More

benreesman · 2026-02-09T07:42:03 1770622923

AI assist in software engineering is unambiguously demonstrated to some done degree at this point: the "no LLM output in my project" stance is cope.

But "reliable, durable, scalable outcomes in adversarial real-world scenarios" is not convincingly demonstrated in public, the asterisks are load bearing as GPT 5.2 Pro would say.

That game is still on, and AI assist beyond FIM is still premature for safety critical or generally outcome critical applications: i.e. you can do it if it doesn't have to work.

I've got a horse in this race which is formal methods as the methodology and AI assist as the thing that makes it economically viable. My stuff is north of demonstrated in the small and south of proven in the large, it's still a bet.

But I like the stock. The no free lunch thing here is that AI can turn specifications into code if the specification is already so precise that it is code.

The irreducible heavy lift is that someone has to prompt it, and if the input is vibes the output will be vibes. If the input is zero sorry rigor... you've just moved the cost around.

The modern software industry is an expensive exercise in "how do we capture all the value and redirect it from expert computer scientists to some arbitrary financier".

You can't. Not at less than the cost of the experts if the outcomes are non-negotiable.

a1o · 2026-02-09T12:20:55 1770639655

What is FIM ?

benreesman · 2026-02-09T07:11:56 1770621116

Ten years ago it seemed obvious where the next AI breakthrough was coming from: it would be DeepSeek using C31 or RAINBOW and PBT to do Alpha something, the evals would be sound and it would be superhuman on something important.

And then "Large Language Models are Few Shot Learners" collided with Sam Altman's ambition/unscrupulousness and now TensorRT-LLM is dictating the shape of data centers in a self reinforcing loop.

LLMs are interesting and useful but the tail is wagging the dog because of path-dependent corruption arbitraging a fragile governance model. You can get a model trained on text corpora to balance nested delimiters via paged attention if you're willing to sell enough bonds, but you could also just do the parse with a PDA from the 60s and use the FLOPs for something useful.

We had it right: dial in an ever-growing set of tasks, opportunistically unify on durable generalities, put in the work.

Instead we asserted generality, lied about the numbers, and lit a trillion dollars on fire.

We've clearly got new capabilities, it's not a total write off, but God damn was this an expensive ways to spend five years making two years of progress.

chrisjj · 2026-02-09T09:54:40 1770630880

> lit a trillion dollars on fire

Not all bad then. Hopefully this impedes them getting more.

benreesman · 2026-02-07T18:27:54 1770488874

It is tempting to be stealthy when you start seeing discontinuous capabilities go from totally random to somewhat predictable. But most of the key stuff is on GitHub.

The moats here are around mechanism design and values (to the extent they differ): the frontier labs are doomed in this world, the commons locked up behind paywalls gets hyper mirrored, value accrues in very different places, and it's not a nice orderly exponent from a sci-fi novel. It's nothing like what the talking heads at Davos say, Anthropic aren't in the top five groups I know in terms of being good at it, it'll get written off as fringe until one day it happens in like a day. So why be secretive?

You get on the ladder by throwing out Python and JSON and learning lean4, you tie property tests to lean theorems via FFI when you have to, you start building out rfl to pretty printers of proven AST properties.

And yeah, the droids run out ahead in little firecracker VMs reading from an effect/coeffect attestation graph and writing back to it. The result is saved, useful results are indexed. Human review is about big picture stuff, human coding is about airtight correctness (and fixing it when it breaks despite your "proof" that had a bug in the axioms).

Programming jobs are impacted but not as much as people think: droids do what David Graeber called bullshit jobs for the most part and then they're savants (not polymath geniuses) at a few things: reverse engineering and infosec they'll just run you over, they're fucking going in CIC.

This is about formal methods just as much as AI.

benreesman · 2026-01-29T19:51:39 1769716299

The thing you want has a kind of academic jargon name (coeffects algebra with graded/indexed monads for discharge) but is very intuitive, and it can do useful and complete attestation without compromising anyone credentials (in the limit case because everyone chooses what proxy to run).

https://imgur.com/a/Ztyw5x5

catlifeonmars · 2026-01-30T04:59:24 1769749164

Sorry but you lost me. How are coeffects different from effects? I think I’m missing some steps between monads and credentials. Maybe fill in the blanks?

benreesman · 2026-01-26T05:20:26 1769404826

I compile nix derivations to well-posed effect/coeffect/graded monad algebra so I can do real bill of materials and build on an action cache engine maintained by professionals, but that's mostly for long-tail stuff.

These days with a la carte access to all of the container ecosystem primitives as nice, ergonomic, orthogonal operations I don't really see the value in nixpkgs. Don't really see the value in a container registry either: a correctly attested content addressable store with some DNS abbreviations is 100 lines of code because mostly it's git and an S3 shim.

The category error at the heart of nixpkgs is that the environment in which software is compiled need resemble the environment in which it executes. Silly stuff. So whether you're a patchelf --rpath ... Person or an unshare --bind-mount Enjoyer (isomorphic), just remember, in 2026 the guy with the daemon that runs as root does not want you to have nice things.

benreesman · 2026-01-25T23:56:16 1769385376

They can all write lean4 now, don't accept numbers that don't carry proofs. The CAS I use for builds has a coeffect discharge cert in the attestation header, couple lines of code. Graded monads are a snap in CIC.

dehsge · 2026-01-26T01:07:51 1769389671

There are some numbers that are uncomputable in lean. You can do things to approximate them in lean however, those approximates may still be wrong. Leans uncomputable namespace is very interesting.

benreesman · 2026-01-24T17:37:30 1769276250

As with many things the secret sauce of AI-assisted software engineering is mathematics, skepticism, and hard work.

The first thing anyone should do is immediately understand that they are in a finite-sum adversarial relationship with all of their vendors. If you're getting repeatably and unambiguously food outcomes with Claude Code you're either in the bandit arm where high twitter follower count people go or you're counting cards in a twelve deck shoe and it'll be 24 soon. Switch to opencode today, the little thing in the corner is a clock, and a prominent clock is more or less proof you're not in a casino.

There is another key observation that took me a long time to internalize, which is that the loss surface of formalism is not convex: most of us have tried to lift the droids of of the untyped lambda calculus and into System F, and this does not as a general thing go well. But a droid in the untyped lambda calculus will with 100% likelihood eventually Superfund your codebase, no exceptions.

The droids are wildly "happy" for lack of a better term in CIC but the sweet spot is System F Omega. A web app that's not in Halogen is an unforced error now: pure vibecoders can do lastingly valuable work in PureScript, and they can't in React, which is irretrievably broken under any sane effects algebra.

So AI coding is kind of a mean reversion in a sense. Knuth and Lamport and Hoare and Djikstra had an LLM: it was an army of guys in short-sleeved dress shirts with half a page of fanfold system prompt on every page. That is automatable now, but there's a straightforward algorithm for remaining relevant: crank the ambition up until Opus 4.5 is whiffing.

Computer Scientist is still a high impact job. The cast of Office Space is redundant. But we kinda knew that the first time we saw Office Space.

Personally I'm working harder than ever on harder problems than ever with pretty extreme AI assist hygiene: you spend all your time on hard problems with serious impact. The bueden of understanding the code and being able to quit vim has gone up but down, and mathematics is absorbing computer programming.

The popular narrative about Claude Code 10x ing JavaScript projects is maskirovska

benreesman · 2026-01-24T17:08:23 1769274503

I mostly write lean4 now and emit proof-carrying System F Omega via rfl. It's the right level of abstraction when the droids have been pinned to theory laden symbolisms. It's also just pleasant to use.

benreesman · 2026-01-22T04:18:50 1769055530

Anthropic might be the first gigantic company to destroy itself by bootstrapping a capability race it definitionally cannot win.

They've been leading in AI coding outcomes (not exactly the Olympics) via being first on a few things, notably a serious commitment to both high cost/high effort post train (curated code and a fucking gigaton of Scale/Surge/etc) and basically the entire non-retired elite ex-Meta engagement org banditing the fuck out of "best pair programmer ever!"

But Opus is good enough to build the tools you need to not need Opus much. Once you escape the Clade Code Casino, you speed run to agent as stochastic omega tactic fast. I'll be AI sovereign in January with better outcomes.

The big AI establishment says AI will change everything. Except their job and status. Everything but that. gl

inimino · 2026-01-22T15:20:59 1769095259

> AI sovereign in January

You mean you won't need tokens anymore? Are you taking bets?

benreesman · 2026-01-23T07:41:31 1769154091

I mean I'm running TensorRT-LLM on a basket of spot vendors at NVFP4 with auction convexity math and Clickhouse Keeper and custom passthrough.

I need more tokens not less because the available weight models aren't quite as strong, but I roofline sm_100 and sm_120 for a living: I get a factor of 2 on the spot arb, a factor of 2 on the utilization, and a factor of 4-16 on the quant.

I come out ahead.

benreesman · 2026-01-22T03:47:05 1769053625

This library is a small thing. A few hundred lines of C++. A template adapter.

But it stands on the shoulders of two decades of work by people who thought deeply about the shape of computation itself.