Having gone through the explainations of the Transformer Explainer [1], I now ha...

ACCount37 · 2025-11-13T09:08:10 1763024890

Most improvements like this don't come from the architecture itself, scale aside. It comes down to training, which is a hair away from being black magic.

The exceptions are improvements in context length and inference efficiency, as well as modality support. Those are architectural. But behavioral changes are almost always down to: scale, pretraining data, SFT, RLHF, RLVR.