Why does everyone keep repeating this mantra? I wrote the x86 decoder for https:...

innocenat · on June 23, 2024

Decoding 1 x86 instruction per cycle is easy. That's solved like 40 years ago.

The problem is that superscalar CPU needs to decode multiple x86 instructions per cycle. I think latest Intel big core pipeline can do (IIRC) 6 instructions per cycle, so to keep the pipeline full the decode MUST be able to decode 6 per cycle too.

If it's ARM, it's easy to do multiple decode. M1 do (IIRC) 8 per cycle easily, because the instruction length is fixed. So the first decoder starts at PC, the second starts at PC+4, etc. But x86 instructions are variable length, so after the first decoder decodes instruction at IP, where does the second decoder start decoding at?

kijiki · on June 23, 2024

It isn't quite that bad. The decoders write stop bits back into the L1D, to demarc where the instructions align. Since those bits aren't indexed in the cache and don't affect associativity, they don't really cost much. A handful of 6T SRAMs per cache line.

jart · on June 23, 2024

I would have assumed it just decodes the x86 into a 32-bit ARM-like internal ISA, similar to how a JIT works in software. x86 decoding is extremely costly in software if you build an interpreter. Probably like 30% maybe and that's assuming you have a cache. But with JIT code morphing in Blink, decoding cost drops to essentially nothing. As best as I understand it, all x86 microprocessors since the NexGen i586 have worked this way too. Once you're code morphing the frontend user-facing ISA, a much bigger problem rears its ugly head, which is the 4096-byte page size. That's something Apple really harped on with their M1 design which increased it to 16kb. It matters since morphed code can't be connected across page boundaries.

kijiki · on June 23, 2024

It decodes to uOPs optimized for the exact microarchitecture of that particular CPU. High performance ARM64 designs do the same.

But in the specific case of tracking variable length instruction boundaries, that happens in the L1i cache. uOP caches make decode bandwidth less critical, but it is still important enough to optimize.

innocenat · on June 23, 2024

That's called uOP cache, which Intel has been using since Sandy Bridge (and AMD but I don't remember on top of my head since when). But that's more transistors for the cache and its control mechanism.

jart · on June 23, 2024

It's definitely better than what NVIDIA does, inventing an entirely new ISA each year. If the hardware isn't paying the cost for a frontend, then it shovels the burden onto software. There's a reason every AI app has to bundle a 500mb matrix multiplication library in each download, and it's because GPUs force you to compile your code ten times for the last ten years of ISAs.

saagarjha · on June 23, 2024

Part of it is that, but part of it is that people pay for getting from 95% optimal to 99% optimal, and doing that is actually a lot of work. If you peek inside the matrix multiplication library you'll note that it's not just "we have the best algorithm for the last 7 GPU microarchitectures" but also 7 implementations for the latest architecture because that's just how you need to be to go fast. Kind of like how if you take an uninformed look at glibc memcpy you'll see there is an AVX2 path and a ERMS path but also it will switch between algorithms based on the size of the input. You can easily go "yeah my SSE2 code is tiny and gets decent performance" but if you stop there you're leaving something on the table, and with GPUs it's this but even more extreme.

camel-cdr · on June 23, 2024

Using the uops directy as the isa would be a bad idea for code density. In RISC-V land, vendors tend to target standard extensions/profiles, but when they hardware is capable of other operations they often expose those through custom extensions.

kiratp · on June 23, 2024

IMO if the trade off is cheaper, faster hardware iteration then Nvidia’s strategy makes a lot of sense.

jorvi · on June 23, 2024

Chips and Cheese specifically talks about this in the article I mention[0].

x86 decoders take a tiny but still significant silicon and power budget, usually somewhere between 3-7%. Not a terrible cost to pay, but if legacy is your only reason, why keep doing so? It’s extra watts and silicon you could dedicate to something else.

[0] https://chipsandcheese.com/2021/07/13/arm-or-x86-isa-doesnt-...

antonkochubey · on June 23, 2024

But decoders for e.g. ARM are not free either, right? Or am I misunderstanding something?

jorvi · on June 23, 2024

Correct. However because ARM has fixed-length instructions the decoder can make more assumptions, keeping the decoder simpler.

Like I said, you its only a small amount of extra silicon you’re paying the x86 tax with, but with the world mostly becoming ARM-compatible, there’s no more reason to pay it.