Why does everyone keep repeating this mantra? I wrote the x86 decoder for https://github.com/jart/blink which is based off intel's xed disassembler. It's so tiny to do if you you have the know-how to do it.
master jart@studio:~/blink$ ls -hal o/tiny/blink/x86.o
-rw-r--r-- 1 jart staff 23K Jun 22 19:03 o/tiny/blink/x86.o
Modern microprocessors have 100,000,000,000+ transistors, so how much die space could 184,000 bits for x86 decoding really need? What proof is there that this isn't just some holy war over the user-facing design. The stuff that actually matters is probably just memory speed and other chip internals, and companies like Intel, AMD, NVIDIA, and ARM aren't sharing that with us. So if you think you understand the challenges and tradeoffs they're facing, then I'm willing to bet it's just false confidence and peanut gallery consensus, since we don't know what we don't know.
Decoding 1 x86 instruction per cycle is easy. That's solved like 40 years ago.
The problem is that superscalar CPU needs to decode multiple x86 instructions per cycle. I think latest Intel big core pipeline can do (IIRC) 6 instructions per cycle, so to keep the pipeline full the decode MUST be able to decode 6 per cycle too.
If it's ARM, it's easy to do multiple decode. M1 do (IIRC) 8 per cycle easily, because the instruction length is fixed. So the first decoder starts at PC, the second starts at PC+4, etc. But x86 instructions are variable length, so after the first decoder decodes instruction at IP, where does the second decoder start decoding at?
It isn't quite that bad. The decoders write stop bits back into the L1D, to demarc where the instructions align. Since those bits aren't indexed in the cache and don't affect associativity, they don't really cost much. A handful of 6T SRAMs per cache line.
I would have assumed it just decodes the x86 into a 32-bit ARM-like internal ISA, similar to how a JIT works in software. x86 decoding is extremely costly in software if you build an interpreter. Probably like 30% maybe and that's assuming you have a cache. But with JIT code morphing in Blink, decoding cost drops to essentially nothing. As best as I understand it, all x86 microprocessors since the NexGen i586 have worked this way too. Once you're code morphing the frontend user-facing ISA, a much bigger problem rears its ugly head, which is the 4096-byte page size. That's something Apple really harped on with their M1 design which increased it to 16kb. It matters since morphed code can't be connected across page boundaries.
It decodes to uOPs optimized for the exact microarchitecture of that particular CPU. High performance ARM64 designs do the same.
But in the specific case of tracking variable length instruction boundaries, that happens in the L1i cache. uOP caches make decode bandwidth less critical, but it is still important enough to optimize.
That's called uOP cache, which Intel has been using since Sandy Bridge (and AMD but I don't remember on top of my head since when). But that's more transistors for the cache and its control mechanism.
It's definitely better than what NVIDIA does, inventing an entirely new ISA each year. If the hardware isn't paying the cost for a frontend, then it shovels the burden onto software. There's a reason every AI app has to bundle a 500mb matrix multiplication library in each download, and it's because GPUs force you to compile your code ten times for the last ten years of ISAs.
Part of it is that, but part of it is that people pay for getting from 95% optimal to 99% optimal, and doing that is actually a lot of work. If you peek inside the matrix multiplication library you'll note that it's not just "we have the best algorithm for the last 7 GPU microarchitectures" but also 7 implementations for the latest architecture because that's just how you need to be to go fast. Kind of like how if you take an uninformed look at glibc memcpy you'll see there is an AVX2 path and a ERMS path but also it will switch between algorithms based on the size of the input. You can easily go "yeah my SSE2 code is tiny and gets decent performance" but if you stop there you're leaving something on the table, and with GPUs it's this but even more extreme.
Using the uops directy as the isa would be a bad idea for code density.
In RISC-V land, vendors tend to target standard extensions/profiles, but when they hardware is capable of other operations they often expose those through custom extensions.
Chips and Cheese specifically talks about this in the article I mention[0].
x86 decoders take a tiny but still significant silicon and power budget, usually somewhere between 3-7%. Not a terrible cost to pay, but if legacy is your only reason, why keep doing so? It’s extra watts and silicon you could dedicate to something else.
Correct. However because ARM has fixed-length instructions the decoder can make more assumptions, keeping the decoder simpler.
Like I said, you its only a small amount of extra silicon you’re paying the x86 tax with, but with the world mostly becoming ARM-compatible, there’s no more reason to pay it.