I would have assumed it just decodes the x86 into a 32-bit ARM-like internal ISA...

kijiki · on June 23, 2024

It decodes to uOPs optimized for the exact microarchitecture of that particular CPU. High performance ARM64 designs do the same.

But in the specific case of tracking variable length instruction boundaries, that happens in the L1i cache. uOP caches make decode bandwidth less critical, but it is still important enough to optimize.

innocenat · on June 23, 2024

That's called uOP cache, which Intel has been using since Sandy Bridge (and AMD but I don't remember on top of my head since when). But that's more transistors for the cache and its control mechanism.

jart · on June 23, 2024

It's definitely better than what NVIDIA does, inventing an entirely new ISA each year. If the hardware isn't paying the cost for a frontend, then it shovels the burden onto software. There's a reason every AI app has to bundle a 500mb matrix multiplication library in each download, and it's because GPUs force you to compile your code ten times for the last ten years of ISAs.

saagarjha · on June 23, 2024

Part of it is that, but part of it is that people pay for getting from 95% optimal to 99% optimal, and doing that is actually a lot of work. If you peek inside the matrix multiplication library you'll note that it's not just "we have the best algorithm for the last 7 GPU microarchitectures" but also 7 implementations for the latest architecture because that's just how you need to be to go fast. Kind of like how if you take an uninformed look at glibc memcpy you'll see there is an AVX2 path and a ERMS path but also it will switch between algorithms based on the size of the input. You can easily go "yeah my SSE2 code is tiny and gets decent performance" but if you stop there you're leaving something on the table, and with GPUs it's this but even more extreme.

camel-cdr · on June 23, 2024

Using the uops directy as the isa would be a bad idea for code density. In RISC-V land, vendors tend to target standard extensions/profiles, but when they hardware is capable of other operations they often expose those through custom extensions.

kiratp · on June 23, 2024

IMO if the trade off is cheaper, faster hardware iteration then Nvidia’s strategy makes a lot of sense.