More

cmovq · 2026-01-24T22:54:51 1769295291

Interestingly the original Pi had the same amount of memory as the PS3 which was available at the time of the Pi release. Still amazes me how much we did with only 512MB.

jsheard · 2026-01-25T00:00:52 1769299252

And you didn't even get the luxury of unified memory on the PS3 - the CPU and GPU had separate 256MB chunks.

I can still barely believe they got Grand Theft Auto 5 running on that thing.

MBCook · 2026-01-25T01:19:17 1769303957

I think it’s more amazing it’s the latest GTA not only had been V for 3 console generations, but it’s still selling well.

andai · 2026-01-24T23:49:47 1769298587

What Moore giveth, Gates taketh away...

cmovq · 2026-01-22T00:05:22 1769040322

This is true, but if you run out of the 32 register names you’ll still need to spill to memory. The large register file is to allow for multiple instructions to execute in parallel among other things.

zeusk · 2026-01-22T01:50:37 1769046637

They’re used by the internal register renamer/allocator so if it sees you’re storing the results to memory then reusing the named register for a new result - it will allocate a new physical register so your instruction doesn’t stall for the previous write to go through.

adrian_b · 2026-01-22T13:22:05 1769088125

I do not understand what you want to say.

The register renamer allocates a new physical register when you attempt to write the same register as a previous instruction, as otherwise you would have to wait for that instruction to complete, and you would also have to wait for any instructions that would want to read the value from that register.

When you store a value into memory, the register renamer does nothing, because you do not attempt to modify any register.

The only optimization is that if a following instruction attempts to read the value stored in the memory, that instruction does not wait for the previous store to complete, in order to be able to load the stored value from the memory, but it gets the value directly from the store queue. But this has nothing to do with register renaming.

Thus if your algorithm has already used all the visible register numbers, and you will still need in the future all the values that occupy the registers, then you have to store one register into the memory, typically in the stack, and the register renamer cannot do anything to prevent this.

This is why Intel will increase the number of architectural general-purpose registers of x86-64 from 16 to 32, matching Arm Aarch64 and IBM POWER, with the APX ISA extension, which will be available in the Nova Lake desktop/laptop CPUs and in the Diamond Rapids server CPUs, which are expected by the end of this year.

Register renaming is a typical example of the general strategy that is used when shared resources prevent concurrency: the shared resources must be multiplied, so that each concurrent task uses its private resource.

gpderetta · 2026-01-22T15:32:52 1769095972

> When you store a value into memory, the register renamer does nothing, because you do not attempt to modify any register.

you are of course correct about everything. But the extreme pendant in me can't avoid pointing out that there are in fact a few mainstream CPUs[1] that can rename memory to physical registers, at least in some cases. This is done explicitly to mitigate the cost of spilling. edit: this is different from the store-forwarding optimization you mentioned.

[1] Ryzen for example: https://www.agner.org/forum/viewtopic.php?t=41

adrian_b · 2026-01-22T23:34:07 1769124847

That feature does not exist in any AMD Zen, but only in certain Zen generations and randomly, i.e. not in successive generations. This optimization has been introduced then removed a couple of times. Therefore this is not an optimization on whose presence you can count in a processor.

I believe that it is not useful to group such an optimization with register renaming. The effect of register renaming is to replace a single register shared by multiple instructions with multiple registers, so that each instructions may use its own private register, without interfering with the other instructions.

On the other hand, the optimization mentioned by you is better viewed as an enhancement of the optimization mentioned by me, and which is implemented in all modern CPUs, i.e. that after a store instruction the stored value persists for some time in the store queue and the subsequent instructions can access it there instead of going to memory.

With this additional optimization, the stored values that are needed by subsequent instructions are retained in some temporary registers even after the store queue is drained to the memory as long as they are still needed.

Unlike with register renaming, here the purpose is not to multiply the memory locations that store a value so that they can be accessed independently. Here the purpose is to cache the value close to the execution units, to be available quickly, instead of taking it from the far away memory.

As mentioned at your link, the most frequent case when this optimization is efficient is when arguments are pushed in the stack before invoking a function and then the invoked function loads the arguments in registers. On the CPUs where this optimization is implemented the passing of arguments to the function bypasses the stack, becoming much faster.

However this calling convention is important mainly for legacy 32-bit applications, because the 64-bit programs pass most arguments inside registers, so they do not need this optimization. Therefore this optimization is more important for Windows, where it is more frequent to use ancient 32-bit executables, which have not been recompiled to 64-bit.

gpderetta · 2026-01-23T09:34:02 1769160842

Yes, it is not in all Zen cpus.

I don't think it makes sense to distinguish it from renaming. It is effectively aliasing a memory location (or better, an offset off the stack pointer) with a physical register, effectively treating named stack offsets as additional architectural registers. AFAIK this is done on the renaming stage.

adrian_b · 2026-01-23T16:29:11 1769185751

The named stack offsets are treated as additional hidden registers, not as additional architectural registers.

You do not access them using architectural register numbers, as you would do with the renamed physical registers, but you access them with an indexed memory addressing mode.

The aliasing between a stack location and a hidden register is of the same nature as the aliasing between a stack location from its true address in the main memory and the location in the L1 cache memory where the the stack locations are normally cached in any other modern CPU.

This optimization present in some Zen CPUs just caches some locations from the stack even closer to the execution units of the CPU core than the L1 cache used for the same purpose in other CPUs, allowing those stack locations to be accessed as fast as the registers.

gpderetta · 2026-01-23T16:56:01 1769187361

The stack offset (or in general memory location address[1]) has a name (its unique address), exactly like an architectural register, how can it be an hidden register?

In any case, as far as I know the feature is known as Memory Renaming, and it was discussed in Accademia decades before it showed in actual consumer CPUs. It uses the renaming hardware and it behaves more like renaming (0 latency movs resolved at rename time, in the front end) than an actual cache (that involves an AGI unit and a load unit and it is resolved in the execution stages, in the OoO backend) .

[1] more precisely, the feature seems to use address expressions to name the stack slots, instead of actual addresses, although it can handle offset changes after push/pop/call/ret, probably thanks to the Stack Engine that canonicalizes the offsets at the decode stage.

cmovq · 2025-12-17T01:11:15 1765933875

https://fgiesen.wordpress.com/2011/07/09/a-trip-through-the-...

cmovq · 2025-12-14T22:42:34 1765752154

> A miscompile of an AI program can cause bad medical advice

I think the AI program is plenty capable of causing bad medical advice on its own without being miscompiled.

cmovq · 2025-11-15T21:27:50 1763242070

On data center as well. I think AMD rightly decided to focus on larger chips for data center instead of consumer laptops where margins are tiny in comparison and growth has been slow for a few years.

jauntywundrkind · 2025-11-15T21:51:38 1763243498

In general AMD seems to not want anything to do with down-market parts.

They still have great laptop & desktop parts, in fact they're essentially the same parts as servers (with less Core Complex Die (CCD) chiplets and simpler IO Die)! Their embedded chips, mobile chips are all the same chiplets too!!

And there's some APU parts that are more consumer focused, which have been quite solid. And now Strix Halo, which were it not for DDR5 prices shooting to the moon, would be incredible prosumer APU.

Where AMD is just totally missing is low end. There's nothing like the Intel N100/N97/N150, which is a super ragingly popular chip for consumer appliances like NAS. I'm hoping their Sound Wave design is real, materializes, offers something a bit more affordable than their usual.

The news at the end of October was that their new low end line up is going to be old Zen2 & Zen3 chips. That's mostly fine, still an amazing chip, just not quite as fast & efficient. But not a lot no small AMD parts. https://wccftech.com/amd-prepares-rebadged-zen-2-ryzen-10-an...

It's crazy how AMD has innovated by building far far less designs than the past. There's not a bunch of different chips designed for different price points, the whole range across all markets (for cpus) is the same core, the same ~3 designs, variously built out.

I do wish AMD would have a better low end story. The Steam Deck is such a killer machine and no one else can make anything with such a clear value, because no one else can buy a bunch of slightly weird old chips for cheap, have to buy much more expensive mainline chips. I really wish there were some smaller interesting APUs available.

iknowstuff · 2025-11-15T22:05:09 1763244309

Damn I love the strix halo. the framework desktop idles at 10W and has modern standby consuming less than 1W, but fully connected so an xbox controller can wake it over bluetooth etc.

My 3080 sffpc eats 70W idle and 400W under load.

Game performance is roughly the same from a normie point of view.

rubatuga · 2025-11-15T22:11:36 1763244696

How did you get Bluetooth wake working?!

p_l · 2025-11-15T23:02:32 1763247752

That's the true magic of "modern standby".

The OS can just leave BT on and still get interrupt and service it.

zackify · 2025-11-15T22:18:56 1763245136

I have a 7840u framework and it idles around 7-8w with not much happening.

init2null · 2025-11-15T22:00:55 1763244055

The Intel video encoding pipeline alone is worth going Intel on the low end. Those low-power devices simply need better transcoding support than AMD can currently provide.

jauntywundrkind · 2025-11-15T22:58:20 1763247500

Updating this post. Found the review I was looking for!

Newest RDNA4 fixes a pretty weak encoder performance for game streaming, is competitive. Unfortunately (at release at least) av1 is still pretty weak. https://youtu.be/kkf7q4L5xl8

One thing noted is AMD seems to have really good output at lower bandwidth (~4min mark). Would be nice to have even deeper dives into this. And also whether or not the quality changes over time with driver updates would be curious to know. One of the comments details how already a bunch of the asks in this video (split frame encoding, improved av1) landed 1mo after the video. Hopefully progress continues for rdna4! https://youtube.com/watch?v=kkf7q4L5xl8&lc=UgzYN-iSC7N097XZi...

overfeed · 2025-11-16T04:58:26 1763269106

> It's crazy how AMD has innovated by building far far less designs than the past. There's not a bunch of different chips designed for different price points, the whole range across all markets (for cpus) is the same core, the same ~3 designs, variously built out.

AMD bet the farm on the chiplet architecture, and their risky bet has paid off in a big way. Intel's fortunately timed stumbling helped, but AMD ultimately made the right call about core-scaling at a time when most games and software titles were not written to take advantage of multicore parallelism. IMO, AMD deserves much more than the 25% marketshare, as Zen chips deliver amazing value.

toast0 · 2025-11-16T06:20:00 1763274000

> Their embedded chips, mobile chips are all the same chiplets too!!

Depends on where in embedded, but the laptop and APU chips are monolithic, not chiplet based.

embedding-shape · 2025-11-15T22:26:02 1763245562

I don't get the feeling that they've focused anywhere in particular (and maybe rightly so), they're in everything from low-powered consoles to high powered workstations and data centers, and seemingly everywhere in-between those too.

cmovq · 2025-11-14T01:51:39 1763085099

Dithering is still very common in rendering pipelines. 8 bits per channel is not enough to capture subtle gradients, and you’ll get tons of banding. Particularly in mostly monochrome gradients produced by light sources. So you render everything to a floating point buffer and apply dithering.

Unlike the examples in this post, this dithering is basically invisible at high resolutions, but it’s still very much in use.

black_knight · 2025-11-14T08:18:31 1763108311

Another place where dithering is useful in graphics is when you can’t do enough samples in every point to get a good estimation of some value. Add jitter to each sample and then blur, and then suddenly each point will be influenced by the samples made around them, giving higher fidelity.

I recently learned the slogan “Add jitter as close to the quantisation step as possible.” I realised that “quantisation step” is not just when clamping to a bit depth, but basically any time there is an if-test on a continuous value! This opens my mind to a lot of possible places to add dithering!

01HNNWZ0MV43FF · 2025-11-15T23:02:38 1763247758

Hell, one could dither vertex positions and normals

zozbot234 · 2025-11-14T11:03:07 1763118187

A lot of display hardware uses a combination of spatial and temporal dithering these days. You can see it sometimes if you look up close, it appears as very faint flickering "snow" (the kind you'd see on old analog TV). Ironically, making this kind of dithering even less perceivable may turn out to be the foremost benefit of high pixel resolutions (beyond 1080p) and refresh rates (beyond 120Hz) since it seems that raising those specs is easier than directly improving color depth in hardware.

quitit · 2025-11-14T12:19:22 1763122762

Adobe Illustrator 2026 has only -just- added a dithering option to their gradient tool.

cmovq · 2025-05-27T21:32:15 1748381535

> gl.drawArrays(gl.TRIANGLES, 0, 6);

Using 2 tris for this isn’t ideal because you will get duplicate fragment invocations along the horizontal seam where the triangles meet. It is slightly more efficient to use one larger triangle extending outside the viewport, the offscreen parts will be clipped and not generate any additional fragments.

[1]: https://wallisc.github.io/rendering/2021/04/18/Fullscreen-Pa...

tarnith · 2025-05-28T12:39:39 1748435979

Also see https://michaldrobot.com/2014/04/01/gcn-execution-patterns-i...

A bit of an older article but still very relevant.

I've found with webGL2 you can also skip the whole upload/binding of the buffer and just emit the vertexes/coordinates from the vertex shader as well.

Less of an impact than cutting it down, but if you're just trying to get a a fragment going, why not use the least amount of data and CPU-> GPU upload possible.

cmovq · 2025-05-28T14:55:11 1748444111

Yeah it’s nice even just to avoid having to set up a vertex buffer:

    ivec2 vert = ivec2(gl_VertexID & 1, gl_VertexID >> 1);

    out_uv = 2.0 * vec2(vert);
    gl_Position = vec4(out_uv * 2.0 - 1.0, 0.0, 1.0);

nathan-barry · 2025-05-28T02:16:05 1748398565

That's a great insight, you're right.

cmovq · 2025-05-21T22:09:38 1747865378

Interestingly, a complete implementation of strtol [1] is shorter than this wrapper. If you don't like strtol's API or error handling, just implement your own.

[1]: https://github.com/gcc-mirror/gcc/blob/master/libiberty/strt...

> If an overflow is detected, it calls abort()

An aside, but this doesn't detect overflows on Windows due to both long and int being 32 bits (you'd want strtoll for that).

cmovq · 2025-05-21T16:52:42 1747846362

> Components are stored in contiguous arrays in a SoA (Structure of Arrays) manner, which allows for fast iteration and processing

Does this actually matter in Lua? Aren’t all array elements going to be pointers to heap allocated objects anyways?

The point of SoA is your likely to be accessed values are adjacent in memory, but if you’re chasing a pointer to get that value then you’re not getting anything out of it.

blackmat · 2025-05-21T21:40:10 1747863610

Yes, organizing components as SoA can provide a significant performance boost in Lua, especially with LuaJIT. Both iteration and element access become faster, and it also reduces memory allocations and GC pressure when creating entities. And yes, Lua tables can be contiguous in memory if you use them carefully.

dicytea · 2025-05-22T05:42:52 1747892572

Do you have any published benchmarks?

blackmat · 2025-05-22T08:21:00 1747902060

Comparative benchmarks are a big task on their own, and usually the author's library wins in them. I have internal benchmarks in the repository, but they are not designed for comparison or for evaluation by outsiders. Maybe I'll get to that someday.

As for the SoA approach, here you can find a small and exaggerated example: https://luajit.org/ext_ffi.html

PhilipRoman · 2025-05-21T17:08:50 1747847330

Lua uses tagged unions so that primitives are stored inline within a table. Some time ago I benchmarked this and the perf gains from SOA were significant. Besides, even if you had to chase pointers, SOA still means you can reduce the number of allocations.

cmovq · 2025-05-18T15:42:40 1747582960

One thing I noticed is that projects written in Rust always mention it the title (there’s one on the front page right now), compared to other languages that don’t. That probably adds to the numbers

encom · 2025-05-18T20:13:37 1747599217

The Crossfit of programming.

pclmulqdq · 2025-05-18T21:23:40 1747603420

Go projects often do the same thing.