More

raphlinus · 2025-12-13T21:28:59 1765661339

My reading is that there aren't really a lot of addressing modes on 286, as there are on 68000 and friends, rather every address is generated by summing an optional immediate 8 or 16 bit value and from zero to two registers. There aren't modes where you do one memory fetch, then use that as the base address for a second fetch, which is arguably a vaguely RISC flavored choice. There is a one cycle penalty for summing 3 elements ("based indexed mode").

adrian_b · 2025-12-14T10:45:13 1765709113

What you say about memory indirect addressing is true only about MC68020 (1984) and later CPUs.

MC68000 and MC68010 had essentially the same addressing modes with 80286, i.e. indexed addressing with up to 3 components (base register + index register + displacement).

The difference is that the addressing modes of MC68000 could be used in a very regular way. All 8 address registers were equivalent, all 8 data registers were equivalent.

In order to reduce the opcode size, 80286 and 8086 permitted only certain combinations of registers in the addressing modes and they did not allow auto-increment and auto-decrement modes, except in special instructions with dedicated registers (PUSH, POP, MOVS, CMPS, STOS, LODS), resulting in an instruction set where no 2 registers are alike and increasing the cognitive burden of the programmer.

80386 not only added extra addressing modes taken from DEC VAX (i.e. scaled indexed addressing) but it made the addressing modes much more regular than those of 8086/80286, even if it has preserved the restriction of auto-incremented auto-decremented modes to a small set of special instructions.

raphlinus · 2025-12-09T18:13:08 1765303988

There's a straightforward answer to the "why not" question: because it will result in codebases with the same kind of memory unsafety and vulnerability as existing C code.

If an LLM is in fact capable of generating code free of memory safety errors, then it's certainly also capable of writing the Rust types that guarantee this and are checkable. We could go even further and have automated generation of proofs, either in C using tools similar to CompCert, or perhaps something like ATS2. The reason we don't do these at scale is that they're tedious and verbose, and that's presumably something AI can solve.

Similar points were also made in Martin Kleppmann's recent blog post [1].

[1]: https://martin.kleppmann.com/2025/12/08/ai-formal-verificati...

nu11ptr · 2025-12-09T18:19:09 1765304349

It is also equally odd to me that people want to cling so hard to C, when something like Rust (and other modern languages for that matter), have so much nicer eco systems, memory safety aside. I mean C doesn't even have a builtin hashtable or vector, let alone pattern matching, traits and sum types. I get this is about AI and vibe coding, but we aren't at a point yet where zero human interaction is reasonable, so every code base should assume some level of hybrid human/AI involvement. Why people want so badly to start a new code base in C is beyond me (and yes, I've written a lot of C in my time, and I don't hate it, but it didn't age well in expressiveness).

benjiro · 2025-12-09T21:02:12 1765314132

> It is also equally odd to me that people want to cling so hard to C, when something like Rust (and other modern languages for that matter), have so much nicer eco systems, memory safety aside.

Simplicity? I learned Rust years ago (when it was still pre release), and when i now look at a lot of codebases, i can barely get a sense what is going on, with all the new stuff that got introduced. Its like looking at something familiar and different at the same time.

I do not feel the same when i see Go code, as so little has changed / got added to it. The biggest thing is probably generics and that is so rarely used.

For me, this is, what i think, appeals for C programmers. The fact that the language does not evolve and has been static.

If we compare this to C++, that has become a mess over time, and i know i am getting downvoted for this, Rust feels like its going way too much in the Rust++ route.

Like everybody and their dog wants something added, to make Rust do more things, but at the same moment, it feels like its repeating the C++ history. I have seen the same issue with other languages that started simple, and then becomes monsters of feature sets. D comes to mind.

So when you see the codebase between developers, the different styles because of the use of different feature sets, creates this disconnect and makes it harder for people to read other code. While with C, because of the language limits, your more often down a rather easier way to read the same code. If that makes sense?

doug_durham · 2025-12-09T18:43:23 1765305803

Proofs of what? "This new feature should make the 18 to 21 year old demographic happy by aligning with popular cultural norms". This would be difficult to formalize as a proof.

raphlinus · 2025-12-09T19:03:32 1765307012

Memory safety in particular, actually UB in general (got to watch out for integer overflows, among other things). But one could prove arbitrary properties, including lack of panics (would have been helpful for a recent Cloudflare outage), etc.

In order to prove lack of UB, you have to be able to reason about other things. For example, to safely call qsort, you have to prove that the comparison is a total order. That's not easy, especially if comparing larger and more complicated structures with pointers.

And of course, proving the lack of pointer aliasing in C is extremely difficult, even more so if pointer arithmetic is employed.

IshKebab · 2025-12-09T19:44:04 1765309444

In this context it's proofs of properties about the program you're writing. A classic one is that any lossless compression algorithm should satisfy decompress(compress(x)) == x for any x.

raphlinus · 2025-12-03T17:22:40 1764782560

That's because the 1 instruction variant may read past the end of an array. Let's say s is a single null byte at 0x2000fff, for example (and that memory is only mapped through 0x2001000); the function as written is fine, but the optimized version may page fault.

abainbridge · 2025-12-03T17:34:04 1764783244

Ah, yes, good point. I think this is a nice example of "I didn't notice I needed to tell the compiler a thing I know so it can optimize".

raphlinus · 2025-11-20T15:15:31 1763651731

Matthew's side of the story is here: https://mastodon.online/@mjg59@nondeterministic.computer/115...

hamdingers · 2025-11-20T16:32:53 1763656373

Here's the other side, for what it's worth: https://news.tuxmachines.org/n/2025/11/20/Today_s_Judgement....

tptacek · 2025-11-20T17:33:55 1763660035

A grim portent for their mental health, given the attempt to reframe a judgement that demolished them and called them "character assassins" as supportive.

Really, though, this is the first time I've ever looked at TechRights for real, and the whole place is very... Always Sunny meme.

ceejayoz · 2025-11-20T15:58:29 1763654309

Direct link: https://nondeterministic.computer/@mjg59/115581959497817474

raphlinus · 2025-11-11T01:46:43 1762825603

Unfortunately graphics APIs suck pretty hard when it comes to actually sharing memory between CPU and GPU. A copy is definitely required when using WebGPU, and also on discrete cards (which is what these APIs were originally designed for). It's possible that using native APIs directly would let us avoid copies, but we haven't done that.

raphlinus · 2025-11-11T00:34:16 1762821256

Thanks for the pointer, we were not actually aware of this, and the claimed benchmark numbers look really impressive.

convolvatron · 2025-11-11T02:03:25 1762826605

there were at least two renderers written for the CM2 that used strips. at least one of them used scans and general communication, most likely both.

1) for the given processor set, where each process holds an object 'spawn' a processor in a new set, one processor for each span. (a) spawn operation consists of the source processor setting the number of nodes in the new domain, then performing an add-scan, then sending the total allocation back to the front end the front end then allocates a new power-of-2 shape than can hold those the object-set then uses general communication to send scan information to the first of these in the strip-set (the address is left over from the scan) (b) in the strip-set, use a mask-copy-scan to get all the parameters to all the the elements of the scan set. (c) each of these elements of the strip set determine the pixel location of the leftmost element (d) use a general send to seed the strip with the parameters of the strip (e) scan those using a mask-copy-scan in the pixel-set (f) apply the shader or the interpolation in the pixel-set

note that steps (d) and (e) also depend on encoding the depth information in the high bits and using a max combiner to perform z-buffering.

Edit: there must have been an additional span/scan in a pixel space that is then sent to image space with z buffering, otherwise strip seeds could collide, and be sorted by z which may miss pixels from the losing strip

actionfromafar · 2025-11-11T11:39:57 1762861197

What's a CM2? I tried searching combined with some graphics related keywords but I just go weird stuff.

Lerc · 2025-11-11T13:14:53 1762866893

Given the focus on parallelism and communication, maybe the Connection Machine 2?

raphlinus · 2025-11-11T00:33:49 1762821229

The output of this renderer is a bitmap, so you have to do an upload to GPU if that's what your environment is. As part of the larger work, we also have Vello Hybrid which does the geometry on CPU but the pixel painting on GPU.

We have definitely thought about having the CPU renderer while the shaders are being compiled (shader compilation is a problem) but haven't implemented it.

fngjdflmdflg · 2025-11-11T00:40:33 1762821633

In any interactive environment you have to upload to the GPU on each frame to output to a display, right? Or maybe integrated SoCs can skip that? Of course you only need to upload the dirty rects, but in the worst case the full image.

>geometry on CPU but the pixel painting on GPU

Wow. Is this akin to running just the vertex shader on the CPU?

ChrisGreenHeur · 2025-11-11T08:34:36 1762850076

It just depends on what architecture your computer has.

On a PC, the CPU typically has exclusive access to system RAM, while the GPU has its own dedicated VRAM. The graphics driver runs code on both the CPU and the GPU since the GPU has its own embedded processor so data is constantly being copied back and forth between the two memory pools.

Mobile platforms like the iPhone or macOS laptops are different: they use unified memory, meaning the CPU and GPU share the same physical RAM. That makes it possible to allocate a Metal surface that both can access, so the CPU can modify it and the GPU can display it directly.

However, you won’t get good frame rates on a MacBook if you try to draw a full-screen, pixel-perfect surface entirely on the CPU it just can’t push pixels that fast. But you can write a software renderer where the CPU updates pixels and the GPU displays them, without copying the surface around.

qingcharles · 2025-11-11T05:30:03 1762839003

Surely not if the CPU and video output device share common RAM?

Or with old VGA, the display RAM was mapped to known system RAM addresses and the CPU would write directly to it. (you could write to an off-screen buffer and flip for double/triple buffering)

jcelerier · 2025-11-11T01:45:11 1762825511

I regularly do remote VNC and X11 access on stuff like raspberry pi zero and in these cases GPU does not work, you won't be able to open a GL context at all. Also whenever i upadte my kernel on archlinux i'm not able to open a gl context until i reboot, so I really need apps that don't need a gpu context just to show stuff

zamadatix · 2025-11-11T07:25:43 1762845943

For the Pi Zero you can force a headless HDMI output in the config and then use that instead of a virtual display to get working GPU with VNC.

actionfromafar · 2025-11-11T11:39:02 1762861142

You can also trick any HDMI output to believe it's connected to a monitor.

One commercial product is:

https://eshop.macsales.com/item/NewerTech/ADP4KHEAD/

But I seem to recall there are dirt cheap hacks to do same. I may be conflating it with "resister jammed into DVI port" which worked back in the VGA and DVI days. Memory unlocked - did this to an old Mac Mini in a closet for some reason.

raphlinus · 2025-11-11T01:04:37 1762823077

It's analogous, but vertex shaders are just triangles, and in 2D graphics you have a lot of other stuff going on.

The actual process of fine rasterization happens in quads, so there's a simple vertex shader that runs on GPU, sampling from the geometry buffers that are produced on CPU and uploaded.

raphlinus · 2025-11-10T04:13:17 1762747997

Another deep dive is in https://www.copetti.org/writings/consoles/master-system/

I've got a mostly-written emulator (in Rust). It's very easy to emulate, possibly the best gameplay bang for the emulator coding effort buck aside from NES. My main intent in writing this emulator is getting it running on an RP2350 board, like Adafruit Fruit Jam or Olimex RP2350pc.

It should also be possible to get the next generation (SNES, Genesis) on such hardware, but it's a much tighter fit and more effort.

raphlinus · 2025-11-06T00:27:02 1762388822

I almost mentioned it in the talk, as an example of a language that's deployed very successfully and expresses parallelism at scale. Ultimately I didn't, as the core of what I'm talking about is control over dynamic allocation and scheduling, and that's not the strength of VHDL.

raphlinus · 2025-11-05T21:52:13 1762379533

Right. This is the binary tree version of the algorithm, and is nice and concise, very readable. What would take it to the next level for me is the version in the stack monoid paper, which chunks things up into workgroups. I haven't done benchmarks against the Pareas version (unfortunately it's not that easy), but I would expect the workgroup optimized version to be quite a bit faster.

Munksgaard · 2025-11-06T06:59:30 1762412370

To be clear, you can express workgroup parallelism in Futhark, or rather, if the compiler sees that you've programmed your problem in such a way that it can take advantage of workgroup parallelism, it will.

But you're right, it would be interesting to see how the different approaches stack up to each other. The Pareas project linked above also includes an implementation using radix sort.

convolvatron · 2025-11-05T21:57:31 1762379851

I've been playing with one using scans. too bad that's not really on the map for architectural reasons, it opens up a lot of uses.

kragen · 2025-11-06T11:51:52 1762429912

Yeah, monoid prefix sum is a surprisingly powerful tool for parallel and incremental algorithm design!