> You could argue now that C has prevented CPUs from implementing these abstract...

> You could argue now that C has prevented CPUs from implementing these abstraction (because arguably, C cannot express them), but I would like to ask first how you think it should be done, and why it's not a good idea to implement it on the language/compiler level how it's currently done?

As I said top-thread, it's to my aesthetic. These are all Turing-complete languages and you can, in theory, do whatever in any of them. But map-reduce-fold-etc make it much clearer, to my eye, that I'm operating on a blob of data with the same pattern, and it's easier to map that in my brain to the idea "The compiler should be able to SIMD this." Contrast with loops requiring me to look at a sequential operation and go "I'll trust the compiler will optimize this by unrolling and then deciding to SIMD some of this operation." The end-result is (handwaving implementation) the same, but the aesthetic differs.

As you've noted, I'm not unable to do this in C or C++ or Rust (in fact, C++ is especially clever in how it can use templates to inline static implementations recursively so that the end result of, say, the dot product of two N-dimensional vectors is "a1 x b1 + a2 x b2 + a3 x b3" for arbitrary dimension, allowing the compiler to see that as one expression and maximize the chances it'll choose to use SIMD to compute it). But getting there is so many layers of abstraction away (I had to stare at a lot of Boost code to learn that little fact about vector math) that the language gets in the way of predicting the parallelism.

> If there comes up a new way that lets CPUs understand type theory

CPUs don't understand type theory. Compilers do and they can take advantage of that additional data to do things like unroll and SIMD my loops right now. My annoyance isn't that it's impossible, it's that I'd rather the abstraction-to-concrete model be "parallel, except sometimes serial if the CPU doesn't have parallel instructions or we hit capacity on the pipelines," not the current model of "serial, and maybe the compiler can figure out how to parallelize it for you."

> To solve practical problems, you need to compile logic/arithmetic instructions serially to achieve the intended effect... Seems to me that it turned out that most degrees of freedom are more accidental than structured, and it's not practical to manually specify them

I agree... Eventually. There's a lot of parallelism allowed under-the-hood in the space between where most programmers think about their code, as evidenced by C's undefined behavior for expression resolution with operators of the same precedence.

Whether degrees of freedom evolved by accident is irrelevant to whether a new language could specify those parts of the system (sequential vs. intentionally-undefined ordering) explicitly. C, for example, has lots of undefined behavior around memory management; Rust constrains it. It's up to the language designer what is bound and what is allowed to be an arbitrary implementation detail, intentionally left undefined to give flexibility to compilers.

Even the modern x86 instruction set is a bit of a lie; under the hood, modern CPUs emulate it by taking chunks of instruction and data and breaking them down for simultaneous execution on multiple parallel pipelines (including some execution that never goes anywhere and is thrown away as a predictive miss). CPUs wouldn't be nearly as fast as they are if they couldn't do that.

I'm not advocating for breaking the x86 abstraction; that's a bit too ambitious. But I'd like to see a language take off that abandons the PDP-11 embarrassingly-serial era of mental model in favor of a parallel model.