ARM is cache coherent, meaning it has some kind of MESI (or better, like MESIF) protocol going on.
ARM is _RELAXED_ however, or perhaps... "more relaxed" than x86. More orderings are possible on ARM, but that only makes memory-fences (load-acquire, store-release) even more important.
> but that only makes memory-fences (load-acquire, store-release) even more important.
I would say that the importance is in the correctness of the algorithm that implements the shared memory synchronization.
Lowering the algorithm/code to architecture instructions that enforce memory ordering is trivial at that point. So what if it means a fence instruction isn't needed on x86 but it is on arm64 at some particular point in the code.
Makes me wonder why there are no portable system language keywords that abstract "hey, I know my peripheral touched this DRAM, please don't take it from the cache".
I suspect you're talking about the case where the peripheral issues a memory read request and you want the coherence protocol to return the value from CPU cache via snoop instead of having that value already been evicted from CPU cache and having to go to DRAM (off-chip memory) for the value?
If the peripheral issues a memory write, that location in the CPU cache must be invalidated so a CPU memory read of the location does not return an old/stale value.
In my own experimentation (non real-world use case) on a very specific system, I was surprised by the rate of peripheral read requests that resulted in snoop hits where the value would be returned from CPU cache (instead of from DDR PHY controller). The base case was surprisingly low. Modifying the experiment to have the CPU continuously access the memory (read accesses) while the CPU-peripheral interaction was taking place resulted in much much higher snoop hit rates. The overall performance difference between the two cases was not as big as I would have hoped at all. Perhaps the value being returned from DDR PHY controller was not as slow as I would have expected (some unknown/unexpected behavior caching/bypassing in the DDR controller?)--again, this was not a use case that had real-world memory accesses...
A language keyword for "please don't take it from cache" is tricky because it would be an incredibly low low low level specifier intended to be used for performance reasons in a system that is very complex to reason about performance. Maybe having more knobs could help (much easier to use this language specifier rather than having to write code to have the CPU continuously access the memory hoping that will keep it in cache), but I think this could get into the realm that people get distracted by performance and just start doing things in the name of performance without having proper controls and measurements in place for assisting in understanding what may be happening in the system.
Instructions related to the memory model exist for correctness. Memory prefetch instructions are just suggestions for an already sophisticated memory unit. Memory QoS can be thought of as having an impact on performance, but it is a much higher level solution aimed at partitioning of resources.
Close, but not fully. Volatile enforces using an assembly read instruction (does not use a register cached value), but no memory barriers. In ye (really) olden days that seemed enough, but then you know, backwards compatibility...
Well, that's what Java's "volatile" actually enforces.
C's "volatile" isn't enough, but more recent C compilers have atomics and memory models.
The memory-model problem wasn't solved until the 00s (that late!!!) when multicore CPUs became more commonplace. Java, C++11, and other language / language updates after that point quickly adopted memory models. And the rest is history.