You're either talking about cache latency, or still talking about first-gen EPYC/Threadripper rather than the current generation codenamed Rome. On a cache miss, all chiplets on a single-socket Rome system have roughly equal latency for a DRAM fetch, regardless of which physical address is being fetched. Any differences are insignificant compared to inter-socket memory access or fetching from a different chiplet's DRAM on first-gen EPYC. And even if you wanted to treat each chiplet as a separate NUMA node, 4 isn't the right number for Rome.
"And even if you wanted to treat each chiplet as a separate NUMA node, 4 isn't the right number for Rome."
You can configure Rome systems with 1, 2, or 4 NUMA domains per socket (NPS1, NPS2, or NPS4, where NPS == "NUMA per socket".) Memory bandwidth is higher if you configure as NPS4, but it exposes different latencies to memory based on its location.
It's really impressive that you can get uniform latency to memory for 64 cores on the 7702 chips (when configured as NPS1).
The underlying hardware reality is that the IO die is organized into quadrants instead of being a full crossbar switch between 8 CCXs and an 8-channel DRAM controller. Whether to enumerate it as 1, 2 or 4 NUMA domains per socket depends very much on what kind of software you plan to run.
Saying that memory bandwidth is higher when configured as NPS4 probably isn't universally true, because that setting will constrain the bandwidth a single core can use to just effectively dual-channel. For a benchmark with the appropriate thread count and sufficiently low core-to-core communication, NPS4 probably makes it easiest to maximize aggregate memory bandwidth utilization (this seems to be what Dell's STREAM Triad results show, with NPS4 and 1 thread per CCX as optimal settings for that benchmark).
I was responding to your claim that "And even if you wanted to treat each chiplet as a separate NUMA node, 4 isn't the right number for Rome", which was incorrect. 4 is one of the three possible options for the number of NUMA domains.
Your comments about Rome are completely incorrect. There are four main memory controllers in this architecture and some of them are further from some CCDs than others. In the worst case, accessing the furthest-away controller adds 25ns to main memory latency.
You can put this part in "NPS1" mode which interleaves all channels into an apparently uniform memory region, however it is still the case that 1/4 of memory takes an extra 25ns to access and 1/2 of it takes an extra 10ns, compared to the remainder. Putting the part into NPS1 mode just zeroes out the SRAT tables so the OS isn't aware of the difference.
But don't take it from me. AMD's developer docs clearly state, and I am quoting, "The EPYC 7002 Series processors use a Non-Uniform Memory Access (NUMA) Micro- architecture."
> AMD's developer docs clearly state, and I am quoting,
Please quote something that's unambiguously supporting your claims. What you've quoted is insufficient.
What I said about a single-socket Rome processor is not "completely incorrect" under any reasonable interpretation. The latency and bandwidth limitations in moving data from one side of the IO die to another is much smaller than the inter-socket connections that were traditionally implied by NUMA, or the inter-chiplet communication in first-gen EPYC/Threadripper.
If you want to insist that NUMA apply to even the slightest measurable memory performance asymmetry between cores, please say so, so that we may know ahead of time whether the discussion is also going to lead to esoteric details like the ring vs mesh interconnects on Intel's processors.
If you're not sensitive to main memory latency, just say that. Don't try to tell me that 25ns is not relevant. It's ~100 CPU cycles and it's also about 25% swing from fastest to slowest.
Intel's server/workstation CPUs have had 2 memory controllers during the last several generations, so even if the whole CPU is seen as a single NUMA node by the software, the actual memory access latency has always been different from core to core, depending on the core position on the intercommunication mesh or ring.
So what ???
The initial posting was about the CPU being seen as a single or multiple NUMA node by the software, not about having an equal memory access latency for all cores, which has never been true for any server/workstation CPU, from any vendor, since many, many years ago.