I bought a Gigabyte X870E board with 3 PCIe slots (PCIe5 16x, PCIe4 4x, PCIe3 4x) and 4 M.2 slots (3x PCIe5, 1x PCIe 4). Three of the M.2 slots are connected to the CPU, and one is connected to the chipset. Using the 2nd and 3rd M.2 CPU-connected slots causes the board to bifurcate the lanes assigned to the GPU's PCIe slot, so you get 8x GPU, 4x M.2, 4x M.2.
I wish you didn't have to buy Xeon or Threadripper to get considerably more PCIe lanes, but for most people I suspect this split is acceptable. The penalty for gaming going from 16x to 8x is pretty small.
IIRC, X870 boards are required to spend some of their PCIe lanes on providing USB4/Thunderbolt ports. If you don't want those, you can get an X670 board that uses the same chipset silicon but provides a better allocation of PCIe lanes to internal M.2 and PCIe slots.
Even with a Threadripper you're at the mercy of the motherboard design.
I use ROG board that has 4 PCIe slots. While each can physically seat an x16 card, only one of them has 16 lanes -- the rest are x4. I had to demote my GPU to a slower slot in order to get full throughput from my 100GbE card. All this despite having a CPU with 64 lanes available.
I don't think Threadripper platform is to blame that you bought a board with potentially the worst possible pcie lane routing. Latest generation has 88 usable lanes at minimum, most boards have 4x 16x, and Pro supports 7x Gen 5.0 x16 links, an absolutely insane amount of IO. "At the mercy of motherboard design"- do the absolute minimum amount of research and pick any other board?
Okay, but then I need to ask what kind of use case doesn't mind the extra latency from ethernet but does care about the difference between 40Gbps and 70Gbps.
Though for the most the performance cost of going down to 8x PCIe is often pretty tiny - only a couple of percent at most
[0] shows a pretty "worst case" impact of 1-4% - that's on the absolute highest-end card possible (a geforce 5090) and pushing it down to 16x PCIe3.0. A lower end card would likely show an even smaller difference. They even showed zero impact from 16xPCIe4.0, which is the same bandwidth as 8x of the PCIe5.0 lanes supported on X870E boards like you mentioned.
Though if you're not on a gaming use case and know you're already PCIe limited it could be larger - but people who have that sort of use case likely already know what to look for, and have systems tuned to that use case more than "generic consumer gamer board"
For Skylake, Intel ran 16 lanes of pci-e to the CPU, and ran DMI to the chipset, which had pci-e lanes behind it. Depending on the chipset, there would be anywhere from 6 lanes at pci-e 2.0 to 20 lanes at pci-e 3.0. My wild guess is that a board from back then would have put m.2 behind the chipset and no cpu attached ssd for you; that fits with your report of the GPU having all 16 lanes.
But, if you had the nicer chipsets, wikipedia says your board could split the 16 cpu lanes into two x8 slots or one x8 and 2 x4 slots, which would fit. This would usually be dynamic at boot time, not at runtime; the firmware would typically look if anything is in the x4 slots and if so, set bifurcation, otherwise the x16 gets all the lanes. Some motherboards do have PCI-e switches to use the bandwidth more flexibly, but those got really expensive; i think at the transition to pci-e 4.0, but maybe 3.0?
Indeed. I dug out the manual (MSI H170 Gaming M3), which has a block diagram showing the M2 port behind the chipset, which is connected via DMI 3 to the CPU. In my mind, the chipset was connected via actual PCIe, but apparently, it's counted separately from the "actual" PCIe lanes.
Intel's DMI connection between the CPU and the chipset is little more than another PCIe x4 link. For consumer CPUs, they don't usually include it in the total lane count, but they have sometimes done so for Xeon parts based off the consumer silicon, giving the false impression that those Xeons have more PCIe lanes.
I wish you didn't have to buy Xeon or Threadripper to get considerably more PCIe lanes, but for most people I suspect this split is acceptable. The penalty for gaming going from 16x to 8x is pretty small.