I am still using a FX-8xxx (and probably will until RISC-V will beat it), and haven't even bothered overclocking it yet.
Anyone knows what kind of issues having "The frontend, FPU, and L2 cache [] shared by two threads" causes compared to 2 real cores / 2 hyper threads on 1 core / 4 hyper threads on 2 cores ?
Read the "bottlenecks in bulldozer" section, particularly the parts about instruction decode (never say x86 doesn't have any overhead) and instruction fetch. Or see the excerpt I posted below.
(author here) With hyperthreading/SMT, all of the above are shared, along with OoO execution buffers, integer execution units, and load/store unit.
I wouldn't say there are issues with sharing those components. More that the non-shared parts were too small, meaning Bulldozer came up quite short in single threaded performance.
My ELI5 understanding is it's a problem of throughput and resource availability.
Long story short, running two threads with only enough hardware to run just one at full speed means any thread that tries to run at full speed will not be able to.
It would work fine with double integer loads, or heavily integer based work with the occasional FPU necessity. The problem wasn't throughput as much as the latter: it didn't have enough FPU resources and would get deadlocked waiting on availability; so would behave in the worst case (as a single core) more often than not.
The problem isn't execution unit throughput at all, it's decode, where running the second thread on the module would bottleneck the decoder and force it to alternate between servicing the two threads. And this doesn't really matter what is running on the other core, it simply cannot service both cores at the same time regardless of instruction type even if the first core isn't using all its decode. If it has to decode a macro-op, it can even stall the other thread for multiple cycles.
> Each decoder can handle four instructions per clock cycle. The Bulldozer, Piledriver and Excavator have one decoder in each unit, which is shared between two cores. When both cores are active, the decoders serve each core every second clock cycle, so that the maximum decode rate is two instructions per clock cycle per core. Instructions that belong to different cores cannot be decoded in the same clock cycle. The decode rate is four instructions per clock cycle when
only one thread is running in each execution unit.
...
> On Bulldozer, Piledriver and Excavator, the shared decode unit can handle four instructions per clock cycle. It is alternating between the two threads so that each thread gets up to four instructions every second clock cycle, or two instructions per clock cycle on average. This is a serious bottleneck because the rest of the pipeline can handle up to four instructions per clock.
> The situation gets even worse for instructions that generate more than one macro-op each. All instructions that generate more than two macro-ops are handled with microcode. The microcode sequencer blocks the decoders for several clock cycles so that the other thread is stalled in the meantime.
Anyone knows what kind of issues having "The frontend, FPU, and L2 cache [] shared by two threads" causes compared to 2 real cores / 2 hyper threads on 1 core / 4 hyper threads on 2 cores ?