For a single user prompting with one or few prompts at a time, compute is not the bottleneck. Memory bandwidth is. This is because the entire model's weights must be run through the algorithm many times per prompt. This is also why multiplexing many prompts at the same time is relatively easy and effective, as many matrix multiplications can happen in the time it takes to do a single fetch from memory.
> This is because the entire model's weights must be run through the algorithm many times per prompt.
And this is why I'm so excited about MoE models! qwen3:30b-a3b runs at the speed of a 3B parameter model. It's completely realistic to run on a plain CPU with 20 GB RAM for the model.
Yes, but with a 400B parameter model, at fp16 it's 800GB right? So with 800GB/s memory bandwidth, you'd still only be able to bring them in once per second.
Edit: actually forgot the MoE part, so that makes sense.
Approximately, yes. For MoE models, there is less required bandwidth, as you're generally only processing the weights from one or two experts at a time. Though which experts can change from token to token, so it's best if all fit in RAM. The sort of machines hyperscalers are using to run these things have essentially 8x APUs each with about that much bandwidth, connected to other similar boxes via infiniband or 800gbps ethernet. Since it's relatively straightforward to split up the matrix math for parallel computation, segmenting the memory in this way allows for near linear increases in memory bandwidth and inference performance. And is effectively the same thing you're doing when adding GPUs.