For a single user prompting with one or few prompts at a time, compute is not th...

yencabulator · 2025-07-24T15:53:00 1753372380

> This is because the entire model's weights must be run through the algorithm many times per prompt.

And this is why I'm so excited about MoE models! qwen3:30b-a3b runs at the speed of a 3B parameter model. It's completely realistic to run on a plain CPU with 20 GB RAM for the model.

osti · 2025-07-23T01:52:54 1753235574

Yes, but with a 400B parameter model, at fp16 it's 800GB right? So with 800GB/s memory bandwidth, you'd still only be able to bring them in once per second.

Edit: actually forgot the MoE part, so that makes sense.

timschmidt · 2025-07-23T02:05:47 1753236347

Approximately, yes. For MoE models, there is less required bandwidth, as you're generally only processing the weights from one or two experts at a time. Though which experts can change from token to token, so it's best if all fit in RAM. The sort of machines hyperscalers are using to run these things have essentially 8x APUs each with about that much bandwidth, connected to other similar boxes via infiniband or 800gbps ethernet. Since it's relatively straightforward to split up the matrix math for parallel computation, segmenting the memory in this way allows for near linear increases in memory bandwidth and inference performance. And is effectively the same thing you're doing when adding GPUs.