Maybe I'm missing something but I don't think I've ever seen quants lower memory...

root_axis · 2025-04-05T19:11:09 1743880269

Quantizing definitely lowers memory requirements, it's a pretty direct effect because you're straight up using less bits per parameter across the board - thus the representation of the weights in memory is smaller, at the cost of precision.

jsnell · 2025-04-05T19:12:10 1743880330

Needing less memory for inference is the entire point of quantization. Saving the disk space or having a smaller download could not justify any level of quality degradation.

anoncareer0212 · 2025-04-05T20:10:49 1743883849

Small point of order:

> entire point...smaller download could not justify...

Q4_K_M has layers and layers of consensus and polling and surveying and A/B testing and benchmarking to show there's ~0 quality degradation. Built over a couple years.

acchow · 2025-04-06T20:11:13 1743970273

> Q4_K_M has ~0 quality degradation

Llama 3.3 already shows a degradation from Q5 to Q4.

As compression improves over the years, the effects of even Q5 quantization will begin to appear

vlovich123 · 2025-04-05T19:14:16 1743880456

Quantization by definition lower memory requirements - instead of using f16 for weights, you are using q8, q6, q4, or q2 which means the weights are smaller by 2x, ~2.7x, 4x or 8x respectively.

That doesn’t necessarily translate to the full memory reduction because of interim compute tensors and KV cache, but those can also be quantized.

acchow · 2025-04-05T19:47:48 1743882468

Nvidia GPUs can natively operate in FP8, FP6, FP4, etc so naturally they have reduced memory requirements when running quantized.

As for CPUs, Intel can only go down to FP16, so you’ll be doing some “unpacking”. But hopefully that is “on the fly” and not when you load the model into memory?

terhechte · 2025-04-05T19:08:42 1743880122

I just loaded two models of different quants into LM Studio:

qwen 2.5 coder 1.5b @ q4_k_m: 1.21 GB memory

qwen 2.5 coder 1.5b @ q8: 1.83 GB memory

I always assumed this to be the case (also because of the smaller download sizes) but never really thought about it.

michaelt · 2025-04-05T19:11:30 1743880290

No need to unpack for inference. As things like CUDA kernels are fully programmable, you can code them to work with 4 bit integers, no problems at all.