I'm curious about the quantization quality claims in the table there. Is this a Gemma 2 specific thing (more subtlety in the weights somehow?). In my testing and testing I've seen elsewhere at least for llama3 8B (and some less rigorous testing with other models) q_8 -> q4_K_M are basically indistinguishable from one another?
The first paper is good to critique the performance of quantised models, it points out that 40-50% 'compression' typically results in only slight loss for RAG tasks relying on in-context learning, but for factual tasks replying on stored knowledge, performance very quickly dropped off. They looked at Vicuna, one of the earlier models, so I wonder how applicable it is to recent models like the Phi 3 range. I don't think deliberate clever adversarial attacks like those of the 2nd paper are a sensible worry for most, but it is fun. Thanks for the links @janwas.
I know the safe tensors are there, but I said GGUF 4-bit quantised, which is kinda the standard for useful local applications, a typical balanced sweet spot of performance and quality. It's makes it much easier to use, works in more places, be it personal devices or a server etc.