LLM Technical Note: Picking the Right Quantization Method for Local Inference

Q4_K_M, Q5_K, Q8_0—what do these cryptic codes mean? A quick reference for picking the right quantization method when running LLMs locally.
nlp
Published

September 10, 2023

Thanks to innovative quantization strategies and the commendable work by fellas like TheBloke, running powerful models locally on devices such as my M1 Max is now possible. However, when first going to a huggingface model page it can be overwhelming and confusing knowing which variation to choose: “Q4_K_M”, “Q5_K”, “Q8_O”, etc. What are all these things? This guide aims to quickly clear it up.

GGML / GGUF vs. GPTQ

The first decision one has to make when choosing a quantized model is the model type. In general: - GGUF (new version of GGML) models are optimized for CPU, and are also the go-to alternative for Mac users. - GPTQ models are optimized for GPU, ideal when you can fit the model into memory.

Both type of models can be run using llama.cpp or GUIs such as LM Studio.

Quantization Methods at a Glance

Model variations are usually named according to the following convention:

<model_name>_Q<quantization_bits>_<variant>

At a high level:

Variant Description Key Differences
_0 Legacy quant method with uniform precision. Uniform precision across all tensors.
_1 Larger legacy quant method with uniform precision. Uniform precision but with a larger model size.
_K Designation for the k-quants. Improved size/quality tradeoff compared to legacy methods.
_K_L Large size k-quant. Uses distinct precision types for specific tensors for optimal tradeoff.
_K_M Medium size k-quant. Uses higher precision (Q6_K) for half of the attention.wv and feed_forward.w2 tensors, and default precision (Q4_K) for the rest.
_K_S Small size k-quant. Uses default precision (Q4_K) for all tensors.

Here is a summary of how these models compare in practice, according to my own limited testing and what I have read on the web:

Quant Method Bits Approx Relative Size Quality Impact on 13B Model
Q2_K 2 ~39% Significant - not recommended
Q3_K_S 3 ~40% Very high loss
Q3_K_M 3 ~45% Very high loss
Q3_K_L 3 ~49% Substantial loss
Q4_0 4 ~53% Very high loss (legacy)
Q4_K_S 4 ~53% Greater loss
Q4_K_M 4 ~56% Balanced - recommended
Q5_0 5 ~64% Balanced (legacy)
Q5_K_S 5 ~64% Low loss - recommended
Q5_K_M 5 ~66% Very low loss - recommended
Q6_K 6 ~76% Extremely low loss
Q8_0 8 ~99% Extremely low loss - not recommended

Final Recommendations

Model Size Best for Quality Best Trade-off Avoid
7B, 13B Q5_K_S, Q5_K_M Q4_K_M Q2_K, Q4_0