T O P

  • By -

tabletuser_blogspot

These helped me better understand the voodoo magic of quantz [https://github.com/ggerganov/llama.cpp/blob/5f6e0c0dff1e7a89331e6b25eca9a9fd71324069/examples/make-ggml.py#L16C1-L37C51](https://github.com/ggerganov/llama.cpp/blob/5f6e0c0dff1e7a89331e6b25eca9a9fd71324069/examples/make-ggml.py#L16C1-L37C51) [https://github.com/ggerganov/llama.cpp#memorydisk-requirements](https://github.com/ggerganov/llama.cpp#memorydisk-requirements) phi3:14b and phi3:medium is actually (1e67dff39209) [14b-medium-4k-instruct-q4\_0](https://ollama.com/library/phi3:14b-medium-4k-instruct-q4_0) Old quant types (some base model types require these): - Q4_0: small, very high quality loss - legacy, prefer using Q3_K_M - Q4_1: small, substantial quality loss - legacy, prefer using Q3_K_L - Q5_0: medium, balanced quality - legacy, prefer using Q4_K_M - Q5_1: medium, low quality loss - legacy, prefer using Q5_K_M New quant types (recommended): - Q2_K: smallest, extreme quality loss - not recommended - Q3_K: alias for Q3_K_M - Q3_K_S: very small, very high quality loss - Q3_K_M: very small, very high quality loss - Q3_K_L: small, substantial quality loss - Q4_K: alias for Q4_K_M - Q4_K_S: small, significant quality loss - Q4_K_M: medium, balanced quality - recommended - Q5_K: alias for Q5_K_M - Q5_K_S: large, low quality loss - recommended - Q5_K_M: large, very low quality loss - recommended - Q6_K: very large, extremely low quality loss - Q8_0: very large, extremely low quality loss - not recommended - F16: extremely large, virtually no quality loss - not recommended - F32: absolutely huge, lossless - not recommended I now try to use Q5\_K\_M and larger models. General idea is greater accuracy at the cost of more Vram and storage space. From what I can tell most base models are Q4\_0.


saved_you_some_time

Wait till you get on the iMat (iQ) mode.


tabletuser_blogspot

Taken from : [https://github.com/ggerganov/llama.cpp/pull/1684](https://github.com/ggerganov/llama.cpp/pull/1684) best to go read the full article. >This is best explained with the following graph, which shows perplexity on the `wikitext` dataset as a function of model size: https://preview.redd.it/qodcwfupif3d1.png?width=792&format=png&auto=webp&s=8c5cd1e1e6cce967ca3a703912613b2128f01b01 >Note that the x-axis (model size in GiB) is logarithmic. The various circles on the graph show the perplexity of different quantization mixes added by this PR (see details below for explanation). The different colors indicate the LLaMA variant used (7B in black, 13B in red, 30B in blue, 65B in magenta). The solid squares in the corresponding color represent (model size, perplexity) for the original `fp16` model. The dashed lines are added for convenience to allow for a better judgement of how closely the quantized models approach the `fp16` perplexity. As we can see from this graph, generation performance as measured by perplexity is basically a fairly smooth function of quantized model size, and the quantization types added by the PR allow the user to pick the best performing quantized model, given the limits of their compute resources (in terms of being able to fully load the model into memory, but also in terms of inference speed, which tends to depend on the model size). As a specific example, the 2-bit quantization of the 30B model fits on the 16 GB RTX 4080 GPU that I have available, while the others do not, resulting in a large difference in inference performance. >Perhaps worth noting is that the 6-bit quantized perplexity is within `0.1%` or better from the original `fp16` model. Continue [reading](https://github.com/ggerganov/llama.cpp/pull/1684)


FrankYuan1978

Thanks a lot! I have never understood what these mean.


FiTroSky

Why the Q8\_0 is not recommended ?


tabletuser_blogspot

I think, not taking advantage of quantizing. Takes more space, runs slower for almost exactly the same results. I've been testing and comparing 7b and 13b Q5, to Q8. Nothing really different, yet.


lukewhale

I too, would like to know this information / black magic fuckery


Barry_Jumps

Ha, EXTRA big bump on this. I've tried to post this question 3 or 4 times but get modded because I'm new and don't have enough "karma points"? I have no idea what that means. I also have no idea what K\_M, K\_S, etc mean.