quantize: add option to automatically choose optimal quant types to reach a file/bpw target size at lowest error#15550
quantize: add option to automatically choose optimal quant types to reach a file/bpw target size at lowest error#15550EAddario wants to merge 260 commits intoggml-org:masterfrom
Conversation
I can't believe we've been working on the same thing for nearly a year. I saw your quant kld results today and was surprised they are well optimised compared to others, so decided to check your work. For my part it's a tool suite that I've created independently to the llama code. If you need to brainstorm please let me know, maybe there are some aspects I've already resolved and vice-versa. You can check the tool suite in action on https://gguf.thireus.com/quant_assign.html, I'm sure you'll notice a lot of similarities. Cheers. |
|
Hi, I don't know the proper protocol for "intruding" on another's PR … 😉 I've been working on inference speed-aware quantization. This is especially useful for NPUs, drafting models or devices that have abundant memory but lack compute. It's based on this PR's knapsack solver infrastructure. Since this PR is already hard to review, I plan to submit a separate PR for it once it is a little more polished. As I am somewhat familiar with the source code by now, I could give a little feedback, if helpful. PS: Nice algorithmic work. |
|
Hi @ivy-42 and thank you! No protocol expected / needed. Please feel free to use the code in any way you deem fit. I'll definitely check your fork but in the meantime, any questions/suggestions are always welcome |
| }; | ||
|
|
||
| // Quality metrics | ||
| struct quant_error { |
There was a problem hiding this comment.
It's not clear to me by what the fields of this struct are scaled. I think it approximates a weighted sum over all elements of the tensor (approximate because not all are sampled), right? Maybe rename the fields or add a comment? E.g. weighted_error, wse and wce + a comment clarifying that those are scaled by the tensor element count
| constexpr double INFINITE = std::numeric_limits<double>::infinity(); | ||
| constexpr uint64_t STATE_MAGIC = 0x4250572d5631; // "BPW-V1" | ||
| constexpr uint64_t HASH_MAGIC = 0xeabada55cafed00d; | ||
| constexpr float penalty = 2.0f; |
There was a problem hiding this comment.
penalty can mean a lot in the context of an optimization problem. Maybe boost_factor is more clear?
|
In my fork, I added a CLI flag If you think this fits the scope of this PR, feel free to include my commit. |
This PR adds
target_bpw_type(), a function to determine an optimal per-tensor quantization mix to achieve a user-specified total file size (i.e.,--target-size 1.5g) or a global bits-per-weight (bpw) target (i.e.,--target-bpw 4.5678).The function solves a constrained optimization problem to minimize quantization error, subject to a global size budget. It estimates per-tensor error for each layer, and dynamically allocates the bit budget where it matters most.
High level flow:
--state-fileis set, target computations are saved to a file. If the quantization is interrupted (e.g., Ctrl+C), it can resume error calculation from where it left off in the next run.Advantages
Target arbitrary size models
--target-size 23.85gwill generate a 23.85 GiB file to utilize the hardware fully.Data-driven mixed precision often can improve quality at fixed size
Allows better like-for-like comparisons between models and families
Standard quantization uses hardcoded rules like: "use Q4_K_M, except bump some tensors up/down, except fall back if incompatible, except keep some tensors unquantized..." and consquently, two different models quantized at the same Q4_K_M level can end up with very different bpw (e.g. 4.75 and 4.30).
Model performance generally scales with size; larger models typically outperform smaller ones. A model quantized with more bits will usually perform better (exhibiting lower perplexity and better evaluation scores) than a smaller version, even when the same underlying quantization method is used. This makes performance comparisons between models not a controlled experiment, as the models being compared have different effective compression ratios.
--target-bpwhelps to normalize experiments by forcing models to be quantized to a roughly equal overall byte budget. This standardization allows performance variations between models to be more accurately linked to underlying factors such as architectural or training differences, the effect of quantization error at the same compression level, or the decisions made by the optimizer regarding allocation.Disadvantages
Quantization process is significantly slower than standard
This approach can take 5x-10x longer as it quantizes a sample of most tensors into 15 different formats, dequantizes them back, computes error diffs, and selects the best size/error option that fits the target file size or global bpw budget.
However, the
--state-fileoption will save the above-mentioned computations to disk so that future quantizations can be generated at normal speed. It also allows to interrupt the computation process and resume it at a later time.The optimization target is only a proxy for the model's performance quality
An imatrix with activations data is required for best results
--target-bpwand--target-filewill refuse to run.Design considerations
The
target_bpw_type()function is implemented as a container for several lambdas providing all the logic for serialization, multithreading, math/stats, and optimization.Although there are clear downsides to this approach (i.e. cognitive load, testability and maintainability), a self-contained "God function" seemed a better choice to prevent
llama-quantize's global scope pollution: structs and helper lambdas are highly specific to this exact algorithm and have no reuse value elsewhere in the library.Test results
Based on 132 tests with models from 11 different families, the
target_bpw_type()optimization routine generated 96 (~70%) better quality models, and 10 (~8%) same as standard quantization. However, even though the method produced better quality often, it lost in surprising cases. Naive quants made up for the remaining 25 tests (20%) performing better, sometimes by a significant margin (e.g.ERNIE-4.5-21B-A3B-PT-IQ1_M,granite-4.0-h-tiny-IQ2_M,granite-4.0-h-tiny-IQ1_M)Of the 96 cases where it performed better, about 1/3 achieved higher scores when using the
--ignore-tensor-importanceoption, forcing the algorithm to treat each tensor equally instead of prioritising some (i.e. attn_output, ffn_down, etc.).Target BPW test results
Using
Cor(ln(PPL(Q)), ln(PPL(base)))as the discriminant metricAI usage disclosure
AI was used to validate the mathematical approach and calculations, and to optimize and debug the code.
Special thanks to @AesSedai, @compilade and @ddh0 for their contributions during the development of this PR.