New quant types, activation quantization#1034
Merged
dxqb merged 101 commits intoNerogar:mergefrom Nov 27, 2025
Merged
Conversation
Disty0
reviewed
Oct 4, 2025
Disty0
reviewed
Oct 4, 2025
Collaborator
Author
|
Known issues:
|
- increase offloading alignment to 16 - disable grads for SVDQuant to save vram
Collaborator
Author
|
Collaborator
Author
|
torch.compile bug workaround for Chroma and Qwen pushed, but
|
… loss - explicit torch.no_grad() on forwards, to avoid torch.compile problems - don't cast x to compute_dtype before matmul, if x is quantized anyway - set GGUF compute_dtype to train_dtype - various fixes
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
tldr: train LoRA twice as fast as before
Switch to this PR:
and then update.sh / update.bat
Compiled blocks
W8A8 quant types
SVDQuant
Why no int4 / fp4?
Nunchaku (https://github.com/nunchaku-tech/nunchaku) have shown that 4-bit quants are possible at no noticable quality loss. Even though their matrix math is faster than 8-bit quants, my benchmarks show that there is no performance gain overall for training just by employing 4-bit matrix math. The 8-bit linear math is already fast enough - most of the remaining time in a training step is spent on bf16 attention and other layers.
I believe the additional improved performance achieved by Nunchaku is due to their manual kernel fusion, not due to 4bit matmuls - but happy to be proven wrong.
includes #1091