Skip to content

New quant types, activation quantization#1034

Merged
dxqb merged 101 commits intoNerogar:mergefrom
dxqb:compile_int8_svd
Nov 27, 2025
Merged

New quant types, activation quantization#1034
dxqb merged 101 commits intoNerogar:mergefrom
dxqb:compile_int8_svd

Conversation

@dxqb
Copy link
Collaborator

@dxqb dxqb commented Oct 4, 2025

tldr: train LoRA twice as fast as before

image image

Switch to this PR:

git fetch origin pull/1034/head:pr-1034
git switch pr-1034

and then update.sh / update.bat


Compiled blocks

W8A8 quant types

  • Instead of only quantizing model weights, these new quant types also quantize activations
  • Performance improvement by 8bit matrix multiplications
  • Loss of precision in theory, but no noticable quality difference between Float8 W8A8 and the current Float8 type
  • Int8 might have a slight quality loss compared to fp8, but is also faster
  • Uses Triton for additional performance - untested on Windows
  • Performance might be different depending on NVIDIA architecture. Is fp8 faster than int8 on Blackwell? Untested

SVDQuant

  • Implementation of https://arxiv.org/abs/2411.05007 but applied to all OneTrainer quant types for better quality
  • Improved sampling quality, possibly even compared to the current Float8 quant type (opinions differ)
  • Effects on training yet to be determined

Why no int4 / fp4?
Nunchaku (https://github.com/nunchaku-tech/nunchaku) have shown that 4-bit quants are possible at no noticable quality loss. Even though their matrix math is faster than 8-bit quants, my benchmarks show that there is no performance gain overall for training just by employing 4-bit matrix math. The 8-bit linear math is already fast enough - most of the remaining time in a training step is spent on bf16 attention and other layers.
I believe the additional improved performance achieved by Nunchaku is due to their manual kernel fusion, not due to 4bit matmuls - but happy to be proven wrong.

includes #1091

@dxqb
Copy link
Collaborator Author

dxqb commented Oct 5, 2025

Known issues:

  • Misaligned adress like [Bug]: Offloading with NF4 weights #827. Changing offloading alignment to 16 seemed to fix it again
  • Combining SVDQuant with offloading leads to NaN after a while (on Flux)
  • creating quantization cache directory fails if cache directory doesn't exist yet
  • resulting LoRAs files currently don't work externally if torch.compile is enabled, because it changes LoRA keys
  • resulting full finetunes can not be loaded externally, issues with LoRA loading for the same reason
  • numerical bug in float W8A8
  • test all models because checkpointing code has changed

@dxqb
Copy link
Collaborator Author

dxqb commented Oct 13, 2025

  • all checkpointing issues should be fixed now (lora loading, lora saving, full finetune saving)
  • float A8W8 bug fixed
  • SVD offloading should work now
  • various

@dxqb
Copy link
Collaborator Author

dxqb commented Oct 14, 2025

torch.compile bug workaround for Chroma and Qwen pushed, but

  • check other models
  • try torch.compile with dynamic=True, which might alltogether avoid this bug. How much slower is it?

@dxqb dxqb linked an issue Oct 15, 2025 that may be closed by this pull request
@dxqb dxqb added the merging last steps before merge label Nov 27, 2025
@dxqb dxqb mentioned this pull request Nov 27, 2025
@dxqb dxqb changed the base branch from master to merge November 27, 2025 20:43
@dxqb dxqb merged commit 7c255ac into Nerogar:merge Nov 27, 2025
1 check passed
dxqb added a commit that referenced this pull request Nov 28, 2025
@dxqb dxqb deleted the compile_int8_svd branch November 28, 2025 21:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Effort: High merging last steps before merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Offloading with NF4 weights

3 participants

Comments