New quant types, activation quantization by dxqb · Pull Request #1034 · Nerogar/OneTrainer

dxqb · 2025-10-04T17:10:10Z

tldr: train LoRA twice as fast as before

Switch to this PR:

git fetch origin pull/1034/head:pr-1034
git switch pr-1034

and then update.sh / update.bat

Compiled blocks

Automatic kernel fusion by https://docs.pytorch.org/tutorials/intermediate/torch_compile_tutorial.html
Requires Torch 2.8 and Triton
Untested on windows
Compatibility issues expected

W8A8 quant types

Instead of only quantizing model weights, these new quant types also quantize activations
Performance improvement by 8bit matrix multiplications
Loss of precision in theory, but no noticable quality difference between Float8 W8A8 and the current Float8 type
Int8 might have a slight quality loss compared to fp8, but is also faster
Uses Triton for additional performance - untested on Windows
Performance might be different depending on NVIDIA architecture. Is fp8 faster than int8 on Blackwell? Untested

SVDQuant

Implementation of https://arxiv.org/abs/2411.05007 but applied to all OneTrainer quant types for better quality
Improved sampling quality, possibly even compared to the current Float8 quant type (opinions differ)
Effects on training yet to be determined

Why no int4 / fp4?
Nunchaku (https://github.com/nunchaku-tech/nunchaku) have shown that 4-bit quants are possible at no noticable quality loss. Even though their matrix math is faster than 8-bit quants, my benchmarks show that there is no performance gain overall for training just by employing 4-bit matrix math. The 8-bit linear math is already fast enough - most of the remaining time in a training step is spent on bf16 attention and other layers.
I believe the additional improved performance achieved by Nunchaku is due to their manual kernel fusion, not due to 4bit matmuls - but happy to be proven wrong.

includes #1091

modules/module/quantized/LinearW8A8.py

dxqb · 2025-10-05T07:28:54Z

Known issues:

Misaligned adress like [Bug]: Offloading with NF4 weights #827. Changing offloading alignment to 16 seemed to fix it again
Combining SVDQuant with offloading leads to NaN after a while (on Flux)
creating quantization cache directory fails if cache directory doesn't exist yet
resulting LoRAs files currently don't work externally if torch.compile is enabled, because it changes LoRA keys
resulting full finetunes can not be loaded externally, issues with LoRA loading for the same reason
numerical bug in float W8A8
test all models because checkpointing code has changed

- increase offloading alignment to 16 - disable grads for SVDQuant to save vram

dxqb · 2025-10-13T10:00:02Z

all checkpointing issues should be fixed now (lora loading, lora saving, full finetune saving)
float A8W8 bug fixed
SVD offloading should work now
various

dxqb · 2025-10-14T17:32:52Z

torch.compile bug workaround for Chroma and Qwen pushed, but

check other models
try torch.compile with dynamic=True, which might alltogether avoid this bug. How much slower is it?

… loss - explicit torch.no_grad() on forwards, to avoid torch.compile problems - don't cast x to compute_dtype before matmul, if x is quantized anyway - set GGUF compute_dtype to train_dtype - various fixes

This reverts commit 9ca1ea7.

Compile int svd

1e8fece

Disty0 reviewed Oct 4, 2025

View reviewed changes

modules/module/quantized/LinearW8A8.py Outdated Show resolved Hide resolved

Disty0 reviewed Oct 4, 2025

View reviewed changes

modules/module/quantized/LinearW8A8.py Outdated Show resolved Hide resolved

dxqb added 7 commits October 5, 2025 12:04

- fix cache dir

bbf6eb8

- increase offloading alignment to 16 - disable grads for SVDQuant to save vram

hide checkpoints from LoRA saving

2676d33

fix buffer registration

dc289fa

fix buffer registration

73822b0

various

4821c9e

various

cb02a4c

various

efb9073

dxqb added 3 commits October 13, 2025 12:30

cleanup

35ba023

torch.compile bug workaround

5633b21

same workaround for Qwen

c37f805

O-J1 added the Effort: High label Oct 14, 2025

dxqb linked an issue Oct 15, 2025 that may be closed by this pull request

[Bug]: Offloading with NF4 weights #827

Closed

dxqb added 3 commits October 15, 2025 13:03

gguf

d7532dc

merge

6883981

gguf

2de19c9

dxqb mentioned this pull request Oct 15, 2025

GGUF with activation quantization #1057

Closed

dxqb added 8 commits October 15, 2025 16:42

bugfix

a3cd936

requirements

bcf0b65

merge

0dcb3dc

name changes, axis wise

5a2e590

Merge branch 'compile_int8_svd' into compile_int8_svd_gguf

44a969b

merge

882154d

big type hint

e5317d3

Merge branch 'compile_int8_svd' into compile_int8_svd_gguf

7f07748

dxqb and others added 20 commits November 22, 2025 13:21

merge with Nerogar#1139

606b7a8

merge with upstream

8ce4604

UI update

50982b7

UI update

16a6015

UI change and merge

fde79ce

fix comment

b9bf5b4

comment fix

261c32c

merge svd and LoRa for efficiency

6128c52

merge svd and LoRa for efficiency

193f5ad

change default - 16 is sufficient

d6c6666

merge Nerogar#1143

355f522

Merge branch 'quant_layer_filter' into compile_int8_svd

91fef92

switch to quantization config

171629c

remove debug print

605c34b

remove debug print

5729eca

- scale matmul result in float32 - more accurate with no performance…

ac2e063

… loss - explicit torch.no_grad() on forwards, to avoid torch.compile problems - don't cast x to compute_dtype before matmul, if x is quantized anyway - set GGUF compute_dtype to train_dtype - various fixes

allow A8 for unet

d98b2b5

OFT workaround for torch.compile slicing issue (#23)

8a53199

fix GGUF A8 float

a767fe5

fix quantization config

f3d9373

dxqb added the merging last steps before merge label Nov 27, 2025

dxqb mentioned this pull request Nov 27, 2025

Code deduplication #1143

Merged

dxqb changed the base branch from master to merge November 27, 2025 20:43

dxqb added 3 commits November 27, 2025 22:14

merge

9ca1ea7

Revert "merge"

47e330e

This reverts commit 9ca1ea7.

merge

49427ff

dxqb merged commit 7c255ac into Nerogar:merge Nov 27, 2025
1 check passed

dxqb added a commit that referenced this pull request Nov 28, 2025

Merge of #1143, #1091, #1034, #1086, #1167

b24a4d9

dxqb deleted the compile_int8_svd branch November 28, 2025 21:12

andrerivas mentioned this pull request Dec 2, 2025

[Bug]: Triton module not found error on MacOS #1182

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

New quant types, activation quantization#1034

New quant types, activation quantization#1034
dxqb merged 101 commits intoNerogar:mergefrom
dxqb:compile_int8_svd

dxqb commented Oct 4, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

dxqb commented Oct 5, 2025 •

edited

Loading

Uh oh!

dxqb commented Oct 13, 2025

Uh oh!

dxqb commented Oct 14, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Uh oh!

Conversation

dxqb commented Oct 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

tldr: train LoRA twice as fast as before

Uh oh!

Uh oh!

Uh oh!

dxqb commented Oct 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dxqb commented Oct 13, 2025

Uh oh!

dxqb commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

dxqb commented Oct 4, 2025 •

edited

Loading

dxqb commented Oct 5, 2025 •

edited

Loading

dxqb commented Oct 14, 2025 •

edited

Loading