Add FP8 kernel acceleration for compressed-tensors quantized models by jiqing-feng · Pull Request #45699 · huggingface/transformers

jiqing-feng · 2026-04-29T08:31:50Z

What does this PR do?

This PR adds native FP8 matmul kernel support for compressed-tensors FP8 quantized models in transformers. Previously, compressed-tensors FP8 models were loaded via the compressed-tensors library and dequantized back to FP16/BF16 for inference. With this change, FP8 weights are kept in FP8 format and inference uses hardware-accelerated FP8 matmul kernels (torch._scaled_mm on XPU, fbgemm.f8f8bf16_rowwise on CUDA).

Key changes:

New file: src/transformers/integrations/compressed_tensors_fp8.py

CTFP8Linear: FP8 linear layer that stores weights in FP8 and uses row-wise FP8 matmul kernels. Activations are dynamically quantized per-row via quantize_fp8_per_row.
Weight converters (CompressedTensorsScaleConvert, CompressedTensorsFp8Dequantize) to handle the checkpoint format conversion (e.g. weight_scale → weight_scale_inv).
CTFP8PerRowQuantize: Online quantization support — quantize BF16 weights to FP8 per-row on-the-fly during model loading.

Modified: src/transformers/quantizers/quantizer_compressed_tensors.py

CompressedTensorsHfQuantizer now detects FP8 quantization configs (float type, num_bits=8) and automatically routes to the FP8 kernel path when GPU/XPU is available. Falls back to the default compressed-tensors dequantize path on CPU.
Added get_weight_conversions() and get_quantize_ops() to support both pre-quantized loading and online quantization.
No changes to the non-FP8 code path — existing INT8/INT4 compressed-tensors models are unaffected.

Modified: src/transformers/quantizers/auto.py

Minor formatting change (no functional change).

Supported models

Per-channel dynamic: e.g. RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8-dynamic
Per-tensor static: e.g. RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8
Online quantization: Any BF16 model can be quantized to FP8 on-the-fly by passing a CompressedTensorsConfig with FP8 quantization scheme.

Usage

Pre-quantized model (no config needed)

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8-dynamic",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

Online quantization

from transformers import AutoModelForCausalLM, CompressedTensorsConfig
from compressed_tensors.quantization import QuantizationScheme, QuantizationArgs, QuantizationType, QuantizationStrategy

ct_config = CompressedTensorsConfig(
    config_groups={
        "group_0": QuantizationScheme(
            weights=QuantizationArgs(
                num_bits=8, type=QuantizationType.FLOAT, strategy=QuantizationStrategy.CHANNEL,
            ),
        ),
    },
    run_compressed=True,
)
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B-Instruct",
    quantization_config=ct_config,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

Devices

XPU (Intel Data Center Max / Arc): uses torch._scaled_mm
CUDA (SM89+): uses fbgemm.f8f8bf16_rowwise
CPU: falls back to default compressed-tensors dequantize path

@sywangyi

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

Rocketknight1 · 2026-04-29T10:23:33Z

cc @SunMarc

SunMarc

Thanks, left a comment !

SunMarc · 2026-04-29T13:35:25Z

+    - XPU: torch._scaled_mm
+    - CUDA: fbgemm.f8f8bf16_rowwise


Indeed, the model is dequantized on the fly even if run_compressed=True (it worked before but compressed-tensors preferred to delegate this work to vllm as it was duplicate work). We could add support for the most used methods here in transformers but it would be nice to use kernels from vllm if possible. Also, I don't know if this is worth using fbgemm. torchao don't use this lib anymore and it was quite a struggle to get this installed. The best would be to use kernels hosted on kernels-community as we do for our current fp8 support.

jiqing-feng added 4 commits April 29, 2026 15:30

add compressed-tensor fp8 integeration

5f46f04

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

update

f5a3168

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

update

59762f0

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

update copyright

0facb18

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

jiqing-feng changed the title ~~Fp8~~ Add FP8 kernel acceleration for compressed-tensors quantized models Apr 29, 2026

evalstate mentioned this pull request Apr 29, 2026

Cumulative feature and defect updates from recent Transformers PRs evalstate/transformers#42

Open

SunMarc reviewed Apr 29, 2026

View reviewed changes

evalstate mentioned this pull request Apr 29, 2026

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#43

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add FP8 kernel acceleration for compressed-tensors quantized models#45699

Add FP8 kernel acceleration for compressed-tensors quantized models#45699
jiqing-feng wants to merge 4 commits intohuggingface:mainfrom
jiqing-feng:fp8

jiqing-feng commented Apr 29, 2026 •

edited

Loading

Uh oh!

Rocketknight1 commented Apr 29, 2026

Uh oh!

SunMarc left a comment

Uh oh!

SunMarc Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jiqing-feng commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Key changes:

Supported models

Usage

Pre-quantized model (no config needed)

Online quantization

Devices

Uh oh!

Rocketknight1 commented Apr 29, 2026

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

SunMarc Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jiqing-feng commented Apr 29, 2026 •

edited

Loading