Skip to content

Fix broken HQQ support#45147

Open
mobicham wants to merge 7 commits intohuggingface:mainfrom
mobicham:main
Open

Fix broken HQQ support#45147
mobicham wants to merge 7 commits intohuggingface:mainfrom
mobicham:main

Conversation

@mobicham
Copy link
Copy Markdown
Contributor

What does this PR do?

This PR fixes hqq support that has been broken for a couple of months now after a refactoring:

  • Online quantization works fine now.
  • Serialization to load/save HQQ models is fixed too.

Code Agent Policy

  • I confirm that this is not a pure code agent PR.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@ArthurZucker @SunMarc

@mobicham
Copy link
Copy Markdown
Contributor Author

mobicham commented Apr 7, 2026

@ArthurZucker @SunMarc a little bump on this, should be an easy fix

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 7, 2026

[For maintainers] Suggested jobs to run (before merge)

run-slow: hqq

Copy link
Copy Markdown
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into that @mobicham ! Since we are adding back the support, can we try to do something a bit more maintainable ? This is also one of the reason we didn't add back the support, it was too complicated. I know that SINQ based a lot of their code on hqq / gemlite and the first PR they added did something similar to here but in the end they manage to clean a lot the integration and now it looks much better. Would you be up to do that ? https://github.com/huggingface/transformers/blob/main/src/transformers/quantizers/quantizer_sinq.py

Comment on lines 165 to 168
# TODO: to remove
# def create_quantized_param(
# self,
# model: "PreTrainedModel",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove that

Comment on lines +162 to +163
def get_weight_conversions(self):
return []
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should use deserialize

Comment on lines 91 to 94
# TODO: to remove
# Kept here in case we see some interest in adding support for it
# # Adds missing keys for HQQLinear modules that are loaded but the model with initialized with torch.nn.Linear
# def update_expected_keys(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

Comment on lines 73 to 78
@@ -72,6 +78,7 @@
logger.info("Setting dtype to torch.float32 as the default value since it was not specified.")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove that

Comment on lines +67 to +70
def update_dtype(self, dtype):
if dtype is not None:
self.dtype = dtype
return dtype
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need that, the tensors should be in the right dtype so we should be able to access that directly

Comment on lines +335 to +347
def _load_hqq_from_checkpoint(self, model: "PreTrainedModel"):
"""Load pre-quantized HQQ weights directly from checkpoint files."""
from collections import defaultdict

from safetensors import safe_open

from ..integrations.hqq import autoname_modules, name_to_linear_tag

# Determine target device from stored device_map
device_map = getattr(self, "device_map", None)
if isinstance(device_map, dict):
# Use the first non-cpu device from the map (values can be str, int, or torch.device)
devices = [torch.device(v) for v in device_map.values()]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need that ?

Comment on lines +248 to +261
def _setup_missing_key_filters(self, model, checkpoint_files):
"""Scan checkpoint files to find HQQ-quantized modules.

For those modules:
1. Suppress their .weight missing key warnings in the load report.
2. Replace their weight parameter with a scalar meta tensor so that
``_move_missing_keys_from_meta_to_device`` does not allocate
full-size fp16 tensors on GPU (which would cause OOM).
"""
import re

from safetensors import safe_open

quantized_modules = set()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not do these types of changes ?

Comment on lines +161 to +182
quant_config = getattr(module, "quant_config", None)
if quant_config is None:
# Module is skipped from quantization, just return the weight as-is
return {full_layer_name: value}

# Determine target device and compute dtype
target_device = value.device
compute_dtype = self.hf_quantizer.dtype

# Create HQQLinear from the nn.Linear
hqq_layer = HQQLinear(
module,
quant_config=quant_config,
compute_dtype=compute_dtype,
device=target_device,
del_orig=True,
)

if hqq_layer.bias is not None and isinstance(hqq_layer.bias, torch.Tensor):
hqq_layer.bias = torch.nn.Parameter(hqq_layer.bias)

if self.hf_quantizer.using_multi_gpu:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if we didn't have to do that here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants