Skip to content

Add VLM base model support for auto_quantize in hf_ptq#1214

Open
yueshen2016 wants to merge 2 commits intomainfrom
yueshen/gemma-auto-quant
Open

Add VLM base model support for auto_quantize in hf_ptq#1214
yueshen2016 wants to merge 2 commits intomainfrom
yueshen/gemma-auto-quant

Conversation

@yueshen2016
Copy link
Copy Markdown
Contributor

@yueshen2016 yueshen2016 commented Apr 9, 2026

For VLMs like Gemma4 where the extracted language_model lacks lm_head, use the full_model's lm_head to compute logits/loss from hidden states.

How to run:

cd /opt/Model-Optimizer/examples/llm_ptq && python hf_ptq.py \
  --pyt_ckpt_path /lustre/fsw/portfolios/coreai/users/yueshen/models/gemma-4-31B-it \
  --qformat nvfp4,fp8 \
  --auto_quantize_bits 6.0 \
  --calib_size 512 \
  --dataset cnn_dailymail \
  --export_path /lustre/fsw/portfolios/coreai/users/yueshen/models/gemma-4-31B-it-autoquant-6.0

What does this PR do?

Type of change: New feature

For VLMs like Gemma4, the extracted language_model is a base text model without lm_head, so it cannot produce logits or loss directly. This PR updates auto_quantize() to accept a full_model parameter and use its lm_head to compute logits/loss from the language model's hidden states, enabling auto-quantization for such architectures.

Usage

# In hf_ptq.py, auto_quantize now accepts full_model for VLMs:
auto_quantize(
    language_model,
    ...,
    full_model=model,  # pass the full VLM so lm_head can be used
)

Testing

Tested with Gemma-4-31B-it using --qformat nvfp4,fp8 --auto_quantize_bits 6.0.

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
  • Did you write any new necessary tests?: ❌
  • Did you update Changelog?: ❌

Additional Information

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes
    • Improved the LLM post-training quantization example to better support Vision–Language variants that separate base text models from heads. Loss and logit computations now correctly use the provided full model's head when needed, and the quantization flow passes the full model through when auto-quantization is enabled, increasing compatibility.

For VLMs like Gemma4 where the extracted language_model lacks lm_head,
use the full_model's lm_head to compute logits/loss from hidden states.

How to run:
cd /opt/Model-Optimizer/examples/llm_ptq && python hf_ptq.py \
  --pyt_ckpt_path /lustre/fsw/portfolios/coreai/users/yueshen/models/gemma-4-31B-it \
  --qformat nvfp4,fp8 \
  --auto_quantize_bits 6.0 \
  --calib_size 512 \
  --dataset cnn_dailymail \
  --export_path /lustre/fsw/portfolios/coreai/users/yueshen/models/gemma-4-31B-it-autoquant-6.0

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: James Shen <yueshen@nvidia.com>
@yueshen2016 yueshen2016 requested a review from a team as a code owner April 9, 2026 01:54
@yueshen2016 yueshen2016 requested a review from sugunav14 April 9, 2026 01:54
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 9, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 792fc9ff-14c3-4f8c-a6b8-1fd037ad8b8b

📥 Commits

Reviewing files that changed from the base of the PR and between 26242e3 and 3022a34.

📒 Files selected for processing (1)
  • examples/llm_ptq/hf_ptq.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • examples/llm_ptq/hf_ptq.py

📝 Walkthrough

Walkthrough

The auto_quantize() function in examples/llm_ptq/hf_ptq.py now accepts an optional full_model parameter. When provided and the extracted language_model lacks an lm_head, the function adapts loss and logits computation to use full_model.lm_head. quantize_main() passes full_model into auto_quantize() when auto quantization is enabled.

Changes

Cohort / File(s) Summary
VLM / full_model handling
examples/llm_ptq/hf_ptq.py
Added `full_model: torch.nn.Module

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main change: adding VLM base model support for auto_quantize in hf_ptq.
Security Anti-Patterns ✅ Passed No security anti-patterns (torch.load, numpy.load, hardcoded trust_remote_code, eval/exec, nosec comments, non-permissive licenses) detected in the PR changes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch yueshen/gemma-auto-quant

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1214/

Built to branch gh-pages at 2026-04-09 03:51 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
examples/llm_ptq/hf_ptq.py (3)

1082-1087: The full_model parameter addition looks correct.

However, I notice that auto_quantize_method, auto_quantize_score_size, and auto_quantize_checkpoint from args are not being forwarded to the function. The function uses parameter defaults ("gradient", 128, None) instead of the user's command-line arguments. This appears to be a pre-existing issue.

♻️ Consider forwarding all command-line arguments
         auto_quantize(
             args,
             language_model,
             calib_dataloader,
+            auto_quantize_method=args.auto_quantize_method,
+            auto_quantize_score_size=args.auto_quantize_score_size,
+            auto_quantize_checkpoint=args.auto_quantize_checkpoint,
             full_model=full_model,
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/hf_ptq.py` around lines 1082 - 1087, The auto_quantize call
is not forwarding the user's CLI choices for method/score_size/checkpoint;
update the call to pass args.auto_quantize_method,
args.auto_quantize_score_size, and args.auto_quantize_checkpoint into
auto_quantize (in addition to args, language_model, calib_dataloader,
full_model) so the function receives the user's selections instead of defaulting
to "gradient", 128, or None; look for the auto_quantize invocation and add these
three arguments by name.

342-350: Minor edge case: Consider checking that lm_head is not None.

The hasattr check on line 349 doesn't verify that lm_head is actually a valid module rather than None. If a model has lm_head = None, this would pass the check but fail at line 356 when attempting to call it.

🛡️ More defensive check
     is_base_model = (
         full_model is not None
         and language_model is not full_model
         and not hasattr(language_model, "lm_head")
-        and hasattr(full_model, "lm_head")
+        and getattr(full_model, "lm_head", None) is not None
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/hf_ptq.py` around lines 342 - 350, The is_base_model check
can falsely pass if lm_head exists but is None; instead of using
hasattr(full_model, "lm_head") use a defensive check that verifies
getattr(full_model, "lm_head", None) is not None (and likewise ensure
getattr(language_model, "lm_head", None) is None) so the boolean is_base_model
only becomes true when full_model.lm_head is an actual module callable; update
the condition where is_base_model is computed and any subsequent code that calls
full_model.lm_head to rely on this non-None guarantee.

297-306: Consider documenting the new full_model parameter.

The function docstring doesn't describe the new full_model parameter. Adding documentation would help clarify its purpose for VLM support.

📝 Suggested docstring update
 def auto_quantize(
     args: argparse.Namespace,
     language_model: torch.nn.Module,
     calib_dataloader: DataLoader,
     auto_quantize_method="gradient",
     auto_quantize_score_size=128,
     auto_quantize_checkpoint=None,
     full_model: torch.nn.Module | None = None,
 ):
-    """Auto search quantization of multiple formats."""
+    """Auto search quantization of multiple formats.
+
+    Args:
+        full_model: Optional full VLM model. When provided and the extracted
+            language_model lacks an lm_head (e.g., Gemma4), the function uses
+            full_model's lm_head to compute logits/loss from hidden states.
+    """
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/hf_ptq.py` around lines 297 - 306, The docstring for
auto_quantize is missing documentation for the new full_model parameter; update
the function docstring to describe full_model (type: torch.nn.Module | None),
explain it is an optional full vision-language model used for VLM support during
calibration/quantization when the provided language_model is only a partial
component, note the default is None and when callers should pass a full_model
(e.g., to include visual encoder layers in calibration), and mention any
behavioral differences when full_model is provided versus omitted.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 1082-1087: The auto_quantize call is not forwarding the user's CLI
choices for method/score_size/checkpoint; update the call to pass
args.auto_quantize_method, args.auto_quantize_score_size, and
args.auto_quantize_checkpoint into auto_quantize (in addition to args,
language_model, calib_dataloader, full_model) so the function receives the
user's selections instead of defaulting to "gradient", 128, or None; look for
the auto_quantize invocation and add these three arguments by name.
- Around line 342-350: The is_base_model check can falsely pass if lm_head
exists but is None; instead of using hasattr(full_model, "lm_head") use a
defensive check that verifies getattr(full_model, "lm_head", None) is not None
(and likewise ensure getattr(language_model, "lm_head", None) is None) so the
boolean is_base_model only becomes true when full_model.lm_head is an actual
module callable; update the condition where is_base_model is computed and any
subsequent code that calls full_model.lm_head to rely on this non-None
guarantee.
- Around line 297-306: The docstring for auto_quantize is missing documentation
for the new full_model parameter; update the function docstring to describe
full_model (type: torch.nn.Module | None), explain it is an optional full
vision-language model used for VLM support during calibration/quantization when
the provided language_model is only a partial component, note the default is
None and when callers should pass a full_model (e.g., to include visual encoder
layers in calibration), and mention any behavioral differences when full_model
is provided versus omitted.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e4da40c1-cdab-4c5e-94bd-7ca3624fb2d8

📥 Commits

Reviewing files that changed from the base of the PR and between cccfded and 26242e3.

📒 Files selected for processing (1)
  • examples/llm_ptq/hf_ptq.py

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 9, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.21%. Comparing base (cccfded) to head (3022a34).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1214      +/-   ##
==========================================
+ Coverage   75.56%   77.21%   +1.64%     
==========================================
  Files         353      353              
  Lines       40430    40430              
==========================================
+ Hits        30551    31218     +667     
+ Misses       9879     9212     -667     
Flag Coverage Δ
examples 44.42% <ø> (+2.69%) ⬆️
unit 55.17% <ø> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Add assert for full_model to satisfy mypy union-attr check, and add
blank lines before nested def statements per ruff formatting rules.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: James Shen <yueshen@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant