Add VLM base model support for auto_quantize in hf_ptq by yueshen2016 · Pull Request #1214 · NVIDIA/Model-Optimizer

yueshen2016 · 2026-04-09T01:54:44Z

For VLMs like Gemma4 where the extracted language_model lacks lm_head, use the full_model's lm_head to compute logits/loss from hidden states.

How to run:

cd /opt/Model-Optimizer/examples/llm_ptq && python hf_ptq.py \
  --pyt_ckpt_path /lustre/fsw/portfolios/coreai/users/yueshen/models/gemma-4-31B-it \
  --qformat nvfp4,fp8 \
  --auto_quantize_bits 6.0 \
  --calib_size 512 \
  --dataset cnn_dailymail \
  --export_path /lustre/fsw/portfolios/coreai/users/yueshen/models/gemma-4-31B-it-autoquant-6.0

What does this PR do?

Type of change: New feature

For VLMs like Gemma4, the extracted language_model is a base text model without lm_head, so it cannot produce logits or loss directly. This PR updates auto_quantize() to accept a full_model parameter and use its lm_head to compute logits/loss from the language model's hidden states, enabling auto-quantization for such architectures.

Usage

# In hf_ptq.py, auto_quantize now accepts full_model for VLMs:
auto_quantize(
    language_model,
    ...,
    full_model=model,  # pass the full VLM so lm_head can be used
)

Testing

Tested with Gemma-4-31B-it using --qformat nvfp4,fp8 --auto_quantize_bits 6.0.

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ✅
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
Did you write any new necessary tests?: ❌
Did you update Changelog?: ❌

Additional Information

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes
- Improved the LLM post-training quantization example to better support Vision–Language variants that separate base text models from heads. Loss and logit computations now correctly use the provided full model's head when needed, and the quantization flow passes the full model through when auto-quantization is enabled, increasing compatibility.

For VLMs like Gemma4 where the extracted language_model lacks lm_head, use the full_model's lm_head to compute logits/loss from hidden states. How to run: cd /opt/Model-Optimizer/examples/llm_ptq && python hf_ptq.py \ --pyt_ckpt_path /lustre/fsw/portfolios/coreai/users/yueshen/models/gemma-4-31B-it \ --qformat nvfp4,fp8 \ --auto_quantize_bits 6.0 \ --calib_size 512 \ --dataset cnn_dailymail \ --export_path /lustre/fsw/portfolios/coreai/users/yueshen/models/gemma-4-31B-it-autoquant-6.0 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: James Shen <yueshen@nvidia.com>

coderabbitai · 2026-04-09T01:55:00Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 792fc9ff-14c3-4f8c-a6b8-1fd037ad8b8b

📥 Commits

Reviewing files that changed from the base of the PR and between 26242e3 and 3022a34.

📒 Files selected for processing (1)

examples/llm_ptq/hf_ptq.py

🚧 Files skipped from review as they are similar to previous changes (1)

examples/llm_ptq/hf_ptq.py

📝 Walkthrough

Walkthrough

The auto_quantize() function in examples/llm_ptq/hf_ptq.py now accepts an optional full_model parameter. When provided and the extracted language_model lacks an lm_head, the function adapts loss and logits computation to use full_model.lm_head. quantize_main() passes full_model into auto_quantize() when auto quantization is enabled.

Changes

Cohort / File(s)	Summary
VLM / full_model handling `examples/llm_ptq/hf_ptq.py`	Added `full_model: torch.nn.Module

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically summarizes the main change: adding VLM base model support for auto_quantize in hf_ptq.
Security Anti-Patterns	✅ Passed	No security anti-patterns (torch.load, numpy.load, hardcoded trust_remote_code, eval/exec, nosec comments, non-permissive licenses) detected in the PR changes.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch yueshen/gemma-auto-quant

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-04-09T01:58:26Z

PR Preview Action v1.8.1
🚀 View preview at https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1214/
Built to branch `gh-pages` at 2026-04-09 03:51 UTC. Preview will be ready when the GitHub Pages deployment is complete.

coderabbitai

🧹 Nitpick comments (3)

examples/llm_ptq/hf_ptq.py (3)

1082-1087: The full_model parameter addition looks correct.

However, I notice that auto_quantize_method, auto_quantize_score_size, and auto_quantize_checkpoint from args are not being forwarded to the function. The function uses parameter defaults ("gradient", 128, None) instead of the user's command-line arguments. This appears to be a pre-existing issue.

♻️ Consider forwarding all command-line arguments

         auto_quantize(
             args,
             language_model,
             calib_dataloader,
+            auto_quantize_method=args.auto_quantize_method,
+            auto_quantize_score_size=args.auto_quantize_score_size,
+            auto_quantize_checkpoint=args.auto_quantize_checkpoint,
             full_model=full_model,
         )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/hf_ptq.py` around lines 1082 - 1087, The auto_quantize call
is not forwarding the user's CLI choices for method/score_size/checkpoint;
update the call to pass args.auto_quantize_method,
args.auto_quantize_score_size, and args.auto_quantize_checkpoint into
auto_quantize (in addition to args, language_model, calib_dataloader,
full_model) so the function receives the user's selections instead of defaulting
to "gradient", 128, or None; look for the auto_quantize invocation and add these
three arguments by name.

342-350: Minor edge case: Consider checking that lm_head is not None.

The hasattr check on line 349 doesn't verify that lm_head is actually a valid module rather than None. If a model has lm_head = None, this would pass the check but fail at line 356 when attempting to call it.

🛡️ More defensive check

     is_base_model = (
         full_model is not None
         and language_model is not full_model
         and not hasattr(language_model, "lm_head")
-        and hasattr(full_model, "lm_head")
+        and getattr(full_model, "lm_head", None) is not None
     )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/hf_ptq.py` around lines 342 - 350, The is_base_model check
can falsely pass if lm_head exists but is None; instead of using
hasattr(full_model, "lm_head") use a defensive check that verifies
getattr(full_model, "lm_head", None) is not None (and likewise ensure
getattr(language_model, "lm_head", None) is None) so the boolean is_base_model
only becomes true when full_model.lm_head is an actual module callable; update
the condition where is_base_model is computed and any subsequent code that calls
full_model.lm_head to rely on this non-None guarantee.

297-306: Consider documenting the new full_model parameter.

The function docstring doesn't describe the new full_model parameter. Adding documentation would help clarify its purpose for VLM support.

📝 Suggested docstring update

 def auto_quantize(
     args: argparse.Namespace,
     language_model: torch.nn.Module,
     calib_dataloader: DataLoader,
     auto_quantize_method="gradient",
     auto_quantize_score_size=128,
     auto_quantize_checkpoint=None,
     full_model: torch.nn.Module | None = None,
 ):
-    """Auto search quantization of multiple formats."""
+    """Auto search quantization of multiple formats.
+
+    Args:
+        full_model: Optional full VLM model. When provided and the extracted
+            language_model lacks an lm_head (e.g., Gemma4), the function uses
+            full_model's lm_head to compute logits/loss from hidden states.
+    """

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/hf_ptq.py` around lines 297 - 306, The docstring for
auto_quantize is missing documentation for the new full_model parameter; update
the function docstring to describe full_model (type: torch.nn.Module | None),
explain it is an optional full vision-language model used for VLM support during
calibration/quantization when the provided language_model is only a partial
component, note the default is None and when callers should pass a full_model
(e.g., to include visual encoder layers in calibration), and mention any
behavioral differences when full_model is provided versus omitted.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 1082-1087: The auto_quantize call is not forwarding the user's CLI
choices for method/score_size/checkpoint; update the call to pass
args.auto_quantize_method, args.auto_quantize_score_size, and
args.auto_quantize_checkpoint into auto_quantize (in addition to args,
language_model, calib_dataloader, full_model) so the function receives the
user's selections instead of defaulting to "gradient", 128, or None; look for
the auto_quantize invocation and add these three arguments by name.
- Around line 342-350: The is_base_model check can falsely pass if lm_head
exists but is None; instead of using hasattr(full_model, "lm_head") use a
defensive check that verifies getattr(full_model, "lm_head", None) is not None
(and likewise ensure getattr(language_model, "lm_head", None) is None) so the
boolean is_base_model only becomes true when full_model.lm_head is an actual
module callable; update the condition where is_base_model is computed and any
subsequent code that calls full_model.lm_head to rely on this non-None
guarantee.
- Around line 297-306: The docstring for auto_quantize is missing documentation
for the new full_model parameter; update the function docstring to describe
full_model (type: torch.nn.Module | None), explain it is an optional full
vision-language model used for VLM support during calibration/quantization when
the provided language_model is only a partial component, note the default is
None and when callers should pass a full_model (e.g., to include visual encoder
layers in calibration), and mention any behavioral differences when full_model
is provided versus omitted.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e4da40c1-cdab-4c5e-94bd-7ca3624fb2d8

📥 Commits

Reviewing files that changed from the base of the PR and between cccfded and 26242e3.

📒 Files selected for processing (1)

examples/llm_ptq/hf_ptq.py

codecov · 2026-04-09T02:07:36Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.21%. Comparing base (cccfded) to head (3022a34).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1214      +/-   ##
==========================================
+ Coverage   75.56%   77.21%   +1.64%     
==========================================
  Files         353      353              
  Lines       40430    40430              
==========================================
+ Hits        30551    31218     +667     
+ Misses       9879     9212     -667

Flag	Coverage Δ
examples	`44.42% <ø> (+2.69%)`	⬆️
unit	`55.17% <ø> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Add assert for full_model to satisfy mypy union-attr check, and add blank lines before nested def statements per ruff formatting rules. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: James Shen <yueshen@nvidia.com>

yueshen2016 requested a review from a team as a code owner April 9, 2026 01:54

yueshen2016 requested a review from sugunav14 April 9, 2026 01:54

yueshen2016 requested review from Edwardf0t1 and cjluo-nv April 9, 2026 01:55

coderabbitai bot reviewed Apr 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add VLM base model support for auto_quantize in hf_ptq#1214

Add VLM base model support for auto_quantize in hf_ptq#1214
yueshen2016 wants to merge 2 commits intomainfrom
yueshen/gemma-auto-quant

yueshen2016 commented Apr 9, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 9, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

github-actions bot commented Apr 9, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-04-09 03:51 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

coderabbitai bot left a comment

Uh oh!

codecov bot commented Apr 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yueshen2016 commented Apr 9, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

github-actions bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Built to branch gh-pages at 2026-04-09 03:51 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yueshen2016 commented Apr 9, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 9, 2026 •

edited

Loading

github-actions bot commented Apr 9, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-04-09 03:51 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

codecov bot commented Apr 9, 2026 •

edited

Loading