Add VLM base model support for auto_quantize in hf_ptq#1214
Add VLM base model support for auto_quantize in hf_ptq#1214yueshen2016 wants to merge 2 commits intomainfrom
Conversation
For VLMs like Gemma4 where the extracted language_model lacks lm_head, use the full_model's lm_head to compute logits/loss from hidden states. How to run: cd /opt/Model-Optimizer/examples/llm_ptq && python hf_ptq.py \ --pyt_ckpt_path /lustre/fsw/portfolios/coreai/users/yueshen/models/gemma-4-31B-it \ --qformat nvfp4,fp8 \ --auto_quantize_bits 6.0 \ --calib_size 512 \ --dataset cnn_dailymail \ --export_path /lustre/fsw/portfolios/coreai/users/yueshen/models/gemma-4-31B-it-autoquant-6.0 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: James Shen <yueshen@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThe Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
There was a problem hiding this comment.
🧹 Nitpick comments (3)
examples/llm_ptq/hf_ptq.py (3)
1082-1087: Thefull_modelparameter addition looks correct.However, I notice that
auto_quantize_method,auto_quantize_score_size, andauto_quantize_checkpointfromargsare not being forwarded to the function. The function uses parameter defaults ("gradient",128,None) instead of the user's command-line arguments. This appears to be a pre-existing issue.♻️ Consider forwarding all command-line arguments
auto_quantize( args, language_model, calib_dataloader, + auto_quantize_method=args.auto_quantize_method, + auto_quantize_score_size=args.auto_quantize_score_size, + auto_quantize_checkpoint=args.auto_quantize_checkpoint, full_model=full_model, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/llm_ptq/hf_ptq.py` around lines 1082 - 1087, The auto_quantize call is not forwarding the user's CLI choices for method/score_size/checkpoint; update the call to pass args.auto_quantize_method, args.auto_quantize_score_size, and args.auto_quantize_checkpoint into auto_quantize (in addition to args, language_model, calib_dataloader, full_model) so the function receives the user's selections instead of defaulting to "gradient", 128, or None; look for the auto_quantize invocation and add these three arguments by name.
342-350: Minor edge case: Consider checking thatlm_headis not None.The
hasattrcheck on line 349 doesn't verify thatlm_headis actually a valid module rather thanNone. If a model haslm_head = None, this would pass the check but fail at line 356 when attempting to call it.🛡️ More defensive check
is_base_model = ( full_model is not None and language_model is not full_model and not hasattr(language_model, "lm_head") - and hasattr(full_model, "lm_head") + and getattr(full_model, "lm_head", None) is not None )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/llm_ptq/hf_ptq.py` around lines 342 - 350, The is_base_model check can falsely pass if lm_head exists but is None; instead of using hasattr(full_model, "lm_head") use a defensive check that verifies getattr(full_model, "lm_head", None) is not None (and likewise ensure getattr(language_model, "lm_head", None) is None) so the boolean is_base_model only becomes true when full_model.lm_head is an actual module callable; update the condition where is_base_model is computed and any subsequent code that calls full_model.lm_head to rely on this non-None guarantee.
297-306: Consider documenting the newfull_modelparameter.The function docstring doesn't describe the new
full_modelparameter. Adding documentation would help clarify its purpose for VLM support.📝 Suggested docstring update
def auto_quantize( args: argparse.Namespace, language_model: torch.nn.Module, calib_dataloader: DataLoader, auto_quantize_method="gradient", auto_quantize_score_size=128, auto_quantize_checkpoint=None, full_model: torch.nn.Module | None = None, ): - """Auto search quantization of multiple formats.""" + """Auto search quantization of multiple formats. + + Args: + full_model: Optional full VLM model. When provided and the extracted + language_model lacks an lm_head (e.g., Gemma4), the function uses + full_model's lm_head to compute logits/loss from hidden states. + """🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/llm_ptq/hf_ptq.py` around lines 297 - 306, The docstring for auto_quantize is missing documentation for the new full_model parameter; update the function docstring to describe full_model (type: torch.nn.Module | None), explain it is an optional full vision-language model used for VLM support during calibration/quantization when the provided language_model is only a partial component, note the default is None and when callers should pass a full_model (e.g., to include visual encoder layers in calibration), and mention any behavioral differences when full_model is provided versus omitted.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 1082-1087: The auto_quantize call is not forwarding the user's CLI
choices for method/score_size/checkpoint; update the call to pass
args.auto_quantize_method, args.auto_quantize_score_size, and
args.auto_quantize_checkpoint into auto_quantize (in addition to args,
language_model, calib_dataloader, full_model) so the function receives the
user's selections instead of defaulting to "gradient", 128, or None; look for
the auto_quantize invocation and add these three arguments by name.
- Around line 342-350: The is_base_model check can falsely pass if lm_head
exists but is None; instead of using hasattr(full_model, "lm_head") use a
defensive check that verifies getattr(full_model, "lm_head", None) is not None
(and likewise ensure getattr(language_model, "lm_head", None) is None) so the
boolean is_base_model only becomes true when full_model.lm_head is an actual
module callable; update the condition where is_base_model is computed and any
subsequent code that calls full_model.lm_head to rely on this non-None
guarantee.
- Around line 297-306: The docstring for auto_quantize is missing documentation
for the new full_model parameter; update the function docstring to describe
full_model (type: torch.nn.Module | None), explain it is an optional full
vision-language model used for VLM support during calibration/quantization when
the provided language_model is only a partial component, note the default is
None and when callers should pass a full_model (e.g., to include visual encoder
layers in calibration), and mention any behavioral differences when full_model
is provided versus omitted.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e4da40c1-cdab-4c5e-94bd-7ca3624fb2d8
📒 Files selected for processing (1)
examples/llm_ptq/hf_ptq.py
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1214 +/- ##
==========================================
+ Coverage 75.56% 77.21% +1.64%
==========================================
Files 353 353
Lines 40430 40430
==========================================
+ Hits 30551 31218 +667
+ Misses 9879 9212 -667
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Add assert for full_model to satisfy mypy union-attr check, and add blank lines before nested def statements per ruff formatting rules. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: James Shen <yueshen@nvidia.com>
For VLMs like Gemma4 where the extracted language_model lacks lm_head, use the full_model's lm_head to compute logits/loss from hidden states.
How to run:
What does this PR do?
Type of change: New feature
For VLMs like Gemma4, the extracted
language_modelis a base text model withoutlm_head, so it cannot produce logits or loss directly. This PR updatesauto_quantize()to accept afull_modelparameter and use itslm_headto compute logits/loss from the language model's hidden states, enabling auto-quantization for such architectures.Usage
Testing
Tested with Gemma-4-31B-it using
--qformat nvfp4,fp8 --auto_quantize_bits 6.0.Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: N/AAdditional Information
🤖 Generated with Claude Code
Summary by CodeRabbit