Skip to content

Comments

Add Qwen3-VL support to Minitron pruning#919

Draft
eagle705 wants to merge 2 commits intoNVIDIA:mainfrom
eagle705:add-vlm-pruning
Draft

Add Qwen3-VL support to Minitron pruning#919
eagle705 wants to merge 2 commits intoNVIDIA:mainfrom
eagle705:add-vlm-pruning

Conversation

@eagle705
Copy link

What does this PR do?

Type of change: ? new feature

Overview: This PR adds VLM support to the Minitron pruning flow (including Qwen3-VL paths), and fixes pipeline-parallel runtime issues specific to mRoPE models.

Key updates:

  • Added VLM wrapper-aware pruning by resolving the prunable language backbone (language_model) while preserving non-language components.
  • Improved HF export compatibility for VLM checkpoints by using robust dummy-model creation and architecture suffix normalization for AutoBridge.
  • Fixed PP + mRoPE runtime failures (position_ids=None on non-first pipeline stages) by ensuring position_ids are synthesized from decoder input shape (with safe kwargs fallback).
  • Updated generation utilities to:
    • provide explicit position_ids for mRoPE models,
    • send vision tensors only during prefill (step 0), not decode steps.
  • Extended dynamic conversion coverage for VLM/VLM-MoE paths:
    • TE linear compatibility for Megatron NAS modules,
    • grouped MoE expert handling,
    • IdentityOp-safe conversion/export paths,
    • auto-registration of forward-overriding subclasses used by VLM modules,
    • preserved original runtime class behavior via dynamic MRO conversion for QKV/proj wrappers.

Usage

torchrun --nproc_per_node 2 prune_minitron.py \
    --pp_size 2 \
    --hf_model_name_or_path /work/checkpoints/hf/Qwen3-VL-8B-Instruct \
    --hparams_to_skip num_attention_heads \
    --prune_target_params 6e9 \
    --output_hf_path /work/checkpoints/compressor/Qwen3-VL-8B-Instruct-Pruned-6B

torchrun --nproc_per_node 2 prune_minitron.py \
    --pp_size 2 \
    --hf_model_name_or_path /work/checkpoints/hf/Qwen3-VL-30B-A3B-Instruct \
    --prune_target_params 26e9 \
    --hparams_to_skip num_attention_heads \
    --output_hf_path /work/checkpoints/compressor/Qwen3-VL-30B-A3B-Instruct-Pruned-6B

Testing

Manual multi-GPU validation on Megatron-Bridge pruning flows:

  • Qwen3-VL-8B (pp_size=2) now runs calibration/evaluation through full iterations (previous NoneType.ndim mRoPE crash is resolved).
  • Qwen3-VL-30B-A3B (pp_size=2) proceeds through NAS search and candidate evaluation path without PP+mRoPE runtime crash.
  • Verified that very low target-params constraints may still be infeasible (No subnets found fitting the constraints!), which is a search-space/constraint outcome, not a PP runtime failure.

Before your PR is "Ready for review"

  • Make sure you read and follow Contributor guidelines and your commits are signed.
  • Is this change backward compatible?: Yes/No
  • Did you write any new necessary tests?: Yes/No
  • Did you add or update any necessary documentation?: Yes/No
  • Did you update Changelog?: Yes/No

Additional Information

This PR focuses on enabling VLM/VLM-MoE pruning paths in Megatron-Bridge + ModelOpt, with mRoPE pipeline-parallel runtime stability improvements.

Signed-off-by: joosungy <joosungy@nvidia.com>
Signed-off-by: joosungy <joosungy@nvidia.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 23, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant