Skip to content

cp: fix: Qwen3.5 dense CP support (1710) into r0.4.0#1737

Merged
akoumpa merged 1 commit intor0.4.0from
cherry-pick-1710-r0.4.0
Apr 9, 2026
Merged

cp: fix: Qwen3.5 dense CP support (1710) into r0.4.0#1737
akoumpa merged 1 commit intor0.4.0from
cherry-pick-1710-r0.4.0

Conversation

@svcnvidia-nemo-ci
Copy link
Copy Markdown
Contributor

beep boop [🤖]: Hi @HuiyingLi 👋,

we've cherry picked #1710 into  for you! 🚀

Please review and approve this cherry pick by your convenience!

* feat: add neat packing (greedy knapsack) for LLM and VLM datasets

Implement sequence packing via min-heap first-fit-decreasing knapsack
for both LLM and VLM datasets, with indexed attention masks and flash
attention support. Includes unit tests and benchmarks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* feat: add LengthGroupedSampler for token-aware distributed sampling

Sort samples by estimated token length (text + media) and shuffle
within buckets to keep batch-internal lengths similar, reducing padding
waste. Includes accurate image/video token count estimation via
smart_resize and comprehensive test suite.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* feat: integrate neat packing strategy into LLM finetune recipe

Add packing_strategy config field ("neat" or "thd") to select between
greedy knapsack packing and existing THD packing in the LLM recipe.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* chore: remove benchmark scripts not needed for this PR

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: lint errors and broken sampler tests

Remove unused import and variable in neat_packing_vlm.py.
Fix 13 sampler tests that referenced non-existent bucket_size
and shuffle_bucket_size parameters.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* style: fix all ruff lint errors across changed files

Sort imports, remove unused imports/variables, fix f-strings
without placeholders, rename ambiguous variable name.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* style: run ruff format on all changed source and test files

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* style: add missing copyright headers to test files

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* feat: add meta-dataset loading system with ShareGPT format support

Implement LLaMA-Factory style meta JSON dataset loading with support
for multiple dataset composition, sampling ratios, ShareGPT format
conversion, LMDB image storage, video frame reading via decord, media
preloading, and cross-rank data sharding.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* feat: add RobustDatasetWrapper with retry and fake image injection

RobustDatasetWrapper provides data loading error retry, media
preloading, and fake image injection to prevent FSDP/Zero3 hangs on
pure-text batches. PreTokenizedDatasetWrapper supports per-sample
tokenization in DataLoader workers with overlong sample detection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* enhance: refactor label building with template-based approach

Replace BPE context-sensitive pattern matching with token ID-level
scanning (build_labels_from_template) for reliable assistant turn
detection. Remove qwen2_5 dependency on qwen_vl_utils. Add per-sample
media counts (n_images_per_sample/n_videos_per_sample) to collate
output for precise PP chunking. Replace truncation with pre-filtering
via _drop_overlong_samples. Use decord as video backend globally.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* refactor: simplify video timestamp handling with VideoMetadata

Replace the manual _fix_video_timestamps regex approach with
_build_video_metadata that passes metadata directly to the processor.
Also adds second_per_grid_ts to output keys.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* feat: add precompute_tokens script for offline tokenization

Offline parallel tokenization tool that writes _text_tokens counts
to dataset samples, enabling LengthGroupedSampler to use exact token
counts instead of heuristic estimation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* feat: wire up configure_packing and attn-aware collaters for neat packing

Wire up configure_packing and attn-aware collaters into both LLM and VLM
recipes so neat packing correctly enforces per-document attention
boundaries with flash_attention_2 and SDPA.

Changes:
- neat_packed_collater: accept attn_implementation param, keep 2D indexed
  mask for flash, 4D bool block-causal mask for SDPA
- configure_packing: patch create_causal_mask in qwen2/qwen2_5_vl/qwen2_vl/
  qwen3_vl/qwen3_vl_moe modules via importlib loop
- LLM recipe: call configure_packing when packing_strategy=neat, detect
  attn backend from cfg_model (backend.attn or attn_implementation)
- VLM recipe: add pretokenize + packing path to build_dataloader with
  cfg_model param, same attn detection logic
- Add 3 example recipes: LLM neat packing, VLM 4B neat packing,
  VLM MoE 30B neat packing

Tested:
- VLM Qwen3-VL-4B flash: 4.19 -> 1.47 -> 0.49
- VLM Qwen3-VL-4B sdpa:  4.19 -> 1.47 -> 0.49
- VLM Qwen3-VL-30B MoE flash: 1.76 -> 0.41 -> 0.10
- LLM Qwen2.5-0.5B flash+force_hf: 3.72 -> ... -> 2.84

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* refactor: move VLM packing config to top-level packed_sequence section

Move packing configuration from nested dataset.packing to a top-level
packed_sequence: section, matching the LLM recipe pattern. This decouples
dataset definition from packing strategy.

The VLM recipe's build_dataloader now accepts cfg_ps and reads packing
config from there first, falling back to legacy dataset.packing for
backward compatibility.

Additional fixes from merge:
- Fix stale build_labels() call in collate_fns.py (merge artifact)
- Revert phi4/kimi collate to use build_labels (not in _IMSTART allowlist)
- Comment out decord2 monkey-patch (user removed it for torchcodec testing)
- Add TODO on _PACKING_PATCH_MODULES about generality

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* Switch VLM neat packing example to MedPix-VQA dataset with 8k seqlen

Use HF dataset (mmoukouba/MedPix-VQA) instead of local mockdata to
demonstrate packed_sequence working with standard HF datasets.
Increase pack_size/max_length to 8192 for real image samples.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* refactor: deduplicate robust_collate into make_robust_collate

Extract the duplicated collate retry logic from PreTokenizedDatasetWrapper
and RobustDatasetWrapper into a shared make_robust_collate() function in
collate_fns.py. Both classes now delegate to it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* refactor: move media I/O helpers from datasets.py to utils.py

Move _resolve_lmdb_image, _read_video_frames, _preload_media, and
_build_video_metadata to vlm/utils.py. These are generic media utilities
not tied to any specific dataset.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* cleanup: move random import to module level, allow pretokenize without packing

- Move `import random` from inside make_robust_collate to module-level
  import in collate_fns.py
- Read pretokenize/max_length from cfg_ps regardless of pack_size,
  enabling pretokenize-only mode without packing

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* cleanup: remove verbose comments from packing recipe yamls

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* feat: add Qwen3.5-4B VLM neat packing recipe

Tested with 8 GPUs, 8k pack_size, MedPix-VQA dataset.
Requires transformers >= 5.3.0 for Qwen3.5 support.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: add qwen3_5 to packing patch modules and fix missing import

- Add transformers.models.qwen3_5.modeling_qwen3_5 to packing patch list
  so create_causal_mask is patched for Qwen3.5 dense models
- Fix _passthrough_create_causal_mask signature to accept both
  input_embeds and inputs_embeds (HF 5.3.0 uses inputs_embeds)
- Import _lmdb_env_cache from utils.py in datasets.py (missed in
  earlier media helpers refactor)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* remove LLM recipe from VLM data pipeline PR

This LLM recipe doesn't belong in the VLM packing PR.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: update test imports after media helpers move to utils.py

Update test_datasets.py to import _read_video_frames and
_preload_media from vlm/utils.py instead of vlm/datasets.py.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: add unit tests for packing, utils, and collate changes

New test files:
- test_utils.py: _resolve_lmdb_image (cache, missing key, RGB),
  _build_video_metadata (empty, no video, preserved fields)
- test_packing.py: get_seqlens_in_batch, get_unpad_data,
  _passthrough_create_causal_mask (both HF signatures),
  get_attn_implementation (backend vs HF config),
  configure_packing (noop for sdpa, patches FA2 modules)

Extended test_collate_fns.py:
- make_robust_collate (success, retry, max_retries exhausted)
- neat_packed_vlm_collater attn_implementation variants
  (2D mask for FA2, 4D for sdpa, fixed max_length, pixel_values concat)

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: lint errors and missing copyright headers

- ruff fix: remove unused imports (copy, BaseVideoProcessor, load_video,
  as_completed), unused variables (grid_idx, total_text_tokens,
  total_media_tokens), fix import ordering
- Add copyright headers to scripts/precompute_tokens.py and
  tests/test_meta_dataset_all.py

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* style: ruff format on all changed files

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: rename test_utils.py to avoid pytest collection conflict

tests/unit_tests/datasets/test_utils.py already exists; having
test_utils.py in the vlm/ subdirectory causes a module name collision.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: configure cfg_ds.get defaults in build_dataloader tests

MagicMock().get() returns a truthy MagicMock by default, which
incorrectly triggers the pretokenize path. Configure side_effect
to return proper defaults for packing-related keys.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: make packing mask patch safe for non-packed forward passes

_passthrough_create_causal_mask now checks whether the attention mask
is actually a packed mask (4D or indexed with values > 1) before
returning it as-is. For normal 2D masks (standard training), it
delegates to the original HF create_causal_mask, preventing test
pollution where the monkey-patch breaks non-packed Qwen2 tests.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: passthrough causal mask for FA2 to avoid breaking validation

The previous logic delegated all non-packed 2D masks to HF's
create_causal_mask, which produced a mask incompatible with
flash_attention_2 during validation. FA2 handles causal masking
internally, so always pass through. Delegation to HF is now
limited to non-FA2 backends (sdpa/eager) where it is needed.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address code review feedback from claude[bot]

- Fix wrong import: _resolve_lmdb_image lives in utils.py not datasets.py
- Assign unused sum() results to variables in dataset timing summary
- Fix fake_indices bug: _drop_overlong_samples now returns kept indices
  so callers can filter examples in sync with conversations

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: log actual processor type before falling back to default

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: remove unused sum() variables flagged by ruff F841

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add validation and max_steps to VLM packing recipes

- Add validation_dataset (MedPix-VQA) and validation_dataloader to
  qwen3_vl_4b and qwen3_vl_moe_30b recipes
- Add max_steps: 100 to both recipes
- Switch MoE recipe from mockdata to MedPix-VQA with pack_size 8192

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: enable checkpoint with safetensors in qwen3_vl_4b recipe

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: enable Qwen3.5 dense CP support and fix FSDP mixed-dtype wrapping

PR #1631 added _restore_loaded_model_dtype which restores checkpoint
dtypes after loading. For Qwen3.5 dense, this puts A_log and norm back
to float32 while everything else is bfloat16, breaking FSDP2 which
requires uniform dtype per group.

Fix by adding Qwen3_5ParallelizationStrategy that:
- Moves float32 bare params (A_log) into a _fp32_params submodule so
  fully_shard_by_dtype can wrap them in a separate FSDP group
- When CP>1, swaps HF Qwen3_5GatedDeltaNet to CPAwareGatedDeltaNet
  (reusing the existing MoE CP implementation) and sets _cp_mesh
- Adds a decoder-layer pre-hook to pass position_ids to linear_attn
  (HF decoder layers don't forward it, but CP needs it)

Tested: CP=1 and CP=2 losses match (2.8484 vs 2.8484 step 0,
2.4787 vs 2.4766 step 1, val 2.9792 vs 2.9786).

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* style: ruff format + add unit tests for Qwen3.5 CP/FSDP patching

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: address Claude review - test calls real patch_hf_model, clarify thread-safety comment

- Rewrote test_fp32_params_moved_to_holder to call the real patch_hf_model
  function instead of replicating its logic, by monkeypatching the
  transformers module stubs so isinstance() matches _FakeGatedDeltaNet.
- Clarified the thread-safety comment on the globals() swap in
  Qwen3_5ParallelizationStrategy.parallelize.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: address round-2 review - test_no_class_swap calls real patch_hf_model, add defensive assertion

- Updated test_no_class_swap_when_cp_disabled to call patch_hf_model
  with stubs instead of trivially asserting the fake type.
- Added defensive assertion that apply_fsdp2_sharding_recursively exists
  before the globals() swap in Qwen3_5ParallelizationStrategy.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* test: add test_class_swap_when_cp_enabled for patch_hf_model cp_enabled=True path

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* test: add coverage for Qwen3_5ParallelizationStrategy.parallelize()

Add tests for the parallelize() method covering:
- patch_hf_model call and delegation to super()
- globals swap and restore of apply_fsdp2_sharding_recursively
- global restore on error (try/finally)
- CP mesh assignment when cp_enabled=True
- _fsdp_by_dtype ModuleList iteration with fully_shard_by_dtype

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

---------

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-authored-by: zhiqil <zhiqil@nvidia.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
@svcnvidia-nemo-ci
Copy link
Copy Markdown
Contributor Author

/ok to test f515dc8

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 8, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@akoumpa akoumpa changed the title cp: fix: Qwen3.5 dense CP support and FSDP mixed-dtype fix (1710) into r0.4.0 cp: fix: Qwen3.5 dense CP support (1710) into r0.4.0 Apr 8, 2026
@akoumpa akoumpa merged commit 58af018 into r0.4.0 Apr 9, 2026
54 of 56 checks passed
@akoumpa akoumpa deleted the cherry-pick-1710-r0.4.0 branch April 9, 2026 02:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-pick Run CICD Trigger Testing CICD

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants