Skip to content

fix: Qwen3.5 dense CP support and FSDP mixed-dtype fix#1710

Merged
akoumpa merged 46 commits intomainfrom
huiyingl/qwen3_5_cp_aware_fsdp
Apr 8, 2026
Merged

fix: Qwen3.5 dense CP support and FSDP mixed-dtype fix#1710
akoumpa merged 46 commits intomainfrom
huiyingl/qwen3_5_cp_aware_fsdp

Conversation

@HuiyingLi
Copy link
Copy Markdown
Contributor

@HuiyingLi HuiyingLi commented Apr 7, 2026

Summary

  • FSDP mixed-dtype fix: PR fix: tied embedding v4 to v5 #1631 added _restore_loaded_model_dtype which restores A_log to float32 in Qwen3.5 dense, breaking FSDP2's uniform-dtype requirement. Fix by moving float32 bare params into a _fp32_params submodule so fully_shard_by_dtype can wrap them separately.
  • CP support for Qwen3.5 dense: When CP>1, swaps HF Qwen3_5GatedDeltaNet to the existing CPAwareGatedDeltaNet (from the MoE codebase) via __class__ swap, reusing the FLA-based CP implementation.
  • Position ID forwarding: HF decoder layers don't pass position_ids to linear_attn, but CP needs them. Added a decoder-layer pre-hook to cache and forward them.
  • Qwen3_5ParallelizationStrategy: Registered for Qwen3_5ForCausalLM and Qwen3_5ForConditionalGeneration, uses fully_shard_by_dtype per layer.

Test plan

image
  • ruff check + format clean

🤖 Generated with Claude Code

zhiqil and others added 30 commits March 7, 2026 01:37
Implement sequence packing via min-heap first-fit-decreasing knapsack
for both LLM and VLM datasets, with indexed attention masks and flash
attention support. Includes unit tests and benchmarks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Sort samples by estimated token length (text + media) and shuffle
within buckets to keep batch-internal lengths similar, reducing padding
waste. Includes accurate image/video token count estimation via
smart_resize and comprehensive test suite.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Add packing_strategy config field ("neat" or "thd") to select between
greedy knapsack packing and existing THD packing in the LLM recipe.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Remove unused import and variable in neat_packing_vlm.py.
Fix 13 sampler tests that referenced non-existent bucket_size
and shuffle_bucket_size parameters.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Sort imports, remove unused imports/variables, fix f-strings
without placeholders, rename ambiguous variable name.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Implement LLaMA-Factory style meta JSON dataset loading with support
for multiple dataset composition, sampling ratios, ShareGPT format
conversion, LMDB image storage, video frame reading via decord, media
preloading, and cross-rank data sharding.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
RobustDatasetWrapper provides data loading error retry, media
preloading, and fake image injection to prevent FSDP/Zero3 hangs on
pure-text batches. PreTokenizedDatasetWrapper supports per-sample
tokenization in DataLoader workers with overlong sample detection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Replace BPE context-sensitive pattern matching with token ID-level
scanning (build_labels_from_template) for reliable assistant turn
detection. Remove qwen2_5 dependency on qwen_vl_utils. Add per-sample
media counts (n_images_per_sample/n_videos_per_sample) to collate
output for precise PP chunking. Replace truncation with pre-filtering
via _drop_overlong_samples. Use decord as video backend globally.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Replace the manual _fix_video_timestamps regex approach with
_build_video_metadata that passes metadata directly to the processor.
Also adds second_per_grid_ts to output keys.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Offline parallel tokenization tool that writes _text_tokens counts
to dataset samples, enabling LengthGroupedSampler to use exact token
counts instead of heuristic estimation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
…king

Wire up configure_packing and attn-aware collaters into both LLM and VLM
recipes so neat packing correctly enforces per-document attention
boundaries with flash_attention_2 and SDPA.

Changes:
- neat_packed_collater: accept attn_implementation param, keep 2D indexed
  mask for flash, 4D bool block-causal mask for SDPA
- configure_packing: patch create_causal_mask in qwen2/qwen2_5_vl/qwen2_vl/
  qwen3_vl/qwen3_vl_moe modules via importlib loop
- LLM recipe: call configure_packing when packing_strategy=neat, detect
  attn backend from cfg_model (backend.attn or attn_implementation)
- VLM recipe: add pretokenize + packing path to build_dataloader with
  cfg_model param, same attn detection logic
- Add 3 example recipes: LLM neat packing, VLM 4B neat packing,
  VLM MoE 30B neat packing

Tested:
- VLM Qwen3-VL-4B flash: 4.19 -> 1.47 -> 0.49
- VLM Qwen3-VL-4B sdpa:  4.19 -> 1.47 -> 0.49
- VLM Qwen3-VL-30B MoE flash: 1.76 -> 0.41 -> 0.10
- LLM Qwen2.5-0.5B flash+force_hf: 3.72 -> ... -> 2.84

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Move packing configuration from nested dataset.packing to a top-level
packed_sequence: section, matching the LLM recipe pattern. This decouples
dataset definition from packing strategy.

The VLM recipe's build_dataloader now accepts cfg_ps and reads packing
config from there first, falling back to legacy dataset.packing for
backward compatibility.

Additional fixes from merge:
- Fix stale build_labels() call in collate_fns.py (merge artifact)
- Revert phi4/kimi collate to use build_labels (not in _IMSTART allowlist)
- Comment out decord2 monkey-patch (user removed it for torchcodec testing)
- Add TODO on _PACKING_PATCH_MODULES about generality

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Use HF dataset (mmoukouba/MedPix-VQA) instead of local mockdata to
demonstrate packed_sequence working with standard HF datasets.
Increase pack_size/max_length to 8192 for real image samples.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Extract the duplicated collate retry logic from PreTokenizedDatasetWrapper
and RobustDatasetWrapper into a shared make_robust_collate() function in
collate_fns.py. Both classes now delegate to it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Move _resolve_lmdb_image, _read_video_frames, _preload_media, and
_build_video_metadata to vlm/utils.py. These are generic media utilities
not tied to any specific dataset.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
…t packing

- Move `import random` from inside make_robust_collate to module-level
  import in collate_fns.py
- Read pretokenize/max_length from cfg_ps regardless of pack_size,
  enabling pretokenize-only mode without packing

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Tested with 8 GPUs, 8k pack_size, MedPix-VQA dataset.
Requires transformers >= 5.3.0 for Qwen3.5 support.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
- Add transformers.models.qwen3_5.modeling_qwen3_5 to packing patch list
  so create_causal_mask is patched for Qwen3.5 dense models
- Fix _passthrough_create_causal_mask signature to accept both
  input_embeds and inputs_embeds (HF 5.3.0 uses inputs_embeds)
- Import _lmdb_env_cache from utils.py in datasets.py (missed in
  earlier media helpers refactor)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
This LLM recipe doesn't belong in the VLM packing PR.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Update test_datasets.py to import _read_video_frames and
_preload_media from vlm/utils.py instead of vlm/datasets.py.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New test files:
- test_utils.py: _resolve_lmdb_image (cache, missing key, RGB),
  _build_video_metadata (empty, no video, preserved fields)
- test_packing.py: get_seqlens_in_batch, get_unpad_data,
  _passthrough_create_causal_mask (both HF signatures),
  get_attn_implementation (backend vs HF config),
  configure_packing (noop for sdpa, patches FA2 modules)

Extended test_collate_fns.py:
- make_robust_collate (success, retry, max_retries exhausted)
- neat_packed_vlm_collater attn_implementation variants
  (2D mask for FA2, 4D for sdpa, fixed max_length, pixel_values concat)

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- ruff fix: remove unused imports (copy, BaseVideoProcessor, load_video,
  as_completed), unused variables (grid_idx, total_text_tokens,
  total_media_tokens), fix import ordering
- Add copyright headers to scripts/precompute_tokens.py and
  tests/test_meta_dataset_all.py

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
tests/unit_tests/datasets/test_utils.py already exists; having
test_utils.py in the vlm/ subdirectory causes a module name collision.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@akoumpa akoumpa added the r0.4.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge. label Apr 8, 2026
@akoumpa akoumpa merged commit 6ba4074 into main Apr 8, 2026
54 checks passed
@akoumpa akoumpa deleted the huiyingl/qwen3_5_cp_aware_fsdp branch April 8, 2026 21:08
svcnvidia-nemo-ci pushed a commit that referenced this pull request Apr 8, 2026
* feat: add neat packing (greedy knapsack) for LLM and VLM datasets

Implement sequence packing via min-heap first-fit-decreasing knapsack
for both LLM and VLM datasets, with indexed attention masks and flash
attention support. Includes unit tests and benchmarks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* feat: add LengthGroupedSampler for token-aware distributed sampling

Sort samples by estimated token length (text + media) and shuffle
within buckets to keep batch-internal lengths similar, reducing padding
waste. Includes accurate image/video token count estimation via
smart_resize and comprehensive test suite.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* feat: integrate neat packing strategy into LLM finetune recipe

Add packing_strategy config field ("neat" or "thd") to select between
greedy knapsack packing and existing THD packing in the LLM recipe.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* chore: remove benchmark scripts not needed for this PR

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: lint errors and broken sampler tests

Remove unused import and variable in neat_packing_vlm.py.
Fix 13 sampler tests that referenced non-existent bucket_size
and shuffle_bucket_size parameters.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* style: fix all ruff lint errors across changed files

Sort imports, remove unused imports/variables, fix f-strings
without placeholders, rename ambiguous variable name.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* style: run ruff format on all changed source and test files

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* style: add missing copyright headers to test files

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* feat: add meta-dataset loading system with ShareGPT format support

Implement LLaMA-Factory style meta JSON dataset loading with support
for multiple dataset composition, sampling ratios, ShareGPT format
conversion, LMDB image storage, video frame reading via decord, media
preloading, and cross-rank data sharding.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* feat: add RobustDatasetWrapper with retry and fake image injection

RobustDatasetWrapper provides data loading error retry, media
preloading, and fake image injection to prevent FSDP/Zero3 hangs on
pure-text batches. PreTokenizedDatasetWrapper supports per-sample
tokenization in DataLoader workers with overlong sample detection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* enhance: refactor label building with template-based approach

Replace BPE context-sensitive pattern matching with token ID-level
scanning (build_labels_from_template) for reliable assistant turn
detection. Remove qwen2_5 dependency on qwen_vl_utils. Add per-sample
media counts (n_images_per_sample/n_videos_per_sample) to collate
output for precise PP chunking. Replace truncation with pre-filtering
via _drop_overlong_samples. Use decord as video backend globally.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* refactor: simplify video timestamp handling with VideoMetadata

Replace the manual _fix_video_timestamps regex approach with
_build_video_metadata that passes metadata directly to the processor.
Also adds second_per_grid_ts to output keys.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* feat: add precompute_tokens script for offline tokenization

Offline parallel tokenization tool that writes _text_tokens counts
to dataset samples, enabling LengthGroupedSampler to use exact token
counts instead of heuristic estimation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* feat: wire up configure_packing and attn-aware collaters for neat packing

Wire up configure_packing and attn-aware collaters into both LLM and VLM
recipes so neat packing correctly enforces per-document attention
boundaries with flash_attention_2 and SDPA.

Changes:
- neat_packed_collater: accept attn_implementation param, keep 2D indexed
  mask for flash, 4D bool block-causal mask for SDPA
- configure_packing: patch create_causal_mask in qwen2/qwen2_5_vl/qwen2_vl/
  qwen3_vl/qwen3_vl_moe modules via importlib loop
- LLM recipe: call configure_packing when packing_strategy=neat, detect
  attn backend from cfg_model (backend.attn or attn_implementation)
- VLM recipe: add pretokenize + packing path to build_dataloader with
  cfg_model param, same attn detection logic
- Add 3 example recipes: LLM neat packing, VLM 4B neat packing,
  VLM MoE 30B neat packing

Tested:
- VLM Qwen3-VL-4B flash: 4.19 -> 1.47 -> 0.49
- VLM Qwen3-VL-4B sdpa:  4.19 -> 1.47 -> 0.49
- VLM Qwen3-VL-30B MoE flash: 1.76 -> 0.41 -> 0.10
- LLM Qwen2.5-0.5B flash+force_hf: 3.72 -> ... -> 2.84

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* refactor: move VLM packing config to top-level packed_sequence section

Move packing configuration from nested dataset.packing to a top-level
packed_sequence: section, matching the LLM recipe pattern. This decouples
dataset definition from packing strategy.

The VLM recipe's build_dataloader now accepts cfg_ps and reads packing
config from there first, falling back to legacy dataset.packing for
backward compatibility.

Additional fixes from merge:
- Fix stale build_labels() call in collate_fns.py (merge artifact)
- Revert phi4/kimi collate to use build_labels (not in _IMSTART allowlist)
- Comment out decord2 monkey-patch (user removed it for torchcodec testing)
- Add TODO on _PACKING_PATCH_MODULES about generality

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* Switch VLM neat packing example to MedPix-VQA dataset with 8k seqlen

Use HF dataset (mmoukouba/MedPix-VQA) instead of local mockdata to
demonstrate packed_sequence working with standard HF datasets.
Increase pack_size/max_length to 8192 for real image samples.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* refactor: deduplicate robust_collate into make_robust_collate

Extract the duplicated collate retry logic from PreTokenizedDatasetWrapper
and RobustDatasetWrapper into a shared make_robust_collate() function in
collate_fns.py. Both classes now delegate to it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* refactor: move media I/O helpers from datasets.py to utils.py

Move _resolve_lmdb_image, _read_video_frames, _preload_media, and
_build_video_metadata to vlm/utils.py. These are generic media utilities
not tied to any specific dataset.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* cleanup: move random import to module level, allow pretokenize without packing

- Move `import random` from inside make_robust_collate to module-level
  import in collate_fns.py
- Read pretokenize/max_length from cfg_ps regardless of pack_size,
  enabling pretokenize-only mode without packing

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* cleanup: remove verbose comments from packing recipe yamls

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* feat: add Qwen3.5-4B VLM neat packing recipe

Tested with 8 GPUs, 8k pack_size, MedPix-VQA dataset.
Requires transformers >= 5.3.0 for Qwen3.5 support.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: add qwen3_5 to packing patch modules and fix missing import

- Add transformers.models.qwen3_5.modeling_qwen3_5 to packing patch list
  so create_causal_mask is patched for Qwen3.5 dense models
- Fix _passthrough_create_causal_mask signature to accept both
  input_embeds and inputs_embeds (HF 5.3.0 uses inputs_embeds)
- Import _lmdb_env_cache from utils.py in datasets.py (missed in
  earlier media helpers refactor)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* remove LLM recipe from VLM data pipeline PR

This LLM recipe doesn't belong in the VLM packing PR.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: update test imports after media helpers move to utils.py

Update test_datasets.py to import _read_video_frames and
_preload_media from vlm/utils.py instead of vlm/datasets.py.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: add unit tests for packing, utils, and collate changes

New test files:
- test_utils.py: _resolve_lmdb_image (cache, missing key, RGB),
  _build_video_metadata (empty, no video, preserved fields)
- test_packing.py: get_seqlens_in_batch, get_unpad_data,
  _passthrough_create_causal_mask (both HF signatures),
  get_attn_implementation (backend vs HF config),
  configure_packing (noop for sdpa, patches FA2 modules)

Extended test_collate_fns.py:
- make_robust_collate (success, retry, max_retries exhausted)
- neat_packed_vlm_collater attn_implementation variants
  (2D mask for FA2, 4D for sdpa, fixed max_length, pixel_values concat)

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: lint errors and missing copyright headers

- ruff fix: remove unused imports (copy, BaseVideoProcessor, load_video,
  as_completed), unused variables (grid_idx, total_text_tokens,
  total_media_tokens), fix import ordering
- Add copyright headers to scripts/precompute_tokens.py and
  tests/test_meta_dataset_all.py

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* style: ruff format on all changed files

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: rename test_utils.py to avoid pytest collection conflict

tests/unit_tests/datasets/test_utils.py already exists; having
test_utils.py in the vlm/ subdirectory causes a module name collision.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: configure cfg_ds.get defaults in build_dataloader tests

MagicMock().get() returns a truthy MagicMock by default, which
incorrectly triggers the pretokenize path. Configure side_effect
to return proper defaults for packing-related keys.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: make packing mask patch safe for non-packed forward passes

_passthrough_create_causal_mask now checks whether the attention mask
is actually a packed mask (4D or indexed with values > 1) before
returning it as-is. For normal 2D masks (standard training), it
delegates to the original HF create_causal_mask, preventing test
pollution where the monkey-patch breaks non-packed Qwen2 tests.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: passthrough causal mask for FA2 to avoid breaking validation

The previous logic delegated all non-packed 2D masks to HF's
create_causal_mask, which produced a mask incompatible with
flash_attention_2 during validation. FA2 handles causal masking
internally, so always pass through. Delegation to HF is now
limited to non-FA2 backends (sdpa/eager) where it is needed.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address code review feedback from claude[bot]

- Fix wrong import: _resolve_lmdb_image lives in utils.py not datasets.py
- Assign unused sum() results to variables in dataset timing summary
- Fix fake_indices bug: _drop_overlong_samples now returns kept indices
  so callers can filter examples in sync with conversations

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: log actual processor type before falling back to default

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: remove unused sum() variables flagged by ruff F841

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add validation and max_steps to VLM packing recipes

- Add validation_dataset (MedPix-VQA) and validation_dataloader to
  qwen3_vl_4b and qwen3_vl_moe_30b recipes
- Add max_steps: 100 to both recipes
- Switch MoE recipe from mockdata to MedPix-VQA with pack_size 8192

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: enable checkpoint with safetensors in qwen3_vl_4b recipe

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: enable Qwen3.5 dense CP support and fix FSDP mixed-dtype wrapping

PR #1631 added _restore_loaded_model_dtype which restores checkpoint
dtypes after loading. For Qwen3.5 dense, this puts A_log and norm back
to float32 while everything else is bfloat16, breaking FSDP2 which
requires uniform dtype per group.

Fix by adding Qwen3_5ParallelizationStrategy that:
- Moves float32 bare params (A_log) into a _fp32_params submodule so
  fully_shard_by_dtype can wrap them in a separate FSDP group
- When CP>1, swaps HF Qwen3_5GatedDeltaNet to CPAwareGatedDeltaNet
  (reusing the existing MoE CP implementation) and sets _cp_mesh
- Adds a decoder-layer pre-hook to pass position_ids to linear_attn
  (HF decoder layers don't forward it, but CP needs it)

Tested: CP=1 and CP=2 losses match (2.8484 vs 2.8484 step 0,
2.4787 vs 2.4766 step 1, val 2.9792 vs 2.9786).

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* style: ruff format + add unit tests for Qwen3.5 CP/FSDP patching

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: address Claude review - test calls real patch_hf_model, clarify thread-safety comment

- Rewrote test_fp32_params_moved_to_holder to call the real patch_hf_model
  function instead of replicating its logic, by monkeypatching the
  transformers module stubs so isinstance() matches _FakeGatedDeltaNet.
- Clarified the thread-safety comment on the globals() swap in
  Qwen3_5ParallelizationStrategy.parallelize.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: address round-2 review - test_no_class_swap calls real patch_hf_model, add defensive assertion

- Updated test_no_class_swap_when_cp_disabled to call patch_hf_model
  with stubs instead of trivially asserting the fake type.
- Added defensive assertion that apply_fsdp2_sharding_recursively exists
  before the globals() swap in Qwen3_5ParallelizationStrategy.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* test: add test_class_swap_when_cp_enabled for patch_hf_model cp_enabled=True path

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* test: add coverage for Qwen3_5ParallelizationStrategy.parallelize()

Add tests for the parallelize() method covering:
- patch_hf_model call and delegation to super()
- globals swap and restore of apply_fsdp2_sharding_recursively
- global restore on error (try/finally)
- CP mesh assignment when cp_enabled=True
- _fsdp_by_dtype ModuleList iteration with fully_shard_by_dtype

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

---------

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-authored-by: zhiqil <zhiqil@nvidia.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
akoumpa pushed a commit that referenced this pull request Apr 9, 2026
fix: Qwen3.5 dense CP support and FSDP mixed-dtype fix (#1710)

* feat: add neat packing (greedy knapsack) for LLM and VLM datasets

Implement sequence packing via min-heap first-fit-decreasing knapsack
for both LLM and VLM datasets, with indexed attention masks and flash
attention support. Includes unit tests and benchmarks.






* feat: add LengthGroupedSampler for token-aware distributed sampling

Sort samples by estimated token length (text + media) and shuffle
within buckets to keep batch-internal lengths similar, reducing padding
waste. Includes accurate image/video token count estimation via
smart_resize and comprehensive test suite.






* feat: integrate neat packing strategy into LLM finetune recipe

Add packing_strategy config field ("neat" or "thd") to select between
greedy knapsack packing and existing THD packing in the LLM recipe.







* chore: remove benchmark scripts not needed for this PR







* fix: lint errors and broken sampler tests

Remove unused import and variable in neat_packing_vlm.py.
Fix 13 sampler tests that referenced non-existent bucket_size
and shuffle_bucket_size parameters.







* style: fix all ruff lint errors across changed files

Sort imports, remove unused imports/variables, fix f-strings
without placeholders, rename ambiguous variable name.







* style: run ruff format on all changed source and test files







* style: add missing copyright headers to test files







* feat: add meta-dataset loading system with ShareGPT format support

Implement LLaMA-Factory style meta JSON dataset loading with support
for multiple dataset composition, sampling ratios, ShareGPT format
conversion, LMDB image storage, video frame reading via decord, media
preloading, and cross-rank data sharding.





* feat: add RobustDatasetWrapper with retry and fake image injection

RobustDatasetWrapper provides data loading error retry, media
preloading, and fake image injection to prevent FSDP/Zero3 hangs on
pure-text batches. PreTokenizedDatasetWrapper supports per-sample
tokenization in DataLoader workers with overlong sample detection.





* enhance: refactor label building with template-based approach

Replace BPE context-sensitive pattern matching with token ID-level
scanning (build_labels_from_template) for reliable assistant turn
detection. Remove qwen2_5 dependency on qwen_vl_utils. Add per-sample
media counts (n_images_per_sample/n_videos_per_sample) to collate
output for precise PP chunking. Replace truncation with pre-filtering
via _drop_overlong_samples. Use decord as video backend globally.





* refactor: simplify video timestamp handling with VideoMetadata

Replace the manual _fix_video_timestamps regex approach with
_build_video_metadata that passes metadata directly to the processor.
Also adds second_per_grid_ts to output keys.





* feat: add precompute_tokens script for offline tokenization

Offline parallel tokenization tool that writes _text_tokens counts
to dataset samples, enabling LengthGroupedSampler to use exact token
counts instead of heuristic estimation.





* feat: wire up configure_packing and attn-aware collaters for neat packing

Wire up configure_packing and attn-aware collaters into both LLM and VLM
recipes so neat packing correctly enforces per-document attention
boundaries with flash_attention_2 and SDPA.

Changes:
- neat_packed_collater: accept attn_implementation param, keep 2D indexed
  mask for flash, 4D bool block-causal mask for SDPA
- configure_packing: patch create_causal_mask in qwen2/qwen2_5_vl/qwen2_vl/
  qwen3_vl/qwen3_vl_moe modules via importlib loop
- LLM recipe: call configure_packing when packing_strategy=neat, detect
  attn backend from cfg_model (backend.attn or attn_implementation)
- VLM recipe: add pretokenize + packing path to build_dataloader with
  cfg_model param, same attn detection logic
- Add 3 example recipes: LLM neat packing, VLM 4B neat packing,
  VLM MoE 30B neat packing

Tested:
- VLM Qwen3-VL-4B flash: 4.19 -> 1.47 -> 0.49
- VLM Qwen3-VL-4B sdpa:  4.19 -> 1.47 -> 0.49
- VLM Qwen3-VL-30B MoE flash: 1.76 -> 0.41 -> 0.10
- LLM Qwen2.5-0.5B flash+force_hf: 3.72 -> ... -> 2.84







* refactor: move VLM packing config to top-level packed_sequence section

Move packing configuration from nested dataset.packing to a top-level
packed_sequence: section, matching the LLM recipe pattern. This decouples
dataset definition from packing strategy.

The VLM recipe's build_dataloader now accepts cfg_ps and reads packing
config from there first, falling back to legacy dataset.packing for
backward compatibility.

Additional fixes from merge:
- Fix stale build_labels() call in collate_fns.py (merge artifact)
- Revert phi4/kimi collate to use build_labels (not in _IMSTART allowlist)
- Comment out decord2 monkey-patch (user removed it for torchcodec testing)
- Add TODO on _PACKING_PATCH_MODULES about generality





* Switch VLM neat packing example to MedPix-VQA dataset with 8k seqlen

Use HF dataset (mmoukouba/MedPix-VQA) instead of local mockdata to
demonstrate packed_sequence working with standard HF datasets.
Increase pack_size/max_length to 8192 for real image samples.





* refactor: deduplicate robust_collate into make_robust_collate

Extract the duplicated collate retry logic from PreTokenizedDatasetWrapper
and RobustDatasetWrapper into a shared make_robust_collate() function in
collate_fns.py. Both classes now delegate to it.





* refactor: move media I/O helpers from datasets.py to utils.py

Move _resolve_lmdb_image, _read_video_frames, _preload_media, and
_build_video_metadata to vlm/utils.py. These are generic media utilities
not tied to any specific dataset.





* cleanup: move random import to module level, allow pretokenize without packing

- Move `import random` from inside make_robust_collate to module-level
  import in collate_fns.py
- Read pretokenize/max_length from cfg_ps regardless of pack_size,
  enabling pretokenize-only mode without packing





* cleanup: remove verbose comments from packing recipe yamls





* feat: add Qwen3.5-4B VLM neat packing recipe

Tested with 8 GPUs, 8k pack_size, MedPix-VQA dataset.
Requires transformers >= 5.3.0 for Qwen3.5 support.





* fix: add qwen3_5 to packing patch modules and fix missing import

- Add transformers.models.qwen3_5.modeling_qwen3_5 to packing patch list
  so create_causal_mask is patched for Qwen3.5 dense models
- Fix _passthrough_create_causal_mask signature to accept both
  input_embeds and inputs_embeds (HF 5.3.0 uses inputs_embeds)
- Import _lmdb_env_cache from utils.py in datasets.py (missed in
  earlier media helpers refactor)





* remove LLM recipe from VLM data pipeline PR

This LLM recipe doesn't belong in the VLM packing PR.





* fix: update test imports after media helpers move to utils.py

Update test_datasets.py to import _read_video_frames and
_preload_media from vlm/utils.py instead of vlm/datasets.py.




* test: add unit tests for packing, utils, and collate changes

New test files:
- test_utils.py: _resolve_lmdb_image (cache, missing key, RGB),
  _build_video_metadata (empty, no video, preserved fields)
- test_packing.py: get_seqlens_in_batch, get_unpad_data,
  _passthrough_create_causal_mask (both HF signatures),
  get_attn_implementation (backend vs HF config),
  configure_packing (noop for sdpa, patches FA2 modules)

Extended test_collate_fns.py:
- make_robust_collate (success, retry, max_retries exhausted)
- neat_packed_vlm_collater attn_implementation variants
  (2D mask for FA2, 4D for sdpa, fixed max_length, pixel_values concat)




* fix: lint errors and missing copyright headers

- ruff fix: remove unused imports (copy, BaseVideoProcessor, load_video,
  as_completed), unused variables (grid_idx, total_text_tokens,
  total_media_tokens), fix import ordering
- Add copyright headers to scripts/precompute_tokens.py and
  tests/test_meta_dataset_all.py




* style: ruff format on all changed files




* fix: rename test_utils.py to avoid pytest collection conflict

tests/unit_tests/datasets/test_utils.py already exists; having
test_utils.py in the vlm/ subdirectory causes a module name collision.




* fix: configure cfg_ds.get defaults in build_dataloader tests

MagicMock().get() returns a truthy MagicMock by default, which
incorrectly triggers the pretokenize path. Configure side_effect
to return proper defaults for packing-related keys.




* fix: make packing mask patch safe for non-packed forward passes

_passthrough_create_causal_mask now checks whether the attention mask
is actually a packed mask (4D or indexed with values > 1) before
returning it as-is. For normal 2D masks (standard training), it
delegates to the original HF create_causal_mask, preventing test
pollution where the monkey-patch breaks non-packed Qwen2 tests.




* fix: passthrough causal mask for FA2 to avoid breaking validation

The previous logic delegated all non-packed 2D masks to HF's
create_causal_mask, which produced a mask incompatible with
flash_attention_2 during validation. FA2 handles causal masking
internally, so always pass through. Delegation to HF is now
limited to non-FA2 backends (sdpa/eager) where it is needed.




* fix: address code review feedback from claude[bot]

- Fix wrong import: _resolve_lmdb_image lives in utils.py not datasets.py
- Assign unused sum() results to variables in dataset timing summary
- Fix fake_indices bug: _drop_overlong_samples now returns kept indices
  so callers can filter examples in sync with conversations




* fix: log actual processor type before falling back to default




* fix: remove unused sum() variables flagged by ruff F841




* feat: add validation and max_steps to VLM packing recipes

- Add validation_dataset (MedPix-VQA) and validation_dataloader to
  qwen3_vl_4b and qwen3_vl_moe_30b recipes
- Add max_steps: 100 to both recipes
- Switch MoE recipe from mockdata to MedPix-VQA with pack_size 8192




* fix: enable checkpoint with safetensors in qwen3_vl_4b recipe




* feat: enable Qwen3.5 dense CP support and fix FSDP mixed-dtype wrapping

PR #1631 added _restore_loaded_model_dtype which restores checkpoint
dtypes after loading. For Qwen3.5 dense, this puts A_log and norm back
to float32 while everything else is bfloat16, breaking FSDP2 which
requires uniform dtype per group.

Fix by adding Qwen3_5ParallelizationStrategy that:
- Moves float32 bare params (A_log) into a _fp32_params submodule so
  fully_shard_by_dtype can wrap them in a separate FSDP group
- When CP>1, swaps HF Qwen3_5GatedDeltaNet to CPAwareGatedDeltaNet
  (reusing the existing MoE CP implementation) and sets _cp_mesh
- Adds a decoder-layer pre-hook to pass position_ids to linear_attn
  (HF decoder layers don't forward it, but CP needs it)

Tested: CP=1 and CP=2 losses match (2.8484 vs 2.8484 step 0,
2.4787 vs 2.4766 step 1, val 2.9792 vs 2.9786).




* style: ruff format + add unit tests for Qwen3.5 CP/FSDP patching




* fix: address Claude review - test calls real patch_hf_model, clarify thread-safety comment

- Rewrote test_fp32_params_moved_to_holder to call the real patch_hf_model
  function instead of replicating its logic, by monkeypatching the
  transformers module stubs so isinstance() matches _FakeGatedDeltaNet.
- Clarified the thread-safety comment on the globals() swap in
  Qwen3_5ParallelizationStrategy.parallelize.




* fix: address round-2 review - test_no_class_swap calls real patch_hf_model, add defensive assertion

- Updated test_no_class_swap_when_cp_disabled to call patch_hf_model
  with stubs instead of trivially asserting the fake type.
- Added defensive assertion that apply_fsdp2_sharding_recursively exists
  before the globals() swap in Qwen3_5ParallelizationStrategy.




* test: add test_class_swap_when_cp_enabled for patch_hf_model cp_enabled=True path




* test: add coverage for Qwen3_5ParallelizationStrategy.parallelize()

Add tests for the parallelize() method covering:
- patch_hf_model call and delegation to super()
- globals swap and restore of apply_fsdp2_sharding_recursively
- global restore on error (try/finally)
- CP mesh assignment when cp_enabled=True
- _fsdp_by_dtype ModuleList iteration with fully_shard_by_dtype




---------

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Huiying <willwin.lee@gmail.com>
Co-authored-by: zhiqil <zhiqil@nvidia.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
akoumpa pushed a commit that referenced this pull request Apr 10, 2026
* feat: add neat packing (greedy knapsack) for LLM and VLM datasets

Implement sequence packing via min-heap first-fit-decreasing knapsack
for both LLM and VLM datasets, with indexed attention masks and flash
attention support. Includes unit tests and benchmarks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* feat: add LengthGroupedSampler for token-aware distributed sampling

Sort samples by estimated token length (text + media) and shuffle
within buckets to keep batch-internal lengths similar, reducing padding
waste. Includes accurate image/video token count estimation via
smart_resize and comprehensive test suite.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* feat: integrate neat packing strategy into LLM finetune recipe

Add packing_strategy config field ("neat" or "thd") to select between
greedy knapsack packing and existing THD packing in the LLM recipe.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* chore: remove benchmark scripts not needed for this PR

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: lint errors and broken sampler tests

Remove unused import and variable in neat_packing_vlm.py.
Fix 13 sampler tests that referenced non-existent bucket_size
and shuffle_bucket_size parameters.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* style: fix all ruff lint errors across changed files

Sort imports, remove unused imports/variables, fix f-strings
without placeholders, rename ambiguous variable name.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* style: run ruff format on all changed source and test files

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* style: add missing copyright headers to test files

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* feat: add meta-dataset loading system with ShareGPT format support

Implement LLaMA-Factory style meta JSON dataset loading with support
for multiple dataset composition, sampling ratios, ShareGPT format
conversion, LMDB image storage, video frame reading via decord, media
preloading, and cross-rank data sharding.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* feat: add RobustDatasetWrapper with retry and fake image injection

RobustDatasetWrapper provides data loading error retry, media
preloading, and fake image injection to prevent FSDP/Zero3 hangs on
pure-text batches. PreTokenizedDatasetWrapper supports per-sample
tokenization in DataLoader workers with overlong sample detection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* enhance: refactor label building with template-based approach

Replace BPE context-sensitive pattern matching with token ID-level
scanning (build_labels_from_template) for reliable assistant turn
detection. Remove qwen2_5 dependency on qwen_vl_utils. Add per-sample
media counts (n_images_per_sample/n_videos_per_sample) to collate
output for precise PP chunking. Replace truncation with pre-filtering
via _drop_overlong_samples. Use decord as video backend globally.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* refactor: simplify video timestamp handling with VideoMetadata

Replace the manual _fix_video_timestamps regex approach with
_build_video_metadata that passes metadata directly to the processor.
Also adds second_per_grid_ts to output keys.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* feat: add precompute_tokens script for offline tokenization

Offline parallel tokenization tool that writes _text_tokens counts
to dataset samples, enabling LengthGroupedSampler to use exact token
counts instead of heuristic estimation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* feat: wire up configure_packing and attn-aware collaters for neat packing

Wire up configure_packing and attn-aware collaters into both LLM and VLM
recipes so neat packing correctly enforces per-document attention
boundaries with flash_attention_2 and SDPA.

Changes:
- neat_packed_collater: accept attn_implementation param, keep 2D indexed
  mask for flash, 4D bool block-causal mask for SDPA
- configure_packing: patch create_causal_mask in qwen2/qwen2_5_vl/qwen2_vl/
  qwen3_vl/qwen3_vl_moe modules via importlib loop
- LLM recipe: call configure_packing when packing_strategy=neat, detect
  attn backend from cfg_model (backend.attn or attn_implementation)
- VLM recipe: add pretokenize + packing path to build_dataloader with
  cfg_model param, same attn detection logic
- Add 3 example recipes: LLM neat packing, VLM 4B neat packing,
  VLM MoE 30B neat packing

Tested:
- VLM Qwen3-VL-4B flash: 4.19 -> 1.47 -> 0.49
- VLM Qwen3-VL-4B sdpa:  4.19 -> 1.47 -> 0.49
- VLM Qwen3-VL-30B MoE flash: 1.76 -> 0.41 -> 0.10
- LLM Qwen2.5-0.5B flash+force_hf: 3.72 -> ... -> 2.84

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* refactor: move VLM packing config to top-level packed_sequence section

Move packing configuration from nested dataset.packing to a top-level
packed_sequence: section, matching the LLM recipe pattern. This decouples
dataset definition from packing strategy.

The VLM recipe's build_dataloader now accepts cfg_ps and reads packing
config from there first, falling back to legacy dataset.packing for
backward compatibility.

Additional fixes from merge:
- Fix stale build_labels() call in collate_fns.py (merge artifact)
- Revert phi4/kimi collate to use build_labels (not in _IMSTART allowlist)
- Comment out decord2 monkey-patch (user removed it for torchcodec testing)
- Add TODO on _PACKING_PATCH_MODULES about generality

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* Switch VLM neat packing example to MedPix-VQA dataset with 8k seqlen

Use HF dataset (mmoukouba/MedPix-VQA) instead of local mockdata to
demonstrate packed_sequence working with standard HF datasets.
Increase pack_size/max_length to 8192 for real image samples.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* refactor: deduplicate robust_collate into make_robust_collate

Extract the duplicated collate retry logic from PreTokenizedDatasetWrapper
and RobustDatasetWrapper into a shared make_robust_collate() function in
collate_fns.py. Both classes now delegate to it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* refactor: move media I/O helpers from datasets.py to utils.py

Move _resolve_lmdb_image, _read_video_frames, _preload_media, and
_build_video_metadata to vlm/utils.py. These are generic media utilities
not tied to any specific dataset.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* cleanup: move random import to module level, allow pretokenize without packing

- Move `import random` from inside make_robust_collate to module-level
  import in collate_fns.py
- Read pretokenize/max_length from cfg_ps regardless of pack_size,
  enabling pretokenize-only mode without packing

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* cleanup: remove verbose comments from packing recipe yamls

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* feat: add Qwen3.5-4B VLM neat packing recipe

Tested with 8 GPUs, 8k pack_size, MedPix-VQA dataset.
Requires transformers >= 5.3.0 for Qwen3.5 support.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: add qwen3_5 to packing patch modules and fix missing import

- Add transformers.models.qwen3_5.modeling_qwen3_5 to packing patch list
  so create_causal_mask is patched for Qwen3.5 dense models
- Fix _passthrough_create_causal_mask signature to accept both
  input_embeds and inputs_embeds (HF 5.3.0 uses inputs_embeds)
- Import _lmdb_env_cache from utils.py in datasets.py (missed in
  earlier media helpers refactor)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* remove LLM recipe from VLM data pipeline PR

This LLM recipe doesn't belong in the VLM packing PR.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: update test imports after media helpers move to utils.py

Update test_datasets.py to import _read_video_frames and
_preload_media from vlm/utils.py instead of vlm/datasets.py.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: add unit tests for packing, utils, and collate changes

New test files:
- test_utils.py: _resolve_lmdb_image (cache, missing key, RGB),
  _build_video_metadata (empty, no video, preserved fields)
- test_packing.py: get_seqlens_in_batch, get_unpad_data,
  _passthrough_create_causal_mask (both HF signatures),
  get_attn_implementation (backend vs HF config),
  configure_packing (noop for sdpa, patches FA2 modules)

Extended test_collate_fns.py:
- make_robust_collate (success, retry, max_retries exhausted)
- neat_packed_vlm_collater attn_implementation variants
  (2D mask for FA2, 4D for sdpa, fixed max_length, pixel_values concat)

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: lint errors and missing copyright headers

- ruff fix: remove unused imports (copy, BaseVideoProcessor, load_video,
  as_completed), unused variables (grid_idx, total_text_tokens,
  total_media_tokens), fix import ordering
- Add copyright headers to scripts/precompute_tokens.py and
  tests/test_meta_dataset_all.py

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* style: ruff format on all changed files

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: rename test_utils.py to avoid pytest collection conflict

tests/unit_tests/datasets/test_utils.py already exists; having
test_utils.py in the vlm/ subdirectory causes a module name collision.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: configure cfg_ds.get defaults in build_dataloader tests

MagicMock().get() returns a truthy MagicMock by default, which
incorrectly triggers the pretokenize path. Configure side_effect
to return proper defaults for packing-related keys.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: make packing mask patch safe for non-packed forward passes

_passthrough_create_causal_mask now checks whether the attention mask
is actually a packed mask (4D or indexed with values > 1) before
returning it as-is. For normal 2D masks (standard training), it
delegates to the original HF create_causal_mask, preventing test
pollution where the monkey-patch breaks non-packed Qwen2 tests.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: passthrough causal mask for FA2 to avoid breaking validation

The previous logic delegated all non-packed 2D masks to HF's
create_causal_mask, which produced a mask incompatible with
flash_attention_2 during validation. FA2 handles causal masking
internally, so always pass through. Delegation to HF is now
limited to non-FA2 backends (sdpa/eager) where it is needed.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address code review feedback from claude[bot]

- Fix wrong import: _resolve_lmdb_image lives in utils.py not datasets.py
- Assign unused sum() results to variables in dataset timing summary
- Fix fake_indices bug: _drop_overlong_samples now returns kept indices
  so callers can filter examples in sync with conversations

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: log actual processor type before falling back to default

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: remove unused sum() variables flagged by ruff F841

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add validation and max_steps to VLM packing recipes

- Add validation_dataset (MedPix-VQA) and validation_dataloader to
  qwen3_vl_4b and qwen3_vl_moe_30b recipes
- Add max_steps: 100 to both recipes
- Switch MoE recipe from mockdata to MedPix-VQA with pack_size 8192

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: enable checkpoint with safetensors in qwen3_vl_4b recipe

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: enable Qwen3.5 dense CP support and fix FSDP mixed-dtype wrapping

PR #1631 added _restore_loaded_model_dtype which restores checkpoint
dtypes after loading. For Qwen3.5 dense, this puts A_log and norm back
to float32 while everything else is bfloat16, breaking FSDP2 which
requires uniform dtype per group.

Fix by adding Qwen3_5ParallelizationStrategy that:
- Moves float32 bare params (A_log) into a _fp32_params submodule so
  fully_shard_by_dtype can wrap them in a separate FSDP group
- When CP>1, swaps HF Qwen3_5GatedDeltaNet to CPAwareGatedDeltaNet
  (reusing the existing MoE CP implementation) and sets _cp_mesh
- Adds a decoder-layer pre-hook to pass position_ids to linear_attn
  (HF decoder layers don't forward it, but CP needs it)

Tested: CP=1 and CP=2 losses match (2.8484 vs 2.8484 step 0,
2.4787 vs 2.4766 step 1, val 2.9792 vs 2.9786).

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* style: ruff format + add unit tests for Qwen3.5 CP/FSDP patching

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: address Claude review - test calls real patch_hf_model, clarify thread-safety comment

- Rewrote test_fp32_params_moved_to_holder to call the real patch_hf_model
  function instead of replicating its logic, by monkeypatching the
  transformers module stubs so isinstance() matches _FakeGatedDeltaNet.
- Clarified the thread-safety comment on the globals() swap in
  Qwen3_5ParallelizationStrategy.parallelize.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: address round-2 review - test_no_class_swap calls real patch_hf_model, add defensive assertion

- Updated test_no_class_swap_when_cp_disabled to call patch_hf_model
  with stubs instead of trivially asserting the fake type.
- Added defensive assertion that apply_fsdp2_sharding_recursively exists
  before the globals() swap in Qwen3_5ParallelizationStrategy.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* test: add test_class_swap_when_cp_enabled for patch_hf_model cp_enabled=True path

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* test: add coverage for Qwen3_5ParallelizationStrategy.parallelize()

Add tests for the parallelize() method covering:
- patch_hf_model call and delegation to super()
- globals swap and restore of apply_fsdp2_sharding_recursively
- global restore on error (try/finally)
- CP mesh assignment when cp_enabled=True
- _fsdp_by_dtype ModuleList iteration with fully_shard_by_dtype

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

---------

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-authored-by: zhiqil <zhiqil@nvidia.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
HuiyingLi added a commit that referenced this pull request Apr 13, 2026
PR #1711 changed _should_load_before_shard to return False for multi-GPU
DP, so models stay on meta device through FSDP wrapping. This broke the
__dict__ trick in PR #1710's patch_hf_model.

Move the gate computation into _Fp32ParamHolder.forward() so FSDP's
unshard/reshard lifecycle fires naturally. Override CPAwareGatedDeltaNet
forward for both CP and non-CP paths to route through the holder.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
HuiyingLi added a commit that referenced this pull request Apr 15, 2026
…1813)

* fix: FSDP2 meta-device crash for Qwen3.5 GatedDeltaNet fp32 params

PR #1711 changed _should_load_before_shard to return False for multi-GPU
DP, so models stay on meta device through FSDP wrapping. This broke the
__dict__ trick in PR #1710's patch_hf_model.

Move the gate computation into _Fp32ParamHolder.forward() so FSDP's
unshard/reshard lifecycle fires naturally. Override CPAwareGatedDeltaNet
forward for both CP and non-CP paths to route through the holder.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* chore: remove test yaml not intended for PR

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: add sentinel to prevent __getattr__ re-wrapping

Address Claude review: guard against re-wrapping __getattr__ on
repeated patch_hf_model calls by checking a class-level sentinel
attribute.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: add upstream version comment to _forward_no_cp

Address Claude review: note the transformers version the forward was
copied from to ease future upstream diffing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: update MoE test expectations for _forward_no_cp path

TestForwardFastPath tests expected super().forward() to be called,
but the non-CP path now uses _forward_no_cp(). Update mocks to match.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* test: add coverage for _Fp32ParamHolder, _compute_gate, and sentinel guard

Add unit tests for:
- _Fp32ParamHolder.forward gate computation and dtype preservation
- _compute_gate routing through holder vs inline fallback
- patch_hf_model sentinel preventing __getattr__ re-wrapping

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* test: add coverage for _forward_no_cp and forward() dispatch paths

Add 14 new tests covering the critical _forward_no_cp method (lines
91-193) and forward() dispatch logic (lines 207-213) to satisfy
codecov/patch requirements for PR #1813:

- _forward_no_cp basic forward, cache_params=None, causal_conv1d_fn
  fallback, causal_conv1d_fn set, attention_mask, GQA repeat-interleave,
  _compute_gate delegation, and output dtype
- forward() dispatch when _cp_mesh is None or size <= 1, parameter
  pass-through, and extra CP kwargs
- _make_fp32_getattr fallback to AttributeError and real attr resolution

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
HuiyingLi added a commit that referenced this pull request Apr 16, 2026
…1813)

* fix: FSDP2 meta-device crash for Qwen3.5 GatedDeltaNet fp32 params

PR #1711 changed _should_load_before_shard to return False for multi-GPU
DP, so models stay on meta device through FSDP wrapping. This broke the
__dict__ trick in PR #1710's patch_hf_model.

Move the gate computation into _Fp32ParamHolder.forward() so FSDP's
unshard/reshard lifecycle fires naturally. Override CPAwareGatedDeltaNet
forward for both CP and non-CP paths to route through the holder.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* chore: remove test yaml not intended for PR

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: add sentinel to prevent __getattr__ re-wrapping

Address Claude review: guard against re-wrapping __getattr__ on
repeated patch_hf_model calls by checking a class-level sentinel
attribute.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: add upstream version comment to _forward_no_cp

Address Claude review: note the transformers version the forward was
copied from to ease future upstream diffing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: update MoE test expectations for _forward_no_cp path

TestForwardFastPath tests expected super().forward() to be called,
but the non-CP path now uses _forward_no_cp(). Update mocks to match.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* test: add coverage for _Fp32ParamHolder, _compute_gate, and sentinel guard

Add unit tests for:
- _Fp32ParamHolder.forward gate computation and dtype preservation
- _compute_gate routing through holder vs inline fallback
- patch_hf_model sentinel preventing __getattr__ re-wrapping

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* test: add coverage for _forward_no_cp and forward() dispatch paths

Add 14 new tests covering the critical _forward_no_cp method (lines
91-193) and forward() dispatch logic (lines 207-213) to satisfy
codecov/patch requirements for PR #1813:

- _forward_no_cp basic forward, cache_params=None, causal_conv1d_fn
  fallback, causal_conv1d_fn set, attention_mask, GQA repeat-interleave,
  _compute_gate delegation, and output dtype
- forward() dispatch when _cp_mesh is None or size <= 1, parameter
  pass-through, and extra CP kwargs
- _make_fp32_getattr fallback to AttributeError and real attr resolution

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
akoumpa pushed a commit that referenced this pull request Apr 16, 2026
…params (#1869)

* fix: FSDP2 meta-device crash for Qwen3.5 GatedDeltaNet fp32 params (#1813)

* fix: FSDP2 meta-device crash for Qwen3.5 GatedDeltaNet fp32 params

PR #1711 changed _should_load_before_shard to return False for multi-GPU
DP, so models stay on meta device through FSDP wrapping. This broke the
__dict__ trick in PR #1710's patch_hf_model.

Move the gate computation into _Fp32ParamHolder.forward() so FSDP's
unshard/reshard lifecycle fires naturally. Override CPAwareGatedDeltaNet
forward for both CP and non-CP paths to route through the holder.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* chore: remove test yaml not intended for PR

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: add sentinel to prevent __getattr__ re-wrapping

Address Claude review: guard against re-wrapping __getattr__ on
repeated patch_hf_model calls by checking a class-level sentinel
attribute.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: add upstream version comment to _forward_no_cp

Address Claude review: note the transformers version the forward was
copied from to ease future upstream diffing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: update MoE test expectations for _forward_no_cp path

TestForwardFastPath tests expected super().forward() to be called,
but the non-CP path now uses _forward_no_cp(). Update mocks to match.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* test: add coverage for _Fp32ParamHolder, _compute_gate, and sentinel guard

Add unit tests for:
- _Fp32ParamHolder.forward gate computation and dtype preservation
- _compute_gate routing through holder vs inline fallback
- patch_hf_model sentinel preventing __getattr__ re-wrapping

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* test: add coverage for _forward_no_cp and forward() dispatch paths

Add 14 new tests covering the critical _forward_no_cp method (lines
91-193) and forward() dispatch logic (lines 207-213) to satisfy
codecov/patch requirements for PR #1813:

- _forward_no_cp basic forward, cache_params=None, causal_conv1d_fn
  fallback, causal_conv1d_fn set, attention_mask, GQA repeat-interleave,
  _compute_gate delegation, and output dtype
- forward() dispatch when _cp_mesh is None or size <= 1, parameter
  pass-through, and extra CP kwargs
- _make_fp32_getattr fallback to AttributeError and real attr resolution

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: update MoE test_no_cp_does_not_forward_cache_position to use _forward_no_cp

The fast-path in CPAwareGatedDeltaNet.forward was refactored to call
self._forward_no_cp() instead of super().forward(), but this test still
mocked the base class forward and thus got called 0 times. Update the
mock target to match the new dispatch, and apply ruff format to the
two test files.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

---------

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
edjson pushed a commit to edjson/Automodel that referenced this pull request Apr 17, 2026
)

* feat: add neat packing (greedy knapsack) for LLM and VLM datasets

Implement sequence packing via min-heap first-fit-decreasing knapsack
for both LLM and VLM datasets, with indexed attention masks and flash
attention support. Includes unit tests and benchmarks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* feat: add LengthGroupedSampler for token-aware distributed sampling

Sort samples by estimated token length (text + media) and shuffle
within buckets to keep batch-internal lengths similar, reducing padding
waste. Includes accurate image/video token count estimation via
smart_resize and comprehensive test suite.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* feat: integrate neat packing strategy into LLM finetune recipe

Add packing_strategy config field ("neat" or "thd") to select between
greedy knapsack packing and existing THD packing in the LLM recipe.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* chore: remove benchmark scripts not needed for this PR

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: lint errors and broken sampler tests

Remove unused import and variable in neat_packing_vlm.py.
Fix 13 sampler tests that referenced non-existent bucket_size
and shuffle_bucket_size parameters.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* style: fix all ruff lint errors across changed files

Sort imports, remove unused imports/variables, fix f-strings
without placeholders, rename ambiguous variable name.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* style: run ruff format on all changed source and test files

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* style: add missing copyright headers to test files

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* feat: add meta-dataset loading system with ShareGPT format support

Implement LLaMA-Factory style meta JSON dataset loading with support
for multiple dataset composition, sampling ratios, ShareGPT format
conversion, LMDB image storage, video frame reading via decord, media
preloading, and cross-rank data sharding.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* feat: add RobustDatasetWrapper with retry and fake image injection

RobustDatasetWrapper provides data loading error retry, media
preloading, and fake image injection to prevent FSDP/Zero3 hangs on
pure-text batches. PreTokenizedDatasetWrapper supports per-sample
tokenization in DataLoader workers with overlong sample detection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* enhance: refactor label building with template-based approach

Replace BPE context-sensitive pattern matching with token ID-level
scanning (build_labels_from_template) for reliable assistant turn
detection. Remove qwen2_5 dependency on qwen_vl_utils. Add per-sample
media counts (n_images_per_sample/n_videos_per_sample) to collate
output for precise PP chunking. Replace truncation with pre-filtering
via _drop_overlong_samples. Use decord as video backend globally.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* refactor: simplify video timestamp handling with VideoMetadata

Replace the manual _fix_video_timestamps regex approach with
_build_video_metadata that passes metadata directly to the processor.
Also adds second_per_grid_ts to output keys.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* feat: add precompute_tokens script for offline tokenization

Offline parallel tokenization tool that writes _text_tokens counts
to dataset samples, enabling LengthGroupedSampler to use exact token
counts instead of heuristic estimation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* feat: wire up configure_packing and attn-aware collaters for neat packing

Wire up configure_packing and attn-aware collaters into both LLM and VLM
recipes so neat packing correctly enforces per-document attention
boundaries with flash_attention_2 and SDPA.

Changes:
- neat_packed_collater: accept attn_implementation param, keep 2D indexed
  mask for flash, 4D bool block-causal mask for SDPA
- configure_packing: patch create_causal_mask in qwen2/qwen2_5_vl/qwen2_vl/
  qwen3_vl/qwen3_vl_moe modules via importlib loop
- LLM recipe: call configure_packing when packing_strategy=neat, detect
  attn backend from cfg_model (backend.attn or attn_implementation)
- VLM recipe: add pretokenize + packing path to build_dataloader with
  cfg_model param, same attn detection logic
- Add 3 example recipes: LLM neat packing, VLM 4B neat packing,
  VLM MoE 30B neat packing

Tested:
- VLM Qwen3-VL-4B flash: 4.19 -> 1.47 -> 0.49
- VLM Qwen3-VL-4B sdpa:  4.19 -> 1.47 -> 0.49
- VLM Qwen3-VL-30B MoE flash: 1.76 -> 0.41 -> 0.10
- LLM Qwen2.5-0.5B flash+force_hf: 3.72 -> ... -> 2.84

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* refactor: move VLM packing config to top-level packed_sequence section

Move packing configuration from nested dataset.packing to a top-level
packed_sequence: section, matching the LLM recipe pattern. This decouples
dataset definition from packing strategy.

The VLM recipe's build_dataloader now accepts cfg_ps and reads packing
config from there first, falling back to legacy dataset.packing for
backward compatibility.

Additional fixes from merge:
- Fix stale build_labels() call in collate_fns.py (merge artifact)
- Revert phi4/kimi collate to use build_labels (not in _IMSTART allowlist)
- Comment out decord2 monkey-patch (user removed it for torchcodec testing)
- Add TODO on _PACKING_PATCH_MODULES about generality

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* Switch VLM neat packing example to MedPix-VQA dataset with 8k seqlen

Use HF dataset (mmoukouba/MedPix-VQA) instead of local mockdata to
demonstrate packed_sequence working with standard HF datasets.
Increase pack_size/max_length to 8192 for real image samples.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* refactor: deduplicate robust_collate into make_robust_collate

Extract the duplicated collate retry logic from PreTokenizedDatasetWrapper
and RobustDatasetWrapper into a shared make_robust_collate() function in
collate_fns.py. Both classes now delegate to it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* refactor: move media I/O helpers from datasets.py to utils.py

Move _resolve_lmdb_image, _read_video_frames, _preload_media, and
_build_video_metadata to vlm/utils.py. These are generic media utilities
not tied to any specific dataset.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* cleanup: move random import to module level, allow pretokenize without packing

- Move `import random` from inside make_robust_collate to module-level
  import in collate_fns.py
- Read pretokenize/max_length from cfg_ps regardless of pack_size,
  enabling pretokenize-only mode without packing

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* cleanup: remove verbose comments from packing recipe yamls

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* feat: add Qwen3.5-4B VLM neat packing recipe

Tested with 8 GPUs, 8k pack_size, MedPix-VQA dataset.
Requires transformers >= 5.3.0 for Qwen3.5 support.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: add qwen3_5 to packing patch modules and fix missing import

- Add transformers.models.qwen3_5.modeling_qwen3_5 to packing patch list
  so create_causal_mask is patched for Qwen3.5 dense models
- Fix _passthrough_create_causal_mask signature to accept both
  input_embeds and inputs_embeds (HF 5.3.0 uses inputs_embeds)
- Import _lmdb_env_cache from utils.py in datasets.py (missed in
  earlier media helpers refactor)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* remove LLM recipe from VLM data pipeline PR

This LLM recipe doesn't belong in the VLM packing PR.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: update test imports after media helpers move to utils.py

Update test_datasets.py to import _read_video_frames and
_preload_media from vlm/utils.py instead of vlm/datasets.py.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: add unit tests for packing, utils, and collate changes

New test files:
- test_utils.py: _resolve_lmdb_image (cache, missing key, RGB),
  _build_video_metadata (empty, no video, preserved fields)
- test_packing.py: get_seqlens_in_batch, get_unpad_data,
  _passthrough_create_causal_mask (both HF signatures),
  get_attn_implementation (backend vs HF config),
  configure_packing (noop for sdpa, patches FA2 modules)

Extended test_collate_fns.py:
- make_robust_collate (success, retry, max_retries exhausted)
- neat_packed_vlm_collater attn_implementation variants
  (2D mask for FA2, 4D for sdpa, fixed max_length, pixel_values concat)

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: lint errors and missing copyright headers

- ruff fix: remove unused imports (copy, BaseVideoProcessor, load_video,
  as_completed), unused variables (grid_idx, total_text_tokens,
  total_media_tokens), fix import ordering
- Add copyright headers to scripts/precompute_tokens.py and
  tests/test_meta_dataset_all.py

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* style: ruff format on all changed files

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: rename test_utils.py to avoid pytest collection conflict

tests/unit_tests/datasets/test_utils.py already exists; having
test_utils.py in the vlm/ subdirectory causes a module name collision.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: configure cfg_ds.get defaults in build_dataloader tests

MagicMock().get() returns a truthy MagicMock by default, which
incorrectly triggers the pretokenize path. Configure side_effect
to return proper defaults for packing-related keys.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: make packing mask patch safe for non-packed forward passes

_passthrough_create_causal_mask now checks whether the attention mask
is actually a packed mask (4D or indexed with values > 1) before
returning it as-is. For normal 2D masks (standard training), it
delegates to the original HF create_causal_mask, preventing test
pollution where the monkey-patch breaks non-packed Qwen2 tests.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: passthrough causal mask for FA2 to avoid breaking validation

The previous logic delegated all non-packed 2D masks to HF's
create_causal_mask, which produced a mask incompatible with
flash_attention_2 during validation. FA2 handles causal masking
internally, so always pass through. Delegation to HF is now
limited to non-FA2 backends (sdpa/eager) where it is needed.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address code review feedback from claude[bot]

- Fix wrong import: _resolve_lmdb_image lives in utils.py not datasets.py
- Assign unused sum() results to variables in dataset timing summary
- Fix fake_indices bug: _drop_overlong_samples now returns kept indices
  so callers can filter examples in sync with conversations

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: log actual processor type before falling back to default

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: remove unused sum() variables flagged by ruff F841

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add validation and max_steps to VLM packing recipes

- Add validation_dataset (MedPix-VQA) and validation_dataloader to
  qwen3_vl_4b and qwen3_vl_moe_30b recipes
- Add max_steps: 100 to both recipes
- Switch MoE recipe from mockdata to MedPix-VQA with pack_size 8192

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: enable checkpoint with safetensors in qwen3_vl_4b recipe

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: enable Qwen3.5 dense CP support and fix FSDP mixed-dtype wrapping

PR NVIDIA-NeMo#1631 added _restore_loaded_model_dtype which restores checkpoint
dtypes after loading. For Qwen3.5 dense, this puts A_log and norm back
to float32 while everything else is bfloat16, breaking FSDP2 which
requires uniform dtype per group.

Fix by adding Qwen3_5ParallelizationStrategy that:
- Moves float32 bare params (A_log) into a _fp32_params submodule so
  fully_shard_by_dtype can wrap them in a separate FSDP group
- When CP>1, swaps HF Qwen3_5GatedDeltaNet to CPAwareGatedDeltaNet
  (reusing the existing MoE CP implementation) and sets _cp_mesh
- Adds a decoder-layer pre-hook to pass position_ids to linear_attn
  (HF decoder layers don't forward it, but CP needs it)

Tested: CP=1 and CP=2 losses match (2.8484 vs 2.8484 step 0,
2.4787 vs 2.4766 step 1, val 2.9792 vs 2.9786).

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* style: ruff format + add unit tests for Qwen3.5 CP/FSDP patching

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: address Claude review - test calls real patch_hf_model, clarify thread-safety comment

- Rewrote test_fp32_params_moved_to_holder to call the real patch_hf_model
  function instead of replicating its logic, by monkeypatching the
  transformers module stubs so isinstance() matches _FakeGatedDeltaNet.
- Clarified the thread-safety comment on the globals() swap in
  Qwen3_5ParallelizationStrategy.parallelize.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: address round-2 review - test_no_class_swap calls real patch_hf_model, add defensive assertion

- Updated test_no_class_swap_when_cp_disabled to call patch_hf_model
  with stubs instead of trivially asserting the fake type.
- Added defensive assertion that apply_fsdp2_sharding_recursively exists
  before the globals() swap in Qwen3_5ParallelizationStrategy.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* test: add test_class_swap_when_cp_enabled for patch_hf_model cp_enabled=True path

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* test: add coverage for Qwen3_5ParallelizationStrategy.parallelize()

Add tests for the parallelize() method covering:
- patch_hf_model call and delegation to super()
- globals swap and restore of apply_fsdp2_sharding_recursively
- global restore on error (try/finally)
- CP mesh assignment when cp_enabled=True
- _fsdp_by_dtype ModuleList iteration with fully_shard_by_dtype

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

---------

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-authored-by: zhiqil <zhiqil@nvidia.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
edjson pushed a commit to edjson/Automodel that referenced this pull request Apr 18, 2026
)

* feat: add neat packing (greedy knapsack) for LLM and VLM datasets

Implement sequence packing via min-heap first-fit-decreasing knapsack
for both LLM and VLM datasets, with indexed attention masks and flash
attention support. Includes unit tests and benchmarks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* feat: add LengthGroupedSampler for token-aware distributed sampling

Sort samples by estimated token length (text + media) and shuffle
within buckets to keep batch-internal lengths similar, reducing padding
waste. Includes accurate image/video token count estimation via
smart_resize and comprehensive test suite.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* feat: integrate neat packing strategy into LLM finetune recipe

Add packing_strategy config field ("neat" or "thd") to select between
greedy knapsack packing and existing THD packing in the LLM recipe.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* chore: remove benchmark scripts not needed for this PR

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: lint errors and broken sampler tests

Remove unused import and variable in neat_packing_vlm.py.
Fix 13 sampler tests that referenced non-existent bucket_size
and shuffle_bucket_size parameters.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* style: fix all ruff lint errors across changed files

Sort imports, remove unused imports/variables, fix f-strings
without placeholders, rename ambiguous variable name.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* style: run ruff format on all changed source and test files

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* style: add missing copyright headers to test files

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* feat: add meta-dataset loading system with ShareGPT format support

Implement LLaMA-Factory style meta JSON dataset loading with support
for multiple dataset composition, sampling ratios, ShareGPT format
conversion, LMDB image storage, video frame reading via decord, media
preloading, and cross-rank data sharding.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* feat: add RobustDatasetWrapper with retry and fake image injection

RobustDatasetWrapper provides data loading error retry, media
preloading, and fake image injection to prevent FSDP/Zero3 hangs on
pure-text batches. PreTokenizedDatasetWrapper supports per-sample
tokenization in DataLoader workers with overlong sample detection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* enhance: refactor label building with template-based approach

Replace BPE context-sensitive pattern matching with token ID-level
scanning (build_labels_from_template) for reliable assistant turn
detection. Remove qwen2_5 dependency on qwen_vl_utils. Add per-sample
media counts (n_images_per_sample/n_videos_per_sample) to collate
output for precise PP chunking. Replace truncation with pre-filtering
via _drop_overlong_samples. Use decord as video backend globally.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* refactor: simplify video timestamp handling with VideoMetadata

Replace the manual _fix_video_timestamps regex approach with
_build_video_metadata that passes metadata directly to the processor.
Also adds second_per_grid_ts to output keys.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* feat: add precompute_tokens script for offline tokenization

Offline parallel tokenization tool that writes _text_tokens counts
to dataset samples, enabling LengthGroupedSampler to use exact token
counts instead of heuristic estimation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* feat: wire up configure_packing and attn-aware collaters for neat packing

Wire up configure_packing and attn-aware collaters into both LLM and VLM
recipes so neat packing correctly enforces per-document attention
boundaries with flash_attention_2 and SDPA.

Changes:
- neat_packed_collater: accept attn_implementation param, keep 2D indexed
  mask for flash, 4D bool block-causal mask for SDPA
- configure_packing: patch create_causal_mask in qwen2/qwen2_5_vl/qwen2_vl/
  qwen3_vl/qwen3_vl_moe modules via importlib loop
- LLM recipe: call configure_packing when packing_strategy=neat, detect
  attn backend from cfg_model (backend.attn or attn_implementation)
- VLM recipe: add pretokenize + packing path to build_dataloader with
  cfg_model param, same attn detection logic
- Add 3 example recipes: LLM neat packing, VLM 4B neat packing,
  VLM MoE 30B neat packing

Tested:
- VLM Qwen3-VL-4B flash: 4.19 -> 1.47 -> 0.49
- VLM Qwen3-VL-4B sdpa:  4.19 -> 1.47 -> 0.49
- VLM Qwen3-VL-30B MoE flash: 1.76 -> 0.41 -> 0.10
- LLM Qwen2.5-0.5B flash+force_hf: 3.72 -> ... -> 2.84

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* refactor: move VLM packing config to top-level packed_sequence section

Move packing configuration from nested dataset.packing to a top-level
packed_sequence: section, matching the LLM recipe pattern. This decouples
dataset definition from packing strategy.

The VLM recipe's build_dataloader now accepts cfg_ps and reads packing
config from there first, falling back to legacy dataset.packing for
backward compatibility.

Additional fixes from merge:
- Fix stale build_labels() call in collate_fns.py (merge artifact)
- Revert phi4/kimi collate to use build_labels (not in _IMSTART allowlist)
- Comment out decord2 monkey-patch (user removed it for torchcodec testing)
- Add TODO on _PACKING_PATCH_MODULES about generality

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* Switch VLM neat packing example to MedPix-VQA dataset with 8k seqlen

Use HF dataset (mmoukouba/MedPix-VQA) instead of local mockdata to
demonstrate packed_sequence working with standard HF datasets.
Increase pack_size/max_length to 8192 for real image samples.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* refactor: deduplicate robust_collate into make_robust_collate

Extract the duplicated collate retry logic from PreTokenizedDatasetWrapper
and RobustDatasetWrapper into a shared make_robust_collate() function in
collate_fns.py. Both classes now delegate to it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* refactor: move media I/O helpers from datasets.py to utils.py

Move _resolve_lmdb_image, _read_video_frames, _preload_media, and
_build_video_metadata to vlm/utils.py. These are generic media utilities
not tied to any specific dataset.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* cleanup: move random import to module level, allow pretokenize without packing

- Move `import random` from inside make_robust_collate to module-level
  import in collate_fns.py
- Read pretokenize/max_length from cfg_ps regardless of pack_size,
  enabling pretokenize-only mode without packing

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* cleanup: remove verbose comments from packing recipe yamls

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* feat: add Qwen3.5-4B VLM neat packing recipe

Tested with 8 GPUs, 8k pack_size, MedPix-VQA dataset.
Requires transformers >= 5.3.0 for Qwen3.5 support.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: add qwen3_5 to packing patch modules and fix missing import

- Add transformers.models.qwen3_5.modeling_qwen3_5 to packing patch list
  so create_causal_mask is patched for Qwen3.5 dense models
- Fix _passthrough_create_causal_mask signature to accept both
  input_embeds and inputs_embeds (HF 5.3.0 uses inputs_embeds)
- Import _lmdb_env_cache from utils.py in datasets.py (missed in
  earlier media helpers refactor)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* remove LLM recipe from VLM data pipeline PR

This LLM recipe doesn't belong in the VLM packing PR.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: update test imports after media helpers move to utils.py

Update test_datasets.py to import _read_video_frames and
_preload_media from vlm/utils.py instead of vlm/datasets.py.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: add unit tests for packing, utils, and collate changes

New test files:
- test_utils.py: _resolve_lmdb_image (cache, missing key, RGB),
  _build_video_metadata (empty, no video, preserved fields)
- test_packing.py: get_seqlens_in_batch, get_unpad_data,
  _passthrough_create_causal_mask (both HF signatures),
  get_attn_implementation (backend vs HF config),
  configure_packing (noop for sdpa, patches FA2 modules)

Extended test_collate_fns.py:
- make_robust_collate (success, retry, max_retries exhausted)
- neat_packed_vlm_collater attn_implementation variants
  (2D mask for FA2, 4D for sdpa, fixed max_length, pixel_values concat)

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: lint errors and missing copyright headers

- ruff fix: remove unused imports (copy, BaseVideoProcessor, load_video,
  as_completed), unused variables (grid_idx, total_text_tokens,
  total_media_tokens), fix import ordering
- Add copyright headers to scripts/precompute_tokens.py and
  tests/test_meta_dataset_all.py

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* style: ruff format on all changed files

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: rename test_utils.py to avoid pytest collection conflict

tests/unit_tests/datasets/test_utils.py already exists; having
test_utils.py in the vlm/ subdirectory causes a module name collision.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: configure cfg_ds.get defaults in build_dataloader tests

MagicMock().get() returns a truthy MagicMock by default, which
incorrectly triggers the pretokenize path. Configure side_effect
to return proper defaults for packing-related keys.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: make packing mask patch safe for non-packed forward passes

_passthrough_create_causal_mask now checks whether the attention mask
is actually a packed mask (4D or indexed with values > 1) before
returning it as-is. For normal 2D masks (standard training), it
delegates to the original HF create_causal_mask, preventing test
pollution where the monkey-patch breaks non-packed Qwen2 tests.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: passthrough causal mask for FA2 to avoid breaking validation

The previous logic delegated all non-packed 2D masks to HF's
create_causal_mask, which produced a mask incompatible with
flash_attention_2 during validation. FA2 handles causal masking
internally, so always pass through. Delegation to HF is now
limited to non-FA2 backends (sdpa/eager) where it is needed.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address code review feedback from claude[bot]

- Fix wrong import: _resolve_lmdb_image lives in utils.py not datasets.py
- Assign unused sum() results to variables in dataset timing summary
- Fix fake_indices bug: _drop_overlong_samples now returns kept indices
  so callers can filter examples in sync with conversations

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: log actual processor type before falling back to default

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: remove unused sum() variables flagged by ruff F841

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add validation and max_steps to VLM packing recipes

- Add validation_dataset (MedPix-VQA) and validation_dataloader to
  qwen3_vl_4b and qwen3_vl_moe_30b recipes
- Add max_steps: 100 to both recipes
- Switch MoE recipe from mockdata to MedPix-VQA with pack_size 8192

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: enable checkpoint with safetensors in qwen3_vl_4b recipe

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: enable Qwen3.5 dense CP support and fix FSDP mixed-dtype wrapping

PR NVIDIA-NeMo#1631 added _restore_loaded_model_dtype which restores checkpoint
dtypes after loading. For Qwen3.5 dense, this puts A_log and norm back
to float32 while everything else is bfloat16, breaking FSDP2 which
requires uniform dtype per group.

Fix by adding Qwen3_5ParallelizationStrategy that:
- Moves float32 bare params (A_log) into a _fp32_params submodule so
  fully_shard_by_dtype can wrap them in a separate FSDP group
- When CP>1, swaps HF Qwen3_5GatedDeltaNet to CPAwareGatedDeltaNet
  (reusing the existing MoE CP implementation) and sets _cp_mesh
- Adds a decoder-layer pre-hook to pass position_ids to linear_attn
  (HF decoder layers don't forward it, but CP needs it)

Tested: CP=1 and CP=2 losses match (2.8484 vs 2.8484 step 0,
2.4787 vs 2.4766 step 1, val 2.9792 vs 2.9786).

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* style: ruff format + add unit tests for Qwen3.5 CP/FSDP patching

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: address Claude review - test calls real patch_hf_model, clarify thread-safety comment

- Rewrote test_fp32_params_moved_to_holder to call the real patch_hf_model
  function instead of replicating its logic, by monkeypatching the
  transformers module stubs so isinstance() matches _FakeGatedDeltaNet.
- Clarified the thread-safety comment on the globals() swap in
  Qwen3_5ParallelizationStrategy.parallelize.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: address round-2 review - test_no_class_swap calls real patch_hf_model, add defensive assertion

- Updated test_no_class_swap_when_cp_disabled to call patch_hf_model
  with stubs instead of trivially asserting the fake type.
- Added defensive assertion that apply_fsdp2_sharding_recursively exists
  before the globals() swap in Qwen3_5ParallelizationStrategy.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* test: add test_class_swap_when_cp_enabled for patch_hf_model cp_enabled=True path

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* test: add coverage for Qwen3_5ParallelizationStrategy.parallelize()

Add tests for the parallelize() method covering:
- patch_hf_model call and delegation to super()
- globals swap and restore of apply_fsdp2_sharding_recursively
- global restore on error (try/finally)
- CP mesh assignment when cp_enabled=True
- _fsdp_by_dtype ModuleList iteration with fully_shard_by_dtype

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

---------

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-authored-by: zhiqil <zhiqil@nvidia.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Edison <edisonggacc@gmail.com>
HuiyingLi added a commit that referenced this pull request Apr 20, 2026
…1813)

* fix: FSDP2 meta-device crash for Qwen3.5 GatedDeltaNet fp32 params

PR #1711 changed _should_load_before_shard to return False for multi-GPU
DP, so models stay on meta device through FSDP wrapping. This broke the
__dict__ trick in PR #1710's patch_hf_model.

Move the gate computation into _Fp32ParamHolder.forward() so FSDP's
unshard/reshard lifecycle fires naturally. Override CPAwareGatedDeltaNet
forward for both CP and non-CP paths to route through the holder.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* chore: remove test yaml not intended for PR

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: add sentinel to prevent __getattr__ re-wrapping

Address Claude review: guard against re-wrapping __getattr__ on
repeated patch_hf_model calls by checking a class-level sentinel
attribute.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: add upstream version comment to _forward_no_cp

Address Claude review: note the transformers version the forward was
copied from to ease future upstream diffing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: update MoE test expectations for _forward_no_cp path

TestForwardFastPath tests expected super().forward() to be called,
but the non-CP path now uses _forward_no_cp(). Update mocks to match.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* test: add coverage for _Fp32ParamHolder, _compute_gate, and sentinel guard

Add unit tests for:
- _Fp32ParamHolder.forward gate computation and dtype preservation
- _compute_gate routing through holder vs inline fallback
- patch_hf_model sentinel preventing __getattr__ re-wrapping

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* test: add coverage for _forward_no_cp and forward() dispatch paths

Add 14 new tests covering the critical _forward_no_cp method (lines
91-193) and forward() dispatch logic (lines 207-213) to satisfy
codecov/patch requirements for PR #1813:

- _forward_no_cp basic forward, cache_params=None, causal_conv1d_fn
  fallback, causal_conv1d_fn set, attention_mask, GQA repeat-interleave,
  _compute_gate delegation, and output dtype
- forward() dispatch when _cp_mesh is None or size <= 1, parameter
  pass-through, and extra CP kwargs
- _make_fp32_getattr fallback to AttributeError and real attr resolution

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
akoumpa added a commit that referenced this pull request Apr 21, 2026
…er knob (#1859)

* add qwen3_5

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix: FSDP2 meta-device crash for Qwen3.5 GatedDeltaNet fp32 params (#1813)

* fix: FSDP2 meta-device crash for Qwen3.5 GatedDeltaNet fp32 params

PR #1711 changed _should_load_before_shard to return False for multi-GPU
DP, so models stay on meta device through FSDP wrapping. This broke the
__dict__ trick in PR #1710's patch_hf_model.

Move the gate computation into _Fp32ParamHolder.forward() so FSDP's
unshard/reshard lifecycle fires naturally. Override CPAwareGatedDeltaNet
forward for both CP and non-CP paths to route through the holder.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* chore: remove test yaml not intended for PR

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: add sentinel to prevent __getattr__ re-wrapping

Address Claude review: guard against re-wrapping __getattr__ on
repeated patch_hf_model calls by checking a class-level sentinel
attribute.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: add upstream version comment to _forward_no_cp

Address Claude review: note the transformers version the forward was
copied from to ease future upstream diffing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: update MoE test expectations for _forward_no_cp path

TestForwardFastPath tests expected super().forward() to be called,
but the non-CP path now uses _forward_no_cp(). Update mocks to match.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* test: add coverage for _Fp32ParamHolder, _compute_gate, and sentinel guard

Add unit tests for:
- _Fp32ParamHolder.forward gate computation and dtype preservation
- _compute_gate routing through holder vs inline fallback
- patch_hf_model sentinel preventing __getattr__ re-wrapping

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* test: add coverage for _forward_no_cp and forward() dispatch paths

Add 14 new tests covering the critical _forward_no_cp method (lines
91-193) and forward() dispatch logic (lines 207-213) to satisfy
codecov/patch requirements for PR #1813:

- _forward_no_cp basic forward, cache_params=None, causal_conv1d_fn
  fallback, causal_conv1d_fn set, attention_mask, GQA repeat-interleave,
  _compute_gate delegation, and output dtype
- forward() dispatch when _cp_mesh is None or size <= 1, parameter
  pass-through, and extra CP kwargs
- _make_fp32_getattr fallback to AttributeError and real attr resolution

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: Qwen3.5 VLM pipeline parallelism support

- parallelizer.py: handle nn.ModuleDict in _fsdp_by_dtype, safe attr walk
  in _extract_model_layers, and use string key for Qwen3_5ForConditionalGeneration
- hf_utils.py: route pipeline forward through get_text_module so nested VLM
  text modules (model.language_model.{embed_tokens,layers,norm}) work
- finetune.py: update_seq_len per-batch to precompute pipeline stage shapes
  analytically (needed for GatedDeltaNet and VLM variable-length batches)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* feat: Qwen3.5 VLM TP plan + per-microbatch grad reduce-scatter knob

Qwen3.5 VLM TP support:
- Register Qwen3_5ForConditionalGeneration in PARALLELIZE_FUNCTIONS; plan
  delegates to get_hf_tp_shard_plan which reads transformers'
  base_model_tp_plan with prefix model.language_model. GatedDeltaNet layers
  stay un-sharded (no stock TP plan exists for linear_attn).
- Extend get_hf_tp_shard_plan dispatch to handle Qwen3.5 VLM nesting.
- Translate transformers' "replicated_with_grad_allreduce" style as a no-op
  (skip in plan) — under FSDP+TP the TP-mesh replication already behaves
  correctly for norm weights.

Per-microbatch grad reduce-scatter (PipelineConfig.reduce_grad_per_microbatch):
- When True, FSDP reduce-scatters grads every microbatch instead of
  accumulating full-size grads under no_sync until the last one. Keeps
  grads sharded across DP, saving ~stage_trainable*2 bytes per rank
  (~27 GB for a 13B-trainable-param stage in bf16). Default False.
- Plumbed through PipelineConfig -> AutoPipeline -> pipeline_model; patches
  stages via types.MethodType and stores the flag on each stage; the
  patched backward_maybe_with_nosync branches on stage._reduce_grad_per_microbatch.
- kd.py teacher pipeline config propagates the field.

Validated locally on 8 GPUs: pp=2, dp=4, lbs=4 peak drops from 66 GB (OOM
at step 1 on 80 GB GPUs) to 32-41 GB across 3 steps with knob=True.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* feat: add Qwen3.5-27B VLM TP4+PP4 recipe

2-node (16 GPUs) tp=4, pp=4, dp=1 config. Uses the new
reduce_grad_per_microbatch knob to keep grads sharded across microbatches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* chore: Qwen3.5-27B tp4pp4 recipe — HF model id + wandb stub

- Use Qwen/Qwen3.5-27B instead of a local checkpoint path
- Add commented-out wandb section so users know how to enable it

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* refactor: drop redundant Qwen3.5 entries

- Remove duplicate self.pp.update_seq_len call in vlm/finetune.py (line
  940 already covers it every batch; update_seq_len short-circuits when
  seq_len is unchanged).
- Drop string-keyed Qwen3_5ForConditionalGeneration entry from
  VLM_MODEL_CLS_TO_LAYERS; the class-keyed entry is sufficient.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* refactor: reuse defer_fsdp_grad_sync for PP; restore Qwen3.5 string fallback

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* test: cover Qwen3.5 VLM TP plan + grad-sync knob additions

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* style: drop unused functools.reduce import

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix(pipelining): get_text_module skips non-Module attrs; stub test mocks

The helper previously returned any attr named 'language_model' /
'text_model' / 'text_decoder' — including auto-generated unittest
Mocks — which broke pipeline_forward tests that passed a plain Mock
model. Now only descend into real nn.Module instances.

Also explicitly set embed_tokens / layers / norm to None on the
mocked text module in the two get_text_module rotary tests so the
now-routed pipeline_forward skips those branches cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: lazy-load transformers.models.qwen3_5 to unblock test stubs

Eagerly importing Qwen3_5ForConditionalGeneration at module load was
pre-loading transformers.models.qwen3_5 into sys.modules, defeating
test_cp_linear_attn_patch.py's module stubbing. Switch to string-based
class qualname lookup + __name__ comparison instead.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-authored-by: Huiying <willwin.lee@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
linnanwang pushed a commit that referenced this pull request Apr 24, 2026
* feat: add neat packing (greedy knapsack) for LLM and VLM datasets

Implement sequence packing via min-heap first-fit-decreasing knapsack
for both LLM and VLM datasets, with indexed attention masks and flash
attention support. Includes unit tests and benchmarks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* feat: add LengthGroupedSampler for token-aware distributed sampling

Sort samples by estimated token length (text + media) and shuffle
within buckets to keep batch-internal lengths similar, reducing padding
waste. Includes accurate image/video token count estimation via
smart_resize and comprehensive test suite.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* feat: integrate neat packing strategy into LLM finetune recipe

Add packing_strategy config field ("neat" or "thd") to select between
greedy knapsack packing and existing THD packing in the LLM recipe.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* chore: remove benchmark scripts not needed for this PR

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: lint errors and broken sampler tests

Remove unused import and variable in neat_packing_vlm.py.
Fix 13 sampler tests that referenced non-existent bucket_size
and shuffle_bucket_size parameters.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* style: fix all ruff lint errors across changed files

Sort imports, remove unused imports/variables, fix f-strings
without placeholders, rename ambiguous variable name.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* style: run ruff format on all changed source and test files

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* style: add missing copyright headers to test files

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* feat: add meta-dataset loading system with ShareGPT format support

Implement LLaMA-Factory style meta JSON dataset loading with support
for multiple dataset composition, sampling ratios, ShareGPT format
conversion, LMDB image storage, video frame reading via decord, media
preloading, and cross-rank data sharding.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* feat: add RobustDatasetWrapper with retry and fake image injection

RobustDatasetWrapper provides data loading error retry, media
preloading, and fake image injection to prevent FSDP/Zero3 hangs on
pure-text batches. PreTokenizedDatasetWrapper supports per-sample
tokenization in DataLoader workers with overlong sample detection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* enhance: refactor label building with template-based approach

Replace BPE context-sensitive pattern matching with token ID-level
scanning (build_labels_from_template) for reliable assistant turn
detection. Remove qwen2_5 dependency on qwen_vl_utils. Add per-sample
media counts (n_images_per_sample/n_videos_per_sample) to collate
output for precise PP chunking. Replace truncation with pre-filtering
via _drop_overlong_samples. Use decord as video backend globally.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* refactor: simplify video timestamp handling with VideoMetadata

Replace the manual _fix_video_timestamps regex approach with
_build_video_metadata that passes metadata directly to the processor.
Also adds second_per_grid_ts to output keys.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* feat: add precompute_tokens script for offline tokenization

Offline parallel tokenization tool that writes _text_tokens counts
to dataset samples, enabling LengthGroupedSampler to use exact token
counts instead of heuristic estimation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* feat: wire up configure_packing and attn-aware collaters for neat packing

Wire up configure_packing and attn-aware collaters into both LLM and VLM
recipes so neat packing correctly enforces per-document attention
boundaries with flash_attention_2 and SDPA.

Changes:
- neat_packed_collater: accept attn_implementation param, keep 2D indexed
  mask for flash, 4D bool block-causal mask for SDPA
- configure_packing: patch create_causal_mask in qwen2/qwen2_5_vl/qwen2_vl/
  qwen3_vl/qwen3_vl_moe modules via importlib loop
- LLM recipe: call configure_packing when packing_strategy=neat, detect
  attn backend from cfg_model (backend.attn or attn_implementation)
- VLM recipe: add pretokenize + packing path to build_dataloader with
  cfg_model param, same attn detection logic
- Add 3 example recipes: LLM neat packing, VLM 4B neat packing,
  VLM MoE 30B neat packing

Tested:
- VLM Qwen3-VL-4B flash: 4.19 -> 1.47 -> 0.49
- VLM Qwen3-VL-4B sdpa:  4.19 -> 1.47 -> 0.49
- VLM Qwen3-VL-30B MoE flash: 1.76 -> 0.41 -> 0.10
- LLM Qwen2.5-0.5B flash+force_hf: 3.72 -> ... -> 2.84

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* refactor: move VLM packing config to top-level packed_sequence section

Move packing configuration from nested dataset.packing to a top-level
packed_sequence: section, matching the LLM recipe pattern. This decouples
dataset definition from packing strategy.

The VLM recipe's build_dataloader now accepts cfg_ps and reads packing
config from there first, falling back to legacy dataset.packing for
backward compatibility.

Additional fixes from merge:
- Fix stale build_labels() call in collate_fns.py (merge artifact)
- Revert phi4/kimi collate to use build_labels (not in _IMSTART allowlist)
- Comment out decord2 monkey-patch (user removed it for torchcodec testing)
- Add TODO on _PACKING_PATCH_MODULES about generality

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* Switch VLM neat packing example to MedPix-VQA dataset with 8k seqlen

Use HF dataset (mmoukouba/MedPix-VQA) instead of local mockdata to
demonstrate packed_sequence working with standard HF datasets.
Increase pack_size/max_length to 8192 for real image samples.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* refactor: deduplicate robust_collate into make_robust_collate

Extract the duplicated collate retry logic from PreTokenizedDatasetWrapper
and RobustDatasetWrapper into a shared make_robust_collate() function in
collate_fns.py. Both classes now delegate to it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* refactor: move media I/O helpers from datasets.py to utils.py

Move _resolve_lmdb_image, _read_video_frames, _preload_media, and
_build_video_metadata to vlm/utils.py. These are generic media utilities
not tied to any specific dataset.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* cleanup: move random import to module level, allow pretokenize without packing

- Move `import random` from inside make_robust_collate to module-level
  import in collate_fns.py
- Read pretokenize/max_length from cfg_ps regardless of pack_size,
  enabling pretokenize-only mode without packing

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* cleanup: remove verbose comments from packing recipe yamls

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* feat: add Qwen3.5-4B VLM neat packing recipe

Tested with 8 GPUs, 8k pack_size, MedPix-VQA dataset.
Requires transformers >= 5.3.0 for Qwen3.5 support.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: add qwen3_5 to packing patch modules and fix missing import

- Add transformers.models.qwen3_5.modeling_qwen3_5 to packing patch list
  so create_causal_mask is patched for Qwen3.5 dense models
- Fix _passthrough_create_causal_mask signature to accept both
  input_embeds and inputs_embeds (HF 5.3.0 uses inputs_embeds)
- Import _lmdb_env_cache from utils.py in datasets.py (missed in
  earlier media helpers refactor)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* remove LLM recipe from VLM data pipeline PR

This LLM recipe doesn't belong in the VLM packing PR.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: update test imports after media helpers move to utils.py

Update test_datasets.py to import _read_video_frames and
_preload_media from vlm/utils.py instead of vlm/datasets.py.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: add unit tests for packing, utils, and collate changes

New test files:
- test_utils.py: _resolve_lmdb_image (cache, missing key, RGB),
  _build_video_metadata (empty, no video, preserved fields)
- test_packing.py: get_seqlens_in_batch, get_unpad_data,
  _passthrough_create_causal_mask (both HF signatures),
  get_attn_implementation (backend vs HF config),
  configure_packing (noop for sdpa, patches FA2 modules)

Extended test_collate_fns.py:
- make_robust_collate (success, retry, max_retries exhausted)
- neat_packed_vlm_collater attn_implementation variants
  (2D mask for FA2, 4D for sdpa, fixed max_length, pixel_values concat)

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: lint errors and missing copyright headers

- ruff fix: remove unused imports (copy, BaseVideoProcessor, load_video,
  as_completed), unused variables (grid_idx, total_text_tokens,
  total_media_tokens), fix import ordering
- Add copyright headers to scripts/precompute_tokens.py and
  tests/test_meta_dataset_all.py

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* style: ruff format on all changed files

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: rename test_utils.py to avoid pytest collection conflict

tests/unit_tests/datasets/test_utils.py already exists; having
test_utils.py in the vlm/ subdirectory causes a module name collision.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: configure cfg_ds.get defaults in build_dataloader tests

MagicMock().get() returns a truthy MagicMock by default, which
incorrectly triggers the pretokenize path. Configure side_effect
to return proper defaults for packing-related keys.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: make packing mask patch safe for non-packed forward passes

_passthrough_create_causal_mask now checks whether the attention mask
is actually a packed mask (4D or indexed with values > 1) before
returning it as-is. For normal 2D masks (standard training), it
delegates to the original HF create_causal_mask, preventing test
pollution where the monkey-patch breaks non-packed Qwen2 tests.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: passthrough causal mask for FA2 to avoid breaking validation

The previous logic delegated all non-packed 2D masks to HF's
create_causal_mask, which produced a mask incompatible with
flash_attention_2 during validation. FA2 handles causal masking
internally, so always pass through. Delegation to HF is now
limited to non-FA2 backends (sdpa/eager) where it is needed.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address code review feedback from claude[bot]

- Fix wrong import: _resolve_lmdb_image lives in utils.py not datasets.py
- Assign unused sum() results to variables in dataset timing summary
- Fix fake_indices bug: _drop_overlong_samples now returns kept indices
  so callers can filter examples in sync with conversations

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: log actual processor type before falling back to default

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: remove unused sum() variables flagged by ruff F841

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add validation and max_steps to VLM packing recipes

- Add validation_dataset (MedPix-VQA) and validation_dataloader to
  qwen3_vl_4b and qwen3_vl_moe_30b recipes
- Add max_steps: 100 to both recipes
- Switch MoE recipe from mockdata to MedPix-VQA with pack_size 8192

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: enable checkpoint with safetensors in qwen3_vl_4b recipe

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: enable Qwen3.5 dense CP support and fix FSDP mixed-dtype wrapping

PR #1631 added _restore_loaded_model_dtype which restores checkpoint
dtypes after loading. For Qwen3.5 dense, this puts A_log and norm back
to float32 while everything else is bfloat16, breaking FSDP2 which
requires uniform dtype per group.

Fix by adding Qwen3_5ParallelizationStrategy that:
- Moves float32 bare params (A_log) into a _fp32_params submodule so
  fully_shard_by_dtype can wrap them in a separate FSDP group
- When CP>1, swaps HF Qwen3_5GatedDeltaNet to CPAwareGatedDeltaNet
  (reusing the existing MoE CP implementation) and sets _cp_mesh
- Adds a decoder-layer pre-hook to pass position_ids to linear_attn
  (HF decoder layers don't forward it, but CP needs it)

Tested: CP=1 and CP=2 losses match (2.8484 vs 2.8484 step 0,
2.4787 vs 2.4766 step 1, val 2.9792 vs 2.9786).

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* style: ruff format + add unit tests for Qwen3.5 CP/FSDP patching

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: address Claude review - test calls real patch_hf_model, clarify thread-safety comment

- Rewrote test_fp32_params_moved_to_holder to call the real patch_hf_model
  function instead of replicating its logic, by monkeypatching the
  transformers module stubs so isinstance() matches _FakeGatedDeltaNet.
- Clarified the thread-safety comment on the globals() swap in
  Qwen3_5ParallelizationStrategy.parallelize.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: address round-2 review - test_no_class_swap calls real patch_hf_model, add defensive assertion

- Updated test_no_class_swap_when_cp_disabled to call patch_hf_model
  with stubs instead of trivially asserting the fake type.
- Added defensive assertion that apply_fsdp2_sharding_recursively exists
  before the globals() swap in Qwen3_5ParallelizationStrategy.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* test: add test_class_swap_when_cp_enabled for patch_hf_model cp_enabled=True path

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* test: add coverage for Qwen3_5ParallelizationStrategy.parallelize()

Add tests for the parallelize() method covering:
- patch_hf_model call and delegation to super()
- globals swap and restore of apply_fsdp2_sharding_recursively
- global restore on error (try/finally)
- CP mesh assignment when cp_enabled=True
- _fsdp_by_dtype ModuleList iteration with fully_shard_by_dtype

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

---------

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-authored-by: zhiqil <zhiqil@nvidia.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

r0.4.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants