fix: Qwen3.5 dense CP support and FSDP mixed-dtype fix by HuiyingLi · Pull Request #1710 · NVIDIA-NeMo/Automodel

HuiyingLi · 2026-04-07T07:30:19Z

Summary

FSDP mixed-dtype fix: PR fix: tied embedding v4 to v5 #1631 added _restore_loaded_model_dtype which restores A_log to float32 in Qwen3.5 dense, breaking FSDP2's uniform-dtype requirement. Fix by moving float32 bare params into a _fp32_params submodule so fully_shard_by_dtype can wrap them separately.
CP support for Qwen3.5 dense: When CP>1, swaps HF Qwen3_5GatedDeltaNet to the existing CPAwareGatedDeltaNet (from the MoE codebase) via __class__ swap, reusing the FLA-based CP implementation.
Position ID forwarding: HF decoder layers don't pass position_ids to linear_attn, but CP needs them. Added a decoder-layer pre-hook to cache and forward them.
Qwen3_5ParallelizationStrategy: Registered for Qwen3_5ForCausalLM and Qwen3_5ForConditionalGeneration, uses fully_shard_by_dtype per layer.

Test plan

Unit tests: 8 tests covering param wrapping, hook caching, deduplication, strategy registration
Qwen3.5 4B llm cp1: https://wandb.ai/Nemo-automodel/vlm-pack/runs/mzdc039t
Qwen3.5 4B llm cp2: https://wandb.ai/Nemo-automodel/vlm-pack/runs/syzr9blp
loss parity between cp1&cp2:

ruff check + format clean

🤖 Generated with Claude Code

Implement sequence packing via min-heap first-fit-decreasing knapsack for both LLM and VLM datasets, with indexed attention masks and flash attention support. Includes unit tests and benchmarks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Sort samples by estimated token length (text + media) and shuffle within buckets to keep batch-internal lengths similar, reducing padding waste. Includes accurate image/video token count estimation via smart_resize and comprehensive test suite. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Add packing_strategy config field ("neat" or "thd") to select between greedy knapsack packing and existing THD packing in the LLM recipe. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Remove unused import and variable in neat_packing_vlm.py. Fix 13 sampler tests that referenced non-existent bucket_size and shuffle_bucket_size parameters. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Sort imports, remove unused imports/variables, fix f-strings without placeholders, rename ambiguous variable name. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Implement LLaMA-Factory style meta JSON dataset loading with support for multiple dataset composition, sampling ratios, ShareGPT format conversion, LMDB image storage, video frame reading via decord, media preloading, and cross-rank data sharding. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

RobustDatasetWrapper provides data loading error retry, media preloading, and fake image injection to prevent FSDP/Zero3 hangs on pure-text batches. PreTokenizedDatasetWrapper supports per-sample tokenization in DataLoader workers with overlong sample detection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Replace BPE context-sensitive pattern matching with token ID-level scanning (build_labels_from_template) for reliable assistant turn detection. Remove qwen2_5 dependency on qwen_vl_utils. Add per-sample media counts (n_images_per_sample/n_videos_per_sample) to collate output for precise PP chunking. Replace truncation with pre-filtering via _drop_overlong_samples. Use decord as video backend globally. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Replace the manual _fix_video_timestamps regex approach with _build_video_metadata that passes metadata directly to the processor. Also adds second_per_grid_ts to output keys. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Offline parallel tokenization tool that writes _text_tokens counts to dataset samples, enabling LengthGroupedSampler to use exact token counts instead of heuristic estimation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

…king Wire up configure_packing and attn-aware collaters into both LLM and VLM recipes so neat packing correctly enforces per-document attention boundaries with flash_attention_2 and SDPA. Changes: - neat_packed_collater: accept attn_implementation param, keep 2D indexed mask for flash, 4D bool block-causal mask for SDPA - configure_packing: patch create_causal_mask in qwen2/qwen2_5_vl/qwen2_vl/ qwen3_vl/qwen3_vl_moe modules via importlib loop - LLM recipe: call configure_packing when packing_strategy=neat, detect attn backend from cfg_model (backend.attn or attn_implementation) - VLM recipe: add pretokenize + packing path to build_dataloader with cfg_model param, same attn detection logic - Add 3 example recipes: LLM neat packing, VLM 4B neat packing, VLM MoE 30B neat packing Tested: - VLM Qwen3-VL-4B flash: 4.19 -> 1.47 -> 0.49 - VLM Qwen3-VL-4B sdpa: 4.19 -> 1.47 -> 0.49 - VLM Qwen3-VL-30B MoE flash: 1.76 -> 0.41 -> 0.10 - LLM Qwen2.5-0.5B flash+force_hf: 3.72 -> ... -> 2.84 Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Move packing configuration from nested dataset.packing to a top-level packed_sequence: section, matching the LLM recipe pattern. This decouples dataset definition from packing strategy. The VLM recipe's build_dataloader now accepts cfg_ps and reads packing config from there first, falling back to legacy dataset.packing for backward compatibility. Additional fixes from merge: - Fix stale build_labels() call in collate_fns.py (merge artifact) - Revert phi4/kimi collate to use build_labels (not in _IMSTART allowlist) - Comment out decord2 monkey-patch (user removed it for torchcodec testing) - Add TODO on _PACKING_PATCH_MODULES about generality Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Use HF dataset (mmoukouba/MedPix-VQA) instead of local mockdata to demonstrate packed_sequence working with standard HF datasets. Increase pack_size/max_length to 8192 for real image samples. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Extract the duplicated collate retry logic from PreTokenizedDatasetWrapper and RobustDatasetWrapper into a shared make_robust_collate() function in collate_fns.py. Both classes now delegate to it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Move _resolve_lmdb_image, _read_video_frames, _preload_media, and _build_video_metadata to vlm/utils.py. These are generic media utilities not tied to any specific dataset. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

…t packing - Move `import random` from inside make_robust_collate to module-level import in collate_fns.py - Read pretokenize/max_length from cfg_ps regardless of pack_size, enabling pretokenize-only mode without packing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Tested with 8 GPUs, 8k pack_size, MedPix-VQA dataset. Requires transformers >= 5.3.0 for Qwen3.5 support. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

- Add transformers.models.qwen3_5.modeling_qwen3_5 to packing patch list so create_causal_mask is patched for Qwen3.5 dense models - Fix _passthrough_create_causal_mask signature to accept both input_embeds and inputs_embeds (HF 5.3.0 uses inputs_embeds) - Import _lmdb_env_cache from utils.py in datasets.py (missed in earlier media helpers refactor) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

This LLM recipe doesn't belong in the VLM packing PR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Update test_datasets.py to import _read_video_frames and _preload_media from vlm/utils.py instead of vlm/datasets.py. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

New test files: - test_utils.py: _resolve_lmdb_image (cache, missing key, RGB), _build_video_metadata (empty, no video, preserved fields) - test_packing.py: get_seqlens_in_batch, get_unpad_data, _passthrough_create_causal_mask (both HF signatures), get_attn_implementation (backend vs HF config), configure_packing (noop for sdpa, patches FA2 modules) Extended test_collate_fns.py: - make_robust_collate (success, retry, max_retries exhausted) - neat_packed_vlm_collater attn_implementation variants (2D mask for FA2, 4D for sdpa, fixed max_length, pixel_values concat) Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- ruff fix: remove unused imports (copy, BaseVideoProcessor, load_video, as_completed), unused variables (grid_idx, total_text_tokens, total_media_tokens), fix import ordering - Add copyright headers to scripts/precompute_tokens.py and tests/test_meta_dataset_all.py Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

tests/unit_tests/datasets/test_utils.py already exists; having test_utils.py in the vlm/ subdirectory causes a module name collision. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add neat packing (greedy knapsack) for LLM and VLM datasets Implement sequence packing via min-heap first-fit-decreasing knapsack for both LLM and VLM datasets, with indexed attention masks and flash attention support. Includes unit tests and benchmarks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * feat: add LengthGroupedSampler for token-aware distributed sampling Sort samples by estimated token length (text + media) and shuffle within buckets to keep batch-internal lengths similar, reducing padding waste. Includes accurate image/video token count estimation via smart_resize and comprehensive test suite. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * feat: integrate neat packing strategy into LLM finetune recipe Add packing_strategy config field ("neat" or "thd") to select between greedy knapsack packing and existing THD packing in the LLM recipe. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * chore: remove benchmark scripts not needed for this PR Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: lint errors and broken sampler tests Remove unused import and variable in neat_packing_vlm.py. Fix 13 sampler tests that referenced non-existent bucket_size and shuffle_bucket_size parameters. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * style: fix all ruff lint errors across changed files Sort imports, remove unused imports/variables, fix f-strings without placeholders, rename ambiguous variable name. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * style: run ruff format on all changed source and test files Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * style: add missing copyright headers to test files Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * feat: add meta-dataset loading system with ShareGPT format support Implement LLaMA-Factory style meta JSON dataset loading with support for multiple dataset composition, sampling ratios, ShareGPT format conversion, LMDB image storage, video frame reading via decord, media preloading, and cross-rank data sharding. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * feat: add RobustDatasetWrapper with retry and fake image injection RobustDatasetWrapper provides data loading error retry, media preloading, and fake image injection to prevent FSDP/Zero3 hangs on pure-text batches. PreTokenizedDatasetWrapper supports per-sample tokenization in DataLoader workers with overlong sample detection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * enhance: refactor label building with template-based approach Replace BPE context-sensitive pattern matching with token ID-level scanning (build_labels_from_template) for reliable assistant turn detection. Remove qwen2_5 dependency on qwen_vl_utils. Add per-sample media counts (n_images_per_sample/n_videos_per_sample) to collate output for precise PP chunking. Replace truncation with pre-filtering via _drop_overlong_samples. Use decord as video backend globally. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * refactor: simplify video timestamp handling with VideoMetadata Replace the manual _fix_video_timestamps regex approach with _build_video_metadata that passes metadata directly to the processor. Also adds second_per_grid_ts to output keys. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * feat: add precompute_tokens script for offline tokenization Offline parallel tokenization tool that writes _text_tokens counts to dataset samples, enabling LengthGroupedSampler to use exact token counts instead of heuristic estimation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * feat: wire up configure_packing and attn-aware collaters for neat packing Wire up configure_packing and attn-aware collaters into both LLM and VLM recipes so neat packing correctly enforces per-document attention boundaries with flash_attention_2 and SDPA. Changes: - neat_packed_collater: accept attn_implementation param, keep 2D indexed mask for flash, 4D bool block-causal mask for SDPA - configure_packing: patch create_causal_mask in qwen2/qwen2_5_vl/qwen2_vl/ qwen3_vl/qwen3_vl_moe modules via importlib loop - LLM recipe: call configure_packing when packing_strategy=neat, detect attn backend from cfg_model (backend.attn or attn_implementation) - VLM recipe: add pretokenize + packing path to build_dataloader with cfg_model param, same attn detection logic - Add 3 example recipes: LLM neat packing, VLM 4B neat packing, VLM MoE 30B neat packing Tested: - VLM Qwen3-VL-4B flash: 4.19 -> 1.47 -> 0.49 - VLM Qwen3-VL-4B sdpa: 4.19 -> 1.47 -> 0.49 - VLM Qwen3-VL-30B MoE flash: 1.76 -> 0.41 -> 0.10 - LLM Qwen2.5-0.5B flash+force_hf: 3.72 -> ... -> 2.84 Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * refactor: move VLM packing config to top-level packed_sequence section Move packing configuration from nested dataset.packing to a top-level packed_sequence: section, matching the LLM recipe pattern. This decouples dataset definition from packing strategy. The VLM recipe's build_dataloader now accepts cfg_ps and reads packing config from there first, falling back to legacy dataset.packing for backward compatibility. Additional fixes from merge: - Fix stale build_labels() call in collate_fns.py (merge artifact) - Revert phi4/kimi collate to use build_labels (not in _IMSTART allowlist) - Comment out decord2 monkey-patch (user removed it for torchcodec testing) - Add TODO on _PACKING_PATCH_MODULES about generality Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * Switch VLM neat packing example to MedPix-VQA dataset with 8k seqlen Use HF dataset (mmoukouba/MedPix-VQA) instead of local mockdata to demonstrate packed_sequence working with standard HF datasets. Increase pack_size/max_length to 8192 for real image samples. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * refactor: deduplicate robust_collate into make_robust_collate Extract the duplicated collate retry logic from PreTokenizedDatasetWrapper and RobustDatasetWrapper into a shared make_robust_collate() function in collate_fns.py. Both classes now delegate to it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * refactor: move media I/O helpers from datasets.py to utils.py Move _resolve_lmdb_image, _read_video_frames, _preload_media, and _build_video_metadata to vlm/utils.py. These are generic media utilities not tied to any specific dataset. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * cleanup: move random import to module level, allow pretokenize without packing - Move `import random` from inside make_robust_collate to module-level import in collate_fns.py - Read pretokenize/max_length from cfg_ps regardless of pack_size, enabling pretokenize-only mode without packing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * cleanup: remove verbose comments from packing recipe yamls Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * feat: add Qwen3.5-4B VLM neat packing recipe Tested with 8 GPUs, 8k pack_size, MedPix-VQA dataset. Requires transformers >= 5.3.0 for Qwen3.5 support. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: add qwen3_5 to packing patch modules and fix missing import - Add transformers.models.qwen3_5.modeling_qwen3_5 to packing patch list so create_causal_mask is patched for Qwen3.5 dense models - Fix _passthrough_create_causal_mask signature to accept both input_embeds and inputs_embeds (HF 5.3.0 uses inputs_embeds) - Import _lmdb_env_cache from utils.py in datasets.py (missed in earlier media helpers refactor) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * remove LLM recipe from VLM data pipeline PR This LLM recipe doesn't belong in the VLM packing PR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: update test imports after media helpers move to utils.py Update test_datasets.py to import _read_video_frames and _preload_media from vlm/utils.py instead of vlm/datasets.py. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: add unit tests for packing, utils, and collate changes New test files: - test_utils.py: _resolve_lmdb_image (cache, missing key, RGB), _build_video_metadata (empty, no video, preserved fields) - test_packing.py: get_seqlens_in_batch, get_unpad_data, _passthrough_create_causal_mask (both HF signatures), get_attn_implementation (backend vs HF config), configure_packing (noop for sdpa, patches FA2 modules) Extended test_collate_fns.py: - make_robust_collate (success, retry, max_retries exhausted) - neat_packed_vlm_collater attn_implementation variants (2D mask for FA2, 4D for sdpa, fixed max_length, pixel_values concat) Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: lint errors and missing copyright headers - ruff fix: remove unused imports (copy, BaseVideoProcessor, load_video, as_completed), unused variables (grid_idx, total_text_tokens, total_media_tokens), fix import ordering - Add copyright headers to scripts/precompute_tokens.py and tests/test_meta_dataset_all.py Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * style: ruff format on all changed files Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: rename test_utils.py to avoid pytest collection conflict tests/unit_tests/datasets/test_utils.py already exists; having test_utils.py in the vlm/ subdirectory causes a module name collision. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: configure cfg_ds.get defaults in build_dataloader tests MagicMock().get() returns a truthy MagicMock by default, which incorrectly triggers the pretokenize path. Configure side_effect to return proper defaults for packing-related keys. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: make packing mask patch safe for non-packed forward passes _passthrough_create_causal_mask now checks whether the attention mask is actually a packed mask (4D or indexed with values > 1) before returning it as-is. For normal 2D masks (standard training), it delegates to the original HF create_causal_mask, preventing test pollution where the monkey-patch breaks non-packed Qwen2 tests. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: passthrough causal mask for FA2 to avoid breaking validation The previous logic delegated all non-packed 2D masks to HF's create_causal_mask, which produced a mask incompatible with flash_attention_2 during validation. FA2 handles causal masking internally, so always pass through. Delegation to HF is now limited to non-FA2 backends (sdpa/eager) where it is needed. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: address code review feedback from claude[bot] - Fix wrong import: _resolve_lmdb_image lives in utils.py not datasets.py - Assign unused sum() results to variables in dataset timing summary - Fix fake_indices bug: _drop_overlong_samples now returns kept indices so callers can filter examples in sync with conversations Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: log actual processor type before falling back to default Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: remove unused sum() variables flagged by ruff F841 Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add validation and max_steps to VLM packing recipes - Add validation_dataset (MedPix-VQA) and validation_dataloader to qwen3_vl_4b and qwen3_vl_moe_30b recipes - Add max_steps: 100 to both recipes - Switch MoE recipe from mockdata to MedPix-VQA with pack_size 8192 Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: enable checkpoint with safetensors in qwen3_vl_4b recipe Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: enable Qwen3.5 dense CP support and fix FSDP mixed-dtype wrapping PR #1631 added _restore_loaded_model_dtype which restores checkpoint dtypes after loading. For Qwen3.5 dense, this puts A_log and norm back to float32 while everything else is bfloat16, breaking FSDP2 which requires uniform dtype per group. Fix by adding Qwen3_5ParallelizationStrategy that: - Moves float32 bare params (A_log) into a _fp32_params submodule so fully_shard_by_dtype can wrap them in a separate FSDP group - When CP>1, swaps HF Qwen3_5GatedDeltaNet to CPAwareGatedDeltaNet (reusing the existing MoE CP implementation) and sets _cp_mesh - Adds a decoder-layer pre-hook to pass position_ids to linear_attn (HF decoder layers don't forward it, but CP needs it) Tested: CP=1 and CP=2 losses match (2.8484 vs 2.8484 step 0, 2.4787 vs 2.4766 step 1, val 2.9792 vs 2.9786). Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * style: ruff format + add unit tests for Qwen3.5 CP/FSDP patching Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: address Claude review - test calls real patch_hf_model, clarify thread-safety comment - Rewrote test_fp32_params_moved_to_holder to call the real patch_hf_model function instead of replicating its logic, by monkeypatching the transformers module stubs so isinstance() matches _FakeGatedDeltaNet. - Clarified the thread-safety comment on the globals() swap in Qwen3_5ParallelizationStrategy.parallelize. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: address round-2 review - test_no_class_swap calls real patch_hf_model, add defensive assertion - Updated test_no_class_swap_when_cp_disabled to call patch_hf_model with stubs instead of trivially asserting the fake type. - Added defensive assertion that apply_fsdp2_sharding_recursively exists before the globals() swap in Qwen3_5ParallelizationStrategy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * test: add test_class_swap_when_cp_enabled for patch_hf_model cp_enabled=True path Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * test: add coverage for Qwen3_5ParallelizationStrategy.parallelize() Add tests for the parallelize() method covering: - patch_hf_model call and delegation to super() - globals swap and restore of apply_fsdp2_sharding_recursively - global restore on error (try/finally) - CP mesh assignment when cp_enabled=True - _fsdp_by_dtype ModuleList iteration with fully_shard_by_dtype Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> --------- Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-authored-by: zhiqil <zhiqil@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

fix: Qwen3.5 dense CP support and FSDP mixed-dtype fix (#1710) * feat: add neat packing (greedy knapsack) for LLM and VLM datasets Implement sequence packing via min-heap first-fit-decreasing knapsack for both LLM and VLM datasets, with indexed attention masks and flash attention support. Includes unit tests and benchmarks. * feat: add LengthGroupedSampler for token-aware distributed sampling Sort samples by estimated token length (text + media) and shuffle within buckets to keep batch-internal lengths similar, reducing padding waste. Includes accurate image/video token count estimation via smart_resize and comprehensive test suite. * feat: integrate neat packing strategy into LLM finetune recipe Add packing_strategy config field ("neat" or "thd") to select between greedy knapsack packing and existing THD packing in the LLM recipe. * chore: remove benchmark scripts not needed for this PR * fix: lint errors and broken sampler tests Remove unused import and variable in neat_packing_vlm.py. Fix 13 sampler tests that referenced non-existent bucket_size and shuffle_bucket_size parameters. * style: fix all ruff lint errors across changed files Sort imports, remove unused imports/variables, fix f-strings without placeholders, rename ambiguous variable name. * style: run ruff format on all changed source and test files * style: add missing copyright headers to test files * feat: add meta-dataset loading system with ShareGPT format support Implement LLaMA-Factory style meta JSON dataset loading with support for multiple dataset composition, sampling ratios, ShareGPT format conversion, LMDB image storage, video frame reading via decord, media preloading, and cross-rank data sharding. * feat: add RobustDatasetWrapper with retry and fake image injection RobustDatasetWrapper provides data loading error retry, media preloading, and fake image injection to prevent FSDP/Zero3 hangs on pure-text batches. PreTokenizedDatasetWrapper supports per-sample tokenization in DataLoader workers with overlong sample detection. * enhance: refactor label building with template-based approach Replace BPE context-sensitive pattern matching with token ID-level scanning (build_labels_from_template) for reliable assistant turn detection. Remove qwen2_5 dependency on qwen_vl_utils. Add per-sample media counts (n_images_per_sample/n_videos_per_sample) to collate output for precise PP chunking. Replace truncation with pre-filtering via _drop_overlong_samples. Use decord as video backend globally. * refactor: simplify video timestamp handling with VideoMetadata Replace the manual _fix_video_timestamps regex approach with _build_video_metadata that passes metadata directly to the processor. Also adds second_per_grid_ts to output keys. * feat: add precompute_tokens script for offline tokenization Offline parallel tokenization tool that writes _text_tokens counts to dataset samples, enabling LengthGroupedSampler to use exact token counts instead of heuristic estimation. * feat: wire up configure_packing and attn-aware collaters for neat packing Wire up configure_packing and attn-aware collaters into both LLM and VLM recipes so neat packing correctly enforces per-document attention boundaries with flash_attention_2 and SDPA. Changes: - neat_packed_collater: accept attn_implementation param, keep 2D indexed mask for flash, 4D bool block-causal mask for SDPA - configure_packing: patch create_causal_mask in qwen2/qwen2_5_vl/qwen2_vl/ qwen3_vl/qwen3_vl_moe modules via importlib loop - LLM recipe: call configure_packing when packing_strategy=neat, detect attn backend from cfg_model (backend.attn or attn_implementation) - VLM recipe: add pretokenize + packing path to build_dataloader with cfg_model param, same attn detection logic - Add 3 example recipes: LLM neat packing, VLM 4B neat packing, VLM MoE 30B neat packing Tested: - VLM Qwen3-VL-4B flash: 4.19 -> 1.47 -> 0.49 - VLM Qwen3-VL-4B sdpa: 4.19 -> 1.47 -> 0.49 - VLM Qwen3-VL-30B MoE flash: 1.76 -> 0.41 -> 0.10 - LLM Qwen2.5-0.5B flash+force_hf: 3.72 -> ... -> 2.84 * refactor: move VLM packing config to top-level packed_sequence section Move packing configuration from nested dataset.packing to a top-level packed_sequence: section, matching the LLM recipe pattern. This decouples dataset definition from packing strategy. The VLM recipe's build_dataloader now accepts cfg_ps and reads packing config from there first, falling back to legacy dataset.packing for backward compatibility. Additional fixes from merge: - Fix stale build_labels() call in collate_fns.py (merge artifact) - Revert phi4/kimi collate to use build_labels (not in _IMSTART allowlist) - Comment out decord2 monkey-patch (user removed it for torchcodec testing) - Add TODO on _PACKING_PATCH_MODULES about generality * Switch VLM neat packing example to MedPix-VQA dataset with 8k seqlen Use HF dataset (mmoukouba/MedPix-VQA) instead of local mockdata to demonstrate packed_sequence working with standard HF datasets. Increase pack_size/max_length to 8192 for real image samples. * refactor: deduplicate robust_collate into make_robust_collate Extract the duplicated collate retry logic from PreTokenizedDatasetWrapper and RobustDatasetWrapper into a shared make_robust_collate() function in collate_fns.py. Both classes now delegate to it. * refactor: move media I/O helpers from datasets.py to utils.py Move _resolve_lmdb_image, _read_video_frames, _preload_media, and _build_video_metadata to vlm/utils.py. These are generic media utilities not tied to any specific dataset. * cleanup: move random import to module level, allow pretokenize without packing - Move `import random` from inside make_robust_collate to module-level import in collate_fns.py - Read pretokenize/max_length from cfg_ps regardless of pack_size, enabling pretokenize-only mode without packing * cleanup: remove verbose comments from packing recipe yamls * feat: add Qwen3.5-4B VLM neat packing recipe Tested with 8 GPUs, 8k pack_size, MedPix-VQA dataset. Requires transformers >= 5.3.0 for Qwen3.5 support. * fix: add qwen3_5 to packing patch modules and fix missing import - Add transformers.models.qwen3_5.modeling_qwen3_5 to packing patch list so create_causal_mask is patched for Qwen3.5 dense models - Fix _passthrough_create_causal_mask signature to accept both input_embeds and inputs_embeds (HF 5.3.0 uses inputs_embeds) - Import _lmdb_env_cache from utils.py in datasets.py (missed in earlier media helpers refactor) * remove LLM recipe from VLM data pipeline PR This LLM recipe doesn't belong in the VLM packing PR. * fix: update test imports after media helpers move to utils.py Update test_datasets.py to import _read_video_frames and _preload_media from vlm/utils.py instead of vlm/datasets.py. * test: add unit tests for packing, utils, and collate changes New test files: - test_utils.py: _resolve_lmdb_image (cache, missing key, RGB), _build_video_metadata (empty, no video, preserved fields) - test_packing.py: get_seqlens_in_batch, get_unpad_data, _passthrough_create_causal_mask (both HF signatures), get_attn_implementation (backend vs HF config), configure_packing (noop for sdpa, patches FA2 modules) Extended test_collate_fns.py: - make_robust_collate (success, retry, max_retries exhausted) - neat_packed_vlm_collater attn_implementation variants (2D mask for FA2, 4D for sdpa, fixed max_length, pixel_values concat) * fix: lint errors and missing copyright headers - ruff fix: remove unused imports (copy, BaseVideoProcessor, load_video, as_completed), unused variables (grid_idx, total_text_tokens, total_media_tokens), fix import ordering - Add copyright headers to scripts/precompute_tokens.py and tests/test_meta_dataset_all.py * style: ruff format on all changed files * fix: rename test_utils.py to avoid pytest collection conflict tests/unit_tests/datasets/test_utils.py already exists; having test_utils.py in the vlm/ subdirectory causes a module name collision. * fix: configure cfg_ds.get defaults in build_dataloader tests MagicMock().get() returns a truthy MagicMock by default, which incorrectly triggers the pretokenize path. Configure side_effect to return proper defaults for packing-related keys. * fix: make packing mask patch safe for non-packed forward passes _passthrough_create_causal_mask now checks whether the attention mask is actually a packed mask (4D or indexed with values > 1) before returning it as-is. For normal 2D masks (standard training), it delegates to the original HF create_causal_mask, preventing test pollution where the monkey-patch breaks non-packed Qwen2 tests. * fix: passthrough causal mask for FA2 to avoid breaking validation The previous logic delegated all non-packed 2D masks to HF's create_causal_mask, which produced a mask incompatible with flash_attention_2 during validation. FA2 handles causal masking internally, so always pass through. Delegation to HF is now limited to non-FA2 backends (sdpa/eager) where it is needed. * fix: address code review feedback from claude[bot] - Fix wrong import: _resolve_lmdb_image lives in utils.py not datasets.py - Assign unused sum() results to variables in dataset timing summary - Fix fake_indices bug: _drop_overlong_samples now returns kept indices so callers can filter examples in sync with conversations * fix: log actual processor type before falling back to default * fix: remove unused sum() variables flagged by ruff F841 * feat: add validation and max_steps to VLM packing recipes - Add validation_dataset (MedPix-VQA) and validation_dataloader to qwen3_vl_4b and qwen3_vl_moe_30b recipes - Add max_steps: 100 to both recipes - Switch MoE recipe from mockdata to MedPix-VQA with pack_size 8192 * fix: enable checkpoint with safetensors in qwen3_vl_4b recipe * feat: enable Qwen3.5 dense CP support and fix FSDP mixed-dtype wrapping PR #1631 added _restore_loaded_model_dtype which restores checkpoint dtypes after loading. For Qwen3.5 dense, this puts A_log and norm back to float32 while everything else is bfloat16, breaking FSDP2 which requires uniform dtype per group. Fix by adding Qwen3_5ParallelizationStrategy that: - Moves float32 bare params (A_log) into a _fp32_params submodule so fully_shard_by_dtype can wrap them in a separate FSDP group - When CP>1, swaps HF Qwen3_5GatedDeltaNet to CPAwareGatedDeltaNet (reusing the existing MoE CP implementation) and sets _cp_mesh - Adds a decoder-layer pre-hook to pass position_ids to linear_attn (HF decoder layers don't forward it, but CP needs it) Tested: CP=1 and CP=2 losses match (2.8484 vs 2.8484 step 0, 2.4787 vs 2.4766 step 1, val 2.9792 vs 2.9786). * style: ruff format + add unit tests for Qwen3.5 CP/FSDP patching * fix: address Claude review - test calls real patch_hf_model, clarify thread-safety comment - Rewrote test_fp32_params_moved_to_holder to call the real patch_hf_model function instead of replicating its logic, by monkeypatching the transformers module stubs so isinstance() matches _FakeGatedDeltaNet. - Clarified the thread-safety comment on the globals() swap in Qwen3_5ParallelizationStrategy.parallelize. * fix: address round-2 review - test_no_class_swap calls real patch_hf_model, add defensive assertion - Updated test_no_class_swap_when_cp_disabled to call patch_hf_model with stubs instead of trivially asserting the fake type. - Added defensive assertion that apply_fsdp2_sharding_recursively exists before the globals() swap in Qwen3_5ParallelizationStrategy. * test: add test_class_swap_when_cp_enabled for patch_hf_model cp_enabled=True path * test: add coverage for Qwen3_5ParallelizationStrategy.parallelize() Add tests for the parallelize() method covering: - patch_hf_model call and delegation to super() - globals swap and restore of apply_fsdp2_sharding_recursively - global restore on error (try/finally) - CP mesh assignment when cp_enabled=True - _fsdp_by_dtype ModuleList iteration with fully_shard_by_dtype --------- Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Huiying <willwin.lee@gmail.com> Co-authored-by: zhiqil <zhiqil@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add neat packing (greedy knapsack) for LLM and VLM datasets Implement sequence packing via min-heap first-fit-decreasing knapsack for both LLM and VLM datasets, with indexed attention masks and flash attention support. Includes unit tests and benchmarks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * feat: add LengthGroupedSampler for token-aware distributed sampling Sort samples by estimated token length (text + media) and shuffle within buckets to keep batch-internal lengths similar, reducing padding waste. Includes accurate image/video token count estimation via smart_resize and comprehensive test suite. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * feat: integrate neat packing strategy into LLM finetune recipe Add packing_strategy config field ("neat" or "thd") to select between greedy knapsack packing and existing THD packing in the LLM recipe. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * chore: remove benchmark scripts not needed for this PR Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: lint errors and broken sampler tests Remove unused import and variable in neat_packing_vlm.py. Fix 13 sampler tests that referenced non-existent bucket_size and shuffle_bucket_size parameters. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * style: fix all ruff lint errors across changed files Sort imports, remove unused imports/variables, fix f-strings without placeholders, rename ambiguous variable name. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * style: run ruff format on all changed source and test files Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * style: add missing copyright headers to test files Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * feat: add meta-dataset loading system with ShareGPT format support Implement LLaMA-Factory style meta JSON dataset loading with support for multiple dataset composition, sampling ratios, ShareGPT format conversion, LMDB image storage, video frame reading via decord, media preloading, and cross-rank data sharding. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * feat: add RobustDatasetWrapper with retry and fake image injection RobustDatasetWrapper provides data loading error retry, media preloading, and fake image injection to prevent FSDP/Zero3 hangs on pure-text batches. PreTokenizedDatasetWrapper supports per-sample tokenization in DataLoader workers with overlong sample detection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * enhance: refactor label building with template-based approach Replace BPE context-sensitive pattern matching with token ID-level scanning (build_labels_from_template) for reliable assistant turn detection. Remove qwen2_5 dependency on qwen_vl_utils. Add per-sample media counts (n_images_per_sample/n_videos_per_sample) to collate output for precise PP chunking. Replace truncation with pre-filtering via _drop_overlong_samples. Use decord as video backend globally. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * refactor: simplify video timestamp handling with VideoMetadata Replace the manual _fix_video_timestamps regex approach with _build_video_metadata that passes metadata directly to the processor. Also adds second_per_grid_ts to output keys. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * feat: add precompute_tokens script for offline tokenization Offline parallel tokenization tool that writes _text_tokens counts to dataset samples, enabling LengthGroupedSampler to use exact token counts instead of heuristic estimation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * feat: wire up configure_packing and attn-aware collaters for neat packing Wire up configure_packing and attn-aware collaters into both LLM and VLM recipes so neat packing correctly enforces per-document attention boundaries with flash_attention_2 and SDPA. Changes: - neat_packed_collater: accept attn_implementation param, keep 2D indexed mask for flash, 4D bool block-causal mask for SDPA - configure_packing: patch create_causal_mask in qwen2/qwen2_5_vl/qwen2_vl/ qwen3_vl/qwen3_vl_moe modules via importlib loop - LLM recipe: call configure_packing when packing_strategy=neat, detect attn backend from cfg_model (backend.attn or attn_implementation) - VLM recipe: add pretokenize + packing path to build_dataloader with cfg_model param, same attn detection logic - Add 3 example recipes: LLM neat packing, VLM 4B neat packing, VLM MoE 30B neat packing Tested: - VLM Qwen3-VL-4B flash: 4.19 -> 1.47 -> 0.49 - VLM Qwen3-VL-4B sdpa: 4.19 -> 1.47 -> 0.49 - VLM Qwen3-VL-30B MoE flash: 1.76 -> 0.41 -> 0.10 - LLM Qwen2.5-0.5B flash+force_hf: 3.72 -> ... -> 2.84 Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * refactor: move VLM packing config to top-level packed_sequence section Move packing configuration from nested dataset.packing to a top-level packed_sequence: section, matching the LLM recipe pattern. This decouples dataset definition from packing strategy. The VLM recipe's build_dataloader now accepts cfg_ps and reads packing config from there first, falling back to legacy dataset.packing for backward compatibility. Additional fixes from merge: - Fix stale build_labels() call in collate_fns.py (merge artifact) - Revert phi4/kimi collate to use build_labels (not in _IMSTART allowlist) - Comment out decord2 monkey-patch (user removed it for torchcodec testing) - Add TODO on _PACKING_PATCH_MODULES about generality Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * Switch VLM neat packing example to MedPix-VQA dataset with 8k seqlen Use HF dataset (mmoukouba/MedPix-VQA) instead of local mockdata to demonstrate packed_sequence working with standard HF datasets. Increase pack_size/max_length to 8192 for real image samples. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * refactor: deduplicate robust_collate into make_robust_collate Extract the duplicated collate retry logic from PreTokenizedDatasetWrapper and RobustDatasetWrapper into a shared make_robust_collate() function in collate_fns.py. Both classes now delegate to it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * refactor: move media I/O helpers from datasets.py to utils.py Move _resolve_lmdb_image, _read_video_frames, _preload_media, and _build_video_metadata to vlm/utils.py. These are generic media utilities not tied to any specific dataset. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * cleanup: move random import to module level, allow pretokenize without packing - Move `import random` from inside make_robust_collate to module-level import in collate_fns.py - Read pretokenize/max_length from cfg_ps regardless of pack_size, enabling pretokenize-only mode without packing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * cleanup: remove verbose comments from packing recipe yamls Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * feat: add Qwen3.5-4B VLM neat packing recipe Tested with 8 GPUs, 8k pack_size, MedPix-VQA dataset. Requires transformers >= 5.3.0 for Qwen3.5 support. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: add qwen3_5 to packing patch modules and fix missing import - Add transformers.models.qwen3_5.modeling_qwen3_5 to packing patch list so create_causal_mask is patched for Qwen3.5 dense models - Fix _passthrough_create_causal_mask signature to accept both input_embeds and inputs_embeds (HF 5.3.0 uses inputs_embeds) - Import _lmdb_env_cache from utils.py in datasets.py (missed in earlier media helpers refactor) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * remove LLM recipe from VLM data pipeline PR This LLM recipe doesn't belong in the VLM packing PR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: update test imports after media helpers move to utils.py Update test_datasets.py to import _read_video_frames and _preload_media from vlm/utils.py instead of vlm/datasets.py. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: add unit tests for packing, utils, and collate changes New test files: - test_utils.py: _resolve_lmdb_image (cache, missing key, RGB), _build_video_metadata (empty, no video, preserved fields) - test_packing.py: get_seqlens_in_batch, get_unpad_data, _passthrough_create_causal_mask (both HF signatures), get_attn_implementation (backend vs HF config), configure_packing (noop for sdpa, patches FA2 modules) Extended test_collate_fns.py: - make_robust_collate (success, retry, max_retries exhausted) - neat_packed_vlm_collater attn_implementation variants (2D mask for FA2, 4D for sdpa, fixed max_length, pixel_values concat) Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: lint errors and missing copyright headers - ruff fix: remove unused imports (copy, BaseVideoProcessor, load_video, as_completed), unused variables (grid_idx, total_text_tokens, total_media_tokens), fix import ordering - Add copyright headers to scripts/precompute_tokens.py and tests/test_meta_dataset_all.py Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * style: ruff format on all changed files Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: rename test_utils.py to avoid pytest collection conflict tests/unit_tests/datasets/test_utils.py already exists; having test_utils.py in the vlm/ subdirectory causes a module name collision. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: configure cfg_ds.get defaults in build_dataloader tests MagicMock().get() returns a truthy MagicMock by default, which incorrectly triggers the pretokenize path. Configure side_effect to return proper defaults for packing-related keys. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: make packing mask patch safe for non-packed forward passes _passthrough_create_causal_mask now checks whether the attention mask is actually a packed mask (4D or indexed with values > 1) before returning it as-is. For normal 2D masks (standard training), it delegates to the original HF create_causal_mask, preventing test pollution where the monkey-patch breaks non-packed Qwen2 tests. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: passthrough causal mask for FA2 to avoid breaking validation The previous logic delegated all non-packed 2D masks to HF's create_causal_mask, which produced a mask incompatible with flash_attention_2 during validation. FA2 handles causal masking internally, so always pass through. Delegation to HF is now limited to non-FA2 backends (sdpa/eager) where it is needed. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: address code review feedback from claude[bot] - Fix wrong import: _resolve_lmdb_image lives in utils.py not datasets.py - Assign unused sum() results to variables in dataset timing summary - Fix fake_indices bug: _drop_overlong_samples now returns kept indices so callers can filter examples in sync with conversations Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: log actual processor type before falling back to default Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: remove unused sum() variables flagged by ruff F841 Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add validation and max_steps to VLM packing recipes - Add validation_dataset (MedPix-VQA) and validation_dataloader to qwen3_vl_4b and qwen3_vl_moe_30b recipes - Add max_steps: 100 to both recipes - Switch MoE recipe from mockdata to MedPix-VQA with pack_size 8192 Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: enable checkpoint with safetensors in qwen3_vl_4b recipe Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: enable Qwen3.5 dense CP support and fix FSDP mixed-dtype wrapping PR #1631 added _restore_loaded_model_dtype which restores checkpoint dtypes after loading. For Qwen3.5 dense, this puts A_log and norm back to float32 while everything else is bfloat16, breaking FSDP2 which requires uniform dtype per group. Fix by adding Qwen3_5ParallelizationStrategy that: - Moves float32 bare params (A_log) into a _fp32_params submodule so fully_shard_by_dtype can wrap them in a separate FSDP group - When CP>1, swaps HF Qwen3_5GatedDeltaNet to CPAwareGatedDeltaNet (reusing the existing MoE CP implementation) and sets _cp_mesh - Adds a decoder-layer pre-hook to pass position_ids to linear_attn (HF decoder layers don't forward it, but CP needs it) Tested: CP=1 and CP=2 losses match (2.8484 vs 2.8484 step 0, 2.4787 vs 2.4766 step 1, val 2.9792 vs 2.9786). Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * style: ruff format + add unit tests for Qwen3.5 CP/FSDP patching Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: address Claude review - test calls real patch_hf_model, clarify thread-safety comment - Rewrote test_fp32_params_moved_to_holder to call the real patch_hf_model function instead of replicating its logic, by monkeypatching the transformers module stubs so isinstance() matches _FakeGatedDeltaNet. - Clarified the thread-safety comment on the globals() swap in Qwen3_5ParallelizationStrategy.parallelize. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: address round-2 review - test_no_class_swap calls real patch_hf_model, add defensive assertion - Updated test_no_class_swap_when_cp_disabled to call patch_hf_model with stubs instead of trivially asserting the fake type. - Added defensive assertion that apply_fsdp2_sharding_recursively exists before the globals() swap in Qwen3_5ParallelizationStrategy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * test: add test_class_swap_when_cp_enabled for patch_hf_model cp_enabled=True path Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * test: add coverage for Qwen3_5ParallelizationStrategy.parallelize() Add tests for the parallelize() method covering: - patch_hf_model call and delegation to super() - globals swap and restore of apply_fsdp2_sharding_recursively - global restore on error (try/finally) - CP mesh assignment when cp_enabled=True - _fsdp_by_dtype ModuleList iteration with fully_shard_by_dtype Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> --------- Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-authored-by: zhiqil <zhiqil@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PR #1711 changed _should_load_before_shard to return False for multi-GPU DP, so models stay on meta device through FSDP wrapping. This broke the __dict__ trick in PR #1710's patch_hf_model. Move the gate computation into _Fp32ParamHolder.forward() so FSDP's unshard/reshard lifecycle fires naturally. Override CPAwareGatedDeltaNet forward for both CP and non-CP paths to route through the holder. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

…1813) * fix: FSDP2 meta-device crash for Qwen3.5 GatedDeltaNet fp32 params PR #1711 changed _should_load_before_shard to return False for multi-GPU DP, so models stay on meta device through FSDP wrapping. This broke the __dict__ trick in PR #1710's patch_hf_model. Move the gate computation into _Fp32ParamHolder.forward() so FSDP's unshard/reshard lifecycle fires naturally. Override CPAwareGatedDeltaNet forward for both CP and non-CP paths to route through the holder. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * chore: remove test yaml not intended for PR Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: add sentinel to prevent __getattr__ re-wrapping Address Claude review: guard against re-wrapping __getattr__ on repeated patch_hf_model calls by checking a class-level sentinel attribute. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: add upstream version comment to _forward_no_cp Address Claude review: note the transformers version the forward was copied from to ease future upstream diffing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: update MoE test expectations for _forward_no_cp path TestForwardFastPath tests expected super().forward() to be called, but the non-CP path now uses _forward_no_cp(). Update mocks to match. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * test: add coverage for _Fp32ParamHolder, _compute_gate, and sentinel guard Add unit tests for: - _Fp32ParamHolder.forward gate computation and dtype preservation - _compute_gate routing through holder vs inline fallback - patch_hf_model sentinel preventing __getattr__ re-wrapping Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * test: add coverage for _forward_no_cp and forward() dispatch paths Add 14 new tests covering the critical _forward_no_cp method (lines 91-193) and forward() dispatch logic (lines 207-213) to satisfy codecov/patch requirements for PR #1813: - _forward_no_cp basic forward, cache_params=None, causal_conv1d_fn fallback, causal_conv1d_fn set, attention_mask, GQA repeat-interleave, _compute_gate delegation, and output dtype - forward() dispatch when _cp_mesh is None or size <= 1, parameter pass-through, and extra CP kwargs - _make_fp32_getattr fallback to AttributeError and real attr resolution Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…params (#1869) * fix: FSDP2 meta-device crash for Qwen3.5 GatedDeltaNet fp32 params (#1813) * fix: FSDP2 meta-device crash for Qwen3.5 GatedDeltaNet fp32 params PR #1711 changed _should_load_before_shard to return False for multi-GPU DP, so models stay on meta device through FSDP wrapping. This broke the __dict__ trick in PR #1710's patch_hf_model. Move the gate computation into _Fp32ParamHolder.forward() so FSDP's unshard/reshard lifecycle fires naturally. Override CPAwareGatedDeltaNet forward for both CP and non-CP paths to route through the holder. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * chore: remove test yaml not intended for PR Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: add sentinel to prevent __getattr__ re-wrapping Address Claude review: guard against re-wrapping __getattr__ on repeated patch_hf_model calls by checking a class-level sentinel attribute. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: add upstream version comment to _forward_no_cp Address Claude review: note the transformers version the forward was copied from to ease future upstream diffing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: update MoE test expectations for _forward_no_cp path TestForwardFastPath tests expected super().forward() to be called, but the non-CP path now uses _forward_no_cp(). Update mocks to match. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * test: add coverage for _Fp32ParamHolder, _compute_gate, and sentinel guard Add unit tests for: - _Fp32ParamHolder.forward gate computation and dtype preservation - _compute_gate routing through holder vs inline fallback - patch_hf_model sentinel preventing __getattr__ re-wrapping Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * test: add coverage for _forward_no_cp and forward() dispatch paths Add 14 new tests covering the critical _forward_no_cp method (lines 91-193) and forward() dispatch logic (lines 207-213) to satisfy codecov/patch requirements for PR #1813: - _forward_no_cp basic forward, cache_params=None, causal_conv1d_fn fallback, causal_conv1d_fn set, attention_mask, GQA repeat-interleave, _compute_gate delegation, and output dtype - forward() dispatch when _cp_mesh is None or size <= 1, parameter pass-through, and extra CP kwargs - _make_fp32_getattr fallback to AttributeError and real attr resolution Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: update MoE test_no_cp_does_not_forward_cache_position to use _forward_no_cp The fast-path in CPAwareGatedDeltaNet.forward was refactored to call self._forward_no_cp() instead of super().forward(), but this test still mocked the base class forward and thus got called 0 times. Update the mock target to match the new dispatch, and apply ruff format to the two test files. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> --------- Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

) * feat: add neat packing (greedy knapsack) for LLM and VLM datasets Implement sequence packing via min-heap first-fit-decreasing knapsack for both LLM and VLM datasets, with indexed attention masks and flash attention support. Includes unit tests and benchmarks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * feat: add LengthGroupedSampler for token-aware distributed sampling Sort samples by estimated token length (text + media) and shuffle within buckets to keep batch-internal lengths similar, reducing padding waste. Includes accurate image/video token count estimation via smart_resize and comprehensive test suite. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * feat: integrate neat packing strategy into LLM finetune recipe Add packing_strategy config field ("neat" or "thd") to select between greedy knapsack packing and existing THD packing in the LLM recipe. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * chore: remove benchmark scripts not needed for this PR Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: lint errors and broken sampler tests Remove unused import and variable in neat_packing_vlm.py. Fix 13 sampler tests that referenced non-existent bucket_size and shuffle_bucket_size parameters. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * style: fix all ruff lint errors across changed files Sort imports, remove unused imports/variables, fix f-strings without placeholders, rename ambiguous variable name. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * style: run ruff format on all changed source and test files Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * style: add missing copyright headers to test files Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * feat: add meta-dataset loading system with ShareGPT format support Implement LLaMA-Factory style meta JSON dataset loading with support for multiple dataset composition, sampling ratios, ShareGPT format conversion, LMDB image storage, video frame reading via decord, media preloading, and cross-rank data sharding. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * feat: add RobustDatasetWrapper with retry and fake image injection RobustDatasetWrapper provides data loading error retry, media preloading, and fake image injection to prevent FSDP/Zero3 hangs on pure-text batches. PreTokenizedDatasetWrapper supports per-sample tokenization in DataLoader workers with overlong sample detection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * enhance: refactor label building with template-based approach Replace BPE context-sensitive pattern matching with token ID-level scanning (build_labels_from_template) for reliable assistant turn detection. Remove qwen2_5 dependency on qwen_vl_utils. Add per-sample media counts (n_images_per_sample/n_videos_per_sample) to collate output for precise PP chunking. Replace truncation with pre-filtering via _drop_overlong_samples. Use decord as video backend globally. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * refactor: simplify video timestamp handling with VideoMetadata Replace the manual _fix_video_timestamps regex approach with _build_video_metadata that passes metadata directly to the processor. Also adds second_per_grid_ts to output keys. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * feat: add precompute_tokens script for offline tokenization Offline parallel tokenization tool that writes _text_tokens counts to dataset samples, enabling LengthGroupedSampler to use exact token counts instead of heuristic estimation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * feat: wire up configure_packing and attn-aware collaters for neat packing Wire up configure_packing and attn-aware collaters into both LLM and VLM recipes so neat packing correctly enforces per-document attention boundaries with flash_attention_2 and SDPA. Changes: - neat_packed_collater: accept attn_implementation param, keep 2D indexed mask for flash, 4D bool block-causal mask for SDPA - configure_packing: patch create_causal_mask in qwen2/qwen2_5_vl/qwen2_vl/ qwen3_vl/qwen3_vl_moe modules via importlib loop - LLM recipe: call configure_packing when packing_strategy=neat, detect attn backend from cfg_model (backend.attn or attn_implementation) - VLM recipe: add pretokenize + packing path to build_dataloader with cfg_model param, same attn detection logic - Add 3 example recipes: LLM neat packing, VLM 4B neat packing, VLM MoE 30B neat packing Tested: - VLM Qwen3-VL-4B flash: 4.19 -> 1.47 -> 0.49 - VLM Qwen3-VL-4B sdpa: 4.19 -> 1.47 -> 0.49 - VLM Qwen3-VL-30B MoE flash: 1.76 -> 0.41 -> 0.10 - LLM Qwen2.5-0.5B flash+force_hf: 3.72 -> ... -> 2.84 Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * refactor: move VLM packing config to top-level packed_sequence section Move packing configuration from nested dataset.packing to a top-level packed_sequence: section, matching the LLM recipe pattern. This decouples dataset definition from packing strategy. The VLM recipe's build_dataloader now accepts cfg_ps and reads packing config from there first, falling back to legacy dataset.packing for backward compatibility. Additional fixes from merge: - Fix stale build_labels() call in collate_fns.py (merge artifact) - Revert phi4/kimi collate to use build_labels (not in _IMSTART allowlist) - Comment out decord2 monkey-patch (user removed it for torchcodec testing) - Add TODO on _PACKING_PATCH_MODULES about generality Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * Switch VLM neat packing example to MedPix-VQA dataset with 8k seqlen Use HF dataset (mmoukouba/MedPix-VQA) instead of local mockdata to demonstrate packed_sequence working with standard HF datasets. Increase pack_size/max_length to 8192 for real image samples. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * refactor: deduplicate robust_collate into make_robust_collate Extract the duplicated collate retry logic from PreTokenizedDatasetWrapper and RobustDatasetWrapper into a shared make_robust_collate() function in collate_fns.py. Both classes now delegate to it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * refactor: move media I/O helpers from datasets.py to utils.py Move _resolve_lmdb_image, _read_video_frames, _preload_media, and _build_video_metadata to vlm/utils.py. These are generic media utilities not tied to any specific dataset. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * cleanup: move random import to module level, allow pretokenize without packing - Move `import random` from inside make_robust_collate to module-level import in collate_fns.py - Read pretokenize/max_length from cfg_ps regardless of pack_size, enabling pretokenize-only mode without packing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * cleanup: remove verbose comments from packing recipe yamls Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * feat: add Qwen3.5-4B VLM neat packing recipe Tested with 8 GPUs, 8k pack_size, MedPix-VQA dataset. Requires transformers >= 5.3.0 for Qwen3.5 support. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: add qwen3_5 to packing patch modules and fix missing import - Add transformers.models.qwen3_5.modeling_qwen3_5 to packing patch list so create_causal_mask is patched for Qwen3.5 dense models - Fix _passthrough_create_causal_mask signature to accept both input_embeds and inputs_embeds (HF 5.3.0 uses inputs_embeds) - Import _lmdb_env_cache from utils.py in datasets.py (missed in earlier media helpers refactor) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * remove LLM recipe from VLM data pipeline PR This LLM recipe doesn't belong in the VLM packing PR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: update test imports after media helpers move to utils.py Update test_datasets.py to import _read_video_frames and _preload_media from vlm/utils.py instead of vlm/datasets.py. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: add unit tests for packing, utils, and collate changes New test files: - test_utils.py: _resolve_lmdb_image (cache, missing key, RGB), _build_video_metadata (empty, no video, preserved fields) - test_packing.py: get_seqlens_in_batch, get_unpad_data, _passthrough_create_causal_mask (both HF signatures), get_attn_implementation (backend vs HF config), configure_packing (noop for sdpa, patches FA2 modules) Extended test_collate_fns.py: - make_robust_collate (success, retry, max_retries exhausted) - neat_packed_vlm_collater attn_implementation variants (2D mask for FA2, 4D for sdpa, fixed max_length, pixel_values concat) Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: lint errors and missing copyright headers - ruff fix: remove unused imports (copy, BaseVideoProcessor, load_video, as_completed), unused variables (grid_idx, total_text_tokens, total_media_tokens), fix import ordering - Add copyright headers to scripts/precompute_tokens.py and tests/test_meta_dataset_all.py Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * style: ruff format on all changed files Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: rename test_utils.py to avoid pytest collection conflict tests/unit_tests/datasets/test_utils.py already exists; having test_utils.py in the vlm/ subdirectory causes a module name collision. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: configure cfg_ds.get defaults in build_dataloader tests MagicMock().get() returns a truthy MagicMock by default, which incorrectly triggers the pretokenize path. Configure side_effect to return proper defaults for packing-related keys. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: make packing mask patch safe for non-packed forward passes _passthrough_create_causal_mask now checks whether the attention mask is actually a packed mask (4D or indexed with values > 1) before returning it as-is. For normal 2D masks (standard training), it delegates to the original HF create_causal_mask, preventing test pollution where the monkey-patch breaks non-packed Qwen2 tests. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: passthrough causal mask for FA2 to avoid breaking validation The previous logic delegated all non-packed 2D masks to HF's create_causal_mask, which produced a mask incompatible with flash_attention_2 during validation. FA2 handles causal masking internally, so always pass through. Delegation to HF is now limited to non-FA2 backends (sdpa/eager) where it is needed. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: address code review feedback from claude[bot] - Fix wrong import: _resolve_lmdb_image lives in utils.py not datasets.py - Assign unused sum() results to variables in dataset timing summary - Fix fake_indices bug: _drop_overlong_samples now returns kept indices so callers can filter examples in sync with conversations Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: log actual processor type before falling back to default Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: remove unused sum() variables flagged by ruff F841 Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add validation and max_steps to VLM packing recipes - Add validation_dataset (MedPix-VQA) and validation_dataloader to qwen3_vl_4b and qwen3_vl_moe_30b recipes - Add max_steps: 100 to both recipes - Switch MoE recipe from mockdata to MedPix-VQA with pack_size 8192 Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: enable checkpoint with safetensors in qwen3_vl_4b recipe Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: enable Qwen3.5 dense CP support and fix FSDP mixed-dtype wrapping PR NVIDIA-NeMo#1631 added _restore_loaded_model_dtype which restores checkpoint dtypes after loading. For Qwen3.5 dense, this puts A_log and norm back to float32 while everything else is bfloat16, breaking FSDP2 which requires uniform dtype per group. Fix by adding Qwen3_5ParallelizationStrategy that: - Moves float32 bare params (A_log) into a _fp32_params submodule so fully_shard_by_dtype can wrap them in a separate FSDP group - When CP>1, swaps HF Qwen3_5GatedDeltaNet to CPAwareGatedDeltaNet (reusing the existing MoE CP implementation) and sets _cp_mesh - Adds a decoder-layer pre-hook to pass position_ids to linear_attn (HF decoder layers don't forward it, but CP needs it) Tested: CP=1 and CP=2 losses match (2.8484 vs 2.8484 step 0, 2.4787 vs 2.4766 step 1, val 2.9792 vs 2.9786). Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * style: ruff format + add unit tests for Qwen3.5 CP/FSDP patching Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: address Claude review - test calls real patch_hf_model, clarify thread-safety comment - Rewrote test_fp32_params_moved_to_holder to call the real patch_hf_model function instead of replicating its logic, by monkeypatching the transformers module stubs so isinstance() matches _FakeGatedDeltaNet. - Clarified the thread-safety comment on the globals() swap in Qwen3_5ParallelizationStrategy.parallelize. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: address round-2 review - test_no_class_swap calls real patch_hf_model, add defensive assertion - Updated test_no_class_swap_when_cp_disabled to call patch_hf_model with stubs instead of trivially asserting the fake type. - Added defensive assertion that apply_fsdp2_sharding_recursively exists before the globals() swap in Qwen3_5ParallelizationStrategy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * test: add test_class_swap_when_cp_enabled for patch_hf_model cp_enabled=True path Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * test: add coverage for Qwen3_5ParallelizationStrategy.parallelize() Add tests for the parallelize() method covering: - patch_hf_model call and delegation to super() - globals swap and restore of apply_fsdp2_sharding_recursively - global restore on error (try/finally) - CP mesh assignment when cp_enabled=True - _fsdp_by_dtype ModuleList iteration with fully_shard_by_dtype Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> --------- Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-authored-by: zhiqil <zhiqil@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

) * feat: add neat packing (greedy knapsack) for LLM and VLM datasets Implement sequence packing via min-heap first-fit-decreasing knapsack for both LLM and VLM datasets, with indexed attention masks and flash attention support. Includes unit tests and benchmarks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * feat: add LengthGroupedSampler for token-aware distributed sampling Sort samples by estimated token length (text + media) and shuffle within buckets to keep batch-internal lengths similar, reducing padding waste. Includes accurate image/video token count estimation via smart_resize and comprehensive test suite. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * feat: integrate neat packing strategy into LLM finetune recipe Add packing_strategy config field ("neat" or "thd") to select between greedy knapsack packing and existing THD packing in the LLM recipe. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * chore: remove benchmark scripts not needed for this PR Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: lint errors and broken sampler tests Remove unused import and variable in neat_packing_vlm.py. Fix 13 sampler tests that referenced non-existent bucket_size and shuffle_bucket_size parameters. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * style: fix all ruff lint errors across changed files Sort imports, remove unused imports/variables, fix f-strings without placeholders, rename ambiguous variable name. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * style: run ruff format on all changed source and test files Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * style: add missing copyright headers to test files Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * feat: add meta-dataset loading system with ShareGPT format support Implement LLaMA-Factory style meta JSON dataset loading with support for multiple dataset composition, sampling ratios, ShareGPT format conversion, LMDB image storage, video frame reading via decord, media preloading, and cross-rank data sharding. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * feat: add RobustDatasetWrapper with retry and fake image injection RobustDatasetWrapper provides data loading error retry, media preloading, and fake image injection to prevent FSDP/Zero3 hangs on pure-text batches. PreTokenizedDatasetWrapper supports per-sample tokenization in DataLoader workers with overlong sample detection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * enhance: refactor label building with template-based approach Replace BPE context-sensitive pattern matching with token ID-level scanning (build_labels_from_template) for reliable assistant turn detection. Remove qwen2_5 dependency on qwen_vl_utils. Add per-sample media counts (n_images_per_sample/n_videos_per_sample) to collate output for precise PP chunking. Replace truncation with pre-filtering via _drop_overlong_samples. Use decord as video backend globally. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * refactor: simplify video timestamp handling with VideoMetadata Replace the manual _fix_video_timestamps regex approach with _build_video_metadata that passes metadata directly to the processor. Also adds second_per_grid_ts to output keys. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * feat: add precompute_tokens script for offline tokenization Offline parallel tokenization tool that writes _text_tokens counts to dataset samples, enabling LengthGroupedSampler to use exact token counts instead of heuristic estimation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * feat: wire up configure_packing and attn-aware collaters for neat packing Wire up configure_packing and attn-aware collaters into both LLM and VLM recipes so neat packing correctly enforces per-document attention boundaries with flash_attention_2 and SDPA. Changes: - neat_packed_collater: accept attn_implementation param, keep 2D indexed mask for flash, 4D bool block-causal mask for SDPA - configure_packing: patch create_causal_mask in qwen2/qwen2_5_vl/qwen2_vl/ qwen3_vl/qwen3_vl_moe modules via importlib loop - LLM recipe: call configure_packing when packing_strategy=neat, detect attn backend from cfg_model (backend.attn or attn_implementation) - VLM recipe: add pretokenize + packing path to build_dataloader with cfg_model param, same attn detection logic - Add 3 example recipes: LLM neat packing, VLM 4B neat packing, VLM MoE 30B neat packing Tested: - VLM Qwen3-VL-4B flash: 4.19 -> 1.47 -> 0.49 - VLM Qwen3-VL-4B sdpa: 4.19 -> 1.47 -> 0.49 - VLM Qwen3-VL-30B MoE flash: 1.76 -> 0.41 -> 0.10 - LLM Qwen2.5-0.5B flash+force_hf: 3.72 -> ... -> 2.84 Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * refactor: move VLM packing config to top-level packed_sequence section Move packing configuration from nested dataset.packing to a top-level packed_sequence: section, matching the LLM recipe pattern. This decouples dataset definition from packing strategy. The VLM recipe's build_dataloader now accepts cfg_ps and reads packing config from there first, falling back to legacy dataset.packing for backward compatibility. Additional fixes from merge: - Fix stale build_labels() call in collate_fns.py (merge artifact) - Revert phi4/kimi collate to use build_labels (not in _IMSTART allowlist) - Comment out decord2 monkey-patch (user removed it for torchcodec testing) - Add TODO on _PACKING_PATCH_MODULES about generality Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * Switch VLM neat packing example to MedPix-VQA dataset with 8k seqlen Use HF dataset (mmoukouba/MedPix-VQA) instead of local mockdata to demonstrate packed_sequence working with standard HF datasets. Increase pack_size/max_length to 8192 for real image samples. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * refactor: deduplicate robust_collate into make_robust_collate Extract the duplicated collate retry logic from PreTokenizedDatasetWrapper and RobustDatasetWrapper into a shared make_robust_collate() function in collate_fns.py. Both classes now delegate to it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * refactor: move media I/O helpers from datasets.py to utils.py Move _resolve_lmdb_image, _read_video_frames, _preload_media, and _build_video_metadata to vlm/utils.py. These are generic media utilities not tied to any specific dataset. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * cleanup: move random import to module level, allow pretokenize without packing - Move `import random` from inside make_robust_collate to module-level import in collate_fns.py - Read pretokenize/max_length from cfg_ps regardless of pack_size, enabling pretokenize-only mode without packing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * cleanup: remove verbose comments from packing recipe yamls Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * feat: add Qwen3.5-4B VLM neat packing recipe Tested with 8 GPUs, 8k pack_size, MedPix-VQA dataset. Requires transformers >= 5.3.0 for Qwen3.5 support. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: add qwen3_5 to packing patch modules and fix missing import - Add transformers.models.qwen3_5.modeling_qwen3_5 to packing patch list so create_causal_mask is patched for Qwen3.5 dense models - Fix _passthrough_create_causal_mask signature to accept both input_embeds and inputs_embeds (HF 5.3.0 uses inputs_embeds) - Import _lmdb_env_cache from utils.py in datasets.py (missed in earlier media helpers refactor) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * remove LLM recipe from VLM data pipeline PR This LLM recipe doesn't belong in the VLM packing PR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: update test imports after media helpers move to utils.py Update test_datasets.py to import _read_video_frames and _preload_media from vlm/utils.py instead of vlm/datasets.py. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: add unit tests for packing, utils, and collate changes New test files: - test_utils.py: _resolve_lmdb_image (cache, missing key, RGB), _build_video_metadata (empty, no video, preserved fields) - test_packing.py: get_seqlens_in_batch, get_unpad_data, _passthrough_create_causal_mask (both HF signatures), get_attn_implementation (backend vs HF config), configure_packing (noop for sdpa, patches FA2 modules) Extended test_collate_fns.py: - make_robust_collate (success, retry, max_retries exhausted) - neat_packed_vlm_collater attn_implementation variants (2D mask for FA2, 4D for sdpa, fixed max_length, pixel_values concat) Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: lint errors and missing copyright headers - ruff fix: remove unused imports (copy, BaseVideoProcessor, load_video, as_completed), unused variables (grid_idx, total_text_tokens, total_media_tokens), fix import ordering - Add copyright headers to scripts/precompute_tokens.py and tests/test_meta_dataset_all.py Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * style: ruff format on all changed files Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: rename test_utils.py to avoid pytest collection conflict tests/unit_tests/datasets/test_utils.py already exists; having test_utils.py in the vlm/ subdirectory causes a module name collision. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: configure cfg_ds.get defaults in build_dataloader tests MagicMock().get() returns a truthy MagicMock by default, which incorrectly triggers the pretokenize path. Configure side_effect to return proper defaults for packing-related keys. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: make packing mask patch safe for non-packed forward passes _passthrough_create_causal_mask now checks whether the attention mask is actually a packed mask (4D or indexed with values > 1) before returning it as-is. For normal 2D masks (standard training), it delegates to the original HF create_causal_mask, preventing test pollution where the monkey-patch breaks non-packed Qwen2 tests. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: passthrough causal mask for FA2 to avoid breaking validation The previous logic delegated all non-packed 2D masks to HF's create_causal_mask, which produced a mask incompatible with flash_attention_2 during validation. FA2 handles causal masking internally, so always pass through. Delegation to HF is now limited to non-FA2 backends (sdpa/eager) where it is needed. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: address code review feedback from claude[bot] - Fix wrong import: _resolve_lmdb_image lives in utils.py not datasets.py - Assign unused sum() results to variables in dataset timing summary - Fix fake_indices bug: _drop_overlong_samples now returns kept indices so callers can filter examples in sync with conversations Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: log actual processor type before falling back to default Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: remove unused sum() variables flagged by ruff F841 Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add validation and max_steps to VLM packing recipes - Add validation_dataset (MedPix-VQA) and validation_dataloader to qwen3_vl_4b and qwen3_vl_moe_30b recipes - Add max_steps: 100 to both recipes - Switch MoE recipe from mockdata to MedPix-VQA with pack_size 8192 Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: enable checkpoint with safetensors in qwen3_vl_4b recipe Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: enable Qwen3.5 dense CP support and fix FSDP mixed-dtype wrapping PR NVIDIA-NeMo#1631 added _restore_loaded_model_dtype which restores checkpoint dtypes after loading. For Qwen3.5 dense, this puts A_log and norm back to float32 while everything else is bfloat16, breaking FSDP2 which requires uniform dtype per group. Fix by adding Qwen3_5ParallelizationStrategy that: - Moves float32 bare params (A_log) into a _fp32_params submodule so fully_shard_by_dtype can wrap them in a separate FSDP group - When CP>1, swaps HF Qwen3_5GatedDeltaNet to CPAwareGatedDeltaNet (reusing the existing MoE CP implementation) and sets _cp_mesh - Adds a decoder-layer pre-hook to pass position_ids to linear_attn (HF decoder layers don't forward it, but CP needs it) Tested: CP=1 and CP=2 losses match (2.8484 vs 2.8484 step 0, 2.4787 vs 2.4766 step 1, val 2.9792 vs 2.9786). Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * style: ruff format + add unit tests for Qwen3.5 CP/FSDP patching Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: address Claude review - test calls real patch_hf_model, clarify thread-safety comment - Rewrote test_fp32_params_moved_to_holder to call the real patch_hf_model function instead of replicating its logic, by monkeypatching the transformers module stubs so isinstance() matches _FakeGatedDeltaNet. - Clarified the thread-safety comment on the globals() swap in Qwen3_5ParallelizationStrategy.parallelize. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: address round-2 review - test_no_class_swap calls real patch_hf_model, add defensive assertion - Updated test_no_class_swap_when_cp_disabled to call patch_hf_model with stubs instead of trivially asserting the fake type. - Added defensive assertion that apply_fsdp2_sharding_recursively exists before the globals() swap in Qwen3_5ParallelizationStrategy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * test: add test_class_swap_when_cp_enabled for patch_hf_model cp_enabled=True path Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * test: add coverage for Qwen3_5ParallelizationStrategy.parallelize() Add tests for the parallelize() method covering: - patch_hf_model call and delegation to super() - globals swap and restore of apply_fsdp2_sharding_recursively - global restore on error (try/finally) - CP mesh assignment when cp_enabled=True - _fsdp_by_dtype ModuleList iteration with fully_shard_by_dtype Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> --------- Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-authored-by: zhiqil <zhiqil@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Edison <edisonggacc@gmail.com>

…1813) * fix: FSDP2 meta-device crash for Qwen3.5 GatedDeltaNet fp32 params PR #1711 changed _should_load_before_shard to return False for multi-GPU DP, so models stay on meta device through FSDP wrapping. This broke the __dict__ trick in PR #1710's patch_hf_model. Move the gate computation into _Fp32ParamHolder.forward() so FSDP's unshard/reshard lifecycle fires naturally. Override CPAwareGatedDeltaNet forward for both CP and non-CP paths to route through the holder. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * chore: remove test yaml not intended for PR Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: add sentinel to prevent __getattr__ re-wrapping Address Claude review: guard against re-wrapping __getattr__ on repeated patch_hf_model calls by checking a class-level sentinel attribute. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: add upstream version comment to _forward_no_cp Address Claude review: note the transformers version the forward was copied from to ease future upstream diffing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: update MoE test expectations for _forward_no_cp path TestForwardFastPath tests expected super().forward() to be called, but the non-CP path now uses _forward_no_cp(). Update mocks to match. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * test: add coverage for _Fp32ParamHolder, _compute_gate, and sentinel guard Add unit tests for: - _Fp32ParamHolder.forward gate computation and dtype preservation - _compute_gate routing through holder vs inline fallback - patch_hf_model sentinel preventing __getattr__ re-wrapping Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * test: add coverage for _forward_no_cp and forward() dispatch paths Add 14 new tests covering the critical _forward_no_cp method (lines 91-193) and forward() dispatch logic (lines 207-213) to satisfy codecov/patch requirements for PR #1813: - _forward_no_cp basic forward, cache_params=None, causal_conv1d_fn fallback, causal_conv1d_fn set, attention_mask, GQA repeat-interleave, _compute_gate delegation, and output dtype - forward() dispatch when _cp_mesh is None or size <= 1, parameter pass-through, and extra CP kwargs - _make_fp32_getattr fallback to AttributeError and real attr resolution Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…er knob (#1859) * add qwen3_5 Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * fix: FSDP2 meta-device crash for Qwen3.5 GatedDeltaNet fp32 params (#1813) * fix: FSDP2 meta-device crash for Qwen3.5 GatedDeltaNet fp32 params PR #1711 changed _should_load_before_shard to return False for multi-GPU DP, so models stay on meta device through FSDP wrapping. This broke the __dict__ trick in PR #1710's patch_hf_model. Move the gate computation into _Fp32ParamHolder.forward() so FSDP's unshard/reshard lifecycle fires naturally. Override CPAwareGatedDeltaNet forward for both CP and non-CP paths to route through the holder. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * chore: remove test yaml not intended for PR Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: add sentinel to prevent __getattr__ re-wrapping Address Claude review: guard against re-wrapping __getattr__ on repeated patch_hf_model calls by checking a class-level sentinel attribute. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: add upstream version comment to _forward_no_cp Address Claude review: note the transformers version the forward was copied from to ease future upstream diffing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: update MoE test expectations for _forward_no_cp path TestForwardFastPath tests expected super().forward() to be called, but the non-CP path now uses _forward_no_cp(). Update mocks to match. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * test: add coverage for _Fp32ParamHolder, _compute_gate, and sentinel guard Add unit tests for: - _Fp32ParamHolder.forward gate computation and dtype preservation - _compute_gate routing through holder vs inline fallback - patch_hf_model sentinel preventing __getattr__ re-wrapping Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * test: add coverage for _forward_no_cp and forward() dispatch paths Add 14 new tests covering the critical _forward_no_cp method (lines 91-193) and forward() dispatch logic (lines 207-213) to satisfy codecov/patch requirements for PR #1813: - _forward_no_cp basic forward, cache_params=None, causal_conv1d_fn fallback, causal_conv1d_fn set, attention_mask, GQA repeat-interleave, _compute_gate delegation, and output dtype - forward() dispatch when _cp_mesh is None or size <= 1, parameter pass-through, and extra CP kwargs - _make_fp32_getattr fallback to AttributeError and real attr resolution Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: Qwen3.5 VLM pipeline parallelism support - parallelizer.py: handle nn.ModuleDict in _fsdp_by_dtype, safe attr walk in _extract_model_layers, and use string key for Qwen3_5ForConditionalGeneration - hf_utils.py: route pipeline forward through get_text_module so nested VLM text modules (model.language_model.{embed_tokens,layers,norm}) work - finetune.py: update_seq_len per-batch to precompute pipeline stage shapes analytically (needed for GatedDeltaNet and VLM variable-length batches) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * feat: Qwen3.5 VLM TP plan + per-microbatch grad reduce-scatter knob Qwen3.5 VLM TP support: - Register Qwen3_5ForConditionalGeneration in PARALLELIZE_FUNCTIONS; plan delegates to get_hf_tp_shard_plan which reads transformers' base_model_tp_plan with prefix model.language_model. GatedDeltaNet layers stay un-sharded (no stock TP plan exists for linear_attn). - Extend get_hf_tp_shard_plan dispatch to handle Qwen3.5 VLM nesting. - Translate transformers' "replicated_with_grad_allreduce" style as a no-op (skip in plan) — under FSDP+TP the TP-mesh replication already behaves correctly for norm weights. Per-microbatch grad reduce-scatter (PipelineConfig.reduce_grad_per_microbatch): - When True, FSDP reduce-scatters grads every microbatch instead of accumulating full-size grads under no_sync until the last one. Keeps grads sharded across DP, saving ~stage_trainable*2 bytes per rank (~27 GB for a 13B-trainable-param stage in bf16). Default False. - Plumbed through PipelineConfig -> AutoPipeline -> pipeline_model; patches stages via types.MethodType and stores the flag on each stage; the patched backward_maybe_with_nosync branches on stage._reduce_grad_per_microbatch. - kd.py teacher pipeline config propagates the field. Validated locally on 8 GPUs: pp=2, dp=4, lbs=4 peak drops from 66 GB (OOM at step 1 on 80 GB GPUs) to 32-41 GB across 3 steps with knob=True. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * feat: add Qwen3.5-27B VLM TP4+PP4 recipe 2-node (16 GPUs) tp=4, pp=4, dp=1 config. Uses the new reduce_grad_per_microbatch knob to keep grads sharded across microbatches. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * chore: Qwen3.5-27B tp4pp4 recipe — HF model id + wandb stub - Use Qwen/Qwen3.5-27B instead of a local checkpoint path - Add commented-out wandb section so users know how to enable it Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * refactor: drop redundant Qwen3.5 entries - Remove duplicate self.pp.update_seq_len call in vlm/finetune.py (line 940 already covers it every batch; update_seq_len short-circuits when seq_len is unchanged). - Drop string-keyed Qwen3_5ForConditionalGeneration entry from VLM_MODEL_CLS_TO_LAYERS; the class-keyed entry is sufficient. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * refactor: reuse defer_fsdp_grad_sync for PP; restore Qwen3.5 string fallback Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * test: cover Qwen3.5 VLM TP plan + grad-sync knob additions Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * style: drop unused functools.reduce import Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix(pipelining): get_text_module skips non-Module attrs; stub test mocks The helper previously returned any attr named 'language_model' / 'text_model' / 'text_decoder' — including auto-generated unittest Mocks — which broke pipeline_forward tests that passed a plain Mock model. Now only descend into real nn.Module instances. Also explicitly set embed_tokens / layers / norm to None on the mocked text module in the two get_text_module rotary tests so the now-routed pipeline_forward skips those branches cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: lazy-load transformers.models.qwen3_5 to unblock test stubs Eagerly importing Qwen3_5ForConditionalGeneration at module load was pre-loading transformers.models.qwen3_5 into sys.modules, defeating test_cp_linear_attn_patch.py's module stubbing. Switch to string-based class qualname lookup + __name__ comparison instead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> --------- Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-authored-by: Huiying <willwin.lee@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add neat packing (greedy knapsack) for LLM and VLM datasets Implement sequence packing via min-heap first-fit-decreasing knapsack for both LLM and VLM datasets, with indexed attention masks and flash attention support. Includes unit tests and benchmarks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * feat: add LengthGroupedSampler for token-aware distributed sampling Sort samples by estimated token length (text + media) and shuffle within buckets to keep batch-internal lengths similar, reducing padding waste. Includes accurate image/video token count estimation via smart_resize and comprehensive test suite. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * feat: integrate neat packing strategy into LLM finetune recipe Add packing_strategy config field ("neat" or "thd") to select between greedy knapsack packing and existing THD packing in the LLM recipe. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * chore: remove benchmark scripts not needed for this PR Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: lint errors and broken sampler tests Remove unused import and variable in neat_packing_vlm.py. Fix 13 sampler tests that referenced non-existent bucket_size and shuffle_bucket_size parameters. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * style: fix all ruff lint errors across changed files Sort imports, remove unused imports/variables, fix f-strings without placeholders, rename ambiguous variable name. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * style: run ruff format on all changed source and test files Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * style: add missing copyright headers to test files Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * feat: add meta-dataset loading system with ShareGPT format support Implement LLaMA-Factory style meta JSON dataset loading with support for multiple dataset composition, sampling ratios, ShareGPT format conversion, LMDB image storage, video frame reading via decord, media preloading, and cross-rank data sharding. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * feat: add RobustDatasetWrapper with retry and fake image injection RobustDatasetWrapper provides data loading error retry, media preloading, and fake image injection to prevent FSDP/Zero3 hangs on pure-text batches. PreTokenizedDatasetWrapper supports per-sample tokenization in DataLoader workers with overlong sample detection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * enhance: refactor label building with template-based approach Replace BPE context-sensitive pattern matching with token ID-level scanning (build_labels_from_template) for reliable assistant turn detection. Remove qwen2_5 dependency on qwen_vl_utils. Add per-sample media counts (n_images_per_sample/n_videos_per_sample) to collate output for precise PP chunking. Replace truncation with pre-filtering via _drop_overlong_samples. Use decord as video backend globally. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * refactor: simplify video timestamp handling with VideoMetadata Replace the manual _fix_video_timestamps regex approach with _build_video_metadata that passes metadata directly to the processor. Also adds second_per_grid_ts to output keys. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * feat: add precompute_tokens script for offline tokenization Offline parallel tokenization tool that writes _text_tokens counts to dataset samples, enabling LengthGroupedSampler to use exact token counts instead of heuristic estimation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * feat: wire up configure_packing and attn-aware collaters for neat packing Wire up configure_packing and attn-aware collaters into both LLM and VLM recipes so neat packing correctly enforces per-document attention boundaries with flash_attention_2 and SDPA. Changes: - neat_packed_collater: accept attn_implementation param, keep 2D indexed mask for flash, 4D bool block-causal mask for SDPA - configure_packing: patch create_causal_mask in qwen2/qwen2_5_vl/qwen2_vl/ qwen3_vl/qwen3_vl_moe modules via importlib loop - LLM recipe: call configure_packing when packing_strategy=neat, detect attn backend from cfg_model (backend.attn or attn_implementation) - VLM recipe: add pretokenize + packing path to build_dataloader with cfg_model param, same attn detection logic - Add 3 example recipes: LLM neat packing, VLM 4B neat packing, VLM MoE 30B neat packing Tested: - VLM Qwen3-VL-4B flash: 4.19 -> 1.47 -> 0.49 - VLM Qwen3-VL-4B sdpa: 4.19 -> 1.47 -> 0.49 - VLM Qwen3-VL-30B MoE flash: 1.76 -> 0.41 -> 0.10 - LLM Qwen2.5-0.5B flash+force_hf: 3.72 -> ... -> 2.84 Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * refactor: move VLM packing config to top-level packed_sequence section Move packing configuration from nested dataset.packing to a top-level packed_sequence: section, matching the LLM recipe pattern. This decouples dataset definition from packing strategy. The VLM recipe's build_dataloader now accepts cfg_ps and reads packing config from there first, falling back to legacy dataset.packing for backward compatibility. Additional fixes from merge: - Fix stale build_labels() call in collate_fns.py (merge artifact) - Revert phi4/kimi collate to use build_labels (not in _IMSTART allowlist) - Comment out decord2 monkey-patch (user removed it for torchcodec testing) - Add TODO on _PACKING_PATCH_MODULES about generality Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * Switch VLM neat packing example to MedPix-VQA dataset with 8k seqlen Use HF dataset (mmoukouba/MedPix-VQA) instead of local mockdata to demonstrate packed_sequence working with standard HF datasets. Increase pack_size/max_length to 8192 for real image samples. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * refactor: deduplicate robust_collate into make_robust_collate Extract the duplicated collate retry logic from PreTokenizedDatasetWrapper and RobustDatasetWrapper into a shared make_robust_collate() function in collate_fns.py. Both classes now delegate to it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * refactor: move media I/O helpers from datasets.py to utils.py Move _resolve_lmdb_image, _read_video_frames, _preload_media, and _build_video_metadata to vlm/utils.py. These are generic media utilities not tied to any specific dataset. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * cleanup: move random import to module level, allow pretokenize without packing - Move `import random` from inside make_robust_collate to module-level import in collate_fns.py - Read pretokenize/max_length from cfg_ps regardless of pack_size, enabling pretokenize-only mode without packing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * cleanup: remove verbose comments from packing recipe yamls Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * feat: add Qwen3.5-4B VLM neat packing recipe Tested with 8 GPUs, 8k pack_size, MedPix-VQA dataset. Requires transformers >= 5.3.0 for Qwen3.5 support. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: add qwen3_5 to packing patch modules and fix missing import - Add transformers.models.qwen3_5.modeling_qwen3_5 to packing patch list so create_causal_mask is patched for Qwen3.5 dense models - Fix _passthrough_create_causal_mask signature to accept both input_embeds and inputs_embeds (HF 5.3.0 uses inputs_embeds) - Import _lmdb_env_cache from utils.py in datasets.py (missed in earlier media helpers refactor) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * remove LLM recipe from VLM data pipeline PR This LLM recipe doesn't belong in the VLM packing PR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: update test imports after media helpers move to utils.py Update test_datasets.py to import _read_video_frames and _preload_media from vlm/utils.py instead of vlm/datasets.py. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: add unit tests for packing, utils, and collate changes New test files: - test_utils.py: _resolve_lmdb_image (cache, missing key, RGB), _build_video_metadata (empty, no video, preserved fields) - test_packing.py: get_seqlens_in_batch, get_unpad_data, _passthrough_create_causal_mask (both HF signatures), get_attn_implementation (backend vs HF config), configure_packing (noop for sdpa, patches FA2 modules) Extended test_collate_fns.py: - make_robust_collate (success, retry, max_retries exhausted) - neat_packed_vlm_collater attn_implementation variants (2D mask for FA2, 4D for sdpa, fixed max_length, pixel_values concat) Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: lint errors and missing copyright headers - ruff fix: remove unused imports (copy, BaseVideoProcessor, load_video, as_completed), unused variables (grid_idx, total_text_tokens, total_media_tokens), fix import ordering - Add copyright headers to scripts/precompute_tokens.py and tests/test_meta_dataset_all.py Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * style: ruff format on all changed files Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: rename test_utils.py to avoid pytest collection conflict tests/unit_tests/datasets/test_utils.py already exists; having test_utils.py in the vlm/ subdirectory causes a module name collision. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: configure cfg_ds.get defaults in build_dataloader tests MagicMock().get() returns a truthy MagicMock by default, which incorrectly triggers the pretokenize path. Configure side_effect to return proper defaults for packing-related keys. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: make packing mask patch safe for non-packed forward passes _passthrough_create_causal_mask now checks whether the attention mask is actually a packed mask (4D or indexed with values > 1) before returning it as-is. For normal 2D masks (standard training), it delegates to the original HF create_causal_mask, preventing test pollution where the monkey-patch breaks non-packed Qwen2 tests. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: passthrough causal mask for FA2 to avoid breaking validation The previous logic delegated all non-packed 2D masks to HF's create_causal_mask, which produced a mask incompatible with flash_attention_2 during validation. FA2 handles causal masking internally, so always pass through. Delegation to HF is now limited to non-FA2 backends (sdpa/eager) where it is needed. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: address code review feedback from claude[bot] - Fix wrong import: _resolve_lmdb_image lives in utils.py not datasets.py - Assign unused sum() results to variables in dataset timing summary - Fix fake_indices bug: _drop_overlong_samples now returns kept indices so callers can filter examples in sync with conversations Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: log actual processor type before falling back to default Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: remove unused sum() variables flagged by ruff F841 Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add validation and max_steps to VLM packing recipes - Add validation_dataset (MedPix-VQA) and validation_dataloader to qwen3_vl_4b and qwen3_vl_moe_30b recipes - Add max_steps: 100 to both recipes - Switch MoE recipe from mockdata to MedPix-VQA with pack_size 8192 Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: enable checkpoint with safetensors in qwen3_vl_4b recipe Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: enable Qwen3.5 dense CP support and fix FSDP mixed-dtype wrapping PR #1631 added _restore_loaded_model_dtype which restores checkpoint dtypes after loading. For Qwen3.5 dense, this puts A_log and norm back to float32 while everything else is bfloat16, breaking FSDP2 which requires uniform dtype per group. Fix by adding Qwen3_5ParallelizationStrategy that: - Moves float32 bare params (A_log) into a _fp32_params submodule so fully_shard_by_dtype can wrap them in a separate FSDP group - When CP>1, swaps HF Qwen3_5GatedDeltaNet to CPAwareGatedDeltaNet (reusing the existing MoE CP implementation) and sets _cp_mesh - Adds a decoder-layer pre-hook to pass position_ids to linear_attn (HF decoder layers don't forward it, but CP needs it) Tested: CP=1 and CP=2 losses match (2.8484 vs 2.8484 step 0, 2.4787 vs 2.4766 step 1, val 2.9792 vs 2.9786). Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * style: ruff format + add unit tests for Qwen3.5 CP/FSDP patching Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: address Claude review - test calls real patch_hf_model, clarify thread-safety comment - Rewrote test_fp32_params_moved_to_holder to call the real patch_hf_model function instead of replicating its logic, by monkeypatching the transformers module stubs so isinstance() matches _FakeGatedDeltaNet. - Clarified the thread-safety comment on the globals() swap in Qwen3_5ParallelizationStrategy.parallelize. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix: address round-2 review - test_no_class_swap calls real patch_hf_model, add defensive assertion - Updated test_no_class_swap_when_cp_disabled to call patch_hf_model with stubs instead of trivially asserting the fake type. - Added defensive assertion that apply_fsdp2_sharding_recursively exists before the globals() swap in Qwen3_5ParallelizationStrategy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * test: add test_class_swap_when_cp_enabled for patch_hf_model cp_enabled=True path Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * test: add coverage for Qwen3_5ParallelizationStrategy.parallelize() Add tests for the parallelize() method covering: - patch_hf_model call and delegation to super() - globals swap and restore of apply_fsdp2_sharding_recursively - global restore on error (try/finally) - CP mesh assignment when cp_enabled=True - _fsdp_by_dtype ModuleList iteration with fully_shard_by_dtype Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> --------- Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-authored-by: zhiqil <zhiqil@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

zhiqil and others added 30 commits March 7, 2026 01:37

chore: remove benchmark scripts not needed for this PR

04df270

Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

style: run ruff format on all changed source and test files

e6563f1

Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

style: add missing copyright headers to test files

bbe7c5d

Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Merge branch 'main' into vlm_data_pipeline_v2

34cf28f

Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Merge remote-tracking branch 'origin/main' into vlm_data_pipeline_v2

baa8baa

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

cleanup: remove verbose comments from packing recipe yamls

4639dac

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

feat: add Qwen3.5-4B VLM neat packing recipe

b325057

Tested with 8 GPUs, 8k pack_size, MedPix-VQA dataset. Requires transformers >= 5.3.0 for Qwen3.5 support. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

remove LLM recipe from VLM data pipeline PR

d9d36ed

This LLM recipe doesn't belong in the VLM packing PR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

style: ruff format on all changed files

584796f

Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

copy-pr-bot Bot temporarily deployed to nemo-ci April 8, 2026 07:17 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci April 8, 2026 07:40 Inactive

hemildesai approved these changes Apr 8, 2026

View reviewed changes

akoumpa added the r0.4.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge. label Apr 8, 2026

akoumpa merged commit 6ba4074 into main Apr 8, 2026
54 checks passed

akoumpa deleted the huiyingl/qwen3_5_cp_aware_fsdp branch April 8, 2026 21:08

HuiyingLi mentioned this pull request Apr 13, 2026

fix: FSDP2 meta-device crash for Qwen3.5 GatedDeltaNet fp32 params #1813

Merged

2 tasks

HuiyingLi mentioned this pull request Apr 16, 2026

cp: 1813 fix: FSDP2 meta-device crash for Qwen3.5 GatedDeltaNet fp32 params #1869

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Qwen3.5 dense CP support and FSDP mixed-dtype fix#1710

fix: Qwen3.5 dense CP support and FSDP mixed-dtype fix#1710
akoumpa merged 46 commits intomainfrom
huiyingl/qwen3_5_cp_aware_fsdp

HuiyingLi commented Apr 7, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

HuiyingLi commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HuiyingLi commented Apr 7, 2026 •

edited

Loading