cp: feat: VLM pretokenized data pipeline with neat packing#1618
cp: feat: VLM pretokenized data pipeline with neat packing#1618
Conversation
Implement sequence packing via min-heap first-fit-decreasing knapsack for both LLM and VLM datasets, with indexed attention masks and flash attention support. Includes unit tests and benchmarks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Sort samples by estimated token length (text + media) and shuffle within buckets to keep batch-internal lengths similar, reducing padding waste. Includes accurate image/video token count estimation via smart_resize and comprehensive test suite. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Add packing_strategy config field ("neat" or "thd") to select between
greedy knapsack packing and existing THD packing in the LLM recipe.
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Remove unused import and variable in neat_packing_vlm.py. Fix 13 sampler tests that referenced non-existent bucket_size and shuffle_bucket_size parameters. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Sort imports, remove unused imports/variables, fix f-strings without placeholders, rename ambiguous variable name. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Implement LLaMA-Factory style meta JSON dataset loading with support for multiple dataset composition, sampling ratios, ShareGPT format conversion, LMDB image storage, video frame reading via decord, media preloading, and cross-rank data sharding. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
RobustDatasetWrapper provides data loading error retry, media preloading, and fake image injection to prevent FSDP/Zero3 hangs on pure-text batches. PreTokenizedDatasetWrapper supports per-sample tokenization in DataLoader workers with overlong sample detection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Replace BPE context-sensitive pattern matching with token ID-level scanning (build_labels_from_template) for reliable assistant turn detection. Remove qwen2_5 dependency on qwen_vl_utils. Add per-sample media counts (n_images_per_sample/n_videos_per_sample) to collate output for precise PP chunking. Replace truncation with pre-filtering via _drop_overlong_samples. Use decord as video backend globally. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Replace the manual _fix_video_timestamps regex approach with _build_video_metadata that passes metadata directly to the processor. Also adds second_per_grid_ts to output keys. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Offline parallel tokenization tool that writes _text_tokens counts to dataset samples, enabling LengthGroupedSampler to use exact token counts instead of heuristic estimation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
…king Wire up configure_packing and attn-aware collaters into both LLM and VLM recipes so neat packing correctly enforces per-document attention boundaries with flash_attention_2 and SDPA. Changes: - neat_packed_collater: accept attn_implementation param, keep 2D indexed mask for flash, 4D bool block-causal mask for SDPA - configure_packing: patch create_causal_mask in qwen2/qwen2_5_vl/qwen2_vl/ qwen3_vl/qwen3_vl_moe modules via importlib loop - LLM recipe: call configure_packing when packing_strategy=neat, detect attn backend from cfg_model (backend.attn or attn_implementation) - VLM recipe: add pretokenize + packing path to build_dataloader with cfg_model param, same attn detection logic - Add 3 example recipes: LLM neat packing, VLM 4B neat packing, VLM MoE 30B neat packing Tested: - VLM Qwen3-VL-4B flash: 4.19 -> 1.47 -> 0.49 - VLM Qwen3-VL-4B sdpa: 4.19 -> 1.47 -> 0.49 - VLM Qwen3-VL-30B MoE flash: 1.76 -> 0.41 -> 0.10 - LLM Qwen2.5-0.5B flash+force_hf: 3.72 -> ... -> 2.84 Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Move packing configuration from nested dataset.packing to a top-level packed_sequence: section, matching the LLM recipe pattern. This decouples dataset definition from packing strategy. The VLM recipe's build_dataloader now accepts cfg_ps and reads packing config from there first, falling back to legacy dataset.packing for backward compatibility. Additional fixes from merge: - Fix stale build_labels() call in collate_fns.py (merge artifact) - Revert phi4/kimi collate to use build_labels (not in _IMSTART allowlist) - Comment out decord2 monkey-patch (user removed it for torchcodec testing) - Add TODO on _PACKING_PATCH_MODULES about generality Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Use HF dataset (mmoukouba/MedPix-VQA) instead of local mockdata to demonstrate packed_sequence working with standard HF datasets. Increase pack_size/max_length to 8192 for real image samples. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Extract the duplicated collate retry logic from PreTokenizedDatasetWrapper and RobustDatasetWrapper into a shared make_robust_collate() function in collate_fns.py. Both classes now delegate to it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Move _resolve_lmdb_image, _read_video_frames, _preload_media, and _build_video_metadata to vlm/utils.py. These are generic media utilities not tied to any specific dataset. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
…t packing - Move `import random` from inside make_robust_collate to module-level import in collate_fns.py - Read pretokenize/max_length from cfg_ps regardless of pack_size, enabling pretokenize-only mode without packing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Tested with 8 GPUs, 8k pack_size, MedPix-VQA dataset. Requires transformers >= 5.3.0 for Qwen3.5 support. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
- Add transformers.models.qwen3_5.modeling_qwen3_5 to packing patch list so create_causal_mask is patched for Qwen3.5 dense models - Fix _passthrough_create_causal_mask signature to accept both input_embeds and inputs_embeds (HF 5.3.0 uses inputs_embeds) - Import _lmdb_env_cache from utils.py in datasets.py (missed in earlier media helpers refactor) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
This LLM recipe doesn't belong in the VLM packing PR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
The previous logic delegated all non-packed 2D masks to HF's create_causal_mask, which produced a mask incompatible with flash_attention_2 during validation. FA2 handles causal masking internally, so always pass through. Delegation to HF is now limited to non-FA2 backends (sdpa/eager) where it is needed. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ok to test 8597ffa |
|
/claude review |
| from nemo_automodel.components.datasets.vlm.datasets import ( | ||
| _convert_sharegpt_to_conversation, | ||
| _load_json_or_jsonl, | ||
| _resolve_lmdb_image, | ||
| ) |
There was a problem hiding this comment.
Bug: _resolve_lmdb_image is defined in nemo_automodel.components.datasets.vlm.utils, not in datasets. This import will raise ImportError at runtime.
| from nemo_automodel.components.datasets.vlm.datasets import ( | |
| _convert_sharegpt_to_conversation, | |
| _load_json_or_jsonl, | |
| _resolve_lmdb_image, | |
| ) | |
| from nemo_automodel.components.datasets.vlm.datasets import ( | |
| _convert_sharegpt_to_conversation, | |
| _load_json_or_jsonl, | |
| ) | |
| from nemo_automodel.components.datasets.vlm.utils import _resolve_lmdb_image |
| @@ -364,6 +625,19 @@ def kimi_vl_collate_fn( | |||
| for key, value in list(batch.items()): | |||
There was a problem hiding this comment.
Bug: fake_indices is computed from the original examples list, but conversations may have been shortened by _drop_overlong_samples above. After dropping, the batch tensors have fewer rows than len(examples), so these indices can be out-of-bounds or point to the wrong samples.
Same issue exists in default_collate_fn (around line 1115) and qwen2_5_collate_fn (around line 467).
You need to either:
- Track which examples survived filtering and remap indices, or
- Move fake-image injection to happen before
_drop_overlong_samplesso it's already baked into the conversation, and track the_injected_fakeflag on the conversation (not onexamples).
| if has_stats: | ||
| total_images = sum(t.get("n_images", 0) for t in timings.values()) |
There was a problem hiding this comment.
These two sum(...) expressions compute values but discard them (not assigned to any variable). If they're needed for the summary, assign them; otherwise remove them.
| if has_stats: | |
| total_images = sum(t.get("n_images", 0) for t in timings.values()) | |
| total_text_tokens = sum(t.get("n_text_tokens", 0) for t in timings.values()) | |
| total_media_tokens = sum(t.get("n_media_tokens", 0) for t in timings.values()) |
- Fix wrong import: _resolve_lmdb_image lives in utils.py not datasets.py - Assign unused sum() results to variables in dataset timing summary - Fix fake_indices bug: _drop_overlong_samples now returns kept indices so callers can filter examples in sync with conversations Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/claude review |
1 similar comment
|
/claude review |
| if processor_type not in COLLATE_FNS: | ||
| processor_type = "default" | ||
| logging.warning(f"You are using {processor_type} with default collate function.") |
There was a problem hiding this comment.
Bug: processor_type is reassigned to "default" on line 375 before the warning on line 376, so the warning always logs "You are using default with default collate function" instead of the actual processor type name.
| if processor_type not in COLLATE_FNS: | |
| processor_type = "default" | |
| logging.warning(f"You are using {processor_type} with default collate function.") | |
| if processor_type not in COLLATE_FNS: | |
| logging.warning(f"You are using {processor_type} with default collate function.") | |
| processor_type = "default" |
There was a problem hiding this comment.
Light review complete. One bug found (inline comment posted): the warning message in finetune.py always logs "default" instead of the actual processor type because the variable is reassigned before the log call.
The rest of the PR — packing engine, collate functions, dataset loading, retry logic, and sharding — looks correct. The autoregressive label shift is safe across all collate functions (shapes diverge before the input-trimming loop). Tests cover the key paths well.
Nice work on the knapsack packing and robust dataset wrapper.
Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ok to test 358b5f7 |
Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ok to test d1e450a |
- Add validation_dataset (MedPix-VQA) and validation_dataloader to qwen3_vl_4b and qwen3_vl_moe_30b recipes - Add max_steps: 100 to both recipes - Switch MoE recipe from mockdata to MedPix-VQA with pack_size 8192 Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ok to test eb01df7 |
Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ok to test 65d42d6 |
|
/ok to test 849ba79 |
Summary
VLM pretokenized data pipeline with greedy knapsack packing, cherry-picked and adapted from @ZhiqiLi-Nvidia's
zhiqi-devbranch. This PR lands the data pipeline components needed for efficient packed-sequence VLM fine-tuning.What's included
Packing engine (CP from zhiqi-dev zhiqil@nvidia.com)
packing_ratioandpack_sizeLengthGroupedSamplerfor token-aware distributed samplingconfigure_packingmonkey-patch forflash_attention_2andsdpapacked masks_get_unpad_datapatch forflash_attn_varlen_funcwith per-documentcu_seqlensData loading (CP from zhiqi-dev zhiqil@nvidia.com)
RobustDatasetWrapperwith retry logic and fake image injection for text-only samplesVideoMetadatarefactor for timestamp handlingprecompute_tokens.pyoffline tokenization scriptCollate functions & integration (CP zhiqi-dev zhiqil@nvidia.com + new)
packed_sequencetop-level YAML config section (replaces nesteddataset.packing)finetune.py: pretokenize → pack → train with FA2Recipes
qwen3_5/qwen3_5_4b_neat_packing.yaml— Qwen3.5-4B with train + validationqwen3/qwen3_vl_4b_neat_packing.yaml— Qwen3-VL-4B-Thinkingqwen3/qwen3_vl_moe_30b_neat_packing.yaml— Qwen3-VL-30B-A3B MoE with EP=8Cherry-pick origin
Most feature commits are cherry-picked from
zhiqi-dev(authored by @ZhiqiLi-Nvidia). Key mappings:63eaf11a5aa050237295924126fe58a9f6936222b6f5f3df3d3e440be6ca6d7e45a7cbe7769cb8b78f18dc5e060f02d78e7b59b29Post-CP commits (lint, tests, recipes, fixes) are by @HuiyingLi.
Validation
qwen3_5_4b_neat_packing.yamlqwen3_vl_4b_neat_packing.yamlqwen3_vl_moe_30b_neat_packing.yamlTest plan
🤖 Generated with Claude Code