Skip to content

cp: feat: VLM pretokenized data pipeline with neat packing#1618

Merged
HuiyingLi merged 39 commits intomainfrom
vlm_data_pipeline_v2
Apr 1, 2026
Merged

cp: feat: VLM pretokenized data pipeline with neat packing#1618
HuiyingLi merged 39 commits intomainfrom
vlm_data_pipeline_v2

Conversation

@HuiyingLi
Copy link
Copy Markdown
Contributor

@HuiyingLi HuiyingLi commented Mar 29, 2026

Summary

VLM pretokenized data pipeline with greedy knapsack packing, cherry-picked and adapted from @ZhiqiLi-Nvidia's zhiqi-dev branch. This PR lands the data pipeline components needed for efficient packed-sequence VLM fine-tuning.

What's included

Packing engine (CP from zhiqi-dev zhiqil@nvidia.com)

  • Greedy knapsack bin-packing with configurable packing_ratio and pack_size
  • LengthGroupedSampler for token-aware distributed sampling
  • configure_packing monkey-patch for flash_attention_2 and sdpa packed masks
  • _get_unpad_data patch for flash_attn_varlen_func with per-document cu_seqlens

Data loading (CP from zhiqi-dev zhiqil@nvidia.com)

  • Meta-dataset loader: ShareGPT format, LMDB images, video support
  • RobustDatasetWrapper with retry logic and fake image injection for text-only samples
  • Template-based label building (works across processor variants)
  • VideoMetadata refactor for timestamp handling
  • precompute_tokens.py offline tokenization script

Collate functions & integration (CP zhiqi-dev zhiqil@nvidia.com + new)

  • Attention-aware collaters for Qwen2.5-VL, Qwen3-VL, Qwen3.5, Kimi-VL, Kimi-K2.5-VL
  • packed_sequence top-level YAML config section (replaces nested dataset.packing)
  • Recipe integration in finetune.py: pretokenize → pack → train with FA2

Recipes

  • qwen3_5/qwen3_5_4b_neat_packing.yaml — Qwen3.5-4B with train + validation
  • qwen3/qwen3_vl_4b_neat_packing.yaml — Qwen3-VL-4B-Thinking
  • qwen3/qwen3_vl_moe_30b_neat_packing.yaml — Qwen3-VL-30B-A3B MoE with EP=8

Cherry-pick origin

Most feature commits are cherry-picked from zhiqi-dev (authored by @ZhiqiLi-Nvidia). Key mappings:

This branch zhiqi-dev Description
63eaf11 a5aa050 Greedy knapsack packing
2372959 24126fe LengthGroupedSampler
58a9f69 36222b6 Meta-dataset loading
f5f3df3 d3e440b RobustDatasetWrapper + fake image
e6ca6d7 e45a7cb Template-based label building
e7769cb 8b78f18 VideoMetadata refactor
dc5e060 f02d78e precompute_tokens script
7b59b29 multiple Wire up packing + collaters

Post-CP commits (lint, tests, recipes, fixes) are by @HuiyingLi.

Validation

Recipe Config Result
Qwen3.5-4B qwen3_5_4b_neat_packing.yaml https://wandb.ai/Nemo-automodel/vlm-pack/runs/bgt2dw6n?nw=nwuserhuiyingl
Qwen3-VL-4B qwen3_vl_4b_neat_packing.yaml https://wandb.ai/Nemo-automodel/vlm-pack/runs/j0pm5m2o?nw=nwuserhuiyingl
Qwen3-VL-30B MoE EP=8 qwen3_vl_moe_30b_neat_packing.yaml https://wandb.ai/Nemo-automodel/vlm-pack/runs/2ja5q4ut?nw=nwuserhuiyingl

Test plan

  • Full convergence run with wandb logging

🤖 Generated with Claude Code

zhiqil and others added 25 commits March 7, 2026 01:37
Implement sequence packing via min-heap first-fit-decreasing knapsack
for both LLM and VLM datasets, with indexed attention masks and flash
attention support. Includes unit tests and benchmarks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Sort samples by estimated token length (text + media) and shuffle
within buckets to keep batch-internal lengths similar, reducing padding
waste. Includes accurate image/video token count estimation via
smart_resize and comprehensive test suite.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Add packing_strategy config field ("neat" or "thd") to select between
greedy knapsack packing and existing THD packing in the LLM recipe.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Remove unused import and variable in neat_packing_vlm.py.
Fix 13 sampler tests that referenced non-existent bucket_size
and shuffle_bucket_size parameters.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Sort imports, remove unused imports/variables, fix f-strings
without placeholders, rename ambiguous variable name.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Implement LLaMA-Factory style meta JSON dataset loading with support
for multiple dataset composition, sampling ratios, ShareGPT format
conversion, LMDB image storage, video frame reading via decord, media
preloading, and cross-rank data sharding.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
RobustDatasetWrapper provides data loading error retry, media
preloading, and fake image injection to prevent FSDP/Zero3 hangs on
pure-text batches. PreTokenizedDatasetWrapper supports per-sample
tokenization in DataLoader workers with overlong sample detection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Replace BPE context-sensitive pattern matching with token ID-level
scanning (build_labels_from_template) for reliable assistant turn
detection. Remove qwen2_5 dependency on qwen_vl_utils. Add per-sample
media counts (n_images_per_sample/n_videos_per_sample) to collate
output for precise PP chunking. Replace truncation with pre-filtering
via _drop_overlong_samples. Use decord as video backend globally.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Replace the manual _fix_video_timestamps regex approach with
_build_video_metadata that passes metadata directly to the processor.
Also adds second_per_grid_ts to output keys.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Offline parallel tokenization tool that writes _text_tokens counts
to dataset samples, enabling LengthGroupedSampler to use exact token
counts instead of heuristic estimation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
…king

Wire up configure_packing and attn-aware collaters into both LLM and VLM
recipes so neat packing correctly enforces per-document attention
boundaries with flash_attention_2 and SDPA.

Changes:
- neat_packed_collater: accept attn_implementation param, keep 2D indexed
  mask for flash, 4D bool block-causal mask for SDPA
- configure_packing: patch create_causal_mask in qwen2/qwen2_5_vl/qwen2_vl/
  qwen3_vl/qwen3_vl_moe modules via importlib loop
- LLM recipe: call configure_packing when packing_strategy=neat, detect
  attn backend from cfg_model (backend.attn or attn_implementation)
- VLM recipe: add pretokenize + packing path to build_dataloader with
  cfg_model param, same attn detection logic
- Add 3 example recipes: LLM neat packing, VLM 4B neat packing,
  VLM MoE 30B neat packing

Tested:
- VLM Qwen3-VL-4B flash: 4.19 -> 1.47 -> 0.49
- VLM Qwen3-VL-4B sdpa:  4.19 -> 1.47 -> 0.49
- VLM Qwen3-VL-30B MoE flash: 1.76 -> 0.41 -> 0.10
- LLM Qwen2.5-0.5B flash+force_hf: 3.72 -> ... -> 2.84

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Move packing configuration from nested dataset.packing to a top-level
packed_sequence: section, matching the LLM recipe pattern. This decouples
dataset definition from packing strategy.

The VLM recipe's build_dataloader now accepts cfg_ps and reads packing
config from there first, falling back to legacy dataset.packing for
backward compatibility.

Additional fixes from merge:
- Fix stale build_labels() call in collate_fns.py (merge artifact)
- Revert phi4/kimi collate to use build_labels (not in _IMSTART allowlist)
- Comment out decord2 monkey-patch (user removed it for torchcodec testing)
- Add TODO on _PACKING_PATCH_MODULES about generality

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Use HF dataset (mmoukouba/MedPix-VQA) instead of local mockdata to
demonstrate packed_sequence working with standard HF datasets.
Increase pack_size/max_length to 8192 for real image samples.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Extract the duplicated collate retry logic from PreTokenizedDatasetWrapper
and RobustDatasetWrapper into a shared make_robust_collate() function in
collate_fns.py. Both classes now delegate to it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Move _resolve_lmdb_image, _read_video_frames, _preload_media, and
_build_video_metadata to vlm/utils.py. These are generic media utilities
not tied to any specific dataset.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
…t packing

- Move `import random` from inside make_robust_collate to module-level
  import in collate_fns.py
- Read pretokenize/max_length from cfg_ps regardless of pack_size,
  enabling pretokenize-only mode without packing

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Tested with 8 GPUs, 8k pack_size, MedPix-VQA dataset.
Requires transformers >= 5.3.0 for Qwen3.5 support.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
- Add transformers.models.qwen3_5.modeling_qwen3_5 to packing patch list
  so create_causal_mask is patched for Qwen3.5 dense models
- Fix _passthrough_create_causal_mask signature to accept both
  input_embeds and inputs_embeds (HF 5.3.0 uses inputs_embeds)
- Import _lmdb_env_cache from utils.py in datasets.py (missed in
  earlier media helpers refactor)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
This LLM recipe doesn't belong in the VLM packing PR.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
The previous logic delegated all non-packed 2D masks to HF's
create_causal_mask, which produced a mask incompatible with
flash_attention_2 during validation. FA2 handles causal masking
internally, so always pass through. Delegation to HF is now
limited to non-FA2 backends (sdpa/eager) where it is needed.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@HuiyingLi
Copy link
Copy Markdown
Contributor Author

/ok to test 8597ffa

@HuiyingLi
Copy link
Copy Markdown
Contributor Author

/claude review

Comment on lines +36 to +40
from nemo_automodel.components.datasets.vlm.datasets import (
_convert_sharegpt_to_conversation,
_load_json_or_jsonl,
_resolve_lmdb_image,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: _resolve_lmdb_image is defined in nemo_automodel.components.datasets.vlm.utils, not in datasets. This import will raise ImportError at runtime.

Suggested change
from nemo_automodel.components.datasets.vlm.datasets import (
_convert_sharegpt_to_conversation,
_load_json_or_jsonl,
_resolve_lmdb_image,
)
from nemo_automodel.components.datasets.vlm.datasets import (
_convert_sharegpt_to_conversation,
_load_json_or_jsonl,
)
from nemo_automodel.components.datasets.vlm.utils import _resolve_lmdb_image

@@ -364,6 +625,19 @@ def kimi_vl_collate_fn(
for key, value in list(batch.items()):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: fake_indices is computed from the original examples list, but conversations may have been shortened by _drop_overlong_samples above. After dropping, the batch tensors have fewer rows than len(examples), so these indices can be out-of-bounds or point to the wrong samples.

Same issue exists in default_collate_fn (around line 1115) and qwen2_5_collate_fn (around line 467).

You need to either:

  1. Track which examples survived filtering and remap indices, or
  2. Move fake-image injection to happen before _drop_overlong_samples so it's already baked into the conversation, and track the _injected_fake flag on the conversation (not on examples).

Comment on lines +573 to +574
if has_stats:
total_images = sum(t.get("n_images", 0) for t in timings.values())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two sum(...) expressions compute values but discard them (not assigned to any variable). If they're needed for the summary, assign them; otherwise remove them.

Suggested change
if has_stats:
total_images = sum(t.get("n_images", 0) for t in timings.values())
total_text_tokens = sum(t.get("n_text_tokens", 0) for t in timings.values())
total_media_tokens = sum(t.get("n_media_tokens", 0) for t in timings.values())

- Fix wrong import: _resolve_lmdb_image lives in utils.py not datasets.py
- Assign unused sum() results to variables in dataset timing summary
- Fix fake_indices bug: _drop_overlong_samples now returns kept indices
  so callers can filter examples in sync with conversations

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@HuiyingLi
Copy link
Copy Markdown
Contributor Author

HuiyingLi commented Mar 30, 2026

/claude review

1 similar comment
@HuiyingLi
Copy link
Copy Markdown
Contributor Author

/claude review

Comment on lines +374 to +376
if processor_type not in COLLATE_FNS:
processor_type = "default"
logging.warning(f"You are using {processor_type} with default collate function.")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: processor_type is reassigned to "default" on line 375 before the warning on line 376, so the warning always logs "You are using default with default collate function" instead of the actual processor type name.

Suggested change
if processor_type not in COLLATE_FNS:
processor_type = "default"
logging.warning(f"You are using {processor_type} with default collate function.")
if processor_type not in COLLATE_FNS:
logging.warning(f"You are using {processor_type} with default collate function.")
processor_type = "default"

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Light review complete. One bug found (inline comment posted): the warning message in finetune.py always logs "default" instead of the actual processor type because the variable is reassigned before the log call.

The rest of the PR — packing engine, collate functions, dataset loading, retry logic, and sharding — looks correct. The autoregressive label shift is safe across all collate functions (shapes diverge before the input-trimming loop). Tests cover the key paths well.

Nice work on the knapsack packing and robust dataset wrapper.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@HuiyingLi
Copy link
Copy Markdown
Contributor Author

/ok to test 358b5f7

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@HuiyingLi
Copy link
Copy Markdown
Contributor Author

/ok to test d1e450a

- Add validation_dataset (MedPix-VQA) and validation_dataloader to
  qwen3_vl_4b and qwen3_vl_moe_30b recipes
- Add max_steps: 100 to both recipes
- Switch MoE recipe from mockdata to MedPix-VQA with pack_size 8192

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@HuiyingLi
Copy link
Copy Markdown
Contributor Author

/ok to test eb01df7

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@HuiyingLi
Copy link
Copy Markdown
Contributor Author

/ok to test 65d42d6

@HuiyingLi
Copy link
Copy Markdown
Contributor Author

/ok to test 849ba79

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants