cp: feat: VLM pretokenized data pipeline with neat packing by HuiyingLi · Pull Request #1618 · NVIDIA-NeMo/Automodel

HuiyingLi · 2026-03-29T08:01:27Z

Summary

VLM pretokenized data pipeline with greedy knapsack packing, cherry-picked and adapted from @ZhiqiLi-Nvidia's zhiqi-dev branch. This PR lands the data pipeline components needed for efficient packed-sequence VLM fine-tuning.

What's included

Packing engine (CP from zhiqi-dev zhiqil@nvidia.com)

Greedy knapsack bin-packing with configurable packing_ratio and pack_size
LengthGroupedSampler for token-aware distributed sampling
configure_packing monkey-patch for flash_attention_2 and sdpa packed masks
_get_unpad_data patch for flash_attn_varlen_func with per-document cu_seqlens

Data loading (CP from zhiqi-dev zhiqil@nvidia.com)

Meta-dataset loader: ShareGPT format, LMDB images, video support
RobustDatasetWrapper with retry logic and fake image injection for text-only samples
Template-based label building (works across processor variants)
VideoMetadata refactor for timestamp handling
precompute_tokens.py offline tokenization script

Collate functions & integration (CP zhiqi-dev zhiqil@nvidia.com + new)

Attention-aware collaters for Qwen2.5-VL, Qwen3-VL, Qwen3.5, Kimi-VL, Kimi-K2.5-VL
packed_sequence top-level YAML config section (replaces nested dataset.packing)
Recipe integration in finetune.py: pretokenize → pack → train with FA2

Recipes

qwen3_5/qwen3_5_4b_neat_packing.yaml — Qwen3.5-4B with train + validation
qwen3/qwen3_vl_4b_neat_packing.yaml — Qwen3-VL-4B-Thinking
qwen3/qwen3_vl_moe_30b_neat_packing.yaml — Qwen3-VL-30B-A3B MoE with EP=8

Cherry-pick origin

Most feature commits are cherry-picked from zhiqi-dev (authored by @ZhiqiLi-Nvidia). Key mappings:

This branch	zhiqi-dev	Description
`63eaf11`	`a5aa050`	Greedy knapsack packing
`2372959`	`24126fe`	LengthGroupedSampler
`58a9f69`	`36222b6`	Meta-dataset loading
`f5f3df3`	`d3e440b`	RobustDatasetWrapper + fake image
`e6ca6d7`	`e45a7cb`	Template-based label building
`e7769cb`	`8b78f18`	VideoMetadata refactor
`dc5e060`	`f02d78e`	precompute_tokens script
`7b59b29`	multiple	Wire up packing + collaters

Post-CP commits (lint, tests, recipes, fixes) are by @HuiyingLi.

Validation

Recipe	Config	Result
Qwen3.5-4B	`qwen3_5_4b_neat_packing.yaml`	https://wandb.ai/Nemo-automodel/vlm-pack/runs/bgt2dw6n?nw=nwuserhuiyingl
Qwen3-VL-4B	`qwen3_vl_4b_neat_packing.yaml`	https://wandb.ai/Nemo-automodel/vlm-pack/runs/j0pm5m2o?nw=nwuserhuiyingl
Qwen3-VL-30B MoE EP=8	`qwen3_vl_moe_30b_neat_packing.yaml`	https://wandb.ai/Nemo-automodel/vlm-pack/runs/2ja5q4ut?nw=nwuserhuiyingl

Test plan

Full convergence run with wandb logging

🤖 Generated with Claude Code

Implement sequence packing via min-heap first-fit-decreasing knapsack for both LLM and VLM datasets, with indexed attention masks and flash attention support. Includes unit tests and benchmarks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Sort samples by estimated token length (text + media) and shuffle within buckets to keep batch-internal lengths similar, reducing padding waste. Includes accurate image/video token count estimation via smart_resize and comprehensive test suite. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Add packing_strategy config field ("neat" or "thd") to select between greedy knapsack packing and existing THD packing in the LLM recipe. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Remove unused import and variable in neat_packing_vlm.py. Fix 13 sampler tests that referenced non-existent bucket_size and shuffle_bucket_size parameters. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Sort imports, remove unused imports/variables, fix f-strings without placeholders, rename ambiguous variable name. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Implement LLaMA-Factory style meta JSON dataset loading with support for multiple dataset composition, sampling ratios, ShareGPT format conversion, LMDB image storage, video frame reading via decord, media preloading, and cross-rank data sharding. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

RobustDatasetWrapper provides data loading error retry, media preloading, and fake image injection to prevent FSDP/Zero3 hangs on pure-text batches. PreTokenizedDatasetWrapper supports per-sample tokenization in DataLoader workers with overlong sample detection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Replace BPE context-sensitive pattern matching with token ID-level scanning (build_labels_from_template) for reliable assistant turn detection. Remove qwen2_5 dependency on qwen_vl_utils. Add per-sample media counts (n_images_per_sample/n_videos_per_sample) to collate output for precise PP chunking. Replace truncation with pre-filtering via _drop_overlong_samples. Use decord as video backend globally. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Replace the manual _fix_video_timestamps regex approach with _build_video_metadata that passes metadata directly to the processor. Also adds second_per_grid_ts to output keys. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Offline parallel tokenization tool that writes _text_tokens counts to dataset samples, enabling LengthGroupedSampler to use exact token counts instead of heuristic estimation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

…king Wire up configure_packing and attn-aware collaters into both LLM and VLM recipes so neat packing correctly enforces per-document attention boundaries with flash_attention_2 and SDPA. Changes: - neat_packed_collater: accept attn_implementation param, keep 2D indexed mask for flash, 4D bool block-causal mask for SDPA - configure_packing: patch create_causal_mask in qwen2/qwen2_5_vl/qwen2_vl/ qwen3_vl/qwen3_vl_moe modules via importlib loop - LLM recipe: call configure_packing when packing_strategy=neat, detect attn backend from cfg_model (backend.attn or attn_implementation) - VLM recipe: add pretokenize + packing path to build_dataloader with cfg_model param, same attn detection logic - Add 3 example recipes: LLM neat packing, VLM 4B neat packing, VLM MoE 30B neat packing Tested: - VLM Qwen3-VL-4B flash: 4.19 -> 1.47 -> 0.49 - VLM Qwen3-VL-4B sdpa: 4.19 -> 1.47 -> 0.49 - VLM Qwen3-VL-30B MoE flash: 1.76 -> 0.41 -> 0.10 - LLM Qwen2.5-0.5B flash+force_hf: 3.72 -> ... -> 2.84 Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Move packing configuration from nested dataset.packing to a top-level packed_sequence: section, matching the LLM recipe pattern. This decouples dataset definition from packing strategy. The VLM recipe's build_dataloader now accepts cfg_ps and reads packing config from there first, falling back to legacy dataset.packing for backward compatibility. Additional fixes from merge: - Fix stale build_labels() call in collate_fns.py (merge artifact) - Revert phi4/kimi collate to use build_labels (not in _IMSTART allowlist) - Comment out decord2 monkey-patch (user removed it for torchcodec testing) - Add TODO on _PACKING_PATCH_MODULES about generality Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Use HF dataset (mmoukouba/MedPix-VQA) instead of local mockdata to demonstrate packed_sequence working with standard HF datasets. Increase pack_size/max_length to 8192 for real image samples. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Extract the duplicated collate retry logic from PreTokenizedDatasetWrapper and RobustDatasetWrapper into a shared make_robust_collate() function in collate_fns.py. Both classes now delegate to it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Move _resolve_lmdb_image, _read_video_frames, _preload_media, and _build_video_metadata to vlm/utils.py. These are generic media utilities not tied to any specific dataset. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

…t packing - Move `import random` from inside make_robust_collate to module-level import in collate_fns.py - Read pretokenize/max_length from cfg_ps regardless of pack_size, enabling pretokenize-only mode without packing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Tested with 8 GPUs, 8k pack_size, MedPix-VQA dataset. Requires transformers >= 5.3.0 for Qwen3.5 support. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

- Add transformers.models.qwen3_5.modeling_qwen3_5 to packing patch list so create_causal_mask is patched for Qwen3.5 dense models - Fix _passthrough_create_causal_mask signature to accept both input_embeds and inputs_embeds (HF 5.3.0 uses inputs_embeds) - Import _lmdb_env_cache from utils.py in datasets.py (missed in earlier media helpers refactor) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

This LLM recipe doesn't belong in the VLM packing PR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

The previous logic delegated all non-packed 2D masks to HF's create_causal_mask, which produced a mask incompatible with flash_attention_2 during validation. FA2 handles causal masking internally, so always pass through. Delegation to HF is now limited to non-FA2 backends (sdpa/eager) where it is needed. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

HuiyingLi · 2026-03-30T18:57:53Z

/ok to test 8597ffa

HuiyingLi · 2026-03-30T19:12:15Z

/claude review

claude · 2026-03-30T19:14:36Z

+from nemo_automodel.components.datasets.vlm.datasets import (
+    _convert_sharegpt_to_conversation,
+    _load_json_or_jsonl,
+    _resolve_lmdb_image,
+)


Bug: _resolve_lmdb_image is defined in nemo_automodel.components.datasets.vlm.utils, not in datasets. This import will raise ImportError at runtime.

Suggested change

from nemo_automodel.components.datasets.vlm.datasets import (

_convert_sharegpt_to_conversation,

_load_json_or_jsonl,

_resolve_lmdb_image,

)

from nemo_automodel.components.datasets.vlm.datasets import (

_convert_sharegpt_to_conversation,

_load_json_or_jsonl,

)

from nemo_automodel.components.datasets.vlm.utils import _resolve_lmdb_image

claude · 2026-03-30T19:14:45Z

@@ -364,6 +625,19 @@ def kimi_vl_collate_fn(
    for key, value in list(batch.items()):


Bug: fake_indices is computed from the original examples list, but conversations may have been shortened by _drop_overlong_samples above. After dropping, the batch tensors have fewer rows than len(examples), so these indices can be out-of-bounds or point to the wrong samples.

Same issue exists in default_collate_fn (around line 1115) and qwen2_5_collate_fn (around line 467).

You need to either:

Track which examples survived filtering and remap indices, or

Move fake-image injection to happen before _drop_overlong_samples so it's already baked into the conversation, and track the _injected_fake flag on the conversation (not on examples).

claude · 2026-03-30T19:14:52Z

+    if has_stats:
+        total_images = sum(t.get("n_images", 0) for t in timings.values())


These two sum(...) expressions compute values but discard them (not assigned to any variable). If they're needed for the summary, assign them; otherwise remove them.

Suggested change

if has_stats:

total_images = sum(t.get("n_images", 0) for t in timings.values())

total_text_tokens = sum(t.get("n_text_tokens", 0) for t in timings.values())

total_media_tokens = sum(t.get("n_media_tokens", 0) for t in timings.values())

- Fix wrong import: _resolve_lmdb_image lives in utils.py not datasets.py - Assign unused sum() results to variables in dataset timing summary - Fix fake_indices bug: _drop_overlong_samples now returns kept indices so callers can filter examples in sync with conversations Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

HuiyingLi · 2026-03-30T20:19:06Z

/claude review

HuiyingLi · 2026-03-30T20:47:07Z

/claude review

claude · 2026-03-30T20:50:05Z

+                if processor_type not in COLLATE_FNS:
+                    processor_type = "default"
+                    logging.warning(f"You are using {processor_type} with default collate function.")


Bug: processor_type is reassigned to "default" on line 375 before the warning on line 376, so the warning always logs "You are using default with default collate function" instead of the actual processor type name.

Suggested change

if processor_type not in COLLATE_FNS:

processor_type = "default"

logging.warning(f"You are using {processor_type} with default collate function.")

if processor_type not in COLLATE_FNS:

logging.warning(f"You are using {processor_type} with default collate function.")

processor_type = "default"

claude

Light review complete. One bug found (inline comment posted): the warning message in finetune.py always logs "default" instead of the actual processor type because the variable is reassigned before the log call.

The rest of the PR — packing engine, collate functions, dataset loading, retry logic, and sharding — looks correct. The autoregressive label shift is safe across all collate functions (shapes diverge before the input-trimming loop). Tests cover the key paths well.

Nice work on the knapsack packing and robust dataset wrapper.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

HuiyingLi · 2026-03-30T20:56:13Z

/ok to test 358b5f7

Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

HuiyingLi · 2026-03-30T21:05:11Z

/ok to test d1e450a

- Add validation_dataset (MedPix-VQA) and validation_dataloader to qwen3_vl_4b and qwen3_vl_moe_30b recipes - Add max_steps: 100 to both recipes - Switch MoE recipe from mockdata to MedPix-VQA with pack_size 8192 Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

HuiyingLi · 2026-03-30T23:36:15Z

/ok to test eb01df7

Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

HuiyingLi · 2026-03-30T23:47:34Z

/ok to test 65d42d6

HuiyingLi · 2026-04-01T03:39:11Z

/ok to test 849ba79

zhiqil and others added 25 commits March 7, 2026 01:37

chore: remove benchmark scripts not needed for this PR

04df270

Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

style: run ruff format on all changed source and test files

e6563f1

Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

style: add missing copyright headers to test files

bbe7c5d

Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Merge branch 'main' into vlm_data_pipeline_v2

34cf28f

Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Merge remote-tracking branch 'origin/main' into vlm_data_pipeline_v2

baa8baa

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

cleanup: remove verbose comments from packing recipe yamls

4639dac

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

feat: add Qwen3.5-4B VLM neat packing recipe

b325057

Tested with 8 GPUs, 8k pack_size, MedPix-VQA dataset. Requires transformers >= 5.3.0 for Qwen3.5 support. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

remove LLM recipe from VLM data pipeline PR

d9d36ed

This LLM recipe doesn't belong in the VLM packing PR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

HuiyingLi requested review from ZhiyuLi-Nvidia, adil-a, akoumpa, hemildesai and pthombre as code owners March 29, 2026 08:01

copy-pr-bot Bot temporarily deployed to nemo-ci March 29, 2026 22:57 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci March 29, 2026 23:18 Inactive

claude Bot reviewed Mar 30, 2026

View reviewed changes

fix: log actual processor type before falling back to default

358b5f7

Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: remove unused sum() variables flagged by ruff F841

d1e450a

Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: enable checkpoint with safetensors in qwen3_vl_4b recipe

65d42d6

Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge branch 'main' into vlm_data_pipeline_v2

849ba79

akoumpa approved these changes Apr 1, 2026

View reviewed changes

		@@ -364,6 +625,19 @@ def kimi_vl_collate_fn(
		for key, value in list(batch.items()):

		if has_stats:
		total_images = sum(t.get("n_images", 0) for t in timings.values())

Conversation

HuiyingLi commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's included

Cherry-pick origin

Validation

Test plan

Uh oh!

HuiyingLi commented Mar 30, 2026

Uh oh!

HuiyingLi commented Mar 30, 2026

Uh oh!

claude Bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

HuiyingLi commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuiyingLi commented Mar 30, 2026

Uh oh!

claude Bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

HuiyingLi commented Mar 30, 2026

Uh oh!

HuiyingLi commented Mar 30, 2026

Uh oh!

HuiyingLi commented Mar 30, 2026

Uh oh!

HuiyingLi commented Mar 30, 2026

Uh oh!

HuiyingLi commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HuiyingLi commented Mar 29, 2026 •

edited

Loading

HuiyingLi commented Mar 30, 2026 •

edited

Loading