Skip to content

feat: Omni dataloader for HF models#2016

Merged
yuki-97 merged 3 commits intomainfrom
yuanhangs_dev
Feb 26, 2026
Merged

feat: Omni dataloader for HF models#2016
yuki-97 merged 3 commits intomainfrom
yuanhangs_dev

Conversation

@yuanhangsu1986
Copy link
Copy Markdown
Contributor

@yuanhangsu1986 yuanhangsu1986 commented Feb 23, 2026

What does this PR do ?

Add video and audio dataloadling support for HF models

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this
export NRL_FORCE_REBUILD_VENVS=true
uv venv
uv run python ./examples/run_vlm_sft.py cluster.gpus_per_node=8 --config ./examples/configs/sft_avlm.yaml

Additional Information

Major changes:

  • nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py: base class GeneralConversationsJsonlDataset for loading conversation-based data structure which is described in the comments.
  • nemo_rl/data/datasets/response_datasets/daily_omni.py: base class and functions for Daily Omni dataset (a public video benchmarking dataset).
  • nemo_rl/data/datasets/processed_dataset.py: add preprocessor logic to AllTaskProcessedDataset. The purpose of this is to convert the dataset structure from its raw form to the open-ai compatible ones. Put the logic here so that we do not need to do the preprocessing for each downstream processors.
  • nemo_rl/data/datasets/raw_dataset.py: add preprocessor in RawDataset. This allows each dataset to optionally define its own preprocessor. If leave undefined, the default behavior is no data preprocessing.
  • nemo_rl/data/multimodal_utils.py: functions for loading omni data
  • nemo_rl/data/llm_message_utils.py: loading omni data for HF-based llm model
  • examples/configs/sft_avlm.yaml: example for loading video conversation-based dataset for Qwen3-VL with 16 frames per video.
  • nemo_rl/algorithms/utils.py: reads audo and video configs from the tokenizer config in the config yaml file.
  • tests/unit/data/datasets/test_response_dataset.py: unit test for daily omni dataset
  • tests/unit/data/datasets/test_general_conversations_dataset.py: unit test for GeneralConversationsJsonlDataset

This is a resubmission of the following PR:
#1639

Dependencies

  • Added decord library for video processing support.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added support for Daily-Omni and General Conversations JSONL datasets
    • Enhanced multimodal data processing for audio, video, and image content
    • New supervised fine-tuning configuration example for audio-visual models
    • Introduced per-task data preprocessing pipeline support
  • Tests

    • Added comprehensive unit tests for multimodal dataset handling
  • Chores

    • Added decord library dependency
    • Updated Docker build configuration

@yuanhangsu1986 yuanhangsu1986 requested review from a team as code owners February 23, 2026 23:18
@github-actions github-actions Bot added the CI Relating to CI label Feb 23, 2026
@yuanhangsu1986 yuanhangsu1986 added CI:L1 Run doctests, unit tests, and functional tests community-request labels Feb 23, 2026
@yuanhangsu1986 yuanhangsu1986 added the needs-follow-up Issue needs follow-up label Feb 23, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 23, 2026

📝 Walkthrough

Walkthrough

This PR introduces per-task preprocessing support and multimodal dataset infrastructure to NeMo RL. It adds new dataset classes (DailyOmniDataset, GeneralConversationsJsonlDataset), multimodal utilities for media loading, a TaskDataPreProcessFnCallable interface, and propagates preprocessor mappings through the training and validation pipelines. Includes Docker build optimization and comprehensive test coverage.

Changes

Cohort / File(s) Summary
Multimodal Preprocessing Infrastructure
nemo_rl/data/interfaces.py, nemo_rl/data/multimodal_utils.py
Introduces TaskDataPreProcessFnCallable protocol and comprehensive multimodal utilities including media tagging constants, functions to extract/load media from messages (images, audio, video), and default settings extraction from processors.
Dataset Base Classes & Dataset Updates
nemo_rl/data/datasets/raw_dataset.py, nemo_rl/data/datasets/processed_dataset.py, nemo_rl/data/datasets/utils.py
Adds preprocessor attribute to RawDataset, extends AllTaskProcessedDataset to accept and apply per-task preprocessors, and updates dataset setup utilities to track and propagate preprocessor mappings during training/validation.
New Dataset Implementations
nemo_rl/data/datasets/response_datasets/daily_omni.py, nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py, nemo_rl/data/datasets/response_datasets/__init__.py
Adds DailyOmniDataset for video QA tasks and GeneralConversationsJsonlDataset for multimodal conversations with message processing, metadata conversion, and media extraction utilities.
Message and Data Processing
nemo_rl/data/llm_message_utils.py
Replaces image-specific processing with generalized media handling using multimodal utils; removes get_images_from_message and updates to load/process any media type dynamically.
Training Pipeline Integration
examples/run_sft.py, nemo_rl/algorithms/utils.py
Extends SFT training to pass task_data_preprocessors to datasets; adds runtime overrides for audio/video processor configurations in tokenizer initialization.
Configuration & Dependencies
examples/configs/sft_avlm.yaml, nemo_rl/models/policy/__init__.py, pyproject.toml, .github/workflows/cicd-main.yml
Adds audio/video fields to TokenizerConfig, adds decord dependency, introduces AVLM config example, and optimizes Docker build with SKIP_SGLANG_BUILD flag.
Test Coverage
tests/unit/data/datasets/test_general_conversations_dataset.py, tests/unit/data/datasets/test_response_dataset.py
Adds comprehensive unit tests for general-conversation-jsonl preprocessing and DailyOmniDataset dataset loading with multimodal validation.

Sequence Diagram

sequenceDiagram
    participant DataLoader as Data Loader
    participant RawDataset as RawDataset<br/>(+ preprocessor)
    participant Preprocessor as Task Preprocessor
    participant AllTaskProcessedDataset as AllTaskProcessedDataset<br/>(+ task_data_preprocessors)
    participant Processor as Task Processor
    participant MessageUtils as Message Utils<br/>(+ Media Loading)

    DataLoader->>RawDataset: Load dataset<br/>(e.g., DailyOmni)
    RawDataset-->>DataLoader: Return data + preprocessor
    DataLoader->>AllTaskProcessedDataset: Pass task_data_preprocessors<br/>mapping
    
    loop For each data sample
        AllTaskProcessedDataset->>Preprocessor: __call__(raw_datum)
        Preprocessor-->>AllTaskProcessedDataset: preprocessed_datum
        AllTaskProcessedDataset->>Processor: __call__(preprocessed_datum)
        Processor->>MessageUtils: load_media_from_message()
        MessageUtils-->>Processor: extracted media dict<br/>(images, audio, video)
        Processor-->>AllTaskProcessedDataset: formatted output
        AllTaskProcessedDataset-->>DataLoader: processed item
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • PR #977: Introduces the AllTaskProcessedDataset refactor and base dataset infrastructure that this PR extends with per-task preprocessing hooks and new dataset implementations.
  • PR #1807: Modifies setup_response_data in nemo_rl/data/utils.py; this PR further extends it to propagate per-task preprocessor mappings through the training pipeline.
  • PR #1649: Involves RawDataset and response-dataset pipeline modifications; this PR adds complementary dataset implementations and preprocessing API enhancements to the same infrastructure.

Suggested labels

ci/cd, data, multimodal, datasets, preprocessing

Suggested reviewers

  • yuki-97
  • terrykong
  • odelalleau
🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 53.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Results For Major Changes ⚠️ Warning PR includes major changes (new dataset classes, multimodal utilities, preprocessor infrastructure) but PR description lacks test results or testing information; tests exist but no pass/fail status or validation results documented. Update PR description to include test execution status (e.g., 'all unit tests pass'), specific test scenarios run, and any performance validation results to confirm no regression.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: Omni dataloader for HF models' accurately captures the main change: adding video/audio dataloading support for Hugging Face models via the Omni dataset infrastructure.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch yuanhangs_dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
nemo_rl/models/policy/__init__.py (1)

201-208: ⚠️ Potential issue | 🟡 Minor

Document new audio/video config keys in TokenizerConfig.

Please add a Google-style class docstring (or expand an existing one) to document purpose, valid values/types, and recommended defaults for audio and video, and ensure exemplar YAMLs reflect those defaults.

As per coding guidelines "Use Google style docstrings for classes and functions" and "When adding a new config key to a TypedDict subclass, document the key's purpose, valid values/types, recommended default, and reflect the default in exemplar YAMLs under examples/configs/*.yaml".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/models/policy/__init__.py` around lines 201 - 208, Add a Google-style
class docstring to the TokenizerConfig TypedDict describing the purpose of the
TypedDict and documenting the new audio and video keys: state that audio and
video are optional multimodal config dicts, list valid value types (dict[str,
Any] or None), describe recommended defaults (e.g., audio: {} or None to
disable, video: {} or None to disable, and any recommended subkeys like
sample_rate, channels for audio or frame_rate, resolution for video), and
briefly mention chat_template_kwargs usage; then update the exemplar YAML files
under examples/configs/*.yaml to include the audio and video keys with the
recommended default values so the examples reflect the docstring defaults.
nemo_rl/data/utils.py (1)

221-248: ⚠️ Potential issue | 🟠 Major

Reset val_task_data_preprocessors per validation dataset.

In the val_data_paths loop, val_task_data_preprocessors is reused across iterations. If a later dataset has no preprocessor, the previous one leaks into its AllTaskProcessedDataset. Reinitialize per iteration.

Suggested fix
-    val_task_data_preprocessors = {}
     if "val_data_paths" in data_config and data_config["val_data_paths"]:
         ...
         for val_dataset_name, val_dataset_path in val_data_paths.items():
             ...
+            val_task_data_preprocessors = {}
             if hasattr(val_data, "preprocessor") and val_data.preprocessor is not None:
-                val_task_data_preprocessors = {
-                    val_data.task_name: val_data.preprocessor
-                }
+                val_task_data_preprocessors[val_data.task_name] = val_data.preprocessor
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/utils.py` around lines 221 - 248, The bug is that
val_task_data_preprocessors is declared once outside the val_data_paths loop and
may leak preprocessors across datasets; fix it by moving/reinitializing
val_task_data_preprocessors inside the for val_dataset_name, val_dataset_path in
val_data_paths.items() loop (set to {} at start of each iteration), then
populate it only when hasattr(val_data, "preprocessor") and
val_data.preprocessor is not None before passing it to AllTaskProcessedDataset;
ensure val_task_data_processors is also built per-iteration using
load_preference_dataset and val_data.task_name so each AllTaskProcessedDataset
gets only its own processors and preprocessors.
pyproject.toml (1)

18-53: ⚠️ Potential issue | 🔴 Critical

Pin decord to a specific version, with awareness of Python 3.12 compatibility.

The latest stable decord release is 0.6.0 (June 2021), which lacks pre-built wheels for Python 3.12. Installing decord on Python 3.12 will require building from source, which may fail depending on your FFmpeg setup and build toolchain. Pin decord==0.6.0 explicitly, and document any required build steps for Python 3.12, or verify that source builds work reliably in your environment before merging.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` around lines 18 - 53, Pin the decord dependency to a specific
version by changing the "decord" entry to "decord==0.6.0" and add a short
comment noting Python 3.12 has no prebuilt wheels (so building from source may
require FFmpeg and a proper toolchain), or alternatively verify/source-build in
CI and document required build steps for Python 3.12 to avoid breakage; update
the dependency line for "decord" and include the explanatory comment nearby.
nemo_rl/data/multimodal_utils.py (1)

1-1: ⚠️ Potential issue | 🟡 Minor

Copyright year is 2025; current year is 2026.

The file has substantial new code. Consider updating the copyright header to 2026.

As per coding guidelines, "Add the NVIDIA copyright header (with current year) to all Python files and shell scripts, excluding tests."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/multimodal_utils.py` at line 1, Update the copyright header at
the top of the file to use the current year 2026: replace the 2025 year in the
existing NVIDIA copyright header comment (the file-level header at the very top
of nemo_rl/data/multimodal_utils.py) so the header reads 2026 instead of 2025,
ensuring it matches the project's required header format for Python files.
🧹 Nitpick comments (12)
nemo_rl/data/datasets/raw_dataset.py (1)

26-35: Initialize preprocessor in the constructor to satisfy class-member init guideline.

RawDataset exposes preprocessor publicly; per guidelines, it should be initialized in __init__ alongside other members. Consider adding an explicit constructor (or a dataclass) to set these fields.

As per coding guidelines "Initialize all externally visible members of a class in the constructor".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/datasets/raw_dataset.py` around lines 26 - 35, Add an explicit
constructor for class RawDataset that initializes all externally visible members
(data_config, dataset, val_dataset, processor, task_spec) and sets preprocessor
to None (or a provided value) so preprocessor is not left uninitialized;
implement __init__ on RawDataset to accept and assign these fields (or convert
the class to a dataclass with defaults) to satisfy the "initialize members in
constructor" guideline.
nemo_rl/algorithms/utils.py (2)

323-355: Add stacklevel=2 to warnings.warn calls so warnings point to the caller.

All three warnings.warn calls (lines 330, 341, 352) default to stacklevel=1, which causes the warning to reference this utility function rather than the calling code that configured the override.

Proposed fix (example for line 330; apply similarly to lines 341 and 352)
                 warnings.warn(
-                    f"Overriding audio sampling rate from {processor.feature_extractor.sampling_rate} to {new_sampling_rate}"
+                    f"Overriding audio sampling rate from {processor.feature_extractor.sampling_rate} to {new_sampling_rate}",
+                    stacklevel=2,
                 )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/algorithms/utils.py` around lines 323 - 355, The three warnings.warn
calls in the function that adjusts processor feature/video settings (the calls
that override processor.feature_extractor.sampling_rate,
processor.video_processor.fps, and processor.video_processor.num_frames) should
include stacklevel=2 so the warning points at the caller; update each
warnings.warn invocation (the ones emitting "Overriding audio sampling rate...",
"Overriding video fps...", and "Overriding video num_frames...") to pass
stacklevel=2 while keeping the existing message and variable usage
(tokenizer_config checks and assignments to
processor.feature_extractor.sampling_rate, processor.video_processor.fps,
processor.video_processor.num_frames).

345-355: Consider validating mutual exclusivity of fps and num_frames upfront.

The comment on line 345 acknowledges the conflict but defers it. An explicit check with a clear error message here would be more user-friendly than a cryptic failure later in the video processor.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/algorithms/utils.py` around lines 345 - 355, Check for mutual
exclusivity of fps and num_frames in tokenizer_config["video"] before mutating
processor.video_processor.num_frames: if both "fps" and "num_frames" are
present, validate they are not contradictory and raise a clear ValueError (or
choose one deterministically) instead of letting the video processor fail later.
Specifically, in the block handling tokenizer_config["video"] (referencing
tokenizer_config, processor.video_processor.num_frames, "fps", and
"num_frames"), add an upfront check that raises a descriptive error like "Cannot
set both fps and num_frames in tokenizer_config['video']" (or compare values and
only warn/override when consistent) before assigning
processor.video_processor.num_frames.
nemo_rl/data/datasets/processed_dataset.py (1)

33-58: Document task_data_preprocessors in the class docstring.

The Args section (lines 36–44) documents task_data_processors and max_seq_length but omits the new task_data_preprocessors parameter. Adding a brief entry keeps the docstring consistent with the constructor signature.

📝 Proposed docstring addition
         task_data_processors: Either a single TaskDataProcessFnCallable for single-task,
             or a dict mapping task names to (TaskDataSpec, TaskDataProcessFnCallable) for multi-task
+        task_data_preprocessors: Optional preprocessing hook applied before task-specific processing.
+            Either a single TaskDataPreProcessFnCallable for all tasks,
+            or a dict mapping task names to TaskDataPreProcessFnCallable.
         max_seq_length: Maximum sequence length for tokenized outputs
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/datasets/processed_dataset.py` around lines 33 - 58, The class
docstring for AllTaskProcessedDataset is missing documentation for the
constructor parameter task_data_preprocessors; update the Args section to add a
short entry describing task_data_preprocessors (type: Optional[Union[dict[str,
TaskDataPreProcessFnCallable], TaskDataPreProcessFnCallable]], default: None),
explaining it can be a single preprocessor applied to all examples or a dict
mapping task names to task-specific preprocessors and that missing tasks fall
back to default behavior; reference the parameter name task_data_preprocessors
and keep the wording consistent with the existing entries for
task_data_processors and default_task_data_spec.
nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py (2)

38-67: convert_metadata returns None when return_inplace=True — confusing API.

When return_inplace=True, the function mutates metadata in place and returns None implicitly. When False, it returns a new dict. The parameter name is also inverted from the typical inplace convention. Consider renaming to inplace (default False) or always returning the result to avoid caller confusion.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py`
around lines 38 - 67, The function convert_metadata currently uses
return_inplace (inverted naming) and returns None when return_inplace=True,
causing a confusing API; rename the parameter to inplace: bool = False (or keep
name but invert semantics) and ensure convert_metadata always returns the
processed dict (variable data) even when mutating the input, so callers get the
result; update logic that chooses data = metadata or metadata.copy(), keep the
mapping loops that reference multimodal_utils.MEDIA_TAGS_TO_ALLOWED and
multimodal_utils.MEDIA_TAGS unchanged, and update any callers to the new
parameter name if renamed.

26-26: Unused _DEBUG variable — remove before merging.

_DEBUG = True is never referenced in this file.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py` at
line 26, The file defines an unused module-level variable named _DEBUG which is
never referenced; remove the _DEBUG = True declaration from
general_conversations_dataset (delete the unused symbol) to eliminate dead code
and avoid misleading debug flags.
nemo_rl/data/llm_message_utils.py (1)

607-613: Consider making the media-key-to-kwarg mapping explicit and extensible.

The hardcoded if/elif chain for mapping "image"→"images", "audio"→"audio", "video"→"videos" will need updating when new modalities are added. A small mapping dict would be more maintainable.

♻️ Example
-            media_kwargs = {}
-            if "image" in media_cur_message:
-                media_kwargs["images"] = media_cur_message["image"]
-            if "audio" in media_cur_message:
-                media_kwargs["audio"] = media_cur_message["audio"]
-            if "video" in media_cur_message:
-                media_kwargs["videos"] = media_cur_message["video"]
+            MEDIA_KEY_TO_KWARG = {"image": "images", "audio": "audio", "video": "videos"}
+            media_kwargs = {
+                MEDIA_KEY_TO_KWARG[k]: v
+                for k, v in media_cur_message.items()
+                if k in MEDIA_KEY_TO_KWARG
+            }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/llm_message_utils.py` around lines 607 - 613, Replace the
hardcoded if-chain that builds media_kwargs from media_cur_message with an
explicit mapping dict (e.g., MEDIA_KEY_TO_KWARG) and iterate over its items to
populate media_kwargs; locate the code block that references media_cur_message
and media_kwargs in nemo_rl.data.llm_message_utils (the snippet that currently
checks "image", "audio", "video") and change it to consult the mapping so new
modalities can be added by updating the dict rather than editing conditional
logic.
examples/run_sft.py (1)

104-153: Validation preprocessor wiring is correct.

Both paths (split-from-train and explicit validation config) properly collect and propagate preprocessors.

Minor note: since RawDataset declares preprocessor as a class attribute (line 33 of raw_dataset.py), the hasattr checks on lines 89 and 142 are always True for RawDataset subclasses. You could simplify to just data.preprocessor is not None, but the current form is safely defensive.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/run_sft.py` around lines 104 - 153, The hasattr(...) checks for
"preprocessor" are unnecessary because RawDataset declares preprocessor as a
class attribute; replace the two occurrences where you do hasattr(data,
"preprocessor") and hasattr(val_data, "preprocessor") with direct checks that
the attribute is not None (e.g., data.preprocessor is not None and
val_data.preprocessor is not None) so the code uses the actual None test on the
preprocessor before wiring it into val_task_data_preprocessors (references:
RawDataset, preprocessor, variables data and val_data, and
val_task_data_preprocessors).
nemo_rl/data/datasets/response_datasets/daily_omni.py (1)

73-78: Add extraction filter to prevent path-traversal risks (Ruff S202).

tarfile.extractall() without a filter can be exploited for path traversal attacks. Python 3.12+ supports filter='data' to block absolute paths and parent-directory traversals. Even though the tar is downloaded from a trusted HuggingFace repository, add the filter for defense in depth.

Apply filter parameter
                 with tarfile.open(archive_filename, "r:*") as tar:
-                    # Extract all contents to the specified path
-                    tar.extractall(path=self.hf_cache_dir)
+                    tar.extractall(path=self.hf_cache_dir, filter="data")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/datasets/response_datasets/daily_omni.py` around lines 73 - 78,
The tar extraction in the try block uses
tarfile.extractall(path=self.hf_cache_dir) which is vulnerable to
path-traversal; update the call to pass the filter parameter to block absolute
and parent-directory paths (use filter='data' on Python 3.12+ or equivalent safe
extraction logic) so extraction of archive_filename into self.hf_cache_dir is
validated before writing (affects the block referencing archive_filename,
self.hf_cache_dir and files_folder).
nemo_rl/data/multimodal_utils.py (3)

321-325: Missing docstring for load_media_from_message.

This function is part of the public API consumed by llm_message_utils.py. A Google-style docstring explaining the parameters, return value, and the fallback behavior would help maintainability.

As per coding guidelines, "For interfaces that may be used outside a file, prefer docstrings over comments" and "Use Google style docstrings for classes and functions, which can be parsed by Sphinx."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/multimodal_utils.py` around lines 321 - 325, Add a Google-style
docstring to the public function load_media_from_message describing parameters,
return value, and fallback behavior: document message (expected keys/structure),
processor (type and when used), multimodal_load_kwargs (shape: mapping of media
type to kwargs and defaults), and the returned dict[str, list[Any]] format;
include behavior when media is missing or when processor is None, examples of
supported media types, and any exceptions raised or swallowed. Place the
docstring immediately under the def load_media_from_message(...) signature so
external modules like llm_message_utils.py can rely on the documented API.

219-261: Missing docstring for a public function used outside this file; consider replacing function-attribute caching.

get_multimodal_default_settings_from_processor is referenced by llm_message_utils.py per the summary, so it should have a docstring per coding guidelines.

Also, the function-attribute caching pattern (lines 237–242, 249–254) is unconventional and not thread-safe. A module-level functools.lru_cache or a simple module-level variable would be more idiomatic. Additionally, the list comprehensions [param for param in ...] are redundant — list(...) suffices.

♻️ Simplify signature caching
+import functools
+
+
+@functools.lru_cache(maxsize=1)
+def _load_video_param_names() -> list[str]:
+    return list(inspect.signature(load_video).parameters)
+
+
+@functools.lru_cache(maxsize=1)
+def _load_audio_param_names() -> list[str]:
+    return list(inspect.signature(load_audio).parameters)
+
+
 def get_multimodal_default_settings_from_processor(
     processor,
 ) -> dict[str, dict[str, Any]]:
+    """Extract default video/audio loading kwargs from a processor's sub-components."""
     ...
-        if not hasattr(
-            get_multimodal_default_settings_from_processor, "load_video_kwargs"
-        ):
-            get_multimodal_default_settings_from_processor.load_video_kwargs = [
-                param for param in inspect.signature(load_video).parameters
-            ]
         default_settings["video"] = {
             arg: video_settings_dict[arg]
-            for arg in get_multimodal_default_settings_from_processor.load_video_kwargs
+            for arg in _load_video_param_names()
             if arg in video_settings_dict
         }

(Apply analogous change for load_audio_kwargs.)

As per coding guidelines, "For interfaces that may be used outside a file, prefer docstrings over comments."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/multimodal_utils.py` around lines 219 - 261, The public function
get_multimodal_default_settings_from_processor lacks a docstring and uses unsafe
function-attribute caching and redundant list comprehensions; add a concise
docstring describing purpose, args, and return value, and replace the
function-attribute caches
get_multimodal_default_settings_from_processor.load_video_kwargs and
.load_audio_kwargs with a module-level cache (either simple module-level
variables or a `@functools.lru_cache-decorated` helper) that computes
list(inspect.signature(load_video).parameters) and
list(inspect.signature(load_audio).parameters) once, and update the list
comprehensions to use list(...) instead of [param for param in ...]; ensure the
rest of the function (video_settings_dict/feature_extractor usage and
default_settings keys) remains unchanged.

315-317: Conditional expression used solely for side effects hurts readability.

The ternary media[tag].extend(...) if isinstance(...) else media[tag].append(...) is used as a statement for its side effects only — the return value is discarded. This is a known anti-pattern in Python. A regular if/else block is clearer here.

♻️ Replace ternary statement with explicit if/else
         tag = item["type"]
         if tag in MEDIA_TAGS:
-            media[tag].extend(list(item[tag])) if isinstance(
-                item[tag], (list, tuple)
-            ) else media[tag].append(item[tag])
+            value = item[tag]
+            if isinstance(value, (list, tuple)):
+                media[tag].extend(value)
+            else:
+                media[tag].append(value)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/multimodal_utils.py` around lines 315 - 317, Replace the ternary
used for side effects with a clear if/else: locate the statement that currently
does "media[tag].extend(list(item[tag])) if isinstance(item[tag], (list, tuple))
else media[tag].append(item[tag])" and change it to an explicit if
isinstance(item[tag], (list, tuple)): media[tag].extend(list(item[tag])) else:
media[tag].append(item[tag]). This preserves the same behavior for
media[tag].extend/append and improves readability; keep the same variable names
(media, tag, item) and surrounding logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_rl/data/datasets/response_datasets/daily_omni.py`:
- Line 1: The file daily_omni.py has an incorrect copyright header starting with
"##" instead of the repo-standard single "#" header; open
nemo_rl/data/datasets/response_datasets/daily_omni.py and replace the top header
so it matches the other Python files (use "# Copyright (c) 2025, NVIDIA
CORPORATION.  All rights reserved." with a single leading '#' and same
spacing/phrasing), ensuring the header sits at the very top of the file.
- Around line 27-32: The class docstring for DailyOmniDataset incorrectly
mentions "CLEVR-CoGenT" due to a copy-paste error; update the docstring in class
DailyOmniDataset to describe the Daily Omni dataset (e.g., replace "Simple
wrapper around the CLEVR-CoGenT dataset." with a concise description referencing
the Daily Omni dataset) and keep the Args section (split) unchanged.
- Around line 85-90: Fix the typo and enable exception chaining in the tar
handling block: change the misspelled tarfile.ReadErro to tarfile.ReadError in
the except block and re-raise the ReadError with the new message using exception
chaining (raise tarfile.ReadError("...") from e). Likewise, in the generic
except Exception as e block, re-raise the new Exception with the formatted
message using "from e" so the original traceback is preserved; locate the
tarfile handling where tarfile.ReadError and the variable e are used in
daily_omni.py.

In `@nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py`:
- Around line 218-234: The return type annotation for process_message_fragment
is wrong: the function builds and returns a list (ret) of dicts, not a single
dict. Update the signature of process_message_fragment to return list[dict[str,
Any]] (or Sequence[Mapping[str, Any]] if you prefer an abstract type) and ensure
any callers expecting dict are adjusted; reference the function name
process_message_fragment and the local variable ret and the loop over
tag.split("-") to locate the code to change.
- Around line 148-153: Fix the typos in the class docstring for
GeneralConversationsJsonlDataset (subclass of RawDataset): change
"converstaions" to "conversations" and "requiement" to "requirement" so the
documentation reads correctly about jsonl datasets and media tag placement.
- Around line 107-118: The code is adding the file extension string to
tried_default_extensions and also assumes filenames always contain a '.',
causing an IndexError; update the logic in general_conversations_dataset.py
where ext is computed from metadata[ tag ][ media_index[tag] ] to safely extract
the extension (use os.path.splitext or check for '.' and handle missing
extension by setting ext to an empty string) and change the set insertion to
tried_default_extensions.add(tag) (not ext) so the guard `tag not in
tried_default_extensions` works as intended; keep the loop over
multimodal_utils.DEFAULT_MEDIA_EXTENSIONS[tag] unchanged but ensure you only add
the media type tag to tried_default_extensions when you decide to skip trying
defaults.

In `@nemo_rl/data/datasets/utils.py`:
- Around line 18-24: The function get_huggingface_cache_path should be updated
to use the public API and improve typing and error handling: replace the private
_scan_cached_repo import/use with huggingface_hub.scan_cache_dir (update the
import), add a Google-style docstring describing parameters/return and behavior,
add precise type hints for parameters and return (Python 3.12+ style), narrow
the broad except Exception to specific exceptions such as OSError and
ValueError, and guard against calling max() on an empty revs mapping by
returning an appropriate fallback or raising a clear ValueError; locate these
changes around the get_huggingface_cache_path function and any imports
referencing _scan_cached_repo.

In `@nemo_rl/data/multimodal_utils.py`:
- Around line 344-347: Replace the runtime assertion that checks for "audio" and
"sampling_rate" in multimodal_load_kwargs with explicit validation and a clear
exception: verify that "audio" is a key in multimodal_load_kwargs and that
"sampling_rate" is present in multimodal_load_kwargs["audio"], and if either
check fails raise a ValueError with a descriptive message (e.g., referencing
multimodal_load_kwargs, "audio", and "sampling_rate") instead of using assert so
the check cannot be stripped under python -O; update the location where this
change is made around the multimodal_load_kwargs handling in multimodal_utils.py
(the block containing the current assert).
- Around line 348-364: Narrow the broad except in the load_audio fallback: catch
only expected errors from load_audio (e.g., RuntimeError, FileNotFoundError,
ValueError) instead of bare Exception; either log the caught error (use the
module logger via logger.warning with the exception info) or drop the unused "as
e" binding; replace the print("audio loading failed...") with logger.warning
including the error details and context (e.g., which file/path and that we are
falling back to decord). Also ensure the call to
get_dim_to_pack_along(processor, "audio") is safe when processor is None by
adding an explicit guard or documenting that processor may be None (or
defaulting to 0) so the fallback code's slicing behavior remains clear.

---

Outside diff comments:
In `@nemo_rl/data/multimodal_utils.py`:
- Line 1: Update the copyright header at the top of the file to use the current
year 2026: replace the 2025 year in the existing NVIDIA copyright header comment
(the file-level header at the very top of nemo_rl/data/multimodal_utils.py) so
the header reads 2026 instead of 2025, ensuring it matches the project's
required header format for Python files.

In `@nemo_rl/data/utils.py`:
- Around line 221-248: The bug is that val_task_data_preprocessors is declared
once outside the val_data_paths loop and may leak preprocessors across datasets;
fix it by moving/reinitializing val_task_data_preprocessors inside the for
val_dataset_name, val_dataset_path in val_data_paths.items() loop (set to {} at
start of each iteration), then populate it only when hasattr(val_data,
"preprocessor") and val_data.preprocessor is not None before passing it to
AllTaskProcessedDataset; ensure val_task_data_processors is also built
per-iteration using load_preference_dataset and val_data.task_name so each
AllTaskProcessedDataset gets only its own processors and preprocessors.

In `@nemo_rl/models/policy/__init__.py`:
- Around line 201-208: Add a Google-style class docstring to the TokenizerConfig
TypedDict describing the purpose of the TypedDict and documenting the new audio
and video keys: state that audio and video are optional multimodal config dicts,
list valid value types (dict[str, Any] or None), describe recommended defaults
(e.g., audio: {} or None to disable, video: {} or None to disable, and any
recommended subkeys like sample_rate, channels for audio or frame_rate,
resolution for video), and briefly mention chat_template_kwargs usage; then
update the exemplar YAML files under examples/configs/*.yaml to include the
audio and video keys with the recommended default values so the examples reflect
the docstring defaults.

In `@pyproject.toml`:
- Around line 18-53: Pin the decord dependency to a specific version by changing
the "decord" entry to "decord==0.6.0" and add a short comment noting Python 3.12
has no prebuilt wheels (so building from source may require FFmpeg and a proper
toolchain), or alternatively verify/source-build in CI and document required
build steps for Python 3.12 to avoid breakage; update the dependency line for
"decord" and include the explanatory comment nearby.

---

Nitpick comments:
In `@examples/run_sft.py`:
- Around line 104-153: The hasattr(...) checks for "preprocessor" are
unnecessary because RawDataset declares preprocessor as a class attribute;
replace the two occurrences where you do hasattr(data, "preprocessor") and
hasattr(val_data, "preprocessor") with direct checks that the attribute is not
None (e.g., data.preprocessor is not None and val_data.preprocessor is not None)
so the code uses the actual None test on the preprocessor before wiring it into
val_task_data_preprocessors (references: RawDataset, preprocessor, variables
data and val_data, and val_task_data_preprocessors).

In `@nemo_rl/algorithms/utils.py`:
- Around line 323-355: The three warnings.warn calls in the function that
adjusts processor feature/video settings (the calls that override
processor.feature_extractor.sampling_rate, processor.video_processor.fps, and
processor.video_processor.num_frames) should include stacklevel=2 so the warning
points at the caller; update each warnings.warn invocation (the ones emitting
"Overriding audio sampling rate...", "Overriding video fps...", and "Overriding
video num_frames...") to pass stacklevel=2 while keeping the existing message
and variable usage (tokenizer_config checks and assignments to
processor.feature_extractor.sampling_rate, processor.video_processor.fps,
processor.video_processor.num_frames).
- Around line 345-355: Check for mutual exclusivity of fps and num_frames in
tokenizer_config["video"] before mutating processor.video_processor.num_frames:
if both "fps" and "num_frames" are present, validate they are not contradictory
and raise a clear ValueError (or choose one deterministically) instead of
letting the video processor fail later. Specifically, in the block handling
tokenizer_config["video"] (referencing tokenizer_config,
processor.video_processor.num_frames, "fps", and "num_frames"), add an upfront
check that raises a descriptive error like "Cannot set both fps and num_frames
in tokenizer_config['video']" (or compare values and only warn/override when
consistent) before assigning processor.video_processor.num_frames.

In `@nemo_rl/data/datasets/processed_dataset.py`:
- Around line 33-58: The class docstring for AllTaskProcessedDataset is missing
documentation for the constructor parameter task_data_preprocessors; update the
Args section to add a short entry describing task_data_preprocessors (type:
Optional[Union[dict[str, TaskDataPreProcessFnCallable],
TaskDataPreProcessFnCallable]], default: None), explaining it can be a single
preprocessor applied to all examples or a dict mapping task names to
task-specific preprocessors and that missing tasks fall back to default
behavior; reference the parameter name task_data_preprocessors and keep the
wording consistent with the existing entries for task_data_processors and
default_task_data_spec.

In `@nemo_rl/data/datasets/raw_dataset.py`:
- Around line 26-35: Add an explicit constructor for class RawDataset that
initializes all externally visible members (data_config, dataset, val_dataset,
processor, task_spec) and sets preprocessor to None (or a provided value) so
preprocessor is not left uninitialized; implement __init__ on RawDataset to
accept and assign these fields (or convert the class to a dataclass with
defaults) to satisfy the "initialize members in constructor" guideline.

In `@nemo_rl/data/datasets/response_datasets/daily_omni.py`:
- Around line 73-78: The tar extraction in the try block uses
tarfile.extractall(path=self.hf_cache_dir) which is vulnerable to
path-traversal; update the call to pass the filter parameter to block absolute
and parent-directory paths (use filter='data' on Python 3.12+ or equivalent safe
extraction logic) so extraction of archive_filename into self.hf_cache_dir is
validated before writing (affects the block referencing archive_filename,
self.hf_cache_dir and files_folder).

In `@nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py`:
- Around line 38-67: The function convert_metadata currently uses return_inplace
(inverted naming) and returns None when return_inplace=True, causing a confusing
API; rename the parameter to inplace: bool = False (or keep name but invert
semantics) and ensure convert_metadata always returns the processed dict
(variable data) even when mutating the input, so callers get the result; update
logic that chooses data = metadata or metadata.copy(), keep the mapping loops
that reference multimodal_utils.MEDIA_TAGS_TO_ALLOWED and
multimodal_utils.MEDIA_TAGS unchanged, and update any callers to the new
parameter name if renamed.
- Line 26: The file defines an unused module-level variable named _DEBUG which
is never referenced; remove the _DEBUG = True declaration from
general_conversations_dataset (delete the unused symbol) to eliminate dead code
and avoid misleading debug flags.

In `@nemo_rl/data/llm_message_utils.py`:
- Around line 607-613: Replace the hardcoded if-chain that builds media_kwargs
from media_cur_message with an explicit mapping dict (e.g., MEDIA_KEY_TO_KWARG)
and iterate over its items to populate media_kwargs; locate the code block that
references media_cur_message and media_kwargs in nemo_rl.data.llm_message_utils
(the snippet that currently checks "image", "audio", "video") and change it to
consult the mapping so new modalities can be added by updating the dict rather
than editing conditional logic.

In `@nemo_rl/data/multimodal_utils.py`:
- Around line 321-325: Add a Google-style docstring to the public function
load_media_from_message describing parameters, return value, and fallback
behavior: document message (expected keys/structure), processor (type and when
used), multimodal_load_kwargs (shape: mapping of media type to kwargs and
defaults), and the returned dict[str, list[Any]] format; include behavior when
media is missing or when processor is None, examples of supported media types,
and any exceptions raised or swallowed. Place the docstring immediately under
the def load_media_from_message(...) signature so external modules like
llm_message_utils.py can rely on the documented API.
- Around line 219-261: The public function
get_multimodal_default_settings_from_processor lacks a docstring and uses unsafe
function-attribute caching and redundant list comprehensions; add a concise
docstring describing purpose, args, and return value, and replace the
function-attribute caches
get_multimodal_default_settings_from_processor.load_video_kwargs and
.load_audio_kwargs with a module-level cache (either simple module-level
variables or a `@functools.lru_cache-decorated` helper) that computes
list(inspect.signature(load_video).parameters) and
list(inspect.signature(load_audio).parameters) once, and update the list
comprehensions to use list(...) instead of [param for param in ...]; ensure the
rest of the function (video_settings_dict/feature_extractor usage and
default_settings keys) remains unchanged.
- Around line 315-317: Replace the ternary used for side effects with a clear
if/else: locate the statement that currently does
"media[tag].extend(list(item[tag])) if isinstance(item[tag], (list, tuple)) else
media[tag].append(item[tag])" and change it to an explicit if
isinstance(item[tag], (list, tuple)): media[tag].extend(list(item[tag])) else:
media[tag].append(item[tag]). This preserves the same behavior for
media[tag].extend/append and improves readability; keep the same variable names
(media, tag, item) and surrounding logic.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9148186 and ee953ca.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (20)
  • .github/workflows/cicd-main.yml
  • examples/configs/recipes/llm/performance/.grpo-deepseek-v3-32n4g.yaml.swp
  • examples/configs/recipes/llm/performance/.grpo-deepseek-v3-32n8g.yaml.swp
  • examples/configs/sft_avlm.yaml
  • examples/run_sft.py
  • nemo_rl/algorithms/utils.py
  • nemo_rl/data/datasets/processed_dataset.py
  • nemo_rl/data/datasets/raw_dataset.py
  • nemo_rl/data/datasets/response_datasets/__init__.py
  • nemo_rl/data/datasets/response_datasets/daily_omni.py
  • nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py
  • nemo_rl/data/datasets/utils.py
  • nemo_rl/data/interfaces.py
  • nemo_rl/data/llm_message_utils.py
  • nemo_rl/data/multimodal_utils.py
  • nemo_rl/data/utils.py
  • nemo_rl/models/policy/__init__.py
  • pyproject.toml
  • tests/unit/data/datasets/test_general_conversations_dataset.py
  • tests/unit/data/datasets/test_response_dataset.py

Comment thread nemo_rl/data/datasets/response_datasets/daily_omni.py Outdated
Comment thread nemo_rl/data/datasets/response_datasets/daily_omni.py
Comment thread nemo_rl/data/datasets/response_datasets/daily_omni.py
Comment thread nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py Outdated
Comment thread nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py Outdated
Comment thread nemo_rl/data/datasets/utils.py
Comment thread nemo_rl/data/multimodal_utils.py Outdated
Comment thread nemo_rl/data/multimodal_utils.py
@chtruong814 chtruong814 removed the needs-follow-up Issue needs follow-up label Feb 24, 2026
@yuanhangsu1986 yuanhangsu1986 changed the title Yuanhangs dev feat: Omni dataloader for HF models Feb 24, 2026
@yuanhangsu1986 yuanhangsu1986 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 24, 2026
@github-actions github-actions Bot removed the CI Relating to CI label Feb 24, 2026
@yuki-97 yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 24, 2026
yuki-97
yuki-97 previously approved these changes Feb 25, 2026
Copy link
Copy Markdown
Contributor

@yuki-97 yuki-97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, and better to add a functional test to guard the feature.

Comment thread examples/configs/sft_avlm.yaml
@chtruong814 chtruong814 added the needs-follow-up Issue needs follow-up label Feb 26, 2026
terrykong
terrykong previously approved these changes Feb 26, 2026
Copy link
Copy Markdown
Collaborator

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. thanks @yuanhangsu1986 !

Comment thread nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py Outdated
Signed-off-by: root <root@batch-block1-0090.cm.cluster>
@yuanhangsu1986 yuanhangsu1986 dismissed stale reviews from terrykong and yuki-97 via cbc65a7 February 26, 2026 14:45
@yuanhangsu1986 yuanhangsu1986 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 26, 2026
@yuki-97 yuki-97 enabled auto-merge (squash) February 26, 2026 16:01
@yfw
Copy link
Copy Markdown
Contributor

yfw commented Feb 26, 2026

Thanks @yuanhangsu1986 for the contribution! @terrykong this should be good to go assuming the CI passes

@chtruong814 chtruong814 removed the needs-follow-up Issue needs follow-up label Feb 26, 2026
@yuki-97 yuki-97 merged commit f7556d7 into main Feb 26, 2026
41 of 42 checks passed
@yuki-97 yuki-97 deleted the yuanhangs_dev branch February 26, 2026 23:37
seonjinn pushed a commit that referenced this pull request Mar 8, 2026
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: root <root@batch-block1-0090.cm.cluster>
Co-authored-by: Yuki Huang <yukih@nvidia.com>
Co-authored-by: root <root@batch-block1-0090.cm.cluster>
seonjinn pushed a commit that referenced this pull request Mar 8, 2026
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: root <root@batch-block1-0090.cm.cluster>
Co-authored-by: Yuki Huang <yukih@nvidia.com>
Co-authored-by: root <root@batch-block1-0090.cm.cluster>
seonjinn pushed a commit that referenced this pull request Mar 9, 2026
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: root <root@batch-block1-0090.cm.cluster>
Co-authored-by: Yuki Huang <yukih@nvidia.com>
Co-authored-by: root <root@batch-block1-0090.cm.cluster>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants