Cleanup megatron dataset preprocessing script and update docs#918
Cleanup megatron dataset preprocessing script and update docs#918kevalmorabia97 wants to merge 1 commit intomainfrom
Conversation
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
📝 WalkthroughWalkthroughThe PR refactors the Megatron preprocessing API from a single Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #918 +/- ##
==========================================
- Coverage 73.11% 73.10% -0.02%
==========================================
Files 205 205
Lines 22281 22281
==========================================
- Hits 16291 16288 -3
- Misses 5990 5993 +3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
modelopt/torch/utils/plugins/megatron_preprocess_data.py (1)
172-179:⚠️ Potential issue | 🟠 Major
UnboundLocalErroroniwhen processing an empty JSONL file.
iis assigned only inside theenumerateloop at line 172. Ifencoded_docsyields no items (empty file),iis never defined, and the newforce_printcall on line 179 raisesUnboundLocalError.🐛 Proposed fix
- total_doc_len, total_enc_len, final_enc_len = 0, 0, 0 + total_doc_len, total_enc_len, final_enc_len, i = 0, 0, 0, 0 for i, (doc, sentence_lens, (doc_len, enc_len)) in enumerate(encoded_docs, start=1):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/utils/plugins/megatron_preprocess_data.py` around lines 172 - 179, The loop variable i is undefined when encoded_docs is empty, causing UnboundLocalError at the final self._print_processing_stats(i, ...) call; fix by initializing i (e.g., i = 0) before the for i, (doc, sentence_lens, (doc_len, enc_len)) in enumerate(...) loop or by using a safe fallback when calling self._print_processing_stats (e.g., pass i if defined else 0); update the code around the enumerate block in megatron_preprocess_data.py so that i is always defined before the final force_print call.
🧹 Nitpick comments (2)
modelopt/torch/prune/plugins/mcore_minitron.py (1)
320-320: LGTM — consider opening a tracking issue for this rename.The TODO comment clearly documents the pending rename of
hybrid_override_pattern→hybrid_layer_patternonce Megatron-LM#3377 merges. No logic or behavior change.Would you like me to draft a tracking issue for this rename so it doesn't get lost?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/prune/plugins/mcore_minitron.py` at line 320, The comment TODO notes that the name hybrid_override_pattern should be renamed to hybrid_layer_pattern after Megatron-LM#3377; create a tracking issue in your repo (or project board) documenting this rename, include the file and symbol names (hybrid_override_pattern in modelopt/torch/prune/plugins/mcore_minitron.py), mention the PR that triggers it (NVIDIA/Megatron-LM#3377), and list required follow-ups (update the TODO comment, rename the symbol across codebase, adjust any docs/tests that reference hybrid_override_pattern) so the change won't be lost.modelopt/torch/utils/plugins/megatron_preprocess_data.py (1)
271-271: Mutable default argument forjson_keys.
json_keys: list[str] = ["text"]shares a single list object across all callers that omit the argument. While the list is currently only read (not mutated) inside the function, this is a Python antipattern that could cause subtle bugs if the implementation changes.♻️ Proposed fix
- json_keys: list[str] = ["text"], + json_keys: list[str] | None = None,Then at the top of the function body:
if json_keys is None: json_keys = ["text"]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/utils/plugins/megatron_preprocess_data.py` at line 271, The parameter json_keys is defined with a mutable default list (`json_keys: list[str] = ["text"]`), so change the function signature to use None as the default (e.g., `json_keys: list[str] | None = None`) and add a guard at the top of the function (in modelopt/torch/utils/plugins/megatron_preprocess_data.py) that sets `json_keys = ["text"]` when `json_keys is None`; reference the parameter name `json_keys` to locate and update the code and ensure any type hints are adjusted accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/megatron_bridge/README.md`:
- Around line 103-111: The bash example uses inline comments after a
backslash-continued line (e.g., the lines containing `--json_keys text \ #
change to your JSON key if needed` and `--tokenizer Qwen/Qwen3-0.6B \ # fyi,
all models...`), which breaks line continuation; remove those trailing inline
comments or move them to their own lines above the command, or place comments
only on the final non-continued line, and ensure every `\` is the last character
on the line so the `python -m
modelopt.torch.utils.plugins.megatron_preprocess_data` command and flags
(`--jsonl_paths`, `--json_keys`, `--tokenizer`, `--output_dir`, `--workers`,
`--max_sequence_length`) are valid when copy-pasted.
In `@modelopt/torch/utils/plugins/megatron_preprocess_data.py`:
- Around line 301-315: Move the creation of the output directory to before
calling _download_hf_dataset so the downloader can write files; specifically,
call Path(output_dir).mkdir(parents=True, exist_ok=True) prior to invoking
_download_hf_dataset(dataset, output_dir, json_keys, subset=subset, split=split)
in the code block that contains the dataset/download logic, leaving the rest of
the file discovery logic for input_dir/jsonl_paths unchanged.
- Around line 229-255: _download_hf_dataset currently builds jsonl_file_path
using ds_path which may contain slashes and never ensures parent directories
exist, causing FileNotFoundError when ds.to_json is called; fix by sanitizing or
using only the dataset name (not the HF org prefix) when composing
jsonl_file_path and ensure the parent directory is created before writing:
derive a safe filename from ds_path (or replace '/' with '_' or use
Path(ds_path).name), compute parent = Path(output_dir)/"raw"/<safe_name> and
call parent.mkdir(parents=True, exist_ok=True) before checking/creating
jsonl_file_path and before calling ds.to_json; also ensure output_dir is created
earlier (or rely on the same mkdir(parents=True)) so writes never fail.
In `@tests/_test_utils/torch/megatron/utils.py`:
- Line 23: The shared utils module uses skip_if_no_megatron(te_required=False)
which allows import to run TE-dependent megatron.core model imports (e.g.,
GPTModel, MambaModel, GPTInferenceWrapper) and causes ImportError when TE is
missing; update the guard so the module is only imported when TE is available by
changing the skip invocation to skip_if_no_megatron(te_required=True) (or, if
utility functions truly don't need TE, move the skip_if_no_megatron call to
after the TE-free helper definitions and keep TE imports behind a separate
guarded block); ensure initialize_for_megatron and any TE-dependent symbols are
only loaded under the TE-required guard to keep imports consistent with
tests/models.py.
In `@tests/unit/torch/utils/test_megatron_preprocess_data.py`:
- Around line 98-121: The test test_megatron_preprocess_data_with_hf_dataset
makes live network calls and also writes JSON into a non-existent raw/
subdirectory in _download_hf_dataset; update the test to skip in offline CI by
adding a guard (e.g., pytest.mark.skipif or an env-check like
SKIP_NETWORK_TESTS) before running megatron_preprocess_data, and in
_download_hf_dataset ensure the target directory (the constructed raw/ path) is
created (os.makedirs(..., exist_ok=True)) before calling ds.to_json(); reference
the test function name test_megatron_preprocess_data_with_hf_dataset, the helper
_download_hf_dataset, and the call sites around load_dataset and ds.to_json when
implementing these changes.
---
Outside diff comments:
In `@modelopt/torch/utils/plugins/megatron_preprocess_data.py`:
- Around line 172-179: The loop variable i is undefined when encoded_docs is
empty, causing UnboundLocalError at the final self._print_processing_stats(i,
...) call; fix by initializing i (e.g., i = 0) before the for i, (doc,
sentence_lens, (doc_len, enc_len)) in enumerate(...) loop or by using a safe
fallback when calling self._print_processing_stats (e.g., pass i if defined else
0); update the code around the enumerate block in megatron_preprocess_data.py so
that i is always defined before the final force_print call.
---
Nitpick comments:
In `@modelopt/torch/prune/plugins/mcore_minitron.py`:
- Line 320: The comment TODO notes that the name hybrid_override_pattern should
be renamed to hybrid_layer_pattern after Megatron-LM#3377; create a tracking
issue in your repo (or project board) documenting this rename, include the file
and symbol names (hybrid_override_pattern in
modelopt/torch/prune/plugins/mcore_minitron.py), mention the PR that triggers it
(NVIDIA/Megatron-LM#3377), and list required follow-ups (update the TODO
comment, rename the symbol across codebase, adjust any docs/tests that reference
hybrid_override_pattern) so the change won't be lost.
In `@modelopt/torch/utils/plugins/megatron_preprocess_data.py`:
- Line 271: The parameter json_keys is defined with a mutable default list
(`json_keys: list[str] = ["text"]`), so change the function signature to use
None as the default (e.g., `json_keys: list[str] | None = None`) and add a guard
at the top of the function (in
modelopt/torch/utils/plugins/megatron_preprocess_data.py) that sets `json_keys =
["text"]` when `json_keys is None`; reference the parameter name `json_keys` to
locate and update the code and ensure any type hints are adjusted accordingly.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
examples/megatron_bridge/README.mdexamples/megatron_bridge/distill.pyexamples/nemo_run/common/process_climbmix.pymodelopt/torch/prune/plugins/mcore_minitron.pymodelopt/torch/utils/plugins/megatron_preprocess_data.pytests/_test_utils/torch/megatron/utils.pytests/unit/torch/utils/test_megatron_preprocess_data.py
What does this PR do?
Cleanup megatron dataset preprocessing script and update docs
Usage
Testing
Summary by CodeRabbit
Documentation
New Features
Configuration Updates