docs: vdr feedback by lbliii · Pull Request #1477 · NVIDIA-NeMo/Curator

lbliii · 2026-02-09T19:13:53Z

bulk vdr feedback in progress for docs

Installation Feedback

#	Feedback	Fix Applied	Files Modified
1	The recommended central pip installation for all modalities fails due to an invalid package URL format - `https://pypi.nvidia.com`	Not a doc fix — this is a `pyproject.toml` configuration issue in `[[tool.uv.index]]`. Requires investigation into whether the URL format needs a `simple/` suffix for pip compatibility.	N/A (code change needed)
2	We recommend using Docker as the preferred installation method, as it includes FFMPEG and InternVideo2 preconfigured in the environment.	Added tip block recommending Docker for video/audio workflows. Renamed container tab to "Container Installation (Recommended for Video/Audio)." Listed FFmpeg, InternVideo2, and CUDA libraries in benefits.	`docs/admin/installation.md`
3	The Docker container does not include pip by default, and the virtual environment in `/opt/venv` is not automatically activated upon entering the container, resulting in a "No module named nemo-curator" error.	Added `{important}` block with `source /opt/venv/env.sh` activation instructions. Updated container-environments reference to remove "activated by default" claim.	`docs/admin/installation.md`, `docs/reference/infrastructure/container-environments.md`
4	Both the audio and video workflows rely on CUDA 12; if this is a required dependency, it should be listed in the prerequisites.	Added "CUDA 12 (required for `audio_cuda12`, `video_cuda12`, `image_cuda12`, and `text_cuda12` extras)" to Quick Start Requirements.	`docs/admin/installation.md`
5	The pip installation for video curation fails during the FFMPEG installation step, producing the following error: `ERROR: failed checking for nvcc`.	Added `{note}` block documenting that the FFmpeg build requires the CUDA toolkit (`nvcc`) on `PATH`, with verification command.	`docs/admin/installation.md`

Text Curation Feedback

#	Feedback	Fix Applied	Files Modified
6	The `fuzzy_e2e.ipynb` notebook fails to run in the Docker container, producing `AttributeError: 'CUDARuntimeError' object has no attribute 'msg'`. Ray actors spawned within the container cannot access the GPU.	Added `{note}` block explaining Docker must be started with `--gpus all` for Ray GPU access, plus venv activation instructions.	`docs/curate-text/process-data/deduplication/fuzzy.md`
7	The `semantic_e2e.ipynb` and `semantic_step_by_step.ipynb` notebooks fail with `RuntimeError: No CUDA GPUs are available`.	Added `{note}` block explaining Docker `--gpus all` requirement and venv activation.	`docs/curate-text/process-data/deduplication/semdedup.md`
8	The quickstart example returned a 429 Client Error because Hugging Face rate-limited our IP. Setting `HF_TOKEN` is recommended.	Added `{tip}` block with `export HF_TOKEN` instructions and link to token settings page.	`docs/get-started/text.md`
9	The VRAM allocated for each task is reported as zero, and memory usage recorded by NeMo Curator does not match system specifications.	Not a doc fix — this is a code bug in resource tracking logic.	N/A (code change needed)
10	When running the quickstart script, "I am neutral about this product" is classified as negative. Updating the script could improve the initial user experience.	Changed sample sentences to unambiguous examples: "I love this product, it works great", "I hate this product, it broke immediately", "This product is okay but nothing special."	`tutorials/quickstart.py`
11	Including a brief one-line description for each classifier would help users quickly identify the most appropriate one.	Expanded all 10 classifier rows in the comparison table with specific descriptions (label counts, output categories, model type).	`docs/curate-text/process-data/quality-assessment/distributed-classifier.md`
12	It should be clarified whether users can integrate their own models into the text curation workflow as classifiers, and if so, documented with an example.	Added "Custom Model Integration" section showing how to extend `DistributedDataClassifier` with a subclass template.	`docs/curate-text/process-data/quality-assessment/distributed-classifier.md`
13	It is recommended to include a `requirements.txt` or add a cell to install all packages needed by the notebooks (Aegis example: "No module named pandas").	Added `{tip}` block documenting that notebooks require additional packages (such as `pandas`) with `uv pip install` command, plus `HF_TOKEN` guidance.	`docs/curate-text/process-data/quality-assessment/distributed-classifier.md`
14	The LLaMA Nemotron tutorial crashed mid-way due to CPU out-of-memory error despite 128 GB RAM. Include prerequisites or guidance for adjusting `num_cpus`.	Added "System Requirements" section with 128 GB+ RAM recommendation and `--num-cpus` guidance. Rewrote OOM debugging section with three concrete steps.	`tutorials/text/llama-nemotron-data-curation/README.md`

Video Curation Feedback

#	Feedback	Fix Applied	Files Modified
15	InternVideo2 must be installed prior to running the quickstart example, but the GitHub repository does not currently provide instructions for doing so.	Added `{important}` block with prerequisite notice and link to InternVideo2 installation instructions.	`docs/get-started/video.md`
16	`video_split_clip_example.py` has so many command-line arguments that it is easier to tune them through a config file instead of passing everything on the command line.	Added `{tip}` block showing argparse `@config.txt` pattern for storing arguments in a file.	`docs/get-started/video.md`
17	Running `video_split_clip_example.py` fails with: `the following arguments are required: --output-clip-path`. We recommend replacing `--output-path` with `--output-clip-path`.	Fixed documentation to use the correct CLI argument `--output-path` (matching the actual script). Fixed all three doc files that used `--output-clip-path`.	`docs/get-started/video.md`, `docs/curate-video/tutorials/beginner.md`, `docs/curate-video/tutorials/split-dedup.md`
18	An incorrect file path in the documentation causes `ModuleNotFoundError: No module named 'nemo_curator.examples'` when attempting to run `video_split_clip_example`.	Replaced all `python -m nemo_curator.examples.video.video_split_clip_example` references with `python tutorials/video/getting-started/video_split_clip_example.py`. Fixed the same pattern in audio docs.	`docs/curate-video/tutorials/beginner.md`, `docs/curate-video/tutorials/split-dedup.md`, `docs/curate-video/process-data/captions-preview.md`, `docs/curate-video/process-data/clipping.md`, `docs/curate-video/process-data/embeddings.md`, `docs/curate-video/process-data/filtering.md`, `docs/curate-video/process-data/frame-extraction.md`, `docs/get-started/audio.md`

Signed-off-by: Lawrence Lane <llane@nvidia.com>

lbliii · 2026-02-11T16:07:11Z

Hi @lbliii what do you think about incorporating these:

The Image Curation “Getting Started” tutorial also experienced crashes due to CPU out-of-memory errors during execution

The setup and deployment instructions should be positioned before the Getting Started section in the documentation, as they are currently listed toward the end of the guide.

?

On it!

Signed-off-by: Lawrence Lane <llane@nvidia.com>

…m/lbliii/NeMo-Curator into llane/26.02-bulk-vdr-doc-feedback

greptile-apps

_{20 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-11T16:19:05Z

Additional Comments (1)

docs/get-started/video.md
Verify InternVideo2 prerequisite block was added. Feedback item #15 states "InternVideo2 must be installed prior to running the quickstart example" with an {important} block and link to installation instructions, but this doesn't appear in the current changes.

sarahyurick

Left a few minor comments, thanks!

sarahyurick · 2026-02-11T16:55:41Z

 Here's a simple example to get started with NeMo Curator's image curation pipeline:

+:::{note}
+**CPU Memory Considerations**


Maybe also add a note about lowering num_cpus during Ray Client set up.

sarahyurick · 2026-02-11T16:57:43Z

 5. Writes output clips and metadata to `$OUT_DIR`

+```{tip}
+**Using a config file**: The example script accepts many command-line arguments. For complex configurations, you can store arguments in a file and pass them with the `@` prefix:


This is nice, thanks. I think eventually we should add it as a YAML file like the text examples here: https://github.com/NVIDIA-NeMo/Curator/tree/main/nemo_curator/config.

sarahyurick · 2026-02-11T16:58:40Z

 The primary container includes comprehensive support for all curation modalities:

-**Container registry:** `nvcr.io/nvidia/nemo-curator:25.09`
+**Container registry:** `nvcr.io/nvidia/nemo-curator:26.02`


Should this use the number from docs/project.json instead of hardcoding?

Signed-off-by: Lawrence Lane <llane@nvidia.com>

sarahyurick · 2026-02-11T17:07:35Z

Oh one last request, can you remove this lingering reference to InternVideo here: https://github.com/NVIDIA-NeMo/Curator/blob/main/docs/curate-video/process-data/dedup.md?plain=1 ?

greptile-apps

_{21 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Lawrence Lane <llane@nvidia.com>

greptile-apps

_{25 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-11T18:04:54Z

@@ -175,7 +175,7 @@ Video-specific pointers:
 - Use `ClipWriterStage` path helpers to locate outputs: `nemo_curator/stages/video/io/clip_writer.py`.


Redundant path option

This line reads as a typo after the iv2→ce1 updates: it says ${OUT_DIR}/ce1_embd_parquet` (or `${OUT_DIR}/ce1_embd_parquet`) which is identical on both sides. It should either list an actual alternative output (if one exists) or just state the single expected path avoid confusing users.

greptile-apps · 2026-02-11T18:04:55Z

@@ -0,0 +1 @@
+{"filename": "get-started/text.md", "lineno": 119, "status": "broken", "code": 0, "uri": "https://huggingface.co/settings/tokens", "info": "unauthorized"}


Invalid JSON structure

broken_links_false_positives.json is being written as a single JSON object, but the name suggests this file is consumed as a collection of entries. If the docs link-checker expects an array (common for “false positives” allowlists), this will break parsing or silently ignore the entry. Consider making this file a JSON array (even for one item), e.g. [ { ... } ], to match typical consumption patterns.

Signed-off-by: Lawrence Lane <llane@nvidia.com>

greptile-apps

_{25 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-11T18:25:44Z

  - Processed videos: `get_output_path_processed_videos(OUT_DIR)`
  - Clip chunks and previews: `get_output_path_processed_clip_chunks(OUT_DIR)`, `get_output_path_previews(OUT_DIR)`
-  - Embeddings parquet: `${OUT_DIR}/iv2_embd_parquet` (or `${OUT_DIR}/ce1_embd_parquet`)
+  - Embeddings parquet: `${OUT_DIR}/ce1_embd_parquet` (or `${OUT_DIR}/ce1_embd_parquet`)


Redundant path information - both options show the same path ${OUT_DIR}/ce1_embd_parquet. Either remove the parenthetical "(or ${OUT_DIR}/ce1_embd_parquet)" or update it to show an actual alternative path if one exists.

Suggested change

- Embeddings parquet: `${OUT_DIR}/ce1_embd_parquet` (or `${OUT_DIR}/ce1_embd_parquet`)

- Embeddings parquet: `${OUT_DIR}/ce1_embd_parquet`

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

sarahyurick

Thank you!

greptile-apps

_{25 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-11T18:35:49Z

 ```

-The pipeline writes embeddings under `$OUT_DIR/iv2_embd_parquet/` (or `ce1_embd_parquet/` if you use Cosmos-Embed1).
+The pipeline writes embeddings under `$OUT_DIR/ce1_embd_parquet/` when using Cosmos-Embed1.


Redundant path information - the comment says "when using Cosmos-Embed1" but doesn't show an alternative path for other embedding models. If there's only one output location, simplify to just state the path without the qualifying clause.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

docs: vdr feedback

7b8c122

Signed-off-by: Lawrence Lane <llane@nvidia.com>

lbliii self-assigned this Feb 9, 2026

copy-pr-bot Bot temporarily deployed to test February 9, 2026 19:14 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci February 9, 2026 19:14 Error

copy-pr-bot Bot temporarily deployed to nemo-ci February 9, 2026 19:14 Inactive

copy-pr-bot Bot temporarily deployed to test February 11, 2026 16:07 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci February 11, 2026 16:07 Error

copy-pr-bot Bot temporarily deployed to nemo-ci February 11, 2026 16:07 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci February 11, 2026 16:07 Error

copy-pr-bot Bot temporarily deployed to nemo-ci February 11, 2026 16:07 Inactive

lbliii added 2 commits February 11, 2026 11:15

feedback

dc1109e

Signed-off-by: Lawrence Lane <llane@nvidia.com>

Merge branch 'llane/26.02-bulk-vdr-doc-feedback' of https://github.co…

79ad10c

…m/lbliii/NeMo-Curator into llane/26.02-bulk-vdr-doc-feedback

greptile-apps Bot reviewed Feb 11, 2026

View reviewed changes

sarahyurick reviewed Feb 11, 2026

View reviewed changes

re order sidebar

ddb5c6e

Signed-off-by: Lawrence Lane <llane@nvidia.com>

Merge branch 'main' into llane/26.02-bulk-vdr-doc-feedback

aa5c213

greptile-apps Bot reviewed Feb 11, 2026

View reviewed changes

lbliii added 3 commits February 11, 2026 12:42

release notes draft

a2a340d

Signed-off-by: Lawrence Lane <llane@nvidia.com>

feedback

5798cd0

Signed-off-by: Lawrence Lane <llane@nvidia.com>

remove more internvid content

e5ca35d

Signed-off-by: Lawrence Lane <llane@nvidia.com>

greptile-apps Bot reviewed Feb 11, 2026

View reviewed changes

sarahyurick reviewed Feb 11, 2026

View reviewed changes

Comment thread docs/curate-video/tutorials/split-dedup.md Outdated

release note fix

64d0424

Signed-off-by: Lawrence Lane <llane@nvidia.com>

greptile-apps Bot reviewed Feb 11, 2026

View reviewed changes

Update docs/curate-video/tutorials/split-dedup.md

923e568

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

sarahyurick approved these changes Feb 11, 2026

View reviewed changes

greptile-apps Bot reviewed Feb 11, 2026

View reviewed changes

This was referenced Feb 11, 2026

Cherry pick tutorials changes from #1477 #1491

Merged

Add relevant 26.02 docs to r1.1.0 #1493

Merged

		@@ -175,7 +175,7 @@ Video-specific pointers:
		- Use `ClipWriterStage` path helpers to locate outputs: `nemo_curator/stages/video/io/clip_writer.py`.

		@@ -0,0 +1 @@
		{"filename": "get-started/text.md", "lineno": 119, "status": "broken", "code": 0, "uri": "https://huggingface.co/settings/tokens", "info": "unauthorized"}

	- Embeddings parquet: `${OUT_DIR}/ce1_embd_parquet` (or `${OUT_DIR}/ce1_embd_parquet`)
	- Embeddings parquet: `${OUT_DIR}/ce1_embd_parquet`

Conversation

lbliii commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Installation Feedback

Text Curation Feedback

Video Curation Feedback

Uh oh!

lbliii commented Feb 11, 2026

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot commented Feb 11, 2026

Uh oh!

sarahyurick left a comment

Choose a reason for hiding this comment

Uh oh!

sarahyurick Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

sarahyurick Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

sarahyurick Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

sarahyurick commented Feb 11, 2026

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

sarahyurick left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lbliii commented Feb 9, 2026 •

edited

Loading