Skip to content

docs: vdr feedback#1477

Merged
thomasdhc merged 17 commits intoNVIDIA-NeMo:mainfrom
lbliii:llane/26.02-bulk-vdr-doc-feedback
Feb 11, 2026
Merged

docs: vdr feedback#1477
thomasdhc merged 17 commits intoNVIDIA-NeMo:mainfrom
lbliii:llane/26.02-bulk-vdr-doc-feedback

Conversation

@lbliii
Copy link
Copy Markdown
Contributor

@lbliii lbliii commented Feb 9, 2026

bulk vdr feedback in progress for docs

Installation Feedback

# Feedback Fix Applied Files Modified
1 The recommended central pip installation for all modalities fails due to an invalid package URL format - https://pypi.nvidia.com Not a doc fix — this is a pyproject.toml configuration issue in [[tool.uv.index]]. Requires investigation into whether the URL format needs a simple/ suffix for pip compatibility. N/A (code change needed)
2 We recommend using Docker as the preferred installation method, as it includes FFMPEG and InternVideo2 preconfigured in the environment. Added tip block recommending Docker for video/audio workflows. Renamed container tab to "Container Installation (Recommended for Video/Audio)." Listed FFmpeg, InternVideo2, and CUDA libraries in benefits. docs/admin/installation.md
3 The Docker container does not include pip by default, and the virtual environment in /opt/venv is not automatically activated upon entering the container, resulting in a "No module named nemo-curator" error. Added {important} block with source /opt/venv/env.sh activation instructions. Updated container-environments reference to remove "activated by default" claim. docs/admin/installation.md, docs/reference/infrastructure/container-environments.md
4 Both the audio and video workflows rely on CUDA 12; if this is a required dependency, it should be listed in the prerequisites. Added "CUDA 12 (required for audio_cuda12, video_cuda12, image_cuda12, and text_cuda12 extras)" to Quick Start Requirements. docs/admin/installation.md
5 The pip installation for video curation fails during the FFMPEG installation step, producing the following error: ERROR: failed checking for nvcc. Added {note} block documenting that the FFmpeg build requires the CUDA toolkit (nvcc) on PATH, with verification command. docs/admin/installation.md

Text Curation Feedback

# Feedback Fix Applied Files Modified
6 The fuzzy_e2e.ipynb notebook fails to run in the Docker container, producing AttributeError: 'CUDARuntimeError' object has no attribute 'msg'. Ray actors spawned within the container cannot access the GPU. Added {note} block explaining Docker must be started with --gpus all for Ray GPU access, plus venv activation instructions. docs/curate-text/process-data/deduplication/fuzzy.md
7 The semantic_e2e.ipynb and semantic_step_by_step.ipynb notebooks fail with RuntimeError: No CUDA GPUs are available. Added {note} block explaining Docker --gpus all requirement and venv activation. docs/curate-text/process-data/deduplication/semdedup.md
8 The quickstart example returned a 429 Client Error because Hugging Face rate-limited our IP. Setting HF_TOKEN is recommended. Added {tip} block with export HF_TOKEN instructions and link to token settings page. docs/get-started/text.md
9 The VRAM allocated for each task is reported as zero, and memory usage recorded by NeMo Curator does not match system specifications. Not a doc fix — this is a code bug in resource tracking logic. N/A (code change needed)
10 When running the quickstart script, "I am neutral about this product" is classified as negative. Updating the script could improve the initial user experience. Changed sample sentences to unambiguous examples: "I love this product, it works great", "I hate this product, it broke immediately", "This product is okay but nothing special." tutorials/quickstart.py
11 Including a brief one-line description for each classifier would help users quickly identify the most appropriate one. Expanded all 10 classifier rows in the comparison table with specific descriptions (label counts, output categories, model type). docs/curate-text/process-data/quality-assessment/distributed-classifier.md
12 It should be clarified whether users can integrate their own models into the text curation workflow as classifiers, and if so, documented with an example. Added "Custom Model Integration" section showing how to extend DistributedDataClassifier with a subclass template. docs/curate-text/process-data/quality-assessment/distributed-classifier.md
13 It is recommended to include a requirements.txt or add a cell to install all packages needed by the notebooks (Aegis example: "No module named pandas"). Added {tip} block documenting that notebooks require additional packages (such as pandas) with uv pip install command, plus HF_TOKEN guidance. docs/curate-text/process-data/quality-assessment/distributed-classifier.md
14 The LLaMA Nemotron tutorial crashed mid-way due to CPU out-of-memory error despite 128 GB RAM. Include prerequisites or guidance for adjusting num_cpus. Added "System Requirements" section with 128 GB+ RAM recommendation and --num-cpus guidance. Rewrote OOM debugging section with three concrete steps. tutorials/text/llama-nemotron-data-curation/README.md

Video Curation Feedback

# Feedback Fix Applied Files Modified
15 InternVideo2 must be installed prior to running the quickstart example, but the GitHub repository does not currently provide instructions for doing so. Added {important} block with prerequisite notice and link to InternVideo2 installation instructions. docs/get-started/video.md
16 video_split_clip_example.py has so many command-line arguments that it is easier to tune them through a config file instead of passing everything on the command line. Added {tip} block showing argparse @config.txt pattern for storing arguments in a file. docs/get-started/video.md
17 Running video_split_clip_example.py fails with: the following arguments are required: --output-clip-path. We recommend replacing --output-path with --output-clip-path. Fixed documentation to use the correct CLI argument --output-path (matching the actual script). Fixed all three doc files that used --output-clip-path. docs/get-started/video.md, docs/curate-video/tutorials/beginner.md, docs/curate-video/tutorials/split-dedup.md
18 An incorrect file path in the documentation causes ModuleNotFoundError: No module named 'nemo_curator.examples' when attempting to run video_split_clip_example. Replaced all python -m nemo_curator.examples.video.video_split_clip_example references with python tutorials/video/getting-started/video_split_clip_example.py. Fixed the same pattern in audio docs. docs/curate-video/tutorials/beginner.md, docs/curate-video/tutorials/split-dedup.md, docs/curate-video/process-data/captions-preview.md, docs/curate-video/process-data/clipping.md, docs/curate-video/process-data/embeddings.md, docs/curate-video/process-data/filtering.md, docs/curate-video/process-data/frame-extraction.md, docs/get-started/audio.md

Signed-off-by: Lawrence Lane <llane@nvidia.com>
@lbliii lbliii self-assigned this Feb 9, 2026
@lbliii
Copy link
Copy Markdown
Contributor Author

lbliii commented Feb 11, 2026

Hi @lbliii what do you think about incorporating these:

  • The Image Curation “Getting Started” tutorial also experienced crashes due to CPU out-of-memory errors during execution
  • The setup and deployment instructions should be positioned before the Getting Started section in the documentation, as they are currently listed toward the end of the guide.

?

On it!

Signed-off-by: Lawrence Lane <llane@nvidia.com>
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

20 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Feb 11, 2026

Additional Comments (1)

docs/get-started/video.md
Verify InternVideo2 prerequisite block was added. Feedback item #15 states "InternVideo2 must be installed prior to running the quickstart example" with an {important} block and link to installation instructions, but this doesn't appear in the current changes.

Copy link
Copy Markdown
Contributor

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few minor comments, thanks!

Comment thread docs/get-started/image.md
Here's a simple example to get started with NeMo Curator's image curation pipeline:

:::{note}
**CPU Memory Considerations**
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also add a note about lowering num_cpus during Ray Client set up.

Comment thread docs/get-started/video.md
5. Writes output clips and metadata to `$OUT_DIR`

```{tip}
**Using a config file**: The example script accepts many command-line arguments. For complex configurations, you can store arguments in a file and pass them with the `@` prefix:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice, thanks. I think eventually we should add it as a YAML file like the text examples here: https://github.com/NVIDIA-NeMo/Curator/tree/main/nemo_curator/config.

The primary container includes comprehensive support for all curation modalities:

**Container registry:** `nvcr.io/nvidia/nemo-curator:25.09`
**Container registry:** `nvcr.io/nvidia/nemo-curator:26.02`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this use the number from docs/project.json instead of hardcoding?

Signed-off-by: Lawrence Lane <llane@nvidia.com>
@sarahyurick
Copy link
Copy Markdown
Contributor

Oh one last request, can you remove this lingering reference to InternVideo here: https://github.com/NVIDIA-NeMo/Curator/blob/main/docs/curate-video/process-data/dedup.md?plain=1 ?

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

21 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: Lawrence Lane <llane@nvidia.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

25 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

@@ -175,7 +175,7 @@ Video-specific pointers:
- Use `ClipWriterStage` path helpers to locate outputs: `nemo_curator/stages/video/io/clip_writer.py`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant path option

This line reads as a typo after the iv2→ce1 updates: it says ${OUT_DIR}/ce1_embd_parquet` (or `${OUT_DIR}/ce1_embd_parquet`) which is identical on both sides. It should either list an actual alternative output (if one exists) or just state the single expected path avoid confusing users.

@@ -0,0 +1 @@
{"filename": "get-started/text.md", "lineno": 119, "status": "broken", "code": 0, "uri": "https://huggingface.co/settings/tokens", "info": "unauthorized"}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Invalid JSON structure

broken_links_false_positives.json is being written as a single JSON object, but the name suggests this file is consumed as a collection of entries. If the docs link-checker expects an array (common for “false positives” allowlists), this will break parsing or silently ignore the entry. Consider making this file a JSON array (even for one item), e.g. [ { ... } ], to match typical consumption patterns.

Comment thread docs/curate-video/tutorials/split-dedup.md Outdated
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

25 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

- Processed videos: `get_output_path_processed_videos(OUT_DIR)`
- Clip chunks and previews: `get_output_path_processed_clip_chunks(OUT_DIR)`, `get_output_path_previews(OUT_DIR)`
- Embeddings parquet: `${OUT_DIR}/iv2_embd_parquet` (or `${OUT_DIR}/ce1_embd_parquet`)
- Embeddings parquet: `${OUT_DIR}/ce1_embd_parquet` (or `${OUT_DIR}/ce1_embd_parquet`)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant path information - both options show the same path ${OUT_DIR}/ce1_embd_parquet. Either remove the parenthetical "(or ${OUT_DIR}/ce1_embd_parquet)" or update it to show an actual alternative path if one exists.

Suggested change
- Embeddings parquet: `${OUT_DIR}/ce1_embd_parquet` (or `${OUT_DIR}/ce1_embd_parquet`)
- Embeddings parquet: `${OUT_DIR}/ce1_embd_parquet`

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

25 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

```

The pipeline writes embeddings under `$OUT_DIR/iv2_embd_parquet/` (or `ce1_embd_parquet/` if you use Cosmos-Embed1).
The pipeline writes embeddings under `$OUT_DIR/ce1_embd_parquet/` when using Cosmos-Embed1.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant path information - the comment says "when using Cosmos-Embed1" but doesn't show an alternative path for other embedding models. If there's only one output location, simplify to just state the path without the qualifying clause.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

r1.1.0 Pick this label for auto cherry-picking into r1.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants