docs: vdr feedback#1477
Conversation
Signed-off-by: Lawrence Lane <llane@nvidia.com>
On it! |
…m/lbliii/NeMo-Curator into llane/26.02-bulk-vdr-doc-feedback
Additional Comments (1)
|
sarahyurick
left a comment
There was a problem hiding this comment.
Left a few minor comments, thanks!
| Here's a simple example to get started with NeMo Curator's image curation pipeline: | ||
|
|
||
| :::{note} | ||
| **CPU Memory Considerations** |
There was a problem hiding this comment.
Maybe also add a note about lowering num_cpus during Ray Client set up.
| 5. Writes output clips and metadata to `$OUT_DIR` | ||
|
|
||
| ```{tip} | ||
| **Using a config file**: The example script accepts many command-line arguments. For complex configurations, you can store arguments in a file and pass them with the `@` prefix: |
There was a problem hiding this comment.
This is nice, thanks. I think eventually we should add it as a YAML file like the text examples here: https://github.com/NVIDIA-NeMo/Curator/tree/main/nemo_curator/config.
| The primary container includes comprehensive support for all curation modalities: | ||
|
|
||
| **Container registry:** `nvcr.io/nvidia/nemo-curator:25.09` | ||
| **Container registry:** `nvcr.io/nvidia/nemo-curator:26.02` |
There was a problem hiding this comment.
Should this use the number from docs/project.json instead of hardcoding?
Signed-off-by: Lawrence Lane <llane@nvidia.com>
|
Oh one last request, can you remove this lingering reference to InternVideo here: https://github.com/NVIDIA-NeMo/Curator/blob/main/docs/curate-video/process-data/dedup.md?plain=1 ? |
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
| @@ -175,7 +175,7 @@ Video-specific pointers: | |||
| - Use `ClipWriterStage` path helpers to locate outputs: `nemo_curator/stages/video/io/clip_writer.py`. | |||
There was a problem hiding this comment.
Redundant path option
This line reads as a typo after the iv2→ce1 updates: it says ${OUT_DIR}/ce1_embd_parquet` (or `${OUT_DIR}/ce1_embd_parquet`) which is identical on both sides. It should either list an actual alternative output (if one exists) or just state the single expected path avoid confusing users.
| @@ -0,0 +1 @@ | |||
| {"filename": "get-started/text.md", "lineno": 119, "status": "broken", "code": 0, "uri": "https://huggingface.co/settings/tokens", "info": "unauthorized"} | |||
There was a problem hiding this comment.
Invalid JSON structure
broken_links_false_positives.json is being written as a single JSON object, but the name suggests this file is consumed as a collection of entries. If the docs link-checker expects an array (common for “false positives” allowlists), this will break parsing or silently ignore the entry. Consider making this file a JSON array (even for one item), e.g. [ { ... } ], to match typical consumption patterns.
Signed-off-by: Lawrence Lane <llane@nvidia.com>
| - Processed videos: `get_output_path_processed_videos(OUT_DIR)` | ||
| - Clip chunks and previews: `get_output_path_processed_clip_chunks(OUT_DIR)`, `get_output_path_previews(OUT_DIR)` | ||
| - Embeddings parquet: `${OUT_DIR}/iv2_embd_parquet` (or `${OUT_DIR}/ce1_embd_parquet`) | ||
| - Embeddings parquet: `${OUT_DIR}/ce1_embd_parquet` (or `${OUT_DIR}/ce1_embd_parquet`) |
There was a problem hiding this comment.
Redundant path information - both options show the same path ${OUT_DIR}/ce1_embd_parquet. Either remove the parenthetical "(or ${OUT_DIR}/ce1_embd_parquet)" or update it to show an actual alternative path if one exists.
| - Embeddings parquet: `${OUT_DIR}/ce1_embd_parquet` (or `${OUT_DIR}/ce1_embd_parquet`) | |
| - Embeddings parquet: `${OUT_DIR}/ce1_embd_parquet` |
Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
| ``` | ||
|
|
||
| The pipeline writes embeddings under `$OUT_DIR/iv2_embd_parquet/` (or `ce1_embd_parquet/` if you use Cosmos-Embed1). | ||
| The pipeline writes embeddings under `$OUT_DIR/ce1_embd_parquet/` when using Cosmos-Embed1. |
There was a problem hiding this comment.
Redundant path information - the comment says "when using Cosmos-Embed1" but doesn't show an alternative path for other embedding models. If there's only one output location, simplify to just state the path without the qualifying clause.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
bulk vdr feedback in progress for docs
Installation Feedback
https://pypi.nvidia.compyproject.tomlconfiguration issue in[[tool.uv.index]]. Requires investigation into whether the URL format needs asimple/suffix for pip compatibility.docs/admin/installation.md/opt/venvis not automatically activated upon entering the container, resulting in a "No module named nemo-curator" error.{important}block withsource /opt/venv/env.shactivation instructions. Updated container-environments reference to remove "activated by default" claim.docs/admin/installation.md,docs/reference/infrastructure/container-environments.mdaudio_cuda12,video_cuda12,image_cuda12, andtext_cuda12extras)" to Quick Start Requirements.docs/admin/installation.mdERROR: failed checking for nvcc.{note}block documenting that the FFmpeg build requires the CUDA toolkit (nvcc) onPATH, with verification command.docs/admin/installation.mdText Curation Feedback
fuzzy_e2e.ipynbnotebook fails to run in the Docker container, producingAttributeError: 'CUDARuntimeError' object has no attribute 'msg'. Ray actors spawned within the container cannot access the GPU.{note}block explaining Docker must be started with--gpus allfor Ray GPU access, plus venv activation instructions.docs/curate-text/process-data/deduplication/fuzzy.mdsemantic_e2e.ipynbandsemantic_step_by_step.ipynbnotebooks fail withRuntimeError: No CUDA GPUs are available.{note}block explaining Docker--gpus allrequirement and venv activation.docs/curate-text/process-data/deduplication/semdedup.mdHF_TOKENis recommended.{tip}block withexport HF_TOKENinstructions and link to token settings page.docs/get-started/text.mdtutorials/quickstart.pydocs/curate-text/process-data/quality-assessment/distributed-classifier.mdDistributedDataClassifierwith a subclass template.docs/curate-text/process-data/quality-assessment/distributed-classifier.mdrequirements.txtor add a cell to install all packages needed by the notebooks (Aegis example: "No module named pandas").{tip}block documenting that notebooks require additional packages (such aspandas) withuv pip installcommand, plusHF_TOKENguidance.docs/curate-text/process-data/quality-assessment/distributed-classifier.mdnum_cpus.--num-cpusguidance. Rewrote OOM debugging section with three concrete steps.tutorials/text/llama-nemotron-data-curation/README.mdVideo Curation Feedback
{important}block with prerequisite notice and link to InternVideo2 installation instructions.docs/get-started/video.mdvideo_split_clip_example.pyhas so many command-line arguments that it is easier to tune them through a config file instead of passing everything on the command line.{tip}block showing argparse@config.txtpattern for storing arguments in a file.docs/get-started/video.mdvideo_split_clip_example.pyfails with:the following arguments are required: --output-clip-path. We recommend replacing--output-pathwith--output-clip-path.--output-path(matching the actual script). Fixed all three doc files that used--output-clip-path.docs/get-started/video.md,docs/curate-video/tutorials/beginner.md,docs/curate-video/tutorials/split-dedup.mdModuleNotFoundError: No module named 'nemo_curator.examples'when attempting to runvideo_split_clip_example.python -m nemo_curator.examples.video.video_split_clip_examplereferences withpython tutorials/video/getting-started/video_split_clip_example.py. Fixed the same pattern in audio docs.docs/curate-video/tutorials/beginner.md,docs/curate-video/tutorials/split-dedup.md,docs/curate-video/process-data/captions-preview.md,docs/curate-video/process-data/clipping.md,docs/curate-video/process-data/embeddings.md,docs/curate-video/process-data/filtering.md,docs/curate-video/process-data/frame-extraction.md,docs/get-started/audio.md