[benchmarking] Adds image curation benchmark to nightly by rlratzel · Pull Request #1341 · NVIDIA-NeMo/Curator

rlratzel · 2025-12-21T06:37:52Z

Adds image curation benchmark to nightly run. This uses the image curation "getting started" tutorial.

…images with :latest by default, adds session name to slack report. Signed-off-by: rlratzel <rratzel@nvidia.com>

Signed-off-by: rlratzel <rratzel@nvidia.com>

…a_updates Signed-off-by: rlratzel <rratzel@nvidia.com>

…atzel/curator into 2602_benchmark_infra_updates Signed-off-by: rlratzel <rratzel@nvidia.com>

Signed-off-by: rlratzel <rratzel@nvidia.com>

…g script to allow for more flexibility. Signed-off-by: rlratzel <rratzel@nvidia.com>

…n-readable output is needed, updates paths to benchmark output dir. Signed-off-by: rlratzel <rratzel@nvidia.com>

…sults Signed-off-by: rlratzel <rratzel@nvidia.com>

Signed-off-by: rlratzel <rratzel@nvidia.com>

copy-pr-bot · 2025-12-21T06:37:56Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Signed-off-by: rlratzel <rratzel@nvidia.com>

…laceholders were silently ignored, comment cleanup. Signed-off-by: rlratzel <rratzel@nvidia.com>

Signed-off-by: rlratzel <rratzel@nvidia.com>

greptile-apps · 2026-01-07T17:50:40Z

Greptile Summary

This PR adds an image curation benchmark to the nightly benchmark suite, along with refactoring the placeholder substitution logic in the benchmark runner to support the new {curator_repo_dir} placeholder.

Key Changes:

Added image_curation benchmark entry that runs the image curation tutorial script
Added mscoco and mscoco_model_weights dataset definitions
Refactored Entry.substitute_paths_in_cmd() into two separate methods: substitute_reserved_placeholders() (for {curator_repo_dir}, {session_entry_dir}, {dataset:...}) and substitute_container_or_host_paths() (for PathResolver paths)
Added support for {curator_repo_dir} placeholder to reference scripts outside the benchmarking/scripts directory
Simplified get_obj_for_json() by removing unused conversion cases

Critical Issue:

The image_curation benchmark entry is missing the ray: configuration block to allocate GPUs. The image curation pipeline requires GPUs for 4 stages (each using 0.25 GPUs per worker), but without this config, the benchmark will default to 0 GPUs (benchmarking/run.py:161) and fail or run incorrectly.

Confidence Score: 2/5

This PR has a critical configuration issue that will cause the benchmark to fail
The refactoring in benchmarking/runner/entry.py is solid and the code changes are clean. However, the missing GPU configuration in the new image_curation benchmark entry is a critical issue that will cause the benchmark to fail or run with 0 GPUs when it requires GPUs for multiple pipeline stages. This must be fixed before merging.
benchmarking/nightly-benchmark.yaml requires GPU configuration for the image_curation entry

Important Files Changed

Filename	Overview
benchmarking/nightly-benchmark.yaml	adds image_curation benchmark entry and mscoco datasets; missing ray GPU configuration will cause the GPU-dependent pipeline to fail or run with 0 GPUs
benchmarking/runner/entry.py	refactors placeholder substitution logic by splitting into separate methods for reserved placeholders and path resolution, adds support for {curator_repo_dir} placeholder

Sequence Diagram

sequenceDiagram
    participant Runner as Benchmark Runner
    participant Session as Session
    participant Entry as Entry
    participant PathRes as PathResolver
    participant DataRes as DatasetResolver
    participant Ray as Ray Cluster
    participant Script as Image Curation Script

    Runner->>Runner: Load YAML config
    Runner->>Session: create_from_dict(config)
    Session->>PathRes: Create PathResolver
    Session->>DataRes: Create DatasetResolver
    Session->>Session: Create Entry objects
    
    Runner->>Entry: get_command_to_run()
    Entry->>Entry: substitute_reserved_placeholders()<br/>{curator_repo_dir}, {session_entry_dir}, {dataset:...}
    Entry->>PathRes: substitute_container_or_host_paths()<br/>resolve paths for container/host mapping
    Entry-->>Runner: Return resolved command
    
    Runner->>Ray: setup_ray_cluster_and_env()<br/>with num_gpus from entry.ray config
    Note over Ray: Defaults to 0 GPUs if not specified
    
    Runner->>Script: Execute python command
    Script->>Script: create_image_curation_pipeline()
    Note over Script: Pipeline stages use num_gpus_per_worker<br/>ImageReaderStage: 0.25<br/>ImageEmbeddingStage: 0.25<br/>ImageAestheticFilterStage: 0.25<br/>ImageNSFWFilterStage: 0.25
    Script-->>Runner: Return exit code
    
    Runner->>Runner: get_entry_script_persisted_data()<br/>Read metrics.json, params.json, tasks.pkl
    Runner->>Runner: check_requirements_update_results()
    Runner->>Ray: teardown_ray_cluster_and_env()
    Runner->>Runner: Write results.json

Signed-off-by: rlratzel <rratzel@nvidia.com>

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-13T18:35:30Z

+  - name: image_curation
+    enabled: true
+    script: "{curator_repo_dir}/tutorials/image/getting-started/image_curation_example.py"
+    args: >-
+      --input-wds-dataset-dir {dataset:mscoco,wds}
+      --output-dataset-dir {session_entry_dir}/results_truncated_100K_mscoco
+      --model-dir {dataset:mscoco_model_weights,files}
+      --batch-size 100
+      --embedding-batch-size 100
+      --aesthetic-batch-size 100
+      --nsfw-batch-size 100
+      --tar-files-per-partition 10
+      --aesthetic-threshold 0.9
+      --nsfw-threshold 0.9
+      --skip-download
+      --verbose


missing ray: configuration block to allocate GPUs

the script requires GPUs for multiple stages (ImageReaderStage uses 0.25 GPUs, ImageEmbeddingStage/ImageAestheticFilterStage/ImageNSFWFilterStage each default to 0.25 GPUs per worker)

other GPU benchmarks like domain_classification_raydata (lines 75-78) include:

ray: num_cpus: 64 num_gpus: 4 enable_object_spilling: false

Signed-off-by: rlratzel <rratzel@nvidia.com>

greptile-apps

_{5 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-13T20:06:50Z

+  - name: image_curation
+    enabled: true
+    script: "{curator_repo_dir}/tutorials/image/getting-started/image_curation_example.py"
+    args: >-
+      --input-wds-dataset-dir {dataset:mscoco,wds}
+      --output-dataset-dir {session_entry_dir}/results_truncated_100K_mscoco
+      --model-dir {dataset:mscoco_model_weights,files}
+      --batch-size 100
+      --embedding-batch-size 100
+      --aesthetic-batch-size 100
+      --nsfw-batch-size 100
+      --tar-files-per-partition 10
+      --aesthetic-threshold 0.9
+      --nsfw-threshold 0.9
+      --skip-download
+      --verbose


logic: missing ray: configuration block for GPU allocation

the image curation script requires GPUs for multiple stages (ImageReaderStage, ImageEmbeddingStage, ImageAestheticFilterStage, ImageNSFWFilterStage each use 0.25 GPUs per worker by default)

add configuration like other GPU benchmarks:

Suggested change

- name: image_curation

enabled: true

script: "{curator_repo_dir}/tutorials/image/getting-started/image_curation_example.py"

args: >-

--input-wds-dataset-dir {dataset:mscoco,wds}

--output-dataset-dir {session_entry_dir}/results_truncated_100K_mscoco

--model-dir {dataset:mscoco_model_weights,files}

--batch-size 100

--embedding-batch-size 100

--aesthetic-batch-size 100

--nsfw-batch-size 100

--tar-files-per-partition 10

--aesthetic-threshold 0.9

--nsfw-threshold 0.9

--skip-download

--verbose

- name: image_curation

enabled: true

script: "{curator_repo_dir}/tutorials/image/getting-started/image_curation_example.py"

args: >-

--input-wds-dataset-dir {dataset:mscoco,wds}

--output-dataset-dir {session_entry_dir}/results_truncated_100K_mscoco

--model-dir {dataset:mscoco_model_weights,files}

--batch-size 100

--embedding-batch-size 100

--aesthetic-batch-size 100

--nsfw-batch-size 100

--tar-files-per-partition 10

--aesthetic-threshold 0.9

--nsfw-threshold 0.9

--skip-download

--verbose

ray:

num_cpus: 64

num_gpus: 4

enable_object_spilling: false

Signed-off-by: rlratzel <rratzel@nvidia.com>

greptile-apps

_{4 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-13T20:34:45Z

+  - name: image_curation
+    enabled: true
+    script: "{curator_repo_dir}/tutorials/image/getting-started/image_curation_example.py"
+    args: >-
+      --input-wds-dataset-dir {dataset:mscoco,wds}
+      --output-dataset-dir {session_entry_dir}/results_truncated_100K_mscoco
+      --model-dir {dataset:mscoco_model_weights,files}
+      --batch-size 100
+      --embedding-batch-size 100
+      --aesthetic-batch-size 100
+      --nsfw-batch-size 100
+      --tar-files-per-partition 10
+      --aesthetic-threshold 0.9
+      --nsfw-threshold 0.9
+      --skip-download
+      --verbose


logic: missing ray: configuration block for GPU allocation

the image curation pipeline requires GPUs (4 stages use 0.25 GPUs per worker by default: ImageReaderStage, ImageEmbeddingStage, ImageAestheticFilterStage, ImageNSFWFilterStage)

without this config, the benchmark will use 0 GPUs (default from benchmarking/run.py:161) and likely fail or run very slowly

add GPU config like other benchmarks (e.g., lines 75-78):

Suggested change

- name: image_curation

enabled: true

script: "{curator_repo_dir}/tutorials/image/getting-started/image_curation_example.py"

args: >-

--input-wds-dataset-dir {dataset:mscoco,wds}

--output-dataset-dir {session_entry_dir}/results_truncated_100K_mscoco

--model-dir {dataset:mscoco_model_weights,files}

--batch-size 100

--embedding-batch-size 100

--aesthetic-batch-size 100

--nsfw-batch-size 100

--tar-files-per-partition 10

--aesthetic-threshold 0.9

--nsfw-threshold 0.9

--skip-download

--verbose

- name: image_curation

enabled: true

script: "{curator_repo_dir}/tutorials/image/getting-started/image_curation_example.py"

args: >-

--input-wds-dataset-dir {dataset:mscoco,wds}

--output-dataset-dir {session_entry_dir}/results_truncated_100K_mscoco

--model-dir {dataset:mscoco_model_weights,files}

--batch-size 100

--embedding-batch-size 100

--aesthetic-batch-size 100

--nsfw-batch-size 100

--tar-files-per-partition 10

--aesthetic-threshold 0.9

--nsfw-threshold 0.9

--skip-download

--verbose

ray:

num_cpus: 64

num_gpus: 4

enable_object_spilling: false

greptile-apps

_{4 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-14T00:27:29Z

+  - name: image_curation
+    enabled: true
+    script: "{curator_repo_dir}/tutorials/image/getting-started/image_curation_example.py"
+    args: >-
+      --input-wds-dataset-dir {dataset:mscoco,wds}
+      --output-dataset-dir {session_entry_dir}/results_truncated_100K_mscoco
+      --model-dir {dataset:mscoco_model_weights,files}
+      --batch-size 100
+      --embedding-batch-size 100
+      --aesthetic-batch-size 100
+      --nsfw-batch-size 100
+      --tar-files-per-partition 10
+      --aesthetic-threshold 0.9
+      --nsfw-threshold 0.9
+      --skip-download
+      --verbose


logic: missing ray: configuration to allocate GPUs

the image curation pipeline uses GPUs in 4 stages: ImageReaderStage (0.25), ImageEmbeddingStage (0.25), ImageAestheticFilterStage (0.25), and ImageNSFWFilterStage (0.25) - see tutorials/image/getting-started/image_curation_example.py:50,56,65,74

without this config, benchmarking/run.py:161 defaults to 0 GPUs, causing the pipeline to fail or run incorrectly

add GPU allocation like other GPU benchmarks (e.g., lines 75-78):

Suggested change

- name: image_curation

enabled: true

script: "{curator_repo_dir}/tutorials/image/getting-started/image_curation_example.py"

args: >-

--input-wds-dataset-dir {dataset:mscoco,wds}

--output-dataset-dir {session_entry_dir}/results_truncated_100K_mscoco

--model-dir {dataset:mscoco_model_weights,files}

--batch-size 100

--embedding-batch-size 100

--aesthetic-batch-size 100

--nsfw-batch-size 100

--tar-files-per-partition 10

--aesthetic-threshold 0.9

--nsfw-threshold 0.9

--skip-download

--verbose

- name: image_curation

enabled: true

script: "{curator_repo_dir}/tutorials/image/getting-started/image_curation_example.py"

args: >-

--input-wds-dataset-dir {dataset:mscoco,wds}

--output-dataset-dir {session_entry_dir}/results_truncated_100K_mscoco

--model-dir {dataset:mscoco_model_weights,files}

--batch-size 100

--embedding-batch-size 100

--aesthetic-batch-size 100

--nsfw-batch-size 100

--tar-files-per-partition 10

--aesthetic-threshold 0.9

--nsfw-threshold 0.9

--skip-download

--verbose

ray:

num_cpus: 64

num_gpus: 4

enable_object_spilling: false

rlratzel and others added 14 commits December 12, 2025 21:46

Updates env var names to match other top-level scripts, does not tag …

89bc074

…images with :latest by default, adds session name to slack report. Signed-off-by: rlratzel <rratzel@nvidia.com>

Updates env var for consistency

9b64594

Signed-off-by: rlratzel <rratzel@nvidia.com>

Fixes formatting of help message.

34564d4

Signed-off-by: rlratzel <rratzel@nvidia.com>

Merge branch 'main' into 2602_benchmark_infra_updates

7ef2bf9

Merge remote-tracking branch 'upstream/main' into 2602_benchmark_infr…

0692fbb

…a_updates Signed-off-by: rlratzel <rratzel@nvidia.com>

Merge branch '2602_benchmark_infra_updates' of https://github.com/rlr…

82383a0

…atzel/curator into 2602_benchmark_infra_updates Signed-off-by: rlratzel <rratzel@nvidia.com>

Removes unused support for an artifacts dir.

3ec5b30

Signed-off-by: rlratzel <rratzel@nvidia.com>

Removes unconditional use of --benchmarks-results-dir arg when runnin…

ed1b95e

…g script to allow for more flexibility. Signed-off-by: rlratzel <rratzel@nvidia.com>

Fixes warning condition about not converting to number when only huma…

59de0d9

…n-readable output is needed, updates paths to benchmark output dir. Signed-off-by: rlratzel <rratzel@nvidia.com>

Updates results path to be session_entry_dir so framework can find re…

53ef91d

…sults Signed-off-by: rlratzel <rratzel@nvidia.com>

Merge branch '2602_benchmark_infra_updates' into 26.02-add_image_bench

af646d8

Signed-off-by: rlratzel <rratzel@nvidia.com>

Merge remote-tracking branch 'upstream/main' into 26.02-add_image_bench

eba1928

Signed-off-by: rlratzel <rratzel@nvidia.com>

Adds initial entry for image curation benchmark

cc2bf93

Signed-off-by: rlratzel <rratzel@nvidia.com>

Merge remote-tracking branch 'upstream/main' into 26.02-add_image_bench

026d79c

Signed-off-by: rlratzel <rratzel@nvidia.com>

rlratzel added 5 commits January 6, 2026 08:52

Merge remote-tracking branch 'upstream/main' into 26.02-add_image_bench

fc61726

Signed-off-by: rlratzel <rratzel@nvidia.com>

Merge remote-tracking branch 'upstream/main' into 26.02-add_image_bench

8a7f0b2

Signed-off-by: rlratzel <rratzel@nvidia.com>

Merge remote-tracking branch 'upstream' into 26.02-add_image_bench

6e90da1

Signed-off-by: rlratzel <rratzel@nvidia.com>

Adds curator_repo_dir reserved placeholder, fixes bug where invalid p…

69b1d06

…laceholders were silently ignored, comment cleanup. Signed-off-by: rlratzel <rratzel@nvidia.com>

Merge remote-tracking branch 'upstream' into 26.02-add_image_bench

58741c4

Signed-off-by: rlratzel <rratzel@nvidia.com>

rlratzel marked this pull request as ready for review January 7, 2026 17:44

sarahyurick approved these changes Jan 7, 2026

View reviewed changes

rlratzel added 2 commits January 7, 2026 16:53

Merge remote-tracking branch 'upstream' into 26.02-add_image_bench

323c346

Signed-off-by: rlratzel <rratzel@nvidia.com>

Merge remote-tracking branch 'upstream' into 26.02-add_image_bench

a991a07

Signed-off-by: rlratzel <rratzel@nvidia.com>

rlratzel mentioned this pull request Jan 10, 2026

[benchmarking] Adds audio curation benchmark to nightly #1360

Merged

Removes unneeded get_obj_for_json utility.

12931e8

Signed-off-by: rlratzel <rratzel@nvidia.com>

copy-pr-bot Bot temporarily deployed to test January 13, 2026 00:02 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci January 13, 2026 00:02 Error

copy-pr-bot Bot temporarily deployed to nemo-ci January 13, 2026 18:32 Inactive

greptile-apps Bot reviewed Jan 13, 2026

View reviewed changes

Merge remote-tracking branch 'upstream/main' into 26.02-add_image_bench

3e556f7

Signed-off-by: rlratzel <rratzel@nvidia.com>

greptile-apps Bot reviewed Jan 13, 2026

View reviewed changes

Adds JSON util back

fa311ed

Signed-off-by: rlratzel <rratzel@nvidia.com>

greptile-apps Bot reviewed Jan 13, 2026

View reviewed changes

Merge branch 'main' into 26.02-add_image_bench

4aa467d

praateekmahajan enabled auto-merge (squash) January 14, 2026 00:25

praateekmahajan merged commit d9ade75 into NVIDIA-NeMo:main Jan 14, 2026
18 checks passed

greptile-apps Bot reviewed Jan 14, 2026

View reviewed changes

copy-pr-bot Bot pushed a commit that referenced this pull request Feb 19, 2026

[benchmarking] Adds image curation benchmark to nightly (#1341)

e22dbc4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[benchmarking] Adds image curation benchmark to nightly#1341

[benchmarking] Adds image curation benchmark to nightly#1341
praateekmahajan merged 27 commits intoNVIDIA-NeMo:mainfrom
rlratzel:26.02-add_image_bench

rlratzel commented Dec 21, 2025

Uh oh!

copy-pr-bot Bot commented Dec 21, 2025

Uh oh!

greptile-apps Bot commented Jan 7, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot left a comment

Uh oh!

greptile-apps Bot Jan 13, 2026

Uh oh!

greptile-apps Bot left a comment

Uh oh!

greptile-apps Bot Jan 13, 2026

Uh oh!

greptile-apps Bot left a comment

Uh oh!

greptile-apps Bot Jan 13, 2026

Uh oh!

Uh oh!

greptile-apps Bot left a comment

Uh oh!

greptile-apps Bot Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

rlratzel commented Dec 21, 2025

Uh oh!

copy-pr-bot Bot commented Dec 21, 2025

Uh oh!

greptile-apps Bot commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 2/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

greptile-apps Bot commented Jan 7, 2026 •

edited

Loading