Skip to content

Revert "Remove nvenc/dec for xenna 0.1.6"#1374

Merged
thomasdhc merged 6 commits intomainfrom
revert-1202-aot/remove-nvencs
Jan 20, 2026
Merged

Revert "Remove nvenc/dec for xenna 0.1.6"#1374
thomasdhc merged 6 commits intomainfrom
revert-1202-aot/remove-nvencs

Conversation

@ayushdg
Copy link
Copy Markdown
Contributor

@ayushdg ayushdg commented Jan 14, 2026

Reverts #1202 due to some issues we see with Xenna's GPU usage vs the allocated resources via Ray

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jan 14, 2026

Greptile Summary

This PR reverts changes from #1202 to address GPU allocation issues observed with Xenna's GPU usage vs Ray-allocated resources.

Key changes:

  • Downgrades cosmos-xenna from 0.1.8 to 0.1.2
  • Re-introduces nvdec/nvenc resource tracking in the Resources dataclass
  • Updates ClipTranscodingStage to partition GPU resources based on nvenc counts instead of fractional GPU shares
  • Updates import paths in Xenna adapter from pipelines.private.resources to ray_utils.resources
  • Adds explicit error handling in Ray Data adapter when nvdecs/nvencs are requested without GPU support

Impact:

  • Video transcoding stages now request nvenc/nvdec units directly rather than fractional GPU allocations
  • Ray Data backend explicitly rejects nvdec/nvenc resource requests (with a TODO for future support)
  • Xenna backend now passes nvdecs, nvencs, and entire_gpu parameters to resource allocation

Confidence Score: 4/5

  • This revert PR is safe to merge as it addresses known GPU allocation issues with Xenna.
  • The PR is a clean revert addressing a known issue. The changes are straightforward and align with the linked cosmos-curate implementation. Minor concerns exist around edge cases (empty GPU list, zero division) but these represent scenarios where the stage wouldn't function anyway.
  • nemo_curator/stages/video/clipping/clip_extraction_stages.py (GPU info access), tests/stages/video/clipping/test_clip_transcoding_stage.py (unused mock class)

Important Files Changed

Filename Overview
nemo_curator/stages/video/clipping/clip_extraction_stages.py Reverts GPU resource allocation to use nvdec/nvenc units instead of fractional GPU shares, but accessing [0] on GPU info list without validation could raise IndexError if no GPUs are detected.
nemo_curator/stages/resources.py Adds nvdecs/nvencs fields to Resources dataclass and removes automatic gpus=1.0 assignment when entire_gpu=True. This is intentional as Xenna handles entire_gpu allocation differently.
nemo_curator/backends/xenna/adapter.py Updates import paths from pipelines.private.resources to ray_utils.resources to match cosmos-xenna 0.1.2 API, and passes nvdecs/nvencs/entire_gpu to XennaResources.
nemo_curator/backends/experimental/ray_data/adapter.py Adds explicit error when nvdecs/nvencs are requested without GPU allocation in Ray Data backend, which does not yet support these resources.
pyproject.toml Downgrades cosmos-xenna from 0.1.8 to 0.1.2 to address GPU usage vs allocated resources issues via Ray.
tests/stages/video/clipping/test_clip_transcoding_stage.py Removes tests for old GPU allocation behavior but adds unused MockGpuResources class. Existing tests need updating to cover new nvenc resource logic.

Sequence Diagram

sequenceDiagram
    participant User
    participant ClipTranscodingStage
    participant cosmos_xenna
    participant Resources
    participant XennaAdapter
    participant XennaResources

    User->>ClipTranscodingStage: __post_init__(encoder="h264_nvenc")
    ClipTranscodingStage->>cosmos_xenna: _get_local_gpu_info()
    cosmos_xenna-->>ClipTranscodingStage: [GpuInfo(name="...")]
    ClipTranscodingStage->>cosmos_xenna: _make_gpu_resources_from_gpu_name(name)
    cosmos_xenna-->>ClipTranscodingStage: GpuResources(num_nvencs=N)
    ClipTranscodingStage->>Resources: Resources(nvencs=N/streams, gpu_memory_gb=M/streams)
    Resources-->>ClipTranscodingStage: resources

    Note over XennaAdapter: During pipeline execution
    XennaAdapter->>Resources: processing_stage.resources
    Resources-->>XennaAdapter: {cpus, gpus, nvdecs, nvencs, entire_gpu}
    XennaAdapter->>XennaResources: XennaResources(cpus, gpus, nvdecs, nvencs, entire_gpu)
    XennaResources-->>XennaAdapter: resource allocation for Ray
Loading

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. tests/stages/video/clipping/test_clip_transcoding_stage.py, line 31-42 (link)

    style: Both MockGpuInfo and MockGpuResources classes are defined but never used in any tests. Consider removing them or adding tests that utilize these mocks.

7 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +82 to +83
gpu_info = _get_local_gpu_info()[0]
nvencs = _make_gpu_resources_from_gpu_name(gpu_info.name).num_nvencs
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Accessing [0] assumes at least one GPU is available. This will raise an IndexError if no GPUs are detected. Should this handle the case where no GPUs are available or add validation?

Comment on lines +86 to +87
nvencs=nvencs // self.nb_streams_per_gpu, gpu_memory_gb=gpu_memory_gb // self.nb_streams_per_gpu
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Integer division by nb_streams_per_gpu could result in zero resources if the divisor is larger than the dividend. What should happen when nb_streams_per_gpu exceeds available nvencs or gpu_memory_gb?

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jan 16, 2026

Greptile's behavior is changing!

From now on, if a review finishes with no comments, we will not post an additional "statistics" comment to confirm that our review found nothing to comment on. However, you can confirm that we reviewed your changes in the status check section.

This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR".

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +37 to +42
# Mock GPU resources class to simulate GPU resources
class MockGpuResources:
def __init__(self, num_nvencs: int = 3, num_nvdecs: int = 3):
self.num_nvencs = num_nvencs
self.num_nvdecs = num_nvdecs

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: MockGpuResources class is defined but never used in any test. Either add tests for the new nvenc-based resource allocation in ClipTranscodingStage.__post_init__ that use this mock, or remove the dead code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

r1.1.0 Pick this label for auto cherry-picking into r1.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants