Skip to content

Add OrderScheme partitioning metadata#853

Open
rjzamora wants to merge 16 commits intorapidsai:mainfrom
rjzamora:orderscheme-metadata
Open

Add OrderScheme partitioning metadata#853
rjzamora wants to merge 16 commits intorapidsai:mainfrom
rjzamora:orderscheme-metadata

Conversation

@rjzamora
Copy link
Copy Markdown
Member

Adds mechanism to track ordering within ChannelMetadata. Needed in cudf-polars to keep track of sorted data.

@rjzamora rjzamora self-assigned this Feb 10, 2026
@rjzamora rjzamora added the feature request New feature or request label Feb 10, 2026
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Feb 10, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@rjzamora rjzamora added the non-breaking Introduces a non-breaking change label Feb 12, 2026
@rjzamora
Copy link
Copy Markdown
Member Author

\okay to test

@rjzamora rjzamora marked this pull request as ready for review March 5, 2026 20:56
@rjzamora rjzamora requested review from a team as code owners March 5, 2026 20:56
@rjzamora rjzamora changed the title [WIP] Add OrderScheme partitioning metadata Add OrderScheme partitioning metadata Apr 13, 2026
@rjzamora
Copy link
Copy Markdown
Member Author

@TomAugspurger - I'd be interested to get your thoughts on the specific contract proposed here.

Copy link
Copy Markdown
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High-level, I think this makes sense as a way to support what we need out of cudf-polars. Left some comments and questions on the implementation.

Comment thread cpp/include/rapidsmpf/streaming/cudf/channel_metadata.hpp Outdated
Comment thread cpp/include/rapidsmpf/streaming/cudf/channel_metadata.hpp Outdated
*
* @note Two OrderSchemes are equal if they have the same column indices,
* orders, null_orders, strict_boundary flag, and boundary values. Boundary
* comparison currently uses table shape only (full content comparison TBD).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a todo here. Solving that here, or file a separate issue to track it?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this comment a bit. I think its "okay" for us to do a "shallow" equality and document that this is the case. We can implement a separate API for strict equality later (or just let cudf-polars do this pylibcudf).

Comment on lines +42 to +45
// Note: Full content comparison would require device-side comparison.
// For now, we consider tables with same dimensions as potentially equal.
// A more complete implementation would use cudf utilities for comparison.
return true;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems thorny.

Do we want to avoid overloading == here entirely? I suspect that if we do ever need to compare two OrderScheme structs in a way that includes a comparison of the boundary values, then we'll need an API that provides a stream and mr.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that this is thorny. I don't think we should ever require == to enforce "deep" equality here. We need a separate API for that.

Comment thread cpp/tests/streaming/test_channel_metadata.cpp Outdated
Comment on lines +39 to +40
@property
def has_boundaries(self) -> bool: ...
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this? Is it just order_scheme.boundaries is not None?

Comment thread python/rapidsmpf/rapidsmpf/streaming/cudf/channel_metadata.pyx Outdated
Comment thread python/rapidsmpf/rapidsmpf/streaming/cudf/channel_metadata.pyx Outdated
Comment thread python/rapidsmpf/rapidsmpf/streaming/cudf/channel_metadata.pyi Outdated
# SPDX-License-Identifier: Apache-2.0
"""Channel metadata types for streaming pipelines."""

from cuda.bindings.cyruntime cimport cudaStream_t
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this import elsewhere. Do we need it?

In

from rmm.pylibrmm.stream cimport Stream
, we use Stream._from_cudaStream_t. But I'm not sure about the type.

@rjzamora
Copy link
Copy Markdown
Member Author

Update:

I revised this PR a few times, and I think it's ready for review. I think I'd like the reviewed in two phases:

  1. Does the public-facing OrderScheme API makes sense?
  2. Do the internal details/implementation make sense? It may make sense for someone with stronger C++ and/or Cython skills to take over this PR or replace it if that are many issues. My primary goal here was to prototype what I think we need in cudf-polars.

One open question I have about the internal implementation is: Should the sort "boundaries" be tracked as a spillable unique_ptr<TableChunk> internally? I decided the answer is "yes" for now. However, since ChannelMetadata is typically copied when it is pushed into a Channel, this means we need to copy this underlying TableChunk often as it moves through the pipeline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants