Skip to content

[Python] Add OrderScheme partitioning metadata#853

Merged
rapids-bot[bot] merged 56 commits intorapidsai:mainfrom
rjzamora:orderscheme-metadata
May 6, 2026
Merged

[Python] Add OrderScheme partitioning metadata#853
rapids-bot[bot] merged 56 commits intorapidsai:mainfrom
rjzamora:orderscheme-metadata

Conversation

@rjzamora
Copy link
Copy Markdown
Member

@rjzamora rjzamora commented Feb 10, 2026

Depends on #993 (adds the C++ API)
Adds mechanism to track ordering within ChannelMetadata. Needed in cudf-polars to keep track of sorted data.

Proposed API

@dataclass(frozen=True, slots=True)
class OrderKey:
    """Sort key: column index, direction, and null ordering (``pylibcudf`` enums)."""

    column_index: int
    order: plc.types.Order
    null_order: plc.types.NullOrder

class OrderScheme:
    def __init__(
        self,
        keys: Sequence[OrderKey],
        boundaries: TableChunk,
        *,
        strict_boundaries: bool = False,
    ) -> None: ...
    @property
    def keys(self) -> tuple[OrderKey, ...]: ...
    @property
    def strict_boundaries(self) -> bool: ...
    @property
    def num_boundaries(self) -> int: ...
    def replace_keys(self, new_keys: Sequence[OrderKey]) -> OrderScheme: ...
    def boundaries_aligned_with(self, other: OrderScheme) -> bool: ...
    # Shallow equality: keys + strict_boundaries + boundary shape only.
    def __eq__(self, other: object) -> bool: ...
    def __repr__(self) -> str: ...

@rjzamora rjzamora self-assigned this Feb 10, 2026
@rjzamora rjzamora added the feature request New feature or request label Feb 10, 2026
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Feb 10, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@rjzamora rjzamora added the non-breaking Introduces a non-breaking change label Feb 12, 2026
@rjzamora
Copy link
Copy Markdown
Member Author

\okay to test

@rjzamora rjzamora marked this pull request as ready for review March 5, 2026 20:56
@rjzamora rjzamora requested review from a team as code owners March 5, 2026 20:56
@rjzamora rjzamora changed the title [WIP] Add OrderScheme partitioning metadata Add OrderScheme partitioning metadata Apr 13, 2026
@rjzamora
Copy link
Copy Markdown
Member Author

@TomAugspurger - I'd be interested to get your thoughts on the specific contract proposed here.

Copy link
Copy Markdown
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High-level, I think this makes sense as a way to support what we need out of cudf-polars. Left some comments and questions on the implementation.

Comment thread cpp/include/rapidsmpf/streaming/cudf/channel_metadata.hpp Outdated
Comment thread cpp/include/rapidsmpf/streaming/cudf/channel_metadata.hpp Outdated
*
* @note Two OrderSchemes are equal if they have the same column indices,
* orders, null_orders, strict_boundary flag, and boundary values. Boundary
* comparison currently uses table shape only (full content comparison TBD).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a todo here. Solving that here, or file a separate issue to track it?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this comment a bit. I think its "okay" for us to do a "shallow" equality and document that this is the case. We can implement a separate API for strict equality later (or just let cudf-polars do this pylibcudf).

Comment on lines +42 to +45
// Note: Full content comparison would require device-side comparison.
// For now, we consider tables with same dimensions as potentially equal.
// A more complete implementation would use cudf utilities for comparison.
return true;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems thorny.

Do we want to avoid overloading == here entirely? I suspect that if we do ever need to compare two OrderScheme structs in a way that includes a comparison of the boundary values, then we'll need an API that provides a stream and mr.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that this is thorny. I don't think we should ever require == to enforce "deep" equality here. We need a separate API for that.

Comment thread cpp/tests/streaming/test_channel_metadata.cpp Outdated
Comment on lines +39 to +40
@property
def has_boundaries(self) -> bool: ...
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this? Is it just order_scheme.boundaries is not None?

Comment thread python/rapidsmpf/rapidsmpf/streaming/cudf/channel_metadata.pyx Outdated
Comment thread python/rapidsmpf/rapidsmpf/streaming/cudf/channel_metadata.pyx Outdated
Comment thread python/rapidsmpf/rapidsmpf/streaming/cudf/channel_metadata.pyi Outdated
# SPDX-License-Identifier: Apache-2.0
"""Channel metadata types for streaming pipelines."""

from cuda.bindings.cyruntime cimport cudaStream_t
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this import elsewhere. Do we need it?

In

from rmm.pylibrmm.stream cimport Stream
, we use Stream._from_cudaStream_t. But I'm not sure about the type.

rapids-bot Bot pushed a commit that referenced this pull request May 1, 2026
**NOTE**: This PR is the C++ component of #853

Adds mechanism to track ordering within ``ChannelMetadata``. Needed in cudf-polars to keep track of sorted data.

### Proposed API

```c++
struct OrderKey {
    cudf::size_type column_index;  ///< Column to sort on.
    cudf::order order;  ///< ASCENDING or DESCENDING.
    cudf::null_order null_order;  ///< BEFORE or AFTER.

    /**
     * @brief Equality comparison.
     * @return True if all fields are equal.
     */
    bool operator==(OrderKey const&) const = default;
};

struct OrderScheme {
    std::vector<OrderKey> keys;  ///< Sort keys (column, order, null_order per entry).
    std::unique_ptr<TableChunk> boundaries;  ///< N-1 boundary rows for N partitions.
    bool strict_boundaries{false};  ///< Whether unique keys must be mapped to a distinct chunk
```

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)

URL: #993
Comment thread python/rapidsmpf/rapidsmpf/streaming/cudf/channel_metadata.pxd Outdated
Copy link
Copy Markdown
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like the cython bindings want a bit of a tidy up now we've changed how the C++ works.

Comment thread python/rapidsmpf/rapidsmpf/streaming/cudf/channel_metadata.pyi Outdated
Comment thread python/rapidsmpf/rapidsmpf/streaming/cudf/channel_metadata.pyi Outdated
Comment thread python/rapidsmpf/rapidsmpf/streaming/cudf/channel_metadata.pyx Outdated
Comment thread python/rapidsmpf/rapidsmpf/streaming/cudf/channel_metadata.pyx Outdated
Comment on lines +311 to +317
cdef vector[cpp_OrderKey] cpp_keys
for key in keys:
if not isinstance(key, OrderKey):
raise TypeError(
f"keys must contain OrderKey objects, got {type(key).__name__}"
)
cpp_keys.push_back((<OrderKey>key)._key)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, how about:

Suggested change
cdef vector[cpp_OrderKey] cpp_keys
for key in keys:
if not isinstance(key, OrderKey):
raise TypeError(
f"keys must contain OrderKey objects, got {type(key).__name__}"
)
cpp_keys.push_back((<OrderKey>key)._key)
cdef vector[cpp_OrderKey] cpp_keys
for key in keys:
cpp_keys.push_back((<OrderKey?>key)._key)

Comment thread python/rapidsmpf/rapidsmpf/streaming/cudf/channel_metadata.pyx Outdated
Comment thread python/rapidsmpf/rapidsmpf/streaming/cudf/channel_metadata.pyx Outdated
Comment thread python/rapidsmpf/rapidsmpf/streaming/cudf/channel_metadata.pyx Outdated
@rjzamora
Copy link
Copy Markdown
Member Author

rjzamora commented May 1, 2026

I feel like the cython bindings want a bit of a tidy up now we've changed how the C++ works.

Thanks for the quick review @wence- ! I addressed your suggestions, but let me know if you're looking for a larger overhaul here.

Comment thread python/rapidsmpf/rapidsmpf/streaming/cudf/channel_metadata.pyx Outdated
Comment thread python/rapidsmpf/rapidsmpf/streaming/cudf/channel_metadata.pyx Outdated
Comment thread python/rapidsmpf/rapidsmpf/streaming/cudf/channel_metadata.pxd
Comment thread python/rapidsmpf/rapidsmpf/streaming/cudf/channel_metadata.pxd Outdated
Comment thread python/rapidsmpf/rapidsmpf/streaming/cudf/channel_metadata.pxd
Comment thread python/rapidsmpf/rapidsmpf/streaming/cudf/channel_metadata.pyx Outdated
Comment thread python/rapidsmpf/rapidsmpf/tests/streaming/test_channel_metadata.py Outdated
Comment thread python/rapidsmpf/rapidsmpf/tests/streaming/test_channel_metadata.py Outdated
Comment thread python/rapidsmpf/rapidsmpf/tests/streaming/test_channel_metadata.py Outdated
Comment thread python/rapidsmpf/rapidsmpf/tests/streaming/test_channel_metadata.py Outdated
@rjzamora
Copy link
Copy Markdown
Member Author

rjzamora commented May 5, 2026

Thanks (again) for the review @wence- !

@rjzamora
Copy link
Copy Markdown
Member Author

rjzamora commented May 6, 2026

/merge

@rapids-bot rapids-bot Bot merged commit e637ad4 into rapidsai:main May 6, 2026
58 checks passed
@rjzamora rjzamora deleted the orderscheme-metadata branch May 6, 2026 14:22
rapids-bot Bot pushed a commit to rapidsai/cudf that referenced this pull request May 7, 2026
- Follow up to #22315 - Further revises `sort_actor` in preparation for rapidsai/rapidsmpf#853
- Part of #22128
- Breaks apart `sort_actor` logic into modular steps, so we can avoid collecting boundaries when we already know the boundaries (future work).

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)

Approvers:
  - Matthew Murray (https://github.com/Matt711)
  - Matthew Roeschke (https://github.com/mroeschke)

URL: #22350
galipremsagar pushed a commit to galipremsagar/cudf that referenced this pull request May 8, 2026
…sai#22350)

- Follow up to rapidsai#22315 - Further revises `sort_actor` in preparation for rapidsai/rapidsmpf#853
- Part of rapidsai#22128
- Breaks apart `sort_actor` logic into modular steps, so we can avoid collecting boundaries when we already know the boundaries (future work).

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)

Approvers:
  - Matthew Murray (https://github.com/Matt711)
  - Matthew Roeschke (https://github.com/mroeschke)

URL: rapidsai#22350
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants