[Python] Add OrderScheme partitioning metadata#853
[Python] Add OrderScheme partitioning metadata#853rapids-bot[bot] merged 56 commits intorapidsai:mainfrom
OrderScheme partitioning metadata#853Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
\okay to test |
OrderScheme partitioning metadataOrderScheme partitioning metadata
|
@TomAugspurger - I'd be interested to get your thoughts on the specific contract proposed here. |
TomAugspurger
left a comment
There was a problem hiding this comment.
High-level, I think this makes sense as a way to support what we need out of cudf-polars. Left some comments and questions on the implementation.
| * | ||
| * @note Two OrderSchemes are equal if they have the same column indices, | ||
| * orders, null_orders, strict_boundary flag, and boundary values. Boundary | ||
| * comparison currently uses table shape only (full content comparison TBD). |
There was a problem hiding this comment.
Looks like a todo here. Solving that here, or file a separate issue to track it?
There was a problem hiding this comment.
I updated this comment a bit. I think its "okay" for us to do a "shallow" equality and document that this is the case. We can implement a separate API for strict equality later (or just let cudf-polars do this pylibcudf).
| // Note: Full content comparison would require device-side comparison. | ||
| // For now, we consider tables with same dimensions as potentially equal. | ||
| // A more complete implementation would use cudf utilities for comparison. | ||
| return true; |
There was a problem hiding this comment.
This seems thorny.
Do we want to avoid overloading == here entirely? I suspect that if we do ever need to compare two OrderScheme structs in a way that includes a comparison of the boundary values, then we'll need an API that provides a stream and mr.
There was a problem hiding this comment.
Agree that this is thorny. I don't think we should ever require == to enforce "deep" equality here. We need a separate API for that.
| @property | ||
| def has_boundaries(self) -> bool: ... |
There was a problem hiding this comment.
Do we need this? Is it just order_scheme.boundaries is not None?
| # SPDX-License-Identifier: Apache-2.0 | ||
| """Channel metadata types for streaming pipelines.""" | ||
|
|
||
| from cuda.bindings.cyruntime cimport cudaStream_t |
There was a problem hiding this comment.
I don't see this import elsewhere. Do we need it?
In
, we useStream._from_cudaStream_t. But I'm not sure about the type.
**NOTE**: This PR is the C++ component of #853 Adds mechanism to track ordering within ``ChannelMetadata``. Needed in cudf-polars to keep track of sorted data. ### Proposed API ```c++ struct OrderKey { cudf::size_type column_index; ///< Column to sort on. cudf::order order; ///< ASCENDING or DESCENDING. cudf::null_order null_order; ///< BEFORE or AFTER. /** * @brief Equality comparison. * @return True if all fields are equal. */ bool operator==(OrderKey const&) const = default; }; struct OrderScheme { std::vector<OrderKey> keys; ///< Sort keys (column, order, null_order per entry). std::unique_ptr<TableChunk> boundaries; ///< N-1 boundary rows for N partitions. bool strict_boundaries{false}; ///< Whether unique keys must be mapped to a distinct chunk ``` Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #993
wence-
left a comment
There was a problem hiding this comment.
I feel like the cython bindings want a bit of a tidy up now we've changed how the C++ works.
| cdef vector[cpp_OrderKey] cpp_keys | ||
| for key in keys: | ||
| if not isinstance(key, OrderKey): | ||
| raise TypeError( | ||
| f"keys must contain OrderKey objects, got {type(key).__name__}" | ||
| ) | ||
| cpp_keys.push_back((<OrderKey>key)._key) |
There was a problem hiding this comment.
Oh, how about:
| cdef vector[cpp_OrderKey] cpp_keys | |
| for key in keys: | |
| if not isinstance(key, OrderKey): | |
| raise TypeError( | |
| f"keys must contain OrderKey objects, got {type(key).__name__}" | |
| ) | |
| cpp_keys.push_back((<OrderKey>key)._key) | |
| cdef vector[cpp_OrderKey] cpp_keys | |
| for key in keys: | |
| cpp_keys.push_back((<OrderKey?>key)._key) |
Thanks for the quick review @wence- ! I addressed your suggestions, but let me know if you're looking for a larger overhaul here. |
|
Thanks (again) for the review @wence- ! |
|
/merge |
- Follow up to #22315 - Further revises `sort_actor` in preparation for rapidsai/rapidsmpf#853 - Part of #22128 - Breaks apart `sort_actor` logic into modular steps, so we can avoid collecting boundaries when we already know the boundaries (future work). Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Matthew Murray (https://github.com/Matt711) - Matthew Roeschke (https://github.com/mroeschke) URL: #22350
…sai#22350) - Follow up to rapidsai#22315 - Further revises `sort_actor` in preparation for rapidsai/rapidsmpf#853 - Part of rapidsai#22128 - Breaks apart `sort_actor` logic into modular steps, so we can avoid collecting boundaries when we already know the boundaries (future work). Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Matthew Murray (https://github.com/Matt711) - Matthew Roeschke (https://github.com/mroeschke) URL: rapidsai#22350
Depends on #993 (adds the C++ API)
Adds mechanism to track ordering within
ChannelMetadata. Needed in cudf-polars to keep track of sorted data.Proposed API