Adopt ``OrderScheme`` metadata in cudf-polars by rjzamora · Pull Request #22291 · rapidsai/cudf

rjzamora · 2026-04-24T20:36:22Z

Description

Follow up to Generalize NormalizedPartitioning class #22246
Blocked by [Python] Add OrderScheme partitioning metadata rapidsmpf#853
Part of [Story][FEA] Timeseries-oriented cuDF-Polars: OrderScheme metadata, strict sort boundaries, and sort-aware execution #22128

These changes lay the foundation for tracking and using metadata about the "sortedness" of the data we are passing through the channel in cudf-polars. The next step after this is to attach an OrderScheme to the ChannelMetadata within a sort_actor.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2026-04-24T20:36:26Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

- Closes #22050 - **Updates `Sort` lowering**: - The "rapidsmpf" runtime no longer uses `Sort(ShuffleSorted(Sort(...)))` - We leave most `Sort` nodes alone. - We map `Sort` directly to `sort_actor` - The `sort_actor` performs it's own top/bottom-k optimization and performs chunk-wise sorting itself. **Motivation**: The `Sort(ShuffleSorted(Sort(...)))` pattern will be too messy when we start utilizing `OrderScheme` metadata (after #22291). The current assumption that the child IR node is always a `Sort` is also not true/safe. **Note**: This PR technically adds code, but it will allow us to *remove* more code once "tasks" is removed (everything `ShuffleSorted` related). Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Matthew Roeschke (https://github.com/mroeschke) URL: #22315

- Closes rapidsai#22050 - **Updates `Sort` lowering**: - The "rapidsmpf" runtime no longer uses `Sort(ShuffleSorted(Sort(...)))` - We leave most `Sort` nodes alone. - We map `Sort` directly to `sort_actor` - The `sort_actor` performs it's own top/bottom-k optimization and performs chunk-wise sorting itself. **Motivation**: The `Sort(ShuffleSorted(Sort(...)))` pattern will be too messy when we start utilizing `OrderScheme` metadata (after rapidsai#22291). The current assumption that the child IR node is always a `Sort` is also not true/safe. **Note**: This PR technically adds code, but it will allow us to *remove* more code once "tasks" is removed (everything `ShuffleSorted` related). Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Matthew Roeschke (https://github.com/mroeschke) URL: rapidsai#22315

Matt711

One nitpick and a comment: I think we should be able to use the OrderScheme metadata to avoid the per-group sort (in GroupedWindow) when the input is already sorted. xref #22191

coderabbitai · 2026-05-07T17:36:23Z

📝 Walkthrough

Summary by CodeRabbit

New Features
- Partitioning normalization now supports both hash- and order-based schemes, with improved detection of strict partitioning and alignment.
Bug Fixes
- More context-aware alignment checks for groupby and join flows, improving strategy selection and reducing incorrect shuffle/broadcast decisions.

Walkthrough

This PR extends RapidsMPF's partitioning normalization to support order-based schemes alongside hash-based ones. The core NormalizedPartitioning dataclass fields are generalized to hold InterRankScheme (hash or order) and PartitioningScheme types. The API parameter indices= is renamed to keys= to accommodate both integer column indices and OrderKey descriptors. New alignment and strictness-checking methods are introduced, and comprehensive test coverage validates order-scheme behavior throughout the normalization and remapping pipeline.

Changes

Order-Aware Partitioning Normalization

Layer / File(s)	Summary
Imports and Type Aliases `python/cudf_polars/cudf_polars/experimental/rapidsmpf/utils.py`, `python/cudf_polars/tests/experimental/test_metadata.py`	Add imports for `OrderKey`, `OrderScheme`, `TypeAlias`, `cast`, and `Sequence`. Introduce public type aliases `InterRankScheme` (hash or order scheme) and `PartitioningScheme` (inter-rank or `"inherit"`).
NormalizedPartitioning Field Types `python/cudf_polars/cudf_polars/experimental/rapidsmpf/utils.py`	Update dataclass fields from hash-only `HashScheme
Partitioning Remap Helpers `python/cudf_polars/cudf_polars/experimental/rapidsmpf/utils.py`	Add helper functions to remap scheme indices through IR transformations: index derivation for hash vs order schemes, creation of updated schemes with remapped indices, selective remapping for `Select`, and passthrough for `inherit`/`None`.
Strictness and Alignment Methods `python/cudf_polars/cudf_polars/experimental/rapidsmpf/utils.py`	Add `is_strictly_partitioned()` method and update `is_aligned_with(other, br)` to check order boundary alignment via `boundaries_aligned_with` and hash modulus/key-count compatibility.
from_keys and _normalize_schemes `python/cudf_polars/cudf_polars/experimental/rapidsmpf/utils.py`	Rewrite `from_keys` to accept `keys: Sequence[int
API Call Site Updates `python/cudf_polars/cudf_polars/experimental/rapidsmpf/groupby.py`, `python/cudf_polars/cudf_polars/experimental/rapidsmpf/join.py`, `python/cudf_polars/cudf_polars/experimental/rapidsmpf/utils.py`	Update all `NormalizedPartitioning.from_keys` invocations to use the `keys=` parameter. Update join alignment check to call `is_aligned_with(..., context.br())`.
Test Updates and New Order-Scheme Tests `python/cudf_polars/tests/experimental/test_metadata.py`	Update existing test calls to use `keys=` and pass `br` to alignment checks. Add `_make_order_scheme()` helper and new parametrized tests covering `from_keys` with `OrderKey`, `is_strictly_partitioned`, `is_aligned_with` (using `br`), single-rank promotion, and `maybe_remap_partitioning` behavior for selects that remap or drop order keys.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Adopt OrderScheme metadata in cudf-polars' clearly and directly describes the main change: adopting OrderScheme metadata support.
Description check	✅ Passed	The description provides relevant context about the changes, including follow-up PRs, blocking issues, and the purpose of adopting OrderScheme metadata for tracking data sortedness.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

python/cudf_polars/tests/experimental/test_metadata.py (1)

485-489: ⚡ Quick win

Use the existing spmd_engine fixture instead of creating a new SPMDEngine.

Per previous review feedback, you should use the existing spmd_engine fixture from conftest.py rather than creating your own SPMDEngine. This ensures consistent test configuration and avoids duplicating engine setup logic.

♻️ Proposed fix

 `@pytest.fixture`(scope="module")
-def rapidsmpf_context():
-    # Use SPMDEngine to generatate a simple Context for testing
-    with SPMDEngine() as engine:
-        yield engine.context
+def rapidsmpf_context(spmd_engine):
+    # Use the spmd_engine fixture to get a Context for testing
+    yield spmd_engine.context

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/cudf_polars/tests/experimental/test_metadata.py` around lines 485 -
489, Replace the custom rapidsmpf_context fixture that constructs a new
SPMDEngine with one that reuses the existing spmd_engine fixture: change the
fixture to accept the spmd_engine parameter (instead of instantiating
SPMDEngine) and yield spmd_engine.context so tests use the shared
engine/configuration; update the fixture name rapidsmpf_context to depend on
spmd_engine and return spmd_engine.context rather than creating a new SPMDEngine
instance.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@python/cudf_polars/tests/experimental/test_metadata.py`:
- Around line 485-489: Replace the custom rapidsmpf_context fixture that
constructs a new SPMDEngine with one that reuses the existing spmd_engine
fixture: change the fixture to accept the spmd_engine parameter (instead of
instantiating SPMDEngine) and yield spmd_engine.context so tests use the shared
engine/configuration; update the fixture name rapidsmpf_context to depend on
spmd_engine and return spmd_engine.context rather than creating a new SPMDEngine
instance.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ba35b4ba-7599-44c7-9acd-260c08c82978

📥 Commits

Reviewing files that changed from the base of the PR and between c9ad1c5 and cf2b900.

📒 Files selected for processing (4)

python/cudf_polars/cudf_polars/experimental/rapidsmpf/groupby.py
python/cudf_polars/cudf_polars/experimental/rapidsmpf/join.py
python/cudf_polars/cudf_polars/experimental/rapidsmpf/utils.py
python/cudf_polars/tests/experimental/test_metadata.py

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

python/cudf_polars/tests/experimental/test_metadata.py (1)

494-585: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add edge-case coverage for order-scheme tests.

The new order-scheme cases only exercise non-empty, non-null multi-row data. Please add empty, all-null, and single-element key scenarios across from_keys, is_aligned_with, and remap tests to catch boundary regressions.

As per coding guidelines, python/**/test*.py: "Missing edge case coverage in tests - Include tests for empty, all-null, single-element, and mixed type cases".

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/cudf_polars/tests/experimental/test_metadata.py` around lines 494 -
585, Add edge-case test variants for empty, all-null, and single-element key
data to the order-scheme tests: update test_from_keys_order_scheme and
test_from_keys_order_scheme_single_rank to include parameterized cases where
keys map to empty datasets, all-null key columns, and single-row datasets and
assert expected inter_rank_scheme behavior; extend
test_is_aligned_with_order_scheme to check alignment when schemes are built from
empty, all-null, and single-element values (use _make_order_scheme with
values=() or values=(None,) or values=(x,)) and assert is_aligned_with
accordingly; and add remapping edge cases in
test_remap_partitioning_order_scheme_select and
test_remap_partitioning_order_scheme_drops_key using _make_select_ir to simulate
empty selection, selection producing null-only key column, and single-row
selection, then assert maybe_remap_partitioning returns the correct inter_rank
(OrderScheme / None) for each. Ensure you use the helper _make_order_scheme,
NormalizedPartitioning.from_keys, and maybe_remap_partitioning identifiers to
locate where to add these cases.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@python/cudf_polars/tests/experimental/test_metadata.py`:
- Around line 520-527: The test test_from_keys_order_scheme must assert explicit
states instead of comparing a boolean to isinstance; update it so after
computing result = NormalizedPartitioning.from_keys(...) you check: if
should_match is True assert isinstance(result.inter_rank_scheme, OrderScheme)
else assert result.inter_rank_scheme is None (or the exact expected non-matching
sentinel) to ensure the negative path doesn't accidentally accept some other
scheme type; keep references to test_from_keys_order_scheme,
NormalizedPartitioning.from_keys, and result.inter_rank_scheme when making the
change.

---

Outside diff comments:
In `@python/cudf_polars/tests/experimental/test_metadata.py`:
- Around line 494-585: Add edge-case test variants for empty, all-null, and
single-element key data to the order-scheme tests: update
test_from_keys_order_scheme and test_from_keys_order_scheme_single_rank to
include parameterized cases where keys map to empty datasets, all-null key
columns, and single-row datasets and assert expected inter_rank_scheme behavior;
extend test_is_aligned_with_order_scheme to check alignment when schemes are
built from empty, all-null, and single-element values (use _make_order_scheme
with values=() or values=(None,) or values=(x,)) and assert is_aligned_with
accordingly; and add remapping edge cases in
test_remap_partitioning_order_scheme_select and
test_remap_partitioning_order_scheme_drops_key using _make_select_ir to simulate
empty selection, selection producing null-only key column, and single-row
selection, then assert maybe_remap_partitioning returns the correct inter_rank
(OrderScheme / None) for each. Ensure you use the helper _make_order_scheme,
NormalizedPartitioning.from_keys, and maybe_remap_partitioning identifiers to
locate where to add these cases.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ad0bd369-054e-452d-8d45-80ee46db3790

📥 Commits

Reviewing files that changed from the base of the PR and between cf2b900 and d6baf06.

📒 Files selected for processing (1)

python/cudf_polars/tests/experimental/test_metadata.py

coderabbitai · 2026-05-08T13:23:34Z

+def test_from_keys_order_scheme(spmd_engine, keys, strict, should_match):
+    part = Partitioning(
+        inter_rank=_make_order_scheme(spmd_engine.context, strict=strict),
+        local="inherit",
+    )
+    result = NormalizedPartitioning.from_keys(part, nranks=4, keys=keys)
+    assert isinstance(result.inter_rank_scheme, OrderScheme) == should_match
+


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Tighten the negative-path assertion in test_from_keys_order_scheme.

assert isinstance(result.inter_rank_scheme, OrderScheme) == should_match can pass even if the non-matching path returns an unexpected non-None scheme type. Assert explicit expected state for should_match=False.

Suggested assertion hardening

def test_from_keys_order_scheme(spmd_engine, keys, strict, should_match): part = Partitioning( inter_rank=_make_order_scheme(spmd_engine.context, strict=strict), local="inherit", ) result = NormalizedPartitioning.from_keys(part, nranks=4, keys=keys) - assert isinstance(result.inter_rank_scheme, OrderScheme) == should_match + if should_match: + assert isinstance(result.inter_rank_scheme, OrderScheme) + else: + assert result.inter_rank_scheme is None

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@python/cudf_polars/tests/experimental/test_metadata.py` around lines 520 - 527, The test test_from_keys_order_scheme must assert explicit states instead of comparing a boolean to isinstance; update it so after computing result = NormalizedPartitioning.from_keys(...) you check: if should_match is True assert isinstance(result.inter_rank_scheme, OrderScheme) else assert result.inter_rank_scheme is None (or the exact expected non-matching sentinel) to ensure the negative path doesn't accidentally accept some other scheme type; keep references to test_from_keys_order_scheme, NormalizedPartitioning.from_keys, and result.inter_rank_scheme when making the change.

rjzamora added 2 commits April 24, 2026 13:30

adopt OrderScheme in NormalizedPartitioning utility

01ddbdb

Merge remote-tracking branch 'upstream/main' into adopt-orderscheme

93301ad

rjzamora self-assigned this Apr 24, 2026

rjzamora added 2 - In Progress Currently a work in progress improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Apr 24, 2026

github-actions Bot added Python Affects Python cuDF API. cudf-polars Issues specific to cudf-polars labels Apr 24, 2026

github-project-automation Bot added this to cuDF Python Apr 24, 2026

GPUtester moved this to In Progress in cuDF Python Apr 24, 2026

Merge branch 'main' into adopt-orderscheme

6e43f61

rjzamora mentioned this pull request Apr 28, 2026

Revise Sort lowering and sort_actor assumptions #22315

Merged

3 tasks

rjzamora added 3 commits April 28, 2026 06:53

Merge remote-tracking branch 'upstream/main' into adopt-orderscheme

c6c870f

align with rapidsmpf changes

43a971a

Merge branch 'main' into adopt-orderscheme

66faf0a

rjzamora added 2 commits April 29, 2026 12:25

Merge remote-tracking branch 'upstream/main' into adopt-orderscheme

dcc5c95

Merge remote-tracking branch 'upstream/main' into adopt-orderscheme

cf45ff6

rjzamora added 2 commits May 6, 2026 07:42

Merge branch 'main' into adopt-orderscheme

69a57a6

Merge remote-tracking branch 'upstream/main' into adopt-orderscheme

3406ebf

rjzamora marked this pull request as ready for review May 6, 2026 14:55

rjzamora requested a review from a team as a code owner May 6, 2026 14:55

rjzamora requested a review from TomAugspurger May 6, 2026 14:55

TomAugspurger reviewed May 6, 2026

View reviewed changes

rjzamora added 2 commits May 6, 2026 12:51

partial code review

f10ddc4

Merge remote-tracking branch 'upstream/main' into adopt-orderscheme

471c5eb

Matt711 approved these changes May 6, 2026

View reviewed changes

Comment thread python/cudf_polars/cudf_polars/experimental/rapidsmpf/utils.py Outdated

revise further

170f71c

rjzamora added 3 commits May 6, 2026 14:33

fix groupby test failure

92b0c15

fix groupby test failure better

f94b4ab

Merge remote-tracking branch 'upstream/main' into adopt-orderscheme

d6af09f

mroeschke reviewed May 7, 2026

View reviewed changes

Comment thread python/cudf_polars/cudf_polars/experimental/rapidsmpf/utils.py

Comment thread python/cudf_polars/cudf_polars/experimental/rapidsmpf/utils.py

Comment thread python/cudf_polars/cudf_polars/experimental/rapidsmpf/utils.py

Merge branch 'main' into adopt-orderscheme

4c64737

rjzamora commented May 7, 2026

View reviewed changes

Comment thread python/cudf_polars/cudf_polars/experimental/rapidsmpf/utils.py Outdated

rjzamora added 3 commits May 7, 2026 07:40

Apply suggestion from @rjzamora

5f60e80

Merge remote-tracking branch 'upstream/main' into adopt-orderscheme

a7c3654

pre-commit

cf2b900

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

rjzamora added 2 commits May 8, 2026 06:11

Merge remote-tracking branch 'upstream/main' into adopt-orderscheme

3558eed

use spmd_engine fixture

d6baf06

rjzamora added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels May 8, 2026

coderabbitai Bot reviewed May 8, 2026

View reviewed changes

Conversation

rjzamora commented Apr 24, 2026

Description

Checklist

Uh oh!

copy-pr-bot Bot commented Apr 24, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Matt711 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

coderabbitai Bot commented May 7, 2026 •

edited

Loading