Skip to content

Adopt OrderScheme metadata in cudf-polars#22291

Open
rjzamora wants to merge 22 commits intorapidsai:mainfrom
rjzamora:adopt-orderscheme
Open

Adopt OrderScheme metadata in cudf-polars#22291
rjzamora wants to merge 22 commits intorapidsai:mainfrom
rjzamora:adopt-orderscheme

Conversation

@rjzamora
Copy link
Copy Markdown
Member

Description

These changes lay the foundation for tracking and using metadata about the "sortedness" of the data we are passing through the channel in cudf-polars. The next step after this is to attach an OrderScheme to the ChannelMetadata within a sort_actor.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@rjzamora rjzamora self-assigned this Apr 24, 2026
@rjzamora rjzamora added 2 - In Progress Currently a work in progress improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Apr 24, 2026
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 24, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions Bot added Python Affects Python cuDF API. cudf-polars Issues specific to cudf-polars labels Apr 24, 2026
@GPUtester GPUtester moved this to In Progress in cuDF Python Apr 24, 2026
rapids-bot Bot pushed a commit that referenced this pull request Apr 29, 2026
- Closes #22050
- **Updates `Sort` lowering**:
  - The "rapidsmpf" runtime no longer uses `Sort(ShuffleSorted(Sort(...)))` - We leave most `Sort` nodes alone.
  - We map `Sort` directly to `sort_actor` - The `sort_actor` performs it's own top/bottom-k optimization and performs chunk-wise sorting itself.

**Motivation**: The `Sort(ShuffleSorted(Sort(...)))` pattern will be too messy when we start utilizing `OrderScheme` metadata (after #22291). The current assumption that the child IR node is always a `Sort` is also not true/safe.

**Note**: This PR technically adds code, but it will allow us to *remove* more code once "tasks" is removed (everything `ShuffleSorted` related).

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)

Approvers:
  - Matthew Roeschke (https://github.com/mroeschke)

URL: #22315
vyasr pushed a commit to vyasr/cudf that referenced this pull request May 4, 2026
- Closes rapidsai#22050
- **Updates `Sort` lowering**:
  - The "rapidsmpf" runtime no longer uses `Sort(ShuffleSorted(Sort(...)))` - We leave most `Sort` nodes alone.
  - We map `Sort` directly to `sort_actor` - The `sort_actor` performs it's own top/bottom-k optimization and performs chunk-wise sorting itself.

**Motivation**: The `Sort(ShuffleSorted(Sort(...)))` pattern will be too messy when we start utilizing `OrderScheme` metadata (after rapidsai#22291). The current assumption that the child IR node is always a `Sort` is also not true/safe.

**Note**: This PR technically adds code, but it will allow us to *remove* more code once "tasks" is removed (everything `ShuffleSorted` related).

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)

Approvers:
  - Matthew Roeschke (https://github.com/mroeschke)

URL: rapidsai#22315
@rjzamora rjzamora marked this pull request as ready for review May 6, 2026 14:55
@rjzamora rjzamora requested a review from a team as a code owner May 6, 2026 14:55
@rjzamora rjzamora requested a review from TomAugspurger May 6, 2026 14:55
Comment thread python/cudf_polars/cudf_polars/experimental/rapidsmpf/utils.py
Comment thread python/cudf_polars/cudf_polars/experimental/rapidsmpf/utils.py Outdated
Comment thread python/cudf_polars/cudf_polars/experimental/rapidsmpf/utils.py Outdated
Comment thread python/cudf_polars/cudf_polars/experimental/rapidsmpf/utils.py Outdated
Comment thread python/cudf_polars/tests/experimental/test_metadata.py Outdated
Copy link
Copy Markdown
Contributor

@Matt711 Matt711 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nitpick and a comment: I think we should be able to use the OrderScheme metadata to avoid the per-group sort (in GroupedWindow) when the input is already sorted. xref #22191

Comment thread python/cudf_polars/cudf_polars/experimental/rapidsmpf/utils.py Outdated
Comment thread python/cudf_polars/cudf_polars/experimental/rapidsmpf/utils.py
Comment thread python/cudf_polars/cudf_polars/experimental/rapidsmpf/utils.py
Comment thread python/cudf_polars/cudf_polars/experimental/rapidsmpf/utils.py
Comment thread python/cudf_polars/cudf_polars/experimental/rapidsmpf/utils.py Outdated
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 7, 2026

Review Change Stack

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • Partitioning normalization now supports both hash- and order-based schemes, with improved detection of strict partitioning and alignment.
  • Bug Fixes

    • More context-aware alignment checks for groupby and join flows, improving strategy selection and reducing incorrect shuffle/broadcast decisions.

Walkthrough

This PR extends RapidsMPF's partitioning normalization to support order-based schemes alongside hash-based ones. The core NormalizedPartitioning dataclass fields are generalized to hold InterRankScheme (hash or order) and PartitioningScheme types. The API parameter indices= is renamed to keys= to accommodate both integer column indices and OrderKey descriptors. New alignment and strictness-checking methods are introduced, and comprehensive test coverage validates order-scheme behavior throughout the normalization and remapping pipeline.

Changes

Order-Aware Partitioning Normalization

Layer / File(s) Summary
Imports and Type Aliases
python/cudf_polars/cudf_polars/experimental/rapidsmpf/utils.py, python/cudf_polars/tests/experimental/test_metadata.py
Add imports for OrderKey, OrderScheme, TypeAlias, cast, and Sequence. Introduce public type aliases InterRankScheme (hash or order scheme) and PartitioningScheme (inter-rank or "inherit").
NormalizedPartitioning Field Types
python/cudf_polars/cudf_polars/experimental/rapidsmpf/utils.py
Update dataclass fields from hash-only `HashScheme
Partitioning Remap Helpers
python/cudf_polars/cudf_polars/experimental/rapidsmpf/utils.py
Add helper functions to remap scheme indices through IR transformations: index derivation for hash vs order schemes, creation of updated schemes with remapped indices, selective remapping for Select, and passthrough for inherit/None.
Strictness and Alignment Methods
python/cudf_polars/cudf_polars/experimental/rapidsmpf/utils.py
Add is_strictly_partitioned() method and update is_aligned_with(other, br) to check order boundary alignment via boundaries_aligned_with and hash modulus/key-count compatibility.
from_keys and _normalize_schemes
python/cudf_polars/cudf_polars/experimental/rapidsmpf/utils.py
Rewrite from_keys to accept `keys: Sequence[int
API Call Site Updates
python/cudf_polars/cudf_polars/experimental/rapidsmpf/groupby.py, python/cudf_polars/cudf_polars/experimental/rapidsmpf/join.py, python/cudf_polars/cudf_polars/experimental/rapidsmpf/utils.py
Update all NormalizedPartitioning.from_keys invocations to use the keys= parameter. Update join alignment check to call is_aligned_with(..., context.br()).
Test Updates and New Order-Scheme Tests
python/cudf_polars/tests/experimental/test_metadata.py
Update existing test calls to use keys= and pass br to alignment checks. Add _make_order_scheme() helper and new parametrized tests covering from_keys with OrderKey, is_strictly_partitioned, is_aligned_with (using br), single-rank promotion, and maybe_remap_partitioning behavior for selects that remap or drop order keys.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Adopt OrderScheme metadata in cudf-polars' clearly and directly describes the main change: adopting OrderScheme metadata support.
Description check ✅ Passed The description provides relevant context about the changes, including follow-up PRs, blocking issues, and the purpose of adopting OrderScheme metadata for tracking data sortedness.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
python/cudf_polars/tests/experimental/test_metadata.py (1)

485-489: ⚡ Quick win

Use the existing spmd_engine fixture instead of creating a new SPMDEngine.

Per previous review feedback, you should use the existing spmd_engine fixture from conftest.py rather than creating your own SPMDEngine. This ensures consistent test configuration and avoids duplicating engine setup logic.

♻️ Proposed fix
 `@pytest.fixture`(scope="module")
-def rapidsmpf_context():
-    # Use SPMDEngine to generatate a simple Context for testing
-    with SPMDEngine() as engine:
-        yield engine.context
+def rapidsmpf_context(spmd_engine):
+    # Use the spmd_engine fixture to get a Context for testing
+    yield spmd_engine.context
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/cudf_polars/tests/experimental/test_metadata.py` around lines 485 -
489, Replace the custom rapidsmpf_context fixture that constructs a new
SPMDEngine with one that reuses the existing spmd_engine fixture: change the
fixture to accept the spmd_engine parameter (instead of instantiating
SPMDEngine) and yield spmd_engine.context so tests use the shared
engine/configuration; update the fixture name rapidsmpf_context to depend on
spmd_engine and return spmd_engine.context rather than creating a new SPMDEngine
instance.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@python/cudf_polars/tests/experimental/test_metadata.py`:
- Around line 485-489: Replace the custom rapidsmpf_context fixture that
constructs a new SPMDEngine with one that reuses the existing spmd_engine
fixture: change the fixture to accept the spmd_engine parameter (instead of
instantiating SPMDEngine) and yield spmd_engine.context so tests use the shared
engine/configuration; update the fixture name rapidsmpf_context to depend on
spmd_engine and return spmd_engine.context rather than creating a new SPMDEngine
instance.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ba35b4ba-7599-44c7-9acd-260c08c82978

📥 Commits

Reviewing files that changed from the base of the PR and between c9ad1c5 and cf2b900.

📒 Files selected for processing (4)
  • python/cudf_polars/cudf_polars/experimental/rapidsmpf/groupby.py
  • python/cudf_polars/cudf_polars/experimental/rapidsmpf/join.py
  • python/cudf_polars/cudf_polars/experimental/rapidsmpf/utils.py
  • python/cudf_polars/tests/experimental/test_metadata.py

@rjzamora rjzamora added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels May 8, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
python/cudf_polars/tests/experimental/test_metadata.py (1)

494-585: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add edge-case coverage for order-scheme tests.

The new order-scheme cases only exercise non-empty, non-null multi-row data. Please add empty, all-null, and single-element key scenarios across from_keys, is_aligned_with, and remap tests to catch boundary regressions.

As per coding guidelines, python/**/test*.py: "Missing edge case coverage in tests - Include tests for empty, all-null, single-element, and mixed type cases".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/cudf_polars/tests/experimental/test_metadata.py` around lines 494 -
585, Add edge-case test variants for empty, all-null, and single-element key
data to the order-scheme tests: update test_from_keys_order_scheme and
test_from_keys_order_scheme_single_rank to include parameterized cases where
keys map to empty datasets, all-null key columns, and single-row datasets and
assert expected inter_rank_scheme behavior; extend
test_is_aligned_with_order_scheme to check alignment when schemes are built from
empty, all-null, and single-element values (use _make_order_scheme with
values=() or values=(None,) or values=(x,)) and assert is_aligned_with
accordingly; and add remapping edge cases in
test_remap_partitioning_order_scheme_select and
test_remap_partitioning_order_scheme_drops_key using _make_select_ir to simulate
empty selection, selection producing null-only key column, and single-row
selection, then assert maybe_remap_partitioning returns the correct inter_rank
(OrderScheme / None) for each. Ensure you use the helper _make_order_scheme,
NormalizedPartitioning.from_keys, and maybe_remap_partitioning identifiers to
locate where to add these cases.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@python/cudf_polars/tests/experimental/test_metadata.py`:
- Around line 520-527: The test test_from_keys_order_scheme must assert explicit
states instead of comparing a boolean to isinstance; update it so after
computing result = NormalizedPartitioning.from_keys(...) you check: if
should_match is True assert isinstance(result.inter_rank_scheme, OrderScheme)
else assert result.inter_rank_scheme is None (or the exact expected non-matching
sentinel) to ensure the negative path doesn't accidentally accept some other
scheme type; keep references to test_from_keys_order_scheme,
NormalizedPartitioning.from_keys, and result.inter_rank_scheme when making the
change.

---

Outside diff comments:
In `@python/cudf_polars/tests/experimental/test_metadata.py`:
- Around line 494-585: Add edge-case test variants for empty, all-null, and
single-element key data to the order-scheme tests: update
test_from_keys_order_scheme and test_from_keys_order_scheme_single_rank to
include parameterized cases where keys map to empty datasets, all-null key
columns, and single-row datasets and assert expected inter_rank_scheme behavior;
extend test_is_aligned_with_order_scheme to check alignment when schemes are
built from empty, all-null, and single-element values (use _make_order_scheme
with values=() or values=(None,) or values=(x,)) and assert is_aligned_with
accordingly; and add remapping edge cases in
test_remap_partitioning_order_scheme_select and
test_remap_partitioning_order_scheme_drops_key using _make_select_ir to simulate
empty selection, selection producing null-only key column, and single-row
selection, then assert maybe_remap_partitioning returns the correct inter_rank
(OrderScheme / None) for each. Ensure you use the helper _make_order_scheme,
NormalizedPartitioning.from_keys, and maybe_remap_partitioning identifiers to
locate where to add these cases.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ad0bd369-054e-452d-8d45-80ee46db3790

📥 Commits

Reviewing files that changed from the base of the PR and between cf2b900 and d6baf06.

📒 Files selected for processing (1)
  • python/cudf_polars/tests/experimental/test_metadata.py

Comment on lines +520 to +527
def test_from_keys_order_scheme(spmd_engine, keys, strict, should_match):
part = Partitioning(
inter_rank=_make_order_scheme(spmd_engine.context, strict=strict),
local="inherit",
)
result = NormalizedPartitioning.from_keys(part, nranks=4, keys=keys)
assert isinstance(result.inter_rank_scheme, OrderScheme) == should_match

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Tighten the negative-path assertion in test_from_keys_order_scheme.

assert isinstance(result.inter_rank_scheme, OrderScheme) == should_match can pass even if the non-matching path returns an unexpected non-None scheme type. Assert explicit expected state for should_match=False.

Suggested assertion hardening
 def test_from_keys_order_scheme(spmd_engine, keys, strict, should_match):
     part = Partitioning(
         inter_rank=_make_order_scheme(spmd_engine.context, strict=strict),
         local="inherit",
     )
     result = NormalizedPartitioning.from_keys(part, nranks=4, keys=keys)
-    assert isinstance(result.inter_rank_scheme, OrderScheme) == should_match
+    if should_match:
+        assert isinstance(result.inter_rank_scheme, OrderScheme)
+    else:
+        assert result.inter_rank_scheme is None
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/cudf_polars/tests/experimental/test_metadata.py` around lines 520 -
527, The test test_from_keys_order_scheme must assert explicit states instead of
comparing a boolean to isinstance; update it so after computing result =
NormalizedPartitioning.from_keys(...) you check: if should_match is True assert
isinstance(result.inter_rank_scheme, OrderScheme) else assert
result.inter_rank_scheme is None (or the exact expected non-matching sentinel)
to ensure the negative path doesn't accidentally accept some other scheme type;
keep references to test_from_keys_order_scheme,
NormalizedPartitioning.from_keys, and result.inter_rank_scheme when making the
change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3 - Ready for Review Ready for review by team cudf-polars Issues specific to cudf-polars improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

6 participants