Adopt OrderScheme metadata in cudf-polars#22291
Adopt OrderScheme metadata in cudf-polars#22291rjzamora wants to merge 22 commits intorapidsai:mainfrom
OrderScheme metadata in cudf-polars#22291Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
- Closes #22050 - **Updates `Sort` lowering**: - The "rapidsmpf" runtime no longer uses `Sort(ShuffleSorted(Sort(...)))` - We leave most `Sort` nodes alone. - We map `Sort` directly to `sort_actor` - The `sort_actor` performs it's own top/bottom-k optimization and performs chunk-wise sorting itself. **Motivation**: The `Sort(ShuffleSorted(Sort(...)))` pattern will be too messy when we start utilizing `OrderScheme` metadata (after #22291). The current assumption that the child IR node is always a `Sort` is also not true/safe. **Note**: This PR technically adds code, but it will allow us to *remove* more code once "tasks" is removed (everything `ShuffleSorted` related). Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Matthew Roeschke (https://github.com/mroeschke) URL: #22315
- Closes rapidsai#22050 - **Updates `Sort` lowering**: - The "rapidsmpf" runtime no longer uses `Sort(ShuffleSorted(Sort(...)))` - We leave most `Sort` nodes alone. - We map `Sort` directly to `sort_actor` - The `sort_actor` performs it's own top/bottom-k optimization and performs chunk-wise sorting itself. **Motivation**: The `Sort(ShuffleSorted(Sort(...)))` pattern will be too messy when we start utilizing `OrderScheme` metadata (after rapidsai#22291). The current assumption that the child IR node is always a `Sort` is also not true/safe. **Note**: This PR technically adds code, but it will allow us to *remove* more code once "tasks" is removed (everything `ShuffleSorted` related). Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Matthew Roeschke (https://github.com/mroeschke) URL: rapidsai#22315
📝 WalkthroughSummary by CodeRabbit
WalkthroughThis PR extends RapidsMPF's partitioning normalization to support order-based schemes alongside hash-based ones. The core ChangesOrder-Aware Partitioning Normalization
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
python/cudf_polars/tests/experimental/test_metadata.py (1)
485-489: ⚡ Quick winUse the existing
spmd_enginefixture instead of creating a newSPMDEngine.Per previous review feedback, you should use the existing
spmd_enginefixture fromconftest.pyrather than creating your ownSPMDEngine. This ensures consistent test configuration and avoids duplicating engine setup logic.♻️ Proposed fix
`@pytest.fixture`(scope="module") -def rapidsmpf_context(): - # Use SPMDEngine to generatate a simple Context for testing - with SPMDEngine() as engine: - yield engine.context +def rapidsmpf_context(spmd_engine): + # Use the spmd_engine fixture to get a Context for testing + yield spmd_engine.context🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@python/cudf_polars/tests/experimental/test_metadata.py` around lines 485 - 489, Replace the custom rapidsmpf_context fixture that constructs a new SPMDEngine with one that reuses the existing spmd_engine fixture: change the fixture to accept the spmd_engine parameter (instead of instantiating SPMDEngine) and yield spmd_engine.context so tests use the shared engine/configuration; update the fixture name rapidsmpf_context to depend on spmd_engine and return spmd_engine.context rather than creating a new SPMDEngine instance.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@python/cudf_polars/tests/experimental/test_metadata.py`:
- Around line 485-489: Replace the custom rapidsmpf_context fixture that
constructs a new SPMDEngine with one that reuses the existing spmd_engine
fixture: change the fixture to accept the spmd_engine parameter (instead of
instantiating SPMDEngine) and yield spmd_engine.context so tests use the shared
engine/configuration; update the fixture name rapidsmpf_context to depend on
spmd_engine and return spmd_engine.context rather than creating a new SPMDEngine
instance.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: ba35b4ba-7599-44c7-9acd-260c08c82978
📒 Files selected for processing (4)
python/cudf_polars/cudf_polars/experimental/rapidsmpf/groupby.pypython/cudf_polars/cudf_polars/experimental/rapidsmpf/join.pypython/cudf_polars/cudf_polars/experimental/rapidsmpf/utils.pypython/cudf_polars/tests/experimental/test_metadata.py
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
python/cudf_polars/tests/experimental/test_metadata.py (1)
494-585:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd edge-case coverage for order-scheme tests.
The new order-scheme cases only exercise non-empty, non-null multi-row data. Please add empty, all-null, and single-element key scenarios across
from_keys,is_aligned_with, and remap tests to catch boundary regressions.As per coding guidelines,
python/**/test*.py: "Missing edge case coverage in tests - Include tests for empty, all-null, single-element, and mixed type cases".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@python/cudf_polars/tests/experimental/test_metadata.py` around lines 494 - 585, Add edge-case test variants for empty, all-null, and single-element key data to the order-scheme tests: update test_from_keys_order_scheme and test_from_keys_order_scheme_single_rank to include parameterized cases where keys map to empty datasets, all-null key columns, and single-row datasets and assert expected inter_rank_scheme behavior; extend test_is_aligned_with_order_scheme to check alignment when schemes are built from empty, all-null, and single-element values (use _make_order_scheme with values=() or values=(None,) or values=(x,)) and assert is_aligned_with accordingly; and add remapping edge cases in test_remap_partitioning_order_scheme_select and test_remap_partitioning_order_scheme_drops_key using _make_select_ir to simulate empty selection, selection producing null-only key column, and single-row selection, then assert maybe_remap_partitioning returns the correct inter_rank (OrderScheme / None) for each. Ensure you use the helper _make_order_scheme, NormalizedPartitioning.from_keys, and maybe_remap_partitioning identifiers to locate where to add these cases.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@python/cudf_polars/tests/experimental/test_metadata.py`:
- Around line 520-527: The test test_from_keys_order_scheme must assert explicit
states instead of comparing a boolean to isinstance; update it so after
computing result = NormalizedPartitioning.from_keys(...) you check: if
should_match is True assert isinstance(result.inter_rank_scheme, OrderScheme)
else assert result.inter_rank_scheme is None (or the exact expected non-matching
sentinel) to ensure the negative path doesn't accidentally accept some other
scheme type; keep references to test_from_keys_order_scheme,
NormalizedPartitioning.from_keys, and result.inter_rank_scheme when making the
change.
---
Outside diff comments:
In `@python/cudf_polars/tests/experimental/test_metadata.py`:
- Around line 494-585: Add edge-case test variants for empty, all-null, and
single-element key data to the order-scheme tests: update
test_from_keys_order_scheme and test_from_keys_order_scheme_single_rank to
include parameterized cases where keys map to empty datasets, all-null key
columns, and single-row datasets and assert expected inter_rank_scheme behavior;
extend test_is_aligned_with_order_scheme to check alignment when schemes are
built from empty, all-null, and single-element values (use _make_order_scheme
with values=() or values=(None,) or values=(x,)) and assert is_aligned_with
accordingly; and add remapping edge cases in
test_remap_partitioning_order_scheme_select and
test_remap_partitioning_order_scheme_drops_key using _make_select_ir to simulate
empty selection, selection producing null-only key column, and single-row
selection, then assert maybe_remap_partitioning returns the correct inter_rank
(OrderScheme / None) for each. Ensure you use the helper _make_order_scheme,
NormalizedPartitioning.from_keys, and maybe_remap_partitioning identifiers to
locate where to add these cases.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: ad0bd369-054e-452d-8d45-80ee46db3790
📒 Files selected for processing (1)
python/cudf_polars/tests/experimental/test_metadata.py
| def test_from_keys_order_scheme(spmd_engine, keys, strict, should_match): | ||
| part = Partitioning( | ||
| inter_rank=_make_order_scheme(spmd_engine.context, strict=strict), | ||
| local="inherit", | ||
| ) | ||
| result = NormalizedPartitioning.from_keys(part, nranks=4, keys=keys) | ||
| assert isinstance(result.inter_rank_scheme, OrderScheme) == should_match | ||
|
|
There was a problem hiding this comment.
Tighten the negative-path assertion in test_from_keys_order_scheme.
assert isinstance(result.inter_rank_scheme, OrderScheme) == should_match can pass even if the non-matching path returns an unexpected non-None scheme type. Assert explicit expected state for should_match=False.
Suggested assertion hardening
def test_from_keys_order_scheme(spmd_engine, keys, strict, should_match):
part = Partitioning(
inter_rank=_make_order_scheme(spmd_engine.context, strict=strict),
local="inherit",
)
result = NormalizedPartitioning.from_keys(part, nranks=4, keys=keys)
- assert isinstance(result.inter_rank_scheme, OrderScheme) == should_match
+ if should_match:
+ assert isinstance(result.inter_rank_scheme, OrderScheme)
+ else:
+ assert result.inter_rank_scheme is None🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@python/cudf_polars/tests/experimental/test_metadata.py` around lines 520 -
527, The test test_from_keys_order_scheme must assert explicit states instead of
comparing a boolean to isinstance; update it so after computing result =
NormalizedPartitioning.from_keys(...) you check: if should_match is True assert
isinstance(result.inter_rank_scheme, OrderScheme) else assert
result.inter_rank_scheme is None (or the exact expected non-matching sentinel)
to ensure the negative path doesn't accidentally accept some other scheme type;
keep references to test_from_keys_order_scheme,
NormalizedPartitioning.from_keys, and result.inter_rank_scheme when making the
change.
Description
NormalizedPartitioningclass #22246OrderSchemepartitioning metadata rapidsmpf#853OrderSchememetadata, strict sort boundaries, and sort-aware execution #22128These changes lay the foundation for tracking and using metadata about the "sortedness" of the data we are passing through the channel in cudf-polars. The next step after this is to attach an
OrderSchemeto theChannelMetadatawithin asort_actor.Checklist