Implement streaming window functions in cudf-polars by Matt711 · Pull Request #22191 · rapidsai/cudf

Matt711 · 2026-04-17T14:10:49Z

Description

over() is, at its heart, a grouped aggregation followed by a broadcast back to the shape of the input. For each group g defined by the partition-by keys, evaluate the expression, then map the result back to every row that belongs to g.

import polars as pl

df = pl.LazyFrame(
    {
        "g": [1, 1, 2, 2, 2, 1],
        "x": [1, 2, 3, 4, 5, 6],
        "g2": ["a", "b", "a", "b", "a", "b"],
        "g_null": [1, None, 1, None, 2, 1],
        "s": [6, 5, 4, 3, 2, 1],
    }
)
print(df.select(pl.col("x").sum().over("g_null")).collect())

shape: (6, 1)
┌─────┐
│ x   │
│ --- │
│ i64 │
╞═════╡
│ 10  │
│ 6   │
│ 10  │
│ 6   │
│ 5   │
│ 10  │
└─────┘

Polars represents this with a WindowMapping enum. This PR adds support for the group_to_rows mapping in the RapidsMPF streaming executor (the variant where the output has the same number of rows as the input and each row receives the value computed for its group). The entry point is a new over_actor that selects one of three execution strategies at runtime based on the incoming channel metadata and expression shape.

The `over_actor`: three strategies

1. Chunkwise (already partitioned)

If the incoming channel metadata shows the data is already hash-partitioned on the over() keys (or any prefix of them; being partitioned on ('a',) is sufficient for over('a', 'b'), since every group is contained within one rank), the window function is trivially correct on each chunk in isolation. We evaluate chunkwise with no coordination at all.

2. Scalar aggregations: AllGather + broadcast

When every GroupedWindow in the expression is a scalar aggregation (sum, mean, count, etc.), we exploit the fact that these are decomposable: each worker computes partial aggregates chunkwise, an AllGather collects all workers' partial results, a single reduction produces the global aggregate per group, and then each original chunk has those results broadcast back into its row positions via a hash join on the partition keys.

3. Non-scalar aggregations: forward-shuffle + return-shuffle

Functions like rank are not decomposable; they require every row in the group to be visible at once. We hash-shuffle by the partition keys so that all rows belonging to group g land in the same rank for evaluation. The challenge is then twofold: putting rows back in the right order, and getting them back to the rank that owns the corresponding output chunk in the first place. Output channels are rank-local, so only the rank that received an input chunk is wired up to emit it, and the hash shuffle scatters rows by group with no regard for where they originated. We need an explicit return trip.

Preserving full order

A lot of the implementation exists purely to put output chunks back in the same sequence-number order as the input. Getting this right across both strategies is where most of the complexity lives.

Scalar aggregation path. We can't produce any output until the global aggregate is known, so we buffer incoming chunks while simultaneously computing partial aggregates over them. Once the AllGather + final reduction completes, we iterate over the buffer and evaluate each chunk against the global aggregate, emitting results with their original sequence numbers. Order preservation falls out naturally: the buffer is in receive order and we never reorder it.

Non-scalar shuffle path. Each row is stamped with three pieces of origin metadata before it enters the forward shuffle: an origin_rank (which rank ingested it), a chunk_index (a rank-local 0-based counter, not the upstream message sequence number, which can collide when the input is the output of a prior shuffle), and a position within that input chunk. After the forward shuffle, each rank holds a mix of rows from every origin, but each row knows where it came from. We evaluate the window function on each local forward partition (so rank sees every row in the group), then route the results through a return shuffle keyed on origin_rank. The return shuffle uses num_partitions = nranks and PartitionAssignment.CONTIGUOUS, so partition i lives on rank i, and every row goes back to the rank that originally received it. Each rank then sorts the returned rows by (chunk_index, position), splits at chunk-index transitions, drops the stamp columns, and emits one output chunk per input chunk in input order.

To avoid buffering every input chunk just to size the forward shuffle, the actor samples a small number of chunks up front (_choose_modulus), AllGathers a size estimate, picks the modulus, and then replays the sampled chunks back through a fresh channel via replay_buffered_channel. The forward-insert phase reads from that replay channel and streams rows into the shuffle as they arrive, never holding more than the shuffle's own internal buffering.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2026-04-17T14:10:53Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Matt711

Review Guide: I recommend reading the PR description first, then looking over the tests. Then looking at the three execution strategies (and the corresponding tests). Finally, the full-order preservation logic.

If you think we should split up the PR, that's okay. Additionally if you think logic should be shared (especially in the scalar aggs - groupby case), we can discuss what specifically in your review. I abstracted some logic into a helper function like _make_hash_shuffle_metadata but in general I avoided it (in the groupby case) because it made it more difficult to understand IMO.

Matt711 · 2026-04-17T14:18:24Z

+    [False, True],
+    ids=["same_rank", "cross_rank"],
+)
+def test_over_multirank(


I tested this using rrun

Details

(rapids) coder ➜ ~/cudf $ rrun -n 2 python -m pytest python/cudf_polars/tests/experimental/test_spmd.py::test_over_multirank -x -v [rrun] All ranks launched. Waiting for completion... ============================= test session starts ============================== platform linux -- Python 3.14.4, pytest-9.0.3, pluggy-1.6.0 -- /home/coder/.conda/envs/rapids/bin/python ============================= test session starts ============================== platform linux -- Python 3.14.4, pytest-9.0.3, pluggy-1.6.0 -- /home/coder/.conda/envs/rapids/bin/python cachedir: .pytest_cache hypothesis profile 'default' benchmark: 5.2.3 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000) rootdir: /home/coder/cudf/python/cudf_polars configfile: pyproject.toml plugins: cases-3.10.1, anyio-4.13.0, hypothesis-6.151.13, cov-7.1.0, xdist-3.8.0, benchmark-5.2.3, pytest_httpserver-1.1.5, rerunfailures-16.1 cachedir: .pytest_cache hypothesis profile 'default' benchmark: 5.2.3 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000) rootdir: /home/coder/cudf/python/cudf_polars configfile: pyproject.toml plugins: cases-3.10.1, anyio-4.13.0, hypothesis-6.151.13, cov-7.1.0, xdist-3.8.0, benchmark-5.2.3, pytest_httpserver-1.1.5, rerunfailures-16.1 collecting ... collected 4 items collecting ... collected 4 items python/cudf_polars/tests/experimental/test_spmd.py::test_over_multirank[same_rank-scalar_sum] PASSED [ 25%] python/cudf_polars/tests/experimental/test_spmd.py::test_over_multirank[same_rank-scalar_sum] PASSED [ 25%] python/cudf_polars/tests/experimental/test_spmd.py::test_over_multirank[same_rank-nonscalar_rank] PASSED [ 50%] python/cudf_polars/tests/experimental/test_spmd.py::test_over_multirank[same_rank-nonscalar_rank] PASSED [ 50%] python/cudf_polars/tests/experimental/test_spmd.py::test_over_multirank[cross_rank-scalar_sum] PASSED [ 75%] python/cudf_polars/tests/experimental/test_spmd.py::test_over_multirank[cross_rank-scalar_sum] PASSED [ 75%] python/cudf_polars/tests/experimental/test_spmd.py::test_over_multirank[cross_rank-nonscalar_rank] XFAIL [100%] python/cudf_polars/tests/experimental/test_spmd.py::test_over_multirank[cross_rank-nonscalar_rank] XFAIL [100%] ========================= 3 passed, 1 xfailed in 3.45s ========================= ========================= 3 passed, 1 xfailed in 3.46s =========================

Matt711 · 2026-04-17T15:10:19Z

Moved this to WIP while I work throught the rapidsmpf test failures: https://github.com/rapidsai/cudf/actions/runs/24570156132/job/71845610400?pr=22191

Matt711 · 2026-04-17T19:43:11Z

/ok to test 0d6d6b1

Matt711 · 2026-05-06T20:56:18Z

/ok to test bc584d7

- This is a follow-up to rapidsai#21796 - This (hopefully) simplifies some code in rapidsai#22191 **Problem statement**: We currently translate `HStack` nodes with non-pointwise expressions to the equivalent `Select` node at lowering time. This is because all our non-pointwise `Expr`-decomposition logic is specific to `Select`. Before this PR, this translation was skipped whenever the underlying `HStack` was completely overwriting it's original columns. The problem with this case is that we loose "anchor" columns that tell the `Select` how to broadcast scalar-aggregation results. **Proposed solution**: We add a temporary "anchor" column to the translated `HStack` so that broadcasting works correctly in the `Select` node. **Motivation**: - We can handle all `over()` expression decomposition within `Select` if we know **all** non-pointwise HStack operations are lowered to `Select` anyway. - We don't "fall back" for other non-`over` `HStack` corner cases either. Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Matthew Murray (https://github.com/Matt711) URL: rapidsai#22353

Matt711 · 2026-05-06T22:13:55Z

/ok to test 338757a

Matt711 · 2026-05-07T03:29:58Z

/ok to test 5b25eea

wence-

A first go. I think there is opportunity to find some more abstraction here, since it seems we are recreating many concepts sui generis for this particular implementation. I did not get to the shuffle-based implementation yet.

I think a useful signpost would be for a module-level docstring that describes the algorithmic aspects of what is going on, without reaching into the implementation details.

wence- · 2026-05-07T11:06:40Z

+@dataclass(frozen=True)
+class OriginStamps:


I think all of this is sui generis sort_by_key with the key being a tuple of (rank, chunk_index, position). Can we reuse the infrastructure for sort to do that for us?

I guess that doesn't necessarily give you everything with a given rank as its key on the input rank?

Yeah the sort infrastructure doesn't give us "everything with a given rank as its key on the input rank"

…ts, clean up Over.do_evaluate signature

Matt711 · 2026-05-07T16:54:37Z

/ok to test f7687f1

Matt711 added 6 commits April 17, 2026 01:58

Support streaming over expression

5861da9

Add fast path for scalar aggs case, support preserving the full oreder

66be8e9

add more tests, protect against OOMs when preserving the full order

3b15601

check style

6b91f1b

add pre-shuffle test, and clarifying comments

ac73ede

add multi-rank test

0c8e3de

Matt711 requested a review from a team as a code owner April 17, 2026 14:10

Matt711 requested review from galipremsagar and vyasr April 17, 2026 14:10

Matt711 added feature request New feature or request non-breaking Non-breaking change labels Apr 17, 2026

github-actions Bot assigned Matt711 Apr 17, 2026

github-actions Bot added Python Affects Python cuDF API. cudf-polars Issues specific to cudf-polars labels Apr 17, 2026

github-project-automation Bot added this to cuDF Python Apr 17, 2026

GPUtester moved this to In Progress in cuDF Python Apr 17, 2026

Matt711 commented Apr 17, 2026

View reviewed changes

style

696bbc9

Matt711 force-pushed the fea/polars/streaming-over branch from 594810d to 696bbc9 Compare April 17, 2026 14:23

Merge branch 'main' into fea/polars/streaming-over

fdc40c9

Matt711 changed the title ~~Implement streaming window functions in cudf-polars~~ [WIP] Implement streaming window functions in cudf-polars Apr 17, 2026

Matt711 added 2 commits April 17, 2026 16:39

reject non decomposable aggregations

e477439

fix xfail condition for mg test

4a6eb6d

Matt711 requested a review from a team as a code owner April 17, 2026 17:03

Merge branch 'main' into fea/polars/streaming-over

0c48e1d

Matt711 removed the request for review from galipremsagar April 17, 2026 17:04

Merge branch 'main' into fea/polars/streaming-over

0d6d6b1

Matt711 mentioned this pull request May 6, 2026

Adopt OrderScheme metadata in cudf-polars #22291

Open

3 tasks

Matt711 force-pushed the fea/polars/streaming-over branch from 9d48123 to bc584d7 Compare May 6, 2026 20:54

Matt711 and others added 3 commits May 6, 2026 21:44

use int32

a723dc8

clean up docstring OriginStamps docstring

902a4cd

Merge branch 'main' into fea/polars/streaming-over

338757a

Matt711 mentioned this pull request May 7, 2026

Diff against merge-base in changed-files action rapidsai/shared-actions#105

Open

3 tasks

Matt711 and others added 2 commits May 7, 2026 00:39

fix upstream polars tests

0a94101

Merge branch 'main' into fea/polars/streaming-over

370fc21

Matt711 requested a review from rjzamora May 7, 2026 01:56

Merge branch 'main' into fea/polars/streaming-over

5b25eea

Matt711 requested a review from wence- May 7, 2026 03:51

wence- requested changes May 7, 2026

View reviewed changes

Matt711 added 8 commits May 7, 2026 11:50

address driveby nits

7c81adf

drop id() as dict key

68a7983

simplify _evaluate_window_with_stamps

7975ff4

address more nits

a6f397c

docstrings & use names_to_indices

ea39bd7

simplify scalar-Over IR + cleanup

658e4bf

colleect ir rewrite for scalar path ahead of time, comment improvemen…

d3b6562

…ts, clean up Over.do_evaluate signature

add module doc string overviewing the algorithm, few smaller clean ups

7b828be

Matt711 requested a review from wence- May 7, 2026 15:22

more clean ups

397d50f

Matt711 force-pushed the fea/polars/streaming-over branch from 9cb0834 to 397d50f Compare May 7, 2026 15:44

Matt711 and others added 2 commits May 7, 2026 10:44

Merge branch 'main' into fea/polars/streaming-over

946582a

oh yeah, dont use to_arrow

f7687f1

		@dataclass(frozen=True)
		class OriginStamps:

Conversation

Matt711 commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

The over_actor: three strategies

Preserving full order

Checklist

Uh oh!

copy-pr-bot Bot commented Apr 17, 2026

Uh oh!

Matt711 left a comment

Choose a reason for hiding this comment

Uh oh!

Matt711 Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Matt711 commented Apr 17, 2026

Uh oh!

Matt711 commented Apr 17, 2026

Uh oh!

Matt711 commented May 6, 2026

Uh oh!

Matt711 commented May 6, 2026

Uh oh!

Matt711 commented May 7, 2026

Uh oh!

wence- left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wence- May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Matt711 May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Matt711 commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Matt711 commented Apr 17, 2026 •

edited

Loading

The `over_actor`: three strategies