Implement groupby sum on StringDtype columns as per-group concatenation by galipremsagar · Pull Request #22370 · rapidsai/cudf

galipremsagar · 2026-05-04T20:05:24Z

Summary

Split out from #22289. Pandas 3 makes DataFrame.groupby(...).sum() on StringDtype columns return a per-group string concatenation rather than raise TypeError. This PR implements that path for cuDF.

Implementation (`python/cudf/cudf/core/groupby/groupby.py`)

GroupBy._reduce dispatches to a new _string_sum helper whenever the value column dtype is pd.StringDtype (and op == "sum"). The dispatch happens before the pre-existing min_count != 0 guard so that string sum supports min_count > 0 independently of the general min_count work in the sibling PR.

_string_sum:

Collects per-group values with plc.aggregation.collect_list.
Joins each list with plc.strings.combine.join_list_elements, using OutputIfEmptyList.EMPTY_STRING (skipna=True) or NULL_ELEMENT (skipna=False) and the matching per-element narep.
Applies min_count by counting per-group non-nulls (plc.aggregation.count) and using ColumnBase.copy_if_else with a null scalar where count < min_count.

The test_group_by_empty_reduction xfail is updated since str + sum no longer raises TypeError.

Tests

test_groupby_string_sum covers all four StringDtype storage/na_value combinations.

Conftest

Removes 16 test_string_dtype_all_na[*-sum-*] entries.

Relationship to #22289

One of the four split PRs requested in the review on #22289. The DataFrame-case parametrizations in test_string_dtype_all_na[*-sum-*] (df.groupby(df["a"]).sum()) also rely on identity-based grouping-key column exclusion, which lands in #22369. Both PRs must merge before those 16 conftest removals stop xpassing.

Pandas 3 makes ``DataFrame.groupby(...).sum()`` on StringDtype columns return a per-group string concatenation rather than raise. Implement that by dispatching to a new ``_string_sum`` helper from ``GroupBy._reduce`` whenever the value column dtype is ``pd.StringDtype``. The implementation: - collects values per group with ``plc.aggregation.collect_list`` - joins each list with ``plc.strings.combine.join_list_elements``, using ``OutputIfEmptyList.EMPTY_STRING`` (skipna=True) or ``NULL_ELEMENT`` (skipna=False) and a null/empty string as the per-element narep to match pandas' all-NA group semantics - applies ``min_count`` by counting per-group non-nulls and using ``copy_if_else`` with a null scalar where ``count < min_count`` The dispatch happens before the pre-existing ``min_count`` guard so that string sum works with ``min_count > 0`` even before general ``min_count`` support is wired up for non-string ops. Conftest update for ``test_string_dtype_all_na[*-sum-*]``: those parametrizations exercise ``df.groupby(df["a"]).sum()``, which also relies on identity-based grouping-key column exclusion. The xfail entries are removed here in anticipation of the grouping-key exclusion change landing as a sibling PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

copy-pr-bot · 2026-05-04T20:05:28Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

galipremsagar · 2026-05-04T20:33:26Z

/okay to test a7b08d8

galipremsagar requested a review from a team as a code owner May 4, 2026 20:05

galipremsagar requested review from rjzamora and wence- and removed request for a team May 4, 2026 20:05

github-actions Bot assigned galipremsagar May 4, 2026

github-actions Bot added Python Affects Python cuDF API. cudf.pandas Issues specific to cudf.pandas labels May 4, 2026

github-project-automation Bot added this to cuDF Python May 4, 2026

GPUtester moved this to In Progress in cuDF Python May 4, 2026

galipremsagar mentioned this pull request May 4, 2026

Preserve StringDtype storage and na_value in get_dtype_of_same_kind #22289

Open

galipremsagar requested a review from mroeschke May 4, 2026 20:30

galipremsagar added bug Something isn't working non-breaking Non-breaking change labels May 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement groupby sum on StringDtype columns as per-group concatenation#22370

Implement groupby sum on StringDtype columns as per-group concatenation#22370
galipremsagar wants to merge 1 commit intorapidsai:pandas3from
galipremsagar:groupby_string_sum

galipremsagar commented May 4, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 4, 2026

Uh oh!

galipremsagar commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

galipremsagar commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Implementation (python/cudf/cudf/core/groupby/groupby.py)

Tests

Conftest

Relationship to #22289

Uh oh!

copy-pr-bot Bot commented May 4, 2026

Uh oh!

galipremsagar commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

galipremsagar commented May 4, 2026 •

edited

Loading

Implementation (`python/cudf/cudf/core/groupby/groupby.py`)