Skip to content

Implement groupby sum on StringDtype columns as per-group concatenation#22370

Open
galipremsagar wants to merge 1 commit intorapidsai:pandas3from
galipremsagar:groupby_string_sum
Open

Implement groupby sum on StringDtype columns as per-group concatenation#22370
galipremsagar wants to merge 1 commit intorapidsai:pandas3from
galipremsagar:groupby_string_sum

Conversation

@galipremsagar
Copy link
Copy Markdown
Contributor

@galipremsagar galipremsagar commented May 4, 2026

Summary

Split out from #22289. Pandas 3 makes DataFrame.groupby(...).sum() on StringDtype columns return a per-group string concatenation rather than raise TypeError. This PR implements that path for cuDF.

Implementation (python/cudf/cudf/core/groupby/groupby.py)

GroupBy._reduce dispatches to a new _string_sum helper whenever the value column dtype is pd.StringDtype (and op == "sum"). The dispatch happens before the pre-existing min_count != 0 guard so that string sum supports min_count > 0 independently of the general min_count work in the sibling PR.

_string_sum:

  • Collects per-group values with plc.aggregation.collect_list.
  • Joins each list with plc.strings.combine.join_list_elements, using OutputIfEmptyList.EMPTY_STRING (skipna=True) or NULL_ELEMENT (skipna=False) and the matching per-element narep.
  • Applies min_count by counting per-group non-nulls (plc.aggregation.count) and using ColumnBase.copy_if_else with a null scalar where count < min_count.

The test_group_by_empty_reduction xfail is updated since str + sum no longer raises TypeError.

Tests

test_groupby_string_sum covers all four StringDtype storage/na_value combinations.

Conftest

Removes 16 test_string_dtype_all_na[*-sum-*] entries.

Relationship to #22289

One of the four split PRs requested in the review on #22289. The DataFrame-case parametrizations in test_string_dtype_all_na[*-sum-*] (df.groupby(df["a"]).sum()) also rely on identity-based grouping-key column exclusion, which lands in #22369. Both PRs must merge before those 16 conftest removals stop xpassing.

Pandas 3 makes ``DataFrame.groupby(...).sum()`` on StringDtype columns
return a per-group string concatenation rather than raise. Implement
that by dispatching to a new ``_string_sum`` helper from
``GroupBy._reduce`` whenever the value column dtype is ``pd.StringDtype``.

The implementation:
- collects values per group with ``plc.aggregation.collect_list``
- joins each list with ``plc.strings.combine.join_list_elements``,
  using ``OutputIfEmptyList.EMPTY_STRING`` (skipna=True) or
  ``NULL_ELEMENT`` (skipna=False) and a null/empty string as the
  per-element narep to match pandas' all-NA group semantics
- applies ``min_count`` by counting per-group non-nulls and using
  ``copy_if_else`` with a null scalar where ``count < min_count``

The dispatch happens before the pre-existing ``min_count`` guard so
that string sum works with ``min_count > 0`` even before general
``min_count`` support is wired up for non-string ops.

Conftest update for ``test_string_dtype_all_na[*-sum-*]``: those
parametrizations exercise ``df.groupby(df["a"]).sum()``, which also
relies on identity-based grouping-key column exclusion. The xfail
entries are removed here in anticipation of the grouping-key
exclusion change landing as a sibling PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@galipremsagar galipremsagar requested a review from a team as a code owner May 4, 2026 20:05
@galipremsagar galipremsagar requested review from rjzamora and wence- and removed request for a team May 4, 2026 20:05
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 4, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added Python Affects Python cuDF API. cudf.pandas Issues specific to cudf.pandas labels May 4, 2026
@GPUtester GPUtester moved this to In Progress in cuDF Python May 4, 2026
@galipremsagar galipremsagar requested a review from mroeschke May 4, 2026 20:30
@galipremsagar galipremsagar added bug Something isn't working non-breaking Non-breaking change labels May 4, 2026
@galipremsagar
Copy link
Copy Markdown
Contributor Author

/okay to test a7b08d8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working cudf.pandas Issues specific to cudf.pandas non-breaking Non-breaking change Python Affects Python cuDF API.

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

2 participants