Preserve extension dtypes in groupby reductions and exclude grouping-key columns by identity by galipremsagar · Pull Request #22369 · rapidsai/cudf

galipremsagar · 2026-05-04T20:05:05Z

Summary

Split out from #22289. Two related fixes that together restore pandas-3 behavior for groupby reductions on extension-typed value columns:

Output dtype for int-returning aggregations on `StringDtype`

COUNT/SIZE/ARGMIN/ARGMAX now return np.int64 (matching pandas 3) instead of Int64Dtype/int64[pyarrow].
NUNIQUE always casts to np.int64.
Series.groupby.size() on string[pyarrow] with na_value=pd.NA now returns Int64Dtype to match pandas 3's specific behavior for that storage/na_value combination.

Identity-based exclusion of grouping-key columns

Pandas excludes a value column whose underlying object is the same as the grouping Series' column (i.e., df.groupby(df["a"]) drops "a" from the aggregated values). This was missing in cuDF.

A new _collect_series_key_column_names helper captures this identity information before nans_to_nulls() breaks it under mode.pandas_compatible, and threads matched column names through _Grouping to populate _named_columns. The check is restricted to DataFrame inputs so that Series.groupby(self) (used internally by Series.value_counts, Series.mode, etc.) doesn't falsely match the Series against itself and empty the aggregation result.

Tests

python/cudf/cudf/tests/groupby/test_reductions.py:

test_groupby_string_int_returning_aggs_dtype covers count/nunique/size across the four StringDtype storage/na_value combinations.
test_groupby_series_identity_column_exclusion and test_groupby_series_copy_no_column_exclusion exercise the matched/non-matched paths.
test_groupby_series_self_does_not_exclude guards against the regression where Series.groupby(self) empties the aggregation.

Conftest

Removes 75 NODEIDS_THAT_FAIL entries that now pass on the regular path:

test_string_dtype_all_na[*-{count,size,nunique}-*] (60 entries) — fixed by the int-returning dtype change.
15 nunique- and identity-related entries: test_size_strings[string=string[pyarrow]], test_groupby_column_index_in_references, test_groupby_nonstring_columns, test_groupby_series_with_name, several test_nunique_* and test_duplicate_columns[nunique-*], etc.

Relationship to #22289

This is one of the four split PRs requested in the review on #22289. #22289 retains only the get_dtype_of_same_kind change; the remaining three split PRs are #string-sum / #bool-any-all / #min-count.

Some test_string_dtype_all_na[*-{sum,all,any,min,max,first,last}-*] parametrizations exercise both this PR's grouping-key-exclusion logic and another split PR's reduction logic, so the corresponding xfail entries stay until both halves merge.

…key columns by identity - Output dtype of COUNT/SIZE/ARGMIN/ARGMAX is np.int64 when the value column is a StringDtype; NUNIQUE always returns np.int64. - Series.groupby.size() on string[pyarrow] with na_value=pd.NA now returns Int64Dtype to match pandas 3. - Add identity-based exclusion of grouping-key columns from value columns via _collect_series_key_column_names. _Grouping accepts series_key_column_names and _handle_series consumes it to populate _named_columns, mirroring pandas' behavior of dropping a column named in obj when it is also passed as a grouping Series. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

copy-pr-bot · 2026-05-04T20:05:08Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

galipremsagar · 2026-05-04T20:32:50Z

/okay to test 3e7e9af

galipremsagar requested a review from a team as a code owner May 4, 2026 20:05

galipremsagar requested review from bdice and vyasr and removed request for a team May 4, 2026 20:05

github-actions Bot assigned galipremsagar May 4, 2026

github-actions Bot added Python Affects Python cuDF API. cudf.pandas Issues specific to cudf.pandas labels May 4, 2026

github-project-automation Bot added this to cuDF Python May 4, 2026

galipremsagar mentioned this pull request May 4, 2026

Implement groupby sum on StringDtype columns as per-group concatenation #22370

Open

GPUtester moved this to In Progress in cuDF Python May 4, 2026

This was referenced May 4, 2026

Implement groupby all/any via bool-coercion + min/max #22371

Open

Preserve StringDtype storage and na_value in get_dtype_of_same_kind #22289

Open

galipremsagar added bug Something isn't working 3 - Ready for Review Ready for review by team non-breaking Non-breaking change labels May 4, 2026

galipremsagar requested a review from mroeschke May 4, 2026 20:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserve extension dtypes in groupby reductions and exclude grouping-key columns by identity#22369

Preserve extension dtypes in groupby reductions and exclude grouping-key columns by identity#22369
galipremsagar wants to merge 1 commit intorapidsai:pandas3from
galipremsagar:groupby_ext_preservation

galipremsagar commented May 4, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 4, 2026

Uh oh!

galipremsagar commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

galipremsagar commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Output dtype for int-returning aggregations on StringDtype

Identity-based exclusion of grouping-key columns

Tests

Conftest

Relationship to #22289

Uh oh!

copy-pr-bot Bot commented May 4, 2026

Uh oh!

galipremsagar commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

galipremsagar commented May 4, 2026 •

edited

Loading

Output dtype for int-returning aggregations on `StringDtype`