Skip to content

Preserve extension dtypes in groupby reductions and exclude grouping-key columns by identity#22369

Open
galipremsagar wants to merge 1 commit intorapidsai:pandas3from
galipremsagar:groupby_ext_preservation
Open

Preserve extension dtypes in groupby reductions and exclude grouping-key columns by identity#22369
galipremsagar wants to merge 1 commit intorapidsai:pandas3from
galipremsagar:groupby_ext_preservation

Conversation

@galipremsagar
Copy link
Copy Markdown
Contributor

@galipremsagar galipremsagar commented May 4, 2026

Summary

Split out from #22289. Two related fixes that together restore pandas-3 behavior for groupby reductions on extension-typed value columns:

Output dtype for int-returning aggregations on StringDtype

  • COUNT/SIZE/ARGMIN/ARGMAX now return np.int64 (matching pandas 3) instead of Int64Dtype/int64[pyarrow].
  • NUNIQUE always casts to np.int64.
  • Series.groupby.size() on string[pyarrow] with na_value=pd.NA now returns Int64Dtype to match pandas 3's specific behavior for that storage/na_value combination.

Identity-based exclusion of grouping-key columns

Pandas excludes a value column whose underlying object is the same as the grouping Series' column (i.e., df.groupby(df["a"]) drops "a" from the aggregated values). This was missing in cuDF.

A new _collect_series_key_column_names helper captures this identity information before nans_to_nulls() breaks it under mode.pandas_compatible, and threads matched column names through _Grouping to populate _named_columns. The check is restricted to DataFrame inputs so that Series.groupby(self) (used internally by Series.value_counts, Series.mode, etc.) doesn't falsely match the Series against itself and empty the aggregation result.

Tests

python/cudf/cudf/tests/groupby/test_reductions.py:

  • test_groupby_string_int_returning_aggs_dtype covers count/nunique/size across the four StringDtype storage/na_value combinations.
  • test_groupby_series_identity_column_exclusion and test_groupby_series_copy_no_column_exclusion exercise the matched/non-matched paths.
  • test_groupby_series_self_does_not_exclude guards against the regression where Series.groupby(self) empties the aggregation.

Conftest

Removes 75 NODEIDS_THAT_FAIL entries that now pass on the regular path:

  • test_string_dtype_all_na[*-{count,size,nunique}-*] (60 entries) — fixed by the int-returning dtype change.
  • 15 nunique- and identity-related entries: test_size_strings[string=string[pyarrow]], test_groupby_column_index_in_references, test_groupby_nonstring_columns, test_groupby_series_with_name, several test_nunique_* and test_duplicate_columns[nunique-*], etc.

Relationship to #22289

This is one of the four split PRs requested in the review on #22289. #22289 retains only the get_dtype_of_same_kind change; the remaining three split PRs are #string-sum / #bool-any-all / #min-count.

Some test_string_dtype_all_na[*-{sum,all,any,min,max,first,last}-*] parametrizations exercise both this PR's grouping-key-exclusion logic and another split PR's reduction logic, so the corresponding xfail entries stay until both halves merge.

…key columns by identity

- Output dtype of COUNT/SIZE/ARGMIN/ARGMAX is np.int64 when the value
  column is a StringDtype; NUNIQUE always returns np.int64.
- Series.groupby.size() on string[pyarrow] with na_value=pd.NA now
  returns Int64Dtype to match pandas 3.
- Add identity-based exclusion of grouping-key columns from value
  columns via _collect_series_key_column_names. _Grouping accepts
  series_key_column_names and _handle_series consumes it to populate
  _named_columns, mirroring pandas' behavior of dropping a column
  named in obj when it is also passed as a grouping Series.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@galipremsagar galipremsagar requested a review from a team as a code owner May 4, 2026 20:05
@galipremsagar galipremsagar requested review from bdice and vyasr and removed request for a team May 4, 2026 20:05
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 4, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@galipremsagar
Copy link
Copy Markdown
Contributor Author

/okay to test 3e7e9af

@galipremsagar galipremsagar added bug Something isn't working 3 - Ready for Review Ready for review by team non-breaking Non-breaking change labels May 4, 2026
@galipremsagar galipremsagar requested a review from mroeschke May 4, 2026 20:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3 - Ready for Review Ready for review by team bug Something isn't working cudf.pandas Issues specific to cudf.pandas non-breaking Non-breaking change Python Affects Python cuDF API.

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

2 participants