Preserve extension dtypes in groupby reductions and exclude grouping-key columns by identity#22369
Open
galipremsagar wants to merge 1 commit intorapidsai:pandas3from
Open
Preserve extension dtypes in groupby reductions and exclude grouping-key columns by identity#22369galipremsagar wants to merge 1 commit intorapidsai:pandas3from
galipremsagar wants to merge 1 commit intorapidsai:pandas3from
Conversation
…key columns by identity - Output dtype of COUNT/SIZE/ARGMIN/ARGMAX is np.int64 when the value column is a StringDtype; NUNIQUE always returns np.int64. - Series.groupby.size() on string[pyarrow] with na_value=pd.NA now returns Int64Dtype to match pandas 3. - Add identity-based exclusion of grouping-key columns from value columns via _collect_series_key_column_names. _Grouping accepts series_key_column_names and _handle_series consumes it to populate _named_columns, mirroring pandas' behavior of dropping a column named in obj when it is also passed as a grouping Series. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 4, 2026
Contributor
Author
|
/okay to test 3e7e9af |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Split out from #22289. Two related fixes that together restore pandas-3 behavior for groupby reductions on extension-typed value columns:
Output dtype for int-returning aggregations on
StringDtypeCOUNT/SIZE/ARGMIN/ARGMAXnow returnnp.int64(matching pandas 3) instead ofInt64Dtype/int64[pyarrow].NUNIQUEalways casts tonp.int64.Series.groupby.size()onstring[pyarrow]withna_value=pd.NAnow returnsInt64Dtypeto match pandas 3's specific behavior for that storage/na_value combination.Identity-based exclusion of grouping-key columns
Pandas excludes a value column whose underlying object is the same as the grouping Series' column (i.e.,
df.groupby(df["a"])drops"a"from the aggregated values). This was missing in cuDF.A new
_collect_series_key_column_nameshelper captures this identity information beforenans_to_nulls()breaks it undermode.pandas_compatible, and threads matched column names through_Groupingto populate_named_columns. The check is restricted toDataFrameinputs so thatSeries.groupby(self)(used internally bySeries.value_counts,Series.mode, etc.) doesn't falsely match the Series against itself and empty the aggregation result.Tests
python/cudf/cudf/tests/groupby/test_reductions.py:test_groupby_string_int_returning_aggs_dtypecoverscount/nunique/sizeacross the fourStringDtypestorage/na_value combinations.test_groupby_series_identity_column_exclusionandtest_groupby_series_copy_no_column_exclusionexercise the matched/non-matched paths.test_groupby_series_self_does_not_excludeguards against the regression whereSeries.groupby(self)empties the aggregation.Conftest
Removes 75
NODEIDS_THAT_FAILentries that now pass on the regular path:test_string_dtype_all_na[*-{count,size,nunique}-*](60 entries) — fixed by the int-returning dtype change.test_size_strings[string=string[pyarrow]],test_groupby_column_index_in_references,test_groupby_nonstring_columns,test_groupby_series_with_name, severaltest_nunique_*andtest_duplicate_columns[nunique-*], etc.Relationship to #22289
This is one of the four split PRs requested in the review on #22289. #22289 retains only the
get_dtype_of_same_kindchange; the remaining three split PRs are #string-sum / #bool-any-all / #min-count.Some
test_string_dtype_all_na[*-{sum,all,any,min,max,first,last}-*]parametrizations exercise both this PR's grouping-key-exclusion logic and another split PR's reduction logic, so the corresponding xfail entries stay until both halves merge.