Implement groupby sum on StringDtype columns as per-group concatenation#22370
Open
galipremsagar wants to merge 1 commit intorapidsai:pandas3from
Open
Implement groupby sum on StringDtype columns as per-group concatenation#22370galipremsagar wants to merge 1 commit intorapidsai:pandas3from
galipremsagar wants to merge 1 commit intorapidsai:pandas3from
Conversation
Pandas 3 makes ``DataFrame.groupby(...).sum()`` on StringDtype columns return a per-group string concatenation rather than raise. Implement that by dispatching to a new ``_string_sum`` helper from ``GroupBy._reduce`` whenever the value column dtype is ``pd.StringDtype``. The implementation: - collects values per group with ``plc.aggregation.collect_list`` - joins each list with ``plc.strings.combine.join_list_elements``, using ``OutputIfEmptyList.EMPTY_STRING`` (skipna=True) or ``NULL_ELEMENT`` (skipna=False) and a null/empty string as the per-element narep to match pandas' all-NA group semantics - applies ``min_count`` by counting per-group non-nulls and using ``copy_if_else`` with a null scalar where ``count < min_count`` The dispatch happens before the pre-existing ``min_count`` guard so that string sum works with ``min_count > 0`` even before general ``min_count`` support is wired up for non-string ops. Conftest update for ``test_string_dtype_all_na[*-sum-*]``: those parametrizations exercise ``df.groupby(df["a"]).sum()``, which also relies on identity-based grouping-key column exclusion. The xfail entries are removed here in anticipation of the grouping-key exclusion change landing as a sibling PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Author
|
/okay to test a7b08d8 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Split out from #22289. Pandas 3 makes
DataFrame.groupby(...).sum()onStringDtypecolumns return a per-group string concatenation rather than raiseTypeError. This PR implements that path for cuDF.Implementation (
python/cudf/cudf/core/groupby/groupby.py)GroupBy._reducedispatches to a new_string_sumhelper whenever the value column dtype ispd.StringDtype(andop == "sum"). The dispatch happens before the pre-existingmin_count != 0guard so that string sum supportsmin_count > 0independently of the generalmin_countwork in the sibling PR._string_sum:plc.aggregation.collect_list.plc.strings.combine.join_list_elements, usingOutputIfEmptyList.EMPTY_STRING(skipna=True) orNULL_ELEMENT(skipna=False) and the matching per-element narep.min_countby counting per-group non-nulls (plc.aggregation.count) and usingColumnBase.copy_if_elsewith a null scalar wherecount < min_count.The
test_group_by_empty_reductionxfail is updated sincestr+sumno longer raisesTypeError.Tests
test_groupby_string_sumcovers all fourStringDtypestorage/na_value combinations.Conftest
Removes 16
test_string_dtype_all_na[*-sum-*]entries.Relationship to #22289
One of the four split PRs requested in the review on #22289. The DataFrame-case parametrizations in
test_string_dtype_all_na[*-sum-*](df.groupby(df["a"]).sum()) also rely on identity-based grouping-key column exclusion, which lands in #22369. Both PRs must merge before those 16 conftest removals stop xpassing.