Implement groupby all/any via bool-coercion + min/max#22371
Open
galipremsagar wants to merge 1 commit intorapidsai:pandas3from
Open
Implement groupby all/any via bool-coercion + min/max#22371galipremsagar wants to merge 1 commit intorapidsai:pandas3from
galipremsagar wants to merge 1 commit intorapidsai:pandas3from
Conversation
Both methods previously raised ``NotImplementedError``. Reduce ``all``/ ``any`` to ``min``/``max`` on a bool-coerced copy of the value columns: - Strings coerce as ``count_characters > 0`` so empty strings become ``False`` and nulls remain null (preserving them through the agg). - Numerics coerce as ``!= 0`` with the same null preservation. - ``skipna=False`` replaces nulls with ``True`` before the aggregation so that nulls don't flip ``all`` to ``False`` and trivially make ``any`` ``True``. - Empty groups (all-NA values, skipna=True) yield NA from min/max; pandas treats those as vacuously ``True`` for ``all`` and ``False`` for ``any``, so the result is filled accordingly. - ``min_count`` masks groups whose non-null count is below the threshold. Conftest update for ``test_string_dtype_all_na[*-all-*]`` and ``[*-any-*]`` (32 entries). The string-key DataFrame cases additionally rely on identity-based grouping-key column exclusion, which lands in a sibling PR; both must merge before the entries can be removed without xpassing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Author
|
/okay to test b288bbc |
mroeschke
reviewed
May 5, 2026
Comment on lines
+3024
to
+3026
| if isinstance(col.dtype, pd.StringDtype) or is_dtype_obj_string( | ||
| col.dtype | ||
| ): |
Contributor
There was a problem hiding this comment.
Suggested change
| if isinstance(col.dtype, pd.StringDtype) or is_dtype_obj_string( | |
| col.dtype | |
| ): | |
| if is_dtype_obj_string(col.dtype): |
Comment on lines
+3027
to
+3030
| counts_plc = plc.strings.attributes.count_characters( | ||
| col.plc_column | ||
| ) | ||
| gt_plc = plc.binaryop.binary_operation( |
Contributor
There was a problem hiding this comment.
@vyasr was there every a way to use PylibcudfFunction on back-to-back pylibcudf functions?
| # Empty groups (skipna=True with all-NA values) yield NA from | ||
| # min/max — pandas treats these as ``True`` for ``all`` and | ||
| # ``False`` for ``any``. | ||
| bool_np = np.dtype(np.bool_) |
Contributor
There was a problem hiding this comment.
Just confirming, is np.dtype(np.bool_) return regardless of the pandas string type?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Split out from #22289.
GroupBy.allandGroupBy.anypreviously raisedNotImplementedError. This PR implements them by reducing tomin/maxon a bool-coerced copy of the value columns.Implementation (
python/cudf/cudf/core/groupby/groupby.py)A new
_bool_reducehelper:count_characters > 0so empty strings becomeFalseand nulls remain null (preserved through the aggregation).!= 0with the same null preservation.skipna=False, fills nulls withTruebefore aggregation so they don't flipalltoFalseand trivially makeanyTrue.TrueforallandFalseforany, so the result is filled accordingly.min_countby counting per-group non-nulls and masking groups whose count is below the threshold.The new GroupBy is constructed with
by=self.grouping(passing the existing_Groupingobject) so key columns match the bool-coerced value columns exactly, avoiding label-based lookup when the original key column was excluded.Tests
python/cudf/cudf/tests/groupby/test_reductions.py:test_groupby_all_anyover bool/int/float data.test_groupby_all_any_stringfor string columns.test_groupby_all_any_emptyfor empty-group behavior.Conftest
Removes 32
test_string_dtype_all_na[*-all-*]and[*-any-*]entries.Relationship to #22289
One of the four split PRs requested in the review on #22289. The DataFrame-case
test_string_dtype_all_na[*-{all,any}-*]parametrizations (df.groupby(df["a"]).all()) also rely on identity-based grouping-key column exclusion in #22369; both must merge before the 32 conftest removals stop xpassing.