Implement groupby all/any via bool-coercion + min/max by galipremsagar · Pull Request #22371 · rapidsai/cudf

galipremsagar · 2026-05-04T20:05:51Z

Summary

Split out from #22289. GroupBy.all and GroupBy.any previously raised NotImplementedError. This PR implements them by reducing to min/max on a bool-coerced copy of the value columns.

Implementation (`python/cudf/cudf/core/groupby/groupby.py`)

A new _bool_reduce helper:

Coerces strings as count_characters > 0 so empty strings become False and nulls remain null (preserved through the aggregation).
Coerces numerics as != 0 with the same null preservation.
For skipna=False, fills nulls with True before aggregation so they don't flip all to False and trivially make any True.
Empty groups (skipna=True with all-NA values) yield NA from min/max; pandas treats those as vacuously True for all and False for any, so the result is filled accordingly.
Applies min_count by counting per-group non-nulls and masking groups whose count is below the threshold.

The new GroupBy is constructed with by=self.grouping (passing the existing _Grouping object) so key columns match the bool-coerced value columns exactly, avoiding label-based lookup when the original key column was excluded.

Tests

python/cudf/cudf/tests/groupby/test_reductions.py:

test_groupby_all_any over bool/int/float data.
test_groupby_all_any_string for string columns.
test_groupby_all_any_empty for empty-group behavior.

Conftest

Removes 32 test_string_dtype_all_na[*-all-*] and [*-any-*] entries.

Relationship to #22289

One of the four split PRs requested in the review on #22289. The DataFrame-case test_string_dtype_all_na[*-{all,any}-*] parametrizations (df.groupby(df["a"]).all()) also rely on identity-based grouping-key column exclusion in #22369; both must merge before the 32 conftest removals stop xpassing.

Both methods previously raised ``NotImplementedError``. Reduce ``all``/ ``any`` to ``min``/``max`` on a bool-coerced copy of the value columns: - Strings coerce as ``count_characters > 0`` so empty strings become ``False`` and nulls remain null (preserving them through the agg). - Numerics coerce as ``!= 0`` with the same null preservation. - ``skipna=False`` replaces nulls with ``True`` before the aggregation so that nulls don't flip ``all`` to ``False`` and trivially make ``any`` ``True``. - Empty groups (all-NA values, skipna=True) yield NA from min/max; pandas treats those as vacuously ``True`` for ``all`` and ``False`` for ``any``, so the result is filled accordingly. - ``min_count`` masks groups whose non-null count is below the threshold. Conftest update for ``test_string_dtype_all_na[*-all-*]`` and ``[*-any-*]`` (32 entries). The string-key DataFrame cases additionally rely on identity-based grouping-key column exclusion, which lands in a sibling PR; both must merge before the entries can be removed without xpassing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

copy-pr-bot · 2026-05-04T20:05:55Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

galipremsagar · 2026-05-04T20:33:49Z

/okay to test b288bbc

mroeschke · 2026-05-05T02:24:19Z

+            if isinstance(col.dtype, pd.StringDtype) or is_dtype_obj_string(
+                col.dtype
+            ):


Suggested change

if isinstance(col.dtype, pd.StringDtype) or is_dtype_obj_string(

col.dtype

):

if is_dtype_obj_string(col.dtype):

mroeschke · 2026-05-05T02:28:21Z

+                counts_plc = plc.strings.attributes.count_characters(
+                    col.plc_column
+                )
+                gt_plc = plc.binaryop.binary_operation(


@vyasr was there every a way to use PylibcudfFunction on back-to-back pylibcudf functions?

mroeschke · 2026-05-05T02:30:02Z

+        # Empty groups (skipna=True with all-NA values) yield NA from
+        # min/max — pandas treats these as ``True`` for ``all`` and
+        # ``False`` for ``any``.
+        bool_np = np.dtype(np.bool_)


Just confirming, is np.dtype(np.bool_) return regardless of the pandas string type?

galipremsagar requested a review from a team as a code owner May 4, 2026 20:05

galipremsagar requested review from TomAugspurger and brandon-b-miller and removed request for a team May 4, 2026 20:05

github-actions Bot assigned galipremsagar May 4, 2026

github-actions Bot added Python Affects Python cuDF API. cudf.pandas Issues specific to cudf.pandas labels May 4, 2026

github-project-automation Bot added this to cuDF Python May 4, 2026

galipremsagar mentioned this pull request May 4, 2026

Preserve StringDtype storage and na_value in get_dtype_of_same_kind #22289

Open

GPUtester moved this to In Progress in cuDF Python May 4, 2026

galipremsagar added bug Something isn't working non-breaking Non-breaking change labels May 4, 2026

galipremsagar requested a review from mroeschke May 4, 2026 20:33

mroeschke reviewed May 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement groupby all/any via bool-coercion + min/max#22371

Implement groupby all/any via bool-coercion + min/max#22371
galipremsagar wants to merge 1 commit intorapidsai:pandas3from
galipremsagar:groupby_bool_reduce

galipremsagar commented May 4, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 4, 2026

Uh oh!

galipremsagar commented May 4, 2026

Uh oh!

mroeschke May 5, 2026

Uh oh!

mroeschke May 5, 2026

Uh oh!

mroeschke May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

galipremsagar commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Implementation (python/cudf/cudf/core/groupby/groupby.py)

Tests

Conftest

Relationship to #22289

Uh oh!

copy-pr-bot Bot commented May 4, 2026

Uh oh!

galipremsagar commented May 4, 2026

Uh oh!

mroeschke May 5, 2026

Choose a reason for hiding this comment

Uh oh!

mroeschke May 5, 2026

Choose a reason for hiding this comment

Uh oh!

mroeschke May 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

galipremsagar commented May 4, 2026 •

edited

Loading

Implementation (`python/cudf/cudf/core/groupby/groupby.py`)