Skip to content

Implement groupby all/any via bool-coercion + min/max#22371

Open
galipremsagar wants to merge 1 commit intorapidsai:pandas3from
galipremsagar:groupby_bool_reduce
Open

Implement groupby all/any via bool-coercion + min/max#22371
galipremsagar wants to merge 1 commit intorapidsai:pandas3from
galipremsagar:groupby_bool_reduce

Conversation

@galipremsagar
Copy link
Copy Markdown
Contributor

@galipremsagar galipremsagar commented May 4, 2026

Summary

Split out from #22289. GroupBy.all and GroupBy.any previously raised NotImplementedError. This PR implements them by reducing to min/max on a bool-coerced copy of the value columns.

Implementation (python/cudf/cudf/core/groupby/groupby.py)

A new _bool_reduce helper:

  • Coerces strings as count_characters > 0 so empty strings become False and nulls remain null (preserved through the aggregation).
  • Coerces numerics as != 0 with the same null preservation.
  • For skipna=False, fills nulls with True before aggregation so they don't flip all to False and trivially make any True.
  • Empty groups (skipna=True with all-NA values) yield NA from min/max; pandas treats those as vacuously True for all and False for any, so the result is filled accordingly.
  • Applies min_count by counting per-group non-nulls and masking groups whose count is below the threshold.

The new GroupBy is constructed with by=self.grouping (passing the existing _Grouping object) so key columns match the bool-coerced value columns exactly, avoiding label-based lookup when the original key column was excluded.

Tests

python/cudf/cudf/tests/groupby/test_reductions.py:

  • test_groupby_all_any over bool/int/float data.
  • test_groupby_all_any_string for string columns.
  • test_groupby_all_any_empty for empty-group behavior.

Conftest

Removes 32 test_string_dtype_all_na[*-all-*] and [*-any-*] entries.

Relationship to #22289

One of the four split PRs requested in the review on #22289. The DataFrame-case test_string_dtype_all_na[*-{all,any}-*] parametrizations (df.groupby(df["a"]).all()) also rely on identity-based grouping-key column exclusion in #22369; both must merge before the 32 conftest removals stop xpassing.

Both methods previously raised ``NotImplementedError``. Reduce ``all``/
``any`` to ``min``/``max`` on a bool-coerced copy of the value columns:

- Strings coerce as ``count_characters > 0`` so empty strings become
  ``False`` and nulls remain null (preserving them through the agg).
- Numerics coerce as ``!= 0`` with the same null preservation.
- ``skipna=False`` replaces nulls with ``True`` before the aggregation
  so that nulls don't flip ``all`` to ``False`` and trivially make
  ``any`` ``True``.
- Empty groups (all-NA values, skipna=True) yield NA from min/max;
  pandas treats those as vacuously ``True`` for ``all`` and ``False``
  for ``any``, so the result is filled accordingly.
- ``min_count`` masks groups whose non-null count is below the
  threshold.

Conftest update for ``test_string_dtype_all_na[*-all-*]`` and
``[*-any-*]`` (32 entries). The string-key DataFrame cases additionally
rely on identity-based grouping-key column exclusion, which lands in
a sibling PR; both must merge before the entries can be removed
without xpassing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@galipremsagar galipremsagar requested a review from a team as a code owner May 4, 2026 20:05
@galipremsagar galipremsagar requested review from TomAugspurger and brandon-b-miller and removed request for a team May 4, 2026 20:05
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 4, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added Python Affects Python cuDF API. cudf.pandas Issues specific to cudf.pandas labels May 4, 2026
@GPUtester GPUtester moved this to In Progress in cuDF Python May 4, 2026
@galipremsagar galipremsagar added bug Something isn't working non-breaking Non-breaking change labels May 4, 2026
@galipremsagar galipremsagar requested a review from mroeschke May 4, 2026 20:33
@galipremsagar
Copy link
Copy Markdown
Contributor Author

/okay to test b288bbc

Comment on lines +3024 to +3026
if isinstance(col.dtype, pd.StringDtype) or is_dtype_obj_string(
col.dtype
):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if isinstance(col.dtype, pd.StringDtype) or is_dtype_obj_string(
col.dtype
):
if is_dtype_obj_string(col.dtype):

Comment on lines +3027 to +3030
counts_plc = plc.strings.attributes.count_characters(
col.plc_column
)
gt_plc = plc.binaryop.binary_operation(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vyasr was there every a way to use PylibcudfFunction on back-to-back pylibcudf functions?

# Empty groups (skipna=True with all-NA values) yield NA from
# min/max — pandas treats these as ``True`` for ``all`` and
# ``False`` for ``any``.
bool_np = np.dtype(np.bool_)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just confirming, is np.dtype(np.bool_) return regardless of the pandas string type?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working cudf.pandas Issues specific to cudf.pandas non-breaking Non-breaking change Python Affects Python cuDF API.

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

3 participants