Fix Python UDAF list-of-timestamps return by enforcing list-valued scalars and caching PyArrow types by kosiew · Pull Request #1347 · apache/datafusion-python

kosiew · 2026-01-20T05:18:41Z

Which issue does this PR close?

Closes Cannot do udaf that returns list of timestamps #1339.

Rationale for this change

Users creating Python user-defined aggregate functions (UDAFs) in DataFusion were unable to reliably return list-valued results, such as a list of timestamps per group. Attempting to do so resulted in confusing Arrow type conversion errors (e.g. attempting to coerce a TimestampArray into an integer).

This limitation made it impossible to implement common aggregation patterns such as collecting events, timestamps, or values into arrays. The underlying issue was that DataFusion expected scalar values from evaluate and state, but Python UDAFs could inadvertently return PyArrow arrays without proper conversion.

This PR improves both correctness and ergonomics by explicitly supporting list-valued scalars returned from Python UDAFs and documenting the correct usage pattern for users.

What changes are included in this PR?

Python API documentation updates
- Added an FAQ entry explaining how to return lists from a UDAF.
- Clarified that evaluate must return a list-valued pyarrow.Scalar, not a pyarrow.Array.
Improved Python-side UDAF guidance
- Expanded the Accumulator.evaluate docstring with a concrete example of returning a list-valued scalar.
Rust ↔ Python interop enhancements
- Updated Rust UDAF bindings to gracefully convert Python objects (including PyArrow arrays and chunked arrays) into ScalarValue::List when appropriate.
- Added a robust fallback conversion path using py_obj_to_scalar_value for both state and evaluate.
New test coverage
- Added a Python test validating that a UDAF can successfully return a list of timestamps without errors.

Are these changes tested?

Yes.

A new test (test_udaf_list_timestamp_return) verifies that a Python UDAF can collect and return a list of timestamps.
The test exercises update, merge, state, and evaluate paths to ensure end-to-end correctness.

Are there any user-facing changes?

Yes.

Python UDAF authors can now return list-valued results (e.g. list[timestamp]) in a supported and documented way.
Documentation now clearly explains the correct pattern and avoids common pitfalls.
This is a backward-compatible enhancement; existing UDAFs are unaffected.

LLM-generated code disclosure

This PR includes code and comments generated with assistance from an LLM. All LLM-generated content has been manually reviewed and tested.

Store UDAF return type in Rust accumulator and wrap pyarrow Array/ChunkedArray returns into list scalars for list-like return types. Add a UDAF test to return a list of timestamps via a pyarrow array, validating the aggregate output for correctness.

Add documented list-valued scalar returns for UDAF accumulators, including an example with pa.scalar and a note about unsupported pyarrow.Array returns from evaluate(). Also, introduce a UDAF FAQ entry detailing list-returning patterns and required return_type/state_type definitions.

…ype checking for list types

…nbinding and binding fresh copies when checking array-likeness, eliminating the Bound reference error

…amp_return

timsaucer · 2026-02-04T21:19:18Z

Sorry it's taken me a while to get around to this PR. It feels like we are doing two different things

telling users that they need to return pyarrow scalars as the return of evaluate
detect when it is a list and then we convert it to a python value and back into a pyarrow scalar

It feels like this isn't the best option. I think we want to avoid doing any kind of to_pylist() calls.

I think a more general solution would be something like

Determine if they have passed in a pyarrow scalar value. If so, use it.
If they have not passed in a pyarrow scalar value, use py_obj_to_scalar_value to convert to a scalar value
Update py_obj_to_scalar_value to detect pyarrow arrays and convert them to ScalarValue::List

For the last part we could do something like

    if obj.hasattr("__arrow_c_array__")? {
        let array_data = ArrayData::from_pyarrow_bound(&obj)?;

        let array = make_array(array_data);

        // ScalarValue::ListArray must be a list of length 1
        let offsets = OffsetBuffer::new(ScalarBuffer::from(vec![0, array.len() as i32]));
        let list_array = Arc::new(ListArray::new(
            Arc::new(Field::new_list_field(array.data_type().clone(), true)),
            offsets,
            array,
            None,
        ));

        return Ok(ScalarValue::List(list_array));
    }

Additionally, if we're going to go down this route I think we would want to treat both the state() and evaluate() since both of them should be returning scalars.

An advantage of the point described above is that I think it adds more flexibility to the users because their python functions can just return python integers and such without having to convert them to pyarrow scalars.

What do you think?

timsaucer · 2026-02-04T23:38:58Z

One problem I see with my answer above ^ is that some libraries like nanoarrow DO implement __arrow_c_array__ on a scalar value, and we wouldn't want to accidentally turn that into a ScalarValue::List.

…ling and conversion from Python objects to Arrow types

…in RustAccumulator and utility functions

kosiew · 2026-02-05T13:15:38Z

@timsaucer,

Thanks for your suggestions.

I agree on both points and have refactored the implementation to align more closely with your approach, while also taking care to address the nanoarrow concern in a clear and safe way.

Implementation Details

Direct scalar handling
When the user returns a PyArrow scalar, we simply use it as-is via the existing PyScalarValue extraction—no extra work required.
Fallback to py_obj_to_scalar_value
For anything that isn’t already a scalar (native Python values, arrays, etc.), we route through py_obj_to_scalar_value, which now cleanly handles the conversion.
Extended py_obj_to_scalar_value
This function now:
- Checks whether the object is already a pyarrow.Scalar using an explicit isinstance() check and extracts it directly.
- Detects pyarrow.Array or pyarrow.ChunkedArray (also via isinstance()) and converts them into ScalarValue::List using the Arrow C data interface—no to_pylist() calls involved.
- Falls back to the original behavior for native Python values (ints, floats, strings, etc.), converting them via pyarrow.scalar().
Applied consistently to state() and evaluate()
Both methods now share this unified conversion path, ensuring consistent and predictable behavior.

Why `isinstance()` instead of `__arrow_c_array__`?

I intentionally avoided checking the __arrow_c_array__ protocol and opted for explicit isinstance() checks against pyarrow.Scalar, pyarrow.Array, and pyarrow.ChunkedArray. This keeps things clear and robust:

Scalar objects from libraries like nanoarrow that implement __arrow_c_array__ are still correctly treated as scalars, rather than being misclassified as lists.
Only true PyArrow array types are converted into ScalarValue::List.
The type checks remain explicit, readable, and safe.

Benefits

No performance penalty: Avoids to_pylist() entirely by relying on the Arrow C data interface.
Flexible: Users can return native Python values, PyArrow scalars, or PyArrow arrays—everything is handled gracefully.
Consistent: Both state() and evaluate() now follow the same conversion logic.
Safe: Clear type discrimination prevents nanoarrow scalars from being misclassified.

kosiew added 7 commits January 20, 2026 13:16

Fix pyarrow calls and improve type handling in RustAccumulator

21906bb

Refactor RustAccumulator to support pyarrow array types and improve t…

7f363a7

…ype checking for list types

Fixed PyO3 type mismatch by cloning Array/ChunkedArray types before u…

5271ba2

…nbinding and binding fresh copies when checking array-likeness, eliminating the Bound reference error

Add timezone information to datetime objects in test_udaf_list_timest…

9c59258

…amp_return

clippy fix

6742954

kosiew marked this pull request as ready for review January 27, 2026 03:08

timsaucer mentioned this pull request Feb 3, 2026

Release DataFusion-Python 52.0.0 #1364

Open

8 tasks

kosiew added 2 commits February 5, 2026 21:14

Refactor RustAccumulator and utility functions for improved type hand…

f16c718

…ling and conversion from Python objects to Arrow types

Enhance PyArrow integration by refining type handling and conversion …

dcf6145

…in RustAccumulator and utility functions

kosiew added 2 commits February 5, 2026 21:18

Merge branch 'main' into typeconversion-issue-1339

7f2d9ae

Fix array data binding in py_obj_to_scalar_value function

7ff146e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Python UDAF list-of-timestamps return by enforcing list-valued scalars and caching PyArrow types#1347

Fix Python UDAF list-of-timestamps return by enforcing list-valued scalars and caching PyArrow types#1347
kosiew wants to merge 11 commits intoapache:mainfrom
kosiew:typeconversion-issue-1339

kosiew commented Jan 20, 2026 •

edited

Loading

Uh oh!

timsaucer commented Feb 4, 2026 •

edited

Loading

Uh oh!

timsaucer commented Feb 4, 2026

Uh oh!

kosiew commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kosiew commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

LLM-generated code disclosure

Uh oh!

timsaucer commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timsaucer commented Feb 4, 2026

Uh oh!

kosiew commented Feb 5, 2026

Implementation Details

Why isinstance() instead of __arrow_c_array__?

Benefits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kosiew commented Jan 20, 2026 •

edited

Loading

timsaucer commented Feb 4, 2026 •

edited

Loading

Why `isinstance()` instead of `__arrow_c_array__`?