Fix Python UDAF list-of-timestamps return by enforcing list-valued scalars and caching PyArrow types#1347
Fix Python UDAF list-of-timestamps return by enforcing list-valued scalars and caching PyArrow types#1347kosiew wants to merge 11 commits intoapache:mainfrom
Conversation
Store UDAF return type in Rust accumulator and wrap pyarrow Array/ChunkedArray returns into list scalars for list-like return types. Add a UDAF test to return a list of timestamps via a pyarrow array, validating the aggregate output for correctness.
Add documented list-valued scalar returns for UDAF accumulators, including an example with pa.scalar and a note about unsupported pyarrow.Array returns from evaluate(). Also, introduce a UDAF FAQ entry detailing list-returning patterns and required return_type/state_type definitions.
…ype checking for list types
…nbinding and binding fresh copies when checking array-likeness, eliminating the Bound reference error
|
Sorry it's taken me a while to get around to this PR. It feels like we are doing two different things
It feels like this isn't the best option. I think we want to avoid doing any kind of I think a more general solution would be something like
For the last part we could do something like Additionally, if we're going to go down this route I think we would want to treat both the An advantage of the point described above is that I think it adds more flexibility to the users because their python functions can just return python integers and such without having to convert them to pyarrow scalars. What do you think? |
|
One problem I see with my answer above ^ is that some libraries like nanoarrow DO implement |
…ling and conversion from Python objects to Arrow types
…in RustAccumulator and utility functions
|
Thanks for your suggestions. I agree on both points and have refactored the implementation to align more closely with your approach, while also taking care to address the nanoarrow concern in a clear and safe way. Implementation Details
Why
|
Which issue does this PR close?
Rationale for this change
Users creating Python user-defined aggregate functions (UDAFs) in DataFusion were unable to reliably return list-valued results, such as a list of timestamps per group. Attempting to do so resulted in confusing Arrow type conversion errors (e.g. attempting to coerce a
TimestampArrayinto an integer).This limitation made it impossible to implement common aggregation patterns such as collecting events, timestamps, or values into arrays. The underlying issue was that DataFusion expected scalar values from
evaluateandstate, but Python UDAFs could inadvertently return PyArrow arrays without proper conversion.This PR improves both correctness and ergonomics by explicitly supporting list-valued scalars returned from Python UDAFs and documenting the correct usage pattern for users.
What changes are included in this PR?
Python API documentation updates
evaluatemust return a list-valuedpyarrow.Scalar, not apyarrow.Array.Improved Python-side UDAF guidance
Accumulator.evaluatedocstring with a concrete example of returning a list-valued scalar.Rust ↔ Python interop enhancements
ScalarValue::Listwhen appropriate.py_obj_to_scalar_valuefor bothstateandevaluate.New test coverage
Are these changes tested?
Yes.
test_udaf_list_timestamp_return) verifies that a Python UDAF can collect and return a list of timestamps.update,merge,state, andevaluatepaths to ensure end-to-end correctness.Are there any user-facing changes?
Yes.
list[timestamp]) in a supported and documented way.LLM-generated code disclosure
This PR includes code and comments generated with assistance from an LLM. All LLM-generated content has been manually reviewed and tested.