Skip to content

Fix Python UDAF list-of-timestamps return by enforcing list-valued scalars and caching PyArrow types#1347

Open
kosiew wants to merge 11 commits intoapache:mainfrom
kosiew:typeconversion-issue-1339
Open

Fix Python UDAF list-of-timestamps return by enforcing list-valued scalars and caching PyArrow types#1347
kosiew wants to merge 11 commits intoapache:mainfrom
kosiew:typeconversion-issue-1339

Conversation

@kosiew
Copy link
Contributor

@kosiew kosiew commented Jan 20, 2026

Which issue does this PR close?


Rationale for this change

Users creating Python user-defined aggregate functions (UDAFs) in DataFusion were unable to reliably return list-valued results, such as a list of timestamps per group. Attempting to do so resulted in confusing Arrow type conversion errors (e.g. attempting to coerce a TimestampArray into an integer).

This limitation made it impossible to implement common aggregation patterns such as collecting events, timestamps, or values into arrays. The underlying issue was that DataFusion expected scalar values from evaluate and state, but Python UDAFs could inadvertently return PyArrow arrays without proper conversion.

This PR improves both correctness and ergonomics by explicitly supporting list-valued scalars returned from Python UDAFs and documenting the correct usage pattern for users.


What changes are included in this PR?

  • Python API documentation updates

    • Added an FAQ entry explaining how to return lists from a UDAF.
    • Clarified that evaluate must return a list-valued pyarrow.Scalar, not a pyarrow.Array.
  • Improved Python-side UDAF guidance

    • Expanded the Accumulator.evaluate docstring with a concrete example of returning a list-valued scalar.
  • Rust ↔ Python interop enhancements

    • Updated Rust UDAF bindings to gracefully convert Python objects (including PyArrow arrays and chunked arrays) into ScalarValue::List when appropriate.
    • Added a robust fallback conversion path using py_obj_to_scalar_value for both state and evaluate.
  • New test coverage

    • Added a Python test validating that a UDAF can successfully return a list of timestamps without errors.

Are these changes tested?

Yes.

  • A new test (test_udaf_list_timestamp_return) verifies that a Python UDAF can collect and return a list of timestamps.
  • The test exercises update, merge, state, and evaluate paths to ensure end-to-end correctness.

Are there any user-facing changes?

Yes.

  • Python UDAF authors can now return list-valued results (e.g. list[timestamp]) in a supported and documented way.
  • Documentation now clearly explains the correct pattern and avoids common pitfalls.
  • This is a backward-compatible enhancement; existing UDAFs are unaffected.

LLM-generated code disclosure

This PR includes code and comments generated with assistance from an LLM. All LLM-generated content has been manually reviewed and tested.

Store UDAF return type in Rust accumulator and wrap
pyarrow Array/ChunkedArray returns into list scalars
for list-like return types. Add a UDAF test to return
a list of timestamps via a pyarrow array, validating
the aggregate output for correctness.
Add documented list-valued scalar returns for UDAF
accumulators, including an example with pa.scalar and a note
about unsupported pyarrow.Array returns from evaluate().
Also, introduce a UDAF FAQ entry detailing list-returning
patterns and required return_type/state_type definitions.
…nbinding and binding fresh copies when checking array-likeness, eliminating the Bound reference error
@kosiew kosiew marked this pull request as ready for review January 27, 2026 03:08
@timsaucer
Copy link
Member

timsaucer commented Feb 4, 2026

Sorry it's taken me a while to get around to this PR. It feels like we are doing two different things

  1. telling users that they need to return pyarrow scalars as the return of evaluate
  2. detect when it is a list and then we convert it to a python value and back into a pyarrow scalar

It feels like this isn't the best option. I think we want to avoid doing any kind of to_pylist() calls.

I think a more general solution would be something like

  1. Determine if they have passed in a pyarrow scalar value. If so, use it.
  2. If they have not passed in a pyarrow scalar value, use py_obj_to_scalar_value to convert to a scalar value
  3. Update py_obj_to_scalar_value to detect pyarrow arrays and convert them to ScalarValue::List

For the last part we could do something like

    if obj.hasattr("__arrow_c_array__")? {
        let array_data = ArrayData::from_pyarrow_bound(&obj)?;

        let array = make_array(array_data);

        // ScalarValue::ListArray must be a list of length 1
        let offsets = OffsetBuffer::new(ScalarBuffer::from(vec![0, array.len() as i32]));
        let list_array = Arc::new(ListArray::new(
            Arc::new(Field::new_list_field(array.data_type().clone(), true)),
            offsets,
            array,
            None,
        ));

        return Ok(ScalarValue::List(list_array));
    }

Additionally, if we're going to go down this route I think we would want to treat both the state() and evaluate() since both of them should be returning scalars.

An advantage of the point described above is that I think it adds more flexibility to the users because their python functions can just return python integers and such without having to convert them to pyarrow scalars.

What do you think?

@timsaucer
Copy link
Member

One problem I see with my answer above ^ is that some libraries like nanoarrow DO implement __arrow_c_array__ on a scalar value, and we wouldn't want to accidentally turn that into a ScalarValue::List.

…ling and conversion from Python objects to Arrow types
@kosiew
Copy link
Contributor Author

kosiew commented Feb 5, 2026

@timsaucer,

Thanks for your suggestions.

I agree on both points and have refactored the implementation to align more closely with your approach, while also taking care to address the nanoarrow concern in a clear and safe way.

Implementation Details

  1. Direct scalar handling
    When the user returns a PyArrow scalar, we simply use it as-is via the existing PyScalarValue extraction—no extra work required.

  2. Fallback to py_obj_to_scalar_value
    For anything that isn’t already a scalar (native Python values, arrays, etc.), we route through py_obj_to_scalar_value, which now cleanly handles the conversion.

  3. Extended py_obj_to_scalar_value
    This function now:

    • Checks whether the object is already a pyarrow.Scalar using an explicit isinstance() check and extracts it directly.
    • Detects pyarrow.Array or pyarrow.ChunkedArray (also via isinstance()) and converts them into ScalarValue::List using the Arrow C data interface—no to_pylist() calls involved.
    • Falls back to the original behavior for native Python values (ints, floats, strings, etc.), converting them via pyarrow.scalar().
  4. Applied consistently to state() and evaluate()
    Both methods now share this unified conversion path, ensuring consistent and predictable behavior.

Why isinstance() instead of __arrow_c_array__?

I intentionally avoided checking the __arrow_c_array__ protocol and opted for explicit isinstance() checks against pyarrow.Scalar, pyarrow.Array, and pyarrow.ChunkedArray. This keeps things clear and robust:

  • Scalar objects from libraries like nanoarrow that implement __arrow_c_array__ are still correctly treated as scalars, rather than being misclassified as lists.
  • Only true PyArrow array types are converted into ScalarValue::List.
  • The type checks remain explicit, readable, and safe.

Benefits

  • No performance penalty: Avoids to_pylist() entirely by relying on the Arrow C data interface.
  • Flexible: Users can return native Python values, PyArrow scalars, or PyArrow arrays—everything is handled gracefully.
  • Consistent: Both state() and evaluate() now follow the same conversion logic.
  • Safe: Clear type discrimination prevents nanoarrow scalars from being misclassified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cannot do udaf that returns list of timestamps

2 participants