Add `StructArray` and `RunArray` benchmark tests to `with_hashes` by notashes · Pull Request #20182 · apache/datafusion

notashes · 2026-02-06T08:55:20Z

Which issue does this PR close?

Closes Add StructArray and RunArray benchmarks to with_hashes suite in datafusion-common #20181

Rationale for this change

Issue #20152 identifies potential areas for optimization for RunArray and StructArray hashing. But the existing with_hashes benchmark tests don't include coverage for these types.

What changes are included in this PR?

Added benchmarks to with_hashes.rs:

StructArray: 4-column struct (bool, int32, int64, string)
RunArray: Int32 run-encoded array
Both include single/multiple columns and with/without nulls

Are these changes tested?

No additional tests added, but the benchmarks both compile and run.

a sample run:

❯ cargo bench --features=parquet --bench with_hashes -- array
   Compiling datafusion-common v52.1.0 (/Users/notashes/dev/datafusion/datafusion/common)
    Finished `bench` profile [optimized] target(s) in 34.49s
     Running benches/with_hashes.rs (target/release/deps/with_hashes-2f180744d22084f3)
Gnuplot not found, using plotters backend
struct_array: single, no nulls
                        time:   [38.389 µs 38.437 µs 38.485 µs]
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  2 (2.00%) high mild

struct_array: single, nulls
                        time:   [46.108 µs 46.197 µs 46.291 µs]
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

struct_array: multiple, no nulls
                        time:   [114.64 µs 114.79 µs 114.93 µs]
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  1 (1.00%) high mild

struct_array: multiple, nulls
                        time:   [138.29 µs 138.62 µs 139.07 µs]
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low severe
  4 (4.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe

run_array_int32: single, no nulls
                        time:   [1.8777 µs 1.9098 µs 1.9457 µs]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

run_array_int32: single, nulls
                        time:   [2.0110 µs 2.0417 µs 2.0751 µs]
Found 7 outliers among 100 measurements (7.00%)
  6 (6.00%) high mild
  1 (1.00%) high severe

run_array_int32: multiple, no nulls
                        time:   [5.0511 µs 5.0603 µs 5.0693 µs]
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild

run_array_int32: multiple, nulls
                        time:   [5.6052 µs 5.6201 µs 5.6353 µs]
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

Are there any user-facing changes?

Jefffrey · 2026-02-07T16:09:54Z

datafusion/common/benches/with_hashes.rs

        // with_hash has different code paths for single vs multiple arrays and nulls vs no nulls
-        let nullable_array = add_nulls(&array);
+        // RunArray encodes nulls in the values array, not at the array level
+        let nullable_array = if name.starts_with("run_array") {


Maybe we should take the approach by #20179 to have this as an explicit property instead of checking by name

Jefffrey · 2026-02-07T16:11:37Z

datafusion/common/benches/with_hashes.rs

+            .collect::<arrow::array::BooleanArray>(),
+    );
+
+    let int32_array: ArrayRef = Arc::new(


Maybe we could reuse the existing functions above to create these random arrays? e.g. primitive_array()

The only difference I see is that it uses its own rng; is this a big concern, considering each of these functions currently use their own rng anyway?

Jefffrey · 2026-02-07T16:13:51Z

datafusion/common/benches/with_hashes.rs

+}
+
+/// Create a RunArray with null values
+fn create_run_array_with_null_values<T>(array_len: usize) -> ArrayRef


I feel we can reduce the duplication with above function here if the only difference is nulls 🤔

It would just be a matter of calling add_nulls() on the values array

bench: adds benchmark tests for StructArray and RunArray

2adccac

github-actions bot added the common Related to common crate label Feb 6, 2026

Merge branch 'main' into with_hashes

6dac5a6

notashes mentioned this pull request Feb 6, 2026

perf: various optimizations to eliminate branch misprediction in hash_utils #20168

Open

Jefffrey reviewed Feb 7, 2026

View reviewed changes

Merge branch 'main' into with_hashes

caa31e6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `StructArray` and `RunArray` benchmark tests to `with_hashes`#20182

Add `StructArray` and `RunArray` benchmark tests to `with_hashes`#20182
notashes wants to merge 3 commits intoapache:mainfrom
notashes:with_hashes

notashes commented Feb 6, 2026

Uh oh!

Jefffrey Feb 7, 2026

Uh oh!

Jefffrey Feb 7, 2026

Uh oh!

Jefffrey Feb 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

notashes commented Feb 6, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Jefffrey Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Jefffrey Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Jefffrey Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants