Skip to content

Add StructArray and RunArray benchmark tests to with_hashes#20182

Open
notashes wants to merge 3 commits intoapache:mainfrom
notashes:with_hashes
Open

Add StructArray and RunArray benchmark tests to with_hashes#20182
notashes wants to merge 3 commits intoapache:mainfrom
notashes:with_hashes

Conversation

@notashes
Copy link

@notashes notashes commented Feb 6, 2026

Which issue does this PR close?

Rationale for this change

Issue #20152 identifies potential areas for optimization for RunArray and StructArray hashing. But the existing with_hashes benchmark tests don't include coverage for these types.

What changes are included in this PR?

Added benchmarks to with_hashes.rs:

  • StructArray: 4-column struct (bool, int32, int64, string)
  • RunArray: Int32 run-encoded array
  • Both include single/multiple columns and with/without nulls

Are these changes tested?

No additional tests added, but the benchmarks both compile and run.

a sample run:
❯ cargo bench --features=parquet --bench with_hashes -- array
   Compiling datafusion-common v52.1.0 (/Users/notashes/dev/datafusion/datafusion/common)
    Finished `bench` profile [optimized] target(s) in 34.49s
     Running benches/with_hashes.rs (target/release/deps/with_hashes-2f180744d22084f3)
Gnuplot not found, using plotters backend
struct_array: single, no nulls
                        time:   [38.389 µs 38.437 µs 38.485 µs]
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  2 (2.00%) high mild

struct_array: single, nulls
                        time:   [46.108 µs 46.197 µs 46.291 µs]
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

struct_array: multiple, no nulls
                        time:   [114.64 µs 114.79 µs 114.93 µs]
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  1 (1.00%) high mild

struct_array: multiple, nulls
                        time:   [138.29 µs 138.62 µs 139.07 µs]
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low severe
  4 (4.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe

run_array_int32: single, no nulls
                        time:   [1.8777 µs 1.9098 µs 1.9457 µs]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

run_array_int32: single, nulls
                        time:   [2.0110 µs 2.0417 µs 2.0751 µs]
Found 7 outliers among 100 measurements (7.00%)
  6 (6.00%) high mild
  1 (1.00%) high severe

run_array_int32: multiple, no nulls
                        time:   [5.0511 µs 5.0603 µs 5.0693 µs]
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild

run_array_int32: multiple, nulls
                        time:   [5.6052 µs 5.6201 µs 5.6353 µs]
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

Are there any user-facing changes?

@github-actions github-actions bot added the common Related to common crate label Feb 6, 2026
// with_hash has different code paths for single vs multiple arrays and nulls vs no nulls
let nullable_array = add_nulls(&array);
// RunArray encodes nulls in the values array, not at the array level
let nullable_array = if name.starts_with("run_array") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should take the approach by #20179 to have this as an explicit property instead of checking by name

.collect::<arrow::array::BooleanArray>(),
);

let int32_array: ArrayRef = Arc::new(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could reuse the existing functions above to create these random arrays? e.g. primitive_array()

The only difference I see is that it uses its own rng; is this a big concern, considering each of these functions currently use their own rng anyway?

}

/// Create a RunArray with null values
fn create_run_array_with_null_values<T>(array_len: usize) -> ArrayRef
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we can reduce the duplication with above function here if the only difference is nulls 🤔

It would just be a matter of calling add_nulls() on the values array

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add StructArray and RunArray benchmarks to with_hashes suite in datafusion-common

2 participants