Benchmarking Column Statistics for Analytical Query Pruning

This is the source code used in the paper "Benchmarking Column Statistics for Analytical Query Pruning", submitted to the workshop DBTest 2026

Structure

This repositry contains 2 folders, duckdb/ and experiments/.

The duckdb/ folder contains the full source code of duckdb, along with changes made to facilitate the use of richer column statistics alongside min/max indexes.

The experiments/ folder contains bash scripts ans SQL files used to run experiments to evaluate pruning and query-time performance of DuckDB using the additional statistics.

Additions made to DuckDB

For a full list of changes, see the bottom of the page

We created the class AdditionalStats, with different types of addition stats (e.g. cluster, bloom filter, dictionary) being subclasses of this class.

AdditionalStats objects are attached to partitions through BaseStats objects, which contains the default min/max index.

AdditionalStats objects must have an initialisation method, that takes a vector of data and uses it to construct itself. Additionally, they must also have a query method, where a value and a comparison type can be given to it, and it can answer whether or not the corresponding partition can be pruned based on that. A few other methods must also be present, including a range query method and one for measuring size of the statistics object.

AdditionalStats is defined in additional_stats.hpp, with concrete implementations present in the same folders using only headers.

To use these new statistics, we also introduced some logic in constant_filter.cpp, to check the query method of the AdditionalStats object when min/max alone is not enough to determine that the pertition may be pruned.

The remaining changes are mainly just there to falicitate the creation and use of these AdditionalStats.

Compiling Source-code

In the root of the repository, a bash script named compile_all.sh exists, running this will create 10 seperate binaries of DuckDB each with a different configuration of additional statistics. Compiling a single binary can be done by navigating to the duckdb folder and calling

make STATS=<AdditionalStats> ADDITIONAL_STATS_SCALE_LEVEL=<Level>

Where <AdditionalStats> is one of EmptyAdditionalStats, ClusterAdditionalStats, BloomAdditionalStats, or DictionaryAdditionalStats for regular min/max, min/max clusters, bloom filters or dictionaries respectively.

<Level> should be one of 0,1, or 2, where 0 corresponds to the smallest metadata-size (10x the of min/max), 1 is the middle configuration at 100x the size, and 2 is the largest configuration, iwth 1000x the metadata-size used by min/max indexes.

Generating Datasets

To generate datasets, the bash script generate_datasets.sh can be used. This script first generates an unsorted dataset with 1,000,000,000 rows and 2 columns, each of which has type unsigned 64-bit int. It then generates the sorted dataset, which is sorted by the first column, using the unsorted one. After that, it uses the sorted dataset to generate the outlier dataset. And finally, it uses the sorted dataset to generate 4 datasets with varying cardinalities of the second column (cardinalities of 100, 1,000, 10,000, 100,000).

Running Experiments

To run experiments, the bash scripts in the experiments/ folder can be used. The scripts suffixed with _performance will measure query running time, whereas the ones without this suffix measures pruning efficiency.

To see which SQL queries are used, navigate to the folder experiments/sql/

Full list of changed files

Changes have been made to the following files:

The following files have been added:

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
duckdb		duckdb
experiments		experiments
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
check_correctness.sh		check_correctness.sh
compile_all.sh		compile_all.sh
data_processor.py		data_processor.py
data_processor_performance.py		data_processor_performance.py
generate_cardinality.py		generate_cardinality.py
generate_datasets.sh		generate_datasets.sh
generate_outliers.py		generate_outliers.py
generate_sorted.py		generate_sorted.py
generate_unsorted.py		generate_unsorted.py
template.csv		template.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Benchmarking Column Statistics for Analytical Query Pruning

Structure

Additions made to DuckDB

Compiling Source-code

Generating Datasets

Running Experiments

Full list of changed files

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Benchmarking Column Statistics for Analytical Query Pruning

Structure

Additions made to DuckDB

Compiling Source-code

Generating Datasets

Running Experiments

Full list of changed files

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages