Skip to content

ITU-DASYALab/pruning-stats-study

Repository files navigation

Benchmarking Column Statistics for Analytical Query Pruning

This is the source code used in the paper "Benchmarking Column Statistics for Analytical Query Pruning", submitted to the workshop DBTest 2026

Structure

This repositry contains 2 folders, duckdb/ and experiments/.

The duckdb/ folder contains the full source code of duckdb, along with changes made to facilitate the use of richer column statistics alongside min/max indexes.

The experiments/ folder contains bash scripts ans SQL files used to run experiments to evaluate pruning and query-time performance of DuckDB using the additional statistics.

Additions made to DuckDB

For a full list of changes, see the bottom of the page

We created the class AdditionalStats, with different types of addition stats (e.g. cluster, bloom filter, dictionary) being subclasses of this class.

AdditionalStats objects are attached to partitions through BaseStats objects, which contains the default min/max index.

AdditionalStats objects must have an initialisation method, that takes a vector of data and uses it to construct itself. Additionally, they must also have a query method, where a value and a comparison type can be given to it, and it can answer whether or not the corresponding partition can be pruned based on that. A few other methods must also be present, including a range query method and one for measuring size of the statistics object.

AdditionalStats is defined in additional_stats.hpp, with concrete implementations present in the same folders using only headers.

To use these new statistics, we also introduced some logic in constant_filter.cpp, to check the query method of the AdditionalStats object when min/max alone is not enough to determine that the pertition may be pruned.

The remaining changes are mainly just there to falicitate the creation and use of these AdditionalStats.

Compiling Source-code

In the root of the repository, a bash script named compile_all.sh exists, running this will create 10 seperate binaries of DuckDB each with a different configuration of additional statistics. Compiling a single binary can be done by navigating to the duckdb folder and calling

make STATS=<AdditionalStats> ADDITIONAL_STATS_SCALE_LEVEL=<Level>

Where <AdditionalStats> is one of EmptyAdditionalStats, ClusterAdditionalStats, BloomAdditionalStats, or DictionaryAdditionalStats for regular min/max, min/max clusters, bloom filters or dictionaries respectively.

<Level> should be one of 0,1, or 2, where 0 corresponds to the smallest metadata-size (10x the of min/max), 1 is the middle configuration at 100x the size, and 2 is the largest configuration, iwth 1000x the metadata-size used by min/max indexes.

Generating Datasets

To generate datasets, the bash script generate_datasets.sh can be used. This script first generates an unsorted dataset with 1,000,000,000 rows and 2 columns, each of which has type unsigned 64-bit int. It then generates the sorted dataset, which is sorted by the first column, using the unsorted one. After that, it uses the sorted dataset to generate the outlier dataset. And finally, it uses the sorted dataset to generate 4 datasets with varying cardinalities of the second column (cardinalities of 100, 1,000, 10,000, 100,000).

Running Experiments

To run experiments, the bash scripts in the experiments/ folder can be used. The scripts suffixed with _performance will measure query running time, whereas the ones without this suffix measures pruning efficiency.

To see which SQL queries are used, navigate to the folder experiments/sql/

Full list of changed files

Changes have been made to the following files:

The following files have been added:

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors