DuckDB vs Spark Benchmark

This project implements the benchmark comparison between DuckDB and Spark as described in the DataExpert.io newsletter article "DuckDB benchmarked against Spark - You Don't Always Need A Sledgehammer".

Overview

The benchmark compares the performance of DuckDB and Apache Spark on count distinct operations across datasets of varying sizes (500 rows to 500 million rows). The results demonstrate that DuckDB often outperforms Spark by several orders of magnitude, especially on single-machine workloads.

Project Structure

bench/
├── main.py                 # Main script to run complete benchmark
├── data_generator.py       # Generate test datasets using DuckDB
├── benchmark.py           # Benchmark functions for DuckDB and Spark
├── visualize_results.py   # Create performance charts and summaries
├── requirements.txt       # Python dependencies
└── README.md             # This file

Prerequisites

Python 3.8 or higher
Java 8 or higher (required for Spark)
At least 25GB of free disk space (for the largest dataset)

Installation

Clone or download this project
Install Python dependencies:
```
pip install -r requirements.txt
```

Usage

Run Complete Benchmark

To run the entire benchmark process (data generation + benchmarking + visualization):

python main.py

Run Individual Components

Generate test data only:

python data_generator.py

Run benchmarks only:

python benchmark.py

Create visualizations only:

python visualize_results.py

Command Line Options

python main.py --help

Available options:

--skip-data-generation: Skip generating test datasets
--skip-benchmark: Skip running benchmarks
--skip-visualization: Skip creating visualizations

Generated Files

The benchmark creates the following files:

Test Datasets (~/dummy_data/):
- ds_500_rows.parquet (~33 KB)
- ds_5000_rows.parquet (~289 KB)
- ds_50000_rows.parquet (~2.3 MB)
- ds_500000_rows.parquet (~23.2 MB)
- ds_5000000_rows.parquet (~232.2 MB)
- ds_50000000_rows.parquet (~2.32 GB)
- ds_500000000_rows.parquet (~23.22 GB)
Results:
- benchmark_results.csv: Detailed benchmark results
- benchmark_comparison.png: Performance comparison chart

Benchmark Details

Test Query

Both DuckDB and Spark execute the same query:

SELECT rand_dt, COUNT(DISTINCT rand_str) 
FROM dataset 
GROUP BY rand_dt 
ORDER BY COUNT(DISTINCT rand_str) DESC

Dataset Schema

Each dataset contains:

row_id: Sequential row identifier
txn_key: Random UUID (varchar)
rand_dt: Random date between 2020-01-01 and 2025-01-01
rand_val: Random float between 0-100
rand_str: Random string (1-26 characters)

Expected Results

Based on the original benchmark, you should see DuckDB significantly outperforming Spark across all dataset sizes:

Row Count	DuckDB (ms)	Spark (ms)	Speedup
500	~3.5	~224	~64x
5k	~5.5	~212	~38x
50k	~6.0	~256	~43x
500k	~14	~286	~20x
5M	~55	~647	~12x
50M	~309	~1564	~5x
500M	~2623	~13055	~5x

Troubleshooting

Memory Issues

If you encounter memory issues with the largest dataset (500M rows), you can modify data_generator.py to generate smaller datasets or skip the largest one.

Spark Setup Issues

If Spark fails to start, ensure Java is properly installed and the JAVA_HOME environment variable is set.

File Permissions

Ensure you have write permissions to your home directory for the ~/dummy_data/ folder.

References

License

This project is for educational and benchmarking purposes. Please refer to the original article for proper attribution.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md
benchmark.py		benchmark.py
benchmark_comparison.png		benchmark_comparison.png
benchmark_results.csv		benchmark_results.csv
create_comparison_charts.py		create_comparison_charts.py
create_my_benchmark_charts.py		create_my_benchmark_charts.py
create_post_charts.py		create_post_charts.py
data_generator.py		data_generator.py
duckdb_performance_charts.png		duckdb_performance_charts.png
duckdb_speedup_analysis.png		duckdb_speedup_analysis.png
duckdb_vs_spark_comparison.png		duckdb_vs_spark_comparison.png
duckdb_vs_spark_summary.png		duckdb_vs_spark_summary.png
duckdb_vs_spark_throughput.png		duckdb_vs_spark_throughput.png
install_spark.sh		install_spark.sh
main.py		main.py
my_duckdb_benchmark_summary.png		my_duckdb_benchmark_summary.png
my_duckdb_speedup_analysis.png		my_duckdb_speedup_analysis.png
my_duckdb_vs_spark_benchmark.png		my_duckdb_vs_spark_benchmark.png
requirements.txt		requirements.txt
visualize_results.py		visualize_results.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DuckDB vs Spark Benchmark

Overview

Project Structure

Prerequisites

Installation

Usage

Run Complete Benchmark

Run Individual Components

Command Line Options

Generated Files

Benchmark Details

Test Query

Dataset Schema

Expected Results

Troubleshooting

Memory Issues

Spark Setup Issues

File Permissions

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DuckDB vs Spark Benchmark

Overview

Project Structure

Prerequisites

Installation

Usage

Run Complete Benchmark

Run Individual Components

Command Line Options

Generated Files

Benchmark Details

Test Query

Dataset Schema

Expected Results

Troubleshooting

Memory Issues

Spark Setup Issues

File Permissions

References

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages