This project implements the benchmark comparison between DuckDB and Spark as described in the DataExpert.io newsletter article "DuckDB benchmarked against Spark - You Don't Always Need A Sledgehammer".
The benchmark compares the performance of DuckDB and Apache Spark on count distinct operations across datasets of varying sizes (500 rows to 500 million rows). The results demonstrate that DuckDB often outperforms Spark by several orders of magnitude, especially on single-machine workloads.
bench/
├── main.py # Main script to run complete benchmark
├── data_generator.py # Generate test datasets using DuckDB
├── benchmark.py # Benchmark functions for DuckDB and Spark
├── visualize_results.py # Create performance charts and summaries
├── requirements.txt # Python dependencies
└── README.md # This file
- Python 3.8 or higher
- Java 8 or higher (required for Spark)
- At least 25GB of free disk space (for the largest dataset)
- Clone or download this project
- Install Python dependencies:
pip install -r requirements.txt
To run the entire benchmark process (data generation + benchmarking + visualization):
python main.pyGenerate test data only:
python data_generator.pyRun benchmarks only:
python benchmark.pyCreate visualizations only:
python visualize_results.pypython main.py --helpAvailable options:
--skip-data-generation: Skip generating test datasets--skip-benchmark: Skip running benchmarks--skip-visualization: Skip creating visualizations
The benchmark creates the following files:
-
Test Datasets (
~/dummy_data/):ds_500_rows.parquet(~33 KB)ds_5000_rows.parquet(~289 KB)ds_50000_rows.parquet(~2.3 MB)ds_500000_rows.parquet(~23.2 MB)ds_5000000_rows.parquet(~232.2 MB)ds_50000000_rows.parquet(~2.32 GB)ds_500000000_rows.parquet(~23.22 GB)
-
Results:
benchmark_results.csv: Detailed benchmark resultsbenchmark_comparison.png: Performance comparison chart
Both DuckDB and Spark execute the same query:
SELECT rand_dt, COUNT(DISTINCT rand_str)
FROM dataset
GROUP BY rand_dt
ORDER BY COUNT(DISTINCT rand_str) DESCEach dataset contains:
row_id: Sequential row identifiertxn_key: Random UUID (varchar)rand_dt: Random date between 2020-01-01 and 2025-01-01rand_val: Random float between 0-100rand_str: Random string (1-26 characters)
Based on the original benchmark, you should see DuckDB significantly outperforming Spark across all dataset sizes:
| Row Count | DuckDB (ms) | Spark (ms) | Speedup |
|---|---|---|---|
| 500 | ~3.5 | ~224 | ~64x |
| 5k | ~5.5 | ~212 | ~38x |
| 50k | ~6.0 | ~256 | ~43x |
| 500k | ~14 | ~286 | ~20x |
| 5M | ~55 | ~647 | ~12x |
| 50M | ~309 | ~1564 | ~5x |
| 500M | ~2623 | ~13055 | ~5x |
If you encounter memory issues with the largest dataset (500M rows), you can modify data_generator.py to generate smaller datasets or skip the largest one.
If Spark fails to start, ensure Java is properly installed and the JAVA_HOME environment variable is set.
Ensure you have write permissions to your home directory for the ~/dummy_data/ folder.
This project is for educational and benchmarking purposes. Please refer to the original article for proper attribution.