Skip to content

Akshay-datazip/duckdb-spark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DuckDB vs Spark Benchmark

This project implements the benchmark comparison between DuckDB and Spark as described in the DataExpert.io newsletter article "DuckDB benchmarked against Spark - You Don't Always Need A Sledgehammer".

Overview

The benchmark compares the performance of DuckDB and Apache Spark on count distinct operations across datasets of varying sizes (500 rows to 500 million rows). The results demonstrate that DuckDB often outperforms Spark by several orders of magnitude, especially on single-machine workloads.

Project Structure

bench/
├── main.py                 # Main script to run complete benchmark
├── data_generator.py       # Generate test datasets using DuckDB
├── benchmark.py           # Benchmark functions for DuckDB and Spark
├── visualize_results.py   # Create performance charts and summaries
├── requirements.txt       # Python dependencies
└── README.md             # This file

Prerequisites

  • Python 3.8 or higher
  • Java 8 or higher (required for Spark)
  • At least 25GB of free disk space (for the largest dataset)

Installation

  1. Clone or download this project
  2. Install Python dependencies:
    pip install -r requirements.txt

Usage

Run Complete Benchmark

To run the entire benchmark process (data generation + benchmarking + visualization):

python main.py

Run Individual Components

Generate test data only:

python data_generator.py

Run benchmarks only:

python benchmark.py

Create visualizations only:

python visualize_results.py

Command Line Options

python main.py --help

Available options:

  • --skip-data-generation: Skip generating test datasets
  • --skip-benchmark: Skip running benchmarks
  • --skip-visualization: Skip creating visualizations

Generated Files

The benchmark creates the following files:

  1. Test Datasets (~/dummy_data/):

    • ds_500_rows.parquet (~33 KB)
    • ds_5000_rows.parquet (~289 KB)
    • ds_50000_rows.parquet (~2.3 MB)
    • ds_500000_rows.parquet (~23.2 MB)
    • ds_5000000_rows.parquet (~232.2 MB)
    • ds_50000000_rows.parquet (~2.32 GB)
    • ds_500000000_rows.parquet (~23.22 GB)
  2. Results:

    • benchmark_results.csv: Detailed benchmark results
    • benchmark_comparison.png: Performance comparison chart

Benchmark Details

Test Query

Both DuckDB and Spark execute the same query:

SELECT rand_dt, COUNT(DISTINCT rand_str) 
FROM dataset 
GROUP BY rand_dt 
ORDER BY COUNT(DISTINCT rand_str) DESC

Dataset Schema

Each dataset contains:

  • row_id: Sequential row identifier
  • txn_key: Random UUID (varchar)
  • rand_dt: Random date between 2020-01-01 and 2025-01-01
  • rand_val: Random float between 0-100
  • rand_str: Random string (1-26 characters)

Expected Results

Based on the original benchmark, you should see DuckDB significantly outperforming Spark across all dataset sizes:

Row Count DuckDB (ms) Spark (ms) Speedup
500 ~3.5 ~224 ~64x
5k ~5.5 ~212 ~38x
50k ~6.0 ~256 ~43x
500k ~14 ~286 ~20x
5M ~55 ~647 ~12x
50M ~309 ~1564 ~5x
500M ~2623 ~13055 ~5x

Troubleshooting

Memory Issues

If you encounter memory issues with the largest dataset (500M rows), you can modify data_generator.py to generate smaller datasets or skip the largest one.

Spark Setup Issues

If Spark fails to start, ensure Java is properly installed and the JAVA_HOME environment variable is set.

File Permissions

Ensure you have write permissions to your home directory for the ~/dummy_data/ folder.

References

License

This project is for educational and benchmarking purposes. Please refer to the original article for proper attribution.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors