Database Benchmark: Multi-Engine Analytics with Apache Iceberg

A comprehensive benchmarking environment for cybersecurity analytics, featuring native database formats, Splunk DB Connect overhead testing, and Apache Iceberg multi-engine architecture. Designed for Apple Silicon (M3/M3 Pro/M3 Max) with AWS deployment options for fair production comparisons.

🎯 Purpose

This project benchmarks three key architectural patterns for cybersecurity log analysis:

1. Native Performance Baseline

Compare raw query performance across database engines on their native formats:

PostgreSQL 16 - Traditional relational (ARM64 native)
ClickHouse 24.1 - Columnar OLAP with NEON SIMD (ARM64 native)
StarRocks 3.2 - MPP analytics (Rosetta 2)

2. Splunk DB Connect Overhead

Measure performance impact when querying databases via Splunk's dbxquery command:

Direct query: Database → Application (baseline)
Proxy query: Database → Splunk → Application (measure overhead)
Expected overhead: 100-200ms added latency

3. Apache Iceberg Multi-Engine

Test open lakehouse pattern where multiple engines query shared data:

Trino - Create and manage Iceberg tables (full read/write)
ClickHouse - Query Iceberg tables (read-only, ~20x slower than native)
StarRocks - Query and modify Iceberg tables (read/write, ~4x slower)
Trade-off: Flexibility vs Performance

🏗️ Architecture Overview

┌─────────────────────────────────────────────────────────────────────┐
│                          Query Engines                              │
├────────────┬────────────┬────────────┬────────────┬─────────────────┤
│ PostgreSQL │ ClickHouse │ StarRocks  │   Trino    │     Splunk      │
│  (Native)  │  (Native)  │  (Native)  │ (Iceberg)  │  (DB Connect)   │
└──────┬─────┴─────┬──────┴──────┬─────┴──────┬─────┴─────────┬───────┘
       │           │             │            │               │
       │           └─────────────┼────────────┘               │
       │                         │                            │
       │           ┌─────────────▼──────────────┐             │
       │           │   Hive Metastore Catalog   │             │
       │           │ (Iceberg metadata: schemas,│             │
       │           │  snapshots, partitions)    │             │
       │           └─────────────┬──────────────┘             │
       │                         │                            │
       │           ┌─────────────▼──────────────┐             │
       │           │  MinIO (S3-compatible)     │             │
       │           │  /warehouse/cybersecurity/ │             │
       │           │    - Parquet data files    │             │
       │           │    - Avro metadata files   │             │
       │           └────────────────────────────┘             │
       │                                                      │
       └──────────────────────────────────────────────────────┘
                   Native Table Storage + Splunk DB Connect

Key Components

Component	Role	Port	Architecture
PostgreSQL	Relational database	5432	ARM64 native
ClickHouse	Columnar OLAP + Iceberg engine	8123	ARM64 native
StarRocks	MPP analytics + Iceberg catalog	9030	Rosetta 2
Splunk	SIEM + DB Connect proxy	8000	Rosetta 2
Trino	Federated SQL + Iceberg coordinator	8080	ARM64 native
MinIO	S3-compatible object storage	9000/9001	ARM64 native
Hive Metastore	Iceberg catalog backend	9083	Rosetta 2

🚀 Quick Start

Prerequisites

Hardware: MacBook Pro M3/M3 Pro/M3 Max
Memory: 24GB recommended (18GB minimum)
Storage: 150GB free space
macOS: 14.0 Sonoma or later
Docker Desktop: 24.0+ with Rosetta 2 enabled
Python: 3.10+ with pip

Installation

# Clone the repository
git clone https://github.com/yourusername/splunk-db-connect-benchmark.git
cd splunk-db-connect-benchmark

# Copy environment template
cp .env.example .env

# Start all services
docker-compose -f docker-compose.m3.yml up -d

# Wait for all services to be healthy (5-10 minutes)
docker-compose -f docker-compose.m3.yml ps

Initialize Data and Iceberg Tables

# Step 1: Load data into native databases (PostgreSQL, ClickHouse, StarRocks)
bash scripts/phase4_load_data.sh

# Step 2: Initialize MinIO and create Iceberg tables
bash scripts/setup_iceberg.sh

# Step 3: Configure ClickHouse Iceberg engine
bash scripts/configure_clickhouse_iceberg.sh

# Step 4: Configure StarRocks Iceberg catalog
bash scripts/configure_starrocks_iceberg.sh

# Step 5: Set up Splunk DB Connect (requires manual DB Connect app installation)
bash scripts/setup_splunk_dbconnect.sh

Run Benchmarks

# Install Python dependencies
pip3 install psycopg2-binary clickhouse-connect pymysql

# Run all 3 benchmarks
cd benchmarks
./run_all.sh

# Or run individually:
python3 01_native_baseline.py          # Native performance baseline
python3 02_splunk_dbxquery_overhead.py # Splunk proxy overhead
python3 03_iceberg_multi_engine.py     # Iceberg multi-engine performance

# View results
ls -lh results/

📊 Benchmark Suite

Test 1: Native Performance Baseline

Purpose: Establish baseline query performance on native database formats

Queries:

Count all records
Aggregate by event type (GROUP BY + ORDER BY)
Filter failed login events (WHERE + HAVING)
Time range aggregation (last 7 days)
Top data transfer events (ORDER BY + LIMIT)

Expected Results:

Database	Simple Query	Complex Aggregation
ClickHouse	10-20 ms	30-50 ms
PostgreSQL	50-100 ms	150-300 ms
StarRocks	30-50 ms	70-120 ms

Script: benchmarks/01_native_baseline.py

Test 2: Splunk DB Connect Overhead

Purpose: Measure latency added by Splunk's dbxquery proxy layer

Method:

Query database directly (baseline)
Query same database via Splunk dbxquery
Calculate overhead: splunk_latency - direct_latency

Expected Overhead:

Database	Direct Query	via dbxquery	Overhead
PostgreSQL	50-100 ms	150-300 ms	+100-200 ms
ClickHouse	10-20 ms	110-220 ms	+100-200 ms
StarRocks	30-50 ms	130-250 ms	+100-200 ms

Key Finding: Splunk dbxquery adds consistent 100-200ms overhead regardless of database speed.

Script: benchmarks/02_splunk_dbxquery_overhead.py

Documentation: docs/SPLUNK_DBXQUERY_LIMITATIONS.md

Test 3: Iceberg Multi-Engine Performance

Purpose: Test Apache Iceberg table format with multiple query engines

Architecture:

Shared Data: Iceberg tables stored in MinIO (Parquet format)
Metadata: Hive Metastore catalog (tracks schemas, snapshots)
Engines: Trino (read/write), ClickHouse (read-only), StarRocks (read/write)

Comparison: Native format vs Iceberg format for same query

Expected Results:

Engine	Native Format	Iceberg Format	Slowdown
ClickHouse	10-20 ms	200-500 ms	20-25x slower
StarRocks	30-50 ms	120-200 ms	4-6x slower

Trade-off:

✅ Benefit: Multi-engine access, ACID transactions, schema evolution, time travel
❌ Cost: 4-20x slower query performance

Script: benchmarks/03_iceberg_multi_engine.py

Documentation: docs/ICEBERG_MULTI_ENGINE.md

📁 Project Structure

splunk-db-connect-benchmark/
├── README.md                        # This file
├── docker-compose.m3.yml            # Complete M3 deployment
├── .env.example                     # Environment variables
│
├── configs/                         # Database configurations
│   ├── postgresql/                  # PostgreSQL configs
│   ├── clickhouse/                  # ClickHouse configs
│   ├── starrocks/                   # StarRocks FE/BE configs
│   ├── splunk/                      # Splunk configs
│   └── trino/                       # Trino configs (NEW)
│       ├── config.properties        # Trino coordinator settings
│       ├── jvm.config               # JVM tuning for M3
│       ├── node.properties          # Node identification
│       ├── log.properties           # Logging configuration
│       └── catalog/
│           └── iceberg.properties   # Iceberg catalog connector
│
├── sql/                             # Schema definitions
│   ├── postgresql_schema.sql        # PostgreSQL tables
│   ├── clickhouse_schema.sql        # ClickHouse tables
│   ├── starrocks_schema.sql         # StarRocks tables
│   └── iceberg_schema.sql           # Iceberg tables (NEW)
│
├── scripts/                         # Setup and utility scripts
│   ├── setup_all.sh                 # Master setup script
│   ├── setup_iceberg.sh             # Initialize Iceberg (NEW)
│   ├── configure_clickhouse_iceberg.sh  # ClickHouse Iceberg engine (NEW)
│   ├── configure_starrocks_iceberg.sh   # StarRocks Iceberg catalog (NEW)
│   ├── setup_splunk_dbconnect.sh    # Splunk DB Connect (NEW)
│   ├── phase1_verify_system.sh      # System verification
│   ├── phase2_configure_docker.sh   # Docker configuration
│   ├── phase3_deploy_containers.sh  # Container deployment
│   ├── phase4_load_data.sh          # Data loading
│   └── cleanup.sh                   # Environment cleanup
│
├── benchmarks/                      # Benchmark scripts (NEW)
│   ├── 01_native_baseline.py        # Native performance baseline
│   ├── 02_splunk_dbxquery_overhead.py  # Splunk proxy overhead
│   ├── 03_iceberg_multi_engine.py   # Iceberg multi-engine
│   ├── run_all.sh                   # Execute all benchmarks
│   └── results/                     # Benchmark results (JSON)
│
├── docs/                            # Documentation
│   ├── SPLUNK_DBXQUERY_LIMITATIONS.md  # Splunk dbxquery analysis (NEW)
│   ├── ICEBERG_MULTI_ENGINE.md      # Iceberg architecture guide (NEW)
│   ├── TROUBLESHOOTING.md           # Common issues
│   └── ARCHITECTURE.md              # System architecture
│
└── data/                            # Persistent data (gitignored)
    ├── postgresql/                  # PostgreSQL data
    ├── clickhouse/                  # ClickHouse data
    ├── starrocks-fe/                # StarRocks frontend
    ├── starrocks-be/                # StarRocks backend
    ├── splunk/                      # Splunk data
    ├── minio/                       # MinIO object storage (NEW)
    └── hive-metastore/              # Hive Metastore metadata (NEW)

🎮 Using the Environment

Access Services

Service	Endpoint	Credentials	Purpose
PostgreSQL	`localhost:5432`	postgres / postgres123	Relational database
ClickHouse HTTP	`http://localhost:8123`	default / (none)	Columnar OLAP
ClickHouse Native	`localhost:9000`	default / (none)	Native protocol
StarRocks FE	`http://localhost:8030`	root / (none)	Web UI
StarRocks MySQL	`localhost:9030`	root / (none)	Query interface
Splunk Web	`http://localhost:8000`	admin / changeme	SIEM UI
Splunk API	`https://localhost:8089`	admin / changeme	Management API
Trino Web	`http://localhost:8080`	(none)	Query UI
MinIO Console	`http://localhost:9001`	admin / password123	Object storage UI
MinIO API	`http://localhost:9000`	admin / password123	S3-compatible API

Query Examples

PostgreSQL (Native)

docker exec -it benchmark-postgres psql -U postgres -d cybersecurity -c \
  "SELECT event_type, COUNT(*) FROM security_logs GROUP BY event_type ORDER BY COUNT(*) DESC LIMIT 10;"

ClickHouse (Native MergeTree)

curl 'http://localhost:8123/' --data-binary \
  "SELECT event_type, COUNT() as count FROM cybersecurity.security_logs GROUP BY event_type ORDER BY count DESC LIMIT 10 FORMAT Pretty"

ClickHouse (Iceberg Table Engine)

docker exec -it benchmark-clickhouse clickhouse-client --query \
  "SELECT event_type, COUNT() as count FROM iceberg_db.security_logs GROUP BY event_type ORDER BY count DESC LIMIT 10"

StarRocks (Native)

docker exec -it benchmark-starrocks-fe mysql -h127.0.0.1 -P9030 -uroot -D cybersecurity -e \
  "SELECT event_type, COUNT(*) as count FROM security_logs GROUP BY event_type ORDER BY count DESC LIMIT 10;"

StarRocks (Iceberg External Catalog)

docker exec -it benchmark-starrocks-fe mysql -h127.0.0.1 -P9030 -uroot -e \
  "SELECT event_type, COUNT(*) as count FROM iceberg_catalog.cybersecurity.security_logs GROUP BY event_type ORDER BY count DESC LIMIT 10;"

Trino (Iceberg)

docker exec -it benchmark-trino trino --server localhost:8080 --catalog iceberg --schema cybersecurity --execute \
  "SELECT event_type, COUNT(*) as count FROM security_logs GROUP BY event_type ORDER BY count DESC LIMIT 10"

Splunk DB Connect (dbxquery)

docker exec benchmark-splunk /opt/splunk/bin/splunk search \
  '| dbxquery connection="postgresql_conn" query="SELECT event_type, COUNT(*) as count FROM security_logs GROUP BY event_type ORDER BY count DESC LIMIT 10"' \
  -auth admin:changeme

📈 Performance Expectations

Native Database Performance (M3 Platform)

Database	Simple Query	Aggregation	Join Query	Architecture
ClickHouse	10-20 ms	30-50 ms	50-100 ms	ARM64 native (optimal)
PostgreSQL	50-100 ms	150-300 ms	300-500 ms	ARM64 native (optimal)
StarRocks	40-60 ms	80-150 ms	150-250 ms	Rosetta 2 (15-20% overhead)
Splunk	200-500 ms	500-2000 ms	1000-5000 ms	Rosetta 2 (30-40% overhead)

Splunk DB Connect Overhead

Additional Latency: +100-200 ms per query
Reason: Splunk search pipeline + JDBC driver + data serialization
Acceptable For: Ad-hoc queries, dashboards (< 100 queries/hour)
Not Suitable For: Real-time analytics (> 1000 queries/hour)

Iceberg Multi-Engine Performance

Engine	Native	Iceberg	Slowdown	Notes
ClickHouse	10-20 ms	200-500 ms	20-25x	Read-only, high overhead
StarRocks	30-50 ms	120-200 ms	4-6x	Read-write, moderate overhead
Trino	-	300-800 ms	(baseline)	Full Iceberg control

Trade-offs:

✅ Benefits: Multi-engine access, ACID transactions, schema evolution, time travel, unified governance
❌ Costs: 4-25x slower than native formats, more complex infrastructure

🔄 Iceberg Use Cases

When to Use Iceberg

✅ Good Use Cases:

Data Governance: Centralized access control across multiple query engines
Schema Evolution: Frequent schema changes without data rewrites
Time Travel: Audit trails, debugging, regulatory compliance
Multi-Engine Federation: Trino for ETL, ClickHouse for dashboards, StarRocks for analytics
Open Architecture: Avoid vendor lock-in, switch engines without data migration

❌ Not Suitable For:

Real-Time Dashboards: Sub-second query latency required
High-Frequency Queries: > 1000 queries/second
OLTP Workloads: Transactional applications with frequent updates
Single Engine: If only using one query engine, native format is faster

Decision Matrix

Requirement	Native Format	Iceberg Format
Performance	✅ Best	❌ 4-25x slower
Multi-engine	❌ Requires ETL	✅ Native support
Schema evolution	❌ Rewrite data	✅ Metadata-only
Time travel	❌ Not supported	✅ Full support
ACID across engines	❌ No	✅ Yes
Setup complexity	✅ Simple	❌ Complex

📚 Documentation

Core Documentation

docs/SPLUNK_DBXQUERY_LIMITATIONS.md - Comprehensive analysis of Splunk DB Connect dbxquery limitations, overhead measurements, and best practices
docs/ICEBERG_MULTI_ENGINE.md - Apache Iceberg multi-engine architecture guide, performance characteristics, use cases
docs/ARCHITECTURE.md - Overall system architecture and design decisions
docs/TROUBLESHOOTING.md - Common issues and solutions

Benchmark Scripts

benchmarks/01_native_baseline.py - Native database performance baseline
benchmarks/02_splunk_dbxquery_overhead.py - Splunk DB Connect overhead measurement
benchmarks/03_iceberg_multi_engine.py - Iceberg multi-engine performance comparison
benchmarks/run_all.sh - Execute all benchmarks and generate report

🛠️ Advanced Configuration

Increase Iceberg Query Performance

-- Trino: Compact small files (reduces metadata overhead)
ALTER TABLE iceberg.cybersecurity.security_logs EXECUTE optimize;

-- Trino: Expire old snapshots (cleanup history)
ALTER TABLE iceberg.cybersecurity.security_logs EXECUTE expire_snapshots(retention_threshold => '7d');

-- Trino: Remove orphan files (cleanup unused data)
ALTER TABLE iceberg.cybersecurity.security_logs EXECUTE remove_orphan_files(older_than => '30d');

Configure Splunk DB Connect Connections

See scripts/setup_splunk_dbconnect.sh for automated configuration.

Manual Configuration:

Install Splunk DB Connect app from Splunkbase
Create database identities (credentials)
Create database connections (PostgreSQL, ClickHouse, StarRocks)
Test connections via Splunk UI

Tune Database Memory (M3)

docker-compose.m3.yml:

services:
  postgresql:
    deploy:
      resources:
        limits:
          memory: 4G    # Increase if needed

  clickhouse:
    deploy:
      resources:
        limits:
          memory: 8G    # High memory for fast queries

  trino:
    deploy:
      resources:
        limits:
          memory: 6G    # JVM heap for Iceberg metadata

🌐 AWS Deployment (Fair Comparison)

Why AWS?

Problem on M3:

StarRocks (Rosetta 2): 15-20% slower than native
Splunk (Rosetta 2): 30-40% slower than native
Unfair comparison between ARM64-native (ClickHouse) and Rosetta (others)

Solution:

Deploy on AWS with native architectures for all databases
ClickHouse: c7g.4xlarge (ARM64 Graviton3)
StarRocks: m6i.4xlarge (x86_64 Intel)
Splunk: m6i.4xlarge (x86_64 Intel)
Fair apples-to-apples comparison

Cost-Optimized AWS Deployment

Use spot instances for short-term benchmarking:

Component	Instance	Spot Cost/Day
PostgreSQL	r7g.xlarge	$3.84
ClickHouse	c7g.2xlarge	$5.02
StarRocks	m6i.2xlarge	$6.67
Splunk	m6i.2xlarge	$6.67
Trino	m7g.2xlarge	$4.70
MinIO	t4g.large	$0.54
Hive Metastore	t3a.medium	$0.27
Total		~$28/day

Run for 3 days: ~$85 total for comprehensive benchmarking

Terraform Deployment: Coming soon (see GitHub Issues)

⚠️ Troubleshooting

Iceberg Tables Not Visible

ClickHouse:

# Verify MinIO connectivity
curl http://localhost:9000/minio/health/live

# Check Iceberg table engine
docker exec -it benchmark-clickhouse clickhouse-client --query \
  "SHOW TABLES FROM iceberg_db"

StarRocks:

# Refresh Iceberg catalog
docker exec -it benchmark-starrocks-fe mysql -h127.0.0.1 -P9030 -uroot -e \
  "REFRESH EXTERNAL TABLE iceberg_catalog.cybersecurity.security_logs;"

# Verify catalog
docker exec -it benchmark-starrocks-fe mysql -h127.0.0.1 -P9030 -uroot -e \
  "SHOW CATALOGS;"

Splunk DB Connect Not Working

Error: "dbxquery command not found"

Solution:

Install DB Connect app from Splunkbase: https://splunkbase.splunk.com/app/2686
Place .spl file in ./data/splunk/dbconnect/
Restart Splunk: docker-compose -f docker-compose.m3.yml restart splunk
Run setup script: bash scripts/setup_splunk_dbconnect.sh

Trino Queries Timeout

Error: "Query exceeded maximum time limit"

Solution:

# configs/trino/config.properties
query.max-execution-time=30m
query.max-memory-per-node=4GB

Out of Memory on M3

Symptom: Docker containers crash, system freezes

Solution:

Increase Docker Desktop memory: Settings → Resources → Memory → 20GB (for 24GB Mac)
Reduce concurrent services: Stop Splunk/Trino if not actively testing
Lower database memory limits in docker-compose.m3.yml

🤝 Contributing

Contributions are welcome! Areas for improvement:

Benchmark Enhancements:
- Add more query types (window functions, CTEs, subqueries)
- Test with larger datasets (1M, 10M, 100M rows)
- Measure concurrent query performance
Iceberg Features:
- Time travel query benchmarks
- Schema evolution performance
- ACID transaction testing
AWS Deployment:
- Terraform scripts for automated deployment
- CloudWatch monitoring dashboards
- Cost optimization strategies
Documentation:
- Video tutorials
- Best practices guide
- Case studies

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

PostgreSQL, ClickHouse, StarRocks, Splunk, Trino teams for excellent database systems
Apache Iceberg community for open table format
Docker for consistent cross-platform development
Apple for M-series ARM64 architecture
AWS for Graviton processors

📞 Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Documentation: docs/

Built for the cybersecurity and database communities

Tested on: MacBook Pro M3 Pro (24GB) | macOS 15.0 Sequoia | Docker Desktop 4.26

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.claude		.claude
benchmarks		benchmarks
configs		configs
docs		docs
scripts		scripts
specs		specs
sql		sql
tests		tests
.claudeignore		.claudeignore
.env.example		.env.example
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
BENCHMARK_RESULTS.md		BENCHMARK_RESULTS.md
DECISIONS.md		DECISIONS.md
IMPLEMENTATION_SUMMARY.md		IMPLEMENTATION_SUMMARY.md
README.md		README.md
SPECIFICATION.md		SPECIFICATION.md
docker-compose.m3.yml		docker-compose.m3.yml

Folders and files

Latest commit

History

Repository files navigation

Database Benchmark: Multi-Engine Analytics with Apache Iceberg

🎯 Purpose

1. Native Performance Baseline

2. Splunk DB Connect Overhead

3. Apache Iceberg Multi-Engine

🏗️ Architecture Overview

Key Components

🚀 Quick Start

Prerequisites

Installation

Initialize Data and Iceberg Tables

Run Benchmarks

📊 Benchmark Suite

Test 1: Native Performance Baseline

Test 2: Splunk DB Connect Overhead

Test 3: Iceberg Multi-Engine Performance

📁 Project Structure

🎮 Using the Environment

Access Services

Query Examples

PostgreSQL (Native)

ClickHouse (Native MergeTree)

ClickHouse (Iceberg Table Engine)

StarRocks (Native)

StarRocks (Iceberg External Catalog)

Trino (Iceberg)

Splunk DB Connect (dbxquery)

📈 Performance Expectations

Native Database Performance (M3 Platform)

Splunk DB Connect Overhead

Iceberg Multi-Engine Performance

🔄 Iceberg Use Cases

When to Use Iceberg

Decision Matrix

📚 Documentation

Core Documentation

Benchmark Scripts

🛠️ Advanced Configuration

Increase Iceberg Query Performance

Configure Splunk DB Connect Connections

Tune Database Memory (M3)

🌐 AWS Deployment (Fair Comparison)

Why AWS?

Cost-Optimized AWS Deployment

⚠️ Troubleshooting

Iceberg Tables Not Visible

Splunk DB Connect Not Working

Trino Queries Timeout

Out of Memory on M3

🤝 Contributing

📄 License

🙏 Acknowledgments

📞 Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages