Skip to content

JW-Corelight/Splunk-db-connect-benchmark

Repository files navigation

Database Benchmark: Multi-Engine Analytics with Apache Iceberg

Platform Architecture Docker License

A comprehensive benchmarking environment for cybersecurity analytics, featuring native database formats, Splunk DB Connect overhead testing, and Apache Iceberg multi-engine architecture. Designed for Apple Silicon (M3/M3 Pro/M3 Max) with AWS deployment options for fair production comparisons.


🎯 Purpose

This project benchmarks three key architectural patterns for cybersecurity log analysis:

1. Native Performance Baseline

Compare raw query performance across database engines on their native formats:

  • PostgreSQL 16 - Traditional relational (ARM64 native)
  • ClickHouse 24.1 - Columnar OLAP with NEON SIMD (ARM64 native)
  • StarRocks 3.2 - MPP analytics (Rosetta 2)

2. Splunk DB Connect Overhead

Measure performance impact when querying databases via Splunk's dbxquery command:

  • Direct query: Database → Application (baseline)
  • Proxy query: Database → Splunk → Application (measure overhead)
  • Expected overhead: 100-200ms added latency

3. Apache Iceberg Multi-Engine

Test open lakehouse pattern where multiple engines query shared data:

  • Trino - Create and manage Iceberg tables (full read/write)
  • ClickHouse - Query Iceberg tables (read-only, ~20x slower than native)
  • StarRocks - Query and modify Iceberg tables (read/write, ~4x slower)
  • Trade-off: Flexibility vs Performance

🏗️ Architecture Overview

┌─────────────────────────────────────────────────────────────────────┐
│                          Query Engines                              │
├────────────┬────────────┬────────────┬────────────┬─────────────────┤
│ PostgreSQL │ ClickHouse │ StarRocks  │   Trino    │     Splunk      │
│  (Native)  │  (Native)  │  (Native)  │ (Iceberg)  │  (DB Connect)   │
└──────┬─────┴─────┬──────┴──────┬─────┴──────┬─────┴─────────┬───────┘
       │           │             │            │               │
       │           └─────────────┼────────────┘               │
       │                         │                            │
       │           ┌─────────────▼──────────────┐             │
       │           │   Hive Metastore Catalog   │             │
       │           │ (Iceberg metadata: schemas,│             │
       │           │  snapshots, partitions)    │             │
       │           └─────────────┬──────────────┘             │
       │                         │                            │
       │           ┌─────────────▼──────────────┐             │
       │           │  MinIO (S3-compatible)     │             │
       │           │  /warehouse/cybersecurity/ │             │
       │           │    - Parquet data files    │             │
       │           │    - Avro metadata files   │             │
       │           └────────────────────────────┘             │
       │                                                      │
       └──────────────────────────────────────────────────────┘
                   Native Table Storage + Splunk DB Connect

Key Components

Component Role Port Architecture
PostgreSQL Relational database 5432 ARM64 native
ClickHouse Columnar OLAP + Iceberg engine 8123 ARM64 native
StarRocks MPP analytics + Iceberg catalog 9030 Rosetta 2
Splunk SIEM + DB Connect proxy 8000 Rosetta 2
Trino Federated SQL + Iceberg coordinator 8080 ARM64 native
MinIO S3-compatible object storage 9000/9001 ARM64 native
Hive Metastore Iceberg catalog backend 9083 Rosetta 2

🚀 Quick Start

Prerequisites

  • Hardware: MacBook Pro M3/M3 Pro/M3 Max
  • Memory: 24GB recommended (18GB minimum)
  • Storage: 150GB free space
  • macOS: 14.0 Sonoma or later
  • Docker Desktop: 24.0+ with Rosetta 2 enabled
  • Python: 3.10+ with pip

Installation

# Clone the repository
git clone https://github.com/yourusername/splunk-db-connect-benchmark.git
cd splunk-db-connect-benchmark

# Copy environment template
cp .env.example .env

# Start all services
docker-compose -f docker-compose.m3.yml up -d

# Wait for all services to be healthy (5-10 minutes)
docker-compose -f docker-compose.m3.yml ps

Initialize Data and Iceberg Tables

# Step 1: Load data into native databases (PostgreSQL, ClickHouse, StarRocks)
bash scripts/phase4_load_data.sh

# Step 2: Initialize MinIO and create Iceberg tables
bash scripts/setup_iceberg.sh

# Step 3: Configure ClickHouse Iceberg engine
bash scripts/configure_clickhouse_iceberg.sh

# Step 4: Configure StarRocks Iceberg catalog
bash scripts/configure_starrocks_iceberg.sh

# Step 5: Set up Splunk DB Connect (requires manual DB Connect app installation)
bash scripts/setup_splunk_dbconnect.sh

Run Benchmarks

# Install Python dependencies
pip3 install psycopg2-binary clickhouse-connect pymysql

# Run all 3 benchmarks
cd benchmarks
./run_all.sh

# Or run individually:
python3 01_native_baseline.py          # Native performance baseline
python3 02_splunk_dbxquery_overhead.py # Splunk proxy overhead
python3 03_iceberg_multi_engine.py     # Iceberg multi-engine performance

# View results
ls -lh results/

📊 Benchmark Suite

Test 1: Native Performance Baseline

Purpose: Establish baseline query performance on native database formats

Queries:

  • Count all records
  • Aggregate by event type (GROUP BY + ORDER BY)
  • Filter failed login events (WHERE + HAVING)
  • Time range aggregation (last 7 days)
  • Top data transfer events (ORDER BY + LIMIT)

Expected Results:

Database Simple Query Complex Aggregation
ClickHouse 10-20 ms 30-50 ms
PostgreSQL 50-100 ms 150-300 ms
StarRocks 30-50 ms 70-120 ms

Script: benchmarks/01_native_baseline.py


Test 2: Splunk DB Connect Overhead

Purpose: Measure latency added by Splunk's dbxquery proxy layer

Method:

  1. Query database directly (baseline)
  2. Query same database via Splunk dbxquery
  3. Calculate overhead: splunk_latency - direct_latency

Expected Overhead:

Database Direct Query via dbxquery Overhead
PostgreSQL 50-100 ms 150-300 ms +100-200 ms
ClickHouse 10-20 ms 110-220 ms +100-200 ms
StarRocks 30-50 ms 130-250 ms +100-200 ms

Key Finding: Splunk dbxquery adds consistent 100-200ms overhead regardless of database speed.

Script: benchmarks/02_splunk_dbxquery_overhead.py

Documentation: docs/SPLUNK_DBXQUERY_LIMITATIONS.md


Test 3: Iceberg Multi-Engine Performance

Purpose: Test Apache Iceberg table format with multiple query engines

Architecture:

  • Shared Data: Iceberg tables stored in MinIO (Parquet format)
  • Metadata: Hive Metastore catalog (tracks schemas, snapshots)
  • Engines: Trino (read/write), ClickHouse (read-only), StarRocks (read/write)

Comparison: Native format vs Iceberg format for same query

Expected Results:

Engine Native Format Iceberg Format Slowdown
ClickHouse 10-20 ms 200-500 ms 20-25x slower
StarRocks 30-50 ms 120-200 ms 4-6x slower

Trade-off:

  • Benefit: Multi-engine access, ACID transactions, schema evolution, time travel
  • Cost: 4-20x slower query performance

Script: benchmarks/03_iceberg_multi_engine.py

Documentation: docs/ICEBERG_MULTI_ENGINE.md


📁 Project Structure

splunk-db-connect-benchmark/
├── README.md                        # This file
├── docker-compose.m3.yml            # Complete M3 deployment
├── .env.example                     # Environment variables
│
├── configs/                         # Database configurations
│   ├── postgresql/                  # PostgreSQL configs
│   ├── clickhouse/                  # ClickHouse configs
│   ├── starrocks/                   # StarRocks FE/BE configs
│   ├── splunk/                      # Splunk configs
│   └── trino/                       # Trino configs (NEW)
│       ├── config.properties        # Trino coordinator settings
│       ├── jvm.config               # JVM tuning for M3
│       ├── node.properties          # Node identification
│       ├── log.properties           # Logging configuration
│       └── catalog/
│           └── iceberg.properties   # Iceberg catalog connector
│
├── sql/                             # Schema definitions
│   ├── postgresql_schema.sql        # PostgreSQL tables
│   ├── clickhouse_schema.sql        # ClickHouse tables
│   ├── starrocks_schema.sql         # StarRocks tables
│   └── iceberg_schema.sql           # Iceberg tables (NEW)
│
├── scripts/                         # Setup and utility scripts
│   ├── setup_all.sh                 # Master setup script
│   ├── setup_iceberg.sh             # Initialize Iceberg (NEW)
│   ├── configure_clickhouse_iceberg.sh  # ClickHouse Iceberg engine (NEW)
│   ├── configure_starrocks_iceberg.sh   # StarRocks Iceberg catalog (NEW)
│   ├── setup_splunk_dbconnect.sh    # Splunk DB Connect (NEW)
│   ├── phase1_verify_system.sh      # System verification
│   ├── phase2_configure_docker.sh   # Docker configuration
│   ├── phase3_deploy_containers.sh  # Container deployment
│   ├── phase4_load_data.sh          # Data loading
│   └── cleanup.sh                   # Environment cleanup
│
├── benchmarks/                      # Benchmark scripts (NEW)
│   ├── 01_native_baseline.py        # Native performance baseline
│   ├── 02_splunk_dbxquery_overhead.py  # Splunk proxy overhead
│   ├── 03_iceberg_multi_engine.py   # Iceberg multi-engine
│   ├── run_all.sh                   # Execute all benchmarks
│   └── results/                     # Benchmark results (JSON)
│
├── docs/                            # Documentation
│   ├── SPLUNK_DBXQUERY_LIMITATIONS.md  # Splunk dbxquery analysis (NEW)
│   ├── ICEBERG_MULTI_ENGINE.md      # Iceberg architecture guide (NEW)
│   ├── TROUBLESHOOTING.md           # Common issues
│   └── ARCHITECTURE.md              # System architecture
│
└── data/                            # Persistent data (gitignored)
    ├── postgresql/                  # PostgreSQL data
    ├── clickhouse/                  # ClickHouse data
    ├── starrocks-fe/                # StarRocks frontend
    ├── starrocks-be/                # StarRocks backend
    ├── splunk/                      # Splunk data
    ├── minio/                       # MinIO object storage (NEW)
    └── hive-metastore/              # Hive Metastore metadata (NEW)

🎮 Using the Environment

Access Services

Service Endpoint Credentials Purpose
PostgreSQL localhost:5432 postgres / postgres123 Relational database
ClickHouse HTTP http://localhost:8123 default / (none) Columnar OLAP
ClickHouse Native localhost:9000 default / (none) Native protocol
StarRocks FE http://localhost:8030 root / (none) Web UI
StarRocks MySQL localhost:9030 root / (none) Query interface
Splunk Web http://localhost:8000 admin / changeme SIEM UI
Splunk API https://localhost:8089 admin / changeme Management API
Trino Web http://localhost:8080 (none) Query UI
MinIO Console http://localhost:9001 admin / password123 Object storage UI
MinIO API http://localhost:9000 admin / password123 S3-compatible API

Query Examples

PostgreSQL (Native)

docker exec -it benchmark-postgres psql -U postgres -d cybersecurity -c \
  "SELECT event_type, COUNT(*) FROM security_logs GROUP BY event_type ORDER BY COUNT(*) DESC LIMIT 10;"

ClickHouse (Native MergeTree)

curl 'http://localhost:8123/' --data-binary \
  "SELECT event_type, COUNT() as count FROM cybersecurity.security_logs GROUP BY event_type ORDER BY count DESC LIMIT 10 FORMAT Pretty"

ClickHouse (Iceberg Table Engine)

docker exec -it benchmark-clickhouse clickhouse-client --query \
  "SELECT event_type, COUNT() as count FROM iceberg_db.security_logs GROUP BY event_type ORDER BY count DESC LIMIT 10"

StarRocks (Native)

docker exec -it benchmark-starrocks-fe mysql -h127.0.0.1 -P9030 -uroot -D cybersecurity -e \
  "SELECT event_type, COUNT(*) as count FROM security_logs GROUP BY event_type ORDER BY count DESC LIMIT 10;"

StarRocks (Iceberg External Catalog)

docker exec -it benchmark-starrocks-fe mysql -h127.0.0.1 -P9030 -uroot -e \
  "SELECT event_type, COUNT(*) as count FROM iceberg_catalog.cybersecurity.security_logs GROUP BY event_type ORDER BY count DESC LIMIT 10;"

Trino (Iceberg)

docker exec -it benchmark-trino trino --server localhost:8080 --catalog iceberg --schema cybersecurity --execute \
  "SELECT event_type, COUNT(*) as count FROM security_logs GROUP BY event_type ORDER BY count DESC LIMIT 10"

Splunk DB Connect (dbxquery)

docker exec benchmark-splunk /opt/splunk/bin/splunk search \
  '| dbxquery connection="postgresql_conn" query="SELECT event_type, COUNT(*) as count FROM security_logs GROUP BY event_type ORDER BY count DESC LIMIT 10"' \
  -auth admin:changeme

📈 Performance Expectations

Native Database Performance (M3 Platform)

Database Simple Query Aggregation Join Query Architecture
ClickHouse 10-20 ms 30-50 ms 50-100 ms ARM64 native (optimal)
PostgreSQL 50-100 ms 150-300 ms 300-500 ms ARM64 native (optimal)
StarRocks 40-60 ms 80-150 ms 150-250 ms Rosetta 2 (15-20% overhead)
Splunk 200-500 ms 500-2000 ms 1000-5000 ms Rosetta 2 (30-40% overhead)

Splunk DB Connect Overhead

  • Additional Latency: +100-200 ms per query
  • Reason: Splunk search pipeline + JDBC driver + data serialization
  • Acceptable For: Ad-hoc queries, dashboards (< 100 queries/hour)
  • Not Suitable For: Real-time analytics (> 1000 queries/hour)

Iceberg Multi-Engine Performance

Engine Native Iceberg Slowdown Notes
ClickHouse 10-20 ms 200-500 ms 20-25x Read-only, high overhead
StarRocks 30-50 ms 120-200 ms 4-6x Read-write, moderate overhead
Trino - 300-800 ms (baseline) Full Iceberg control

Trade-offs:

  • Benefits: Multi-engine access, ACID transactions, schema evolution, time travel, unified governance
  • Costs: 4-25x slower than native formats, more complex infrastructure

🔄 Iceberg Use Cases

When to Use Iceberg

Good Use Cases:

  1. Data Governance: Centralized access control across multiple query engines
  2. Schema Evolution: Frequent schema changes without data rewrites
  3. Time Travel: Audit trails, debugging, regulatory compliance
  4. Multi-Engine Federation: Trino for ETL, ClickHouse for dashboards, StarRocks for analytics
  5. Open Architecture: Avoid vendor lock-in, switch engines without data migration

Not Suitable For:

  1. Real-Time Dashboards: Sub-second query latency required
  2. High-Frequency Queries: > 1000 queries/second
  3. OLTP Workloads: Transactional applications with frequent updates
  4. Single Engine: If only using one query engine, native format is faster

Decision Matrix

Requirement Native Format Iceberg Format
Performance ✅ Best ❌ 4-25x slower
Multi-engine ❌ Requires ETL ✅ Native support
Schema evolution ❌ Rewrite data ✅ Metadata-only
Time travel ❌ Not supported ✅ Full support
ACID across engines ❌ No ✅ Yes
Setup complexity ✅ Simple ❌ Complex

📚 Documentation

Core Documentation

Benchmark Scripts


🛠️ Advanced Configuration

Increase Iceberg Query Performance

-- Trino: Compact small files (reduces metadata overhead)
ALTER TABLE iceberg.cybersecurity.security_logs EXECUTE optimize;

-- Trino: Expire old snapshots (cleanup history)
ALTER TABLE iceberg.cybersecurity.security_logs EXECUTE expire_snapshots(retention_threshold => '7d');

-- Trino: Remove orphan files (cleanup unused data)
ALTER TABLE iceberg.cybersecurity.security_logs EXECUTE remove_orphan_files(older_than => '30d');

Configure Splunk DB Connect Connections

See scripts/setup_splunk_dbconnect.sh for automated configuration.

Manual Configuration:

  1. Install Splunk DB Connect app from Splunkbase
  2. Create database identities (credentials)
  3. Create database connections (PostgreSQL, ClickHouse, StarRocks)
  4. Test connections via Splunk UI

Tune Database Memory (M3)

docker-compose.m3.yml:

services:
  postgresql:
    deploy:
      resources:
        limits:
          memory: 4G    # Increase if needed

  clickhouse:
    deploy:
      resources:
        limits:
          memory: 8G    # High memory for fast queries

  trino:
    deploy:
      resources:
        limits:
          memory: 6G    # JVM heap for Iceberg metadata

🌐 AWS Deployment (Fair Comparison)

Why AWS?

Problem on M3:

  • StarRocks (Rosetta 2): 15-20% slower than native
  • Splunk (Rosetta 2): 30-40% slower than native
  • Unfair comparison between ARM64-native (ClickHouse) and Rosetta (others)

Solution:

  • Deploy on AWS with native architectures for all databases
  • ClickHouse: c7g.4xlarge (ARM64 Graviton3)
  • StarRocks: m6i.4xlarge (x86_64 Intel)
  • Splunk: m6i.4xlarge (x86_64 Intel)
  • Fair apples-to-apples comparison

Cost-Optimized AWS Deployment

Use spot instances for short-term benchmarking:

Component Instance Spot Cost/Day
PostgreSQL r7g.xlarge $3.84
ClickHouse c7g.2xlarge $5.02
StarRocks m6i.2xlarge $6.67
Splunk m6i.2xlarge $6.67
Trino m7g.2xlarge $4.70
MinIO t4g.large $0.54
Hive Metastore t3a.medium $0.27
Total ~$28/day

Run for 3 days: ~$85 total for comprehensive benchmarking

Terraform Deployment: Coming soon (see GitHub Issues)


⚠️ Troubleshooting

Iceberg Tables Not Visible

ClickHouse:

# Verify MinIO connectivity
curl http://localhost:9000/minio/health/live

# Check Iceberg table engine
docker exec -it benchmark-clickhouse clickhouse-client --query \
  "SHOW TABLES FROM iceberg_db"

StarRocks:

# Refresh Iceberg catalog
docker exec -it benchmark-starrocks-fe mysql -h127.0.0.1 -P9030 -uroot -e \
  "REFRESH EXTERNAL TABLE iceberg_catalog.cybersecurity.security_logs;"

# Verify catalog
docker exec -it benchmark-starrocks-fe mysql -h127.0.0.1 -P9030 -uroot -e \
  "SHOW CATALOGS;"

Splunk DB Connect Not Working

Error: "dbxquery command not found"

Solution:

  1. Install DB Connect app from Splunkbase: https://splunkbase.splunk.com/app/2686
  2. Place .spl file in ./data/splunk/dbconnect/
  3. Restart Splunk: docker-compose -f docker-compose.m3.yml restart splunk
  4. Run setup script: bash scripts/setup_splunk_dbconnect.sh

Trino Queries Timeout

Error: "Query exceeded maximum time limit"

Solution:

# configs/trino/config.properties
query.max-execution-time=30m
query.max-memory-per-node=4GB

Out of Memory on M3

Symptom: Docker containers crash, system freezes

Solution:

  1. Increase Docker Desktop memory: Settings → Resources → Memory → 20GB (for 24GB Mac)
  2. Reduce concurrent services: Stop Splunk/Trino if not actively testing
  3. Lower database memory limits in docker-compose.m3.yml

🤝 Contributing

Contributions are welcome! Areas for improvement:

  1. Benchmark Enhancements:

    • Add more query types (window functions, CTEs, subqueries)
    • Test with larger datasets (1M, 10M, 100M rows)
    • Measure concurrent query performance
  2. Iceberg Features:

    • Time travel query benchmarks
    • Schema evolution performance
    • ACID transaction testing
  3. AWS Deployment:

    • Terraform scripts for automated deployment
    • CloudWatch monitoring dashboards
    • Cost optimization strategies
  4. Documentation:

    • Video tutorials
    • Best practices guide
    • Case studies

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.


🙏 Acknowledgments

  • PostgreSQL, ClickHouse, StarRocks, Splunk, Trino teams for excellent database systems
  • Apache Iceberg community for open table format
  • Docker for consistent cross-platform development
  • Apple for M-series ARM64 architecture
  • AWS for Graviton processors

📞 Support


Built for the cybersecurity and database communities

Tested on: MacBook Pro M3 Pro (24GB) | macOS 15.0 Sequoia | Docker Desktop 4.26

About

Database benchmark environment for cybersecurity analytics comparing PostgreSQL, ClickHouse, StarRocks, and Splunk on Apple Silicon (M3)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors