This document analyzes the query patterns in HackerBook and explains why SQLite sharding is optimal for this use case.
After analyzing all SQL queries in the codebase, we identified the following distribution:
- Point Lookups (40%) -
WHERE id=?,WHERE username=? - Top-K Sorted (30%) -
ORDER BY score/time DESC LIMIT 50 - Aggregations (20%) -
GROUP BY,COUNT(),SUM() - Range Scans (10%) -
WHERE time BETWEEN,WHERE by=? AND time BETWEEN
Core navigation queries that fetch specific records by primary key:
SELECT * FROM items WHERE id=? LIMIT 1- Single item by IDSELECT * FROM items WHERE id IN (...)- Batch items by ID listSELECT first_time FROM users WHERE username=? COLLATE NOCASE LIMIT 1- Single user lookupSELECT * FROM users WHERE username=? COLLATE NOCASE LIMIT 1- Full user recordSELECT child_id FROM edges WHERE parent_id=?- Get comment children
Frequency: Very High - Core navigation flow
Used in the Query view and user statistics pages:
SELECT by, COUNT(*) as stories FROM items WHERE type='story' GROUP BY by ORDER BY stories DESC LIMIT 50SELECT domain, COUNT(DISTINCT username) as users FROM user_domains GROUP BY domain ORDER BY users DESCSELECT month, SUM(count) as active FROM user_months GROUP BY month ORDER BY month DESCSELECT parent_id, COUNT(*) as cnt FROM edges WHERE parent_id IN (...) GROUP BY parent_idSELECT e.parent_id as id, i.title, COUNT(*) as comments FROM edges e JOIN items i ON i.id = e.parent_id GROUP BY e.parent_id
Frequency: High - Query view, user stats
Used for "Me" view daily activity and time-based filters:
SELECT id,type,time,title,text,url,score FROM items WHERE by=? AND time BETWEEN X AND Y ORDER BY time DESCSELECT id,type,by,time,title,url,score FROM items WHERE type='story' AND time >= (SELECT MAX(time) FROM items) - 86400
Frequency: Medium - "Me" view daily activity, relative time queries
Front page and list views with filtering and sorting:
SELECT id,title,by,time,score FROM items WHERE type='story' ORDER BY score DESC LIMIT 30SELECT id,title,by,time,score FROM items WHERE type='story' ORDER BY time DESC LIMIT 50SELECT username, max_score FROM users ORDER BY max_score DESC LIMIT 30
Frequency: Very High - Front page, query templates
Used for Ask HN, Show HN, and Jobs views:
SELECT id,title,by,score FROM items WHERE type='story' AND title LIKE 'Ask HN:%' ORDER BY score DESCSELECT id,title,by,score FROM items WHERE type='story' AND title LIKE 'Show HN:%'
Frequency: Medium - Ask/Show/Jobs views
User profile detail queries:
SELECT domain, count FROM user_domains WHERE username=? ORDER BY count DESC LIMIT 8SELECT month, count FROM user_months WHERE username=? ORDER BY count DESC LIMIT 8
Frequency: Medium - User profile views
✅ Point lookups - Your bread and butter. Parquet would be 10-100x slower.
SELECT * FROM items WHERE id=123→ O(log n) with B-tree index- Parquet: Must scan row groups, no native indexes
✅ Top-K with WHERE filters - SQLite's query optimizer shines
WHERE type='story' ORDER BY score DESC LIMIT 30- Uses indexes efficiently, stops at 30 rows
- Parquet: Must scan entire column groups
✅ JOIN queries - Critical for comment threads
edges e JOIN items i ON i.id = e.parent_id- SQLite has optimized join algorithms
- Parquet has no JOIN support (must do in application code with DuckDB)
✅ Browser compatibility
- sqlite3-wasm: ~1MB
- DuckDB-WASM: ~10MB
🟡 Full-shard aggregations - Query view performs these
SELECT by, COUNT(*) as stories FROM items GROUP BY bySELECT month, SUM(count) as active FROM user_months GROUP BY month- Column-oriented storage is faster for these
However:
- Results are limited (
LIMIT 50) - Users select specific shards to query
- 15MB shard = 1-2s query in SQLite (already fast)
- Cross-shard aggregations rare (users typically query 1-3 shards)
🟡 Time-range scans on "Me" view
WHERE by=? AND time BETWEEN X AND Y- Could benefit from columnar compression and selective column reads
- But: Already using cross-shard index for efficient targeting
We considered maintaining both formats:
/static-shards/ # SQLite (current) - for point lookups & navigation
/static-shards-parquet/ # Parquet copies - for query view only
Pros:
- Query view aggregations could be faster
- Parquet files smaller due to better compression
- HTTP range requests could skip unneeded columns
Cons:
- 2x storage on server (double hosting costs)
- Increased build complexity and time
- DuckDB-WASM is 10MB vs 1MB for SQLite
- Most queries still need SQLite for point lookups and JOINs
- Browser must load both SQLite and DuckDB
- Cloudflare Pages caches entire files on first range request anyway
Verdict: Stick with SQLite
HackerBook's query patterns are:
- Heavily indexed (id, type, time, username)
- Mixed workload (OLTP point lookups + light OLAP analytics)
- Small result sets (LIMIT 30-50 everywhere)
- Require JOINs (edges + items for comment threads)
- Interactive latency requirements (sub-second)
Parquet excels at:
- Full table scans (we rarely do these)
- Column-heavy analytics (SELECT few columns from millions of rows)
- No indexes needed (we rely heavily on indexes)
- Large result sets (we always LIMIT)
- Batch processing (not interactive queries)
Current architecture is optimal:
- 15MB SQLite shards download in <1s on decent connections
- Users download once, browser caches forever
- Sub-millisecond query latency after initial load
- Content-hash filenames ensure fresh data on updates
- Cross-shard index enables efficient multi-shard targeting
If we add a "Data Export" feature for data scientists:
/static-export/hn-archive.parquet(single 500MB+ file, all items)- Users download with DuckDB CLI for local analysis
- Ideal for: "Analyze all HN posts mentioning 'AI' by year"
- Supplementary to main UI, not replacement
But for the interactive web application, SQLite + small shards + aggressive caching is the right architecture.
- Small shards (~15MB) = fast initial load
- Content-hash filenames = automatic cache busting on updates
- Browser caching = users typically only need 1-2 shards
- Efficient indexes = sub-millisecond point lookups
- sqlite3-wasm (1MB) = minimal overhead
- Proven technology = works great in browsers
- Simple deployment = static files on Cloudflare Pages
- Query patterns extracted from:
docs/*.html - Build process:
etl-hn.cjs,build-user-stats.mjs - Shard management:
toool/s/predeploy-checks.sh