A multi-layer LLM pipeline that answers business questions against a relational database by writing SQL, verifying its own answers, and reporting a calibrated confidence label. Built on the Olist Brazilian e-commerce dataset (9 tables, ~100K orders).
Generating SQL with an LLM is the easy part. Making the answer trustworthy — knowing when to be confident, when to hedge, and when to refuse — is what this project works on.
Given a natural-language question, the system routes it through six layers — planner, SQL generator, sanity checker, reconciler, confidence deriver, presenter — and emits a four-section analyst report labeled HIGH, MEDIUM, LOW, or UNABLE. The label is derived deterministically from structured signals (did the SQL execute, did sanity rules pass, did two independent SQL paths agree); it is not an LLM grading another LLM.
$ python -m main "What was Q3 2017 revenue?"
Confidence: HIGH
Reconciliation: items+freight vs payments agreed within 0.02%
Sanity: all rules passed
Cost: $0.16 Wall time: 28.4s
Q3 2017 revenue was approximately $1.90M. Two independent calculations
(items+freight sum vs payment-value sum) reconciled to within 0.02% on
the same filter ruleset, and both passed null-rate and rate-range
sanity checks.
For questions the dataset cannot support (e.g., "What is customer acquisition cost?" — no marketing spend data), the planner refuses and returns UNABLE rather than fabricating a proxy answer.
Six layers, each with a narrow responsibility:
| Layer | Model | Responsibility |
|---|---|---|
| 2. Planner | Sonnet 4.6 | Decompose the question into one or more sub-questions. Decide answerability. Fan out cross-validatable metrics into sibling SQL paths. |
| 3. SQL Generator | Sonnet 4.6 | Per sub-question, generate SQL, execute it, return a DataFrame. Retries on execution errors with structured error context. |
| 4. Sanity | Haiku 4.5 + rules | Deterministic DataFrame rules (negative monetary, null-rate, rate-range, etc.) plus Haiku judgment on result shape. A failed sanity check triggers one regenerate+re-execute up to MAX_RETRIES. |
| 5. Reconciliation | Haiku 4.5 + rules | When the planner fanned out, compare sibling results deterministically (symmetric percentage delta on scalars; outer-merged per-row delta on grouped results). Haiku writes a one-sentence explanation, never decides pass/fail. |
| 6. Confidence | No LLM | Derive HIGH / MEDIUM / LOW / UNABLE from the preceding layers' structured outputs. Pure function. |
| 6. Presenter | Sonnet 4.6 | Narrate a four-section report (answer, method, caveats, sources) grounded in the full audit trail. |
Retries live at the generator level only. Reconciliation failures lower confidence rather than trigger a retry — the system is honest about disagreement instead of hiding it.
Latest N=3 run across 15 curated questions (5 straightforward, 7 tricky, 3 unanswerable):
- 45 / 45 trial label hits against expected confidence bands.
- 0 crashes across 45 trials.
- ~$1.60 per full N=3 run ($0.036 per trial average).
- 15 minutes wall time for the full run.
- $0.17 to re-run a single question in isolation (for focused debugging).
Progression over the last ~10 commits on this branch:
| Commit | Scoreboard | Notable change |
|---|---|---|
| Baseline (pre-Q8 fix) | 40 / 45 | Q8 pinned to outdated [MEDIUM, LOW] expectation. |
7f9ec90 |
43 / 45 | Q8 expected updated to reflect generator improvement (uses customer_unique_id correctly). |
254de18 |
43 / 45 | Q12 reconciler gains a cross-sibling common-column heuristic; planner prompt adds grain-alignment rules. Q12 trial 3 lands HIGH. |
c1f2b6d |
45 / 45 | Q12 expected admits LOW — the reconciler correctly flags a genuine semantic ambiguity (per-payment_type items+freight has no natural decomposition). |
Every commit message carries the N=3 evidence that justified it. The eval harness is the system's own observability.
- Designing multi-stage LLM pipelines with narrow per-layer responsibilities and structured hand-offs between them.
- Deterministic verification of LLM output. The confidence layer derives its label from typed signals — SQL success, sanity rule outcomes, reconciliation delta, planner answerability — rather than asking another LLM "are you sure?".
- Cross-validation via planner fan-out. When a metric has multiple legitimate definitions (revenue-from-items-and-freight vs revenue-from-payments), the planner emits both as sibling sub-questions and the reconciler compares them. Disagreement surfaces as lower confidence, not as a wrong answer.
- Building eval harnesses for LLM systems. N-trial runner with per-trial error isolation, cost tracking, wall-time measurement, drift analysis over SQL text, and
--n-trials/--question-idsflags for cheap focused debugging. - Handling real failure modes. SQL execution errors with structured retry context, metric-definition ambiguity, key-alignment failures in multi-row reconciliation, generator drift on
customer_idvscustomer_unique_id, shape-mismatch in cross-validation. - Measure-first engineering discipline. Every behavioral change lands with a commit that carries the N=3 scoreboard before and after. Fixes that sound good but don't move the numbers don't ship.
Deterministic confidence, not LLM-as-judge. Asking an LLM to rate its own answer correlates with output length and fluency more than with correctness. The confidence layer here is a pure function of the structured signals already captured by earlier layers. It can be unit-tested; the thresholds can be tuned; the label is reproducible.
Reconciler operates on DataFrames, not text. Sibling results are compared by symmetric percentage delta on scalars and by outer-merged per-row delta on grouped results. Key-alignment failures are severity="high" on their own, independent of delta magnitude. The LLM's only role is writing the one-sentence narration; it cannot override the deterministic verdict.
Planner skip taxonomy. Not every plan needs reconciliation. The planner distinguishes single_sub_question (legitimately one SQL path), complementary_views (two sub-questions that slice one population differently — not a cross-validation), and insufficient_siblings (planner wanted reconciliation but an upstream sub-question failed). Each category maps to a specific confidence treatment downstream.
Adversarial test coverage for the reconciler. 14 rule-based cases in eval/test_reconciliation.py covering key misalignment, shape mismatch, ambiguous inference, and the recent cross-sibling common-column heuristic — all running without an API call so they're free to run in CI.
Honest refusal on unanswerable questions. Planner returns UNABLE with a reason for questions the dataset cannot support (CAC without marketing spend, profit margin without COGS, future revenue). The pipeline short-circuits — no SQL is generated, no confidence is fabricated.
A system like this breaks in characteristic ways, and most of the engineering work in this repo is about recognizing each failure mode and deciding how to respond to it honestly. Four of them are worth calling out.
Metric ambiguity without a natural decomposition. A canonical question like "revenue by payment method, and does it differ by calculation approach?" fans out into two sibling SQL paths — items-and-freight sum vs. payment-value sum. At the aggregate level these two definitions agree to within 0.03%. When sliced per payment_type, they can disagree by 33% or more, because the items-and-freight definition has no natural per-payment_type decomposition: attributing the full order revenue to every payment method used on that order double-counts multi-method orders. The generator picked exactly this pathology. The reconciler flagged it deterministically, the confidence layer returned LOW, and the report surfaces the disagreement rather than picking a winner. The right fix here is not to suppress the divergence; it's to admit that LOW is the honest label when two valid definitions genuinely disagree, and document the architectural follow-up (proportional allocation, or a planner rule against fanning out at incompatible grains) as future work.
Generator drift on look-alike keys. The Olist dataset has two customer identifiers that are one letter apart and mean very different things: customer_id (per-order identity, 99,441 distinct) and customer_unique_id (person-level identity, 96,096 distinct). Early in the project the SQL generator occasionally used the wrong one for person-level aggregations, which inflates repeat-purchase rates by several percentage points. The evaluation harness pinned Q8's expected confidence to [MEDIUM, LOW] as a guardrail against this — a honest admission that the system couldn't be trusted on that question yet. Over time the generator stabilized to the correct key, and the expected label was promoted to [HIGH, MEDIUM] to reflect real behavior. Expected outcomes in an LLM eval are a policy choice, not ground truth; the useful thing is keeping them honest, and having a deferred ticket for a sanity rule ready for the day the regression returns.
Shape mismatch in cross-validation. When the planner fanned out a metric, the two resulting SQL paths sometimes returned incompatible result shapes: one grouped by payment_type with a single metric column, the other a scalar grand total; or one with two numeric columns (a metric plus an incidental aid column like order_count) and the other with just the metric. The reconciler can't compare row-by-row when the shapes don't align, so every such trial landed LOW for a mechanical reason rather than a semantic one. Two complementary fixes: a planner-prompt rule requiring reconciliation sub-questions to share grain and filters, and a deterministic reconciler heuristic that picks a common numeric column name across siblings when one of them has extra columns. The planner rule alone wasn't enough — prompt guidance to the planner doesn't reliably reach the downstream SQL generator — which is why the deterministic fallback in the reconciler mattered. That's the general lesson: when behavior has to cross prompt boundaries, pair probabilistic guidance with a deterministic check.
Honest refusal over fluent fabrication. Three of the 15 evaluation questions are deliberately unanswerable from the Olist data alone: customer acquisition cost (no marketing spend), profit margin (no COGS), next-quarter revenue (no data past 2018). The temptation with LLM systems is to always produce something — a proxy answer, an extrapolation, a best-guess caveat. The planner here refuses cleanly with UNABLE and a reason, and the pipeline short-circuits before any SQL runs. All nine trials on those three questions landed UNABLE in the final N=3 run. The discipline is producing nothing when nothing is true, and making the refusal a first-class output of the system rather than a failure of it.
Prerequisites:
- Python 3.10+
- PostgreSQL with the Olist dataset loaded (see
db/load_olist.py) ANTHROPIC_API_KEYin.env
pip install -r requirements.txt
python db/load_olist.py # one-time: loads Olist CSVs into Postgres
python -m main "What was Q3 2017 revenue?"Full eval harness:
python -m eval.runner # N=3 across all 15 questions (~$1.60, 15 min)
python -m eval.runner --n-trials 1 # fast smoke, ~$0.55
python -m eval.runner --n-trials 1 --question-ids 12 # single-question debug, ~$0.17Results are written to eval/results/<timestamp>.json with per-trial SQL text, sanity outcomes, reconciliation deltas, confidence labels, and per-layer token usage.
agents/ Layer 2, 3, 6 (planner, sql_generator, presenter)
scrutiny/ Layer 4, 5, 6 (sanity, reconciliation, confidence)
prompts/ System prompts for each LLM-backed layer
db/ Postgres connection + Olist loader
eval/
questions.jsonl 15 curated questions with expected confidence bands
runner.py N-trial harness
results/ Timestamped N=3 JSON blobs
test_*.py Adversarial rule-based tests for each layer
main.py End-to-end orchestrator + CLI entry point
- Olist only. The system has been developed and measured against one dataset. The claim of domain reuse is untested until Phase 2 runs it on a genuinely different schema.
- CLI only. No UI has been built. A Streamlit progressive-render UI is planned but not part of Phase 1.
- One sanity rule deferred. A sanity check for
customer_idvscustomer_unique_iddrift is in the backlog (ticket #46), deliberately unimplemented until a generator regression gives it something real to catch. - Per-payment_type items+freight is semantically ambiguous. Q12's reconciler correctly flags that
revenue_from_items_and_freighthas no natural decomposition by payment method; a deeper architectural fix (planner rule for grain-compatible fan-out, or SQL rule for proportional allocation) is documented but deferred.
Dataset: Olist Brazilian E-Commerce (public, CC BY-NC-SA). Built with Anthropic's Claude Sonnet 4.6 and Haiku 4.5.