Supplementary bundle accompanying the paper. Everything needed to reproduce the empirical findings: fixtures, both reference runtimes, all LLM call logs, and the scoring code.
.
├── README.md # this file
├── LIMITATIONS.md # engineering-process register (10 entries)
├── fixtures/
│ ├── INDEX.md # per-fixture index (199 entries, 25 gap)
│ ├── README.md # fixture schema spec
│ ├── snap/ (44 fixtures, 7 gap)
│ ├── medicaid/ (40 fixtures, 4 gap)
│ ├── ssi/ (36 fixtures, 5 gap)
│ ├── va_pension/ (25 fixtures, 3 gap)
│ ├── eitc/ (29 fixtures, 4 gap)
│ └── wic/ (25 fixtures, 2 gap)
├── rust-runtime/ # Rust reference oracle (source snapshot)
│ ├── rac_oracle.rs # dispatcher binary
│ ├── statutory_math.rs # per-program math (~4,500 LOC)
│ └── fpl.rs # FPL tables 2025 + 2026
├── runners/
│ ├── py_oracle/ # Python cross-runtime (bit-identical to Rust on 199/199)
│ │ ├── statutory_math.py
│ │ ├── fpl.py
│ │ ├── run.py # fixture → decision
│ │ └── quarterly_diff.py # 2025 vs 2026 diff
│ ├── llm/ # schema-constrained SLM harness
│ │ ├── harness.py # vLLM OpenAI-compat async client
│ │ ├── prompts.py # 4 conditions × 3 variants, program-aware
│ │ ├── rules_text.py # distilled rule summaries per program
│ │ ├── rules_text_verbatim.py # verbatim CFR excerpts per program
│ │ ├── scoring.py # 3-bucket scoring, kappa, bootstrap, McNemar
│ │ ├── outlines_probe.py # backend-sensitivity cross-check
│ │ ├── models.yaml # model endpoints
│ │ ├── HARNESS_CONTRACT.md # pre-registered design (hash b8bc676b...)
│ │ └── requirements.txt
│ ├── complexity/ # rule-complexity metric extractor
│ └── fixture_gen/ # template → fixture generators (per program)
├── scripts/
│ └── run-vllm.sh # vast.ai-side vLLM serve script
└── results/
├── metadata.json # pre-registration hash + model revisions
├── oracle-{snap,medicaid,ssi,va_pension,eitc,wic}.json
├── complexity-metrics.{csv,json} # G3 output
├── quarterly-diff.json # G4 output (33/199 change 2025→2026)
├── llm-expanded-main.jsonl # 21,492 main-pass LLM calls
├── llm-expanded-nojson.jsonl # 597 secondary-pass (no guided JSON)
├── llm-outlines-probe.jsonl # 199 calls, outlines backend cross-check
├── llm-verbatim-statute.jsonl # 5,373 calls, verbatim-CFR statute_ctx
└── llm-snap-scored.json # pre-computed scoring output
Pre-registration hash (frozen before first LLM call): b8bc676b...
(full hash in results/metadata.json).
Python oracle only (no Rust build needed):
cd runners/py_oracle
python3 run.py --fixtures-root ../../fixtures --out ../../results/py-oracle-all.json
# Compare against the Rust oracle outputs in results/oracle-*.json;
# every field in every fixture's {eligible, net_income_monthly, grant_monthly, etc.}
# matches to the integer cent.Rust oracle (requires Rust toolchain + source tree beyond what is bundled; the
rust-runtime/*.rs files are snapshots, not a standalone crate). See
rust-runtime/README.md.
cd runners/llm
pip install -r requirements.txt
python3 -m scoring \
--main ../../results/llm-expanded-main.jsonl \
--oracle-glob '../../results/oracle-*.json' \
--fixtures-root ../../fixtures \
--out /tmp/rescored.jsoncd runners/py_oracle
python3 quarterly_diff.py --fixtures-root ../../fixtures --out /tmp/qdiff.json
# 33/199 outputs change; 3 eligibility flips, 30 numeric-only changes.cd runners/complexity
python3 extract.py --rust-src ../../rust-runtime/statutory_math.rs \
--out-csv ../../results/complexity-metrics.csvFull re-run requires GPU + vLLM. See scripts/run-vllm.sh for the server side
and runners/llm/HARNESS_CONTRACT.md for the frozen experimental design
(4 conditions × 3 prompt variants × 3 seeds × 199 fixtures × 4 models =
10,752 main calls per model; we ran 3 models for 21,492 total).
- Fixtures (
fixtures/) — CC-BY-SA 4.0 - Code (
runners/,rust-runtime/,scripts/) — Apache-2.0
Every number cited in the paper derives from a file in this bundle. The main
LLM table's cells can be re-derived by re-running scoring.py on
llm-expanded-main.jsonl.