Paper 08 — Rules-as-Code vs SLMs for US Benefits Eligibility

Supplementary bundle accompanying the paper. Everything needed to reproduce the empirical findings: fixtures, both reference runtimes, all LLM call logs, and the scoring code.

Directory map

.
├── README.md                       # this file
├── LIMITATIONS.md                  # engineering-process register (10 entries)
├── fixtures/
│   ├── INDEX.md                    # per-fixture index (199 entries, 25 gap)
│   ├── README.md                   # fixture schema spec
│   ├── snap/         (44 fixtures,  7 gap)
│   ├── medicaid/     (40 fixtures,  4 gap)
│   ├── ssi/          (36 fixtures,  5 gap)
│   ├── va_pension/   (25 fixtures,  3 gap)
│   ├── eitc/         (29 fixtures,  4 gap)
│   └── wic/          (25 fixtures,  2 gap)
├── rust-runtime/                   # Rust reference oracle (source snapshot)
│   ├── rac_oracle.rs               # dispatcher binary
│   ├── statutory_math.rs           # per-program math (~4,500 LOC)
│   └── fpl.rs                      # FPL tables 2025 + 2026
├── runners/
│   ├── py_oracle/                  # Python cross-runtime (bit-identical to Rust on 199/199)
│   │   ├── statutory_math.py
│   │   ├── fpl.py
│   │   ├── run.py                  # fixture → decision
│   │   └── quarterly_diff.py       # 2025 vs 2026 diff
│   ├── llm/                        # schema-constrained SLM harness
│   │   ├── harness.py              # vLLM OpenAI-compat async client
│   │   ├── prompts.py              # 4 conditions × 3 variants, program-aware
│   │   ├── rules_text.py           # distilled rule summaries per program
│   │   ├── rules_text_verbatim.py  # verbatim CFR excerpts per program
│   │   ├── scoring.py              # 3-bucket scoring, kappa, bootstrap, McNemar
│   │   ├── outlines_probe.py       # backend-sensitivity cross-check
│   │   ├── models.yaml             # model endpoints
│   │   ├── HARNESS_CONTRACT.md     # pre-registered design (hash b8bc676b...)
│   │   └── requirements.txt
│   ├── complexity/                 # rule-complexity metric extractor
│   └── fixture_gen/                # template → fixture generators (per program)
├── scripts/
│   └── run-vllm.sh                 # vast.ai-side vLLM serve script
└── results/
    ├── metadata.json                         # pre-registration hash + model revisions
    ├── oracle-{snap,medicaid,ssi,va_pension,eitc,wic}.json
    ├── complexity-metrics.{csv,json}         # G3 output
    ├── quarterly-diff.json                   # G4 output (33/199 change 2025→2026)
    ├── llm-expanded-main.jsonl               # 21,492 main-pass LLM calls
    ├── llm-expanded-nojson.jsonl             # 597 secondary-pass (no guided JSON)
    ├── llm-outlines-probe.jsonl              # 199 calls, outlines backend cross-check
    ├── llm-verbatim-statute.jsonl            # 5,373 calls, verbatim-CFR statute_ctx
    └── llm-snap-scored.json                  # pre-computed scoring output

Pre-registration hash (frozen before first LLM call): b8bc676b... (full hash in results/metadata.json).

Reproducing the key findings

1. Cross-runtime agreement (199/199 bit-identical)

Python oracle only (no Rust build needed):

cd runners/py_oracle
python3 run.py --fixtures-root ../../fixtures --out ../../results/py-oracle-all.json
# Compare against the Rust oracle outputs in results/oracle-*.json;
# every field in every fixture's {eligible, net_income_monthly, grant_monthly, etc.}
# matches to the integer cent.

Rust oracle (requires Rust toolchain + source tree beyond what is bundled; the rust-runtime/*.rs files are snapshots, not a standalone crate). See rust-runtime/README.md.

2. LLM scoring tables (re-score the 21,492 captured calls)

cd runners/llm
pip install -r requirements.txt
python3 -m scoring \
  --main ../../results/llm-expanded-main.jsonl \
  --oracle-glob '../../results/oracle-*.json' \
  --fixtures-root ../../fixtures \
  --out /tmp/rescored.json

3. Statutory-update diff (G4)

cd runners/py_oracle
python3 quarterly_diff.py --fixtures-root ../../fixtures --out /tmp/qdiff.json
# 33/199 outputs change; 3 eligibility flips, 30 numeric-only changes.

4. Complexity metrics (G3)

cd runners/complexity
python3 extract.py --rust-src ../../rust-runtime/statutory_math.rs \
  --out-csv ../../results/complexity-metrics.csv

Re-running the LLM experiment (optional)

Full re-run requires GPU + vLLM. See scripts/run-vllm.sh for the server side and runners/llm/HARNESS_CONTRACT.md for the frozen experimental design (4 conditions × 3 prompt variants × 3 seeds × 199 fixtures × 4 models = 10,752 main calls per model; we ran 3 models for 21,492 total).

Licensing

Fixtures (fixtures/) — CC-BY-SA 4.0
Code (runners/, rust-runtime/, scripts/) — Apache-2.0

Integrity note

Every number cited in the paper derives from a file in this bundle. The main LLM table's cells can be re-derived by re-running scoring.py on llm-expanded-main.jsonl.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
fixtures		fixtures
results		results
runners		runners
rust-runtime		rust-runtime
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
LICENSE-FIXTURES		LICENSE-FIXTURES
LIMITATIONS.md		LIMITATIONS.md
NOTICE		NOTICE
README.md		README.md
refs.bib		refs.bib

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Paper 08 — Rules-as-Code vs SLMs for US Benefits Eligibility

Directory map

Reproducing the key findings

1. Cross-runtime agreement (199/199 bit-identical)

2. LLM scoring tables (re-score the 21,492 captured calls)

3. Statutory-update diff (G4)

4. Complexity metrics (G3)

Re-running the LLM experiment (optional)

Licensing

Integrity note

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Paper 08 — Rules-as-Code vs SLMs for US Benefits Eligibility

Directory map

Reproducing the key findings

1. Cross-runtime agreement (199/199 bit-identical)

2. LLM scoring tables (re-score the 21,492 captured calls)

3. Statutory-update diff (G4)

4. Complexity metrics (G3)

Re-running the LLM experiment (optional)

Licensing

Integrity note

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages