Skip to content

jcatster7/benefits-as-code

Repository files navigation

Paper 08 — Rules-as-Code vs SLMs for US Benefits Eligibility

Supplementary bundle accompanying the paper. Everything needed to reproduce the empirical findings: fixtures, both reference runtimes, all LLM call logs, and the scoring code.

Directory map

.
├── README.md                       # this file
├── LIMITATIONS.md                  # engineering-process register (10 entries)
├── fixtures/
│   ├── INDEX.md                    # per-fixture index (199 entries, 25 gap)
│   ├── README.md                   # fixture schema spec
│   ├── snap/         (44 fixtures,  7 gap)
│   ├── medicaid/     (40 fixtures,  4 gap)
│   ├── ssi/          (36 fixtures,  5 gap)
│   ├── va_pension/   (25 fixtures,  3 gap)
│   ├── eitc/         (29 fixtures,  4 gap)
│   └── wic/          (25 fixtures,  2 gap)
├── rust-runtime/                   # Rust reference oracle (source snapshot)
│   ├── rac_oracle.rs               # dispatcher binary
│   ├── statutory_math.rs           # per-program math (~4,500 LOC)
│   └── fpl.rs                      # FPL tables 2025 + 2026
├── runners/
│   ├── py_oracle/                  # Python cross-runtime (bit-identical to Rust on 199/199)
│   │   ├── statutory_math.py
│   │   ├── fpl.py
│   │   ├── run.py                  # fixture → decision
│   │   └── quarterly_diff.py       # 2025 vs 2026 diff
│   ├── llm/                        # schema-constrained SLM harness
│   │   ├── harness.py              # vLLM OpenAI-compat async client
│   │   ├── prompts.py              # 4 conditions × 3 variants, program-aware
│   │   ├── rules_text.py           # distilled rule summaries per program
│   │   ├── rules_text_verbatim.py  # verbatim CFR excerpts per program
│   │   ├── scoring.py              # 3-bucket scoring, kappa, bootstrap, McNemar
│   │   ├── outlines_probe.py       # backend-sensitivity cross-check
│   │   ├── models.yaml             # model endpoints
│   │   ├── HARNESS_CONTRACT.md     # pre-registered design (hash b8bc676b...)
│   │   └── requirements.txt
│   ├── complexity/                 # rule-complexity metric extractor
│   └── fixture_gen/                # template → fixture generators (per program)
├── scripts/
│   └── run-vllm.sh                 # vast.ai-side vLLM serve script
└── results/
    ├── metadata.json                         # pre-registration hash + model revisions
    ├── oracle-{snap,medicaid,ssi,va_pension,eitc,wic}.json
    ├── complexity-metrics.{csv,json}         # G3 output
    ├── quarterly-diff.json                   # G4 output (33/199 change 2025→2026)
    ├── llm-expanded-main.jsonl               # 21,492 main-pass LLM calls
    ├── llm-expanded-nojson.jsonl             # 597 secondary-pass (no guided JSON)
    ├── llm-outlines-probe.jsonl              # 199 calls, outlines backend cross-check
    ├── llm-verbatim-statute.jsonl            # 5,373 calls, verbatim-CFR statute_ctx
    └── llm-snap-scored.json                  # pre-computed scoring output

Pre-registration hash (frozen before first LLM call): b8bc676b... (full hash in results/metadata.json).

Reproducing the key findings

1. Cross-runtime agreement (199/199 bit-identical)

Python oracle only (no Rust build needed):

cd runners/py_oracle
python3 run.py --fixtures-root ../../fixtures --out ../../results/py-oracle-all.json
# Compare against the Rust oracle outputs in results/oracle-*.json;
# every field in every fixture's {eligible, net_income_monthly, grant_monthly, etc.}
# matches to the integer cent.

Rust oracle (requires Rust toolchain + source tree beyond what is bundled; the rust-runtime/*.rs files are snapshots, not a standalone crate). See rust-runtime/README.md.

2. LLM scoring tables (re-score the 21,492 captured calls)

cd runners/llm
pip install -r requirements.txt
python3 -m scoring \
  --main ../../results/llm-expanded-main.jsonl \
  --oracle-glob '../../results/oracle-*.json' \
  --fixtures-root ../../fixtures \
  --out /tmp/rescored.json

3. Statutory-update diff (G4)

cd runners/py_oracle
python3 quarterly_diff.py --fixtures-root ../../fixtures --out /tmp/qdiff.json
# 33/199 outputs change; 3 eligibility flips, 30 numeric-only changes.

4. Complexity metrics (G3)

cd runners/complexity
python3 extract.py --rust-src ../../rust-runtime/statutory_math.rs \
  --out-csv ../../results/complexity-metrics.csv

Re-running the LLM experiment (optional)

Full re-run requires GPU + vLLM. See scripts/run-vllm.sh for the server side and runners/llm/HARNESS_CONTRACT.md for the frozen experimental design (4 conditions × 3 prompt variants × 3 seeds × 199 fixtures × 4 models = 10,752 main calls per model; we ran 3 models for 21,492 total).

Licensing

  • Fixtures (fixtures/) — CC-BY-SA 4.0
  • Code (runners/, rust-runtime/, scripts/) — Apache-2.0

Integrity note

Every number cited in the paper derives from a file in this bundle. The main LLM table's cells can be re-derived by re-running scoring.py on llm-expanded-main.jsonl.

About

Rules-as-Code test vectors + reference runtimes for US federal benefits eligibility (SNAP, Medicaid, SSI, VA Pension, EITC, WIC). Companion artifact to EAAMO 2026 submission.

Resources

License

Apache-2.0, Unknown licenses found

Licenses found

Apache-2.0
LICENSE
Unknown
LICENSE-FIXTURES

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors