Evaluation engine core refactor: leakage guard, shared rules, structured warnings (#62–#68) by dgenio · Pull Request #136 · dgenio/agent-routing-eval-lab

dgenio · 2026-06-24T07:42:50Z

Pull Request

Summary

Coherent refactor of the evaluation/scoring engine internals, implementing the architecture/refactor issue group (#62–#68) as one PR. These issues all touch the same core — OfflineEvaluator._score_decision and its immediate plumbing — so making the engine's hidden rules explicit, single-sourced, and testable is cleaner together than as seven separate PRs editing the same function. Stdlib-only; no new runtime dependencies. Behavior is preserved (metrics are numerically unchanged); the only intentional output change is the JSON warnings shape (schema bump).

Linked issue

Fixes #62, #63, #64, #65, #66, #67, #68

What changed

Stop passing oracle_tool to routers through metadata (evaluation-leakage guard) #62 — evaluation-leakage guard (evaluation/evaluator.py): oracle_tool is no longer placed in the metadata handed to routers, so no router (custom or accidental) can read the ground-truth answer at decision time. Added a regression test asserting routers never receive oracle_tool.
Consolidate the unsafe-action rules into one shared module used by both the generator and the evaluator #63 — shared unsafe-action rule (data/safety_rules.py new; generate_synthetic_logs.py, evaluator.py): extracted is_unsafe_action(tool, intent, requires_approval, approval_granted) as the single source of truth consumed by both the generator (ground-truth labels) and the evaluator (candidate scoring), so they can never diverge.
Move "resolved" semantics from a hardcoded tool-name set into ToolSpec #64 — catalog-driven "resolved" (data/schemas.py, evaluator.py): added ToolSpec.resolves_without_success (set on support.create_task, email.draft_reply, docs.search_policy) and drive the unresolved_request_rate metric from it instead of a hardcoded tool-name set buried in scoring code.
Replace string-matched warnings with structured warning objects (code, severity, message) #65 — structured warnings (warnings.py new; evaluator.py, metrics.py, adapters/skdr_eval_adapter.py, serialization.py): introduced EvalWarning(code, severity, message), replacing free-form strings and the fragile "low support" in metrics.support_coverage_warning.lower() substring check with a PolicyMetrics.low_support boolean. JSON warnings are now objects → schema_version bumped to "2" (documented).
Inject adapters into OfflineEvaluator instead of hard-instantiating SkdrEvalAdapter #66 — adapter injection (evaluator.py): OfflineEvaluator(..., skdr_adapter=...) (keyword-only, defaults to SkdrEvalAdapter()) so tests and integrations can substitute deterministic/native implementations.
Name and document the utility-model coefficients in _score_decision #67 — named utility coefficients (evaluator.py, docs/evaluation_methodology.md): extracted SUCCESS_REWARD/FAILURE_PENALTY/COST_WEIGHT/LATENCY_WEIGHT_PER_SECOND/UNSAFE_PENALTY constants and documented the utility/regret formula.
Consolidate ranking/winner selection into one helper with deterministic tie-breaking #68 — single ranking helper (evaluator.py; report.py, charts.py, serialization.py, cli.py): added rank_results() with deterministic tie-breaking (score desc, then policy name asc), replacing four ad-hoc sorted(...) sites so the "winner" no longer depends on policy-registry insertion order.

Layering note: EvalWarning lives at the package top level (agent_routing_eval_lab/warnings.py) rather than under evaluation/ so the adapters package does not import evaluation — this avoids an import cycle (adapters → evaluation → evaluator → adapters).

How to verify

make install
make test          # 85 passed (17 new)
python -m agent_routing_eval_lab.cli evaluate --input examples/logged_decisions.sample.csv --format json   # schema_version "2", warnings as {code,severity,message}
python -m agent_routing_eval_lab.cli demo --output-dir /tmp/demo

Verified locally: pytest → 85 passed (was 68); adapter-first import works (cycle resolved); evaluate/compare/gate/demo smoke-tested. The #62 leakage test was mutation-checked (re-adding oracle_tool to metadata makes it fail). No linter/type-checker is configured in this repo, so none was run (CI runs make install + make test on Python 3.10–3.14).

Pull Request Checklist

Keep the change scoped to one contribution area when possible. — All seven issues are the evaluation-engine core (_score_decision + immediate plumbing).
Update docs when behavior, metrics, command output, or terminology changes. — docs/evaluation_methodology.md (utility model), docs/json-schema.md (warnings shape + low_support + schema "2").
Keep examples reproducible and aligned with committed reports. — Metrics are numerically unchanged; the committed reports/example_report.md was already stale on main and is intentionally left untouched (regenerating it is out of scope).
Run the relevant local checks and mention any skipped checks in the PR. — pytest (85 passed); no linter/type-checker configured.
Preserve the limitations documented in the README and methodology docs.
Avoid overclaiming safety, governance, or production readiness.

Honesty and claim discipline

Does this change any claim the README makes? (yes/no): no — README claims are unchanged. Internal methodology/JSON-contract docs were updated to match the refactor.

README location: n/a
Docs/report location(s): docs/evaluation_methodology.md (utility/regret formula), docs/json-schema.md (structured warnings, low_support, schema_version "2").

🤖 Generated with Claude Code

https://claude.ai/code/session_01YCm5Qcb55RqDr98EgroAgP

Generated by Claude Code

…red warnings Coherent refactor of the evaluation/scoring engine internals (issues #62–#68), making its hidden rules explicit, single-sourced, and testable. - #62: stop passing `oracle_tool` to routers via metadata; add a leakage regression test so no router can ever see the ground-truth answer. - #63: extract the unsafe-action predicate into `data/safety_rules.py` (`is_unsafe_action`), shared by the generator and the evaluator so their labels can never silently diverge. - #64: drive "resolved-without-success" from a new `ToolSpec.resolves_without_success` catalog flag instead of a hardcoded tool-name set in the evaluator. - #65: introduce `EvalWarning(code, severity, message)`; replace free-form warning strings and the fragile `"low support" in ...` substring check with a structured `PolicyMetrics.low_support` flag. JSON warnings become objects (schema_version bumped to "2"). - #66: inject the skdr adapter into `OfflineEvaluator` (keyword-only) so tests and integrations can substitute deterministic/native implementations. - #67: name the utility-model coefficients as module-level constants and document the utility/regret formula in docs/evaluation_methodology.md. - #68: add a single `rank_results()` helper with deterministic tie-breaking (score desc, then policy name asc), replacing four ad-hoc sort sites. EvalWarning lives at the package top level to keep `adapters` from depending on `evaluation` (avoids an import cycle). Behavior is preserved: metrics are numerically unchanged; built-in routers never read `oracle_tool`. Tests: 85 passed (17 new across safety rules, warnings, ranking, leakage, adapter injection, and the utility model). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01YCm5Qcb55RqDr98EgroAgP

Addresses audit findings on the #62–#68 evaluation-engine refactor: - #62 acceptance criteria: document the router metadata contract (allowed keys: approval_granted; oracle_tool deliberately withheld) in the routing package, and add a "Decision-time information (leakage guard)" note to docs/evaluation_methodology.md. The leak was already closed and tested; this completes the issue's stated documentation criteria. - #65 follow-up: centralize the structured-warning codes as a WarningCode registry in warnings.py and reference it at the emit sites (evaluator, skdr adapter) instead of repeating string literals. Behavior-preserving — the emitted code strings and JSON schema ("2") are unchanged. No behavior change; 85 tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01YK3TZqzLPmDk9ivpri792g

claude added 2 commits June 24, 2026 07:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation engine core refactor: leakage guard, shared rules, structured warnings (#62–#68)#136

Evaluation engine core refactor: leakage guard, shared rules, structured warnings (#62–#68)#136
dgenio wants to merge 2 commits into
mainfrom
claude/issue-triage-grouping-k2wo7j

dgenio commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dgenio commented Jun 24, 2026

Pull Request

Summary

Linked issue

What changed

How to verify

Pull Request Checklist

Honesty and claim discipline

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants