Skip to content

Evaluation engine core refactor: leakage guard, shared rules, structured warnings (#62–#68)#136

Open
dgenio wants to merge 2 commits into
mainfrom
claude/issue-triage-grouping-k2wo7j
Open

Evaluation engine core refactor: leakage guard, shared rules, structured warnings (#62–#68)#136
dgenio wants to merge 2 commits into
mainfrom
claude/issue-triage-grouping-k2wo7j

Conversation

@dgenio

@dgenio dgenio commented Jun 24, 2026

Copy link
Copy Markdown
Owner

Pull Request

Summary

Coherent refactor of the evaluation/scoring engine internals, implementing the architecture/refactor issue group (#62#68) as one PR. These issues all touch the same core — OfflineEvaluator._score_decision and its immediate plumbing — so making the engine's hidden rules explicit, single-sourced, and testable is cleaner together than as seven separate PRs editing the same function. Stdlib-only; no new runtime dependencies. Behavior is preserved (metrics are numerically unchanged); the only intentional output change is the JSON warnings shape (schema bump).

Linked issue

Fixes #62, #63, #64, #65, #66, #67, #68

What changed

Layering note: EvalWarning lives at the package top level (agent_routing_eval_lab/warnings.py) rather than under evaluation/ so the adapters package does not import evaluation — this avoids an import cycle (adapters → evaluation → evaluator → adapters).

How to verify

make install
make test          # 85 passed (17 new)
python -m agent_routing_eval_lab.cli evaluate --input examples/logged_decisions.sample.csv --format json   # schema_version "2", warnings as {code,severity,message}
python -m agent_routing_eval_lab.cli demo --output-dir /tmp/demo

Verified locally: pytest85 passed (was 68); adapter-first import works (cycle resolved); evaluate/compare/gate/demo smoke-tested. The #62 leakage test was mutation-checked (re-adding oracle_tool to metadata makes it fail). No linter/type-checker is configured in this repo, so none was run (CI runs make install + make test on Python 3.10–3.14).

Pull Request Checklist

  • Keep the change scoped to one contribution area when possible. — All seven issues are the evaluation-engine core (_score_decision + immediate plumbing).
  • Update docs when behavior, metrics, command output, or terminology changes. — docs/evaluation_methodology.md (utility model), docs/json-schema.md (warnings shape + low_support + schema "2").
  • Keep examples reproducible and aligned with committed reports. — Metrics are numerically unchanged; the committed reports/example_report.md was already stale on main and is intentionally left untouched (regenerating it is out of scope).
  • Run the relevant local checks and mention any skipped checks in the PR. — pytest (85 passed); no linter/type-checker configured.
  • Preserve the limitations documented in the README and methodology docs.
  • Avoid overclaiming safety, governance, or production readiness.

Honesty and claim discipline

Does this change any claim the README makes? (yes/no): no — README claims are unchanged. Internal methodology/JSON-contract docs were updated to match the refactor.

  • README location: n/a
  • Docs/report location(s): docs/evaluation_methodology.md (utility/regret formula), docs/json-schema.md (structured warnings, low_support, schema_version "2").

🤖 Generated with Claude Code

https://claude.ai/code/session_01YCm5Qcb55RqDr98EgroAgP


Generated by Claude Code

claude added 2 commits June 24, 2026 07:42
…red warnings

Coherent refactor of the evaluation/scoring engine internals (issues #62#68),
making its hidden rules explicit, single-sourced, and testable.

- #62: stop passing `oracle_tool` to routers via metadata; add a leakage
  regression test so no router can ever see the ground-truth answer.
- #63: extract the unsafe-action predicate into `data/safety_rules.py`
  (`is_unsafe_action`), shared by the generator and the evaluator so their
  labels can never silently diverge.
- #64: drive "resolved-without-success" from a new `ToolSpec.resolves_without_success`
  catalog flag instead of a hardcoded tool-name set in the evaluator.
- #65: introduce `EvalWarning(code, severity, message)`; replace free-form
  warning strings and the fragile `"low support" in ...` substring check with a
  structured `PolicyMetrics.low_support` flag. JSON warnings become objects
  (schema_version bumped to "2").
- #66: inject the skdr adapter into `OfflineEvaluator` (keyword-only) so tests
  and integrations can substitute deterministic/native implementations.
- #67: name the utility-model coefficients as module-level constants and
  document the utility/regret formula in docs/evaluation_methodology.md.
- #68: add a single `rank_results()` helper with deterministic tie-breaking
  (score desc, then policy name asc), replacing four ad-hoc sort sites.

EvalWarning lives at the package top level to keep `adapters` from depending on
`evaluation` (avoids an import cycle). Behavior is preserved: metrics are
numerically unchanged; built-in routers never read `oracle_tool`.

Tests: 85 passed (17 new across safety rules, warnings, ranking, leakage,
adapter injection, and the utility model).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01YCm5Qcb55RqDr98EgroAgP
Addresses audit findings on the #62#68 evaluation-engine refactor:

- #62 acceptance criteria: document the router metadata contract (allowed
  keys: approval_granted; oracle_tool deliberately withheld) in the routing
  package, and add a "Decision-time information (leakage guard)" note to
  docs/evaluation_methodology.md. The leak was already closed and tested; this
  completes the issue's stated documentation criteria.
- #65 follow-up: centralize the structured-warning codes as a WarningCode
  registry in warnings.py and reference it at the emit sites (evaluator,
  skdr adapter) instead of repeating string literals. Behavior-preserving —
  the emitted code strings and JSON schema ("2") are unchanged.

No behavior change; 85 tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01YK3TZqzLPmDk9ivpri792g
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Stop passing oracle_tool to routers through metadata (evaluation-leakage guard)

2 participants