Evaluation engine core refactor: leakage guard, shared rules, structured warnings (#62–#68)#136
Open
dgenio wants to merge 2 commits into
Open
Evaluation engine core refactor: leakage guard, shared rules, structured warnings (#62–#68)#136dgenio wants to merge 2 commits into
dgenio wants to merge 2 commits into
Conversation
…red warnings Coherent refactor of the evaluation/scoring engine internals (issues #62–#68), making its hidden rules explicit, single-sourced, and testable. - #62: stop passing `oracle_tool` to routers via metadata; add a leakage regression test so no router can ever see the ground-truth answer. - #63: extract the unsafe-action predicate into `data/safety_rules.py` (`is_unsafe_action`), shared by the generator and the evaluator so their labels can never silently diverge. - #64: drive "resolved-without-success" from a new `ToolSpec.resolves_without_success` catalog flag instead of a hardcoded tool-name set in the evaluator. - #65: introduce `EvalWarning(code, severity, message)`; replace free-form warning strings and the fragile `"low support" in ...` substring check with a structured `PolicyMetrics.low_support` flag. JSON warnings become objects (schema_version bumped to "2"). - #66: inject the skdr adapter into `OfflineEvaluator` (keyword-only) so tests and integrations can substitute deterministic/native implementations. - #67: name the utility-model coefficients as module-level constants and document the utility/regret formula in docs/evaluation_methodology.md. - #68: add a single `rank_results()` helper with deterministic tie-breaking (score desc, then policy name asc), replacing four ad-hoc sort sites. EvalWarning lives at the package top level to keep `adapters` from depending on `evaluation` (avoids an import cycle). Behavior is preserved: metrics are numerically unchanged; built-in routers never read `oracle_tool`. Tests: 85 passed (17 new across safety rules, warnings, ranking, leakage, adapter injection, and the utility model). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01YCm5Qcb55RqDr98EgroAgP
Addresses audit findings on the #62–#68 evaluation-engine refactor: - #62 acceptance criteria: document the router metadata contract (allowed keys: approval_granted; oracle_tool deliberately withheld) in the routing package, and add a "Decision-time information (leakage guard)" note to docs/evaluation_methodology.md. The leak was already closed and tested; this completes the issue's stated documentation criteria. - #65 follow-up: centralize the structured-warning codes as a WarningCode registry in warnings.py and reference it at the emit sites (evaluator, skdr adapter) instead of repeating string literals. Behavior-preserving — the emitted code strings and JSON schema ("2") are unchanged. No behavior change; 85 tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01YK3TZqzLPmDk9ivpri792g
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pull Request
Summary
Coherent refactor of the evaluation/scoring engine internals, implementing the architecture/refactor issue group (#62–#68) as one PR. These issues all touch the same core —
OfflineEvaluator._score_decisionand its immediate plumbing — so making the engine's hidden rules explicit, single-sourced, and testable is cleaner together than as seven separate PRs editing the same function. Stdlib-only; no new runtime dependencies. Behavior is preserved (metrics are numerically unchanged); the only intentional output change is the JSON warnings shape (schema bump).Linked issue
Fixes #62, #63, #64, #65, #66, #67, #68
What changed
oracle_toolto routers through metadata (evaluation-leakage guard) #62 — evaluation-leakage guard (evaluation/evaluator.py):oracle_toolis no longer placed in the metadata handed to routers, so no router (custom or accidental) can read the ground-truth answer at decision time. Added a regression test asserting routers never receiveoracle_tool.data/safety_rules.pynew;generate_synthetic_logs.py,evaluator.py): extractedis_unsafe_action(tool, intent, requires_approval, approval_granted)as the single source of truth consumed by both the generator (ground-truth labels) and the evaluator (candidate scoring), so they can never diverge.ToolSpec#64 — catalog-driven "resolved" (data/schemas.py,evaluator.py): addedToolSpec.resolves_without_success(set onsupport.create_task,email.draft_reply,docs.search_policy) and drive theunresolved_request_ratemetric from it instead of a hardcoded tool-name set buried in scoring code.warnings.pynew;evaluator.py,metrics.py,adapters/skdr_eval_adapter.py,serialization.py): introducedEvalWarning(code, severity, message), replacing free-form strings and the fragile"low support" in metrics.support_coverage_warning.lower()substring check with aPolicyMetrics.low_supportboolean. JSONwarningsare now objects →schema_versionbumped to"2"(documented).OfflineEvaluatorinstead of hard-instantiatingSkdrEvalAdapter#66 — adapter injection (evaluator.py):OfflineEvaluator(..., skdr_adapter=...)(keyword-only, defaults toSkdrEvalAdapter()) so tests and integrations can substitute deterministic/native implementations._score_decision#67 — named utility coefficients (evaluator.py,docs/evaluation_methodology.md): extractedSUCCESS_REWARD/FAILURE_PENALTY/COST_WEIGHT/LATENCY_WEIGHT_PER_SECOND/UNSAFE_PENALTYconstants and documented the utility/regret formula.evaluator.py;report.py,charts.py,serialization.py,cli.py): addedrank_results()with deterministic tie-breaking (score desc, then policy name asc), replacing four ad-hocsorted(...)sites so the "winner" no longer depends on policy-registry insertion order.Layering note:
EvalWarninglives at the package top level (agent_routing_eval_lab/warnings.py) rather than underevaluation/so theadapterspackage does not importevaluation— this avoids an import cycle (adapters → evaluation → evaluator → adapters).How to verify
Verified locally:
pytest→ 85 passed (was 68); adapter-first import works (cycle resolved);evaluate/compare/gate/demosmoke-tested. The #62 leakage test was mutation-checked (re-addingoracle_toolto metadata makes it fail). No linter/type-checker is configured in this repo, so none was run (CI runsmake install+make teston Python 3.10–3.14).Pull Request Checklist
_score_decision+ immediate plumbing).docs/evaluation_methodology.md(utility model),docs/json-schema.md(warnings shape +low_support+ schema "2").reports/example_report.mdwas already stale onmainand is intentionally left untouched (regenerating it is out of scope).pytest(85 passed); no linter/type-checker configured.Honesty and claim discipline
Does this change any claim the README makes? (yes/no): no — README claims are unchanged. Internal methodology/JSON-contract docs were updated to match the refactor.
docs/evaluation_methodology.md(utility/regret formula),docs/json-schema.md(structured warnings,low_support,schema_version"2").🤖 Generated with Claude Code
https://claude.ai/code/session_01YCm5Qcb55RqDr98EgroAgP
Generated by Claude Code