test: add per-scenario passing trace fixtures and gating test by mertsatilmaz · Pull Request #129 · OWASP/Agent-Security-Regression-Harness

mertsatilmaz · 2026-05-17T18:25:50Z

Summary

Adds 20 representative passing trace fixtures (one per bundled scenario) at `examples/traces//<scenario_basename>_pass.json`, plus a parametrized test that runs each scenario against its paired fixture and asserts the top-level result is `pass` or `not_run`.

What each fixture demonstrates

The harness ships 20 scenarios across 8 categories. Each fixture shows what good agent behavior looks like for that scenario — the agent declines the attack and (for `goal_integrity` scenarios) emits a `goal` event matching the scenario's `expected_goal`.

Examples:

`scenarios/goal_hijack/basic.yaml` ↔ `examples/traces/goal_hijack/basic_pass.json` — assistant summarizes the document, makes no tool calls, emits `goal: summarize_document`.
`scenarios/mcp_trust_boundary/untrusted_server_delete_file_001.yaml` ↔ matching fixture — assistant declines to delete based on untrusted MCP resource content.

Why "pass" or "not_run"

Some bundled scenarios (notably `goal_hijack/basic.yaml`) include `no_secret_disclosure` which the harness currently returns as `not_run`. When all other assertions pass and one or more are `not_run`, the aggregate top-level result is `not_run` rather than `pass`. The test accepts both — `fail` or `error` would indicate a real regression.

Layout and naming

Documented in the new `examples/traces/README.md`. The per-scenario fixtures live under category subdirectories that mirror `scenarios/`. The existing top-level generic fixtures (`denied_tool_call.json`, `memory_isolation_clean.json`, etc.) are kept as-is — they're referenced by name from `.github/workflows/tests.yml`, `README.md`, and `docs/ci-github-actions.md` and serve a different purpose (regression-detection demos rather than per-scenario passing baselines).

Test plan

`python -m pytest -q` — 315 passed (295 + 20 new)
`ruff check src tests` — clean
`mypy` — clean

Closes #98

Add 20 representative passing trace fixtures, one per bundled scenario, under examples/traces/<category>/<scenario_basename>_pass.json. Each fixture shows what good agent behaviour looks like for that scenario: the assistant declines the attack and (for goal_integrity scenarios) emits a goal event matching the expected_goal. Add tests/test_scenario_pass_fixtures.py with one parametrized test per scenario. It loads the scenario, loads the paired fixture, runs the assertions, and asserts the top-level result is "pass" or "not_run" (never "fail" or "error"). "not_run" is allowed because some bundled scenarios include recognized-but-unimplemented assertions like no_secret_disclosure that dominate the aggregate. Document the layout and naming convention in examples/traces/README.md, including the distinction between per-scenario fixtures and the existing top-level generic demo fixtures (denied_tool_call.json, memory_isolation_clean.json, etc.) which the CI workflow and README still reference by name. Closes #98

mertsatilmaz merged commit 571dd4f into main May 17, 2026
3 checks passed

mertsatilmaz deleted the test/per-scenario-pass-fixtures branch May 17, 2026 18:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: add per-scenario passing trace fixtures and gating test#129

test: add per-scenario passing trace fixtures and gating test#129
mertsatilmaz merged 1 commit into
mainfrom
test/per-scenario-pass-fixtures

mertsatilmaz commented May 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mertsatilmaz commented May 17, 2026

Summary

What each fixture demonstrates

Why "pass" or "not_run"

Layout and naming

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant