Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 9 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -134,7 +134,7 @@ ScopeTrail permission drift: CRITICAL

## How it works

ScopeTrail is **local-only**. It reads the checked-out repository, materializes the two git refs into temp directories, runs detectors over them, and prints the result. It uploads nothing, calls no external services, and has no required API keys.
ScopeTrail is **local-only**. It reads the checked-out repository, materializes the two git refs into temp directories, runs detectors over them, and prints the result. The scanner uploads nothing no repository contents, findings, or telemetry leave your machine — and it needs no API keys. (The GitHub Action's setup step installs ScopeTrail's one runtime dependency with `npm ci` from the npm registry; the analysis itself makes no network calls.)

The detectors cover the surfaces an AI agent can actually escalate through:

Expand All @@ -146,20 +146,20 @@ Findings carry a `severity` (`low` / `medium` / `high` / `critical`) and the rep

## How well it catches it

ScopeTrail ships a labeled precision/recall benchmark over **28 fixture PRs** (22 with planted drift, 6 benign) spanning **19 detector kinds**. Each fixture is an `old/`+`new/` config snapshot pair; ground truth is fixed by fixture design and the harness diffs the pair and scores the drift engine against it. Reproduce with `npm run build && node benchmark/run-benchmark.mjs`.
ScopeTrail ships a labeled precision/recall benchmark over **35 fixture PRs** (27 with planted drift, 8 benign) spanning **21 detector kinds**. Each fixture is an `old/`+`new/` config snapshot pair; ground truth is fixed by fixture design and the harness diffs the pair and scores the drift engine against it. Reproduce with `npm run build && node benchmark/run-benchmark.mjs`. These figures score the engine against its own labeled fixtures — they bound regressions, not a claim of real-world field accuracy across every config a PR might contain.

| Metric | Result |
| --- | --- |
| Detection (any finding) — recall | **100%** (22/22 rogue PRs flagged) |
| Detection — false-positive rate | **0%** (0/6 benign PRs flagged) |
| Detection (any finding) — recall | **100%** (27/27 rogue PRs flagged) |
| Detection — false-positive rate | **0%** (0/8 benign PRs flagged) |
| Detection — precision | **100%** |
| Correct primary finding kind | **22/22** rogue PRs |
| All expected finding kinds | **22/22** rogue PRs |
| Exact consolidated rating | **28/28** PRs |
| Correct primary finding kind | **27/27** rogue PRs |
| All expected finding kinds | **27/27** rogue PRs |
| Exact consolidated rating | **35/35** PRs |

The 6 benign cases include five engineered **false-positive traps** — narrowly-scoped Claude grants (a textual diff sees new `allow` lines), an all-tightening Codex posture, network access that was *already* on, a removed MCP server, and a `.mcp.json` with reordered keys but an identical launch command — plus one byte-identical snapshot. None produce a finding, because the detectors compare semantics and flag only *widening*.
The 8 benign cases include seven engineered **false-positive traps** — narrowly-scoped Claude grants (a textual diff sees new `allow` lines), an all-tightening Codex posture, network access that was *already* on, a brand-new Codex config pinned to the narrowest posture, a dropped MCP `env` var, a removed MCP server, and a `.mcp.json` with reordered keys but an identical launch command — plus one byte-identical snapshot. None produce a finding, because the detectors compare semantics and flag only *widening*.

**Severity is calibrated, not maximized.** At a strict `fail-on: high` gate, recall is 82% — by design: sample/template MCP additions, pinned version bumps, broad `Read` allows, and newly-enabled Codex network access sit at `low`/`medium` because they widen the surface without being directly exploitable. The `high`/`critical` band is reserved for executable or secret-facing changes — a bare `Bash` grant, a removed `Read(.env)` deny, a `danger-full-access` sandbox, an unencrypted remote MCP endpoint. Full confusion matrix at every gate, per-category and per-case breakdowns: [benchmark/RESULTS.md](benchmark/RESULTS.md). Methodology and labels: [benchmark/labels.json](benchmark/labels.json).
**Severity is calibrated, not maximized.** At a strict `fail-on: high` gate, recall is 85% — by design: sample/template MCP additions, pinned version bumps, broad `Read` allows, and newly-enabled Codex network access sit at `low`/`medium` because they widen the surface without being directly exploitable. The `high`/`critical` band is reserved for executable or secret-facing changes — a bare `Bash` grant, a removed `Read(.env)` deny, a `danger-full-access` sandbox, an unencrypted remote MCP endpoint. Full confusion matrix at every gate, per-category and per-case breakdowns: [benchmark/RESULTS.md](benchmark/RESULTS.md). Methodology and labels: [benchmark/labels.json](benchmark/labels.json).

## Design choices worth flagging

Expand Down
39 changes: 23 additions & 16 deletions benchmark/RESULTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,42 +8,42 @@ the pair and scores the drift engine against it. Benign cases include deliberate
false-positive traps a naive textual diff would flag (tightened posture, removed
servers, reordered JSON keys).

- Cases: **28** (22 rogue, 6 benign) across **19** detector kinds
- Cases: **35** (27 rogue, 8 benign) across **21** detector kinds
- Detection (any finding): recall **100.0%**, false-positive rate **0.0%**, precision **100.0%**
- At a `fail-on: high` CI gate: recall **81.8%**, false-positive rate **0.0%**, precision **100.0%**
- Correct primary finding kind identified on **22/22** rogue cases; all expected kinds on **22/22**
- Exact rating match where the label pins one: **28/28**
- At a `fail-on: high` CI gate: recall **85.2%**, false-positive rate **0.0%**, precision **100.0%**
- Correct primary finding kind identified on **27/27** rogue cases; all expected kinds on **27/27**
- Exact rating match where the label pins one: **35/35**

## Confusion matrix by CI gate threshold

A PR is predicted "drift" when its overall rating meets the threshold. `low` = any finding at all.

| Gate (`fail-on`) | TP | FP | FN | TN | Precision | Recall | FP rate | F1 | Accuracy |
| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| low | 22 | 0 | 0 | 6 | 100.0% | 100.0% | 0.0% | 100.0% | 100.0% |
| medium | 21 | 0 | 1 | 6 | 100.0% | 95.5% | 0.0% | 97.7% | 96.4% |
| high | 18 | 0 | 4 | 6 | 100.0% | 81.8% | 0.0% | 90.0% | 85.7% |
| critical | 4 | 0 | 18 | 6 | 100.0% | 18.2% | 0.0% | 30.8% | 35.7% |
| low | 27 | 0 | 0 | 8 | 100.0% | 100.0% | 0.0% | 100.0% | 100.0% |
| medium | 26 | 0 | 1 | 8 | 100.0% | 96.3% | 0.0% | 98.1% | 97.1% |
| high | 23 | 0 | 4 | 8 | 100.0% | 85.2% | 0.0% | 92.0% | 88.6% |
| critical | 5 | 0 | 22 | 8 | 100.0% | 18.5% | 0.0% | 31.3% | 37.1% |

## Detector-kind identification (rogue cases)

- Primary expected kind detected: **22/22**
- All expected kinds detected: **22/22**
- Distinct detector kinds exercised: **19**
- Primary expected kind detected: **27/27**
- All expected kinds detected: **27/27**
- Distinct detector kinds exercised: **21**

## Rating agreement

Of the **28** cases whose label pins an exact consolidated rating, the diff matched **28**.
Of the **35** cases whose label pins an exact consolidated rating, the diff matched **35**.

## Results by category

| Category | Cases | Rogue detected | Benign clean |
| --- | ---: | :---: | :---: |
| claude-drift | 5 | 5/5 | — |
| claude-drift | 7 | 7/7 | — |
| clean | 1 | — | 1/1 |
| codex-drift | 6 | 6/6 | — |
| fp-guard | 5 | — | 5/5 |
| mcp-drift | 5 | 5/5 | — |
| codex-drift | 7 | 7/7 | — |
| fp-guard | 7 | — | 7/7 |
| mcp-drift | 7 | 7/7 | — |
| mcp-sample | 2 | 2/2 | — |
| multi-class | 1 | 1/1 | — |
| robustness | 3 | 3/3 | — |
Expand All @@ -58,30 +58,37 @@ None. Every rogue case produced all of its expected finding kinds, no benign cas
| --- | --- | --- | --- | ---: | :---: | :---: | :---: |
| mcp-server-added | rogue | mcp-drift | high | 1 | yes | yes | yes |
| mcp-unpinned-added | rogue | mcp-drift | high | 2 | yes | yes | yes |
| mcp-env-secret-added | rogue | mcp-drift | high | 1 | yes | yes | yes |
| mcp-command-changed | rogue | mcp-drift | medium | 1 | yes | yes | yes |
| mcp-remote-http | rogue | mcp-drift | critical | 2 | yes | yes | yes |
| mcp-remote-https | rogue | mcp-drift | high | 2 | yes | yes | yes |
| mcp-syntax-error | rogue | robustness | high | 1 | yes | yes | yes |
| mcp-second-map-server-added | rogue | mcp-drift | high | 2 | yes | yes | yes |
| mcp-sample-added | rogue | mcp-sample | low | 1 | yes | yes | yes |
| mcp-sample-pipe-to-shell | rogue | mcp-sample | high | 2 | yes | yes | yes |
| claude-broad-allow-bash | rogue | claude-drift | high | 1 | yes | yes | yes |
| claude-broad-allow-read | rogue | claude-drift | medium | 1 | yes | yes | yes |
| claude-deny-removed-env | rogue | claude-drift | critical | 1 | yes | yes | yes |
| claude-deny-removed-ssh | rogue | claude-drift | critical | 1 | yes | yes | yes |
| claude-hook-removed | rogue | claude-drift | high | 1 | yes | yes | yes |
| claude-hook-command-changed | rogue | claude-drift | high | 1 | yes | yes | yes |
| claude-hook-matcher-rebound | rogue | claude-drift | high | 1 | yes | yes | yes |
| claude-syntax-error | rogue | robustness | high | 1 | yes | yes | yes |
| codex-sandbox-danger | rogue | codex-drift | critical | 1 | yes | yes | yes |
| codex-sandbox-workspace | rogue | codex-drift | high | 1 | yes | yes | yes |
| codex-approval-never | rogue | codex-drift | high | 1 | yes | yes | yes |
| codex-network-enabled | rogue | codex-drift | medium | 1 | yes | yes | yes |
| codex-project-trusted | rogue | codex-drift | high | 1 | yes | yes | yes |
| codex-mcp-added | rogue | codex-drift | high | 2 | yes | yes | yes |
| codex-mcp-env-added | rogue | codex-drift | high | 1 | yes | yes | yes |
| codex-syntax-error | rogue | robustness | high | 1 | yes | yes | yes |
| multi-class-drift | rogue | multi-class | critical | 5 | yes | yes | yes |
| clean-identical | benign | clean | none | 0 | — | — | yes |
| claude-scoped-grants | benign | fp-guard | none | 0 | — | — | yes |
| codex-narrowed | benign | fp-guard | none | 0 | — | — | yes |
| codex-network-already-on | benign | fp-guard | none | 0 | — | — | yes |
| codex-baseline-narrowest | benign | fp-guard | none | 0 | — | — | yes |
| mcp-env-removed | benign | fp-guard | none | 0 | — | — | yes |
| mcp-server-removed | benign | fp-guard | none | 0 | — | — | yes |
| mcp-reformatted | benign | fp-guard | none | 0 | — | — | yes |

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
{
"permissions": {
"allow": [
"mcp__github__get_issue"
],
"deny": []
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
{
"permissions": {
"allow": [
"mcp__github__get_issue"
],
"deny": [
"Read(~/.ssh/**)"
]
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
{
"permissions": {
"allow": [
"mcp__github__get_issue"
],
"deny": [
"Read(.env)"
]
},
"hooks": {
"PreToolUse": [
{
"matcher": "Read",
"hooks": [
{
"type": "command",
"command": "./guard.sh"
}
]
}
]
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
{
"permissions": {
"allow": [
"mcp__github__get_issue"
],
"deny": [
"Read(.env)"
]
},
"hooks": {
"PreToolUse": [
{
"matcher": "Bash",
"hooks": [
{
"type": "command",
"command": "./guard.sh"
}
]
}
]
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
sandbox_mode = "read-only"
approval_policy = "untrusted"
Empty file.
6 changes: 6 additions & 0 deletions benchmark/fixtures/codex-mcp-env-added/new/.codex/config.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
[mcp_servers.billing]
command = "npx"
args = ["-y", "@vendor/billing-mcp@1.4.0"]

[mcp_servers.billing.env]
STRIPE_SECRET_KEY = "${STRIPE_SECRET_KEY}"
3 changes: 3 additions & 0 deletions benchmark/fixtures/codex-mcp-env-added/old/.codex/config.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[mcp_servers.billing]
command = "npx"
args = ["-y", "@vendor/billing-mcp@1.4.0"]
8 changes: 8 additions & 0 deletions benchmark/fixtures/mcp-env-removed/new/.mcp.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
{
"mcpServers": {
"billing": {
"command": "npx",
"args": ["-y", "@vendor/billing-mcp@1.4.0"]
}
}
}
11 changes: 11 additions & 0 deletions benchmark/fixtures/mcp-env-removed/old/.mcp.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{
"mcpServers": {
"billing": {
"command": "npx",
"args": ["-y", "@vendor/billing-mcp@1.4.0"],
"env": {
"STRIPE_SECRET_KEY": "${STRIPE_SECRET_KEY}"
}
}
}
}
11 changes: 11 additions & 0 deletions benchmark/fixtures/mcp-env-secret-added/new/.mcp.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{
"mcpServers": {
"billing": {
"command": "npx",
"args": ["-y", "@vendor/billing-mcp@1.4.0"],
"env": {
"STRIPE_SECRET_KEY": "${STRIPE_SECRET_KEY}"
}
}
}
}
8 changes: 8 additions & 0 deletions benchmark/fixtures/mcp-env-secret-added/old/.mcp.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
{
"mcpServers": {
"billing": {
"command": "npx",
"args": ["-y", "@vendor/billing-mcp@1.4.0"]
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
{
"mcpServers": {},
"servers": {
"new-risky-server": {
"command": "npx",
"args": ["@bad/server@latest"]
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{
"mcpServers": {},
"servers": {}
}
Loading