Conalh · Conalh · May 29, 2026 · May 29, 2026 · May 29, 2026 · May 29, 2026
diff --git a/README.md b/README.md
@@ -134,7 +134,7 @@ ScopeTrail permission drift: CRITICAL
 
 ## How it works
 
-ScopeTrail is **local-only**. It reads the checked-out repository, materializes the two git refs into temp directories, runs detectors over them, and prints the result. It uploads nothing, calls no external services, and has no required API keys.
+ScopeTrail is **local-only**. It reads the checked-out repository, materializes the two git refs into temp directories, runs detectors over them, and prints the result. The scanner uploads nothing — no repository contents, findings, or telemetry leave your machine — and it needs no API keys. (The GitHub Action's setup step installs ScopeTrail's one runtime dependency with `npm ci` from the npm registry; the analysis itself makes no network calls.)
 
 The detectors cover the surfaces an AI agent can actually escalate through:
 
@@ -146,20 +146,20 @@ Findings carry a `severity` (`low` / `medium` / `high` / `critical`) and the rep
 
 ## How well it catches it
 
-ScopeTrail ships a labeled precision/recall benchmark over **28 fixture PRs** (22 with planted drift, 6 benign) spanning **19 detector kinds**. Each fixture is an `old/`+`new/` config snapshot pair; ground truth is fixed by fixture design and the harness diffs the pair and scores the drift engine against it. Reproduce with `npm run build && node benchmark/run-benchmark.mjs`.
+ScopeTrail ships a labeled precision/recall benchmark over **35 fixture PRs** (27 with planted drift, 8 benign) spanning **21 detector kinds**. Each fixture is an `old/`+`new/` config snapshot pair; ground truth is fixed by fixture design and the harness diffs the pair and scores the drift engine against it. Reproduce with `npm run build && node benchmark/run-benchmark.mjs`. These figures score the engine against its own labeled fixtures — they bound regressions, not a claim of real-world field accuracy across every config a PR might contain.
 
 | Metric | Result |
 | --- | --- |
-| Detection (any finding) — recall | **100%** (22/22 rogue PRs flagged) |
-| Detection — false-positive rate | **0%** (0/6 benign PRs flagged) |
+| Detection (any finding) — recall | **100%** (27/27 rogue PRs flagged) |
+| Detection — false-positive rate | **0%** (0/8 benign PRs flagged) |
 | Detection — precision | **100%** |
-| Correct primary finding kind | **22/22** rogue PRs |
-| All expected finding kinds | **22/22** rogue PRs |
-| Exact consolidated rating | **28/28** PRs |
+| Correct primary finding kind | **27/27** rogue PRs |
+| All expected finding kinds | **27/27** rogue PRs |
+| Exact consolidated rating | **35/35** PRs |
 
-The 6 benign cases include five engineered **false-positive traps** — narrowly-scoped Claude grants (a textual diff sees new `allow` lines), an all-tightening Codex posture, network access that was *already* on, a removed MCP server, and a `.mcp.json` with reordered keys but an identical launch command — plus one byte-identical snapshot. None produce a finding, because the detectors compare semantics and flag only *widening*.
+The 8 benign cases include seven engineered **false-positive traps** — narrowly-scoped Claude grants (a textual diff sees new `allow` lines), an all-tightening Codex posture, network access that was *already* on, a brand-new Codex config pinned to the narrowest posture, a dropped MCP `env` var, a removed MCP server, and a `.mcp.json` with reordered keys but an identical launch command — plus one byte-identical snapshot. None produce a finding, because the detectors compare semantics and flag only *widening*.
 
-**Severity is calibrated, not maximized.** At a strict `fail-on: high` gate, recall is 82% — by design: sample/template MCP additions, pinned version bumps, broad `Read` allows, and newly-enabled Codex network access sit at `low`/`medium` because they widen the surface without being directly exploitable. The `high`/`critical` band is reserved for executable or secret-facing changes — a bare `Bash` grant, a removed `Read(.env)` deny, a `danger-full-access` sandbox, an unencrypted remote MCP endpoint. Full confusion matrix at every gate, per-category and per-case breakdowns: [benchmark/RESULTS.md](benchmark/RESULTS.md). Methodology and labels: [benchmark/labels.json](benchmark/labels.json).
+**Severity is calibrated, not maximized.** At a strict `fail-on: high` gate, recall is 85% — by design: sample/template MCP additions, pinned version bumps, broad `Read` allows, and newly-enabled Codex network access sit at `low`/`medium` because they widen the surface without being directly exploitable. The `high`/`critical` band is reserved for executable or secret-facing changes — a bare `Bash` grant, a removed `Read(.env)` deny, a `danger-full-access` sandbox, an unencrypted remote MCP endpoint. Full confusion matrix at every gate, per-category and per-case breakdowns: [benchmark/RESULTS.md](benchmark/RESULTS.md). Methodology and labels: [benchmark/labels.json](benchmark/labels.json).
 
 ## Design choices worth flagging
 

diff --git a/benchmark/RESULTS.md b/benchmark/RESULTS.md
@@ -8,42 +8,42 @@ the pair and scores the drift engine against it. Benign cases include deliberate
 false-positive traps a naive textual diff would flag (tightened posture, removed
 servers, reordered JSON keys).
 
-- Cases: **28** (22 rogue, 6 benign) across **19** detector kinds
+- Cases: **35** (27 rogue, 8 benign) across **21** detector kinds
 - Detection (any finding): recall **100.0%**, false-positive rate **0.0%**, precision **100.0%**
-- At a `fail-on: high` CI gate: recall **81.8%**, false-positive rate **0.0%**, precision **100.0%**
-- Correct primary finding kind identified on **22/22** rogue cases; all expected kinds on **22/22**
-- Exact rating match where the label pins one: **28/28**
+- At a `fail-on: high` CI gate: recall **85.2%**, false-positive rate **0.0%**, precision **100.0%**
+- Correct primary finding kind identified on **27/27** rogue cases; all expected kinds on **27/27**
+- Exact rating match where the label pins one: **35/35**
 
 ## Confusion matrix by CI gate threshold
 
 A PR is predicted "drift" when its overall rating meets the threshold. `low` = any finding at all.
 
 | Gate (`fail-on`) | TP | FP | FN | TN | Precision | Recall | FP rate | F1 | Accuracy |
 | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
-| low | 22 | 0 | 0 | 6 | 100.0% | 100.0% | 0.0% | 100.0% | 100.0% |
-| medium | 21 | 0 | 1 | 6 | 100.0% | 95.5% | 0.0% | 97.7% | 96.4% |
-| high | 18 | 0 | 4 | 6 | 100.0% | 81.8% | 0.0% | 90.0% | 85.7% |
-| critical | 4 | 0 | 18 | 6 | 100.0% | 18.2% | 0.0% | 30.8% | 35.7% |
+| low | 27 | 0 | 0 | 8 | 100.0% | 100.0% | 0.0% | 100.0% | 100.0% |
+| medium | 26 | 0 | 1 | 8 | 100.0% | 96.3% | 0.0% | 98.1% | 97.1% |
+| high | 23 | 0 | 4 | 8 | 100.0% | 85.2% | 0.0% | 92.0% | 88.6% |
+| critical | 5 | 0 | 22 | 8 | 100.0% | 18.5% | 0.0% | 31.3% | 37.1% |
 
 ## Detector-kind identification (rogue cases)
 
-- Primary expected kind detected: **22/22**
-- All expected kinds detected: **22/22**
-- Distinct detector kinds exercised: **19**
+- Primary expected kind detected: **27/27**
+- All expected kinds detected: **27/27**
+- Distinct detector kinds exercised: **21**
 
 ## Rating agreement
 
-Of the **28** cases whose label pins an exact consolidated rating, the diff matched **28**.
+Of the **35** cases whose label pins an exact consolidated rating, the diff matched **35**.
 
 ## Results by category
 
 | Category | Cases | Rogue detected | Benign clean |
 | --- | ---: | :---: | :---: |
-| claude-drift | 5 | 5/5 | — |
+| claude-drift | 7 | 7/7 | — |
 | clean | 1 | — | 1/1 |
-| codex-drift | 6 | 6/6 | — |
-| fp-guard | 5 | — | 5/5 |
-| mcp-drift | 5 | 5/5 | — |
+| codex-drift | 7 | 7/7 | — |
+| fp-guard | 7 | — | 7/7 |
+| mcp-drift | 7 | 7/7 | — |
 | mcp-sample | 2 | 2/2 | — |
 | multi-class | 1 | 1/1 | — |
 | robustness | 3 | 3/3 | — |
@@ -58,30 +58,37 @@ None. Every rogue case produced all of its expected finding kinds, no benign cas
 | --- | --- | --- | --- | ---: | :---: | :---: | :---: |
 | mcp-server-added | rogue | mcp-drift | high | 1 | yes | yes | yes |
 | mcp-unpinned-added | rogue | mcp-drift | high | 2 | yes | yes | yes |
+| mcp-env-secret-added | rogue | mcp-drift | high | 1 | yes | yes | yes |
 | mcp-command-changed | rogue | mcp-drift | medium | 1 | yes | yes | yes |
 | mcp-remote-http | rogue | mcp-drift | critical | 2 | yes | yes | yes |
 | mcp-remote-https | rogue | mcp-drift | high | 2 | yes | yes | yes |
 | mcp-syntax-error | rogue | robustness | high | 1 | yes | yes | yes |
+| mcp-second-map-server-added | rogue | mcp-drift | high | 2 | yes | yes | yes |
 | mcp-sample-added | rogue | mcp-sample | low | 1 | yes | yes | yes |
 | mcp-sample-pipe-to-shell | rogue | mcp-sample | high | 2 | yes | yes | yes |
 | claude-broad-allow-bash | rogue | claude-drift | high | 1 | yes | yes | yes |
 | claude-broad-allow-read | rogue | claude-drift | medium | 1 | yes | yes | yes |
 | claude-deny-removed-env | rogue | claude-drift | critical | 1 | yes | yes | yes |
+| claude-deny-removed-ssh | rogue | claude-drift | critical | 1 | yes | yes | yes |
 | claude-hook-removed | rogue | claude-drift | high | 1 | yes | yes | yes |
 | claude-hook-command-changed | rogue | claude-drift | high | 1 | yes | yes | yes |
+| claude-hook-matcher-rebound | rogue | claude-drift | high | 1 | yes | yes | yes |
 | claude-syntax-error | rogue | robustness | high | 1 | yes | yes | yes |
 | codex-sandbox-danger | rogue | codex-drift | critical | 1 | yes | yes | yes |
 | codex-sandbox-workspace | rogue | codex-drift | high | 1 | yes | yes | yes |
 | codex-approval-never | rogue | codex-drift | high | 1 | yes | yes | yes |
 | codex-network-enabled | rogue | codex-drift | medium | 1 | yes | yes | yes |
 | codex-project-trusted | rogue | codex-drift | high | 1 | yes | yes | yes |
 | codex-mcp-added | rogue | codex-drift | high | 2 | yes | yes | yes |
+| codex-mcp-env-added | rogue | codex-drift | high | 1 | yes | yes | yes |
 | codex-syntax-error | rogue | robustness | high | 1 | yes | yes | yes |
 | multi-class-drift | rogue | multi-class | critical | 5 | yes | yes | yes |
 | clean-identical | benign | clean | none | 0 | — | — | yes |
 | claude-scoped-grants | benign | fp-guard | none | 0 | — | — | yes |
 | codex-narrowed | benign | fp-guard | none | 0 | — | — | yes |
 | codex-network-already-on | benign | fp-guard | none | 0 | — | — | yes |
+| codex-baseline-narrowest | benign | fp-guard | none | 0 | — | — | yes |
+| mcp-env-removed | benign | fp-guard | none | 0 | — | — | yes |
 | mcp-server-removed | benign | fp-guard | none | 0 | — | — | yes |
 | mcp-reformatted | benign | fp-guard | none | 0 | — | — | yes |
 

diff --git a/benchmark/fixtures/claude-deny-removed-ssh/new/.claude/settings.json b/benchmark/fixtures/claude-deny-removed-ssh/new/.claude/settings.json
@@ -0,0 +1,8 @@
+{
+  "permissions": {
+    "allow": [
+      "mcp__github__get_issue"
+    ],
+    "deny": []
+  }
+}
diff --git a/benchmark/fixtures/claude-deny-removed-ssh/old/.claude/settings.json b/benchmark/fixtures/claude-deny-removed-ssh/old/.claude/settings.json
@@ -0,0 +1,10 @@
+{
+  "permissions": {
+    "allow": [
+      "mcp__github__get_issue"
+    ],
+    "deny": [
+      "Read(~/.ssh/**)"
+    ]
+  }
+}
diff --git a/benchmark/fixtures/claude-hook-matcher-rebound/new/.claude/settings.json b/benchmark/fixtures/claude-hook-matcher-rebound/new/.claude/settings.json
@@ -0,0 +1,23 @@
+{
+  "permissions": {
+    "allow": [
+      "mcp__github__get_issue"
+    ],
+    "deny": [
+      "Read(.env)"
+    ]
+  },
+  "hooks": {
+    "PreToolUse": [
+      {
+        "matcher": "Read",
+        "hooks": [
+          {
+            "type": "command",
+            "command": "./guard.sh"
+          }
+        ]
+      }
+    ]
+  }
+}
diff --git a/benchmark/fixtures/claude-hook-matcher-rebound/old/.claude/settings.json b/benchmark/fixtures/claude-hook-matcher-rebound/old/.claude/settings.json
@@ -0,0 +1,23 @@
+{
+  "permissions": {
+    "allow": [
+      "mcp__github__get_issue"
+    ],
+    "deny": [
+      "Read(.env)"
+    ]
+  },
+  "hooks": {
+    "PreToolUse": [
+      {
+        "matcher": "Bash",
+        "hooks": [
+          {
+            "type": "command",
+            "command": "./guard.sh"
+          }
+        ]
+      }
+    ]
+  }
+}
diff --git a/benchmark/fixtures/codex-baseline-narrowest/new/.codex/config.toml b/benchmark/fixtures/codex-baseline-narrowest/new/.codex/config.toml
@@ -0,0 +1,2 @@
+sandbox_mode = "read-only"
+approval_policy = "untrusted"
diff --git a/benchmark/fixtures/codex-baseline-narrowest/old/.gitkeep b/benchmark/fixtures/codex-baseline-narrowest/old/.gitkeep
diff --git a/benchmark/fixtures/codex-mcp-env-added/new/.codex/config.toml b/benchmark/fixtures/codex-mcp-env-added/new/.codex/config.toml
@@ -0,0 +1,6 @@
+[mcp_servers.billing]
+command = "npx"
+args = ["-y", "@vendor/billing-mcp@1.4.0"]
+
+[mcp_servers.billing.env]
+STRIPE_SECRET_KEY = "${STRIPE_SECRET_KEY}"
diff --git a/benchmark/fixtures/codex-mcp-env-added/old/.codex/config.toml b/benchmark/fixtures/codex-mcp-env-added/old/.codex/config.toml
@@ -0,0 +1,3 @@
+[mcp_servers.billing]
+command = "npx"
+args = ["-y", "@vendor/billing-mcp@1.4.0"]
diff --git a/benchmark/fixtures/mcp-env-removed/new/.mcp.json b/benchmark/fixtures/mcp-env-removed/new/.mcp.json
@@ -0,0 +1,8 @@
+{
+  "mcpServers": {
+    "billing": {
+      "command": "npx",
+      "args": ["-y", "@vendor/billing-mcp@1.4.0"]
+    }
+  }
+}
diff --git a/benchmark/fixtures/mcp-env-removed/old/.mcp.json b/benchmark/fixtures/mcp-env-removed/old/.mcp.json
@@ -0,0 +1,11 @@
+{
+  "mcpServers": {
+    "billing": {
+      "command": "npx",
+      "args": ["-y", "@vendor/billing-mcp@1.4.0"],
+      "env": {
+        "STRIPE_SECRET_KEY": "${STRIPE_SECRET_KEY}"
+      }
+    }
+  }
+}
diff --git a/benchmark/fixtures/mcp-env-secret-added/new/.mcp.json b/benchmark/fixtures/mcp-env-secret-added/new/.mcp.json
@@ -0,0 +1,11 @@
+{
+  "mcpServers": {
+    "billing": {
+      "command": "npx",
+      "args": ["-y", "@vendor/billing-mcp@1.4.0"],
+      "env": {
+        "STRIPE_SECRET_KEY": "${STRIPE_SECRET_KEY}"
+      }
+    }
+  }
+}
diff --git a/benchmark/fixtures/mcp-env-secret-added/old/.mcp.json b/benchmark/fixtures/mcp-env-secret-added/old/.mcp.json
@@ -0,0 +1,8 @@
+{
+  "mcpServers": {
+    "billing": {
+      "command": "npx",
+      "args": ["-y", "@vendor/billing-mcp@1.4.0"]
+    }
+  }
+}
diff --git a/benchmark/fixtures/mcp-second-map-server-added/new/.cursor/mcp.json b/benchmark/fixtures/mcp-second-map-server-added/new/.cursor/mcp.json
@@ -0,0 +1,9 @@
+{
+  "mcpServers": {},
+  "servers": {
+    "new-risky-server": {
+      "command": "npx",
+      "args": ["@bad/server@latest"]
+    }
+  }
+}
diff --git a/benchmark/fixtures/mcp-second-map-server-added/old/.cursor/mcp.json b/benchmark/fixtures/mcp-second-map-server-added/old/.cursor/mcp.json
@@ -0,0 +1,4 @@
+{
+  "mcpServers": {},
+  "servers": {}
+}
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		sandbox_mode = "read-only"
		approval_policy = "untrusted"