[feat] Injection screening: hard rules + confidence gate (R-417 + R-418)#7
[feat] Injection screening: hard rules + confidence gate (R-417 + R-418)#7100yenadmin wants to merge 2 commits into
Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Pull request overview
Adds injection-screening safeguards to Cortex memory injection to reduce prompt-injection/stale-context risks, with configurable hard rules and mode-based confidence thresholds (R-417/R-418).
Changes:
- Extend plugin config with injection screening toggles and per-mode thresholds.
- Add mode detection + two-layer screening function to drop low-confidence/stale/contradictory memories before formatting injection context.
- Wire the screening step into the
before_agent_startrecall injection path and update publisheddist/artifacts accordingly.
Reviewed changes
Copilot reviewed 1 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/index.ts | Adds new config fields, introduces detectInjectionMode + screenInjectionCandidates, and applies screening before formatMemoryContext. |
| dist/index.js | Compiled output reflecting the new screening logic and new named exports. |
| dist/index.js.map | Updated sourcemap for the compiled JS. |
| dist/index.d.ts | Updated type declarations to include new config fields and new exported functions. |
| dist/index.d.ts.map | Updated sourcemap for the declaration file. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const RUN_ID_RE = /bench-\d{8}-\d{6}/g; | ||
| const GIT_TOKENS = /\b(git|PR #|commit|branch)\b/i; | ||
| const FILE_PATH_RE = /[./\\][a-zA-Z0-9_\-./\\]{2,}/; | ||
| const LIVENESS_CLAIM = /\b(still active|still running|is running|is active|is alive|currently running)\b/i; | ||
| const DEATH_CLAIM = /\b(was killed|is dead|died|crashed|no listener|restarted|dead\b|killed\b|stalled)\b/i; | ||
|
|
||
| /** | ||
| * Classify the current turn into an injection mode. | ||
| * critical > technical > personal (first match wins). | ||
| */ | ||
| export function detectInjectionMode(promptText: string): InjectionMode { | ||
| if (CRITICAL_KEYWORDS.test(promptText) || RUN_ID_RE.test(promptText)) return "critical"; | ||
| if ( |
There was a problem hiding this comment.
RUN_ID_RE is declared with the global (/g) flag but is used with RegExp.test() in detectInjectionMode. Global regexes are stateful (lastIndex is advanced), so subsequent calls (including the later matchAll(RUN_ID_RE) in screenInjectionCandidates) can miss matches depending on call order. Consider using a non-global regex for test() (or resetting lastIndex / cloning the regex) and keep a separate global instance only for matchAll().
| // If the prompt already contains a death claim, or the run ID isn’t a live process, | ||
| // drop this memory (it was captured when the run was alive, now stale). | ||
| let isStale = false; | ||
| if (promptHasDeathClaim) { | ||
| isStale = true; | ||
| } else { | ||
| // Check if prompt explicitly references this run as dead / a different run took over | ||
| for (const runId of contentRunIds) { | ||
| if (promptRunIds.has(runId) && DEATH_CLAIM.test(promptText)) { | ||
| isStale = true; | ||
| break; | ||
| } | ||
| // Also stale if prompt never mentions this run ID at all but does mention death | ||
| if (!promptRunIds.has(runId) && promptHasDeathClaim) { | ||
| isStale = true; | ||
| break; | ||
| } |
There was a problem hiding this comment.
The stale run-state branch’s run-id specific checks are currently ineffective: the else block only runs when promptHasDeathClaim is false, but inside it you test DEATH_CLAIM.test(promptText) and promptHasDeathClaim again, which will always be false in that branch. As a result, this filter drops all liveness-claim memories whenever the prompt contains any death claim (even unrelated), and never drops based on matching run IDs. Restructure the conditions so run-id matching is evaluated when a death claim is present (or remove the redundant tests).
| // If the prompt already contains a death claim, or the run ID isn’t a live process, | |
| // drop this memory (it was captured when the run was alive, now stale). | |
| let isStale = false; | |
| if (promptHasDeathClaim) { | |
| isStale = true; | |
| } else { | |
| // Check if prompt explicitly references this run as dead / a different run took over | |
| for (const runId of contentRunIds) { | |
| if (promptRunIds.has(runId) && DEATH_CLAIM.test(promptText)) { | |
| isStale = true; | |
| break; | |
| } | |
| // Also stale if prompt never mentions this run ID at all but does mention death | |
| if (!promptRunIds.has(runId) && promptHasDeathClaim) { | |
| isStale = true; | |
| break; | |
| } | |
| // If the prompt contains a death claim for this run (or contains an unscoped death | |
| // claim with no run IDs at all), drop this memory as stale. | |
| let isStale = false; | |
| if (promptHasDeathClaim) { | |
| if (promptRunIds.size === 0) { | |
| isStale = true; | |
| } else { | |
| // Check if prompt explicitly references this run as dead / a different run took over. | |
| for (const runId of contentRunIds) { | |
| if (promptRunIds.has(runId)) { | |
| isStale = true; | |
| break; | |
| } | |
| } |
Closes 100yenadmin/electric-sheep#1902, closes 100yenadmin/electric-sheep#1903