Skip to content

feat(skills-guard): wire up LLM security audit behind --llm-audit flag#3234

Open
edmundman wants to merge 1 commit intoNousResearch:mainfrom
edmundman:feature/wire-llm-audit-skill
Open

feat(skills-guard): wire up LLM security audit behind --llm-audit flag#3234
edmundman wants to merge 1 commit intoNousResearch:mainfrom
edmundman:feature/wire-llm-audit-skill

Conversation

@edmundman
Copy link
Copy Markdown

llm_audit_skill() in tools/skills_guard.py was fully implemented but never called — a dead function since it was written. This surfaces it as an opt-in second pass on top of the existing regex scanner.

Problem

The static regex scan in scan_skill() is fast and reliable for known-bad patterns (exfiltration regexes, reverse shells, credential leakage, etc.), but it cannot catch threats expressed in natural language prose — subtle social engineering, multi-step exfiltration described across sentences, or jailbreak framing that avoids any keyword the regexes target. llm_audit_skill() was written to fill exactly this gap, calling the user's configured LLM as a second opinion after the static scan, but nothing ever invoked it.

Fix

Add --llm-audit / dest=llm_audit to both hermes skills install and hermes skills audit in hermes_cli/main.py.

Wire the flag through skills_command() and handle_skills_slash() in hermes_cli/skills_hub.py into the do_install() and do_audit() functions, which now accept llm_audit: bool = False.

When llm_audit=True, after scan_skill() completes, llm_audit_skill() is called on the same path with the static ScanResult. Its findings are merged in before the install-policy decision is made, so an LLM-detected critical finding can block a community skill just as a regex finding would. The LLM verdict can only raise severity, never lower it, and a failed LLM call is best-effort — install is never blocked due to an API error.

The flag defaults to False so existing behaviour (fast, offline, regex-only scan) is unchanged for all current users.

Usage

hermes skills install owner/repo/skill --llm-audit
hermes skills audit --llm-audit
hermes skills audit my-skill --llm-audit
/skills install owner/repo/skill --llm-audit
/skills audit --llm-audit

Tests

40 new tests across five classes in tests/tools/test_skills_guard.py:

TestParseLlmResponse — JSON parsing, markdown unwrapping, severity
normalisation, truncation, malformed inputs
TestLlmAuditSkill — dangerous skip, no-content skip, no-model skip,
API failure passthrough, finding merge, verdict
raise (safe→caution, caution→dangerous), verdict
cannot-lower invariant, explicit model arg,
single-file path, content truncation
TestDoInstallLlmAuditWiring — llm_audit=False never calls llm_audit_skill,
llm_audit=True calls it, LLM-raised dangerous
verdict blocks install
TestDoAuditLlmAuditWiring — flag off/on, once-per-skill count, name
filter respected
TestCliArgLlmAudit — argparse defaults, flag parsing, router dispatch
TestSlashCommandLlmAuditWiring — slash parse for install and audit,
name+flag combination

Full suite: 93/93 passed in test_skills_guard.py; 6197 passing across the broader suite with zero new failures introduced.

  • [ X] 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • [ X] 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

llm_audit_skill() in tools/skills_guard.py was fully implemented but
never called — a dead function since it was written. This surfaces it
as an opt-in second pass on top of the existing regex scanner.

Problem

The static regex scan in scan_skill() is fast and reliable for
known-bad patterns (exfiltration regexes, reverse shells, credential
leakage, etc.), but it cannot catch threats expressed in natural
language prose — subtle social engineering, multi-step exfiltration
described across sentences, or jailbreak framing that avoids any
keyword the regexes target. llm_audit_skill() was written to fill
exactly this gap, calling the user's configured LLM as a second
opinion after the static scan, but nothing ever invoked it.

Fix

Add --llm-audit / dest=llm_audit to both `hermes skills install` and
`hermes skills audit` in hermes_cli/main.py.

Wire the flag through skills_command() and handle_skills_slash() in
hermes_cli/skills_hub.py into the do_install() and do_audit() functions,
which now accept llm_audit: bool = False.

When llm_audit=True, after scan_skill() completes, llm_audit_skill() is
called on the same path with the static ScanResult. Its findings are
merged in before the install-policy decision is made, so an LLM-detected
critical finding can block a community skill just as a regex finding
would. The LLM verdict can only raise severity, never lower it, and a
failed LLM call is best-effort — install is never blocked due to an API
error.

The flag defaults to False so existing behaviour (fast, offline,
regex-only scan) is unchanged for all current users.

Usage

    hermes skills install owner/repo/skill --llm-audit
    hermes skills audit --llm-audit
    hermes skills audit my-skill --llm-audit
    /skills install owner/repo/skill --llm-audit
    /skills audit --llm-audit

Tests

40 new tests across five classes in tests/tools/test_skills_guard.py:

  TestParseLlmResponse   — JSON parsing, markdown unwrapping, severity
                           normalisation, truncation, malformed inputs
  TestLlmAuditSkill      — dangerous skip, no-content skip, no-model skip,
                           API failure passthrough, finding merge, verdict
                           raise (safe→caution, caution→dangerous), verdict
                           cannot-lower invariant, explicit model arg,
                           single-file path, content truncation
  TestDoInstallLlmAuditWiring — llm_audit=False never calls llm_audit_skill,
                           llm_audit=True calls it, LLM-raised dangerous
                           verdict blocks install
  TestDoAuditLlmAuditWiring  — flag off/on, once-per-skill count, name
                           filter respected
  TestCliArgLlmAudit     — argparse defaults, flag parsing, router dispatch
  TestSlashCommandLlmAuditWiring — slash parse for install and audit,
                           name+flag combination

Full suite: 93/93 passed in test_skills_guard.py; 6197 passing across
the broader suite with zero new failures introduced.
@alt-glitch alt-glitch added type/feature New feature or request P3 Low — cosmetic, nice to have comp/cli CLI entry point, hermes_cli/, setup wizard tool/skills Skills system (list, view, manage) labels May 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard P3 Low — cosmetic, nice to have tool/skills Skills system (list, view, manage) type/feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants