Skill Harness

Skill Harness refuses to produce a confidence number when no admissible evidence exists for the axis being claimed. UNMEASURED is the first-class verdict that makes this refusal legible.

What is different from existing LLM eval frameworks

Pairwise judges, scalar rubrics, and holistic LLM-as-judge approaches all produce numbers. Skill Harness produces UNMEASURED when the framework does not have the instrument to verify the clause being claimed. Where G-Eval asks a judge to score an output, Skill Harness asks whether a specific clause — when removed — produces a measurable directional change on its claimed axis. When no mechanical scorer exists for the axis, the result is UNMEASURED, not an estimated score. The case study below shows this distinction on a real, widely-used skill.

Reproduce the case study (Windows)

git clone https://github.com/MrBinnacle/skill-harness
cd skill-harness && git checkout main   # see "Why not v0.1.0?" below
python -m venv .venv && .venv\Scripts\pip install -e ".[dev]"

# Required on Windows to avoid encoding errors and non-deterministic hashes
$env:PYTHONUTF8 = "1"; $env:PYTHONHASHSEED = "0"; $env:PYTHONPATH = "src"

$py = ".venv\Scripts\python.exe"
& $py -m skill_harness skill init <path-to-ai-slop-sentinel-SKILL.md> --execute
& $py -m skill_harness run ablation <skill_id> --execute
& $py -m skill_harness run evaluate-skill <skill_id>

PYTHONUTF8=1 prevents cp1252 encoding errors on Windows terminals. PYTHONHASHSEED=0 makes JSON output byte-stable across re-runs. See docs/concepts/why-pythonutf8-on-windows.md for detail. For <path-to-ai-slop-sentinel-SKILL.md> and a one-shot reproduction script, see examples/.

API-key requirements (current state, honest)

Two API surfaces, two requirements:

skill init calls the Claude API to extract clauses from your skill artifact. It currently requires ANTHROPIC_API_KEY to be set in the environment. There is no OpenRouter fallback for the extractor yet — operators on Claude Code subscription auth or other no-direct-key environments cannot run skill init end-to-end against the current main. Extractor OpenRouter fallback is a v0.2 backlog candidate.
run ablation --execute calls the subject model. It accepts EITHER ANTHROPIC_API_KEY (direct Anthropic) OR OPENROUTER_API_KEY (auto-routed via OpenRouter with a stderr warning). The --subject-model flag selects the model id; see --help for the matrix of direct vs OpenRouter forms.

The case study's own author hit this exact asymmetry in real time — see the case study's HALT 2 narrative for the audit trail.

Why not `git checkout v0.1.0`?

The case-study reproduction recipe used to pin v0.1.0 (commit fd782b1). The v0.1.0 tag is the harness state the case study was written against, but it predates the W2 CLI engineering work (commits a9bdacc + f6201a8) that added --subject-model and the OpenRouter fallback for run ablation. Operators on direct Anthropic API can reproduce at either tag; operators on OpenRouter-only environments need main (or a future v0.1.1 tag) for the run ablation step.

Case study

docs/case-studies/ai-slop-sentinel-under-ablation.md — a real audit trail of the discipline catching its own author across three classes of inconsistency (documentation drift, operational state, orchestrator precondition gap) before any contaminated result shipped. The deliverable is the chain of refusals, not a number.

Why UNMEASURED is not a failure: docs/concepts/why-unmeasured.md

Full specification

PRD.md — wire format, oracle tiering, aggregation rules, CLI surface, and the invariants the framework is built around.

Architecture and internals

docs/internals/README.md — track layout, database partition, discipline rationale, and development setup.

Name		Name	Last commit message	Last commit date
Latest commit History 107 Commits
.claude		.claude
.github		.github
.supply-chain-risk-auditor		.supply-chain-risk-auditor
docs		docs
examples		examples
migrations		migrations
src/skill_harness		src/skill_harness
tests		tests
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
PLAN.md		PLAN.md
PRD.md		PRD.md
README.md		README.md
RELEASE-NOTES-v0.1.md		RELEASE-NOTES-v0.1.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Skill Harness

What is different from existing LLM eval frameworks

Reproduce the case study (Windows)

API-key requirements (current state, honest)

Why not `git checkout v0.1.0`?

Case study

Full specification

Architecture and internals

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Skill Harness

What is different from existing LLM eval frameworks

Reproduce the case study (Windows)

API-key requirements (current state, honest)

Why not git checkout v0.1.0?

Case study

Full specification

Architecture and internals

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Why not `git checkout v0.1.0`?

Packages