The self-improving harness for AI coding agents.
Drop-in agent harness that makes AI agents govern themselves. The agent builds its own rules, enforces them mechanically, and gets better every time it fails. No framework. No runtime. Just structure that actually sticks.
AI agents drift. The longer the task, the worse it gets.
- They ignore system prompts after a few thousand tokens
- They skip steps when the work gets complex
- They make the same mistake twice because nothing enforces learning
- They produce inconsistent output across sessions
- Memory and RAG give them context — but no obligation to use it
Every existing solution is advisory. Give the agent instructions, context, or memory — and hope it follows them. It doesn't. Not reliably. Not on long tasks. Not under pressure.
The Isagawa Kernel introduces Spec-Driven Development (SDD) — a framework where the agent builds, enforces, and improves its own governance from a domain spec. The agent internalizes the spec, builds its own protocol and enforcement, and then it mechanically can't violate it.
The agent physically cannot skip a quality check, ignore a failure, or proceed without recording what it learned. It's not "please follow this spec" — it's "you are blocked until you do."
You (drop kernel in) → Agent scans repo → Builds own protocol
→ Builds own enforcement
→ Builds own task queue
→ Starts working (governed)
→ Fails? Learns. Gets harder to break.
You don't write the rules. The agent does. And then it can't break them.
SDD flips the traditional relationship between developer and agent:
| Traditional | SDD + Kernel |
|---|---|
| You write the spec and hope | Agent internalizes the spec and enforces it |
| You enforce quality | Agent enforces its own quality |
| You update after failures | Agent updates itself after failures |
| Agent follows instructions (sometimes) | Agent follows its own rules (always — mechanically enforced) |
| Quality depends on prompt engineering | Quality compounds automatically over time |
The kernel is the runtime that makes SDD possible. Domain specs are the knowledge that makes it domain-aware.
| Approach | What happens when the agent drifts |
|---|---|
| System prompts | Agent ignores them after 10K tokens |
| RAG / memory | Agent has context but no obligation to use it |
| Static specs | Agent skips steps when they're hard |
| Fine-tuning | Expensive, brittle, doesn't generalize |
| Isagawa Kernel | Agent is blocked. Cannot proceed until compliant. |
Other tools help agents remember. This one makes them behave.
Without a domain spec (kernel only):
- Drop the kernel into any repo
- Agent scans your codebase, builds a protocol from what it finds
- Governance loop enforces the protocol as the agent works
- Every failure becomes a permanent lesson
- Cross-session persistence — the agent picks up where it left off
With a domain spec (kernel + spec):
- Same as above, plus domain-specific knowledge
- Agent builds to your industry's patterns, conventions, and quality gates
- Specs are portable — drop one into any project with the kernel installed
Session 1: Agent sets up. Builds protocol. Learns the codebase.
Session 5: Agent catches its own mistakes. Protocol is tightening.
Session 20: Governance is battle-tested. Same patterns, every time.
Session 50+: One-shot quality. The system has seen every failure mode.
You have a web app. You want governed test automation. Here's what happens:
# 1. Clone a release repo (kernel + Playwright spec bundled)
git clone https://github.com/isagawa-qa/platform-playwright.git my-qa-project- Open
my-qa-projectin VS Code → open the Claude Code panel - Invoke
/qa-workflow— the spec's slash command that starts governed test codegen
What the agent does (governed, not guessing):
Session Start → Restores state from last session (or fresh start)
Domain Setup → Scans your repo + reads the Playwright QA spec
→ Reads reference files: page objects, tasks, roles, test specs
→ Builds protocol: 5-layer architecture, selector priority, interface-first rules
→ Creates enforcement: quality gates for every layer
→ Builds task queue from your request
QA Workflow begins (5-step, gate-enforced):
Step 1: Input → Captures URL, persona, test type (UI/API/hybrid)
Step 2: Pre-flight → Configures credentials, auth strategy
Step 3: Processing → Generates BDD scenarios and expected states
Step 4: Construction (builds layer by layer, spec-governed):
→ Page Objects → framework/pages/login/login-page.ts
(role-based selectors, BrowserInterface methods, atomic actions)
→ Tasks → framework/tasks/login/login-tasks.ts
(composes page objects, @autologger decorator, void returns)
→ Roles → framework/roles/login/standard-user-role.ts
(composes tasks, multi-step workflows)
→ Tests → tests/login/test-login.spec.ts
(AAA pattern, one role call per test, POM state assertions)
Step 5: Execution → npx playwright test --headed
→ Pass? Done.
→ Fail? Agent STOPS. Reports to user. Waits for decision.
→ User picks: AI fix / manual fix / investigate / skip
→ If fix → /kernel/learn records lesson → retry
Failure example:
Agent uses CSS class selector (.btn-primary) in page object
→ Spec enforces: role-based selectors first (role=button[name="Log in"])
→ Blocked until lesson recorded
→ Learns: "use role selectors, CSS is last resort"
→ Lesson encoded permanently → same mistake can't recur
What you did: Cloned a repo, opened it, said one sentence.
What you got: A governed test script built to the spec's 5-layer architecture — page objects, tasks, roles, and test — with every failure captured as a permanent lesson.
The fastest way to see Spec-Driven Development in action:
git clone https://github.com/isagawa-qa/platform-playwright.gitOpen platform-playwright in VS Code with the Claude Code extension. Tell the agent to build tests for any web app. Watch it set up its own protocol, create enforcement, and start cycling through tasks — governed the entire time. That's the kernel + a domain spec working together.
- VS Code with the Claude Code extension
- Python 3.8+
- That's it. No database, no Docker, no cloud.
# Clone the kernel
git clone https://github.com/isagawa-co/isagawa-kernel.git
# Copy into your project
cp -r isagawa-kernel/.claude/ your-project/.claude/
cp isagawa-kernel/CLAUDE.md your-project/CLAUDE.md- Open
your-projectin VS Code - Open the Claude Code panel (sidebar or
Ctrl+Shift+P→ "Claude Code") - Give the agent any task
The agent will:
- Detect no domain exists
- Scan your repo structure, code patterns, and conventions
- Build its own protocol and enforcement
- Ask you to restart the extension (enforcement hooks need a reload)
- Pick up where it left off — fully governed
That's it. Five minutes from clone to governed agent.
session-start → anchor → WORK ─────────────────→ complete
^ | ^
└─ every N actions ←──────────────┘
|
failure? → fix → learn (MANDATORY)
| Step | What happens |
|---|---|
| Session Start | Checks state, restores context, resumes from where it left off |
| Anchor | Re-reads protocol, audits recent work against it |
| Work | Builds — every action tracked automatically |
| Every N actions | Forced to stop, re-read protocol, check quality |
| On failure | Must diagnose, fix, and record what it learned |
| Complete | Final quality gate before marking done |
Skipped a quality check? Blocked. Test failed and didn't record the lesson? Blocked. Went too long without re-reading the protocol? Blocked.
The agent can't skip steps. If it tries, it's stopped until it complies.
Three layers. The kernel governs, the spec teaches, the agent builds everything else.
┌─────────────────────────────────────────────┐
│ Agent-Generated │
│ Protocol, lessons, enforcement, tasks │
│ (the agent builds this — you don't) │
├─────────────────────────────────────────────┤
│ Domain Spec (optional) │
│ Industry knowledge for a specific domain │
├─────────────────────────────────────────────┤
│ Kernel (this repo) │
│ Governance loop — always present │
└─────────────────────────────────────────────┘
- Kernel governs how the agent works. Ships with every project.
- Domain Spec teaches what to build. Drop one in, or let the agent figure it out from your repo.
- Agent-Generated is everything the agent creates — protocols, lessons, task queues. All self-built, all self-improving.
The kernel governs. Domain specs teach. Together, you get a governed agent that knows your industry.
A domain spec is a portable skill folder — markdown files that encode domain knowledge, patterns, conventions, and quality gates for a specific vertical. Drop one into any project with the kernel installed, and the agent builds to that domain's standards.
# You already have the kernel installed in your project.
# Now add a domain spec:
# Option 1: Clone a published spec
git clone https://github.com/isagawa-qa/selenium-spec.git
cp -r selenium-spec/ your-project/.claude/skills/selenium-spec/
# Option 2: Already bundled (kernel + spec ship together)
git clone https://github.com/isagawa-qa/platform-selenium.git
# Open platform-selenium in VS Code → open Claude Code panel. Just start.The agent reads the domain spec during setup, merges it with what it discovers in your repo, and builds a protocol that covers both your codebase and your domain's requirements.
One kernel. Many specs. Each project gets domain-aware governance.
| Spec | What the agent builds | Status |
|---|---|---|
| QA Platform (Selenium) | Selenium test automation with Page Object patterns | Live |
| QA Platform (Playwright) | Playwright test suites with fixture-based architecture | Live |
| DevOps (CI/CD) | GitHub Actions, GitLab CI, Azure DevOps pipelines | Coming soon |
| Health Insurance | EDI testing, claims processing, benefits configuration | Coming soon |
| Real Estate | Lease-option deal management, buyer matching | Coming soon |
The kernel works without a domain spec. It scans your repo, builds a protocol from what it finds, and governs the agent's work. The spec just makes it domain-aware — it's the difference between a governed agent and a governed agent that knows your industry.
Domain specs are just markdown skill folders. No code. No API. No schema to learn.
A spec typically contains:
- Workflow — how the agent should approach work in this domain
- Reference patterns — conventions, naming, architecture the agent should follow
- Seeded lessons — known pitfalls in this domain, pre-loaded so the agent doesn't hit them
- Task templates — common work items the agent can cycle through
See the Selenium spec for a working example. Build one for your vertical — the kernel handles enforcement.
Give the agent a queue of tasks. It picks the next one, implements it, verifies it, marks it complete, and moves to the next — autonomously. No human in the loop between tasks.
The governance loop runs the entire time. Every task gets the same quality enforcement. The agent can cycle through dozens of tasks across multiple sessions, picking up exactly where it left off.
Task 001 → implement → verify → complete ✓
Task 002 → implement → verify → complete ✓
Task 003 → implement → fail → fix → learn → verify → complete ✓
Task 004 → implement → verify → complete ✓
...
Task 071 → done. Full project built, governed end to end.
Point the agent at any codebase. It scans the repo — file structure, code patterns, naming conventions, test frameworks, dependencies — and builds its own protocol from scratch. No templates. No configuration. The agent figures out what matters and writes the rules itself.
Add a domain spec and the setup gets smarter — the agent merges domain knowledge with what it discovers in your repo.
When something breaks, the agent can't just fix it and move on. It must record what went wrong, why it went wrong, and what changed to prevent it. That lesson becomes part of the system permanently.
Session 1 failure → lesson recorded → session 50, same mistake is impossible. The system has antibodies.
The agent saves its full context — what it was working on, decisions made, progress through task queues — to state files. Next session, it restores everything and picks up mid-task. No re-explaining. No context loss.
Agents drift over long tasks. The kernel forces periodic stops where the agent must re-read its own protocol and audit its recent work against it. If the work doesn't match the protocol, the agent self-corrects before continuing.
This is the difference between "instructions at the top of the conversation" and "instructions enforced continuously throughout."
The enforcement knows what's wrong and tells the agent exactly how to fix it. Not just "blocked" — but "blocked because X, do Y to proceed." The agent always has a clear path forward.
- Self-Building — The agent internalizes the spec and creates its own protocols and enforcement from it.
- Self-Improving — Every failure updates the system. The same mistake can't happen twice.
- Safety-First — Enforcement is mechanical, not advisory. Can't be bypassed or ignored.
- Autonomous — The agent reports what it did, not asks what to do.
- Portable — Markdown and JSON. No build step. No vendor lock-in. Works anywhere Claude Code runs.
Solo developers — Give your agent structure it actually follows. Stop re-explaining conventions every session. The kernel remembers and enforces.
Teams — Share a kernel + domain spec across the team. Every developer's agent follows the same patterns, conventions, and quality gates. Consistency without code review bottlenecks.
Consultants / Agencies — Build domain specs for your clients' verticals. Drop kernel + spec into their repo. Their agent builds to your standards, governed, from day one.
QA / Test Automation — Pair with a QA domain spec. Agent builds test suites that follow your framework's patterns. Every test failure becomes a lesson that prevents the next one.
What is SDD? Spec-Driven Development. A framework where the agent internalizes a domain spec, builds its own governance from it, and enforces compliance mechanically. The kernel is the runtime that makes SDD work.
Is this a framework? SDD is the framework. The kernel is the implementation. It's a set of markdown files and one enforcement script. No imports, no APIs, no build step. Copy it in, start working.
Does it work without a domain spec? Yes. The kernel scans your repo and builds a protocol from what it finds. The domain spec adds industry-specific knowledge, but the governance loop works either way.
Does the agent really improve over time? Yes. Every failure triggers a mandatory learn cycle. The lesson gets encoded into the system so the same failure can't happen again. This is mechanical, not aspirational.
Can the agent bypass the enforcement? No. The enforcement operates at the tool-call level. The agent cannot write, edit, or execute when blocked. It must comply first.
What AI agents does it work with? Currently Claude Code. The architecture is agent-agnostic — any agent runtime that supports tool-call interception can run it.
How is this different from giving the agent a CLAUDE.md file? CLAUDE.md is advisory — the agent reads it once and drifts. The kernel re-reads the protocol every N actions, audits work against it, and blocks non-compliance. Structure that persists, not instructions that decay.
Can I use my own domain spec with it?
Yes. Domain specs are just markdown skill folders. Build one for your vertical, drop it into .claude/skills/, and the kernel handles the rest. See the Selenium spec for the pattern.
How do domain specs work with teams? Everyone clones the same repo with kernel + spec installed. Every developer's agent builds to the same protocol. Same patterns, same quality gates, same enforcement. New team member? Clone and go.
- Core governance loop (session-start, anchor, learn, complete)
- Self-building domain setup (agent creates its own protocol)
- Autonomous task cycling (agent works through task queues)
- Cross-session persistence (state survives restarts)
- Domain spec system (portable, drop-in skill folders)
- Multi-agent coordination (governed agents working together)
- Spec marketplace (discover and install community domain specs)
- Additional agent runtime support
See CONTRIBUTING.md for guidelines on submitting domain specs and kernel improvements.