🇺🇸 English | 🇨🇳 中文
The hypothesis: AI coding agents fail because they don't follow rules carefully enough. Give them more rules, make the rules stick harder, and quality goes up.
I tested it. Three months, two frameworks, three controlled experiments.
The data said the hypothesis was wrong.
This repo is what survived.
I wrote results-driven: a 30-line cognitive protocol. Five rules:
- Done means "the goal is met, here's evidence." Not "I changed the code."
- Every conclusion needs data.
- You haven't failed until you've tried three fundamentally different approaches.
- After the primary task is done, circle back to the secondary requirements.
- Internal self-check before every response.
Same agent, same task, same prompt — and sometimes the rules worked, sometimes they didn't. Thirty lines in the system prompt is a safety notice on a wall. Written, not read.
cognitive-kernel was the doubling down. Six intervention layers, weakest to strongest:
- L6 — on-demand loading (read when needed)
- L5 — core rules pinned to context
- L3 — IF-THEN triggers (e.g., detect "should work" in draft → require specific verification step)
- L2 — platform hooks (inject reminders before Edit/Write tool calls)
- L4 — spawn an adversarial-review sub-agent for high-risk changes
- L1 — output template (response must include
## Completion Evidence,## What This Breakssections; missing structure = not done)
Internally consistent. Multi-layered. Theoretically complete. I was proud of it.
Then I ran experiments.
Three head-to-head trials. Same task each time. Bare agent vs Kernel-augmented agent.
Experiment 1 — Simple task, Sonnet. Three Python utility functions with tests.
Kernel version produced a beautiful ## Completion Evidence section. Walked through every requirement. Then the tests failed — it imported pytest and the environment didn't have it. Bare agent used unittest from the standard library. 30/30 pass.
My runtime made the report prettier. The code got worse.
Experiment 2 — Medium task, Opus. A cache + config system. Three files, six requirements, one hidden "thread safety" requirement buried in the spec.
Both versions completed every requirement. Both passed 43/43 tests. Code quality essentially identical.
Opus alone was careful enough. My six-layer edifice: zero marginal benefit.
Experiment 3 — Harder task, Opus. An order-processing system. Seven files, 800 lines, twelve explicit requirements plus two implicit ones.
| Bare Opus | Kernel Opus | |
|---|---|---|
| Tests pass | 31/31 | 29/29 |
| Requirements (incl. implicit) | All | All |
| Tool calls | 67 | 28 |
| Time | 292s | 195s |
Code quality: no difference. But Kernel used 58% fewer tool calls and finished 33% faster. The only signal where Control earned its keep — not better code, more planful execution.
Three experiments. The honest conclusion: three months on Control design, zero contribution to code quality.
There was a fatal limitation in those experiments. The biggest task was 800 lines. My real projects start at 100k lines.
On 100k-line codebases, where do agents actually fail? I listed the modes:
- Insufficient search — changed an interface, didn't find every caller.
- Doesn't know the architecture — made the change at the wrong layer.
- Context window exhausted — read too much, forgot the earlier info.
- Didn't run the right tests — ran units, missed integration.
- Doesn't know project conventions — correct feature, wrong style.
Writing the list, I realized: none of these are attitude problems. They're all information problems.
"Check every requirement before claiming done" doesn't fix insufficient search. Under the ## What This Breaks heading, the agent dutifully writes "checked, no impact on other modules." But it didn't search everywhere. What's that sentence based on?
I'd spent months designing intervention layers to make agents more careful. But agents weren't being careless. They were flying blind.
They don't skip impact checks — they don't know what to search for. They don't skip integration tests — they don't know they exist. They don't violate the style guide — they don't know the style guide exists.
| What I was doing (Control) | What the agent actually needs (Context) |
|---|---|
| "Before claiming done, check each requirement" | "Before editing models/order.py, grep -r 'Order\\.' src/" |
| "Think: what will this break?" | "Order status changes must go through state_machine.transition()" |
| "Give completion evidence" | pytest tests/unit/ && pytest tests/integration/ -x -q |
Left side: the agent obeys, then produces confident conclusions from incomplete information.
Right side: hands the agent the correct information directly.
One line — "don't mutate order.status directly" — closes one bug permanently. Six layers of "think about what this might break" just produces more confident "won't break anything" — because the agent still doesn't know a state machine exists.
You can't constrain ignorance. Three months, one sentence.
Knowing what I know now, I rebuilt the framework. Context is the main thing. Control stays, lightweight. A 5:1 ratio.
CLAUDE.md:
@.better-work/shared/index.md ← Context: project knowledge (≤150 lines)
@.better-work/code/protocol.md ← Control: discipline constraints (≤15 lines)
@~/.claude/CLAUDE.md has @protocol ← Control: execution floor (≤30 lines)
Context side — the 150 lines. Each line plugs a specific failure mode:
- Architecture overview → "doesn't know the architecture"
- Danger zones → "insufficient search"
- Coding conventions → "doesn't know project conventions"
- Test commands → "didn't run the right tests"
- Fast file locator → "context window exhausted"
Not written by hand. Signal-driven: git log picks hot files, import analysis finds danger zones, agent behavior during actual sessions seeds conventions and locator entries.
Control side — the 30 lines. Kept the pieces that survived the experiments:
- Completion standards — re-read the request, check every part, give observable evidence.
- Behavior triggers — catch "should work" → require verification; catch "I can't" → require three attempts; catch activity-reporting → restructure as outcome.
- Planning hints — what produced the 58% tool-call reduction in Experiment 3.
Six intervention layers compressed into thirty lines. Not because the six layers were wrong in theory — because in a codebase where the agent doesn't know a state machine exists, no amount of Control will stop it from editing order.status.
better-work is the series entry point. It implements the Lite Control side and routes to discipline-specific Context sources.
| Repo | What it owns |
|---|---|
| better-work (this repo) | Lite Control: 30-line execution protocol + Cognitive Layer + project init + series routing |
| better-code | Full Context for coding: architecture, conventions, danger zones |
| better-test | Full Context for testing: test groups, impact maps, known failures |
Install any subset. Install better-code alone and you get coding Context without the Lite Control. Install better-work alone and you get Lite Control without the Context. The series is designed so subskills interoperate — but each stands on its own.
Three paths depending on your context. Path A is the recommended default for new users.
Tell your Claude Code agent to install the series. It will run these commands:
mkdir -p ~/.better-work-series ~/.claude/skills
# Main skill (always install this)
git clone https://github.com/d-wwei/better-work.git ~/.better-work-series/better-work
ln -sfn ~/.better-work-series/better-work/skills/better-work ~/.claude/skills/better-work
# Optional: better-code (development knowledge)
git clone https://github.com/d-wwei/better-code.git ~/.better-work-series/better-code
ln -sfn ~/.better-work-series/better-code ~/.claude/skills/better-code
# Optional: better-test (testing knowledge)
git clone https://github.com/d-wwei/better-test.git ~/.better-work-series/better-test
ln -sfn ~/.better-work-series/better-test ~/.claude/skills/better-test
# Inject always-on protocol into global CLAUDE.md (idempotent)
grep -q "better-work/protocol.md" ~/.claude/CLAUDE.md || \
echo "@~/.claude/skills/better-work/protocol.md" >> ~/.claude/CLAUDE.mdOpen a new Claude Code session to load the newly installed skills.
# Main skill only
curl -fsSL https://raw.githubusercontent.com/d-wwei/better-work/main/install.sh | bash
# Full series
curl -fsSL https://raw.githubusercontent.com/d-wwei/better-work/main/install.sh | bash -s -- --with code --with test
# Update installed skills later
~/.better-work-series/better-work/install.sh --update
# Remove symlinks (source repos preserved)
~/.better-work-series/better-work/install.sh --uninstallRun ./install.sh --help for all options (--install-dir, --version, --skip-protocol, --protocol-scope=project).
Adapter files live in codex/, cursor/, and vscode/. See each adapter directory's README for platform-specific install commands.
Both paths default to ~/.better-work-series/<skill>/ as the physical clone location. The ~/.claude/skills/<skill>/ symlink resolves transparently regardless of the physical path, so:
- Path B's
--install-dir <path>flag overrides the default clone location; symlinks still land in~/.claude/skills/and keep working - Moving the series directory later only requires re-running
install.sh --update(which re-creates symlinks) or manuallyln -sfnto the new location ~/.claude/skills/better-work/always resolves to<install-dir>/better-work/skills/better-work/(note the nestedskills/better-work/—better-workas a repo contains other platform adapters at its root)
Inside a project:
/better-work init
This creates .better-work/ in the project (a symlink to ~/.better-work/<project>/), writes protocol.md, injects @ references into ~/.claude/CLAUDE.md, and offers to initialize installed subskills (better-code, better-test) to fill the Context side.
On multi-step work:
/better-work rounds # PDCA-style bounded rounds for locally complex tasks
/better-work waves # project-scale staged delivery
/better-work verify # verification-heavy mode
/better-work handoff # produce a resumable handoff document
Series infrastructure:
| Command | What it does |
|---|---|
/better-work init |
First-time setup — symlinks, CLAUDE.md injection, subskill init |
/better-work status |
Diagnose the current project's better-work setup |
/better-work code <cmd> |
Route to better-code subskill |
/better-work test <cmd> |
Route to better-test subskill |
Execution modes:
| Command | When to use |
|---|---|
/better-work |
Default — small-to-medium bounded tasks |
/better-work rounds |
Current slice needs PDCA loops with quality gates |
/better-work waves |
Project-scale, staged delivery across sessions |
/better-work verify |
Testing-heavy changes |
/better-work unstick |
Recovery when the agent is looping |
/better-work handoff |
Produce a handoff doc for another session |
Thirty lines. Three sections:
## Operating Standards (4 principles, compressed)
- Exhaust reasonable paths before declaring a blocker
- Investigate before asking
- Close the loop (no "done" without verification)
- Use just enough structure
## Cognitive Rules (always-on, 5 IF/THEN)
1. IF claiming done → re-read the request; every part addressed? Evidence?
2. IF two paths and one is simpler → explain why simple isn't avoidance
3. IF writing "should work" → replace with a specific verification step
4. IF stating a fact → mark investigated vs speculation ("not verified")
5. IF using vague wording → investigate or rewrite as explicit uncertainty
Full framework (Output Protocol, Behavior Triggers, Adversarial Review):
SKILL.md § Cognitive Layer
The Cognitive Layer — loaded on demand from SKILL.md — absorbed the useful pieces of cognitive-kernel:
- Output Protocol — field templates for proposing / claiming-done responses
- Behavior Triggers — 8 IF/THEN rules including the results-driven originals
- Adversarial Review — spawn a reviewer sub-agent when ≥5 files change or a public interface shifts
- The experiments were small-scale. The 58% tool-call reduction is from 800-line projects. No controlled data yet on 100k-line codebases — the Full Context framework is the bet that it'll scale where Control alone didn't.
- Context quality depends on subskills. Better-work alone is just the Lite Control layer. Without better-code (or an equivalent Context source) the agent still flies blind on project knowledge.
protocol.mdauto-injects globally by default./better-work initappends a line to~/.claude/CLAUDE.md. Opt out with--skip-protocol, or scope to the project with--protocol-scope=project.- Signal-driven updates aren't automatic.
/better-code updateand/better-test updatestill need explicit invocation; no filesystem watcher. - Adapter files (cursor/codex/vscode) may lag the main SKILL.md. Native Claude Code reflects the current state; other platforms use embedded copies that are refreshed manually.
MIT.
Questions, issues, discussion: GitHub issues.