better-work

🇺🇸 English | 🇨🇳 中文

better-work

Three months on a hypothesis. The data came back the opposite way.

The hypothesis: AI coding agents fail because they don't follow rules carefully enough. Give them more rules, make the rules stick harder, and quality goes up.

I tested it. Three months, two frameworks, three controlled experiments.

The data said the hypothesis was wrong.

This repo is what survived.

What I tried first

Attempt 1 — Rules in the system prompt (didn't stick)

I wrote results-driven: a 30-line cognitive protocol. Five rules:

Done means "the goal is met, here's evidence." Not "I changed the code."
Every conclusion needs data.
You haven't failed until you've tried three fundamentally different approaches.
After the primary task is done, circle back to the secondary requirements.
Internal self-check before every response.

Same agent, same task, same prompt — and sometimes the rules worked, sometimes they didn't. Thirty lines in the system prompt is a safety notice on a wall. Written, not read.

Attempt 2 — A 6-layer intervention runtime (over-engineered)

cognitive-kernel was the doubling down. Six intervention layers, weakest to strongest:

L6 — on-demand loading (read when needed)
L5 — core rules pinned to context
L3 — IF-THEN triggers (e.g., detect "should work" in draft → require specific verification step)
L2 — platform hooks (inject reminders before Edit/Write tool calls)
L4 — spawn an adversarial-review sub-agent for high-risk changes
L1 — output template (response must include ## Completion Evidence, ## What This Breaks sections; missing structure = not done)

Internally consistent. Multi-layered. Theoretically complete. I was proud of it.

Then I ran experiments.

What the experiments showed

Three head-to-head trials. Same task each time. Bare agent vs Kernel-augmented agent.

Experiment 1 — Simple task, Sonnet. Three Python utility functions with tests.

Kernel version produced a beautiful ## Completion Evidence section. Walked through every requirement. Then the tests failed — it imported pytest and the environment didn't have it. Bare agent used unittest from the standard library. 30/30 pass.

My runtime made the report prettier. The code got worse.

Experiment 2 — Medium task, Opus. A cache + config system. Three files, six requirements, one hidden "thread safety" requirement buried in the spec.

Both versions completed every requirement. Both passed 43/43 tests. Code quality essentially identical.

Opus alone was careful enough. My six-layer edifice: zero marginal benefit.

Experiment 3 — Harder task, Opus. An order-processing system. Seven files, 800 lines, twelve explicit requirements plus two implicit ones.

	Bare Opus	Kernel Opus
Tests pass	31/31	29/29
Requirements (incl. implicit)	All	All
Tool calls	67	28
Time	292s	195s

Code quality: no difference. But Kernel used 58% fewer tool calls and finished 33% faster. The only signal where Control earned its keep — not better code, more planful execution.

Three experiments. The honest conclusion: three months on Control design, zero contribution to code quality.

The first-principles check

There was a fatal limitation in those experiments. The biggest task was 800 lines. My real projects start at 100k lines.

On 100k-line codebases, where do agents actually fail? I listed the modes:

Insufficient search — changed an interface, didn't find every caller.
Doesn't know the architecture — made the change at the wrong layer.
Context window exhausted — read too much, forgot the earlier info.
Didn't run the right tests — ran units, missed integration.
Doesn't know project conventions — correct feature, wrong style.

Writing the list, I realized: none of these are attitude problems. They're all information problems.

"Check every requirement before claiming done" doesn't fix insufficient search. Under the ## What This Breaks heading, the agent dutifully writes "checked, no impact on other modules." But it didn't search everywhere. What's that sentence based on?

The insight I paid for

I'd spent months designing intervention layers to make agents more careful. But agents weren't being careless. They were flying blind.

They don't skip impact checks — they don't know what to search for. They don't skip integration tests — they don't know they exist. They don't violate the style guide — they don't know the style guide exists.

What I was doing (Control)	What the agent actually needs (Context)
"Before claiming done, check each requirement"	"Before editing `models/order.py`, `grep -r 'Order\\.' src/`"
"Think: what will this break?"	"Order status changes must go through `state_machine.transition()`"
"Give completion evidence"	`pytest tests/unit/ && pytest tests/integration/ -x -q`

Left side: the agent obeys, then produces confident conclusions from incomplete information.

Right side: hands the agent the correct information directly.

One line — "don't mutate order.status directly" — closes one bug permanently. Six layers of "think about what this might break" just produces more confident "won't break anything" — because the agent still doesn't know a state machine exists.

You can't constrain ignorance. Three months, one sentence.

The principle: Full Context, Lite Control

Knowing what I know now, I rebuilt the framework. Context is the main thing. Control stays, lightweight. A 5:1 ratio.

CLAUDE.md:
  @.better-work/shared/index.md       ← Context: project knowledge (≤150 lines)
  @.better-work/code/protocol.md      ← Control: discipline constraints (≤15 lines)
  @~/.claude/CLAUDE.md has @protocol  ← Control: execution floor (≤30 lines)

Context side — the 150 lines. Each line plugs a specific failure mode:

Architecture overview → "doesn't know the architecture"
Danger zones → "insufficient search"
Coding conventions → "doesn't know project conventions"
Test commands → "didn't run the right tests"
Fast file locator → "context window exhausted"

Not written by hand. Signal-driven: git log picks hot files, import analysis finds danger zones, agent behavior during actual sessions seeds conventions and locator entries.

Control side — the 30 lines. Kept the pieces that survived the experiments:

Completion standards — re-read the request, check every part, give observable evidence.
Behavior triggers — catch "should work" → require verification; catch "I can't" → require three attempts; catch activity-reporting → restructure as outcome.
Planning hints — what produced the 58% tool-call reduction in Experiment 3.

Six intervention layers compressed into thirty lines. Not because the six layers were wrong in theory — because in a codebase where the agent doesn't know a state machine exists, no amount of Control will stop it from editing order.status.

This repo: the Lite Control layer

better-work is the series entry point. It implements the Lite Control side and routes to discipline-specific Context sources.

Repo	What it owns
better-work (this repo)	Lite Control: 30-line execution protocol + Cognitive Layer + project init + series routing
better-code	Full Context for coding: architecture, conventions, danger zones
better-test	Full Context for testing: test groups, impact maps, known failures

Install any subset. Install better-code alone and you get coding Context without the Lite Control. Install better-work alone and you get Lite Control without the Context. The series is designed so subskills interoperate — but each stands on its own.

Installation

Three paths depending on your context. Path A is the recommended default for new users.

Path A — Agent-driven install (recommended)

Tell your Claude Code agent to install the series. It will run these commands:

mkdir -p ~/.better-work-series ~/.claude/skills

# Main skill (always install this)
git clone https://github.com/d-wwei/better-work.git ~/.better-work-series/better-work
ln -sfn ~/.better-work-series/better-work/skills/better-work ~/.claude/skills/better-work

# Optional: better-code (development knowledge)
git clone https://github.com/d-wwei/better-code.git ~/.better-work-series/better-code
ln -sfn ~/.better-work-series/better-code ~/.claude/skills/better-code

# Optional: better-test (testing knowledge)
git clone https://github.com/d-wwei/better-test.git ~/.better-work-series/better-test
ln -sfn ~/.better-work-series/better-test ~/.claude/skills/better-test

# Inject always-on protocol into global CLAUDE.md (idempotent)
grep -q "better-work/protocol.md" ~/.claude/CLAUDE.md || \
  echo "@~/.claude/skills/better-work/protocol.md" >> ~/.claude/CLAUDE.md

Open a new Claude Code session to load the newly installed skills.

Path B — Installer script

# Main skill only
curl -fsSL https://raw.githubusercontent.com/d-wwei/better-work/main/install.sh | bash

# Full series
curl -fsSL https://raw.githubusercontent.com/d-wwei/better-work/main/install.sh | bash -s -- --with code --with test

# Update installed skills later
~/.better-work-series/better-work/install.sh --update

# Remove symlinks (source repos preserved)
~/.better-work-series/better-work/install.sh --uninstall

Run ./install.sh --help for all options (--install-dir, --version, --skip-protocol, --protocol-scope=project).

Path C — Other platforms (Codex, Cursor, VS Code)

Adapter files live in codex/, cursor/, and vscode/. See each adapter directory's README for platform-specific install commands.

Path A vs Path B — path semantics

Both paths default to ~/.better-work-series/<skill>/ as the physical clone location. The ~/.claude/skills/<skill>/ symlink resolves transparently regardless of the physical path, so:

Path B's --install-dir <path> flag overrides the default clone location; symlinks still land in ~/.claude/skills/ and keep working
Moving the series directory later only requires re-running install.sh --update (which re-creates symlinks) or manually ln -sfn to the new location
~/.claude/skills/better-work/ always resolves to <install-dir>/better-work/skills/better-work/ (note the nested skills/better-work/ — better-work as a repo contains other platform adapters at its root)

Quick Start

Inside a project:

/better-work init

This creates .better-work/ in the project (a symlink to ~/.better-work/<project>/), writes protocol.md, injects @ references into ~/.claude/CLAUDE.md, and offers to initialize installed subskills (better-code, better-test) to fill the Context side.

On multi-step work:

/better-work rounds     # PDCA-style bounded rounds for locally complex tasks
/better-work waves      # project-scale staged delivery
/better-work verify     # verification-heavy mode
/better-work handoff    # produce a resumable handoff document

Commands

Series infrastructure:

Command	What it does
`/better-work init`	First-time setup — symlinks, CLAUDE.md injection, subskill init
`/better-work status`	Diagnose the current project's better-work setup
`/better-work code <cmd>`	Route to better-code subskill
`/better-work test <cmd>`	Route to better-test subskill

Execution modes:

Command	When to use
`/better-work`	Default — small-to-medium bounded tasks
`/better-work rounds`	Current slice needs PDCA loops with quality gates
`/better-work waves`	Project-scale, staged delivery across sessions
`/better-work verify`	Testing-heavy changes
`/better-work unstick`	Recovery when the agent is looping
`/better-work handoff`	Produce a handoff doc for another session

What's actually in protocol.md

Thirty lines. Three sections:

## Operating Standards (4 principles, compressed)
- Exhaust reasonable paths before declaring a blocker
- Investigate before asking
- Close the loop (no "done" without verification)
- Use just enough structure

## Cognitive Rules (always-on, 5 IF/THEN)
1. IF claiming done → re-read the request; every part addressed? Evidence?
2. IF two paths and one is simpler → explain why simple isn't avoidance
3. IF writing "should work" → replace with a specific verification step
4. IF stating a fact → mark investigated vs speculation ("not verified")
5. IF using vague wording → investigate or rewrite as explicit uncertainty

Full framework (Output Protocol, Behavior Triggers, Adversarial Review):
SKILL.md § Cognitive Layer

The Cognitive Layer — loaded on demand from SKILL.md — absorbed the useful pieces of cognitive-kernel:

Output Protocol — field templates for proposing / claiming-done responses
Behavior Triggers — 8 IF/THEN rules including the results-driven originals
Adversarial Review — spawn a reviewer sub-agent when ≥5 files change or a public interface shifts

Limitations

The experiments were small-scale. The 58% tool-call reduction is from 800-line projects. No controlled data yet on 100k-line codebases — the Full Context framework is the bet that it'll scale where Control alone didn't.
Context quality depends on subskills. Better-work alone is just the Lite Control layer. Without better-code (or an equivalent Context source) the agent still flies blind on project knowledge.
protocol.md auto-injects globally by default. /better-work init appends a line to ~/.claude/CLAUDE.md. Opt out with --skip-protocol, or scope to the project with --protocol-scope=project.
Signal-driven updates aren't automatic. /better-code update and /better-test update still need explicit invocation; no filesystem watcher.
Adapter files (cursor/codex/vscode) may lag the main SKILL.md. Native Claude Code reflects the current state; other platforms use embedded copies that are refreshed manually.

License

MIT.

Questions, issues, discussion: GitHub issues.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.codex		.codex
assets		assets
codebuddy/better-work		codebuddy/better-work
codex/better-work		codex/better-work
commands		commands
cursor/rules		cursor/rules
designs		designs
evals		evals
kiro/steering		kiro/steering
skills/better-work		skills/better-work
subskills		subskills
templates		templates
vscode		vscode
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
README.zh-CN.md		README.zh-CN.md
activation-policy.md		activation-policy.md
install.sh		install.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

better-work

Three months on a hypothesis. The data came back the opposite way.

What I tried first

Attempt 1 — Rules in the system prompt (didn't stick)

Attempt 2 — A 6-layer intervention runtime (over-engineered)

What the experiments showed

The first-principles check

The insight I paid for

The principle: Full Context, Lite Control

This repo: the Lite Control layer

Installation

Path A — Agent-driven install (recommended)

Path B — Installer script

Path C — Other platforms (Codex, Cursor, VS Code)

Path A vs Path B — path semantics

Quick Start

Commands

What's actually in protocol.md

Limitations

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

better-work

Three months on a hypothesis. The data came back the opposite way.

What I tried first

Attempt 1 — Rules in the system prompt (didn't stick)

Attempt 2 — A 6-layer intervention runtime (over-engineered)

What the experiments showed

The first-principles check

The insight I paid for

The principle: Full Context, Lite Control

This repo: the Lite Control layer

Installation

Path A — Agent-driven install (recommended)

Path B — Installer script

Path C — Other platforms (Codex, Cursor, VS Code)

Path A vs Path B — path semantics

Quick Start

Commands

What's actually in protocol.md

Limitations

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages