Skip to content

d-wwei/better-work

Repository files navigation

🇺🇸 English | 🇨🇳 中文

better-work

Three months on a hypothesis. The data came back the opposite way.

The hypothesis: AI coding agents fail because they don't follow rules carefully enough. Give them more rules, make the rules stick harder, and quality goes up.

I tested it. Three months, two frameworks, three controlled experiments.

The data said the hypothesis was wrong.

This repo is what survived.

What I tried first

Attempt 1 — Rules in the system prompt (didn't stick)

I wrote results-driven: a 30-line cognitive protocol. Five rules:

  • Done means "the goal is met, here's evidence." Not "I changed the code."
  • Every conclusion needs data.
  • You haven't failed until you've tried three fundamentally different approaches.
  • After the primary task is done, circle back to the secondary requirements.
  • Internal self-check before every response.

Same agent, same task, same prompt — and sometimes the rules worked, sometimes they didn't. Thirty lines in the system prompt is a safety notice on a wall. Written, not read.

Attempt 2 — A 6-layer intervention runtime (over-engineered)

cognitive-kernel was the doubling down. Six intervention layers, weakest to strongest:

  • L6 — on-demand loading (read when needed)
  • L5 — core rules pinned to context
  • L3 — IF-THEN triggers (e.g., detect "should work" in draft → require specific verification step)
  • L2 — platform hooks (inject reminders before Edit/Write tool calls)
  • L4 — spawn an adversarial-review sub-agent for high-risk changes
  • L1 — output template (response must include ## Completion Evidence, ## What This Breaks sections; missing structure = not done)

Internally consistent. Multi-layered. Theoretically complete. I was proud of it.

Then I ran experiments.

What the experiments showed

Three head-to-head trials. Same task each time. Bare agent vs Kernel-augmented agent.

Experiment 1 — Simple task, Sonnet. Three Python utility functions with tests.

Kernel version produced a beautiful ## Completion Evidence section. Walked through every requirement. Then the tests failed — it imported pytest and the environment didn't have it. Bare agent used unittest from the standard library. 30/30 pass.

My runtime made the report prettier. The code got worse.

Experiment 2 — Medium task, Opus. A cache + config system. Three files, six requirements, one hidden "thread safety" requirement buried in the spec.

Both versions completed every requirement. Both passed 43/43 tests. Code quality essentially identical.

Opus alone was careful enough. My six-layer edifice: zero marginal benefit.

Experiment 3 — Harder task, Opus. An order-processing system. Seven files, 800 lines, twelve explicit requirements plus two implicit ones.

Bare Opus Kernel Opus
Tests pass 31/31 29/29
Requirements (incl. implicit) All All
Tool calls 67 28
Time 292s 195s

Code quality: no difference. But Kernel used 58% fewer tool calls and finished 33% faster. The only signal where Control earned its keep — not better code, more planful execution.

Three experiments. The honest conclusion: three months on Control design, zero contribution to code quality.

The first-principles check

There was a fatal limitation in those experiments. The biggest task was 800 lines. My real projects start at 100k lines.

On 100k-line codebases, where do agents actually fail? I listed the modes:

  1. Insufficient search — changed an interface, didn't find every caller.
  2. Doesn't know the architecture — made the change at the wrong layer.
  3. Context window exhausted — read too much, forgot the earlier info.
  4. Didn't run the right tests — ran units, missed integration.
  5. Doesn't know project conventions — correct feature, wrong style.

Writing the list, I realized: none of these are attitude problems. They're all information problems.

"Check every requirement before claiming done" doesn't fix insufficient search. Under the ## What This Breaks heading, the agent dutifully writes "checked, no impact on other modules." But it didn't search everywhere. What's that sentence based on?

The insight I paid for

I'd spent months designing intervention layers to make agents more careful. But agents weren't being careless. They were flying blind.

They don't skip impact checks — they don't know what to search for. They don't skip integration tests — they don't know they exist. They don't violate the style guide — they don't know the style guide exists.

What I was doing (Control) What the agent actually needs (Context)
"Before claiming done, check each requirement" "Before editing models/order.py, grep -r 'Order\\.' src/"
"Think: what will this break?" "Order status changes must go through state_machine.transition()"
"Give completion evidence" pytest tests/unit/ && pytest tests/integration/ -x -q

Left side: the agent obeys, then produces confident conclusions from incomplete information.

Right side: hands the agent the correct information directly.

One line — "don't mutate order.status directly" — closes one bug permanently. Six layers of "think about what this might break" just produces more confident "won't break anything" — because the agent still doesn't know a state machine exists.

You can't constrain ignorance. Three months, one sentence.

The principle: Full Context, Lite Control

Knowing what I know now, I rebuilt the framework. Context is the main thing. Control stays, lightweight. A 5:1 ratio.

CLAUDE.md:
  @.better-work/shared/index.md       ← Context: project knowledge (≤150 lines)
  @.better-work/code/protocol.md      ← Control: discipline constraints (≤15 lines)
  @~/.claude/CLAUDE.md has @protocol  ← Control: execution floor (≤30 lines)

Context side — the 150 lines. Each line plugs a specific failure mode:

  • Architecture overview → "doesn't know the architecture"
  • Danger zones → "insufficient search"
  • Coding conventions → "doesn't know project conventions"
  • Test commands → "didn't run the right tests"
  • Fast file locator → "context window exhausted"

Not written by hand. Signal-driven: git log picks hot files, import analysis finds danger zones, agent behavior during actual sessions seeds conventions and locator entries.

Control side — the 30 lines. Kept the pieces that survived the experiments:

  • Completion standards — re-read the request, check every part, give observable evidence.
  • Behavior triggers — catch "should work" → require verification; catch "I can't" → require three attempts; catch activity-reporting → restructure as outcome.
  • Planning hints — what produced the 58% tool-call reduction in Experiment 3.

Six intervention layers compressed into thirty lines. Not because the six layers were wrong in theory — because in a codebase where the agent doesn't know a state machine exists, no amount of Control will stop it from editing order.status.

This repo: the Lite Control layer

better-work is the series entry point. It implements the Lite Control side and routes to discipline-specific Context sources.

Repo What it owns
better-work (this repo) Lite Control: 30-line execution protocol + Cognitive Layer + project init + series routing
better-code Full Context for coding: architecture, conventions, danger zones
better-test Full Context for testing: test groups, impact maps, known failures

Install any subset. Install better-code alone and you get coding Context without the Lite Control. Install better-work alone and you get Lite Control without the Context. The series is designed so subskills interoperate — but each stands on its own.

Installation

Three paths depending on your context. Path A is the recommended default for new users.

Path A — Agent-driven install (recommended)

Tell your Claude Code agent to install the series. It will run these commands:

mkdir -p ~/.better-work-series ~/.claude/skills

# Main skill (always install this)
git clone https://github.com/d-wwei/better-work.git ~/.better-work-series/better-work
ln -sfn ~/.better-work-series/better-work/skills/better-work ~/.claude/skills/better-work

# Optional: better-code (development knowledge)
git clone https://github.com/d-wwei/better-code.git ~/.better-work-series/better-code
ln -sfn ~/.better-work-series/better-code ~/.claude/skills/better-code

# Optional: better-test (testing knowledge)
git clone https://github.com/d-wwei/better-test.git ~/.better-work-series/better-test
ln -sfn ~/.better-work-series/better-test ~/.claude/skills/better-test

# Inject always-on protocol into global CLAUDE.md (idempotent)
grep -q "better-work/protocol.md" ~/.claude/CLAUDE.md || \
  echo "@~/.claude/skills/better-work/protocol.md" >> ~/.claude/CLAUDE.md

Open a new Claude Code session to load the newly installed skills.

Path B — Installer script

# Main skill only
curl -fsSL https://raw.githubusercontent.com/d-wwei/better-work/main/install.sh | bash

# Full series
curl -fsSL https://raw.githubusercontent.com/d-wwei/better-work/main/install.sh | bash -s -- --with code --with test

# Update installed skills later
~/.better-work-series/better-work/install.sh --update

# Remove symlinks (source repos preserved)
~/.better-work-series/better-work/install.sh --uninstall

Run ./install.sh --help for all options (--install-dir, --version, --skip-protocol, --protocol-scope=project).

Path C — Other platforms (Codex, Cursor, VS Code)

Adapter files live in codex/, cursor/, and vscode/. See each adapter directory's README for platform-specific install commands.

Path A vs Path B — path semantics

Both paths default to ~/.better-work-series/<skill>/ as the physical clone location. The ~/.claude/skills/<skill>/ symlink resolves transparently regardless of the physical path, so:

  • Path B's --install-dir <path> flag overrides the default clone location; symlinks still land in ~/.claude/skills/ and keep working
  • Moving the series directory later only requires re-running install.sh --update (which re-creates symlinks) or manually ln -sfn to the new location
  • ~/.claude/skills/better-work/ always resolves to <install-dir>/better-work/skills/better-work/ (note the nested skills/better-work/better-work as a repo contains other platform adapters at its root)

Quick Start

Inside a project:

/better-work init

This creates .better-work/ in the project (a symlink to ~/.better-work/<project>/), writes protocol.md, injects @ references into ~/.claude/CLAUDE.md, and offers to initialize installed subskills (better-code, better-test) to fill the Context side.

On multi-step work:

/better-work rounds     # PDCA-style bounded rounds for locally complex tasks
/better-work waves      # project-scale staged delivery
/better-work verify     # verification-heavy mode
/better-work handoff    # produce a resumable handoff document

Commands

Series infrastructure:

Command What it does
/better-work init First-time setup — symlinks, CLAUDE.md injection, subskill init
/better-work status Diagnose the current project's better-work setup
/better-work code <cmd> Route to better-code subskill
/better-work test <cmd> Route to better-test subskill

Execution modes:

Command When to use
/better-work Default — small-to-medium bounded tasks
/better-work rounds Current slice needs PDCA loops with quality gates
/better-work waves Project-scale, staged delivery across sessions
/better-work verify Testing-heavy changes
/better-work unstick Recovery when the agent is looping
/better-work handoff Produce a handoff doc for another session

What's actually in protocol.md

Thirty lines. Three sections:

## Operating Standards (4 principles, compressed)
- Exhaust reasonable paths before declaring a blocker
- Investigate before asking
- Close the loop (no "done" without verification)
- Use just enough structure

## Cognitive Rules (always-on, 5 IF/THEN)
1. IF claiming done → re-read the request; every part addressed? Evidence?
2. IF two paths and one is simpler → explain why simple isn't avoidance
3. IF writing "should work" → replace with a specific verification step
4. IF stating a fact → mark investigated vs speculation ("not verified")
5. IF using vague wording → investigate or rewrite as explicit uncertainty

Full framework (Output Protocol, Behavior Triggers, Adversarial Review):
SKILL.md § Cognitive Layer

The Cognitive Layer — loaded on demand from SKILL.md — absorbed the useful pieces of cognitive-kernel:

  • Output Protocol — field templates for proposing / claiming-done responses
  • Behavior Triggers — 8 IF/THEN rules including the results-driven originals
  • Adversarial Review — spawn a reviewer sub-agent when ≥5 files change or a public interface shifts

Limitations

  • The experiments were small-scale. The 58% tool-call reduction is from 800-line projects. No controlled data yet on 100k-line codebases — the Full Context framework is the bet that it'll scale where Control alone didn't.
  • Context quality depends on subskills. Better-work alone is just the Lite Control layer. Without better-code (or an equivalent Context source) the agent still flies blind on project knowledge.
  • protocol.md auto-injects globally by default. /better-work init appends a line to ~/.claude/CLAUDE.md. Opt out with --skip-protocol, or scope to the project with --protocol-scope=project.
  • Signal-driven updates aren't automatic. /better-code update and /better-test update still need explicit invocation; no filesystem watcher.
  • Adapter files (cursor/codex/vscode) may lag the main SKILL.md. Native Claude Code reflects the current state; other platforms use embedded copies that are refreshed manually.

License

MIT.


Questions, issues, discussion: GitHub issues.

About

Better-Work: execution protocol + cognitive layer + series entry for AI coding agents.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages