Kaizen User Guide

A practical guide to using Kaizen for evaluating AI agent quality.

Getting Started
Core Concepts
Common Workflows
Command Reference
CI/CD Integration
Troubleshooting

Getting Started

Installation

# Install via go install
go install github.com/srstomp/kaizen/cmd/kaizen@latest

# Or build from source
git clone https://github.com/srstomp/kaizen.git
cd kaizen
go build -o bin/kaizen ./cmd/kaizen

Verify Installation

kaizen

You should see the available commands:

Usage: kaizen <command> [options]

Commands:
  grade               Run a single grader on a single input
  grade-skills        Grade all pokayokay skills and generate report
  grade-task          Run code-based graders on task changes
  grade-task-quality  Evaluate task quality based on metadata
  meta                Run meta-evaluations on agents or skills
  eval                Run eval suite against failure cases
  report              View and analyze evaluation reports
  gate                Check if eval/meta results pass threshold (for CI)
  dashboard           Generate HTML dashboard from eval/meta results

Core Concepts

What Kaizen Measures

Kaizen evaluates AI agents at three stages:

┌─────────────────────────────────────────────────────────────────┐
│                    Agent Evaluation Pipeline                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  BEFORE WORK           DURING WORK           AFTER WORK         │
│  ───────────           ───────────           ──────────         │
│  grade-task-quality    (agent runs)          grade-task         │
│  "Is this task                               "Did it work?"     │
│   well-defined?"                                                │
│                                                                 │
│                   ACROSS SESSIONS                               │
│                   ───────────────                               │
│                   meta / eval                                   │
│                   "Is the agent                                 │
│                    consistent?"                                 │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Key Metrics

Pass@k - What percentage of attempts succeed?

Higher is better
Example: Pass@5 = 80% means 4 out of 5 attempts succeeded

Pass^k (Pass-caret-k) - Do ALL attempts succeed?

Measures consistency/reliability
Example: Pass^5 = 60% means the agent succeeds all 5 times in 60% of test cases
This is the metric that matters for production reliability

Grader Types

Type	Speed	Cost	Use For
Code-based	Fast (<1ms)	Free	File existence, test presence, patterns
Model-based	Slow (1-5s)	API costs	Semantic quality, clarity, compliance

Common Workflows

Workflow 1: Pre-Task Quality Gate

Use case: Validate task definitions before an agent starts working.

kaizen grade-task-quality \
  --task-id "TASK-123" \
  --task-title "Add user authentication" \
  --task-type "feature" \
  --description "Implement login and logout functionality with JWT tokens. Users should be able to sign in with email/password and receive a token that expires after 24 hours." \
  --acceptance-criteria "Users can log in,Users can log out,Tokens expire after 24h,Invalid credentials show error"

What it checks:

Description length (minimum 100 characters)
Acceptance criteria present (required for feature/test/spike)
No ambiguous keywords ("investigate", "explore", "figure out")

Example output:

{
  "task_id": "TASK-123",
  "passed": true,
  "score": 100,
  "issues": [],
  "suggestion": ""
}

If it fails:

{
  "task_id": "TASK-456",
  "passed": false,
  "score": 50,
  "issues": [
    {"check": "description_length", "message": "Description too short (45 chars, minimum 100)"},
    {"check": "ambiguous_keywords", "message": "Contains ambiguous keyword 'investigate' - task may be too vague"}
  ],
  "suggestion": "Run /pokayokay:brainstorm to refine task requirements"
}

Workflow 2: Post-Task Grading

Use case: Verify an agent's implementation after it claims to be done.

kaizen grade-task \
  --task-id "TASK-123" \
  --task-type "feature" \
  --changed-files "src/auth/login.go,src/auth/logout.go,src/auth/token.go" \
  --work-dir "/path/to/project" \
  --format json

What it checks:

file-exists: Do all claimed files actually exist?
test-exists: Do code files have corresponding test files?

Example output:

{
  "task_id": "TASK-123",
  "timestamp": "2026-01-28T10:30:00Z",
  "results": [
    {
      "grader_name": "file-exists",
      "passed": true,
      "score": 100,
      "details": "All 3 files exist"
    },
    {
      "grader_name": "test-exists",
      "passed": false,
      "score": 33.33,
      "details": "Missing tests: src/auth/logout.go, src/auth/token.go"
    }
  ],
  "overall_passed": false,
  "overall_score": 66.67
}

Workflow 3: Agent Consistency Testing

Use case: Measure how reliably an agent performs the same task.

# Run meta-evaluations on all agents, 5 times each
kaizen meta --suite agents --k 5

# Run on a specific agent
kaizen meta --suite agents --agent yokay-spec-reviewer --k 10

What it does:

Loads test cases from meta/agents/<agent-name>/eval.yaml
Runs each test case k times
Calculates accuracy and consistency (Pass^k)

Example output:

Meta-Evaluation Results
=======================

Agent: yokay-spec-reviewer
Test Cases: 15
Runs per test: 5

Results:
  - SPEC-001: 5/5 passed (Pass^5: 100%)
  - SPEC-002: 4/5 passed (Pass^5: 0%)
  - SPEC-003: 5/5 passed (Pass^5: 100%)
  ...

Summary:
  Accuracy: 93.3% (14/15 tests passed at least once)
  Consistency (Pass^5): 86.7% (13/15 tests passed all 5 times)

Workflow 4: Failure Case Evaluation

Use case: Test agents against documented failure patterns.

# Run all failure case evaluations
kaizen eval --failures-dir failures --k 3

# Filter to specific category
kaizen eval --failures-dir failures --category missing-tests --k 3 --format json

Failure categories:

Category	Prefix	What it catches
missed-tasks	MT	Requirements not implemented
missing-tests	WT	No tests for implementation
wrong-product	WP	Misunderstood requirements
regression	RG	Broke existing functionality
premature-completion	PC	Claimed done before complete
scope-creep	SC	Extra work beyond spec
integration-failure	IF	Integration issues
session-amnesia	SA	Lost context between sessions
hallucinated-deps	HD	Used non-existent dependencies
security-flaw	SF	Security vulnerability
tool-misuse	TM	Incorrect tool/API usage
task-quality	TQ	Poor task execution quality

Workflow 5: Skill Documentation Grading

Use case: Evaluate the quality of skill documentation.

kaizen grade-skills \
  --skills-dir /path/to/skills \
  --output reports/skill-clarity-report.md

Grading criteria:

Clear Instructions (30%): Are instructions unambiguous?
Actionable Steps (25%): Can users follow step-by-step?
Good Examples (25%): Are there helpful examples?
Appropriate Scope (20%): Is the skill focused?

Passing threshold: 70/100

Workflow 6: Report Aggregation

Use case: View trends and analyze evaluation history.

# List available reports
kaizen report --type all --list

# Generate aggregated markdown report
kaizen report --type grade --format markdown --output analysis.md

# View as JSON
kaizen report --type eval --format json

Command Reference

grade

Run a single grader on a single input file.

kaizen grade --grader <name> --input <path> [--spec <text>] [--format text|json]

Flag	Required	Description
`--grader`	Yes	Grader name (file-exists, test-exists, skill-clarity, etc.)
`--input`	Yes	Path to JSON input file
`--spec`	No	Specification text (for model-based graders)
`--format`	No	Output format: text (default) or json

grade-skills

Grade skill documentation for clarity.

kaizen grade-skills --skills-dir <path> [--output <path>]

Flag	Required	Description
`--skills-dir`	Yes	Directory containing SKILL.md files
`--output`	No	Report output path (default: reports/skill-clarity-YYYY-MM-DD.md)

grade-task

Run code-based graders on changed files.

kaizen grade-task --task-id <id> --changed-files <files> [options]

Flag	Required	Description
`--task-id`	Yes	Task identifier
`--task-type`	No	Type: feature, bug, test, spike, chore (default: feature)
`--changed-files`	Yes	Comma-separated list of changed files
`--work-dir`	No	Working directory (default: .)
`--format`	No	Output format: json (default) or text

grade-task-quality

Evaluate task definition quality (pre-task gate).

kaizen grade-task-quality --task-id <id> --task-title <title> --task-type <type> --description <desc> [options]

Flag	Required	Description
`--task-id`	Yes	Task identifier
`--task-title`	Yes	Task title
`--task-type`	Yes	Type: feature, bug, test, spike, chore
`--description`	Yes	Task description
`--acceptance-criteria`	No	Comma-separated or JSON criteria
`--min-description-length`	No	Minimum description length (default: 100)
`--format`	No	Output format: json (default) or text

Flag	Required	Description
`--suite`	Yes	Suite to run: agents or skills
`--agent`	No	Specific agent to test
`--k`	No	Runs per test case (default: 5)
`--meta-dir`	No	Path to meta directory (default: meta)
`--confirm`	No	Skip confirmation prompt

eval

Run evaluation suite against failure cases.

kaizen eval [options]

Flag	Required	Description
`--failures-dir`	No	Path to failures directory (default: failures)
`--category`	No	Filter to specific category
`--k`	No	Number of runs (default: 1)
`--format`	No	Output format: table (default) or json

report

View and analyze evaluation reports.

kaizen report [options]

Flag	Required	Description
`--type`	No	Report type: grade, eval, or all (default: grade)
`--format`	No	Output format: markdown (default) or json
`--list`	No	List reports without aggregating
`--output`	No	Write to file instead of stdout
`--reports-dir`	No	Reports directory (default: reports/)
`--no-trends`	No	Disable trend analysis

gate

Quality gate for CI/CD pipelines.

kaizen gate [options]

Flag	Required	Description
`--type`	No	Check type: eval, meta, or all (default: all)
`--threshold`	No	Pass threshold 0-100 (default: 95.0)
`--reports-dir`	No	Reports directory (default: reports/)

Returns exit code 0 if passing, 1 if failing.

dashboard

Generate HTML dashboard.

kaizen dashboard [options]

Flag	Required	Description
`--reports-dir`	No	Reports directory (default: reports/)
`--output`	No	Output file (default: dashboard.html)

CI/CD Integration

GitHub Actions Example

name: Agent Quality Gate

on:
  pull_request:
    branches: [main]

jobs:
  quality-gate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Go
        uses: actions/setup-go@v5
        with:
          go-version: '1.23'

      - name: Install Kaizen
        run: go install github.com/srstomp/kaizen/cmd/kaizen@latest

      - name: Run Meta Evaluations
        run: kaizen meta --suite agents --k 3 --confirm

      - name: Quality Gate Check
        run: kaizen gate --type meta --threshold 90.0

Pre-commit Hook

#!/bin/bash
# .git/hooks/pre-commit

# Get changed files
CHANGED_FILES=$(git diff --cached --name-only --diff-filter=ACM | tr '\n' ',')

if [ -n "$CHANGED_FILES" ]; then
  kaizen grade-task \
    --task-id "pre-commit" \
    --changed-files "$CHANGED_FILES" \
    --format text

  if [ $? -ne 0 ]; then
    echo "Quality gate failed. Please fix issues before committing."
    exit 1
  fi
fi

Troubleshooting

Common Issues

"No skill files found"

Ensure your skills directory contains SKILL.md files:

find /path/to/skills -name "SKILL.md"

"Invalid task type"

Task type must be one of: feature, bug, test, spike, chore

"Description too short"

Provide more detail in your task description. Minimum is 100 characters by default.

Meta-evaluation returns 0% Pass^k

This means the agent is inconsistent - it gives different results on repeated runs of the same test. Check:

Is the test case deterministic?
Does the agent have non-deterministic behavior?
Are external dependencies causing variance?

Getting Help

GitHub Issues: https://github.com/srstomp/kaizen/issues
Documentation: See docs/ directory
ADRs: See docs/adr/ for architectural decisions

Next Steps

Start small: Use grade-task-quality on your next task
Add to CI: Set up the quality gate in your pipeline
Track failures: Document failure cases in failures/ directory
Measure consistency: Run meta evaluations weekly

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kaizen User Guide

Table of Contents

Getting Started

Installation

Verify Installation

Core Concepts

What Kaizen Measures

Key Metrics

Grader Types

Common Workflows

Workflow 1: Pre-Task Quality Gate

Workflow 2: Post-Task Grading

Workflow 3: Agent Consistency Testing

Workflow 4: Failure Case Evaluation

Workflow 5: Skill Documentation Grading

Workflow 6: Report Aggregation

Command Reference

grade

grade-skills

grade-task

grade-task-quality

meta

eval

report

gate

dashboard

CI/CD Integration

GitHub Actions Example

Pre-commit Hook

Troubleshooting

Common Issues

Getting Help

Next Steps

FilesExpand file tree

user-guide.md

Latest commit

History

user-guide.md

File metadata and controls

Kaizen User Guide

Table of Contents

Getting Started

Installation

Verify Installation

Core Concepts

What Kaizen Measures

Key Metrics

Grader Types

Common Workflows

Workflow 1: Pre-Task Quality Gate

Workflow 2: Post-Task Grading

Workflow 3: Agent Consistency Testing

Workflow 4: Failure Case Evaluation

Workflow 5: Skill Documentation Grading

Workflow 6: Report Aggregation

Command Reference

grade

grade-skills

grade-task

grade-task-quality

meta

eval

report

gate

dashboard

CI/CD Integration

GitHub Actions Example

Pre-commit Hook

Troubleshooting

Common Issues

Getting Help

Next Steps