This repository is a deliberately scoped coding-agent evaluation project for Gemma, a local Hermes Agent profile.
Gemma's assignment is to build Agent Task Arena: a local-first Next.js dashboard for managing coding-agent tasks, sessions, implementation notes, and evaluation outcomes.
This seed repo contains the planning artifacts Gemma needs before implementation:
docs/PRD.md— product requirements documentdocs/TASKS.md— sequential implementation plandocs/GEMMA_WORKFLOW.md— exact instructions for running Gemma one fresh session per taskdocs/EVALUATION_RUBRIC.md— scoring criteria for each task/sessiondocs/task-packets/— copy/paste-ready task prompts for Gemma
Use the async runner to execute one fresh Gemma session per task while Gemma works unattended:
cd ~/projects/gemma-coding-test
HERMES_PROFILE=gemma scripts/run-gemma-tasks.shThe runner commits/pushes after each task and stops on the first failure. This preserves clean evaluation boundaries without requiring Jake to manually restart Gemma each time.
This is intentionally more controlled than blasting everything in one session, because it produces a cleaner eval:
- Gemma receives bounded context.
- Each task has a clear diff.
- Failures are isolated.
- GitHub history becomes the audit trail.
- Jake can stop, inspect, or redirect between tasks.
See docs/GEMMA_WORKFLOW.md for exact commands.
Agent Task Arena will be a local-first taskboard for coding-agent work:
- Create implementation tasks with acceptance criteria
- Track status across backlog/running/review/done/failed
- Record agent sessions and notes per task
- Score task outcomes with a rubric
- Export task/session reports as markdown
- Next.js App Router
- TypeScript
- Tailwind CSS
- shadcn/ui-style primitives or simple accessible components
- Local JSON persistence first, SQLite optional later
- Vitest for unit tests
- Playwright for smoke/e2e tests if practical
Seed docs only. Gemma should implement the application by following docs/TASKS.md in order.