Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
125 changes: 125 additions & 0 deletions .github/skills/copilot-benchmark/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
---
name: copilot-benchmark
description: Run the Copilot agent benchmark suite against a target repo. Use when asked to run benchmarks, benchmark Lore, measure Copilot performance, compare control vs lore-enabled, or evaluate tool effectiveness.
---

# Copilot Agent Benchmark

## Purpose

Run Lore's Copilot agent benchmark harness, which evaluates how the Copilot CLI answers codebase questions with and without Lore MCP tools, comparing the two arms on correctness, coverage, and efficiency.

## Prerequisites

- **`copilot` CLI** installed and authenticated (`copilot --version` must work).
- **Node.js 22** (use `nvm use 22`).
- **Lore built** (`npm run build`) — the test `beforeAll` also runs this.
- Real API calls are made — this costs tokens.

## Quick start

```sh
source ~/.nvm/nvm.sh && nvm use 22
npm run build
BENCHMARK_COPILOT=1 npx vitest run tests/benchmark/copilot-agent.test.ts
```

## Environment variables

| Variable | Default | Description |
|---|---|---|
| `BENCHMARK_COPILOT` | _(unset)_ | **Required.** Set to `1` to enable the suite (skipped otherwise). |
| `BENCHMARK_REPO` | `lore-self` | Target repo. Options: `lore-self`, `zod`, `fastapi`, `esbuild`, `postgres`, `gson`. |
| `BENCHMARK_MODEL` | `claude-opus-4.6` | LLM model passed to copilot CLI `--model`. |
| `BENCHMARK_INDEX_MODE` | `scip` | Lore indexing mode: `tree-sitter`, `scip`, or `full`. |
| `BENCHMARK_ITERATIONS` | `1` | Runs per task. Use `≥3` for statistical significance. |
| `BENCHMARK_EMBEDDING_MODEL` | _(empty)_ | Embedding model, e.g. `nomic-ai/nomic-embed-text-v1.5`. |
| `BENCHMARK_LSP` | _(unset)_ | Set to `1` to enable LSP enrichment during indexing. |

## Instructions

When the user asks to run, execute, or launch a Copilot benchmark:

1. **Pre-flight checks**
- Ensure Node.js 22 is active: `source ~/.nvm/nvm.sh && nvm use 22`.
- Build Lore: `npm run build`.
- Verify `copilot --version` works.

2. **Determine configuration from user request**
- Pick a repo from the available list. Default is `lore-self`.
- Pick an index mode. Default is `scip`.
- Pick iteration count. Default is `1` for quick runs, `3+` for statistical significance.
- Pick model. Default is `claude-opus-4.6`.

3. **Run the benchmark**
- Launch as a background process since it runs for 10–20 minutes:
```sh
BENCHMARK_COPILOT=1 \
BENCHMARK_REPO=lore-self \
BENCHMARK_INDEX_MODE=scip \
BENCHMARK_ITERATIONS=1 \
npx vitest run tests/benchmark/copilot-agent.test.ts
```

4. **Monitor progress**
- The test outputs per-task results as they complete (e.g. `[control] lore-self-1.1-openDb: success=1 correctness=0.85 ...`).
- 16 tasks run concurrently in pairs (control + lore-enabled), so results arrive in batches.
- Check the terminal periodically for `N/17` progress.

5. **Interpret results**
- Per-task output shows: `success`, `correctness`, `ans_cov`, `file_cov`, `sym_cov`, `tokens`, `wall` time.
- `lore calls:` shows which Lore MCP tools were invoked (or `(none)` if the model chose not to use them).
- `MISSED parts:` and `MISSED answer lines:` show expected answers that were not covered.
- The aggregate report at the end compares control vs lore-enabled across all metrics.

6. **Report to the user**
- Summarize total tasks completed, overall success rates for both arms.
- Highlight tasks where lore-enabled outperformed control (or vice versa).
- Note Lore tool usage patterns.
- Report any tasks that timed out or failed.

## How it works

Each task runs **two concurrent arms**:
- **Control**: Copilot CLI with Lore tools explicitly denied via `--deny-tool`.
- **Lore-enabled**: Copilot CLI with Lore MCP server registered via `--additional-mcp-config`.

Both arms answer the same question about the target codebase, then results are scored against ground-truth expected answers.

## Scoring metrics

- **taskSuccess**: 0 / 0.5 / 1 composite score
- **correctness**: 0–1 line-by-line match against expected answer
- **answerCoverage**: fraction of expected answer parts found
- **fileCoverage**: fraction of expected files referenced
- **symbolCoverage**: fraction of expected symbols mentioned
- **tokensUsed**: estimated token consumption
- **wallTimeMs**: end-to-end wall-clock time
- **loreToolCallCount**: number of `lore_*` tool invocations

## Available repos with ground truth

| Repo | Language | Size | Tasks |
|---|---|---|---|
| `lore-self` | TypeScript | medium | 16 |
| `zod` | TypeScript | small | partial |
| `fastapi` | Python | medium | partial |
| `esbuild` | Go/TypeScript | large | partial |
| `postgres` | C | very-large | partial |

## Key files

- `tests/benchmark/copilot-agent.test.ts` — main test file
- `tests/benchmark/util/copilot-agent.ts` — copilot CLI invocation
- `tests/benchmark/util/tasks.ts` — ground truth answer tables
- `tests/benchmark/util/repos.ts` — repo specifications
- `tests/benchmark/util/scorer.ts` — scoring and report formatting
- `tests/benchmark/util/questions.ts` — question catalog and templates
- `tests/benchmark/util/types.ts` — shared types

## Troubleshooting

- **All tests skipped**: `BENCHMARK_COPILOT=1` is not set.
- **`lore calls: (none)` on all tasks**: The Lore MCP server may not be starting. Check that `dist/server/server.js` exists and the `realpathSync` fix is present (commit `ee708f8`). On macOS, symlink mismatches under `/var` can cause silent failures.
- **Timeouts**: Each arm has a 360s timeout. Complex tasks on large repos may time out. Check the model or increase timeout in `CopilotAgentOptions`.
- **`copilot` not found**: Install the Copilot CLI and authenticate first.
56 changes: 20 additions & 36 deletions tests/benchmark/util/copilot-agent.ts
Original file line number Diff line number Diff line change
Expand Up @@ -225,23 +225,13 @@ export async function runCopilotAgent(
// For lore-enabled arm, register the Lore MCP server
let mcpConfigPath: string | undefined;
if (arm === 'lore-enabled' && dbPath) {
// Find the Lore project root (where dist/ lives)
// For self-benchmarks this is the Lore repo itself
const loreProjectRoot = findLoreProjectRoot(repoPath);
// Always use the current Lore build, not the cloned repo's (which may
// be at an older SHA without recent fixes like the realpathSync guard).
const loreProjectRoot = findLoreProjectRoot();
mcpConfigPath = writeLoreMcpConfig(dbPath, loreProjectRoot);
args.push('--additional-mcp-config', `@${mcpConfigPath}`);
}

// For control arm, deny Lore tools explicitly
if (arm === 'control') {
args.push(
'--deny-tool', 'lore_lookup', 'lore_search', 'lore_graph',
'lore_docs',
'lore_test_map', 'lore_snippet', 'lore_blame',
'lore_metrics', 'lore_history',
);
}

if (options.extraFlags) {
args.push(...options.extraFlags);
}
Expand All @@ -261,7 +251,7 @@ export async function runCopilotAgent(
toolCalls: result.toolCalls,
filesRead: result.filesRead,
finalAnswer: result.answer,
totalTokensEstimate: result.outputTokens || estimateTokensFromCalls(result.toolCalls, result.answer),
totalTokensEstimate: result.outputTokens,
loreToolsCalled: extractLoreToolsCalled(result.toolCalls),
rawOutput: output,
};
Expand Down Expand Up @@ -295,29 +285,23 @@ export async function runCopilotAgent(

// ─── Helpers ────────────────────────────────────────────────────────────────

function estimateTokensFromCalls(calls: ToolCallRecord[], answer: string): number {
let totalChars = answer.length;
for (const call of calls) {
totalChars += JSON.stringify(call.args).length;
totalChars += call.result.length;
}
return Math.ceil(totalChars / 4);
}

/**
* Walk up from repoPath to find the Lore project root (directory containing
* dist/server/server.js). Falls back to __dirname-based resolution.
* Resolve the Lore project root that contains `dist/server/server.js`.
*
* The MCP server must always come from the **current checkout** — never from
* a cloned target repo, even when that target is Lore itself. A pinned-SHA
* clone may lack fixes present in the running build (e.g. the realpathSync
* entry-guard fix), which would cause the server to silently fail.
*/
function findLoreProjectRoot(repoPath: string): string {
// Check if the repo itself is Lore (has a built server)
if (existsSync(join(repoPath, 'dist', 'server', 'server.js'))) {
return repoPath;
}
// Fall back to the Lore project root relative to this file
const pkgRoot = join(import.meta.dirname, '..', '..', '..');
if (existsSync(join(pkgRoot, 'dist', 'server', 'server.js'))) {
return pkgRoot;
function findLoreProjectRoot(): string {
// Relative to this file: tests/benchmark/util/ → project root
const root = join(import.meta.dirname, '..', '..', '..');
if (!existsSync(join(root, 'dist', 'server', 'server.js'))) {
throw new Error(
'Cannot find dist/server/server.js — run `npm run build` before the benchmark.',
);
}
// Last resort: assume Lore is built in the current working directory
return process.cwd();
return root;
}


Loading