Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion docs/operate/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ Server mode only. If `ORCHESTRATOR_URL` is set, the process runs in daemon mode
| `CLAUDE_CODE_PATH` | resolved from `node_modules` | Absolute path to the Claude Code CLI `cli.js`. |
| `CLONE_BASE_DIR` | `/tmp/bot-workspaces` | Parent directory for per-delivery clones. |
| `CLONE_DEPTH` | `50` | Shallow-clone depth. Increase for deeply-diverged PRs. |
| `WORKSPACE_STALE_TTL_MS` | `3600000` | TTL before an orphaned per-job workspace triple (clone dir + `.cred.sh` + `-artifacts`) under `CLONE_BASE_DIR` is swept at startup. Reclaims SIGKILL/OOM/eviction orphans. Lower only if you understand the risk. |
| `CONTEXT7_API_KEY` | unset | Lifts Context7 MCP rate limiting. No other effect. |

## Postgres
Expand Down Expand Up @@ -189,7 +190,7 @@ before committing:

## Prompt cache layout

Selects the system/user prompt split the agent executor passes to the Claude Agent SDK. See `src/config.ts:573#promptCacheLayout` for the Zod definition and `src/core/executor.ts:208` for the runtime guard.
Selects the system/user prompt split the agent executor passes to the Claude Agent SDK. See `src/config.ts:582#promptCacheLayout` for the Zod definition and `src/core/executor.ts:208` for the runtime guard.

| Variable | Default | Notes |
| --------------------- | -------- | ------------------------------------------------------------------------------------------------------------ |
Expand Down
8 changes: 8 additions & 0 deletions docs/operate/observability.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,14 @@ cache_read_input_tokens / (input_tokens + cache_read_input_tokens + cache_creati

(The ratio is undefined when the denominator is zero, e.g. a dry-run that never called the model; guard against that in the query.) A high ratio on the second+ run of the same prompt shape confirms `PROMPT_CACHE_LAYOUT=cacheable` is working; a sudden drop in the per-installation cache-read share is the signature of a prompt-cache stability regression. Alert on `sum(cache_read_input_tokens) / sum(input_tokens + cache_read_input_tokens + cache_creation_input_tokens)` falling below its established baseline.

### Workspace sweep

`sweepStaleWorkspaces` (`src/core/workspace-sweep.ts`) emits one line per startup sweep when the daemon and the webhook server reclaim stale per-job workspace triples under `CLONE_BASE_DIR` (orphans left by SIGKILL/OOM/eviction). A `swept` count climbing across restarts means jobs are being killed mid-run before their own cleanup runs.

| `event` | Level | Fields |
| ----------------- | ----- | ------------------------------------------------------------------------------------------------------------ |
| `workspace.sweep` | info | `swept` (entries removed), `retained` (entries kept as fresh), `durationMs` (wall-clock time for the sweep). |

## GitHub API rate-limit fields

The `App` is constructed with `ObservableOctokit` (`src/utils/octokit-observability.ts`), an `Octokit.plugin` subclass shared by `app.octokit` and every installation octokit. It logs GitHub's per-installation rate-limit headers via `octokit.hook.after` / `hook.error`. The `pipeline.stage`-style strict Zod schema (`GithubApiLogFieldsSchema`) pins the field shape.
Expand Down
2 changes: 2 additions & 0 deletions docs/operate/runbooks/daemon-fleet.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,8 @@ flowchart LR
classDef done fill:#2a6f2a,stroke:#1a4d1a,color:#ffffff
```

At boot, before connecting to the orchestrator, the daemon sweeps stale workspace triples (clone dir + `.cred.sh` token helper + `-artifacts`) older than `WORKSPACE_STALE_TTL_MS` under `CLONE_BASE_DIR`, reclaiming SIGKILL/OOM/eviction orphans left behind when a prior run skipped its own cleanup. Each sweep emits a single `workspace.sweep` log line with `swept` / `retained` / `durationMs`.

## Operational knobs

The full list lives at [`../configuration.md`](../configuration.md#orchestrator-and-daemon). The handful you'll actually touch:
Expand Down
37 changes: 7 additions & 30 deletions src/app.ts
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
import { randomUUID } from "node:crypto";
import { access, constants, mkdir, readdir, rm, stat } from "node:fs/promises";
import { access, constants } from "node:fs/promises";
import http from "node:http";
import { join } from "node:path";

Expand All @@ -17,6 +17,7 @@ import type {
import { App, Octokit } from "octokit";

import { config } from "./config";
import { sweepStaleWorkspaces } from "./core/workspace-sweep";
import { closeDb, getDb } from "./db";
import { runMigrations } from "./db/migrate";
import { installFatalHandlers, logger } from "./logger";
Expand Down Expand Up @@ -368,35 +369,11 @@ async function runStartupChecks(): Promise<void> {
}
}

// *.cred.sh files accumulate in cloneBaseDir when the pod is SIGKILL-ed mid-checkout.
// Remove files older than 1 hour to avoid leaking installation tokens across restarts.
const STALE_CRED_TTL_MS = 60 * 60 * 1000;
const staleCutoff = Date.now() - STALE_CRED_TTL_MS;

try {
// eslint-disable-next-line security/detect-non-literal-fs-filename
await mkdir(config.cloneBaseDir, { recursive: true });
// eslint-disable-next-line security/detect-non-literal-fs-filename
const entries = await readdir(config.cloneBaseDir);
const credFiles = entries.filter((f) => f.endsWith(".cred.sh"));

for (const credFile of credFiles) {
const fullPath = join(config.cloneBaseDir, credFile);
try {
// eslint-disable-next-line no-await-in-loop, security/detect-non-literal-fs-filename
const { mtimeMs } = await stat(fullPath);
if (mtimeMs < staleCutoff) {
// eslint-disable-next-line no-await-in-loop -- fullPath is join()-constructed, not user input
await rm(fullPath, { force: true });
logger.info({ credFile }, "Removed stale credential helper script");
}
} catch {
// Non-fatal: file may have been removed concurrently
}
}
} catch {
// Non-fatal: cloneBaseDir may not exist yet on a fresh pod
}
// Reclaim workspace triples (clone dir + `.cred.sh` token helper +
// `-artifacts`) orphaned in cloneBaseDir when a pod was SIGKILL-ed
// mid-run, to avoid leaking installation tokens and disk across restarts
// (issue #221).
await sweepStaleWorkspaces(config.cloneBaseDir, config.workspaceStaleTtlMs, logger);

const db = getDb();
if (db !== null) {
Expand Down
10 changes: 10 additions & 0 deletions src/config.ts
Original file line number Diff line number Diff line change
Expand Up @@ -238,6 +238,15 @@ const configSchema = z
// for testing or smaller-scope deployments.
agentTimeoutMs: z.coerce.number().int().positive().default(3_600_000),

// TTL before an orphaned per-job workspace triple (clone dir +
// `<workDir>.cred.sh` + `<workDir>-artifacts`) under cloneBaseDir is swept
// at daemon/server startup. Orphans are left behind when a pod is
// SIGKILL-ed / OOM-killed / evicted mid-run, skipping the pipeline's own
// cleanup. Default 1 hour, long enough that an in-flight job's fresh
// workspace is never reaped, short enough that a leaked install token does
// not linger across restarts. Set via WORKSPACE_STALE_TTL_MS (issue #221).
workspaceStaleTtlMs: z.coerce.number().int().positive().default(3_600_000),

// Override max turns for the Claude Agent SDK, used as a FALLBACK ONLY on
// src/core/executor.ts when invoked without an explicit `maxTurns`
// argument. Since the dispatch-collapse, the orchestrator always passes
Expand Down Expand Up @@ -913,6 +922,7 @@ function loadConfig(): Config {
// Group 5, App runtime / behaviour
context7ApiKey: process.env["CONTEXT7_API_KEY"],
cloneBaseDir: process.env["CLONE_BASE_DIR"],
workspaceStaleTtlMs: process.env["WORKSPACE_STALE_TTL_MS"],
cloneDepth: process.env["CLONE_DEPTH"],
triggerPhrase: process.env["TRIGGER_PHRASE"],
botAppLogin: process.env["BOT_APP_LOGIN"],
Expand Down
108 changes: 108 additions & 0 deletions src/core/workspace-sweep.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
/**
* Per-job workspace cleanup helpers (issue #221).
*
* A daemon job owns a "workspace triple" rooted at its workDir:
* - `<workDir>` the cloned repo
* - `<workDir>.cred.sh` the git credential helper holding the install token
* - `<workDir>-artifacts` the sibling summary dir (IMPLEMENT.md / REVIEW.md / ...)
*
* The pipeline removes all three on a clean run. A SIGKILL / OOM / eviction
* skips that path and orphans the triple, leaking disk and a short-lived token.
* `removeWorkspaceTripleSync` is the synchronous last-resort path used by the
* daemon's exit and cancel handlers; `sweepStaleWorkspaces` is the TTL reaper
* run once at daemon startup to reclaim orphans from a prior lifetime.
*/

import { rmSync } from "node:fs";
import { mkdir, readdir, rm, stat } from "node:fs/promises";
import { join } from "node:path";

/** Minimal structural logger; only `.info(obj, msg)` is used. */
interface Logger {
info: (obj: object, msg: string) => void;
}

/**
* Synchronously remove a workspace triple, best-effort. Each removal is
* isolated so one failure (e.g. a busy file) does not skip the others. Safe
* to call when paths are absent (`force: true`). Used on the process-exit and
* cancel paths where async cleanup is not an option.
*/
export function removeWorkspaceTripleSync(workDir: string): void {
// Self-enforce the scoped-job invariant: an empty workDir would make the
// calls below target CWD-relative `.cred.sh` / `-artifacts`. Callers also
// guard, but a force-rm helper must not depend on every caller remembering.
if (workDir === "") return;
try {
rmSync(workDir, { recursive: true, force: true });
} catch {
// Best effort: leak is acceptable, blocking exit is not.
}
try {
rmSync(`${workDir}.cred.sh`, { force: true });
} catch {
// Best effort.
}
try {
rmSync(`${workDir}-artifacts`, { recursive: true, force: true });
} catch {
// Best effort.
}
}

/**
* Sweep stale workspace entries under `cloneBaseDir` older than `ttlMs`,
* by entry mtime. Tolerant of a missing base dir and of concurrent removal
* (a live job deleting its own workspace mid-sweep). Emits one structured
* log line and returns counts for observability.
*
* Each entry is removed via `rm(..., { recursive: true })`, so a stale clone
* dir, its sibling `.cred.sh`, and its `-artifacts` dir are reaped as three
* independent entries (each carries its own mtime). This per-entry signal is
* safe because the sweep runs once at process startup, before any job of this
* lifetime exists, over a process-local `cloneBaseDir` (pod-local ephemeral
* storage, never a shared volume), so no in-flight job's workspace is reaped.
*/
export async function sweepStaleWorkspaces(
cloneBaseDir: string,
ttlMs: number,
log: Logger,
): Promise<{ swept: number; retained: number; durationMs: number }> {
const startedAt = Date.now();
let swept = 0;
let retained = 0;

let entries: string[];
try {
// eslint-disable-next-line security/detect-non-literal-fs-filename -- cloneBaseDir is config, not user input
await mkdir(cloneBaseDir, { recursive: true });
// eslint-disable-next-line security/detect-non-literal-fs-filename -- cloneBaseDir is config, not user input
entries = await readdir(cloneBaseDir);
} catch {
// Base dir unreadable / uncreatable: nothing to sweep.
return { swept: 0, retained: 0, durationMs: Date.now() - startedAt };
}
Comment on lines +81 to +84

const cutoff = Date.now() - ttlMs;

for (const entry of entries) {
const full = join(cloneBaseDir, entry);
try {
// eslint-disable-next-line no-await-in-loop, security/detect-non-literal-fs-filename -- full is join()-constructed
const { mtimeMs } = await stat(full);
if (mtimeMs < cutoff) {
// eslint-disable-next-line no-await-in-loop -- sequential rm bounds peak fd/IO; full is join()-constructed
await rm(full, { recursive: true, force: true });
swept++;
} else {
retained++;
}
} catch {
// Concurrent removal or transient stat/rm error: skip this entry.
}
}

const durationMs = Date.now() - startedAt;
log.info({ event: "workspace.sweep", swept, retained, durationMs }, "Swept stale workspaces");
return { swept, retained, durationMs };
}
32 changes: 7 additions & 25 deletions src/daemon/job-executor.ts
Original file line number Diff line number Diff line change
@@ -1,9 +1,8 @@
import { rmSync } from "node:fs";

import { Octokit } from "octokit";

import { config } from "../config";
import { runPipeline } from "../core/pipeline";
import { removeWorkspaceTripleSync } from "../core/workspace-sweep";
import { createChildLogger, logger } from "../logger";
import type { ActiveJob, DaemonCapabilities, SerializableBotContext } from "../shared/daemon-types";
import {
Expand Down Expand Up @@ -42,19 +41,9 @@ export function registerExitCleanup(): void {
// Scoped jobs never own a workspace path; skip the rm so we do not
// accidentally target `.cred.sh` in the daemon's CWD.
if (job.workDir === "") continue;
try {
rmSync(job.workDir, { recursive: true, force: true });
} catch {
// Best effort on exit
}
// Credential helper (${workDir}.cred.sh) is written by checkoutRepo beside
// the workspace and contains the installation token, remove it too so a
// SIGKILL / crash does not leak it for the app.ts stale-cred sweep window.
try {
rmSync(`${job.workDir}.cred.sh`, { force: true });
} catch {
// Best effort on exit
}
// Removes the clone, the `.cred.sh` token helper, and the `-artifacts`
// sibling. The artifacts dir was previously leaked on crash exit.
removeWorkspaceTripleSync(job.workDir);
}
});
}
Expand Down Expand Up @@ -732,16 +721,9 @@ export function handleJobCancel(cancel: JobCancelMessage, send: (msg: unknown) =
}

if (job.workDir !== "") {
try {
rmSync(job.workDir, { recursive: true, force: true });
} catch {
// Best effort
}
try {
rmSync(`${job.workDir}.cred.sh`, { force: true });
} catch {
// Best effort
}
// Clone + `.cred.sh` + `-artifacts` sibling; the artifacts dir was
// previously leaked on cancel.
removeWorkspaceTripleSync(job.workDir);
}

activeJobs.delete(offerId);
Expand Down
6 changes: 6 additions & 0 deletions src/daemon/main.ts
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import { platform } from "node:os";

import { config } from "../config";
import { sweepStaleWorkspaces } from "../core/workspace-sweep";
import { installFatalHandlers, logger } from "../logger";
import type { DaemonCapabilities } from "../shared/daemon-types";
import {
Expand Down Expand Up @@ -366,6 +367,11 @@ async function main(): Promise<void> {

registerExitCleanup();

// Reclaim workspace triples orphaned by a prior SIGKILL/OOM/eviction (issue
// #221). Best-effort, runs before connecting so a cluttered base dir does
// not accumulate clones + install-token `.cred.sh` files across restarts.
await sweepStaleWorkspaces(config.cloneBaseDir, config.workspaceStaleTtlMs, logger);

capabilities = await discoverCapabilities(config.cloneBaseDir);
logger.info(
{
Expand Down
Loading
Loading