feat(observability): add 12 structured Pino event families with Zod-strict schemas#251
feat(observability): add 12 structured Pino event families with Zod-strict schemas#251chrisleekr wants to merge 3 commits into
Conversation
…trict schemas Bundles 12 area:observability issues into one change. Each family ships a Zod-strict *-log-fields.ts module with a co-located drift-prevention test, following the #242/#244 pattern: structured_output, circuit, digest, github.api.slow, github.app.token.mint, http, scheduler.scan, workflow.run, workspace, agent.tool, daemon.connection, k8s.spawn. Behavioral changes beyond logging: #236 routes 6 mint sites through a new mintInstallationToken helper (exact cache_hit via scoped hook.before); #218 removes unused onConnected/onDisconnected WsClientOptions callbacks; #223 replaces octokit hook.after/hook.error with one hook.wrap timing closure (new GITHUB_API_SLOW_REQUEST_MS env, default 3000). Security: errSerializer now strips octokit event/payload/signature carriers so the raw webhook body + HMAC signature cannot leak through err: log lines (regression-tested). Docs: 12 observability.md sections + alerts, configuration.md env var, daemon-fleet.md k8s.spawn diagnostics. Closes #216 #217 #218 #223 #228 #233 #234 #235 #236 #237 #243 #247 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Wjb51MuzTGGiJ2ggGX2DhH
|
Warning Review limit reached
More reviews will be available in 43 minutes and 35 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits. 🚦 How do rate limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (61)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This PR expands the codebase’s structured observability surface by introducing multiple new event: "<family>.<action>" log families with strict Zod schemas and drift-prevention tests, and it threads those events through key runtime paths (scheduler scans, daemon WS lifecycle, circuit breaker, octokit requests, workspace lifecycle, token minting, etc.).
Changes:
- Add new structured log-field modules (Zod-strict schemas + tests) for multiple event families (e.g.,
digest.*,circuit.*,scheduler.scan.*,k8s.spawn.*,daemon.connection.*,structured_output.*,http.*,agent.tool.*). - Introduce/extend behavioral helpers for observability: centralized installation-token minting (
mintInstallationToken), octokit request timing viahook.wrap(duration_ms+github.api.slow), scheduler scan bracketing + counters, per-tool-call agent events, and workspace lifecycle events. - Harden log redaction by stripping
@octokit/webhookserror “carrier” fields (event/payload/signature) from serialized errors.
Reviewed changes
Copilot reviewed 60 out of 60 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/ai/structured-output.ts | Adds optional per-call emission of structured_output.* events from the parse chokepoint. |
| src/ai/structured-output-log-fields.ts | Defines strict schema + event constants for structured_output.*. |
| src/app-log-fields.ts | Defines strict schema + event constants for inbound http.* boundary events. |
| src/config.ts | Adds githubApiSlowRequestMs / GITHUB_API_SLOW_REQUEST_MS config. |
| src/core/checkout.ts | Emits workspace.* clone/base-branch events with redacted errors. |
| src/core/executor.ts | Adds per-tool-call agent.tool.* emission via tool_use/tool_result pairing. |
| src/core/log-fields.ts | Adds strict schemas/constants for agent.tool.* and pending-tool metadata type. |
| src/core/workspace-events.ts | Defines strict schema + event constants for workspace.* lifecycle. |
| src/core/workspace-events.test.ts | Drift-prevention tests for workspace.* schema. |
| src/core/workspace-sweep.ts | Adds optional structured logging for cleanup success/failure inside removeWorkspaceTripleSync. |
| src/daemon/job-executor.ts | Emits workspace cleanup events for exit/cancel; adds daemon.job.cancelled line. |
| src/daemon/log-fields.ts | Defines strict schema + event constants for daemon.connection.*. |
| src/daemon/main.ts | Removes ws-client connect/disconnect callbacks; relies on daemon.connection.*. |
| src/daemon/ws-client.ts | Emits daemon.connection.* events with attempt/backoff/duration fields; redacts error text. |
| src/daemon/workflow-executor.ts | Adds workflow.run.* logging helpers at lifecycle transitions (running/succeeded/failed/etc.). |
| src/k8s/ephemeral-daemon-spawner.ts | Emits k8s.spawn.* attempted/succeeded/failed events with api_call_ms + kind. |
| src/orchestrator/connection-handler.ts | Routes multiple token-mint sites through mintInstallationToken. |
| src/orchestrator/ephemeral-daemon-scaler.ts | Emits k8s.spawn.decision_skipped with decision-time signals when provided an obs handle. |
| src/orchestrator/installation-token.ts | New helper to mint tokens with exact cache_hit probing + duration_ms. |
| src/orchestrator/k8s-spawn-log-fields.ts | Defines strict schema + event constants for k8s.spawn.*. |
| src/orchestrator/log-fields.ts | Extends orchestrator log-field schemas with github.app.token.mint.* schema/keys. |
| src/orchestrator/triage.ts | Emits new circuit.* events via breaker onEvent; threads latency_tripped. |
| src/scheduler/log-fields.ts | Defines strict schema + event constants and counters for scheduler.scan.*. |
| src/scheduler/log-fields.test.ts | Drift-prevention tests for scheduler.scan.* schema/counters. |
| src/scheduler/scheduler.ts | Adds scheduler.scan.* bracketing + counters and uses mintInstallationToken in runAction. |
| src/utils/circuit-breaker.ts | Adds onEvent and richer execution result (latencyTripped), plus skip/failure events. |
| src/utils/circuit-breaker-log-fields.ts | Defines strict schema + event constants for circuit.*. |
| src/utils/github-output-guard.ts | Threads optional logger into LLM output scanner options. |
| src/utils/llm-output-scanner.ts | Allows passing logger into parseStructuredResponse for structured_output.* coverage. |
| src/utils/log-redaction.ts | Adds stripping of octokit webhook error carriers (event/payload/signature) in errSerializer. |
| src/utils/octokit-observability.ts | Switches to hook.wrap to add duration_ms to github.api.* and emit github.api.slow. |
| src/webhook/router.ts | Threads obs into decideEphemeralSpawn and emits structured k8s.spawn.failed on config-incomplete path. |
| src/workflows/digest-log-fields.ts | Defines strict schema + event constants for digest.*. |
| src/workflows/dispatcher.ts | Adds workflow.run.* logging at queue/refusal/enqueue-failure points. |
| src/workflows/handlers/triage.ts | Routes verdict parsing through parseStructuredResponse(..., {site, log}). |
| src/workflows/intent-classifier.ts | Routes parsing through parseStructuredResponse(..., {site, log}). |
| src/workflows/orchestrator.ts | Emits workflow.run.enqueue_failed via helper on post-commit enqueue compensation path. |
| src/workflows/ship/iteration.ts | Logs workflow.run.queued after inserting queued ship runs. |
| src/workflows/ship/nl-classifier.ts | Routes parsing through parseStructuredResponse(..., {site, log}). |
| src/workflows/ship/scoped/chat-thread.ts | Routes parsing through parseStructuredResponse(..., {site, log}) and removes bespoke tolerant logs. |
| test/ai/structured-output.test.ts | Adds tests asserting structured_output.* emission behavior and schema validity. |
| test/ai/structured-output-log-fields.test.ts | Drift-prevention tests for structured_output.* schema. |
| test/app-log-fields.test.ts | Drift-prevention tests for http.* schema/constants. |
| test/core/executor.test.ts | Adds tests for per-tool-call agent.tool.* pairing and timeout draining. |
| test/core/log-fields.test.ts | Adds tests for agent.tool.* schemas/constants. |
| test/daemon/log-fields.test.ts | Drift-prevention tests for daemon.connection.* schema/constants. |
| test/daemon/ws-client.test.ts | Updates ws-client tests for removal of callbacks. |
| test/orchestrator/k8s-spawn-log-fields.test.ts | Drift-prevention tests for k8s.spawn.* schema/constants. |
| test/orchestrator/log-fields.test.ts | Adds tests for github.app.token.mint.* schema/constants. |
| test/utils/circuit-breaker-log-fields.test.ts | Drift-prevention tests for circuit.* schema/constants. |
| test/utils/log-redaction.test.ts | Adds regression tests for stripping webhook error carriers via errSerializer. |
| test/utils/octokit-observability.test.ts | Extends tests for duration_ms threading and github.api.slow. |
| test/workflows/digest-log-fields.test.ts | Drift-prevention tests for digest.* schema/constants. |
| docs/operate/configuration.md | Documents GITHUB_API_SLOW_REQUEST_MS. |
| docs/operate/runbooks/daemon-fleet.md | Adds operator guidance for diagnosing k8s.spawn.* failures by kind. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…lity Satisfies the FR-019 docs-sync guard for the src/workflows/ logging changes in this PR and points readers to the new event-family contracts. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Wjb51MuzTGGiJ2ggGX2DhH
- ws-client: omit `message` from daemon.connection.error when redaction yields an empty string (schema pins message: z.string().min(1).optional()). - job-executor: derive workspace.cleanup.exit `jobIds` from the same `workspaceJobs` filter as `count`, so length and ids agree (scoped jobs with no workDir are excluded). - pin the daemon.job.cancelled event in DaemonJobCancelledLogSchema with a co-located drift test, matching the schema-per-event convention of this PR. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Wjb51MuzTGGiJ2ggGX2DhH
Summary
Bundles 12
area: observabilityissues (closes #216 #217 #218 #223 #228 #233 #234 #235 #236 #237 #243 #247) into one change. Each family follows the repo's established pattern from PRs #242/#244: a new*-log-fields.tsmodule holds a Zod-strict discriminated-union schema pinning every field name and type, co-located with a drift-prevention test that validates a representative payload against the schema. Emitters log plainlog.info/log.warnobjects; the schema is the contract, not a runtime validator on the hot path.Beyond logging, three behavioral changes ship in this PR: (1) #236 routes six installation-token mint sites through a new
mintInstallationTokenhelper (src/orchestrator/installation-token.ts) that probes the@octokit/auth-appcache via a scopedhook.beforeto report an exactcache_hitboolean and per-callduration_ms; (2) #218 removes the unusedonConnected/onDisconnectedcallbacks fromWsClientOptions, replacing them with structureddaemon.connection.*log lines emitted inside the ws-client itself; (3) #223 replaces the separatehook.after+hook.errorpair on octokit with a singlehook.wraptiming closure soduration_msis threaded onto every line and agithub.api.slowwarn fires when the call exceedsGITHUB_API_SLOW_REQUEST_MS(default 3000 ms, new env var).Security hardening included (surfaced by the pre-merge review panel):
errSerializerinsrc/utils/log-redaction.tsnow strips theevent,payload, andsignaturecarriers that@octokit/webhooksattaches to its errors, preventing the raw webhook body and HMAC signature from appearing inerr:log lines. A regression test is co-located intest/utils/log-redaction.test.ts.Diagram
flowchart TD classDef old fill:#7f1d1d,color:#ffffff classDef new fill:#14532d,color:#ffffff subgraph BEFORE B_octokit["octokit hook.after<br/>plus hook.error<br/>no duration_ms<br/>no slow alert"]:::old B_ws["DaemonWsClient<br/>onConnected / onDisconnected<br/>callbacks, no structured events"]:::old B_mint["6 inline mint sites<br/>no cache_hit, no duration_ms"]:::old B_err["errSerializer<br/>no octokit event/payload<br/>/signature strip"]:::old end subgraph AFTER A_octokit["octokit hook.wrap<br/>duration_ms on every line<br/>github.api.slow warn"]:::new A_ws["DaemonWsClient<br/>daemon.connection.* events<br/>attempt plus downtime_ms"]:::new A_mint["mintInstallationToken<br/>cache_hit probe via hook.before<br/>github.app.token.mint.*"]:::new A_err["errSerializer<br/>strips event/payload/signature<br/>regression-tested"]:::new A_families["10 new *-log-fields.ts modules<br/>structured_output / circuit<br/>digest / http / scheduler<br/>workflow_run / workspace<br/>agent.tool / k8s.spawn<br/>daemon.connection"]:::new end B_octokit --> A_octokit B_ws --> A_ws B_mint --> A_mint B_err --> A_errChanges
New log-fields modules (each with co-located Zod-strict schema + drift-prevention test)
src/ai/structured-output-log-fields.ts,structured_output.*(feat(observability): add structured_output.* event family to make strategy field observable across 8 LLM JSON sites #233): strategy (strict/tolerant) + latency across the 7parseStructuredResponsecall sitessrc/utils/circuit-breaker-log-fields.ts,circuit.*(feat(observability): emit structured circuit.* events for triage breaker with counters and openMs #216):opened/half_open/closed/skipped/failurewithconsecutive_failures,latency_tripped,open_ms,skips_since_openedsrc/workflows/digest-log-fields.ts,digest.*(feat(observability): add structured digest.* event family to discussion-digest, the unobserved hot-path LLM call #228): skip / per-call tokens / completed / failed on the discussion-digest LLM hot pathsrc/app-log-fields.ts,http.*(feat(observability): add structured http.* event family for inbound HTTP boundary (webhook entry, HMAC failure, scheduler endpoint) #247): inbound HTTP boundary (webhook entry, HMAC failure kind, readiness probe, scheduler endpoint)src/orchestrator/log-fields.ts(extended),github.app.token.mint.*(feat(observability): add github.app.token.mint.* event family with cache_hit + duration_ms across 6 unobserved mint sites #236) withinstallation_id,via,cache_hit,duration_mssrc/scheduler/log-fields.ts,scheduler.scan.*(feat(observability): add scheduler.scan.* lifecycle events for tick heartbeat, duration, and reentrancy #217): tick heartbeat, duration, overlap guardsrc/workflows/log-fields.ts,workflow.run.*(feat(observability): add structured workflow.run.* event family for workflow_runs lifecycle transitions #235):workflow_runslifecycle transitionssrc/core/workspace-events.ts,workspace.*(feat(observability): structured workspace.* event family for cleanup, cancel, exit-handler, and clone paths #243): clone / cleanup / cancel / exit-handler pathssrc/core/log-fields.ts(extended),agent.tool.*(feat(observability): emit per-tool-call agent.tool.{started,completed,timed_out} events with tool_use_id pairing in executor SDK loop #237): per-tool-call started/completed/timed_out paired bytool_use_idin the executor SDK loopsrc/daemon/log-fields.ts,daemon.connection.*(feat(observability): add structured daemon.connection.* events with attempt + downtime fields to ws-client #218): connect-attempt/connected/disconnected/error withattempt,downtime_ms,time_to_connect_ms,connected_duration_mssrc/orchestrator/k8s-spawn-log-fields.ts,k8s.spawn.*(feat(observability): add structured k8s.spawn.* events with EphemeralSpawnErrorKind + api_call_ms to ephemeral-daemon scaler #234): ephemeral-daemon spawn lifecycle withEphemeralSpawnErrorKindandapi_call_mssrc/utils/octokit-observability.ts(extended),github.api.slow(feat(observability): add duration_ms + github.api.slow to octokit hooks for per-request GitHub latency visibility #223) +duration_mson allgithub.api.*linesBehavioral / interface changes
src/orchestrator/installation-token.ts(new):mintInstallationTokencentralises the 6 inline mint sites; scopedapp.octokit.hook.before("request", probe)detects a cache miss exactly, with afinallyguard that always removes the hooksrc/daemon/ws-client.ts: removedonConnected/onDisconnectedfromWsClientOptions; lifecycle observability is now self-contained viadaemon.connection.*eventssrc/utils/octokit-observability.ts: replacedhook.after+hook.errorwith a singlehook.wrapclosure; new threshold-injectedslowRequestFields()pure helper; newGITHUB_API_SLOW_REQUEST_MSconfig key (default 3000 ms) insrc/config.tssrc/utils/log-redaction.ts:errSerializernow callsstripSecretCarriers(out)to dropevent/payload/signature(and recurseaggregateErrors) before the error is loggedDocs
docs/operate/observability.md: 12 new event-family sections with field tables + alertsdocs/operate/configuration.md:GITHUB_API_SLOW_REQUEST_MSenv vardocs/operate/runbooks/daemon-fleet.md:k8s.spawn.*diagnostics sectionTest plan
*-log-fields.tsmodule has a co-located drift-prevention test (pins event strings, accepts valid records, rejects drift)log-redaction.test.ts:errSerializerregression test for the octokit carrier strip (bare + AggregateError)octokit-observability.test.ts:slowRequestFields+hook.wrapduration;ws-client.test.ts: callback removal +daemon.connection.*;executor.test.ts:agent.tool.*pairing (no input/output bodies logged)check:test-globs,check:no-destructive,check:no-em-dashespass; DB-gated suites run in CI;check:docs-versions,check:docs-citations,mkdocs build --strictpassCloses #216, closes #217, closes #218, closes #223, closes #228, closes #233, closes #234, closes #235, closes #236, closes #237, closes #243, closes #247
🤖 Generated with Claude Code