Skip to content

feat(observability): emit failed_stage and failed_stage_delta_ms on pipeline.failed#244

Merged
chrisleekr merged 2 commits into
mainfrom
fix/issue-226
Jun 20, 2026
Merged

feat(observability): emit failed_stage and failed_stage_delta_ms on pipeline.failed#244
chrisleekr merged 2 commits into
mainfrom
fix/issue-226

Conversation

@chrisleekr

@chrisleekr chrisleekr commented Jun 20, 2026

Copy link
Copy Markdown
Owner

What changed and why

pipeline.failed previously carried only pipeline_wall_clock_ms + err, giving operators no signal about which stage threw or how long it had been running — they had to scroll back through pipeline.stage lines and mentally reconstruct the static stage order to diagnose a failure.

This adds a StageTracker cursor to runPipeline. Each of the seven timeStage calls and the inline prompt.build site receive it as an optional fourth argument: on stage entry the tracker is set, on success it is cleared, and on throw it is left intact. The outer catch reads stageTracker.active and conditionally spreads failed_stage + failed_stage_delta_ms into the pipeline.failed line. Both fields are omitted when no timed stage was in flight (an early failure before any stage, or after all stages succeeded).

One correctness detail: the trackingComment.finalize inner catch swallows its error and continues, so stageTracker.active is explicitly cleared there — otherwise a subsequent throw would be mis-attributed to finalize.

The change is additive: success path, control flow, and return values are unchanged. timeStage's tracker argument is optional, so every existing caller stays valid. New strict PipelineFailedLogSchema in src/core/log-fields.ts mirrors PipelineCompletedLogSchema, pinning the new fields so a rename or stray field trips the co-located test.

Flow

flowchart TD
  RP["runPipeline creates a StageTracker"]:::new
  STG["timeStage stage fn tracker"]:::keep
  SUCC["on success: tracker cleared<br/>pipeline.completed unchanged"]:::keep
  THROW["on throw: tracker keeps stage and startedAt"]:::new
  FAIL["pipeline.failed<br/>err + pipeline_wall_clock_ms<br/>+ failed_stage + failed_stage_delta_ms"]:::new
  RP --> STG
  STG --> SUCC
  STG --> THROW
  THROW --> FAIL
  classDef new fill:#196f3d,color:#ffffff
  classDef keep fill:#2c3e50,color:#ffffff
Loading

Changes

  • src/core/log-fields.tsStageTracker interface + createStageTracker(); timeStage gains an optional tracker param (set on entry, clear on success, retain on throw, re-throw unchanged); strict PipelineFailedLogSchema + PipelineFailedLog.
  • src/core/pipeline.ts — create the tracker, thread it to the 7 timeStage calls + prompt.build, clear it in the finalize inner catch, and spread the two fields in the outer catch.
  • test/core/log-fields.test.tsPipelineFailedLogSchema drift suite (valid; camelCase/extra-field/negative/wrong-event-literal/missing-required rejected) + timeStage tracker suite (clear-on-success, retain-on-throw, null-initial, no-tracker backward-compat).
  • docs/operate/observability.md — pipeline.failed row documents the two new fields.

Test plan

  • 11 new log-fields tests pass; each runPipeline-exercising handler test (review/implement/resolve) passes in isolation
  • typecheck / lint / format / docs:build clean
  • Full bun test failure count is baseline-identical (no new failures); the only combined-run failures are pre-existing cross-file mock pollution the isolated CI runner avoids

Closes #226

Summary by CodeRabbit

  • Documentation

    • Updated pipeline observability documentation with expanded failure event schemas and detailed payload field descriptions.
  • New Features

    • Enhanced pipeline failure diagnostics: failures now record which stage was active at time of failure and the elapsed duration, enabling faster root cause analysis and debugging.

…ipeline.failed

pipeline.failed carried only err + pipeline_wall_clock_ms, so an operator
had to scan back for the last pipeline.stage line and know the static stage
order to guess which stage threw. Add a StageTracker threaded as an optional
4th arg through the 7 timeStage calls + the prompt.build site: timeStage sets
it on entry, clears on success, and leaves it set on throw, so the outer
catch attributes failed_stage + failed_stage_delta_ms. The finalize inner
catch clears the tracker so a later throw is not mis-attributed. New strict
PipelineFailedLogSchema mirrors PipelineCompletedLogSchema. Additive only.

Closes #226

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_015v2bspPF1ZTTG9ZZrbBgA1
Copilot AI review requested due to automatic review settings June 20, 2026 22:07
@coderabbitai

coderabbitai Bot commented Jun 20, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@chrisleekr, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 52 minutes and 49 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9327e70b-38f0-416a-82c2-8fd099a99db7

📥 Commits

Reviewing files that changed from the base of the PR and between b91f1c9 and 7760b9c.

📒 Files selected for processing (2)
  • src/core/log-fields.ts
  • test/core/log-fields.test.ts
📝 Walkthrough

Walkthrough

Adds failed_stage and failed_stage_delta_ms fields to pipeline.failed log events. Introduces PipelineFailedLogSchema, StageTracker interface, and createStageTracker() in log-fields.ts, extends timeStage with an optional tracker parameter, wires the tracker through all runPipeline timed phases, and updates tests and observability documentation.

Changes

Stage Attribution for Pipeline Failures

Layer / File(s) Summary
PipelineFailedLogSchema, StageTracker, and timeStage contract
src/core/log-fields.ts
Adds PipelineFailedLogSchema/PipelineFailedLog Zod contract with optional failed_stage/failed_stage_delta_ms fields, introduces StageTracker interface and createStageTracker() helper, and extends timeStage to accept an optional tracker that is set before awaiting fn and cleared only on success, leaving it populated for the outer catch on failure.
runPipeline stageTracker wiring and failure attribution
src/core/pipeline.ts
Expands log-fields imports, initializes stageTracker at runPipeline entry, wraps all timed phases (trackingComment.create, token.resolve, github.fetch, checkoutRepo, executor.invoke, trackingComment.finalize, workspace.cleanup) with timeStage(..., stageTracker), manually sets/clears stageTracker.active around the prompt.build phase, clears the tracker on finalization error, and emits failed_stage/failed_stage_delta_ms from stageTracker.active in the outer catch.
Schema tests, timeStage tracker tests, and observability docs
test/core/log-fields.test.ts, docs/operate/observability.md
Adds PipelineFailedLogSchema validation tests (field acceptance, rejection of negative values, strict mode checks) and timeStage tracker behavior tests (success-path clear, failure-path preservation, null initialization, no-tracker compatibility); updates observability.md table with failed_stage/failed_stage_delta_ms fields and schema pinning notes.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • chrisleekr/github-app-playground#182: Modifies the same src/core/log-fields.ts and src/core/pipeline.ts files to introduce timeStage-based pipeline.stage timing instrumentation, which this PR extends with tracker-based failure attribution.
  • chrisleekr/github-app-playground#225: Also modifies runPipeline in src/core/pipeline.ts around the same retry-wrapped pipeline phases where this PR adds stageTracker attribution.

Suggested labels

type: feature ✨, type: docs 📋

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title directly describes the primary feature being added: emitting failed_stage and failed_stage_delta_ms fields on pipeline.failed logs for observability.
Linked Issues check ✅ Passed The implementation fully addresses all objectives from issue #226: structured error-attribution via failed_stage and failed_stage_delta_ms, RED-method gap bridging, cardinality-safe analytics, minimal bounded scope with StageTracker, PipelineFailedLogSchema validation, and backward compatibility preservation.
Out of Scope Changes check ✅ Passed All changes are directly aligned with issue #226 objectives: log schema additions, StageTracker interface/function, timeStage signature update, runPipeline integration, tests, and documentation—no unrelated modifications present.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves failure-path observability for the core execution pipeline by emitting stage attribution on the terminal pipeline.failed log line, so operators can immediately see which stage threw and how long it was running before failing.

Changes:

  • Introduces a StageTracker and threads it through runPipeline + timeStage to capture the currently active stage at the moment of failure.
  • Adds failed_stage and failed_stage_delta_ms to pipeline.failed (when applicable) and documents the new fields.
  • Adds a strict PipelineFailedLogSchema and corresponding tests to guard against future log-field drift.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
src/core/log-fields.ts Adds StageTracker + createStageTracker(), extends timeStage() with optional tracking, and introduces PipelineFailedLogSchema.
src/core/pipeline.ts Creates and threads a stage tracker through stages; emits failed_stage + failed_stage_delta_ms on pipeline.failed.
test/core/log-fields.test.ts Adds schema drift tests for pipeline.failed and behavior tests for timeStage tracking behavior.
docs/operate/observability.md Documents the new pipeline.failed fields.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/core/log-fields.ts Outdated
Comment thread test/core/log-fields.test.ts
Comment thread test/core/log-fields.test.ts
Comment thread test/core/log-fields.test.ts
Comment thread test/core/log-fields.test.ts

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/core/log-fields.ts`:
- Around line 74-81: The PipelineFailedLogSchema object currently allows
failed_stage and failed_stage_delta_ms to be independently optional, but they
must always appear together or not at all according to the documented contract.
Add a validation constraint to the PipelineFailedLogSchema using Zod's refine
method to enforce that either both fields are present (defined) or both are
absent (undefined), rejecting any state where only one field is provided.

In `@src/core/pipeline.ts`:
- Around line 592-595: In the catch block for the cleanup error (the block
starting at line 593 that catches cleanupError), after logging the error
message, clear the stageTracker.active field similar to the pattern already used
in the trackingComment.finalize catch block around lines 510-513. This ensures
that when cleanup fails and the error is suppressed, the stageTracker state is
reset so the outer catch handler doesn't misattribute the original failure to
the "workspace.cleanup" stage.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9f4cdf94-e07f-4578-9bee-d568ca675527

📥 Commits

Reviewing files that changed from the base of the PR and between e1e7f9e and b91f1c9.

📒 Files selected for processing (4)
  • docs/operate/observability.md
  • src/core/log-fields.ts
  • src/core/pipeline.ts
  • test/core/log-fields.test.ts

Comment thread src/core/log-fields.ts Outdated
Comment thread src/core/pipeline.ts
…Schema

Address review: failed_stage and failed_stage_delta_ms are emitted together
or not at all, so a .refine rejects a record carrying only one. Add paired-
contract regression tests.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_015v2bspPF1ZTTG9ZZrbBgA1
@chrisleekr chrisleekr merged commit 4f2483c into main Jun 20, 2026
15 checks passed
@chrisleekr chrisleekr deleted the fix/issue-226 branch June 20, 2026 22:18
chrisleekr added a commit that referenced this pull request Jun 25, 2026
…trict schemas

Bundles 12 area:observability issues into one change. Each family ships a
Zod-strict *-log-fields.ts module with a co-located drift-prevention test,
following the #242/#244 pattern: structured_output, circuit, digest,
github.api.slow, github.app.token.mint, http, scheduler.scan, workflow.run,
workspace, agent.tool, daemon.connection, k8s.spawn.

Behavioral changes beyond logging: #236 routes 6 mint sites through a new
mintInstallationToken helper (exact cache_hit via scoped hook.before); #218
removes unused onConnected/onDisconnected WsClientOptions callbacks; #223
replaces octokit hook.after/hook.error with one hook.wrap timing closure
(new GITHUB_API_SLOW_REQUEST_MS env, default 3000).

Security: errSerializer now strips octokit event/payload/signature carriers so
the raw webhook body + HMAC signature cannot leak through err: log lines
(regression-tested).

Docs: 12 observability.md sections + alerts, configuration.md env var,
daemon-fleet.md k8s.spawn diagnostics.

Closes #216 #217 #218 #223 #228 #233 #234 #235 #236 #237 #243 #247

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Wjb51MuzTGGiJ2ggGX2DhH
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(observability): add failed_stage + failed_stage_delta_ms to pipeline.failed for error-attribution parity with #166

2 participants