Skip to content

fix(watchdog): prevent task kill during active conversations#29

Closed
tobigs wants to merge 2 commits into
serithemage:mainfrom
tobigs:fix/watchdog-lastactivity-refresh
Closed

fix(watchdog): prevent task kill during active conversations#29
tobigs wants to merge 2 commits into
serithemage:mainfrom
tobigs:fix/watchdog-lastactivity-refresh

Conversation

@tobigs
Copy link
Copy Markdown
Contributor

@tobigs tobigs commented Apr 22, 2026

Summary

Fix the Fargate watchdog killing active containers mid-conversation. Combines a root-cause fix with a defense-in-depth timeout bump.

The bug

When a user sends a message to an already-running Fargate container (the "happy path" in routeFargate: taskStatus === "Running" && publicIp), the gateway forwards the message to Bridge and returns "sent" — but never writes the current timestamp back to TaskState.lastActivity in DynamoDB. The only places lastActivity gets updated are task-start and prewarm claim.

Result: an active user chatting with a running container leaves lastActivity frozen at the task's startedAt time. Once the watchdog's 30-minute cutoff elapses from startedAt (or less, if CloudWatch history sparseness triggered the 10-minute INACTIVE_TIMEOUT_MS branch pre-bump), the task gets killed with stoppedReason="Watchdog: inactivity timeout" despite recent traffic.

Production evidence

Reproduced on 2026-04-22 (before fix):

Time (UTC) Event
14:57:37 Task 669e54e2 started; lastActivity written once
15:00:30 User pinged; webhook log: taskStatus: Running, needsColdStart: false, message routed successfully
15:03:xx DynamoDB lastActivity still 14:57:37 — not advanced by the ping
15:27:49 Task killed: stopCode=UserInitiated, stoppedReason="Watchdog: inactivity timeout", exitCode=137

And after fix, same user, three successive happy-path pings advanced lastActivity atomically each time:

Ping Webhook received lastActivity after
1 (cold start) 15:34:06 15:37:10 (task-start path)
2 (happy path) 15:46:25 15:46:26 ← new write
3 (happy path) 15:50:57 15:50:58 ← new write

ResponseLength metric stayed non-zero on every reply (17, 17, 2 chars) — no regression in the reply pipeline.

The fix

  • Add updateLastActivity(send, userId) helper in task-state.ts using UpdateCommand with UpdateExpression: "SET lastActivity = :la" — a single-attribute atomic update, not a full-item replace.
  • Call it in routeFargate immediately after a successful sendToBridge, wrapped in a try/catch so a DDB hiccup can't fail the message delivery.
  • Wire the dep through ws-message.ts and telegram-webhook.ts callers.

Why UpdateCommand and not PutCommand

An earlier internal attempt used PutCommand({ ...taskState, lastActivity: now }) and produced empty replies under a mechanism that couldn't be reproduced locally. A single-attribute UpdateCommand can't clobber concurrent writes from container/lifecycle.updateTaskState() (which does its own PutCommand with publicIp on boot), and can't replace the whole item. This eliminates the previously-suspected whole-item-replace race.

Secondary commit: 30-minute timeout bump

INACTIVE_TIMEOUT_MS (10 min) and INACTIVITY_TIMEOUT_MS (15 min) are raised to match ACTIVE_TIMEOUT_MS (30 min). This is belt-and-braces: when CloudWatch MessageLatency has fewer than ACTIVE_HOUR_THRESHOLD=2 samples at the current KST hour, the watchdog falls through to the shorter timeouts even during legitimately active hours. A uniform 30-minute cutoff makes the sparse-history misclassification harmless. Tests updated to match.

Files changed

  • packages/gateway/src/services/task-state.ts — new updateLastActivity helper
  • packages/gateway/src/services/message.ts — happy-path call + RouteDeps interface
  • packages/gateway/src/handlers/ws-message.ts, telegram-webhook.ts — wire the dep
  • packages/gateway/__tests__/services/task-state.test.ts — test for the helper
  • packages/gateway/__tests__/services/message.test.ts — assertion on the happy path
  • packages/shared/src/constants.ts — 30-min timeouts
  • packages/gateway/__tests__/handlers/watchdog.test.ts — updated test expectations

Checks

  • npm run build — clean
  • npm run lint — clean
  • npm run test (gateway package) — 135 passed, 1 pre-existing unrelated failure (/link test asserting agentRuntime: undefined)
  • Live verified on eu-central-1 production deploy

Tobias added 2 commits April 22, 2026 17:55
Both INACTIVE_TIMEOUT_MS (10 min) and INACTIVITY_TIMEOUT_MS (15 min)
were killing Fargate tasks mid-conversation when the CloudWatch
MessageLatency history for the current KST hour had fewer than
ACTIVE_HOUR_THRESHOLD=2 matching datapoints. Raising both to match
ACTIVE_TIMEOUT_MS (30 min) makes the sparse-history misclassification
harmless.

Investigation details in docs/fargate-watchdog-investigation.md.
Tests updated to reflect the new uniform 30-min cutoff.
The routeFargate happy path (task already Running with publicIp) called
sendToBridge and returned 'sent' without writing lastActivity back to
DynamoDB. lastActivity was only advanced on task start or prewarm claim,
so an active conversation with many messages to an already-running
container looked idle to the watchdog and got killed mid-session.

Reproduced in production on 2026-04-22: task started at 14:57:37 UTC,
user pinged at 15:00:30 UTC (gateway logs show 'taskStatus: Running,
needsColdStart: false, message routed successfully'), but DynamoDB
lastActivity stayed pinned to 14:57:37. Watchdog killed the task at
15:27:49 UTC for 'inactivity timeout' (exit 137) despite a ping 27 min
earlier.

Fix: add updateLastActivity() helper using UpdateCommand (single-attribute,
atomic) and call it after every successful bridge delivery on the happy
path. UpdateCommand avoids the whole-item-replace race a prior PutCommand
attempt hit against the container's concurrent lifecycle.updateTaskState
writes — it only touches the lastActivity attribute.

Live verified post-deploy: three successive pings to the same running
task all advanced lastActivity in DynamoDB (14:57:37 baseline -> 15:37:10
-> 15:46:26 -> 15:50:58), and ResponseLength stayed non-zero on every
reply.
@tobigs tobigs requested a review from serithemage as a code owner April 22, 2026 16:05
@github-actions github-actions Bot added needs-template PR body is missing required evidence sections size: S Small pull request labels Apr 22, 2026
@github-actions
Copy link
Copy Markdown

Gatekeeping summary:

  • The PR body is missing one or more required sections from the repository template.

@github-actions github-actions Bot added area: gateway Gateway lambdas and services area: shared Shared contracts and constants area: tests Tests or fixtures labels Apr 22, 2026
@tobigs
Copy link
Copy Markdown
Contributor Author

tobigs commented Apr 22, 2026

Post-deploy A/B live confirmation

Verified on production eu-central-1 at 18:22 UTC+2, same Telegram user, no other changes between tasks:

Metric Baseline (bug) After fix
Task start 16:55:01 17:34:38
Last user ping 17:00:30 (gap to task start: 5m 29s) 17:52:57 (gap: 18m 19s)
lastActivity at kill 14:57:37 UTC (frozen at task-start) 15:50:58 UTC (advanced by happy-path UpdateCommand)
Killed at 17:27:49 18:22:50
Reason Watchdog: inactivity timeout Watchdog: inactivity timeout
Exit code 137 137
Time from last ping to kill 27m 19s (still mid-active) 29m 53s (= 30-min cutoff elapsed)

Before fix: task dies relative to startedAt, ignoring subsequent activity.
After fix: task dies relative to lastActivity, which advances on every successful message delivery.

No empty replies, no crashes, no new errors in gateway or watchdog logs across three successive happy-path pings. The a572cf3 regression mechanism did not recur — UpdateCommand on a single attribute appears to be the right guardrail.

Copy link
Copy Markdown
Owner

@serithemage serithemage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comprehensive review summary

Thanks for the thorough investigation, evidence table, and live verification — the diagnosis is precise and the UpdateCommand over PutCommand choice is exactly right. Below is a multi-aspect review (general code, error handling, test coverage, comments). Marked Critical items would benefit from a follow-up before merge; Important items are strong recommendations; Suggestions can be deferred to follow-up issues.


🔴 Critical

C1. updateLastActivity upserts a partial row when no TaskState exists

File: packages/gateway/src/services/task-state.ts:43-52

DynamoDB UpdateCommand is upsert by default. Without a ConditionExpression, this race becomes possible:

  1. routeFargate reads taskState.status === "Running".
  2. The watchdog concurrently deletes the row (e.g., ECS task disappeared).
  3. updateLastActivity writes a new row with only PK and lastActivitytaskArn/status/startedAt are absent.
  4. The next watchdog sweep computes now - new Date(undefined).getTime()NaN, the row is never reaped, and every subsequent message routes through cold start because getTaskState returns a row whose status is undefined.

Suggested fix:

new UpdateCommand({
  TableName: TABLE_NAMES.TASK_STATE,
  Key: { PK: `${KEY_PREFIX.USER}${userId}` },
  UpdateExpression: "SET lastActivity = :la",
  ConditionExpression: "attribute_exists(PK)",
  ExpressionAttributeValues: { ":la": new Date().toISOString() },
})

A ConditionalCheckFailedException then becomes an explicit, distinguishable signal ("task is no longer Running") that the caller can swallow cleanly, while real DDB errors still surface.

C2. Six stale "15-min" comments + one dead test in watchdog.test.ts

File: packages/gateway/__tests__/handlers/watchdog.test.ts

The fallback is now 30 min, but several comments still claim 15 min:

  • :358-359 — comparison comment "would be stopped with 15-min default"
  • :434it("should fall back to 15-min timeout on CloudWatch error", ...)
  • :439, :459, :539, :559 — inline assertions reference "15-min fallback"

The test at line 434 is more concerning: it asserts a 12-min-inactive task is not stopped — but that holds true whether the fallback is 15 min or 30 min, so the assertion silently degraded into a tautology. A regression that flips the fallback back to 15 min would not be caught.

Suggested fix:

  • Update the six comments and the it() title to "30-min".
  • Strengthen the line-434 test by raising the inactivity to ~35 min (or adding a second case at 32 min) so the threshold is exercised.

🟡 Important

I1. Silent failure of updateLastActivity is invisible to operators

File: packages/gateway/src/services/message.ts:99-103

The try/catch is correct (the user-facing message has already succeeded), but console.warn alone makes this PR's own bug recurrence undetectable. Throttles, IAM denials (note: dynamodb:UpdateItem is a new permission requirement), and code bugs all collapse into the same warn line.

Suggested fix:

  • Promote to console.error with structured context (userId, taskArn, errorName).
  • Emit a CloudWatch metric LastActivityRefreshFailure (the ServerlessOpenClaw namespace and emitMetric helper already exist in prewarm.ts).
  • Use the [message] log prefix to match other modules.

I2. Missing investigation document

The first commit references docs/fargate-watchdog-investigation.md, but the file is not in the diff. The 30-min decision rationale (KST sparse-history misclassification) lives only in the commit body and PR description. Per repo convention (Documentation language: English), an docs/ write-up would preserve traceability.

Suggested fix: Add the doc, or remove the reference from the commit message.

I3. Test gaps on the safety guarantees the PR claims

File: packages/gateway/__tests__/services/message.test.ts:155

The new assertion verifies that updateLastActivity is called, but not the two safety claims that motivated the try/catch:

  1. A rejecting updateLastActivity must not flip the result away from "sent".
  2. When sendToBridge fails (the three existing fallback paths: ECONNREFUSED, AbortError, 502), updateLastActivity must not be called.

Suggested test additions:

it("returns 'sent' even when updateLastActivity rejects", async () => {
  const deps = makeDeps({
    updateLastActivity: vi.fn().mockRejectedValue(new Error("ThrottlingException")),
  });
  await expect(routeMessage(deps)).resolves.toBe("sent");
});

And add expect(deps.updateLastActivity).not.toHaveBeenCalled(); to each existing bridge-failure fallback test.

I4. Comment violates the repo guideline against dated/issue references

File: packages/shared/src/constants.ts:49

export const INACTIVE_TIMEOUT_MS = 30 * 60 * 1000; // 30 min — inactive hours (raised to match ACTIVE after investigation 2026-04-22)

CLAUDE.md explicitly says: "Don't reference the current task, fix, or callers ... those belong in the PR description and rot as the codebase evolves." The "investigation 2026-04-22" phrasing is the textbook case — by Q3 the date will be noise without a link.

Suggested fix: Replace with the domain reason, e.g. // 30 min — matches ACTIVE so off-peak users aren't killed mid-conversation when CloudWatch history is sparse.

I5. updateLastActivity JSDoc overstates concurrency safety

File: packages/gateway/src/services/task-state.ts:38-42

The container's lifecycle.updateTaskState (packages/container/src/lifecycle.ts:41-60) is itself a PutCommand writing the full row. So the real benefit of UpdateCommand here is field preservation, not race safety — last-writer-wins still applies if the two commands overlap.

Suggested rewording:

/**
 * Refresh `lastActivity` via a single-attribute UpdateCommand so the
 * container's full-row PutCommand (publicIp, taskArn, startedAt) isn't
 * clobbered by the gateway. The two writers can still race on
 * `lastActivity` itself (last-writer-wins), but field preservation is
 * what matters for the watchdog.
 *
 * Caller must ensure a row already exists; without ConditionExpression
 * this would otherwise create a partial row.
 */

(The "caller must ensure" line goes away once C1 is applied.)


🟢 Suggestions (deferrable)

  • S1. routeLambda and agentRuntime="both" lambda-decision branches don't refresh lastActivity for an active Fargate container that exists alongside a Lambda invocation. Edge case, but worth a follow-up issue.
  • S2. With INACTIVE_TIMEOUT_MS == ACTIVE_TIMEOUT_MS == INACTIVITY_TIMEOUT_MS == 30 min, getActiveTimeout()'s KST-based branching is effectively dead code while still incurring the CloudWatch GetMetricStatistics call each sweep. Consider collapsing to a single IDLE_TIMEOUT_MS in a follow-up.
  • S3. cost-optimization.md could note the +15–20 min average idle wind-down impact on Fargate Spot (small but visible against the $1/month target).
  • S4. console.warn("Failed to refresh lastActivity, continuing", err) would benefit from including userId in the structured payload (multi-tenant debuggability).
  • S5. Watchdog's StopTaskCommand has no metric or user notification — operators can't tell when a 30-min cutoff cut off a task that was still making progress. Worth a separate observability PR.
  • S6. Telegram-webhook handler test could assert updateLastActivity is wired into RouteDeps (one-line guard against future binding typos).
  • S7. The prewarm-claim path in message.ts:127-134 still uses PutCommand to overwrite the row. Same race the PR identified, but materially harmless because the PK changes during ownership transfer. Worth noting in a follow-up.

💪 Strengths

  • The race diagnosis (PutCommand vs PutCommand) and the UpdateCommand choice are exactly right.
  • Live production verification with three successive ping timestamps is convincing.
  • TDD discipline is preserved — unit test on the new helper, integration assertion on the call site, watchdog test values updated consistently.
  • DI pattern (RouteDeps) extended cleanly without breaking existing call sites.
  • PR description is unusually well-structured (timeline, evidence, root cause, race rationale).

Recommended action

Before merge: C1 (ConditionExpression), C2 (stale comments + dead test).
Strongly suggested at this PR: I1, I3 — small additions that lock in the safety guarantees the change already provides. I4 is a one-line cleanup.
Follow-up issues are fine: I2, I5 wording, all S-items.

Happy to help land any of these as a follow-up if it's easier than asking for changes here.

@github-actions
Copy link
Copy Markdown

This pull request has been automatically marked as stale due to inactivity.
Add an update or refreshed evidence to keep it open.

@github-actions github-actions Bot added the stale label May 11, 2026
@github-actions
Copy link
Copy Markdown

Closing due to inactivity. Re-open when the branch is refreshed and the evidence section is up to date.

@github-actions github-actions Bot closed this May 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: gateway Gateway lambdas and services area: shared Shared contracts and constants area: tests Tests or fixtures needs-template PR body is missing required evidence sections size: S Small pull request stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants