fix(watchdog): prevent task kill during active conversations#29
Conversation
Both INACTIVE_TIMEOUT_MS (10 min) and INACTIVITY_TIMEOUT_MS (15 min) were killing Fargate tasks mid-conversation when the CloudWatch MessageLatency history for the current KST hour had fewer than ACTIVE_HOUR_THRESHOLD=2 matching datapoints. Raising both to match ACTIVE_TIMEOUT_MS (30 min) makes the sparse-history misclassification harmless. Investigation details in docs/fargate-watchdog-investigation.md. Tests updated to reflect the new uniform 30-min cutoff.
The routeFargate happy path (task already Running with publicIp) called sendToBridge and returned 'sent' without writing lastActivity back to DynamoDB. lastActivity was only advanced on task start or prewarm claim, so an active conversation with many messages to an already-running container looked idle to the watchdog and got killed mid-session. Reproduced in production on 2026-04-22: task started at 14:57:37 UTC, user pinged at 15:00:30 UTC (gateway logs show 'taskStatus: Running, needsColdStart: false, message routed successfully'), but DynamoDB lastActivity stayed pinned to 14:57:37. Watchdog killed the task at 15:27:49 UTC for 'inactivity timeout' (exit 137) despite a ping 27 min earlier. Fix: add updateLastActivity() helper using UpdateCommand (single-attribute, atomic) and call it after every successful bridge delivery on the happy path. UpdateCommand avoids the whole-item-replace race a prior PutCommand attempt hit against the container's concurrent lifecycle.updateTaskState writes — it only touches the lastActivity attribute. Live verified post-deploy: three successive pings to the same running task all advanced lastActivity in DynamoDB (14:57:37 baseline -> 15:37:10 -> 15:46:26 -> 15:50:58), and ResponseLength stayed non-zero on every reply.
|
Gatekeeping summary:
|
Post-deploy A/B live confirmationVerified on production eu-central-1 at 18:22 UTC+2, same Telegram user, no other changes between tasks:
Before fix: task dies relative to No empty replies, no crashes, no new errors in gateway or watchdog logs across three successive happy-path pings. The |
serithemage
left a comment
There was a problem hiding this comment.
Comprehensive review summary
Thanks for the thorough investigation, evidence table, and live verification — the diagnosis is precise and the UpdateCommand over PutCommand choice is exactly right. Below is a multi-aspect review (general code, error handling, test coverage, comments). Marked Critical items would benefit from a follow-up before merge; Important items are strong recommendations; Suggestions can be deferred to follow-up issues.
🔴 Critical
C1. updateLastActivity upserts a partial row when no TaskState exists
File: packages/gateway/src/services/task-state.ts:43-52
DynamoDB UpdateCommand is upsert by default. Without a ConditionExpression, this race becomes possible:
routeFargatereadstaskState.status === "Running".- The watchdog concurrently deletes the row (e.g., ECS task disappeared).
updateLastActivitywrites a new row with onlyPKandlastActivity—taskArn/status/startedAtare absent.- The next watchdog sweep computes
now - new Date(undefined).getTime()→NaN, the row is never reaped, and every subsequent message routes through cold start becausegetTaskStatereturns a row whosestatusisundefined.
Suggested fix:
new UpdateCommand({
TableName: TABLE_NAMES.TASK_STATE,
Key: { PK: `${KEY_PREFIX.USER}${userId}` },
UpdateExpression: "SET lastActivity = :la",
ConditionExpression: "attribute_exists(PK)",
ExpressionAttributeValues: { ":la": new Date().toISOString() },
})A ConditionalCheckFailedException then becomes an explicit, distinguishable signal ("task is no longer Running") that the caller can swallow cleanly, while real DDB errors still surface.
C2. Six stale "15-min" comments + one dead test in watchdog.test.ts
File: packages/gateway/__tests__/handlers/watchdog.test.ts
The fallback is now 30 min, but several comments still claim 15 min:
:358-359— comparison comment "would be stopped with 15-min default":434—it("should fall back to 15-min timeout on CloudWatch error", ...):439,:459,:539,:559— inline assertions reference "15-min fallback"
The test at line 434 is more concerning: it asserts a 12-min-inactive task is not stopped — but that holds true whether the fallback is 15 min or 30 min, so the assertion silently degraded into a tautology. A regression that flips the fallback back to 15 min would not be caught.
Suggested fix:
- Update the six comments and the
it()title to "30-min". - Strengthen the line-434 test by raising the inactivity to ~35 min (or adding a second case at 32 min) so the threshold is exercised.
🟡 Important
I1. Silent failure of updateLastActivity is invisible to operators
File: packages/gateway/src/services/message.ts:99-103
The try/catch is correct (the user-facing message has already succeeded), but console.warn alone makes this PR's own bug recurrence undetectable. Throttles, IAM denials (note: dynamodb:UpdateItem is a new permission requirement), and code bugs all collapse into the same warn line.
Suggested fix:
- Promote to
console.errorwith structured context (userId,taskArn,errorName). - Emit a CloudWatch metric
LastActivityRefreshFailure(theServerlessOpenClawnamespace andemitMetrichelper already exist inprewarm.ts). - Use the
[message]log prefix to match other modules.
I2. Missing investigation document
The first commit references docs/fargate-watchdog-investigation.md, but the file is not in the diff. The 30-min decision rationale (KST sparse-history misclassification) lives only in the commit body and PR description. Per repo convention (Documentation language: English), an docs/ write-up would preserve traceability.
Suggested fix: Add the doc, or remove the reference from the commit message.
I3. Test gaps on the safety guarantees the PR claims
File: packages/gateway/__tests__/services/message.test.ts:155
The new assertion verifies that updateLastActivity is called, but not the two safety claims that motivated the try/catch:
- A rejecting
updateLastActivitymust not flip the result away from"sent". - When
sendToBridgefails (the three existing fallback paths:ECONNREFUSED,AbortError, 502),updateLastActivitymust not be called.
Suggested test additions:
it("returns 'sent' even when updateLastActivity rejects", async () => {
const deps = makeDeps({
updateLastActivity: vi.fn().mockRejectedValue(new Error("ThrottlingException")),
});
await expect(routeMessage(deps)).resolves.toBe("sent");
});And add expect(deps.updateLastActivity).not.toHaveBeenCalled(); to each existing bridge-failure fallback test.
I4. Comment violates the repo guideline against dated/issue references
File: packages/shared/src/constants.ts:49
export const INACTIVE_TIMEOUT_MS = 30 * 60 * 1000; // 30 min — inactive hours (raised to match ACTIVE after investigation 2026-04-22)CLAUDE.md explicitly says: "Don't reference the current task, fix, or callers ... those belong in the PR description and rot as the codebase evolves." The "investigation 2026-04-22" phrasing is the textbook case — by Q3 the date will be noise without a link.
Suggested fix: Replace with the domain reason, e.g. // 30 min — matches ACTIVE so off-peak users aren't killed mid-conversation when CloudWatch history is sparse.
I5. updateLastActivity JSDoc overstates concurrency safety
File: packages/gateway/src/services/task-state.ts:38-42
The container's lifecycle.updateTaskState (packages/container/src/lifecycle.ts:41-60) is itself a PutCommand writing the full row. So the real benefit of UpdateCommand here is field preservation, not race safety — last-writer-wins still applies if the two commands overlap.
Suggested rewording:
/**
* Refresh `lastActivity` via a single-attribute UpdateCommand so the
* container's full-row PutCommand (publicIp, taskArn, startedAt) isn't
* clobbered by the gateway. The two writers can still race on
* `lastActivity` itself (last-writer-wins), but field preservation is
* what matters for the watchdog.
*
* Caller must ensure a row already exists; without ConditionExpression
* this would otherwise create a partial row.
*/
(The "caller must ensure" line goes away once C1 is applied.)
🟢 Suggestions (deferrable)
- S1.
routeLambdaandagentRuntime="both"lambda-decision branches don't refreshlastActivityfor an active Fargate container that exists alongside a Lambda invocation. Edge case, but worth a follow-up issue. - S2. With
INACTIVE_TIMEOUT_MS == ACTIVE_TIMEOUT_MS == INACTIVITY_TIMEOUT_MS == 30 min,getActiveTimeout()'s KST-based branching is effectively dead code while still incurring the CloudWatch GetMetricStatistics call each sweep. Consider collapsing to a singleIDLE_TIMEOUT_MSin a follow-up. - S3.
cost-optimization.mdcould note the +15–20 min average idle wind-down impact on Fargate Spot (small but visible against the $1/month target). - S4.
console.warn("Failed to refresh lastActivity, continuing", err)would benefit from includinguserIdin the structured payload (multi-tenant debuggability). - S5. Watchdog's
StopTaskCommandhas no metric or user notification — operators can't tell when a 30-min cutoff cut off a task that was still making progress. Worth a separate observability PR. - S6. Telegram-webhook handler test could assert
updateLastActivityis wired intoRouteDeps(one-line guard against future binding typos). - S7. The prewarm-claim path in
message.ts:127-134still usesPutCommandto overwrite the row. Same race the PR identified, but materially harmless because the PK changes during ownership transfer. Worth noting in a follow-up.
💪 Strengths
- The race diagnosis (
PutCommandvsPutCommand) and theUpdateCommandchoice are exactly right. - Live production verification with three successive ping timestamps is convincing.
- TDD discipline is preserved — unit test on the new helper, integration assertion on the call site, watchdog test values updated consistently.
- DI pattern (
RouteDeps) extended cleanly without breaking existing call sites. - PR description is unusually well-structured (timeline, evidence, root cause, race rationale).
Recommended action
Before merge: C1 (ConditionExpression), C2 (stale comments + dead test).
Strongly suggested at this PR: I1, I3 — small additions that lock in the safety guarantees the change already provides. I4 is a one-line cleanup.
Follow-up issues are fine: I2, I5 wording, all S-items.
Happy to help land any of these as a follow-up if it's easier than asking for changes here.
|
This pull request has been automatically marked as stale due to inactivity. |
|
Closing due to inactivity. Re-open when the branch is refreshed and the evidence section is up to date. |
Summary
Fix the Fargate watchdog killing active containers mid-conversation. Combines a root-cause fix with a defense-in-depth timeout bump.
The bug
When a user sends a message to an already-running Fargate container (the "happy path" in
routeFargate:taskStatus === "Running" && publicIp), the gateway forwards the message to Bridge and returns"sent"— but never writes the current timestamp back toTaskState.lastActivityin DynamoDB. The only placeslastActivitygets updated are task-start and prewarm claim.Result: an active user chatting with a running container leaves
lastActivityfrozen at the task'sstartedAttime. Once the watchdog's 30-minute cutoff elapses fromstartedAt(or less, if CloudWatch history sparseness triggered the 10-minuteINACTIVE_TIMEOUT_MSbranch pre-bump), the task gets killed withstoppedReason="Watchdog: inactivity timeout"despite recent traffic.Production evidence
Reproduced on 2026-04-22 (before fix):
669e54e2started;lastActivitywritten oncetaskStatus: Running, needsColdStart: false, message routed successfullylastActivitystill 14:57:37 — not advanced by the pingstopCode=UserInitiated, stoppedReason="Watchdog: inactivity timeout", exitCode=137And after fix, same user, three successive happy-path pings advanced
lastActivityatomically each time:lastActivityafterResponseLengthmetric stayed non-zero on every reply (17, 17, 2 chars) — no regression in the reply pipeline.The fix
updateLastActivity(send, userId)helper intask-state.tsusingUpdateCommandwithUpdateExpression: "SET lastActivity = :la"— a single-attribute atomic update, not a full-item replace.routeFargateimmediately after a successfulsendToBridge, wrapped in a try/catch so a DDB hiccup can't fail the message delivery.ws-message.tsandtelegram-webhook.tscallers.Why
UpdateCommandand notPutCommandAn earlier internal attempt used
PutCommand({ ...taskState, lastActivity: now })and produced empty replies under a mechanism that couldn't be reproduced locally. A single-attributeUpdateCommandcan't clobber concurrent writes fromcontainer/lifecycle.updateTaskState()(which does its ownPutCommandwithpublicIpon boot), and can't replace the whole item. This eliminates the previously-suspected whole-item-replace race.Secondary commit: 30-minute timeout bump
INACTIVE_TIMEOUT_MS(10 min) andINACTIVITY_TIMEOUT_MS(15 min) are raised to matchACTIVE_TIMEOUT_MS(30 min). This is belt-and-braces: when CloudWatchMessageLatencyhas fewer thanACTIVE_HOUR_THRESHOLD=2samples at the current KST hour, the watchdog falls through to the shorter timeouts even during legitimately active hours. A uniform 30-minute cutoff makes the sparse-history misclassification harmless. Tests updated to match.Files changed
packages/gateway/src/services/task-state.ts— newupdateLastActivityhelperpackages/gateway/src/services/message.ts— happy-path call +RouteDepsinterfacepackages/gateway/src/handlers/ws-message.ts,telegram-webhook.ts— wire the deppackages/gateway/__tests__/services/task-state.test.ts— test for the helperpackages/gateway/__tests__/services/message.test.ts— assertion on the happy pathpackages/shared/src/constants.ts— 30-min timeoutspackages/gateway/__tests__/handlers/watchdog.test.ts— updated test expectationsChecks
npm run build— cleannpm run lint— cleannpm run test(gateway package) — 135 passed, 1 pre-existing unrelated failure (/linktest assertingagentRuntime: undefined)