fix(watchdog): prevent task kill during active conversations by tobigs · Pull Request #29 · serithemage/serverless-openclaw

tobigs · 2026-04-22T16:05:24Z

Summary

Fix the Fargate watchdog killing active containers mid-conversation. Combines a root-cause fix with a defense-in-depth timeout bump.

The bug

When a user sends a message to an already-running Fargate container (the "happy path" in routeFargate: taskStatus === "Running" && publicIp), the gateway forwards the message to Bridge and returns "sent" — but never writes the current timestamp back to TaskState.lastActivity in DynamoDB. The only places lastActivity gets updated are task-start and prewarm claim.

Result: an active user chatting with a running container leaves lastActivity frozen at the task's startedAt time. Once the watchdog's 30-minute cutoff elapses from startedAt (or less, if CloudWatch history sparseness triggered the 10-minute INACTIVE_TIMEOUT_MS branch pre-bump), the task gets killed with stoppedReason="Watchdog: inactivity timeout" despite recent traffic.

Production evidence

Reproduced on 2026-04-22 (before fix):

Time (UTC)	Event
14:57:37	Task `669e54e2` started; `lastActivity` written once
15:00:30	User pinged; webhook log: `taskStatus: Running, needsColdStart: false, message routed successfully`
15:03:xx	DynamoDB `lastActivity` still 14:57:37 — not advanced by the ping
15:27:49	Task killed: `stopCode=UserInitiated, stoppedReason="Watchdog: inactivity timeout", exitCode=137`

And after fix, same user, three successive happy-path pings advanced lastActivity atomically each time:

Ping	Webhook received	`lastActivity` after
1 (cold start)	15:34:06	15:37:10 (task-start path)
2 (happy path)	15:46:25	15:46:26 ← new write
3 (happy path)	15:50:57	15:50:58 ← new write

ResponseLength metric stayed non-zero on every reply (17, 17, 2 chars) — no regression in the reply pipeline.

The fix

Add updateLastActivity(send, userId) helper in task-state.ts using UpdateCommand with UpdateExpression: "SET lastActivity = :la" — a single-attribute atomic update, not a full-item replace.
Call it in routeFargate immediately after a successful sendToBridge, wrapped in a try/catch so a DDB hiccup can't fail the message delivery.
Wire the dep through ws-message.ts and telegram-webhook.ts callers.

Why `UpdateCommand` and not `PutCommand`

An earlier internal attempt used PutCommand({ ...taskState, lastActivity: now }) and produced empty replies under a mechanism that couldn't be reproduced locally. A single-attribute UpdateCommand can't clobber concurrent writes from container/lifecycle.updateTaskState() (which does its own PutCommand with publicIp on boot), and can't replace the whole item. This eliminates the previously-suspected whole-item-replace race.

Secondary commit: 30-minute timeout bump

INACTIVE_TIMEOUT_MS (10 min) and INACTIVITY_TIMEOUT_MS (15 min) are raised to match ACTIVE_TIMEOUT_MS (30 min). This is belt-and-braces: when CloudWatch MessageLatency has fewer than ACTIVE_HOUR_THRESHOLD=2 samples at the current KST hour, the watchdog falls through to the shorter timeouts even during legitimately active hours. A uniform 30-minute cutoff makes the sparse-history misclassification harmless. Tests updated to match.

Files changed

packages/gateway/src/services/task-state.ts — new updateLastActivity helper
packages/gateway/src/services/message.ts — happy-path call + RouteDeps interface
packages/gateway/src/handlers/ws-message.ts, telegram-webhook.ts — wire the dep
packages/gateway/__tests__/services/task-state.test.ts — test for the helper
packages/gateway/__tests__/services/message.test.ts — assertion on the happy path
packages/shared/src/constants.ts — 30-min timeouts
packages/gateway/__tests__/handlers/watchdog.test.ts — updated test expectations

Checks

npm run build — clean
npm run lint — clean
npm run test (gateway package) — 135 passed, 1 pre-existing unrelated failure (/link test asserting agentRuntime: undefined)
Live verified on eu-central-1 production deploy

Both INACTIVE_TIMEOUT_MS (10 min) and INACTIVITY_TIMEOUT_MS (15 min) were killing Fargate tasks mid-conversation when the CloudWatch MessageLatency history for the current KST hour had fewer than ACTIVE_HOUR_THRESHOLD=2 matching datapoints. Raising both to match ACTIVE_TIMEOUT_MS (30 min) makes the sparse-history misclassification harmless. Investigation details in docs/fargate-watchdog-investigation.md. Tests updated to reflect the new uniform 30-min cutoff.

The routeFargate happy path (task already Running with publicIp) called sendToBridge and returned 'sent' without writing lastActivity back to DynamoDB. lastActivity was only advanced on task start or prewarm claim, so an active conversation with many messages to an already-running container looked idle to the watchdog and got killed mid-session. Reproduced in production on 2026-04-22: task started at 14:57:37 UTC, user pinged at 15:00:30 UTC (gateway logs show 'taskStatus: Running, needsColdStart: false, message routed successfully'), but DynamoDB lastActivity stayed pinned to 14:57:37. Watchdog killed the task at 15:27:49 UTC for 'inactivity timeout' (exit 137) despite a ping 27 min earlier. Fix: add updateLastActivity() helper using UpdateCommand (single-attribute, atomic) and call it after every successful bridge delivery on the happy path. UpdateCommand avoids the whole-item-replace race a prior PutCommand attempt hit against the container's concurrent lifecycle.updateTaskState writes — it only touches the lastActivity attribute. Live verified post-deploy: three successive pings to the same running task all advanced lastActivity in DynamoDB (14:57:37 baseline -> 15:37:10 -> 15:46:26 -> 15:50:58), and ResponseLength stayed non-zero on every reply.

github-actions · 2026-04-22T16:05:35Z

Gatekeeping summary:

The PR body is missing one or more required sections from the repository template.

tobigs · 2026-04-22T16:26:54Z

Post-deploy A/B live confirmation

Verified on production eu-central-1 at 18:22 UTC+2, same Telegram user, no other changes between tasks:

Metric	Baseline (bug)	After fix
Task start	16:55:01	17:34:38
Last user ping	17:00:30 (gap to task start: 5m 29s)	17:52:57 (gap: 18m 19s)
`lastActivity` at kill	14:57:37 UTC (frozen at task-start)	15:50:58 UTC (advanced by happy-path UpdateCommand)
Killed at	17:27:49	18:22:50
Reason	`Watchdog: inactivity timeout`	`Watchdog: inactivity timeout`
Exit code	137	137
Time from last ping to kill	27m 19s (still mid-active)	29m 53s (= 30-min cutoff elapsed)

Before fix: task dies relative to startedAt, ignoring subsequent activity.
After fix: task dies relative to lastActivity, which advances on every successful message delivery.

No empty replies, no crashes, no new errors in gateway or watchdog logs across three successive happy-path pings. The a572cf3 regression mechanism did not recur — UpdateCommand on a single attribute appears to be the right guardrail.

serithemage

Comprehensive review summary

Thanks for the thorough investigation, evidence table, and live verification — the diagnosis is precise and the UpdateCommand over PutCommand choice is exactly right. Below is a multi-aspect review (general code, error handling, test coverage, comments). Marked Critical items would benefit from a follow-up before merge; Important items are strong recommendations; Suggestions can be deferred to follow-up issues.

🔴 Critical

C1. `updateLastActivity` upserts a partial row when no TaskState exists

File: packages/gateway/src/services/task-state.ts:43-52

DynamoDB UpdateCommand is upsert by default. Without a ConditionExpression, this race becomes possible:

routeFargate reads taskState.status === "Running".
The watchdog concurrently deletes the row (e.g., ECS task disappeared).
updateLastActivity writes a new row with only PK and lastActivity — taskArn/status/startedAt are absent.
The next watchdog sweep computes now - new Date(undefined).getTime() → NaN, the row is never reaped, and every subsequent message routes through cold start because getTaskState returns a row whose status is undefined.

Suggested fix:

new UpdateCommand({
  TableName: TABLE_NAMES.TASK_STATE,
  Key: { PK: `${KEY_PREFIX.USER}${userId}` },
  UpdateExpression: "SET lastActivity = :la",
  ConditionExpression: "attribute_exists(PK)",
  ExpressionAttributeValues: { ":la": new Date().toISOString() },
})

A ConditionalCheckFailedException then becomes an explicit, distinguishable signal ("task is no longer Running") that the caller can swallow cleanly, while real DDB errors still surface.

C2. Six stale "15-min" comments + one dead test in `watchdog.test.ts`

File: packages/gateway/__tests__/handlers/watchdog.test.ts

The fallback is now 30 min, but several comments still claim 15 min:

:358-359 — comparison comment "would be stopped with 15-min default"
:434 — it("should fall back to 15-min timeout on CloudWatch error", ...)
:439, :459, :539, :559 — inline assertions reference "15-min fallback"

The test at line 434 is more concerning: it asserts a 12-min-inactive task is not stopped — but that holds true whether the fallback is 15 min or 30 min, so the assertion silently degraded into a tautology. A regression that flips the fallback back to 15 min would not be caught.

Suggested fix:

Update the six comments and the it() title to "30-min".
Strengthen the line-434 test by raising the inactivity to ~35 min (or adding a second case at 32 min) so the threshold is exercised.

🟡 Important

I1. Silent failure of `updateLastActivity` is invisible to operators

File: packages/gateway/src/services/message.ts:99-103

The try/catch is correct (the user-facing message has already succeeded), but console.warn alone makes this PR's own bug recurrence undetectable. Throttles, IAM denials (note: dynamodb:UpdateItem is a new permission requirement), and code bugs all collapse into the same warn line.

Suggested fix:

Promote to console.error with structured context (userId, taskArn, errorName).
Emit a CloudWatch metric LastActivityRefreshFailure (the ServerlessOpenClaw namespace and emitMetric helper already exist in prewarm.ts).
Use the [message] log prefix to match other modules.

I2. Missing investigation document

The first commit references docs/fargate-watchdog-investigation.md, but the file is not in the diff. The 30-min decision rationale (KST sparse-history misclassification) lives only in the commit body and PR description. Per repo convention (Documentation language: English), an docs/ write-up would preserve traceability.

Suggested fix: Add the doc, or remove the reference from the commit message.

I3. Test gaps on the safety guarantees the PR claims

File: packages/gateway/__tests__/services/message.test.ts:155

The new assertion verifies that updateLastActivity is called, but not the two safety claims that motivated the try/catch:

A rejecting updateLastActivity must not flip the result away from "sent".
When sendToBridge fails (the three existing fallback paths: ECONNREFUSED, AbortError, 502), updateLastActivity must not be called.

Suggested test additions:

it("returns 'sent' even when updateLastActivity rejects", async () => {
  const deps = makeDeps({
    updateLastActivity: vi.fn().mockRejectedValue(new Error("ThrottlingException")),
  });
  await expect(routeMessage(deps)).resolves.toBe("sent");
});

And add expect(deps.updateLastActivity).not.toHaveBeenCalled(); to each existing bridge-failure fallback test.

I4. Comment violates the repo guideline against dated/issue references

File: packages/shared/src/constants.ts:49

export const INACTIVE_TIMEOUT_MS = 30 * 60 * 1000; // 30 min — inactive hours (raised to match ACTIVE after investigation 2026-04-22)

CLAUDE.md explicitly says: "Don't reference the current task, fix, or callers ... those belong in the PR description and rot as the codebase evolves." The "investigation 2026-04-22" phrasing is the textbook case — by Q3 the date will be noise without a link.

Suggested fix: Replace with the domain reason, e.g. // 30 min — matches ACTIVE so off-peak users aren't killed mid-conversation when CloudWatch history is sparse.

I5. `updateLastActivity` JSDoc overstates concurrency safety

File: packages/gateway/src/services/task-state.ts:38-42

The container's lifecycle.updateTaskState (packages/container/src/lifecycle.ts:41-60) is itself a PutCommand writing the full row. So the real benefit of UpdateCommand here is field preservation, not race safety — last-writer-wins still applies if the two commands overlap.

Suggested rewording:

/**
 * Refresh `lastActivity` via a single-attribute UpdateCommand so the
 * container's full-row PutCommand (publicIp, taskArn, startedAt) isn't
 * clobbered by the gateway. The two writers can still race on
 * `lastActivity` itself (last-writer-wins), but field preservation is
 * what matters for the watchdog.
 *
 * Caller must ensure a row already exists; without ConditionExpression
 * this would otherwise create a partial row.
 */

(The "caller must ensure" line goes away once C1 is applied.)

🟢 Suggestions (deferrable)

S1. routeLambda and agentRuntime="both" lambda-decision branches don't refresh lastActivity for an active Fargate container that exists alongside a Lambda invocation. Edge case, but worth a follow-up issue.
S2. With INACTIVE_TIMEOUT_MS == ACTIVE_TIMEOUT_MS == INACTIVITY_TIMEOUT_MS == 30 min, getActiveTimeout()'s KST-based branching is effectively dead code while still incurring the CloudWatch GetMetricStatistics call each sweep. Consider collapsing to a single IDLE_TIMEOUT_MS in a follow-up.
S3. cost-optimization.md could note the +15–20 min average idle wind-down impact on Fargate Spot (small but visible against the $1/month target).
S4. console.warn("Failed to refresh lastActivity, continuing", err) would benefit from including userId in the structured payload (multi-tenant debuggability).
S5. Watchdog's StopTaskCommand has no metric or user notification — operators can't tell when a 30-min cutoff cut off a task that was still making progress. Worth a separate observability PR.
S6. Telegram-webhook handler test could assert updateLastActivity is wired into RouteDeps (one-line guard against future binding typos).
S7. The prewarm-claim path in message.ts:127-134 still uses PutCommand to overwrite the row. Same race the PR identified, but materially harmless because the PK changes during ownership transfer. Worth noting in a follow-up.

💪 Strengths

The race diagnosis (PutCommand vs PutCommand) and the UpdateCommand choice are exactly right.
Live production verification with three successive ping timestamps is convincing.
TDD discipline is preserved — unit test on the new helper, integration assertion on the call site, watchdog test values updated consistently.
DI pattern (RouteDeps) extended cleanly without breaking existing call sites.
PR description is unusually well-structured (timeline, evidence, root cause, race rationale).

Recommended action

Before merge: C1 (ConditionExpression), C2 (stale comments + dead test).
Strongly suggested at this PR: I1, I3 — small additions that lock in the safety guarantees the change already provides. I4 is a one-line cleanup.
Follow-up issues are fine: I2, I5 wording, all S-items.

Happy to help land any of these as a follow-up if it's easier than asking for changes here.

github-actions · 2026-05-11T04:42:34Z

This pull request has been automatically marked as stale due to inactivity.
Add an update or refreshed evidence to keep it open.

github-actions · 2026-05-17T04:40:23Z

Closing due to inactivity. Re-open when the branch is refreshed and the evidence section is up to date.

Tobias added 2 commits April 22, 2026 17:55

tobigs requested a review from serithemage as a code owner April 22, 2026 16:05

github-actions Bot added needs-template PR body is missing required evidence sections size: S Small pull request labels Apr 22, 2026

github-actions Bot added area: gateway Gateway lambdas and services area: shared Shared contracts and constants area: tests Tests or fixtures labels Apr 22, 2026

serithemage reviewed Apr 30, 2026

View reviewed changes

github-actions Bot added the stale label May 11, 2026

github-actions Bot closed this May 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(watchdog): prevent task kill during active conversations#29

fix(watchdog): prevent task kill during active conversations#29
tobigs wants to merge 2 commits into
serithemage:mainfrom
tobigs:fix/watchdog-lastactivity-refresh

tobigs commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

tobigs commented Apr 22, 2026

Uh oh!

serithemage left a comment

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

github-actions Bot commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tobigs commented Apr 22, 2026

Summary

The bug

Production evidence

The fix

Why UpdateCommand and not PutCommand

Secondary commit: 30-minute timeout bump

Files changed

Checks

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

tobigs commented Apr 22, 2026

Post-deploy A/B live confirmation

Uh oh!

serithemage left a comment

Choose a reason for hiding this comment

Comprehensive review summary

🔴 Critical

C1. updateLastActivity upserts a partial row when no TaskState exists

C2. Six stale "15-min" comments + one dead test in watchdog.test.ts

🟡 Important

I1. Silent failure of updateLastActivity is invisible to operators

I2. Missing investigation document

I3. Test gaps on the safety guarantees the PR claims

I4. Comment violates the repo guideline against dated/issue references

I5. updateLastActivity JSDoc overstates concurrency safety

🟢 Suggestions (deferrable)

💪 Strengths

Recommended action

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

github-actions Bot commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Why `UpdateCommand` and not `PutCommand`

C1. `updateLastActivity` upserts a partial row when no TaskState exists

C2. Six stale "15-min" comments + one dead test in `watchdog.test.ts`

I1. Silent failure of `updateLastActivity` is invisible to operators

I5. `updateLastActivity` JSDoc overstates concurrency safety