fix: resolve open bugs #456-#460 across scheduler/retry pipelines#462
Conversation
- #459 (low): waitForActiveChecks poll uses trackTimeout so the 100ms recursive timer is cancelled by stopScheduler(), eliminating the leaked handles that surfaced during graceful shutdown. - #460 (low): webhook-URL backfill now skips only the corrupted field (url OR secret) instead of the whole row, so a malformed url no longer blocks encryption of a valid whsec_ secret (and vice versa). - #457 (med): if any Drizzle-referenced schema (monitor health cols, pending_retry_at, notification_queue cols, automation_subscriptions) fails to provision, install a global /api 503 SCHEMA_NOT_READY gate so clients get a clear signal instead of silent 500s per route. - #458 (med): webhook retries are now atomically claimed via UPDATE ... SET status='processing', claimed_at=NOW() WHERE id=? AND attempt=? RETURNING, so replicas under Replit autoscale never double-deliver. A new claimed_at column enables stalled-row recovery (rows stuck >5min re-queued to 'pending'). Same infra generalised via storage.claimDeliveryForRetry / recoverStalledDeliveries. - #456 (high): automation (Zapier REST Hook) deliveries now classify transient (5xx/429/408/network) vs persistent (4xx) failures. On transient failure we enqueue a delivery_log row and a new automation retry cron re-attempts up to 3 times with cumulative backoff. Only persistent failures count toward consecutive_failures / auto-deact, so a flaky upstream can't silently kill a subscription after one dropped event. https://claude.ai/code/session_016C3aevHcGMqJL7MaEHdqN9
|
Caution Review failedPull request was closed or merged during review 📝 WalkthroughWalkthroughAdds delivery-retry and recovery for automation/webhook channels: a nullable Changes
Sequence Diagram(s)sequenceDiagram
participant Scheduler
participant Storage
participant Automation as Automation API
participant DB as Database
Scheduler->>Storage: recoverStalledDeliveries("automation", cutoff)
Storage->>DB: UPDATE delivery_log SET status='pending', claimed_at=NULL WHERE status='processing' AND claimed_at < cutoff
DB-->>Storage: recovered count
Scheduler->>Storage: getPendingAutomationRetries()
Storage->>DB: SELECT * FROM delivery_log WHERE channel='automation' AND status='pending' AND attempt < max
DB-->>Storage: pending rows
Storage-->>Scheduler: pending entries
Scheduler->>Storage: claimDeliveryForRetry(id, expectedAttempt, "automation")
Storage->>DB: UPDATE ... SET status='processing', claimed_at=NOW() WHERE id=? AND status='pending' AND attempt=?
DB-->>Storage: claimed entry or null
alt claimed
Scheduler->>Automation: performAutomationDelivery(hookUrl, payload)
Automation-->>Scheduler: {kind: success|transient|persistent, statusCode?, error?}
alt success
Scheduler->>Storage: updateDeliveryLog(id, {status:'success', deliveredAt:NOW(), claimedAt:null})
Scheduler->>Storage: touchAndResetAutomationSubscription(...)
else transient
Scheduler->>Storage: addDeliveryLog({...status:'pending', attempt:1, response:{transient:true,...}})
Scheduler->>Storage: updateDeliveryLog(id, {status:'failed' or other, claimedAt:null})
else persistent
Scheduler->>Storage: handleDeliveryFailure(...) (increment failures/deactivate)
Scheduler->>Storage: updateDeliveryLog(id, {claimedAt:null})
end
else not claimed
Note over Scheduler: skip (another replica claimed it)
end
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Security & Correctness Observations
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
shared/schema.ts (1)
327-331: 🧹 Nitpick | 🔵 TrivialIndex
claimed_atfor stalled delivery recovery.The retry recovery path filters by
channel,status, and staleclaimedAtevery minute, but the current index usescreatedAtinstead. Add a matching index here and inensureChannelTables()beforedelivery_loggrows enough for recovery scans to become expensive.Suggested index addition
}, (table) => ({ monitorCreatedIdx: index("delivery_log_monitor_created_idx").on(table.monitorId, table.createdAt), channelStatusAttemptIdx: index("delivery_log_channel_status_attempt_idx").on(table.channel, table.status, table.createdAt, table.attempt), + channelStatusClaimedIdx: index("delivery_log_channel_status_claimed_idx").on(table.channel, table.status, table.claimedAt), }));Also mirror this in
server/services/ensureTables.ts:await db.execute(sql`CREATE INDEX IF NOT EXISTS delivery_log_channel_status_attempt_idx ON delivery_log(channel, status, created_at, attempt)`); +await db.execute(sql`CREATE INDEX IF NOT EXISTS delivery_log_channel_status_claimed_idx ON delivery_log(channel, status, claimed_at)`);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@shared/schema.ts` around lines 327 - 331, The delivery retry recovery queries filter by channel, status and stale claimedAt but the schema only indexes createdAt; add a new index on the delivery_log table covering (channel, status, claimedAt) so recovery scans use the index (in the schema where claimedAt is defined alongside monitorCreatedIdx and channelStatusAttemptIdx create e.g. a channelStatusClaimedIdx index on table.channel, table.status, table.claimedAt), and make the same change inside the ensureChannelTables()/ensureTables.ts logic that creates/ensures delivery_log so the physical DB gets the identical index before the table grows large.server/storage.ts (1)
10-76: 🛠️ Refactor suggestion | 🟠 MajorAdd the new retry primitives to
IStorage.The new retry APIs only exist on
DatabaseStorage. Any test double or alternate implementation typed asIStoragewill miss the scheduler’s new contract. Add the new delivery-log methods to the interface so storage remains a complete boundary.Proposed interface additions
cleanupStaleAutomationSubscriptions(olderThan: Date): Promise<number>; getRecentChangesForMonitors(monitorIds: number[], limit: number): Promise<MonitorChange[]>; + + // Delivery log retry primitives + updateDeliveryLog( + id: number, + updates: Partial<Pick<DeliveryLogEntry, "status" | "attempt" | "response" | "deliveredAt" | "claimedAt">>, + ): Promise<void>; + getPendingWebhookRetries(): Promise<DeliveryLogEntry[]>; + claimDeliveryForRetry(id: number, expectedAttempt: number, channel: string): Promise<DeliveryLogEntry | null>; + claimWebhookDeliveryForRetry(id: number, expectedAttempt: number): Promise<DeliveryLogEntry | null>; + recoverStalledDeliveries(channel: string, olderThan: Date): Promise<number>; + recoverStalledWebhookDeliveries(olderThan: Date): Promise<number>; + getPendingAutomationRetries(): Promise<DeliveryLogEntry[]>; + getAutomationSubscriptionById(id: number): Promise<AutomationSubscription | undefined>; }As per coding guidelines, “Never put database queries or Drizzle ORM calls directly in route handlers — all database access must go through methods on the
IStorageinterface implemented inserver/storage.ts.”Also applies to: 443-524
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/storage.ts` around lines 10 - 76, The IStorage interface is missing the new retry/delivery-log primitives implemented on DatabaseStorage; add the same method signatures to IStorage so all storage implementations conform to the scheduler contract. Specifically, in the IStorage declaration add the delivery-retry methods (for example) createDeliveryLog(monitorId: number, payload: string, nextAttemptAt: Date): Promise<DeliveryLog>, getDeliveryLog(id: number): Promise<DeliveryLog | undefined>, listPendingDeliveryLogs(limit: number): Promise<DeliveryLog[]>, incrementDeliveryAttempt(id: number, nextAttemptAt: Date): Promise<number>, markDeliveryDelivered(id: number): Promise<void>, and deleteDeliveryLog(id: number): Promise<void> (or the exact method names/signatures used on DatabaseStorage) so DatabaseStorage and any test doubles satisfy the new scheduler contract.server/services/scheduler.ts (1)
73-85:⚠️ Potential issue | 🟠 MajorResolve active-check waiters when stopping the scheduler.
Line 80 now tracks the shutdown polling timer, but
stopScheduler()clears tracked timers at Lines 584-585. IfactiveChecks > 0, the next poll is cancelled and thewaitForActiveChecks()promise never reaches either the active-check condition or its timeout. That can hang shutdown despite the documented timeout guarantee.Proposed fix
const cronTasks: ReturnType<typeof cron.schedule>[] = []; const pendingTimeouts = new Set<ReturnType<typeof setTimeout>>(); +const activeCheckWaiters = new Set<() => void>(); @@ export function waitForActiveChecks(timeoutMs: number): Promise<void> { return new Promise((resolve) => { + let settled = false; + const finish = () => { + if (settled) return; + settled = true; + activeCheckWaiters.delete(finish); + resolve(); + }; + activeCheckWaiters.add(finish); const start = Date.now(); const check = () => { if (activeChecks === 0 || Date.now() - start > timeoutMs) { - resolve(); + finish(); } else { trackTimeout(check, 100); } @@ pendingTimeouts.forEach((handle) => { clearTimeout(handle); }); pendingTimeouts.clear(); + for (const finish of activeCheckWaiters) finish(); + activeCheckWaiters.clear(); retryBackoff.clear();Also applies to: 579-585
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/services/scheduler.ts` around lines 73 - 85, The waitForActiveChecks Promise can hang when stopScheduler clears tracked timers; to fix, add a small shutdown-waiter mechanism: in waitForActiveChecks (function waitForActiveChecks) push the Promise's resolve into a global array (e.g., shutdownWaiters) in addition to scheduling poll timers with trackTimeout, and when stopScheduler runs (function stopScheduler) iterate shutdownWaiters and call each resolver before/while clearing tracked timers so any outstanding waiters are unblocked; ensure you remove the resolver from shutdownWaiters when the promise resolves normally and keep using activeChecks, trackTimeout, and the existing timeout logic unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@server/routes.ts`:
- Around line 79-91: The ensureChannelTables() function currently swallows
errors during its bootstrap (so an ALTER that fails for delivery_log.claimed_at
can leave the API live but broken); change ensureChannelTables() to return a
boolean: true on success and false from its catch block (or rethrow and let
caller handle), then in the routes initialization replace the current
fire-and-forget call with const channelsReady = await ensureChannelTables(); and
if (!channelsReady) push "channel tables" into criticalSchemaFailures so the
readiness gate will block traffic until the schema fix is applied; reference
ensureChannelTables(), delivery_log.claimed_at, and criticalSchemaFailures when
making the changes.
In `@server/services/automationDelivery.test.ts`:
- Around line 145-173: Add tests in server/services/automationDelivery.test.ts
that mirror the existing "queues durable retry" cases for 500 and network errors
but use HTTP 429 and 408 responses: call deliverToAutomationSubscriptions with
mockGetActiveAutomationSubscriptions returning a subscription and
mockSsrfSafeFetch resolved with { ok: false, status: 429 } and then with status:
408; in each case assert mockAddDeliveryLog was called with an objectContaining
channel: "automation", status: "pending", response containing transient: true
(and include subscriptionId/platform when applicable) and assert
mockIncrementAutomationSubscriptionFailures and
mockTouchAndResetAutomationSubscription were not called. Use the same helper
functions (makeSub, makeMonitor, makeChange) and matchers as the existing tests
to keep style consistent.
In `@server/services/automationDelivery.ts`:
- Around line 71-116: The code currently uses Promise.allSettled on the array
deliveries (created around performAutomationDelivery, storage.addDeliveryLog and
handleDeliveryFailure) but ignores rejected results, allowing failures from
addDeliveryLog fallback or handleDeliveryFailure to be swallowed; change the
completion handling to inspect the results: await Promise.allSettled(deliveries)
then iterate the settled results and for any result.status === "rejected" log a
clear error (including the subscription id, monitor.id and the rejection reason)
and rethrow or return a non-zero failure (e.g., throw the first rejection) so
the caller sees the failure; alternatively use await Promise.all(deliveries) if
you want immediate bubbling of the first rejection—ensure logging uses the same
contextual fields passed to storage.addDeliveryLog/handleDeliveryFailure
(subscriptionId/sub.id, monitorId/monitor.id, platform/sub.platform) so failures
aren’t silent.
In `@server/services/scheduler.test.ts`:
- Around line 36-43: The test currently mocks automation retry hooks but never
exercises the automation retry loop because getPendingAutomationRetries returns
an empty array; update the test to return one or more pending retry records from
getPendingAutomationRetries and add assertions that claimDeliveryForRetry is
invoked with channel="automation", that when claimDeliveryForRetry returns null
the delivery is skipped, that when getAutomationSubscriptionById returns
undefined the code clears claimedAt on the pending retry record, and that a
successful delivery path invokes the delivery finalizer (the function your code
uses to finalize deliveries); also add focused assertions for edge cases (empty
arrays, null/undefined fields), error paths (invalid IDs, failed claim), and a
security/ownership case to satisfy the guidelines.
In `@server/services/scheduler.ts`:
- Around line 478-490: The delivery row is being marked failed via
withDbRetry(storage.updateDeliveryLog(..., entry.id)) before calling
finalizeAutomationRetry(sub.id,...), which can leave the row irrecoverable if
finalization fails; change this so the terminal delivery-log update and the
subscription failure/deactivation side effects are executed atomically in one
storage transaction (e.g., add a new storage method like
finalizeDeliveryAndSubscriptionFailure that accepts entry.id, sub.id,
outcome/platform/monitor and performs updateDeliveryLog plus the
finalizeAutomationRetry work inside a single DB transaction, or call both
updates inside an existing storage.transaction/withDbRetry wrapper) so that
either both the delivery row and subscription counters are committed or none
are.
---
Outside diff comments:
In `@server/services/scheduler.ts`:
- Around line 73-85: The waitForActiveChecks Promise can hang when stopScheduler
clears tracked timers; to fix, add a small shutdown-waiter mechanism: in
waitForActiveChecks (function waitForActiveChecks) push the Promise's resolve
into a global array (e.g., shutdownWaiters) in addition to scheduling poll
timers with trackTimeout, and when stopScheduler runs (function stopScheduler)
iterate shutdownWaiters and call each resolver before/while clearing tracked
timers so any outstanding waiters are unblocked; ensure you remove the resolver
from shutdownWaiters when the promise resolves normally and keep using
activeChecks, trackTimeout, and the existing timeout logic unchanged.
In `@server/storage.ts`:
- Around line 10-76: The IStorage interface is missing the new
retry/delivery-log primitives implemented on DatabaseStorage; add the same
method signatures to IStorage so all storage implementations conform to the
scheduler contract. Specifically, in the IStorage declaration add the
delivery-retry methods (for example) createDeliveryLog(monitorId: number,
payload: string, nextAttemptAt: Date): Promise<DeliveryLog>, getDeliveryLog(id:
number): Promise<DeliveryLog | undefined>, listPendingDeliveryLogs(limit:
number): Promise<DeliveryLog[]>, incrementDeliveryAttempt(id: number,
nextAttemptAt: Date): Promise<number>, markDeliveryDelivered(id: number):
Promise<void>, and deleteDeliveryLog(id: number): Promise<void> (or the exact
method names/signatures used on DatabaseStorage) so DatabaseStorage and any test
doubles satisfy the new scheduler contract.
In `@shared/schema.ts`:
- Around line 327-331: The delivery retry recovery queries filter by channel,
status and stale claimedAt but the schema only indexes createdAt; add a new
index on the delivery_log table covering (channel, status, claimedAt) so
recovery scans use the index (in the schema where claimedAt is defined alongside
monitorCreatedIdx and channelStatusAttemptIdx create e.g. a
channelStatusClaimedIdx index on table.channel, table.status, table.claimedAt),
and make the same change inside the ensureChannelTables()/ensureTables.ts logic
that creates/ensures delivery_log so the physical DB gets the identical index
before the table grows large.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 6ab6ad94-bba6-42ea-8f7f-f1f2676307d0
📒 Files selected for processing (10)
server/routes.conditions.test.tsserver/routes.tsserver/services/automationDelivery.test.tsserver/services/automationDelivery.tsserver/services/ensureTables.test.tsserver/services/ensureTables.tsserver/services/scheduler.test.tsserver/services/scheduler.tsserver/storage.tsshared/schema.ts
| const apiKeysReady = await ensureApiKeysTable(); | ||
| await ensureChannelTables(); | ||
| await ensureMonitorChangesIndexes(); | ||
| await ensureCampaignPartialIndexes(); | ||
| const notificationQueueReady = await ensureNotificationQueueColumns(); | ||
| if (!notificationQueueReady) { | ||
| console.error("CRITICAL: notification_queue columns missing — notification cron queries will fail"); | ||
| criticalSchemaFailures.push("notification_queue columns"); | ||
| } | ||
| await ensureTagTables(); | ||
| const automationSubsReady = await ensureAutomationSubscriptionsTable(); | ||
| if (!automationSubsReady) { | ||
| console.error("CRITICAL: automation_subscriptions table missing — Zapier endpoints and automation delivery will fail"); | ||
| criticalSchemaFailures.push("automation_subscriptions table"); | ||
| } |
There was a problem hiding this comment.
Gate traffic when delivery_log.claimed_at bootstrap fails.
The new schema-not-ready gate checks boolean ensure calls, but ensureChannelTables() still swallows failures. If the new claimed_at ALTER fails, retry/delivery-log code can hit column claimed_at does not exist while /api stays live.
Suggested readiness wiring
const apiKeysReady = await ensureApiKeysTable();
- await ensureChannelTables();
+ const channelTablesReady = await ensureChannelTables();
+ if (!channelTablesReady) {
+ criticalSchemaFailures.push("notification channel / delivery_log columns");
+ }
await ensureMonitorChangesIndexes();And change ensureChannelTables() to return true on success and false in its catch block.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@server/routes.ts` around lines 79 - 91, The ensureChannelTables() function
currently swallows errors during its bootstrap (so an ALTER that fails for
delivery_log.claimed_at can leave the API live but broken); change
ensureChannelTables() to return a boolean: true on success and false from its
catch block (or rethrow and let caller handle), then in the routes
initialization replace the current fire-and-forget call with const channelsReady
= await ensureChannelTables(); and if (!channelsReady) push "channel tables"
into criticalSchemaFailures so the readiness gate will block traffic until the
schema fix is applied; reference ensureChannelTables(), delivery_log.claimed_at,
and criticalSchemaFailures when making the changes.
| const deliveries = subscriptions.map(async (sub) => { | ||
| try { | ||
| // Automation hook URLs (e.g. Zapier) are unguessable bearer tokens. | ||
| // No HMAC signature header — the hookUrl itself authenticates the request. | ||
| // Webhook channels use X-FTC-Signature-256 with a per-channel secret. | ||
| const response = await ssrfSafeFetch(sub.hookUrl, { | ||
| method: "POST", | ||
| headers: { | ||
| "Content-Type": "application/json", | ||
| "User-Agent": "FetchTheChange-Zapier/1.0", | ||
| }, | ||
| body, | ||
| signal: AbortSignal.timeout(5000), | ||
| const outcome = await performAutomationDelivery(sub.hookUrl, body); | ||
|
|
||
| if (outcome.kind === "success") { | ||
| console.log(`[Automation] Delivered successfully (monitorId=${monitor.id}, platform=${sub.platform}, subId=${sub.id}, status=${outcome.statusCode})`); | ||
| storage.touchAndResetAutomationSubscription(sub.id).catch((err) => { | ||
| console.warn(`[Automation] Failed to reset failure counter for subscription ${sub.id}`, err); | ||
| }); | ||
| return; | ||
| } | ||
|
|
||
| if (response.ok) { | ||
| console.log(`[Automation] Delivered successfully (monitorId=${monitor.id}, platform=${sub.platform}, subId=${sub.id}, status=${response.status})`); | ||
| // Atomic touch + failure counter reset in a single UPDATE | ||
| storage.touchAndResetAutomationSubscription(sub.id).catch((err) => { | ||
| console.warn(`[Automation] Failed to reset failure counter for subscription ${sub.id}`, err); | ||
| if (outcome.kind === "transient") { | ||
| // Enqueue a durable retry — cron will re-attempt up to 3 times total. | ||
| try { | ||
| await storage.addDeliveryLog({ | ||
| monitorId: monitor.id, | ||
| changeId: change.id, | ||
| channel: "automation", | ||
| status: "pending", | ||
| attempt: 1, | ||
| response: { | ||
| subscriptionId: sub.id, | ||
| platform: sub.platform, | ||
| error: outcome.error, | ||
| transient: true, | ||
| }, | ||
| }); | ||
| console.warn(`[scheduler] Automation delivery transient failure — queued for retry for monitor "${monitor.name}"`, { | ||
| subscriptionId: sub.id, | ||
| monitorId: monitor.id, | ||
| platform: sub.platform, | ||
| error: outcome.error, | ||
| }); | ||
| } else { | ||
| await handleDeliveryFailure(sub.id, monitor, sub.platform, `HTTP ${response.status}`); | ||
| } catch (logErr) { | ||
| // If we can't enqueue the retry, fall back to the old behavior to avoid losing all signal. | ||
| console.error(`[Automation] Failed to enqueue retry delivery_log row — falling back to failure counter bump`, logErr); | ||
| await handleDeliveryFailure(sub.id, monitor, sub.platform, outcome.error); | ||
| } | ||
| } catch (err) { | ||
| // Sanitize error to avoid leaking hookUrl secrets in logs | ||
| const rawMsg = err instanceof Error ? err.message : String(err); | ||
| const safeError = rawMsg.replace(/https?:\/\/[^\s)]+/g, "[redacted-url]"); | ||
| await handleDeliveryFailure(sub.id, monitor, sub.platform, safeError); | ||
| return; | ||
| } | ||
|
|
||
| // Persistent (4xx) — no retry; bump failure counter as before. | ||
| await handleDeliveryFailure(sub.id, monitor, sub.platform, outcome.error); | ||
| }); | ||
|
|
||
| await Promise.allSettled(deliveries); |
There was a problem hiding this comment.
Log rejected delivery tasks instead of swallowing them.
Promise.allSettled() prevents rethrowing, but the results are ignored. Any rejection from addDeliveryLog() fallback or handleDeliveryFailure() disappears, so failure counters/deactivation can fail silently.
Proposed fix
- await Promise.allSettled(deliveries);
+ const results = await Promise.allSettled(deliveries);
+ for (const [index, result] of results.entries()) {
+ if (result.status === "rejected") {
+ const sub = subscriptions[index];
+ console.error("[Automation] Delivery handling failed", {
+ subscriptionId: sub?.id,
+ monitorId: monitor.id,
+ platform: sub?.platform,
+ errorMessage: result.reason instanceof Error ? result.reason.message : String(result.reason),
+ });
+ }
+ }
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@server/services/automationDelivery.ts` around lines 71 - 116, The code
currently uses Promise.allSettled on the array deliveries (created around
performAutomationDelivery, storage.addDeliveryLog and handleDeliveryFailure) but
ignores rejected results, allowing failures from addDeliveryLog fallback or
handleDeliveryFailure to be swallowed; change the completion handling to inspect
the results: await Promise.allSettled(deliveries) then iterate the settled
results and for any result.status === "rejected" log a clear error (including
the subscription id, monitor.id and the rejection reason) and rethrow or return
a non-zero failure (e.g., throw the first rejection) so the caller sees the
failure; alternatively use await Promise.all(deliveries) if you want immediate
bubbling of the first rejection—ensure logging uses the same contextual fields
passed to storage.addDeliveryLog/handleDeliveryFailure (subscriptionId/sub.id,
monitorId/monitor.id, platform/sub.platform) so failures aren’t silent.
| claimDeliveryForRetry: vi.fn().mockImplementation(async (id: number, attempt: number, channel: string) => ({ | ||
| id, attempt, channel, status: "processing", createdAt: new Date(), | ||
| monitorId: id, changeId: id, response: null, deliveredAt: null, claimedAt: new Date(), | ||
| })), | ||
| recoverStalledWebhookDeliveries: vi.fn().mockResolvedValue(0), | ||
| recoverStalledDeliveries: vi.fn().mockResolvedValue(0), | ||
| getPendingAutomationRetries: vi.fn().mockResolvedValue([]), | ||
| getAutomationSubscriptionById: vi.fn().mockResolvedValue(undefined), |
There was a problem hiding this comment.
Add assertions for the automation retry cron path.
These mocks wire the new automation retry dependencies, but the default getPendingAutomationRetries: [] means no test exercises the loop. Add coverage for: due retry is claimed with channel="automation", claim miss skips delivery, missing subscriptionId/subscription clears claimedAt, and successful delivery calls the finalizer.
As per coding guidelines, Tests must cover edge cases and error paths - include assertions for edge cases (empty inputs, boundary values, null/undefined), error paths (invalid input, unauthorized access, not found), and security-relevant scenarios (ownership violations, SSRF attempts).
Also applies to: 68-70
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@server/services/scheduler.test.ts` around lines 36 - 43, The test currently
mocks automation retry hooks but never exercises the automation retry loop
because getPendingAutomationRetries returns an empty array; update the test to
return one or more pending retry records from getPendingAutomationRetries and
add assertions that claimDeliveryForRetry is invoked with channel="automation",
that when claimDeliveryForRetry returns null the delivery is skipped, that when
getAutomationSubscriptionById returns undefined the code clears claimedAt on the
pending retry record, and that a successful delivery path invokes the delivery
finalizer (the function your code uses to finalize deliveries); also add focused
assertions for edge cases (empty arrays, null/undefined fields), error paths
(invalid IDs, failed claim), and a security/ownership case to satisfy the
guidelines.
| } else if (outcome.kind === "persistent" || nextAttempt >= 3) { | ||
| await withDbRetry(() => storage.updateDeliveryLog(entry.id, { | ||
| status: "failed", | ||
| attempt: nextAttempt, | ||
| claimedAt: null, | ||
| response: { | ||
| subscriptionId: sub.id, | ||
| platform, | ||
| error: outcome.error, | ||
| transient: outcome.kind === "transient", | ||
| } as Record<string, unknown>, | ||
| })); | ||
| await finalizeAutomationRetry(sub.id, monitor, platform, outcome); |
There was a problem hiding this comment.
Make terminal retry finalization atomic with the subscription failure update.
This marks the delivery row failed and clears claimedAt before finalizeAutomationRetry() increments/deactivates the subscription. If that finalization storage call fails, the row is no longer recoverable and the subscription failure counter stays stale. Commit the delivery-log terminal status and subscription failure/deactivation side effects in one storage transaction, or leave the row recoverable until finalization succeeds.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@server/services/scheduler.ts` around lines 478 - 490, The delivery row is
being marked failed via withDbRetry(storage.updateDeliveryLog(..., entry.id))
before calling finalizeAutomationRetry(sub.id,...), which can leave the row
irrecoverable if finalization fails; change this so the terminal delivery-log
update and the subscription failure/deactivation side effects are executed
atomically in one storage transaction (e.g., add a new storage method like
finalizeDeliveryAndSubscriptionFailure that accepts entry.id, sub.id,
outcome/platform/monitor and performs updateDeliveryLog plus the
finalizeAutomationRetry work inside a single DB transaction, or call both
updates inside an existing storage.transaction/withDbRetry wrapper) so that
either both the delivery row and subscription counters are committed or none
are.
- scheduler.ts: waitForActiveChecks now tracks its resolver in activeCheckWaiters; stopScheduler() resolves pending waiters after cancelling the poll timer so shutdown cannot hang. - automationDelivery.ts: inspect Promise.allSettled results and log rejected delivery tasks with subscription / monitor context instead of swallowing them (failures from the addDeliveryLog fallback or handleDeliveryFailure are no longer silent). - ensureTables.ts: ensureChannelTables() now returns boolean and its result is added to criticalSchemaFailures in registerRoutes, so a failed `ALTER TABLE delivery_log ADD COLUMN claimed_at` triggers the /api SCHEMA_NOT_READY gate instead of leaving the API live with broken retry queries. - ensureTables.ts + shared/schema.ts: new delivery_log_channel_status_claimed_idx (channel, status, claimed_at) so the per-tick stalled-delivery recovery UPDATE is index-backed. - storage.ts: added the new delivery-log retry primitives (addDeliveryLog, updateDeliveryLog, claimDeliveryForRetry, recoverStalledDeliveries, getPendingAutomationRetries, getAutomationSubscriptionById, and deprecated webhook-specific wrappers) to the IStorage interface so test doubles and alternate implementations conform to the scheduler contract. - automationDelivery.test.ts: added parametrised 408 / 429 coverage to prove those statuses take the transient-retry path and do not bump consecutive_failures. https://claude.ai/code/session_016C3aevHcGMqJL7MaEHdqN9
…atch
Phase 5 skeptic discoveries addressed:
browserlessTracker (H1/H2): Same direct-insert pattern resendTracker had —
raw db.insert(errorLogs).values({level:"info",...}) bypassed the dedup
upsert. After the 6h cooldown elapsed on an unresolved alert row, the
next INSERT would hit the partial unique index and throw, silently
swallowing the alert (no email sent during the threshold event).
Converted to ErrorLogger.info, matching the resendTracker fix.
scraper.ts transient console.warn paths (M1/M2): Two paths on this
branch use console.warn for transient failures. console.warn bypasses
ErrorLogger.sanitizeString/sanitizeContext entirely:
- Line ~1460 (save-failure retry): was logging extractedValue/previousValue
which contain user-scraped page content — could include API tokens,
PII, or internal URLs. Moved those fields to the non-transient
ErrorLogger.error branch only; transient stdout log now contains
only triage metadata. Test updated to assert the redaction.
- Line ~1590 (outer-catch transient): was interpolating raw error.message
into the stdout line. undici/fetch errors echo the full requested URL
(userinfo creds, ?api_key=... query). Now uses the classified
logContext only; the ErrorLogger.error branch keeps the full message
since it runs through the sanitizer.
Count endpoint / UI dropdown mismatch (M4): The nav badge's count
endpoint (GET /api/admin/error-logs/count) counted all unresolved rows
regardless of level, while the UI filter dropdown only offers
"error"/"info" after the warning-level narrowing. Legacy warning rows
inflated the badge but were unreachable from the filter — admin would
see "7 unresolved" then find 4 after clicking through. Narrowed count
query with inArray(level, ["error","info"]) so badge and filter agree.
Browserless-outage stdout flooding (M5): The
"Browserless service unavailable" console.warn used to dedup at DB
level via ErrorLogger.warning's partial-index upsert. After the
warning-level removal, every monitor check during an outage emits a
fresh stdout line — N monitors x M replicas. Added a module-level
per-process throttle (60s cooldown per message key) for the two
infra-wide warnings ("service unavailable" and "circuit breaker open").
Exported _resetInfraWarnThrottleForTests hook; scraper.test.ts
beforeEach resets it so tests observe the first emit deterministically.
Verified #462 fixes survived the revert: SCHEMA_NOT_READY gate,
activeCheckWaiters, claimed_at / claimDeliveryForRetry /
recoverStalledDeliveries all present.
Deferred to Phase 6 bug report (pre-existing / not regressions):
- safeHostname "unknown" fallback ambiguity
- errorLogs.level has no DB CHECK constraint
- Dual warning/info rows during deploy (operator migration note for
PR body: legacy resend_alert_* warning rows should be resolved or
deleted at deploy to avoid apparent duplicate alerts)
All 2475 tests pass; npm run check clean.
…ise (#469) * Revert "Remove admin error logging system (#461)" This reverts commit 1593135. * Remove warning-level from ErrorLogger; keep error/info in Event Log User intent on #461 was to silence the Browserless warning-noise flood only; reverting #461 brought the Event Log back but also re-enabled warning-level writes that produced the original noise. This drops ErrorLogger.warning() entirely and rewrites every call site to console.warn with the same `[source] message` prefix, so: - Transient/noisy warnings still surface in server logs for ops triage - The admin Event Log only shows error- and info-level events - Dedup/aggregation still applies to the remaining error+info rows Test mocks updated to spy on console.warn instead of ErrorLogger.warning. * test: regression guards for ErrorLogger.warning() removal Adds two assertions to logger.test.ts: - ErrorLogger.warning is undefined (prevents accidental reintroduction) - ErrorLogger.info routes only to console.log, never console.warn/.error * security: redact user URLs in error_logs context; drop warning-level API filter Phase 2 security findings from /magicwand review: H1/H2 (high): Replace `url: monitor.url` with `hostname: safeHostname(monitor.url)` at every log site on the branch. Monitor URLs are user-supplied and can carry basic-auth credentials (`https://user:pw@host`) or query-string secrets (`?api_key=…`) that the ErrorLogger's sanitizeString regex list does not catch. This re-applies PR #461's CodeRabbit hardening on top of the Event Log revert. Sites patched: - server/services/scraper.ts (5 call sites — error, console.warn, circuit-breaker) - server/services/scheduler.ts (1 — scheduled-check failed) - server/services/email.ts (1 — notification email failed) Also drops `rawBrowserlessMsg` from the site-specific console.warn at scraper.ts:1283 — upstream errors can echo back the requested URL with credentials. classifiedReason already captures enough triage detail. Updates the stale comment claiming logger sanitization applies. M1 (medium): API `level` filter in GET /api/admin/error-logs still accepted `"warning"` after commit 3541155 removed the warning-level write path. Narrows the allowlist (and batch-delete Zod enum) to ["error", "info"] and removes the "Warnings" dropdown option in AdminErrors.tsx. safeHostname helper inlined in email.ts to avoid pulling the full monitorValidation→storage→schema chain into email.test.ts's drizzle-orm mock. All 2475 tests pass; npm run check clean. * refactor: consolidate safeHostname; stop direct warning-level writes Phase 3 architecture findings: resendTracker: two direct `db.insert(errorLogs).values({level:"warning",...})` sites bypassed ErrorLogger entirely — producing new warning rows despite the branch's stated "no more warning writes" invariant. Threshold alerts (80/90/95% Resend usage) are legitimately triage-worthy, so route them through ErrorLogger.info instead (preserves admin visibility, sanitization, and dedup-upsert). The in-DB 6-hour cooldown SELECT is retained for rate-limiting. safeHostname consolidation: extracted the hostname helper to dependency-free server/utils/urlUtils.ts. scraper.ts and email.ts dropped their local copies; scheduler.ts and monitorValidation.ts updated to import from the new location (monitorValidation re-exports for back-compat with any external callers). Doc and test cleanup: - shared/schema.ts:93 level comment now flags 'warning' as historical-only - Removed twelve dead `warning: vi.fn()` mocks from route test files — they masked that the ErrorLogger API no longer exposes .warning() All 2475 tests pass; npm run check clean. * chore: code-review cleanup — direct safeHostname imports, valid test fixture Phase 4 code-review minor findings: - adminErrorsUtils.test.ts: swap stale `levelFilter: "warning"` fixture for "info" so the test mirrors real UI traffic (the dropdown no longer offers "warning" and the server Zod schema would 400 it). - monitorValidation.ts: drop the `safeHostname` re-export; routes.ts, routes/extension.ts, routes/v1.ts now import directly from ../utils/urlUtils. extension.test.ts mock moved accordingly. One import path per symbol. All 2475 tests pass; npm run check clean. * harden: skeptic-phase fixes to console.warn sanitization + badge mismatch Phase 5 skeptic discoveries addressed: browserlessTracker (H1/H2): Same direct-insert pattern resendTracker had — raw db.insert(errorLogs).values({level:"info",...}) bypassed the dedup upsert. After the 6h cooldown elapsed on an unresolved alert row, the next INSERT would hit the partial unique index and throw, silently swallowing the alert (no email sent during the threshold event). Converted to ErrorLogger.info, matching the resendTracker fix. scraper.ts transient console.warn paths (M1/M2): Two paths on this branch use console.warn for transient failures. console.warn bypasses ErrorLogger.sanitizeString/sanitizeContext entirely: - Line ~1460 (save-failure retry): was logging extractedValue/previousValue which contain user-scraped page content — could include API tokens, PII, or internal URLs. Moved those fields to the non-transient ErrorLogger.error branch only; transient stdout log now contains only triage metadata. Test updated to assert the redaction. - Line ~1590 (outer-catch transient): was interpolating raw error.message into the stdout line. undici/fetch errors echo the full requested URL (userinfo creds, ?api_key=... query). Now uses the classified logContext only; the ErrorLogger.error branch keeps the full message since it runs through the sanitizer. Count endpoint / UI dropdown mismatch (M4): The nav badge's count endpoint (GET /api/admin/error-logs/count) counted all unresolved rows regardless of level, while the UI filter dropdown only offers "error"/"info" after the warning-level narrowing. Legacy warning rows inflated the badge but were unreachable from the filter — admin would see "7 unresolved" then find 4 after clicking through. Narrowed count query with inArray(level, ["error","info"]) so badge and filter agree. Browserless-outage stdout flooding (M5): The "Browserless service unavailable" console.warn used to dedup at DB level via ErrorLogger.warning's partial-index upsert. After the warning-level removal, every monitor check during an outage emits a fresh stdout line — N monitors x M replicas. Added a module-level per-process throttle (60s cooldown per message key) for the two infra-wide warnings ("service unavailable" and "circuit breaker open"). Exported _resetInfraWarnThrottleForTests hook; scraper.test.ts beforeEach resets it so tests observe the first emit deterministically. Verified #462 fixes survived the revert: SCHEMA_NOT_READY gate, activeCheckWaiters, claimed_at / claimDeliveryForRetry / recoverStalledDeliveries all present. Deferred to Phase 6 bug report (pre-existing / not regressions): - safeHostname "unknown" fallback ambiguity - errorLogs.level has no DB CHECK constraint - Dual warning/info rows during deploy (operator migration note for PR body: legacy resend_alert_* warning rows should be resolved or deleted at deploy to avoid apparent duplicate alerts) All 2475 tests pass; npm run check clean. * fix: safeHostname returns <invalid-url> instead of 'unknown' on parse failure Phase 6 bug-reporter reclassified the Phase-5 "unknown" fallback as in-scope (urlUtils.ts is new code on this branch). A monitor URL of http://unknown/path would collide with the fallback, making "unknown" ambiguous in triage. Swap to `<invalid-url>` — angle brackets are not valid in a real hostname, so the sentinel is guaranteed distinguishable from parsed hostnames. * test: add urlUtils.test.ts covering safeHostname contract New-code coverage gap noted during review: server/utils/urlUtils.ts was added on this branch but had no direct test. Asserts: - well-formed URL → hostname - userinfo credentials stripped from output - query-string secrets not reflected in hostname - parse failures return the `<invalid-url>` sentinel (guard against regression of the previous "unknown" fallback) - real host named `unknown` round-trips cleanly (collision guard) --------- Co-authored-by: Claude <noreply@anthropic.com>
Summary
Fixes the five open bugs triaged during
/magicwandPhase 6: one high-severity data-loss bug in the Zapier/automation delivery path, two medium-severity multi-replica / boot-hardening bugs in the webhook retry cron and theensure*startup sequence, and two low-severity hygiene bugs in scheduler shutdown and webhook-URL backfill encryption. All fixes are onserver/services/*,server/storage.ts,server/routes.ts, andshared/schema.ts;npm run check,npm run test(2334/2334), andnpm run buildall pass.Changes
#456 (high) — Durable retry queue for automation (Zapier REST Hook) deliveries
server/services/automationDelivery.ts: classify delivery failures as transient (5xx / 429 / 408 / network) vs persistent (4xx). Transient failures enqueue adelivery_logrow (channel='automation',status='pending',response={subscriptionId, platform, error, transient:true}) and do not bumpconsecutive_failures. Persistent 4xx keeps the old behaviour (bump counter, potentially auto-deactivate). Adds exportedperformAutomationDelivery+finalizeAutomationRetryso the retry cron reuses the exact request/classification logic.server/services/scheduler.ts: new*/1 * * * *automation retry cron mirroring the webhook cron — atomic claim, cumulative backoff (5s / 35s / 155s), 3 attempts, stalled-row recovery. Caps at 10 retries per tick.server/storage.ts: newgetPendingAutomationRetries,getAutomationSubscriptionById(with hookUrl decryption).server/services/automationDelivery.test.ts: replaced the two "increments failures on 500 / network" tests with new ones asserting transient failures queue a retry without bumping the counter; persistent 4xx still bumps it.#457 (medium) — 503 SCHEMA_NOT_READY gate on critical
ensure*failuresserver/routes.ts:ensureMonitorHealthColumns,ensureMonitorPendingRetryColumn,ensureNotificationQueueColumns, andensureAutomationSubscriptionsTableeach contribute to acriticalSchemaFailuresarray. If any fail, logCRITICAL:and install a globalapp.use("/api", ...)handler that returns503 { code: "SCHEMA_NOT_READY" }for every API route until the server restarts. Avoids the silent per-route 500 storm that the original comment described as "health alerts will be disabled" but was actually a full outage.process.exit(1)because it doesn't churn every route-test file'sensureTablesmocks.#458 (medium) — Atomic cross-replica claim for webhook retries
shared/schema.ts: newclaimed_atcolumn ondelivery_logand expandedstatusdoc comment (pending/processing/success/failed).server/services/ensureTables.ts:ALTER TABLE delivery_log ADD COLUMN IF NOT EXISTS claimed_at TIMESTAMP(idempotent, added toensureChannelTables).server/storage.ts: newclaimDeliveryForRetry(id, expectedAttempt, channel)atomic UPDATE usingattemptas an optimistic concurrency key — if another replica already claimed the row, zero rows match and we returnnull. CompanionrecoverStalledDeliveries(channel, olderThan)re-queues rows stuck inprocessingfor >5 minutes (crashed-replica recovery). LegacyclaimWebhookDeliveryForRetry/recoverStalledWebhookDeliverieskept as thin deprecated wrappers.server/services/scheduler.ts: webhook retry cron now (1) runs stalled-row recovery at the start of each tick, (2) atomically claims each row before delivering (skips if already claimed), (3) clearsclaimed_atin every final-state update.X-FTC-Signature-256+ payload sent from two replicas at the same cron tick).#459 (low) — Cancellable shutdown poll
server/services/scheduler.ts:waitForActiveCheckspoll now goes throughtrackTimeoutsostopScheduler()cancels the recursive 100 mssetTimeout, matching every other timer in the file.#460 (low) — Per-field skip in webhook-URL backfill encryption
server/services/ensureTables.ts:backfillNotificationChannelWebhookUrlsno longercontinues the whole row when one ofconfig.url/config.secretfails validation. Instead, it sets per-fieldurlSkipped/secretSkippedflags so a broken URL doesn't block encryption of a validwhsec_…secret (and vice versa).Test wiring
server/services/ensureTables.test.ts: expect 9 execute() calls (addedALTER TABLE delivery_log ADD COLUMN claimed_at); addedclaimed_atto thedelivery_logcolumn parity list.server/services/scheduler.test.ts: mock the new storage methods (claimDeliveryForRetry,recoverStalledDeliveries,getPendingAutomationRetries,getAutomationSubscriptionById) + mockautomationDeliverymodule. Added a microtask-flush in the "skips webhook cron when previous iteration is still running" test so the assertion runs after the new recovery-sweep step completes and the cron is actually parked on the hanginggetPendingWebhookRetries.server/routes.conditions.test.ts:ensureMonitorHealthColumnsmock now returnstrueso the 503 gate isn't installed in this mock app.How to test
Type check + unit tests
npm run check npm run testAll 2334 tests should pass.
#459 — shutdown poll cancellation
curlan endpoint that triggers a long-running monitor check soactiveChecks > 0.kill -SIGTERM <pid>and confirm the process exits withinSHUTDOWN_TIMEOUT_MSwithout "unhandled error after resolution" warnings in the log.#460 — per-field skip
INSERT INTO notification_channels (monitor_id, channel, config) VALUES (<id>, 'webhook', '{"url":"not-a-url","secret":"whsec_abc123plaintext","events":[]}'::jsonb);Skipping URL encryption for notification channel <id>: ….SELECT config FROM notification_channels WHERE id = <id>;—secretshould now be encrypted (no longer the plaintextwhsec_abc123plaintext);urlremains plaintext.#457 — SCHEMA_NOT_READY gate
ALTER TABLE monitors DROP COLUMN health_alert_sent_at;and revokeALTERprivilege onmonitorsfrom the app role.CRITICAL: required schema missing — monitor health columns. ….curl -i http://localhost:5000/api/monitors— expect503 {"message":"Schema not ready: …","code":"SCHEMA_NOT_READY"}instead of a 500.#458 — cross-replica webhook retry claim
INSERT INTO delivery_log (monitor_id, change_id, channel, status, attempt, created_at) VALUES (<id>,<id>,'webhook','pending',1, NOW() - INTERVAL '10 seconds');[Webhook] Delivered …; the other should see 0 rows returned fromclaimDeliveryForRetryand skip.delivery_logshows a single final status transition (success/pending/failed), no double-writes.UPDATE delivery_log SET status='processing', claimed_at=NOW() - INTERVAL '6 minutes' WHERE id=<id>;. Next cron tick should logRecovered 1 stalled delivery row(s)and re-process.#456 — automation retry queue
POST /api/zapier/v1/subscribepointing to a test endpoint that returns HTTP 503.POST /api/monitors/:id/check).Automation delivery transient failure — queued for retry …and a newdelivery_logrow withchannel='automation',status='pending',response.transient=true.consecutive_failuresis still 0 (no auto-deactivation from one flaky event).[Automation] Delivered successfullyand the delivery_log row transitions tostatus='success'.consecutive_failuresto increment and no retry row indelivery_log.https://claude.ai/code/session_016C3aevHcGMqJL7MaEHdqN9
Summary by CodeRabbit
New Features
Bug Fixes