Skip to content

fix: resolve open bugs #456-#460 across scheduler/retry pipelines#462

Merged
bd73-com merged 2 commits into
mainfrom
claude/fix-github-issues-Xumuy
Apr 21, 2026
Merged

fix: resolve open bugs #456-#460 across scheduler/retry pipelines#462
bd73-com merged 2 commits into
mainfrom
claude/fix-github-issues-Xumuy

Conversation

@bd73-com
Copy link
Copy Markdown
Owner

@bd73-com bd73-com commented Apr 20, 2026

Summary

Fixes the five open bugs triaged during /magicwand Phase 6: one high-severity data-loss bug in the Zapier/automation delivery path, two medium-severity multi-replica / boot-hardening bugs in the webhook retry cron and the ensure* startup sequence, and two low-severity hygiene bugs in scheduler shutdown and webhook-URL backfill encryption. All fixes are on server/services/*, server/storage.ts, server/routes.ts, and shared/schema.ts; npm run check, npm run test (2334/2334), and npm run build all pass.

Changes

#456 (high) — Durable retry queue for automation (Zapier REST Hook) deliveries

  • server/services/automationDelivery.ts: classify delivery failures as transient (5xx / 429 / 408 / network) vs persistent (4xx). Transient failures enqueue a delivery_log row (channel='automation', status='pending', response={subscriptionId, platform, error, transient:true}) and do not bump consecutive_failures. Persistent 4xx keeps the old behaviour (bump counter, potentially auto-deactivate). Adds exported performAutomationDelivery + finalizeAutomationRetry so the retry cron reuses the exact request/classification logic.
  • server/services/scheduler.ts: new */1 * * * * automation retry cron mirroring the webhook cron — atomic claim, cumulative backoff (5s / 35s / 155s), 3 attempts, stalled-row recovery. Caps at 10 retries per tick.
  • server/storage.ts: new getPendingAutomationRetries, getAutomationSubscriptionById (with hookUrl decryption).
  • server/services/automationDelivery.test.ts: replaced the two "increments failures on 500 / network" tests with new ones asserting transient failures queue a retry without bumping the counter; persistent 4xx still bumps it.

#457 (medium) — 503 SCHEMA_NOT_READY gate on critical ensure* failures

  • server/routes.ts: ensureMonitorHealthColumns, ensureMonitorPendingRetryColumn, ensureNotificationQueueColumns, and ensureAutomationSubscriptionsTable each contribute to a criticalSchemaFailures array. If any fail, log CRITICAL: and install a global app.use("/api", ...) handler that returns 503 { code: "SCHEMA_NOT_READY" } for every API route until the server restarts. Avoids the silent per-route 500 storm that the original comment described as "health alerts will be disabled" but was actually a full outage.
  • Chose the 503 middleware approach over process.exit(1) because it doesn't churn every route-test file's ensureTables mocks.

#458 (medium) — Atomic cross-replica claim for webhook retries

  • shared/schema.ts: new claimed_at column on delivery_log and expanded status doc comment (pending / processing / success / failed).
  • server/services/ensureTables.ts: ALTER TABLE delivery_log ADD COLUMN IF NOT EXISTS claimed_at TIMESTAMP (idempotent, added to ensureChannelTables).
  • server/storage.ts: new claimDeliveryForRetry(id, expectedAttempt, channel) atomic UPDATE using attempt as an optimistic concurrency key — if another replica already claimed the row, zero rows match and we return null. Companion recoverStalledDeliveries(channel, olderThan) re-queues rows stuck in processing for >5 minutes (crashed-replica recovery). Legacy claimWebhookDeliveryForRetry / recoverStalledWebhookDeliveries kept as thin deprecated wrappers.
  • server/services/scheduler.ts: webhook retry cron now (1) runs stalled-row recovery at the start of each tick, (2) atomically claims each row before delivering (skips if already claimed), (3) clears claimed_at in every final-state update.
  • Fixes the autoscale double-delivery case described in the issue (same X-FTC-Signature-256 + payload sent from two replicas at the same cron tick).

#459 (low) — Cancellable shutdown poll

  • server/services/scheduler.ts: waitForActiveChecks poll now goes through trackTimeout so stopScheduler() cancels the recursive 100 ms setTimeout, matching every other timer in the file.

#460 (low) — Per-field skip in webhook-URL backfill encryption

  • server/services/ensureTables.ts: backfillNotificationChannelWebhookUrls no longer continues the whole row when one of config.url / config.secret fails validation. Instead, it sets per-field urlSkipped / secretSkipped flags so a broken URL doesn't block encryption of a valid whsec_… secret (and vice versa).

Test wiring

  • server/services/ensureTables.test.ts: expect 9 execute() calls (added ALTER TABLE delivery_log ADD COLUMN claimed_at); added claimed_at to the delivery_log column parity list.
  • server/services/scheduler.test.ts: mock the new storage methods (claimDeliveryForRetry, recoverStalledDeliveries, getPendingAutomationRetries, getAutomationSubscriptionById) + mock automationDelivery module. Added a microtask-flush in the "skips webhook cron when previous iteration is still running" test so the assertion runs after the new recovery-sweep step completes and the cron is actually parked on the hanging getPendingWebhookRetries.
  • server/routes.conditions.test.ts: ensureMonitorHealthColumns mock now returns true so the 503 gate isn't installed in this mock app.

How to test

Type check + unit tests

npm run check
npm run test

All 2334 tests should pass.

#459 — shutdown poll cancellation

  1. Start dev server, curl an endpoint that triggers a long-running monitor check so activeChecks > 0.
  2. kill -SIGTERM <pid> and confirm the process exits within SHUTDOWN_TIMEOUT_MS without "unhandled error after resolution" warnings in the log.

#460 — per-field skip

  1. Seed a row: INSERT INTO notification_channels (monitor_id, channel, config) VALUES (<id>, 'webhook', '{"url":"not-a-url","secret":"whsec_abc123plaintext","events":[]}'::jsonb);
  2. Restart the server. Expect log: Skipping URL encryption for notification channel <id>: ….
  3. SELECT config FROM notification_channels WHERE id = <id>;secret should now be encrypted (no longer the plaintext whsec_abc123plaintext); url remains plaintext.

#457 — SCHEMA_NOT_READY gate

  1. Before boot: ALTER TABLE monitors DROP COLUMN health_alert_sent_at; and revoke ALTER privilege on monitors from the app role.
  2. Start the server. Expect log: CRITICAL: required schema missing — monitor health columns. ….
  3. curl -i http://localhost:5000/api/monitors — expect 503 {"message":"Schema not ready: …","code":"SCHEMA_NOT_READY"} instead of a 500.
  4. Re-grant privilege, restart, and confirm normal 200s.

#458 — cross-replica webhook retry claim

  1. Point two local processes at the same Postgres (simulate autoscale).
  2. Insert a pending webhook delivery: INSERT INTO delivery_log (monitor_id, change_id, channel, status, attempt, created_at) VALUES (<id>,<id>,'webhook','pending',1, NOW() - INTERVAL '10 seconds');
  3. Wait for both replicas' cron tick. Exactly one should log [Webhook] Delivered …; the other should see 0 rows returned from claimDeliveryForRetry and skip.
  4. Verify delivery_log shows a single final status transition (success / pending / failed), no double-writes.
  5. Stalled-row recovery: UPDATE delivery_log SET status='processing', claimed_at=NOW() - INTERVAL '6 minutes' WHERE id=<id>;. Next cron tick should log Recovered 1 stalled delivery row(s) and re-process.

#456 — automation retry queue

  1. Subscribe a Zapier-style hook via POST /api/zapier/v1/subscribe pointing to a test endpoint that returns HTTP 503.
  2. Trigger a change on the monitored URL (or POST /api/monitors/:id/check).
  3. Expect log: Automation delivery transient failure — queued for retry … and a new delivery_log row with channel='automation', status='pending', response.transient=true.
  4. Verify the subscription's consecutive_failures is still 0 (no auto-deactivation from one flaky event).
  5. Switch the endpoint to return HTTP 200. Wait for the next automation-retry cron tick. Expect [Automation] Delivered successfully and the delivery_log row transitions to status='success'.
  6. For persistent failure: switch the endpoint to HTTP 404 before triggering a change. Expect consecutive_failures to increment and no retry row in delivery_log.

https://claude.ai/code/session_016C3aevHcGMqJL7MaEHdqN9

Summary by CodeRabbit

  • New Features

    • API now returns a clear 503 "Schema not ready" response when required database schema pieces are missing, preventing partial service behavior.
    • Automation deliveries gain retry handling with durable delivery logs and improved retry scheduling.
  • Bug Fixes

    • Added recovery for stalled webhook and automation delivery attempts to resume pending work.
    • Improved distinction between transient and persistent delivery failures for more reliable retry and deactivation behavior.

- #459 (low): waitForActiveChecks poll uses trackTimeout so the 100ms
  recursive timer is cancelled by stopScheduler(), eliminating the
  leaked handles that surfaced during graceful shutdown.
- #460 (low): webhook-URL backfill now skips only the corrupted field
  (url OR secret) instead of the whole row, so a malformed url no
  longer blocks encryption of a valid whsec_ secret (and vice versa).
- #457 (med): if any Drizzle-referenced schema (monitor health cols,
  pending_retry_at, notification_queue cols, automation_subscriptions)
  fails to provision, install a global /api 503 SCHEMA_NOT_READY gate
  so clients get a clear signal instead of silent 500s per route.
- #458 (med): webhook retries are now atomically claimed via
  UPDATE ... SET status='processing', claimed_at=NOW() WHERE id=? AND
  attempt=? RETURNING, so replicas under Replit autoscale never
  double-deliver. A new claimed_at column enables stalled-row recovery
  (rows stuck >5min re-queued to 'pending'). Same infra generalised
  via storage.claimDeliveryForRetry / recoverStalledDeliveries.
- #456 (high): automation (Zapier REST Hook) deliveries now classify
  transient (5xx/429/408/network) vs persistent (4xx) failures. On
  transient failure we enqueue a delivery_log row and a new automation
  retry cron re-attempts up to 3 times with cumulative backoff. Only
  persistent failures count toward consecutive_failures / auto-deact,
  so a flaky upstream can't silently kill a subscription after one
  dropped event.

https://claude.ai/code/session_016C3aevHcGMqJL7MaEHdqN9
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 20, 2026

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

Adds delivery-retry and recovery for automation/webhook channels: a nullable claimedAt column and index, storage APIs for claiming/recovering delivery_log rows, scheduler loops for claiming and retrying deliveries (including automation cron), automation delivery outcome classification and enqueueing of transient failures, and a startup middleware that returns 503 if critical schema readiness checks fail.

Changes

Cohort / File(s) Summary
Schema
shared/schema.ts
Add claimedAt (claimed_at) nullable timestamp to delivery_log, add composite index on (channel, status, claimed_at), extend documented enums to include automation and processing.
Ensure tables & backfill
server/services/ensureTables.ts, server/services/ensureTables.test.ts
DDL adds delivery_log.claimed_at and index; ensureChannelTables() now returns Promise<boolean> (true on success, false on failure); backfill logic changed to independently evaluate/encrypt URL and secret per-field. Tests updated to reflect new DDL and return/value expectations.
Automation delivery
server/services/automationDelivery.ts, server/services/automationDelivery.test.ts
Introduce AutomationDeliveryOutcome type, add performAutomationDelivery() and finalizeAutomationRetry(); refactor deliverToAutomationSubscriptions() to classify outcomes (success/transient/persistent), enqueue transient failures via storage.addDeliveryLog, and handle failures/fallbacks accordingly. Tests updated to assert pending delivery_log entries for transient failures and adjusted deactivation/threshold expectations.
Scheduler & retry loops
server/services/scheduler.ts, server/services/scheduler.test.ts
Use trackTimeout + activeCheckWaiters for shutdown; add stalled-recovery (recoverStalledDeliveries/deprecated webhook aliases) and atomic claim APIs (claimDeliveryForRetry) to webhook retry flow; add new automation retry cron mirroring webhook flow (claim → perform → finalize); clear claimedAt on status updates; tests expanded with additional mocks and relaxed assertions.
Storage layer
server/storage.ts
Expand IStorage/DatabaseStorage with delivery-log APIs: addDeliveryLog, claimDeliveryForRetry, recoverStalledDeliveries, getPending*Retries, getAutomationSubscriptionById (with decryption warning behavior); updateDeliveryLog accepts claimedAt updates; deprecated webhook-specific aliases added.
Routes / startup gating
server/routes.ts, server/routes.conditions.test.ts
Aggregate critical schema readiness failures and install global "/api" middleware that returns HTTP 503 { message: "Schema not ready: ...", code: "SCHEMA_NOT_READY" } if any critical checks fail. Test mock adjusted for ensureMonitorHealthColumns to resolve true.

Sequence Diagram(s)

sequenceDiagram
    participant Scheduler
    participant Storage
    participant Automation as Automation API
    participant DB as Database

    Scheduler->>Storage: recoverStalledDeliveries("automation", cutoff)
    Storage->>DB: UPDATE delivery_log SET status='pending', claimed_at=NULL WHERE status='processing' AND claimed_at < cutoff
    DB-->>Storage: recovered count

    Scheduler->>Storage: getPendingAutomationRetries()
    Storage->>DB: SELECT * FROM delivery_log WHERE channel='automation' AND status='pending' AND attempt < max
    DB-->>Storage: pending rows
    Storage-->>Scheduler: pending entries

    Scheduler->>Storage: claimDeliveryForRetry(id, expectedAttempt, "automation")
    Storage->>DB: UPDATE ... SET status='processing', claimed_at=NOW() WHERE id=? AND status='pending' AND attempt=?
    DB-->>Storage: claimed entry or null
    alt claimed
        Scheduler->>Automation: performAutomationDelivery(hookUrl, payload)
        Automation-->>Scheduler: {kind: success|transient|persistent, statusCode?, error?}
        alt success
            Scheduler->>Storage: updateDeliveryLog(id, {status:'success', deliveredAt:NOW(), claimedAt:null})
            Scheduler->>Storage: touchAndResetAutomationSubscription(...)
        else transient
            Scheduler->>Storage: addDeliveryLog({...status:'pending', attempt:1, response:{transient:true,...}})
            Scheduler->>Storage: updateDeliveryLog(id, {status:'failed' or other, claimedAt:null})
        else persistent
            Scheduler->>Storage: handleDeliveryFailure(...) (increment failures/deactivate)
            Scheduler->>Storage: updateDeliveryLog(id, {claimedAt:null})
        end
    else not claimed
        Note over Scheduler: skip (another replica claimed it)
    end
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Security & Correctness Observations

  • ⚠️ Critical: If storage.addDeliveryLog() fails while handling a transient delivery outcome, current code falls back to handleDeliveryFailure() which treats it like a delivery failure. That can mask storage-layer outages as persistent delivery failures — log and alert explicitly on addDeliveryLog errors before fallback to avoid silent loss of retryability.
  • ⚠️ Concurrency: The optimistic-claim predicate (id, channel, status='pending', attempt=expectedAttempt) depends on consistent attempt increment semantics across paths. Verify attempt increments and finalization paths to avoid duplicate processing or lost retries.
  • ✓ Data leakage: Error strings are redacted for URLs in transient outcomes per tests; ensure redaction covers all fetch/network error paths.
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: resolve open bugs #456-#460 across scheduler/retry pipelines' directly matches the PR's primary objective of addressing five identified issues across automation delivery, scheduler retry logic, and related infrastructure.
Docstring Coverage ✅ Passed Docstring coverage is 83.33% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/fix-github-issues-Xumuy

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added the fix label Apr 20, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
shared/schema.ts (1)

327-331: 🧹 Nitpick | 🔵 Trivial

Index claimed_at for stalled delivery recovery.

The retry recovery path filters by channel, status, and stale claimedAt every minute, but the current index uses createdAt instead. Add a matching index here and in ensureChannelTables() before delivery_log grows enough for recovery scans to become expensive.

Suggested index addition
 }, (table) => ({
   monitorCreatedIdx: index("delivery_log_monitor_created_idx").on(table.monitorId, table.createdAt),
   channelStatusAttemptIdx: index("delivery_log_channel_status_attempt_idx").on(table.channel, table.status, table.createdAt, table.attempt),
+  channelStatusClaimedIdx: index("delivery_log_channel_status_claimed_idx").on(table.channel, table.status, table.claimedAt),
 }));

Also mirror this in server/services/ensureTables.ts:

 await db.execute(sql`CREATE INDEX IF NOT EXISTS delivery_log_channel_status_attempt_idx ON delivery_log(channel, status, created_at, attempt)`);
+await db.execute(sql`CREATE INDEX IF NOT EXISTS delivery_log_channel_status_claimed_idx ON delivery_log(channel, status, claimed_at)`);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@shared/schema.ts` around lines 327 - 331, The delivery retry recovery queries
filter by channel, status and stale claimedAt but the schema only indexes
createdAt; add a new index on the delivery_log table covering (channel, status,
claimedAt) so recovery scans use the index (in the schema where claimedAt is
defined alongside monitorCreatedIdx and channelStatusAttemptIdx create e.g. a
channelStatusClaimedIdx index on table.channel, table.status, table.claimedAt),
and make the same change inside the ensureChannelTables()/ensureTables.ts logic
that creates/ensures delivery_log so the physical DB gets the identical index
before the table grows large.
server/storage.ts (1)

10-76: 🛠️ Refactor suggestion | 🟠 Major

Add the new retry primitives to IStorage.

The new retry APIs only exist on DatabaseStorage. Any test double or alternate implementation typed as IStorage will miss the scheduler’s new contract. Add the new delivery-log methods to the interface so storage remains a complete boundary.

Proposed interface additions
   cleanupStaleAutomationSubscriptions(olderThan: Date): Promise<number>;
   getRecentChangesForMonitors(monitorIds: number[], limit: number): Promise<MonitorChange[]>;
+
+  // Delivery log retry primitives
+  updateDeliveryLog(
+    id: number,
+    updates: Partial<Pick<DeliveryLogEntry, "status" | "attempt" | "response" | "deliveredAt" | "claimedAt">>,
+  ): Promise<void>;
+  getPendingWebhookRetries(): Promise<DeliveryLogEntry[]>;
+  claimDeliveryForRetry(id: number, expectedAttempt: number, channel: string): Promise<DeliveryLogEntry | null>;
+  claimWebhookDeliveryForRetry(id: number, expectedAttempt: number): Promise<DeliveryLogEntry | null>;
+  recoverStalledDeliveries(channel: string, olderThan: Date): Promise<number>;
+  recoverStalledWebhookDeliveries(olderThan: Date): Promise<number>;
+  getPendingAutomationRetries(): Promise<DeliveryLogEntry[]>;
+  getAutomationSubscriptionById(id: number): Promise<AutomationSubscription | undefined>;
 }

As per coding guidelines, “Never put database queries or Drizzle ORM calls directly in route handlers — all database access must go through methods on the IStorage interface implemented in server/storage.ts.”

Also applies to: 443-524

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/storage.ts` around lines 10 - 76, The IStorage interface is missing
the new retry/delivery-log primitives implemented on DatabaseStorage; add the
same method signatures to IStorage so all storage implementations conform to the
scheduler contract. Specifically, in the IStorage declaration add the
delivery-retry methods (for example) createDeliveryLog(monitorId: number,
payload: string, nextAttemptAt: Date): Promise<DeliveryLog>, getDeliveryLog(id:
number): Promise<DeliveryLog | undefined>, listPendingDeliveryLogs(limit:
number): Promise<DeliveryLog[]>, incrementDeliveryAttempt(id: number,
nextAttemptAt: Date): Promise<number>, markDeliveryDelivered(id: number):
Promise<void>, and deleteDeliveryLog(id: number): Promise<void> (or the exact
method names/signatures used on DatabaseStorage) so DatabaseStorage and any test
doubles satisfy the new scheduler contract.
server/services/scheduler.ts (1)

73-85: ⚠️ Potential issue | 🟠 Major

Resolve active-check waiters when stopping the scheduler.

Line 80 now tracks the shutdown polling timer, but stopScheduler() clears tracked timers at Lines 584-585. If activeChecks > 0, the next poll is cancelled and the waitForActiveChecks() promise never reaches either the active-check condition or its timeout. That can hang shutdown despite the documented timeout guarantee.

Proposed fix
 const cronTasks: ReturnType<typeof cron.schedule>[] = [];
 const pendingTimeouts = new Set<ReturnType<typeof setTimeout>>();
+const activeCheckWaiters = new Set<() => void>();
@@
 export function waitForActiveChecks(timeoutMs: number): Promise<void> {
   return new Promise((resolve) => {
+    let settled = false;
+    const finish = () => {
+      if (settled) return;
+      settled = true;
+      activeCheckWaiters.delete(finish);
+      resolve();
+    };
+    activeCheckWaiters.add(finish);
     const start = Date.now();
     const check = () => {
       if (activeChecks === 0 || Date.now() - start > timeoutMs) {
-        resolve();
+        finish();
       } else {
         trackTimeout(check, 100);
       }
@@
   pendingTimeouts.forEach((handle) => { clearTimeout(handle); });
   pendingTimeouts.clear();
+  for (const finish of activeCheckWaiters) finish();
+  activeCheckWaiters.clear();
   retryBackoff.clear();

Also applies to: 579-585

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/services/scheduler.ts` around lines 73 - 85, The waitForActiveChecks
Promise can hang when stopScheduler clears tracked timers; to fix, add a small
shutdown-waiter mechanism: in waitForActiveChecks (function waitForActiveChecks)
push the Promise's resolve into a global array (e.g., shutdownWaiters) in
addition to scheduling poll timers with trackTimeout, and when stopScheduler
runs (function stopScheduler) iterate shutdownWaiters and call each resolver
before/while clearing tracked timers so any outstanding waiters are unblocked;
ensure you remove the resolver from shutdownWaiters when the promise resolves
normally and keep using activeChecks, trackTimeout, and the existing timeout
logic unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@server/routes.ts`:
- Around line 79-91: The ensureChannelTables() function currently swallows
errors during its bootstrap (so an ALTER that fails for delivery_log.claimed_at
can leave the API live but broken); change ensureChannelTables() to return a
boolean: true on success and false from its catch block (or rethrow and let
caller handle), then in the routes initialization replace the current
fire-and-forget call with const channelsReady = await ensureChannelTables(); and
if (!channelsReady) push "channel tables" into criticalSchemaFailures so the
readiness gate will block traffic until the schema fix is applied; reference
ensureChannelTables(), delivery_log.claimed_at, and criticalSchemaFailures when
making the changes.

In `@server/services/automationDelivery.test.ts`:
- Around line 145-173: Add tests in server/services/automationDelivery.test.ts
that mirror the existing "queues durable retry" cases for 500 and network errors
but use HTTP 429 and 408 responses: call deliverToAutomationSubscriptions with
mockGetActiveAutomationSubscriptions returning a subscription and
mockSsrfSafeFetch resolved with { ok: false, status: 429 } and then with status:
408; in each case assert mockAddDeliveryLog was called with an objectContaining
channel: "automation", status: "pending", response containing transient: true
(and include subscriptionId/platform when applicable) and assert
mockIncrementAutomationSubscriptionFailures and
mockTouchAndResetAutomationSubscription were not called. Use the same helper
functions (makeSub, makeMonitor, makeChange) and matchers as the existing tests
to keep style consistent.

In `@server/services/automationDelivery.ts`:
- Around line 71-116: The code currently uses Promise.allSettled on the array
deliveries (created around performAutomationDelivery, storage.addDeliveryLog and
handleDeliveryFailure) but ignores rejected results, allowing failures from
addDeliveryLog fallback or handleDeliveryFailure to be swallowed; change the
completion handling to inspect the results: await Promise.allSettled(deliveries)
then iterate the settled results and for any result.status === "rejected" log a
clear error (including the subscription id, monitor.id and the rejection reason)
and rethrow or return a non-zero failure (e.g., throw the first rejection) so
the caller sees the failure; alternatively use await Promise.all(deliveries) if
you want immediate bubbling of the first rejection—ensure logging uses the same
contextual fields passed to storage.addDeliveryLog/handleDeliveryFailure
(subscriptionId/sub.id, monitorId/monitor.id, platform/sub.platform) so failures
aren’t silent.

In `@server/services/scheduler.test.ts`:
- Around line 36-43: The test currently mocks automation retry hooks but never
exercises the automation retry loop because getPendingAutomationRetries returns
an empty array; update the test to return one or more pending retry records from
getPendingAutomationRetries and add assertions that claimDeliveryForRetry is
invoked with channel="automation", that when claimDeliveryForRetry returns null
the delivery is skipped, that when getAutomationSubscriptionById returns
undefined the code clears claimedAt on the pending retry record, and that a
successful delivery path invokes the delivery finalizer (the function your code
uses to finalize deliveries); also add focused assertions for edge cases (empty
arrays, null/undefined fields), error paths (invalid IDs, failed claim), and a
security/ownership case to satisfy the guidelines.

In `@server/services/scheduler.ts`:
- Around line 478-490: The delivery row is being marked failed via
withDbRetry(storage.updateDeliveryLog(..., entry.id)) before calling
finalizeAutomationRetry(sub.id,...), which can leave the row irrecoverable if
finalization fails; change this so the terminal delivery-log update and the
subscription failure/deactivation side effects are executed atomically in one
storage transaction (e.g., add a new storage method like
finalizeDeliveryAndSubscriptionFailure that accepts entry.id, sub.id,
outcome/platform/monitor and performs updateDeliveryLog plus the
finalizeAutomationRetry work inside a single DB transaction, or call both
updates inside an existing storage.transaction/withDbRetry wrapper) so that
either both the delivery row and subscription counters are committed or none
are.

---

Outside diff comments:
In `@server/services/scheduler.ts`:
- Around line 73-85: The waitForActiveChecks Promise can hang when stopScheduler
clears tracked timers; to fix, add a small shutdown-waiter mechanism: in
waitForActiveChecks (function waitForActiveChecks) push the Promise's resolve
into a global array (e.g., shutdownWaiters) in addition to scheduling poll
timers with trackTimeout, and when stopScheduler runs (function stopScheduler)
iterate shutdownWaiters and call each resolver before/while clearing tracked
timers so any outstanding waiters are unblocked; ensure you remove the resolver
from shutdownWaiters when the promise resolves normally and keep using
activeChecks, trackTimeout, and the existing timeout logic unchanged.

In `@server/storage.ts`:
- Around line 10-76: The IStorage interface is missing the new
retry/delivery-log primitives implemented on DatabaseStorage; add the same
method signatures to IStorage so all storage implementations conform to the
scheduler contract. Specifically, in the IStorage declaration add the
delivery-retry methods (for example) createDeliveryLog(monitorId: number,
payload: string, nextAttemptAt: Date): Promise<DeliveryLog>, getDeliveryLog(id:
number): Promise<DeliveryLog | undefined>, listPendingDeliveryLogs(limit:
number): Promise<DeliveryLog[]>, incrementDeliveryAttempt(id: number,
nextAttemptAt: Date): Promise<number>, markDeliveryDelivered(id: number):
Promise<void>, and deleteDeliveryLog(id: number): Promise<void> (or the exact
method names/signatures used on DatabaseStorage) so DatabaseStorage and any test
doubles satisfy the new scheduler contract.

In `@shared/schema.ts`:
- Around line 327-331: The delivery retry recovery queries filter by channel,
status and stale claimedAt but the schema only indexes createdAt; add a new
index on the delivery_log table covering (channel, status, claimedAt) so
recovery scans use the index (in the schema where claimedAt is defined alongside
monitorCreatedIdx and channelStatusAttemptIdx create e.g. a
channelStatusClaimedIdx index on table.channel, table.status, table.claimedAt),
and make the same change inside the ensureChannelTables()/ensureTables.ts logic
that creates/ensures delivery_log so the physical DB gets the identical index
before the table grows large.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6ab6ad94-bba6-42ea-8f7f-f1f2676307d0

📥 Commits

Reviewing files that changed from the base of the PR and between 1593135 and d582fc0.

📒 Files selected for processing (10)
  • server/routes.conditions.test.ts
  • server/routes.ts
  • server/services/automationDelivery.test.ts
  • server/services/automationDelivery.ts
  • server/services/ensureTables.test.ts
  • server/services/ensureTables.ts
  • server/services/scheduler.test.ts
  • server/services/scheduler.ts
  • server/storage.ts
  • shared/schema.ts

Comment thread server/routes.ts
Comment on lines 79 to +91
const apiKeysReady = await ensureApiKeysTable();
await ensureChannelTables();
await ensureMonitorChangesIndexes();
await ensureCampaignPartialIndexes();
const notificationQueueReady = await ensureNotificationQueueColumns();
if (!notificationQueueReady) {
console.error("CRITICAL: notification_queue columns missing — notification cron queries will fail");
criticalSchemaFailures.push("notification_queue columns");
}
await ensureTagTables();
const automationSubsReady = await ensureAutomationSubscriptionsTable();
if (!automationSubsReady) {
console.error("CRITICAL: automation_subscriptions table missing — Zapier endpoints and automation delivery will fail");
criticalSchemaFailures.push("automation_subscriptions table");
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Gate traffic when delivery_log.claimed_at bootstrap fails.

The new schema-not-ready gate checks boolean ensure calls, but ensureChannelTables() still swallows failures. If the new claimed_at ALTER fails, retry/delivery-log code can hit column claimed_at does not exist while /api stays live.

Suggested readiness wiring
   const apiKeysReady = await ensureApiKeysTable();
-  await ensureChannelTables();
+  const channelTablesReady = await ensureChannelTables();
+  if (!channelTablesReady) {
+    criticalSchemaFailures.push("notification channel / delivery_log columns");
+  }
   await ensureMonitorChangesIndexes();

And change ensureChannelTables() to return true on success and false in its catch block.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/routes.ts` around lines 79 - 91, The ensureChannelTables() function
currently swallows errors during its bootstrap (so an ALTER that fails for
delivery_log.claimed_at can leave the API live but broken); change
ensureChannelTables() to return a boolean: true on success and false from its
catch block (or rethrow and let caller handle), then in the routes
initialization replace the current fire-and-forget call with const channelsReady
= await ensureChannelTables(); and if (!channelsReady) push "channel tables"
into criticalSchemaFailures so the readiness gate will block traffic until the
schema fix is applied; reference ensureChannelTables(), delivery_log.claimed_at,
and criticalSchemaFailures when making the changes.

Comment thread server/services/automationDelivery.test.ts
Comment thread server/services/automationDelivery.ts Outdated
Comment on lines 71 to 116
const deliveries = subscriptions.map(async (sub) => {
try {
// Automation hook URLs (e.g. Zapier) are unguessable bearer tokens.
// No HMAC signature header — the hookUrl itself authenticates the request.
// Webhook channels use X-FTC-Signature-256 with a per-channel secret.
const response = await ssrfSafeFetch(sub.hookUrl, {
method: "POST",
headers: {
"Content-Type": "application/json",
"User-Agent": "FetchTheChange-Zapier/1.0",
},
body,
signal: AbortSignal.timeout(5000),
const outcome = await performAutomationDelivery(sub.hookUrl, body);

if (outcome.kind === "success") {
console.log(`[Automation] Delivered successfully (monitorId=${monitor.id}, platform=${sub.platform}, subId=${sub.id}, status=${outcome.statusCode})`);
storage.touchAndResetAutomationSubscription(sub.id).catch((err) => {
console.warn(`[Automation] Failed to reset failure counter for subscription ${sub.id}`, err);
});
return;
}

if (response.ok) {
console.log(`[Automation] Delivered successfully (monitorId=${monitor.id}, platform=${sub.platform}, subId=${sub.id}, status=${response.status})`);
// Atomic touch + failure counter reset in a single UPDATE
storage.touchAndResetAutomationSubscription(sub.id).catch((err) => {
console.warn(`[Automation] Failed to reset failure counter for subscription ${sub.id}`, err);
if (outcome.kind === "transient") {
// Enqueue a durable retry — cron will re-attempt up to 3 times total.
try {
await storage.addDeliveryLog({
monitorId: monitor.id,
changeId: change.id,
channel: "automation",
status: "pending",
attempt: 1,
response: {
subscriptionId: sub.id,
platform: sub.platform,
error: outcome.error,
transient: true,
},
});
console.warn(`[scheduler] Automation delivery transient failure — queued for retry for monitor "${monitor.name}"`, {
subscriptionId: sub.id,
monitorId: monitor.id,
platform: sub.platform,
error: outcome.error,
});
} else {
await handleDeliveryFailure(sub.id, monitor, sub.platform, `HTTP ${response.status}`);
} catch (logErr) {
// If we can't enqueue the retry, fall back to the old behavior to avoid losing all signal.
console.error(`[Automation] Failed to enqueue retry delivery_log row — falling back to failure counter bump`, logErr);
await handleDeliveryFailure(sub.id, monitor, sub.platform, outcome.error);
}
} catch (err) {
// Sanitize error to avoid leaking hookUrl secrets in logs
const rawMsg = err instanceof Error ? err.message : String(err);
const safeError = rawMsg.replace(/https?:\/\/[^\s)]+/g, "[redacted-url]");
await handleDeliveryFailure(sub.id, monitor, sub.platform, safeError);
return;
}

// Persistent (4xx) — no retry; bump failure counter as before.
await handleDeliveryFailure(sub.id, monitor, sub.platform, outcome.error);
});

await Promise.allSettled(deliveries);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Log rejected delivery tasks instead of swallowing them.

Promise.allSettled() prevents rethrowing, but the results are ignored. Any rejection from addDeliveryLog() fallback or handleDeliveryFailure() disappears, so failure counters/deactivation can fail silently.

Proposed fix
-  await Promise.allSettled(deliveries);
+  const results = await Promise.allSettled(deliveries);
+  for (const [index, result] of results.entries()) {
+    if (result.status === "rejected") {
+      const sub = subscriptions[index];
+      console.error("[Automation] Delivery handling failed", {
+        subscriptionId: sub?.id,
+        monitorId: monitor.id,
+        platform: sub?.platform,
+        errorMessage: result.reason instanceof Error ? result.reason.message : String(result.reason),
+      });
+    }
+  }
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/services/automationDelivery.ts` around lines 71 - 116, The code
currently uses Promise.allSettled on the array deliveries (created around
performAutomationDelivery, storage.addDeliveryLog and handleDeliveryFailure) but
ignores rejected results, allowing failures from addDeliveryLog fallback or
handleDeliveryFailure to be swallowed; change the completion handling to inspect
the results: await Promise.allSettled(deliveries) then iterate the settled
results and for any result.status === "rejected" log a clear error (including
the subscription id, monitor.id and the rejection reason) and rethrow or return
a non-zero failure (e.g., throw the first rejection) so the caller sees the
failure; alternatively use await Promise.all(deliveries) if you want immediate
bubbling of the first rejection—ensure logging uses the same contextual fields
passed to storage.addDeliveryLog/handleDeliveryFailure (subscriptionId/sub.id,
monitorId/monitor.id, platform/sub.platform) so failures aren’t silent.

Comment on lines +36 to +43
claimDeliveryForRetry: vi.fn().mockImplementation(async (id: number, attempt: number, channel: string) => ({
id, attempt, channel, status: "processing", createdAt: new Date(),
monitorId: id, changeId: id, response: null, deliveredAt: null, claimedAt: new Date(),
})),
recoverStalledWebhookDeliveries: vi.fn().mockResolvedValue(0),
recoverStalledDeliveries: vi.fn().mockResolvedValue(0),
getPendingAutomationRetries: vi.fn().mockResolvedValue([]),
getAutomationSubscriptionById: vi.fn().mockResolvedValue(undefined),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add assertions for the automation retry cron path.

These mocks wire the new automation retry dependencies, but the default getPendingAutomationRetries: [] means no test exercises the loop. Add coverage for: due retry is claimed with channel="automation", claim miss skips delivery, missing subscriptionId/subscription clears claimedAt, and successful delivery calls the finalizer.

As per coding guidelines, Tests must cover edge cases and error paths - include assertions for edge cases (empty inputs, boundary values, null/undefined), error paths (invalid input, unauthorized access, not found), and security-relevant scenarios (ownership violations, SSRF attempts).

Also applies to: 68-70

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/services/scheduler.test.ts` around lines 36 - 43, The test currently
mocks automation retry hooks but never exercises the automation retry loop
because getPendingAutomationRetries returns an empty array; update the test to
return one or more pending retry records from getPendingAutomationRetries and
add assertions that claimDeliveryForRetry is invoked with channel="automation",
that when claimDeliveryForRetry returns null the delivery is skipped, that when
getAutomationSubscriptionById returns undefined the code clears claimedAt on the
pending retry record, and that a successful delivery path invokes the delivery
finalizer (the function your code uses to finalize deliveries); also add focused
assertions for edge cases (empty arrays, null/undefined fields), error paths
(invalid IDs, failed claim), and a security/ownership case to satisfy the
guidelines.

Comment on lines +478 to +490
} else if (outcome.kind === "persistent" || nextAttempt >= 3) {
await withDbRetry(() => storage.updateDeliveryLog(entry.id, {
status: "failed",
attempt: nextAttempt,
claimedAt: null,
response: {
subscriptionId: sub.id,
platform,
error: outcome.error,
transient: outcome.kind === "transient",
} as Record<string, unknown>,
}));
await finalizeAutomationRetry(sub.id, monitor, platform, outcome);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Make terminal retry finalization atomic with the subscription failure update.

This marks the delivery row failed and clears claimedAt before finalizeAutomationRetry() increments/deactivates the subscription. If that finalization storage call fails, the row is no longer recoverable and the subscription failure counter stays stale. Commit the delivery-log terminal status and subscription failure/deactivation side effects in one storage transaction, or leave the row recoverable until finalization succeeds.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/services/scheduler.ts` around lines 478 - 490, The delivery row is
being marked failed via withDbRetry(storage.updateDeliveryLog(..., entry.id))
before calling finalizeAutomationRetry(sub.id,...), which can leave the row
irrecoverable if finalization fails; change this so the terminal delivery-log
update and the subscription failure/deactivation side effects are executed
atomically in one storage transaction (e.g., add a new storage method like
finalizeDeliveryAndSubscriptionFailure that accepts entry.id, sub.id,
outcome/platform/monitor and performs updateDeliveryLog plus the
finalizeAutomationRetry work inside a single DB transaction, or call both
updates inside an existing storage.transaction/withDbRetry wrapper) so that
either both the delivery row and subscription counters are committed or none
are.

- scheduler.ts: waitForActiveChecks now tracks its resolver in
  activeCheckWaiters; stopScheduler() resolves pending waiters after
  cancelling the poll timer so shutdown cannot hang.
- automationDelivery.ts: inspect Promise.allSettled results and log
  rejected delivery tasks with subscription / monitor context instead
  of swallowing them (failures from the addDeliveryLog fallback or
  handleDeliveryFailure are no longer silent).
- ensureTables.ts: ensureChannelTables() now returns boolean and its
  result is added to criticalSchemaFailures in registerRoutes, so a
  failed `ALTER TABLE delivery_log ADD COLUMN claimed_at` triggers the
  /api SCHEMA_NOT_READY gate instead of leaving the API live with
  broken retry queries.
- ensureTables.ts + shared/schema.ts: new
  delivery_log_channel_status_claimed_idx (channel, status, claimed_at)
  so the per-tick stalled-delivery recovery UPDATE is index-backed.
- storage.ts: added the new delivery-log retry primitives
  (addDeliveryLog, updateDeliveryLog, claimDeliveryForRetry,
  recoverStalledDeliveries, getPendingAutomationRetries,
  getAutomationSubscriptionById, and deprecated webhook-specific
  wrappers) to the IStorage interface so test doubles and alternate
  implementations conform to the scheduler contract.
- automationDelivery.test.ts: added parametrised 408 / 429 coverage
  to prove those statuses take the transient-retry path and do not
  bump consecutive_failures.

https://claude.ai/code/session_016C3aevHcGMqJL7MaEHdqN9
@bd73-com bd73-com merged commit 933260d into main Apr 21, 2026
2 checks passed
@github-actions github-actions Bot deleted the claude/fix-github-issues-Xumuy branch April 21, 2026 04:42
bd73-com pushed a commit that referenced this pull request Apr 24, 2026
…atch

Phase 5 skeptic discoveries addressed:

browserlessTracker (H1/H2): Same direct-insert pattern resendTracker had —
raw db.insert(errorLogs).values({level:"info",...}) bypassed the dedup
upsert. After the 6h cooldown elapsed on an unresolved alert row, the
next INSERT would hit the partial unique index and throw, silently
swallowing the alert (no email sent during the threshold event).
Converted to ErrorLogger.info, matching the resendTracker fix.

scraper.ts transient console.warn paths (M1/M2): Two paths on this
branch use console.warn for transient failures. console.warn bypasses
ErrorLogger.sanitizeString/sanitizeContext entirely:
- Line ~1460 (save-failure retry): was logging extractedValue/previousValue
  which contain user-scraped page content — could include API tokens,
  PII, or internal URLs. Moved those fields to the non-transient
  ErrorLogger.error branch only; transient stdout log now contains
  only triage metadata. Test updated to assert the redaction.
- Line ~1590 (outer-catch transient): was interpolating raw error.message
  into the stdout line. undici/fetch errors echo the full requested URL
  (userinfo creds, ?api_key=... query). Now uses the classified
  logContext only; the ErrorLogger.error branch keeps the full message
  since it runs through the sanitizer.

Count endpoint / UI dropdown mismatch (M4): The nav badge's count
endpoint (GET /api/admin/error-logs/count) counted all unresolved rows
regardless of level, while the UI filter dropdown only offers
"error"/"info" after the warning-level narrowing. Legacy warning rows
inflated the badge but were unreachable from the filter — admin would
see "7 unresolved" then find 4 after clicking through. Narrowed count
query with inArray(level, ["error","info"]) so badge and filter agree.

Browserless-outage stdout flooding (M5): The
"Browserless service unavailable" console.warn used to dedup at DB
level via ErrorLogger.warning's partial-index upsert. After the
warning-level removal, every monitor check during an outage emits a
fresh stdout line — N monitors x M replicas. Added a module-level
per-process throttle (60s cooldown per message key) for the two
infra-wide warnings ("service unavailable" and "circuit breaker open").
Exported _resetInfraWarnThrottleForTests hook; scraper.test.ts
beforeEach resets it so tests observe the first emit deterministically.

Verified #462 fixes survived the revert: SCHEMA_NOT_READY gate,
activeCheckWaiters, claimed_at / claimDeliveryForRetry /
recoverStalledDeliveries all present.

Deferred to Phase 6 bug report (pre-existing / not regressions):
- safeHostname "unknown" fallback ambiguity
- errorLogs.level has no DB CHECK constraint
- Dual warning/info rows during deploy (operator migration note for
  PR body: legacy resend_alert_* warning rows should be resolved or
  deleted at deploy to avoid apparent duplicate alerts)

All 2475 tests pass; npm run check clean.
bd73-com added a commit that referenced this pull request Apr 24, 2026
…ise (#469)

* Revert "Remove admin error logging system (#461)"

This reverts commit 1593135.

* Remove warning-level from ErrorLogger; keep error/info in Event Log

User intent on #461 was to silence the Browserless warning-noise flood
only; reverting #461 brought the Event Log back but also re-enabled
warning-level writes that produced the original noise.

This drops ErrorLogger.warning() entirely and rewrites every call site
to console.warn with the same `[source] message` prefix, so:
- Transient/noisy warnings still surface in server logs for ops triage
- The admin Event Log only shows error- and info-level events
- Dedup/aggregation still applies to the remaining error+info rows

Test mocks updated to spy on console.warn instead of ErrorLogger.warning.

* test: regression guards for ErrorLogger.warning() removal

Adds two assertions to logger.test.ts:
- ErrorLogger.warning is undefined (prevents accidental reintroduction)
- ErrorLogger.info routes only to console.log, never console.warn/.error

* security: redact user URLs in error_logs context; drop warning-level API filter

Phase 2 security findings from /magicwand review:

H1/H2 (high): Replace `url: monitor.url` with `hostname: safeHostname(monitor.url)`
at every log site on the branch. Monitor URLs are user-supplied and can carry
basic-auth credentials (`https://user:pw@host`) or query-string secrets
(`?api_key=…`) that the ErrorLogger's sanitizeString regex list does not catch.
This re-applies PR #461's CodeRabbit hardening on top of the Event Log revert.

Sites patched:
- server/services/scraper.ts (5 call sites — error, console.warn, circuit-breaker)
- server/services/scheduler.ts (1 — scheduled-check failed)
- server/services/email.ts (1 — notification email failed)

Also drops `rawBrowserlessMsg` from the site-specific console.warn at
scraper.ts:1283 — upstream errors can echo back the requested URL with
credentials. classifiedReason already captures enough triage detail.
Updates the stale comment claiming logger sanitization applies.

M1 (medium): API `level` filter in GET /api/admin/error-logs still accepted
`"warning"` after commit 3541155 removed the warning-level write path. Narrows
the allowlist (and batch-delete Zod enum) to ["error", "info"] and removes the
"Warnings" dropdown option in AdminErrors.tsx.

safeHostname helper inlined in email.ts to avoid pulling the full
monitorValidation→storage→schema chain into email.test.ts's drizzle-orm mock.

All 2475 tests pass; npm run check clean.

* refactor: consolidate safeHostname; stop direct warning-level writes

Phase 3 architecture findings:

resendTracker: two direct `db.insert(errorLogs).values({level:"warning",...})`
sites bypassed ErrorLogger entirely — producing new warning rows despite the
branch's stated "no more warning writes" invariant. Threshold alerts (80/90/95%
Resend usage) are legitimately triage-worthy, so route them through
ErrorLogger.info instead (preserves admin visibility, sanitization, and
dedup-upsert). The in-DB 6-hour cooldown SELECT is retained for rate-limiting.

safeHostname consolidation: extracted the hostname helper to dependency-free
server/utils/urlUtils.ts. scraper.ts and email.ts dropped their local copies;
scheduler.ts and monitorValidation.ts updated to import from the new location
(monitorValidation re-exports for back-compat with any external callers).

Doc and test cleanup:
- shared/schema.ts:93 level comment now flags 'warning' as historical-only
- Removed twelve dead `warning: vi.fn()` mocks from route test files —
  they masked that the ErrorLogger API no longer exposes .warning()

All 2475 tests pass; npm run check clean.

* chore: code-review cleanup — direct safeHostname imports, valid test fixture

Phase 4 code-review minor findings:
- adminErrorsUtils.test.ts: swap stale `levelFilter: "warning"` fixture for
  "info" so the test mirrors real UI traffic (the dropdown no longer offers
  "warning" and the server Zod schema would 400 it).
- monitorValidation.ts: drop the `safeHostname` re-export; routes.ts,
  routes/extension.ts, routes/v1.ts now import directly from
  ../utils/urlUtils. extension.test.ts mock moved accordingly. One import
  path per symbol.

All 2475 tests pass; npm run check clean.

* harden: skeptic-phase fixes to console.warn sanitization + badge mismatch

Phase 5 skeptic discoveries addressed:

browserlessTracker (H1/H2): Same direct-insert pattern resendTracker had —
raw db.insert(errorLogs).values({level:"info",...}) bypassed the dedup
upsert. After the 6h cooldown elapsed on an unresolved alert row, the
next INSERT would hit the partial unique index and throw, silently
swallowing the alert (no email sent during the threshold event).
Converted to ErrorLogger.info, matching the resendTracker fix.

scraper.ts transient console.warn paths (M1/M2): Two paths on this
branch use console.warn for transient failures. console.warn bypasses
ErrorLogger.sanitizeString/sanitizeContext entirely:
- Line ~1460 (save-failure retry): was logging extractedValue/previousValue
  which contain user-scraped page content — could include API tokens,
  PII, or internal URLs. Moved those fields to the non-transient
  ErrorLogger.error branch only; transient stdout log now contains
  only triage metadata. Test updated to assert the redaction.
- Line ~1590 (outer-catch transient): was interpolating raw error.message
  into the stdout line. undici/fetch errors echo the full requested URL
  (userinfo creds, ?api_key=... query). Now uses the classified
  logContext only; the ErrorLogger.error branch keeps the full message
  since it runs through the sanitizer.

Count endpoint / UI dropdown mismatch (M4): The nav badge's count
endpoint (GET /api/admin/error-logs/count) counted all unresolved rows
regardless of level, while the UI filter dropdown only offers
"error"/"info" after the warning-level narrowing. Legacy warning rows
inflated the badge but were unreachable from the filter — admin would
see "7 unresolved" then find 4 after clicking through. Narrowed count
query with inArray(level, ["error","info"]) so badge and filter agree.

Browserless-outage stdout flooding (M5): The
"Browserless service unavailable" console.warn used to dedup at DB
level via ErrorLogger.warning's partial-index upsert. After the
warning-level removal, every monitor check during an outage emits a
fresh stdout line — N monitors x M replicas. Added a module-level
per-process throttle (60s cooldown per message key) for the two
infra-wide warnings ("service unavailable" and "circuit breaker open").
Exported _resetInfraWarnThrottleForTests hook; scraper.test.ts
beforeEach resets it so tests observe the first emit deterministically.

Verified #462 fixes survived the revert: SCHEMA_NOT_READY gate,
activeCheckWaiters, claimed_at / claimDeliveryForRetry /
recoverStalledDeliveries all present.

Deferred to Phase 6 bug report (pre-existing / not regressions):
- safeHostname "unknown" fallback ambiguity
- errorLogs.level has no DB CHECK constraint
- Dual warning/info rows during deploy (operator migration note for
  PR body: legacy resend_alert_* warning rows should be resolved or
  deleted at deploy to avoid apparent duplicate alerts)

All 2475 tests pass; npm run check clean.

* fix: safeHostname returns <invalid-url> instead of 'unknown' on parse failure

Phase 6 bug-reporter reclassified the Phase-5 "unknown" fallback as in-scope
(urlUtils.ts is new code on this branch). A monitor URL of http://unknown/path
would collide with the fallback, making "unknown" ambiguous in triage.

Swap to `<invalid-url>` — angle brackets are not valid in a real hostname, so
the sentinel is guaranteed distinguishable from parsed hostnames.

* test: add urlUtils.test.ts covering safeHostname contract

New-code coverage gap noted during review: server/utils/urlUtils.ts was
added on this branch but had no direct test. Asserts:
- well-formed URL → hostname
- userinfo credentials stripped from output
- query-string secrets not reflected in hostname
- parse failures return the `<invalid-url>` sentinel (guard against
  regression of the previous "unknown" fallback)
- real host named `unknown` round-trips cleanly (collision guard)

---------

Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants