Skip to content

feat: configure BullMQ retry strategy and dead letter handling#580

Open
chrispaskvan wants to merge 5 commits into
mainfrom
feature/issue-569-bullmq-retry-dead-letter
Open

feat: configure BullMQ retry strategy and dead letter handling#580
chrispaskvan wants to merge 5 commits into
mainfrom
feature/issue-569-bullmq-retry-dead-letter

Conversation

@chrispaskvan
Copy link
Copy Markdown
Owner

Summary

Fixes two compounding bugs that cause permanent notification loss from transient failures:

  • No retry config: BullMQ Queue had no defaultJobOptions, defaulting to 1 attempt (0 retries). The worker re-threw errors but retries never fired.
  • Bare catch swallowed all errors: #send() caught every exception and sent a fallback SMS, so jobs always appeared successful to BullMQ — even if retries were configured, they'd never trigger.

Changes

  • Add defaultJobOptions to Queue: 3 attempts with exponential backoff (5s, 10s, 20s), completed job cleanup (24h/1000 max), failed job retention (7d/500 max)
  • Refactor #send() error handling to classify errors:
    • Transient (5xx, timeouts, network errors) → re-throw for BullMQ retry
    • Permanent (bad credentials, account issues) → UnrecoverableError (no retries)
    • Business-logic (Xur not available) → XurUnavailableError → send fallback SMS, job succeeds
  • Emit Application Insights metric (notification-job-exhausted) when retries are exhausted
  • Enhance subscriber failed event logging with attempt counts
  • Add XurUnavailableError class for Xur business-logic errors

Test plan

  • All 446 existing + new tests pass
  • Publisher spec: verifies defaultJobOptions shape, metric emission on exhaustion, no metric when retries remain
  • Controller spec: transient errors re-throw for retry, permanent errors wrap in UnrecoverableError, only business-logic getXur failures trigger fallback SMS
  • ESLint passes via pre-commit hook

Closes #569

🤖 Generated with Claude Code

Add retry configuration, error classification, and job lifecycle
management for notification jobs to prevent permanent notification loss
from transient failures.

- Add defaultJobOptions to Queue: 3 attempts with exponential backoff
  (5s, 10s, 20s), completed job cleanup (24h), failed job retention (7d)
- Refactor #send() error handling to classify errors as transient
  (retry via BullMQ), permanent (UnrecoverableError), or business-logic
  (XurUnavailableError sends fallback SMS)
- Emit Application Insights metric on retry exhaustion
- Enhance subscriber failed-event logging with attempt counts

Closes #569

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 21, 2026 17:58
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR configures BullMQ retries/cleanup for notification jobs and refactors notification sending so transient failures can actually trigger BullMQ retries, while business-logic “Xur unavailable” failures still send a fallback message.

Changes:

  • Configure BullMQ Queue defaultJobOptions (attempts/backoff, remove-on-complete/fail retention) and add a “retries exhausted” metric emission on failed events.
  • Refactor NotificationController.#send() error handling to rethrow transient errors (retry), mark permanent errors as UnrecoverableError (no retry), and introduce XurUnavailableError for business-logic fallback.
  • Expand unit tests around retry/permanent/fallback behavior and job exhaustion metric behavior.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
notifications/xur-unavailable.error.js Adds a dedicated business-logic error type for Xur inventory unavailability.
notifications/notification.controller.js Implements transient/permanent/business-logic error classification for Xur notifications.
notifications/notification.controller.spec.js Adds test coverage for transient rethrow, permanent unrecoverable failures, and business-logic fallback.
helpers/publisher.js Adds BullMQ default retry/cleanup options and emits an App Insights metric when retries are exhausted.
helpers/publisher.spec.js Tests queue configuration and the exhausted-retries metric emission logic.
helpers/subscriber.js Enhances worker failed logging with attempt counts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread notifications/notification.controller.js
Comment thread helpers/publisher.js Outdated
Comment thread notifications/notification.controller.js Outdated
Comment thread notifications/notification.controller.js
- Add ClaimCheck.updatePhoneNumber in XurUnavailableError fallback path
- Wrap publisher failed handler in try/catch with defensive opts access
- Preserve original error as cause on UnrecoverableError, guard re-wrapping
- Narrow getXur error handling: only DestinyError becomes XurUnavailableError

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

This comment was marked as outdated.

… cause

- Only convert DestinyError with status 'DestinyVendorNotFound' to
  XurUnavailableError; rethrow other DestinyErrors for outer handling
- Preserve original error as cause on XurUnavailableError
- Fix DestinyError constructor arg order in tests (code, message, status)
- Add test for non-VendorNotFound DestinyError becoming UnrecoverableError

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

This comment was marked as outdated.

- Fix resolves.not.toThrow() assertions to use resolves.toBeUndefined()
- Use fallback of 1 (BullMQ default) instead of Infinity for missing opts.attempts
- Update test expectation for missing opts to reflect new fallback behavior

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Configure BullMQ retry strategy and dead letter handling for notification jobs

2 participants