Skip to content

obs/perf: /api/trpc 5xx breakdown + add error detail to trpc.call logs #1572

@joelhooks

Description

@joelhooks

What

We are still throwing a lot of 5xx on /api/trpc/[trpc] in prod.

This issue is to:

  • break down the failures by tRPC procedure path
  • add structured error detail to trpc.call logs so we can actually fix root causes (right now we mostly have ok=false with no message)

Why it matters

tRPC is a load amplifier. When it 500s we get:

  • broken UX
  • retries
  • more backend pressure

Since current prod deploy

Deploy: dpl_84MfaWnriWoj37gJ5b62WZ5ohe1W (created 2026-02-07T23:35:22Z)

Route-level 5xx:

  • bun tools/me.ts logs route-errors "/api/trpc/[trpc]" --since-deploy https://vercel.com/eggheadio/egghead-io-nextjs/84MfaWnriWoj37gJ5b62WZ5ohe1W --json
  • Result: 997 HTTP 500s out of 233,945 requests (all on the current deployment)

Procedure-level error breakdown:

  • bun tools/me.ts logs trpc-errors --since-deploy https://vercel.com/eggheadio/egghead-io-nextjs/84MfaWnriWoj37gJ5b62WZ5ohe1W --json
  • Top failing procedures (errors are per-procedure call, not per-request):
    • course.getCourse: 1381
    • progress.forLesson: 170 (all authed)
    • lesson.getLessonbySlug: 152
    • topics.top: 124
    • lesson.getAssociatedLessonsByTag: 74

Current instrumentation gap

logs trpc-errors shows has_error_message=0 and has_error=0 across the board.
So we can identify which procedures fail, but not why.

Tasks

  1. Improve trpc.call structured logs:
  • Add error_code (TRPCError code)
  • Add error_message (sanitized, no PII)
  • Add error_name / cause_name if available
  • Add http_status
  • Add input_shape booleans (ex: has_slug, slug_len) instead of raw inputs
  1. Add an agent query pack:
  • Keep bun tools/me.ts logs trpc-errors --since-deploy ... current
  • Add a second query (optional) for topMessages once error_message coverage exists
  1. Fix the top offenders first:
  • course.getCourse
  • progress.forLesson
  • lesson.getLessonbySlug

Definition of done

  • We can answer: "which procedure is failing and what error is it" from Axiom in <60s.
  • /api/trpc 500s trend down after fixes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    agent/readyCan be picked up by an AI agentbugSomething isn't workingobservabilityStructured logging/metrics/telemetryperfPerformance work

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions