What
We are still throwing a lot of 5xx on /api/trpc/[trpc] in prod.
This issue is to:
- break down the failures by tRPC procedure path
- add structured error detail to
trpc.call logs so we can actually fix root causes (right now we mostly have ok=false with no message)
Why it matters
tRPC is a load amplifier. When it 500s we get:
- broken UX
- retries
- more backend pressure
Since current prod deploy
Deploy: dpl_84MfaWnriWoj37gJ5b62WZ5ohe1W (created 2026-02-07T23:35:22Z)
Route-level 5xx:
bun tools/me.ts logs route-errors "/api/trpc/[trpc]" --since-deploy https://vercel.com/eggheadio/egghead-io-nextjs/84MfaWnriWoj37gJ5b62WZ5ohe1W --json
- Result:
997 HTTP 500s out of 233,945 requests (all on the current deployment)
Procedure-level error breakdown:
bun tools/me.ts logs trpc-errors --since-deploy https://vercel.com/eggheadio/egghead-io-nextjs/84MfaWnriWoj37gJ5b62WZ5ohe1W --json
- Top failing procedures (errors are per-procedure call, not per-request):
course.getCourse: 1381
progress.forLesson: 170 (all authed)
lesson.getLessonbySlug: 152
topics.top: 124
lesson.getAssociatedLessonsByTag: 74
Current instrumentation gap
logs trpc-errors shows has_error_message=0 and has_error=0 across the board.
So we can identify which procedures fail, but not why.
Tasks
- Improve
trpc.call structured logs:
- Add
error_code (TRPCError code)
- Add
error_message (sanitized, no PII)
- Add
error_name / cause_name if available
- Add
http_status
- Add
input_shape booleans (ex: has_slug, slug_len) instead of raw inputs
- Add an agent query pack:
- Keep
bun tools/me.ts logs trpc-errors --since-deploy ... current
- Add a second query (optional) for
topMessages once error_message coverage exists
- Fix the top offenders first:
course.getCourse
progress.forLesson
lesson.getLessonbySlug
Definition of done
- We can answer: "which procedure is failing and what error is it" from Axiom in <60s.
/api/trpc 500s trend down after fixes.
What
We are still throwing a lot of 5xx on
/api/trpc/[trpc]in prod.This issue is to:
trpc.calllogs so we can actually fix root causes (right now we mostly haveok=falsewith no message)Why it matters
tRPC is a load amplifier. When it 500s we get:
Since current prod deploy
Deploy:
dpl_84MfaWnriWoj37gJ5b62WZ5ohe1W(created2026-02-07T23:35:22Z)Route-level 5xx:
bun tools/me.ts logs route-errors "/api/trpc/[trpc]" --since-deploy https://vercel.com/eggheadio/egghead-io-nextjs/84MfaWnriWoj37gJ5b62WZ5ohe1W --json997HTTP 500s out of233,945requests (all on the current deployment)Procedure-level error breakdown:
bun tools/me.ts logs trpc-errors --since-deploy https://vercel.com/eggheadio/egghead-io-nextjs/84MfaWnriWoj37gJ5b62WZ5ohe1W --jsoncourse.getCourse:1381progress.forLesson:170(all authed)lesson.getLessonbySlug:152topics.top:124lesson.getAssociatedLessonsByTag:74Current instrumentation gap
logs trpc-errorsshowshas_error_message=0andhas_error=0across the board.So we can identify which procedures fail, but not why.
Tasks
trpc.callstructured logs:error_code(TRPCError code)error_message(sanitized, no PII)error_name/cause_nameif availablehttp_statusinput_shapebooleans (ex: has_slug, slug_len) instead of raw inputsbun tools/me.ts logs trpc-errors --since-deploy ...currenttopMessagesonce error_message coverage existscourse.getCourseprogress.forLessonlesson.getLessonbySlugDefinition of done
/api/trpc500s trend down after fixes.