feat: is_user_error label on the db side by nickkounz · Pull Request #1271 · KeeperHub/keeperhub

nickkounz · 2026-05-15T04:43:33Z

Summary

Adds is_user_error as a third label on the DB-sourced keeperhub_workflow_executions_total gauge, so SLO PromQL can distinguish KeeperHub-system failures from user-caused failures on the same authoritative data source.

Why

The managed-client dashboard currently shows inconsistent numbers:

Managed Client System Error = errors_created_total{org_slug=~"techops|ajna", is_user_error="false"} → 0
Managed Client User Error = errors_created_total{org_slug=~"techops|ajna", is_user_error="true"} → 0
Managed Client SLO = success / (success + error) from executions_total → 99.6%

Root cause: the two panels read different metrics with different label sets.

executions_total (DB-sourced gauge) only has status and org_slug, so SLO PromQL cannot filter on is_user_error. Every status="error" row is treated as a failure regardless of cause.
errors_created_total (API-process counter) has is_user_error but is fired by explicit code calls. Finalization paths that bypass the call (or pre-instrumentation rows) leave the counter under-reporting relative to the DB.

Adding is_user_error to the gauge unblocks a true platform SLO query (success / (success + error AND is_user_error="false")) sourced entirely from the DB scan, removing both the data-source-mismatch and the under-reporting risk.

Changes

lib/metrics/db-metrics.ts
SQL groups by workflow_executions.is_user_error and emits a CASE expression that normalizes the value into four buckets: true, false, unknown (errored row predating classification), na (non-error status).
WorkflowStats.executionsByStatusAndOrgSlug carries the new isUserError field.
lib/metrics/collectors/prometheus.ts
keeperhub_workflow_executions_total gauge gains is_user_error as a third label.
.set() call in updateDbMetrics passes the new label.
lib/metrics/METRICS_REFERENCE.md
Label column updated for workflow.executions.total.

Backward compatibility

PromQL queries that do not reference is_user_error continue to work — Prometheus auto-aggregates across all values of an un-filtered label. Verified no local alert rules or dashboards reference the gauge with non-aggregated queries (only
METRICS_REFERENCE.md references it in this repo).

Cardinality

is_user_error takes at most 4 distinct values (true / false / unknown / na), bounded against status (~5 values) and org_slug (~tens). Worst-case ~4x expansion of an already small series set; well within Prometheus health limits.

Old data handling

Errored rows with is_user_error = NULL (pre-classification) are emitted under is_user_error="unknown" so backfill gaps are visible on the dashboard rather than hidden. Backfill can happen in a follow-up if the bucket grows large enough to matter.

Test plan

CI green
In Grafana, plot existing queries (status="success" / status="error" without is_user_error filter) and verify totals unchanged

Follow-ups (not in this PR)

Update managed-client dashboard to add Platform SLO panel and align Error / SLO data sources
Decide on backfill of pre-classification is_user_error NULL rows if "unknown" bucket is non-trivial

⚠️ MUST UPDATE — unsafe max by patterns

Dashboard JSON: infra repo -> grafana/keeperhub-dashboards/values/keeperhub.json
Alert rules: infra repo -> grafana/keeperhub-dashboards/keeperhub_metrics_alerts.tf

Fix for all 5: add is_user_error to each max by (...) list. E.g., max by (status, org_slug) → max by (status, org_slug,
is_user_error). Outer sum(...) collapses the buckets back into a single total.

eskp

i dont' love errors_created and is_user_error.
First errors don't get created
Second feels like a useless label and should probably be an appropriate metric instead
Let's talk on Monday

nickkounz · 2026-05-16T01:37:26Z

i dont' love errors_created and is_user_error. First errors don't get created Second feels like a useless label and should probably be an appropriate metric instead Let's talk on Monday

Fair point and happy to talk Monday. Quick context:

Both errors_created_total and the is_user_error label already exist. This PR does not introduce either of them — it only propagates the existing is_user_error label onto the DB-sourced gauge keeperhub_workflow_executions_total. The goal of thie PR is to get the Error count and SLO align with each other. currently having 0 error and less than 100% SLO is very confusing.

errors_created_total and is_user_error are fair to revisit. Let me know if you'd prefer to hold the merge or land this and track the broader design separately.

add is_user_error label on the db side

0310a57

nickkounz requested review from OleksandrUA, eskp, joelorzet and suisuss May 15, 2026 04:43

nickkounz temporarily deployed to staging May 15, 2026 04:43 — with GitHub Actions Inactive

nickkounz changed the title ~~add is_user_error label on the db side~~ add: is_user_error label on the db side May 15, 2026

nickkounz changed the title ~~add: is_user_error label on the db side~~ feat: is_user_error label on the db side May 15, 2026

eskp requested changes May 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: is_user_error label on the db side#1271

feat: is_user_error label on the db side#1271
nickkounz wants to merge 1 commit into
stagingfrom
feat/metrics-is-user-error-gauge-label

nickkounz commented May 15, 2026 •

edited

Loading

Uh oh!

eskp left a comment

Uh oh!

nickkounz commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nickkounz commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Changes

Backward compatibility

Cardinality

Old data handling

Test plan

Follow-ups (not in this PR)

Uh oh!

eskp left a comment

Choose a reason for hiding this comment

Uh oh!

nickkounz commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nickkounz commented May 15, 2026 •

edited

Loading