Skip to content

feat: is_user_error label on the db side#1271

Open
nickkounz wants to merge 1 commit into
stagingfrom
feat/metrics-is-user-error-gauge-label
Open

feat: is_user_error label on the db side#1271
nickkounz wants to merge 1 commit into
stagingfrom
feat/metrics-is-user-error-gauge-label

Conversation

@nickkounz
Copy link
Copy Markdown

@nickkounz nickkounz commented May 15, 2026

Summary

Adds is_user_error as a third label on the DB-sourced keeperhub_workflow_executions_total gauge, so SLO PromQL can distinguish KeeperHub-system failures from user-caused failures on the same authoritative data source.

Why

The managed-client dashboard currently shows inconsistent numbers:

  • Managed Client System Error = errors_created_total{org_slug=~"techops|ajna", is_user_error="false"} → 0
  • Managed Client User Error = errors_created_total{org_slug=~"techops|ajna", is_user_error="true"} → 0
  • Managed Client SLO = success / (success + error) from executions_total → 99.6%

Root cause: the two panels read different metrics with different label sets.

  • executions_total (DB-sourced gauge) only has status and org_slug, so SLO PromQL cannot filter on is_user_error. Every status="error" row is treated as a failure regardless of cause.
  • errors_created_total (API-process counter) has is_user_error but is fired by explicit code calls. Finalization paths that bypass the call (or pre-instrumentation rows) leave the counter under-reporting relative to the DB.

Adding is_user_error to the gauge unblocks a true platform SLO query (success / (success + error AND is_user_error="false")) sourced entirely from the DB scan, removing both the data-source-mismatch and the under-reporting risk.

Changes

  • lib/metrics/db-metrics.ts
  • SQL groups by workflow_executions.is_user_error and emits a CASE expression that normalizes the value into four buckets: true, false, unknown (errored row predating classification), na (non-error status).
  • WorkflowStats.executionsByStatusAndOrgSlug carries the new isUserError field.
  • lib/metrics/collectors/prometheus.ts
  • keeperhub_workflow_executions_total gauge gains is_user_error as a third label.
  • .set() call in updateDbMetrics passes the new label.
  • lib/metrics/METRICS_REFERENCE.md
  • Label column updated for workflow.executions.total.

Backward compatibility

PromQL queries that do not reference is_user_error continue to work — Prometheus auto-aggregates across all values of an un-filtered label. Verified no local alert rules or dashboards reference the gauge with non-aggregated queries (only
METRICS_REFERENCE.md references it in this repo).

Cardinality

is_user_error takes at most 4 distinct values (true / false / unknown / na), bounded against status (~5 values) and org_slug (~tens). Worst-case ~4x expansion of an already small series set; well within Prometheus health limits.

Old data handling

Errored rows with is_user_error = NULL (pre-classification) are emitted under is_user_error="unknown" so backfill gaps are visible on the dashboard rather than hidden. Backfill can happen in a follow-up if the bucket grows large enough to matter.

Test plan

  • CI green
  • In Grafana, plot existing queries (status="success" / status="error" without is_user_error filter) and verify totals unchanged

Follow-ups (not in this PR)

  • Update managed-client dashboard to add Platform SLO panel and align Error / SLO data sources
  • Decide on backfill of pre-classification is_user_error NULL rows if "unknown" bucket is non-trivial

⚠️ MUST UPDATE — unsafe max by patterns

Dashboard JSON: infra repo -> grafana/keeperhub-dashboards/values/keeperhub.json
Alert rules: infra repo -> grafana/keeperhub-dashboards/keeperhub_metrics_alerts.tf

Fix for all 5: add is_user_error to each max by (...) list. E.g., max by (status, org_slug) → max by (status, org_slug,
is_user_error). Outer sum(...) collapses the buckets back into a single total.

@nickkounz nickkounz changed the title add is_user_error label on the db side add: is_user_error label on the db side May 15, 2026
@nickkounz nickkounz changed the title add: is_user_error label on the db side feat: is_user_error label on the db side May 15, 2026
Copy link
Copy Markdown

@eskp eskp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont' love errors_created and is_user_error.
First errors don't get created
Second feels like a useless label and should probably be an appropriate metric instead
Let's talk on Monday

@nickkounz
Copy link
Copy Markdown
Author

i dont' love errors_created and is_user_error. First errors don't get created Second feels like a useless label and should probably be an appropriate metric instead Let's talk on Monday

Fair point and happy to talk Monday. Quick context:

Both errors_created_total and the is_user_error label already exist. This PR does not introduce either of them — it only propagates the existing is_user_error label onto the DB-sourced gauge keeperhub_workflow_executions_total. The goal of thie PR is to get the Error count and SLO align with each other. currently having 0 error and less than 100% SLO is very confusing.

errors_created_total and is_user_error are fair to revisit. Let me know if you'd prefer to hold the merge or land this and track the broader design separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants