You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Adds is_user_error as a third label on the DB-sourced keeperhub_workflow_executions_total gauge, so SLO PromQL can distinguish KeeperHub-system failures from user-caused failures on the same authoritative data source.
Why
The managed-client dashboard currently shows inconsistent numbers:
Managed Client System Error = errors_created_total{org_slug=~"techops|ajna", is_user_error="false"} → 0
Managed Client User Error = errors_created_total{org_slug=~"techops|ajna", is_user_error="true"} → 0
Root cause: the two panels read different metrics with different label sets.
executions_total (DB-sourced gauge) only has status and org_slug, so SLO PromQL cannot filter on is_user_error. Every status="error" row is treated as a failure regardless of cause.
errors_created_total (API-process counter) has is_user_error but is fired by explicit code calls. Finalization paths that bypass the call (or pre-instrumentation rows) leave the counter under-reporting relative to the DB.
Adding is_user_error to the gauge unblocks a true platform SLO query (success / (success + error AND is_user_error="false")) sourced entirely from the DB scan, removing both the data-source-mismatch and the under-reporting risk.
Changes
lib/metrics/db-metrics.ts
SQL groups by workflow_executions.is_user_error and emits a CASE expression that normalizes the value into four buckets: true, false, unknown (errored row predating classification), na (non-error status).
WorkflowStats.executionsByStatusAndOrgSlug carries the new isUserError field.
lib/metrics/collectors/prometheus.ts
keeperhub_workflow_executions_total gauge gains is_user_error as a third label.
.set() call in updateDbMetrics passes the new label.
lib/metrics/METRICS_REFERENCE.md
Label column updated for workflow.executions.total.
Backward compatibility
PromQL queries that do not reference is_user_error continue to work — Prometheus auto-aggregates across all values of an un-filtered label. Verified no local alert rules or dashboards reference the gauge with non-aggregated queries (only METRICS_REFERENCE.md references it in this repo).
Cardinality
is_user_error takes at most 4 distinct values (true / false / unknown / na), bounded against status (~5 values) and org_slug (~tens). Worst-case ~4x expansion of an already small series set; well within Prometheus health limits.
Old data handling
Errored rows with is_user_error = NULL (pre-classification) are emitted under is_user_error="unknown" so backfill gaps are visible on the dashboard rather than hidden. Backfill can happen in a follow-up if the bucket grows large enough to matter.
Test plan
CI green
In Grafana, plot existing queries (status="success" / status="error" without is_user_error filter) and verify totals unchanged
Follow-ups (not in this PR)
Update managed-client dashboard to add Platform SLO panel and align Error / SLO data sources
Decide on backfill of pre-classification is_user_error NULL rows if "unknown" bucket is non-trivial
⚠️ MUST UPDATE — unsafe max by patterns
Dashboard JSON: infra repo -> grafana/keeperhub-dashboards/values/keeperhub.json
Alert rules: infra repo -> grafana/keeperhub-dashboards/keeperhub_metrics_alerts.tf
Fix for all 5: add is_user_error to each max by (...) list. E.g., max by (status, org_slug) → max by (status, org_slug,
is_user_error). Outer sum(...) collapses the buckets back into a single total.
The reason will be displayed to describe this comment to others. Learn more.
i dont' love errors_created and is_user_error.
First errors don't get created
Second feels like a useless label and should probably be an appropriate metric instead
Let's talk on Monday
i dont' love errors_created and is_user_error. First errors don't get created Second feels like a useless label and should probably be an appropriate metric instead Let's talk on Monday
Fair point and happy to talk Monday. Quick context:
Both errors_created_total and the is_user_error label already exist. This PR does not introduce either of them — it only propagates the existing is_user_error label onto the DB-sourced gauge keeperhub_workflow_executions_total. The goal of thie PR is to get the Error count and SLO align with each other. currently having 0 error and less than 100% SLO is very confusing.
errors_created_total and is_user_error are fair to revisit. Let me know if you'd prefer to hold the merge or land this and track the broader design separately.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
is_user_erroras a third label on the DB-sourcedkeeperhub_workflow_executions_totalgauge, so SLO PromQL can distinguish KeeperHub-system failures from user-caused failures on the same authoritative data source.Why
The managed-client dashboard currently shows inconsistent numbers:
errors_created_total{org_slug=~"techops|ajna", is_user_error="false"}→ 0errors_created_total{org_slug=~"techops|ajna", is_user_error="true"}→ 0success / (success + error)fromexecutions_total→ 99.6%Root cause: the two panels read different metrics with different label sets.
executions_total(DB-sourced gauge) only hasstatusandorg_slug, so SLO PromQL cannot filter onis_user_error. Everystatus="error"row is treated as a failure regardless of cause.errors_created_total(API-process counter) hasis_user_errorbut is fired by explicit code calls. Finalization paths that bypass the call (or pre-instrumentation rows) leave the counter under-reporting relative to the DB.Adding
is_user_errorto the gauge unblocks a true platform SLO query (success / (success + error AND is_user_error="false")) sourced entirely from the DB scan, removing both the data-source-mismatch and the under-reporting risk.Changes
lib/metrics/db-metrics.tsworkflow_executions.is_user_errorand emits aCASEexpression that normalizes the value into four buckets:true,false,unknown(errored row predating classification),na(non-error status).WorkflowStats.executionsByStatusAndOrgSlugcarries the newisUserErrorfield.lib/metrics/collectors/prometheus.tskeeperhub_workflow_executions_totalgauge gainsis_user_erroras a third label..set()call inupdateDbMetricspasses the new label.lib/metrics/METRICS_REFERENCE.mdworkflow.executions.total.Backward compatibility
PromQL queries that do not reference
is_user_errorcontinue to work — Prometheus auto-aggregates across all values of an un-filtered label. Verified no local alert rules or dashboards reference the gauge with non-aggregated queries (onlyMETRICS_REFERENCE.mdreferences it in this repo).Cardinality
is_user_errortakes at most 4 distinct values (true/false/unknown/na), bounded againststatus(~5 values) andorg_slug(~tens). Worst-case ~4x expansion of an already small series set; well within Prometheus health limits.Old data handling
Errored rows with
is_user_error = NULL(pre-classification) are emitted underis_user_error="unknown"so backfill gaps are visible on the dashboard rather than hidden. Backfill can happen in a follow-up if the bucket grows large enough to matter.Test plan
status="success"/status="error"withoutis_user_errorfilter) and verify totals unchangedFollow-ups (not in this PR)
is_user_errorNULL rows if"unknown"bucket is non-trivialDashboard JSON: infra repo -> grafana/keeperhub-dashboards/values/keeperhub.json
Alert rules: infra repo -> grafana/keeperhub-dashboards/keeperhub_metrics_alerts.tf
Fix for all 5: add is_user_error to each max by (...) list. E.g., max by (status, org_slug) → max by (status, org_slug,
is_user_error). Outer sum(...) collapses the buckets back into a single total.