Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion lib/metrics/METRICS_REFERENCE.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ Counter/Gauge metrics tracking request/event counts.

| Metric Name | Description | Labels | Unit | Source |
|-------------|-------------|--------|------|--------|
| `workflow.executions.total` | Total workflow executions by status (all-time) | `status` | gauge | DB |
| `workflow.executions.total` | Total workflow executions by status (all-time) | `status`, `org_slug`, `is_user_error` (`true`/`false`/`unknown`/`na`) | gauge | DB |
| `workflow.execution.errors.total` | Total failed workflow executions (all-time) | - | gauge | DB |
| `plugin.invocations.total` | Plugin action invocations | `plugin_name`, `action_name` | count | API |
| `user.active.daily` | Daily active users (24h) | - | gauge | DB |
Expand Down
29 changes: 23 additions & 6 deletions lib/metrics/collectors/prometheus.ts
Original file line number Diff line number Diff line change
Expand Up @@ -94,14 +94,27 @@ function getOrCreateGauge(
// All metrics are GAUGES (point-in-time snapshots). Use max() aggregation across pods.
// For rate/delta queries, use PromQL delta() function: max(delta(metric[1h]))

// Workflow execution counts by status and org_slug. Personal/anonymous
// workflows are emitted under org_slug="_anonymous" so the sum across
// org_slug for a given status equals the global per-status total.
// Workflow execution counts by status, org_slug, and is_user_error. Personal/
// anonymous workflows are emitted under org_slug="_anonymous" so the sum
// across org_slug for a given status equals the global per-status total.
//
// is_user_error label values:
// "true" - error caused by user input/config/external service
// "false" - error caused by KeeperHub system/infrastructure
// "unknown" - errored row predating classification (NULL in DB)
// "na" - non-error status (success/running/pending/cancelled)
//
// PromQL queries that ignore is_user_error still work (Prometheus aggregates
// across all values by default). The label unlocks "platform-side SLO"
// queries that exclude user-caused failures from the denominator, since the
// counter `keeperhub_workflow_execution_errors_created_total` can miss
// errors when finalization paths bypass it; the gauge is sourced from the DB
// directly and stays authoritative.
const workflowExecutionsTotal = getOrCreateGauge(
dbRegistry,
"keeperhub_workflow_executions_total",
"Total workflow executions by status, broken down by org_slug (all-time)",
["status", "org_slug"]
"Total workflow executions by status, broken down by org_slug and is_user_error (all-time)",
["status", "org_slug", "is_user_error"]
);

// KEEP-545: the previous DB-sourced gauge `keeperhub_workflow_execution_errors_total`
Expand Down Expand Up @@ -1233,7 +1246,11 @@ export async function updateDbMetrics(): Promise<void> {
workflowExecutionsTotal.reset();
for (const row of workflowStats.executionsByStatusAndOrgSlug) {
workflowExecutionsTotal.set(
{ status: row.status, org_slug: row.orgSlug },
{
status: row.status,
org_slug: row.orgSlug,
is_user_error: row.isUserError,
},
row.count
);
}
Expand Down
47 changes: 35 additions & 12 deletions lib/metrics/db-metrics.ts
Original file line number Diff line number Diff line change
Expand Up @@ -58,12 +58,23 @@ export type WorkflowStats = {
totalPending: number;
totalCancelled: number;

// Per-(status, org_slug) execution counts. Personal/anonymous workflows
// are bucketed under ANONYMOUS_ORG_SLUG so the sum of counts for a given
// status across all orgs matches the corresponding total* above.
// Per-(status, org_slug, is_user_error) execution counts. Personal/anonymous
// workflows are bucketed under ANONYMOUS_ORG_SLUG so the sum of counts for a
// given status across all orgs matches the corresponding total* above.
//
// isUserError values:
// "true" - error caused by user input/config/external service
// "false" - error caused by KeeperHub system/infrastructure
// "unknown" - errored row predating classification (NULL in DB)
// "na" - non-error status (success/running/pending/cancelled)
//
// Encoding NULL as "unknown" rather than dropping the row keeps the gauge
// total equal to the all-up execution count and surfaces backfill gaps in
// the dashboard rather than hiding them.
executionsByStatusAndOrgSlug: Array<{
status: string;
orgSlug: string;
isUserError: string;
count: number;
}>;

Expand Down Expand Up @@ -93,30 +104,42 @@ export async function getWorkflowStatsFromDb(): Promise<WorkflowStats> {
durationCount: 0,
};

// Per-(status, org_slug) execution breakdown: JOIN workflows + organization,
// LEFT JOIN so anonymous workflows still contribute (under ANONYMOUS_ORG_SLUG).
// GROUP BY uses the organization.slug column reference (not the COALESCE
// expression): Drizzle would otherwise bind ANONYMOUS_ORG_SLUG as separate
// parameters in SELECT and GROUP BY clauses, and Postgres rejects the query
// because the two COALESCE expressions are not textually identical. Postgres
// groups all NULL slugs into one group (NULLs are equal in GROUP BY), and
// the SELECT-side COALESCE renders that group as ANONYMOUS_ORG_SLUG.
// Per-(status, org_slug, is_user_error) execution breakdown: JOIN workflows
// + organization, LEFT JOIN so anonymous workflows still contribute (under
// ANONYMOUS_ORG_SLUG). GROUP BY uses the underlying columns (not the
// COALESCE/CASE expressions): Drizzle would otherwise bind constants as
// separate parameters in SELECT and GROUP BY clauses, and Postgres rejects
// the query because the two expressions are not textually identical.
// Postgres groups NULLs together (NULLs are equal in GROUP BY), and the
// SELECT-side expressions render those groups as ANONYMOUS_ORG_SLUG /
// "unknown" / "na" as appropriate.
const breakdown = await db
.select({
status: workflowExecutions.status,
orgSlug: sql<string>`COALESCE(${organization.slug}, ${ANONYMOUS_ORG_SLUG})`,
isUserError: sql<string>`CASE
WHEN ${workflowExecutions.status} <> 'error' THEN 'na'
WHEN ${workflowExecutions.isUserError} IS NULL THEN 'unknown'
WHEN ${workflowExecutions.isUserError} = TRUE THEN 'true'
ELSE 'false'
END`,
count: count(),
})
.from(workflowExecutions)
.innerJoin(workflows, eq(workflowExecutions.workflowId, workflows.id))
.leftJoin(organization, eq(workflows.organizationId, organization.id))
.groupBy(workflowExecutions.status, organization.slug);
.groupBy(
workflowExecutions.status,
organization.slug,
workflowExecutions.isUserError
);

for (const row of breakdown) {
const c = Number(row.count) || 0;
stats.executionsByStatusAndOrgSlug.push({
status: row.status,
orgSlug: row.orgSlug,
isUserError: row.isUserError,
count: c,
});
switch (row.status) {
Expand Down
Loading