diff --git a/docs/cloud/high-availability/monitoring.mdx b/docs/cloud/high-availability/monitoring.mdx index fb28a81fe2..4e1b79b5ed 100644 --- a/docs/cloud/high-availability/monitoring.mdx +++ b/docs/cloud/high-availability/monitoring.mdx @@ -29,19 +29,19 @@ Temporal Cloud offers several ways for you to track the health and performance o ## Replication status You can monitor your replica status with the Temporal Cloud UI. If the replica is unhealthy, Temporal Cloud disables the -“Trigger a failover” option to prevent failing over to an unhealthy replica. An unhealthy replica might be due to: +"Trigger a failover" option to prevent failing over to an unhealthy replica. An unhealthy replica might be due to: - **Data synchronization issues:** The replica fails to remain in sync with the primary due to network or performance problems. - **Replication lag:** The replica falls behind the primary, causing it to be out of sync. - **Network issues:** Loss of communication between the replica and the primary causes problems. -- **Failed health checks:** If the replica fails health checks, it’s marked as unhealthy. +- **Failed health checks:** If the replica fails health checks, it's marked as unhealthy. These issues prevent the replica from being used during a failover, ensuring system stability and consistency. -## Replication lag metric +## Monitoring replication -Temporal Cloud’s High Availability features use asynchronous replication between the primary and the replica. Workflow +Temporal Cloud's High Availability features use asynchronous replication between the primary and the replica. Workflow updates in the primary, along with associated History Events, are transmitted to the replica. Replication lag refers to the transmission delay of Workflow updates and history events from the primary to the replica. @@ -55,25 +55,10 @@ P95 means 95% of updates are processed faster than this limit. A forced failover, when there is significant replication lag, increases the likelihood of rolling back Workflow progress. Always check the replication lag metrics before initiating a failover. -Temporal Cloud emits three replication lag-specific -[metrics](/cloud/metrics/reference#replication-lag). The following samples demonstrate how you can -use these metrics to monitor and explore replication lag: +Temporal Cloud emits replication lag [metrics](/cloud/metrics/openmetrics/metrics-reference#replication-metrics) +as pre-computed percentiles (p50, p95, p99) that are labeled with `temporal_namespace`. -**P99 replication lag histogram**: - -``` -histogram_quantile(0.99, sum(rate(temporal_cloud_v0_replication_lag_bucket[$__rate_interval])) by (temporal_namespace, le)) -``` - -**Average replication lag**: - -``` -sum(rate(temporal_cloud_v0_replication_lag_sum[$__rate_interval])) by (temporal_namespace) -/ -sum(rate(temporal_cloud_v0_replication_lag_count[$__rate_interval])) by (temporal_namespace) -``` - -When a Namespace is using a replica, you may notice that the Action count in `temporal_cloud_v0_total_action_count` is +When a Namespace is using a replica, you may notice that the Action count in `temporal_cloud_v1_total_action_count` is 2x what it was before adding a replica. This happens because Actions are replicated; they occur on both the primary and the replica. diff --git a/docs/cloud/service-health.mdx b/docs/cloud/service-health.mdx index 7438b4b0ec..b12e6492d9 100644 --- a/docs/cloud/service-health.mdx +++ b/docs/cloud/service-health.mdx @@ -36,36 +36,35 @@ Note that Service API errors are not equivalent to guarantees mentioned in the [ ### Reference Metrics -- [temporal\_cloud\_v1\_frontend\_service\_error\_count](/cloud/metrics/openmetrics/metrics-reference#temporal_cloud_v1_service_error_count) -- [temporal\_cloud\_v1\_frontend\_service\_request\_count](/cloud/metrics/openmetrics/metrics-reference#temporal_cloud_v1_service_request_count) +- [temporal\_cloud\_v1\_service\_error\_count](/cloud/metrics/openmetrics/metrics-reference#temporal_cloud_v1_service_error_count) +- [temporal\_cloud\_v1\_service\_request\_count](/cloud/metrics/openmetrics/metrics-reference#temporal_cloud_v1_service_request_count) ### Prometheus Query for this Metric -Measure your daily average errors over 10-minute windows: +Measure your daily average success rate over 10-minute windows. + +OpenMetrics v1 metrics are pre-computed rates. Use `sum()` to aggregate across dimensions rather than `increase()` or `rate()`. ``` avg_over_time(( ( - ( - sum(increase(temporal_cloud_v1_service_request_count{temporal_namespace=~"$namespace", operation=~"StartWorkflowExecution|SignalWorkflowExecution|SignalWithStartWorkflowExecution|RequestCancelWorkflowExecution|TerminateWorkflowExecution"}[10m])) + sum(temporal_cloud_v1_service_request_count{temporal_namespace=~"$namespace", operation=~"StartWorkflowExecution|SignalWorkflowExecution|SignalWithStartWorkflowExecution|RequestCancelWorkflowExecution|TerminateWorkflowExecution"}) - - sum(increase(temporal_cloud_v1_service_error_count{temporal_namespace=~"$namespace", operation=~"StartWorkflowExecution|SignalWorkflowExecution|SignalWithStartWorkflowExecution|RequestCancelWorkflowExecution|TerminateWorkflowExecution"}[10m])) + sum(temporal_cloud_v1_service_error_count{temporal_namespace=~"$namespace", operation=~"StartWorkflowExecution|SignalWorkflowExecution|SignalWithStartWorkflowExecution|RequestCancelWorkflowExecution|TerminateWorkflowExecution"}) ) / - sum(increase(temporal_cloud_v1_service_request_count{temporal_namespace=~"$namespace", operation=~"StartWorkflowExecution|SignalWorkflowExecution|SignalWithStartWorkflowExecution|RequestCancelWorkflowExecution|TerminateWorkflowExecution"}[10m])) + sum(temporal_cloud_v1_service_request_count{temporal_namespace=~"$namespace", operation=~"StartWorkflowExecution|SignalWorkflowExecution|SignalWithStartWorkflowExecution|RequestCancelWorkflowExecution|TerminateWorkflowExecution"}) ) or vector(1) - )[1d:10m]) + )[1d:1m]) ``` ## Detecting Activity and Workflow Failures -The metrics `temporal_activity_execution_failed` and `temporal_cloud_v1_workflow_failed_count` together provide failure detection for Temporal applications. These metrics work in tandem to give you both granular component-level visibility and high-level workflow health insights. - -Note that `temporal_activity_execution_failed` is an SDK metric that must be collected from the Worker. +The metrics `temporal_cloud_v1_activity_fail_count` and `temporal_cloud_v1_workflow_failed_count` together provide failure detection for Temporal applications. These metrics work in tandem to give you both granular component-level visibility and high-level workflow health insights. ### Activity failure cascade @@ -86,7 +85,7 @@ Generally Temporal recommends that Workflows should be designed to always succee Monitor the ratio of workflow failures to activity failures: ``` -workflow_failure_rate = temporal_cloud_v1_workflow_failed_count / temporal_activity_execution_failed +workflow_failure_rate = temporal_cloud_v1_workflow_failed_count / temporal_cloud_v1_activity_fail_count ``` What to watch for: @@ -97,7 +96,7 @@ What to watch for: #### Activity success rate ``` -activity_success_rate = (total_activities - temporal_activity_execution_failed) / total_activities +activity_success_rate = temporal_cloud_v1_activity_success_count / (temporal_cloud_v1_activity_success_count + temporal_cloud_v1_activity_fail_count) ``` Target: >95% for most applications. Lower success rate can be a sign of system troubles. @@ -138,13 +137,13 @@ See [operations and metrics](/cloud/high-availability) for Namespaces with High ## Detecting Resource Exhaustion The Cloud metric `temporal_cloud_v1_resource_exhausted_error_count` is the primary indicator for Cloud-side throttling, signaling system limits -are exceeded and `ResourceExhausted` gRPC errors are occurring. This generally does not break workflow processing due to how resources are prioritized. +are exceeded and `ResourceExhausted` gRPC errors are occurring. This generally does not break workflow processing due to how resources are prioritized. Persistent non-zero values of this metric are unexpected. ## Monitoring Trends Against Limits {#rps-aps-rate-limits} -The set of [limit metrics](/cloud/metrics/openmetrics/metrics-reference#limit-metrics) provide a time series of values for limits. Use these +The set of [limit metrics](/cloud/metrics/openmetrics/metrics-reference#limit-metrics) provide a time series of values for limits. Use these metrics with their corresponding count metrics to monitor general trends against limits and set alerts when limits are exceeded. Use the corresponding throttle metrics to determine the severity of any active rate limiting. | Limit Metric | Count Metric | Throttle Metric | @@ -156,8 +155,8 @@ to determine the severity of any active rate limiting. The [Grafana dashboard example](https://github.com/grafana/jsonnet-libs/blob/master/temporal-mixin/dashboards/temporal-overview.json) includes a Usage & Quotas section that creates demo charts for these limits and count metrics respectively. -The limit metrics, throttle metrics, and count metrics are already directly comparable as per second rates. Keep in mind that each `count` metric is represented as a per second rate averaged -over each minute. For example, to get the total count of Actions, you must multiply this metric by 60. +The limit metrics, throttle metrics, and count metrics are already directly comparable as per second rates. Keep in mind that each `count` metric is represented as a per second rate averaged +over each minute. For example, to get the total count of Actions, you must multiply this metric by 60. When setting alerts against limits, consider if your workload is spiky or sensitive to throttling (e.g. does latency matter?). If your workload is sensitive, consider alerting for `temporal_cloud_v1_total_action_count` at a 50% threshold of the `temporal_cloud_v1_action_limit`. If your workload is not sensitive, consider an alert at 90% of this threshold or directly when throttling is detected as a value greater than zero for `temporal_cloud_v1_total_action_throttled_count`. This logic can also be used to automatically scale [Temporal diff --git a/docs/cloud/worker-health.mdx b/docs/cloud/worker-health.mdx index c3085b5a15..edef8eccae 100644 --- a/docs/cloud/worker-health.mdx +++ b/docs/cloud/worker-health.mdx @@ -79,25 +79,31 @@ The following alerts build on the above to dive deeper into specific potential c - Alert at >\{predetermined_high_number\} +4. Create monitors for the [temporal_cloud_v1_approximate_backlog_count](/cloud/metrics/openmetrics/metrics-reference#temporal_cloud_v1_approximate_backlog_count) Cloud metric. + This metric provides a server-side view of how many Tasks are waiting in a Task Queue and complements the SDK Schedule To Start latency metrics. + +- Alert when the value is growing over time for a given Task Queue + ## Detect Task Backlog {#detect-task-backlog} ### Symptoms of high Task backlog If the Task backlog is too high, you will find that tasks are waiting to find Workers to run on. This can cause a delay in -Workflow execution. Detecting a growing Task backlog is possible by watching the Schedule To Start latency and sync match rate. +Workflow execution. Detecting a growing Task backlog is possible by watching the Schedule To Start latency, sync match rate, and approximate backlog count. Metrics to monitor: - **SDK metric**: [workflow_task_schedule_to_start_latency](/references/sdk-metrics#workflow_task_schedule_to_start_latency) - **SDK metric**: [activity_schedule_to_start_latency](/references/sdk-metrics#activity_schedule_to_start_latency) -- **Temporal Cloud metric**: [temporal_cloud_v0_poll_success_count](/cloud/metrics/reference#temporal_cloud_v0_poll_success_count) -- **Temporal Cloud metric**: [temporal_cloud_v0_poll_success_sync_count](/cloud/metrics/reference#temporal_cloud_v0_poll_success_sync_count) +- **Temporal Cloud metric**: [temporal_cloud_v1_poll_success_count](/cloud/metrics/openmetrics/metrics-reference#temporal_cloud_v1_poll_success_count) +- **Temporal Cloud metric**: [temporal_cloud_v1_poll_success_sync_count](/cloud/metrics/openmetrics/metrics-reference#temporal_cloud_v1_poll_success_sync_count) +- **Temporal Cloud metric**: [temporal_cloud_v1_approximate_backlog_count](/cloud/metrics/openmetrics/metrics-reference#temporal_cloud_v1_approximate_backlog_count) ### Schedule To Start latency The Schedule To Start metric represents how long Tasks are staying unprocessed in the Task Queues. It is the time between when a Task is enqueued and when it is started by a Worker. -This time being long (likely) means that your Workers can't keep up — either increase the number of Workers (if the host load is already high) or increase the number of pollers per Worker. +This time being long (likely) means that your Workers can't keep up - either increase the number of Workers (if the host load is already high) or increase the number of pollers per Worker. If your Schedule To Start latency alert triggers or is high, check the [Sync Match Rate](#sync-match-rate) to decide if you need to adjust your Worker or fleet, or contact Temporal Cloud support. If your Sync Match Rate is low, contact [Temporal Cloud support](/cloud/support#support-ticket). @@ -151,7 +157,7 @@ An async match is when a Task cannot be matched to the Sticky Queue for a Worker **Calculate Sync Match Rate** ``` -temporal_cloud_v0_poll_success_sync_count ÷ temporal_cloud_v0_poll_success_count = N +temporal_cloud_v1_poll_success_sync_count / temporal_cloud_v1_poll_success_count = N ``` #### Prometheus query samples @@ -160,15 +166,11 @@ temporal_cloud_v0_poll_success_sync_count ÷ temporal_cloud_v0_poll_success_coun ``` sum by(temporal_namespace) ( - rate( - temporal_cloud_v0_poll_success_sync_count{temporal_namespace=~"$namespace"}[5m] - ) + temporal_cloud_v1_poll_success_sync_count{temporal_namespace=~"$namespace"} ) / sum by(temporal_namespace) ( - rate( - temporal_cloud_v0_poll_success_count{temporal_namespace=~"$namespace"}[5m] - ) + temporal_cloud_v1_poll_success_count{temporal_namespace=~"$namespace"} ) ``` @@ -207,16 +209,24 @@ In this case it's also important to understand the fill and drain rates of the a Successful async polls ``` -temporal_cloud_v0_poll_success_count - temporal_cloud_v0_poll_success_sync_count = N +temporal_cloud_v1_poll_success_count - temporal_cloud_v1_poll_success_sync_count = N ``` ``` -sum(rate(temporal_cloud_v0_poll_success_count{temporal_namespace=~"$temporal_namespace"}[5m])) by (temporal_namespace, task_type) +sum by(temporal_namespace, task_type) ( + temporal_cloud_v1_poll_success_count{temporal_namespace=~"$namespace"} +) - -sum(rate(temporal_cloud_v0_poll_success_sync_count{temporal_namespace=~"$temporal_namespace"}[5m])) by (temporal_namespace, task_type) +sum by(temporal_namespace, task_type) ( + temporal_cloud_v1_poll_success_sync_count{temporal_namespace=~"$namespace"} +) ``` -[//]: # (add `temporal_cloud_v1_approximate_backlog_count` once the v2 metrics has been GA'd) +You can also monitor the [approximate backlog count](/cloud/metrics/openmetrics/metrics-reference#temporal_cloud_v1_approximate_backlog_count) to observe Task Queue depth directly: + +``` +temporal_cloud_v1_approximate_backlog_count{temporal_namespace=~"$namespace", temporal_task_queue=~"$task_queue"} +``` **Actions** @@ -244,18 +254,18 @@ If you see the Poll Success Rate showing low numbers, you might have too many re Metrics to monitor: -- **Temporal Cloud metric**: [temporal_cloud_v0_poll_success_count](/cloud/metrics/reference#temporal_cloud_v0_poll_success_count) -- **Temporal Cloud metric**: [temporal_cloud_v0_poll_success_sync_count](/cloud/metrics/reference#temporal_cloud_v0_poll_success_sync_count) -- **Temporal Cloud metric**: [temporal_cloud_v0_poll_timeout_count](/cloud/metrics/reference#temporal_cloud_v0_poll_timeout_count) +- **Temporal Cloud metric**: [temporal_cloud_v1_poll_success_count](/cloud/metrics/openmetrics/metrics-reference#temporal_cloud_v1_poll_success_count) +- **Temporal Cloud metric**: [temporal_cloud_v1_poll_success_sync_count](/cloud/metrics/openmetrics/metrics-reference#temporal_cloud_v1_poll_success_sync_count) +- **Temporal Cloud metric**: [temporal_cloud_v1_poll_timeout_count](/cloud/metrics/openmetrics/metrics-reference#temporal_cloud_v1_poll_timeout_count) - **SDK metric**: [temporal_workflow_task_schedule_to_start_latency](/references/sdk-metrics#workflow_task_schedule_to_start_latency) - **SDK metric**: [temporal_activity_schedule_to_start_latency](/references/sdk-metrics#activity_schedule_to_start_latency) **Calculate Poll Success Rate** ``` -(temporal_cloud_v0_poll_success_count + temporal_cloud_v0_poll_success_sync_count) +(temporal_cloud_v1_poll_success_count) / -(temporal_cloud_v0_poll_success_count + temporal_cloud_v0_poll_success_sync_count + temporal_cloud_v0_poll_timeout_count) +(temporal_cloud_v1_poll_success_count + temporal_cloud_v1_poll_timeout_count) ``` **Target** @@ -281,50 +291,6 @@ Consider sizing down your Workers by either: - Reducing the concurrent pollers per Worker, OR - Both of the above -#### Prometheus query samples - -**poll_success_rate query** - -``` -( - ( - sum by(temporal_namespace) ( - rate( - temporal_cloud_v0_poll_success_count{temporal_namespace=~"$namespace"}[5m] - ) - ) - + - sum by(temporal_namespace) ( - rate( - temporal_cloud_v0_poll_success_sync_count{temporal_namespace=~"$namespace"}[5m] - ) - ) - ) - / - ( - ( - sum by(temporal_namespace) ( - rate( - temporal_cloud_v0_poll_success_count{temporal_namespace=~"$namespace"}[5m] - ) - ) - + - sum by(temporal_namespace) ( - rate( - temporal_cloud_v0_poll_success_sync_count{temporal_namespace=~"$namespace"}[5m] - ) - ) - ) - + - sum by(temporal_namespace) ( - rate( - temporal_cloud_v0_poll_timeout_count{temporal_namespace=~"$namespace"}[5m] - ) - ) - ) -) -``` - ## Detect misconfigured Workers {#detect-misconfigured-workers} **How to detect misconfigured Workers.** @@ -425,4 +391,3 @@ Use `TelemetryConfig()` to adjust heartbeat settings. See the [Python SDK docume Add configurations to `Runtime()` to adjust heartbeat settings. See the [Ruby SDK documentation](https://ruby.temporal.io/Temporalio/Runtime.html) for more details. -