Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 7 additions & 22 deletions docs/cloud/high-availability/monitoring.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -29,19 +29,19 @@ Temporal Cloud offers several ways for you to track the health and performance o
## Replication status

You can monitor your replica status with the Temporal Cloud UI. If the replica is unhealthy, Temporal Cloud disables the
Trigger a failover option to prevent failing over to an unhealthy replica. An unhealthy replica might be due to:
"Trigger a failover" option to prevent failing over to an unhealthy replica. An unhealthy replica might be due to:

- **Data synchronization issues:** The replica fails to remain in sync with the primary due to network or performance
problems.
- **Replication lag:** The replica falls behind the primary, causing it to be out of sync.
- **Network issues:** Loss of communication between the replica and the primary causes problems.
- **Failed health checks:** If the replica fails health checks, its marked as unhealthy.
- **Failed health checks:** If the replica fails health checks, it's marked as unhealthy.

These issues prevent the replica from being used during a failover, ensuring system stability and consistency.

## Replication lag metric
## Monitoring replication

Temporal Clouds High Availability features use asynchronous replication between the primary and the replica. Workflow
Temporal Cloud's High Availability features use asynchronous replication between the primary and the replica. Workflow
updates in the primary, along with associated History Events, are transmitted to the replica. Replication lag refers to
the transmission delay of Workflow updates and history events from the primary to the replica.

Expand All @@ -55,25 +55,10 @@ P95 means 95% of updates are processed faster than this limit.
A forced failover, when there is significant replication lag, increases the likelihood of rolling back Workflow
progress. Always check the replication lag metrics before initiating a failover.

Temporal Cloud emits three replication lag-specific
[metrics](/cloud/metrics/reference#replication-lag). The following samples demonstrate how you can
use these metrics to monitor and explore replication lag:
Temporal Cloud emits replication lag [metrics](/cloud/metrics/openmetrics/metrics-reference#replication-metrics)
as pre-computed percentiles (p50, p95, p99) that are labeled with `temporal_namespace`.

**P99 replication lag histogram**:

```
histogram_quantile(0.99, sum(rate(temporal_cloud_v0_replication_lag_bucket[$__rate_interval])) by (temporal_namespace, le))
```

**Average replication lag**:

```
sum(rate(temporal_cloud_v0_replication_lag_sum[$__rate_interval])) by (temporal_namespace)
/
sum(rate(temporal_cloud_v0_replication_lag_count[$__rate_interval])) by (temporal_namespace)
```

When a Namespace is using a replica, you may notice that the Action count in `temporal_cloud_v0_total_action_count` is
When a Namespace is using a replica, you may notice that the Action count in `temporal_cloud_v1_total_action_count` is
2x what it was before adding a replica. This happens because Actions are replicated; they occur on both the primary and
the replica.

Expand Down
33 changes: 16 additions & 17 deletions docs/cloud/service-health.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -36,36 +36,35 @@ Note that Service API errors are not equivalent to guarantees mentioned in the [

### Reference Metrics

- [temporal\_cloud\_v1\_frontend\_service\_error\_count](/cloud/metrics/openmetrics/metrics-reference#temporal_cloud_v1_service_error_count)
- [temporal\_cloud\_v1\_frontend\_service\_request\_count](/cloud/metrics/openmetrics/metrics-reference#temporal_cloud_v1_service_request_count)
- [temporal\_cloud\_v1\_service\_error\_count](/cloud/metrics/openmetrics/metrics-reference#temporal_cloud_v1_service_error_count)
- [temporal\_cloud\_v1\_service\_request\_count](/cloud/metrics/openmetrics/metrics-reference#temporal_cloud_v1_service_request_count)

### Prometheus Query for this Metric

Measure your daily average errors over 10-minute windows:
Measure your daily average success rate over 10-minute windows.

OpenMetrics v1 metrics are pre-computed rates. Use `sum()` to aggregate across dimensions rather than `increase()` or `rate()`.

```
avg_over_time((
(

(
sum(increase(temporal_cloud_v1_service_request_count{temporal_namespace=~"$namespace", operation=~"StartWorkflowExecution|SignalWorkflowExecution|SignalWithStartWorkflowExecution|RequestCancelWorkflowExecution|TerminateWorkflowExecution"}[10m]))
sum(temporal_cloud_v1_service_request_count{temporal_namespace=~"$namespace", operation=~"StartWorkflowExecution|SignalWorkflowExecution|SignalWithStartWorkflowExecution|RequestCancelWorkflowExecution|TerminateWorkflowExecution"})
-
sum(increase(temporal_cloud_v1_service_error_count{temporal_namespace=~"$namespace", operation=~"StartWorkflowExecution|SignalWorkflowExecution|SignalWithStartWorkflowExecution|RequestCancelWorkflowExecution|TerminateWorkflowExecution"}[10m]))
sum(temporal_cloud_v1_service_error_count{temporal_namespace=~"$namespace", operation=~"StartWorkflowExecution|SignalWorkflowExecution|SignalWithStartWorkflowExecution|RequestCancelWorkflowExecution|TerminateWorkflowExecution"})
)
/
sum(increase(temporal_cloud_v1_service_request_count{temporal_namespace=~"$namespace", operation=~"StartWorkflowExecution|SignalWorkflowExecution|SignalWithStartWorkflowExecution|RequestCancelWorkflowExecution|TerminateWorkflowExecution"}[10m]))
sum(temporal_cloud_v1_service_request_count{temporal_namespace=~"$namespace", operation=~"StartWorkflowExecution|SignalWorkflowExecution|SignalWithStartWorkflowExecution|RequestCancelWorkflowExecution|TerminateWorkflowExecution"})
)

or vector(1)

)[1d:10m])
)[1d:1m])
```

## Detecting Activity and Workflow Failures

The metrics `temporal_activity_execution_failed` and `temporal_cloud_v1_workflow_failed_count` together provide failure detection for Temporal applications. These metrics work in tandem to give you both granular component-level visibility and high-level workflow health insights.

Note that `temporal_activity_execution_failed` is an SDK metric that must be collected from the Worker.
The metrics `temporal_cloud_v1_activity_fail_count` and `temporal_cloud_v1_workflow_failed_count` together provide failure detection for Temporal applications. These metrics work in tandem to give you both granular component-level visibility and high-level workflow health insights.

### Activity failure cascade

Expand All @@ -86,7 +85,7 @@ Generally Temporal recommends that Workflows should be designed to always succee
Monitor the ratio of workflow failures to activity failures:

```
workflow_failure_rate = temporal_cloud_v1_workflow_failed_count / temporal_activity_execution_failed
workflow_failure_rate = temporal_cloud_v1_workflow_failed_count / temporal_cloud_v1_activity_fail_count
```

What to watch for:
Expand All @@ -97,7 +96,7 @@ What to watch for:
#### Activity success rate

```
activity_success_rate = (total_activities - temporal_activity_execution_failed) / total_activities
activity_success_rate = temporal_cloud_v1_activity_success_count / (temporal_cloud_v1_activity_success_count + temporal_cloud_v1_activity_fail_count)
```

Target: >95% for most applications. Lower success rate can be a sign of system troubles.
Expand Down Expand Up @@ -138,13 +137,13 @@ See [operations and metrics](/cloud/high-availability) for Namespaces with High
## Detecting Resource Exhaustion

The Cloud metric `temporal_cloud_v1_resource_exhausted_error_count` is the primary indicator for Cloud-side throttling, signaling system limits
are exceeded and `ResourceExhausted` gRPC errors are occurring. This generally does not break workflow processing due to how resources are prioritized.
are exceeded and `ResourceExhausted` gRPC errors are occurring. This generally does not break workflow processing due to how resources are prioritized.

Persistent non-zero values of this metric are unexpected.

## Monitoring Trends Against Limits {#rps-aps-rate-limits}

The set of [limit metrics](/cloud/metrics/openmetrics/metrics-reference#limit-metrics) provide a time series of values for limits. Use these
The set of [limit metrics](/cloud/metrics/openmetrics/metrics-reference#limit-metrics) provide a time series of values for limits. Use these
metrics with their corresponding count metrics to monitor general trends against limits and set alerts when limits are exceeded. Use the corresponding throttle metrics
to determine the severity of any active rate limiting.
| Limit Metric | Count Metric | Throttle Metric |
Expand All @@ -156,8 +155,8 @@ to determine the severity of any active rate limiting.
The [Grafana dashboard example](https://github.com/grafana/jsonnet-libs/blob/master/temporal-mixin/dashboards/temporal-overview.json) includes a Usage & Quotas section
that creates demo charts for these limits and count metrics respectively.

The limit metrics, throttle metrics, and count metrics are already directly comparable as per second rates. Keep in mind that each `count` metric is represented as a per second rate averaged
over each minute. For example, to get the total count of Actions, you must multiply this metric by 60.
The limit metrics, throttle metrics, and count metrics are already directly comparable as per second rates. Keep in mind that each `count` metric is represented as a per second rate averaged
over each minute. For example, to get the total count of Actions, you must multiply this metric by 60.
When setting alerts against limits, consider if your workload is spiky or sensitive to throttling (e.g. does latency matter?). If your workload is sensitive, consider alerting
for `temporal_cloud_v1_total_action_count` at a 50% threshold of the `temporal_cloud_v1_action_limit`. If your workload is not sensitive, consider an alert at 90% of this threshold
or directly when throttling is detected as a value greater than zero for `temporal_cloud_v1_total_action_throttled_count`. This logic can also be used to automatically scale [Temporal
Expand Down
95 changes: 30 additions & 65 deletions docs/cloud/worker-health.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -79,25 +79,31 @@ The following alerts build on the above to dive deeper into specific potential c

- Alert at >\{predetermined_high_number\}

4. Create monitors for the [temporal_cloud_v1_approximate_backlog_count](/cloud/metrics/openmetrics/metrics-reference#temporal_cloud_v1_approximate_backlog_count) Cloud metric.
This metric provides a server-side view of how many Tasks are waiting in a Task Queue and complements the SDK Schedule To Start latency metrics.

- Alert when the value is growing over time for a given Task Queue

## Detect Task Backlog {#detect-task-backlog}

### Symptoms of high Task backlog

If the Task backlog is too high, you will find that tasks are waiting to find Workers to run on. This can cause a delay in
Workflow execution. Detecting a growing Task backlog is possible by watching the Schedule To Start latency and sync match rate.
Workflow execution. Detecting a growing Task backlog is possible by watching the Schedule To Start latency, sync match rate, and approximate backlog count.

Metrics to monitor:

- **SDK metric**: [workflow_task_schedule_to_start_latency](/references/sdk-metrics#workflow_task_schedule_to_start_latency)
- **SDK metric**: [activity_schedule_to_start_latency](/references/sdk-metrics#activity_schedule_to_start_latency)
- **Temporal Cloud metric**: [temporal_cloud_v0_poll_success_count](/cloud/metrics/reference#temporal_cloud_v0_poll_success_count)
- **Temporal Cloud metric**: [temporal_cloud_v0_poll_success_sync_count](/cloud/metrics/reference#temporal_cloud_v0_poll_success_sync_count)
- **Temporal Cloud metric**: [temporal_cloud_v1_poll_success_count](/cloud/metrics/openmetrics/metrics-reference#temporal_cloud_v1_poll_success_count)
- **Temporal Cloud metric**: [temporal_cloud_v1_poll_success_sync_count](/cloud/metrics/openmetrics/metrics-reference#temporal_cloud_v1_poll_success_sync_count)
- **Temporal Cloud metric**: [temporal_cloud_v1_approximate_backlog_count](/cloud/metrics/openmetrics/metrics-reference#temporal_cloud_v1_approximate_backlog_count)

### Schedule To Start latency

The Schedule To Start metric represents how long Tasks are staying unprocessed in the Task Queues.
It is the time between when a Task is enqueued and when it is started by a Worker.
This time being long (likely) means that your Workers can't keep up either increase the number of Workers (if the host load is already high) or increase the number of pollers per Worker.
This time being long (likely) means that your Workers can't keep up - either increase the number of Workers (if the host load is already high) or increase the number of pollers per Worker.

If your Schedule To Start latency alert triggers or is high, check the [Sync Match Rate](#sync-match-rate) to decide if you need to adjust your Worker or fleet, or contact Temporal Cloud support.
If your Sync Match Rate is low, contact [Temporal Cloud support](/cloud/support#support-ticket).
Expand Down Expand Up @@ -151,7 +157,7 @@ An async match is when a Task cannot be matched to the Sticky Queue for a Worker
**Calculate Sync Match Rate**

```
temporal_cloud_v0_poll_success_sync_count ÷ temporal_cloud_v0_poll_success_count = N
temporal_cloud_v1_poll_success_sync_count / temporal_cloud_v1_poll_success_count = N
```

#### Prometheus query samples
Expand All @@ -160,15 +166,11 @@ temporal_cloud_v0_poll_success_sync_count ÷ temporal_cloud_v0_poll_success_coun

```
sum by(temporal_namespace) (
rate(
temporal_cloud_v0_poll_success_sync_count{temporal_namespace=~"$namespace"}[5m]
)
temporal_cloud_v1_poll_success_sync_count{temporal_namespace=~"$namespace"}
)
/
sum by(temporal_namespace) (
rate(
temporal_cloud_v0_poll_success_count{temporal_namespace=~"$namespace"}[5m]
)
temporal_cloud_v1_poll_success_count{temporal_namespace=~"$namespace"}
)
```

Expand Down Expand Up @@ -207,16 +209,24 @@ In this case it's also important to understand the fill and drain rates of the a
Successful async polls

```
temporal_cloud_v0_poll_success_count - temporal_cloud_v0_poll_success_sync_count = N
temporal_cloud_v1_poll_success_count - temporal_cloud_v1_poll_success_sync_count = N
```

```
sum(rate(temporal_cloud_v0_poll_success_count{temporal_namespace=~"$temporal_namespace"}[5m])) by (temporal_namespace, task_type)
sum by(temporal_namespace, task_type) (
temporal_cloud_v1_poll_success_count{temporal_namespace=~"$namespace"}
)
-
sum(rate(temporal_cloud_v0_poll_success_sync_count{temporal_namespace=~"$temporal_namespace"}[5m])) by (temporal_namespace, task_type)
sum by(temporal_namespace, task_type) (
temporal_cloud_v1_poll_success_sync_count{temporal_namespace=~"$namespace"}
)
```

[//]: # (add `temporal_cloud_v1_approximate_backlog_count` once the v2 metrics has been GA'd)
You can also monitor the [approximate backlog count](/cloud/metrics/openmetrics/metrics-reference#temporal_cloud_v1_approximate_backlog_count) to observe Task Queue depth directly:

```
temporal_cloud_v1_approximate_backlog_count{temporal_namespace=~"$namespace", temporal_task_queue=~"$task_queue"}
```

**Actions**

Expand Down Expand Up @@ -244,18 +254,18 @@ If you see the Poll Success Rate showing low numbers, you might have too many re

Metrics to monitor:

- **Temporal Cloud metric**: [temporal_cloud_v0_poll_success_count](/cloud/metrics/reference#temporal_cloud_v0_poll_success_count)
- **Temporal Cloud metric**: [temporal_cloud_v0_poll_success_sync_count](/cloud/metrics/reference#temporal_cloud_v0_poll_success_sync_count)
- **Temporal Cloud metric**: [temporal_cloud_v0_poll_timeout_count](/cloud/metrics/reference#temporal_cloud_v0_poll_timeout_count)
- **Temporal Cloud metric**: [temporal_cloud_v1_poll_success_count](/cloud/metrics/openmetrics/metrics-reference#temporal_cloud_v1_poll_success_count)
- **Temporal Cloud metric**: [temporal_cloud_v1_poll_success_sync_count](/cloud/metrics/openmetrics/metrics-reference#temporal_cloud_v1_poll_success_sync_count)
- **Temporal Cloud metric**: [temporal_cloud_v1_poll_timeout_count](/cloud/metrics/openmetrics/metrics-reference#temporal_cloud_v1_poll_timeout_count)
- **SDK metric**: [temporal_workflow_task_schedule_to_start_latency](/references/sdk-metrics#workflow_task_schedule_to_start_latency)
- **SDK metric**: [temporal_activity_schedule_to_start_latency](/references/sdk-metrics#activity_schedule_to_start_latency)

**Calculate Poll Success Rate**

```
(temporal_cloud_v0_poll_success_count + temporal_cloud_v0_poll_success_sync_count)
(temporal_cloud_v1_poll_success_count)
/
(temporal_cloud_v0_poll_success_count + temporal_cloud_v0_poll_success_sync_count + temporal_cloud_v0_poll_timeout_count)
(temporal_cloud_v1_poll_success_count + temporal_cloud_v1_poll_timeout_count)
```

**Target**
Expand All @@ -281,50 +291,6 @@ Consider sizing down your Workers by either:
- Reducing the concurrent pollers per Worker, OR
- Both of the above

#### Prometheus query samples

**poll_success_rate query**

```
(
(
sum by(temporal_namespace) (
rate(
temporal_cloud_v0_poll_success_count{temporal_namespace=~"$namespace"}[5m]
)
)
+
sum by(temporal_namespace) (
rate(
temporal_cloud_v0_poll_success_sync_count{temporal_namespace=~"$namespace"}[5m]
)
)
)
/
(
(
sum by(temporal_namespace) (
rate(
temporal_cloud_v0_poll_success_count{temporal_namespace=~"$namespace"}[5m]
)
)
+
sum by(temporal_namespace) (
rate(
temporal_cloud_v0_poll_success_sync_count{temporal_namespace=~"$namespace"}[5m]
)
)
)
+
sum by(temporal_namespace) (
rate(
temporal_cloud_v0_poll_timeout_count{temporal_namespace=~"$namespace"}[5m]
)
)
)
)
```

## Detect misconfigured Workers {#detect-misconfigured-workers}

**How to detect misconfigured Workers.**
Expand Down Expand Up @@ -425,4 +391,3 @@ Use `TelemetryConfig()` to adjust heartbeat settings. See the [Python SDK docume
Add configurations to `Runtime()` to adjust heartbeat settings. See the [Ruby SDK documentation](https://ruby.temporal.io/Temporalio/Runtime.html) for more details.
</SdkTabs.Ruby>
</SdkTabs>