Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# apiVersion: runwhen.com/v1
# kind: GenerationRules
# spec:
# platform: azure
# generationRules:
# - resourceTypes:
# - azure_containerapps_container_apps
# matchRules:
# - type: pattern
# pattern: ".+"
# properties: [name]
# mode: substring
# - type: pattern
# pattern: "^app(?:,.*)?$"
# properties: [kind]
# mode: substring
# slxs:
# - baseName: az-containerapp-health
# qualifiers: ["resource", "resource_group"]
# baseTemplateName: azure-containerapps-health
# levelOfDetail: detailed
# outputItems:
# - type: slx
# - type: sli
# - type: runbook
# templateName: azure-containerapps-health-taskset.yaml
# - type: workflow
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
apiVersion: runwhen.com/v1
kind: ServiceLevelIndicator
metadata:
name: {{slx_name}}
labels:
{% include "common-labels.yaml" %}
annotations:
{% include "common-annotations.yaml" %}
spec:
displayUnitsLong: OK
displayUnitsShort: ok
locations:
- {{default_location}}
description: Measures the health of Azure Container App {{match_resource.resource.name}} in resource group {{resource_group.name}}
codeBundle:
{% if repo_url %}
repoUrl: {{repo_url}}
{% else %}
repoUrl: https://github.com/runwhen-contrib/rw-cli-codecollection.git
{% endif %}
{% if ref %}
ref: {{ref}}
{% else %}
ref: main
{% endif %}
pathToRobot: codebundles/azure-containerapps-health/sli.robot
intervalStrategy: intermezzo
intervalSeconds: 600
configProvided:
- name: AZ_RESOURCE_GROUP
value: {{resource_group.name}}
- name: CONTAINER_APP_NAME
value: {{match_resource.resource.name}}
- name: AZURE_RESOURCE_SUBSCRIPTION_ID
value: "{{ match_resource.subscription_id }}"
- name: SLI_TIME_PERIOD_MINUTES
value: "60"
- name: AZURE_SUBSCRIPTION_NAME
value: "{{ match_resource.subscription_name }}"
- name: AVAILABILITY_TARGET
value: "99.5"
- name: PERFORMANCE_TARGET_MS
value: "1000"
- name: ERROR_RATE_TARGET
value: "1.0"
secretsProvided:
{% if wb_version %}
{% include "azure-auth.yaml" ignore missing %}
{% else %}
- name: azure_credentials
workspaceKey: AUTH DETAILS NOT FOUND
{% endif %}
alerts:
warning:
operator: <
threshold: '1'
for: '20m'
ticket:
operator: <
threshold: '1'
for: '30m'
page:
operator: '=='
threshold: '0'
for: ''
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
apiVersion: runwhen.com/v1
kind: ServiceLevelX
metadata:
name: {{slx_name}}
labels:
{% include "common-labels.yaml" %}
annotations:
{% include "common-annotations.yaml" %}
spec:
imageURL: https://storage.googleapis.com/runwhen-nonprod-shared-images/icons/azure/container%20instances/10023-icon-service-Container-Instances.svg
alias: {{match_resource.resource.name}} Container App Health
asMeasuredBy: An aggregate score that is evaluated from configuration, metric, event or log issues.
configProvided:
- name: AZ_RESOURCE_GROUP
value: {{resource_group.name}}
- name: CONTAINER_APP_NAME
value: {{match_resource.resource.name}}
owners:
- {{workspace.owner_email}}
statement: Container App {{match_resource.resource.name}} in resource group {{resource_group.name}} be available and healthy.
additionalContext:
{% include "azure-hierarchy.yaml" ignore missing %}
qualified_name: "{{ match_resource.qualified_name }}"
containerapp_kind: "{{match_resource.resource.kind}}"
tags:
{% include "azure-tags.yaml" ignore missing %}
- name: service
value: container_apps
- name: type
value: containerapp
- name: access
value: read-only
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
apiVersion: runwhen.com/v1
kind: Runbook
metadata:
name: {{slx_name}}
labels:
{% include "common-labels.yaml" %}
annotations:
{% include "common-annotations.yaml" %}
spec:
location: {{default_location}}
description: Generates a report for Azure Container App {{match_resource.resource.name}} in resource group {{resource_group.name}}
codeBundle:
{% if repo_url %}
repoUrl: {{repo_url}}
{% else %}
repoUrl: https://github.com/runwhen-contrib/rw-cli-codecollection.git
{% endif %}
{% if ref %}
ref: {{ref}}
{% else %}
ref: main
{% endif %}
pathToRobot: codebundles/azure-containerapps-health/runbook.robot
configProvided:
- name: AZ_RESOURCE_GROUP
value: {{resource_group.name}}
- name: CONTAINER_APP_NAME
value: {{match_resource.resource.name}}
- name: AZURE_RESOURCE_SUBSCRIPTION_ID
value: "{{ match_resource.subscription_id }}"
- name: TIME_PERIOD_MINUTES
value: "10"
- name: AZURE_SUBSCRIPTION_NAME
value: "{{ match_resource.subscription_name }}"
- name: CPU_THRESHOLD
value: "80"
- name: MEMORY_THRESHOLD
value: "80"
- name: REPLICA_COUNT_MIN
value: "1"
- name: RESTART_COUNT_THRESHOLD
value: "5"
- name: REQUEST_COUNT_THRESHOLD
value: "1000"
- name: HTTP_ERROR_RATE_THRESHOLD
value: "5"
secretsProvided:
{% if wb_version %}
{% include "azure-auth.yaml" ignore missing %}
{% else %}
- name: azure_credentials
workspaceKey: AUTH DETAILS NOT FOUND
{% endif %}
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
apiVersion: runwhen.com/v1
kind: Workflow
metadata:
name: {{slx_name}}-{{ "Container App SLI Alert Workflow" | replace(" ", "-") | lower }}
labels:
{% include "common-labels.yaml" %}
annotations:
{% include "common-annotations.yaml" %}
spec:
fromActivities:
- displayName: {{match_resource.resource.name}} Container App SLI Alert Workflow
description: Start RunSession with Eager Edgar when SLI is alerting for {{match_resource.resource.name}} Container App health
actions:
- tasks:
slx: {{slx_name.split('--')[1]}}
persona: eager-edgar
titles:
- '*'
sessionTTL: 20m
match:
activityVerbs:
- SLI_ALERTS_STARTED
slxs:
- {{slx_name.split('--')[1]}}
name: {{match_resource.resource.name}}-{{ "Container App SLI Alert Workflow" | replace(" ", "-") | lower }}
88 changes: 88 additions & 0 deletions codebundles/azure-containerapps-health/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# Azure Container Apps Health Monitoring

Comprehensive health monitoring and triage for Azure Container Apps, checking application status, metrics, logs, configuration, and environment health to identify and resolve issues.

## Features

This codebundle provides comprehensive monitoring for Azure Container Apps including:

- **Resource Health**: Azure platform-level health status monitoring
- **Replica Health**: Container replica status and scaling analysis
- **Performance Metrics**: CPU, memory, request volume, and error rate monitoring
- **Configuration Analysis**: Best practices validation and security assessment
- **Revision Management**: Deployment health and traffic distribution monitoring
- **Environment Health**: Infrastructure and networking status
- **Log Analysis**: Intelligent error detection and pattern analysis

## Configuration

The TaskSet requires initialization to import necessary secrets, services, and user variables. The following variables should be set:

### Required Variables
- `CONTAINER_APP_NAME`: The name of the Azure Container App to monitor
- `AZ_RESOURCE_GROUP`: The resource group containing the Container App
- `azure_credentials`: Secret containing Azure service principal credentials

### Optional Variables
- `CONTAINER_APP_ENV_NAME`: Container Apps Environment name (auto-discovered if not provided)
- `AZURE_RESOURCE_SUBSCRIPTION_ID`: Azure subscription ID (uses current subscription if not provided)
- `TIME_PERIOD_MINUTES`: Time period for metrics and logs (default: 10 minutes)
- `CPU_THRESHOLD`: CPU utilization threshold percentage (default: 80%)
- `MEMORY_THRESHOLD`: Memory utilization threshold percentage (default: 80%)
- `REPLICA_COUNT_MIN`: Minimum expected replica count (default: 1)
- `RESTART_COUNT_THRESHOLD`: Restart count threshold for alerts (default: 5)
- `REQUEST_COUNT_THRESHOLD`: Request count per minute threshold (default: 1000)
- `HTTP_ERROR_RATE_THRESHOLD`: HTTP error rate percentage threshold (default: 5%)

## Prerequisites

- Azure CLI installed and authenticated
- Service principal with appropriate permissions:
- `Container Apps Reader` role for Container Apps resources
- `Reader` role for metrics and monitoring
- `Log Analytics Reader` role for log analysis (if using Log Analytics)

## Usage Examples

### Basic Health Check
```bash
export CONTAINER_APP_NAME="my-app"
export AZ_RESOURCE_GROUP="my-rg"
# Run the health monitoring tasks
```

### Advanced Configuration
```bash
export CONTAINER_APP_NAME="production-app"
export AZ_RESOURCE_GROUP="prod-rg"
export CONTAINER_APP_ENV_NAME="prod-env"
export CPU_THRESHOLD="90"
export MEMORY_THRESHOLD="85"
export TIME_PERIOD_MINUTES="30"
```

## Tasks Included

1. **Resource Health Check**: Validates Azure platform health status
2. **Replica Health Analysis**: Monitors replica count and status
3. **Metrics Collection**: Gathers performance and utilization metrics
4. **Log Retrieval**: Collects recent application logs
5. **Configuration Review**: Analyzes Container App configuration
6. **Revision Health**: Monitors deployment and traffic distribution
7. **Environment Health**: Checks Container Apps Environment status
8. **Log Analysis**: Performs intelligent error pattern detection

## Notes

- This codebundle assumes Azure service principal authentication flow
- Requires Container Apps and related Azure resources to be properly configured
- Log analysis works best when Log Analytics is configured for the Container Apps Environment
- Some metrics may not be available for newly deployed Container Apps
- Environment health checks will auto-discover the environment if not specified

## Troubleshooting

- Ensure Azure CLI is authenticated and has access to the subscription
- Verify service principal has necessary permissions for Container Apps resources
- Check that the Container App and resource group names are correct
- For log analysis issues, verify Log Analytics configuration in the Container Apps Environment
Loading