Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
apiVersion: runwhen.com/v1
kind: GenerationRules
spec:
generationRules:
- resourceTypes:
- deployment
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generation excludes non-deployment workloads

High Severity

k8s-applog-health only generates for deployment resources, while log tasks were removed from the statefulset and daemonset healthcheck runbooks. This leaves statefulset and daemonset workloads without generated applog SLIs/runbooks, so the consolidation does not actually apply to all supported workload types.

Additional Locations (2)

Fix in Cursor Fix in Web

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stewartshea @Rohit-Ekbote
to include the other two, i.e. statefulset and daemonset , will this be correct:

- deployment
- statefulset
- daemonset

Also, do we have any other codebundle with generation Rules applicable to more than 1 resource type?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that looks correct (you can validate against the other generation rules for those resource types)
and yes, we have other examples with azure codebundles etc.

matchRules:
- type: pattern
pattern: ".+"
properties: [name]
mode: substring
slxs:
- baseName: applog-health
levelOfDetail: detailed
qualifiers: ["resource", "namespace", "cluster"]
baseTemplateName: k8s-applog-health
outputItems:
- type: slx
- type: sli
- type: runbook
templateName: k8s-applog-health-taskset.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
apiVersion: runwhen.com/v1
kind: ServiceLevelIndicator
metadata:
name: {{slx_name}}
labels:
{% include "common-labels.yaml" %}
annotations:
{% include "common-annotations.yaml" %}
runwhen.com/sli: "true"
spec:
displayUnitsLong: OK
displayUnitsShort: ok
locations:
- {{ default_location }}
codeBundle:
{% if repo_url %}
repoUrl: {{repo_url}}
{% else %}
repoUrl: https://github.com/runwhen-contrib/rw-cli-codecollection.git
{% endif %}
{% if ref %}
ref: {{ref}}
{% else %}
ref: main
{% endif %}
pathToRobot: codebundles/k8s-applog-health/sli.robot
intervalStrategy: intermezzo
intervalSeconds: 600
description: Measures the health of the application logs for the {{match_resource.resource.metadata.name}} {{match_resource.kind | lower}}.
configProvided:
- name: NAMESPACE
value: {{match_resource.resource.metadata.namespace}}
- name: CONTEXT
value: {{context}}
- name: KUBERNETES_DISTRIBUTION_BINARY
value: {{custom.kubernetes_distribution_binary | default("kubectl")}}
- name: WORKLOAD_NAME
value: {{match_resource.resource.metadata.name}}
- name: WORKLOAD_TYPE
value: {{match_resource.kind | lower}}
- name: CONTAINER_RESTART_AGE
value: "10m"
- name: CONTAINER_RESTART_THRESHOLD
value: "2"
- name: EVENT_AGE
value: "10m"
- name: EVENT_THRESHOLD
value: "2"
- name: CHECK_SERVICE_ENDPOINTS
value: "true"
- name: MAX_LOG_LINES
value: "1000"
- name: MAX_LOG_BYTES
value: "2097152"
Comment thread
akshayrw25 marked this conversation as resolved.
secretsProvided:
{% if wb_version %}
{% include "kubernetes-auth.yaml" ignore missing %}
{% else %}
- name: kubeconfig
workspaceKey: {{custom.kubeconfig_secret_name}}
{% endif %}
alertConfig:
tasks:
persona: eager-edgar
sessionTTL: 10m
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
apiVersion: runwhen.com/v1
kind: ServiceLevelX
metadata:
name: {{slx_name}}
labels:
{% include "common-labels.yaml" %}
annotations:
{% include "common-annotations.yaml" %}
spec:
imageURL: https://storage.googleapis.com/runwhen-nonprod-shared-images/icons/kubernetes/resources/labeled/deploy.svg
alias: {{match_resource.resource.metadata.name}} {{match_resource.kind}} Application Log Health
asMeasuredBy: The presence of application-level errors/issues/stacktraces in the application logs indicating runtime errors or exceptions in {{match_resource.resource.metadata.name}}.
configProvided:
- name: OBJECT_NAME
value: {{match_resource.resource.metadata.name}}
owners:
- {{workspace.owner_email}}
statement: Application logs for {{match_resource.resource.metadata.name}} {{match_resource.kind | lower}} should be free of critical errors/issues/stacktraces indicating runtime errors or exceptions.
additionalContext:
{% include "kubernetes-hierarchy.yaml" ignore missing %}
qualified_name: "{{ match_resource.qualified_name }}"
tags:
{% include "kubernetes-tags.yaml" ignore missing %}
- name: access
value: read-only
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
apiVersion: runwhen.com/v1
kind: Runbook
metadata:
name: {{slx_name}}
labels:
{% include "common-labels.yaml" %}
annotations:
{% include "common-annotations.yaml" %}
spec:
location: {{default_location}}
codeBundle:
{% if repo_url %}
repoUrl: {{repo_url}}
{% else %}
repoUrl: https://github.com/runwhen-contrib/rw-cli-codecollection.git
{% endif %}
{% if ref %}
ref: {{ref}}
{% else %}
ref: main
{% endif %}
pathToRobot: codebundles/k8s-applog-health/runbook.robot
configProvided:
- name: NAMESPACE
value: {{match_resource.resource.metadata.namespace}}
- name: CONTEXT
value: {{context}}
- name: KUBERNETES_DISTRIBUTION_BINARY
value: {{custom.kubernetes_distribution_binary}}
- name: WORKLOAD_NAME
value: {{match_resource.resource.metadata.name}}
- name: WORKLOAD_TYPE
value: {{match_resource.kind | lower}}
- name: CONTAINER_RESTART_AGE
value: "30m"
- name: CONTAINER_RESTART_THRESHOLD
value: "4"
- name: LOG_AGE
value: "10m"
- name: LOG_SIZE
value: "2097152"
- name: LOG_LINES
value: "1000"
secretsProvided:
{% if wb_version %}
{% include "kubernetes-auth.yaml" ignore missing %}
{% else %}
- name: kubeconfig
workspaceKey: {{custom.kubeconfig_secret_name}}
{% endif %}
56 changes: 56 additions & 0 deletions codebundles/k8s-applog-health/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Kubernetes Application Log Health

This codebundle provides tasks for triaging application log health of Kubernetes workloads (deployments, statefulsets, or daemonsets). It fetches pod logs, scans for error patterns, and reports issues with severity and next steps.

## Tasks

**Runbook**
- `Analyze Application Log Patterns for ${WORKLOAD_TYPE} ${WORKLOAD_NAME} in Namespace ${NAMESPACE}` — Fetches workload logs, scans for configurable error/exception patterns, creates issues for matches above the severity threshold, and reports a log health score and summary.
- `Fetch Workload Logs for ${WORKLOAD_TYPE} ${WORKLOAD_NAME} in Namespace ${NAMESPACE}` — Fetches and attaches workload logs to the report for manual review (no issue creation).

**SLI**
- `Get Critical Log Errors and Score for ${WORKLOAD_TYPE} ${WORKLOAD_NAME}` — Fetches logs and scores health based on critical error patterns (e.g. GenericError, AppFailure) and container restarts; pushes a metric for SLI scoring.
- `Generate Application Health Score for ${WORKLOAD_TYPE} ${WORKLOAD_NAME}` — Computes the final applog health score and report details (e.g. scaled-to-zero vs healthy vs issues).

### Log pattern categories

Analysis uses pattern categories (configurable via `runbook_patterns.json` or `sli_critical_patterns.json`). Examples:

- **GenericError** — exception, fatal, panic, crash, failed, failure (severity 1)
- **AppFailure** — application failed, service unavailable, connection refused, timeout, OOM, disk full, auth failures (severity 1)
- **StackTrace** — stack trace, exception in thread, java.lang., traceback, panic (severity 1)
- **Connection** — connection reset/timeout, network unreachable, socket error, DNS resolution failed (severity 2)
- **Timeout** — request/operation timeout, deadline exceeded, read/write timeout (severity 2)
- **Auth** — unauthorized, authentication error, invalid credentials, forbidden, token expired (severity 2)
- **Exceptions** — NullPointerException, IllegalArgumentException, SQLException, IOException, etc. (severity 2)
- **Resource** — resource exhausted, memory leak, CPU throttled, quota/rate limit exceeded (severity 2)
- **HealthyRecovery** — recovered from error, connection restored, retry successful (severity 4, informational)

Exclude patterns (e.g. INFO/DEBUG/TRACE, health checks, heartbeats) reduce false positives.

## Configuration

The TaskSet/SLI requires initialization with secrets and user variables. Key variables:

- `kubeconfig` — Secret containing cluster access (kubeconfig YAML).
- `KUBERNETES_DISTRIBUTION_BINARY` — CLI binary for Kubernetes (`kubectl` or `oc`). Default: `kubectl`.
- `CONTEXT` — Kubernetes context to use.
- `NAMESPACE` — Namespace of the workload. Leave blank to search all namespaces.
- `WORKLOAD_NAME` — Name of the deployment, statefulset, or daemonset to analyze.
- `WORKLOAD_TYPE` — Type of workload: `deployment`, `statefulset`, or `daemonset`. Default: `deployment`.
- `LOG_AGE` — Age of logs to fetch (e.g. `10m`). Default: `10m`.
- `LOG_LINES` / `LOG_SIZE` — Max lines or bytes per container for runbook log fetch. Defaults: 1000 lines, 2MB.
- `LOG_SEVERITY_THRESHOLD` — Minimum severity to create issues (1=critical … 5=info). Default: 3.
- `LOG_PATTERN_CATEGORIES` — Comma-separated categories to scan (e.g. `GenericError,AppFailure,Connection`). Default includes GenericError, AppFailure, Connection, Timeout, Auth, Exceptions, Resource, HealthyRecovery.
- `LOGS_EXCLUDE_PATTERN` — Regex to exclude lines from analysis (e.g. INFO/DEBUG, health checks).
- `EXCLUDED_CONTAINER_NAMES` — Comma-separated container names to skip (e.g. `linkerd-proxy,istio-proxy`). Default: `linkerd-proxy,istio-proxy,vault-agent`.
- `CONTAINER_RESTART_AGE` / `CONTAINER_RESTART_THRESHOLD` — Time window and threshold for container restarts (SLI). Defaults: e.g. `10m`, `1`.
- `LOG_SCAN_TIMEOUT` — Timeout in seconds for log scanning. Default: 300.

## Requirements

- A kubeconfig with RBAC permissions to list pods and read logs for the target workload and namespace.

## TODO

- [ ] Add additional documentation.
Loading