-
Notifications
You must be signed in to change notification settings - Fork 12
Application Log Health: Shift all Logs-related tasks from deployment + statefulset + daemonset healthcheck codebundles #606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
akshayrw25
wants to merge
10
commits into
main
Choose a base branch
from
RWENGG-1350
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
97ffeb3
RWENGG-1350: initial writeup of the applog health codebundle
akshayrw25 abf88d1
Improve log analysis task naming and report formatting
akshayrw25 8f8d391
Update display name from Kubernetes Deployment Triage to Kubernetes A…
akshayrw25 532ff8e
Add log size limits and clean up unused SLI variables
akshayrw25 2591933
shifted "Analyze applog " task from healthcheck to applog-health CB; …
akshayrw25 24f4d74
\k8s-applog-health: generalize to workload type, drop stacktrace SLI,…
akshayrw25 4573ff4
- shift the "Fetch Deployment Logs" task to applog codebundle
akshayrw25 cc77008
added README for k8s-applog-health(the new application log codebundle…
akshayrw25 284c11e
add next_action kwarg to distinguish applog issues in platform
akshayrw25 34619c1
- added the stacktrace task and issue creation to applog-health
akshayrw25 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
21 changes: 21 additions & 0 deletions
21
codebundles/k8s-applog-health/.runwhen/generation-rules/k8s-applog-health.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,21 @@ | ||
| apiVersion: runwhen.com/v1 | ||
| kind: GenerationRules | ||
| spec: | ||
| generationRules: | ||
| - resourceTypes: | ||
| - deployment | ||
| matchRules: | ||
| - type: pattern | ||
| pattern: ".+" | ||
| properties: [name] | ||
| mode: substring | ||
| slxs: | ||
| - baseName: applog-health | ||
| levelOfDetail: detailed | ||
| qualifiers: ["resource", "namespace", "cluster"] | ||
| baseTemplateName: k8s-applog-health | ||
| outputItems: | ||
| - type: slx | ||
| - type: sli | ||
| - type: runbook | ||
| templateName: k8s-applog-health-taskset.yaml | ||
65 changes: 65 additions & 0 deletions
65
codebundles/k8s-applog-health/.runwhen/templates/k8s-applog-health-sli.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,65 @@ | ||
| apiVersion: runwhen.com/v1 | ||
| kind: ServiceLevelIndicator | ||
| metadata: | ||
| name: {{slx_name}} | ||
| labels: | ||
| {% include "common-labels.yaml" %} | ||
| annotations: | ||
| {% include "common-annotations.yaml" %} | ||
| runwhen.com/sli: "true" | ||
| spec: | ||
| displayUnitsLong: OK | ||
| displayUnitsShort: ok | ||
| locations: | ||
| - {{ default_location }} | ||
| codeBundle: | ||
| {% if repo_url %} | ||
| repoUrl: {{repo_url}} | ||
| {% else %} | ||
| repoUrl: https://github.com/runwhen-contrib/rw-cli-codecollection.git | ||
| {% endif %} | ||
| {% if ref %} | ||
| ref: {{ref}} | ||
| {% else %} | ||
| ref: main | ||
| {% endif %} | ||
| pathToRobot: codebundles/k8s-applog-health/sli.robot | ||
| intervalStrategy: intermezzo | ||
| intervalSeconds: 600 | ||
| description: Measures the health of the application logs for the {{match_resource.resource.metadata.name}} {{match_resource.kind | lower}}. | ||
| configProvided: | ||
| - name: NAMESPACE | ||
| value: {{match_resource.resource.metadata.namespace}} | ||
| - name: CONTEXT | ||
| value: {{context}} | ||
| - name: KUBERNETES_DISTRIBUTION_BINARY | ||
| value: {{custom.kubernetes_distribution_binary | default("kubectl")}} | ||
| - name: WORKLOAD_NAME | ||
| value: {{match_resource.resource.metadata.name}} | ||
| - name: WORKLOAD_TYPE | ||
| value: {{match_resource.kind | lower}} | ||
| - name: CONTAINER_RESTART_AGE | ||
| value: "10m" | ||
| - name: CONTAINER_RESTART_THRESHOLD | ||
| value: "2" | ||
| - name: EVENT_AGE | ||
| value: "10m" | ||
| - name: EVENT_THRESHOLD | ||
| value: "2" | ||
| - name: CHECK_SERVICE_ENDPOINTS | ||
| value: "true" | ||
| - name: MAX_LOG_LINES | ||
| value: "1000" | ||
| - name: MAX_LOG_BYTES | ||
| value: "2097152" | ||
|
akshayrw25 marked this conversation as resolved.
|
||
| secretsProvided: | ||
| {% if wb_version %} | ||
| {% include "kubernetes-auth.yaml" ignore missing %} | ||
| {% else %} | ||
| - name: kubeconfig | ||
| workspaceKey: {{custom.kubeconfig_secret_name}} | ||
| {% endif %} | ||
| alertConfig: | ||
| tasks: | ||
| persona: eager-edgar | ||
| sessionTTL: 10m | ||
25 changes: 25 additions & 0 deletions
25
codebundles/k8s-applog-health/.runwhen/templates/k8s-applog-health-slx.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,25 @@ | ||
| apiVersion: runwhen.com/v1 | ||
| kind: ServiceLevelX | ||
| metadata: | ||
| name: {{slx_name}} | ||
| labels: | ||
| {% include "common-labels.yaml" %} | ||
| annotations: | ||
| {% include "common-annotations.yaml" %} | ||
| spec: | ||
| imageURL: https://storage.googleapis.com/runwhen-nonprod-shared-images/icons/kubernetes/resources/labeled/deploy.svg | ||
| alias: {{match_resource.resource.metadata.name}} {{match_resource.kind}} Application Log Health | ||
| asMeasuredBy: The presence of application-level errors/issues/stacktraces in the application logs indicating runtime errors or exceptions in {{match_resource.resource.metadata.name}}. | ||
| configProvided: | ||
| - name: OBJECT_NAME | ||
| value: {{match_resource.resource.metadata.name}} | ||
| owners: | ||
| - {{workspace.owner_email}} | ||
| statement: Application logs for {{match_resource.resource.metadata.name}} {{match_resource.kind | lower}} should be free of critical errors/issues/stacktraces indicating runtime errors or exceptions. | ||
| additionalContext: | ||
| {% include "kubernetes-hierarchy.yaml" ignore missing %} | ||
| qualified_name: "{{ match_resource.qualified_name }}" | ||
| tags: | ||
| {% include "kubernetes-tags.yaml" ignore missing %} | ||
| - name: access | ||
| value: read-only |
50 changes: 50 additions & 0 deletions
50
codebundles/k8s-applog-health/.runwhen/templates/k8s-applog-health-taskset.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,50 @@ | ||
| apiVersion: runwhen.com/v1 | ||
| kind: Runbook | ||
| metadata: | ||
| name: {{slx_name}} | ||
| labels: | ||
| {% include "common-labels.yaml" %} | ||
| annotations: | ||
| {% include "common-annotations.yaml" %} | ||
| spec: | ||
| location: {{default_location}} | ||
| codeBundle: | ||
| {% if repo_url %} | ||
| repoUrl: {{repo_url}} | ||
| {% else %} | ||
| repoUrl: https://github.com/runwhen-contrib/rw-cli-codecollection.git | ||
| {% endif %} | ||
| {% if ref %} | ||
| ref: {{ref}} | ||
| {% else %} | ||
| ref: main | ||
| {% endif %} | ||
| pathToRobot: codebundles/k8s-applog-health/runbook.robot | ||
| configProvided: | ||
| - name: NAMESPACE | ||
| value: {{match_resource.resource.metadata.namespace}} | ||
| - name: CONTEXT | ||
| value: {{context}} | ||
| - name: KUBERNETES_DISTRIBUTION_BINARY | ||
| value: {{custom.kubernetes_distribution_binary}} | ||
| - name: WORKLOAD_NAME | ||
| value: {{match_resource.resource.metadata.name}} | ||
| - name: WORKLOAD_TYPE | ||
| value: {{match_resource.kind | lower}} | ||
| - name: CONTAINER_RESTART_AGE | ||
| value: "30m" | ||
| - name: CONTAINER_RESTART_THRESHOLD | ||
| value: "4" | ||
| - name: LOG_AGE | ||
| value: "10m" | ||
| - name: LOG_SIZE | ||
| value: "2097152" | ||
| - name: LOG_LINES | ||
| value: "1000" | ||
| secretsProvided: | ||
| {% if wb_version %} | ||
| {% include "kubernetes-auth.yaml" ignore missing %} | ||
| {% else %} | ||
| - name: kubeconfig | ||
| workspaceKey: {{custom.kubeconfig_secret_name}} | ||
| {% endif %} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,56 @@ | ||
| # Kubernetes Application Log Health | ||
|
|
||
| This codebundle provides tasks for triaging application log health of Kubernetes workloads (deployments, statefulsets, or daemonsets). It fetches pod logs, scans for error patterns, and reports issues with severity and next steps. | ||
|
|
||
| ## Tasks | ||
|
|
||
| **Runbook** | ||
| - `Analyze Application Log Patterns for ${WORKLOAD_TYPE} ${WORKLOAD_NAME} in Namespace ${NAMESPACE}` — Fetches workload logs, scans for configurable error/exception patterns, creates issues for matches above the severity threshold, and reports a log health score and summary. | ||
| - `Fetch Workload Logs for ${WORKLOAD_TYPE} ${WORKLOAD_NAME} in Namespace ${NAMESPACE}` — Fetches and attaches workload logs to the report for manual review (no issue creation). | ||
|
|
||
| **SLI** | ||
| - `Get Critical Log Errors and Score for ${WORKLOAD_TYPE} ${WORKLOAD_NAME}` — Fetches logs and scores health based on critical error patterns (e.g. GenericError, AppFailure) and container restarts; pushes a metric for SLI scoring. | ||
| - `Generate Application Health Score for ${WORKLOAD_TYPE} ${WORKLOAD_NAME}` — Computes the final applog health score and report details (e.g. scaled-to-zero vs healthy vs issues). | ||
|
|
||
| ### Log pattern categories | ||
|
|
||
| Analysis uses pattern categories (configurable via `runbook_patterns.json` or `sli_critical_patterns.json`). Examples: | ||
|
|
||
| - **GenericError** — exception, fatal, panic, crash, failed, failure (severity 1) | ||
| - **AppFailure** — application failed, service unavailable, connection refused, timeout, OOM, disk full, auth failures (severity 1) | ||
| - **StackTrace** — stack trace, exception in thread, java.lang., traceback, panic (severity 1) | ||
| - **Connection** — connection reset/timeout, network unreachable, socket error, DNS resolution failed (severity 2) | ||
| - **Timeout** — request/operation timeout, deadline exceeded, read/write timeout (severity 2) | ||
| - **Auth** — unauthorized, authentication error, invalid credentials, forbidden, token expired (severity 2) | ||
| - **Exceptions** — NullPointerException, IllegalArgumentException, SQLException, IOException, etc. (severity 2) | ||
| - **Resource** — resource exhausted, memory leak, CPU throttled, quota/rate limit exceeded (severity 2) | ||
| - **HealthyRecovery** — recovered from error, connection restored, retry successful (severity 4, informational) | ||
|
|
||
| Exclude patterns (e.g. INFO/DEBUG/TRACE, health checks, heartbeats) reduce false positives. | ||
|
|
||
| ## Configuration | ||
|
|
||
| The TaskSet/SLI requires initialization with secrets and user variables. Key variables: | ||
|
|
||
| - `kubeconfig` — Secret containing cluster access (kubeconfig YAML). | ||
| - `KUBERNETES_DISTRIBUTION_BINARY` — CLI binary for Kubernetes (`kubectl` or `oc`). Default: `kubectl`. | ||
| - `CONTEXT` — Kubernetes context to use. | ||
| - `NAMESPACE` — Namespace of the workload. Leave blank to search all namespaces. | ||
| - `WORKLOAD_NAME` — Name of the deployment, statefulset, or daemonset to analyze. | ||
| - `WORKLOAD_TYPE` — Type of workload: `deployment`, `statefulset`, or `daemonset`. Default: `deployment`. | ||
| - `LOG_AGE` — Age of logs to fetch (e.g. `10m`). Default: `10m`. | ||
| - `LOG_LINES` / `LOG_SIZE` — Max lines or bytes per container for runbook log fetch. Defaults: 1000 lines, 2MB. | ||
| - `LOG_SEVERITY_THRESHOLD` — Minimum severity to create issues (1=critical … 5=info). Default: 3. | ||
| - `LOG_PATTERN_CATEGORIES` — Comma-separated categories to scan (e.g. `GenericError,AppFailure,Connection`). Default includes GenericError, AppFailure, Connection, Timeout, Auth, Exceptions, Resource, HealthyRecovery. | ||
| - `LOGS_EXCLUDE_PATTERN` — Regex to exclude lines from analysis (e.g. INFO/DEBUG, health checks). | ||
| - `EXCLUDED_CONTAINER_NAMES` — Comma-separated container names to skip (e.g. `linkerd-proxy,istio-proxy`). Default: `linkerd-proxy,istio-proxy,vault-agent`. | ||
| - `CONTAINER_RESTART_AGE` / `CONTAINER_RESTART_THRESHOLD` — Time window and threshold for container restarts (SLI). Defaults: e.g. `10m`, `1`. | ||
| - `LOG_SCAN_TIMEOUT` — Timeout in seconds for log scanning. Default: 300. | ||
|
|
||
| ## Requirements | ||
|
|
||
| - A kubeconfig with RBAC permissions to list pods and read logs for the target workload and namespace. | ||
|
|
||
| ## TODO | ||
|
|
||
| - [ ] Add additional documentation. |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generation excludes non-deployment workloads
High Severity
k8s-applog-healthonly generates fordeploymentresources, while log tasks were removed from the statefulset and daemonset healthcheck runbooks. This leavesstatefulsetanddaemonsetworkloads without generated applog SLIs/runbooks, so the consolidation does not actually apply to all supported workload types.Additional Locations (2)
codebundles/k8s-statefulset-healthcheck/runbook.robot#L21-L22codebundles/k8s-daemonset-healthcheck/runbook.robot#L21-L22There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stewartshea @Rohit-Ekbote
to include the other two, i.e. statefulset and daemonset , will this be correct:
Also, do we have any other codebundle with generation Rules applicable to more than 1 resource type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that looks correct (you can validate against the other generation rules for those resource types)
and yes, we have other examples with azure codebundles etc.