IncidentPilot — Production Incident Autopilot Agent on Qwen Cloud

IncidentPilot is a production-style incident response autopilot agent. It receives alerts, creates incidents, investigates logs/metrics/deployments/runbooks, uses Qwen Cloud for reasoning, proposes safe remediations, requests human approval, executes approved actions, verifies recovery, and generates postmortems.

Stack

Backend: FastAPI + SQLAlchemy + PostgreSQL
Frontend: Vite + React + TypeScript
Agent: Supervisor workflow with Qwen Cloud-compatible client and deterministic tools
DB: PostgreSQL schema + seed data
Deployment: Docker Compose, Alibaba Cloud ECS helper files

Quick Start

cp .env.example .env
cp backend/.env.example backend/.env
cp frontend/.env.example frontend/.env

docker compose up --build

Open:

Demo Flow

Open dashboard.
Click Trigger Demo Alert.
Open the created incident.
Watch IncidentPilot investigate.
Approve rollback.
See remediation, verification, and postmortem.

Qwen Cloud

Set these in backend/.env:

QWEN_API_KEY=your_key
QWEN_BASE_URL=https://dashscope-intl.aliyuncs.com/compatible-mode/v1
QWEN_MODEL=qwen-plus

If no key is configured, IncidentPilot uses a deterministic local demo fallback so the product remains runnable.

Investigation Agent

IncidentPilot includes an Investigation Agent that gathers evidence before root-cause analysis.

The agent calls read-only diagnostic tools:

get_service_metrics
get_error_logs
get_recent_deployments
get_service_health

Every tool call is stored in the database and shown in the incident timeline/dashboard. This makes the system auditable and prevents unsupported root-cause guesses.

Runbook and Memory Retrieval

Before generating root cause, IncidentPilot retrieves relevant operational knowledge:

Runbooks: Standard operating procedures (e.g., rollback procedures)
Prior Incident Memory: Known fixes from previous incidents

This retrieval is logged in the timeline as context_retrieved event, making the reasoning process transparent and grounded in operational knowledge.

Root Cause Agent

IncidentPilot includes a Root Cause Agent that analyzes evidence after investigation and context retrieval.

The agent receives metrics, logs, deployment history, service health, retrieved runbooks, and prior incident memory.

It produces:

Root cause with confidence score
Ranked hypotheses with evidence and counter-evidence
Recommended safe remediation

This makes IncidentPilot's RCA explainable and auditable instead of a black-box LLM answer.

Remediation Planner Agent

IncidentPilot includes a Remediation Planner Agent that converts root-cause analysis into a safe action proposal.

The planner returns:

Recommended action with parameters
Risk level and confidence score
Expected impact and reversibility
Alternative actions
Safety notes

The planner does not execute actions. Production actions are passed through the policy engine and human approval workflow before execution.

Triage Agent

When an alert arrives, the Triage Agent classifies it before investigation begins.

The Triage Agent produces:

Incident title
Severity (critical, high, medium, low)
Affected service
Alert category
Business impact
Recommended next step
Triage confidence score

The triage result is recorded in the timeline as a triage_completed event.

Policy and Risk Engine

IncidentPilot does not blindly execute AI recommendations. Every action passes through a deterministic Policy Engine.

Safety Rules

Production rollbacks are classified as medium risk
Any medium, high, or critical risk action in production requires human approval
The LLM cannot bypass the policy engine

This ensures "controlled autonomy" and prevents AI hallucinations from breaking production systems.

Human Approval Workflow

For production changes, IncidentPilot enforces a human-in-the-loop checkpoint.

Flow

Remediation Planner proposes a rollback
Policy Engine flags it as requires_approval: true
Frontend displays an Approval Card with Risk Level and Expected Impact
SRE clicks "Approve Remediation"
Action Executor runs the deterministic rollback tool

Postmortem Agent

Once an incident is resolved and verified, the Postmortem Agent automatically drafts a report.

Evaluation Metrics

Metric	Manual Baseline	IncidentPilot
Time to triage	10 min	60-90 sec
Manual steps	~12	~4
Postmortem draft	30 min	<30 sec
Audit coverage	Incomplete notes	Full event timeline
Root cause confidence	Subjective	0.86 with evidence

Demo Line

"IncidentPilot uses controlled autonomy: Qwen Cloud and agents propose a remediation, the policy engine checks risk, a human approves production changes, deterministic tools execute the rollback, verification confirms recovery, and the postmortem agent documents the incident."

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
backend		backend
database		database
deploy/alibaba		deploy/alibaba
docker		docker
docs		docs
frontend		frontend
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
HACKATHON_SUBMISSION.md		HACKATHON_SUBMISSION.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IncidentPilot — Production Incident Autopilot Agent on Qwen Cloud

Stack

Quick Start

Demo Flow

Qwen Cloud

Investigation Agent

Runbook and Memory Retrieval

Root Cause Agent

Remediation Planner Agent

Triage Agent

Policy and Risk Engine

Safety Rules

Human Approval Workflow

Flow

Postmortem Agent

Contents

Evaluation Metrics

Demo Line

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

IncidentPilot — Production Incident Autopilot Agent on Qwen Cloud

Stack

Quick Start

Demo Flow

Qwen Cloud

Investigation Agent

Runbook and Memory Retrieval

Root Cause Agent

Remediation Planner Agent

Triage Agent

Policy and Risk Engine

Safety Rules

Human Approval Workflow

Flow

Postmortem Agent

Contents

Evaluation Metrics

Demo Line

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages