IncidentPilot is a production-style incident response autopilot agent. It receives alerts, creates incidents, investigates logs/metrics/deployments/runbooks, uses Qwen Cloud for reasoning, proposes safe remediations, requests human approval, executes approved actions, verifies recovery, and generates postmortems.
- Backend: FastAPI + SQLAlchemy + PostgreSQL
- Frontend: Vite + React + TypeScript
- Agent: Supervisor workflow with Qwen Cloud-compatible client and deterministic tools
- DB: PostgreSQL schema + seed data
- Deployment: Docker Compose, Alibaba Cloud ECS helper files
cp .env.example .env
cp backend/.env.example backend/.env
cp frontend/.env.example frontend/.env
docker compose up --buildOpen:
- Frontend: http://localhost:5173
- Backend API: http://localhost:8000
- API Docs: http://localhost:8000/docs
- Open dashboard.
- Click Trigger Demo Alert.
- Open the created incident.
- Watch IncidentPilot investigate.
- Approve rollback.
- See remediation, verification, and postmortem.
Set these in backend/.env:
QWEN_API_KEY=your_key
QWEN_BASE_URL=https://dashscope-intl.aliyuncs.com/compatible-mode/v1
QWEN_MODEL=qwen-plusIf no key is configured, IncidentPilot uses a deterministic local demo fallback so the product remains runnable.
IncidentPilot includes an Investigation Agent that gathers evidence before root-cause analysis.
The agent calls read-only diagnostic tools:
get_service_metricsget_error_logsget_recent_deploymentsget_service_health
Every tool call is stored in the database and shown in the incident timeline/dashboard. This makes the system auditable and prevents unsupported root-cause guesses.
Before generating root cause, IncidentPilot retrieves relevant operational knowledge:
- Runbooks: Standard operating procedures (e.g., rollback procedures)
- Prior Incident Memory: Known fixes from previous incidents
This retrieval is logged in the timeline as context_retrieved event, making the reasoning process transparent and grounded in operational knowledge.
IncidentPilot includes a Root Cause Agent that analyzes evidence after investigation and context retrieval.
The agent receives metrics, logs, deployment history, service health, retrieved runbooks, and prior incident memory.
It produces:
- Root cause with confidence score
- Ranked hypotheses with evidence and counter-evidence
- Recommended safe remediation
This makes IncidentPilot's RCA explainable and auditable instead of a black-box LLM answer.
IncidentPilot includes a Remediation Planner Agent that converts root-cause analysis into a safe action proposal.
The planner returns:
- Recommended action with parameters
- Risk level and confidence score
- Expected impact and reversibility
- Alternative actions
- Safety notes
The planner does not execute actions. Production actions are passed through the policy engine and human approval workflow before execution.
When an alert arrives, the Triage Agent classifies it before investigation begins.
The Triage Agent produces:
- Incident title
- Severity (critical, high, medium, low)
- Affected service
- Alert category
- Business impact
- Recommended next step
- Triage confidence score
The triage result is recorded in the timeline as a triage_completed event.
IncidentPilot does not blindly execute AI recommendations. Every action passes through a deterministic Policy Engine.
- Production rollbacks are classified as
mediumrisk - Any
medium,high, orcriticalrisk action in production requires human approval - The LLM cannot bypass the policy engine
This ensures "controlled autonomy" and prevents AI hallucinations from breaking production systems.
For production changes, IncidentPilot enforces a human-in-the-loop checkpoint.
- Remediation Planner proposes a rollback
- Policy Engine flags it as
requires_approval: true - Frontend displays an Approval Card with Risk Level and Expected Impact
- SRE clicks "Approve Remediation"
- Action Executor runs the deterministic rollback tool
Once an incident is resolved and verified, the Postmortem Agent automatically drafts a report.
- Impact summary
- Root cause (from RCA Agent)
- Resolution steps (from Execution logs)
- Prevention items (to avoid future incidents)
This saves SREs hours of manual documentation after every incident.
| Metric | Manual Baseline | IncidentPilot |
|---|---|---|
| Time to triage | 10 min | 60-90 sec |
| Manual steps | ~12 | ~4 |
| Postmortem draft | 30 min | <30 sec |
| Audit coverage | Incomplete notes | Full event timeline |
| Root cause confidence | Subjective | 0.86 with evidence |
"IncidentPilot uses controlled autonomy: Qwen Cloud and agents propose a remediation, the policy engine checks risk, a human approves production changes, deterministic tools execute the rollback, verification confirms recovery, and the postmortem agent documents the incident."
MIT