Note
Judge TL;DR: Point this at Windows evidence and it runs a full autonomous DFIR investigation with zero human steering and zero model shell access, validating every single claim against real tool output before a judge ever sees it.
Autonomous agentic DFIR for the SANS SIFT Workstation. Point it at Windows evidence (memory image, disk image, event logs) and it investigates end-to-end - zero human steering, zero model shell access - then hands you an investigative report where every single claim is validated against real tool output before you ever see it.
Find Evil! AI Hackathon 2026 · Adil Eskintan · MIT License Internal Python package name:
sift_sentinel(stable import path; the product/repo name is Sentinel Ensemble).
Cross-links: README · JUDGE-QUICKSTART · ARCHITECTURE · ACCURACY · SELF-CORRECTION-PROOF
Tip
🎯 Judges: the For Judges - score this in 15 minutes map points each of the 6 criteria straight to its 5-star evidence, every row ending at a 3-step trace recipe. New to the depth proof? See the memory + disk + log worked example.
Important
One command, ~8.5 minutes, fully autonomous: 49 findings dispositioned with SHA256 MATCH evidence integrity (pre == post), and every surviving claim traceable to the tool execution that proved it.
| Metric | Value |
|---|---|
| Confirmed | 2 |
| Suspicious - needs review | 42 |
| Benign | 5 |
| Total findings | 49 |
| Evidence integrity | SHA256 MATCH (unmodified, pre == post) |
| Wall-clock | ~509s (~8.5 min, 8-core box) |
| Cost | ~$15.45 |
| Model | claude-opus 4-model ensemble |
The confirmed SET is model-reasoning-sensitive and can vary run-to-run (0-2
confirmed on the identical rd01 image) because the analysis invocation is a
nondeterministic LLM ensemble; what reproduces DETERMINISTICALLY is the gate
behavior - code, not the model, decides promotion - and full traceability of
every surviving claim to its tool execution. The reference figures here are the
run-rd01 (2026-06-11) run in artifacts/run-rd01/.
| Requirement | Status | Location |
|---|---|---|
| Open-source license (MIT) | ✓ | /LICENSE - detected by GitHub, visible in About |
| README with setup instructions | ✓ | this file - Start from zero + Install |
| Run instructions for judges | ✓ | Quick Start + JUDGE-QUICKSTART.md |
| Text description | ✓ | What it does |
| Demonstration video | ✓ | ▶ YouTube demo |
| Architecture diagram | ✓ | ARCHITECTURE.md (16-step + MCP diagrams; PNG at submission) |
| Evidence Dataset Documentation | ✓ | docs/DATASET.md |
| Accuracy Report | ✓ | docs/ACCURACY.md |
| Agent Execution Logs | ✓ | artifacts/run-rd01/ (report + full step-by-step execution log + interactive HTML + summary) |
| Self-correction demonstrated | ✓ | SELF-CORRECTION-PROOF.md - every correction listed, before→after, with log line refs · FP-sweep + ReAct cross-check |
| Accuracy validation demonstrated | ✓ | deterministic validator - every finding traces to tool output (src/sift_sentinel/validation/) |
| Analytical reasoning demonstrated | ✓ | structured investigative narrative report (not a raw log) |
All hackathon submission requirements are met.
Four things, in order. If you already have SIFT + an API key, jump to Install.
SIFT is the platform this tool runs on - a free, ready-made forensic Linux VM from SANS that ships Volatility 3, Sleuth Kit, EWF tools and Plaso pre-installed, so you install almost nothing. It runs on any computer - Windows, Mac, or Linux - inside free virtualization software, so your normal laptop is fine. (SIFT is the environment this hackathon assumes - see the resources page: https://findevil.devpost.com/resources.)
- Download the pre-built VM appliance (
.ova) from the official SANS page: https://www.sans.org/tools/sift-workstation/ - or run the installer on a clean Ubuntu 22.04 system. - Install a free hypervisor and import the appliance: VMware Workstation
Player (Windows / Linux), VMware Fusion (Mac), or VirtualBox (all
platforms) - File → Open/Import Appliance → select the
.ova→ Import. - Give the VM at least 8 GB RAM (16 GB recommended for large memory images)
and ≥ 80 GB disk. More cores and RAM mean faster Step-6 tool runs - the worker
count is RAM-aware, so a high-RAM VM is used to the full. See
docs/HARDWARE-AND-PERFORMANCE.mdfor the per-machine runtime table, the env knobs, and the honest performance floors. - Start the VM and log in - default SIFT credentials: user
sansforensics, passwordforensics. Open a terminal (you'll live here from now on).
💡 Prefer no VM? Any Ubuntu 22.04 machine works too - bare metal, Windows WSL2, or a cloud Linux box - just install the SIFT toolchain (or run the SIFT installer). The VM is simply the easiest path.
- Go to https://console.anthropic.com → sign up / log in.
- API keys → Create key → copy the
sk-ant-…string. - Give it to Sentinel Ensemble in any one of three ways - pick whatever's easiest (you genuinely cannot get stuck: a real key always wins, and a bad one falls through to the next option):
| Option | How | Notes | |
|---|---|---|---|
| ① | 🚀 Just run it & paste (recommended) | Run the launcher - at the 🔑 API key step it asks you at a hidden prompt. Paste, press Enter. |
Verified live · this session only · never echoed, logged, or written to disk. Nothing to find or edit. |
| ② | 📄 A visible file (set once) | Open API_KEY.txt in the repo root, replace the placeholder on the last line with your key, save. |
Created for you on first run · gitignored, so your key is never committed · no prompt next time. |
| ③ | 🌐 Environment variable | export ANTHROPIC_API_KEY=sk-ant-… (a hidden .env with ANTHROPIC_API_KEY=… works too). |
For CI / power users. |
🔓 Order & self-healing. The launcher checks env var →
.env→API_KEY.txt. A real key always beats a leftover placeholder, and if the environment key is rejected - e.g. a staleexportleft in your shell - it automatically falls back to a valid key in your file before asking, so the file you just edited always works.
⚠️ API tier matters. The analysis stage runs a 4-model ensemble in parallel (4 concurrent API calls), so a Tier-1 account ($5) is likely to hit rate limits (HTTP 429) on a live run. Use at least Tier-2 ($40) - Tier-3 ($200) for the smoothest run. Your tier auto-increases with account age + spend; check / raise it at https://platform.claude.com/settings/limits. The--demomode needs no key and no tier (a typical full investigation costs a few dollars; pick depth 2 / Haiku for the cheapest live run).
Any of these work - the pipeline auto-detects what you give it (memory-only, disk-only, or both together):
| Source | What you get |
|---|---|
| Official hackathon starter case data - download (also posted on the Protocol SIFT Slack, per the official rules) | ready-made disk + memory case data |
| Your own captures | .E01/.raw disk images, .raw/.vmem/.img memory, exported .evtx logs |
Put everything for one case in one folder (example: /cases/evidence/).
A typical strong pair: one memory image + one disk image from the same machine.
🔒 Evidence uses a read-only mount + SHA256: the raw memory/image is hashed pre and post the run; the mounted-E01 disk is protected by a read-only mount, not re-hashed (chain of custody by math where it applies, read-only where it doesn't - the running program prints this same honest distinction).
git clone https://github.com/3sk1nt4n/Sentinel-Ensemble.git
cd Sentinel-Ensemble
./setup.sh # ONE command: installs + verifies everything
./findevil.sh --demo # smoke test - no evidence, no API key✅ It worked when the demo ends with a case card reading "Everything verified and ready." 🎉
./setup.sh sets up an isolated environment, installs the Python deps and forensic tools, then verifies the whole toolchain and prints anything missing. Re-check anytime with ./setup.sh --check.
Under the hood - venv, offline boxes, Volatility symbols
- Isolated by default.
setup.shbuilds a.venv(--system-site-packages, so it still sees the SIFT bindingspytsk3/pyewf/pyesedb);./findevil.shuses it automatically, nosourceneeded. Nopython3-venv? It falls back to your system Python cleanly. - No venv at all?
pip install -r requirements.txt(add--break-system-packagesif pip refuses) covers a full--liverun. - Volatility 3 symbols download once to a user-writable cache (no sudo), so even a root-owned
/opt/volatility3works.
./findevil.sh # asks ONE question: where is the evidence
./findevil.sh /cases/evidence # or pass the path directly
./findevil.sh --demo # synthetic walkthrough - no evidence, no API key
./findevil.sh --dry-run /cases/evidence # full onboarding + printed plan, pipeline NOT executedA real run, start to finish - one command, two prompts:
- Type
./findevil.sh - It asks where the evidence is - type your case folder path
(example:
/cases/evidence/my-case- the folder holding your memory/disk images). - It scans the evidence and shows a case card (what it found, sizes, SHA256). Just read it.
- It asks the analysis depth -
1(or Enter) = ⚡ HEAVY (Claude Opus 4.8, ~$8-15/case) or2= 🪶 LIGHT (Claude Haiku 4.5, ~$2-3/case). Choosing the depth launches the run. - The API key step - if you set it already (visible
API_KEY.txt,.env, orANTHROPIC_API_KEY) it's used automatically; otherwise paste it at the hidden prompt (blank screen while pasting is normal; never echoed, logged, or saved). - Wait minutes, not hours. Touch nothing.
- Read the report - every finding links to the exact tool execution that proved it.
findevil.sh checks dependencies, then delegates to the conversational
onboarding (python3 step0_onboard.py - same flags, same behavior).
Sentinel Ensemble investigates Windows evidence (memory images, disk images, event logs) end-to-end with zero human steering and zero model shell access:
- A deterministic 16-step conductor (
run_pipeline.py) drives everything; the AI is invoked exactly 5 times (tool selection, analysis, investigation threads, the Step-13AA self-correction finalize, and the report). Claude is invoked exactly 5 times; the 2nd invocation (analysis) is itself a 4-model ensemble, which is why a live run shows 4 concurrent API calls at that step. - Architectural pattern: Custom MCP Server - every forensic tool is a typed MCP function - the model never constructs command syntax and never touches bash.
- Every AI claim is checked against a paired reference set built from real tool output during the run; unsupported claims are blocked, then self-corrected or honestly reported as UNRESOLVED (honest failure beats a wrong answer).
- A 4-model ensemble + deterministic cross-checks disposition findings into confirmed / needs-review / benign / false-positive, with confidence earned by independent artifact types (memory + disk + logs) - not model feeling.
- A report-integrity layer keeps the story honest end-to-end: the executive summary can never name a finding "confirmed" that the evidence pipeline didn't confirm (any mismatch is auto-annotated with the finding's true status), benign rows always explain why they were cleared, and duplicate findings about the same artifact (same file, same registry key, same Windows service) are merged before you read them.
- Output: a structured investigative narrative with WHO/WHEN context, a network IOC roll-up, and a finding-by-finding audit trail to tool executions.
flowchart LR
A[🔒 Evidence\nread-only mount + SHA256\nraw memory/image hashed pre and post;\nmounted-E01 disk protected by read-only mount, not re-hashed] --> B[🧰 Typed forensic tools\nno shell, ever]
B --> C[(EvidenceDB\ntyped facts + provenance)]
C --> D[🤖 AI analysis\n5 AI calls only]
D --> E{🧪 Deterministic validator\ncode checks AI}
E -- unsupported --> F[♻️ Self-correction\nor honest UNRESOLVED]
F --> E
E -- validated --> G[📋 Report\nnarrative + WHO/WHEN + IOC\n+ audit trail + SHA256 re-check]
- Step-0 onboarding - finds and profiles the evidence, mounts read-only, SHA256-fingerprints it (chain of custody).
- Tool sweep + EvidenceDB - runs the forensic tools via typed functions, parses every output into typed facts with provenance.
- AI analysis - the model selects tools and writes candidate findings from parsed facts only.
- Validation + cross-check - deterministic validator, ReAct investigation threads, self-correction, disposition. Code checks AI; AI never grades itself.
- Reporting - investigative narrative + audit log; SHA256 verified again (spoliation check).
| Artifact | What it is |
|---|---|
report.md |
the investigative narrative - findings first, plain-English "why it matters" per finding (the per-finding customer table renders into its sections) |
run_summary.md |
tools · dispositions · cost · tokens at a glance |
agent_execution_log.txt |
append-only execution log - every tool call, timestamps, token usage |
finding_disposition_buckets.json |
confirmed / needs-review / benign / false-positive buckets, each with its reasoning - written to the run directory; report.md renders from it |
| Symptom | Fix |
|---|---|
| Dependency install refused (PEP 668) | re-run ./setup.sh (it creates a venv / handles PEP 668 for you) |
ERROR: Missing dependencies from findevil.sh |
re-run ./setup.sh (it creates a venv / handles PEP 668 for you), then retry |
| The run doesn't start after you pick depth | you ran step0_onboard.py directly (staged / dev mode) - use ./findevil.sh, which is live by default |
| No prompt appears in CI/scripts | that's by design: headless + no path → usage + exit 2 (no hang) |
Slow run, "low validations", Step6 workers: 4 on a big box, many HTTP 529, OOM |
These are all hardware/dependency-tuning items - see the symptom table in docs/HARDWARE-AND-PERFORMANCE.md (the fast evtx wheel, RAM-aware workers, and the 529-rate-limit knobs). |
No case-specific indicators (hostnames, usernames, IPs, tool-name lists, PIDs,
hashes) are embedded in code, prompts, or fixtures - detection is behavioral
and structural only (process ancestry, RWX anomalies, Event-ID grammar,
egress outliers). Guard tests enforce it, a commit-time audit
(audit/nocheat.py) bans answer-key vocabulary, and the release pipeline
hard-fails if a case token would ever ship.
Two examples of the principle in practice:
- Domains by standard, not by list - a token counts as a domain only if its final label is a registered IANA TLD (vendored from the Public Suffix List, identical for every case on earth); ambiguous TLD/file-extension collisions additionally require the run to have seen the token as a URL host. No domain or extension blocklist decides anything.
- IOCs by correlation, not by lookup - a network indicator is reported as malicious only because a validator-backed finding in this run proved it (verdict inherited from the finding's disposition, related finding IDs cited). The confirmed tier doubles as a copy-pasteable block/hunt list.
Defaults are tuned for zero-regression. For the strongest adjudication layer:
SIFT_INV3A_ENRICH=1 SIFT_MODEL_INV3A=claude-opus-4-8 \
SIFT_INV3A_JIT_RWX_GUARD=1 SIFT_USER_8DOT3_CANON=1 python3 findevil.pySIFT_INV3A_ENRICH gives the final false-positive sweep a deterministic
cross-reference per finding; SIFT_MODEL_INV3A routes that single call to a
stronger model; the guard suppresses classic JIT/.NET RWX false-positive
promotions structurally (no process-name allowlist); the 8.3 flag folds
short-name user identities into their long form. Every flag has a kill-switch
and fails closed.
See ARCHITECTURE.md and docs/ for the full
design · JUDGE-QUICKSTART.md for the judge path ·
EXTENDING.md to add your own forensic tool ·
MIT © Adil Eskintan
SIFT-Sentinel - Adil Eskintan / Solvent CyberSecurity LLC - Find Evil! 2026. Reference run: run-rd01 (artifacts/run-rd01/). The live Official Rules govern.
