🛡️ Sentinel Ensemble

Note

Judge TL;DR: Point this at Windows evidence and it runs a full autonomous DFIR investigation with zero human steering and zero model shell access, validating every single claim against real tool output before a judge ever sees it.

Autonomous agentic DFIR for the SANS SIFT Workstation. Point it at Windows evidence (memory image, disk image, event logs) and it investigates end-to-end - zero human steering, zero model shell access - then hands you an investigative report where every single claim is validated against real tool output before you ever see it.

Find Evil! AI Hackathon 2026 · Adil Eskintan · MIT License Internal Python package name: sift_sentinel (stable import path; the product/repo name is Sentinel Ensemble).

Cross-links: README · JUDGE-QUICKSTART · ARCHITECTURE · ACCURACY · SELF-CORRECTION-PROOF

Tip

🎯 Judges: the For Judges - score this in 15 minutes map points each of the 6 criteria straight to its 5-star evidence, every row ending at a 3-step trace recipe. New to the depth proof? See the memory + disk + log worked example.

📊 Headline result (run-rd01)

Important

One command, ~8.5 minutes, fully autonomous: 49 findings dispositioned with SHA256 MATCH evidence integrity (pre == post), and every surviving claim traceable to the tool execution that proved it.

Metric	Value
Confirmed	2
Suspicious - needs review	42
Benign	5
Total findings	49
Evidence integrity	SHA256 MATCH (unmodified, pre == post)
Wall-clock	~509s (~8.5 min, 8-core box)
Cost	~$15.45
Model	claude-opus 4-model ensemble

The confirmed SET is model-reasoning-sensitive and can vary run-to-run (0-2 confirmed on the identical rd01 image) because the analysis invocation is a nondeterministic LLM ensemble; what reproduces DETERMINISTICALLY is the gate behavior - code, not the model, decides promotion - and full traceability of every surviving claim to its tool execution. The reference figures here are the run-rd01 (2026-06-11) run in artifacts/run-rd01/.

✅ Submission Compliance Checklist

Requirement	Status	Location
Open-source license (MIT)	✓	`/LICENSE` - detected by GitHub, visible in About
README with setup instructions	✓	this file - Start from zero + Install
Run instructions for judges	✓	Quick Start + `JUDGE-QUICKSTART.md`
Text description	✓	What it does
Demonstration video	✓	▶ YouTube demo
Architecture diagram	✓	`ARCHITECTURE.md` (16-step + MCP diagrams; PNG at submission)
Evidence Dataset Documentation	✓	`docs/DATASET.md`
Accuracy Report	✓	`docs/ACCURACY.md`
Agent Execution Logs	✓	`artifacts/run-rd01/` (report + full step-by-step execution log + interactive HTML + summary)
Self-correction demonstrated	✓	`SELF-CORRECTION-PROOF.md` - every correction listed, before→after, with log line refs · FP-sweep + ReAct cross-check
Accuracy validation demonstrated	✓	deterministic validator - every finding traces to tool output (`src/sift_sentinel/validation/`)
Analytical reasoning demonstrated	✓	structured investigative narrative report (not a raw log)

All hackathon submission requirements are met.

🧭 Start from zero (never used SIFT before?)

Four things, in order. If you already have SIFT + an API key, jump to Install.

1️⃣ Get the SANS SIFT Workstation (the free forensic VM)

SIFT is the platform this tool runs on - a free, ready-made forensic Linux VM from SANS that ships Volatility 3, Sleuth Kit, EWF tools and Plaso pre-installed, so you install almost nothing. It runs on any computer - Windows, Mac, or Linux - inside free virtualization software, so your normal laptop is fine. (SIFT is the environment this hackathon assumes - see the resources page: https://findevil.devpost.com/resources.)

Download the pre-built VM appliance (.ova) from the official SANS page: https://www.sans.org/tools/sift-workstation/ - or run the installer on a clean Ubuntu 22.04 system.
Install a free hypervisor and import the appliance: VMware Workstation Player (Windows / Linux), VMware Fusion (Mac), or VirtualBox (all platforms) - File → Open/Import Appliance → select the .ova → Import.
Give the VM at least 8 GB RAM (16 GB recommended for large memory images) and ≥ 80 GB disk. More cores and RAM mean faster Step-6 tool runs - the worker count is RAM-aware, so a high-RAM VM is used to the full. See docs/HARDWARE-AND-PERFORMANCE.md for the per-machine runtime table, the env knobs, and the honest performance floors.
Start the VM and log in - default SIFT credentials: user sansforensics, password forensics. Open a terminal (you'll live here from now on).

💡 Prefer no VM? Any Ubuntu 22.04 machine works too - bare metal, Windows WSL2, or a cloud Linux box - just install the SIFT toolchain (or run the SIFT installer). The VM is simply the easiest path.

2️⃣ Get an Anthropic API key (the AI brain)

Go to https://console.anthropic.com → sign up / log in.
API keys → Create key → copy the sk-ant-… string.
Give it to Sentinel Ensemble in any one of three ways - pick whatever's easiest (you genuinely cannot get stuck: a real key always wins, and a bad one falls through to the next option):

	Option	How	Notes
①	🚀 Just run it & paste (recommended)	Run the launcher - at the `🔑 API key` step it asks you at a hidden prompt. Paste, press Enter.	Verified live · this session only · never echoed, logged, or written to disk. Nothing to find or edit.
②	📄 A visible file (set once)	Open `API_KEY.txt` in the repo root, replace the placeholder on the last line with your key, save.	Created for you on first run · gitignored, so your key is never committed · no prompt next time.
③	🌐 Environment variable	`export ANTHROPIC_API_KEY=sk-ant-…` (a hidden `.env` with `ANTHROPIC_API_KEY=…` works too).	For CI / power users.

🔓 Order & self-healing. The launcher checks env var → .env → API_KEY.txt. A real key always beats a leftover placeholder, and if the environment key is rejected - e.g. a stale export left in your shell - it automatically falls back to a valid key in your file before asking, so the file you just edited always works.

⚠️ API tier matters. The analysis stage runs a 4-model ensemble in parallel (4 concurrent API calls), so a Tier-1 account ($5) is likely to hit rate limits (HTTP 429) on a live run. Use at least Tier-2 ($40) - Tier-3 ($200) for the smoothest run. Your tier auto-increases with account age + spend; check / raise it at https://platform.claude.com/settings/limits. The --demo mode needs no key and no tier (a typical full investigation costs a few dollars; pick depth 2 / Haiku for the cheapest live run).

3️⃣ Get evidence to investigate

Any of these work - the pipeline auto-detects what you give it (memory-only, disk-only, or both together):

Source	What you get
Official hackathon starter case data - download (also posted on the Protocol SIFT Slack, per the official rules)	ready-made disk + memory case data
Your own captures	`.E01`/`.raw` disk images, `.raw`/`.vmem`/`.img` memory, exported `.evtx` logs

Put everything for one case in one folder (example: /cases/evidence/). A typical strong pair: one memory image + one disk image from the same machine.

🔒 Evidence uses a read-only mount + SHA256: the raw memory/image is hashed pre and post the run; the mounted-E01 disk is protected by a read-only mount, not re-hashed (chain of custody by math where it applies, read-only where it doesn't - the running program prints this same honest distinction).

4️⃣ Install & run - see the next two sections. That's it.

📦 Install

git clone https://github.com/3sk1nt4n/Sentinel-Ensemble.git
cd Sentinel-Ensemble
./setup.sh             # ONE command: installs + verifies everything
./findevil.sh --demo   # smoke test - no evidence, no API key

✅ It worked when the demo ends with a case card reading "Everything verified and ready." 🎉

./setup.sh sets up an isolated environment, installs the Python deps and forensic tools, then verifies the whole toolchain and prints anything missing. Re-check anytime with ./setup.sh --check.

Under the hood - venv, offline boxes, Volatility symbols

Isolated by default. setup.sh builds a .venv (--system-site-packages, so it still sees the SIFT bindings pytsk3 / pyewf / pyesedb); ./findevil.sh uses it automatically, no source needed. No python3-venv? It falls back to your system Python cleanly.
No venv at all? pip install -r requirements.txt (add --break-system-packages if pip refuses) covers a full --live run.
Volatility 3 symbols download once to a user-writable cache (no sudo), so even a root-owned /opt/volatility3 works.

🚀 Quick Start

./findevil.sh                      # asks ONE question: where is the evidence
./findevil.sh /cases/evidence      # or pass the path directly
./findevil.sh --demo               # synthetic walkthrough - no evidence, no API key
./findevil.sh --dry-run /cases/evidence   # full onboarding + printed plan, pipeline NOT executed

A real run, start to finish - one command, two prompts:

Type ./findevil.sh
It asks where the evidence is - type your case folder path (example: /cases/evidence/my-case - the folder holding your memory/disk images).
It scans the evidence and shows a case card (what it found, sizes, SHA256). Just read it.
It asks the analysis depth - 1 (or Enter) = ⚡ HEAVY (Claude Opus 4.8, ~$8-15/case) or 2 = 🪶 LIGHT (Claude Haiku 4.5, ~$2-3/case). Choosing the depth launches the run.
The API key step - if you set it already (visible API_KEY.txt, .env, or ANTHROPIC_API_KEY) it's used automatically; otherwise paste it at the hidden prompt (blank screen while pasting is normal; never echoed, logged, or saved).
Wait minutes, not hours. Touch nothing.
Read the report - every finding links to the exact tool execution that proved it.

findevil.sh checks dependencies, then delegates to the conversational onboarding (python3 step0_onboard.py - same flags, same behavior).

🔍 What it does

Sentinel Ensemble investigates Windows evidence (memory images, disk images, event logs) end-to-end with zero human steering and zero model shell access:

A deterministic 16-step conductor (run_pipeline.py) drives everything; the AI is invoked exactly 5 times (tool selection, analysis, investigation threads, the Step-13AA self-correction finalize, and the report). Claude is invoked exactly 5 times; the 2nd invocation (analysis) is itself a 4-model ensemble, which is why a live run shows 4 concurrent API calls at that step.
Architectural pattern: Custom MCP Server - every forensic tool is a typed MCP function - the model never constructs command syntax and never touches bash.
Every AI claim is checked against a paired reference set built from real tool output during the run; unsupported claims are blocked, then self-corrected or honestly reported as UNRESOLVED (honest failure beats a wrong answer).
A 4-model ensemble + deterministic cross-checks disposition findings into confirmed / needs-review / benign / false-positive, with confidence earned by independent artifact types (memory + disk + logs) - not model feeling.
A report-integrity layer keeps the story honest end-to-end: the executive summary can never name a finding "confirmed" that the evidence pipeline didn't confirm (any mismatch is auto-annotated with the finding's true status), benign rows always explain why they were cleared, and duplicate findings about the same artifact (same file, same registry key, same Windows service) are merged before you read them.
Output: a structured investigative narrative with WHO/WHEN context, a network IOC roll-up, and a finding-by-finding audit trail to tool executions.

flowchart LR
    A[🔒 Evidence\nread-only mount + SHA256\nraw memory/image hashed pre and post;\nmounted-E01 disk protected by read-only mount, not re-hashed] --> B[🧰 Typed forensic tools\nno shell, ever]
    B --> C[(EvidenceDB\ntyped facts + provenance)]
    C --> D[🤖 AI analysis\n5 AI calls only]
    D --> E{🧪 Deterministic validator\ncode checks AI}
    E -- unsupported --> F[♻️ Self-correction\nor honest UNRESOLVED]
    F --> E
    E -- validated --> G[📋 Report\nnarrative + WHO/WHEN + IOC\n+ audit trail + SHA256 re-check]

🪜 The five stages

Step-0 onboarding - finds and profiles the evidence, mounts read-only, SHA256-fingerprints it (chain of custody).
Tool sweep + EvidenceDB - runs the forensic tools via typed functions, parses every output into typed facts with provenance.
AI analysis - the model selects tools and writes candidate findings from parsed facts only.
Validation + cross-check - deterministic validator, ReAct investigation threads, self-correction, disposition. Code checks AI; AI never grades itself.
Reporting - investigative narrative + audit log; SHA256 verified again (spoliation check).

📄 What you get after a run

Artifact	What it is
`report.md`	the investigative narrative - findings first, plain-English "why it matters" per finding (the per-finding customer table renders into its sections)
`run_summary.md`	tools · dispositions · cost · tokens at a glance
`agent_execution_log.txt`	append-only execution log - every tool call, timestamps, token usage
`finding_disposition_buckets.json`	confirmed / needs-review / benign / false-positive buckets, each with its reasoning - written to the run directory; `report.md` renders from it

🧯 Troubleshooting

Symptom	Fix
Dependency install refused (PEP 668)	re-run `./setup.sh` (it creates a venv / handles PEP 668 for you)
`ERROR: Missing dependencies` from findevil.sh	re-run `./setup.sh` (it creates a venv / handles PEP 668 for you), then retry
The run doesn't start after you pick depth	you ran `step0_onboard.py` directly (staged / dev mode) - use `./findevil.sh`, which is live by default
No prompt appears in CI/scripts	that's by design: headless + no path → usage + exit 2 (no hang)
Slow run, "low validations", `Step6 workers: 4` on a big box, many `HTTP 529`, OOM	These are all hardware/dependency-tuning items - see the symptom table in `docs/HARDWARE-AND-PERFORMANCE.md` (the fast `evtx` wheel, RAM-aware workers, and the 529-rate-limit knobs).

🌍 Dataset-agnostic by construction

No case-specific indicators (hostnames, usernames, IPs, tool-name lists, PIDs, hashes) are embedded in code, prompts, or fixtures - detection is behavioral and structural only (process ancestry, RWX anomalies, Event-ID grammar, egress outliers). Guard tests enforce it, a commit-time audit (audit/nocheat.py) bans answer-key vocabulary, and the release pipeline hard-fails if a case token would ever ship.

Two examples of the principle in practice:

Domains by standard, not by list - a token counts as a domain only if its final label is a registered IANA TLD (vendored from the Public Suffix List, identical for every case on earth); ambiguous TLD/file-extension collisions additionally require the run to have seen the token as a URL host. No domain or extension blocklist decides anything.
IOCs by correlation, not by lookup - a network indicator is reported as malicious only because a validator-backed finding in this run proved it (verdict inherited from the finding's disposition, related finding IDs cited). The confirmed tier doubles as a copy-pasteable block/hunt list.

🎛️ Deepest-accuracy run (optional flags)

Defaults are tuned for zero-regression. For the strongest adjudication layer:

SIFT_INV3A_ENRICH=1 SIFT_MODEL_INV3A=claude-opus-4-8 \
SIFT_INV3A_JIT_RWX_GUARD=1 SIFT_USER_8DOT3_CANON=1 python3 findevil.py

SIFT_INV3A_ENRICH gives the final false-positive sweep a deterministic cross-reference per finding; SIFT_MODEL_INV3A routes that single call to a stronger model; the guard suppresses classic JIT/.NET RWX false-positive promotions structurally (no process-name allowlist); the 8.3 flag folds short-name user identities into their long form. Every flag has a kill-switch and fails closed.

See ARCHITECTURE.md and docs/ for the full design · JUDGE-QUICKSTART.md for the judge path · EXTENDING.md to add your own forensic tool · MIT © Adil Eskintan

SIFT-Sentinel - Adil Eskintan / Solvent CyberSecurity LLC - Find Evil! 2026. Reference run: run-rd01 (artifacts/run-rd01/). The live Official Rules govern.

Name		Name	Last commit message	Last commit date
Latest commit History 137 Commits
artifacts/run-rd01		artifacts/run-rd01
audit		audit
bin		bin
docs		docs
logo		logo
reports		reports
scripts		scripts
src		src
tests		tests
yara_rules		yara_rules
.env.example		.env.example
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
ARCH_VERTICAL.html		ARCH_VERTICAL.html
ARCH_VERTICAL.png		ARCH_VERTICAL.png
EXTENDING.md		EXTENDING.md
JUDGE-QUICKSTART.md		JUDGE-QUICKSTART.md
LICENSE		LICENSE
README.md		README.md
SELF-CORRECTION-PROOF.md		SELF-CORRECTION-PROOF.md
SIFT-SENTINEL-SETUP-GUIDE.md		SIFT-SENTINEL-SETUP-GUIDE.md
console.py		console.py
demo_self_correction.py		demo_self_correction.py
findevil.py		findevil.py
findevil.sh		findevil.sh
fresh-start.sh		fresh-start.sh
generate_report.py		generate_report.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_pipeline.py		run_pipeline.py
setup.sh		setup.sh
start.sh		start.sh
step0_onboard.py		step0_onboard.py
stop.sh		stop.sh
test_advanced_scenarios.py		test_advanced_scenarios.py
verify_tools.py		verify_tools.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🛡️ Sentinel Ensemble

📊 Headline result (run-rd01)

✅ Submission Compliance Checklist

🧭 Start from zero (never used SIFT before?)

1️⃣ Get the SANS SIFT Workstation (the free forensic VM)

2️⃣ Get an Anthropic API key (the AI brain)

3️⃣ Get evidence to investigate

4️⃣ Install & run - see the next two sections. That's it.

📦 Install

🚀 Quick Start

🔍 What it does

🪜 The five stages

📄 What you get after a run

🧯 Troubleshooting

🌍 Dataset-agnostic by construction

🎛️ Deepest-accuracy run (optional flags)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🛡️ Sentinel Ensemble

📊 Headline result (run-rd01)

✅ Submission Compliance Checklist

🧭 Start from zero (never used SIFT before?)

1️⃣ Get the SANS SIFT Workstation (the free forensic VM)

2️⃣ Get an Anthropic API key (the AI brain)

3️⃣ Get evidence to investigate

4️⃣ Install & run - see the next two sections. That's it.

📦 Install

🚀 Quick Start

🔍 What it does

🪜 The five stages

📄 What you get after a run

🧯 Troubleshooting

🌍 Dataset-agnostic by construction

🎛️ Deepest-accuracy run (optional flags)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages