ForensIQ is an end-to-end digital forensics pipeline that mounts forensic disk images, extracts artifacts across multiple streams, classifies them with a BERT-based ML model, explains every decision with SHAP values, and synthesises a UTC-normalised timeline in DFXML format. It ships with a FastAPI backend and a React frontend for real-time analysis via the browser.
- Features
- Architecture
- Project Structure
- Module Reference
- Backend API
- Frontend
- Installation
- Configuration
- Running the Web App
- Running the CLI
- Running the Demo
- Running Tests
- Supported Formats
- Case Type Profiles
- Output Formats
- Platform Notes
- Forensic image parsing — mounts E01 (libewf), DD/raw, and AFF4 images; traverses NTFS, ext4, and APFS filesystems using pytsk3; recovers deleted files and reads MFT records
- Multi-stream artifact extraction — file metadata with MFT timestomping detection, Windows registry hives (persistence keys, USB history, UserAssist), browser SQLite databases (Chrome/Firefox history, cookies, downloads), and pcap network logs
- BERT-based classification — two-stage pipeline: BERT [CLS] embedding + logistic regression head; heuristic fallback when no labelled data is available
- Case-type re-weighting — investigator selects a case type (financial fraud, data theft, malware) and the classifier dynamically re-weights artifact streams and file extensions
- SHAP explainability — KernelExplainer surfaces the top features driving each suspicion score with a human-readable narrative ("because: created at 3AM, timestomped, accessed financial website")
- Timeline synthesis — correlates events across all artifact streams, normalises timestamps to UTC accounting for timezone offsets, outputs DFXML
- Real-time web UI — React frontend with live SSE progress stream, findings table, SHAP bar charts, interactive timeline, and one-click DFXML/JSON download
- Rich terminal dashboard — alternative CLI interface using the Rich library
- 41-test suite — all heavy dependencies (pytsk3, pyewf, torch, transformers) mocked; runs without forensic hardware or GPU
Browser
│ POST /analyze (case_type, tz_offset, file?)
│ GET /stream/{job_id} ← Server-Sent Events
▼
backend/api.py (FastAPI)
│
├── backend/pipeline_runner.py (daemon thread + queue)
│ │
│ ├── image_parser.py mount image → FileEntry stream
│ ├── artifact_extractors.py metadata / registry / browser / network
│ ├── classifier.py BERT embed + logistic head + case weights
│ ├── explainability.py SHAP KernelExplainer + narrative builder
│ └── timeline.py UTC normalise + DFXML serialiser
│
└── outputs/{job_id}.dfxml
outputs/{job_id}.json
Events streamed to the browser in order:
| SSE Event | Payload |
|---|---|
progress |
{step, total, msg} |
artifacts |
list of all extracted artifacts |
classifications |
scored + labelled artifact list |
explanations |
flagged artifacts with SHAP narratives |
timeline |
UTC-sorted event list |
done |
job summary + download URLs |
error |
traceback string |
ForensIQ/
│
├── backend/
│ ├── __init__.py
│ ├── api.py FastAPI app — HTTP endpoints + SSE
│ └── pipeline_runner.py Background thread, queue, synthetic dataset
│
├── frontend/
│ └── index.html Single-file React + Tailwind UI
│
├── ui/
│ └── dashboard.py Rich terminal dashboard (alternative to web UI)
│
├── tests/
│ └── test_forensiq.py 41 unit tests (all deps mocked)
│
├── outputs/ Generated DFXML and JSON reports
│
├── image_parser.py E01/DD/AFF4 mounting, filesystem traversal
├── artifact_extractors.py Metadata, registry, browser, network extractors
├── classifier.py BERT classifier + case-type profiles
├── explainability.py SHAP engine + narrative builder
├── timeline.py UTC normaliser + DFXML serialiser
├── config.py Lazy env-var config singleton
├── logger.py Structured logging setup
├── main.py CLI entry point
├── demo.py Standalone demo on synthetic data
├── server.py Uvicorn server entry point
├── setup.py pip-installable package definition
├── requirements.txt All Python dependencies
└── .env.example All tunable environment variables
Mounts a forensic image and yields FileEntry objects for every file.
from image_parser import parse_image
for entry in parse_image("evidence.E01"):
print(entry.path, entry.size, entry.is_deleted)FileEntry fields: path, size, inode, created, modified, accessed, is_deleted, fs_type, data
- E01 images are opened via
pyewfand bridged to pytsk3 throughEWFImgInfo - Partition tables are detected automatically; falls back to whole-image filesystem if none found
- Files are capped at
cfg.max_file_read_bytes(default 10 MB) to prevent OOM - Filesystem type is detected per-partition: NTFS, ext4, APFS
Four extractors, all returning Artifact objects.
Artifact fields: artifact_type, source_path, timestamp, features, raw
Extracts file size, timestamps, extension, deleted flag. On NTFS files starting with the MFT magic FILE, compares $STANDARD_INFORMATION and $FILE_NAME timestamps — a difference > 1 second flags possible timestomping.
Parses Windows registry hives using regipy. Extracts:
- Persistence keys:
Run,RunOnce,Services - USB mount history:
USBSTOR - User activity:
UserAssist
Reads Chrome/Firefox SQLite databases. Queries urls, cookies, and downloads tables. Converts Chrome's WebKit epoch (microseconds since 1601-01-01) to Unix epoch. Flags visits to financial domains (PayPal, Coinbase, Chase, etc.).
Reads a pcap file with Scapy, aggregates flows by (src, dst, dport, proto), then runs IsolationForest on [bytes, packets, duration] to flag anomalous flows.
from classifier import ForensicClassifier
clf = ForensicClassifier(case_type="financial_fraud")
results = clf.classify_batch(artifacts)Two-stage pipeline:
- BERT tokenises
artifact.raw(truncated to 512 chars) → [CLS] embedding (768-dim) - Numeric features appended:
[file_size_mb, is_deleted, anomaly, is_financial, timestomp, visit_count_norm] - Logistic regression head maps the combined vector to a suspicion probability
- Case-type multiplier applied:
score = min(prob × weight, 1.0) - Extension boost: high-value extensions for the case type multiply weight by 1.5
Labels: benign (< 0.40), suspicious (0.40–0.75), malicious (> 0.75)
Thresholds are configurable via FORENSIQ_THRESH_SUSPICIOUS and FORENSIQ_THRESH_MALICIOUS.
When no labelled training data is available, _heuristic_score() is used as a zero-shot fallback based on rule weights.
from explainability import ExplainabilityEngine
engine = ExplainabilityEngine(clf, background_results)
explanation = engine.explain(result)
print(explanation.narrative)- Uses
shap.KernelExplainer— model-agnostic, works with both the fitted logistic head and the heuristic fallback - Background is compressed to ≤ k unique centroids via
shap.kmeansto keep SHAP tractable on 768-dim BERT embeddings _build_narrative()maps SHAP-identified features to investigator-readable language:- Off-hours activity (before 05:00 or after 22:00 UTC)
- Deleted files, timestomping, network anomalies
- Persistence keys, USB mounts, financial site visits
Explanation fields: artifact_path, suspicion_score, label, top_features, narrative
from timeline import TimelineSynthesizer
synth = TimelineSynthesizer(tz_offset_hours=-5)
events = synth.build_from_results(pairs) # pairs = [(ClassificationResult, Explanation)]
dfxml = synth.to_dfxml(events)tz_offset_hoursis subtracted from stored timestamps to convert local time to UTC- Events are sorted ascending by UTC time
- DFXML output follows the
forensicswiki.orgschema, compatible with Autopsy and log2timeline
All settings are read lazily from os.environ (populated from .env if present). This means patch.dict(os.environ, ...) works correctly in tests.
from config import cfg
print(cfg.bert_model) # nlpaueb/legal-bert-base-uncased
print(cfg.threshold_malicious) # 0.75
print(cfg.shap_nsamples) # 100from logger import setup_logging
setup_logging(level="DEBUG", log_file="forensiq.log")Configures the root logger with a timestamped formatter. Silences noisy third-party loggers (transformers, scapy.runtime).
Start the server:
python server.py
# or
uvicorn backend.api:app --host 0.0.0.0 --port 8000Serves frontend/index.html.
Start a new analysis job.
| Field | Type | Default | Description |
|---|---|---|---|
case_type |
form string | financial_fraud |
One of financial_fraud, data_theft, malware, default |
tz_offset |
form float | -5.0 |
Local timezone offset in hours (e.g. -5 for EST) |
file |
file upload | optional | Evidence file (.E01, .dd, .db, .pcap, etc.) |
Response:
{ "job_id": "a1b2c3d4", "stream_url": "/stream/a1b2c3d4" }Server-Sent Events stream. Connect with EventSource in the browser or curl:
curl -N http://localhost:8000/stream/a1b2c3d4Events: progress, artifacts, classifications, explanations, timeline, done, error
Download the DFXML timeline for a completed job.
Download the JSON findings report for a completed job.
Auto-generated Swagger UI (provided by FastAPI).
Single-file React 18 + Tailwind CSS application served directly from frontend/index.html. No build step required.
Panels:
- Configuration — case type selector, timezone offset input, optional file upload, Run button
- Progress bar — live step indicator fed by SSE
progressevents - Stats row — total artifacts, flagged count, malicious count, suspicious count
- Findings tab — expandable cards per flagged artifact showing score bar, narrative, and SHAP feature impact chart
- Timeline tab — chronological event list with colour-coded dots (red = malicious, yellow = suspicious, grey = benign); click any event to expand its narrative
- Artifacts tab — full table of all extracted artifacts with type, path, and UTC timestamp
- Classifications tab — all artifacts ranked by suspicion score with score bars
- Download buttons — one-click DFXML and JSON download after analysis completes
- Python 3.10+
- On Linux:
sudo apt install libewf-dev(required for pyewf/pytsk3) - On Windows: pytsk3 and pyewf are stubbed automatically when not available
# Clone or download the project
cd ForensIQ
# Install dependencies
pip install -r requirements.txt
# Optional: install as a package
pip install -e .pytsk3>=20230125 # forensic image parsing
pyewf>=20230119 # libewf Python bindings
regipy>=3.1.0 # Windows registry hive parsing
scapy>=2.5.0 # pcap network analysis
torch>=2.0.0 # BERT inference
transformers>=4.38.0 # BERT tokeniser + model
scikit-learn>=1.4.0 # logistic regression + IsolationForest
numpy>=1.26.0
shap>=0.44.0 # SHAP explainability
rich>=13.7.0 # terminal dashboard
fastapi # web backend
uvicorn # ASGI server
python-multipart # file upload support
Copy .env.example to .env and edit as needed:
cp .env.example .env| Variable | Default | Description |
|---|---|---|
FORENSIQ_BERT_MODEL |
nlpaueb/legal-bert-base-uncased |
HuggingFace model name or local path |
FORENSIQ_BERT_MAX_LEN |
128 |
Max token length for BERT |
FORENSIQ_THRESH_MALICIOUS |
0.75 |
Score threshold for malicious label |
FORENSIQ_THRESH_SUSPICIOUS |
0.40 |
Score threshold for suspicious label |
FORENSIQ_MAX_FILE_BYTES |
10485760 |
Max bytes read per file (10 MB) |
FORENSIQ_SHAP_NSAMPLES |
100 |
SHAP KernelExplainer sample count |
FORENSIQ_SHAP_BG_K |
10 |
SHAP k-means background centroids |
FORENSIQ_IF_CONTAMINATION |
0.05 |
IsolationForest contamination fraction |
FORENSIQ_OUT_DFXML |
timeline.dfxml |
Default DFXML output path (CLI) |
FORENSIQ_OUT_JSON |
report.json |
Default JSON output path (CLI) |
FORENSIQ_LOG_LEVEL |
INFO |
Logging level |
FORENSIQ_LOG_FILE |
(none) | Optional log file path |
python server.pyThen open http://localhost:8000 in your browser.
- Select a case type and timezone offset
- Optionally upload an evidence file (
.E01,.dd,.db,.pcap) - Click Run Analysis
- Watch the live progress stream populate findings, timeline, and SHAP charts in real time
- Download the DFXML timeline or JSON report when complete
# Basic run on a real image
python main.py --image evidence.E01 --case financial_fraud --tz -5
# With a pcap file
python main.py --image evidence.E01 --case data_theft --tz 0 --pcap traffic.pcap
# Custom output paths
python main.py --image evidence.E01 --out-dfxml my_timeline.dfxml --out-json my_report.json
# Verbose logging
python main.py --image evidence.E01 --verboseCLI flags:
| Flag | Default | Description |
|---|---|---|
--image |
required | Path to forensic image |
--case |
default |
Case type profile |
--tz |
0.0 |
Timezone offset in hours |
--pcap |
none | Optional pcap file |
--out-dfxml |
timeline.dfxml |
DFXML output path |
--out-json |
report.json |
JSON output path |
--verbose |
off | Enable DEBUG logging |
Runs the full pipeline on a built-in synthetic forensic scenario (no image file needed):
# financial_fraud case, UTC-5 timezone
set PYTHONIOENCODING=utf-8 && python demo.py financial_fraud -5
# data_theft case, UTC+0
set PYTHONIOENCODING=utf-8 && python demo.py data_theft 0
# malware case
set PYTHONIOENCODING=utf-8 && python demo.py malware 0The synthetic dataset includes:
- A spreadsheet modified at 3 AM
- A deleted 500 MB archive
- A timestomped executable (
svchost32.exe) - Chrome browser history with PayPal and Coinbase visits
- A Windows registry persistence key (
Run\updater) - A USB mount event (SanDisk, 2 minutes before the 3 AM activity)
- A 524 MB outbound network transfer flagged as anomalous
- A benign system DLL for contrast
Outputs written to demo_timeline.dfxml and demo_report.json.
python -m ui.dashboard --image evidence.E01 --case financial_fraud --tz -5Displays a live progress bar during analysis, then renders a colour-coded findings table and top-10 suspicious timeline events in the terminal.
pip install pytest
python -m pytest tests/ -vAll 41 tests pass without any forensic hardware, GPU, or network access. Heavy dependencies are stubbed at the module level:
pytsk3/pyewf— no Windows wheels; replaced withMagicMockstubstorch/transformers— replaced with stubs;torch.Tensoris a real class to satisfyscipy'sissubclasscheckscapy— installed; network tests useMagicMockpackets
Test coverage by module:
| Class | Tests |
|---|---|
TestImageParserHelpers |
_ts(), _open_image() for E01/DD/unsupported |
TestExtractMetadata |
basic fields, deleted flag, timestomp positive/negative, non-NTFS |
TestExtractBrowser |
financial domain flagging, empty data, Chrome epoch conversion |
TestExtractNetwork |
empty pcap, read error, artifact list returned |
TestForensicClassifier |
heuristic scores, case weight profiles, ext boost, unknown case fallback |
TestTimeline |
sort order, TZ offset, None timestamp skipped, DFXML structure/label/precision |
TestConfig |
defaults, env override, log_file default |
TestNarrativeBuilder |
deleted, timestomp, persistence, off-hours, financial, USB, label |
| Format | Extension | Library |
|---|---|---|
| Expert Witness Format | .E01, .Ex01 |
pyewf + pytsk3 |
| Raw / DD | .dd, .raw, .img |
pytsk3 |
| AFF4 | .aff4 |
pytsk3 (raw block device) |
| Filesystem | OS |
|---|---|
| NTFS | Windows |
| ext2 / ext3 / ext4 | Linux |
| APFS | macOS |
| File | Extractor |
|---|---|
Chrome/Firefox History, Cookies, Web Data |
extract_browser() |
Windows registry hives (NTUSER.DAT, SAM, SOFTWARE, SYSTEM, etc.) |
extract_registry() |
| pcap / pcapng | extract_network() |
Each case type applies multipliers to artifact stream scores and boosts specific file extensions.
Prioritises browser artifacts (financial site visits, cookies) and document files.
| Stream | Multiplier |
|---|---|
| browser | 2.5× |
| metadata | 1.8× |
| network | 1.5× |
| registry | 1.0× |
Boosted extensions: .xlsx, .xls, .csv, .pst, .ost, .mbox, .pdf
Prioritises large file transfers, USB mounts, and cloud upload anomalies.
| Stream | Multiplier |
|---|---|
| registry | 2.5× |
| network | 2.5× |
| metadata | 2.0× |
| browser | 1.0× |
Boosted extensions: .zip, .rar, .7z, .tar, .gz, .db, .sql
Prioritises persistence keys, dropped executables, and C2 beaconing.
| Stream | Multiplier |
|---|---|
| registry | 2.5× |
| metadata | 2.0× |
| network | 2.0× |
| browser | 1.0× |
Boosted extensions: .exe, .dll, .bat, .ps1, .vbs, .js, .lnk
All streams weighted equally at 1.0×.
Array of flagged artifacts:
[
{
"path": "/Windows/Temp/svchost32.exe",
"label": "suspicious",
"score": 0.72,
"narrative": "[SUSPICIOUS] score=0.72 — because: created/modified at 03:00 UTC (off-hours); $SI/$FN timestamp mismatch (possible timestomping)",
"top_features": [
["timestomp_detected", 0.267],
["network_anomaly", -0.039],
["financial_site", -0.033]
]
}
]Digital Forensics XML following the forensicswiki.org schema. Compatible with Autopsy, log2timeline/plaso, and other DFXML-aware tools.
<?xml version='1.0' encoding='utf-8'?>
<dfxml version="1.0" xmlns="http://www.forensicswiki.org/wiki/Category:Digital_Forensics_XML">
<metadata>
<generator>ForensIQ</generator>
<created>2024-03-15T08:00:00+00:00</created>
</metadata>
<fileobject>
<filename>/Windows/Temp/svchost32.exe</filename>
<source_type>metadata</source_type>
<label>suspicious</label>
<suspicion_score>0.7200</suspicion_score>
<times><mtime>2024-03-15T08:00:00+00:00Z</mtime></times>
<narrative>[SUSPICIOUS] score=0.72 — because: ...</narrative>
</fileobject>
</dfxml>pytsk3andpyewfhave no official Windows wheels. They are automatically stubbed when not importable, so the classifier, explainability, timeline, web UI, and demo all work fully.- To analyse real disk images on Windows, use WSL2 with
libewf-devinstalled, or use a pre-built pytsk3 wheel from the libewf releases page. - Set
PYTHONIOENCODING=utf-8before running the CLI or demo to ensure Unicode characters print correctly.
# Debian/Ubuntu
sudo apt install libewf-dev
# macOS
brew install libewfThen pip install pytsk3 pyewf will compile successfully.
- BERT inference runs on CPU by default. To use a GPU, install the CUDA-enabled torch wheel and the model will automatically use
cuda:0. - For production use, replace
nlpaueb/legal-bert-base-uncasedwith a checkpoint fine-tuned on labelled forensic artifact datasets.