ForensIQ — AI-Powered Forensic Image Analyser

ForensIQ is an end-to-end digital forensics pipeline that mounts forensic disk images, extracts artifacts across multiple streams, classifies them with a BERT-based ML model, explains every decision with SHAP values, and synthesises a UTC-normalised timeline in DFXML format. It ships with a FastAPI backend and a React frontend for real-time analysis via the browser.

Features

Forensic image parsing — mounts E01 (libewf), DD/raw, and AFF4 images; traverses NTFS, ext4, and APFS filesystems using pytsk3; recovers deleted files and reads MFT records
Multi-stream artifact extraction — file metadata with MFT timestomping detection, Windows registry hives (persistence keys, USB history, UserAssist), browser SQLite databases (Chrome/Firefox history, cookies, downloads), and pcap network logs
BERT-based classification — two-stage pipeline: BERT [CLS] embedding + logistic regression head; heuristic fallback when no labelled data is available
Case-type re-weighting — investigator selects a case type (financial fraud, data theft, malware) and the classifier dynamically re-weights artifact streams and file extensions
SHAP explainability — KernelExplainer surfaces the top features driving each suspicion score with a human-readable narrative ("because: created at 3AM, timestomped, accessed financial website")
Timeline synthesis — correlates events across all artifact streams, normalises timestamps to UTC accounting for timezone offsets, outputs DFXML
Real-time web UI — React frontend with live SSE progress stream, findings table, SHAP bar charts, interactive timeline, and one-click DFXML/JSON download
Rich terminal dashboard — alternative CLI interface using the Rich library
41-test suite — all heavy dependencies (pytsk3, pyewf, torch, transformers) mocked; runs without forensic hardware or GPU

Architecture

Browser
  │  POST /analyze  (case_type, tz_offset, file?)
  │  GET  /stream/{job_id}  ← Server-Sent Events
  ▼
backend/api.py  (FastAPI)
  │
  ├── backend/pipeline_runner.py  (daemon thread + queue)
  │     │
  │     ├── image_parser.py        mount image → FileEntry stream
  │     ├── artifact_extractors.py metadata / registry / browser / network
  │     ├── classifier.py          BERT embed + logistic head + case weights
  │     ├── explainability.py      SHAP KernelExplainer + narrative builder
  │     └── timeline.py            UTC normalise + DFXML serialiser
  │
  └── outputs/{job_id}.dfxml
      outputs/{job_id}.json

Events streamed to the browser in order:

SSE Event	Payload
`progress`	`{step, total, msg}`
`artifacts`	list of all extracted artifacts
`classifications`	scored + labelled artifact list
`explanations`	flagged artifacts with SHAP narratives
`timeline`	UTC-sorted event list
`done`	job summary + download URLs
`error`	traceback string

Project Structure

ForensIQ/
│
├── backend/
│   ├── __init__.py
│   ├── api.py                 FastAPI app — HTTP endpoints + SSE
│   └── pipeline_runner.py     Background thread, queue, synthetic dataset
│
├── frontend/
│   └── index.html             Single-file React + Tailwind UI
│
├── ui/
│   └── dashboard.py           Rich terminal dashboard (alternative to web UI)
│
├── tests/
│   └── test_forensiq.py       41 unit tests (all deps mocked)
│
├── outputs/                   Generated DFXML and JSON reports
│
├── image_parser.py            E01/DD/AFF4 mounting, filesystem traversal
├── artifact_extractors.py     Metadata, registry, browser, network extractors
├── classifier.py              BERT classifier + case-type profiles
├── explainability.py          SHAP engine + narrative builder
├── timeline.py                UTC normaliser + DFXML serialiser
├── config.py                  Lazy env-var config singleton
├── logger.py                  Structured logging setup
├── main.py                    CLI entry point
├── demo.py                    Standalone demo on synthetic data
├── server.py                  Uvicorn server entry point
├── setup.py                   pip-installable package definition
├── requirements.txt           All Python dependencies
└── .env.example               All tunable environment variables

Module Reference

`image_parser.py`

Mounts a forensic image and yields FileEntry objects for every file.

from image_parser import parse_image

for entry in parse_image("evidence.E01"):
    print(entry.path, entry.size, entry.is_deleted)

FileEntry fields: path, size, inode, created, modified, accessed, is_deleted, fs_type, data

E01 images are opened via pyewf and bridged to pytsk3 through EWFImgInfo
Partition tables are detected automatically; falls back to whole-image filesystem if none found
Files are capped at cfg.max_file_read_bytes (default 10 MB) to prevent OOM
Filesystem type is detected per-partition: NTFS, ext4, APFS

`artifact_extractors.py`

Four extractors, all returning Artifact objects.

Artifact fields: artifact_type, source_path, timestamp, features, raw

`extract_metadata(entry: FileEntry) -> Artifact`

Extracts file size, timestamps, extension, deleted flag. On NTFS files starting with the MFT magic FILE, compares $STANDARD_INFORMATION and $FILE_NAME timestamps — a difference > 1 second flags possible timestomping.

`extract_registry(entry: FileEntry) -> list[Artifact]`

Parses Windows registry hives using regipy. Extracts:

Persistence keys: Run, RunOnce, Services
USB mount history: USBSTOR
User activity: UserAssist

`extract_browser(entry: FileEntry) -> list[Artifact]`

Reads Chrome/Firefox SQLite databases. Queries urls, cookies, and downloads tables. Converts Chrome's WebKit epoch (microseconds since 1601-01-01) to Unix epoch. Flags visits to financial domains (PayPal, Coinbase, Chase, etc.).

`extract_network(pcap_path: str) -> list[Artifact]`

Reads a pcap file with Scapy, aggregates flows by (src, dst, dport, proto), then runs IsolationForest on [bytes, packets, duration] to flag anomalous flows.

`classifier.py`

from classifier import ForensicClassifier

clf = ForensicClassifier(case_type="financial_fraud")
results = clf.classify_batch(artifacts)

Two-stage pipeline:

BERT tokenises artifact.raw (truncated to 512 chars) → [CLS] embedding (768-dim)
Numeric features appended: [file_size_mb, is_deleted, anomaly, is_financial, timestomp, visit_count_norm]
Logistic regression head maps the combined vector to a suspicion probability
Case-type multiplier applied: score = min(prob × weight, 1.0)
Extension boost: high-value extensions for the case type multiply weight by 1.5

Labels: benign (< 0.40), suspicious (0.40–0.75), malicious (> 0.75)

Thresholds are configurable via FORENSIQ_THRESH_SUSPICIOUS and FORENSIQ_THRESH_MALICIOUS.

When no labelled training data is available, _heuristic_score() is used as a zero-shot fallback based on rule weights.

`explainability.py`

from explainability import ExplainabilityEngine

engine = ExplainabilityEngine(clf, background_results)
explanation = engine.explain(result)
print(explanation.narrative)

Uses shap.KernelExplainer — model-agnostic, works with both the fitted logistic head and the heuristic fallback
Background is compressed to ≤ k unique centroids via shap.kmeans to keep SHAP tractable on 768-dim BERT embeddings
_build_narrative() maps SHAP-identified features to investigator-readable language:
- Off-hours activity (before 05:00 or after 22:00 UTC)
- Deleted files, timestomping, network anomalies
- Persistence keys, USB mounts, financial site visits

Explanation fields: artifact_path, suspicion_score, label, top_features, narrative

`timeline.py`

from timeline import TimelineSynthesizer

synth = TimelineSynthesizer(tz_offset_hours=-5)
events = synth.build_from_results(pairs)   # pairs = [(ClassificationResult, Explanation)]
dfxml  = synth.to_dfxml(events)

tz_offset_hours is subtracted from stored timestamps to convert local time to UTC
Events are sorted ascending by UTC time
DFXML output follows the forensicswiki.org schema, compatible with Autopsy and log2timeline

`config.py`

All settings are read lazily from os.environ (populated from .env if present). This means patch.dict(os.environ, ...) works correctly in tests.

from config import cfg

print(cfg.bert_model)             # nlpaueb/legal-bert-base-uncased
print(cfg.threshold_malicious)    # 0.75
print(cfg.shap_nsamples)          # 100

`logger.py`

from logger import setup_logging
setup_logging(level="DEBUG", log_file="forensiq.log")

Configures the root logger with a timestamped formatter. Silences noisy third-party loggers (transformers, scapy.runtime).

Backend API

Start the server:

python server.py
# or
uvicorn backend.api:app --host 0.0.0.0 --port 8000

Endpoints

`GET /`

Serves frontend/index.html.

`POST /analyze`

Start a new analysis job.

Field	Type	Default	Description
`case_type`	form string	`financial_fraud`	One of `financial_fraud`, `data_theft`, `malware`, `default`
`tz_offset`	form float	`-5.0`	Local timezone offset in hours (e.g. `-5` for EST)
`file`	file upload	optional	Evidence file (`.E01`, `.dd`, `.db`, `.pcap`, etc.)

Response:

{ "job_id": "a1b2c3d4", "stream_url": "/stream/a1b2c3d4" }

`GET /stream/{job_id}`

Server-Sent Events stream. Connect with EventSource in the browser or curl:

curl -N http://localhost:8000/stream/a1b2c3d4

Events: progress, artifacts, classifications, explanations, timeline, done, error

`GET /download/{job_id}/dfxml`

Download the DFXML timeline for a completed job.

`GET /download/{job_id}/json`

Download the JSON findings report for a completed job.

`GET /docs`

Auto-generated Swagger UI (provided by FastAPI).

Frontend

Single-file React 18 + Tailwind CSS application served directly from frontend/index.html. No build step required.

Panels:

Configuration — case type selector, timezone offset input, optional file upload, Run button
Progress bar — live step indicator fed by SSE progress events
Stats row — total artifacts, flagged count, malicious count, suspicious count
Findings tab — expandable cards per flagged artifact showing score bar, narrative, and SHAP feature impact chart
Timeline tab — chronological event list with colour-coded dots (red = malicious, yellow = suspicious, grey = benign); click any event to expand its narrative
Artifacts tab — full table of all extracted artifacts with type, path, and UTC timestamp
Classifications tab — all artifacts ranked by suspicion score with score bars
Download buttons — one-click DFXML and JSON download after analysis completes

Installation

Prerequisites

Python 3.10+
On Linux: sudo apt install libewf-dev (required for pyewf/pytsk3)
On Windows: pytsk3 and pyewf are stubbed automatically when not available

Steps

# Clone or download the project
cd ForensIQ

# Install dependencies
pip install -r requirements.txt

# Optional: install as a package
pip install -e .

`requirements.txt`

pytsk3>=20230125        # forensic image parsing
pyewf>=20230119         # libewf Python bindings
regipy>=3.1.0           # Windows registry hive parsing
scapy>=2.5.0            # pcap network analysis
torch>=2.0.0            # BERT inference
transformers>=4.38.0    # BERT tokeniser + model
scikit-learn>=1.4.0     # logistic regression + IsolationForest
numpy>=1.26.0
shap>=0.44.0            # SHAP explainability
rich>=13.7.0            # terminal dashboard
fastapi                 # web backend
uvicorn                 # ASGI server
python-multipart        # file upload support

Configuration

Copy .env.example to .env and edit as needed:

cp .env.example .env

Variable	Default	Description
`FORENSIQ_BERT_MODEL`	`nlpaueb/legal-bert-base-uncased`	HuggingFace model name or local path
`FORENSIQ_BERT_MAX_LEN`	`128`	Max token length for BERT
`FORENSIQ_THRESH_MALICIOUS`	`0.75`	Score threshold for malicious label
`FORENSIQ_THRESH_SUSPICIOUS`	`0.40`	Score threshold for suspicious label
`FORENSIQ_MAX_FILE_BYTES`	`10485760`	Max bytes read per file (10 MB)
`FORENSIQ_SHAP_NSAMPLES`	`100`	SHAP KernelExplainer sample count
`FORENSIQ_SHAP_BG_K`	`10`	SHAP k-means background centroids
`FORENSIQ_IF_CONTAMINATION`	`0.05`	IsolationForest contamination fraction
`FORENSIQ_OUT_DFXML`	`timeline.dfxml`	Default DFXML output path (CLI)
`FORENSIQ_OUT_JSON`	`report.json`	Default JSON output path (CLI)
`FORENSIQ_LOG_LEVEL`	`INFO`	Logging level
`FORENSIQ_LOG_FILE`	(none)	Optional log file path

Running the Web App

python server.py

Then open http://localhost:8000 in your browser.

Select a case type and timezone offset
Optionally upload an evidence file (.E01, .dd, .db, .pcap)
Click Run Analysis
Watch the live progress stream populate findings, timeline, and SHAP charts in real time
Download the DFXML timeline or JSON report when complete

Running the CLI

# Basic run on a real image
python main.py --image evidence.E01 --case financial_fraud --tz -5

# With a pcap file
python main.py --image evidence.E01 --case data_theft --tz 0 --pcap traffic.pcap

# Custom output paths
python main.py --image evidence.E01 --out-dfxml my_timeline.dfxml --out-json my_report.json

# Verbose logging
python main.py --image evidence.E01 --verbose

CLI flags:

Flag	Default	Description
`--image`	required	Path to forensic image
`--case`	`default`	Case type profile
`--tz`	`0.0`	Timezone offset in hours
`--pcap`	none	Optional pcap file
`--out-dfxml`	`timeline.dfxml`	DFXML output path
`--out-json`	`report.json`	JSON output path
`--verbose`	off	Enable DEBUG logging

Running the Demo

Runs the full pipeline on a built-in synthetic forensic scenario (no image file needed):

# financial_fraud case, UTC-5 timezone
set PYTHONIOENCODING=utf-8 && python demo.py financial_fraud -5

# data_theft case, UTC+0
set PYTHONIOENCODING=utf-8 && python demo.py data_theft 0

# malware case
set PYTHONIOENCODING=utf-8 && python demo.py malware 0

The synthetic dataset includes:

A spreadsheet modified at 3 AM
A deleted 500 MB archive
A timestomped executable (svchost32.exe)
Chrome browser history with PayPal and Coinbase visits
A Windows registry persistence key (Run\updater)
A USB mount event (SanDisk, 2 minutes before the 3 AM activity)
A 524 MB outbound network transfer flagged as anomalous
A benign system DLL for contrast

Outputs written to demo_timeline.dfxml and demo_report.json.

Running the Rich Terminal Dashboard

python -m ui.dashboard --image evidence.E01 --case financial_fraud --tz -5

Displays a live progress bar during analysis, then renders a colour-coded findings table and top-10 suspicious timeline events in the terminal.

Running Tests

pip install pytest
python -m pytest tests/ -v

All 41 tests pass without any forensic hardware, GPU, or network access. Heavy dependencies are stubbed at the module level:

pytsk3 / pyewf — no Windows wheels; replaced with MagicMock stubs
torch / transformers — replaced with stubs; torch.Tensor is a real class to satisfy scipy's issubclass check
scapy — installed; network tests use MagicMock packets

Test coverage by module:

Class	Tests
`TestImageParserHelpers`	`_ts()`, `_open_image()` for E01/DD/unsupported
`TestExtractMetadata`	basic fields, deleted flag, timestomp positive/negative, non-NTFS
`TestExtractBrowser`	financial domain flagging, empty data, Chrome epoch conversion
`TestExtractNetwork`	empty pcap, read error, artifact list returned
`TestForensicClassifier`	heuristic scores, case weight profiles, ext boost, unknown case fallback
`TestTimeline`	sort order, TZ offset, None timestamp skipped, DFXML structure/label/precision
`TestConfig`	defaults, env override, log_file default
`TestNarrativeBuilder`	deleted, timestomp, persistence, off-hours, financial, USB, label

Supported Formats

Image Formats

Format	Extension	Library
Expert Witness Format	`.E01`, `.Ex01`	pyewf + pytsk3
Raw / DD	`.dd`, `.raw`, `.img`	pytsk3
AFF4	`.aff4`	pytsk3 (raw block device)

Filesystems

Filesystem	OS
NTFS	Windows
ext2 / ext3 / ext4	Linux
APFS	macOS

Other Input Files

File	Extractor
Chrome/Firefox `History`, `Cookies`, `Web Data`	`extract_browser()`
Windows registry hives (`NTUSER.DAT`, `SAM`, `SOFTWARE`, `SYSTEM`, etc.)	`extract_registry()`
pcap / pcapng	`extract_network()`

Case Type Profiles

Each case type applies multipliers to artifact stream scores and boosts specific file extensions.

`financial_fraud`

Prioritises browser artifacts (financial site visits, cookies) and document files.

Stream	Multiplier
browser	2.5×
metadata	1.8×
network	1.5×
registry	1.0×

Boosted extensions: .xlsx, .xls, .csv, .pst, .ost, .mbox, .pdf

`data_theft`

Prioritises large file transfers, USB mounts, and cloud upload anomalies.

Stream	Multiplier
registry	2.5×
network	2.5×
metadata	2.0×
browser	1.0×

Boosted extensions: .zip, .rar, .7z, .tar, .gz, .db, .sql

`malware`

Prioritises persistence keys, dropped executables, and C2 beaconing.

Stream	Multiplier
registry	2.5×
metadata	2.0×
network	2.0×
browser	1.0×

Boosted extensions: .exe, .dll, .bat, .ps1, .vbs, .js, .lnk

`default`

All streams weighted equally at 1.0×.

Output Formats

JSON Report (`report.json`)

Array of flagged artifacts:

[
  {
    "path": "/Windows/Temp/svchost32.exe",
    "label": "suspicious",
    "score": 0.72,
    "narrative": "[SUSPICIOUS] score=0.72 — because: created/modified at 03:00 UTC (off-hours); $SI/$FN timestamp mismatch (possible timestomping)",
    "top_features": [
      ["timestomp_detected", 0.267],
      ["network_anomaly", -0.039],
      ["financial_site", -0.033]
    ]
  }
]

DFXML Timeline (`timeline.dfxml`)

Digital Forensics XML following the forensicswiki.org schema. Compatible with Autopsy, log2timeline/plaso, and other DFXML-aware tools.

<?xml version='1.0' encoding='utf-8'?>
<dfxml version="1.0" xmlns="http://www.forensicswiki.org/wiki/Category:Digital_Forensics_XML">
  <metadata>
    <generator>ForensIQ</generator>
    <created>2024-03-15T08:00:00+00:00</created>
  </metadata>
  <fileobject>
    <filename>/Windows/Temp/svchost32.exe</filename>
    <source_type>metadata</source_type>
    <label>suspicious</label>
    <suspicion_score>0.7200</suspicion_score>
    <times><mtime>2024-03-15T08:00:00+00:00Z</mtime></times>
    <narrative>[SUSPICIOUS] score=0.72 — because: ...</narrative>
  </fileobject>
</dfxml>

Platform Notes

Windows

pytsk3 and pyewf have no official Windows wheels. They are automatically stubbed when not importable, so the classifier, explainability, timeline, web UI, and demo all work fully.
To analyse real disk images on Windows, use WSL2 with libewf-dev installed, or use a pre-built pytsk3 wheel from the libewf releases page.
Set PYTHONIOENCODING=utf-8 before running the CLI or demo to ensure Unicode characters print correctly.

Linux / macOS

# Debian/Ubuntu
sudo apt install libewf-dev

# macOS
brew install libewf

Then pip install pytsk3 pyewf will compile successfully.

GPU / CPU

BERT inference runs on CPU by default. To use a GPU, install the CUDA-enabled torch wheel and the model will automatically use cuda:0.
For production use, replace nlpaueb/legal-bert-base-uncased with a checkpoint fine-tuned on labelled forensic artifact datasets.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
backend		backend
frontend		frontend
tests		tests
ui		ui
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
artifact_extractors.py		artifact_extractors.py
classifier.py		classifier.py
config.py		config.py
demo.py		demo.py
demo_report.json		demo_report.json
demo_timeline.dfxml		demo_timeline.dfxml
explainability.py		explainability.py
image_parser.py		image_parser.py
logger.py		logger.py
main.py		main.py
requirements.txt		requirements.txt
server.py		server.py
setup.py		setup.py
timeline.py		timeline.py

Folders and files

Latest commit

History

Repository files navigation

ForensIQ — AI-Powered Forensic Image Analyser

Table of Contents

Features

Architecture

Project Structure

Module Reference

image_parser.py

artifact_extractors.py

extract_metadata(entry: FileEntry) -> Artifact

extract_registry(entry: FileEntry) -> list[Artifact]

extract_browser(entry: FileEntry) -> list[Artifact]

extract_network(pcap_path: str) -> list[Artifact]

classifier.py

explainability.py

timeline.py

config.py

logger.py

Backend API

Endpoints

GET /

POST /analyze

GET /stream/{job_id}

GET /download/{job_id}/dfxml

GET /download/{job_id}/json

GET /docs

Frontend

Installation

Prerequisites

Steps

requirements.txt

Configuration

Running the Web App

Running the CLI

Running the Demo

Running the Rich Terminal Dashboard

Running Tests

Supported Formats

Image Formats

Filesystems

Other Input Files

Case Type Profiles

financial_fraud

data_theft

malware

default

Output Formats

JSON Report (report.json)

DFXML Timeline (timeline.dfxml)

Platform Notes

Windows

Linux / macOS

GPU / CPU

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`image_parser.py`

`artifact_extractors.py`

`extract_metadata(entry: FileEntry) -> Artifact`

`extract_registry(entry: FileEntry) -> list[Artifact]`

`extract_browser(entry: FileEntry) -> list[Artifact]`

`extract_network(pcap_path: str) -> list[Artifact]`

`classifier.py`

`explainability.py`

`timeline.py`

`config.py`

`logger.py`

`GET /`

`POST /analyze`

`GET /stream/{job_id}`

`GET /download/{job_id}/dfxml`

`GET /download/{job_id}/json`

`GET /docs`

`requirements.txt`

`financial_fraud`

`data_theft`

`malware`

`default`

JSON Report (`report.json`)

DFXML Timeline (`timeline.dfxml`)

Packages