Skip to content

Akshay473/ForensIQ

Repository files navigation

ForensIQ — AI-Powered Forensic Image Analyser

ForensIQ is an end-to-end digital forensics pipeline that mounts forensic disk images, extracts artifacts across multiple streams, classifies them with a BERT-based ML model, explains every decision with SHAP values, and synthesises a UTC-normalised timeline in DFXML format. It ships with a FastAPI backend and a React frontend for real-time analysis via the browser.


Table of Contents

  1. Features
  2. Architecture
  3. Project Structure
  4. Module Reference
  5. Backend API
  6. Frontend
  7. Installation
  8. Configuration
  9. Running the Web App
  10. Running the CLI
  11. Running the Demo
  12. Running Tests
  13. Supported Formats
  14. Case Type Profiles
  15. Output Formats
  16. Platform Notes

Features

  • Forensic image parsing — mounts E01 (libewf), DD/raw, and AFF4 images; traverses NTFS, ext4, and APFS filesystems using pytsk3; recovers deleted files and reads MFT records
  • Multi-stream artifact extraction — file metadata with MFT timestomping detection, Windows registry hives (persistence keys, USB history, UserAssist), browser SQLite databases (Chrome/Firefox history, cookies, downloads), and pcap network logs
  • BERT-based classification — two-stage pipeline: BERT [CLS] embedding + logistic regression head; heuristic fallback when no labelled data is available
  • Case-type re-weighting — investigator selects a case type (financial fraud, data theft, malware) and the classifier dynamically re-weights artifact streams and file extensions
  • SHAP explainability — KernelExplainer surfaces the top features driving each suspicion score with a human-readable narrative ("because: created at 3AM, timestomped, accessed financial website")
  • Timeline synthesis — correlates events across all artifact streams, normalises timestamps to UTC accounting for timezone offsets, outputs DFXML
  • Real-time web UI — React frontend with live SSE progress stream, findings table, SHAP bar charts, interactive timeline, and one-click DFXML/JSON download
  • Rich terminal dashboard — alternative CLI interface using the Rich library
  • 41-test suite — all heavy dependencies (pytsk3, pyewf, torch, transformers) mocked; runs without forensic hardware or GPU

Architecture

Browser
  │  POST /analyze  (case_type, tz_offset, file?)
  │  GET  /stream/{job_id}  ← Server-Sent Events
  ▼
backend/api.py  (FastAPI)
  │
  ├── backend/pipeline_runner.py  (daemon thread + queue)
  │     │
  │     ├── image_parser.py        mount image → FileEntry stream
  │     ├── artifact_extractors.py metadata / registry / browser / network
  │     ├── classifier.py          BERT embed + logistic head + case weights
  │     ├── explainability.py      SHAP KernelExplainer + narrative builder
  │     └── timeline.py            UTC normalise + DFXML serialiser
  │
  └── outputs/{job_id}.dfxml
      outputs/{job_id}.json

Events streamed to the browser in order:

SSE Event Payload
progress {step, total, msg}
artifacts list of all extracted artifacts
classifications scored + labelled artifact list
explanations flagged artifacts with SHAP narratives
timeline UTC-sorted event list
done job summary + download URLs
error traceback string

Project Structure

ForensIQ/
│
├── backend/
│   ├── __init__.py
│   ├── api.py                 FastAPI app — HTTP endpoints + SSE
│   └── pipeline_runner.py     Background thread, queue, synthetic dataset
│
├── frontend/
│   └── index.html             Single-file React + Tailwind UI
│
├── ui/
│   └── dashboard.py           Rich terminal dashboard (alternative to web UI)
│
├── tests/
│   └── test_forensiq.py       41 unit tests (all deps mocked)
│
├── outputs/                   Generated DFXML and JSON reports
│
├── image_parser.py            E01/DD/AFF4 mounting, filesystem traversal
├── artifact_extractors.py     Metadata, registry, browser, network extractors
├── classifier.py              BERT classifier + case-type profiles
├── explainability.py          SHAP engine + narrative builder
├── timeline.py                UTC normaliser + DFXML serialiser
├── config.py                  Lazy env-var config singleton
├── logger.py                  Structured logging setup
├── main.py                    CLI entry point
├── demo.py                    Standalone demo on synthetic data
├── server.py                  Uvicorn server entry point
├── setup.py                   pip-installable package definition
├── requirements.txt           All Python dependencies
└── .env.example               All tunable environment variables

Module Reference

image_parser.py

Mounts a forensic image and yields FileEntry objects for every file.

from image_parser import parse_image

for entry in parse_image("evidence.E01"):
    print(entry.path, entry.size, entry.is_deleted)

FileEntry fields: path, size, inode, created, modified, accessed, is_deleted, fs_type, data

  • E01 images are opened via pyewf and bridged to pytsk3 through EWFImgInfo
  • Partition tables are detected automatically; falls back to whole-image filesystem if none found
  • Files are capped at cfg.max_file_read_bytes (default 10 MB) to prevent OOM
  • Filesystem type is detected per-partition: NTFS, ext4, APFS

artifact_extractors.py

Four extractors, all returning Artifact objects.

Artifact fields: artifact_type, source_path, timestamp, features, raw

extract_metadata(entry: FileEntry) -> Artifact

Extracts file size, timestamps, extension, deleted flag. On NTFS files starting with the MFT magic FILE, compares $STANDARD_INFORMATION and $FILE_NAME timestamps — a difference > 1 second flags possible timestomping.

extract_registry(entry: FileEntry) -> list[Artifact]

Parses Windows registry hives using regipy. Extracts:

  • Persistence keys: Run, RunOnce, Services
  • USB mount history: USBSTOR
  • User activity: UserAssist

extract_browser(entry: FileEntry) -> list[Artifact]

Reads Chrome/Firefox SQLite databases. Queries urls, cookies, and downloads tables. Converts Chrome's WebKit epoch (microseconds since 1601-01-01) to Unix epoch. Flags visits to financial domains (PayPal, Coinbase, Chase, etc.).

extract_network(pcap_path: str) -> list[Artifact]

Reads a pcap file with Scapy, aggregates flows by (src, dst, dport, proto), then runs IsolationForest on [bytes, packets, duration] to flag anomalous flows.


classifier.py

from classifier import ForensicClassifier

clf = ForensicClassifier(case_type="financial_fraud")
results = clf.classify_batch(artifacts)

Two-stage pipeline:

  1. BERT tokenises artifact.raw (truncated to 512 chars) → [CLS] embedding (768-dim)
  2. Numeric features appended: [file_size_mb, is_deleted, anomaly, is_financial, timestomp, visit_count_norm]
  3. Logistic regression head maps the combined vector to a suspicion probability
  4. Case-type multiplier applied: score = min(prob × weight, 1.0)
  5. Extension boost: high-value extensions for the case type multiply weight by 1.5

Labels: benign (< 0.40), suspicious (0.40–0.75), malicious (> 0.75)

Thresholds are configurable via FORENSIQ_THRESH_SUSPICIOUS and FORENSIQ_THRESH_MALICIOUS.

When no labelled training data is available, _heuristic_score() is used as a zero-shot fallback based on rule weights.


explainability.py

from explainability import ExplainabilityEngine

engine = ExplainabilityEngine(clf, background_results)
explanation = engine.explain(result)
print(explanation.narrative)
  • Uses shap.KernelExplainer — model-agnostic, works with both the fitted logistic head and the heuristic fallback
  • Background is compressed to ≤ k unique centroids via shap.kmeans to keep SHAP tractable on 768-dim BERT embeddings
  • _build_narrative() maps SHAP-identified features to investigator-readable language:
    • Off-hours activity (before 05:00 or after 22:00 UTC)
    • Deleted files, timestomping, network anomalies
    • Persistence keys, USB mounts, financial site visits

Explanation fields: artifact_path, suspicion_score, label, top_features, narrative


timeline.py

from timeline import TimelineSynthesizer

synth = TimelineSynthesizer(tz_offset_hours=-5)
events = synth.build_from_results(pairs)   # pairs = [(ClassificationResult, Explanation)]
dfxml  = synth.to_dfxml(events)
  • tz_offset_hours is subtracted from stored timestamps to convert local time to UTC
  • Events are sorted ascending by UTC time
  • DFXML output follows the forensicswiki.org schema, compatible with Autopsy and log2timeline

config.py

All settings are read lazily from os.environ (populated from .env if present). This means patch.dict(os.environ, ...) works correctly in tests.

from config import cfg

print(cfg.bert_model)             # nlpaueb/legal-bert-base-uncased
print(cfg.threshold_malicious)    # 0.75
print(cfg.shap_nsamples)          # 100

logger.py

from logger import setup_logging
setup_logging(level="DEBUG", log_file="forensiq.log")

Configures the root logger with a timestamped formatter. Silences noisy third-party loggers (transformers, scapy.runtime).


Backend API

Start the server:

python server.py
# or
uvicorn backend.api:app --host 0.0.0.0 --port 8000

Endpoints

GET /

Serves frontend/index.html.

POST /analyze

Start a new analysis job.

Field Type Default Description
case_type form string financial_fraud One of financial_fraud, data_theft, malware, default
tz_offset form float -5.0 Local timezone offset in hours (e.g. -5 for EST)
file file upload optional Evidence file (.E01, .dd, .db, .pcap, etc.)

Response:

{ "job_id": "a1b2c3d4", "stream_url": "/stream/a1b2c3d4" }

GET /stream/{job_id}

Server-Sent Events stream. Connect with EventSource in the browser or curl:

curl -N http://localhost:8000/stream/a1b2c3d4

Events: progress, artifacts, classifications, explanations, timeline, done, error

GET /download/{job_id}/dfxml

Download the DFXML timeline for a completed job.

GET /download/{job_id}/json

Download the JSON findings report for a completed job.

GET /docs

Auto-generated Swagger UI (provided by FastAPI).


Frontend

Single-file React 18 + Tailwind CSS application served directly from frontend/index.html. No build step required.

Panels:

  • Configuration — case type selector, timezone offset input, optional file upload, Run button
  • Progress bar — live step indicator fed by SSE progress events
  • Stats row — total artifacts, flagged count, malicious count, suspicious count
  • Findings tab — expandable cards per flagged artifact showing score bar, narrative, and SHAP feature impact chart
  • Timeline tab — chronological event list with colour-coded dots (red = malicious, yellow = suspicious, grey = benign); click any event to expand its narrative
  • Artifacts tab — full table of all extracted artifacts with type, path, and UTC timestamp
  • Classifications tab — all artifacts ranked by suspicion score with score bars
  • Download buttons — one-click DFXML and JSON download after analysis completes

Installation

Prerequisites

  • Python 3.10+
  • On Linux: sudo apt install libewf-dev (required for pyewf/pytsk3)
  • On Windows: pytsk3 and pyewf are stubbed automatically when not available

Steps

# Clone or download the project
cd ForensIQ

# Install dependencies
pip install -r requirements.txt

# Optional: install as a package
pip install -e .

requirements.txt

pytsk3>=20230125        # forensic image parsing
pyewf>=20230119         # libewf Python bindings
regipy>=3.1.0           # Windows registry hive parsing
scapy>=2.5.0            # pcap network analysis
torch>=2.0.0            # BERT inference
transformers>=4.38.0    # BERT tokeniser + model
scikit-learn>=1.4.0     # logistic regression + IsolationForest
numpy>=1.26.0
shap>=0.44.0            # SHAP explainability
rich>=13.7.0            # terminal dashboard
fastapi                 # web backend
uvicorn                 # ASGI server
python-multipart        # file upload support

Configuration

Copy .env.example to .env and edit as needed:

cp .env.example .env
Variable Default Description
FORENSIQ_BERT_MODEL nlpaueb/legal-bert-base-uncased HuggingFace model name or local path
FORENSIQ_BERT_MAX_LEN 128 Max token length for BERT
FORENSIQ_THRESH_MALICIOUS 0.75 Score threshold for malicious label
FORENSIQ_THRESH_SUSPICIOUS 0.40 Score threshold for suspicious label
FORENSIQ_MAX_FILE_BYTES 10485760 Max bytes read per file (10 MB)
FORENSIQ_SHAP_NSAMPLES 100 SHAP KernelExplainer sample count
FORENSIQ_SHAP_BG_K 10 SHAP k-means background centroids
FORENSIQ_IF_CONTAMINATION 0.05 IsolationForest contamination fraction
FORENSIQ_OUT_DFXML timeline.dfxml Default DFXML output path (CLI)
FORENSIQ_OUT_JSON report.json Default JSON output path (CLI)
FORENSIQ_LOG_LEVEL INFO Logging level
FORENSIQ_LOG_FILE (none) Optional log file path

Running the Web App

python server.py

Then open http://localhost:8000 in your browser.

  1. Select a case type and timezone offset
  2. Optionally upload an evidence file (.E01, .dd, .db, .pcap)
  3. Click Run Analysis
  4. Watch the live progress stream populate findings, timeline, and SHAP charts in real time
  5. Download the DFXML timeline or JSON report when complete

Running the CLI

# Basic run on a real image
python main.py --image evidence.E01 --case financial_fraud --tz -5

# With a pcap file
python main.py --image evidence.E01 --case data_theft --tz 0 --pcap traffic.pcap

# Custom output paths
python main.py --image evidence.E01 --out-dfxml my_timeline.dfxml --out-json my_report.json

# Verbose logging
python main.py --image evidence.E01 --verbose

CLI flags:

Flag Default Description
--image required Path to forensic image
--case default Case type profile
--tz 0.0 Timezone offset in hours
--pcap none Optional pcap file
--out-dfxml timeline.dfxml DFXML output path
--out-json report.json JSON output path
--verbose off Enable DEBUG logging

Running the Demo

Runs the full pipeline on a built-in synthetic forensic scenario (no image file needed):

# financial_fraud case, UTC-5 timezone
set PYTHONIOENCODING=utf-8 && python demo.py financial_fraud -5

# data_theft case, UTC+0
set PYTHONIOENCODING=utf-8 && python demo.py data_theft 0

# malware case
set PYTHONIOENCODING=utf-8 && python demo.py malware 0

The synthetic dataset includes:

  • A spreadsheet modified at 3 AM
  • A deleted 500 MB archive
  • A timestomped executable (svchost32.exe)
  • Chrome browser history with PayPal and Coinbase visits
  • A Windows registry persistence key (Run\updater)
  • A USB mount event (SanDisk, 2 minutes before the 3 AM activity)
  • A 524 MB outbound network transfer flagged as anomalous
  • A benign system DLL for contrast

Outputs written to demo_timeline.dfxml and demo_report.json.


Running the Rich Terminal Dashboard

python -m ui.dashboard --image evidence.E01 --case financial_fraud --tz -5

Displays a live progress bar during analysis, then renders a colour-coded findings table and top-10 suspicious timeline events in the terminal.


Running Tests

pip install pytest
python -m pytest tests/ -v

All 41 tests pass without any forensic hardware, GPU, or network access. Heavy dependencies are stubbed at the module level:

  • pytsk3 / pyewf — no Windows wheels; replaced with MagicMock stubs
  • torch / transformers — replaced with stubs; torch.Tensor is a real class to satisfy scipy's issubclass check
  • scapy — installed; network tests use MagicMock packets

Test coverage by module:

Class Tests
TestImageParserHelpers _ts(), _open_image() for E01/DD/unsupported
TestExtractMetadata basic fields, deleted flag, timestomp positive/negative, non-NTFS
TestExtractBrowser financial domain flagging, empty data, Chrome epoch conversion
TestExtractNetwork empty pcap, read error, artifact list returned
TestForensicClassifier heuristic scores, case weight profiles, ext boost, unknown case fallback
TestTimeline sort order, TZ offset, None timestamp skipped, DFXML structure/label/precision
TestConfig defaults, env override, log_file default
TestNarrativeBuilder deleted, timestomp, persistence, off-hours, financial, USB, label

Supported Formats

Image Formats

Format Extension Library
Expert Witness Format .E01, .Ex01 pyewf + pytsk3
Raw / DD .dd, .raw, .img pytsk3
AFF4 .aff4 pytsk3 (raw block device)

Filesystems

Filesystem OS
NTFS Windows
ext2 / ext3 / ext4 Linux
APFS macOS

Other Input Files

File Extractor
Chrome/Firefox History, Cookies, Web Data extract_browser()
Windows registry hives (NTUSER.DAT, SAM, SOFTWARE, SYSTEM, etc.) extract_registry()
pcap / pcapng extract_network()

Case Type Profiles

Each case type applies multipliers to artifact stream scores and boosts specific file extensions.

financial_fraud

Prioritises browser artifacts (financial site visits, cookies) and document files.

Stream Multiplier
browser 2.5×
metadata 1.8×
network 1.5×
registry 1.0×

Boosted extensions: .xlsx, .xls, .csv, .pst, .ost, .mbox, .pdf

data_theft

Prioritises large file transfers, USB mounts, and cloud upload anomalies.

Stream Multiplier
registry 2.5×
network 2.5×
metadata 2.0×
browser 1.0×

Boosted extensions: .zip, .rar, .7z, .tar, .gz, .db, .sql

malware

Prioritises persistence keys, dropped executables, and C2 beaconing.

Stream Multiplier
registry 2.5×
metadata 2.0×
network 2.0×
browser 1.0×

Boosted extensions: .exe, .dll, .bat, .ps1, .vbs, .js, .lnk

default

All streams weighted equally at 1.0×.


Output Formats

JSON Report (report.json)

Array of flagged artifacts:

[
  {
    "path": "/Windows/Temp/svchost32.exe",
    "label": "suspicious",
    "score": 0.72,
    "narrative": "[SUSPICIOUS] score=0.72 — because: created/modified at 03:00 UTC (off-hours); $SI/$FN timestamp mismatch (possible timestomping)",
    "top_features": [
      ["timestomp_detected", 0.267],
      ["network_anomaly", -0.039],
      ["financial_site", -0.033]
    ]
  }
]

DFXML Timeline (timeline.dfxml)

Digital Forensics XML following the forensicswiki.org schema. Compatible with Autopsy, log2timeline/plaso, and other DFXML-aware tools.

<?xml version='1.0' encoding='utf-8'?>
<dfxml version="1.0" xmlns="http://www.forensicswiki.org/wiki/Category:Digital_Forensics_XML">
  <metadata>
    <generator>ForensIQ</generator>
    <created>2024-03-15T08:00:00+00:00</created>
  </metadata>
  <fileobject>
    <filename>/Windows/Temp/svchost32.exe</filename>
    <source_type>metadata</source_type>
    <label>suspicious</label>
    <suspicion_score>0.7200</suspicion_score>
    <times><mtime>2024-03-15T08:00:00+00:00Z</mtime></times>
    <narrative>[SUSPICIOUS] score=0.72 — because: ...</narrative>
  </fileobject>
</dfxml>

Platform Notes

Windows

  • pytsk3 and pyewf have no official Windows wheels. They are automatically stubbed when not importable, so the classifier, explainability, timeline, web UI, and demo all work fully.
  • To analyse real disk images on Windows, use WSL2 with libewf-dev installed, or use a pre-built pytsk3 wheel from the libewf releases page.
  • Set PYTHONIOENCODING=utf-8 before running the CLI or demo to ensure Unicode characters print correctly.

Linux / macOS

# Debian/Ubuntu
sudo apt install libewf-dev

# macOS
brew install libewf

Then pip install pytsk3 pyewf will compile successfully.

GPU / CPU

  • BERT inference runs on CPU by default. To use a GPU, install the CUDA-enabled torch wheel and the model will automatically use cuda:0.
  • For production use, replace nlpaueb/legal-bert-base-uncased with a checkpoint fine-tuned on labelled forensic artifact datasets.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors