Cybersecurity Agent Pipeline - COS30049

A comprehensive LLM-based multi-agent cybersecurity analysis system that processes security logs and correlates them with MITRE ATT&CK techniques to provide detailed threat intelligence and mitigation recommendations.

Project Overview

This project implements a sophisticated multi-agent cybersecurity analysis system that combines traditional log analysis with modern LLM-based reasoning. The system processes security logs from various sources and applies advanced AI techniques to:

Log Analysis Agent: Detects suspicious activities and anomalies in security logs
Retrieval Supervisor: Intelligently searches MITRE ATT&CK knowledge base for relevant techniques
Response Agent: Synthesizes findings into comprehensive threat intelligence reports

Key Features

Multi-Agent Architecture: Hierarchical agent system using LangGraph for orchestration
MITRE ATT&CK Integration: Semantic search over comprehensive technique database
Real-time Analysis: Process security logs and generate immediate threat assessments
Web Interface: User-friendly Streamlit application for easy interaction
Comprehensive Evaluation: Built-in evaluation framework for system performance testing

System Architecture

Current System Architecture

Note: The full hierarchical design is still under development. Currently, the system uses a linear pipeline with separate reflection on each module.

Linear Pipeline (Current Implementation)
├── Log Analysis Agent
│   ├── Event ID Validation
│   ├── Timeline Analysis
│   ├── Command Decoding
│   └── Anomaly Detection
├── Retrieval Supervisor
│   ├── MITRE Database Agent
│   ├── Retrieval Grader Agent
│   └── CTI Agent (Under Development - Temporarily Disabled)
│       └── Online CTI Report Analysis
└── Response Agent
    ├── Threat Correlation
    ├── Attack Chain Reconstruction
    └── Mitigation Recommendations

Figure 1: Current Multi-Agent System Architecture

Current Workflow

Input: User uploads JSON log file (Mordor dataset format)
Log Analysis: Agent analyzes logs for suspicious activities and IOCs with self-reflection
Intelligence Retrieval: Retrieval supervisor searches MITRE ATT&CK database for relevant techniques with quality assessment
Correlation: Response agent correlates findings to reconstruct potential attack chains with report refinement
Output: Comprehensive threat intelligence report with mitigation suggestions

Note: Each module operates independently with its own reflection mechanism. The full hierarchical coordination is planned for future development.

Planned Hierarchical Architecture (future plan)

Global Supervisor Agent (Future)
├── Log Analysis Agent
├── Retrieval Supervisor (Sub-supervisor)
│   ├── MITRE Database Agent
│   ├── Retrieval Grader Agent
│   └── CTI Agent (Web-facing CTI analysis)
└── Response Agent

File Structure

Cyber-Agent/
├── app.py                              # Streamlit web application
├── requirements.txt                    # Python dependencies
│
├── src/                                # Source code modules
│   ├── agents/                         # Multi-agent system components
│   │   ├── log_analysis_agent/        # Log analysis and anomaly detection
│   │   ├── retrieval_supervisor/      # CTI knowledge retrieval coordination
│   │   ├── database_agent/            # Knowledge base search agent
│   │   ├── grader_agent/              # Retrieval quality assessment
│   │   ├── cti_agent/                 # CTI report analysis (under development)
│   │   └── response_agent/            # Final report generation
│   │
│   ├── full_pipeline/                 # Complete pipeline orchestration
│   │   └── simple_pipeline.py         # Main pipeline implementation
│   │
│   ├── knowledge_base/                # MITRE ATT&CK knowledge base
│   │   └── cyber_knowledge_base.py    # Vector database management
│   │
│   ├── scripts/                       # Utility and evaluation scripts
│   │   ├── extract_mitre_techniques.py    # MITRE data extraction
│   │   ├── build_cyber_database.py        # Knowledge base construction
│   │   ├── cti_bench_evaluation.py        # CTI Bench evaluation
│   │   ├── execute_pipeline_all_datasets.py # Full dataset processing
│   │   └── run_evaluation.py              # Evaluation pipeline
│   │
│   └── evaluation/                    # Evaluation framework
│       ├── cti_bench/                 # CTI Bench evaluation tools
│       └── full_pipeline/             # Pipeline evaluation metrics
│
├── mordor_dataset/                    # Sample security logs
│   ├── datasets/                      # JSON log files by attack type
│   └── eval_output/                   # Evaluation results
│
├── mitre_data/                        # MITRE ATT&CK data
│   ├── enterprise-attack.json         # Full ATT&CK dataset
│   └── techniques.json                # Processed techniques
│
├── cyber_knowledge_base/              # Vector database storage
│   ├── chroma/                        # ChromaDB vector store
│   └── bm25_retriever.pkl            # BM25 keyword search index
│
└── cti_bench/                         # CTI Bench evaluation data
    ├── datasets/                      # CTI Bench datasets
    └── eval_output/                   # Evaluation results

Quick Start Guide

Option 1: Online Demo (Recommended)

Visit the live demo at: Our Hugging Face Space

Option 2: Local Web Application

Environment Setup

Prerequisites

Python 3.11 or higher
Git
CUDA-compatible GPU (optional, for faster processing)

Step-by-Step Installation

Clone the Repository

git clone https://github.com/minhan6559/Cyber-Agent.git
cd Cyber-Agent

Create Virtual Environment

Option A: Using Conda (Recommended)

conda create -n cyber_agent python=3.11
conda activate cyber_agent

Option B: Using Python venv

python -m venv cyber_agent
source cyber_agent/bin/activate  # On Windows: cyber_agent\Scripts\activate

Install Dependencies
```
pip install -r requirements.txt
```

Optional: GPU Support

# For CUDA 12.6 (recommended)
pip install torch==2.9.0 torchvision==0.24.0 --index-url https://download.pytorch.org/whl/cu126

# For CPU-only (if no GPU available)
pip install torch==2.9.0 torchvision==0.24.0

Create .env File

Create a .env file in the root directory with the following API keys required for running the local web app:

# LLM you need - Only for running script, web app will require input on the interface
GOOGLE_API_KEY=
OPENAI_API_KEY=
GROQ_API_KEY=

# Must have api for tool calling
TAVILY_API_KEY=
SHODAN_API_KEY=
VT_API_KEY=  # For virus total

# Hugging face API that have access to google embedding gemma 300m 
HF_TOKEN=

# Langsmith
LANGSMITH_API_KEY=
LANGSMITH_PROJECT=
LANGSMITH_TRACING=

Note: The LLM API keys (GOOGLE_API_KEY, OPENAI_API_KEY, GROQ_API_KEY) are optional for the web app as you can enter them directly in the interface. However, TAVILY_API_KEY, SHODAN_API_KEY, VT_API_KEY, HF_TOKEN, and Langsmith keys are required for tool calling functionality.

Verify Installation

python -c "import torch; print('CUDA available:', torch.cuda.is_available())"
python -c "import streamlit; import langchain; print('All packages installed successfully')"

Running the Application

Start the Streamlit App
```
streamlit run app.py
```
Access the Web Interface
- Open your browser to http://localhost:8501
- Select your preferred LLM model (Google GenAI, Groq, or OpenAI)
- Enter your API key for the selected provider
- Upload a JSON log file from mordor_dataset/datasets/
- Click "Run Analysis" to generate a comprehensive threat report

Figure 2: Streamlit Web Application - Model Selection and Configuration

View Results
- The system will display real-time progress
- Results include threat assessment, abnormal events, and mitigation recommendations
- Download detailed reports in JSON and Markdown formats

Required Dependencies

The requirements.txt includes these key packages:

streamlit: Web application framework
langchain: LLM framework and tools
langgraph: Multi-agent orchestration
chromadb: Vector database for semantic search
sentence-transformers: Embedding models
mitreattack-python: MITRE ATT&CK data processing
pandas: Data manipulation
numpy: Numerical computing

Figure 3: Streamlit Web Application - Analysis Results and Threat Intelligence Report

Advanced Usage

Model Name Format

The system uses the LangChain init_chat_model format for model specification. Models are specified using the '{model_provider}:{model}' format as required by the LangChain documentation.

Supported Model Providers and Examples

Google GenAI:

"google_genai:gemini-2.0-flash"
"google_genai:gemini-2.0-flash-lite"
"google_genai:gemini-2.5-flash-lite"

Groq:

"groq:openai/gpt-oss-120b"
"groq:openai/gpt-oss-20b"
"groq:llama-3.1-8b-instant"
"groq:llama-3.3-70b-versatile"
"groq:moonshotai/kimi-k2-instruct-0905"

OpenAI:

"openai:gpt-4o"
"openai:gpt-4.1"

Other Providers:

"anthropic:claude-3-5-sonnet-latest"
"ollama:llama3.1:8b"

Usage in Code

from langchain.chat_models import init_chat_model

# Initialize model using the format
model = init_chat_model("google_genai:gemini-2.0-flash", temperature=0.1)

# Use in pipeline
result = analyze_log_file(
    log_file="sample.json",
    model_name="groq:openai/gpt-oss-120b",
    temperature=0.1
)

Building the Knowledge Base from Scratch

If you want to reproduce the entire system and evaluation from scratch:

Extract MITRE ATT&CK Techniques

python src/scripts/extract_mitre_techniques.py

Build the Vector Database

python src/scripts/build_cyber_database.py ingest --techniques-json ./mitre_data/techniques.json

Test the Knowledge Base

python src/scripts/build_cyber_database.py test --interactive

Running CTI Bench Evaluation

Test the retrieval supervisor system using the CTI Bench dataset. This evaluation tests the system's ability to extract and map MITRE ATT&CK techniques from cybersecurity threat intelligence reports.

Note: Currently, only the CTI-ATE (Attack Technique Extraction) dataset is fully supported and tested. The CTI-MCQ (Multiple Choice Questions) dataset is not fully developed yet.

Prerequisites

Ensure you have the CTI Bench dataset files in the correct location:

cti_bench/datasets/
├── cti-ate.tsv      # Attack Technique Extraction dataset
└── cti-mcq.tsv      # Multiple Choice Questions dataset

Evaluation Modes

1. Quick Test (Recommended for first-time users)

# Test with 2 samples from both datasets
python src/scripts/cti_bench_evaluation.py --mode quick --num-samples 2

# Test with 5 samples from ATE dataset only
python src/scripts/cti_bench_evaluation.py --mode quick --num-samples 5 --datasets ate

# Test with 3 samples from MCQ dataset only
python src/scripts/cti_bench_evaluation.py --mode quick --num-samples 3 --datasets mcq

2. Connection Test

# Test if the supervisor can connect to the knowledge base
python src/scripts/cti_bench_evaluation.py --mode test --llm-model google_genai:gemini-2.0-flash

3. Full Evaluation

# Full evaluation on ATE dataset (recommended)
python src/scripts/cti_bench_evaluation.py --mode full --datasets ate --llm-model google_genai:gemini-2.0-flash

# Full evaluation on MCQ dataset
python src/scripts/cti_bench_evaluation.py --mode full --datasets mcq --llm-model groq:openai/gpt-oss-120b

# Full evaluation on both datasets
python src/scripts/cti_bench_evaluation.py --mode full --datasets all --llm-model openai:gpt-4o

Getting Help

For detailed parameter information and advanced configuration options:

python src/scripts/cti_bench_evaluation.py --help

Understanding Results

The evaluation generates several output files:

cti-ate_{model_name}_{timestamp}.csv: Detailed results for ATE dataset
cti-mcq_{model_name}_{timestamp}.csv: Detailed results for MCQ dataset
evaluation_summary_ate_{model_name}_{timestamp}.json: Summary metrics for ATE
evaluation_summary_mcq_{model_name}_{timestamp}.json: Summary metrics for MCQ

Key Metrics:

Macro F1: Overall technique extraction performance
Success Rate: Percentage of successfully processed samples
Accuracy: Correct technique identification rate
Precision/Recall: Detailed performance breakdown

Full Pipeline Evaluation

Run the complete evaluation on the Mordor dataset:

Execute Pipeline on All Datasets

python src/scripts/execute_pipeline_all_datasets.py --model google_genai:gemini-2.0-flash

Run Evaluation Metrics
```
python src/scripts/run_evaluation.py
```
View Results
- Check mordor_dataset/eval_output/evaluation_results/ for detailed metrics
- Review model_metrics.csv for performance comparisons

Evaluation and Testing

CTI Bench Evaluation

The system includes comprehensive evaluation using the CTI Bench dataset. CTI Bench is a comprehensive suite of benchmark tasks and datasets designed to evaluate Large Language Models (LLMs) in the field of Cyber Threat Intelligence (CTI), as described in the CTI-Bench repository.

ATE (Attack Technique Extraction): Tests technique identification and retrieval accuracy
MCQ (Multiple Choice Questions): Tests knowledge base retrieval quality
Metrics: F1-score, accuracy, precision, recall, and success rates

Below is evaluation result of the supervisor retrieval design across multiple language models on CTI-ATE benchmark

Figure 4: CTI-ATE Evaluation Results - Performance Comparison Across Multiple Language Models

Mordor Dataset Evaluation

Full evaluation on real-world security logs using the Mordor dataset. The Mordor project provides pre-recorded security events generated after simulating adversarial techniques in the form of JSON files, categorized by platforms, adversary groups, tactics and techniques defined by the MITRE ATT&CK Framework.

Evaluation Results Summary

Three LLM models were evaluated across 86 log files covering 7 MITRE ATT&CK tactics. All models achieved "GOOD" grades with effectiveness scores of 61.93-67.92%.

Figure 5: Pipeline Evaluation Results

Evaluation Metrics:

Detection Rate: Percentage of log files with successfully identified threats
Coverage: Percentage of attack tactics the model can detect (breadth)
Accuracy: Per-tactic detection rate averaged across all tactics
Effectiveness Score: Weighted composite (40% detection + 30% coverage + 30% accuracy)
Standard Metrics: Precision, Recall, F1-score for classification performance

Available Models

The system supports multiple LLM providers:

Google GenAI: gemini-2.0-flash, gemini-2.5-flash-lite
Groq: gpt-oss-120b, llama-3.1-8b-instant, llama-3.3-70b-versatile
OpenAI: gpt-5-mini, gpt-5, gpt-4.1-mini

Project Architecture

Design Principles

Modular Design: Each agent operates independently with clear interfaces
LLM-Driven Reasoning: Prefers AI reasoning over hard-coded rules
Hierarchical Coordination: Supervisor agents manage specialized sub-agents
Comprehensive Evaluation: Built-in testing and validation frameworks
Scalable Processing: Efficient handling of large log datasets

Key Components

Log Analysis Agent

Purpose: Detects suspicious activities in security logs
Capabilities: Event validation, timeline analysis, command decoding
Tools: Field reduction, event ID extraction, timeline building, base64 decoding
Reflection: Self-critique and iterative improvement

Retrieval Supervisor

Purpose: Coordinates MITRE ATT&CK technique retrieval
Sub-agents:
- Database agent (semantic search over MITRE knowledge base)
- Grader agent (quality assessment and iterative refinement)
- CTI Agent (under development - web-facing CTI report analysis)
Features: Iterative refinement, quality control, multi-query search
Reflection: Quality assessment and retrieval improvement

Response Agent

Purpose: Synthesizes findings into comprehensive reports
Capabilities: Threat correlation, attack chain reconstruction, mitigation recommendations
Outputs: Threat assessments, attack chains, mitigation recommendations
Formats: JSON, Markdown, structured data
Reflection: Report quality assessment and improvement

Extensibility

The modular architecture supports:

New Data Sources: Implement custom log processors
Additional Agents: Add specialized analysis components
Custom Evaluations: Extend evaluation frameworks
Model Integration: Support for new LLM providers

Contributing

This is a university project demonstrating LLM-based multi-agent systems for cybersecurity. The system is designed for educational purposes and showcases the power of modern AI in security analysis.

License

This project is developed for educational purposes as part of COS30049 coursework.

Acknowledgments

MITRE ATT&CK: For the comprehensive attack technique database
Mordor Dataset: For realistic security log samples
LangChain & LangGraph: For the multi-agent framework
Streamlit: For the web interface
Hugging Face: For model hosting and deployment

Name		Name	Last commit message	Last commit date
Latest commit History 202 Commits
assets		assets
cti_bench		cti_bench
cyber_knowledge_base		cyber_knowledge_base
mitre_data		mitre_data
mordor_dataset		mordor_dataset
self_learn		self_learn
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
app.py		app.py
requirements.in		requirements.in
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Cybersecurity Agent Pipeline - COS30049

Table of Contents

Project Overview

Key Features

System Architecture

Current System Architecture

Current Workflow

Planned Hierarchical Architecture (future plan)

File Structure

Quick Start Guide

Option 1: Online Demo (Recommended)

Option 2: Local Web Application

Environment Setup

Running the Application

Advanced Usage

Model Name Format

Supported Model Providers and Examples

Usage in Code

Building the Knowledge Base from Scratch

Running CTI Bench Evaluation

Prerequisites

Evaluation Modes

Getting Help

Understanding Results

Full Pipeline Evaluation

Evaluation and Testing

CTI Bench Evaluation

Mordor Dataset Evaluation

Evaluation Results Summary

Available Models

Project Architecture

Design Principles

Key Components

Log Analysis Agent

Retrieval Supervisor

Response Agent

Extensibility

Contributing

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages