A comprehensive LLM-based multi-agent cybersecurity analysis system that processes security logs and correlates them with MITRE ATT&CK techniques to provide detailed threat intelligence and mitigation recommendations.
- Project Overview
- System Architecture
- File Structure
- Quick Start Guide
- Advanced Usage
- Evaluation and Testing
- Model Performance
- Project Architecture
This project implements a sophisticated multi-agent cybersecurity analysis system that combines traditional log analysis with modern LLM-based reasoning. The system processes security logs from various sources and applies advanced AI techniques to:
- Log Analysis Agent: Detects suspicious activities and anomalies in security logs
- Retrieval Supervisor: Intelligently searches MITRE ATT&CK knowledge base for relevant techniques
- Response Agent: Synthesizes findings into comprehensive threat intelligence reports
- Multi-Agent Architecture: Hierarchical agent system using LangGraph for orchestration
- MITRE ATT&CK Integration: Semantic search over comprehensive technique database
- Real-time Analysis: Process security logs and generate immediate threat assessments
- Web Interface: User-friendly Streamlit application for easy interaction
- Comprehensive Evaluation: Built-in evaluation framework for system performance testing
Note: The full hierarchical design is still under development. Currently, the system uses a linear pipeline with separate reflection on each module.
Linear Pipeline (Current Implementation)
├── Log Analysis Agent
│ ├── Event ID Validation
│ ├── Timeline Analysis
│ ├── Command Decoding
│ └── Anomaly Detection
├── Retrieval Supervisor
│ ├── MITRE Database Agent
│ ├── Retrieval Grader Agent
│ └── CTI Agent (Under Development - Temporarily Disabled)
│ └── Online CTI Report Analysis
└── Response Agent
├── Threat Correlation
├── Attack Chain Reconstruction
└── Mitigation Recommendations
- Input: User uploads JSON log file (Mordor dataset format)
- Log Analysis: Agent analyzes logs for suspicious activities and IOCs with self-reflection
- Intelligence Retrieval: Retrieval supervisor searches MITRE ATT&CK database for relevant techniques with quality assessment
- Correlation: Response agent correlates findings to reconstruct potential attack chains with report refinement
- Output: Comprehensive threat intelligence report with mitigation suggestions
Note: Each module operates independently with its own reflection mechanism. The full hierarchical coordination is planned for future development.
Global Supervisor Agent (Future)
├── Log Analysis Agent
├── Retrieval Supervisor (Sub-supervisor)
│ ├── MITRE Database Agent
│ ├── Retrieval Grader Agent
│ └── CTI Agent (Web-facing CTI analysis)
└── Response Agent
Cyber-Agent/
├── app.py # Streamlit web application
├── requirements.txt # Python dependencies
│
├── src/ # Source code modules
│ ├── agents/ # Multi-agent system components
│ │ ├── log_analysis_agent/ # Log analysis and anomaly detection
│ │ ├── retrieval_supervisor/ # CTI knowledge retrieval coordination
│ │ ├── database_agent/ # Knowledge base search agent
│ │ ├── grader_agent/ # Retrieval quality assessment
│ │ ├── cti_agent/ # CTI report analysis (under development)
│ │ └── response_agent/ # Final report generation
│ │
│ ├── full_pipeline/ # Complete pipeline orchestration
│ │ └── simple_pipeline.py # Main pipeline implementation
│ │
│ ├── knowledge_base/ # MITRE ATT&CK knowledge base
│ │ └── cyber_knowledge_base.py # Vector database management
│ │
│ ├── scripts/ # Utility and evaluation scripts
│ │ ├── extract_mitre_techniques.py # MITRE data extraction
│ │ ├── build_cyber_database.py # Knowledge base construction
│ │ ├── cti_bench_evaluation.py # CTI Bench evaluation
│ │ ├── execute_pipeline_all_datasets.py # Full dataset processing
│ │ └── run_evaluation.py # Evaluation pipeline
│ │
│ └── evaluation/ # Evaluation framework
│ ├── cti_bench/ # CTI Bench evaluation tools
│ └── full_pipeline/ # Pipeline evaluation metrics
│
├── mordor_dataset/ # Sample security logs
│ ├── datasets/ # JSON log files by attack type
│ └── eval_output/ # Evaluation results
│
├── mitre_data/ # MITRE ATT&CK data
│ ├── enterprise-attack.json # Full ATT&CK dataset
│ └── techniques.json # Processed techniques
│
├── cyber_knowledge_base/ # Vector database storage
│ ├── chroma/ # ChromaDB vector store
│ └── bm25_retriever.pkl # BM25 keyword search index
│
└── cti_bench/ # CTI Bench evaluation data
├── datasets/ # CTI Bench datasets
└── eval_output/ # Evaluation results
Visit the live demo at: Our Hugging Face Space
Prerequisites
- Python 3.11 or higher
- Git
- CUDA-compatible GPU (optional, for faster processing)
Step-by-Step Installation
-
Clone the Repository
git clone https://github.com/minhan6559/Cyber-Agent.git cd Cyber-Agent -
Create Virtual Environment
Option A: Using Conda (Recommended)
conda create -n cyber_agent python=3.11 conda activate cyber_agent
Option B: Using Python venv
python -m venv cyber_agent source cyber_agent/bin/activate # On Windows: cyber_agent\Scripts\activate
-
Install Dependencies
pip install -r requirements.txt
-
Optional: GPU Support
# For CUDA 12.6 (recommended) pip install torch==2.9.0 torchvision==0.24.0 --index-url https://download.pytorch.org/whl/cu126 # For CPU-only (if no GPU available) pip install torch==2.9.0 torchvision==0.24.0
-
Create
.envFileCreate a
.envfile in the root directory with the following API keys required for running the local web app:# LLM you need - Only for running script, web app will require input on the interface GOOGLE_API_KEY= OPENAI_API_KEY= GROQ_API_KEY= # Must have api for tool calling TAVILY_API_KEY= SHODAN_API_KEY= VT_API_KEY= # For virus total # Hugging face API that have access to google embedding gemma 300m HF_TOKEN= # Langsmith LANGSMITH_API_KEY= LANGSMITH_PROJECT= LANGSMITH_TRACING=
Note: The LLM API keys (GOOGLE_API_KEY, OPENAI_API_KEY, GROQ_API_KEY) are optional for the web app as you can enter them directly in the interface. However, TAVILY_API_KEY, SHODAN_API_KEY, VT_API_KEY, HF_TOKEN, and Langsmith keys are required for tool calling functionality.
-
Verify Installation
python -c "import torch; print('CUDA available:', torch.cuda.is_available())" python -c "import streamlit; import langchain; print('All packages installed successfully')"
-
Start the Streamlit App
streamlit run app.py
-
Access the Web Interface
- Open your browser to
http://localhost:8501 - Select your preferred LLM model (Google GenAI, Groq, or OpenAI)
- Enter your API key for the selected provider
- Upload a JSON log file from
mordor_dataset/datasets/ - Click "Run Analysis" to generate a comprehensive threat report
- Open your browser to
- View Results
- The system will display real-time progress
- Results include threat assessment, abnormal events, and mitigation recommendations
- Download detailed reports in JSON and Markdown formats
Required Dependencies
The requirements.txt includes these key packages:
- streamlit: Web application framework
- langchain: LLM framework and tools
- langgraph: Multi-agent orchestration
- chromadb: Vector database for semantic search
- sentence-transformers: Embedding models
- mitreattack-python: MITRE ATT&CK data processing
- pandas: Data manipulation
- numpy: Numerical computing
The system uses the LangChain init_chat_model format for model specification. Models are specified using the '{model_provider}:{model}' format as required by the LangChain documentation.
Google GenAI:
"google_genai:gemini-2.0-flash"
"google_genai:gemini-2.0-flash-lite"
"google_genai:gemini-2.5-flash-lite"Groq:
"groq:openai/gpt-oss-120b"
"groq:openai/gpt-oss-20b"
"groq:llama-3.1-8b-instant"
"groq:llama-3.3-70b-versatile"
"groq:moonshotai/kimi-k2-instruct-0905"OpenAI:
"openai:gpt-4o"
"openai:gpt-4.1"Other Providers:
"anthropic:claude-3-5-sonnet-latest"
"ollama:llama3.1:8b"from langchain.chat_models import init_chat_model
# Initialize model using the format
model = init_chat_model("google_genai:gemini-2.0-flash", temperature=0.1)
# Use in pipeline
result = analyze_log_file(
log_file="sample.json",
model_name="groq:openai/gpt-oss-120b",
temperature=0.1
)If you want to reproduce the entire system and evaluation from scratch:
-
Extract MITRE ATT&CK Techniques
python src/scripts/extract_mitre_techniques.py
-
Build the Vector Database
python src/scripts/build_cyber_database.py ingest --techniques-json ./mitre_data/techniques.json
-
Test the Knowledge Base
python src/scripts/build_cyber_database.py test --interactive
Test the retrieval supervisor system using the CTI Bench dataset. This evaluation tests the system's ability to extract and map MITRE ATT&CK techniques from cybersecurity threat intelligence reports.
Note: Currently, only the CTI-ATE (Attack Technique Extraction) dataset is fully supported and tested. The CTI-MCQ (Multiple Choice Questions) dataset is not fully developed yet.
Ensure you have the CTI Bench dataset files in the correct location:
cti_bench/datasets/
├── cti-ate.tsv # Attack Technique Extraction dataset
└── cti-mcq.tsv # Multiple Choice Questions dataset1. Quick Test (Recommended for first-time users)
# Test with 2 samples from both datasets
python src/scripts/cti_bench_evaluation.py --mode quick --num-samples 2
# Test with 5 samples from ATE dataset only
python src/scripts/cti_bench_evaluation.py --mode quick --num-samples 5 --datasets ate
# Test with 3 samples from MCQ dataset only
python src/scripts/cti_bench_evaluation.py --mode quick --num-samples 3 --datasets mcq2. Connection Test
# Test if the supervisor can connect to the knowledge base
python src/scripts/cti_bench_evaluation.py --mode test --llm-model google_genai:gemini-2.0-flash3. Full Evaluation
# Full evaluation on ATE dataset (recommended)
python src/scripts/cti_bench_evaluation.py --mode full --datasets ate --llm-model google_genai:gemini-2.0-flash
# Full evaluation on MCQ dataset
python src/scripts/cti_bench_evaluation.py --mode full --datasets mcq --llm-model groq:openai/gpt-oss-120b
# Full evaluation on both datasets
python src/scripts/cti_bench_evaluation.py --mode full --datasets all --llm-model openai:gpt-4oFor detailed parameter information and advanced configuration options:
python src/scripts/cti_bench_evaluation.py --helpThe evaluation generates several output files:
cti-ate_{model_name}_{timestamp}.csv: Detailed results for ATE datasetcti-mcq_{model_name}_{timestamp}.csv: Detailed results for MCQ datasetevaluation_summary_ate_{model_name}_{timestamp}.json: Summary metrics for ATEevaluation_summary_mcq_{model_name}_{timestamp}.json: Summary metrics for MCQ
Key Metrics:
- Macro F1: Overall technique extraction performance
- Success Rate: Percentage of successfully processed samples
- Accuracy: Correct technique identification rate
- Precision/Recall: Detailed performance breakdown
Run the complete evaluation on the Mordor dataset:
-
Execute Pipeline on All Datasets
python src/scripts/execute_pipeline_all_datasets.py --model google_genai:gemini-2.0-flash
-
Run Evaluation Metrics
python src/scripts/run_evaluation.py
-
View Results
- Check
mordor_dataset/eval_output/evaluation_results/for detailed metrics - Review
model_metrics.csvfor performance comparisons
- Check
The system includes comprehensive evaluation using the CTI Bench dataset. CTI Bench is a comprehensive suite of benchmark tasks and datasets designed to evaluate Large Language Models (LLMs) in the field of Cyber Threat Intelligence (CTI), as described in the CTI-Bench repository.
- ATE (Attack Technique Extraction): Tests technique identification and retrieval accuracy
- MCQ (Multiple Choice Questions): Tests knowledge base retrieval quality
- Metrics: F1-score, accuracy, precision, recall, and success rates
Below is evaluation result of the supervisor retrieval design across multiple language models on CTI-ATE benchmark
Full evaluation on real-world security logs using the Mordor dataset. The Mordor project provides pre-recorded security events generated after simulating adversarial techniques in the form of JSON files, categorized by platforms, adversary groups, tactics and techniques defined by the MITRE ATT&CK Framework.
Three LLM models were evaluated across 86 log files covering 7 MITRE ATT&CK tactics. All models achieved "GOOD" grades with effectiveness scores of 61.93-67.92%.
Evaluation Metrics:
- Detection Rate: Percentage of log files with successfully identified threats
- Coverage: Percentage of attack tactics the model can detect (breadth)
- Accuracy: Per-tactic detection rate averaged across all tactics
- Effectiveness Score: Weighted composite (40% detection + 30% coverage + 30% accuracy)
- Standard Metrics: Precision, Recall, F1-score for classification performance
The system supports multiple LLM providers:
- Google GenAI:
gemini-2.0-flash,gemini-2.5-flash-lite - Groq:
gpt-oss-120b,llama-3.1-8b-instant,llama-3.3-70b-versatile - OpenAI:
gpt-5-mini,gpt-5,gpt-4.1-mini
- Modular Design: Each agent operates independently with clear interfaces
- LLM-Driven Reasoning: Prefers AI reasoning over hard-coded rules
- Hierarchical Coordination: Supervisor agents manage specialized sub-agents
- Comprehensive Evaluation: Built-in testing and validation frameworks
- Scalable Processing: Efficient handling of large log datasets
- Purpose: Detects suspicious activities in security logs
- Capabilities: Event validation, timeline analysis, command decoding
- Tools: Field reduction, event ID extraction, timeline building, base64 decoding
- Reflection: Self-critique and iterative improvement
- Purpose: Coordinates MITRE ATT&CK technique retrieval
- Sub-agents:
- Database agent (semantic search over MITRE knowledge base)
- Grader agent (quality assessment and iterative refinement)
- CTI Agent (under development - web-facing CTI report analysis)
- Features: Iterative refinement, quality control, multi-query search
- Reflection: Quality assessment and retrieval improvement
- Purpose: Synthesizes findings into comprehensive reports
- Capabilities: Threat correlation, attack chain reconstruction, mitigation recommendations
- Outputs: Threat assessments, attack chains, mitigation recommendations
- Formats: JSON, Markdown, structured data
- Reflection: Report quality assessment and improvement
The modular architecture supports:
- New Data Sources: Implement custom log processors
- Additional Agents: Add specialized analysis components
- Custom Evaluations: Extend evaluation frameworks
- Model Integration: Support for new LLM providers
This is a university project demonstrating LLM-based multi-agent systems for cybersecurity. The system is designed for educational purposes and showcases the power of modern AI in security analysis.
This project is developed for educational purposes as part of COS30049 coursework.
- MITRE ATT&CK: For the comprehensive attack technique database
- Mordor Dataset: For realistic security log samples
- LangChain & LangGraph: For the multi-agent framework
- Streamlit: For the web interface
- Hugging Face: For model hosting and deployment




