SympScan is an intelligent medical assistant designed to provide clinical facts, diagnostic insights, and treatment protocols by leveraging a Hybrid Dual-Indexing RAG (Retrieval-Augmented Generation) pipeline and a Knowledge Graph. The system transitions from a simple retriever to a sophisticated "Medical Knowledge Engine" that synthesizes information from both unstructured document chunks and structured entity relationships.
The pipeline integrates:
- Hybrid Search: Combining BM25 keyword search with FAISS-based semantic vector embeddings.
- Knowledge Graph: A Neo4j-powered graph database to capture explicit relationships between diseases, symptoms, medications, and precautions.
- Agentic Workflow: Pre-retrieval query rewriting, expansion, and HyDE (Hypothetical Document Embeddings), followed by post-retrieval extractive compression and reranking.
| Component | Technology | Purpose |
|---|---|---|
| Orchestration | LangChain / Ollama | Managing LLM chains and tool integration. |
| Vector Database | FAISS | High-performance semantic similarity search. |
| Graph Database | Neo4j | Retrieving structured medical entities and relationships. |
| LLM Interface | Qwen-2.5 / Llama 3 | Intent detection, entity extraction, and final response synthesis. |
| Data Processing | PySpark | Efficient transformation and Parquet storage of medical datasets. |
[Image of a RAG pipeline architecture including Vector DB and Knowledge Graph]
| File Name | Description |
|---|---|
Raw_Dataset_PreProcess.py |
Uses PySpark to clean and transform raw medical CSVs into a structured Parquet dataset. |
Hybrid_Dual_Indexing.py |
Implements semantic chunking and BM25 indexing for dual-path retrieval. |
Knowledge_Graph.py |
Constructs and queries a Neo4j graph to map Disease |
Vector_Database.py |
Manages the FAISS HNSW index for efficient vector storage and retrieval. |
PreRetrival_and_PostRetrieval.py |
Handles query rewriting, HyDE generation, and extractive context compression. |
Retrieval.py |
The main engine that merges Hybrid and Graph results using RRF and Cross-Encoders. |
Augmented_Generation.py |
The core RAG logic; handles prompt engineering, JSON validation, and chat history summarization. |
Inference.py |
A Streamlit dashboard providing a professional UI for real-time medical analysis. |
The system utilizes a "Dual-Path" approach. The Hybrid_Dual_Indexing.py script ensures that technical medical terms (captured by BM25) and contextual meanings (captured by all-MiniLM-L6-v2) are both considered. Results are then reranked using a Cross-Encoder (ms-marco-MiniLM-L-6-v2) to ensure top-tier relevance.
While the vector database provides descriptive context, the Knowledge_Graph.py component provides hard clinical links. For example, if "Hypertension" is detected, the graph immediately pulls associated "Medications" and "Precautions" as verified facts, which are prioritized in the final prompt.
- Format Guardrails: The system enforces strict JSON outputs for consistent UI rendering.
- Self-Correction: If the LLM produces an invalid format, the
Generationloop triggers a rectification prompt. - Evaluation Scores: Includes internal metrics for Response Confidence and Retrieval Helpfulness.
- Database: Neo4j Desktop or AuraDB instance (configured with APOC).
- Local LLM: Ollama installed and running.
- Environment: Python 3.10+
-
Clone the Repository:
cd "Your Directory" git clone https://github.com/Dochikhoa2006/SympScan-Advanced-Medical-RAG-Knowledge-Graph-System.git
-
Docker:
- Container Instantiation / Orchestration:
docker-compose up --build
- Chatbot UI (access 1 suitable URL):
Local URL: http://localhost:8501 Network URL: http://xxx.xx.x.x:xxxx External URL: http://yyy.yy.y.y:yyyy
- Container Instantiation / Orchestration:
This project is licensed under the CC-BY (Creative Commons Attribution) license.
Do, Chi Khoa (2026). SympScan: Advanced Medical RAG & Knowledge Graph System.
This README structure is inspired by data documentation guidelines from:
This project utilizes the SympScan - Symptomps to Disease Dataset, available on Kaggle:
For inquiries regarding the architecture or medical dataset integration, contact dochikhoa2006@gmail.com.