AI/ML-Powered RAG Pipeline for Banking Regulatory Document Analysis
An advanced Retrieval-Augmented Generation (RAG) system that ingests, indexes, and analyzes large corpora of regulatory documents from the Bank for International Settlements (BIS). The system performs automated governance analysis, regulatory anomaly detection, and compliance gap identification across thousands of pages of financial policy documents.
| Analysis Module | Description |
|---|---|
| Governance & Compliance Analysis | Identifies governance structures, oversight mechanisms, and compliance patterns |
| Regulatory Anomaly Detection | Flags inconsistencies, contradictions, and deviations across document corpus |
| Best Practices Identification | Extracts and ranks regulatory best practices from the full document set |
| Comprehensive Intelligence Report | Generates structured DOCX reports with findings and recommendations |
| Legal & Procedural Gap Analysis | Identifies gaps between stated procedures and regulatory requirements |
| Market Irregularity Assessment | Detects references to market irregularities and systemic risk indicators |
- 156 BIS papers analyzed (March 2001 – April 2025)
- 22,296 pages fully indexed, processed, and vectorized
- Documents stored in SQLite with full-text search and TF-IDF vectorization
- NLP Pipeline: spaCy, NLTK (tokenization, lemmatization, stopword removal), TextBlob (sentiment analysis), textstat (readability scoring)
- ML/Vectorization: scikit-learn (TF-IDF, K-Means clustering, Latent Dirichlet Allocation for topic modeling)
- Document Processing: PyMuPDF (fitz), PyPDF2 for robust multi-format PDF extraction
- Report Generation: python-docx for automated DOCX report creation
- GUI: Tkinter-based desktop interface for interactive analysis
- Storage: SQLite for document indexing and metadata management
bisv3.py # Latest version — full RAG pipeline with GUI and all 6 analysis modules
bisv2.py # Prior iteration with core analysis functionality
bisv1.py # Initial implementation
BIS10/ # Sample subset of BIS papers (PDF) and generated reports (DOCX)
PyPDF2
PyMuPDF
pandas
numpy
scikit-learn
nltk
spacy
textstat
textblob
python-docx
pip install PyPDF2 PyMuPDF pandas numpy scikit-learn nltk spacy textstat textblob python-docx
python -m spacy download en_core_web_sm
python bisv3.pyThe GUI will launch. Load PDF documents via the file browser, wait for indexing to complete, then select any of the six analysis modules to generate findings.
- Document Ingestion: Batch PDF extraction with fallback between PyMuPDF and PyPDF2 for maximum compatibility
- Text Processing: Sentence/word tokenization, lemmatization, stopword removal, named entity recognition via spaCy
- Vectorization: TF-IDF vectorization of document segments for similarity search and retrieval
- Topic Modeling: Latent Dirichlet Allocation (LDA) for unsupervised topic discovery across the corpus
- Clustering: K-Means clustering of document segments by semantic similarity
- Analysis & Reporting: Module-specific analytical pipelines generating structured DOCX reports
Dr. Mosab Hawarey PhD, Geodetic & Photogrammetric Engineering (ITU) | MSc, Geomatics (Purdue) | MBA (Wales) | BSc, MSc (METU)
- GitHub: https://github.com/mhawarey
- Personal: https://hawarey.org/mosab
- ORCID: https://orcid.org/0000-0001-7846-951X
MIT License