A production-ready Retrieval-Augmented Generation (RAG) assistant for customer support. The system answers customer questions accurately with inline citations and enforces company policies.
- Hybrid Search: Combines semantic search (sentence transformers + FAISS) with BM25 for robust retrieval
- Cross-Encoder Reranking: Improves precision with neural reranking
- LLM-Based Generation: Natural language answers using OpenAI/Anthropic (with offline fallback)
- Inline Citations: Every fact is cited in format
[source_file#identifier] - Confidence Scoring: Quality assessment for each answer
- PII Masking: Automatically redacts card numbers, national IDs, masks phone numbers
- Escalation Detection: Identifies P0/P1/P2 severity levels and provides guidance
- SLA Compliance: Differentiates enterprise (4hr) vs standard (24hr) response times
.
├── config.py # Configuration management
├── data_loader.py # Data loading with intelligent chunking
├── retriever.py # Hybrid search + reranking
├── generator.py # LLM-based answer generation
├── policy_enforcer.py # Policy enforcement
├── assistant.py # Main orchestrator
├── evaluate.py # Evaluation script
├── main.py # Entry point (interactive/eval modes)
├── indexer.py # Legacy indexer (backward compatibility)
├── data/
│ ├── plans.csv # Telecom plans
│ ├── kb.yaml # Knowledge base (policies, SLAs, discounts)
│ ├── transcripts.json # Customer support examples
│ ├── faq.jsonl # FAQ pairs
│ └── eval_prompts.txt # Evaluation questions
├── requirements.txt # Python dependencies
├── ARCHITECTURE.md # Design documentation
└── README.md # This file
- Python 3.8 or higher
- pip package manager
-
Clone or download this repository
-
Install dependencies:
pip install -r requirements.txt- (Optional) Set up LLM API key for enhanced generation:
# For OpenAI
export OPENAI_API_KEY="your-api-key-here"
# Or for Anthropic
export ANTHROPIC_API_KEY="your-api-key-here"Note: The system works fully without an API key using offline template-based generation.
The first run will download the embedding models (~90MB), which may take a few minutes.
Evaluate the system on eval_prompts.txt:
python evaluate.pyThis will:
- Load all data sources
- Build hybrid search index
- Run evaluation on all prompts
- Save results to
evaluation_outputs.jsonandevaluation_outputs.txt
Run the assistant interactively:
python main.pyCommands in interactive mode:
- Type customer questions to get answers
help- Show available commandseval- Run evaluationexplain- Toggle explanation mode (shows retrieval details)status- Show assistant statusclear- Clear screenexit- Quit
# Run in offline mode (no LLM API needed)
python main.py --offline
# Disable reranking (faster but less precise)
python main.py --no-rerank
# Run evaluation directly
python main.py --evalfrom assistant import RAGAssistant, create_assistant
# Create and initialize assistant
assistant = create_assistant(use_llm=True, use_reranker=True)
assistant.initialize()
# Generate answer
response = assistant.generate_answer(
"Which plan includes call recording?",
top_k=5,
is_enterprise=False
)
print(response["answer"])
print("Citations:", response["citations"])
print("Confidence:", f"{response['confidence']:.2%}")
# Get detailed explanation
explanation = assistant.explain_answer("Customer wants SSO support")
print(explanation["breakdown"]["retrieved_documents"])Evaluation produces two files:
- evaluation_outputs.json: Structured JSON with all results
- evaluation_outputs.txt: Human-readable text format
Each result includes:
- Query and answer with inline citations
- List of all citations used
- Confidence score
- Escalation information (if triggered)
- SLA guidance
- Generation method used
Citations follow the format: [source_file#identifier]
Examples:
[plans.csv#row=4]- Row 4 in plans.csv[kb.yaml#features_matrix]- Features matrix section[kb.yaml#discounts]- Discount rules section[transcripts.json#t-002]- Transcript t-002[faq.jsonl#entry=2]- FAQ entry 2
Configuration can be set via environment variables:
| Variable | Description | Default |
|---|---|---|
OPENAI_API_KEY |
OpenAI API key | None |
ANTHROPIC_API_KEY |
Anthropic API key | None |
LLM_PROVIDER |
"openai", "anthropic", or "offline" | "openai" |
LLM_MODEL |
Model name | "gpt-4o-mini" |
USE_RERANKER |
Enable cross-encoder reranking | "true" |
SEMANTIC_WEIGHT |
Weight for semantic search (0-1) | 0.6 |
BM25_WEIGHT |
Weight for BM25 search (0-1) | 0.4 |
- Data Loading: All sources loaded into
Documentobjects with metadata and keywords - Indexing:
- Semantic: Sentence transformer embeddings → FAISS index
- BM25: Custom tokenization → inverted index
- Query Processing:
- Query expansion with domain synonyms
- Parallel hybrid search (semantic + BM25)
- Score fusion with configurable weights
- Cross-encoder reranking for top results
- Answer Generation:
- LLM mode: GPT-4o-mini with grounding constraints
- Offline mode: Template-based with intent detection
- Policy Enforcement: PII masking, escalation detection, SLA guidance
- Response: Answer with inline citations, confidence score, policy info
- Hybrid Search:
final_score = 0.7 * semantic_score + 0.3 * bm25_score - Hybrid Search:
final_score = 0.6 * semantic_score + 0.4 * bm25_score - Reranking: Cross-encoder (ms-marco-MiniLM-L-6-v2) for precision
- Query Expansion: Domain-specific synonyms (e.g., "SSO" → "single sign-on")
See ARCHITECTURE.md for detailed design decisions, trade-offs, and improvement roadmap.
Key highlights:
- Hybrid search provides robustness against both semantic and keyword misses
- Offline mode ensures reproducibility and works without API dependencies
- Inline citations enable precise fact traceability
- Modular architecture allows easy extension and testing
The system is evaluated on:
- Grounding: Answers traceable to the dataset
- Citations: Proper format and accuracy
- Policy Enforcement: PII masking, escalation, SLA compliance
- Answer Quality: Relevance, completeness, natural language
- Confidence: Score reflects retrieval quality
If the sentence transformer model fails to download:
- Check internet connection (first run only)
- Models are cached locally after first download
- Try:
pip install --upgrade sentence-transformers
If you encounter memory issues:
- Reduce
top_kparameter ingenerate_answer() - Disable reranking with
--no-rerank - Use a smaller embedding model
If LLM generation fails:
- Check API key is set correctly
- Verify API quota/billing
- System automatically falls back to offline mode
Ensure all dependencies are installed:
pip install -r requirements.txt| Metric | Value |
|---|---|
| Index Build | ~3-5 seconds |
| Query (offline) | ~200ms |
| Query (LLM) | ~1-2 seconds |
| Memory | ~200MB |