A professional FAQ chatbot application that uses Natural Language Processing (NLP) and TF-IDF-based similarity matching to provide instant answers from a predefined FAQ dataset.
✅ NLP Text Preprocessing - Tokenization, lowercasing, stopword removal, lemmatization
✅ TF-IDF Vectorization - Converts text to numerical vectors for comparison
✅ Cosine Similarity Matching - Finds the best matching FAQ question
✅ Confidence Scoring - Shows how confident the system is about each answer
✅ Modular Architecture - Clean separation of concerns (preprocessing, similarity, API)
✅ REST API - JSON-based endpoints for query processing
✅ Web UI - Modern, responsive chat interface
✅ 25 Sample FAQs - Covering Tech, ML, DL, Java, C, Python topics
AI faq ChatBot/
├── app.py # Flask application (main server)
├── preprocessing.py # NLP preprocessing module
├── similarity.py # TF-IDF and similarity matching module
├── faq_dataset.json # FAQ database
├── requirements.txt # Python dependencies
├── README.md # This file
└── static/
├── index.html # Chat UI
├── style.css # UI styling
└── script.js # Frontend logic
- Backend: Flask 3.0.0, Python 3.x
- NLP: NLTK 3.8.1 (Natural Language Toolkit)
- Similarity: TF-IDF vectorization with cosine similarity
- API Integration: Google APIs (Knowledge Graph, Custom Search)
- Security: python-dotenv for secure credential management
- Frontend: HTML5, CSS3, JavaScript
- API: REST with JSON
This project includes secure API integration without exposing credentials:
✅ Credentials Stored Locally - API keys in .env file (never committed)
✅ Environment Variables - All secrets loaded from environment
✅ Safe for GitHub - .env is in .gitignore and never uploaded
✅ Easy Setup - Copy .env.example and add your own key
-
.env- Local credentials (NOT on GitHub)- Contains:
GOOGLE_API_KEY=your_key_here - This file is in
.gitignoreand stays private
- Contains:
-
.env.example- Template for GitHub- Shows structure:
GOOGLE_API_KEY=your_google_api_key_here - Others copy this to
.envand add their key
- Shows structure:
-
config.py- Reads from environment variables- Never hardcodes secrets
- Loads from
.envfile usingpython-dotenv
-
api_client.py- Secure API client- Uses credentials from
config.py - Handles API requests safely
- Uses credentials from
-
Create
.envfile in project root (copy from.env.example):GOOGLE_API_KEY=your_actual_api_key_here -
Keep
.envprivate - it will NOT be committed to GitHub -
For collaboration:
- Share
.env.example(no secrets) - Each developer adds their own
.envwith their API key .gitignoreprevents accidental commits
- Share
- Python 3.7 or higher (tested with Python 3.8+)
- pip (Python package manager)
python -m venv venv
venv\Scripts\activate
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
This installs:
- Flask 3.0.0 - Web framework
- Flask-CORS 4.0.0 - Cross-Origin Resource Sharing
- NLTK 3.8.1 - Natural Language Processing
python -c "import flask; import nltk; print('All dependencies installed!')"
python app.py
You should see output like:
FAQ Chatbot initialized with 25 FAQs
Similarity threshold: 0.2
Starting FAQ Chatbot on http://localhost:5000
Press CTRL+C to stop
Open your web browser and navigate to:
http://localhost:5000
Press CTRL+C in the terminal
- Open http://localhost:5000 in your browser
- Type a question in the input field
- Click "Send" or press Enter
- View the chatbot's response with confidence score
- "What is deep learning?"
- "Difference between lists and tuples in Python"
- "Explain polymorphism in Java"
- "What are pointers in C?"
- "What is machine learning?"
- "How does a CNN work?"
- User Input → User types a question in the chat interface
- Text Preprocessing (preprocessing.py):
- Lowercasing
- Tokenization (word splitting)
- Stopword removal (remove common words like "the", "is")
- Lemmatization (convert to base form: "running" → "run")
- TF-IDF Vectorization (similarity.py):
- Term Frequency (TF) = importance of word in document
- Inverse Document Frequency (IDF) = how rare the word is across all documents
- TF-IDF = TF × IDF (higher score for important words)
- Similarity Matching (similarity.py):
- Calculate cosine similarity between query vector and all FAQ vectors
- Find FAQ with highest similarity
- Threshold Check:
- If similarity ≥ 0.2 (20%) → Return FAQ answer
- If similarity < 0.2 → Return "not confident" message
- Display Response (frontend):
- Show matched FAQ question
- Show answer
- Show confidence score (0.0 to 1.0)
Input: "Can I change my password?"
↓
Preprocessed: [can, change, password]
↓
TF-IDF Vector: {can: 0.15, change: 0.25, password: 0.42, ...}
↓
Compare with FAQ: "How do I reset my account password?"
Preprocessed: [reset, account, password]
TF-IDF Vector: {reset: 0.18, account: 0.20, password: 0.42, ...}
↓
Cosine Similarity: 0.78 (78%)
↓
0.78 > 0.2 threshold ✓
↓
Return: "To reset your password, go to the login page..."
Edit app.py to adjust the confidence threshold:
SIMILARITY_THRESHOLD = 0.2 # Range: 0.0 to 1.0- Lower threshold (0.1): More responses but less accurate
- Higher threshold (0.5): Fewer responses but more confident
- Default (0.2): Balanced approach
Handles all NLP text preprocessing:
TextPreprocessor.clean_text()- Remove special charactersTextPreprocessor.tokenize()- Split into wordsTextPreprocessor.remove_stopwords()- Remove common wordsTextPreprocessor.lemmatize()- Convert to base formsTextPreprocessor.preprocess()- Complete pipeline
Implements TF-IDF and similarity matching:
TFIDFVectorizer.fit()- Learn from FAQ questionsTFIDFVectorizer.transform()- Convert query to vectorSimilarityMatcher.find_best_match()- Find best FAQSimilarityMatcher.get_answer()- Return answer with confidence
Manages API keys and settings securely:
- Loads environment variables from
.envfile usingpython-dotenv - Never exposes credentials in code
validate_config()- Checks if all required variables are set- Provides centralized config access for entire app
Integrates with Google APIs to enhance FAQ answers:
GoogleAPIClient- Secure client for Google Knowledge Graph APIsearch_knowledge_graph(query)- Query Google Knowledge Graphextract_knowledge_info(query)- Extract additional informationvalidate_api_key()- Test if API key is validget_enhanced_answer(answer, question)- Enhance FAQ with additional info
Flask server with REST API and API integration:
- Loads configuration securely via
config.py - Uses
SimilarityMatcherfor core FAQ matching - Optionally enhances answers using
api_client.py POST /api/query- Query endpoint (with optional enhancement)GET /api/faqs- Get all FAQsGET /api/health- Health check with API statusGET /api/config/check- Configuration check (debug only)GET /- Serve web UI
Install dependencies again:
pip install -r requirements.txt
Install dependencies again:
pip install -r requirements.txt
Edit app.py and change the port:
app.run(debug=True, host='0.0.0.0', port=5001) # Use 5001 insteadEnsure faq_dataset.json is in the same directory as app.py
Make sure the Flask server is running with python app.py
- Check
.envfile exists in project root - Verify
GOOGLE_API_KEYis set in.env - Make sure
.envis NOT in.gitignore(should be there to keep it private) - Restart Flask app:
python app.py - Test with:
curl http://localhost:5000/api/health(in debug mode)
The app will auto-download NLTK data on first run. If it fails:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')The chatbot includes 25 sample FAQs covering:
AI/ML Topics:
- AI vs Machine Learning
- Supervised vs Unsupervised Learning
- Overfitting in ML
- Neural Networks
- Deep Learning vs ML
- CNNs and RNNs
Python:
- Python basics
- Lists vs Tuples
- Lambda functions
- List comprehension
- == vs is operators
Java:
- Garbage Collection
- String/StringBuilder/StringBuffer
- Access Modifiers
- Polymorphism
- Exception Handling
- Abstract Classes vs Interfaces
C:
- Pointers
- Dynamic Memory Allocation
- Structures vs Unions
- Function Pointers
- Stack vs Heap
- Linked Lists vs Arrays
- Binary Search Trees
- Query processing: ~100-500ms (depends on query length)
- Suitable for: up to 1000+ FAQs without significant slowdown
- Concurrent users: Limited by server resources
- Database support (PostgreSQL/MongoDB)
- User feedback loop to improve matching
- Advanced NLP (Word2Vec, BERT embeddings)
- Analytics dashboard
- Multi-language support
- Authentication and rate limiting
- Admin dashboard for FAQ management
- Live chat escalation
This project is provided as an educational resource for building FAQ chatbots.
For detailed information:
- Read the inline code comments in
app.py,preprocessing.py,similarity.py - Check API endpoint examples in this README
- Review the FAQ dataset format in
faq_dataset.json
Author - Priyanshu Kumar
github - @Priyanshu7439
Linkdin - https://www.linkedin.com/in/priyanshu-kumar-8a51382b4/?skipRedirect=true
**Version:** 1.0
**Last Updated:** April 2026
**Technology:** Flask + NLTK + TF-IDF + Cosine Similarity