This project builds an NLP-based system to detect whether two questions are duplicates using hybrid machine learning techniques.
The model combines:
- Custom engineered similarity features
- TF-IDF text representations
- Machine Learning classifiers
Online platforms like Quora and StackOverflow contain many semantically duplicate questions. Detecting duplicates improves search quality and reduces redundancy.
- Question length
- Word count
- Common word count
- Total words
- Word share ratio
- TF-IDF Vectorization
- Random Forest Classifier
- Support Vector Machine (SVM)
- XGBoost Classifier
- Accuracy
- Precision
- Recall
- F1 Score
- Confusion Matrix
Best model achieved ~71% accuracy on test dataset.
Preprocessing → Feature Engineering → TF-IDF → Model Training → Evaluation → Prediction
Input: Q1: How to learn machine learning? Q2: What is the best way to study machine learning?
Output: Duplicate Questions ✅
- Semantic embeddings (BERT)
- Ensemble learning
- Deployment using Streamlit
Aditya Sharma B.Tech CSE | AI Enthusiast