This project implements a duplicate question classifier using the Quora Questions dataset.
This classifier can help Q&A websites like Stack Overflow and Quora automatically detect and merge duplicate question threads. By consolidating similar questions, users can find answers more efficiently without getting lost in multiple threads, improving the overall user experience and reducing redundancy.
This project demonstrates several approaches, leveraging both classical ML and DL techniques. In particular, it covers:
- Feature Engineering & Classical ML: Extracting custom text features (common words, stopwords, fuzzy metrics, etc.), generating Word2Vec embeddings, and training an XGBoost classifier on these engineered features.
- Deep Learning: Building and training LSTM and Transformer-based architectures in PyTorch for sequence modeling.
- Transfer Learning: Fine-tuning a pre-trained BERT model for robust semantic understanding.
I took help from the following research papers in identifying the optimal architecture for the DL approach:
- Source: Quora duplicate questions dataset (
quora_questions.csv) - Columns:
id,qid1,qid2,question1,question2,is_duplicate
- Preprocessing: HTML tag removal, punctuation removal, URL removal, stopword filtering, tokenization.
- Feature Extraction: Custom features (common words, stopwords, fuzzy metrics, etc.) and Word2Vec embeddings.
- Model: XGBClassifier
- Vocabulary: Built from tokenized corpus.
- Data Preparation: Questions converted to index sequences, padded for batching.
- Model: LSTM and Transformer-based classifiers (
MyNN) - Training: BCEWithLogitsLoss, Adam optimizer
- Tokenizer:
bert-base-uncased - Model:
BertForSequenceClassification(HuggingFace Transformers) - Training: Trainer API, custom metrics (accuracy, F1, precision, recall)
| Model | Training Rows | Accuracy | Precision | Recall | F1-Score |
|---|---|---|---|---|---|
| XGBoost (Word2Vec + Features) | 40k | 0.7957 | 0.7217 | 0.7300 | 0.7258 |
| PyTorch Model | ~400k | 0.7578 | 0.7339 | 0.5377 | 0.6207 |
| BERT Fine-tuned (Best so far ✅) | 200k | 0.8821 | 0.8184 | 0.8728 | 0.8447 |
A comprehensive test suite is available to evaluate the model on various question pairs, including edge cases and tricky scenarios.
python tests.pyThis will:
- Load the latest trained BERT checkpoint
- Run 30 diverse test cases from
testing/test_cases.json - Generate a detailed report in
testing/outputs.txt
The test suite covers:
- Clear Duplicates: Paraphrased questions, synonyms, same intent
- Clear Non-Duplicates: Unrelated topics, different domains
- Tricky Cases: Negation, similar words with different intent, opposite questions, context variations
- Edge Cases: Identical questions, single word differences, number variations
- Semantic Cases: Technical vs layman terms, opposite actions
- Complex Cases: Comparison questions, multi-concept questions, process-oriented
Add your own test cases by editing testing/test_cases.json:
{
"category": "Your Category",
"q1": "First question",
"q2": "Second question",
"expected": "Duplicate/Not Duplicate"
}- See requirements.txt for project dependencies.
duplicate_classifier.ipynb: Main notebook with all code and experimentscsvs/quora_questions.csv: Dataset (link provided in Dataset Link)models/: Saved models (Word2Vec, XGBoost, BERT checkpoints)
Note: Model files are not uploaded due to large size. Please train and save models locally as needed.
Contributions are welcome!
If you have suggestions, improvements, or new models to add, feel free to open an issue or submit a pull request.
This project is licensed under the MIT License. See the LICENSE file for details.