Skip to content

akashsinghiitr/Duplicate-Question-Classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Duplicate Question Classifier

This project implements a duplicate question classifier using the Quora Questions dataset.

Use Case:

This classifier can help Q&A websites like Stack Overflow and Quora automatically detect and merge duplicate question threads. By consolidating similar questions, users can find answers more efficiently without getting lost in multiple threads, improving the overall user experience and reducing redundancy.

Summary

This project demonstrates several approaches, leveraging both classical ML and DL techniques. In particular, it covers:

  • Feature Engineering & Classical ML: Extracting custom text features (common words, stopwords, fuzzy metrics, etc.), generating Word2Vec embeddings, and training an XGBoost classifier on these engineered features.
  • Deep Learning: Building and training LSTM and Transformer-based architectures in PyTorch for sequence modeling.
  • Transfer Learning: Fine-tuning a pre-trained BERT model for robust semantic understanding.

Credit

I took help from the following research papers in identifying the optimal architecture for the DL approach:

Dataset

  • Source: Quora duplicate questions dataset (quora_questions.csv)
  • Columns: id, qid1, qid2, question1, question2, is_duplicate

Approaches

1. Feature Engineering + XGBoost

  • Preprocessing: HTML tag removal, punctuation removal, URL removal, stopword filtering, tokenization.
  • Feature Extraction: Custom features (common words, stopwords, fuzzy metrics, etc.) and Word2Vec embeddings.
  • Model: XGBClassifier

2. PyTorch Deep Learning

  • Vocabulary: Built from tokenized corpus.
  • Data Preparation: Questions converted to index sequences, padded for batching.
  • Model: LSTM and Transformer-based classifiers (MyNN)
  • Training: BCEWithLogitsLoss, Adam optimizer

3. BERT Fine-Tuning (Most accurate till now ✅)

  • Tokenizer: bert-base-uncased
  • Model: BertForSequenceClassification (HuggingFace Transformers)
  • Training: Trainer API, custom metrics (accuracy, F1, precision, recall)

Model Performance

Model Training Rows Accuracy Precision Recall F1-Score
XGBoost (Word2Vec + Features) 40k 0.7957 0.7217 0.7300 0.7258
PyTorch Model ~400k 0.7578 0.7339 0.5377 0.6207
BERT Fine-tuned (Best so far ✅) 200k 0.8821 0.8184 0.8728 0.8447

Custom Testing (For the BERT approach)

A comprehensive test suite is available to evaluate the model on various question pairs, including edge cases and tricky scenarios.

Running Tests

python tests.py

This will:

  • Load the latest trained BERT checkpoint
  • Run 30 diverse test cases from testing/test_cases.json
  • Generate a detailed report in testing/outputs.txt

Test Categories

The test suite covers:

  • Clear Duplicates: Paraphrased questions, synonyms, same intent
  • Clear Non-Duplicates: Unrelated topics, different domains
  • Tricky Cases: Negation, similar words with different intent, opposite questions, context variations
  • Edge Cases: Identical questions, single word differences, number variations
  • Semantic Cases: Technical vs layman terms, opposite actions
  • Complex Cases: Comparison questions, multi-concept questions, process-oriented

Custom Test Cases

Add your own test cases by editing testing/test_cases.json:

{
  "category": "Your Category",
  "q1": "First question",
  "q2": "Second question",
  "expected": "Duplicate/Not Duplicate"
}

Requirements

  • See requirements.txt for project dependencies.

Structure

  • duplicate_classifier.ipynb: Main notebook with all code and experiments
  • csvs/quora_questions.csv: Dataset (link provided in Dataset Link)
  • models/: Saved models (Word2Vec, XGBoost, BERT checkpoints)
    Note: Model files are not uploaded due to large size. Please train and save models locally as needed.

Dataset Link

Contributions

Contributions are welcome!
If you have suggestions, improvements, or new models to add, feel free to open an issue or submit a pull request.

License

This project is licensed under the MIT License. See the LICENSE file for details.

About

Multi-model duplicate question classifier for Quora dataset. Implements three architectures: (1) XGBoost with Word2Vec + fuzzy matching features, (2) PyTorch BiLSTM with early stopping, and (3) Fine-tuned BERT for sequence classification. Complete pipeline from preprocessing to inference.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors