Duplicate Question Detection System

Overview

This project builds an NLP-based system to detect whether two questions are duplicates using hybrid machine learning techniques.

The model combines:

Custom engineered similarity features
TF-IDF text representations
Machine Learning classifiers

Problem Statement

Online platforms like Quora and StackOverflow contain many semantically duplicate questions. Detecting duplicates improves search quality and reduces redundancy.

Features Used

Custom Features

Question length
Word count
Common word count
Total words
Word share ratio

Text Features

TF-IDF Vectorization

Models Implemented

Random Forest Classifier
Support Vector Machine (SVM)
XGBoost Classifier

Evaluation Metrics

Accuracy
Precision
Recall
F1 Score
Confusion Matrix

Results

Best model achieved ~71% accuracy on test dataset.

Project Pipeline

Preprocessing → Feature Engineering → TF-IDF → Model Training → Evaluation → Prediction

Example Prediction

Input: Q1: How to learn machine learning? Q2: What is the best way to study machine learning?

Output: Duplicate Questions ✅

Future Improvements

Semantic embeddings (BERT)
Ensemble learning
Deployment using Streamlit

Author

Aditya Sharma B.Tech CSE | AI Enthusiast

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
archive.zip		archive.zip
duplicate-question-analysis-ipynb.ipynb		duplicate-question-analysis-ipynb.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Duplicate Question Detection System

Overview

Problem Statement

Features Used

Custom Features

Text Features

Models Implemented

Evaluation Metrics

Results

Project Pipeline

Example Prediction

Future Improvements

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Duplicate Question Detection System

Overview

Problem Statement

Features Used

Custom Features

Text Features

Models Implemented

Evaluation Metrics

Results

Project Pipeline

Example Prediction

Future Improvements

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages