Skip to content

AdityaSharma2007/Duplicate-Question-Detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

Duplicate Question Detection System

Overview

This project builds an NLP-based system to detect whether two questions are duplicates using hybrid machine learning techniques.

The model combines:

  • Custom engineered similarity features
  • TF-IDF text representations
  • Machine Learning classifiers

Problem Statement

Online platforms like Quora and StackOverflow contain many semantically duplicate questions. Detecting duplicates improves search quality and reduces redundancy.


Features Used

Custom Features

  • Question length
  • Word count
  • Common word count
  • Total words
  • Word share ratio

Text Features

  • TF-IDF Vectorization

Models Implemented

  • Random Forest Classifier
  • Support Vector Machine (SVM)
  • XGBoost Classifier

Evaluation Metrics

  • Accuracy
  • Precision
  • Recall
  • F1 Score
  • Confusion Matrix

Results

Best model achieved ~71% accuracy on test dataset.


Project Pipeline

Preprocessing → Feature Engineering → TF-IDF → Model Training → Evaluation → Prediction


Example Prediction

Input: Q1: How to learn machine learning? Q2: What is the best way to study machine learning?

Output: Duplicate Questions ✅


Future Improvements

  • Semantic embeddings (BERT)
  • Ensemble learning
  • Deployment using Streamlit

Author

Aditya Sharma B.Tech CSE | AI Enthusiast

About

Duplicate Question Detection system built using NLP techniques and custom engineered similarity features combined with TF-IDF representations and ML models to identify semantically similar questions.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors