A machine learning model that classifies SMS messages as spam or not spam, deployed as an interactive web app using Streamlit.
- Source: SMS Spam Collection Dataset — Kaggle
- 5,572 labeled SMS messages (ham / spam)
- Lowercasing, punctuation and stopword removal
- Tokenization using
nltk - Stemming with
PorterStemmer
- Text vectorized using TF-IDF (
TfidfVectorizer)
- Random Forest Classifier
n_estimators = 200random_state = 2- Tuned splitting criteria for better generalization
| Metric | Score |
|---|---|
| Accuracy | ~97% |
| Precision | 100% |
| False Positives | 0 |
Confusion Matrix:
| Predicted: Not Spam | Predicted: Spam | |
|---|---|---|
| Actually Not Spam | 896 | 0 |
| Actually Spam | 29 | 109 |
Test set performance. Real-world results may vary due to class imbalance and evolving spam patterns.
| Tool | Purpose |
|---|---|
pandas, numpy |
Data manipulation |
nltk |
NLP preprocessing |
scikit-learn |
Modeling & evaluation |
matplotlib, seaborn |
EDA & visualization |
streamlit |
Web app deployment |
pickle |
Model serialization |
git clone https://github.com/maheen8q/spam-detector.git
cd spam-detector
pip install -r requirements.txt
streamlit run app.pyspam-detector/
│
├── spam-detection.ipynb # EDA, preprocessing, modeling
├── app.py # Streamlit app
├── model.pkl # Trained Random Forest model
├── vectorizer.pkl # Fitted TF-IDF vectorizer
└── README.md