Spam Mail Prediction using Machine Learning
A machine learning project that classifies emails as Spam or Ham (Not Spam) using TF–IDF vectorization and Logistic Regression. This project demonstrates a full ML pipeline including preprocessing, vectorization, model training, and evaluation.
Project Files SpamFilter.ipynb # Jupyter Notebook containing full implementation README.md # Project documentation Mail.csv # for email dataset
Features
Clean and simple ML pipeline
TF-IDF text vectorization
Logistic Regression classifier
High accuracy for spam detection
Easy to run on Jupyter Notebook or Google Colab
Technologies Used
Python 3
NumPy
Pandas
Scikit-learn
train_test_split
TfidfVectorizer
LogisticRegression
accuracy_score
Jupyter Notebook / Google Colab
Workflow
- Import Dependencies
Basic ML and data preprocessing libraries.
- Load Dataset
Emails containing text + labels (spam / ham).
- Data Preprocessing
Clean text
Convert labels to numerical values
Handle missing data
- Feature Extraction
Using TF-IDF Vectorizer:
TfidfVectorizer(min_df=1, stop_words='english')
- Train–Test Split
Split dataset into 80% training and 20% testing.
-
Model Training model = LogisticRegression() model.fit(X_train, Y_train)
-
Evaluation
Accuracy measured using:
accuracy_score(Y_test, predictions)
Results
The model performs strongly for spam classification with high accuracy and generalization.
How to Run the Project Option 1: Google Colab (Recommended)
Upload the .ipynb file or open directly via Drive
Upload your dataset
Run all cells
Option 2: Local Machine pip install numpy pandas scikit-learn jupyter notebook
Future Enhancements
Add Naive Bayes, SVM, Random Forest models
Compare model accuracy
Build a web interface using Streamlit/Flask
Deploy as an API
Add real-time email classification UI
Author
Banoth Vikas Machine Learning & Software Engineering Enthusiast