This project focuses on predicting whether water is safe for drinking based on its physicochemical properties using machine learning classification techniques.
Access to safe drinking water is a critical public health concern. The goal of this project is to build and compare machine learning models that can classify water samples as potable or non-potable based on measured water quality parameters.
- Source: Kaggle – Water Potability Dataset
- Link: https://www.kaggle.com/datasets/adityakadiwal/water-potability
- Description:
The dataset contains water quality metrics such as pH, hardness, solids, chloramines, sulfate, conductivity, organic carbon, trihalomethanes, and turbidity, along with a binary target variable indicating potability.
Note: The dataset is not included in this repository. Please download it directly from Kaggle using the link above.
- Data loading and initial exploration
- Handling missing values using statistical imputation
- Feature scaling where required
- Training and evaluation of multiple classification models
- Performance comparison across models
- Logistic Regression
- Random Forest Classifier
- Support Vector Machine (SVM)
Model performance was evaluated using classification metrics such as accuracy and confusion matrix analysis. The results demonstrate how different models perform on real-world environmental data and highlight their strengths and limitations.
- Python
- NumPy
- Pandas
- Matplotlib
- Seaborn
- Scikit-learn
- Google Colab
water-potability-prediction/
├── notebooks/
│ └── water_potability.ipynb
├── requirements.txt
├── .gitignore
└── README.md- Clone the repository
- Install dependencies:
pip install -r requirements.txt
- Download the dataset from Kaggle
- Open and run the notebook in Jupyter Notebook or Google Colab
- Mean imputation may reduce feature variance
- Accuracy alone may not fully capture real-world risk, especially for false negative predictions
- Further tuning and additional evaluation metrics could improve model reliability
This project was completed as part of the IBM SkillsBuild – Mastering Data with Machine Learning program in collaboration with CSRBOX.
This project is licensed under the MIT License.