datacleanx is a fast, CLI-first data cleaning engine for tabular datasets. It's designed for machine learning practitioners and data engineers who want to automate cleaning workflows efficiently using a single command-line interface.
- 🔁 Automates repetitive cleaning steps
- 📦 Works out-of-the-box with CSV files
- 📁 Outputs timestamped cleaned files and reports
- 🐳 Docker-ready for CI/CD and containerized workflows
- 🧪 Includes tests and reports for reproducibility
- ✅ Imputation:
mean,median,mode - ✅ Encoding:
label,onehot - ✅ Outlier removal using IQR
- ✅ Feature scaling:
standard,minmax,robust - ✅ Auto-saves cleaned data to
outputs/ - ✅ Saves reports as structured JSON
- ✅ CLI-first design, easily scriptable
- ✅ Docker and Poetry integration
python3 -m venv venv
source venv/bin/activate
pip install datacleanx
sudo apt install pipx
pipx ensurepath
pipx install datacleanx
git clone https://github.com/essiebx/datacleanx.git
cd datacleanx
poetry install
poetry run datacleanx sample_input.csv --impute median
###
###Example Usage
datacleanx sample_input.csv --impute median --encode onehot --remove-outliers --scale minmax
###✅ Save cleaned data to outputs/impute_encode_outliers_scale_.csv
###🧾 Save a cleaning report to outputs/impute_encode_outliers_scale_report_.json
datacleanx sample_input.csv --output-name marketing_cleaned --impute mean --scale robust
#📁 Project Structure
datacleanx/ ├── datacleanx/ │ ├── cleaner.py # Core cleaning logic │ ├── cli.py # CLI interface │ ├── report.py # JSON report logic │ └── init.py ├── tests/ │ ├── test_cleaner.py # Unit tests │ └── tests_output/ # Test outputs ├── outputs/ # Auto-saved results ├── sample_input.csv # Example CSV ├── Dockerfile ├── README.md └── pyproject.toml
###🧾 Example Report Output:
{ "shape_before": [4, 3], "age_imputed_with": "median", "income_imputed_with": "median", "gender_imputed_with": "median", "age_outliers_removed": 1, "income_outliers_removed": 0, "categorical_encoding": "onehot", "scaling": "minmax", "shape_after": [3, 3] }
poetry run pytest
docker build -t datacleanx .
docker run -v $(pwd):/app datacleanx sample_input.csv --impute median --scale standard
🐍 PyPI: https://pypi.org/project/datacleanx/
🐳 DockerHub: https://hub.docker.com/r/essiebx/datacleanx
🗂 GitHub: https://github.com/essiebx/datacleanx