Skip to content

essiebx/datacleanx

Repository files navigation

🧼 datacleanx

datacleanx is a fast, CLI-first data cleaning engine for tabular datasets. It's designed for machine learning practitioners and data engineers who want to automate cleaning workflows efficiently using a single command-line interface.


🚀 Why datacleanx?

  • 🔁 Automates repetitive cleaning steps
  • 📦 Works out-of-the-box with CSV files
  • 📁 Outputs timestamped cleaned files and reports
  • 🐳 Docker-ready for CI/CD and containerized workflows
  • 🧪 Includes tests and reports for reproducibility

🔧 Features

  • ✅ Imputation: mean, median, mode
  • ✅ Encoding: label, onehot
  • ✅ Outlier removal using IQR
  • ✅ Feature scaling: standard, minmax, robust
  • ✅ Auto-saves cleaned data to outputs/
  • ✅ Saves reports as structured JSON
  • ✅ CLI-first design, easily scriptable
  • ✅ Docker and Poetry integration

📦 Installation

✅ Option 1: From PyPI## 📦 Installation

✅ Option 1: Install via pip in a Virtual Environment (Recommended)

python3 -m venv venv
source venv/bin/activate
pip install datacleanx

Option 2: Install via pipx (Best for CLI tools)

sudo apt install pipx
pipx ensurepath
pipx install datacleanx

Option 3: From Source (Developer Mode)

git clone https://github.com/essiebx/datacleanx.git
cd datacleanx
poetry install
poetry run datacleanx sample_input.csv --impute median

###⚠️ Note for Ubuntu/Debian Users: If you see an error like externally-managed-environment, avoid using --break-system-packages. Use a venv or pipx as shown above instead.

###Example Usage

datacleanx sample_input.csv --impute median --encode onehot --remove-outliers --scale minmax

This will:

###✅ Save cleaned data to outputs/impute_encode_outliers_scale_.csv

###🧾 Save a cleaning report to outputs/impute_encode_outliers_scale_report_.json

Custom Output Name

datacleanx sample_input.csv --output-name marketing_cleaned --impute mean --scale robust

This will generate:

outputs/marketing_cleaned_.csv

outputs/marketing_cleaned_report_.json

#📁 Project Structure

datacleanx/ ├── datacleanx/ │ ├── cleaner.py # Core cleaning logic │ ├── cli.py # CLI interface │ ├── report.py # JSON report logic │ └── init.py ├── tests/ │ ├── test_cleaner.py # Unit tests │ └── tests_output/ # Test outputs ├── outputs/ # Auto-saved results ├── sample_input.csv # Example CSV ├── Dockerfile ├── README.md └── pyproject.toml

###🧾 Example Report Output:

{ "shape_before": [4, 3], "age_imputed_with": "median", "income_imputed_with": "median", "gender_imputed_with": "median", "age_outliers_removed": 1, "income_outliers_removed": 0, "categorical_encoding": "onehot", "scaling": "minmax", "shape_after": [3, 3] }

Running Tests

poetry run pytest

Test output artifacts are saved in the tests/tests_output/ folder.

Docker Usage

Build Docker Image

docker build -t datacleanx .

Run Cleaning Task Inside Docker

docker run -v $(pwd):/app datacleanx sample_input.csv --impute median --scale standard

📡 Project Links

🐍 PyPI: https://pypi.org/project/datacleanx/

🐳 DockerHub: https://hub.docker.com/r/essiebx/datacleanx

🗂 GitHub: https://github.com/essiebx/datacleanx

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors