🧼 datacleanx

datacleanx is a fast, CLI-first data cleaning engine for tabular datasets. It's designed for machine learning practitioners and data engineers who want to automate cleaning workflows efficiently using a single command-line interface.

🚀 Why datacleanx?

🔁 Automates repetitive cleaning steps
📦 Works out-of-the-box with CSV files
📁 Outputs timestamped cleaned files and reports
🐳 Docker-ready for CI/CD and containerized workflows
🧪 Includes tests and reports for reproducibility

🔧 Features

✅ Imputation: mean, median, mode
✅ Encoding: label, onehot
✅ Outlier removal using IQR
✅ Feature scaling: standard, minmax, robust
✅ Auto-saves cleaned data to outputs/
✅ Saves reports as structured JSON
✅ CLI-first design, easily scriptable
✅ Docker and Poetry integration

📦 Installation

✅ Option 1: From PyPI## 📦 Installation

✅ Option 1: Install via `pip` in a Virtual Environment (Recommended)

python3 -m venv venv
source venv/bin/activate
pip install datacleanx

Option 2: Install via pipx (Best for CLI tools)

sudo apt install pipx
pipx ensurepath
pipx install datacleanx

Option 3: From Source (Developer Mode)

git clone https://github.com/essiebx/datacleanx.git
cd datacleanx
poetry install
poetry run datacleanx sample_input.csv --impute median

###⚠️ Note for Ubuntu/Debian Users: If you see an error like externally-managed-environment, avoid using --break-system-packages. Use a venv or pipx as shown above instead.

###Example Usage

datacleanx sample_input.csv --impute median --encode onehot --remove-outliers --scale minmax

This will:

###✅ Save cleaned data to outputs/impute_encode_outliers_scale_.csv

###🧾 Save a cleaning report to outputs/impute_encode_outliers_scale_report_.json

Custom Output Name

datacleanx sample_input.csv --output-name marketing_cleaned --impute mean --scale robust

This will generate:

outputs/marketing_cleaned_.csv

outputs/marketing_cleaned_report_.json

#📁 Project Structure

datacleanx/ ├── datacleanx/ │ ├── cleaner.py # Core cleaning logic │ ├── cli.py # CLI interface │ ├── report.py # JSON report logic │ └── init.py ├── tests/ │ ├── test_cleaner.py # Unit tests │ └── tests_output/ # Test outputs ├── outputs/ # Auto-saved results ├── sample_input.csv # Example CSV ├── Dockerfile ├── README.md └── pyproject.toml

###🧾 Example Report Output:

{ "shape_before": [4, 3], "age_imputed_with": "median", "income_imputed_with": "median", "gender_imputed_with": "median", "age_outliers_removed": 1, "income_outliers_removed": 0, "categorical_encoding": "onehot", "scaling": "minmax", "shape_after": [3, 3] }

Running Tests

poetry run pytest

Test output artifacts are saved in the tests/tests_output/ folder.

Docker Usage

Build Docker Image

docker build -t datacleanx .

Run Cleaning Task Inside Docker

docker run -v $(pwd):/app datacleanx sample_input.csv --impute median --scale standard

📡 Project Links

🐍 PyPI: https://pypi.org/project/datacleanx/

🐳 DockerHub: https://hub.docker.com/r/essiebx/datacleanx

🗂 GitHub: https://github.com/essiebx/datacleanx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧼 datacleanx

🚀 Why datacleanx?

🔧 Features

📦 Installation

✅ Option 1: From PyPI## 📦 Installation

✅ Option 1: Install via `pip` in a Virtual Environment (Recommended)

Option 2: Install via pipx (Best for CLI tools)

Option 3: From Source (Developer Mode)

This will:

Custom Output Name

This will generate:

outputs/marketing_cleaned_.csv

outputs/marketing_cleaned_report_.json

Running Tests

Test output artifacts are saved in the tests/tests_output/ folder.

Docker Usage

Build Docker Image

Run Cleaning Task Inside Docker

📡 Project Links

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
datacleanx		datacleanx
docker		docker
outputs		outputs
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
report.json		report.json
requirements.txt		requirements.txt
sample_input.csv		sample_input.csv

Folders and files

Latest commit

History

Repository files navigation

🧼 datacleanx

🚀 Why datacleanx?

🔧 Features

📦 Installation

✅ Option 1: From PyPI## 📦 Installation

✅ Option 1: Install via pip in a Virtual Environment (Recommended)

Option 2: Install via pipx (Best for CLI tools)

Option 3: From Source (Developer Mode)

This will:

Custom Output Name

This will generate:

outputs/marketing_cleaned_.csv

outputs/marketing_cleaned_report_.json

Running Tests

Test output artifacts are saved in the tests/tests_output/ folder.

Docker Usage

Build Docker Image

Run Cleaning Task Inside Docker

📡 Project Links

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

✅ Option 1: Install via `pip` in a Virtual Environment (Recommended)

Packages