Skip to content

mattclifford1/data_complexity_analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

167 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

data_complexity_analysis

Python library for analyzing data complexity metrics in classification datasets. Wraps PyCol (Python Class Overlap Library) and adds an experiment framework for studying how complexity metrics correlate with ML classifier performance.

What it does

  • Computes 33+ complexity metrics across four categories: Feature Overlap, Instance Overlap, Structural Overlap, and Multiresolution Overlap
  • Provides a modular ML evaluation module (8 classifiers, 17 metrics, cross-validation and train/test evaluators)
  • Includes a configurable experiment framework for parameter sweeps over synthetic datasets (Gaussian, Moons, Circles, Blobs)
  • Supports parallel execution, result saving/loading, and a range of visualizations

Installation

pdm install

Quick start

from data_complexity.metrics import complexity_metrics
import numpy as np

dataset = {"X": np.random.randn(200, 2), "y": np.array([0] * 100 + [1] * 100)}
complexity = complexity_metrics(dataset=dataset)

print(complexity.get_all_metrics_scalar())

Run a pre-defined experiment:

from data_complexity.experiments.pipeline import run_experiment

exp = run_experiment("moons_noise")   # runs, saves plots and CSVs

Further reading

  • CLAUDE.md — full API reference for contributors and AI assistants
  • data_complexity/experiments/pipeline/README.md — detailed experiment framework docs

About

Python library for analyzing data complexity metrics in classification datasets.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages