unsupervised-learning-from-scratch

Five clustering algorithms built from the ground up with NumPy — K-Means, K-Medoids, AGNES, DIANA & DBSCAN applied to real retail data.

Overview

Implementations of five clustering algorithms written from scratch using only NumPy, applied to a retail transaction dataset of 51,289 records. Each algorithm follows its original specification — no sklearn wrappers, no abstractions hiding the math.

The goal is to make the internals of these algorithms readable and executable, not to compete with optimized library implementations.

Algorithms

Algorithm	Type	Complexity	Sample Size	Notes
K-Means	Partition	O(n · k · i)	51,289	Fully vectorized with NumPy broadcasting
K-Medoids (PAM)	Partition	O(n² · k · i)	1,000	PAM swap strategy
AGNES	Hierarchical (bottom-up)	O(n³)	1000	single / complete / average linkage
DIANA	Hierarchical (top-down)	O(n²/split)	1,000	Stops at target k, avoids singleton explosion
DBSCAN	Density-based	O(n²)	1,000	Epsilon selected via k-distance elbow

Results

Algorithm	Sample	Clusters	Silhouette	Davies-Bouldin
K-Means	51,289	4	0.452	0.865
K-Medoids	1,000	3	0.334	1.126
AGNES	1000	3	0.707	0.420
DIANA	1,000	3	0.714	0.524
DBSCAN	1,000	3	0.548	0.615

Silhouette: higher is better. Davies-Bouldin: lower is better.

AGNES produces the best-separated clusters (DB=0.420). DIANA scores highest on intra-cluster cohesion (Silhouette=0.714) with a negligible difference. The comparison between algorithms is directional — sample sizes differ by design (see below).

Why Different Sample Sizes

Sample sizes are constrained by algorithmic complexity, not by choice.

Algorithm      Complexity       1,000 pts      10,000 pts
---------------------------------------------------------
K-Means        O(n·k·i)         < 1 sec        seconds
DBSCAN         O(n²)            ~10 sec        ~15 min
K-Medoids      O(n²·k·i)        ~5 min         hours
DIANA          O(n²/split)      ~10 min        hours
AGNES          O(n³)            hours          days

K-Means runs on the full 51k dataset because the distance computation is vectorized — no Python-level loops. The remaining algorithms rely on nested loops consistent with a readable from-scratch implementation. Replacing those loops with KD-Trees or ball trees would make the code faster but harder to follow.

Repository Structure

unsupervised-learning-from-scratch/
│
├── Clustering_From_Scratch.ipynb
├── client_clean3.csv               # place in root before running
├── requirements.txt
└── README.md

Setup

git clone https://github.com/imenei/unsupervised-learning-from-scratch.git
cd unsupervised-learning-from-scratch
pip install -r requirements.txt

Place client_clean3.csv in the root directory. The notebook expects these columns: Sales, Quantity, Discount, Profit, Shipping Cost.

jupyter notebook Clustering_From_Scratch.ipynb

Run all cells sequentially — each section is self-contained.

Implementation Notes

K-Means — distance matrix computed as a single NumPy broadcasting operation (n, 1, d) - (k, d). Convergence criterion: L2 norm of centroid shift below tol=1e-4.

K-Medoids — standard PAM: for each medoid, iterate over all non-medoid candidates and accept the swap that most reduces total cost. Terminates when no improving swap exists.

AGNES — merger history is stored at each step. Dendrogram visualization delegates to scipy.cluster.hierarchy using the same linkage method for consistency.

DIANA — the cluster with the largest diameter is selected for splitting at each step. The most dissimilar point seeds the splinter group; remaining points are reassigned iteratively. The diana_k(X, k) variant stops at k clusters to avoid reducing every point to a singleton.

DBSCAN — precomputed n×n distance matrix. Core, border, and noise points classified per the original Ester et al. 1996 definition. Epsilon chosen from the elbow of the k-distance plot (k = min_pts = 5).

Requirements

numpy>=1.24
pandas>=1.5
matplotlib>=3.7
scikit-learn>=1.2
scipy>=1.10
jupyter>=1.0

References

MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. 5th Berkeley Symposium on Mathematical Statistics.
Kaufman, L. & Rousseeuw, P.J. (1990). Finding Groups in Data: An Introduction to Cluster Analysis. Wiley.
Ester, M., Kriegel, H-P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. KDD-96.
Rousseeuw, P.J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics.

Author

imenei — github.com/imenei

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Clustering_From_Scratch.ipynb		Clustering_From_Scratch.ipynb
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

unsupervised-learning-from-scratch

Overview

Algorithms

Results

Why Different Sample Sizes

Repository Structure

Setup

Implementation Notes

Requirements

References

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

unsupervised-learning-from-scratch

Overview

Algorithms

Results

Why Different Sample Sizes

Repository Structure

Setup

Implementation Notes

Requirements

References

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages