Skip to content

imenei/unsupervised-learning-from-scratch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

unsupervised-learning-from-scratch

Five clustering algorithms built from the ground up with NumPy — K-Means, K-Medoids, AGNES, DIANA & DBSCAN applied to real retail data.

Python NumPy Jupyter


Overview

Implementations of five clustering algorithms written from scratch using only NumPy, applied to a retail transaction dataset of 51,289 records. Each algorithm follows its original specification — no sklearn wrappers, no abstractions hiding the math.

The goal is to make the internals of these algorithms readable and executable, not to compete with optimized library implementations.


Algorithms

Algorithm Type Complexity Sample Size Notes
K-Means Partition O(n · k · i) 51,289 Fully vectorized with NumPy broadcasting
K-Medoids (PAM) Partition O(n² · k · i) 1,000 PAM swap strategy
AGNES Hierarchical (bottom-up) O(n³) 1000 single / complete / average linkage
DIANA Hierarchical (top-down) O(n²/split) 1,000 Stops at target k, avoids singleton explosion
DBSCAN Density-based O(n²) 1,000 Epsilon selected via k-distance elbow

Results

Algorithm Sample Clusters Silhouette Davies-Bouldin
K-Means 51,289 4 0.452 0.865
K-Medoids 1,000 3 0.334 1.126
AGNES 1000 3 0.707 0.420
DIANA 1,000 3 0.714 0.524
DBSCAN 1,000 3 0.548 0.615

Silhouette: higher is better. Davies-Bouldin: lower is better.

AGNES produces the best-separated clusters (DB=0.420). DIANA scores highest on intra-cluster cohesion (Silhouette=0.714) with a negligible difference. The comparison between algorithms is directional — sample sizes differ by design (see below).


Why Different Sample Sizes

Sample sizes are constrained by algorithmic complexity, not by choice.

Algorithm      Complexity       1,000 pts      10,000 pts
---------------------------------------------------------
K-Means        O(n·k·i)         < 1 sec        seconds
DBSCAN         O(n²)            ~10 sec        ~15 min
K-Medoids      O(n²·k·i)        ~5 min         hours
DIANA          O(n²/split)      ~10 min        hours
AGNES          O(n³)            hours          days

K-Means runs on the full 51k dataset because the distance computation is vectorized — no Python-level loops. The remaining algorithms rely on nested loops consistent with a readable from-scratch implementation. Replacing those loops with KD-Trees or ball trees would make the code faster but harder to follow.


Repository Structure

unsupervised-learning-from-scratch/
│
├── Clustering_From_Scratch.ipynb
├── client_clean3.csv               # place in root before running
├── requirements.txt
└── README.md

Setup

git clone https://github.com/imenei/unsupervised-learning-from-scratch.git
cd unsupervised-learning-from-scratch
pip install -r requirements.txt

Place client_clean3.csv in the root directory. The notebook expects these columns: Sales, Quantity, Discount, Profit, Shipping Cost.

jupyter notebook Clustering_From_Scratch.ipynb

Run all cells sequentially — each section is self-contained.


Implementation Notes

K-Means — distance matrix computed as a single NumPy broadcasting operation (n, 1, d) - (k, d). Convergence criterion: L2 norm of centroid shift below tol=1e-4.

K-Medoids — standard PAM: for each medoid, iterate over all non-medoid candidates and accept the swap that most reduces total cost. Terminates when no improving swap exists.

AGNES — merger history is stored at each step. Dendrogram visualization delegates to scipy.cluster.hierarchy using the same linkage method for consistency.

DIANA — the cluster with the largest diameter is selected for splitting at each step. The most dissimilar point seeds the splinter group; remaining points are reassigned iteratively. The diana_k(X, k) variant stops at k clusters to avoid reducing every point to a singleton.

DBSCAN — precomputed n×n distance matrix. Core, border, and noise points classified per the original Ester et al. 1996 definition. Epsilon chosen from the elbow of the k-distance plot (k = min_pts = 5).


Requirements

numpy>=1.24
pandas>=1.5
matplotlib>=3.7
scikit-learn>=1.2
scipy>=1.10
jupyter>=1.0

References

  • MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. 5th Berkeley Symposium on Mathematical Statistics.
  • Kaufman, L. & Rousseeuw, P.J. (1990). Finding Groups in Data: An Introduction to Cluster Analysis. Wiley.
  • Ester, M., Kriegel, H-P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. KDD-96.
  • Rousseeuw, P.J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics.

Author

imenei — github.com/imenei

About

Five clustering algorithms built from the ground up with NumPy — K-Means, K-Medoids, AGNES, DIANA & DBSCAN applied to real retail data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors