Five clustering algorithms built from the ground up with NumPy — K-Means, K-Medoids, AGNES, DIANA & DBSCAN applied to real retail data.
Implementations of five clustering algorithms written from scratch using only NumPy, applied to a retail transaction dataset of 51,289 records. Each algorithm follows its original specification — no sklearn wrappers, no abstractions hiding the math.
The goal is to make the internals of these algorithms readable and executable, not to compete with optimized library implementations.
| Algorithm | Type | Complexity | Sample Size | Notes |
|---|---|---|---|---|
| K-Means | Partition | O(n · k · i) | 51,289 | Fully vectorized with NumPy broadcasting |
| K-Medoids (PAM) | Partition | O(n² · k · i) | 1,000 | PAM swap strategy |
| AGNES | Hierarchical (bottom-up) | O(n³) | 1000 | single / complete / average linkage |
| DIANA | Hierarchical (top-down) | O(n²/split) | 1,000 | Stops at target k, avoids singleton explosion |
| DBSCAN | Density-based | O(n²) | 1,000 | Epsilon selected via k-distance elbow |
| Algorithm | Sample | Clusters | Silhouette | Davies-Bouldin |
|---|---|---|---|---|
| K-Means | 51,289 | 4 | 0.452 | 0.865 |
| K-Medoids | 1,000 | 3 | 0.334 | 1.126 |
| AGNES | 1000 | 3 | 0.707 | 0.420 |
| DIANA | 1,000 | 3 | 0.714 | 0.524 |
| DBSCAN | 1,000 | 3 | 0.548 | 0.615 |
Silhouette: higher is better. Davies-Bouldin: lower is better.
AGNES produces the best-separated clusters (DB=0.420). DIANA scores highest on intra-cluster cohesion (Silhouette=0.714) with a negligible difference. The comparison between algorithms is directional — sample sizes differ by design (see below).
Sample sizes are constrained by algorithmic complexity, not by choice.
Algorithm Complexity 1,000 pts 10,000 pts
---------------------------------------------------------
K-Means O(n·k·i) < 1 sec seconds
DBSCAN O(n²) ~10 sec ~15 min
K-Medoids O(n²·k·i) ~5 min hours
DIANA O(n²/split) ~10 min hours
AGNES O(n³) hours days
K-Means runs on the full 51k dataset because the distance computation is vectorized — no Python-level loops. The remaining algorithms rely on nested loops consistent with a readable from-scratch implementation. Replacing those loops with KD-Trees or ball trees would make the code faster but harder to follow.
unsupervised-learning-from-scratch/
│
├── Clustering_From_Scratch.ipynb
├── client_clean3.csv # place in root before running
├── requirements.txt
└── README.md
git clone https://github.com/imenei/unsupervised-learning-from-scratch.git
cd unsupervised-learning-from-scratch
pip install -r requirements.txtPlace client_clean3.csv in the root directory. The notebook expects these columns: Sales, Quantity, Discount, Profit, Shipping Cost.
jupyter notebook Clustering_From_Scratch.ipynbRun all cells sequentially — each section is self-contained.
K-Means — distance matrix computed as a single NumPy broadcasting operation (n, 1, d) - (k, d). Convergence criterion: L2 norm of centroid shift below tol=1e-4.
K-Medoids — standard PAM: for each medoid, iterate over all non-medoid candidates and accept the swap that most reduces total cost. Terminates when no improving swap exists.
AGNES — merger history is stored at each step. Dendrogram visualization delegates to scipy.cluster.hierarchy using the same linkage method for consistency.
DIANA — the cluster with the largest diameter is selected for splitting at each step. The most dissimilar point seeds the splinter group; remaining points are reassigned iteratively. The diana_k(X, k) variant stops at k clusters to avoid reducing every point to a singleton.
DBSCAN — precomputed n×n distance matrix. Core, border, and noise points classified per the original Ester et al. 1996 definition. Epsilon chosen from the elbow of the k-distance plot (k = min_pts = 5).
numpy>=1.24
pandas>=1.5
matplotlib>=3.7
scikit-learn>=1.2
scipy>=1.10
jupyter>=1.0
- MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. 5th Berkeley Symposium on Mathematical Statistics.
- Kaufman, L. & Rousseeuw, P.J. (1990). Finding Groups in Data: An Introduction to Cluster Analysis. Wiley.
- Ester, M., Kriegel, H-P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. KDD-96.
- Rousseeuw, P.J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics.
imenei — github.com/imenei