Implemented K-Means and K-Medoids for vectors and curves with C++
- Random Selection: With random selection we select uniformly at random k points of the dataset to start as centroids
- K-Means++: With K-Means++ we select k points of the dataset, spread very far away from the data
- Loyd's: For every point compute distances to all centroids, and select the one that is closer. It seems as a very good practice, but in reality it converges really slow, because the decisions affect the clusters very little, so there are not big differences in centroid update.
- Inverse Assignment: We use LSH nearest neighbor for this assigner. Apply Range Search with LSH for the R-nearest neighbors, to get for every centroid the R-closest points of the dataset and assign to those points as centroid id. For points who came as nearest neighbors of more the one centroid, we calculate the distance, and select the one that is closer. That way, the algorithm converges faster because some changes affect the clustering score in a certain amount, where the centroids are changing rapidly.
- PAM - K Medoids: K-Medoids is a very good algorithm, insensitive to outliers providing very good results. But its complexity is very high O(n^2 * K * i) since it has to calculate for every combination of points their distances. In order to make this algorithm faster, we implemented a database, where the distances are pre calculated exhaustively.
- K-Means: K-Means is sensitive to outliers, and worse than K-Medoids for small datasets, where the time of completion is not a factor. K-Means works very well for very big datasets, where K-Medoids is extremely slow. It gives very good results in O(n * k * i) where k * i << n
We used Silhouette for clustering evaluation and validation of consistency.
- Vectors ~0.98
- Curves ~0.4
- Compilation
make cluster - Run
./build/cluster.x –i <input_file> -c <configuration_file> -o <output_file>