Skip to content

taslanidis/Clustering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 

Repository files navigation

Clustering with K-Means

Implemented K-Means and K-Medoids for vectors and curves with C++

  • Random Selection: With random selection we select uniformly at random k points of the dataset to start as centroids
  • K-Means++: With K-Means++ we select k points of the dataset, spread very far away from the data
  • Loyd's: For every point compute distances to all centroids, and select the one that is closer. It seems as a very good practice, but in reality it converges really slow, because the decisions affect the clusters very little, so there are not big differences in centroid update.
  • Inverse Assignment: We use LSH nearest neighbor for this assigner. Apply Range Search with LSH for the R-nearest neighbors, to get for every centroid the R-closest points of the dataset and assign to those points as centroid id. For points who came as nearest neighbors of more the one centroid, we calculate the distance, and select the one that is closer. That way, the algorithm converges faster because some changes affect the clustering score in a certain amount, where the centroids are changing rapidly.
  • PAM - K Medoids: K-Medoids is a very good algorithm, insensitive to outliers providing very good results. But its complexity is very high O(n^2 * K * i) since it has to calculate for every combination of points their distances. In order to make this algorithm faster, we implemented a database, where the distances are pre calculated exhaustively.
  • K-Means: K-Means is sensitive to outliers, and worse than K-Medoids for small datasets, where the time of completion is not a factor. K-Means works very well for very big datasets, where K-Medoids is extremely slow. It gives very good results in O(n * k * i) where k * i << n

Metric for cluster score:

We used Silhouette for clustering evaluation and validation of consistency.

  • Vectors ~0.98
  • Curves ~0.4

Usage:

  • Compilation
    make cluster
  • Run
    ./build/cluster.x –i <input_file> -c <configuration_file> -o <output_file>

Collaborators

Konstantinos Athinaios
Theofanis Aslanidis

About

K-Means and K-Medoids for vectors and curves with C++. Comparing different initialization, assignment and update methods.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors