ClusterTools contains two groups of commands for clustering sets of objects:  feature based and dissimilarity based routines. These as well as some back-end routines and some PRTools cluster routines are listed below. It is assumed that readers are familiar with the basics of ClusterTools. There is a separate page discussing the relation between PRTools and ClusterTools.

Other pages introducing user commands of ClusterTools: classification, reclustering, evaluation, support

Feature based cluster routines

The below routines all operate on a dataset A of M objects represented in a L-dimensional vector space. This can be an M*L double array or a PRTools dataset. They all result in a M*N multilevel clustering LABC. These are N clusterings of the M objects. The sizes of these clusterings (desired number of clusters) can be set by a vector K. Commands are thereby similar to:

LABC = cluste(A,K);
LABC = A*cluste(K);

cluste Exemplar clustering by passing messages between objects, Frey and Dueck.
clustf Clustering by the Farthest First Traversal algorithm
clustw Clustering by the  Worst First Traversal algorithm
clusth Hierarchical agglomerative clustering. Repeatedly the two nearest clusters are merged. Various types are possible:

  • single. The distance between two clusters is defined as the minimum distance between any two objects of the two clusters.
  • average. The distance between two clusters is defined as the average  distance between all objects of the merged cluster..
  • central. The distance between two clusters is defined as the distance between the two cluster centers.
  • complete. The distance between two clusters is defined as the maximum distance between any two objects of the two clusters.
clustk EM based clustering, various types are possible:

  • kmeans. Iteratively objects are assigned to the nearest cluster mean. The final cluster prototype is the object most close to the cluster mean.
  • kcentres. Iteratively objects are assigned to the nearest cluster center, .
  • kmedoids. iteratively objects are assigned to the nearest cluster medoid.
clustkh Hierarchical clustering on top of kMeans by clustk in order to make clusth for large datasets feasible.
clustm KNN mode seeking clustering
clusts Mean shift, kernel based mode seeking clustering
clustr Random clustering

In a 2D example the use and some differences between these procedures will be shown and discussed.

Distance based cluster routines

Cluster analysis often depends on the set of all distances between a given set of objects. Such distances may often be based on a non-Euclidean distance measure, resulting, in PRTools terminology, in a dissimilarity matrix. The below routines all operate on a square dissimilarity matrix D of size M*M, if M is the number of objects. They all result in a M*N multilevel clustering LABC. These are N clusterings of the M objects. The sizes of these clusterings are always determined by a second parameter K. As Matlab matrices of size M*M should fit in memory, dissimilarity matrices of about M = 5000 objects is about the limit.

LABC = cluste(A,K);
LABC = A*cluste(K);

These distance based routines have a unique relation with one of the above feature based routines by computing the Euclidean distances matrix of the set of objects. The reverse is not true for the mean shift procedure (clusts), nor for kmeans (an option of clustk) as they necessarily need the feature space. Moreover, the exemplar clustering (dcluste) as well as hierarchical agglomerative clustering (dclusth) are in fact defined on given (dis)similarity matrices. Thereby they suffer from the dataset size restrictions, as mentioned above.

dcluste Exemplar clustering by passing messages between objects, Frey and Dueck.
dclustf Clustering by the Farthest First Traversal algorithm
dclusth Hierarchical agglomerative clustering. Repeatedly the two nearest clusters are merged. Various types are possible:

  • single. The distance between two clusters is defined as the minimum distance between any two objects of the two clusters.
  • average. The distance between two clusters is defined as the average distance between all objects of the merged cluster..
  • central. The distance between two clusters is defined as the distance between the two cluster centers.
  • complete. The distance between two clusters is defined as the maximum distance between any two objects of the two clusters.
dclustk EM based clustering, two types are possible:

  • kcentres. Iteratively objects are assigned to the nearest cluster center, .
  • kmedoids. iteratively objects are assigned to the nearest cluster medoid.
dclustm KNN mode seeking clustering
dclustr Random clustering

In a 2D example the use and some differences between these procedures will be shown and discussed.

Back-end routines

The above routines offer a standard user call. Some of them are based on other, more general routines in which more specific parameters can be set. These on some PRTools cluster routines are listed in the below table.

emclust A PRTools clustering routine the uses an arbitrary untrained classifier in an EM algorithm.
hclust The PRTools routine for hierarchical clustering. It is called by dclusth, which is also the basis for clusth, reclusth and reclustk.
kcentres The PRTools version of the K-Centres procedure. It is based on a different implementation than clustk and dclustk.
meanshift This is the ClusterTools back-end routine for clusts and It offers some additional parameters.
modeclust This is the full, most recent version of the ClusterTools KNN mode seeking procedure. It offers some additional options, a.o. the possibility of a user defined distance measure. The routine is called by modeclustf and thereby by clustm for small datasets.
modeclustf The fast version of modeclust. It approximates the results of that routine depending of a complexity parameter.
modeseek The original PRTools version of KNN mode seeking clustering. It is not used by ClusterTools. It operates on dissimilarity matrices only. The code is very simple and makes thereby clear how the algorithm works.
prkmeans The PRTools K-Means algorithm. It is based on a different implementation than clustk and not used by ClusterTools.

 

Other pages introducing user commands of ClusterTools: classification, reclustering, evaluation, support

Print Friendly, PDF & Email