Cluster routines
ClusterTools contains two groups of commands for clustering sets of objects: feature based and dissimilarity based routines. These as well as some back-end routines and some PRTools cluster routines are listed below. It is assumed that readers are familiar with the basics of ClusterTools. There is a separate page discussing the relation between PRTools and ClusterTools.
Other pages introducing user commands of ClusterTools: classification, reclustering, evaluation, support
Feature based cluster routines
The below routines all operate on a dataset A
of M
objects represented in a L
-dimensional vector space. This can be an M*L
double array or a PRTools dataset. They all result in a M*N
multilevel clustering LABC
. These are N
clusterings of the M
objects. The sizes of these clusterings (desired number of clusters) can be set by a vector K
. Commands are thereby similar to:
LABC = cluste(A,K);
LABC = A*cluste(K);
cluste |
Exemplar clustering by passing messages between objects, Frey and Dueck. | |
clustf |
Clustering by the Farthest First Traversal algorithm | |
clustw |
Clustering by the Worst First Traversal algorithm | |
clusth |
Hierarchical agglomerative clustering. Repeatedly the two nearest clusters are merged. Various types are possible:
|
|
clustk |
EM based clustering, various types are possible:
|
|
clustkh |
Hierarchical clustering on top of kMeans by clustk in order to make clusth for large datasets feasible. |
|
clustm |
KNN mode seeking clustering | |
clusts |
Mean shift, kernel based mode seeking clustering | |
clustr |
Random clustering |
In a 2D example the use and some differences between these procedures will be shown and discussed.
Distance based cluster routines
Cluster analysis often depends on the set of all distances between a given set of objects. Such distances may often be based on a non-Euclidean distance measure, resulting, in PRTools terminology, in a dissimilarity matrix. The below routines all operate on a square dissimilarity matrix D
of size M*M
, if M
is the number of objects. They all result in a M*N
multilevel clustering LABC
. These are N
clusterings of the M
objects. The sizes of these clusterings are always determined by a second parameter K
. As Matlab matrices of size M*M
should fit in memory, dissimilarity matrices of about M = 5000
objects is about the limit.
LABC = cluste(A,K);
LABC = A*cluste(K);
These distance based routines have a unique relation with one of the above feature based routines by computing the Euclidean distances matrix of the set of objects. The reverse is not true for the mean shift procedure (clusts
), nor for kmeans (an option of clustk
) as they necessarily need the feature space. Moreover, the exemplar clustering (dcluste
) as well as hierarchical agglomerative clustering (dclusth
) are in fact defined on given (dis)similarity matrices. Thereby they suffer from the dataset size restrictions, as mentioned above.
dcluste |
Exemplar clustering by passing messages between objects, Frey and Dueck. | |
dclustf |
Clustering by the Farthest First Traversal algorithm | |
dclusth |
Hierarchical agglomerative clustering. Repeatedly the two nearest clusters are merged. Various types are possible:
|
|
dclustk |
EM based clustering, two types are possible:
|
|
dclustm |
KNN mode seeking clustering | |
dclustr |
Random clustering |
In a 2D example the use and some differences between these procedures will be shown and discussed.
Back-end routines
The above routines offer a standard user call. Some of them are based on other, more general routines in which more specific parameters can be set. These on some PRTools cluster routines are listed in the below table.
emclust |
A PRTools clustering routine the uses an arbitrary untrained classifier in an EM algorithm. | |
hclust |
The PRTools routine for hierarchical clustering. It is called by dclusth , which is also the basis for clusth , reclusth and reclustk . |
|
kcentres |
The PRTools version of the K-Centres procedure. It is based on a different implementation than clustk and dclustk . |
|
meanshift |
This is the ClusterTools back-end routine for clusts and It offers some additional parameters. |
|
modeclust |
This is the full, most recent version of the ClusterTools KNN mode seeking procedure. It offers some additional options, a.o. the possibility of a user defined distance measure. The routine is called by modeclustf and thereby by clustm for small datasets. |
|
modeclustf |
The fast version of modeclust. It approximates the results of that routine depending of a complexity parameter. |
|
modeseek |
The original PRTools version of KNN mode seeking clustering. It is not used by ClusterTools. It operates on dissimilarity matrices only. The code is very simple and makes thereby clear how the algorithm works. | |
prkmeans |
The PRTools K-Means algorithm. It is based on a different implementation than clustk and not used by ClusterTools. |
Other pages introducing user commands of ClusterTools: classification, reclustering, evaluation, support