Cluster routines

ClusterTools contains two groups of commands for clustering sets of objects: feature based and dissimilarity based routines. These as well as some back-end routines and some PRTools cluster routines are listed below. It is assumed that readers are familiar with the basics of ClusterTools. There is a separate page discussing the relation between PRTools and ClusterTools.

Other pages introducing user commands of ClusterTools: classification, reclustering, evaluation, support

Feature based cluster routines

The below routines all operate on a dataset A of M objects represented in a L-dimensional vector space. This can be an M*L double array or a PRTools dataset. They all result in a M*N multilevel clustering LABC. These are N clusterings of the M objects. The sizes of these clusterings (desired number of clusters) can be set by a vector K. Commands are thereby similar to:

LABC = cluste(A,K);
LABC = A*cluste(K);

`cluste`	Exemplar clustering by passing messages between objects, Frey and Dueck.
`clustf`	Clustering by the Farthest First Traversal algorithm
`clustw`	Clustering by the Worst First Traversal algorithm
`clusth`	Hierarchical agglomerative clustering. Repeatedly the two nearest clusters are merged. Various types are possible: single. The distance between two clusters is defined as the minimum distance between any two objects of the two clusters. average. The distance between two clusters is defined as the average distance between all objects of the merged cluster.. central. The distance between two clusters is defined as the distance between the two cluster centers. complete. The distance between two clusters is defined as the maximum distance between any two objects of the two clusters.
`clustk`	EM based clustering, various types are possible: kmeans. Iteratively objects are assigned to the nearest cluster mean. The final cluster prototype is the object most close to the cluster mean. kcentres. Iteratively objects are assigned to the nearest cluster center, . kmedoids. iteratively objects are assigned to the nearest cluster medoid.
`clustkh`	Hierarchical clustering on top of kMeans by `clustk` in order to make `clusth` for large datasets feasible.
`clustm`	KNN mode seeking clustering
`clusts`	Mean shift, kernel based mode seeking clustering
`clustr`	Random clustering

In a 2D example the use and some differences between these procedures will be shown and discussed.

Distance based cluster routines

Cluster analysis often depends on the set of all distances between a given set of objects. Such distances may often be based on a non-Euclidean distance measure, resulting, in PRTools terminology, in a dissimilarity matrix. The below routines all operate on a square dissimilarity matrix D of size M*M, if M is the number of objects. They all result in a M*N multilevel clustering LABC. These are N clusterings of the M objects. The sizes of these clusterings are always determined by a second parameter K. As Matlab matrices of size M*M should fit in memory, dissimilarity matrices of about M = 5000 objects is about the limit.

LABC = cluste(A,K);
LABC = A*cluste(K);

These distance based routines have a unique relation with one of the above feature based routines by computing the Euclidean distances matrix of the set of objects. The reverse is not true for the mean shift procedure (clusts), nor for kmeans (an option of clustk) as they necessarily need the feature space. Moreover, the exemplar clustering (dcluste) as well as hierarchical agglomerative clustering (dclusth) are in fact defined on given (dis)similarity matrices. Thereby they suffer from the dataset size restrictions, as mentioned above.

`dcluste`	Exemplar clustering by passing messages between objects, Frey and Dueck.
`dclustf`	Clustering by the Farthest First Traversal algorithm
`dclusth`	Hierarchical agglomerative clustering. Repeatedly the two nearest clusters are merged. Various types are possible: single. The distance between two clusters is defined as the minimum distance between any two objects of the two clusters. average. The distance between two clusters is defined as the average distance between all objects of the merged cluster.. central. The distance between two clusters is defined as the distance between the two cluster centers. complete. The distance between two clusters is defined as the maximum distance between any two objects of the two clusters.
`dclustk`	EM based clustering, two types are possible: kcentres. Iteratively objects are assigned to the nearest cluster center, . kmedoids. iteratively objects are assigned to the nearest cluster medoid.
`dclustm`	KNN mode seeking clustering
`dclustr`	Random clustering

In a 2D example the use and some differences between these procedures will be shown and discussed.

Back-end routines

The above routines offer a standard user call. Some of them are based on other, more general routines in which more specific parameters can be set. These on some PRTools cluster routines are listed in the below table.

`emclust`	A PRTools clustering routine the uses an arbitrary untrained classifier in an EM algorithm.
`hclust`	The PRTools routine for hierarchical clustering. It is called by `dclusth`, which is also the basis for `clusth`, `reclusth` and `reclustk`.
`kcentres`	The PRTools version of the K-Centres procedure. It is based on a different implementation than `clustk` and `dclustk`.
`meanshift`	This is the ClusterTools back-end routine for `clusts` and It offers some additional parameters.
`modeclust`	This is the full, most recent version of the ClusterTools KNN mode seeking procedure. It offers some additional options, a.o. the possibility of a user defined distance measure. The routine is called by `modeclustf` and thereby by `clustm` for small datasets.
`modeclustf`	The fast version of `modeclust.` It approximates the results of that routine depending of a complexity parameter.
`modeseek`	The original PRTools version of KNN mode seeking clustering. It is not used by ClusterTools. It operates on dissimilarity matrices only. The code is very simple and makes thereby clear how the algorithm works.
`prkmeans`	The PRTools K-Means algorithm. It is based on a different implementation than `clustk` and not used by ClusterTools.

Other pages introducing user commands of ClusterTools: classification, reclustering, evaluation, support

Pattern Recognition Tools

Cluster routines

Feature based cluster routines

Distance based cluster routines

Back-end routines

Follow us

37 Steps

Search

Blog Archives

Blog Categories