Although ClusterTools is based on PRTools, users have to be familiar with just a few elementary facts of PRTools. That package is specifically designed for training and executing classifiers in multi-dimensional feature spaces. In order to simplify the handling of datasets and procedures two programming classes are defined in PRTools:

  • Datasets, In order to avoid clashes with datasets defined in Matlab, its constructor is called prdataset. Among others, it includes the storage of multi-dimensional vectors (doubles), object labels (characters or numbers) and a dataset name (character string). Datasets have a size that is the same as the size of the raw data (e.g. [m,k] for m objects given in a k-dimensional feature space.
  • Mappings. convert dataset from one space to another. Its constructor in PRTools is called prmapping. The three main types are untrained, trained and fixed. An untrained mapping can be trained by a labeled dataset and yields a classifier, which is thereby a trained mapping. Such a mapping, like a fixed mapping, can be applied to a new datasets. A classifier outputs class confidences or labels. These are estimates, optimized during training. A fixed mapping transforms a dataset into another dataset or a set of numbers according to a unique, fixed procedure.

Cluster routines in ClusterTools are programmed as fixed mappings. The can operate either on PRTools datasets, neglecting possible labels, or on a plain dataset defined by a double array, both of size [m,k]. They output for for every object a pointer, a number between 1 and m, pointing to the prototype of a cluster it is assigned to.

All cluster routines have a parameter that will smooth the data or define the number of desired clusters. This parameter can also be a vector of length n. The cluster routines will thereby output a [m,n] array of pointers. Each of the n columns defines a clustering, related to a specific parameter value. The whole output is called a multilevel or multi-scale clustering.

The below table contains a number of examples of calls to cluster routines using PRTools coding. It is assumed that a is a m-dimensional dataset: m objects in some multi-dimensional space.

a = gendatclust(1000) The ClusterTools command gendatclust generates 1000 objects of a 2D cluster toy problem with 10 clusters..
scattern(a) 2D datasets can be plotted by scattern (and by scatterd).
prdatasets This lists the public domain datasets directly accessible from PRTools.
x = +a; This PRTools statement transforms a labeled PRTools dataset into a double array of the same size. Effectively it removes a possible labeling, thereby guaranteeing that known object labels are not used during clustering. Usually this statement is not needed for users as it is executed internally in every cluster routine.
labels = getlabels(a); The objects stored in a dataset can be labeled. Labels can be strings or arbitrary numbers. By getlabels they can be retrieved.
labt = getnlab(a); The dataset labels are internally stored in a table. For every object a pointer is stored to its label in the table. In ClusterTools these indices are used in evaluation routines comparing clusters with true labels. getnlab retrieves these indices from a dataset.
labc = clustk(x,5); clustk runs kmeans clustering resulting in a [m,1] column vector of pointers to 5 cluster prototypes.
labc = x*clustk(5); This is the same, using the PRTools way to call a mapping using the *-operator. It is overloaded to be a piping operator.
labc = x*clustk([2 3 5 10 20]); Runs four kmeans clusterings yielding 2, 3, 5, 10 and 20 clusters resulting in a [m,5] multilevel clustering lab.
scattern(prdataset(x,labc(:,5)))  Uses the PRTools command scattern to make a scatterplot of the kmeans clustering result obtained above for k = 20;
scatn(labc,x, ...
'KMeans for K = 20')
 The same, now by the ClusterTools command scatn.

Users of ClusterTools should be aware of some global setting defining the PRTools environment. In particular they should realize that there is a time limit controlled by prtime that will stop any iterative optimization prematurely, possibly causing a sub-optimal result. This has been introduced to make the interactive study of cluster routines feasible.

Print Friendly, PDF & Email