ClusterTools and PRTools
Although ClusterTools is based on PRTools, users have to be familiar with just a few elementary facts of PRTools. That package is specifically designed for training and executing classifiers in multi-dimensional feature spaces. In order to simplify the handling of datasets and procedures two programming classes are defined in PRTools:
- Datasets, In order to avoid clashes with datasets defined in Matlab, its constructor is called
prdataset
. Among others, it includes the storage of multi-dimensional vectors (doubles), object labels (characters or numbers) and a dataset name (character string). Datasets have a size that is the same as the size of the raw data (e.g.[m,k]
form
objects given in ak
-dimensional feature space.
- Mappings. convert dataset from one space to another. Its constructor in PRTools is called
prmapping
. The three main types are untrained, trained and fixed. An untrained mapping can be trained by a labeled dataset and yields a classifier, which is thereby a trained mapping. Such a mapping, like a fixed mapping, can be applied to a new datasets. A classifier outputs class confidences or labels. These are estimates, optimized during training. A fixed mapping transforms a dataset into another dataset or a set of numbers according to a unique, fixed procedure.
Cluster routines in ClusterTools are programmed as fixed mappings. The can operate either on PRTools datasets, neglecting possible labels, or on a plain dataset defined by a double array, both of size [m,k].
They output for for every object a pointer, a number between 1
and m
, pointing to the prototype of a cluster it is assigned to.
All cluster routines have a parameter that will smooth the data or define the number of desired clusters. This parameter can also be a vector of length n. The cluster routines will thereby output a [m,n] array of pointers. Each of the n columns defines a clustering, related to a specific parameter value. The whole output is called a multilevel or multi-scale clustering.
The below table contains a number of examples of calls to cluster routines using PRTools coding. It is assumed that a is a m-dimensional dataset: m objects in some multi-dimensional space.
a = gendatclust(1000) |
The ClusterTools command gendatclust generates 1000 objects of a 2D cluster toy problem with 10 clusters.. |
scattern(a) |
2D datasets can be plotted by scattern (and by scatterd ). |
prdatasets |
This lists the public domain datasets directly accessible from PRTools. |
x = +a; |
This PRTools statement transforms a labeled PRTools dataset into a double array of the same size. Effectively it removes a possible labeling, thereby guaranteeing that known object labels are not used during clustering. Usually this statement is not needed for users as it is executed internally in every cluster routine. |
labels = getlabels(a); |
The objects stored in a dataset can be labeled. Labels can be strings or arbitrary numbers. By getlabels they can be retrieved. |
labt = getnlab(a); |
The dataset labels are internally stored in a table. For every object a pointer is stored to its label in the table. In ClusterTools these indices are used in evaluation routines comparing clusters with true labels. getnlab retrieves these indices from a dataset. |
labc = clustk(x,5); |
clustk runs kmeans clustering resulting in a [m,1] column vector of pointers to 5 cluster prototypes. |
labc = x*clustk(5); |
This is the same, using the PRTools way to call a mapping using the *-operator. It is overloaded to be a piping operator. |
labc = x*clustk([2 3 5 10 20]); |
Runs four kmeans clusterings yielding 2, 3, 5, 10 and 20 clusters resulting in a [m,5] multilevel clustering lab. |
scattern(prdataset(x,labc(:,5))) |
Uses the PRTools command scattern to make a scatterplot of the kmeans clustering result obtained above for k = 20; |
scatn(labc,x, ... 'KMeans for K = 20') |
The same, now by the ClusterTools command scatn . |
Users of ClusterTools should be aware of some global setting defining the PRTools environment. In particular they should realize that there is a time limit controlled by prtime
that will stop any iterative optimization prematurely, possibly causing a sub-optimal result. This has been introduced to make the interactive study of cluster routines feasible.