# ClusterTools and PRTools

Although ClusterTools is based on PRTools, users have to be familiar with just a few elementary facts of PRTools. That package is specifically designed for training and executing classifiers in multi-dimensional feature spaces. In order to simplify the handling of datasets and procedures two programming classes are defined in PRTools:

**Datasets**, In order to avoid clashes with datasets defined in Matlab, its constructor is called`prdataset`

. Among others, it includes the storage of multi-dimensional vectors (doubles), object labels (characters or numbers) and a dataset name (character string). Datasets have a size that is the same as the size of the raw data (e.g.`[m,k]`

for`m`

objects given in a`k`

-dimensional feature space.

**Mappings**. convert dataset from one space to another. Its constructor in PRTools is called`prmapping`

. The three main types are untrained, trained and fixed. An untrained mapping can be trained by a labeled dataset and yields a classifier, which is thereby a trained mapping. Such a mapping, like a fixed mapping, can be applied to a new datasets. A classifier outputs class confidences or labels. These are estimates, optimized during training. A fixed mapping transforms a dataset into another dataset or a set of numbers according to a unique, fixed procedure.

Cluster routines in ClusterTools are programmed as fixed mappings. The can operate either on PRTools datasets, neglecting possible labels, or on a plain dataset defined by a double array, both of size `[m,k].`

They output for for every object a pointer, a number between `1`

and `m`

, pointing to the prototype of a cluster it is assigned to.

All cluster routines have a parameter that will smooth the data or define the number of desired clusters. This parameter can also be a vector of length n. The cluster routines will thereby output a [m,n] array of pointers. Each of the n columns defines a clustering, related to a specific parameter value. The whole output is called a multilevel or multi-scale clustering.

The below table contains a number of examples of calls to cluster routines using PRTools coding. It is assumed that a is a m-dimensional dataset: m objects in some multi-dimensional space.

`a = gendatclust(1000)` |
The ClusterTools command `gendatclust` generates 1000 objects of a 2D cluster toy problem with 10 clusters.. |

`scattern(a)` |
2D datasets can be plotted by `scattern` (and by `scatterd` ). |

`prdatasets` |
This lists the public domain datasets directly accessible from PRTools. |

`x = +a;` |
This PRTools statement transforms a labeled PRTools dataset into a double array of the same size. Effectively it removes a possible labeling, thereby guaranteeing that known object labels are not used during clustering. Usually this statement is not needed for users as it is executed internally in every cluster routine. |

`labels = getlabels(a);` |
The objects stored in a dataset can be labeled. Labels can be strings or arbitrary numbers. By `getlabels` they can be retrieved. |

`labt = getnlab(a);` |
The dataset labels are internally stored in a table. For every object a pointer is stored to its label in the table. In ClusterTools these indices are used in evaluation routines comparing clusters with true labels. `getnlab` retrieves these indices from a dataset. |

`labc = clustk(x,5);` |
`clustk` runs kmeans clustering resulting in a [m,1] column vector of pointers to 5 cluster prototypes. |

`labc = x*clustk(5);` |
This is the same, using the PRTools way to call a mapping using the *-operator. It is overloaded to be a piping operator. |

`labc = x*clustk([2 3 5 10 20]);` |
Runs four kmeans clusterings yielding 2, 3, 5, 10 and 20 clusters resulting in a [m,5] multilevel clustering lab. |

`scattern(prdataset(x,labc(:,5)))` |
Uses the PRTools command `scattern` to make a scatterplot of the kmeans clustering result obtained above for k = 20; |

`scatn(labc,x, ...` `'KMeans for K = 20')` |
The same, now by the ClusterTools command `scatn` . |

Users of ClusterTools should be aware of some global setting defining the PRTools environment. In particular they should realize that there is a time limit controlled by `prtime`

that will stop any iterative optimization prematurely, possibly causing a sub-optimal result. This has been introduced to make the interactive study of cluster routines feasible.