PRTools examples: Dissimilarities
Dissimilarities offer an alternative way in comparison with features to represent objects in a vector space. They are similar to kernels, but more general. They can be used on top of a given feature representation, but more importantly, they are often computed for raw data: images, spectra, time signals. Here some ways are given to compute dissimilarities in a given feature space. There is a separate set of examples on handling dissimilarities. It is assumed that the reader is familiar with the introductory sections of the user guide:
The main routine for computing similarities and dissimilarities is
proxm. Formally it is a trainable mapping. During training the training set is just stored. During execution the proximities between a ‘test’ set and the ‘training’ set are computed. Let us take the example of the L1 distance, which is identical to the Minkowksy-1 metric.
A = gendatb([10 10]) % given set of 20 objects in a 2D space
gendatb([50 50]) % new set of 100 objects in the same space
W = A*
proxm('m',1) % define proximity mapping, 2 to 20
D = B*W % resulting dissimilarity matrix 100 x 20
The ‘trained’ mapping
W (it is just trained in PRTools terms, in fact there is nothing optimized, just data is stored) maps data from a 2D space (the number of features of the input space) to 20D (the number of objects in A. After applying this mapping to a dataset of 100 objects a 100*20 dissimilarity matrix is
D obtained, which can also be considered as a set of 100 objects in a 20D space.
A special case is the computation of Euclidean distances. This is the same as a L2 distance or a Minkowsky-2 distance (
) or the computation of an Euclidean distance to the first power (
). These yield the same results, which, moreover, can be obtained by the routine
distm, which is especially designed for Euclidean distances:
The results are identical, except for very small differences due to digital noise. This may be checked by:
max(max(abs(+D1 - +D2)))
max(max(abs(+D1 - +D3)))
max(max(abs(+D2 - +D3)))
The last statement shows that D2 and D3 are fully identical, due to the fact that
proxm internally uses
distm for the
'd' option. Note, however, the huge difference in computation times:
Proximity matrices obtained by
distm store the class labels of the ‘train’ set as feature labels. They have thereby the same format as a classification matrix. However, values a classification matrix have the orientation that higher values are closer more similar) to the class corresponding to the column of the considered matrix element. Dissimilarities have thereby the wrong orientation for a direct classification. This can be easily corrected.
The two classification results are identical. The first road, by using
, offers, however, the possibility to use other distance measures, e.g.
computes the NN classification result based on the L1 distance.
cells and doubles
operations: datasets datafiles cells and doubles mappings classifiers stacked parallel sequential dyadic.
user commands: datasets representation classifiers evaluation clustering examples support routines.
introductory examples: Introduction Scatterplots Datasets Datafiles Mappings Classifiers Evaluation Learning curves Feature curves Dimension reduction Combining classifiers Dissimilarities.