DisTools
Here the main commands of the DisTools toolbox for dissimilarity based pattern recognition are described. It is a Matlab toolbox and is built on top of PRTools, which should be in the path.
- PRTools introduction page
- PRTools Table of Contents
- DisTools Table of Contents
- DisTools download
- Distools Introductory Examples
- Dissimilarity Representation Course
DisTools is primarily meant for the analysis of a given set of dissimilarities (possibly non-Eudlidean, non-metric or asymmetric) between labeled objects. There are a few commands for computing such dissimilarities. Its main use is, however, to help researchers who are studying dissimilarity measures between objects like images, time signals and spectra for classification purposes. It offers more advanced tools than based on the nearest neighbor relations.
There are three sets of tools based on three different interpretations of the dissimilarity matrix. Next to that there are some tools for visualization, evaluation and the computation of dissimilarities.
- Given dissimilarities: manipulation and classification
- The dissimilarity space
- Pseudo-Euclidean embedding
- Visualization
- Evaluation
- Computing dissimilarities
In our terminology we distinguish
- training set: the set of labeled objects to be used for training a classifier
- test set: the set of labeled objects to be used for evaluating a classifier
- representation set: the set of objects used for representing other objects in a dissimilarity space or by embedding
In some documents, help files and software comments the following abbreviations may be used: trainset of training set, testset for test set, repset for representation set, dismat for dissimilarity matrix, disspace for dissimilarity space and pe_space for the space obtained by pseudo_Euclidean embedding.
In the below documentation D
, D1
, D2
, etc are used for dissimilarity matrices. U
stands for an untrained classifier, W
for a trained classifier and V
for a mapping between spaces or representations. Readers not familiar with PRTools should realize that the *
operator is overloaded for feeding the left operand to the right (like piping in Unix). Some very simple examples:
U = svc([],1); % defines the untrained support vector classifier
% % with regularization parameter C=1.
W = D*U; % trains U by D in a dissimilarity space
W = D*svc
([],1);% Alternative notation 1
W =svc
(D,1); % Alternativenotation
2
e = D*W*; % evaluates W by the training set using the
testc
.
testc
See PRTools documentation for more details.
For consistency we will use throughout in this documentation the alternative notation 1 as illustrated above. It depends on the further software and the coding style of the programmers what will be used in practice.
Manipulation of dissimilarity matrices
|
Make a dissimilarity matrix symmetric |
D2= D*makesym |
default is averaging d and d' . |
D2= D* |
use min(d,d') |
genddat |
Generate random training and test sets for dissimilarity data, |
[DT,DS] = |
use 50% of the data for trainset (dt ), remaining for testset (ds ); repset is trainset |
[DT,DS] = |
use 10 objects of first class and 20 of second for training. repset is first 5 objects of trainset |
seldclass |
Select class subset from a square dissimilarity dataset |
D2 = |
Reduce square dissimilarity matrix such that only rows and columns of classes 3 and 4 are preserved |
Classification of given dissimilarity matrix
In the analysis for such matrices it is assumed that training set and representation set coincide. The rows usually refer to the test objects and the columns to the training objects. The dissimilarity matrix between the training objects themselves is square. Occasionally dissimilarity measure can give non-zero values for the diagonal. In some routines they may be replaced by zeros. If in documentation and in the text below a square matrix is mentioned, it refers to a dissimilarity matrix between a set of objects and itself. D(i,j)
is the distance from object i
to object j
, D(j,i)
is the distance from object j
to object i
.
A dissimilarity matrix constructed as a PRTools dataset should have object labels for the rows and feature labels for the columns. The labels of the two data sets used for constructing the dissimilarity matrix should be used for that. PRTools commands computing dissimilarity matrix take care of that. If a dissimilarity matrix given in a PRTools dataset is transposed (E = D'
) then the two labellings are switched automatically by PRTools.
nne |
Leave-one-out nearest neighbor error estimated from a square labeled dissimilarity matrix. |
e = D* |
|
nnerror1 |
Exact expected NN error from a dissimilarity matrix, trainset size per class. The exact expected error is made by a full integration. |
e = D* |
Compute and plot learning curve based on the 1NN rule |
e = D* |
Compute and show the exact expected 1NN error for 20 objects / class |
|
Exact expected NN error from a dissimilarity matrix, trainset size overall. The exact expected error is made by a full integration. |
e = D* |
Compute and plot learning curve based on the 1NN rule |
e = D* |
Compute and show the exact expected 1NN error for 20 objects |
testkd |
Test k-NN classifier for dissimilarity data |
e = D* |
Compute for a square dismat classification error of the kNN rule with k=3 using leave-one-out crossvalidation. |
e = D* |
Compute for a dissimilarity dataset of test data (dissimilarities with a trainset) the classification error using the 1NN rule. |
knndc |
k-nearest neighbor classifier for dissimilarity data |
[DT,DS] = W = DT* e = DS*W*testc |
Generate trainset and testset from the given dismat D . The repset equals the training set. Train k-nearest neigbor classifier (optimize k) by DT and test it on DS . DT and DS should be dissmilarity datasets based on the same representation set. |
testpd |
Parzen classifier for dissimilarity data |
e = D* |
Compute for a square dismat classification error of the Parzen classifier for a given value of the smoothing parameter (kernel width) h , using leave-one-out crossvalidation. |
e = D* |
Compute for a dissimilarity dataset of test data (dissimilarities with a trainset) the classification error using the Parzen classifier. The smoothing parameter is optimized by parzenddc . |
parzenddc | Parzen classifier for dissimilarity data |
[DT,DS] = W = DT*parzenddc; e = DS*W* |
Generate trainset and testset from the given dismat D . The repset equals the training set. Train the Parzen classifier (optimize h ) by DT and test it on DS . DT and DS should be dissmilarity datasets based on the same representation set. |
Dissimilarity space
The dissimilarity space is postulated as an Euclidean vector space. Objects are represented in this space as vectors of given dissimilarities to a representation set of other (or the same) objects. It can be treated in a similar way as the traditional feature space in which the features are defined as dissimilarities to the representation set.
Users have to decide whether they want to use a separate representation set, different from the training and test set. Alternatively they can use (a part of) the training set for representation and may even include test objects as there is no need that the representation objects are labeled. The main routine to organize this is
. Here are a few examples based on a square dissimilarity matrix genddat
D
of size 100*100
with two classes of 40
respectively 60
objects and class priors 0.4
and 0.6
. Dissimilarity trainsets DT
and testsets DS
with the same repset are generated.
% 60% for training, repset is trainset
[DT,DS] =(D,0.6)
genddat
% DT is a 60 by 60 dataset with 2 classes: [24 36]
% DS is a 40 by 60 dataset with 2 classes: [16 24]
% 60% for training, 10% of the trainset for the repset
[DT,DS] =(D,0.6,0.1)
genddat
% DT is a 60 by 7 dataset with 2 classes: [24 36]
% DS is a 40 by 7 dataset with 2 classes: [16 24]
% Note that fractions in numbers of objects are rounded up.
% 60% for training, 0% of the trainset, 10% of the testset for the repset
[DT,DS] =
(D,0.6,0,0.1)
genddat
% DT is a 60 by 5 dataset with 2 classes: [24 36]
% DS is a 40 by 5 dataset with 2 classes: [16 24]
% similar, but exclude repset from testset
[DT,DS] =
(D,0.6,0,0.1,'ex')
genddat
% DT is a 60 by 5 dataset with 2 classes: [24 36]
% DS is a 35 by 5 dataset with 2 classes: [14 21]
Once a training set is available by a dissimilarity matrix stored as a dataset DT, it is interpreted in the dissimilarity space approach as a set of training vectors. In principle any classifier suitable for a traditional feature space can be used here as well. An independent test set represented in the same space by DS can be used for evaluation.
The strength of this approach is that it can be used for any dissimilarity measure, Euclidean, non-Euclidean, metric, non-metric, symmetric or non-symmetric. A drawback is that by this generality the characteristic of the dissimilarity values themselves (positive numbers with the meaning that smaller values indicate more similar objects) is not used. A second drawback is that the dimensionalities can be large, e.g. as large as the training set or even larger (possibly causing high correlations as well). This should be taken into account in the training of classifiers.
A solution for both drawbacks is to reduce the representation set significantly. This can be done by a random selection as shown above by the
function. There are in addition various approaches to do this systematically, e.g. the feature selection routines of PRTools. These don’t use the characteristics of dissimilarities but are based on the class separability in the dissimilarity space.genddat
A faster solution is the use the object properties as offered by the individual dissimilarities. The DisTools routine protselfd
offers some supervised and some unsupervised possibilities.
Here are some examples. They all compute a selection routine W
that may be used by [D1,D2]= D*W
, in which D1
contains the selected prototypes for the representation set and D2
contains the other ones.
w = d*featseli; |
Individual selection using the 1NN LOO criterion |
w = d* |
Forward selection of k prototypes based on the Mahalanobis distance |
help feateval |
Lists possible criterions for feature selection |
w = d* |
Rank the prototypes (see help for criteria).Use the first 10. |
[dt,ds] = |
Split the dataset in a training set and a test set. Compute a selection of 10 prototypes on the training set. Apply it on the trainset, compute the complementary set as well. Apply it on the testset, compute the complementary set as well. |
Once a proper repset has been selected and applied to the testset as to the trainset, any PRTools classifier can be trained and tested using its test routines testc
and testd
.
Pseudo-Euclidean embedding
By embedding a dissimilarity matrix $$D$$ a vector space $$X$$ with a given norm $$||.||$$ is constructed such that $$D_{ij} = ||x_i-x_j||$$. Obviously, by definition, the dissimilarity matrix has to be Euclidean for a perfect embedding into an Euclidean vector space. A more general space is the pseudo-Euclidean space in which any symmetric dissimilarity matrix with a zero diagonal can be perfectly embedded. Even off-diagonal zeros and negative numbers are allowed.
The Pseudo-Euclidean space (PE-space) consists of two orthogonal Euclidean subspaces, called the positive space $$P$$ and the negative space $$Q$$. By embedding, vectors $$x$$ are constructed in which the first $$p$$ components $$ x_p$$ refer to $$P$$ and the following $$q$$ components $$x_q$$ to $$Q$$, $$ x = (x_p,x_q)$$ . Usually, for non-degenerate (full rank) dissimilarity matrices holds that $$p+q = m-1$$ if $$D$$ describes the dissimilarities between $$m$$ objects. The pseudo-Euclidean norm $$||.||_{pe}$$is now defined as $$||x||_{pe} = sqrt{||x_p||^2 – ||x_q||^2}$$.
The PE-space can be interpreted as a Euclidean space $$P$$ offering an approximate solution for the embedding and a space $$Q$$ in which all corrections are collected. Distances (or the length of vectors) are now computed (in the squared sense) by subtracting the corrections from the Euclidean norm (distances). Consequently, squared distances may be zero or even negative, resulting in a complex norm.
The main DisTools routine for PE- embedding is called pe_em
. It contains various options for returning reduced spaces of a lower dimensionality. The mapping produced by the routine may be used to project the original training data on the pe_space, or, alternatively, any other dissimilarity matrix based on the same representation set. It should be realized this can be very inaccurate for non-Euclidean dissimilarities and high-dimensional spaces, resulting in imaginary distances in the PE-space in cases for which the given distances are all positive.
In addition to pe_em
there are various other quantities describing the non-Euclidean characteristics of a dissimilarity matrix.
- The signature is the duplet (p,q) describing the dimensionalities of the positive and the negative space.
- The spectrum is the set of ranked eigenvalues of a Gramm matrix computed from the dissimilarity matrix. Some eigenvalues are negative in case of non-Euclidean dissimilarities. The eigenvectors related to the positive eigenvectors correspond with the axes of the positive space and the negative ones with the negative space.
- The Non-Euclidean Fraction, NEF, considers the absolute values in the spectrum. The sum over the eigenvalues found for the negative space divided by the total sum over all eigenvalues constructs the NEF.
- The Non-Metricity Fraction, NMF, is the fraction of triplets in the dissimilarity matrix (corresponding to the dissimilarities between three objects) that violates the the triangle inequality: the largest of the three is larger than the sum of the two other ones.
In the below table a number of DisTools commands and small examples is listed. They may be applied to a simple non-Euclidean dissimilarity matrix, e.g. as generated in the first rows:
a = gendatb; |
Generate a 2-dimensional dataset, 2 classes, 100 objects. Compute the 100 x 100 dismat based on Minkowsky-1 metric. This dissimilarity is metric but non-Euclidean. |
|
Generate a 2-dimensional dataset, 2 classes, 100 objects. Compute the 100 x 100 dismat based on Minkowsky-1 metric. Take the 5th power, creating a dissimilarity which is non-metric. |
w = d* |
Compute PE mapping Apply to dissimilarity matrix |
[dt,ds] = genddat(d,0.5); w = dt* xt = dt*w; xs = ds*w; |
Split given dataset in trainset and testset Compute PE mapping using the trainset Map the trainset Map the testset |
w = d* x = d*w w*signature x* signature xa = x*euspace; xp = x* xn = x* |
Compute PE mapping Apply mapping to dissimilarity matrix Retrieve the signature of w Retrieve the signature from the PE space Convert x to associated space Convert x to positive space Convert x to negative space Compute negative eigenfraction from x Compute negative eigenfraction from xa Compute negative eigenfraction from w |
d*nmf d.^3*nmf |
Compute non-metric fraction of d Compute non-metric fraction of d.^3 |
w = d* (w) |
Compute a PE mapping Get its spectrum PLot its spectrum |
In the next table a number of routines and classifiers are listed that make use of the specific norm that is defined for a PE space. This norm follows from the signature and is stored in a dataset x that is projected in a PE space by mappings that are computed by
: x = d*(d*pe_em
); or x = d*pe_em
(d);pe_em
d = pe_distm(x,y) | Square PE dismat between two datasets |
z = pe_mtimes(x,y) | Matrix multiplication (inner product) in PE space |
k = x*pe_kernelm | Compute linear kernel in PE space |
w = x*pe_nmc | Nearest mean classifier in PE space |
w = x*pe_knnc | KNN classifier in PE space |
w = x*pe_parzenc | Parzen classifier in PE space |
w = x*pe_libsvc | SVM in PE space |
Here is a small example that splits a dataset in a trainset and a testset and computes the performances for three classifiers
a = gendatb;
d = a*proxm(a,'m',1);
w = d*
;
pe_em
x = d*w;
[xt,xs] = gendat(x,0.5);
w = xt*{pe_nmc,pe_knnc,pe_parzenc,pe_libsvc};
(xs,w)
testc
Visualization
Some ways to visualize a dissimilarity matrix d:
imagesc(+d) |
Show the data as an image |
[lab,den] = hclust(d,'s'); imagesc(+d(lab,lab)) |
Perform some clustering that ranks the data Show it as an image in the ranked order |
scatterd(d*pca(d,2)) |
Show a PCA space of the dissimilarity space |
scatterd(d*pe_em(d,[2 0])) |
Show first two dimensions of the positive space after PE embedding |
scatterd(d* |
Show the first positive and the first negative dimension after PE embedding |
|
Multi-dimensional scaling |
Evaluation
Here some examples are given illustrating how for the three ways a dissimilarity representation can be handled, classifiers can be computed and tested.
[dt,ds] = genddat(d,0.5) w = knndc(dt) ; e_dm = ds*w*testc |
Split dissimilarity matrix, ds contains dissimilarities to trainsetCompute kNN classifier (optimize k) Classification error on testset |
[dt,ds]= gendat(d,0.5) ; w = knnc(dt); e_ds = ds*w* |
Split dataset in disspace (based on all data!) in trainset and testset Compute classifier on trainset Classification error on testset |
|
Split dataset in disspace (based on trainset) in trainset and testset Compute classifier on trainset Classification error on testset |
x = d* [xt,xs] = w = xt*pe_knnc ; e_pe = xs*w* |
Compute PE space from all data Split in trainset and testset Compute classifier on trainset Classification error on testset |
|
Split dataset in disspace (based on trainset) in trainset and testset Find PE space Map trainset on PE space Map testset on PE space Compute classifier on trainset Classification error on testset |
In the second and the fourth example the spaces are defined on all data as the gendat
function just selects objects in a given space. In the construction of that space the labels are not used. The test results are thereby still fair, but give an estimate of the performance when test sets of a similar size are used and classifiers are recomputed. This is also called transductive learning: the classifiers are adapted to the test data.
In the third and the fifth example just the training sets are used for building the spaces as the genddat
function uses by default a representation set (columns) that is equal to the trainset. Thereby classifiers can be computed that are valid for all future data. For some small, severely non-Eucledian datasets the projection of the test data on a PE space that has been computed on just the train data yields bad, e.g. complex, results. Therefor many studies are based on transductive learning.
PRTools contains special routines for crossvalidation, learning curves and feature curves: crossval
, cleval
and clevalf
. Sometimes special versions, crossvald
and clevald
, are needed for dissimilarity data in order to use
instead of genddat
for splitting datasets. In the transductive approaches this is not needed. Here are some examples.gendat
% |
Transductive learning 10-fold crossvalidation of nmc (5 times) in disspace.Embedding in PE space and run a PE classifier. |
% |
Use of crossvald builds spaces from the trainset only.Disspaces with at random 20 training objects for representation. PE spaces recomputed for for every new trainset. |
e = |
nmc learning curve in disspace, repset is all data. 5repeats.Plot learning curves of test and apparent error |
e = |
Compare learning curves of 3 classifiers in disspace with a repset of 1 training object per class. |
e =
|
Learning curves of knndc (on dismat) and knnc (in disspace), repset is trainset, plot without apparent error curve. |
e = |
Compare learning curves of kNN in PE space, disspace and dismat. The representation set equals the training set. |
e =
|
Feature curve in disspace using all objects for representation and 50% of the objects for training. |
e =
|
Feature curve in PE space using all objects for representation and 50% of the objects for training. |
Computing dissimilarities
DisTools focuses on the analysis of dissimilarities. The computation of dissimilarities should arise from the application, e.g. dissimilarities between images, shapes, spectra or time signals. However, also special distances between objects in vector spaces. Next to the PRTools routine proxm
offers DisTools a similar routine with an extended set of distances:
. In the below table a list is presented of all possibilities for its parameter proxxm
type
.
|
'POLYNOMIAL' | 'P': SIGN(A*B').*(A*B').^P 'HOMOGENEOUS' | 'H': SIGN(A*B').*(A*B').^P 'EXP' | 'E': EXP(-(||A-B||)/P) 'EXP-SUM' | 'ES': SUM_Z SIGN(P(Z)) * EXP(-(||A-B||)/P(Z)) 'RBF' | 'R': EXP(-(||A-B||.^2)/(P*P)) 'RBF-SUM' | 'RS': SUM_Z SIGN(P(Z)) * EXP(-(||A-B||.^2)/(P(Z)^2)) 'SIGMOID' | 'S': SIGM(A*B'/P) 'DSIGMOID' | 'DS': SIGM(||A-B||^(2P(1))/P(2)) 'DISTANCE' | 'D': ||A-B||.^P 'MULTIQUADRIC' | 'MQ': SQRT(||A-B||.^2/P(1) + P(2)) 'THIN-PLATE' | 'TP': ||A-B||.^(2*P(1))/P(2) * LOG(1+||A-B||.^(2*P(1))/P(2)) 'N-THIN-PLATE' | 'NTP': -||A-B||.^(2*P(1))/P(2) * LOG(1+||A-B||.^(2*P(1))/P(2)) 'MINKOWSKI' | 'M': SUM(|A-B|^P).^(1/P) 'CITY-BLOCK' | 'C': SUM(|A-B|) 'COSINE' | 'O': 1 - (A*B')/||A||*||B|| 'FOURIER' | 'F' 'TANH' | 'T': TANH(P*(A*B')/LENGTH(A) + P); 'ANOVA' | 'A': ANOVA MODEL 'BSPLINE' | 'B': BSPLINE MODEL, ORDER P 'ANOVABSPLINE' | 'AB': ANOVA-BSPLINE MODEL, ORDER P 'ANOVASPLINE1' | 'AS1':ANOVA-SPLINE MODEL, ORDER 1 'ANOVASPLINE2' | 'AS2':ANOVA-SPLINE MODEL, ORDER 2 'ANOVASPLINE3' | 'AS3':ANOVA-SPLINE MODEL, ORDER 3 |