PRTools examples: Evaluation

Evaluation is the proof of the pudding in the design of pattern recognition systems. It is assumed that the reader is familiar with the introductory sections of the user guide:

There is a specific set of evaluation routines. Here some examples will be shown fro three topics and related routines:

Classification error

The main evaluation routine is testc. It estimates the performance of a trained classifier by a given dataset. For a good estimate this testset should be labeled, randomly drawn from the problem of interest, independent from the training set, and as large as possible. Let us take a small training set from the Highleyman problem:

T = gendath([50 50]);
W = T*fisherc;
[e,N] = T*W*testc

When these statements are a few times repeated the classification error deviates around 0.13 and the numbers of erroneaously classified objects for the two classes will be something like 11 for one class and 2 for the other. We will lake a single trained classifier W and inspect its error on a very large test set, e.g. one million objects per class:

[e,N] = gendath([1000000 1000000])*W*testc

Repeat this statement a few times. In spite of the very large test set, the results still show some variability. The error in the error follows from the standard deviation of a binomial distribution:

$$sigma_hat{epsilon} = sqrt{frac{epsilon (1-epsilon)}{n}}$$

in which $$epsilon$$ is the true error, $$hat{epsilon}$$, its estimate and $$n$$ the number of objects in the test set.

Exercise: the error in the error

  1. Verify the above formula for $$n = 100, 10000, 1000000$$.
  2. Estimate the standard deviation of the variability in the performance of a classifier based on just 50 objects per class.

Note that testc uses the class prior probabilties stored in the test set. The errors for the separated classes are weighted by the class priors. It is thereby not needed the the classes equally represented in the test set. There is another test routine, testd, that just reports the fraction of erroneously classified objects, unweighted by class priors. In short: testc evaluates the classifier by a test set, testd evaluates the dataset by a classifier.

testc is a very general routine and can be used in various ways and for different purposes. It can compute many different performance measures than just the classification error, e.g. a soft error estimate, false positives, false negatives, sensitivity, etcetera. It can also report the errors for a set of classifiers over a set of datasets. Here is an example:

U = {nmc, fisherc, qdc, knnc};   %  a set of classifiers
T = {gendatb, gendath, gendatd}; % a set of training sets
% generate test sets for each problem
n = 1000;                        % test set size per class

S = {gendatb([n n]), gendath([n n]), gendatd([n n])};
% train the classifiers, results in a 3x4 cell array.

W = prmap(T,U);
% prmap is used as W = T*U doesn't work for cell arrays             

testc(S,W)                       % test the trained classifier

Experiments like this investigate a set of classifiers over a set of problems in order to find for which types of problems which classifiers are to be preferred.

Exercise: stability of a small training set; size of the test set

  1. Repeat the above experiment a few times and observe which classifiers are best for which problem. Realize that for two dimensions 50 objects per class for training and 1000 for testing is a lot compared with many real world problems.
  2. Are your results consistent with results obtained by using just 50 objects per class for testing?

Estimated class labels

The classification dataset is found by applying a dataset to a trained classifier

T = gendatb   % generate a training set
S = gendatb   % generated a test set
W = T*fisherc % train a classifier
D = S*W       % compute a classification dataset

D contains true labels (copied from S) and has confidences for all objects for every class. The labels of the classes with the highest confidences (the predicted or estimated labels) can be found by:

labels_pred = D*labeld;
labels_true = getlabels(D);

The classifier test routines testc and testd (see above) are based on counting the number of differences between these two label lists. If this is done class wise a so-called confusion matrix is found (see below). A list of all wrongly labeled objects can be obtained by

L = labcmp(labels_pred,labels_true)

Confusion matrix

A confusion matrix $$C$$ reports the fractions or numbers of objects $$C_ij$$ of class $$i$$ that are assigned to class $$j$$. Let us take the iris dataset and compute the confusion matrix on the training set for the nearest mean classifier:

A = iris;
W = A*nmc;
D = A*W;

This results in the following output:

  True           | Estimated Labels
  Labels         | Iris-s Iris-v Iris-v| Totals
  Iris-setosa    |   50      0      0  |   50
  Iris-versicolor|    0     46      4  |   50
  Iris-virginica |    0      7     43  |   50
  Totals         |   50     53     47  |  150

The Iris-setosa class is perfectly separable from the other two classes. Iris-versicolor and Iris-virginica have an overlap that cannot be solved by nmc. Usually one wants to see the confusion matrix for a test set. To see the overlap for a training set, however, can be informative as well, as long as one realizes that the obtained results are optimistically biased.

Exercise: find invariant features

The six datasets mfeat_fou, mfeat_kar, mfeat_zer, mfeat_fac, mfeat_pix and mfeat_mor are based on six different feature sets for the same set of hand-printed digits. Some of these feature sets are rotation invariant: they measure the same feature even if the digits are rotated. In the following experiment may be used to find out which ones:

Repeat for each dataset:

  1. Load the dataset, e.g. by A = mfeat_fou
  2. Split it 50-50 in sets for training and testing
  3. Train a simple classifier, e.g. ldc
  4. Find the classification dataset D for the test set
  5. Show the confusion matrix

Study the six confusion matrices and try to reason which feature set is rotation invariant.

elements: datasets datafiles cells and doubles mappings classifiers mapping types.
operations: datasets datafiles cells and doubles mappings classifiers stacked parallel sequential dyadic.
user commands: datasets representation classifiers evaluation clustering examples support routines.
introductory examples: Introduction Scatterplots Datasets Datafiles Mappings Classifiers Evaluation Learning curves Feature curves Dimension reduction Combining classifiers Dissimilarities.
advanced examples.