PRTools examples: Dimension Reduction

Dimension reduction is used and sometimes necessary to decrease the computational complexity and to increase the classification performance. Dimension reduction itself, however, may be time consuming. We will present some examples. It is assumed that the reader is familiar with the introductory sections of the user guide:

Dimension reduction can be realized by feature selection or by extraction. Extraction is traditionally based on finding linear subspaces. Non-linear procedures may be based on neural networks, kernels or dissimilarities.

There is usual a criterion and a strategy to optimize this criterion.

In case of feature selection one distinguishes filters and wrappers. The criteria for filters are different from the classifier to be used in the final space. The wrapper approach simply minimizes the final classification error over the family of considered spaces. To avoid that the wrapper approach overtrains, cross-validation may be applied. The filter approach may be faster. As they are more simple than classifiers there is a smaller danger of overtraining.

Finally criteria may be supervised, i.e. depending on class labels, or unsupervised.

In PRTools terms, procedures for dimension reduction are trainable mappings. Here are some examples for selection and extraction.

Feature Selection

There are routines for different strategies like individual, forward, backward and floating selection. They are based on criteria computed by feateval. Some are supervised, others are unsupervised. The criteria can be computed on the training set, by cross-validation or on a fixed test set. This is important is the classification error in a wrapper approach is used as a criterion. Experiments are usually very time consuming, so we will restrict to very simple examples based on finding 3 features for the Malaysia dataset, which has in total 8 features and 291 objects in 20 classes.

A = malaysia;
A = setprior(A,getprior(A));    % avoid heaps of warnings

[T,S] = gendat(A,0.5);          % generate train / test set
W1 = A*featseli([],'NN',4);     W1 = setname(W1,'ISel NN');
W2 = A*featseli([],'maha-s',4); W2 = setname(W2,'ISel maha-s');
W3 = A*featseli([],'maha-m',4); W3 = setname(W3,'ISel maha-m');
W4 = A*featseli([],'in-in',4);  W4 = setname(W4,'ISel in-in');
W5 = A*featseli([],ldc,4,5);    W5 = setname(W5,'ISel wrapper');
disp([+W1;+W2;+W3;+W4;+W5])     % show selected features

V = T*({W1,W2,W3,W4,W5}*ldc);   % train all selectors and classifiers
S*V*testc                       % show test result

Note that the criteria maha-s and in-in behave very similar. Moreover, they match good with the classifier ldc as all are based on maximizing the between scatter while minimizing the within scatter. We will use this criterion to compare several strategies.

U1 = featseli([],'maha-s',4); U1 = setname(U1,'ISel maha-s');
U2 = featself([],'maha-s',4); U2 = setname(U2,'FSel maha-s');
U3 = featselb([],'maha-s',4); U3 = setname(U3,'BSel maha-s');
U4 = featselo([],'maha-s',4); U4 = setname(U4,'OSel maha-s');
W = T*{U1,U2,U3,U4}; % use train set for feature selection
disp([+W{1};+W{2};+W{3};+W{4}]) % show selected featurs

V = T*(W*ldc);       % use train set also for classifier training
S*V*testc            % show test results

These results are based on a single split of the data. A full 8-fold cross-validation can be executed by:

prcrossval(A,{U1,U2,U3,U4}*ldc,8,'DPS')

Exercises

  1. Compare in the last example one of the filter feature selectors with a equivalent wrapper.
  2. Compare the learning curves for this wrapper and the filter (selecting features for the same classifier ldc).

Feature extraction

By feature extraction from existing vector representations (non)linear functions are found as a basis for a new representation. Three simple linear feature extractors offered by PRTools are:

  1. Principal component analysis (pcam) based on the eigenvectors found for the entire dataset. This approach is unsupervised. However, if applied to a labeled dataset class priors are used to weight the object contributions to the overall covariance matrix. If class priors are equal to class frequencies objects have equal weights.
  2. Fisher mapping (fisherm), also known as LDA (Linear Discriminant Analysis). It finds the linear subspace that maximizes the class between scatter w.r.t. the class within scatter.
  3. Karhunen Loéve Mapping (klm), it computes the eigenvectors of the averaged class covariance matrices.

In case of images the axes representing a linear subspace point into a direction of the original feature space based on the image pixels. These directions consitute also images. They are the so-called eigen-images.

delfigs
prdatafiles;      % make sure that prdatafiles in in the path
a = faces;        % the ORL database of faces.
a = prdataset(a)  % convert datafile into a dataset
w1 = pcam(a,10);
figure; show(w1); % show first 10 eigenfaces
w2 = fisherm(a,10);
figure; show(w2); % show first 10 fisherfaces
w3 = klm(a,10);
figure; show(w3); % show first 10 KL-faces
showfigs

Exercises

  1. Show the three scatterplots based on the first two features of the three feature extractors discussed above. What is remarkable? Can you find an explanation?
  2. Compute the learning curves for the ldc as well as the qdc classifier in 5-dimensional spaces of the three feature representations.
  3. Compute feature curves for the three feature representations using 5 training objects per class for ldc.
  4. Compute feature curves for the PCA eigenspaces using 1-9 training objects per class for ldc.

elements: datasets datafiles cells and doubles mappings classifiers mapping types.
operations: datasets datafiles cells and doubles mappings classifiers stacked parallel sequential dyadic.
user commands: datasets representation classifiers evaluation clustering examples support routines.
introductory examples: Introduction Scatterplots Datasets Datafiles Mappings Classifiers Evaluation Learning curves Feature curves Dimension reduction Combining classifiers Dissimilarities.
advanced examples.