PRTools examples: Scatterplots

The main purpose of these examples are to show the possibilities of the scatter plot.

PRTools 2D datasets

PRTools offers a series of 2D problems. A one-liner to generate a dataset, produce a scatter plot and show a classifier is:

A = gendatb; scatterd(A); plotc(A*parzenc)

Copy and paste this line to the Matlab window and repeat it a number of times (use the ‘up’ arrow of the keyboard). The classifier, parzenc, estimates an optimal density kernel and uses the Bayes rule to compute the classifier. The plot command plotc needs a corresponding scatter plot for the right axes.

Repeat the above for the following 2D examples:

  • gendatb, the banana set
  • gendats, a simple problem, two identical uncorrelated Gaussian distributions
  • gendatd, the difficult classes, two identical, correlated Gaussian distributions
  • gendath, the Highleyman classes, two different Gaussian distributions
  • gendatc, the circular classes
  • gendatl, the Lithuanian classes
  • spirals
  • gendatm, a multi-class problem

Why does the classifier perform so bad for the spirals problem?

Sample size

We repeat the above experiment for one case but with different sample sizes. First, again:

A = gendatb; scatterd(A); plotc(A*parzenc)

Repeat it a few times and observe the variability. Here 50 samples per class are used. Now for 1000 samples per class:

A = gendatb([1000,1000]); scatterd(A); plotc(A*parzenc)

Note that it takes longer to produce the scatter plot and much longer to compute the classifier. The computing time for training as well as for executing the Parzen classifier is highly sensitive for the size of the training set. Note in addition that even for 1000 samples per class in this 2D problem the classifier is not stable!

Multiple classifiers

Scatter plots can show a set of classifiers, e.g.:

A = gendatb;
W1 = A*fisherc;
W2 = A*qdc;
W3 = A*knnc;
W4 = A*dtc;
scatterd(A);
plotc({W1,W2,W3,W4});

One liners are sometimes nice:

A = gendatb; scatterd(A); plotc(A*{fisherc,qdc,knnc,dtc})

Note that classifier names are shown in the legend of the plot. They may be changed by the user, e.g.:

A = gendatb;
W1 = A*fisherc;
W2 = A*qdc; W2 = setname(W2,'qdc');
W3 = A*knnc;
W4 = A*dtc; W4 = setname(W4,'dtc');
scatterd(A);
plotc({W1,W2,W3,W4});

Why are in these experiments the classifiers fisherc and qdc almost identical?

Densities

Some classifiers like qdc and parzenc are based on estimated densities. In addition there are some specific routines to estimate densities, see the page in the user guide. Let us take a single class:

A = gendatgauss(100,[0 0]); % generate a 2D Gaussian distribution
W = A*gaussm;               % estimate the density from the data
scatterd(A);                % scatter plot of the data
plotm(W);                   % plot the estimated density

The routine plotm is similar to plotc. It draws density curves instead of a separation boundary. There is always the corresponding scatter plot needed for a proper result. The routine gaussm always estimate a Gaussian density, also when it is not appropriate. In the next examples the class densities related to the classifiers qdc and parzenc are shown.

delfigs          % PRTools routine for deleting all figures
A = gendatb; scatterd(A);
plotm(A*qdc); title('Gaussian densities')
figure; scatterd(A);
plotm(A*parzenc);; title('Parzen densities')
showfigs         % PRTools routine for showing all figures

Resolution

A special option of plotc is that it can show the class regions filled by colors:

delfigs
A = gendatm;

scatterd(A);
plotc(A*qdc,'col')

Usually some white, unassigned areas are shown. This is the result of of a low resolution of the plotting grid. PRTools uses a low resolution to speed up plotting. The resolution can be increased by the command gridsize. By default it is 30. Increasing too 100 is often sufficient:

figure; scatterd(A)
gridsize(100);
plotc(A*qdc,'col')
showfigs

If this is not sufficient, increase the gridsize further to 300.

High dimensional feature spaces

Scatter plots are good to get some feeling about the distribution of datasets and the properties of classifiers. However, most pattern recognition are multidimensional and cannot be shown in a 2D space. Projections of this space can be shown but may give the wrong impression. Here are two examples. We use the Iris dataset. It can be loaded in PRTools by the command prdatasets. This facilitates the loading of a set of standard real world problems from the PRTools website.

delfigs
prdatasets;
A = iris
scatterd(A*pcam(A,2)); title('PCA')

This shows a 2D PCA space. Another option, for not so high-dimensional spaces, is to inspect, some, and in this case all combinations of 2 features:

delfigs
figure; scatterd(A(:,[1 2])); title('1 2');
figure; scatterd(A(:,[1 3])); title('1 3');
figure; scatterd(A(:,[1 4])); title('1 4');
figure; scatterd(A(:,[2 3])); title('2 3');
figure; scatterd(A(:,[2 4])); title('2 4');
figure; scatterd(A(:,[3 4])); title('3 4');
showfigs

Note that PRTools shows the feature names (called feature labels in PRTools terminology) along the axis (may be it is needed to enlarge a figure to see it).

A much more convenient way to inspect 2D subspaces is by the routine scatterdui. It has clickable axes which makes it possible to interactively select subspaces.

delfigs; scatterdui(A)

Of course, PCA spaces can be inspected as well;

scatterdui(A*pcam(A))

Play around with arrows adjusting the selected dimensions.

An important point to realize is that multi-dimensional classifiers cannot be shown in a 2D plot. See the faq about this.

elements: datasets datafiles cells and doubles mappings classifiers mapping types.
operations: datasets datafiles cells and doubles mappings classifiers stacked parallel sequential dyadic.
user commands: datasets representation classifiers evaluation clustering examples support routines.
introductory examples: Introduction Scatterplots Datasets Datafiles Mappings Classifiers Evaluation Learning curves Feature curves Dimension reduction Combining classifiers Dissimilarities.
advanced examples.