PRTools examples: Scatterplots
The main purpose of these examples are to show the possibilities of the scatter plot.
- PRTools 2D datasets
- Sample size
- Multiple classifiers
- Densities
- Resolution
- High dimensional feature spaces
PRTools 2D datasets
PRTools offers a series of 2D problems. A one-liner to generate a dataset, produce a scatter plot and show a classifier is:
Copy and paste this line to the Matlab window and repeat it a number of times (use the ‘up’ arrow of the keyboard). The classifier, parzenc
, estimates an optimal density kernel and uses the Bayes rule to compute the classifier. The plot command plotc
needs a corresponding scatter plot for the right axes.
Repeat the above for the following 2D examples:
gendatb
, the banana setgendats
, a simple problem, two identical uncorrelated Gaussian distributionsgendatd
, the difficult classes, two identical, correlated Gaussian distributionsgendath
, the Highleyman classes, two different Gaussian distributionsgendatc
, the circular classesgendatl
, the Lithuanian classesspirals
gendatm
, a multi-class problem
Why does the classifier perform so bad for the spirals problem?
Sample size
We repeat the above experiment for one case but with different sample sizes. First, again:
Repeat it a few times and observe the variability. Here 50 samples per class are used. Now for 1000 samples per class:
Note that it takes longer to produce the scatter plot and much longer to compute the classifier. The computing time for training as well as for executing the Parzen classifier is highly sensitive for the size of the training set. Note in addition that even for 1000 samples per class in this 2D problem the classifier is not stable!
Multiple classifiers
Scatter plots can show a set of classifiers, e.g.:
A =
;
gendatb
W1 = A*fisherc;
W2 = A*qdc;
W3 = A*knnc;
W4 = A*dtc;
(A);
scatterd
({W1,W2,W3,W4});
plotc
One liners are sometimes nice:
Note that classifier names are shown in the legend of the plot. They may be changed by the user, e.g.:
A =
;
gendatb
W1 = A*
;
fisherc
W2 = A*
; W2 = setname(W2,'qdc');
qdc
W3 = A*
;
knnc
W4 = A*
; W4 = setname(W4,'dtc');
dtc
(A);
scatterd
({W1,W2,W3,W4});
plotc
Why are in these experiments the classifiers
and fisherc
almost identical?qdc
Densities
Some classifiers like
and qdc
are based on estimated densities. In addition there are some specific routines to estimate densities, see the page in the user guide. Let us take a single class:parzenc
A = gendatgauss(100,[0 0]); % generate a 2D Gaussian distribution
W = A*gaussm; % estimate the density from the data
(A); % scatter plot of the data
scatterd
plotm(W); % plot the estimated density
The routine plotm
is similar to
. It draws density curves instead of a separation boundary. There is always the corresponding scatter plot needed for a proper result. The routine gaussm always estimate a Gaussian density, also when it is not appropriate. In the next examples the class densities related to the classifiers qdc and parzenc are shown.plotc
delfigs % PRTools routine for deleting all figures
A =
;
gendatb
(A);
scatterd
plotm
(A*); title('Gaussian densities')
qdc
figure;
(A);
scatterd
plotm
(A*);; title('Parzen densities')
parzenc
showfigs
% PRTools routine for showing all figures
Resolution
A special option of
is that it can show the class regions filled by colors:plotc
Usually some white, unassigned areas are shown. This is the result of of a low resolution of the plotting grid. PRTools uses a low resolution to speed up plotting. The resolution can be increased by the command gridsize
. By default it is 30. Increasing too 100 is often sufficient:
figure;
(A)
scatterd
gridsize(100);
(A*
plotc
,'col')
qdc
showfigs
If this is not sufficient, increase the gridsize further to 300.
High dimensional feature spaces
Scatter plots are good to get some feeling about the distribution of datasets and the properties of classifiers. However, most pattern recognition are multidimensional and cannot be shown in a 2D space. Projections of this space can be shown but may give the wrong impression. Here are two examples. We use the Iris dataset. It can be loaded in PRTools by the command prdatasets
. This facilitates the loading of a set of standard real world problems from the PRTools website.
delfigs
prdatasets
;
A = iris
(A*pcam(A,2)); title('PCA')
scatterd
This shows a 2D PCA space. Another option, for not so high-dimensional spaces, is to inspect, some, and in this case all combinations of 2 features:
delfigs
figure;
(A(:,[1 2])); title('1 2');
scatterd
figure;
(A(:,[1 3])); title('1 3');
scatterd
figure;
(A(:,[1 4])); title('1 4');
scatterd
figure;
(A(:,[2 3])); title('2 3');
scatterd
figure;
(A(:,[2 4])); title('2 4');
scatterd
figure;
(A(:,[3 4])); title('3 4');
scatterd
showfigs
Note that PRTools shows the feature names (called feature labels in PRTools terminology) along the axis (may be it is needed to enlarge a figure to see it).
A much more convenient way to inspect 2D subspaces is by the routine scatterdui. It has clickable axes which makes it possible to interactively select subspaces.
delfigs; scatterdui(A)
Of course, PCA spaces can be inspected as well;
scatterdui
(A*pcam
(A))
Play around with arrows adjusting the selected dimensions.
An important point to realize is that multi-dimensional classifiers cannot be shown in a 2D plot. See the faq about this.
elements:
datasets
datafiles
cells and doubles
mappings
classifiers
mapping types.
operations:
datasets
datafiles
cells and doubles
mappings
classifiers
stacked
parallel
sequential
dyadic.
user commands:
datasets
representation
classifiers
evaluation
clustering
examples
support routines.
introductory examples:
Introduction
Scatterplots
Datasets
Datafiles
Mappings
Classifiers
Evaluation
Learning curves
Feature curves
Dimension reduction
Combining classifiers
Dissimilarities.
advanced examples.