PRTools examples: Datasets

Datasets are the key element of PRTools. It is assumed that the reader is familiar with the introductory sections of the user guide:

Here we will present a small example about the creation of a dataset. We will use the kimia dataset, a set of images:

A = kima
show(A,12);
classnames(A)

It is a 18-class dataset with 12 black-and-white images per class. Every image has 64*64 pixels. We select just two classes for a simple experiment.

B = selclass(A,{'elephant','turtle'})
show(B,6);
% Compute the total number of pixels in the blob
% (they have value 1) as an estimate of the area
feat1 = sum(B,2)
% Compute the number of 0-1 and 1-0 changes in the
% unfolded image as an estimate of the perimeter

feat2 = sum(abs(B(:,1:end-1)-B(:,2:end)),2)

Try to understand how feat2 estimates the perimeter of a blob in B. Now the two features feat1 and feat2 are used to construct a new dataset X, using the same labels as B.

% Construct a dataset, use the orginal labels of B
X = prdataset([feat1 feat2],getlabels(B));
scatterd(X,'legend');
testk(X,1) % the leave-one out error of the 1-NN classifier

The nearest neighbors are found in the given feature space. The scatter plot is misleading. Look at a properly scaled version:

axis equal

Feature 2 (vertical) is almost not significant. Let us scale the features by their variance (make variances 1):

Y = X*mapex(scalem,'variance');
figure; scatterd(Y,'legend'); axis equal
showfigs

testk(Y,1)

The routine mapex takes care of training (here scalem) and executing on the same data. The scatter plot and the result of testk show an improvement.

In order to find a better estimate for the perimeter the objects should be considered as images. This is possible as in the dataset to information of image size is preserved. See the conversion of the dataset to a structure

struct(B)

The feature size stores the image size:

getfeatsize(B)

A routine like show makes use of this fact. PRTools offers a general routine for computing arbitrary image operations on all objects stored in a dataset:

C = B*filtim('bwperim');
show(C)

Counting the number pixels of the countour image gives a better esitmate of the contour length as it takes the vertical contributions into account.

Exercises:

  1. compute feat3, the feature based on the improved contour length estimate.
  2. construct a dataset based on feat1 and feat3 and test its performance.
  3. create a scatter plot of the two contour length estimators.
  4. test how good feat3 alone is. Explain the result on the basis of the object images.

elements: datasets datafiles cells and doubles mappings classifiers mapping types.
operations: datasets datafiles cells and doubles mappings classifiers stacked parallel sequential dyadic.
user commands: datasets representation classifiers evaluation clustering examples support routines.
introductory examples: Introduction Scatterplots Datasets Datafiles Mappings Classifiers Evaluation Learning curves Feature curves Dimension reduction Combining classifiers Dissimilarities.
advanced examples.