PRTools, elements, operations, user commands, introductory examples, advanced examples

Datafiles

On this page the datafile is introduced as an essential extension of the dataset.

Datafile background

Datasets as offered by PRTools represent objects by vectors, traditionally interpreted as features, in a vector space. The definition of such a representation and procedures to compute them are an important aspect of pattern recognition. This may be done on the basis of other, larger feature vectors by feature selection or feature extraction. In practise, however, it might be needed to start with raw data like images, time signals, spectra of varying sizes. They do not fit in the PRTools definition of a dataset as it assumes vectors of a fixed and well-defined size.

The datafile variable type is created to integrate raw data in PRTools. A datafile (in PRTools5 called prdatafile)  is programmatically a child of prdataset and inherits many of its properties. What is different is that datafiles do not store data as such but just keep links to data directories on disk. Typically, every file relates to an object. In addition pre-processing commands may stored in a datafile. When a datafile is converted to a dataset all these commands are executed on all objects. They should therefor result in feature vectors of the same size.

Next to the integration of raw data in PRTools datafiles offer the possibility to handle very large datasets. A number of mapping routines can handle datafiles where datasets are expected. These are typically the routines that handle the data sequentially.

Datafile definition

Constructor

The datafile constructor looks like

a = prdatafile(directory,type,readcommand,par1,par2,... )

  • directory is the name of the directory on disk where the data is stored.
  • type is the type of datafile, see below. It may be omitted for the default, standard, raw datafile.
  • readcommand is a string with the name of the command used to read individual files (objects). Default is imread.
  • par1, par2, are additional parameters needed for readcommand.

Datafile types

There are five types supported for datafiles:

  • raw: Every file is interpreted as a single object in the dataset. Files should be organized in directories of objects that belong to the same class. Directory names will be the class names. Files in the main directories are interpreted as unlabeled objects. Files may, for instance, be images of different size. This is the standard, default type.
  • cell: All files should be mat-files containing just a single variable being a cell array. Its elements are interpreted as objects. The file names will be used as labels during construction. This may be changed by the user afterwards.
  • pre-cooked: In this case readcommand should be a user defined command that converts individual files into a dataset.
  • half-baked: All files should be mat-files, containing a single dataset.
  • mature: This is a datafile as constructed by PRTools, using the savedatafile command after execution of all preprocessing defined for the datafile.

The raw type is the original, standard type. It is useful for raw date like images, time signals and spectra. The user should organize the files class-wise in sub-directories before the datafile constructor is called.

An important point to note is that in the datafile constructor call a mat-file with all information is stored in directory. This is done to speed up loading of an existing, earlier defined datafile, as the constructor may be time consuming because all directories, and for some types, all files have to be inspected.

Relation with datasets

The datafile variable type is a child of  the dataset (formally prdatafile is a child of prdataset). According to the Matlab conventions datafiles thereby inherit all properties of datasets, unless otherwise specified. All annotation defined for datasets, see the dataset structure, like object labeling, is thereby available for datafiles. The main exception is everything that is defined for features: feature labels, feature domains, feature sizes. Feature are not yet defined for datafiles and this type of annotation should be added after datafiles are converted to datasets.

Datafile overload

Operators and commands for datafiles are described separately. It should be emphasized that all operations applied to datafiles are just stored and only executed when the content of a datafile is needed or when it is converted to a dataset by prdataset(A) or A*datasetm. See the datafile definition above.

Examples

   % delete all figures
delfigs;
   % download and unpack famous 'faces' database
url = 'http://www.cl.cam.ac.uk/Research/DTG/attarchive/pub/data';
   % url is too long to fit on a single line
url = [url '/att_faces.zip'];
prdownload(url,'faces');
   % remove non-images files
delete faces/README
   % this database consists of a directory of one-person subdirectories
   % the prdatafile command interprets these as classes
a = prdatafile('faces')
   % faces, raw datafile with 400 objects in 40 crisp classes
   % show the first 10 classes, 10 images on a row.
show(selclass(a,[1:10]),10)
   % compute a PCA, convert to dataset first
   % (this is easy here as all images have the same size)
w = a*datasetm*pcam([],2);
   % show the first two eigenfaces
figure; show(w);
   % project all images on the eigenfaces and show the scatter
figure; scatterd(a*w); axis equal
   % show all
showfigs
   % correct eigenface image for equal axis
figure(2); axis equal

The first 10 classes
The first two eigenfaces

Scatterplot of the first two eigenfaces

elements: datasets datafiles cells and doubles mappings classifiers mapping types.
operations: datasets datafiles cells and doubles mappings classifiers stacked parallel sequential dyadic.
user commands: datasets representation classifiers evaluation clustering examples support routines.
introductory examples: Introduction Scatterplots Datasets Datafiles Mappings Classifiers Evaluation Learning curves Feature curves Dimension reduction Combining classifiers Dissimilarities.
advanced examples.