On this page the datafile is introduced as an essential extension of the dataset.
Datasets as offered by PRTools represent objects by vectors, traditionally interpreted as features, in a vector space. The definition of such a representation and procedures to compute them are an important aspect of pattern recognition. This may be done on the basis of other, larger feature vectors by feature selection or feature extraction. In practise, however, it might be needed to start with raw data like images, time signals, spectra of varying sizes. They do not fit in the PRTools definition of a dataset as it assumes vectors of a fixed and well-defined size.
The datafile variable type is created to integrate raw data in PRTools. A datafile (in PRTools5 called
prdatafile) is programmatically a child of
prdataset and inherits many of its properties. What is different is that datafiles do not store data as such but just keep links to data directories on disk. Typically, every file relates to an object. In addition pre-processing commands may stored in a datafile. When a datafile is converted to a dataset all these commands are executed on all objects. They should therefor result in feature vectors of the same size.
Next to the integration of raw data in PRTools datafiles offer the possibility to handle very large datasets. A number of mapping routines can handle datafiles where datasets are expected. These are typically the routines that handle the data sequentially.
The datafile constructor looks like
a = prdatafile(directory,type,readcommand,par1,par2,... )
directoryis the name of the directory on disk where the data is stored.
typeis the type of datafile, see below. It may be omitted for the default, standard, raw datafile.
readcommandis a string with the name of the command used to read individual files (objects). Default is
par2, are additional parameters needed for
There are five types supported for datafiles:
- raw: Every file is interpreted as a single object in the dataset. Files should be organized in directories of objects that belong to the same class. Directory names will be the class names. Files in the main directories are interpreted as unlabeled objects. Files may, for instance, be images of different size. This is the standard, default type.
- cell: All files should be mat-files containing just a single variable being a cell array. Its elements are interpreted as objects. The file names will be used as labels during construction. This may be changed by the user afterwards.
- pre-cooked: In this case
readcommandshould be a user defined command that converts individual files into a dataset.
- half-baked: All files should be mat-files, containing a single dataset.
- mature: This is a datafile as constructed by PRTools, using the savedatafile command after execution of all preprocessing defined for the datafile.
The raw type is the original, standard type. It is useful for raw date like images, time signals and spectra. The user should organize the files class-wise in sub-directories before the datafile constructor is called.
An important point to note is that in the datafile constructor call a mat-file with all information is stored in
directory. This is done to speed up loading of an existing, earlier defined datafile, as the constructor may be time consuming because all directories, and for some types, all files have to be inspected.
Relation with datasets
The datafile variable type is a child of the dataset (formally
prdatafile is a child of
prdataset). According to the Matlab conventions datafiles thereby inherit all properties of datasets, unless otherwise specified. All annotation defined for datasets, see the dataset structure, like object labeling, is thereby available for datafiles. The main exception is everything that is defined for features: feature labels, feature domains, feature sizes. Feature are not yet defined for datafiles and this type of annotation should be added after datafiles are converted to datasets.
Operators and commands for datafiles are described separately. It should be emphasized that all operations applied to datafiles are just stored and only executed when the content of a datafile is needed or when it is converted to a dataset by
A*datasetm. See the datafile definition above.
% delete all figures delfigs; % download and unpack famous 'faces' database url = '
http://www.cl.cam.ac.uk/Research/DTG/attarchive/pub/data'; % url is too long to fit on a single line url = [url
'/att_faces.zip']; prdownload(url,'faces'); % remove non-images files delete faces/README % this database consists of a directory of one-person subdirectories % the prdatafile command interprets these as classes a = prdatafile('faces') % faces, raw datafile with 400 objects in 40 crisp classes % show the first 10 classes, 10 images on a row. show(selclass(a,[1:10]),10) % compute a PCA, convert to dataset first % (this is easy here as all images have the same size) w = a*datasetm*pcam(,2); % show the first two eigenfaces figure; show(w); % project all images on the eigenfaces and show the scatter figure; scatterd(a*w); axis equal % show all showfigs % correct eigenface image for equal axis figure(2); axis equal
|The first 10 classes
||The first two eigenfaces
cells and doubles
operations: datasets datafiles cells and doubles mappings classifiers stacked parallel sequential dyadic.
user commands: datasets representation classifiers evaluation clustering examples support routines.
introductory examples: Introduction Scatterplots Datasets Datafiles Mappings Classifiers Evaluation Learning curves Feature curves Dimension reduction Combining classifiers Dissimilarities.