This page is a description of the pattern recognition terminology as it is used on the PRTools website and in the literature on which it is based. General terms are presented in red. For specific PRTools terminology burnt orange is used.
Apparent error. Also called training error. It is the fraction of objects used for designing (training) a classifier that is incorrectly classified by that classifier. It is usually a positively biased estimator of the true classification error.
Attribute. A symbolic or numeric property of an object that might be useful to determine its class. The word feature is used for this as well. Different objects however may have different numbers of attribute. Usually, the same set of features is measured for all objects in a single problem. Thereby, objects may be represented by a feature vector, or by a set of attributes.
Bayes classifier. This is a classifier based on the Bayes rule. It combines given class prior probabilities with class probability densities for the object vector representation such that the classification error is minimum under the assumption that these probabilities and densities hold for the objects to be classified. If this assumption is violated (e.g. by using bad density estimates) the classifier may be far from optimal.
For a two-class problem with this classifier can be written as:
, if then else
For a multi-class problem with classes the classifier can be written as:
As the denominator does not depend on , it can be omitted. The rule thereby simplifies to:
Bayes error. The classification error made by the Bayes classifier for the case its assumptions are correct. By definition this is the lowest classification error that can be achieved for the given problem.
Bayes rule. For a class and an object representation vector , this rule relates the class posterior probability with the class probability density , the class prior probability and the joint probability density function over the set of classes:
Class. A class is a set of objects that within a given context is recognized as similar. Such a class has usually a unique name, the class name. The individual objects within a class have a label that refers to this name. If the class is human defined it is sometimes also call a concept.
Class frequency. The frequency of which objects of a particular class in a dataset. It can be used as an estimate for the class prior in case the dataset is representative for the classification problem.
Class name. A name assigned to a class of objects. Sometimes very symbolic like or , sometimes with a very practical meaning like ‘healthy’ or ‘diseased’.
Some classifiers may reject some objects. Some classifiers may assign multiple class labels. Instead of or in addition to class labels classifiers may output class posteriors, dissimilarities, distances, similarities or densities for a all possible classes. In the additional step, the most likely class or classes have to be determined.
Classification accuracy. This is one minus the classification error. Sometimes, the accuracy is also multiplied by 100 and presented in percentage.
Classification error. The probability that an arbitrarily selected object from a set of classes is incorrectly classified by a given trained classifiers. Alternatively it is used for the fraction of objects of a given finite set of objects that is incorrectly classified. See also apparent error (or training error), test error and true error. Sometimes, this error is also multiplied by 100 and presented in percentage.
Classification performance. A general expression referring to how well a classifier classifies unseen objects. There is no sharp mathematical definition of the performance. In general, the higher the performance, the higher the classification accuracy and thereby the lower the classification error.
Cluster. A set of objects that are similar in their representation. It depends on the problem and the representation whether classes and clusters coincide. In a single class sometimes several clusters can be distinguished.
Combiner. One of the types of a PRTools mapping or classifier. A combiner is a mapping that is able to handle another mapping as input, e.g.
V = W1*W2, in which
W2 is the combiner,
W1 is another mapping and
V is the combined mapping. This construct is called a sequential combining. Other combining possibilities are stacked combining by
V = [W1 W2], and parallel combining by
V = [W1; W2].
Concept. A concept is a general idea, or something conceived in the mind (Wikipedia). In pattern recognition it is sometimes used for a class or a cluster for which the underlying set of objects is the complete set of specific examples for which the concept is applicable (occasionally a single representative or typical object is also called a concept). The concept is the step that consciousness makes after recognizing patterns in observations and before a name (a word) has been a assigned to it.
Confidence. A value between 0 and 1 that indicates the likelihood that a particular outcome (e.g. class membership) is correct. This term is used instead of posterior probability when probabilities in the strict sense are not well defined. Class confidences may, like class posteriors, be interpreted as soft labels.
Covariance matrix. This matrix generalizes the notion of variance to multiple dimensions. See the Wikipedia. In dimensions it describes the 2nd order moment of a probability distribution of a random vector with mean by or of a set of objects by . Multi-dimensional Gaussian distributions are fully described by mean and covariance matrix.
Crisp label. Labels point from an object to a class. They are assigned by an expert or estimated during automatic classification. If there is no uncertainty or ambiguity such a label is unique and symbolic. It just contains the name of the class. See also soft labels.
Curse of dimensionality. See overtraining.
Datafile. A PRTools programmatic class type. Variables of this type refer to directories (folders) of objects stored on disk. It contains the definition of all preprocessing (e.g. feature extraction) needed to convert it to a proper dataset. The interesting properties of data types are that they have an unlimited size and that the objects can still have different dimensions (image of different size, time signals of different length). A datafile is a child (subclass) of a dataset and thereby a dataset itself. See the PRTools help information on
Dataset. A PRTools programmatic class type. Variables of this type refer to a set of vector representation of objects, feature vectors. These vectors should have the same length and are stored row-wise. So the columns refer to features and the rows to objects. The set of feature vectors is also referred to as a feature matrix. Various types of additional information can be stored in a variable of the dataset type, e.g. object labels, class priors, feature names, feature domains, dates and descriptions. See the PRTools help information on
Dimension reduction. Transformation of a given representation vector space, usually a feature space into another one with a lower dimension then the original one, while maintaining the information content. This content is either expressed in the object variability (unsupervised) or the class separability (supervised).
Dissimilarity representation. In this representation objects are represented by pairwise dissimilarities to other objects, either directly computed from the real world observations or from another representation, e.g. features or graphs. The next step may be the postulation of a dissimilarity space or embedding.
Dissimilarity space. The vector space defined by the dissimilarities to a representation set of objects. The vector length (space dimensionality) is thereby equal to the size of the representation set. This set can be the entire training set, a selection of these objects, but possibly also a set of entirely different objects. Classifiers in this space can be trained in a similar way as in a feature space.
Embedding. The computation of a set of vectors in an Euclidean space for which the distances are equal to a given set of dissimilarities or the inner products are equal to a given set of similarities.
Feature. A symbolic or numeric property of a real world object that might be useful to determine its class. The word ‘attribute‘ is used for this as well. Different objects however may have different numbers of attributes, while usually for all objects in the same problem the same features can be measured. Thereby objects may be represented by a feature vector, or by a set of attributes.
Feature domain. A feature property stored in a dataset which refers to the set of values the particular feature may have. In operations or during the addition of new objects to a dataset the feature values may be checked for the defined domain.
Feature extraction. The process of determining good features for a feature representation of objects. This may refer to raw data like images or time signals, but also to already given representation. In the latter case the aim is to simplify the representation, e.g. by dimension reduction. Examples are PCA and LDA.
Feature label. The feature name as it is stored in a dataset. Feature labels may be empty or just numbers. In a classification matrix stored in a dataset the feature labels are the class names, as every column refers here to a class instead of a feature.
Feature space. The vector space defined by the feature vectors. This concept is so familiar in statistical pattern recognition and machine learning that sometimes also vector spaces used for object representation, but created otherwise are also called feature spaces. Examples are the kernel space and the dissimilarity space which are defined by functions of input features or distances between objects.
Feature vector. The vector storing all relevant properties of a real world object in a particulary well defined order. It may also be the unfolded set of image pixels or a sampling of a spectrum or a time signal. If feature vectors are used to represent a set of objects in a vector space it is necessary that the images, spectra or signals are converted to the same size before sampling.
Fisher mapping. The orthonormal transformation of a feature space that maximizes the class between scatter w.r.t. the average class within scatter . It is sometimes also called LDA. In PRTools it is implemented by
Generalization. Generalization is the step from a set of objects observations to their common properties or concept. In logic this is called induction. In pattern recognition it is the process of finding clusters, classes or typical objects. If for a new object, that was not used in the generalization, its most similar cluster, class or typical object can be determined, properties may be predicted that have not been measured.
Image. A one-, two-, or multi-dimensional set of pixels. For many pattern recognition applications it is relevant that the assumption holds that neighboring pixels have a higher correlation than more remote ones.
Image object. PRTools terminology for an unfolded image stored as a feature vector (row-wise) in a dataset. It thereby represents an object. The original image size (before unfolding) should be stored in the dataset in order to retrieve pixel neighborhoods.
Image feature. PRTools terminology for an unfolded image stored as a feature (column wise) in a dataset. It thereby represents an object. The original image size (before unfolding) should be stored in the dataset in order to retrieve pixel neighborhoods. As objects (the row vectors) are thereby pixels these are represented by the corresponding pixel values in a set of images. This is mainly relevant for storing an RGB image or a (hyper)spectral image in a single dataset for segmentation purposes (i.e. classification or clustering of pixels).
Input space. Machine learning terminology for the space of the given (vector) representation. It is usually used in relation with kernel representations as it refers to the input space for which kernels are computed. The input space is in this case identical to what is called feature space in pattern recognition.
KLM. The PRTools routine
klm is called Karhunen-Loève mapping and is effectively a PCA applied to the mean class covariance matrix. It thereby neglects the scatter of the class means and focusses on the average shapes of the class distributions.
Kernel. A function of two objects and . It is either computed from a representation like a feature space, or from the raw data of and directly. A training set of objects is converted by the kernel in an matrix. This may be used directly as a representation, but much more often the kernel trick is used (when possible) to train classifiers in kernel space. See also dissimilarity representation and dissimilarity space.
Kernel space. If a kernel is separable, i.e. if then the space and refer to is the kernel space. In case of a Gaussian kernel this space is a Hilbert space. In support vector machines the kernel space is only used implicitly by the kernel trick.
Kernel trick. If a kernel is separable, i.e. if , which is the case if fulfills the Mercer conditions, then the kernel may be interpreted as the inner product in the kernel space. As some transformations and classifiers, in particular the support vector machines, can be written in terms of inner products only, they implicitely base their result on a virtual kernel space, which might be of an infinite dimensionality. See also the Wikipedia article on this topic.
Label. In general this is a tag or a pointer assigned to some entity in order to link it to some other entity. In a pattern recognition context it is often a short for class label.
LDA. Linear Discriminant Analysis (LDA) stands for both, the linear transformation to the sub-space of a vector space under consideration for which the class between scatter is maximized w.r.t. the average class within scatter, as well as for the classifier that assigns new objects to the class with the highest posterior probability under the assumption of Gaussian class distributions with equal covariance matrices. The first, also called Fisher mapping, is realized in PRTools by the mapping
fisherm, the second by the classifier
Mapping. A PRTools programmatic class type for variables in which a transformation between object representations is stored. This can be raw data, as stored in datafiles, as well as vector representations as stored in datasets, as well as class confidences or just the class label. PRTools classifiers are implemented as a mapping. Like these, mappings can be untrained or trained or combiner. In addition they can also be fixed. See the PRTools help information on
Object. A real world entity that can be physically observed. Also used for a representation of such an entity. Examples are 2-dimensional items like photos, characters, plotted curves; 3-dimensional items like chair, tables, cells, airplanes; but also time dependent events like speech, gestures, movies. More abstract entities like a voice, a persons identity, style of writing, a composition are sometimes called an object. Another way to phrase it is that an object is a realisation of a concept, which corresponds to the pattern recognition terminology: a member (element) of a class (set).
Object representation. A formal description of an object that fascilitates the comparison of objects and the generalisation of sets of similar objects to a class . Examples are the feature representation, pixel representation, dissimilarities, kernels and graphs.
Object variability. A measure that expresses how much the objects in a given set differ in general.
Observation. A single measurement or a set of measurements made from an objects.
Overtraining. Originally overtraining refers to training a neural network too long, by which it adapts to the noise in the data. Now overtraining is also used for any classifier that by bad parameter settings, too small training sets, too high dimensions or to many parameters to be optimized adapts to the noise. Other terms for (about) the same phenomenon are curse of dimensionality, peaking and Rao’s paradox. There is a set of posts on overtraining.
Parallel combining. This is a PRTools term for combining a set of mappings or classifiers defined for different vector representations into a single single mapping or classifier. The resulting space might be the class label derived from a classifier output. Like in stacked combining and in contrast to sequential combining parallel combining depends on a combining rule.
Pattern. The word ‘pattern’ is overly used for slightly different but related concepts. The main one is the subset of similar objects in a larger set (a class or a cluster). It is however also used for the entire similarity structure in a collection of objects as well as for a single object which is typical for a set of similar objects.
Pattern recognition. This is the process or ability of finding patterns in a set of objects. It also refers to the scientific domain that studies such processes as well as to the technology of creating artificial systems that can do this. The main sub-domains are representation and generalization. Pattern recognition is related to but slightly different from the fields of artificial intelligence and machine learning. As pattern recognition refers to both, a human ability as well as a research domain, it may be labeled as an art as well as a science.
PCA. Principal Component Analysis finds a orthogonal transformation for a vector space that decorrelates a specified covariance matrix . It results in a set of eigenvectors (the new axes) and the variances (diagonal terms) of the transformed covariance matrix , also called eigenvalues. Thereby . The orthonormal transformation is as the covariance of the transformed space is . It is assumed without loss of generality that has zero mean. More in the Wikipedia article.
In practical applications often just the eigenvectors corresponding the the largest eigenvalues are used as they stand for the main directions of variance. Other directions are then neglected under the assumption that they correspond to noise. As this is effectively a type of averaging it may improve accuracy.
Peaking phenomenon. See overtraining.
Pixel Representation. Images can be represented by their pixels. The unfolded 1-dimensional vector constructed from 2-dimensional or even multi-dimensional images is then used for building a feature space. There are two problems with this approach: it results in very high-dimensional spaces and small shifts or rotations of the image may result into a large jump of the representing vector. A good property of the pixel representation is that all information available in the image is preserved, in contrast to features measured from the image.
Representation. The representation of an object is a description of that object in terms of measurable observations that enables the a numeric or logic comparison with other objects in the same problem. See also an introductory post on this topic.
Scatter. The scatter is a synonym for the variance or covariance matrix, emphasizing that it is derived from an estimate based on observed data points.
Sequential combining. This is a PRTools term for combining consecutive mappings (transformations) between vector spaces into a single mapping. The resulting space might be the class label derived from a classifier output
Soft label. Labels point from an object to a class. They are assigned by an expert or estimated during automatic classification. In both cases there might be uncertainty or ambiguity. This can be solved by replacing the traditional crisp labels by soft labels. In PRTools these are numbers between 0 and 1. Every object has a soft label value for every class. They don’t necessarily sum to one. They might be interpreted in several ways: uncertainties, probabilities or fuzzy class membership. See also crisp label.
Stacked combining. This is a PRTools term for combining a set of mappings or classifiers between the same two vector spaces into a single mapping or classifier . The resulting space might be the class label derived from a classifier output. Like in parallel combining and in contrast to sequential combining stacked combining depends on a combining rule.
Support vector machine. A linear classifier or regression function that is optimised for a distance based criterion as a linear function of a subset of the given vectors (training objects), the support vectors. Thanks to the kernel trick also non-linear classifiers can be trained. The result depends on the choice of the kernel. Usually abbreviated as SVM.
Trained classifier. A classifier trained by a training set, ready for evaluation by a test set or to be applied in practice. A trained classifier might be just a linear function. The way it is trained (the untrained classifier) may not be visible anymore. This is a special case of an untrained classifier.
True classification error. The classification error as made by the classifier on the objects of the problem for which it was trained. This error can be estimated by a test set randomly sampled from these objects.
Untrained classifier. A classifier for which the functional form and a criterion are given but for which the parameters have not yet been specified. They need to be optimized by the use of a training set. This is a special case of an untrained mapping.
Untrained mapping. A mapping for which the functional form and a criterion are given but for which the parameters have not yet been specified. They need to be optimized by the use of a training set. An untrained classifier is a special case of an untrained mapping.
Vector representation. An object representation by a vector space. This is the most popular representation as there are many tools to analyze sets of vectors.