Glossary

This page is a description of the pattern recognition terminology as it is used on the PRTools website and in the literature on which it is based. General terms are presented in red. For specific PRTools terminology burnt orange is used.

Apparent error. Also called training error. It is the fraction of objects used for designing (training) a classifier that is incorrectly classified by that classifier. It is usually a positively biased estimator of the true classification error.

Attribute. A symbolic or numeric property of an object that might be useful to determine its class. The word feature is used for this as well. Different objects however may have different numbers of attribute. Usually, the same set of  features is measured for all objects in a single problem. Thereby, objects may be represented by a feature vector, or by a set of attributes.

Bayes classifier. This is a classifier based on the Bayes rule. It combines given class prior probabilities P(\omega) with class probability densities P(x|\omega) for the object vector representation x such that the classification error \epsilon is minimum under the assumption that these probabilities and densities hold for the objects to be classified. If this assumption is violated (e.g. by using bad density estimates) the classifier may be far from optimal.

For a two-class problem with \omega \in {A,B} this classifier can be written as:

S(x) = P(x|A) P(A) - P(x|B)P(B), if S(x) \ge 0 then A else B

For a multi-class problem with classes \omega in \omega the classifier can be written as:

\hat{\omega}(x) = argmax_{\omega \in \omega} \frac{P(x|\omega)P(\omega)}{P(x)}

As the denominator P(x) does not depend on \omega, it can be omitted. The rule thereby simplifies to:

\hat{\omega}(x) = argmax_{\omega \in \omega}{{P(x|\omega)P(\omega)}}

Bayes error. The classification error made by the Bayes classifier for the case its assumptions are correct. By definition this is the lowest classification error that can be achieved for the given problem.

Bayes rule. For a class \omega and an object representation vector x, this rule relates the class posterior probability P(\omega|x) with the class probability density P(\omega|x), the class prior probability P(\omega) and the joint probability density function P(x) over the set of classes:

P(\omega|x) = \frac{P(x|\omega), P(\omega)}{P(x)}

Class. A class is a set of objects that within a given context is recognized as similar. Such a class has usually a unique name, the class name. The individual objects within a class have a label that refers to this name. If the class is human defined it is sometimes also call a concept.

Class frequency. The frequency of which objects of a particular class in a dataset. It can be used as an estimate for the class prior in case the dataset is representative for the classification problem.

Class label. A pointer assigned to an object that refers to its class or class name. Note the inconsistency in terminology: an object label is the same pointer, from an object to a class.

Class membership. The class a class label  points to, in case of crisp labels, or the value(s) of the soft labels.

Class name. A name assigned to a class of objects. Sometimes very symbolic like A or B, sometimes with a very practical meaning like ‘healthy’ or ‘diseased’.

Class separability. Some general measure related to the possible classification performance given an object represention and a training set.

Class posterior. The probability P(\omega|x) of a particular class \omega for an object x. The difference with the class prior is that the posterior depends on the observed object . See also confidence.

Class prior. The probability P(\omega) that an arbitrary object belongs to class \omega. The difference with the class posterior is that the prior is independent of an observed object .

Classifier. A classifier is a rule that assigns a class label to any object in a particular object representation.

Some classifiers may reject some objects. Some classifiers may assign multiple class labels. Instead of or in addition to class labels classifiers may output class posteriors, dissimilarities, distances, similarities or densities for a all possible classes. In the additional step, the most likely class or classes have to be determined.

Classifier. The word ‘classifier’ is used for trained classifiers as well as for untrained classifiers.

Classification. The assignment of a class (in fact a class name) to an object by evaluating a trained classifier for that object.

Classification accuracy. This is one minus the classification error. Sometimes, the accuracy is also multiplied by 100 and presented in percentage.

Classification error. The probability that an arbitrarily selected object from a set of classes is incorrectly classified by a given trained classifiers. Alternatively it is used for the fraction of objects of a given finite set of objects that is incorrectly classified. See also apparent error (or training error), test error and true error. Sometimes, this error is also multiplied by 100 and presented in percentage.

Classification matrix. The MxC matrix D of class confidences resulting from applying a C class trained classifier W to the M objects in a dataset A. So D = A*W.

Classification performance. A general expression referring to how well a classifier classifies unseen objects. There is no sharp mathematical definition of the performance. In general, the higher the performance, the higher the classification accuracy and thereby the lower the classification error.

Cluster. A set of objects that are similar in their representation. It depends on the problem and the representation whether classes and clusters coincide. In a single class sometimes several clusters can be distinguished.

Clustering. The process of finding clusters. It is a way of unsupervised learning.

Cluster center. The object in a cluster for which the maximum distance to all other objects in the cluster is minimum.

Cluster medoid. The object in a cluster for which the mean distance to all other objects in the cluster is minimum.

Combiner. One of the types of a PRTools mapping or classifier. A combiner is a mapping that is able to handle another mapping as input, e.g. V = W1*W2, in which W2 is the combiner, W1 is another mapping and V is the combined mapping. This construct is called a sequential combining. Other combining possibilities are stacked combining by V = [W1 W2],  and parallel combining by V = [W1; W2].

Combining rule. The way a set of mappings or classifiers are combined into a single one in case of parallel or stacked combining .

Concept. A concept is a general idea, or something conceived in the mind (Wikipedia). In pattern recognition it is sometimes used for a class or a cluster for which the underlying set of objects is the complete set of specific examples for which the concept is applicable (occasionally a single representative or typical object is also called a concept). The concept is the step that consciousness makes after recognizing patterns in observations and before a name (a word) has been a assigned to it.

Confidence. A value between 0 and 1 that indicates the likelihood that a particular outcome (e.g. class membership) is correct. This term is used instead of posterior probability when probabilities in the strict sense are not well defined. Class confidences may, like class posteriors, be interpreted as soft labels.

Consciousness. The ability of a thinker, observer or actor to observe his own state of mind.

Covariance matrix. This matrix generalizes the notion of variance to multiple dimensions. See the Wikipedia. In k dimensions it describes the 2nd order moment of a probability distribution of a random vector \boldsymbol{x} with mean \boldsymbol{mu} by C_{i,j}=E[(x_i-mu_i)(x_j-\mu_j)] or of a set of m  objects X={\boldsymbol{x_p}, p=1,2, ... ,m} by C_{i,j}=\Sigma_p[(x_{p,i}-\mu_i)(x_{p,j}-\mu_j)]. Multi-dimensional Gaussian distributions are fully described by mean and covariance matrix.

Crisp label. Labels point from an object to a class. They are assigned by an expert or estimated during automatic classification. If there is no uncertainty or ambiguity such a label is unique and symbolic. It just contains the name of the class. See also soft labels.

Curse of dimensionality. See overtraining.

Datafile. A PRTools programmatic class type. Variables of this type refer to directories (folders) of objects stored on disk. It contains the definition of all preprocessing (e.g. feature extraction) needed to convert it to a proper dataset. The interesting properties of data types are that they have an unlimited size and that the objects can still have different dimensions (image of different size, time signals of different length). A datafile is a child (subclass) of a dataset and thereby a dataset itself. See the PRTools help information on datafiles.

Dataset. A PRTools programmatic class type. Variables of this type refer to a set of vector representation of objects, feature vectors. These vectors should have the same length and are stored row-wise. So the columns refer to features and the rows to objects. The set of feature vectors is also referred to  as a feature matrix. Various types of additional information can be stored in a variable of the dataset type, e.g. object labels, class priors, feature names, feature domains, dates and descriptions. See the PRTools help information on datasets.

Dimension reduction. Transformation of a given representation vector space, usually a feature space into another one with a lower dimension then the original one, while maintaining the information content. This content is either expressed in the object variability (unsupervised) or the class separability (supervised).

Dissimilarity. A measure for the difference between two objects. This can be a proper metric but may also violate the triangle inequality or be asymmetric.

Dissimilarity representation. In this representation objects are represented by pairwise dissimilarities to other objects, either directly computed from the real world observations or from another representation, e.g. features or graphs. The next step may be the postulation of a dissimilarity space or embedding.

Dissimilarity space. The vector space defined by the dissimilarities to a representation set of objects. The vector  length (space dimensionality) is thereby equal to the size of the representation set. This set can be the entire training set, a selection of these objects, but possibly also a set of entirely different objects. Classifiers in this space can be trained in a similar way as in a feature space.

Embedding. The computation of a set of vectors in an Euclidean space for which the distances are equal to a given set of dissimilarities or the inner products are equal to a given set of similarities.

Feature. A symbolic or numeric property of a real world object that might be useful to determine its class. The word ‘attribute‘ is used for this as well. Different objects however may have different numbers of attributes, while usually for all objects in the same problem the same features can be measured. Thereby objects may be represented by a feature vector, or by a set of attributes.

Feature domain. A feature property stored in a dataset which refers to the set of values the particular feature may have. In operations or during the addition of new objects to a dataset the feature values may be checked for the defined domain.

Feature extraction. The process of determining good features for a feature representation of objects. This may refer to raw data like images or time signals, but also to already given representation. In the latter case the aim is to simplify the representation, e.g. by dimension reduction. Examples are PCA and LDA.

Feature label. The feature name as it is stored in a dataset. Feature labels may be empty or just numbers. In a classification matrix stored in a dataset the feature labels are the class names, as every column refers here to a class instead of a feature.

Feature matrix. A matrix containing the feature vectors for a set of objects. It is the basic item stored in a dataset.

Feature name. In case a  feature is defined as a specific property (e.g. area or length) it is useful for annotating plots to name a feature also like that. See also feature label.

Feature representation. The representation of objects by features resulting in a feature space.

Feature space. The vector space defined by the feature vectors. This concept is so familiar in statistical pattern recognition and machine learning that sometimes also vector spaces used for object representation, but created otherwise are also called feature spaces. Examples are the kernel space and the dissimilarity space which are defined by functions of input features or distances between objects.

Feature vector. The vector storing all relevant properties of a real world object in a particulary well defined order. It may also be the unfolded set of image pixels or a sampling of a spectrum or a time signal. If feature vectors are used to represent a set of objects in a vector space it is necessary that the images, spectra or signals are converted to the same size before sampling.

Fisher mapping. The orthonormal transformation of a feature space that maximizes the class between scatter w.r.t. the average class within scatter . It is sometimes also called LDA. In PRTools it is implemented by fisherm.

Fixed mapping. This is a mapping that cannot be trained but is defined by its parameters, e.g. sigm which maps the real axis on the (0,1) interval.

Generalization. Generalization is the step from a set of objects observations to their common properties or concept. In logic this is called induction. In pattern recognition it is the process of finding clusters, classes or typical objects. If for a new object, that was not used in the generalization, its most similar cluster, class or typical object can be determined, properties may be predicted that have not been measured.

Graph representation. A structural object representation by a set of nodes partially connected by edges. Nodes and edges may be attributed.

Image. A one-, two-, or multi-dimensional set of pixels. For many pattern recognition applications it is relevant that the assumption holds that neighboring pixels have a higher correlation than more remote ones.

Image object. PRTools terminology for an unfolded image stored as a feature vector (row-wise) in a dataset. It thereby represents an object. The original image size (before unfolding)  should be stored in the dataset in order to retrieve pixel neighborhoods.

Image feature. PRTools terminology for an unfolded image stored as a feature (column wise) in a dataset. It thereby represents an object. The original image size (before unfolding)  should be stored in the dataset in order to retrieve pixel neighborhoods. As objects (the row vectors) are thereby pixels these are represented by the corresponding pixel values in a set of images. This is mainly relevant for storing an RGB image or a  (hyper)spectral image in a single dataset for segmentation purposes (i.e. classification or clustering of pixels).

Input space. Machine learning terminology for the space of the given (vector) representation. It is usually used in relation with kernel representations as it refers to the input space for which kernels are computed. The input space is in this case identical to what is called feature space in pattern recognition.

KLM. The PRTools routine klm is called Karhunen-Loève mapping and is effectively a PCA applied to the mean class covariance matrix. It thereby neglects the scatter of the class means and focusses on the average shapes of the class distributions.

KLT. The Karhunen-Loève Transform. It is sometimes used as an alternative name for PCA. The PRTools routine klm uses it for a slightly different orthonormal mapping.

Kernel. A function K(x,y) of two objects x and y. It is either computed from a representation like a feature space, or from the raw data of x and y directly. A training set of m objects is converted by the kernel in an m \times m matrix. This may be used directly as a representation, but much more often the kernel trick is used (when possible) to train classifiers in kernel space. See also dissimilarity representation and dissimilarity space.

Kernel space. If a kernel is separable, i.e. if K(x,y) = \phi(x) \phi(y) then the space \phi(x) and \phi(y) refer to is the kernel space. In case of a Gaussian kernel this space is a Hilbert space. In support vector machines the kernel space is only used implicitly by the kernel trick.

Kernel trick. If a kernel is separable, i.e. if K(x,y) = \phi(x) \phi(y), which is the case if K(\cdot , \cdot) fulfills the Mercer conditions, then the kernel may be interpreted as the inner product in the kernel space. As some transformations and classifiers, in particular the support vector machines, can be written in terms of inner products only, they implicitely base their result on a virtual kernel space, which might be of an infinite dimensionality. See also the Wikipedia article on this topic.

Label. In general this is a tag or a pointer assigned to some entity in order to link it to some other entity. In a pattern recognition context it is often a short for class label.

LDA. Linear Discriminant Analysis (LDA) stands for both, the linear  transformation to the sub-space of a vector space under consideration for which the class between scatter is maximized w.r.t. the average class within scatter, as well as for the classifier that assigns new objects to the class with the highest posterior probability under the assumption of Gaussian class distributions with equal covariance matrices. The first, also called Fisher mapping, is realized in PRTools by the mapping fisherm, the second by the classifier ldc.

Mapping. A PRTools programmatic class type for variables in which a transformation between object representations is stored. This can be raw data, as stored in datafiles, as well as vector representations as stored in datasets, as well as class confidences or just the class label. PRTools classifiers are implemented as a mapping. Like these, mappings can be untrained or trained or combiner. In addition they can also be fixed. See the PRTools help information on mappings.

Object. A real world entity that can be physically observed. Also used for a representation of such an entity. Examples are 2-dimensional items like photos, characters, plotted curves; 3-dimensional items like chair, tables, cells, airplanes; but also time dependent events like speech, gestures, movies. More abstract entities like a voice, a persons identity, style of writing, a composition are sometimes called an object. Another way to phrase it is that an object is a realisation of a concept, which corresponds to the pattern recognition terminology: a member (element) of a class (set).

Object label. A pointer to the class to which an object belongs or may belong, e.g. its name. Note the inconsistency in terminology: a class label is the same pointer, from an object to a class .

Object representation. A formal description of an object that fascilitates the comparison of objects and the generalisation of sets of similar objects to a class . Examples are the feature representation, pixel representation, dissimilarities, kernels and graphs.

Object variability. A measure that expresses how much the objects in a given set differ in general.

Observation. A single measurement or a set of measurements made from an objects.

Overtraining. Originally overtraining refers to training a neural network too long, by which it adapts to the noise in the data. Now overtraining is also used for any classifier that by bad parameter settings, too small training sets, too high dimensions or to many parameters to be optimized adapts to the noise. Other terms for (about) the same phenomenon are curse of dimensionality, peaking and Rao’s paradox. There is a set of posts on overtraining.

Parallel combining. This is a PRTools term for combining a set of mappings or classifiers defined for different vector representations into a single single mapping or classifier. The resulting space might be the class label derived from a classifier output. Like in stacked combining and in contrast to sequential combining parallel combining depends on a combining rule.

Pattern. The word ‘pattern’ is overly used for slightly different but related concepts. The main one is the subset of similar objects in a larger set (a class or a cluster). It is however also used for the entire similarity structure in a collection of objects as well as for a single object which is typical for a set of similar objects.

Pattern recognition. This is the process or ability of finding patterns in a set of objects. It also refers to the scientific domain that studies such processes as well as to the technology of creating artificial systems that can do this. The main sub-domains are representation and generalization. Pattern recognition is related to  but slightly different from the fields of artificial intelligence and machine learning. As pattern recognition refers to both, a human ability as well as a research domain, it may be labeled as an art as well as a science.

PCA. Principal Component Analysis finds a orthogonal transformation for a vector space that decorrelates a specified covariance matrix C. It results in a set of eigenvectors (the new axes) V and the variances (diagonal terms) of the transformed covariance matrix D, also called eigenvalues. Thereby V^{-1} C V = D. The orthonormal transformation is V^T = V^{-1} as the covariance of the transformed space is E[V^T x (V^Tx)^T] = V^T E(xx^T) V = V^{-1} C V = D. It is assumed without loss of generality that x has zero mean. More in the Wikipedia article.

In practical applications often just the eigenvectors corresponding the the largest eigenvalues are used as they stand for the main directions of variance. Other directions are then neglected under the assumption that they correspond to noise. As this is effectively a type of averaging it may improve accuracy.

PCA is related to the Karhunen Loeve Transform KLT and to LDA or Fisher mapping.

Peaking phenomenon. See overtraining.

Pixel Representation. Images can be represented by their pixels. The unfolded 1-dimensional vector constructed from 2-dimensional or even multi-dimensional images is then used for building a feature space. There are two problems with this approach: it results in very high-dimensional spaces and small shifts or rotations of the image may result into a large jump of the representing vector. A good property of the pixel representation is that all information available in the image is preserved, in contrast to features measured from the image.

Posterior probability. In classification this is the probability P(\omega|x) for a particular class \omega of an object x given the observation of (measurements on) that object.

Prior knowledge. All knowledge of measurements, models and probabilities that may be helpful to classify a particular object x, except any measurement on x itself.

Prior probability. In classification this is the probability P(\omega) for a particular class \omega of an object x without taking into account any measurement on x itself.

Recognition. In the strict sense the recognition of an object is equivalent to its classification. In a broader sense in includes the (development of the) representation as well of the classifier.

Representation. The representation of an object is a description of that object in terms of measurable observations that enables the a numeric or logic comparison with other objects in the same problem. See also an introductory post on this topic.

Representation set. The representation set is a set of objects used in a dissimilarity representation to represent other objects by their dissimilarities to the objects in the representation set.

Scatter. The scatter is a synonym for the variance or covariance matrix, emphasizing that it is derived from an estimate based on observed data points.

Sequential combining.  This is a PRTools term for combining consecutive mappings (transformations) between vector spaces  into a single mapping. The resulting space might be the class label derived from a classifier output

Similarity. A measure expressing how much two objects are equal. Usually similarities are positive, sometimes even restricted to the [0,1] interval. See also dissimilarity.

Soft label. Labels point from an object to a class. They are assigned by an expert or estimated during automatic classification. In both cases there might be uncertainty or ambiguity. This can be solved by replacing the traditional crisp labels by soft labels. In PRTools these are numbers between 0 and 1. Every object has a soft label value for every class. They don’t necessarily sum to one. They might be interpreted in several ways: uncertainties, probabilities or fuzzy class membership. See also crisp label.

Stacked combining. This is a PRTools term for combining a set of mappings or classifiers between the same two vector spaces into a single mapping or classifier . The resulting space might be the class label derived from a classifier output. Like in parallel combining and in contrast to sequential combining stacked combining depends on a combining rule.

Supervised learning. This is learning from examples under supervision of a teacher. This almost always refers to a  teacher who assigns desired class labels to the example objects .

Support vector machine. A linear classifier or regression function that is optimised for a distance based criterion as a linear function of a subset of the given vectors (training objects), the support vectors. Thanks to the kernel trick also non-linear classifiers can be trained. The result depends on the choice of the kernel. Usually abbreviated as SVM.

Test error. The classification error as estimated by a test set.

Test set. A set of objects with given class labels used for evaluating a classifier. A good test set is representative for the set of objects to be classified later by the classifier.

Trained classifier. A classifier trained by a training set, ready for evaluation by a test set or to be applied in practice. A trained classifier might be just a linear function. The way it is trained (the untrained classifier) may not be visible anymore. This is a special case of an untrained classifier.

Trained mapping. PRTools terminology for a transformation of a vector space optimized for some training set. A trained classifier is a special case of a trained mapping.

Training. The optimization of a mapping or a classifier for a given set of objects, the training set.

Training error. The classification error of a classifier based on the set of objects used for training the classifier, its training set.

Training set. The set of objects used for optimizing a mapping or a classifier.

True classification error. The classification error as made by the classifier on the objects of the problem for which it was trained. This error can be estimated by a test set randomly sampled from these objects.

Unseen objects. This expression refers to objects that did not belong to the training set.

Unsupervised learning. This is learning from examples without the supervision of a teacher. This almost always refers to a learning from objects for which no class labels have been given.

Untrained classifier. A classifier for which the functional form and a criterion are given but for which the parameters have not yet been specified. They need to be optimized by the use of a training set. This is a special case of an untrained mapping.

Untrained mapping. A mapping for which the functional form and a criterion are given but for which the parameters have not yet been specified. They need to be optimized by the use of a training set. An untrained classifier is a special case of an untrained mapping.

Vector representation. An object representation by a vector space. This is the most popular representation as there are many tools to analyze sets of vectors.

Print Friendly, PDF & Email