The post The role of densities in pattern classification appeared first on Pattern Recognition Tools.

]]>Is pattern recognition about statistics? Well, it depends. If you see as its target to understand how new knowledge can be gained by learning from examples the role of statistics may be disputable. Knowledge lives in the human mind. It is born in the marriage between observations and reasoning. If we follow this process consciously by introspection, then statistics is not there. Human knowledge has been gained over thousands of years before we had encountered the concept of probability. So, if we would have used probabilities in gaining knowledge, we did this, at least originally, unconsciously.

However, at the moment we start to analyze the process of gaining knowledge itself, and try to build a model of it, probabilities enter the scene. A good pattern recognition system, one that has gained some knowledge, is able to classify new objects accurately. Almost always this is defined by the probability of error, estimated by the fraction of wrong classifications in a test set of examples with known pattern classes.

A strong characteristic of the field of pattern recognition is that is has this very clear performance measure. It can be used to judge students in the field, to compare procedures and to organize competitions. Just using a single number that can be measured by anybody makes life easy and preferences indisputable. There are almost no studies of people who claim that their classifier might be worse in terms of numbers of mistakes, but that at the end it is better as its mistakes are less stupid. It is all about quantities. Quality plays hardly a role.

How do we find the best classifier? It is the classifier that assigns the most probable class to new objects , given the training set of examples . In mathematical terms, phrased by the posterior probabilities:

if then else

Using Bayes rule, the corresponding classifier can be written as:

, if S(x) > 0 then else

Here and are the class prior probabilities. For continuous vector representations the two other terms can be written as and , the probability density functions of the two classes. If they reflect the class density functions in the application, and accordingly in the test set that measures the classifier performance, then this classifier is the best one possible. Because of the use of the Bayes rule in its derivation it is called the Bayes classifier and the corresponding minimum error the Bayes error.

Formally the pattern recognition problem has been solved now. So why do we find many other classifiers in the literature, like the support vector machine, adaboost, random forest, decision trees and neural networks? They should be worse, or, at most, if trained very carefully, they may asymptotically approximate the Bayes classifier, under specific conditions. They are not directly based on densities, but apply them at most in a very indirect way. So why are they so popular?

The answer is straightforward, but not discussed often explicitly, so it might be good to phrase it here. Proper density estimation in high-dimensional spaces is very difficult. The dimensions of feature spaces may range from a few to many thousands. As long as the size of the training set per class is not a multiple of the dimension, the feature space is almost empty. The ‘density’ of training points in the neighborhood of a test point cannot reliably be estimated. So non-density based classifiers are of interest here.

In conclusion, classifier design is important if we just have a small training set (or want to use one for computational reasons). For larger and larger training sets their performances are expected to become at most as good as classification based on density estimation. For this reason training set sizes should be taken into account in the the study of classifier performances. In many comparative studies on classifier performances this is, unfortunately, neglected.

The post The role of densities in pattern classification appeared first on Pattern Recognition Tools.

]]>The post Discovered by accident appeared first on Pattern Recognition Tools.

]]>Some discoveries are made by accident. The wrong road brought a beautiful view. An arbitrary book from the library gave a great, new insight. A procedure was suddenly understood in a discussion with colleagues during a poster session. In a physical experiment a failure in controlling the circumstances showed a surprising phenomenon.

Children playing with simple blocks may suddenly discover that a castle or a ship is hided in the heap. They randomly combine them. A surprising result may suddenly appear, depending on their imagination. When they are sufficiently skilled they may start to beautify it. Perhaps several trials are needed, but then it is there: a beauty.

During one of the lab courses students were investigating some pattern recognition issues using PRTools. It was explained that in some situations the performance of a classifier could be improved by a feature reduction to remove some noise and to run the classifier in a space of lower dimension. The sequence of feature extraction and classifier training can be combined into a single operation. So they compared

`classf1 = fisherc; % straightforward Fisher classifier`

classf2 = pcam(0.95)*fisherc; % 95% PCA reduction followed by Fisher

by

`testset*(trainset*classf1)*testc`

testset*(trainset*classf2)*testc

Note that in PRTools the `*`

-operator is a overloaded by piping: it feeds the data stream from left to right into functions (called mappings) that handle data and may output new data or trained classifiers. The routine `testc`

evaluates the trained classifiers. The examples used in the lab course showed that this is useful for high dimensional data, but not for low dimensional problems, as then PCA may remove useful directions in the feature space. For instance, try the `k`

-dimensional 8-class problem generated by `genmdat('gendatm',k,m*ones(1,8)))`

, in which `m`

is the number of training objects per class. One of the students tried by accident:

`classf3 = fisherc*fisherc;`

instead of `classf2`

. It appeared that it is also helpful, and in lower dimensional situations even much more effective than `classf2`

. On the right is a feature curve shown that shows this. It is based on a 8-class problem The entire experiment can be found here. How can the performance of `classf3`

be explained?

It should be noted that in the above procedures `classf2`

and `classf3`

the same data is used twice, once for the dimension reduction and once for training the classifier. In `classf3 fisherc`

reduces the dimension of the feature space to 8, the number of class confidences. As these sum to one, the intrinsic dimension is in fact at most 7. This reduction is based on a supervised procedure. For training sets that are sufficiently large in relation with the feature size, this training result is not overtrained and thereby there is room for a second supervised generalization. For large feature sizes, however, the first `fisherc `

is overtrained and the dataset used for training is biased when submitted to the second `fisherc`

. For high dimensional feature spaces (compared with the size of the training set) an unsupervised procedure is needed to enable a useful training procedure based on the same data.

There is a second point to realize. The Fisher classifier is a linear classifier. The separation boundaries between classes are linear. The output of the classifier, the class confidences, however, are a non-linear function of the features due to the use of a sigmoid function that transforms distances to classifiers into 0-1 confidences. When these outputs are used as input for a second classifier, the overall result of the class boundaries is nonlinear. Below this is illustrated in a two-dimensional problem. The error measured by a large independent test set in given between brackets.

The procedure resulting in the right plot can be understood as generating an 8-dimensional space, because it is an 8-class problem, and in this space the Fisher classifier is trained. This yields the observation that any classifier applied to this problem generates such an 8-dimensional space and may in the same way be followed by a second classifier. If the first classifier, producing the space, is not overtrained, more advanced classifiers may be used as the second one. Below some examples are presented using QDA (the PRTools routine `qdc`

).

The series starts with just QDA in the original 2-dimensional space. This classifier is improved by first transforming this space to 8-dimensions. The second example shows that the multi-dimensional LIBSVM classifier is non-linear. Apparently it is not overtrained as its outputs can be used to train QDA such that it improves the original classifier. These figure are taken from a more extensive example.

The accidental discovery made by the student is not yet fully exploited. The sequential combination of classifiers in multi-class problems is an interesting area of research.

The post Discovered by accident appeared first on Pattern Recognition Tools.

]]>The post Cross-validation appeared first on Pattern Recognition Tools.

]]>A returning question by students and colleagues is how to define a proper cross-validation: How many folds? Do we need repeats? How to determine the significance?. Here are some considerations.

Cross-validation is a procedure for obtaining an error estimate of trainable system like a classifier. The resulting estimate is specific for the training procedure. There are two main reasons why we want to have such an estimate. First it is used for reporting the expected performance to our customer or project partners when we deliver the classifier. This classifier should be as good as a possible, and so, all available data should be used for training. Consequently there is no separate test set left for an independent evaluation. Cross-validation offers a solution.

A second application is the comparison of different training procedures. The performance itself is not of primary importance here, but the comparison of different procedures by ranking their performances.

Cross-validation solves the problem of delivering a classifier based on all available objects and obtaining an independent estimate of its performance at the same time. This is done by splitting the dataset in parts, in this context called folds. folds are used for training and the fold that has been left out is used for testing. This is repeated times using every fold once for testing. Error estimates are averaged.

At the end all objects are used for testing the classifier which is trained by different objects. So the tests are independent. The fold classifiers are all trained by a fraction of of the design set. For , every classifier is trained by 90% of the data. It is expected that each of them is almost as good as the final classifier. The error estimate obtained in this way is thereby a little bit pessimistically biased.

It should be realized that just the finite available dataset is used for testing. It has a limited size, otherwise we wouldn’t have a problem. However, if a test set of objects is applied to a classifier with true error the obtained estimate has a standard deviation of . Whatever further tricks (see below) are included on the -fold cross-validation procedure, it will never get better than that. The uncertainty in the error estimate is fundamentally limited by the finite sample size.

The finite size of the test set is not the only source of inaccuracy in the error estimate. There are fold classifiers tested and each of them is most likely different from the final classifier that is delivered. Sometimes it is suggested to combine the fold classifiers in one way or the other,. It has, however, to be investigated in what way the obtained cross-validation error based on an average of the errors of the fold classifiers is still an estimate of the error of the combiner.

In contrast to this effort of making use of the diversity of the set of classifiers, some attempts have been made to make the classifiers as equal as possible. By improving their similarity it is expected that they also become more similar to the final classifier based on the entire design set. The first idea is stratification. By stratified sampling not the design set is sampled to obtain subsets, but the individual classes. This makes the class sizes between the folds more equal. Problems may arise when the classes have very different sizes and some have sizes close to .

One step further is density preserving sampling [4]. This strategy aims to make the probability distributions of the samples in the folds as equal as possible. This will make the classifiers even more alike than by stratification.

The split in folds by unstratified sampling as well as by stratified sampling is random. If it is repeated a different set of folds is obtained. The resulting classifiers are thereby different which may result in a different error estimate. In order to decrease the inaccuracy caused by the variability of the fold classifiers the entire process may be repeated a number of times. This will stabilize the final result in the sense that the error estimate obtained after times repeated -fold cross validation is expected to change less and less for larger values of .

A stable performance estimate implies that the standard deviation in the average of the cross-validation estimates shrinks and approximates zero for large . Consequently, differences between the performances of different classifiers become significant for increasing numbers of repetitions . A proper statistical test is not straight forward and should take into account that the classifiers are tested by the same objects, see [3] and [5]. A fast rule of thumb is that the difference between the means of the performances of two procedures is significant if they differ more than the sum of the estimated standard deviations of the performances in a single -fold cross-validation.

Repeated cross-validation removes the variability in the obtained error estimate. This estimate, however, is still a prediction of the performance of the final classifier based on the entire design set. This prediction has a standard deviation of at least as explained above. Repeating the cross validation will not remove this uncertainty as long as it is based on the same set of objects.

The most popular cross-validation procedures are the following:

- 10-fold cross-validation, as followed from a study by Kohavi [1]. This became very popular and has become a standard procedure in many papers.
- 10 times 10-fold cross-validation. This may be useful for comparing classifiers as the 10 repeats shrink the standard deviations in the means of the estimated classifier errors.
- 5 times 2-fold cross-validation as proposed by Dietterich [2]. In this paper it is shown how to compute the significance of differences in classification error means. This is later improved by Alpaydin [3].
- 8-fold density preserving sampling as proposed by Budka and Gabrys [4]. The number of folds in this procedure should be a power of 2. Repeating makes no difference as the sampling is deterministic.
- Leave-one-out (LOO) cross-validation. In this case the number of folds is equal to the number of objects in the design set ( above). For large datasets this is very time-consuming. The sampling procedure is deterministic as simply all folds of objects are considered. So repeating makes no difference.

The first four of these are compared in a large experiment in which every procedure is applied to seven classifiers. The cross-validation error predictions are compared with the performances of the classifiers based on the entire design set. This results, for every procedure, in a difference between the estimated ranking of seven classifiers (based on the cross-validation of the design set) and the true ranking (based on the performance of the classifier trained on the full design set and estimated on a large, independent test set). It appeared that 10 times 10-fold cross-validation has usually to be preferred and that 8-fold density preserving sampling is never significantly worse. but is much faster. The full experiment (in Matlab) may be downloaded.

- R. Kohavi, A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection IJCAI, pp. 1137-1145, Morgan Kaufmann, 1995. download
- T.G. Dietterich, Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms, Neural Computation, 10(7), pp. 1895-1924, 1998. download
- E. Alpaydin, Combined 5 x 2 cv F Test for Comparing Supervised Classification Learning Algorithms, Neural Computation, 11(8), pp. 1885-1892, 1999 download
- M. Budka, B. Gabrys, Correntropy-based density-preserving data sampling as an alternative to standard cross-validation, IJCNN2010, 1-8 download
- A. Webb,Statistical Pattern Recognition, 2nd edition, Wiley, 2002, page 267.

The post Cross-validation appeared first on Pattern Recognition Tools.

]]>The post Adaboost and the Random Fisher Combiner appeared first on Pattern Recognition Tools.

]]>Like in most areas, pattern classification and machine learning have their hypes. In the early 90-s the neural networks awoke and enlarged the community significantly. This was followed by the support vector machine reviving the applicability of kernels. Then, from the turn of the century the combining of classifiers became popular, with significant fruits like adaboost and random forest. Now we have neural networks again and deep learning.

Due to the hypes interest shifts to new areas before the old ones are fully exploited. Here we will take a breath and return to adaboost and compare it with similar but less complex alternatives.

Adaboost is a great, beautiful system for incremental learning. It generates simple base classifiers, see [1]. Classifiers found later in the process are trained by subsets of the data that appear to be difficult to classify correctly. Every base classifier receives directly a classifier weight based on its success in performing its local task. Finally all base classifiers are combined by a weighted voting scheme based on these weights.

The base classifiers should be simple, so called weak classifiers. They are expected to show some performance, but also have a large variability. In the above example we used linear perceptrons (a single neural network neuron), randomly initialized and trained shortly. In the left scatter plot the first 20 are shown for one of the training sessions.

The blue line in the right plot shows the averaged adaboost performance (over 10 experiments) as a function of the number of base classifiers until 300. It is monotonically decreasing due to the clever choice in the adaboost algorithm to estimate classifier weights individually and not to optimize them in their relation to each other.

Is the adaboost combiner really the best one? Let us study three alternatives. They are computed for every collection generated in the adaboost procedure First, the decision tree. It starts with finding the best classifier for the entire training set and then gradually splits it according to the classifier decisions. If a pure node is reached (a cell in the feature space with training objects of a single class) a final decision is made. For an increasing number of base classifiers this system may overtrain. This is protected by pruning the decision tree. The dark green line in the right plot shows that this is not entirely successful. The results are best for about 100 base classifiers and are slightly worse than the final adaboost result. For small numbers of base classifiers, however, the decision tree is significantly better than the adaboost combiner.

The fisher classifier optimizes the combination of the weights of the base classifiers for the training set. Again this is done for each considered set separately. The number of weights to be optimized increases, resulting in overtraining. This is maximum (red line) if the number of weights equals the size of the training set, a well known phenomenon. For this number of objects the optimization is maximum sensitive to noise [3]. For small numbers of base classifiers the performance is similar to that of the decision tree.

The above mentioned fisher procedure (red line) is called here binary fisher as it is based on just the binary (0-1) outcomes of the base classifiers. There is, however, more information, the distance to the classifier. It can be used as an input for the fisher procedure instead. This results in the line green line in the plot. It is especially good for very small numbers of base classifiers. In this example it is even slightly better than the final adaboost result, 0.027 against 0.031.

The advantage of this fisher combiner over the adaboost combiner is that its best performance is based on just 20 base classifiers instead of 300. What it combines are still the base classifiers generated in the adaboost scheme.Is this really significant or can other, more simple procedures be used as well? We tried another scheme.

Instead of generating weak classifiers according to the adaboost scheme, in which the resulting classifiers are dependent, we generated a set of independent classifiers by using the 1-nearest neighbor classifier based on just two randomly selected objects, one from each class. Using the weighted voting combiner [2] a set of weights can be found very similar to adaboost, but now based on the entire training set in which each object is equally important. The blue line in the right plot shows that this procedure is much worse than adaboost. The decision tree and the fisher combiners, however, show similar results as in the first experiment. For small numbers of base classifiers the results are even significantly better, 0.021 for the fisher combiner . This suggests that the adaboost scheme of generating base classifiers is good for the adaboost combiner, but not for other combiners.

As a result we have an much more simple classifier than adaboost, similar in its architecture of a weighted combination of base classifiers. It has in this example a similar performance. A larger set of experiments should judge the value of the random fisher combiner as presented here.

The source of the experiments as well as an implementation of the random fisher combiner, both based on PRTools, are available.

[1] J. Friedman, T. Hastie , R. Tibshiran, Special Invited Paper. Additive Logistic Regression: A Statistical View of Boosting, The Annals of Statistics, 28(2), pp. 337-374, 2000.

[2] L.I. Kuncheva and J. J. Rodriguez, A weighted voting framework for classifiers ensembles, Knowledge and Information Systems 38, 2014, 259-275

[3] R.P.W. Duin, Classifiers in Almost Empty Spaces, Proc. ICPR15, vol. 2, 2000, 1-7.

The post Adaboost and the Random Fisher Combiner appeared first on Pattern Recognition Tools.

]]>The post Using the test set for training appeared first on Pattern Recognition Tools.

]]>Never use the test set for training. It is meant for independent validation of the training result. If it has been somewhere included in the training stage it is not independent anymore and the evaluation result will be positively biased.

This has been my guideline for a long time. Some students were shocked when I started to experiment in the opposite direction, partially triggered by the literature. Why to do it differently? Does the above not hold?

The first, very good reason has been recognized all the time. At some moment, after analyzing a dataset extensively, a final classifier has to be delivered. From various train set – test set splits and cross-validation experiments it has already to be clear what performance can be reached. The classifier to be delivered to the customer or supervisor should be as good as possible and so it has to be trained by all data that is available. There is no data left for testing but previous results obtained from a part of the dataset may give a fair estimate. The final classifier might be slightly better than our estimate, but in practice it may even be worse if the analysis took many rounds in which train set and test set have been used repeatedly.

There is a second reason to use the test set for training. Every object to be classified brings its own information about the problem at hand. This is most clear for high-dimensional problems given by a small dataset. Additional objects can be helpful to determine a relevant subspace with a higher accuracy, e.g. by principal component analysis. The true labels of test objects that are included in such an analysis should not be used, of course. In this way a better classifier may be found, which can still be evaluated by the same test set, now using its labels. This has to be paid by a repeated training for every (set of) test object(s) to which the procedure has to be applied. There is no final classifier that can directly be used. Depending on the new test objects, a new classifier has to be computed.

Such a procedure can be called transductive learning. There is no induction, as there is no classifier computed to be applied to all future objects. Instead, a classifier has been derived that should be applied to just the objects at hand. Here is an example.

The figure shows the result of an experiment using the quadratic classifier QDA on the Sonar dataset. This two-class dataset has 208 objects in 60 dimensions. Small training sets of 2 to 40 objects per class are used. All other objects, without their labels, are added to the training set to compute, by PCA, low-dimensional subspaces of 4, 5 or 7 dimensions. In these subspaces the labeled training set only is used for computing a classifier which is applied to all test objects. The resulting classification errors are shown in the vertical direction. Classification errors for small training sets are very noisy. For that reason the average over 250 (!) repetitions are shown. In each run the dataset was randomly split into a training set of the desired size and a test set.

The results might seem promising. The use of the test set for building an appropriate feature reduction needed for the small training set sizes is of significant help. However, the following points have to be considered before a final judgment on the applicability should be made.

- It took some effort to find an example like this. Several datasets, classifiers and subspace dimensionalities had to be examined.
- Training sets should be small, but not very small, in order to profit from an additional unlabeled set. If there is no performance, it cannot be improved. Also nothing can be gained in case of sufficiently large training sets.
- A single additional object is hardly of any help. The additional unlabeled set should preferably an order larger than the labeled set.
- Additional unlabeled set might be counterproductive. If use for linear feature extraction, like in the above example, the extracted subspace might be bad for the small training set.

As an additional unlabeled set may improve as well deteriorate the result a cross-validation procedure is needed to determine whether it should be used or not. However, usually it might only be helpful for small training sets, cross-validation results may not be significant for a final judgement. Given this state of affairs, research focuses on procedures that will certainly not hurt.

The post Using the test set for training appeared first on Pattern Recognition Tools.

]]>The post My classifier scores 50% error. How bad is that? appeared first on Pattern Recognition Tools.

]]>What error rates can we expect for a trained classifier? How good or bad is a 50% error? Well, if classes are separable, a zero-error classifier is possible. But a very bad classifier may assign every object to the wrong class. Generally, all errors between zero and one are possible: .

Much more can be said for given class prior probabilities , number of classes and class overlap .

Class overlap arises if objects belonging to different classes may have the same representation. Assume that the class probabilities or densities as well as the class prior probabilities (which is the probability that an arbitrary object belongs to class ), are known for all classes . Then we will assign every object to the most probable class given . This is the class for which the posterior probability is maximum. Let this be class . The error contribution of a vector is the probability that the object originates from another class than this one.

The * indicates that this is the minimum error. The overall error for an arbitrary is thereby

which is the best possible result, the so called Bayes error. It can only be reached for known probability functions and class priors. What happens if we don’t know them and sometimes assign to class for which is not the largest? Then the error will increase, so . In the worst case every will be assigned to the class for which the posterior probability is minimum instead of maximum. So

For two-class problems it holds that as the sum of all probabilities for a given should be one and there are just these two. Consequently

by which we find, after taking the expectation that

for any classifier with

However, for the minimum posterior for every point will be as it is possible that other posteriors sum to one. So

and thereby we find for the multi-class case

for any classifier with

What will happen if we have chosen by accident a bad classifier, e.g. one that uses random assignment, or a classifier like the nearest neighbor rule while the representation is random? With a probability an object of class will be encountered and with a probability it will not be assigned to its own class but a wrong class. This results in an error

This will be maximum in case of equal probable classes. They will all have a prior probability of by which

for a random, prior sensitive classifier

From the above it is clear that bad classifiers may yield large errors. The error of two-class classifiers is bounded by the class overlap. Multi-class classifiers are unbounded from above and may result into an error of one. If one has some doubts about the representation, the features or the classifier, e.g. after inspecting the results of some cross-validation experiments, a better result can be realized on the basis of the prior probabilities. If they are known (or if the estimate on the basis of the training set is reliable) then assigning all objects to the most probable class will classify all objects of this class correctly. All objects of the other classes will be classified wrongly, resulting in the following error:

for a classifier based on class priors only

For two-class classifiers with equally probable classes this results in . If the classes are not equally probable . A 50% error is thereby the worst thing that may happen if and no proper representation is available.

for two-class problems and a classifier based on class priors only

A 50% error is the worst that should be found for two-class problems. Only in case of well-separable classes and the use of a very stupid classifier larger error rates may result. This can easily be avoided by assigning all objects to the most probable class. However, in case of multi-class problems much larger errors may be found. How good is a 50% error for a multi-class classifier? Can we compare it in one way or the other with a two-class performance?

To answer this question we have a look at the confusion matrix resulting from a -class classifier. It is a matrix in which element lists the number of objects with true class and that are assigned to class . The numbers on the diagonal refer to correctly classified objects. The classification error is thereby

Let the total number of objects be . The error contribution of an arbitrary two class problem taken out of this multi-class problem is . In general they will not be equal, but in order to get a feeling for the above question we will assume they are. As there are two-class problems in a -class problem, the average contribution of a two-class confusion to the overall classification error is . In such a two-class problem just objects are involved instead of . Its two-class error probability for these objects is thereby a factor larger. Consequently, the average two-class error relates to the -class error as follows:

To answer the question: a 50% error in a 50-class problem is about as good as a 1% error in a two-class problem, as every two-class problem in the 50-class problem has on the average to be solved with this performance to reach an overall error of 50%.

The post My classifier scores 50% error. How bad is that? appeared first on Pattern Recognition Tools.

]]>The post Is every pattern recognition problem a small sample size problem? appeared first on Pattern Recognition Tools.

]]>If we want to learn a new concept we may ask for a definition. This might be good in mathematics, but in real life it is often better to get some examples. Let us, for instance, try to understand what *despair* means. The dictionary tells us that it means ‘loss of hope’. This is just rephrasing. We didn’t learn anything new if we already know the words ‘loss’ and ‘hope’.

In order to learn something new, some examples are needed instead of a definition. Asking the internet for related paintings produces very soon the famous Scream by Munch. You may also ask for some music and find Mahler 10. In literature, of course, much can be found, e.g. in the novels of John Steinbeck, as well as in his biography.

How many real world examples do we need to understand what *despair* means? The first picture is already very helpful, and after listening to some pieces of music and reading a few short stories or (parts of) some novels we may have a good understanding of the concept. Is it possible to feed a pattern recognition system with the same small set pictures, music and texts such that it will recognize similar real world examples of the occurrence of ‘despair’? This might be doubted.

It is more realistic to expect that thousands of real world, observable expressions of *despair* are needed and perhaps a set of complicated neural networks to have any success. We are dealing with a complex concept which cannot easily be learned from a few examples. So may be for humans, having the experience of life, this can be done, but for training a pattern recognition system not having this context, thousands or many more examples will be needed.

Is this not true for many more problems, even for seemingly simple ones like character recognition? This application could only be solved after large training sets became available as well as computers to handle them. So not just for complex concepts, but also for simple ones, large datasets may be needed.

In the daily pattern recognition practice still many problems arise, e.g. in medical applications in which just a few tens of objects are available. The collection of more examples is very expensive or virtually impossible. These are truly small sample size problems. The study of the general aspects of such problems has generated many papers handling overtraining and the curse of dimensionality. See the 1991 paper by Jain and Raudys for a classic study.

Overtraining happens if the complexity of the system we have chosen for training is too large for the given dataset. Consequently the system will adapt to the noise instead of to the class differences. The solution is either to enlarge the dataset, but, if not possible, to simplify the system. A big problem is, however, that in practice we do not know whether overtraining happens, unless a sufficiently large test set is available. If we would have it, it would be better to add it to the training set as this will generally improve the classifier. Consequently, the optimal complexity of the system cannot be determined and has to be chosen blindly. Of course, past experience and prior studies will be very helpful to make an appropriate choise.

Overtraining might not be a problem for a large training sets. However, the classifier complexity can always be enlarged, e.g. by a higher non-linearity, or by more layers or units in a neural network. This will cause a more detailed description of the dataset. Initially, a better classification performance can be expected, but by increasing the complexity too far, the system may start to suffer from overtraining again.

This may be phrased again as having a small sample size problem: the dataset is too small for the chosen complexity of the model. The problem of adapting the complexity of the generalization procedure (dimension, number of parameters, regularization) to the given dataset size will always arise in the attempt to optimize the classification performance. In this sense, every pattern recognition problem has to face the small sample size problem in searching for the best performance.

The post Is every pattern recognition problem a small sample size problem? appeared first on Pattern Recognition Tools.

]]>The post Aristotle and the ugly duckling theorem appeared first on Pattern Recognition Tools.

]]>We already discussed several times the significance of understanding the Platonic and Aristotelian ways of gaining knowledge. It can be of great help to researchers in the field of pattern recognition in the appreciation of contributions by others, in discussions with colleagues and in supervising students. This may hold for science in general, but it this understanding becomes in particular crucial if one studies the design of machines that try to integrate knowledge and observations.

The Platonic thinker may have a great, intuitive vision of his area of research. Put to the extremes, he may think that he knows it all, but he is not able to concretize it into words or mathematics. Colleagues may accuse him from building castles in the air. The true Platonic, however, has a keen interest in reality. He wonders where he can find or realize his ideas in the world he lives in.

The Aristotelian researcher starts from his observations and tries to combine them logically. In the extreme he is mainly a collector of observations and at most comes to empty abstractions. The true Aristotelian, however, is not a nominalist, but wants to infer meaningful concepts from his observations. Some talks with a Platonic colleague may be inspiring for both.

When the designer of a pattern recognition system has nobody around with a larger picture of the problem and when he is also not able himself to gain a wider view, he might be in great trouble. We will illustrate this with the problem of feature extraction.

Assume that in some recognition problem the objects are given by a small set of features. We may realize that these features might not be appropriate as they are not specifically selected for the problem at hand. For that reason we consider to extend the original representation with various non-linear combinations of the given ones. This set can be very large. The feature space may become high-dimensional and we realize that our classifiers will overtrain. Consequently we keep the original feature set, or at most extend it with some polynomial combinations. Effectively we are biased towards the given feature set. So what is needed is somebody with a good idea about the features to choose.

This well known reasoning has been put to an extreme by Satosi Watanabe, one of the fundamental thinkers on pattern recognition. He used the “ugly duckling theorem” to illustrate what happens in the case of a given set of binary features. Suppose that k features are given. There are 2^k different feature vectors possible. Consider all possible logic functions of the k features. Different logic functions will map at least one of the 2^k feature vectors differently. Consequently there are 2^(2^k) different logic functions. They are now used instead of the original features.

Let us now consider two objects that differ at least in one of the k original features. We compute for both of these feature vectors all the 2^(2^k) logic functions. It can be shown, e.g. see the lecture notes by Costa and Backofen, that independent of the difference of the original features, exactly half of the logic functions will be equal and half will be different (unless the original feature vectors are identical, of course). In conclusion, all objects are equally different if we represent them by all possible combinations of the original features.

In the ugly duckling story the ugly duckling is, in the generalized function representation, equally different from all other ducklings as the mutual differences between the other ducklings, independent of how small these are in terms of the original features. Consequently, if we don’t have any preference how to represent the objects in features, then, in expectation, by an arbitrary choice or by using everything and all possible combinations, objects are equally different.

The observation that some objects are more similar than others is subjective. It depends on the choice of the features. How does the Aristotelian researcher know what feature to use? He needs a Platonic insight, either by himself of by a colleague, to know what to observe to see the difference between the ugly ducklings and the other ducklings.

The post Aristotle and the ugly duckling theorem appeared first on Pattern Recognition Tools.

]]>The post Why is the nearest neighbor rule so good? appeared first on Pattern Recognition Tools.

]]>Just compare the the new observations with the ones stored in memory. Take the most similar one and use its label. What is wrong with that? It is simple, intuitive, implementation is straightforward (everybody will get the same result), there is no training involved and it has asymptotically a very nice guaranteed performance, the Cover and Hart bound.

The nearest neighbor (NN) rule is perhaps the oldest classification rule, much older than Fisher’s LDA (1936), which is according to many is the natural standard. Marcello Pelillo dates it back to Alhazen (965 – 1040), which is not fully accurate as Alhazen described template matching as he had no way to store the observed past, see a previous post. Anyway, it has a long history, has interesting properties, but it is not very popular in machine learning studies. Moreover, iIt was neglected for a long time in the neural network community. Why is that?

Is it that researchers judge its most remarkable feature, no training, as a drawback? May be, may be not, but for sure, this property positions the rule outside many studies. It is very common in machine learning to describe statistical learning in terms of an optimization problem. This implies the postulation of a criterion and a way to minimize it. Some rules are directly defined in this way, e.g. the support vector classifier. Other rules can be reformulated as such. However, the NN rule is not based on a criterion and there is no optimization. It directly relies on the data and a given distance measure.

One way to judge classifiers is by their complexity. Intuitively this can be understood as the flexibility of shaping the classifier in the vector space during the above optimization procedure. Complexity should be in line with the size of the dataset to avoid overtraining. The VC complexity of the NN rule, however, is infinite. According to the statistical learning theory this is very dangerous. It will result in a zero-error empirical error (on the training set), by which here is the danger of an almost random result in classifying test sets. We are warned: do not use such rules. And yes, artificial examples can be constructed that shows this for the NN rule.

On the left, a 2D scatter plot of almost separable classes for which the NN rule performs badly. The distances of nearest neighbors of different classes are similar to those of the same class. There exists, however, a linear classifier that separates the classes well. The scatter plot on the right shows classes for which the NN rule performs reasonably well. For this case advanced classifiers like a neural network or a non-linear support vector classifier are needed to obtain a similar result..

Here is a key point. High complexity classifiers are dangerous as they can adjust the decision boundary in a very flexible way that adapts to the given training set, but does not generalize. This also holds for the NN rule. However, the performance of that rule can very be easily estimated by a leave-one-out cross-validation. As there is no training needed it is as fast as computing the apparent error (which will be zero) . On the other hand, estimating the leave-one-out error for other high-complexity classifiers can be computationally very expensive as retraining is needed. Moreover, the estimated performance should ideally be included in the training procedure, which is prohibitive. Consequently, the danger that comes with a high complexity is not present for the NN rule. It can be safely applied after judging the leave-one-out error.

It might be argued that the NN rule does not perform a proper generalization, as it just stores the data. The generalization is in the given distance measure. This might be true, but almost all classifiers depend in one way or the other on a given distance measure. Next, the size of the storage might be criticized. Accepted, but some neural networks store a comparable collection of weights and need considerable training.

Again, why is the NN rule so good in spite of its high complexity? There is the following evolutionary argument. Problems for which nearest neighbors have different labels will not be considered. They are directly neglected as wrongly formulated or too difficult. Recognition problems arise in a human context, in scientific research or in applications in which human decision making has to be supported. Thereby, they can directly, after looking at a few examples, be considered as too difficult. They will be put aside and the effort will be focused into another direction.

The NN neighbor rule is good as it reflects human decision making because it is based only a distance measure designed or accepted by the analyst. Secondly, and most prominently, because no training is needed just problems will survive for further investigation in which this rule shows a promising result. In short, the NN rule is good because it matches the problems of interest and vice versa.

The post Why is the nearest neighbor rule so good? appeared first on Pattern Recognition Tools.

]]>The post There is no best classifier appeared first on Pattern Recognition Tools.

]]>Every problem has its own best classifier. Every classifier has at least one dataset for which it is the best. So there is no end to pattern recognition research as long as there are problems that are at least slightly different from all other ones that have been studied so far.

The reason for this is that every training rule is based on some explicit or implicit assumptions. For datasets that exactly fulfill these, the classifier is the best one. Classifiers that are more general, i.e. make less assumptions, need more training examples to compensate for the lack of prior knowledge. Classifiers that make assumptions that are not true, perform worse.

The above is illustrated by one of the solutions submitted to the 2010 classification competition of the ICPR. About 20 standard PRTools classifiers are studied over 300 datasets. It appeared that every classifier in the set was the best one for at least one of the datasets. This showed that the collection of datasets defined for this competition reflected sufficiently well the set of problems PRTools was designed for.

In short: there is no such thing as the best classifier. This has only a meaning for a particular problem or for a sharp defined set of problems.

The above may seem obvious and is accepted by many researchers in the field. However, it is often not recognized by the present culture of writing and reviewing papers on proposals for new or modified classifiers. As every classifier has its own problems for which it is the best, it is no surprise that they exist. The interesting point is to characterize such problems, which is not always easy. One way to do this is to create a well defined artificial problem for which the classifier is optimal. This problem should exactly fulfill the appropriate conditions and assumptions.

In addition it is of interest to the community to present at least a single real world example for which it is the best as well. This will proof that such problems exist and that the proposal has practical significance. Showing the performance over a large set of public domain datasets is not really needed, but might be of interest. The average performance over such a set has no meaning, as it is based on an arbitrary collection of problems that will not coincide with any real world environment.

Finally, the characteristics of the classifier may be illustrated in addition by showing examples for which it definitely fails. Such examples should exist, as in some applications the classifier assumptions are not fulfilled at all. Finding or creating them makes further clear how to position the proposed classifier.

All comparisons should be made with published classifiers that in nature are close to the one under study, and preferably with some general standard classifier too.

- Make assumptions and conditions for the proposal explicit as good as possible.
- Illustrate the superior performance with at least one well defined artificial example.
- Find a real world problem in which the classifier performs best.
- Find a well defined problem in which the classifier performs badly.
- Base all comparisons on classifiers that, in design, are close to the proposal.

The post There is no best classifier appeared first on Pattern Recognition Tools.

]]>