What error rates can we expect for a trained classifier? How good or bad is a 50% error? Well, if classes are separable, a zero-error classifier is possible. But a very bad classifier may assign every object to the wrong class. Generally, all errors between zero and one are possible: \epsilon \in [0,1].

Much more can be said for given class prior probabilities p(x \in \omega_i ), i = 1,c, number of classes c and  class overlap \epsilon^*.

The classification error and the Bayes error

Class overlap arises if objects belonging to different classes may have the same representation. Assume that the class probabilities or densities p(x | x \in \omega)  as well as the class prior probabilities p(x \in \omega)  (which is the probability that an arbitrary object belongs to class \omega), are known for all classes \omega. Then we will assign every object x to the most probable class given x. This is the class for which the posterior probability p(\omega | x) is maximum. Let this be class \omega_i. The error contribution of a vector x is the probability that the object x originates from another class  than this one.

\epsilon^*(x) = \textrm{Prob}(\textrm{error} | x) = 1 - p(\omega_i | x) = 1 - max_j {p(\omega_j | x) }

The * indicates that this is the minimum error. The overall error for an arbitrary x is thereby

\epsilon^* = E_x \epsilon^*(x)

which is the best possible result, the so called Bayes error. It can only be reached for known probability functions p(x | x \in \omega) and class priors. What happens if we don’t know them and sometimes assign x to class \omega_i for which p(\omega_i | x) is not the largest? Then the error will increase, so \epsilon \ge \epsilon*. In the worst case every x will be assigned to the class \omega for which the posterior probability p(\omega | x) is minimum instead of maximum. So

\epsilon^*(x) \le \epsilon(x) le 1 - \min_j{p(\omega_j | x)}

For two-class problems it holds that \min_j{p(\omega_j | x)}+ \max_j{p(\omega_j | x)} = 1 as the sum of all probabilities for a given x should be one and there are just these two. Consequently

\epsilon^*(x) \le 1 -\min_j{p(\omega_j | x)} = 1 - 1 + \max_j{p(\omega_j | x)} = 1-\epsilon^*(x)

by which we find, after taking the expectation that

\epsilon^* \le \epsilon \le 1- \epsilon^* for any classifier with c = 2

However, for c > 2 the minimum posterior for every point x will be 0 as it is possible that other posteriors sum to one. So

\epsilon^*(x) \le 1 -\min_j{p(\omega_j | x)} = 1 - 0 = 1

and thereby we find for the multi-class case

\epsilon^* \le \epsilon \le 1 for any classifier with c > 2

Using a random, prior sensitive classifier

What will happen if we have chosen by accident a bad classifier, e.g. one that uses random assignment, or a classifier like the nearest neighbor rule while the representation is random? With a probability p(x \in \omega_i) an object of class \omega_i will be encountered and with a probability 1-p(x \in \omega_i) it will not be assigned to its own class but a wrong class. This results in an error

\epsilon =\sum_i p(x \in \omega_i) (1- p(x \in \omega_i)) = 1- \sum_i p(x \in \omega_i)^2

This will be maximum in case of c equal probable classes.  They will all have a prior probability of 1/c by which

0 \le \epsilon \le 1-1/c for a random, prior sensitive classifier

 Classification using class priors only

From the above it is clear that bad classifiers may yield large errors. The error of two-class classifiers is bounded by the class overlap. Multi-class classifiers are unbounded from above and may result into an error of one. If one has some doubts about the representation, the features or the classifier, e.g. after inspecting the results of some cross-validation experiments, a better result can be realized on the basis of the prior probabilities. If they are known (or if the estimate on the basis of the training set is reliable) then assigning all objects to the most probable class will classify all objects of this class correctly. All objects of the other classes will be classified wrongly, resulting in the following error:

\epsilon = 1-\max_i {p(x \in \omega_i)} for a classifier based on class priors only

For two-class classifiers with equally probable classes this results in \epsilon = 0.5. If the classes are not equally probable \epsilon < 0.5. A 50% error is thereby the worst thing that may happen if c=2 and no proper representation is available.

\epsilon \le 0.5 for two-class problems and a classifier based on class priors only

How bad is 50% error for a multi-class problem?

A 50% error is the worst that should be found for two-class problems. Only in case of well-separable classes and the use of a very stupid classifier larger error rates may result. This can easily be avoided by assigning all objects to the most probable class. However, in case of multi-class problems much larger errors may be found. How good is a 50% error for a multi-class classifier? Can we compare it in one way or the other with a two-class performance?

To answer this question we have a look at the confusion matrix N resulting from a c-class classifier. It is a c \times c matrix in which element n_{ij} lists the number of objects with true class \omega_i and that are assigned to class \omega_j. The numbers on the diagonal refer to correctly classified objects. The classification error is thereby

\epsilon = {1 \over n} \sum_{i \ne j} n_{ij}

Let the total number of objects be n = \sum_{ij} n_{ij}. The error contribution of an arbitrary two class problem taken out of this multi-class problem is \epsilon_{ij} = (n_{ij} + n_{ji})/n. In general they will not be equal, but in order to get a feeling for the above question we will assume they are. As there are c(1-c) / 2 two-class problems in a c -class problem, the average contribution of a two-class confusion to the overall classification error is 2 \epsilon /(c(c-1)). In such a two-class problem just 2 n/c objects are involved instead of n. Its two-class error probability for these 2 n/c objects is thereby a factor c/2 larger. Consequently, the average two-class error \epsilon_2 relates to the c-class error \epsilon_c as follows:

\epsilon_2 = (c/2) (2 \epsilon_c /(c(c-1))) = \epsilon_c/(c-1)

To answer the question: a 50% error in a 50-class problem is about as good as a 1% error in a two-class problem, as every two-class problem in the 50-class problem has on the average to be solved with this performance to reach an overall error of 50%.

Using the test set for training
Why is the nearest neighbor rule so good?

Filed under: ClassificationEvaluation