The goal of representation is in pattern recognition to map objects into a domain in which they can be compared. Usually, this domain is a vector space. It might also be a graph or a symbolic sequence or any other modality that allows the computation of distances between the represented objects. The quality of a representation is determined by the performance of recognition schemes that can be applied on it.

Often, second levels of representation are considered to improve the original one. Subspaces of a vector space are used to obtain a lower dimensionality, e.g. by a principal component analysis (PCA), a Kohonen map or an auto-encoder neural network. If kernels or dissimilarities are used the dimensionality of the representation may also grow significantly. In this case the purpose is to introduce some non-linearity.

The number of possibilities to construct representations in this way is virtually unlimited. Optimization procedures may thereby become unfeasible.  To overcome this,  proposals have been made to introduce some randomness in making choices or in the setting of parameters. When and why will this be good?

Coverage versus dimensionality

The are two phenomena to be aware of in creating a random representation: the coverage of the input and the dimensionality of the resulting output representation. In taking a random representation there is the danger that not all information is preserved. Some aspects, groups of objects or subspaces might not be taken into account. The probability that this happens may be reduced by generating more random functions. This will result, however, into a large output dimensionality. Some classifiers may suffer from that, requiring some dimension reduction on top of the representation. A good alternative solution is to use a sparse classifier that automatically selects the appropriate random functions. In both cases the danger of overtraining has to be faced.

Some examples

  • Random neural networks. Classifiers constructed by neural networks may contain many weights to be optimized. The advantage is that they may result in non-linear functions that really follow the data. The problem is however that optimizing them may be very time consuming, and, if not designed carefully, the result may be overtrained. A fast, reasonably well performing solution [1] is to make a random choice for the weights in the input layer and optimize the output layer only. This can be understood as the construction of a random representation on top of the input space. This solution also may be used as a smart initialization for a full optimization.
  • Extreme neural networks. The above idea has been exploited for the construction of so called extreme learning machines [2], which attract some attention at the moment.
  • Random subspaces. Applications with many features may suffer from long computing times and overtraining. Both problems may be attacked by splitting the problem in a set of sub-problems defined by a set of subspaces. A random split of the features may be used to find these subspaces. Classifiers should be combined. The original idea is by Tin Ho [3]. Skurichina developed it further for linear classifiers [4]. Van Breukelen used it for initializing neural networks [5].
  • Bagging and boosting. Instead of defining a representation by generating random feature subsets, a random representation may also be constructed by training classifiers on randomly selected subsets from the training set [4]. This is not fully random anymore as by training an adaptation to the data is made.
  • Dissimilarity space. The dissimilarity space is defined by vectors of distances to a representation set of objects (prototypes). There are various ways to optimize this set. A random selection works often remarkably well due to the fact that an arbitrary subset of objects may already be a good representation of the entire set [6]. This is in contrast to features: a randomly selected set of features may miss the few significant ones.

Why and when does it work?

Random choices in building a representation may perform well if the original information is sampled well. It may thereby reduce the dimensionality as well as enlarge it if (non-)linear functions of the inputs are generated. In the generalization step that follows the representation the possibly redundant or insignificant elements of the representation can be removed if the size of the training set is sufficiently large. This gives a key to the choice of the random representation. It should generate a space with a dimensionality that this is an lower (e.g. a factor 10) than the size of the training set.

A great advantage of a random representation is that it is fast: no optimization. If sufficient time is available optimization may be better, but this is not guaranteed as it also includes adaptation to the data and will thereby overtrain more easily. So, optimized representation should result in lower dimensional spaces to avoid this.

References

  1. W.F. Schmidt, M.A. Kraaijveld, and R.P.W. Duin, Feed forward neural networks with random weights, Proc. 11th IAPR Int. Conf. on Pattern Recognition, Volume II, 1992, 1-4.
  2. Guang-Bin Huang, Dianhui Wang: Advances in Extreme Learning Machines (ELM2011). Neurocomputing, vol 102, 2013, 1-2
  3. Tin Kam Ho: The Random Subspace Method for Constructing Decision Forests. IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no 8, 1998, 832-844
  4. M. Skurichina and R.P.W. Duin, Bagging, Boosting and the Random Subspace Method for Linear Classifiers, Pattern Analysis and Applications, vol. 5, no. 2, 2002, 121-135.
  5. M. van Breukelen and R.P.W. Duin, Neural Network Initialization by Combined Classifiers, ICPR’98, Proc. 14th Int. Conference on Pattern Recognition, 1998, 215-218.
  6. E. Pekalska, R.P.W. Duin, and P. Paclik, Prototype selection for dissimilarity-based classification, Pattern Recognition, vol. 39, no. 2, 2006, 189-208.
Print Friendly, PDF & Email
Who invented the nearest neighbor rule?
Hume's fork in pattern recognition

Filed under: FoundationRepresentation