Personal and historical notes
The previous post briefly explains arguments for the steps taken by us between 1995 and 2005. From the perspective we have now, it has become much more clear what we did in those years. Below a few historical remarks will be made as they sketch how research may proceed.
It all started when I spent a sabbatical leave with Josef Kittler in Guildford, UK. I had finished my research on the miracles and pitfalls of the neural classifiers and I finally understood when and why they would work. Neural networks offer a very general framework for building classifiers. Almost any classifier can be imitated by them given the right architecture, transition functions and training parameter values. Consequently, the results heavily depend on user supplied settings and, thereby, on the skills of the analyst. The contribution of the neural architectures to generalization possibilities is minor as other classifiers not based on this architecture, may yield similar or better results.
I wanted to present these observations during a workshop in Edinburgh, but I entirely failed to make my points clear against giants such as Brian Ripley and Chris Bishop. Luckily, Leo Breiman also attended the workshop and supported me with his statement: “Just a short look at the architecture of a neural network is sufficient to understand that the thing doesn’t even have the moral right to show any reasonable performance”. As somebody with the stature of Breiman shared my observation, it was sufficient for me to leave this research domain. There was no reason to proceed with repeating the same arguments over and over if this was done more convincingly by others.
Josef Kittler proposed to work on the newly arising domain of combining classifiers. I liked this ideas because the combined classifiers can be understood as more consciously designed and trained than neural networks. Similarly to neural networks they offer a flexible platform for designing non-linear classifiers. Avoiding overtraining, however, can be much better realized. Although this resulted in years of interesting research and it became clear that classifier combining is the way to go for solving complicated problems, it did not show more light on the fundamental of pattern recognition question: how does learning start?
During my stay in Guildford another visitor came along: Lev Goldfarb. I had read some of his papers in the eighties. They pointed to new directions for pattern recognition, but his reasoning and proposals were hard to grasp. His visit was an opportunity to catch up. I shared time with him during a one-day visit to Royal Holloway where he gave a speech on his ideas. Again, I didn’t understand him, but it was sufficiently clear to me that his present ideas were different from what I had read in the past.
On the way back I asked for clarification. The basis of patterns is constituted by the structure of objects, according to him. He argued that we should respect this by finding proper, generative, structural representations from the very beginning. There is no other way to give respect to objetcs than their structural development. Any type of stochastic reasoning should be avoided.
At that time I could not understand how somebody with a physical background could not see that any sensor input is hampered by noise. Structures which are directly defined on the measured signals would include noise in the representation, making them almost useless. In the discussion I could not make my arguments clear to Lev, and neither could he. We argued, quarreled and became angry as a result. There was an essential point on the primary representation that we failed to clarify but of which I felt it was significant. So, we ended up on two different ends of the platform, waiting for the train to Guildford 😉
Nevertheless, I had all my respect to Lev. I (re)collected all his papers, browsed them, but didn’t get his point. I put them in a binder for a later study. In the same time I found Vapnik’s first paper on the Support Vector Machine (SVM). He had visited the Netherlands in the early eighties and I had studied his book “Estimation of Dependencies Based on Empirical Data” together with Martin Kraaijveld.
I found SVMs much better than neural networks because I sensed we were touching the foundation. The SVM arises as the consequence of the complexity discussion in that book. I am still not convinced that the discussion there is sound as it mixes different types of probabilities. They are based on different (imaginary) stochastic events for which I am not sure that they can be combined. Anyway, this is the first time I saw a representation by objects instead of by features, the maximum margin strategy, the concept of support objects and the kernel trick. All are beautiful.
The criterion according to which the SVM is trained is a mixture of distances and densities. Also this criterion mixes entities that should not be combined in this way according to me. But the worst thing, the price we had to pay for all beauty, are the Mercer conditions needed for a feasible optimization strategy. Here we are restricted in the way we can compare objects that has nothing to do with the objects themselves or with the pattern recognition problem we want to solve, but just by the limitations of the mathematics that has to follow. The foundation is spoiled because we want to live in a palace. But the consequence is that we have to stay in a walled garden and we cannot go out anymore. We became prisoners tempted by the beauty.
I decided to follow the steps into the darkness that may take me outside the garden, into an open field. No Mercer conditions to pay the price, no kernel trick and no support objects. As a result we arrive at a representation based on arbitrary (not restricted!) object comparisons. I used distances. They construct a dissimilarity space in which statistical recognition procedures can be used. Distances and densities are now clearly given their own domains: representation and generalization. When applied to structural object descriptions, it may construct a bridge between structural and statistical pattern recognition, as Horst Bunke pointed out to me later. Moreover, if dissimilarities were suitably defined, representations with non-overlapping classes might have been constructed!
But … back then, it is 1996, my first dissimilarity-based and, thereby, featureless classifier worked. It was published in 1997 and I wrote enthusiastically a project proposal on dissimilarities.
It was approved!
This was in a period in which many of my proposals were approved, but I had never written one with so much personal commitment and scientific eagerness. Now the right PhD student had to come along. But he/she didn’t show up. There was nobody who shared my enthusiasm or to whom I could make my perspective clear. Especially, when I described to possibility of an error-free classifier the eyes of a prospective PhD used to mirror a crazy scientist back to me.
At that time Ela Pekalska had been already working on another project with me for more than a year. I had to nominate somebody or I would loose the dissimilarity project. As she was in need for the next project and nothing else popped up for her, we agreed to continue with this one, both with the feeling that this might not be the best thing to do. But … there was no escape.
After about a year, in 2000, Ela gradually became enthusiastic. She fell in love with the concept of dissimilarities and took over. She renamed the approach from featureless pattern recognition into dissimilarity representation and asked me for my binder with the Goldfarb papers. There she found that Goldfarb had already been where we were, 15 years before us. He saw the limitations of the feature based vector spaces, proposed the use of dissimilarities and pseudo-Euclidean embedding (but not the dissimilarity space). Soon after he became disappointed about the possibilities of the pseudo-Euclidean space for generalization. As a result, he turned to a fully structural approach, the evolving transformation system (ETS).
We both visited Lev Goldfarb in 2002. Now, we were friends. We were impressed by his insights and determination to construct a fully structural representation as a basis for pattern recognition. We didn’t understand however how generalization from examples had to be defined as there was no concept of any numerical distance measure. We decided to continue with the dissimilarity space.
In 2004 Elzbieta finished her thesis. It is based on the observation that dissimilarities are primal for human recognition. Similarities or sameness set up the background (for comparisons), but these are the differences that are meaningful and call for individuality and conscious perception. The quest for features comes after it has been concluded that some differences are significant. Some belong to each others neighborhoods (nearness) and others do not. This constructs a topology. Dissimilarity measures have to be postulated consistently with the nearness. The thesis continues with many examples based on pseudo-Euclidean embedding and dissimilarity spaces. In 2005 we rewrote the thesis into a book.
The dissimilarity approach has been applied to many studies on shape analysis, computer vision, medical imaging, digital pathology, seismics, remote sensing and chemometrics. It works!