The recent results in the round of 16 of the football world championship in Brazil showed a remarkable statistic. The eight group winners all had to play against a runner-up of another group. All group winners won. Is that significant? Does this show that the result of a match is not random? Watching them strongly suggests sometimes that it is. Let us do some statistics.

The eight groups have been carefully constructed in such a way that supposedly strong countries are in different groups. They run a half competition, meeting each other ones. The group winners and the runners-up go to the round of 16. Here a knock-out system starts in which group winners meet the numbers two of other groups. Suppose that there is no significant difference in strength between winners and runners-up, or even stronger, suppose that the result of a football match on this level is random anyway, then for each match there is a probability of 0.5 that it is won by an original group winner (draws are not possible anymore). The probability that this happens all eight times, assuming that match results are independent of each other, is 0.5^{8 }= 1/2^{8} = 0.0039.

Does this really proof something? The p-value is much smaller than 0.05, so it seems that the null-hypothesis “football matches on this level end in a random result” has according to this experiment to be rejected with a high degree of significance. Is this all according to scientific standards?

No, certainly not. Taking the match winners of the present championship for our test is one of the many experiments that could have been chosen. Results are in the news and after they are obtained we computed the statistical significance. In a proper scientific experiment the hypothesis and the experiment should have been defined before. The abuse of the p-value is a returning concern in the scientific community, e.g. see a discussion in a recent issue of Nature.

It is not really needed that an experiment is fully defined before it is performed. It is possible to use existing data. The problem is with formulatingf the hypothesis. In our example it has been suggested by the data . Computing the statistical significance is based on the independence of the hypothesis and the data. Here are some examples to illustrate where it goes wrong.

#### Selection is training

You have to analyze a dataset with about 100 features. Following the suggestion of your supervisor you inspect the scatter plots of some feature pairs. There is a feature that looks promising. You fixate it and try to find a second feature to construct an interesting pair. Great, there is one that really contributes to the first. You select these two features and to check you run a cross validation with your favorite classifier. That looks good!

Is it? Well, it depends. If you used the entire dataset, for selection, then you used the data needed for evaluation as well as in the selection. Thereby the results are positively biased. Selection is training! For a correct use, a test set should be excluded In the selection process. So even don’t look at the scatter plots of the test data!

#### Forensics

Forensics is full of doubts and doubts may be transformed into statistics. Is this really the fingerprint of the suspect? Is his alibi sufficiently strong? He was selected out of a row of six by a witness, is that significant? Is it possible to combine these uncertainties? Here the problem of data selection before the hypothesis is formulated is very evident as the case of Lucia de Berk shows.

Lucia de Berk was a nurse on duty in a hospital when 6 patients died over a period of 3 years. None of these 6 deaths was suspicious. In the 3 years before, 7 patients died in that hospital. So everything looks normal. The only thing that was wrong was that she was on duty during the 6 deaths. Somebody computed that this was very improbable. So she was condemned and spent innocently 6 years in prison. Thanks to the tireless efforts of the statistician Richard Gill, among others, she was finally released.

#### The improbable happened, all the time.

Almost any event, if sufficiently complicated and isolated in the right way, is improbable in retrospect. The probability is very low that you first buy a bread, five minutes later keep standing for three minutes in front of the window of the bookshop, then be within two minutes at the take-away, meet a friend, walk with her to the park and throw your cup after 10 minutes in the first dustbin next to the gate. In fact, it will not happen, unless it happened.

Computing statistics itself is not simple, drawing conclusions for science and real life is very difficult. Going from correlation to causality is a famous problem. The formulation of an hypothesis is another one. The significant rule is that it should not be formulated by data that will be used for testing. See also “The cult of statistical significance” for discussions on the abuse of hypothesis testing in science.

**Filed under: **Classification • Evaluation