Quintessa mathematicians and scientists enjoy analysing data. Recently, Simon Rookyard presented predictions of the results in the knock-out stages of the UEFA EURO 2020 Football Championship, based on an algorithm developed within Quintessa. Now the competition is complete, it is possible to present an appraisal of the algorithm’s performance in these matches.

We recently used our “N-Estimates” algorithm, created to rate sports teams, to predict the results of the 15 matches in the knock-out rounds of UEFA EURO 2020. Each of these predictions was accompanied by a graph that displayed the most likely match result in green and match results that would reasonably be expected given the algorithm (those falling within a one standard deviation confidence interval, 1σ) in amber. Figure 1 presents these again with the observed results highlighted.

When assessing the performance of the algorithm during the group stages, we considered four metrics of algorithm performance:

- the percentage of matches in which the correct outcome (correct winner or draw) was predicted;
- the percentage of predictions with the correct goal difference;
- the percentage of predictions with the correct exact scoreline; and
- the percentage of matches inside the approximate 1σ region.

Here, we consider the same metrics (which we evaluate at the end of normal time, to be consistent with the basis for the predictions). Figure 2 compares the algorithm’s performance across the knock-out stages in each of these metrics against a benchmark value (the expectation if winners/goal differences/scorelines were selected randomly, or the 68% of the 15 matches which would be expected to fall within a 1σ confidence interval).

We can see that the algorithm performed well in terms of predicting the correct outcome. The correct outcome (at the end of normal time) was predicted in just over half of matches, around 50% more occasions than we would expect by chance. The algorithm also predicted the goal difference and the exact scoreline correctly more times than expected by chance, although only by one match in both cases, so it would be unsafe to conclude that the algorithm definitively performed better than chance in these cases. As in the group stages, the algorithm has underpredicted the variability of the results, with only 8 (53%) of the knock-out matches falling within the predicted 1σ confidence level, compared with the expected 10 (corresponding to 68.2%). A closer investigation (Figure 3) reveals that the predictions were poorer in the Last 16 round than in the other knock-out matches; five of the eight correctly-predicted outcomes came in the quarter-finals or later, even though those rounds only accounted for seven of the fifteen matches. All three matches in the semi-final and final rounds were correctly predicted to be draws after normal time. Furthermore, the algorithm-calculated ratings of England and Italy were very close before the final with a predicted 51% chance of an England victory and 49% chance of an Italy victory. This similarity between the teams was reflected in the result, in which the teams still had equal scores after extra time.

As in the summary of the algorithm’s performance in the group stages, we have compared the performance with the expert judgement of the winner of the BBC’s competition for pundits’ predictions. (Conveniently, the winning pundit is the same person that was considered after the group stages). It can be seen in Figure 2 that the pundit showed a similar level of performance to the algorithm (as was the case during the group stages) although in this case the algorithm has performed marginally better. Over the course of the EURO 2020 competition, the N-Estimates algorithm has been shown to perform well, predicting the correct winner (or correctly predicting a draw) in around 55% of matches, the correct goal difference in around 27% of matches and the correct scoreline in around 14% of matches (compared with 34%, 18% and 8% expected respectively for random predictions). However, there is scope for further improvement, as there appears to be a source of variability in the results that has not been captured by the algorithm (as the algorithm was tuned predominantly to qualifying matches for the European Championships and World Cups, this implies a source of variability that is only significant at a major competition such as the European Championship). An obvious possible cause is the high frequency of matches at EURO 2020; it is known that teams can become fatigued after several matches in a short period, and differences in the magnitude of the fatigue between teams could be an additional source of variability in match results. Excessive travelling between matches could further contribute to fatigue levels (and differences in fatigue), causing further variability; this may have been particularly relevant at EURO 2020, which was played at stadia across the continent, with some teams spending the group stages in one place and others covering thousands of kilometres. Considering such sources of variability could help the algorithm to better capture the likelihoods of different scorelines in a given match.

*Quintessa is not affiliated in any way with UEFA or the BBC. Its application of the N-estimates algorithm to the UEFA EURO 2020 competition is an independent and non-commercial endeavour. The UEFA EURO 2020 logo is copyright of UEFA.*