Tuesday, 21 June 2011

Statistical inference and mathematical modelling in sport


Since the origin of organised sport, players have always searched for the secret of success. In individual sports like athletics, for instance, the route to improvement is more apparent than in more complex team activities like football. But, as time has gone on, the increased sophistication which we bring to life in general has revealed hidden truths in them all.

At first, players relied on developing an instinctive approach by trying different approaches and experiencing different outcomes. This was enough to develop a 'feel' for the activity which was then passed through the generations through received wisdom.

Some of these instinctive beliefs duly persisted in individual sports, while others were shown to be false. Eventually, players and coaches in many sports became so incentivised by the financial rewards of success that they started to employ quantitative methods - and so began the march of statistics.

Now, those of an academic approach within each sport - and eventually academics themselves - began to study how success was achieved by scientific method. They isolated variables, by which performance in each sport could be measured, then developed quantitative relationships between these markers and victory.

Some of these findings immediately threw doubt on old approaches; some merely confirmed long-held beliefs and turned them into convictions. The academics then dug deeper and realised that the existing variables themselves were now insufficient to capture elusive truths.

In this series, I will start from the first principles of quantitative analysis and slowly proceed towards highly complex mathematical modelling of those sports for which sufficient data exists to form predictive tools. By that stage, you will have gained the advanced knowledge to reach a completely new understanding of whatever your favourite sport is. I guarantee you will never think about sport the same way again.

I will then look at the limitations of these models, but also show how we can account for these using real-time examples which will throw up projections sometimes greatly at variance with those of the bookmakers.

Anyone motivated to follow the entire series which will end in early September will be able to do so, assuming of course they have either the motivation to learn, or else the good nature to improve my own understanding.

As ever, I will answer any questions, or receive your views, directly through Twitter, a medium which facilitates constructive and respectful debate between parties, for all that it also limits it to 140 characters per offering. Hit me up on @Prof_Hindsight.

1.0 Correlation

Consider the following end of season standings for the English Premier League season just completed:

Man Utd 78 37 +41 80
Chelsea 69 33 +36 71

Man City 60 33 +27 71
Arsenal 72 43 +29 68
Spurs 55 46 +9 62
Liverpool 59 44 +15 58
Everton 51 45 +6 54
Fulham 49 43 +6 49
Aston Villa 48 59 -11 48
Sunderland 45 56 -11 47
West Brom 56 71 -15 47
Newcastle 56 57 -1 46
Stoke 46 48 -2 46
Bolton 52 56 -4 46
Blackburn 46 59 -13 43
Wigan 40 61 -21 42
Wolves 46 66 -20 40
Birmingham 37 58 -21 39
Blackpool 55 78 -23 39
West Ham 43 70 -27 33
(GF=goals scored; GA=goals conceded; GD=goal difference.)

It is obvious that the more goals a team scores and the less it concedes, the greater its point total at the end of the season. But is there a way of expressing that a relationship exists between these variables? 

Yes.  It is called correlation. And its mathematical measure is called a correlation coefficient. 

When we are considering quantities such as how goals scored or conceded relate to points won, the coefficient is technically referred to as the Pearson product-moment correlation, or 'r'.

The correlation between two variables is expressed by means of a decimal between 0 and 1. The stronger the relationship, the higher the r value or correlation. I will get into the scaling of r in a moment, but also consider the r can be either a positive or negative value.

A positive value of r refers to a relationship in which the greater the value of one variable, the greater the value of the other - a positive correlation, we say. The more goals a team scores during the season, the greater its total points, so the r value between points and goals scored is positive.

But when one variable tends to increase as the other decreases, we say there is a negative correlation between the two. Now a -ve sign is placed in front of the r value, while the strength of the relationship is still expressed by the magnitude of the decimal. Patently, goals conceded bears a negative correlation with total points earned in our example.

Calculating r is a laborious process but can be done quickly by means of a spreadsheet such as Microsoft Excel or the freeware OpenOffice.org. During this series, it is not necessary to do any of the calculations yourself, though hopefully you will have the urge to do so.

In this case, the r value or correlation between goals scored and points won is 0.867 - indicating a strong relationship. This is no surprise, but remember this example is very much entry level statistical inference. There are three months still to go!

The correlation between goals conceded and points won is -0.848. It is of a similar magnitude to that governing goals scored, but the negative sign indicates the opposite relationship.

1.1 The meaning of r-squared

As you become familiar with the correlation between two variables, you will find it easier to recognise the significance of its strength. But what does r actually express in a quantitative sense?

Technically speaking, r is the covariance of the two variables divided by the product of their standard deviations. Don't worry about this, though: think of the correlation of two variables as the degree to which the variance of one presages the variance of the other.

By squaring r (r times itself) we produce the "coefficient of determination" or 'r-squared'. This expresses what percentage of the variance of one variable is related to the variance of the other.  

The r-squared of goals scored and points earned is 0.751 - so 75.1% of the variance in a team's points total is a product of the variance in goals scored.

The r-squared of goals conceded and pointed earned is 0.719 - so 71.9% of the variance in a team's points total is a product of the variance in goals conceded. Notice how the -ve signs disappears with the r-squared value of negative correlations. Now it is the magnitude of the relationship only in which we are interested.

1.2 Correlation does not imply causation

According to the limited sample provided by the 2010-2011 Premier League season - we will examine the meaning of sample-size in a later episode - the r-squared values suggest there is a stronger relationship between goals scored and points earned than between goals conceded and points earned.

This is just the first of a many interesting results that an example as simple as the one above can provide. Does it therefore mean that goals scored are more important than goals conceded - that attack is more important than defence?

Not necessarily. When we infer a correlation between two variables, it does not mean that a change in one is the cause of a change in another. Instead, it implies only that - within the limitations of the sample-size - a connection between the two has existed in the past. If the sample-size is small, this connection may even be just a coincidence.

We might find, for instance, that the number of points earned by Liverpool is actually correlated with the production of wine in Portugal between the years 1979 and 1985. But this does not imply a causal link between the two: if Portugal were to increase its wine production in the future, Liverpool would not earn more points as a result.

In our example, it is however reasonable to infer that the stronger correlation between goals scored and points earned is a significant result. The three-point-for-a-win system means that a team who scores and concedes more goals but has the same goal difference than a rival is likely to earn more points.

"Correlation does not imply causation" is an important maxim of statistical inference and should always be remembered.

1.3 Co-dependency

The value of r-squared which we calculated for goals scored and points earned suggested that 75.1% of the variance in a team's points total is a product of the variance in goals scored. That only leaves 24.9% for other factors, yet we found that the r-squared value for goals conceded and points earned implied that 71.9% of the variance of one was due to the other.

75.1% and 71.9% add up to a lot more than 100%, so how can this be? The answer is that goals scored and goals conceded are not independent variables themselves.

In fact, the correlation between goals scored and conceded is -0.570 - a weaker correlation but one which still suggests that the more goals a team scores, the fewer it tends to concede.

This so-called co-dependency between goals scored and goals conceded is another example of the maxim we have just established, that "correlation does not imply causation".

Scoring goals does not directly cause a team to concede fewer. (Incidentally, there may be a positive game-theoretical implication of a strategy biased towards attacking - "attack is the best form of defence" - but that is a different consideration.)

Instead, these co-dependent variables are both some function of a football team's merit or efficiency. In other words, good teams tend to score lots of goals and concede few, while bad teams do the opposite.

It follows from this that goal-difference (goals scored minus goals conceded) it itself strongly correlated with this team efficiency, and therefore with points earned. 

Of course. In fact, goal difference is a very powerful measure of efficiency, more so than even points earned, as I will prove in a future episode.

For now, just consider that the correlation between goal difference and points earned is a very strong 0.965. And the r-squared of 0.931 suggests that 93.1% of the variance in points earned is a result of the variance in goal difference. And, yes, it is safe to say that the first causes the second.

1.4 Sample-size and statistical significance

It is very important to be aware that the values of 'r' and 'r-squared'  cited in this example are only a function of a limited sample of data. Forging universal relationships between goals scored and conceded and points earned is not possible from just the latest Premiership season.

We must always remember to state the limitations of data when expressing statistical relationships formally. And there are various measures of so-called statistical significance we can use to imply the reliability of the data in a quantitative way. Those must wait while we uncover some more interesting relationships from Premiership history next week.

1.5 Coming up

Next time, I will examine how predictive are goals scored and conceded in one season of performance in the next. In other words, how identifiable and persistent is the ability a team has to score goals and prevent them being scored? 

And to what degree can it survive the inevitable changes in personnel from one season to the next. I will be considering data only from the Premiership itself, so teams who were relegated in an adjacent season will be excluded from the sample.

As a matter of interest, what do you think will be the result? Will there be no difference in the predictive power of goals scored and conceded from one season to the next? Or will one persist more than the other, perhaps suggesting something about the underlying skills associated with attack and defence and their importance.

I will also introduce the topic of linear regression, deriving an equation which links goals scored and conceded to points earned in a 38-game Premiership season - assuming the relationship is linear and not exponential for now.