Sunday, 26 June 2011

Statistical inference and mathematical modelling in sport (part 2)

Summary of part 1

In the first part, we established the first major weapon of statistical inference - correlation. 

We saw that the correlation between goals scored and points earned for teams during the 2010-11 Premiership season was 0.867, while the correlation between goals conceded and points was -0.848.

Just in case you aren't with me, part 1 of the series has a full explanation of those terms and is here.  

2.0 Auto-correlation

As well as observing correlations between two distinct variables - such as goals and points - it is also extremely useful when modelling sports to determine how a single variable persists or changes over time.

How a variable correlates with itself across different periods of time is called auto-correlation. I often refer to it by the more descriptive term of "forward correlation".

To see what forward correlation can teach us about the way a sport works, let's now expand our data sample from a single season of the Premiership - from which any inference is not statistically significant - to the entire history of the competition back to 1992/3 - from which it is.

(For more on statistical significance, please wait for an upcoming episode in which I will highlight a shocking example for 10-year trends fans.)

The forward correlation we are concerned with here is how the basic statistics of a team's performance in one season persist - or predict - the same results in a year's time. 

In Premiership history, there are 311 data-points consisting of teams who have played consecutive seasons in the Premiership, consider their r values in various categories:

wins 0.739
draws 0.109
losses 0.632
goals scored: 0.666
goals conceded: 0.599
goal difference: 0.785
total goals per game: 0.185
points: 0.679

This is statistical inference at a baby level, but already you have enough information to be ahead of 99% of all football fans, punters and journalists:

The number of drawn games in which a Premiership teams is involved shows little correlation from one season to the next

The r-squared value of the forward correlation of draws is just 0.012, so only around 1% of the variance in draws persists from season to season. We say that Premiership teams in general do not control the rate at which they draw games in future seasons.

It is important to understand what this does and does not imply.

 - It DOES NOT imply that draws are random, nor that the percentage chance of a draw in a Premiership game should always be the same.

In fact, as might be expected, teams who are historically around the middle third of the Premiership draw games at a slightly higher rate than those at the extremes.

 - It DOES imply that the facets of a team which are connected to its chance of drawing are not strong enough to survive the vagaries of time. Many things can change about a Premiership team from season to season, and they clearly have a radical effect on its tendency to draw games.

Put another way, drawing games is not a skill which persists for the average Premiership team.

This is a simple but powerful result. Because three points are awarded to a team for a win and only one for a draw, a team which plays at exactly the same level of efficiency from one season to the next is likely to show a variance in points gained caused by the inability to control the rate at which it draws.

Taken to the extreme, a team of league-average ability which won half its 38 Premiership games and lost the other half would earn 57 points, but a team of the same efficiency which happened to draw all its games would earn 19 fewer points.

But the following season, both these teams could be expected to earn a similar number of points, for the reason we have just established: the forward correlation of draws by season is only 0.109.

2.1 Inferring dominant and recessive team-skills
The forward correlation of goals scored is 0.666 but the forward correlation of goals conceded is just 0.599. The difference between the r-squared values for these two relationships is a statistically significant 8.6% (0.444 - 0.358).

This is really important. It implies that scoring goals is a more influential skill than preventing them: that every goal scored in the Premiership has more to do with the efficiency of the attacking force than the defensive one.

This result is consistent for every sport that I have modelled.

For instance, the average yards gained by an NFL team on any play shows a much greater correlation with the offense's seasonal average than the defense's.

In NBA basketball, shooting percentage in any game is much more strongly correlated with the team taking the shot than the one defending against it.

And in baseball, the result of any single interaction between pitcher and hitter is more strongly correlated to the batter's statistics than the pitcher's.

The way I put this is that in football, scoring goals is a dominant skill while preventing them is a recessive one.

Again, it is important to outline what can and cannot be inferred from this.

 - It DOES NOT imply that attack is more important than defence. As we saw in the last part, the rate at which a team scores goals is itself negatively correlated with the rate at which it concedes them. (Teams who score a lot of goals tend to concede many fewer, and vice versa.)

In other words, both scoring goals and defending them are just related facets of a more dominant skill - team efficiency. 

(The technical reasons for this are manifold. Most obviously, all players on a team are involved in attacking and defending to some extent.)

 - It DOES NOT imply that attacking football is a better idea than defensive football for all teams and in all situations. We will fully explore the fascinating subject of game theory in due course, but for now consider that the payoff associated with different strategies for a football team is measured most easily in expected goal difference which is affected equally and oppositely by goals scored and conceded.  

2.2 Metagames and underlying skills

Earning points is the most important consideration for a Premiership team. But we can think of the units of points as wins, draws and losses, and the units of wins, draws and losses as goals scored and conceded.

And so we can go to deeper and deeper levels of so-called metagames. One unit of goals scored, for instance, is shots on target %, while a unit of goals conceded (or prevented in this case) may be tackles won %.

Another way of putting this is that shooting and tackling are some of the underlying skills of team efficiency. 

Measuring and quantifying the impact of underlying skills is one of the major challenges of modelling sports because of sample-size considerations. Because goals scored and conceded are relatively rare events, they are subject to greater variance than other more frequent expressions of team efficiency, such as tackling, passing, shooting accurately or whatever.

Of course, goals scored and conceded are much more important towards the result than the many more frequent and less variable outcomes of metagames - or "games within a game" - between ball-carrier and tackler, shooter and saver.

This is a statistical paradox of mathematical modelling of sports which is going to be a recurring theme of this series.

Coming up

I had planned to get into the second major weapon of statistical inference - regression. But I chose to expand this section and do not wan to present too much too soon.

I hope you enjoyed the second part. As ever, any questions or discussions resulting from it can be discussed with me on Twitter @Prof_Hindsight.

Thanks for your effort.  

Tuesday, 21 June 2011

Statistical inference and mathematical modelling in sport


Since the origin of organised sport, players have always searched for the secret of success. In individual sports like athletics, for instance, the route to improvement is more apparent than in more complex team activities like football. But, as time has gone on, the increased sophistication which we bring to life in general has revealed hidden truths in them all.

At first, players relied on developing an instinctive approach by trying different approaches and experiencing different outcomes. This was enough to develop a 'feel' for the activity which was then passed through the generations through received wisdom.

Some of these instinctive beliefs duly persisted in individual sports, while others were shown to be false. Eventually, players and coaches in many sports became so incentivised by the financial rewards of success that they started to employ quantitative methods - and so began the march of statistics.

Now, those of an academic approach within each sport - and eventually academics themselves - began to study how success was achieved by scientific method. They isolated variables, by which performance in each sport could be measured, then developed quantitative relationships between these markers and victory.

Some of these findings immediately threw doubt on old approaches; some merely confirmed long-held beliefs and turned them into convictions. The academics then dug deeper and realised that the existing variables themselves were now insufficient to capture elusive truths.

In this series, I will start from the first principles of quantitative analysis and slowly proceed towards highly complex mathematical modelling of those sports for which sufficient data exists to form predictive tools. By that stage, you will have gained the advanced knowledge to reach a completely new understanding of whatever your favourite sport is. I guarantee you will never think about sport the same way again.

I will then look at the limitations of these models, but also show how we can account for these using real-time examples which will throw up projections sometimes greatly at variance with those of the bookmakers.

Anyone motivated to follow the entire series which will end in early September will be able to do so, assuming of course they have either the motivation to learn, or else the good nature to improve my own understanding.

As ever, I will answer any questions, or receive your views, directly through Twitter, a medium which facilitates constructive and respectful debate between parties, for all that it also limits it to 140 characters per offering. Hit me up on @Prof_Hindsight.

1.0 Correlation

Consider the following end of season standings for the English Premier League season just completed:

Man Utd 78 37 +41 80
Chelsea 69 33 +36 71

Man City 60 33 +27 71
Arsenal 72 43 +29 68
Spurs 55 46 +9 62
Liverpool 59 44 +15 58
Everton 51 45 +6 54
Fulham 49 43 +6 49
Aston Villa 48 59 -11 48
Sunderland 45 56 -11 47
West Brom 56 71 -15 47
Newcastle 56 57 -1 46
Stoke 46 48 -2 46
Bolton 52 56 -4 46
Blackburn 46 59 -13 43
Wigan 40 61 -21 42
Wolves 46 66 -20 40
Birmingham 37 58 -21 39
Blackpool 55 78 -23 39
West Ham 43 70 -27 33
(GF=goals scored; GA=goals conceded; GD=goal difference.)

It is obvious that the more goals a team scores and the less it concedes, the greater its point total at the end of the season. But is there a way of expressing that a relationship exists between these variables? 

Yes.  It is called correlation. And its mathematical measure is called a correlation coefficient. 

When we are considering quantities such as how goals scored or conceded relate to points won, the coefficient is technically referred to as the Pearson product-moment correlation, or 'r'.

The correlation between two variables is expressed by means of a decimal between 0 and 1. The stronger the relationship, the higher the r value or correlation. I will get into the scaling of r in a moment, but also consider the r can be either a positive or negative value.

A positive value of r refers to a relationship in which the greater the value of one variable, the greater the value of the other - a positive correlation, we say. The more goals a team scores during the season, the greater its total points, so the r value between points and goals scored is positive.

But when one variable tends to increase as the other decreases, we say there is a negative correlation between the two. Now a -ve sign is placed in front of the r value, while the strength of the relationship is still expressed by the magnitude of the decimal. Patently, goals conceded bears a negative correlation with total points earned in our example.

Calculating r is a laborious process but can be done quickly by means of a spreadsheet such as Microsoft Excel or the freeware During this series, it is not necessary to do any of the calculations yourself, though hopefully you will have the urge to do so.

In this case, the r value or correlation between goals scored and points won is 0.867 - indicating a strong relationship. This is no surprise, but remember this example is very much entry level statistical inference. There are three months still to go!

The correlation between goals conceded and points won is -0.848. It is of a similar magnitude to that governing goals scored, but the negative sign indicates the opposite relationship.

1.1 The meaning of r-squared

As you become familiar with the correlation between two variables, you will find it easier to recognise the significance of its strength. But what does r actually express in a quantitative sense?

Technically speaking, r is the covariance of the two variables divided by the product of their standard deviations. Don't worry about this, though: think of the correlation of two variables as the degree to which the variance of one presages the variance of the other.

By squaring r (r times itself) we produce the "coefficient of determination" or 'r-squared'. This expresses what percentage of the variance of one variable is related to the variance of the other.  

The r-squared of goals scored and points earned is 0.751 - so 75.1% of the variance in a team's points total is a product of the variance in goals scored.

The r-squared of goals conceded and pointed earned is 0.719 - so 71.9% of the variance in a team's points total is a product of the variance in goals conceded. Notice how the -ve signs disappears with the r-squared value of negative correlations. Now it is the magnitude of the relationship only in which we are interested.

1.2 Correlation does not imply causation

According to the limited sample provided by the 2010-2011 Premier League season - we will examine the meaning of sample-size in a later episode - the r-squared values suggest there is a stronger relationship between goals scored and points earned than between goals conceded and points earned.

This is just the first of a many interesting results that an example as simple as the one above can provide. Does it therefore mean that goals scored are more important than goals conceded - that attack is more important than defence?

Not necessarily. When we infer a correlation between two variables, it does not mean that a change in one is the cause of a change in another. Instead, it implies only that - within the limitations of the sample-size - a connection between the two has existed in the past. If the sample-size is small, this connection may even be just a coincidence.

We might find, for instance, that the number of points earned by Liverpool is actually correlated with the production of wine in Portugal between the years 1979 and 1985. But this does not imply a causal link between the two: if Portugal were to increase its wine production in the future, Liverpool would not earn more points as a result.

In our example, it is however reasonable to infer that the stronger correlation between goals scored and points earned is a significant result. The three-point-for-a-win system means that a team who scores and concedes more goals but has the same goal difference than a rival is likely to earn more points.

"Correlation does not imply causation" is an important maxim of statistical inference and should always be remembered.

1.3 Co-dependency

The value of r-squared which we calculated for goals scored and points earned suggested that 75.1% of the variance in a team's points total is a product of the variance in goals scored. That only leaves 24.9% for other factors, yet we found that the r-squared value for goals conceded and points earned implied that 71.9% of the variance of one was due to the other.

75.1% and 71.9% add up to a lot more than 100%, so how can this be? The answer is that goals scored and goals conceded are not independent variables themselves.

In fact, the correlation between goals scored and conceded is -0.570 - a weaker correlation but one which still suggests that the more goals a team scores, the fewer it tends to concede.

This so-called co-dependency between goals scored and goals conceded is another example of the maxim we have just established, that "correlation does not imply causation".

Scoring goals does not directly cause a team to concede fewer. (Incidentally, there may be a positive game-theoretical implication of a strategy biased towards attacking - "attack is the best form of defence" - but that is a different consideration.)

Instead, these co-dependent variables are both some function of a football team's merit or efficiency. In other words, good teams tend to score lots of goals and concede few, while bad teams do the opposite.

It follows from this that goal-difference (goals scored minus goals conceded) it itself strongly correlated with this team efficiency, and therefore with points earned. 

Of course. In fact, goal difference is a very powerful measure of efficiency, more so than even points earned, as I will prove in a future episode.

For now, just consider that the correlation between goal difference and points earned is a very strong 0.965. And the r-squared of 0.931 suggests that 93.1% of the variance in points earned is a result of the variance in goal difference. And, yes, it is safe to say that the first causes the second.

1.4 Sample-size and statistical significance

It is very important to be aware that the values of 'r' and 'r-squared'  cited in this example are only a function of a limited sample of data. Forging universal relationships between goals scored and conceded and points earned is not possible from just the latest Premiership season.

We must always remember to state the limitations of data when expressing statistical relationships formally. And there are various measures of so-called statistical significance we can use to imply the reliability of the data in a quantitative way. Those must wait while we uncover some more interesting relationships from Premiership history next week.

1.5 Coming up

Next time, I will examine how predictive are goals scored and conceded in one season of performance in the next. In other words, how identifiable and persistent is the ability a team has to score goals and prevent them being scored? 

And to what degree can it survive the inevitable changes in personnel from one season to the next. I will be considering data only from the Premiership itself, so teams who were relegated in an adjacent season will be excluded from the sample.

As a matter of interest, what do you think will be the result? Will there be no difference in the predictive power of goals scored and conceded from one season to the next? Or will one persist more than the other, perhaps suggesting something about the underlying skills associated with attack and defence and their importance.

I will also introduce the topic of linear regression, deriving an equation which links goals scored and conceded to points earned in a 38-game Premiership season - assuming the relationship is linear and not exponential for now.