Sunday, 26 June 2011

Statistical inference and mathematical modelling in sport (part 2)

Summary of part 1

In the first part, we established the first major weapon of statistical inference - correlation. 

We saw that the correlation between goals scored and points earned for teams during the 2010-11 Premiership season was 0.867, while the correlation between goals conceded and points was -0.848.

Just in case you aren't with me, part 1 of the series has a full explanation of those terms and is here.  

2.0 Auto-correlation

As well as observing correlations between two distinct variables - such as goals and points - it is also extremely useful when modelling sports to determine how a single variable persists or changes over time.

How a variable correlates with itself across different periods of time is called auto-correlation. I often refer to it by the more descriptive term of "forward correlation".

To see what forward correlation can teach us about the way a sport works, let's now expand our data sample from a single season of the Premiership - from which any inference is not statistically significant - to the entire history of the competition back to 1992/3 - from which it is.

(For more on statistical significance, please wait for an upcoming episode in which I will highlight a shocking example for 10-year trends fans.)

The forward correlation we are concerned with here is how the basic statistics of a team's performance in one season persist - or predict - the same results in a year's time. 

In Premiership history, there are 311 data-points consisting of teams who have played consecutive seasons in the Premiership, consider their r values in various categories:

wins 0.739
draws 0.109
losses 0.632
goals scored: 0.666
goals conceded: 0.599
goal difference: 0.785
total goals per game: 0.185
points: 0.679

This is statistical inference at a baby level, but already you have enough information to be ahead of 99% of all football fans, punters and journalists:

The number of drawn games in which a Premiership teams is involved shows little correlation from one season to the next

The r-squared value of the forward correlation of draws is just 0.012, so only around 1% of the variance in draws persists from season to season. We say that Premiership teams in general do not control the rate at which they draw games in future seasons.

It is important to understand what this does and does not imply.

 - It DOES NOT imply that draws are random, nor that the percentage chance of a draw in a Premiership game should always be the same.

In fact, as might be expected, teams who are historically around the middle third of the Premiership draw games at a slightly higher rate than those at the extremes.

 - It DOES imply that the facets of a team which are connected to its chance of drawing are not strong enough to survive the vagaries of time. Many things can change about a Premiership team from season to season, and they clearly have a radical effect on its tendency to draw games.

Put another way, drawing games is not a skill which persists for the average Premiership team.

This is a simple but powerful result. Because three points are awarded to a team for a win and only one for a draw, a team which plays at exactly the same level of efficiency from one season to the next is likely to show a variance in points gained caused by the inability to control the rate at which it draws.

Taken to the extreme, a team of league-average ability which won half its 38 Premiership games and lost the other half would earn 57 points, but a team of the same efficiency which happened to draw all its games would earn 19 fewer points.

But the following season, both these teams could be expected to earn a similar number of points, for the reason we have just established: the forward correlation of draws by season is only 0.109.

2.1 Inferring dominant and recessive team-skills
  
The forward correlation of goals scored is 0.666 but the forward correlation of goals conceded is just 0.599. The difference between the r-squared values for these two relationships is a statistically significant 8.6% (0.444 - 0.358).

This is really important. It implies that scoring goals is a more influential skill than preventing them: that every goal scored in the Premiership has more to do with the efficiency of the attacking force than the defensive one.

This result is consistent for every sport that I have modelled.

For instance, the average yards gained by an NFL team on any play shows a much greater correlation with the offense's seasonal average than the defense's.

In NBA basketball, shooting percentage in any game is much more strongly correlated with the team taking the shot than the one defending against it.

And in baseball, the result of any single interaction between pitcher and hitter is more strongly correlated to the batter's statistics than the pitcher's.

The way I put this is that in football, scoring goals is a dominant skill while preventing them is a recessive one.

Again, it is important to outline what can and cannot be inferred from this.

 - It DOES NOT imply that attack is more important than defence. As we saw in the last part, the rate at which a team scores goals is itself negatively correlated with the rate at which it concedes them. (Teams who score a lot of goals tend to concede many fewer, and vice versa.)

In other words, both scoring goals and defending them are just related facets of a more dominant skill - team efficiency. 

(The technical reasons for this are manifold. Most obviously, all players on a team are involved in attacking and defending to some extent.)

 - It DOES NOT imply that attacking football is a better idea than defensive football for all teams and in all situations. We will fully explore the fascinating subject of game theory in due course, but for now consider that the payoff associated with different strategies for a football team is measured most easily in expected goal difference which is affected equally and oppositely by goals scored and conceded.  

2.2 Metagames and underlying skills

Earning points is the most important consideration for a Premiership team. But we can think of the units of points as wins, draws and losses, and the units of wins, draws and losses as goals scored and conceded.

And so we can go to deeper and deeper levels of so-called metagames. One unit of goals scored, for instance, is shots on target %, while a unit of goals conceded (or prevented in this case) may be tackles won %.

Another way of putting this is that shooting and tackling are some of the underlying skills of team efficiency. 

Measuring and quantifying the impact of underlying skills is one of the major challenges of modelling sports because of sample-size considerations. Because goals scored and conceded are relatively rare events, they are subject to greater variance than other more frequent expressions of team efficiency, such as tackling, passing, shooting accurately or whatever.

Of course, goals scored and conceded are much more important towards the result than the many more frequent and less variable outcomes of metagames - or "games within a game" - between ball-carrier and tackler, shooter and saver.

This is a statistical paradox of mathematical modelling of sports which is going to be a recurring theme of this series.

Coming up

I had planned to get into the second major weapon of statistical inference - regression. But I chose to expand this section and do not wan to present too much too soon.

I hope you enjoyed the second part. As ever, any questions or discussions resulting from it can be discussed with me on Twitter @Prof_Hindsight.

Thanks for your effort.