Sunday, 26 June 2011

Statistical inference and mathematical modelling in sport (part 2)

Summary of part 1

In the first part, we established the first major weapon of statistical inference - correlation. 

We saw that the correlation between goals scored and points earned for teams during the 2010-11 Premiership season was 0.867, while the correlation between goals conceded and points was -0.848.

Just in case you aren't with me, part 1 of the series has a full explanation of those terms and is here.  

2.0 Auto-correlation

As well as observing correlations between two distinct variables - such as goals and points - it is also extremely useful when modelling sports to determine how a single variable persists or changes over time.

How a variable correlates with itself across different periods of time is called auto-correlation. I often refer to it by the more descriptive term of "forward correlation".

To see what forward correlation can teach us about the way a sport works, let's now expand our data sample from a single season of the Premiership - from which any inference is not statistically significant - to the entire history of the competition back to 1992/3 - from which it is.

(For more on statistical significance, please wait for an upcoming episode in which I will highlight a shocking example for 10-year trends fans.)

The forward correlation we are concerned with here is how the basic statistics of a team's performance in one season persist - or predict - the same results in a year's time. 

In Premiership history, there are 311 data-points consisting of teams who have played consecutive seasons in the Premiership, consider their r values in various categories:

wins 0.739
draws 0.109
losses 0.632
goals scored: 0.666
goals conceded: 0.599
goal difference: 0.785
total goals per game: 0.185
points: 0.679

This is statistical inference at a baby level, but already you have enough information to be ahead of 99% of all football fans, punters and journalists:

The number of drawn games in which a Premiership teams is involved shows little correlation from one season to the next

The r-squared value of the forward correlation of draws is just 0.012, so only around 1% of the variance in draws persists from season to season. We say that Premiership teams in general do not control the rate at which they draw games in future seasons.

It is important to understand what this does and does not imply.

 - It DOES NOT imply that draws are random, nor that the percentage chance of a draw in a Premiership game should always be the same.

In fact, as might be expected, teams who are historically around the middle third of the Premiership draw games at a slightly higher rate than those at the extremes.

 - It DOES imply that the facets of a team which are connected to its chance of drawing are not strong enough to survive the vagaries of time. Many things can change about a Premiership team from season to season, and they clearly have a radical effect on its tendency to draw games.

Put another way, drawing games is not a skill which persists for the average Premiership team.

This is a simple but powerful result. Because three points are awarded to a team for a win and only one for a draw, a team which plays at exactly the same level of efficiency from one season to the next is likely to show a variance in points gained caused by the inability to control the rate at which it draws.

Taken to the extreme, a team of league-average ability which won half its 38 Premiership games and lost the other half would earn 57 points, but a team of the same efficiency which happened to draw all its games would earn 19 fewer points.

But the following season, both these teams could be expected to earn a similar number of points, for the reason we have just established: the forward correlation of draws by season is only 0.109.

2.1 Inferring dominant and recessive team-skills
  
The forward correlation of goals scored is 0.666 but the forward correlation of goals conceded is just 0.599. The difference between the r-squared values for these two relationships is a statistically significant 8.6% (0.444 - 0.358).

This is really important. It implies that scoring goals is a more influential skill than preventing them: that every goal scored in the Premiership has more to do with the efficiency of the attacking force than the defensive one.

This result is consistent for every sport that I have modelled.

For instance, the average yards gained by an NFL team on any play shows a much greater correlation with the offense's seasonal average than the defense's.

In NBA basketball, shooting percentage in any game is much more strongly correlated with the team taking the shot than the one defending against it.

And in baseball, the result of any single interaction between pitcher and hitter is more strongly correlated to the batter's statistics than the pitcher's.

The way I put this is that in football, scoring goals is a dominant skill while preventing them is a recessive one.

Again, it is important to outline what can and cannot be inferred from this.

 - It DOES NOT imply that attack is more important than defence. As we saw in the last part, the rate at which a team scores goals is itself negatively correlated with the rate at which it concedes them. (Teams who score a lot of goals tend to concede many fewer, and vice versa.)

In other words, both scoring goals and defending them are just related facets of a more dominant skill - team efficiency. 

(The technical reasons for this are manifold. Most obviously, all players on a team are involved in attacking and defending to some extent.)

 - It DOES NOT imply that attacking football is a better idea than defensive football for all teams and in all situations. We will fully explore the fascinating subject of game theory in due course, but for now consider that the payoff associated with different strategies for a football team is measured most easily in expected goal difference which is affected equally and oppositely by goals scored and conceded.  

2.2 Metagames and underlying skills

Earning points is the most important consideration for a Premiership team. But we can think of the units of points as wins, draws and losses, and the units of wins, draws and losses as goals scored and conceded.

And so we can go to deeper and deeper levels of so-called metagames. One unit of goals scored, for instance, is shots on target %, while a unit of goals conceded (or prevented in this case) may be tackles won %.

Another way of putting this is that shooting and tackling are some of the underlying skills of team efficiency. 

Measuring and quantifying the impact of underlying skills is one of the major challenges of modelling sports because of sample-size considerations. Because goals scored and conceded are relatively rare events, they are subject to greater variance than other more frequent expressions of team efficiency, such as tackling, passing, shooting accurately or whatever.

Of course, goals scored and conceded are much more important towards the result than the many more frequent and less variable outcomes of metagames - or "games within a game" - between ball-carrier and tackler, shooter and saver.

This is a statistical paradox of mathematical modelling of sports which is going to be a recurring theme of this series.

Coming up

I had planned to get into the second major weapon of statistical inference - regression. But I chose to expand this section and do not wan to present too much too soon.

I hope you enjoyed the second part. As ever, any questions or discussions resulting from it can be discussed with me on Twitter @Prof_Hindsight.

Thanks for your effort.  

Tuesday, 21 June 2011

Statistical inference and mathematical modelling in sport

Introduction

Since the origin of organised sport, players have always searched for the secret of success. In individual sports like athletics, for instance, the route to improvement is more apparent than in more complex team activities like football. But, as time has gone on, the increased sophistication which we bring to life in general has revealed hidden truths in them all.

At first, players relied on developing an instinctive approach by trying different approaches and experiencing different outcomes. This was enough to develop a 'feel' for the activity which was then passed through the generations through received wisdom.

Some of these instinctive beliefs duly persisted in individual sports, while others were shown to be false. Eventually, players and coaches in many sports became so incentivised by the financial rewards of success that they started to employ quantitative methods - and so began the march of statistics.

Now, those of an academic approach within each sport - and eventually academics themselves - began to study how success was achieved by scientific method. They isolated variables, by which performance in each sport could be measured, then developed quantitative relationships between these markers and victory.

Some of these findings immediately threw doubt on old approaches; some merely confirmed long-held beliefs and turned them into convictions. The academics then dug deeper and realised that the existing variables themselves were now insufficient to capture elusive truths.

In this series, I will start from the first principles of quantitative analysis and slowly proceed towards highly complex mathematical modelling of those sports for which sufficient data exists to form predictive tools. By that stage, you will have gained the advanced knowledge to reach a completely new understanding of whatever your favourite sport is. I guarantee you will never think about sport the same way again.

I will then look at the limitations of these models, but also show how we can account for these using real-time examples which will throw up projections sometimes greatly at variance with those of the bookmakers.

Anyone motivated to follow the entire series which will end in early September will be able to do so, assuming of course they have either the motivation to learn, or else the good nature to improve my own understanding.

As ever, I will answer any questions, or receive your views, directly through Twitter, a medium which facilitates constructive and respectful debate between parties, for all that it also limits it to 140 characters per offering. Hit me up on @Prof_Hindsight.


1.0 Correlation

Consider the following end of season standings for the English Premier League season just completed:

Team GF GA GD PTS
Man Utd 78 37 +41 80
Chelsea 69 33 +36 71

Man City 60 33 +27 71
Arsenal 72 43 +29 68
Spurs 55 46 +9 62
Liverpool 59 44 +15 58
Everton 51 45 +6 54
Fulham 49 43 +6 49
Aston Villa 48 59 -11 48
Sunderland 45 56 -11 47
West Brom 56 71 -15 47
Newcastle 56 57 -1 46
Stoke 46 48 -2 46
Bolton 52 56 -4 46
Blackburn 46 59 -13 43
Wigan 40 61 -21 42
Wolves 46 66 -20 40
Birmingham 37 58 -21 39
Blackpool 55 78 -23 39
West Ham 43 70 -27 33
(GF=goals scored; GA=goals conceded; GD=goal difference.)

It is obvious that the more goals a team scores and the less it concedes, the greater its point total at the end of the season. But is there a way of expressing that a relationship exists between these variables? 

Yes.  It is called correlation. And its mathematical measure is called a correlation coefficient. 


When we are considering quantities such as how goals scored or conceded relate to points won, the coefficient is technically referred to as the Pearson product-moment correlation, or 'r'.

The correlation between two variables is expressed by means of a decimal between 0 and 1. The stronger the relationship, the higher the r value or correlation. I will get into the scaling of r in a moment, but also consider the r can be either a positive or negative value.

A positive value of r refers to a relationship in which the greater the value of one variable, the greater the value of the other - a positive correlation, we say. The more goals a team scores during the season, the greater its total points, so the r value between points and goals scored is positive.

But when one variable tends to increase as the other decreases, we say there is a negative correlation between the two. Now a -ve sign is placed in front of the r value, while the strength of the relationship is still expressed by the magnitude of the decimal. Patently, goals conceded bears a negative correlation with total points earned in our example.

Calculating r is a laborious process but can be done quickly by means of a spreadsheet such as Microsoft Excel or the freeware OpenOffice.org. During this series, it is not necessary to do any of the calculations yourself, though hopefully you will have the urge to do so.

In this case, the r value or correlation between goals scored and points won is 0.867 - indicating a strong relationship. This is no surprise, but remember this example is very much entry level statistical inference. There are three months still to go!

The correlation between goals conceded and points won is -0.848. It is of a similar magnitude to that governing goals scored, but the negative sign indicates the opposite relationship.

1.1 The meaning of r-squared

As you become familiar with the correlation between two variables, you will find it easier to recognise the significance of its strength. But what does r actually express in a quantitative sense?

Technically speaking, r is the covariance of the two variables divided by the product of their standard deviations. Don't worry about this, though: think of the correlation of two variables as the degree to which the variance of one presages the variance of the other.

By squaring r (r times itself) we produce the "coefficient of determination" or 'r-squared'. This expresses what percentage of the variance of one variable is related to the variance of the other.  

The r-squared of goals scored and points earned is 0.751 - so 75.1% of the variance in a team's points total is a product of the variance in goals scored.

The r-squared of goals conceded and pointed earned is 0.719 - so 71.9% of the variance in a team's points total is a product of the variance in goals conceded. Notice how the -ve signs disappears with the r-squared value of negative correlations. Now it is the magnitude of the relationship only in which we are interested.

1.2 Correlation does not imply causation

According to the limited sample provided by the 2010-2011 Premier League season - we will examine the meaning of sample-size in a later episode - the r-squared values suggest there is a stronger relationship between goals scored and points earned than between goals conceded and points earned.

This is just the first of a many interesting results that an example as simple as the one above can provide. Does it therefore mean that goals scored are more important than goals conceded - that attack is more important than defence?

Not necessarily. When we infer a correlation between two variables, it does not mean that a change in one is the cause of a change in another. Instead, it implies only that - within the limitations of the sample-size - a connection between the two has existed in the past. If the sample-size is small, this connection may even be just a coincidence.


We might find, for instance, that the number of points earned by Liverpool is actually correlated with the production of wine in Portugal between the years 1979 and 1985. But this does not imply a causal link between the two: if Portugal were to increase its wine production in the future, Liverpool would not earn more points as a result.


In our example, it is however reasonable to infer that the stronger correlation between goals scored and points earned is a significant result. The three-point-for-a-win system means that a team who scores and concedes more goals but has the same goal difference than a rival is likely to earn more points.

"Correlation does not imply causation" is an important maxim of statistical inference and should always be remembered.

1.3 Co-dependency

The value of r-squared which we calculated for goals scored and points earned suggested that 75.1% of the variance in a team's points total is a product of the variance in goals scored. That only leaves 24.9% for other factors, yet we found that the r-squared value for goals conceded and points earned implied that 71.9% of the variance of one was due to the other.

75.1% and 71.9% add up to a lot more than 100%, so how can this be? The answer is that goals scored and goals conceded are not independent variables themselves.

In fact, the correlation between goals scored and conceded is -0.570 - a weaker correlation but one which still suggests that the more goals a team scores, the fewer it tends to concede.

This so-called co-dependency between goals scored and goals conceded is another example of the maxim we have just established, that "correlation does not imply causation".

Scoring goals does not directly cause a team to concede fewer. (Incidentally, there may be a positive game-theoretical implication of a strategy biased towards attacking - "attack is the best form of defence" - but that is a different consideration.)


Instead, these co-dependent variables are both some function of a football team's merit or efficiency. In other words, good teams tend to score lots of goals and concede few, while bad teams do the opposite.

It follows from this that goal-difference (goals scored minus goals conceded) it itself strongly correlated with this team efficiency, and therefore with points earned. 

Of course. In fact, goal difference is a very powerful measure of efficiency, more so than even points earned, as I will prove in a future episode.

For now, just consider that the correlation between goal difference and points earned is a very strong 0.965. And the r-squared of 0.931 suggests that 93.1% of the variance in points earned is a result of the variance in goal difference. And, yes, it is safe to say that the first causes the second.

1.4 Sample-size and statistical significance

It is very important to be aware that the values of 'r' and 'r-squared'  cited in this example are only a function of a limited sample of data. Forging universal relationships between goals scored and conceded and points earned is not possible from just the latest Premiership season.

We must always remember to state the limitations of data when expressing statistical relationships formally. And there are various measures of so-called statistical significance we can use to imply the reliability of the data in a quantitative way. Those must wait while we uncover some more interesting relationships from Premiership history next week.

1.5 Coming up

Next time, I will examine how predictive are goals scored and conceded in one season of performance in the next. In other words, how identifiable and persistent is the ability a team has to score goals and prevent them being scored? 


And to what degree can it survive the inevitable changes in personnel from one season to the next. I will be considering data only from the Premiership itself, so teams who were relegated in an adjacent season will be excluded from the sample.

As a matter of interest, what do you think will be the result? Will there be no difference in the predictive power of goals scored and conceded from one season to the next? Or will one persist more than the other, perhaps suggesting something about the underlying skills associated with attack and defence and their importance.

I will also introduce the topic of linear regression, deriving an equation which links goals scored and conceded to points earned in a 38-game Premiership season - assuming the relationship is linear and not exponential for now.

Saturday, 7 May 2011

Crunching the numbers from the Kentucky Derby preps

If you read my two series on pace and its implications, you should be acutely aware of the relationship between even pace and optimal time.
With tonight's Kentucky Derby in mind, now is the time to put that learning into action. Let's look at the individual splits of several key horses and think about how they relate to the final times earned in each case.
I will be using Beyer speed figures strictly for this exercise, though I do feel they are less potent as an analytical tool than the new US Timeform ratings devised by Simon Rowlands.
Anyway, for your information and analysis, here we go with the number-crunching:


Dialed In (Florida Derby): 24.9 - 23.3 - 23.7 - 25.1 - 13.1 = Beyer 93


Shackleford (Florida Derby): 23.3 - 23.1 - 24.3 - 25.8 - 13.7 = Beyer 93


Soldat (Fountain of Youth): 24.3 - 23.7 - 24.4 - 24.8 - 13.0 = Beyer 96


Soldat (GP allowance): 24.3 - 24.0 - 23.9 - 24.5 - 12.6 = Beyer 103 


Archarcharch (Ark Derby): 24.2 - 23.5 - 24.2 - 24.7 - 12.8 = Beyer 98


Nehro (Ark Derby): 24.2 - 23.8 - 24.6 - 24.5 - 12.4 = Beyer 98


Brilliant Speed (Blue Grass): 26.7 - 25.0 - 24.6 - 23.0 - 11.6 = Beyer 93


Twinspired (Blue Grass): 25.5 - 25.2 - 24.3 - 24.0 - 12.0 = Beyer 93


Twice The Appeal (Sun Dy): 23.5 - 22.4 - 25.2 - 26.4 - 13.5 = Beyer 89


Animal Kingdom (Spiral): 24.7 - 22.6 - 25.7 - 26.0 - 13.4 = Beyer 94


Decisive Moment (Spiral): 23.8 - 22.8 - 25.9 - 26.4 - 13.9 = Beyer 89


Midnight Interlude (SA Derby): 23.2 - 24.4 - 24.3 - 24.4 - 12.3 = Beyer 97


Comma To The Top (SA Dy): 22.8 - 24.5 - 24.2 - 24.5 - 12.7 = Beyer 97


Pants On Fire (LA Derby): 23.6 - 24.1 - 24.4 - 25.2 - 12.7 = Beyer 94


Nehro (LA Derby): 24.0 - 24.0 - 24.8 - 24.9 - 12.3 = Beyer 94


Mucho Macho Man (LA Dy): 24.0 - 23.9 - 24.7 - 24.6 - 12.8 = Beyer 93


Mucho Macho Man (Remsen): 24.5 - 23.3 - 23.5 - 26.0 - 13.0 = Beyer 99


If it wasn't clear already, this must be one of the most open renewals of the Kentucky Derby ever. If you look at the numbers above, Animal Kingdom appeals as a horse who ran a final time competitive with the rest, while doing so more unevenly. 
At this stage, we need to take a look at the stamina profile of the contenders - mostly according to their pedigree, with a little interpretation of how the individual is turning out. Here is my assessment of each, classified by 'stay', 'not stay' or 'improve':


1 Archarcharch - stay
2 Brilliant Speed - improve
3 Twice The Appeal - not stay
4 Stay Thirsty - improve
5 Decisive Moment - not stay
6 Comma To The Top - not stay
7 Pants On Fire - stay
8 Dialed In - improve
9 Derby Kitten - stay
10 Twinspired - stay
11 Master of Hounds - stay
12 Santiva - not stay
13 Mucho Macho Man - stay
14 Shackleford - not stay
15 Midnight Interlude - improve
16 Animal Kingdom - improve
17 Soldat - stay
19 Nehro - improve
20 Watch Me Go - stay


With none of the runners having achieved either a form rating of speed figure within 7lb of that required on average, it seems likely that this year's race will go to a horse improving for the test - rather than merely handling it.


The shortlist is:
Brilliant Speed - thrashed on two starts on dirt but bred to handle it
Dialed In - vulnerable one-run plodder but will surely go well
Stay Thirsty - sweated freely in first-time blinkers latest, left off
Nehro - carries head high but no denying his promise
Midnight Interlude - unraced as 2yo and has improved a lot in short time
Animal Kingdom - turf-bred but won dirt-style on deep Turfway poly


Verdict:
Dialed In and Nehro are solid contenders whose chance has been captured by the market - and even a little overestimated.
Brilliant Speed and Animal Kingdom are highly intriguing synthetic horses who could go very well, if they can handle dirt.
Midnight Interlude has star quality and could easily prove this year's superstar 3yo colt - if he can overcome lack of seasoning and everything which the race demands.
Stay Thirsty has a ton of stamina in his pedigree, had the Wood Memorial winner behind him two starts ago and is easily the most interesting of those at a massive price.


Good luck to all and please join me from 8pm on Racing UK for the live broadcast. It is an interactive show - as always.

Sunday, 1 May 2011

Frankel dishes the dirt

It might have been run on a strip of England's green and pleasant land, but the 2000 Guineas followed an energy dynamic more in common with races on dirt.

Frankel's brazen display of early pace turned the generally expected slow-fast pattern of turf races on its head. And it thereby put the emphasis on conditioning - a factor usually discussed more with regard to the first colts' Classic in the US, the Kentucky Derby.

The placed horses all came into the Guineas with the edge of fitness. Frankel had maintained his unbeaten record in Newbury's Greenham Stakes; runner-up Dubawi Gold had won twice in 2011 for new trainer Richard Hannon; while Native Khan had landed the Craven Stakes over course and distance, having come to hand earlier than trainer Ed Dunlop expected.

At this early stage of the season, the strong pace took the rest of the field out of their comfort-zone. Neither Pathfork, Casamento, Roderic O'Connor nor Fury had been given a prep race - understandable, perhaps, as this is normally not of overwhelming importance in grass races run at less than full tilt. 

On this occasion, however, the physical demands of the race may have been too much for horses without the necessary foundation.

Of course, the order of merit expressed by the finish of the 2000 Guineas may well persist to the end of the Flat season. But the magnitude of the 11-length gap which separated Native Khan from the rest is likely explained by something other than the equivalent margin in ability. 

Immediately after the race, two conclusions seemed to be drawn on Channel 4 from the outlandish manner of Frankel's victory. I disagree with both.

First, that the winner had a punishing race which will limit his potential. 

Why? I find myself strongly opposed to this view. I think Frankel will get a lot out of the race and move forward off it.

This is a very talented horse who was asked to run outside his envelope for far less a period than the horses who finished behind. The inherent stress of this effort cannot be judged by his relativity to the rest of the field.

And, even when tiring, he showed clear signs of expressing some control over his exertion. (I wrote about this principle in my series on Equine Flow Psychology and believe it to be important.)

Second, that we should be more negative about Frankel's chances of winning the Derby. 

Frankel was allowed to run at a tempo suitable to the tactical demands of the Guineas - as conceptualised by his trainer and executed by his jockey.  In the event, jockey Tom Queally was somewhat generous in the freedom he allowed his mount.

Like many trainers, Henry Cecil believes in conditioning his horses to run hard, but he has the intuition to judge their psychological needs far better than the vast majority.

Queally made little attempt to restrain Frankel because Cecil believed that was the right course of action for the colt. And Frankel's single-furlong ability is miles better than the horses who took him on. It is easy to see that the corollary of these two statements would land the horse in a clear early lead - without it being the result of a headstrong tendency which cannot be curbed.

The race might have looked startling to the eye accustomed to watching only races on grass, but it would have been considered a perfectly natural performance on dirt. It would be understood that Frankel's style of running was an attempt to reduce the uncertainty of victory for the best horse in the race.

To my mind, the free-going tendencies which Frankel displayed during his two-year-old season were a response to his being restrained behind horses in tight quarters. And that view is evinced by his extreme reaction to catching a bump in the Dewhurst.

It appeals strongly that Frankel is a horse who wants or needs to ration his own effort and to have his own space - in other words, to stride on. But that does not mean he wants to run off.

Riding tactics are not binary; they are not just push or pull. If Frankel is buried at the back of the pack at Epsom, he would resent it; he would pull hard and may ruin his chance. 

But, if he is allowed to race close to the pace and his speed just a little more rationed, he may settle sufficiently well.

Derby candidates regarded as doubtful stayers are often compromised by the way they are ridden at Epsom. They allow more stamina-laden horses to be kicked on running down Tattenham Hill, while they idle this section of the race away. In other words, they give lengths away cheaply then try to buy them back more expensively in the closing stages.  

Frankel's two-year-old exploits told us that he tends to pull hard when restrained and in behind horses; the Greenham seemed to confirm this impression. But, if the Guineas threw any more light on his chance of winning the Derby, I don't think it was negative - and that's the point I am trying to make.

Frankel is a horse of monstrous talent who was asked - or at least allowed - to express his superiority over the Guineas field in the first part of the race, rather than the last.

When Hawk Wing's jockey Jamie Spencer employed the opposite tactical approach to achieve the same result in 2002, his mount ran on strongly at the end of the race. And, perversely, he was regarded as having run a much better Derby trial - even though he was beaten and Frankel won easily. 

But, the contrasting impressions created at the end of the Guineas by Hawk Wing and Frankel were only a function of riding tactics.

Hawk Wing went on to run a huge race when second at Epsom; he would have won the Derby in most years. He was beaten only by a great middle-distance stayer in High Chaparral.

I would bet a lot of money that Frankel could run at least as fast as Hawk Wing over the Derby course. Perhaps, if he did, he would also be defeated. But he is nothing like the forlorn hope some are portraying.

You, of course, may disagree. And, more to the point, so may Henry Cecil. There is always a doubt sending any horse over  a trip 50 per cent further than before.

But the confidence in him to achieve the task could change as the race approaches. Do you really want to bet against easily the best horse at Epsom?