Regression To The Mean To Describe Corey Kluber and Cody Allen

In their past two outings, both Corey Kluber and Cody Allen have been less than stellar, especially considering the high stakes.  However, it is a matter of mere perception and regression to the mean rather than deteriorating talent.

The concept of regression towards the mean is important in fields far beyond baseball. Wikipedia provides this classic explanation:

A class of students takes two editions of the same test on two successive days. It has frequently been observed that the worst performers on the first day will tend to improve their scores on the second day, and the best performers on the first day will tend to do worse on the second day. The phenomenon occurs because student scores are determined in part by underlying ability and in part by chance.

The last sentence is the critical one to understand. Most measurements of human ability are partly achieved by skill and partly achieved by luck. This means that data cannot always be taken at face value. Since we cannot always be completely confident that we’ve measured what we want to measure, we can apply an expected regression to the mean to get a true idea of talent. We all do this, whether we mean to or not. The rookie that comes up in September and gets a hit in his first at-bat? The numbers say he’s on pace for a career batting average of 1.000. Does anyone expect said rookie to never make an out in his life? Of course not. Cody Allen gives up a few late inning home-runs during a pennant race; does this raise our expectations that he will continue to struggle? Of course not.

The interesting question is which mean to apply our expected regression towards. What if our rookie is reckoned by scouts to be an excellent pure hitter? What if he’s a guy who swings from the heels and misses half the time he offers at a pitch? Clearly, we expect different batting averages from the two, and one at-bat isn’t going to influence our expectations either way. We’d regress the first player towards the ‘good hitter’ population mean, and the second towards the ‘bad-hitter’ population mean. Eventually (given enough at-bats), we simply use their career numbers as the population mean for the player. This is a shortcut rather than being analytically rigorous, as some element of randomness always influences career numbers, meaning that barring other information players should always be expected to be slightly more average than they have been historically. It’s not a big effect, however; I merely highlight it to demonstrate the difficulty in choosing the mean towards which we expect a player to regress.

We all know that numbers are manipulable, and that it’s possible to draw completely ludicrous conclusions from them that simply don’t bear up to even basic common sense. With a good grasp on the theory behind regression towards the mean, one can avoid the pitfalls of putting too much faith in a poor sample size. However, we remain unsure of what sample size is actually required for a given metric until regression’s close cousin correlation comes into play. Regression also does not protect us from statistical arguments based on irrational theories of value (i.e. over/undervaluing a specific skill or statistic). The sample size taken from the entire season shows us that we should have expected some regression from both Kluber and Allen at some point this season; we just weren’t expecting it to come at the same time.

For now, let’s chalk it up to the fact that they have virtually been flawless all season and our perceptions are higher than they should be. Will their struggles continue? Only time will tell. The samples size from the struggle is simply too small as of this writing.

Let’s hope that their young teammates can pick them up so that the Tribe can stay in the playoff hunt.

 

Arrow to top