Snakeoil Statistics

.

Someone is selling snake oil to the Oilers and they are buying.

In a December 31st article by David Staples over at Cult of Hockey he interviewed Brad Werenka, retired NHL player and founder of the company, TruPerformance, about their proprietary statistical measure they use to rate individual player values; the results of which are then sold to NHL teams.

Staples’ article was primarily about the new metric and how it justifies the Oilers of adding Kris Russell. In it Werenka made several statements in the article defending Russell including comparing him to Francois Beauchemin and Marc-Edouard Vlasic.

That brought a lot of attention and while it was a “tell” it masked some of the other things worth noting in the article.

On the face of it, there is nothing inherently wrong with what TruPerformance is offering – a product they manufacture for sale to any interested customer. However, some questions arise here that demand serious scrutiny.

Now, I don’t have an issue with a company trying to make money by selling analytical services to willing customers. This can be intensive work that requires a very specific skill set and team to perform effectively. Most organizations can’t afford to staff and maintain a group for this specific a purpose and so in steps the independent contractor to fill the gap. What worries me are the details that surround the company, and the lack of information in some key areas of their end-product.

Let me be clear, I don’t think TruPerformance believes that it is selling poor analysis. I think they are doing sincere work and analyzing game data in good faith and with the best of intentions. The things they are seeing and recording are actually occurring and do have an impact on the game.

However, I am concerned about the relative weight they are applying to those events and their process of then ranking players using a simple number-out-of-100 system.

There’s a logical discrepancy occurring here that separates the system from the expected results. Let me rephrase that in clearer language: I see the packaging selling something that allows the customer to make some fairly serious assumptions of its effectiveness.

Snakeoil Statistics

Put it another way, picture a man selling a medicinal tonic. The label has the phrases “effective in most cases” and “famous the world over”. A customer becomes interested and asks if it will help him with his hair loss. The salesman introduces him to a man standing nearby and has that man explain to the customer about how he began using the tonic and a few weeks later his hair began to get thicker.

The salesman hasn’t specifically stated that the tonic regrows hair. The label is vague enough to allow anyone to make their own assumptions. And the testimonial comes from a third party individual who only states two circumstances, letting the customer assume a causal link between the two.

Here’s what I mean…

TruPerformance describes hockey situations on their website, offering visual guides, and accompanying descriptions of everyday hockey plays. They are the types of plays that are the small, detailed, individual contributions a player does putting his team in a better position to win a game. But from there we make the jump to assigning a number to a player ranking him against the rest of the league, without being able to see inside the process. Here’s a look at their methodology explanation from their website (click to enlarge or visit the site here).

Snakeoil Statistics Snakeoil Statistics

We are meant to draw the conclusion that TruPerformance’s data helps to describe these kinds of hockey plays and identify players who excel (or don’t) at these key plays.

There’s a psychological game going on here that I feel compelled to point out because the approach here isn’t altogether different from used a very long time ago.

In Athens, and more or less around the greater part of the Greek world during the 5th century BC, through a combination of changes (societal, political, economic) it became an advantage for young men (women weren’t part of the body politic) to receive an education a little more detailed than the old model (literacy, math, music, wrestling). Essentially it became a necessity for them to learn oratorical skills (how to speak and debate) as well as apply critical thinking skills and other intellectual pursuits.

Basically a market appeared almost overnight for instruction in wisdom and the ways of the world and in stepped a varied collection of professional tutors on the art of wisdom and knowledge. They offered to instruct their students in politics, wisdom, justice, and particularly public speaking and the art of convincing conversation. They were called the Sophists, from the Greek sophia, meaning wisdom.

That title was descriptive but later became something of an insult as they were eventually associated with teaching their pupils not wisdom but the art of twisting the truth and language to suit one’s own ends.

They were not, it turns out, necessarily selling what it was their customers thought they were buying.

At their best, the Sophists taught young men some critical thinking, philosophy, oratorical skills, and passed on a little wisdom.

At their worst, they taught students how to manipulate people through language and twist the truth to convince the populace to agree with them.

TruPerformance isn’t teaching people to lie nor do I think they are intentionally selling untruths. But I do think they are collecting and packaging information that has serious drawbacks to it and leading their customers to believe that it provides a measure of a player’s ability that is not justified by their methods.

Let’s take a look at their assertions from the source article again:

“In Werneka’s [sic] multi-dimensional approach, the magnitude of the play (how it relates to winning game) and the

magnitude of a player’s involvement in the play are the focus. According to Werenka’s findings, in every NHL game

there are about 1,000 plays in total, with about 400 of them, on average, turning out to be significant. In the

TruPerformance system, each play is given a percentile score in relation to the plays magnitude — how significant it

is to winning. So a stretch pass that directly leads to a breakaway is given more weight than a rote defence-to-defence

pass. Players are also given a score based on how much they contributed to the play. So the player making the stretch

pass for the breakaway will be seen as deeply involved in the play but some other player who was merely in position

but was not involved in the pass won’t be seen as a significant actor. In the end, all the players[sic] contributions to

the attack are weighed against his mistakes on defence and each player as a final percentile score for the game.

Further explanation comes from the TruPerformance website: ‘Each play is scored independently, based on the

surrounding gameplay variables: time and space, situation, game state, sub-skills, and the degree of advantage

created or denied. On average, it takes nine hours to analyze one game.”

They are specifically laying out what aptitudes they reward and which they don’t. That’s fair and up front. Let’s continue on.

“The kinds of good defensive plays TruPerformance is looking for are a textbook for great defensive play and include,

according to the company website: ‘Slides to block a shot in the crease to save a goal; Makes a diving play extending

his stick to block a backdoor pass to a wide open player: Uses an active stick in the neutral zone to stop a pass and

prevent a breakaway; Slides to break up a 3v1; Executes a stand up check in the neutral zone to prevent a 2v1 from

developing; Fronts/blocks a puck at the net front with two opposition forwards at the net; Stands up an opposing

player in the neutral zone to force a dump.”

Let’s stop there for a moment.

Here are the specific pages from the site that describe their valuation of player actions for defensemen both positive and negative, respectively (click to enlarge or visit the site using the link posted above).

Snakeoil Statistics Snakeoil Statistics

Think about how this is being explained, specifically the outcomes: “…to save a goal, block a backdoor pass to a wide open player, uses active stick in the neutral zone to…prevent a breakaway and so on.” No one in their right mind would argue that these are not good hockey plays because the outcomes all favour the defending team. Value is not being placed on the intent, or even the action, of the player. It is being placed on the outcome, something over which the player has only partial control.

From the perspective of how one phrases a logical argument, this is stacking the deck in favour of the outcome you’re trying to measure.

There is another aspect being overlooked here, which is the culpability for the situation which necessitated such a defensive reaction – who’s guilty for the d-man having to make a play in the first place? But that is outside the scope of this discussion here today so we’ll leave it for another time.

Let’s move on to the negative plays description now, again from David Staples’ article.

“As for negative plays by a defenceman? ‘Leaves his player wide open for a pass at the net front in Defensive Zone

Coverage; Makes a line change at a bad time which allows a breakaway; Gets beat 1v1 allowing a breakaway;

Allows an easy pass at the net front when defending a 2v1; Steps up in the neutral zone, but misses the puck and

the body, allowing a 2v1; Allows an opposing forward access to the net front to screen and tip; Allows an

opposing player to use time and space to carry and pass the puck.”

Here’s where things become a little ambiguous.

For instance, in the positive category they include the description of a player who “slides to block a shot in the crease to save a goal”. However, in the negative category we see the following: “Leaves his player wide open for a pass at the net front in Defensive Zone Coverage,…[a]llows an easy pass at the net front when defending a 2v1,..[a]llows an opposing forward access to the net front to screen and tip”.

What if he slides to block a shot but does not save a goal? What if he slides to block a shot and ends up out of position, turning a 2 on 1 into a 2 on 0? These are reasonable and frequent consequences of the action described, yet they end in counter-productive results. At the core, the way the actions are scored does not necessarily hold true to the value one might put on the result. Put it plainly, you can’t give a guy a 10/10 for leaving his feet to block a shot every single time because sometimes it isn’t the right play.

All three of these negative scores are also potential outcomes of the initial positive action. When a defenseman slides to block a shot he sometimes leaves his opponent open for a pass if he mis-times the block or slide or if the opponent simply waits for him to slide out of position. Again, the grade being assigned is not altogether within their power to achieve. The language used in the description is allowing the reader to make a very subtle logical jump here, which doesn’t stand out at first glance but in the end undermines the rational process: the slide from player intent to play outcome irrespective of the opponent.

Consider this for a moment: do you suppose that every time McDavid and Draisaitl have had a breakaway that the defenders have not at least attempted to execute the correct plays against them? Is there not a chance that a defender could do everything correctly, and still be scored upon because the opposition player made a better offensive play than there were defensive options for the defender (shy of taking a penalty, perhaps)?

Here’s my point in brief: TruPerformance is applying arbitrary values to plays whose outcome is not necessarily determined by the player being evaluated and they are then wrapping that analysis in language that is deliberately vague and the reader is misled into a false or mistaken evaluation.

*David Staples has run his own analysis project in the past using a similar system of observing contributions by a player towards a scoring chance and penalizing those who contribute to scoring chances against. It is a system that is, at least nominally, based on that pioneered by Roger Neilson in the 1970s. The TruPerformance process has a similar methodology to that of Staples’ project in so far as it is based on visual observation and the application of relatively subjective values to player actions both positive and negative, and in that regard it is understandable why he would be drawn to TruPerformance’s outcomes as an analytical tool.

The final portion of the article takes aim at the Corsi metric. In particular, Staples and Werenka discuss the predictive consistency of using Corsi. Werenka says that the r rating (r rating means how closely packed the data points are on a graph and how well they correlate to each other) for Corsi is somewhere between 0.33 and 0.4.

Werenka says that the TruPerformance system has an r rating of 0.81, meaning that the data it collects conforms to a single line with a fairly small variance.

Both Staples and Werenka use the term “correlation to winning” in describing the predictive power of Corsi or the TruPerformance metric.

I’d like to stop there for a moment to concentrate on the choice of words here.

Corsi is not associated directly with winning, rather it is used as a proxy for possession (because you can’t shoot the puck if you don’t have it, and if you have the puck it means your opponents don’t, and shy of a Steve Smith moment the opposition cannot score if the puck is on your stick).

Possession of the puck is a significant step in the process of winning, just as Gretzky said you miss 100% of the shots you don’t take, but there are about three steps of reasoning between “Corsi” and “winning”.

The statements of both Staples and Werenka conflate the two more closely than is advocated by those more familiar with the Corsi number.

Back to that r value though…

Remember, I’m only talking about how closely the dots on a chart sit compared to a line but 0.81 is quite a phenomenal number to arrive at.

Remember my example of the miracle tonic label claims? This sort of thing should raise alarm bells as there are very few statistical models that study human behaviour in groups that can boast that sort of predictive power.

An article by SportingCharts back in October of 2015 examined all the many statistical tools used in the analytics community at that time and evaluated their r value predictive abilities. It looked at goal differential, hits, special teams, Corsi/Fenwick, save %, penalties, shooting %, offensive zone starts and faceoff wins. The highest ranked ones were Fenwick at 0.252, goal differential at 0.245 and Corsi For at 0.198.

All of those stats are ones that we can examine, test and compare because they are public and available.

Physicists aim for something greater than 0.9 when they collect data, but they are working at a level of refinement that is light years beyond human visual observation and have technical equipment that is designed to be sensitive enough to measure the movement of a single photon.

For the sake of clarity, here is what those values look like. The first is a correlation (r) value of 0.99 which is the correlation that physical scientists would insist upon when measuring specific, quantified matter.

Snakeoil Statistics

Here’s a correlation value of 0.28, or about where most analytic tools in the sporting weigh in at.

Snakeoil Statistics

Here’s a correlation value of 0.89, which is just a shade higher than what Werenka says TruPerformance’s methods provide.

Snakeoil Statistics

While I would gladly welcome a system that could bring this kind of predictability to the chaotic nature of hockey, this sort of result just is not seen in any analytical models anywhere in the business. If it’s true, then congratulations, you’ve just punched your ticket to early retirement because you can predict sporting outcomes with an unheard of level of consistency. But how do we know if we are being given accurate measurements?

How does someone working in sports analytics arrive at a correlation value of 0.81? Or to rephrase that, how does someone come up with a predictive model that beats out the rest of the analytics field by an order of magnitude that flirts with levels of confidence found in hard science?

Snakeoil Statistics

What this looks like is a problem referred to in analysis as “overfitting” wherein someone designs a model using a certain range of data, then tests it on that same data to prove that the model worked and from there argue that your process is logically sound.

 

It isn’t.

You just designed it well enough so that it delivers your desired outcome. There’s a difference. Good craftsmanship in designing a process does not acquit you of using it for false pretences.

If you set out looking for something, you are liable to find it – like wondering if you’re popular, so you ask your mom.

(Here is an excellent video that quickly and easily explains what overfitting is – starts at about 0:48. I highly recommend giving it the few minutes it takes to watch, as you could apply it to any other charts you see and use it to help figure out when you’re being led along.)

This is why the examples TruPerformance points out are generally the kinds of plays you see on a typical hockey panel, because they’ve (perhaps unintentionally) rigged the system to effectively be self-perpetuating. Have a look at their list of top ten players at a variety of positions and you’ll see a list that you and six of your friends could probably come up with in 10 minutes.

Truth be told, we have no way of confirming TruPerformance’s claim about its extraordinarily high correlation value. That’s the proprietary privilege. It also raises some questions.

Let’s put it another way. If you were shopping for a new car and the salesman told you that a particular model got 400 miles to the gallon, you’d probably want to look into those claims before buying.

Snakeoil Statistics

In the second-final paragraph Staples and Werenka take the opportunity to use the 2013-2014 Edmonton Oilers adoption of analytical tools including Corsi as evidence that it is a fundamentally flawed statistic and a wrong-headed approach to analyzing the game.

I’m not going to debate the relative merits of Corsi today. If you want to read my thoughts on that, you can do so here. Suffice to say, Corsi is an improvement on Shots on Goal and stands within a kind of evolutionary tree that will see further, gradual developments as we continue to work on the idea of puck possession as one of many components used to identifying successful traits.

What I want to discuss is the thought process behind this one paragraph as it relates to the theme of this article: logical argument and the skewed framing of a subject; otherwise, sophistry.

Here are Staples’ statement followed by Werenka’s comments:

“It didn’t work when the 2013-14 Edmonton Oilers embraced Corsi, and tried to create a system to get a better

shots-at-net differential. ‘You had a front row seat there when (head coach Dallas) Eakins was trying to

recreate Corsi,’ Werenka says. ‘It’s such an incomplete stat. Sure, you can create Corsi if you understand exactly

how Corsi is done, but if you’re trying to increase it just by taking snapshots from outside at blueline, that’s not

what Corsi is. Corsi is a very vague concept.”

Okay. Let’s take this one apart to see if we can really get at everything being suggested here.

I’ll start with a look at how Corsi is arrived at.

Corsi is shots at the net. That’s all. Shots at the net. Not that just hit the net, that’s the old shots on goal, but those directed at the net, this includes blocked shots. It’s usually represented as a percentage because that way you get an idea of which team overall had a greater share of the entire number of shots taken during the game. You don’t “recreate” Corsi. It’s a stat, like counting the number of apples in a bucket versus the number of oranges. It gives you a number. Then you put that number into a percentage. 6 oranges and 4 apples in a bucket? So you’ve got 60% oranges and 40% apples.

You don’t recreate that. You just count it. This isn’t some ethereal concept that is measured with arcane wisdom and crystal balls like “compete” or “effort”.

Try this: count your fingers. Okay, now ask your friend to count his or her fingers. Absent a tragic industrial arts class accident, they’ll arrive at the same number. You’ve just “re-created” the number. And so long as your friend isn’t a completely mind-numbing idiot, he or she likely did it without too much trouble.

“Sure, you can create Corsi if you understand exactly how Corsi is done…”.

Do you know how to “create Corsi”? You take a shot at the net.

But maybe he was talking about “creating Corsi” as a beneficial statistical effect, like a shooting percentage?

You know how to do that? Take a lot of shots at the net. More shots on net than the other guys, to be exact.

What Werenka has done here is build on that conflated idea mentioned above that Corsi = winning and then confused this misunderstanding with the idea that “re-creating Corsi” somehow translates to “re-creating winning”. We’re one step away from a Charlie Sheen hashtag rant at this stage.

Corsi is a measurement of a quantifiable event, like inches or days of the week. There appears to be a core, fundamental misunderstanding (or at the very least a disconcerting miscommunication) with regards to Werenka and his concept of Corsi, and possession metrics in general.

To re-iterate: Corsi is counting shots at the net. Even Sportsnet does this now where they will show the period-by-period stats during the intermission and it will include shots, hits, faceoff percentages and…shot attempts. That’s right, shot attempts is more or less the same thing as Corsi.

Does that sound like a particularly hard concept? Does counting shots at the net, whether they hit the net or not, sound like something that you aren’t able to do on your own? Because Werenka certainly seems to imply that it is.

“…but if you’re trying to increase it just by taking snapshots from outside the blueline, that’s not what Corsi is.”

It is.

I’ll say this again. Corsi is the count of shot attempts at the net.

 

From the blueline, the crease, heck shoot from the player’s bench or the pressbox if you can. I don’t care. They count them all.

Doesn’t that support his argument that it is a vague stat? Yes, it does. Nobody who uses Corsi would tell you that it isn’t. Does this mean it is a useless stat that should be thrown away?

Let’s head back to our metaphor emporium for a moment.

Imagine you are trying to build a fence and all you are given is a hammer, nails and a tape measure. When it comes time to cut the boards you haven’t got the right tools to do it, but do you throw away the hammer because it isn’t sawing the boards the way you’d like it? No, because they are still useful, just not for that task at the moment. You go buy the saw and add to your assortment of tools.

Stats are tools. Use them well and appropriately, everything’s good. Use them incorrectly and you’ll be disappointed.

If you use Corsi data to tell you which team likely had the puck on their stick more often during a game, and then extend that across large portions of a season, it will tell you something about that team. If you use it to tell you how much a player is going to score on a per game basis, you will be disappointed. But don’t blame the statistic.

A note on Werenka’s first assertion, the statement that folks in Edmonton had a front-row seat to the failings of Corsi because of the 2013-2014 season under Dallas Eakins.

Let’s go back to the fruit in the bucket metaphor.

You’ve counted your fruit in that bucket and arrived at a conclusion. Let’s imagine that there are 29 other buckets alongside the one you counted.

Would you assume that what you learned from examining the first bucket would apply to the other 29?

Imagine that 16 of those buckets were then loaded onto a truck while the remaining 14, yours included, was taken out back to a dumpster.

Are you still confident that what you learned from your one bucket applies to the rest?

Now imagine that five of those 14 buckets were singled out, and the contents dumped into a barrel and lit on fire. Yours included.

Now how do you feel about your assumptions based on that one bucket?

I’d guess that by this point you’d have some very serious concerns about what was in those other 29 buckets compared to what you’d found in yours.

Those are the consequences of making an assumption based on a very small sample size. In this case, basing an opinion on a statistical measure because of the results of a single team over a single season.

Werenka is doing just that, but by making it a local connection – “you saw with your own eyes there in Edmonton what happens when you believe in this foolish statistic” – he’s leading you to discard any logical perspectives and “trust your gut” in so far as it reinforces his assertion.

It must be said at this point that any argument that aims to discredit a measurable outcome by focusing on a small sample size and coming from someone who purports to run a statistical model of their own should be cause for alarm.

In the end, Werenka’s claims can almost all be questioned, if not dismissed, by observing the statements and measuring them against the larger weight of evidence to the contrary. If we apply a little bit of critical thinking to his assertions, they fall to dust ever more readily.

All this without getting a glimpse into his proprietary statistical model.

So if we can’t see how the math works, but everything else about it doesn’t jive with what we know about the larger study of statistical science and the most basic definitions of terms he chooses to use, then I think we can safely say that something in the TruPerformance approach doesn’t add up.

 

Arrow to top