2015 NFL Combine: Value of Athleticism Metrics And How We Should Use Them

thorntonsolitaire

With the NFL combine coming up for potential draftees, the inevitable debates on the relevance of the combine crop up, with arguments about who did well with what scores and why workout warriors like Mike Mamula or Yamon Figurs amounted to nothing in the NFL while slow players like Jerry Rice and Larry Fitzgerald have flourished.

That’s all true; no amount of combine production can replace intelligent evaluation of how a player plays a game. But falling on that crutch leads to faulty thinking, replacing the rule with the exception.

By: Arif Hasan

It should come as no surprise that, in general, more athletic football players play football better. Unless there is evidence that there is a clear, zero-sum tradeoff between athleticism and football intelligence or football technique; that should be intuitive.

To that extent, I’ve been engaging in a massive project to determine, to what degree, combine scores can predict football performance. Unsurprisingly, the most basic statistical tests already show that good athletes play better.

Even a glance confirms it; there have been 922 non-quarterback or specialist selections for the Pro Bowl since 2005. The top 0.1% of combine performers since 1999 have made 5% of the Pro Bowl selections and 6% of the first-team All-Pro selections in that time (or in other words, are overrepresented by 5000-6000 percent), despite the fact that many of those in the top 0.1% didn’t enter the NFL until after (or well after) 2005, including J.J. Watt, Jimmy Graham, Von Miller, Aaron Donald and Dontari Poe.

To me, that makes the project well worth pursuing.

Off the bat, there are some issues. In order to even begin a study, one must come to a conclusion on how to measure performance. While perhaps easy for quarterbacks or running backs, there’s not a clear metric of evaluation for offensive linemen. Even where there’s some data (edge rushers have sacks for example), it’s woefully incomplete—after all, sacks are a poor way to measure a player’s true ability.

It’s not a problem that has been adequately solved, but for an initial look, pro-football-references’ Approximate Value metric does a good job. It’s explained extremely well here, and it essentially finds all the relevant data about a position (for a quarterback, adjusted net yards per attempt, total games started, total attempts, Pro Bowl invites, MVP votes and so on) and then adjusts that position for the unit’s performance (a quarterback with a Pro Bowl invite and a high adjusted net yards per attempt will suffer a penalty if the offense isn’t good). The resulting number can then be used to compare across positions.

A specific player with an Approximate Value (AV) of 14 in a particular year may have been better than a player with an AV of 16 in the same year, but much more often than not, the group of 16 AV players will have been better than the group of 14 AV players.

With a large enough dataset, the specifics shouldn’t be an enormous issue.

An archive of 15,000 players who have attended the combine or held a pro day since 1999 should be sufficient, though it is perhaps better to restrict it to the 4,500 players who have been invited to the combine since then in order to really compare apples to apples.

It wouldn’t be useful to find a marginal player who blew up a super regional combine and include him in the data and provide noise—we’re not figuring out if a generic athlete can magically be good at football, but if a player who is good enough to be invited to the combine should improve their prospects by a good combine performance; essentially if a player is good enough to be considered draftable, do players with the same pre-combine grade radically change it after the combine? How much?

There are a few issues at stake. First, using combine data can be tricky. 40-yard dash times have become the most standardized measurement at the combine, but somehow remain the least consistent. Any fan with the NFL Network or access to NFL.com’s video database can themselves time many of the 40-yard dashes shown live and come up with different numbers than the “official” number used at the combine.

The official number doesn’t mean much, as teams don’t use it; they have their own team of scouts with their own stopwatches. As Chris Kouffman points out, those times are actually more reliable than the times issues by the NFL, in part because of how the NFL overstandardizes the starting timer (which is hand-timed)—which explains why Taylor Mays beat Trindon Holliday on a simulcam of their 40-yard dashes, but had an official time that was .19 seconds slower.

In order to create a standard unit of measurement, I’m discarding the “official” times and measurements, instead using the data provided to CBS through NFLDraftScout.com.

Second, we have to figure out exactly what we’re trying to prove.

There are three ways to analyze combine data—the first is to take a look at which players did well and who did poorly in the NFL, and run their combine scores against how well they did at the next level. From that, we can start assigning weights to the different combine scores and see which ones are the most important. This is something that teams often do despite the reputation the NFL has for being anti-analytics, and is useful for team decisionmakers to a big degree.

On the other hand, it mostly produces trivia for NFL fans, because the scores aren’t useful by themselves. Again, nothing correlates better with Approximate Value than a player’s draft pick, and their draft pick is not merely the result of the combine. Without the data of where a player already grades, we don’t know to where they may have improved their grade.

It’s still fun to know who the “best” performers at the combine were and it’s not completely useless: the winners of the combine have the best scores and the losers have the worst scores, and we know that will impact their player grade.

The second is even more trivial, but allows us to answer fun questions like “who was the best athlete there?”

Because we intuitively know that running a 4.43 40-yard dash is more impressive at 250 pounds than it is at 180 pounds, we can reward players who move more mass better. Based on positional values for weight and height—basically creating a formula with weight and height as inputs for each combine score, relative to position, and produce an expected outcome.

So, a 6’3” defensive tackle who weighs 306 pounds would be expected to put together the following numbers:

  • 96-second 40-yard dash
  • 73-second 10-yard split
  • 1 bench press reps of 225 pounds
  • 4” vertical leap
  • 8’10.2” broad jump
  • 59-second short shuttle
  • 39-second three-cone

Different positions, weights and heights would have different expected values for those measurements. As a result, we can measure the deviation from the expected value (measured by a statistical tool called Z-Score), and average the differences to create an “athleticism score”.

As an example, DaJohn Harris, Dusty Dvoracek and Brodrick Bunkley were all 6’3” DTs who weighed 306 pounds at the combine. Harris ran a faster 40-yard dash (4.97 seconds to the 5.01 from Dvoracek and Bunkley) but had a slower 10-split (1.77 to 1.75 and 1.71), fewer bench reps (28 to 44 and 34), a shorter vertical (29.5 inches to 34 and 32.5) and underperformed in his agility drills, declining to do the bench press.

As a result, Harris’ positive 40-time was dragged down by his other negative measurements, and he ended with a “negative” athleticism score (where 0 is the average). Bunkley ended up with a positive (1.21, indicating he was that many standard deviations away, or the 88.7th percentile) and Dvoracek did too, but not as much (0.61, or in the 72.9th percentile).

The final way to look at it, and one that may provide the most relevant information to fans, is to figure out which scores are well-accounted for by the NFL and which are not. In short, measure who underperformed or overperformed according to their draft slot and run the numbers of those players to see if the NFL overemphasizes some traits and underemphasizes others—basically identifying potential “steals” at their position (or conversely, those overvalued).

Some players may end up being excellent combine performers in a way the NFL already knows how to incorporate, while others may win in ways the NFL does not traditionally account for. As an example, Jeremiah Attaochu’s scores were not very good for projecting an edge rusher in the draft, but he was bad at things the NFL does a good job of accounting for and good at things the NFL does not do as well taking into account—his 40-yard dash time was very good for a player of his size, but his three-cone was not.

We’ll work with all three kinds of measurements as the combine continues to produce not just lists of the best athletes or biggest winners at the combine, but where they rank historically and which players they athletically compare to best.

For example, Brian Orakpo is extremely similar, athletically, to Cornelius Washington. Within an inch and two pounds of one another, both have obscene explosion scores (Orakpo’s vertical of 39.5 inches to Washington’s 39 inches, Orakpo’s broad jump of 10’10” to Washington’s 10’8”), good speed scores (Orakpo is slightly faster with a 4.63-second 40 and a 1.56-second 10-yard split, compared to Washington’s 4.56 and 1.60-second efforts) and below average agility scores (Orakpo’s 4.45 shuttle and 7.26 three-cone combine for 0.04 seconds slower than expectation, while Washington’s 4.74- and 7.47-second times were even worse).

There have long been rules about how to draft players relative to their combine scores—Pat Kirwan introduced fans to the Explosion Number for defensive linemen—adding bench press reps, vertical jump (in inches) and broad jump (in feet) into one number, and Kirwan claimed a number of teams used this as a proxy for explosion overall, with 70 or higher being a pass and anything below a failure.

It’s not crazy, either. There are notable hits and misses with the metric (like there is for anything), but even a deep statistical look at a relatively simple and unscientific-sounding number produces something that correlates reasonably well with success, and does so over a broad span of time with a lot of data to back it up.

NFL teams also use these numbers smartly—they don’t organize the board solely due to these numbers or charts. Mike Kudla had the highest explosion score of any defensive edge rusher since 2004, and he went undrafted. We’ll never know if it was a mistake to leave him hanging—he suffered a hamstring injury while in camp with the Steelers and never had a great opportunity to turn his fantastic measurables and excellent college production into a real NFL try—but for the most part, draft slot remains one of the single-best predictors of NFL performance.

No metric is perfect and that’s fine—Vernon Gholston would also be a “miss” for the metric, as would Ryan LaCasse. But the second-highest explosion score belongs to Mario Williams, and both Justin Houston and Cameron Wake weren’t far behind. The lowest scores for highly drafted players belong to Jarvis Moss and Robert Ayers, neither of whom have lived up to their draft slot.

Kirwan goes on to say that many teams also use a “points” system for meeting certain criteria, giving a player 10 out of 10 points for hitting marks like height, length, 40-yard dash time and so on and giving deductions for underperformance in each category.

The CBS analyst and former NFL coach and director of player administration was writing from his experience at the time, and though he has dozens of contacts in the NFL, there’s a good chance many teams have adopted more advanced combine tracking systems that provide additional sophistication—if only because Keep Your Eye Off the Ball is five years old and the introduction of analytics departments and the explosion of data-based analysis in football has moved it forward faster than ever.

We’ll weight-adjust scores, combine in them traditional and non-traditional ways, and see if linear formulas predict success better than filter-formulas (i.e. “do you pass a certain threshold? Yes or no?). We’re adopting different tools for different positions, and therefore have to be careful about identifying a player’s role. We don’t want to compare DeAndre Levy to Justin Houston, even though both are linebackers. Edge players will be evaluated separately from players that play off the line of scrimmage as linebackers, and those two will be distinct from those who play on the interior of the defensive line, from 5-technique to 0-technique.

Because receivers move around much more than linebackers or edge rushers do, they won’t be evaluated separately. Tight ends still match up against linebackers and safeties instead of cornerbacks despite many of them having almost exclusive receiving duties, so they won’t be split up either.

Along the way, there are some interesting findings—arm length and bench press reps aren’t as inversely related as many people think (though that’s not to say the extremes, like 29-inch arms or 37-inch arms, don’t matter). It makes sense to adjust some statistics for weight (broad jump does not provide a reasonable measure of explosion without taking into account the player’s weight) and some statistics provide power in their raw form (both weight-adjusted vertical leap and raw vertical leap are explanatory).

There are ways to combine these with other metrics that look at college production, which can be predictive (though not entirely so) for a lot of positions, including edge rusher and receiver. That will be an interesting project for later, but for now, expect a combine week with updated player athleticism rankings and which scores matter for what positions.

Arrow to top