Note: For those of you unfamiliar with regression analysis or if it’s been a while, it’s explained as follows by this website:

“Regression analysis is most often used for prediction. The goal in regression analysis is to create a mathematical model that can be used to predict the values of a dependent variable based upon the values of an independent variable. In other words, we use the model to predict the value of Y when we know the value of X.”

A couple of weeks ago Andrew began a series of articles about expectations for Chris Johnson going forward.  The articles got me thinking, so I decided to run a regression using the data he posted regarding the NFL rushing title winners dating back to 1978.

I posted the regression results in the comments section of Andrew’s second C.J. article. Since then, I’ve given a bit of thought about the best way to specify the model.  For this article, I actually specified two models.  The first looks at yards in each of the five subsequent years as a percent of yards gained in a rushing title year.  The second model simply looks at yards per year for the five years following a rushing title.  In each model, I controlled for the following variables as of the year that the rushing title was won:

Years in the NFL (as a quadratic)

As expected, the coefficient on age was negative for every year in both models.  Also as expected, the coefficient on the quadratic was universally positive, meaning that the drop off in production is more pronounced for players who win rushing titles later in their careers than for those who win rushing titles early in their careers.  Coefficients were statistically significant at the five percent level in both models in the fourth and fifth years following rushing titles.

My initial thought was that the coefficient on both attempts and attempts squared would be negative.  The coefficient on career attempts was however positive.  It seems counterintuitive as a general rule that yards per year would increase as career attempts increase, but this seems to be an artifact of this particular data set.  George Rogers, Freeman McNeil, Marcus Allen, Christian Okoye, and Terrell Davis all saw their production drop dramatically following their respective rushing titles.  Edgerrin James would twice again rush for over 1500 yards, but his totals immediately following his back-to-back rushing titles were lower, in part due to injuries.  Ricky Williams year-to-year totals were sporadic due to both suspensions and injuries.  These seven players won eight rushing titles in their first four years in the league and didn’t maintain their level of productivity.

Career Yards/Att

This was calculated through the year in which the respective rushing title was won.  I thought that the sign on this variable might be positive (more yds/att = faster/quicker & less pounding).  On the other hand, it could be negative: more yds/att might mean smaller quicker backs who can’t take the hits over time.  In both models, the coefficient is positive in each of the first three years and negative in the last two.  It didn’t turn out to be very informative.

Yards

The coefficient in the regression of yards as a percent of rushing title yards is negative as expected.  I wasn’t sure what to expect in the second model, but the coefficient was similarly negative and consistently statistically significant.  It appears that very high rushing totals do have non-negligible impact on not only relative future performance, but absolute performance as well.

I also included controls for the strike-shortened 1982 season and for retirement.  Below is a synopsis of my results.

 Model 1: Yards as Pct of Title Yards Year T + 1 Year T + 2 Year T + 3 Year T + 4 Year T + 5 Years ** ** Years2 ** ** Career Att * ** Career Att2 * ** Career Yds/Att ** * Yards ** ** ** ** ** Strike Season ** ** ** ** NA Retired NA * ** ** Constant ** ** C.J. Proj. Pct 66.2 76.2 47.0 56.9 33.6 C.J. Proj Yds 1327 1528 944 1142 674 **statistically significant at 5% level *statistically significant at 10% level

 Model 2: Yards Per Year Year T + 1 Year T + 2 Year T + 3 Year T + 4 Year T + 5 Years ** ** Years2 ** ** Career Att * ** Career Att2 ** ** Career Yds/Att Yards Strike Season ** ** ** NA Retired NA ** ** ** Constant ** C.J. Projected 1347 1376 1211 1141 813 Sample Avg 1248 1009 935 796 702 **statistically significant at 5% level *statistically significant at 10% level

As you can see, the totals I came up with here are nowhere near those from the previous regression, but I do think this model is better specified.  What are your thoughts?  How would you feel if Chris Johnson posted these numbers over the next five years?  Also, what if anything should I have included in these models?