I built a college football model. Here’s how it works
Thomas Zwiller | Friday, August 26, 2022
For those of you who have read The Observer for the past few years, there is a good chance you have run into my NFL model, which is predicated on the idea that instead of simply measuring wins and losses to rate teams, we can observe what statistics tell us down to a micro level.
At the end of the spring semester, I was asked by our Sports Editor to try and create a version of ZeLO for college football. Eager to try my hand at a college football model, I quickly got to work until I realized it was impossible for two reasons.
Firstly, trying to manage the rosters of 130-plus college teams is a task I’m not crazy enough to try, and people who have watched me run the NFL model already think I’m crazy (depending on the week, it can take me over ten hours to update ZeLO).
For another, there isn’t the same level of data readily available. For my money, Pro Football Reference is the best stats website and the easiest to integrate into Excel. But as immaculate as PFR is, there are just some stats missing in the NCAA. The best example I can think of is passes targeted. The NFL has both receptions and targets, but college football lacks targets.
So, between the lack of data and the sheer size of the project, I decided to move to a team-based model. With that decision made, the pieces began to fall into place. I grabbed each team per game stats, both for and against, ran them through a modified ZeLO formula and created a net metric.
The first thing that jumped out to me was that teams who had been good recently but had a down year last year were low (think Penn State) and teams that had been poor the year before but were good last year (Baylor) were projected to be too strong. This made me want to go back to years prior, run more data and create a weighted average based on recency (for example, the most recent year is weighted at 50% and the last year is 5%). I now felt a lot more comfortable knowing that a team who might have had an outlier year in one way or the other would still get credit for it, but at the same time, it wouldn’t be the only factor.
The next phase was adding a strength of schedule component. I created a scale based on the average schedule, and if a team had a more demanding schedule, their ZeLO for that year was multiplied by a decimal greater than one. If it was easier, it got multiplied by a decimal less than.
There was also the matter of roster turnover that needed to be addressed. With the NFL model, I manually address this by moving players as I follow free agency. But with a team model, I have to make more general assumptions about which players are leaving when players graduate or are drafted. So a good team last year, like Georgia, who lost over half of their returning production, will not return with as elite a defensive unit.
Conversely, the matter of recruiting needs to be considered, as well. According to 247 Sports, over the past four years, Georgia has been a top-5 recruiting school, ranking first (2020) and second (2019) during that span. Suddenly, that reduction of their defense seems like an overcorrection.
My solution was simple: I took247 247Sports’ average recruiting grade and multiplied it by .36 (the 36% was found here) and added the returning production metric and the new recruiting metric.
So Georgia, who had a 73% offensive and 44% defensive returning percent, now has a return plus recruit score of 1.06 on offense (a slight improvement) and a .775 on defense (a relatively large regression). These two numbers should account for returning production and the new class of players joining the team.
The last thing I wanted to incorporate into the model was a home field adjustment. Two components were critical to me. The first was the team’s home win-loss record, and the second was their average home attendance (adjusted for capacity to help protect smaller schools). I used a five-year average for both stats and calculated Z-Scores (a metric that shows numerically how close something is to the mean) for both stats.
I took these two numbers and added them to a base of 10. So, if a team has a poor win-loss record, they will have a negative Z-Score (because they are below the mean), which would lower their home adjustment from 10. The same is true if they have below-average attendance. But the worst that could happen is a team drops to plus-four — still an advantage, but minimal.
I honestly have no idea how this project is going to go. When I first built my NFL model, I got to test it by myself first. I spent a week or two just tracking results and seeing if the player values made sense and didn’t jump around too much. It worked out well and I felt good about publishing it.
This college model is completely different; you and I are going to find out how good it is at the same time.
I built it and all the decisions made were intentional and well-thought-out. It back tested incredibly well last year — on average, the model was about a game and a half off of the actual finish. By the model, Georgia won the title and Alabama finished second, mirroring the results of the actual season. Always a good sign.
So, I hope you follow along and enjoy watching the season unfold as we track ZeLO against both real results and the ESPN FPI predictions.