Wednesday, November 19, 2014

Ooten NCAAF Rankings Details and Explanation

This post is in progress. Please check back soon!

The coaches, media, and now an all-powerful committee of 13 publishes their Top 25 NCAA Football rankings. When you have a field of over 100 Division I-A (FBS) teams and a relatively tiny number of games to judge them by, you will have controversy. Well, I couldn't help myself... so I have applied my rating system to the college football system. This took a little while longer to launch (compared to my NFL and MLB ratings) because of the shear number of teams involved. It's not just the 100+ Div I-A teams, but they play the 100+ Div I-AA teams, and they play the vast Div II teams, and they play the vast amount of Div III. Thanks to really fast computations, I have rated well over 500 NCAAF teams, many of them I haven't heard of before. You can see the latest rankings here: NCAAF Rankings

Behind the Rankings
For the most part, my NCAAF rankings are very similar to my rating systems that I use for my NFL and MLB rankings. So if this text seems familiar, that's because it is! I don't want to get too in depth here into the details of the mathematics behind the ratings. So, I'll keep it relatively simple in this blog, and refer you to a reference where you can dig into it more if you'd like. I use a slightly modified version of Microsoft's TrueSkill ratings system. Why slightly modified? Two reasons: 1) Microsoft wouldn't elaborate on the complex details of competitions involving three or more competitors. 2) Microsoft would give the specific equations of the 'v' and 'w' functions (check the details in the reference if you care), so I had to curve fit.
Issue #2 isn't that big of a detail, my curve fit matches extremely close to their 'v' and 'w' function plots. Issue #1 doesn't effect head-to-head competitions, which cover the vast majority of sporting events, so this does not apply to my NCAAF rankings. For the curious readers out there, I devised a fairly accurate way of simulating their complex methods of three or more competitor events that tracks very closely with their results. With all that said, I'm satisfied with my Matlab version of the TrueSkill Rating system.
I think we can all agree that a team isn't always as good or as bad as their record. Strength of schedule matters. A team can have some very quality wins against a strong opponent, or an embarrassing loss against a poor opponent. From a 30,000 ft view, my NCAAF ratings (again, based on Microsoft's TrueSkill) measures each team based on the quality of opponent they compete against by tracking two parameters for each team: average skill (mu) and a measure of uncertainty of that skill (sigma). Many rating systems only track the "skill" term. By tracking both skill and uncertainty, you can converge to a more accurate representation of a player's (or team's) skill, with a smaller sample size. A team's opponent's skill, uncertainty, and outcome of the event effect that team's recalculated skill and uncertainty. The rating is generated from subtracting three times the uncertainty from the average skill (rating = mu - 3*sigma). This results in a 99% confidence that the team's skill is at or above that rating.

Everything I've mentioned to this point applies to my NFL and MLB ratings also. Where I start to diverge with my NCAAF ratings is the fact that I converge my NCAAF rankings. I do this for NCAAF and not the professional sports because in college sports there can be a huge variation of the quality of teams. Applying an iterative algorithm to the rating system allows learning of each team later in the season to be applied to early season results.

Each year, the teams are started off with the same base rating (or a regression to a base rating), so an early win or loss may not accurately yield the correct points added or subtracted from a team. For example, during the 2014 NCAAF season, Ohio State suffered and early loss to Virginia Tech. In the beginning of the season, these teams are similarly rated, so OSU was not penalized too much for losing to a near-equal team. However, as the season has progressed, Virginia Tech has lost several more games. This indicates that the loss to Virginia Tech was worse than losing to a near-equal team. The iteration process subtracts more points from Ohio State than was originally subtracted. The opposite can occur from an early season win over a team that proves to be high quality. The iterative process applies learning for each team's opponents to that team's rating, which can result in a small fluctuation of a team's rating even during a bye week. Currently, I iterate on the NCAAF ratings until the maximum team convergence is less than 1%.
More details here if you are interested! Microsoft TrueSkill Rating
I'd also be happy to answer questions if you'd rather have someone translate that for you: Post your questions via email or comments on this post and I'll get back to you.