# Tag Archives: sports

## introducing Badger Bracketology, a tool for forecasting the NCAA football playoff

Today I am introducing Badger Bracketology:
http://bracketology.engr.wisc.edu/

I have long been interested in football analytics, and I enjoy crunching numbers while watching the games. This year is the first season for the NCAA football playoff, where four teams will play to determine the National Champion. It’s a small bracket, but it’s a start in the right direction.

The first step to being becoming the national champion is to make the playoff. To do so, a team must be one of the top four ranked teams at the end of the season. A selection committee manually ranks the teams, and they are given a slew of information and other rankings to make their decisions.

I wanted to see if I could forecast the playoff ahead of time by simulating the rest of the season rather than waiting until all season’s games have been played. Plus, it’s a fun project that I can share with my undergraduate simulation simulation that I teach in the spring.

Here is how my simulation model works. The most critical part is the ranking method, which uses the completed game results to rate and then rank the teams so that I can forecast who the top 4 teams will be at the end of the season. I need to do this solely using math (no humans in the loop!) in each of 10,000 replications. Here is how it works. I start with the outcomes of the games played so far, starting with at least 8 weeks of data. This is used to come up with a rating for each team that I then rank. The ranking methodology uses a connectivity matrix based on Google’s PageRank algorithm (similar to a Markov chain). So far, I’ve considered three variants of this model that take various bits of information account like who a team beats, who it loses to, and the additional value provided by home wins. I used data from the 2012 and 2013 seasons to tune the parameters needed for the models.

The ratings along with the impact of home field advantage are then used to determine a win probability for each game. From previous years, we found that the home team won 56.9% of games later in the season (week 9 or later), which accounts for an extra boost in win probability of ~6.9% for home teams. This is important since there are home/away games as well as games on neutral sites, and we need to take this into account. The simulation selects winners in the next week of games by essentially flipping a biased coin with. Then, the teams are re-ranked after each week of simulated game outcomes. This is repeated until we get to the end of the season. Finally, I identify and simulate the conference championship games played (these are the only games not scheduled in advance). And then we end up with a final ranking. Go here for more details.

There are many methods for predicting the outcome of a game in advance. Most of the sophisticated methods use additional information that we could not expect to obtain weeks ahead of time (like the point spread, point outcomes, yards allowed, etc.). Additionally, some of the methods simply return win probabilities and cannot be used to identify the top four teams at the end of the season. My method is simple, but it gives us everything we need without being so complex that I would be suspicious of overfitting. The college football season is pretty short, so our matrix is really sparse. At present, teams have played 8 weeks of football in sum, but many teams have played just 6-7 games. Additional information could be used to help make better predictions, and I hope to further refine and improve the model in coming years. Suggestions for improving the model will be well-received.

Our results for our first week of predictions are here. Check back each week for more predictions.

Your thoughts and feedback are welcome!

## underpowered statistical tests and the myth of the myth of the hot hand

In grad school, I learned about the hot hand fallacy in basketball. The so-called “hot hand” is the person whose scoring success probability is temporarily increased and therefore should shoot the ball more often (in the basketball context). I thought the myth of the hot hand effect was an amazing result: there is no such thing as a hot hand in sports, it’s just that humans are not good at evaluating streaks of successes (hot hand) or failures (slumps).

Flash forward years later. I read a headline about how hand sanitizer doesn’t “work” in terms of preventing illness. I looked at the abstract and read off the numbers. The group that used hand sanitizer (in addition to hand washing) got sick 15-20% less than the control group that only washed hands. The 15-20% difference wasn’t statistically significant so it was impossible to conclude that hand sanitizing helped, but it represented a lot of illnesses averted. I wondered if this difference would have been statistically significant if the number of participants was just a bit larger.

It turns out that I was onto something.

The hot hand fallacy is like the hand sanitizer study: the study design was underpowered, meaning that there is no way to reject the null hypothesis and draw the “correct” conclusion whether or not the hot hand effect or the hand sanitizer effect is real. In the case of the hand sanitizer, the number of participants needed to be large enough to detect a 15-20% improvement in the number of illnesses acquired. Undergraduates do this in probability and statistics courses where they estimate the sample size needed. But often researchers sometimes forget to design an experiment in a way that can detect real differences.

My UW-Madison colleague Jordan Ellenberg has a great article about the myth of the myth of the hot hand on Deadspin and it’s fantastic. He has more in his book How Not to Be Wrong, which I highly recommend.  He introduced me to a research paper by Kevin Korb and Michael Stillwell that compared statistical tests used to test for the hot hand effect on simulated data that did indeed have a hot hand. The “hot” data alternated between streaks with success probabilities of 50% and 90%. They demonstrated that the serial correlation and runs tests used in the ‘early “hot hand fallacy” paper were unable to identify a real hot hand, and therefore, these tests were underpowered and unable to reject the null hypothesis when it was indeed false. This is poor test design. If you want to answer a question using any kind of statistical test, it’s important to collect enough data and use the right tools so you can find the signal in the noise (if there is one) and reject the null hypothesis if it is false.

I learned that there appears to be no hot hand in sports where a defense can easily adapt to put greater defensive pressure on the “hot” player, like basketball and football. So the player may be hot but it doesn’t show up in the statistics only because the hot player is, say, double teamed. The hot hand is more apparent and measurable in sports where defenses are not flexible enough to put more pressure on the hot player, like in baseball and volleyball.

## Major League Baseball scheduling at the German OR Society Conference

Mike Trick talked about his experience setting the Major League Baseball (MLB) schedule at the 2014 German OR Conference in Aachen, Germany. Mike’s plenary talk had two major themes:
1. Getting the job with the MLB
2. Keeping the job with the MLB

The getting the job section summarized advances in computing power and integer programming solvers that have made solving large-scale integer programming (IP) models a reality. Mike talked about how he used to generate cuts for his models, but now the solvers (like CPLEX or Gurobi) add a lot of the cuts automatically as part of pre-processing. Over time, Mike’s approach has become popping his models into CPLEX and then figuring out what the solver is doing so he can exploit the tools that already exist.

Side note: I am amazed at how good the integer programming solvers have become. I recently worked on a variation to the set covering model for which a greedy approximation algorithm exists. The time complexity of the greedy algorithm isn’t great in theory. In practice, the greedy algorithm is slower than the solver (Gurobi, I think) and doesn’t guarantee optimality. I can’t believe we’ve come this far.

Mike also stressed the importance of finding better ways to formulate the problem to create a better structure for the IP solver.  Better formulations can be more complicated and less intuitive, but they can lead to markedly better linear programming bounds. Mike achieved this by replacing his model with binary variables that correspond to team-to-team games (does team i play team j on day t?) with another model whose variables correspond to series (a series is usually 3 games played between teams on consecutive days). Good bounds from the linear programming relaxations help the IP solver find an optimal solution much quicker. Another innovation focused on improving the schedule by “throwing away” much of the schedule (usually about a month) after making needed changes and resolving. Again, this is something that is possible due to advances in computing.

The keeping the job section addressed business analytics and its role in optimization. Mike defined business analytics as using data to make better decisions, something that OR has always done. What is new is using the power of data analytics and predictive modeling to guide prescriptive integer programming models in a meaningful way. The old way was to use point estimates in integer programming models, the new way uses more information (such as the output of a logistic regression) to guide optimization models. The application Mike used was estimating the value of scheduling home games at different times (day vs. night) and day of the week. When embedded in the optimization modeling framework, the end result was that creating a schedule using business analytics could add about \$50M to MLB in revenue.

Mike summed up his talk but talking about how educating the marketing folks is part of the job now. Marketing likes to measure “success” as the number of games that sell out. Operations researchers recognize that sold out games are lost revenue, so the goal has become to schedule games such that games are almost sold out, and making sure that marketing understands this approach.

Related post:

the craft of scheduling Major League Baseball games

## Markov chains for ranking sports teams

My favorite talk at ISERC 2014 (the IIE conference) was “A new approach to ranking using dual-level decisions” by Baback Vaziri, Yuehwern Yih, Mark Lehto, and Tom Morin (Purdue University) [Link]. They used a Markov chain to rank Big Ten football teams in their ability to recruit prospective players. Players would accept one of several offers. The team that got the player was the “winner” and the other teams were losers.  We end up with a matrix P where element (i,j) in P is the number of times team j beats team i.

The Markov chain is then normalized so that each row sums to 1 and solved for the limiting distribution. The probability of being in team j in the limit was interpreted as meaning the proportion of time that team j is the best. Therefore, the limiting distribution can be used to rank teams from best to worst.

They found that using this method with 2001 – 2012 data, Wisconsin was ranked fourth, which was much higher than it was ranked by experts and explains why they have been to 12 bowl games in a row. Illinois (my alma mater) was ranked second to last, only above lowly Indiana.

I used this method regular season 2014 Big Ten basketball wins and ended up with the following ranking. I also have the official ranking based on win-loss record for comparison.  We see large discrepancies for only two teams: Michigan State (which is over-ranked according to its win-loss record) and Indiana (which is under-ranked according to its win-loss record). The Markov chain method ranks these two teams differently because Indiana had high quality wins despite not winning so frequently and because Michigan State lost to a few bad teams when they were down a few players due to injuries.

 Ranking MC Ranking W-L record  Ranking 1 Michigan Michigan 2 Wisconsin Wisconsin 3 Indiana Michigan State 4 Iowa Nebraska 5 Nebraska Ohio State 6 Ohio St Iowa 7 Michigan St Minnesota 8 Minnesota Illinois 9 Illinois Indiana 10 Penn St Penn State 11 Northwestern Northwestern 12 Purdue Purdue

Sophisticated methods are a little more complex than this. Paul Kvam and Joel Sokol estimate conditional probabilities in the transition probability matrix for the logistic regression Markov chain (LRMC) model using logistic regression [Paper link here]. The logistic regression yields an estimate for the probability that a team with a margin of victory of x points at home is better than its opponent, and thus, looks at margin of victory not just wins and losses.

## subjective scoring in Olympic sports drives me a little crazy

The Olympics are beginning. When I think of the Olympic sports, I think of a lot of sports that scored subjectively. Not so much stronger, faster, and more goals, more of panels of judges picking winners amid controversy. I prefer number crunching and objective scoring. A New York Times article by John Branch [Link] overviews the changes to the winter Olympic sports in the last two decades. In summary, the new sports are mostly those with  subjective scoring (halfpipe, snowboard cross).

A good run early in the contest might receive an 80. A slightly better run might earn an 83. A brilliant run, one that seems unbeatable, might score 95. All of the others are slotted around them. It can frustrate athletes, who ask why their second-place score was 10 points below that of the winner. They struggle to understand that the value means nothing; what matters is how it ranks.

I’ve noticed this, too, and it’s frustrating. Some sports like figure skating and gymnastics have well-established rubrics for scoring, but they are not perfect. On the positive side, the judges do a fairly good job of recognizing the best performances.

Does subjective scoring bother you?

~

Look for more Olympics posts from me in the next couple of weeks.

I’ve been blogging for almost 7 years, so I have a few old posts about the Olympics. Here are a few that I recommend reading:

## will the New York Times Fourth Down Robot change football?

The New York times runs a twitter account for a “Fourth Down Bot” (@NYT4thDownBot) that analyzes every 4th down call in NFL football games. The bot gives advice and sometimes a short report summarizing the probability of success associated with each of the choices:

The bot has a lot of personality!

Brian Burke provides the methodology, which is here. The recommendations are based on which actions (going for it, punting, or going for a field goal) yields the most expected points. In the last 10 minutes of a game, the bot selects recommendations based on which yields the highest win probability. These concepts are not equivalent – going for it may maximize your points, but if time is running out and you are down by two, it might be better to go for a field goal than try for a touchdown.

The bot is useful because there is such a huge difference what is the best strategy and what coaches actually do. The picture below illustrates the difference. There are a number of explanations for the difference. One is that fans and owners only remember the times it doesn’t work–following the optimal policy may maximize the number of wins on average, but losing a game could mean losing your job. When the objective is to keep your job and not win games, everyone gets used to more conservative and suboptimal play calling.

Fourth Down Bot’s recommendations as compared to what most coaches do.