Recently, I found a few interesting articles/posts that all defend model simplicity.
An interview with Gregory matthews and Michael Lopez about their winning entry in the Kaggle’s NCAA tournament challenge “ML mania” suggests that it’s better to have a simple model with the right data than a complex model with the wrong data. This is my favorite quote from the interview:
John Foreman has a nice blog post defending simple models here. He argues for sometimes replacing a machine learning model for clustering with an IF statement or two. He links to a published paper entitled “Very simple classification rules perform well on most commonly used datasets” by Robert Holte in Machine Learning that demonstrates his point. You can watch John talk about modeling in his very informative and enjoyable hour-long seminar here.
A paper called “The Bias Bias” by Henry Brighton and Gerd Gigerenzer examines our tendency to build overly-complex models. Do complex problems require complex solutions? Not always. Here is the abstract.
In marketing and finance, surprisingly simple models sometimes predict more accurately than more complex, sophisticated models. Why? Here, we address the question of when and why simple models succeed — or fail — by framing the forecasting problem in terms of the bias-variance dilemma. Controllable error in forecasting consists of two components, the “bias” and the “variance”. We argue that the benefits of simplicity are often overlooked by researchers because of a pervasive “bias bias”: The importance of the bias component of prediction error is inflated, and the variance component of prediction error, which reflects an oversensitivity of a model to different samples from the same population, is neglected. Using the study of cognitive heuristics, we discuss how individuals and organizations can reduce variance by ignoring weights, attributes, and dependencies between attributes, and thus make better decisions. We argue that bias and variance provide a more insightful perspective on the benefits of simplicity than common intuitions that typically appeal to Occam’s razor.
What about discrete optimization models?
All of these links address data science problems, like classifying data or building a predictive model. Operations research models are often trying to solve complicated problems with a lot of constraints and requirements. They have a lot of pieces that need to play nicely together. But even then, it’s often incredibly useful to ask the right question and then answer it using a simple model.
I have one example that makes a great case for simple models. Armann Ingolfsson examined the impact of model simplifications in models used to locate ambulances in a recent paper (see citation below). Location problems like this one almost always use a coverage objective function, where locations are covered if an ambulance can respond to the location in a fixed amount of time (e.g., 9 minutes). The question is how to represent the coverage function and how to aggregate the locations, two choices of model error. The coverage objective function can either reflect deterministic or probabilistic travel times. Deterministic travel times lead to binary objective function coefficients (an ambulance covers a location or is doesn’t) whereas probabilistic travel times lead to real-valued objective coefficients that are a little “smoother” with respect to distances between stations and locations (an ambulance can reach 75% of calls at this location in 9 minutes).
This paper examined which is worse: (a) a simple model with highly aggregated locations but realistic (probabilistic) travel times or (b) a more complex model with finely granulated locations but less realistic (deterministic) travel times.
It turns out that the simple but realistic model (choice (a)) is better by a long shot. Here is a figure from the paper that reflects the coverage loss (model error) from different models. The x-axis reflects aggregation, and the y-axis reflects coverage loss (model error, more is bad). The different curves reflect different models. The blue line is the model with probabilistic travel times; the rest have deterministic travel times with the binary value determined by different percentiles.
From the paper: “Figure 4 shows how relative coverage loss varies with aggregation level (on a log scale) for the five models, for a scenario with a budget for five stations, using network distances, and actual demand. This figure illustrates our two main findings: (1) If one uses the probabilistic model (THE BLUE LINE), then the aggregation error is negligible, even for extreme levels of aggregation and (2) all of the deterministic models (ALL OTHER LINES) result in large coverage losses that decrease inconsistently, if at all, when the level of aggregation is reduced”
From the conclusion:
In this paper, we demonstrated that the use of coverage probabilities rather than deterministic coverage thresholds reduces the deleterious effects of demand point aggregation on solution quality for ambulance station site selection optimization models. We find that for the probabilistic version of the optimization model, the effects of demand-point aggregation are minimal, even for high levels of spatial aggregation.
Holmes, G., A. Ingolfsson, R. Patterson, E. Rolland. 2014. Model specification and data aggregation for emergency services facility location. [Supplement] [Submitted, last revision March 2014.]
What is your favorite simple model?