Is analysis necessarily data-driven? Does analytics include optimization?

A recent blog post by Jean Francois Puget and tweet by INFORMS President-Elect Anne Robinson asked about how to define analytics.

INFORMS defines analytics as:

“Analytics — the scientific process of transforming data into insight for making better decisions.”

I believe that optimization is a critical part of analytics. As Jean Francios Puget states, “Optimization seems covered by the ‘making better decisions’ part.” As an optimizer, I am completely biased in favor of this answer. The real question Jean Francios asks is not about optimization, but how analytics ultimately seems to start with data, even optimization. Is a data-centric view essential to analytics?

For the past few years, I thought “yes.”I did not have the foresight to take data-centric courses in graduate school. I have been trying to make up for that ever since. We live in a world that collects increasing massive and complex data sets. Being able to analyze data is generally a critical starting point from which to start doing analytics.

Some of my research is in the area of emergency medical services. Locating ambulances, for example, is a classic application of facility location, p-center, and p-median problems. Ambulance location models existed before ambulance data was routinely collected. Being able to build models without data necessitated assumptions regarding what the data would look (e.g., assuming that calls arrive according to a Poisson process, although I often find that assumption to be realistic!) Other models were built with limited data that was painstakingly collected: I like this model of fire engine travel times that led to the so-called square root law. Having a simple model for fire engine travel times made it easier to build models that were not data-driven. Having access to oodles of data today is a huge help in building good models and understanding which assumptions are okay to make. The more data the merrier.

Now, I think differently.  Building new models imposes a new and potentially useful way of looking at a problem. Data isn’t necessary to build a model (although it is surely helpful) nor should a data set be the ultimate starting point of every model. My students found it useful when I argued that bus accidents are a Poisson process. Without any real data, we could gain useful insights for how to view a cluster of bus accidents that occurred in the Richmond area one week.

Most data that is collected is not analyzed. Much of the analyzed data is “analyzed” using summary statistics. In my experience, statistical models often provide useful explanations of what happened in the past. Analytics is not backward-looking; it is forward-looking. “How would you improve what you do as a result of this data analysis” is a question that is not asked enough. Today, it’s hard to imagine data not being part of the answer, but it certainly could be excluded as part of an “analytics” solution. It is certainly possible/desirable to use a non-data driven approach to look to the future.

I realize that arguing for data not being included as part of analytics is going to be a tough sell. The “analytics” culture is aimed at those who are knee deep in data on any given day. Maybe people like me who have been trained more on the models/optimization side than on the data side just don’t want to be left behind. I hope there is some legitimate truth to this blog post.

A case study

A local fire and rescue station recently contacted me for feedback on their staffing levels. Yes, data was necessary for me to use analytical techniques in pursuit of some useful feedback, but it wasn’t the starting point. I started by interviewing the chief and other experts about the operations, constraints, and culture. At the end, I was able to look at the data, which was incredibly messy (lots of data was clearly not recorded). I was able to devise a clever way of at least bounding the amount of work that was performed to come up with bounds for staffing levels. Here, data was essential but was essentially not the starting point.

How essential is data for analytics? What is its role? Please leave a comment?

About these ads

7 responses to “Is analysis necessarily data-driven? Does analytics include optimization?

  • Barry Render

    Your questions sort of surprises me, Laura. Data is absolutely essential in the real world. It’s fine to write a theoretical Ph.D. dissertation (although I much prefer to see ones that companies can apply), but I cannot remember one consulting job where collecting data was not essential. The data may have been messy, but that’s the real world.

  • Laura McLay

    Barry, I don’t doubt that data is absolutely essential (almost) all of the time. The question I had is slightly different: is analytics absolutely data-centric? Data-centric here meaning that the analysis revolves around the data. I like the way JF Puget put it in his blog post, “According to this view, everything starts with data, optimization being no exception. Are you OK with that? Are you, really?” My question was about the role of data not whether or not it should be used.

    I do understand what you meant by this: “I cannot remember one consulting job where collecting data was not essential” I would agree! I would seriously wonder why someone would not have data in this day and age (although for certain application areas such as homeland security, some things are really hard/impossible to collect).

  • Don Buckshaw

    Like you I felt that I was missing the data-centric classes in college, which is why I am enrolled in a 2nd masters, this time in predictive analytics and not Ops Research. I find both skill sets both different and complimentary. I have to say, it is a lot of fun to be given the data up front and then build a model! But there is also something satisfying about figuring out the data that you need and then finding it if you can. Well, sometimes satisfying and sometimes frustrating. But sometimes, folks won’t budge from one position (model-centric or data-centric).

    One lesson I learned early on from Dennis Buede was this: Start with a small model – I am thinking of a probabilistic model of a system. Collect data for this small model. Then run a sensitivity analysis and find the key drivers. Then build out those areas in greater detail. In contrast, the data-centric analytics folks might build a large model using all the data, then determine which factors can be dropped off in the interest of parsimony and to remove collinearity concerns. Both valid, but opposite in thinking!

    This makes me think of a problem I had at work. We were trying to collect data on implementing a new security feature for a computer network. This feature had never been used before. We built the model to describe its possible effects. This defined the data that we would need to collect. But one person refused to participate because they needed the actual data. But I said “The only way available to get the data is by trying it out on the system. Why don’t we ask experts for their opinions.” The person responded “I don’t like that. Why don’t we implement the feature, and then observe to see if it worked or not.” I responded with two comments:
    #1 We want to predict the effects before hand to see if this is a good decision and compare this with other strategies.
    #2 Why you suggest (implementing and observing) is the same strategy I suggested but I am using someone’s experience over 15 years who have made similar decisions in the past.
    My arguments failed. I could certainly see his point. But if I when I started this work 15 years ago I took that approach I would still to this day be waiting for data.

  • prubin73

    Laura,

    Given your response to Barry, and assuming that by “data” you mean numeric data, I somewhat cautiously agree with you. Starting by blocking out a conceptual model (after observing the system) helps sort out what data you actually need (as opposed to what data is handed to you), and helps avoid having the data drive the model. Even in regression analysis, it’s well known that if you stare at the data first and then pick a model that seems to fit it, with no a priori rationale, you can be led seriously astray.

    As to what I meant with the numeric data qualification, I consider facts acquired by asking questions, observing the existing system, doing ride-alongs etc. to be “data”, and I think it’s a good idea to have that in hand before beginning to write a model. Otherwise, you’re likely to wind up with one of those wonderfully elegant models that academics love to publish and real people cannot use. There’s a commercial running on TV right now where a little kid keeps asking “why” over and over. Assuming he learns some math down the road, I think he’s got a future as an OR analyst.

  • Laura McLay

    @Prubin73, Yes, I referred to numerical data here. I assumed that observing-the-system data/information that is collected to be a necessity. I suppose that assumption is worth a discussion, but like you, I tend to think it’s necessary.

    As for the rest of your comments, I somewhat cautiously agree with me, too. I see the writing on the wall: we live in the data age and will continue to have oodles of data for the foreseeable future. It’s hard to imagine projects that are not data driven. However, with optimization, I *can* envision some scenarios where the data is important but is not front and center. I think the reason why is that optimizing the system doesn’t describe the system dynamics (captured in the data), it provides guidance for how the system *should* operate (which is then no longer captured in the data). The data collected are useful, but they may reflect suboptimal system operation. The data would not describe a system that is subsequently optimized.

    I am probably doing a horrible job describing this process. I am currently working on a time-series model, where the data are more centrally important than data have been for any optimization model I’ve worked on. Again, I wouldn’t draw the same conclusions for all types of OR/MS methodologies.

  • prubin73

    Laura, your description is not at all “horrible” (I followed it, and I’m only half awake at the moment). Your also on the edge of something Milan Zeleny wrote about in his description of “De Novo Programming”. (Disclaimer: I’m not sure you should trust my interpretation of what he wrote, but that won’t stop me from giving it.) You point out that the data may describe suboptimal operation of the system. It may also describe optimal operation of a suboptimally designed system. For instance, the data will reflect historical limits on budget, capacity etc., which we may think of as part of the system design. Arguably, the analyst should model some or all of those limits as variables rather than parameters; but in a data-centric approach, that may be even less likely than usual.

  • Marc-André Carle

    Sorry to comment an old post but I just read it when you linked to it in your Jan. 16th post. I was introduced late to hypothesis-driven models (as opposed to data-driven models). Carlos Daganzo’s book on “Logistics Systems Analysis” is what convinced me to give credit to this approach. I would consider it a first step in some projects.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 2,383 other followers

%d bloggers like this: