In Greg Mankiw’s economics blog, he excerpts from a review of Alan Greenspan’s new book:
Michael Kinsley reviews Alan Greenspan’s book (which I have not yet read). This passage from the review caught my eye:
You gotta love a guy whose idea of an important life lesson is: “I have always argued that an up-to-date set of the most detailed estimates for the latest available quarter is far more useful for forecasting accuracy than a more sophisticated model structure.” Words to live by.
Mike thinks Alan is being hopelessly geeky here. But I think Alan is talking to geeks like me, and also those who used to work for him on the staff of the Federal Reserve.
…
Better monetary policy, he suggests, is more likely to follow from better data than from better models. Relatively little modern macro has been directed at improving data sources. Perhaps that is a mistake.
This issue of data versus model is important in operations research. I must say that in my own consulting, I am rather cavalier about data quality. I was on a project with the United States Postal Service where it was surprising how iffy much of the data was (for instance, we had to estimate (not measure) the amount of mail that was sent between LA and NY each year). My view was that the qualitative conclusions were robust to reasonable variations in the data. So if the model suggested that the number of mail processing facilities could be cut by 25%, this was true no matter what the exact data was. Further, the solutions we generated were going to be pretty good, no matter what the true data was: the day-to-day operation could handle any mis-estimates that were in our approach.
I was not persuasive in this: a tremendous amount of time and effort was spent in getting better data. At the end, this was wise, since the results stand up much better to detailed scrutiny. No group faced with the closure of a facility wants to hear: it really doesn’t matter what the data is, since Trick says so.
I do wonder how many OR projects actually end up with poor or wrong answers due to not getting data of high enough quality. But it is a heck of a lot more fun to develop models rather than hunt down missing data.
I agree with your point of view here. It would be much more rewarding (both to the researcher and to the practical outcome) if we spend more time developing models that are robust to data quality than to try to correct data. This is even more so in many modern fields where we have more data than we can handle.
Say a popular website will generate many gigabytes of data per day. Trying to clean up this dataset will take many hours, even days. In the meantime, even more data pile up. It’s simply not practical. Better to focus on models that admit a moderate degree of dirty data.
Of course, if most of the data is bad, then no amount of robust modeling would help.