Clustering is difficult only when it does not matter, Amit Daniely, Nati Linial, and Michael Saks, arXiv:1205.4891. […] this represents a move from worst-case complexity towards something more instance-based. The main idea here is that the only hard instances for clustering problems (under traditional worst-case algorithms) are ones in which the input is not actually clustered very well. Their definition of a “good clustering” seems very sensitive to outliers or noisy data, but perhaps that can be a subject for future work.
This paper really hit home for me. I have taught data mining quite often to the MBAs at the Tepper School and clustering is one topic I cover (in fact, research on clustering got me interested in data mining in the first place). I generally cover k-means clustering (easy to explain, nice graphics, pretty intuitive), and note that the clustering you end up with depends on the randomly-generated starting centroids. This is somewhat bothersome until you play with the method for a while and see that, generally, k-means works pretty well and pretty consistently as long as the data actually has a good clustering (with the correct number of clusters). It is only when the data doesn’t cluster well that k-means depends strongly on the starting clusters. This makes the starting centroid issue much less important: if it is important, then you shouldn’t be doing clustering anyway.
There are other operations research algorithms where I don’t think similar results occur. In my work in practice with integer programming, most practical integer programs turn out to be difficult to solve. There are at least a couple of reasons for this (in addition to the explanation “Trick is bad at solving integer programs”). Most obviously, easy problems typically don’t get the full “operations research” treatment. If the solution to a problem is obvious, it is less likely to require more advanced analytics (like integer programming and similar).
More subtly, there is a problem-solving dynamic at work. If an instance is easy to solve, then the decision maker will do something to make it harder. Constraints will be tightened (“What if we require at least 3 weeks between consecutive visits instead of just two?”) or details will be added (“Can we add the lunch breaks?”) until the result becomes a challenge to solve. I have not yet had a real-world situation where we exhausted the possibilities to add details to models or to further explore the possible sets of constraints. Eventually, we get to something that we can’t solve in a reasonable amount of time and we back up to something we can (just) solve. So we live on the boundary of what can be done. Fortunately, that boundary gets pushed back every year.
I am sure there is a lot of practical work in operations research that does not have this property. But I don’t think I will wake up one morning to see a preprint: “Integer programming is difficult only when it doesn’t matter”.