Seed Magazine has an excellent article entitled “Dirty Little Secret” with a subtitle of “Are most published research findings actually false? The case for reform.” (Thanks to the Complex Intelligent Systems blog for pointing to this). The article begins with some amazing statistics:
In a 2005 article in the Journal of the American Medical Association, epidemiologist John Ioannidis showed that among the 45 most highly cited clinical research findings of the past 15 years, 99 percent of molecular research had subsequently been refuted. Epidemiology findings had been contradicted in four-fifths of the cases he looked at, and the usually robust outcomes of clinical trials had a refutation rate of one in four.
Why has this happened?
The culprits appear to be the proverbial suspects: lies, damn lies, and statistics. Jonathan Sterne and George Smith, a statistician and an epidemiologist from the university of Bristol in the UK, point out in a study in British Medical Journal that “the widespread misunderstanding of statistical significance is a fundamental problem” in medical research. What’s more, the scientist’s bias may distort statistics. Pressure to publish can lead to “selective reporting;” the implication is that attention-seeking scientists are exaggerating their results far more often than the occasional, spectacular science fraud would suggest.
Ioannidis’ paper “Why Most Published Research Findings are False” goes into a more in-depth examination of why this happens. This is no surprise to those who understand statistical significance. If 20 groups do similar research, it is pretty likely that at least one group will have a finding “at the 95% significance level” just by blind luck. And if that is the group that gets to publish (since proof of significance is much more interesting than non-proof), then the scientific literature will, by its nature, publish false findings. This is made worse by issues of bias, conflict of interest, and so on.
Ioannidis continues with some corollaries on when it is more likely that published research is false:
Corollary 1: The smaller the studies conducted in a scientific field, the less likely the research findings are to be true.
Corollary 2: The smaller the effect sizes in a scientific field, the less likely the research findings are to be true.
Corollary 3: The greater the number and the lesser the selection of tested relationships in a scientific field, the less likely the research findings are to be true.
Corollary 4: The greater the flexibility in designs, definitions, outcomes, and analytical modes in a scientific field, the less likely the research findings are to be true.
Corollary 5: The greater the financial and other interests and prejudices in a scientific field, the less likely the research findings are to be true.
Corollary 6: The hotter a scientific field (with more scientific teams involved), the less likely the research findings are to be true.
This got me thinking about the situation in operations research. How much of the published operations research literature is false? One key way of avoiding false results is replicability. If a referee (or better, any reader of the paper) can replicate the result, then many of the causes for false results go away. The referee may be wrong sometimes (in a statistical study with a true 95% confidence, the referee will get the “wrong” result 5% of the time), but the number of false publications decreases enormously.
Most major medical trials are not replicable: the cost of large trials is enormous, and the time required is too long. So you can expect false publications.
Much of OR is replicable. For more mathematical work, the purpose of a proof is to allow the reader to replicate the theorem. For most papers, this is straightforward enough that the field can have confidence in the theorems. Of course, sometimes subtle mistakes are made (or not so subtle if the referees did not do their jobs), and false theorems are published. And sometimes the proof of a theorem is so complex (or reliant on computer checking) that the level of replicability decreases. But I think it is safe to say that the majority (perhaps even the vast majority) of theorems in operations research are “true” (they may, of course, be irrelevant or uninteresting or useless, but that is another matter).
For computational work, the situation is less rosy. Some types of claims (“This is a feasible solution with this value to this model”, for example) should be easy to check, but generally are not. This leads to problems and frustrations in the literature. In my own areas of research, there are competing claims of graph coloring with k colors and cliques of size k+1 on the same graph, which is an inconsistency: I am putting together a system to try to clear out those sorts of problems. But other claims are harder to check. A claim that X is the optimal solution for this instance of a hard combinatorial problem is, by its nature, difficult to show. Worse are the claims “this heuristic is the best method for this problem” which get into many of the same issues medical research gets into (with bias being perhaps even more of an issue).
I think this is an issue that deserves much more thought. Very little of the computational work in this field is replicable (contrast this to my review of work of Applegate, Bixby, Cook, and Chvatal on the Traveling Salesman Problem, which I think is much more open than 99.9% of the work in the field), and this leads to the confusion, false paths, and wasted effort. We are much more able to replicate work than many fields, but I do not think we do so. So we have many more false results in our field than we should have.
I’d like to expand a bit on the difficulty of verifying/replicating computational results. Some computational results are extremely easy to verify because there are specific certificates that can be checked, while other results are very difficult to verify because there isn’t any specific certificate that can be checked.
For example, if someone claims that he’s found a coloring of a particular graph with 24 colors, then it’s easy to verify that the claimed coloring is in fact a coloring of the graph. Furthermore, if someone else finds a clique of size 24 in the same graph, it’s easy to verify that this is a clique. The combination of the verified clique of size 24 and the verified coloring of size 24 together establish that the coloring number of this graph is 24.
Now, consider another graph where the best known coloring uses 57 colors and the largest clique is of size 54. These results are certainly not in disagreement, but they leave room for a smaller coloring of size 55 or 56. Someone might use branch and bound to prove that the coloring of size 57 is optimal, but can this result be trusted?
If the researcher who ran the branch and bound procedure makes his executable code available, than others can repeat the run, but all this does is eliminate the possibility that the first author lied about the result or that his computer malfunctioned. It doesn’t do anything to eliminate the much more likely possibility of a bug in the branch and bound code.
If the researcher who ran the branch and bound procedure makes his source code available, then others can verify that the code produces the same solution and they can also check the code for bugs. Thus there is a major advantage to making source code freely available.
An alternative approach that can help to ensure the correctness of the result is to solve the same problem with an independently developed branch and bound code. However, it may be too much to expect other researchers to put this much effort into duplicating previously published research. It seems unlikely that journals would want to publish such independent confirmations of computational results.
In many cases, computational work in OR makes use of proprietary software for which source code may not be available. In such cases it’s very hard to validate a computational result.
All of this causes particular problems for referees. Should they be expected to replicate and validate all of the computational results in a paper? What if the computations used expensive proprietary software? What if the computations took months on a supercomputer? The general consensus seems to be that it’s not the referee’s job to do this.
In my own refereeing, I’ve insisted that the paper should include enough information so that the results could be checked by a sufficiently skilled, motivated, and financed reader. At the very least I expect that the authors will provide enough detail about their algorithm so that others could implement the algorithm without having to make lots of choices about unspecified parts of the algorithm. Better yet, I prefer to see that the authors make their source code available. This is rapidly becoming the norm in many areas of computational OR.
A common (and somewhat reasonable) exception involves code that is readily available as commercial software. In my opnion this is undesirable because it causes problems for anyone who might want to replicate the computational results but who can’t afford the software.
Another issue is that as software is upgraded it may no longer be possible to purchase (or even run) older versions of the software. If a researcher publishes a result obtained with version 2.3 of a commercial software package and the vendor is now up to version 3.0 and no longer licenses version 2.3 then it’s no longer possible to replicate the earlier result.
In addition to code, there’s also a need to make test problems available. For research involving randomly generated instances this really isn’t a problem. However, for research that involves problem instances that contain proprietary information about actual businesses it can be problematic.
Interesting article, indeed.
Looking forward to the day someone finds Ioannidis’ statistics was wrong 🙂
Wow, this is amazing, I’ll never look at a scientific publication the same again!
A says “All Researchers lie”. What if A is a researcher 🙂 ?