Referees considered harmful

When doing empirical work, researchers often mess up either in the design of the experiment or in the analysis of data.  In operations research, much of our “empirical work” is in computational testing of algorithms.  Is algorithm A faster than algorithm B?  “It depends” is generally the only honest answer.  It depends on the instance selection, it depends on the computing environment, it depends on the settings, etc. etc.   If we are careful enough, we can say things that are (within the limits of the disclaimers) true.   But even a a careful experiment can fall prey to issues.  For instance, throwing away “easy” instances can bias the results against whatever algorithm is used to determine easiness.  And don’t get me started on empirical approaches that test dozens of possibilities and miraculously find something “statistically significant”, to be duly marked with an asterisk in the table of results.    It is very difficult to truly do trustworthy empirical work.  And it is even harder to do such work when researchers cheat or reviewers don’t do their job.

For some fields, these issues are even more critical.  Operations research generally has some theoretical grounding:  we know about polytopes and complexity, and so on, and can prove theorems that help guide our empirical work.  In fields like Social Psychology (the study of people in their interactions with others), practically all that is known is due to the results of experiments.   The fundamental structure in this field is a mental state, something that can only be imprecisely observed.

Social psychology is in a bit of a crisis.  In a very real sense, the field no longer knows what is true.  Some of that crisis is due to academic malfeasance, particularly that of an influential researcher Diederik Stapel.  Stapel has been found inventing data for dozens of papers, as described by a  “Science Insider” column.

Due to data fraud by Stapel and others, the field has to reexamine much of what it thought was true.  Are meat eaters more selfish than vegetarians?  We thought so for a while, but now we don’t know.  A Dutch report on this goes into great detail on this affair.

But overt fraud is not the only issue, as outlined in the report.  I was particularly struck by the role paper reviewers played in this deceit:

It is almost inconceivable that co-authors who analysed the data intensively, or reviewers of the international “leading journals”, who are deemed to be experts in their field, could have failed to see that a reported experiment would have been almost infeasible in practice, did not notice the reporting of impossible statistical results, … and did not spot values identical to many decimal places in entire series of means in the published tables. Virtually nothing of all the impossibilities, peculiarities and sloppiness mentioned in this report was observed by all these local, national and international members of the field, and no suspicion of fraud whatsoever arose.

And the role of reviewers goes beyond that of negligence:

Reviewers have also requested that not all executed analyses be reported, for example by simply leaving unmentioned any conditions for which no effects had been found, although effects were originally expected. Sometimes reviewers insisted on retrospective pilot studies, which were then reported as having been performed in advance. In this way the experiments and choices of items are justified with the benefit of hindsight.

Not infrequently reviews were strongly in favour of telling an interesting, elegant, concise and compelling story, possibly at the expense of the necessary scientific diligence.

I think it is safe to say that these issues are not unique to social psychology.  I think that I too have, as a reviewer, pushed toward telling an interesting story, although I hope not at the expense of scientific diligence.   And perhaps I could have worked harder to replicate some results during the reviewing process.

I don’t think we in operations research are in crisis over empirical issues.  I am pretty confident that CPLEX 12.4 is faster than CPLEX 4.0 for practically any instance you can throw at it.  And some journals, like  Mathematical Programming Computation have attempted to seriously address these issues.  But I am also pretty sure that some things I think are true are not true, either due to fraud by the author or negligence by reviewers.

One important role of a reviewer is to be on the lookout for malfeasance or bias and to avoid allowing (or, worse, forcing) authors to present data in an untruthful way.  And I think many of us are not doing a great job in this regard.  I would hate to have to rethink the foundations of my field due to these issues.

3 thoughts on “Referees considered harmful”

  1. Reviewers and associate editors (and editors?) in OR and related disciplines tend not to want to publish “non-results” (something _didn’t_ happen), which I think are publishable in some scientific disciplines. Good luck finding a paper that says “these N metaheuristics pretty much all tied for fastest on problem class P”.

    I’ve also seen (more than once) demonstrably erroneous results get published. They’ve tended to be relatively harmless, but of course they end up getting cited (as if correct). So we have lack of diligence on the part of the reviewers compounded by lack of diligence on the part of subsequent authors (who cite the published results without verification/critical analysis).

  2. Good article. For the credibility of our field, we need to work on this.

    Personally, I thrust the results of OR competitions far more than any paper claiming that algo A is better than B. Good OR competitions make it very hard to cheat: real-life problem, same hardware/time for all contestants, 20+ public datasets, 10+ hidden datasets (contestants don’t get them), all comminication about the problem description is public, … The Google ROADEF competition was a beautifull example of that.

Leave a Reply

Your email address will not be published. Required fields are marked *