When faked data are uncovered, they are very often crude and "obvious" in their construction. Are we not catching smarter fakers because most fakers are not smart, or because the data they fake is too realistic for us to spot them?— Nick Brown (@sTeamTraen) January 30, 2020
My answer is that we are not spotting the competent frauds. This becomes obvious when we think about all the steps that are necessary to catch a fraud:
- The fraudulent work must be suspicious enough to get a closer look.
- Somebody must be motivated to take it upon themselves to take that closer look.
- That motivated person must have the necessary skill to detect the fraud.
- The research records available to that motivated and skilled person must be complete and transparent enough to detect the fraud.
- That motivated and skilled person must then be brave enough (foolish enough? equipped with lawyers enough?) to contact the research institution.
- That research institution must be motivated enough to investigate.
- That research institution must also be skilled enough to find and interpret the evidence for fraud.
Considering all these stages at which one could fail to detect or pursue misconduct, it seems immediately obvious to me that we are finding only the most obvious and least protected frauds.
Consider the "Boom, Headshot!" affair. I had read this paper several times and never suspected a thing; nothing in the summary statistics indicates any cause for concern. The only reason anybody discovered the deception was because Pat Markey was curious enough about the effect of skewness on the results to spend months asking the authors and journal for the data and happened to discover values edited by the grad student.
Are all frauds stupid?Some of the replies to Nick's question imply that faking data convincingly is too much hassle compared to actually collecting data. If you know a lot about data and simulation, why would you bother faking data? This perspective assumes that fraud is difficult and requires skills that could be more profitably used for good. But I don't think either of those is true.
Being good at data doesn't remove temptations for fraudWhen news of the LaCour scandal hit, the first thing that struck me was how good this guy was at fancy graphics. Michael LaCour really knew his way around analyzing and presenting statistics in an exciting and accessible way.
But that's not enough to get LaCour's job offer at Princeton. You need to show that you can collect exciting data and get exciting results! When hundreds of quant-ninja, tech-savvy grad students are scrambling for a scant handful of jobs, you need a result that lands you on This American Life. And those of us on the tenure track have our own temptations: bigger grants, bigger salaries, nicer positions, and respect.
Some might even be tempted by the prospect of triumphing over their scientific rivals. Cyril Burt, once president of the British Psychological Society, was alleged to have made up extra twin pairs in order to silence critics of his discovered link between genetics and IQ. Hans Eysenck, the most-cited psychologist of his time, published and defended dozens of papers using likely-fabricated data from his collaborator that supported his views on the causes of cancer.
Skill and intellect and fame and power do not seem to be vaccines against misconduct. And it doesn't take a lot of skill to commit misconduct, either, because...
Frauds don't need to be cleverA fraud does not need a deep understanding of data to make a convincing enough forgery. A crude fake might get some of the complicated multivariate relationships wrong, sure. But will those be detected and prosecuted? Probably not.
|You don't need to be the Icy Black Hand of Death to get away with data fakery.|
(img source fbi.gov)
Why not? Those complicated relationships don't need to be reported in the paper. Nobody will think to check them. If they want to check them, they'll need to send you an email requesting the raw data. You can ignore them for some months, then tell them your dog ate the raw data, then demand they sign an oath of fealty to you if they're going to look at your raw data.
Getting the complicated covariation bits a little wrong is not likely to reveal a fraud, anyway. Can a psychologist predict even the first digit of simple correlations? A complicated relationship that we know less about will be harder to predict, and it will be harder to persuade co-authors, editors, and institutions that any misspecification is evidence of wrongdoing. Maybe the weird covariation can be explained away as an unusual feature of the specific task or study population. The evidence is merely circumstantial.
...because data forensics can rarely stop them.Direct evidence requires some manner of internal whistleblower who notices and reports research misconduct. Again, one would need the actually see the misconduct, which is especially unlikely in today's projects in which data and reports come from distant collaborators. Then one would need to actually blow the whistle, after which they might expect to lose their career and get stuck in a years-long court case. Most frauds in psychology are caught this way (Stroebe, Postmes, & Spears, 2012).
In data forensics, by contrast, most evidence for misconduct is merely circumstantial. Noticing in the data very similar means and standard deviations or duplicated data points or duplicated images might be suggestive, but requires assumptions, and is open to alternative explanations. Maybe there was an error in data preprocessing, or the research assistants managed the data wrong, or someone used file IMG4015.png instead of IMG4016.png.
This circumstantial evidence means that nonspecific screw-ups are often a plausible alternative hypothesis. It seems possible to me that a just-competent fraud could falsify a bunch of reports, plead incompetence, issue corrections as necessary, and refine one's approach to data falsification for quite a long time.
A play in one act:
FRAUDSTERThe means were 2.50, 2.50, 2.35, 2.15, 2.80, 2.40, and 2.67.
DATA THUGIt is exceedingly unlikely that you would receive such consistent means. I suspect you have fabricated these summary statistics.
FRAUDSTEROops, haha, oh shit, did I say those were the means? Major typo! The means were actually, uh, 2.53, 3.12, 2.07, 1.89...
EDITORAhh, nice to see this quickly resolved with a corrigendum. Bye everyone.
We are fully committed to upholding the highest ethical standards etc. any concerns are thoroughly etc. etc.
FRAUDSTER (sotto voce)That was close! Next time I fake data I will avoid this error.
The field isn't particularly trying to catch frauds, either.Trying to prosecute fraud sounds terrible. It takes a very long time, it requires a very high standard of evidence, and lawyers get involved. It is for these reasons, I think, that the self-stated goal of many data thugs is to "correct the literature" rather than "find and punish frauds".
But I worry about this blameless approach, because there's no guarantee that the data that appears in a corrigendum is any closer to the truth. If the original data was a fabrication, chances are good the corrigendum is just a report of slightly-better-fabricated data. And even if the paper is retracted, the perpetrator may learn from the experience and find a way to refine his fabrications and enjoy a long, prosperous life of polluting the scientific literature.