Wednesday, June 1, 2016

Extraordinary evidence

Everyone seems to agree with the saying "extraordinary claims require extraordinary evidence." But what exactly do we mean by it?

In previous years, I'd taken this to mean that an improbable claim requires a dataset with strong probative value, e.g. a very small p-value or a very large Bayes factor. Extraordinary claims have small prior probability and need strong evidence if they are to be considered probable a posteriori.

However, this is not the only variety of extraordinary claim. Suppose that someone tells you that he has discovered that astrological signs determine Big Five personality scores. You scoff, expecting that he has run a dozen tests and wrestled out a p = .048 here or there. But no, he reports strong effects on every outcome: all are p < .001, with correlations in the r = .7 range. If you take the results at face value, it is clearly strong evidence of an effect.

Is this extraordinary evidence? In a sense, yes. The Bayes factor or likelihood ratio or whatever is very strong. But nested within this extraordinary evidence is another extraordinary claim: that his study found these powerful results. These effects are unusually strong for personality psychology in general, much less for astrology and personality in particular.

What kind of extraordinary evidence is needed to support that claim? In this post-Lacour-fraud, post-Reinhart-Rogoff-Excel-error world, I would suggest that more is needed than simply a screenshot of some SPSS output.

In ascending order of rigor, authors can support their extraordinary evidence by providing the following:

  1. The post-processed data necessary to recreate the result.
  2. The pre-processed data (e.g., single-subject e-prime files; single-trial data).
  3. All processing scripts that turn the pre-processed data into the post-processed data.
  4. Born-open data, data that is organized by Git to be saved and uploaded to the cloud in an automated script. This is an extension of the above -- it provides the pre-processed data, uploaded to the central, 3rd-party GitHub server, where it is timestamped.

Providing access to the above gives greater evidence that:

  1. The data are real, 
  2. The results match the data, 
  3. The processed data are an appropriate function of the preprocessed data, 
  4. The data were collected and uploaded over time, rather than cooked up in Excel overnight, and
  5. The data were not tampered with between initial collection and final report.

If people do not encourage data-archival, a frustrating pattern may emerge: Researchers report huge effect sizes with high precision. These whopping results have considerable influence on the literature, meta-analyses, and policy decisions. However, when the data are requested, it is discovered that the data were hit by a meteor, or stolen by Chechen insurgents, or chewed up by a slobbery old bulldog, or something. Nobody is willing to discard the outrageous effect size from meta-analysis for fear of bias, or appearing biased. Techniques to detect and adjust for publication bias and p-hacking, such as P-curve and PET-PEESE, would be powerless to detect and adjust for bias so long as a few high-effect-size farces remain in the dataset.

The inevitable fate of many suspiciously successful datasets.
Like Nick Brown points out, this may be the safest strategy for fraudsters. At present, psychologists are not expected to be competent custodians of their own data. Little of graduate training concerns data archival. It is not unusual for data to go missing, and so far I have yet to find anybody who has been censured for failure to preserve their data. In contrast, accusations of fraud or wrongdoing require strong evidence -- the kind that can only be obtained by looking at the raw data, or perhaps by finding the same mistake, made repeatedly across a lifetime of fraudulent research. Somebody could go far by making up rubbish and saying the data were stolen by soccer hooligans, or whatever.

For a stronger, more replicable science, we must do more to train scientists in data management and incentivize data storage and sharing. Open science badges are nice. They let honest researchers signal their honesty. But they are not going to save the literature so long as meta-analysis and public policy statements must tiptoe around closed-data (or the-dog-ate-my-data) studies with big, influential results.

No comments:

Post a Comment