Thursday, March 3, 2016

A reading list for the Replicability Crisis

This is a partial reading list meant to be shared with those just learning about the replication crisis. Entries will be added over time.

Origins of the Crisis.

Cohen (1962) "The statistical power of abnormal-social psychological research." Cohen estimates that the typical psychology study has approximately 50% power, which is a little odd when you consider that all the published studies manage to find effects all the time.

Ioannidis (2005) "Why most published research findings are false." In this classic paper, Ioannidis presents a simple analysis. He demonstrates that when the base rate of discoveries is low (true findings are rare), the false positive rate is high (worse than the nominal 5%), and the false negative rate is high (most studies have <50% power), more than half of significant test results will be false; they will represent null hypotheses. 

The false positive rate is high because researchers are flexible in what they analyze. They will sometimes use questionable research practices to attain p < .05. The false negative rate is also high because samples are too small to reliably detect significant results. Ergo the conditional probability of something being true, given that it's published p < .05, is actually much lower than we'd like.

Prinz, Schlange, & Asadullah (2011), "Believe it or not." Drug companies are often looking for ways to apply the remarkable biomedical discoveries announced in journals such as Science, Cell, or Nature. In this paper, the authors announce that in-house replication attempts at two major drug companies routinely failed to yield the results claimed in the journals. 

The above two papers are digested in a very readable Economist article, "Trouble at the Lab."

Fanelli (2012), "Negative results are disappearing." Fanelli looks at hypothesis tests published between 1990 and 2007. More than 80% of published psychology studies claim support for their primary hypothesis, which is again odd given that the average study has rather less than 80% power.

Preposterous Results.

Bem (2011). "Feeling the Future." Psychologist Daryl Bem reports a series of 9 experiments demonstrating that college undergraduates have precognitive, ESP-like abilities. The manuscript is published in the highly esteemed Journal of Personality and Social Psychology. Psychologists are shaken to find that typical statistical methods can support an impossible hypothesis as being true.

Simmons, Nelson, & Simonsohn (2011) "False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant". Inspired by Bem's demonstration of ESP, Simmons et al. demonstrate that, with enough flexibility, you can find anything in a study. In their demonstration, they torture a dataset until it yields evidence that listening to a song makes participants younger.

Their SSRN upload, "A 21-word solution" is a brief and effective remedy. They suggest scientists be encouraged to report everything they did that might have influenced the Type I error rate.

Failures to Replicate. 

Open Science Collaboration, 2015. "Estimating the Reproducibility of Psychological Science." The Center for Open Science organized a massive replication effort, 100 studies in 100 laboratories. Whereas all of the original studies had reported significant results, only ~39% of replications found similar results. This ~39% estimate is still the subject of some debate: See criticism from Gilbert et al. and replies from Simonsohn and Srivastava. The most effective summary seems to be "40% replicate, 30% did not replicate, 30% were inconclusive."

Many Labs Replication Project. In this study, several labs replicated each of several experiments. Again, some replicated, but others did not. There have been, to date, three different Many Labs projects.

Donnellan, Lucas, and Cesario (2015). A study by Bargh & Shalev (2012) reported a relationship between physical warmth and emotional warmth, as manifested as a relationship between hot showers and feelings of loneliness. Donnellan et al. attempted, in nine studies, to replicate the Bargh & Shalev result. None succeeded.

Meta-analytic Signs of Bias.

Carter & McCullough (2015) "Publication bias and the limited strength model of self-control". A 2010 meta-analysis (Hagger et al.) concluded that "ego depletion," a form of fatigue in self-control, was a real and robust phenomenon, d = .6. Carter and McCullough find strong indications of publication and analytic bias; so much so that it was not clear whether the true effect was any different from zero. 

In response to Carter & McCullough, psychologists joined together to each perform a preregistered replication of a standard ego-depletion paradigm. Although the manuscript is not yet public, it has been announced that the project found zero evidence of ego depletion. An independent preregistered replication also finds no evidence for the phenomenon. Simine Vazire notes that an effect can be informally replicated in dozens of experiments but still falter in a strict, pre-registered replication.

Landy & Goodwin (2015). Reports claim that feelings of incidental disgust (e.g., smelling a noxious odor) can influence moral judgments. This meta-analysis finds signs of publication or analytic bias.

Flore & Wicherts (2015). "Does stereotype threat influence performance of girls in stereotyped domains?" The authors examine the hypothesis that reminding girls that they are expected to be bad at math harms their ability on a math test. This effect is one instance of "stereotype threat," thought to harm minorities' ability to succeed. The authors find signs of publication bias, and are not certain that there is a true effect.

Pressures on Researchers.

Bakker, van Dijk, and Wicherts (2012). "Rules of the game called psychological science." Currently, scientists are chiefly evaluated by the degree to which they manage to publish. Publishing generally requires finding a p-value less than .05, a significant result. Bakker et al. perform simulations to compare the relative success of two scientists. One scientist is slow and careful, running well-powered experiments and not torturing the data. The other scientist is sloppy, running many small experiments and doing all the wrong things to get p < .05. Naturally, the good, careful scientist finds many fewer significant results than the bad, sloppy scientist. The implied long-term effects on hiring, funding, and tenure decisions are chilling.


Bones (2012) "We knew the future all along." Brian Nosek's satirical alter-ego Arina K. Bones argues that Bem's (2012) ESP experiments should not have been published. But not because the results are wrong -- because the results are not new. Bones points out that an estimated 97% of psychology studies find exactly what they predicted, whereas Bem's college undergrads could only muster ~60% prediction. Bones concludes that social psychologists have unparalleled powers of second sight.


Finkel, Eastwick, & Reis (2015). "Best Research Practices in Psychology: Illustrating Epistemological and Pragmatic Considerations With the Case of Relationship Science." Finkel et al. suggest that the proposed reforms in response to the crisis do too much to reduce Type I error at the risk of inflating Type II error. They also argue that one-size-fits-all reforms will be counterproductive.

My own work.

A famous 2010 meta-analysis (Anderson et al.) summarizes the research on effects of violent video games on aggressive thoughts, feelings, and behaviors. The authors argue that there is minimal evidence for publication bias in the literature. They shared their data with me, and I performed my own analysis for publication bias. I find that there seems to be quite strong bias in experiments studying effects of violent games on aggressive behavior; so much so that there may not be an underlying effect.

My dissertation tested the hypothesis that brief exposure to a violent video game could increase aggressive behavior. Despite the decent sample size (N = 223), I could not detect such a main effect. The data are still being entered and analyzed, so results may change as more data are available or errors are found. At present, the effective sample size is N = 295, and the main effect of game violence is r = .06 [-.06, .17]. (For reference, Anderson et al. argue the effect is r = .21, larger than the upper bound of my confidence interval.)

Another of my papers explores common statistical mistakes made by both sides of the violent-games debate. Those that believe in the effect claim that their games are identical except for violent content. Our analyses indicate very little evidence of equivalence. Those that doubt the effect claim that their studies provide strong evidence against a violent-game effect. Our analyses indicate that some "failures to replicate" provide very little evidence against the effect. One or two even provide a little evidence for the effect.


  1. Thanks for this list! I hope graduate students will be taught these things during their education.

    "Whereas all of the original studies had reported significant results, only ~39% of replications found similar results. This ~39% estimate is still the subject of some debate: See criticism from Gilbert et al. and replies from Simonsohn and Srivastava. The most effective summary seems to be "40% replicate, 30% did not replicate, 30% were inconclusive."

    I wonder if it would be possible for researchers to come up with a universal manner of deciding whether a study replicated or not. It seems to me that different researchers use different criteria which seems problematic to me.

    1. With respect to your last question, you might be interested in this paper:
      The Replication Recipe: What makes for a convincing replication?

  2. May I add to your exellent list:

    Teaching Replication & How to do a replication study

    Janz, N. (2015) Bringing the Gold Standard Into the
    Class Room: Replication in University Teaching, International Studies Perspectives, Open access copy at:

    Brandt et al. (2014) The Replication Recipe: What makes for a
    convincing replication? Journal of Experimental Social Psychology, Vol 50, pp. 217-224. Copy at: http://tinyurl.com/poe474k

    King, Gary. (2006). How to Write a Publishable Paper as a Class
    Project,copy at: http://gking.harvard.edu/papers (with updates)

  3. Thank you for bringing up this good list as for me as a psychologist to be more serious about the replication crisis in our field :)