The reliability of scientific knowledge can be threatened by a number of bad behaviors. The problems of p-hacking and publication bias are now well understood, but there is a third problem that has received relatively little attention. This third problem currently cannot be detected through any statistical test, and its effects on theory may be stronger than that of p-hacking.
I call this problem curiously strong effects.
The Problem of Curiously Strong
Has this ever happened to you? You come across a paper with a preposterous-sounding hypothesis and a method that sounds like it would produce only the tiniest change, if any. You skim down to the results, expecting to see a bunch of barely-significant results. But instead of
p = .04,
d = 0.46 [0.01, 0.91], you see
p < .001,
d = 2.35 [1.90, 2.80]. This unlikely effect is apparently not only real, but it is four or five times stronger than most effects in psychology, and it has a p-value that borders on impregnable. It is curiously strong.
The result is so curiously strong that it is hard to believe that the effect is actually that big. In these cases, if you are feeling uncharitable, you may begin to wonder if there hasn't been some mistake in the data analysis. Worse, you might suspect that perhaps the data have been tampered with or falsified.
Spuriously strong results can have lasting effects on future research. Naive researchers are likely to accept the results at face value, cite them uncritically, and attempt to expand upon them. Less naive researchers may still be reassured by the highly significant p-values and cite the work uncritically. Curiously strong results can enter meta-analyses, heavily influencing the mean effect size, Type I error rate, and any adjustments for publication bias.
Curiously strong results might, in this way, be more harmful than p-hacked results. With p-hacking, the results are often just barely significant, yielding the smallest effect size that is still statistically significant. Curiously strong results are much larger and have greater leverage on meta-analysis, especially when they have large sample sizes. Curiously strong results are also harder to detect and criticize: We can recognize p-hacking, and we can address it by asking authors to provide all their conditions, manipulations, and outcomes. We don't have such a contingency plan for curiously strong results.
What should be done?
My question to the community is this: What can or should be done about such implausible, curiously strong results?
This is complicated, because there are a number of viable responses and explanations for such results:
1) The effect really is that big.
2) Okay, maybe the effect is overestimated because of demand effects. But the effect is probably still real, so there's no reason to correct or retract the report.
3) Here are the data, which show that the effect is this big. You're not insinuating somebody made the data up, are you?
In general, there's no clear policy on how to handle curiously strong effects, which leaves the field poorly equipped to deal with them. Peer reviewers know to raise objections when they see
p = .034,
p = .048,
p = .041. They don't know to raise objections when they see
d = 2.1 or
r = 0.83 or
η2 = .88.
Nor is it clear that curiously strong effects
should be a concern in peer review. One could imagine the problems that ensue when one starts rejecting papers or flinging accusations because the effects seem too large. Our minds and our journals should be open to the possibility of large effects.
The only solution I can see, barring some corroborating evidence that leads to retraction, is to try to replicate the curiously strong effect. Unfortunately, that takes time and expense, especially considering how replications are often expected to collect substantially more data than original studies. Even after the failure to replicate, one has to spend another 3 or 5 years arguing about why the effect was found in the original study but not in the replication. ("It's not like we p-hacked this initial result -- look at how good the p-value is!")
It would be nice if the whole mess could be nipped in the bud. But I'm not sure how it can.
A future without the curiously strong?
This may be naive of me, but it seems that in other sciences it is easier to criticize curiously strong effects, because the prior expectations on effects are more precise.
In physics, theory and measurement are well-developed enough that it is a relatively simple matter to say "You did not observe the speed of light to be 10 mph." But in psychology, one can still insist with a straight face that (to make up an example) subliminal luck priming lead to a 2 standard deviation improvement in health.
In the future, we may be able to approach this enviable state of physics. Richard, Bond Jr., and Stokes-Zoota (
2003) gathered up 322 meta-analyses and concluded that the modal effect size in social psych is
r = .21, approximately
d = 0.42. (Note that even this is probably an overestimate considering publication bias.) Simmons, Nelson, and Simonsohn (
2013) collected data on obvious-sounding effects to provide benchmark effect sizes. Together, these reports show that an effect of
d > 2 is several times stronger than most effects in social psychology and stronger even than obvious effects like "men are taller than women (
d = 1.85)" or "liberals see social equality as more important than conservatives (
d = 0.69)".
By using our prior knowledge to describe what is within the bounds of psychological science, we could tell what effects need scrutiny. Even then, one is likely to need corroborating evidence to garner a correction, expression of concern, or retraction, and such evidence may be hard to find.
In the meantime, I don't know what to do when I see
d = 2.50 other than to groan. Is there something that should be done about curiously strong effects, or is this just another way for me to indulge my motivated reasoning?