Tuesday, May 30, 2017

Trim-and-fill just doesn't work

The last couple years have seen an exciting explosion in new techniques for publication bias. If you're on the cutting edge of meta-analysis, you now can choose between p-curve, p-uniform, PET, PEESE, PET-PEESE, Top-10, and selection-weight models. If you're not on the cutting edge, you're probably just running trim-and-fill and calling it a day.

Looking at all these methods, my colleagues and I got to wondering: Which of these methods work best? Are some always better than others, or are there certain conditions under which they work best? Should we use p-curve or PET-PEESE? Does trim-and-fill work at all?

Today Evan Carter, Felix Schonbrodt, Will Gervais, and I have finished an exciting project in which we simulated hundreds of thousands of research literatures, then held a contest between the methods to see which does the best at recovering the true effect size.

You can read the full paper here. For this blog post, I want to highlight one finding: that the widely-used trim-and-fill technique seems to be wholly inadequate for dealing with publication bias.

https://www.youtube.com/watch?v=8DlJUrPtm3I
One of the outcomes we evaluated in our simulations was mean error, or the bias. When statistically significant results are published and non-significant results are censored, doing a plain-vanilla meta-analysis is gonna give you an estimate that's much too high. To try to handle this, people use trim-and-fill, hoping that it will give a less-biased estimate.

Unfortunately, trim-and-fill is not nearly strong enough to recover an estimate of zero when the null hypothesis is true. In terms of hypothesis testing, then, meta-analysis and trim-and-fill seem hopeless -- given any amount of publication bias, you will conclude that there is a true effect.

In the figure here I've plotted the average estimate from plain-vanilla random-effects meta-analysis (reMA) and the average estimate from trim-and-fill (TF). I've limited it to meta-analyses of 100 studies with no heterogeneity or p-hacking. Each facet represents a different true effect size, marked by the horizontal line. As you go from left to right, the number of studies forced to be statistically significant ranges from 0% to 60% to 90%.

As you can see, when the null is true and there is moderate publication bias, the effect is estimated as d = 0.3. Trim-and-fill nudges that down to about d = 0.25, which is still not enough to prevent a Type I error rate of roughly 100%.

Indeed, trim-and-fill tends to nudge the estimate down by about 0.05 regardless of how big the true effect or how strong the publication bias. Null, small, and medium effects will all be estimated as medium effects, and the null hypothesis will always be rejected.

Our report joins the chorus of similar simulations from Moreno et al. (2009) and Simonsohn, Nelson, and Simmons (2014) indicating that trim-and-fill just isn't up to the job.

I ask editors and peer reviewers everywhere to stop accepting trim-and-fill and fail-safe N as publication bias analyses. These two techniques are quite popular, but trim-and-fill is too weak to adjust for any serious amount of bias, and fail-safe N doesn't even tell you whether there is bias.

For what you should use, read our preprint!!

Sunday, May 14, 2017

Curiously Strong effects

The reliability of scientific knowledge can be threatened by a number of bad behaviors. The problems of p-hacking and publication bias are now well understood, but there is a third problem that has received relatively little attention. This third problem currently cannot be detected through any statistical test, and its effects on theory may be stronger than that of p-hacking.

I call this problem curiously strong effects.

The Problem of Curiously Strong

Has this ever happened to you? You come across a paper with a preposterous-sounding hypothesis and a method that sounds like it would produce only the tiniest change, if any. You skim down to the results, expecting to see a bunch of barely-significant results. But instead of p = .04, d = 0.46 [0.01, 0.91], you see p < .001, d = 2.35 [1.90, 2.80]. This unlikely effect is apparently not only real, but it is four or five times stronger than most effects in psychology, and it has a p-value that borders on impregnable. It is curiously strong.



The result is so curiously strong that it is hard to believe that the effect is actually that big. In these cases, if you are feeling uncharitable, you may begin to wonder if there hasn't been some mistake in the data analysis. Worse, you might suspect that perhaps the data have been tampered with or falsified.

Spuriously strong results can have lasting effects on future research. Naive researchers are likely to accept the results at face value, cite them uncritically, and attempt to expand upon them. Less naive researchers may still be reassured by the highly significant p-values and cite the work uncritically. Curiously strong results can enter meta-analyses, heavily influencing the mean effect size, Type I error rate, and any adjustments for publication bias.

Curiously strong results might, in this way, be more harmful than p-hacked results. With p-hacking, the results are often just barely significant, yielding the smallest effect size that is still statistically significant. Curiously strong results are much larger and have greater leverage on meta-analysis, especially when they have large sample sizes. Curiously strong results are also harder to detect and criticize: We can recognize p-hacking, and we can address it by asking authors to provide all their conditions, manipulations, and outcomes. We don't have such a contingency plan for curiously strong results.

What should be done?

My question to the community is this: What can or should be done about such implausible, curiously strong results?

This is complicated, because there are a number of viable responses and explanations for such results:

1) The effect really is that big.
2) Okay, maybe the effect is overestimated because of demand effects. But the effect is probably still real, so there's no reason to correct or retract the report.
3) Here are the data, which show that the effect is this big. You're not insinuating somebody made the data up, are you?

In general, there's no clear policy on how to handle curiously strong effects, which leaves the field poorly equipped to deal with them. Peer reviewers know to raise objections when they see p = .034, p = .048, p = .041. They don't know to raise objections when they see d = 2.1 or r = 0.83 or η2 = .88.

Nor is it clear that curiously strong effects should be a concern in peer review. One could imagine the problems that ensue when one starts rejecting papers or flinging accusations because the effects seem too large. Our minds and our journals should be open to the possibility of large effects.

The only solution I can see, barring some corroborating evidence that leads to retraction, is to try to replicate the curiously strong effect. Unfortunately, that takes time and expense, especially considering how replications are often expected to collect substantially more data than original studies. Even after the failure to replicate, one has to spend another 3 or 5 years arguing about why the effect was found in the original study but not in the replication. ("It's not like we p-hacked this initial result -- look at how good the p-value is!")

It would be nice if the whole mess could be nipped in the bud. But I'm not sure how it can.

A future without the curiously strong?

This may be naive of me, but it seems that in other sciences it is easier to criticize curiously strong effects, because the prior expectations on effects are more precise.

In physics, theory and measurement are well-developed enough that it is a relatively simple matter to say "You did not observe the speed of light to be 10 mph." But in psychology, one can still insist with a straight face that (to make up an example) subliminal luck priming lead to a 2 standard deviation improvement in health.

In the future, we may be able to approach this enviable state of physics. Richard, Bond Jr., and Stokes-Zoota (2003) gathered up 322 meta-analyses and concluded that the modal effect size in social psych is r = .21, approximately d = 0.42. (Note that even this is probably an overestimate considering publication bias.) Simmons, Nelson, and Simonsohn (2013) collected data on obvious-sounding effects to provide benchmark effect sizes. Together, these reports show that an effect of d > 2 is several times stronger than most effects in social psychology and stronger even than obvious effects like "men are taller than women (d = 1.85)" or "liberals see social equality as more important than conservatives (d = 0.69)".

By using our prior knowledge to describe what is within the bounds of psychological science, we could tell what effects need scrutiny. Even then, one is likely to need corroborating evidence to garner a correction, expression of concern, or retraction, and such evidence may be hard to find.

In the meantime, I don't know what to do when I see d = 2.50 other than to groan. Is there something that should be done about curiously strong effects, or is this just another way for me to indulge my motivated reasoning?