Tuesday, May 30, 2017

Trim-and-fill just doesn't work

The last couple years have seen an exciting explosion in new techniques for publication bias. If you're on the cutting edge of meta-analysis, you now can choose between p-curve, p-uniform, PET, PEESE, PET-PEESE, Top-10, and selection-weight models. If you're not on the cutting edge, you're probably just running trim-and-fill and calling it a day.

Looking at all these methods, my colleagues and I got to wondering: Which of these methods work best? Are some always better than others, or are there certain conditions under which they work best? Should we use p-curve or PET-PEESE? Does trim-and-fill work at all?

Today Evan Carter, Felix Schonbrodt, Will Gervais, and I have finished an exciting project in which we simulated hundreds of thousands of research literatures, then held a contest between the methods to see which does the best at recovering the true effect size.

You can read the full paper here. For this blog post, I want to highlight one finding: that the widely-used trim-and-fill technique seems to be wholly inadequate for dealing with publication bias.

https://www.youtube.com/watch?v=8DlJUrPtm3I
One of the outcomes we evaluated in our simulations was mean error, or the bias. When statistically significant results are published and non-significant results are censored, doing a plain-vanilla meta-analysis is gonna give you an estimate that's much too high. To try to handle this, people use trim-and-fill, hoping that it will give a less-biased estimate.

Unfortunately, trim-and-fill is not nearly strong enough to recover an estimate of zero when the null hypothesis is true. In terms of hypothesis testing, then, meta-analysis and trim-and-fill seem hopeless -- given any amount of publication bias, you will conclude that there is a true effect.

In the figure here I've plotted the average estimate from plain-vanilla random-effects meta-analysis (reMA) and the average estimate from trim-and-fill (TF). I've limited it to meta-analyses of 100 studies with no heterogeneity or p-hacking. Each facet represents a different true effect size, marked by the horizontal line. As you go from left to right, the number of studies forced to be statistically significant ranges from 0% to 60% to 90%.

As you can see, when the null is true and there is moderate publication bias, the effect is estimated as d = 0.3. Trim-and-fill nudges that down to about d = 0.25, which is still not enough to prevent a Type I error rate of roughly 100%.

Indeed, trim-and-fill tends to nudge the estimate down by about 0.05 regardless of how big the true effect or how strong the publication bias. Null, small, and medium effects will all be estimated as medium effects, and the null hypothesis will always be rejected.

Our report joins the chorus of similar simulations from Moreno et al. (2009) and Simonsohn, Nelson, and Simmons (2014) indicating that trim-and-fill just isn't up to the job.

I ask editors and peer reviewers everywhere to stop accepting trim-and-fill and fail-safe N as publication bias analyses. These two techniques are quite popular, but trim-and-fill is too weak to adjust for any serious amount of bias, and fail-safe N doesn't even tell you whether there is bias.

For what you should use, read our preprint!!

No comments:

Post a Comment