A psychologist's thoughts on how and why we play games

Monday, June 29, 2015

Putting PET-PEESE to the Test, Part 1A

The Problem with PET-PEESE?

Will Gervais has a very interesting criticism of PET-PEESE, a meta-analytic technique for correcting for publication bias, up at his blog. In it, he tests PET-PEESE's bias by simulating many meta-analyses, each of many studies, using historically-accurate effect sizes and sample sizes from social psychology. He finds that, under these conditions and assuming some true effect, PET-PEESE performs very poorly at detecting the true effect, underestimating it by a median 0.2 units of Cohen's d.

When I saw this, I was flattened. I knew PET-PEESE had its problems, but I also thought it represented a great deal of promise compared to other rotten old ways of inspecting for publication bias, such as trim-and-fill or (shudder) Fail-Safe N. In the spirit of full disclosure, I'll tell you that I'm 65 commits deep into a PET-PEESE manuscript with some provocative conclusions, so I may be a little bit motivated to defend PET-PEESE. But I saw some simulation parameters that could be tweaked to possibly give PET-PEESE a better shot at the true effect.

My Tweaks to Will's Simulation

One problem is that, in this simulation, the sample sizes are quite small. The sample sizes per cell distributed according to a truncated normal, ~N(30, 50), bounded by 20 and 200. So the minimum experiment has just 40 subjects across two cells, the modal experiment has just 60 subjects across two cells, and no study will ever exceed 400 subjects across the two cells.

These small sample sizes, combined with the small true effect (delta = .275), mean that the studies meta-analyzed have miserable power. The median power is only 36%. The maximum power is 78%, but you'll see that in fewer than one in ten thousand studies.

The problem, then, is one of signal and noise. The signal is weak: delta = .275 is a small effect by most standards. The noise is enormous: at n = 60-70, the sampling error is devastating. But what's worse, there's another signal superimposed on top of all this: publication bias! The effect is something like trying to hear your one friend whisper a secret in your ear, but the two of you are in a crowded bar, and your other friend is shouting in your other ear about the Entourage movie.

So as I saw it, the issue wasn't that PET-PEESE was cruelly biased in favor of the null or that it had terrible power to detect true effects. The issue was small effects, impotent sample sizes, and withering publication bias. In these cases, it's very hard to tell true effects from null effects. Does this situation sound familiar to you? It should -- Will's simulation uses distributions of sample sizes and effect sizes that are very representative of the norms in social psychology!

But social psychology is changing. The new generation of researchers are becoming acutely aware of the importance of sample size and of publishing null results. New journals like Frontiers or PLOS (and even Psych Science) are making it easier to publish null results. In this exciting new world of social psychology, might we have an easier time of arriving at the truth?


To test my intuition, I made one tweak to Will's simulation: Suppose that, in each meta-analysis, there is one ambitious grad student who decides she's had enough talk. She wants some damn data, and when she gets it, she will publish it come hell or high water, regardless of the result.

In each simulated meta-anaysis, I guarantee a single study with n = 209/cell (80% power, two-tailed, to detect the true homogenous effect delta = 0.275). Moreover, this single well-powered study is made immune to publication bias. Could a single, well-powered study help PET-PEESE?

Well, it doesn't. One 80%-powered study isn't enough. You might be better off using the "Top Ten" estimator, that looks only at the 10 largest studies, or even just interpreting the single largest study.

What if the grad student runs her dissertation at 90% power, collecting n = 280 per cell?

Maybe we're getting somewhere now. The PEESE spike is coming up a little bit and the PET spike is going down. But maybe we're asking too much of our poor grad student. Nobody should have to determine the difference between delta = 0 and delta = 0.275 all by themselves. (Note that, even still, you're probably better off throwing all the other studies and meta-analysis and meta-regressions in the garbage and just using this single pre-registered experiment as your estimate!)

Here's the next scenario: Suppose somebody looked at the funnel plot from the original n = ~20 studies and found it to be badly asymmetrical. Moreover, they saw the PET-PEESE estimate couldn't detect the effect as significantly different from zero. Rather than pronounce the PET-PEESE estimate as the true effect size, they instead suggested that the literature was badly biased and that a replication effort was needed. So three laboratories each agreed to rerun the experiment at 80% power and publish the results in a Registered Report. Afterwards, they reran the meta-analysis and PET-PEESE.

Even with these three unbiased, decently-powered studies, PET-PEESE is still flubbing it badly, going to PET more often than it should. Again, you might be better off just looking at the three trustworthy studies in the Registered Report than try to fix the publication bias with meta-regression.

I'm feeling pretty exhausted by now, so let's just drop the hammer on this. The Center for Open Science decides to step in and run a Registered Report with 10 studies, each powered at 80%. Does this give PET-PEESE what it needs to perform well?

No dice. Again, you'd be better off just looking at the 10 preregistered studies and giving up on the rest of the literature. Even with these 10 healthy studies in the dataset, we're missing delta = .275 by quite a bit in one direction or the other: PET-PEESE is estimating delta = 0.10, while naive meta-analysis is estimating delta = .42.


I am reminded of a blog post by Michele Nuijten, in which she explains how more information can actually make your estimates worse. If your original estimates are contaminated by publication bias, and your replication estimates are also contaminated by publication bias, adding the replication data to your original data only makes things worse. In the cases above, we gain very little from meta-analysis and meta-regression. It would be better to look only at the large-sample Registered Reports and dump all the biased, underpowered studies in the garbage.

The simple lesson is this: There is no statistical replacement for good research practice. Publication bias is nothing short of toxic, particularly when sample sizes and effect sizes are small.

So what can we do? Maybe this is my bias as a young scientist with few publications to my name, but if we really want to know what is true and what is false, we might be better off disregarding the past literature of biased, small-sample studies entirely and only interpreting data we can trust.

The lesson I take is this: For both researchers and the journals that publish them, Registered Report or STFU.

(Now, how am I gonna salvage this meta-analysis???)

Code is available at my GitHub. The bulk of the original code was written by Will Gervais, with edits and tweaks by Evan Carter and Felix Schonbrodt. You can recreate my analyses by loading packages and the meta() function on lines 1-132, then skipping down to the section "Hilgard is going ham" on line 303.