Wednesday, March 22, 2017

Comment on Data Colada [58]: Funnel plots, done correctly, are extremely useful

In DataColada [58], Simonsohn argues that funnel plots are not useful. The argument is, for true effect size δ and sample size n:
  • Funnel plots are based on the assumption that r(δ, n) = 0.
  • Under some potentially common circumstances, r(δ, n) != 0. 
  • When r(δ, n) != 0, there is the risk of mistaking benign funnel plot asymmetry (small-study effects) for publication bias.

I do not think that any of this is controversial. It is always challenging to determine how to interpret small-study effects. They can be caused by publication bias, or they can be caused by, as Simonsohn argues, researchers planning their sample sizes in anticipation of some large and some small true effects.

There is a simple solution to this that preserves the validity and utility of funnel plots. If your research literature is expected to contain some large and some small effects, and these are reflected by clear differences in experimental methodology and/or subject population, then analyze those separate methods and populations separately. 

For this post, I will call this making homogeneous subgroups. 

Once you have made homogeneous subgroups, r(δ, n) = 0 is not a crazy assumption at all. Indeed, it can be a more sensible assumption than r(δ, δguess) = .6.

Making homogeneous subgroups

Suppose we are interested in the efficacy of a new psychotherapeutic technique for depression and wish to meta-analyze the available literature. 

It would be silly to combine studies looking at the efficacy of this technique for reducing depression and improving IQ and reducing aggression and reducing racial bias and losing weight. These are all different effects and different hypotheses. It would be much more informative to test each of these separately.

In keeping with the longest-running cliche in meta-analysis, here's an "apples to oranges" metaphor.

For example, when we investigated the funnel plots from Anderson et al.'s (2010) meta-analysis of violent video game effects, we preserved the original authors' decision to separate studies by design (experiment, cross-section, longitudinal) and by classes of outcome (behavior, cognition, affect). When Carter & McCullough (2014) inspected the effects of ego depletion, they separated their analysis by classes of outcome.

In short, combine studies of similar methods and similar outcomes. Studies of dissimilar methods and dissimilar outcomes should probably be analyzed separately.

The bilingual advantage example

I think the deBruin, Treccani, and Della Sala (2014) paper that serves as the post's motivating example is a little too laissez-faire about combining dissimilar studies. The hypothesis "bilingualism is good for you" seems much too broad, encompassing far too many heterogeneous studies.

Simonsohn's criticism here has less to do with a fatal flaw in funnel plots and more to do with a suboptimal application of the technique. Let's talk about why this is suboptimal and how it could have been improved.

To ask whether bilingualism improves working memory among young adults is one question. To ask whether bilingualism delays the onset of Alzheimer's disease is another. To combine the two is of questionable value. 

It would be more informative to restrict the analysis to a more limited, homogeneous hypothesis such as "bilingualism improves working memory". Even after that, it might be useful to explore different working memory tasks separately.

When r(δ, n) = 0 is reasonable

Once you have parsed the studies out into homogeneous subsamples, the assumption that r(δ, n) = 0 becomes quite reasonable. This is because:
  • Choosing homogeneous studies minimizes the variance in delta across studies.
  • Given homogeneous methods, outcomes, and populations, researchers cannot plan for variance in delta.
Let's look at each in turn.

Minimizing variance in delta

Our concern is that the true effect size δ varies from study to study -- sometimes it is large, and sometimes it is small. This variance may covary with study design and with sample size, leading to a small-study effect. Because study design is confounded with sample size, there is a risk of mistaking this for publication bias.

Partitioning into homogeneous subsets addresses this concern. As methods and populations become more similar, we reduce the variance in delta. As we reduce the variance in delta, we restrict its range, and correlations between delta and confounds will shrink, leading us towards the desirable case that r(δ, n) = 0.

Researchers cannot plan for the true effect size within homogeneous subgroup

Simonsohn assumes that researchers have some intuition for the true effect size -- that they are able to guess it with some accuracy such that r(δ, δguess) = .6.

True and guessed effect sizes in Data Colada 58. r = .6 is a pretty strong estimate of researcher intuition, although Simonsohn's concern still applies (albeit less so) at lower levels of intuition.

This may be a reasonable assumption when we are considering a wide array of heterogeneous studies. I can guess that the Stroop effect is large, that the contrast mapping effect is medium in size, and that the effect of elderly primes is zero.

However, once we have made homogeneous subsamples, this assumption becomes much less tenable. Can we predict when and for whom the Stroop effect is larger or smaller? Do we know under which conditions the effect of elderly primes is nonzero?

Indeed, you are probably performing a meta-analysis exactly because researchers have poor intuition for the true effect size. You want to know whether the effect is δ = 0, 0.5, or 1. You are performing moderator analyses to see if you can learn what makes the effect larger or smaller. 

Presuming you are the first to do this, it is unclear how researchers could have powered their studies accordingly. Within this homogeneous subset, nobody can predict when the effect should be large or small. To make this correlation between sample size and effect size, researchers would need access to knowledge that does not yet exist.

Once you have made a homogeneous subgroup, r(δ, n) = 0 can be a more reasonable assumption than r(δ, δguess) = .6.

Meta-regression is just regression

Meta-analysis seems intimidating, but the funnel plot is just a regression equation. Confounds are a hazard in regression, but we still use regression because we can mitigate the hazard and the resulting information is often useful. The same is true of meta-regression.  

Because this is regression, all the old strategies apply. Can you find a third variable that explains the relationship between sample size and effect size? Moderator analyses and inspection of the funnel plots can help to look for, and test, such potential confounds.

I think that Simonsohn does not see this presented often in papers, and so he is under the impression that this sort of quality check is uncommon. In my experience, my reviewers were definitely very careful asking me to rule out confounds in my own funnel plot analysis.

That said, it's definitely possible that these steps don't make it to the published literature: perhaps they are performed internally, or shared with just the peer reviewers, or maybe studies where the funnel plot contains confounds are not interesting enough to publish. Maybe greater attention can be paid to this in our popular discourse.


Into every life some heterogeneity must fall. There is the risk that, even after these efforts, there is some confound that you mistake for publication bias. That's regression for you.

There is also the risk that, if you get carried away chasing after perfectly homogeneous subgroups, you may find yourself conducting a billion analyses of only one or two studies each. This is not helpful either for reasons that will be obvious.

Simonsohn is concerned that we can never truly reach such homogeneity that r(δ, n) = 0 is true. This seems possible, but it is hard to say without access to 1) the true effect size and 2) the actual power analyses of researchers. I think that we can at least reach such a point that we have reached the limits of researcher's ability to plan for larger vs. smaller effects.


The funnel plot represents the relationship between effect size δ and the sample size n. These may be correlated because of publication bias, or they may be correlated because of genuine differences in δ that have been planned for in power analysis. By conditioning your analysis on homogeneous subsets, you reduce variance in δ and the potential influence of power analysis.

My favorite video game is The Legend of Zelda: Plot of the Funnel

Within homogeneous subsets, researchers do not know when the effect is larger vs. smaller, and so cannot plan their sample sizes accordingly. Under these conditions, the assumption that r(δ, n) = 0 can be quite reasonable, and perhaps more reasonable than the assumption that r(δ, δguess) = .6.

Applied judiciously, funnel plots can be valid, informative, expressive, and useful. They encourage attention to effect size, reveal outliers, and demonstrate small-study effects that can often be attributed to publication bias.



I also disagree with Simonsohn that "It should be considered malpractice to publish papers with PET-PEESE." Simonsohn is generally soft-spoken, so I was a bit surprised to see such a stern admonishment.

PET and PEESE are definitely imperfect, and their weaknesses are well-documented: PET is biased downwards when δ != 0, and PEESE is biased upwards when δ = 0. This sucks if you want to know whether δ = 0. 

Still, I think PEESE has some promise; assuming there is an effect, how big is it likely to be? Yes, these methods depend heavily on the funnel plot, assuming that any small-study effect is attributable to publication bias, but again, this is can be a reasonable assumption under the right conditions. Some simulations I'm working on with Felix Schonbrodt, Evan Carter, and Will Gervais indicate that it's at least no worse than trim-and-fill (low bar, I know).

Of course, no one technique is perfect. I would recommend using these methods in concert with other analyses such as the Hedges & Vevea 3-parameter selection model or, sure, p-curve or p-uniform.


  1. Nice post. You say that

    "There is a simple solution... If your research literature is expected to contain some large and some small effects, and these are reflected by clear differences in experimental methodology and/or subject population, then analyze those separate methods and populations separately."

    I would say it may not be so simple to do this.

    We may find that many of the studies are idiosyncratic and don't fit into any groups, or that many of the groups are too small to meta-analyze (you can hardly interpret a funnel plot with only 3 studies.)

    You do note this issue but I don't think that we can always avoid it by not "getting carried away chasing after perfectly homogeneous subgroups". In many cases the literature on a certain topic really is a mess with all kinds of different study designs and methods.

    I would say that in such a case, we shouldn't do a meta-analysis of that topic at all (rather we should use other approaches to summarize the literature).

    Even in cases where it is possible to define homogenous subgroups of studies, these subgroups may not be representative of all of the literature (consider for instance if there was one lab churning out studies using the same methodology, a methodology that no-one else in the field uses because it's flawed.)

    In such a case I would again say that meta-analysis is not the tool we're looking for to summarize the topic.

    1. I'd agree, to some extent. I feel it is a common misconception that every systematic review need to include a meta-analysis, or that the core purpose of any meta-analysis is to provide a single effect size estimate. I think that there is room for systematic reviews to say "This literature, while rich with new ideas, is too heterogeneous to synthesize."