A psychologist's thoughts on how and why we play games

Tuesday, July 19, 2016

The Failure of Fail-safe N

Fail-Safe N is a statistic suggested as a way to address publication bias in meta-analysis. Fail-Safe N describes the robustness of a significant result by calculating how many studies with effect size zero could be added to the meta-analysis before the result lost statistical significance. The original formulation is provided by Rosenthal (1979), with modifications proposed by Orwin (1983) and Rosenberg (2005).

I would like to argue that, as a way to detect and account for bias in meta-analysis, Fail-Safe N is completely useless. Others have said this before (see the bottom of the post for some links), but I needed to explore it further for my own curiosity. All together, I have to say that Fail-Safe N appears to be completely obsoleted by subsequent techniques, and thus is not recommended for use.

Fail-Safe N isn't for detecting bias

When we perform a meta-analysis, the question on our minds is usually "Looking at the gathered studies, how many null results were hidden from report?" Fail-Safe N does not answer that. Instead, it asks, "Looking at the gathered studies, how many more null results would you need before you'd no longer claim an effect?"

This isn't useful as a bias test. Indeed, Rosenthal never meant it as a way to test the presence of bias -- he'd billed it as an estimate of tolerance for null results, an answer to the question "How bad would bias have to be before I changed my mind?" He used it to argue that the published psych literature was not the 5% of Type I errors, while the 95% of null results languished in file drawers. Fail-Safe N was never meant to distinguish biased from unbiased literatures.

Fail-Safe N doesn't scale with bias

Although Fail-Safe N was never meant to test for bias, sometimes people will act as though a larger Fail-Safe N indicates the absence of bias. That won't work.

To see why it won't work, let's look briefly at the equation that defines FSN.

FSN = [(ΣZ)^2 / 2.706] - k

where ΣZ is the sum of z-scores from individual studies (small p-values mean large z-scores) and k is the number of studies.

This means that Fail-Safe N grows larger with each significant result. Even when each study is just barely significant (p = .050), Fail-Safe N will grow rapidly. After six p = .05 results, FSN is 30. After ten p = .05 results, FSN is 90. After twenty p = .05 results, FSN is 380. Fail-safe N rapidly becomes huge, even when the individual studies just barely cross the significance threshold.

Worse, FSN can get bigger as the literature becomes more biased.

  • For each dropped study with an effect size of exactly zero, FSN grows by one. (That's what it says on the tin -- how many dropped zeroes would be required to make p > .05.) 
  • When dropped studies have positive but non-significant effect sizes, FSN falls. 
  • When dropped studies have negative effect sizes, FSN rises.
If all the studies with estimated effect sizes less than zero are censored, FSN will quickly rise. 

Because Fail-Safe N doesn't behave in any particular way with bias, the following scenarios could all have the same Fail-Safe N:

  • A few honestly-reported studies on a moderate effect.
  • A lot of honest studies on a teeny-tiny effect.
  • A single study with a whopping effect size.
  • A dozen p-hacked studies on a null effect.

Fail-Safe N is often huge, even when it looks like the null is true

Publication bias and flexible analysis being what they are in social psychology, Fail-Safe N tends to return whopping huge numbers. The original Rosenthal paper provides two demonstrations. In one, he synthesizes 94 experiments examining the effects of interpersonal self-fulfilling prophecies, and concludes that 3,263 studies averaging null effects would be necessary to make the effect go away. In another analysis of k = 311 studies, he says nearly 50,000 studies would be needed.

Similarly, in the Hagger et al. meta-analysis of ego depletion, Fail-Safe N reported that 50,000 null studies would be needed to reduce the effect to non-significance. By comparison, the Egger test indicated that the literature was badly biased, and PET-PEESE indicated that the effect size was likely zero. The registered replication report also indicated that the effect size was likely zero. Even a Fail-Safe N of 50,000 does not indicate a robust result.


Fail-Safe N is not a useful bias test because:

  1. It does not tell you whether there is bias.
  2. Greater bias can lead to a greater Fail-Safe N.
  3. Hypotheses that would appear to be false have otherwise obtained very large values of FSN.

FSN is just another way to describe the p-value at the end of your meta-analysis. If your p-value is very small, FSN will be very large; if your p-value is just barely under .05, FSN will be small.

In no case does Fail-Safe N indicate the presence or absence of bias. It only places a number on how bad publication bias would have to be, in a world without p-hacking, for the result to be a function of publication bias alone. Unfortunately, we know well that we live in a world with p-hacking. Perhaps this is why Fail-Safe N is sometimes so inappropriately large.

If you need to test for bias, I would recommend instead Begg's test, Egger's test, or p-uniform. If you want to adjust for bias, PET, PEESE, p-curve, p-uniform, or selection models might work. But don't ever try to interpret the Fail-Safe N in a way it was never meant to be used.

Related reading:
Becker (2005) recommends "abandoning Fail-Safe N in favor of other, more informative analyses."
Here Coyne agrees that Fail-Safe N is not a function of bias and does not check for bias.
The Cochrane Collaboration agrees that  Fail-Safe N is mostly a function of the net effect size, and criticizes the emphasis on statistical significance over effect size.
Moritz Heene refers me to three other articles pointing out that the average Z-score of unpublished studies is probably not zero, as Fail-Safe N assumes, but rather, less than zero. Thus, the Fail-Safe N is too large. (Westfall's comment below makes a similar point.) This criticism is worth bearing in mind, but I think the larger problem is that Fail-Safe N does not answer the user's question regarding bias.