I would like to argue that, as a way to detect and account for bias in meta-analysis, Fail-Safe N is completely useless. Others have said this before (see the bottom of the post for some links), but I needed to explore it further for my own curiosity. All together, I have to say that Fail-Safe N appears to be completely obsoleted by subsequent techniques, and thus is not recommended for use.
Fail-Safe N isn't for detecting bias
When we perform a meta-analysis, the question on our minds is usually "Looking at the gathered studies, how many null results were hidden from report?" Fail-Safe N does not answer that. Instead, it asks, "Looking at the gathered studies, how many more null results would you need before you'd no longer claim an effect?"This isn't useful as a bias test. Indeed, Rosenthal never meant it as a way to test the presence of bias -- he'd billed it as an estimate of tolerance for null results, an answer to the question "How bad would bias have to be before I changed my mind?" He used it to argue that the published psych literature was not the 5% of Type I errors, while the 95% of null results languished in file drawers. Fail-Safe N was never meant to distinguish biased from unbiased literatures.
Fail-Safe N doesn't scale with bias
Although Fail-Safe N was never meant to test for bias, sometimes people will act as though a larger Fail-Safe N indicates the absence of bias. That won't work.To see why it won't work, let's look briefly at the equation that defines FSN.
FSN = [(ΣZ)^2 / 2.706] - k
where ΣZ is the sum of z-scores from individual studies (small p-values mean large z-scores) and k is the number of studies.
This means that Fail-Safe N grows larger with each significant result. Even when each study is just barely significant (p = .050), Fail-Safe N will grow rapidly. After six p = .05 results, FSN is 30. After ten p = .05 results, FSN is 90. After twenty p = .05 results, FSN is 380. Fail-safe N rapidly becomes huge, even when the individual studies just barely cross the significance threshold.
Worse, FSN can get bigger as the literature becomes more biased.
- For each dropped study with an effect size of exactly zero, FSN grows by one. (That's what it says on the tin -- how many dropped zeroes would be required to make p > .05.)
- When dropped studies have positive but non-significant effect sizes, FSN falls.
- When dropped studies have negative effect sizes, FSN rises.
If all the studies with estimated effect sizes less than zero are censored, FSN will quickly rise.
- A few honestly-reported studies on a moderate effect.
- A lot of honest studies on a teeny-tiny effect.
- A single study with a whopping effect size.
- A dozen p-hacked studies on a null effect.
Fail-Safe N is often huge, even when it looks like the null is true
Publication bias and flexible analysis being what they are in social psychology, Fail-Safe N tends to return whopping huge numbers. The original Rosenthal paper provides two demonstrations. In one, he synthesizes 94 experiments examining the effects of interpersonal self-fulfilling prophecies, and concludes that 3,263 studies averaging null effects would be necessary to make the effect go away. In another analysis of k = 311 studies, he says nearly 50,000 studies would be needed.Similarly, in the Hagger et al. meta-analysis of ego depletion, Fail-Safe N reported that 50,000 null studies would be needed to reduce the effect to non-significance. By comparison, the Egger test indicated that the literature was badly biased, and PET-PEESE indicated that the effect size was likely zero. The registered replication report also indicated that the effect size was likely zero. Even a Fail-Safe N of 50,000 does not indicate a robust result.
Summary
Fail-Safe N is not a useful bias test because:- It does not tell you whether there is bias.
- Greater bias can lead to a greater Fail-Safe N.
- Hypotheses that would appear to be false have otherwise obtained very large values of FSN.
FSN is just another way to describe the p-value at the end of your meta-analysis. If your p-value is very small, FSN will be very large; if your p-value is just barely under .05, FSN will be small.
In no case does Fail-Safe N indicate the presence or absence of bias. It only places a number on how bad publication bias would have to be, in a world without p-hacking, for the result to be a function of publication bias alone. Unfortunately, we know well that we live in a world with p-hacking. Perhaps this is why Fail-Safe N is sometimes so inappropriately large.
If you need to test for bias, I would recommend instead Begg's test, Egger's test, or p-uniform. If you want to adjust for bias, PET, PEESE, p-curve, p-uniform, or selection models might work. But don't ever try to interpret the Fail-Safe N in a way it was never meant to be used.
Related reading:
Becker (2005) recommends "abandoning Fail-Safe N in favor of other, more informative analyses."
Here Coyne agrees that Fail-Safe N is not a function of bias and does not check for bias.
The Cochrane Collaboration agrees that Fail-Safe N is mostly a function of the net effect size, and criticizes the emphasis on statistical significance over effect size.
Moritz Heene refers me to three other articles pointing out that the average Z-score of unpublished studies is probably not zero, as Fail-Safe N assumes, but rather, less than zero. Thus, the Fail-Safe N is too large. (Westfall's comment below makes a similar point.) This criticism is worth bearing in mind, but I think the larger problem is that Fail-Safe N does not answer the user's question regarding bias.