I would like to argue that, as a way to

*detect and account*for bias in meta-analysis, Fail-Safe N is completely useless. Others have said this before (see the bottom of the post for some links), but I needed to explore it further for my own curiosity. All together, I have to say that Fail-Safe N appears to be completely obsoleted by subsequent techniques, and thus is not recommended for use.

### Fail-Safe N isn't for detecting bias

When we perform a meta-analysis, the question on our minds is usually "Looking at the gathered studies, how many null results were hidden from report?" Fail-Safe N does not answer that. Instead, it asks, "Looking at the gathered studies, how many more null results would you need before you'd no longer claim an effect?"This isn't useful as a bias test. Indeed, Rosenthal never meant it as a way to test the presence of bias -- he'd billed it as an estimate of

*tolerance for null results*, an answer to the question "How bad would bias have to be before I changed my mind?" He used it to argue that the published psych literature was not the 5% of Type I errors, while the 95% of null results languished in file drawers. Fail-Safe N was never meant to distinguish biased from unbiased literatures.

### Fail-Safe N doesn't scale with bias

Although Fail-Safe N was never meant to test for bias, sometimes people will act as though a larger Fail-Safe N indicates the absence of bias. That won't work.To see why it won't work, let's look briefly at the equation that defines FSN.

FSN = [(ΣZ)^2 / 2.706] -

*k**where ΣZ is the sum of z-scores from individual studies (small*

*p*-values mean large

*z*-scores) and

*k*is the number of studies.

This means that Fail-Safe N grows larger with each significant result. Even when each study is just barely significant (

*p*= .050), Fail-Safe N will grow rapidly. After six

*p*= .05 results, FSN is 30. After ten

*p*= .05 results, FSN is 90. After twenty

*p*= .05 results, FSN is 380. Fail-safe N rapidly becomes huge, even when the individual studies just barely cross the significance threshold.

Worse, FSN can get

*bigger*as the literature becomes more biased.

- For each dropped study with an effect size of exactly zero, FSN grows by one. (That's what it says on the tin -- how many dropped zeroes would be required to make p > .05.)
- When dropped studies have positive but non-significant effect sizes, FSN falls.
- When dropped studies have negative effect sizes, FSN rises.

If all the studies with estimated effect sizes less than zero are censored, FSN will quickly rise.

- A few honestly-reported studies on a moderate effect.
- A lot of honest studies on a teeny-tiny effect.
- A single study with a whopping effect size.
- A dozen
*p*-hacked studies on a null effect.

### Fail-Safe N is often huge, even when it looks like the null is true

Publication bias and flexible analysis being what they are in social psychology, Fail-Safe N tends to return whopping huge numbers. The original Rosenthal paper provides two demonstrations. In one, he synthesizes 94 experiments examining the effects of interpersonal self-fulfilling prophecies, and concludes that 3,263 studies averaging null effects would be necessary to make the effect go away. In another analysis of k = 311 studies, he says nearly 50,000 studies would be needed.Similarly, in the Hagger et al. meta-analysis of ego depletion, Fail-Safe N reported that 50,000 null studies would be needed to reduce the effect to non-significance. By comparison, the Egger test indicated that the literature was badly biased, and PET-PEESE indicated that the effect size was likely zero. The registered replication report also indicated that the effect size was likely zero. Even a Fail-Safe N of 50,000 does not indicate a robust result.

### Summary

Fail-Safe N is not a useful bias test because:- It does not tell you whether there is bias.
- Greater bias can lead to a greater Fail-Safe N.
- Hypotheses that would appear to be false have otherwise obtained very large values of FSN.

FSN is just another way to describe the p-value at the end of your meta-analysis. If your p-value is very small, FSN will be very large; if your p-value is just barely under .05, FSN will be small.

In no case does Fail-Safe N indicate the presence or absence of bias. It only places a number on how bad publication bias would have to be, in a world without

*p*-hacking, for the result to be a function of publication bias alone. Unfortunately, we know well that we live in a world with

*p*-hacking. Perhaps this is why Fail-Safe N is sometimes so inappropriately large.

If you need to

*test*for bias, I would recommend instead Begg's test, Egger's test, or

*p*-uniform. If you want to

*adjust*for bias, PET, PEESE,

*p*-curve,

*p*-uniform, or selection models might work. But don't ever try to interpret the Fail-Safe N in a way it was never meant to be used.

Related reading:

Becker (2005) recommends "abandoning Fail-Safe N in favor of other, more informative analyses."

Here Coyne agrees that Fail-Safe N is not a function of bias and does not check for bias.

The Cochrane Collaboration agrees that Fail-Safe N is mostly a function of the net effect size, and criticizes the emphasis on statistical significance over effect size.

Moritz Heene refers me to three other articles pointing out that the average Z-score of unpublished studies is probably not zero, as Fail-Safe N assumes, but rather, less than zero. Thus, the Fail-Safe N is too large. (Westfall's comment below makes a similar point.) This criticism is worth bearing in mind, but I think the larger problem is that Fail-Safe N does not answer the user's question regarding bias.

Another criticism of fail-safe N, this one just of the basic assumptions of the procedure, can be found on pages 557-558 of Ferguson & Heene (2012), here: https://www.researchgate.net/profile/Christopher_Ferguson/publication/258180082_A_Vast_Graveyard_of_Undead_Theories_Publication_Bias_and_Psychological_Science's_Aversion_to_the_Null/links/0c96053041198978c6000000.pdf

ReplyDeleteBasically they point out that Rosenthal's method assumes that the average z-statistic for the file-drawered studies is 0. But really, if studies with (say) z > 2 are getting published and the rest are not, then the mean of the unpublished studies must necessarily be negative, not 0. In which case, adding the hypothetical null file-drawered studies to the mix lowers the overall z-statistic much faster, so that the fail-safe N is not nearly as high. They conclude, "Hence, the true fail-safe N is almost never as large as Rosenthal’s fail-safe N."