Monday, May 4, 2015

Bayes Factor: Asking the Right Questions, pt. 2

There has recently been some discussion as to whether Bayes factor is biased in favor of the null. I am particularly sensitive to these concerns as somebody who sometimes uses Bayes factor to argue in favor of the null. I do not want Reviewer 2 to think that I am overstating my evidence.

I would like to address two specific criticisms of Bayes factor, each arguing that the choice of an alternative hypothesis makes it too easy for researchers to argue for the null. 


In a recent blog post, Dr. Simonsohn writes “Because I am not interested in the distribution designated as the alternative hypothesis, I am not interested in how well the data support it.”

Of course, if one does not like one alternative hypothesis, one can choose another. Bayes factor is just the tool, and it's up to the analyst to make the tool answer a valuable question.

I asked Dr. Simonsohn for clarification on what he thought might make a good alternative hypothesis. He suggested a point-alternative hypothesis describing the minimum effect size of interest. That way, the Bayes factor yielded would not be too hasty to lean in favor of the null. 

That smallest effect size will vary across context. For example, for gender discrimination I may have one standard of too small to care, and for PSI I will have a much lower standard, and for time travel a tiny standard (a few seconds of time travel are a wonderful discovery).

Personally, I do not think this is a good alternative hypothesis. It makes the null and alternative hypothesis too similar so that their predictions are nigh-indiscriminable. It makes it nearly impossible to find evidence one way or the other.
Left panel: Depiction of null hypothesis and "minimum effect of interest" alternative. Null hypothesis: δ = 0. Alternative hypothesis: δ = 0.01. 
Right panel: Probability of data given each hypothesis and 200 observations, between-subjects design. 
The hypotheses are so similar as to be indistinguishable from each other.

Imagine if we did a priori power analysis with this alternative hypothesis for conventional null hypothesis significance testing. Power analysis would tell us we would need hundreds of thousands of observations to have adequate power. Less than that, and any significant results could be flukes and Type I errors, and nonsignificant results would be Type II errors. It's the Sisyphean Decimal Sequence from last post.

At some point, you have to live with error. The conventional testing framework assumes an effect size and establishes Type I and Type II error rates from there. But what justifies your a priori power assumption? Dr. Simonsohn's newest paper suggests a negative replication should indicate that the previous study had less than 33% power to detect its effect. But why would we necessarily care about the effect as it was observed in a previous study?

Every choice of alternative hypothesis is, at some level, arbitrary. No effect can be measured to arbitrary precision. Of all the inferential techniques I know, however, Bayes factor states this alternative hypothesis most transparently and reports the evidence in the most finely-grained units.

In practice, we don't power studies to the minimum interesting effect. We power studies to what expect the effect size to be given the theory. The alternative hypothesis in Bayesian model comparison should be the same way, representing our best guess about the effect. Morey et al. (submitted) call this a "consensus prior", the prior a "reasonable, but somewhat-removed researcher would have [when trying to quantify evidence for or against the theory]."


Dr. Schimmack also thinks that Bayes factor is prejudiced against small effects and that it makes it too easy to land a prestigious JEP:G publication ruling in favor of the null. In his complaint, he examines an antagonistic collaboration among Matzke, Nieuwenhuis, and colleagues. Nieuwenhuis et al. argue that horizontal eye movements improve memory, while Matzke et al. argue that they have no such effect. Data is collected, and we ask questions of it: Whose hypothesis is supported, Nieuwenhuis’ or Matzke’s?

In the data, the effect of horizontal eye movements was actually negative. This is unusual given Matzke’s hypothesis, but very unusual given Nieuwenhuis’ hypothesis. Because the results are 10 times more likely given Matzke’s hypothesis than Nieuwenhuis’, we rule in favor of Matzke’s null hypothesis.

Dr. Schimmack is dissatisfied with the obtained result and wants more power:
[T]his design has 21% power to reject the null-hypothesis with a small effect size (d = .2). Power for a moderate effect size (d = .5) is 68% and power for a large effect size (d = .8) is 95%.
Thus, the decisive study that was designed to solve the dispute only has adequate power (95%) to test Drs. Matzke et al.’s hypothesis d = 0 against the alternative hypothesis that d = .8. For all effect sizes between 0 and .8, the study was biased in favor of the null-hypothesis.”
Dr. Schimmack is concerned that the sample size is too small to distinguish the null from the alternative. The rules of the collaboration, however, were to collect data until the Bayes factor was 10 for one or the other hypothesis. The amount of data collected was indeed enough to distinguish between the two hypotheses, as the support is quite strong for the no-effect-hypothesis relative to the improvement-hypothesis. Everybody goes to the pub to celebrate, having increased their belief in the null relative to this alternative by a factor of 10.

But suppose we tried to interpret the results in terms of power and significance. What would we infer if the result was not significant? Dr. Schimmack’s unusual comment above that “for all effect sizes between 0 and .8, the study was biased in favor of the null-hypothesis” leads me to worry that he intends to interpret p > .05 as demonstrating the truth of the null – a definite faux pas in null-hypothesis significance testing.

But what can we infer from p > .05? That the results have no evidentiary value, being unable to reject the null hypothesis? That the obtained result is (1 – Power)% unlikely if the alternative hypothesis δ = 0.5 were true? But why would we care about the power based on the alternative hypothesis δ = 0.5, and not δ = 0.1, or δ = 1.0, or any other point-alternative hypothesis?

Dr. Niewenhuis understands his theory, formulated a fair hypothesis, and agreed that a test of that hypothesis would constitute a fair test of the theory. I can see no better or more judicious choice of alternative hypothesis. In a well-designed experiment with a fair hypothesis, the Bayesian test is fair.

Dr. Schimmack further argues that “[The] empirical data actually showed a strong effect in the opposite direction, in that participants in the no-eye-movement condition had better performance than in the horizontal-eye-movement condition (d = -.81).   A Bayes Factor for a two-tailed hypothesis or the reverse hypothesis would not have favored the null-hypothesis.” This is an interesting phenomenon, but beside the point of the experiment. Remember the question being asked: Is there a positive effect, or no effect? The obtained data support the hypothesis of no effect over the hypothesis of a positive effect.

If one wishes to pursue the new hypothesis of a negative effect in a future experiment, one can certainly do so. If one thinks that the negative effect indicates some failure of the experiment then that is a methodological, not statistical, concern. Keep in mind that both researchers agreed to the validity of the method before the data were collected, so again, we expect that this is a fair test.


Bayes factor provides an effective summary of evidence. A Cauchy or half-Cauchy distribution on the effect size often makes for a fair and reasonable description of the alternative hypothesis. Scientists who routinely read papers with attention to effect size and sample size will quickly find themselves capable of describing a reasonable "consensus prior." 

Having to describe this alternative hypothesis sometimes makes researchers uneasy, but it is also necessary for the interpretation of results in conventional testing. If a test of a subtle effect is statistically significant in a sample of 20, we suspect a Type I error rather than a true effect. If that subtle effect is not statistically significant in a sample of 20, we suspect a Type II error rather than a true effect. Specification of the alternative hypothesis makes these judgments transparent and explicit and yields the desired summary of evidence.


  1. I second the argument for decisions based on posteriors and not Bayes factors. See here for a quote from LJ Savage explaining why it must be so. https://twitter.com/AlxEtz/status/589072169683918848

  2. Hi Uli,

    I'm leaving a comment on your blog right now. (By the way, the current "default" default is r = .707, meaning 50% of effects greater than |d| = .707, not |d| = 1.)

    Naturally, I agree with Etz, Morey, Savage, and others in abstaining from accept/reject decisions or even verbal categories in describing evidence. The whole beauty of Bayes factor is that it is already in interpretable, continuous units representing the change in belief. I'm not giving that up.

    This is doubly important because the necessary amount of evidence will vary from application to application depending on my prior skepticism. For reasonable hypotheses, I may need >3:1 evidence to mostly believe in it; for unusual hypotheses, >8:1; for the nigh-impossible, 10^6:1 or more.