I would
like to address two specific criticisms of Bayes factor, each arguing that the choice of an alternative hypothesis makes it too easy for researchers to argue for the null.
Simonsohn
In a recent blog post, Dr. Simonsohn writes “Because I am not
interested in the distribution designated as the alternative hypothesis, I am
not interested in how well the data support it.”
Of course, if one does not like one alternative hypothesis, one can choose another. Bayes factor is just the tool, and it's up to the analyst to make the tool answer a valuable question.
I asked Dr. Simonsohn for clarification on what he thought might make a good alternative hypothesis. He suggested a point-alternative hypothesis describing the minimum effect size of interest. That way, the Bayes factor yielded would not be too hasty to lean in favor of the null.
That smallest effect size will vary across context. For example, for gender discrimination I may have one standard of too small to care, and for PSI I will have a much lower standard, and for time travel a tiny standard (a few seconds of time travel are a wonderful discovery).
Personally, I do not think this is a good alternative hypothesis. It makes the null and alternative hypothesis too similar so that their predictions are nigh-indiscriminable. It makes it nearly impossible to find evidence one way or the other.
Imagine if we did a priori power analysis with this alternative hypothesis for conventional null hypothesis significance testing. Power analysis would tell us we would need hundreds of thousands of observations to have adequate power. Less than that, and any significant results could be flukes and Type I errors, and nonsignificant results would be Type II errors. It's the Sisyphean Decimal Sequence from last post.
At some point, you have to live with error. The conventional testing framework assumes an effect size and establishes Type I and Type II error rates from there. But what justifies your a priori power assumption? Dr. Simonsohn's newest paper suggests a negative replication should indicate that the previous study had less than 33% power to detect its effect. But why would we necessarily care about the effect as it was observed in a previous study?
Imagine if we did a priori power analysis with this alternative hypothesis for conventional null hypothesis significance testing. Power analysis would tell us we would need hundreds of thousands of observations to have adequate power. Less than that, and any significant results could be flukes and Type I errors, and nonsignificant results would be Type II errors. It's the Sisyphean Decimal Sequence from last post.
At some point, you have to live with error. The conventional testing framework assumes an effect size and establishes Type I and Type II error rates from there. But what justifies your a priori power assumption? Dr. Simonsohn's newest paper suggests a negative replication should indicate that the previous study had less than 33% power to detect its effect. But why would we necessarily care about the effect as it was observed in a previous study?
Every choice of alternative hypothesis is, at some level, arbitrary. No effect can be measured to arbitrary precision. Of all the inferential techniques I know, however, Bayes factor states this alternative hypothesis most transparently and reports the evidence in the most finely-grained units.
In practice, we don't power studies to the minimum interesting effect. We power studies to what expect the effect size to be given the theory. The alternative hypothesis in Bayesian model comparison should be the same way, representing our best guess about the effect. Morey et al. (submitted) call this a "consensus prior", the prior a "reasonable, but somewhat-removed researcher would have [when trying to quantify evidence for or against the theory]."
In practice, we don't power studies to the minimum interesting effect. We power studies to what expect the effect size to be given the theory. The alternative hypothesis in Bayesian model comparison should be the same way, representing our best guess about the effect. Morey et al. (submitted) call this a "consensus prior", the prior a "reasonable, but somewhat-removed researcher would have [when trying to quantify evidence for or against the theory]."
Schimmack
Dr. Schimmack
also thinks that Bayes factor is prejudiced
against small effects
and that it makes it too easy to land a prestigious JEP:G publication ruling in
favor of the null. In his complaint, he examines an antagonistic collaboration among
Matzke, Nieuwenhuis, and colleagues. Nieuwenhuis et al. argue that horizontal
eye movements improve memory, while Matzke et al. argue that they have no such
effect. Data is collected, and we ask questions of it: Whose hypothesis is
supported, Nieuwenhuis’ or Matzke’s?
In the
data, the effect of horizontal eye movements was actually negative. This is
unusual given Matzke’s hypothesis, but very
unusual given Nieuwenhuis’ hypothesis. Because the results are 10 times
more likely given Matzke’s hypothesis than Nieuwenhuis’, we rule in favor of Matzke’s
null hypothesis.
Dr. Schimmack
is dissatisfied with the obtained result and wants more power:
“[T]his design has 21% power to reject the null-hypothesis with a small effect size (d = .2). Power for a moderate effect size (d = .5) is 68% and power for a large effect size (d = .8) is 95%.
Thus, the decisive study that was designed to solve the dispute only has adequate power (95%) to test Drs. Matzke et al.’s hypothesis d = 0 against the alternative hypothesis that d = .8. For all effect sizes between 0 and .8, the study was biased in favor of the null-hypothesis.”
Dr. Schimmack is concerned that the sample size is too small to distinguish the null from the alternative. The rules of the collaboration, however, were to collect data until the Bayes factor was 10 for one or the other hypothesis. The amount of data collected was indeed enough to distinguish between the two hypotheses, as the
support is quite strong for the no-effect-hypothesis relative to the
improvement-hypothesis. Everybody goes to the pub to celebrate, having
increased their belief in the null relative to this alternative by a factor of
10.
But
suppose we tried to interpret the results in terms of power and significance. What
would we infer if the result was not significant? Dr. Schimmack’s unusual
comment above that “for all effect sizes between 0 and .8, the study was biased
in favor of the null-hypothesis” leads me to worry that he intends to interpret
p > .05 as demonstrating the truth
of the null – a definite faux pas in null-hypothesis
significance testing.
But what
can we infer from p > .05? That the results have no
evidentiary value, being unable to reject the null hypothesis? That the
obtained result is (1 – Power)% unlikely if the alternative hypothesis δ = 0.5
were true? But why would we care about the power based on the alternative
hypothesis δ = 0.5, and not δ = 0.1, or δ = 1.0, or any other point-alternative
hypothesis?
Dr. Niewenhuis
understands his theory, formulated a fair hypothesis, and agreed that a test of
that hypothesis would constitute a fair test of the theory. I can see no better
or more judicious choice of alternative hypothesis. In a well-designed
experiment with a fair hypothesis, the Bayesian test is fair.
Dr.
Schimmack further argues that “[The]
empirical data actually showed a strong effect in the opposite direction, in
that participants in the no-eye-movement condition had better performance than
in the horizontal-eye-movement condition (d = -.81). A Bayes Factor
for a two-tailed hypothesis or the reverse hypothesis would not have favored
the null-hypothesis.” This is an interesting phenomenon, but beside the
point of the experiment. Remember the question being asked: Is there a positive
effect, or no effect? The obtained data support the hypothesis of no effect
over the hypothesis of a positive effect.
If one wishes to pursue the new hypothesis of a negative effect in a future experiment,
one can certainly do so. If one thinks that the negative effect
indicates some failure of the experiment then that is a methodological, not
statistical, concern. Keep in mind that both researchers agreed to the validity
of the method before the data were collected, so again, we expect that this is
a fair test.
Having to describe this alternative hypothesis sometimes makes researchers uneasy, but it is also necessary for the interpretation of results in conventional testing. If a test of a subtle effect is statistically significant in a sample of 20, we suspect a Type I error rather than a true effect. If that subtle effect is not statistically significant in a sample of 20, we suspect a Type II error rather than a true effect. Specification of the alternative hypothesis makes these judgments transparent and explicit and yields the desired summary of evidence.
Summary
Bayes factor provides an effective summary of evidence. A Cauchy or half-Cauchy distribution on the effect size often makes for a fair and reasonable description of the alternative hypothesis. Scientists who routinely read papers with attention to effect size and sample size will quickly find themselves capable of describing a reasonable "consensus prior."Having to describe this alternative hypothesis sometimes makes researchers uneasy, but it is also necessary for the interpretation of results in conventional testing. If a test of a subtle effect is statistically significant in a sample of 20, we suspect a Type I error rather than a true effect. If that subtle effect is not statistically significant in a sample of 20, we suspect a Type II error rather than a true effect. Specification of the alternative hypothesis makes these judgments transparent and explicit and yields the desired summary of evidence.