I love Bayesian model comparison. It’s my opinion that null
hypothesis testing is not great because 1) it gives dichotomous accept/reject
outcomes when we all know that evidence is a continuous quantity and 2) it can
never provide evidence for the null, only fail to reject it. This latter point
is important because it’s my opinion that the null is often true, so we should
be able to provide evidence and assign belief to it.
By comparison, Bayesian model comparison has neither weakness of NHST. First, it yields a "Bayes factor", the multiplicative and continuous change in beliefs effected by the data. Second, it can yield Bayes factors favoring the null hypothesis over a specified alternative hypothesis.
By comparison, Bayesian model comparison has neither weakness of NHST. First, it yields a "Bayes factor", the multiplicative and continuous change in beliefs effected by the data. Second, it can yield Bayes factors favoring the null hypothesis over a specified alternative hypothesis.
Despite my enthusiasm for Bayesian model comparison,
one criticism I see now and again about
Bayesian model comparison is that the obtained Bayes factor varies as a
function of the hypothesis being tested. See e.g. this Twitter thread or Simonsohn (2015):
When a default Bayesian test favors the null hypothesis, the correct interpretation of the result is that the data favor the null hypothesis more than that one specific alternative hypothesis. The Bayesian test could conclude against the same null hypothesis, using the same data, if a different alternative hypothesis were used, say, that the effect is distributed normal but with variance of 0.5 instead of 1, or that the distribution is skewed or has some other mean value.*
To some researchers, this may seem undesirable. Science and analysis are supposed to be "objective," so the subjectivity in Bayesian analysis may seem unappealing.
To a
Bayesian, however, this is the behavior as intended. The Bayes factor is supposed to vary according to the
hypotheses tested. The answer should
depend on the question.
Asking the Right Question
The
problem reminds me of the classic punchline in Douglas Adams’ Hitchhiker’s Guide to the Galaxy. An
advanced civilization builds a massive supercomputer at great expense to run
for millions of years to provide an answer to life, the universe, and everything.
Eons
later, as the calculations finally complete, the computer pronounces its
answer: “Forty-two.”
Everyone
winces. They demand to know what the computer means by forty-two. The computer
explains that forty-two is the correct answer, but that the question is still
unknown. The programmers are mortified. In their haste to get an impressive
answer, they did not stop to consider that every answer is valuable only in the
context of its question.
Bayesian
model comparison is a way to ask questions. When you ask different questions of
your data, you get different answers. Any particular answer is only valuable
insofar as the corresponding question is worth asking.
An Example from PSI Research
Let’s
suppose you’re running a study on ESP. You collect a pretty decently-sized
sample, and at the end of the day, you’re looking at an effect size and confidence interval (ESCI) of d = 0.15 (-.05, .35). Based on this, what is your inference?
The NHST inference is that you didn't learn
anything: you failed to reject the null, so the null stands for today, but maybe
in the future with more data you’d reject the null with d = .03 (.01, .05) or something. You can
never actually find evidence for the null so long as you use NHST. In the most
generous case, you might argue that you've rejected some other null hypothesis
such as δ > .35.
The ESCI inference is that the true effect of ESP
is somewhere in the interval.** Zero is in the interval, and we don’t believe
that ESP exists, so we’re vaguely satisfied. But how narrow an interval around
zero do we need before we’re convinced that there’s no ESP? How much evidence
do we have for zero relative to some predicted effect?
Bayesian Inferences
Now you
consult a Bayesian (Figure 1). You ask the Bayesian which she favors: the null hypothesis δ
= 0, or the alternative hypothesis δ ≠ 0. She shakes her head. Your alternative
hypothesis makes no predictions. The effect could be anywhere from negative
infinity to positive infinity, or so close to zero as to be nearly equal it. She
urges you to be more specific.
To get
an answer, you will have to provide a more specific question. Bayesian model
comparison operates by comparing one or more model predictions and seeing which
is best supported by the data. Because it is a daunting task to try to
precisely predict the effect size (although we often attempt to do so in a priori power analysis), we can assign
probability across a range of values.
Figure 1. Ancient Roman depiction of a Bayesian. |
Trying
again, you ask her whether there is a large effect of ESP. Maybe the effect of
ESP could be a standard deviation in either direction, and any nonzero effect
between d = -1 and d = 1 would be considered evidence of
the theory. That is, H1: δ ~ Uniform(-1, 1) (see Figure 2). The Bayesian tells you
that you have excellent evidence for the null relative to this hypothesis.
Encouraged,
you ask her whether there is a medium effect of ESP. Maybe ESP would change
behavior by about half a standard deviation in either direction; small effects
are more likely than large effects, but large effects are possible too. That is,
H2: δ ~ Cauchy(0.5) (see Figure 3). The Bayesian tells you that you have pretty
good evidence for the null against this hypothesis, but not overwhelming
evidence.
Finally,
you ask her whether you have evidence against even the tiniest effect of ESP.
Between the null hypothesis H0: δ = 0 and the alternative H3:
δ ~ Cauchy(1x10^-3), which does she prefer? She shrugs. These two hypotheses
make nearly-identical predictions about what you might see in your experiment (see Figure 4).
Your data cannot distinguish between the two. You would need to spend several
lifetimes collecting data before you were able to measurably shift belief from this
alternative to the null.
Figure 2. Competing statements of belief about the effect size delta. |
Figure 3. A Cauchy-distributed alternative hypothesis can be more conservative, placing more weight on smaller, more realistic effect sizes while maintaining the possibility of large effects. |
Figure 4. The null and alternative hypotheses make nearly-identical statements of belief. |
The
Bayesian assures you that this is not a specific weakness of Bayesian model
comparison. If you were a frequentist, your opponents could always complain
that your study did not have enough power to detect δ = 1×10^-4. If you were
into estimation, your opponents could complain that your ESCI did not exclude δ
= 1×10^-4. You wonder if this is any way to spend your life, chasing eternally
after your opponents’ ever-shifting goalposts.
It is my opinion that these minimal-effect-size hypotheses are not questions worth asking in most psychological research. If the effect is truly so tiny, its real-world relevance is minimal. If the phenomenon requires several thousand observations to distinguish signal from noise, it is probably not practical to study it. I think these hypotheses are most often employed as last lines of epistemic defense, a retreat of the effect size from something meaningful to something essentially untestable.
At some
point, you will have to draw a limit. You will have to make an alternative hypothesis and declare “Here is the
approximate effect size predicted by the theory.” You won’t have to select the
specific point, because you can spread the probability judiciously across a
range of plausible values. It may not be exactly the hypothesis every single
researcher would choose, but it will be reasonable and judicious, because you
will select it carefully. You are a scientist, and you are good at asking meaningful questions. When you ask that meaningful question, Bayesian model comparison will give you a meaningful answer.
In Summary
Bayesian
model comparison is a reasonable and mathematically-consistent way to get
appropriate answers to whatever your question. As the question changes, so too
should the answer. This is a feature, not a bug. If every question got the same
answer, would we trust that answer?
We must
remember that no form of statistics or measurement can hope to measure an
effect to arbitrary precision, and so it is epistemically futile to try to
prove absolutely the null hypothesis δ = 0. However, in many cases, δ = 0 seems
appropriate, and the data tend to support it relative to any reasonable alternative hypothesis. The argument that the null cannot be supported relative to Ha:
δ = 1×10^-10 is trivially true, but scientifically unreasonable and unfair.
Asking good questions is a skill, and doing the appropriate mathematics and programming to model the questions is often no small task. I suggest that we appreciate those who ask good questions and help those who ask poor questions to try other, more informative models.
In my next post, I'll cover some recent, specific critiques of Bayesian model comparison that, in my opinion, hinge on asking the wrong questions for the desired answers.
---------------------------------------
Thanks to Jeff Rouder, Richard Morey, Chris Engelhardt, and Alexander Etz for feedback. Thanks to Uri Simonsohn for clarifications and thoughts.
* Simonsohn clarifies his point briefly in the second half of this blog post -- he is moreso dissatisfied with the choice of a particular alternative hypothesis than he is alarmed by the Bayes factor's sensitivity to the alternative. Still, it is my impression that some readers may find this subjectivity scary and therefore unfortunately avoid Bayesian model comparison.
** This isn't true either. It is a common misconception that the 95% ESCI contains the true effect with 95% probability. The Bayesian 95% highest-density posterior interval (HDPI) does, but you need a prior. Even then you still have to come to some sort of decision about whether that HDPI is narrow enough or not. So here we are again.