Monday, April 20, 2015

Bayes Factor: Asking the Right Questions

I love Bayesian model comparison. It’s my opinion that null hypothesis testing is not great because 1) it gives dichotomous accept/reject outcomes when we all know that evidence is a continuous quantity and 2) it can never provide evidence for the null, only fail to reject it. This latter point is important because it’s my opinion that the null is often true, so we should be able to provide evidence and assign belief to it. 

By comparison, Bayesian model comparison has neither weakness of NHST. First, it yields a "Bayes factor", the multiplicative and continuous change in beliefs effected by the data. Second, it can yield Bayes factors favoring the null hypothesis over a specified alternative hypothesis.

Despite my enthusiasm for Bayesian model comparison, one criticism I see now and again about Bayesian model comparison is that the obtained Bayes factor varies as a function of the hypothesis being tested. See e.g. this Twitter thread or Simonsohn (2015)
When a default Bayesian test favors the null hypothesis, the correct interpretation of the result is that the data favor the null hypothesis more than that one specific alternative hypothesis. The Bayesian test could conclude against the same null hypothesis, using the same data, if a different alternative hypothesis were used, say, that the effect is distributed normal but with variance of 0.5 instead of 1, or that the distribution is skewed or has some other mean value.*

To some researchers, this may seem undesirable. Science and analysis are supposed to be "objective," so the subjectivity in Bayesian analysis may seem unappealing.

To a Bayesian, however, this is the behavior as intended. The Bayes factor is supposed to vary according to the hypotheses tested. The answer should depend on the question.

Asking the Right Question

The problem reminds me of the classic punchline in Douglas Adams’ Hitchhiker’s Guide to the Galaxy. An advanced civilization builds a massive supercomputer at great expense to run for millions of years to provide an answer to life, the universe, and everything.

Eons later, as the calculations finally complete, the computer pronounces its answer: “Forty-two.”

Everyone winces. They demand to know what the computer means by forty-two. The computer explains that forty-two is the correct answer, but that the question is still unknown. The programmers are mortified. In their haste to get an impressive answer, they did not stop to consider that every answer is valuable only in the context of its question.

Bayesian model comparison is a way to ask questions. When you ask different questions of your data, you get different answers. Any particular answer is only valuable insofar as the corresponding question is worth asking.

An Example from PSI Research

Let’s suppose you’re running a study on ESP. You collect a pretty decently-sized sample, and at the end of the day, you’re looking at an effect size and confidence interval (ESCI) of d = 0.15 (-.05, .35). Based on this, what is your inference?

The NHST inference is that you didn't learn anything: you failed to reject the null, so the null stands for today, but maybe in the future with more data you’d reject the null with d = .03 (.01, .05) or something. You can never actually find evidence for the null so long as you use NHST. In the most generous case, you might argue that you've rejected some other null hypothesis such as δ > .35.

The ESCI inference is that the true effect of ESP is somewhere in the interval.** Zero is in the interval, and we don’t believe that ESP exists, so we’re vaguely satisfied. But how narrow an interval around zero do we need before we’re convinced that there’s no ESP? How much evidence do we have for zero relative to some predicted effect?

Bayesian Inferences

Now you consult a Bayesian (Figure 1). You ask the Bayesian which she favors: the null hypothesis δ = 0, or the alternative hypothesis δ ≠ 0. She shakes her head. Your alternative hypothesis makes no predictions. The effect could be anywhere from negative infinity to positive infinity, or so close to zero as to be nearly equal it. She urges you to be more specific.


Figure 1. Ancient Roman depiction of a Bayesian.
To get an answer, you will have to provide a more specific question. Bayesian model comparison operates by comparing one or more model predictions and seeing which is best supported by the data. Because it is a daunting task to try to precisely predict the effect size (although we often attempt to do so in a priori power analysis), we can assign probability across a range of values.

Trying again, you ask her whether there is a large effect of ESP. Maybe the effect of ESP could be a standard deviation in either direction, and any nonzero effect between d = -1 and d = 1 would be considered evidence of the theory. That is, H1: δ ~ Uniform(-1, 1) (see Figure 2). The Bayesian tells you that you have excellent evidence for the null relative to this hypothesis.
Figure 2. Competing statements of belief about the effect size delta.
Encouraged, you ask her whether there is a medium effect of ESP. Maybe ESP would change behavior by about half a standard deviation in either direction; small effects are more likely than large effects, but large effects are possible too. That is, H2: δ ~ Cauchy(0.5) (see Figure 3). The Bayesian tells you that you have pretty good evidence for the null against this hypothesis, but not overwhelming evidence.
Figure 3. A Cauchy-distributed alternative hypothesis can be more conservative, placing more weight on smaller, more realistic effect sizes while maintaining the possibility of large effects.
Finally, you ask her whether you have evidence against even the tiniest effect of ESP. Between the null hypothesis H0: δ = 0 and the alternative H3: δ ~ Cauchy(1x10^-3), which does she prefer? She shrugs. These two hypotheses make nearly-identical predictions about what you might see in your experiment (see Figure 4). Your data cannot distinguish between the two. You would need to spend several lifetimes collecting data before you were able to measurably shift belief from this alternative to the null.


Figure 4. The null and alternative hypotheses make nearly-identical statements of belief.
And after that, what’s next? Will you have to refute H4: δ ~ Cauchy(1×10^-4), H5: δ ~ Cauchy(1×10^-5), and so on? A chill falls over you as you consider the possibilities. Each time you defeat one decimal place, another will rise to take its place. The fate of Sisyphus seems pleasant by comparison.

The Bayesian assures you that this is not a specific weakness of Bayesian model comparison. If you were a frequentist, your opponents could always complain that your study did not have enough power to detect δ = 1×10^-4. If you were into estimation, your opponents could complain that your ESCI did not exclude δ = 1×10^-4. You wonder if this is any way to spend your life, chasing eternally after your opponents’ ever-shifting goalposts.

It is my opinion that these minimal-effect-size hypotheses are not questions worth asking in most psychological research. If the effect is truly so tiny, its real-world relevance is minimal. If the phenomenon requires several thousand observations to distinguish signal from noise, it is probably not practical to study it. I think these hypotheses are most often employed as last lines of epistemic defense, a retreat of the effect size from something meaningful to something essentially untestable.

At some point, you will have to draw a limit. You will have to make an alternative hypothesis and declare “Here is the approximate effect size predicted by the theory.” You won’t have to select the specific point, because you can spread the probability judiciously across a range of plausible values. It may not be exactly the hypothesis every single researcher would choose, but it will be reasonable and judicious, because you will select it carefully. You are a scientist, and you are good at asking meaningful questions. When you ask that meaningful question, Bayesian model comparison will give you a meaningful answer.

In Summary

Bayesian model comparison is a reasonable and mathematically-consistent way to get appropriate answers to whatever your question. As the question changes, so too should the answer. This is a feature, not a bug. If every question got the same answer, would we trust that answer?

We must remember that no form of statistics or measurement can hope to measure an effect to arbitrary precision, and so it is epistemically futile to try to prove absolutely the null hypothesis δ = 0. However, in many cases, δ = 0 seems appropriate, and the data tend to support it relative to any reasonable alternative hypothesis. The argument that the null cannot be supported relative to Ha: δ = 1×10^-10 is trivially true, but scientifically unreasonable and unfair. 

Asking good questions is a skill, and doing the appropriate mathematics and programming to model the questions is often no small task. I suggest that we appreciate those who ask good questions and help those who ask poor questions to try other, more informative models.

In my next post, I'll cover some recent, specific critiques of Bayesian model comparison that, in my opinion, hinge on asking the wrong questions for the desired answers.

---------------------------------------

Thanks to Jeff Rouder, Richard Morey, Chris Engelhardt, and Alexander Etz for feedback. Thanks to Uri Simonsohn for clarifications and thoughts.

* Simonsohn clarifies his point briefly in the second half of this blog post -- he is moreso dissatisfied with the choice of a particular alternative hypothesis than he is alarmed by the Bayes factor's sensitivity to the alternative. Still, it is my impression that some readers may find this subjectivity scary and therefore unfortunately avoid Bayesian model comparison.

** This isn't true either. It is a common misconception that the 95% ESCI contains the true effect with 95% probability. The Bayesian 95% highest-density posterior interval (HDPI) does, but you need a prior. Even then you still have to come to some sort of decision about whether that HDPI is narrow enough or not. So here we are again.