Crystal Prison Zone: Bayes Factor: Asking the Right Questions

Monday, April 20, 2015

Bayes Factor: Asking the Right Questions

I love Bayesian model comparison. It’s my opinion that null hypothesis testing is not great because 1) it gives dichotomous accept/reject outcomes when we all know that evidence is a continuous quantity and 2) it can never provide evidence for the null, only fail to reject it. This latter point is important because it’s my opinion that the null is often true, so we should be able to provide evidence and assign belief to it.

By comparison, Bayesian model comparison has neither weakness of NHST. First, it yields a "Bayes factor", the multiplicative and continuous change in beliefs effected by the data. Second, it can yield Bayes factors favoring the null hypothesis over a specified alternative hypothesis.

Despite my enthusiasm for Bayesian model comparison, one criticism I see now and again about Bayesian model comparison is that the obtained Bayes factor varies as a function of the hypothesis being tested. See e.g. this Twitter thread or Simonsohn (2015):

When a default Bayesian test favors the null hypothesis, the correct interpretation of the result is that the data favor the null hypothesis more than that one specific alternative hypothesis. The Bayesian test could conclude against the same null hypothesis, using the same data, if a different alternative hypothesis were used, say, that the effect is distributed normal but with variance of 0.5 instead of 1, or that the distribution is skewed or has some other mean value.*

To some researchers, this may seem undesirable. Science and analysis are supposed to be "objective," so the subjectivity in Bayesian analysis may seem unappealing.

To a Bayesian, however, this is the behavior as intended. The Bayes factor is supposed to vary according to the hypotheses tested. The answer should depend on the question.

Asking the Right Question

The problem reminds me of the classic punchline in Douglas Adams’ Hitchhiker’s Guide to the Galaxy. An advanced civilization builds a massive supercomputer at great expense to run for millions of years to provide an answer to life, the universe, and everything.

Eons later, as the calculations finally complete, the computer pronounces its answer: “Forty-two.”

Everyone winces. They demand to know what the computer means by forty-two. The computer explains that forty-two is the correct answer, but that the question is still unknown. The programmers are mortified. In their haste to get an impressive answer, they did not stop to consider that every answer is valuable only in the context of its question.

Bayesian model comparison is a way to ask questions. When you ask different questions of your data, you get different answers. Any particular answer is only valuable insofar as the corresponding question is worth asking.

An Example from PSI Research

Let’s suppose you’re running a study on ESP. You collect a pretty decently-sized sample, and at the end of the day, you’re looking at an effect size and confidence interval (ESCI) of d = 0.15 (-.05, .35). Based on this, what is your inference?

The NHST inference is that you didn't learn anything: you failed to reject the null, so the null stands for today, but maybe in the future with more data you’d reject the null with d = .03 (.01, .05) or something. You can never actually find evidence for the null so long as you use NHST. In the most generous case, you might argue that you've rejected some other null hypothesis such as δ > .35.

The ESCI inference is that the true effect of ESP is somewhere in the interval.** Zero is in the interval, and we don’t believe that ESP exists, so we’re vaguely satisfied. But how narrow an interval around zero do we need before we’re convinced that there’s no ESP? How much evidence do we have for zero relative to some predicted effect?

Bayesian Inferences

Now you consult a Bayesian (Figure 1). You ask the Bayesian which she favors: the null hypothesis δ = 0, or the alternative hypothesis δ ≠ 0. She shakes her head. Your alternative hypothesis makes no predictions. The effect could be anywhere from negative infinity to positive infinity, or so close to zero as to be nearly equal it. She urges you to be more specific.

Figure 1. Ancient Roman depiction of a Bayesian.

To get an answer, you will have to provide a more specific question. Bayesian model comparison operates by comparing one or more model predictions and seeing which is best supported by the data. Because it is a daunting task to try to precisely predict the effect size (although we often attempt to do so in a priori power analysis), we can assign probability across a range of values.

Trying again, you ask her whether there is a large effect of ESP. Maybe the effect of ESP could be a standard deviation in either direction, and any nonzero effect between d = -1 and d = 1 would be considered evidence of the theory. That is, H₁: δ ~ Uniform(-1, 1) (see Figure 2). The Bayesian tells you that you have excellent evidence for the null relative to this hypothesis.

Figure 2. Competing statements of belief about the effect size delta.

Encouraged, you ask her whether there is a medium effect of ESP. Maybe ESP would change behavior by about half a standard deviation in either direction; small effects are more likely than large effects, but large effects are possible too. That is, H₂: δ ~ Cauchy(0.5) (see Figure 3). The Bayesian tells you that you have pretty good evidence for the null against this hypothesis, but not overwhelming evidence.

Figure 3. A Cauchy-distributed alternative hypothesis can be more conservative, placing more weight on smaller, more realistic effect sizes while maintaining the possibility of large effects.

Finally, you ask her whether you have evidence against even the tiniest effect of ESP. Between the null hypothesis H₀: δ = 0 and the alternative H₃: δ ~ Cauchy(1x10^-3), which does she prefer? She shrugs. These two hypotheses make nearly-identical predictions about what you might see in your experiment (see Figure 4). Your data cannot distinguish between the two. You would need to spend several lifetimes collecting data before you were able to measurably shift belief from this alternative to the null.

Figure 4. The null and alternative hypotheses make nearly-identical statements of belief.

And after that, what’s next? Will you have to refute H₄: δ ~ Cauchy(1×10^-4), H₅: δ ~ Cauchy(1×10^-5), and so on? A chill falls over you as you consider the possibilities. Each time you defeat one decimal place, another will rise to take its place. The fate of Sisyphus seems pleasant by comparison.

The Bayesian assures you that this is not a specific weakness of Bayesian model comparison. If you were a frequentist, your opponents could always complain that your study did not have enough power to detect δ = 1×10^-4. If you were into estimation, your opponents could complain that your ESCI did not exclude δ = 1×10^-4. You wonder if this is any way to spend your life, chasing eternally after your opponents’ ever-shifting goalposts.

It is my opinion that these minimal-effect-size hypotheses are not questions worth asking in most psychological research. If the effect is truly so tiny, its real-world relevance is minimal. If the phenomenon requires several thousand observations to distinguish signal from noise, it is probably not practical to study it. I think these hypotheses are most often employed as last lines of epistemic defense, a retreat of the effect size from something meaningful to something essentially untestable.

At some point, you will have to draw a limit. You will have to make an alternative hypothesis and declare “Here is the approximate effect size predicted by the theory.” You won’t have to select the specific point, because you can spread the probability judiciously across a range of plausible values. It may not be exactly the hypothesis every single researcher would choose, but it will be reasonable and judicious, because you will select it carefully. You are a scientist, and you are good at asking meaningful questions. When you ask that meaningful question, Bayesian model comparison will give you a meaningful answer.

In Summary

Bayesian model comparison is a reasonable and mathematically-consistent way to get appropriate answers to whatever your question. As the question changes, so too should the answer. This is a feature, not a bug. If every question got the same answer, would we trust that answer?

We must remember that no form of statistics or measurement can hope to measure an effect to arbitrary precision, and so it is epistemically futile to try to prove absolutely the null hypothesis δ = 0. However, in many cases, δ = 0 seems appropriate, and the data tend to support it relative to any reasonable alternative hypothesis. The argument that the null cannot be supported relative to H_a: δ = 1×10^-10 is trivially true, but scientifically unreasonable and unfair.

Asking good questions is a skill, and doing the appropriate mathematics and programming to model the questions is often no small task. I suggest that we appreciate those who ask good questions and help those who ask poor questions to try other, more informative models.

In my next post, I'll cover some recent, specific critiques of Bayesian model comparison that, in my opinion, hinge on asking the wrong questions for the desired answers.

---------------------------------------

Thanks to Jeff Rouder, Richard Morey, Chris Engelhardt, and Alexander Etz for feedback. Thanks to Uri Simonsohn for clarifications and thoughts.

* Simonsohn clarifies his point briefly in the second half of this blog post -- he is moreso dissatisfied with the choice of a particular alternative hypothesis than he is alarmed by the Bayes factor's sensitivity to the alternative. Still, it is my impression that some readers may find this subjectivity scary and therefore unfortunately avoid Bayesian model comparison.

** This isn't true either. It is a common misconception that the 95% ESCI contains the true effect with 95% probability. The Bayesian 95% highest-density posterior interval (HDPI) does, but you need a prior. Even then you still have to come to some sort of decision about whether that HDPI is narrow enough or not. So here we are again.

13 comments:

David ColquhounApril 20, 2015 at 12:17 PM
The problem with this post is that it still doesn't answer the question that most experimenters want to ask, namely what is the chance that I'll be wrong if I claim that there is a real effect. In other words what experimenters want to know is the false discovery rate (a surprisingly large number of them make the mistake of thinking that's what the P value gives).

On the basis of simulated t tests, I maintain that if you observe P = 0.047 in a single test, the false discovery rate is at least 30%. See http://rsos.royalsocietypublishing.org/content/1/3/140216

This agrees quite well with the results of Sellke & Berger, and of Valen Johnson.

It's true that this result depends on assumption of a point null hypothesis. That seems to be the only sort of null hypothesis that makes much sense. We want to see whether or not our results are consistent with both groups being given exactly the same treatment. Of course you can get different answers if you are a subjective Bayesian who feels free to postulate weird sorts of prior for which there is no objective justification.
ReplyDelete
Replies
Jeff RouderApril 20, 2015 at 12:35 PM
David, Is it not true that most objective justifications are based on frequentist-ispired large-sample properties? What possible Bayesian justification could there be? As a subjectivist who follows Lindley and Goldstein, I think getting different answers across different analysts who specify different priors is no problem at all. Priors are part of the model because they affect the predicted distribution of data (that is, I view models as statements about data). We always get different answers when we posit different models, and this is true of frequentist and objective Bayesians, so I don't see how it is a proper critique of subjective Bayesians. Likewise, the inference from a model is conditional on the reasonableness of the model specification, so priors need to be defended of course, but that is a matter of rhetoric and context, and not a matter of long-run property of resulting posteriors.
ReplyDelete
Replies
Antiquated ToryApril 20, 2015 at 12:57 PM
No comment, just subscribing to this thread. Promises to be interesting.
ReplyDelete
Replies
Jonathan ThieleApril 20, 2015 at 1:22 PM
My take on calculating the "false discovery rate" is that it also requires a specification of an alternative model in order to calculate it correctly which, unless I'm mistaken, is problematic for the same reason that the "subjective Bayesian who feels free to postulate weird sorts of prior for which there is no objective justification" is problematic. You're still specifying something that isn't necessarily objectively justifiable but go on as if it does because it "makes sense."

That aside, focusing on false discovery rate itself misses the point of research because it removes the researcher from the analysis and then places them in the spotlight for making such a grand discovery instead of putting them behind the spotlight so that they can highlight the interesting parts of the data. And, last I checked, (good) science is supposed to be concerned with what the data say and not so much with the people who generated it.

Of course there will be some bias, some false discovery, and some parts of the data that were overlooked because of some weird prior specifications, but being so preoccupied with their existences that you obsess over removing them completely doesn't produce useful results. Acknowledging that your first estimates are off for those reasons or that your methods didn't consider some alternative and then re-calibrating and correcting for those errors in the next iteration does.
ReplyDelete
Replies
Richard MoreyApril 20, 2015 at 3:13 PM
"The conventional null hypothesis is that the difference between means is zero, and the alternative hypothesis is that it's not zero. They're precisely what we want to test."

What sort of hypothesis is "not zero"? It makes no predictions, and is completely unfalsifiable. It is unscientific. How do you test it? We want -- need, actually -- hypotheses that make connections with the data. Anything else is unacceptable. You can call that subjective if you like, but it is really just basic a scientific desiderata. Any scientist who says "I think that effect is not zero" is certainly in no risk of being proven wrong; that's because the hypothesis is a unconstrained nonsense.

Once one accepts the fact that any hypothesis must make predictions, then the question becomes how to make those predictions. That's what a prior does.
ReplyDelete
Replies
David ColquhounApril 21, 2015 at 12:14 AM
Richard Morey
If it's the case that the point null is nonsense, then several generations of statisticians who have taught null hypothesis testing (including RA Fisher) have been misleading us badly. Is that what you're saying?

In my opinion what scientists want is to be able to say whether or not an observed difference is consistent with random chance, or whether there is evidence for a real effect. Of course, in the latter case you would estimate the effect size and decide whether or not it was big enough to matter in practice.

The problem with P values is that they don't tell you what you want to know in practice. What you want to know is what your chances are of being wrong if you claim there's a real effect, i.e. the false discovery rate. The lack of a totally unambiguous way of calculating the FDR is a problem, but it's possible to put lower limits on the FDR, and that limit is high enough that it's important to change, at least, the words that are used to describe P values. My suggestions are at http://rsos.royalsocietypublishing.org/content/1/3/140216#comment-1889100957 Steven Goodman has made similar suggestions.

Perhaps, if you are still sceptical, you could point out to me what's wrong with the simulated t tests in my paper. They mimic exactly what's done usually in practice.
ReplyDelete
Replies

Add comment