## Monday, April 20, 2015

### Bayes Factor: Asking the Right Questions

I love Bayesian model comparison. It’s my opinion that null hypothesis testing is not great because 1) it gives dichotomous accept/reject outcomes when we all know that evidence is a continuous quantity and 2) it can never provide evidence for the null, only fail to reject it. This latter point is important because it’s my opinion that the null is often true, so we should be able to provide evidence and assign belief to it.

By comparison, Bayesian model comparison has neither weakness of NHST. First, it yields a "Bayes factor", the multiplicative and continuous change in beliefs effected by the data. Second, it can yield Bayes factors favoring the null hypothesis over a specified alternative hypothesis.

Despite my enthusiasm for Bayesian model comparison, one criticism I see now and again about Bayesian model comparison is that the obtained Bayes factor varies as a function of the hypothesis being tested. See e.g. this Twitter thread or Simonsohn (2015)
When a default Bayesian test favors the null hypothesis, the correct interpretation of the result is that the data favor the null hypothesis more than that one specific alternative hypothesis. The Bayesian test could conclude against the same null hypothesis, using the same data, if a different alternative hypothesis were used, say, that the effect is distributed normal but with variance of 0.5 instead of 1, or that the distribution is skewed or has some other mean value.*

To some researchers, this may seem undesirable. Science and analysis are supposed to be "objective," so the subjectivity in Bayesian analysis may seem unappealing.

To a Bayesian, however, this is the behavior as intended. The Bayes factor is supposed to vary according to the hypotheses tested. The answer should depend on the question.

### Asking the Right Question

The problem reminds me of the classic punchline in Douglas Adams’ Hitchhiker’s Guide to the Galaxy. An advanced civilization builds a massive supercomputer at great expense to run for millions of years to provide an answer to life, the universe, and everything.

Eons later, as the calculations finally complete, the computer pronounces its answer: “Forty-two.”

Everyone winces. They demand to know what the computer means by forty-two. The computer explains that forty-two is the correct answer, but that the question is still unknown. The programmers are mortified. In their haste to get an impressive answer, they did not stop to consider that every answer is valuable only in the context of its question.

Bayesian model comparison is a way to ask questions. When you ask different questions of your data, you get different answers. Any particular answer is only valuable insofar as the corresponding question is worth asking.

### An Example from PSI Research

Let’s suppose you’re running a study on ESP. You collect a pretty decently-sized sample, and at the end of the day, you’re looking at an effect size and confidence interval (ESCI) of d = 0.15 (-.05, .35). Based on this, what is your inference?

The NHST inference is that you didn't learn anything: you failed to reject the null, so the null stands for today, but maybe in the future with more data you’d reject the null with d = .03 (.01, .05) or something. You can never actually find evidence for the null so long as you use NHST. In the most generous case, you might argue that you've rejected some other null hypothesis such as δ > .35.

The ESCI inference is that the true effect of ESP is somewhere in the interval.** Zero is in the interval, and we don’t believe that ESP exists, so we’re vaguely satisfied. But how narrow an interval around zero do we need before we’re convinced that there’s no ESP? How much evidence do we have for zero relative to some predicted effect?

### Bayesian Inferences

Now you consult a Bayesian (Figure 1). You ask the Bayesian which she favors: the null hypothesis δ = 0, or the alternative hypothesis δ ≠ 0. She shakes her head. Your alternative hypothesis makes no predictions. The effect could be anywhere from negative infinity to positive infinity, or so close to zero as to be nearly equal it. She urges you to be more specific.

 Figure 1. Ancient Roman depiction of a Bayesian.
To get an answer, you will have to provide a more specific question. Bayesian model comparison operates by comparing one or more model predictions and seeing which is best supported by the data. Because it is a daunting task to try to precisely predict the effect size (although we often attempt to do so in a priori power analysis), we can assign probability across a range of values.

Trying again, you ask her whether there is a large effect of ESP. Maybe the effect of ESP could be a standard deviation in either direction, and any nonzero effect between d = -1 and d = 1 would be considered evidence of the theory. That is, H1: δ ~ Uniform(-1, 1) (see Figure 2). The Bayesian tells you that you have excellent evidence for the null relative to this hypothesis.
 Figure 2. Competing statements of belief about the effect size delta.
Encouraged, you ask her whether there is a medium effect of ESP. Maybe ESP would change behavior by about half a standard deviation in either direction; small effects are more likely than large effects, but large effects are possible too. That is, H2: δ ~ Cauchy(0.5) (see Figure 3). The Bayesian tells you that you have pretty good evidence for the null against this hypothesis, but not overwhelming evidence.
 Figure 3. A Cauchy-distributed alternative hypothesis can be more conservative, placing more weight on smaller, more realistic effect sizes while maintaining the possibility of large effects.
Finally, you ask her whether you have evidence against even the tiniest effect of ESP. Between the null hypothesis H0: δ = 0 and the alternative H3: δ ~ Cauchy(1x10^-3), which does she prefer? She shrugs. These two hypotheses make nearly-identical predictions about what you might see in your experiment (see Figure 4). Your data cannot distinguish between the two. You would need to spend several lifetimes collecting data before you were able to measurably shift belief from this alternative to the null.

 Figure 4. The null and alternative hypotheses make nearly-identical statements of belief.
And after that, what’s next? Will you have to refute H4: δ ~ Cauchy(1×10^-4), H5: δ ~ Cauchy(1×10^-5), and so on? A chill falls over you as you consider the possibilities. Each time you defeat one decimal place, another will rise to take its place. The fate of Sisyphus seems pleasant by comparison.

The Bayesian assures you that this is not a specific weakness of Bayesian model comparison. If you were a frequentist, your opponents could always complain that your study did not have enough power to detect δ = 1×10^-4. If you were into estimation, your opponents could complain that your ESCI did not exclude δ = 1×10^-4. You wonder if this is any way to spend your life, chasing eternally after your opponents’ ever-shifting goalposts.

It is my opinion that these minimal-effect-size hypotheses are not questions worth asking in most psychological research. If the effect is truly so tiny, its real-world relevance is minimal. If the phenomenon requires several thousand observations to distinguish signal from noise, it is probably not practical to study it. I think these hypotheses are most often employed as last lines of epistemic defense, a retreat of the effect size from something meaningful to something essentially untestable.

At some point, you will have to draw a limit. You will have to make an alternative hypothesis and declare “Here is the approximate effect size predicted by the theory.” You won’t have to select the specific point, because you can spread the probability judiciously across a range of plausible values. It may not be exactly the hypothesis every single researcher would choose, but it will be reasonable and judicious, because you will select it carefully. You are a scientist, and you are good at asking meaningful questions. When you ask that meaningful question, Bayesian model comparison will give you a meaningful answer.

## In Summary

Bayesian model comparison is a reasonable and mathematically-consistent way to get appropriate answers to whatever your question. As the question changes, so too should the answer. This is a feature, not a bug. If every question got the same answer, would we trust that answer?

We must remember that no form of statistics or measurement can hope to measure an effect to arbitrary precision, and so it is epistemically futile to try to prove absolutely the null hypothesis δ = 0. However, in many cases, δ = 0 seems appropriate, and the data tend to support it relative to any reasonable alternative hypothesis. The argument that the null cannot be supported relative to Ha: δ = 1×10^-10 is trivially true, but scientifically unreasonable and unfair.

Asking good questions is a skill, and doing the appropriate mathematics and programming to model the questions is often no small task. I suggest that we appreciate those who ask good questions and help those who ask poor questions to try other, more informative models.

In my next post, I'll cover some recent, specific critiques of Bayesian model comparison that, in my opinion, hinge on asking the wrong questions for the desired answers.

---------------------------------------

Thanks to Jeff Rouder, Richard Morey, Chris Engelhardt, and Alexander Etz for feedback. Thanks to Uri Simonsohn for clarifications and thoughts.

* Simonsohn clarifies his point briefly in the second half of this blog post -- he is moreso dissatisfied with the choice of a particular alternative hypothesis than he is alarmed by the Bayes factor's sensitivity to the alternative. Still, it is my impression that some readers may find this subjectivity scary and therefore unfortunately avoid Bayesian model comparison.

** This isn't true either. It is a common misconception that the 95% ESCI contains the true effect with 95% probability. The Bayesian 95% highest-density posterior interval (HDPI) does, but you need a prior. Even then you still have to come to some sort of decision about whether that HDPI is narrow enough or not. So here we are again.

1. The problem with this post is that it still doesn't answer the question that most experimenters want to ask, namely what is the chance that I'll be wrong if I claim that there is a real effect. In other words what experimenters want to know is the false discovery rate (a surprisingly large number of them make the mistake of thinking that's what the P value gives).

On the basis of simulated t tests, I maintain that if you observe P = 0.047 in a single test, the false discovery rate is at least 30%. See http://rsos.royalsocietypublishing.org/content/1/3/140216

This agrees quite well with the results of Sellke & Berger, and of Valen Johnson.

It's true that this result depends on assumption of a point null hypothesis. That seems to be the only sort of null hypothesis that makes much sense. We want to see whether or not our results are consistent with both groups being given exactly the same treatment. Of course you can get different answers if you are a subjective Bayesian who feels free to postulate weird sorts of prior for which there is no objective justification.

2. David, Is it not true that most objective justifications are based on frequentist-ispired large-sample properties? What possible Bayesian justification could there be? As a subjectivist who follows Lindley and Goldstein, I think getting different answers across different analysts who specify different priors is no problem at all. Priors are part of the model because they affect the predicted distribution of data (that is, I view models as statements about data). We always get different answers when we posit different models, and this is true of frequentist and objective Bayesians, so I don't see how it is a proper critique of subjective Bayesians. Likewise, the inference from a model is conditional on the reasonableness of the model specification, so priors need to be defended of course, but that is a matter of rhetoric and context, and not a matter of long-run property of resulting posteriors.

1. [Posting on behalf of Dr. Colquhoun due to problems with the comment software. -- Joe]

David Calquhoun says:
On the contrary, it's subjective Bayesians who put themselves ahead of the data. They give a answer that depends entirely on subjective guess of the shape of prior distributions. Or, still less usefully, they give a whole range of answers.

The conventional null hypothesis is that the difference between means is zero, and the alternative hypothesis is that it's not zero. They're precisely what we want to test. Within that framework, every reasonable prevalence of real effects (i.e. 0.5 or less) gives a false discovery rate of at least 26%.

David

2. David, This is inaccurate. Can you provide a citation or some argument for the claim that "depends entirely on the subjective guess of the shape of prior distributions?" It is true that if one chooses absolutely ridiculous, indefensible priors, then one can get ridiculous answers (I call these "Rush Limbaugh priors" because Rush is so fixated on his answers that data cannot shift his beliefs!). Yet, this argument carries no credibility as researchers need to justify the reasonableness of their priors. They can be neither too thin nor too fat to be justifiable. There will be some subjectivity on this determination, but in my experience and in my writings, this subjectivity has minimal impact. See for example http://pcl.missouri.edu/sites/default/files/p_8.pdf

Maybe we are talking past each other. I have not gotten into FDR too much. It seems to be marginalizing over things I am uninterested in. Maybe a some blog posts about why FDR is so important and how subjectivity makes it unattainable are in order. You are welcome to guest write on my blog, Invariances, http://jeffrouder.blogspot.com if you dont have your own.

3. I have to say that I am not familiar with the "Objective Bayesian" school of thought, but it has always seemed to me a contradiction in terms. Placing expectations on effect sizes will always require some judicious but subjective decision-making, as I understand it. So on the face of it, a method of inference that is wholly without subjectivity in the allocation of priors seems impossible -- the inferential equivalent of perpetual motion.

That said, I'll do my best to read the cited literature, but it seems there's always some form of lurking, unstated assumption or prior (e.g. as above, "every reasonable prevalence") or equivocation (e.g. also as above, "at least 26%") that has been chosen to make the machinery work.

I lean toward subjective Bayes because 1) the assumptions, models, etc are all stated plainly so that limitations are obvious and 2) it's not difficult to make fair and reasonable subjective decisions that are broadly appropriate. So while there cannot always be objective justification, I don't think that makes the priors weird, and certainly does not prevent -subjective- justification.

4. [Again posting on behalf of Dr. Calquhoun.]

David Calquhoun says:
My thinking on this topic started out on my blog, See
http://www.dcscience.net/2014/03/10/on-the-hazards-of-significance-testing-part-1-screening/

and

http://www.dcscience.net/2014/03/24/on-the-hazards-of-significance-testing-part-2-the-false-discovery-rate-or-how-not-to-make-a-fool-of-yourself-with-p-values/

Eventually, after 4 months on arXiv it evolved into a proper paper
http://rsos.royalsocietypublishing.org/content/1/3/140216

and lastly a simplified version on Youtube

3. No comment, just subscribing to this thread. Promises to be interesting.

4. My take on calculating the "false discovery rate" is that it also requires a specification of an alternative model in order to calculate it correctly which, unless I'm mistaken, is problematic for the same reason that the "subjective Bayesian who feels free to postulate weird sorts of prior for which there is no objective justification" is problematic. You're still specifying something that isn't necessarily objectively justifiable but go on as if it does because it "makes sense."

That aside, focusing on false discovery rate itself misses the point of research because it removes the researcher from the analysis and then places them in the spotlight for making such a grand discovery instead of putting them behind the spotlight so that they can highlight the interesting parts of the data. And, last I checked, (good) science is supposed to be concerned with what the data say and not so much with the people who generated it.

Of course there will be some bias, some false discovery, and some parts of the data that were overlooked because of some weird prior specifications, but being so preoccupied with their existences that you obsess over removing them completely doesn't produce useful results. Acknowledging that your first estimates are off for those reasons or that your methods didn't consider some alternative and then re-calibrating and correcting for those errors in the next iteration does.

1. You say
"My take on calculating the "false discovery rate" is that it also requires a specification of an alternative model in order to calculate it correctly which, unless I'm mistaken, is problematic"

Perhaps you would be so kind as to point out what you think is wrong with the results on FDR that I got from simulated t tests?

And perhaps you could also point out the mistakes being made by Valen Johnson in his approach to the problem via uniformly most-powerful Bayesian tests. That approach gives similar results to mine.

2. You cut the statement off a bit too early there. They are only problematic in the same sense that the priors needed in Bayesian analysis are problematic - they also require additional specification that many psychologists don't do.

I have not yet seen your FDR analysis itself and did not wish to impugn those results, as I have no doubt that you have done extensive work supporting its effectiveness. I also did not intend to make this a long discussion and was stating an initial impression of mine that was intended to be considered a throw-away statement to be considered and then cast aside. I apologize for my mistakes and any over-reaching that was carried out in my comment.

3. That's OK. But in general, it's a good idea to read a paper before commenting on it!

5. "The conventional null hypothesis is that the difference between means is zero, and the alternative hypothesis is that it's not zero. They're precisely what we want to test."

What sort of hypothesis is "not zero"? It makes no predictions, and is completely unfalsifiable. It is unscientific. How do you test it? We want -- need, actually -- hypotheses that make connections with the data. Anything else is unacceptable. You can call that subjective if you like, but it is really just basic a scientific desiderata. Any scientist who says "I think that effect is not zero" is certainly in no risk of being proven wrong; that's because the hypothesis is a unconstrained nonsense.

Once one accepts the fact that any hypothesis must make predictions, then the question becomes how to make those predictions. That's what a prior does.

6. Richard Morey
If it's the case that the point null is nonsense, then several generations of statisticians who have taught null hypothesis testing (including RA Fisher) have been misleading us badly. Is that what you're saying?

In my opinion what scientists want is to be able to say whether or not an observed difference is consistent with random chance, or whether there is evidence for a real effect. Of course, in the latter case you would estimate the effect size and decide whether or not it was big enough to matter in practice.

The problem with P values is that they don't tell you what you want to know in practice. What you want to know is what your chances are of being wrong if you claim there's a real effect, i.e. the false discovery rate. The lack of a totally unambiguous way of calculating the FDR is a problem, but it's possible to put lower limits on the FDR, and that limit is high enough that it's important to change, at least, the words that are used to describe P values. My suggestions are at http://rsos.royalsocietypublishing.org/content/1/3/140216#comment-1889100957 Steven Goodman has made similar suggestions.

Perhaps, if you are still sceptical, you could point out to me what's wrong with the simulated t tests in my paper. They mimic exactly what's done usually in practice.