A psychologist's thoughts on how and why we play games

Monday, May 4, 2015

Bayes Factor: Asking the Right Questions, pt. 2

There has recently been some discussion as to whether Bayes factor is biased in favor of the null. I am particularly sensitive to these concerns as somebody who sometimes uses Bayes factor to argue in favor of the null. I do not want Reviewer 2 to think that I am overstating my evidence.

I would like to address two specific criticisms of Bayes factor, each arguing that the choice of an alternative hypothesis makes it too easy for researchers to argue for the null. 


In a recent blog post, Dr. Simonsohn writes “Because I am not interested in the distribution designated as the alternative hypothesis, I am not interested in how well the data support it.”

Of course, if one does not like one alternative hypothesis, one can choose another. Bayes factor is just the tool, and it's up to the analyst to make the tool answer a valuable question.

I asked Dr. Simonsohn for clarification on what he thought might make a good alternative hypothesis. He suggested a point-alternative hypothesis describing the minimum effect size of interest. That way, the Bayes factor yielded would not be too hasty to lean in favor of the null. 

That smallest effect size will vary across context. For example, for gender discrimination I may have one standard of too small to care, and for PSI I will have a much lower standard, and for time travel a tiny standard (a few seconds of time travel are a wonderful discovery).

Personally, I do not think this is a good alternative hypothesis. It makes the null and alternative hypothesis too similar so that their predictions are nigh-indiscriminable. It makes it nearly impossible to find evidence one way or the other.
Left panel: Depiction of null hypothesis and "minimum effect of interest" alternative. Null hypothesis: δ = 0. Alternative hypothesis: δ = 0.01. 
Right panel: Probability of data given each hypothesis and 200 observations, between-subjects design. 
The hypotheses are so similar as to be indistinguishable from each other.

Imagine if we did a priori power analysis with this alternative hypothesis for conventional null hypothesis significance testing. Power analysis would tell us we would need hundreds of thousands of observations to have adequate power. Less than that, and any significant results could be flukes and Type I errors, and nonsignificant results would be Type II errors. It's the Sisyphean Decimal Sequence from last post.

At some point, you have to live with error. The conventional testing framework assumes an effect size and establishes Type I and Type II error rates from there. But what justifies your a priori power assumption? Dr. Simonsohn's newest paper suggests a negative replication should indicate that the previous study had less than 33% power to detect its effect. But why would we necessarily care about the effect as it was observed in a previous study?

Every choice of alternative hypothesis is, at some level, arbitrary. No effect can be measured to arbitrary precision. Of all the inferential techniques I know, however, Bayes factor states this alternative hypothesis most transparently and reports the evidence in the most finely-grained units.

In practice, we don't power studies to the minimum interesting effect. We power studies to what expect the effect size to be given the theory. The alternative hypothesis in Bayesian model comparison should be the same way, representing our best guess about the effect. Morey et al. (submitted) call this a "consensus prior", the prior a "reasonable, but somewhat-removed researcher would have [when trying to quantify evidence for or against the theory]."


Dr. Schimmack also thinks that Bayes factor is prejudiced against small effects and that it makes it too easy to land a prestigious JEP:G publication ruling in favor of the null. In his complaint, he examines an antagonistic collaboration among Matzke, Nieuwenhuis, and colleagues. Nieuwenhuis et al. argue that horizontal eye movements improve memory, while Matzke et al. argue that they have no such effect. Data is collected, and we ask questions of it: Whose hypothesis is supported, Nieuwenhuis’ or Matzke’s?

In the data, the effect of horizontal eye movements was actually negative. This is unusual given Matzke’s hypothesis, but very unusual given Nieuwenhuis’ hypothesis. Because the results are 10 times more likely given Matzke’s hypothesis than Nieuwenhuis’, we rule in favor of Matzke’s null hypothesis.

Dr. Schimmack is dissatisfied with the obtained result and wants more power:
[T]his design has 21% power to reject the null-hypothesis with a small effect size (d = .2). Power for a moderate effect size (d = .5) is 68% and power for a large effect size (d = .8) is 95%.
Thus, the decisive study that was designed to solve the dispute only has adequate power (95%) to test Drs. Matzke et al.’s hypothesis d = 0 against the alternative hypothesis that d = .8. For all effect sizes between 0 and .8, the study was biased in favor of the null-hypothesis.”
Dr. Schimmack is concerned that the sample size is too small to distinguish the null from the alternative. The rules of the collaboration, however, were to collect data until the Bayes factor was 10 for one or the other hypothesis. The amount of data collected was indeed enough to distinguish between the two hypotheses, as the support is quite strong for the no-effect-hypothesis relative to the improvement-hypothesis. Everybody goes to the pub to celebrate, having increased their belief in the null relative to this alternative by a factor of 10.

But suppose we tried to interpret the results in terms of power and significance. What would we infer if the result was not significant? Dr. Schimmack’s unusual comment above that “for all effect sizes between 0 and .8, the study was biased in favor of the null-hypothesis” leads me to worry that he intends to interpret p > .05 as demonstrating the truth of the null – a definite faux pas in null-hypothesis significance testing.

But what can we infer from p > .05? That the results have no evidentiary value, being unable to reject the null hypothesis? That the obtained result is (1 – Power)% unlikely if the alternative hypothesis δ = 0.5 were true? But why would we care about the power based on the alternative hypothesis δ = 0.5, and not δ = 0.1, or δ = 1.0, or any other point-alternative hypothesis?

Dr. Niewenhuis understands his theory, formulated a fair hypothesis, and agreed that a test of that hypothesis would constitute a fair test of the theory. I can see no better or more judicious choice of alternative hypothesis. In a well-designed experiment with a fair hypothesis, the Bayesian test is fair.

Dr. Schimmack further argues that “[The] empirical data actually showed a strong effect in the opposite direction, in that participants in the no-eye-movement condition had better performance than in the horizontal-eye-movement condition (d = -.81).   A Bayes Factor for a two-tailed hypothesis or the reverse hypothesis would not have favored the null-hypothesis.” This is an interesting phenomenon, but beside the point of the experiment. Remember the question being asked: Is there a positive effect, or no effect? The obtained data support the hypothesis of no effect over the hypothesis of a positive effect.

If one wishes to pursue the new hypothesis of a negative effect in a future experiment, one can certainly do so. If one thinks that the negative effect indicates some failure of the experiment then that is a methodological, not statistical, concern. Keep in mind that both researchers agreed to the validity of the method before the data were collected, so again, we expect that this is a fair test.


Bayes factor provides an effective summary of evidence. A Cauchy or half-Cauchy distribution on the effect size often makes for a fair and reasonable description of the alternative hypothesis. Scientists who routinely read papers with attention to effect size and sample size will quickly find themselves capable of describing a reasonable "consensus prior." 

Having to describe this alternative hypothesis sometimes makes researchers uneasy, but it is also necessary for the interpretation of results in conventional testing. If a test of a subtle effect is statistically significant in a sample of 20, we suspect a Type I error rather than a true effect. If that subtle effect is not statistically significant in a sample of 20, we suspect a Type II error rather than a true effect. Specification of the alternative hypothesis makes these judgments transparent and explicit and yields the desired summary of evidence.

Monday, April 20, 2015

Bayes Factor: Asking the Right Questions

I love Bayesian model comparison. It’s my opinion that null hypothesis testing is not great because 1) it gives dichotomous accept/reject outcomes when we all know that evidence is a continuous quantity and 2) it can never provide evidence for the null, only fail to reject it. This latter point is important because it’s my opinion that the null is often true, so we should be able to provide evidence and assign belief to it. 

By comparison, Bayesian model comparison has neither weakness of NHST. First, it yields a "Bayes factor", the multiplicative and continuous change in beliefs effected by the data. Second, it can yield Bayes factors favoring the null hypothesis over a specified alternative hypothesis.

Despite my enthusiasm for Bayesian model comparison, one criticism I see now and again about Bayesian model comparison is that the obtained Bayes factor varies as a function of the hypothesis being tested. See e.g. this Twitter thread or Simonsohn (2015)
When a default Bayesian test favors the null hypothesis, the correct interpretation of the result is that the data favor the null hypothesis more than that one specific alternative hypothesis. The Bayesian test could conclude against the same null hypothesis, using the same data, if a different alternative hypothesis were used, say, that the effect is distributed normal but with variance of 0.5 instead of 1, or that the distribution is skewed or has some other mean value.*

To some researchers, this may seem undesirable. Science and analysis are supposed to be "objective," so the subjectivity in Bayesian analysis may seem unappealing.

To a Bayesian, however, this is the behavior as intended. The Bayes factor is supposed to vary according to the hypotheses tested. The answer should depend on the question.

Asking the Right Question

The problem reminds me of the classic punchline in Douglas Adams’ Hitchhiker’s Guide to the Galaxy. An advanced civilization builds a massive supercomputer at great expense to run for millions of years to provide an answer to life, the universe, and everything.

Eons later, as the calculations finally complete, the computer pronounces its answer: “Forty-two.”
Everyone winces. They demand to know what the computer means by forty-two. The computer explains that forty-two is the correct answer, but that the question is still unknown. The programmers are mortified. In their haste to get an impressive answer, they did not stop to consider that every answer is valuable only in the context of its question.

Bayesian model comparison is a way to ask questions. When you ask different questions of your data, you get different answers. Any particular answer is only valuable insofar as the corresponding question is worth asking.

An Example from PSI Research

Let’s suppose you’re running a study on ESP. You collect a pretty decently-sized sample, and at the end of the day, you’re looking at an effect size and confidence interval (ESCI) of d = 0.15 (-.05, .35). Based on this, what is your inference?

The NHST inference is that you didn't learn anything: you failed to reject the null, so the null stands for today, but maybe in the future with more data you’d reject the null with d = .03 (.01, .05) or something. You can never actually find evidence for the null so long as you use NHST. In the most generous case, you might argue that you've rejected some other null hypothesis such as δ > .35.

The ESCI inference is that the true effect of ESP is somewhere in the interval.** Zero is in the interval, and we don’t believe that ESP exists, so we’re vaguely satisfied. But how narrow an interval around zero do we need before we’re convinced that there’s no ESP? How much evidence do we have for zero relative to some predicted effect?

Bayesian Inferences

Now you consult a Bayesian (Figure 1). You ask the Bayesian which she favors: the null hypothesis δ = 0, or the alternative hypothesis δ ≠ 0. She shakes her head. Your alternative hypothesis makes no predictions. The effect could be anywhere from negative infinity to positive infinity, or so close to zero as to be nearly equal it. She urges you to be more specific.

Figure 1. Ancient Roman depiction of a Bayesian.
To get an answer, you will have to provide a more specific question. Bayesian model comparison operates by comparing one or more model predictions and seeing which is best supported by the data. Because it is a daunting task to try to precisely predict the effect size (although we often attempt to do so in a priori power analysis), we can assign probability across a range of values.

Trying again, you ask her whether there is a large effect of ESP. Maybe the effect of ESP could be a standard deviation in either direction, and any nonzero effect between d = -1 and d = 1 would be considered evidence of the theory. That is, H1: δ ~ Uniform(-1, 1) (see Figure 2). The Bayesian tells you that you have excellent evidence for the null relative to this hypothesis.
Figure 2. Competing statements of belief about the effect size delta.
Encouraged, you ask her whether there is a medium effect of ESP. Maybe ESP would change behavior by about half a standard deviation in either direction; small effects are more likely than large effects, but large effects are possible too. That is, H2: δ ~ Cauchy(0.5) (see Figure 3). The Bayesian tells you that you have pretty good evidence for the null against this hypothesis, but not overwhelming evidence.
Figure 3. A Cauchy-distributed alternative hypothesis can be more conservative, placing more weight on smaller, more realistic effect sizes while maintaining the possibility of large effects.
Finally, you ask her whether you have evidence against even the tiniest effect of ESP. Between the null hypothesis H0: δ = 0 and the alternative H3: δ ~ Cauchy(1x10^-3), which does she prefer? She shrugs. These two hypotheses make nearly-identical predictions about what you might see in your experiment (see Figure 4). Your data cannot distinguish between the two. You would need to spend several lifetimes collecting data before you were able to measurably shift belief from this alternative to the null.

Figure 4. The null and alternative hypotheses make nearly-identical statements of belief.
And after that, what’s next? Will you have to refute H4: δ ~ Cauchy(1x10^-4), H5: δ ~ Cauchy(1x10^-5), and so on? A chill falls over you as you consider the possibilities. Each time you defeat one decimal place, another will rise to take its place. The fate of Sisyphus seems pleasant by comparison.

The Bayesian assures you that this is not a specific weakness of Bayesian model comparison. If you were a frequentist, your opponents could always complain that your study did not have enough power to detect δ = 1x10^-4. If you were into estimation, your opponents could complain that your ESCI did not exclude δ = 1x10^-4. You wonder if this is any way to spend your life, chasing eternally after your opponents’ ever-shifting goalposts.

It is my opinion that these minimal-effect-size hypotheses are not questions worth asking in most psychological research. If the effect is truly so tiny, its real-world relevance is minimal. If the phenomenon requires several thousand observations to distinguish signal from noise, it is probably not practical to study it. I think these hypotheses are most often employed as last lines of epistemic defense, a retreat of the effect size from something meaningful to something essentially untestable.

At some point, you will have to draw a limit. You will have to make an alternative hypothesis and declare “Here is the approximate effect size predicted by the theory.” You won’t have to select the specific point, because you can spread the probability judiciously across a range of plausible values. It may not be exactly the hypothesis every single researcher would choose, but it will be reasonable and judicious, because you will select it carefully. You are a scientist, and you are good at asking meaningful questions. When you ask that meaningful question, Bayesian model comparison will give you a meaningful answer.

In Summary

Bayesian model comparison is a reasonable and mathematically-consistent way to get appropriate answers to whatever your question. As the question changes, so too should the answer. This is a feature, not a bug. If every question got the same answer, would we trust that answer?

We must remember that no form of statistics or measurement can hope to measure an effect to arbitrary precision, and so it is epistemically futile to try to prove absolutely the null hypothesis δ = 0. However, in many cases, δ = 0 seems appropriate, and the data tend to support it relative to any reasonable alternative hypothesis. The argument that the null cannot be supported relative to Ha: δ = 1x10^-10 is trivially true, but scientifically unreasonable and unfair. 

Asking good questions is a skill, and doing the appropriate mathematics and programming to model the questions is often no small task. I suggest that we appreciate those who ask good questions and help those who ask poor questions to try other, more informative models.

In my next post, I'll cover some recent, specific critiques of Bayesian model comparison that, in my opinion, hinge on asking the wrong questions for the desired answers.


Thanks to Jeff Rouder, Richard Morey, Chris Engelhardt, and Alexander Etz for feedback. Thanks to Uri Simonsohn for clarifications and thoughts.

* Simonsohn clarifies his point briefly in the second half of this blog post -- he is moreso dissatisfied with the choice of a particular alternative hypothesis than he is alarmed by the Bayes factor's sensitivity to the alternative. Still, it is my impression that some readers may find this subjectivity scary and therefore unfortunately avoid Bayesian model comparison.

** This isn't true either. It is a common misconception that the 95% ESCI contains the true effect with 95% probability. The Bayesian 95% highest-density posterior interval (HDPI) does, but you need a prior. Even then you still have to come to some sort of decision about whether that HDPI is narrow enough or not. So here we are again.

Friday, November 28, 2014

Exciting New Misapplications of The New Statistics

This year's increased attention to effect sizes and confidence intervals (ESCI) has been great for psychological science. ESCI offers a number of improvements over null-hypothesis significance testing (NHST), such as an attention to practical significance and the elimination of dichotomous decision rules.

However, the problem of ESCI is that it is purely descriptive, not inferential. No hypotheses are named, and so ESCI doesn't report on the probability of a hypothesis given the data, or even the probability of the data given a null hypothesis. No process or statistic turns the ESCI into a decision, although we might make Geoff Cumming cringe by looking at whether the ESCI includes zero and making a decision based on that, thereby falling right back to using NHST.

The point is, there's no theoretical or even pragmatic method for turning an ESCI into an inference. At what point does a confidence interval become sufficiently narrow to make a decision? We know that values near the extremes of the interval are often less likely than the values near the middle, but how much less likely?

I'm not asking for a formal dichotomous decision rule (I'm a Bayesian, I have resigned my life to uncertainty), but I've already noticed the ways we can apply ESCI inconsistently to overstate the evidence. See a recent example from Boothby, Clark, and Bargh (PDF link), arguing that shared experiences are more intense in two studies of n = 23 women:
Indeed, our analyses indicated that participants liked the chocolate significantly less when the confederate was also eating the chocolate (M = 2.45, SD = 1.77) than when the confederate was reviewing the computational products (M = 3.16, SD = 2.32), t(21) = 2.42, p = .025, 95% CI for the difference between conditions = [0.10, 1.31], Cohen’s d = 0.34. Participants reported feeling more absorbed in the experience of eating the chocolate in the shared-experience condition (M = 6.11, SD = 2.27) than in the unshared-experience condition (M = 5.39, SD = 2.43), p = .14. Participants also felt like they were more “on the same wavelength” with the confederate during the shared-experience condition (M = 6.43, SD = 1.38) compared with the unshared-experience condition (M = 5.61, SD = 1.38), t(21) = 2.35, p = .03, 95% CI for the difference
between conditions = [0.10, 1.54], Cohen’s d = 0.59 (see Fig. 2). There were no significant differences in participants’ self-reported mood or any other feedback measures between the shared and the unshared-experience conditions (all ps > .10).
Normally one wouldn't be allowed to talk about that p = .14 as evidence for an effect, but we now live in a more enlightened ESCI period in which we're trying to get away from dichotomous decision making. Okay, that's great, although I'd question the wisdom of trying to make any inference based on such a small sample, even within-subjects. But notice that when p = .14 is in the direction of their expected effect, it is interpreted as evidence for the phenomenon, but when differences are in a direction that does not support the hypothesis, it is simply reported as "not significant, p > .10". If we're going to abandon NHST for ESCI, we should at least be consistent about reporting ALL the ESCIs, and not just the ones that support our hypotheses.

Or, better yet, use that ESCI to actually make a principled and consistent inference through Bayes Factor. Specify an alternative hypothesis of what the theory might suggest are likely effect sizes. In this example, one might say that the effect size is somewhere between d = 0 and d = 0.5, with smaller values more likely than large values. This would look like the upper half of a normal distribution with mean 0 and standard deviation .5. Then we'd see how probable the obtained effect is given this alternative hypothesis and compare it to how probable the effect would be given the null hypothesis. At 20 subjects, I'm going to guess that the evidence is a little less than 3:1 odds for the alternative for the significant items, and less than that for the other items.

ESCI's a good first step, but we need to be careful and consistent about how we use it before we send ourselves to a fresh new hell. But when Bayesian analysis is this easy for simple study designs, why stop at ESCI?

Tuesday, July 1, 2014

Can p-curve detect p-hacking through moderator trawling?

NOTE: Dr. Simonsohn has contacted me and indicated a possible error in my algorithm. The results presented here could be invalid. We are talking back in forth and I am trying to fix my code. Stay tuned!

Suppose a researcher were to conduct an experiment looking to see if Manipulation X had any effect on Outcome Y, but the result was not significant. Since nonsignificant results are harder to publish, the researcher might be motivated to find some sort of significant effect somehow. How might the researcher go about dredging up a significant p-value?

One possibility is "moderator trawling". The researcher could try potential moderating variables until one is found that provides a significant interaction. Maybe it only works for men but not women? Maybe the effect can be seen after error trials, but not after successful trials? In essence, this is slicing the data until one manages to find a subset of the data that does show the desired effect. Given the number of psychological findings that seem to depend on surprisingly nuanced moderating conditions (ESP is one of these, but there are others), I do not think this is an uncommon practice.

To demonstrate moderator trawling, here's a set of data that has no main effect.
However, when we slice up the data by one of our moderators, we do find an effect. Here the interaction is significant, and the simple slope in group 0 is also significant.

In the long run, testing the main effect and three moderators will cause the alpha error rate to increase from 5% to 18.5%. That's the chance that at least one of the four tests come up p>.05, (.95)^4.

Because I am intensely excited by the prospect of p-curve meta-analysis, I just had to program a simulation to see whether p-curve could detect this moderator trawling in the absence of a real effect. P-curve meta-analysis is a statistical technique which examines the distribution of reported significant p-values. It relies on the property of the p-value that, when the null is true, p is uniformly distributed between 0 and 1. When an effect exists, smaller p-values are more likely than larger p-values, even for small p: p < .01 is more likely than .04<p<.05 for a true effect. Thus, a flat p-curve indicates no effect and possible file-drawering of null findings, while a right-skewed p-curve indicates a true effect. More interesting yet, a left-skewed curve suggests p-hacking -- doing what you need to to achieve p < .05, alpha error be damned.

You can find the simulation hosted on Open Science Framework at https://osf.io/ydwef/. This is my first swipe at an algorithm; I'd be happy to hear other suggestions for algorithms and parameters that simulate moderator trawling. Here's what the script does.
1) Create independent x and y variables from a normal distribution
2) Create three moderator variables z1, z2, and z3, of which 10 random subjects make up each of two levels
3) Fit the main effect model y ~ x. If it's statistically significant, stop and report the main effect.
4) If that doesn't come out significant, try z1, z2, and z3 each as moderators (e.g. y ~ x*z1; y ~ x*z2; y ~ x*z3). If one of these is significant, stop and plan to report the interaction.
5) Simonsohn et al. recommend using the p-value of the interaction for an attenuation interaction (e.g. "There's an effect among men that is reduced or eliminated among women"), but the p-values of each of the simple slopes for a crossover interaction (e.g. "This makes men more angry but makes women less angry."). So, determine whether it's an attenuation or crossover interaction.
5a)  If just one simple slope is significant, we'll call it an interaction. There's an effect in one group that is significant that is eliminated or reduced in the other group. In this case, we report the interaction p-value.
5b) If neither simple slopes are significant, or both are significant with coefficients of opposite sign, we'll call it a crossover. Both slopes significant indicates opposite effects, while neither slope significant indicates that the simple slopes aren't strong enough on their own but their opposition is enough to power a significant interaction. In these cases, we'll report both simple slopes' p-values.

We repeat this for 10,000 hypothetical studies, export the t-tests, and put them into the p-curve app at www.p-curve.com. Can p-curve tell that these results are the product of p-hacking?

It cannot. In the limit, it seems that p-curve will conclude that the findings are very mildly informative, indicating that the p-curve is flatter than 33% power, but still right-skewed, suggesting a true effect measured at about 20% power. Worse yet, it cannot detect that these p-values come from post-hoc tomfoolery and p-hacking. A few of these sprinkled into a research literature could make an effect seem to bear slightly more evidence, and be less p-hacked, then it really is.

The problem would seem to be that the p-values are aggregated across heterogeneous statistical tests: some tests of the main effect, some tests of this interaction or that interaction. Heterogeneity seems like it would be a serious problem for p-curve analysis in other ways. What happens when the p-values come from a combination of well-powered studies of a true effect and some poorly-powered, p-hacked studies of that same effect? (As best I can tell from the manuscript draft, the resulting p-curve is flat!) How does one meta-analyze across studies of different phenomena or different operationalizations or different models?

I remain excited and optimistic for the future of p-curve meta-analysis as a way to consider the strength of research findings. However, I am concerned by the ambiguities of practice and interpretation in the above case. It would be a shame if these p-hacked interactions would be interpreted as evidence of a true effect. For now, I think it best to report the data with and without moderators, preregister analysis plans, and ask researchers to report all study variables. In this way, one can reduce the alpha-inflation and understand how badly the results seem to rely upon moderator trawling.

Tuesday, May 20, 2014

Psychology's Awkward Puberty

There's a theory of typical neural development I remember from my times as a neuroscience student. It goes like this: in the beginning of development, the brain's tissues rapidly grow in size and thickness. General-purpose cells are replaced with more specialized cells. Neurons proliferate, and rich interconnections bind them together.

Around the time of puberty, neurons start dying off and many of those connections are pruned. This isn't a bad thing, and in fact, seems to be good for typical neural development, since there seems to be an association between mental disorder and brains that failed to prune.

In the past half a century, psychological science has managed to publish an astonishing number of connections between concepts. For example, people experience physical warmth as interpersonal warmth, hot temperatures make them see more hostile behavior, eating granola with the ingredients all mixed up makes them more creative than eating granola ingredients separately, and seeing the national flag makes them more conservative. Can all of these fantastic connections be true, important, meaningful? Probably not.

Until now, psychology has been designed for the proliferation of effects. Our most common statistical procedure, null hypothesis significance testing, can only find effects, not prove their absence. Researchers are rewarded for finding effects, not performing good science, and the weirder the effect, the more excited the response. And so, we played the game, finding lots of real effects and lots of other somethings we could believe in, too.

It's now time for us to prune some connections. Psychology doesn't know too little, it knows too much -- so much that we can't tell truth from wistful thinking anymore. Even the most bizarre and sorcerous manipulations still manage to eke out p < .05 often enough to turn up in journals. "Everything correlates at r = .30!" we joke. "Everything! Isn't that funny?" One can't hear the truth, overpowered as it is by the neverending chorus of significance, significance, significance.

This pruning process makes researchers nervous, concerned that their effect which garnered them tenure, grants, and fame will be torn to shreds, leaving them naked and foolish. We must remember that the authors of unreplicable findings didn't necessarily do anything wrong -- even the most scrupulous researcher will get p < .05 one time in 20 in the absence of a true effect. That's how Type I error works. (Although one might still wonder how an effect could enjoy so many conceptual replications within a single lab yet fall to pieces the moment they leave the lab.)

Today, psychology finally enters puberty. It's bound to be awkward and painful, full of hurt feelings, awkwardness, and embarrassment, but it's a sign we're also gaining a little maturity. Let's look forward to the days ahead, in which we know more through knowing less.

Monday, March 24, 2014

Intuitions about p

Two of my labmates were given a practice assignment for a statistics class. Their assignment was to generate simulated data where there was no relationship between x and y. In R, this is easy, and can be done by the code below: x is just the numbers from 1:20, and y is twenty random pulls from a normal distribution.

m1 = lm(y ~ x, data=dat)

One of my labmates ran the above code, frowned, and asked me where he had gone wrong. His p-value was 0.06 -- "marginally significant"! Was x somehow predicting y? I looked at his code and confirmed that it had been written properly and that there was no relationship between x and y. He frowned again. "Maybe I didn't simulate enough subjects," he said. I assured him this was not the case.

It's a common, flawed intuition among researchers that p-values naturally gravitate towards 1 with increasing power or smaller (more nonexistent?) effects. This is an understandable fallacy. As sample size increases, power increases, reducing the Type II error rate. It might be mistakenly assumed, then, that Type I error rate also reduces with sample size. However, increasing sample size does nothing to p-value when the null is true. When there is no effect, p-values come from a uniform distribution: a p-value less than .05 is just as likely as a p-value greater than .95!

As we increase our statistical power, the likelihood of Type II error (failing to notice a present effect) approaches zero. However, Type I error remains constant at whatever we set it to, no matter how many observations we collect. (You could, of course, trade power for a reduction in Type I error by setting a more stringent cutoff for "significant" p-values like .01, but this is pretty rare in our field where p<.05 is good enough to publish.)

Because we don't realize that p is uniformly distributed when the null is true, we overinterpret all our p-values that are less than about .15. We've all had the experience of looking at our data and being taunted by a p-value of 0.11. "It's so low! It's tantalizingly close to marginal significance already. There must be something there, or else it would have a really meaningless p-value like p=.26. I just need to run a few more subjects, or throw out the outlier that's ruining it," we say to ourselves. "This isn't p-hacking -- my effect is really there, and I just need to reveal it."

We say hopelessly optimistic things like "p = .08 is approaching significance." The p-value is doing no such thing -- it is .08 for this data and analysis, and it is not moving anywhere. Of course, if you are in the habit of peeking at the data and adding subjects until you reach p < .05, it certainly could be "approaching" significance, but that says more about the flaws of your approach to research than the validity of your observed effects.

How about effect size? Effect size, unlike p, benefits from increasing sample size whether there's an effect or not. As sample size is added, estimates of true effects approach their real value, and estimates of null effects approach zero. Of course, after a certain point the benefits of even more samples starts to decrease: going from n=200 to n=400 yields a bigger benefit to precision than does going from n=1000 to n=1200.

Let's see what effect size estimates of type I errors look like at small and large N.

Here's a Type I error at n=20. Notice that the slope is pretty steep. Here we estimate the effect size to be a whopping |r| = .44! Armed with only a p-value and this point estimate, a naive reader might be inclined to believe that the effect is indeed huge, while a slightly skeptical reader might round down to about |r| = .20. They'd both be wrong, however, since the true effect size is zero. Random numbers are often more variable than we think!

Let's try that again. Here's a Type I error at n = 10,000. Even though the p-value is statistically significant (here, p = .02), the effect size is pathetically small: |r| = .02. This is one of the many benefits of reporting the effect size and confidence interval. Significance testing will always be wrong at least 5% of the time, while effect size estimates will always benefit from power.

This is how we got the silly story about the decline effect (http://www.newyorker.com/reporting/2010/12/13/101213fa_fact_lehrer), in which scientific discoveries tend to "wear off" over time. Suppose you find a Type I error in your n=20 study. Now you go to replicate it, and since you have faith in your effect, you don't mind running additional subjects and re-analyzing until you find p < .05. This is p-hacking, but let's presume you don't care. Chances are it will take you more than 20 subjects before you "find" your Type I error again, because it's unlikely that you would be so lucky as to find the same Type I error within the first 20 subjects. By the point that you do find p < .05, you will probably have run rather more than 20 subjects, and so the effect size estimate will be a little more precise and be precipitously closer to zero. The truth doesn't "wear off." The truth always outs.

Of course, effect size estimates aren't immune to p-hacking, either. One of the serious consequences of p-hacking is that it biases effect sizes.

Collect big enough samples. Look at your effect sizes and confidence intervals. Report everything you've got in the way that makes the most sense. Don't trust p. Don't chase p.

Monday, December 9, 2013

Outrageous Fortune, pt. 1

When we sit down to play a game involving dice, we understand that the results are influenced by a combination of strategy and luck. However, it's not always clear which is more important. While we'd like to think our results are chiefly the result of good strategy, and that the role of luck was fair and minimal, it's often difficult to judge. How can we make games which incorporate luck while rewarding strategy?


In order to add some excitement and variety, many game developers like to add dice rolls to their games to introduce an element of randomness. Dice rolls, the argument goes, add an element of chance that keeps the game from becoming strictly deterministic, forcing players to adapt to good and bad fortune. While some dice rolls are objectively better than others, potentially causing one player to gain the upper hand over another through luck alone, developers claim that things will "average out" in the long run, with a given player eventually experiencing just as much good luck as bad luck.

Most outcomes are near the average, with equal amounts of "good luck" (area above green line) and "bad luck" (area below red line).

Luck should average out

With the effect of luck averaging out, the player with the better strategy (e.g., the player who obtained better modifiers on their rolls) should still be able to reliably perform better.  However, players and developers alike do not often realize just how many rolls are necessary before the effect of strategy can be reliably detected as something above and beyond the effect of luck.

Forums are full of players describing which build has the better average, which is often plain to see with some math. For many players, this is all they need to concern themselves with: they have done the math and determined which build is most effective. The question for the designer, however, is whether the players can expect to see a difference within a single game or session. As it turns out, many of these modifiers are so small compared to the massive variance of a pass-fail check that it takes surprisingly long for luck to "average out".

An example: Goofus and Gallant

For the following example, I'll use Dungeons & Dragons, since that's one most gamers are likely familiar with. D&D uses a 20-sided die (1d20) to check for success or failure, and by adjusting the necessary roll, probability of success ranges from 0% to 100% by intervals of 5%. (In future posts I hope to examine other systems of checks, like those used in 2d6 or 3d6-based RPGs or wargames.)

Consider two similar level-1 characters, Goofus and Gallant. Gallant, being a smart player, has chosen the Weapon Expertise feat, giving him +1 to-hit. Goofus copied all of Gallant's choices but instead chose the Coordinated Explosion feat because he's some kind of dingus. The result is we have two identical characters, one with a to-hit modifier that is +1 better than the other. So, we expect that, in an average session, Gallant should hit 5% more often than Goofus. But how many rolls do we need before we reliably see Gallant outperforming Goofus?

For now, let's assume a base accuracy of 50%. So, Goofus hits if he rolls an 11 or better on a 20-sided die (50% accuracy), and Gallant hits on a roll of 10 or better(55% accuracy). We'll return to this assumption later and see how it influences our results.

I used the statistical software package R to simulate the expected outcomes for sessions involving 1 to 500 rolls. For each number of rolls, I simulated 10,000 different D&D sessions. Using R for this stuff is easy and fun! Doing this lets us examine the proportion of sessions in which Gallant outperforms Goofus and vice-versa. So, how many trials are needed for Gallant to outperform Goofus?

Goofus hits on 11, Gallant hits on 10 thanks to his +1 bonus.

One intuitive guess would be that you need 20 rolls, since that 5% bonus is 1 in 20. It turns out, however, that even at 20 trials, Gallant only has a 56% probability of outperforming Goofus.

In order to see Gallant reliably (75%) outperform Goofus requires more than a hundred rolls. Even then, Goofus will still surpass him about 20% of the time. It's difficult to see the modifier make a reliable difference compared to the wild swings of fortune caused by a 50% success rate.

Reducing luck through a more reliable base rate

It turns out these probabilities depend a lot on the base probability of success. When the base probability is close to 50%, combat is "swingy" -- the number of successes may be centered at 50% times the number of trials, but it's also very probable that the number of successes may be rather more or rather less than the expected value. We call this range around the expected value variance. When the base probability is closer to 0% or 100%, the variance shrinks, and the number of successes tends to hang closer to the expected value.

This time, let's assume a base accuracy of 85%. Now, Goofus hits on 4 or better (85%), and Gallant hits on 3 or better (90%). How many trials are now necessary to see Gallant reliably outperform Goofus?

This time, things are more stable. For very small numbers of rolls, they're more likely to tie than before. More importantly, the probability of Gallant outperforming Goofus increases more rapidly than before, because successes are less variable at this probability.

Comparing these two graphs against each other, we see the advantages of a higher base rate. For sessions involving fewer than 10 rolls, it is rather less likely that Goofus will outperform Gallant -- they'll tie, if anything. For sessions involving more than 10 rolls, the difference between Goofus and Gallant also becomes more reliable when the base rate is high. Keep in mind that we haven't increased the size of the difference between Goofus and Gallant, which is still just a +1 bonus. Instead, by making a more reliable base rate, we've reduced the influence of luck somewhat. In either case, however, keep in mind that it takes at least 10 rolls before we see Gallant outperform Goofus in just half of sessions. If you're running a competitive strategy game, you'd probably want to see a more pronounced difference than that!

In conclusion

To sum it all up, the issue is that players and developers expect luck to "average out", but they may not realize how many rolls are needed for this to happen. It's one thing to do the math and determine which build has the better expected value; it's another to actually observe that benefit in the typical session. It's my opinion that developers should seek to make these bonuses as reliable and noticeable as possible, but your mileage may vary. This may be more important for certain games & groups than others, after all.

My advice is to center your probabilities of success closer to 100% than to 50%. When the base probability is high, combat is less variable, and it doesn't take as long for luck to average out. Thus, bonuses are more reliably noticed in the course of play, making players observe and enjoy their strategic decisions more.

Less variable checks also have the advantage of allowing players to make more involved plans, since individual actions are less likely to fail. However, when an action does fail, it is more surprising and dramatic than it would otherwise have been when failure is common. Finally, reduced variability allows the party to feel agentic and decisive, rather than being buffeted about by the whims of outrageous fortune.

Another option is to reduce the variance by dividing the result into more fine-grained categories than "success" and "failure" such as "partial success". Some tabletop systems already do this, and even D&D will try to reduce the magnitude of difference between success and failure by letting a powerful ability do half-damage on a miss, again making combat less variable. Upcoming Obsidian Software RPG Pillars of Eternity plans to replace most "misses" with "grazing attacks" that do half-damage instead of no damage, again reducing the role of chance -- a design decision we'll examine in greater detail in next week's post.

Future directions

Next time, we'll go one step further and see how hard it can be for that +1 to-hit bonus to actually translate into an increase in damage output. To do this, I made my work PC simulate forty million attack rolls. It was fun as heck. I hope to see you then!