A psychologist's thoughts on how and why we play games

Sunday, October 4, 2015

Poor Power at Decent Sample Sizes: Significance Under Duress

Last week, I got to meet Andrew Gelman as he outlined what he saw as several of the threats to validity in social science research. Among these was the fallacious idea of "significance under duress." The claim in "significance under duress" is that, when statistical significance is reached under less-than-ideal conditions, it implies that the underlying effect must be very powerful. While this sounds like it makes sense, this claim does not follow.

Let's dissect the idea by considering the following scenario:

120 undergraduates participate in an experiment to examine the effect of mood on preferences for foods branded as "natural" relative to conventionally-branded foods. To manipulate mood, half of the participants write a 90-second paragraph about a time they felt bad, while the other half write a 90-second essay about a control topic.  The outcome is a single dichotomous choice between two products. Even though a manipulation check reveals the writing manipulation had only a small effect on mood, and even though a single-item outcome provides less power than would rating several forced choices, statistical significance is nevertheless found when comparing the negative-writing group to the neutral-writing group, p = .030. The authors argue that the relationship between mood and preferences for "natural" must be very strong indeed to have yielded significance despite the weak manipulation and imprecise outcome measure.

Even though the sample size is better than most, I would still be concerned that a study like this is underpowered. But why?

Remember that statistical power depends on the expected effect size. Effect size involves both signal and noise. Cohen's d is the difference in means divided by the standard deviation of scores. Pearson correlation is the covariance of x and y divided by the standard deviations of x and y. Noisier measures will mean larger standard deviations and hence, a smaller effect size.

The effect size is not a platonic distillation of the relationship between the two constructs you have in mind (say, mood and preference for the natural). Instead, it is a ratio of signal to noise between your measures -- here, condition assignment and product choice.

Let's imagine this through the lens of a structural equation model. Italicized and b represent the latent constructs of interest: mood and preference for the natural, respectively. Let's assume their relationship is rho = .4, a hearty effect. x and y are the condition assignment and the outcome, respectively. The path from x to a represents the effect of the manipulation. The path from b to y represents the measurement reliability of the outcome. To tell what the relationship will be between x and y, we multiply each path coefficient as we travel from x to a to b to y.

When the manipulation is strong and the measurement reliable, the relationship between x and y is strong, and power is good. When the manipulation is weak and the measurement unreliable, the relationship is small, and power falls dramatically.

Because weak manipulations and noisy measurements decrease the anticipated effect size, thereby decreasing power, studies can still have decent sample sizes and poor statistical power. Such examples of "significance under duress" should be regarded with the same skepticism as other underpowered studies.

Friday, July 3, 2015

Bayesian Perspectives on Publication Bias

I have two problems with statistics in psychological science. They are:
  1. Everybody speaks in categorical yes/no answers (statistical significance) rather than continuous, probabilistic answers (probably yes, probably no, not enough data to tell).
  2. There's a lot of bullshit going around. The life cycle of the bullshit is extended by publication bias (running many trials and just reporting the ones that work) and p-hacking (torturing the data until it gives you significance).
Meta-analysis is often suggested as one solution to these problems. If you average together everybody's answers, maybe you get closer to the true answer. Maybe you can winnow out truth from bullshit when looking at all the data instead of the tally of X significant results and Y nonsignificant results. 

That's a nice thought, but publication bias and p-hacking make it possible that the meta-analysis just reports the degree of bias in the literature rather than the true effect. So how do we account for bias in our estimates?

Bayesian Spike-and-Slab Shrinkage Estimates

One very simple approach would be to consider some sort of "bullshit factor". Suppose you believe, as John Ioannidis does, that half of published research findings are false. If that's all you know, then for any published result you believe that there's a 50% chance that there's an effect such as the authors report it (p(H1) = .5) and a 50% chance that the finding is false (p(H0) = .5). Just to be clear, I'm using H0 to refer to the null hypothesis, H1 to refer to the alternative hypothesis.

How might we summarize our beliefs if we wanted to estimate the effect with a single number? Let's say the authors report d = 0.60. We halfway believe in them, but we still halfway believe in the null. So on average, our belief in the true effect size delta is 

delta(d | H0) * probability(H0) + (d | H1) * probability(H1)
delta = (0) * (0.5) + (0.6) * (0.5) = 0.3

So we've applied some shrinkage or regularization to our estimate. Because we believe that half of everything is crap, we're able to improve our estimates by adjusting our estimates accordingly.

This is roughly a Bayesian spike-and-slab regularization model: the spike refers to our belief that delta is exactly zero, while the slab is the diffuse alternative hypothesis describing likely non-zero effects. As we believe more in the null, the spike rises and the slab shrinks; as we believe more in the alternative, the spike lowers and the slab rises. By averaging across the spike and the slab, we get a single value that describes our belief.

Bayesian Spike-and-Slab system. As evidence accumulates for a positive effect, the "spike" of belief in the null diminishes and the "slab" of belief in the alternative soaks up more probability. Moreover, the "slab" begins to take shape around the true effect.

So that's one really crude way of adjusting for meta-analytic bias as a Bayesian: just assume half of everything is crap and shrink your effect sizes accordingly. Every time a psychologist comes to you claiming that he can make you 40% more productive, estimate instead that it's probably more like 20%.

But what if you wanted to be more specific? Wouldn't it be better to shrink preposterous claims more than sensible claims? And wouldn't it be better to shrink fishy findings with small sample sizes and a lot of p = .041s moreso than a strong finding with a good sample size and p < .001?

Bayesian Meta-Analytic Thinking by Guan & Vandekerckhove

This is exactly the approach given in a recent paper by Guan and Vandekerckhove. For each meta-analysis or paper, you do the following steps:
  1. Ask yourself how plausible the null hypothesis is relative to a reasonable alternative hypothesis. For something like "violent media make people more aggressive," you might be on the fence and assign 1:1 odds. For something goofy like "wobbly chairs make people think their relationships are unstable" you might assign 20:1 odds in favor of the null.
  2. Ask yourself how plausible the various forms of publication bias are. The models they present are:
    1. M1: There is no publication bias. Every study is published.
    2. M2: There is absolute publication bias. Null results are never published.
    3. M3: There is flat probabilistic publication bias. All significant results are published, but only some percentage of null results are ever published.
    4. M4: There is tapered probabilistic publication bias: everything < .05 gets published, but the chances of publishing get worse the farther you get from < .05 (e.g. p = .07 gets published more than p = .81).
  3. Look at the results and see which models of publication bias look likely. If there's even a single null result, you can scratch off M2, which says null results are never published. Roughly speaking, if the p-curve looks good, M1 starts looking pretty likely. If the p-curve is flat or bent the wrong way, M3 and M4 start looking pretty likely.
  4. Update your beliefs according to the evidence. If the evidence looks sound, belief in the unbiased model (M1) will rise and belief in the biased models (M2, M3, M4) will drop. If the evidence looks biased, belief in the publication bias models will rise and belief in the unbiased model will drop. If the evidence supports the hypothesis, belief in the alternative (H1) will rise and belief in the null (H0) will drop. Note that, under each publication bias model, you can still have evidence for or against the effect.
  5. Average the effect size across all the scenarios, weighting by the probability of each scenario.
If you want to look at the formula for this weighted average, it's:
delta = (d | M1, H1) * p(M1, H1) + (d | M1, H0)*p(M1, H0) + (d | M2, H1)*p(M2, H1) + (d | M2, H0)*p(M2, H0) + (d | M3, H1)*p(M3, H1) + (d | M3, H0)*p(M3, H0) + (d | M4, H1)*p(M4, H1) + (d | M4, H0)*p(M4, H0)
(d | Mx, H0) is "effect size d given that publication bias model X is true and there is no effect." We can go through and set all these to zero, because when the null is true, delta is zero. 

(d | Mx, H1) is "effect size d given that pubication bias model X is true and there is a true effect." Each bias model makes a different guess at the underlying true effect.    (d | M1, H1) is just the naive estimate. It assumes there's no pub bias, so it doesn't adjust at all. However, M2, M3, and M4 say there is pub bias, so they estimate delta as being smaller. Thus, (| M2, H1), (d | M3, H1), and (d | M4, H1) are shrunk-down effect size estimates.

p(M1, H1) through p(M4, H0) reflect our beliefs in each (pub-bias x H0/H1) combo. If the evidence is strong and unbiased, p(M1, H1) will be high. If the evidence is fishy, p(M1, H1) will be low and we'll assign more belief to skeptical models like p(M3, H1), which says the effect size is overestimated, or even p(M3, H0), which says that the null is true.

Then to get our estimate, we make our weighted average. If the evidence looks good, p(M1, H1) will be large, and we'll shrink d very little according to publication bias and remaining belief in the null hypothesis. If the evidence is suspect, values like p(M3, H0) will be large, so we'll end up giving more weight to the possibility that d is overestimated or even zero.


So at the end of the day, we have a process that:
  1. Takes into account how believable the hypothesis is before seeing data, gaining strength from our priors. Extraordinary claims require extraordinary evidence, while less wild claims require less evidence.
  2. Takes into account how likely publication bias is in psychology, gaining further strength from our priors. Data from a pre-registered prospective meta-analysis is more trustworthy than a look backwards at the prestige journals. We could take that into account by putting low probability in pub bias models in the pre-registered case, but higher probability in the latter case.
  3. Uses the available data to update beliefs about the hypothesis and publication bias both, improving our beliefs through data. If the data look unbiased, we trust it more. If the data looks like it's been through hell, we trust it less.
  4. Provides a weighted average estimate of the effect size given our updated beliefs. It thereby shrinks estimates a lot when the data are flimsy and there's strong evidence of bias, but shrinks estimates less when the data are strong and there's little evidence of bias.
It's a very nuanced and rational system. Bayesian systems usually are.

That's enough for one post. I'll write a follow-up post explaining some of the implications of this method, as well as the challenges of implementing it. 

Monday, June 29, 2015

Putting PET-PEESE to the Test, Part 1A

The Problem with PET-PEESE?

Will Gervais has a very interesting criticism of PET-PEESE, a meta-analytic technique for correcting for publication bias, up at his blog. In it, he tests PET-PEESE's bias by simulating many meta-analyses, each of many studies, using historically-accurate effect sizes and sample sizes from social psychology. He finds that, under these conditions and assuming some true effect, PET-PEESE performs very poorly at detecting the true effect, underestimating it by a median 0.2 units of Cohen's d.

When I saw this, I was flattened. I knew PET-PEESE had its problems, but I also thought it represented a great deal of promise compared to other rotten old ways of inspecting for publication bias, such as trim-and-fill or (shudder) Fail-Safe N. In the spirit of full disclosure, I'll tell you that I'm 65 commits deep into a PET-PEESE manuscript with some provocative conclusions, so I may be a little bit motivated to defend PET-PEESE. But I saw some simulation parameters that could be tweaked to possibly give PET-PEESE a better shot at the true effect.

My Tweaks to Will's Simulation

One problem is that, in this simulation, the sample sizes are quite small. The sample sizes per cell distributed according to a truncated normal, ~N(30, 50), bounded by 20 and 200. So the minimum experiment has just 40 subjects across two cells, the modal experiment has just 60 subjects across two cells, and no study will ever exceed 400 subjects across the two cells.

These small sample sizes, combined with the small true effect (delta = .275), mean that the studies meta-analyzed have miserable power. The median power is only 36%. The maximum power is 78%, but you'll see that in fewer than one in ten thousand studies.

The problem, then, is one of signal and noise. The signal is weak: delta = .275 is a small effect by most standards. The noise is enormous: at n = 60-70, the sampling error is devastating. But what's worse, there's another signal superimposed on top of all this: publication bias! The effect is something like trying to hear your one friend whisper a secret in your ear, but the two of you are in a crowded bar, and your other friend is shouting in your other ear about the Entourage movie.

So as I saw it, the issue wasn't that PET-PEESE was cruelly biased in favor of the null or that it had terrible power to detect true effects. The issue was small effects, impotent sample sizes, and withering publication bias. In these cases, it's very hard to tell true effects from null effects. Does this situation sound familiar to you? It should -- Will's simulation uses distributions of sample sizes and effect sizes that are very representative of the norms in social psychology!

But social psychology is changing. The new generation of researchers are becoming acutely aware of the importance of sample size and of publishing null results. New journals like Frontiers or PLOS (and even Psych Science) are making it easier to publish null results. In this exciting new world of social psychology, might we have an easier time of arriving at the truth?


To test my intuition, I made one tweak to Will's simulation: Suppose that, in each meta-analysis, there is one ambitious grad student who decides she's had enough talk. She wants some damn data, and when she gets it, she will publish it come hell or high water, regardless of the result.

In each simulated meta-anaysis, I guarantee a single study with n = 209/cell (80% power, two-tailed, to detect the true homogenous effect delta = 0.275). Moreover, this single well-powered study is made immune to publication bias. Could a single, well-powered study help PET-PEESE?

Well, it doesn't. One 80%-powered study isn't enough. You might be better off using the "Top Ten" estimator, that looks only at the 10 largest studies, or even just interpreting the single largest study.

What if the grad student runs her dissertation at 90% power, collecting n = 280 per cell?

Maybe we're getting somewhere now. The PEESE spike is coming up a little bit and the PET spike is going down. But maybe we're asking too much of our poor grad student. Nobody should have to determine the difference between delta = 0 and delta = 0.275 all by themselves. (Note that, even still, you're probably better off throwing all the other studies and meta-analysis and meta-regressions in the garbage and just using this single pre-registered experiment as your estimate!)

Here's the next scenario: Suppose somebody looked at the funnel plot from the original n = ~20 studies and found it to be badly asymmetrical. Moreover, they saw the PET-PEESE estimate couldn't detect the effect as significantly different from zero. Rather than pronounce the PET-PEESE estimate as the true effect size, they instead suggested that the literature was badly biased and that a replication effort was needed. So three laboratories each agreed to rerun the experiment at 80% power and publish the results in a Registered Report. Afterwards, they reran the meta-analysis and PET-PEESE.

Even with these three unbiased, decently-powered studies, PET-PEESE is still flubbing it badly, going to PET more often than it should. Again, you might be better off just looking at the three trustworthy studies in the Registered Report than try to fix the publication bias with meta-regression.

I'm feeling pretty exhausted by now, so let's just drop the hammer on this. The Center for Open Science decides to step in and run a Registered Report with 10 studies, each powered at 80%. Does this give PET-PEESE what it needs to perform well?

No dice. Again, you'd be better off just looking at the 10 preregistered studies and giving up on the rest of the literature. Even with these 10 healthy studies in the dataset, we're missing delta = .275 by quite a bit in one direction or the other: PET-PEESE is estimating delta = 0.10, while naive meta-analysis is estimating delta = .42.


I am reminded of a blog post by Michele Nuijten, in which she explains how more information can actually make your estimates worse. If your original estimates are contaminated by publication bias, and your replication estimates are also contaminated by publication bias, adding the replication data to your original data only makes things worse. In the cases above, we gain very little from meta-analysis and meta-regression. It would be better to look only at the large-sample Registered Reports and dump all the biased, underpowered studies in the garbage.

The simple lesson is this: There is no statistical replacement for good research practice. Publication bias is nothing short of toxic, particularly when sample sizes and effect sizes are small.

So what can we do? Maybe this is my bias as a young scientist with few publications to my name, but if we really want to know what is true and what is false, we might be better off disregarding the past literature of biased, small-sample studies entirely and only interpreting data we can trust.

The lesson I take is this: For both researchers and the journals that publish them, Registered Report or STFU.

(Now, how am I gonna salvage this meta-analysis???)

Code is available at my GitHub. The bulk of the original code was written by Will Gervais, with edits and tweaks by Evan Carter and Felix Schonbrodt. You can recreate my analyses by loading packages and the meta() function on lines 1-132, then skipping down to the section "Hilgard is going ham" on line 303.

Monday, May 4, 2015

Bayes Factor: Asking the Right Questions, pt. 2

There has recently been some discussion as to whether Bayes factor is biased in favor of the null. I am particularly sensitive to these concerns as somebody who sometimes uses Bayes factor to argue in favor of the null. I do not want Reviewer 2 to think that I am overstating my evidence.

I would like to address two specific criticisms of Bayes factor, each arguing that the choice of an alternative hypothesis makes it too easy for researchers to argue for the null. 


In a recent blog post, Dr. Simonsohn writes “Because I am not interested in the distribution designated as the alternative hypothesis, I am not interested in how well the data support it.”

Of course, if one does not like one alternative hypothesis, one can choose another. Bayes factor is just the tool, and it's up to the analyst to make the tool answer a valuable question.

I asked Dr. Simonsohn for clarification on what he thought might make a good alternative hypothesis. He suggested a point-alternative hypothesis describing the minimum effect size of interest. That way, the Bayes factor yielded would not be too hasty to lean in favor of the null. 

That smallest effect size will vary across context. For example, for gender discrimination I may have one standard of too small to care, and for PSI I will have a much lower standard, and for time travel a tiny standard (a few seconds of time travel are a wonderful discovery).

Personally, I do not think this is a good alternative hypothesis. It makes the null and alternative hypothesis too similar so that their predictions are nigh-indiscriminable. It makes it nearly impossible to find evidence one way or the other.
Left panel: Depiction of null hypothesis and "minimum effect of interest" alternative. Null hypothesis: δ = 0. Alternative hypothesis: δ = 0.01. 
Right panel: Probability of data given each hypothesis and 200 observations, between-subjects design. 
The hypotheses are so similar as to be indistinguishable from each other.

Imagine if we did a priori power analysis with this alternative hypothesis for conventional null hypothesis significance testing. Power analysis would tell us we would need hundreds of thousands of observations to have adequate power. Less than that, and any significant results could be flukes and Type I errors, and nonsignificant results would be Type II errors. It's the Sisyphean Decimal Sequence from last post.

At some point, you have to live with error. The conventional testing framework assumes an effect size and establishes Type I and Type II error rates from there. But what justifies your a priori power assumption? Dr. Simonsohn's newest paper suggests a negative replication should indicate that the previous study had less than 33% power to detect its effect. But why would we necessarily care about the effect as it was observed in a previous study?

Every choice of alternative hypothesis is, at some level, arbitrary. No effect can be measured to arbitrary precision. Of all the inferential techniques I know, however, Bayes factor states this alternative hypothesis most transparently and reports the evidence in the most finely-grained units.

In practice, we don't power studies to the minimum interesting effect. We power studies to what expect the effect size to be given the theory. The alternative hypothesis in Bayesian model comparison should be the same way, representing our best guess about the effect. Morey et al. (submitted) call this a "consensus prior", the prior a "reasonable, but somewhat-removed researcher would have [when trying to quantify evidence for or against the theory]."


Dr. Schimmack also thinks that Bayes factor is prejudiced against small effects and that it makes it too easy to land a prestigious JEP:G publication ruling in favor of the null. In his complaint, he examines an antagonistic collaboration among Matzke, Nieuwenhuis, and colleagues. Nieuwenhuis et al. argue that horizontal eye movements improve memory, while Matzke et al. argue that they have no such effect. Data is collected, and we ask questions of it: Whose hypothesis is supported, Nieuwenhuis’ or Matzke’s?

In the data, the effect of horizontal eye movements was actually negative. This is unusual given Matzke’s hypothesis, but very unusual given Nieuwenhuis’ hypothesis. Because the results are 10 times more likely given Matzke’s hypothesis than Nieuwenhuis’, we rule in favor of Matzke’s null hypothesis.

Dr. Schimmack is dissatisfied with the obtained result and wants more power:
[T]his design has 21% power to reject the null-hypothesis with a small effect size (d = .2). Power for a moderate effect size (d = .5) is 68% and power for a large effect size (d = .8) is 95%.
Thus, the decisive study that was designed to solve the dispute only has adequate power (95%) to test Drs. Matzke et al.’s hypothesis d = 0 against the alternative hypothesis that d = .8. For all effect sizes between 0 and .8, the study was biased in favor of the null-hypothesis.”
Dr. Schimmack is concerned that the sample size is too small to distinguish the null from the alternative. The rules of the collaboration, however, were to collect data until the Bayes factor was 10 for one or the other hypothesis. The amount of data collected was indeed enough to distinguish between the two hypotheses, as the support is quite strong for the no-effect-hypothesis relative to the improvement-hypothesis. Everybody goes to the pub to celebrate, having increased their belief in the null relative to this alternative by a factor of 10.

But suppose we tried to interpret the results in terms of power and significance. What would we infer if the result was not significant? Dr. Schimmack’s unusual comment above that “for all effect sizes between 0 and .8, the study was biased in favor of the null-hypothesis” leads me to worry that he intends to interpret p > .05 as demonstrating the truth of the null – a definite faux pas in null-hypothesis significance testing.

But what can we infer from p > .05? That the results have no evidentiary value, being unable to reject the null hypothesis? That the obtained result is (1 – Power)% unlikely if the alternative hypothesis δ = 0.5 were true? But why would we care about the power based on the alternative hypothesis δ = 0.5, and not δ = 0.1, or δ = 1.0, or any other point-alternative hypothesis?

Dr. Niewenhuis understands his theory, formulated a fair hypothesis, and agreed that a test of that hypothesis would constitute a fair test of the theory. I can see no better or more judicious choice of alternative hypothesis. In a well-designed experiment with a fair hypothesis, the Bayesian test is fair.

Dr. Schimmack further argues that “[The] empirical data actually showed a strong effect in the opposite direction, in that participants in the no-eye-movement condition had better performance than in the horizontal-eye-movement condition (d = -.81).   A Bayes Factor for a two-tailed hypothesis or the reverse hypothesis would not have favored the null-hypothesis.” This is an interesting phenomenon, but beside the point of the experiment. Remember the question being asked: Is there a positive effect, or no effect? The obtained data support the hypothesis of no effect over the hypothesis of a positive effect.

If one wishes to pursue the new hypothesis of a negative effect in a future experiment, one can certainly do so. If one thinks that the negative effect indicates some failure of the experiment then that is a methodological, not statistical, concern. Keep in mind that both researchers agreed to the validity of the method before the data were collected, so again, we expect that this is a fair test.


Bayes factor provides an effective summary of evidence. A Cauchy or half-Cauchy distribution on the effect size often makes for a fair and reasonable description of the alternative hypothesis. Scientists who routinely read papers with attention to effect size and sample size will quickly find themselves capable of describing a reasonable "consensus prior." 

Having to describe this alternative hypothesis sometimes makes researchers uneasy, but it is also necessary for the interpretation of results in conventional testing. If a test of a subtle effect is statistically significant in a sample of 20, we suspect a Type I error rather than a true effect. If that subtle effect is not statistically significant in a sample of 20, we suspect a Type II error rather than a true effect. Specification of the alternative hypothesis makes these judgments transparent and explicit and yields the desired summary of evidence.

Monday, April 20, 2015

Bayes Factor: Asking the Right Questions

I love Bayesian model comparison. It’s my opinion that null hypothesis testing is not great because 1) it gives dichotomous accept/reject outcomes when we all know that evidence is a continuous quantity and 2) it can never provide evidence for the null, only fail to reject it. This latter point is important because it’s my opinion that the null is often true, so we should be able to provide evidence and assign belief to it. 

By comparison, Bayesian model comparison has neither weakness of NHST. First, it yields a "Bayes factor", the multiplicative and continuous change in beliefs effected by the data. Second, it can yield Bayes factors favoring the null hypothesis over a specified alternative hypothesis.

Despite my enthusiasm for Bayesian model comparison, one criticism I see now and again about Bayesian model comparison is that the obtained Bayes factor varies as a function of the hypothesis being tested. See e.g. this Twitter thread or Simonsohn (2015)
When a default Bayesian test favors the null hypothesis, the correct interpretation of the result is that the data favor the null hypothesis more than that one specific alternative hypothesis. The Bayesian test could conclude against the same null hypothesis, using the same data, if a different alternative hypothesis were used, say, that the effect is distributed normal but with variance of 0.5 instead of 1, or that the distribution is skewed or has some other mean value.*

To some researchers, this may seem undesirable. Science and analysis are supposed to be "objective," so the subjectivity in Bayesian analysis may seem unappealing.

To a Bayesian, however, this is the behavior as intended. The Bayes factor is supposed to vary according to the hypotheses tested. The answer should depend on the question.

Asking the Right Question

The problem reminds me of the classic punchline in Douglas Adams’ Hitchhiker’s Guide to the Galaxy. An advanced civilization builds a massive supercomputer at great expense to run for millions of years to provide an answer to life, the universe, and everything.

Eons later, as the calculations finally complete, the computer pronounces its answer: “Forty-two.”

Everyone winces. They demand to know what the computer means by forty-two. The computer explains that forty-two is the correct answer, but that the question is still unknown. The programmers are mortified. In their haste to get an impressive answer, they did not stop to consider that every answer is valuable only in the context of its question.

Bayesian model comparison is a way to ask questions. When you ask different questions of your data, you get different answers. Any particular answer is only valuable insofar as the corresponding question is worth asking.

An Example from PSI Research

Let’s suppose you’re running a study on ESP. You collect a pretty decently-sized sample, and at the end of the day, you’re looking at an effect size and confidence interval (ESCI) of d = 0.15 (-.05, .35). Based on this, what is your inference?

The NHST inference is that you didn't learn anything: you failed to reject the null, so the null stands for today, but maybe in the future with more data you’d reject the null with d = .03 (.01, .05) or something. You can never actually find evidence for the null so long as you use NHST. In the most generous case, you might argue that you've rejected some other null hypothesis such as δ > .35.

The ESCI inference is that the true effect of ESP is somewhere in the interval.** Zero is in the interval, and we don’t believe that ESP exists, so we’re vaguely satisfied. But how narrow an interval around zero do we need before we’re convinced that there’s no ESP? How much evidence do we have for zero relative to some predicted effect?

Bayesian Inferences

Now you consult a Bayesian (Figure 1). You ask the Bayesian which she favors: the null hypothesis δ = 0, or the alternative hypothesis δ ≠ 0. She shakes her head. Your alternative hypothesis makes no predictions. The effect could be anywhere from negative infinity to positive infinity, or so close to zero as to be nearly equal it. She urges you to be more specific.

Figure 1. Ancient Roman depiction of a Bayesian.
To get an answer, you will have to provide a more specific question. Bayesian model comparison operates by comparing one or more model predictions and seeing which is best supported by the data. Because it is a daunting task to try to precisely predict the effect size (although we often attempt to do so in a priori power analysis), we can assign probability across a range of values.

Trying again, you ask her whether there is a large effect of ESP. Maybe the effect of ESP could be a standard deviation in either direction, and any nonzero effect between d = -1 and d = 1 would be considered evidence of the theory. That is, H1: δ ~ Uniform(-1, 1) (see Figure 2). The Bayesian tells you that you have excellent evidence for the null relative to this hypothesis.
Figure 2. Competing statements of belief about the effect size delta.
Encouraged, you ask her whether there is a medium effect of ESP. Maybe ESP would change behavior by about half a standard deviation in either direction; small effects are more likely than large effects, but large effects are possible too. That is, H2: δ ~ Cauchy(0.5) (see Figure 3). The Bayesian tells you that you have pretty good evidence for the null against this hypothesis, but not overwhelming evidence.
Figure 3. A Cauchy-distributed alternative hypothesis can be more conservative, placing more weight on smaller, more realistic effect sizes while maintaining the possibility of large effects.
Finally, you ask her whether you have evidence against even the tiniest effect of ESP. Between the null hypothesis H0: δ = 0 and the alternative H3: δ ~ Cauchy(1x10^-3), which does she prefer? She shrugs. These two hypotheses make nearly-identical predictions about what you might see in your experiment (see Figure 4). Your data cannot distinguish between the two. You would need to spend several lifetimes collecting data before you were able to measurably shift belief from this alternative to the null.

Figure 4. The null and alternative hypotheses make nearly-identical statements of belief.
And after that, what’s next? Will you have to refute H4: δ ~ Cauchy(1x10^-4), H5: δ ~ Cauchy(1x10^-5), and so on? A chill falls over you as you consider the possibilities. Each time you defeat one decimal place, another will rise to take its place. The fate of Sisyphus seems pleasant by comparison.

The Bayesian assures you that this is not a specific weakness of Bayesian model comparison. If you were a frequentist, your opponents could always complain that your study did not have enough power to detect δ = 1x10^-4. If you were into estimation, your opponents could complain that your ESCI did not exclude δ = 1x10^-4. You wonder if this is any way to spend your life, chasing eternally after your opponents’ ever-shifting goalposts.

It is my opinion that these minimal-effect-size hypotheses are not questions worth asking in most psychological research. If the effect is truly so tiny, its real-world relevance is minimal. If the phenomenon requires several thousand observations to distinguish signal from noise, it is probably not practical to study it. I think these hypotheses are most often employed as last lines of epistemic defense, a retreat of the effect size from something meaningful to something essentially untestable.

At some point, you will have to draw a limit. You will have to make an alternative hypothesis and declare “Here is the approximate effect size predicted by the theory.” You won’t have to select the specific point, because you can spread the probability judiciously across a range of plausible values. It may not be exactly the hypothesis every single researcher would choose, but it will be reasonable and judicious, because you will select it carefully. You are a scientist, and you are good at asking meaningful questions. When you ask that meaningful question, Bayesian model comparison will give you a meaningful answer.

In Summary

Bayesian model comparison is a reasonable and mathematically-consistent way to get appropriate answers to whatever your question. As the question changes, so too should the answer. This is a feature, not a bug. If every question got the same answer, would we trust that answer?

We must remember that no form of statistics or measurement can hope to measure an effect to arbitrary precision, and so it is epistemically futile to try to prove absolutely the null hypothesis δ = 0. However, in many cases, δ = 0 seems appropriate, and the data tend to support it relative to any reasonable alternative hypothesis. The argument that the null cannot be supported relative to Ha: δ = 1x10^-10 is trivially true, but scientifically unreasonable and unfair. 

Asking good questions is a skill, and doing the appropriate mathematics and programming to model the questions is often no small task. I suggest that we appreciate those who ask good questions and help those who ask poor questions to try other, more informative models.

In my next post, I'll cover some recent, specific critiques of Bayesian model comparison that, in my opinion, hinge on asking the wrong questions for the desired answers.


Thanks to Jeff Rouder, Richard Morey, Chris Engelhardt, and Alexander Etz for feedback. Thanks to Uri Simonsohn for clarifications and thoughts.

* Simonsohn clarifies his point briefly in the second half of this blog post -- he is moreso dissatisfied with the choice of a particular alternative hypothesis than he is alarmed by the Bayes factor's sensitivity to the alternative. Still, it is my impression that some readers may find this subjectivity scary and therefore unfortunately avoid Bayesian model comparison.

** This isn't true either. It is a common misconception that the 95% ESCI contains the true effect with 95% probability. The Bayesian 95% highest-density posterior interval (HDPI) does, but you need a prior. Even then you still have to come to some sort of decision about whether that HDPI is narrow enough or not. So here we are again.

Friday, November 28, 2014

Exciting New Misapplications of The New Statistics

This year's increased attention to effect sizes and confidence intervals (ESCI) has been great for psychological science. ESCI offers a number of improvements over null-hypothesis significance testing (NHST), such as an attention to practical significance and the elimination of dichotomous decision rules.

However, the problem of ESCI is that it is purely descriptive, not inferential. No hypotheses are named, and so ESCI doesn't report on the probability of a hypothesis given the data, or even the probability of the data given a null hypothesis. No process or statistic turns the ESCI into a decision, although we might make Geoff Cumming cringe by looking at whether the ESCI includes zero and making a decision based on that, thereby falling right back to using NHST.

The point is, there's no theoretical or even pragmatic method for turning an ESCI into an inference. At what point does a confidence interval become sufficiently narrow to make a decision? We know that values near the extremes of the interval are often less likely than the values near the middle, but how much less likely?

I'm not asking for a formal dichotomous decision rule (I'm a Bayesian, I have resigned my life to uncertainty), but I've already noticed the ways we can apply ESCI inconsistently to overstate the evidence. See a recent example from Boothby, Clark, and Bargh (PDF link), arguing that shared experiences are more intense in two studies of n = 23 women:
Indeed, our analyses indicated that participants liked the chocolate significantly less when the confederate was also eating the chocolate (M = 2.45, SD = 1.77) than when the confederate was reviewing the computational products (M = 3.16, SD = 2.32), t(21) = 2.42, p = .025, 95% CI for the difference between conditions = [0.10, 1.31], Cohen’s d = 0.34. Participants reported feeling more absorbed in the experience of eating the chocolate in the shared-experience condition (M = 6.11, SD = 2.27) than in the unshared-experience condition (M = 5.39, SD = 2.43), p = .14. Participants also felt like they were more “on the same wavelength” with the confederate during the shared-experience condition (M = 6.43, SD = 1.38) compared with the unshared-experience condition (M = 5.61, SD = 1.38), t(21) = 2.35, p = .03, 95% CI for the difference
between conditions = [0.10, 1.54], Cohen’s d = 0.59 (see Fig. 2). There were no significant differences in participants’ self-reported mood or any other feedback measures between the shared and the unshared-experience conditions (all ps > .10).
Normally one wouldn't be allowed to talk about that p = .14 as evidence for an effect, but we now live in a more enlightened ESCI period in which we're trying to get away from dichotomous decision making. Okay, that's great, although I'd question the wisdom of trying to make any inference based on such a small sample, even within-subjects. But notice that when p = .14 is in the direction of their expected effect, it is interpreted as evidence for the phenomenon, but when differences are in a direction that does not support the hypothesis, it is simply reported as "not significant, p > .10". If we're going to abandon NHST for ESCI, we should at least be consistent about reporting ALL the ESCIs, and not just the ones that support our hypotheses.

Or, better yet, use that ESCI to actually make a principled and consistent inference through Bayes Factor. Specify an alternative hypothesis of what the theory might suggest are likely effect sizes. In this example, one might say that the effect size is somewhere between d = 0 and d = 0.5, with smaller values more likely than large values. This would look like the upper half of a normal distribution with mean 0 and standard deviation .5. Then we'd see how probable the obtained effect is given this alternative hypothesis and compare it to how probable the effect would be given the null hypothesis. At 20 subjects, I'm going to guess that the evidence is a little less than 3:1 odds for the alternative for the significant items, and less than that for the other items.

ESCI's a good first step, but we need to be careful and consistent about how we use it before we send ourselves to a fresh new hell. But when Bayesian analysis is this easy for simple study designs, why stop at ESCI?

Tuesday, July 1, 2014

Can p-curve detect p-hacking through moderator trawling?

NOTE: Dr. Simonsohn has contacted me and indicated a possible error in my algorithm. The results presented here could be invalid. We are talking back in forth and I am trying to fix my code. Stay tuned!

Suppose a researcher were to conduct an experiment looking to see if Manipulation X had any effect on Outcome Y, but the result was not significant. Since nonsignificant results are harder to publish, the researcher might be motivated to find some sort of significant effect somehow. How might the researcher go about dredging up a significant p-value?

One possibility is "moderator trawling". The researcher could try potential moderating variables until one is found that provides a significant interaction. Maybe it only works for men but not women? Maybe the effect can be seen after error trials, but not after successful trials? In essence, this is slicing the data until one manages to find a subset of the data that does show the desired effect. Given the number of psychological findings that seem to depend on surprisingly nuanced moderating conditions (ESP is one of these, but there are others), I do not think this is an uncommon practice.

To demonstrate moderator trawling, here's a set of data that has no main effect.
However, when we slice up the data by one of our moderators, we do find an effect. Here the interaction is significant, and the simple slope in group 0 is also significant.

In the long run, testing the main effect and three moderators will cause the alpha error rate to increase from 5% to 18.5%. That's the chance that at least one of the four tests come up p>.05, (.95)^4.

Because I am intensely excited by the prospect of p-curve meta-analysis, I just had to program a simulation to see whether p-curve could detect this moderator trawling in the absence of a real effect. P-curve meta-analysis is a statistical technique which examines the distribution of reported significant p-values. It relies on the property of the p-value that, when the null is true, p is uniformly distributed between 0 and 1. When an effect exists, smaller p-values are more likely than larger p-values, even for small p: p < .01 is more likely than .04<p<.05 for a true effect. Thus, a flat p-curve indicates no effect and possible file-drawering of null findings, while a right-skewed p-curve indicates a true effect. More interesting yet, a left-skewed curve suggests p-hacking -- doing what you need to to achieve p < .05, alpha error be damned.

You can find the simulation hosted on Open Science Framework at https://osf.io/ydwef/. This is my first swipe at an algorithm; I'd be happy to hear other suggestions for algorithms and parameters that simulate moderator trawling. Here's what the script does.
1) Create independent x and y variables from a normal distribution
2) Create three moderator variables z1, z2, and z3, of which 10 random subjects make up each of two levels
3) Fit the main effect model y ~ x. If it's statistically significant, stop and report the main effect.
4) If that doesn't come out significant, try z1, z2, and z3 each as moderators (e.g. y ~ x*z1; y ~ x*z2; y ~ x*z3). If one of these is significant, stop and plan to report the interaction.
5) Simonsohn et al. recommend using the p-value of the interaction for an attenuation interaction (e.g. "There's an effect among men that is reduced or eliminated among women"), but the p-values of each of the simple slopes for a crossover interaction (e.g. "This makes men more angry but makes women less angry."). So, determine whether it's an attenuation or crossover interaction.
5a)  If just one simple slope is significant, we'll call it an interaction. There's an effect in one group that is significant that is eliminated or reduced in the other group. In this case, we report the interaction p-value.
5b) If neither simple slopes are significant, or both are significant with coefficients of opposite sign, we'll call it a crossover. Both slopes significant indicates opposite effects, while neither slope significant indicates that the simple slopes aren't strong enough on their own but their opposition is enough to power a significant interaction. In these cases, we'll report both simple slopes' p-values.

We repeat this for 10,000 hypothetical studies, export the t-tests, and put them into the p-curve app at www.p-curve.com. Can p-curve tell that these results are the product of p-hacking?

It cannot. In the limit, it seems that p-curve will conclude that the findings are very mildly informative, indicating that the p-curve is flatter than 33% power, but still right-skewed, suggesting a true effect measured at about 20% power. Worse yet, it cannot detect that these p-values come from post-hoc tomfoolery and p-hacking. A few of these sprinkled into a research literature could make an effect seem to bear slightly more evidence, and be less p-hacked, then it really is.

The problem would seem to be that the p-values are aggregated across heterogeneous statistical tests: some tests of the main effect, some tests of this interaction or that interaction. Heterogeneity seems like it would be a serious problem for p-curve analysis in other ways. What happens when the p-values come from a combination of well-powered studies of a true effect and some poorly-powered, p-hacked studies of that same effect? (As best I can tell from the manuscript draft, the resulting p-curve is flat!) How does one meta-analyze across studies of different phenomena or different operationalizations or different models?

I remain excited and optimistic for the future of p-curve meta-analysis as a way to consider the strength of research findings. However, I am concerned by the ambiguities of practice and interpretation in the above case. It would be a shame if these p-hacked interactions would be interpreted as evidence of a true effect. For now, I think it best to report the data with and without moderators, preregister analysis plans, and ask researchers to report all study variables. In this way, one can reduce the alpha-inflation and understand how badly the results seem to rely upon moderator trawling.