Header


Friday, November 28, 2014

Exciting New Misapplications of The New Statistics

This year's increased attention to effect sizes and confidence intervals (ESCI) has been great for psychological science. ESCI offers a number of improvements over null-hypothesis significance testing (NHST), such as an attention to practical significance and the elimination of dichotomous decision rules.

However, the problem of ESCI is that it is purely descriptive, not inferential. No hypotheses are named, and so ESCI doesn't report on the probability of a hypothesis given the data, or even the probability of the data given a null hypothesis. No process or statistic turns the ESCI into a decision, although we might make Geoff Cumming cringe by looking at whether the ESCI includes zero and making a decision based on that, thereby falling right back to using NHST.

The point is, there's no theoretical or even pragmatic method for turning an ESCI into an inference. At what point does a confidence interval become sufficiently narrow to make a decision? We know that values near the extremes of the interval are often less likely than the values near the middle, but how much less likely?

I'm not asking for a formal dichotomous decision rule (I'm a Bayesian, I have resigned my life to uncertainty), but I've already noticed the ways we can apply ESCI inconsistently to overstate the evidence. See a recent example from Boothby, Clark, and Bargh (PDF link), arguing that shared experiences are more intense in two studies of n = 23 women:
Indeed, our analyses indicated that participants liked the chocolate significantly less when the confederate was also eating the chocolate (M = 2.45, SD = 1.77) than when the confederate was reviewing the computational products (M = 3.16, SD = 2.32), t(21) = 2.42, p = .025, 95% CI for the difference between conditions = [0.10, 1.31], Cohen’s d = 0.34. Participants reported feeling more absorbed in the experience of eating the chocolate in the shared-experience condition (M = 6.11, SD = 2.27) than in the unshared-experience condition (M = 5.39, SD = 2.43), p = .14. Participants also felt like they were more “on the same wavelength” with the confederate during the shared-experience condition (M = 6.43, SD = 1.38) compared with the unshared-experience condition (M = 5.61, SD = 1.38), t(21) = 2.35, p = .03, 95% CI for the difference
between conditions = [0.10, 1.54], Cohen’s d = 0.59 (see Fig. 2). There were no significant differences in participants’ self-reported mood or any other feedback measures between the shared and the unshared-experience conditions (all ps > .10).
Normally one wouldn't be allowed to talk about that p = .14 as evidence for an effect, but we now live in a more enlightened ESCI period in which we're trying to get away from dichotomous decision making. Okay, that's great, although I'd question the wisdom of trying to make any inference based on such a small sample, even within-subjects. But notice that when p = .14 is in the direction of their expected effect, it is interpreted as evidence for the phenomenon, but when differences are in a direction that does not support the hypothesis, it is simply reported as "not significant, p > .10". If we're going to abandon NHST for ESCI, we should at least be consistent about reporting ALL the ESCIs, and not just the ones that support our hypotheses.

Or, better yet, use that ESCI to actually make a principled and consistent inference through Bayes Factor. Specify an alternative hypothesis of what the theory might suggest are likely effect sizes. In this example, one might say that the effect size is somewhere between d = 0 and d = 0.5, with smaller values more likely than large values. This would look like the upper half of a normal distribution with mean 0 and standard deviation .5. Then we'd see how probable the obtained effect is given this alternative hypothesis and compare it to how probable the effect would be given the null hypothesis. At 20 subjects, I'm going to guess that the evidence is a little less than 3:1 odds for the alternative for the significant items, and less than that for the other items.

ESCI's a good first step, but we need to be careful and consistent about how we use it before we send ourselves to a fresh new hell. But when Bayesian analysis is this easy for simple study designs, why stop at ESCI?