Sunday, October 4, 2015

Poor Power at Decent Sample Sizes: Significance Under Duress

Last week, I got to meet Andrew Gelman as he outlined what he saw as several of the threats to validity in social science research. Among these was the fallacious idea of "significance under duress." The claim in "significance under duress" is that, when statistical significance is reached under less-than-ideal conditions, it implies that the underlying effect must be very powerful. While this sounds like it makes sense, this claim does not follow.

Let's dissect the idea by considering the following scenario:

120 undergraduates participate in an experiment to examine the effect of mood on preferences for foods branded as "natural" relative to conventionally-branded foods. To manipulate mood, half of the participants write a 90-second paragraph about a time they felt bad, while the other half write a 90-second essay about a control topic.  The outcome is a single dichotomous choice between two products. Even though a manipulation check reveals the writing manipulation had only a small effect on mood, and even though a single-item outcome provides less power than would rating several forced choices, statistical significance is nevertheless found when comparing the negative-writing group to the neutral-writing group, p = .030. The authors argue that the relationship between mood and preferences for "natural" must be very strong indeed to have yielded significance despite the weak manipulation and imprecise outcome measure.

Even though the sample size is better than most, I would still be concerned that a study like this is underpowered. But why?

Remember that statistical power depends on the expected effect size. Effect size involves both signal and noise. Cohen's d is the difference in means divided by the standard deviation of scores. Pearson correlation is the covariance of x and y divided by the standard deviations of x and y. Noisier measures will mean larger standard deviations and hence, a smaller effect size.

The effect size is not a platonic distillation of the relationship between the two constructs you have in mind (say, mood and preference for the natural). Instead, it is a ratio of signal to noise between your measures -- here, condition assignment and product choice.

Let's imagine this through the lens of a structural equation model. Italicized and b represent the latent constructs of interest: mood and preference for the natural, respectively. Let's assume their relationship is rho = .4, a hearty effect. x and y are the condition assignment and the outcome, respectively. The path from x to a represents the effect of the manipulation. The path from b to y represents the measurement reliability of the outcome. To tell what the relationship will be between x and y, we multiply each path coefficient as we travel from x to a to b to y.

When the manipulation is strong and the measurement reliable, the relationship between x and y is strong, and power is good. When the manipulation is weak and the measurement unreliable, the relationship is small, and power falls dramatically.

Because weak manipulations and noisy measurements decrease the anticipated effect size, thereby decreasing power, studies can still have decent sample sizes and poor statistical power. Such examples of "significance under duress" should be regarded with the same skepticism as other underpowered studies.

No comments:

Post a Comment