A few weeks ago, I was listening to a bit of point/counterpoint on the Mother Jones Inquiring Minds podcast. On one episode, Brad Bushman gave an interview about the causes of gun violence, emphasizing the Weapons Priming Effect and the effects of violent video games. (Apparently he and his co-authors have a new meta-analysis of the Weapons Priming Effect; I can't read it because it's still under revision and the authors have not sent me a copy.)
On the other, Inquiring Minds invited violent-media-effect skeptic Chris Ferguson, perhaps one of Bushman's most persistent detractors. Ferguson recounted all the reasons he has for skepticism of violent-game effects, some reasonable, some less reasonable. One of his more reasonable criticisms is that he's concerned about publication bias and p-hacking in the literature. Perhaps researchers are running several studies and only reporting the ones that find significance, or maybe researchers take their null results and wiggle them around until they reach significance. (I think this is happening to some degree in this literature.)
Surprisingly, this was the criticism that drew the most scoffing from the hosts. University scientists don't earn anything, they argued, so who in their right mind would go into science and twist their results in hope of grant funding? Anyone wanting to make money would have an easier time of it staying far away from academia and going into something more lucrative, like dog walking.
Clearly, the hosts are mistaken, because we know that research fraud happens, publication bias happens, and p-hacking happens. Andrew Gelman's blog today suggests that these things happen when researchers find themselves chasing null hypotheses: due to publish-or-perish pressures, researchers have to find statistical significance. But why does anybody bother?
If the choice is between publishing nonsense and "perishing" (e.g., leaving academia to take a significant pay raise at a real job), why don't we see more researchers choosing to perish?
Tuesday, February 9, 2016
Wednesday, January 27, 2016
Power analysis slants funnels but doesn't flatten curves.
I recently had the pleasure of receiving a very thorough and careful peer review, with invitation to resubmit, at Psychological Bulletin for my manuscript-in-progress Overestimated Effects of Violent Games on Aggressive Outcomes in Anderson et al. (2010).
Although it was humbling to find I still have much to learn about meta-analysis, I was also grateful for what has been one of the most instructive peer reviews I have ever received. Sometimes one gets peer reviewers who simply don't like your paper, and perhaps never will. Sometimes reviewers can only offer the most nebulous of suggestions, leaving you fumbling for a way to appease everyone. This review, however, was full of blessedly concrete recommendations.
Anyway, the central thrust of my paper is that Anderson et al. (2010) assert that violent-game effects are not overestimated through publication, analytic, or other biases. However, they did not provide funnel plots to support this argument, relying chiefly on the trim-and-fill procedure instead. When you generate these funnel plots, you find that they are strikingly asymmetrical, suggesting that there may indeed be bias despite the trim-and-fill results.
We conducted two new statistical adjustments for bias, which suggest that the effect may actually be quite small. One, PET, uses the funnel plot's asymmetry to estimate what the effect size might be for a hypothetical perfectly-precise study. The other, p-curve, uses the p-values of significant results to estimate the underlying effect size.
One peer reviewer commented that, if the meta-analyzed effect sizes are heterogeneous, with some large and some small, and that if researchers are using power analysis appropriately to plan their effect sizes, then true large effects will be studied with small samples and true small effects will be studied with large samples, leading to an asymmetrical funnel plot and the illusion of research bias.
I don't think that's what's going on here, of course. Power analysis is rare in social psychology, especially in the years covered by the Anderson et al. meta-analysis. I'm also not sure how researchers would somehow know, a priori, the effect size they were studying, but then Arina K. Bones does have ironclad evidence of the precognitive abilities of social psychologists.
But even if that were true, I had the hunch it should only affect the funnel plot, which relies on observed effect sizes and sample sizes, and not the p-curve, which relies on the statistical power of studies. So I ran a simulation to see.
Sure enough, the funnel plot is very asymmetrical despite the inclusion of all studies. However, the p-curve still shows a clear right skew.
Of course, the meta-analyst should take efforts to divide the studies into homogenous groups so that heterogeneity is minimized and one is not comparing apples and oranges. But I was compelled to test this and then further compelled to write it down.
Code here:
# power analysis hypothesis:
# Reviewer 3 says:
# if the size of studies is chosen according to properly
# executed power analyses, we would in fact expect to see
# an inverse relationship between outcomes and sample sizes
# (and so if authors engage in the recommended practice of
# planning a study to achieve sufficient power, we are actually
# building small-study effects into our literature!).
# Let's simulate b/c I don't believe p-curve would work that way.
library(pwr)
library(metafor)
# lookup table for power
d = seq(.1, .6, .01) #seq(.1, 1, .05)
n = NULL
for (i in 1:length(d)) {
n[i] = pwr.t.test(d = d[i], sig.level = .05, power = .8,
type = "two.sample", alternative = "greater")$n
}
# round up b/c can't have fractional n
n = ceiling(n)
# pick a d, pick an n, run an experiment
simLength = 1e3
d_iter = NULL
n_iter = NULL
df_iter = NULL
t_iter = NULL
for (i in 1:simLength) {
index = sample(1:length(d), 1)
d_iter[i] = d[index]
n_iter[i] = n[index]
df_iter[i] = n_iter[i] - 2
t_iter[i] = rt(1, df_iter[i],
ncp = d_iter[i] / (sqrt(1/floor(n_iter[i]) + 1/ceiling(n_iter[i])))
)
}
dat = data.frame(d_true = d_iter,
n = n_iter, df = df_iter, t = t_iter)
dat$d_obs = 2*dat$t/sqrt(dat$df)
dat$p = pt(dat$t, dat$df, lower.tail = F)
dat$se_obs = sqrt(
(dat$n/((dat$n/2)^2)+dat$d_obs^2/(2*dat$df))*(dat$n/dat$df)
)
# funnel plot
model = rma(yi = d_obs, sei = se_obs, data = dat)
#p-curve
hist(dat$p)
par(mfrow=c(1, 2))
funnel(model, main = "Funnel plot w/ \npower analysis & \nheterogeneity")
hist(dat$p[dat$p<.05], main = "p-curve w/ \npower analysis & \nheterogeneity", xlab = "p-value")
Although it was humbling to find I still have much to learn about meta-analysis, I was also grateful for what has been one of the most instructive peer reviews I have ever received. Sometimes one gets peer reviewers who simply don't like your paper, and perhaps never will. Sometimes reviewers can only offer the most nebulous of suggestions, leaving you fumbling for a way to appease everyone. This review, however, was full of blessedly concrete recommendations.
Anyway, the central thrust of my paper is that Anderson et al. (2010) assert that violent-game effects are not overestimated through publication, analytic, or other biases. However, they did not provide funnel plots to support this argument, relying chiefly on the trim-and-fill procedure instead. When you generate these funnel plots, you find that they are strikingly asymmetrical, suggesting that there may indeed be bias despite the trim-and-fill results.
Effects of violent games on aggressive behavior in experiments selected as having "best-practices" methodology by Anderson et al. (2010). |
We conducted two new statistical adjustments for bias, which suggest that the effect may actually be quite small. One, PET, uses the funnel plot's asymmetry to estimate what the effect size might be for a hypothetical perfectly-precise study. The other, p-curve, uses the p-values of significant results to estimate the underlying effect size.
One peer reviewer commented that, if the meta-analyzed effect sizes are heterogeneous, with some large and some small, and that if researchers are using power analysis appropriately to plan their effect sizes, then true large effects will be studied with small samples and true small effects will be studied with large samples, leading to an asymmetrical funnel plot and the illusion of research bias.
I don't think that's what's going on here, of course. Power analysis is rare in social psychology, especially in the years covered by the Anderson et al. meta-analysis. I'm also not sure how researchers would somehow know, a priori, the effect size they were studying, but then Arina K. Bones does have ironclad evidence of the precognitive abilities of social psychologists.
But even if that were true, I had the hunch it should only affect the funnel plot, which relies on observed effect sizes and sample sizes, and not the p-curve, which relies on the statistical power of studies. So I ran a simulation to see.
Simulated meta-analysis of 1000 studies. True effect size varies uniformly between .1 and .5. Sample sizes selected for 80% one-tailed power. Simulation code at bottom of post. |
Sure enough, the funnel plot is very asymmetrical despite the inclusion of all studies. However, the p-curve still shows a clear right skew.
Of course, the meta-analyst should take efforts to divide the studies into homogenous groups so that heterogeneity is minimized and one is not comparing apples and oranges. But I was compelled to test this and then further compelled to write it down.
Code here:
# power analysis hypothesis:
# Reviewer 3 says:
# if the size of studies is chosen according to properly
# executed power analyses, we would in fact expect to see
# an inverse relationship between outcomes and sample sizes
# (and so if authors engage in the recommended practice of
# planning a study to achieve sufficient power, we are actually
# building small-study effects into our literature!).
# Let's simulate b/c I don't believe p-curve would work that way.
library(pwr)
library(metafor)
# lookup table for power
d = seq(.1, .6, .01) #seq(.1, 1, .05)
n = NULL
for (i in 1:length(d)) {
n[i] = pwr.t.test(d = d[i], sig.level = .05, power = .8,
type = "two.sample", alternative = "greater")$n
}
# round up b/c can't have fractional n
n = ceiling(n)
# pick a d, pick an n, run an experiment
simLength = 1e3
d_iter = NULL
n_iter = NULL
df_iter = NULL
t_iter = NULL
for (i in 1:simLength) {
index = sample(1:length(d), 1)
d_iter[i] = d[index]
n_iter[i] = n[index]
df_iter[i] = n_iter[i] - 2
t_iter[i] = rt(1, df_iter[i],
ncp = d_iter[i] / (sqrt(1/floor(n_iter[i]) + 1/ceiling(n_iter[i])))
)
}
dat = data.frame(d_true = d_iter,
n = n_iter, df = df_iter, t = t_iter)
dat$d_obs = 2*dat$t/sqrt(dat$df)
dat$p = pt(dat$t, dat$df, lower.tail = F)
dat$se_obs = sqrt(
(dat$n/((dat$n/2)^2)+dat$d_obs^2/(2*dat$df))*(dat$n/dat$df)
)
# funnel plot
model = rma(yi = d_obs, sei = se_obs, data = dat)
#p-curve
hist(dat$p)
par(mfrow=c(1, 2))
funnel(model, main = "Funnel plot w/ \npower analysis & \nheterogeneity")
hist(dat$p[dat$p<.05], main = "p-curve w/ \npower analysis & \nheterogeneity", xlab = "p-value")
Tuesday, January 19, 2016
Two Quick HIBARs
I've posted a little post-publication peer review on ResearchGate these past few months on some studies of violent game effects. Doing this made me realize that ResearchGate is actually really weak for this task -- you can mark up particular chunks and comment on them, but most people are going to immediately download the PDF, and the PDF won't carry the comments. So putting up commentary on ResearchGate will mostly just piss off the authors, who get an email alerting them to the assault, but fail to inform the readers, who will probably not read or even notice the comments.
So here is a brief digest of two recent comments I've put on two recent papers. Consider these some quick Had I Been A Reviewer (HIBAR) posts.
Lishner, Groves, and Chobrak (2015): Are Violent Video Game-Aggression Researchers Biased? (Paper link)
Whether or not there is bias in violent-games and aggression research is the topic of some of my own research, which seems to indicate that, yes, there is some element of bias that is leading to the likely overestimation of violent-game effects.
The authors consider three potential forms of bias: Single-scholar bias, by which a single prominent scholar is able to unduly influence the field by overwhelming publishing; cabal bias, by which a group of scholars use their numbers or resources to again unduly influence the field by overwhelming publishing or by collusion in peer review; and systemic bias, by which there is some broad and systemic bias towards the finding of an effect.
They present some re-analyses of the Anderson et al. (2010) meta-analysis to suggest that there is not single-scholar bias (because Anderson's effect sizes aren't statistically significantly larger than everybody else's) and cabal bias (because those who publish repeatedly on these effects don't find statistically significantly larger effects than those who only ever run one study).
Of course, the absence of statistical significance does not necessarily imply that the null is true, but the confidence intervals suggest that Lishner et al. might be correct. Experiments done by Anderson find a mean effect size of r = .19 [.14, .24], while experiments done by the rest of the collection have a mean effect size of r = .18 [.13, .22]. That's a pretty close match. For their test of cabal bias, the experiments done by the potential cabal have a mean effect size of r = .20 (.15, .24), while the experiments done by the other groups have a mean effect size of r = .24 (.17, .30). The difference isn't in the right direction for cabal bias.
That leaves us with the possibility of systemic bias. Systemic bias would be a bigger concern for overestimation of the effect size -- instead of one particularly biased researcher or a subset of particularly biased researchers, the whole system might be overestimating the effect size.
My priors tell me there's likely some degree of systemic bias. We weren't aware of the problems of research flexibility until about 2010, and we weren't publishing many null results until PLOS ONE started changing things. With this in mind, I'd suspect null results are likely to be tortured (at least a little) into statistical significance, or else they'll go rot in file drawers.
What do Lishner et al. say about that?
The authors argue that there is no systematic bias because a single outspoken skeptic still manages to get published. I don't buy this. Ferguson is one determined guy. I would expect that most other researchers have not pressed so hard as him to get their null results published.
There are ways to check for systematic biases like publication bias in a meta-analytic dataset, but Lishner et al. do not explore any of them. There is no Egger's test, no search for unpublished or rejected materials, no estimation of research flexibility, no test of excess significance, or any other meta-analytic approach that would speak to the possibility of research practices that favor significant results.
The authors, in my opinion, overestimate the strength of their case against the possibility of systemic bias in violent-games research.
Again, I've conducted my own analysis of possible systematic bias in violent games research and come up with a rather different view of things than Lishner et al. Among the subset of practices Anderson selected as "best-practices" studies, there is substantial selection bias. Among that subset or the full sample of experiments, p-curve meta-analysis indicates there is little to no effect. This leads me to suspect that the effect size has been overestimated through some element of bias in this literature.
Ferguson et al. (2015) Digital Poison? Three studies examining the influence of violent video games on youth (Paper link)
As is typical for me, I skimmed directly to the sample size and the reported result. I recognize I'm a tremendous pain in the ass, and I'm sorry. So I haven't read the rest of the manuscript and cannot comment on the methods.
This paper summarizes two experiments and a survey. Experiment 1 has 70 subjects, while Experiment 2 has 53. They use between-subject designs.
These are pretty small sample sizes if one intends to detect an effect. Anderson et al. (2010) estimate the effect as r = .21, or d = .43, which is probably an overestimate (research bias), but we'll take it at face value for now.
If the effect to be detected is r = .21, the studies have 43% and 33.5% power. Assuming there is an effect, a Type II error seems likely.
However, the authors erroneously dismiss the possibility of Type II error (excerpted from Experiment 1 results, but Experiment 2 makes identical arguments):
They treat the observed effect size as though it were unquestionably the true effect size. That is, they ignore the possibility of sampling error, which, at a sample size of 70, is quite substantial.
The argument in the upper paragraph doesn't seem to follow even its own logic: it argues that the true effect is very tiny, so it would take 1600 participants, which nobody can expect to collect, so of course it's Type II error, so Type II error can be ruled out as a counter-explanation.
The lower paragraph argues that because the observed effect size is in the wrong direction, the result is not a Type II error and the effect does not exist. Again, sampling error means that even a positive effect will sometimes be measured as having the wrong sign in small samples (some tests for pub bias use this to great effect), so this argument does not hold on the basis of just this p-value.
Remember also that Anderson et al. (2010) believe the effect of violent games on empathy to be smaller still than that of violent games on aggressive behavior: just r = -.14. So the power for this test is just 21.4% for a two-tailed test in Experiment 1, 17.2% in Experiment 2. Type II error is extremely likely.
But let's see how much evidence we can squeeze out of this with Bayesian model comparison. We'll take the Anderson et al. (2010) estimates for our alternative hypotheses: r = .21 for aggressive behavior, r = -.14 for empathy. We'll transform everything to Fisher's Z for my convenience. This gives HA1: Z ~ N(.21, .02) for aggressive behavior and HA2: Z ~ N(-.14, .07) for empathy.
The empathy results are stronger. The observed negative sign does not necessarily rule out Type II error, but it does make the results less likely given some positive effect. Bayes factors are 4:1 odds for the null over HA2 in Experiment 1, 3.1:1 odds for the null over HA2 in Experiment 2.
My recent paper (free postprint link) catalogs several of these fallacious arguments for the null hypothesis as made by both proponents and skeptics of violent-game effects. It then demonstrates how Bayesian model comparison is really the only way to make these arguments for the null (or the alternative!) in a principled and effective way.
I recognize that it is difficult to collect samples, particularly when studying populations besides college undergraduates at big state universities. All the same, the conclusions have to be in line with the data in hand. Sympathy aside, the results cannot rule out the possibility of Type II error and provide little evidence for the null relative to what meta-analyses report.
I thank Drs. Lishner and Ferguson for supplying comment. Dr. Lishner suggested I better attend to the reported confidence intervals for single-scholar bias and cabal bias. Dr. Ferguson suggested I consider the challenges of collecting samples of teens in the absence of grant funding.
Figure code below.
# Get standard error of effect
n1 = ceiling(53/2)
n2 = floor(53/2)
se = sqrt(((n1+n2)/(n1*n2) + 0 / (2*(n1+n2-2))) * ((n1+n2)/(n1+n2-2)))
null = rnorm(1e5, 0, sd = se)
alt = rnorm(1e5, .43, sd = se)
# Plot p(data | hypothesis)
plot(density(null), xlim = c(-1, 1.5), ylim = c(0, 1.5),
main = "Hypothesis Comparison",
ylab = "Probability Density of Data",
xlab = "Effect size d",
col = 'darkblue',
lwd = 2)
lines(density(alt), col = 'darkred', lwd = 2)
# plot observed effect size
abline(v = .12, col = 'darkgreen', lwd = 1.5)
So here is a brief digest of two recent comments I've put on two recent papers. Consider these some quick Had I Been A Reviewer (HIBAR) posts.
Lishner, Groves, and Chobrak (2015): Are Violent Video Game-Aggression Researchers Biased? (Paper link)
Whether or not there is bias in violent-games and aggression research is the topic of some of my own research, which seems to indicate that, yes, there is some element of bias that is leading to the likely overestimation of violent-game effects.
The authors consider three potential forms of bias: Single-scholar bias, by which a single prominent scholar is able to unduly influence the field by overwhelming publishing; cabal bias, by which a group of scholars use their numbers or resources to again unduly influence the field by overwhelming publishing or by collusion in peer review; and systemic bias, by which there is some broad and systemic bias towards the finding of an effect.
They present some re-analyses of the Anderson et al. (2010) meta-analysis to suggest that there is not single-scholar bias (because Anderson's effect sizes aren't statistically significantly larger than everybody else's) and cabal bias (because those who publish repeatedly on these effects don't find statistically significantly larger effects than those who only ever run one study).
Of course, the absence of statistical significance does not necessarily imply that the null is true, but the confidence intervals suggest that Lishner et al. might be correct. Experiments done by Anderson find a mean effect size of r = .19 [.14, .24], while experiments done by the rest of the collection have a mean effect size of r = .18 [.13, .22]. That's a pretty close match. For their test of cabal bias, the experiments done by the potential cabal have a mean effect size of r = .20 (.15, .24), while the experiments done by the other groups have a mean effect size of r = .24 (.17, .30). The difference isn't in the right direction for cabal bias.
That leaves us with the possibility of systemic bias. Systemic bias would be a bigger concern for overestimation of the effect size -- instead of one particularly biased researcher or a subset of particularly biased researchers, the whole system might be overestimating the effect size.
My priors tell me there's likely some degree of systemic bias. We weren't aware of the problems of research flexibility until about 2010, and we weren't publishing many null results until PLOS ONE started changing things. With this in mind, I'd suspect null results are likely to be tortured (at least a little) into statistical significance, or else they'll go rot in file drawers.
What do Lishner et al. say about that?
The authors argue that there is no systematic bias because a single outspoken skeptic still manages to get published. I don't buy this. Ferguson is one determined guy. I would expect that most other researchers have not pressed so hard as him to get their null results published.
There are ways to check for systematic biases like publication bias in a meta-analytic dataset, but Lishner et al. do not explore any of them. There is no Egger's test, no search for unpublished or rejected materials, no estimation of research flexibility, no test of excess significance, or any other meta-analytic approach that would speak to the possibility of research practices that favor significant results.
The authors, in my opinion, overestimate the strength of their case against the possibility of systemic bias in violent-games research.
Again, I've conducted my own analysis of possible systematic bias in violent games research and come up with a rather different view of things than Lishner et al. Among the subset of practices Anderson selected as "best-practices" studies, there is substantial selection bias. Among that subset or the full sample of experiments, p-curve meta-analysis indicates there is little to no effect. This leads me to suspect that the effect size has been overestimated through some element of bias in this literature.
Ferguson et al. (2015) Digital Poison? Three studies examining the influence of violent video games on youth (Paper link)
As is typical for me, I skimmed directly to the sample size and the reported result. I recognize I'm a tremendous pain in the ass, and I'm sorry. So I haven't read the rest of the manuscript and cannot comment on the methods.
This paper summarizes two experiments and a survey. Experiment 1 has 70 subjects, while Experiment 2 has 53. They use between-subject designs.
These are pretty small sample sizes if one intends to detect an effect. Anderson et al. (2010) estimate the effect as r = .21, or d = .43, which is probably an overestimate (research bias), but we'll take it at face value for now.
If the effect to be detected is r = .21, the studies have 43% and 33.5% power. Assuming there is an effect, a Type II error seems likely.
However, the authors erroneously dismiss the possibility of Type II error (excerpted from Experiment 1 results, but Experiment 2 makes identical arguments):
They treat the observed effect size as though it were unquestionably the true effect size. That is, they ignore the possibility of sampling error, which, at a sample size of 70, is quite substantial.
The argument in the upper paragraph doesn't seem to follow even its own logic: it argues that the true effect is very tiny, so it would take 1600 participants, which nobody can expect to collect, so of course it's Type II error, so Type II error can be ruled out as a counter-explanation.
The lower paragraph argues that because the observed effect size is in the wrong direction, the result is not a Type II error and the effect does not exist. Again, sampling error means that even a positive effect will sometimes be measured as having the wrong sign in small samples (some tests for pub bias use this to great effect), so this argument does not hold on the basis of just this p-value.
Remember also that Anderson et al. (2010) believe the effect of violent games on empathy to be smaller still than that of violent games on aggressive behavior: just r = -.14. So the power for this test is just 21.4% for a two-tailed test in Experiment 1, 17.2% in Experiment 2. Type II error is extremely likely.
But let's see how much evidence we can squeeze out of this with Bayesian model comparison. We'll take the Anderson et al. (2010) estimates for our alternative hypotheses: r = .21 for aggressive behavior, r = -.14 for empathy. We'll transform everything to Fisher's Z for my convenience. This gives HA1: Z ~ N(.21, .02) for aggressive behavior and HA2: Z ~ N(-.14, .07) for empathy.
The empathy results are stronger. The observed negative sign does not necessarily rule out Type II error, but it does make the results less likely given some positive effect. Bayes factors are 4:1 odds for the null over HA2 in Experiment 1, 3.1:1 odds for the null over HA2 in Experiment 2.
My recent paper (free postprint link) catalogs several of these fallacious arguments for the null hypothesis as made by both proponents and skeptics of violent-game effects. It then demonstrates how Bayesian model comparison is really the only way to make these arguments for the null (or the alternative!) in a principled and effective way.
I recognize that it is difficult to collect samples, particularly when studying populations besides college undergraduates at big state universities. All the same, the conclusions have to be in line with the data in hand. Sympathy aside, the results cannot rule out the possibility of Type II error and provide little evidence for the null relative to what meta-analyses report.
I thank Drs. Lishner and Ferguson for supplying comment. Dr. Lishner suggested I better attend to the reported confidence intervals for single-scholar bias and cabal bias. Dr. Ferguson suggested I consider the challenges of collecting samples of teens in the absence of grant funding.
Figure code below.
# Get standard error of effect
n1 = ceiling(53/2)
n2 = floor(53/2)
se = sqrt(((n1+n2)/(n1*n2) + 0 / (2*(n1+n2-2))) * ((n1+n2)/(n1+n2-2)))
null = rnorm(1e5, 0, sd = se)
alt = rnorm(1e5, .43, sd = se)
# Plot p(data | hypothesis)
plot(density(null), xlim = c(-1, 1.5), ylim = c(0, 1.5),
main = "Hypothesis Comparison",
ylab = "Probability Density of Data",
xlab = "Effect size d",
col = 'darkblue',
lwd = 2)
lines(density(alt), col = 'darkred', lwd = 2)
# plot observed effect size
abline(v = .12, col = 'darkgreen', lwd = 1.5)
Wednesday, January 13, 2016
"Differences of Significance" fallacy is popular because it increases Type I error
We all recognize that a difference of significance is not the same as a significant difference. That is, if men show a statistically significant (p < .05) response to some manipulation while women do not (p > .05), that does not imply that there is evidence for a difference in how men and women respond to the manipulation.
The test that should be used, of course, is the Significant Difference Test. One estimates the interaction term and its standard error, then checks the p-value representing unusual it would be if the true value were zero. If p < .05, one concludes the two subgroups have different responses.
The incorrect test is the Differences of Significance Test. In that test, one checks the p-values for the manipulation in each subgroup and concludes a difference between subgroups if one has p < .05 and the other has p > .05.
We've seen the scientific community taking a firmer stance on people mistaking the Difference of Significance for the Significant Difference. Last year we saw Psych Science retract a paper because its core argument relied upon a Difference of Significance Test.
Why do people make this mistake? Why do we still make it, 10 years after Gelman and Stern?
My suspicion is that the Differences of Significance Test gets (unknowingly) used because it suffers from much higher Type I error rates, which allows for greater publication rates and more nuanced story-telling than is appropriate.
Let's think about two subgroups of equal size. We purport to be testing for an interaction: the two subgroups are expected to have different responses to the manipulation. We should be reporting the manipulation × subgroup interaction which, when done properly, has the nominal Type I error rate of 5%. Instead, we will look to see if one group has a significant effect of manipulation while the other is not. If so, we'll call it a success.
Assuming the two subgroups have equal size and there is no interaction, each subgroup has the same chance of having a statistically significant effect of manipulation. So the probability of getting one significant effect and one nonsignificant effect is simply the probability of getting one success on two Bernoulli trials with (Power)% success rate.
As you can see, the Type I error rate of this improper test is very high, peaking at a whopping 50% when each subgroup has 50% power. And this doesn't even require any questionable research practices like optional stopping or flexible outlier treatment!
Of course, one can obtain Type I error rates for this (again, improper) test by running unequal subgroups for unequal power. If group 1 is large and has 80% power to detect the effect, while group 2 is small and has only 20% power to detect, then one will find a difference in significance 68% of the time.
Obviously everybody knows the Difference of Significance Test is wrong and bad and they should be using and looking for the Significant Difference Test. But I wanted to illustrate just how bad the problem can actually be. As you can see, this isn't just nitpicking -- it can be the cause of a tenfold increase in Type I error rates.
You might be tempted to call that.... a significant difference. (hyuk hyuk hyuk)
The test that should be used, of course, is the Significant Difference Test. One estimates the interaction term and its standard error, then checks the p-value representing unusual it would be if the true value were zero. If p < .05, one concludes the two subgroups have different responses.
The incorrect test is the Differences of Significance Test. In that test, one checks the p-values for the manipulation in each subgroup and concludes a difference between subgroups if one has p < .05 and the other has p > .05.
We've seen the scientific community taking a firmer stance on people mistaking the Difference of Significance for the Significant Difference. Last year we saw Psych Science retract a paper because its core argument relied upon a Difference of Significance Test.
Why do people make this mistake? Why do we still make it, 10 years after Gelman and Stern?
My suspicion is that the Differences of Significance Test gets (unknowingly) used because it suffers from much higher Type I error rates, which allows for greater publication rates and more nuanced story-telling than is appropriate.
Let's think about two subgroups of equal size. We purport to be testing for an interaction: the two subgroups are expected to have different responses to the manipulation. We should be reporting the manipulation × subgroup interaction which, when done properly, has the nominal Type I error rate of 5%. Instead, we will look to see if one group has a significant effect of manipulation while the other is not. If so, we'll call it a success.
Assuming the two subgroups have equal size and there is no interaction, each subgroup has the same chance of having a statistically significant effect of manipulation. So the probability of getting one significant effect and one nonsignificant effect is simply the probability of getting one success on two Bernoulli trials with (Power)% success rate.
5% Type 1 error rate of correct test shown as blue line. |
As you can see, the Type I error rate of this improper test is very high, peaking at a whopping 50% when each subgroup has 50% power. And this doesn't even require any questionable research practices like optional stopping or flexible outlier treatment!
Of course, one can obtain Type I error rates for this (again, improper) test by running unequal subgroups for unequal power. If group 1 is large and has 80% power to detect the effect, while group 2 is small and has only 20% power to detect, then one will find a difference in significance 68% of the time.
Obviously everybody knows the Difference of Significance Test is wrong and bad and they should be using and looking for the Significant Difference Test. But I wanted to illustrate just how bad the problem can actually be. As you can see, this isn't just nitpicking -- it can be the cause of a tenfold increase in Type I error rates.
You might be tempted to call that.... a significant difference. (hyuk hyuk hyuk)
Saturday, November 28, 2015
Minimal Manipulations
What's the weak link in your path diagram? |
Psychology is the study of relationships between intangible constructs as seen through the lens of our measures and manipulations. We use manipulation A to push on construct X, then look at the resulting changes in construct Y, as estimated by measurement B.
Sometimes it's not clear how one should manipulate construct X. How would we make participants feel self-affirmed? Or if we wanted participants to slow down and really think about a problem? Or conversely, how would we get them to think less and go with their gut feeling? While we have a whole subfield dedicated to measurement (psychometrics), methods and manipulations have historically received less attention and less journal space.
So what can we do when we don't know how to manipulate something? One lowest-common-denominator manipulation of these complicated constructs is to ask participants to think about (or, if we're feeling ambitious, to write about) a time when they exhibited Construct X. That, it's assumed, will lead them to feel more Construct X and lead them to exhibit behaviors consistent with greater levels of Construct X.
I wonder, though, at the statistical power of such experiments. Will remembering a time your hunch was correct lead you to substantially greater levels of intuition use for the next 15 minutes? Will writing about a time you felt good about yourself lead you to achieve a peaceful state of self-affirmation where you can accept evidence that conflicts with your views?
Effect-Size Trickle-Down
If we think about an experiment as a path diagram, it becomes clear that a strong manipulation is necessary. When we examine the relationship between constructs X and Y, what we're really looking at is the relationship between manipulation A and measurement B.
Although path b2 is what we want to test, we don't get to see it directly. X and Y are latent and not observable. Instead, the path that we see is the relationship between Manipulation A and Measurement B. This relationship has to go through all three paths, and so it has strength = b1 × b2 × b3. Since each path is a correlation between -1 and +1, the magnitude of b1 × b2 × b3 must be equal to or less than that of each individual path.
This means that your effect on the dependent variable is almost certain to be smaller than the effect on the manipulation check. Things start with the manipulation and trickle down from there. If the manipulation can only barely nudge the manipulated construct, then you're certain not to see effects of the manipulation on the downstream outcome.
Minimal Manipulations in the Journals
I wonder if these writing manipulations are effective. One time I reviewed a paper using such a manipulation. Experimental assignment had only a marginally significant effect on the manipulation check. Nevertheless, the authors managed to find significant differences in the outcome across experimental conditions. Is that plausible?
I've since found another published (!) paper with such a manipulation. In Experiment 1, the manipulation check was not significant, but the anticipated effect was. In Experiment 2, the authors didn't bother to check the manipulation any further.
This might be another reason to be skeptical about social priming: manipulations such as briefly holding a warm cup of coffee are by nature minimal manipulations. Even if one expected a strong relationship between feelings of bodily warmth and feelings of interpersonal warmth, the brief exposure to warm coffee might not be enough to create strong feelings of bodily warmth.
(As an aside, it occurs to me that these minimal manipulations might be why, in part, college undergraduates think the mind is such a brittle thing. Their social psychology courses have taught them that the brief recounting of an unpleasant experience has pronounced effects on subsequent behavior.)
Ways Forward
Creating powerful and reliable manipulations is challenging. Going forward, we should be:
1) Skeptical of experiments using weak manipulations, as their statistical power is likely poor, but
2) Understanding and patient about the complexities and challenges of manipulations
3) Careful to share methodological details, including effect sizes on manipulation checks, so that researchers can share what manipulations do and do not work, and
4) Grateful for methods papers that carefully outline the efficacy and validity of manipulations.
Sunday, November 22, 2015
The p-value would have been lower if...
One is often asked, it seems, to extend someone a p-value on credit. "The p-value would be lower if we'd had more subjects." "The p-value would have been lower if we'd had a stronger manipulation." "The p-value would have been lower with a cleaner measurement, a continuous instead of a dichotomous outcome, the absence of a ceiling effect..."
These claims could be true, or they could be false, conditional on one thing: Whether the null hypothesis is true or false. This is, of course, a tricky thing to condition on. The experiment itself should be telling us the evidence for or against the null hypothesis.
So now we see that these statements are very clearly begging the question. Perhaps the most accurate formulation would be, "I would have stronger evidence that the null were false if the null were false and I had stronger evidence." It is perfectly circuitous.
When I see a claim like this, I imagine a cockney ragamuffin pleading, "I'll have the p-value next week, bruv, sware on me mum." But one can't issue an IOU for evidence.
These claims could be true, or they could be false, conditional on one thing: Whether the null hypothesis is true or false. This is, of course, a tricky thing to condition on. The experiment itself should be telling us the evidence for or against the null hypothesis.
So now we see that these statements are very clearly begging the question. Perhaps the most accurate formulation would be, "I would have stronger evidence that the null were false if the null were false and I had stronger evidence." It is perfectly circuitous.
When I see a claim like this, I imagine a cockney ragamuffin pleading, "I'll have the p-value next week, bruv, sware on me mum." But one can't issue an IOU for evidence.
Sunday, October 4, 2015
Poor Power at Decent Sample Sizes: Significance Under Duress
Last week, I got to meet Andrew Gelman as he outlined what he saw as several of the threats to validity in social science research. Among these was the fallacious idea of "significance under duress." The claim in "significance under duress" is that, when statistical significance is reached under less-than-ideal conditions, it implies that the underlying effect must be very powerful. While this sounds like it makes sense, this claim does not follow.
Let's dissect the idea by considering the following scenario:
Even though the sample size is better than most, I would still be concerned that a study like this is underpowered. But why?
Remember that statistical power depends on the expected effect size. Effect size involves both signal and noise. Cohen's d is the difference in means divided by the standard deviation of scores. Pearson correlation is the covariance of x and y divided by the standard deviations of x and y. Noisier measures will mean larger standard deviations and hence, a smaller effect size.
The effect size is not a platonic distillation of the relationship between the two constructs you have in mind (say, mood and preference for the natural). Instead, it is a ratio of signal to noise between your measures -- here, condition assignment and product choice.
Because weak manipulations and noisy measurements decrease the anticipated effect size, thereby decreasing power, studies can still have decent sample sizes and poor statistical power. Such examples of "significance under duress" should be regarded with the same skepticism as other underpowered studies.
Let's dissect the idea by considering the following scenario:
120 undergraduates participate in an experiment to examine the effect of mood on preferences for foods branded as "natural" relative to conventionally-branded foods. To manipulate mood, half of the participants write a 90-second paragraph about a time they felt bad, while the other half write a 90-second essay about a control topic. The outcome is a single dichotomous choice between two products. Even though a manipulation check reveals the writing manipulation had only a small effect on mood, and even though a single-item outcome provides less power than would rating several forced choices, statistical significance is nevertheless found when comparing the negative-writing group to the neutral-writing group, p = .030. The authors argue that the relationship between mood and preferences for "natural" must be very strong indeed to have yielded significance despite the weak manipulation and imprecise outcome measure.
Even though the sample size is better than most, I would still be concerned that a study like this is underpowered. But why?
Remember that statistical power depends on the expected effect size. Effect size involves both signal and noise. Cohen's d is the difference in means divided by the standard deviation of scores. Pearson correlation is the covariance of x and y divided by the standard deviations of x and y. Noisier measures will mean larger standard deviations and hence, a smaller effect size.
The effect size is not a platonic distillation of the relationship between the two constructs you have in mind (say, mood and preference for the natural). Instead, it is a ratio of signal to noise between your measures -- here, condition assignment and product choice.
Let's imagine this through the lens of a structural equation model. Italicized a and b represent the latent constructs of interest: mood and preference for the natural, respectively. Let's assume their relationship is rho = .4, a hearty effect. x and y are the condition assignment and the outcome, respectively. The path from x to a represents the effect of the manipulation. The path from b to y represents the measurement reliability of the outcome. To tell what the relationship will be between x and y, we multiply each path coefficient as we travel from x to a to b to y.
When the manipulation is strong and the measurement reliable, the relationship between x and y is strong, and power is good. When the manipulation is weak and the measurement unreliable, the relationship is small, and power falls dramatically.
Because weak manipulations and noisy measurements decrease the anticipated effect size, thereby decreasing power, studies can still have decent sample sizes and poor statistical power. Such examples of "significance under duress" should be regarded with the same skepticism as other underpowered studies.
Subscribe to:
Posts (Atom)