Wednesday, January 27, 2016

Power analysis slants funnels but doesn't flatten curves.

I recently had the pleasure of receiving a very thorough and careful peer review, with invitation to resubmit, at Psychological Bulletin for my manuscript-in-progress Overestimated Effects of Violent Games on Aggressive Outcomes in Anderson et al. (2010).

Although it was humbling to find I still have much to learn about meta-analysis, I was also grateful for what has been one of the most instructive peer reviews I have ever received. Sometimes one gets peer reviewers who simply don't like your paper, and perhaps never will. Sometimes reviewers can only offer the most nebulous of suggestions, leaving you fumbling for a way to appease everyone. This review, however, was full of blessedly concrete recommendations.

Anyway, the central thrust of my paper is that Anderson et al. (2010) assert that violent-game effects are not overestimated through publication, analytic, or other biases. However, they did not provide funnel plots to support this argument, relying chiefly on the trim-and-fill procedure instead. When you generate these funnel plots, you find that they are strikingly asymmetrical, suggesting that there may indeed be bias despite the trim-and-fill results.

Effects of violent games on aggressive behavior in experiments selected as having "best-practices" methodology by Anderson et al. (2010).

We conducted two new statistical adjustments for bias, which suggest that the effect may actually be quite small. One, PET, uses the funnel plot's asymmetry to estimate what the effect size might be for a hypothetical perfectly-precise study. The other, p-curve, uses the p-values of significant results to estimate the underlying effect size.

One peer reviewer commented that, if the meta-analyzed effect sizes are heterogeneous, with some large and some small, and that if researchers are using power analysis appropriately to plan their effect sizes, then true large effects will be studied with small samples and true small effects will be studied with large samples, leading to an asymmetrical funnel plot and the illusion of research bias.

I don't think that's what's going on here, of course. Power analysis is rare in social psychology, especially in the years covered by the Anderson et al. meta-analysis. I'm also not sure how researchers would somehow know, a priori, the effect size they were studying, but then Arina K. Bones does have ironclad evidence of the precognitive abilities of social psychologists.

But even if that were true, I had the hunch it should only affect the funnel plot, which relies on observed effect sizes and sample sizes, and not the p-curve, which relies on the statistical power of studies. So I ran a simulation to see.

Simulated meta-analysis of 1000 studies. True effect size varies uniformly between .1 and .5. Sample sizes selected for 80% one-tailed power. Simulation code at bottom of post.

Sure enough, the funnel plot is very asymmetrical despite the inclusion of all studies. However, the p-curve still shows a clear right skew.

Of course, the meta-analyst should take efforts to divide the studies into homogenous groups so that heterogeneity is minimized and one is not comparing apples and oranges. But I was compelled to test this and then further compelled to write it down.

Code here:

# power analysis hypothesis:
# Reviewer 3 says:
  # if the size of studies is chosen according to properly 
  # executed power analyses, we would in fact expect to see 
  # an inverse relationship between outcomes and sample sizes 
  # (and so if authors engage in the recommended practice of 
  # planning a study to achieve sufficient power, we are actually 
  # building small-study effects into our literature!). 

# Let's simulate b/c I don't believe p-curve would work that way.


# lookup table for power
d = seq(.1, .6, .01) #seq(.1, 1, .05)
n = NULL
for (i in 1:length(d)) {
  n[i] = pwr.t.test(d = d[i], sig.level = .05, power = .8, 
                 type = "two.sample", alternative = "greater")$n
# round up b/c can't have fractional n
n = ceiling(n)

# pick a d, pick an n, run an experiment
simLength = 1e3
d_iter = NULL
n_iter = NULL
df_iter = NULL
t_iter = NULL
for (i in 1:simLength) {
  index = sample(1:length(d), 1)
  d_iter[i] = d[index]
  n_iter[i] = n[index]
  df_iter[i] = n_iter[i] - 2
  t_iter[i] = rt(1, df_iter[i], 
                 ncp = d_iter[i] / (sqrt(1/floor(n_iter[i]) + 1/ceiling(n_iter[i])))

dat = data.frame(d_true = d_iter,
                 n = n_iter, df = df_iter, t = t_iter)
dat$d_obs = 2*dat$t/sqrt(dat$df)
dat$p = pt(dat$t, dat$df, lower.tail = F)
dat$se_obs = sqrt(

# funnel plot
model = rma(yi = d_obs, sei = se_obs, data = dat)


par(mfrow=c(1, 2))
funnel(model, main = "Funnel plot w/ \npower analysis & \nheterogeneity")
hist(dat$p[dat$p<.05], main = "p-curve w/ \npower analysis & \nheterogeneity", xlab = "p-value")

Tuesday, January 19, 2016

Two Quick HIBARs

I've posted a little post-publication peer review on ResearchGate these past few months on some studies of violent game effects. Doing this made me realize that ResearchGate is actually really weak for this task -- you can mark up particular chunks and comment on them, but most people are going to immediately download the PDF, and the PDF won't carry the comments. So putting up commentary on ResearchGate will mostly just piss off the authors, who get an email alerting them to the assault, but fail to inform the readers, who will probably not read or even notice the comments.

So here is a brief digest of two recent comments I've put on two recent papers. Consider these some quick Had I Been A Reviewer (HIBAR) posts.

Lishner, Groves, and Chobrak (2015): Are Violent Video Game-Aggression Researchers Biased? (Paper link)

Whether or not there is bias in violent-games and aggression research is the topic of some of my own research, which seems to indicate that, yes, there is some element of bias that is leading to the likely overestimation of violent-game effects.

The authors consider three potential forms of bias: Single-scholar bias, by which a single prominent scholar is able to unduly influence the field by overwhelming publishing; cabal bias, by which a group of scholars use their numbers or resources to again unduly influence the field by overwhelming publishing or by collusion in peer review; and systemic bias, by which there is some broad and systemic bias towards the finding of an effect.

They present some re-analyses of the Anderson et al. (2010) meta-analysis to suggest that there is not single-scholar bias (because Anderson's effect sizes aren't statistically significantly larger than everybody else's) and cabal bias (because those who publish repeatedly on these effects don't find statistically significantly larger effects than those who only ever run one study).

Of course, the absence of statistical significance does not necessarily imply that the null is true, but the confidence intervals suggest that Lishner et al. might be correct. Experiments done by Anderson find a mean effect size of r = .19 [.14, .24], while experiments done by the rest of the collection have a mean effect size of r = .18 [.13, .22]. That's a pretty close match. For their test of cabal bias, the experiments done by the potential cabal have a mean effect size of r = .20 (.15, .24), while the experiments done by the other groups have a mean effect size of r = .24 (.17, .30). The difference isn't in the right direction for cabal bias.

That leaves us with the possibility of systemic bias. Systemic bias would be a bigger concern for overestimation of the effect size -- instead of one particularly biased researcher or a subset of particularly biased researchers, the whole system might be overestimating the effect size.

My priors tell me there's likely some degree of systemic bias. We weren't aware of the problems of research flexibility until about 2010, and we weren't publishing many null results until PLOS ONE started changing things. With this in mind, I'd suspect null results are likely to be tortured (at least a little) into statistical significance, or else they'll go rot in file drawers.

What do Lishner et al. say about that?

The authors argue that there is no systematic bias because a single outspoken skeptic still manages to get published. I don't buy this. Ferguson is one determined guy. I would expect that most other researchers have not pressed so hard as him to get their null results published.

There are ways to check for systematic biases like publication bias in a meta-analytic dataset, but Lishner et al. do not explore any of them. There is no Egger's test, no search for unpublished or rejected materials, no estimation of research flexibility, no test of excess significance, or any other meta-analytic approach that would speak to the possibility of research practices that favor significant results.

The authors, in my opinion, overestimate the strength of their case against the possibility of systemic bias in violent-games research.

Again, I've conducted my own analysis of possible systematic bias in violent games research and come up with a rather different view of things than Lishner et al. Among the subset of practices Anderson selected as "best-practices" studies, there is substantial selection bias. Among that subset or the full sample of experiments, p-curve meta-analysis indicates there is little to no effect. This leads me to suspect that the effect size has been overestimated through some element of bias in this literature.

Ferguson et al. (2015) Digital Poison? Three studies examining the influence of violent video games on youth (Paper link)

As is typical for me, I skimmed directly to the sample size and the reported result. I recognize I'm a tremendous pain in the ass, and I'm sorry. So I haven't read the rest of the manuscript and cannot comment on the methods.

This paper summarizes two experiments and a survey. Experiment 1 has 70 subjects, while Experiment 2 has 53. They use between-subject designs.

These are pretty small sample sizes if one intends to detect an effect. Anderson et al. (2010) estimate the effect as r = .21, or d = .43, which is probably an overestimate (research bias), but we'll take it at face value for now.

If the effect to be detected is r = .21, the studies have 43% and 33.5% power. Assuming there is an effect, a Type II error seems likely.

However, the authors erroneously dismiss the possibility of Type II error (excerpted from Experiment 1 results, but Experiment 2 makes identical arguments):

They treat the observed effect size as though it were unquestionably the true effect size. That is, they ignore the possibility of sampling error, which, at a sample size of 70, is quite substantial.

The argument in the upper paragraph doesn't seem to follow even its own logic: it argues that the true effect is very tiny, so it would take 1600 participants, which nobody can expect to collect, so of course it's Type II error, so Type II error can be ruled out as a counter-explanation.

The lower paragraph argues that because the observed effect size is in the wrong direction, the result is not a Type II error and the effect does not exist. Again, sampling error means that even a positive effect will sometimes be measured as having the wrong sign in small samples (some tests for pub bias use this to great effect), so this argument does not hold on the basis of just this p-value.

Remember also that Anderson et al. (2010) believe the effect of violent games on empathy to be smaller still than that of violent games on aggressive behavior: just r = -.14. So the power for this test is just 21.4% for a two-tailed test in Experiment 1, 17.2% in Experiment 2. Type II error is extremely likely.

But let's see how much evidence we can squeeze out of this with Bayesian model comparison. We'll take the Anderson et al. (2010) estimates for our alternative hypotheses: r = .21 for aggressive behavior, r = -.14 for empathy. We'll transform everything to Fisher's Z for my convenience. This gives HA1: Z ~ N(.21, .02) for aggressive behavior and HA2: Z ~ N(-.14, .07) for empathy.

Probability distribution of observed effect size given H0: d = 0, H1: d = .43, and a sample size of 53 (a la study 2). Observed effect size is indicated with the vertical line. The probability is comparable across hypotheses; the data support the null, but only slightly. (Sorry this is in terms of d rather than Z -- just the code I had lying around.)
When we use these alternative hypotheses, there isn't much in the way of evidence. We get the following Bayes factors: 1.25:1 odds for the null over HA1 in Experiment 1, 1.39:1 odds for the null over HA1 in Experiment 2. So whatever somebody was willing to bet that there is or isn't an effect of violent games wouldn't change much. The authors have really overstated the strength of their evidence by insisting that Type II error can't explain these results. At most, the r = .21 estimates might be a bit high, but if you had to choose between r = 0 or r = .21 you wouldn't really know which to choose.

The empathy results are stronger. The observed negative sign does not necessarily rule out Type II error, but it does make the results less likely given some positive effect. Bayes factors are 4:1 odds for the null over HA2 in Experiment 1, 3.1:1 odds for the null over HA2 in Experiment 2.

My recent paper (free postprint link) catalogs several of these fallacious arguments for the null hypothesis as made by both proponents and skeptics of violent-game effects. It then demonstrates how Bayesian model comparison is really the only way to make these arguments for the null (or the alternative!) in a principled and effective way.

I recognize that it is difficult to collect samples, particularly when studying populations besides college undergraduates at big state universities. All the same, the conclusions have to be in line with the data in hand. Sympathy aside, the results cannot rule out the possibility of Type II error and provide little evidence for the null relative to what meta-analyses report.

I thank Drs. Lishner and Ferguson for supplying comment. Dr. Lishner suggested I better attend to the reported confidence intervals for single-scholar bias and cabal bias. Dr. Ferguson suggested I consider the challenges of collecting samples of teens in the absence of grant funding.

Figure code below.

# Get standard error of effect
n1 = ceiling(53/2)
n2 = floor(53/2)
se = sqrt(((n1+n2)/(n1*n2) + 0 / (2*(n1+n2-2))) * ((n1+n2)/(n1+n2-2)))
null = rnorm(1e5, 0, sd = se)  
alt = rnorm(1e5, .43, sd = se) 

# Plot p(data | hypothesis)
plot(density(null), xlim = c(-1, 1.5), ylim = c(0, 1.5),
     main = "Hypothesis Comparison",
     ylab = "Probability Density of Data",
     xlab = "Effect size d",
     col = 'darkblue',
     lwd = 2)
lines(density(alt), col = 'darkred', lwd = 2)
# plot observed effect size
abline(v = .12, col = 'darkgreen', lwd = 1.5)

Wednesday, January 13, 2016

"Differences of Significance" fallacy is popular because it increases Type I error

We all recognize that a difference of significance is not the same as a significant difference. That is, if men show a statistically significant (p < .05) response to some manipulation while women do not (p > .05), that does not imply that there is evidence for a difference in how men and women respond to the manipulation.

The test that should be used, of course, is the Significant Difference Test. One estimates the interaction term and its standard error, then checks the p-value representing how unusual it would be if the true value were zero. If p < .05, one concludes the two subgroups have different responses.

The incorrect test is the Differences of Significance Test. In that test, one checks the p-values for the manipulation in each subgroup and concludes a difference between subgroups if one has p < .05 and the other has p > .05.

We've seen the scientific community taking a firmer stance on people mistaking the Difference of Significance for the Significant Difference. Last year we saw Psych Science retract a paper because its core argument relied upon a Difference of Significance Test.

Why do people make this mistake? Why do we still make it, 10 years after Gelman and Stern?

My suspicion is that the Differences of Significance Test gets (unknowingly) used because it suffers from much higher Type I error rates, which allows for greater publication rates and more nuanced story-telling than is appropriate.

Let's think about two subgroups of equal size. We purport to be testing for an interaction: the two subgroups are expected to have different responses to the manipulation. We should be reporting the manipulation × subgroup interaction which, when done properly, has the nominal Type I error rate of 5%. Instead, we will look to see if one group has a significant effect of manipulation while the other is not. If so, we'll call it a success.

Assuming the two subgroups have equal size and there is no interaction, each subgroup has the same chance of having a statistically significant effect of manipulation. So the probability of getting one significant effect and one nonsignificant effect is simply the probability of getting one success on two Bernoulli trials with (Power)% success rate.

5% Type 1 error rate of correct test shown as blue line.

As you can see, the Type I error rate of this improper test is very high, peaking at a whopping 50% when each subgroup has 50% power.  And this doesn't even require any questionable research practices like optional stopping or flexible outlier treatment!

Of course, one can obtain Type I error rates for this (again, improper) test by running unequal subgroups for unequal power. If group 1 is large and has 80% power to detect the effect, while group 2 is small and has only 20% power to detect, then one will find a difference in significance 68% of the time.

Obviously everybody knows the Difference of Significance Test is wrong and bad and they should be using and looking for the Significant Difference Test. But I wanted to illustrate just how bad the problem can actually be. As you can see, this isn't just nitpicking -- it can be the cause of a tenfold increase in Type I error rates.

You might be tempted to call that.... a significant difference. (hyuk hyuk hyuk)