Header

A psychologist's thoughts on how and why we play games

Tuesday, January 19, 2016

Two Quick HIBARs

I've posted a little post-publication peer review on ResearchGate these past few months on some studies of violent game effects. Doing this made me realize that ResearchGate is actually really weak for this task -- you can mark up particular chunks and comment on them, but most people are going to immediately download the PDF, and the PDF won't carry the comments. So putting up commentary on ResearchGate will mostly just piss off the authors, who get an email alerting them to the assault, but fail to inform the readers, who will probably not read or even notice the comments.

So here is a brief digest of two recent comments I've put on two recent papers. Consider these some quick Had I Been A Reviewer (HIBAR) posts.

Lishner, Groves, and Chobrak (2015): Are Violent Video Game-Aggression Researchers Biased? (Paper link)

Whether or not there is bias in violent-games and aggression research is the topic of some of my own research, which seems to indicate that, yes, there is some element of bias that is leading to the likely overestimation of violent-game effects.

The authors consider three potential forms of bias: Single-scholar bias, by which a single prominent scholar is able to unduly influence the field by overwhelming publishing; cabal bias, by which a group of scholars use their numbers or resources to again unduly influence the field by overwhelming publishing or by collusion in peer review; and systemic bias, by which there is some broad and systemic bias towards the finding of an effect.

They present some re-analyses of the Anderson et al. (2010) meta-analysis to suggest that there is not single-scholar bias (because Anderson's effect sizes aren't statistically significantly larger than everybody else's) and cabal bias (because those who publish repeatedly on these effects don't find statistically significantly larger effects than those who only ever run one study).

Of course, the absence of statistical significance does not necessarily imply that the null is true, but the confidence intervals suggest that Lishner et al. might be correct. Experiments done by Anderson find a mean effect size of r = .19 [.14, .24], while experiments done by the rest of the collection have a mean effect size of r = .18 [.13, .22]. That's a pretty close match. For their test of cabal bias, the experiments done by the potential cabal have a mean effect size of r = .20 (.15, .24), while the experiments done by the other groups have a mean effect size of r = .24 (.17, .30). The difference isn't in the right direction for cabal bias.

That leaves us with the possibility of systemic bias. Systemic bias would be a bigger concern for overestimation of the effect size -- instead of one particularly biased researcher or a subset of particularly biased researchers, the whole system might be overestimating the effect size.

My priors tell me there's likely some degree of systemic bias. We weren't aware of the problems of research flexibility until about 2010, and we weren't publishing many null results until PLOS ONE started changing things. With this in mind, I'd suspect null results are likely to be tortured (at least a little) into statistical significance, or else they'll go rot in file drawers.

What do Lishner et al. say about that?



The authors argue that there is no systematic bias because a single outspoken skeptic still manages to get published. I don't buy this. Ferguson is one determined guy. I would expect that most other researchers have not pressed so hard as him to get their null results published.

There are ways to check for systematic biases like publication bias in a meta-analytic dataset, but Lishner et al. do not explore any of them. There is no Egger's test, no search for unpublished or rejected materials, no estimation of research flexibility, no test of excess significance, or any other meta-analytic approach that would speak to the possibility of research practices that favor significant results.

The authors, in my opinion, overestimate the strength of their case against the possibility of systemic bias in violent-games research.

Again, I've conducted my own analysis of possible systematic bias in violent games research and come up with a rather different view of things than Lishner et al. Among the subset of practices Anderson selected as "best-practices" studies, there is substantial selection bias. Among that subset or the full sample of experiments, p-curve meta-analysis indicates there is little to no effect. This leads me to suspect that the effect size has been overestimated through some element of bias in this literature.

Ferguson et al. (2015) Digital Poison? Three studies examining the influence of violent video games on youth (Paper link)

As is typical for me, I skimmed directly to the sample size and the reported result. I recognize I'm a tremendous pain in the ass, and I'm sorry. So I haven't read the rest of the manuscript and cannot comment on the methods.

This paper summarizes two experiments and a survey. Experiment 1 has 70 subjects, while Experiment 2 has 53. They use between-subject designs.

These are pretty small sample sizes if one intends to detect an effect. Anderson et al. (2010) estimate the effect as r = .21, or d = .43, which is probably an overestimate (research bias), but we'll take it at face value for now.

If the effect to be detected is r = .21, the studies have 43% and 33.5% power. Assuming there is an effect, a Type II error seems likely.

However, the authors erroneously dismiss the possibility of Type II error (excerpted from Experiment 1 results, but Experiment 2 makes identical arguments):


They treat the observed effect size as though it were unquestionably the true effect size. That is, they ignore the possibility of sampling error, which, at a sample size of 70, is quite substantial.

The argument in the upper paragraph doesn't seem to follow even its own logic: it argues that the true effect is very tiny, so it would take 1600 participants, which nobody can expect to collect, so of course it's Type II error, so Type II error can be ruled out as a counter-explanation.

The lower paragraph argues that because the observed effect size is in the wrong direction, the result is not a Type II error and the effect does not exist. Again, sampling error means that even a positive effect will sometimes be measured as having the wrong sign in small samples (some tests for pub bias use this to great effect), so this argument does not hold on the basis of just this p-value.

Remember also that Anderson et al. (2010) believe the effect of violent games on empathy to be smaller still than that of violent games on aggressive behavior: just r = -.14. So the power for this test is just 21.4% for a two-tailed test in Experiment 1, 17.2% in Experiment 2. Type II error is extremely likely.

But let's see how much evidence we can squeeze out of this with Bayesian model comparison. We'll take the Anderson et al. (2010) estimates for our alternative hypotheses: r = .21 for aggressive behavior, r = -.14 for empathy. We'll transform everything to Fisher's Z for my convenience. This gives HA1: Z ~ N(.21, .02) for aggressive behavior and HA2: Z ~ N(-.14, .07) for empathy.

Probability distribution of observed effect size given H0: d = 0, H1: d = .43, and a sample size of 53 (a la study 2). Observed effect size is indicated with the vertical line. The probability is comparable across hypotheses; the data support the null, but only slightly. (Sorry this is in terms of d rather than Z -- just the code I had lying around.)
When we use these alternative hypotheses, there isn't much in the way of evidence. We get the following Bayes factors: 1.25:1 odds for the null over HA1 in Experiment 1, 1.39:1 odds for the null over HA1 in Experiment 2. So whatever somebody was willing to bet that there is or isn't an effect of violent games wouldn't change much. The authors have really overstated the strength of their evidence by insisting that Type II error can't explain these results. At most, the r = .21 estimates might be a bit high, but if you had to choose between r = 0 or r = .21 you wouldn't really know which to choose.

The empathy results are stronger. The observed negative sign does not necessarily rule out Type II error, but it does make the results less likely given some positive effect. Bayes factors are 4:1 odds for the null over HA2 in Experiment 1, 3.1:1 odds for the null over HA2 in Experiment 2.

My recent paper (free postprint link) catalogs several of these fallacious arguments for the null hypothesis as made by both proponents and skeptics of violent-game effects. It then demonstrates how Bayesian model comparison is really the only way to make these arguments for the null (or the alternative!) in a principled and effective way.

I recognize that it is difficult to collect samples, particularly when studying populations besides college undergraduates at big state universities. All the same, the conclusions have to be in line with the data in hand. Sympathy aside, the results cannot rule out the possibility of Type II error and provide little evidence for the null relative to what meta-analyses report.

I thank Drs. Lishner and Ferguson for supplying comment. Dr. Lishner suggested I better attend to the reported confidence intervals for single-scholar bias and cabal bias. Dr. Ferguson suggested I consider the challenges of collecting samples of teens in the absence of grant funding.

Figure code below.

# Get standard error of effect
n1 = ceiling(53/2)
n2 = floor(53/2)
se = sqrt(((n1+n2)/(n1*n2) + 0 / (2*(n1+n2-2))) * ((n1+n2)/(n1+n2-2)))
null = rnorm(1e5, 0, sd = se)  
alt = rnorm(1e5, .43, sd = se) 

# Plot p(data | hypothesis)
plot(density(null), xlim = c(-1, 1.5), ylim = c(0, 1.5),
     main = "Hypothesis Comparison",
     ylab = "Probability Density of Data",
     xlab = "Effect size d",
     col = 'darkblue',
     lwd = 2)
lines(density(alt), col = 'darkred', lwd = 2)
# plot observed effect size
abline(v = .12, col = 'darkgreen', lwd = 1.5)

No comments:

Post a Comment