Monday, June 17, 2019

Comment on Chang & Bushman (2019): Effects of outlier exclusion

Recent research by Chang & Bushman (2019) reports how video games may cause children to be more likely to play with a real handgun. In this experiment, children participate in the study in pairs. They play one of three versions of Minecraft for 20 minutes. One version has no violence (control), another has monsters that they fight with swords (sword violence), and another has monsters that they fight with guns (gun violence). 

The children are then left to play in a room in which, hidden in a drawer, are two very real 9mm handguns. The handguns are disabled -- their firing mechanism has been taken out and replaced with a clicker that counts the number of trigger pulls. But these guns look and feel like the real thing, so one would hope that a child would not touch them or pull their triggers.

The authors report four study outcomes: whether the kid touches the gun, how long they hold the gun, how many times they pull the trigger, and how many times they pull the trigger while the gun is pointed at somebody (themself or the other kid).

I think it's an interesting paradigm. The scenario has a certain plausibility about it, and the outcome is certainly important. It must have been a lot of work to get the ethics board approval.

However, the obtained results depend substantially on the authors' decision to exclude two participants from the control group for playing with the guns a lot. I feel that this is an inappropriate discarding of data. Without this discard, the results are not statistically significant.

Overinterpretation of marginal significance

The results section reports one significant and three marginally significant outcomes:
  • "The difference [in handgun touching] across conditions was nonsignificant [...]" (p = .09)
  • "The gun violence condition increased time spent holding a handgun, although the effect was nonsignificant [...]" (p = .080)
  • "Participants in the gun violence condition pulled the trigger more times than participants in other conditions, although the effect was nonsignificant [...]" (p = .097)
  • "Participants in the violent game conditions pulled the trigger at themselves or their partner more than participants in the nonviolent condition." (p = .007)
These nonsignificant differences are overinterpreted in the discussion section, which begins: "In this study, playing a violent video game increased the likelihood that children would touch a real handgun, increased time spent holding a handgun, and increased pulling the trigger at oneself and others." I found this very confusing; I thought I had read the wrong results section. One has to dig into Supplement 2 to see the exact p values.

Exclusion of outliers

The distribution of the data is both zero-inflated and powerfully right skewed. About half of the kids did not touch the gun at all, much less pull its trigger. Among the minority of kids that did pull the trigger, they pulled it many times. This is a noisy outcome, and difficult to model: you would need a zero-inflated negative binomial regression with cluster-adjusted variances. The authors present a negative binomial regression with cluster-adjusted variances, ignoring the zero-inflation, which is fine enough by me since I can't figure out how to do all that at once either.

Self-other trigger pulls outcome. The pair in red were excluded because the coders commented that they were acting unusually wild. The pair in green were excluded for having too high a score on the outcomes.

Noisy data affords many opportunities for subjectivity. The authors report: "We eliminated 1 pair who was more than 5 SDs from the mean for both time spent holding a handgun and trigger pulls [green pair].  The coders also recommended eliminating another pair because of unusual and extremely aggressive behavior [red pair]." The CONSORT flow diagram reveals that these four excluded subjects with very high scores on the dependent variables were all from the nonviolent control condition, in which participants were expected to spend the least time holding the gun and pulling its trigger. 

The authors tell me that the pair eliminated because of unusual and extremely aggressive behavior was made on the coders' recommendation, blind to condition. That may be true, but the registration is generally rather vague and says nothing about excluding participants on coder recommendation.

The authors also tell me that the pair eliminated because of high scores were eliminated without looking at the results. That may be true as well, but I feel as though one could predict how this exclusion might affect the results.

This latter exclusion of the high-scoring pair is not acceptable to me. You can consider this decision in two ways: First, you can see that there are scores still more extreme in the other two conditions. With data this zero-inflated and skewed, it is no great feat to be more than 5 SDs from the mean. Second, you can look at the model diagnostics. The excluded outliers are not "outliers" in any model influence sense -- their Cook's distances are less than 0.2. (Thresholds of 0.5 or 1.0 are often suggested for Cook's distance.)

Here are the nonzero values in log space, which is where the model fits the negative binomial. On a log scale, the discarded data points still do not look at all like outliers to me.

Revised results

If the high-scoring pair is retained for analysis, none of the results are statistically significant:
  • Touching the gun: omnibus F(2, 79.5) = 1.04, p = .359; gun-vs-control contrast p = .148.
  • Time holding gun: omnibus F(2, 79.5) = .688, p = .506; gun-vs-control contrast p = .278.
  • Trigger pulls at self or other: omnibus F(2, 79.4) = 1.80, p = .172; gun-vs-control contrast p = .098.
From here, adding the coder-suggested pair to the analysis moves the results further still from statistical significance.

If you're worried about the influence of the zero inflation and the long tail, a simpler way to look at the data might be to ask "is the trigger pulled at all while the gun is pointed at somebody?" After all, the difference between not being shot and being shot once is a big deal; the difference between being shot four times and being shot five times less so. Think of this as winsorizing all the values in the tail to 1. Then you could just fit a logistic regression and not have to worry about influence.

Analyzed this way, there are 6 events in the control group, 10 in the sword-game group, and 13 in the gun-game group. The authors excluded four of these six control-group events as outliers. With these exclusions, there is a statistically significant effect, p = .029. If you return either pair to the control group, the effect is not statistically significant, p = .098. If you return both pairs to the control group, the effect is not statistically significant, p = .221.

I wish the authors and peer reviewers had considered the sensitivity of the results to the questionable exclusion of this pair. While these results are suggestive, they are much less decisive than the authors have presented them.

Journal response

I attempted to send JAMA Open a version of this comment, but their publication portal does not accept comment submissions. I asked to speak with an editor; the editor declined to discuss the article with me. The journal's stance is that, as an online-only journal, they don't consider letters to the editor. They invited me to post a comment in their Youtube-style comments field, which appears on a separate tab where it will likely go unread.

I am disturbed by the ease with which peer reviewers would accept ad hoc outlier exclusion and frustrated that the article and press release do little to present the uncertainty. It seems like one could get up to a lot of mischief at JAMA Open by excluding hypothesis-threatening datapoints.

Author response

I discussed these criticisms intensely with the authors as I prepared my concerns for JAMA Open and for this blog post. Dr. Bushman replied:

We believe that [the coder-suggested pair] was removed completely legitimately, although you are correct this was not documented ahead of time on the clinicaltrials.gov site. We believe [the high-scoring pair] should also have been excluded, but you do not. We acknowledge there may be honest differences of opinion regarding [the high-scoring pair]. 
As stated in our comment on JAMA Open, “Importantly, both pairs were eliminated before we knew how they would impact our analyses and whether their results would support our hypotheses.”
Again, I disagree with the characterization of the removal of the high-scoring pair as a subjective decision. I don't see any justifiable criterion for throwing this data away, and one can anticipate how this removal would influence the analyses and results.


I was successfully able to reproduce the results presented by Chang and Bushman (2019). However, those results seem to depend heavily on the exclusion of four of the six most aggressive participants in the nonviolent control group. The justification that these four participants are unusually aggressive does not seem tenable in light of the low influence of these datapoints and similarly aggressive participants retained in the other two conditions. 

While I admire the researchers for their passion and their creative setup, I am also frustrated. I believe that researchers have an obligation to quantify uncertainty to the best of their ability. I feel that the exclusion of high-scoring participants from the control group serves to understate the uncertainty and facilitate the anticipated headlines. The sensitivity of the results to this questionable exclusion should be made clearer.

See my code at https://osf.io/8jgrp/. Analyses reproduced in R using MASS::glm.nb for negative binomial regression with log link and clubSandwich for cluster-robust variance estimation. Data available upon request from the authors. Thanks to James Pustejovsky for making clubSandwich. Thanks to Jeff Rouder for talking with me about all this when I needed to know I wasn't taking crazy pills.


  1. Nicely written Joe. I am not technically educated and was still able to get the gist of it. You last paragraph was restrained yet clear.

    One thing that strikes me is the effect of what the children were taught about firearms before the test. If you were to run an experiment with 8-12 year olds about crossing the street or dealing with snakes, what they had been taught about those things would have profound effects on the results. Firearms are ubiquitous in society, at least in the media, so I would expect it common that parents would have given the children at least some instruction as to how handle any weapons they may find. I didn't read the research article and don't know if that was accounted for but maybe it should have been.

    1. Thanks for reading, Anonymous, and thanks for your kind words. I think these are valid questions that could help to contextualize the results. The authors do report that children from households that did own a gun were less likely to play with the study's guns. Maybe that speaks a little to your perspective.

      However, I don't think there's much point in contextualizing the results until the results are accurate. The phenomenon the authors report seems to be attributable to their data cleaning process, not randomization to condition. I don't see much value in discussing what the effect *means* when the data seems to indicate that *there is no effect*.

  2. Forgive me Joe, what does randomization to condition mean?

    You are right of course. What good is trying to figure why something is out there when there is nothing out there?

    From a layman's point of view, the problems with this study seem common in psychological research. It won't change easy if at all will it?