Crystal Prison Zone: 2019

It's been a rich week of readings for wondering just what the hell we're doing. Loyka et al. (2019) present a framework for considering external validity, and this framework reminds us just how poorly we are doing at considering actual real-world human behavior. Tal Yarkoni has a preprint up that describes how implausible it is that the situations and stimuli we study will generalize to other situations and stimuli. Danielle Navarro has clarified her stance on preregistration by elaborating on how misguided she perceives hypothesis testing to be. Together, these articles remind us of the importance of studying the thing we actually care about, rather than what's convenient, because chances are that our findings won't generalize as simply as we expect, because a significant p-value only means that the null is wrong, and not that the alternative is correct.

These readings reminded me of some thoughts I'd jotted down following APA 2019. I'd been invited to present some of my research on violent video games. While I had a great session and had a lot of fun talking to a receptive audience about issues like measurement validity and publication bias, the overall APA experience was personally challenging. This is because one of the major themes of APA 2019 was gun violence and what the APA can do about it.

I attended a number of interesting sessions with presenters who studied actual violence by working and serving in communities, doing ride-alongs with police, interviewing people who had suffered violence and had perpetrated violence. This was draining in two ways.

First, there's a lot of human suffering out there. One presenter had found that many felons serving prison sentences for gun violence had themselves been victims of gun violence, often as early as age 14. He further found that, when people knew who shot them, they were less likely to tell the police. They trust the police so little that they would prefer to settle the score themselves, and the police are just somebody you can dump your cold cases on as one last hail mary. A mother from Newtown was there. Both of her children had been shot in the massacre. One died. She described crying until the capillaries burst in both her eyes. One gets the feeling that tragedy cannot be prevented and that many people are doomed to poverty and violence from the moment they're born.

Second, it made me frustrated with how far removed we are from the actual societal problem we want to study. We want to prevent gang violence, child abuse, intimate partner violence, bullying, aggressive driving, and harassment. Instead of studying the community members of South- and West-Side Chicago, we study college undergraduates, a bunch of nerds who would rather read a book than fight somebody and generally have enough money and safety to be able to do just that. Instead of studying shootings or fights or abuse, we study how much hot sauce these undergrads pour for each other or whether they think a rude RA should be able to keep their job. We even use proxies of proxies -- when it's too much trouble to see how much hot sauce they'll pour for somebody, we give them KI__ and watch whether they fill it in as KILL or KISS.

One of the APA speakers closed by reference to the old joke about the drunkard looking for his keys. The drunk is looking for his keys under the streetlight. A friend joins him and helps him look for a while, with no progress. Eventually the friend, exasperated, says "Let's try something different. Where did you last see your keys?" The drunk says "I dropped my keys over there in the bushes." The bewildered friend asks "Well then, why are we searching over here by the streetlight?" To which the drunk replies "Well, the light's good over here, and I'm afraid of the dark."

The light's good over here playing parlor tricks with college undergraduates and hot sauce. And it's certainly less scary than trying to get out in the rough parts of Chicago!

It's possible that I'm not well read and that there's a lot of great aggression research going on that studies these real problems. But mostly I see us running little experiments with just-significant results, or running survey designs that tell us something obvious and hopelessly confounded. Interviews and ethnography and field work seem to be for sociologists or criminologists, not psychologists.

What am I doing about it? Not much. For now, I'm doing my part by trying to test the convergent validity of our lab measures and see whether they actually agree with each other (preliminary answer: they don't). I often worry about my career, because I've never "discovered" some effect. You could do a decent job summarizing my last ten years as digging a deeper and deeper hole in what we think we already know, hoping to find some sort of bedrock that we can build from. So far, I'm still shoveling, assessing publication bias, failing to replicate findings, criticizing too-good-to-be-true results, and trying to figure out if our measures are at all valid and reliable.

I like the work that I do, and I think it's the best work I can do given my skills and resources and timeframe. But that work could be much more valuable if I could get out into the actual populations and environments that we're worried about. I had an RA with a connection at a maximum-security prison, but I wasn't able to pursue the lead aggressively enough and it slipped through my fingers. I'm not particularly smooth or adventurous, so I'm not enthusiastic about going into communities to understand gun violence. I'm pre-tenure, so what makes the most sense for me career-wise is to stick to doing more of the same research with college undergrads and MTurk workers. Maybe try to find some sort of eyebrow-raising lab effect that I can wildly extrapolate from.

I'm not sure what to recommend. As a field, we probably recalibrate our expectations; we can't expect a scientist to make three or four noteworthy, generalizable discoveries a year. Getting actionable and generalizable psychological findings will probably require orders of magnitude more effort and investment. We can make psychological science prepared for that investment by trying to improve the transparency and honesty of that process.

I'm gonna try to read more sociology and criminology. Maybe they know something we don't?

Recent research by Chang & Bushman (2019) reports how video games may cause children to be more likely to play with a real handgun. In this experiment, children participate in the study in pairs. They play one of three versions of Minecraft for 20 minutes. One version has no violence (control), another has monsters that they fight with swords (sword violence), and another has monsters that they fight with guns (gun violence).

The children are then left to play in a room in which, hidden in a drawer, are two very real 9mm handguns. The handguns are disabled -- their firing mechanism has been taken out and replaced with a clicker that counts the number of trigger pulls. But these guns look and feel like the real thing, so one would hope that a child would not touch them or pull their triggers.

The authors report four study outcomes: whether the kid touches the gun, how long they hold the gun, how many times they pull the trigger, and how many times they pull the trigger while the gun is pointed at somebody (themself or the other kid).

I think it's an interesting paradigm. The scenario has a certain plausibility about it, and the outcome is certainly important. It must have been a lot of work to get the ethics board approval.

However, the obtained results depend substantially on the authors' decision to exclude two participants from the control group for playing with the guns a lot. I feel that this is an inappropriate discarding of data. Without this discard, the results are not statistically significant.

Overinterpretation of marginal significance

The results section reports one significant and three marginally significant outcomes:

"The difference [in handgun touching] across conditions was nonsignificant [...]" (p = .09)
"The gun violence condition increased time spent holding a handgun, although the effect was nonsignificant [...]" (p = .080)
"Participants in the gun violence condition pulled the trigger more times than participants in other conditions, although the effect was nonsignificant [...]" (p = .097)
"Participants in the violent game conditions pulled the trigger at themselves or their partner more than participants in the nonviolent condition." (p = .007)

These nonsignificant differences are overinterpreted in the discussion section, which begins: "In this study, playing a violent video game increased the likelihood that children would touch a real handgun, increased time spent holding a handgun, and increased pulling the trigger at oneself and others." I found this very confusing; I thought I had read the wrong results section. One has to dig into Supplement 2 to see the exact p values.

Exclusion of outliers

The distribution of the data is both zero-inflated and powerfully right skewed. About half of the kids did not touch the gun at all, much less pull its trigger. Among the minority of kids that did pull the trigger, they pulled it many times. This is a noisy outcome, and difficult to model: you would need a zero-inflated negative binomial regression with cluster-adjusted variances. The authors present a negative binomial regression with cluster-adjusted variances, ignoring the zero-inflation, which is fine enough by me since I can't figure out how to do all that at once either.

Self-other trigger pulls outcome. The pair in red were excluded because the coders commented that they were acting unusually wild. The pair in green were excluded for having too high a score on the outcomes.

Noisy data affords many opportunities for subjectivity. The authors report: "We eliminated 1 pair who was more than 5 SDs from the mean for both time spent holding a handgun and trigger pulls [green pair]. The coders also recommended eliminating another pair because of unusual and extremely aggressive behavior [red pair]." The CONSORT flow diagram reveals that these four excluded subjects with very high scores on the dependent variables were all from the nonviolent control condition, in which participants were expected to spend the least time holding the gun and pulling its trigger.

The authors tell me that the pair eliminated because of unusual and extremely aggressive behavior was made on the coders' recommendation, blind to condition. That may be true, but the registration is generally rather vague and says nothing about excluding participants on coder recommendation.

The authors also tell me that the pair eliminated because of high scores were eliminated without looking at the results. That may be true as well, but I feel as though one could predict how this exclusion might affect the results.

This latter exclusion of the high-scoring pair is not acceptable to me. You can consider this decision in two ways: First, you can see that there are scores still more extreme in the other two conditions. With data this zero-inflated and skewed, it is no great feat to be more than 5 SDs from the mean. Second, you can look at the model diagnostics. The excluded outliers are not "outliers" in any model influence sense -- their Cook's distances are less than 0.2. (Thresholds of 0.5 or 1.0 are often suggested for Cook's distance.)

Here are the nonzero values in log space, which is where the model fits the negative binomial. On a log scale, the discarded data points still do not look at all like outliers to me.

Revised results

If the high-scoring pair is retained for analysis, none of the results are statistically significant:

Touching the gun: omnibus F(2, 79.5) = 1.04, p = .359; gun-vs-control contrast p = .148.
Time holding gun: omnibus F(2, 79.5) = .688, p = .506; gun-vs-control contrast p = .278.
Trigger pulls at self or other: omnibus F(2, 79.4) = 1.80, p = .172; gun-vs-control contrast p = .098.

From here, adding the coder-suggested pair to the analysis moves the results further still from statistical significance.

If you're worried about the influence of the zero inflation and the long tail, a simpler way to look at the data might be to ask "is the trigger pulled at all while the gun is pointed at somebody?" After all, the difference between not being shot and being shot once is a big deal; the difference between being shot four times and being shot five times less so. Think of this as winsorizing all the values in the tail to 1. Then you could just fit a logistic regression and not have to worry about influence.

Analyzed this way, there are 6 events in the control group, 10 in the sword-game group, and 13 in the gun-game group. The authors excluded four of these six control-group events as outliers. With these exclusions, there is a statistically significant effect, p = .029. If you return either pair to the control group, the effect is not statistically significant, p = .098. If you return both pairs to the control group, the effect is not statistically significant, p = .221.

I wish the authors and peer reviewers had considered the sensitivity of the results to the questionable exclusion of this pair. While these results are suggestive, they are much less decisive than the authors have presented them.

Journal response

I attempted to send JAMA Open a version of this comment, but their publication portal does not accept comment submissions. I asked to speak with an editor; the editor declined to discuss the article with me. The journal's stance is that, as an online-only journal, they don't consider letters to the editor. They invited me to post a comment in their Youtube-style comments field, which appears on a separate tab where it will likely go unread.

I am disturbed by the ease with which peer reviewers would accept ad hoc outlier exclusion and frustrated that the article and press release do little to present the uncertainty. It seems like one could get up to a lot of mischief at JAMA Open by excluding hypothesis-threatening datapoints.

Author response

I discussed these criticisms intensely with the authors as I prepared my concerns for JAMA Open and for this blog post. Dr. Bushman replied:

We believe that [the coder-suggested pair] was removed completely legitimately, although you are correct this was not documented ahead of time on the clinicaltrials.gov site. We believe [the high-scoring pair] should also have been excluded, but you do not. We acknowledge there may be honest differences of opinion regarding [the high-scoring pair].

As stated in our comment on JAMA Open, “Importantly, both pairs were eliminated before we knew how they would impact our analyses and whether their results would support our hypotheses.”

Again, I disagree with the characterization of the removal of the high-scoring pair as a subjective decision. I don't see any justifiable criterion for throwing this data away, and one can anticipate how this removal would influence the analyses and results.

Conclusion

I was successfully able to reproduce the results presented by Chang and Bushman (2019). However, those results seem to depend heavily on the exclusion of four of the six most aggressive participants in the nonviolent control group. The justification that these four participants are unusually aggressive does not seem tenable in light of the low influence of these datapoints and similarly aggressive participants retained in the other two conditions.

While I admire the researchers for their passion and their creative setup, I am also frustrated. I believe that researchers have an obligation to quantify uncertainty to the best of their ability. I feel that the exclusion of high-scoring participants from the control group serves to understate the uncertainty and facilitate the anticipated headlines. The sensitivity of the results to this questionable exclusion should be made clearer.

Code
See my code at https://osf.io/8jgrp/. Analyses reproduced in R using MASS::glm.nb for negative binomial regression with log link and clubSandwich for cluster-robust variance estimation. Data available upon request from the authors. Thanks to James Pustejovsky for making clubSandwich. Thanks to Jeff Rouder for talking with me about all this when I needed to know I wasn't taking crazy pills.

Header

Saturday, November 23, 2019

Weighing bullets, not hot sauce

Monday, June 17, 2019

Comment on Chang & Bushman (2019): Effects of outlier exclusion