Monday, March 28, 2016

Asking for advice re: causal inference in SEM

I'm repeatedly running into an issue in causal interpretation of SEM models. I'm not sure what to make of it, so I want to ask everybody what they think.

Suppose one knows A and B to be highly correlated in the world, but one doesn't know whether there is causality between them.

In an experiment, one stages an intervention. Manipulation X causes a difference in levels of A between the control and treatment groups.

Here's the tricky part. Suppose one analyses the data gleaned from this experiment using SEM. One makes an SEM with paths X -> A -> B. Each path is statistically significant. This is presented as a causal model indicating that manipulation X causes changes in A, which in turn cause changes in B. 

Paths X->A and A->B are significant, but X->B is not. Is a causal model warranted?

However, if one tests the linear models A = b1×X and B = b2×X, we find that b1 is statistically significant, but b2 is not. (Note that I am not referring to the indirect effect of X on B after controlling for A. Tather, the "raw" effect of X on B is not statistically significant.)

This causes my colleagues and I to wonder: Does the SEM support the argument that, by manipulation of X, one can inflict changes in A, causing downstream changes in B? Or does this inject new variance in A that is unrelated to B, but the SEM fits because of the preexisting large correlation between A and B?

Can you refer me to any literature on this issue? What are your thoughts?

Thanks for any help you can give, readers.

Tuesday, March 22, 2016

Results-blinded Peer Review

The value of any experiment rests on the validity of its measurements and manipulations. If the manipulation doesn't have the intended effect, or the measurements are just noise, then the experiment's results will be uninformative.

This holds whether the results are statistically significant or not. A nonsignificant result, obviously, could be the consequence of an ineffective manipulation or a noisy outcome variable. But given a significant result, the results are still uninformative -- the significant result is either Type I error, or it reflects bias in the measurement.

The problem I have is that often the reader's (or at least, the reviewer's) perception of the method's validity may sometimes hinge upon the results obtained. Where a significant result might have been hailed as a successful conceptual replication, a nonsignificant result might be dismissed as a departure from appropriate methodology.

It makes me consider this puckish lesson from Archibald Cochrane, as quoted and summarized on Ben Goldacre's blog:
The results at that stage showed a slight numerical advantage for those who had been treated at home. I rather wickedly compiled two reports: one reversing the number of deaths on the two sides of the trial. As we were going into the committee, in the anteroom, I showed some cardiologists the results. They were vociferous in their abuse: “Archie,” they said “we always thought you were unethical. You must stop this trial at once.”
I let them have their say for some time, then apologized and gave them the true results, challenging them to say as vehemently, that coronary care units should be stopped immediately. There was dead silence and I felt rather sick because they were, after all, my medical colleagues.
Perhaps, just once in a while, such a results-blinded manuscript should be submitted to a journal. Once Reviewers 1, 2, and 3 have all had their say about the ingenuity of the method, the precision of the measurements, and the adequacy of the sample size, the true results could be revealed, and one could see how firmly the reviewers hold to their earlier arguments.

Thankfully, the increasing prevalence of Registered Reports may forestall the need for any such underhanded prank. Still, it is fun to think about.

Saturday, March 19, 2016

I Was Wrong!

Yesterday, ResearchGate suggested that I read a new article reporting that ego depletion can cause aggressive behavior. This was a surprise to me because word has it that ego depletion does not exist, so surely it cannot be a cause of aggressive behavior.

The paper in question looks about like you'd expect: an unusual measure of aggression, a complicated 3 (within) × 2 (between) × 2 (between) design, a covariate tossed into the mix just for kicks, a heap of measures collected and mentioned in a footnote but not otherwise analyzed. It didn't exactly change my mind about ego depletion, much less its role in aggressive behavior.

But it'd be hypocritical of me to criticize this ill-timed paper without mentioning the time I reported an ego-depletion effect through effect-seeking, exploratory analysis. I've also been meaning to change my blogging regimen up a bit. It's time I switched from withering criticism to withering self-criticism.

The paper is Engelhardt, Hilgard, and Bartholow (2015), "Acute exposure to difficult (but not violent) video games dysregulates cognitive control." In this study, we collected a hearty sample (N = 238) and had them play one of four modified versions of a first-person shooter game, a 2 (Violence: low, high) × 2 (Difficulty: low, high) between-subjects design.

To manipulate violence, I modified the game's graphics. The violent version had demons and gore and arms bouncing across the floor, whereas the less violent version had silly-looking aliens being warped home. We also manipulated difficulty: Some participants played a normal version of the game in which monsters fought back, while other participants played a dumb-as-rocks version where the monsters walked slowly towards them and waited patiently to be shot.

After the game, participants performed a Spatial Stroop task. We measured the magnitude of the compatibility effect, figuring that larger compatibility effects would imply poorer control. We also threw in some no-go trials, on which participants were supposed to withhold a response.

Our hypothesis was that playing a difficult game would lead to ego depletion, causing poorer performance on the Spatial Stroop. This might have been an interesting refinement on the claim that violent video games teach their players poorer self-control.

We looked at Stroop compatibility and found nothing. We looked at the no-go trials and found nothing. Effects of neither violence nor of difficulty. So what did we do?

We needed some kind of effect to publish, so we reported an exploratory analysis, finding a moderated-mediation model that sounded plausible enough.

We figured that maybe the difficult game was still too easy. Maybe participants who were more experienced with video games would find the game to be easy and so would not have experienced ego depletion. So we split the data again according to how much video game experience our participants had, figuring that maybe the effect would be there in the subgroup of inexperienced participants playing a difficult game.

The conditional indirect effect of game difficulty on Stroop compatibility as moderated by previous game difficulty wasn't even, strictly speaking, statistically significant: p = .0502. And as you can see from our Figure 1, the moderator is very lopsided: only 25 people out of the sample of 238 met the post-hoc definition of "experienced player." 

And the no-go trials on the Stroop? Those were dropped from analysis: our footnote 1 says our manipulations failed to influence behavior on those trials, so we didn't bother talking about them in the text.

So to sum it all up, we ran a study, and the study told us nothing was going on. We shook the data a bit more until something slightly more newsworthy fell out of it. We dropped one outcome and presented a fancy PROCESS model of the other. (I remember at some point in the peer review process being scolded for finding nothing more interesting than ego depletion, which was accepted fact and old news!)

To our credit, we explicitly reported the exploratory analyses as being exploratory, and we reported p = .0502 instead of rounding it down to "statistically significant, p = .05." But at the same time, it's embarrassing that we structured the whole paper to be about the exploratory analysis, rather than the null results. 

In the end, I'm grateful that the RRR has set the record straight on ego depletion. It means our paper probably won't get cited much except as a methodological or rhetorical example, but it also means that our paper isn't going to clutter up the literature and confuse things in the future. 

In the meantime, it's showed me how easily one can pursue a reasonable post-hoc hypothesis and still land far from the truth. And I still don't trust PROCESS.

Wednesday, March 16, 2016

The Weapons Priming Effect, Pt. 2: Meta-analysis

Even in the 1970s the Weapons Priming Effect was considered hard to believe. A number of replications were conducted, failed to find an effect, and were published (Buss, Booker, & Buss, 1972; Ellis, Weiner, & Miller, 1971; Page & Scheidt, 1971).

Remarkable to think that in 1970 people could publish replications with null results, isn't it? What the hell happened between 1970 and 2010? Anyway...

To try to resolve the controversy, the results were aggregated in a meta-analysis (Carlson et al., 1990). To me, this is an interesting meta-analysis. It is interesting because the median cell size is about 11, and the largest is 52. 80% of the cells are of size 15 or fewer.

Carlson et al. concluded "strong support" for "the notion that incidentally-present negative or aggression cues generally enhance aggressiveness among individuals already experiencing negative affect." However, across all studies featuring only weapons as cues, "a nonsignificant, near-zero average effect-size value was obtained."

Carlson et al. argue that this is because of two equal but opposite forces (emphasis mine):
Among subjects whose hypothesis awareness or evaluation apprehension was specifically elevated by an experimental manipulation or as a natural occurrence, as determined by a post-session interview, the presence of weapons tended to inhibit aggression. In contrast, the presence of weapons enhanced the aggression of nonapprehensive or less suspicious individuals.

In short, Carlson et al. argue that when participants know they're being judged or evaluated, seeing a gun makes them kick into self-control mode and aggress less. But when participants are less aware, seeing a gun makes them about d = 0.3 more aggressive.

I’d wanted to take a quick look for potential publication bias. I took the tables out of the PDF and tried to wrangle them back into CSV. You can find that table and some code in a GitHub repo here.

So far, I've only been able to confirm the following results:

First, I confirm the overall analysis suggesting an effect of aggression cues in general (d = 0.26 [0.15, 0.36]). However, there's a lot of heterogeneity here (I^2 = 73.5%), so I wonder how helpful a conclusion that is.

Second, I can confirm the overall null effect of weapons primes on aggressive behavior (d = 0.05, [-0.21, 0.32]). Again, there's a lot of heterogeneity (I^2 = 71%).

However, I haven't been able to confirm the stuff about splitting by sophistication. Carlson et al. don't do a very good job of reporting these codings in their table. They'll mention in a cell sometimes "low sophistication." As best I can tell, unless the experimenter specifically reported subjects as being hypothesis- or evaluation-aware, Carlson et al. consider the subjects to be naive.

But splitting up the meta-analysis this way, I still don't get any significant results -- just a heap of heterogeneity. Among the Low Awareness/Sophistication group, I get d = 0.17 [-0.15, 0.49]. Among the High Awareness/Sophistication group, I get d = -0.30 [-0.77, 0.16]. Both are still highly contaminated by heterogeneity (Low Awareness: 76% I^2; High Awareness: 47% I^2), indicating that maybe these studies are too different to really be mashed together like this.

There's probably something missing from the way I'm doing it vs. how Carlson et al. did it. Often, several effect sizes are entered from the same study. This causes some control groups to be double- or triple-counted, overestimating the precision of the study. I'm not sure if that's how Carlson et al. handled it or not.

It goes to show how difficult it can be to replicate a meta-analysis even when you've got much of the data in hand. Without a full .csv file and the software syntax, reproducing a meta-analysis is awful.

A New Meta-Analysis
It'd be nice to see the Carlson et al. meta-analysis updated with a more modern review. Such a review could contain more studies. The studies could have bigger sample sizes. This would allow for better tests of the underlying effect, better adjustments for bias, and better explorations of causes of heterogeneity.

Arlin Benjamin Jr. and Brad Bushman are working on just such a meta-analysis, which seems to have inspired, in part, Bushman's appearance on Inquiring Minds. The manuscript is under revision, so it is not yet public. They've told me they'll send me a copy once it's accepted.

It's my hope that Benjamin and Bushman will be sure to include a full .csv file with clearly coded moderators. A meta-analysis that can't be reproduced, examined, and tested is of little use to anyone.

Wednesday, March 9, 2016

The Weapons Priming Effect

"Guns not only permit violence, they can stimulate it as well. The finger pulls the trigger, but the trigger may also be pulling the finger." - Leonard Berkowitz

There is a theory in social psychology that aggressive behaviors can be stimulated by simply seeing a weapon. I have been skeptical of this effect for a while, as it sounds suspiciously like Bargh-style social priming. The manipulations are very subtle and the outcomes are very strong, and sometimes opposite to the direction one might expect. This is the first of several posts describing my mixed and confused feelings about this priming effect and my ongoing struggle to sate my curiosity.

The original finding
First, let me describe the basic phenomenon. In 1967, two psychologists reported that simply seeing a gun was enough to stimulate aggressive behavior. This suggested a surprising new cause of aggressive behavior, in that simply seeing aggressive primes could provoke aggressive behavior.

In their experiment, Berkowitz and LePage asked participants to perform a task in a room. The design was a 3 (Object) × 2 (Provocation) + 1 design. For the object manipulation, was a piece of sporting equipment in the room. In one condition, the equipment was a rifle and revolver combination; the participant was told the weapons belonged to the other participant. In another condition, the equipment was again the rifle and revolver, but the participant was told the weapons belonged to the previous experimenter. In a third condition, there were no objects on the table.

The provocation manipulation consisted of how many shocks the participant received from the other participant. Participants were provoked by receiving either 1 or 7 electrical shocks.

The extra cell consisted of participants in a room with squash racquets instead of guns. All of these participants were strongly provoked.

So that's 100 participants in a 3 (Object: Confederate's Guns, Experimenter's Guns, Nothing) × 2 (Provocation: Mild, Strong) + 1 (Squash Racquets, Strong Provocation) design. That's about 14 subjects per cell.

The researchers hypothesized that, because shotguns are weapons, they are associated with aggression and violence. Exposure to a shotgun, then, should increase the accessibility of aggressive thoughts. The accessibility of aggressive thoughts, in turn, should increase the likelihood of aggressive behavior.

Berkowitz and LePage found results consistent with their hypothesis. Participants who saw a shotgun (and were later provoked) were more aggressive than participants who saw nothing. They were also more aggressive than participants who had been heavily provoked but seen a squash racquet. These participants gave the confederate more and longer electrical shocks.

Extensions and Public Policy 
I'd been curious about this effect for a long time. I do some aggression research, and my PhD advisor conducted some elaborations on the Berkowitz and LePage study in his early career. But I really grew curious when I listened to Brad Bushman's appearance on Mother Jones' "Inquiring Minds" podcast.

Bushman joined the podcast to talk about the science of gun violence. About the first half of the episode is devoted to the Weapons Priming Effect. Bushman argues that one step to reducing gun violence would be to make guns less visible. For example, guns could be kept in opaque safes rather than in clear display cases. Reducing the opportunities for aggressive-object priming would be expected to reduce aggression and violence in society.

Would you mess with someone who had this in their rear window?
In the podcast, Bushman mentions one of the more bizarre and counterintuitive replications of the weapons priming effect. Turner, Layton, and Simons (1975) report a bizzare experiment in which an experimenter driving a pickup truck loitered at a traffic light. When the light turned green, the experimenter idled for a further 12 seconds, waiting to see if the driver trapped behind would honk. Honking, the researchers argued, would constitute a form of aggressive behavior.

The design was a 3 (Prime) × 2 (Visibility) design. For the Prime factor, the experimenter's truck featured either an empty gun rack (control), a gun rack with a fully-visible .303-caliber military rifle and a bumper sticker with the word "Friend" (Friendly Rifle), or a gun rack with a .303 rifle and a bumper sticker with the word "Vengeance" (Aggressive Rifle). The experimenter driving the pickup was made visible or invisible by the use of a curtain in the rear window.

There were 92 subjects, about 15/cell. The sample is restricted to males driving late-model privately-owned vehicles for some reason.

The authors reasoned that seeing the rifle would prime aggressive thoughts, which would inspire aggressive behavior, leading to more honking. They run five different planned complex contrasts and find that the Rifle/Vengeance combination inspired honking relative to the No Rifle and Rifle/Friend combo, but only when the curtain was closed, F(1, 86) = 5.98, p = .017. That seems like a very suspiciously post-hoc subgroup analysis to me.

A second study in Turner, Layton, and Simons (1975) collects a larger sample of men and women driving vehicles of all years. The design was a 2 (Rifle: present, absent) × 2 (Bumper Sticker: "Vengeance", absent) design with 200 subjects. They divide this further by driver's sex and by a median split on vehicle year. They find that the Rifle/Vengeance condition increased honking relative to the other three, but only among newer-vehicle male drivers, F(1, 129) = 4.03, p = .047. But then they report that the Rifle/Vengeance condition decreased honking among older-vehicle male drivers, F(1, 129) = 5.23, p = .024! No results were found among female drivers.

Overgeneralizing from Turner et al. (1975)
I was surprised to find that the results in Turner et al. (1975) depended so heavily on the analysis of subgroups. In the past, whenever people told me about this experiment, they'd always just mentioned an increase in honking among those who'd seen a rifle.

Take, for example, this piece from Bushman's Psychology Today blog. Reading it, one gets the impression that a significant increase in honking was present across all groups, in contrast to the significant decreases in other subgroups:
The weapons effect occurs outside of the lab too. In one field experiment,[2] a confederate driving a pickup truck purposely remained stalled at a traffic light for 12 seconds to see whether the motorists trapped behind him would honk their horns (the measure of aggression). The truck contained either a .303-calibre military rifle in a gun rack mounted to the rear window, or no rifle. The results showed that motorists were more likely to honk their horns if the confederate was driving a truck with a gun visible in the rear window than if the confederate was driving the same truck but with no gun. What is amazing about this study is that you would have to be pretty stupid to honk your horn at a driver with a military rifle in his truck—if you were thinking, that is! But people were not thinking—they just naturally honked their horns after seeing the gun. The mere presence of a weapon automatically triggered aggression.
On Inquiring Minds, Bushman again acknowledge that the effect is, a priori, implausible. One should think twice before honking at an armed man, after all! In my estimation, counter-intuitive effects should be judged carefully, as they are less likely to be real. But this implausability does not dampen Bushman's enthusiasm for the effect. If anything, it kindles it. 

Next Posts
Naturally, the literature on weapon priming is not limited to these two papers. In subsequent posts, I hope to talk about meta-analyses of the effect. I also hope to talk about the role of science in generating and disseminating knowledge about the effect. But this post is long enough -- let's call it at this for now.

Thursday, March 3, 2016

A reading list for the Replicability Crisis

This is a partial reading list meant to be shared with those just learning about the replication crisis. Entries will be added over time.

Origins of the Crisis.

Cohen (1962) "The statistical power of abnormal-social psychological research." Cohen estimates that the typical psychology study has approximately 50% power, which is a little odd when you consider that all the published studies manage to find effects all the time.

Ioannidis (2005) "Why most published research findings are false." In this classic paper, Ioannidis presents a simple analysis. He demonstrates that when the base rate of discoveries is low (true findings are rare), the false positive rate is high (worse than the nominal 5%), and the false negative rate is high (most studies have <50% power), more than half of significant test results will be false; they will represent null hypotheses. 

The false positive rate is high because researchers are flexible in what they analyze. They will sometimes use questionable research practices to attain p < .05. The false negative rate is also high because samples are too small to reliably detect significant results. Ergo the conditional probability of something being true, given that it's published p < .05, is actually much lower than we'd like.

Prinz, Schlange, & Asadullah (2011), "Believe it or not." Drug companies are often looking for ways to apply the remarkable biomedical discoveries announced in journals such as Science, Cell, or Nature. In this paper, the authors announce that in-house replication attempts at two major drug companies routinely failed to yield the results claimed in the journals. 

The above two papers are digested in a very readable Economist article, "Trouble at the Lab."

Fanelli (2012), "Negative results are disappearing." Fanelli looks at hypothesis tests published between 1990 and 2007. More than 80% of published psychology studies claim support for their primary hypothesis, which is again odd given that the average study has rather less than 80% power.

Preposterous Results.

Bem (2011). "Feeling the Future." Psychologist Daryl Bem reports a series of 9 experiments demonstrating that college undergraduates have precognitive, ESP-like abilities. The manuscript is published in the highly esteemed Journal of Personality and Social Psychology. Psychologists are shaken to find that typical statistical methods can support an impossible hypothesis as being true.

Simmons, Nelson, & Simonsohn (2011) "False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant". Inspired by Bem's demonstration of ESP, Simmons et al. demonstrate that, with enough flexibility, you can find anything in a study. In their demonstration, they torture a dataset until it yields evidence that listening to a song makes participants younger.

Their SSRN upload, "A 21-word solution" is a brief and effective remedy. They suggest scientists be encouraged to report everything they did that might have influenced the Type I error rate.

Failures to Replicate. 

Open Science Collaboration, 2015. "Estimating the Reproducibility of Psychological Science." The Center for Open Science organized a massive replication effort, 100 studies in 100 laboratories. Whereas all of the original studies had reported significant results, only ~39% of replications found similar results. This ~39% estimate is still the subject of some debate: See criticism from Gilbert et al. and replies from Simonsohn and Srivastava. The most effective summary seems to be "40% replicate, 30% did not replicate, 30% were inconclusive."

Many Labs Replication Project. In this study, several labs replicated each of several experiments. Again, some replicated, but others did not. There have been, to date, three different Many Labs projects.

Donnellan, Lucas, and Cesario (2015). A study by Bargh & Shalev (2012) reported a relationship between physical warmth and emotional warmth, as manifested as a relationship between hot showers and feelings of loneliness. Donnellan et al. attempted, in nine studies, to replicate the Bargh & Shalev result. None succeeded.

Meta-analytic Signs of Bias.

Carter & McCullough (2015) "Publication bias and the limited strength model of self-control". A 2010 meta-analysis (Hagger et al.) concluded that "ego depletion," a form of fatigue in self-control, was a real and robust phenomenon, d = .6. Carter and McCullough find strong indications of publication and analytic bias; so much so that it was not clear whether the true effect was any different from zero. 

In response to Carter & McCullough, psychologists joined together to each perform a preregistered replication of a standard ego-depletion paradigm. Although the manuscript is not yet public, it has been announced that the project found zero evidence of ego depletion. An independent preregistered replication also finds no evidence for the phenomenon. Simine Vazire notes that an effect can be informally replicated in dozens of experiments but still falter in a strict, pre-registered replication.

Landy & Goodwin (2015). Reports claim that feelings of incidental disgust (e.g., smelling a noxious odor) can influence moral judgments. This meta-analysis finds signs of publication or analytic bias.

Flore & Wicherts (2015). "Does stereotype threat influence performance of girls in stereotyped domains?" The authors examine the hypothesis that reminding girls that they are expected to be bad at math harms their ability on a math test. This effect is one instance of "stereotype threat," thought to harm minorities' ability to succeed. The authors find signs of publication bias, and are not certain that there is a true effect.

Pressures on Researchers.

Bakker, van Dijk, and Wicherts (2012). "Rules of the game called psychological science." Currently, scientists are chiefly evaluated by the degree to which they manage to publish. Publishing generally requires finding a p-value less than .05, a significant result. Bakker et al. perform simulations to compare the relative success of two scientists. One scientist is slow and careful, running well-powered experiments and not torturing the data. The other scientist is sloppy, running many small experiments and doing all the wrong things to get p < .05. Naturally, the good, careful scientist finds many fewer significant results than the bad, sloppy scientist. The implied long-term effects on hiring, funding, and tenure decisions are chilling.


Bones (2012) "We knew the future all along." Brian Nosek's satirical alter-ego Arina K. Bones argues that Bem's (2012) ESP experiments should not have been published. But not because the results are wrong -- because the results are not new. Bones points out that an estimated 97% of psychology studies find exactly what they predicted, whereas Bem's college undergrads could only muster ~60% prediction. Bones concludes that social psychologists have unparalleled powers of second sight.


Finkel, Eastwick, & Reis (2015). "Best Research Practices in Psychology: Illustrating Epistemological and Pragmatic Considerations With the Case of Relationship Science." Finkel et al. suggest that the proposed reforms in response to the crisis do too much to reduce Type I error at the risk of inflating Type II error. They also argue that one-size-fits-all reforms will be counterproductive.

My own work.

A famous 2010 meta-analysis (Anderson et al.) summarizes the research on effects of violent video games on aggressive thoughts, feelings, and behaviors. The authors argue that there is minimal evidence for publication bias in the literature. They shared their data with me, and I performed my own analysis for publication bias. I find that there seems to be quite strong bias in experiments studying effects of violent games on aggressive behavior; so much so that there may not be an underlying effect.

My dissertation tested the hypothesis that brief exposure to a violent video game could increase aggressive behavior. Despite the decent sample size (N = 223), I could not detect such a main effect. The data are still being entered and analyzed, so results may change as more data are available or errors are found. At present, the effective sample size is N = 295, and the main effect of game violence is r = .06 [-.06, .17]. (For reference, Anderson et al. argue the effect is r = .21, larger than the upper bound of my confidence interval.)

Another of my papers explores common statistical mistakes made by both sides of the violent-games debate. Those that believe in the effect claim that their games are identical except for violent content. Our analyses indicate very little evidence of equivalence. Those that doubt the effect claim that their studies provide strong evidence against a violent-game effect. Our analyses indicate that some "failures to replicate" provide very little evidence against the effect. One or two even provide a little evidence for the effect.