Sunday, October 23, 2016

Outrageous Fortune: 2. The variability of rare drops

Growing up, I played a lot of role-playing games for the Super Nintendo. One trope of late-game design for role-playing games are rare drops -- highly desirable items that have a low probability of appearing after a battle. These items are generally included as a way to let players kill an awful lot of time as they roll the dice again and again trying to get the desired item.

For example, in Final Fantasy 4, there is an item called the "pink tail" that grants you the best armor in the game. It has a 1/64 chance of dropping when you fight a particular monster. In Earthbound, the "Sword of Kings" has a 1/128 chance of dropping when you fight a particular monster. In Pokemon, there are "shiny" versions of normal Pokemon that have a very small chance of appearing (in the newest games, the chance is something like 1/4096).

Watching people try to get these items reveals an interesting misunderstanding about how probability works. Intuitively, it makes sense that if the item has a 1/64 chance of dropping, then by the time you've fought the monster 64 times, you should have a pretty good chance of having the item.

Although it's true that the average number of required pulls is 64, there's still a substantial role of chance. This is a recurring theme in gaming and probability -- yes, we know what the average experience is, but the amount of variability around that can be quite large. (See also my old post on how often a +5% chance to hit actually converts into more hits.)

It turns out that if your desired item has a drop rate of 1/64, then after 64 pulls, there's only a 63.6% chance that you have the item.

To understand the variability around the drop chance, we have to use the negative binomial distribution. The binomial distribution takes in a number of attempts and a probability of success to tell us how many successes to expect. The negative binomial inverts this: it takes in the number of desired successes and the probability of success to tell us how many attempts we'll need to make.

In R, we can model this with the function pnbinom(), which gives us the cumulative density function. This tells us what proportion of players will get a success by X number of pulls.
Let's look at some examples.

Earthbound: The Sword of Kings

The Sword of Kings has a drop rate of 1/128 when the player fights a Starman Super. In a normal game, a player probably fights five or ten Starman Supers. How many players can be expected to find a Sword of Kings in the course of normal play? Some players will decide they want a Sword of Kings and will hang out in the dungeon fighting Starman Supers ("Starmen Super"?) until they find one. How long will most players have to grind to get this item?

By the time the player finds a Sword of Kings, the party is probably so overleveled from killing dozens of Starman Supers that they don't really need the Sword anyway. (From http://starmendotnet.tumblr.com/post/96897409329/sijbrenschenkels-finally-found-sword-of)

We use pnbinom() to get the cumulative probabilities. We'll use the dplyr package too because I like piping and being able to use filter() later for a nice table.

eb <- data.frame(x = 1:800) %>% 
  mutate(p = pnbinom(x, size = 1, prob = 1/128))

filter(eb, x %in% c(1, 10, 50, 100, 128, 200, 256, 400))

with(eb, plot(x, p, type = 'l',
              xlab = "Starman Supers defeated",
              ylab = "Probability of at least one drop",

              main = "Grinding for a Sword of Kings"))

A lucky 8% of all players will get the item in their first ten fights, a number that might be found in the course of normal play. 21% of players still won't have gotten one after two hundred combats, and 4% of players won't have gotten the Sword of Kings even after fighting four hundred Starman Supers!

Final Fantasy IV: The Pink Tail

There's a monster in the last dungeon of Final Fantasy IV. When you encounter monsters in a particular room, there is a 1/64 chance that you will find the "Pink Puff" monster. Every Pink Puff you kill has a 1/64 chance of dropping a pink tail.

ff4 <- data.frame(x = 1:400) %>%
  mutate(p = pnbinom(x, size = 1, prob = 1/64))

filter(ff4, x %in% c(1, 5, 10, 50, 64, 100, 200))

Just to find the Pink Puff monster is a major endeavor. A lucky 3% of players will find a Pink Puff on their first combat, and about 16% of players will run into one in the course of normal play (10 combats). But 20% of players won't have found one even after a hundred combats, and 4% of players won't have found a Pink Puff even after two hundred combats.

Finding a pack of Pink Puffs is a 1/64 chance, and that's just the start of it.

After you find and kill the Pink Puff, it still has to drop the pink tail, which is a 1/64 chance per Pink Puff. So 20% of players won't find a pink tail even after killing a hundred Pink Puffs. Consider next that one finds, on average, one group of Pink Puffs per 64 combats, and Pink Puffs come in groups of five. You could run through more than a thousand fights in order to find 100 Pink Puffs and still not get a pink tail. Ridiculous!

Here's a guy on the IGN forums saying "I've been trying for a week and I haven't gotten one."

Shiny Pokemon

A "shiny" pokemon is a rarer version of any other pokemon. In the newest pokemon game, any wild pokemon has a 1/4096 chance of being shiny.

This is so rare that we'll put things into log scale so that we're calling pnbinom() 100 times rather than 22000 times.

pkmn <- data.frame(x = seq(.1, 10, by = .1)) %>% 
  mutate(p = pnbinom(exp(x), size = 1, prob = 1/4096))
with(pkmn, plot(exp(x), p, typ = 'l'))

filter(pkmn, exp(x) <= 500) %>% tail(1)
filter(pkmn, exp(x) <= 2000) %>% tail(1)
filter(pkmn, exp(x) <= 10000) %>% tail(1)

11% of players will find one shiny pokemon within 500 encounters. About 39% will find one within 2000 encounters. 9% of players will grind through ten thousand encounters and still not find one.

There's a video on youtube of a kid playing three or four gameboys at once until he finds a shiny pokemon after about 26000 encounters. (I think this was in one of the earlier pokemon games where the encounter rate was about 1/8000.) There seems to be a whole genre of YouTube streamers showing off their shiny pokemon that they had gone through thousands of encounters looking for.

World of Warcraft

I don't know anything about World of Warcraft, but googling for its idea of a rare drop turns up a sword with a 1 in 1,500 chance of dropping as a quest reward.

wow <- data.frame(x = 1:5e3) %>% 
  mutate(p = pnbinom(x, size = 1, prob = 1/1500))
with(wow, plot(x, p, type = 'l'))

filter(wow, round(p, 3) == .5)
filter(wow, round(p, 3) == .8)
filter(wow, x == 2500)

Suppose your dedicated fanbase decides to try grinding for this item. Players can do 1000 quests and only half of them will get this sword. Among players running 2500 quests, 19% of players still won't have gotten one.

My Thoughts

In general, I feel like rare drops aren't worth grinding for. There's a lot of chance involved in the negative binomial function, and you could get very lucky or very unlucky.

Sometimes single-player RPGs seem to include them as a somewhat cynical way to keep kids busy when they have too much time to kill. In this case, players seem to know they're going to have to grind a long time, but they may not realize just how long they could grind and still not get it. The draw for these items seems to be more about the spectacle of the rarity than about the actual utility of the item.

In massively multiplayer games, it seems like drops are made so rare that players aren't really expected to grind for them. An item with a 1/1500 drop chance isn't something any one player can hope to get, even if they are deliberately trying to farm the item. Thus, rare items in MMOs are more like Willy Wonka Golden Tickets that a few lucky players get, rather than something that one determined player works to get. One player could try a thousand times and still not get the item, but across a hundred thousand players trying once, a few will get it, and that's enough to keep the in-game economy interesting.

My preference is that chance is something that encourages a shift in strategy rather than something that strictly makes your character better or worse. Maybe the player is guaranteed a nice item, but that nice item can be one of three things, each encouraging a different playstyle.

Still, it's fun sometimes to get something unexpected and get a boost from it. Imagine being one of the ~8% of players who gets a Sword of Kings by chance. Another approach is to provide a later way to ensure getting the desired item. In the roguelike Dungeon Crawl, some desirable items can be found early by luck, but they are also guaranteed to appear later in the game by skill.

Anyway, don't bet on getting that rare drop -- you may find yourself grinding much longer than you'd thought.

Tuesday, October 18, 2016

Publishing the Null Shows Sensitivity to Data

Some months ago, a paper argued for the validity of an unusual measurement of aggression. According to this paper, the number of pins a participant sticks into a paper voodoo doll representing their child seems to be a valid proxy for aggressive parenting.

Normally, I might be suspicious of such a paper because the measurement sounds kind of farfetched. Some of my friends in aggression research scoffed at the research, calling bullshit. But I felt I could trust the research.

Why? The first author has published null results before.

I cannot stress enough how much an author's published null results encourages my trust of a published significant result. With some authors, the moment you read the methods section, you know what the results section will say. When every paper supports the lab's theory, one is left wondering whether there are null results hiding in the wings. One starts to worry that the tested hypotheses are never in danger of falsification.

"Attached are ten stickers you can use to harm the child.
You can stick these onto the child to get out your bad feelings.
You could think of this like sticking pins into a Voodoo doll."

In the case of the voodoo doll paper, the first author is Randy McCarthy. Years ago, I became aware of Dr. McCarthy when he carefully tried to replicate the finding that heat-related word primes influence hostile perceptions (DeWall & Bushman, 2009) and reported null results (McCarthy, 2014).

The voodoo doll paper from McCarthy and colleagues is also a replication attempt of sorts. The measure was first presented by DeWall et al. (2013); McCarthy et al. perform conceptual replications testing the measure's validity. On the whole, the replication and extension is quite enthusiastic about the measure. And that means all the more to me given my hunch that McCarthy started this project by saying "I'm not sure I trust this voodoo doll task..."

Similar commendable frankness can be seen in work from Michael McCullough's lab. In 2012, McCullough et al. reported that religious thoughts influence male's stereotypically-male behavior. Iin 2014, one of McCullough's grad students published that she couldn't replicate the 2012 result (Hone & McCullough, 2014).

I see it as something like a Receiver Operating Characteristic curve. If the classifier has only ever given positive responses, that's probably not a very useful classifier -- you can't tell if there's any specificity to the classifier. A classifier that gives a mixture of positive and negative responses is much more likely to be useful.

A researcher that publishes a null now and again is a researcher I trust to call the results as they are.

[Conflict of interest statement: In the spirit of full disclosure, Randy McCarthy once gave me a small Amazon gift card for delivering a lecture to the Bayesian Interest Group at Northern Illinois University.]

Friday, August 19, 2016

Comment on Strack (2016)

Yesterday, Perspectives on Psychological Science published a 17-laboratory Registered Replication Report, totaling nearly 1900 subjects. In this RRR, researchers replicated an influential study of the Facial Feedback Effect, showing that being surreptitiously made to smile or to pout could influence emotional reactions.

The results were null, indicating that there may not be much to this effect.

The first author of the original study, Fritz Strack, was invited to comment. In his comment, Strack makes four criticisms of the original study that, in his view, undermine the results of the RRR to some degree. I am not convinced by these arguments; below, I address each in sequence.

"Hypothesis-aware subjects eliminate the effect."

First, Strack says that participants may have learned of the effect in class and thus failed to demonstrate it. To support this argument, he performs a post-hoc analysis demonstrating that the 14 studies using psychology pools found an effect size of d = -0.03, whereas the three studies using non-psychology undergrad pools found an effect size of d = 0.16, p = .037.

However, the RRR took pains to exclude hypothesis-aware subjects. Psychology students were also, we are told, recruited prior to coverage of the Strack et al. study in their classes. Neither of these steps ensure that all hypothesis-aware subjects were removed, of course, but it certainly helps. And as Sanjay Srivastava points out, why would hypothesis awareness necessarily shrink the effect? It could just as well enhance it by demand characteristics.

Also, d = 0.16 is quite small -- like, 480-per-group for a one-tailed 80% power test small. If Strack is correct, and the true effect size is indeed d = 0.16, this would seem to be a very thin success for the Facial Feedback Hypothesis, and still far from consistent with the original study's effect.

"The Far Side isn't funny anymore."

Second, Strack suggests that, despite the stimulus testing data indicating otherwise, perhaps The Far Side is too 1980s to provide an effective stimulus.

I am not sure why he feels it necessary to disregard the data, which indicates that these cartoons sit nicely in the midpoint of the scale. I am also at a loss as to why the cartoons need to be unambiguously funny -- had the cartoons been too funny, one could have argued there was a ceiling effect.

"Cameras obliterate the facial feedback effect."

Third, Strack suggests that the "RRR labs deviated from the original study by directing a camera at the participants." He argues that research on objective self-awareness demonstrates that cameras induce subjective self-focus, tampering with the emotional response.

This argument would be more compelling if any studies were cited, but in either case, I feel the burden of proof rests with this novel hypothesis that the facial feedback effect is moderated by the presence of cameras.

"The RRR shows signs of small-study effects."

Finally, Strack closes by using a funnel plot to suggest that the RRR results are suffering from a statistical anomaly.

He shows a funnel plot that compares sample size and Cohen's d, arguing that it is not appropriately pyramidal. (Indeed, it looks rather frisbee-shaped.)

Further, he conducts a correlation test between sample size and Cohen's d. This result is not, strictly speaking, statistically significant (p = .069), but he interprets it all the same as a warning sign. (It bears mention here that an Egger test with an additive error term is a more appropriate test. Such a test yields p = .235, quite far from significance.)

Strack says that he does not mean to insinuate that there is "reverse p-hacking" at play, but I am not sure how else we are to interpret this criticism. In any case, he recommends that "the current anomaly needs to be further explored," which I will below.

Strack's funnel plot does not appear pyramidal because the studies are all of roughly equal size, and so the default scale of the axes is way off. Here I present a funnel plot with axes of more appropriate scale. Again, the datapoints do not form a pyramid shape, but we see now that this is because there is little variance in sample size or standard error with which to make a pyramid shape. You're used to seeing taller, more funnel-y funnels because sample sizes in social psych tend to range broadly from 40 to 400, whereas here they vary narrowly from 80 to 140.

You can also see that there's really only one of the 17 studies that contributes to the correlation, having a negative effect size and larger standard error. This study is still well within the range of all the other results, of course; together, the studies are very nicely homogeneous (I^2 = 0%, tau^2 = 0), indicating that there's no evidence this study's results measure a different true effect size.

Still, this study has influence on the funnel plot -- it has a Cook's distance of 0.46, whereas all the others have distances of 0.20 or less. Removing this one study abolishes the correlation between d and sample size (r(14) = .27, p = .304), and the resulting meta-analysis is still quite null (raw effect size = 0.04, [-0.09, 0.18]). Strack is interpreting a correlation that hinges upon one influential observation.

I am willing to bet that this purported small-study effect is a pattern detected in noise. (Not that it was ever statistically significant in the first place.)

Admittedly, I am sensitive to the suggestion that an RRR would somehow be marred by reverse p-hacking. If all the safeguards of an RRR can't stop psychologists from reaching whatever their predetermined result, we are completely and utterly fucked, and it's time to pursue a more productive career in refrigerator maintenance.

Fortunately, that does not seem to be the case. The RRR does not show evidence of small-study effects or reverse p-hacking, and its null result is robust to exclusion of the most negative result.

Tuesday, July 19, 2016

The Failure of Fail-safe N

Fail-Safe N is a statistic suggested as a way to address publication bias in meta-analysis. Fail-Safe N describes the robustness of a significant result by calculating how many studies with effect size zero could be added to the meta-analysis before the result lost statistical significance. The original formulation is provided by Rosenthal (1979), with modifications proposed by Orwin (1983) and Rosenberg (2005).

I would like to argue that, as a way to detect and account for bias in meta-analysis, Fail-Safe N is completely useless. Others have said this before (see the bottom of the post for some links), but I needed to explore it further for my own curiosity. All together, I have to say that Fail-Safe N appears to be completely obsoleted by subsequent techniques, and thus is not recommended for use.

Fail-Safe N isn't for detecting bias

When we perform a meta-analysis, the question on our minds is usually "Looking at the gathered studies, how many null results were hidden from report?" Fail-Safe N does not answer that. Instead, it asks, "Looking at the gathered studies, how many more null results would you need before you'd no longer claim an effect?"

This isn't useful as a bias test. Indeed, Rosenthal never meant it as a way to test the presence of bias -- he'd billed it as an estimate of tolerance for null results, an answer to the question "How bad would bias have to be before I changed my mind?" He used it to argue that the published psych literature was not the 5% of Type I errors, while the 95% of null results languished in file drawers. Fail-Safe N was never meant to distinguish biased from unbiased literatures.

Fail-Safe N doesn't scale with bias

Although Fail-Safe N was never meant to test for bias, sometimes people will act as though a larger Fail-Safe N indicates the absence of bias. That won't work.

To see why it won't work, let's look briefly at the equation that defines FSN.

FSN = [(ΣZ)^2 / 2.706] - k

where ΣZ is the sum of z-scores from individual studies (small p-values mean large z-scores) and k is the number of studies.

This means that Fail-Safe N grows larger with each significant result. Even when each study is just barely significant (p = .050), Fail-Safe N will grow rapidly. After six p = .05 results, FSN is 30. After ten p = .05 results, FSN is 90. After twenty p = .05 results, FSN is 380. Fail-safe N rapidly becomes huge, even when the individual studies just barely cross the significance threshold.

Worse, FSN can get bigger as the literature becomes more biased.

  • For each dropped study with an effect size of exactly zero, FSN grows by one. (That's what it says on the tin -- how many dropped zeroes would be required to make p > .05.) 
  • When dropped studies have positive but non-significant effect sizes, FSN falls. 
  • When dropped studies have negative effect sizes, FSN rises.
If all the studies with estimated effect sizes less than zero are censored, FSN will quickly rise. 

Because Fail-Safe N doesn't behave in any particular way with bias, the following scenarios could all have the same Fail-Safe N:

  • A few honestly-reported studies on a moderate effect.
  • A lot of honest studies on a teeny-tiny effect.
  • A single study with a whopping effect size.
  • A dozen p-hacked studies on a null effect.

Fail-Safe N is often huge, even when it looks like the null is true

Publication bias and flexible analysis being what they are in social psychology, Fail-Safe N tends to return whopping huge numbers. The original Rosenthal paper provides two demonstrations. In one, he synthesizes 94 experiments examining the effects of interpersonal self-fulfilling prophecies, and concludes that 3,263 studies averaging null effects would be necessary to make the effect go away. In another analysis of k = 311 studies, he says nearly 50,000 studies would be needed.

Similarly, in the Hagger et al. meta-analysis of ego depletion, Fail-Safe N reported that 50,000 null studies would be needed to reduce the effect to non-significance. By comparison, the Egger test indicated that the literature was badly biased, and PET-PEESE indicated that the effect size was likely zero. The registered replication report also indicated that the effect size was likely zero. Even a Fail-Safe N of 50,000 does not indicate a robust result.


Fail-Safe N is not a useful bias test because:

  1. It does not tell you whether there is bias.
  2. Greater bias can lead to a greater Fail-Safe N.
  3. Hypotheses that would appear to be false have otherwise obtained very large values of FSN.

FSN is just another way to describe the p-value at the end of your meta-analysis. If your p-value is very small, FSN will be very large; if your p-value is just barely under .05, FSN will be small.

In no case does Fail-Safe N indicate the presence or absence of bias. It only places a number on how bad publication bias would have to be, in a world without p-hacking, for the result to be a function of publication bias alone. Unfortunately, we know well that we live in a world with p-hacking. Perhaps this is why Fail-Safe N is sometimes so inappropriately large.

If you need to test for bias, I would recommend instead Begg's test, Egger's test, or p-uniform. If you want to adjust for bias, PET, PEESE, p-curve, p-uniform, or selection models might work. But don't ever try to interpret the Fail-Safe N in a way it was never meant to be used.

Related reading:
Becker (2005) recommends "abandoning Fail-Safe N in favor of other, more informative analyses."
Here Coyne agrees that Fail-Safe N is not a function of bias and does not check for bias.
The Cochrane Collaboration agrees that  Fail-Safe N is mostly a function of the net effect size, and criticizes the emphasis on statistical significance over effect size.
Moritz Heene refers me to three other articles pointing out that the average Z-score of unpublished studies is probably not zero, as Fail-Safe N assumes, but rather, less than zero. Thus, the Fail-Safe N is too large. (Westfall's comment below makes a similar point.) This criticism is worth bearing in mind, but I think the larger problem is that Fail-Safe N does not answer the user's question regarding bias.

Thursday, June 23, 2016

Derailment, or The Seeing-Thinking-Doing Model

Inspired by a recent excellent lecture by Nick Brown, I decided to finally sit down and read Diederik Stapel's confessional autobiography, Ontsporing. Brown translated it from Dutch into English; it is available for free here.

In this account, Stapel describes how he came to leave theater for social psychology, how he had some initial fledgling successes, and ultimately, how his weak results and personal greed drove him to fake his data. A common theme is the complete lack of scientific oversight -- Stapel refers to his sole custody of the data as being alone with a big jar of cookies.

Doomed from the start

Poor Stapel! He based his entire research program on a theory doomed to failure. So much of what he did was based on a very simple, very crude model: Seeing a stimulus "activates" thoughts related to the stimulus. Those "activated" thoughts then influence behavior, usually at sufficient magnitude and clarity that they can be detected in a between-samples test of 15-30 samples per cell.

Say what you will about the powerful effects of the situation, but in hindsight, it's little surprise that Stapel couldn't find significant results. The stimuli were too weak, the outcomes too multiply determined, and the sample sizes too small. It's like trying to study if meditation reduces anger by treating 10 subjects with one 5-minute session and then seeing if they ever get in a car crash. Gelman might say Stapel was "driven to cheat [...] because there was nothing there to find. [...] If there's nothing there, they'll start to eat dirt."

Remarkably, Stapel writes as though he never considered that his theories could be wrong and that he should have changed course. Instead, he seems to have taken every p < .05 as gospel truth. He talks about p-hacking two studies into shape (he refers to "gray methods" like dropping conditions or outcomes) only to be devastated when the third study comes up immovably null. He didn't listen to his null results.

However, theory seemed to play a role in his reluctance to listen to his data. Indeed, he says the way he got away with it for as long as he did was by carefully reading the literature and providing the result that theory would have obviously predicted. Maybe the strong support from theory is why he always assumed there was some signal he could find through enough hacking.

He similarly placed too much faith in the significant results of other labs. He alludes to strange ideas from other labs as though they were established facts: things like the size of one's signature being a valid measure of self-esteem, or thoughts of smart people making you better at Trivial Pursuit.

Thinking-Seeing-Doing Theory

Reading the book, I had to reflect upon social psychology's odd but popular theory, which grew to prominence some thirty years ago and is just now starting to wane. This theory is the seeing-thinking-doing theory: seeing something "activates thoughts" related to the stimulus, the activation of those thoughts leads to thinking those thoughts, and thinking those thoughts leads to doing some behavior.

Let's divide the seeing-thinking-doing theory into its component stages: seeing-thinking and thinking-doing. The seeing-thinking hypothesis seems pretty obvious. It's sensible enough to believe in and study some form of lexical priming, e.g. that some milliseconds after you've just showed somebody the word CAT, participants are faster to say HAIR than BOAT. Some consider the seeing-thinking hypothesis so obvious as to be worthy of lampoon.

But it's the thinking-doing hypothesis that seems suspicious. If incidental thoughts are to direct behavior in powerful ways, it would suggest that cognition is asleep at the wheel. There seems to be this idea that the brain has no idea what to do from moment to moment, and so it goes rummaging about looking for whatever thoughts are accessible, and then it seizes upon one at random and acts on it.

The causal seeing-thinking-doing cascade starts to unravel when you think about the strength of the manipulation. Seeing probably causes some change in thinking, but there's a lot of thinking going on, so it can't account for that much variance in thinking. Thinking is probably related to doing, but then, one often thinks about something without acting on it.

The trickle-down cascade from minimal stimulus to changes in thoughts to changes in behavior would seem to amount to little more than a sneeze in a tornado. Yet this has been one of the most powerful ideas in social psychology, leading to arguments that we can reduce violence by keeping people from seeing toy guns, stimulate intellect through thoughts of professors, and promote prosocial behavior by putting eyes on the walls.


When I read Ontsporing, I saw a lot of troubling things: lax oversight, neurotic personalities, insufficient skepticism. But it's the historical perspective on social psychology that most jumped out to me. Stapel couldn't wrap his head around the idea that words and pictures aren't magic totems in the hands of social psychologists. He set out to study a field of null results. Rather than revise his theories, he chose a life of crime.

The continuing replicability crisis is finally providing some appropriately skeptical and clear tests of the seeing-thinking-doing hypothesis. In the meantime, I wonder: What exactly do we mean when we say "thoughts" are "activated"? How strong is the evidence is that the activation of a thought can later influence behavior? And are there qualitative differences between the kind of thought associated with incidental primes and the kind of thought that typically guides behavior? The latter would seem much more substantial.

Thursday, June 2, 2016

Prior elicitation for directing replication efforts

Brent Roberts suggests the replication movement solicit federal funding for the organization of federally-funded replication daisy chains. James Coyne suggests that the replication movement has already made a grave misstep by attempting to replicate findings that were always hopelessly preposterous. Who is in the right?

It seems to me that both are correct, but the challenge is in knowing when to replicate and when to dismiss outright. Coyne and the OSF seem to be after different things: the OSF has been very careful to make the RP:P about "estimating the replicability of psychology" in general rather than establishing the truth or falsity of particular effects of note. This motivated their decision to choose a random-ish sample of 100 studies rather than target specific controversial studies.

If in contrast, we want to direct our replication efforts to where they will have the greatest probative value, we will need to first identify which phenomena we are collectively most ambivalent about. There's no point in replicating something that's obviously trivially true or blatantly false.

How do we figure that out? Prior elicitation! We gather a diverse group of experts and ask them to divide up their probability, indicating how big they think the effect size is in a certain experimental paradigm.

If most the probability mass is away from zero, then we don't bother with the replication -- everybody believes in the effect already.

On the other hand, if the estimates are tightly clustered around zero, we don't bother with the replication -- it's obvious nobody believes it in the first place.

It's when the prior is diffuse, or evenly divided between the spike at zero and the slab outside zero, or bimodal, that we find the topic is controversial and in need of replication. That's the kind of thing that might benefit from a RRR or a federally-funded daisy chain.

Code below:
# Plot1
x = seq(-2, 2, .01)
plot(x, dcauchy(x, location = 1, scale = .3)*.9, type = 'l',
     ylim = c(0, 1),
     ylab = "Probability density",
     xlab = paste("Effect size (delta)"),
     main = "All-but-certain finding \n Little need for replication")
arrows(0, 0, 0, .1)

# Plot2
plot(x, dcauchy(x, location = 0, scale = .25)*.1, type = 'l',
     ylim = c(0, 1),
     ylab = "Probability density",
     xlab = paste("Effect size (delta)"),
     main = "No one believes it \n Little need for replication")
arrows(0, 0, 0, .9)

# Plot3
plot(x, dcauchy(x, location = 0, scale = 1)*.5, type = 'l',
     ylim = c(0, .75),
     ylab = "Probability density",
     xlab = paste("Effect size (delta)"),
     main = "No one knows what to think \n Great target for replication")
arrows(0, 0, 0, .5)

# Plot4
plot(x, dcauchy(x, location = 1, scale = 1)*.5, type = 'l',
     ylim = c(0, .75),
     ylab = "Probability density",
     xlab = paste("Effect size (delta)"),
     main = "Competing theories \n Great target for replication")
lines(x, dcauchy(x, location = -1, scale = 1)*.5)

Wednesday, June 1, 2016

Extraordinary evidence

Everyone seems to agree with the saying "extraordinary claims require extraordinary evidence." But what exactly do we mean by it?

In previous years, I'd taken this to mean that an improbable claim requires a dataset with strong probative value, e.g. a very small p-value or a very large Bayes factor. Extraordinary claims have small prior probability and need strong evidence if they are to be considered probable a posteriori.

However, this is not the only variety of extraordinary claim. Suppose that someone tells you that he has discovered that astrological signs determine Big Five personality scores. You scoff, expecting that he has run a dozen tests and wrestled out a p = .048 here or there. But no, he reports strong effects on every outcome: all are p < .001, with correlations in the r = .7 range. If you take the results at face value, it is clearly strong evidence of an effect.

Is this extraordinary evidence? In a sense, yes. The Bayes factor or likelihood ratio or whatever is very strong. But nested within this extraordinary evidence is another extraordinary claim: that his study found these powerful results. These effects are unusually strong for personality psychology in general, much less for astrology and personality in particular.

What kind of extraordinary evidence is needed to support that claim? In this post-Lacour-fraud, post-Reinhart-Rogoff-Excel-error world, I would suggest that more is needed than simply a screenshot of some SPSS output.

In ascending order of rigor, authors can support their extraordinary evidence by providing the following:

  1. The post-processed data necessary to recreate the result.
  2. The pre-processed data (e.g., single-subject e-prime files; single-trial data).
  3. All processing scripts that turn the pre-processed data into the post-processed data.
  4. Born-open data, data that is organized by Git to be saved and uploaded to the cloud in an automated script. This is an extension of the above -- it provides the pre-processed data, uploaded to the central, 3rd-party GitHub server, where it is timestamped.

Providing access to the above gives greater evidence that:

  1. The data are real, 
  2. The results match the data, 
  3. The processed data are an appropriate function of the preprocessed data, 
  4. The data were collected and uploaded over time, rather than cooked up in Excel overnight, and
  5. The data were not tampered with between initial collection and final report.

If people do not encourage data-archival, a frustrating pattern may emerge: Researchers report huge effect sizes with high precision. These whopping results have considerable influence on the literature, meta-analyses, and policy decisions. However, when the data are requested, it is discovered that the data were hit by a meteor, or stolen by Chechen insurgents, or chewed up by a slobbery old bulldog, or something. Nobody is willing to discard the outrageous effect size from meta-analysis for fear of bias, or appearing biased. Techniques to detect and adjust for publication bias and p-hacking, such as P-curve and PET-PEESE, would be powerless to detect and adjust for bias so long as a few high-effect-size farces remain in the dataset.

The inevitable fate of many suspiciously successful datasets.
Like Nick Brown points out, this may be the safest strategy for fraudsters. At present, psychologists are not expected to be competent custodians of their own data. Little of graduate training concerns data archival. It is not unusual for data to go missing, and so far I have yet to find anybody who has been censured for failure to preserve their data. In contrast, accusations of fraud or wrongdoing require strong evidence -- the kind that can only be obtained by looking at the raw data, or perhaps by finding the same mistake, made repeatedly across a lifetime of fraudulent research. Somebody could go far by making up rubbish and saying the data were stolen by soccer hooligans, or whatever.

For a stronger, more replicable science, we must do more to train scientists in data management and incentivize data storage and sharing. Open science badges are nice. They let honest researchers signal their honesty. But they are not going to save the literature so long as meta-analysis and public policy statements must tiptoe around closed-data (or the-dog-ate-my-data) studies with big, influential results.

Monday, May 16, 2016

The value-added case for open peer reviews

Last post, I talked about the benefits a manuscript enjoys in the process of scientific publication. To me, it seems that the main benefits are that an editor and some number of peer reviewers read it and give edits. Somehow despite this part coming from volunteer labor, it still manages to cost $1500 an article.

And yet, as researchers, we can't afford to try to do without the journals. When the paper appears with a sagepub.com URL on it, readers now assume it to be broadly correct. The journal publication is part of the scientific canon, whereas the preprint was not.

Since the peer reviews are what really elevates the research from preprint to publication, I think the peer reviews should be made public, as part of the article's record. This will open the black box and encourage readers to consider: Who thinks this article is sound? What do they think are the strengths and weaknesses of the research? Why?

By comparison, the current system provides only the stamp of approval. But we readers and researchers know that the stamp of approval is imperfect. The process is capricious. Sometimes duds get published. Sometimes worthy studies are discarded. If we're going to place our trust in the journals, we need to be able to check up on the content and process of peer review.

Neuroskeptic points out that, peer review being what it is, perhaps there should be fewer journals and more blogs. The only difference between the two, in Neuro's view, is that a journal implies peer review, which implies the assent of the community. If journal publication implies peer approval, shouldn't journals show the peer reviews to back that up? And if peer approval is all it takes to make something scientific canon, couldn't a blogpost supported by peer reviews and revisions be equivalent to a journal publication?

Since peer review is all that separates blogging from journal publishing, I often fantasize about sidestepping the journals and self-publishing my science. Ideally, I would just upload a preprint to OSF. Alongside the preprint there would be the traditional 2-5 uploaded peer reviews.

Arguably, this would provide an even higher standard of peer review, in that readers could see the reviews. This would compare favorably with the current system, in which howlers are met with unanswerable questions like "Who the heck reviewed this thing?" and "Did nobody ask about this serious flaw?"

Maybe one day we'll get there. In the meantime, so long as hiring committees, tenure committees, and granting agencies are willing to accept only journal publications as legitimate, scientists will remain powerless to self-publish. In the meantime, the peer reviews should really be open. The peer reviews are what separates preprint from article, and we pay millions of dollars a year to maintain that boundary, so we might as well place greater emphasis and transparency on that piece of the product.

Saturday, May 14, 2016

Be Your Own Publisher?

The problem with paying any 3rd party for academic publishing is that these 3rd parties are corporations. Corporations have the defining goal of making as much profit as possible by providing a service.

This goal is often at odds with what is best for science. Under the traditional publishing model, financial considerations favor the strategy of hoarding all the most exciting research and leasing it out for incredible subscription fees. Researchers stretch their data to try to get the most extraordinary story so that they can get published in the most exclusive journal. Under the Open Access publishing model, financial considerations favor the strategy of publishing as many papers as possible so long as the average paper quality is not so poor that it causes the journal's reputation to collapse.

Subscription journals apparently cost the educational system billions of dollars a year. Article processing fees at open-access journals tend to sit at a cool $1500. How can it be so expensive to throw a .PDF file up on the internet?

Let's consider the advantages a published article has relative to a preprint on my GitHub page. Relative to the preprint, the science in a published article has added value from:
1) Peer reviewers, who provide needed criticism and skepticism. (Cost: $0)
2) Editors, who provide needed criticism, skepticism and curation. (Cost: $0)
3) Publicity and dissemination for accepted articles (Cost: Marketing budget)
4) Typesetting and file hosting (Cost: $1500 an article, apparently)

The value-added to researchers comes from the following sources:
1) The perceived increase in legitimacy associated with making it past peer review (Value: Priceless)
2) Prestige associated with being picked out for curation. (Value: Priceless)

It leads me to wonder: What might be so wrong with universities, laboratories, and researchers simply using self-publishing? Websites like arXiv, SSRN, OSF, and GitHub provide free hosting for PDFs and supplementary files.

If the main thing that distinguishes a preprint from an article is that between two and five people have read it and okayed it, and if that part costs nothing, why not save a heap of money and just have people post peer reviews on your preprint? (Consider Tal Yarkoni's suggestion of a Reddit-like interface for discussion, curation, and ranking.)

Is it possible that we might one day cut out the middleman and allow ourselves to enjoy the benefits of peer review without the enormous financial burden? Or does institutional inertia make it impossible?

Maybe this fall my CV can have a section for "Peer-reviewed manuscripts not published in journals."

Wednesday, May 4, 2016

Post-pub peer review should be transparent too

A few weeks ago, I did a little post-publication peer review. It was a novel experience for me, and lead me to consider the broader purpose of post-pub peer review.
In particular, I was reminded of the quarrel between Simone Schnall and Brent Donnellan (and others) back in 2014. Schnall et al. suggested an embodied cognition phenomenon wherein incidental cues of cleanliness influenced participants' ratings of moral disgust. Donnellan et al. ran replications and failed to detect the effect. An uproar ensued, goaded on by some vehement language by high-profile individuals on either side of the debate.

One thing about Schnall's experience stays with me today. In a blogpost, she summarizes her responses to a number of frequently asked questions. One answer is particularly important for anybody interested in post-publication peer review.
Question 10: “What has been your experience with replication attempts?”
My work has been targeted for multiple replication attempts; by now I have received so many such requests that I stopped counting. Further, data detectives have demanded the raw data of some of my studies, as they have done with other researchers in the area of embodied cognition because somehow this research area has been declared “suspect.” I stand by my methods and my findings and have nothing to hide and have always promptly complied with such requests. Unfortunately, there has been little reciprocation on the part of those who voiced the suspicions; replicators have not allowed me input on their data, nor have data detectives exonerated my analyses when they turned out to be accurate.
I invite the data detectives to publicly state that my findings lived up to their scrutiny, and more generally, share all their findings of secondary data analyses. Otherwise only errors get reported and highly publicized, when in fact the majority of research is solid and unproblematic.
[Note: Donnellan and colleagues were not among these data detectives. They did only the commendable job of performing replications and reporting the null results. I mention Donnellan et al. only to provide context -- it's my understanding that the failure to replicate lead to 3rd-party detectives's attempts to detect wrongdoing through analysis of the original Schnall et al. dataset. It is these attempts to detect wrongdoing that I refer to below.]

It is only fair that these data detectives report their analyses and how they failed to detect wrongdoing. I don't believe Schnall's phenomenon for a second, but the post-publication reviewers could at least report that they don't find evidence of fraud.

Data detectives themselves can run the risk of p-hacking and selective report. Imagine ten detectives run ten tests each. If all tests are independent, eventually one test will emerge with a very small p-value. If anyone is going to make accusations according to "trial by p-value," then we had damn well consider the problems of multiple comparisons and the garden of forking paths.

Post-publication peer review is often viewed as a threat, but it can and should be a boon, when appropriate. A post-pub review that finds no serious problems is encouraging, and should be reported and shared.* By contrast, if every data request is a prelude to accusations of error (or worse), then it becomes upsetting to learn that somebody is looking at your data. But data inspection should not imply that there are suspicions or serious concerns. Data requests and data sharing should be the norm -- they cannot be a once-in-a-career disaster.

Post-pub peer review is too important to be just a form of witch-hunting.
It's important, then, that post-publication peer reviewers give the full story. If thirty models give the same result, but one does not, you had better report all thirty-one models.** If somebody spends the time and energy to deal with your questions, post the answers so that the authors need not answer the questions all over again.

I do post-publication peer review because I generally don't trust the literature. I don't believe results until I can crack them open and run my fingers through the goop. I'm a tremendous pain in the ass. But I also want to be fair. My credibility, and the value of my peer reviews, depends on it.

The Court of Salem reels in terror at the perfect linearity of Jens Forster's sample means.

* Sakaluk, Williams, and Biernat (2014) suggest that, during pre-publication peer review, one reviewer run the code to make sure they get the same statistics. This would cut down on the number of misreported statistics. Until that process is a common part of pre-publication peer review, it will always be a beneficial result of post-publication peer review.

** Simonsohn, Simmons, and Nelson suggest specification curve, which takes the brute-force approach to this by reporting every possible p-value from every possible model. It's cool, but I've never tried to implement it yet.

Friday, April 15, 2016

The Undergraduate Thesis Banquet

An unusally lavish undergraduate honors banquet. (Image pinched from TheTimes.co.uk)
Some time ago, I got to attend a dinner for undergraduates who had completed a honor's thesis in psychology. For each of these undergraduates, their faculty advisor would stand up and say some nice things about them.

The advisors would praise students for their motivation, their ideas, their brilliance, etc. etc. And then they would say something about the student's research results.

For some students, the advisor would say, with regret, that the student's idea hadn't borne fruit. "It was a great idea, but the data didn't work out..." they'd grimace, before concluding, "Anyway, I'm sure you'll do great." In these cases one knows that the research project is headed for the dustbin.

For other students, the advisor would gush, "So-and-so's an incredible student, they ran the best project, we got some really great data, and we're submitting an article to [Journal X]."

Somewhere in this, one gets the impression that the significance of results indicates the quality of a research assistant. Significant results are headed for the journals, while nonsignificant results are rewarded with a halfhearted, "Well, you tried."

I suspect that there is a heuristic at play that goes something like this: Effect size is a ratio of signal to noise. Good RAs collect clean data, while bad RAs collect noisy data. Therefore, a good RA will find significant results, while a bad RA might not.

But that, of course, assumes there is signal to be found. If the null is true, there's no signal, no matter how precise your RAs.

In any case, as unfair as it is, it's probably good for the undergrads to learn how the system works. But I'm hoping that at the next such banquet, the statistical significance of an undergrad's research results will have little to do with their perceived competence.

Monday, March 28, 2016

Asking for advice re: causal inference in SEM

I'm repeatedly running into an issue in causal interpretation of SEM models. I'm not sure what to make of it, so I want to ask everybody what they think.

Suppose one knows A and B to be highly correlated in the world, but one doesn't know whether there is causality between them.

In an experiment, one stages an intervention. Manipulation X causes a difference in levels of A between the control and treatment groups.

Here's the tricky part. Suppose one analyses the data gleaned from this experiment using SEM. One makes an SEM with paths X -> A -> B. Each path is statistically significant. This is presented as a causal model indicating that manipulation X causes changes in A, which in turn cause changes in B. 

Paths X->A and A->B are significant, but X->B is not. Is a causal model warranted?

However, if one tests the linear models A = b1×X and B = b2×X, we find that b1 is statistically significant, but b2 is not. (Note that I am not referring to the indirect effect of X on B after controlling for A. Tather, the "raw" effect of X on B is not statistically significant.)

This causes my colleagues and I to wonder: Does the SEM support the argument that, by manipulation of X, one can inflict changes in A, causing downstream changes in B? Or does this inject new variance in A that is unrelated to B, but the SEM fits because of the preexisting large correlation between A and B?

Can you refer me to any literature on this issue? What are your thoughts?

Thanks for any help you can give, readers.

Tuesday, March 22, 2016

Results-blinded Peer Review

The value of any experiment rests on the validity of its measurements and manipulations. If the manipulation doesn't have the intended effect, or the measurements are just noise, then the experiment's results will be uninformative.

This holds whether the results are statistically significant or not. A nonsignificant result, obviously, could be the consequence of an ineffective manipulation or a noisy outcome variable. But given a significant result, the results are still uninformative -- the significant result is either Type I error, or it reflects bias in the measurement.

The problem I have is that often the reader's (or at least, the reviewer's) perception of the method's validity may sometimes hinge upon the results obtained. Where a significant result might have been hailed as a successful conceptual replication, a nonsignificant result might be dismissed as a departure from appropriate methodology.

It makes me consider this puckish lesson from Archibald Cochrane, as quoted and summarized on Ben Goldacre's blog:
The results at that stage showed a slight numerical advantage for those who had been treated at home. I rather wickedly compiled two reports: one reversing the number of deaths on the two sides of the trial. As we were going into the committee, in the anteroom, I showed some cardiologists the results. They were vociferous in their abuse: “Archie,” they said “we always thought you were unethical. You must stop this trial at once.”
I let them have their say for some time, then apologized and gave them the true results, challenging them to say as vehemently, that coronary care units should be stopped immediately. There was dead silence and I felt rather sick because they were, after all, my medical colleagues.
Perhaps, just once in a while, such a results-blinded manuscript should be submitted to a journal. Once Reviewers 1, 2, and 3 have all had their say about the ingenuity of the method, the precision of the measurements, and the adequacy of the sample size, the true results could be revealed, and one could see how firmly the reviewers hold to their earlier arguments.

Thankfully, the increasing prevalence of Registered Reports may forestall the need for any such underhanded prank. Still, it is fun to think about.

Saturday, March 19, 2016

I Was Wrong!

Yesterday, ResearchGate suggested that I read a new article reporting that ego depletion can cause aggressive behavior. This was a surprise to me because word has it that ego depletion does not exist, so surely it cannot be a cause of aggressive behavior.

The paper in question looks about like you'd expect: an unusual measure of aggression, a complicated 3 (within) × 2 (between) × 2 (between) design, a covariate tossed into the mix just for kicks, a heap of measures collected and mentioned in a footnote but not otherwise analyzed. It didn't exactly change my mind about ego depletion, much less its role in aggressive behavior.

But it'd be hypocritical of me to criticize this ill-timed paper without mentioning the time I reported an ego-depletion effect through effect-seeking, exploratory analysis. I've also been meaning to change my blogging regimen up a bit. It's time I switched from withering criticism to withering self-criticism.

The paper is Engelhardt, Hilgard, and Bartholow (2015), "Acute exposure to difficult (but not violent) video games dysregulates cognitive control." In this study, we collected a hearty sample (N = 238) and had them play one of four modified versions of a first-person shooter game, a 2 (Violence: low, high) × 2 (Difficulty: low, high) between-subjects design.

To manipulate violence, I modified the game's graphics. The violent version had demons and gore and arms bouncing across the floor, whereas the less violent version had silly-looking aliens being warped home. We also manipulated difficulty: Some participants played a normal version of the game in which monsters fought back, while other participants played a dumb-as-rocks version where the monsters walked slowly towards them and waited patiently to be shot.

After the game, participants performed a Spatial Stroop task. We measured the magnitude of the compatibility effect, figuring that larger compatibility effects would imply poorer control. We also threw in some no-go trials, on which participants were supposed to withhold a response.

Our hypothesis was that playing a difficult game would lead to ego depletion, causing poorer performance on the Spatial Stroop. This might have been an interesting refinement on the claim that violent video games teach their players poorer self-control.

We looked at Stroop compatibility and found nothing. We looked at the no-go trials and found nothing. Effects of neither violence nor of difficulty. So what did we do?

We needed some kind of effect to publish, so we reported an exploratory analysis, finding a moderated-mediation model that sounded plausible enough.

We figured that maybe the difficult game was still too easy. Maybe participants who were more experienced with video games would find the game to be easy and so would not have experienced ego depletion. So we split the data again according to how much video game experience our participants had, figuring that maybe the effect would be there in the subgroup of inexperienced participants playing a difficult game.

The conditional indirect effect of game difficulty on Stroop compatibility as moderated by previous game difficulty wasn't even, strictly speaking, statistically significant: p = .0502. And as you can see from our Figure 1, the moderator is very lopsided: only 25 people out of the sample of 238 met the post-hoc definition of "experienced player." 

And the no-go trials on the Stroop? Those were dropped from analysis: our footnote 1 says our manipulations failed to influence behavior on those trials, so we didn't bother talking about them in the text.

So to sum it all up, we ran a study, and the study told us nothing was going on. We shook the data a bit more until something slightly more newsworthy fell out of it. We dropped one outcome and presented a fancy PROCESS model of the other. (I remember at some point in the peer review process being scolded for finding nothing more interesting than ego depletion, which was accepted fact and old news!)

To our credit, we explicitly reported the exploratory analyses as being exploratory, and we reported p = .0502 instead of rounding it down to "statistically significant, p = .05." But at the same time, it's embarrassing that we structured the whole paper to be about the exploratory analysis, rather than the null results. 

In the end, I'm grateful that the RRR has set the record straight on ego depletion. It means our paper probably won't get cited much except as a methodological or rhetorical example, but it also means that our paper isn't going to clutter up the literature and confuse things in the future. 

In the meantime, it's showed me how easily one can pursue a reasonable post-hoc hypothesis and still land far from the truth. And I still don't trust PROCESS.

Wednesday, March 16, 2016

The Weapons Priming Effect, Pt. 2: Meta-analysis

Even in the 1970s the Weapons Priming Effect was considered hard to believe. A number of replications were conducted, failed to find an effect, and were published (Buss, Booker, & Buss, 1972; Ellis, Weiner, & Miller, 1971; Page & Scheidt, 1971).

Remarkable to think that in 1970 people could publish replications with null results, isn't it? What the hell happened between 1970 and 2010? Anyway...

To try to resolve the controversy, the results were aggregated in a meta-analysis (Carlson et al., 1990). To me, this is an interesting meta-analysis. It is interesting because the median cell size is about 11, and the largest is 52. 80% of the cells are of size 15 or fewer.

Carlson et al. concluded "strong support" for "the notion that incidentally-present negative or aggression cues generally enhance aggressiveness among individuals already experiencing negative affect." However, across all studies featuring only weapons as cues, "a nonsignificant, near-zero average effect-size value was obtained."

Carlson et al. argue that this is because of two equal but opposite forces (emphasis mine):
Among subjects whose hypothesis awareness or evaluation apprehension was specifically elevated by an experimental manipulation or as a natural occurrence, as determined by a post-session interview, the presence of weapons tended to inhibit aggression. In contrast, the presence of weapons enhanced the aggression of nonapprehensive or less suspicious individuals.

In short, Carlson et al. argue that when participants know they're being judged or evaluated, seeing a gun makes them kick into self-control mode and aggress less. But when participants are less aware, seeing a gun makes them about d = 0.3 more aggressive.

I’d wanted to take a quick look for potential publication bias. I took the tables out of the PDF and tried to wrangle them back into CSV. You can find that table and some code in a GitHub repo here.

So far, I've only been able to confirm the following results:

First, I confirm the overall analysis suggesting an effect of aggression cues in general (d = 0.26 [0.15, 0.36]). However, there's a lot of heterogeneity here (I^2 = 73.5%), so I wonder how helpful a conclusion that is.

Second, I can confirm the overall null effect of weapons primes on aggressive behavior (d = 0.05, [-0.21, 0.32]). Again, there's a lot of heterogeneity (I^2 = 71%).

However, I haven't been able to confirm the stuff about splitting by sophistication. Carlson et al. don't do a very good job of reporting these codings in their table. They'll mention in a cell sometimes "low sophistication." As best I can tell, unless the experimenter specifically reported subjects as being hypothesis- or evaluation-aware, Carlson et al. consider the subjects to be naive.

But splitting up the meta-analysis this way, I still don't get any significant results -- just a heap of heterogeneity. Among the Low Awareness/Sophistication group, I get d = 0.17 [-0.15, 0.49]. Among the High Awareness/Sophistication group, I get d = -0.30 [-0.77, 0.16]. Both are still highly contaminated by heterogeneity (Low Awareness: 76% I^2; High Awareness: 47% I^2), indicating that maybe these studies are too different to really be mashed together like this.

There's probably something missing from the way I'm doing it vs. how Carlson et al. did it. Often, several effect sizes are entered from the same study. This causes some control groups to be double- or triple-counted, overestimating the precision of the study. I'm not sure if that's how Carlson et al. handled it or not.

It goes to show how difficult it can be to replicate a meta-analysis even when you've got much of the data in hand. Without a full .csv file and the software syntax, reproducing a meta-analysis is awful.

A New Meta-Analysis
It'd be nice to see the Carlson et al. meta-analysis updated with a more modern review. Such a review could contain more studies. The studies could have bigger sample sizes. This would allow for better tests of the underlying effect, better adjustments for bias, and better explorations of causes of heterogeneity.

Arlin Benjamin Jr. and Brad Bushman are working on just such a meta-analysis, which seems to have inspired, in part, Bushman's appearance on Inquiring Minds. The manuscript is under revision, so it is not yet public. They've told me they'll send me a copy once it's accepted.

It's my hope that Benjamin and Bushman will be sure to include a full .csv file with clearly coded moderators. A meta-analysis that can't be reproduced, examined, and tested is of little use to anyone.

Wednesday, March 9, 2016

The Weapons Priming Effect

"Guns not only permit violence, they can stimulate it as well. The finger pulls the trigger, but the trigger may also be pulling the finger." - Leonard Berkowitz

There is a theory in social psychology that aggressive behaviors can be stimulated by simply seeing a weapon. I have been skeptical of this effect for a while, as it sounds suspiciously like Bargh-style social priming. The manipulations are very subtle and the outcomes are very strong, and sometimes opposite to the direction one might expect. This is the first of several posts describing my mixed and confused feelings about this priming effect and my ongoing struggle to sate my curiosity.

The original finding
First, let me describe the basic phenomenon. In 1967, two psychologists reported that simply seeing a gun was enough to stimulate aggressive behavior. This suggested a surprising new cause of aggressive behavior, in that simply seeing aggressive primes could provoke aggressive behavior.

In their experiment, Berkowitz and LePage asked participants to perform a task in a room. The design was a 3 (Object) × 2 (Provocation) + 1 design. For the object manipulation, was a piece of sporting equipment in the room. In one condition, the equipment was a rifle and revolver combination; the participant was told the weapons belonged to the other participant. In another condition, the equipment was again the rifle and revolver, but the participant was told the weapons belonged to the previous experimenter. In a third condition, there were no objects on the table.

The provocation manipulation consisted of how many shocks the participant received from the other participant. Participants were provoked by receiving either 1 or 7 electrical shocks.

The extra cell consisted of participants in a room with squash racquets instead of guns. All of these participants were strongly provoked.

So that's 100 participants in a 3 (Object: Confederate's Guns, Experimenter's Guns, Nothing) × 2 (Provocation: Mild, Strong) + 1 (Squash Racquets, Strong Provocation) design. That's about 14 subjects per cell.

The researchers hypothesized that, because shotguns are weapons, they are associated with aggression and violence. Exposure to a shotgun, then, should increase the accessibility of aggressive thoughts. The accessibility of aggressive thoughts, in turn, should increase the likelihood of aggressive behavior.

Berkowitz and LePage found results consistent with their hypothesis. Participants who saw a shotgun (and were later provoked) were more aggressive than participants who saw nothing. They were also more aggressive than participants who had been heavily provoked but seen a squash racquet. These participants gave the confederate more and longer electrical shocks.

Extensions and Public Policy 
I'd been curious about this effect for a long time. I do some aggression research, and my PhD advisor conducted some elaborations on the Berkowitz and LePage study in his early career. But I really grew curious when I listened to Brad Bushman's appearance on Mother Jones' "Inquiring Minds" podcast.

Bushman joined the podcast to talk about the science of gun violence. About the first half of the episode is devoted to the Weapons Priming Effect. Bushman argues that one step to reducing gun violence would be to make guns less visible. For example, guns could be kept in opaque safes rather than in clear display cases. Reducing the opportunities for aggressive-object priming would be expected to reduce aggression and violence in society.

Would you mess with someone who had this in their rear window?
In the podcast, Bushman mentions one of the more bizarre and counterintuitive replications of the weapons priming effect. Turner, Layton, and Simons (1975) report a bizzare experiment in which an experimenter driving a pickup truck loitered at a traffic light. When the light turned green, the experimenter idled for a further 12 seconds, waiting to see if the driver trapped behind would honk. Honking, the researchers argued, would constitute a form of aggressive behavior.

The design was a 3 (Prime) × 2 (Visibility) design. For the Prime factor, the experimenter's truck featured either an empty gun rack (control), a gun rack with a fully-visible .303-caliber military rifle and a bumper sticker with the word "Friend" (Friendly Rifle), or a gun rack with a .303 rifle and a bumper sticker with the word "Vengeance" (Aggressive Rifle). The experimenter driving the pickup was made visible or invisible by the use of a curtain in the rear window.

There were 92 subjects, about 15/cell. The sample is restricted to males driving late-model privately-owned vehicles for some reason.

The authors reasoned that seeing the rifle would prime aggressive thoughts, which would inspire aggressive behavior, leading to more honking. They run five different planned complex contrasts and find that the Rifle/Vengeance combination inspired honking relative to the No Rifle and Rifle/Friend combo, but only when the curtain was closed, F(1, 86) = 5.98, p = .017. That seems like a very suspiciously post-hoc subgroup analysis to me.

A second study in Turner, Layton, and Simons (1975) collects a larger sample of men and women driving vehicles of all years. The design was a 2 (Rifle: present, absent) × 2 (Bumper Sticker: "Vengeance", absent) design with 200 subjects. They divide this further by driver's sex and by a median split on vehicle year. They find that the Rifle/Vengeance condition increased honking relative to the other three, but only among newer-vehicle male drivers, F(1, 129) = 4.03, p = .047. But then they report that the Rifle/Vengeance condition decreased honking among older-vehicle male drivers, F(1, 129) = 5.23, p = .024! No results were found among female drivers.

Overgeneralizing from Turner et al. (1975)
I was surprised to find that the results in Turner et al. (1975) depended so heavily on the analysis of subgroups. In the past, whenever people told me about this experiment, they'd always just mentioned an increase in honking among those who'd seen a rifle.

Take, for example, this piece from Bushman's Psychology Today blog. Reading it, one gets the impression that a significant increase in honking was present across all groups, in contrast to the significant decreases in other subgroups:
The weapons effect occurs outside of the lab too. In one field experiment,[2] a confederate driving a pickup truck purposely remained stalled at a traffic light for 12 seconds to see whether the motorists trapped behind him would honk their horns (the measure of aggression). The truck contained either a .303-calibre military rifle in a gun rack mounted to the rear window, or no rifle. The results showed that motorists were more likely to honk their horns if the confederate was driving a truck with a gun visible in the rear window than if the confederate was driving the same truck but with no gun. What is amazing about this study is that you would have to be pretty stupid to honk your horn at a driver with a military rifle in his truck—if you were thinking, that is! But people were not thinking—they just naturally honked their horns after seeing the gun. The mere presence of a weapon automatically triggered aggression.
On Inquiring Minds, Bushman again acknowledge that the effect is, a priori, implausible. One should think twice before honking at an armed man, after all! In my estimation, counter-intuitive effects should be judged carefully, as they are less likely to be real. But this implausability does not dampen Bushman's enthusiasm for the effect. If anything, it kindles it. 

Next Posts
Naturally, the literature on weapon priming is not limited to these two papers. In subsequent posts, I hope to talk about meta-analyses of the effect. I also hope to talk about the role of science in generating and disseminating knowledge about the effect. But this post is long enough -- let's call it at this for now.