Header

A psychologist's thoughts on how and why we play games

Sunday, October 23, 2016

Outrageous Fortune: 2. The variability of rare drops

Growing up, I played a lot of role-playing games for the Super Nintendo. One trope of late-game design for role-playing games are rare drops -- highly desirable items that have a low probability of appearing after a battle. These items are generally included as a way to let players kill an awful lot of time as they roll the dice again and again trying to get the desired item.

For example, in Final Fantasy 4, there is an item called the "pink tail" that grants you the best armor in the game. It has a 1/64 chance of dropping when you fight a particular monster. In Earthbound, the "Sword of Kings" has a 1/128 chance of dropping when you fight a particular monster. In Pokemon, there are "shiny" versions of normal Pokemon that have a very small chance of appearing (in the newest games, the chance is something like 1/4096).

Watching people try to get these items reveals an interesting misunderstanding about how probability works. Intuitively, it makes sense that if the item has a 1/64 chance of dropping, then by the time you've fought the monster 64 times, you should have a pretty good chance of having the item.

Although it's true that the average number of required pulls is 64, there's still a substantial role of chance. This is a recurring theme in gaming and probability -- yes, we know what the average experience is, but the amount of variability around that can be quite large. (See also my old post on how often a +5% chance to hit actually converts into more hits.)

It turns out that if your desired item has a drop rate of 1/64, then after 64 pulls, there's only a 63.6% chance that you have the item.

To understand the variability around the drop chance, we have to use the negative binomial distribution. The binomial distribution takes in a number of attempts and a probability of success to tell us how many successes to expect. The negative binomial inverts this: it takes in the number of desired successes and the probability of success to tell us how many attempts we'll need to make.

In R, we can model this with the function pnbinom(), which gives us the cumulative density function. This tells us what proportion of players will get a success by X number of pulls.
Let's look at some examples.

Earthbound: The Sword of Kings

The Sword of Kings has a drop rate of 1/128 when the player fights a Starman Super. In a normal game, a player probably fights five or ten Starman Supers. How many players can be expected to find a Sword of Kings in the course of normal play? Some players will decide they want a Sword of Kings and will hang out in the dungeon fighting Starman Supers ("Starmen Super"?) until they find one. How long will most players have to grind to get this item?

By the time the player finds a Sword of Kings, the party is probably so overleveled from killing dozens of Starman Supers that they don't really need the Sword anyway. (From http://starmendotnet.tumblr.com/post/96897409329/sijbrenschenkels-finally-found-sword-of)


We use pnbinom() to get the cumulative probabilities. We'll use the dplyr package too because I like piping and being able to use filter() later for a nice table.

library(dplyr)
eb <- data.frame(x = 1:800) %>% 
  mutate(p = pnbinom(x, size = 1, prob = 1/128))

filter(eb, x %in% c(1, 10, 50, 100, 128, 200, 256, 400))

with(eb, plot(x, p, type = 'l',
              xlab = "Starman Supers defeated",
              ylab = "Probability of at least one drop",

              main = "Grinding for a Sword of Kings"))




A lucky 8% of all players will get the item in their first ten fights, a number that might be found in the course of normal play. 21% of players still won't have gotten one after two hundred combats, and 4% of players won't have gotten the Sword of Kings even after fighting four hundred Starman Supers!

Final Fantasy IV: The Pink Tail

There's a monster in the last dungeon of Final Fantasy IV. When you encounter monsters in a particular room, there is a 1/64 chance that you will find the "Pink Puff" monster. Every Pink Puff you kill has a 1/64 chance of dropping a pink tail.

ff4 <- data.frame(x = 1:400) %>%
  mutate(p = pnbinom(x, size = 1, prob = 1/64))

filter(ff4, x %in% c(1, 5, 10, 50, 64, 100, 200))

Just to find the Pink Puff monster is a major endeavor. A lucky 3% of players will find a Pink Puff on their first combat, and about 16% of players will run into one in the course of normal play (10 combats). But 20% of players won't have found one even after a hundred combats, and 4% of players won't have found a Pink Puff even after two hundred combats.

Finding a pack of Pink Puffs is a 1/64 chance, and that's just the start of it.


After you find and kill the Pink Puff, it still has to drop the pink tail, which is a 1/64 chance per Pink Puff. So 20% of players won't find a pink tail even after killing a hundred Pink Puffs. Consider next that one finds, on average, one group of Pink Puffs per 64 combats, and Pink Puffs come in groups of five. You could run through more than a thousand fights in order to find 100 Pink Puffs and still not get a pink tail. Ridiculous!

Here's a guy on the IGN forums saying "I've been trying for a week and I haven't gotten one."

Shiny Pokemon

A "shiny" pokemon is a rarer version of any other pokemon. In the newest pokemon game, any wild pokemon has a 1/4096 chance of being shiny.

This is so rare that we'll put things into log scale so that we're calling pnbinom() 100 times rather than 22000 times.

pkmn <- data.frame(x = seq(.1, 10, by = .1)) %>% 
  mutate(p = pnbinom(exp(x), size = 1, prob = 1/4096))
with(pkmn, plot(exp(x), p, typ = 'l'))

filter(pkmn, exp(x) <= 500) %>% tail(1)
filter(pkmn, exp(x) <= 2000) %>% tail(1)
filter(pkmn, exp(x) <= 10000) %>% tail(1)

11% of players will find one shiny pokemon within 500 encounters. About 39% will find one within 2000 encounters. 9% of players will grind through ten thousand encounters and still not find one.

There's a video on youtube of a kid playing three or four gameboys at once until he finds a shiny pokemon after about 26000 encounters. (I think this was in one of the earlier pokemon games where the encounter rate was about 1/8000.) There seems to be a whole genre of YouTube streamers showing off their shiny pokemon that they had gone through thousands of encounters looking for.

World of Warcraft

I don't know anything about World of Warcraft, but googling for its idea of a rare drop turns up a sword with a 1 in 1,500 chance of dropping as a quest reward.

wow <- data.frame(x = 1:5e3) %>% 
  mutate(p = pnbinom(x, size = 1, prob = 1/1500))
with(wow, plot(x, p, type = 'l'))

filter(wow, round(p, 3) == .5)
filter(wow, round(p, 3) == .8)
filter(wow, x == 2500)

Suppose your dedicated fanbase decides to try grinding for this item. Players can do 1000 quests and only half of them will get this sword. Among players running 2500 quests, 19% of players still won't have gotten one.

My Thoughts

In general, I feel like rare drops aren't worth grinding for. There's a lot of chance involved in the negative binomial function, and you could get very lucky or very unlucky.

Sometimes single-player RPGs seem to include them as a somewhat cynical way to keep kids busy when they have too much time to kill. In this case, players seem to know they're going to have to grind a long time, but they may not realize just how long they could grind and still not get it. The draw for these items seems to be more about the spectacle of the rarity than about the actual utility of the item.

In massively multiplayer games, it seems like drops are made so rare that players aren't really expected to grind for them. An item with a 1/1500 drop chance isn't something any one player can hope to get, even if they are deliberately trying to farm the item. Thus, rare items in MMOs are more like Willy Wonka Golden Tickets that a few lucky players get, rather than something that one determined player works to get. One player could try a thousand times and still not get the item, but across a hundred thousand players trying once, a few will get it, and that's enough to keep the in-game economy interesting.

My preference is that chance is something that encourages a shift in strategy rather than something that strictly makes your character better or worse. Maybe the player is guaranteed a nice item, but that nice item can be one of three things, each encouraging a different playstyle.

Still, it's fun sometimes to get something unexpected and get a boost from it. Imagine being one of the ~8% of players who gets a Sword of Kings by chance. Another approach is to provide a later way to ensure getting the desired item. In the roguelike Dungeon Crawl, some desirable items can be found early by luck, but they are also guaranteed to appear later in the game by skill.

Anyway, don't bet on getting that rare drop -- you may find yourself grinding much longer than you'd thought.

Tuesday, October 18, 2016

Publishing the Null Shows Sensitivity to Data

Some months ago, a paper argued for the validity of an unusual measurement of aggression. According to this paper, the number of pins a participant sticks into a paper voodoo doll representing their child seems to be a valid proxy for aggressive parenting.

Normally, I might be suspicious of such a paper because the measurement sounds kind of farfetched. Some of my friends in aggression research scoffed at the research, calling bullshit. But I felt I could trust the research.

Why? The first author has published null results before.

I cannot stress enough how much an author's published null results encourages my trust of a published significant result. With some authors, the moment you read the methods section, you know what the results section will say. When every paper supports the lab's theory, one is left wondering whether there are null results hiding in the wings. One starts to worry that the tested hypotheses are never in danger of falsification.

"Attached are ten stickers you can use to harm the child.
You can stick these onto the child to get out your bad feelings.
You could think of this like sticking pins into a Voodoo doll."

In the case of the voodoo doll paper, the first author is Randy McCarthy. Years ago, I became aware of Dr. McCarthy when he carefully tried to replicate the finding that heat-related word primes influence hostile perceptions (DeWall & Bushman, 2009) and reported null results (McCarthy, 2014).

The voodoo doll paper from McCarthy and colleagues is also a replication attempt of sorts. The measure was first presented by DeWall et al. (2013); McCarthy et al. perform conceptual replications testing the measure's validity. On the whole, the replication and extension is quite enthusiastic about the measure. And that means all the more to me given my hunch that McCarthy started this project by saying "I'm not sure I trust this voodoo doll task..."

Similar commendable frankness can be seen in work from Michael McCullough's lab. In 2012, McCullough et al. reported that religious thoughts influence male's stereotypically-male behavior. Iin 2014, one of McCullough's grad students published that she couldn't replicate the 2012 result (Hone & McCullough, 2014).

I see it as something like a Receiver Operating Characteristic curve. If the classifier has only ever given positive responses, that's probably not a very useful classifier -- you can't tell if there's any specificity to the classifier. A classifier that gives a mixture of positive and negative responses is much more likely to be useful.

A researcher that publishes a null now and again is a researcher I trust to call the results as they are.

[Conflict of interest statement: In the spirit of full disclosure, Randy McCarthy once gave me a small Amazon gift card for delivering a lecture to the Bayesian Interest Group at Northern Illinois University.]

Friday, August 19, 2016

Comment on Strack (2016)

Yesterday, Perspectives on Psychological Science published a 17-laboratory Registered Replication Report, totaling nearly 1900 subjects. In this RRR, researchers replicated an influential study of the Facial Feedback Effect, showing that being surreptitiously made to smile or to pout could influence emotional reactions.

The results were null, indicating that there may not be much to this effect.

The first author of the original study, Fritz Strack, was invited to comment. In his comment, Strack makes four criticisms of the original study that, in his view, undermine the results of the RRR to some degree. I am not convinced by these arguments; below, I address each in sequence.


"Hypothesis-aware subjects eliminate the effect."

First, Strack says that participants may have learned of the effect in class and thus failed to demonstrate it. To support this argument, he performs a post-hoc analysis demonstrating that the 14 studies using psychology pools found an effect size of d = -0.03, whereas the three studies using non-psychology undergrad pools found an effect size of d = 0.16, p = .037.

However, the RRR took pains to exclude hypothesis-aware subjects. Psychology students were also, we are told, recruited prior to coverage of the Strack et al. study in their classes. Neither of these steps ensure that all hypothesis-aware subjects were removed, of course, but it certainly helps. And as Sanjay Srivastava points out, why would hypothesis awareness necessarily shrink the effect? It could just as well enhance it by demand characteristics.

Also, d = 0.16 is quite small -- like, 480-per-group for a one-tailed 80% power test small. If Strack is correct, and the true effect size is indeed d = 0.16, this would seem to be a very thin success for the Facial Feedback Hypothesis, and still far from consistent with the original study's effect.

"The Far Side isn't funny anymore."

Second, Strack suggests that, despite the stimulus testing data indicating otherwise, perhaps The Far Side is too 1980s to provide an effective stimulus.

I am not sure why he feels it necessary to disregard the data, which indicates that these cartoons sit nicely in the midpoint of the scale. I am also at a loss as to why the cartoons need to be unambiguously funny -- had the cartoons been too funny, one could have argued there was a ceiling effect.


"Cameras obliterate the facial feedback effect."

Third, Strack suggests that the "RRR labs deviated from the original study by directing a camera at the participants." He argues that research on objective self-awareness demonstrates that cameras induce subjective self-focus, tampering with the emotional response.

This argument would be more compelling if any studies were cited, but in either case, I feel the burden of proof rests with this novel hypothesis that the facial feedback effect is moderated by the presence of cameras.

"The RRR shows signs of small-study effects."

Finally, Strack closes by using a funnel plot to suggest that the RRR results are suffering from a statistical anomaly.



He shows a funnel plot that compares sample size and Cohen's d, arguing that it is not appropriately pyramidal. (Indeed, it looks rather frisbee-shaped.)

Further, he conducts a correlation test between sample size and Cohen's d. This result is not, strictly speaking, statistically significant (p = .069), but he interprets it all the same as a warning sign. (It bears mention here that an Egger test with an additive error term is a more appropriate test. Such a test yields p = .235, quite far from significance.)

Strack says that he does not mean to insinuate that there is "reverse p-hacking" at play, but I am not sure how else we are to interpret this criticism. In any case, he recommends that "the current anomaly needs to be further explored," which I will below.


Strack's funnel plot does not appear pyramidal because the studies are all of roughly equal size, and so the default scale of the axes is way off. Here I present a funnel plot with axes of more appropriate scale. Again, the datapoints do not form a pyramid shape, but we see now that this is because there is little variance in sample size or standard error with which to make a pyramid shape. You're used to seeing taller, more funnel-y funnels because sample sizes in social psych tend to range broadly from 40 to 400, whereas here they vary narrowly from 80 to 140.

You can also see that there's really only one of the 17 studies that contributes to the correlation, having a negative effect size and larger standard error. This study is still well within the range of all the other results, of course; together, the studies are very nicely homogeneous (I^2 = 0%, tau^2 = 0), indicating that there's no evidence this study's results measure a different true effect size.

Still, this study has influence on the funnel plot -- it has a Cook's distance of 0.46, whereas all the others have distances of 0.20 or less. Removing this one study abolishes the correlation between d and sample size (r(14) = .27, p = .304), and the resulting meta-analysis is still quite null (raw effect size = 0.04, [-0.09, 0.18]). Strack is interpreting a correlation that hinges upon one influential observation.

I am willing to bet that this purported small-study effect is a pattern detected in noise. (Not that it was ever statistically significant in the first place.)

Admittedly, I am sensitive to the suggestion that an RRR would somehow be marred by reverse p-hacking. If all the safeguards of an RRR can't stop psychologists from reaching whatever their predetermined result, we are completely and utterly fucked, and it's time to pursue a more productive career in refrigerator maintenance.

Fortunately, that does not seem to be the case. The RRR does not show evidence of small-study effects or reverse p-hacking, and its null result is robust to exclusion of the most negative result.

Tuesday, July 19, 2016

The Failure of Fail-safe N

Fail-Safe N is a statistic suggested as a way to address publication bias in meta-analysis. Fail-Safe N describes the robustness of a significant result by calculating how many studies with effect size zero could be added to the meta-analysis before the result lost statistical significance. The original formulation is provided by Rosenthal (1979), with modifications proposed by Orwin (1983) and Rosenberg (2005).

I would like to argue that, as a way to detect and account for bias in meta-analysis, Fail-Safe N is completely useless. Others have said this before (see the bottom of the post for some links), but I needed to explore it further for my own curiosity. All together, I have to say that Fail-Safe N appears to be completely obsoleted by subsequent techniques, and thus is not recommended for use.


Fail-Safe N isn't for detecting bias

When we perform a meta-analysis, the question on our minds is usually "Looking at the gathered studies, how many null results were hidden from report?" Fail-Safe N does not answer that. Instead, it asks, "Looking at the gathered studies, how many more null results would you need before you'd no longer claim an effect?"

This isn't useful as a bias test. Indeed, Rosenthal never meant it as a way to test the presence of bias -- he'd billed it as an estimate of tolerance for null results, an answer to the question "How bad would bias have to be before I changed my mind?" He used it to argue that the published psych literature was not the 5% of Type I errors, while the 95% of null results languished in file drawers. Fail-Safe N was never meant to distinguish biased from unbiased literatures.

Fail-Safe N doesn't scale with bias

Although Fail-Safe N was never meant to test for bias, sometimes people will act as though a larger Fail-Safe N indicates the absence of bias. That won't work.

To see why it won't work, let's look briefly at the equation that defines FSN.

FSN = [(ΣZ)^2 / 2.706] - k

where ΣZ is the sum of z-scores from individual studies (small p-values mean large z-scores) and k is the number of studies.

This means that Fail-Safe N grows larger with each significant result. Even when each study is just barely significant (p = .050), Fail-Safe N will grow rapidly. After six p = .05 results, FSN is 30. After ten p = .05 results, FSN is 90. After twenty p = .05 results, FSN is 380. Fail-safe N rapidly becomes huge, even when the individual studies just barely cross the significance threshold.

Worse, FSN can get bigger as the literature becomes more biased.

  • For each dropped study with an effect size of exactly zero, FSN grows by one. (That's what it says on the tin -- how many dropped zeroes would be required to make p > .05.) 
  • When dropped studies have positive but non-significant effect sizes, FSN falls. 
  • When dropped studies have negative effect sizes, FSN rises.
If all the studies with estimated effect sizes less than zero are censored, FSN will quickly rise. 

Because Fail-Safe N doesn't behave in any particular way with bias, the following scenarios could all have the same Fail-Safe N:

  • A few honestly-reported studies on a moderate effect.
  • A lot of honest studies on a teeny-tiny effect.
  • A single study with a whopping effect size.
  • A dozen p-hacked studies on a null effect.

Fail-Safe N is often huge, even when it looks like the null is true

Publication bias and flexible analysis being what they are in social psychology, Fail-Safe N tends to return whopping huge numbers. The original Rosenthal paper provides two demonstrations. In one, he synthesizes 94 experiments examining the effects of interpersonal self-fulfilling prophecies, and concludes that 3,263 studies averaging null effects would be necessary to make the effect go away. In another analysis of k = 311 studies, he says nearly 50,000 studies would be needed.

Similarly, in the Hagger et al. meta-analysis of ego depletion, Fail-Safe N reported that 50,000 null studies would be needed to reduce the effect to non-significance. By comparison, the Egger test indicated that the literature was badly biased, and PET-PEESE indicated that the effect size was likely zero. The registered replication report also indicated that the effect size was likely zero. Even a Fail-Safe N of 50,000 does not indicate a robust result.

Summary

Fail-Safe N is not a useful bias test because:

  1. It does not tell you whether there is bias.
  2. Greater bias can lead to a greater Fail-Safe N.
  3. Hypotheses that would appear to be false have otherwise obtained very large values of FSN.


FSN is just another way to describe the p-value at the end of your meta-analysis. If your p-value is very small, FSN will be very large; if your p-value is just barely under .05, FSN will be small.

In no case does Fail-Safe N indicate the presence or absence of bias. It only places a number on how bad publication bias would have to be, in a world without p-hacking, for the result to be a function of publication bias alone. Unfortunately, we know well that we live in a world with p-hacking. Perhaps this is why Fail-Safe N is sometimes so inappropriately large.

If you need to test for bias, I would recommend instead Begg's test, Egger's test, or p-uniform. If you want to adjust for bias, PET, PEESE, p-curve, p-uniform, or selection models might work. But don't ever try to interpret the Fail-Safe N in a way it was never meant to be used.




Related reading:
Becker (2005) recommends "abandoning Fail-Safe N in favor of other, more informative analyses."
Here Coyne agrees that Fail-Safe N is not a function of bias and does not check for bias.
The Cochrane Collaboration agrees that  Fail-Safe N is mostly a function of the net effect size, and criticizes the emphasis on statistical significance over effect size.
Moritz Heene refers me to three other articles pointing out that the average Z-score of unpublished studies is probably not zero, as Fail-Safe N assumes, but rather, less than zero. Thus, the Fail-Safe N is too large. (Westfall's comment below makes a similar point.) This criticism is worth bearing in mind, but I think the larger problem is that Fail-Safe N does not answer the user's question regarding bias.

Thursday, June 23, 2016

Derailment, or The Seeing-Thinking-Doing Model

Inspired by a recent excellent lecture by Nick Brown, I decided to finally sit down and read Diederik Stapel's confessional autobiography, Ontsporing. Brown translated it from Dutch into English; it is available for free here.

In this account, Stapel describes how he came to leave theater for social psychology, how he had some initial fledgling successes, and ultimately, how his weak results and personal greed drove him to fake his data. A common theme is the complete lack of scientific oversight -- Stapel refers to his sole custody of the data as being alone with a big jar of cookies.

Doomed from the start

Poor Stapel! He based his entire research program on a theory doomed to failure. So much of what he did was based on a very simple, very crude model: Seeing a stimulus "activates" thoughts related to the stimulus. Those "activated" thoughts then influence behavior, usually at sufficient magnitude and clarity that they can be detected in a between-samples test of 15-30 samples per cell.

Say what you will about the powerful effects of the situation, but in hindsight, it's little surprise that Stapel couldn't find significant results. The stimuli were too weak, the outcomes too multiply determined, and the sample sizes too small. It's like trying to study if meditation reduces anger by treating 10 subjects with one 5-minute session and then seeing if they ever get in a car crash. Gelman might say Stapel was "driven to cheat [...] because there was nothing there to find. [...] If there's nothing there, they'll start to eat dirt."



Remarkably, Stapel writes as though he never considered that his theories could be wrong and that he should have changed course. Instead, he seems to have taken every p < .05 as gospel truth. He talks about p-hacking two studies into shape (he refers to "gray methods" like dropping conditions or outcomes) only to be devastated when the third study comes up immovably null. He didn't listen to his null results.

However, theory seemed to play a role in his reluctance to listen to his data. Indeed, he says the way he got away with it for as long as he did was by carefully reading the literature and providing the result that theory would have obviously predicted. Maybe the strong support from theory is why he always assumed there was some signal he could find through enough hacking.

He similarly placed too much faith in the significant results of other labs. He alludes to strange ideas from other labs as though they were established facts: things like the size of one's signature being a valid measure of self-esteem, or thoughts of smart people making you better at Trivial Pursuit.

Thinking-Seeing-Doing Theory

Reading the book, I had to reflect upon social psychology's odd but popular theory, which grew to prominence some thirty years ago and is just now starting to wane. This theory is the seeing-thinking-doing theory: seeing something "activates thoughts" related to the stimulus, the activation of those thoughts leads to thinking those thoughts, and thinking those thoughts leads to doing some behavior.

Let's divide the seeing-thinking-doing theory into its component stages: seeing-thinking and thinking-doing. The seeing-thinking hypothesis seems pretty obvious. It's sensible enough to believe in and study some form of lexical priming, e.g. that some milliseconds after you've just showed somebody the word CAT, participants are faster to say HAIR than BOAT. Some consider the seeing-thinking hypothesis so obvious as to be worthy of lampoon.

But it's the thinking-doing hypothesis that seems suspicious. If incidental thoughts are to direct behavior in powerful ways, it would suggest that cognition is asleep at the wheel. There seems to be this idea that the brain has no idea what to do from moment to moment, and so it goes rummaging about looking for whatever thoughts are accessible, and then it seizes upon one at random and acts on it.

The causal seeing-thinking-doing cascade starts to unravel when you think about the strength of the manipulation. Seeing probably causes some change in thinking, but there's a lot of thinking going on, so it can't account for that much variance in thinking. Thinking is probably related to doing, but then, one often thinks about something without acting on it.

The trickle-down cascade from minimal stimulus to changes in thoughts to changes in behavior would seem to amount to little more than a sneeze in a tornado. Yet this has been one of the most powerful ideas in social psychology, leading to arguments that we can reduce violence by keeping people from seeing toy guns, stimulate intellect through thoughts of professors, and promote prosocial behavior by putting eyes on the walls.

Reflections

When I read Ontsporing, I saw a lot of troubling things: lax oversight, neurotic personalities, insufficient skepticism. But it's the historical perspective on social psychology that most jumped out to me. Stapel couldn't wrap his head around the idea that words and pictures aren't magic totems in the hands of social psychologists. He set out to study a field of null results. Rather than revise his theories, he chose a life of crime.

The continuing replicability crisis is finally providing some appropriately skeptical and clear tests of the seeing-thinking-doing hypothesis. In the meantime, I wonder: What exactly do we mean when we say "thoughts" are "activated"? How strong is the evidence is that the activation of a thought can later influence behavior? And are there qualitative differences between the kind of thought associated with incidental primes and the kind of thought that typically guides behavior? The latter would seem much more substantial.

Thursday, June 2, 2016

Prior elicitation for directing replication efforts

Brent Roberts suggests the replication movement solicit federal funding for the organization of federally-funded replication daisy chains. James Coyne suggests that the replication movement has already made a grave misstep by attempting to replicate findings that were always hopelessly preposterous. Who is in the right?

It seems to me that both are correct, but the challenge is in knowing when to replicate and when to dismiss outright. Coyne and the OSF seem to be after different things: the OSF has been very careful to make the RP:P about "estimating the replicability of psychology" in general rather than establishing the truth or falsity of particular effects of note. This motivated their decision to choose a random-ish sample of 100 studies rather than target specific controversial studies.

If in contrast, we want to direct our replication efforts to where they will have the greatest probative value, we will need to first identify which phenomena we are collectively most ambivalent about. There's no point in replicating something that's obviously trivially true or blatantly false.

How do we figure that out? Prior elicitation! We gather a diverse group of experts and ask them to divide up their probability, indicating how big they think the effect size is in a certain experimental paradigm.


If most the probability mass is away from zero, then we don't bother with the replication -- everybody believes in the effect already.


On the other hand, if the estimates are tightly clustered around zero, we don't bother with the replication -- it's obvious nobody believes it in the first place.



It's when the prior is diffuse, or evenly divided between the spike at zero and the slab outside zero, or bimodal, that we find the topic is controversial and in need of replication. That's the kind of thing that might benefit from a RRR or a federally-funded daisy chain.


Code below:
# Plot1
x = seq(-2, 2, .01)
plot(x, dcauchy(x, location = 1, scale = .3)*.9, type = 'l',
     ylim = c(0, 1),
     ylab = "Probability density",
     xlab = paste("Effect size (delta)"),
     main = "All-but-certain finding \n Little need for replication")
arrows(0, 0, 0, .1)

# Plot2
plot(x, dcauchy(x, location = 0, scale = .25)*.1, type = 'l',
     ylim = c(0, 1),
     ylab = "Probability density",
     xlab = paste("Effect size (delta)"),
     main = "No one believes it \n Little need for replication")
arrows(0, 0, 0, .9)

# Plot3
plot(x, dcauchy(x, location = 0, scale = 1)*.5, type = 'l',
     ylim = c(0, .75),
     ylab = "Probability density",
     xlab = paste("Effect size (delta)"),
     main = "No one knows what to think \n Great target for replication")
arrows(0, 0, 0, .5)

# Plot4
plot(x, dcauchy(x, location = 1, scale = 1)*.5, type = 'l',
     ylim = c(0, .75),
     ylab = "Probability density",
     xlab = paste("Effect size (delta)"),
     main = "Competing theories \n Great target for replication")
lines(x, dcauchy(x, location = -1, scale = 1)*.5)

Wednesday, June 1, 2016

Extraordinary evidence

Everyone seems to agree with the saying "extraordinary claims require extraordinary evidence." But what exactly do we mean by it?

In previous years, I'd taken this to mean that an improbable claim requires a dataset with strong probative value, e.g. a very small p-value or a very large Bayes factor. Extraordinary claims have small prior probability and need strong evidence if they are to be considered probable a posteriori.

However, this is not the only variety of extraordinary claim. Suppose that someone tells you that he has discovered that astrological signs determine Big Five personality scores. You scoff, expecting that he has run a dozen tests and wrestled out a p = .048 here or there. But no, he reports strong effects on every outcome: all are p < .001, with correlations in the r = .7 range. If you take the results at face value, it is clearly strong evidence of an effect.

Is this extraordinary evidence? In a sense, yes. The Bayes factor or likelihood ratio or whatever is very strong. But nested within this extraordinary evidence is another extraordinary claim: that his study found these powerful results. These effects are unusually strong for personality psychology in general, much less for astrology and personality in particular.

What kind of extraordinary evidence is needed to support that claim? In this post-Lacour-fraud, post-Reinhart-Rogoff-Excel-error world, I would suggest that more is needed than simply a screenshot of some SPSS output.

In ascending order of rigor, authors can support their extraordinary evidence by providing the following:

  1. The post-processed data necessary to recreate the result.
  2. The pre-processed data (e.g., single-subject e-prime files; single-trial data).
  3. All processing scripts that turn the pre-processed data into the post-processed data.
  4. Born-open data, data that is organized by Git to be saved and uploaded to the cloud in an automated script. This is an extension of the above -- it provides the pre-processed data, uploaded to the central, 3rd-party GitHub server, where it is timestamped.

Providing access to the above gives greater evidence that:

  1. The data are real, 
  2. The results match the data, 
  3. The processed data are an appropriate function of the preprocessed data, 
  4. The data were collected and uploaded over time, rather than cooked up in Excel overnight, and
  5. The data were not tampered with between initial collection and final report.

If people do not encourage data-archival, a frustrating pattern may emerge: Researchers report huge effect sizes with high precision. These whopping results have considerable influence on the literature, meta-analyses, and policy decisions. However, when the data are requested, it is discovered that the data were hit by a meteor, or stolen by Chechen insurgents, or chewed up by a slobbery old bulldog, or something. Nobody is willing to discard the outrageous effect size from meta-analysis for fear of bias, or appearing biased. Techniques to detect and adjust for publication bias and p-hacking, such as P-curve and PET-PEESE, would be powerless to detect and adjust for bias so long as a few high-effect-size farces remain in the dataset.

The inevitable fate of many suspiciously successful datasets.
Like Nick Brown points out, this may be the safest strategy for fraudsters. At present, psychologists are not expected to be competent custodians of their own data. Little of graduate training concerns data archival. It is not unusual for data to go missing, and so far I have yet to find anybody who has been censured for failure to preserve their data. In contrast, accusations of fraud or wrongdoing require strong evidence -- the kind that can only be obtained by looking at the raw data, or perhaps by finding the same mistake, made repeatedly across a lifetime of fraudulent research. Somebody could go far by making up rubbish and saying the data were stolen by soccer hooligans, or whatever.

For a stronger, more replicable science, we must do more to train scientists in data management and incentivize data storage and sharing. Open science badges are nice. They let honest researchers signal their honesty. But they are not going to save the literature so long as meta-analysis and public policy statements must tiptoe around closed-data (or the-dog-ate-my-data) studies with big, influential results.