Thursday, June 23, 2016

Derailment, or The Seeing-Thinking-Doing Model

Inspired by a recent excellent lecture by Nick Brown, I decided to finally sit down and read Diederik Stapel's confessional autobiography, Ontsporing. Brown translated it from Dutch into English; it is available for free here.

In this account, Stapel describes how he came to leave theater for social psychology, how he had some initial fledgling successes, and ultimately, how his weak results and personal greed drove him to fake his data. A common theme is the complete lack of scientific oversight -- Stapel refers to his sole custody of the data as being alone with a big jar of cookies.

Doomed from the start

Poor Stapel! He based his entire research program on a theory doomed to failure. So much of what he did was based on a very simple, very crude model: Seeing a stimulus "activates" thoughts related to the stimulus. Those "activated" thoughts then influence behavior, usually at sufficient magnitude and clarity that they can be detected in a between-samples test of 15-30 samples per cell.

Say what you will about the powerful effects of the situation, but in hindsight, it's little surprise that Stapel couldn't find significant results. The stimuli were too weak, the outcomes too multiply determined, and the sample sizes too small. It's like trying to study if meditation reduces anger by treating 10 subjects with one 5-minute session and then seeing if they ever get in a car crash. Gelman might say Stapel was "driven to cheat [...] because there was nothing there to find. [...] If there's nothing there, they'll start to eat dirt."

Remarkably, Stapel writes as though he never considered that his theories could be wrong and that he should have changed course. Instead, he seems to have taken every p < .05 as gospel truth. He talks about p-hacking two studies into shape (he refers to "gray methods" like dropping conditions or outcomes) only to be devastated when the third study comes up immovably null. He didn't listen to his null results.

However, theory seemed to play a role in his reluctance to listen to his data. Indeed, he says the way he got away with it for as long as he did was by carefully reading the literature and providing the result that theory would have obviously predicted. Maybe the strong support from theory is why he always assumed there was some signal he could find through enough hacking.

He similarly placed too much faith in the significant results of other labs. He alludes to strange ideas from other labs as though they were established facts: things like the size of one's signature being a valid measure of self-esteem, or thoughts of smart people making you better at Trivial Pursuit.

Thinking-Seeing-Doing Theory

Reading the book, I had to reflect upon social psychology's odd but popular theory, which grew to prominence some thirty years ago and is just now starting to wane. This theory is the seeing-thinking-doing theory: seeing something "activates thoughts" related to the stimulus, the activation of those thoughts leads to thinking those thoughts, and thinking those thoughts leads to doing some behavior.

Let's divide the seeing-thinking-doing theory into its component stages: seeing-thinking and thinking-doing. The seeing-thinking hypothesis seems pretty obvious. It's sensible enough to believe in and study some form of lexical priming, e.g. that some milliseconds after you've just showed somebody the word CAT, participants are faster to say HAIR than BOAT. Some consider the seeing-thinking hypothesis so obvious as to be worthy of lampoon.

But it's the thinking-doing hypothesis that seems suspicious. If incidental thoughts are to direct behavior in powerful ways, it would suggest that cognition is asleep at the wheel. There seems to be this idea that the brain has no idea what to do from moment to moment, and so it goes rummaging about looking for whatever thoughts are accessible, and then it seizes upon one at random and acts on it.

The causal seeing-thinking-doing cascade starts to unravel when you think about the strength of the manipulation. Seeing probably causes some change in thinking, but there's a lot of thinking going on, so it can't account for that much variance in thinking. Thinking is probably related to doing, but then, one often thinks about something without acting on it.

The trickle-down cascade from minimal stimulus to changes in thoughts to changes in behavior would seem to amount to little more than a sneeze in a tornado. Yet this has been one of the most powerful ideas in social psychology, leading to arguments that we can reduce violence by keeping people from seeing toy guns, stimulate intellect through thoughts of professors, and promote prosocial behavior by putting eyes on the walls.


When I read Ontsporing, I saw a lot of troubling things: lax oversight, neurotic personalities, insufficient skepticism. But it's the historical perspective on social psychology that most jumped out to me. Stapel couldn't wrap his head around the idea that words and pictures aren't magic totems in the hands of social psychologists. He set out to study a field of null results. Rather than revise his theories, he chose a life of crime.

The continuing replicability crisis is finally providing some appropriately skeptical and clear tests of the seeing-thinking-doing hypothesis. In the meantime, I wonder: What exactly do we mean when we say "thoughts" are "activated"? How strong is the evidence is that the activation of a thought can later influence behavior? And are there qualitative differences between the kind of thought associated with incidental primes and the kind of thought that typically guides behavior? The latter would seem much more substantial.

Thursday, June 2, 2016

Prior elicitation for directing replication efforts

Brent Roberts suggests the replication movement solicit federal funding for the organization of federally-funded replication daisy chains. James Coyne suggests that the replication movement has already made a grave misstep by attempting to replicate findings that were always hopelessly preposterous. Who is in the right?

It seems to me that both are correct, but the challenge is in knowing when to replicate and when to dismiss outright. Coyne and the OSF seem to be after different things: the OSF has been very careful to make the RP:P about "estimating the replicability of psychology" in general rather than establishing the truth or falsity of particular effects of note. This motivated their decision to choose a random-ish sample of 100 studies rather than target specific controversial studies.

If in contrast, we want to direct our replication efforts to where they will have the greatest probative value, we will need to first identify which phenomena we are collectively most ambivalent about. There's no point in replicating something that's obviously trivially true or blatantly false.

How do we figure that out? Prior elicitation! We gather a diverse group of experts and ask them to divide up their probability, indicating how big they think the effect size is in a certain experimental paradigm.

If most the probability mass is away from zero, then we don't bother with the replication -- everybody believes in the effect already.

On the other hand, if the estimates are tightly clustered around zero, we don't bother with the replication -- it's obvious nobody believes it in the first place.

It's when the prior is diffuse, or evenly divided between the spike at zero and the slab outside zero, or bimodal, that we find the topic is controversial and in need of replication. That's the kind of thing that might benefit from a RRR or a federally-funded daisy chain.

Code below:
# Plot1
x = seq(-2, 2, .01)
plot(x, dcauchy(x, location = 1, scale = .3)*.9, type = 'l',
     ylim = c(0, 1),
     ylab = "Probability density",
     xlab = paste("Effect size (delta)"),
     main = "All-but-certain finding \n Little need for replication")
arrows(0, 0, 0, .1)

# Plot2
plot(x, dcauchy(x, location = 0, scale = .25)*.1, type = 'l',
     ylim = c(0, 1),
     ylab = "Probability density",
     xlab = paste("Effect size (delta)"),
     main = "No one believes it \n Little need for replication")
arrows(0, 0, 0, .9)

# Plot3
plot(x, dcauchy(x, location = 0, scale = 1)*.5, type = 'l',
     ylim = c(0, .75),
     ylab = "Probability density",
     xlab = paste("Effect size (delta)"),
     main = "No one knows what to think \n Great target for replication")
arrows(0, 0, 0, .5)

# Plot4
plot(x, dcauchy(x, location = 1, scale = 1)*.5, type = 'l',
     ylim = c(0, .75),
     ylab = "Probability density",
     xlab = paste("Effect size (delta)"),
     main = "Competing theories \n Great target for replication")
lines(x, dcauchy(x, location = -1, scale = 1)*.5)

Wednesday, June 1, 2016

Extraordinary evidence

Everyone seems to agree with the saying "extraordinary claims require extraordinary evidence." But what exactly do we mean by it?

In previous years, I'd taken this to mean that an improbable claim requires a dataset with strong probative value, e.g. a very small p-value or a very large Bayes factor. Extraordinary claims have small prior probability and need strong evidence if they are to be considered probable a posteriori.

However, this is not the only variety of extraordinary claim. Suppose that someone tells you that he has discovered that astrological signs determine Big Five personality scores. You scoff, expecting that he has run a dozen tests and wrestled out a p = .048 here or there. But no, he reports strong effects on every outcome: all are p < .001, with correlations in the r = .7 range. If you take the results at face value, it is clearly strong evidence of an effect.

Is this extraordinary evidence? In a sense, yes. The Bayes factor or likelihood ratio or whatever is very strong. But nested within this extraordinary evidence is another extraordinary claim: that his study found these powerful results. These effects are unusually strong for personality psychology in general, much less for astrology and personality in particular.

What kind of extraordinary evidence is needed to support that claim? In this post-Lacour-fraud, post-Reinhart-Rogoff-Excel-error world, I would suggest that more is needed than simply a screenshot of some SPSS output.

In ascending order of rigor, authors can support their extraordinary evidence by providing the following:

  1. The post-processed data necessary to recreate the result.
  2. The pre-processed data (e.g., single-subject e-prime files; single-trial data).
  3. All processing scripts that turn the pre-processed data into the post-processed data.
  4. Born-open data, data that is organized by Git to be saved and uploaded to the cloud in an automated script. This is an extension of the above -- it provides the pre-processed data, uploaded to the central, 3rd-party GitHub server, where it is timestamped.

Providing access to the above gives greater evidence that:

  1. The data are real, 
  2. The results match the data, 
  3. The processed data are an appropriate function of the preprocessed data, 
  4. The data were collected and uploaded over time, rather than cooked up in Excel overnight, and
  5. The data were not tampered with between initial collection and final report.

If people do not encourage data-archival, a frustrating pattern may emerge: Researchers report huge effect sizes with high precision. These whopping results have considerable influence on the literature, meta-analyses, and policy decisions. However, when the data are requested, it is discovered that the data were hit by a meteor, or stolen by Chechen insurgents, or chewed up by a slobbery old bulldog, or something. Nobody is willing to discard the outrageous effect size from meta-analysis for fear of bias, or appearing biased. Techniques to detect and adjust for publication bias and p-hacking, such as P-curve and PET-PEESE, would be powerless to detect and adjust for bias so long as a few high-effect-size farces remain in the dataset.

The inevitable fate of many suspiciously successful datasets.
Like Nick Brown points out, this may be the safest strategy for fraudsters. At present, psychologists are not expected to be competent custodians of their own data. Little of graduate training concerns data archival. It is not unusual for data to go missing, and so far I have yet to find anybody who has been censured for failure to preserve their data. In contrast, accusations of fraud or wrongdoing require strong evidence -- the kind that can only be obtained by looking at the raw data, or perhaps by finding the same mistake, made repeatedly across a lifetime of fraudulent research. Somebody could go far by making up rubbish and saying the data were stolen by soccer hooligans, or whatever.

For a stronger, more replicable science, we must do more to train scientists in data management and incentivize data storage and sharing. Open science badges are nice. They let honest researchers signal their honesty. But they are not going to save the literature so long as meta-analysis and public policy statements must tiptoe around closed-data (or the-dog-ate-my-data) studies with big, influential results.