Thursday, October 15, 2020

Fraud and Erroneous Judgment: Varieties of Deception in the Social Sciences (1995)

Killing time in the UChicago stacks last summer, I found a book from 1995 called Fraud and Erroneous Judgment in the Social Sciences. It's been an interesting read, because despite having been written nearly 25 years ago, much of it reads like it was written today. Specifically, there is very little substance about actually preventing, detecting, or prosecuting fraud, presumably because all these things are very difficult to do. 

Instead, a substantial portion of the text is dedicated to the easier task of fighting the culture war. Nearly half the book consists of polemics from scientists who think their ability to speak hard truths about sexual assault or intelligence or race or whatever has been suppressed by the bleeding hearts. This is particularly depressing and unhelpful when you see that two of the thirteen chapters are written by Linda Gottfredson and J. Phillippe Rushton, scientists receiving funding from the Pioneer Fund, an organization founded to study and promote eugenics.


For a text that is notionally about fraud, there is very little substance about actual fraud. Instead, most of the chapters are dedicated to the latter topic of "fallible judgment". Only three instances of research misconduct in psychology are discussed. Two of them appear in brief bullet points in the first chapter: In one, a psychologist fabricated data to demonstrate the efficacy of a drug for preventing self-harm in the mentally disabled; in the other, a researcher may have massaged his data to overstate the potential harms of low levels of lead exposure.

The third case consists of the allegations surrounding Cyril Burt. Cyril Burt was an early behavior geneticist. He argued that intelligence was heritable, and he demonstrated this through studies of the similarity of identical twins raised apart.

Burt was unpopular at the time because the view that intelligence was heritable sounded to many like Nazi ideology. While he was alive, people protested him as a far-right ideologue. (Other hereditarians experienced similar treatment; Hans Eysenck reportedly needed bodyguards as a result of his 1971 views that some of the Black-White intelligence gap was genetic in nature.) 

Five years after his death, allegations arose that Burt had invented a number of his later samples. These allegations claimed that Burt, having found an initial sample that supported his hypothesis, and frustrated by the public resistance to his findings as well as the challenge of finding more identical twins raised apart, decided to help the process along by fabricating data from twin pairs. As evidence of this, his heritability coefficient remained .77 as the sample size increased from 15 twin pairs to 53 twin pairs. He was further alleged to have made up two research assistants, but these assistants were later found. Complicating matters further, his housekeeper burnt all his research records shortly after his death (!) purportedly on the advice of one of Burt's scientific rivals (?!?).

Burt sounds like a real horse's ass. In a separate book, Cyril Burt: Fraud or Framed?, Hans Eysenck reports that Burt would sometimes sock-puppet, writing articles according to his own views, then leaving his name off of the work and handing it off to a junior researcher, giving the impression that some independent scholar shared his view. Burt purportedly went one further by editing articles submitted to his journal, inserting his own stances and invective into others' work and publishing it without their approval.

Two chapters in this textbook are devoted to the Burt affair. The first chapter, written by Robert B. Joynson, argues that well, strictly speaking, you can't prove he committed fraud. Probably we will never know. Burt is dead and his records destroyed. Even if he made up the data, the potentially made-up data are at least consistent with what we believe today, so maybe it doesn't matter.

The other, written by the late J. Phillippe Rushton, one-time head of the Pioneer Fund, argues more stridently that Burt was framed. According to his perspective, the various social justice warriors and bleeding hearts of today's the 1970s' hyper-liberal universities couldn't bear the uncomfortable truths Burt preached. Rather than refute Burt's ideas in the arena of logic and facts and science, they resorted to underhanded callout-culture tactics to smear him after his death and spoil his legacy.

So in the only involved discussion of an actual fraud allegation in this 181-page book, all that can be said is "maybe he did, or maybe he didn't."

Some material is useful. Chapter 3 recognizes that scientific fraud is a human behavior that is motivated by, and performed within, a social system. One author theorizes that fraud is most often committed under three conditions: 1) there is pressure to publish, whether to advance one's career or to refute critics, 2) the researcher thinks they know the answer already, so that actually doing the experiment is unneccessary, and 3) the research area involves an amount of stochastic variability, such that a failure to replicate can be shaken off as Type I error or hidden moderators. It certainly sounds plausible, but I wonder how useful it is. Most research fulfills all three conditions: all of us are under pressure to publish, all of us have a theory or two to suggest a "right" answer, and all of us experience sampling error and meta-uncertainty.

One thing that hasn't changed one bit is that demonstrating fraud requires demonstrating intent, which is basically impossible. Then and now, people instead have to couch concerns in the language of error, presuming sloppiness instead of malfeasance. Even then, it's not clear at what level of sloppiness crosses the threshold between error and misconduct.

...and Erroneous Judgment

The other cases all concern "erroneous judgment". They reflect ideologically-biased interpretations of data, a lack of scientific rigor, or an excessive willingness to be fooled. These cases vary in their seriousness. At the extremely harmful end, there is a discussion of recovered-memory therapy; this therapy involves helping patients to recover memories of childhood abuse through a process indistinguishable from that one would use to create a false memory. Chillingly, recovered memories became permissible as court evidence in 15 states and lead to a number of false accusations and possible convictions during the Satanic Panic of the 1980s. At the less harmful end, there's an argument about whether the Greeks made up their culture by copying off of the Egyptians. Fun to think about maybe, but nobody is going to jail for that.

Other examples include overexaggeration of societal problems in order to drum up support for research and advocacy. Neil Gilbert illustrates how moral entrepreneurs can extrapolate from sloppy statistical work, small samples, and bad question wording to estimate that 100 billion children are abducted every 3.7 seconds. This fine example is, however, paired with a criticism of feminism and research on sexual assault that has aged poorly; the author's argument boils down to "c'mon, sexual assault can't be that common, right?" Maybe it can be, Neil.

According to the authors, these cases of fallible judgment are caused by excessive enthusiasm rather than deliberate intention to deceive. Therapists dealing in recovered memories are too excited to root out satanic child-abuse cults, too ignorant of the basic science of memory, and too dependent on the perceived efficacy of their practice to know better. Critics of the heritability of IQ are blinded by political correctness and "the egalitarian hoax" of blank-slate models of human development. Political correctness is cited as influencing "fallible judgments" as diverse as the removal of homosexuality from the DSM (and its polite replacement in diagnosis of other disorders so that homosexual patients could continue billing their insurance), the estimation of the prevalence of sexual harassment, failures to test and report racial differences in outcomes, or the attribution of the accomplishments of the Greeks to the Egyptians.

Again, it seems revealing that so little is known about actual cases of fraud that the vast majority of the volume is dedicated to cases where it is unclear who is right. Unable to discover and discuss actual frauds, the discussion has to focus instead on ideological opponents whom the authors don't trust to interpret and represent their data fairly.

Have we made progress?

What's changed between 1995 and now? Today we have more examples to draw upon and more forensic tools. We can use GRIM and SPRITE to catch what are either honest people making typographical mistakes or fraudsters too stupid to make up raw data (good luck telling which is which!). The Data Colada boys keep coming up with new tests for detecting suspicious patterns in data. It's become a little less weird to ask for data and a little more weird to refuse to share data. So there's progress.

Even so, we're still a billion miles away from being able to detect most fraud and to demonstrate intent. Demonstration of intent generally requires a confession or someone on the inside. Personally, I've suspect that fraud detection at scale is probably impossible unless we ask scientists to provide receipts, which would create an additional layer of paperwork and hassle for scientists.

One recurring theme is the absence of an actual science police. The discussion of the Burt affair complains that the Council of the British Psychological Society did little to examine Burt's case on its own, instead accepting the conclusions of a biographer. Chapters 1 and 2 discuss the political events that put "Science under Siege" and lead to the creation of the Office of Research Integrity, an institution only grudgingly accepted in Chapter 2. Huffing that every great scientist from Mendel to Millikan had to massage their data a bit from time to time to make their point, David Goodstein cautions the ORI, "I can only hope that we won't arrange things in such a way as would have inhibited Newton or Millikan from doing his thing."

Can we ever know the truth?

Earlier, I mentioned that the book contains three cases of purported fraud: the self-harm study, Cyril Burt's 38 twin-pairs raised apart, and the researcher possibly massaging his data to overestimate the harms of lead. This last case appears to be a reference to the late Herbert Needleman, accused in 1990 of p-hacking his model, an offense Newsweek described at the time as "like bringing a felony indictment for jaywalking." Needleman was exonerated in 1992, and the New York Times ran an obituary honoring him following his death in 2017.

Would I be impressed by Needleman's work today, or would I count him out as another garden-variety noise-miner looking for evidence to support a foregone conclusion? Maybe it doesn't matter. In the Newsweek article, the EPA is quoted as saying "We don't even use Needleman's study anymore" because subsequent research recommended even lower safety thresholds than did Needleman's controversial work. The tempest has blown over. The winners write their history, and the losers get paid by the Cato Institute to go on Fox News and argue against "lead hysteria".

There's a lot that hasn't changed

We think that science has only been subjective, partisan, and politicized in our current "war on science" post-2016 world, but the 1990s also had "science under siege" (Time, Aug 26, 1991) and intractable debates between competing groups with vested interests in there being a crisis or not being a crisis. The tobacco wars reappear in every decade.

Similarly, the froth and stupidity of daytime TV lives on in today's Daily Mail and Facebook groups. In the 90s, people with more outrage than sense believed in vast networks of underground Satanist cults that tortured children and "programmed" them to become pawns in their world domination scheme. Today, those people believe that work continues in a Democrat-controlled child trafficking ring run through a pizza parlor and a furniture website and that Donald Trump is on a one-man mission to stop them.

Regarding fraud, we find that scientific self-policing only tends to emerge in response to crisis and scandal. NIH and NSF don't seem to have had formal recommendations regarding fraud until 1988; these were apparently motivated by pressure from Congress following the 1981 case of John Darsee, a Harvard cardiologist who had been faking his data. Those who do scientific self-policing aren't welcomed with open arms -- the book briefly stops to sneer at Walter Stewart and Ned Feder as "a kind of self-appointed truth squad. According to their critics, they had not been very productive scientists and were trying to find a way of holding on to their lab space."

Finally, each generation seems to suspect its successors of being fatally blinded by political correctness. This is clearest in the chapter dedicated to the defense of Cyril Burt, in which Rushton complains that academia will only become more corrupted by political correctness:
Today, the campus radicals of earlier decades are the tenured radicals of the 1990s. Some are chairmen, deans, and presidents. The 1960s mentality of peace, love, and above all equality now constitutes a significant portion of the intellectual establishment in the Western world. The equalitarian dogma is more, not less, entrenched than ever before. Yet, it is based on the scientific hoax of the century.
Will every generation of academics forever consider their successors insufferably and disreputably woke? Should they? It seems that, despite Rushton's concerns, the hereditarian perspective has won out in the end. Today we have researchers who not only recognize heritability, but have given careful thought to the meaning, causality, and societal implications of the research. I see this as tremendous progress when compared to the way the book tends to frame the debate over heritability, which invites the reader to choose between two equally misguided perspectives of either ignorant blank-slate idealism or Rushton's inhumane "race realism."


Some things have changed since 1995, but much has stayed the same.

Compared to 25 years ago, I think we have a better set of tools for detecting fraud. We have new statistical tricks and stronger community norms around data sharing and editorial action. We have the Office of Research Integrity and Retraction Watch.

But some things haven't changed. Researchers checking each other's work are still, at times, regarded coldly: the "self-appointed truth squad" of 1995 is the "self-appointed data police" of 2016. Demonstrating intent to deceive remains a very high bar for those investigating misconduct; probably some number of fraudsters escape oversight by claiming mere incompetence. Because it is difficult to prove intent, it's easier to fight personal vendettas and culture war -- one can wave to an opponent's political bias without getting slapped with a libel suit. And we still don't know much about who commits fraud, why they commit fraud, and how we'll ever catch them.

Thursday, January 30, 2020

Are frauds incompetent?

Nick Brown asks:

My answer is that we are not spotting the competent frauds. This becomes obvious when we think about all the steps that are necessary to catch a fraud:
  1. The fraudulent work must be suspicious enough to get a closer look.
  2. Somebody must be motivated to take it upon themselves to take that closer look.
  3. That motivated person must have the necessary skill to detect the fraud.
  4. The research records available to that motivated and skilled person must be complete and transparent enough to detect the fraud.
  5. That motivated and skilled person must then be brave enough (foolish enough? equipped with lawyers enough?) to contact the research institution.
  6. That research institution must be motivated enough to investigate.
  7. That research institution must also be skilled enough to find and interpret the evidence for fraud.

Considering all these stages at which one could fail to detect or pursue misconduct, it seems immediately obvious to me that we are finding only the most obvious and least protected frauds.

Consider the "Boom, Headshot!" affair. I had read this paper several times and never suspected a thing; nothing in the summary statistics indicates any cause for concern. The only reason anybody discovered the deception was because Pat Markey was curious enough about the effect of skewness on the results to spend months asking the authors and journal for the data and happened to discover values edited by the grad student.

Are all frauds stupid?

Some of the replies to Nick's question imply that faking data convincingly is too much hassle compared to actually collecting data. If you know a lot about data and simulation, why would you bother faking data? This perspective assumes that fraud is difficult and requires skills that could be more profitably used for good. But I don't think either of those is true.

Being good at data doesn't remove temptations for fraud

When news of the LaCour scandal hit, the first thing that struck me was how good this guy was at fancy graphics. Michael LaCour really knew his way around analyzing and presenting statistics in an exciting and accessible way.

But that's not enough to get LaCour's job offer at Princeton. You need to show that you can collect exciting data and get exciting results! When hundreds of quant-ninja, tech-savvy grad students are scrambling for a scant handful of jobs, you need a result that lands you on This American Life. And those of us on the tenure track have our own temptations: bigger grants, bigger salaries, nicer positions, and respect.

Some might even be tempted by the prospect of triumphing over their scientific rivals. Cyril Burt, once president of the British Psychological Society, was alleged to have made up extra twin pairs in order to silence critics of his discovered link between genetics and IQ. Hans Eysenck, the most-cited psychologist of his time, published and defended dozens of papers using likely-fabricated data from his collaborator that supported his views on the causes of cancer.

Skill and intellect and fame and power do not seem to be vaccines against misconduct. And it doesn't take a lot of skill to commit misconduct, either, because...

Frauds don't need to be clever

A fraud does not need a deep understanding of data to make a convincing enough forgery. A crude fake might get some of the complicated multivariate relationships wrong, sure. But will those be detected and prosecuted? Probably not.

You don't need to be the Icy Black Hand of Death to get away with data fakery.
(img source fbi.gov)

Why not? Those complicated relationships don't need to be reported in the paper. Nobody will think to check them. If they want to check them, they'll need to send you an email requesting the raw data. You can ignore them for some months, then tell them your dog ate the raw data, then demand they sign an oath of fealty to you if they're going to look at your raw data.

Getting the complicated covariation bits a little wrong is not likely to reveal a fraud, anyway. Can a psychologist predict even the first digit of simple correlations? A complicated relationship that we know less about will be harder to predict, and it will be harder to persuade co-authors, editors, and institutions that any misspecification is evidence of wrongdoing. Maybe the weird covariation can be explained away as an unusual feature of the specific task or study population. The evidence is merely circumstantial.

...because data forensics can rarely stop them.

Direct evidence requires some manner of internal whistleblower who notices and reports research misconduct. Again, one would need the actually see the misconduct, which is especially unlikely in today's projects in which data and reports come from distant collaborators. Then one would need to actually blow the whistle, after which they might expect to lose their career and get stuck in a years-long court case. Most frauds in psychology are caught this way (Stroebe, Postmes, & Spears, 2012).

In data forensics, by contrast, most evidence for misconduct is merely circumstantial. Noticing in the data very similar means and standard deviations or duplicated data points or duplicated images might be suggestive, but requires assumptions, and is open to alternative explanations. Maybe there was an error in data preprocessing, or the research assistants managed the data wrong, or someone used file IMG4015.png instead of IMG4016.png.

This circumstantial evidence means that nonspecific screw-ups are often a plausible alternative hypothesis. It seems possible to me that a just-competent fraud could falsify a bunch of reports, plead incompetence, issue corrections as necessary, and refine one's approach to data falsification for quite a long time.

A play in one act:

The means were 2.50, 2.50, 2.35, 2.15, 2.80, 2.40, and 2.67.

It is exceedingly unlikely that you would receive such consistent means. I suspect you have fabricated these summary statistics.

Oops, haha, oh shit, did I say those were the means? Major typo! The means were actually, uh, 2.53, 3.12, 2.07, 1.89...

Ahh, nice to see this quickly resolved with a corrigendum. Bye everyone.

We are fully committed to upholding the highest ethical standards etc. any concerns are thoroughly etc. etc.

FRAUDSTER (sotto voce) 
That was close! Next time I fake data I will avoid this error.

The field isn't particularly trying to catch frauds, either.

Trying to prosecute fraud sounds terrible. It takes a very long time, it requires a very high standard of evidence, and lawyers get involved. It is for these reasons, I think, that the self-stated goal of many data thugs is to "correct the literature" rather than "find and punish frauds".

But I worry about this blameless approach, because there's no guarantee that the data that appears in a corrigendum is any closer to the truth. If the original data was a fabrication, chances are good the corrigendum is just a report of slightly-better-fabricated data. And even if the paper is retracted, the perpetrator may learn from the experience and find a way to refine his fabrications and enjoy a long, prosperous life of polluting the scientific literature.

In summary,

I don't think you have to be particularly clever to be a fraud. It seems to me that most discovered frauds involve either direct evidence from a whistleblower or overwhelming circumstantial evidence due to rampant sloppiness. I think that there are probably many more frauds with just a modicum of skill that have gone undiscovered. There are probably also a number of cases that are quietly resolved without the institution announcing the discovered fraud. I spend a lot of time thinking about what it would take to change this, and what the actual prevalence would be if we could uncover it.

Saturday, November 23, 2019

Weighing bullets, not hot sauce

It's been a rich week of readings for wondering just what the hell we're doing. Loyka et al. (2019) present a framework for considering external validity, and this framework reminds us just how poorly we are doing at considering actual real-world human behavior. Tal Yarkoni has a preprint up that describes how implausible it is that the situations and stimuli we study will generalize to other situations and stimuli. Danielle Navarro has clarified her stance on preregistration by elaborating on how misguided she perceives hypothesis testing to be. Together, these articles remind us of the importance of studying the thing we actually care about, rather than what's convenient, because chances are that our findings won't generalize as simply as we expect, because a significant p-value only means that the null is wrong, and not that the alternative is correct.

These readings reminded me of some thoughts I'd jotted down following APA 2019. I'd been invited to present some of my research on violent video games. While I had a great session and had a lot of fun talking to a receptive audience about issues like measurement validity and publication bias, the overall APA experience was personally challenging. This is because one of the major themes of APA 2019 was gun violence and what the APA can do about it.

I attended a number of interesting sessions with presenters who studied actual violence by working and serving in communities, doing ride-alongs with police, interviewing people who had suffered violence and had perpetrated violence. This was draining in two ways. 

First, there's a lot of human suffering out there. One presenter had found that many felons serving prison sentences for gun violence had themselves been victims of gun violence, often as early as age 14. He further found that, when people knew who shot them, they were less likely to tell the police. They trust the police so little that they would prefer to settle the score themselves, and the police are just somebody you can dump your cold cases on as one last hail mary. A mother from Newtown was there. Both of her children had been shot in the massacre. One died. She described crying until the capillaries burst in both her eyes. One gets the feeling that tragedy cannot be prevented and that many people are doomed to poverty and violence from the moment they're born.

Second, it made me frustrated with how far removed we are from the actual societal problem we want to study. We want to prevent gang violence, child abuse, intimate partner violence, bullying, aggressive driving, and harassment. Instead of studying the community members of South- and West-Side Chicago, we study college undergraduates, a bunch of nerds who would rather read a book than fight somebody and generally have enough money and safety to be able to do just that. Instead of studying shootings or fights or abuse, we study how much hot sauce these undergrads pour for each other or whether they think a rude RA should be able to keep their job. We even use proxies of proxies -- when it's too much trouble to see how much hot sauce they'll pour for somebody, we give them KI__ and watch whether they fill it in as KILL or KISS.

One of the APA speakers closed by reference to the old joke about the drunkard looking for his keys. The drunk is looking for his keys under the streetlight. A friend joins him and helps him look for a while, with no progress. Eventually the friend, exasperated, says "Let's try something different. Where did you last see your keys?" The drunk says "I dropped my keys over there in the bushes." The bewildered friend asks "Well then, why are we searching over here by the streetlight?" To which the drunk replies "Well, the light's good over here, and I'm afraid of the dark." 

The light's good over here playing parlor tricks with college undergraduates and hot sauce. And it's certainly less scary than trying to get out in the rough parts of Chicago!

It's possible that I'm not well read and that there's a lot of great aggression research going on that studies these real problems. But mostly I see us running little experiments with just-significant results, or running survey designs that tell us something obvious and hopelessly confounded. Interviews and ethnography and field work seem to be for sociologists or criminologists, not psychologists.

What am I doing about it?  Not much. For now, I'm doing my part by trying to test the convergent validity of our lab measures and see whether they actually agree with each other (preliminary answer: they don't). I often worry about my career, because I've never "discovered" some effect. You could do a decent job summarizing my last ten years as digging a deeper and deeper hole in what we think we already know, hoping to find some sort of bedrock that we can build from. So far, I'm still shoveling, assessing publication bias, failing to replicate findings, criticizing too-good-to-be-true results, and trying to figure out if our measures are at all valid and reliable

I like the work that I do, and I think it's the best work I can do given my skills and resources and timeframe. But that work could be much more valuable if I could get out into the actual populations and environments that we're worried about. I had an RA with a connection at a maximum-security prison, but I wasn't able to pursue the lead aggressively enough and it slipped through my fingers. I'm not particularly smooth or adventurous, so I'm not enthusiastic about going into communities to understand gun violence. I'm pre-tenure, so what makes the most sense for me career-wise is to stick to doing more of the same research with college undergrads and MTurk workers. Maybe try to find some sort of eyebrow-raising lab effect that I can wildly extrapolate from.

I'm not sure what to recommend. As a field, we probably recalibrate our expectations; we can't expect a scientist to make three or four noteworthy, generalizable discoveries a year. Getting actionable and generalizable psychological findings will probably require orders of magnitude more effort and investmentWe can make psychological science prepared for that investment by trying to improve the transparency and honesty of that process.

I'm gonna try to read more sociology and criminology. Maybe they know something we don't?

Monday, June 17, 2019

Comment on Chang & Bushman (2019): Effects of outlier exclusion

Recent research by Chang & Bushman (2019) reports how video games may cause children to be more likely to play with a real handgun. In this experiment, children participate in the study in pairs. They play one of three versions of Minecraft for 20 minutes. One version has no violence (control), another has monsters that they fight with swords (sword violence), and another has monsters that they fight with guns (gun violence). 

The children are then left to play in a room in which, hidden in a drawer, are two very real 9mm handguns. The handguns are disabled -- their firing mechanism has been taken out and replaced with a clicker that counts the number of trigger pulls. But these guns look and feel like the real thing, so one would hope that a child would not touch them or pull their triggers.

The authors report four study outcomes: whether the kid touches the gun, how long they hold the gun, how many times they pull the trigger, and how many times they pull the trigger while the gun is pointed at somebody (themself or the other kid).

I think it's an interesting paradigm. The scenario has a certain plausibility about it, and the outcome is certainly important. It must have been a lot of work to get the ethics board approval.

However, the obtained results depend substantially on the authors' decision to exclude two participants from the control group for playing with the guns a lot. I feel that this is an inappropriate discarding of data. Without this discard, the results are not statistically significant.

Overinterpretation of marginal significance

The results section reports one significant and three marginally significant outcomes:
  • "The difference [in handgun touching] across conditions was nonsignificant [...]" (p = .09)
  • "The gun violence condition increased time spent holding a handgun, although the effect was nonsignificant [...]" (p = .080)
  • "Participants in the gun violence condition pulled the trigger more times than participants in other conditions, although the effect was nonsignificant [...]" (p = .097)
  • "Participants in the violent game conditions pulled the trigger at themselves or their partner more than participants in the nonviolent condition." (p = .007)
These nonsignificant differences are overinterpreted in the discussion section, which begins: "In this study, playing a violent video game increased the likelihood that children would touch a real handgun, increased time spent holding a handgun, and increased pulling the trigger at oneself and others." I found this very confusing; I thought I had read the wrong results section. One has to dig into Supplement 2 to see the exact p values.

Exclusion of outliers

The distribution of the data is both zero-inflated and powerfully right skewed. About half of the kids did not touch the gun at all, much less pull its trigger. Among the minority of kids that did pull the trigger, they pulled it many times. This is a noisy outcome, and difficult to model: you would need a zero-inflated negative binomial regression with cluster-adjusted variances. The authors present a negative binomial regression with cluster-adjusted variances, ignoring the zero-inflation, which is fine enough by me since I can't figure out how to do all that at once either.

Self-other trigger pulls outcome. The pair in red were excluded because the coders commented that they were acting unusually wild. The pair in green were excluded for having too high a score on the outcomes.

Noisy data affords many opportunities for subjectivity. The authors report: "We eliminated 1 pair who was more than 5 SDs from the mean for both time spent holding a handgun and trigger pulls [green pair].  The coders also recommended eliminating another pair because of unusual and extremely aggressive behavior [red pair]." The CONSORT flow diagram reveals that these four excluded subjects with very high scores on the dependent variables were all from the nonviolent control condition, in which participants were expected to spend the least time holding the gun and pulling its trigger. 

The authors tell me that the pair eliminated because of unusual and extremely aggressive behavior was made on the coders' recommendation, blind to condition. That may be true, but the registration is generally rather vague and says nothing about excluding participants on coder recommendation.

The authors also tell me that the pair eliminated because of high scores were eliminated without looking at the results. That may be true as well, but I feel as though one could predict how this exclusion might affect the results.

This latter exclusion of the high-scoring pair is not acceptable to me. You can consider this decision in two ways: First, you can see that there are scores still more extreme in the other two conditions. With data this zero-inflated and skewed, it is no great feat to be more than 5 SDs from the mean. Second, you can look at the model diagnostics. The excluded outliers are not "outliers" in any model influence sense -- their Cook's distances are less than 0.2. (Thresholds of 0.5 or 1.0 are often suggested for Cook's distance.)

Here are the nonzero values in log space, which is where the model fits the negative binomial. On a log scale, the discarded data points still do not look at all like outliers to me.

Revised results

If the high-scoring pair is retained for analysis, none of the results are statistically significant:
  • Touching the gun: omnibus F(2, 79.5) = 1.04, p = .359; gun-vs-control contrast p = .148.
  • Time holding gun: omnibus F(2, 79.5) = .688, p = .506; gun-vs-control contrast p = .278.
  • Trigger pulls at self or other: omnibus F(2, 79.4) = 1.80, p = .172; gun-vs-control contrast p = .098.
From here, adding the coder-suggested pair to the analysis moves the results further still from statistical significance.

If you're worried about the influence of the zero inflation and the long tail, a simpler way to look at the data might be to ask "is the trigger pulled at all while the gun is pointed at somebody?" After all, the difference between not being shot and being shot once is a big deal; the difference between being shot four times and being shot five times less so. Think of this as winsorizing all the values in the tail to 1. Then you could just fit a logistic regression and not have to worry about influence.

Analyzed this way, there are 6 events in the control group, 10 in the sword-game group, and 13 in the gun-game group. The authors excluded four of these six control-group events as outliers. With these exclusions, there is a statistically significant effect, p = .029. If you return either pair to the control group, the effect is not statistically significant, p = .098. If you return both pairs to the control group, the effect is not statistically significant, p = .221.

I wish the authors and peer reviewers had considered the sensitivity of the results to the questionable exclusion of this pair. While these results are suggestive, they are much less decisive than the authors have presented them.

Journal response

I attempted to send JAMA Open a version of this comment, but their publication portal does not accept comment submissions. I asked to speak with an editor; the editor declined to discuss the article with me. The journal's stance is that, as an online-only journal, they don't consider letters to the editor. They invited me to post a comment in their Youtube-style comments field, which appears on a separate tab where it will likely go unread.

I am disturbed by the ease with which peer reviewers would accept ad hoc outlier exclusion and frustrated that the article and press release do little to present the uncertainty. It seems like one could get up to a lot of mischief at JAMA Open by excluding hypothesis-threatening datapoints.

Author response

I discussed these criticisms intensely with the authors as I prepared my concerns for JAMA Open and for this blog post. Dr. Bushman replied:

We believe that [the coder-suggested pair] was removed completely legitimately, although you are correct this was not documented ahead of time on the clinicaltrials.gov site. We believe [the high-scoring pair] should also have been excluded, but you do not. We acknowledge there may be honest differences of opinion regarding [the high-scoring pair]. 
As stated in our comment on JAMA Open, “Importantly, both pairs were eliminated before we knew how they would impact our analyses and whether their results would support our hypotheses.”
Again, I disagree with the characterization of the removal of the high-scoring pair as a subjective decision. I don't see any justifiable criterion for throwing this data away, and one can anticipate how this removal would influence the analyses and results.


I was successfully able to reproduce the results presented by Chang and Bushman (2019). However, those results seem to depend heavily on the exclusion of four of the six most aggressive participants in the nonviolent control group. The justification that these four participants are unusually aggressive does not seem tenable in light of the low influence of these datapoints and similarly aggressive participants retained in the other two conditions. 

While I admire the researchers for their passion and their creative setup, I am also frustrated. I believe that researchers have an obligation to quantify uncertainty to the best of their ability. I feel that the exclusion of high-scoring participants from the control group serves to understate the uncertainty and facilitate the anticipated headlines. The sensitivity of the results to this questionable exclusion should be made clearer.

See my code at https://osf.io/8jgrp/. Analyses reproduced in R using MASS::glm.nb for negative binomial regression with log link and clubSandwich for cluster-robust variance estimation. Data available upon request from the authors. Thanks to James Pustejovsky for making clubSandwich. Thanks to Jeff Rouder for talking with me about all this when I needed to know I wasn't taking crazy pills.

Sunday, April 15, 2018

Why I hate teaching the classics

I’m approaching the end of my first semester teaching Intro to Social Psychology. As someone who came of age during the peak of the replication crisis (Bem, Stapel, Reproducibility Project), studies publication bias, and has had a hard time finding statistically significant results, I generally have a dim view of big chunks of the literature. I was worried that we would have very little to talk about given all the uncertainty, but we’ve made a good semester of it by talking about the general ideas, their strengths and weaknesses, and the opportunities for a young scientist to contribute by addressing these uncertainties.

But this semester’s teaching has taught me one thing: I hate teaching the classics.
What makes the classics, and why do I hate teaching them? The studies that my textbooks present as classics tend to have a few common attributes, some desirable and others undesirable.
The desirable:

  1. They provide a useful summary of some broader theory.
  2. They are catchy or sticky in a way that makes them easy to remember and fun to talk about.
  3. The outcome is provocative and interesting.

The undesirable:
  1. The sample size is tiny.
  2. The p-values are either marginal or bizarrely good. 
  3. The outcome has little evidence of validity.
  4. Data from the classic study tend to predate strong tests of the theory by several decades. The strongest evidence tends to come later (if at all) when people have cleaned up the methods and run more studies (often in response to criticism).
My concern is that these qualities of classics give students the wrong idea about what makes for good psychological science, leading them to embrace the desirable attributes of these classics without considering the undesirable attributes.

Some classics that I’ve struggled with this semester:
Frederickson et al., 1998: In this classic study on the harms of self-objectification, wearing a swimsuit (vs. a sweater) caused women (but not men) to do worse on a math test, N = 82, p = .051.
Pennebaker & Beall (1986): In this classic study on the benefits of self-expression, students who wrote about a traumatic experience enjoyed better health, N = 46, p = .055 for health center visits, p = .10 for sick days, p = .040 for total health problems.
Rosenthal and Jacobson (1968): In this classic study on how expectations shape outcomes, students labeled as “about to bloom” gained more IQ than other students. Unfortunately, the data are insane, with many students scoring well outside of the range of the test, featuring pre-post scores on the scale of hundreds of points (see Snow, 1995; hat tip to Bob C-J)
Srull & Wyer (1979): In this classic study of how primes influence perceptions of others, primes influenced perceptions up to days later. Unfortunately, the data show an effect too insanely powerful to be true; in meta-analyzing this literature, DeCoster and Claypool (2004) estimate Srull & Wyer’s result as d = 5.7. (For reference, obvious effects like “men are taller than women” are in the range of d = 1.85; Simmons, Nelson, & Simonsohn, 2013.)
Festinger & Carlsmith (1959): In this classic study of cognitive dissonance, participants given a small bribe to say a boring task was fun changed their opinion of the boring task. Unfortunately, the published results contain a number of GRIM errors.
This isn’t to say that the classics are bad science, especially for their time. My concern is that their evidence is much weaker than one might expect given their status as classics. It makes me feel sometimes like I am teaching the history of psychology instead of the science of psychology; something where knowing about the peg-turning experiment is hoped to represent some greater knowledge.
Figure 1. Me and my fellow troublemakers (periphery) complaining about a classic study (center).
What’s the problem?
My concern is that these classics set a bad example for young scientists and do not prepare them to think about science according to modern standards. According to these classics, one collects a little data on a new, untested method, and so long as the p-value isn’t too far from significance, you can make an argument about how the mind works. If your idea is catchy enough, the citations will roll in forever, and few will talk about the weaknesses of the evidence. Like Daryl Bem said in his recent interview with Dan Engber, “I’m all for rigor, but I prefer other people do it. […] If you looked at all my past experiments, they were always rhetorical devices. I gathered data to show how my point would be made.” 
This isn’t to say that the theories proposed by these classics are necessarily wrong. It’s just hard to teach these originals while talking about how weak that one study is. Discrediting one operationalization may unjustifiably discredit the broader idea. Maybe the whole Festinger & Carlsmith peg-turning, subject-bribing method is bunk, but cognitive dissonance is such a stronger, broader idea that it seems impossible to discard it. In that light, is it really important how Festinger & Carlsmith did it? Couldn’t we cite instead something that demonstrates the core idea with a little more refinement or rigor?
In the “Creativity and Rigor” episode of The Black Goat, Sanjay, Simine, and Alexa talk about the problem of framing creativity and rigor as enemies. This framing sets science up as some sort of battle between the creative, idea-generating geniuses and the rigorous, pencil-pushing critics. It doesn’t have to be this way, they argue -- the goals of rigor and creativity are aligned. To test interesting ideas in useful ways will require both rigor and creativity.
It’s my concern that teaching these cool-idea, weak-evidence studies as the classics may lead students to value creativity without rigor. When we canonize these early studies, we honor them for their interesting ideas or provocative manipulations, but we overlook all their weaknesses in sample size and measurement validity.
Figure 2. A brilliant idea occurs to a psychologist in 1972. The psychologist will demonstrate its truth in a sample of 28 undergraduates with a p-value of .063, an event which will be remembered by textbooks forever.

What should we do?
I would like to see more textbooks credit both the original idea and some of the stronger methods and samples. In this way, we could teach both the origin of the theory and the best science involved in testing that theory. If newer, stronger data is not available, this should be made clear as a weakness of the literature and an opportunity for students to do their own studies.
This is probably not easy to do. The classics have a lot of momentum and citations, which makes them easy to discover. Finding these newer, more rigorous studies and writing them up for textbooks will be more work. I think it will be worth it. This will help communicate to students our values as a member of the sciences. It will give more credit and more attention to psychology as an empirical science, not just a system for the generation of cool ideas.

Wednesday, December 13, 2017

How to Play a Prediction Market

The prediction market is a way to try to assign probabilities to events. Bettors buy YES bets on things they think are likely to happen (relative to the market price) and NO bets on things they think are unlikely to happen (relative to the market price). Market dynamics lead the market price to settle on what is, across the bettors, the best subjective probability of the event. This is useful if you are trying to assign probabilities to one-off future events.

In this post, I'll teach you how to place bets to most effectively get the largest payout possible. In so doing, you'll do more to calibrate the market to your predictions.

Let's get ready to corner the replication market!

How does a prediction market work?

A prediction market allows people to bet YES or NO on some outcome. As people bet that the outcome will happen, the price of YES shares increases. As people bet that the outcome won't happen, the price of YES shares falls.

The market price for a YES share is p, the probability of the outcome. The market price for a NO share is (1-p). If the event happens, all the YES shares pay out $1 each and the NO shares become worthless. If the event does not happen, all the NO shares pay out $1 each and the YES shares become worthless.

The probability of rolling a six is 1/6, so we should be willing to pay up to $1/6 for YES or $5/6 for NO.

Imagine we are betting that a roll of a six-sided die will yield a six. The probability of this is 1/6, or about 17 percent. YES shares will cost 17 cents and NO shares will cost 83 cents. With five dollars, you could buy 30 YES shares or 6 NO shares.

Your expected payout is the number of shares times the probability. In the die example, since the market price is correct, your expected value is five dollars whether you buy YES or NO. For YES shares, 30 shares * (1/6 payout chance) = $5. For NO shares, 6 shares * (5/6 payout chance) = $5.

If the market price is wrong, you have a chance to make a profit. Suppose we are still betting on the die, but for some reason the market price is set at 10 cents for a YES share. We know that the probability of the die rolling six is greater than this, so with our five dollars we can buy 50 shares with an expected value of 50 shares * (1/6 payout chance) = $8.33. This is a profit of three dollars. Another way to look at this is that it's a profit of six cents per share, the difference between the wrong market price (.10) and the true probability (.16).

But if the market price is wrong, and we are wrong with it, we will lose money. Buying NO shares at this price will turn our five dollars into 5.55 shares * (5/6 payout chance) = $4.62, a long-run loss of 38 cents.

The Big Picture of the Big Short

Like we covered above, playing the prediction market isn't simply about buying YES on things you think will replicate and NO on things you don't replicate. Otherwise, we would just buy NO shares on the die rolling six because we know it's unlikely relative to the die not rolling six. It's about evaluating the probability of those replications. Your strategy in a betting market should be to look for those opportunities where there is a difference between the market price and the probability that you'd assign to that event.

If the market is completely correct, it shouldn't matter what you buy -- your 50 tokens will have an expected value of $50. In our die example above, when the market price was right, YES and NO shares had the same expected value. But if the market is wrong, you have a chance to beat the market, turning your 50 tokens into several times their value.

In order to beat the market, you have to find places where the market price is miscalibrated. Maybe something is trading at 40% when it only has a 20% chance to replicate, in your view. If you are right, each NO share you buy will cost 60 cents but have an expected value of 80 cents. But if you are wrong, you will pay more for the shares than they are truly worth, getting a poorer return on your 50 tokens than had you just spread them across the market.

Below is my four-step process for turning your predictions into the largest possible payoff.

1. Evaluate your prices.

Before the market started, I wrote down my estimates of what would or wouldn't replicate. I assigned probabilities to these studies indicating what chance I thought they had of replicating.

Coming up with these estimates is the basis of the replication market. I ended up focusing on the things I thought wouldn't replicate. Some studies were a priori deeply implausible, others had weak p-values, some had previous failures to replicate, and some had a combination of factors. These were studies I felt pretty confident wouldn't replicate, and so I priced them at about 10% (2.5% chance of Type I error + 7.5% chance of true effect).

A peek at my spreadsheet, comparing my subjective probabilities to the market prices.

Some other studies seemed more likely to replicate, so I was willing to price them in the 50-80% range. I was less certain about these, so I saw these as riskier purchases, and tended to invest less in them.

It's also useful to remember the context of the last prediction market. In that market, the prices were much too high. Nothing below 40% replicated, and the highest-priced study (88%) also failed to replicate. The lowest price on a successful replication was about 42%.

2. Buy and sell to your prices.

To make profit on the replication market, you have to spend your money where you think the market price is most miscalibrated. Something that the market thinks is a sure thing (95%) that you think will flop (5%) would be a massive 90-cent profit per share. Something that seems reasonable (50%) that the market is afraid won't replicate (15%) could be a nice little profit of 35 cents per share.

I made a spreadsheet of my prices and the market's prices. I added a column representing the difference between those prices. The largest absolute difference indicates where I would expect the greatest profit per share.

If the difference is negative, then buy NO shares. Suppose something is trading at 50%, but you think it has only a 15% chance of replicating. You can buy NO shares for 50 cents that you think are worth 85 cents -- a 35 cent profit per share.

If the difference is positive, then buy YES shares. If something is trading at 50%, and you think it has a 75% chance of replicating, then every YES share costs 50 cents but is worth 75 cents.

Overly optimistic market prices meant that I placed most of my bets on certain studies not replicating.
Again, you only profit when you are right and the market is wrong. Look for where there is juice!

3. Diversify your portfolio

If you want to ensure a decent payout, it may make sense to spread your money around. Suppose there is a study priced at 50% chance of replicating, but you know the true chance of replication is 80%. If you're right, putting all 50 tokens on this one study has an 80% chance of earning you $100, but a 20% chance of earning you $0. Your expected value is $80, a nice $30 profit, but there's a lot of variability.

Payout $100 $0
Frequency 80% 20%
EV = $80; SD = $41

By diversifying your bets, you can reduce the variability at the cost of reducing your expected value slightly. Consider if we divide your bets across two options, one with a slightly worse profit margin. Let's say Study 1 is priced at 50% but is worth 80%, and Study 2 is priced at 65% but is worth 75%. By putting half our money into Study 2, we reduce our average profit, but we also reduce the likelihood of suffering a blowout.

Payout $88 $50 $38 $0
Frequency 60% 20% 15% 5%
EV = $70; SD = $26

In the recent market, for example, Sparrow, Liu, and Wegner tended to trade at 55%, whereas I thought it was worth about 15%. Although this 40-cent gap would have been my biggest profit-per-dollar, I felt it was too risky to put everything on this study, so I balanced it against other studies with smaller profit margins.

4. Day trading

As other people show up to the market and start twiddling their bets around, the market prices will change. The market may move towards some of your predictions and away from other of your predictions. If you like to procrastinate by watching the market, you can leverage out your bets for a higher potential payout.

Figure 1. You hold NO shares of Studies 1 and 2, which the market has evaluated at 35% (bars) but you think have only a 10% chance of replicating (dashed line). Each share represents 25 cents of profit to you.

Lets say you think Study 1 and Study 2 each have a 10% chance of replicating. You bought 30 shares of Study1 NO and Study2 NO for 65 cents a share each (35% chance to replicate). You see each of these shares as representing a 25 cent profit (Figure 1).

Figure 2. The market has shifted such that your Study1 NO shares are worth more and your Study2 NO shares are worth less. If you are ready to be aggressive, you can sell your Study1 NO shares to take advantage of cheaper Study2 NO shares.

Some time passes, and now the market has agreed with you on Study1, dropping its probability to 20%, but it disagrees with you on Study2, raising the probability to 45% (Figure 2). The shares of Study 1 you're holding have already realized 15 cents per share of profit. The shares of Study 2 you're holding have lost 10 cents a share, but if you are right, then you can keep buying these shares at 55 cents when you think they are worth 90 cents.

Since the Study 1 shares have already realized their value, you can sell the Study1 NO shares to buy more cheap shares of Study2 NO. If the market fluctuates again, you can sell your expensive shares to pick up more cheap shares and so on and so on.

I watched the market and kept comparing the prices against my predictions. When one of my NO bets started to cap out (e.g., Gervais and Norenzayan reached 15%), I would sell my NO bets and reinvest them in another cheaper NO bet (e.g., buying NO on Kidd and Castano at 40%). Sometimes some poor credible soul (or somebody fumbling with the GUI) would buy a bunch of YES bets on Ackerman, driving the price way up (e.g., to 45%). When this would happen, I'd sell all my current bets to take advantage of the opportunity of cheap Ackerman NO bets.

It can be tempting to try to play the market, moving your tokens around to try to catch where other people will move tokens. I don't think there's much use in that. There aren't news events to influence the prediction market prices. Just buy your positions and hold them. If the market disagrees with you, you may consider doubling down on your bets now that they are cheaper. If the market agrees with you, you can release those options to invest in places where the market disagrees with you.


To make the biggest profits, you have to beat the market. To do this, you must: (1) Make good estimates of the probability to replicate. (2) Find the places where the market price is most divergent from what probability you would assign the study. (3) Spread your bets out across a number of studies to manage your risk. (4) Use day trading to take advantage of underpriced shares and increase your total leverage.

Friday, December 1, 2017

Adventures programming a Word Pronunciation Task in PsychoPy

I'm a new assistant professor trying to set up my research laboratory. I thought I'd try making the jump to PsychoPy as a way to make my materials more shareable, since not everybody will have a $750+ E-Prime or DirectRT license or whatever. (I'm also a tightwad.)

My department has a shared research suite of cubicles. Those cubicles are equipped with Dell Optiplex 960s running Windows 7. I'm reluctant to try to upgrade them since, as shared computers, other members of the department have stuff running on them that I'm sure they don't want to set up all over again.

In this process, I ran into a couple of bugs on these machines that I hadn't encountered while developing the tasks on my Win10 Dell Optiplex 7050s. These really made life difficult. I spent a lot of time wrangling with these errors, and I experienced a lot of stress wondering whether I'd fix them in five minutes or five months.

Here for posterity are the two major bugs I'd encountered and how they were resolved. I don't know anything about Python, so I hope these are helpful to the equally clueless.

"Couldn't share context" error

Initially, PsychoPy tasks of all varieties were crashing on startup. Our group couldn't even get the demos to run. The error message said pyglet.gl.ContextException: Unable to share contexts.

Didn't fix it:

Apparently this can be an issue with graphics drivers on some machines. Updating my drivers didn't fix the problem, perhaps in part because the hardware is kind of old.


This error was resolved by specifying an option for pyglet. I used PyschoPy's Builder View to compile the task. This made a file called Task.py. I opened up the .py file with notepad / wordpad / coder view / code writer and added two lines to the top of the script (here in bold):

from __future__ import absolute_import, division
# Trying to fix pyglet 'shared environment' error
import pyglet
# script continues as normal
from psychopy import locale_setup, sound, gui, visual, core, data, event, logging
from psychopy.constants import (NOT_STARTED, STARTED, PLAYING, PAUSED,
                                STOPPED, FINISHED, PRESSED, RELEASED, FOREVER)
This fixed my "Couldn't share context" error. If you're having trouble with "couldn't share context", consider opening up your .py file and adding these two lines just underneath from __future__ import.

Portaudio not initialized error

My Word Pronunciation Task requires the use of a microphone to detect reaction time. Apparently this was a simple task for my intellectual ancestors back in the 1990s -- they were able to handle this using HyperCard, of all things! But I have lost a lot of time and sleep and hair trying to get microphones to play nice with PsychoPy. It's not a major priority for the overworked developers, and it seems to rely on some other libraries that I don't understand.

Trying to launch my Word Pronunciation Task lead to the following error: PortAudio not initialized [...] The Server must be booted! [...] Need a running pyo server."

This was fixed by changing Windows' speaker playback frequency from 48000 Hz to 44100 Hz.

Right click on the Volume icon in the taskbar and open up "Playback devices."

Right click on your playback device and click "Properties."

Under the "Advanced" tab, switch the audio quality from a 48000Hz sampling rate (which Portaudio doesn't like) to a 44100 Hz sampling rate (which Portaudio does like, apparently).

This strangely oblique tweak was enough to fix my Portaudio problems.

Now that I can use all these computers, I'm looking forward to scaling up my data collection and getting this project really purring!

Thanks to Matt Craddock and Stephen Martin for help with the "shared context" bug. Thanks to Olivier Belanger for posting how to fix the Portaudio bug.