Header


Saturday, May 29, 2021

Smell you later

A few weeks ago, I started a new job as a data scientist at the logistics division of a pharmaceutical company. This was primarily motivated by family reasons. My wife works at an art museum in Chicago and could not find satisfactory work in my small college town. We'd been living apart while I struggled to find a new professorship that might put us in a city she'd like. The academic job market is lousy enough that I was lucky to have a job anywhere -- I was not the kind of research superstar and granting rainmaker who could work wherever I pleased.

The school going online in 2020 made it possible for me to spend a fourth year in academia, but with a big naughty dog and our first child on the way, there was no way I could return to in-person classes in the fall. I flailed around on LinkedIn looking for a chance to get into data science, but later lucked out when a headhunter found my resume on Dice and scooped me up.

So now I'm working in industry. I am happy with this development: the pay is good, I don't have to drive two and a half hours to see my wife, and I get to turn my brain off at 5pm. 

My impact on the field

I'd like to think that I've managed to do a few useful things in my time in social psychology.

  • I've tried to point out that, contrary to some previous claims, there is reason to doubt that 15 minutes of violent video games causes a detectable increase in aggressive behavior. I've identified publication bias, and I've published null results.
  • I've tried to bring greater awareness to publication bias and meta-analytic ways to adjust for it. The paper I wrote with Evan Carter, Felix Schonbrodt, and Will Gervais has been pretty successful in this light and quickly became my most cited paper.
  • I've managed to get some really suspicious papers retracted. I'd like to think that this has helped shift the norms in post-publication review and retraction, if only slightly. I'm happy I could contribute a few more data points to the Retraction Watch database.
  • Paul Bloom once saw me order a Sazerac at SPSP, asked me what it was, then ordered one himself. So you might say I've influenced some of the big names.

The stuff we do doesn't matter

The thing that I'll probably most miss about academia is getting to research whatever I'm curious about. I'm not missing it just yet, however. While on the tenure track, I didn't think I was discovering hidden truths about the human mind by studying the American college undergraduate and Prolific.co participant. 

Honestly, I'd felt pretty discouraged about research for a while. The things we study tend to have small effects, and when we can't detect those small effects a second time, it can be hard to tell why. (Possible explanations include noise, publication bias, errors in methods, differences in populations due to culture or even the passage of time.) It's why we spend so much time arguing about the fidelity of replication methods and hidden moderators.

Because the things we study have small, purportedly delicate effects, it's rare that we expect to see them applied and working in the real world. It's unpleasant to say it, but I feel that a lot of the research that we do doesn't matter. It's because it doesn't matter that we were able to get all the way into the 2010s before having a replication crisis. If we had screwed up our basic science in physics or biology or chemistry, we would notice pretty quickly when the engineers told us their bridges were collapsing or the crops were dying or the soda pop was going flat. By comparison, very little in social psychology seems to be applied or expected to work in any routinely detectable way.

The lackadaisical response I've sometimes received when raising concerns about papers has further convinced me that most social psych research does not matter. When I email a journal to say "none of these statistics add up" or "these effect sizes are ridiculously big," I often get no reply. Compare this to the sort of all-hands-on-deck response we might get if we found poison in the dog food. It doesn't matter that the product is no good -- we produce it for the sake of producing it, quality irrelevant.

In comparison, the stuff I'm doing as a data scientist isn't glamorous, but it's useful. Some of our projects save the company millions of dollars a year in shipping costs. That's a lot of gasoline and traffic and cardboard and dry ice that we're able to save. Reducing the amount of oil and packaging that gets used up might be the most useful thing I've done in years.

I sometimes wonder if the future of social psychology isn't in industry. Academics are spending their little budgets on MTurk studies and undergraduate surveys while tech companies have terabytes of data on people's activities (real people! from the real world!) and can run A/B tests whenever they like. People sometimes complain that "the best minds of our generation are working to get people to click on ads," but it's also the case that the best datasets of our generation are also dedicated to the same cause.

I also wasn't confident that I was doing anything useful as a teacher. At some level it broke my heart to know that students were paying money to attend my classes and take my exams. I probably expected too much, but I felt like my classes somehow had to lift my undergrads into a job. I could see a computer science course doing that, but not social psychology. I often felt like a failure for not being able to fix what generations of poverty and decades of underresourced primary schools had done to my students. Yes, there were a few exceptional students that I was able to help get into a job or a graduate program, but I feel like if it weren't me, it would have been another professor.

The Thumos Treadmill

Liam Kofi Bright has a really interesting philosophy article out on why academics are motivated to do fraud. In it, he examines acadmics through Plato's tripartite model of the soul: epithumia, the materialistic element, associated with the masses; thumos, the honor and esteem element, associated with the armed forces; and nous, the reason and wisdom element, associated with philosophers and scientists. Plato assumes that scientists are motivated by nous; Bright argues that they are also motivated by thumos, which has both good and bad consequences.

The neverending quest for thumos kind of drove me nuts. I didn't have a strategy for finding a new job that didn't revolve around just publishing as much as I could and hoping it would be enough. (I probably should have written a grant at some point, but I was too busy writing papers and too discouraged by terribly low funding rates.) It felt awful to stretch myself trying to build a better CV, knowing all the time that no matter how good my CV was, I still might not get what I wanted.

I liked writing papers, but they always felt like ten times more work than they should have been. And once finished, the interesting ones made for arguments and headaches and the quiet ones sank into the literature without a sound. I will admit that sometimes arguments tempted with the promise of a possible comment or reply, yielding another line on the CV at lower cost than a three-study empirical paper. For all my high-mindedness, I was still subject to the same pressures as everybody else.

I'm done with thumos for now. I'm enjoying the comparatively relaxed pace of my new job. I no longer have to try to become a one-in-a-thousand genius so I can get hired at an urban university; I can just be a guy who does his job well enough to not get fired. Maybe in a few years I'll get hungry for promotion to senior data scientist or something like that, but it'll be a while, if it ever happens. I've always been willing to earn a little less money in exchange for working a little less.

Normalize leaving academia

Academia is nice if it works out and you like the work and flexibility enough to take the pay dip. However, academia isn't going to go out of its way to take care of you. It can barely take care of itself.

If you're good at coding and data analysis, you can probably increase your salary by 50-100% by going to industry. Money isn't everything, but it's not nothing, either. I hate to be so capitalistic, but money feels a little bit like respect. It puts a clear and concrete value on your skills in a way that the occasional citation does not. After so much time begging for a tenure track position, getting a single offer by incredible good fortune, and going through it all over again trying to move to the city, it's nice to feel wanted again.

Being prepared to leave academia has benefits beyond the materialistic. You might recognize some of my more pugilistic works pointing out effects that are artifacts of selective removal of outliers, or effects that are too consistent or too big to be true, or the last two years of the Zhang affair. It can hurt your academic career to make enemies or to be known as a trouble-maker. Even outside of that, these projects had an opportunity cost; while I was writing criticism, I was not doing my own primary research and making discoveries with my name on them. Being ready to leave gave me the freedom and power to criticize what I felt needed to be criticized.

Spending any amount of time on the Zhang affair would have been a career mistake, of course, had I planned to stay. People are grateful to you for cleaning up the mess, but getting some papers retracted isn't going to get an entire department to want to hire you and work with you for the rest of your lives. Error detection isn't yet sustainable as a primary research interest; if it was, Elisabeth Bik would be an endowed chair.

If the NSF ever wants to assemble people for a real data police with a real budget, let me know. Until then, I'm going to be over here, writing code, cashing checks, and raising my family.


Thanks

I'd like to thank:
  • Bruce Bartholow, my PhD advisor, for trusting me enough to do my thing as a graduate student, even when it involved making trouble.
  • Laura King, who ran some good classes and an excellent journal club, and who was one of my letter-writers when I was looking for a job.
  • Jeff Rouder, who was crucial to my development as a scientist and Bayesian, and who introduced me to R, which is now how I support my family.
  • Daniel Lakens, whose early blog posts provided code for meta-analysis and PET-PEESE meta-regression, showing me that meta-analysis was just a single line of code, and not an arcane ritual requiring several degrees in rocket science. This opened a whole primary research interest to me.
  • Illinois State University, for running a good psychology department where people generally get along. I'd like to particularly thank my department chair, Scott Jordan, for running a department where expectations are both reasonable and clear and professors aren't encouraged to eat each other alive.
  • The Society for the Improvement of Psychological Science, particularly its early founders, for putting open science and post-publication peer review on the agenda. It's the only reason I was able to spend four years on the tenure track, and the only reason the science was worth doing.


In the roguelike community, people would post their "morguefiles" at the end of their game to show the thrilling way their character won or the tragicomic way they died. Here's mine:

        Joe_Hilgard the Assistant Professor (level 22)
             Began as a Social Psychologist
             Was a friend of SIPS.
             Escaped with a job
             ... and 31 publications on May 10, 2021!
             
             The professorship lasted 4 years.

Joe_Hilgard the Social Psychologist (HuSP)         

+3,+1 RStudio IDE {Int+4, Enhance}
+4 robe of Git {Dex+4 Int+2, Version Control}
+2 mug {Coffee}
+2 visored helmet "FunnelPlot" {Dam+3}
+2 gauntlets of emailing
+2 boots of BayesFactor
+1 ring of PubPeer
+2 ring of Monte Carlo

@: tired, grumpy, collecting data on college undergrads and MTurk workers, resistant to enchantments
}: 5 runes: meta-analytic, experimental, silver, iron, abyssal

You escaped.
You attended 4 universities and accumulated 22 years of education.
You visited the Abyss 2 times.
You visited 1 Labyrinth.

Your h-index was 18.
You got other authors to retract 5 articles.

You had 261 unread emails.
You owed your collaborators 5 overdue action items.
You signed your peer reviews.


        #.#######.#     #.#######
        #.........#     #........
        #......##############.###
        #......#.8#.##.#### #.#
        #......#<.........# #.#
        ########...8..8..8###.#
               #<.........'...#
               #...8..8..8#####
               #@.........#
               #.8#.##.##.#
               ##########.#
                        #.#
                        #.#######
                        #......Wp
                        #^#######
                    #####.#     #
#####################.....#     #

Message History

You preprocess your data.
Ping! New email.
You hold Zoom meetings with students.
Ping! New email.
The fraudster publishes!
The fraudster publishes!
The fraudster publishes!
Publish which blog post? (* to show all)
People are talking about the blog post.
Ping! New email.
A fraudulent paper is retracted!
You hold Zoom meetings with students.
You hold Zoom meetings with students.
There is a job offer at a company here.
Are you sure you want to win?
You have escaped!

Tuesday, January 26, 2021

I tried to report scientific misconduct. How did it go?

This is the story of how I found what I believe to be scientific misconduct and what happened when I reported it.

Science is supposed to be self-correcting. To test whether science is indeed self-correcting, I tried reporting this misconduct via several mechanisms of scientific self-correction. The results have shown me that psychological science is largely defenseless against unreliable data.

I want to share this story with you so that you understand a few things. You should understand that there are probably a few people in your field producing work that is either fraudulent or so erroneous it may as well be fraudulent. You should understand that their work is cited in policy statements and included in meta-analyses. You should understand that, if you want to see the data or to report concerns, those things happen according to the inclinations of the editor-in-chief at the journal. You should understand that if the editor-in-chief is not inclined to help you, they generally not accountable to anyone and they can always ignore you until the statute of limitations runs out.

Basically, it is very easy to generate unreliable data, and it is very difficult to get it retracted.

Qian Zhang

Two years ago, I read a journal article that appeared to have gibberish for all its statistics (Zhang, Espelage, & Zhang, 2018). None of the numbers in the tables added up: the values didn't match the values, the values didn't match the means and SDs, and the degrees of freedom didn't match the sample size. This was distressing because the sample size was a formidable 3,000 participants. If these numbers were wrong, they were going to receive a lot of weight in future meta-analyses. I sent the editor a note saying "Hey, none of these numbers make sense." The editor said they'd ask the authors to correct, and I moved on with my life.

 


Figure 1. Table from Zhang, Espelage, & Zhang, (2018). The means and SDs don’t make sense, and the significance asterisks are incorrect given the F values.

Then I read the rest of Dr. Zhang's first-authored articles and realized there was a broader, more serious problem – one that I am still spending time and energy trying to clean up, two years later.

 

Problems in Qian Zhang’s articles

Zhang’s papers would often report impossible statistics. Many papers had subgroup means that could not be combined to yield the grand mean. For example, one paper reported mean task scores of 8.98ms and 6.01ms for males and females, respectively, but a grand mean task score of 23ms.

Other papers had means and SDs that were impossible given the range. For example, one study reported a sample of 3,000 children with ages ranging from 10 to 20 years (M = 15.76, SD = 1.18), of which 1,506 were between ages 10 and 14 and 1,494 were between ages 15 and 20. If you put those numbers into SPRITE, you will find that, to meet the reported mean and SD of age, all the participants must be between the ages of 14 and 19, and only about 500 participants could be age 14.

More seriously still, tables of statistical output seemed to be recycled from paper to paper. Two different articles describing two different experiments on two different populations would come up with very similar cell means and F values. Even if one runs exactly the same experiment twice, sampling error means that the odds of getting all six cells of a 2 × 3 design to come up again within a few decimal points are quite low. The odds of getting them on an entirely different experiment years later in a different population would be smaller still.

As an example, consider this table, published in Zhang, Espelage, and Rost (2018)Youth and Society (Panel A)in which 2,000 children (4th-6th grade) perform a two-color emotion Stroop task. The means and F values closely match the same values as a sample of 74 high schoolers (Zhang, Xiong, & Tian, 2013Scientific Research: Health, Panel B) and a sample of 190 high schoolers (Zhang, Zhang, & Wang, 2013Scientific Research: Psychology, Panel C).



Figure 2. Three highly similar tables from three different experiments by Zhang and colleagues. The degree of similarity for all nine values of the table is suspiciously high.


Dr. Zhang publishes some corrigenda 

After my first quick note to Youth and Society that Zhang’s p values didn't match the F values, Dr. Zhang started submitting corrections to journals. What was remarkable about these corrections is that they would simply add an integer to the F values so that they would be statistically significant.

Consider, for example, this correction at Personality and Individual Differences (Zhang, Tian, Cao, Zhang, & Rodkin, 2016):


Figure 3. An uninterpretable ANOVA table is corrected by the addition or subtraction of an integer value from its F statistics.

The correction just adds 2 or 3 onto the nonsignificant values to make them match their asterisks, and it subtracts 5 from the significant F value to make it match its lack of asterisks.


Or this correction to 
Zhang, Espelage, and Zhang (2018)Youth and Society, now retracted:

 


Figure 4. Nonsignificant F values become statistically significant through the addition of a tens digit. Note that these should now have three asterisks rather than one and two, respectively.

Importantly, none of the other summary or inferential statistics had to be changed in these corrigenda, as one might expect if there was an error in analysis. Instead, it was a simple matter of clobbering the F values so that they’d match the significance asterisks.


Asking for raw data

While I was investigating Zhang’s work from 2018 and earlier, he published another massive 3,000-participant experiment in Aggressive Behavior (Zhang et al., 2019). Given the general sketchiness of the reports, I was getting anxious about the incredible volume of data Zhang was publishing. 

I asked Dr. Zhang if I could see the data from these studies to try to understand what had happened. He refused, saying only the study team could see the data. 

So, I decided I’d ask the study team. I asked Zhang’s American co-author if they had seen the data. They said they hadn't. I suggested they ask for the data. They said Zhang refused. I asked them if they thought that was odd. They said, no, "It's a China thing."

 

Reporting Misconduct to the Institution

Given the recycling of tables across studies, the impossible statistics, the massive sample sizes, the secrecy around the data, and the corrigenda which had simply bumped the F values into significance, I suspected I had found research misconduct.  In May 2019, I wrote up a report and sent it to the Chairman of the Academic Committee at his institution, Southwest University Chongqing. You can read that report here.

A month later, I was surprised to get an email from Dr. Zhang. It was the raw data from the Youth & Society article I had previously asked for and been refused.

Looking at the raw data revealed a host of suspicious issues. For starters, participants were supposed to be randomly assigned to movie, but girls and students with high trait aggression were dramatically more likely to be assigned to the nonviolent movie. 

There was something else about the reaction time data that is a little more technical but very serious. Basically, reaction time data on a task like the Stroop should show within-subject effects (some conditions have faster RTs than others) and between-subject effects (some people are faster than others). Consequently, even an incongruent trial from Quick Draw McGraw could be faster than a congruent trial from Slowpoke Steven.

Because of these between-subject effects, there should be a correlation between a subject’s reaction times in one condition and their reaction times in the other. If you look at color-Stroop data I grabbed from a reliable source on the OSF, you can see that correlation is very strong. 


Figure 5. The correlation between subjects' mean congruent-word RT and mean incongruent-word RT in a color-word Stroop task. Data from Lin, Inzlicht, Saunders, & Friese (2019).

If you look at Zhang’s data, you see the correlation is completely absent. You might also notice that the distribution of subjects’ means is weirdly boxy, unlike the normal or log-normal distribution you might expect.

Figure 6. The correlation between subjects' mean aggressive-word RT and nonaggressive-word RT in an aggressive-emotion Stroop task. Data from Zhang, Espelage, and Rost (2018). The distribution of averages is odd, and the correlation unusually weak.

There was no way the study was randomized, and there was no way that the study data was reliable Stroop data. I wrote an additional letter to the institution detailing these oddities. You can read that additional letter here.

A month after that, Southwest University cleared Dr. Zhang of all charges.

The letter I received declared: "Dr. Zhang Qian was deficient in statistical knowledge and research methods, yet there is insufficient evidence to prove that data fraud [sic]." It explained that Dr. Zhang was just very, very bad at statistics and would be receiving remedial training and writing some corrigenda. The letter noted that, as I had pointed out, the ANOVA tables were gibberish and the degrees of freedom did not match the reported sample sizes. It also noted that the "description of the procedure and the object of study lacks logicality, and there is a suspicion of contradiction in the procedure and inconsistency in the sample," whatever that means.

However, the letter did not comment on the strongest pieces of evidence for misconduct: the recycled tables, the impossible statistics, and the unrealistic properties of the raw data. I pressed the Chairman for comment on these issues. 

After four months, the Chairman replied that the two experts they consulted determined that "these discussions belong to academic disputes." I asked to see the report from the experts. I did not receive a reply.

 

Reporting Misconduct to the Journals

The institution being unwilling to fix anything, I decided to approach the journals. In September and October 2019, I sent each journal a description of the problems in the specific article each had published, as well as a description of the broader evidence for misconduct across articles. 

I hoped that these letters would inspire some swift retractions, or at least, expressions of concern. I would be disappointed.

Some journals appeared to make good-faith attempts to investigate and retract. Other journals have been less helpful.


The Good Journals

Youth and Society reacted the most swiftly, retracting both articles two months later

Personality and Individual Differences took 10 months to decide to retract. In July 2020, the editor showed me a retraction notice for the article. I am still waiting for the retraction notice to be published. It was apparently lost when changing journal managers; once recovered, it then had to be sent to the authors and publisher for another round of edits and approvals.

Computers in Human Behavior is still investigating. The editor received my concerns with an appropriate degree of attention, but it seems there was some confusion about whether the editor or the publisher is supposed to investigate that has slowed down the process.

I felt these journals generally did their best, and the slowness of the process likely comes from the bureaucracy of the process and the inexperience editors have with that process. Other journals, I felt, did not make such an attempt.


Aggressive Behavior

In October 2019, Zhang sent me the data from his Aggressive Behavior article. I found the data had the same bizarre features that I had found when I received the raw data from Zhang's now-retracted Youth and Society article. I wrote a letter detailing my concerns and sent it to Aggressive Behavior's editor in chief, Craig Anderson. 

The letter, which you can read here, detailed four concerns. One was about the plausibility of the average Stroop effect reported, which was very large. Another was about failures of random assignment: chi-squared tests found the randomly-assigned conditions differed in sex and trait aggression, with p values of less than one in a trillion. The other two concerns regarded the properties of the raw data.

It took three months and two emails to the full editorial board to receive acknowledgement of my letter. Another four months after that, the journal notified me that it would investigate. 

Now, fifteen months after the submission of my complaint, the journal has made the disappointing decision to correct the article. The correction explains away the failures of randomization as an error in translation; the authors now claim that they let participants self-select their condition. This is difficult for me to believe. The original article’s stressed multiple times its use of random assignment and described the design as a "true experiment.” They also had perfectly equal samples per condition ("n = 1,524 students watched a 'violent' cartoon and n = 1,524 students watched a 'nonviolent' cartoon.") which is exceedingly unlikely to happen without random assignment. 

The correction does not mention the multiple suspicious features of the raw data. 

This correction has done little to assuage my concerns. I feel it is closer to a cover-up. I will express my displeasure with the process at Aggressive Behavior in greater detail in a future post.

 

Zhang’s newest papers

Since I started contacting journals, Zhang has published four new journal articles and one ResearchSquare preprint. I also served as a peer reviewer on two of his other submissions: One was rejected, and the other Zhang withdrew when I repeatedly requested raw data and materials.

These newest papers all carefully avoid the causes of my previous complaints. I had complained it was unlikely that Zhang should collect 3,000 subjects every experiment; the sample sizes in the new studies range from 174 to 480. I had complained that the distribution of aggressive-trial and nonaggressive-trial RTs within a subject didn’t make sense; the new studies analyze and present only the aggressive-trial RTs, or they report a measure that does not require RTs.

Two papers include a public dataset as part of the online supplement, but the datasets contain only the aggressive-trial RTs. When I contacted Zhang, he refused to share the nonaggressive-trial RTs. He has also refused to share the accuracy data for any trials. This might be a strategy to avoid tough questions about the kind of issues I found in his Youth & Society and Aggressive Behavior articles. 

Because Zhang refused me access to the data, I had to try asking the editors at those journals to enforce the APA Code of Ethics section 8.14 which requires sharing of data for the purpose of verifying results.

At Journal of Experimental Child Psychology, I asked editor-in-chief David Bjorklund to intervene. Dr. Bjorklund has asked Dr. Zhang to provide the requested data. I thank him for upholding the Code of Ethics. A month and half have passed since Dr. Bjorklund's intervention, and I yet to receive the requested data and materials from Dr. Zhang.

At Children and Youth Services Review, I asked editor-in-chief Duncan Lindsey to intervene. Zhang claimed that the data consisted only of aggressive-trial RTs, and that he could not share the program because it “contained many private information of children and had copyrights.”

I explained my case to Lindsey. Lindsey sent me nine words — "You will need to solve this with the authors." — and never replied again.

Dr. Lindsey's failure to uphold the Code of Ethics at his journal is shameful. Scholars should be aware that Children and Youth Services Review has chosen not to enforce data-sharing standards, and research published in Children and Youth Services Review cannot be verified through inspection of the raw data.

I have not yet asked for the data behind Zhang’s new articles in Cyberpsychology, Behavior, and Social Networking or Journal of Aggression, Maltreatment, & Trauma.


Summary

I was curious to see how the self-correcting mechanisms of science would respond to what seemed to me a rather obvious case of unreliable data and possible research misconduct. It turns out Brandolini’s Law still holds: “The amount of energy needed to refute bullshit is an order of magnitude larger than to produce it.” However, I was not prepared to be resisted and hindered by the self-correcting institutions of science itself.

I was disappointed by the response from Southwest University. Their verdict has protected Zhang and enabled him to continue publishing suspicious research at great pace. However, this result does not seem particularly surprising given universities' general unwillingness to investigate their own and China's general eagerness to clear researchers of fraud charges.

I have also generally been disappointed by the response from journals. It turns out that a swift two-month process like the one at Youth and Society is the exception, not the norm.

In the cases that an editor in chief has been willing to act, the process has been very slow, moving only in fits and starts. I have read before that editors and journals have very little time or resources to investigate even a single case of misconduct. It is clear to me that the publishing system is not ready to handle misconduct at scale.

In the cases that an editor in chief has been unwilling to act, there is little room for appeal. Editors can act busy and ignore a complainant, and they can get indignant if one tries to go around them to the rest of the editorial board. It is not clear who would hold the editors accountable, or how. I have little leverage over Craig Anderson or Duncan Lindsey besides my ability to bad-mouth them and their journals in this report. At best, they might retire in another year or two and I could have a fresh editor with whom to plead my case.

The clearest consequence of my actions has been that Zhang has gotten better at publishing. Every time I reported an irregularity with his data, his next article would not feature that irregularity. In essence, each technique for pointing out the implausibility of the data can be used only once, because an editor’s or university’s investigation consists of showing the authors all the irregularities and asking for benign explanations. This is a serious problem when even weak explanations like “I didn’t understand what randomized assignment means” or “I’m just very bad at statistics” are considered acceptable.

Zhang has reported experiments with sample sizes totaling to more than 11,000 participants (8,000 given the Aggressive Behavior correction). This is an amount of data that rivals entire meta-analyses and ManyLabs projects. If this data is flawed, it will have serious consequences for reviews and meta-analyses.

In total, trying to get these papers retracted has been much more difficult, and rather less rewarding, than I had expected. The experience has led me to despair for the quality and integrity of our science. If data this suspicious can’t get a swift retraction, it must be impossible to catch a fraud equipped with skills, funding, or social connections.

Thursday, October 15, 2020

Fraud and Erroneous Judgment: Varieties of Deception in the Social Sciences (1995)

Killing time in the UChicago stacks in the summer of 2019, I found a book from 1995 called Fraud and Erroneous Judgment in the Social Sciences. It's been an interesting read, because despite having been written nearly 25 years ago, much of it reads like it was written today. Specifically, there is very little substance about actually preventing, detecting, or prosecuting fraud, presumably because all these things are very difficult to do. 

Instead, a substantial portion of the text is dedicated to the easier task of fighting the culture war. Nearly half the book consists of polemics from scientists who think their ability to speak hard truths about sexual assault or intelligence or race or whatever has been suppressed by the bleeding hearts. This is particularly depressing and unhelpful when you see that two of the thirteen chapters are written by Linda Gottfredson and J. Phillippe Rushton, scientists receiving funding from the Pioneer Fund, an organization founded to study and promote eugenics.



Fraud...

For a text that is notionally about fraud, there is very little substance about actual fraud. Instead, most of the chapters are dedicated to the latter topic of "fallible judgment". Only three instances of research misconduct in psychology are discussed. Two of them appear in brief bullet points in the first chapter: In one, a psychologist fabricated data to demonstrate the efficacy of a drug for preventing self-harm in the mentally disabled; in the other, a researcher may have massaged his data to overstate the potential harms of low levels of lead exposure.

The third case consists of the allegations surrounding Cyril Burt. Cyril Burt was an early behavior geneticist. He argued that intelligence was heritable, and he demonstrated this through studies of the similarity of identical twins raised apart.

Burt was unpopular at the time because the view that intelligence was heritable sounded to many like Nazi ideology. While he was alive, people protested him as a far-right ideologue. (Other hereditarians experienced similar treatment; Hans Eysenck reportedly needed bodyguards as a result of his 1971 views that some of the Black-White intelligence gap was genetic in nature.) 

Five years after his death, allegations arose that Burt had invented a number of his later samples. These allegations claimed that Burt, having found an initial sample that supported his hypothesis, and frustrated by the public resistance to his findings as well as the challenge of finding more identical twins raised apart, decided to help the process along by fabricating data from twin pairs. As evidence of this, his heritability coefficient remained .77 as the sample size increased from 15 twin pairs to 53 twin pairs. (Usually parameter estimates change a little bit as new data comes in.) He was further alleged to have made up two research assistants, but these assistants were later found. Complicating matters further, his housekeeper burnt all his research records shortly after his death (!) purportedly on the advice of one of Burt's scientific rivals (?!?).

Burt sounds like a real horse's ass. In a separate book, Cyril Burt: Fraud or Framed?, Hans Eysenck reports that Burt would sometimes sock-puppet, writing articles according to his own views, then leaving his name off of the work and handing it off to a junior researcher, giving the impression that some independent scholar shared his view. Burt purportedly went one further by editing articles submitted to his journal, inserting his own stances and invective into others' work and publishing it without their approval.

Two chapters in Fraud and Erroneous Judgment are devoted to the Burt affair. The first chapter, written by Robert B. Joynson, argues that, strictly speaking, you can't prove he committed fraud. Probably we will never know. Burt is dead and his records destroyed. Even if he made up the data, the potentially made-up data are at least consistent with what we believe today, so maybe it doesn't matter.

The other, written by the late J. Phillippe Rushton, one-time head of the Pioneer Fund, argues more stridently that Burt was framed. According to his perspective, the various social justice warriors and bleeding hearts of today's the 1970s' hyper-liberal universities couldn't bear the uncomfortable truths Burt preached. Rather than refute Burt's ideas in the arena of logic and facts and science, they resorted to underhanded callout-culture tactics to smear him after his death and spoil his legacy.

So in the only involved discussion of an actual fraud allegation in this 181-page book, all that can be said is "maybe he did, or maybe he didn't."

Some material is useful. Chapter 3 recognizes that scientific fraud is a human behavior that is motivated by, and performed within, a social system. One author theorizes that fraud is most often committed under three conditions: 1) there is pressure to publish, whether to advance one's career or to refute critics, 2) the researcher thinks they know the answer already, so that actually doing the experiment is unneccessary, and 3) the research area involves an amount of stochastic variability, such that a failure to replicate can be shaken off as Type I error or hidden moderators. It certainly sounds plausible, but I wonder how useful it is. Most research fulfills all three conditions: all of us are under pressure to publish, all of us have a theory or two to suggest a "right" answer, and all of us experience sampling error and meta-uncertainty.

One thing that hasn't changed one bit is that demonstrating fraud requires demonstrating intent, which is basically impossible. Then and now, people instead have to couch concerns in the language of error, presuming sloppiness instead of malfeasance. Even then, it's not clear at what level of sloppiness crosses the threshold between error and misconduct.

...and Erroneous Judgment

The other cases all concern "erroneous judgment". They reflect ideologically-biased interpretations of data, a lack of scientific rigor, or an excessive willingness to be fooled. These cases vary in their seriousness. At the extremely harmful end, there is a discussion of recovered-memory therapy; this therapy involves helping patients to recover memories of childhood abuse through a process indistinguishable from that one would use to create a false memory. Chillingly, recovered memories became permissible as court evidence in 15 states and lead to a number of false accusations and possible convictions during the Satanic Panic of the 1980s. At the less harmful end, there's an argument about whether the Greeks made up their culture by copying off of the Egyptians. Fun to think about maybe, but nobody is going to jail over that.

Other examples include overexaggeration of societal problems in order to drum up support for research and advocacy. Neil Gilbert illustrates how moral entrepreneurs can extrapolate from sloppy statistical work, small samples, and bad question wording to estimate that 100 billion children are abducted every 3.7 seconds. This fine example is, however, paired with a criticism of feminism and research on sexual assault that has aged poorly; the author's argument boils down to "c'mon, sexual assault can't be that common, right?" Maybe it can be, Neil.

According to the authors, these cases of fallible judgment are caused by excessive enthusiasm rather than deliberate intention to deceive. Therapists dealing in recovered memories are too excited to root out satanic child-abuse cults, too ignorant of the basic science of memory, and too dependent on the perceived efficacy of their practice to know better. Critics of the heritability of IQ are blinded by political correctness and "the egalitarian hoax" of blank-slate models of human development. Political correctness is cited as influencing "fallible judgments" as diverse as the removal of homosexuality from the DSM (and its polite replacement in diagnosis of other disorders so that homosexual patients could continue billing their insurance), the estimation of the prevalence of sexual harassment, failures to test and report racial differences in outcomes, or the attribution of the accomplishments of the Greeks to the Egyptians.

Again, it seems revealing that so little is known about actual cases of fraud that the vast majority of the volume is dedicated to cases where it is unclear who is right. Unable to discover and discuss actual frauds, the discussion has to focus instead on ideological opponents whom the authors don't trust to interpret and represent their data fairly.


Have we made progress?

What's changed between 1995 and now? Today we have more examples to draw upon and more forensic tools. We can use GRIM and SPRITE to catch what are either honest people making typographical mistakes or fraudsters too stupid to make up raw data (good luck telling which is which!). The Data Colada boys keep coming up with new tests for detecting suspicious patterns in data. It's become a little less weird to ask for data and a little more weird to refuse to share data. So there's progress.

Even so, we're still a billion miles away from being able to detect most fraud and to demonstrate intent. Demonstration of intent generally requires a confession or someone on the inside. Personally, I've suspect that fraud detection at scale is probably impossible unless we ask scientists to provide receipts. I can't imagine researchers going for another layer of bureaucracy like that.

One recurring theme is the absence of an actual science police. The discussion of the Burt affair complains that the Council of the British Psychological Society did little to examine Burt's case on its own, instead accepting the conclusions of a biographer. Chapters 1 and 2 discuss the political events that put "Science under Siege" and lead to the creation of the Office of Research Integrity, an institution only grudgingly accepted in Chapter 2. Huffing that every great scientist from Mendel to Millikan had to massage their data a bit from time to time to make their point, David Goodstein cautions the ORI, "I can only hope that we won't arrange things in such a way as would have inhibited Newton or Millikan from doing his thing."


Can we ever know the truth?

Earlier, I mentioned that the book contains three cases of purported fraud: the self-harm study, Cyril Burt's 38 twin-pairs raised apart, and the researcher possibly massaging his data to overestimate the harms of lead. This last case appears to be a reference to the late Herbert Needleman, accused in 1990 of p-hacking his model, an offense Newsweek described at the time as "like bringing a felony indictment for jaywalking." Needleman was exonerated in 1992, and the New York Times ran an obituary honoring him following his death in 2017.

Would I be impressed by Needleman's work today, or would I count him out as another garden-variety noise-miner looking for evidence to support a foregone conclusion? Maybe it doesn't matter. In the Newsweek article, the EPA is quoted as saying "We don't even use Needleman's study anymore" because subsequent research recommended even lower safety thresholds than did Needleman's controversial work. The tempest has blown over. The winners write their history, and the losers get paid by the Cato Institute to go on Fox News and argue against "lead hysteria".

There's a lot that hasn't changed

We think that science has only been subjective, partisan, and politicized in our current "war on science" post-2016 world, but the 1990s also had "science under siege" (Time, Aug 26, 1991) and intractable debates between competing groups with vested interests in there being a crisis or not being a crisis. The tobacco wars reappear in every decade.

Similarly, the froth and stupidity of daytime TV lives on in today's Daily Mail and Facebook groups. In the 90s, people with more outrage than sense believed in vast networks of underground Satanist cults that tortured children and "programmed" them to become pawns in their world domination scheme. Today, those people believe the Democratic party runs child trafficking ring through a pizza parlor and a furniture website and that Donald Trump is on a one-man mission to stop them.

Regarding fraud, we find that scientific self-policing only tends to emerge in response to crisis and scandal. NIH and NSF don't seem to have had formal recommendations regarding fraud until 1988; these were apparently motivated by pressure from Congress following the 1981 case of John Darsee, a Harvard cardiologist who had been faking his data. Those who do scientific self-policing aren't welcomed with open arms -- the book briefly stops to sneer at Walter Stewart and Ned Feder as "a kind of self-appointed truth squad. According to their critics, they had not been very productive scientists and were trying to find a way of holding on to their lab space." Nobody likes having fraud oversight, and everybody does the minimum possible to maintain public respectability until the scandal blows over.

Finally, each generation seems to suspect its successors of being fatally blinded by political correctness. This is clearest in the chapter dedicated to the defense of Cyril Burt, in which Rushton complains that academia will only become more corrupted by political correctness:
Today, the campus radicals of earlier decades are the tenured radicals of the 1990s. Some are chairmen, deans, and presidents. The 1960s mentality of peace, love, and above all equality now constitutes a significant portion of the intellectual establishment in the Western world. The equalitarian dogma is more, not less, entrenched than ever before. Yet, it is based on the scientific hoax of the century.
Will every generation of academics forever consider their successors insufferably and disreputably woke? Should they? It seems that, despite Rushton's concerns, the hereditarian perspective has won out in the end. Today we have researchers who not only recognize heritability, but have given careful thought to the meaning, causality, and societal implications of the research. I see this as tremendous progress when compared to the way the book tends to frame the debate over heritability, which invites the reader to choose between two equally misguided perspectives of either ignorant blank-slate idealism or Rushton's inhumane "race realism."

Summary

Some things have changed since 1995, but much has stayed the same.

Compared to 25 years ago, I think we have a better set of tools for detecting fraud. We have new statistical tricks and stronger community norms around data sharing and editorial action. We have the Office of Research Integrity and Retraction Watch.

But some things haven't changed. Researchers checking each other's work are still, at times, regarded coldly: the "self-appointed truth squad" of 1995 is the "self-appointed data police" of 2016. Demonstrating intent to deceive remains a very high bar for those investigating misconduct; probably some number of fraudsters escape oversight by claiming mere incompetence. Because it is difficult to prove intent to deceive, it's easier to fight culture war -- one can wave to an opponent's political bias without getting slapped with a libel suit. And we still don't know much about who commits fraud, why they commit fraud, and how we'll ever catch them.




Thursday, January 30, 2020

Are frauds incompetent?

Nick Brown asks:

My answer is that we are not spotting the competent frauds. This becomes obvious when we think about all the steps that are necessary to catch a fraud:
  1. The fraudulent work must be suspicious enough to get a closer look.
  2. Somebody must be motivated to take it upon themselves to take that closer look.
  3. That motivated person must have the necessary skill to detect the fraud.
  4. The research records available to that motivated and skilled person must be complete and transparent enough to detect the fraud.
  5. That motivated and skilled person must then be brave enough (foolish enough? equipped with lawyers enough?) to contact the research institution.
  6. That research institution must be motivated enough to investigate.
  7. That research institution must also be skilled enough to find and interpret the evidence for fraud.

Considering all these stages at which one could fail to detect or pursue misconduct, it seems immediately obvious to me that we are finding only the most obvious and least protected frauds.

Consider the "Boom, Headshot!" affair. I had read this paper several times and never suspected a thing; nothing in the summary statistics indicates any cause for concern. The only reason anybody discovered the deception was because Pat Markey was curious enough about the effect of skewness on the results to spend months asking the authors and journal for the data and happened to discover values edited by the grad student.

Are all frauds stupid?

Some of the replies to Nick's question imply that faking data convincingly is too much hassle compared to actually collecting data. If you know a lot about data and simulation, why would you bother faking data? This perspective assumes that fraud is difficult and requires skills that could be more profitably used for good. But I don't think either of those is true.

Being good at data doesn't remove temptations for fraud

When news of the LaCour scandal hit, the first thing that struck me was how good this guy was at fancy graphics. Michael LaCour really knew his way around analyzing and presenting statistics in an exciting and accessible way.

But that's not enough to get LaCour's job offer at Princeton. You need to show that you can collect exciting data and get exciting results! When hundreds of quant-ninja, tech-savvy grad students are scrambling for a scant handful of jobs, you need a result that lands you on This American Life. And those of us on the tenure track have our own temptations: bigger grants, bigger salaries, nicer positions, and respect.

Some might even be tempted by the prospect of triumphing over their scientific rivals. Cyril Burt, once president of the British Psychological Society, was alleged to have made up extra twin pairs in order to silence critics of his discovered link between genetics and IQ. Hans Eysenck, the most-cited psychologist of his time, published and defended dozens of papers using likely-fabricated data from his collaborator that supported his views on the causes of cancer.

Skill and intellect and fame and power do not seem to be vaccines against misconduct. And it doesn't take a lot of skill to commit misconduct, either, because...

Frauds don't need to be clever

A fraud does not need a deep understanding of data to make a convincing enough forgery. A crude fake might get some of the complicated multivariate relationships wrong, sure. But will those be detected and prosecuted? Probably not.

You don't need to be the Icy Black Hand of Death to get away with data fakery.
(img source fbi.gov)


Why not? Those complicated relationships don't need to be reported in the paper. Nobody will think to check them. If they want to check them, they'll need to send you an email requesting the raw data. You can ignore them for some months, then tell them your dog ate the raw data, then demand they sign an oath of fealty to you if they're going to look at your raw data.

Getting the complicated covariation bits a little wrong is not likely to reveal a fraud, anyway. Can a psychologist predict even the first digit of simple correlations? A complicated relationship that we know less about will be harder to predict, and it will be harder to persuade co-authors, editors, and institutions that any misspecification is evidence of wrongdoing. Maybe the weird covariation can be explained away as an unusual feature of the specific task or study population. The evidence is merely circumstantial.


...because data forensics can rarely stop them.

Direct evidence requires some manner of internal whistleblower who notices and reports research misconduct. Again, one would need the actually see the misconduct, which is especially unlikely in today's projects in which data and reports come from distant collaborators. Then one would need to actually blow the whistle, after which they might expect to lose their career and get stuck in a years-long court case. Most frauds in psychology are caught this way (Stroebe, Postmes, & Spears, 2012).

In data forensics, by contrast, most evidence for misconduct is merely circumstantial. Noticing in the data very similar means and standard deviations or duplicated data points or duplicated images might be suggestive, but requires assumptions, and is open to alternative explanations. Maybe there was an error in data preprocessing, or the research assistants managed the data wrong, or someone used file IMG4015.png instead of IMG4016.png.

This circumstantial evidence means that nonspecific screw-ups are often a plausible alternative hypothesis. It seems possible to me that a just-competent fraud could falsify a bunch of reports, plead incompetence, issue corrections as necessary, and refine one's approach to data falsification for quite a long time.

A play in one act:

FRAUDSTER
The means were 2.50, 2.50, 2.35, 2.15, 2.80, 2.40, and 2.67.


DATA THUG
It is exceedingly unlikely that you would receive such consistent means. I suspect you have fabricated these summary statistics.


FRAUDSTER
Oops, haha, oh shit, did I say those were the means? Major typo! The means were actually, uh, 2.53, 3.12, 2.07, 1.89...


EDITOR
Ahh, nice to see this quickly resolved with a corrigendum. Bye everyone.


UNIVERSITY
We are fully committed to upholding the highest ethical standards etc. any concerns are thoroughly etc. etc.


FRAUDSTER (sotto voce) 
That was close! Next time I fake data I will avoid this error.

The field isn't particularly trying to catch frauds, either.

Trying to prosecute fraud sounds terrible. It takes a very long time, it requires a very high standard of evidence, and lawyers get involved. It is for these reasons, I think, that the self-stated goal of many data thugs is to "correct the literature" rather than "find and punish frauds".

But I worry about this blameless approach, because there's no guarantee that the data that appears in a corrigendum is any closer to the truth. If the original data was a fabrication, chances are good the corrigendum is just a report of slightly-better-fabricated data. And even if the paper is retracted, the perpetrator may learn from the experience and find a way to refine his fabrications and enjoy a long, prosperous life of polluting the scientific literature.

In summary,

I don't think you have to be particularly clever to be a fraud. It seems to me that most discovered frauds involve either direct evidence from a whistleblower or overwhelming circumstantial evidence due to rampant sloppiness. I think that there are probably many more frauds with just a modicum of skill that have gone undiscovered. There are probably also a number of cases that are quietly resolved without the institution announcing the discovered fraud. I spend a lot of time thinking about what it would take to change this, and what the actual prevalence would be if we could uncover it.