This is the story of how I found what I believe to
be scientific misconduct and what happened when I reported it.
Science is supposed to be self-correcting. To test
whether science is indeed self-correcting, I tried reporting this misconduct via
several mechanisms of scientific self-correction. The results have shown me
that psychological science is largely defenseless against unreliable data.
I want to share this story with you so that you
understand a few things. You should understand that there are probably a few people
in your field producing work that is either fraudulent or so erroneous it may
as well be fraudulent. You should understand that their work is cited in policy
statements and included in meta-analyses. You should understand that, if you
want to see the data or to report concerns, those things happen according to
the inclinations of the editor-in-chief at the journal. You should understand
that if the editor-in-chief is not inclined to help you, they generally not
accountable to anyone and they can always ignore you until the statute of
limitations runs out.
Basically, it is very easy to generate unreliable
data, and it is very difficult to get it retracted.
Qian Zhang
Two years ago, I read a journal
article that appeared to have gibberish for all its statistics (Zhang, Espelage, & Zhang, 2018). None of the numbers in the tables added up:
the p values didn't match the F values,
the F values didn't match the means and SDs, and the degrees
of freedom didn't match the sample size. This was distressing because the
sample size was a formidable 3,000 participants. If these numbers were wrong,
they were going to receive a lot of weight in future meta-analyses. I sent the
editor a note saying "Hey, none of these numbers make sense." The
editor said they'd ask the authors to correct, and I moved on with my life.
Figure 1. Table from Zhang, Espelage, & Zhang, (2018). The means and SDs don’t make sense, and the
significance asterisks are incorrect given the F values.
Then I read the rest of
Dr. Zhang's first-authored articles and realized there was a broader, more
serious problem – one that I am still spending time and energy trying to clean
up, two years later.
Problems in Qian Zhang’s
articles
Zhang’s papers would often
report impossible statistics. Many papers had subgroup means that could not be
combined to yield the grand mean. For example, one paper reported mean task
scores of 8.98ms and 6.01ms for males and females, respectively, but a grand
mean task score of 23ms.
Other papers had means
and SDs that were impossible given the range. For example, one study reported a
sample of 3,000 children with ages ranging from 10 to 20 years (M = 15.76, SD =
1.18), of which 1,506 were between ages 10 and 14 and 1,494 were between ages
15 and 20. If you put those numbers into SPRITE,
you will find that, to meet the reported mean and SD of age, all the
participants must be between the ages of 14 and 19, and only about 500
participants could be age 14.
More seriously still,
tables of statistical output seemed to be recycled from paper to paper. Two
different articles describing two different experiments on two different
populations would come up with very similar cell means and F values.
Even if one runs exactly the same experiment twice, sampling error means that the
odds of getting all six cells of a 2 × 3 design to come up again within a few
decimal points are quite low. The odds of getting them on an entirely different
experiment years later in a different population would be smaller still.
As an example, consider this table, published in Zhang, Espelage, and Rost (2018), Youth and Society (Panel A), in which 2,000 children (4th-6th grade) perform a two-color emotion Stroop task. The means and F values closely match the same values as a sample of 74 high schoolers (Zhang, Xiong, & Tian, 2013, Scientific Research: Health, Panel B) and a sample of 190 high schoolers (Zhang, Zhang, & Wang, 2013, Scientific Research: Psychology, Panel C).
Figure 2. Three highly similar tables from three different experiments by Zhang and colleagues. The degree of similarity for all nine values of the table is suspiciously high.
Dr. Zhang publishes some corrigenda
After my first quick
note to Youth and Society that Zhang’s p values didn't
match the F values, Dr. Zhang started submitting corrections
to journals. What was remarkable about these corrections is that they would simply
add an integer to the F values so that they would be statistically
significant.
Consider, for example,
this correction at Personality and Individual Differences (Zhang, Tian, Cao, Zhang, & Rodkin, 2016):
Figure 3. An uninterpretable ANOVA table is corrected by the addition or subtraction of an integer value from its F statistics.The correction just adds
2 or 3 onto the nonsignificant F values to make them match their
asterisks, and it subtracts 5 from the significant F value to
make it match its lack of asterisks.
Or this correction to Zhang, Espelage, and Zhang (2018), Youth and Society, now
retracted:
Figure 4. Nonsignificant F values become statistically significant through the addition of a tens digit. Note that these should now have three asterisks rather than one and two, respectively.
Importantly, none of the
other summary or inferential statistics had to be changed in these corrigenda, as one might expect if
there was an error in analysis. Instead, it was a simple matter of clobbering
the F values so that they’d match the significance asterisks.
Asking for raw data
While I was
investigating Zhang’s work from 2018 and earlier, he published another massive
3,000-participant experiment in Aggressive Behavior (Zhang et al., 2019). Given the
general sketchiness of the reports, I was getting anxious about the incredible
volume of data Zhang was publishing.
I asked Dr. Zhang if I
could see the data from these studies to try to understand what had happened.
He refused, saying only the study team could see the data.
So, I decided I’d ask
the study team. I asked Zhang’s American co-author if they had seen the data.
They said they hadn't. I suggested they ask for the data. They said Zhang
refused. I asked them if they thought that was odd. They said, no, "It's a
China thing."
Reporting Misconduct to the Institution
Given the recycling of
tables across studies, the impossible statistics, the massive sample sizes, the
secrecy around the data, and the corrigenda which had simply bumped the F values
into significance, I suspected I had found research misconduct. In May 2019, I
wrote up a report and sent it to the Chairman of the Academic Committee at his
institution, Southwest University Chongqing. You can read that report here.
A month later, I was
surprised to get an email from Dr. Zhang. It was the raw data from the Youth
& Society article I had previously asked for and been refused.
Looking at the raw data
revealed a host of suspicious issues. For starters, participants were supposed
to be randomly assigned to movie, but girls and students with high trait
aggression were dramatically more likely to be assigned to the nonviolent
movie.
There was something else
about the reaction time data that is a little more technical but very serious. Basically,
reaction time data on a task like the Stroop should show within-subject effects (some
conditions have faster RTs than others) and between-subject effects (some
people are faster than others). Consequently, even an incongruent trial from Quick
Draw McGraw could be faster than a congruent trial from Slowpoke Steven.
Because of these
between-subject effects, there should be a correlation between a subject’s
reaction times in one condition and their reaction times in the other. If you
look at color-Stroop data I grabbed from a reliable source on the OSF, you can
see that correlation is very strong.
Figure 5. The correlation between subjects' mean congruent-word RT and mean incongruent-word RT in a color-word Stroop task. Data from Lin, Inzlicht, Saunders, & Friese (2019).If you look at Zhang’s data, you see the
correlation is completely absent. You might also notice that the distribution
of subjects’ means is weirdly boxy, unlike the normal or log-normal
distribution you might expect.
Figure 6. The correlation between subjects' mean aggressive-word RT and nonaggressive-word RT in an aggressive-emotion Stroop task. Data from Zhang, Espelage, and Rost (2018). The distribution of averages is odd, and the correlation unusually weak.
There was no way the
study was randomized, and there was no way that the study data was reliable
Stroop data. I wrote an additional letter to the institution detailing these
oddities. You can read that additional letter here.
A month after that,
Southwest University cleared Dr. Zhang of all charges.
The letter I received
declared: "Dr. Zhang Qian was deficient in statistical knowledge and
research methods, yet there is insufficient evidence to prove that data fraud
[sic]." It explained that Dr. Zhang was just very, very bad at
statistics and would be receiving remedial training and writing some corrigenda.
The letter noted that, as I had pointed out, the ANOVA tables were gibberish
and the degrees of freedom did not match the reported sample sizes. It also
noted that the "description of the procedure and the object of study lacks
logicality, and there is a suspicion of contradiction in the procedure and
inconsistency in the sample," whatever that means.
However, the letter did
not comment on the strongest pieces of evidence for misconduct: the recycled
tables, the impossible statistics, and the unrealistic properties of the raw
data. I pressed the Chairman for comment on these issues.
After four months, the
Chairman replied that the two experts they consulted determined that "these
discussions belong to academic disputes." I asked to see the report from
the experts. I did not receive a reply.
Reporting Misconduct to the Journals
The institution being
unwilling to fix anything, I decided to approach the journals. In September and
October 2019, I sent each journal a description of the problems in the specific
article each had published, as well as a description of the broader evidence
for misconduct across articles.
I hoped that these
letters would inspire some swift retractions, or at least, expressions of
concern. I would be disappointed.
Some journals appeared
to make good-faith attempts to investigate and retract. Other journals have
been less helpful.
The Good Journals
Youth and Society reacted the most swiftly, retracting both articles two months later.
Personality and
Individual Differences took 10 months to decide to retract. In July 2020, the editor
showed me a retraction notice for the article. I am still waiting for the
retraction notice to be published. It was apparently lost when changing journal managers; once recovered, it then had to be sent to the authors and publisher for another round of edits and approvals.
Computers in Human
Behavior is still
investigating. The editor received my concerns with an appropriate degree of attention, but it seems there was some confusion about whether the editor or the publisher is supposed to investigate that has slowed down the process.
I felt these journals generally did their best, and the slowness of the process likely comes from the bureaucracy of the process and the inexperience editors have with that process. Other journals, I felt, did not make such an attempt.
Aggressive Behavior
In October 2019, Zhang
sent me the data from his Aggressive Behavior article. I found the data
had the same bizarre features that I had found when I received the raw data
from Zhang's now-retracted Youth and Society article. I wrote
a letter detailing my concerns and sent it to Aggressive Behavior's
editor in chief, Craig Anderson.
The letter, which you can read here, detailed four concerns. One was about the plausibility of the average
Stroop effect reported, which was very large. Another was about failures of
random assignment: chi-squared tests found the randomly-assigned conditions
differed in sex and trait aggression, with p values of less than one in
a trillion. The other two concerns regarded the properties of the raw data.
It took three months and
two emails to the full editorial board to receive acknowledgement of my letter.
Another four months after that, the journal notified me that it would
investigate.
Now, fifteen months
after the submission of my complaint, the journal has made the disappointing
decision to correct the article. The correction explains away the failures of randomization as an
error in translation; the authors now claim that they let participants
self-select their condition. This is difficult for me to believe. The original article’s stressed multiple times its use of random assignment and described the design as a "true experiment.” They also had perfectly equal samples per condition ("n = 1,524 students watched a 'violent' cartoon and n = 1,524 students watched a 'nonviolent' cartoon.") which is exceedingly unlikely to happen without random assignment.
The correction does not
mention the multiple suspicious features of the raw data.
This correction has done little to assuage my concerns. I feel it is closer to a cover-up. I will express my
displeasure with the process at Aggressive
Behavior in greater detail in a future post.
Zhang’s newest papers
Since I started
contacting journals, Zhang has published four new journal articles and one
ResearchSquare preprint. I also served as a peer reviewer on two of his other submissions:
One was rejected, and the other Zhang withdrew when I repeatedly requested raw
data and materials.
These newest papers all
carefully avoid the causes of my previous complaints. I had complained it was
unlikely that Zhang should collect 3,000 subjects every experiment; the sample
sizes in the new studies range from 174 to 480. I had complained that the
distribution of aggressive-trial and nonaggressive-trial RTs within a subject
didn’t make sense; the new studies analyze and present only the aggressive-trial
RTs, or they report a measure that does not require RTs.
Two papers include a public
dataset as part of the online supplement, but the datasets contain only the
aggressive-trial RTs. When I contacted Zhang, he refused to share the nonaggressive-trial
RTs. He has also refused to share the accuracy data for any trials. This might
be a strategy to avoid tough questions about the kind of issues I found in his Youth
& Society and Aggressive Behavior articles.
Because Zhang refused me
access to the data, I had to try asking the editors at those journals to
enforce the APA Code of Ethics section 8.14 which requires sharing of data for
the purpose of verifying results.
At Journal of
Experimental Child Psychology, I asked editor-in-chief David Bjorklund
to intervene. Dr. Bjorklund has asked Dr. Zhang to provide the requested data. I thank him for upholding the Code of Ethics. A month and half have passed since Dr. Bjorklund's intervention, and I yet to receive the requested data and materials from Dr. Zhang.
At Children and
Youth Services Review, I asked editor-in-chief Duncan Lindsey to
intervene. Zhang claimed that the data consisted only of aggressive-trial RTs, and
that he could not share the program because it “contained many private
information of children and had copyrights.”
I explained my case to Lindsey.
Lindsey sent me nine words — "You will need to solve this with the
authors." — and never replied again.
Dr. Lindsey's failure to
uphold the Code of Ethics at his journal is shameful. Scholars should be aware
that Children and Youth Services Review has chosen not to enforce data-sharing
standards, and research published in Children and Youth Services Review cannot
be verified through inspection of the raw data.
I
have not yet asked for the data behind Zhang’s new articles in Cyberpsychology,
Behavior, and Social Networking or Journal of Aggression, Maltreatment,
& Trauma.
Summary
I was curious to see how the self-correcting
mechanisms of science would respond to what seemed to me a rather obvious case
of unreliable data and possible research misconduct. It turns out Brandolini’s
Law still holds: “The amount of energy needed to refute bullshit is an order of
magnitude larger than to produce it.” However, I was not prepared to be
resisted and hindered by the self-correcting institutions of science itself.
I was disappointed by the response from Southwest
University. Their verdict has protected Zhang and enabled him to continue
publishing suspicious research at great pace. However, this result does not seem particularly surprising given universities' general unwillingness to investigate their own and China's general eagerness to clear researchers of fraud charges.
I have also generally been disappointed by the
response from journals. It turns out that a swift two-month process like the
one at Youth and Society is the exception, not the norm.
In the cases that an editor in chief has been
willing to act, the process has been very slow, moving only in fits and starts.
I have read before that editors and journals have very little time or resources
to investigate even a single case of misconduct. It is clear to me that the
publishing system is not ready to handle misconduct at scale.
In the cases that an editor in chief has been
unwilling to act, there is little room for appeal. Editors can act busy and
ignore a complainant, and they can get indignant if one tries to go around them
to the rest of the editorial board. It is not clear who would hold the editors
accountable, or how. I have little leverage over Craig Anderson or Duncan
Lindsey besides my ability to bad-mouth them and their journals in this report.
At best, they might retire in another year or two and I could have a fresh
editor with whom to plead my case.
The clearest consequence of my actions has been that
Zhang has gotten better at publishing. Every time I reported an irregularity
with his data, his next article would not feature that irregularity. In essence,
each technique for pointing out the implausibility of the data can be used only
once, because an editor’s or university’s investigation consists of showing the
authors all the irregularities and asking for benign explanations. This is a serious
problem when even weak explanations like “I didn’t understand what randomized
assignment means” or “I’m just very bad at statistics” are considered
acceptable.
Zhang has reported experiments with sample sizes totaling to more than 11,000 participants (8,000 given the Aggressive Behavior correction). This is an amount of data that rivals entire meta-analyses and ManyLabs projects. If this data is flawed, it will have serious consequences for reviews and meta-analyses.
In total,
trying to get these papers retracted has been much more difficult, and rather
less rewarding, than I had expected. The experience has led me to despair for the
quality and integrity of our science. If data this suspicious can’t get a swift
retraction, it must be impossible to catch a fraud equipped with skills,
funding, or social connections.