A psychologist's thoughts on how and why we play games

Monday, May 16, 2016

The value-added case for open peer reviews

Last post, I talked about the benefits a manuscript enjoys in the process of scientific publication. To me, it seems that the main benefits are that an editor and some number of peer reviewers read it and give edits. Somehow despite this part coming from volunteer labor, it still manages to cost $1500 an article.

And yet, as researchers, we can't afford to try to do without the journals. When the paper appears with a sagepub.com URL on it, readers now assume it to be broadly correct. The journal publication is part of the scientific canon, whereas the preprint was not.

Since the peer reviews are what really elevates the research from preprint to publication, I think the peer reviews should be made public, as part of the article's record. This will open the black box and encourage readers to consider: Who thinks this article is sound? What do they think are the strengths and weaknesses of the research? Why?

By comparison, the current system provides only the stamp of approval. But we readers and researchers know that the stamp of approval is imperfect. The process is capricious. Sometimes duds get published. Sometimes worthy studies are discarded. If we're going to place our trust in the journals, we need to be able to check up on the content and process of peer review.

Neuroskeptic points out that, peer review being what it is, perhaps there should be fewer journals and more blogs. The only difference between the two, in Neuro's view, is that a journal implies peer review, which implies the assent of the community. If journal publication implies peer approval, shouldn't journals show the peer reviews to back that up? And if peer approval is all it takes to make something scientific canon, couldn't a blogpost supported by peer reviews and revisions be equivalent to a journal publication?

Since peer review is all that separates blogging from journal publishing, I often fantasize about sidestepping the journals and self-publishing my science. Ideally, I would just upload a preprint to OSF. Alongside the preprint there would be the traditional 2-5 uploaded peer reviews.

Arguably, this would provide an even higher standard of peer review, in that readers could see the reviews. This would compare favorably with the current system, in which howlers are met with unanswerable questions like "Who the heck reviewed this thing?" and "Did nobody ask about this serious flaw?"

Maybe one day we'll get there. In the meantime, so long as hiring committees, tenure committees, and granting agencies are willing to accept only journal publications as legitimate, scientists will remain powerless to self-publish. In the meantime, the peer reviews should really be open. The peer reviews are what separates preprint from article, and we pay millions of dollars a year to maintain that boundary, so we might as well place greater emphasis and transparency on that piece of the product.

Saturday, May 14, 2016

Be Your Own Publisher?

The problem with paying any 3rd party for academic publishing is that these 3rd parties are corporations. Corporations have the defining goal of making as much profit as possible by providing a service.

This goal is often at odds with what is best for science. Under the traditional publishing model, financial considerations favor the strategy of hoarding all the most exciting research and leasing it out for incredible subscription fees. Researchers stretch their data to try to get the most extraordinary story so that they can get published in the most exclusive journal. Under the Open Access publishing model, financial considerations favor the strategy of publishing as many papers as possible so long as the average paper quality is not so poor that it causes the journal's reputation to collapse.

Subscription journals apparently cost the educational system billions of dollars a year. Article processing fees at open-access journals tend to sit at a cool $1500. How can it be so expensive to throw a .PDF file up on the internet?

Let's consider the advantages a published article has relative to a preprint on my GitHub page. Relative to the preprint, the science in a published article has added value from:
1) Peer reviewers, who provide needed criticism and skepticism. (Cost: $0)
2) Editors, who provide needed criticism, skepticism and curation. (Cost: $0)
3) Publicity and dissemination for accepted articles (Cost: Marketing budget)
4) Typesetting and file hosting (Cost: $1500 an article, apparently)

The value-added to researchers comes from the following sources:
1) The perceived increase in legitimacy associated with making it past peer review (Value: Priceless)
2) Prestige associated with being picked out for curation. (Value: Priceless)

It leads me to wonder: What might be so wrong with universities, laboratories, and researchers simply using self-publishing? Websites like arXiv, SSRN, OSF, and GitHub provide free hosting for PDFs and supplementary files.

If the main thing that distinguishes a preprint from an article is that between two and five people have read it and okayed it, and if that part costs nothing, why not save a heap of money and just have people post peer reviews on your preprint? (Consider Tal Yarkoni's suggestion of a Reddit-like interface for discussion, curation, and ranking.)

Is it possible that we might one day cut out the middleman and allow ourselves to enjoy the benefits of peer review without the enormous financial burden? Or does institutional inertia make it impossible?

Maybe this fall my CV can have a section for "Peer-reviewed manuscripts not published in journals."

Wednesday, May 4, 2016

Post-pub peer review should be transparent too

A few weeks ago, I did a little post-publication peer review. It was a novel experience for me, and lead me to consider the broader purpose of post-pub peer review.
In particular, I was reminded of the quarrel between Simone Schnall and Brent Donnellan (and others) back in 2014. Schnall et al. suggested an embodied cognition phenomenon wherein incidental cues of cleanliness influenced participants' ratings of moral disgust. Donnellan et al. ran replications and failed to detect the effect. An uproar ensued, goaded on by some vehement language by high-profile individuals on either side of the debate.

One thing about Schnall's experience stays with me today. In a blogpost, she summarizes her responses to a number of frequently asked questions. One answer is particularly important for anybody interested in post-publication peer review.
Question 10: “What has been your experience with replication attempts?”
My work has been targeted for multiple replication attempts; by now I have received so many such requests that I stopped counting. Further, data detectives have demanded the raw data of some of my studies, as they have done with other researchers in the area of embodied cognition because somehow this research area has been declared “suspect.” I stand by my methods and my findings and have nothing to hide and have always promptly complied with such requests. Unfortunately, there has been little reciprocation on the part of those who voiced the suspicions; replicators have not allowed me input on their data, nor have data detectives exonerated my analyses when they turned out to be accurate.
I invite the data detectives to publicly state that my findings lived up to their scrutiny, and more generally, share all their findings of secondary data analyses. Otherwise only errors get reported and highly publicized, when in fact the majority of research is solid and unproblematic.
[Note: Donnellan and colleagues were not among these data detectives. They did only the commendable job of performing replications and reporting the null results. I mention Donnellan et al. only to provide context -- it's my understanding that the failure to replicate lead to 3rd-party detectives's attempts to detect wrongdoing through analysis of the original Schnall et al. dataset. It is these attempts to detect wrongdoing that I refer to below.]

It is only fair that these data detectives report their analyses and how they failed to detect wrongdoing. I don't believe Schnall's phenomenon for a second, but the post-publication reviewers could at least report that they don't find evidence of fraud.

Data detectives themselves can run the risk of p-hacking and selective report. Imagine ten detectives run ten tests each. If all tests are independent, eventually one test will emerge with a very small p-value. If anyone is going to make accusations according to "trial by p-value," then we had damn well consider the problems of multiple comparisons and the garden of forking paths.

Post-publication peer review is often viewed as a threat, but it can and should be a boon, when appropriate. A post-pub review that finds no serious problems is encouraging, and should be reported and shared.* By contrast, if every data request is a prelude to accusations of error (or worse), then it becomes upsetting to learn that somebody is looking at your data. But data inspection should not imply that there are suspicions or serious concerns. Data requests and data sharing should be the norm -- they cannot be a once-in-a-career disaster.

Post-pub peer review is too important to be just a form of witch-hunting.
It's important, then, that post-publication peer reviewers give the full story. If thirty models give the same result, but one does not, you had better report all thirty-one models.** If somebody spends the time and energy to deal with your questions, post the answers so that the authors need not answer the questions all over again.

I do post-publication peer review because I generally don't trust the literature. I don't believe results until I can crack them open and run my fingers through the goop. I'm a tremendous pain in the ass. But I also want to be fair. My credibility, and the value of my peer reviews, depends on it.

The Court of Salem reels in terror at the perfect linearity of Jens Forster's sample means.

* Sakaluk, Williams, and Biernat (2014) suggest that, during pre-publication peer review, one reviewer run the code to make sure they get the same statistics. This would cut down on the number of misreported statistics. Until that process is a common part of pre-publication peer review, it will always be a beneficial result of post-publication peer review.

** Simonsohn, Simmons, and Nelson suggest specification curve, which takes the brute-force approach to this by reporting every possible p-value from every possible model. It's cool, but I've never tried to implement it yet.

Friday, April 15, 2016

The Undergraduate Thesis Banquet

An unusally lavish undergraduate honors banquet. (Image pinched from TheTimes.co.uk)
Some time ago, I got to attend a dinner for undergraduates who had completed a honor's thesis in psychology. For each of these undergraduates, their faculty advisor would stand up and say some nice things about them.

The advisors would praise students for their motivation, their ideas, their brilliance, etc. etc. And then they would say something about the student's research results.

For some students, the advisor would say, with regret, that the student's idea hadn't borne fruit. "It was a great idea, but the data didn't work out..." they'd grimace, before concluding, "Anyway, I'm sure you'll do great." In these cases one knows that the research project is headed for the dustbin.

For other students, the advisor would gush, "So-and-so's an incredible student, they ran the best project, we got some really great data, and we're submitting an article to [Journal X]."

Somewhere in this, one gets the impression that the significance of results indicates the quality of a research assistant. Significant results are headed for the journals, while nonsignificant results are rewarded with a halfhearted, "Well, you tried."

I suspect that there is a heuristic at play that goes something like this: Effect size is a ratio of signal to noise. Good RAs collect clean data, while bad RAs collect noisy data. Therefore, a good RA will find significant results, while a bad RA might not.

But that, of course, assumes there is signal to be found. That assumption would seem to beg the question: is research for answering questions? Or is it for demonstrating what you already assume to be true? But I digress...

In any case, as unfair as it is, it's probably good for the undergrads to learn how the system works. But I'm hoping that at the next such banquet, the statistical significance of an undergrad's research results will have little to do with their perceived competence.

Monday, March 28, 2016

Asking for advice re: causal inference in SEM

I'm repeatedly running into an issue in causal interpretation of SEM models. I'm not sure what to make of it, so I want to ask everybody what they think.

Suppose one knows A and B to be highly correlated in the world, but one doesn't know whether there is causality between them.

In an experiment, one stages an intervention. Manipulation X causes a difference in levels of A between the control and treatment groups.

Here's the tricky part. Suppose one analyses the data gleaned from this experiment using SEM. One makes an SEM with paths X -> A -> B. Each path is statistically significant. This is presented as a causal model indicating that manipulation X causes changes in A, which in turn cause changes in B. 

Paths X->A and A->B are significant, but X->B is not. Is a causal model warranted?

However, if one tests the linear models A = b1×X and B = b2×X, we find that b1 is statistically significant, but b2 is not. (Note that I am not referring to the indirect effect of X on B after controlling for A. Tather, the "raw" effect of X on B is not statistically significant.)

This causes my colleagues and I to wonder: Does the SEM support the argument that, by manipulation of X, one can inflict changes in A, causing downstream changes in B? Or does this inject new variance in A that is unrelated to B, but the SEM fits because of the preexisting large correlation between A and B?

Can you refer me to any literature on this issue? What are your thoughts?

Thanks for any help you can give, readers.

Tuesday, March 22, 2016

Results-blinded Peer Review

The value of any experiment rests on the validity of its measurements and manipulations. If the manipulation doesn't have the intended effect, or the measurements are just noise, then the experiment's results will be uninformative.

This holds whether the results are statistically significant or not. A nonsignificant result, obviously, could be the consequence of an ineffective manipulation or a noisy outcome variable. But given a significant result, the results are still uninformative -- the significant result is either Type I error, or it reflects bias in the measurement.

The problem I have is that often the reader's (or at least, the reviewer's) perception of the method's validity may sometimes hinge upon the results obtained. Where a significant result might have been hailed as a successful conceptual replication, a nonsignificant result might be dismissed as a departure from appropriate methodology.

It makes me consider this puckish lesson from Archibald Cochrane, as quoted and summarized on Ben Goldacre's blog:
The results at that stage showed a slight numerical advantage for those who had been treated at home. I rather wickedly compiled two reports: one reversing the number of deaths on the two sides of the trial. As we were going into the committee, in the anteroom, I showed some cardiologists the results. They were vociferous in their abuse: “Archie,” they said “we always thought you were unethical. You must stop this trial at once.”
I let them have their say for some time, then apologized and gave them the true results, challenging them to say as vehemently, that coronary care units should be stopped immediately. There was dead silence and I felt rather sick because they were, after all, my medical colleagues.
Perhaps, just once in a while, such a results-blinded manuscript should be submitted to a journal. Once Reviewers 1, 2, and 3 have all had their say about the ingenuity of the method, the precision of the measurements, and the adequacy of the sample size, the true results could be revealed, and one could see how firmly the reviewers hold to their earlier arguments.

Thankfully, the increasing prevalence of Registered Reports may forestall the need for any such underhanded prank. Still, it is fun to think about.

Saturday, March 19, 2016

I Was Wrong!

Yesterday, ResearchGate suggested that I read a new article reporting that ego depletion can cause aggressive behavior. This was a surprise to me because word has it that ego depletion does not exist, so surely it cannot be a cause of aggressive behavior.

The paper in question looks about like you'd expect: an unusual measure of aggression, a complicated 3 (within) × 2 (between) × 2 (between) design, a covariate tossed into the mix just for kicks, a heap of measures collected and mentioned in a footnote but not otherwise analyzed. It didn't exactly change my mind about ego depletion, much less its role in aggressive behavior.

But it'd be hypocritical of me to criticize this ill-timed paper without mentioning the time I reported an ego-depletion effect through effect-seeking, exploratory analysis. I've also been meaning to change my blogging regimen up a bit. It's time I switched from withering criticism to withering self-criticism.

The paper is Engelhardt, Hilgard, and Bartholow (2015), "Acute exposure to difficult (but not violent) video games dysregulates cognitive control." In this study, we collected a hearty sample (N = 238) and had them play one of four modified versions of a first-person shooter game, a 2 (Violence: low, high) × 2 (Difficulty: low, high) between-subjects design.

To manipulate violence, I modified the game's graphics. The violent version had demons and gore and arms bouncing across the floor, whereas the less violent version had silly-looking aliens being warped home. We also manipulated difficulty: Some participants played a normal version of the game in which monsters fought back, while other participants played a dumb-as-rocks version where the monsters walked slowly towards them and waited patiently to be shot.

After the game, participants performed a Spatial Stroop task. We measured the magnitude of the compatibility effect, figuring that larger compatibility effects would imply poorer control. We also threw in some no-go trials, on which participants were supposed to withhold a response.

Our hypothesis was that playing a difficult game would lead to ego depletion, causing poorer performance on the Spatial Stroop. This might have been an interesting refinement on the claim that violent video games teach their players poorer self-control.

We looked at Stroop compatibility and found nothing. We looked at the no-go trials and found nothing. Effects of neither violence nor of difficulty. So what did we do?

We needed some kind of effect to publish, so we reported an exploratory analysis, finding a moderated-mediation model that sounded plausible enough.

We figured that maybe the difficult game was still too easy. Maybe participants who were more experienced with video games would find the game to be easy and so would not have experienced ego depletion. So we split the data again according to how much video game experience our participants had, figuring that maybe the effect would be there in the subgroup of inexperienced participants playing a difficult game.

The conditional indirect effect of game difficulty on Stroop compatibility as moderated by previous game difficulty wasn't even, strictly speaking, statistically significant: p = .0502. And as you can see from our Figure 1, the moderator is very lopsided: only 25 people out of the sample of 238 met the post-hoc definition of "experienced player." 

And the no-go trials on the Stroop? Those were dropped from analysis: our footnote 1 says our manipulations failed to influence behavior on those trials, so we didn't bother talking about them in the text.

So to sum it all up, we ran a study, and the study told us nothing was going on. We shook the data a bit more until something slightly more newsworthy fell out of it. We dropped one outcome and presented a fancy PROCESS model of the other. (I remember at some point in the peer review process being scolded for finding nothing more interesting than ego depletion, which was accepted fact and old news!)

To our credit, we explicitly reported the exploratory analyses as being exploratory, and we reported p = .0502 instead of rounding it down to "statistically significant, p = .05." But at the same time, it's embarrassing that we structured the whole paper to be about the exploratory analysis, rather than the null results. 

In the end, I'm grateful that the RRR has set the record straight on ego depletion. It means our paper probably won't get cited much except as a methodological or rhetorical example, but it also means that our paper isn't going to clutter up the literature and confuse things in the future. 

In the meantime, it's showed me how easily one can pursue a reasonable post-hoc hypothesis and still land far from the truth. And I still don't trust PROCESS.