Wednesday, December 13, 2017

How to Play a Prediction Market

The prediction market is a way to try to assign probabilities to events. Bettors buy YES bets on things they think are likely to happen (relative to the market price) and NO bets on things they think are unlikely to happen (relative to the market price). Market dynamics lead the market price to settle on what is, across the bettors, the best subjective probability of the event. This is useful if you are trying to assign probabilities to one-off future events.

In this post, I'll teach you how to place bets to most effectively get the largest payout possible. In so doing, you'll do more to calibrate the market to your predictions.

Let's get ready to corner the replication market!

How does a prediction market work?

A prediction market allows people to bet YES or NO on some outcome. As people bet that the outcome will happen, the price of YES shares increases. As people bet that the outcome won't happen, the price of YES shares falls.

The market price for a YES share is p, the probability of the outcome. The market price for a NO share is (1-p). If the event happens, all the YES shares pay out $1 each and the NO shares become worthless. If the event does not happen, all the NO shares pay out $1 each and the YES shares become worthless.

The probability of rolling a six is 1/6, so we should be willing to pay up to $1/6 for YES or $5/6 for NO.

Imagine we are betting that a roll of a six-sided die will yield a six. The probability of this is 1/6, or about 17 percent. YES shares will cost 17 cents and NO shares will cost 83 cents. With five dollars, you could buy 30 YES shares or 6 NO shares.

Your expected payout is the number of shares times the probability. In the die example, since the market price is correct, your expected value is five dollars whether you buy YES or NO. For YES shares, 30 shares * (1/6 payout chance) = $5. For NO shares, 6 shares * (5/6 payout chance) = $5.

If the market price is wrong, you have a chance to make a profit. Suppose we are still betting on the die, but for some reason the market price is set at 10 cents for a YES share. We know that the probability of the die rolling six is greater than this, so with our five dollars we can buy 50 shares with an expected value of 50 shares * (1/6 payout chance) = $8.33. This is a profit of three dollars. Another way to look at this is that it's a profit of six cents per share, the difference between the wrong market price (.10) and the true probability (.16).

But if the market price is wrong, and we are wrong with it, we will lose money. Buying NO shares at this price will turn our five dollars into 5.55 shares * (5/6 payout chance) = $4.62, a long-run loss of 38 cents.

The Big Picture of the Big Short

Like we covered above, playing the prediction market isn't simply about buying YES on things you think will replicate and NO on things you don't replicate. Otherwise, we would just buy NO shares on the die rolling six because we know it's unlikely relative to the die not rolling six. It's about evaluating the probability of those replications. Your strategy in a betting market should be to look for those opportunities where there is a difference between the market price and the probability that you'd assign to that event.

If the market is completely correct, it shouldn't matter what you buy -- your 50 tokens will have an expected value of $50. In our die example above, when the market price was right, YES and NO shares had the same expected value. But if the market is wrong, you have a chance to beat the market, turning your 50 tokens into several times their value.

In order to beat the market, you have to find places where the market price is miscalibrated. Maybe something is trading at 40% when it only has a 20% chance to replicate, in your view. If you are right, each NO share you buy will cost 60 cents but have an expected value of 80 cents. But if you are wrong, you will pay more for the shares than they are truly worth, getting a poorer return on your 50 tokens than had you just spread them across the market.

Below is my four-step process for turning your predictions into the largest possible payoff.

1. Evaluate your prices.

Before the market started, I wrote down my estimates of what would or wouldn't replicate. I assigned probabilities to these studies indicating what chance I thought they had of replicating.

Coming up with these estimates is the basis of the replication market. I ended up focusing on the things I thought wouldn't replicate. Some studies were a priori deeply implausible, others had weak p-values, some had previous failures to replicate, and some had a combination of factors. These were studies I felt pretty confident wouldn't replicate, and so I priced them at about 10% (2.5% chance of Type I error + 7.5% chance of true effect).

A peek at my spreadsheet, comparing my subjective probabilities to the market prices.

Some other studies seemed more likely to replicate, so I was willing to price them in the 50-80% range. I was less certain about these, so I saw these as riskier purchases, and tended to invest less in them.

It's also useful to remember the context of the last prediction market. In that market, the prices were much too high. Nothing below 40% replicated, and the highest-priced study (88%) also failed to replicate. The lowest price on a successful replication was about 42%.

2. Buy and sell to your prices.

To make profit on the replication market, you have to spend your money where you think the market price is most miscalibrated. Something that the market thinks is a sure thing (95%) that you think will flop (5%) would be a massive 90-cent profit per share. Something that seems reasonable (50%) that the market is afraid won't replicate (15%) could be a nice little profit of 35 cents per share.

I made a spreadsheet of my prices and the market's prices. I added a column representing the difference between those prices. The largest absolute difference indicates where I would expect the greatest profit per share.

If the difference is negative, then buy NO shares. Suppose something is trading at 50%, but you think it has only a 15% chance of replicating. You can buy NO shares for 50 cents that you think are worth 85 cents -- a 35 cent profit per share.

If the difference is positive, then buy YES shares. If something is trading at 50%, and you think it has a 75% chance of replicating, then every YES share costs 50 cents but is worth 75 cents.

Overly optimistic market prices meant that I placed most of my bets on certain studies not replicating.
Again, you only profit when you are right and the market is wrong. Look for where there is juice!

3. Diversify your portfolio

If you want to ensure a decent payout, it may make sense to spread your money around. Suppose there is a study priced at 50% chance of replicating, but you know the true chance of replication is 80%. If you're right, putting all 50 tokens on this one study has an 80% chance of earning you $100, but a 20% chance of earning you $0. Your expected value is $80, a nice $30 profit, but there's a lot of variability.

Payout $100 $0
Frequency 80% 20%
EV = $80; SD = $41

By diversifying your bets, you can reduce the variability at the cost of reducing your expected value slightly. Consider if we divide your bets across two options, one with a slightly worse profit margin. Let's say Study 1 is priced at 50% but is worth 80%, and Study 2 is priced at 65% but is worth 75%. By putting half our money into Study 2, we reduce our average profit, but we also reduce the likelihood of suffering a blowout.

Payout $88 $50 $38 $0
Frequency 60% 20% 15% 5%
EV = $70; SD = $26

In the recent market, for example, Sparrow, Liu, and Wegner tended to trade at 55%, whereas I thought it was worth about 15%. Although this 40-cent gap would have been my biggest profit-per-dollar, I felt it was too risky to put everything on this study, so I balanced it against other studies with smaller profit margins.

4. Day trading

As other people show up to the market and start twiddling their bets around, the market prices will change. The market may move towards some of your predictions and away from other of your predictions. If you like to procrastinate by watching the market, you can leverage out your bets for a higher potential payout.

Figure 1. You hold NO shares of Studies 1 and 2, which the market has evaluated at 35% (bars) but you think have only a 10% chance of replicating (dashed line). Each share represents 25 cents of profit to you.

Lets say you think Study 1 and Study 2 each have a 10% chance of replicating. You bought 30 shares of Study1 NO and Study2 NO for 65 cents a share each (35% chance to replicate). You see each of these shares as representing a 25 cent profit (Figure 1).

Figure 2. The market has shifted such that your Study1 NO shares are worth more and your Study2 NO shares are worth less. If you are ready to be aggressive, you can sell your Study1 NO shares to take advantage of cheaper Study2 NO shares.

Some time passes, and now the market has agreed with you on Study1, dropping its probability to 20%, but it disagrees with you on Study2, raising the probability to 45% (Figure 2). The shares of Study 1 you're holding have already realized 15 cents per share of profit. The shares of Study 2 you're holding have lost 10 cents a share, but if you are right, then you can keep buying these shares at 55 cents when you think they are worth 90 cents.

Since the Study 1 shares have already realized their value, you can sell the Study1 NO shares to buy more cheap shares of Study2 NO. If the market fluctuates again, you can sell your expensive shares to pick up more cheap shares and so on and so on.

I watched the market and kept comparing the prices against my predictions. When one of my NO bets started to cap out (e.g., Gervais and Norenzayan reached 15%), I would sell my NO bets and reinvest them in another cheaper NO bet (e.g., buying NO on Kidd and Castano at 40%). Sometimes some poor credible soul (or somebody fumbling with the GUI) would buy a bunch of YES bets on Ackerman, driving the price way up (e.g., to 45%). When this would happen, I'd sell all my current bets to take advantage of the opportunity of cheap Ackerman NO bets.

It can be tempting to try to play the market, moving your tokens around to try to catch where other people will move tokens. I don't think there's much use in that. There aren't news events to influence the prediction market prices. Just buy your positions and hold them. If the market disagrees with you, you may consider doubling down on your bets now that they are cheaper. If the market agrees with you, you can release those options to invest in places where the market disagrees with you.


To make the biggest profits, you have to beat the market. To do this, you must: (1) Make good estimates of the probability to replicate. (2) Find the places where the market price is most divergent from what probability you would assign the study. (3) Spread your bets out across a number of studies to manage your risk. (4) Use day trading to take advantage of underpriced shares and increase your total leverage.

Friday, December 1, 2017

Adventures programming a Word Pronunciation Task in PsychoPy

I'm a new assistant professor trying to set up my research laboratory. I thought I'd try making the jump to PsychoPy as a way to make my materials more shareable, since not everybody will have a $750+ E-Prime or DirectRT license or whatever. (I'm also a tightwad.)

My department has a shared research suite of cubicles. Those cubicles are equipped with Dell Optiplex 960s running Windows 7. I'm reluctant to try to upgrade them since, as shared computers, other members of the department have stuff running on them that I'm sure they don't want to set up all over again.

In this process, I ran into a couple of bugs on these machines that I hadn't encountered while developing the tasks on my Win10 Dell Optiplex 7050s. These really made life difficult. I spent a lot of time wrangling with these errors, and I experienced a lot of stress wondering whether I'd fix them in five minutes or five months.

Here for posterity are the two major bugs I'd encountered and how they were resolved. I don't know anything about Python, so I hope these are helpful to the equally clueless.

"Couldn't share context" error

Initially, PsychoPy tasks of all varieties were crashing on startup. Our group couldn't even get the demos to run. The error message said pyglet.gl.ContextException: Unable to share contexts.

Didn't fix it:

Apparently this can be an issue with graphics drivers on some machines. Updating my drivers didn't fix the problem, perhaps in part because the hardware is kind of old.


This error was resolved by specifying an option for pyglet. I used PyschoPy's Builder View to compile the task. This made a file called Task.py. I opened up the .py file with notepad / wordpad / coder view / code writer and added two lines to the top of the script (here in bold):

from __future__ import absolute_import, division
# Trying to fix pyglet 'shared environment' error
import pyglet
# script continues as normal
from psychopy import locale_setup, sound, gui, visual, core, data, event, logging
from psychopy.constants import (NOT_STARTED, STARTED, PLAYING, PAUSED,
                                STOPPED, FINISHED, PRESSED, RELEASED, FOREVER)
This fixed my "Couldn't share context" error. If you're having trouble with "couldn't share context", consider opening up your .py file and adding these two lines just underneath from __future__ import.

Portaudio not initialized error

My Word Pronunciation Task requires the use of a microphone to detect reaction time. Apparently this was a simple task for my intellectual ancestors back in the 1990s -- they were able to handle this using HyperCard, of all things! But I have lost a lot of time and sleep and hair trying to get microphones to play nice with PsychoPy. It's not a major priority for the overworked developers, and it seems to rely on some other libraries that I don't understand.

Trying to launch my Word Pronunciation Task lead to the following error: PortAudio not initialized [...] The Server must be booted! [...] Need a running pyo server."

This was fixed by changing Windows' speaker playback frequency from 48000 Hz to 44100 Hz.

Right click on the Volume icon in the taskbar and open up "Playback devices."

Right click on your playback device and click "Properties."

Under the "Advanced" tab, switch the audio quality from a 48000Hz sampling rate (which Portaudio doesn't like) to a 44100 Hz sampling rate (which Portaudio does like, apparently).

This strangely oblique tweak was enough to fix my Portaudio problems.

Now that I can use all these computers, I'm looking forward to scaling up my data collection and getting this project really purring!

Thanks to Matt Craddock and Stephen Martin for help with the "shared context" bug. Thanks to Olivier Belanger for posting how to fix the Portaudio bug.

Thursday, June 22, 2017

Overestimation of violent-game effects

At long last, our article "Overstated Evidence for Short-Term Effects of Violent Games on Affect and Behavior: A Reanalysis of Anderson et al. (2010)" is released from its embargo at Psychological Bulletin. (Paywalled version here.)

In this paper, Chris Engelhardt, Jeff Rouder, and I re-analyze the famous Anderson et al. (2010) meta-analysis on violent video game effects. At the time, this meta-analysis was hailed by some as "nailing the coffin shut on doubts that violent video games stimulate aggression" (Huesmann, 2010). It is perhaps the most comprehensive and most-cited systematic review of violent-game research.

The authors conclude that, across experimental, cross-sectional, and longitudinal research designs, the recovered literature indicates significant effects of violent games on aggressive thoughts, feelings, and behaviors. Effects are moderate in size (r = ~.2).

Our paper challenges some of the conclusions from that paper. Namely,

  • The original authors reported that there was "little evidence of selection (publication) bias." We found, among some sets of experiments, considerable evidence of selection bias.
  • The original authors reported that better experiments found larger effects. We found that it instead may be the case that selection bias is stronger among the "best" experiments.
  • The original authors reported short-term effects on behavior of r = .21, a highly significant result of medium size. We estimated that effect as being r = .15 at the most and possibly as small as r = .02.

We do not challenge the results from cross-sectional or longitudinal research. The cross-sectional evidence is clear: there is a correlation between violent videogames and aggressive outcomes, although this research cannot demonstrate causality. There is not enough longitudinal research to try to estimate the degree of publication bias, so we are willing to take that research at its word for now. (Besides, an effect of hundred of hours of games over a year is more plausible than an effect of a single 15-minute game session.)

Signs of selection bias in aggressive behavior experiments

With regard to short-term effects on aggressive behavior, the funnel plot shows some worrying signs. Effect sizes seem to get smaller as the sample size gets larger. There is a cluster of studies that fall with unusual accuracy in the .01 < p < .05 region. And when filtering for the "best practices" experiments, nearly all the nonsignificant results are discarded, leaving a starkly asymmetrical funnel plot. See these funnel plots from experiments on aggressive behavior:

When filtering for what the original authors deemed "best-practices" experiments, most null results are discarded. Effect sizes are reported in Fisher's Z, with larger effects on the right side of the x-axis. The average effect size increases, but so does funnel plot asymmetry, indicating selection bias. Studies fall with unusual regularity in the .01 < p < .05 region, shaded in dark grey.

The p-curve doesn't look so hot either:
P-curve of experiments of aggressive behavior coded as "best-practices". The curve is generally flat. This suggests either (1) the null is true or (2) the null is false but there is p-hacking.

Where naive analysis suggests r = .21 and trim-and-fill suggests r = .18, p-curve estimates the effect as r = .08. Let's put that in practical terms. If Anderson and colleagues are right, a good experiment needs 140 participants for 80% power in a one-tailed test. If p-curve is right, you need 960 participants. 

Given that 4 out of 5 "best-practices" studies have fewer than 140 participants, I suspect that we know very little about short-term causal effects of violent games on behavior.

Reply from Kepes, Bushman, and Anderson

You can find a reply by Kepes, Bushman, and Anderson here. They provide sensitivity analyses by identifying and removing outliers and by applying a number of other adjustments to the data: random-effects trim-and-fill, averaging the five most precise studies, and a form of selection modeling that assumes certain publication probabilities for null results.

They admit that "selective publishing seems to have adversely affected our cumulative knowledge regarding the effects of violent video games." However, they conclude that, because many of their adjustments are not so far from the naive estimate, that the true effects are probably only modestly overstated. In their view, the lab effect remains theoretically informative.

They do a fine job of it, but I must point out that several of their adjustments are unlikely to fully account for selection bias. We know that trim-and-fill doesn't get the job done. An average of the five most precise studies is also unlikely to fully eliminate bias. (In our preprint, we looked at an average of the ten most precise studies and later dropped it as uninteresting. You shed only a little bias but lose a lot of efficiency.)

I know less about the Vevea and Woods selection model they use. Still, because it uses a priori weights instead of estimating them from the data, I am concerned it may yet overestimate the true effect size if there is p-hacking or if the selection bias is very strong. But that's just my guess.


I am deeply grateful to Psychological Bulletin for publishing my criticism. It is my hope that this is the first of many similar re-analyses increasing the transparency, openness, and robustness of meta-analysis. Transparency opens the black box of meta-analysis and makes it easier to tell whether literature search, inclusion/exclusion, and analysis were performed correctly. Data sharing and archival also allows us to apply new tests as theory or methods are developed.

I am glad to see that we have made some progress as a field. Where once we might have debated whether or not there is publication bias, we can now agree that there is some publication bias. We can debate whether there is only a little bias and a medium effect, or whether there is a lot of bias and no effect. Your answer will depend somewhat on your choice of adjustment model, as Kepes et al. make clear.

To that end, I hope that we can start collecting and reporting data that does not require such adjustment. Iowa State's Douglas Gentile and I are preparing a Registered Replication Report together. If we find an effect, I'll have a lot to think about and a lot of crow to eat. If we don't find an effect, we will need to reevaluate what we know about violent-game effects on the basis of brief laboratory experiments.

Tuesday, May 30, 2017

Trim-and-fill just doesn't work

The last couple years have seen an exciting explosion in new techniques for publication bias. If you're on the cutting edge of meta-analysis, you now can choose between p-curve, p-uniform, PET, PEESE, PET-PEESE, Top-10, and selection-weight models. If you're not on the cutting edge, you're probably just running trim-and-fill and calling it a day.

Looking at all these methods, my colleagues and I got to wondering: Which of these methods work best? Are some always better than others, or are there certain conditions under which they work best? Should we use p-curve or PET-PEESE? Does trim-and-fill work at all?

Today Evan Carter, Felix Schonbrodt, Will Gervais, and I have finished an exciting project in which we simulated hundreds of thousands of research literatures, then held a contest between the methods to see which does the best at recovering the true effect size.

You can read the full paper here. For this blog post, I want to highlight one finding: that the widely-used trim-and-fill technique seems to be wholly inadequate for dealing with publication bias.

One of the outcomes we evaluated in our simulations was mean error, or the bias. When statistically significant results are published and non-significant results are censored, doing a plain-vanilla meta-analysis is gonna give you an estimate that's much too high. To try to handle this, people use trim-and-fill, hoping that it will give a less-biased estimate.

Unfortunately, trim-and-fill is not nearly strong enough to recover an estimate of zero when the null hypothesis is true. In terms of hypothesis testing, then, meta-analysis and trim-and-fill seem hopeless -- given any amount of publication bias, you will conclude that there is a true effect.

In the figure here I've plotted the average estimate from plain-vanilla random-effects meta-analysis (reMA) and the average estimate from trim-and-fill (TF). I've limited it to meta-analyses of 100 studies with no heterogeneity or p-hacking. Each facet represents a different true effect size, marked by the horizontal line. As you go from left to right, the number of studies forced to be statistically significant ranges from 0% to 60% to 90%.

As you can see, when the null is true and there is moderate publication bias, the effect is estimated as d = 0.3. Trim-and-fill nudges that down to about d = 0.25, which is still not enough to prevent a Type I error rate of roughly 100%.

Indeed, trim-and-fill tends to nudge the estimate down by about 0.05 regardless of how big the true effect or how strong the publication bias. Null, small, and medium effects will all be estimated as medium effects, and the null hypothesis will always be rejected.

Our report joins the chorus of similar simulations from Moreno et al. (2009) and Simonsohn, Nelson, and Simmons (2014) indicating that trim-and-fill just isn't up to the job.

I ask editors and peer reviewers everywhere to stop accepting trim-and-fill and fail-safe N as publication bias analyses. These two techniques are quite popular, but trim-and-fill is too weak to adjust for any serious amount of bias, and fail-safe N doesn't even tell you whether there is bias.

For what you should use, read our preprint!!

Sunday, May 14, 2017

Curiously Strong effects

The reliability of scientific knowledge can be threatened by a number of bad behaviors. The problems of p-hacking and publication bias are now well understood, but there is a third problem that has received relatively little attention. This third problem currently cannot be detected through any statistical test, and its effects on theory may be stronger than that of p-hacking.

I call this problem curiously strong effects.

The Problem of Curiously Strong

Has this ever happened to you? You come across a paper with a preposterous-sounding hypothesis and a method that sounds like it would produce only the tiniest change, if any. You skim down to the results, expecting to see a bunch of barely-significant results. But instead of p = .04, d = 0.46 [0.01, 0.91], you see p < .001, d = 2.35 [1.90, 2.80]. This unlikely effect is apparently not only real, but it is four or five times stronger than most effects in psychology, and it has a p-value that borders on impregnable. It is curiously strong.

The result is so curiously strong that it is hard to believe that the effect is actually that big. In these cases, if you are feeling uncharitable, you may begin to wonder if there hasn't been some mistake in the data analysis. Worse, you might suspect that perhaps the data have been tampered with or falsified.

Spuriously strong results can have lasting effects on future research. Naive researchers are likely to accept the results at face value, cite them uncritically, and attempt to expand upon them. Less naive researchers may still be reassured by the highly significant p-values and cite the work uncritically. Curiously strong results can enter meta-analyses, heavily influencing the mean effect size, Type I error rate, and any adjustments for publication bias.

Curiously strong results might, in this way, be more harmful than p-hacked results. With p-hacking, the results are often just barely significant, yielding the smallest effect size that is still statistically significant. Curiously strong results are much larger and have greater leverage on meta-analysis, especially when they have large sample sizes. Curiously strong results are also harder to detect and criticize: We can recognize p-hacking, and we can address it by asking authors to provide all their conditions, manipulations, and outcomes. We don't have such a contingency plan for curiously strong results.

What should be done?

My question to the community is this: What can or should be done about such implausible, curiously strong results?

This is complicated, because there are a number of viable responses and explanations for such results:

1) The effect really is that big.
2) Okay, maybe the effect is overestimated because of demand effects. But the effect is probably still real, so there's no reason to correct or retract the report.
3) Here are the data, which show that the effect is this big. You're not insinuating somebody made the data up, are you?

In general, there's no clear policy on how to handle curiously strong effects, which leaves the field poorly equipped to deal with them. Peer reviewers know to raise objections when they see p = .034, p = .048, p = .041. They don't know to raise objections when they see d = 2.1 or r = 0.83 or η2 = .88.

Nor is it clear that curiously strong effects should be a concern in peer review. One could imagine the problems that ensue when one starts rejecting papers or flinging accusations because the effects seem too large. Our minds and our journals should be open to the possibility of large effects.

The only solution I can see, barring some corroborating evidence that leads to retraction, is to try to replicate the curiously strong effect. Unfortunately, that takes time and expense, especially considering how replications are often expected to collect substantially more data than original studies. Even after the failure to replicate, one has to spend another 3 or 5 years arguing about why the effect was found in the original study but not in the replication. ("It's not like we p-hacked this initial result -- look at how good the p-value is!")

It would be nice if the whole mess could be nipped in the bud. But I'm not sure how it can.

A future without the curiously strong?

This may be naive of me, but it seems that in other sciences it is easier to criticize curiously strong effects, because the prior expectations on effects are more precise.

In physics, theory and measurement are well-developed enough that it is a relatively simple matter to say "You did not observe the speed of light to be 10 mph." But in psychology, one can still insist with a straight face that (to make up an example) subliminal luck priming lead to a 2 standard deviation improvement in health.

In the future, we may be able to approach this enviable state of physics. Richard, Bond Jr., and Stokes-Zoota (2003) gathered up 322 meta-analyses and concluded that the modal effect size in social psych is r = .21, approximately d = 0.42. (Note that even this is probably an overestimate considering publication bias.) Simmons, Nelson, and Simonsohn (2013) collected data on obvious-sounding effects to provide benchmark effect sizes. Together, these reports show that an effect of d > 2 is several times stronger than most effects in social psychology and stronger even than obvious effects like "men are taller than women (d = 1.85)" or "liberals see social equality as more important than conservatives (d = 0.69)".

By using our prior knowledge to describe what is within the bounds of psychological science, we could tell what effects need scrutiny. Even then, one is likely to need corroborating evidence to garner a correction, expression of concern, or retraction, and such evidence may be hard to find.

In the meantime, I don't know what to do when I see d = 2.50 other than to groan. Is there something that should be done about curiously strong effects, or is this just another way for me to indulge my motivated reasoning?

Wednesday, March 22, 2017

Comment on Data Colada [58]: Funnel plots, done correctly, are extremely useful

In DataColada [58], Simonsohn argues that funnel plots are not useful. The argument is, for true effect size δ and sample size n:
  • Funnel plots are based on the assumption that r(δ, n) = 0.
  • Under some potentially common circumstances, r(δ, n) != 0. 
  • When r(δ, n) != 0, there is the risk of mistaking benign funnel plot asymmetry (small-study effects) for publication bias.

I do not think that any of this is controversial. It is always challenging to determine how to interpret small-study effects. They can be caused by publication bias, or they can be caused by, as Simonsohn argues, researchers planning their sample sizes in anticipation of some large and some small true effects.

There is a simple solution to this that preserves the validity and utility of funnel plots. If your research literature is expected to contain some large and some small effects, and these are reflected by clear differences in experimental methodology and/or subject population, then analyze those separate methods and populations separately. 

For this post, I will call this making homogeneous subgroups. 

Once you have made homogeneous subgroups, r(δ, n) = 0 is not a crazy assumption at all. Indeed, it can be a more sensible assumption than r(δ, δguess) = .6.

Making homogeneous subgroups

Suppose we are interested in the efficacy of a new psychotherapeutic technique for depression and wish to meta-analyze the available literature. 

It would be silly to combine studies looking at the efficacy of this technique for reducing depression and improving IQ and reducing aggression and reducing racial bias and losing weight. These are all different effects and different hypotheses. It would be much more informative to test each of these separately.

In keeping with the longest-running cliche in meta-analysis, here's an "apples to oranges" metaphor.

For example, when we investigated the funnel plots from Anderson et al.'s (2010) meta-analysis of violent video game effects, we preserved the original authors' decision to separate studies by design (experiment, cross-section, longitudinal) and by classes of outcome (behavior, cognition, affect). When Carter & McCullough (2014) inspected the effects of ego depletion, they separated their analysis by classes of outcome.

In short, combine studies of similar methods and similar outcomes. Studies of dissimilar methods and dissimilar outcomes should probably be analyzed separately.

The bilingual advantage example

I think the deBruin, Treccani, and Della Sala (2014) paper that serves as the post's motivating example is a little too laissez-faire about combining dissimilar studies. The hypothesis "bilingualism is good for you" seems much too broad, encompassing far too many heterogeneous studies.

Simonsohn's criticism here has less to do with a fatal flaw in funnel plots and more to do with a suboptimal application of the technique. Let's talk about why this is suboptimal and how it could have been improved.

To ask whether bilingualism improves working memory among young adults is one question. To ask whether bilingualism delays the onset of Alzheimer's disease is another. To combine the two is of questionable value. 

It would be more informative to restrict the analysis to a more limited, homogeneous hypothesis such as "bilingualism improves working memory". Even after that, it might be useful to explore different working memory tasks separately.

When r(δ, n) = 0 is reasonable

Once you have parsed the studies out into homogeneous subsamples, the assumption that r(δ, n) = 0 becomes quite reasonable. This is because:
  • Choosing homogeneous studies minimizes the variance in delta across studies.
  • Given homogeneous methods, outcomes, and populations, researchers cannot plan for variance in delta.
Let's look at each in turn.

Minimizing variance in delta

Our concern is that the true effect size δ varies from study to study -- sometimes it is large, and sometimes it is small. This variance may covary with study design and with sample size, leading to a small-study effect. Because study design is confounded with sample size, there is a risk of mistaking this for publication bias.

Partitioning into homogeneous subsets addresses this concern. As methods and populations become more similar, we reduce the variance in delta. As we reduce the variance in delta, we restrict its range, and correlations between delta and confounds will shrink, leading us towards the desirable case that r(δ, n) = 0.

Researchers cannot plan for the true effect size within homogeneous subgroup

Simonsohn assumes that researchers have some intuition for the true effect size -- that they are able to guess it with some accuracy such that r(δ, δguess) = .6.

True and guessed effect sizes in Data Colada 58. r = .6 is a pretty strong estimate of researcher intuition, although Simonsohn's concern still applies (albeit less so) at lower levels of intuition.

This may be a reasonable assumption when we are considering a wide array of heterogeneous studies. I can guess that the Stroop effect is large, that the contrast mapping effect is medium in size, and that the effect of elderly primes is zero.

However, once we have made homogeneous subsamples, this assumption becomes much less tenable. Can we predict when and for whom the Stroop effect is larger or smaller? Do we know under which conditions the effect of elderly primes is nonzero?

Indeed, you are probably performing a meta-analysis exactly because researchers have poor intuition for the true effect size. You want to know whether the effect is δ = 0, 0.5, or 1. You are performing moderator analyses to see if you can learn what makes the effect larger or smaller. 

Presuming you are the first to do this, it is unclear how researchers could have powered their studies accordingly. Within this homogeneous subset, nobody can predict when the effect should be large or small. To make this correlation between sample size and effect size, researchers would need access to knowledge that does not yet exist.

Once you have made a homogeneous subgroup, r(δ, n) = 0 can be a more reasonable assumption than r(δ, δguess) = .6.

Meta-regression is just regression

Meta-analysis seems intimidating, but the funnel plot is just a regression equation. Confounds are a hazard in regression, but we still use regression because we can mitigate the hazard and the resulting information is often useful. The same is true of meta-regression.  

Because this is regression, all the old strategies apply. Can you find a third variable that explains the relationship between sample size and effect size? Moderator analyses and inspection of the funnel plots can help to look for, and test, such potential confounds.

I think that Simonsohn does not see this presented often in papers, and so he is under the impression that this sort of quality check is uncommon. In my experience, my reviewers were definitely very careful asking me to rule out confounds in my own funnel plot analysis.

That said, it's definitely possible that these steps don't make it to the published literature: perhaps they are performed internally, or shared with just the peer reviewers, or maybe studies where the funnel plot contains confounds are not interesting enough to publish. Maybe greater attention can be paid to this in our popular discourse.


Into every life some heterogeneity must fall. There is the risk that, even after these efforts, there is some confound that you mistake for publication bias. That's regression for you.

There is also the risk that, if you get carried away chasing after perfectly homogeneous subgroups, you may find yourself conducting a billion analyses of only one or two studies each. This is not helpful either for reasons that will be obvious.

Simonsohn is concerned that we can never truly reach such homogeneity that r(δ, n) = 0 is true. This seems possible, but it is hard to say without access to 1) the true effect size and 2) the actual power analyses of researchers. I think that we can at least reach such a point that we have reached the limits of researcher's ability to plan for larger vs. smaller effects.


The funnel plot represents the relationship between effect size δ and the sample size n. These may be correlated because of publication bias, or they may be correlated because of genuine differences in δ that have been planned for in power analysis. By conditioning your analysis on homogeneous subsets, you reduce variance in δ and the potential influence of power analysis.

My favorite video game is The Legend of Zelda: Plot of the Funnel

Within homogeneous subsets, researchers do not know when the effect is larger vs. smaller, and so cannot plan their sample sizes accordingly. Under these conditions, the assumption that r(δ, n) = 0 can be quite reasonable, and perhaps more reasonable than the assumption that r(δ, δguess) = .6.

Applied judiciously, funnel plots can be valid, informative, expressive, and useful. They encourage attention to effect size, reveal outliers, and demonstrate small-study effects that can often be attributed to publication bias.



I also disagree with Simonsohn that "It should be considered malpractice to publish papers with PET-PEESE." Simonsohn is generally soft-spoken, so I was a bit surprised to see such a stern admonishment.

PET and PEESE are definitely imperfect, and their weaknesses are well-documented: PET is biased downwards when δ != 0, and PEESE is biased upwards when δ = 0. This sucks if you want to know whether δ = 0. 

Still, I think PEESE has some promise; assuming there is an effect, how big is it likely to be? Yes, these methods depend heavily on the funnel plot, assuming that any small-study effect is attributable to publication bias, but again, this is can be a reasonable assumption under the right conditions. Some simulations I'm working on with Felix Schonbrodt, Evan Carter, and Will Gervais indicate that it's at least no worse than trim-and-fill (low bar, I know).

Of course, no one technique is perfect. I would recommend using these methods in concert with other analyses such as the Hedges & Vevea 3-parameter selection model or, sure, p-curve or p-uniform.

Monday, February 27, 2017

Publication bias can hide your moderators

It is a common goal of meta-analysis to provide not only an overall average effect size, but also to test for moderators that cause the effect size to become larger or smaller. For example, researchers who study the effects of violent media would like to know who is most at risk for adverse effects. Researchers who study psychotherapy would like to recommend a particular therapy as being most helpful.

However, meta-analysis does not often generate these insights. For example, research has not found that violent-media effects are larger for children than for adults (Anderson et al. 2010). Similarly, it is often reported that all therapies are roughly equally effective (the "dodo bird verdict," Luborsky, Singer, & Luborsky, 1975; Wampold et al., 1997).

"Everybody has won, and all must have prizes. At least, that's what it looks like if you only look at what got published."

It seems to me that publication bias may obscure such patterns of moderation. Publication bias introduces a “small-study effect” in which the observed effect size is highly dependent on the sample size. Large-sample studies can reach statistical significance with smaller effect sizes. Small-sample studies can only reach statistical significance by reporting enormous effect sizes. The observed effect sizes gathered in meta-analysis, therefore, may be more a function of the sample size than they are a function of theoretically-important moderators such as age group or treatment type.

In this simulation, I compare the statistical power of meta-analysis to detect moderators when there is, or when there is not, publication bias.


Simulations cover 4 scenarios in a 2 (Effects: large or medium) × 2 (Pub bias: absent or present) design.

When effect sizes were large, the true effects were δ = 0 in the first population, δ = 0.3 in the second population, and δ = 0.6 in the third population. When effect sizes were medium, the true effects were δ = 0 in the first population, δ = 0.2 in the second population, and δ = 0.4 in the third population. Thus, each scenario represents one group with no effect, a group with a medium-small effect, and a group with an effect twice as large.

When studies were simulated without publication bias, twenty studies were conducted on each population, and all were reported. When studies were simulated with publication bias, studies were simulated, then published and/or file-drawered until at least 70% of the published effects were statistically significant. When results were not statistically significant and were file-drawered, further studies were simulated until 20 statistically significant results were obtained. This keeps the number of studies k constant at 20, which prevents confounding the influence of publication bias with the influence of fewer observed studies.

For each condition, I report the observed effect size for each group, the statistical power of the test for moderators, and the statistical power of the Egger test for publication bias. I simulated 500 meta-analyses within each condition in order to obtain stable estimates.


Large effects.

Without publication bias: 
  • In 100% of the metas, the difference between δ = 0 and δ = 0.6 was detected.
  • In 92% of the metas, the difference between δ = 0 and δ = 0.3 was detected. 
  • In only 4.2% of cases was the δ = 0 group mistaken as having a significant effect.
  • Effect sizes within each group were accurately estimated (in the long run) as δ = 0, 0.3, and 0.6.

With publication bias: 
  • Only 15% of the metas were able to tell the difference between δ = 0 and δ = 0.3.
  • 91% of meta-analyses were able to tell the difference between δ = 0 and δ = 0.6. 
  • 100% of the metas mistook the δ = 0 group as having a significant effect.  
  • Effect sizes within each group were overestimated: d = .45, .58, and .73 instead of 0, 0.3, and 0.6.  
Here's a plot of the moderator parameters across the 500 simulations without bias (bottom) and with bias (top).
Moderator values are dramatically underestimated in context of publication bias.

Medium effects.  

Without publication bias:
  • 99% of metas detected the difference between δ = 0 and δ = 0.4. 
  • 60% of metas detected the difference between δ = 0 and δ = 0.2. 
  • The Type I error rate in the δ = 0 group was 5.6%. 
  • In the long run, effect sizes within each group were accurately recovered as d = 0, 0.2, and 0.4.

With publication bias:
  • Only 35% were able to detect the difference between δ = 0 and δ = 0.4.
  • Only 2.2% of the meta-analyses were able to detect the difference between δ = 0 and δ = 0.2, 
  • 100% of meta-analyses mistook the δ = 0 group as reflecting a significant effect. 
  • Effect sizes within each group were overestimated: d = .46, .53, and .62 instead of δ = 0, 0.2, and 0.4.
Here's a plot of the moderator parameters across the 500 simulations without bias (bottom) and with bias (top).
Again, pub bias causes parameter estimates of the moderator to be biased downwards.


Publication bias can hurt statistical power for your moderators.  Obvious differences such as that between d = 0 and d = 0.6 may retain decent power, but power will fall dramatically for more modest differences such as that between d = 0 and d = 0.4. Meta-regression may be stymied by publication bias.

Monday, February 13, 2017

Why retractions are so slow

A few months ago, I had the opportunity to attend a symposium on research integrity. The timing was interesting because, on the same day, Retraction Watch ran a story on two retractions in my research area, the effects of violent media. Although one of these retractions had been quite swift, the other retraction had been three years in coming, which was a major source of heartache and frustration among all parties involved.

Insofar as some of us are concerned about the possible role of fraud as a contaminating influence in the scientific literature, I thought it might be helpful to share what I learned at the symposium. This regards the multiple steps and stakeholders in a retraction process, which may in part be the cause of common frustrations about the opacity and gradualness of the retraction process.

The Process

On paper, the process for handling concerns about a paper looks something like this:
  1. Somebody points out the concerns about the legitimacy of an article.
  2. The journal posts an expression of concern, summarizing the issues with the article.
  3. If misconduct is suspected, the university investigates for possible malfeasance.
  4. If malfeasance is discovered, the article is retracted.
We can see that it is an expression of concern can be posted quickly, whereas a retraction can take years of investigation. Because there is no way to resolve investigations faster, scientific self-correction can be expected to be slow. The exception to this is that, when the authors voluntarily withdraw an article in response to concerns, a retraction no longer requires an investigation.

Multiple stakeholders in investigations

Regarding investigations, it is not always clear what is being done or how seriously concerns are being addressed. In the Retraction Watch story at the top of the article, the plaintiffs spent about three years waiting for action on a data set with signs of tampering.

From the perspective of a scientist, one might wish for a system of retractions that acts swiftly and transparently. Through swiftness, the influence of fraudulent papers might be minimized, and through transparency, one might be appraised of the status of each concern.

Despite these goals, the accused has rights and must be considered innocent until found guilty. The accused, therefore, retains certain rights and protections. Because an ongoing investigation can harm one's reputation and career, oversight committees will not comment on the status or existence of an investigation.

Even when the accused is indeed guilty, they may recruit lawyers to apply legal pressure to universities, journals, or whistleblowers to avoid the career damage of a retraction. This can further complicate and frustrate scientific self-correction.

Should internal investigation really be necessary?

From a researcher's perspective, it's a shame that retraction seems to require a misconduct investigation. Such investigations are time-consuming. It is also difficult to prove intent absent some confession -- this may be why Diederik Stapel has 58 retractions, but only three of eight suspicious Jens Forster papers have been retracted.

Additionally, I'm not sure that a misconduct investigation is strictly necessary to find a paper worthy of retraction. When a paper's conclusions do not follow from the data, or the data are clearly mistaken, a speedy retraction would be nice.

Sometimes we are fortunate enough to see papers voluntarily withdrawn without a full-fledged investigation. Often this is possible only when there is some escape valve for blame: There is some honest mistake that can be offered up, or some collaborator can be offered as blameworthy. For example, this retraction could be lodged quickly because the data manipulation was performed by an unnamed graduate student. Imagine a different case where the PI was at fault -- it would have required years of investigation.


Whistleblowers are often upset that clearly suspicious papers are sometimes labeled only with an expression of concern. These frustrations are exacerbated by the opacity of investigations, in that it is often unclear whether there is an investigation at all, much less what progress has been made in the investigation.

Personally, I hope that journals will make effective use of expressions of concern as appropriate. I also appreciate the efforts of honest authors to voluntarily withdraw papers, as this allows for much
faster self-correction than would be possible if university investigation were necessary.

Unfortunately, detection of malfeasance will remain time-consuming and imperfect. Retraction is quick only when authors are either (1) honest and cooperative, issuing a voluntary withdrawal or (2) dishonest but with a guilty conscience, confessing quickly under scrutiny. However, science still has few tools against sophisticated and tenacious frauds with hefty legal war chests.