Comments on Crystal Prison Zone: Curiously Strong effects

An interesting problem, but one that could be elim...

2017-05-20T08:12:39.762-07:00

An interesting problem, but one that could be eliminated by tightening standards for publication. Sample size of submitted manuscripts should be large enough to support a split-half test of the hypothesis. If the finding of the first half is replicated by the second, then the manuscript, if otherwise acceptable, can be published. This will increase the cost of conducting research and likely delay the publication of findings. But what do we really have to lose? Replicability and replication is at the heart of empirical science. We face a replication crisis that is causing erosion of public trust in the scientific enterprise. See: A manifesto for reproducible science
Marcus R. Munafò, Brian A. Nosek, Dorothy V. M. Bishop, Katherine S. Button, Christopher D. Chambers, Nathalie Percie du Sert, Uri Simonsohn, Eric-Jan Wagenmakers, Jennifer J. Ware & John P. A. Ioannidis
Nature Human Behaviour 1, Article number: 0021 (2017)
doi:10.1038/s41562-016-0021

Interesting post. I agree with those comments poin...

2017-05-16T14:57:59.501-07:00

Interesting post. I agree with those comments pointing the finger at study design, only to add that failing to take into account the underlying structure of the data can generate unbelievably-high effect sizes. Yule pointed this out almost a hundred years ago with regards to nonsense correlations in time-series (https://www.jstor.org/stable/2341482?seq=1#page_scan_tab_contents). Dan Hruschka and I have a commentary coming out soon on a study that looked at the effect of the ebola outbreak on voting intentions using time-series data, without accounting for autocorrelation (i.e non-independence) of data points. This, combined with the fact that the original paper used smoothed data instead of raw data, greatly inflated the estimates of effect-size.
(http://leotiokhin.com/assets/uploads/Tiokhin_Hruschka_Ebola_Elections_2016_Merged.pdf)

This is a great situation where Simonsohn's &q...

2017-05-16T11:17:52.081-07:00

This is a great situation where Simonsohn's "small telescopes" approach (http://datacolada.org/wp-content/uploads/2015/05/Small-Telescopes-Published.pdf) is useful - you can adjust your sample size so that you can draw conclusions about whether the original study was adequately powered to detect any effect that could be there.
-Simine

Surely, replication is the only way for science to...

2017-05-15T18:41:22.127-07:00

Surely, replication is the only way for science to proceed, irrespective of the size of the effect. There are three things we (should) want to know: (1) is the finding reliable? (i.e., can it be replicated in an independent sample of subjects); (2) how big is it?; and (3) how general is it? P values don't address these questions; only doing the hard work does ...

It might be constructive to transform the effect s...

2017-05-15T15:12:59.118-07:00

It might be constructive to transform the effect sizes into the r-metric, where we have more benchmarks. A d of 2.5 is going to be equivalent to a correlation .78. That's a correlation that exceeds the average reliability of most measures used in the social sciences (.75) and would be correlation used to make the argument that the IV and DV are the same thing. That might make a more compelling argument.

To address “maybe the effect is overestimated beca...

2017-05-15T14:57:29.234-07:00

To address “maybe the effect is overestimated because of demand effects,” you could see if the curiously strong effect correlates with the Perceived Awareness of the Research Hypothesis scale: https://osf.io/preprints/psyarxiv/m2jgn/

This is a really important point to keep emphasisi...

2017-05-15T14:32:46.796-07:00

This is a really important point to keep emphasising the research community.

There are some things worth checking if you see large d or r values. My initial thought on seeing such large standardised effects (usually a point estimate) is that the sample size is small and the effect size is inevitable. My second thought is to wonder how they have computed d or r - it is quite easy to use an incorrect conversion formula or some other approach that inflates d (e.g., computing d from t with a within-subject design). Third, there are artefacts of the design that distort d or r and combined with choice of computational approach produce large d or r values. Ceiling and floor effects are an example as they can dramatically shrink the sample variance as can ecological correlations. Finally, there are study designs that produce large effects by increasing the strength of a manipulation such as extreme group designs.

In most cases it really helps to get a measure of effects size that is unstandardized to give more context (or equivalently a raw data plot).

Fair point!

2017-05-15T13:29:34.762-07:00

Fair point!

Then it's no longer a 'curiously large eff...

2017-05-15T12:36:55.466-07:00

Then it's no longer a 'curiously large effect' to worry about

It sounds like you and Dr. LeBel both see value in...

2017-05-15T12:25:25.028-07:00

It sounds like you and Dr. LeBel both see value in a replication attempt. The thought had occurred to me, but it seems a shame to do all the set-up only to collect N = 10. I'm also concerned about the possibility of a fighting retreat: "Well, no, so it's not d = 2.5, but maybe it's d = 0.5, which your replication wouldn't detect." Perhaps I'm overthinking it and a little replication would go a long way.

I guess you could calculate the sample size needed...

2017-05-15T12:20:19.345-07:00

I guess you could calculate the sample size needed for a 'curiously large effect. it would be wonderfully small - if d=2.5, then you would probably need only 5 per group (to get 95% power). Why not then get multiple labs to replicate using variations around these sample sizes and 'bobs your uncle' - see how many replicate... then pool them all together in a meta-analysis ...add a few moderator variables ...ready for publication

One might at least expect authors to point out the...

2017-05-15T10:48:05.385-07:00

One might at least expect authors to point out their curiously strong effects and offer their thoughts on the matter.

Good post. BTW, a curiously strong effect (spotted...

2017-05-15T10:33:54.271-07:00

Good post. BTW, a curiously strong effect (spotted by Richard Morey) played a role in a recent Psych Science retraction. Also, as others have noted, when power is low an effect will be significant only if it exaggerates true effect size. Be deeply skeptical of low-powered, single-experiment studies with surprising results.
Steve

Interesting and valuable blog post! A few quick co...

2017-05-15T10:05:44.839-07:00

Interesting and valuable blog post! A few quick comments on specific points:

>>>The only solution I can see, barring some corroborating evidence that leads to retraction, is to try to replicate the curiously strong effect. Unfortunately, that takes time and expense, especially considering how replications are often expected to collect substantially more data than original studies.

Yes! Independent corroboration via strong falsification attempts via replicability tests **IS** the only fail-proof way to increase one's belief confidence in a curiously strong, or any, published effect. Falsifiability is simply **NOT** optional for scientific progress to be possible: https://osf.io/preprints/psyarxiv/dv94b/

>>>>Even after the failure to replicate, one has to spend another 3 or 5 years arguing about why the effect was found in the original study but not in the replication.

No, this is not necessary. Our job isn't to figure out **why** certain published findings are false (there could be a thousand different reasons why original researchers got it wrong). Our job is to better understand reality by building upon each others' findings, which are assumed to be (in principle) replicable according to the specified conditions. If independent labs, in good faith, cannot demonstrate to themselves that a published effect is replicable, then they must simply move onto investigating other effects that are indeed replicable in the hope of increasing our understanding of the world.

Sadly, I suspect that fabrication will very often ...

2017-05-15T09:52:42.642-07:00

Sadly, I suspect that fabrication will very often turn out to be the most parsimonious explanation. Fraudulent researchers tend to be rather incompetent, including in their understanding of what a reasonable effect size might be.

Great post. This is one advantage that I see of re...

2017-05-15T07:29:08.295-07:00

Great post. This is one advantage that I see of results-blind reviewing. In my experience, many of these curiously strong effects happen with very small samples. If we were to evaluate the study without knowing the results, we would likely say it had too little power/precision to produce meaningful results. Focusing on the design takes the pressure off of having to address what produced those results.
Of course this only applies to cases where we'd be likely to think the design is weak.
-Simine