Home / General / Replication Crisis in Psychology: Part Two

Replication Crisis in Psychology: Part Two


See Part 1 here.

On the questions: How should you evaluate whether a replication is successful? And about how many replications should we expect be successful?

This one goes far into the weeds, and I’ll climb back out of them after, but I really wanted to understand and evaluate the Technical Comment’s claims.

Gilbert and colleagues go on to claim that because you expect some replications to fail both because of chance, and because of these infidelities, the rate of replication in the RPP doesn’t mean that the number of those findings that were real is nearly that low.  They try to come up with an estimate of how many replications we should expect of fail by chance.  

Here’s how: a previous project, Many Labs 1, chose 15 findings in psychology and replicated them many times across 36 labs (so about 36 times each).  Many Labs, like RPP, allowed deviations from the original study protocol.  If you take every possible pair of studies in Many Labs, you can ask, did one study replicate the other?  Although RPP used five different measures of replication, Gilbert and his coauthors focus on one: was the original study in the confidence interval of the replication?  They argue that since most of the findings in Many Labs (13/15) successfully replicated, the Many Labs studies represent a world in which most effects are true.  So picking single pairs of studies simulates what the RPP did, in that it took one study and replicated it once.  If they count how many replications fail across all the pairs of Many Labs studies, they argue, that’s a good estimate of how many replications should fail due to chance + differences between the original and replication study.  They try to use these data both to answer “what percent of replications should fail?” and to answer  “how many replications in Many Labs would have failed if they’d used RPP’s confidence interval method for assessing replication?”  

They found that when they constructed all the possible pairs of studies, 66% of the studies were inside the confidence interval of the other.  They argue that this is a good estimate of how many replications we should expect to be successful in an environment where we’re testing real effects.  They also find that only 34% of the Many Labs replications had confidence intervals that contained the original published results.  If ML had used RPP’s methods all those successful replications would have looked like they failed.  In their appendix they make the very strong claim that if the RPP “had analyzed only the high fidelity replications that were endorsed by the original authors, the percent of successful replications would have been statistically indistinguishable from 100%.”

This argument has some flaws.  Several are pointed out by the OSF authors’ “Response to the Comment”:

  • If you take all possible pairs of studies from Many Labs, roughly half of the “failures” are because the the effect of the “replication” is bigger than the original.  This would also lead to “failure” in the RPP, if the replication effect size was enough larger (and the replication sample was big enough that it had a narrow confidence interval).  But only 5% of the replications in the RPP have confidence intervals that fail to include the original effect because the replication effect was larger.  Or in other words, about 10% of the RPP replications were counted as failures to replicate for this reason, compared to about 50% in the reanalysis of Many Labs.  

Here are figures from the original RPP paper that show the distribution of p values on the left and effect sizes on the right, in the original and replication studies.  It is hard to conclude, looking at these figures, that this doesn’t represent systematic movement downward in the size of the effects, greater downward movement than in Many Labs, even though it too allowed infidelities.  



  • The Many Labs studies were chosen a different way: they were an “ad hoc sample of new and classic effects.”  The RPP studies attempted to be more representative.  It is probably a fair point to say that the RPP didn’t do that well enough, and particularly if you’re going to say that your paper is “estimating the reproducibility of psychological science” it would be a good idea to have a better definition of what psychological science is.  The RPP chose high-profile psychology journals, but it’s very possible that papers in higher profile journals are *less* replicable.  For instance, one paper found that in genetics studies, journals with higher impact factors (a measure of prestige) tended to overestimate effect sizes more.
  • The bigger effects tended to vary more between labs in Many Labs, so pointing out that pairs of studies had different effect sizes in a replication project where most effects were true and some were very large does not generalize to a replication project in which effects may have been small, or not really exist at all.  

And I will also add:

  • Gilbert et al. say that they just happened to choose the confidence interval metric (when there were others reported in the RPP), and any other metric gave roughly similar results.  But because I found their description of what they did very confusing, possibly just because of some idiosyncratic fogginess of my own, I wound up looking at their R code to be able to write the preceding paragraphs.  And because I was already looking at the R code, I wanted to know how many of the individual Many Labs studies had confidence intervals that did not include zero. This is very closely related to p < .05, which was another of the replication metrics in RPP.  I get 388/574 total, or 67%. This, of course, looks a lot like the number of replications if you use the method that Gilbert and colleagues did, picking pairs of studies and seeing whether one is outside the confidence interval of the other.  So in one sense, they’re right; this other method provides very similar results.  But, if you exclude the two original studies in Many Labs which didn’t wind up replicating even with a very large sample and extremely high power to detect an effect, in other words, the effects we now have no reason to think are real, you get 386/502, or 77%.  I’m not really trying to claim “that’s the number that the RPP should have found if all the effects were real”; there are so many interpretive problems — really you just don’t know. I just want to point out that if you’re using Many Labs as a model for what single sample replications of real effects look like, there are ways to get a ceiling higher than 66%.  

I think it’s unfair to characterize the RPP as containing statistical “errors” or having been “overturned” in any sense. Because you really don’t know how many replications to expect, the RPP is fundamentally a work of descriptive, not inferential, statistics. In other words, there are two kinds of numbers you could present in a psychology paper: one kind consists of variations on counting — how many replications “worked” by these various standards? — and otherwise describing the quantitative characteristics of your sample. Those are descriptive statistics.  Inferential statistics, try to characterize the degree of error in the sample characteristics and use that to make stronger inferences about the larger population.  

But we don’t have strong theory to characterize the error here. We know that publication bias is distorting the effect sizes, but we don’t know exactly how.  It’s possible that infidelities from original study protocols are causing replication failures, but we don’t know exactly how that’s working, either.  Maybe a large scale replication project like this, where each study gets replicated only once, by studies of similar size that are also insufficiently powerful to detect effects, was destined to be underinformative relative to the effort.  Uri Simonsohn argues that a replication only truly fails if the data argue in favor of a zero effect (“accepting the null” in the language of hypothesis testing).  There are several ways to try to make a case that the data favor a zero effect.  Simonsohn compares two.  One is to ask: was the original study big enough to have detected the effect found in a preregistered replication?  That’s Simonsohn’s own “Small Telescopes” method.  Another is to ask: is the size of the replication effect a lot smaller than we expect, given the size of the original effect?  Simonsohn compares them both, and finds that using either about 30% of the replication attempts were inconclusive — you don’t have information to either say an effect is zero, or that it’s different from zero.  He would have recommended a rewrite of the introduction the the RPP paper: “[…] the Open Science Collaboration observed that the original result was replicated in ~40 of 100 studies sampled, failed to replicate in ~30, and that the remaining ~30 replications were inconclusive.”

But that doesn’t mean it’s informative to make up an estimate of how many successful replications to expect using the Many Labs data, claiming it’s better than nothing, when really, maybe it isn’t.  

Whether you look over the numbers in the RPP and think they look good or bad depends a lot on both what your prior expectations were and what you want psychology to be (more on this last point in later posts).  Readers will recall that I saw them and didn’t think they looked that bad.  Many fewer than half of the effects flipped signs. The paper finds a correlation of .51 between the original effects and the replication effects — that’s an indication of a lot of noise, but also substantial signal.  The striations in the reliability of the effects accorded with my prior intuitions about which subfields of psychology would produce more reliable effects (cognitive studies had higher replication rates than social ones), but social psychology still looked to be producing real signal. If you are shocked at this point that psychology papers systematically overestimate effect sizes (or you’re not shocked but think that’s really bad), then the RPP should make you shake your darn head; it at the very least describes a big slump in effect sizes from the original to the replication (unless you want to attribute that all to infidelities).  But it also provides some evidence that psychology is making some progress.


  • Facebook
  • Twitter
  • Google+
  • Linkedin
  • Pinterest
It is main inner container footer text