I have tried to be, well, not completist, but quite thorough in my reading about the latest iterations of the replication controversy. It’s been fairly overwhelming, but I’ve certainly learned a lot. So the coming post has six parts that address several issues.
- What is a replication? (This turns out to be controversial.)
- How should you evaluate whether a replication is successful? and
- About how many replications should we expect be successful?
These questions are all raised by the back and forth between some of the authors of the Science paper: “Estimating the Reproducibility of Psychological Science,” which I blogged about back in August (a shorthand for this project is Reproducibility Project: Psychology, or RPP), and Daniel Gilbert and coauthors’ “Technical Comment” in response, which takes issue with a lot of the the original paper’s conclusions, and claims to find “statistical errors.”
Some broader questions are:
- Why do replications fail? How can scientists avoid misleading themselves?
- How should scientists respond to failed replications? What kinds of responses contribute to a progressive science?
- How should a lab regard its own “failures”?
- Is there a replication “crisis” in psychology? What do we want psychology (and other sciences) to be?
So, on the first question: What is a replication study?
If you’ll recall, the Open Science Collaboration (the OSC), chose 100 studies from three high-profiles journals to try to replicate, once each. For each study, they looked at five different measures of replication success. The most widely reported result was that 47% of original effects were in the 95% confidence interval of the replication study. (A 95% confidence interval means: using your same study procedure, if you repeat your study 100 times, 95% of the time your confidence interval should contain the true value.)
Gilbert and coauthors claim that OSC’s replication studies are insufficiently faithful to the originals:
For example, many of OSC’s replication studies drew their samples from different populations than the original studies did. An original study that measured American’s attitudes toward African Americans (3) was replicated with Italians, who do not share the same stereotypes; an original study that asked college students to imagine being called on by a professor (4) was replicated with participants who had never been to college; and an original study that asked students who commute to school to choose between apartments that were short and long drives from campus (5) was replicated with students who do not commute to school. What’s more, many of OSC’s replication studies used procedures that differed from the original study’s procedures in substantial ways: An original study that asked Israelis to imagine the consequences of military service (6) was replicated by asking Americans to imagine the consequences of a honeymoon; an original study that gave younger children the difficult task of locating targets on a large screen (7) was replicated by giving older children the easier task of locating targets on a small screen; an original study that showed how a change in the wording of a charitable appeal sent by mail to Koreans could boost response rates (8) was replicated by sending 771,408 e-mail messages to people all over the world (which produced a response rate of essentially zero in all conditions).
On one hand, it’s true that the main text of the report on the RPP never gave an indication of these infidelities. On the other, are they really so bad as all that? Instructively, Brian Nosek and Elizabeth Gilbert defend the study involving the honeymoon/military service: the study was actually about perpetrators’ and victims’ needs for reconciliation, and the vignette in the study asked you to imagine that you had been away from work, and that your colleague had taken credit for something the other person had done (and in another condition the study reverses the roles, and now you are the wrongdoer). “Military service” was changed to “honeymoon” just to make it something more Americans could relate to, since military service is not compulsory in the US. If I squint I could imagine that “military service” is necessary to feeling really wronged (I was serving my country and being so virtuous!). But in the manipulation check (that’s the term for establishing that you’ve induced some state) was successful; participants *did* feel wronged. Daniel Lakens links to Jesse Eisenberg offering some further explanations. I am not so completist as to want to go through dozens of studies methods establishing their fidelity myself, but if this is what the critics come up with, it seems like they were overreaching in their description of the infidelities.
So there’s fundamental disagreement: how different can a study be before it’s no longer a direct, but a “conceptual” replication. In the original RPP paper, the authors say “You can’t step in the same river twice” arguing that there is no such thing as exact replication. Andrew Gelman agrees. I find this somewhat unsatisfying, because surely there is some line past which it’s not quite a replication anymore. It would be worth trying to define that. At the same time, if psychology are going to try to make explanatory statements about human behavior, there has to be some level of robustness to slight changes in context (at the ad absurdum end of the spectrum, if effects were only ever reproducible in one lab then you’re not doing any kind of convincing science). Lisa Feldman Barrett argues that it is expected that effects in social psychology would be highly context dependent, and that explains failures to replicate. It’s questionable whether this is the best explanation for failures to replicate (more on this later), but even if it were, at some point, an effect is so specific, fragile, and isolated as to be boring. It would be worth trying to define that too. Nobody gives TED talks saying that “only when hypothetically contemplating someone taking credit for a colleague’s work when the colleague is away for military service do victims and perpetrators have different needs for reconciliation.” Maybe the slow progress of science requires looking for unexamined context differences that could explain why you get an effect or don’t, but before you have a good idea of what those moderators are, it doesn’t seem reasonable to ask for anyone else’s attention, or to claim, as some psychologists do in practice, that you’re offering data-backed insight what to expect from other people, or how to behave in the world.
There’s also disagreement about whether the infidelities were the cause of the observed replication rate. Gilbert and coauthors argue that because the replication rate was higher among replications that had endorsements by the original study authors, infidelities were the cause of the failure. But the OSC authors respond that there are other potential reasons for that correlation, for example, that authors with lower confidence in their original findings would be less likely to endorse the replication. The OSC collaborators are planning another study to formally test the hypothesis that infidelities to the original studies were the cause of replication failures. They are retesting the non-endorsed studies many times: both repeating the original replication studies and doing an improved replication vetted by peer review.