Home / General / Replication Crisis in Psychology: Part Three

Replication Crisis in Psychology: Part Three

On May 22, 2016

At 8:20 am

In General

1735 Views

See parts one and two.

On why replications fail (presuming that some of them fail not just because of variations in study protocols and sampling error):

I don’t know that at this point the answer to this question is controversial, exactly, but you do still see arguments that seem to have not fully absorbed them.

One I think is well-known to the general public at this point, and that is the problem of publication bias. Most of effects psychology studies are pretty small. In other words, our ability to successfully predict (or push around) how people behave is small compared to what we don’t know in advance, or what’s random. When people talk about effect size, they’re talking about a ratio of what we can explain or predict to what we can’t. The smaller the effect, the bigger the study you need to detect it, and most psychology studies probably aren’t big enough. If a study that’s too small to detect the real effect “succeeds” and achieves significance by the p < .05 standard, then by definition it’s overestimating how big the effect is. In a publishing scheme with a strong bias for publishing positive findings, when you consider the whole population of published papers, as the likelihood of individual studies finding an effect goes down (because the effect is small and the study is too small to detect it), the proportion of false positives in the literature goes up. Even though most people know this by now, it’s sometimes difficult (and I can testify to this from experience) to apply it to your own work. Your first efforts at experiments are so likely to be null results that it’s correspondingly tempting to see significant results as hard won and real. It’s very hard to correctly see yourself as not just one octopus who has correctly predicted the World Cup, but one among a population of octopi, who, if being randomly tested, will generate an improbable sequence that matches some arbitrary criterion some of the time.

A somewhat less widely appreciated point (though it’s perhaps well-known to readers of this blog) is that data analysis in many study designs is extremely flexible; the universe of possible analyses is very large. Sometimes you have more than one choice of outcome measure because you’ve collected a bunch. Or you could decide after the fact to look for subgroups in the data where you see the effect and subgroups where you don’t. You could decide while you’re analyzing that maybe there is a gender difference. But now you’ve effectively divided your sample in half, and now have two small samples where you might find something just by chance. A quickly-becoming-classic article on this phenomenon is called “False Positive Psychology,” and it includes an experiment that successfully shows that listening to children’s music makes you younger. Gelman and Loken have argued that the problem is even worse than this, because sometimes the analyses you choose are actually a result of the data you see. That means the whole universe of tests really ought to include everything you would have done if the data were different:

We are hardly the first to express concern over the use of p-values to justify scientific claims, or to point out that multiple comparisons invalidate p-values. Our contribution is simply to note that because the justification for p-values lies in what would have happened across multiple data sets, it is relevant to consider whether any choices in analysis and interpretation are data dependent and would have been different given other possible data. If so, even in settings where a single analysis has been carried out on the given data, the issue of multiple comparisons emerges because different choices about combining variables, inclusion and exclusion of cases, transformations of variables, tests for interactions in the absence of main effects, and many other steps in the analysis could well have occurred with different data. It’s also possible that different interpretations regarding confirmation of theories would have been invoked to explain different observed patterns of results.

On the one hand, I’m self-conscious even writing this, because part of me feels like I’m saying something obvious. On the other, there are many examples of veteran psychologists arguing that an effect has replicated, now with an interaction. “Interaction” is the statistical term for the general case of looking for subgroups in data. It refers to any way in which one variable alters the observed effect of another. “Moderator” is a term for a variable in an interaction. That link is to Gelman quoting John Bargh. Bargh’s highly cited study finding that college students walked slower if they’d recently seen words related to old age failed to replicate in one effort, and Bargh argued:

The original focus of replication failures of priming effects on behavior was Doyen et al.’s PLOS publication last January of their failure to replicate the elderly priming study of Bargh et al. 1996. The Science News article does not mention that there are already at least two successful replications of that particular study by other, independent labs, published in a mainstream social psychology journal. Here are links to these two replications. Both appeared in the Journal of Personality and Social Psychology, the top and most rigorously reviewed journal in the field. Both articles found the effect but with moderation by a second factor: Hull et al. 2002 showed the effect mainly for individuals high in self consciousness, and Cesario et al. 2006 showed the effect mainly for individuals who like (versus dislike) the elderly.

Of course, if you are collecting a lot of variables that could potentially be used as moderators, then you have a lot of opportunities to try different kinds of analyses. If one paper finds the effect only with one moderator, another finds the effect only with another, and the first found the effect with no moderators, then none of them truly replicates the others. (It’s not that they’re clear failures, either — they just don’t add a lot of confidence to the reality of the other effects.)

Maybe the past three years of discussion have altered Bargh’s thinking, and he wouldn’t make quite the same argument today. I think awareness of these issues in the field has really improved in the last few years of debate. But that may in part be not because the minds of very senior psychologists are changing, but rather because of cohort effects. Earlier-career psychologists, who haven’t spent a whole career under an older set of standards, and instead are coming of age during this discussion, are becoming a larger part of forming the new norms. Maybe we’re seeing something like a Kuhnian paradigm shift, but for practice, not theoretical content.