Rebuilding After the Replication Crisis

Stuart Ritchie

Over a decade has passed since scientists realized many of their studies were failing to replicate. How well have their attempts to fix the problem actually worked?

An empty room with a large cardboard box in the center. A group of 102 undergrad students. They’re split into three groups, and asked to sit either in the box, beside the box or in the room with the box removed. They complete a task that’s supposed to measure creativity — coming up with words that link together three seemingly unrelated terms.

The results of this experiment? The students who sat beside the box had higher scores on the test than the ones in the box or those with no box present. That’s because — according to the researchers — sitting next to the box activated in the students’ minds the metaphor “thinking outside the box.” And this, through some unknown psychological mechanism, boosted their creativity.

You might be laughing at this absurd-sounding experiment. You might even think I just made it up. But I didn’t: It was published as part of a real study — one that the editors and reviewers at one of the top psychology journals, Psychological Science, deemed excellent enough to publish back in 2012.

To my knowledge, nobody has ever attempted to replicate this study — to repeat the same result in their own lab, with their own cardboard box. That’s perhaps no surprise: After all, psychology research is infamous for having undergone a “replication crisis.” That was the name that came to describe the realization — around the same time that the cardboard box study was published — that hardly any psychologists were bothering to do those all-important replication studies. Why check the validity of one another’s findings when, instead, we could be pushing on to make new and exciting discoveries?

Developments in the years 2011 and 2012 made this issue hard to ignore. A Dutch psychology professor, Diederik Stapel, was found to have faked dozens of studies across many years, and nobody had noticed, in part because barely anyone had tried to replicate his work (and in part because it’s really awkward to ask your boss if he’s made up all his data). Psychologists published a provocative paper that showed that they could find essentially any result they wished by using statistics in biased ways — ways that were almost certainly routinely used in the field. And one of those hen’s-teeth replication attempts found that a famous study from “social priming,” the same social psychology genre as the cardboard box study — in which merely seeing words relating to old people made participants walk more slowly out of the lab — might have been an illusion.

Similar stories followed. As psychologists got their act together and tried replicating one another’s work, sometimes in large collaborations where they chose many studies from prominent journals to try to repeat, they found that approximately half the time, the older study wouldn’t replicate (and even when it did, the effects were often a lot smaller than in the original claim). Confidence in the psychological literature started to waver. Many of those “exciting discoveries” psychologists thought they’d made were potentially just statistical flukes — products of digging through statistical noise and seeing illusory patterns, like the human face people claimed to see on the surface of Mars. Worse, some of the studies might even have been entirely made up.

The replication crisis, alas, applies to a lot more of science than just silly social psychology research. Research in all fields was affected by fraud, bias, negligence and hype, as I put it in the subtitle of my book Science Fictions. In that book, I argued that perverse incentives were the ultimate reason for all the bad science: Scientists are motivated by flashy new discoveries rather than “boring” replication studies — even though those replications might produce more solid knowledge. That’s because for scientists, so much hinges on getting their papers published — particularly getting published in prestigious journals, which are on the lookout for groundbreaking, boundary-pushing results. Unfortunately, standards are so low that many of the novel results in those papers are based on flimsy studies, poor statistics, sloppy mistakes or outright fraud.

I think it’s fair to predict with confidence that, were the cardboard box study to be repeated, the results would be different. It’s the kind of study—based on tenuous reasoning about how language affects thought, with statistical tests that, when looked at in detail, are right on the very edge of being considered “statistically significant”—that would be a prime candidate for a failed replication, should anyone ever try. It’s the kind of research that psychologists now look back on with embarrassment. Of course, a decade later we’ve learned our lesson, and definitely don’t do unreplicable studies like that any more.

Right?

Adrian Forrow

The problems of fraud, bias, negligence and hype in science aren’t going away anytime soon. But we can still ask to what extent things have gotten better. Are researchers doing better studies — by any measure — than they were in 2012? Has anything about the perverse publishing dynamics changed? Have all the debates (what actually counts as a replication?), criticisms (are common statistical practices actually ruining science?), and reforms (should we change the way we publish research?) that have swirled around the idea of the replication crisis made science — in psychology, or indeed in any field — more reliable? Fundamentally, how much more can we trust a study published in 2022 compared to one from 2012?

If you jumped ten years forward in time from 2012, what would you notice that’s different about the way science is published? Certainly you’d see a lot of unfamiliar terms. For instance, unless you were a clinical trialist, you likely wouldn’t recognize the term “preregistration.” This involves scientists planning out their study in detail before they collect the data, and posting the plan online for everyone to see (the idea is that this stops them “mucking about” with the data and finding spurious results). And unless you were a physicist or an economist, you might be surprised by the rise of “preprints” — working papers shared with the community for comment, discussion and even citation before formal publication. These ideas come under the rubric of “open science,” a term that in 2012 you might have heard of (it’s been around since the 1980s), but that in 2022 is discussed almost everywhere.

You’d also notice a big rise in scientific papers discussing a “crisis,” as well as all sorts of special issues and debate pieces dedicated to the idea of replicability. Like never before, many scientists are looking inward and questioning the reliability of their work. There are also telling patterns in the tools they’re using. The Open Science Framework, a website where scientists can post their plans, share their data and generally make their whole research process more transparent, had somewhere near zero users in 2012, but by the end of 2021 had hit 400,000. The number of new files posted by those users, and the number of preregistrations, have also risen exponentially. In the past, a major barrier to being open and transparent with research was that it was really difficult to do so (how would you share your data, pre-internet?). It’s still far from super easy, but the technology has substantially improved, and a great many scientists are signing up to use it.

You’d also notice that scientific publishers are changing. One of my formative experiences as a PhD student, in 2011, was submitting a replication study to the Journal of Personality and Social Psychology, only to be told that the journal did not publish replications under any circumstances (you might be thinking, “WTF?” — and we were too). Now at that very same journal and a host of others, replications are encouraged, as is a set of other “open” practices — sharing data, code and materials, and pre-registering hypotheses and analyses before a study is carried out. Some journals now publish an article’s peer reviews alongside its online version, so the whole process is on view — hopefully encouraging reviewers to put in more effort, and allowing us to see where things went wrong in cases where reviewers missed important flaws.

Over 300 journals across a variety of fields now offer the ultimate form of preregistered research, the “Registered Report,” where it’s not just that a plan is posted and then the study goes ahead, it’s that peer reviewers review a study plan before the study happens. If the plan passes this quality control — and the reviewers might suggest all sorts of changes before they agree that it’s a good study design — the journal commits to publish it, regardless of whether the results are positive or negative. This is a brilliant way of making sure that decisions about publication are made on the basis of how solid the design of a study is — not on the perceived excitement levels of its results.

These are all encouraging developments, and represent impressive progress in and of themselves. A scientist from 2012 would find a lot to be optimistic about in 2022 — at least on the surface. But the number of people talking about the crisis, debating open science or signing up to a website isn’t what we really want to know. And journals offering ways to make science better isn’t much use if nobody takes them up. Have these changes actually made the science better?

***

Given its life-or-death importance, it’s no surprise that medicine has seen more intensive self-study, more meta-research, than any other field. Researchers at organizations like the Cochrane Collaboration have been beavering away, rating medical trials for their quality and how much they risk bias in their findings. To take just two examples, they check whether the participants in a trial could’ve found out if they were getting the real treatment or the placebo control — in which case the “blinding” of the study would’ve failed, expectation might play a role, and the results wouldn’t be reliable. They also check whether the randomization of the study worked properly: If people are randomly assigned to groups, then there shouldn’t be any big differences in health status, or background, between them before the study starts. If the randomization goes wrong, any results you find might be related to preexisting differences rather than to the treatment you’re testing, and you’ll draw the wrong conclusion.

If studies take extra care, they can reduce problems with blinding, randomization and a bunch of other bias-related problems that occur in medical trials, 1 and reduce the likelihood that they get spurious results. And according to a large-scale analysis of the overall trend in the quality of randomized medical trials from 1966 to 2018, the research has gotten better on average. Failures of randomization are rarer now; fewer studies have problems with blinding. Indeed, every metric they looked at has improved over the years, though there’s still a long way to go — for example, 52% of trials in the period 2010–2018 still had problems with blinding. There also wasn’t evidence of any kind of acceleration in quality over the past decade in particular.

So if you’re reading a medical trial published recently — and many of us did this a lot during the pandemic — it is more likely to be better than one published in previous decades (though only a little better since 2012). A lot of that probably has to do with regulations on the way trials are planned and reported: Researchers in medical trials are forced to be transparent in a way that would be unrecognizable to scientists in other fields, whose research can effectively be entirely done in secret.

But nothing like this analysis has been done for any other field. Instead, we have to look at some proxies for quality. In psychology, one such proxy might be “adherence to open research”: How much of the new replication-crisis-inspired reforms do they follow? Sadly, for this, all we have so far is a starting point: Only 2% of psychology studies from 2014–2017 shared their data online, and just 3% had a preregistration. These numbers will undoubtedly rise in future surveys — but we don’t have those surveys yet, so we don’t know how much. As for Registered Reports, uptake by scientists has been slow, regardless of how many journals offer it as an option. Changing a whole culture — that, like any culture, has built up a great deal of inertia and skepticism about change — is hard — even if you have very good reason to do so.

Using adherence to open research as a proxy for research quality is complicated by the fact that it’s possible to post a preregistration and then simply not follow it, or write it so vaguely that it doesn’t constrain your “mucking about” in the intended way. Medical researchers might nod wearily here — it’s been clear to them for years that scientists often dishonestly “switch” the outcome of their experiment, which they’d written down in their registration, to something else if their main outcome — pain levels, blood pressure measurements, depression ratings — didn’t show the effect they wanted. It’s also possible to post your data set online and have it be poorly annotated, or at worst completely incomprehensible. That’s if the data is even present: A study from earlier this year found that, in studies where the authors wrote that they’d be happy to share their data on request, only 6.8% actually did so when emailed. In other words, it’s possible to go through the motions of “open science” without it really affecting your research or the way you behave — a problem that’s increasingly been spotted as more researchers sign up to these “open science” techniques. If you want to really make your research open, you have to actually mean it.

Another proxy for research quality is sample size. All else equal, bigger studies are usually better — so, are studies bigger nowadays? This is another way of asking about statistical power, the ability to detect effects if they’re really there in your data. Studies with low-powered analyses — and usually this means studies that are too small — risk missing true effects and picking up on false ones.

It’s possible to go through the motions of “open science” without it really affecting your research or the way you behave—a problem that’s increasingly been spotted as more researchers sign up to “open science” techniques.

In some fields, some types of analysis have undoubtedly become more powerful. Genetics is among the most obvious: After many years of failed “candidate gene” research, where small-sample research led the field badly astray, genetic studies now regularly reach sample sizes in the millions, and produce results that are replicable (even if their precise implications are still hotly debated). In brain-imaging research, too, there’s an increasing awareness that to say anything sensible about how the brain relates to various behaviors, traits or disorders, we usually need sample sizes in the thousands. Happily, we now have what we need: Studies published in recent years have used resources like the UK Biobank or the ENIGMA Consortium, both with tens of thousands of brain scans, to come to more reliable conclusions. Alas, that has reemphasized that much of what was done in the past, in small-sample neuroimaging studies, was next to worthless.

Meta-research does show increasing sample sizes over time in neuroimaging as a whole; I’m certain that such a study would find the same in genetics. In other fields, it’s less clear: There’s some evidence, for instance, of a modest increase in sample size in personality psychology over time, and a recent preprint “cautiously” suggested that studies in political science have gotten bigger in recent years.

In other fields, though, all we have are starting points but no data on long-term trends. Almost uniformly, the starting point is one of very low power. That’s true for psychology in general, clinical and sports and exercise psychology in particular, ecology, global change biology (the field that studies how ecosystems are impacted by climate change), economics, and political science. Other areas like geography have seen glimmers of a replication crisis but haven’t yet collected the relevant meta-scientific data on factors like statistical power to assess how bad things are. We’ll need a lot more meta-research in the future if we want to know whether things are getting better (or, whisper it, worse).

Even then, the mere knowledge that studies are, say, getting bigger shouldn’t reassure us unless those studies are also becoming more replicable — that is to say, a closer approximation to reality. And although areas like psychology and economics have attempted to replicate dozens of experiments, there hasn’t been time to make the same attempts to replicate newer studies or compare the replication rates over time. We likely won’t see meta-research like this for a long time — and for some fields, a very long time. Witness how long it took the Reproducibility Project: Cancer Biology, a heroic attempt to replicate a selection of findings in preclinical cancer research, to finish its research: It began in 2013, but only just reported its final mixed bag of results in December 2021.

What about those papers that aren’t just low quality, but are actively fraudulent or otherwise brought about by nefarious activities like plagiarism? A change from 2012 is that more papers are being retracted — removed from the scientific literature due to some major deficiency or error (it’s a higher proportion of papers overall too). Not all retractions are due to deliberate rule breaking, of course — some are due to the discovery of honest mistakes, among other reasons (the Retraction Watch website covers each new retraction as it arises, and tries to ferret out the story behind it). But we can see it as a good thing that mistakes and falsehoods are being actively dealt with more often than they were even a decade ago — even if, going by the number of papers that are flagged on error-checking websites like PubPeer, the frequency of retraction should be a lot higher.

***

The economist Michael Clemens famously described the potential economic benefits of changes to immigration policy — removing restrictions and barriers to the movement of people across borders — as “trillion-dollar bills on the sidewalk.” The benefits are just lying there, ready to be grabbed if politicians so choose. I think something similar is the case with science: Changes that would make dramatic improvements to the quality of research are right there — but, although they’re often available, most scientists haven’t even begun to pick them up.

And that’s what’s really different between now and a decade ago: We know a lot more about where science goes wrong, and we have a much longer list of potential tools to fix it. We’ve tried various reforms in several fields, producing useful lessons for other disciplines. We’ve developed technologies to improve the transparency of our research. And, in our more open, self-critical discussions (not to mention formal meta-research) about how science works, we’ve become much more aware of the hurdles — the incentives, the vested interests, the inertia and sometimes the sheer social awkwardness — that slow down the process of improving science.

But as you can see from my sorting through the scraps of evidence above, we have nowhere near the data we’d need to confidently argue that science is better now than a decade ago. Definitively answering this question will require substantially more meta-research across disciplines — and will likely require more reforms. The burst of meta-science that we have seen since the replication crisis mustn’t be squandered: Pushing for the funding of much more such research should be a major priority for anyone who wants to improve science, and wants to do so using hard evidence.

Perhaps that evidence will tell us that our incremental fixes are ticking along nicely, steadily improving the quality of science as more and more researchers take up preregistration and open science and the rest. But equally, they might tell us that something more basic has to change. They might tell us that only a hard core of interested scientists are truly invested in “open science,” and that the rest of the community needs to be incentivized into improving their own work. Perhaps — similar to the regulations for medical trials — we need simply to require that they follow a set of minimal standards before they receive funding. And maybe we need to fundamentally change how we approach science: a radical rethinking of the peer-review system, for example. Some have even argued that scientists’ obsession with publication at all costs will only end if we get rid of scientific journals — or scientific papers themselves.

We shouldn’t be afraid to trial and test new and creative ideas, even if they might make science look very different from the status quo a decade ago, or even today. That is, for science to become as trustworthy as we need it to be, it might — like those creative students back in 2012 — need to escape the cardboard box entirely.

  1. Aside from problems with blinding and randomization, these include factors like the quality of the measurement of the outcome (does the trial use good-quality instruments or well-validated questionnaires, or are they likely to produce noisy, hard-to-interpret data?) and the patterns of which data is missing (did people drop out of your trial in a “nonrandom” way — that is, did sicker people tend to quit the study faster? If so, your final results might be skewed). A fully detailed description of the huge number of ways that trials can be biased can be found on the website riskofbias.info.

Published November 2022

Stuart Ritchie is a senior lecturer in psychiatry at King’s College London. He is the author of a book, Science Fictions: Exposing Fraud, Bias, Negligence and Hype in Science, and blogs about metascience at stuartritchie.substack.com.

Further Reading