The Replication Crisis: What Failed to Replicate and Why

nonacademicresearch.org Editorial

Submitted
May 9, 2026
Version
v1
License
CC-BY-4.0
Views
0
Identifier
nar:qcetlew5tjfj0fyhgl

Abstract

Large-scale replication projects in psychology found that approximately 36–62% of published findings failed to replicate at conventional significance levels, with average effect sizes roughly half those originally reported. This is not a crisis unique to psychology — similar patterns have been documented across medicine, economics, and cancer biology. The causes are structural: publication bias, small samples, undisclosed analytical flexibility, and the statistical consequences of testing many hypotheses.

Manuscript


title: "The Replication Crisis in Psychology: What Failed and What Held" abstract: "In 2015, the Open Science Collaboration published the results of 100 replication attempts of published psychology experiments. Only 39% replicated by conventional standards — a finding that shook the discipline. The pattern of failures reveals not randomness but a structural bias: small samples, surprising findings, and p-values just below 0.05 are the strongest predictors of failure to replicate. A decade later, the field has changed in important ways, though unevenly." topic: science author: nonacademicresearch.org Editorial date: 2026-05-09

The Replication Crisis in Psychology: What Failed and What Held

Abstract

In 2015, the Open Science Collaboration published the results of 100 replication attempts of published psychology experiments. Only 39% replicated by conventional standards — a finding that shook the discipline and triggered a broader reckoning across empirical social science. The pattern of failures is not random: small samples, surprising findings, and p-values just below 0.05 are the strongest predictors of failure to replicate. A decade later, reform efforts have produced genuine improvements, but the crisis has not been resolved.

Background

For most of its history as an academic discipline, psychology operated under an assumption shared by nearly all of empirical science: that published findings, having cleared peer review and statistical significance thresholds, represent reliable knowledge. The standard significance threshold — p < 0.05, meaning a result that would occur by chance fewer than 1 in 20 times if the null hypothesis were true — was understood as a filter separating signal from noise.

By the mid-2000s, a small number of methodologists had begun identifying warning signs. Studies tended to have very small samples (20–40 participants per condition was standard). Effect sizes were rarely estimated in advance. Researchers had considerable flexibility in how they analyzed and reported their data. Publication bias — the tendency for journals to publish positive results and reject null results — meant the literature accumulated an unknown but substantial proportion of false positives.

The problem crystallized in 2011 when Simmons, Nelson, and Simonsohn demonstrated in Psychological Science that standard "flexible" research practices (what they called "researcher degrees of freedom") could inflate false positive rates from 5% to 60% without any intentional misconduct. The same year, Daryl Bem published a paper in the flagship Journal of Personality and Social Psychology claiming to find evidence for precognition in humans — a result that, because it followed all standard methodological norms, exposed how far those norms had drifted from what was needed to distinguish real effects from chance.

The Evidence

The Open Science Collaboration (2015)

The landmark test came from the Open Science Collaboration, a consortium of 270 researchers who replicated 100 published psychology experiments from three leading journals. Their results, published in Science, were unambiguous:

  • 39% of replications produced statistically significant results using the original study's threshold
  • 47% produced effect sizes in the original direction (a more charitable measure)
  • Mean effect size in replications was approximately half that of originals
  • Original p-values, effect sizes, and sample sizes were among the strongest predictors of replication success

The studies that failed to replicate were not uniformly obscure or poorly regarded. They included studies that had been cited hundreds or thousands of times and had influenced textbooks, public policy, and popular science writing.

The Pattern of Failure

Subsequent analyses revealed that failures were not random but structured. Camerer et al. (2018), replicating 21 social science experiments published in Nature and Science, found that the strongest predictors of replication failure were:

  1. Effect size in the original. Large, surprising effect sizes replicated at lower rates. Smaller, more theoretically grounded effects replicated at higher rates.
  2. Sample size. Studies with fewer than 50 participants were dramatically more likely to fail.
  3. Prediction market judgment. When researchers were asked to bet on which studies would replicate, their collective prediction was highly accurate — suggesting expert intuition about which findings are "too good to be true" is often correct.

Gelman and Loken (2014) provided a statistical framework explaining the pattern. They described the "garden of forking paths" — the many undisclosed analytical decisions researchers make during data collection and analysis that, even without any deliberate misconduct, effectively allow multiple testing and inflate false positive rates. A study with 20 researchers would not get the same result even if conducted identically, because each would make slightly different analytical choices.

What Did Replicate?

The crisis has sometimes been misread as evidence that psychology produces no reliable knowledge. This is not accurate. Many findings replicated well:

  • Basic cognitive effects (priming of perception, anchoring effects in explicit contexts, stroop interference) showed high replication rates
  • Studies with large samples, pre-registered hypotheses, and theoretically motivated designs replicated at substantially higher rates
  • Cross-national replications (Klein et al., 2018, Advances in Methods and Practices in Psychological Science) found many basic findings to be robust across cultures

The failures clustered in "social priming" (claims that subtle environmental cues change behavior dramatically), ego depletion (the claim that willpower is a depletable resource), power poses (Amy Cuddy's claim that expansive postures change testosterone and cortisol), and studies with implausibly large effect sizes.

Reforms and Their Effects

The replication crisis triggered substantial methodological reform. Pre-registration — committing to hypotheses, methods, and analysis plans before data collection — became common in leading journals. The Center for Open Science developed a transparency certification system. Several journals adopted Registered Reports, in which peer review occurs before data collection, removing publication bias.

These reforms appear to be working. A 2020 analysis by Scheel et al. in Advances in Methods and Practices in Psychological Science found that pre-registered studies in psychology journals had a dramatically lower rate of "positive" results (44%) compared to non-registered studies (96%), suggesting the latter were substantially inflated by selective reporting.

Counterarguments

Some psychologists argue that replication as operationalized by the OSC was too rigid — that psychological phenomena are genuinely context-dependent, and that "failed replication" sometimes reflects a real moderating variable rather than a false positive. This is a legitimate point for specific cases, but it does not explain the systematic directional bias in the data (original effect sizes are almost always larger than replications, not sometimes larger and sometimes smaller).

Others argue that physics and chemistry also have replication problems, making psychology's difficulties less exceptional than they appear. This is true; the crisis has subsequently been identified in medicine, economics, and neuroscience. But psychology's problems appear more severe in some domains, partly due to historically smaller sample sizes and greater researcher flexibility.

What We Can Conclude

The replication crisis in psychology is real, substantial, and structured. Approximately half of the canonical findings from social psychology circa 2000–2015 do not replicate at their original effect sizes, with the most surprising and most cited findings showing the highest failure rates. The cause is not primarily fraud but a set of methodological practices — small samples, flexible analysis, publication bias — that allowed the literature to accumulate false positives at scale.

The good news is that the crisis also produced a clear diagnosis and substantial reform. Pre-registration, open data, and larger samples are improving the reliability of new research. The uncomfortable reality is that a substantial portion of the foundational literature underlying applied psychology — from clinical interventions to workplace training to public policy — rests on findings that have not survived rigorous replication.

References

  • Camerer, C.F., et al. (2018). Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015. Nature Human Behaviour, 2, 637–644. https://doi.org/10.1038/s41562-018-0399-z
  • Gelman, A., & Loken, E. (2014). The statistical crisis in science. American Scientist, 102(6), 460. https://doi.org/10.1511/2014.111.460
  • Klein, R.A., et al. (2018). Many Labs 2: Investigating variation in replicability across samples and settings. Advances in Methods and Practices in Psychological Science, 1(4), 443–490. https://doi.org/10.1177/2515245918810225
  • Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716. https://doi.org/10.1126/science.aac4716
  • Scheel, A.M., et al. (2021). An excess of positive results: Comparing the standard Psychology literature with Registered Reports. Advances in Methods and Practices in Psychological Science, 4(2). https://doi.org/10.1177/25152459211007467
  • Simmons, J.P., Nelson, L.D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366. https://doi.org/10.1177/0956797611417632

Versions (1)

  • v1May 9, 2026initial publicationmd

Cite this paper

APA

nonacademicresearch.org Editorial (2026). The Replication Crisis: What Failed to Replicate and Why. nonacademicresearch.org. nar:qcetlew5tjfj0fyhgl

BibTeX
@misc{a8vrrxy3,
  title = {The Replication Crisis: What Failed to Replicate and Why},
  author = {nonacademicresearch.org Editorial},
  year = {2026},
  howpublished = {nonacademicresearch.org},
  note = {nar:qcetlew5tjfj0fyhgl},
}

Temporary identifier. This paper carries a temporary nar:* identifier valid for citation within the independent research community. A permanent DOI will be minted via DataCite once the platform completes nonprofit registration.

Discussion (0)

Log in to join the discussion.

Loading…