Paper Trail

The 865-Scientist Stress Test: Why Half of Social Science Fails to Replicate

April 20, 202619:08Paper Trail

This episode discusses the landmark SCORE study, which revealed that nearly half of social science findings fail to replicate and their reported impact is often significantly overstated. It explores why DARPA funded this extensive audit and clarifies the crucial distinction between reproducibility and replicability, helping listeners understand the challenges to scientific credibility and how research reliability is assessed.

Key Takeaways

Detailed Report

{

"key_takeaways": [

"A landmark study, detailed at https://cordis.europa.eu/article/id/464710-will-world-s-largest-probe-make-us-lose-trust-in-research-findings, reveals that nearly half of social science findings fail to replicate when tested by independent researchers.",

"The Systematizing Confidence in Open Research and Evidence (SCORE) project found that only 49.3% of social science papers successfully replicated, meaning new data confirmed the original phenomenon.",

"Even when studies did replicate, their practical impact often shrank dramatically, with a median reduction of over 50% in effect size and more than 80% in explained variance.",

"Mandatory open science practices, such as sharing data and analytical code, significantly boost research rigor, with exact reproducibility skyrocketing from 54% to 77% when authors were transparent.",

"Human experts proved far more effective than current AI algorithms at predicting which studies would fail to replicate, highlighting the enduring value of critical human judgment in assessing scientific credibility."

],

"detailed_report": "A monumental audit involving 865 scientists has revealed that nearly half of social science findings fail to replicate when re-examined by independent researchers. This extensive stress test, known as the Systematizing Confidence in Open Research and Evidence (SCORE) project, also found that even when findings do hold up, their actual impact often shrinks dramatically, sometimes by over 80 percent.\n\nThis sobering assessment forces a critical re-evaluation of how research is interpreted and applied, influencing everything from public policy decisions to daily news headlines.\n\n## The SCORE Project's Ambitious Goal\n\nThe SCORE project was an unprecedented undertaking, spearheaded by the Center for Open Science with significant funding from the U.S. Defense Advanced Research Agency (DARPA). While an unusual pairing, DARPA's investment stems from its heavy reliance on social and behavioral science insights for critical decision-making, including training protocols, intelligence analysis, and geopolitical forecasting. The agency's ultimate goal was to develop tools capable of assigning confidence scores to research results, proactively assessing reliability before findings are used to inform policy.\n\nTo achieve this, researchers extracted nearly 4,000 claims from papers published between 2009 and 2018, spanning 62 journals across 11 diverse disciplines. This comprehensive approach ensured a broad cross-section of the social science literature was put to the test.\n\n## Understanding Reproducibility vs. Replicability\n\nCrucial to understanding SCORE's findings is the distinction between two key terms:\n\n* Reproducibility refers to running the exact same analysis on the original data. It's like auditing an accountant's spreadsheet: if you follow the same steps with the same inputs, do you get the same reported result? This tests the transparency and accuracy of the math.\n* Replicability is a more demanding test, requiring the collection of brand new data to test the same underlying research question. This is akin to opening a second franchise of a successful business in a new location to see if the business model holds up with new customers. It verifies if the phenomenon actually holds true in the real world, outside the original study's specific context.\n\n## The Transparency Roadblock\n\nBefore even reaching replicability, the SCORE team encountered a significant systemic issue: a lack of transparency. Data was available for only 24 percent of a sample of 600 assessed papers. For three-quarters of published social science, the raw data needed to verify findings simply wasn't shared, making independent assessment functionally impossible.\n\nHowever, when data *was* available for 143 papers, only 54 percent could be precisely reproduced. A further 74 percent could be *approximately* reproduced, meaning results were within a close margin of the original.\n\n### The Transparency Dividend\n\nThe report highlighted a clear "transparency dividend": when original authors had proactively shared their data and analytical code, exact reproducibility skyrocketed to 77 percent, and approximate reproducibility jumped to 91 percent. This demonstrates that methodological sloppiness is often a consequence of closed systems. Fields like political science and economics, which have instituted strict data and code-sharing mandates, dramatically outperformed others in reproducibility, proving that transparency directly improves scientific quality.\n\n## The Replicability Stress Test Results\n\nThe most striking finding came from the replicability phase, where independent researchers gathered fresh data to test 164 previously published papers, encompassing 274 specific claims. The results were stark: only 49.3 percent of those 164 papers successfully replicated. If individual claims are considered, the success rate was 55.1 percent.\n\nThis means roughly half of what was published did not hold up when new data was collected. Breaking it down further:\n\n* About a third of new analyses yielded results very close to the original.\n* Approximately a quarter found no clear effects at all, with the phenomenon vanishing.\n* Around two percent of cases found results pointing in the exact *opposite* direction of the original claim.\n\nThis has profound implications for policymakers, journalists, and anyone making decisions based on social science research. A failure to replicate does not automatically imply fraud; human behavior is complex and context-dependent. However, the 50 percent failure rate reveals a systemic issue: many findings are treated as universal truths when they are often fragile, context-dependent observations.\n\n## The Shrinking Effects: A Devastating Truth\n\nEven for studies that *did* replicate, the SCORE project uncovered another unsettling truth: the magnitude of the effects often shrank dramatically. The median reduction in estimated effect size was more than 50 percent, and explained variance dropped by over 80 percent.\n\n* Effect size measures *how much* of a difference an intervention makes. A 50% drop means an intervention is half as powerful as initially thought.\n* Explained variance (R-squared) measures how much of the variation in an outcome can be attributed to the intervention. An 80% drop means the original "signal" was nearly drowned out by noise in the replication, making the finding far less useful for prediction or policy.\n\n### Why Effects Shrink: Publication Bias\n\nThis phenomenon is largely driven by publication bias and the "Winner's Curse." Academic journals heavily favor publishing novel, statistically significant, and large effects. This creates an incentive for researchers, even subconsciously, to engage in "p-hacking" or exploit "researcher degrees of freedom"—tweaking analyses to achieve the largest possible effect size for publication. When an independent team replicates the study without this incentive, the true, often much smaller, effect size reveals itself. This suggests that an initial finding should be viewed not as a definitive measurement, but as its *maximum possible upper bound*.\n\n## Predicting Credibility: Humans vs. Machines\n\nRecognizing that empirical replication is slow and expensive, DARPA also explored whether tools could predict credibility. They set up a forecasting tournament pitting human expert judgment against artificial intelligence and machine-learning algorithms.\n\n* Human forecasters performed remarkably well, achieving up to a 78 percent success rate in predicting replicability. This indicates that experienced researchers possess a finely tuned "BS detector," understanding experimental design nuances and spotting exaggerated claims.\n* In stark contrast, initial machine-learning and algorithmic methods largely failed to effectively predict replicability. While AI excels at finding patterns, assessing scientific credibility requires deep contextual reasoning and semantic comprehension that current AI models lack.\n\nThis suggests that for the complex, high-stakes world of scientific validity, algorithmic fact-checking is not yet ready to replace human scientific skepticism or deep peer review.\n\n## Key Takeaways for Society\n\nThe SCORE project offers critical insights for how society should approach scientific findings:\n\n1. Distinguish Reproducibility from Replicability: One verifies the math, the other verifies the underlying truth of the phenomenon. Both are vital, but a finding's math being correct doesn't automatically mean the phenomenon is true.\n2. Recalibrate Expectations of Breakthroughs: The massive shrinkage of effect sizes and explained variance proves that early science is almost always "louder than reality." Initial findings should be seen as maximums, not definitive measurements.\n3. Transparency is a Powerful Cure: The replication crisis can seem overwhelming, but open science mandates work. Requiring researchers to share their data and code demonstrably improves the structural integrity of the entire scientific enterprise.\n\nWhile challenges remain in changing academic incentive structures to reward rigorous, transparent science over flashy breakthroughs, the SCORE project ultimately demonstrates that science is working as it should—by turning its exacting, skeptical lens upon itself."

}

Show Notes

Works Referenced

Glossary

Sources / References

Full Transcript

HostA landmark study has just revealed that nearly half of social science findings fail to replicate when tested by independent researchers. And even when they do, the actual impact of those findings often shrinks dramatically, sometimes by over 80 percent.
ExpertThat's right. It's a sobering assessment, but also one of the most rigorous and comprehensive audits of scientific credibility ever undertaken. It forces a reconsideration of how research is interpreted and applied, shaping everything from public policy to daily headlines.
HostSo, if findings are essentially a coin flip, and their true impact might be a fraction of what was originally claimed, how much of what is commonly understood is built on sand?
ExpertThat is precisely the question this project, known as SCORE, set out to answer. It’s an unprecedented stress test of the social and behavioral sciences.
HostThe scale and ambition of this project are difficult to overstate. Eight hundred and sixty-five researchers, globally, mobilized to essentially check the homework of their peers. What was the driving force behind such an enormous undertaking?
ExpertThe initiative, the Systematizing Confidence in Open Research and Evidence project, or SCORE, was spearheaded by the Center for Open Science, but it received significant funding from DARPA, the U.S. Defense Advanced Research Projects Agency. And that might seem counterintuitive at first glance – why would the military be so invested in the reproducibility of social science?
HostIt does seem like an unusual pairing. One might expect DARPA to be focused on AI or advanced weaponry, not the nuances of psychology papers.
ExpertExactly. But the Department of Defense, like any major policymaking body, relies heavily on insights from fields like behavioral economics, psychology, sociology, and political science. This research informs everything from training protocols and intelligence analysis to psychological operations and geopolitical forecasting. If the foundational research underpinning these decisions is flawed, the resulting policies and strategies are built on very shaky ground. DARPA’s ultimate goal was highly ambitious: to develop tools that could assign confidence scores to different social and behavioral science research results.
HostSo, it wasn't just about identifying problems, but about finding a way to proactively assess the reliability of studies before they're used to make critical decisions.
ExpertPrecisely. They extracted nearly 4,000 claims from papers published between 2009 and 2018, spanning 62 journals across 11 diverse disciplines. This wasn't some niche corner of academia; it was a broad cross-section of the literature. The sheer administrative and scientific willpower required for this kind of retrospective audit is remarkable. Science is typically geared toward discovering the *next* new thing, not meticulously re-examining the old.
HostIt really underscores the idea of science as a self-correcting process, even if that correction is a massive, multi-year undertaking involving hundreds of institutions. Before delving into the specific findings, it’s important to clarify some key terms, because the SCORE project makes very precise distinctions that are critical to understanding its results. What's the difference between "reproducibility" and "replicability"?
ExpertThat distinction is absolutely fundamental. In metascience, precision is everything, and these terms often get conflated in public discourse, but they measure entirely different aspects of scientific validity.
HostTo begin with reproducibility, what does that mean in the context of this study?
ExpertReproducibility means running the exact same analysis on the original data. Think of it like auditing an accountant's spreadsheet. You have the exact same receipts – that's the data – and the exact same ledger, which is the code and analytical steps the accountant used. The question is: if you punch the numbers into a calculator using the same method, do you arrive at the same profit margin the accountant reported? Reproducibility tests the transparency and accuracy of the math.
HostSo, it's about whether you can get the same answer if you follow the same steps with the same inputs. What about replicability then?
ExpertReplicability is a much more demanding test. This requires gathering brand new data to test the same underlying research question. The analogy here is like opening a second franchise location of a successful business. You're using the same business model – the methodology – but you're testing it in a new town with new customers. You're collecting new data, in other words, to see if the phenomenon actually holds up in the real world, outside of the specific context of the original study. Does the business concept work in a new environment?
HostThat distinction is crucial. One is about verifying the calculation, the other is about verifying the phenomenon itself. Before they could even get to the replicability tests, the SCORE team ran into a significant hurdle, didn't they? The transparency roadblock.
ExpertThey did, and it highlights a systemic issue. The report revealed that data was available for only 24 percent of a sample of 600 assessed papers. For three-quarters of published social science, the original authors had not shared the raw data required to even verify their findings. This lack of transparency makes independent assessment, even just reproducibility, functionally impossible in many cases.
HostThat's a staggering figure. It means you can't even check the math for most published studies.
ExpertIt's a massive barrier to scientific progress. However, when the SCORE reviewers *could* get their hands on the data, the results were highly illuminating. For the 143 papers where they attempted a reproduction, only 54 percent could be precisely reproduced. Another 74 percent could be *approximately* reproduced, meaning the result was within a close margin of the original.
HostBut there was a "transparency dividend," as the report calls it, when authors *did* share their data.
ExpertAbsolutely. When the original authors had proactively shared their data and analytical code openly, exact reproducibility skyrocketed to 77 percent, and approximate reproducibility jumped to 91 percent. This is a highly actionable takeaway for listeners. The data proves that methodological sloppiness or bad math isn't an unchangeable law of nature; it is a direct consequence of closed systems.
HostSo, if journals just demand transparency, the rigor improves dramatically.
ExpertIt's an infrastructure win. The SCORE project found that political science and economics, for example, dramatically outperformed other fields in reproducibility. This wasn't because economists are inherently better at math, but because their leading journals instituted strict data and code-sharing mandates years ago. When transparency is required, the quality of the science instantly improves.
HostThat's a powerful argument for open science practices. But the real headline-grabber, the "stress test" that everyone is talking about, is the replicability phase. What did they find when they went out and collected new data?
ExpertThis is where we hit the coin-flip survival rate. In the replication phase, independent researchers gathered fresh data to empirically test 164 previously published papers, which encompassed 274 specific claims. The most striking finding was that only 49.3 percent of those 164 papers successfully replicated. If you look at individual claims, it was 55.1 percent.
HostSo, roughly half of what was published didn't hold up when someone else tried to find the same phenomenon with new data.
ExpertThat's the reality. And "successful replication" here meant the new study achieved statistical significance with the same pattern and direction as the original study. To break it down further: only about a third of the new analyses yielded results that were *very close* to the original study's numbers. In about a quarter of the cases, no clear effects were found at all—the phenomenon completely vanished. And in around two percent of cases, the results actually pointed in the exact *opposite* direction of the original claim.
HostThat has profound implications for policymakers, journalists, and frankly, anyone who makes decisions based on social science research.
ExpertIt absolutely does. Imagine a mayor deciding to fund a new multimillion-dollar anti-poverty intervention based on a single study published in a prestigious sociology journal. The SCORE data suggests there's roughly a 50 percent chance the intervention's foundational premise will not hold up when applied to a new population, or in a slightly different context. Every day, science journalists write definitive headlines like "Study proves X causes Y," and policymakers draft legislation based on single, newly published papers. This data suggests that half the time, that foundation is incredibly fragile.
HostIt's important to clarify here that a failure to replicate doesn't necessarily mean the original researchers were fraudulent or intentionally sloppy.
ExpertThat's a critical point to emphasize. A failure to replicate does *not* automatically imply fraud. Human behavior is incredibly complex and context-dependent. A psychological intervention that worked on a cohort of university undergraduates in 2011 might genuinely not work on a diverse sample of adults in 2026. Context changes, populations change, and small differences in methodology can have large effects. But the 50 percent failure rate highlights a systemic issue: the scientific literature is filled with findings that are treated as universal truths when, in reality, they are often fragile, context-dependent observations.
HostAnd even when studies *did* replicate, the SCORE project uncovered another unsettling truth: the magnitude of the effects often shrank dramatically.
ExpertThis finding is perhaps the most subtle but devastating for the practical utility of social science. Even when a study successfully replicated, meaning the new data confirmed the phenomenon actually exists, the SCORE project revealed a median reduction of more than 50 percent in the estimated effect size. And crucially, the explained variance dropped by over 80 percent.
HostTo break those terms down, because "effect size" and "explained variance" might not be immediately clear to everyone. What does a 50 percent reduction in effect size actually mean?
ExpertEffect size measures *how much* of a difference an intervention makes. So, if an original study claimed a new teaching method raised test scores by 20 points, a 50 percent drop in effect size means the replication found it only raised scores by 10 points. The effect is still there, but it is vastly less impressive and less practically significant. It means the intervention is half as powerful as initially thought.
HostAnd the explained variance, dropping by over 80 percent—that sounds catastrophic.
ExpertIt is. Explained variance, often denoted as R-squared in statistics, measures how much of the variation in an outcome can be attributed to the intervention, as opposed to random noise or other factors. So, if an original paper claimed their intervention explained 15 percent of the variance in employee productivity, the replication found it actually explained less than 3 percent. The original "signal" was nearly drowned out by the noise in the replication. This makes the finding far less useful for prediction or policy intervention.
HostWhy does this happen? Why do effect sizes shrink so dramatically in replications?
ExpertA primary driver is what's known as publication bias and the "Winner's Curse." Academic journals are highly competitive and heavily favor publishing novel, statistically significant, and large effects. Imagine ten different research teams testing a subtle psychological phenomenon. Eight might find nothing, one might find a tiny effect, and one—purely by statistical chance—might find a massive effect. That team with the massive effect gets published in a top-tier journal; the other nine often end up in a desk drawer.
HostSo, the initial published finding often represents a kind of "best-case statistical scenario."
ExpertExactly. The original researchers might have, even subconsciously, engaged in "p-hacking" or exploited "researcher degrees of freedom"—tweaking their analytical models or data exclusions until they achieved the largest possible effect size to secure publication. When an independent, dispassionate team runs the exact same experiment without that incentive to find a massive result, the true, much smaller effect size often reveals itself. This fundamentally changes how new studies should be viewed: the initial finding should be seen not as the definitive measurement of a phenomenon, but as its *maximum possible upper bound*. The reality is almost always smaller, messier, and less impactful.
HostThat's a profound shift in perspective for how to consume scientific news. Moving to the cutting edge of this project, SCORE also tried to answer a crucial question: can we predict which papers will fail to replicate *before* we spend all that time and money testing them? They set up a forecasting tournament, pitting humans against machines.
ExpertThis was one of the most innovative elements of the project. Because empirical replication is incredibly slow and expensive, DARPA wanted to know if they could develop tools to predict credibility. So, they set up a massive tournament, comparing human expert judgment against artificial intelligence and machine-learning algorithms.
HostAnd what did they find? Which came out on top?
ExpertThe humans performed remarkably well. Independent teams of human forecasters, like those from repliCATS and Replication Markets, evaluated the papers. They looked at sample sizes, the plausibility of the claims, methodological rigor, and statistical reporting. The report shows that human forecasters achieved up to a 78 percent success rate in predicting replicability using the best-performing metric.
HostSeventy-eight percent is quite high for predicting something so complex. What does that tell us about human expertise?
ExpertIt suggests that experienced researchers possess a finely tuned "BS detector." They understand the nuances of experimental design, they know which sub-fields are prone to exaggeration, and they can spot when a claim is too neat, too perfect, or too large to be true. It's a testament to the value of deep, contextual knowledge and critical thinking.
HostSo, human skepticism triumphed. What about the algorithms? How did the AI fare in predicting bad science?
ExpertIn stark contrast, the initial machine-learning and algorithmic methods, developed by teams at various universities and tech companies, largely failed to effectively predict replicability.
HostThat's surprising in an era of such rapid AI advancement, where algorithms are pitched as solutions to everything from content moderation to fact-checking. Why did AI struggle here?
ExpertMachine learning excels at finding patterns in data, but assessing scientific credibility requires deep contextual reasoning. An algorithm can easily check if a p-value is below 0.05, or scan the text for certain keywords. But evaluating whether an experimental design actually isolates the variable it claims to be testing, or whether the theoretical framework makes logical sense, requires a level of semantic and scientific comprehension that current AI models simply lack. As one of the project leaders, Brian Nosek, noted, "A lot more evidence is needed before we would be confident in a valid, scalable solution" regarding automated credibility scoring.
HostSo, for all the hype, when it comes to the complex, high-stakes world of scientific validity, algorithmic fact-checking isn't ready to replace human scientific skepticism or deep human peer review.
ExpertNot yet. The project demonstrates that human critical thinking remains irreplaceable for now.
HostThis SCORE project provides much to consider. What are the key takeaways for listeners, and for society in general, on how to approach scientific findings moving forward?
ExpertThere are three critical insights. First, the distinction between reproducibility and replicability is not academic pedantry; it's vital. One tests the transparency of the math, while the other tests the underlying truth of the phenomenon. It cannot be assumed that a finding is true just because the math is correct.
HostAnd second?
ExpertExpectations of scientific breakthroughs must be recalibrated. The massive 50 percent drop in effect sizes and 80 percent drop in explained variance when independent researchers repeat the work proves that early science is almost always louder than reality. An initial finding is often a maximum, not a definitive measurement.
HostAnd finally, a path forward, perhaps?
ExpertThe third takeaway is that transparency is a powerful cure. The replication crisis can seem overwhelming, but the SCORE project offers a clear, actionable fix. Exact reproducibility skyrocketed from 54 percent to 77 percent simply when original authors shared their data and code. Open science mandates work. If funders and journals require researchers to show their work, the structural integrity of the entire scientific enterprise improves instantly.
HostThat makes a lot of sense. But if open data works so well, why is the data sharing rate still so low, around 24 percent? How do we change the academic tenure and grant-funding systems to reward rigorous, transparent, incremental science rather than just flashy, single-study breakthroughs?
ExpertThat's a fundamental challenge to the incentive structures within academia. It's not just about what is scientifically sound, but what advances careers.
HostAnd what about the future of AI in this space? While human forecasters dominated the machines in this project, AI is advancing rapidly. Will we eventually reach a point where an AI can ingest a PDF of a study and instantly flag methodological flaws that humans might miss? And if so, how will that change the peer-review process?
ExpertThe potential is there, but the SCORE project serves as a crucial reminder that true understanding often requires more than pattern recognition; it requires interpretation and judgment.
HostThe SCORE project does not indicate that science is broken. It suggests that science is working exactly as it should—by finally turning its exacting, skeptical lens upon itself.