Contaminated corpora: How the retraction crisis is being silently encoded into AI scientific knowledge

Abstract

This paper argues that the retraction crisis in scholarly publishing is increasingly entering large language model (LLM) training datasets through commercial publisher–AI licensing agreements that provide entire journal archives without retraction filtering. Using a conceptual-analytical approach, the paper synthesizes three strands of empirical literature: studies on retraction growth and post-retraction citation, research examining LLM interactions with retracted papers, and documented publisher–AI licensing agreements. The paper advances three theoretical findings. First, it identifies a direct and commercially formalised pipeline through which retracted papers enter LLM training corpora: bulk licensing agreements in which publishers sell entire journal archives to AI developers without retraction-filtered exports, thereby bypassing the retraction notification infrastructure that already exists and functions for human readers. Second, it introduces the contamination-absence asymmetry: citation decay removes valid evidence from AI knowledge while retraction propagation inserts invalid evidence into it, producing a compound failure mode in which AI systems simultaneously know less than they should and believe more than they should. Third, it proposes the governance gap hypothesis: that retraction infrastructure — Retraction Watch, Crossref retraction flags, PubMed retraction notices — constitutes a functioning information system that stops at the human reader and was never designed to intercept AI ingestion, representing a gap in information governance whose closure is a library and information science responsibility. The study concludes that ensuring the integrity of AI-generated scientific knowledge requires stronger information governance and the active involvement of library and information science professionals in the oversight of AI training datasets.

Keywords

contaminated corpora retraction crisis large language models AI training data information governance scientific integrity publisher-AI licensing

Introduction

In July 2024, Informa, the parent company of Taylor & Francis, announced a $10 million data-access agreement with Microsoft, granting non-exclusive access to content from nearly 3000 academic journals for the purpose of training artificial intelligence models (Palmer, 2024). Informa publicly projected that AI-related revenue for the year would exceed $75 million (Palmer, 2024). The scholars whose work was included in the deal were given no prior notice and no opportunity to opt out. As Dave Hansen, Executive Director of Authors Alliance, observed, the deal was struck against a backdrop of standard academic publishing contracts that involve broad transfers of copyright from authors to publishers, effectively granting publishers control over sublicensing rights — including, in many cases, the right to license content to AI developers without the author’s knowledge or consent (Hansen, 2024). Informa was neither alone nor first: Wiley, Axel Springer, Le Monde, and other major publishers have entered comparable agreements, and the broader trend of scholarly publishers licensing their archives to train AI models is now well-documented and commercially established (Kwon, 2024).

What has not been documented, discussed, or governed is what these archives contain. A journal archive sold in bulk to an AI developer contains, by default, every paper that publisher has ever published — including every paper that has since been retracted. The retraction crisis is among the most discussed integrity problems in contemporary science. More than 10,000 research papers were retracted in 2023 alone, a figure that integrity experts described as likely representing only the visible portion of a substantially larger problem of invalid science circulating in the literature (Van Noorden, 2023). Analysis of European biomedical literature found that retraction rates increased fourfold between 2000 and 2021, with two-thirds of retractions attributable to research misconduct including data fabrication, image manipulation, and authorship fraud (Else, 2024). The retracted papers in these publisher archives represent formally invalidated claims about drug efficacy, disease mechanisms, biological processes, and social phenomena — claims determined, through editorial and investigative processes, to be unreliable, fabricated, or fraudulent.

The scholarly community has built, over decades, a functioning infrastructure to manage this problem: Retraction Watch, the world’s largest database of retractions; Crossref’s retraction flag system, which links DOIs to retraction notices; PubMed’s retraction labelling; and filtering capabilities in major academic databases. These systems protect human readers from unknowingly building on invalid foundations. What they do not do — what they were never designed to do — is intercept the ingestion of retracted papers into AI training corpora. In the absence of any contractual requirement, technical protocol, or professional standard mandating retraction filtering in publisher-AI licensing agreements, the pipeline from publisher archive to AI training dataset bypasses every node in the retraction notification infrastructure. The retraction crisis, which the scholarly community has been working to contain for two decades, is being silently re-encoded into the parametric scientific knowledge of AI systems. This paper argues that this constitutes a structural information governance failure, and that identifying, theorising, and closing the governance gap that enables it is a responsibility that falls squarely within the mandate of the library and information science profession.

The retraction crisis: Scale, causes, and structural concentration

The scale of the retraction problem in contemporary science is both large and, crucially, concentrated in ways that have direct implications for the AI training pipeline. More than 10,000 research papers were retracted in 2023 — a figure that represented a record high and that Van Noorden (2023) described as likely representing only the visible portion of a substantially larger problem of invalid science circulating in the literature. The causes of this surge are neither uniform nor randomly distributed. Petrou’s (2024) analysis in The Scholarly Kitchen provides the most granular publicly available decomposition of retraction drivers, identifying three primary factors: industrial-scale activity from paper mills, the exceptional growth of research output from China, and what he terms “journal breaches” — instances where a single journal issues more than 20 retractions in a calendar year, indicating severe compromise of its peer-review process.

The concentration of retractions within journal breaches is particularly significant for the AI contamination problem. In 2014, journal breaches accounted for approximately 10% of all retractions. By 2022, 34 journals met the breach criteria and accounted for 51% of all retractions globally (Petrou, 2024). More than half of all retracted papers are concentrated in a small number of severely compromised journals — journals that are, by definition, present in the full archives that publishers are licensing to AI developers. China’s contribution to the retraction landscape has similarly expanded, from 16% of worldwide retractions in 2014 to 54% by 2022, with particularly high retraction rates in biomedical disciplines: 1.1% of papers in Biochemistry, Genetics and Molecular Biology from Chinese institutions were retracted in the study period (Petrou, 2024). Petrou’s analysis also demonstrates that when journal breaches, serial offenders, and the Chinese biomedical surge are excluded, the global retraction rate has remained relatively flat since 2014 — a finding that underscores the structural, systemic nature of the problem rather than a generalised deterioration in research standards.

The European picture provides a different but equally consequential perspective. Else (2024), in a study of over 2000 retracted biomedical papers with European corresponding authors published between 2000 and mid-2021, found that retraction rates increased fourfold over this period. Research misconduct accounted for nearly 67% of these retractions, while only around 16% were attributable to honest error. The leading causes of misconduct-related retractions shifted substantially over the two decades: ethical problems and authorship issues dominated in 2000, but by 2020 the leading categories were “unreliable data” — a classification strongly associated with paper-mill-generated research — and duplication (Else, 2024). As Else (2024) reports in Nature, research integrity specialists note that improved detection methods — the launch of PubPeer in 2012, the routine adoption of image-analysis and plagiarism-detection software by publishers, and greater institutional willingness to act on integrity concerns — account for a significant portion of the rising retraction count, suggesting that the total volume of invalid science circulating in the literature exceeds what retraction rates alone indicate.

A critical feature of the retraction landscape directly relevant to AI contamination is the well-documented continued citation of retracted papers after their retraction. Hsiao and Schneider (2022), in a database-wide analysis of 7813 retracted PubMed papers, found that retracted papers continued to be cited after retraction and that only 5.4% of 13,252 post-retraction citation contexts explicitly acknowledged the retraction. Woo and Walsh (2024) found that post-retraction citations are disproportionately concentrated among audiences unfamiliar with the retracted paper’s specific field — precisely the audiences most dependent on secondary knowledge sources, including AI systems, to navigate unfamiliar literature. Tang (2023), in a study of retracted papers from Nature and Science published between 2010 and 2018, found that post-retraction citations accounted for 47.7% and 40.9% of total citations (median values) for retracted papers in those journals respectively, demonstrating that even the most visible and well-resourced publication venues cannot prevent the continued circulation of invalidated work. The combination of continued post-retraction citation and the inadequate acknowledgement of retraction status means that retracted papers are embedded not only in publisher archives but in the citation networks of the surrounding literature, further multiplying their potential footprint in AI training corpora.

The publisher-AI licensing pipeline: How retracted papers enter AI training corpora

The pathway by which retracted papers enter LLM training corpora is not hypothetical. It is commercially formalised, contractually documented, and currently active. The Taylor & Francis-Microsoft deal, which grants Microsoft non-exclusive access to content from nearly 3000 journals for AI training, was executed without any publicly disclosed mechanism for retraction filtering (Palmer, 2024). The legal architecture enabling the deal is straightforward: standard academic publishing contracts involve broad copyright transfers from authors to publishers, giving publishers the right to sublicense content for purposes analogous to existing database subscriptions — a category that, as James Grimmelmann of Cornell Law School noted, “almost certainly” encompasses bulk AI training access (Palmer, 2024). Authors have, in most cases, already signed away the rights that would allow them to object. As Hansen (2024) observes, publishers may not have clear or documented rights to all works in their archives, and the validity of large-scale licensing schemes may be legally contestable in some cases, but the practical reality is that deals are being executed, archives are being transferred, and the retraction status of included papers is not part of the transaction.

The commercial logic driving these deals creates a structural incentive against retraction filtering. From the perspective of a publisher licensing its archive, a retraction-filtered dataset is a smaller and therefore less valuable dataset. Taylor & Francis’s parent company Informa projected over $75 million in AI-related revenue for 2024, revenue directly proportional to the volume of content transferred (Palmer, 2024). Retraction-filtered exports require additional technical infrastructure: linking the archive to retraction databases, maintaining the filter as new retractions are issued, and verifying coverage across Retraction Watch, Crossref, and PubMed. None of these costs is trivial, and none generates additional revenue. In the absence of any external requirement to incur them, publishers have no financial incentive to do so.

The author community most directly affected by these deals has had no voice in their design. Palmer (2024) reports scholars reacting with shock and dismay to the Taylor & Francis announcement, with academics noting not only the absence of prior notice but the deeper structural inequity: scholars already give away their labour and rights for free through the academic publishing system, and these deals represent publishers extracting further commercial value from that labour without compensation or consent. Heather Joseph of SPARC characterised the situation as a critical “balancing act” and called for the development of standards giving authors control over whether their work is included in AI training (Palmer, 2024). Those standards do not yet exist. In their absence, the default is complete inclusion — including the retracted portion of every archive.

The scale of the resulting contamination is estimable, if not yet precisely measured. The Amend platform and Web of Science database catalogues over 55,000 retracted articles published between 2000 and 2024 (Zhou et al., 2025). These are distributed across the archives of publishers that are actively licensing their content for AI training. The retraction of any given paper does not remove it from the publisher’s archive — it appends a retraction notice to it. The original paper, with its original claims, remains in the archive and, in the absence of retraction filtering, in the licensed dataset. An LLM trained on a bulk archive will encode the retracted paper’s claims as part of its parametric scientific knowledge. The retraction notice, filed in Retraction Watch and flagged in Crossref, was never part of the data the model was given. The retraction updated the human record. It did not update the model.

Evidence that LLMs actively use retracted papers

The hypothesis that AI systems encode and reproduce retracted scientific claims is no longer speculative. Two peer-reviewed studies published in 2025 provide direct empirical evidence that LLMs actively use retracted papers to answer scientific questions, without acknowledging or in most cases knowing their retraction status.

Gu et al. (2025) conducted the most targeted investigation to date, asking ChatGPT (GPT-4o) questions derived from the specific content of 21 retracted papers about cancer imaging. The chatbot’s answers referenced retracted papers in five cases — approximately one in four — and advised caution in only three of those five instances, meaning that in at least two cases the system reproduced claims from formally invalidated papers without any acknowledgement of their retraction status (Gu et al., 2025). The study concludes that there is at least a 10% chance that ChatGPT will use retracted articles to answer questions without recognising their retracted status, a particularly significant risk in a medical domain where the stakes of relying on invalid evidence are directly clinical. Thelwall et al. (2025) conducted a broader assessment, identifying 217 retracted or otherwise problematic academic studies with high altmetric scores and asking ChatGPT to evaluate their quality 30 times each. Across 6510 evaluation reports, not one mentioned that the article was retracted or had relevant integrity concerns. ChatGPT assigned many of the papers relatively high scores based on parametric knowledge that treated the retracted work as valid. In a follow-up experiment, the team extracted 61 specific claims from retracted articles and asked ChatGPT whether each was true. Almost two-thirds of the time, ChatGPT reported that the claims were likely true, partially true, or consistent with research, even for claims demonstrably shown to be false (Thelwall et al., 2025).

These findings are not anomalies. They are the predictable consequence of a training process that cannot distinguish valid from retracted content in a corpus constructed without retraction filtering. An LLM generates outputs by drawing on parametric associations formed during training. If a retracted paper was present in the training corpus — which the bulk licensing pipeline makes highly probable for any model trained on publisher archives — its claims will be encoded alongside valid claims, with no metadata signal to distinguish them. The model learns from the retracted paper on the same terms as from any other paper. At inference time, it draws on those associations without any capacity for post-training retraction awareness. It is not that the model knows the paper has been retracted and ignores the information. It is that the model was never given that information. The retraction notice arrived after the model was trained. No mechanism exists to update its parametric knowledge accordingly.

The contamination-absence asymmetry: A theoretical framework

Two failure modes, one corrupted knowledge base

The corpus contamination problem described in this paper is structurally related to, but conceptually distinct from, the corpus incompleteness problem described in the dark citations literature (Jensen, 2016; Keralis et al., 2023). Both produce unreliable AI scientific knowledge. Both originate in failures of information infrastructure rather than failures of model architecture. Both are invisible to standard LLM evaluation benchmarks. But they operate in opposite directions, are caused by different institutional failures, and require different remediation strategies.

Citation decay — the progressive inaccessibility of web-cited scholarly sources through link rot, repository abandonment, and format obsolescence — produces corpus incompleteness: valid scientific evidence never ingested into the training corpus because it was inaccessible at crawl time. The AI system knows less than it should. Its errors, when they occur, tend toward fabrication: the generation of plausible-sounding claims to fill gaps in parametric knowledge that the training corpus could not supply. Retraction propagation, by contrast, produces corpus contamination: invalid scientific evidence actively ingested into the training corpus because it was present in bulk-licensed archives without retraction filtering. The AI system believes more than it should. Its errors tend toward confident assertion: the reproduction of formally invalidated claims as though they represent established knowledge. The empirical work of Thelwall et al. (2025) illustrates this precisely: ChatGPT did not hesitate or hedge when evaluating retracted papers. It assigned them relatively high scores.

This is the contamination-absence asymmetry: two information infrastructure failures that act on AI scientific knowledge in structurally opposite directions, together producing a compound failure mode that no single intervention can address. Addressing corpus incompleteness requires preservation — keeping valid knowledge accessible. Addressing corpus contamination requires filtration — keeping invalid knowledge out. An AI system subject to both failure modes simultaneously is in the worst possible epistemic configuration: it knows less than it should about what is true, and is more confident than it should be about what is false. Neither failure mode alone captures the full scale of the problem. The two must be theorised together, as complementary dimensions of a single AI knowledge integrity crisis, if the threat to scientific AI reliability is to be adequately understood.

The governance gap hypothesis

The contamination-absence asymmetry is not merely a technical description of two problems. It is a diagnosis of a governance failure. Both corpus incompleteness and corpus contamination are consequences of the same structural condition: the absence of information governance frameworks that extend to the AI training pipeline. Preservation infrastructure — digital archives, persistent identifiers, CLOCKSS, LOCKSS, Portico — was designed for the human reader. Retraction infrastructure — Retraction Watch, Crossref flags, PubMed notices, database-level filtering — was designed for the human reader. Neither was designed for, or extended to, the new class of reader represented by a machine learning system ingesting millions of papers at once through a commercial bulk transfer.

The governance gap hypothesis proposes that this omission is structural, not incidental. The institutions responsible for information governance — publishers, database operators, standards bodies, professional associations, and information professionals — have not yet articulated, let alone operationalised, a governance mandate covering the AI training pipeline. Publishers negotiate licensing deals without specifying retraction-filtered exports. Database operators provide retraction metadata to human searchers but do not mandate its provision to AI training partners. Professional standards bodies have not produced a standard for retraction-aware AI dataset construction. And the information science profession, which possesses the expertise and the institutional mandate to address all of these gaps, has not yet formally claimed this territory.

The urgency of the governance gap is compounded by its irreversibility at the model level. When a retracted paper is ingested into a training corpus and the model is trained, the contamination becomes part of the model’s parametric knowledge. There is no post-training mechanism for selectively removing the influence of specific papers from an already-trained model short of complete retraining on a cleaned corpus. Each day that bulk licensing continues without retraction filtering adds new contamination to the corpora that future model generations will inherit. The retraction that occurs today will not update any model trained yesterday. The governance gap, left unclosed, widens with every new retraction and every new training run.

Conclusion

When Taylor & Francis announced its data-access agreement with Microsoft in July 2024, the academic community’s concerns centred on copyright, compensation, and consent. These are legitimate concerns. But they are not the only concerns, and they may not be the most consequential ones. Hidden within the commercial and legal debate over who owns scholarly content is a question of greater scientific moment: what is in the content being transferred? The archives of major publishers contain not only decades of valid scientific contribution but also the formally invalidated, fabricated, and fraudulent papers that the retraction system was designed to quarantine. In the absence of retraction filtering — which neither the Taylor & Francis deal nor any comparable agreement has publicly specified — both categories are transferred together, and both are encoded into the parametric knowledge of AI systems that will increasingly mediate how science is conducted, communicated, and understood.

The retraction infrastructure that exists to protect human readers from this invalid knowledge — Retraction Watch, Crossref flags, PubMed notices — has been bypassed not by malice but by design gap. It was built for a reader who searches, evaluates, and verifies. It was not built for a reader who ingests millions of papers at once through a bulk commercial transfer. The governance gap hypothesis proposed in this paper argues that closing this gap is a responsibility that belongs to the information governance community: to the publishers, standards bodies, and information professionals who collectively maintain the systems through which the validity of the scholarly record is assured.

The contamination-absence asymmetry provides the theoretical foundation for understanding why this problem must be addressed in tandem with the corpus incompleteness produced by citation decay. AI systems trained on the scholarly literature are not merely incomplete; they are contaminated. They do not merely fail to know some things they should know; they confidently assert some things that have been formally determined to be false. An AI system subject to both failure modes — knowing less than it should and believing more than it should — is an AI system whose scientific knowledge cannot be trusted without systematic auditing of the corpora it was trained on. The tools to conduct that auditing exist. The expertise to deploy them exists in the LIS profession. What remains is the governance will to require it, the professional standards to mandate it, and the contractual norms to enforce it. The contamination of AI scientific knowledge begins in the publisher’s archive. The response must begin in the library.

Implications

The contaminated corpora framework carries implications that extend across the professional and disciplinary communities engaged with AI development, scholarly publishing, and information governance.

For AI developers and researchers, the framework identifies a category of systematic error in LLM scientific outputs that is neither random nor addressable through architectural improvement alone: contamination errors arising from retracted content in training data. Current LLM evaluation benchmarks measure hallucination without distinguishing architectural errors from corpus contamination errors. The empirical findings of Gu et al. (2025) and Thelwall et al.(2025) provide starting points; systematic field-wide measurement of how retraction prevalence correlates with AI output reliability is urgently needed. The framework calls for the development of retraction-aware corpus construction protocols, post-training contamination auditing methodologies, and evaluation benchmarks specifically designed to test AI performance on topics where retracted papers are heavily concentrated.

For scholarly publishers, the framework makes explicit the integrity responsibility that bulk licensing creates. A publisher that sells its archive to an AI developer without retraction filtering is not merely licensing intellectual property. It is transferring its integrity failures alongside its scientific contributions, with no mechanism for correction once training is complete. Retraction-filtered exports should be treated as a standard deliverable in any AI licensing agreement. As Heather Joseph of SPARC has argued, the development of standards giving researchers greater control over AI uses of their work is overdue (Palmer, 2024). Retraction filtering is a concrete, technically achievable first step that publishers can take immediately, using the Crossref retraction metadata infrastructure that already exists.

For research funders and policy bodies, the framework provides a new and urgent argument for sustained investment in retraction infrastructure. The standard case for funding Retraction Watch, Crossref retraction flags, and institutional notification systems rests on their value to human researchers. The contaminated corpora framework adds a second, more pressing argument: retraction infrastructure that does not extend to AI training pipelines is infrastructure that is failing to protect the integrity of AI scientific knowledge at precisely the point where that knowledge has the most leverage over future scientific practice. Funding retraction infrastructure is no longer merely an act of scholarly housekeeping. It is an AI integrity intervention.

For the library and information science profession, the framework establishes the clearest possible statement of its stake in the AI knowledge integrity crisis. The LIS community has long established standards for the curation, provenance tracking, and quality assurance of information collections, applied systematically to print archives, digital repositories, and database construction. These standards have not been applied to AI training datasets. Information professionals should lead the development of a formal standard specifying that any AI training dataset derived from the scholarly literature must be constructed using retraction-filtered exports drawn from recognised sources including Retraction Watch and the Crossref retraction flag system. The organisations that represent information professionals — SPARC, IFLA, and national library associations — are well-positioned to advocate for the inclusion of retraction metadata requirements as a standard clause in publisher-AI licensing agreements. The contamination-absence asymmetry demonstrates that AI scientific knowledge is being degraded from two directions simultaneously. The response to both begins in the library.

Footnotes

ORCID iD

Adebowale Jeremy Adetayo

Funding

The author received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Author biography

Dr. Adebowale Jeremy Adetayo is a lecturer and researcher in the Department of Information Resource Management at Babcock University, Nigeria. His research interests encompass Information Science, Information Management, Library Science, and digital educational innovations. Dr. Adetayo has published numerous articles in highly reputable, peer-reviewed international journals, focusing on emerging technology integration, virtual learning environments, and institutional research capacity building.

References

Else

(2024) Biomedical paper retractions have quadrupled in 20 years — why? Nature 630(8016): 280–281. https://doi.org/10.1038/d41586-024-01609-0

Feng

, et al. (2025) Alarm: retracted articles on cancer imaging are not only continuously cited by publications but also used by ChatGPT to answer questions. Journal of Advanced Research 71: 1–3. https://doi.org/10.1016/j.jare.2025.03.020

Hansen

(2024) What happens when your publisher licenses your work for AI training? – authors alliance. Available at: https://www.authorsalliance.org/2024/07/30/what-happens-when-your-publisher-licenses-your-work-for-ai-training/ (accessed 15 March 2026).

Hsiao

Schneider

(2022) Continued use of retracted papers: temporal trends in citations and (lack of) awareness of retractions shown in citation contexts in biomedicine. Quantitative Science Studies 2(4): 1144–1169. https://doi.org/10.1162/qss_a_00155

Jensen

(2016) Editorial. ACM Transactions on Database Systems 41(2): 1–3. Available at: https://doi.org/10.1145/2946798

Keralis

Albertorio-Díaz

Hoppe

(2023) Dark citations to federal resources and their contribution to the public health literature. Frontiers in Research Metrics and Analytics 8: 1235208. https://doi.org/10.3389/frma.2023.1235208

Kwon

(2024) Publishers are selling papers to train AIs - and making millions of dollars. Nature 636(8043): 529–530. https://doi.org/10.1038/d41586-024-04018-5

Palmer

(2024) Taylor & Francis AI deal sets ‘worrying precedent’ for Academic Publishing. Available at: https://www.insidehighered.com/news/faculty-issues/research/2024/07/29/taylor-francis-ai-deal-sets-worrying-precedent (accessed 15 March 2026).

Petrou

(2024) Guest post - making sense of retractions and tackling research misconduct - the scholarly kitchen. Available at: https://scholarlykitchen.sspnet.org/2024/04/18/guest-post-making-sense-of-retractions-and-tackling-research-misconduct/ (accessed 15 March 2026).

10.

Tang

(2023) Some insights into the factors influencing continuous citation of retracted scientific papers. Publications 11(4): 47. https://doi.org/10.3390/publications11040047

11.

Thelwall

Lehtisaari

Katsirea

, et al. (2025) Does ChatGPT ignore article retractions and other reliability concerns? Learned Publishing 38(4): e2018. https://doi.org/10.1002/leap.2018

12.

Van Noorden

(2023) More than 10,000 research papers were retracted in 2023 - a new record. Nature 624(7992): 479–481. https://doi.org/10.1038/d41586-023-03974-8

13.

Woo

Walsh

(2024) On the shoulders of fallen giants: what do references to retracted research tell us about citation behaviors? Quantitative Science Studies 5(1): 1–30. https://doi.org/10.1162/qss_a_00303

14.

Zhou

Lou

Shen

, et al. (2025) Prevalence and Trends in Global Retractions Explored Through a Topic Lens. arXiv. Epub ahead of print 26 November 2025.