Toxicologic Pathology Forum*: Opportunities and Challenges in the Use of Artificial Intelligence in Nonclinical Toxicologic Histopathology Evaluations

Abstract

*Disclaimer

This is an opinion article submitted to the Toxicologic Pathology Forum. It represents the views of the authors. It does not constitute an official position of the Society of Toxicologic Pathology, British Society of Toxicological Pathology, or European Society of Toxicologic Pathology, and the views expressed might not reflect the best practices recommended by these Societies. This article should not be construed to represent the policies, positions, or opinions of their respective organizations, employers, or regulatory agencies. The Toxicologic Pathology Forum is designed to stimulate discussion of topics relevant to regulatory issues in toxicologic pathology. Readers of Toxicologic Pathology are encouraged to send their thoughts on TPF opinion articles or ideas for new discussion topics to the Editor.

Introduction

Recent progress in the development of artificial intelligence (AI) designed to assist pathologists conducting histopathology evaluations may create a new paradigm in the interpretation of nonclinical toxicology studies with respect to technical, social, and economic changes in the way nonclinical toxicologic pathologists work. We believe the impact of this burgeoning technology will be understood as a classical example of the introduction of automation into an industrial workflow where machine work augments human effort and becomes a tool to increase the efficiency of routine, repetitive tasks.

Opportunities for Using Artificial Intelligence–Based Automation in Toxicologic Pathology

Histopathology evaluations in toxicologic pathology are suited to use AI-based automation because toxicology studies are typically designed as safety screens, and therefore, a complete set of tissues is usually evaluated.¹⁶ Given the resulting large tissue sets, a critical labor-intensive first step in the workflow is identifying specimens that warrant further characterization by the Study Pathologist. Automation that performs histology quality control would be a labor-saving first step, and automation that sensitively and specifically identifies specimens with a high probability of having test article-related findings could reduce the time a pathologist spends evaluating toxicology studies by reducing the attention currently allocated to specimens with no findings (ie, that are unremarkable) or regions of specimens with no findings.

Moreover, in most scientific disciplines, observations are presented separately from interpretations.¹⁵ This separation of observation from interpretation is not always clear in histopathology evaluations, because a new interpretation often requires a return to the specimens for fresh observation, eg, in a retrospective peer review or a Pathology Working Group process. The uniqueness of histopathology diagnoses is acknowledged in Good Laboratory Practice regulations, where the raw histopathology data are defined as the signed and dated final report⁴ and interim notes and peer review discussions are not considered raw data. Therefore, histopathology observations and interpretations are intrinsically complex and can be difficult to fully contextualize in some situations without access to the specimens. AI–based analysis of histology slides may be standardized across studies and allow stakeholders without access to the specimens to make like-for-like comparisons based on AI data. The use of AI data from the analysis of histology slides may alter anatomic histopathology evaluations so that they are more like clinical pathology data generation, where data resulting from the assessment of specimens take the form of numbers generated by an instrument. However, at the current time, the methods and process for alignment on AI-score outputs, standardization, and interpretation for histomorphologic data generation are to be determined.

Currently explored applications of AI in toxicologic pathology include automated tissue classification and segmentation, quantitative detection and enumeration of defined lesions, workflow efficiency (ie, quality control, data mining), and algorithmic screening/triaging of specimens.¹⁴ In our opinion, the most significant opportunity for AI in toxicologic pathology lies in aiding histological evaluations to reduce the time pathologists allocate to specimens that are not remarkable. This type of support will allow pathologists to focus more of their efforts on the interpretation of findings and contextual risk assessment, which will continue to necessitate human expertise. Our view is that embracing this technology and confronting the challenges it poses will allow our profession to adapt to and thrive in the coming era.

Challenges to Development and Use of Artificial Intelligence–Based Automation in Production

Despite the potential of AI to reduce repetitive tasks and to automate histopathology analysis, there are some challenges in the current trajectory for adoption in toxicologic pathology. One example is the use of the term “Artificial Intelligence” to describe this technology, which has the unfortunate effect of creating an association between algorithmic output and human intellect. Conflating algorithmic output and human intellect makes it easy to forget that AI output does not account for context that was not presented during algorithm training but may apply in a new use-case. As AI algorithms continue to evolve, it’s important for us to stay attentive and ensure they are used in appropriate contexts to generate accurate and reliable outputs. This proactive approach will help us harness the full potential of AI technology.

The absence of adequate context in AI training is manifest as bias, which can be present at multiple levels. The most obvious form of bias arises from the use of training data with a limited variety of pre-analytical variables, which can result in inaccurate outputs due to misrepresentation or underrepresentation of findings (data imbalance). To train an accurate yet fit-for-purpose model, AI-algorithm developers should prioritize data diversity and data augmentation from the outset or use strategies that otherwise explicitly address limited training data. Other examples of sources of bias include inexperienced users/developers defining the problem (cognitive bias), designing an experiment (framing bias), or selecting a sample, region of interest, or algorithm (selection bias).^11,20 Bias introduced in AI training can be subtle and not manifest until the algorithm is used for problem-solving outside the training environment. Therefore, AI algorithms should be tested rigorously during qualification and validation in an assessment of data that includes the full spectrum of pre-analytical variables and use-case scenarios that will be encountered in production implementation. A real-world example of unanticipated AI bias having a negative impact is a case where a multi-national technology company reportedly developed an AI-based resume screening tool using the premise that they wanted to hire more people like their current employees. However, the existing employee population was not equally weighted for gender, and therefore, the AI learned to systematically eliminate resumes from consideration that contained clues that the applicant was female.³

The broader the potential application of an AI algorithm, the more likely it will be used in a context that was inadequately represented during training, which may lead to inaccurate or unhelpful output. Therefore, qualification⁹ is best conducted using broad datasets drawn from a variety of real-world scenarios, preferably using multiple sources that differ from the development data included in training the algorithm. A key objective in algorithm qualification is defining the limits of productive use, so that these limits can be articulated to end-users. Unfortunately, many presentations of algorithm performance do not include qualification packages of this type.¹⁰ Therefore, when developing models collaboratively or sourcing them externally, the limits of the intended use-case should be clearly defined and the span of the qualification data set should be assessed to ensure that it is suitably representative. Where the span of the qualification data set does not overlap with a potential use, consideration should be given to additional qualification testing, or the limits of use should be narrowed.

Likewise, AI algorithms should not be deployed in production without thorough validation, which includes consideration of use-case–specific analysis of the consequence of unrecognized bias. Most importantly, a well-considered plan for human oversight over algorithmic outputs should be in place for use-cases that could affect public health or have potential for other widespread impact. While successful validation across many organizations is a sound rationale for having increased confidence in algorithm outputs, subtle differences in preanalytical variables or the specific context of use can limit the utility of algorithm outputs. Moreover, even for established, well-performing AI algorithms, slight shifts in test background variables that differ from the training data set or other difficult to recognize changes in the context of use can reduce the utility of algorithm outputs. Therefore, it is important to conduct periodic revalidation to monitor model performance over time; however, the frequency of such revalidation efforts will have to be defined and may depend on a risk-based assessment of the context of use.¹⁹

As described above, unintentional misuse of AI algorithms in toxicologic pathology can occur when developers or users are unaware of limitations on the context of use. In addition, problems related to transferring data between systems can arise within large organizations and between organizations due to insufficient standardization in models and data formats.⁵ Standardization issues can occur in model development during data generation (eg, slide preparation or scanning), data curation (eg, study selection), design (eg, selecting the AI architecture), training (eg, data splits, performance metrics, selection of hyperparameters), or testing (eg, in- and out-of-sample data). The need for standardization was highlighted in a recent industry survey where respondents had a wide range of responses to questions on the technical aspects of developing models for use in toxicologic pathology.¹³ As yet, formal standards have not been established for judging the suitability of AI-based analysis and data generation of digital image of a stained tissue on a slide as an aid to postmortem evaluations in nonclinical toxicology. Therefore, at this early stage of implementing AI models in nonclinical toxicology workflows, there is an opportunity for innovators to gather broad input from relevant stakeholder communities such as AI-algorithm developers, pathologists, and health authority representatives.

A variety of stakeholders, some of whom may have limited pathology training or experience, may use data generated from AI-based analysis of histology slides. For example, the development of computerized image analysis to generate quantitative data from histology images (eg, characterizing immunohistochemical staining) has facilitated the use of morphometric methods, but these projects are sometimes led by scientists who are not pathologists and who have not sought the help of pathologists in designing or interpreting their experiments. A similar phenomenon has occurred with the use of more advanced AI models. For example, recently AI algorithms have been incorporated into the evaluation process to further characterize experimentally induced histopathology endpoints in rodent models by scoring/grading the findings and even for diagnosis.⁷ Although many publications have at least one pathologist as a co-author, also common are publications using AI for the interpretation of histology endpoints using special stains, apparently without a pathologist as a co-author. There are examples of publications containing potential misinterpretations due to a lack of a larger context that may have been provided by a pathologist.^{1,2,6-8,12,17,18} As peer reviewers and consumers of scientific literature, we should keep in mind that AI outputs are not adequate substitutes for the professional judgment of subject matter expert pathologists and not accept them as such.

In addition to problems that arise from the accessibility of these novel approaches to histology evaluation to non-pathologists, there may also be an impact on cohorts of early career toxicologic pathologists who begin their professional lives using AI algorithms routinely as an aid to histology evaluation. This group of toxicologic pathologists may need to develop new skill sets not currently emphasized in pathology training and may have less opportunity to develop traditional skills of the current mid- and late-career pathologists. For example, in the future, toxicologic pathologists may be called on to more routinely make evidence-based judgments about the suitability of AI-based histology analysis algorithms for specific use-cases. This could involve training related to developing expertise in establishing performance indicators specific to AI-based measurement. On the contrary, developing traditional observational skills required to identify histomorphology findings may take focused effort by pathologist trainees in a learning environment where many differences are routinely annotated on whole slide images by AI algorithms.

Widespread adoption of AI models as aids to histology evaluations may also increase the capital requirements for organizations providing services in this field. For example, histology workflows must be digitized before an AI analysis can be conducted. Digitization via whole slide scanning introduces an additional significant laboratory cost during the creation of the materials to be evaluated and creates post-evaluation costs related to the long-term retention of images and data analysis. Furthermore, the use of vendor-supplied analysis and viewing software entails payment of licensing fees. Based on the history of automation, the capital costs of implementation are typically offset by labor cost savings over time. However, when the capital costs of automation are high, organizations with smaller labor forces may struggle to fully leverage the potential return on investment. This effect will likely provide a competitive advantage to larger organizations.

Conclusion

Although we champion the use of AI-based pathologist aiding in nonclinical toxicology, we recognize that adoption presents opportunities and challenges. Our perspective is that the most important opportunity is the possibility of strengthening histopathology evaluations by introducing routine automated analysis and annotation of histology images. We welcome the potential for these scores to reduce the effort needed to perform histopathology evaluations but believe the more important impact may be the creation of increased scientific context in the form of quantitative measurements of histopathology findings for study pathologists. Moreover, the adoption of AI-based pathologist aiding opens opportunity to develop new research skill sets centered around assessing model predictions. Each of these opportunities creates a reciprocal challenge either in terms of using these powerful tools wisely or in gaining new skills needed to judge this new type of data critically. As we work through these challenges, we will need to ensure that our approaches are aligned with the input of our stakeholders, in particular health authorities, to ensure acceptance and full compliance with regulatory expectations. Likewise, our profession may face a period of social and economic change as we work to productively incorporate this technology into our daily work. Despite the adjustments that may be needed to productively use AI-based pathologist aiding, we believe that our profession should embrace this technology and the attendant changes, because, from a scientific perspective, we are uniquely well-suited to judge the adequacy and meaning of the data it will create.

Thomas ForestMerck & Co., Inc., Rahway, New Jersey, USA Gordon K. WollenbergMerck & Co., Inc., Rahway, New Jersey, USA Jerrold M. WardGlobal VetPathology, Montgomery Village, Maryland, USA Kyathanahalli S. JanardhanMerck & Co., Inc., Rahway, New Jersey, USA Bhupinder BawaAbbVie Inc., North Chicago, Illinois, USA

Footnotes

Author Contributions

Authors contributed to conceptualization (TF, GW, JMW, KSJ, BB); Writing—original draft (TF); and Writing—review & editing (TF, GW, JMW, KSJ, BB).

Declaration of Conflicting Interests

The authors declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: JMW is a senior advisor to the Editor of Toxicologic Pathology, and KSJ is an editorial board member for the journal, but neither author took part in the manuscript peer review or decision-making process for this submission. The other authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs

Thomas Forest

Gordon K. Wollenberg

Jerrold M. Ward

Kyathanahalli S. Janardhan

Bhupinder Bawa

References

Akatsuka

Horai

Akatsuka

Automated recognition of glomerular lesions in the kidneys of mice by using deep learning. J Pathol Inform. 2022;13:100129.

Arlova

Jin

Wong-Rolle

Chen

, et al. Artificial intelligence-based tumor segmentation in mouse models of lung adenocarcinoma. J Pathol Inform. 2022;13:100007.

Dastin

Amazon scraps secret AI recruiting tool that showed bias against women. Reuters. October 10, 2018.

United States Federal Register. 1987;52(172):33768-33782. Accessed May 18, 2026. https://www.govinfo.gov/app/details/FR-1987-09-04

Jarrahi

Davoudi

Haeri

The key to an effective AI-powered digital pathology: establishing a symbiotic workflow between pathologists and machine. J Pathol Inform. 2022;13:100156.

Kobayashi

Shieh

Ruiz

Sabando

, et al. Deep learning-based approach to the characterization and quantification of histopathology in mouse models of colitis. PLoS ONE. 2022;17(8):e0268954.

LaFave

Kartha

, et al. Epigenomic state transitions characterize tumor progression in mouse lung adenocarcinoma. Cancer Cell. 2020;38(2):212-228.

Lockhart

Ackerman

Lee

Abdalah

, et al. Grading of lung adenocarcinomas with simultaneous segmentation by artificial intelligence (GLASS-AI). NPJ Precis Oncol. 2023;7(1):68.

Long

Smith

Machotka

, et al. Scientific and Regulatory Policy Committee (SRPC) paper: validation of digital pathology systems in the regulated nonclinical environment. Toxicol Pathol. 2013;41(1):115-124.

10.

McGenity

Clarke

Jennings

, et al. Artificial intelligence in digital pathology: a systematic review and meta-analysis of diagnostic test accuracy. NPJ Digit Med. 2024;7(1):114.

11.

Nakagawa

Moukheiber

Celi

, et al. AI in pathology: what could possibly go wrong? Semin Diagn Pathol. 2023;40(2):100-108.

12.

Nguyen

TTU

Nguyen

A-T

Kim

, et al. Deep-learning model for evaluating histopathology of acute renal tubular injury. Sci Rep. 2024;14(1):9010.

13.

Palazzi

Barale-Thomas

Bawa

, et al. Results of the European society of toxicologic pathology survey on the use of artificial intelligence in toxicologic pathology. Toxicol Pathol. 2023;51(4):216-224.

14.

Pohlmeyer-Esch

Halsey

Boisclair

, et al. Digital pathology and artificial intelligence applied to nonclinical toxicology pathology—the current state, challenges, and future directions. Toxicol Pathol. 2025;53(6):516-535.

15.

Popper

KR.

Conjectures and Refutations: The Growth of Scientific Knowledge. 4th ed. Routledge and Kegan Paul; 1972.

16.

Rudmann

Bertrand

Zuraw

, et al. Building a nonclinical pathology laboratory of the future for pharmaceutical research excellence. Drug Discov. Today. 2023;28(10):103747.

17.

Sheehan

Mawe

Cianciolo

Korstanje

Mahoney

JM.

Detection and classification of novel renal histologic phenotypes using deep neural networks. Am J Pathol. 2019;189(9):1786-1796.

18.

Shimada

Tanimoto

Sasaki

Taga

Sasaki

Imagawa

Sasaki

Automated scoring of glomerular injury in TNS2-deficient nephropathy. Exp Anim. 2024;73:370-375.

19.

US Food and Drug Administration and European Medicines Agency. Guiding principles of good AI practice in drug development. Updated January 14, 2026. Accessed April 8, 2026. https://www.fda.gov/about-fda/artificial-intelligence-drug-development/guiding-principles-good-ai-practice-drug-development

20.

Zuraw

Aeffner

Whole-slide imaging, tissue image analysis, and artificial intelligence in veterinary pathology: an updated introduction and review. Vet Pathol. 2022;59(1):6-25.