Reporting transparency in veterinary pathology deep learning: A systematic review of reproducibility-critical details

Abstract

Whereas reproducibility of studies is a prerequisite for trustworthy deep learning (DL) in veterinary histopathology and microscopy, the actual degree of methodological transparency that exists in the literature remains uncertain. We performed a Preferred Reporting Items for Systematic Reviews and Meta-Analyses-guided systematic review to quantify the degree to which supervised DL and supervised machine learning studies report reproducibility-critical details. Using a veterinary-journal-restricted Boolean search executed in PubMed and Scopus, we screened 180 unique records and included 50 primary research articles for full-text analysis. Based on a recently published guideline for the development of DL models in veterinary pathology, we extracted information for each study across 5 dimensions: (1) study and task characterization, (2) data transparency, (3) experimental design and data-leakage control, (4) model and training details, and (5) performance evaluation and reporting. Among the included studies, private data sets predominated, with 90% of studies relying on private data. Sharing of code was uncommon (3%). Key training details such as augmentation and hyperparameters were often incompletely reported; augmentation was not reported in 56% of studies, and key hyperparameters were absent in 40% of studies. It was often not clear whether patient-level stratification (necessary to avoid data leakage) was performed. In summary, these results highlight major deficits in the reporting of details and experimental design necessary for reproducing DL results in veterinary histopathology. This review provides a practical baseline and reporting roadmap to support more transparent and reproducible research in veterinary computational pathology.

Keywords

computational pathology deep learning machine learning open data reproducibility systematic review transparency veterinary pathology

Techniques based on deep learning (DL) and machine learning (ML) have become a major trend in veterinary pathology over the past years, with use cases varying from mitotic figure detection² and tumor grading⁵⁸ to lesion segmentation.⁵³ The increasing adoption of these methods is driven by the time-consuming and tedious nature of many routine histologic tasks and interobserver variability,^5,72,75 for which algorithmic approaches are expected to help standardize diagnostic assessments.⁷ At the same time, the artificial intelligence (AI) ecosystem has also undergone a significant transformation with more accessible software frameworks, low-code interfaces, and a broad range of online tutorials now allows researchers without any formal ML training to develop or modify DL pipelines for their tasks.^31,65 Consequently, DL has become an attractive approach for many practitioners and researchers in veterinary pathology.

Crucially, the widespread use and low barrier to entry of these techniques demands particular attention. Due to their expressiveness, DL models will almost invariably yield some result, even if the data are not suitable, the task is not well defined, or the training is incorrect. This can lead to situations where the seemingly impressive performance conceals severe methodological problems. Moreover, model behavior depends not only on the data set and model architecture, but also on the exact details of the training, such as which data were used for training and testing, augmentation, loss functions, optimization strategy, learning rate schedules, and other hyperparameters. Without knowledge of these crucial elements of the experimental setup, reproducing reported results can become tedious to impossible. This has contributed to the well-known reproducibility crisis in the broader ML community,³² and in particular in the medical imaging community,⁶¹ whereby the outcomes of studies cannot be consistently reproduced by others. These pitfalls may affect scientists without sufficient training in ML and without knowledge about DL best practices even more strongly.

The field of pathology is no exception to this: Training in ML techniques is not part of the regular curriculum in either human or veterinary pathology. Yet, the application of ML and DL methods is attractive to researchers with this background that have only recently started to utilize them and have not received any in-depth training. This gives rise to the potential for methodological mistakes that stem from insufficient knowledge about the details of the methods, their limitations, and boundary conditions. In particular, as not every part of a data pipeline is necessarily understood in its full detail, we hypothesize that authors from these fields are more prone to not reporting all relevant parameters of the experimental setup, which would significantly impact reproducibility of their approach.

The existing review literature about AI in veterinary diagnostics has mainly focused on identifying application areas and summarizing the usage of DL rather than systematically quantifying whether studies are reported in such a way that would enable them to be reliably reproduced. Broad overviews of veterinary DL applications further reveal that practical translation is commonly constrained due to small data sets, heterogeneity, and continuing needs for stronger standardization and validation practices.⁷⁴ A complementary synthesis of real-world adoption barriers highlights how issues related to access to limited data, poorly described methods, and scarce independent external testing can directly undermine confidence and reproducibility.¹⁶

Survey-based evidence, coming from professional communities, further strengthens these concerns and points out that readiness depends on more than benchmark performance. The reported barriers include annotation burden, skill gaps, uncertainty about validation requirements, and trust in model outputs, collectively shaping whether AI tools can be integrated into routine practice.⁴⁹

Reproducibility has also become a central theme in computational pathology and, more broadly, digital pathology. Analyses centered on reproduction show that the whole-slide image pipelines published cannot be reproduced, or it is very hard to do so, when crucial implementation details are not available, which motivates the checklist-style reporting of data handling, preprocessing, training, and evaluation.¹⁷ Related work on reproducibility and reusability in computational pathology further argues that verification and downstream reuse are hindered when artifacts are not shared and methodological reporting is incomplete.⁶⁹

Apart from academic publishing, there is a growing interest in the transparency of the publicly available evidence base for pathology AI systems put into clinical use. Analyses of AI products in digital pathology reveal that there is considerable variation in public evidence and make a case for more standardized and comparable documentation to allow independent scrutiny.³⁹ Guidance on how to interpret ML studies in clinical contexts highlights the importance of careful appraisals of validation strategy, bias, overfitting, and generalizability, further underlining that trustworthy claims must explicitly report study design and evaluation.³⁷ Cumulatively, these analyses make the case for transparency at the levels of data, annotations, training procedures, and evaluation. Besides, they indicate that reproducibility and the documentation of evidence are active concerns in pathology AI. Yet, to date, there is no systematic analysis in the field of veterinary pathology that investigates the availability of methodological details required for the reproducibility of results when ML and, in particular, DL models are involved. This review addresses this gap by quantifying reporting practices in veterinary articles that evaluated DL for analysis of microscopic images and identifies recurring omissions that impede reproducibility and thereby limit robust assessment of the results reported. We performed a systematic, PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses)-guided approach to cover the field of veterinary pathology. The overarching aim was to characterize the extent to which published articles provide sufficient detail on data, code availability, model development, and model evaluation to enable rigorous assessment and reproducibility.

Methods

Search Strategy and Study Selection

To evaluate the degree of reproducibility in reported DL methods for microscopic image analysis within veterinary medicine, a PRISMA 2020-style systematic review⁴⁸ was conducted on 4 November 2025, covering publications from 1 January 2010 to the search date. This lower bound was chosen because the widespread adoption of DL methods in image analysis began in the early 2010s, making earlier literature unlikely to contain relevant supervised DL studies. After several rounds of refinement, a Boolean search query was developed and is provided in full in the Supplemental Materials due to its length. In brief, the query consisted of 4 major OR-blocks combined with AND, capturing (1) AI-related terms, (2) pathology-related terms, (3) imaging-related terms, and (4) journals (Fig. 1). The search was limited to veterinary science journals, according to the categorization in the Clarivate database in an effort to meaningfully restrict the search scope to veterinary medicine and exclude studies from human medicine that would use animal models and otherwise show up in large quantities in our search results. Our search was intentionally restricted to veterinary journals, as our goal was to characterize the transparency standards of research conducted within and for the veterinary community, regardless of the disciplinary background of contributing authors. The identical search was executed in both PubMed and Scopus. Google Sheets was used to store and manage the extracted study characteristics, which was then exported as a comma separated value (csv) file, and Python programming language was used to process the csv and derive aggregated results. The first author (SB) reviewed all papers in detail for eligibility according to the inclusion and exclusion criteria (Table 1) and then, for the included papers, extracted all information according to defined reproducibility criteria. To ensure extraction quality, all 50 included papers were first annotated by the first author (SB). The 50 papers were then divided into 3 subsets (17, 17, and 16 papers), each independently re-reviewed by 1 additional co-author (CAB, MA, and JA, respectively), such that each paper was reviewed by exactly 2 annotators. Disagreements between the first author and the respective coreviewer were resolved by structured discussion until consensus was reached. The assessment items used for full-text extraction were aligned with recent minimum reporting guidelines for automated image analysis in veterinary pathology.⁹

Figure 1.

Illustration of the final search string used in the systematic review.

Table 1.

Eligibility criteria used for study screening.

Decision Category	Inclusion Criteria	Exclusion Criteria
Modality	Histopathology, cytology, and other light microscopy-based modalities (eg, blood smear analysis, urine sediment microscopy, sperm morphology assessment)	Studies combining nonmicroscopic imaging modalities (eg, CT, MRI) with histopathology or cytology, rather than focusing exclusively on microscopic image analysis
Methods	Deep learning or machine learning methods applied to images	No deep learning/machine learning, or such methods not applied to images
Publication type	Peer-reviewed journal articles	Preprints, narrative reviews, systematic reviews, opinion pieces; editorials, commentaries
Language	English	Other languages
Scope relevance	Pathology microscopy imaging component within the review scope	Unrelated topic (eg, appeared due to broad query terms but has no pathology imaging component)

Abbreviations: CT, computed tomography; MRI, magnetic resonance imaging.

Data Extraction and Reproducibility Framework

For each of the included studies, information was extracted along 5 major dimensions according to a recent minimum reporting guidelines for automated image analysis in veterinary pathology,⁹ which together formed the reproducibility framework. Extraction decisions were independently verified by coreviewer assessment across all 50 studies.

Study and task characterization

This dimension captured the pattern recognition task addressed by the model itself (eg, classification, object detection, semantic segmentation, or instance segmentation), not any downstream postprocessing steps, as well as whether the ML/DL models were trained as part of the reported work or used off-the-shelf without modification.

Data transparency

Data-related reporting was assessed in terms of transparency, completeness, and potential for reuse. We recorded whether the data set used for model development and evaluation was public, private, or mixed. A data set was considered public if both the images and their corresponding annotations were openly accessible without restriction. Studies that used images from a publicly available resource but did not release their annotations were treated as private, since reproducibility requires access to both. Data sets were coded as mixed if a portion of the images and their corresponding annotations were publicly accessible, but not the complete data set. Statements such as “available upon request” were treated as private, as conditional access does not guarantee reproducibility. We also recorded additional characteristics of the data. These included species, whether histochemical staining or immunochemical labeling was reported, and whether the digitization device was reported. We further extracted information on data set scale and structure, recording whether the number of whole-slide images or image patches was reported, and whether the image or patch size was reported when larger images were systematically tiled into smaller patches for model input. We also evaluated the description of the annotation process. Studies were categorized according to whether annotations were:

Manual. Labels or region annotations were created directly by one or more experts using dedicated software tools.

Automatic. Labels were derived by an existing algorithm (eg, a pretrained DL model) without direct per-instance human verification/correction.

Semi-automatic. Annotations were produced through a combination of algorithmic preannotation and subsequent review, correction, or refinement by pathologists.

Molecular/PCR. Labels were derived from molecular or polymerase chain reaction (PCR)-based assays rather than direct image annotation.

Experimental design and data leakage

This dimension focused on how rigorously the experimental setup was planned and described, with particular attention to prevention of data leakage and overfitting. We recorded whether the data set was explicitly partitioned into distinct training, validation, and test sets. We further recorded whether data augmentation (eg, geometric transformations, color jittering, and stain normalization) was reported. To evaluate safeguards against data leakage, we examined how studies handled multiple samples from the same individual. In settings where more than 1 image, slide, or tile per patient was available, we recorded whether patient-level stratification was explicitly enforced, such that all data from a given individual were restricted to a single partition (training, validation, or test).

Model and training details

This dimension evaluated the transparency of model specification and training procedures. We first recorded whether the ML/DL models were trained as part of the reported work or used off-the-shelf without modification, the latter including commercial or proprietary tools. For studies that trained models, we assessed the reporting of key training hyperparameters, specifically the learning rate, loss function, number of epochs, and optimization algorithm. Studies were classified as having all hyperparameters mentioned, partially mentioned, or none mentioned. For studies relying on commercial or proprietary tools where training was not performed as part of the work, hyperparameter reporting was recorded as not applicable where no training details could be expected or assessed against the above criteria where the tool provided partial transparency about its underlying methodology.

Evaluation and reporting of performance

The final dimension addressed how model performance was evaluated and reported. We recorded whether quantitative performance metrics were reported, specifically metrics evaluating the DL model itself, regardless of their appropriateness for the task at hand.

Results

Our search query resulted in 180 papers from the PubMed database. On the Scopus database, our query resulted in 63 papers, of which all but 1 were already found by the query on PubMed, resulting in a total of 181 papers to be screened after removal of duplicates. Following our inclusion/exclusion criteria, 98 papers were excluded as they were of unrelated topics (eg, studies not involving pathology images despite matching the search terms) or combined microscopic imaging with radiology-based experiments, and 33 as review or opinion articles. This left 50 primary research articles^{1,3,4,6–8,10,12–15,18–21,23–30,33–36,40–45,47,50–52,54–57,59,60,62–64,66–68,73} that applied ML or DL to veterinary medicine tasks as the final corpus for the systematic analysis. A PRISMA 2020 flow diagram summarizing the screening process is presented in Fig. 2. The assessment items were aligned with recent minimum reporting guidelines for automated image analysis in veterinary pathology.⁹

Figure 2.

PRISMA flow diagram, visualizing the final corpus (N = 50).

Study and Task Characterization

Out of the 50 evaluated studies, the majority of papers (n = 44) involved the training of some kind of DL/ML model, whereas in 6 papers, either no model was trained or a previously published model was used (Fig. 4a and Supplemental Table S1—Review and Analysis). Across the included studies, semantic segmentation was the most frequently applied pattern recognition task (n = 29), followed by classification (n = 19), object detection (n = 7), and instance segmentation (n = 5). We note that some papers developed multiple DL/ML models addressing different pattern recognition tasks, often to reflect a sequential processing pipeline in which, for example, 1 model first localizes or segments a region of interest and a subsequent model classifies it, as seen in spermatogenic staging combining semantic segmentation with classification¹⁴ and the automated diagnosis of canine skin tumors combining semantic segmentation with classification.¹⁸ This explains why the total count of tasks to exceeded 50.

Data Transparency

In terms of data transparency (Fig. 3), data accessibility was limited. Of 50 papers, 45 used in-house private data sets, only 2 used fully public data sets, and 3 used a mix of public and private data (Fig. 3a). Similarly, the reporting of model code or implementation details was also limited. Of the 50 papers, 19 used commercial software, for which code sharing is inherently not applicable; of the remaining 31 studies that developed custom pipelines, 29 reported no code repository and only 1 provided a link to an online code repository. For the remaining 1 paper, the code requirements were not applicable as they were using a model from a previous publication. This makes almost all evaluated pipelines difficult to reproduce independently.

Figure 3.

Results of our analysis in the “data transparency” category, reporting about which data details were reported in the analyzed manuscripts, including data set availability (a), staining process (b), scanning device (c), image patch size (d), annotation process (e), and the number of images (f).

Of 50 studies, almost half used murine specimen (n = 24), followed by studies on canine specimen (n = 12). Research based on feline (n = 3), avian (n = 3), porcine (n = 4), simian (n = 2), and equine (n = 2) tissues contributed to a smaller number of publications. One study each used bovine, ovine, and equine, and a combination of canine and feline tissues. In 1 study, blood parasites from unknown host animals were analyzed. A large number of application studies concentrated on the quantification of certain histologic features (eg, mitoses, necrosis, fibrosis, and inflammatory infiltrates) or on the derivation of prognostically relevant indices, whereas lesion detection, grading, or cellular subtyping in immunohistochemically labeled or histochemically stained slides were some of the topics in the remaining studies. Almost all papers mentioned the staining process (n = 47), with 2 papers reporting no staining and 1 using unstained specimens (Fig. 3b). Most papers also referred to the digitization device used for obtaining images (n = 48, Fig. 3c). The number of images/data set size was mostly provided (n = 47, Fig. 3f); however, a few papers failed to report this information (n = 2) or reported it partially (n = 1). On the contrary, the patch size used as the model input was reported in approximately half of papers (n = 26, Fig. 3d). Finally, most of the studies had their annotation done manually (n = 34), and some were annotated semi-automatically (n = 8, Fig. 3e). For 2 papers, we were not able to determine the method of annotation, while in 3 papers, the images were not annotated. In 3 papers, labels were generated with the help of PCR.

Experimental Design and Data Leakage

With regards to reporting of data splitting, just over half of studies used an explicit train/validation/test split (n = 26), while 7 used only a train/test split and 5 used only a train/validation split (Fig. 4b). A substantial portion did not mention any split strategy (n = 10), and for 2 studies, data splitting was not applicable as they applied a model from a previous publication. Potential data leakage control via patient-level stratification was also rare—only 13 studies explicitly or implicitly indicated patient-level stratification, while most reported no stratification or did not mention it (n = 29). In 5 studies, this criterion was not applicable, and only 2 explicitly reported that no patient-level stratification was performed (Fig. 4c).

Figure 4.

Results of our analysis in the “experimental design” category, including whether a deep learning (DL) or machine learning (ML) model was trained (a), the train/validation/test split strategy used (b), and patient-level stratification (c).

Model and Training Details

When it comes to the training of the DL models, the description of the architectures varied from very brief, high-level mentions of a specific backbone by name (eg, U-Net, ResNet, Mask R-CNN, and MIL frameworks) to more detailed descriptions of the individual layers, but this information was not consistent across the papers. Information on hyperparameters was similarly variable (Fig. 5). In terms of learning rate, loss function, number of epochs, and optimizer, only 9 papers provided details of all the essential hyperparameters, 19 papers provided only a partial list, 20 papers did not report any hyperparameters, and for 2 papers it was not applicable (Fig. 5b). Information on data augmentation was likewise insufficient, with only 18 papers reporting augmentation for all models, 1 paper reporting it partially, and 28 papers not mentioning augmentation. For 3 papers, augmentation was not applicable as no model was trained (Fig. 5a).

Figure 5.

Results of our analysis in the “training details” and “evaluation and reporting of performance” category, including data augmentation (a), hyperparameter reporting (b), machine/deep learning metrics (c), and internal held-out test metrics (d).

Evaluation and Reporting of Performance

Among all the papers, 35 papers used ML/DL-related metrics, and 15 papers did not report any ML/DL metrics (only statistical metrics, Fig. 5c). Finally, in 33 of the papers, hold-out test metrics were reported (such as accuracy, F1-score, area under the curve, and Dice), whereas in 17 papers, no DL-based test metrics were reported (Fig. 5d).

Temporal Trends in Reporting Practices

A year-stratified analysis comparing earlier (2010–2021, n = 16) and more recent (2022–2025, n = 34) studies revealed only modest improvement in reporting practices over time. The proportion of studies reporting a full train/validation/test split increased from 50% to 56%, and patient-level stratification from 25% to 27%. Hyperparameter reporting remained unchanged at approximately 19% in both periods. The proportion of studies using private data sets decreased slightly from 100% to 85% in more recent work, suggesting a gradual but limited trend toward greater data accessibility.

Discussion

This systematic review gives a quantitative overview of how reproducible DL and ML studies in veterinary pathology and microscopy are currently reported in 5 aspects: task characterization, data transparency, experimental design and data leakage, model and training details, and evaluation. A consistent pattern emerged across 50 primary studies that, although many papers describe parts of the laboratory context reasonably well, key elements necessary to independently reproduce the work and to rigorously judge their validity are frequently missing or only partially reported.

A positive finding is that reporting of factors related to image acquisition was comparatively strong. The large majority of studies stated both the staining process and the scanning device, and a majority also reported either the data set size or number of images. These items help readers understand the data generation pipeline. However, reproducibility is fundamentally hindered by limited accessibility of data and code; there is a prevalence of private in-house data sets, which prevents independent verification and benchmarking. Similarly, a general lack of public code repositories restricts reproducibility of training and evaluation procedures. Notably, when commercial software is used, code sharing is inherently not possible; however, this places greater responsibility on authors to report all software versions, parameter settings, and configurations in full, which was not consistently observed in the commercial software studies in our corpus. While data sharing by private diagnostic companies and toxicologic laboratories might be restricted due to legal and data privacy constraints as well as to proprietary considerations, we want to emphasize the value of open data beyond transparency and reproducibility, which includes acceleration of methodological innovations by other research groups, broadening the availability of training data with real-world variability, and reduction of redundant efforts.⁷⁰ In cases where sharing of data sets or code is not feasible due to proprietary or legal constraints, the minimum reporting checklist can still be partially satisfied by sharing model weights, inference scripts, and fully specified configuration files. Specific recommendations for veterinary pathology, considering the individual disciplines’ restrictions, are not available yet. However, there are ongoing initiatives for large data repositories, such as BigPicture, that will be extremely valuable in overcoming the legal and proprietary restrictions.⁴⁶ The FAIR (findability, accessibility, interoperability, and reusability) principle should serve as a guideline to make future initiatives of open data most meaningful in advancing AI.⁷¹

The most worrying issues concern experimental design and data leakage. Explicit reporting of a train/validation/test split was only performed by half of the studies, while many omitted split information altogether. In our corpus, very few studies performed explicit patient/subject-level stratification, and almost none described any efforts taken to prevent leakage. Similarly to our findings, a survey of toxicologic pathologists revealed that almost a quarter of participants were not following a 3-fold split.⁴⁹ This has particular implications for veterinary pathology pipelines that typically output multiple images or tiles per animal; if samples coming from the same animal appear in both training and test sets, it can result in misguiding results due to data bleeding. Because many veterinary data sets are small and includes data from the same institution and similar preparation protocols, transparent reporting of splits and data leakage should be treated as important requirements to ensure that overoptimistic performance evaluation is avoided. Splitting data at the wrong level can considerably bias performance metrics. In digital pathology, an image-wise split that lets images from the same subject appear in both training and validation can inflate predictive scores by up to 41%.¹¹ Studies that verify subject separation show much lower accuracy compared to those likely to have data leaks. One study even showed a 28-percentage point drop (from 94 to 66%) after introducing a proper data split.⁷⁶ We therefore suggest that studies involving veterinary pathology and microscopy adopt established reporting guidelines⁹ that provide structured recommendations for documenting data partitioning strategies, cross-validation procedures, and measures taken to prevent data leakage.

Model and training descriptions were also often insufficient for replication. Although most studies trained a model, the reporting of model development details varied widely. In many cases, architectures were referenced only at a high level—naming U-Net, ResNet, Mask R-CNN, or MIL—without sufficient specification to reconstruct the pipeline.⁴⁹ Crucial information including model configuration, input tile size, magnification or resolution, normalization or stain processing, sampling strategy for class imbalance, loss configuration, optimizers, and training schedules was often absent. Reporting of hyperparameters and data augmentation was particularly scarce. All of these criteria are known to impact results and are commonly optimized in the development process; hence, the nonreporting of those significantly impacts reproducibility of results. To address this, studies should adopt a minimum checklist of reporting for model development and training that allows for the full pipeline to be reproduced.⁹ First, an explicit description of architecture and configuration should be provided, for example, backbone variant, depth/width, heads, and pretrained weights. In case a commercial software is used, the version and all the settings and hyperparameters should be reported. Second, the formation of inputs, such as digitization device, magnification/resolution, image size, and staining process should be reported. Third, training details should be reported with enough details to reproduce them, including loss functions and class-imbalance handling, optimizer and learning-rate schedule, batch size, epochs, regularization, augmentation with parameter ranges, early stopping/model selection, and random seeds. Where possible, these items should be complemented by sharing code, or at least a runnable configuration file, and releasing trained weights and inference scripts.

Reporting of evaluation methods was inconsistent. Not all studies reported standard ML/DL performance metrics and instead, relied on statistical summaries or visual examination without clear reporting of model-centric metrics such as the area under the receiver operating characteristic curve, F1-score, or Dice coefficient. Better separation between model-performance evaluation and downstream statistical or biological interpretation would be advantageous for transparency and comparability across studies. In veterinary pathology, there are currently no clear recommendations to which extent the performance of DL models need to be evaluated before they can be used in research, toxicologic studies, or diagnostic settings. An existing white paper for human diagnostics on patient samples can be used as an orientation for veterinary use cases.²² Furthermore, the metrics reloaded initiatives provides a step-by-step decision workflow how the most suitable performance metrics can be selected for performance evaluation.³⁸ These authors suggest that for many use cases, a combination of different performance metrics is necessary to account for the statistical limitations that an individual metrics might have.

Considering these findings, we recommend that future veterinary pathology ML/DL studies standardize reporting based on a minimum set of items critical for reproducibility. At a minimum, studies should transparently mention the split strategy (including whether splitting was performed at the animal/patient level), describe how data leakage was prevented, and provide full training configuration reports (loss function, optimizer, learning rate and schedule, batch size, epochs, augmentation, and model selection criteria). The preprocessing and tiling decisions should be documented, including tile size, magnification, tissue filtering, and stain normalization, if used. Finally, authors are encouraged to share data, code, and weights whenever possible. Reporting guidelines, for example, the one developed for the journal Veterinary Pathology, may support authors in including all relevant information in manuscripts.⁹

This review has some limitations. To begin with, interdisciplinary work published in computer vision or biomedical imaging venues was out of scope by design and therefore not all studies from the veterinary community had been included in this review. However, our search was intentionally restricted to peer-reviewed articles indexed in veterinary-scoped journals, as our aim was to characterize reporting practices within the veterinary research community rather than the broader computational pathology field. However, depending on the journal (eg, medical journals vs computer science journals), a variable level of computer science expertise may have been included in manuscript preparation (authors) and review (editors and reviewers), which are likely to affect the completeness of reporting. To support veterinary researchers and reviewers in ensuring complete and transparent reporting, the journal Veterinary Pathology has established a reporting checklist for studies that use DL-based image analysis.⁹ Second, because we relied on what authors explicitly reported, our extraction likely underestimates the presence of good practices that were performed but not documented; we could not independently verify, for example, the correctness of reported practices (eg, whether patient-level splitting was truly enforced). Finally, even though more people are using ML and DL in veterinary histopathology, the way they report their methods is not consistent. Too often, studies skip important details about how they set up their experiments, control for data leakage, or train their models. Most do not share their code. These practices limit reproducibility or the ability to compare one study to another. Closing these gaps would make studies more reliable, facilitate meaningful cross-study comparisons, and accelerate their safe translational potential into veterinary research and practice.

Supplemental Material

sj-xlsx-1-vet-10.1177_03009858261459452 – Supplemental material for Reporting transparency in veterinary pathology deep learning: A systematic review of reproducibility-critical details

Supplemental material, sj-xlsx-1-vet-10.1177_03009858261459452 for Reporting transparency in veterinary pathology deep learning: A systematic review of reproducibility-critical details by Sweta Banerjee, Christof A. Bertram, Viktoria Weiss, Jonas Ammeling, Thomas Conrad, Nils Porsche, Robert Klopfleisch, Christoph Stroblberger, Christopher Kaltenecker, Katharina Breininger and Marc Aubreville in Veterinary Pathology

Supplemental Material

sj-pdf-1-vet-10.1177_03009858261459452 – Supplemental material for Reporting transparency in veterinary pathology deep learning: A systematic review of reproducibility-critical details

Supplemental material, sj-pdf-1-vet-10.1177_03009858261459452 for Reporting transparency in veterinary pathology deep learning: A systematic review of reproducibility-critical details by Sweta Banerjee, Christof A. Bertram, Viktoria Weiss, Jonas Ammeling, Thomas Conrad, Nils Porsche, Robert Klopfleisch, Christoph Stroblberger, Christopher Kaltenecker, Katharina Breininger and Marc Aubreville in Veterinary Pathology

Footnotes

Supplemental material for this article is available online.

Author Contributions

SB assessed all manuscripts, performed analysis, and wrote the main manuscript. MA, CAB, and JA were involved in coreviewing subsets of the corpus as second reviewers. MA and CAB were involved in reviewing manuscripts where the decision was doubtful. CAB, KB, MA, and SB designed the inclusion and exclusion criteria and search terms and edited the manuscript. MA contributed one of the figures. All authors discussed the scope of the review and reviewed and contributed to the final manuscript.

Declaration of Conflicting Interests

The authors declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: Authors Marc Aubreville and Christof Bertram are members of the Editorial Board of Veterinary Pathology but were not involved with handling the manuscript and have no further conflicts to declare. The remaining authors declare no conflicts of interest.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: CAB, VW, CS, and CK acknowledge the support from the Austrian Research Fund (FWF, project number: I 6555). SB, TC, RK, and MA acknowledge support by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation, project number: 520330054). KB acknowledges support by the DFG (project numbers: 405969122, 505539112, and 460333672 CRC1540 EBM).

ORCID iDs

Christof A. Bertram

Thomas Conrad

Robert Klopfleisch

Christoph Stroblberger

Marc Aubreville

References

Ancheta

Psifidi

Yale

, et al. Deep-learning based morphological segmentation of canine diffuse large B-cell lymphoma. Front Vet Sci. 2025;12:1656976.

Aubreville

Bertram

Marzahl

, et al. Deep learning algorithms out-perform veterinary pathologists in detecting the mitotically most active tumor region. 2020;10(1):16447.

Becker

Hansen

Nielsen

, et al. Machine-learning for quantitative histopathology of piglet intestinal tissues: challenges with limited training data. Front Vet Sci. 2025;12:1620338.

Bédard

Westerling-Bui

Zuraw

Proof of concept for a deep learning algorithm for identification and quantification of key microscopic features in the murine model of DSS-induced colitis. Toxicol Pathol. 2021;49(4):897–904.

Belluco

Marano

Baiker

, et al. Standardisation of canine meningioma grading: inter-observer agreement and recommendations for reproducible histopathologic criteria. Vet Comp Oncol. 2022;20(2):509–520.

Bertani

Blanck

Guignard

, et al. Artificial intelligence in toxicological pathology: quantitative evaluation of compound-induced follicular cell hypertrophy in rat thyroid gland using deep learning models. Toxicol Pathol. 2022;50(1):23–34.

Bertram

Aubreville

Donovan

, et al. Computer-assisted mitotic count using a deep learning-based algorithm improves interobserver reproducibility and accuracy. Vet Pathol. 2022;59(2):211–226.

Bertram

Marzahl

Bartel

, et al. Cytologic scoring of equine exercise-induced pulmonary hemorrhage: performance of human experts and a deep learning-based algorithm. Vet Pathol. 2023;60(1):75–85.

Bertram

Schutten

Ressel

, et al. Reporting guidelines for manuscripts that use artificial intelligence-based automated image analysis in veterinary pathology. Vet Pathol. 2025;62(5):615–617.

10.

Busayakanon

Kaewthamasorn

Pinetsuksai

, et al. Identification of veterinary and medically important blood parasites using contrastive loss-based self-supervised learning. Vet World. 2024;17(11):2619–2634.

11.

Bussola

Marcolini

Maggio

, et al. AI slipping on tiles: data leakage in digital pathology. In: Del Bimbo

Cucchiara

Sclaroff

, et al., eds. Pattern Recognition. ICPR International Workshops and Challenges. ICPR 2021. Lecture Notes in Computer Science(), vol. 12661. Cham: Springer; 2021:167–182.

12.

Chanutin

Bauck

Roberts

, et al. Comparison of two techniques to blind end jejunum and ileum for jejunocaecostomy in horses. Equine Vet J. 2025;57(6):1690–1702.

13.

Chen

Zhang

Duan

, et al. Lesion localization and pathological diagnosis of ovine pulmonary adenocarcinoma based on MASK R-CNN. Animals 2024;14(17):2488.

14.

Creasy

Panchal

Garg

, et al. Deep learning-based spermatogenic staging assessment for hematoxylin and eosin-stained sections of rat testes. Toxicol Pathol. 2021;49(4):872–887.

15.

De Vera Mudry

Martin

Schumacher

, et al. Deep learning in toxicologic pathology: a new approach to evaluate rodent retinal atrophy. Toxicol Pathol. 2021;49(4):851–861.

16.

Farhoodimoghadam

Brandt

Keller

, et al. Adopting artificial intelligence in veterinary diagnostics: a scoping review of key challenges. MetaArXiv; 2025. DOI: 10.31222/osf.io/gy9pz_v3.

17.

Fell

Mohammadi

Morrison

, et al. Reproducibility of deep learning in digital pathology whole slide image analysis. Plos Digital Health. 2022;1(12):e0000145.

18.

Fragoso-Garcia

Wilm

Bertram

, et al. Automated diagnosis of 7 canine skin tumors using machine learning on H&E-stained whole slide images. Vet Pathol. 2023;60(6):865–875.

19.

Freyre

CAC

Spiegel

Gubser Keller

, et al. Biomarker-based classification and localization of renal lesions using learned representations of histology—a machine learning approach to histopathology. Toxicol Pathol. 2021;49(4):798–814.

20.

Funk

Clement

Togninalli

, et al. Comparison of an attention-based Multiple Instance Learning (MIL) with a visual transformer model: two weakly supervised Deep Learning (DL) algorithms for the detection of histopathologic lesions in the rat liver to distinguish normal from abnormal. Toxicol Pathol. 2025;53(5):456–478.

21.

Gradner

Janssen

Oevermann

, et al. Immunohistochemical staining properties of osteopontin and Ki-67 in Feline meningiomas. Animals. 2024;14(23):3404.

22.

Hanna

Olson

Zarella

, et al. Recommendations for performance evaluation of machine learning in pathology: a concept paper from the college of American pathologists. Arch Pathol Lab Med. 2024;148(10):e335–e361.

23.

Heinemann

Lempp

Colbatzky

, et al. Quantification of hepatocellular mitoses in a toxicological study in rats using a convolutional neural network. Toxicol Pathol. 2022;50(3):344–352.

24.

Hernandez

Bilbrough

GEA

DeNicola

, et al. Comparison of the performance of the IDEXX SediVue Dx® with manual microscopy for the detection of cells and 2 crystal types in canine and feline urine. J Vet Intern Med. 2019;33(1):167–177.

25.

Hoefling

Sing

Hossain

, et al. HistoNet: a deep learning-based model of normal histology. Toxicol Pathol. 2021;49(4):784–797.

26.

Horváth

Abonyi-Tóth

Papp

, et al. Quantitative analysis of inflammatory uterine lesions of pregnant gilts with digital image analysis following experimental PRRSV-1 infection. Animals. 2023;13(5):830.

27.

Hubbard-Perez

Luchian

Milford

, et al. Use of deep learning for the classification of hyperplastic lymph node and common subtypes of canine lymphomas: a preliminary study. Front Vet Sci. 2023;10: 1309877.

28.

Schutt

Kozlowski

, et al. Ovarian toxicity assessment in histopathological images using deep learning. Toxicol Pathol. 2020;48(2):350–361.

29.

Hvid

Skydsgaard

Jensen

, et al. Artificial intelligence-based quantification of epithelial proliferation in mammary glands of rats and oviducts of göttingen minipigs. Toxicol Pathol. 2021;49(4):912–927.

30.

Hwang

Kim

Park

, et al. Implementation and practice of deep learning-based instance segmentation algorithm for quantification of hepatic fibrosis at whole slide level in Sprague-Dawley rats. Toxicol Pathol. 2022;50(2):186–196.

31.

Kaliyugarasan

Lundervold

AS.

FastMONAI: a low-code deep learning library for medical image analysis. Software Impacts. 2023;18:100583.

32.

Kapoor

Narayanan

Leakage and the reproducibility crisis in machine-learning-based science. Patterns. 2023;4(9):100804.

33.

Kim

Baek

Hwang

, et al. Application of convolutional neural network for analyzing hepatic fibrosis in mice. J Toxicol Pathol. 2023;36(1):21–30.

34.

Küchler

Posthaus

Jäger

, et al. Artificial intelligence to predict the BRAF V595E mutation in canine urinary bladder urothelial carcinomas. Animals. 2023;13(15):2404.

35.

Kuklyte

Fitzgerald

Nelissen

, et al. Evaluation of the use of single- and multi-magnification convolutional neural networks for the determination and quantitation of lesions in nonclinical pathology studies. Toxicol Pathol. 2021;49(4):815–842.

36.

Liu

Zhang

, et al. An analysis on efficacy of applying β-elemene intervention on chemically-induced tongue lesions using SAM algorithm. Anatom Histol Embryol. 2024;53(5):e13095.

37.

Liu

Chen

PHC

Krause

, et al. How to read articles that use machine learning: users’ guides to the medical literature. JAMA. 2019;322(18):1806–1816.

38.

Maier-Hein

Reinke

Godau

, et al. Metrics reloaded: recommendations for image analysis validation. Nature Methods. 2024;21(2):195–212.

39.

Matthews

McGenity

Bansal

, et al. Public evidence on AI products for digital pathology. NPJ Digital Medicine. 2024;7(1):300.

40.

Mecklenburg

Luetjens

Romeike

, et al. Deep learning-based spermatogenic staging in tissue sections of cynomolgus macaque testes. Toxicol Pathol. 2024;52(1):4–12.

41.

Mehrvar

Kambara

Morphologic features and deep learning-based analysis of canine spermatogenic stages. Toxicol Pathol. 2022;50(6):736–753.

42.

Mehrvar

Maisonave

Buck

, et al. Immunohistochemistry-free enhanced histopathology of the rat spleen using deep learning. Toxicol Pathol. 2025;53(1):83–94.

43.

Mohammed

Ebrahim

SK.

Myofibril architecture in prediction and evaluation of broiler wooden breast: histopathological and immunohistochemical study. Iraqi J Vet Sci. 2024;38(3):663–670.

44.

Moraes

Osório

Salle

, et al. Evaluation of follicular lymphoid depletion in the Bursa of Fabricius: an alternative methodology using digital image analysis and artificial neural networks. Pesq Vet Bras. 2010;30(4):340–344.

45.

Morisi

Rai

Bacon

, et al. Detection of necrosis in digitised whole-slide images for better grading of canine soft-tissue sarcomas using machine-learning. Vet Sci. 2023;10(1): 45.

46.

Moulin

Grünberg

Barale-Thomas

, et al. IMI—bigpicture: a central repository for digital pathology. Toxicol Pathol. 2021;49(4):711–713.

47.

Pacholec

Xie

Curnin

, et al. Impact of magnification, image type, and number on convolutional neural network performance in differentiating canine large cell lymphoma from non-lymphoma via lymph node cytology. Vet Clin Pathol. 2025;54:S82–S94.

48.

Page

McKenzie

Bossuyt

, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. 2021;372:n71.

49.

Palazzi

Barale-Thomas

Bawa

, et al. Results of the European society of toxicologic pathology survey on the use of artificial intelligence in toxicologic pathology. Toxicol Pathol. 2023;51(4):216–224.

50.

Pischon

Mason

Lawrenz

, et al. Artificial Intelligence in toxicologic pathology: quantitative evaluation of compound-induced hepatocellular hypertrophy in rats. Toxicol Pathol. 2021;49(4):928–937.

51.

Puget

Ganz

Bertram

, et al. Artificial intelligence predicts c-KIT exon 11 genotype by phenotype in canine cutaneous mast cell tumors: can human observers learn it? Vet Pathol. 2026;63(2):369–379.

52.

Puget

Ganz

Ostermaier

, et al. Artificial intelligence can be trained to predict c-KIT-11 mutational status of canine mast cell tumors from hematoxylin and eosin-stained histological slides. Vet Pathol. 2025;62(2):152–160.

53.

Rai

Morisi

Bacci

, et al. Deep learning for necrosis detection using canine perivascular wall tumour whole slide images. Sci Rep 2022;12(1):10634.

54.

Ramot

Deshpande

Morello

, et al. Microscope-based automated quantification of liver fibrosis in mice using a deep learning algorithm. Toxicol Pathol. 2021;49(5):1126–1133.

55.

Ramot

Zandani

Madar

, et al. Utilization of a deep learning algorithm for microscope-based fatty vacuole quantification in a fatty liver model in mice. Toxicol Pathol. 2020;48(5):702–707.

56.

Rodriguez Barbon

Bentley

Drake

, et al. Hepatic iron assessment using pinch liver biopsies in Asian glossy starlings (Aplonis panayensis). Vet Pathol. 2025;62(3):360–363.

57.

Rudmann

Albretsen

Doolan

, et al. Using deep learning artificial intelligence algorithms to verify N-Nitroso-N-Methylurea and Urethane positive control proliferative changes in Tg-RasH2 mouse carcinogenicity studies. Toxicol Pathol. 2021;49(4):938–949.

58.

Salvi

Molinari

Iussich

, et al. Histopathological classification of canine cutaneous round cell tumors using deep learning: a multi-center study. Front Vet Sci. 2021;8:640944.

59.

Shimada

Tanimoto

Sasaki

, et al. Automated scoring of glomerular injury in TNS2-deficient nephropathy. Exp Anim. 2024;73(4):370–375.

60.

Shimazaki

Deshpande

Hajra

, et al. Deep learning-based image-analysis algorithm for classification and quantification of multiple histopathological lesions in rat liver. J Toxicol Pathol. 2022;35(2):135–147.

61.

Simkó

Garpebring

Jonsson

, et al. Reproducibility of the methods in medical imaging with deep learning. Proc Mach Learn Res. 2023;227:95–106.

62.

Singh

Mahore

Das

, et al. Development of deep learning-based mobile application for the identification of Coccidia species in pigs using microscopic images. Vet Parasitol. 2025;334:110373.

63.

Smith

Westerling-Bui

Wilcox

, et al. Screening for bone marrow cellularity changes in cynomolgus macaques in toxicology safety studies using artificial intelligence models. Toxicol Pathol. 2021;49(4):905–911.

64.

Steinbach

Tokarz

, et al. Inter-rater and intra-rater agreement in scoring severity of rodent cardiomyopathy and relation to artificial intelligence-based scoring. Toxicol Pathol. 2024;52(5):258–265.

65.

Sundberg

Holmström

Democratizing artificial intelligence: how no-code AI can leverage machine learning operations. Bus Horiz. 2023;66(6):777–788.

66.

Tokarz

Steinbach

Lokhande

, et al. Using artificial intelligence to detect, classify, and objectively score severity of rodent cardiomyopathy. Toxicol Pathol. 2021;49(4):888–896.

67.

Urli

Corte Pause

Dreossi

, et al. Evaluation of an artificial intelligence system for bull sperm morphology evaluation. Theriogenology. 2025;245:117504.

68.

Vuorimaa

Kareinen

Toivanen

, et al. Deep learning-based segmentation of morphologically distinct rat hippocampal reactive astrocytes after trimethyltin exposure. Toxicol Pathol. 2022;50(6):754–762.

69.

Wagner

Matek

Shetab Boushehri

, et al. Built to Last? Reproducibility and reusability of deep learning algorithms in computational pathology. Modern Pathol. 2024;37(1):100350.

70.

Walsh

Fishman

Garcia-Gasulla

, et al. Reproducibility standards for machine learning in the life sciences. Nature Methods. 2021;18(10):1132–1135.

71.

Wilkinson

Dumontier

Aalbersberg

IJJ

, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016;3(1):160018.

72.

Willard

Jergens

Duncan

, et al. Interobserver variation among histopathologic evaluations of intestinal tissues from dogs and cats. J Am Vet Med Assoc. 2002;220(8):1177–1182.

73.

Wulcan

Giaretta

Fingerhood

, et al. Artificial intelligence-based quantification of lymphocytes in feline small intestinal biopsies. Vet Pathol. 2025;62(2):139–151.

74.

Xiao

Dhand

Wang

, et al. Review of applications of deep learning in veterinary diagnostics and animal health. Front Vet Sci. 2025;12:1511522.

75.

Yap

Rasotto

Priestnall

, et al. Intra- and inter-observer agreement in histological assessment of canine soft tissue sarcoma. Vet Comp Oncol. 2017;15(4):1553–1557.

76.

Young

Gates

Garcia

, et al. Data leakage in deep learning for Alzheimer’s disease diagnosis: a scoping review of methodological rigor and performance inflation. Diagnostics. 2025;15(18):2348.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.02 MB

0.18 MB