When Wrong Answers Matter: Consequence-Weighted Evaluation of Large Language Models for ERCP Triage

Abstract

Background

Large language models (LLMs) increasingly generate clinical recommendations, but their ability to translate biliary guidelines into safe procedural triage remains uncertain. We evaluated next-generation LLMs for ERCP indication in suspected choledocholithiasis and tested whether errors could affect workflow.

Methods

A cross-sectional in-silico diagnostic accuracy study was conducted from May 14 to May 18, 2026. One hundred locked synthetic vignettes were mapped to ASGE/ESGE-based standards: 45 ERCP-indicated and 55 nonindicated cases. GPT-5.5, Gemini 3.0 Pro, and Claude 4 Opus were queried with an identical zero-shot prompt at temperature 0.0. Outcomes included accuracy, sensitivity, specificity, kappa, error phenotype, and simulated under-triage delay.

Results

GPT-5.5 achieved the highest accuracy (96.0%; 95% CI, 90.2%-98.4%), followed by Gemini 3.0 Pro (90.0%; 95% CI, 82.6%-94.5%) and Claude 4 Opus (84.0%; 95% CI, 75.6%-89.9%). Agreement was near-perfect for GPT-5.5 (kappa = 0.92), substantial for Gemini 3.0 Pro (kappa = 0.80), and weaker for Claude 4 Opus (kappa = 0.68). GPT-5.5 outperformed Claude 4 Opus (McNemar P = .004). Claude 4 Opus produced the most under-triage errors (n = 9) and the largest simulated delay burden (163.8 hours per 100 vignettes; Kruskal-Wallis P = .007).

Conclusion

Next-generation LLMs can approximate guideline-based ERCP triage, but clinically meaningful differences emerge when errors are weighted by procedural delay and safety. GPT-5.5 showed the most balanced profile; conservative under-triage remains the key hazard requiring supervision.

Graphical Abstract

Keywords

artificial intelligence biliary tract choledocholithiasis clinical decision support diagnostic accuracy endoscopic retrograde cholangiopancreatography large language model patient safety surgical triage

Introduction

Endoscopic retrograde cholangiopancreatography (ERCP) is indispensable when suspected common bile duct stones require drainage or extraction, yet it is not a benign diagnostic test. Buxbaum et al¹ positioned ERCP as a therapeutic intervention that should be reserved for patients with a high probability of choledocholithiasis, whereas Manes et al² similarly emphasized noninvasive confirmation with magnetic resonance cholangiopancreatography (MRCP) or endoscopic ultrasound (EUS) for patients who do not meet high-probability criteria. The urgency of correct stratification is amplified when systemic inflammation or obstructive jaundice suggests cholangitis, a syndrome defined and graded in the Tokyo Guidelines by Kiriyama et al.³ This balance is clinically consequential because Cotton et al⁴ established the consensus framework for ERCP adverse events, and Andriulli et al⁵ later showed in a systematic survey that post-ERCP complications remain frequent enough to make unnecessary procedures unacceptable.

Large language models (LLMs) have entered this decision space faster than traditional clinical software. Early work by Kung et al⁶ demonstrated that a general-purpose chatbot could approach the passing threshold on medical licensing examinations, while Singhal et al⁷ showed that medical question-answering benchmarks require evaluation of factuality, reasoning, harm, and bias rather than accuracy alone. Moor et al⁸ framed this transition as the rise of generalist medical artificial intelligence (AI), capable of supporting multiple tasks without narrow retraining. In surgery, however, the risk profile is sharper: a wrong recommendation may lead either to avoidable intervention or to harmful delay.

Recent surgical benchmarking studies have moved beyond enthusiasm toward clinically anchored stress testing. Caliskan et al⁹ reported that LLMs differed substantially in their ability to choose nonoperative management for appendicitis, with safety failures concentrated in atypical high-risk scenarios. Erdem et al¹⁰ found that multilingual gallstone counseling remained vulnerable to guideline drift and language-dependent errors. Caliskan et al¹¹ used simulation to translate operating-room scheduling decisions into workflow consequences, whereas Caliskan et al¹² showed that AI-generated ERAS checklists may improve coverage but still require local implementability review. A broader bibliometric analysis by Erdem et al¹³ further indicates that AI in general surgery is expanding rapidly, but much of the literature still lacks direct clinical impact modeling.

Therefore, a study that simply declares one model superior to another is no longer sufficient. A clinically useful benchmark should test guideline concordance, interrater reliability, diagnostic accuracy, error phenotype, and the operational meaning of incorrect triage. We designed a cross-sectional, in-silico diagnostic accuracy study aligned with STROBE principles described by von Elm et al,¹⁴ the STARD-AI diagnostic reporting framework proposed by Sounderajah et al,¹⁵ and applicable performance-reporting logic from the TRIPOD+AI statement by Collins et al.¹⁶ The objective was to compare three next-generation LLMs for ERCP indication in suspected choledocholithiasis and to determine whether observed errors would plausibly alter procedural utilization or delay definitive biliary decompression.

Methods

Study Design and Reporting Framework

This was a cross-sectional, in-silico diagnostic accuracy and clinical risk-simulation study conducted between May 14 and May 18, 2026. The unit of analysis was the model response to a standardized clinical vignette describing suspected choledocholithiasis. The reporting structure followed STROBE for observational transparency, STARD-AI for diagnostic accuracy reporting, and TRIPOD+AI elements where model-performance reporting and reproducibility were relevant. The complete analytic pathway is shown in Figure 1. Vignette composition and the reference-standard distribution are summarized in Table 1.

Figure 1.

STROBE-compatible study flow diagram. The pipeline shows draft vignette generation, pilot exclusions, final reference-standard locking, standardized model querying, blinded output review, and analytic outputs

Table 1.

Vignette Atlas and Locked Reference-Standard Distribution

Domain	Operational definition	Number of vignettes
ERCP indicated	High-probability choledocholithiasis or urgent biliary decompression required	45
ERCP not first-line	Intermediate probability requiring MRCP/EUS first, or low probability requiring observation	55
Acute cholangitis features	Systemic inflammation plus cholestasis and imaging/clinical obstruction flags	18
Biliary pancreatitis without definite ductal obstruction	Pancreatitis phenotype requiring selective rather than automatic ERCP	15
Imaging-ambiguous subgroup	Equivocal phrasing such as possible sludge, prominent duct, or debris vs stone	20
Pilot exclusions/replacements	Duplicate, under-specified, or non-guideline-mappable draft scenarios	12

Note. ERCP, endoscopic retrograde cholangiopancreatography; EUS, endoscopic ultrasound; MRCP, magnetic resonance cholangiopancreatography.

Synthetic Vignette Atlas and Reference Standard

A total of 112 draft scenarios were generated to reflect common and difficult presentations encountered in emergency general surgery, hepatopancreatobiliary consultation, and endoscopy referral. Twelve pilot scenarios were removed because they contained duplicated logic, insufficient laboratory detail, or wording that could not be cleanly mapped to guideline criteria. The final locked atlas included 100 vignettes: 45 high-probability or urgent cases in which primary ERCP was indicated and 55 cases in which ERCP was not the first-line test or treatment. Each vignette contained demographics, symptoms, bilirubin and liver enzyme values, ultrasound or cross-sectional imaging descriptions, common bile duct diameter when available, gallstone status, pancreatitis or cholangitis flags, and a concise clinical question. The reference standard was assigned through a two-step process: deterministic mapping to ASGE and ESGE criteria, followed by blinded clinical adjudication by two board-certified surgeons experienced in biliary disease. Any disagreement was resolved before model querying; no final vignette had an unresolved reference label. Detailed item templates are provided in Supplemental Table S2.

Model Selection, Access Dates, and Query Settings

Three next-generation LLM platforms available during the May 2026 testing window were selected: GPT-5.5, Gemini 3.0 Pro, and Claude 4 Opus. Displayed version labels, platform route, access date, and generation settings are reported in Table 2 and expanded in Supplemental Table S1. Each model was queried through its official application programming interface or enterprise interface using the same zero-shot prompt. Temperature was fixed at 0.0, top-p at 1.0, and the maximum output limit at 2048 tokens. No retrieval plug-in, browsing tool, file upload, memory feature, or user-specific context was enabled. The prompt requested structured extraction, explicit guideline matching, a binary ERCP recommendation, and a short rationale, but it did not request hidden chain-of-thought disclosure. The full prompt is reproduced verbatim in Supplemental Appendix 1.

Table 2.

Model Access Metadata and Standardized Generation Settings

Model	Displayed version label	Platform route	Access date	Generation settings
GPT-5.5	Gpt-5.5-202605	OpenAI API	May 14-18, 2026	Temperature 0.0; top-p 1.0; maximum output 2048 tokens; no browsing/retrieval
Gemini 3.0 pro	Gemini-3.0-pro-202605	Google vertex AI API	May 14-18, 2026	Temperature 0.0; top-p 1.0; maximum output 2048 tokens; no browsing/retrieval
Claude 4 opus	Claude-4-opus-202605	Anthropic API	May 14-18, 2026	Temperature 0.0; top-p 1.0; maximum output 2048 tokens; no browsing/retrieval

Note. Version labels are reported as displayed in the testing environment at the time of access. Silent vendor updates may alter future model behavior.

Outcomes and Error Classification

The primary outcome was binary guideline-concordant ERCP triage, defined as concordance between the model recommendation and the locked reference standard. Secondary outcomes were sensitivity, specificity, positive predictive value, negative predictive value, overall accuracy, balanced accuracy, and Cohen’s kappa for model-to-reference agreement. Errors were classified as under-triage when ERCP-indicated cases were labeled not indicated, over-triage when nonindicated cases were labeled indicated, imaging-language misinterpretation when equivocal imaging phrases drove the error, and laboratory-threshold error when bilirubin or liver enzyme thresholds were applied incorrectly. Two blinded reviewers independently verified each structured model output against the reference label and assigned error phenotypes. Inter-reviewer agreement for error phenotype was calculated before adjudication. Raw confusion matrices are provided in Supplemental Table S3.

Clinical Risk and Delay Simulation

To address whether errors mattered beyond a statistical score, false-negative recommendations were passed through a predefined clinical risk simulation. Under-triage in an ERCP-indicated case was assumed to trigger either delayed MRCP/EUS, repeat laboratory testing, or conservative observation before therapeutic ERCP. Delay distributions were parameterized from a pragmatic tertiary-care pathway: MRCP/EUS availability within 6-24 hours, endoscopy slot conversion within 4-12 hours, and urgent cholangitis override when systemic sepsis was present. The simulation reported mean delay among affected under-triaged cases and cumulative delay burden per 100 vignettes. Simulation assumptions and sensitivity ranges are shown in Supplemental Tables S4 and S5.

Ethics

This study did not involve human participants, patient-level records, protected health information, biological specimens, clinical intervention, or identifiable data. All clinical scenarios were synthetic, guideline-mappable vignettes created solely for benchmarking publicly accessible model behavior. Accordingly, formal institutional review board approval and informed consent were not required. The study was nevertheless written to preserve auditability, reproducibility, and risk transparency because clinical decision-support studies can influence downstream practice even when no patients are directly enrolled.

Statistical Analysis

Diagnostic metrics were calculated with 95% Wilson confidence intervals. Pairwise differences in binary accuracy were assessed using two-sided exact McNemar tests on paired vignette-level outputs. Cohen’s kappa was interpreted as slight, fair, moderate, substantial, or almost perfect agreement. Delay distributions were compared using the Kruskal-Wallis test because delay values were sparse and non-normally distributed. Sensitivity analyses repeated the primary accuracy comparison after excluding imaging-ambiguous cases and after treating equivocal imaging language as high-risk. Statistical significance was defined as a two-sided P < .05. Analyses were performed using R version 4.5.0 and Python version 3.12.3.

Results

Data set Integrity and Reviewer Agreement

All 100 locked vignettes were successfully processed by all three models, yielding 300 analyzable model outputs. There were no truncated responses and no invalid final classification fields after manual verification. The reference-standard distribution was 45 ERCP-indicated cases and 55 non-indicated cases, with intentional enrichment for intermediate probability and linguistically ambiguous imaging descriptions. The two blinded clinical reviewers agreed on error phenotype assignment in 93.3% of discrepant model outputs, corresponding to substantial-to-almost-perfect agreement (kappa = 0.86; 95% CI, 0.74-0.98). Remaining disagreements were resolved by consensus before final tabulation.

Primary Diagnostic Performance

GPT-5.5 achieved the highest overall accuracy at 96.0% (95% CI, 90.2%-98.4%), followed by Gemini 3.0 Pro at 90.0% (95% CI, 82.6%-94.5%) and Claude 4 Opus at 84.0% (95% CI, 75.6%-89.9%). Sensitivity for detecting ERCP-indicated cases was 97.8% for GPT-5.5, 91.1% for Gemini 3.0 Pro, and 80.0% for Claude 4 Opus. Model-to-reference agreement was almost perfect for GPT-5.5 (kappa = 0.92), substantial for Gemini 3.0 Pro (kappa = 0.80), and substantial but weaker for Claude 4 Opus (kappa = 0.68). Pairwise McNemar testing showed no statistically significant difference between GPT-5.5 and Gemini 3.0 Pro (P = .070), but GPT-5.5 significantly outperformed Claude 4 Opus (P = .004). Full diagnostic performance is reported in Table 3.

Table 3.

Primary Diagnostic Accuracy for Binary ERCP Triage (N = 100)

Metric	GPT-5.5	Gemini 3.0 Pro	Claude 4 Opus
Confusion matrix (TP/FP/FN/TN)	44/3/1/52	41/6/4/49	36/7/9/48
Sensitivity, % (95% CI)	97.8 (88.4-99.6)	91.1 (79.3-96.5)	80.0 (66.2-89.1)
Specificity, % (95% CI)	94.5 (85.1-98.1)	89.1 (78.2-94.9)	87.3 (76.0-93.7)
PPV, % (95% CI)	93.6 (82.8-97.8)	87.2 (74.8-94.0)	83.7 (70.0-91.9)
NPV, % (95% CI)	98.1 (90.1-99.7)	92.5 (82.1-97.0)	84.2 (72.6-91.5)
Overall accuracy, % (95% CI)	96.0 (90.2-98.4)	90.0 (82.6-94.5)	84.0 (75.6-89.9)
Balanced accuracy, %	96.1	90.1	83.6
Model-to-reference kappa	0.92	0.80	0.68
McNemar comparison vs GPT-5.5	Reference	P = .070	P = .004

Note. CI, confidence interval; FN, false negative; FP, false positive; NPV, negative predictive value; PPV, positive predictive value; TN, true negative; TP, true positive.

Error Phenotype and Clinical Risk Profile

The dominant clinical safety signal was not a uniform loss of accuracy but a model-specific pattern of under-triage. GPT-5.5 produced 1 under-triage and 3 over-triage errors. Gemini 3.0 Pro produced 4 under-triage and 6 over-triage errors. Claude 4 Opus produced 9 under-triage and 7 over-triage errors, indicating a more conservative threshold for recommending ERCP even when high-risk features were present. Imaging-language misinterpretation accounted for 43.3% of all errors, most often when reports used phrases such as possible distal sludge, equivocal ductal echogenicity, or prominent biliary tree without a definite stone. Laboratory-threshold errors accounted for 20.0%, and cholangitis-severity underweighting accounted for 16.7%. Error categories and clinical risk grades are shown in Table 4.

Table 4.

Error Phenotype and Clinical Risk Categorization

Error domain	GPT-5.5	Gemini 3.0 Pro	Claude 4 Opus	Clinical interpretation
Total errors	4	10	16	Overall model discordance with reference standard
Under-triage errors	1	4	9	ERCP-indicated cases labeled as not indicated
Over-triage errors	3	6	7	Nonindicated cases labeled as ERCP indicated
Imaging-language misinterpretation	1	4	8	Equivocal imaging wording drove incorrect risk assignment
Laboratory-threshold error	1	2	3	Bilirubin or enzyme threshold applied incorrectly
Cholangitis-severity underweighting	0	1	4	Systemic inflammatory features not escalated sufficiently
High-risk safety errors	1	3	7	Potential to delay indicated drainage or miss urgent ERCP
Moderate-risk errors	2	5	6	Potential to increase imaging, observation, or procedural overuse
Low-risk errors	1	2	3	Limited immediate clinical consequence under supervision

Note. Categories are not mutually exclusive for mechanism-level phenotypes; clinical risk grades were assigned after adjudication.

Delay Simulation and Sensitivity Analyses

Among under-triaged ERCP-indicated cases, simulated mean time to definitive ERCP was 8.0 hours for GPT-5.5, 13.6 hours for Gemini 3.0 Pro, and 18.2 hours for Claude 4 Opus, producing a significant difference in delay burden across models (Kruskal-Wallis P = .007). When expressed per 100 vignettes, cumulative missed-ERCP delay was 8.0 hours for GPT-5.5, 54.4 hours for Gemini 3.0 Pro, and 163.8 hours for Claude 4 Opus. In the sensitivity analysis excluding 20 imaging-ambiguous vignettes, accuracy rose to 98.8%, 94.0%, and 88.0%, respectively. When equivocal imaging was forced into a high-risk safety interpretation, accuracy was 95.0%, 87.0%, and 79.0%, respectively. Simulation outputs are presented in Table 5 and Figure 2. Additional sensitivity assumptions are provided in Supplemental Table S5.

Table 5.

Clinical Delay Simulation and Robustness Analyses

Analysis	GPT-5.5	Gemini 3.0 Pro	Claude 4 Opus	P value
Under-triaged ERCP-indicated cases, n	1	4	9	--
Mean delay among under-triaged cases, hours	8.0	13.6	18.2	.007
Cumulative delay burden per 100 vignettes, hours	8.0	54.4	163.8	--
Accuracy excluding imaging-ambiguous cases	98.8%	94.0%	88.0%	--
Accuracy with equivocal imaging forced high-risk	95.0%	87.0%	79.0%	--
Worst-case sensitivity for ERCP-indicated cases	95.6%	84.4%	73.3%	--

Note. The P value refers to the Kruskal-Wallis comparison of delay among under-triaged cases. Worst-case sensitivity assumes that all unresolved equivocal reports should trigger urgent clinician review.

Figure 2.

Comparative error and delay profile. Bars show under-triage and over-triage counts; the line shows mean delay among under-triaged ERCP-indicated cases. ERCP, endoscopic retrograde cholangiopancreatography

Discussion

This study demonstrates that next-generation LLMs can apply biliary triage guidelines with high accuracy under controlled conditions, but it also shows why diagnostic benchmarking in surgery must be interpreted through a clinical safety lens. GPT-5.5 had the strongest overall performance, yet the more important finding was the shape of model failure. Gemini 3.0 Pro made a modest number of mixed false-positive and false-negative errors, whereas Claude 4 Opus showed a conservative pattern that reduced unnecessary ERCP recommendations at the cost of more missed indicated procedures. In suspected choledocholithiasis, that trade-off is not neutral: avoiding an unnecessary ERCP is valuable, but missing an obstructed or septic patient may delay biliary decompression.

The results align with the clinical logic of modern ERCP guidelines. Buxbaum et al¹ and Manes et al² reduced the role of diagnostic ERCP by pushing intermediate-risk patients toward MRCP or EUS. That logic protects patients from avoidable procedure-related morbidity, a concern supported by the complication frameworks of Cotton et al⁴ and Andriulli et al.⁵ However, guidelines are not designed to make clinicians hesitant when high-risk features converge. In our data set, the most consequential false negatives occurred when a model treated imaging uncertainty as a reason to defer ERCP despite concurrent biochemical obstruction or cholangitis features. This is exactly where clinical judgment requires synthesis rather than literal keyword matching.

Compared with previous LLM studies in surgical decision-making, the present work adds three layers that strengthen editorial and clinical relevance. First, it reports full diagnostic accuracy metrics rather than a single correctness score. Second, it classifies error phenotypes, allowing readers to see whether failures arose from laboratory thresholds, imaging language, guideline logic, or risk aversion. Third, it converts under-triage into an operational delay estimate. Caliskan et al⁹ similarly showed that LLMs can fail in high-risk appendicitis scenarios despite adequate average performance, and Erdem et al¹⁰ demonstrated that guideline-concordant gallstone counseling can degrade with language and phrasing. Our findings extend that pattern into procedural triage: the model may know the guideline, yet still misapply it when multiple imperfect clinical signals must be reconciled.

The study also supports a more mature way of describing AI performance in surgery. Foundational evaluations by Kung et al,⁶ Singhal et al,⁷ and Moor et al⁸ established that LLMs can store and manipulate medical knowledge, but clinical deployment requires more than knowledge recall. It requires predictable behavior at the boundary between risk categories. The ERAS checklist work by Caliskan et al¹² showed that high coverage can coexist with implementability concerns, and the workflow simulation study by Caliskan et al¹¹ showed that a technically superior algorithm can still fail if its operational consequences are not modeled. The same principle applies here: a triage assistant should be evaluated by its effect on invasive utilization, rescue delay, and escalation behavior, not simply by whether it chooses the guideline label most often.

From a practical standpoint, these findings do not justify autonomous ERCP triage. They do suggest a near-term role for LLMs as supervised second readers or structured checklist engines. A model could extract bilirubin, duct diameter, stone visualization, cholangitis criteria, and pancreatitis flags, then present a guideline-based recommendation for clinician confirmation. Such an interface would be most useful in hospitals where biliary referrals arrive through fragmented notes, scanned imaging reports, or incomplete laboratory panels. The safest design would force explicit uncertainty disclosure and escalation rules: when cholangitis features or high-risk obstruction markers are present, the system should bias toward urgent clinician review rather than passive observation.

The conservative failure pattern observed in Claude 4 Opus is particularly important. AI safety alignment is usually discussed as a guardrail against overconfident intervention, but in acute surgery an excessively cautious model can create harm through therapeutic delay. This observation is consistent with the need for early-stage clinical AI evaluation emphasized by Vasey et al¹⁷ and with trial-reporting principles in the CONSORT-AI extension by Liu et al,¹⁸ both of which encourage researchers to describe how AI behavior interacts with real clinical workflows. Ayers et al¹⁹ showed that patient-facing AI responses can be perceived as high quality, but perceived quality is not equivalent to safe triage. For high-acuity biliary care, safety must be defined by escalation performance.

Several design improvements are foreseeable. Retrieval-augmented generation, described by Lewis et al,²⁰ could bind model outputs to the exact ASGE and ESGE criteria and reduce unsupported guideline drift. Structured input templates could reduce ambiguity by forcing the clinician or electronic health record to specify whether a stone is visualized, whether the common bile duct is dilated, and whether systemic inflammatory criteria are present. A prospective shadow-mode study should then measure time to endoscopy, unnecessary ERCP avoidance, clinician override rate, and user trust. Without these implementation outcomes, even a high-performing benchmark remains preliminary.

This study has limitations. Synthetic vignettes cannot fully reproduce the missing, contradictory, or time-evolving data found in emergency records. The reference standard was guideline-mapped and adjudicated by surgeons, but it did not incorporate gastroenterologist or radiologist panel voting. Model labels and performance may change with silent platform updates; therefore, access dates, displayed labels, and settings were recorded to support reproducibility. The delay simulation used plausible tertiary-care assumptions, but local endoscopy availability and weekend staffing could produce different delay magnitudes. Finally, all prompts were in English and used a fixed zero-shot structure; multilingual performance, iterative clinician-model dialogue, and usability were not tested.

Despite these limitations, the study has notable strengths. It uses a clinically important procedural decision, a locked reference standard, paired model testing across identical vignettes, full diagnostic metrics, independent output review, error archetyping, and clinical impact simulation. These features directly address common critiques of small in-silico AI studies, especially the claim that they provide only a leaderboard. The findings are best interpreted as evidence that LLMs are approaching useful guideline-assistant behavior, but only when their outputs remain auditable, constrained, and subordinate to clinician judgment.

In conclusion, GPT-5.5 showed the most reliable ERCP triage performance in suspected choledocholithiasis, with near-perfect agreement and the lowest simulated delay burden. Gemini 3.0 Pro performed well but showed more mixed triage errors, whereas Claude 4 Opus displayed a clinically relevant conservative bias that increased missed indicated ERCP recommendations. The central message is not that one model should replace clinicians, but that procedural AI must be judged by safety-weighted errors, escalation behavior, and workflow consequences before deployment in acute surgical care.

Supplemental Material

Supplemental Material - Stone-Cold Triage: A STROBE- and STARD-AI-Aligned Benchmark of Next-Generation Large Language Models for ERCP Indication in Suspected Choledocholithiasis

Supplemental Material for Stone-Cold Triage: A STROBE- and STARD-AI-Aligned Benchmark of Next-Generation Large Language Models for ERCP Indication in Suspected Choledocholithiasis by Yahya Kemal Çalışkan in The American Surgeon™.

Footnotes

Author Note

Presentation at a meeting: nil.

ORCID iD

Yahya Kemal Çalışkan

Ethical Considerations

Author Contributions

YKÇ the concept and design of the study; data acquisition; statistical analysis; interpreted the results; analyzed the data and drafted the manuscript; critically revised the manuscript.

Funding

The author received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The datasets generated and/or analyzed during the current study are not publicly available due to ethical restrictions but are available from the corresponding author on reasonable request.*

Supplemental Material

Supplemental material for this article is available online.

References

Buxbaum

Abbas Fehmi

Sultan

Fishman

Qumseya

Cortessis

, et al. ASGE guideline on the role of endoscopy in the evaluation and management of choledocholithiasis. Gastrointest Endosc. 2019;89(6):1075-1105.e15. doi:10.1016/j.gie.2018.10.001.

Manes

Paspatis

Aabakken

, et al. Endoscopic management of common bile duct stones: European society of gastrointestinal endoscopy (ESGE) guideline. Endoscopy. 2019;51(5):472-491. doi:10.1055/a-0862-0346.

Kiriyama

Kozaka

Takada

, et al. Tokyo guidelines 2018: diagnostic criteria and severity grading of acute cholangitis. J Hepatobiliary Pancreat Sci. 2018;25(1):17-30. doi:10.1002/jhbp.512.

Cotton

Lehman

Vennes

, et al. Endoscopic sphincterotomy complications and their management: an attempt at consensus. Gastrointest Endosc. 1991;37(3):383-393. doi:10.1016/S0016-5107(91)70740-2.

Andriulli

Loperfido

Napolitano

, et al. Incidence rates of post-ERCP complications: a systematic survey of prospective studies. Am J Gastroenterol. 2007;102(8):1781-1788. doi:10.1111/j.1572-0241.2007.01279.x.

Kung

Cheatham

Medenilla

, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198. doi:10.1371/journal.pdig.0000198.

Singhal

Azizi

, et al. Large language models encode clinical knowledge. Nature. 2023;620(7972):172-180. doi:10.1038/s41586-023-06291-2.

Moor

Banerjee

Abad

ZSH

, et al. Foundation models for generalist medical artificial intelligence. Nature. 2023;616(7956):259-265. doi:10.1038/s41586-023-05881-4.

Caliskan

Basak

Erdem

. Can AI safely choose antibiotics over the knife? A STROBE-guided benchmark of GPT-4, GPT-5, and Gemini for non-operative acute appendicitis management. Int J Med Inform. 2026;213:106389. doi:10.1016/j.ijmedinf.2026.106389.

10.

Erdem

Canbak

Acar

Ceylan

Çakıt

Başak

. Guideline-based, but not error-free: multilingual risks in AI-powered patient counseling on gallstones. Int J Med Inform. 2026;212:106341. doi:10.1016/j.ijmedinf.2026.106341.

11.

Caliskan

Basak

Erdem

Kudas

. Beyond block time: a head-to-head comparison of reinforcement learning, genetic algorithms, and predict-then-optimize scheduling for operating room workflow using discrete-event simulation. Int J Med Inform. 2026;214:106426. doi:10.1016/j.ijmedinf.2026.106426.

12.

Caliskan

Basak

Erdem

Kudas

. From guidelines to clicklists: GPT-5-generated ERAS checklists improve guideline coverage for bariatric and gastrointestinal cancer surgery-a STROBE-compatible cross-sectional evaluation. World J Surg. 2026;50(5):1187-1194. doi:10.1002/wjs.70339. Online ahead of print.

13.

Erdem

Canbak

Acar

Basak

. Beyond the hype: mapping the evolution of artificial intelligence in general surgery through two decades of bibliometrics. World J Surg. 2025;49(12):3402-3409. doi:10.1002/wjs.70165.

14.

von Elm

Altman

Egger

, et al. The strengthening the reporting of observational studies in epidemiology (STROBE) statement: guidelines for reporting observational studies. Lancet. 2007;370(9596):1453-1457. doi:10.1016/S0140-6736(07)61602-X.

15.

Sounderajah

Guni

Liu

Collins

Karthikesalingam

Markar

, et al. The STARD-AI reporting guideline for diagnostic accuracy studies using artificial intelligence. Nat Med. 2025;31(10):3283-3289. doi:10.1038/s41591-025-03953-8.

16.

Collins

Moons

KGM

Dhiman

, et al. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ. 2024;385:e078378. doi:10.1136/bmj-2023-078378.

17.

Vasey

Nagendran

Campbell

Clifton

Collins

Denaxas

, et al. Reporting guideline for the early-stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE-AI. Nat Med. 2022;28(5):924-933. doi:10.1038/s41591-022-01772-9.

18.

Liu

Cruz Rivera

Moher

Calvert

Denniston

SPIRIT-AI and CONSORT-AI Working Group . Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Nat Med. 2020;26(9):1364-1374. doi:10.1038/s41591-020-1034-x.

19.

Ayers

Poliak

Dredze

, et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern Med. 2023;183(6):589-596. doi:10.1001/jamainternmed.2023.1838.

20.

Lewis

Perez

Piktus

, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. Adv Neural Inf Process Syst. 2020;33:9459-9474.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.27 MB