Large Language Models Triage of Retina Patient Emergency Telephone Calls: A Pilot Study

Abstract

Purpose: To compare the diagnostic and management accuracy of large language model chatbots vs that of humans in performing outpatient retina triage in on-call telephone emergencies. Methods: Four large language model chatbots, 3 vitreoretinal surgery fellows, and 3 certified ophthalmic technicians with on-call experience were presented with 10 simulated retina cases representing after-hours telephone calls from patients. Diagnosis and triage recommendations were obtained from chatbots and humans. Recommendations were graded for each chatbot and human respondent. Results: Human graders were significantly more accurate than chatbots in diagnosis (95% vs 76.7%, respectively; P < .01) and follow-up recommendations (85% vs 70%, respectively; P = .03). However, chatbot performance varied. ChatGPT (OpenAI; 90%, P = .4) and Claude (Anthropic; 83.3%, P = .11) were noninferior to humans in diagnosis, while Meta (Meta Platforms Inc; 76.7%, P = .01) and Gemini (Google LLC; 56.7%, P < .001) performed significantly worse than humans. ChatGPT (93.3%, P = .32) and Claude (90%, P = .74) were also noninferior to humans in follow-up recommendations, but Gemini (50%, P < .001) and Meta (46.7%, P < .001) were worse than humans. Conclusions: The current pilot study found that overall, humans performed better than large language model–based chatbots in diagnosing and triaging retina-specific on-call telephone emergencies. However, chatbot accuracy was variable, with ChatGPT and Claude showing noninferior performance compared with humans. These findings suggest that with further validation, certain large language models could serve as useful aides for managing emergency telephone calls of varying medical urgency.

Keywords

large language models artificial intelligence retina triage retina emergency

Introduction

Throughout the past decade, the capabilities of large language models have dramatically increased.¹ Such models are a type of artificial intelligence trained to ingest and generate humanlike text.² Prominent examples include ChatGPT (OpenAI), Google Gemini (Google LLC), Meta (Meta Platforms Inc), and Claude (Anthropic).

In medicine, large language models have been adopted to perform a wide range of tasks such as assisting with clinical documentation, performing medical image analysis, and drug discovery.^3–5 Multiple investigators have also assessed large language models that have been developed to answer medical questions (eg, Med-PaLM 2 by Google).^6–8 An additional area where large language models may benefit patients and physicians is in triaging after-hours emergencies for outpatient specialties.⁹

Among outpatient specialties, the retina field has several common patient emergencies that benefit from early recognition and treatment, including retinal detachment and endophthalmitis.¹⁰ Importantly, mismanaged patient emergency telephone calls can lead to vision loss and blindness on 1 end of the spectrum or unnecessary clinic visits on the other. These calls also require meaningful human resources and represent a source of potential medicolegal risk.¹¹

The current analysis sought to determine the ability of 4 prominent large language model–based AI chatbots (ChatGPT-4o, Google Gemini, Meta, and Claude), programs designed to simulate conversation with users, to accurately diagnose and triage retina patient emergencies as compared with humans.

Methods

The study was performed in concordance with the Declaration of Helsinki. Institutional review board approval was not required because no protected patient information was collected or used. The study was conducted between May 2024 and November 2024 at Retina Consultants of Texas (Houston, TX, USA).

Review of Emergency Call Logs

To determine the most common reasons for after-hours emergency de-identified patient telephone calls to Retina Consultants of Texas, the date of the call, patient message (anonymized), and call center staff comments were analyzed. The calls were categorized into either “medical emergency” or logistics-related (such as appointment confirmations and prescription refills). Those that were categorized as a “medical emergency” were subcategorized based on patient symptoms. For patients reporting multiple symptoms, each was logged individually.

Simulated Emergency Scenarios

Based on a review of the patient emergency calls, 10 simulated patient telephone calls of varying medical urgency (Table 1) were created to mirror the most frequently reported retina emergencies. Cases consisted of 2 parts: 1) a first-person patient account of their symptoms and 2) a brief clinical history including the patient’s age, sex, and ocular history. The cases were reviewed by 3 retina specialists (H.A., K.C.F., and C.C.W.).

Table 1.

Diagnosis and Follow-Up Recommendation for Each Simulated Case.

Case Number	Diagnosis	Follow-Up Recommendation	Chatbot/Human Input
1	Subconjunctival hemorrhage	Routine follow-up	This is a 75-year-old woman receiving monthly injections for AMD: “I have been receiving injections for my AMD. I got one earlier this morning in my right eye, and now I notice a big red spot in the white part of the eye. It doesn’t hurt, but it looks pretty bad. This is the first time this has happened. My family also noticed it too.”
2	Betadine irritation	Routine follow-up	This is a 56-year-old man receiving injections for AMD in the right eye: “I just got my first injection yesterday. My doctor said the AMD went from dry to wet in my right eye. I have a gritty sensation in the right eye now. My vision is the same, and the pain is not severe, but it just feels like something is in the eye.”
3	Unilateral total transient vision loss	Immediate attention	This is a 67-year-old man followed-up in clinic for diabetic retinopathy: “I was watching TV an hour ago when I noticed the vision in my left eye go completely black for about a minute. There was no pain, but the vision went away entirely then just came back. Everything was black in that eye.”
4	Viral conjunctivitis	Routine follow-up	This is a 32-year-old woman with lattice, seen in clinic for a fundus examination 1 week ago: “I get checked about every 6 months because my doctor said I have thinning in my retina. I was in the office a week ago; they said everything looked stable. About 3 days ago though, my right eye got swollen and crusted shut. This morning, my left eye is also having the same issues. Both eyes are now very red, and it feels like something is in them. They are both mucousy, too.”
5	Endophthalmitis	Immediate attention	This is a 54-year-old woman receiving injections in clinic for diabetic macular edema: “I received an injection like I usually do in my right eye 2 days ago for my diabetes. Now, my eye is really painful and swollen, and my vision is much worse than normal. I can only see shadows.”
6	Postinjection air bubbles	Routine follow-up	This is a 62-year-old woman receiving injections for diabetic retinopathy in the right eye: “I just had an injection this morning for my diabetes. Right after the injection, I saw a couple of black circles in my vision. They’re right at the bottom. My vision is the same, and I don’t have any pain. It seems they have been getting smaller but are still there.”
7	Corneal epithelial defect	Urgent follow-up	This is a 49-year-old man receiving monthly injections for diabetic macular edema: “I just had my injection for my diabetes this morning in both eyes. Right after the injection, my left eye has been really hurting. It feels better when I close it. There is also a blurry part in the center of my vision that wasn’t there before.”
8	Elevated intraocular pressure	Immediate attention	This is a 73-year-old woman with a history of vitrectomy with gas tamponade for retinal detachment 6 hours ago: “I had surgery to repair the retinal detachment in my left eye this morning. Now my eye really hurts, and I have a severe headache. I was going to wait it out, but I just also had an episode of vomiting. I still have the patch on; I haven’t touched the eye like my doctor said.”
9	Acute posterior vitreous detachment	Urgent follow-up	This is a 65-year-old man with no history of ocular concerns: “I had sudden dark specks and bright flashes of light in my vision, just in the right eye. It started suddenly yesterday evening. At first, I thought it was flies and tried swatting them away, but I realized there was nothing there. The flashes would go away, but when I woke up this morning, I still saw them.”
10	Persistent dilation	Routine follow-up	This is a 40-year-old man with diabetes without retinopathy: “I went to the ophthalmologist for a diabetic eye check. They told me I need to get checked every year to make sure I have no problems in my retina. It’s been a few hours since my visit, but my wife mentioned my pupils are still really big. I feel fine otherwise. I didn’t even notice until she told me and I looked in the mirror.”

Abbreviation: AMD, age-related macular degeneration.

Assessment of Diagnostic and Triage Accuracy

The 10 scenarios were presented to 4 large language models chatbots (ChatGPT 4-o, Google Gemini, Meta, and Claude), 3 vitreoretinal surgery fellows (F.B., first-year fellow at Wills Eye Hospital; V.C., first-year fellow at Associated Retina Consultants; L.K., first-year fellow at Bascom Palmer Eye Institute), and 3 certified ophthalmic technicians with on-call experience. All respondents were asked to assume the scenario was an on-call, patient telephone call being received outside of standard working hours (7:00 pm). Respondents were required to provide the single best diagnosis as well as to classify follow-up as 1) “immediate attention” (same day evaluation in clinic or emergency department), 2) “urgent follow-up” (next 1-2 days), or 3) “routine follow-up” (as scheduled/next available, assuming no worsening of symptoms).

Diagnosis and follow-up recommendations were graded for each chatbot and human respondent. Each scenario was graded for the correct diagnosis and follow-up recommendation; therefore, 20 total points could be achieved for the 10 cases, per respondent. Proportions of correct responses were analyzed using the Fisher exact test. Statistical analysis was performed using StataSE 17 (StataCorp LLC). A P value of <.05 was considered statistically significant.

For the large language models (and the human responders), the same prompt was given to prevent biases (Supplement 1). Each scenario was presented 3 separate times to each chatbot to assess reliability and agreement between answers. A new chat was opened (within the same account) to present the cases in triplicate to each large language model. For each chatbot, “agreement” was defined as the same response given across the 3 repeated prompts (intraplatform agreement). For humans, “agreement” was defined as all respondents in a cohort providing the same response for a given diagnosis or follow-up recommendation (intracohort agreement).

Results

In total, 563 emergency on-call telephone calls were recorded between May 2024 and September 2024 (Figure 1). The 3 most common concerns were flashes and floaters (n=89, 15.8%), eye pain (n=88, 15.6%), and postintravitreal injection symptoms (n=62, 11.0%).

Figure 1.

The most common symptoms prompting patient emergency calls.

On average, the human cohort (fellows and technicians) was significantly more accurate than the chatbots for both diagnosis (95% vs 76.7%, respectively, P < .01) and management (85% vs 70%, respectively, P = .03) (Table 2). However, there was meaningful variation in accuracy across the 4 chatbots. Diagnostically, the highest-performing chatbots were ChatGPT (90%, P = .4 compared with humans) and Claude (83.3%, P = .11 compared with humans), both of which were noninferior to human graders. Meanwhile, Meta (76.7%, P = .01 compared with humans) and Gemini (56.7%, P < .001 compared with humans) performed significantly worse than the human cohort.

Table 2.

Diagnosis and Follow-Up Recommendation Accuracy Among Humans and Chatbots.

	Accuracy, %	P Value^a
Overall
Human	90
Technician	83
Fellow	97
Chatbot	73	<.001
ChatGPT	92	.79
Gemini	53	<.001
Meta	62	<.001
Claude	87	.62
Diagnosis
Human	95
Technician	90
Fellow	100
Chatbot	77	.002
ChatGPT	90	.4
Gemini	57	<.001
Meta	77	.01
Claude	83	.11
Follow-up recommendation
Human	85
Technician	77
Fellow	93
Chatbot	70	.03
ChatGPT	93	.32
Gemini	50	<.001
Meta	47	<.001
Claude	90	.74

The reference group for the P value comparison was human accuracy.

The divergent trend in chatbot performance was also seen for follow-up recommendation accuracy. ChatGPT (93.3%, P = .32 compared with humans) and Claude (90%, P = .74 compared with humans) were noninferior to human graders, while Gemini (50%, P < .001 compared with humans) and Meta (46.7%, P < .001 compared with humans) again performed significantly worse than humans.

Within the cohort of human graders, fellows, as compared with technicians, had a numerically higher accuracy of diagnoses (100% vs 90%, respectively) and follow-up recommendations (93.3% vs 76.7%, respectively), but these differences were not statistically significant (P = .234 for diagnoses and P = .15 for follow-up recommendations).

Fellows showed the highest rate of agreement (complete agreement on 18/20 diagnoses/follow-up recommendations, 90%), followed by ChatGPT, Gemini, and Claude (16/20, 80%), then Meta (15/20, 75%), and lastly, the technicians (13/20, 65%).

Conclusions

The current study found that the included large language model–enabled chatbots had widely varying abilities in the assessment and triage of emergency after-hours telephone calls from patients of retina specialists. In a pooled analysis, chatbots as a group were inferior to humans in both providing accurate diagnoses and management recommendations for patient-reported retina emergencies. However, subanalysis of the individual chatbots revealed that ChatGPT-4o and Claude were both noninferior to human respondents. This pilot study, therefore, provides initial evidence that with further validation, certain large language models may be useful in assisting physicians with fielding retina patient emergency telephone calls in an after-hours outpatient setting.

Many ocular conditions are time-sensitive and can cause permanent vision loss if not recognized and treated quickly.¹⁰ A large percentage of ocular emergencies are of retinal etiology.¹² As demonstrated by the 563 calls to 1 retina practice over 5 months, emergency calls can pose a substantive resource burden. Given the outpatient nature of retina practice, patients with after-hours emergencies must either be triaged by an on-call team or present to emergency departments that may not be adequately staffed or equipped for comprehensive and accurate ophthalmic diagnosis.¹³

The results of the present pilot study suggest that large language model–based applications may be leveraged to aid clinicians in responding to retina emergencies. However, many publicly available large language model chatbots have explicit restrictions on providing medical advice.¹⁴ For example, in the present study, the Gemini chatbot sometimes initially refused to respond to the cases unless given reassurance that the cases were simulated. Despite these restrictions, many prior studies have demonstrated high rates of accuracy among large language models in diagnosing ophthalmic conditions.^15–17 As in this work, prior studies have shown significant variations between chatbots in their accuracy in diagnosing cases or answering questions in ophthalmology, with ChatGPT generally outperforming other large language models.^18–21

While large language models show promise as clinical tools for both patients and providers, there are several concerns that must be addressed. First, large language models have been demonstrated to “hallucinate,” creating persuasive but incorrect responses.²² This is especially important to consider for patient-facing applications, where nonexperts such as patients might be unable to identify incorrect responses. Additionally, large language models could also provide different answers to the same prompt when asked multiple times.²³ To assess consistency between responses in the present study, we prompted the large language model chatbots 3 times in separate chats. In the present case, all 4 chatbots demonstrated relatively high rates of agreement (75%-80%), achieving rates between fellows (90%) and technicians (65%).

The present study was limited by the simulated nature of the patient cases used. However, cases were created based on a systematic review of real emergency telephone calls to a large, urban retina specialty practice. Additionally, all cases were developed with nonexpert levels of language, representing a typical patient. An additional consideration is that case selection was limited to the most common symptoms prompting emergency telephone calls. Future studies could evaluate a broader spectrum of patient-reported emergencies.

The large language models’ performance would likely benefit from training prior to deployment, though we opted to use the most common publicly available versions. Lastly, real-time interactions with patients present unique challenges for implementation, which were not addressed here and warrant further study. Nevertheless, the current study highlights the promise of large language models in assisting with the remote care of retina patients in the emergency setting.

Ultimately, this pilot study demonstrated that humans performed better as a cohort compared with current-generation chatbots in diagnosing and triaging retina emergencies. Importantly, the accuracy of these chatbots was variable, with ChatGPT and Claude achieving noninferior performance compared with human graders. Further research is warranted on the highest-performing large language models to evaluate real-time implementation challenges as well as performance with a broader set of cases.

Supplemental Material

sj-docx-1-vrd-10.1177_24741264251414097 – Supplemental material for Large Language Models Triage of Retina Patient Emergency Telephone Calls: A Pilot Study

Supplemental material, sj-docx-1-vrd-10.1177_24741264251414097 for Large Language Models Triage of Retina Patient Emergency Telephone Calls: A Pilot Study by Rohini Chahal, Flavius Beca, Viet Q. Chau, Lauren Kiryakoza, Mya Abousy, Kenneth C. Fan, Charles C. Wykoff and Hasenin Al-khersan in Journal of VitreoRetinal Diseases

Supplemental Material

sj-docx-2-vrd-10.1177_24741264251414097 – Supplemental material for Large Language Models Triage of Retina Patient Emergency Telephone Calls: A Pilot Study

Supplemental material, sj-docx-2-vrd-10.1177_24741264251414097 for Large Language Models Triage of Retina Patient Emergency Telephone Calls: A Pilot Study by Rohini Chahal, Flavius Beca, Viet Q. Chau, Lauren Kiryakoza, Mya Abousy, Kenneth C. Fan, Charles C. Wykoff and Hasenin Al-khersan in Journal of VitreoRetinal Diseases

Footnotes

Ethical Approval

Institutional review board approval was not required for this study.

Statement of Informed Consent

Patient cases were simulated and therefore no consent was required.

Declaration of Conflicting Interests

The authors declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: HA reports Annexon (C), Apellis (C), Ocular Therapeutix (C), Genentech (C), and Regeneron (C); KCF reports Apellis (C), AbbVie/Regenxbio (C), Ocular Therapeutix (C), Genentech (C), and Regeneron (C); CCW reports 4DMT (C,R), AbbVie (C,R), ADARx (C), Adverum (C,R), AffaMed (R), AGTC (C), Alcon (C), Alexion (R), Alimera (C,R), Alkeus (C), Allgenesis (R), AMC Sciences (C), Amgen (R), Annexin (R), Annexon (C,R), Apellis (C,R), Ascidian (R), Asclepix (R), Aviceda (R), Bausch + Lomb (C), Bayer (C,R), Beacon (R), Biocryst (C), Bionic Vision (C), Boehringer Ingelheim (C,R), Chengdu Kanghong (C), Chengdu Origen (R), Clearside (R), Curacle (C,R), Eluminex (R), Emmecell (C), EyeBiotech (C,R), EyePoint (C,R), Genentech (C,R), Gyroscope (R), InGel (C,SO), IONIS (R), IVERIC Bio (C,R), Janssen (C,R), Kalaris (R), Kiora (C), Kodiak (C,R), Kowa (C), Kyoto DDD (R), Kyowa Kirin (R), Merck (C), Nanoscope (C, R), Neurotech (C, R), NGM (R), Novartis (C, R), Oak Bay Bio (C), Ocugen (R), Ocular Therapeutix (C, R), Oculis (R), Ocuphire (C), OcuTerra (C, R), Ollin (C), ONL (C, SO), Opthea (C,R), Osanni (C,SO), Outlook Therapeutics (R), Oxurion (R), Panther (SO), Perceive Bio (R), PolyPhotonix (SO), Ray (C), RecensMedical (SO), Regeneron (C,R), RegenXBio (C,R), RetinAI (C), Roche (C,R), Sandoz (C), Sanofi (C), Santen (C), Skyline (C, R), Stealth (C,R), Sylentis (C), THEA (C), Therini (C), TissueGen (SO), VH401 (C,R), Visgenx (C,SO), Vitranu (SO), Zeiss (C); RC, FB, VQC, LK, and MA have no financial disclosures to report.

C= Consultant | R= Research Support | SO= Stock Options

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs

Lauren Kiryakoza

Kenneth C. Fan

Charles C. Wykoff

Hasenin Al-khersan

Data Availability

Data can be requested from the corresponding author.

Supplemental Material

Supplemental material is available online with this article.

References

Patil

Gudivada

A review of current trends, techniques, and challenges in large language models (LLMs). Appl Sci. 2024;14(5):2074.

Omar

Nadkarni

Klang

Glicksberg

BS.

Large language models in medicine: a review of current clinical trials across healthcare applications. PLOS Digit Health. 2024;3(11):e0000662. doi:10.1371/journal.pdig.0000662

Tian

Jiang

Zhang

The role of large language models in medical image processing: a narrative review. Quant Imaging Med Surg. 2024;14(1):1108-1121. doi:10.21037/qims-23-892

Paul

Sanap

Shenoy

Kalyane

Kalia

Tekade

RK.

Artificial intelligence in drug discovery and development. Drug Discov Today. 2021;26(1):80-93. doi:10.1016/j.drudis.2020.10.010

Sallam

ChatGPT utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns. Healthcare (Basel). 2023;11(6):887. doi:10.3390/healthcare11060887

Singhal

Azizi

, et al. Large language models encode clinical knowledge. Nature. 2023;620(7972):172-180. doi:10.1038/s41586-023-06291-2

Singhal

Gottweis

, et al. Toward expert-level medical question answering with large language models. Nat Med. 2025;31(3):943-950. doi:10.1038/s41591-024-03423-7

Clay

Da Custodia Steel

Jacobs

Human-computer interaction: a literature review of artificial intelligence and communication in healthcare. Cureus. 2024;16(11):e73763. doi:10.7759/cureus.73763

Pham

Thongprayoon

Miao

, et al. Large language model triaging of simulated nephrology patient inbox messages. Front Artif Intell. 2024;7:1452469. doi:10.3389/frai.2024.1452469

10.

Gelston

Deitz

GA.

Eye emergencies. Am Fam Physician. 2020;102(9):539-545.

11.

Katz

Kaltsounis

Halloran

Mondor

Patient safety and telephone medicine: some lessons from closed claim case review. J Gen Intern Med. 2008;23(5):517-522. doi:10.1007/s11606-007-0491-y

12.

McDonald

Iordanous

Ophthalmology on call: evaluating the volume, urgency, and type of pages received at a tertiary care center. Cureus. 2022;14(4):e23824. doi:10.7759/cureus.23824

13.

Tan

Mickelsen

Villegas

, et al. Evaluation of interventions targeting follow-up appointment scheduling after emergency department referral to ophthalmology clinics using A3 problem solving. JAMA Ophthalmol. 2022;140(6):561-567. doi:10.1001/jamaophthalmol.2022.0889

14.

Mandalos

Tsouris

Artificial versus human intelligence in the diagnostic approach of ophthalmic case scenarios: a qualitative evaluation of performance and consistency. Cureus. 2024;16(6):e62471. doi:10.7759/cureus.62471

15.

Ran

Nguyen

, et al. What can GPT-4 do for diagnosing rare eye diseases? A pilot study. Ophthalmol Ther. 2023;12(6):3395-3402. doi:10.1007/s40123-023-00789-8

16.

Shanmugam

Wilkinson

Allergic contact dermatitis caused by a cyanoacrylate-containing false eyelash glue. Contact Dermatitis. 2012;67(5):309-310. doi:10.1111/cod.12000

17.

Tan

DNH

Tham

Koh

, et al. Evaluating chatbot responses to patient questions in the field of glaucoma. Front Med (Lausanne). 2024;11:1359073. doi:10.3389/fmed.2024.1359073

18.

Pushpanathan

Lim

Er Yew

, et al. Popular large language model chatbots’ accuracy, comprehensiveness, and self-awareness in answering ocular symptom queries. iScience. 2023;26(11):108163. doi:10.1016/j.isci.2023.108163

19.

Lim

Pushpanathan

Yew

SME

, et al. Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard. EBioMedicine. 2023;95:104770. doi:10.1016/j.ebiom.2023.104770

20.

Ichhpujani

Parmar

UPS

Kumar

Appropriateness and readability of Google Bard and ChatGPT-3.5 generated responses for surgical treatment of glaucoma. Rom J Ophthalmol. 2024;68(3):243-248. doi:10.22336/rjo.2024.45

21.

Carlà

Gambini

Baldascino

, et al. Large language models as assistance for glaucoma surgical cases: a ChatGPT vs. Google Gemini comparison. Graefes Arch Clin Exp Ophthalmol. 2024;262(9):2945-2959. doi:10.1007/s00417-024-06470-5

22.

Athaluri

Manthena

Kesapragada

Yarlagadda

Dave

Duddumpudi

RTS

. Exploring the boundaries of reality: investigating the phenomenon of artificial intelligence hallucination in scientific writing through ChatGPT references. Cureus. 2023;15(4):e37432. doi:10.7759/cureus.37432

23.

Wang

Chen

Deng

, et al. Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs. NPJ Digit Med. 2024;7(1):41. doi:10.1038/s41746-024-01029-4

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.02 MB

0.01 MB

0.00 MB