Large language model chatbots as sources of pediatric anesthesia health advice: An evaluation of reliability and readability

Abstract

Background

Large language models are increasingly used to obtain health information, but their quality in pediatric anesthesia remains insufficiently evaluated. This study aimed to assess the reliability and readability of four widely used AI chatbots in this context.

Methods

This cross-sectional observational study developed 18 pediatric anesthesia-related questions using Medical Subject Headings terms, online search trend analysis, and commonly queried topics reflecting parental information needs. Each question was submitted under standardized conditions to four generative AI-driven chatbots: OpenAI’s GPT-5.1 Thinking, Google’s Gemini 3 Pro, Anthropic’s Claude Opus 4.5 Extended Thinking, and DeepSeek-V3.2-Speciale. Models were accessed in their vendor-deployed configurations without task-specific fine-tuning. The generated responses were evaluated for information reliability using the Ensuring Quality Information for Patients (EQIP) instrument, DISCERN tool, Global Quality Score (GQS), and Journal of the American Medical Association (JAMA) benchmark criteria. Readability was assessed using seven validated indices including Flesch Reading Ease Score, Flesch–Kincaid Grade Level, Gunning Fog Index, Simple Measure of Gobbledygook, Coleman–Liau Index, Automated Readability Index, and Linsear Write Formula.

Results

A total of 72 chatbot-generated responses were included for analysis. Significant between-model differences were observed in DISCERN, EQIP, and GQS, while JAMA benchmark scores were consistently low across all models. DeepSeek and Gemini showed higher median reliability scores across several instruments, although significant pairwise differences mainly involved ChatGPT. None of the evaluated models achieved the recommended sixth-grade readability level across any index. Correlations between reliability and readability were non-significant, suggesting that these represent independent dimensions of information quality.

Conclusions

Current LLM-based chatbots provided pediatric anesthesia information with variable reliability and consistently suboptimal readability. Although certain models demonstrated relatively higher information quality, limited transparency and excessive reading complexity may restrict their suitability for public-facing educational use. These findings highlight the need for improved quality control, enhanced transparency, and readability-focused optimization in pediatric perioperative education.

Keywords

pediatric anesthesia large language models digital health information readability generative artificial intelligence

1. Introduction

Pediatric anesthesia is a highly specialized domain of anesthesiology that provides perioperative care for neonates, infants, children, and adolescents undergoing surgical, diagnostic, or therapeutic procedures.¹ Owing to substantial differences in anatomical structure, physiological function, and pharmacological responses between pediatric and adult populations, as well as the ongoing maturation of organ systems in children, anesthesia management in this group presents unique risks and challenges.^2,3 In high-income countries such as the United States and the United Kingdom, millions of children undergo anesthesia annually for surgical procedures.^1,4 Notably, approximately 65% to 80% of pediatric patients experience significant perioperative anxiety.^5,6 This anxiety may not only increase psychological distress among parents but has also been associated with adverse postoperative outcomes in children, including emergence delirium, increased pain perception, and maladaptive behavioral changes.⁷

In response to these concerns, parents increasingly seek health-related information through digital platforms prior to hospital admission. A recent survey reported that 74.3% of caregivers of children undergoing surgery searched online for procedural information before their hospital visit.⁸ While digital information sources have the potential to support informed decision-making and improve perioperative preparation, the quality of publicly accessible online health content remains highly variable. Inaccurate, incomplete, or overly technical information, especially regarding the effects of anesthetic agents on pediatric neurodevelopment, may increase caregiver anxiety and lead to misconceptions.⁹ Such misinformation may reduce adherence to perioperative recommendations, delay necessary treatment, or undermine trust in clinical guidance, thereby posing potential risks to perioperative safety in children.¹⁰

Recent advances in artificial intelligence, particularly the development of large language models (LLMs), have introduced new forms of consumer-facing digital health information systems.¹¹ Leveraging natural language processing capabilities, LLM-based conversational agents are increasingly used by the public to obtain personalized health-related explanations and procedural guidance.¹² In the context of pediatric anesthesia, these tools have the potential to improve caregivers’ understanding of anesthesia-related risks, facilitate preoperative preparation, and support postoperative recovery through accessible, on-demand information delivery. However, LLMs generate responses based on probabilistic language modeling rather than explicit fact verification, which may result in factual inaccuracies or so-called “hallucinations”.¹³ In the perioperative context, inappropriate or misleading information regarding fasting protocols, anesthesia induction, or postoperative medication may adversely affect recovery trajectories and, in extreme cases, compromise patient safety. Moreover, excessive linguistic complexity in generated responses may hinder caregivers’ comprehension and implementation of recommended care strategies.

Previous evaluations across clinical domains including oncology, hepatology, musculoskeletal medicine, and urology suggest that LLM-generated health information may achieve moderate to high content quality but frequently demonstrates limited readability, insufficient transparency, and occasional factual inconsistency.^14–17 More recent studies have also begun to examine the use of LLMs for anesthesia-related patient education, including adult anesthesia, obstetric anesthesia, and pediatric dental sedation.^18–20 However, evidence remains limited on how multiple widely used chatbots perform when answering caregiver-facing pediatric anesthesia questions. This represents a critical gap, given the need for such information to be accurate, clear, and safe for perioperative guidance. Accordingly, the present study conducted a systematic comparative evaluation of four widely used large language models, namely ChatGPT, Claude, DeepSeek, and Gemini, in the context of pediatric anesthesia information provision. Using standardized quality and readability assessment frameworks, this study aimed to characterize the reliability and accessibility of LLM-generated health information for parents and caregivers, thereby informing the responsible integration of generative AI–driven tools into pediatric perioperative education and digital health communication strategies. No formal hypothesis was prespecified.

2. Materials and methods

2.1. Study design

This study was designed as a cross-sectional observational analysis and was reported in accordance with the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) guidelines.²¹ The STROBE framework provides standardized guidance for the transparent reporting of observational research, including study objectives, methodology, results, and limitations. Because this study specifically evaluated generative AI-driven chatbots as sources of health advice, we also followed the Chatbot Assessment Reporting Tool (CHART) guideline.²² The completed STROBE and CHART checklists are provided in the Supplementary Materials.

The study systematically assessed a total of 72 responses generated by four widely used LLM-based chatbot systems—OpenAI’s GPT-5.1 Thinking, Google’s Gemini 3 Pro, Anthropic’s Claude Opus 4.5 Extended Thinking, and DeepSeek-V3.2-Speciale—in response to pediatric anesthesia scenario-based questions from caregivers. The number of responses was determined by the predefined set of prompts and the number of evaluated platforms, enabling comparative assessment across systems under standardized conditions. Given the observational and system-performance–oriented nature of the study, a formal statistical sample size calculation was not performed. The evaluation covered multiple dimensions, including readability, informational reliability, and overall quality of responses.

2.2. Search strategy and data collection

Standardized pediatric anesthesia–related search terms were initially identified using the Medical Subject Headings (MeSH) database and further explored via Google Trends to capture temporal search interest. To further reflect real-world information-seeking behavior, commonly searched terms were collected from major Chinese search engines (Baidu and Sogou) and health information platforms (e.g., Dingxiang Doctor), as well as consultation records from professional medical forums including Dingxiangyuan and Haodf Online. Through this process, 28 search terms were initially identified. After removing irrelevant or duplicate items, 18 unique and relevant search terms were finalized (eTable 1), covering common caregiver concerns such as modality selection, neurodevelopmental safety, preoperative preparation, postoperative management, and anesthesia-related risks.

All prompts were entered verbatim as single-turn queries to simulate typical user interactions with publicly available chatbot interfaces. Queries were conducted by a single researcher with a clinical background in anesthesiology to ensure consistency in data entry procedures. No iterative refinement, follow-up prompting, or patient/public involvement was applied.

Data collection was performed on December 1, 2025, from Nanjing, China, via publicly available web-based interfaces for OpenAI’s GPT-5.1 Thinking (released November 12, 2025; knowledge cutoff September 30, 2024), Google’s Gemini 3 Pro (released November 18, 2025, knowledge cutoff January, 2025), Anthropic’s Claude Opus 4.5 Extended Thinking (released November 24, 2025, knowledge cutoff March, 2025), and DeepSeek-V3.2-Speciale (released December 1, 2025, knowledge cutoff July, 2025). The majority of the evaluated systems were proprietary chatbot models accessed in their vendor-deployed form, while DeepSeek represented an open-source alternative and was assessed in its publicly released configuration. All models were evaluated in their default configurations as deployed through the official chatbot platforms. The investigators did not apply task-specific fine-tuning, parameter optimization, retrieval-augmented generation, or external post-processing. Each query was entered into a new chat session following browser cache clearance to minimize potential contextual carryover effects.

For proprietary models, detailed disclosures regarding training corpora and internal architectures were not publicly available, and performance evaluation was therefore based on the functionality described in official vendor documentation. DeepSeek was assessed in its publicly released form without modification, ensuring consistency with its standard community-distributed implementation.

As this study did not involve human participants, there was no recruitment, exposure period, or follow-up period. Each prompt was processed twice per model to assess response stability, resulting in 144 raw outputs. Paired outputs were reviewed qualitatively for consistency in core medical content, key recommendations, safety-relevant information, and qualitative error-relevant content. Most paired outputs showed no substantive difference, and the remaining outputs showed only minor wording, organizational, or level-of-detail differences. No clinically meaningful differences were identified that would alter reliability scoring or qualitative error classification. Therefore, one complete run comprising 72 responses was used as the formal dataset for scoring and analysis.

All chatbot responses were recorded verbatim and stored in their original form without modification. Before expert assessment, all chatbot responses were de-identified and assigned numerical codes to mask model identity. Evaluators were blinded to the chatbot systems during both quantitative scoring and qualitative error analysis. No personal or identifiable information was involved in the dataset. The complete set of chatbot responses is provided in the Supplementary Material. All responses obtained from the chatbots were subsequently subjected to reliability and readability assessments.

2.3. Inclusion and exclusion criteria

Responses were eligible for inclusion if they were generated by one of the four predefined chatbot systems (GPT-5.1 Thinking, Gemini 3 Pro, Claude Opus 4.5 Extended Thinking, and DeepSeek-V3.2-Speciale) in response to one of the 18 predefined pediatric anesthesia prompts and were provided in text format.

Responses were excluded if the model failed to generate a valid answer, such as returning an error message or experiencing a network interruption. Responses that were completely unrelated to the question were also excluded, as were non-text outputs, including responses consisting solely of images or hyperlinks.

2.4. Reliability assessment scales

Informational reliability was evaluated using four established health information quality assessment instruments: DISCERN, the Ensuring Quality Information for Patients (EQIP) tool, the Global Quality Score (GQS), and the Journal of the American Medical Association (JAMA) benchmark criteria.

DISCERN is a standardized instrument for assessing the quality of written health information.^14,23 It consists of 16 questions, divided into three sections: reliability of the publication (questions 1-8), quality of treatment information (questions 9-15), and overall assessment (question 16). Each question is scored on a 1-5 Likert scale, with a total score ranging from 16 to 80. A higher score indicates better information quality.

The EQIP tool is commonly used to assess the quality of written patient information available to the public.^24,25 In this study, we used the expanded 36-item EQIP scale, which evaluates three domains: content, identification data, and structure. Each applicable item was rated as “Yes” or “No,” and items deemed not applicable were excluded from the denominator. The EQIP score was calculated as the proportion of applicable items rated “Yes” and expressed as a percentage, with higher scores indicating better information quality.

The GQS is a holistic quality assessment tool that requires evaluators to rate the content based on its comprehensiveness, accuracy, practicality, and objectivity.^26,27 This study employed a 5-point Likert scale, 1 point very poor, 2 points poor, 3 points fair, 4 points good and 5 points excellent quality.

JAMA benchmark criteria are commonly used to evaluate the transparency and reliability of health information sources.^28,29 The instrument assesses four key domains: authorship, attribution, disclosure, and timeliness. Each domain is scored from 0 to 1, with the total score ranging from 0 to 4, where higher scores indicate greater transparency and adherence to reporting standards.

These scales are shown in eTables 2-5.

2.5. Readability assessment tools

A readability application (https://readabilityformulas.com) was used to assess the reading ease of each response, utilizing seven indices: Flesch Reading Ease Score (FRES), Flesch-Kincaid Grade Level (FKGL), Gunning Fog Index (GFI), Simple Measure of Gobbledygook (SMOG), Coleman-Liau Index (CL), Automated Readability Index (ARI), and Linsear Write Formula (LWF).³⁰

These indices evaluate text readability based on distinct criteria. FRES calculates readability by considering sentence length and word count, with higher scores indicating easier comprehension.¹⁶ FKGL determines the U.S. school grade level at which a text can be understood.³¹ GFI evaluates sentence length and multisyllabic word frequency, while SMOG assesses the educational level required to understand health-related documents.³² CL is based on sentence length and average letter count, commonly applied in medical texts.³³ The ARI and LWF provide readability scores that reflect the suitability of a text for the reader based on word and sentence structure.^34,35 Calculation formulas were presented in eTable 6.

A decrease in FRES and an increase in ARI, FKGL, GFI, SMOG, CL, and LWF indicate lower readability.^36,37 The readability threshold was set at 80.0 for FRES, while a grade level of 6 was the threshold for the remaining six formulas. Readability scores for 72 chatbot-generated responses were calculated, and the mean and standard deviation were compared against the sixth-grade readability level recommended by the American Medical Association and NIH.^38,39

2.6. Scoring process

All chatbot-generated responses were independently evaluated by two senior physicians with more than 10 years of clinical experience using de-identified, numerically coded response files. Prior to the formal assessment, both reviewers underwent standardized training in the application of DISCERN, EQIP, GQS and JAMA criteria, including calibration using a pilot sample of 10 responses.

Discrepancies of one point or greater between reviewers were resolved through consensus discussion. In cases where agreement could not be reached, a third senior clinician with more than 25 years of experience served as an adjudicator to determine the final score.

2.7. Statistical analysis

Descriptive statistics were used to summarize all variables. The normality of data distribution was assessed using the Shapiro–Wilk test. Data with a normal distribution were presented as mean ± standard deviation (SD), while data not following a normal distribution were presented as median (IQR), and categorical variables were presented as frequencies and percentages.

Between-group comparisons were conducted using one-way analysis of variance (ANOVA) with Tukey’s HSD post hoc test for normally distributed variables, and the Kruskal–Wallis test followed by Bonferroni-adjusted Mann–Whitney U tests for non-normally distributed variables. Interrater agreement for quantitative scoring instruments, including DISCERN, EQIP, GQS, and JAMA, was assessed using the intraclass correlation coefficient (ICC). ICC values were interpreted as follows: values below 0.50 indicated poor reliability, 0.50–0.75 indicated moderate reliability, 0.75–0.90 indicated good reliability, and values above 0.90 indicated excellent reliability. For qualitative error classification, Cohen’s kappa was calculated for each predefined error category and overall category-level agreement.

Spearman’s correlation analysis was used to examine associations between reliability and readability metrics. A p-value< 0.05 was considered statistically significant. All statistical analyses and plots were performed using R version 4.5.2.

3. Results

A total of 72 chatbot-generated responses were obtained from 18 predefined pediatric anesthesia prompts across four chatbot systems, and all were included in the final analysis.

3.1. Interrater reliability

Interrater agreement for reliability assessments demonstrated strong consistency across all evaluation instruments (Table 1). ICC analysis indicated DISCERN scores (ICC = 0.953, 95% CI: 0.930–0.969, p< 0.001) and JAMA scores (ICC = 0.964, 95% CI: 0.947–0.977, p< 0.001). GQS (ICC = 0.854, 95% CI: 0.785–0.904, p< 0.001) and EQIP (ICC = 0.840, 95% CI: 0.660–0.915, p< 0.001) also showed good reliability, indicating robust concordance between reviewers.

Table 1.

Inter-rater reliability assessment by independent reviewers.

Metric	ICC	CI lower	CI upper	p
DISCERN	0.953	0.930	0.969	<0.001
GQS	0.854	0.785	0.904	<0.001
JAMA	0.964	0.947	0.977	<0.001
EQIP	0.840	0.660	0.915	<0.001

3.2. Reliability of LLM-Generated information

Substantial variability in informational reliability was observed across the evaluated chatbot systems (Table 2). Significant between-system differences were identified for DISCERN (H= 23.106, p< 0.001), EQIP (H= 27.872, p< 0.001), and GQS (H= 14.969, p= 0.002), whereas JAMA scores did not differ significantly (p= 0.215).

Table 2.

Reliability scores across LLMs.

Program	DISCERN	EQIP	GQS	JAMA
ChatGPT	33.5 (31.0–36.75)	36.0 (33.0–36.0)	3.0 (3.0–3.0)	0.0 (0.0–0.0)
Claude	41.5 (37.25–44.5)	44.0 (42.0–50.0)	3.0 (3.0–4.0)	0.0 (0.0–0.0)
DeepSeek	49.0 (45.0–54.5)	48.5 (44.75–50.0)	4.0 (3.0–4.0)	0.0 (0.0–0.0)
Gemini	47.0 (39.25–51.75)	47.0 (44.75–54.5)	4.0 (4.0–4.0)	0.0 (0.0–0.0)
H	23.106	27.872	14.969	4.476
p	<0.001	<0.001	0.002	0.215

Note. Values are presented as median (IQR) for each model across evaluation metrics (DISCERN, EQIP, GQS, and JAMA).

DeepSeek and Gemini demonstrated higher median DISCERN scores, with median values of 49.0 (45.0–54.5) and 47.0 (39.25–51.75), whereas ChatGPT had a lower median score of 33.5 (31.0–36.75). A similar pattern was seen in EQIP scores, with Gemini and DeepSeek showing higher median scores of 47.0 (44.75–54.5) and 48.5 (44.75–50.0), respectively. For GQS, both Gemini and DeepSeek achieved median scores of 4.0, with IQRs of 4.0–4.0 and 3.0–4.0, respectively.

In contrast, JAMA benchmark scores remained uniformly low across all evaluated systems, with a median value of 0.0 (IQR: 0.0–0.0), suggesting limited transparency and authorship attribution.

Distributional analysis (Figure 1) further depicted that ChatGPT showed narrower distributions with consistently lower central values for both DISCERN and EQIP, whereas DeepSeek and Gemini demonstrated broader distributions with higher upper ranges, indicating overall superior performance despite greater variability. Claude generally exhibited intermediate distributions overlapping with higher-performing models but without comparable central tendencies.

Figure 1.

Reliability scores across LLMs. Violin plots with embedded boxplots show score distributions for DISCERN, EQIP, GQS, and JAMA; brackets indicate pairwise comparisons.

For GQS, score variability was less pronounced across models; however, DeepSeek and Gemini remained concentrated toward higher values. Claude showed wider dispersion, suggesting greater inconsistency in global quality, while ChatGPT scores clustered at lower levels.

JAMA benchmark scores were uniformly low across all models, with minimal dispersion and no meaningful between-model differences, indicating consistently limited transparency and authorship disclosure.

3.3. Qualitative error analysis

Qualitative expert review identified multiple categories of content-related deficiencies in chatbot-generated responses (Table 3). Before qualitative assessment, all chatbot responses were de-identified and assigned numerical codes to mask model identity. Two independent experts reviewed all 72 chatbot-generated responses and classified content-related deficiencies into four predefined categories: missing crucial information (responses lacking important perioperative or pediatric-specific content), factual inaccuracies (responses containing incorrect statements or misrepresented facts), hallucinations (fabricated or unsupported content not grounded in evidence or guidelines), and provision of outdated medical advice (information that is obsolete or no longer recommended). A single response could be assigned more than one error category when multiple distinct deficiencies were identified. Two experts independently reviewed the de-identified responses for each predefined error category, and discrepancies were resolved through structured consensus discussion to generate the final classification. Cohen’s kappa values for each error category and overall category-level agreement are reported in eTable 8.

Table 3.

Final consensus classification of qualitative content-related deficiencies across 72 chatbot responses.

Category	Number of cases	Percentage of total responses (n=72)	Model-question identifiers
Factual inaccuracies	10	13.89%	ChatGPT-1/2/4/14/15/16/18; DeepSeek-2/16; Claude-18
Crucial information missing	9	12.50%	ChatGPT-5/6/9/15; DeepSeek-15/17; Gemini-3; Claude-6/9
Hallucination	3	4.17%	ChatGPT-3; DeepSeek-15; Claude-5
Provided outdated medical advice	1	1.39%	ChatGPT-9
Total	23	31.94%	—

Note. Percentages were calculated using the total number of analyzed chatbot responses as the denominator (n = 72). A single response could be assigned more than one deficiency category when multiple distinct deficiencies were identified.

The frequency and proportion of each error type are illustrated in Figure 2. After consensus discussion, factual inaccuracies were the most frequent error category, accounting for 10 responses (13.89%). These cases originated from ChatGPT (questions 1, 2, 4, 14, 15, 16, and 18), DeepSeek (questions 2 and 16), and Claude (question 18).

Figure 2.

Distribution of qualitative content-related deficiencies across 72 chatbot-generated responses. Bars show the number of deficiencies in each category, with percentages calculated using all analyzed responses as the denominator (n = 72). A single response could be assigned more than one deficiency category when multiple distinct deficiencies were identified.

Crucial information missing was identified in nine responses (12.50%). These cases were observed in outputs from ChatGPT (questions 5, 6, 9, and 15), DeepSeek (questions 15 and 17), Gemini (question 3), and Claude (questions 6 and 9).

Hallucinations were detected in three responses (4.17%), including one generated by ChatGPT (question 3), one by DeepSeek (question 15), and one by Claude (question 5).

Provided outdated medical advice was identified in one response (1.39%), generated by ChatGPT (question 9).

These findings suggest that although LLM-based chatbots are capable of generating contextually relevant responses, clinically important informational gaps and inaccuracies remain present across platforms. Detailed examples of these errors and expert annotations are provided in eTable 7.

3.4. Pairwise comparison of reliability metrics

Pairwise comparisons revealed that ChatGPT-generated responses scored significantly lower on DISCERN and EQIP metrics compared with responses generated by DeepSeek and Gemini (p< 0.05), indicating comparatively reduced informational reliability (Table 4). No significant differences were found between Claude and either DeepSeek or Gemini across reliability measures (p> 0.05), suggesting comparable reliability among these systems. Similarly, no significant differences were found between DeepSeek and Gemini across evaluated reliability metrics (p> 0.05).

Table 4.

Pairwise comparison of reliability (p-values).

Program	DISCERN	EQIP	GQS	JAMA
ChatGPT-Claude	0.017	0.001	0.244	0.417
ChatGPT-DeepSeek	0.000	0.000	0.013	0.486
ChatGPT-Gemini	0.000	0.000	0.003	0.251
Claude-DeepSeek	0.123	0.491	0.168	0.622
Claude-Gemini	0.246	0.538	0.052	0.694
DeepSeek-Gemini	0.638	0.927	0.525	0.589

Note. Values represent p-values from pairwise statistical comparisons across evaluation metrics (DISCERN, EQIP, GQS, and JAMA).

3.5. Readability of LLM-Generated responses

Readability using seven widely accepted indices indicated that none of the evaluated chatbot systems achieved the recommended sixth-grade readability threshold for public-facing health information (Table 5).

Table 5.

Readability scores of LLMs.

Program	ARI	GFI	FKGL	CL	SMOG	LWF	FRES
ChatGPT	14.13±2.25	15.51±2.13	13.42±1.92	15.88±2.29	11.88±1.54	15.81±2.57	26.50±12.38
Claude	11.37±1.96	14.49±2.13	11.41±1.81	14.71±2.00	9.66±1.45	25.45±8.31	31.00±11.06
DeepSeek	13.65±2.29	14.37±2.02	12.71±2.14	14.66±1.95	11.67±1.71	13.33±2.18	33.50±12.48
Gemini	13.36±2.19	14.11±2.00	12.40±2.02	13.82±2.02	11.53±1.48	12.98±2.43	37.28±12.42
6th grade level	6	6	6	6	6	6	80

Note. Values are presented as mean ± SD across readability metrics (ARI, GFI, FKGL, CL, SMOG, LWF, and FRES).

FRES scores were markedly below the target range of 80–90 for texts intended for a sixth-grade reading level (Figure 3). Complementary indices, including ARI, GFI, FKGL, CL, and SMOG similarly indicated that the linguistic complexity exceeded what is considered appropriate for a general audience.

Figure 3.

Readability scores of LLM-generated responses across seven indices. Bar charts show mean scores for each model. The dashed reference line (for FRES) or horizontal line at grade level 6 indicates the recommended readability threshold.

Collectively, these findings suggest that the responses generated by all four LLMs were written at a level too complex, which may limit comprehension among caregivers without medical training.

3.6. Correlation between readability and reliability

Spearman correlation analysis demonstrated no strong associations between reliability indicators and readability measures (Figure 4), suggesting that information quality and textual accessibility represent largely independent dimensions of chatbot-generated health content.

Figure 4.

Spearman correlation between reliability and readability metrics for LLM-generated pediatric anesthesia responses. Circle size indicates correlation strength, with red for positive and blue for negative correlations. Selected correlations are reported in text.

Strong positive correlations were observed among the readability metrics, with FKGL demonstrating high correlations with ARI (r= 0.97), SMOG (r= 0.92), GFI (r= 0.83), and CL (r= 0.82). Similarly, GFI showed strong positive associations with CL (r= 0.87) and FKGL (r= 0.83). As expected, FRES was negatively correlated with the other readability indices, including CL (r= -0.97), FKGL (r= -0.84), and GFI (r= –0.92), reflecting inverse scale directions.

Regarding reliability measures, GQS showed a moderate positive correlation with DISCERN (r= 0.52) and EQIP (r= 0.61), while JAMA demonstrated weaker associations with readability metrics overall. EQIP displayed weak correlations with readability indices, including FKGL (r= -0.24), CL (r= -0.31), FRES (r= 0.30), SMOG (r= -0.14), and LWF (r= -0.16). No strong negative correlations were observed between reliability and readability metrics, suggesting largely independent constructs.

4. Discussion

The increasing deployment of large language model (LLM)–based conversational agents has expanded the ways in which caregivers access perioperative health information through digital platforms. These systems are progressively being integrated into routine health information–seeking behavior and may influence decision-making prior to clinical encounters.^40–42 However, evaluations of the quality of pediatric anesthesia information generated by LLMs remain limited. In this study, we evaluated the performance of four mainstream large language models, including ChatGPT, Claude, DeepSeek, and Gemini, in the pediatric anesthesia domain. Overall, the quality and reliability of the generated information were suboptimal. DeepSeek and Gemini demonstrated relatively higher median reliability scores overall; however, pairwise comparisons showed that differences between Claude and either DeepSeek or Gemini did not reach statistical significance. Nevertheless, none of the evaluated models achieved a readability level equivalent to the recommended sixth-grade standard, and no significant correlation was identified between information reliability and readability. These findings provide an overview of the current capabilities and limitations of widely used LLMs in addressing pediatric anesthesia–related questions and offer preliminary insights for improving both the reliability and accessibility of health information in this field.

4.1. Reliability of LLM-Generated information

Using multiple established quality assessment instruments, this study found that the overall reliability of LLM-generated pediatric anesthesia information was moderate. This observation is consistent with recent research evaluating LLM responses to cancer-related queries, which reported acceptable accuracy but highlighted limitations such as poor readability, limited practical applicability, and a lack of cited sources.¹⁷ Notably, significant performance differences were observed among models, with DeepSeek and Gemini showing higher median scores across DISCERN, EQIP, and GQS metrics, although pairwise superiority over Claude was not statistically established. Similar patterns have been reported in prior studies, including research on exercise rehabilitation recommendations for ankylosing spondylitis, where DeepSeek-V3 demonstrated superior performance in modified DISCERN, reliability, and usefulness compared with ChatGPT-4,⁴³ as well as studies evaluating responses to frequently asked pain-related questions, in which Gemini achieved significantly higher GQS scores than ChatGPT.⁴⁴ Together, these findings reinforce the notion that model selection substantially influences the assessed quality of AI-generated medical information.

A notable and consistent finding in this study was the uniformly low JAMA scores across all evaluated models, indicating widespread deficiencies in transparency-related criteria. The JAMA benchmark emphasizes elements such as authorship, information sources, references, and content currency. In contrast, LLM-generated responses typically draw on aggregated knowledge from association websites, hospital pages, scientific literature, and general written materials intended for public dissemination, which rarely include explicit author attribution or publication dates. This pattern is consistent with previous research examining LLM responses to cardiopulmonary resuscitation–related questions.⁴⁵ In high-risk domains such as pediatric anesthesia, expert human review therefore remains essential. These limitations may reflect the fact that current LLMs primarily generate responses based on learned language patterns rather than direct information retrieval, resulting in challenges related to traceability and limited integration with high-quality, domain-specific medical knowledge bases. To address these issues, the adoption of retrieval-augmented generation (RAG) approaches is increasingly advocated, as they link generated content to verifiable sources and enable explicit disclosure of knowledge cutoffs and limitations. Accumulating evidence suggests that RAG can reduce hallucinations, improve factual accuracy, and enhance transparency.^46–48

4.2. Qualitative error analysis

Qualitative analysis of erroneous responses indicated that factual inaccuracies and omission of crucial information were the two dominant error patterns across models, accounting for most identified errors. Factual inaccuracies occurred predominantly in ChatGPT outputs, with occasional instances in DeepSeek and Claude. These inaccuracies mainly involved ambiguous descriptions of timing or dosing, oversimplified clinical recommendations, insufficient differentiation among related perioperative concepts, and use of nonstandard terminology, which may not be readily apparent to non-expert users. In parallel, omission of crucial information was observed across all four platforms, including ChatGPT, DeepSeek, Gemini, and Claude, suggesting a systematic tendency of LLMs to generate incomplete yet superficially coherent responses to complex perioperative questions.

Missing key information is of particular concern in pediatric anesthesia. In this study, omissions involved pediatric-specific fasting distinctions, age-related pharmacological and physiological considerations, antiemetic strategies for postoperative nausea and vomiting, and risk differentiation in perioperative counseling. Such omissions may limit caregivers’ understanding of anesthesia-related risks and contribute to misjudgment.

Hallucinations represent another important challenge and refer to the generation of factually incorrect, fabricated, or unsupported content.⁴⁹ In this study, DeepSeek inappropriately extrapolated adult evidence on posterior neck cooling for postoperative nausea and vomiting to pediatric patients (Question 15), and Claude suggested a minimum interval of 2–4 weeks between anesthetic exposures despite the absence of authoritative guidance (Question 5).

Outdated medical advice represented the smallest proportion of identified errors and was observed exclusively in ChatGPT outputs. Specifically, in response to Question 9, ChatGPT provided outdated recommendations regarding anesthetic medication use. Such obsolete guidance may lead to inappropriate perioperative preparation and underscores the importance of continual data updating for LLM-based systems.

To further contextualize these errors from a patient-safety perspective, we examined two representative examples informed by the approach of Kuo et al.⁵⁰DeepSeek-2 incorrectly described caudal block as applicable to upper limb surgery, which could mislead caregivers regarding regional anesthesia indications. DeepSeek-15 recommended applying a cool towel to the forehead or back of the neck for postoperative nausea and vomiting (PONV). This intervention is unsupported by established guidelines and could potentially delay evidence-based symptom management.^51,52 In terms of potential harm severity, both examples were judged to carry moderate clinical significance, as they could distort caregiver understanding or delay appropriate care, but neither was considered likely to cause death or severe harm in the context of caregiver education alone. These examples represent potential rather than realized harm, as no actual patients or clinical interventions were involved. The occurrence of such errors across multiple platforms suggests that this vulnerability is inherent to current LLM architectures rather than confined to a specific model.

4.3. Readability assessment

Our results indicate that the readability of content generated by all evaluated LLMs substantially exceeded the sixth-grade level recommended by the American Medical Association, thereby posing a barrier to public comprehension. Across grade-level–based indices (ARI, GFI, FKGL, CL, SMOG, and LWF) and reading ease metrics (FRES), the generated texts consistently corresponded to secondary school or higher reading levels. Although Gemini and DeepSeek exhibited relatively lower grade-level scores and greater reading ease than ChatGPT and Claude, none of the models produced content that met readability standards appropriate for general caregivers. These findings align with prior studies demonstrating that AI-generated medical information frequently exceeds recommended readability thresholds for public-facing health materials.^45,53 Collectively, these results suggest that current LLMs tend to rely heavily on technical terminology and complex sentence structures when conveying medical knowledge.

Effective health communication depends on information being understandable and usable by its intended audience. Excessive linguistic complexity may reduce acceptability, limit usability, and adversely affect caregivers’ engagement and decision-making, potentially increasing anxiety or misinterpretation of risk. While between-model differences suggest that newer models such as Gemini may incorporate incremental improvements in linguistic accessibility, reliance on model selection alone is unlikely to resolve readability challenges. From a development perspective, simplifying language should be treated as a core design objective for health-focused LLMs, for example by incorporating high-quality science communication materials during training or enabling adaptive language-level adjustment based on user input. From a user perspective, parents or caregivers of pediatric patients may improve comprehension by explicitly specifying target reading levels within prompts when seeking pediatric anesthesia information.

4.4. Relationship between reliability and readability

This study found no strong associations between reliability metrics and readability indices, indicating that these dimensions function largely independently. This observation is consistent with prior multimetric evaluations showing that AI-generated patient education materials may score well for quality yet remain difficult to read, suggesting that quality and accessibility do not necessarily improve in parallel.⁵⁴ Reliability appears to be primarily influenced by the quality and scope of underlying knowledge exposure, whereas readability is shaped by learned linguistic patterns and stylistic preferences. In the context of pediatric anesthesia, where information is frequently accessed by parents and caregivers, a mismatch between content quality and readability may limit the educational value of LLM-generated materials. These findings carry important implications for both users and developers: ease of understanding should not be equated with informational accuracy, and health-oriented LLMs should be evaluated using integrated frameworks that simultaneously consider content quality and communicative clarity to achieve a balance between accuracy and interpretability.

4.5. Practical implications for caregivers

From a practical perspective, caregivers should regard LLM-generated pediatric anesthesia information as supplementary educational material rather than as a substitute for professional medical advice. Clinically actionable information should be verified with an anesthesiologist or perioperative care team, particularly when questions involve preoperative fasting, medication use, anesthesia-related risks, neurodevelopmental concerns, postoperative symptom management, or the timing of repeated anesthetic exposures. These question types may carry higher risk if incomplete or inaccurate responses are accepted without professional confirmation.

4.6. Limitations

Several limitations should be acknowledged. First, the number of questions differed across clinical domains. Although this distribution reflects areas of greater patient concern, it may influence domain-specific comparisons of statistical significance. Second, this study focused on commonly encountered questions, resulting in a relatively small overall sample size. To reduce subjectivity in the qualitative error analysis, all chatbot responses were independently reviewed by two experts using de-identified materials, and discrepancies were resolved through consensus discussion. Nevertheless, some residual expert judgment may remain; therefore, the qualitative findings should still be interpreted as exploratory in nature. Third, all responses were generated using a predefined, standardized single-turn question set, and neither patients nor caregivers were directly involved in the evaluation. While this approach enhances comparability across models, it does not capture the interactive characteristics of real-world caregiver-LLM communication, in which users may ask follow-up questions, seek clarification, or refine prompts over multiple turns. In addition, the relatively long responses generated from one-turn questions may be atypical of natural user interactions and may have influenced readability and perceived information quality. Fourth, all chatbot responses were collected on a single day to ensure synchronous comparison across models; however, LLMs may undergo updates or temporal drift, and their outputs may change over time. Therefore, the findings should be interpreted as a time-specific assessment of the evaluated model versions. Although paired duplicate outputs were reviewed qualitatively for response stability, formal quantitative variability analysis across repeated queries was not performed. Finally, readability was assessed using established formula-based indices. Although these metrics estimate textual complexity, they do not directly reflect actual comprehension, contextual interpretation, or the influence of cultural and educational factors on understanding. Future studies should include larger and more diverse expert panels, incorporate direct participation from patients and caregivers, evaluate multi-turn interactions, and achieve a more balanced distribution of questions across clinical domains. In addition, including responses from human anesthesiologists as a comparator would provide a clinically meaningful benchmark for interpreting LLM performance and help determine whether LLM-generated information is comparable to, or potentially better than, clinician-generated caregiver education in specific scenarios.

5. Conclusion

In summary, widely used large language model-based chatbot systems currently demonstrate several limitations when applied as sources of pediatric anesthesia information for caregivers. Informational reliability remains variable, transparency regarding content sourcing is limited, and the linguistic complexity of generated responses frequently exceeds recommended readability thresholds for public-facing health communication. These factors may constrain the suitability of LLM-generated content for supporting caregiver understanding within digitally mediated perioperative care environments.

By systematically evaluating these performance dimensions, this study clarifies the current role and limitations of large language models as sources of pediatric anesthesia information. Our findings offer evidence-based insights to support the development of more reliable and accessible health information for the public and provide practical considerations for improving the application of large language models in the medical and health communication domains.

Supplemental material

Supplemental material - Large language model chatbots as sources of pediatric anesthesia health advice: An evaluation of reliability and readability

Supplemental material for Large language model chatbots as sources of pediatric anesthesia health advice: An evaluation of reliability and readability by Xue Zhang, Yuchen Dai, Xin Zhao, Lin Wu, Boming Shao, Xisheng Shan, Fuhai Ji, Runzhi Deng, Baojian Zhao in DIGITAL HEALTH.

Supplemental material

Supplemental material - Large language model chatbots as sources of pediatric anesthesia health advice: An evaluation of reliability and readability

Footnotes

Acknowledgements

The authors would like to thank the expert reviewers for their contributions to the assessment of chatbot-generated responses.

ORCID iD

Baojian Zhao

Ethical considerations

This study did not involve human participants, clinical datasets, laboratory animals, or histological specimens. All materials analyzed in this research were derived exclusively from publicly accessible large language models. No private, sensitive, or personally identifiable information was collected, accessed, or processed at any stage of the study, and there was no direct interaction with end users of these systems. Under these circumstances, formal ethical approval was not required.

Author contributions

Xue Zhang: Methodology; Formal analysis; Writing – Original Draft. Yuchen Dai: Investigation; Project administration; Review & Editing. Xin Zhao: Supervision; Project administration; Review. Lin Wu: Investigation; Supervision; Project administration; Xisheng Shan: Conceptualization; Investigation; Review & Editing. Boming Shao: Conceptualization; Investigation; Review & Editing. Fuhai Ji: Review & Editing. Runzhi Deng: Methodology; Project administration; Review. Baojian Zhao: Methodology; Review & Editing. All authors have read and approved the final version of the manuscript.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The main data supporting the findings of this study, including the complete set of chatbot-generated responses and supplementary tables, are provided in the Supplementary Materials. Additional analysis files are available from the corresponding author upon reasonable request.*

Use of assessment tools and permission statement

All tools, questionnaires, scales, and assessment instruments used in this study were reviewed for copyright and permission requirements. The DISCERN, EQIP, GQS, JAMA benchmark criteria, and standard readability indices were used as established academic evaluation tools for assessing chatbot-generated health information, and their original sources were appropriately cited. No proprietary patient questionnaire, restricted-use clinical scale, or copyrighted instrument requiring separate permission was administered to participants or reproduced in full as a primary data-collection instrument.

Supporting information

The Supplementary Materials include the completed STROBE and CHART checklists, supplementary tables, and the complete set of chatbot-generated responses analyzed in this study.

Patient and public involvemen

Patients and/or the public were not involved in the design, conduct, reporting, or dissemination plans of this research.

Guarantor

Baojian Zhao is the corresponding author of this article. Runzhi Deng is the co-corresponding author. Both authors take full responsibility for the integrity of the research and data, have full access to all data, and had the final decision-making authority regarding publication.

Supplemental material

Supplemental material for this article is available online.

References

Walters

. Pediatric Anesthesiology Special Issue. Children (Basel) 2021; 8: 20210307. https://doi.org/10.3390/children8030201

Cho

Lee

, et al. Critical incidents associated with pediatric anesthesia: changes over 6 years at a tertiary children's hospital. Anesth Pain Med (Seoul) 2022; 17: 386–396. https://doi.org/10.17085/apm.22164

Marciniak

. Growth and Development. In: Cote

Lerman

Anderson

(eds). A Practice of Anesthesia for Infants and Children. 6 ed. : Elsevier, 2019, pp. 8–24.e23.

Walkden

Pickering

Gill

. Assessing Long-term Neurodevelopmental Outcome Following General Anesthesia in Early Childhood: Challenges and Opportunities. Anesth Analg 2019; 128: 681–694. https://doi.org/10.1213/ANE.0000000000004052

Liang

Huang

, et al. Preoperative anxiety in children aged 2-7 years old: a cross-sectional analysis of the associated risk factors. Transl Pediatr 2021; 10: 2024–2034. https://doi.org/10.21037/tp-21-215

Mustafa

Shafique

Zaidi

, et al. Preoperative anxiety management in pediatric patients: a systemic review and meta-analysis of randomized controlled trials on the efficacy of distraction techniques. Front Pediatr 2024; 12: 1353508. https://doi.org/10.3389/fped.2024.1353508

Han

Yan

, et al. Pediatric Anesthesia, Psychology, and Interventions: A Narrative Review. Drug Des Devel Ther 2025; 19: 9779. https://doi.org/10.2147/DDDT.S481654

Russo

Campagna

Ferretti

, et al. Online health information seeking behaviours of parents of children undergoing surgery in a pediatric hospital in Rome, Italy: a survey. Ital J Pediatr 2020; 46: 141. https://doi.org/10.1186/s13052-020-00884-7

McCann

Soriano

. Does general anesthesia affect neurodevelopment in infants and children? BMJ 2019; 367: l6459. https://doi.org/10.1136/bmj.l6459

10.

Keshtkar

Bennett-Weston

Khan

, et al. Impacts of Communication Type and Quality on Patient Safety Incidents: A Systematic Review. Ann Intern Med 2025; 178: 687. https://doi.org/10.7326/ANNALS-24-02904

11.

Shah

Entwistle

Pfeffer

. Creation and Adoption of Large Language Models in Medicine. JAMA 2023; 330: 866–869. https://doi.org/10.1001/jama.2023.14217

12.

Mendel

Singh

Mann

, et al. Laypeople's Use of and Attitudes Toward Large Language Models and Search Engines for Health Queries: Survey Study. J Med Internet Res 2025; 27: e64290. https://doi.org/10.2196/64290

13.

Anh-Hoang

Tran

Nguyen

. Survey and analysis of hallucinations in large language models: attribution to prompting strategies or model behavior. Front Artif Intell 2025; 8: 1622292. https://doi.org/10.3389/frai.2025.1622292

14.

Pan

Musheyev

Bockelman

, et al. Assessment of Artificial Intelligence Chatbot Responses to Top Searched Queries About Cancer. JAMA Oncol 2023; 9: 1437–1440. https://doi.org/10.1001/jamaoncol.2023.2947

15.

Yeo

Samaan

, et al. Assessing the performance of ChatGPT in answering questions regarding cirrhosis and hepatocellular carcinoma. Clin Mol Hepatol 2023; 29: 721. https://doi.org/10.3350/cmh.2023.0089

16.

Scaff

SPS

Reis

FJJ

Ferreira

, et al. Assessing the performance of AI chatbots in answering patients' common questions about low back pain. Ann Rheum Dis 2025; 84: 143. https://doi.org/10.1136/ard-2024-226202

17.

Musheyev

Pan

Loeb

, et al.

How Well Do Artificial Intelligence Chatbots Respond to the Top Search Queries About Urological Malignancies?

Eur Urol 2024; 85: 13–16. https://doi.org/10.1016/j.eururo.2023.07.004

18.

Kocaoglu

Demirel

Kaya

. Accuracy, quality, and readability analyses of responses from large language models to questions on pediatric dental sedation. BMC Oral Health 2026; 26: 304. https://doi.org/10.1186/s12903-026-08026-x

19.

Sharma

Sidhu

Reddy

, et al. Artificial intelligence in anesthesia: comparison of the utility of ChatGPT v/s google gemini large language models in pre-anesthetic education: content, readability and sentiment analysis. BMC Anesthesiol 2025; 25: 574. https://doi.org/10.1186/s12871-025-03451-x

20.

Lee

Brown

Hammond

, et al. Readability, quality and accuracy of generative artificial intelligence chatbots for commonly asked questions about labor epidurals: a comparison of ChatGPT and Bard. Int J Obstet Anesth 2025; 61: 104317. https://doi.org/10.1016/j.ijoa.2024.104317

21.

von

Altman

Egger

, et al. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. Lancet 2007; 370: 1453–1457. https://doi.org/10.1016/S0140-6736(07)61602-X

22.

Collaborative

. Reporting guidelines for chatbot health advice studies: explanation and elaboration for the Chatbot Assessment Reporting Tool (CHART). BMJ 2025; 390: e083305. https://doi.org/10.1136/bmj-2024-083305

23.

Charnock

Shepperd

Needham

, et al. DISCERN: an instrument for judging the quality of written consumer health information on treatment choices. J Epidemiol Community Health 1999; 53: 105–111. https://doi.org/10.1136/jech.53.2.105

24.

Walker

Ghani

Kuemmerli

, et al. Reliability of Medical Information Provided by ChatGPT: Assessment Against Clinical Guidelines and Patient Information Quality Instrument. J Med Internet Res 2023; 25(20230630): e47479. https://doi.org/10.2196/47479

25.

Charvet-Berard

Chopard

Perneger

. Measuring quality of patient information documents with an expanded EQIP scale. Patient Educ Couns 2008; 70: 407–411. https://doi.org/10.1016/j.pec.2007.11.018

26.

Bernard

Langille

Hughes

, et al. A systematic review of patient inflammatory bowel disease information resources on the World Wide Web. Am J Gastroenterol 2007; 102: 2070–2077. https://doi.org/10.1111/j.1572-0241.2007.01325.x

27.

Jafarnia

Haff

Moore

, et al. Quality, Empathy, and Readability of AI Chatbot Responses to the Survivorship Needs of Adolescents and Young Adults With Melanoma: Evaluation Study. JMIR Cancer 2026; 12: e84234. https://doi.org/10.2196/84234

28.

Silberg

Lundberg

Musacchio

. Assessing, controlling, and assuring the quality of medical information on the Internet: Caveant lector et viewor--Let the reader and viewer beware. JAMA 1997; 277: 1244–1245.

29.

Yildiz

Sogutdelen

. AI Chatbots as Sources of STD Information: A Study on Reliability and Readability. J Med Syst 2025; 49: 43. https://doi.org/10.1007/s10916-025-02178-z

30.

Hanci

Otlu

Biyikoglu

. Assessment of the Readability of the Online Patient Education Materials of Intensive and Critical Care Societies. Crit Care Med 2024; 52: e47–e57. https://doi.org/10.1097/CCM.0000000000006121

31.

Zaretsky

Kim

Baskharoun

, et al. Generative Artificial Intelligence to Transform Inpatient Discharge Summaries to Patient-Friendly Language and Format. JAMA Netw Open 2024; 7: e240357. https://doi.org/10.1001/jamanetworkopen.2024.0357

32.

Abreu

Murimwa

Farah

, et al. Enhancing Readability of Online Patient-Facing Content: The Role of AI Chatbots in Improving Cancer Information Accessibility. J Natl Compr Canc Netw 2024; 22: 20240515. https://doi.org/10.6004/jnccn.2023.7334

33.

Ali

Connolly

Tang

, et al. Bridging the literacy gap for surgical consents: an AI-human expert collaborative approach. NPJ Digit Med 2024; 7: 63. https://doi.org/10.1038/s41746-024-01039-2

34.

Smith

Kincaid

. Derivation and Validation of the Automated Readability Index for Use with Technical Materials. Human Factors 1970; 12: 457–564. https://doi.org/10.1177/001872087001200505

35.

Klare

. A Second Look at the Validityl of Readability Formulasa. Journal of Reading Behavior 1976; 8: 129–152. https://doi.org/10.1080/10862967609547171

36.

Onder

Koc

Gokbulut

, et al. Evaluation of the reliability and readability of ChatGPT-4 responses regarding hypothyroidism during pregnancy. Sci Rep 2024; 14: 243. https://doi.org/10.1038/s41598-023-50884-w

37.

Warrier

Singh

Haleem

, et al. Readability of Hospital Online Patient Education Materials Across Otolaryngology Specialties. Laryngoscope Investig Otolaryngol 2025; 10: e70101. https://doi.org/10.1002/lio2.70101

38.

Weiss

. Help patients understand. Manual for Clinicians AMA Foundation, 2007.

39.

Schmitt

Prestigiacomo

. Readability of neurosurgery-related patient education materials provided by the American Association of Neurological Surgeons and the National Library of Medicine and National Institutes of Health. World Neurosurg 2013; 80: e33–e39. https://doi.org/10.1016/j.wneu.2011.09.007

40.

Biswas

. Role of Chat GPT in Public Health. Ann Biomed Eng 2023; 51: 868–869. https://doi.org/10.1007/s10439-023-03172-7

41.

Ayers

Poliak

Dredze

, et al. Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum. JAMA Intern Med 2023; 183: 589–596. https://doi.org/10.1001/jamainternmed.2023.1838

42.

Karako

Song

Chen

, et al. New possibilities for medical support systems utilizing artificial intelligence (AI) and data platforms. Biosci Trends 2023; 17: 186–189. https://doi.org/10.5582/bst.2023.01138

43.

Sari

Celik

Mirza

. ChatGPT-4 vs. DeepSeek-V3: a comparative study of response quality, reliability, usefulness, and readability for exercise and rehabilitation strategies in patients with ankylosing spondylitis. Clin Rheumatol 2025; 45: 20251110. https://doi.org/10.1007/s10067-025-07789-y

44.

Ozduran

Akkoc

Buyukcoban

, et al. Readability, reliability and quality of responses generated by ChatGPT, gemini, and perplexity for the most frequently asked questions about pain. Medicine (Baltimore) 2025; 104: e41780. https://doi.org/10.1097/MD.0000000000041780

45.

Omur Arca

Erdemir

Kara

, et al. Assessing the readability, reliability, and quality of artificial intelligence chatbot responses to the 100 most searched queries about cardiopulmonary resuscitation: An observational study. Medicine (Baltimore) 2024; 103: e38352. https://doi.org/10.1097/MD.0000000000038352

46.

Wang

Zhao

, et al. LINS: A general medical Q&A framework for enhancing the quality and credibility of LLM-generated responses. Nat Commun 2025; 16: 9076. https://doi.org/10.1038/s41467-025-64142-2

47.

Tayebi Arasteh

Lotfinia

Bressem

, et al. RadioRAG: Online Retrieval-Augmented Generation for Radiology Question Answering. Radiol Artif Intell 2025; 7: e240476. https://doi.org/10.1148/ryai.240476

48.

Weinert

Rauschecker

. Enhancing Large Language Models with Retrieval-Augmented Generation: A Radiology-Specific Approach. Radiol Artif Intell 2025; 7: e240313. https://doi.org/10.1148/ryai.240313

49.

Chelli

Descamps

Lavoue

, et al. Hallucination Rates and Reference Accuracy of ChatGPT and Bard for Systematic Reviews: Comparative Analysis. J Med Internet Res 2024; 26: e53164. https://doi.org/10.2196/53164

50.

Kuo

Fierstein

Tudor

, et al. Comparing ChatGPT and a Single Anesthesiologist's Responses to Common Patient Questions: An Exploratory Cross-Sectional Survey of a Panel of Anesthesiologists. J Med Syst 2024; 48: 77. https://doi.org/10.1007/s10916-024-02100-z

51.

Gan

Jin

Ayad

, et al. Fifth Consensus Guidelines for the Management of Postoperative Nausea and Vomiting: Executive Summary. Anesth Analg 2025: 20251114. https://doi.org/10.1213/ANE.0000000000007816

52.

Wang

, et al. Clinical practice guidelines for the prevention and management of postoperative nausea and vomiting (2025 edition). J Anesth Transl Med 2025; 4: 286–302. https://doi.org/10.1016/j.jatmed.2025.12.002

53.

Helvacioglu-Yigit

Demirturk

Ali

, et al. Evaluating artificial intelligence chatbots for patient education in oral and maxillofacial radiology. Oral Surg Oral Med Oral Pathol Oral Radiol 2025; 139: 750–759. https://doi.org/10.1016/j.oooo.2025.01.001

54.

Whittaker

Sun

. Quality and readability of chatbot responses to patient questions: A systematic cross-sectional meta-synthesis. Health Informatics J 2025; 31. 1017. https://doi.org/10.1177/14604582251388879

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.05 MB

0.90 MB

0.00 MB