Pandora’s Box of Pediatric Sports Medicine: Assessing ChatGPT Responses to Patient-Centered Questions in Pediatric Sports Medicine

Abstract

Background:

The ubiquitous use of the internet to investigate health concerns and the popularity of Chat Generative Pre-trained Transformer (ChatGPT), the fastest-growing consumer app in history, make it a likely source of information for patients. ChatGPT has provided accurate orthopaedic differential diagnoses and satisfactory responses to patient questions about slipped capital femoral epiphysis and in-toeing, but its performance regarding management and outcomes of pediatric sports diagnoses remains to be seen. The purpose of this study was to assess ChatGPT’s ability to answer pediatric sports medicine questions.

Hypothesis:

ChatGPT’s responses to pediatric sports medicine questions will be considered generally accurate by orthopaedic experts regardless of question topic or rater experience level. Responses for more frequently asked questions will have better readability and accuracy.

Methods:

Five frequently asked questions (FAQ) from patient education webpages and eight complex questions composed from the authors’ clinical experience were selected based on frequency of appearance and importance in patient education and asked of ChatGPT, using a guest internet browser profile without stored history or cookies to limit bias. Four board-certified pediatric orthopaedic surgeons and 2 orthopaedic clinical fellows scored each ChatGPT response on a 0-5 scale in each of three domains: accuracy, relevance and clarity. Flesch-Kincaid Grade Level and Flesch Reading Ease were determined using Microsoft Word. Weighted Cohen kappa and Kendall’s coefficient of concordance (KCC) were used to assess rater agreement. All data were analyzed using Shapiro-Wilk test for normality, Mann Whitney U test, one-way ANOVA, Fisher’s exact test, and Kruskal-Wallis test as appropriate based on normality.

Results:

The mean FGKL and median FRE were 15.4±2.2 and 23.6 (13.1-27.8), respectively. With median FRE 3.0 (6.7-19.2), PFI responses demonstrated significantly lower readability than meniscus responses (p=0.032). Agreement among all raters was moderate (KCC, 0.57). When subdivided by experience level, agreement among raters was fair (weighted Cohen kappa coefficient, 0.298). Scores differed significantly by question topic (p<0.001) and question level (p=0.02), with highest median scores for PFI (4.0, 3.2-4.7) and FAQs (3.8, 3.0-4.6). Fellows awarded higher scores for clarity of ACL responses (p=0.004) and relevance of PFI responses (p=0.02). Median clarity scores across question topic and level were above 4.0 for both fellows and attendings.

Conclusion:

ChatGPT provided information regarding pediatric sports medicine at a college reading level. There was agreement between raters in scoring ChatGPT’s responses, with fellows awarding higher scores for certain domains. Across all domains, ChatGPT performed best on FAQs and PFI.