Abstract
Background. Patients increasingly use the Internet and artificial intelligence (AI) platforms ChatGPT for medical information, raising concerns about the accuracy and clinical depth of AI-generated content. This study evaluated the reliability and clinical utility of ChatGPT (GPT-3.5 and GPT-4.0) for common foot and ankle conditions compared with patient education materials from the American Orthopaedic Foot & Ankle Society (AOFAS) FootCareMD. Methods. Between January 20 and 26, 2025, standardized prompts were used to query GPT-3.5 and GPT-4.0 across 15 common foot and ankle conditions. ChatGPT responses were compared with AOFAS FootCareMD content based on the number of symptoms, risk factors, and treatment options provided. Two fellowship-trained foot and ankle orthopaedic surgeons independently evaluated response accuracy, categorizing outputs as <50%, 50% to 74%, 75% to 99%, or 100% accurate. Paired t-tests were used for statistical comparisons, and inter-rater reliability was assessed using Cohen’s weighted kappa. Results. GPT-4.0 generated significantly more symptoms than AOFAS content (P = .015). In contrast, GPT-3.5 listed significantly fewer treatment options than both AOFAS and GPT-4.0 (P = .042). When addressing surgical management, both ChatGPT versions frequently provided vague or incomplete information. GPT-3.5 referenced surgery without procedural detail in 53% of responses, while GPT-4.0 lacked detailed surgical explanations or omitted them entirely in 80% of responses. Overall accuracy ratings were high, with 77% of responses judged as 75% to 99% accurate and only 3.4% rated below 50% accuracy. However, inter-rater agreement between surgeons was poor (κ = −0.02), for responses labeled as 100% accurate, highlighting subjectivity in grading AI-generated medical content. Conclusion. ChatGPT effectively provides general information on foot and ankle conditions, regarding causes and symptoms, and GPT-4.0 offers more comprehensive treatment discussions than GPT-3.5. Nevertheless, its limited depth and specificity regarding surgical options restrict its clinical usefulness. Until further improvements are made, AI-generated content should serve as a supplement rather than a replacement for expert-reviewed patient education resources.
Level of Evidence: Level III Case Control Study
This study evaluated the reliability and clinical utility of chatgpt (GPT-3.5 and GPT-4.0) for common foot and ankle conditions compared with patient education materials from the american orthopaedic foot & ankle society (AOFAS) footcaremd.”
Introduction
The Internet is now a major source of health information, with about 72% of adult Internet users seeking health-related topics online, often as a first step in symptom evaluation.1,2 This shift highlights the need for accurate, accessible, evidence-based digital resources. In orthopaedics, particularly foot and ankle surgery, patients often consult online sources before seeing a provider. Artificial intelligence (AI)–driven platforms are increasingly shaping patient knowledge. Morris Gordon et al performed an extensive scoping review that integrated 278 publications focused on the application of AI in medical education. They demonstrate a notable application of AI across various stages of medical education, including but not limited to admissions, training, testing, and enhancing clinical reasoning abilities. 3
A recent investigation conducted by Theo J Clay et al highlights the impact of AI on communication within the health care sector. It emphasizes the intricacies of human-computer interaction via frameworks like ChatGPT and Med-PaLM. On one hand, their findings suggest that AI might exceed physicians in empathy and clarity. Conversely, issues such as “hallucinations” and inaccuracies in medical information were highlighted, emphasizing the necessity for enhancements in algorithms and the creation of new regulations. 4
At its core, AI simulates human-like intelligence. 5 ChatGPT (OpenAI, San Francisco, CA) is a generative AI model using natural language processing to provide human-like responses across many topics, including health care. Its ease of use has led to widespread adoption by patients and clinicians. However, ChatGPT is known for lacking source attribution, occasional inaccuracies, and difficulty with nuanced clinical cases. 4 Ethical concerns, such as misinformation, privacy, and transparency, have also been raised. 4 In March 2023, Italy’s data protection authority temporarily banned ChatGPT over privacy violations, reflecting these ongoing issues. 6 Despite its flaws, ChatGPT is widely used for understandable, quick medical explanations. In addition to that, the integration of technology in medical education is seen as a significant advancement toward fostering a more informed and sensitive health care environment. Khamisy-Farah et al emphasize the potential of using technological advancements, such as virtual reality and e-learning platforms, to educate medical students and practitioners on gender and sexuality topics. These modern methods could enhance traditional learning and provide flexible training opportunities. 7
A thorough evaluation of 21 research studies demonstrated that chatbots provide 89.1% satisfactory patient education responses. 8 Nonetheless, significant issues arise: replies are often formulated at a collegiate reading level (Flesch-Kincaid Grade Level 13.1), which may hinder patient understanding. 8 AI holds potential for interpreting images and making clinical predictions in foot and ankle surgery; 9 however, experts caution against its extensive application. Issues encompass the potential for disseminating inaccurate information, the challenge of staying current with the latest clinical guidelines, and the risks associated with prioritizing factors other than patient care in treatment. 10 While ChatGPT serves as a valuable resource for patient education, it is essential that it does not substitute the direct interaction between health care providers and their patients.
ChatGPT presents significant limitations and risks for orthopaedic patients, with studies revealing critical challenges in information accuracy and reliability. Adnan Kasapovic et al 11 found that ChatGPT mentioned only 47% of relevant medical keywords, with 35% of generated fact sheets rated as “not useful” and 20% classified as potentially “dangerous.” Giorgino et al 12 emphasized critical risks such as inadequate specialized knowledge, absence of contextual awareness, and the possibility of producing biased outcomes. Morya et al13,20 highlights that although ChatGPT presents encouraging functionalities, it is essential for health care providers to proceed with utmost caution, consistently validating crucial information through independent investigation and expert advice. Vanish et al explore the application of AI in diagnosing and treating ankle and foot fractures in previous research. For example, Ashkani-Esfahani et al validated 2 deep convolutional neural networks (DCNNs) for identifying ankle fractures, achieving an area under the curve (AUC) of 0.99. Furthermore, Prijs et al validated a deep learning model for detecting and classifying ankle fractures, recording an AUC of 0.92 and an accuracy of 99% on external validation. Guermazi et al and Olczak et al achieved high AUCs of 0.97 and fair to excellent performance in classifying fractures, respectively. Other studies validated convolutional neural network (CNN) models for specific fracture types, consistently achieving high detection accuracies, such as 98% for calcaneal fractures and AUCs ranging from 0.81 to 0.89 for tibial shaft fractures. Overall, the research indicates robust performance of AI models in fracture management.
Despite its acknowledged flaws, ChatGPT is a dominant source for quick and understandable medical explanations. Given AI’s accelerating role in patient education and the potential for clinical misinformation to compromise outcomes, assessing the educational accuracy of this technology is critical, especially in a specialty where patient compliance is vital. This study aims to assess the reliability and performance of ChatGPT in educating patients about common foot and ankle conditions. Specifically, by comparison of generated responses from GPT-3.5 and GPT-4.0 to information provided by the American Orthopaedic Foot & Ankle Society (AOFAS) FootCareMD website, a reputable and physician-reviewed source of patient education materials. By evaluating ChatGPT’s ability to provide accurate, complete, and readable information, we seek to identify both the potential benefits and limitations of large language models as tools for patient engagement in orthopaedic care. Due to the nature of this study, no AI tools were utilized for this project.
Methods
In this comparative observational study, from January 20, 2025, to January 26, 2025, ChatGPT (GPT-3.5 and GPT-4.0) was queried separately with: “I have been diagnosed with [condition]. Can you tell me more about it?” across 15 common foot and ankle conditions (Figure 1). Each query was made in a separate chat session. The authors compared outputs from GPT-3.5 and GPT-4.0 to information from the AOFAS FootCareMD website, assessing accuracy, completeness, and readability. The selected foot and ankle conditions were arbitrarily chosen; the authors wanted to include a wide range of conditions to better access the breadth of information ChatGPT could provide.

Foot and ankle orthopaedic conditions used as inputs for ChatGPT.
For each condition, a content assessment was conducted. This included the number of symptoms, risk factors, and treatments listed by ChatGPT (both versions) and AOFAS. General categories like “surgery” or “medication” were counted once, while specific options were counted individually. Two fellowship-trained foot and ankle orthopaedic surgeons (A.B. and G.I.P.) independently assessed the accuracy of ChatGPT outputs, blinded to the model version, using 4 categories: <50%, 50% to 74%, 75% to 99%, and 100% accurate. Comments were encouraged to clarify ratings. In this study, 100% accuracy was defined as no incorrect information provided and <50% accuracy as more than 50% of the provided information being inaccurate. Assessors graded the accuracy of outputs based on their current practice management and standards. Additionally, responses were compared against any supplemental handouts and patient information available to their patients. Statistical analyses utilized Jeffreys’s Amazing Statistics Program (JASP), version 0.19.3. A priori power analysis was conducted to determine the medium effect (d = 0.5) with 80% power at a significance level of α = 0.05 for paired T-test samples. The analysis indicated that a total sample N = 20 was required. The authors aimed for a sample size of 30. Paired t-tests assessed differences in symptoms, risk factors, and treatment options. These parameters were analyzed between both iterations of ChatGPT as well as analyzed against the AOFAS FootCareMD website information. Data normality was assessed by Shapiro-Wilk tests. The results indicated that variable data did not significantly depart from a normal distribution. Therefore, parametric tests were utilized. Cohen’s weighted kappa measured inter-rater agreement between the independent assessors’ grading accuracy. 11
Results
Statistical analysis was performed by a single author (AL) to ensure uniformity in analysis and interpretation. ChatGPT 3.5 and 4.0 produced paragraph-dominant responses, with occasional bullet points (Figures 2 and 3). ChatGPT 3.5’s mean word count was 608 ± 52 (range = 270-975) vs 403 ± 10 (range = 344-468) for GPT-4.0. Follow-up care was recommended in 60% of GPT-3.5 responses and 40% of GPT-4.0. Specific recommendations to see an orthopaedic surgeon or podiatrist appeared in 67% of GPT-3.5 and 53% of GPT-4.0 outputs.

Example ChatGPT output.

Example ChatGPT output cont.
GPT-4.0 provided significantly more symptoms per condition than AOFAS content (mean difference −1.4; P = .015), averaging 6.3 symptoms compared to AOFAS’s 4.9. GPT-3.5 offered significantly fewer treatment options than AOFAS (mean difference 1.2; P = .042), averaging 5.7 vs 6.7 treatments. No significant difference was found in the risk factors presented.
GPT-4.0 averaged more symptoms (6.3) than GPT-3.5 (5.5), but the difference was not statistically significant (P = .22). Risk factors and treatment options did not differ significantly between the versions. Notably, surgical options listed by ChatGPT were vague. In GPT-3.5, 8 of 15 responses (53%) vaguely mentioned “surgery” without specifics; only 4 diagnoses included named procedures with minimal detail, and 3 omitted surgeries altogether. GPT-4.0 outputs were similar: 12 of 15 (80%) had superficial or absent surgical explanations. GPT-4.0 averaged more symptoms (6.3) than GPT-3.5 (5.5), but the difference was not statistically significant (P = .22). Risk factors and treatment options did not differ significantly between the versions. Notably, surgical options listed by ChatGPT were vague. In GPT-3.5, 8 of 15 responses (53%) vaguely mentioned “surgery” without specifics; only 4 diagnoses included named procedures with minimal detail, and 3 omitted surgeries altogether. GPT-4.0 outputs were similar: 12 of 15 (80%) had superficial or absent surgical explanations.
Overall, 77% of outputs were rated 75% to 99% accurate; only 3.4% fell below 50% accuracy. For GPT-3.5, 13% of outputs were rated 50% to 74% accurate and 7% as 100% accurate. For GPT-4.0, 9.9% were rated 50% to 74% accurate and 7% as 100% accurate. Inter-rater agreement was poor (k = −0.02), especially for the 100% accuracy category.
Discussion
This study found that while ChatGPT can provide general information about foot and ankle conditions, its outputs often lack specificity and can contain inaccuracies. Poor inter-rater agreement highlights the difficulty of objectively evaluating AI medical content and reinforces the need for rigorous standards. Our results align with recent orthopaedic AI research.2,6,14-16 Massey et al 14 reported that orthopaedic residents significantly outperformed both GPT-3.5 and GPT-4.0 on ResStudy questions, showing the clear gap between AI and expert knowledge. Hofmann et al 6 found GPT-4.0 performing at a third-year resident level on the Orthopaedic In-Training Examination. Seth et al 16 similarly noted GPT-4.0’s superficiality when explaining hip osteoarthritis.
Collectively, these studies suggest ChatGPT, despite improvements, remains inadequate as a standalone tool for clinical patient education. Its inability to provide detailed, referenced, and nuanced medical advice poses risks for patient misunderstanding and unrealistic expectations. Until substantial improvements are made, AI models like ChatGPT should supplement—but not replace—expert-reviewed materials like those from AOFAS. 5 The AOFAS FootCareMD website has undergone multiple updates since its launch in 2003, with the primary objective of enhancing patient education. 5 The latest revisions aim to improve the clarity and accessibility of the content. The initiative seeks to strike an optimal balance between delivering comprehensive, peer-reviewed medical information and ensuring that this information is comprehensible to the general public. 5 It mainly focuses on readability because a significant portion of the audience has a reading level at or below the eighth grade. 5 This study had several limitations. First, the authors only utilized the GPT versions readily available to the public at the time of the study. ChatGPT 3.5 is the free open-access version, and ChatGPT 4.0 requires a paid subscription. The rationale was to query both versions of ChatGPT to capture the potential options. Secondly, the authors only compared ChatGPT to AOFAS content, although patients may use a variety of sources. Each diagnosis was queried once with a single prompt; no iterative follow-up questions were used. Also, treatment recommendations in orthopaedics often vary based on surgeon preference, affecting assessments of completeness. The low inter-rater agreement further underscores the difficulty in judging AI output quality. In this study, the independent reviewers assessed the blinded responses, assigning an overall score for each prompt. The poor interobserver agreement could partially be attributed to this study design. Without the option of categorical scoring, it is difficult to determine the nuances of why the reviewers’ assessments differed and in what specific way. In addition, there could have been a discrepancy in the verbiage describing the rating scales. For example, a reviewer could have graded responses based on accuracy versus how they would normally counsel their patients in independent practice. Future work should involve more evaluators (including residents and practicing surgeons) and broader academic source comparisons to improve generalizability. Improving interobserver agreement for ChatGPT in foot and ankle surgery requires acknowledging fundamental AI limitations rather than simple guideline modifications. The evidence suggests significant challenges: Steven R. Cooperman et al 17 found only 50.5% accuracy in identifying AI-generated content, with moderate to poor inter-rater and intra-rater reliability. Cooperman et al 17 showed that while ChatGPT can provide generally acceptable patient information, it requires substantial clarification in complex queries. Potential improvement strategies include implementing standardized evaluation rubrics, requiring multiple independent reviews, and mandating human expert verification. Chandler A. Sparks et al 18 reinforce that professional organizations remain the preferred information source, indicating that AI should be viewed as a supplementary tool, not a replacement for expert medical guidance. The study by Fahim et al 19 examines the significant influence of AI on health care worldwide, highlighting its uses in areas such as disease detection, personalized care, drug discovery, predictive analytics, telemedicine, and wearable health technologies. The authors emphasize the potential of AI to improve health care fairness through affordable solutions in resource-limited environments. They also recognize obstacles including data privacy, algorithmic bias, model interpretability, and the need for regulatory oversight. 19 This research supports our study by emphasizing the role of AI in enhancing the capabilities of health care professionals but not replacing it, aiming to minimize errors, optimize resource utilization, and elevate patient outcomes, ultimately broadening access to high-quality care worldwide.
In addition to that, Jacob C. et al built a framework that combines clinical context with real-world implementation factors, providing a more thorough method for assessing AI tools. A framework for AI was developed for IMPACTS. 20 The criteria are structured into 7 essential clusters, each linked to a letter in the acronym: (1) I—integration, interoperability, and workflow; (2) M—monitoring, governance, and accountability; (3) P—performance and quality metrics; (4) A—acceptability, trust, and training; (5) C—cost and economic evaluation; (6) T—technological safety and transparency; and (7) S—scalability and impact. 20 This framework extends its scope beyond merely concentrating on technical metrics or methodological guidance at the level of individual studies. It combines the clinical environment and practical application elements to guarantee that AI tools are assessed in a comprehensive manner. 20 Furthermore, there is a need to create standardized evaluation frameworks tailored to conversational AI platforms. Current assessment tools for online medical content are not suited to the dynamic, evolving nature of AI dialogue. As AI integration in health care grows, quality control mechanisms are essential to ensure safe and effective patient outcomes.
Conclusion
ChatGPT, especially GPT-4.0, can provide a larger quantity of general information about common foot and ankle conditions but often lacks specificity, accuracy, and depth. Compared to AOFAS FootCareMD content, both GPT-3.5 and GPT-4.0 frequently presented vague surgical explanations and incomplete treatment options. Until AI tools significantly improve, expert-reviewed, evidence-based resources should remain the gold standard for patient education, with AI serving only as a supplementary tool to support health literacy. This study highlights the limitations of evaluating ChatGPT in orthopaedics against AOFAS content. This investigation highlights the limitations of evaluating ChatGPT in the field of orthopaedics compared to AOFAS content. The assessment involved a single question for each diagnosis, without follow-up inquiries, and treatment recommendations varied based on the surgeon’s discretion. A minimal consensus among evaluators indicates the challenges in assessing the quality of AI-generated results while identifying potential avenues to better evaluate these types of resources. Future investigations should involve a broader range of evaluators and more comprehensive comparisons, acknowledging the constraints of AI rather than simply revising guidelines. Specifically, evaluation tools that parse out distinct components of the overall education provided by AI resources. Strategies for enhancement encompass the use of standardized evaluation rubrics and validation by human experts, indicating that AI serves as a complement to professional guidance. Moreover, existing assessment instruments need to be adapted to reflect the evolving characteristics of AI interactions, underscoring the necessity for robust quality assurance systems as the influence of AI in the health care sector expands.
Footnotes
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Ethical Considerations
Ethical approval for this study was waived by the Northwell Health Institutional Review Board due to the determination that this project did not constitute human research.
