Abstract
Background:
Chat Generative Pre-Trained Transformer (ChatGPT), an artificial intelligence (AI) program, is widely used for information compilation. This study sought to analyze the quality and consistency of the information generated by ChatGPT regarding common procedures for wrist arthritis.
Methods:
32 standardized questions regarding wrist osteoarthritis and related procedures (4-corner-fusion [4CF], proximal row carpectomy [PRC], resurfacing capitate pyrocarbon implant, wrist denervation, and total wrist arthrodesis and arthroplasty) were presented to the ChatGPT-3.5 interface 3 separate times, without feedback. ChatGPT’s answers were evaluated for medical accuracy by 3 reviewers and rated as “appropriate,” “appropriate but incomplete,” or “inappropriate.” Ratings were then converted to numerical values to calculate an intraclass correlation coefficient (ICC). A DISCERN score was used to assess quality, and Flesch-Kincade Grade Level and Flesch Reading Ease Score for readability.
Results:
75% of the responses were deemed “appropriate,” with 23 questions receiving unanimous appropriate ratings across all responses. The ICC was 0.97 (95% CI [0.46, 0.98]), indicating excellent reliability. DISCERN score was 60 (good). The Flesch-Kincaid Grade Level was 14.6 ± 1.9, and the Flesch Reading Ease Score was 25.3 ± 6.7, implying a college reading level. The information that ChatGPT provided for PRC and total wrist arthrodesis and arthroplasty, appeared to be more reliable than for 4CF and denervation.
Conclusion:
ChatGPT’s reliability and accuracy of information varied across procedures, possibly due to unknown and diverse sources. Furthermore, while some answers were factually correct, many provided generic information across differing questions, limiting usefulness. ChatGPT must be used cautiously, and the limitations understood.
Keywords
Introduction
Artificial intelligence (AI) has increasingly invaded all facets of life, with an immense rise in use in the health care setting. 1 Furthermore, the AI-associated health care market is expected to continue to grow rapidly, with estimates reporting the global AI in health care market size was 22.6 billion US dollars in 2023 with expansion at a compound annual growth rate of 36.4% from 2024 to 2030. 2 AI has come to include various modalities, such as machine learning, deep learning, and natural language processing 3 (a specific type of AI algorithm). Large Language Models (LLMs) use deep learning techniques and massively large data sets to understand, summarize, generate, and predict new text-based content.4,5 Chat Generative Pre-Trained Transformer (ChatGPT) is an LLM that uses generative learning to produce unique, human-like responses to each question asked of it, in real-time. 6
Tools such as ChatGPT are relevant to clinicians not only given their prevalence and ease of accessibility and use, but due to the fact that most of the patients query the internet to inquire about their health before ever speaking with their doctor.7,8 Several recent studies have investigated the use of ChatGPT in hand and upper extremity surgery.9-12 Many of these studies have reported that ChatGPT provides generally good quality answers, but it remains unclear when the source material was formulated or where these answers came from. ChatGPT does not routinely provide source material. Furthermore, the consistency of the answers to questions has come into question. 12 As surgeries for wrist arthritis become more common,13,14 the available AI-provided data on wrist procedures (4-corner-fusion [4CF], proximal row carpectomy [PRC], resurfacing capitate pyrocarbon implant [RCPI], wrist denervation, and total wrist arthrodesis and arthroplasty) become an area of interest. Notably, from 2009 to 2019 in the United States, PRC, 4CF, total wrist arthrodesis, and total wrist arthroplasty total case volume increased by 3.4% (7467-7720, respectively). 14
This study sought to analyze the quality and consistency of the information generated by ChatGPT regarding common procedures for treatment of wrist arthritis. We hypothesized that there may exist a great amount of variability in the quality and reliability of the information provided.
Materials and Methods
The authors compiled a list of 32 questions regarding wrist osteoarthritis and subsequent surgeries, including 4CF, PRC, RCPI, wrist denervation, and total wrist arthrodesis and arthroplasty, including basic definitions, risk factors, diagnostic and treatment modalities, and outcomes (Table 1). These questions were then individually presented to ChatGPT on 3 separate occasions. The questions were presented to ChatGPT unchanged and in the same order for each occasion. As ChatGPT provides unique responses to every question based on a probability-based model, all answers differed, and no feedback was given to the program. All ChatGPT answers were recorded and assembled for author review. Each author (n = 3) reviewed all iterations of answers provided by ChatGPT and individually graded the responses as “appropriate,” if accurate and without misinformation or nearly without error; “appropriate but incomplete,” if lacking significant data needed to answer the question posed; or “inappropriate,” if the response contained significantly inaccurate or misleading data. If a response was rated as “inappropriate,” the reviewer provided a response as to the reasoning behind the rating.
Questions Presented to ChatGPT.
Answers reflect the numerical average of ratings by 3 reviewers (i.e., “appropriate” = 2, “appropriate but incomplete” = 1, and “inappropriate” = 0).
MRI = magnetic resonance imaging; CT = computed tomography; PRP = platelet rich plasma.
To investigate the reliability across ChatGPT’s answers to the same question, an intraclass correlation coefficient (ICC) was calculated. Answers deemed “appropriate” were converted to a value of 2, answers deemed “appropriate but incomplete” were converted to a value of 1, and answers graded “inappropriate” were converted to a value of 0. The ICC was then analyzed using the average of the grades determined by the reviewers and calculated with a 2-way mixed effects model. 15
The quality of the answers was analyzed using the DISCERN score, which allows users to assess the quality of written health information. 16 This tool consists of 16 questions, with each scored from 1 to 5. Half of these questions apply to reliability, 7 on quality of treatment information, and 1 on overall quality score. Previously it has been determined that scores greater than 70 are classified as excellent and scores over 50 are classified as good. Subsequently, the readability as determined by the Flesch Reading Ease Score 17 and Flesch-Kincaid Grade Level 18 were calculated to determine the appropriateness of the answers for patients of varying educational levels.
Results
Of the unique responses, 75% were rated as “appropriate.” Of the 32 questions with 3 individual responses for each, 23 questions were rated as appropriate by all reviewers across all 3 responses.
The ICC, determining how reliable ChatGPT is across answers to the same questions, was found to be 0.97 with a 95% CI (0.46, 0.98), indicating excellent reliability across answers. Though, some answers were found to be factually incorrect (Tables 2 and 3). For example, in question 3 (“What is 4-corner fusion?”), the first answer reported: “In a four-corner fusion procedure, the surgeon fuses the four corners of the wrist bones (scaphoid, lunate, triquetrum, and pisiform) to eliminate motion at the affected joint,” containing both inaccurate information regarding the pisiform as part of the 4-corner fusion, and missing points of interest such as scaphoid excision and neurectomy. The other 2 responses provided by ChatGPT regarding 4-corner fusion (Table 2) were similarly inappropriate. In other instances (questions 7, 10, 12, 14), ChatGPT identified the carpal bones involved in 4-corner fusion as “scaphoid, lunate, triquetrum, capitate,” providing a different answer, but similarly inaccurate response to the question posed.
Example of Incorrect Answers—4-Corner Fusion.
Example of Incorrect Answers—Wrist Denervation.
Another example where an inappropriate response was provided for all 3 answers by ChatGPT was question 5 (“What is wrist denervation?”) (Table 3). For all 3 responses to this question, ChatGPT associated wrist denervation with wrist arthroscopy.
Furthermore, several answers were deemed “appropriate,” but provided generic information. For example, there was a recurring theme to answers regarding therapy, with each containing similar iterations of phrases such as “. . . can vary from person to person”; “The extent of therapy required depends on factors such as your overall health, the specific details of your surgery, and how well you respond to the procedure”; and “The therapy may involve exercises to improve range of motion, strength training, and techniques to reduce swelling and improve overall function.” Although many answers provided by ChatGPT were prefaced with “I am not a medical professional. . . . .” An example is shown in Table 4, where similar answers were provided for therapy after PRC, 4-corner fusion and wrist denervation.
Example of Appropriate, But Generic Answers.
The Flesch-Kincaid Grade Level was found to be 14.6 ± 1.9, and the Flesh Reading Ease Score was found to be 25.3 ± 6.7, which indicates a college reading level. The DISCERN score was 60, which is considered good.
Discussion
As AI continues to evolve, it is crucial to ensure that it is developed responsibly and for the benefit of all. 19 This study further highlights the immense potential, but major pitfalls of AI. The authors would stress that while ChatGPT and other AI modalities may be useful in some instances, patients should exercise caution using ChatGPT when questions arise regarding their care. This study found that many of the questions (23/32 questions) had 3 individual responses which were rated as appropriate by all reviewers across all 3 responses. Although, grossly inaccurate data were also noted. Furthermore, the ICC was found to be .97, indicating ChatGPT is reliable across answers to the same questions, and the DISCERN score determined many answers were of good quality. The Flesch-Kincaid Grade Level and the Flesh Reading Ease Score indicated that much of the information provided required a college reading level. Therefore, with many variables in play, it is difficult to fully support these tools at this time for educational purposes regarding care in hand and upper extremity surgery.
The fact that 75% of answers were rated as appropriate, and 72% (23/32) of questions provided to ChatGPT prompted 3 individual responses deemed appropriate by all reviewers is notable. This is consistent with a prior study by Christy et al. 10 where 76.9% of questions regarding distal radius fractures were deemed “appropriately” answered. Although, their study showed a much poorer ICC of 0.12 compared with our ICC of .97. Thus, while both studies show that ChatGPT is capable of providing accurate answers, depending on the topic, there is potential for poor reliability and consistency. This again brings up the point that the source of information from ChatGPT is generally unknown and each answer, even to identical questions, may be pulled from different, unverified sources.
It was further noted that the overall information that ChatGPT has for PRC and total wrist arthrodesis and arthroplasty appeared to be more reliable than for 4CF and denervation. Many of the answers regarding denervation started with the statement, “wrist denervation, also known as wrist arthroscopy with denervation,” which are notably not equivalents. In addition, some of the answers regarding 4CF reported “in 4CF, the scaphoid, lunate, triquetrum, and the pisiform,” which is inaccurate. 20 In other instances, ChatGPT identified the carpal bones involved in 4CF as the scaphoid, lunate, triquetrum, and capitate, different but similarly inaccurate. These inaccuracies varied significantly between questions and across answers to the same question, and very well again be related to the information source that ChatGPT is drawing from.
Furthermore, the answers on RCPI were often incomplete, though this could be due to the fact that there is less data on RCPI in general.
21
Postoperative therapy answers were also found to be quite limited, with generic and similar answers between different surgical interventions. While factually correct, they did not necessarily provide the most precise information, as most answers were iterations of similar phrasing: the amount of therapy needed after can vary depending on several factors including the individual’s overall health, the extent of the surgery, any pre-existing conditions, and the recommendations of the surgeon and physical therapist. Generally, physical therapy is an essential component of the rehabilitation process to help regain strength, range of motion, and function of the wrist.
While this information may not be all encompassing, it is seemingly the better alternative to inaccurate information.
While the DISCERN score (60) indicated quality answers, the Flesch-Kincaid Grade Level (14.6 ± 1.9) and the Flesh Reading Ease Score (25.3 ± 6.7) indicated a college reading level. Thus, despite an increasing number of patients turning to the internet prior to consulting with their physician, 22 limited health literacy and educational level may severely impact interpretation of information. It is known that limited health literacy has been associated with adverse health outcomes, and notably the choice to undergo surgery often requires patients to make complex decisions and adhere to complicated instructions. 23 Giudici et al 24 in a study aimed at designing a tool to measure patient comprehension of the information provided during a surgical consultation, found that there is a tendency to overvalue some information (reasons for the intervention and alternatives to surgery) and that certain information is not understood (risks and complications) or not provided (postoperative follow-up). Thus, while the internet and AI may potentially fill in gaps left by surgeons during discussion, these modalities may similarly overvalue some information or even present information at a level that may not be well understood by every patient. This also highlights the importance of expert society-written pamphlets, informatics, and distribution of patient-centric information for review.
Artificial intelligence and LLMs like ChatGPT serve a role and are being used at increasing rates. While they may one day have great use in health care, and they do currently provide some useful and reliable information regarding common questions, there are many limitations and occasionally inaccurate information is provided. Although consultation with a physician may similarly have its inherent limitations, any information garnered from AI should be considered in conjunction with a consultation with a trained expert and a thorough, informed discussion should be had. It is crucial that these modalities do not replace the experience of professionals and that the professionals provide patients the opportunity to discuss their questions. Albeit studies such as this have limitations. The only LLM utilized in this study was ChatGPT-3.5, and thus, conclusions across all AI may be limited. Although questions were asked multiple times without providing feedback to the AI and multiple reviewers were utilized to rate the answers, inherent biases and opinions could influence the results. Furthermore, the questions asked were by physicians and not patients. Inevitably, the results obtained from AI may vary pending the complexity of the procedures and questions asked as well. In addition, while multiple tools were used to increase thoroughness and generalizability, each have intrinsic limitations and potential for human error.
Conclusion
ChatGPT’s reliability and accuracy of information varied across procedures, with 4CF and wrist denervation being particularly limited. This may be due to the unknown and diverse sources ChatGPT draws from. Furthermore, while some answers were factually correct, many provided generic information across differing questions, limiting usefulness. ChatGPT must be used cautiously, and the limitations understood. As AI is increasingly used and more easily accessible, it is crucial that clinicians continue to exercise caution and counsel patients on the limitations of the information provided by the internet and these programs. Furthermore, these findings emphasize the importance of distribution of expert and society-written information that is patient-centric and readily available.
Footnotes
Ethical Approval
No IRB approval was obtained given the review did not involve experimentation of human or animal subjects, and the data reviewed are public.
Statement of Human and Animal Rights
All procedures followed were in accordance with the ethical standards of the responsible committee on human experimentation (institutional and national) and with the Helsinki Declaration of 1975, as revised in 2008.
Statement of Informed Consent
Informed consent was not required for this study.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
