Abstract
Objective
This study sought to explore the unexamined capabilities of ChatGPT in describing the surgical steps of a specialized operation, the Fisher cleft lip repair.
Design
A chat log within ChatGPT was created to generate the procedural steps of a cleft lip repair utilizing the Fisher technique. A board certified craniomaxillofacial (CMF) surgeon then wrote the Fisher repair in his own words blinded to the ChatGPT response. Using both responses, a voluntary survey questionnaire was distributed to residents of plastic and reconstructive surgery (PRS), general surgery (GS), internal medicine (IM), and medical students at our institution in a blinded study.
Setting
Authors collected information from residents (PRS, GS, IM) and medical students at one institution.
Main Outcome Measures
Primary outcome measures included understanding, preference, and author identification of the procedural prompts.
Results
Results show PRS residents were able to detect more inaccuracies of the ChatGPT response as well as prefer the CMF surgeon's prompt in performing the surgery. Residents with less expertise in the procedure not only failed to detect who wrote what procedure, but preferred the ChatGPT response in explaining the concept and chose it to perform the surgery.
Conclusions
In applications to surgical education, ChatGPT was found to be effective in generating easy to understand procedural steps that can be followed by medical personnel of all specialties. However, it does not have expert capabilities to provide the minute detail of measurements and specific anatomy required to perform medical procedures.
Introduction
The healthcare industry is consistently at the cutting-edge of technological development, utilizing innovations to make advancements in patient care, diagnosis, and research. Artificial intelligence (AI) usage has been increasing in a variety of medical settings. In 2002, an artificial neural network (ANN) was used to recognize the presence of cardiac ischemia in 2204 patients, demonstrating a sensitivity/specificity of 88.1 and 86.2% respectively, outperforming two existing standard of care aids. 1 In 2013, deep learning-based pattern recognition technology performed more accurately than existing methods in ultrasound data concerning left ventricular endocardial tracking. 2 More recently, AI has been utilized in image-recognition of chest radiographs, pelvic radiographs, and retinal images, with 98% accuracy in detection of coronavirus pneumonia patients and 91% accuracy in detection and localization of hip fractures, showing promise for the growing field of deep-learning's applications in medicine.3–5 The newest iteration of AI's potential usage in a medical setting is ChatGPT.
ChatGPT is a chatbot paired with AI. As defined by IBM, a chatbot is “a computer program that uses artificial intelligence and natural language processing to understand customer questions and automate responses to them, simulating human conversation.” 6 It was developed by the company OpenAI and released to the public on November 30, 2022. Within 1 week, the software had over 1 million users and by February 2023, there were over 100 million users, making it the fastest growing consumer internet application in existence to date. 7 ChatGPT represents the result of years of natural language processing development at OpenAI, with its ability to generate naturally-sounding long-form text based on transformation architecture, a complex ANN first described in 2017. 8 By utilizing the family of computational methods that make up deep learning, a subset of machine learning, ChatGPT can program itself by learning from sets of text demonstrating desired behavior, in this case, natural speech.
Since its release to the public, much of the literature concerns its use as a test-taking tool. 9 It has been used to take United States Medical Licensing Exam-style tests, demonstrating the ability to score above the 60% threshold in accuracy, and has answered questions from the Ophthalmic Knowledge Assessment Program (OKAP), with accuracy ratings of 55.8 and 47.2%.10,11 Another sizable area of interest is its use as a writing tool. Simplification of radiology reports for patient consumption and generation of entire medical abstracts represent medical writing capabilities of ChatGPT. However, this area still requires human review and editing.12–14
Although ChatGPT has been applied to a diverse range of medical scenarios such as imaging interpretation and creation of potential differential diagnoses, there remain numerous unexplored opportunities for its utilization within the healthcare industry. AI and computer-aided design and manufacturing have already advanced the field of medicine in terms of virtual surgical planning. Specifically, craniofacial surgical planning has been improved with the fabrication of cutting guides, custom implants, and stereolithographic models. 15 Additionally, further advances in computer-assisted surgery (CAS) have made the use of virtual reality, augmented reality, and mixed reality possible intraoperatively, enhancing surgical accuracy and efficiency in head and neck procedures such as tumor resection. 16 However, the utilization of ChatGPT in the field of surgical education has not yet been examined. This study sought to explore the capabilities of ChatGPT in describing the surgical steps of a specialized operation, the Fisher cleft lip repair, and compare that to the same procedure described by a craniomaxillofacial plastic surgeon. In doing so, this allowed testing of the expertise and detail of ChatGPT, its ability to coherently write surgical steps, and applications in teaching introductory surgical concepts.
Methods and Materials
A chat log within ChatGPT was created to generate the procedural steps of a Fisher cleft lip repair. This was accomplished by creating a series of commands within the system to create a step-by-step surgical protocol with attention to pertinent anatomy, measurements, and surgical landmarks. Due to the nature of how the AI redefines its responses, the initial commands given in the chat log were simple, one-dimensional requests. This allowed for redefining questions to be answered by the AI with utmost detail to truly replicate a surgeon's response. The final step of the creation of the chat log involved creating the final developed question which is seen in Figure 1.
Once the final question and response were analyzed, the same question was given to a board-certified CMF plastic surgeon at the authors’ institution. The surgeon was blinded to ChatGPT's response and asked to produce the procedural steps of the Fisher cleft lip repair. Once the surgeon had created a response, the formatting of the response was edited to match the format of ChatGPT's response. However, no content of the surgeon's response was altered in any way. Both protocols are listed in Figure 2.

Final question entered into ChatGPT.

Responses from ChatGPT and craniofacial plastic surgeon.
After the formulation of responses, a voluntary electronic Google survey questionnaire (Version 0.8, Google, Mountain View, California, USA) was distributed to four groups in February 2023. These groups included residents from Plastic and Reconstructive Surgery (PRS), General Surgery (GS), Internal Medicine (IM), and a cohort of medical students at the authors’ institution. The groups were blinded to who wrote each protocol, and the survey consisted of 10 questions about them. The PRS residents had a sample size of n = 9, the GS residents had a sample size of n = 10, the IM residents had a sample size of n = 10, and the cohort of medical students had a sample size of n = 16. All survey questions utilized a Likert scale format of 1 to 5 (1 = lowest confidence in response, 5 = highest confidence in response). The entirety of the survey questions can be found in Supplemental Figure 1. This study did not have any patient identifying information, as it included non-human subject research; an IRB was not pursued for this study. Likert scale numerical data was averaged separately within groups for each survey question.
Results
Results show PRS residents were able to detect more inaccuracies of the ChatGPT response (Mean = 2.66) as well as prefer the CMF surgeon's prompt in performing the surgery (Mean =3.22) (Figure 3). Residents with less expertise in the procedure not only had more difficulty detecting who wrote what procedure, but preferred the ChatGPT response in explaining the concept and chose it to perform the surgery.

Perceived understanding of ChatGPT and craniomaxillofacial surgeon surgical prompts. For all questions other than #7 and #8, Likert scale responses (1 = difficult, 5 easy) were used. Please see appendix for full questions.
As responses strayed farther from familiarity with plastic surgery, the surgeon's explanation became more difficult to understand. PRS rated the CMF surgical steps with a comprehension score of 3.3, GS 2.6, IM 2.3, and medical students 2.1 (Figure 4). In contrast, the ChatGPT response received a 2.7 rating from PRS in comprehension while all other groups rated it higher than 3. There was immense variability of responses in non-plastic surgery respondents as to which prompt was written by the surgeon versus ChatGPT.

Comprehension of procedures. CMF, craniomaxillofacial surgeon.
Discussion
Initially, the authors hypothesized that ChatGPT would be able to provide an accurate surgical procedure that could potentially mimic a board-certified CMF surgeon, as it has access to virtually unlimited online data. The purpose was to assess if an individual without prior expertise on cleft lip repairs, could ask a general question on the procedure and receive enough detailed and accurate information to learn to adequately perform it. When interacting with the interface, it was soon discovered that the chatbot required significant prompting and direction to provide specific steps in the Fisher repair including anatomical points and markings such as Noordhoff's point, or equations like c = a-b-1. The ChatGPT prompt was found to be imprecise and preferred less by PRS residents. In contrast, medical specialties with less background knowledge of the procedure preferred the ChatGPT response and chose it to perform the surgery in more instances. This effect was further magnified in finding sequential decrease in preference for the craniofacial surgeon response as medical knowledge moves farther away from surgery. This indicates that while ChatGPT may not be capable of providing expert level knowledge of a surgical procedure, it can give a well-organized, easy to understand response that can be followed by medical personnel of all specialties, regardless of surgical expertise.
In evaluating AI's capabilities, the Turing Test (TT) has been a benchmark commonly used historically to determine whether a machine has the ability to exhibit intelligence and behaviors indistinguishable from that of a real human in natural conversation. 17 Previous AI chatbots demonstrated abilities that pass the Turing Test, but none have been as responsive, versatile, and accessible to the public as ChatGPT. 18 The software version of ChatGPT utilized in this study was not connected to the Internet past 2021 and significant limitations were placed on its performance. 19 However, software updates are currently being released and new versions have demonstrated eight times the processing power and access to real time internet connection in contrast to ChatGPT. 20 It is inevitable that the technology will soon have expert capabilities to describe a surgical procedure in necessary detail to perform it. Future studies should continue to analyze AI technology in the field of surgical education and specifically, applications in enhancing patient care such as patient teaching and informed consent processes, generating innovative ideas/devices/research studies, and augmenting medical learning.
Limitations
Various limitations of this study are present. Firstly, ChatGPT was initially forced multiple times to be prompted to answer various preliminary questions about the Fisher repair to provide sufficiently detailed procedural steps. The final prompt used in the experiment was not the initial response generated by ChatGPT on a basic question. Also, ChatGPT utilizes information available on the Internet prior to November 2021. 20 Because of this, any current research on the Fisher repair was not utilized in its generation of a prompt. While the authors of this study do not believe that this affected the generation of the response, this inability to utilize current research could potentially pose issues in the future of the AI's usage in medical settings, as medicine is a field that constantly requires current investigation. The survey data between groups was not statistically significant, which may be attributed to small sample size and limitations of Likert scale categorical scores. Other limitations include potential confounding factors from greater access to Fisher repair opportunities amongst PRS residents compared to GS residents. As responding to the online survey was voluntary, there was also the potential for selection bias.
Conclusion
In applications to surgical education, ChatGPT was found to be effective in generating easy to understand procedural steps that can be followed by medical personnel of all specialties, regardless of surgical expertise. However, it does not have expert capabilities to provide the minute detail of measurements and specific anatomy required to perform medical procedures. As the technology advances, further studies should examine methods of implementing artificial intelligence, such as ChatGPT, in medical education and practice.
Supplemental Material
sj-docx-1-cpc-10.1177_10556656231193966 - Supplemental material for Dr. ChatGPT: Utilizing Artificial Intelligence in Surgical Education
Supplemental material, sj-docx-1-cpc-10.1177_10556656231193966 for Dr. ChatGPT: Utilizing Artificial Intelligence in Surgical Education by Michael S. Lebhar, Alexander Velazquez, Shelby Goza and Ian C. Hoppe in The Cleft Palate Craniofacial Journal
Footnotes
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
