Comparative Evaluation of ChatGPT-4o and Grok-3 on Cleft Lip and Palate and Presurgical Infant Orthopedics

Abstract

Dear Editor,

This is a response to a published article on “Comparative Evaluation of ChatGPT-4o and Grok-3 on Cleft Lip and Palate and Presurgical Infant Orthopedics: A Multidisciplinary Assessment by Orthodontists, Pediatricians, and Plastic Surgeons.¹” The study found that both large language models could give health information on cleft lip and palate and preoperative orthodontic treatment (PSIO) with comparable quality levels. There were no significant statistical differences between ChatGPT-4o and Grok-3 on DISCERN and Global Quality Scale ratings, showing that both models are universal and therapeutically useful. Although Ekizer et al clearly reported the absence of statistical differences, it is worth noting that their results also reveal considerable discrepancies in opinions of professionals. Pediatricians tended to give greater ratings to reliability, clarity, and therapeutic relevance than orthodontic and plastic surgery specialists, which suggests that subjective perception and familiarity with specific content may strongly influence how answers are judged, an aspect that the authors did not emphasize. Furthermore, patient-directed questions obtained higher total scores than questions aimed at medical experts. Upon reviewing the published data, it is also apparent that Grok-3 gave somewhat more relevant and effective PSIO responses, while ChatGPT-4o provided more through and structured responses. This highlights each model's distinct strengths and limitations, a nuance that could guide users in selecting a model depending on the clinical context, such as delivering direct information to patients versus supporting professional decision-making. These observations underscore the importance of multidisciplinary assessment and tailoring of questions to the intended audience, points that extend beyond the authors’ original emphasis. Further research should explore the ability of large language models to address personalized treatment needs, update outputs according to evolving clinical guidelines, and assess their influence on both patient learning and physician decision-making in real-world practice. Such direction would help better define the potential and boundaries of AI in supporting patient care.

Footnotes

Authors’ Note

AI Declaration: The authors used computational tool for language editing/checking in preparation of the article.

ORCID iD

Hinpetch Daungsupawong

Authors’ Contribution

Hinpetch Daungsupawong contributed to ideas, writing, analyzing, and approval. Viroj Wiwanitkit contributed to ideas, supervision, and approval.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Ekizer

Kurt Demirsoy

Büyük

Canpolat

Bilirer

. Comparative evaluation of ChatGPT-4o and Grok-3 on cleft lip and palate and presurgical infant orthopedics: a multidisciplinary assessment by orthodontists, pediatricians, and plastic surgeons. Cleft Palate Craniofac J. 2025:10556656251378591. Online ahead of print. doi:10.1177/10556656251378591