Letter to the Editor Regarding “The Comparative Performance of Large Language Models on the Hand Surgery Self-Assessment Examination”

Abstract

Dear Editor,

We hereby comment on the publication on “The Comparative Performance of Large Language Models on the Hand Surgery Self-Assessment Examination.”¹ A study of artificial intelligence (AI) systems such as ChatGPT 4.0 and Bing AI in the American Society for Surgery of the Hand Self-Assessment Exams reveals critical challenges in medical education, notably the use of AI to assess knowledge in specialties. The assessment process is a crucial problem to solve. Although using 999 questions from a 5-year period yields a useful dataset, the methods for picking these questions are unclear. Were the questions chosen at random or in accordance with their difficulty or relevance? In addition, the use of graphics and video connections complicated the issues. However, the study did not define how these coaching tools were included into the answers, raising questions about whether the AI systems were tested on their analytical abilities or just their ability to comprehend the presented media.

Furthermore, the published results demonstrate that while both models performed substantially above the passing threshold, more refined research could have yielded key insights into certain themes or question types where the AIs performed poorly. Generalizing performance measures without differentiating them by question category limits the findings’ applicability to real-world settings with potential disparities in subspecialty competencies. This begs the question of how AI may help medical professionals receive individualized education and improve in areas where they are currently lacking. Future research could benefit from more in-depth performance analysis, allowing educators to better design training programs.

Furthermore, while Bing AI outperforms ChatGPT 4.0 on average, it is critical to understand the individual reasons that contribute to these disparities. For example, does Bing AI’s advantage stem from superior training data or algorithmic efficiency? Further investigation into the mechanisms influencing these discrepancies may aid in future model design and deployment. Understanding these characteristics could also help AI engineers adapt tools for medical applications.

In the future, including real-time feedback methods may improve these AI systems’ interactive capabilities. Allowing AI to learn from user interactions and continuously improve its replies has the potential to significantly increase its instructional value. Furthermore, future research could focus on assessing AI’s long-term performance, tracking advances, and responding to new medical knowledge as it becomes available. Finally, collaborations between AI developers and medical institutions should encourage interdisciplinary approaches, ensuring that AI tools are better aligned with the educational needs of medical practitioners.

Footnotes

Author contributions

A.K.: 50% ideas, writing, analyzing, approval

V.W.: 50% ideas, supervision, approval

Ethical Approval

This study was approved by our institutional review board.

Statement of Human and Animal Rights

There is no involvement of human or animal subjects.

Statement of Informed Consent

Informed consent was obtained from all individual participants included in the study

Data Availability Statement

There is no new data generated.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Amnuay Kleebayoon

References

Chen

Sobol

Hickey

, et al. The comparative performance of large language models on the hand surgery self-assessment examination. Hand. 2026;21(1):63–67. doi:10.1177/15589447241279460