The Great Debates: De-Escalation of Standardized Testing in Medical Education (USMLE Pass/Fail,De-Emphasizing Shelf Exams and ABSITE Scores) Has Been a Positive Evolution in Medical Education

Abstract

The National Board of Medical Examiners and the American Board of Surgery created a series of examinations including Subject Exams, United States Medical Licensing Examination Steps 1-3, ABS In-Training Exam, and the ABS Qualifying and Certifying Exams originally designed as staged assessments of clinical competence. These exams have evolved beyond licensure requirements into high-stakes screening tools for residency and even surgical fellowship selection. Recent efforts to de-emphasize standardized testing have sparked substantial debate within the medical community. We examine whether standardized tests reflect a true measure of trainee competence and its impact on trainees from low socioeconomic status. We also explored the downstream impact of de-emphasizing standardized testing, shifting selection from objective test scores towards more subjective factors such as research experience, letters of recommendations, and medical school reputation.

Keywords

general surgery resident education surgical education

Key Takeaways

• Exam performance does NOT necessarily translate to clinical performance and may lead to stress and emotional exhaustion

• With fewer objective data points in the selection process for residency and fellowship, the focus has shifted to more subjective factors

• Competition is an intrinsic component of medical culture and professional identity that also drives self-improvement

Introduction

The ability to identify which medical students and residents will become competent physicians and surgeons is a problem that has challenged medical school deans and residency program directors for decades. After 9-10 years of training, it is understandable that educators want to invest in students who will not only provide skilled care but continue to do so for a prolonged successful career. The primary question then becomes what defines a successful physician. Khawar et al¹ attempts to answer this question in a survey study including physicians and patients. The study found that the “excellent physician” possessed three primary characteristics: competence (communication, knowledge, and professionalism), motivation (staying engaged and having intrinsic motivation to learn), and personality (including humility and empathy). To measure competence and knowledge in medicine and general surgery, the National Board of Medical Examiners (NBME) and the American Board of Surgery (ABS) created a series of examinations including Subject Exams, United States Medical Licensing Examination (USMLE) Steps 1-3, ABS In-Training Exam (ABSITE), and the ABS Qualifying and Certifying Exams (ABS QE/CE).

High-stakes examinations have long served as foundational benchmarks in medical education. Originally designed as staged assessments of clinical competence, these exams have evolved beyond licensure requirements into screening tools for residency and even surgical fellowship selection—an unintended but predictable consequence of competition for desirable postgraduate training positions. However, there is concern that emphasis on standardized exams may be detrimental to the development of a well-rounded physician and surgeon. Advocates for limiting standardized scoring argue that numeric scores lead to stress and burnout thereby increasing physician attrition rates, as well as creating a disadvantage for students of lower socioeconomic status (SES). There have been multiple recent efforts to de-emphasize standardized testing. This movement began in January 2022, when the NBME transitioned the USMLE Step 1 exam from numeric scoring to pass/fail reporting. In January 2025, the ABS followed suit, replacing ABSITE percentile reports with standardized percent correct scores to refocus the exam on its intended purpose of assessing board readiness rather than peer comparison. These decisions have sparked substantial debate within the medical community regarding downstream effects on trainee well-being and competence, medical school curricula, and residency selection processes. The surgical community is just beginning to see how the ABSITE decision impacts ABS QE/CE pass rates and fellowship application process. The purpose of this debate is to evaluate the current literature and discuss the cases for and against medical education entities de-escalating standardized testing.

The Case for De-Escalating Standardized Testing in Medical Education

There is a paucity of evidence that link higher scores on standardized testing with better clinical performance. As one would expect, the data does show that achieving high scores on standardized tests reflects good test-taking skills. Studies consistently demonstrate that students and residents who do well on standardized testing early in their education will continue to do so later in their training.^2,3 These studies show a correlation between USMLE Step scores, ABSITE scores, and eventually ABS QE/CE success. The question is whether test-taking skills are influential on becoming a good surgeon. In a systematic review of 19 studies of general surgery residents by Lombardi et al,⁴ 13 found that USMLE Step 1 or 2 scores correlated only with ABSITE scores and passing the ABS QE/CE. Only one of those studies showed that Step 2 scores were associated with a higher match rate after a preliminary surgery year,⁵ likely due to the belief that higher scores may contribute to a better performing resident. However, two of those studies did not show an association between Step 2 scores and Accreditation Council for Graduate Medical Education (ACGME) milestone scores as residents.^6,7 ACGME milestones scores are determined by each program’s leadership to assess their residents on a continuum across a multitude of components including competence, motivation, and personality—metrics felt to be important in a practicing surgeon and not easily tested with standardized tests.

Interestingly, in one study, award winners such as Resident of the Year, often chosen by faculty to represent an ideal surgical resident, had lower median Step 1 and 2 scores.⁸ This may reflect trainee strengths beyond test-taking ability. Lund et al⁹ evaluated predictors of resident success at the time of the selection process, finding the only predictor of better residency performance was completion of several surgical-related tasks (including knot tying and suturing skills) during the interview; USMLE Step 1 and 2 scores had no correlation. Finally, McGaghie et al¹⁰ found no association between USMLE Step 1 and 2 scores and clinical skills, including central venous catheter insertion, advanced cardiac life support scenarios, and communication tasks. Cumulatively, these studies demonstrate that exam performance does not always translate to trainees doing well clinically.

Standardized testing also leads to increased stress and emotional exhaustion in medical students. In one study, 79% of students reported that studying for Step 1 contributed to medical student burnout, while 61% identified the preparation period to be a cause of depression.¹¹ Similar results were also found when studying medical schools that utilized tiered grading systems. Reed et al¹² evaluated mental health in medical students at 12 campuses, reporting that those students who had a pass/fail grading system fared significantly better than the students with a tiered grading system. They had decreased perceived stress, decreased burnout, decreased emotional exhaustion, improved mental quality of life, and decreased thoughts of dropping out in the past year. Preventing burnout early in a medical student’s education can have long lasting impact on their career, which translates to decreased burnout as a surgeon and ultimately decreasing medical errors and improving patient safety.¹³

Finally, emphasis on standardized tests creates an uneven playing field for students of low SES. Although the USMLE Step 1 and 2 board exams were created simply as a pathway to licensure, residency programs often use them as a screening and selection factor, causing students to purchase third-party resources such as question banks to optimize their performance. Studies show that students who utilize these resources do better on the exam.¹⁴ In addition to unique challenges such as economic restraints, students of low SES also face unique challenges such as unfamiliarity with the academic system, both of which contribute historically to poorer performance on standardized exams.¹⁵

The Case Against De-Escalating Standardized Testing in Medical Education

Although correlations between Step 1 scores and markers of surgical resident success are mixed, multiple studies demonstrate positive associations with objective outcomes, including ABSITE performance, ABS QE/CE success, and decreased remediation and attrition.¹⁶ With Step 1 now reported as pass/fail, residency selection has shifted toward other objective measures, most notably the still reported Step 2 Clinical Knowledge (CK) numeric score. Like Step 1, Step 2 CK demonstrates similar correlations with resident success and even stronger associations with ABSITE scores and passing the ABS QE/CE.¹⁷ Logically, Step 2 CK has assumed disproportionate importance as the only standardized objective metric. This pressure is amplified by the timing of the examination at the end of the third year of medical school, when students face limited dedicated study time alongside competing demands related to sub-internships, visiting rotations, and residency applications. The shift in academic pressure from Step 1 to Step 2 CK has created an “all or nothing” dynamic, replacing what was previously a more balanced distribution of objective assessment across two examinations, further exacerbating medical student anxiety and burnout.

Compounding these challenges is the growing reliance of other subjective metrics, including medical school reputation, letters of recommendation, clinical performance evaluations, and research publications, each of which introduce additional inequities. Medical school reputation and letters of recommendations are inherently biased, as many students lack meaningful choice in where they train.¹⁸ More than 10% of medical schools now exclusively utilize pass/fail grading systems for third-year clerkships and over 20% do so for fourth-year sub-internships. Even among schools that retain honors distinctions, grading practices vary widely, with students at top-ranked schools significantly more likely to receive the highest grades compared to peers at lower-ranked institutions.¹⁹ Since the transition to Step 1 pass/fail, the mean number of research experiences and publications among applicants increased by two- and four-fold, respectively.²⁰ This shift disadvantages students who cannot afford an additional research year and incentivizes quantity over quality, contributing to the proliferation of low-caliber studies with poor methodology in journals not undergoing adequate peer-review.²¹ Following the transition to Step 1 pass/fail, students from highly ranked schools have become even more likely to match into competitive specialties (eg, neurological surgery, plastic surgery, and orthopedic surgery), while applicants from lower-tier programs increasingly hedge their residency applications by applying to less competitive “backup” specialties. Notably one in six students now pursue this parallel strategy, reflecting heightened uncertainty about their ability to match in an era of selection process subjectivity.²²

Similarly, eliminating ABSITE percentiles removes one of the few remaining objective tools for monitoring resident progression compared to their peers. Although ABSITE scores do not correlate with clinical acumen, they remain valuable for identifying foundational knowledge gaps, provide objective identification of struggling residents, and guide remediation. When reported as percentiles, ABSITE scores reflected relative performance among peers at the same level of training and were influenced by cohort strength and test difficulty rather than absolute knowledge acquisition. Without this comparison, its use in fellowship applications becomes diminished and makes it harder for trainees and Fellowship Directors to know how the resident is doing relative to their peers. This will likely mirror the downstream effects of the pass/fail Step 1, shifting fellowship selection toward less objective and more biased criteria such as program reputation, personal networks, and subjective clinical performance evaluations. For residents pursuing highly competitive fellowships, such as pediatric surgery or surgical oncology, ABSITE performance assumes even greater importance as an objective means of comparison between applicants.²³ Standardized ABSITE scores can help level the playing field for all residents from diverse training backgrounds, including US MDs, DOs, and international MDs.

Lastly, competition is an intrinsic component of medical culture and professional identity. Obtaining a numeric score or knowing a grade is at stake simply drives human behavior. Three years after the NMBE pass/fail decision, Step 1 pass rates declined across all examinees.²⁴ Surgeons have embraced this culture, because the stakes of error are extremely high—human lives matter. Surgeons routinely benchmark themselves against both personal standards and peer performance (eg, operative times, costs, volume, and outcomes) to drive improvement. Evidence demonstrates that increased ABSITE competition within residency programs is associated with improved scores, reflecting heightened engagement with surgical knowledge,²⁵ which is ultimately what is best for our patients. Transitioning the standardized testing to a pass/fail format risks developing underprepared students and ultimately surgeons while eroding the competitive esprit de corps that has long defined surgical excellence.

Conclusion

The ability to identify an applicant who will make an excellent physician and surgeon remains elusive, leaving program directors to rely on objective data such as standardized test scores. While passing these exams is a critical competency milestone in surgical training, there are few studies that correlate test scores to clinical performance. Emphasizing test scores can draw medical students’ and residents’ focus away from developing other areas of equal importance, including emotional intelligence and communication skills. However, early studies show that de-emphasizing Step 1 and ABSITE scores have only pushed the issue downstream. With fewer objective data points in the selection process for residency and fellowship, the focus has shifted to other factors such as medical school reputation, letters of recommendation, and research involvement. Similarly, emphasis on standardized testing has been shown to be disadvantageous for students of low socioeconomic status, but it is not clear whether eliminating numeric Step 1 scores would disproportionately disadvantage these students from less prestigious medical schools by removing a critical avenue for academic distinction. One thing is certain—that the selection process remains flawed, and additional studies are needed to develop an equitable, holistic, and diverse selection process.

Footnotes

ORCID iDs

Chrysanthy Ha

Peter D. Nguyen

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Khawar

Frederiks

Nasori

, et al. What are the characteristics of excellent physicians and residents in the clinical workplace? A systematic review. BMJ Open. 2022;12(9):e065333.

Shellito

Osland

Helmer

Chang

. American board of surgery examinations: can we identify surgery residency applicants and residents who will pass the examinations on the first attempt? Am J Surg. 2010;199(2):216-222. doi:10.1016/j.amjsurg.2009.03.006.

Aljamal

Pakonen

Martin

Heller

McKenzie

Farley

. Factors that predict an intern's first ABSITE score are known by September. J Surg Educ. 2018;75(6):e72-e77. doi:10.1016/j.jsurg.2018.08.025.

Lombardi

Chidiac

Record

Laukka

. USMLE step 1 and step 2 CK as indicators of resident performance. BMC Med Educ. 2023;23(1):543.

Al Fayyadh

Heller

Rajab

, et al. Predicting success of preliminary surgical residents: a multi-institutional study. J Surg Educ. 2016;73(6):e77-e83. doi:10.1016/j.jsurg.2016.05.018.

Gardner

Dunkin

. Evaluation of validity evidence for personality, emotional intelligence, and situational judgment tests to identify successful residents. JAMA Surg. 2018;153(5):409-416. doi:10.1001/jamasurg.2017.5013.

Alterman

Jones

Heidel

Daley

Goldman

. The predictive value of general surgery application data for future resident performance. J Surg Educ. 2011;68(6):513-518. doi:10.1016/j.jsurg.2011.07.007.

Mainthia

Tarpley

Davidson

Tarpley

. Achievement in surgical residency: are objective measures of performance associated with awards received in final years of training? J Surg Educ. 2014;71(2):176-181. doi:10.1016/j.jsurg.2013.07.012. Epub 2013 Sep 14. PMID: 24602705.

Lund

D'Angelo

Baloul

Yeh

Stulak

Rivera

. Simulation as soothsayer: simulated surgical skills MMIs during residency interviews are associated with first year residency performance. J Surg Educ. 2022;79(6):e235-e241.

10.

McGaghie

Cohen

Wayne

. Are United States medical licensing Exam step 1 and 2 scores valid measures for postgraduate medical residency selection decisions? Acad Med. 2011;86(1):48-52.

11.

Cortes-Penfield

Khazanchi

Talmon

. Educational and personal opportunity costs of medical student preparation for the United States medical licensing examination step 1 exam: a single-center study. Cureus. 2020;12(10):e10938. doi:10.7759/cureus.10938. PMID: 33194500; PMCID: PMC7660126.

12.

Reed

Shanafelt

Satele

, et al. Relationship of pass/fail grading and curriculum structure with well-being among preclinical medical students: a multi-institutional study. Acad Med. 2011;86(11):1367-1373.

13.

Al-Ghunaim

Johnson

Biyani

Alshahrani

Dunning

O'Connor

. Surgeon burnout, impact on patient safety and professionalism: a systematic review and meta-analysis. Am J Surg. 2022;224(1 Pt A):228-238.

14.

Drake

Phillips

Kovar-Gough

. Exploring preparation for the USMLE step 2 exams to inform best practices. PRiMER. 2021;5:26. doi:10.22454/PRiMER.2021.693105.

15.

Ghersin

Gulfo

Frohlich

, et al. Socioeconomic factors and test preparation strategies are related to success on the USMLE step 2 clinical knowledge (CK) exam: a single-institution study. BMC Med Educ. 2024;24(1):1412. doi:10.1186/s12909-024-06414-x.

16.

Bankhead-Kendall

Slama

Truitt

. Common attributes of high/low performing general surgery programs as they relate to QE/CE pass rates. Am J Surg. 2016;212(6):1248-1250. doi:10.1016/j.amjsurg.2016.08.024.

17.

Willis

Dent

Love

, et al.

Predicting and enhancing American board of surgery In-Training examination performance: does writing questions really help?

Am J Surg. 2016;211(2):361-368. doi:10.1016/j.amjsurg.2015.08.033.

18.

Perez

Williams

Henderson

, et al. Association of applicant demographic factors with medical school acceptance. BMC Med Educ. 2023;23(1):960. doi:10.1186/s12909-023-04897-8.

19.

Hoy

Shuman

Smith

Kogan

Simcock

. Analysis of variability and trends in medical school clerkship grades. Surg Open Sci. 2024;19:80-86. doi:10.1016/j.sopen.2024.03.010.

20.

Al-Mufti

Ghaith

Sacknovitz

, et al. Publication race: the battle for residency in a competitive landscape. Cardiol Rev. 2025;33(5):394-401. doi:10.1097/CRD.0000000000000978.

21.

Elliott

Carmody

. Publish or perish: the research arms race in residency selection. J Grad Med Educ. 2023;15(5):524-527. doi:10.4300/JGME-D-23-00262.1.

22.

Rusk

Holt

Harvey

Shanks

. Impact of parallel planning on residency match rate success. BMC Med Educ. 2025;25(1):405. doi:10.1186/s12909-025-06879-4.

23.

Miller

Swain

Widmar

Divino

. How important are American board of surgery In-Training examination scores when applying for fellowships? J Surg Educ. 2010;67(3):149-151. doi:10.1016/j.jsurg.2010.02.007.

24.

English

. Assessing the impact of USMLE step 1 going pass-fail: a brief review of the performance data. Avicenna J Med. 2024;14(4):228-230. doi:10.1055/s-0044-1800830.

25.

Spurzem

Reeves

Berumen

Jacobsen

Berndtson

. A team-based american board of surgery in-training examination (ABSITE) competition improves exam performance. J Surg Educ. 2024;81(11):1691-1698. doi:10.1016/j.jsurg.2024.08.021.