Raising the bar: Can an Online Module With a Validated Tool Improve Clerkship Evaluations?

Abstract

Introduction

Narrative evaluations are crucial components of medical education, offering valuable information about students’ performance. However, the quality of narratives is often variable. Faculty development interventions to improve narrative evaluations can be time-intensive, and clerkship directors cannot mandate faculty participation. To address this challenge, this study implemented a time-efficient faculty development intervention that used the validated Narrative Evaluation Quality Instrument (NEQI) as a feedback tool to improve the quality of narrative evaluations.

Methods

Eighty-nine faculty from the Ambulatory Clerkship at UT Southwestern and the Pediatrics Clerkship at McGovern Medical School were randomized to either a control group, which received no feedback on their narrative evaluations, or an intervention group, which received an educational module on enhancing narrative evaluations using the NEQI, along with individualized NEQI data via email. Pre and post NEQI scores were compared between both groups.

Results

While no significant differences in pre and post NEQI scores were found between the two groups and subgroup analysis of faculty who completed the educational module also showed no statistically significant change, the module demonstrated feasibility across the two clerkships. Results revealed a positive trend toward greater specificity in narrative comments.

Discussion

The lack of measurable change in NEQI scores may be attributed to the timing of the educational module or limited opportunities for direct student observation. Future research should explore the use of artificial intelligence-assisted scoring and standardized templates to prompt more detailed and actionable narrative evaluations. Clerkship directors need support for faculty education and development of novel interventions to improve the quality of narrative evaluations. Our study demonstrated that simply emailing feedback and an invitation for an asynchronous learning module did not result in measurable improvement in narrative quality.

Keywords

clerkship evaluations narrative evaluations faculty development

Introduction

As medical education has shifted towards a competency-based approach, narrative evaluations have become increasingly essential. The qualitative assessment of learner performance provides valuable insights into student performance that numerical scores cannot convey. In clinical settings, the information contained offers important feedback to help trainees give context to their performance and identify areas for improvement.^1-3 Additionally, narrative evaluations serve to assess the learners, determine readiness for promotion, and contribute significantly to the Medical Student Performance Evaluation (MSPE), a key document factored by residency program directors when selecting candidates to interview.^4-8 As such, narrative comments have the potential to serve as an essential handoff between undergraduate (UME) and graduate medical education (GME).

Despite their importance, narrative evaluations often fall short of their potential. Many are vague, overly generic, and describe personal attributes rather than observed behaviors or competencies.^6,9-11 This in turn affects the MSPE, which consequentially can be a source of implicit bias and variable interpretation.^6,8 Students prefer narrative feedback that is specific, tailored to their individual performance, actionable, and grounded in actual patient encounters.^12,13 Narrative evaluations often lack this level of specificity and fail to include concrete examples or clear suggestions for improvement.^6,14 These limitations not only reduce the usefulness of feedback for learners but also diminish areas for improvement in UME to GME handoff.

Clerkship directors (CDs) face additional challenges in ensuring consistent, high-quality narrative evaluations. Narrative comments are typically written by a wide range of faculty across diverse clinical sites, leading to inconsistencies. Faculty also face competing priorities in patient-care settings, and in addition to finding the time to write meaningful comments, they perceive they have insufficient contact with learners to generate thoughtful narratives.^15,16 Faculty development efforts have emerged as a key strategy to improve the quality, specificity, and usefulness of narrative evaluations.^8,17,18 However, while faculty development has shown promise, it is often labor-intensive, and participation is usually voluntary.^17-19 As a result, CDs may lack the resources or authority to mandate training or enforce standards across all evaluators.

A valuable resource that supports evaluation quality and faculty development is the narrative evaluation quality instrument (NEQI).^17,20 The NEQI is a validated tool that scores evaluations across three domains: the number of performance domains commented on, the specificity of comments, and their usefulness to trainees.^17,20,21 As such, this tool provides a quantitative assessment of qualitative information, offering a standardized approach to evaluate narrative comments. Each domain is scored on a scale of 0–4, for a total possible score of 12; a score of 7 or above is considered moderately useful to the trainee. The NEQI thus provides a means to convert qualitative data into a structured, quantifiable format, facilitating a more consistent evaluation of narrative quality.

The primary goal of this pilot study was to determine whether the NEQI could serve as a practical and scalable feedback intervention to help faculty improve the quality of their narrative evaluations. We hypothesized that the NEQI may offer a sustainable solution for improving narrative feedback without requiring extensive additional time or mandatory participation in faculty development workshops. The study aimed to evaluate the utility of the NEQI in supporting CDs as a resource-efficient tool to guide faculty in writing clearer, more specific, and useful evaluations, and to assess the overall quality of narrative comments provided by faculty preceptors.

Methods

This multi-institutional study examined narrative evaluations written by faculty for students on the Ambulatory Clerkship at UT Southwestern (UTSW) and the Pediatrics Clerkship at McGovern Medical School (MMS), UTHealth Houston, using an adapted version of the NEQI. The study took place from January 2024 to March 2025.

The NEQI Adaptation

Investigators adapted only the “Performance Domains Commented On” section to match the clinical competencies more closely on UTSW and MMS’s student evaluation forms (Figure 1). Specifically, overall performance, clinical skills, prepares for and participates in patient care activities, fund of knowledge, and written and/or oral skills were removed and replaced with history taking and physical exam skills, oral presentation skills, documentation, patient communication skills, and receiving and applying feedback. We also developed a set of criteria for the specificity and usefulness domains to provide more consistency when scoring. For the specificity domain, qualifiers were defined as adjectives used to describe the student (“student was professional”); evidence was defined as a general description (“student showed professionalism by calling consultants”); example was defined as a specific encounter (“student showed professionalism by remaining calm and communicating well with a cardiology fellow during a rapid response.”). For the usefulness domain, a 0 was defined as minimal information and vague; a 2 was defined as commenting on at least 2 domains with at least 1 piece of evidence but only encouraging student to continue same behavior; a 4 was defined as giving suggestions for student growth.

Figure 1.

Modified narrative evaluation quality instrument (NEQI)

Establishing Interrater Reliability

Four investigators (JC, VG, RA, BC) scored the evaluations. Study members used an automated Excel spreadsheet containing the specific rules for scoring discussed above when scoring evaluations to improve consistency. The intraclass correlation coefficient (ICC) was calculated on Statistical Package for the Social Sciences (SPSS) using two rounds of a pilot sample of 10 evaluations from UTSW and MMS. The ICCs for each round were 0.892 (95% CI 0.772-0.949) and 0.840 (95% CI 0.720-0.916), respectively, both demonstrating strong reliability.

The Intervention

This study was approved as exempt by the UT Southwestern Medical Center Human Research Protection Program (UTSW: Y2-23-0316) and was approved by the Committee for the Protection of Human Subjects at UTHealth Houston (MMS: HSC-MS-23-0688). Inclusion criteria included clinical faculty with teaching appointments for the Ambulatory Clerkship at UTSW and the Pediatrics Clerkship at MMS who submitted medical student evaluations during the study period were eligible (85 at UTSW and 30 at MMS). Exclusion criteria were faculty who do not participate in medical student evaluations directly (i.e. due to prior existing doctor-patient relationship) or had resigned prior to scoring of the pre-intervention evaluations. Both institutions granted waivers of informed consent, and as such an email with a Letter of Information was sent to all faculty who could voluntarily opt-out of participation. Investigators (AO, PH) not involved in scoring evaluations randomly assigned participants to a control group and an intervention group. All narrative evaluations were de-identified of student and faculty names before their release to three investigators scoring narrative evaluations, who were also blinded to participant groups. Evaluations were divided into two sets. One set was scored by one investigator, and the other was scored by two investigators who met to resolve any differences to determine a single score.

Any faculty member teaching in the Pediatrics Clerkship at MMS or the Ambulatory Clerkship at UTSW who did not opt-out were eligible for the study. Initially, 89 evaluations from unique faculty members (59 at UTSW, 30 at MMS) were available for scoring in the time window of the study and thus were randomized to the study, 45 in the control group and 44 in the intervention group (Figure 2). After the intervention, 66 evaluations from the same group of faculty members were available for scoring (43 at UTSW, 23 at MMS). The drop in number of evaluations available was due to some faculty members not teaching and therefore not submitting evaluations during the time period. Thus, total of 33 paired evaluations were conducted in both the control and intervention groups.

Figure 2.

Flow diagram of the participant randomization

Table 1 shows the demographics of participants in each group. There were more female participants in the control group. There were equal numbers of assistant professors in both groups, but the control group had more associate professors and fewer professors than the intervention group.

Table 1.

Faculty Demographics

	Control (n=33)	Intervention (n=33)
Female	20	14
Male	13	19
Staff/Community Physicians	2	3
Assistant Professor	17	17
Associate Professor	12	7
Professor	2	6
Pediatrics	13	15
Internal Medicine	18	17
Med/Peds	1	1
PMR	1	0

Investigators used the NEQI to score baseline evaluations pulled from both clerkships at the start of the study. The control group did not receive any feedback regarding their evaluations or the NEQI tool itself. The intervention group received an email with a link to an asynchronous, educational module developed by study investigators which oriented learners to the NEQI and demonstrated how to use it to improve the quality of narrative comments. The module began with a low-scoring narrative evaluation and guided learners through each NEQI domain, demonstrating how to build upon and enhance the narrative step by step. The module could be completed in approximately 20 minutes; it was shared via Canvas Catalog and OneDrive to participating faculty at both institutions. Participants in the intervention group received periodic email reminders to complete the module over the course of one month. Participants in the intervention group also received a personalized email from a non-scoring investigator providing feedback on their baseline evaluation NEQI score. The email contained a breakdown of their baseline NEQI score (their score in each domain and their overall score), a link to access the learning module, and a copy of the NEQI chart for reference. The email did not include written, individualized suggestions or targeted recommendations beyond the general guidance provided in the module. After the deadline to complete the learning module, investigators scored the second set of de-identified evaluations. Two-sided t-tests were conducted using SPSS (v. 29.0) for the paired pre- and post-samples to determine significance.

Results

The educational module was completed by 9 out of 15 faculty at MMS and 6 out of 29 faculty at UTSW, for a total of 15 out of 44 (34%). In both the control (Table 2) and intervention (Table 3) groups, there was no significant difference in the pre- and post-total NEQI scores from the paired samples (n=33). Importantly, the overall mean totals were 6.0 and 5.39, which are below the threshold of 7 that the developers of the NEQI suggested as the minimal quality threshold for students and grading committees. There were also no significant differences in each individual domain pre- and post-intervention.

Table 2.

Paired Changes in NEQI Means: Control Group (n=33)

Domain	Baseline mean (std deviation)	Post mean (std deviation)	Mean difference (std deviation)	P Value
Performance	1.88 (0.96)	2.06 (0.97)	0.18 (0.88)	0.25
Specificity	2.03 (1.26)	1.70 (1.23)	-0.33 (1.59)	0.24
Usefulness	2.18 (1.16)	2.24 (1.30)	0.06 (1.54)	0.82
Total	6.09 (2.69)	6.00 (2.84)	-0.09 (3.07)	0.87

Table 3.

Paired Changes in NEQI Means: Intervention Group (n=33)

Domain	Baseline mean (std deviation)	Post mean (std deviation)	Mean difference (std deviation)	P Value
Performance	1.97 (0.81)	1.97 (0.77)	0.00 (0.70)	1.00
Specificity	1.36 (1.03)	1.67 (1.05)	0.30 (1.31)	0.19
Usefulness	2.06 (0.93)	1.76 (1.20)	-0.30 (1.33)	0.20
Total	5.39 (2.00)	5.39 (2.40)	0.00 (2.50)	1.00

Since one potential cause of the lack of improvement in the intervention group was the low participation rate, we performed a subgroup analysis only on the 15 faculty who completed the educational module (Table 4). There were 12 faculty who had both pre- and post-narrative evaluations available for scoring paired samples. There was no statistically significant difference in the NEQI scores pre- and post-intervention. Although there was a trend toward improvement in the specificity domain, it did not reach statistical significance.

Table 4.

Paired Changes in NEQI Means: Subgroup (n=12)

Domain	Baseline mean (std deviation)	Post mean (std deviation)	Mean difference (std deviation)	P Value
Performance	2.25 (0.87)	2.25 (0.87)	0.00 (0.85)	1.00
Specificity	1.17 (0.94)	1.83 (1.27)	0.67 (1.44)	0.14
Usefulness	2.00 (1.21)	2.00 (0.85)	0.00 (1.21)	1.00
Total	5.42 (2.11)	6.08 (2.87)	0.67 (2.99)	0.46

Discussion

This study has several strengths. As a multi-institutional and multi-clerkship study, we had a diverse sampling of faculty across multiple specialties, practice settings (inpatient and outpatient), and academic ranks (experience). Investigators who assigned NEQI scoring were blinded to participants, thus minimizing the risk of evaluator bias. We demonstrate how a validated tool from the existing literature can be adapted to fit individual institutional needs. This pragmatic approach allows educators to build upon previous findings to improve their evaluation process. The lessons we learned in reaching high interrater reliability such as developing specific criteria and creating an automated process show how others can best utilize the tool at their institutions. A limitation of our study was the small number of evaluations available, with only one pre- and post-intervention evaluation scored per faculty member. While generalizability evidence suggests that at least three narrative evaluations with the NEQI are needed to generate a dependable estimate of an individual’s narrative quality,²² we intentionally narrowed the scope to reflect the pilot nature of the study and to ensure feasibility for the investigators. Accordingly, our use of a single evaluation per time point likely reduced statistical power and reliability and may have attenuated our ability to detect any differences.

We hypothesize a few reasons for the lack of difference in the NEQI scores in the intervention group. As the module was voluntary, the low participation in the educational module could largely explain why there was no difference in scoring. However, our subgroup analysis of faculty who completed the module, though underpowered, also did not show a significant difference in scores. The low participation rate also highlights the difficulties that clerkship directors face when trying to implement voluntary faculty development without incentives or support from leadership. The timing of the educational module may have also contributed. While not specifically measured, based on the timing of the emails and the follow-up evaluations, there was a time gap of several weeks between completion of the educational module and the participants’ subsequent evaluations. By the time the participants wrote their next evaluation, they may not have been primed to think about the NEQI and the educational module. Factors that were not considered in randomization but may have contributed to outcomes include gender (more female faculty in control group), prior training on writing narrative evaluations, and prior experience as a learner receiving narrative evaluations.^15,23,24 Finally, educating faculty on how to write narrative evaluations may not matter if their workflow does not include documenting and reflecting on observable behaviors in multiple specific domains of their learners in the first place. In other words, it may not be that faculty do not know how to write an evaluation, but rather they do not have the time or a system in place to prioritize gathering the data needed to observe and remember the specified competencies in a busy clinical environment. Furthermore, our intervention did not mirror best-practice feedback principles that faculty are expected to use for learners. Since feedback to the faculty was limited with only a numerical NEQI score from a pre-intervention evaluation without any individualized, behaviorally anchored coaching, this likely contributed to low uptake and minimal measurable change. Our findings are consistent with previous work showing that a one-time voluntary faculty development session did not significantly impact quality of narrative evaluations.⁵

While the intervention as tested did not result in significant differences in evaluations, the trend toward more specific comments in the intervention group is promising. The specificity of comments is important for many aspects of residency applications, including the MSPE and structured letters of evaluations (SLOEs). Faculty may also reference their prior evaluations when asked to write an individual recommendation letter to document concrete behavioral examples. These findings shed light on some potential next steps that may increase the usefulness of the NEQI as a tool for improving narrative evaluations that can be investigated in future studies. First, the NEQI could be used to structure evaluator orientation materials, observation tools, and evaluation templates to prompt the specific components needed to achieve higher quality with individualized coaching. Specific observation exercises which parallel the performance domains could be suggested or assigned during the supervisory period. Templated narrative prompts in the evaluation form, rather than blank text boxes, may be more effective in guiding evaluators to include comments on competency domains, provide specific examples and feedback for learner improvement.²⁵ Next, artificial intelligence may be used to score evaluations using the NEQI, thus relieving time burden, reducing human differences in grading, and allowing multiple opportunities for feedback. These findings could be shared with faculty in longitudinal peer-mentored groups focused on improving faculty teaching skills which could capitalize on the importance of spaced and active learning.²⁶ These interventions highlight the importance of technology and ensuring the systems of formative and summative assessment are thoughtfully integrated and available for use in clinical settings. Finally, more targeted educational interventions may yield greater benefits. Sessions focusing on portions of the NEQI instead of the overall may lead to more incremental changes in quality of evaluations over time.

Conclusion

Narrative evaluations are essential for learner feedback, competency-based evaluation, and UME to GME handoffs. Narrative evaluations should comment on multiple competency domains, provide specificity, and be useful for learner growth. Programs need to prioritize and provide space for faculty development on gathering and documenting the necessary observations effectively to promote trainee growth and improve learner handoffs. Clerkship directors require institutional support and resources for faculty education and development of novel interventions to improve the quality of narrative evaluations, as our study showed that simply emailing feedback and an invitation for an asynchronous learning module did not demonstrate measurable improvement.

Footnotes

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Cook

Kuper

Hatala

Ginsburg

. When assessment data are words: validity evidence for qualitative educational assessments. Acad Med. 2016;91(10):1359-1369. doi: 10.1097/ACM.0000000000001175.

Hemmer

Dadekian

Terndrup

, et al. Regular formal evaluation sessions are effective as frame-of-reference training for faculty evaluators of clerkship medical students. J Gener Intern Med: JGIM. 2015;30(9):1313-1318. doi: 10.1007/s11606-015-3294-6.

Pangaro

. A new vocabulary and other innovations for improving descriptive in-training evaluations. Acad Med. 1999;74(11):1203-1207. doi: 10.1097/00001888-199911000-00012.

Ginsburg

Eva

Regehr

. Do In-training evaluation reports deserve their bad reputations? A study of the reliability and predictive ability of ITER scores and narrative comments. Acad Med. 2013;88(10):1539-1544. doi: 10.1097/ACM.0b013e3182a36c3d.

Crumbley

Szauter

Karnath

Sonstein

Belalcazar

Qureshi

. Narrative comments in internal medicine clerkship evaluations: room to grow. Med Educ Online. 2025;30(1):2471434. doi: 10.1080/10872981.2025.2471434.

Buchanan

Strano-Paul

Saudek

, et al. Preparing effective narrative evaluations for the Medical School Performance Evaluation (MSPE). MedEdPORTAL. 2022;18:11277. doi: 10.15766/mep_2374-8265.11277.

National Resident Matching Program . Data Release and Research Committee: Results of the 2024 NRMP Program Director Survey. Washington, DC: National Resident Matching Program; 2024. https://www.nrmp.org/match-data/charting-outcomes-program-director-survey-results-main-residency-match/. Accessed June 5, 2024.

Hall

Gray

Ragsdale

. Making narrative feedback meaningful. Clin Teach. 2024;21(5):e13766. doi: 10.1111/tct.13766.

Hatala

Sawatsky

Dudek

Ginsburg

Cook

. Using In-Training Evaluation Report (ITER) Qualitative Comments to Assess Medical Students and Residents: A Systematic Review. Acad Med. 2017;92(6):868-879. doi: 10.1097/ACM.0000000000001506.

10.

Jackson

Kay

Jackson

Frank

. The Quality of Written Feedback by Attendings of Internal Medicine Residents. J Gen Intern Med. 2015;30(7):973-978. doi: 10.1007/s11606-015-3237-2.

11.

Canavan

Holtman

Richmond

Katsufrakis

. The quality of written comments on professional behaviors in a developmental multisource feedback program. Acad Med. 2010;85(10 Suppl):S106-S109. doi: 10.1097/ACM.0b013e3181ed4cdb.

12.

Gulbas

Guerin

Ryder

. Does what we write matter? Determining the features of high- and low-quality summative written comments of students on the internal medicine clerkship using pile-sort and consensus analysis: a mixed-methods study. BMC Med Educ. 2016;16:145. doi: 10.1186/s12909-016-0660-y.

13.

van de Pol

MHJ

Lagro

Koopman

Olde Rikkert

MGM

Fluit

CRMG

Lagro-Janssen

ALM

. Lessons learned from narrative feedback of students on a geriatric training program. Gerontol Geriatr Educ. 2018;39(1):21-34. doi: 10.1080/02701960.2015.1127810.

14.

Raaum

Lappe

Colbert-Getz

Milne

. Milestone Implementation's Impact on Narrative Comments and Perception of Feedback for Internal Medicine Residents: a Mixed Methods Study. J Gen Intern Med. 2019;34(6):929-935. doi: 10.1007/s11606-019-04946-3.

15.

McQueen

Petrisor

Bhandari

Fahim

McKinnon

Sonnadara

. Examining the barriers to meaningful assessment and feedback in medical training. Am J Surg. 2016;211(2):464-475. doi: 10.1016/j.amjsurg.2015.10.002.

16.

Eugene

Montes-Rivera

Adair White

. Education research: perspectives and experiences of clinical neurology faculty regarding the end-of-rotation assessment: a qualitative study. Neurol Educ. 2023;2(4):e200104. doi: 10.1212/NE9.00000000002001.

17.

Mooney

Powell

Dahl

Eiduson

Reinhardt

Stone

. Education research: a long-term faculty development initiative improves specificity and usefulness of narrative evaluations of clerkship students. Neurol Educ. 2022;1(1):e200003. doi: 10.1212/NE9.0000000000200003.

18.

Dudek

Marks

Wood

, et al. Quality evaluation reports: can a faculty development program make a difference. Med Teach. 2012;34(11):e725-e731. doi: 10.3109/0142159X.2012.689444.

19.

Wilbur

. Does faculty development influence the quality of in-training evaluation reports in pharmacy. BMC Med Educ. 2017;17(1):222. doi: 10.1186/s12909-017-1054-5.

20.

Kelly

Mooney

Rosati

Braun

Thompson Stone

. Education Research: The Narrative Evaluation Quality Instrument: Development of a tool to assess the assessor. Neurology. 2020;94(2):91-95. doi: 10.1212/WNL.0000000000008794.

21.

Bartels

Mooney

Stone

. Numerical versus narrative: a comparison between methods to measure medical student performance during clinical clerkships. Med Teach. 2017;39(11):1154-1158. doi: 10.1080/0142159X.2017.1368467.

22.

Mooney

Stone

Wang

Blatt

Pascoe

Lang

. Examining generalizability of faculty members’ narrative assessments. Acad Med. 2023;98(s3):s210. doi: 10.1097/ACM.0000000000005417.

23.

Mooney

Pascoe

Blatt

, et al. Predictors of faculty narrative evaluation quality in medical school clerkships. Med Educ. 2022;56(12):1223-1231. doi: 10.1111/medu.14911. Epub 2022 Aug 23. PMID: 35950329.

24.

Natesan

Jordan

Sheng

, et al. Feedback in Medical Education: An Evidence-based Guide to Best Practices from the Council of Residency Directors in Emergency Medicine. West J Emerg Med. 2023;24(3):479-494. doi: 10.5811/westjem.56544. PMID: 37278777; PMCID: PMC10284500.

25.

Curtis

Moon

Hanmore

Hopman

Baxter

. Evaluating the Effect of Assessment Form Design on the Quality of Feedback in One Canadian Ophthalmology Residency Program as an Early Adopter of CBME. Can J Ophthalmol. 2023;58(4):e149-e150.

26.

Steinert

Mann

Centeno

, et al. A systematic review of faculty development initiatives designed to improve teaching effectiveness in medical education: BEME Guide No. 8. Med Teach. 2006;28(6):497-526. doi: 10.1080/01421590600902976.