Abstract
Background:
Instruments to assess surgical skills have been validated for several key indicator procedures in otolaryngology. Selective neck dissection is a core procedure for which trainees must integrate knowledge of complex head and neck anatomy with technical surgical skills. An instrument for assessment of surgical performance in selective neck dissection has not been previously developed. The objective of the current study is to develop and validate an instrument for assessing surgical competency for level II-IV selective neck dissection.
Design:
A Delphi working group comprised of 23 fellowship trained head and neck surgeons from 17 institutions was assembled. The modified Delphi method encompassed a 3-step process, including 2 anonymous voting rounds to successively refine individual items and establish levels of consensus. Thresholds for achieving strong consensus, at >80% agreement, were determined a priori. The resulting instrument was subsequently validated in a prospective cohort of 17 resident surgeons, spanning postgraduate year 1 to 5 training experience. Participants were asked to perform a level II-IV selective neck dissection on fresh-frozen cadaveric specimens. Performance was scored by 2 independent, blinded observers using the devised instrument and construct validity was assessed.
Results:
Through the modified Delphi process a final list of 30 items, considered to be the most essential items for achieving the goals of a level II-IV selective neck dissection, was developed. Construct validity was supported by a positive association between instrument scores compared to both resident postgraduate year level and number of head and neck rotations completed.
Conclusion:
The development and validation of a novel instrument for assessment of surgical competency in level II-IV selective neck dissection, a key indicator case in otolaryngology, is described. This new instrument may be used to provide objective feedback on overall and task-specific competency to identify surgical deficiencies and offer granular feedback to enhance surgical training.
Keywords
Introduction
Objective assessment of trainees’ abilities to perform a set of core surgical procedures safely, efficiently, and independently is a challenging but necessary task for surgical training programs. The Accreditation Council for Graduate Medical Education (ACGME) model of competency-based medical education (CBME) evaluates programs based on trainees meeting defined milestones in 6 Core Competencies encompassing patient care and medical knowledge. There has been an increasing drive to incorporate objective measures of surgical competency into this model. This motivation is further intensified by clinical and trainee constraints, such as limiting concurrent surgery and duty-hour restrictions, and reduced case numbers associated with the COVID-19 global pandemic. Training programs are being encouraged to implement objective assessment tools that not only help trainees safely and efficiently achieve competency, but also serve to evaluate overall program performance and facilitate granular feedback. 1
There is currently no widely accepted standard for assessing surgical skill development and procedural competency. Commonly used methods of evaluation include case log review, informal work-based assessment, and subjective end-of-rotation evaluation.2,3 While case logs show quantitative experience, they are not reliable proxies for surgical competency. Furthermore, differences in case-log entry may reflect variance in resident coding practices rather than true disparity in surgical experience. End-of-rotation evaluations rely on faculty opinion and recollection of trainee performance. These can be influenced by confirmation bias, recall bias, or “halo or horn” effect and have shown poor reliability and validity—faculties’ global perceptions of resident performance during rotations may influence specific scoring of surgical acumen. 4 In addition, these methods often do not provide sufficiently granular feedback on specific deficiencies that could prompt trainee-specific adjustments and remediation. A more ideal instrument would objectively measure performance in surgical skills and provide reproducible, consistent, and actionable feedback.
Current objective measures use global rating scales (GRSs) and task-based checklists (TBCs). Previously validated GRSs are generalizable to any surgical procedure, assessing fundamental skills such as respect for tissue, appropriate instrument selection, and economy of movement. While these generic instruments can assess general surgical skills, they lack the ability to assess performance of discrete procedural steps. Task-based checklists account for this by assessing a surgeon’s ability to effectively navigate individual steps within a specific procedure. Originally developed for general surgery procedures, Objective Structured Assessments of Technical Skills (OSATS) are performance-based evaluation tools for assessing surgical competence. These involve a combination of GRSs and TBCs for evaluation of surgeon competency either in live animal models or bench models. 5 Since the original inception, OSATS have expanded to the operating room setting and have been successfully adopted into training curricula of specialties including obstetrics and gynecology, general surgery, and orthopedic surgery.6-8
Given the heterogeneous subspecialties within otolaryngology, many unique technical skills are required to perform procedures within each domain of the field. Instruments to assess otolaryngology surgical skills have been previously validated for endoscopic sinus surgery, thyroidectomy, mastoidectomy, direct laryngoscopy, tracheostomy, myringotomy, tonsillectomy, septoplasty, and microvascular free flaps.9-24 Still, many key otolaryngology procedures lack objective evaluation methods. 5 Selective neck dissection (SND) is a core procedure in otolaryngology for which trainees must integrate knowledge of complex head and neck anatomy with technical surgical skills. To the best of the authors’ knowledge, there is currently no assessment tool available for SND. Thus, the chief objective of this study is to develop and validate an operative performance instrument for SND, a key indicator case for otolaryngology residency training.
Materials and Methods
Modified Delphi Process
A modified Delphi method was used for development of a novel OSATS-based instrument for SND competency assessment. The Delphi process is an established method of consensus building that leverages the expertise of a group of individuals who have professional and experiential knowledge surrounding the content under investigation. 25 The process inherently facilitates controlled feedback and systematic progress toward consensus among a panel of experts during completion of a series of anonymous online questionnaires. 26 Each round of voting is used to successively refine the next iteration of the questionnaire until sufficient level of consensus is achieved.
In a modified Delphi process an initial collection of statements, items, or questions are provided for critiques and the open-ended questions are eliminated. The modified Delphi method employed in this study encompassed development of an inclusive list of instrument items developed by the 3-member Delphi steering committee (E.M.D., D.L.P, and M.L.C) and 2 subsequent voting rounds to successively refine individual statements and establish levels of consensus by members of the Delphi panel (Figure 1). To improve generalizability and construct validity, the Delphi working group comprised 23 head and neck surgeons with representation from 17 institutions. With each round, a prospective list of items was presented alongside forced response multiple-choice responses; free-text responses were also supported, offering respondents an opportunity to provide feedback. A 5-point Likert scale was applied to each item. A score of 3 was chosen as the minimum acceptable level to perform the procedure independently, allowing for improvement beyond minimally acceptable levels. The modified Delphi process for each statement was considered complete when there was convergence of opinion or when a point of diminishing returns was reached. The three-step modified Delphi process was completed over an 8-week period, from December 1, 2018 to January 26, 2019. All voting was performed anonymously via the SurveyMonkey online survey tool (SurveyMonkey, Portland, OR). Thresholds for achieving strong consensus, at >80% agreement, were determined a priori.

The modified Delphi method employed a 3-step process consisting of 1 pre-voting round and 2 subsequent voting rounds to successively refine individual statements and establish levels of consensus by all members of the panel.
Instrument Validation
Institutional review board approval was obtained for a prospective study designed to validate the novel 30-item SND competency assessment tool that was developed using the methodology described above (IRB#18-007477). Seventeen residents, ranging from postgraduate year (PGY) 1 to 5, performed level II-IV SNDs on fresh-frozen cadaveric specimens in a controlled anatomy laboratory setting. Prior to starting the exercise, trainees were instructed to perform the procedure as if it were a live operative case. Specifically, residents were instructed to choose appropriate instrumentation, delicately navigate vital structures, and ligate necessary vasculature. Video documentation was obtained by use of an unmodified high-definition GoPro HERO 7 Black digital action camera (GoPro Inc, San Mateo, CA) and a commercially available head-mount (Figure 2). The camera was worn by a senior level trainee acting as a surgical assistant for the procedure. The surgical assistant provided no feedback and only assisted according to surgeons’ instructions. Captured video was edited to remove participant identifiers. The video for left-handed trainees was inverted 180-degrees in the horizontal plane so that all videos displayed right-handed dissection to enhance subject anonymity. The video was evaluated using the newly developed assessment instrument for level II-IV SND by 2 independent observers (E.M.D. and D.L.P.) who were blinded to the performing resident they were evaluating and to each other’s assessment.

Equipment used to record trainee operative performance, including an unmodified GoPro camera with head mount and recording computer.
Construct validity was evaluated for each component of the tool by comparing trainees’ mean scores across advancing PGY levels and number of previous head and neck rotations completed. The inter-item reliability for the instrument was measured by assessing their internal consistency with Cronbach α, with a value of at least .80 considered acceptable. Inter-rater reliability was calculated using Cohen’s κ. Scores given for each item were evaluated, with an inter-rater score difference of 1 or less point on the Likert scale considered agreement. Pearson correlation coefficient was used to determine the relationship between scores on individual instrument tasks and overall score. Continuous features were summarized with means and 95% confidence intervals. Analyses were performed using Microsoft Excel 2010 (Microsoft Corporation, Redmond, WA).
Results
Through the modified Delphi process a final list of 30 items, considered to be the most essential assessable items for achieving the goals of a level II-IV SND, was developed (Figure 3). A total of 23 evaluations were completed for 17 distinct trainees across 5 PGY levels (levels 1-5). A breakdown of the number of trainees by PGY level and number of head and neck rotations is shown in Table 1. Six trainees underwent a second evaluation after advancing in PGY level. The Cronbach α to evaluate the tool’s internal consistency was .98. Evaluation of 6 different trainees was completed by both raters and was thus used to calculate inter-rater reliability on each item in the instrument. Inter-rater reliability showed moderate concordance (Cohen’s κ .65, 86% agreement).

Selective neck dissection (levels II-IV) task specific checklist and global assessment tool.
Number of Participating Trainees According to Post-Graduate Year Level and Number of Rotations Completed.
For this study, the instrument demonstrated construct validity. There was a clear association of increasing scores by PGY level in the TBC, GRS, and overall score (Figure 4A-C). PGY-3 had significantly better overall scores compared to PGY-1 (mean difference 0.98; P = .004) and PGY-4-5 had significantly better overall scores compared to PGY-3 (mean difference 1.30; P < .001). The transition to surgical competence (average score ≥3) appeared to occur at the PGY-3 level. The time required to complete the exercise did not significantly correlate with PGY-level (Figure 4D).

Mean instrument scores and time to complete the exercise by PGY level. (A) Mean Global Rating Scale Score. (B) Mean Task Based Checklist Score. (C) Mean overall instrument score. (D) Mean time to complete the exercise in minutes. Error bars represent 95% confidence intervals.
Construct validity was also appraised within the context of the number of head and neck rotations completed by the evaluated trainee (0, 1-2, 3-4, and 5-7). The results of this assessment also demonstrate a positive association between the number of head and neck rotations completed and higher instrument scores for TBC, GRS, and overall score (Figure 5A-C). The transition to surgical competence appeared to occur between 1-2 and 3-4 head and neck rotations completed. No significant correlation was found between time to complete the exercise and number of head and neck rotations completed (Figure 5D).

Mean instrument scores and time to complete the exercise by number of head and neck rotations completed. (A) Mean Global Rating Scale Score. (B) Mean Task Based Checklist Score. (C) Mean overall instrument score. (D) Mean time to complete the exercise in minutes. Error bars represent 95% confidence intervals.
To better elucidate the discrete surgical steps that have the greatest correlation with overall score, and thus have the potential to differentiate trainees by level of performance, the mean scores of each step and its correlation with overall score was calculated. The surgical steps with the greatest degree of correlation are shown in Table 2. Those steps with the least correlation are shown in Table 3.
Task-Based Checklist Items Demonstrating Strongest Correlation With Final Instrument Score.
Task-Based Checklist Items Demonstrating Lowest Correlation With Final Instrument Score.
Discussion
This study describes the systematic development and validation of a novel surgical assessment instrument for SND that reliably correlates with PGY level as well as number of head and neck surgical rotations completed, with moderate inter-rater reliability and construct validity. Individual instrument items were rigorously examined and refined through the established Delphi process by an extended group of 23 fellowship trained head and neck surgeons from 17 institutions. Identification of key vascular structures as well as the omohyoid muscle and posterior belly of the digastric were most strongly associated with overall score. This can be feasibly implemented into training program’s armamentarium along with other, previously validated instruments to identify deficiencies early on, provide formative feedback, and facilitate graduated autonomy with improvement in performance.
Too often, resident procedural deficiencies are identified late in training, when remediation is more challenging and arduous. Ideally, these deficiencies should be identified early, so that minor course corrections can be implemented throughout training. Moreover, accurate assessment of surgical proficiency may be used to accelerate graduated autonomy, whereby residents who demonstrate more advanced skills may be given early opportunities for increasing independence in the operating room when appropriate.
A recent systematic review by Labbé et al 27 identified a paucity of validated assessment tools for otolaryngology procedures, having only been developed for 11 core otolaryngology procedures. Selective neck dissection is among the procedures that lack an objective assessment tool. This pilot study presents a valid, reliable, and practical instrument for assessing operative performance in SND in cadaveric dissection with the potential to be used in live surgery. Intuitively, instrument performance should not only reflect PGY level, but also the number of head and neck rotations completed. Indeed, our data confirm that a higher number of rotations completed in the head and neck subspecialty is associated with improved surgical performance in SND.
Producing a safe and technically proficient surgeon is arguably one of the most important objectives of a surgical training program. Work-hour restrictions, limits on concurrent surgery, and pressure imposed on faculty to increase clinical productivity highlight the need for more innovative training and assessment strategies. With the impetus from accreditation bodies toward competency-based medical education, more objective measures of trainee performance, including multiple operative instruments, have been developed, validated, and implemented by various surgical specialties. These instruments can measure performance and provide formative feedback for improvement. In contrast to traditional evaluation methods that generally provide less reliable subjective feedback, objective assessment tools improve the ability to identify and correct deficiencies early on, provide formative and directed feedback, and ensure certification of a proficient and confident surgeon at the conclusion of training.
Despite the theoretical advantages of incorporating objective procedural assessment into resident evaluation, from a practical standpoint, implementation of these tools is often challenging. Coordination of busy trainee schedules with anatomy laboratory appointments, specimen availability, and logistics of capturing and distributing video footage requires a systematic approach with significant investment from the department as a whole. Integration of regular resident education time into our curriculum facilitates participation in activities such as this. Another major challenge with implementing assessment tools is the burden placed on faculty. This requires significant faculty time to observe, either directly or indirectly, evaluate, and complete assessments. This emphasizes the importance of faculty buy-in to ensure that timely formative feedback is provided.
This study is limited by several factors. Direct observational assessment in the operating room setting is the gold standard for evaluation. However, cadaveric dissection was used for this instrument for various reasons. While cadaveric dissection does not offer true fidelity of a living, bleeding human model, it allows for reasonably uniform difficulty with normal anatomy and a lack of disease involvement. In addition, cadaveric dissection allowed for evaluation of junior trainees without placing patients at undue risk. Trainees performed the dissections independently and feedback was withheld until the procedure had concluded. Future application of this instrument in a live operating room setting under appropriate supervision will lend insight into its validity and allow for further modification. Also, due to scheduling constraints, there was a disproportionate number of junior trainees participating in the exercise. While the data show a clear trend toward improvement in senior residents, the deficiency of senior-level evaluation may have impacted reliability. Ideally, each resident would have been evaluated annually throughout training, but the cross-sectional nature of this study was not conducive to serial evaluations. Video was recorded using a head-mounted camera, which was the least labor-intensive method of maintaining an appropriate field of view but created significant motion. A dedicated videographer would create the most ideal recordings but is not practical in most situations. As camera stabilization improves this will become less of an issue, and we found upgrading the camera was beneficial.
Rather than direct observation, video recording was obtained for evaluation. While this was logistically more complicated than direct observation, it allowed for anonymity of the participants and blinding of the evaluators who are familiar with the trainees. Bias due to known PGY level or “halo or horn” effect was circumvented using this method. Furthermore, the video allowed for rewind, control of playback speed, and did not require direct, real-time observation by the faculty. Remote scoring of assessment otolaryngology procedures using video review has previously been validated by Bowles et al 28 . While the TBC developed for this study is based on rigorous literature review and consensus, the tasks described are the methodology used at the author’s institution and are therefore not generalizable to the practice at every institution. Ultimately, multi-institutional application and crowd-sourced feedback will facilitate instrument revision to improve feasibility and generalizability.
Conclusion
Herein, we present a novel instrument for objective assessment of surgical competency in level II-IV SND. It allows for specific and objective documentation of trainee skills and is intended to facilitate appropriate advancement in the operating room setting and identify specific deficiencies for targeted skill development. This instrument can feasibly be adopted into a training program curriculum to monitor progression of trainee skills.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
