Abstract
Objectives:
Sensation is an integral component of laryngeal control for breathing, swallowing, and vocalization. Laryngeal sensation is assessed by elicitation of the laryngeal adductor reflex (LAR), a brainstem-mediated adduction of the true vocal folds. During Flexible Endoscopic Evaluations of Swallowing (FEES), the touch method can be used to elicit the LAR to judge laryngeal sensation. Despite the prevalence of this method in clinical practice and research, prior studies have yet to examine inter- and intra-rater reliability.
Methods:
Four speech-language pathologists rated 125 randomized video clips for the presence, absence, or inability to rate the LAR. Fifty percent of video clips were re-randomized and re-rated 1 week later. Raters then created guidelines and participated in formal consensus training sessions on a separate set of videos. Ratings were repeated post-training.
Results:
Overall inter-rater reliability was fair (κ = 0.22) prior to training. Pre-training intra-rater reliability ranged from fair (κ = 0.35) to almost perfect (κ = 0.89). Inter-rater reliability significantly improved after training (κ = 0.42, P < .001), though agreement did not reach prespecified acceptable levels (κ ≥ 0.80). Post-training intra-rater reliability ranged from moderate (κ = 0.49) to almost perfect (κ = 0.85).
Conclusion:
Adequate inter-rater reliability was not achieved when rating isolated attempts to elicit the LAR. Acceptable within-rater reliability was observed in some raters 1 week after initial ratings, suggesting that ratings may remain consistent within raters over a short period of time. Limitations and considerations for future research using the touch method are discussed.
Background
Sensation is an integral component of laryngeal control for airway protection, breathing, swallowing, and vocalization, and precise sensorimotor integration is required for guiding and modulating these movements and reflexes.1-4 Laryngeal sensation can be assessed by elicitation of the laryngeal adductor reflex (LAR), a brainstem-mediated adduction of the true vocal folds. 5 The afferent branch of the LAR is comprised of the internal branch of the superior laryngeal nerve, which receives sensory input from mechanoreceptors and chemoreceptors in the laryngeal mucosa. 5 This afferent input provides sensory feedback to central neural circuits in the medulla with subsequent elicitation of the efferent component of the reflex through the recurrent laryngeal nerve, which ultimately results in a brief, bilateral contraction of the thyroarytenoid muscle.5-7 An intact LAR requires sensorimotor integrity at both peripheral and central neural structures, including sensory afferents, interneurons, and motor signaling.
Laryngeal sensation has been examined in patients with gastroesophageal reflux,8-10 obstructive sleep apnea,11-14 stroke,15,16 acute respiratory failure, 17 tracheostomy, 18 Parkinson’s disease,2,19 amyotrophic lateral sclerosis, 20 head and neck cancer,19,21 partial laryngectomy, 22 chronic cough and paradoxical vocal fold movement, 23 and pediatric populations with varying medical diagnoses.24,25 In the dysphagia literature, an association between laryngeal sensory deficits and aspiration,15,17,26,27 pharyngolaryngeal secretions,17,28 and pneumonia29,30 has been documented. However, methodological variability exists across studies, including various operational definitions of laryngeal sensation (eg, self-report, LAR, cough, or swallow), as well as location (eg, arytenoids, epiglottis, aryepiglottic folds) and type of stimuli administered (eg, air pulse or touch method).
The LAR is often included as part of Flexible Endoscopic Evaluations of Swallowing (FEES) with two methods: the air pulse 31 and touch method. 32 The air pulse method, also called FEES with sensory testing (FEESST), applies pressure and duration-controlled air pulses to the laryngeal mucosa through a working channel of a flexible endoscope. 31 Though normative values have been established with the air pulse method, 33 its clinical use is currently lacking since this specialized equipment is no longer commercially manufactured.
Tactile stimulation of laryngeal structures, known as “the touch method,” is routinely used in clinical practice since it does not require specialized equipment beyond an endoscope. During laryngeal sensation testing, the endoscope is advanced to make contact with an arytenoid. Once contact is made, the endoscope is partially retracted to view the presence or absence of the LAR. However, there are inherent methodological limitations, including variability of pressures both within and between endoscopists, 34 inconsistency in the location of tactile stimulation of the endoscope, and poor visualization due to obstruction from residue or secretions. Additional patient factors, such as volitional vocal fold adduction during testing or poor exam tolerance, can limit one’s ability to assess the LAR. Despite these limitations, prior studies report on laryngeal sensation outcomes using the touch method without any reported reliability measures.15-17,19,29,35,36
This study sought to examine the inter- and intra-rater reliability of clinician ratings of the LAR during laryngeal sensation testing with the touch method. Following assessment of baseline reliability, additional guidelines were developed and consensus training was performed before post-training reliability ratings were completed. We hypothesized that raters would achieve acceptable inter- and intra-rater reliability (κ ≥ 0.80) after training.
Material and Methods
Study Design
This prospective study examined inter- and intra-rater reliability for ratings of the LAR during the touch method of laryngeal sensation testing both before and after training. Videos of laryngeal sensory testing were retrospectively obtained from a database of FEES videos performed as standard of care exams on dysphagic patients seen in the ambulatory and inpatient setting of a large urban hospital. Inclusion criteria included laryngeal sensation testing during FEES as either an inpatient or outpatient evaluation. Exclusion criteria included videos without a clear view of the larynx at rest, or suspected unilateral or bilateral vocal fold immobility identified during phonatory speech tasks on FEES (“eee—sniff”), which elicited maximum vocal fold abduction and adduction. In these standard of care FEES videos, laryngeal sensation testing was performed at the end of the FEES and involved brief tactile stimulation of each arytenoid with the tip of the endoscope. Since videos were obtained from FEES performed during routine clinical care, tactile stimulation duration and pressure was not controlled or recorded. The decision to touch the arytenoids was based on research suggesting that this area contains the highest density of mechanoreceptors compared to other laryngeal subsites, such as the epiglottis. 37
Laryngeal Sensory Testing Videos
Videos were examined and obtained by a research assistant. First, the entire laryngeal sensation testing exam was obtained, then each video was divided into isolated clips for each individual tactile stimulation of an arytenoid. A total of 125 video clips were obtained for pre- and post-training reliability, as well as a fifty video clips for the training session.
Pre-Training Reliability Ratings
Four speech pathologists with clinical experience performing FEES and laryngeal sensation testing with the touch method individually performed baseline and post-training LAR ratings. REDCap (Research Electronic Data Capture), a secure web-based data collection tool, was used to collect ratings for reliability. 38 Videos were put in random order by a research assistant and raters were blinded to patient demographics. Two exemplar video clips of a present LAR and one video clip of an absent LAR were provided to raters at the beginning of each rating session. The exemplar videos were chosen by consensus from clinicians who were not raters. Both frame-by-frame analysis and real-time observation to judge the LAR was encouraged. Raters were instructed to judge the LAR as present only when adduction of the arytenoids or true vocal folds was visualized. An absent LAR was defined as no movement of the arytenoids or true vocal folds. No criteria for the inability to rate the LAR or specific rating time frame was provided. Intra-rater reliability was performed 1 week after the initial rating session on a randomly selected 50% of video clips.
Training Session
The raters met to discuss and develop rating guidelines after baseline reliability ratings were completed (see Table 1). Once guidelines were established, the four raters individually judged 50 video clips, which were different from pre-training videos. The first author (JCB) served as a facilitator to guide rating discussions and disagreements were resolved in real-time by consensus after watching the video clip again. Two training sessions, each lasting approximately 2 hours, were completed by all raters.
Guidelines Used During Training Sessions and Post-Training Reliability Ratings.
Post-Training Reliability Ratings
The same four speech pathologists were instructed to individually rate video clips from the same, re-randomized database of videos from the pre-training rating session. Raters were provided guidelines developed at the training session, as well as exemplar videos. Intra-rater reliability was again performed 1 week later on a randomly selected 50% of video clips.
Statistical Analyses
Statistical analyses were performed in R 39 with the rel open-source package. 40 In order to examine inter-rater reliability across four raters with a categorical outcome (present, absent, unable to rate the LAR), Fleiss’ kappa (κ) was computed pre- and post-training. Intra-class correlation coefficients (ICC) were calculated to assess overall intra-rater reliability for pre- and post-training ratings. Estimates were obtained via an intercept only multinomial logistic mixed effects model. Cohen’s κ was used to examine inter- and intra-rater reliability between dyads and within each rater. Fleiss and Cohen’s κ values were interpreted as follows: values ≤0 as indicating no agreement, 0.01-0.20 as none to slight, 0.21-0.40 as fair, 0.41-0.60 as moderate, 0.61-0.80 as substantial, and 0.81-1.00 as almost perfect agreement. 41 An a priori Fleiss’ κ value of 0.80 was set for acceptable reliability. The Fleiss κ statistic was compared between pre- and post-training using a linearization method to compare correlated agreement coefficients. 42
Results
Patient Demographics for Pre- and Post-Training Videos
Laryngeal sensory testing videos were obtained from the FEES of 48 patients. The majority of exams were from inpatients (54%) without nasogastric tubes (83%). The mean age of patients was 59 years (SD = 16 years). Primary medical diagnoses included neurologic (33%), head and neck cancer (29%), respiratory (17%), esophageal (11%), and spinal cord injury (10%). 0.2 mL of 4% atomized lidocaine was administered (75%) at the discretion of the clinician and laryngeal abnormalities (eg, erythema, edema, ulcerative tissue) were appreciated in nearly half of patients (46%). On average, the clinician performing the FEES with sensory testing performed three attempts (range = 1-5) of tactile stimulation per exam. In total, 125 video clips (isolated attempts at eliciting the LAR) were included for reliability ratings. Raters had an average of 8 years of experience performing FEES with the touch method of sensory testing (range = 1-25 years). Specifically, rater A had 1 year of experience, rater B had 2 years of experience, rater C had 12 years of experience, and rater D had 25 years of experience.
Pre-Training Inter- and Intra-Rater Reliability
Reliability analyses between rater dyads are reported in Table 2. Fair inter-rater reliability was achieved between all raters (Fleiss’ κ = 0.22, 95% CI 0.17-0.28). An item analysis revealed that raters achieved complete agreement on 30% of the video clips, agreement in three out of four raters on 37% of the video clips, and agreement in two out of four raters on 34% of the video clips. On videos where disagreements were limited to only two responses (eg, present or unable to rate), the most common disagreements were between present versus unable to rate (74%), compared to absent versus unable to rate (14%) or present versus absent (12%). Overall intra-rater reliability was moderate (ICC = 0.48), ranging from fair (Cohen’s κ = 0.35) to almost perfect (Cohen’s κ = 0.89) agreement (Table 3).
Inter-Rater Reliability.
Note. All values are Cohen’s κ, except for overall (Fleiss’ κ).
P < .001.
Intra-Rater Reliability.
Post-Training Inter- and Intra-Rater Reliability
After establishing guidelines (Table 1) and additional training, moderate inter-rater reliability was achieved (κ = 0.42, 95% CI 0.37-0.47). Inter-rater reliability showed statistically significant improvements from pre- to post-training (t = 3.46, P < .001); however, post-training reliability did not reach prespecified acceptable levels (Fleiss’ κ ≥ 0.80). An item analysis revealed that raters achieved complete agreement on 34% of clips, agreement in three out of four raters on 41%, and agreement in two out of four raters on 25%. On videos with disagreements limited to two responses, the most common disagreements were between present versus unable to rate (46%), compared to absent versus unable to rate (32%) or present versus absent (22%). Overall intra-rater reliability was moderate (ICC = 0.47), ranging from moderate (Cohen’s κ = 0.49) to almost perfect (Cohen’s κ = 0.85) agreement.
Discussion
This study examined the inter- and intra-rater reliability for judging the LAR with the touch method of laryngeal sensation testing during FEES. Despite establishing guidelines for rating the presence or absence of the LAR in a large sample of clinical videos and participating in two training sessions, we were unable to achieve acceptable reliability between raters. The consensus training sessions did significantly improve reliability compared to pre-training ratings, though post-training reliability did not meet prespecified acceptable levels. Sufficient intra-rater reliability was achieved in some raters, suggesting that ratings may remain consistent within raters over a short period of time. However, the accuracy of these ratings remains unclear.
Though studies have not formally reported reliability for the touch method, a recent study by Kaneoka et al 34 reported raw data on judgments of the LAR from two separate raters. Disagreements were resolved by a third rater. In this study, LAR ratings during the touch method on healthy adults were reported across 48 trials. When trial-by-trial ratings were compared between raters, moderate reliability was achieved (Cohen’s κ = 0.44, 95% CI 0.05-0.83, 85% agreement). Though not formally reported in the original study, this inter-rater reliability value closely aligns with our findings (Fleiss’ κ = 0.42; Cohen’s κ range = 0.39-0.57).
There are considerations inherent in the administration and interpretation of the touch method of laryngeal sensation testing which might further explain the low levels of reliability found in this study. The touch method requires the clinician to briefly touch the arytenoid and then withdraw the endoscope so that adduction of the arytenoids or vocal folds can be visualized. Given that the LAR occurs within a short time frame, this swift endoscopic movement must ensure that both adequate contact and visualization of the glottis is achieved. This method of sensation testing requires instruction beyond what is typically required during a FEES. Additional patient factors, such as vocal fold adduction during testing from coughing, speaking, or hyperfunctional breathing patterns, can further obfuscate rating the LAR. These factors likely contributed to the high proportion of disagreement in this study between rating the LAR as “present” versus “unable to rate” in both pre- and post-training sessions. Videos were deliberately chosen from routine clinical exams in order to include attempts in suboptimal conditions. Raters potentially had different internal criteria to judge whether an attempt was not sufficient to rate as “present,” despite providing explicit guidelines post-training. This underscores the complexity of rating the LAR under suboptimal conditions and the importance of obtaining multiple attempts to elicit the LAR with this method. Future studies examining LAR reliability would benefit from additional guidelines to better identify low quality attempts as unable to be rated.
Another method of laryngeal sensation testing, the air pulse method, reconciles many of these limitations. This method allows for quantification of the severity of sensory deficits based on the intensity of the air pulse stimulus, defined as either normal (<4 mmHg), moderate (4-6 mmHg), or severe (>6 mmHg). The air pulse method provides a consistent, clear view of the larynx during testing and does not require movement of the endoscope. However, there are some limitations with the air pulse method, including device generated noise, poor stimulus reproducibility, limited stimulus range, and a lack of commercially manufactured equipment. 43 Though equipment has since been developed to address these limitations,44-46 a major benefit of the touch method is that it can be performed during routine FEES exams and does not require specialized equipment.
Few studies have directly compared the air pulse and touch methods. Kaneoka and colleagues 19 showed that the air pulse method identified laryngeal sensory deficits at greater frequency than the touch method in healthy adults and patients with head and neck cancer and Parkinson’s disease. Additionally, these investigators found that the air pulse method was not correlated with abnormal penetration-aspiration scale scores, 47 whereas the touch method did show an association. However, it should be noted that the distribution of penetration-aspiration scale scores in that study’s sample did not include more severe scores (>4) or patients who aspirated. In another study, Cuellar and Harvey 16 examined both methods’ ability to predict sensory responses to airway invasion and pharyngeal residue. Though the authors conclude that LAR testing alone is unable to predict sensory impairments, the authors did not include a spontaneous swallow as a normal sensory response to penetration despite research suggesting that stimulation above the level of the true vocal folds results in a swallow.6,7 Future studies comparing the two methods in patients with severe deficits in airway protection are necessary to better elucidate their clinical utility.
There are several limitations of the study’s design that warrant discussion. First, raters assessed the LAR during a single attempt of tactile stimulation of an arytenoid. Though this design provided precise analysis of rater judgments, this is not common in clinical practice. During FEES, clinicians often perform multiple attempts to elicit the LAR. Once a single positive response is elicited, the LAR is judged as at least unilaterally present. It is unclear whether prior studies have rated the LAR during single attempts or over the course of an entire exam. We suspect that reliability would improve if raters were provided the entire laryngeal sensation testing exam, as opposed to isolated attempts. The inclusion of additional types of reflexes (eg, cough, swallow, gag) would likely improve reliability, though these responses do not consistently occur in healthy adults during laryngeal sensation testing compared to the LAR. 34 Similarly, including the entire exam rather than a clip of a single attempt may provide more context to the raters to bolster confidence and reliability in ratings. Secondly, laryngeal sensation testing was performed at the end of FEES. In case the patient did not tolerate the endoscopic exam well, bolus administration was prioritized in the clinical exams and therefore sensory testing was carried out at the end of the exam. It is possible that there was residue from bolus trials, which might obscure one’s view during testing. Clinicians also rely on proprioceptive feedback during sensation testing to determine if the endoscope adequately contacted the arytenoid. This type of feedback was not available to raters. Thirdly, guidelines were formulated based on clinical experience with the touch method. The decision was made to define a present LAR as “robust” (ie, more complete, quick adduction) in order to better delineate unclear cases, though there are likely variations in adduction patterns of the arytenoids or true vocal folds depending on the integrity of the recurrent laryngeal nerve, peripheral trauma such as edema, or potential subtle variations in a normal LAR. Future research using two endoscopes to confirm the presence or absence of the LAR (ie, one endoscope providing the touch stimulus and the other observing in home position) is needed to formulate more accurate judgments and establish detailed and rigorous guidelines to improve reliability with this method. Finally, it should be noted that absolute agreement is an unforgiving standard and the a priori threshold for “acceptable” reliability (Fleiss’ κ ≥ 0.80) might be unrealistic given the inherent complexity and limitations of the touch method. Future investigations might benefit from methodologies to resolve discrepancies between raters, such as a consultation from a third rater or consensus panel.
Conclusion
This study was unable to demonstrate adequate reliability between raters when judging the LAR with the touch method. Sufficient intra-rater reliability was achieved in some raters, suggesting that ratings may remain consistent within raters over a short period of time. Reliability reporting and specific definitions when performing and judging the LAR are encouraged to promote transparency, reproducibility, and sound methodological practices. Future studies examining reliability of the touch method of laryngeal sensation testing are required to better understand the method’s utility in clinical practice and research.
Footnotes
Authors’ Note
Portions of this manuscript were presented at the European Society for Swallowing Disorders in Vienna, Austria on September 19, 2019.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Ethical Approval
All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards. This study was deemed exempt by IRB as it did not constitute research with human subjects.
Informed Consent
The regulatory requirement for a consent form does not apply to exempt research.
