Diagnostic Accuracy of Tuning Fork Tests for Hearing Loss: A Systematic Review

Abstract

Objective

(1) To determine the diagnostic accuracy of tuning fork tests (TFTs; Weber and Rinne) for assessment of hearing loss as compared with standard audiometry. (2) To identify the audiometric threshold at which TFTs transition from normal to abnormal, thus indicating the presence of hearing loss.

Data Sources

PubMed, Ovid Medline, EMBASE, Web of Science, Cochrane, and Scopus and manual bibliographic searches.

Review Methods

A systematic review of studies reporting TFT accuracy was performed according to a standardized protocol. Two independent evaluators corroborated the extracted data and assessed risk of bias.

Results

Seventeen studies with 3158 participants, including adults and children, met inclusion criteria. The sensitivity and specificity of the Rinne test for detecting conductive hearing loss ranged from 43% to 91% and 50% to 100%, respectively, for a 256-Hz fork and from 16% to 87% and 55% to 100% for a 512-Hz fork. The audiometric thresholds at which tests transition from normal to abnormal ranged from 13 to 40 dB of conductive hearing loss for the Rinne test and from 2.5 to 4 dB of asymmetry for the Weber test. Significant heterogeneity in TFT methods and audiometric thresholds to define hearing loss precluded meta-analysis. There is high risk of bias in patient selection for a majority of the studies.

Conclusion

Variability exists in the reported test accuracy measurements of TFTs for clinical screening, surgical candidacy assessments, and estimation of hearing loss severity. Clinicians should remain mindful of these differences and optimize these techniques in specific clinical applications to improve TFT accuracy.

Keywords

tuning fork hearing test audiometry Weber Rinne

Since their introduction in the 1800s, Weber and Rinne tuning fork tests (TFTs) have become valued as fundamental diagnostic clinical tools for hearing loss assessment and established parts of medical education curricula.¹ Even with subsequent advances in audiometry, these simple, inexpensive, and easily incorporated TFTs serve several roles in current practice, including screening for hearing loss, confirmation of audiometry, estimation of hearing loss severity, and verification of surgical candidacy.^2,3 Otolaryngologists are often trained to use TFTs in decision making for patients with otosclerosis, reserving surgery for those whose abnormal Rinne results indicate a level of conductive loss and ossicular fixation sufficient to warrant intervention.

Given their persistent use, one might expect high diagnostic accuracy (and evidence) undergirding various TFT applications. However, the extant literature contains individual studies questioning if the Rinne and Weber TFTs have sufficient accuracy to identify conductive hearing loss (CHL) and unilateral sensorineural hearing loss (SNHL) or CHL, respectively.^4,5 Furthermore, there is disagreement about how large of an air-bone gap (ABG) is necessary for an abnormal Rinne result and whether this value and overall test accuracy differ by tuning fork frequency, material, test technique, or patient population.⁶ One prior systematic review that assessed the accuracy of bedside tests to screen elderly individuals for hearing loss concluded that TFTs lack sufficient reliability and accuracy for use as screening tests.⁷ The review was restricted to 5 studies that reported raw data, excluded pediatric participants, and did not assess the aforementioned questions pertinent to otolaryngologic practice. If TFTs are to remain valuable elements of a clinical examination, their accuracy, conditions for optimal performance, and limitations should be better understood.

This systematic review aimed to determine the diagnostic accuracy of TFTs for assessing hearing loss, as measured by gold standard pure tone audiometry, and to identify the audiometric threshold at which TFT results transition from normal to abnormal. This review also aimed to identify factors that influence TFT accuracy in the evaluation of different types of hearing loss.

Methods

Information Sources and Search Process

This review employed standard methodology for medical test reviews delineated by the US Agency for Healthcare Quality and Research.⁸ OVID Medline, EMBASE, PubMed, Web of Science, Scopus, and Cochrane Library were searched from their dates of inception through May 2016 for articles with title/abstract/keyword or mapping alias terms tuning fork or Weber or Rinne AND hearing test or audiogram or audiometry. Articles were limited to the English language and human subjects. Searches were repeated prior to publication to ensure that no additional papers were published since the initial search. Manual searches of bibliographies were performed and citations cross-checked. Title and abstract review was performed, followed by full-text review. Two authors (E.A.K. and B.L.) independently performed the article selection process with disagreements resolved by discussion with the third author.

Inclusion and Exclusion Criteria

Studies were included per the following criteria: (1) the population comprised adults and/or children with a TFT and standard pure tone audiometry completed for evaluation of hearing loss; (2) the index test was the Weber or Rinne TFT; (3) 256- and/or 512-Hz forks were used; (4) pure tone audiometry was the comparator test (reference standard); (5) at least 1 of the following outcome measures was reported—diagnostic accuracy (sensitivity, specificity) for detection of hearing loss (CHL, SNHL, and/or mixed) or the audiometric threshold at which the Rinne TFT changed from a normal to abnormal result (ie, transition point); and (6) outcome measures were reported individually for TFT types and fork frequency.

Studies were excluded if (1) only the TFT or audiogram was performed, not both; (2) the TFT was not used to assess hearing; or (3) the study did not include a measure of diagnostic test accuracy or the Rinne transition point.

Tuning Fork Tests

Weber and Rinne TFTs were defined as follows. The Weber TFT involves placement of the vibrating tuning fork on a midline osseous structure (eg, forehead, upper incisors). With normal hearing, the tuning fork is heard centrally.⁹ The sound should lateralize to the poorer-hearing ear in the presence of CHL and to the better-hearing ear in the presence of SNHL.^1,10 The Rinne TFT compares perceptions of air- and bone-conducted sounds with either the loudness comparison technique or the timed threshold technique. The loudness comparison technique compares loudness of the tone by air conduction to bone conduction.^9,11,12 In a normal (“positive”) test, the sound intensity is greater when the fork is lateral to the external auditory canal (EAC), as sound is transmitted by a functioning middle ear mechanism, whereas with ipsilateral CHL, sound conducted via the bone of the mastoid process is heard louder than air-conducted sound—an abnormal (“negative”) test. In the timed threshold technique, the tuning fork is first placed on the mastoid process until the sound is no longer heard and then placed lateral to the EAC. In a normal test, the sound is still heard when the vibrating tuning fork is lateral to the EAC after it is no longer heard on the mastoid.^10,12,13

Data Abstraction

Two authors (E.A.K. and B.L.) independently abstracted data onto standardized forms. The main outcomes of interest were TFT sensitivity and specificity for detecting hearing loss as defined by the reference standard audiogram and the threshold value (measured in decibels) at which the Rinne TFT transitions from a normal to an abnormal result. True and false positives and negatives were recorded and, if not available, back calculated if possible from sample sizes.

Quality and Applicability Assessments

The QUADAS-2 (Quality Assessment of Diagnostic Accuracy Studies–2) was used to assess study quality and applicability.¹⁴ There are 4 domains: patient selection, index test, reference standard, and flow and timing. All assess for risk of bias, and 3 consider applicability with scores of “low,” “high,” and “unclear.”

Data Synthesis

Forest plots of sensitivity and specificity were generated with RevMan 5.3.¹⁵ Variations in study characteristics and absence of raw data in some of the studies precluded meta-analyses. Therefore, qualitative analysis was performed.

Results

Study Search and Selection

The search strategy yielded 735 references ( Figure 1 ). After duplicates were removed, 336 articles remained. Eighteen articles met inclusion and exclusion criteria. Two articles reported the same data.^13,16 Seventeen articles were included in analysis.^{4-6,11-13,17-27} Cross-reference checking did not reveal additional articles.

Figure 1.

Flow diagram of study selection.

Study Characteristics

Study characteristics are displayed in Table 1 . There were 3158 total participants. The sample sizes ranged from 20 to 1000 (median, 88). Among studies providing demographic information, 6 included adults only, 4 pediatric patients only, and 2 both. Other demographics were not consistently reported. Studies were performed in 5 countries (Canada, United Kingdom, Finland, India, and United States). All were performed in otolaryngology outpatient clinics. Studies included participants with the following causes of hearing loss: otitis media with effusion, otosclerosis, ossicular discontinuity, tympanic membrane perforations, otologic symptoms, mastoid surgery, sudden SNHL, and other/unknown. Some studies included normal hearing individuals.

Table 1.

Study Characteristics.

																	Outcomes Reported
				Population			Hearing Loss Type		Tuning Fork Test Performed		Tuning Fork Frequency, Hz		CHL Definition: ABG, dB				ABG Transition Point
First Author	Year	PTS, n	Ears, n	Adults	Children	NR	CHL	SNHL	Rinne	Weber	256	512	≥10	≥15	≥20	NR	Rinne	Weber	SN	SP
Shuman⁵	2013	250	500	×				×		×		×				×			×
MacKechnie²⁵	2013	50	91	×			×		×			×				×	×
Behn¹¹	2007	58	88		×		×		×		×	×			×				×	×
Burkey¹⁷	1998	1000	2000	×			×		×			×	×						×	×
Haapaniemi²²	1996	687	1374		×		×	×	×	×	×			×			×	×	×	×
Miltenburg²⁶	1994	68	136	×	×		×	×	×	×	×					×			×	×
Jacob²³	1993	20	40	×			×		×		×	×				×	×
Johnston²⁴	1992	62	100	×			×		×			×	×				×		×	×
Browning¹³	1989	132	132			×	×		×		×	×			×				×	×
Chole¹²	1988	200	200			×	×		×		×	×	×						×	×
Capper¹⁸	1987	125	662		×		×		×	×		×				×	×	×	×	×
Doyle²⁰	1984	88	176			×	×	×	×	×	×	×				×	×		×
Stankiewicz⁴	1979	134	268			×	×	×	×	×	×	×				×			×	×
Golabek²¹	1979	34	68	×	×		×		×			×				×	×
Gelfand⁶	1977	123	200	×			×		×		×	×				×	×
Wilson²⁷	1975	50	100		×		×	×	×			×	×	×^a					×	×
Crowley¹⁹	1966	77	153^b			×	×		×		×	×				×	×		×	×
Total		3158	6288	8	6	5	16	6	16	6	10	15	4	2	2	10	9	2	13	11

Abbreviations: ABG, air-bone gap; CHL, conductive hearing loss; NR, not recorded; PTS, patients; SN, sensitivity; SNHL, sensorineural hearing loss; SP, specificity.

CHL defined as an ABG of 10 dB at 2 frequencies or 15 dB at 1 frequency.

Data from 153 ears were used for Rinne 256-Hz test and 154 ears for Rinne 512-Hz test.

TFT Characteristics

Sixteen studies reported Rinne TFT results with 256- and/or 512-Hz tuning forks ( Table 1 ). Five studies reported results of Weber and Rinne TFTs, and 1 reported Weber TFT results alone. TFT techniques and fork characteristics varied ( Table 2 ). Tuning forks were composed of aluminum, steel, or metal alloy, but 7 studies (41%) did not record the composition. Seven studies used masking (including narrow band noise, Barany box, and/or tragal rubbing) in the nontest ear during TFTs; 5 used masking with all TFTs; and 2 reported data with and without use of masking.

Table 2.

Tuning Fork Test Method and Characteristics.

		Type of Tuning Fork
First Author	Year	Aluminum	Steel	Alloy	Not Recorded	Masking Performed	Rinne Tuning Fork Testing Technique
Shuman⁵	2013				×	No	Not applicable (Weber only)
MacKechnie²⁵	2013	×	×			No	Loudness comparison
Behn¹¹	2007	×				No	Loudness comparison
Burkey¹⁷	1998				×	Yes^a, ^b	Loudness comparison
Haapaniemi²²	1996				×	No	Loudness comparison
Miltenburg²⁶	1994		×			Yes^a, ^c	Loudness comparison
Jacob²³	1993				×	No	Not recorded
Johnston²⁴	1992			×		Yes^c	Loudness comparison
Browning¹³	1989				×	Yes^c	Loudness comparison and timed threshold
Chole¹²	1988				×	No	Loudness comparison
Capper¹⁸	1987	×				No	Loudness comparison
Doyle²⁰	1984				×	Yes^c	Loudness comparison or timed threshold
Stankiewicz⁴	1979			×		No	Timed threshold
Golabek²¹	1979		×			Yes^b	Loudness comparison and timed threshold
Gelfand⁶	1977			×		Yes^b	Loudness comparison (if equivocal, then timed threshold)
Wilson²⁷	1975			×		No	Loudness comparison
Crowley¹⁹	1966			×		No	Loudness comparison

Both nonmasking and masking were performed.

Narrow band noise was used for masking.

Barany box or tragal rubbing was used for masking.

Rinne TFT technique differed among studies. Typically, included studies described placing the tuning fork 2 to 2.5 cm lateral to the EAC for air conduction testing. In 4 studies, the placement of the tuning fork relative to the EAC was not described.^5,6,18,23,26 Eight studies reported that tines were parallel to the axis of the EAC^{11-13,21,22,24,25,27}; 1 indicated that they were perpendicular¹⁷; and the remaining studies did not specify. Both Rinne TFT techniques were represented. Ten studies used only the loudness comparison technique; 1 used only the timed threshold technique; and 2 studies used both techniques on each patient. In the remaining 3 studies, both techniques were used intermittently, or the technique was not reported.

The tuning fork position for the Weber TFT varied among and within studies. Locations included vertex, forehead, nasal bridge, upper dentition, and midline of skull.

Quality and Applicability

Table 3 shows the quality and applicability assessments of the included studies based on the QUADAS-2. In the patient selection domain, 4 studies (24%) had a low risk of bias, as patients were enrolled randomly or consecutively. Eleven studies (65%) had a high risk of bias, as inclusion criteria were specified or they included only pediatric patients or only patients with CHL, SNHL, or abnormal Rinne TFTs. In the index test domain, 14 studies (82%) had a low risk of bias, as TFT examiners were blinded to examination findings and/or audiogram results. Two studies (12%) did not indicate if blinding occurred, and 1 study (6%) had no blinding. In the reference standard domain, 2 studies (12%) had an unclear risk of bias due to an unclear audiometric definition of CHL.

Table 3.

Risk of Bias and Applicability of Included Studies Based on the QUADAS-2 Tool.^a

	Risk of Bias				Applicability Concerns
First Author (Year)	Patient Selection	Index Test	Reference Standard	Flow and Timing	Patient Selection	Index Test	Reference Standard
Shuman⁵ (2013)	☹	☺	☺	☺	☺	☺	☺
MacKechnie²⁵ (2013)	☹	☺	☺	☺	☺	☺	☺
Behn¹¹ (2007)	☹	☺	☺	☺	☹	☺	☺
Burkey¹⁷ (1998)	☺	☺	☺	☺	☺	☺	☺
Haapaniemi²² (1996)	☹	☺	☺	?	☹	☺	☺
Miltenburg²⁶ (1994)	☺	☺	?	☺	☺	☺	☺
Jacob²³ (1993)	☹	☺	☺	?	☹	☺	☺
Johnston²⁴ (1992)	☹	?	?	?	☺	☺	☺
Browning¹³ (1989)	?	☺	☺	☺	☺	☺	☺
Chole¹² (1988)	?	☺	☺	?	☺	☺	☺
Capper¹⁸ (1987)	☹	☺	☺	☺	☹	☺	☺
Doyle²⁰ (1984)	☺	☺	☺	?	☺	☺	☺
Stankiewicz⁴ (1979)	☺	☺	?	?	☺	☺	☺
Golabek²¹ (1979)	☹	?	☺	?	☹	☺	☺
Gelfand⁶ (1977)	☹	☺	☺	?	☺	☺	☺
Wilson²⁷ (1975)	☹	☺	☺	?	☹	☺	☺
Crowley¹⁹ (1966)	☹	☹	☺	?	☺	☺	☺

Symbols as recommended by QUADAS-2 (Quality Assessment of Diagnostic Accuracy Studies–2)¹⁴:☺, low risk;☹, high risk; ?, unclear risk.

In the flow and timing domain, 10 studies (59%) had an unclear risk of bias, as we were unable to determine if TFTs and audiogram tests were performed on the same day to minimize changes in patient condition. Seven studies (41%) had a low risk of bias, as the TFT and audiogram were performed on the same day. Overall, 13 studies (76%) are at risk for bias due to high or unclear ratings in patient selection and/or index test domains, and 11 (65%) are at risk for bias in the reference standard and/or flow and timing domain.

Six studies (35%) had applicability concerns for some practices based on inclusion criteria, as they included only pediatric patients or both pediatric and adult patients. There were no applicability concerns regarding use of the index or reference standard tests.

Diagnostic Test Accuracy Outcomes

Diagnostic test performance of Rinne TFT

Test accuracy outcomes are presented in Table 4 . Among the 8 studies that investigated the accuracy of the 256-Hz Rinne TFT for detecting CHL, sensitivity ranged from 43% to 91% (median, 78.8%), and specificity ranged from 50% to 100% (median, 88%). Among the 10 studies that evaluated the accuracy of the 512-Hz Rinne TFT, sensitivity ranged from 16% to 87% (median, 54%), and specificity ranged from 55% to 100% (median, 98.5%). When those studies not at high risk for bias were assessed, sensitivity ranged from 16% to 91%, and specificity ranged from 71% to 100%.^{4,12,13,17,20}

Table 4.

Diagnostic Test Accuracy Outcomes of Rinne and Weber Tuning Fork Tests.

		Rinne, %				Weber, %
		256 Hz		512 Hz		256 Hz		512 Hz
First Author	CHL Definition, dB	SN	SP	SN	SP	SN	SP	SN	SP	Raw Data Reported
Chole¹²	≥10	78.8	71.4	44.8	100					×
Burkey¹⁷	≥10			73.1	96.6					×
Johnston²⁴	≥10			52.9	100
Wilson²⁷	≥15			44	98					×
Haapaniemi²²	≥15	76.7	98.9			18^a	97^a			×
Behn¹¹,^b	≥20	82 (L)	66 (L)	64 (L)	85 (L)
		80 (R)	50 (R)	54 (R)	83 (R)
Crowley¹⁹	≥20	73.5	100	76	100					×
Browning¹³	≥20	69	94	47	99
Doyle²⁰	NR	90.8		58.5						×
Miltenburg²⁶	NR	84.8	82			50^a	33.3^a			×
Stankiewicz⁴	NR	43	99	16	99	55^a	68^a	67^a	60^a	×
Capper¹⁸	NR			87	55			65^c	75^c
Shuman⁵	NA							78^d		×

Abbreviations: CHL, conductive hearing loss; L, left ear; NA, not applicable; NR, not recorded; R, right ear; SN, sensitivity; SP, specificity.

Sensitivity and specificity to detect either unilateral CHL or unilateral sensorineural hearing loss (combined population).

Data reported for each ear separately.

Sensitivity and specificity to detect unilateral CHL.

Sensitivity to detect unilateral sensorineural hearing loss.

Of 6 studies that compared tuning fork frequencies, 5 utilizing the Rinne test found that 256 Hz was more sensitive than 512 Hz for detecting CHL. In contrast, 256 Hz was less specific than (k = 3 studies) or of equal specificity to (k = 2) 512 Hz.

Only 8 of 17 studies provided raw data, permitting construction of forest plots of sensitivity, specificity, and the respective 95% CIs for the 256-Hz (k = 6) and 512-Hz (k = 6) Rinne tests for detecting CHL ( Figure 2 ). Given the disparate definitions of CHL used in the included studies, meta-analysis was not considered appropriate.

Figure 2.

Sensitivity and specificity of Rinne tuning fork test for detection of conductive hearing loss (CHL) according to air-bone gap. FN, false-negative count; FP, false-positive count; NR, not reported; TN, true-negative count; TP, true-positive count.

Effect of CHL definition and equivocal results on Rinne accuracy

Among studies evaluating Rinne TFTs, CHL was defined differently (ie, by differing magnitudes of ABGs; Table 4 ). The ranges of reported sensitivity and specificity overlap substantially among studies with different definitions. In 3 studies that determined sensitivity for differing ABG ranges ( Table 5 ), the Rinne TFT was more sensitive for greater degrees of CHL, with 1 exception: Crowley and Kaufman¹⁹ reported that sensitivity decreased by 9.1% when the ABG was >40 dB; notably, the sample size of the patients with an ABG >40 dB was only 40% of the sample size with an ABG >30 dB.

Table 5.

Within-Study Comparisons of Accuracy of Rinne Tuning Fork Test at Different Air-Bone Gap Ranges, with and without Masking, and Inclusion of Equivocal Results.

Covariate
First Author: Frequency of Tuning Fork, Hz	CHL Definition, dB	Sensitivity, %	Specificity, %	Change in Sensitivity^a	Change in Specificity^a
Air-bone gap range, dB
Crowley¹⁹
256	20-29	34	NR	—	—
	≥30	100	NR	—	—
512	20-29	45	NR	—	—
	30-39	98	NR	—	—
	≥40	88.9	NR	—	—
Burkey¹⁷: 512	10-19	49.4	NR	—	—
	20-29	87.9	NR	—	—
	≥30	92.3	NR	—	—

Wilson²⁷: 512	10-19	26	NR	—	—
	20-29	46.1	NR	—	—
	30-39	45.4	NR	—	—
	≥40	100	NR	—	—
Masking
Burkey¹⁷: 512	≥10	97.3	NR^b	7.7	—
Miltenburg²⁶: 256	NR	84.2	82	2.2	0
Equivocal results
Crowley¹⁹
256	NR	79.3	100	3.3	0
512	NR	74.5	100	1	0
Burkey¹⁷: 512	≥10	80.1	NR	7	—

Abbreviations: CHL, conductive hearing loss; NR, not recorded.

Calculated as sensitivity (or specificity) with masking in nontest ear or including equivocal data – sensitivity (or specificity) without masking in nontest ear or excluding equivocal data. Dashes (—) indicate not applicable.

Specificity values not reported, but study indicated that test specificity was not effected by masking.

Studies varied in classifying and analyzing equivocal Rinne TFT results (ie, when bone conduction was equal to air conduction). Equivocal results were considered abnormal^12,13,25 or excluded,^{6,11,17,19,22} or their classification was not specified.^{4,18,20,21,23,24,26,27} In 2 intrastudy comparisons, when equivocal data were included and considered to indicate an abnormal Rinne test result, within-study sensitivity for CHL increased by 1% to 7%, and specificity did not change ( Table 5 ).^17,19

Diagnostic test accuracy of Weber TFT

Three studies reported accuracy measures for the 256-Hz Weber TFT and 3 studies for the 512-Hz Weber TFT ( Table 4 ). To correctly lateralize to either the ipsilateral ear (in CHL) or the contralateral ear (in SNHL), sensitivity for the 256-Hz Weber test ranged from 18% to 55% (median, 50%), and specificity ranged from 33% to 97% (median, 68%). For the 512-Hz Weber test, sensitivity ranged from 65% to 78% (median, 67%), and specificity ranged from 60% to 75% (median, 67.5%). Weber accuracy was also reported for subgroups of patients with only unilateral SNHL versus only unilateral CHL, given that a clinical history would elicit whether sound localized to a patient’s “good” or “bad” ear. For the 256-Hz Weber test, sensitivity to detect unilateral SNHL (ie, sound lateralizes to the contralateral “good” ear) ranged from 3.8% to 50%, and specificity was 98%.^4,22,26 For the 512-Hz Weber test, sensitivity to detect unilateral SNHL was reported as 64% and specificity as 99% in 1 study.⁴ For the 256-Hz Weber test, sensitivity to detect unilateral CHL ranged from 53% to 62.5%, and specificity was 93%.^4,22,26 For the 512-Hz Weber test, sensitivity and specificity to detect unilateral CHL were 76% and 94% in 1 study, respectively.⁴

Results of Transition Point

Ten studies investigated the transition point: the specific ABG present when the Rinne TFT transitions from a normal to an abnormal result in most patients ( Table 6 ). For the 256-Hz Rinne test, the transition points ranged from 13 to 40 dB (mean, 20.5); for 512 Hz, points ranged from 17.5 to 40 dB (mean, 26).

Table 6.

Rinne Transition Points: ABG at Which Rinne TFT Transitions from Normal to Abnormal Result.

		Hearing Loss within Study Population, %			Transition Point by Tuning Fork, dB
First Author	Statistical Method	Normal	CHL	SNHL	256 Hz	512 Hz
Haapaniemi²²	ABG at intersection of cumulative distribution curves^a	88	3	9	13	—
Johnston²⁴	ABG at intersection of cumulative distribution curves^a	0	100	0	—	32
Capper¹⁸	ABG at intersection of cumulative distribution curves^a	NR	NR	0	—	19
Golabek²¹, ^b	ABG at intersection of cumulative distribution curves^a	NR	NR	0	—	19
Doyle²⁰	Linear regression: ABG correlated with number of abnormal Rinne tests^c	34	45	11	16.3	—
MacKechnie²⁵, ^d
Steel	Logistic regression: ABG with 50% probability of abnormal Rinne	NR	NR	0	—	19
Aluminum	Logistic regression: ABG with 50% probability of abnormal Rinne	NR	NR	0	—	27
Chole¹²	Mean ABG of those with equivocal responses	13.5	86.5	0	15.6	34.5
Jacob²³	Mean ABG at which >50% have abnormal Rinne	30	70	0	17.5	17.5
Crowley¹⁹	Mean ABG at which >50% have abnormal Rinne	20	80	0	—	25
Gelfand⁶	Median ABG among those with abnormal Rinne	0	100	0	40	40

Abbreviations: —, value not calculated in the study; ABG, air-bone gap; CHL, conductive hearing loss; NR, not recorded; SNHL, sensorineural hearing loss; TFT, tuning fork test.

Transition point considered to be the ABG at the intersection of the cumulative distribution curves for normal and abnormal Rinne results.

Values in table are for loudness comparison; study also reported transition point for timed threshold technique with a cumulative distribution curve (2 dB) and linear regression of mean duration of abnormal Rinne (27.8 dB).

Linear regression: ABG = 4.5 + 11.8 (no. of abnormal Rinne tests). 90% CI: 14 to 19 dB.

This study reported results for different tuning fork materials separately.

Studies considering the transition point varied in the distribution of hearing loss types among participants and in statistical methods ( Table 6 ). Those studies yielding lower transition points had a greater proportion of individuals with normal hearing. Furthermore, using cumulative distribution curves or linear regression models resulted in lower estimates. The cumulative distribution curve method involves plotting the cumulative distribution curves of normal and abnormal Rinne responses against the ABG in decibels. The point of intersection is selected as the transition point. Higher transition points were reported by studies that had more participants with CHL and/or selected the ABG that yielded an abnormal or equivocal Rinne result in 50% of the participants. The study reporting a markedly higher transition point (40 dB) than the others included only patients with CHL and selected the median ABG among participants with an abnormal Rinne.⁶

Two studies, both restricted to pediatric populations, determined the transition point for the Weber TFT, defined as the difference in hearing between the ears (in decibels) at which the majority of patients changed from central to lateralized hearing.^18,22 The intersection point of the cumulative distribution curves for the central and lateralizing Weber test results approximated a 2.5-dB²² and 4-dB¹⁸ difference between ears.

TFT Technique Effects: Rinne Method, Masking, Materials, and Force

The Rinne loudness comparison technique was superior to the timed threshold technique for detecting CHL in the 2 studies that performed within-study comparisons. Using a 256-Hz fork, Browning et al found that loudness comparison was 24% more sensitive than timed threshold for detecting ABGs ≥20 dB.¹³ Golabek and Stephens found that the transition point for the 512-Hz Rinne test was 5 dB higher when using the timed threshold technique versus the loudness comparison technique.²¹ The notable outlier among 512-Hz Rinne test accuracy estimates—Stankiewicz and Mowry,⁴ reporting 16% sensitivity—used the timed threshold method only, while the loudness comparison method was used in at least some participants in all other studies reporting the method.

Two studies reported TFT results for the same patients with and without masking, allowing for within-study comparison ( Table 5 ). With masking, within-study sensitivity increased by 2% to 8%, and specificity did not change.

The effect of tuning fork material on Rinne testing could be assessed from 1 intrastudy comparison. MacKechnie et al²⁵ compared the ABG necessary to detect CHL with steel versus aluminum forks and found that steel forks detected an 8-dB smaller ABG. The effects of fork material were not able to be compared among the included studies, as an insufficient number of studies recorded and performed similar TFTs with different materials for meaningful assessment ( Table 2 ).

The amount of force applied during TFTs was not standardized among included studies. One study (Johnston²⁴) evaluated the effects of levels of force. A specific measure of force (2400 g) applied during the 512-Hz Rinne test resulted in higher sensitivity (+20%) but lower specificity (–6.7%) as compared with a normal (unknown) force.

Patient and Practitioner Effects

There were variations in the testing performed on pediatric and adult patients; therefore, the effect of age as a covariate cannot be rigorously assessed. Among the 4 studies that enrolled pediatric patients only, Rinne specificity was at the lower end of the range reported for the 512-Hz (55%)¹⁸ and 256-Hz (50%)¹¹ forks, suggesting higher false-positive rates, but specificity was 98% in the 2 remaining pediatric studies.^22,27

One included study assessed the effect of practitioner experience on TFT accuracy. Burkey et al compared the 512-Hz Rinne test performed by an otology fellow with that of an experienced otologist. When performed by the less experienced individual, the test was less sensitive by 15.8% if equivocal results were included and by 29.7% if equivocal results were not included.¹⁷

Combined TFT Effect

Five studies assessed the use of TFT combinations to improve accuracy.^{13,18,20,23,26} The combined results of Weber and Rinne TFTs performed in 2 studies were inconsistent. Using 512-Hz TFTs, Capper et al observed that for a child with a normal Rinne result on one side and an abnormal on the other, the Weber lateralized to the abnormal Rinne side one-third of the time.¹⁸ Similarly, Miltenburg observed that the 256-Hz Weber and Rinne tests suggested the same type and side of hearing loss <50% of the time; however, agreement between the tests yielded a 95% sensitivity in detecting CHL.²⁶

Four studies examined the benefit of using a combination of tuning fork frequencies for Rinne testing and reached differing conclusions. Two studies observed significant correlations between the number of tuning fork frequencies with negative tests and the average ABG across audiometric speech frequencies. Doyle et al reported that the mean ABG values among ears with a negative Rinne result were as follows: at 256 Hz only, 21.5 dB; at 256 and 512 Hz, 28.4; and at 256, 512, and 1024 Hz, 43.2 dB (P < .01, Spearman rho not reported). A linear regression model for ABG prediction was created: ABG = 4.5 + 11.8 (number of negative forks).²⁰ Gelfand also reported that the number of frequencies with negative tests was proportional to the median ABG across 250 to 4000 Hz (rho = 0.943, P < .01), but they did not consider the number of frequencies with negative results to be predictive of CHL magnitude, because they observed numerous exceptions—particularly falsely normal results when a large ABG was present.⁶ Jacob et al found that a combination of frequencies did not reliably predict the severity of CHL: among those with negative tests at 256, 512, and 1024 Hz, the average ABG values ranged from 10 to 60 dB (mean, 40 dB).²³ Similarly, Browning et al observed that the sensitivity and specificity for detecting CHL ≥20 dB did not improve with the 256- and 512-Hz Rinne tests as compared with 256-Hz Rinne test alone.¹³

Discussion

This systematic review aimed to evaluate the body of evidence for clinical TFTs by determining the accuracy of Weber and Rinne tests in detecting hearing loss as compared with the clinical standard audiogram. In summary, among studies without a high risk of bias, the sensitivity of the Rinne TFT to detect CHL ranged from 16% to 91%, and specificity ranged from 71% to 100%. While heterogeneity in study design precluded meta-analysis, Rinne test sensitivity was greater, and specificity lower, for the following conditions: the 256-Hz fork (vs 512 Hz); the loudness comparison technique (vs timed threshold); masking; for detecting larger CHLs; when equivocal results are considered abnormal; and, based on individual studies, when practitioners have more experience, apply more pressure, and use steel (vs aluminum) forks. The transition point at which the Rinne TFT changes from normal to abnormal in most individuals ranged from 13 to 40 dB of CHL. The Weber test has poor sensitivity for identifying unilateral CHL or SNHL by correct lateralization (18%-67%), and specificity ranged from 33% to 97%. The Weber test was more sensitive with 512-Hz forks (vs 256 Hz) and for detecting CHL versus SNHL. The Weber can lateralize in the presence of modest hearing asymmetries, ranging from 2.5 to 4 dB. These findings may inform the performance and interpretation of TFTs for various clinical scenarios.

Otolaryngologists commonly employ the Rinne TFT to confirm surgical candidacy for CHL. Among adult patients (who are most likely to have otosclerosis), the 512-Hz Rinne test had high specificity (97%-100%) and, thus, low false-positive rates for detecting at least 10 to 20 dB of CHL. As such, surgeons who reserve surgery for patients with abnormal (negative) Rinne results are highly likely to select patients with CHL. However, a negative Rinne test result does not provide assurance that patients have specific levels of CHL (eg, ≥15 dB). While transition points for the Rinne 512-Hz TFT were consistently above this level (17-40 dB), the divergent methods utilized to calculate the transition point each produced an ABG value beyond which patients become more likely than not to have a negative Rinne test result, rather than a value at which a negative Rinne result is assured. Furthermore, while more tuning fork frequencies show negative Rinne test results as the ABG widens, ABGs among patients with normal and abnormal Rinne results overlapped considerably in the included studies, precluding firm predictions. Also, clinicians should remain mindful that the 512-Hz Rinne test’s sensitivity (<75% in 8 of 10 studies and <90% in all studies) suggests that a normal (positive) test does not reliably rule out CHL. Thus, this approach may lead some appropriate surgical candidates to be offered only medical management.

Variations in TFT technique likely contributed to the wide range of accuracy values reported in these studies. Authors and clinicians reported the performance of what they considered to be “normal” Rinne or Weber TFTs. In actuality, test techniques in comparison studies often differed in ways that would be expected to influence accuracy. In intrastudy comparisons, the loudness comparison technique yielded performance superior to the timed threshold; however, some studies, like clinicians, used the techniques interchangeably or did not report the technique used. Almost half the studies did not report the orientation of the tuning fork tines relative to the EAC during Rinne testing; the majority of those reporting described a parallel orientation. Surveyed otolaryngologists differ in placement of tuning forks relative to the EAC (47% place the axis between the tines parallel to, and 45% perpendicular to, the EAC), but orienting 512-Hz fork tines parallel to the EAC results in an increased sound intensity of 2.5 dB.²⁸ Analogously, Weber test sensitivity is greatest with the incisors, but the forehead is more often used.²¹ Tuning fork activation methods, a factor poorly described in the included studies, may also affect test accuracy. Nonfundamental sound frequencies are produced when tuning forks are struck on metal or wood surfaces rather than the head, pisiform, or other soft surfaces.^29,30 While the principal tone is the same, overtones generated by harder surfaces may confound patient interpretation. Furthermore, while increasing tuning fork force on the mastoid increases Rinne sensitivity to detect CHL (potentially at the expense of specificity), this force would be difficult to standardize in studies and clinical practice. Consistency in test performance may partly explain the observation that TFT sensitivity improves with experience. As Sheehy wrote, “it is important for the individual otologist to become familiar with the normal and abnormal responses with forks the way [he or she] uses them.”¹⁰

Tuning fork selection also influences test accuracy and should be optimized for different clinical applications. Due to the greater sensitivity of the 256-Hz Rinne TFT (and lower false-negative rates), it may be more appropriate if one chooses to screen a clinical population for CHL, with follow-up audiometry when abnormal. In contrast, the 512-Hz fork would be more appropriate to supplement the evaluation of candidates for stapes procedures due to its higher specificity (and lower false-positive rates). Use of nonaluminum tuning forks may also improve performance. One intrastudy comparison revealed that the Rinne transition point was 8 dB lower by using steel versus aluminum forks.²⁵ A study that did not meet inclusion criteria (results of all fork frequencies were combined) reported greater accuracy with magnesium alloy forks over steel forks.³¹ Heavier materials have more sustained sound levels, and aluminum forks undergo metal fatigue, producing shorter decay times and performance deterioration over time.³²

Clinicians should also be aware that TFT accuracy differed according to the clinical population under consideration. Included studies varied in case mix, hearing loss prevalence, and severity, which would result in spectrum effects—meaning differences in accuracy when tests are applied to distinct subgroups.^33,34 As expected, significant threshold effects were observed, with higher Rinne sensitivity for detecting greater degrees of CHL. Thus, one would expect sensitivity to be greater when a clinician applies TFTs to a population already suspected to have CHL (eg, examination and audiogram findings consistent with otosclerosis) rather than to all patients. This review also highlights the potential impact of patient age. Pediatric patients may not fully understand TFT instructions, leading to unreliable responses.³⁵ While many studies in the review did not specify patient ages, Behn et al¹¹ demonstrated that TFT accuracy for screening purposes among children 2 to 11 years old is poor and does not improve for older children.^11,27 However, Capper et al showed increased TFT reliability among older children for CHL related to “glue ear,”¹⁸ and Yung and Morris achieved good Rinne sensitivity among children with middle ear effusion (94% for bilateral effusions, 79% for unilateral),³⁶ suggesting that test performance is dependent on the target population.

This study has several limitations. By restriction to articles in English, relevant studies may have been excluded. Heterogeneity among study populations, designs, and outcomes precluded meta-analysis. Many available studies were at risk of bias. Furthermore, only 3 studies were performed in the last decade. Audiometric techniques, including masking strategies and use of circumaural headphones, have differed over time, possibly affecting “gold standard” accuracy. Additionally, as the audiogram was the standard for TFT accuracy, this review cannot address whether tuning forks effectively identify audiometric errors. Similarly, these findings inform but do not negate the value of TFTs in specific circumstances—for example, when audiometry is not available or when masking dilemmas are present.

Conclusion

TFTs are integral components of the otologic examination for many practitioners. However, there is substantial variation in reported accuracy measurements of TFTs for clinical screening, surgical candidacy assessments, and estimation of hearing loss severity. TFT accuracy may be improved and/or stabilized by identifying and consistently using optimal TFT techniques and materials and by considering specific patient populations who may most benefit from testing.

Author Contributions

Elizabeth A. Kelly, designed study, collected data, analyzed data, wrote article; Bin Li, designed study, collected data, revised article; Meredith E. Adams, designed study, analyzed data, revised article.

Disclosures

Competing interests: None.

Sponsorships: None.

Funding source: None.

Footnotes

No sponsorships or competing interests have been disclosed for this article.

This article was presented at the 2017 AAO-HNSF Annual Meeting & OTO Experience; September 10-13, 2017; Chicago, Illinois.

References

Albers

GD.

Tuning or pitch forks. J Mich State Med Soc. 1961;60:1152-1155.

Jackler

RK.

Early history of tuning-fork tests. Am J Otol. 1993;14:100-105.

Hildyard

Stool

Valentine

. Tuning fork tests as aid to screening audiometry: report on a preliminary field study. Arch Otolaryngol Head Neck Surg. 1963;78:151-154.

Stankiewicz

Mowry

. Clinical accuracy of tuning fork tests. Laryngoscope. 1979;89:1956-1963.

Shuman

Halpin

Rauch

Telian

. Tuning fork testing in sudden sensorineural hearing loss. JAMA Internal Medicine. 2013;173:706-707.

Gelfand

. Clinical precision of the Rinne test. Acta Otolaryngol. 1977;83:480-487.

Bagai

Thavendiranathan

Detsky

. Does this patient have hearing impairment? JAMA. 2006;295:416-428.

Agency for Healthcare Research and Quality. Methods Guide for Medical Test Reviews. Rockville, MD: Agency for Healthcare Research and Quality; 2012. AHRQ publication 12-EC017.

British Society of Audiology. Recommended procedure for Rinne and Weber tuning-fork tests. Br J Audiol. 1987;21:229-230.

10.

Sheehy

Gardner

Jr Hambley

. Tuning fork tests in modern otology. Arch Otolaryngol Head Neck Surg. 1971;94:132-138.

11.

Behn

Westerberg

Zhang

Riding

Ludemann

Kozak

. Accuracy of the Weber and Rinne tuning fork tests in evaluation of children with otitis media with effusion. J Otolaryngol. 2007;36:197-202.

12.

Chole

Cook

. The Rinne test for conductive deafness: a critical reappraisal. Arch Otolaryngol Head Neck Surg. 1988;114:399-403.

13.

Browning

Swan

IRC

Chew

. Clinical role of informal tests of hearing. J Laryngol Otol. 1989;103:7-11.

14.

Whiting

Rutjes

Westwood

et al . QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies. Ann Intern Med. 2011;155:529-536.

15.

Review Manager (RevMan) [computer program]. Version 5.3. Copenhagen, Denmark: Nordic Cochrane Centre, Cochrane Collection; 2014.

16.

Browning

Swan

IRC

. Sensitivity and specificity of Rinne tuning fork test. BMJ. 1988;297:1381-1382.

17.

Burkey

Lippy

Schuring

Rizer

. Clinical utility of the 512-Hz Rinne tuning fork test. Am J Otol. 1998;19:59-62.

18.

Capper

Slack

Maw

. Tuning fork tests in children (an evaluation of their usefulness). J Laryngol Otol. 1987;101:780-783.

19.

Crowley

Kaufman

. The Rinne tuning fork test. Arch Otolaryngol Head Neck Surg. 1966;84:406-408.

20.

Doyle

Anderson

Pijl

. The tuning fork—an essential instrument in otologic practice. J Otolaryngol. 1984;13:83-86.

21.

Golabek

Stephens

. Some tuning fork tests revisited. Clin Otolaryngol. 1979;4:421-430.

22.

Haapaniemi

Suonpaa

Salmivalli

Virolainen

. C1-tuning fork tests in school-aged children. Auris Nasus Larynx. 1996;23:26-32.

23.

Jacob

Alexander

Nalinesha

Nayar

. Can Rinne’s test quantify hearing loss? Ear Nose Throat J. 1993;72:152-153.

24.

Johnston

. A new modification of the Rinne test. Clin Otolaryngol. 1992;17:322-326.

25.

MacKechnie

Greenberg

Gerkin

et al . Rinne revisited: steel versus aluminum tuning forks. Otolaryngol Head Neck Surg. 2013;149:907-913.

26.

Miltenburg

. The validity of tuning fork tests in diagnosing hearing loss. J Otolaryngol. 1994;23:254-259.

27.

Wilson

Woods

. Accuracy of the Bing and Rinne tuning fork tests. Arch Otolaryngol Head Neck Surg. 1975;101:81-85.

28.

Butskiy

Hodgson

Nunez

. Rinne test: does the tuning fork position affect the sound amplitude at the ear? J Otolaryngol Head Neck Surg. 2016;45:21.

29.

Stevens

Pfannenstiel

. The otologist’s tuning fork examination—are you striking it correctly? Otolaryngol Head Neck Surg. 2015;152:477-479.

30.

Samuel

Eitelberg

. Tuning forks: the problem of striking. J Laryngol Otol. 1989;103:1-6.

31.

White

. The Rinne test: its use in predicting magnitude of conductive hearing loss. Laryngoscope. 1974;84:459-467.

32.

Yuksel

Kemaloglu

. Acoustic analysis of used tuning forks. J Int Adv Otol. 2017;13:239-242.

33.

Willis

. Spectrum bias—why clinicians need to be cautious when applying diagnostic test studies. Fam Pract. 2008;25:390-396.

34.

Leeflang

Bossuyt

Irwig

. Diagnostic test accuracy may vary with prevalence: implications for evidence-based diagnosis. J Clin Epidemiol. 2009;62:5-12.

35.

Kavitha

Jose

Anurudhan

Baby

. Hearing assessment of kindergarten children in Mangalore. J Clin Diagn Res. 2009;3:1261-1265.

36.

Yung

Morris

. Tuning-fork tests in diagnosis of serous otitis media. BMJ Clinical Research Ed. 1981;283:1576.