Abstract
Background
Ultrasonography of the testis is a well-established diagnostic tool in detection of testicular microlithiasis (TML). Operator-dependent diagnostic variation related to skill, knowledge, and operator consistency are factors that influence detection of TML.
Purpose
To determine inter- and intraobserver agreement for detection of TML using ultrasonography for a group of physicians with no or limited experience compared to a group of experience senior radiologists.
Material and Methods
Between May and September 2014 a total of six observers evaluated 34 patients scrotal ultrasonography recorded from September to December 2013. The observers were blinded to patient history and previous ultrasonography. Three of the observers had no or limited experience with detection of TML, and three of the observers had more than 15 years of experience. Each observer reviewed all the scrotal ultrasonography recordings twice with a time interval of 3 months.
Results
The inter-observer agreement showed substantial agreement and up to almost perfect agreement (κ = 0.86). Both the experienced and less experienced observers had a higher agreement in detecting and grading TML in their second reading.
Conclusion
The ultrasonography grading system of TML in this study showed to be reproducible, with an inter- and intraobserver agreement ranging between substantial agreement and up to almost perfect agreement with many years of experience not necessarily being essential.
Keywords
Introduction
Ultrasonography (US) of the testis is a well-established diagnostic tool in detecting testicular microlithiasis (TML); however, the clinically significance of TML is still debated because TML has been associated with testicular malignancy by some authors while not by others (1–3). Several factors influence detection of TML, such as operator dependent diagnostic variation related to skill, knowledge, and operator consistency. TML was first described by Doherty et al. (4) as innumerable tiny bright echoes diffusely and uniformly scattered throughout their substance. To our knowledge there are no studies investigating either the inter- or the intraobserver agreement of TML detection and grading of lesions.
The aim of this study was to determine the inter- and intraobserver agreement in TML detection and to investigate differences in TML detection between a group of physicians with no or limited experience and a group of experienced senior radiologists.
Material and Methods
The study was approved by the Regional Scientific Ethic Committee of Southern Denmark (ID: S-20120144) and the Danish Data Protection Agency.
There are different approaches to determine the amount of TML in a testicle. A simple grading system description used by Backus et al. (5) and Aizenstein et al. (6) termed more than 5 microliths per transducer field as multiple TML. A more accepted classification system is Classic TML (with 5 or more microliths per testis), and Limited TML (with fewer than 5 microliths per testis) (Fig. 1a) and this term is used by several authors (7–16). The model has progressed: grade 1 (5–10 microliths per testis) (Fig. 1b), grade 2 (10–20 microliths per testis) (Fig. 1c), and grade 3 (more than 20 microliths per testis) (Fig. 1d) (12,17,18). This model has been further developed to a more logic model with TML per testis (19). The grading in this study follows the latter definition.
(a) A testicle with limited TML; (b) a testicle with 5–10 TML; (c) a testicle with 10–20 TML; (d) a testicle with more than 20 TML.
Recruitment of patients
All patients gave a written informed consent before entering the study. In total 112 patients were referred to scrotal ultrasound examination at the radiology department from September to December 2013. A total of 24 patients with TML were offered enrolment in the study. Further 10 patients without TML were included, scheduled to an US due to different clinical indications (e.g. pain, swelling, and discomfort) and all were included after their primary US examination. These 10 patients without TML were included mainly to make the setting more realistic and increase the attention for the observers. No patients refused participation.
Included in the study were patients referred to an US examination of the scrotum including a feasible ultrasonography cine-loop lasting at least 20 s. Patients with macro-calcification or tumors were not included.
Ultrasonography procedures
The US was performed with a Siemens S3000 ultrasound machine (Acuson Corporation, Siemens, Mountain View, CA, USA), and the examinations were performed using a linear-array 9L4 frequency transducer. No patient preparation was necessary.
The following data were recorded: length, height, and width of testicles, patient age, and an ultrasonography cine-loop from each testicle were recorded in a steady pace lasting between 20 and 30 s. Each observer viewed a total of 67 ultrasound recordings from 34 patients. Recordings were obtained from both testicles, except in one patient who previously had a left-sided orchiectomy performed due to testicular cancer. The authors (MP and SR) performed all the US recordings of the testicles. Examinations were stored in a Dicom format allowing observers the possibility of reviewing the recordings more than once.
Observers
Two observers (observers 2 and 3) had limited experience and one (observer 1) had no experience with US, three (observers 4, 5, and 6) were trained specialist radiologists with more than 15 years of experience in US of the scrotum. The six observers analyzed all the 67 US recordings individually and again after 3 months (May and August 2014). The observers were blinded to patient data, patient history, and results of previous examinations. The observers evaluated the degree of TML in each cine-loop recoding.
All observers used the same screen to view the 67 recordings. Between the two readings observers received two blinded US recordings of testicles with TML, from patients who were not included in the study. The main purpose of this was to eradicate memories and avoid recall bias.
In the second reading the US became mixed in a random order to further blur the memory from the first reading. All observers had, during 1 day, unlimited period of time to review the 67 recordings. The US review sessions were carried out in an undisturbed room and the observer could not discuss the cases with a colleague.
Statistical analysis
For proper assessment of the variation between and within observers, it was essential to identify all sources of random variation in the data. Both observers and the sets of testicles used, for the study, were considered a random sample from larger populations. Data were sampled in a crossed design, where multiple observers all evaluated all sets of testicles twice. Apart from the measurement error on individual observations, there was consequently random variation: (i) between pairs of testicles observed within observers at each reading; (ii) between readings of the same pairs within observers; and (iii) between observers. These quantities were estimated by use of a logistic random effects model, where the outcome variable is a dichotomisation of the TML readings; these were adjusted for the age of the patients and the volumes of the testicles.
The analyses were performed with STATA statistical software (version 13.1, STATA Corporation, College Station, TX, USA).
Results
The six observers interpreted 67 testicles (34 right and 33 left testicles) from 34 male patients with a median age of 47 years (age range, 20–77 years). The mean reading time of the recordings of testicles was in the range of 2.4–3.6 min. The volume of the testicle was measured and the median was 9.06 mL (mean volume, 10. 7 mL) and in the range of 1.19–49.66 mL (SD ± 6.89).
Inter-observer agreement
Results of the six observers’ two readings.
Inter-observer agreements by the six observers in readings 1 and 2.
Inter-agreement between all the observers in readings 1 and 2.
Intraobserver agreement
Intraobserver agreement of the two readings.
Random effect logistic regression
A mixed-effects logistic regression analysis was used to model the variation of TML category. The preliminary model used patient age, testicular volume, and testicular volume squared as fixed effects. Testicular volume was based on length, height, and width measured with US. Observers, patients, and double readings entered the model as random effects, with observers and patients being crossed. Testicular volume and testicular volume squared were both statistically significant with odds ratios of 5.88 (95% CI, 1.86–18.51, P = 0.002) and 0.93 (95% CI, 0.87−0.98, P = 0.008), respectively. Age was not tested significant and was consequently excluded from the model.
Random variations between observers, patients, and the two readings.
Note that the small variation (96%) is due to homogeneity between patients, and the larger variation (1.7%) is due to heterogeneity by observers.
Discussion
This study focuses on the inter- and intraobserver agreement in TML detection in the testis. The result of this study shows the most commonly finding was no TML and limited TML. The ultrasonography grading system of TML in this study showed to be reproducible, with an inter- and intraobserver agreement ranging between substantial agreement and up to almost perfect agreement with many years of experience not necessarily being essential.
Microlithiasis in the testicles is detected only by US modality. Microlithiasis can be found in other organs, such as the breast, but here the primary image tool is mammography. Microcalcifications in the breast have a more crucial role in cancer detection than in the testicles, where it is suggested that microcalcifications have an association with cancer progression (20,21).
Inter- and intraobserver agreements are frequently assessed by Kappa statistics. However, this statistic fails to reflect the more complex design of crossed random effects statistics, also used in this study. Furthermore, the Kappa statistic is unable to account for the random variation caused by patient age, the volume of the testicles, or other covariates. The 67 recordings in the present study population resemble patients in daily clinical practice, who range from no TML, over testicles with few or more with TML, to testicles with more than 20 microliths.
Interestingly, the mixed-effects logistic regression model showed that the degree of TML seems to be associated with the testicular volume. The mixed-effects logistic regression models intercepts were used to determine the association between the binary outcome for TML category and the variables; testicle volume and patients’ age, while accounting for random variation between assessments, patients, and observers. The analysis showed no significant effect of age. It is therefore reasonable to assume that the degree of TML is constant across age. Only the volume of the testicle seems to affect the degree of TML.
Table 3 presents inter-agreement between observers. The overall agreement between observers was from moderate agreement (κ = 0.46) up to almost perfect agreement (κ = 0.86). The almost perfect agreement (κ = 0.86) was detected between an observer with no experience (observer 1) vs. a specialist radiologist with years of experience (observer 4), suggesting experience may not be paramount when detecting TML. To support this finding we found a high kappa value between observer 2 versus observer 4 (0.70 and 0.77), and observer 3 (0.61 and 0.81) versus observer 4. In general, the second reading had a higher kappa value than in the first reading. This could be due to more focus on TML between the two readings. The diagnostic accuracy is overall increased in the second reading, especially in the category with no TML (κ = 0.49–0.74) and in the category grade 1 (1–4 microliths), also most likely due to higher awareness of TML during the 3 months between the readings. However, the diagnostic accuracy is almost unchanged in category grade 3 (11–20 microliths) and in more than 20 microliths (grade 4). This suggests that awareness of TML affects the diagnostic accuracy among observers.
Table 1 shows the variations between the observers. Especially in the category no TML, the variation was highly visible ranging from 8 (observer 3) as the lowest and up to 35 (observer 5) as the highest detection rate. However, the detection rate in category no TML changes in reading 2 where the observer detection rate is more stable in the range of 20–28.
Experience and education is central in all medical imaging modalities. Observers 4, 5, and 6 had more than 15 years of experience with US. Evidently, observer 1, with no experience with US, achieved the same results as one of the trained radiologist (observer 4) in the second reading.
The majority of the cine-loops were interpreted with heterogeneous results regarding the detection and grading of TML. The grading between the observers was frequently unequivocal, and the observers disagreed in 54 out of 67 recordings in reading 1, and the observers disagreed in 46 out of 67 recordings in the second reading. The diagnosis of TML was determined with complete agreement by the six observers in 13 testicles (19.4%) in reading 1. This increased with 21 complete agreements in the second reading.
To the best of our knowledge there are no other studies investigating the inter- and intraobserver agreement detection of TML in testicles with US. In general, studies dealing with inter- and intraobserver agreement of scrotal US are non-existing.
A limitation of this study was the absence of a true gold standard, revealing the true presence of TML for our cases. This would have enabled estimation of sensitivity and specificity for the visual screening method. Further studies could be helpful to evaluate inter- and intraobserver agreement in the detection of TML using composite gold standard, e.g. gold standard made by two scholars in consensus, or by biopsy. Another limitation is that some of the studies do not grade TML per testis but instead uses per field of view, and therefore it can be difficult to compare the grading systems with one another. It is a limitation that one of the observers (SR) also performed some of the US recordings. All the observers reviewed all the 67 recordings during 1 day and therefore risk of fatigue cannot be excluded. One advantage of working with recorded examinations is the opportunity to study the US until the observer is ready to diagnose. The possible recall-bias was reduced with a 3-month interval between the two readings. The study deals with observer variation; however, the operator factor was not analyzed. This would require all the patients having their scrotum examined by all the observers, which is an impractical approach.
In conclusion, the ultrasonography grading system of TML in this study showed to be reproducible, with an inter- and intraobserver agreement ranging between substantial agreement and up to almost perfect agreement with many years of experience not necessarily being essential. The observer agreement also showed that both the experienced and less experienced radiologist had a higher agreement in detecting and grading TML in their second reading and also to exclude TML.
Footnotes
Acknowledgements
The authors thank sonographer Karl Erik Stovgaard for technical support.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
