Abstract
Background
A high false-positive rate remains a technical glitch hindering the broad spectrum of application of deep-learning-based diagnostic tools in routine radiological practice from assisting in diagnosing rib fractures.
Purpose
To examine the performance of two versions of deep-learning-based software tools in aiding radiologists in diagnosing rib fractures on chest computed tomography (CT) images.
Material and Methods
In total, 123 patients (708 rib fractures) were included in this retrospective study. Two groups of radiologists with different experience levels retrospectively reviewed images for rib fractures in the concurrent mode aided with RibFrac-High Sensitivity (HS) and RibFrac-High Precision (HP). We compared their diagnostic performance against the reference standard in terms of sensitivity and positive predictive value (PPV).
Results
On a per-patient basis, RibFrac-HS exhibited a higher sensitivity compared with RibFrac-HP (mean difference=0.051, 95% CI=0.012–0.090; P = 0.011), whereas the latter significantly outperformed the former in terms of the PPV (mean difference=0.273, 95% CI=0.238–0.308; P < 0.0001). The use of RibFrac-HP significantly improved the junior and the senior groups’ sensitivities respectively by 0.058 (95% CI=0.033–0.083; P < 0.0001) and 0.058 (95% CI=0.034–0.081; P < 0.0001), and decreased the diagnosis time by 206 s (95% CI=191–220; P < 0.0001) and 79 s (95% CI=67–92; P < 0.0001), respectively, when compared to no software assistance.
Conclusion
The sensitivity and efficiency of radiologists in identifying rib fractures can be improved by using RibFrac-HS and/or RibFrac-HP. With an added module for false-positive suppression, RibFrac-HP maintains the sensitivity and increases the PPV in fracture detection compared to Rib-Frac-HS.
Keywords
Introduction
A rib fracture is one of the most common consequences of traumatic injuries, affecting around 10% of the patients in general (1) and nearly 40%–80% of patients experiencing blunt chest trauma due to high-impact collision (2). Studies have shown that the mortality rate in patients with post-traumatic rib fractures is in the range of 10%–16% (3), while it may further increase to as high as 65% when pulmonary infection occurs secondary to rib fracture (4). Therefore, prompt and effective detection of post-traumatic rib fractures is urgently needed in stratifying the severity of the trauma, as well as for determining an appropriate patient care management.
Chest digital radiography (DR) (5) and computed tomography (CT) (6) are the most commonly used imaging techniques for early assessment of patients with chest trauma. Although radiographs can be obtained more expediently in the emergent setting, incomplete or non-displaced fractures can frequently be missed (7). On the other hand, Performance and interpretation of CT exams are more time-consuming due to the large number of images and frequent coexistence of other injuries and incidental findings on CT compared to radiographs (8). Despite these facts, the anatomy and orientation of the rib cage also requires radiologists to follow each rib on multiple slices (whether in axial, coronal, or sagittal views) to detect sometimes subtle discontinuity or distortion of the rib cortices, making fracture detection even more difficult than in other osseous structures (9).
In recent decades, considerable progress has been made in the application of deep learning (DL) to medical image processing and recognition. As a consequence, DL-based computer aided diagnosis (CAD), such as the detection and characterization of lung (10), prostate, and breast cancers (11,12), has been widely implemented in clinical practice. In terms of fracture detection, CAD has proven its accuracy and efficiency in facilitating the diagnosis of common fractures, including the ones occurring in the proximal humerus, femoral neck, and wrist joint (13–15). However, its levels of precision, performance, and practical feasibility remain inconclusive in detecting rib fractures, as several studies have reported the high occurrence of false-positive results in this diagnostic method (9,16).
The aim of the present study was to evaluate the performance of two versions of DL-based CAD tool for rib fractures (RibFrac; Aitrox, Shanghai, PR China), namely the high sensitivity version (RibFrac-HS) and the high precision version (RibFrac-HP), in assisting radiologists in identifying blunt trauma-induced rib fractures. RibFrac-HS was designed to capture as many rib fractures as possible from CT images, whereas RibFrac-HP had an additional module to neutralize the plausible false positives occurring during the automatic detection. Here, based on a rib-fracture dataset of 123 cases with the available ground truth as the reference standard, we compared the performance levels of two groups of radiologists in diagnosing the same rib fractures with and without the aid of software tools in terms of the sensitivity, positive predictive value (PPV), and the length of the diagnostic process.
Material and Methods
This study was approved by the Institutional Review Board of Shanghai Changzheng Hospital before patient information was accessed, and the requirement for informed consent of patients was waived due to the retrospective nature of the analysis and the anonymity of the data.
Case selection
Diagnostic imaging records of the patients enrolled for the thoracic CT scan in Shanghai Changzheng Hospital between January 2015 and December 2019 were retrospectively reviewed with the following inclusion criteria: (i) images were from patients with thoracic trauma only; (ii) images were available with slice thickness of ≤1 mm; and (iii) images were of fair quality without breathing or motion artifacts. Patient images were excluded if they had undergone any internal rib fixation or any other thoracic surgeries (e.g. lung wedge resection or mastectomy). After screening, the images of 123 patients were enrolled in our study. We should mention that this cohort of 123 patients is also related to our other project, which requires patients to have DR and CT examinations at the same time. Fig. 1 shows the case selection approach, and the number of patients finally included in the study according to the inclusion and exclusion criteria. Both RibFrac-HS and RibFrac-HP were trained with 1500 chest CT images from two other hospitals. The detailed inclusion flow chart is shown in Fig. 1.

The strategy used for case selection and the number of patients finally recruited per the inclusion and exclusion criteria.
Image acquisition and preprocessing
The CT images were acquired by using one of the following multidetector CT scanners (manufacturer: Toshiba, General Electric, or Philips) with the following scan settings: tube voltage = 120 kVp; tube current = 50–150 mAs; image matrix = 512 × 512 pixels; and scanning duration = 0.5 s. The included images were reconstructed using a standard reconstruction algorithm (17) with a thickness in the range of 0.625–1 mm. All scans were performed from the level of the thoracic cavity to the level of the upper part of the kidney with a slice thickness of 1 mm.
Reference standard of rib fractures
A rib fracture was regarded as the ground truth if the lesion had been individually annotated by two senior radiologists (S.X. and P.W., respectively with 13 and 15 years of experience in chest CT imaging). In cases where consensus was not reached, a third expert (Y.X., with 25 years of experience) evaluated and discussed with others, and the determined discussion was agreed as a reference standard. If there was still disagreement, the lesion was removed. All image annotations were performed on the Prolego (Image Processing System, Aitrox Technology Corporation Limited, Shanghai, China). All CT images were observed under a bone window, with a window width (WW) of 1400 HU and a window level (WL) of 600 HU. We divided all rib fractures into five categories: completely displaced fracture; completely non-displaced fracture; incomplete fracture; cortical distortion; and callus formation. The completely displaced fracture was characterized by a dislocation >2 mm. Fractures without significant dislocation, including simple, oblique, transverse, or butterfly fractures, were classified as completely non-displaced fractures. The incomplete fracture was characterized if some cortical bones were discontinuous but not completely broken. A fracture with “focal deformity” was classified as cortical distortion. Callus formation was defined as focal sclerosis of the rib, with or without cortical displacement.
DL model architecture
The architecture of “RibFrac-HS” consisted of two modules: (i) a 3D-region proposal network to detect rib fractures based on the combination of a modified U-Net (18) and a ResNet Block (19); and (ii) a bounding box eliminator clearing the regions outside of the rib regions (Fig. 2). Compared to “RibFrac-HS,” “RibFrac-HP” had an additional U-Net-based classification network (18) in between those two modules to suppress false positives (Figs. 2 and 3). All networks were trained using PyTorch version 1.5.1 (20) on the platform of Python version 3.6.8 (Python Software Foundation, Wilmington, DE, USA).

Architectures of the developed DL-based software tools (a) “ribFrac-HS” and (b) “ribFrac-HP.” DL, deep learning.

Structures of the 3D-region proposal network (top) and the novel false-positive suppression module (bottom) in ribFrac-HP.
Diagnostic performance evaluation
We investigated the diagnostic performance of two groups of radiologists with varied clinical experiences in chest CT interpretation, respectively, with and without the aid of the two software tools. The junior radiologists’ group consisted of Q.X., Q.S., and C.S., with two, three, and three years of experience, respectively. The senior group consisted of H.S., S.C., and X.W., with eight, nine, and nine years of experience, respectively.
Their performances in identifying rib fractures with and without software assistance were evaluated according to the following steps (Fig. 4): (i) the participating radiologists individually marked the location of rib fractures based on their own judgment using a dedicated software tool Prolego (Image Processing System, Aitrox Technology Corporation Limited, Shanghai, China); (ii) RibFrac-HS and RibFrac-HP were retrospectively applied to the collected rib-fracture dataset, generating a report including the location of all rib fractures identified for each patient; (iii)rReferring to the reports generated in step (ii), the radiologists re-examined all CT images and marked any location of fractures they could identify, and the new diagnosis time was also recorded; and (iv) locations of rib fractures annotated in steps (i), (ii), and (iii) were respectively compared against the ground truth, in terms of the detection sensitivity and PPV, calculated as:


The rib fracture was detected in HS, and the doctor suspected cortical distortion; HP did not output later. Finally, 3D reconstruction confirmed no rib fracture.

The rib fracture was detected and output in HS, and the doctor considered it was not a rib fracture and removed it. HP did not output later.
An identified rib fracture was regarded as a true positive if the center of its bounding box fell inside the bounding box of a rib fracture annotated in the reference standard (i.e. the ground truth).
Note that, in this diagnostic performance test, (i) there was a period of at least one month (21) between step (i) and (iii) for radiologists to neutralize their diagnostic bias; (ii) all radiologists were always blind to the information of patients; and (iii) the order of CT scans was randomized before a new round of diagnostic test The time needed by each radiologist to reach a diagnosis for each patient was automatically recorded by Prolego (Image Processing System, Aitrox Technology Corporation Limited, Shanghai, China). Before this study, all radiologists involved had been trained to be familiar with the operation of the Prolego.
Statistical analysis
To compare the performances of diagnoses made with and without artificial intelligence (AI) assistance, a paired-sample two-sided t-test was applied. A difference was considered statistically significant when a P value was <0.05. The statistical analysis was performed on the R language platform version 4.0.0 (The R Foundation for Statistical Computing, Vienna, Austria).
Results
Patient demographics
A total of 123 patients (82 men [67%], 41 women [33%]; mean age = 54 years; age range = 46–64 years), with 708 rib fractures identified as the ground truth, met the inclusion and exclusion criteria. Of the 708 fractures, 360 (51%) fractures were complete dislocation fractures, 170 (24%) were complete non-displaced fractures, 113 (16%) were incomplete cortical fractures, and 40 (6%) were incomplete cortical distortions. The remaining 25 (3%) fractures were bone callus. See Fig. 1 for the strategy used for case selection and the number of patients included, and Table 1 for a description of the patient characteristics.
Patient demographics.
Values are given as n or median (IQR).
Diagnostic performance of radiologists with and without AI assistance
Table 2 presents the sensitivities, PPVs, and the reading time in two groups of radiologists in diagnosing rib fractures with and without the support of RibFrac-HS and RibFrac-HP tools.
Summary of the diagnostic performance and diagnosis time of junior and senior radiologists in identifying rib fractures in CT images with and without the aid of two software tools.
Values are given as mean ± SD (95% CI).
AI, artificial intelligence; CI, confidence interval; CT, computed tomography; PPV, positive predictive value; SD, standard deviation.
On a per-patient basis, the mean diagnostic sensitivities of the junior radiologists’ group under the conditions of without AI assistance and with the assistance of RibFrac-HS, or RibFrac-HP were 0.71 (95% confidence interval (CI) = 0.66–0.75), 0.76 (95% CI = 0.72–0.81), and 0.78 (95% CI = 0.74–0.82), respectively. Likewise, the sensitivities of the senior radiologists’ group under those three conditions were 0.72 (95% CI = 0.68–0.77), 0.77 (95% CI = 0.73–0.81), and 0.78 (95% CI = 0.74–0.83), respectively (Table 2).
Interestingly, neither junior nor senior radiologists could outperform the CAD tools in identifying comparatively more rib fractures labeled in the reference standard. Notably, RibFrac-HS and RibFrac-HP had sensitivities of 0.83 (95% CI = 0.78–0.88) and 0.79 (95% CI = 0.74–0.84), respectively. Compared with the relative variabilities in PPV between two groups of radiologists in human only diagnostic approach, the clinical application of CAD tools significantly decreased the PPV for both groups, regardless of their years of experience (Table 2).
Comparison of performances between ribFrac-HS and ribFrac-HP
Table 3 presents the differences in the sensitivity, PPV, and the reading time between the following: (i) the junior and the senior radiologists; (ii) RibFrac-HS and RibFrac-HP; (iii) human and machine (i.e. RibFrac-HP); (iv) humans with and without the support of the machine; (v) junior radiologists with RibFrac-HS and RibFrac-HP; and (vi) senior radiologists with RibFrac-HS and RibFrac-HP.
Differences in sensitivity, PPV, and diagnosis time between (1) the junior and the senior radiologists, (2) ribFrac-HS and ribFrac-HP, (3) human and machine, and (4) senior radiologists with and without the support of machine.
*Comparison between senior radiologists and software tool “RibFrac-HP.”
Comparison between senior radiologists and senior radiologists with the aid of software tool “RibFrac-HP.”
AI, artificial intelligence; CI, confidence interval; PPV, positive predictive value.
With an additional module for false-positive suppression, RibFrac-HP significantly reduced the average number of false-positive detections per scan by a factor of 2.90 (95% CI = 1.88–3.93; P < 0.0001). As a result, its PPV was significantly improved by 0.273 (95% CI = 0.238–0.308; P < 0.0001) but with a very limited compromise in sensitivity of 0.051 (95% CI = 0.012–0.090; P = 0.011), compared to that in case of RibFrac-HS (Tables 2 and 3).
When software tools were applied to assist radiologists in identifying rib fractures, while no statistically significant differences were observed in the diagnostic sensitivity and PPV between RibFrac-HS and RibFrac-HP, we found that the use of RibFrac-HP prevailed in further decreasing the reading time, especially for the junior radiologists group, as the time was decreased from 151 to 80 s (mean difference = 72, 95% CI = 61–82; P < 0.0001), and for the senior group it was decreased from 156 to 142 s (mean difference = 13, 95% CI = 3–23; P = 0.0095) (Tables 2 and 3).
Discussion
In the present study, we investigated the diagnostic performance of two versions of rib fracture detection tools—RibFrac-HS and RibFrac-HP, which were developed based on the DL approach in assisting radiologists diagnosing rib fractures. The sensitives of the junior and senior radiologists’ group were 70.9% and 72.2%, respectively. Notably, the application of DL to medical image analysis and interpretation enables radiologists to detect at least 5.0%–6.5% more rib fractures than that without DL support. Furthermore, these models can significantly improve the diagnostic efficiency of radiologists (approximately 54% of diagnosis time can be saved) irrespective of their experience.
Compared with a previously published report (22), the PPV of the developed algorithms (RibFrac-HS and RibFrac-HP) was lower. To improve the performance of these models and reduce the false positives, multiple types of rib fractures were incorporated into the model, especially those rib fractures at atypical sites that tend to be misdiagnosed during the clinical practice, as the test set of the model. In this study, we had a higher proportion of minor fractures near the sternum and thoracic spine (49%), which could be the primary reason for the slightly lower PPV compared with the previous studies.
Rib fracture is a common and frequently occurring medical condition in emergency patients with high-impact collision injuries. The sensitivities of radiologists in detecting the fractures were 0.71 (junior) and 0.72 (senior), which did not have significant statistical significance. These results were in line with previously published studies (23). It might be possible that in terms of experience there was a slight difference between the senior and the junior radiologists enrolled in this study. Another possible reason could be that rib fracture is quite common in patients with traumatic injury. Though there was a certain gap between the marking results and the gold standard, our statistical data were very intuitive in showing that the two algorithms could improve the sensitivity and diagnosis time of doctors.
Our results demonstrated that the diagnostic application of DL-based CAD tools could significantly improve the sensitivity of fracture detection and reduce the per-patient diagnosis time for radiologists, regardless of their years of experience, while interpreting CT images without any statistically significant reduction in precision. In our evaluation of these tools on a real-world dataset, RibFrac-HP exhibited a greater improvement in elevating the PPV and further decreasing the diagnosis time than that of RibFrac-HS, which could be attributed to its false-positive suppression module. In addition, RibFrac-HS reduced the per-patient diagnosis time by 36.9% and 32.2%, respectively, for the junior and the senior radiologists, in contrast to RibFrac-HP, which reduced the time by 68.8% and 39.2%, respectively.
The comparison between the sensitivity and the number of false positives for RibFrac-HS and RibFrac-HP showed the importance of the false-positive reduction module as it significantly reduced the number of false positives, thereby improving the precision in fracture detection without compromising the sensitivity.
To the best of our knowledge, RibFrac-HP has been the first software tool specifically designed with an additional U-Net-based classification network embedded to eliminate false positives. In addition, the data and labeling standards were high in this study. Notably, we included only images that were obtained through high-quality and thin-slice (at a thickness of 1 mm) CT scans. Moreover, we have had two senior radiologists with 13 and 15 years of experience, respectively, in chest CT diagnosis to note the CT diagnosis report. In this algorithm test study, six radiologists with different seniority were organized to form two groups with three radiologists in each group. The obtained results were satisfactory as per our understanding.
The diagnostic sensitivities of these two groups of radiologists were high, while the diagnosis time was reasonably short. More importantly, the sensitivity was further improved when radiologists included CAD support in their diagnostic method. Thus, these results suggest that we could reach the primary aim of our study in improving the diagnostic efficiency and simultaneously reducing the rates of missed diagnosis and misdiagnosis in case of rib fracture detection.
In the present study, the PPV of the software tools was not as good as we expected. After the human-machine combination, PPV was decreased. It might happen that the algorithms found a comparatively greater number of fractures, but at the same time, it could also be possible that those counts included false positives, which require manual detection and elimination to arrive at the right diagnosis. However, we decided to sacrifice a small percentage of PPV in order to gain an overall increase in sensitivity and a significant decrease in diagnosis time. Therefore, our study indicates that human-machine integration can drastically improve diagnostic performance, at least in the case of rib fracture detection.
In this study, the total number of fractures in the gold standard was 708. We observed that most of the misdiagnosed and missed rib fractures were located at the junction of thoracic vertebrae and sternum. A lot of distorted cortices and callus were also misjudged. In addition, the most misdiagnosed fractures included osteofibrous dysplasia, vascular sulci, bone islands, and other rib abnormalities. The complexity of the ribs themselves and the surrounding bone structures undoubtedly requires an experienced radiologist to distinguish and judge the lesion type. In this study, we had 2.90 false positives per patient, which were inevitable and within the acceptable error range.
The present study has some limitations. First, this is a retrospective study with all rib fracture samples collected from a single center. As a result, the total number of patients was relatively small. In the future, a multi-center study involving a larger cohort of patients should be conducted to examine the efficacy of the developed software tools in a real-world clinical setting. Second, radiologists in this study were not under the stress of the actual emergency, which usually brings a higher incidence of missed diagnosis or misdiagnosis due to fatigue and stress. Moreover, the radiologists were focused only on the task of rib fracture detection, which is never the case in clinical practice. Third, the reference standard was developed based on the experience of our senior radiologists, whereas different radiologists may have different standards towards the diagnosis of rib fractures, especially for the inconspicuous ones. However, most non-obvious rib fractures are not clinically significant. Thus radiologists pay more attention to the obvious fractures to evaluate the severe complications due to trauma. Finally, the landscape of the DL networks was not fully explored. It is our ongoing work to optimize the algorithm, in terms of improving the accuracy and recall, as well as reducing the probability of misdiagnosis. This is because it is not the worst scenario to miss a non-displaced rib fracture but more harmful to mistakenly diagnose a benign or malignant bone condition as a fracture. To meet this aim, future improvements made to a rib fracture CAD tool should take into account its capability to differentiate image features that other rib abnormalities or artifact (e.g. motion artifacts) can mimic.
The performances of two versions of software tools (RibFrac-HS and RibFrac-HP) in assisting radiologists with varying years of experience to identify the post-traumatic rib fractures were systematically assessed. We discovered the following: (i) RibFrac-HS and RibFrac-HP both improved the diagnostic sensitivity of radiologists (e.g. for the junior group by 0.058; 95% CI = 0.033–0.083; P < 0.0001), regardless of their years of experience; and (ii) attributable to a false-positive suppression module, RibFrac-HP greatly outperformed RibFrac-HS in terms of the PPV (mean difference = 0.273; 95% CI = 0.238–0.308; P < 0.0001) and prevailed in further reducing the diagnosis time for rib fractures in a clinical setting (e.g. for the senior group by 79 s; 95% CI = 73–140 s; P < 0.0001).
In conclusion, our findings suggest that the application of advanced AI-guided tools can significantly improve the sensitivity and efficiency of radiologists in diagnosing rib fractures based on CT images. RibFrac-HP may serve as an effective and generally practical tool to be used in assisting the clinical management of patients with chest trauma.
Footnotes
Funding
This work was supported by the National Key R&D Program of China [grant number 2016YFE01030003, grant number 2018YFC0116404]; Contract grant sponsor: Pyramid Talent Project of Shanghai Changzheng Hospital and National Natural Science Foundation of China (grant number 82001812).
