Abstract
BACKGROUND
Accurate segmentation of brain tumor depicting on magnetic resonance imaging (MRI) is an important step for doctors to determine optimal treatment plan of Gliomas, which are the common malignant brain tumors that seriously damage patients’ health and life.
OBJECT
This study aims to improve accuracy and efficiency of brain tumor segmentation on MRI using the advanced deep learning model.
METHOD
In this study, an improved model based on the U-net for accurate segmentation of brain tumor MRI images, called Deeper ResU-net, is proposed. First, a deep Deeper U-net is built, which has deeper network depth compared with U-net, uses Squeeze Operator to control network parameters and attempts to enhance the feature extraction ability. Then, Deeper ResU-net is formed to eliminate degradation phenomenon of the deep network, in which residual unit is designed and integrated into the Deeper U-net to keep the number of parameters unchanged.
RESULT
Deeper ResU-net makes the deep network conduct stable training without degrading. Evaluation result shows that the Deeper ResU-net has achieved competitive result with average DSC metrics of 0.9, 0.82, 0.88 for Complete tumor region, Core tumor region and Enhanced tumor region, respectively.
CONCLUSION
By extending the U-net model to a deeper layer and adding the residual structure to ensure effective and stable training of the model, the experiment results demonstrate that applying the improved Deeper ResU-net can effectively eliminate the degradation phenomenon of deep network and improve segmentation performance.
Introduction
Gliomas are derived from the glial cells of the central nervous system and are the most common malignant brain tumors. According to their aggressiveness and permeability, they are broadly classified into two categories: high-grade gliomas (HGG) and low-grade gliomas (LGG) [1]. In the past few decades, magnetic resonance imaging (MRI) has been widely used to diagnose brain and nervous system abnormalities due to its clear contrast to soft tissues. The commonly used MRI sequences include T1-weighted (T1), T1-weighted contrast-enhanced (T1c), T2-weighted (T2), T2-weighted fluid attenuated inversion recovery (Flair) [2]. The reason for using these four different sequences is that different tumor regions may be clearly displayed in different sequences, which can provide more accurate brain tumor boundary information.
Many traditional machine learning methods have been used in medical image and brain tumor image segmentation [3–11]. For examples, Jones [4] present a novel whole-brain diffusion tensor imaging (DTI) segmentation (D-SEG) to delineate tumor volumes of interest (VOIs). D-SEG uses k-means clustering of the 2D space and semiautomated flood-filling technique to generate volumes of interest (VOI) segments with different isotropic and anisotropic diffusion characteristics. Yang [5] present a computerized decision support framework for discrimination between GBM and solitary MET using MRI. Establishing the tumor ROL through D-SEG, then use supervised learning strategies to classify GBMs and solitary METs based on the selected two dimensional shape features by student’s t-test and highly correlated. They also proposed a 3D morphometric analysis framework for distinguish GBMs from solitary METs. Morphometric features of shape index and curvedness were computed for each tumor surface defined by DTI segmentation technique [6]. A support vector machine (SVM) classifier is utilized to classify different combinations of the 1st order and 2nd order statistical textural features of ROI [7]. Soltaninejad [8] also proposed a fully automated method for the detection and segmentation of brain tumor from FLAIR MRI images, by calculating Gabor texton feature, fractal analysis, curvature and statistical intensity features from superpixels. Extremely randomized trees (ERT) is then used to classify each superpixel into tumor or healthy brain tissue. Supervoxels are calculated using information fusion from multimodal MRI images. For each supervoxel, a variety of features are extracted and fed into a random forests (RF) classifier [9]. These methods require a lot of hand-designed features which usually lead to dimensional disasters. Wu [10] used superpixel features to segment brain tumors in the conditional random field (CRF) framework, but the performance was poor in LGG images. In addition, geometry-based methods [11] include level sets and active contour models, which are computationally inexpensive, but usually require user interaction, prior knowledge of expert and feature engineering. Thus, they are not fully automatic segmentation algorithms. Currently, these algorithms are often used as the pre-processing or post-processing stages of deep learning algorithms.
In recent years, convolutional neural network (CNN) has become the most popular method in image recognition and classification, is also widely used in medical image segmentation [12, 13]. The early medical image segmentation algorithms obtain the anatomical descriptions of images or voxel through the classification method based on image patches. The method mainly considers local information around pixel points, but it is difficult to take account the image context information. Thus, it is highly restricted [14]. Some studies use different convolutional kernel sizes or different input image sizes to extract multi-scale information [15, 16]. More work suggests combining the CNN with post-processing operations, such as CRF [17], random forests (RF) [18], etc. Moreover, the patch-based method has the problem of computational efficiency. When CNN processes a large number of dense image patches, many calculations are redundant. Therefore, a more efficient segmentation method is necessary.
One of the most common efficient segmentation methods uses the fully convolutional neural network (FCN) [19] semantic segmentation model, which has been widely used in medical image segmentation. Zhao [20] proposed a brain tumor segmentation method by integrating FCN and CRF in a unified framework. Pereira [21] used segmentation Squeeze-and-Excitation (SE) blocks to collect image context information while spatial information was maintained for creating more complex semantic features. Oliveira [22] proposed to combine the multi-scale analysis provided by the Stationary Wavelet Transform (SWT) with a multi-scale FCN to copy with varying width and direction of the vessel structure in the retina. A pre-trained FCN is proposed through transfer learning, it has simplified the typical retinal vessel segmentation problem and has a state-of-the-art result among four databases [23]. Fakhry [24] balanced the tradeoff between increasing contextual window required for multi-scale reasoning and the ability to preserve pixel-level solution through adding multiple residual shortcut paths to fully deconvolutional network. Vijay [25] constructed a Deep Convolutional Encoder-Decoder Architecture for semantic pixel-wise segmentation termed SegNet, the encoder network is a 13-layer VGG16 network [26], and the corresponding decoder network up-samples its lower resolution input feature maps. Sajid [27] designed three improved SegNet-based models for brain tumor images. The skip connection, interpolation algorithm and SE module were applied to the original SegNet model, and better segmentation results were obtained.
The U-net [28] model based on FCN has recently achieved competitive segmentation results, its structure consists of an encoding path and a decoding path, and the skip connection passes the computed feature map in the encoding path to the decoding path, which superimposes the early feature map directly on the later feature localization. A convolutional kernel of 1×1 is used to generate the final segmentation map after the last layer of the decoding phase. Because of its end-to-end network characteristic, the segmentation process is simpler and more efficient. Compared with the traditional CNN model, it has the following advantages at least: (1) both local features and global features are used; (2) considering that the medical image segmentation label is very precious, this model only needs less training samples to ensure a better performance; (3) end-to-end input and output produce complete segmentation results.
At present, a variety of improved methods based on the U-net model have been proposed. Dong [29] used comprehensive data augmentation techniques to improve the segmentation accuracy, applied the Soft Dice loss function to solve the problem of data imbalance in brain tumor MRI images. Salehi [30] augment the U-net with auto-context algorithm [31] to improve the segmentation performance. Chen [32] extent DeepMedic to Multi-Level DeepMedic (MlDeepMedic) to utilize multi-level information, this training strategy is realized by attaching an auxiliary classifier to both MLDeepMedic and U-net. Zhou [33] designed a series of nested, dense skip pathways to connect the encoder and decoder networks, which can reduce the semantic gap between the feature maps of the encoder and decoder networks. Baris [34] proposed two modifications to the U-Net architecture. First, the segmentation maps created by different scales are combined to enrich the diversity of training data, and then the feature maps are transferred from one stage to another using element-wise summation. Experiments have shown that the network has a better convergence performance during the training phase, but no other advantages are observed during the testing phase. Alom [35] proposed a recurrent neural network based on U-net as well as a recurrent residual convolutional neural network based on U-net model. The models were tested on three datasets such as retinal blood vessel and the experiment results shown superior performance on segmentation tasks. Milletari [14] proposed a 3D U-net model called V-net, which extended the original U-net structure with the 3D convolution kernels, which was necessary and effective. Then the residual unit was added to further modify the original U-net structure and applied to the prostate image segmentation task. Cicek [36] also extended the network to 3D and designed a structure called 3D U-net to learn features from sparse annotated volume images.
In this study, the improved U-net model is used as the Baseline Model, and a novel semantic model based on the Baseline Model is proposed. The main contributions can be summarized as follows: A deep U-net model called Deeper U-net is built, in which the Squeeze Operator is used to control parameters. A model combining the residual units and Deeper U-net is proposed, called Deeper ResU-net, in which the residual unit is integrated into the deeper U-net to prevent gradient vanish and optimize the deep training. The models are tested using the Brats2015 training dataset. The experimental results show superior performance compared with state-of-the-art methods recently proposed.
Method
Figure 1 presents an overview of the proposed approach. There are three main stages: MRI image preprocessing, Deeper ResU-net, output prediction map.

Overview of the proposed method. Including training phase and testing phase.
The common data preprocessing methods are adopted in this study. Specifically, the original brain tumor 2D slices are cropped, and a large range of background images that do not contain image information is removed. As a result, 192×192 images slices are obtained, which can reduce the computational cost while ensure that the network can learn effective brain image information. The 1% maximum gray value and 1% minimum gray value of the image sequences are then removed, which ensures that the intensity values of all images are within a relevant range and can facilitate learning during the training phase. We observed that the intensity values on the MRI slices still vary widely, so normalization operation is also applied to make the average intensity values and variances close to zero and one, respectively.
Deep learning model
Three different architectures were evaluated, namely (1) Baseline Model: U-net with forward convolutional layer and feature superposition. (2) Deeper U-net: deep U-net with Squeeze Operator. (3) Deeper ResU-net: Deeper U-net with residual units.
Baseline model
This study adopts the improved U-net network as the Baseline Model, which consists of two main parts: convolutional encoding structure and decoding structure, the structure is shown in Fig. 2. The convolutional encoding structure is used to extract image features, capture background information, encode advanced features to detect tumor location and texture features. The decoding structure is used to accurately locate and reconstruct details such as tumor edges and grayscale features. Both parts perform the basic convolutional operation and the Relu activation function, the difference is that the 2×2 max-pooling operation is performed during the encoding phase, the number of feature maps is increased from 4 to 1024, the size of the feature maps is reduced from 192×192 to 12×12. In the decoding phase, the transposition convolutional operation is performed to upsample the feature maps, the number of feature maps is reduced from 1024 to 64, the size of the feature maps is increased from 12×12 to 192×192. Skip connections also play an important role in making feature localization more precise. In order not to change the size of the output feature map, this study uses 0 padding in the convolutional process.

Detail diagram of Baseline Model structure. The top of the figure is the overview of Baseline Model, and the bottom is the various legends contained in the overview. All convolutional layers in the encoding and decoding stages are of size 3×3. A kernel of size 1×1 is used to implement the prediction map. The input layer (192×192×4) outputs 5 probability distribution maps (192×192×5) through the encoding path and decoding path, finally a result prediction map (192×192) is obtained.
Many evidences have revealed that network depth is crucial to feature extraction ability, while the most excellent classification results of the challenging ImageNet dataset [37] almost all make use of a very deep CNN [26, 39]. VGG16 proposed to deepen the network depth through stacking 3×3 convolutional kernels singly which can greatly improve the classification accuracy, and proved the importance of network depth for convolutional features. Among them, the 3×3 convolutional kernel is used in the network to increase the network depth because of its small parameter quantity and strong nonlinear mapping ability. Therefore, it can be seen that under certain conditions, the deeper the network is, the richer the extracted features are, and the better the classification effect we get. At present, almost no researchers have deepened U-net model and proved that the U-net model can be extended to deeper to get a better performance. Therefore, we design a deep U-net structure called Deeper U-net, which deepens network depth for each convolutional block in the encoding path and the decoding path. With the increase of network depth, the network parameters will increase dramatically and the computational cost will be further increased. Inspired by Lin [40], 1×1 convolutional kernel is used as Squeeze Operator to compress the number of feature maps while deepen the number of layers and recombination of features. It can keep the number of network parameters to a minimum value, prevent over-fitting, and increase the non-linearity of the decision function. In this study, we add 1×1 convolutional kernel to the second convolutional layer of each convolutional block, the structure is shown in Fig. 3.

Detail diagram of Deeper U-net structure. Compared with the Baseline Model, the Deeper U-net changes the original two convolutional layers (operation1) in each block to three convolutional layers (operation1 + operation5). In order to keep the parameters unchanged, the kernel of size 1×1 is used to perform dimensionality reduction (operation3).
Although the deep convolutional model can optimize feature extraction capability and map features to the more complex dimensions, it is found that the deep convolutional model is prone to degradation in training phase, results in slow or even no continuous optimization of the deep network. In this study, the residual unit [41] is considered to optimize the Deeper U-net. Although 2D deep residual networks have been extensively studied in the domain of computer vision, seldom studies have been conducted in the field of medical image computing. Residual units use special additive skip connections to combat vanishing gradients in Deeper U-net. The key idea is to create an identity mapping connection to bypass the parameterization layer in the network, and then directly merge the input of the residual block into the output through the addition of elements. It has been proved that the residual unit can make the information transmission smoother and accelerate the convergence speed. Generally, the residual unit can be expressed as following:
the feature X
l
of any deeper layers can be expressed as the feature X
p
of shallow unit p plus summarized residual functions
At the beginning of a residual unit, the data flow is separated into two streams: the first carries the unchanged input of the unit, while the second applies weights and non-linearity. At the end of the unit, the two streams are merged using an element-wise summation. Due to the structural characteristics of residual units and Baseline Model itself, we designed two residual units as shown in Fig. 4, the left figure shows the classic structure of the residual unit, which contains two layers of non-linear mapping layers. The right figure contains one non-linear layer. The experimental results show that the combination of the right unit and Deeper U-net makes the network converge faster and the classification accuracy is improved obviously. Thus, we change the common residual unit of two-layer into one layer, and then embed the residual unit into Deeper U-net. The overall structure of Deeper ResU-net is shown in Fig. 5, the residual unit is embedded in the first convolutional layer of each block in the Deeper U-net, and the second convolutional layer of each block is used as the nonlinear unit, the element-wise summation operation is performed after the second convolutional layer. In this case, F (X p , W p ) is represented as a convolutional function followed by Relu.

The residual unit structure used in Deeper ResU-net. The left is the classical structure with two nonlinear layers, and right has one.

Detail diagram of Deeper ResU-net structure. The residual units designed and integrated in the Deeper U-net are showed by operation1.
Compared with the Baseline Model, the proposed network architecture has many advantages. Deeper ResU-net keeps the number of network parameters unchanged based on the Deeper U-net, and maintains the stability of computing efficiency. Secondly, the residual unit improves the training effect of the deep network, further optimizes the approximate identity map, and prevents the degradation of the deep network during the training phase. It optimizes the feature extraction capability of the deep network and improves the classification effect. In the end, this method can be popularized, because they can be easily applied to the deep learning models based on v-net, SegNet, etc., and improve classification performance.
Database
The algorithm in this study is evaluated mainly on the Brats2015 training dataset [42, 43], which contains a total of 274 groups MRI tumor images (220 groups of HGG and 54 groups of LGG) in four modalities as shown in Fig. 6: Flair, T1, T1 C and T2. Each modality sequence was already aligned with T1C modality and skull stripped. Segmentation labels divide tumor images into five categories: normal tissue, edema, necrosis, enhanced tumors, and non-enhanced tumors. The evaluation criteria classify the tumor images into three categories: Enhanced Tumor, Core tumor (necrotic + non-enhanced tumor + enhanced tumor), and Complete tumor (all categories except normal tissue).

MRI image modal sequence. From left to right: Flair modality; T1 modality; T1C modality; T2 modality.
A five-fold cross-validation method is used, which are conducted on the Brats2015 training dataset with 274 samples. Evaluation metrics for the segmentation effect are Dice similarity coefficient (DSC), Positive Predictive Value (PPV) and Sensitivity. In addition, cross entropy is used as the loss function, Adam is used as the optimization algorithm, and the initial learning rate is set to 0.0001. The parameter initialization uses the Xavier uniform initialization method. Dropout is used as the regularization method, is set to 0.5. All the hyper-parameters which are found using the validation set performed best on the validation set. The method proposed in this study is implemented using Keras framework based on Tensorflow. The experimental machine uses Intel Core i7 3.5 GHZ processor and is equipped with NVIDIA GeForce GTX1080 GPU 8G.
Experiment results
Experiments are conducted on U-net, Deeper U-net and Deeper ResU-net, respectively. Figures 7 and 8 show the training and validation loss value when using the Brats2015 training dataset. Two figures demonstrate that Deeper ResU-net achieves a more stable and efficient optimization process, as well as a smaller loss value than other models in the comparison. Figures 9 and 10 also show the training and validation accuracy. These figures demonstrate that the proposed Deeper ResU-net model provides better performance during both the training and validation phase when compared to U-net and Deeper U-net. And as can be seen from the figures. There is no over-fitting phenomenon in the training of the three networks, indicating that there are no problems such as excessive parameters of the model or fewer data.

Training loss of the Deepe r ResU-net against Deeper U-net and U-net (Baseline Model).

Validation loss of the Deeper ResU-net against Deeper U-net and U-net (Baseline Model).

Training accuracy of the Deeper ResU-net against Deeper U-net and U-net (Baseline Model).

Validation accuracy of the Deeper ResU-net against Deeper U-net and U-net (Baseline Model).
The comparison of the segmentation results is shown in the Fig. 11, the Baseline Model used in this study has achieved a relatively accurate segmentation result and performed well in each region, which is basically consistent with the contour of the expert segmentation label. Maybe we cannot intuitively find the difference from the comparison of different segmentation results among the models, so we need to further evaluate the optimization ability of Deeper ResU-net. The box plot and the comparison chart of the main evaluation metrics are shown in Figs. 12 and 13, the red squares in Fig. 12 are mean values. It is evident that apart from some outliers, the models perform well over all image in the dataset and can be seen from the figure that the Deeper U-net model does not significantly improve its performance by deepening the original network, the outliers of some metrics and the floating ranges of the segmentation metrics become larger. It shows that although deepening the network depth under certain conditions can enhance the feature extraction ability of the network, the deep network is difficult to optimize during the training process in actual experiment process, the accuracy is saturated and begins to degrade. But this is not due to over-fitting, it implies that deep network appear a gradient vanish phenomenon in deep training, which leads to no signification improvement of training results. On the contrary, the performance of the model is worse in some metrics due to the increase of parameters.

Comparison of segmentation results. (a) T1 modality image of the patient’s brain; (b) Segmentation label divided by the expert; (c) Segmentation result of the baseline model; (d) Segmentation result of Deeper U-net; (e) Segmentation result of Deeper ResU-net. Among them, green is the edema area, red is the necrosis area, blue is the non-enhanced tumor area, and yellow is the enhanced tumor area.

Boxplot of the results on MICCAI Brats2015 dataset using proposed architectures: Baseline Model (U-net), Deeper U-net and Deeper ResU-net, in terms of Dice score, PPV and Sensitivity. The mean value is presented as Red squares.

DSC comparison of Baseline Model (U-net), Deeper U-net and Deeper ResU-net segmentation results.
Therefore, the improvement of Deeper U-net is particularly important in this study. As shown in the Fig. 12, after adding the residual unit to the Deeper U-net model to form the Deeper ResU-net, the number of the model parameters have not changed, but the performance of the model has been improved. The DSC metric shows an obvious increase in the Q1 and mean value of Complete tumor area, and the Q3 of the Core area is improved. As shown in the Fig. 13, the DSC metrics in each region have been improved to some extent. It shows that the residual unit optimizes the model training process, prevent the degradation phenomenon of deepnetwork.
In order to further validate the performance of the proposed algorithm, this study compare several classic brain tumor segmentation algorithms with the proposed algorithm. The results of these algorithms are among the best in the recent Brats competitions.
The quantitative results of this experiment are compared against existing methods as shown in Table 1. Compared with the Deeper U-net and Baseline Model, Deeper ResU-net has improvements in majority metrics. Because the Baseline Model in this study already has a promising feature extraction ability, it is more difficult to increase the accuracy of 1% based on the Baseline Model. Compared with other models, it can see that Deeper ResU-net produces competitive results in terms of Complete region, Core tumor region and Enhanced tumor region. In [34], the U-net was also improved by using the residual units, although it has the comparable results as Deeper ResU-net in the Complete tumor region and the Core tumor region, the high variance of the DSC and the PPV on the enhanced region obviously indicate that the model is prone to false positives.
Quantitative results of proposed methods compared to the results from the classic brain tumor segmentation algorithms published recently. The bold numbers highlight the scores best among these algorithms on the Brats 2015 Training dataset
In contrast, Deeper ResU-net has stable and good performance in various regions. The additional Residual unit does not affect the segmentation effect of Enhanced tumor region, and does not lead to strong over-segmentation. Compared with a previous study [17], the algorithm yields the same complete tumor segmentation results, both of which are 0.9DSC, but both DSC of the Core tumor and Enhanced tumor significantly exceeded 0.75 DSC and 0.73 DSC. At the same time, [17] used a post-process operation to optimize the segmentation result of the CNN model, it not an independent feature extraction structure. However, the Deeper ResU-net does not use any post-processing procedure to optimize the results, which proves the competitiveness of the feature extraction model designed in this study. Three models designed and tested in this study are superior to conventional U-net model (i.e., reported in [29]) in all regions, which indicates that the improvement of the Baseline Model is effective and obvious.
Our model has probably been iterated about 100 times, the Baseline Model, Deeper U-net model and Deeper ResU-net model take about seven hours for training. In the prediction process, Deeper ResU-net has a high computational efficiency. Each 3D brain tumor MRI image takes only 3 seconds for prediction, while the previous study [17] needs 30 seconds to predict an entire 3D brain tumor image using NVIDIA GTX Titan X GPU.
In the meanwhile, the results on the Brats2017 dataset are shown in Table 2. As can be seen from the table, the algorithm still achieves a good segmentation result. Compared to U-net and Deeper U-net, Deeper ResU-net has the obvious improvement, which proved the effectiveness of the proposed algorithm once again. We also compare the Deeper ResU-net with the 3D U-net model. Since the 3D U-net model makes full use of the 3D information of the data, it also achieves a good segmentation result. The DSC in the Core region exceeds the algorithm in this study. However, the Deeper ResU-net has an obvious improvement in the Complete and Enhancing regions. At the same time, the 3D structure has a large amount of computation and a large memory footprint.
Quantitative results of proposed methods based on the Brats 2017 Training dataset
Although the traditional U-net model has been successfully applied in the field of medical image segmentation, few researchers have improved the depth of the U-net model. In this study, by improving the original U-net model, we establish the Baseline Model and the Deeper U-net model, and propose Deeper ResU-net based on the Deeper U-net model. The Deeper ResU-net model combines deep U-net, Squeeze Operator and residual units to produce excellent segmentation results. The training set and model use 2D image patches and 2D model structures, but the segmentation results are inferior to those 3D-CNN models [17], while avoiding the disadvantages that the 3D structure is difficult to train and requires huge memory and huge computational complexity. At the same time, the model avoids the use of complex data preprocessing and post-processing procedures which greatly increase the model’s dependence on the nature of the data, for example, [27] used N4ITK bias correction, [29] used data augmentation methods, and [17] used post-processing procedures. Therefore, the effectiveness and efficiency of these models are also based on the data pre-processing and post-processing stages. However, the model proposed in this study has achieved better segmentation results based on the simplified algorithm flow. The segmentation results almost completely depend on the performance and efficiency of the model.
In summary, this study proposes a Deeper ResU-net model for the segmentation task of brain tumor images. For this purpose, we designed and tested a Deeper U-net, which attempts to enhance the convolutional feature extraction ability by deepening the depth of the network. However, it is found through experiments that the degradation phenomenon exists in the deep network and the segmentation result is not improved. Therefore, the Deeper ResU-net structure is built to improve the training process and eliminate the gradient vanish. Experiments show that the Deeper ResU-net model improves the segmentation result while the network parameters remained unchanged. It is implies that the deeper U-net can enhance the feature extraction ability under the condition of adding residual unit. Finally, the algorithm proposed in this study was evaluated using the Brats2015 and Brats2017 training datasets. Compared with the results of other excellent algorithms, we observed advantages of this improved algorithm with higher computational efficiency.
Footnotes
Acknowledgments
This study was supported by the Natural Science Foundation of Henan Province Science and Technology Committee [Grant No.162102210189].
