A deep learning semantic segmentation network with attention mechanism for concrete crack detection

Abstract

In this research, an attention-based feature fusion network (AFFNet), with a backbone residual network (ResNet101) enhanced with two attention mechanism modules, is proposed for automatic pixel-level detection of concrete crack. In particular, the inclusion of attention mechanism modules, for example, the vertical and horizontal compression attention module (VH-CAM) and the efficient channel attention upsample module (ECAUM), is to enable selective concentration on the crack feature. The VH-CAM generates a feature map integrating pixel-level information in vertical and horizontal directions. The ECAUM applied on each decoder layer combines efficient channel attention (ECA) and feature fusion, which can provide rich contextual information as guidance to help low-level features recover crack localization. The proposed model is evaluated on the test dataset and the results reach 84.49% for mean intersection over union (MIoU). Comparison with other state-of-the-art models proves high efficiency and accuracy of the proposed method.

Keywords

Semantic segmentation attention mechanism crack detection deep learning

Introduction

Due to the low tensile strength of concrete,¹ cracks will inevitably appear under the influence of external load and temperature change in concrete structures. The existence of crack accelerates the corrosion of rebar, seriously affecting the load-carrying capacity and durability of the structure.² Because cracks are an important indicator to evaluate structural damage and durability,³ crack detection is of considerable importance in concrete structure maintenance. The traditional crack detection method is through periodical manual inspection and generally involves sending inspectors to measure cracks by the use of bulky equipment.⁴ However, the results are susceptible to subjective factors and the process is also time-consuming and labor-intensive.⁵ For these reasons, new crack detection approach with high accuracy and efficiency is desirable and such development is of great interest to all stakeholders.

To overcome the drawbacks of these human-based methods, many image-processing-technique (IPT)-based crack detections have been proposed, such as image threshold,^6,7 edge detection^8,9 and morphological operation.^10,11 However, the prediction results rely highly on manually defined parameters and are easily affected by complex environment in real-world situations.

Recently, deep learning (DL) has been proposed and rapidly developed¹² as a result of the reasons mentioned above. Compared with IPTs, DL can automatically extract abundant abstract features from massive surface crack data.¹³ Therefore, researchers have adopted DL algorithms to improve the accuracy and efficiency of crack detection.^14,15 Due to the outstanding ability to achieve pixel-level prediction of cracks, most recently, semantic segmentation approaches, which can classify each pixel as crack or noncrack,¹⁶ have obtained increasing attention. The semantic segmentation network is usually with encoder-decoder model, which utilizes the encoder backbone derived from the state-of-the-art image classification architecture.^17–19 Yang et al.²⁰ used fully convolutional network (FCN) with a backbone VGG19 to detect concrete crack, where training time was lower than CrackNet²¹ on account of the end-to-end structure. Huyan et al.²² adopted two U-Net with backbone VGGNet and ResNet to perform pavement crack segmentation task from good quality images without noise, which exhibited significant advantages compared to FCN. Li et al.²³ proposed FCN with a backbone DenseNet121 to analyze smartphone images for automatic detection of four damages in the concrete structure. On the other hand, to prevent information loss in down-sampling operation, researchers tend to develop DL models without pooling layers. Zhang et al.²⁴ proposed a model named CrackNetII, which removed all pooling layer, for automatic pavement crack detection using 3000 asphalt surface images. However, those semantic segmentation approaches have an obvious challenge that they cannot aggregate rich contextual information well. To address it, two well-known models are proposed to aggregate such context. The first is DeepLabv3,²⁵ which uses atrous spatial pyramid pooling (ASPP) to fuse feature maps at different scales to capture contextual information. The second is PSPNet,²⁶ which employs pyramid pooling module to aggregate multiscale context. Wang et al.²⁷ used 2446 concrete and asphalt crack images to train and evaluate five semantic segmentation models, including FCN, GCN, PSPNet, UPerNet and DeepLabv3+, and found that DeepLabv3+ shows the best performance. Ji et al.²⁸ adopted DeepLabv3+ to do the automatic detection of pavement crack, and crack images collected by UAV were tested to obtain the mean intersection over union (MIoU) of 78.75%. However, the above method can only collect local contextual information, while the global information on the crack image cannot be fully captured.

The attention mechanism, firstly proposed by Bahdanau et al.,²⁹ is a cognitive process to enable selective concentration on nominated features and intentionally disregard unimportant information. The attention mechanism can be applied to semantic segmentation models as global context exploration methods, such as SENet,³⁰ CBAM,³¹ and DANet.³² In crack detection, there have been attempts to combine the attention mechanism with available networks to improve detection efficiency.^33–35 For example, Pan et al.³⁶ modified the backbone of DANet from ResNet101 to VGG19, namely SCHNet, and added a new attention mechanism named feature pyramid attention to improve crack detection accuracy. The results demonstrated that three attention mechanisms can increase MIoU by 10.88% than the baseline model. The attention maps in SCHNet are obtained by calculating the similarities among all pixels in the feature map. However, these feature maps are obtained by 1 × 1 convolution, which means that each pixel is influenced by only one pixel in the input feature map. This leads to a troublesome situation that one pixel is unable to contain much spatial information, which makes the attention map not optimally designed. In addition, SCHNet has not considered the fusion of low-level features and high-level features, which could help the decoder to generate high-resolution semantic features. Meanwhile, the direct fusion will downgrade the performance of crack segmentation. To solve these problems, this research designs a novel model combining FCN and two attention mechanism modules to aggregate rich contextual information for automatic concrete crack detection.

Upon the crack detection from the image, the evaluation of the crack such as crack area and crack width would be a direct demand, which is able to assist the practitioners for decision-making and further maintenance schedule. Conversion from crack pixel information to physical dimension is required to achieve above objective. Li et al.³⁷ proposed a crack image binarization architecture called SegNet-DCRF and further calculated the unidirectional crack width and web crack area. Bhowmick et al.³⁸ proposed U-Net architecture for crack image segmentation and used morphological operations from image processing to quantify the geometrical properties of concrete surface cracks. Built on the accurate identification of crack from image, this research conducts the morphological feature measurement and the crack severity ranking to enable potential application of the proposed algorithm.

In this research, an attention-based feature fusion network (AFFNet) is proposed for crack segmentation under various complex conditions. The use of attention mechanism is to aggregate crack features and suppress irrelevant features to improve segmentation performance. To capture rich contextual information, the vertical and horizontal compression attention module (VH-CAM) is set on the top of the backbone ResNet101,³⁹ which uses two asymmetric convolutions to enable the single pixel containing more information. Meanwhile, the efficient channel attention upsample module (ECAUM) combines the efficient channel attention (ECA) and feature fusion to restore semantic boundaries by guiding low-level features. In consequence, these two attention mechanism modules can contribute to better feature representations and more precise crack segmentation results. In addition, the semantic segmentation images of crack by AFFNet are used to quantitatively measure the morphological features of cracks by using single-pixel width skeletons.

The content of this paper is organized as follows. Section “Methodology” presents the detailed feature fusion network. Section “Implementation details” is devoted to the implementation details, including process of generating the dataset and training parameters setting. Section “Experimental results” introduces the experimental results and corresponding analysis. Section “Discussion” describes the discussion of different test sets. Section “Conclusion” summarizes the conclusion of this paper.

Methodology

To distinguish crack pixels and non-crack pixels, this research proposes a novel model named AFFNet with a backbone ResNet101 pretrained on ImageNet and the integration of two attention mechanism modules. The backbone network, ResNet,³⁹ is the first convolutional neural network with a depth of more than 100 layers, which solves the degradation problem that accuracy tends to saturate and then decreases with the depth of the network increasing. It won the ILSVRC and COCO 2015 competition and has been widely used in semantic segmentation. In addition, dilated convolution with a rate of two is employed and the stride is modified from two to one in the last ResNet block to enlarge the output size of ResNet101 from 1/32 to 1/16 of the raw image. In this way, more details of feature maps can be retained. The structure of the proposed AFFNet, which is made up of two attention mechanism modules including VH-CAM and ECAUM, is shown in Figure 1. The quadrangular prisms represent the block of ResNet101 and the arrows represent the operations of the model. The detailed network parameters are listed in Table 1. Notably, ReLU and BN layers, which are used in the ResNet101, are not presented in Table 1. The keep probability of the dropout layer in Figure 1 is 0.9, which can assign the value of 0 to each channel with a probability of 0.1.

Figure 1.

Overall architecture of proposed AFFNet for semantic segmentation. The red and green lines represent down-sampling and up-sampling operations. The proposed model uses ResNet101 as backbone and apply VH-CAM and ECAUM to improve crack segmentation.

Table 1.

The detailed configuration of each layer in AFFNet.

Layer name	Pad	Kernel size(width × height, channels)	Stride	Output size(width × height)	Outputchannels	Dilation
Input	–	–	–	224 × 224	3	–
Conv1	3	7 × 7, 64	2	112 × 112	64	1
Maxpool	1	3 × 3	2	56 × 56	64	–
Res-1	$[\begin{matrix} 0 \\ 1 \\ 0 \end{matrix}] \times 3$	$[\begin{matrix} 1 \times 1, 64 \\ 3 \times 3, 64 \\ 1 \times 1, 256 \end{matrix}] \times 3$	$[\begin{matrix} 1 \\ 1 \\ 1 \end{matrix}] \times 3$	56 × 56	256	$[\begin{matrix} 1 \\ 1 \\ 1 \end{matrix}] \times 3$
Res-2	$[\begin{matrix} 0 \\ 1 \\ 0 \end{matrix}] \times 4$	$[\begin{matrix} 1 \times 1, 128 \\ 3 \times 3, 128 \\ 1 \times 1, 512 \end{matrix}] \times 4$	$[\begin{matrix} 1 \\ 2 \\ 1 \end{matrix}] \times 1 + [\begin{matrix} 1 \\ 1 \\ 1 \end{matrix}] \times 3$	28 × 28	512	$[\begin{matrix} 1 \\ 1 \\ 1 \end{matrix}] \times 4$
Res-3	$[\begin{matrix} 0 \\ 1 \\ 0 \end{matrix}] \times 23$	$[\begin{matrix} 1 \times 1, 256 \\ 3 \times 3, 256 \\ 1 \times 1, 1024 \end{matrix}] \times 23$	$[\begin{matrix} 1 \\ 2 \\ 1 \end{matrix}] \times 1 + [\begin{matrix} 1 \\ 1 \\ 1 \end{matrix}] \times 22$	14 × 14	1024	$[\begin{matrix} 1 \\ 1 \\ 1 \end{matrix}] \times 23$
Res-4	$[\begin{matrix} 0 \\ 2 \\ 0 \end{matrix}] \times 3$	$[\begin{matrix} 1 \times 1, 512 \\ 3 \times 3, 512 \\ 1 \times 1, 2048 \end{matrix}] \times 3$	$[\begin{matrix} 1 \\ 1 \\ 1 \end{matrix}] \times 3$	14 × 14	2048	$[\begin{matrix} 1 \\ 2 \\ 1 \end{matrix}] \times 3$
VH-CAM	–	–	–	14 × 14	2048	–
ECAUM(1)	–	–	–	14 × 14	1024	–
ECAUM(2)	–	–	–	28 × 28	512	–
ECAUM(3)	–	–	–	56 × 56	256	–
Dropout	–	–	–	56 × 56	256	–
Conv2	0	1 × 1, 2	1	56 × 56	2	1
Upsample	–	–	–	224 × 224	2	–
Conv3	1	3 × 3, 2	1	224 × 224	2	1
Softmax	–	–	–	224 × 224	2	–

AFFNet: attention-based feature fusion network; VH-CAM: vertical and horizontal compression attention module; ECAUM: efficient channel attention upsample module.

Similar to human attention, the attention mechanism can concentrate on the features that need to be focused on to acquire more details and ignore irrelevant information.⁴⁰ The essence of the attention mechanism is to learn weight distribution relevant to feature maps. In recent years, the attention mechanism has been developed rapidly in computer vision in light of its advantages. Here, VH-CAM and ECAUM are chosen with key features and explained in following sections.

Vertical and horizontal compression attention module

It is known that the contextual information is of considerable importance in semantic segmentation due to multiple scales of objects.⁴¹ However, local features from traditional FCN may mislead the classification process at the pixel level.³² To overcome this issue, the VH-CAM is introduced,⁴¹ which can capture rich contextual information to accomplish the crack segmentation task. Different from the position attention module in DANet, VH-CAM employs two asymmetric convolutions with the kernel size of 1 × W and H × 1 to enable each pixel to contain more information. Then the attention map is obtained through matrix operations, which is more comprehensive than using 1 × 1 convolution. Next, the process to aggregate contextual information between crack and background is introduced.

The exact working principle of VH-CAM is described in Figure 2. The feature map $A \in R^{C \times H \times W}$ , where H, W, and C represent the height, width, and channel number of the feature map, respectively, is firstly sent into two asymmetric convolution layers of 1 × W and H × 1 to generate two compressed feature maps B and D, shown in Figure 3. $B \in R^{C' \times H \times 1}$ and $D \in R^{C' \times 1 \times W}$ have different number of channels ( $C^{'}$ ) as A. Then D is reshaped to $R^{C' \times W}$ . Meanwhile, after performing reshape and transpose operations, B is transformed to $R^{H \times C'}$ and the specific operation is shown in Figure 4. After that a matrix multiplication between $B'$ and $D'$ is performed and then the sigmoid function is utilized to generate the two-dimensional attention map $E \in R^{H \times W}$ :

\begin{matrix} e_{ij} = \frac{1}{1 + \exp (- {B'}_{i} \times {D'}_{j})} \end{matrix}

(1)

where ${B'}_{i}$ indicates the $i^{th}$ row vector of feature map $B'$ and ${D'}_{j}$ indicates the $j^{th}$ column vector of feature map $D'$ . Similarly, $e_{ij}$ indicates the weight at the $i^{th}$ row and $j^{th}$ column of the attention map E after matrix multiplication operation.

Figure 2.

Structure of the proposed vertical and horizontal compression attention module.

Figure 3.

The example of asymmetric convolution (refer to the mathematical process to generate B feature map in Figure 1, the same principle can be used to generate feature map D when the convolutional kernel is a H × 1 vector).

Figure 4.

The example of reshape and transpose operations.

Then, a multiplication operation is applied to E and A to generate a new feature map $F \in R^{C \times H \times W}$ possessing rich context as:

\begin{matrix} F_{m} (i, j) = A_{m} (i, j) e_{ij} \end{matrix}

(2)

where $A_{m} (i, j)$ indicates the element at the $i^{th}$ row and $j^{th}$ column of the $m^{th}$ (m = 1, 2…C) channel in the original feature map A, and $F_{m} (i, j)$ has the same definition.

Finally, each element in feature map F is multiplied by the parameter $α$ . And then the point-wise addition operation is performed on the original feature map A and the above multiplied result to generate the final output $G \in R^{C \times H \times W}$ :

\begin{matrix} G_{m} (i, j) = α \times F_{m} (i, j) + A_{m} (i, j) \end{matrix}

(3)

where $α$ is a learnable weight that is initialized to 0. It can be seen from Equation (2) that the attention map E assigns different weights to each element in the original feature map A. Therefore, the final output G in Equation (3) can aggregate long-range contexts according to the spatial attention map.

Efficient channel attention upsample module

In general, low-level features and high-level features are equally important.⁴² To restore the lost details in consecutive down-sampling, many models adopt encoder-decoder structures, such as FCN,⁴³ U-Net⁴⁴, and SegNet.⁴⁵ However, these encoder-decoder structures lack appropriate guidance and may cause misclassification.⁴¹ To overcome this problem, ECAUM combined with the attention mechanism and feature fusion is adopted here. Due to less parameters involved and high performance, the ECA⁴⁶ in ECAUM is performed to provide high-level semantic information as guidance to help low-level features select precise resolution details. Moreover, since residual blocks dominate performance of ResNet101, three ECAUM is used to perform the feature fusion with residual block and decoder.

The structure of ECAUM is illustrated in Figure 5 and the mathematical evolution of ECA is described in Figure 6. First, the high-level feature map $B \in R^{C' \times H' \times W'}$ is put through the global average pooling (see illustration in Figure 6) to obtain rough global contextual information without dimensionality reduction:

Y_{m} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} B_{m} (i, j)

(4)

Here, $B_{m} (i, j)$ indicates the element at the $i^{th}$ row and $j^{th}$ column of the $m^{th}$ (m = 1,2…C) channel in the high-level feature map B, and $Y_{m}$ indicates average pixel of the $m^{th}$ channel in the feature vector Y.

Figure 5.

Structure of the proposed efficient channel attention upsample module.

Figure 6.

The details of efficient channel attention.

Then, the channel weight vector $Y'$ is obtained using a 1D convolution of kernel size k. To make the channel number of vector $Y'$ equal to the channel number of low-level feature map A, a 1 × 1 convolution is used. After that the sigmoid function is adopted to obtain the vector $Y ″$ to limit the range of weight vector $Y'$ to [ $0 ~ 1$ ]:

\begin{matrix} Y ″ = σ (w_{2} (w_{1} (Y))) \end{matrix}

(5)

where $w_{1}$ indicates 1D convolution, $w_{2}$ indicates 1 × 1 convolution and $σ$ indicates the sigmoid function.

Then, $Y ″$ is utilized as a guidance of the feature map A with below operation:

\begin{matrix} E_{m} = w_{3} ({Y ″}_{m} A_{m} + A_{m}) \end{matrix}

(6)

where $w_{3}$ indicates 1 × 1 convolution and the feature map $E \in R^{C \times H \times W}$ is the same size as A. $A_{m}$ indicates the element of the $m^{th}$ channel in A and the definitions of $E_{m}$ and ${Y ″}_{m}$ are the same.

The fusion of low-level features and high-level features is an effective approach to restore the lost details caused by the consecutive down-sampling. The transposed convolution upsample is utilized as an efficient method to enlarge the high-level feature map $B \in R^{C^{'} \times H^{'} \times W^{'}}$ and then the 1 × 1 convolution operation is adopted to obtain the new feature map $D \in R^{C \times H \times W}$ , which has the same size as feature map A:

\begin{matrix} D = w_{4} (δ (B)) \end{matrix}

(7)

where $δ$ indicates transposed convolution and $w_{4}$ indicates 1 × 1 convolution operation.

Finally, E and D is spliced and a 3 × 3 convolution operation followed by BN and ReLU is adopted to obtain the final output $F \in R^{C \times H \times W}$ . Notably, the last ECAUM module uses two 3 × 3 convolutions to increase the depth of the model:

\begin{matrix} F = w_{5} (φ (D, E)) \end{matrix}

(8)

where $φ$ indicates concatenation operation and $w_{5}$ indicates 3 × 3 convolution. The output F is used as the high-level feature map for the next module.

Implementation details

Dataset construction

To verify the effectiveness and robustness of AFFNet, a concrete crack image dataset is constructed for the following experimental validations. To ensure the variability, the images in the dataset contain not only wall cracks, but also pavement and bridge cracks, saved in JPG format. Furthermore, some of the crack images also contain various types of noise that often observed with concrete structures, such as spots, shadows, water stain, handwriting, Gaussian noise, and insufficient lightening.

The dataset contains 1760 crack images, of which 776 crack images are found in paper,²⁰ 524 crack images are collected manually, and 460 crack images are generated using data augmentation techniques. The manually collected images are taken by a 40-megapixel smartphone at different distances without zoom, where the aperture is f/1.8, the ISO is 50, and the original full image resolution is 2736 × 3648 pixels. To decrease the computational cost of the training model, the original images are cropped into sub-images with a size of 224 × 224 pixels. In order to detect crack with more complex environments, we use data augmentation techniques such as rotation and Gaussian noise to increase the complexity of the dataset. The proposed AFFNet generates the crack shape and location through segmenting crack images to obtain important crack features. Therefore, the images obtained by cropping operation are labeled as ground truths using Photoshop software. Then, these ground truths are converted to PNG format with a single channel, where crack pixels and background pixels are labeled as 255 and 0, respectively. In order to assess the generalization ability of the proposed AFFNet, 1760 images in the dataset are randomly divided into three parts that 64% are used for training, 16% are used for validation, and the last 20% are used to test the model. Specifically, eight types of cracks are included in the dataset, containing cracks without noise and cracks with noise. For the former, there are four subgroups: (1) diagonal crack: contains only one crack in diagonal direction; (2) transverse crack: contains only single transverse crack; (3) reticulation crack: contains more than one crack; (4) wide crack that is filled with stones and earth. For the latter, there are also six groups: (1) crack with spalls in the concrete surface; (2) crack with shadow, which contains shadow interfere with crack detection; (3) crack with water stain, which has water stain around crack; (4) crack with handwriting that contains black handwriting similar to crack; (5) crack with Gaussian noise; (6) crack in insufficient lightening.

Model initialization

Model initialization is to determine whether the model converges.⁴⁷ When training the AFFNet, transfer learning is adopted to improve the training efficiency and crack segmentation performance of AFFNet instead of training it from scratch. In consequence, the initialization method of all convolutional layers is the same as that of pretrained ResNet101, where weights are initialized with the Kaiming method,⁴⁸ and biases are set to 0 and untrained.

Moreover, the AFFNet used the transposed convolution method to enlarge the high-level feature map. Compared to other upsample methods, the transposed convolution method is learnable and can be learned through the network to obtain a better upsample result.

Loss function

The loss function can estimate the discrepancy between the predicted result and the ground truth.²⁰ The optimal solution of the model needs to minimize the value of loss function by fine-tuning parameters in the training process. Therefore, the selection of an appropriate loss function is indispensable for AFFNet. Since crack segmentation can be regarded as pixel-level classification, cross entropy loss function is applied to the proposed AFFNet on account of its effectiveness and solid theoretical grounding. The formula of the corresponding loss function for each pixel can be represented as:

\begin{matrix} L = y \ln (p) + (1 - y) \ln (1 - p) \end{matrix}

(9)

where y and p indicate the ground truth value and predicted value, respectively. And the total loss for each concrete crack image is the mean of all losses for the pixels.

Optimizer

The optimizer is one of the crucial components of DL due to the ability of minimizing the value of loss function and updating model parameters. Due to the fast updating speed and simple setting, the stochastic gradient descent with momentum (SGDM) is employed to train AFFNet.¹² The weight decay, an important parameter in the optimizer, is set to 0.0001. In addition, the batch size is set to eight when training AFFNet. The expression for updating parameters using SGDM is as follows:

\begin{matrix} v_{t} = γ v_{t - 1} + η \nabla_{θ} J (θ) \end{matrix}

(10)

\begin{matrix} θ = θ - v_{t} \end{matrix}

(11)

where $η$ represents learning rate, $\nabla_{θ}$ represents gradient of the loss function $J (θ)$ , and $γ$ represents momentum with a value of 0.9.

The learning rate is used to control the updating speed of the model parameters in the training process. The small learning rate reduces the updating speed, but an over-large leaning rate can result in parameters hovering around the optimal value. Therefore, a learning rate decay method used exponential decay function is adopted in this paper, as follows:

\begin{matrix} l r_{t} = l r_{0} * r^{t} \end{matrix}

(12)

where r = 0.95 is the drop factor, t is the drop period, and it is specified as the learning rate updated each epoch.

Evaluation metrics

The performance of AFFNet in crack detection needs to be evaluated by standard and well-known metrics.⁴⁸ Here in this paper, pixel accuracy (PA), mean pixel accuracy (MPA), MIoU, and frequency weighted intersection over union (FWIoU) are used as our metrics.⁴⁸ We first introduce all symbols in the formula: for a segmentation task, if the dataset contains k + 1 classes, $p_{ij}$ represents the amount of pixels originally belonging to class i but misclassified into class j. In addition, this definition refers equally well to the rest of symbols in the formula, including $p_{ii}$ , $p_{jj}$ , and $p_{ji}$ . The detailed formula of four evaluation metrics is represented as:

PA = \frac{\sum_{i = 0}^{k} p_{ii}}{\sum_{i = 0}^{k} \sum_{j = 0}^{k} p_{ij}}

(13)

MPA = \frac{1}{k + 1} \sum_{i = 0}^{k} \frac{p_{ii}}{\sum_{j = 0}^{k} p_{ij}}

(14)

\begin{matrix} MIoU = \frac{1}{k + 1} \sum_{i = 0}^{k} \frac{p_{ii}}{\sum_{j = 0}^{k} p_{ij} + \sum_{j = 0}^{k} p_{ji} - p_{ii}} \end{matrix}

(15)

\begin{matrix} FWIoU = \frac{1}{\sum_{i = 0}^{k} \sum_{j = 0}^{k} p_{ij}} \sum_{i = 0}^{k} \frac{\sum_{j = 0}^{k} p_{ij} p_{ii}}{\sum_{j = 0}^{k} p_{ij} + \sum_{j = 0}^{k} p_{ji} - p_{ii}} \end{matrix}

(16)

Among all the above metrics, MIoU stands out to evaluate segmentation models because of its representativeness and simplicity.

Experimental results

Analysis of results

Initial learning rate

A number of studies have shown that the initial learning rate significantly affects the convergence of the loss function.⁵⁰ It is known that a small learning rate will result in slow convergence, while a large one may hinder convergence. To obtain an appropriate value, three initial learning rates including $10^{- 2}$ , $5 \times 10^{- 2}$ , and $8 \times 10^{- 2}$ are selected after 100 epochs of training. The loss function curves are shown in Figure 7; it can be seen that all training losses have converged after 100 training epochs. Since the validation set is utilized to preliminarily evaluate the performance of crack segmentation, it is only necessary to compare the results of loss function on the validation set. It is observed from Figure 7 that when the initial learning rate is $10^{- 2}$ or $8 \times 10^{- 2}$ , the validation loss is about 0.035 after training with 100 epochs, while the validation loss of $5 \times 10^{- 2}$ is about 0.03. It is found that the loss function of validation set reaches the lowest when the initial learning rate is $5 \times 10^{- 2}$ through comparison.

Figure 7.

Training and validation loss curves under three initial learning rates during 100 epochs.

However, the loss function curves cannot fully reflect the performance of AFFNet.⁵¹ The metric MIoU is also used to select an appropriate initial learning rate. The MIoU curves are shown in Figure 8. It is observed that the MIoU of the validation set is higher when the initial learning rate is $5 \times 10^{- 2}$ . Therefore, the initial learning rate of $5 \times 10^{- 2}$ is selected as the optimal value in the proposed model.

Figure 8.

MIoU curves of three initial learning rates on training set and validation set during 100 epochs.

Execution time

The execution time, which represents training times for each image, is a valuable metric to evaluate the model efficiency.⁴⁹ In order to reflect the advantage of execution time, AFFNet is in comparison with four state-of-the-art models, including U-Net, DeepLabv3+, Dilated FCN, and PAN. Due to the multi-scale feature fusion and wide application, U-Net⁵² is used as a comparison. To assign the same size of the input image as that of the output image, zero-padding is adopted in the convolutional layer. Due to the advantage of combination with ASPP and encoder-decoder structure, DeepLabv3+ ⁵³ is also chosen. Dilated FCN⁴³ as the baseline is also used for comparison. To reflect the advantage of two attention mechanisms, PAN⁵⁴ with the same structure as AFFNet is also used for comparison. In order to ensure a fair comparison, all these models are trained with the same hyper-parameters and epochs. The runtime is measured on a computer with a high-performance GPU (NVIDIA GeForce RTX 1060, 6 GB) based on the PyTorch-1.7.1 framework. The execution time of AFFNet is competitive with other state-of-the-art models, with results summarized in Table 2. The descending order of execution time can be shown as: U-Net > PAN > AFFNet > DeepLabv3+ > Dilated FCN. In Table 2, U-Net shows the longest execution time (67 ms) due to the use of deconvolutional layers. Although Dilated FCN has the shortest execution time (33 ms) due to the simple decoder structure, its segmentation process compromised its overall performance. In summary, AFFNet has an acceptable execution time (52 ms) and the highest MIoU (see Table 5).

Table 2.

Comparison of execution time.

Models	Backbone	Time (ms)
U-Net	–	67
DeepLabv3+	ResNet101	42
Dilated FCN	ResNet101	33
PAN	ResNet101	56
AFFNet	ResNet101	52

AFFNet: attention-based feature fusion network; FCN: fully convolutional network.

Visualization of attention module

For VH-CAM, the attention map in Figure 2 is a crucial component, which can intuitively observe the weight distribution after visualizing the attention map. In Figure 9, for two input images, corresponding attention maps are showed in column three. Significantly, red areas indicate high contribution to the feature map while blue areas indicate low contribution. It is observed that some blue areas are in the background that avoids the crack. This proves that VH-CAM can indeed guide the proposed model to focus on the crack, even if not all the red areas are attached to the crack.

Figure 9.

Visualization of feature maps produced by two attention mechanisms: (a) Input, (b) ground truth, (c) the attention map in VH-CAM, (d) visualization results before using ECA, and (e) visualization results after using ECA.

For ECAUM, the ECA is performed on each channel using 1 × 1 convolution. Due to the small size of the feature maps and the large number of channels, it is not feasible to directly visualize the attention map. Here, the Grad-CAM⁵⁵ is used as the visualization tool to visualize the feature maps before and after the ECA in the last attention module (i.e., ECAUM(3) in Figure 1). In Figure 9, corresponding feature maps before and after using channel attention are visualized in columns three and five, respectively, to verify whether it highlights crack areas. Before using channel attention, only a few blue areas are in the background, which means that the model considers that cracks and background are equally important. However, after using channel attention, most of red areas in the background become blue. It is obvious that the ECAUM can help to locate crack pixels. In short, these visualization methods demonstrate the importance of two attention mechanisms for improving segmentation performance in crack detection.

Effects of attention mechanisms

From the previous section, it can be seen that the attention mechanism can remarkably improve the segmentation performance by focusing on the important features, that is, crack. In order to further understand the advantage of two attention mechanisms, effects of VH-CAM and ECAUM are visualized in Figure 10, where red boxes denote incorrect segmentation predictions. Here, comparison among four networks are demonstrated, for example, backbone ResNet101, ResNet101 with VH-CAM only, ResNet101 with ECAUM, and ResNet101 with both attention mechanisms. As shown in Figure 10(c), a part of thin cracks is missed if not using any attention mechanisms, especially the cracks at the image boundary. Meanwhile, Figure 10(d) demonstrates that some misclassified crack pixels at the image boundary are now correctly classified after using VH-CAM. However, it is still not a continuous crack and it is divided into multiple disconnected segmentations. Because ECAUM can locate crack pixels better than VH-CAM, more crack pixels are classified correctly and these cracks become more continuous. However, there are still some undetected crack pixels, such as the thin crack at the bottom of the second image in Figure 10(e). By comparison, it is found that the segmentation predictions using both VH-CAM and ECAUM are better than using one of them and cracks become more complete as a result.

Figure 10.

Visualization results of different attention mechanisms: (a) input, (b) ground truth, (c) without any attention mechanism, (d) with VH-CAM, (e) with ECAUM and (f) with VH-CAM and ECAUM.

Visualization of feature maps

Visualizing the feature maps of DL models can provide a deep insight on how the proposed models work. Figure 11 takes three concrete crack images as examples to show the visualization results of feature maps. It is observed that the feature maps closer to the input layer such as Res-1 and Res-2 can capture substantial crack features. However, the noise such as handwriting is also captured by AFFNet, which can be shown in Figure 11(a). As the image progresses through the next layers, features become increasingly abstract, which is important for the model to detect crack. With the increase on the size of the feature maps in the decoder, the crack features begin to become more accurate and the noise is filtered out. When the image reaches the output layer, the pixels are classified as crack and background.

Figure 11.

Visualization of feature maps at different modules: (a) image 1, (b) image 2 and (c) image 3.

Comparative study

Ablation study for k in ECAUM

The ECAUM involves a 1D convolutional layer with a crucial parameter, that is, the kernel size k. The kernel size k needs to be determined due to the ability of capturing local cross-channel interaction.⁴⁶ Therefore, AFFNet is trained using different values of k, and the comparison results are summarized in Table 3, where k is fixed in all 1D convolutional layers. It can be seen that MIoU shows an increasing trend when the value of k becomes smaller. Since AFFNet has more hidden layers, using smaller k can improve the nonlinear fitting ability of AFFNet. Consequently, the proposed AFFNet has the best result at k = 3.

Table 3.

Comparison results of ECAUM with different k (%).

Method	K	PA	MPA	MIoU	FWIoU
ECAUM	3	98.36	92.01	84.49	97.07
ECAUM	5	98.26	91.33	83.47	96.87
ECAUM	7	97.85	90.76	82.77	96.76

ECAUM: efficient channel attention upsample module; PA: pixel accuracy; MPA: mean pixel accuracy; MIoU: mean intersection over union; FWIoU: frequency weighted intersection over union.

Ablation study for attention modules

The ablation study is designed to validate the effectiveness of two attention mechanisms. The models with different attention mechanisms and corresponding evaluation metrics are summarized in Table 4. Because crack pixels normally occupy only a small proportion of the total pixels, MPA and MIoU are sensitive to small changes in the amount of crack pixels according to Equations (14) and (15). Therefore, MPA and MIoU are used as main indicators in this research. It can be seen that the baseline FCN without any attention mechanisms obtains the lowest evaluation metrics, returning the MPA and MIoU of 82.96% and 76.26%. After applying the attention mechanism, the MPA and MIoU increase steadily by increasing of the amount of correct detected crack pixels. Compared to the baseline FCN, the MPA and MIoU of the model only adopting VH-CAM can yield a slight improvement of 1.84% and 1.95% to 84.8% and 78.21%. Meanwhile, only adopting ECAUM can achieve a substantial increase on MPA and MIoU with 8.9% and 7.96% to 91.86% and 84.22%. The combination of VH-CAM and ECAUM however can yield a result of 92.01% and 84.49% in MPA and MIoU, which proves that two attention mechanisms work complementary.

Table 4.

Ablation study of two proposed attention mechanism modules on the test set (%).

Models	Backbone	VH-CAM	ECAUM	PA	MPA	MIoU	FWIoU
Dilated FCN	ResNet101			97.75	82.96	76.26	95.88
AFFNet	ResNet101	✓		97.89	84.8	78.21	95.96
AFFNet	ResNet101		✓	98.35	91.86	84.22	96.88
AFFNet	ResNet101	✓	✓	98.36	92.01	84.49	97.07

FCN: fully convolutional network; AFFNet: attention-based feature fusion network; VH-CAM: vertical and horizontal compression attention module; ECAUM: efficient channel attention upsample module; PA: pixel accuracy; MPA: mean pixel accuracy; MIoU: mean intersection over union; FWIoU: frequency weighted intersection over union.

Comparison with other semantic segmentation models

To reflect the excellent performance of AFFNet, four state-of-the-art models trained by the same dataset are compared with the proposed model. The segmentation results of five models are listed in Table 5. It is clear that the proposed AFFNet outperforms other models. Owing to the concentration on crack features by VH-CAM and ECAUM, AFFNet can obtain the most crack pixels than other models. Therefore, AFFNet achieves the highest evaluation metrics and its value of MPA and MIoU reach the highest 92.01% and 84.49%, respectively. The Dilated FCN with simple decoder shows the lowest MPA and MIoU compared to other models with more trainable parameters in the decoder. PAN achieved a slightly higher MPA and MIoU than Dilated FCN, which attributes to the simple combination of feature fusion and the attention mechanism used, namely global attention upsample, incapable of capturing fine crack features (76.79% in MIoU with global attention upsample only). The combination of ASPP and encoder-decoder structure contributes to the good performance of DeepLabv3+ in crack detection. However, two upsample operations cannot restore the lost details efficiently, the performance of DeepLabv3+ is inferior to U-Net and AFFNet. Four deconvolutional layers in U-Net are able to recover the image resolution. However, its MIoU is still lower than AFFNet on account of the direct fusion between low-level features and high-level features.⁴² Consequently, the proposed AFFNet has the distinct advantages and achieves the best performance and can capture rich contextual information and guide low-level features to recover the crack localization.

Table 5.

Segmentation results of five models (%).

Models	PA	MPA	MIoU	FWIoU
U-Net	98.11	88.83	81.57	96.6
Dilated FCN	97.66	82.78	76.13	95.83
DeepLabv3+	97.8	86.57	79.62	96.04
PAN	97.6	84.07	77.26	95.68
AFFNet	98.36	92.01	84.49	97.07

FCN: fully convolutional network; AFFNet: attention-based feature fusion network; PA: pixel accuracy; MPA: mean pixel accuracy; MIoU: mean intersection over union; FWIoU: frequency weighted intersection over union.

On the other hand, in order to understand respective enhancement brought by two designed attention mechanisms, the VH-CAM and ECAUM are incorporated into U-Net and DeepLabv3+ as comparison. The same training set and test set are used to conduct this experiment with two modified models. The result shows that the MIoU of DeepLabv3+ increases by 1.17%, from 79.62% to 80.79%. Due to only one feature fusion operation, the performance of DeepLabv3+ is not much improved. However, four feature fusion operations of U-Net result in a great improvement in its performance with the addition of two attention mechanisms, and its MIoU increases from 81.57% to 83.11%. These results indicate that two attention mechanisms can indeed improve the performance of other models, but the extent of improvement is related to the number of feature fusion operations. Thus, VH-CAM and ECAUM can be plugged into existing semantic segmentation models.

Discussion

Visualization results of multi-type crack image

To verify the effectiveness and robustness of AFFNet, the comparative experiment is conducted using different types of cracks. Figure 12 shows the visual comparison result between AFFNet and other models. From top to bottom, four types of cracks are diagonal cracks, transverse cracks, reticulation cracks, and wide cracks, respectively. From left to right, concrete crack images predicted by different models are input image, ground truth, U-Net, Dilated FCN, DeepLabv3+, PAN, and AFFNet, respectively. It can be seen that when thin cracks or the low contrast between cracks and background appear, Dilated FCN, DeepLabv3+, and PAN are not able to capture part of thin cracks, clearly presented in the reticulation crack case. The performance of U-Net is better than above three models, but a few crack features in reticulation crack image is still missing. Meanwhile, for wide cracks, the edge of the crack predicted by U-Net will appear some scattered pixels that should belong to background pixels but are misclassified as crack pixels. In wide crack case, Dilated FCN produces small holes in the crack area and PAN cannot generate a complete crack due to the influence of background noise. In contrast, AFFNet adopts two attention mechanisms to extract more crack information, which brings great benefits in improving the accuracy of the crack detection. Overall, the segmentation performance of AFFNet is better than other models.

Figure 12.

Prediction results of different types of cracks using different models: (a) input, (b) ground truth, (c) U-Net, (d) Dilated FCN, (e) DeepLabv3+, (f) PAN and (g) AFFNet.

Visualization results of concrete cracks under complex conditions

It should be noted that above concrete crack images are relatively clean and contain low-level noise. However, in reality, the cracks are quite versatile and can present with various imagery disturbances. These images will be interfered by spots, shadow, water stain, and handwriting, which increase the difficulty of crack detection. Therefore, another comparative experiment is conducted using cracks under complex conditions. Figure 13 shows the visual comparison result between AFFNet and other models on six types of cracks, such as crack with spots, crack with shadow, crack withwater stain, crack with handwriting, and so on. It can be seen that all the models have plausible abilities to distinguish crack and noise when detecting cracks with spots and shadow. However, the model deficiencies described in the previous section still exist. For example, Dilated FCN, DeepLabv3+, and PAN are unable to detect the thin part of the crack and U-Net misclassifies part of background as cracks at the edge of the wide crack. In addition, U-Net also incorrectly detects background of the shadow edge as crack. For crack with water stain, all other models exhibit false positives due to the low contrast between crack and water stain. These models overlook the width information of cracks, usually with the predicted crack width larger than the ground truth. Meanwhile, part of thin cracks is also ignored by Dilated FCN, DeepLabv3+, and PAN. Considering the crack with handwriting, the discrepancy in crack detection is more distinct. Due to the unified pretrained ResNet101, Dilated FCN, DeepLabv3+, and PAN perfectly distinguish cracks and handwriting. However, U-Net incorrectly recognizes part of handwriting as cracks. Different from other models, the AFFNet based on a pretrained ResNet101 still provides a satisfactory crack segmentation result when detecting concrete cracks under complex conditions.

Figure 13.

Prediction results of cracks under complex conditions using different models: (a) input, (b) ground truth, (c) U-Net,(d) Dilated FCN, (e) DeepLabv3+, (f) PAN and (g) AFFNet.

Quantification of crack images

Crack identification by AFFNet in the test set are employed for the quantification of three morphological features at a pixel level: crack area, crack length, and crack mean width. The crack area is obtained by calculating the number of crack pixels. The acquisition of crack length is relatively complex. The crack needs to be skeletonized into the thin lines with a single-pixel width and then the crack length can be obtained by calculating the number of pixels in thin lines. In this paper, the approach in the research⁵⁶ is used to perform the skeletonizing crack task. The crack mean width is the ratio between the crack area and the crack length.

The quantification differences between predicted results and ground truth are illustrated in Figure 14. As shown in Figure 14(a), the accuracy of the crack area is not competent yet, with scattered points diverted above the diagnostic line, indicating that there are background pixels in crack images misclassified as crack pixels. Meanwhile, some crack pixels are ignored by AFFNet when the crack area is more than 6000 pixels. It is obvious that AFFNet is susceptible to underestimating the crack area for large cracks. With respect to crack length, it is observed that most plotted points are near the diagnostic line, which means that AFFNet performs well in identifying crack length. The crack mean width is influenced by two other indicators. Statistically, the predicted area and mean width are greater than the ground truth in 69.6% and 72.3% of the cases, while 74.23% of the predicted length is lower. This means that the proposed model tends to enlarge the crack width and decrease the crack length. The possible reason of enlargement on crack width is that AFFNet is prone to generating coarse segmentation when thin cracks appear because of up-sampling. The reason of the underestimation on crack length is that thin cracks especially reticulation cracks are missed by AFFNet.

Figure 14.

Quantification of concrete crack images at a pixel level: (a) crack area, (b) crack length and (c) crack mean width.

In order to further evaluate the effectiveness of the proposed algorithm and to obtain the geometric information of the actual concrete cracks, a new crack dataset called AFF-D (AFFNet datasets) was collected. The crack image acquisition process is shown in Figure 15, the concrete crack datasets were obtained by IPHONE camera and the distance from camera to the concrete surface was set at 30 cm by using laser rangefinder. Then, a crack width meter was used to measure the actual size of concrete crack. After performing the above operation, more than 1700 concrete crack images with a resolution of $224 \times 224$ pixels were obtained.

Figure 15.

The crack image acquisition process.

Built on the proposed AFFNet algorithm, the crack morphological features such as crack area, crack length, and crack mean width can be calculated for the AFF-D dataset. As mentioned above, the crack can be skeletonized into the thin lines with a single-pixel width and then the crack geometric information can be obtained by calculating the number of pixels in thin lines. Then, the actual crack area, length, and mean width could be obtained by multiplying the area and length represented by each pixel using the camera calibration parameter k.

In Table 6, the geometric information of the actual concrete cracks was calculated, and the damage level of the crack was evaluated based on the actual crack area, which were valuable indicators for inspectors to evaluate and monitor the structural health quantitatively.

Table 6.

Calculation of crack parameters and evaluation of damage levels.

Crack type	Crack area (mm²)	Max crack width (mm)	Damage level
Crack 1	1402	15.37	Mild damage
Crack 2	2568	37.65	Moderate damage
Crack 3	3696	42.76	Serious damage

Evaluation of AFFNet using other dataset

To further verify the effectiveness of AFFNet, a robustness analysis is performed. Here two new datasets, for example, DeepCrack⁵⁷ and SDNET2018 datasets, are used for crack detection with five selected models. The images in two datasets contain a variety of noise, which is different from our built dataset. For example, obstructions in DeepCrack include surface roughness and mark. Meanwhile, obstructions in SDNET2018 include holes and low lightening. It should be noted that these two datasets are not utilized to train AFFNet prior to the test, with the aim to examine the model robustness. The crack images in datasets need to be resized to 224 × 224 pixels due to the requirement of asymmetric convolution in VH-CAM.

Table 7 lists the performance of five models tested by the DeepCrack dataset containing 527 crack images. Compared with other models, AFFNet achieves the highest MIoU of 82.28%, with a distinct margin of at least 4.34% than other models. As shown in Figure 16, four characteristic crack images are selected to display the prediction results. From left to right, the types of cracks are reticulation crack, crack with white line, crack with joint, and crack with handwriting. It can be seen that AFFNet can effectively detect cracks, including the cracks with rough background.

Table 7.

Segmentation results of five models on DeepCrack dataset (%).

Models	PA	MPA	MIoU	FWIoU
U-Net	98.28	83.28	77.82	96.82
Dilated FCN	98.16	80.24	75.05	96.69
DeepLabv3+	98.52	83.61	77.94	97.29
PAN	98.38	81.08	75.8	96.69
AFFNet	98.73	90.78	82.28	97.78

FCN: fully convolutional network; AFFNet: attention-based featurefusion network; PA: pixel accuracy; MPA: mean pixel accuracy; MIoU: mean intersection over union; FWIoU: frequency weighted intersection over union.

Figure 16.

Prediction results in the DeepCrack dataset using AFFNet: (a) reticulation crack, (b) crack with white line, (c) crack with joint, and (d) crack with handwriting.

Besides the DeepCrack dataset, the SDNET2018 dataset is also used to test the effectiveness of AFFNet. A total of 50 randomly selected crack images are resized to 224 × 224 pixels with RGB channels and then manually labeled as test images. Table 8 lists the performance of five models tested by the SDNET2018 dataset, which has shown that AFFNet achieves the highest MIoU of 89.21%. Figure 17 illustrates four typical concrete crack images in the SDNET2018 dataset, including transverse crack, crack with low lightening and crack with holes. The prediction results have shown that AFFNet has strong robustness regardless the conditions the crack attached.

Table 8.

Segmentation results of five models on SDNET2018 dataset (%).

Models	PA	MPA	MIoU	FWIoU
U-Net	99.21	91.16	84.17	98.46
Dilated FCN	98.98	87.03	80.97	98.07
DeepLabv3+	99.27	88.78	83.65	98.63
PAN	99.12	87.76	81.58	98.38
AFFNet	99.48	95.21	89.21	99.04

Figure 17.

Prediction results in the SDNET2018 dataset using AFFNet: (a) transverse crack, (b) crack with low lightening, (c) crack with big holes, and (d) crack with tiny holes.

Conclusion

In order to cope with complex conditions around the concrete structure, this paper implements a novel DL-based framework, namely AFFNet, for automatic concrete crack detection at the pixel level. In particular, the proposed AFFNet consists of ResNet101 as backbone and two attention mechanism modules, including the VH-CAM and the ECAUM. Specifically, the VH-CAM uses two convolution layers of kernel size 1 × W and H × 1 to make each pixel obtain more information and then generate the attention map through the matrix multiplication to capture rich contextual information. The ECAUM provides rich contextual information to guide low-level features.

The effectiveness and robustness of AFFNet are verified by a concrete crack dataset after a serious of experiments. The experimental results show that two attention mechanisms can contribute a better performance in crack segmentation. The proposed model achieves the highest MIoU of 84.49% in comparison with other existing models, including U-Net, Dilated FCN, DeepLabv3+, and PAN. In addition, a robustness analysis is also conducted using DeepCrack and SDNET2018 datasets. The prediction results show that the proposed model can also maintain an accurate segmentation performance in detecting cracks with untrained dataset.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by the Youth fund project of Jiangsu Natural Science Foundation (No. BK20180708) and Science and Education Integration Innovation Pilot Program from Qilu University of Technology (Shandong Academy of Sciences)–International Collaboration Project (2022GH006).

ORCID iDs

Jiaqi Hang

Yancheng Li

References

Jiang

Zhang

. Real-time crack assessment using deep neural networks with wall-climbing unmanned aerial system. Comput-Aided Civ Infrastruct Eng 2019; 35(6): 549–564.

Nishikawa

Yoshida

Sugiyama

, et al. Concrete crack detection by multiple sequential image filtering. Comput-Aided Civ Infrastruct Eng 2012; 27(1): 29–47.

Hoang

N-D

. Detection of surface crack in building structures using image processing technique with an improved Otsu method for image thresholding. Adv Civ Eng 2018; 2018: 1–10.

Zhang

Chen

. Zernike-moment measurement of thin-crack width in images enabled by dual-scale deep learning. Comput-Aided Civ Infrastruct Eng 2018; 34(5): 367–384.

Yeum

Dyke

. Vision–based automated crack detection for bridge inspection. Comput-Aided Civ Infrastruct Eng 2015; 30(10): 759–770.

Fujita

Hamamoto

. A robust automatic crack detection method from noisy concrete surfaces. Mach Vision Appl 2010; 22(2): 245–254.

Oliveira

Correia

. Automatic road crack segmentation using entropy and image dynamic thresholding. In: 2009 17th European signal processing conference, Glasgow, Scotland, 2009, pp. 622–626.

Dhule

Dhurpate

Gonge

, et al. Edge detection technique used for identification of cracks on vertical walls of the building. In: 2015 international conference on computing and network communications (CoCoNet), Trivandrum, India, 2015, pp. 263–268.

Abdel-Qader

Abudayyeh

Kelly Michael

. Analysis of edge-detection techniques for crack identification in bridges. J Comput Civ Eng 2003; 17(4): 255–263.

10.

Merazi-Meksen

Boudraa

. Mathematical morphology for TOFD image analysis and automatic crack detection. Ultrasonics 2014; 54(6): 1642–1648.

11.

Giakoumis

Nikolaidis

Pitas

. Digital image processing techniques for the detection and removal of cracks in digitized paintings. IEEE Trans Image Process 2006; 15(1): 178–188.

12.

Krizhevsky

Sutskever

Hinton

. ImageNet classification with deep convolutional neural networks. Commun ACM 2012; 60: 84–90.

13.

Wang

, et al. Automated crack severity level detection and classification for ballastless track slab using deep convolutional neural network. Autom Constr 2021; 124: 103484.

14.

Rao

Nguyen

Palaniswami

, et al. Vision-based automated crack detection using convolutional neural networks for condition assessment of infrastructure. Struct Health Monit 2021; 20(4): 2124–2142.

15.

Deng

Lee

VC-S

. Imaging-based crack detection on concrete surfaces using You Only Look Once network. Struct Health Monit 2021; 20(2): 484–499.

16.

Hsieh

Y-A

Tsai

. Machine learning for crack detection: review and model performance comparison. J Comput Civ Eng 2020; 34(5): 4020038.1–4020038.12.

17.

Alipour

Harris

Miller

. Robust pixel-level crack detection using deep fully convolutional neural networks. J Comput Civ Eng 2019; 33(6): 04019040.

18.

Zhang

Shen

Zhu

. A research on an improved Unet-based concrete crack detection algorithm. Struct Health Monit 2020; 20(4): 1864–1879.

19.

Huyan

Tighe

, et al. CrackU-net: A novel deep convolutional neural network for pixelwise pavement crack detection. Struct Control Health Monit 2020; 27(8): e2551.

20.

Yang

, et al. Automatic pixel-level crack detection and measurement using fully convolutional network. Comput-Aided Civ Infrastruct Eng 2018; 33(12): 1090–1109.

21.

Zhang

Wang

KCP

, et al. Automated pixel-level pavement crack detection on 3D asphalt surfaces using a deep-learning network. Comput-Aided Civ Infrastruct Eng 2017; 32(10): 805–819.

22.

Huyan

, et al. Pixelwise asphalt concrete pavement crack detection via deep learning-based semantic segmentation method. Struct Control Health Monit. Epub ahead of print 5 April. DOI: 10.1002/stc.2974.

23.

Zhao

Zhou

. Automatic pixel-level multiple damage detection of concrete structure using fully convolutional network. Comput-Aided Civ Infrastruct Eng 2019; 34(7): 616–634.

24.

Zhang

Wang Kelvin

Fei

, et al. Deep learning–based fully automated pavement crack detection on 3D asphalt surfaces with an improved CrackNet. J Comput Civ Eng 2018; 32(5): 04018041.

25.

Chen

L-C

Papandreou

Schroff

, et al. Rethinking Atrous convolution for semantic image segmentation. arXiv e-prints 2017. arXiv:1706.05587.

26.

Zhao

Shi

, et al. Pyramid scene parsing network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, HI, USA, 2017, pp. 2881–2890.

27.

Wang

Liu

Nie

, et al. Deep convolutional neural networks for semantic segmentation of cracks. Struct Control Health Monit 2022; 29(1): e2850.

28.

Xue

Wang

, et al. Image-based road crack risk-informed assessment using a convolutional neural network and an unmanned aerial vehicle. Struct Control Health Monit 2021; 28(7): e2749.

29.

Bahdanau

Cho

Bengio

. Neural machine translation by jointly learning to align and translate. arXiv e-prints. 2014: arXiv:1409.0473.

30.

Shen

Albanie

, et al. Squeeze-and-excitation networks. IEEE Trans Pattern Anal Mach Intell 2020; 42(8): 2011–2023.

31.

Woo

Park

Lee

J-Y

Kweon

. CBAM: Convolutional block attention module. In: Proceedings of the European conference on computer vision, Munich, Germany, 2018, pp. 3–19.

32.

Liu

Tian

, et al. Dual attention network for scene segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition(CVPR), Long Beach, CA, USA, 2019, pp. 3141–3149.

33.

Chen

. A novel U-shaped encoder–decoder network with attention mechanism for detection and evaluation of road cracks at pixel level. Comput-Aided Civ Infrastruct Eng. Epub ahead of print 18 February 2022. DOI: 10.1111/mice.12826.

34.

Fang

Yuan

. Distribution equalization learning mechanism for road crack detection. Neurocomputing 2021; 424: 193–204.

35.

Hao

Liu

, et al. Concrete crack segmentation based on convolution–deconvolution feature fusion with holistically nested networks. Struct Control Health Monit. Epub ahead of print 23 March 2022. DOI: 10.1002/stc.2965.

36.

Pan

Zhang

. A spatial-channel hierarchical deep learning network for pixel-level automated crack detection. Autom Constr 2020; 119: 103357.

37.

Liu

Ren

, et al. Automatic recognition and analysis system of asphalt pavement cracks using interleaved low-rank group convolution hybrid deep network and SegNet fusing dense condition random field. Measurement 2021; 170: 108693.

38.

Bhowmick

Nagarajaiah

Veeraraghavan

. Vision and deep learning-based algorithms to detect and quantify cracks on concrete surfaces from UAV videos. Sensors 2020; 20(21): 6299.

39.

Zhang

Ren

, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition(CVPR), Las Vegas, NV, USA, 2016, pp. 770–778.

40.

Zhang

Lan

Zeng

, et al. Relation-aware global attention for person re-identification. In: Proceedings of the IEEE conference on computer vision and pattern recognition(CVPR), Seattle, WA, USA, 2020, pp. 3183–3192.

41.

Zhou

Wang Mu

, et al. Self-attention feature fusion network for semantic segmentation. Neurocomputing 2021; 453: 50–59.

42.

Zhang

Peng

, et al. ExFuse: enhancing feature fusion for semantic segmentation. Proceedings of the European conference on computer vision(ECCV), Munich, Germany, 2018, pp. 269–284.

43.

Long

Shelhamer

Darrell

. Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition(CVPR), Boston, MA, USA, 2015, pp. 3431–3440.

44.

Ronneberger

Fischer

Brox

. U-Net: Convolutional networks for biomedical image segmentation. Med Image Comput Comput-Assist Interv 2015; 9351: 234–241.

45.

Badrinarayanan

Kendall

Cipolla

. SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell 2017; 39(12): 2481–2495.

46.

Wang

Zhu

, et al. ECA-Net: efficient channel attention for deep convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Seattle, WA, USA, 2020, pp. 11531–11539.

47.

Goodfellow

Bengio

Courville

. Deep learning, vol. 301. Cambridge, MA: MIT Press, 2016.

48.

Zhang

Ren

, et al. Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE international conference on computer Vision, Santiago, Chile, 2015, pp. 1026–1034.

49.

Garcia-Garcia

Orts-Escolano

Oprea

, et al. A review on deep learning techniques applied to semantic segmentation. arXiv e-prints. 2017. arXiv:1704.06857.

50.

Sutskever

Martens

Dahl

, et al. On the importance of initialization and momentum in deep learning. In: Proceedings of the 30th international conference on machine learning (PMLR), Atlanta, GA, USA, 2013. pp.1139–1147.

51.

, et al. Automatic tunnel crack detection based on U-net and a convolutional neural network with alternately updated clique. Sensors 2020; 20(3): 717.

52.

Liu

Cao

Wang

, et al. Computer vision-based concrete crack detection using U-net fully convolutional networks. Autom Constr 2019; 104: 129–139.

53.

Chen

L-C

Zhu

Papandreou

, et al. Encoder-decoder with Atrous separable convolution for semantic image segmentation. In: Proceedings of the European conference on computer vision (ECCV), Munich, Germany, 2018, pp. 801–818.

54.

Xiong

, et al. Pyramid attention network for semantic segmentation. arXiv e-prints 2018.ArXiv: 1805.10180.

55.

Selvaraju

Cogswell

Das

, et al. Grad-CAM: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision, Venice, Italy, 2017, pp. 618–626.

56.

Lam

Lee

Suen

. Thinning methodologies-a comprehensive survey. IEEE Trans Pattern Anal Mach Intell 1992; 14(9): 869–885.

57.

Liu

Yao

, et al. DeepCrack: a deep hierarchical feature learning architecture for crack segmentation. Neurocomputing 2019; 338: 139–153.