Abstract
Due to the harsh working environment of storage stacking machinery, the fault information of important components is significantly complex, which leads to the problem of low classification accuracy and high computational complexity of existing deep learning-based fault diagnosis methods. To alleviate the problem, this paper presents a novel architecture named attention-based adaptive multimodal feature fusion networks for intelligent fault diagnosis of storage stacking machinery, which is aimed at improving the diagnostic precision and robustness of feature fusion network and learning the broader feature representation. Firstly, the long short-term memory layer is introduced to extract the feature information of multiple time steps to improve the self-extraction ability of multi-temporal features. Then, the maximum temporal feature fusion module is utilized to highlight the recognizability of deep fusion features. Finally, a residual layer with spanning connections is added to increase the utilization and characterization capability of deep fusion features. Experimental results demonstrate the effectiveness and superiority of the proposed method in intelligent fault diagnosis of storage stacking machinery under variable working conditions compared with some state-of-the-art deep learning-based methodologies.
Keywords
Introduction
Storage stacker is the most important mechanical transmission equipment for industrial applications (e.g. construction, metallurgy and medicine). With the increase of industrialization, storage stacking machinery is developing rapidly in the direction of large-scale transportation capacity, diversified work scenarios, automated and intelligent control systems, and digital monitoring and protection. 1 Due to generally operating under complex and harsh working environment such as high speed and large load, storage stacker is inevitably prone to local damage, which eventually causes mechanical system downtime and even casualties.2–4 Therefore, to improve the safety and stability of equipment operation, it is greatly necessary and significant to carry out intelligent fault diagnosis research on storage stacking machinery. Besides, proposing efficient and useful fault diagnosis methods has been the goal of scholars.5–7 Namely, it is of great value and significance to develop a new technique with outstanding versatility and high accuracy to identify the mechanical faults of storage stacking machinery under variable working conditions.
Nowadays, the most effective and common approach for intelligent fault diagnosis of storage stacking machinery mainly focuses on two steps (i.e. feature extraction and fault identification). In recent years, a large number of researches have been conducted on time-frequency analysis-related fault identification method, 8 including wavelet transform (WT), 9 ensemble empirical mode decomposition (EEMD), 10 local mean decomposition (LMD), 11 intrinsic time-scale decomposition (ITD) 12 and variational mode decomposition (VMD), 13 etc. To research the dynamic response characteristics of cranes under different fault conditions, Wang et al. 14 proposed a combination of WT and singular spectrum decomposition. To extract the useful multiple correlation coefficients from the nonlinear vibration data of rolling bearings, Xu et al. 15 presented a new method for structural health monitoring and fault diagnosis of rotating machinery based on EEMD and wavelet packet quantization. To overcome the modal aliasing problem of conventional empirical mode decomposition, Dan et al. 16 proposed a fully ensemble LMD method with adaptive noise, and combined an improved coyote optimization algorithm and multiscale sample entropy to achieve the structural damage identification of storage stacking machinery. To extract the fault feature information of hoisting machinery under heavy background noise, Chen et al. 17 proposed a fault identification method named ITD to obtain the impulse components containing the fault information. Zhao et al. 18 introduced VMD and support vector machine to effectively realize the structural health monitoring of storage stacking machinery under variable working conditions. To parallelly mine the sequential-multimodal fusion features and achieve the fault diagnosis of hoisting systems, Ramírez et al. 19 introduced a single-delay muti-sequential fusion networks (SDMSFNs) based on a practical decomposition of the corresponding characteristic equation of the network system into subsystems, and the predicted probabilities of the sequential-multimodal fusion features are output by fully connected layer and the softmax layer. Nevertheless, although these approaches can roughly extract and identify fault features and patterns in rotating machinery, storage stacking machinery and hoisting machinery, they have some obvious drawbacks: (1) these methods require advance setting of parameters (e.g. noise level, mode number and penalty parameter), while the parameters will have a large impact on the decomposition results. (2) The computational complexity is high and requires more calculation resources and time, and these methods often suffer from the problems of mode mixing and endpoint effect when performing the processing of nonlinear and non-stationary signals; thus their feasibility is very poor. Accordingly, it is necessary to propose a new signal feature extraction technique to accurately identify the structural fault of storage stacking machinery under variable working conditions.
Alternatively, deep learning (DL)-based fault identification methods have become a hot research direction in the field of intelligent fault diagnosis of mechanical systems.20–23 DL utilizes hierarchical stacking of nonlinear processing units to achieve feature self-learning and intelligent fault identification through nonlinear transformations. The representative DL methods mainly includes convolutional neural networks (CNNs), 24 recurrent neural networks (RNNs), 25 deep belief networks (DBNs), 26 sparse autoencoder (SAE) 27 and generative adversarial networks (GANs). 28 Nevertheless, these methods have some obvious weaknesses for analyzing fault signals generated by storage stacking machinery under different working conditions. For example, CNN requires a large amount of labeled data for training in the process of extracting fault feature information, which is prone to overfitting and thus leads to the reduction of identification accuracy.29,30 While RNN is designed to capture dependencies across time, it is difficult to handle irregular time-series data where the time between observations is not constant. 31 Besides, DBN is typically used for unsupervised learning, which relies on unlabeled data to learn the underlying structure of the input, thus limiting the applicability of DBN in situations where labeled data is available.32–34 Since SAE contains many hyperparameters that need to be adjusted, including sparsity constraints and regularization parameters, the computational process consumes too much time, which greatly reduces the training speed and produces overfitting phenomenon. 35 GAN requires large amounts of high-quality data to train and will replicate poor patterns and biases in datasets when the data is low-quality or insufficient.36,37 Consequently, these existing DL-based methods still have the following shortcomings: (1) DL model usually has many hyperparameters (e.g. learning rate, batch size and number of layers), and the selection of these hyperparameters requires a large amount of prior knowledge, thus limiting its self-adaptability and diagnostic capability. (2) DL model requires a large number of data to train, which leads to overfitting and inability to generalize to new data. (3) Due to the lack of multiscale learning capability, the traditional DL method cannot directly extract multiscale features from the original vibration signal. Moreover, attention mechanism is a technology that mimics the way human attention is allocated, which can make the model pay more attention to the important parts when processing information, thus improving diagnostic accuracy and efficiency. Currently, attention mechanism has been studied and some achievements for fault feature extraction have been received. For example, Banu et al. 38 proposed an attention-based optimal bidirectional long- and short-term memory technique for renewable energy storage systems to ensure that productivity requirements are met while minimizing power costs. To perform structural health detection of bridge, Yang et al. 39 introduced a CNN and gated recurrent unit-based framework to model the relationship between space and temporal. Wu et al. 40 presented a multi-source domain attention mechanism adaptive network based on knowledge dynamic matching unit guidance to realize bearing fault diagnosis. However, the effectiveness and feasibility of attention mechanism in fault diagnosis of storage stacking machinery is unknown to us. Therefore, to address these problems, an attention-based adaptive multimodal feature fusion networks (AAMFFN) is proposed in this paper to effectively extract feature information for fault diagnosis of storage stacking machinery under variable working conditions.
In summary, this paper proposes a new intelligent fault diagnosis method for storage stacking machinery under variable working conditions based on AAMFFN. Firstly, the attention module is added to multimodal feature fusion network (MFFN) to complete the refinement of fault features by integrating the attention signal waveforms in both spatial and channel dimensions. Then, the maximum time-delay features are extracted and the residual layer is introduced to further enhance the utilization and characterization of the deep fusion feature. Finally, experimental data and contrastive analysis are utilized to verify the effectiveness and superiority of the proposed method. Main contributions and merits of this paper are summarized as follows:
(1) A novel AAMFFN architecture is constructed to extract more comprehensive and richer fault feature information from multiple scales of the raw vibration signal compared with some existing DL models.
(2) An end-to-end AAMFFN-based fault diagnosis scheme for storage stacking machinery under variable working conditions is proposed in this paper, which can automatically extract deep fusion features from the raw vibration signal and identify simultaneously fault patterns.
(3) The effectiveness of the proposed method in intelligent fault diagnosis is validated by experimental data under variable working conditions. Besides, the classification performance is also compared with some state-of-the-art methods to manifest the superiority of the proposed method.
The remaining sections of this paper are organized as follows. Section ‘Proposed methodology’ provides the detailed procedures of the proposed fault diagnosis method. Thereafter, section ‘Experimental validation and analysis’ conducts the case study to verify the effectiveness of the proposed approach. Finally, conclusions and some future work are summarized in section ‘Conclusion.’
Proposed methodology
In this section, the proposed AAMFFN is presented in detail. Main contribution of the proposed AAMFFN is to learn more rich and complementary multi-temporal feature information and achieve better classification performance. On this basis, an end-to-end AAMFFN-based fault diagnosis scheme is further proposed for achieving intelligent fault diagnosis of storage stacking machinery under variable working conditions.
Attention mechanism
Due to heavy background noise in the fault signals generated by storage stacking machinery under variable working conditions, it is difficult to extract useful fault features by existing methods. Specifically, attention mechanism is similar to the selective mechanism of human vision, which is based on the principle of selecting the most critical information for the current task from a large amount of information, thus suppressing non-essential noisy information and guiding the training of the network. Therefore, inspired by these works, this paper presents the squeeze-and-excitation block and attention gate module as an attention mechanism, and the spatial attention and channel attention are cascaded and fused to improve the feature refinement performance of the proposed AAMFFN. The spatial attention module can achieve the same feature extraction accuracy as channel attention without introducing a large number of parameters and calculations, and its structure is shown in Figure 1, where g denotes the input signal in the decoder, x represents the input signal of the previous layer in the encoder corresponding to the decoder, and the resolution of x is twice that of g. First, g is upsampled to have the same resolution as x and treated as a gated signal, which is transformed into the same dimension by respectively performing 3 × 3 convolution on g and x, and then superimposed on them frame by frame. Besides, the attention coefficient a is obtained by passing the feature signal through a Rectified linear unit (ReLU). activation function and a 3 × 3 convolution, then normalizing it using a sigmoid activation function and making it have the same resolution as the input signal by an upsampling operation. Moreover, channel attention adaptively recalibrates the feature responses of channels and improves the expressiveness of the network by establishing interrelationships between channels, and schematic diagram of the channel attention module is displayed in Figure 2. More specifically, channel attention mainly consists of compression and excitation, where the compression section performs global average pooling on the input feature signal, which is calculated as shown in Equation (1).
where T and L are respectively the period and offset of the signal, C indicates the number of channels in the signal acquisition process, and PC denotes the global response distribution on channel C. Accordingly, channel attention and spatial attention are cascaded to serve as the attention module, whose structure is schematically displayed in Figure 3. Seen from Figure 3, the inputs of channel attention and spatial attention are the output feature signals of this layer and the corresponding next layer encoder, respectively. Although the skip connection can concatenate high-resolution feature signals from the encoded path and after upsampling to accurately localize and extract feature information, its simple concatenation cannot completely eliminate irrelevant and noisy response interference in low-level features; thus more explicit high-level semantic information needs to be used to guide the selection of low-level features. Moreover, the central bottleneck part of the encoding and decoding paths is utilized as the key point of the information flow to encode the most distinguishing semantic features, and the information in this part can be decomposed spatially and channel-wise, where the spatial encoding and channel encoding are closely related to the important localization information and semantic categories of the segmented objects, respectively. Consequently, spatial attention and channel attention are combined and cascaded in a progressive connection structure, whose input part consists of shallow detail and deep global feature information, and the output part has the same dimensionality as the shallow feature information and is reassigned with weights, thus having the ability to improve the accuracy of signal feature extraction.

Schematic diagram of the spatial attention module.

Schematic diagram of the channel attention module.

Flowchart of the designed attention module.
Multimodal feature fusion networks
Aiming at the application of intelligent fault diagnosis for storage stacking machinery under variable working conditions, there are few existing DL models, which are mainly based on transfer learning of simple neural network models, thus leading to insufficient research on the own coding and feature complementary performance of DL methods. Therefore, this paper designs a novel DL framework for fault diagnosis of storage stacking machinery under variable working conditions named MFFNs, whose main architecture is shown in Figure 4. Specifically, as shown in Figure 4, the multi-temporal features are firstly extracted by using the correlation difference of MFFN coding matrix, and the self-extraction module of multi-temporal features is constructed to extract temporal features. Finally, the maximum temporal feature fusion module is utilized to deeply fuse each temporal feature. According to flowchart of MFFN, the multi-temporal encoding features extracted by the multi-temporal cyclic layer are complemented with the multi-time delay convolutional features obtained by the multi-expansion rate convolutional layer, thus enabling the proposed MFFN to learn deeper features with stronger differentiation and enhancing the mapping capability of fault feature information. In particular, a continuous convolution kernel is firstly used to extract deep multi-temporal features, and the one-dimensional vector output from the g-th convolution kernel at time delay τ is expressed as follows
where Wg and bg respectively represent the weight and bias of the convolution kernel to be learned. To accommodate the input dimension of the long short-term memory (LSTM) layer, 41 the three-channel output of the deep convolutional layer is converted to two-channel, which can be defined as follows

Main architecture of a MFFN.
Additionally, the dimensionally reshaped concatenation matrix Pτ is input to LSTM layer, while the states of the forgetting gate
where Wf and bf, Wi and bi and Wo and bo respectively denote the weights and bias of
In this equation,
Accordingly, according to the joint control of three control gates and the long-term storage of information from multiple memory units, the LSTM unit is able to utilize the features of multiple time steps before and after, thus realizing the self-extraction of multi-temporal features and further enhancing the feature mapping capability of the proposed MFFN for fault diagnosis of storage stacking machinery under variable working conditions.
AAMFFN architecture
The proposed AAMFFN architecture mainly consists of three layers: (1) attention feature fusion layer, (2) multimodal feature learning layer and (3) classification layer. Besides, the fault diagnosis process of AAMFFN contains four characteristics: (1) fault signal collected from storage stacking machinery is generally characterized by nonlinear and non-stationary, the dilated convolution is introduced to extract multi-time delay signal features, thereby enhancing the feature mapping capability of the network. (2) The LSTM layer is introduced to extract the feature information of multiple time steps to improve the self-extraction ability of multi-temporal features. (3) The maximum temporal feature fusion module is utilized to highlight the recognizability of deep fusion features. (4) A residual layer with spanning connections is added to increase the utilization and characterization capability of deep fusion features. Figure 5 shows the overall architecture of the proposed AAMFFN.

Overall architecture of the proposed AAMFFN.
Its details are described as follows:
Attention feature fusion layer
In this layer, channel attention and spatial attention are cascaded and integrated into the progressive connection to serve as the attention module. The detailed construction process can be found in Figure 3. More specifically, as shown in Figure 3, the designed attention mechanism can capture three kinds of shallow feature information from the original signal, thus improving the feature learning efficiency and identification accuracy of the proposed method.
Multimodal feature learning layer
In this layer, our goal is to extract automatically the maximum time-delay features as deep fusion features and add a residual layer with cross-connections to make full use of the fusion information, thus effectively solving the problems of poor single feature characterization and insufficient utilization of the complementarity of each time-delay information. Similarly, it can be clearly seen from Figure 5 that the two-channel output of the LSTM layer is firstly transformed into three-channel, and the output yτ of the LSTM layer under time-delay τ is split into one-dimensional feature vectors by row, which can be summarized as:
where Q denotes the number of one-dimensional vectors. The 1 × N-dimensional feature vectors
where S q indicates the q-th 3 × N-dimensional matrix obtained by splicing, which contains the feature information under various types of time-delay. The obtained splicing matrix S q is subjected to feature fusion and the maximum time-delay feature is selected as deep fusion features by column traversal matrix, which can be summarized as:
where S q (j) is the j-th column of the q-th splicing matrix. In order to improve the utilization of deep fusion features and overcome the network degradation caused by gradient descent, a residual layer with cross-connections is introduced to improve the characterization capability of the network. Thus, for the q-th fused feature vector, the output after residual layer processing can be formulated as follows
where h(·) denotes the mapping function of cross-connections, the formula h(U q ) = U q is utilized to represent the constant mapping, and W q indicates the weight to be learned. Apparently, with the constant connectivity h(U q ), the residual layer is able to take full advantage of the deep fusion features, thus further exploiting the potential mapping performance.
Classification layer
In this layer, our task is to automatically obtain fault classification results by selecting softmax classifier to effectively solve multi-class problem. In particular, the extracted feature data are input to the softmax layer, and the features are mapped to the range of (0, 1) which is more suitable for classification, so as to achieve the fault classification of storage stacking machinery under variable working conditions. Assuming that the input sample x includes K-class health states, the probability corresponding to the j-th class can be calculated by softmax function, as shown in Equation (13).
where K is the number of classification categories, xj denotes the j-th feature, and Qj(x) represents the probability that xj belongs to the j-th class. By completing the calculation process shown in Equation (13), the accuracy of the different fault patterns can finally be obtained. The details of softmax classifier can be found in Kuang et al. 42
Loss function construction in AAMFFN
The dice loss that can directly maximize the intersection over union (IOU) 43 is selected as the loss function of the proposed AAMFFN, which has better feature enhancement performance in the case of imbalanced positive and negative samples, as shown in Equation (14).
where dsc(·) represents the similarity between the diagnostic sample and the actual sample of the network, which can be mathematically calculated as
And the gradient Gd of the loss function dice loss is defined as follows
In this equation, p is the predicted value of the network (i.e. the output of the sigmoid function), and t denotes the truth label. More specifically, the stability of the proposed AAMFFN is further enhanced by combining dice loss and focal loss, where dice loss is used to overcome the problem of sample imbalance during training and focal loss is utilized to reduce the loss value of diagnostic samples, which the loss function can be defined as follows
where
Proposed AAMFFN-based fault diagnosis scheme
To significantly improve the robustness of feature fusion network and overcome the problem of low diagnostic accuracy, and enhance largely the feature diversity and stability of MFFN to adaptively extract multi-temporal feature information, a new feature fusion framework called AAMFFN is proposed. On this basis, an end-to-end AAMFFN-based fault diagnosis scheme is further presented for achieving intelligent fault diagnosis of storage stacking machinery under variable working conditions. Specifically, the flowchart of fault diagnosis method based on AAMFFN algorithm is displayed in Figure 6, its specific identification steps can be summarized as follows: (1) the real and imaginary parts of the original signal are extracted to form a 2 × N-dimensional data sample, and the real and imaginary parts are combined into a one-dimensional feature vector by merging the convolutional layers. (2) Based on the correlation of different coding matrices, the dilation convolution at multiple expansion rates is utilized to extract the non-continuous time window information, thus realizing the self-extraction of multi-time delay features. (3) The attention module is added to MFFN to complete the refinement of fault features by integrating the attention signal waveforms in both spatial and channel dimensions. (4) The LSTM layer is used to extract multi-temporal features adaptively on the basis of deep convolution with continuously sampled convolution kernels. (5) After stitching the deep mapping information, the maximum time-delay features are extracted and the residual layer is introduced to further enhance the utilization and characterization of the deep fusion feature. (6) Creating the loss function of the AAMFFN architecture using a backpropagation algorithm, outputting the fault diagnosis results and evaluating the effectiveness of the proposed method.

Overall architecture of the proposed method.
Experimental validation and analysis
To evaluate the feasibility of the proposed method, storage stacker fault vibration data under variable working conditions are collected in the experiment. Then, comparisons with some state-of-the-art DL-based fault diagnosis methodologies are conducted to validate the superiority of the proposed approach.
Description of the dataset
The effectiveness of the proposed method is validated by collecting vibration data from a storage stacking machinery model in the testing technology and fault diagnosis laboratory. Figure 7 shows the overall setup of the experimental equipment. In the experiment, storage stacker fault vibration data are collected by the acceleration sensor installed on the entire mechanical system (see Figure 7) with a sampling frequency of 50 kHz. Besides, all fault vibration data are obtained by the storage stacker under four load conditions (i.e. 10, 40, 70 and 100 kg) corresponding respectively to four machine speeds (160, 172, 185 and 195 m/min). In particular, for each load conditions, eight fault patterns can be obtained from the key components of the storage stacking machinery, which are abbreviated as load platform (LP), fork arm (FA), column (C), winding drum (WD), electric cabinet (EC), guide pulley (GP), gear rack (GR) and normal (N), respectively. Furthermore, every vibration data for different fault patterns has 160,000 data points. To be specific, in this experiment, four datasets (Group A/B/C/D) are considered into test the diagnosis performance of the proposed algorithm. The detailed illustration of four datasets is summarized in Table 1, where each data sample contains 1024 points. Moreover, a total of 800 groups of samples are collected for each load condition, where 400 data samples are randomly chosen as the training dataset while the remainder are designated as the testing dataset. That is, the ratio of training samples to testing samples is 1:1. Some specific experimental parameters of fault data acquisition system are presented in Table 2.

The overview of the storage stacking machinery system.
Description of sample dataset of storage stacking machinery.
LP: load platform; FA: fork arm; C: column; WD: winding drum; EC: electric cabinet; GP: guide pulley; GR: gear rack; N: normal.
Experimental parameters of storage stacking machinery.
Fault diagnosis results analysis
According to the fault diagnosis flowchart of Figure 6, the real and imaginary parts of the vibration signal are firstly extracted to form a 2 × 256-dimensional data sample. Then, AAMFFN architecture is trained using the training data. Finally, the equalsized testing data is imported into the well-trained AAMFFN for achieving automatically fault diagnosis. Table 3 provides a comprehensive list of specific parameters for the proposed method, taking into account both the stability and convergence speed of the algorithm. More specifically, to provide an intuitive analysis, one sample is taken as an example under a load condition of 40 kg. Figure 8 displays time-domain waveform and Fourier spectrum of storage stacker vibration signal under different fault patterns for 40 kg. It can be found from Figure 8 that storage stacker fault patterns are difficult to be identified directly by analyzing the temporal waveform and Fourier spectrum, due to the fact that similar vibration patterns can be presented in different datasets, making it difficult to distinguish fault features. In other words, the self-similarity present in storage stacker vibration data can impede fault diagnosis through conventional signal analysis techniques. Therefore, the proposed AAMFFN method without prior knowledge is applied to analyze storage stacker vibration data and reduce the reliance on human factors.
The main parameter setting of AAMFFN algorithm.
AAMFFN: attention-based adaptive multimodal feature fusion network.

Temporal waveform and Fourier spectrum of storage stacker vibration signal under 40 kg.
In the first trial, the confusion matrix of the proposed approach is displayed in Figure 9 for various load conditions. Seen from Figure 9, the x-axis and y-axis of the confusion matrix represent the abbreviations of different fault patterns, and the identification results for all fault patterns are revealed. Moreover, the diagonal elements indicate the number of samples in which one fault pattern is correctly classified, and the remaining elements except the diagonal element represent the number of the misclassified samples in which one fault pattern is predicted as another pattern. It can be seen clearly from Figure 9 that other fault patterns are correctly identified except for one sample of LP which is misclassified as WD under a load condition of 10 kg, indicating that the classification accuracy of the proposed method is 99.88% (799/800) in the first trial. Besides, classification accuracy of the proposed approach in the first trial are respectively 99.50% (796/800), 99.25% (794/800) and 99.63% (797/800) for other three loads (i.e. 40, 70 and 100 kg). Despite a slight decrease in classification accuracy as the load increases, the proposed approach can achieve more than 99% accuracy for various load conditions. This implies that the proposed method is effective for storage stacker fault diagnosis under variable working conditions. In addition, referring to the equations presented in Shao et al. 44 the average recall rate, F-Score and precision of the proposed AAMFFN under different load conditions are calculated in Figure 10. It is easily found in Figure 10 that the F-Scores under different loads are high, thus indicating that the proposed method has good classification performance. To validate the randomness and superiority of the designed AAMFFN algorithm, comparison with SDMSFN is conducted under the 20 trials. Figure 11 shows classification accuracy of the 20 trials using the two algorithms for different load conditions. The detailed comparison results of two algorithms (i.e. SDMSFN and AAMFFN) are listed in Table 4, including the diagnosis results and CPU running time in the testing procedure. Seen from Figure 11 and Table 4, classification accuracy of the proposed method is obviously greater than that of SDMSFN, while the CPU running time of the proposed AAMFFN is lower than that of SDMSFN under different load conditions. Additionally, Figure 12 illustrates the box-plot representation of two algorithms (i.e. SDMSFN and AAMFFN) under various load conditions. In this depiction, the upper and lower black lines connected to the box respectively represent the maximum and minimum limits of classification accuracy, and the black dots indicate the average accuracy obtained from 20 trials, thereby providing a more intuitive comparison of the algorithms. Furthermore, the red plus sign indicates the outliers, the red line in the box denotes the median of classification accuracy, and the upper and lower blue lines in the box respectively reveal the upper and lower quartile of classification accuracy. Accordingly, as can be seen from Figure 12, there are significant fluctuations in the accuracy obtained by SDMSFN with outliers (see the red plus sign), while the accuracy obtained by AAMFFN is more stable and most of them remain in the range of 99.87%–99.99%. Comprehensively viewed, the comparison results preliminarily validate the effectiveness of the proposed method in identifying mechanical fault of storage stacker under variable working conditions.

Confusion matrix of the proposed approach for storage stacker dataset under different load conditions: (a) Group A, (b) Group B, (c) Group C, and (d) Group D.

Average diagnostic index of different load conditions obtained by AAMFFN.

Classification accuracy of the 20 trials using the two algorithms for different load conditions: (a) Group A, (b) Group B, (c) Group C, and (d) Group D.
Identification results obtained using the SDMSFN and AAMFFN approach for storage stacker datasets.
AAMFFN: attention-based adaptive multimodal feature fusion network; SDMSFN: single-delay muti-sequential fusion network.

Box-plot of the two algorithms for storage stacker dataset under different load conditions in the 20 trials: (a) SDMSFN and (b) AAMFFN.
To further investigate the impact of iterations on the classification performance of the proposed methods, Figure 13 depicts the relationship curves between the classification accuracy of the two methods (i.e. SDMSFN and AAMFFN) and iterations under different load conditions. It can be obviously observed from Figure 13 that the classification accuracy of both SDMSFN and AAMFFN tends to increase gradually as the number of iterations increases. Specifically, the classification accuracy of the proposed AAMFFN method is very close to that of SDMSFN when the number of iterations is less than 120. However, when the number of iterations is greater than 120, the classification accuracy of AAMFFN is significantly higher than that of SDMSFN. Furthermore, to quantitatively evaluate the robustness of the proposed proposed for intelligent fault diagnosis, Figure 14 illustrates the classification accuracy of AAMFFN as the number of samples increases (i.e. 100, 200, 400, 600 and 800) for different signal-noise ratios (SNRs

The relation curve between classification accuracy and iterations for different load conditions: (a) Group A, (b) Group B, (c) Group C, and (d) Group D.

Fault diagnosis results of AAMFFN as the number of samples increases under different SNRs: (a) Group A, (b) Group B, (c) Group C, and (d) Group D.
Comparison with some state-of-the-art approaches
To further validate the effectiveness and superiority of the proposed method, comparisons with seven state-of-the-art DL-based methodologies are conducted. Where these methods are abbreviated as Meta generative adversarial network (MTGAN), 45 Wide deep convolutional neural networks (WDCNN), 46 Deep residual shrinkage networks (DRSN), 47 Bidirectional long short-term memory-stacked autoencoder (BiLSTM-SAE), 30 Ensemble empirical mode decomposition-deep convolution neural network (EEMD-DCNN), 48 Multiscale entropy+multiscale permutation entropy+regression neural network (MSE+MPE+RNN), 49 Empirical wavelet transform (EWT) + Deep belief networks (DBN) + Softmax, 50 respectively. Note that the MATLAB R2018a was run on an Intel Core i7-8550U 3.80 GHz CPU with 16.00 GB RAM computer to analyze the fault diagnosis performance of all methods. More specifically, t-distributed stochastic neighbor embedding (t-SNE) 51 technique is employed to transform the high-dimensional output features from the last hidden layer obtained using the above-mentioned eight methods into a two-dimensional vector distribution, which is aiming at visualizing the output features of the training network. Figure 15 displays the feature visualization results of the testing samples. It can be clearly seen from Figure 15 that compared with the other seven methods, the proposed AAMFFN can better aggregate samples of the same type and distinguish samples of different types.

Feature visualization via t-SNE: (a) MTGAN, (b) WDCNN, (c) DRSN, (d) BiLSTM-SAE, (e) EEMD-DCNN, (f) MSE-MPE + RNN, (g) EWT + DBN + Softmax, and (h) the proposed AAMFFN.
In addition, Figure 16 depicts the fault diagnosis results obtained by various methods under different SNRs. It can be found from Figure 16 that the proposed AAMFFN outperforms the other seven approaches under various SNR conditions, especially under strong noisy conditions. Figure 17 displays the classification accuracy of different methods under five trials. The detailed classification results including average accuracy, standard deviation and CPU running time are listed in Table 5. As shown in Figure 17 and Table 5, the proposed method has higher classification accuracy (99.98%) compared to the above-mentioned methods. Additionally, the standard deviation of the proposed method is lower than that of all comparative approaches, which indicates that the proposed method has higher stability in the fault diagnosis of storage stacking machinery. Besides, seen from Table 5, the CPU running time of the proposed method is smaller than other different methods when processing storage stacker fault vibration data, thus further validating the advantages of the proposed method. Therefore, to sum up, the proposed method can achieve high-precision fault diagnosis of storage stacking machinery under variable working conditions.

Fault diagnosis results obtained by various methods under different SNRs.

The comparison of the identification results obtained via different approaches.
Comparison results among different approaches.
EEMD: ensemble empirical mode decomposition; SAE: sparse autoencoder.
Ablation study
A series of ablation analyses are performed to assess the impact of hyperparameters on performance, where these sensitive parameters are respectively time delay τ, the weight W and bias B. The root-mean-square error (RMSE) from training AAMFFN under different parameter settings is depicted in Figure 18. As shown in Figure 18, as the time delay τ increases, the RMSE first starts to increase and reaches a maximum value at τ = 5, while the RMSE gradually decreases and obtains a minimum value when the parameter τ is greater than five. Similarly, the RMSE of the weight W follows a similar trajectory of increasing and then decreasing, and a distinct turning point can be obtained at W = 1. Likewise, the RMSE for various values of bias B experiences an increasing pattern, and the best RMSE is obviously obtained at B = 100. Consequently, the selection of these three hyperparameters (τ, W and B) plays a pivotal role in shaping the performance of AAMFFN, and improper parameter choices can also amplify model performance fluctuations.

RMSE wsith different hyperparameter settings.
Conclusion
This paper concentrates on the issue of multimodal feature fusion in DL and proposes a novel AAMFFN architecture for identifying mechanical faults under variable load conditions. More specifically, an end-to-end AAMFFN-based fault diagnosis scheme is formulated in this paper. Novelties of the proposed AAMFFN is to enhance largely the feature diversity and stability of MFFN to adaptively extract multi-temporal feature information, which can learn efficiently richer feature representation and greatly improve fault classification performance. Experimental data for fault diagnosis of storage stacker is analyzed to validate the availability of the proposed method in variable working conditions. Furthermore, comparisons among some state-of-the-art approaches also demonstrate that the proposed AAMFFN has higher diagnostic accuracy. Main contributions of this paper are summarized as follows:
(1) Spatial attention and channel attention are combined and cascaded in a progressive connection structure, whose input part consists of shallow detail and deep global feature information, thus having the ability to improve the accuracy of fault feature extraction.
(2) A novel DL architecture named AAMFFN is developed to extract the multi-temporal encoding features, thereby enabling the proposed MFFN to learn deeper features with stronger differentiation and enhancing the mapping capability of fault feature information.
(3) An end-to-end fault diagnosis scheme based on AAMFFN is presented through automatic feature extraction of the original signal and simultaneous fault classification, which can enhance largely learning ability and improve fault diagnosis accuracy.
In addition, the effectiveness of the proposed method in fault diagnosis of storage stacking machinery is verified by the above experimental analysis results. More specifically, the proposed method is proved to be the most advanced method compared to some state-of-the-art methods, thus providing a solid research foundation for fault diagnosis of storage stacking machinery in practical engineering application. Nevertheless, there are still some issues in the proposed method that can be improved and discussed in future researches, which can be summarized as follows:
(1) Although the effectiveness of the proposed method in identifying various fault patterns of storage stacking machinery under a constant average speed has been demonstrated, the unstable rotating speed resulting from the difficulty in controlling the actual operating speed poses a challenge. That is, the performance of the proposed method is unknown for fault diagnosis of storage stacking machinery at variable speed. Consequently, further research will be conducted to thoroughly investigate this issue in the future work. Additionally, considering the harsh operating environment of storage stacking machinery, it is crucial to acknowledge the possibility of compound faults occurring. Hence, future work will also focus on developing a diagnostic approach for identifying compound faults in storage stacking machinery.
(2) The proposed AAMFFN offers a solution for fault diagnosis by learning an efficient feature representation without the need for extensive signal processing methods or expert knowledge. Nevertheless, the performance of AAMFFN heavily relies on its hyperparameters, such as the number of scale, number of neurons in each hidden layer, number of iterations and learning rate. Hence, parameter optimization of AAMFFN based on intelligence algorithm will be concerned sequentially in our future work.
(3) Although the analysis results demonstrate that the proposed method achieve high-precision fault diagnosis of storage stacking machinery under variable working conditions, the ability of the proposed method to predict the remaining useful life (RUL) and identify automatically different fault patterns of storage stacking machinery in practical engineering application is unknown for us. Therefore, our future researches attempt to apply the proposed method to solve the problems of RUL prediction and multichannel fault diagnosis of storage stacking machinery under various working conditions in practical engineering application.
Footnotes
Acknowledgements
Meanwhile, the author would like to appreciate the anonymous reviewers and the editor for their helpful comments.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported in part by the National key research and development program of China (Grant No. 2020YFB1712200) and in part by Major science and technology projects of Sichuan Province, China (Project No. 2022ZDZX0002), and this paper was also supported by the Fundamental Research Funds for the Central Universities (No. 2682023CX003), and also funded by China Postdoctoral Science Foundation (Grant No. 2023M742895). This research is also funded by China Scholarship Council (CSC).
