Fusion of multi-gate expert mechanism and attention strategy for sintering multi-quality prediction model

Abstract

Predicting the quality indicators of the sintering process is a core component in the intelligent regulation of the steel industry, directly impacting production efficiency and product quality. However, existing research faces two key issues: first, traditional single-task models struggle to leverage the correlations between indicators to improve prediction accuracy, and are prone to performance degradation due to conflicts in feature sharing. Second, noise and outliers in industrial data interfere with model training, reducing prediction robustness. To address these issues, this paper proposes a multi-gate expert mixture model that integrates an attention mechanism and dynamic loss weighting. The model achieves collaborative optimization through a three-layer innovative architecture. First, it uses a multi-expert network and task gating mechanism to dynamically balance feature sharing and specificity between indicators, and integrates a task attention layer to enhance the key features of each indicator, improving the model's adaptability to task differences. It also uses a dynamic loss weighting strategy combining L1 loss and Smooth L1 loss to adaptively suppress outlier interference. Experimental results show that the prediction performance on the sintering industrial dataset significantly outperforms traditional models, with R² values of 0.967, 0.982, and 0.986 for product yield, drum index, and RDI_+3.15, respectively. This model provides an efficient solution for the multi-quality indicator collaborative prediction of complex industrial processes, with strong engineering application value.

Keywords

sintering process quality indicator prediction multi-Gated expert mixture model attention mechanism dynamic loss weighting

Introduction

In modern industrial production, the sintering process plays a crucial role in iron and steel as well as in chemical and other heavy industries, such as powder metallurgy and advanced ceramics, directly influencing production efficiency, energy consumption, and final product quality.¹ With the continuous development of industrial automation and data-driven approaches, optimizing the sintering process through data-driven methods has become an urgent issue to address. Accurate sintering quality prediction can not only effectively improve production efficiency and reduce waste rates but also significantly save energy and reduce environmental pollution.²

In recent years, sintering quality prediction has garnered widespread attention as a key component in the intelligent development of the steel industry. Li et al.³ addressed the dynamic characteristics and time-delay issues of the sintering process by proposing dynamic time feature extraction and state-space reconstruction methods. Jiang et al.⁴ employed infrared thermography and image processing techniques to develop an FeO content prediction model based on image features, achieving non-contact, high-precision online predictions. Additionally, Shao et al.⁵ utilized artificial neural networks and convolutional neural networks, combined with process mechanisms, to enhance the prediction performance of key quality indicators such as drum index. In terms of tree models and ensemble learning, Jiang et al.⁶ addressed the issue of unstable predictions for RDI and RI by combining CatBoost and XGBoost with the SHAP method, leveraging the nonlinear fitting capabilities of tree models and the feature contribution analysis mechanism. Xia et al.⁷ addressed the uncertainty in drum index prediction by using LightGBM combined with kernel density estimation to construct an interval prediction model. Li et al.⁸ proposed a prediction system combining Granger causality selection and Stacking ensemble learning to address the challenge of multi-quality indicator linked predictions. In the field of sequential deep learning, Li et al.⁹ addressed the issues of difficulty in parameter tuning, insufficient self-learning ability, and poor generalization in traditional methods with multiple parameter inputs by employing the optimized Optuna-DFNN. Chen et al.¹⁰ developed a spatiotemporal feature extraction model driven by dynamic working condition recognition to address the impact of working condition switching on drum strength. Yang et al.¹¹ addressed the cross-domain generalization issue for sintering composite quality indicators by employing latent space transfer and domain adaptation methods. In the field of visual soft sensing, Zhang et al.¹² addressed the difficulty of online FeO measurement and experimental delay by using infrared machine vision combined with CNN models. Tang et al.¹³ addressed the need for mixed material particle recognition and batching evaluation by using micro-CT 3D reconstruction combined with machine learning to construct a particle-level multi-label classification model. In the single-objective optimization of emissions and process condition predictions, Li et al.¹⁴ developed an emission prediction model using a PSO-BP neural network to achieve prior control of sintering flue gas SO2 and NOX. Wang et al.¹⁵ addressed the issue of joint prediction of SO2, NOX, and PM by using a deep ensemble model combining GBDT, RNN, and GRU. Wang et al.¹⁶ addressed the issues of sintering negative pressure and flue gas pressure fluctuations in the main sintering exhaust pipe by using a CNN-GRU prediction model combined with a single-objective optimization approach, they achieved overshoot suppression of negative pressure and energy consumption improvement. In the field of data-driven modeling for core KPIs such as strength, yield, and output, Gao et al.¹⁷ developed a PCA combined with GA-optimized ANN model for drum strength prediction. Mallick et al.¹⁸ compared linear regression and ANN models for sintering machine productivity prediction and confirmed that ANN performs better in nonlinear scenarios. Fan et al.¹⁹ employed ICEEMDAN combined with CNN-BiLSTM-Attention models to predict the time series of FeO composition under complex disturbances. Zhou et al.²⁰ addressed the cross-source fusion issue of FeO soft measurement by using a heterogeneous data-driven fusion method combined with tensor decomposition. Li et al.²¹ developed a system that combines a big data platform with an integrated deep model cascade to meet the needs of full-process cascade prediction.

With the deepening of research, multi-source data fusion and multi-task modeling have gradually become important directions for sintering quality prediction. Table 1 summarizes key studies on multi-source data fusion, multi-task modeling, and system optimization for sintering quality prediction.

Table 1.

Comparison of methods in sintering quality prediction.

Study	Methodology	Limitations	Difference from this study
Li et al.²²	Multi-source data fusion model	Sintering quality prediction	Improved model stability and prediction accuracy
Liu et al.²³	Time-shifted state-space reconstruction	Inconsistent time delays between variables	Addressed time delay issues, improving multi-variable predictions
Liu G et al.²⁴	Dynamic multi-objective optimization with collaborative prediction	Cement clinker firing process production indicators	Optimized production indicator predictions under dynamic conditions
Dai Y et al.²⁵	Hybrid self-learning model	Sintering furnace temperature prediction	Achieved accurate temperature predictions and better control
Singh A et al.²⁶	Multi-objective Bayesian optimization	Retrieval of aerosol structures from microscopic images	Applied Bayesian optimization to optimize retrieval from images
A. N F and M. N V²⁷	Viral evolutionary genetic algorithm	Multi-objective optimization of laser sintering thermoplastics	Optimized material sintering performance using evolutionary algorithms
Vukovic M et al.²⁸	Local machine learning models for multi-objective optimization	System interdependencies in production	Managed system interdependencies using localized models for optimization
Lili F et al.²⁹	Multi-objective differential evolution for gas system scheduling	Steel metallurgy gas system scheduling	Improved gas system scheduling in steel metallurgy processes
Li et al.³⁰	IMOBWO framework with restricted non-dominating criterion	Raw material ratio optimization under complex constraints	Achieved a uniformly distributed Pareto front in steel plant tests
Yang et al.³¹	Full-process multi-objective model based on mass and energy conservation	High energy consumption and carbon emissions in steel process	Reduced cost, CO₂ emissions, and energy consumption by significant percentages
Jiabao W et al.³²	Improved genetic algorithm for batching optimization	Batching for sintering blast furnaces	Minimized cost and energy consumption while meeting quality standards
Singh K et al.³³	Machine learning and NSGA-II for pelletizing and sintering optimization	Integrated optimization of pelletizing and sintering	Established a model for production rate and quality optimization
Luo J et al.³⁴	SDN-NSGAII algorithm	Optimizing complex system relationships	Improved optimization in complex systems using enhanced SDN relationships

Through the above research, it has been found that despite extensive studies on the sintering process using machine learning in recent years, most existing models still face challenges when dealing with the multi-task and multi-indicator issues in the sintering process. Existing methods often overlook the interrelationships between different tasks, resulting in suboptimal model performance during multi-task collaborative prediction. Furthermore, traditional single-task learning methods are limited in their predictive generalization ability because they fail to fully consider the shared features between tasks. The specific issues are reflected in the following: first, multiple quality indicators in the sintering process are highly correlated, but existing models fail to effectively utilize this relationship. Second, when handling multiple objectives, balancing the prediction accuracy of each task and reducing interference between tasks remains an urgent issue to address.

To address these issues, this paper proposes a novel multi-task learning model: Attention-MMoE-DLW (Attention Multi-gate Mixture-of-Experts with Dynamic Loss Weighting), which aims to optimize the prediction of multiple quality indicators in the sintering process through multi-task collaborative learning. Specifically, this study proposes solutions in the following areas: first, the MMoE algorithm is used to achieve feature sharing between tasks, dynamically allocating expert weights through multiple expert networks and task-specific gating mechanisms. Second, a task attention mechanism is incorporated to adaptively enhance task-specific information. Finally, a dynamic weighting strategy based on L1 loss and Smooth L1 loss is designed to further improve the model's robustness and prediction accuracy. Additionally, Optuna is used for hyperparameter optimization, automatically adjusting the model's hyperparameters to further enhance its performance and adaptability. Through these designs, the Attention-MMoE-DLW model effectively addresses the limitations of traditional methods in handling multi-task and multi-indicator problems, providing an effective solution for precise control of the sintering process and multi-objective optimization in industrial processes.

Data processing

Data description

The dataset used in this study consists of operational plant data collected during controlled production trials on an industrial sintering strand, with each sample corresponding to one complete sintering cycle. It includes input features such as raw material particle size, moisture content, thermal parameters, sintering layer thickness, and chemical composition, as well as the corresponding product quality indicators. Each sample contains 17 input variables and 3 output indicators. Table 2 lists all the dimensions of the data used in this study, with the first 17 being input parameters and the 18th to 20th being the output results.

Table 2.

Data description.

Index	Parameter description
1	Water addition rate (%)
2–6	Particle size of mixed materials (<3 mm, 3–5 mm, 5–8 mm, >8 mm) and average particle size (%)
7	Ignition temperature (°C)
8	Ignition negative pressure (kPa)
9	Sintering negative pressure (kPa)
10	Layer thickness (mm)
11	Sintering time (min)
12–16	TFe (%) in sinter, CaO (%), MgO (%), SiO₂ (%), Al₂O₃ (%)
17	Moisture (%)
18	Product yield (%)
19	Drum index (%)
20	RDI_+3.15 (%)

Data cleaning

To identify potential outliers in multi-dimensional feature combinations, this study employs the Isolation Forest algorithm for unsupervised anomaly detection. This method constructs multiple randomly partitioned tree models to measure how easily each sample can be “isolated” and calculates its anomaly score. The lower the anomaly score, the more easily the sample can be isolated, indicating a higher likelihood of being an outlier. Figure 1 shows the distribution of anomaly scores for all samples. It can be observed that most of the samples have scores concentrated between 0.05 and 0.12, while samples with scores less than 0 exhibit a distinctly left-skewed distribution, with fewer samples but greater deviations.

Figure 1.

Distribution of anomaly scores

The dataset used in this study consists of operational data collected from the same batch of material over a consistent time interval. After preprocessing, 95% of the original samples were retained following the removal of anomalies. Anomalies were identified using a 5% threshold for anomaly detection, which is consistent with the 95% control limit commonly used in statistical process control. This threshold was set to ensure that the probability of a false negative remains below 5%. The threshold value is indirectly determined by a pollution factor, which controls the expected proportion of anomalies in the dataset. Unless otherwise specified, the pollution factor is set to 5%, ensuring a balance between effective outlier removal and maintaining a sufficient number of valid samples for analysis.³⁵ After removing the outliers, all input variables were standardized using the Z-score method. The formula is as follows:

x^{'} = \frac{x - μ}{σ}

(1)

Where x represents the original feature value, $μ$ is the mean of the feature, $σ$ is the standard deviation, and $x^{'}$ is the standardized data.

Model construction

To address the complexity of multi-quality indicator collaborative prediction in the sintering process, a novel multi-task learning model called Attention-MMoE-DLW is proposed. This model combines the multi-gate expert mixture mechanism with a task attention module, using a “shared-private feature” learning framework to effectively capture the coupling relationships between multiple indicators and task specificity. The execution process of the entire model is shown in Figure 2.

Figure 2.

Overall execution flow of the model

Below is an introduction to the various model modules and the model construction process.

Input layer

The input to the Attention-MMoE-DLW model consists of 17-dimensional process parameters from the sintering process. These parameters are derived from the preprocessed dataset and form the input feature vector, $X = [x_{1}, x_{2}, \dots, x_{17}]^{T}$ , where $x_{i}$ represents the standardized value of the i -th process parameter.

Multi-experts and gating mechanism

MMoE is the core algorithm in the Attention-MMoE-DLW model for implementing multi-task feature selective sharing. It consists of the expert network and task gating network. The expert network extracts features from different perspectives, while the task gating mechanism dynamically allocates expert weights based on the prediction target. The calculation formula is given by:

h_{k} = \sum_{i = 1}^{E} g_{k, i} (X) \times f_{i} (X)

(2)

Where E is the number of experts, $f_{i} (X)$ represents the features extracted by the $i$ -th expert, and $g_{k, i} (X)$ is the gating weight for task k. Through this mechanism, the model captures common patterns across quality indicators while preserving their individual characteristics; in summary, we use an MMoE model with three tasks that map one-to-one to the three outputs, and obtain the task characteristics by compressing the 17 input features into an 8-dimensional task-relevant representation via a linear–nonlinear feature compression layer; the number of experts is not fixed a priori but is obtained via hyperparameter search, the gate then routes the shared representation to the experts, and task-specific towers produce the three outputs, as illustrated in Figure 3.

Figure 3.

MMoE model structure

Task attention layer

The task attention layer is a key module in the Attention-MMoE-DLW model for strengthening task-specific features, i.e., a feature-weighted attention mechanism. Through the “feature-weighted enhancement” mechanism, the module further amplifies the task-relevant information in the MMoE output features while suppressing noise interference. This layer corresponds to the number of tasks, with each task equipped with an independent attention generation network. The input consists of the 8-dimensional task features output by MMoE, and the output is the enhanced feature vector. The model structure of this layer is shown in Figure 4.

Figure 4.

Attention layer structure

In the attention-weight generation mechanism, each task's attention network adopts a serial architecture of “two fully connected layers plus a nonlinear activation.” The first fully connected layer applies a linear transformation to the 8-dimensional input features, introduces nonlinearity via the Tanh activation function, and yields intermediate features; the computation is given by:

z_{k} = Tanh (W_{\tanh, k} h_{k} + b_{\tanh, k})

(3)

Here, $W_{\tanh, k} \in R^{8 \times 8}$ denotes the weight matrix of the first layer in the attention network for task k, $b_{\tanh, k} \in R^{8}$ is the bias term, and $z_{k} \in R^{8}$ is the intermediate feature vector after the nonlinear activation. The second fully connected layer further transforms $z_{k}$ and, via a sigmoid activation, maps the output to the interval (0,1) to obtain the attention-weight vector:

a_{k} = S i g m o i d (W_{s i g m, k} z_{k} + b_{s i g n, k})

(4)

In Eq.(4), $W_{s i g m, k} \in R^{8 \times 8}$ is the weight matrix of the second layer, $b_{s i g n, k} \in R^{8}$ is its bias, and $a_{k} = [a_{k, 1}, a_{k, 2}, \dots, a_{k, 8}]^{T}$ $\in (0, 1)^{8}$ is the resulting attention-weight vector. Here, $(0, 1)^{8}$ denotes the Cartesian product of the open interval (0,1) in an 8-dimensional Euclidean space, and $a_{k, j}$ quantifies the importance of the $i$ -th feature dimension for task t.

Under the feature-enhancement mechanism, the task feature is produced by “original feature + self-gated enhancement,” as formulated in Eq. (5):

r_{k} = h_{k} + h_{k} ⊙ a_{k}

(5)

Here, $⊙$ denotes element-wise multiplication, and $r_{k} \in R^{8}$ is the final enhanced feature. The key advantage is that when $a_{k, j} > 0$ , the corresponding feature dimension is amplified, whereas when $a_{k, j} \to 0$ , the feature remains essentially unchanged—thereby enabling adaptive emphasis on informative features.

Dynamic loss weighting and optimization strategy

Dynamic loss weighting (DLW) is the core mechanism in the Attention-MMoE-DLW model for enhancing prediction robustness. It adaptively balances the contributions of L1 loss and Smooth L1 loss, suppressing outlier interference while ensuring gradient stability, thereby ensuring efficient model training.

For sintering quality indicator prediction tasks, the model uses “weighted combined loss” as the optimization objective, defined as the linear weighted sum of L1 loss and Smooth L1 loss. For the k-th task, its loss component is:

L_{k} = λ \cdot L_{L 1} (\hat{y_{k}}, y_{k}) + (1 - λ) \cdot L_{S m o o t h L 1} (\hat{y_{k}}, y_{k})

(6)

Where:

L_{L 1} (\hat{y_{k}}, y_{k}) = | | \hat{y_{k}} - y_{k} | | = \sum_{i = 1}^{n} | \hat{y_{k, i}} - y_{k, i} |

(7)

This is the L1 loss, which is naturally robust to outliers but has a discontinuous gradient at zero.

L_{S m o o t h L 1} (\hat{y_{k}}, y_{k}) = \sum_{i = 1}^{n} {\begin{matrix} \frac{1}{2} {(\hat{y_{k, i}} - y_{k, i})}^{2}, | \hat{y_{k, i}} - y_{k, i} | \leq 1, \\ | \hat{y_{k, i}} - y_{k, i} | - \frac{1}{2}, | \hat{y_{k, i}} - y_{k, i} | > 1 \end{matrix}

(8)

This is the Smooth L1 loss, which resolves the gradient discontinuity problem of L1 loss while retaining sensitivity to large errors. $λ$ is the dynamic weight coefficient, determined through hyperparameter optimization, and is used to balance the contributions of both losses. i denotes the expert index in the MMoE module and k denotes the task index. $y_{k}$ is truth targets for task k, $\hat{y_{k}}$ is predictions.

Experiment

Model evaluation metrics

The model performance evaluation metrics selected include the coefficient of determination (R²), mean squared error (MSE), root mean squared error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE). The calculation formulas are as follows:

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}

(9)

M S E = \frac{1}{n} \sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})^{2}

(10)

R M S E = \sqrt{M S E} = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}

(11)

M A E = \frac{1}{n} \sum_{i = 1}^{n} | y_{i} - {\hat{y}}_{i} |

(12)

M A P E = \frac{100 %}{n} \sum_{i = 1}^{n} | \frac{y_{i} - {\hat{y}}_{i}}{y_{i}} |

(13)

Where n is the number of samples, $y_{i}$ is the true value of the i-th sample, ${\hat{y}}_{i}$ is the predicted value of the i-th sample, and $\bar{y}$ is the mean of all true values. R² measures the model's goodness of fit to the data, with values closer to 1 indicating better fit. MSE measures the average squared differences between predicted and true values, with smaller values indicating more accurate predictions. RMSE is the square root of MSE, with units consistent with those of the true values. RMSE more intuitively reflects the magnitude of prediction errors, with smaller values indicating more accurate predictions. MAE measures the average absolute difference between predicted and true values, with smaller values indicating more accurate predictions. MAPE is measures the average percentage error of predicted values relative to true values, with smaller values indicating better performance.

Model comparison

To verify the performance advantage of the proposed Attention-MMoE-DLW model in multi-indicator prediction tasks for sintering quality, five representative architectures were selected for comparison experiments. These include the single-task learning model EMD-ISBM-FNN,³⁶ the shared bottom-layer feature network model SSA-MLP-BPNN,³⁷the multi-gate expert model Mamba-MoE,³⁸the progressive feature extraction model HBPLE³⁹ and the improved structure proposed in this paper, Attention-MMoE-DLW. The above models were compared under the same dataset, training iterations, and parameter settings, predicting the three key quality indicators: product yield (%), drum index (%), and RDI_+3.15. The results are shown in Table 3.

Table 3.

Comparison of experimental results.

Prediction Metric	Model	R²	MSE	RMSE	MAE	MAPE
Product Yield	EMD-ISBM-FNN	0.806	0.236	0.486	0.377	0.463
	SSA-MLP-BPNN	0.700	0.609	0.781	0.666	0.803
	Mamba-MoE	0.903	0.119	0.345	0.288	0.348
	HBPLE	0.922	0.096	0.310	0.242	0.297
	Attention-MMoE-DLW	0.967	0.074	0.272	0.212	0.257
Drum Index	EMD-ISBM-FNN	0.911	0.086	0.275	0.226	0.349
	SSA-MLP-BPNN	0.814	0.180	0.293	0.241	0.364
	Mamba-MoE	0.903	0.094	0.307	0.282	0.428
	HBPLE	0.972	0.027	0.163	0.134	0.202
	Attention-MMoE-DLW	0.982	0.017	0.131	0.099	0.150
RDI_+3.15	EMD-ISBM-FNN	0.940	0.147	0.384	0.306	0.414
	SSA-MLP-BPNN	0.901	0.081	0.289	0.235	0.334
	Mamba-MoE	0.974	0.063	0.252	0.198	0.264
	HBPLE	0.979	0.052	0.228	0.187	0.254
	Attention-MMoE-DLW	0.986	0.035	0.187	0.147	0.195

As shown in Table 3 and Figure 5, in the product yield prediction task, the Attention-MMoE-DLW model achieves a coefficient of determination (R²) of 0.967, higher than the other four models. It also demonstrates optimal performance in error metrics such as MSE, MAE, and MAPE, with values of 0.074, 0.212, and 0.257, respectively. This reflects its strong capability in handling nonlinear coupled features and imbalanced tasks. In the drum index prediction task, Attention-MMoE-DLW also shows significant improvement, with an R² of 0.982. The root mean square error is only 0.131, and MAE and MAPE decrease to 0.099 and 0.150, respectively, further validating the advantage of the task attention mechanism in fine-grained feature extraction. In the RDI_+3.15 task, Attention-MMoE-DLW also performs best, with an R² of 0.986 and prediction error metrics that outperform all other comparison methods. The reasons for the poor performance of the four comparison models are as follows: Operating point changes produce distinct input and output regimes; single expert or fixed capacity models struggle to represent these piecewise behaviors, leading to unstable errors across regimes. In addition, there is strong coupling among process variables, including basicity with iron content, ignition temperature with pressure, and moisture with particle size. Conventional multilayer perceptron and backpropagation baseline methods tend to overfit these correlations and fail to generalize beyond them.

Figure 5.

Comparison of model performance

To compare the performance of different models in product yield prediction, the prediction errors of each model on the same test set were analyzed, and the results are shown in Figure 6. The Attention-MMoE-DLW model has a maximum absolute error of 0.8257 and a minimum absolute error of 0.0005 in product yield prediction, with the overall error controlled within 1.48%. The error distribution is uniform with no noticeable bias, indicating that the model maintains high accuracy and stability in the product yield prediction task.

Figure 6.

Comparison of true and predicted values for product yield

In the drum index prediction task, the error comparison of each model is shown in Figure 7. The Attention-MMoE-DLW model has a minimum absolute error of 0.0003 and a maximum absolute error of 0.3638, with the error values consistently distributed around zero, showing a small fluctuation range. This indicates that the model has better robustness and accuracy in predicting the drum index, effectively avoiding large deviations. Compared to traditional single-task models, this model not only ensures the convergence of overall errors but also provides a closer approximation to the real production patterns in its detailed performance.

Figure 7.

Comparison of drum Index errors for different models

For the RDI_+3.15 prediction results, Figure 8 shows the error comparison among different models. The Attention-MMoE-DLW model has a minimum absolute error of 0.0034 and a maximum absolute error of 0.5414, with the error controlled within a reasonable range. This further validates the model's feasibility in predicting complex quality indicators. Overall, the model demonstrates significant advantages in key indicators such as product yield, drum index, and RDI_+3.15. It not only improves prediction accuracy but also shows a strong ability to capture the coupled relationships among multiple indicators.

Figure 8.

Comparison of RDI+_3.15 errors for different models

The results indicate that the Attention-MMoE-DLW model has significant advantages in multi-indicator prediction tasks. It can provide effective support for quality assessment and operational state prediction in the sintering process, thereby improving the predictability and stability of the production process.

Ablation study

To further validate the contribution of key components in the model to overall performance, an ablation study was conducted, removing the task attention mechanism and dynamic loss weighting strategy to create two ablation models for comparative analysis. First, to test the effectiveness of the task attention mechanism, the Task Attention layer was removed from the original model, retaining only the multi-gate expert structure and loss weighting strategy to form the no-attention version of the model (No-Attention). Secondly, to evaluate the role of GradNorm, the loss function was adjusted to an equally weighted sum, keeping the MMoE and Task Attention structures unchanged, forming the no-dynamic-weighting version of the model (No-DLW). Under the same training and testing conditions, the performance metrics of the three models on product yield (%), drum index (%), and RDI_+3.15 prediction tasks are shown in Table 4.

Table 4.

Ablation study results.

Prediction Metric	Model	R²	MSE	RMSE	MAE	MAPE
Product Yield	No-Attention	0.870	0.159	0.399	0.314	0.380
	No-DLW	0.946	0.096	0.287	0.236	0.268
	Attention-MMoE-DLW	0.967	0.074	0.272	0.212	0.257
Drum Index	No-Attention	0.934	0.064	0.252	0.194	0.295
	No-DLW	0.851	0.144	0.379	0.298	0.450
	Attention-MMoE-DLW	0.982	0.017	0.131	0.099	0.150
RDI_+3.15	No-Attention	0.942	0.143	0.378	0.308	0.410
	No-DLW	0.940	0.146	0.382	0.301	0.401
	Attention-MMoE-DLW	0.986	0.035	0.187	0.147	0.195

To conduct a detailed comparative analysis of the model performance, Figure 9 shows the performance comparison of the three models across different metrics. As shown in Figure 9, Attention-MMoE-DLW demonstrates a clear advantage in all evaluation metrics, particularly in R² and MAE. Specifically, Attention-MMoE-DLW achieves an R² of 0.967 for product yield, significantly outperforming No-Attention (0.870) and No-DLW (0.946). At the same time, Attention-MMoE-DLW's MSE, RMSE, MAE, and MAPE are also significantly lower than the other two models, demonstrating its higher prediction accuracy and stability. In predicting the drum index and RDI_+3.15, Attention-MMoE-DLW also exhibits strong advantages, especially with an R² of 0.982 for the drum index and 0.986 for RDI_+3.15, clearly outperforming No-Attention and No-DLW. These results indicate that the complete model, which combines the task attention mechanism and dynamic loss weighting strategy, can significantly improve the model's accuracy and prediction performance in the complex multi-task learning scenario.

Figure 9.

Comparison of ablation study results

Model prediction results analysis

To ensure reproducibility, the software stack used in this study is reported as follows: Optuna [4.4.0]; Python [3.11]; PyTorch [2.5.1]; NumPy [1.26.4]; pandas [2.2.3]; scikit-learn [1.6.1]. Optuna was used for hyperparameter optimization, and the optimization results are shown in Table 5. The optimization ranges for the parameters num_experts, hidden_dim, learning_rate, batch_size, and weight_l1 were set as follows: [2,8], [64,256], [1e-4,1e-2], [32,128], and [0.4,0.9], respectively. Training was run for up to 100 epochs with early stopping, using the validation mean $R^{2}$ of the first two targets as the stopping criterion.

Table 5.

Hyperparameter optimization results for attention-MMoE-DLW.

Prediction Metric	num_experts	hidden_dim	learning_rate	batch_size	weight_l1	patience
Product Yield	2	224	0.000597	32	0.721480	15
Drum Index	2	160	0.000934	64	0.494180	20
RDI_+3.15	6	192	0.000890	32	0.400504	18

The parameter hidden_dim represents the dimension of the network's hidden layers, determining the model's ability to extract and transform features, which directly impacts its ability to capture complex patterns. num_experts is the number of expert networks in the MMoE structure, which affects the model's ability to specialize in learning features for different tasks. Too few experts may result in insufficient task feature learning, while too many can increase model complexity. Learning_rate determines the update step size of the optimizer in each iteration, which is a key factor in the model's convergence speed and final performance. A step size that is too large may cause oscillation and prevent convergence, while one that is too small can result in low training efficiency. Batch_size refers to the number of samples used for each parameter update, affecting the stability and efficiency of training, and also indirectly influencing the model's generalization ability. Weight_l1 is the weight of the L1 loss in the total loss, which balances the focus of the model's learning by adjusting the proportion of different loss functions. Patience is the patience value in the early stopping strategy, controlling the stopping point when the model's performance on the validation set no longer improves. It is used to prevent overfitting and enhance training efficiency.

To better validate the model's superiority, the following are the simulation results of the Attention-MMoE-DLW model for predicting three indicators.

For product yield, Figure 10 shows the comparison between the true and predicted values. It can be seen that the model's predictions closely match the trend of the true values, with both showing a good fit in overall fluctuation trends.

Figure 10.

Comparison of true and predicted values for product yield

For the drum index, Figure 11 shows the comparison between the true and predicted values. The Attention-MMoE-DLW model has a minimum absolute error of 0.0003 and a maximum absolute error of 0.3638, with error values remaining close to zero, indicating that the model experiences very little fluctuation in error when predicting the drum index. This shows excellent stability and accuracy.

Figure 11.

Comparison of true and predicted values for drum Index

Figure 12 shows the comparison between the true and predicted values for RDI_+3.15. As seen in the figure, the model's predictions closely follow the overall trend of the true values. Despite some fluctuations, the general trend remains consistent, indicating that the model has a good fit.

Figure 12.

Comparison of true and predicted values for RDI_+3.15

To better understand the factors affecting product yield, index, and RDI_+3.15 predictions, we performed feature importance analysis using SHAP values. This analysis helps identify which parameters have the most significant impact on the model's predictions for each target. These parameters were selected based on their higher mean absolute SHAP values, indicating that they exert the greatest influence on the model's predictions. Table 6 summarizes the top three key parameters affecting the three metrics.

Table 6.

Top parameters influencing product yield, drum Index, and RDI_+3.15 predictions.

Target	Top Parameters (Ranked by SHAP Importance)
Product Yield	Particle size of mixed materials(3–5 mm)、Al₂O₃(%)、Particle size of mixed materials(<3 mm)
Drum Index	Sintering negative pressure (kPa)、SiO2(%)、Ignition negative pressure (kPa)
RDI_+3.15	Tfe(%)、moisture(%)、Water addition rate (%)

This improves the interpretability of the model by clearly identifying the key features driving predictions for each target.

Conclusion

Existing models often overlook the relationships between different tasks when handling multi-task and multi-indicator problems in the sintering process, resulting in suboptimal model performance during multi-task collaborative prediction. This paper constructs a multi-objective quality prediction model for sintered ore based on operational plant data from industrial production trials, combining sintering production theory, artificial intelligence, and machine learning algorithms. It effectively addresses the shortcomings of traditional methods in handling multi-task and multi-indicator problems, providing an effective solution for precise control of the sintering process and multi-objective optimization of industrial processes.

The specific conclusions of the study are as follows:

A multi-task prediction model is constructed using MMoE, with the introduction of expert networks and a task attention mechanism to simultaneously predict multiple target parameters. The model structure includes the MMoE layer for feature specialization learning, and the Task Attention layer enhances task feature interaction, validating the effectiveness of the multi-task learning framework in predicting related indicators.

The MMoE model is optimized using the Optuna framework, with key parameters such as hidden_dim, num_experts, learning rate, and batch_size being tuned. R², MSE, RMSE, MAE, and MAPE are used as evaluation metrics. The results show that the Attention-MMoE-DLW model outperforms the other four advanced models across all prediction metrics, with R² exceeding 96%, proving the algorithm's outstanding performance and good generalization ability in multi-objective quality prediction.

The optimized MMoE multi-task model was tested and evaluated, with the results showing that the model performs well in predicting all three target parameters. All metrics reached their optimal levels, fully validating the model's accuracy and reliability in multi-objective collaborative prediction tasks.

While the proposed Attention MMoE DLW model demonstrates strong performance on operational plant data, several limitations remain for practical deployment. Performance may degrade under extreme distribution shifts, for example substantial changes in feed composition, ambient conditions, or control policies that fall outside the training regime. Adaptability to new tasks and new target indicators may require additional data, retraining, and careful mitigation of negative transfer. The computational cost of multi expert architectures and hyperparameter optimization can affect training efficiency and online latency in resource constrained environments. These issues motivate future work on drift detection and periodic recalibration, parameter efficient task expansion, and model compression and distillation to better meet plant level requirements. Future research can further expand the application of the MMoE multi-task model in multi-objective prediction scenarios, such as incorporating more industrial production data related to product yield, drum index, and RDI_+3.15. It could also explore the impact mechanisms of dynamic interactions between production parameters on target indicators, further enhancing the model's adaptability to complex industrial environments.

Footnotes

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

Supported by the Graduate Student Innovation Fund of North China University of Science and Technology (No. 2026S27).

ORCID iD

Zhenhua Yang

References

Liu

Xie

Sun

, et al. Optimization for sintering proportioning based on energy value[J]. Appl Therm Eng 2016; 103: 1087–1094.

Gong

Wang

, et al. Application of deep learning in iron ore sintering process: a review[J]. J Iron Steel Res Int 2024; 31: 1033–1049.

Yang

Sun

. Dynamic time features expanding and extracting method for prediction model of sintering process quality index[J]. IEEE Trans Ind Inf 2021; 18: 1737–1745.

Jiang

Guo

Pan

, et al. Polymorphic measurement method of FeO content of sinter based on heterogeneous features of infrared thermal images[J]. IEEE Sensors J 2021; 21: 12036–12047.

Shao

Chen

, et al. Application of artificial neural networks for prediction of sinter quality based on process parameters control[J]. Trans Inst Meas Control 2020; 42: 422–429.

Jiang

Pang

Zhang

, et al. Forecast of sinter reduction degradation index and reducibility index and analysis of influencing factors using machine learning[J]. Metallurgical Research & Technology 2023; 120: 608.

Xia

Liu

, et al. Prediction interval estimation of sinter drum index based on light gradient boosting machine and kernel density estimation[J]. Ironmak Steelmak 2023; 50: 909–920.

Liu

, et al. Research on sinter quality prediction system based on granger causality analysis and stacking integration algorithm[J]. Metals (Basel) 2023; 13: 419.

Cao

Yang

, et al. Optuna-DFNN: an optuna framework driven deep fuzzy neural network for predicting sintering performance in big data[J]. Alexandria Eng J 2024; 97: 100–113.

10.

Chen

Liu

, et al. Spatio-temporal feature extraction model based on dynamic identification of operating conditions for sintering tumbler strength prediction[J]. Control Eng Pract 2025; 164: 106484.

11.

Yang

, et al. Forecasting of iron ore sintering quality index: a latent variable method with deep inner structure[J]. Comput Ind 2022; 141: 103713.

12.

Zhang

Chen

Huang

, et al. Online measurement method of FeO content in sinter based on infrared machine vision and convolutional neural network[J]. Measurement (Mahwah N J) 2022; 202: 111849.

13.

Tang

Da Wang

Niu

, et al. Particle classification of iron ore sinter green bed mixtures by 3D X-ray microcomputed tomography and machine learning[J]. Powder Technol 2023; 415: 118151.

14.

Liu

, et al. Prediction of SO2 and NOx in sintering flue gas based on PSO-BP neural network model[J]. Ironmak Steelmak 2023; 50: 1443–1450.

15.

Wang

Ren

, et al. Prediction of SO2, NOx and PM in the sintering process based on deep learning[J]. Ironmak Steelmak 2024; 51: 03019233241266013.

16.

Wang

Tang

Chu

, et al. Prediction and optimization of flue pressure in sintering process based on SHAP[J]. International Journal of Minerals, Metallurgy and Materials 2025; 32: 346–359.

17.

Gao

Wang

Pan

, et al. A forecast model of the sinter tumble strength in iron ore fines sintering process[J]. Powder Technol 2021; 390: 256–267.

18.

Mallick

Dhara

Rath

. Application of machine learning algorithms for prediction of sinter machine productivity[J]. Machine Learning with Applications 2021; 6: 100186.

19.

Fan

Yang

, et al. Prediction of FeO content in sintered ore based on ICEEMDAN and CNN-BiLSTM-AM[J]. Ironmak Steelmak 2025: 03019233241312772.

20.

Zhou

Gao

Zhao

, et al. Heterogeneous data-driven measurement method for FeO content of sinter based on deep learning and tensor decomposition[J]. Control Eng Pract 2023; 134: 105479.

21.

Liu

, et al. Research on cascade intelligent sinter quality prediction system based on big data technology[J]. Ironmak Steelmak 2024; 51: 3–14.

22.

Jiang

Shi

, et al. A soft sensor model of sintering process quality Index based on multi-source data fusion[J]. Sensors 2023; 23: 4954.

23.

Liu

Chen

, et al. Dynamic Sinter Quality Prediction Based on Time-Shifted State Space Reconstruction[J]. IEEE Trans Ind Inf 2024; 21: 2033–2042.

24.

Liu

Xie

Hao

, et al. Dynamic multi-objective optimization method for production index of cement clinker firing process based on collaborative prediction strategy [J]. Eng Appl Artif Intell 2025; 151: 110774–110774.

25.

Dai

Chen

Shao

. Hybrid self-learning model for the prediction and control of sintering furnace temperature [J]. Control Eng Pract 2025; 154: 106159–106159.

26.

Singh

Jena

Gupta

, et al. Multi-objective Bayesian optimization for the retrieval of aggregated aerosol structures from microscopic images [J]. J Aerosol Sci 2025; 186: 106556–106556.

27.

Fountas

Vaxevanidis

. Multi-objective optimization of engineering properties for laser-sintered durable thermoplastic/polyamide specimens by applying a virus-evolutionary genetic algorithm [J]. Comput Ind 2021; 128: 103430.

28.

Vuković

Koutroulis

Mutlu

, et al. Local machine learning model-based multi-objective optimization for managing system interdependencies in production: a case study from the ironmaking industry [J]. Eng Appl Artif Intell 2024; 133: 108099.

29.

Lili

Jun

Zhaojun

. Gas system scheduling strategy for steel metallurgical process based on multi-objective differential evolution [J]. Inf Sci (Ny) 2024; 654: 119817.

30.

Yan

, et al. Optimal proportioning of iron ore in sintering process based on improved multi-objective beluga whale optimisation algorithm[J]. J Iron Steel Res Int 2024; 31: 1597–1609.

31.

Yang

Zhang

Yuan

, et al. Muti-objective optimization on energy consumption, CO2 emission and production cost for iron and steel industry[J]. J Environ Manag 2023; 347: 119102.

32.

Jiabao

Jianliang

Yaozu

, et al. Cost and energy synergy optimization model for ironmaking processes: hybrid knowledge and data driven[J]. J Cleaner Prod 2025; 486: 144420.

33.

Singh

Vakkantham

Nistala

, et al. Multi-objective optimization of integrated iron ore sintering process using machine learning and evolutionary algorithms[J]. Trans Indian Inst Met 2020; 73: 2033–2039.

34.

Luo

Chen

, et al. Multi-objective optimization for ore blending schemes in the open-pit phosphate mine using an improved NSGA-II algorithm[J]. Green and Smart Mining Engineering 2025; 2: 42–56.

35.

Aldrich

Liu

. Monitoring of mineral processing operations with isolation forests[J]. Minerals 2024; 14: 76.

36.

Fang

Zheng

Wang

. Forecasting the crude oil prices with an EMD-ISBM-FNN model[J]. Energy 2023; 263: 125407.

37.

. Forecasting and decision making of firm’s financial indicators based on the SSA-MLP-BPNN model[J]. Systems and Soft Computing 2025; 7: 200233.

38.

Şentürk

Adıgüzel

Polat

. Hybrid Mamba-MoE model for non-invasive blood glucose prediction[J]. Comput Electr Eng 2025; 127: 110549.

39.

Zhang

Liu

, et al. Hot rolled prognostic approach based on hybrid Bayesian progressive layered extraction multi-task learning[J]. Expert Syst Appl 2024; 249: 123763.