Abstract
Blast furnace (BF) ironmaking system is a complex industrial system so this paper proposes a BF state causality analysis method based on the use of convergent cross-mapping method (CCM). This method can accurately describe the causal relationships between states at different locations in the BF system. It can also be used as a feature selection method for prediction models. After obtaining accurate causal characteristics of the BF state covariates, the BF system process theory is used for validation. The causal characteristics are used as input variables to the extreme gradient boosting model (XGboost) for predicting BF state parameters. After testing with industrial data, the model predicted an absolute error control within 2% with an accuracy of over 88%. The CCM approach mentioned in this paper is more suitable for state causal impact analysis and predictive model feature selection for BF systems.
Introduction
BF ironmaking is a complex industrial process with many physicochemical reactions, strong coupling of parameters, complex variable and nonlinearity [1–3]. There are many influencing factors in the BF smelting process, and there is a strong correlation between the influencing factors. It is necessary to make appropriate adjustments and comprehensive judgments on the production status of the BF based on the professional experience of the BF operators, but these human factors are difficult to be reflected in the BF model. The key to realizing ‘high yield, high quality, low consumption and long life’ of BF ironmaking is to timely grasp the operation state of BF, accurately analyze the reasons for the change of operation state, and stably control the appropriate operating range. The silicon content of BF molten iron is an important index to measure the quality of pig iron and the production condition of the BF [4]. At the same time, the change of silicon content directly reflects the stability of the smelting production process [5]. The BF permeability index is one of the most critical indicators to reflect the operating status of BF comprehensively [6]. Based on the permeability index, BF operators can detect and avoid BF malfunctions such as bridging or collapsing charges as soon as possible, so as to judge whether the BF is in a stable state. Gas utilization rate is an important index to measure the level of BF energy consumption [7,8]. It represents the reduction utilization rate of primary raw materials (carbon) in BF production, which directly affects the energy consumption per ton of iron. The three parameters mentioned above represent the three directions of concern for BF production: hot metal quality, running condition and utilization efficiency. Each state parameter is related to a variety of factors in the BF system. It is essential to investigate the underlying causes between the BF state variables and the other parameters to stabilize the BF smoothly. Owing to the complexity of the BF system, the interaction between parameters and the time delays, the causality of the BF condition parameters is still challenging to obtain the correct results from simple data analysis. At present, the adjustment and maintenance of BF conditions are still mainly carried out by manual experience.
Currently, there are two main directions for analyzing the relationship between BF condition and influencing parameters, which are based on data models and data-driven. Compared to data models, data-driven systems are more suitable for complex systems such as BF systems, because they do not rely on expert experience, process information, or limitations of process principles. The data-driven approach is further divided into correlation and causation. Correlation methods generally use Pearson correlation coefficients to determine the relationship between different covariates. This method applies to selecting parameters for feature engineering of traditional data prediction models. However, the correlation method has the following disadvantages compared to the causality method: 1. correlations are symmetrical, whereas causality is directional; 2. causality is time-series, whereas correlation is not necessarily. The hazards are as follows: 1. The parameters obtained by correlation analysis have poor interpretability; 2. Different methods have a significant influence on the accuracy of prediction models.
Identifying causal relationships between variables is one of the most challenging goals of scientific research. Using actual data to infer causal relationships between variables has become a valuable and economically important technique in many research areas [9–11]. Granger causality is widely used in economics, meteorological science, and neuroscience to measure the interactions between time series. Another standard method is the intercorrelation function which determines the causal relationship between variables through the similarity and temporal characteristics of two-time series [12]. However, the constraints of linear systems and parameter separability required by these methods preclude the application in BF systems. Sugihara presents the first data-driven CCM-based method that provides a new approach to reveal nonlinear causal relationships between weakly coupled variables [11]. The method exploits the characteristics of diffeomorphism between variables in nonlinear coupled systems. It transforms the original causal identification into a comparison of the mutual predictive effects of embedded manifolds. The CCM has been successfully applied in many research areas such as ecology, biology, and geosciences [13,14]. Therefore, the solution idea of the CCM applies to the BF system with a nonlinear coupling system, but there is no causal analysis for the BF status system at present.
In this paper, in order to provide a basis for the root cause tracking of BF state parameters, an modified causality method based CCM is proposed to accurately determine the causality between BF state parameters and operation parameters. At the same time, XGboost algorithm is used to verify the superiority of the causal analysis results in the parameter selection process of predicting the state parameters of BF.
The main contributions of this paper are summarized as follows: 1. for the BF system, a causal analysis method based on the CCM is used to determine the root cause of the BF state parameters and the directionality of the information flow between them; 2. the adjustment parameters (embedding dimension, lag interval) in the CCM method are judged and determined in a specific way; 3. compared with the traditional parameter selection method of prediction model, the influence of this method on the parameter selection of prediction model is verified.
The organization of the paper is as follows. Chapter 2 describes the source detection method based on the CCM, the process for selecting the optimal embedding dimension and lag interval, and the model principles of the XGboost algorithm. Chapter 3 provides an overview of the BF data states and data preparation for the three enterprises. Chapter 4 presents the embedding dimension and lag interval determination and the causal analysis of all BF states. Chapter 5 discusses the impact of the input parameters chosen by the different methods on the prediction results of the XGboost model. A summary is given in Chapter 6.
Related works
The basic idea of the CCM is that even a chaotic deterministic system is still somehow predictable in the short run. Based on the differential homeomorphism theory, this method uses the interaction characteristics of coupled variables in a nonlinear approach to provide a basis for judging the causal relationship to determine the strength and direction of the interaction between variables in a nonlinear system [15]. The CCM can be used for non-separable dynamical systems with weak to moderate coupling strength and compensates for the fact that the granger analysis method cannot be used for non-separable systems. Once proposed, the process has been widely used in ecology and neuroscience. The BF production system is a parametrically coupled, non-separable dynamic system, which fits the scope of application of the CCM method. However, the BF system is an industrial system with nonlinearity, high dimensionality, continuity, and hysteresis, which differs significantly from the traditional applicable systems.
Source model of the CCM
For deterministic systems and not completely random (e.g. BF production systems), there is a dynamics of the underlying manifold M control system (representing coherent trajectories rather than random chaotic states). For deterministic systems and not completely random (e.g. BF production systems), there is a dynamics of the underlying manifold M control system (representing coherent trajectories rather than random chaotic states). In dynamical systems theory, time series variables from the same dynamical system (e.g. BF operating covariates and state covariates) share the same attractor manifold M, so the states of each other can be estimated [16,17].
Suppose there are d-dimensional (d< = N) time-varying manifolds
According to Takens embedding theorem, if there is a causal relationship between two variables of a dynamical system, then the shadow manifolds Mx and My of these two variables will have the property of being differentially homogeneous with the manifold M of the original system. The phase space is reconstructed using the time-delayed coordinate method. Let the dimension of the reconstructed manifold be E, and the sampling interval be τ (generally defaulted to 1). The reconstructed manifold lag time vector at time t is as follows
Topologically equivalent reconstructed manifolds are obtained from the above two equations. The set of X(t) is the shaded manifold
First, determine the lagging coordinate vector Y(t) on
Selection of the optimal embedding dimension
The choice of embedded dimension E can change the importance of attractor to characterize the original behaviour of system. An embedding dimension that is too small leads to a situation where point nearest neighbours in the state space may not represent the trustworthy nearest neighbours in the actual state space. At the same time, an embedding dimension that is too large increases the amount of unnecessary computation significantly. Professor Sugihara used the False Nearest Neighbour (FNN) to determine the optimal embedding dimension E of the manifold. Later studies and partial improvements to the FNN method made the CCM more suitable for fixed application scenarios [11]. Additional studies have used the Akaike information criterion(AIC) for the determination of the embedding dimension. Owing to the high frequency of BF production data and the considerable sample size of the data, the BIC can effectively prevent excessive model complexity caused by high model accuracy under the premise of considering the sample size. Therefore, BIC is more suitable for embedding dimension selection problem of BF data than AIC, and the BIC method is chosen in this paper to judge the embedding dimension E value.
The model complexity is defined as k, the loss function B is defined, the number of samples is n, and the BIC is defined as shown in Formula (11).
Selection of lag time
In the BF production system, there is a specific time lag between the operation parameters and the status parameters. First, the status of the BF has a lag effect, and the current system’s status has a significant causal impact on the following status. Second, there is a time interval between the upstream variables of the BF system and the downstream variables, which is called the mechanism time lag. Finally, for the same batch of incoming raw material in the production process, there is a technical time lag due to differences in the measurement times of the different variables. For example, in the process of measuring different variables for the same batch of material, the four variables – burden structure, depth of trial rod, blast volume, and temperature of hot metal – should be measured at t0, t1, t2, and t3, respectively. The different time intervals between these four moments become the technical time lag. Time lag makes the influence of variables within a BF system vary significantly with the time interval. Time lag conditions exist for the state parameters themselves, and the operating parameters or other state parameters also have time lag conditions on the target state parameters. For the analytical perspective of the CCM dealing with a time lag, Ye proposed to add the time lag factorλbased on the CCM algorithm, and using the Y(t) at moment t can get a better prediction of X(t) at the moment t-λ [14]. During the reconstruction of manifold
The autocorrelation function (ACF) is used to measure the correlation between observations at λ unit time intervals (yt to yt+λ) in a time series and is given by the following Equation (14).
The mutual information function is a measure of the interdependence between variables [18]. The mutual information method uses the first local minimum of the mutual information function as the optimal delay time to determine the optimal time lag. Another advantage is that the mutual information method can be used independently of the embedding dimension to select the lag time. In the process of phase space reconstruction of a single variable as a time series, the strength of the lagged correlation of the variables is obtained by calculating the magnitude of the mutual information of the sampled variable series X(i) and the delayed series Y(i +λ). For the variables X(i) and Y(i+λ), the expression for the mutual information calculation is shown in expression (15).
XGboost prediction model
Owing to the complexity, time delay and nonlinearity of BF system, the prediction process of state parameters becomes a complex nonlinear regression problem. For such issues, XGboost, based on the improved GBDT algorithm, has good results in many engineering and process directions [19]. Therefore, the XGboost model is used in this paper to predict the BF state parameter. The principle and processes are shown below. The input parameter X and the state parameter Y of the sample set are used as the input and output values of XGBoost, respectively. An additive model consisting of M decision trees is built as follows
The data prediction model was calculated using the XGboost module in python, and the model evaluation was measured using a combination of accuracy, MAE, MSE, RMSE, and R2.
Data state and feature engineering for BF systems
To verify the applicability of the CCM causal analysis method in BF ironmaking production and to investigate the effect of the analysis results on the prediction model, the production data of three BFs from three different steel companies are collected for testing in this paper.
Data status
Details of blast furnace production data.
Differentiated data for the three blast furnaces.
Feature engineering
The BF production process is complex, and there are many parameters in the BF system covering many physical and chemical reactions, so the causal analysis of the data requires the raw data to be processed first. There are various production conditions within the BF service, such as normal production, blowing out maintenance and abnormal furnace conditions, etc. Abnormal production conditions are outside the scope of this paper. This paper uses causal analysis to analyze the production status of the BF under normal production. Based on the principles of the ironmaking process, the four parameters of blast volume, blast pressure, oxygen enrichment, and coal injection are used to calibrate and reject the time interval of the abnormal production state.
Methodology for processing raw data.
Results analysis of causality
Determination of the optimal embedding dimension
For selecting embedding dimensions for the CCM method, this section uses the statsmodels library in python to calculate the embedding dimension E according to the BIC and analyses the effect of the embedding dimension on the causal results of the state parameters concerning the other parameters. The blast volume (X(i)) and gas utilization rate (Y(i)) of the B# BF was chosen as an example for the causality analysis of the optimal embedding dimension selection. The autoregressive calculation of X(i) yields the trend graph of BIC(n), as shown in Figure 1. From Figure 1, it can be seen that as the embedding dimension increases, the BIC value shows a trend of first decreasing and then stabilizes. The first smaller value appears at k = 2, indicating that the embedding dimension equal to 2 can cover the original data information more adequately. The trend in BIC results. (Online version in colour.)
The effect of different embedding dimensions on the final causal relationship is shown in Figure 2, where the curves indicate the trend of the causal impact results (CCM values) for different directions X(i) and Y(i) with increasing data length L. (b) When the embedding dimension E is 1, the value of the CCM for causality from X(i) to Y(i) converges to 0.72 because the reconstructed manifold does not cover enough of the original information. When the embedding dimension E is 2, it can be found that the CCM value for the existence of causality from X(i) to Y(i) converges to 0.85, and the value of CCM stabilizes with the increase of length L. When the embedding dimension E is 3, the value of CCM from X(i) to Y(i) converges to 0.87, and the results are similar to those calculated with an embedding dimension E of 2. Therefore, the best embedding dimension is 2. Meanwhile, the CCM values obtained for the three embedding dimensions from Y(i) to X(i) are less than 0.2, and the causality in the direction of Y(i) to X(i) cannot be used, thus indicating that the directionality of the causality is verified. Verifying the BIC results shows that the optimal embedding dimension of the CCM method can be obtained in the BF system using the BIC method. Results of causality detection with different embedding dimensions. (Online version in colour.)
Selection of lag time
Autocorrelation analysis of state variables
The BF production status has a certain autocorrelation, where the pre-sequence time of the status variable has a significant effect on the post-sequence time. The correlation of the BF production status parameters at different time intervals is measured by the absolute value of the autocorrelation coefficient. In this paper, [Si], permeability index, and gas utilization rate of C# BF were selected for the analysis of the autocorrelation method, and the results are shown in Figure 3. The initial value of ACF for each variable is 1. As the time interval increases, there is a significant difference in the decrease of ACF for each variable, with a threshold value of 0.7 selected and lag times of three hours, one hour, and one hour for the three parameters. Analyzing the reasons for this phenomenon, the difference in lag time between the three is due to the characteristics of BF production. In the process of BF production, the change range of permeability index and gas utilization rate is extensive, and the change frequency is high, so when the interval time is more than 2 h, the influence degree of itself will weaken rapidly. As a reflection parameter of the quality of the iron produced in the BF, [Si] itself varies more slowly than the previous two parameters, while the value of the [Si] parameter is artificially detected, resulting in a longer autocorrelation lag interval of 3 h. The results obtained using the state variable autocorrelation analysis method can be verified as correct by practical production experience. State parameter autocorrelation results. (Online version in colour.)
Time lag analysis of other parameters and state parameters
The time lag relationship between the other parameters and the state parameters is analyzed and the time lag interval of the CCM method is determined using the mutual information method above. The basic idea behind the choice of td is to make the original sample sequence and its delay sequence somewhat independent and not completely unrelated so that neighbouring data points x(i) and x(i-td) can be treated as independent coordinates in the phase space reconstruction. Too short td will result in redundant information between neighbouring components and a dramatic increase in the computational effort; too large td will make the neighbouring coordinates completely independent and prevent the acquisition of valid information for phase space reconstruction. In this paper, the example of the lagging interval is demonstrated using the barren gas pressure X(i) and the gas utilization rate Y(i) of the B# BF. The mutual information of Y(t) for moment t and X(t-td) for moment t-td is shown in Figure 4. As shown in Figure 4, the first local minima of the two mutual information occur when td is one, and the lag period can be provisionally determined to be 1 h. Mutual information coefficients between parameters. (Online version in colour.)
Using the above Bayesian criterion to determine that the optimal embedding dimension is 2, the causal relationship between X(i) and Y(i) was analyzed at different lag intervals td. The final calculation results are shown in Figure 5. As can be seen in Figure 5, the causal effect of different td on the causal effect of Y(i) causes X(i) is insignificant as Y(i) causes X(i) is not significant. For the causal impact of X(i) causes Y(i), when no lag time is added, the CCM value of the causal relationship between the two converges to 0.5, which is not a strong causal relationship. With the increase of the lag time TD, the CCM value increases first and then decreases. When td = 1, the result of the value of CCM is the largest, which is consistent with the result of mutual information. At the same time, it can be seen that the influence time of raw gas pressure on gas utilization obtained by the mutual information method conforms to the actual state of the BF system, and the excessive lag time interval weakens the degree of causal correlation. Results of causality detection with different lag intervals. (Online version in colour.)
Causal analysis of state parameters
Based on the actual BF ironmaking production data collected from different steel companies, the embedding dimensions and lag times were selected in the manner described above. The causal relationships between the three state parameters of each BF and other parameters are analyzed to determine the final causality results for each state parameter.
Assuming that the time series itself has a clear unidirectional causality, with the continuous increase of the length L of the time series, the causality strength of y to x will continue to increase, and finally, converge to a stable value. The larger L indicates that the more information the time series contains, the more obvious the causality obtained by the convergent crossover method. This property is also a good indicator that the concurrent cross-mapping way can reveal the causal relationship directly. As the data length L increases, the curve of the change in the CCM values between some of the state parameters is shown in Figure 6. As can be seen from the graphs, 1. there are differences in the final test lengths due to differences in the length of data collected. 2. there is an apparent causal influence of the three state parameters and the selected parameters. The different parametric quantities during BF production all have different effects on the state parametric quantities of their respective systems, and the causal results of these three pairs of parametric amounts are in line with the actual situation of the BF system. Results of causality detection with different parameters in different companies. (Online version in colour.)
Results of the causal influence of state parameters for A#BF.
Results of the causal influence of state parameters for B#BF.
Results of the causal influence of state parameters for C#BF.
Based on the process mechanism of the BF system itself, there are differences in the influence parameters obtained for different state parameters. The gas utilization rate is used as an example for the process mechanism analysis. In the operation of the BF production process, if the site operators need to adjust the gas utilization rate of the BF, they need to start from two directions: upper adjustment and lower adjustment. The upper adjustment mainly refers to the burden distribution system adjustment. Within a certain range, the hourly burden batch reflects changes in the burden distribution system, with adjustments to the batch changing the degree of focus on the edges and centre of the material surface. As the ore batch weight increases, gas utilization can be improved. By changing the proportion of furnace burden, the permeability of the upper charge of the BF can be changed and thus affect the gas utilization. The adjustment of ring pattern makes the charge distribution reasonable, which is the primary means to control the gas utilization. Owing to the complexity of the ring pattern, this paper uses the ore to coke ratio at different angles to reflect the change in the ring pattern, and it is worth noting that the C# BF uses a bell-type fabricator which does not involve the ring pattern. Changes in top gas pressure can alter the residence time of the gas in the BF and affect gas utilization. Lower adjustment includes blast velocity, blast kinetic energy, and tuyere, which affect gas utilization by changing the initial gas flow distribution. In general, the blast kinetic energy reflects the state of the central airflow and affects the gas utilization rate. The daily adjustment parameters of the BF are mainly the blast system, including blast volume, blast temperature, blast pressure, humidity, oxygen enrichment rate and coal injection. Any change in the blast system will affect the change in the tuyere and the gas utilization rate. Based on the different databases of each BF, there are differences in the influence parameters of the permeability index obtained in this paper. But by analyzing the same influence parameters of the three BFs, it is concluded that the upper conditioning means (Burden batch, Raw material ratios, ore coke ratio, top gas pressure) and the lower conditioning means (blast volume, oxygen enrichment rate, pressure difference, blast velocity, blast kinetic energy, coal injection) are all validated in the causal analysis of the gas utilization rate of each BF. This shows that the causal analysis method is suitable for the actual situation of the BF system. The influence parameters of [Si] contain incoming furnace data, operating data, outgoing iron data, BF monitoring data, and molten iron quality data. The influence parameters obtained from the permeability index are also validated by the BF operating experience.
For the single state parameters, the BFs of the three companies differed in terms of raw data collection though. However, it can be seen from the results that for the gas utilization rate, for example, 11 of the same variables appear in two or all of the BFs for the listed influencing parameters. For other parameters obtained separately, the calculation and analysis of data cannot be realized for other BF due to the data collection, but it does not mean that these parameters will not have an impact on the unrepresented BFs. For example, the coal injection rate of A#BF and the belly gas volume index of B#BF have a strong correlation in the production process, but the fact that the Bosh gas index data are not recorded for A#BF does not mean that the real Bosh gas index of A#BF does not have a strong causal relationship with the gas utilization rate.
For the three different state parameters, due to the differences in databases and production processes, it is obvious that the CCM results for [Si] are overall smaller than the other two values, indicating that the causal relationship between [Si] and these influencing factors are not very strong. By analysing the number of parameters with CCM values less than 0.5, [Si] has a high percentage of CCM results of 51.9%, while the other two parameters are 0. On the one hand, this is due to the fact that [Si] belongs to the inspection data with more human interference factors; on the other hand, the relationship between [Si] itself and other factors is relatively complex, and the value of the CCM decreases as the propagation path becomes longer.
Prediction of BF state parameters
The application of the causality analysis method in BF production not only yields relatively accurate causal influence correspondence, but also uses the lag time selection of its establishment process to determine the duration of the influence parameters on the state parameters. Using the lag time to determine the prediction duration, this paper defines a threshold value of 0.4 for the CCM, identifies the influence variables with the CCM values above 0.4 as input parameters to the XGboost prediction model, and makes predictions for each of the three state parameters.
Comparison between causal analysis and correlation analysis
Results of Pearson correlation analysis of the state parameters of B#BF.
Prediction of state parameters
In order to verify the impact of causal analysis as a parameter selection method on the model predictions, three state covariates of the B# BF were used as target values. The first 18 causal influence results and the first 18 Pearson correlation analysis results of each state covariate were used as input variables for the model to build a dataset for each state covariate prediction. This study was selected in July 2019 and in August 2020 to BF in the actual production data as the data set. The first 95% of the time range of data set was intercepted in a 9:1 ratio to divide the training and validation sets, and the last 5% of the data was used as the test set to validate and analyze the prediction results. Figure 7 shows the prediction results for the three-state parameters of the B# BF. As can be seen from Figure 7, the causal analysis results have a clear advantage over the Pearson analysis method as a feature selection method for the XGboost prediction model, being able to predict the trend of each state variable accurately. Therefore, the causal analysis method is more promising as a feature selection tool while being better adapted to the actual production situation of BF production. . Prediction results for different state parameters of B# BF. (Online version in colour.)
Evaluation of prediction results for all state parameters.
Note: PI = Permeability Index; GUR = Gas utilization rate.
By comparing the prediction results of state parameters of the three BFs, the XGboost model can accurately predict the changing trend of the state variables. The goodness of fit R2 is more significant than 0.7, and the MAE, MSE, and RMSE are all maintained at a high level. For the accuracy analysis of the three state covariates, the range of variation of [Si] is 5–15% due to the quality of the [Si] data itself and the large interpretation of [Si] in actual production. Ensuring that the variability of the covariates is within 2% (15% for [Si]), the accuracy of the model can all be achieved at over 85%. By using the causal analysis method to obtain the input parameters and the XGboost model, it is possible to initially predict the changing trend of the production status of different parts of the BF, which provides a basis for further research on the BF production status adjustment strategy and realizes the guidance for the operators on site.
Conclusions
In this paper, a new causal analysis model is proposed to determine the causal relationship between different production state parameters of a blast furnace. The CCM method and the XGboost model are combined to establish a causal analysis and prediction model for the BF production state. After industrial data testing, the model is able to achieve a more accurate prediction of the blast furnace production state within an absolute error range of 2%, thus enabling a preliminary prediction of the changing trend of the BF production state based on the process data. The CCM method is used to carry out causal analysis of the three state parameters of BF production, and the causal relationship between the state parameters and the BF operating parameters is obtained. The verification of the production process experience is satisfactory. It shows that the causal analysis method applies to the data analysis of BF production states. In the process of parameter determination for the CCM, the best embedding dimension is judged using Bayesian criteria, and the results meet the expected requirements. In the process of judging the lags of different covariates, the autocorrelation and mutual information methods are used to analyze the correlation between the state parameters themselves and the two variables respectively, and the causal influence relationship using the lag interval is more obvious. Based on the XGboost model, the ideal prediction results can be accurately obtained by using the causal analysis results as input parameters, and the prediction accuracy and precision are significantly improved compared to the traditional Pearson correlation coefficient selection method.
Footnotes
Acknowledgements
Thanks are given to the financial support from the Basic Research Program of the National Nature Science Foundation of China (52004096), the China NSF project (E2019209314), and Hebei Provincial Higher Education Fundamental Research Projects (JQN2020032).
Disclosure statement
No potential conflict of interest was reported by the author(s).
