Modeling with Multiple Correlated Spectral Data Based on Approximating the Nonlinear Spectrum Induced by Scattering

Abstract

In the spectral quantitative analysis of scattering solution, the improvement of accuracy is seriously restricted by the nonlinearity caused by scattering, and even the measurement will fail due to the influence of scattering. The important reasons are that the modeling variables are greatly affected by nonlinearity, and the information contained in the modeling data cannot represent the scattering characteristics. In this paper, a method is proposed, in which the spectral data of several optical pathlengths with equal space are combined as the modeling data set of a sample. These highly correlated spectral data contain relatively nonlinear information. The addition of the spectral data provides more options for the selection of principal components in modeling with PLS method. By giving lower weight to the corresponding wavelength which is greatly affected by scattering, the model is insensitive to scattering and the prediction accuracy is improved. Through the spectral quantitative analysis experiment on strong scattering material, the prediction accuracy of the model was 61.7% higher than that of the traditional method and was 58.5% higher than that of the variable sorting for normalization method. The feasibility of the method is verified.

Graphical Abstract

Keywords

Scattering nonlinearity pathlength modeling spectral quantitative analysis

Introduction

Spectral analysis technology has the advantages of being non-polluting, nondestructive, and rapid, and it is widely used in the quantitative analysis of multicomponent solutions in the fields of medicine,^1,2 agriculture,^3,4 and petrochemical industry.^5,6 These solutions contain more or less scattered substances. Affected by scattering, the relationship between composition content and spectrum is nonlinear.^7,8 The prediction ability of the model is seriously interfered by the effect of nonlinearity. At present, the research on suppressing the effect of scattering is mainly focused on two aspects: based on the statistical law, the relationship between scattering coefficient and absorption coefficient is estimated and analyzed, and the mathematical algorithm is used to reduce the nonlinear effect.

The absorption coefficient and scattering coefficient of standard particles are usually used to estimate the corresponding particles in solution.^9,10 The measurement result of the sensor can also be reduced to the scattering coefficient.¹¹ The additional equipment was required in these methods, and the errors of these equipment will be introduced. In addition, when the absorption coefficients and scattering coefficients of the components are known, the Monte Carlo algorithm is used to simulate the nonlinear case caused by scattering in the solution, which is used as an approximation of the actual nonlinear case.¹² At present, the algorithm has achieved very good simulation results. However, in practical application, the result is not ideal in the case of without knowing all the components. It is more convenient to use mathematical algorithm to reduce the influence of scattering.

Preprocessing methods are commonly used to smooth the nonlinearity of spectral lines, or to linearize the nonlinearity. Savitzky–Golay convolution smoothing method is used to eliminate spectral fluctuations caused by nonlinearity.^13,14 The second derivative removes both baseline and linear trend,¹⁵ but it may cause noise.¹⁶ Multivariate scattering correction (MSC) can be used to eliminate the problems caused by solid particle size difference, surface scattering, and spectral peak overlap.¹⁷ This method attempts to establish a linear relationship between each spectral line and the “ideal” spectrum, and the ideal spectrum is approximately represented by the average spectrum of the verification set. Extended MSC (EMSC) is more detailed than MSC.¹⁸ EMSC can effectively correct the nonlinearity caused by the difference of sample thickness and can separate and quantify different types of chemical and physical sources that affect spectral changes.¹⁹ Standard normal variate (SNV) is similar to the MSC method in that it is similar to simple rotation and offset correction.²⁰ SNV corrects each row of a spectrum separately by subtracting its row mean from all elements of the spectrum, and by normalization in the row direction.²¹ In order to integrate the advantages of each preprocessing method, many approaches of integrating multiple preprocessing methods are proposed.^22,23 These preprocessing methods are usually integrated according to the two strategies of combination and fusion.

The combination strategy means that the preprocessing method is run in turn, and the methods are combined based on their result, such as trial and error^22–25 or full factorial design.²⁶ The fusion strategy is to directly integrate the preprocessing methods, rather than optimizing each preprocessing result.^27,28 Lemos and Kalivas²⁸ proposed a flexible fusion method to solve the selection problem of the processing methods and spectral regions, and to combine all assessment values with the sum of ranking difference (SRD). In order to suppress the effect of scattering, the combination of MSC and other pre-processing methods.^29,30

Variable sorting for normalization (VSN) is a signal normalization method, which is used before other standard normalization methods such as SNV and MSC.³¹ This algorithm automatically produces a weighting function favoring signal variables, and this weighting function significantly improves signal shape and model interpretation. On the basis of removing the “abnormal” information by using the pre-processing method, the VSN method enhances the shape effect. It is generally believed that this shape effect can partially represent the scattering characteristics.²³ The combination of many preprocessing methods has a stronger ability to remove “useless information” than a single method. However, there is no clear definition of “useless” information. The information with characteristics will be greatly reduced after many times of preprocessing. Rich information is the premise to improve the prediction ability of the model. In the case of large samples, the amount of information is large, and the information can be used as much as possible to further reduce the influence of nonlinearity by using the nonlinear modeling method.

The common nonlinear modeling methods include artificial neural network (ANN), and convolution neural network (CNN). The ANN is used to calibrate the nonlinear spectrum.³² Based on the CNN,^33,34 all the original spectral information collected can be directly used as input data. Through sparse local connection and weight sharing, local and abstract features are learned from the original spectral data, and a high-quality nonlinear relation model is established. These nonlinear modeling methods are highly adaptive and can quickly complete the operation, analysis, and comparison of large data sets. However, when the sample size is small, the relevant information between the hidden components and the spectrum cannot accurately be extracted in the spectral data by the operation mechanism of the algorithm, which will lead to the poor stability of the model. The stability of the model can be improved by increasing the effective information of the model. The partial least squares (PLS) algorithm is a common modeling method employed in NIR spectral analysis, and is typically used for small-sample regression modeling.³⁵

The PLS method is further developed on the basis of principal component regression analysis method, combined with the principal component analysis method and cross-validation method. The prediction accuracy of PLS has been found to be high.²⁰ As a result, not only the multiple collinearity of independent variables and dependent variables are solved, but also the maximum correlation between the two variables is guaranteed.³⁶ From the point of view of prediction, the weak correlation between the variables of the prediction set and the variables of the calibration set may change the prediction results greatly.³⁷ The situation of high modeling accuracy and low prediction accuracy will occur. The prediction ability of the model can be improved by increasing the effective information.

Abundant nonlinear spectral information can be obtained by measuring the transmitted light on several equally spaced pathlengths. The smaller the distance between the two adjacent pathlengths, the higher the accuracy of the nonlinearity depicted by the information is. The larger the range of the pathlength is, the higher the approximation to the actual nonlinear curve is. In this paper, a method is proposed to describe the nonlinear spectra of scattered solution by using the spectral data of multiple pathlengths, and the joint spectral data is used as the modeling data set. The effectiveness of the method was verified by quantitative analysis of intralipid solution.

The Basis of the Method

The relationship between pathlength and absorbance is nonlinear due to the existence of scattered substances, so there is a first derivative of absorbance for the pathlength l_k

A' (c_{i}, l_{k}) = f (c_{i}, l_{k}) \neq 0

(1)

where c_i is the concentration of the ith sample;

A (c_{i}, l_{k})

is the absorbance of the ith sample on the pathlength l_k.

The distance between the two adjacent pathlengths is the same ( $g = l_{k + 1} - l_{k}$ ). According to Taylor's mean value theorem, the polynomial of absorbance for the pathlength l_k is

(2)

According to Eq. 2, taking the pathlength as a variable, the third-order Taylor polynomials are made at the absorbance for l₁, l₂, and l₃, respectively, and are as follows

\begin{matrix} A' (c_{i}, l_{1}) = f (c_{i}, l_{1}) \\ A (c_{i}, l_{2}) = A (c_{i}, l_{1}) + g f (l_{1}) + \frac{(g)^{2}}{2!} f' (l_{1}) + \frac{(g)^{3}}{3!} f ″ (l_{1}) \\ A (c_{i}, l_{3}) = A (c_{i}, l_{2}) + 2 g f (l_{2}) + \frac{(2 g)^{2}}{2!} f' (l_{2}) + \frac{(2 g)^{3}}{3!} f ″ (l_{2}) \end{matrix}

(3)

According to Eq. 3, it can be inferred that with each increase of distance g, the absorbance increases by two terms of higher order, and their coefficient increases exponentially. These show that the nonlinear absorption spectrum of a sample can be represented with the absorbance on the multiple pathlengths precisely. The condition for taking the pathlength as a variable is that the distance between the two adjacent pathlengths is equal. The significance and function of spectral data on multiple pathlengths are explained by the experiments.

Materials

The pure intralipid samples were made by adding distilled water to 20% intralipid (Huari Pharmaceutical Co., China). The concentrations of intralipid were 18–8.2% and the interval is 0.2%. The ink (Beijing Solarbio Science and Technology Co., Ltd., China) was added to all pure intralipid solutions. After mixing, the ink concentration is 10%. In total, there were 100 samples: 50 pure intralipid solutions (samples with strong scattering), and 50 mixed solutions of intralipid and ink (samples with weak scattering).

NIR Spectroscopy Detection System

The experimental setup is shown in Fig. 1. It mainly consisted of a direct current (DC) power supply, a light chopper that was driven by a stepper motor with six blades, a light source (tungsten bromide lamp, 12 V, 20 W), a numerical control slide table (position accuracy is 0.01 mm), a quartz cuvette (15 × 15 × 17 mm), a spectrometer (AvaSpec-NIR256-2.5(TEC), 1041–1770 nm, 256 infrared wavelengths), and a computer.

Figure 1.

The measurement system.

Spectral Data Acquisition, Preprocessing, and Modeling

Collection of Transmitted Light

First of all, in the case of turning off the chopper, the transmitted light ( $I_{i}^{P}$ ) of each sample on each pathlength was measured.

Modulation of Incident Light

The optical chopper is the device that can convert the continuous light beam in time into the optical signal in the form of square wave. It is a common method to improve the signal-to-noise ratio (S/N) of the spectral signal. In the experiment, an optical chopper with six vanes was used to modulate the incident light I₀ into a pulsed light signal $I'_{0}$ (as shown in Fig. 2). The period of pulsed light is T_c = 200 ms and the frequency of pulsed light is f_c = 1/T_c = 5 Hz.

Figure 2.

The multiple pathlength measurement process of spectra with chopping light modulation.

Collection of Transmitted Light on Multiple Pathlength s

The transmitted light of each sample was collected at 10 positions (as shown in Fig. 3). The initial measuring position was 7.3 mm from the bottom of the cuvette. The distance between two adjacent positions was 0.5 mm. The transmitted light pulse sequence was ${I_{t}^{0}, I_{t}^{1}, \dots, I_{t}^{P}, \dots, I_{t}^{10}}$ . The integration time of the spectrometer was set to τ = 160 ms, the sampling frequency was f_s = 1/τ = 6.25 Hz, and the integration number (N) was 30. The near infrared transmission spectra were collected by Avasoft 7.6 software. In order to avoid the interference of ambient light, shading measures were taken in the experimental process.

Figure 3.

The measurement positions.

The output spectrum (I_S) includes the light ( $I_{t}^{P}$ ) after absorbed and scattered by the solution and the background light (I_n). The background light mainly includes stray light of the spectrometer

I_{s} = \int_{0}^{τ} I_{t}^{P} dt + I_{n}

(4)

I_{t}^{P} = I_{t} - (I_{abs} - I_{sca})

(5)

where I_t is the incident light intensity,

I_{abs}

is the absorption light intensity, and

I_{sca}

is the scattering light intensity.

The preprocessing steps of spectral data are as follows:

According to Eq. 6, the logarithmic transformation of the measured transmission spectrum I_i is made, and $E (c_{i}, l_{1})$ is obtained

E (c_{i}, l_{1}) = \ln (I_{i}) = \ln (I_{0}) - ɛ c_{i} l_{1}

(6)

where I_i is the transmission intensity, I₀ is the incident intensity.

(ii) According to Eq. 7, the absorption spectrum $A_{i}^{5 - 6}$ of the absorption layer between Positions 5 and 6 is calculated³⁴

A_{i}^{5 - 6} = \ln ({I_{i}^{5} / I}_{i}^{6})

(7)

(iii) According to Eq. 8, $a_{i, k}$ is obtained by normalizing H

a_{i, k} = \ln (H_{i, k}) / \ln (H_{i, k = 1, \dots, n})_{max}

(8)

where

H_{i, k}

refers to

E (c_{i}, l_{1})

A_{i}^{5 - 6}

and

X_{i}^{P}

. the light intensity of the

i th

sample at the

k th

wavelength,

(H_{i, k = 1, \dots, n})_{max}

is the maximum of the three data wavelengths, and

a_{i, k}

is the normalized spectral data.

(iv) Calculation of multiple pathlength spectral data: Because the pulsed light signal is obtained by chopping, the transmitted light intensity after logarithmic transformation (i.e., $E (c_{i}, l_{1})$ ) is performed using Fourier transform and the frequency domain amplitude is extracted to be used as modeling data. The effect of using multiple $E (c_{i}, l_{1})$ to represent the nonlinearity of spectral line is the same as the effect expressed by $A (c_{i}, l_{1})$ . The proof is as follows: Assuming the initial incident light I₀ is constant, the Taylor polynomial of the transmitted light intensity at the pathlength l₁ is shown in Eq. 9. Comparing the second polynomial of Eqs. 3 and 9, we can find that the two polynomials are consistent. Therefore, $E (c_{i}, l_{2})$ can be used as data set

E (c_{i}, l_{2}) = E (c_{i}, l_{1}) + g e (c_{i}, l_{1}) + \frac{(g)^{2}}{2!} e' (c_{i}, l_{1}) + \frac{(g)^{3}}{3!} e ″ (c_{i}, l_{1})

(9)

where e(c_i, l₁) is the first derivative of E(c_i, l₁), e(c_i, l₁) = E′(c_i, l₁) = ɛc_i, ɛ is the absorbance coefficient.

Then, after Fourier transform, I_s, is decomposed into 30 harmonics. Because I_n corresponds to the harmonic with f_n = 0, it can be obviously distinguished. In the amplitude-frequency distribution of the spectrum, the largest amplitude of the harmonic frequency is selected as the spectral parameter $X_{i}^{P}$ on position P

\begin{matrix} X [i] = DF [I_{S} [i]] = \sum_{ν = 0}^{N - 1} I_{S} [i] e^{- j \frac{2 π}{N} ν k} \\ S = 0, 1, 2, \dots, N - 1 \end{matrix} (10)

where ν is the harmonic number, N is the number of measurement points and is the number of integrations of the spectrometer.

Taking the sample of 18% pure intralipid solution as an example, transmission spectrum at 1073.7 nm wavelength after logarithmic transformation and its amplitude-frequency distribution at the tenth measurement position are illustrated in Fig. 4. The duty cycle of the pulsed light after chopping was 0.5. The signal collected by the spectrometer contained transmitted light that is approximately a square wave and the stray light that is approximately constant. After the Fourier transform, the stray light is recorded as the zeroth harmonic and can be clearly identified. The amplitude of the third harmonic frequency (f = (6.25/30 × 3 = 0.625 Hz) is the maximum amplitude, and the harmonic amplitude is selected as the modeling data $X_{i}^{P}$ of the sample.

(v) According to Eq. 11, the spectral data of the same sample on P pathlengths are spliced into the same data set $S_{i}^{P}$ . For example, the data set of the ith sample at 256 wavelengths is

\begin{matrix} S_{i}^{P} = [\begin{matrix} X_{i}^{1} & X_{i}^{2} & \dots X_{i}^{P} \end{matrix}] \\ = [\begin{matrix} X_{i, 1}^{1} \begin{matrix} \dots & X_{i, 256}^{1} \end{matrix} & X_{i, 1}^{2} \begin{matrix} \dots & X_{i, 256}^{2} \begin{matrix} \dots & X_{i, 1}^{P} \end{matrix} \end{matrix} & \dots X_{i, 256}^{P} \end{matrix}] \\ (i = 1, 2, \dots, m) \end{matrix}

(11)

Figure 4.

Logarithm of transmission spectrum and its amplitude-frequency distribution.

Song et al.³⁸ improved the S/N by reconstructing the signal with the fundamental frequency coefficient, while eliminating the background light interference. The spectrum was collected by short-time multiple sampling and obtained multiple frequency information. The choice of modeling frequency becomes very difficult in the method. In addition, the too short integration time is not suitable for the measurement of high concentration scattering samples. Because the integration time is longer and the number of integrations is less in our experiment, only the highest amplitude was selected as the modeling data.

In order to observe the combination effect of the novel method and other preprocessing methods, spectral data of multiple pathlength were used as raw data, and the pre-procession methods of the second derivative, MSC, SNV, and VSN were used, respectively. The schemes are shown in Table II.

The PLS method was chosen as the modeling method in this paper. The indicators used to analyze the prediction results were the correlation coefficient of the calibration set ( $r_{c}^{2}$ ), the correlation coefficients of the prediction set ( $r_{p}^{2}$ ), the root mean square error of calibration (RMSEC), and the root mean square error of prediction (RMSEP).

Three kinds of spectral data were used as the calibration set, which are transmission spectrum after logarithmic transformation (E), absorption spectrum (

A_{i}^{5 - 6}

), and amplitude spectrum (

S_{i}^{P}

). The first two data sets were extracted on single pathlength, and the last one was of multiple pathlengths. Eight modeling schemes were designed, and the prediction accuracy of the model built by the three kinds of sets was compared and the third modeling method was observed (as shown in Table I). Fifty pure intralipid samples were used as modeling samples, and 20 of the 50 mixture solutions of intralipid and ink were randomly selected as prediction samples. Through the differences in the types of components between the calibration set and the prediction set, the prediction ability of the model is observed.

Table I.

Spectral data analysis of eight schemes.

Schemes	Data sets	Number of pathlengths
1	$E (c_{i}, l_{3})$	One
2	$A_{i}^{5 - 6}$	One
3	$S_{i}^{1}$	One
4	$S_{i}^{1, 2}$	Two
5	$S_{i}^{1, 2, 3}$	Three
6	$S_{i}^{1, 2, 3, 4}$	Four
7	$S_{i}^{1, 2, 3, 4, 5}$	Five
8	$S_{i}^{1, 2, 3, 4, 5, 6}$	Six

Note: The prediction accuracy of the three sets was compared with an observed third modeling method.

Results and Discussion

The transmission spectra after logarithmic transformation on 3 mm pathlength are shown in Fig. 5. It is obvious that the spectral line changes nonlinearly with the uniform increase of intralipid concentration. The nonlinear situation is complex.

Figure 5.

The logarithmic transmission spectrum at the pathlength of 3 mm.

The absorption spectra of 16% intralipid solution on 10 pathlengths are shown in Fig. 6. In the range of small pathlength (0.5–3 mm), the spectral lines decrease uniformly with the pathlength increasing. In the range of large pathlength (3.5–5 mm), the nonlinear change of spectral lines affected by scattering is serious in some bands, while it still decreases uniformly in some other bands. The study of the spectrum in the pathlength dimension is equivalent to a depth analysis that focused on a point on the spectral line in Fig. 5, which can “describe” the absorption characteristics of this sample at a wavelength in more detail.

Figure 6.

Absorption spectrum of 16% intralipid solution at 10 pathlengths.

It can be clearly seen from Fig. 7 that the transmitted light is a pulsed light signal. The lowest amplitudes of all spectral lines are in the range of 750–900. This part of the light mainly comes from the stray light of the spectrometer and constant light caused by dark current. The method of mean centering transformation is usually used to reduce the influence. The processing mechanism of the method is to correlate the change of the property or composition of component with the change of the spectrum, rather than with the absolute value of the spectrum.²⁶ The premise of using this method is to meet the conditions that the relationship between composition concentration and absorbance is linear. More errors will be caused by this method in the spectral analysis of scattering solution. Because the stray light and constant light caused by dark current are DC noise, the method of using Fourier transform on the pulse light to eliminate its effect is suitable for the spectral analysis of scattered solution.

Figure 7.

Transmission spectrum of 20% intralipid solution at 10 pathlengths.

The data in Scheme 3 were modulated, while those in Scheme 1 were not. According to the results (as shown in Table II), the prediction accuracy of Scheme 3 is 33.3% higher than that of Scheme 1, which shows that the S/N of the signal is improved by light chopping modulation.

At the 1073.7 nm wavelength, the third harmonic amplitude of 18% pure intralipid solution is shown in Fig. 8. It is obvious that with the increase of the pathlength value, the amplitude decreases nonlinearly. There are four schemes to approximate spectral lines: two-point line approximation (at 0.5 mm and 4.5 mm), three-point line approximation (at 0.5 mm, 2.5 mm, and 4.5 mm), four-point line approximation (at 0.5 mm, 1.5 mm, 2.5 mm, and 4.5 mm), and 10-point line approximation. The higher the pathlength number and the smaller the spacing is, the more accurate the description of the spectral nonlinearity of the components is.

Figure 8.

The approximation of nonlinear spectral lines with the third harmonic frequency amplitude.

According to the experimental results (Table III), (i) the

r_{c}^{2}

of all schemes tended to be one, which indicated a good correlation of the model; (ii) the

r_{p}^{2}

of all schemes were close to or greater than 0.85, which indicated that the correlation between the predicted value and the real value of component concentration is high; (iii) the RMSEC values of all schemes were less than 0.17, and indicated that the modeling accuracy is high; (iv) Schemes 1 and 2 represent the traditional methods, while Schemes 3–7 represent the new method. The prediction accuracy of Scheme 5 is the highest, which is 61.7% higher than that of Scheme 1. The prediction accuracy of Schemes 4 and 5 is higher than that of Schemes 1 and 2. These show that the data set in Schemes 4 and 5 contain more nonlinear information, which is helpful to establish an effective model relationship. The high prediction accuracy is also related to the use of PLS modeling method.

Table II.

Spectral data analysis schemes in which the spectral data of multiple pathlengths were used as raw data, and the pre-procession methods of the second derivative, MSC, SNV, and VSN were used.

Schemes	Data sets	Pre-procession	Schemes	Data sets	Pre-procession
9	$S_{i}^{1}$	Second derivative	17	$S_{i}^{1}$	SNV
10	$S_{i}^{1, 2}$	Second derivative	18	$S_{i}^{1, 2}$	SNV
11	$S_{i}^{1, 2, 3}$	Second derivative	19	$S_{i}^{1, 2, 3}$	SNV
12	$S_{i}^{1, 2, 3, 4}$	Second derivative	20	$S_{i}^{1, 2, 3, 4}$	SNV
13	$S_{i}^{1}$	MSC	21	$S_{i}^{1}$	VSN
14	$S_{i}^{1, 2}$	MSC	22	$S_{i}^{1, 2}$	VSN
15	$S_{i}^{1, 2, 3}$	MSC	23	$S_{i}^{1, 2, 3}$	VSN
16	$S_{i}^{1, 2, 3, 4}$	MSC	24	$S_{i}^{1, 2, 3, 4}$	VSN

Table III.

Results of the experimental data.

Scheme	$r_{c}^{2}$	RMSEC	$r_{p}^{2}$	RMSEP	N
1	0.9999	0.0330	0.8533	5.6723	7
2	0.9969	0.1626	0.9139	2.6556	4
3	0.9971	0.1575	0.9155	3.7804	4
4	0.9982	0.1238	0.9140	2.4831	5
5	0.9988	0.1002	0.8709	2.1705	7
6	0.9995	0.0669	0.9229	2.4052	8
7	0.9999	0.0283	0.9268	3.4175	9
8	0.9998	0.0441	0.9286	3.9053	9

Note: The high prediction accuracy was related to the use of the PLS modeling method.

Modeling with the PLS method is equivalent to additive modeling, and the potential unknown relations in the data can be explicitly expressed by Taylor polynomials. Therefore, the participation of multispectral data in modeling is suitable for the modeling mechanism of PLS method. In the modeling, the spectrum and concentration are decomposed at the same time, and the concentration information is introduced into the decomposition process of spectral data. For the wavelengths with the greatest correlation with the concentration of the components being analyzed, the corresponding terms in the regression vector will be given higher weight. If the calibration set contains different degrees of information of external interference factors, the regression vector will give lower weight to the corresponding wavelength with large external interference, and the model will be insensitive to this kind of interference. The spectral data of multiple pathlengths at each wavelength represent a variety of the nonlinear effects of different degrees of scattering, and the stronger one (large interference) will be given lower weight by the regression vector. The sensitivity of the model to this interference is reduced.

In Schemes 3–8, the prediction accuracy of Scheme 5 is the highest. The main reason is that the modeling data set includes the frequency amplitudes of three pathlengths. Compared with one amplitude and two amplitudes, the three amplitudes have a higher degree of approximation to the nonlinearity. In addition, the prediction accuracy of Schemes 5–8 decreased gradually, while the amount of modeling data increases gradually. The result contradicts the inference of Eq. 3. One of the reasons is that the number of variables in the calibration set is proportional to the number of pathlengths. When the number of variables is much larger than the number of samples, the over-fitting is more serious and the prediction accuracy decreases. In these schemes, the value of RMSEC is lower than others, while the value of RMSEP is higher than others, indicating that there is a serious over-fitting phenomenon. The second reason is that a large amount of nonlinear information is contained in the modeling data, and the establishment of the model relationship was interfered by the uncertainty of nonlinear information seriously. Even if the number of principal components is increased, it is difficult to improve the prediction accuracy.

The results of processing four types of raw data by using four preprocessing methods are shown in Table IV. The results show that (i) in all cases, the prediction accuracy of VSN method is the highest (RMSEP = 5.2275). This shows that by using the fusion method, the influence of additive effects and multiplicative effects can be further reduced, and the effect of nonlinear spectral line on prediction can be highlighted accordingly. (ii) The prediction accuracy of MSC and SNV is similar and becomes worse with the increase of the number of pathlength (as shown in Schemes 13–16 and 17–20). The usage of these preprocessing not only eliminates the influence of the linear relationship between the spectral lines and the influence of the linear relationship between variables, but also causes the geometric distortion of the spectral profiles. Adding the spectral data of each variable is equivalent to increasing the nonlinear complexity of the spectral line, which aggravates the distortion. In the processing of MSC, the generation of ideal spectral lines is hindered. In the processing of SNV, the difference between adjacent variables becomes smaller, which reduces the difference of spectral lines. Therefore, the prediction accuracy of the four pre-processing methods is worse than that of the new method (Schemes 3–8). And with the increase of the amount of data, RMSEP increases and RMSEC decreases. This shows that the preprocessing method can eliminate too much characteristic information of the components, including both absorption information and scattering information, linear information, and nonlinear information. The preprocessing method has the advantage of removing the size effect, but it is not suitable to deal with data sets with multiple correlation data.

Table IV.

Results of the processing of four types of raw data using four different preprocessing methods.

Scheme	$r_{c}^{2}$	RMSEC	$r_{p}^{2}$	RMSEP	N
9	1	0.0199	0.8393	7.8387	9
10	1	0.0177	0.8235	7.5605	8
11	0.9972	0.1559	0.8842	7.0688	5
12	0.9970	0.1561	0.9300	6.3461	5
13	0.9989	0.0962	0.8715	5.9850	4
14	0.9987	0.1066	0.8956	6.9509	4
15	0.9995	0.0636	0.9137	7.8773	5
16	0.9998	0.0374	0.9286	8.2162	6
17	0.9989	0.0958	0.8716	5.9762	4
18	0.9987	0.1058	0.8956	6.9411	4
19	0.9995	0.0634	0.9134	7.864	5
20	0.9998	0.0373	0.9285	8.1927	6
21	0.9841	0.3709	0.7983	5.3220	3
22	0.9967	0.1683	0.7685	5.2275	5
23	0.9967	0.1685	0.8191	6.1653	5
24	0.9968	0.1644	0.8781	6.3450	5

Conclusion

The nonlinearity caused by scattering is a common influencing factor in the spectral quantitative analysis of complex solutions. The complexity of the action mechanism makes it more difficult to reduce the nonlinear influence. The modeling method with multiple pathlength spectral data is proposed. This method can effectively improve the prediction ability of the model by adding nonlinear information without increasing the number of samples. The experimental results show that the prediction ability of the model can be effectively improved by the participation of multiple pathlength spectral data in modeling. The method provides a new perspective for studying the nonlinear phenomenon of scattering in solution and restraining its effect. The results also showed that the prediction accuracy cannot be improved by the integration with other preprocessing methods. As an effective processing method, the research on the fusion method can be done and the fusion strategy is the key in the research work.

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Ling Lin

References

Sakudo

“Near-Infrared Spectroscopy for Medical Applications: Current Status and Future Perspectives”. Clin. Chim. Acta. 2016; 455(4): 181–188.

Wang

Yang

Che

, et al. “A New Calibration Model Transferring Strategy Maintaining the Predictive Abilities of NIR Multivariate Calibration Model Applied in Different Batches Process of Extraction”. Infrared Phys. Technol. 2019; 103(3): 103046.

Cai

Luo

Wang

, et al. “Influence of Tea Polyphenol and Bovine Serum Albumin on Tea Cream Formation by Multiple Spectroscopy Methods and Molecular Docking”. Food Chem. 2020; 333(7): 127432.

Véstia

Mota Barroso

Ferreira

Gaspar

, et al. “Predicting Calcium in Grape Must and Base Wine by FT-NIR Spectroscopy”. Food Chem. 2019; 276(9): 71–76.

Chen

Zhang

“Detection of Ethanol Content in Ethanol Diesel Based on PLS and Multispectral Method”. Optik. 2019; 195(10): 162861.

Zhu

S.H.

Zhang

Y.H.

Zhai

H.L.

, et al. “An Effective and Rapid Approach to Predict Molecular Composition of Naphtha Based on Raw NIR Spectra”. Vib. Spectrosc. 2020; 109(7): 103071.

Pizarro

Esteban-Díez

Nistal

A.-J.

González-Sáiz

J.-M.

“Influence of Data Pre-Processing on the Quantitative Determination of the Ash Content and Lipids in Roasted Coffee by Near Infrared Spectroscopy”. Anal. Chim. Acta. 2004; 509(2): 217–227.

Cyvin

S.J.

Rauch

J.E.

Decius

J.C.

“Theory of Hyper-Raman Effects (Nonlinear Inelastic Light Scattering): Selection Rules and Depolarization Ratios for the Second-Order Polarizability”. J. Chem. Phys. 1965; 43: 4083–4095.

V. Tuchin. “Part II. Light Scattering Methods and Instruments for Medical Diagnosis”. Tissue Optics: Light Scattering Methods and Instruments for Medical Diagnosis. Washington: SPIE, 2000. Pp. 352.

10.

Costa Lopes

Ventura Brandão

Calderón Sánchez

Franceschi

, et al. “Horseradish Peroxidase Structural Changes by Near Infrared (NIR) Spectroscopy”. Process Biochem. 2018; 71(8): 127–133.

11.

Wang

Jacques

S.L.

Zheng

“MMCM: Monte Carlo Modeling of Light Transport in Multi-Layered Tissues”. Comput. Methods Programs Biomed. 1995; 47(2): 131–146.

12.

Mengqiu

Gang

Wenjuan

Wang

S.H.

, et al. “Reducing the Spectral Nonlinearity Error Caused by Varying Integration Time”. Infrared Phys. Technol. 2018; 94(8): 48–54.

13.

Bing

Nihong

Xufeng

, et al. “A Feasibility Quantitative Analysis of NIR Spectroscopy Coupled Si-PLS to Predict Coco-Peat Available Nitrogen from Rapid Measurements”. Comput. Electron. Agric. 2020; 173(6): 105410.

14.

Jing

Pengdi

Huan

Wang

, et al. “Rapid Screening and Quantitative Analysis of Adulterant Lonicerae flos in Lonicerae japonicae flos by Fourier-Transform Near Infrared Spectroscopy”. Infrared Phys. Technol. 2020; 104(1): 103139.

15.

Luypaert

Zhang

M.H.

Massart

D.L.

“Feasibility Study for the Use of Near Infrared Spectroscopy in the Qualitative and Quantitative Analysis of Green Tea, Camellia sinensis (L.)”. Anal. Chim. Acta. 2003; 478(10): 303–312.

16.

Martens

Nielsen

J.P.

Engelsen

S.B.

“Light Scattering and Light Absorbance Separated by Extended Multiplicative Signal Correction. Application to Near-Infrared Transmission Analysis of Powder Mixtures”. Anal. Chem. 2003; 75(3): 394–404.

17.

Jiechao

Xuelei

Ruihua

Wang

“Combination of Convolutional Neural Networks and Recurrent Neural Networks for Predicting Soil Properties Using Vis-NIR Spectroscopy”. Geoderma. 2020; 380(12): 114616.

18.

H. Martens, S.A. Jensen, P. Geladi. “Multivariate Linearity Transformations for Near Infrared Reflectance Spectroscopy”. In: O.H.J. Christie, editor. Proceedings of the Nordic Symposium on Applied Statistics. Stavanger, Norway: Stokkland Forlag, 1983. Pp. 205--234.

19.

L. Spinelli, M. Botwicz, Zolek N, Kacprzak M, et al. “Determination of Reference Values for Optical Properties of Liquid Phantoms Based on Intralipid and India Ink”. Biomed. Opt. Express. 2014. 7(5): 2037–2053.

20.

Rinnan

Å.

van den Berg

Engelsen

S.B.

“Review of the Most Common Pre-Processing Techniques for Near-Infrared Spectra”. TrAC, Trends Anal. Chem. 2009; 28(10): 1201–1222.

21.

Pasquini

“Near Infrared Spectroscopy: A Mature Analytical Technique with New Perspectives—A Review”. Anal. Chim. Acta. 2018; 1026(10): 8–36.

22.

Bian

Wang

Tan

Diwu

, et al. “A Selective Ensemble Preprocessing Strategy for Near-Infrared Spectral Quantitative Analysis of Complex Samples”. Chemom. Intell. Lab. Syst. 2020; 197(2): 103916.

23.

Engel

Gerretzen

Szymańska

Jansen

J.J.

, et al. “Breaking with Trends in Pre-Processing?" TrAC, Trends Anal. Chem. 2013; 50(10): 96–106.

24.

Mishra

Roger

J.M.

Rutledge

D.N.

Biancolillo

, et al. “MBA-GUI: A Chemometric Graphical User Interface for Multi-Block Data Visualisation, Regression, Classification, Variable Selection and Automated Pre-Processing”. Chemom. Intell. Lab. Syst. 2020; 205(10): 104139.

25.

Mishra

Roger

J.M.

Marini

Biancolillo

, et al. “Parallel Pre-Processing Through Orthogonalization (PORTO) and Its Application to Near-Infrared Spectroscopy”. Chemom. Intell. Lab. Syst. 2021; 212(5): 104190.

26.

Gerretzen

Szymanska

Jansen

J.J.

Bart

, et al. “Simple and Effective Way for Data Preprocessing Selection Based on Design of Experiments”. Anal. Chem. 2015; 87(24): 12096–12103.

27.

Brownfield

Lemos

Kalivas

J.H.

“Consensus Classification Using Non-Optimized Classifiers”. Anal. Chem. 2018; 90(7): 4429–4437.

28.

Lemos

Kalivas

J.H.

“Self-Optimized One-Class Classification Using Sum of Ranking Differences Combined with a Receiver Operator Characteristic Curve”. Anal. Chem. 2020; 92(3): 5354–5361.

29.

Chen

F.-R.

“Application of Interval Selection Methods in Quantitative Analysis of Multicomponent Mixtures by Terahertz Time-Domain Spectroscopy”. Guang Pu Xue Yu Guang Pu Fen Xi [Spectrosc. Spectral Anal.]. 2014; 34(12): 413–419.

30.

Ignesti

Tommasi

Fini

Martelli

, et al. “A New Class of Optical Sensors: A Random Laser Based Device”. Sci. Rep. 2016; 6(10): 35225.

31.

Rabatel

Marini

Walczak

Roger

J.-M.

“VSN: Variable Sorting for Normalization”. J. Chemom. 2019; 34(3): E3164.

32.

Sun

Zhang

“Detection of Type, Blended Ratio, and Mixed Ratio of Pu'er Tea by Using Electronic Nose and Visible/Near Infrared Spectrometer”. Chem. Sens. 2019; 19(10): 2359.

33.

Chen

Y.-Y.

Wang

Z.-B.

“End-to-End Quantitative Analysis Modeling of Near-Infrared Spectroscopy Based on Convolutional Neural Network”. J. Chemom. 2019; 33(5): E3122.

34.

Luo

Guan

, et al. “A Method to Eliminate the Influence of Incident Light Variations in Spectral Analysis”. Rev. Sci. Instrum. 2018; 89(6): 063103.

35.

Garcia-Garcia

J.L.

Pérez-Guaita

Ventura-Gayete

Garrigues

, et al. “Determination of Biochemical Parameters in Human Serum by Near-Infrared Spectroscopy”. Anal. Methods. 2014; 6(12): 3982–3989.

36.

Geladi

Kowalski

B.R.

“Partial Least Squares Regression: A Tutorial”. Anal. Chim. Acta. 1986; 185: 1–17.

37.

Höskuldsson

“Variable and Subset Selection in PLS Regression”. Chemom. Intell. Lab. Syst. 2001; 55(1): 23–38.

38.

Song

Hou

Zhang

, et al. “Principal Frequency Component Analysis Based on Modulate Chopper Technique Used in Diffuse Reflectance Spectroscopy Measurement”. Appl. Opt. 2018; 57(5): 1043–1049.