Abstract
In the spectral quantitative analysis of scattering solution, the improvement of accuracy is seriously restricted by the nonlinearity caused by scattering, and even the measurement will fail due to the influence of scattering. The important reasons are that the modeling variables are greatly affected by nonlinearity, and the information contained in the modeling data cannot represent the scattering characteristics. In this paper, a method is proposed, in which the spectral data of several optical pathlengths with equal space are combined as the modeling data set of a sample. These highly correlated spectral data contain relatively nonlinear information. The addition of the spectral data provides more options for the selection of principal components in modeling with PLS method. By giving lower weight to the corresponding wavelength which is greatly affected by scattering, the model is insensitive to scattering and the prediction accuracy is improved. Through the spectral quantitative analysis experiment on strong scattering material, the prediction accuracy of the model was 61.7% higher than that of the traditional method and was 58.5% higher than that of the variable sorting for normalization method. The feasibility of the method is verified.
Introduction
Spectral analysis technology has the advantages of being non-polluting, nondestructive, and rapid, and it is widely used in the quantitative analysis of multicomponent solutions in the fields of medicine,1,2 agriculture,3,4 and petrochemical industry.5,6 These solutions contain more or less scattered substances. Affected by scattering, the relationship between composition content and spectrum is nonlinear.7,8 The prediction ability of the model is seriously interfered by the effect of nonlinearity. At present, the research on suppressing the effect of scattering is mainly focused on two aspects: based on the statistical law, the relationship between scattering coefficient and absorption coefficient is estimated and analyzed, and the mathematical algorithm is used to reduce the nonlinear effect.
The absorption coefficient and scattering coefficient of standard particles are usually used to estimate the corresponding particles in solution.9,10 The measurement result of the sensor can also be reduced to the scattering coefficient. 11 The additional equipment was required in these methods, and the errors of these equipment will be introduced. In addition, when the absorption coefficients and scattering coefficients of the components are known, the Monte Carlo algorithm is used to simulate the nonlinear case caused by scattering in the solution, which is used as an approximation of the actual nonlinear case. 12 At present, the algorithm has achieved very good simulation results. However, in practical application, the result is not ideal in the case of without knowing all the components. It is more convenient to use mathematical algorithm to reduce the influence of scattering.
Preprocessing methods are commonly used to smooth the nonlinearity of spectral lines, or to linearize the nonlinearity. Savitzky–Golay convolution smoothing method is used to eliminate spectral fluctuations caused by nonlinearity.13,14 The second derivative removes both baseline and linear trend, 15 but it may cause noise. 16 Multivariate scattering correction (MSC) can be used to eliminate the problems caused by solid particle size difference, surface scattering, and spectral peak overlap. 17 This method attempts to establish a linear relationship between each spectral line and the “ideal” spectrum, and the ideal spectrum is approximately represented by the average spectrum of the verification set. Extended MSC (EMSC) is more detailed than MSC. 18 EMSC can effectively correct the nonlinearity caused by the difference of sample thickness and can separate and quantify different types of chemical and physical sources that affect spectral changes. 19 Standard normal variate (SNV) is similar to the MSC method in that it is similar to simple rotation and offset correction. 20 SNV corrects each row of a spectrum separately by subtracting its row mean from all elements of the spectrum, and by normalization in the row direction. 21 In order to integrate the advantages of each preprocessing method, many approaches of integrating multiple preprocessing methods are proposed.22,23 These preprocessing methods are usually integrated according to the two strategies of combination and fusion.
The combination strategy means that the preprocessing method is run in turn, and the methods are combined based on their result, such as trial and error22–25 or full factorial design. 26 The fusion strategy is to directly integrate the preprocessing methods, rather than optimizing each preprocessing result.27,28 Lemos and Kalivas 28 proposed a flexible fusion method to solve the selection problem of the processing methods and spectral regions, and to combine all assessment values with the sum of ranking difference (SRD). In order to suppress the effect of scattering, the combination of MSC and other pre-processing methods.29,30
Variable sorting for normalization (VSN) is a signal normalization method, which is used before other standard normalization methods such as SNV and MSC. 31 This algorithm automatically produces a weighting function favoring signal variables, and this weighting function significantly improves signal shape and model interpretation. On the basis of removing the “abnormal” information by using the pre-processing method, the VSN method enhances the shape effect. It is generally believed that this shape effect can partially represent the scattering characteristics. 23 The combination of many preprocessing methods has a stronger ability to remove “useless information” than a single method. However, there is no clear definition of “useless” information. The information with characteristics will be greatly reduced after many times of preprocessing. Rich information is the premise to improve the prediction ability of the model. In the case of large samples, the amount of information is large, and the information can be used as much as possible to further reduce the influence of nonlinearity by using the nonlinear modeling method.
The common nonlinear modeling methods include artificial neural network (ANN), and convolution neural network (CNN). The ANN is used to calibrate the nonlinear spectrum. 32 Based on the CNN,33,34 all the original spectral information collected can be directly used as input data. Through sparse local connection and weight sharing, local and abstract features are learned from the original spectral data, and a high-quality nonlinear relation model is established. These nonlinear modeling methods are highly adaptive and can quickly complete the operation, analysis, and comparison of large data sets. However, when the sample size is small, the relevant information between the hidden components and the spectrum cannot accurately be extracted in the spectral data by the operation mechanism of the algorithm, which will lead to the poor stability of the model. The stability of the model can be improved by increasing the effective information of the model. The partial least squares (PLS) algorithm is a common modeling method employed in NIR spectral analysis, and is typically used for small-sample regression modeling. 35
The PLS method is further developed on the basis of principal component regression analysis method, combined with the principal component analysis method and cross-validation method. The prediction accuracy of PLS has been found to be high. 20 As a result, not only the multiple collinearity of independent variables and dependent variables are solved, but also the maximum correlation between the two variables is guaranteed. 36 From the point of view of prediction, the weak correlation between the variables of the prediction set and the variables of the calibration set may change the prediction results greatly. 37 The situation of high modeling accuracy and low prediction accuracy will occur. The prediction ability of the model can be improved by increasing the effective information.
Abundant nonlinear spectral information can be obtained by measuring the transmitted light on several equally spaced pathlengths. The smaller the distance between the two adjacent pathlengths, the higher the accuracy of the nonlinearity depicted by the information is. The larger the range of the pathlength is, the higher the approximation to the actual nonlinear curve is. In this paper, a method is proposed to describe the nonlinear spectra of scattered solution by using the spectral data of multiple pathlengths, and the joint spectral data is used as the modeling data set. The effectiveness of the method was verified by quantitative analysis of intralipid solution.
The Basis of the Method
The relationship between pathlength and absorbance is nonlinear due to the existence of scattered substances, so there is a first derivative of absorbance for the pathlength lk
The distance between the two adjacent pathlengths is the same (
According to Eq. 2, taking the pathlength as a variable, the third-order Taylor polynomials are made at the absorbance for l1, l2, and l3, respectively, and are as follows
According to Eq. 3, it can be inferred that with each increase of distance g, the absorbance increases by two terms of higher order, and their coefficient increases exponentially. These show that the nonlinear absorption spectrum of a sample can be represented with the absorbance on the multiple pathlengths precisely. The condition for taking the pathlength as a variable is that the distance between the two adjacent pathlengths is equal. The significance and function of spectral data on multiple pathlengths are explained by the experiments.
Materials
The pure intralipid samples were made by adding distilled water to 20% intralipid (Huari Pharmaceutical Co., China). The concentrations of intralipid were 18–8.2% and the interval is 0.2%. The ink (Beijing Solarbio Science and Technology Co., Ltd., China) was added to all pure intralipid solutions. After mixing, the ink concentration is 10%. In total, there were 100 samples: 50 pure intralipid solutions (samples with strong scattering), and 50 mixed solutions of intralipid and ink (samples with weak scattering).
NIR Spectroscopy Detection System
The experimental setup is shown in Fig. 1. It mainly consisted of a direct current (DC) power supply, a light chopper that was driven by a stepper motor with six blades, a light source (tungsten bromide lamp, 12 V, 20 W), a numerical control slide table (position accuracy is 0.01 mm), a quartz cuvette (15 × 15 × 17 mm), a spectrometer (AvaSpec-NIR256-2.5(TEC), 1041–1770 nm, 256 infrared wavelengths), and a computer. The measurement system.
Spectral Data Acquisition, Preprocessing, and Modeling
Collection of Transmitted Light
First of all, in the case of turning off the chopper, the transmitted light (
Modulation of Incident Light
The optical chopper is the device that can convert the continuous light beam in time into the optical signal in the form of square wave. It is a common method to improve the signal-to-noise ratio (S/N) of the spectral signal. In the experiment, an optical chopper with six vanes was used to modulate the incident light I0 into a pulsed light signal The multiple pathlength measurement process of spectra with chopping light modulation.
Collection of Transmitted Light on Multiple Pathlength s
The transmitted light of each sample was collected at 10 positions (as shown in Fig. 3). The initial measuring position was 7.3 mm from the bottom of the cuvette. The distance between two adjacent positions was 0.5 mm. The transmitted light pulse sequence was The measurement positions.
The output spectrum (IS) includes the light (
The preprocessing steps of spectral data are as follows: According to Eq. 6, the logarithmic transformation of the measured transmission spectrum Ii is made, and (ii) According to Eq. 7, the absorption spectrum (iii) According to Eq. 8, (iv) Calculation of multiple pathlength spectral data: Because the pulsed light signal is obtained by chopping, the transmitted light intensity after logarithmic transformation (i.e.,
where Ii is the transmission intensity, I0 is the incident intensity.
where
where e(ci, l1) is the first derivative of E(ci, l1), e(ci, l1) = E′(ci, l1) = ɛci, ɛ is the absorbance coefficient.
Then, after Fourier transform, Is, is decomposed into 30 harmonics. Because In corresponds to the harmonic with fn = 0, it can be obviously distinguished. In the amplitude-frequency distribution of the spectrum, the largest amplitude of the harmonic frequency is selected as the spectral parameter
Taking the sample of 18% pure intralipid solution as an example, transmission spectrum at 1073.7 nm wavelength after logarithmic transformation and its amplitude-frequency distribution at the tenth measurement position are illustrated in Fig. 4. The duty cycle of the pulsed light after chopping was 0.5. The signal collected by the spectrometer contained transmitted light that is approximately a square wave and the stray light that is approximately constant. After the Fourier transform, the stray light is recorded as the zeroth harmonic and can be clearly identified. The amplitude of the third harmonic frequency (f = (6.25/30 × 3 = 0.625 Hz) is the maximum amplitude, and the harmonic amplitude is selected as the modeling data (v) According to Eq. 11, the spectral data of the same sample on P pathlengths are spliced into the same data set Logarithm of transmission spectrum and its amplitude-frequency distribution.

Song et al. 38 improved the S/N by reconstructing the signal with the fundamental frequency coefficient, while eliminating the background light interference. The spectrum was collected by short-time multiple sampling and obtained multiple frequency information. The choice of modeling frequency becomes very difficult in the method. In addition, the too short integration time is not suitable for the measurement of high concentration scattering samples. Because the integration time is longer and the number of integrations is less in our experiment, only the highest amplitude was selected as the modeling data.
In order to observe the combination effect of the novel method and other preprocessing methods, spectral data of multiple pathlength were used as raw data, and the pre-procession methods of the second derivative, MSC, SNV, and VSN were used, respectively. The schemes are shown in Table II.
The PLS method was chosen as the modeling method in this paper. The indicators used to analyze the prediction results were the correlation coefficient of the calibration set (
Spectral data analysis of eight schemes.
Note: The prediction accuracy of the three sets was compared with an observed third modeling method.
Results and Discussion
The transmission spectra after logarithmic transformation on 3 mm pathlength are shown in Fig. 5. It is obvious that the spectral line changes nonlinearly with the uniform increase of intralipid concentration. The nonlinear situation is complex. The logarithmic transmission spectrum at the pathlength of 3 mm.
The absorption spectra of 16% intralipid solution on 10 pathlengths are shown in Fig. 6. In the range of small pathlength (0.5–3 mm), the spectral lines decrease uniformly with the pathlength increasing. In the range of large pathlength (3.5–5 mm), the nonlinear change of spectral lines affected by scattering is serious in some bands, while it still decreases uniformly in some other bands. The study of the spectrum in the pathlength dimension is equivalent to a depth analysis that focused on a point on the spectral line in Fig. 5, which can “describe” the absorption characteristics of this sample at a wavelength in more detail. Absorption spectrum of 16% intralipid solution at 10 pathlengths.
It can be clearly seen from Fig. 7 that the transmitted light is a pulsed light signal. The lowest amplitudes of all spectral lines are in the range of 750–900. This part of the light mainly comes from the stray light of the spectrometer and constant light caused by dark current. The method of mean centering transformation is usually used to reduce the influence. The processing mechanism of the method is to correlate the change of the property or composition of component with the change of the spectrum, rather than with the absolute value of the spectrum.
26
The premise of using this method is to meet the conditions that the relationship between composition concentration and absorbance is linear. More errors will be caused by this method in the spectral analysis of scattering solution. Because the stray light and constant light caused by dark current are DC noise, the method of using Fourier transform on the pulse light to eliminate its effect is suitable for the spectral analysis of scattered solution. Transmission spectrum of 20% intralipid solution at 10 pathlengths.
The data in Scheme 3 were modulated, while those in Scheme 1 were not. According to the results (as shown in Table II), the prediction accuracy of Scheme 3 is 33.3% higher than that of Scheme 1, which shows that the S/N of the signal is improved by light chopping modulation.
At the 1073.7 nm wavelength, the third harmonic amplitude of 18% pure intralipid solution is shown in Fig. 8. It is obvious that with the increase of the pathlength value, the amplitude decreases nonlinearly. There are four schemes to approximate spectral lines: two-point line approximation (at 0.5 mm and 4.5 mm), three-point line approximation (at 0.5 mm, 2.5 mm, and 4.5 mm), four-point line approximation (at 0.5 mm, 1.5 mm, 2.5 mm, and 4.5 mm), and 10-point line approximation. The higher the pathlength number and the smaller the spacing is, the more accurate the description of the spectral nonlinearity of the components is. The approximation of nonlinear spectral lines with the third harmonic frequency amplitude.
Spectral data analysis schemes in which the spectral data of multiple pathlengths were used as raw data, and the pre-procession methods of the second derivative, MSC, SNV, and VSN were used.
Results of the experimental data.
Note: The high prediction accuracy was related to the use of the PLS modeling method.
Modeling with the PLS method is equivalent to additive modeling, and the potential unknown relations in the data can be explicitly expressed by Taylor polynomials. Therefore, the participation of multispectral data in modeling is suitable for the modeling mechanism of PLS method. In the modeling, the spectrum and concentration are decomposed at the same time, and the concentration information is introduced into the decomposition process of spectral data. For the wavelengths with the greatest correlation with the concentration of the components being analyzed, the corresponding terms in the regression vector will be given higher weight. If the calibration set contains different degrees of information of external interference factors, the regression vector will give lower weight to the corresponding wavelength with large external interference, and the model will be insensitive to this kind of interference. The spectral data of multiple pathlengths at each wavelength represent a variety of the nonlinear effects of different degrees of scattering, and the stronger one (large interference) will be given lower weight by the regression vector. The sensitivity of the model to this interference is reduced.
In Schemes 3–8, the prediction accuracy of Scheme 5 is the highest. The main reason is that the modeling data set includes the frequency amplitudes of three pathlengths. Compared with one amplitude and two amplitudes, the three amplitudes have a higher degree of approximation to the nonlinearity. In addition, the prediction accuracy of Schemes 5–8 decreased gradually, while the amount of modeling data increases gradually. The result contradicts the inference of Eq. 3. One of the reasons is that the number of variables in the calibration set is proportional to the number of pathlengths. When the number of variables is much larger than the number of samples, the over-fitting is more serious and the prediction accuracy decreases. In these schemes, the value of RMSEC is lower than others, while the value of RMSEP is higher than others, indicating that there is a serious over-fitting phenomenon. The second reason is that a large amount of nonlinear information is contained in the modeling data, and the establishment of the model relationship was interfered by the uncertainty of nonlinear information seriously. Even if the number of principal components is increased, it is difficult to improve the prediction accuracy.
Results of the processing of four types of raw data using four different preprocessing methods.
Conclusion
The nonlinearity caused by scattering is a common influencing factor in the spectral quantitative analysis of complex solutions. The complexity of the action mechanism makes it more difficult to reduce the nonlinear influence. The modeling method with multiple pathlength spectral data is proposed. This method can effectively improve the prediction ability of the model by adding nonlinear information without increasing the number of samples. The experimental results show that the prediction ability of the model can be effectively improved by the participation of multiple pathlength spectral data in modeling. The method provides a new perspective for studying the nonlinear phenomenon of scattering in solution and restraining its effect. The results also showed that the prediction accuracy cannot be improved by the integration with other preprocessing methods. As an effective processing method, the research on the fusion method can be done and the fusion strategy is the key in the research work.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
