Risk prediction for myocardial infarction via generalized functional regression models

Abstract

In this paper, we propose a generalized functional linear regression model for a binary outcome indicating the presence/absence of a cardiac disease with multivariate functional data among the relevant predictors. In particular, the motivating aim is the analysis of electrocardiographic traces of patients whose pre-hospital electrocardiogram (ECG) has been sent to 118 Dispatch Center of Milan (the Italian free-toll number for emergencies) by life support personnel of the basic rescue units. The statistical analysis starts with a preprocessing of ECGs treated as multivariate functional data. The signals are reconstructed from noisy observations. The biological variability is then removed by a nonlinear registration procedure based on landmarks. Thus, in order to perform a data-driven dimensional reduction, a multivariate functional principal component analysis is carried out on the variance-covariance matrix of the reconstructed and registered ECGs and their first derivatives. We use the scores of the Principal Components decomposition as covariates in a generalized linear model to predict the presence of the disease in a new patient. Hence, a new semi-automatic diagnostic procedure is proposed to estimate the risk of infarction (in the case of interest, the probability of being affected by Left Bundle Brunch Block). The performance of this classification method is evaluated and compared with other methods proposed in literature. Finally, the robustness of the procedure is checked via leave-j-out techniques.

Keywords

multivariate functional data ECG signals generalized linear models

1 Introduction

The use of telemedicine systems in pre-hospital emergency rescues has allowed quicker diagnoses for patients with cardiovascular ischaemic diseases. The literature has shown that pre-hospital electrocardiogram (ECG) reduces treatment times and in-hospital mortality.^1–3 It is also known that pre-hospital ECG may either be transmitted to the hospital staff or can be interpreted in real time by paramedics, who then communicate their diagnosis to the hospital.^4,5

Starting from 2006, in the Milan urban area a working group collecting 23 Cardiology Units and the 118 Dispatch Center (the Italian free-toll number for emergencies), performed monthly data collections twice a year on all patients admitted to any hospital in Milano with coronary artery disease (MOMI²: MOnth MOnitoring Myocardial Infarction in MIlan survey). The statistical analysis of the collected data^6–8 confirmed the time of first ECG teletransmission as the most important factor to guarantee a quick access to an effective treatment for patients. Since 2008, a project named PROMETEO (PROgetto sull’area Milanese Elettrocardiogrammi Teletrasferiti dall’Extra Ospedaliero) has been started with the aim of spreading the intensive use of ECG as pre-hospital diagnostic tool. The project is also a way of constructing a new database of ECGs with features never recorded before in any other data collection on heart diseases. Indeed, ECG recorders with GSM transmission have been installed on all Basic Rescue Units of Milan urban area, thanks to the partnerships of Azienda Regionale Emergenza Urgenza (AREU), Abbott Vascular and Mortara Rangoni Europe s.r.l.

The principal aim of this work is the development of a new semi-automatic diagnostic procedure for predicting the disease probability on the basis of the ECG morphology. Such a procedure can be implemented starting from the ECG signals generated by telemedicine equipment of the Basic Rescue Units.

A statistical identification of specific ECG patterns which could benefit by an early invasive approach has been performed on a sample of data arising from PROMETEO database.⁹ In fact, the presence of a cardiac disease induces morphological changes in the ECG signals. Finding statistical tools capable of classifying curves using only these changes could support an early detection of coronary disease, not based on usual clinical criteria.

In order to do this, ECG traces are considered as a noisy multivariate functional data.⁹ A real-time procedure has been tuned and tested, consisting of preliminary steps like signal reconstruction, wavelet denoizing and removing biological variability through data registration. Then a multivariate functional k-means has been considered, thus simultaneously clustering all eight leads of each patient. This classification procedure uses group centroids as reference signals. The technique proposed in Ieva et al.⁹ allowed diagnoses to be consistent with clinical practice, starting from purely statistical considerations. Anyway, despite the attractiveness of this method, the estimation of the number of groups as well as their identification might not be straightforward.

In this work, we approach the problem in a different way: we aim at constructing and validating a statistical procedure to model the binary outcome of interest (i.e. the presence of cardiovascular acute ischaemic event) by means of suitable covariates (i.e. patient characteristics, whenever available) and of multivariate functional predictors (i.e. the ECG signals and the corresponding derivatives, available for each patient). In particular, we focus on estimating the probability of each patient to belong to the Left Bundle Branch Block (LBBB) group. This can be done by using as predictors the eight-leads ECG trace of each patient and its first derivative, which are inserted in a suitable generalized functional linear model. Specifically, following for example Berrendero et al.¹⁰, James,¹¹ Ratcliffe et al.,¹² Escabias et al.,¹³ Müller and Stadtmüller¹⁴ and Zhu and Cox,¹⁵ we perform a dimensionality reduction by a multivariate functional principal component analysis (see Ramsay and Silverman¹⁶). It consists of summarizing the information carried out by the covariance operators of the signals and their first derivatives by the corresponding scores and obtaining projecting data and derivatives on the corresponding Karhunen–Loève bases. Then we introduce the scores into a generalized linear model where the response is the Bernoulli variable indicating the presence of LBBB, performing a model selection inspired by the work of Escabias et al.¹⁷ We then discuss the classification of patients resulting from the procedure described above and propose a comparison with results obtained by other classification methods dealing with suitable dimensional reduction of functional data.^18,19 We also study the robustness of the procedure via leave-j-out techniques. In fact, one of the main aims of the work is to propose a method for identifying a suitable and reliable basis to be used for projecting a new data and making a real-time prediction of the corresponding risk of disease.

The paper is structured as follows: Section 2 contains the theoretical framework of the multivariate functional principal component analysis we adopt for carrying out dimensional reduction of the multivariate functional data (2.1) and the corresponding first derivatives (2.2). These steps are performed in order to input relevant components in a generalized linear model for risk prediction of the presence of disease (Section 2.3). In Section 3, the analysis of ECG data arising from PROMETEO dataset is presented, together with the leave-j-out analysis. Finally, in Section 4, conclusions are drawn and further developments are discussed. All the analyses are carried out using R statistical software.²⁰

2 Models and methods

A common strategy to deal with complex or high-dimensional data is to perform a dimensional reduction.¹⁶ In the motivating example described here, the eight-leads ECG signal of each patient (a multivariate functional curve) is considered as a predictor of the presence of LBBB. Thus we deal with the dimensional reduction of such data and their first derivatives in order to input them in a generalized linear model for predicting the risk of LBBB.

2.1 Multivariate functional principal component analysis

Within the functional setting, principal component analysis also provides a way of looking at covariance structure of data that can be much more informative and can complement, or even replace altogether, a direct examination of the variance-covariance function, as detailed in Ramsay and Silverman.¹⁶

Let X a stochastic process with law P taking values on the space $L 2 (I; ℝ h)$ of square integrable functions $X (t) = (X 1 (t), \dots, X h (t)) T : I \to ℝ h$ , where I is a compact interval of ℝ. Let , for each $t \in I$ , denote the mean function of the $l -$ component $X l (t)$ , for $1 \leq l \leq h$ , then

is the mean function of X. The covariance operator of X is a linear compact integral operator from $L 2 (I; ℝ h)$ to $L 2 (I; ℝ h)$ acting on a function g as follows:

The kernel $V (s, t)$ is defined by

where ⊗ is an outer product in $ℝ h$ . $V (s, t)$ is a $h \times h$ matrix, whose elements will be denoted as $V rq (s, t)$ , for $r, q = 1, \dots, h$ .

In what follows, the model formulation is already intended for the application of interest, where the ECG of each patient $j = 1, \dots, n$ is a eight-variate functional data generated by the stochastic process X taking values on the Hilbert space $L 2 (I; ℝ 8)$ . The general case of $h \geq 2, h \neq 8$ follows straightforwardly. We consider, as data reduction strategy, the multivariate functional principal component analysis proposed in Ramsay and Silverman.¹⁶

So let $V rr (s, t)$ , $r = 1, \dots, 8$ be the variance functions of the components of X as well as $V rq (s, t)$ , $r, q = 1, \dots, 8, r \neq q$ be the cross covariance functions. Thus, for any $(s, t) \in I \times I$ , $V rq (s, t) = Cov (X r (s), X q (t))$ , $r, q = 1, \dots, 8$ .

Consider the usual scalar product between two elements U and W in $L 2 (I; ℝ 8)$

〈 U, W 〉 = \sum_{r = 1}^{8} \int I U r (t) W r (t) dt .

(1)

Call $e k (t) = (e_{1}^{k} (t), \dots e_{8}^{k} (t)) T$ , the k-element of the Karhunen–Loève expansion which is the solution of the eigenequation system

\int I V 11 (s, t) e_{1}^{k} (t) dt + \dots + \int I V 18 (s, t) e_{8}^{k} (t) dt = ρ k e_{1}^{k} (s), := : \int I V 81 (s, t) e_{1}^{k} (t) dt + \dots + \int I V 88 (s, t) e_{8}^{k} (t) dt = ρ k e_{8}^{k} (s) .

The eight-leads ECGs ${(X) i}$ , $(i = 1, \dots, n)$ , are a sample from X. Eigenfunction-eigenvalue couples ${(e k, ρ k)} k \in ℕ$ completely explain modes of variation in the data, in the sense that eigenfunctions represent orthonormal directions of decreasing variability with respect to the explained variances expressed by the corresponding eigenvalues. Thanks to the basis expansion given by principal components, it is possible to represent data using just the first K elements of ${e k} k \in ℕ$ , the linear combination of which is, by construction, a good approximation for the original curves. The interpretation of eigenvalues as variances is also useful to determine a choice criterion for the most relevant modes. Since $\sum_{k = 1}^{K} ρ k$ represents variance captured by the first K components, we can choose K so that the proportion of variance described by these components is higher than a given threshold c, i.e.

\frac{\sum_{k = 1}^{K} ρ k}{\sum_{k = 1}^{m} ρ k} \geq c,

(2)

where m is the number of abscissa values on which functional data are known, which is an upper bound for the number of components that can be estimated.

In the analysis of data, as literature advises, we deal with centred and scaled data, i.e.

Z (t) = (Z 1 (t), \dots, Z 8 (t)) T = (\frac{X 1 (t) - μ 1 (t)}{\sqrt{V 11 (t, t)}}, \dots, \frac{X 8 (t) - μ 8 (t)}{\sqrt{V 88 (t, t)}}) T .

2.2 Derivatives refinements

The problem with discrete and noisy observations is amplified when the data derivatives are taken into account. In our case, since the information on the presence of the disease is carried out not only by morphological changes we observe on the signals but also on their first derivatives, it is even more necessary to smooth data in a suitable way. In fact, the smoothing procedure is essential not only for an accurate reconstruction of data but also for a proper estimate of their derivatives (see Ieva et al.⁹ for a discussion of such arguments and comparison of different derivatives computations). Moreover, since the eight ECG leads of interest (I, II, V1, V2, V3, V4, V5 and V6) jointly describe the complex heart dynamics, the smoothing technique should take into account simultaneously all the components of the multivariate functional data (i.e. the leads).

Among possible smoothing methods, wavelets based ones seem suitable for smoothing our data because every basis function is localized both in time and in frequency and is therefore able to capture strongly localized ECG features (peaks, oscillations,…). The procedure proposed in Pigoli and Sangalli²¹ can take into account the multi-dimensionality of the data, obtaining smoothed estimates of the eight-dimensional curves of the ECGs. In the case presented below, a Daubechies wavelet basis with 10 vanishing moments is used. This basis is smooth enough to allow the computation of the first derivative of the estimated functional data, too. In fact, the smoothing procedure is essential for an accurate derivative reconstruction as well. Wavelet methods have the advantage of providing an estimate of the curves derivatives, which is straightforward when functional reconstruction is obtained via a basis expansion. Indeed, each derivative can be obtained simply by a linear combination of the corresponding basis function derivatives.

Once we get the smoothed data and derivatives, we carry out the same dimensionality reduction proposed in Section 2.1 also on the covariance operator of the first derivatives. This enables to take into account also the variability of the first derivatives when modelling the risk prediction. In so doing, we really take advantage of the functional nature of data.

2.3 Generalized regression with multivariate functional predictors

We now consider a logistic regression model, where the response variable is $Y i ~ Be (p i)$ for $i \in 1, \dots, n$ and $θ i = \log (p i / (1 - p i))$ . We model $θ i$ as linear transformation of the covariates related to i-th statistical unit, i.e.

θ i = \int I δ T (t) Z i (t) dt + \int I δ_{d}^{T} (t) Z i' (t) dt + \sum_{h = 1}^{q} d ih γ h

(3)

Z i (t)

being the centred and scaled multivariate functional data concerning the i-th statistical unit, and

Z i' (t)

the corresponding first derivatives. The vector

d i = (d i 1, \dots, d iq) T

d i \in ℝ q

, for

i = 1, \dots, n

, contains the traditional covariates that are possibly available for the i-th statistical unit. Moreover,

δ (t) : I \to ℝ 8

and

δ d (t) : I \to ℝ 8

are eight-variate functional parameters to be estimated as well as

γ \in ℝ q

is a vector of parameters to be estimated. Thanks to the dimensional reduction driven by the selection of the number of multivariate functional principal component basis obtained using condition (2), the linear predictor in equation (3) may be approximated with the following expression:

\int I \sum_{k = 1}^{K} ξ_{i}^{k} δ (t) T e k (t) dt + \int I \sum_{k = 1}^{K d} \tilde{ξ_{i}^{k}} δ (t) d T e_{d}^{k} (t) dt + \sum_{h = 1}^{q} d ih γ h,

where

ξ_{i}^{k} = 〈 Z i, e k 〉

\tilde{ξ_{i}^{k}} = 〈 Z i', e_{d}^{k} 〉

and

{e_{d}^{k}} k \in ℕ

is the basis expansion given by the principal components of derivatives. We can represent also

δ (t)

and

δ d (t)

using the correspondent Karhunen–Loève expansions, i.e.

δ (t) = \sum_{l = 1}^{K} ζ l e l

, where

ζ l = 〈 δ, e l 〉

and

δ d (t) = \sum_{l = 1}^{K d} ζ_{d}^{l} e_{d}^{l}

, with

ζ_{d}^{l} = 〈 δ d, e_{d}^{l} 〉

. Thanks to the orthonormality of

{e k} k \in ℕ

and of

{e_{d}^{k}} k \in ℕ

, we obtain

θ i = \sum_{k = 1}^{K} ξ_{i}^{k} ζ k + \sum_{k = 1}^{K d} \tilde{ξ_{i}^{k}} ζ_{d}^{k} + \sum_{h = 1}^{q} d ih γ h = \underset{\binom{data}{contribution}}{\underset{︸}{ξ_{i}^{T} ζ}} + \underset{\binom{derivatives}{contribution}}{\underset{︸}{{\tilde{ξ}}_{i}^{T} ζ d}} + \underset{\binom{covariates}{contribution}}{\underset{︸}{d_{i}^{T} γ}}, i = 1, \dots, n .

(4)

The model is then reduced to a classical logistic regression model in which the unknowns are represented by the parameters $ζ = (ζ 1, \dots, ζ K) T$ , $ζ d = (ζ_{d}^{1}, \dots, ζ_{d}^{K d}) T$ and $γ = (γ 1, \dots, γ q) T$ . The same approach can be extended without further difficulties to the generalized linear models with different responses and link functions.

3 An application to ECG signals

In Ieva et al.,⁹ a statistical framework has been proposed for the analysis and classification of ECG curves starting from the modification of morphology induced by the presence of disease. The main goal of that paper is to identify specific ECG patterns which could benefit from an early invasive approach. In fact, the identification of statistical tools capable of classifying curves using only the modifications in shape induced by the presence of disease could support an early detection of heart failures.

The basic statistical unit is the eight-variate function (the ECG) which describes the heart dynamics of each patient on the eight leads I, II, V1, V2, V3, V4, V5 and V6, together with the corresponding derivatives. Here, the outcome we consider is the group label, indicating the presence of the disease. It is modelled by a Bernoulli random variable Y_i, which takes value 1 if LBBB is diagnosed, and 0 if the trace is physiological. As mentioned before, in this work, we analyse ECG traces from PROMETEO datawarehouse. Each file contained in PROMETEO can be associated to three sub-files. The first is called Details and consists of technical information, useful for signal processing and analysis. More precisely, it includes waves repolarization and depolarization times, landmarks indicating onset and offset times of the main ECG’s subintervals and an automatic diagnosis, established by the commercial Mortara-Rangoni VERITAS™ algorithm. We used these automatic diagnoses to label the ECG traces we analysed. The second sub-file is called Rhythm and contains the output of an ECG recorder. Specifically, it registers 10 s (10,000 sampled points) of the ECG signal. The third file is called Median. It is built from the Rhythm file and depicts a reference beat lasting 1.2 s on a grid of 1200 points. We carried out the analysis using only the Median files, i.e. using eight curves (one for each ECG lead) for each patient, representing patient’s ‘Median’ beat for that lead. This representative heartbeat is a trace of a single cardiac cycle (heartbeat), i.e. of a P wave, a QRS complex, a T wave and a U wave, which are normally visible in 50–75% of ECGs.

The sample we analyse consists of the ECG signals of $n = 149$ subjects, among which 101 are Normal and 48 are affected by LBBB. The main reason for the sample size being relatively small is that, among the eligible traces of LBBB, we retained only ‘pure LBBB’ diagnoses. This means that we selected diagnoses of LBBB with no further comorbidities. In fact, it often happens that LBBB arises together with other comorbidities (say atrial fibrillation, atrioventricular block, atrial flutter, paroxysmal and supraventricular tachycardia, etc.). A crucial point in the analysis is that the modification on morphological variations in ECGs are induced only by the presence of LBBB, at least in the training set we compute the multivariate functional principal component basis upon. Thus, in order to avoid a priori the biases carried by the presence of other comorbidities, we excluded traces where LBBB was not the only diagnosis. In so doing, the number of traces that we considered as really eligible reduced a lot.

The novelty of the analysis presented here is to set up a generalized linear model able to discriminate between pathological and physiological traces, explaining the disease probability by means of multivariate functional predictors, i.e. ECG signals and their first derivatives. The contribution of the multivariate curves is summarized through the dimensional reduction carried out by multivariate functional principal component analysis of the covariance operators of signals and signals’ first derivatives proposed in Sections 2.1 and 2.2, in order to take advantage of the functional nature of the data.

In practice, we deal with a noisy and discrete observation of the function describing the ECG trace of each patient. We use the wavelet based smoothing technique for multivariate curves proposed in Pigoli and Sangalli²¹ to obtain the smoothed estimates of eight-dimensional ECG signals and their first derivatives. Moreover, since each patient has her/his own ‘biological’ time, the same event of the heart dynamics may happen at different times for different patients. Since the morphological change due to this difference is misleading from a statistical perspective, we need to register data. It is well known that a correct separation between the different kinds of variability is necessary for a successful analysis.¹⁶ In particular, as detailed in Ieva et al.,⁹ we adopt a registration procedure based on landmarks, which are points of the curve that can be associated with a specific biological time. Five of these landmarks (P_onset, QRS_onset, QRS_offset, T_onset and T_offset) are provided by Mortara-Rangoni procedure. They identify, for each patient $i = 1, \dots, n,$ the P wave, the QRS complex and the T wave, i.e. the main segments and waves of the ECG signal. We add one more landmark: the R peak identified on the lead I ( $R peak$ ). We choose the time point identified on this lead as representative for all the leads because only on the lead I both the physiological and pathological ECG traces present a clearly identifiable R peak. Then, since all the leads capture the same heart dynamics, the biological time must be the same.

Figure 1 shows denoized and registered data we consider for our analysis. The black solid lines represent the mean functions. Figure 2 shows the corresponding first derivatives. Again the black solid lines represent the mean functions.

Figure 1.

Denoized and registered data (eight leads) for the 149 patients with superimposed the mean functions (black solid lines).

Figure 2.

First derivatives (eight leads) for the 149 patients with superimposed the mean functions (black solid lines).

We shall now select the components of the multivariate functional principal component to be considered in the subsequent analysis, both for ECG signals and their first derivatives. In both cases, we choose the first K and K_d components of data and derivative’s basis, respectively, such that their associated eigenvalues explain a proportion of variance equal to 70%.

Among these, we retained only the first principal components, using the corresponding scores as covariates. The scores are computed projecting data and first derivative on the first elements of the corresponding multivariate functional principal component basis. We retained only the first multivariate functional principal component for two reasons:

both a stepwise selection based on the Akaike Information Criterion (AIC) as well as a Brier’s score minimization criterion selected the scores on the first and tenth components as the significant ones to explain the disease probability but

it is known that the efficiency of the estimates of the eigenfunctions and of the corresponding scores is decreasing with respect to the index of eigenfunctions.

The conditional likelihood ratio test to choose between nested models with functional covariates, proposed in Escabias et al.¹⁷, shows a high statistical evidence (p

= 0.002

) to retain also the derivative as functional covariate in the model. Moreover, the results of the risk prediction obtained with this parsimonious choice remain very robust with respect to those obtained considering more than one multivariate functional principal component (as it will be detailed in the following). The scores of the first principal component are then the only ones identified as statistically significant for the generalized linear regression model, both for the data and the first derivatives. Figure 3 shows the distributions of the first principal components scores, for the data (left panel) and the first derivatives (right panel), respectively, stratified by the presence/absence of LBBB. The p values of Wilcoxon tests carried out to compare the distributions of the scores are less than

2 \times 10 - 16

in both cases.

Figure 3.

Distributions of first principal components scores, stratified by the presence of disease, for the data (left panel) and the first derivatives (right panel).

Thus we fitted the following logistic regression model: for $i = 1, \dots, n$ ,

θ i = \log (p i / (1 - p i)) = γ 0 + ξ_{i}^{k} ζ 1 + \tilde{ξ_{i}^{1}} ζ_{d}^{1} .

(5)

It arises from model (4), where

K = K d = 1

and no further patient’s covariates are available. The model output is reported in Table 1.

Table 1.

Estimates, standard errors and p values for the parameters of the logistic model.

Parameter	Estimate	Standard error	p
$γ 0$ (Intercept)	−0.07148	0.53112	0.892938
$ζ 1$ (First PC)	0.16941	0.04695	0.000308
$ζ_{d}^{1}$ (First PC derivative)	0.16304	0.06352	0.010262

Figure 4 shows the first multivariate functional principal component of the data. Sample means of each lead are plotted (solid lines), together with two curves obtained by adding (+) and subtracting (−) a suitable multiple C of the principal component. As suggested in Ramsay and Silverman,¹⁶ we set C as 0.2 times the root-mean-square difference between the estimated mean $(\overset{\land}{μ} 1 (t), \dots, \overset{\land}{μ} 8 (t))$ and its overall time average, i.e. $\bar{μ} = \frac{1}{8} \sum_{i = 1}^{8} \int I \overset{\land}{μ} i (t) dt$ . From Figure 4, it is clear that the first functional principal component describes the morphological variability expressed by specific segments of the ECG. These morphological changes are particularly marked in the ST-segment (the part of the ECG curve usually including among time interval between 350 and 600 ms), which in fact is the most useful part of the ECG, apart from the QRS complex, to carry out the LBBB diagnosis, as confirmed by the cardiologists.

Figure 4.

First multivariate functional principal component.

The confusion matrix obtained comparing the true and the estimated labels of the patients is reported in Table 2. We set the threshold for the classification carried out by the logistic model in equation (5) equal to 0.5. The corresponding false-positive and false-negative rates are, respectively, 8.3% and 0.99%. The correct classification rate is 96.64% and the Cohen’s K statistics is 0.95.

Table 2.

Confusion matrix.

Normal	LBBB
Classified as normal	100	4
Classified as LBBB	1	44

LBBB: Left Bundle Branch Block.

Moreover following James and Hastie,¹⁸ we performed a two-means classification of the scores arising from projecting both data and derivatives on the corresponding first principal components, and a linear discriminant analysis as suggested in Wouters et al.¹⁹ The results in terms of false-positive rate provided by these procedures are much worse than those obtained with the procedure we propose in this paper. In fact, although the false-negative rate is zero in both the cases, the false-positive rate is 48% in the two-means clustering analysis and 29% for the Linear Discriminant Analysis (LDA) procedure. The corresponding correct classification rates are 84.6% and 90.6%, respectively.

All the results indicate that the model can be considered a good classifier, providing reasonable performances, especially with respect to the benchmark of physicians practice in diagnosing LBBB.

Finally, we computed the mean cross validation (CV) error of the logistic model for testing its goodness of fit, obtaining a mean CV error equal to 3.6%.

We propose this method as an automatic diagnostic tool to predict the risk also for new patients entering the study. In fact, one of the most innovative aspects of the proposed methodology is the idea of computing off-line a basis to be used in an on-line procedure of risk prediction. Specifically, the off-line step consists of performing the multivariate functional principal component analysis on a given database of ECGs, then fitting a logistic regression model like equation (4) and obtaining the estimation of the coefficients and the number of basis components to be retained for prediction. Then, in the on-line step, for a new patient to be diagnosed we projects her/his ECG on the basis previously pointed out, in order to get a real-time computation of the scores corresponding to the new data and its first derivative, and the estimated probability of disease.

In order to do this, we need to check the robustness of the basis to be used for the off-line estimation. According to the framework detailed above, we did it through a leave-j-out study. The procedure can be summarized as follows:

– step 1 choose a random subsample of j patients ( $j = 1, 5, 10, 20$ );

– step 2 carry out the multivariate functional principal component analysis on the remaining n–j patients;

– step 3 fit the logistic model for obtaining coefficients estimates and the number of basis components to be retained, both for data and derivatives;

– step 4 projected the j chosen ECGs on the basis pointed out in step 2, getting the real-time computation of the scores corresponding to the new j data and their first derivatives, and the corresponding estimated probability of disease.

We repeated the experiment 500 times. The mean actual error rate (AER) over the 500 simulations and their corresponding standard deviation are reported in Table 3.

Table 3.

Mean and standard deviation of actual error rate.

Mean	Standard deviation
$j = 1$	0.062	0.241
$j = 5$	0.047	0.088
$j = 10$	0.055	0.068
$j = 20$	0.056	0.0491

In general, the idea is the following: once a reliable and representative dataset of N ECGs is given according to clinical best practice, the procedure we propose computes the off-line step described above on the N multivariate curves, selecting a suitable number of components for data and first derivatives basis and providing the coefficients for the generalized regression model. Therefore, when new patients enter the study, the semi-automatic diagnosis tool projects their ECGs on eigenfunctions selected in the off-line basis and plugs in the scores estimates as predictors in the logistic model for estimating then LBBB risk.

4 Conclusions

In this paper, we propose a generalized functional linear regression model for a binary outcome indicating the presence/absence of a cardiac disease, with a multivariate functional data among the relevant predictors. This is an example of data which must be treated in the multivariate functional context. Such a framework, despite its evident interest, is quite rarely treated in statistical literature.

The principal aim of this work is then the development of a new semi-automatic diagnostic procedure for risk prediction based on the ECG signals generated by telemedicine equipment of the Basic Rescue Units. In fact, we set up a framework for carrying out semi-automatic diagnosis of LBBB, starting from the statistical analysis of the sole curve morphology. The method we propose is then aimed at supporting decisions rescuers. It is aimed at identifying specific ECG patterns which could benefit by an early invasive approach, performing a real-time diagnosis.

In particular, we focus on estimating the probability to belong to LBBB group, using as predictor the eight-leads ECG trace of each patient and its first derivative, which are inserted in a suitable generalized functional regression model. Specifically, we perform a dimensionality reduction by a multivariate functional principal component analysis, summarizing the information carried out by the covariance operators of the signals and their first derivatives by the corresponding scores and obtaining projecting data and derivatives on the corresponding Karhunen–Loéve bases. Then we introduce the scores into a generalized linear regression model where the response is the Bernoulli variable indicating the presence of LBBB. We finally carry out the classification of patients and we check the robustness of our method. To this aim, we are trying to robustify the estimation method for regression parameters through the use of a wider dataset of numerically simulated ECGs.

The innovative aspect of this paper lies in developing advanced statistical methods aimed at detecting pathological ECG traces (in particular, LBBB), starting only from morphological features of the curves. This methodology enables diagnoses that are consistent with clinical practice, starting from purely statistical considerations. Further extensions of this work consist in enlarging the spectrum of acute cardiovascular diseases.

Footnotes

Acknowledgements

This work is part of PROMETEO (PROgetto sull’area Milanese Elettrocardiogrammi Teletrasferiti dall’Extra Ospedaliero). Data are provided by Mortara Rangoni Europe s.r.l.. The authors wish to thank 118 Dispatch Centre of Milan and Professor Maurizio Grasselli of Politecnico di Milano for stimulating discussions that gave us the impetus to improve the original version of the paper.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

References

Canto

Rogers

Bowlby

. The prehospital electrocardiogram in acute myocardial infarction: is its full potential being realized? National Registry of Myocardial Infarction 2 Investigators. J Am Coll Cardiol 1997; 29: 498–505.

Ting

Krumholz

Bradley

. Implementation and integration of prehospital ECGs into systems of care for acute coronary syndrome. Circulation 2008; 118: 1066–1079.

Diercks

Kontos

Chen

. Utilization and impact of prehospital electrocardiograms for patients with acute ST-segment elevation myocardial infarction data from the NCDR (National Cardiovascular Data Registry) ACTION (Acute Coronary Treatment and Intervention Outcomes Network) registry free. J Am Coll Cardiol 2009; 53: 161–166.

Brown

Mahmud

Dunford

. Effect of prehospital 12-lead electrocardiogram on activation of the cardiac catheterization laboratory and door-to-balloon time in ST-segment elevation acute myocardial infarction. Am J Cardiol 2008; 101: 158–161.

Trivedi

Schuur

Cone

. Can paramedics read ST-segment elevation myocardial infarction on prehospital 12-lead electrocardiograms? Prehospital Emergency Care 2009; 13: 207–214.

Ieva

Paganoni

. Multilevel models for clinical registers concerning STEMI patients in a complex urban reality: a statistical analysis of MOMI² survey. Commun Appl Ind Math 2010; 1: 128–147.

Grieco

Ieva

Paganoni

. Performance assessment using mixed effects models: a case study on coronary patient care. IMA J Manage Math 2012; 23: 117–131.

Grieco

Corrada

Sesana

. Mortality and ST resolution in patients admitted with STEMI: the MOMI survey of emergency service experience in a complex urban area. Eur Heart J: Acute Cardiovasc Care 2012; 1: 192–199.

Ieva

Paganoni

Pigoli

. Multivariate functional clustering for the analysis of ECG curves morphology. J R Stat Soc – Ser C 2013; 62: 401–418.

10.

Berrendero

Justel

Svarc

. Principal components for multivariate functional data. Comput Stat Data Anal 2011; 55: 2619–2634.

11.

James

. Generalized linear models with functional predictors. J R Stat Soc – Ser B 2002; 64: 411–432.

12.

Ratcliffe

Heller

Leader

. Functional data analysis with application to periodically stimulated foetal heart rate data. II: Functional logistic regression. Stat Med 2002; 21: 1115–1127.

13.

Escabias

Aguilera

Valderrama

. Principal component estimation of functional logistic regression: discussion on two different approaches. Nonparametric Stat 2004; 16: 365–384.

14.

Müller

Stadtmüller

. Generalized functional linear models. Ann Stat 2005; 33: 774–805.

15.

Zhu

Cox

. A functional generalized linear model with curve selection in cervical pre-cancer diagnosis using fluorescence spectroscopy. IMS Lecture Notes – Monogr Ser (Optimality: The Third Erich L. Lehmann Symposium) 2009; 57: 173–189.

16.

Ramsay

Silverman

. Functional data analysis, 2nd ed. New York: Springer, 2005.

17.

Escabias

Valderrama

Aguilera

. Stepwise selection of functional covariates in forecasting peak levels of olive pollen. Stoch Environ Res Risk Assess 2013; 27: 367–376.

18.

James

Hastie

. Functional linear discriminant analysis for irregularly sampled curves. J R Stat Soc – Ser B 2001; 63: 533–550.

19.

Wouters

Cortinas Abrahantes

Molenberghs

. A comparison of doubly hierarchical discriminant analyses for multiple class longitudinal data from EEG experiments. J Biopharm Stat 2008; 18: 1120–1135.

20.

R Development Core Team. R: a language and environment for statistical computing [online]. Vienna, Austria: R Foundation for Statistical Computing, http://www.R-project.org (2009, accessed 21 June 2013).

21.

Pigoli

Sangalli

L.M

. Wavelets in functional data analysis: Estimation of multidimensional curves and their derivatives. Comput Stat Data Anal 2012; 56: 1482–1498.