Predicting Long-COVID Sequelae: A Multi-Label Classification Approach

Abstract

We present a study about the prediction of long-COVID sequelae through multi-label classification (MLC). Data on more than $300$ patients have been collected during a long-COVID study at Ospedale Maggiore of Novara (Italy), considering their baseline situation, as well as their condition on acute COVID-19 onset. The goal is to predict the presence of specific long-COVID sequelae after a one-year follow-up. To amplify the representativeness of the analysis, we carefully investigated the possibility of both augmenting the dataset by considering situations where different levels in the number of complications could arise, and reducing the number of features to be considered for prediction. In the first case, MLSmote under six different policies of data augmentation has been considered, while in case of feature reduction we have generated new datasets via both a supervised and an unsupervised dimension reduction approach (Relief and PCA respectively). A representative set of MLC approaches has been tested on all the available datasets. Results have been evaluated in terms of Accuracy, Exact match, Hamming score and macro-averaged AUC; they show that MLC methods can actually be useful for the prediction of specific long-COVID sequelae, under the different conditions represented by the different considered datasets. In addition, interpretability of the results has been addressed through an approach based on the SHAP method, showing that clinical interpretations of specific predictions can be actually captured by the method, together with the observation that data augmentation techniques do not harm such a kind of explanations.

Keywords

multi-label classification data augmentation long-COVID syndrome

Introduction

In the last years, the intelligent analysis of clinical data has become a cornerstone of modern biomedical research. An increasing number of projects focus on the definition of biomedical databanks, and their analysis through machine and deep learning techniques. Notably, databanks can be fed from different sources (e.g., electronic health records, clinical trial data, laboratory analysis results), which are usually updated over time. In such contexts, it is useful to provide practitioners with direct and easy access to sophisticated AI-driven data analysis.

In this paper, we move a step forward in such a direction, by studying how machine learning techniques can be adopted in order to work with a real-world databank. This work is part of the TECNOMED-HUB research project (Tecnomed-hub, 2023), which aims at building a research platform supporting the collection and intelligent analysis of long-COVID data, but with the long-term goal of being applicable to a wide range of other diseases. In particular, the goal of this study is to define a framework for predicting long-COVID sequelae, based on patients’ clinical data collected during the acute phase of COVID infection.

Following the characterization in Nalbandian (2021), post(long)-COVID-19 syndrome consists of signs and symptoms (sequelae) consistent with COVID-19 that are present beyond 12 weeks of the onset of acute COVID-19 infection, and not ascribable to alternative causes (i.e., other diseases). From the machine learning point of view, considering the syndrome to be defined as the persistence of at least one of such symptoms, the characterization of the problem can be viewed as a Multi-Label Classification problem (see Section “Multi-label Classification”), where instances are the patients’ data collected at hospitalization, and the labels are the long-COVID symptoms persisting at follow-up.

Notably, such a kind of task characterizes a wide range of applications in the field of individualized predictive modelling where the onset of comorbidities is analyzed in a precise context; examples are reported for diabetes (Zhou et al., 2021), heart failure (Huang et al., 2023) and dyspnea (Baarts et al., 2021). Notice that differently from these works, our study focuses on the prediction of specific symptoms, rather than on the risk of comorbidities, making the clinical context slightly different, since the physician suspecting the insurgence of a given symptom can directly take this into account (for instance with specific therapies avoiding undesirable effects of such symptoms).

Coping with individualized predictive modelling in real-world settings implies working in a low-data regime and unbalanced labels. As a consequence, two different solutions could be adopted: upsampling techniques and feature reduction. However, they both have some shortcomings. Upsampling may conflict with the evidence-based nature of the medical research field, where synthetic data may be carefully considered. On the other hand, even if feature reduction may potentially partially alleviate the curse of dimensionality problem (and then to mitigate the fact that the learned model is prone to high variance in the presence of a limited training set), there is no guarantee that only relevant features are maintained; in particular, by exploiting a supervised approach, the limited number of cases may incorrectly estimate the actual correlation among the features and target attributes, while an unsupervised approach can only rely on a limited set of examples in order to characterize most important features (for instance in terms of data variance).

To this end, one of the goals of our work is to evaluate which upsampling and feature reduction techniques can be considered a valid trade-off between the need for a more balanced and stable dataset and keeping the synthetic data close to the real-world ones.

The results of the work described in this paper are twofold: –

from the medical point of view, we investigated the correlation between long-COVID syndrome and the patients’ data, providing a framework to predict long-COVID sequelae;

–

from the technological perspective, we defined a framework for the multi-label classification of symptoms working in a low-data regimen, that can be easily adapted for different tasks in the field of individualized predictive modelling.

The paper is organized as follows: Section “Multi-label Classification” summarizes the MLC framework and the adopted algorithms and evaluation criteria; Section “The Case Study: Long-COVID Syndrome” presents the case study with the characterization of the collected data; Section “Experimental Framework” describes the experimental part whose results are reported in Section “Results”; Section “Interpretability Analysis” presents an interpretability analysis of the obtained predictions, both in terms of clinical implications as well as of the impact of data augmentation techniques; final conclusions are then reported in Section “Conclusions”.

Multi-label Classification

Multi-Label Classification (MLC) can be defined as follows (Bogatinovski et al., 2022; Madjarov et al., 2012). Let us consider an instance space $X = {x_{1}, \dots x_{n}}$ where each $x_{i} \in X$ is a tuple of size $D = | x_{i} |$ , a label space $L = {λ_{1}, \dots λ_{Q}}$ of $Q = | L |$ possible labels, a set of instances $E = {(x_{i}, Y_{i}) | x_{i} \in X, Y_{i} \subseteq L, 1 \leq i \leq n}$ , and a quality criterion $q$ . The objective of MLC is to find a function $h : X \to 2^{L}$ that maximizes $q$ . The goal is to obtain a function $h$ able to predict the subset of labels associated with a given example. To this end, the problem is usually tackled by considering the set of labels represented as a binary vector of size $Q$ , where $0$ means that the label is absent (or not predicted) and $1$ means the label is present (or predicted). The problem can be generalized to label ranking by learning a function $h : X \times L \to R$ such that $h (x_{i}, λ_{j}) = r$ is the prediction score for label $λ_{j}$ in the example $x_{i}$ . Usually, the score is the probability of label $λ_{j}$ given $x_{i}$ , and the presence of the label can be predicted if it exceeds a given threshold (usually $0.5$ ). Given an instance $(x_{i}, Y_{i}) \in E$ we define $Y_{i}$ as the label-set of $x_{i}$ . Usually, the number of label-sets in a given instance space (dataset) is much less than $2^{| L |}$ , making the label-set space very sparse.

MLC Methods

MLC can be approached in two different ways: problem transformation and algorithm transformation (Bogatinovski et al., 2022). Methods in the former category transform the multi-label dataset into one or more datasets that are then targeted using traditional single-label classification algorithms; they finally build one or multiple single-label models. Methods in the second category adapt traditional single-label algorithms to the multi-label setting such as decision trees, functional models (SVM or NN), instance-based models and probabilistic models.

In the present study we concentrate on the problem transformation approaches, which is the most widely adopted in MLC; in fact, most of the algorithm transformation methods actually rely on an internal problem transformation in order to solve the MLC task. Moreover, this allows in principle to experiment with a larger number of base classifiers.

Concerning problem transformation methods, they can further be divided into binary, multi-class or ensemble methods (Bogatinovski et al., 2022). In binary methods, each pair of labels is considered to produce a set of single-target binary datasets following a one-vs-all strategy; the results is the construction of $| L |$ binary classifiers, one for each possible label. Methods belonging to this category are Binary Relevance (BR) Gibaja and Ventura (2015) and different versions of classifier chain such as CC (Read et al., 2011) or BCC (Zaragoza et al., 2011). Multi-class methods, on the contrary, build one single multi-class classifier: the target class has a set of possible values equal to the cardinality of the label-sets. The Label Powerset or Label Combination (LC) algorithm (Gibaja & Ventura, 2015) and the Pruned Set (PS) method (Read et al., 2008) are examples of this kind of approach. Finally, ensemble methods adopt ensemble-like techniques in order to train multiple single-label classifiers; important representatives of such methods are the Conditional Dependency Network (CDN) Guo and Gu (2011) and the RAkEL (Tsoumakas et al., 2011) algorithms.

MLC: Characterization and Evaluation Measures

MLC dataset characterization is a very important aspect, especially in terms of label distribution and balancing. Different measures can be used to characterize a dataset for MLC (Tarekegn et al., 2021) starting from basic measures such as number of instances, number of attributes, number of labels, number of distinct label-sets. Another characterization can be given in terms of label distribution through measures such as label cardinality (average number of labels in the examples) and label density (the cardinality divided by the number of labels), as reported in equation (1):

C a r d (E) = \frac{1}{n} \sum_{i = 1}^{n} | Y_{i} | D e n s (E) = \frac{1}{| L |} C a r d (E) .

(1)

Imbalance level measures are also important in order to characterize how frequent or rare are certain labels. The most relevant measures are the Imbalance Ratio per Label ( $I R L b l$ ) and the Mean Imbalance Ratio (MeanIR) which are reported in equations (2) and (4) respectively:

I R L b l (λ) = \frac{max_{λ^{'} \in L} \sum_{i = 1}^{n} h (λ^{'}, Y_{i})}{\sum_{i = 1}^{n} h (λ, Y_{i})},

(2)

\begin{aligned} h (λ, Y_{i}) & = {\begin{matrix} 1 & if λ \in Y_{i} \\ 0 & if λ \notin Y_{i} \end{matrix}; \end{aligned}

(3)

M e a n I R = \frac{1}{| L |} \sum_{λ \in L} I R L b l (λ) .

(4)

I R L b l (λ)

is the ratio between the occurrence of the majority label and the current label

λ

;

I R L b l

\geq 1

(

1

only for majority labels), and the larger the value the greater the imbalance of the label.

M e a n I R

characterises the level imbalance of the whole dataset. In addition, the standard coefficient of variation of

I R L b l

(

C V I R

, the ratio between standard deviation and mean) can be useful to measure if labels experience a similar level of imbalance or if there are large differences, in terms of imbalance, among them.

Finally, the Scumble metric (Charte et al., 2019) provides a way to understand the level of concurrence between minority and majority labels; values are in the $[0, 1]$ range, and the higher the value the more instances sharing minority and majority labels exist in the dataset. It is based on the computation of the Atkinson index (Atkinson, 1970) over the $I R L b l$ of the labels occurring in each instance; the final score is the average over all the instances of the dataset (see Charte et al., 2019 for details). A small value of Scumble (usually $\leq 0.1$ ) denotes a low concurrency between minority and majority labels; this is a proxy for the possibility of adopting data augmentation techniques without the risk of increasing in a significant way the label imbalance. In particular, upsampling techniques such as MLSmote (Charte et al., 2015) are well justified and applicable when the dataset Scumble is low (Charte et al., 2019; Rana et al., 2023).

Concerning evaluation measures for MLC, several metrics have been proposed with very different aims. They can be categorized as bipartition-based and ranking-based (Gibaja & Ventura, 2015). In our setting, bipartition-based are more significant, since the emphasis is on predicting the right set of long-COVID symptoms, rather than a correct ranking. In the present study, we then consider some of the most popular and natural bipartition-based metrics. Let $Y_{i}$ be the ground-truth label-set of instance $x_{i}$ , $Z_{i}$ the predicted label-set and $Δ$ the symmetric set difference operator:

HammingScore=1-HammingLoss where HammingLoss evaluates how many times, on average, an example-label pair is misclassified:

H a m m i n g L o s s = \frac{1}{n} \sum_{i = 1}^{n} \frac{| Y_{i} Δ Z_{i} |}{| L |} .

Accuracy or Jaccard Index evaluates the average proportion of labels correctly classified on the total number (predicted and actual) of labels, and averaged over all instances:

A c c u r a c y = \frac{1}{n} \sum_{i = 1}^{n} \frac{| Y_{i} \cap Z_{i} |}{| Y_{i} \cup Z_{i} |} .

ExactMatch evaluates the percentage of label-sets that are correctly predicted as a whole:

E x a c t M a t c h = \frac{1}{n} \sum_{i = 1}^{n} 1 (Y_{i} = Z_{i}) .

Notice that the HammingScore is somewhat lenient, since when several labels are absent in a label-set, the score may find several correct matches; however, it is an important metric since the correct prediction of an absent label should be relevant in several applications, as in our case study. On the contrary, Exact Match is a very strict measure, since partial predictions are completely ruled out. A somewhat intermediate metric is the Accuracy/Jaccard Index, where however, due to the nature of an MLC problem, one can hardly expect to get results close to

1

(as can be in the case of single-label classification). Jaccard Index of about

0.65 / 0.7

can be regarded as good results (see Madjarov et al., 2012).

We will finally consider the AUC (Area Under Roc Curve) in the macro-averaged version (the metric is computed independently for each label and then averaged). Since in our case all labels have the same importance (they should be treated equally as one of the symptoms of long-COVID), macro-average is preferred to micro-averaging (where the specific measures of each class are combined together).

The Case Study: Long-COVID Syndrome

Problem Characterization

The focus of the present paper refers to a long-COVID study realized at Ospedale Maggiore of Novara in Italy, where data about $324$ patients, hospitalized for acute COVID-19 onset during the first three waves of the pandemic, have been collected (Bellan et al., 2021a, 2021b). As already reported in Section “Introduction”, a patient is considered suffering from long-COVID if she shows at least one sequela (symptom) of acute COVID-19 after a minimum of 12 weeks from the onset (Nalbandian, 2021). In the present study, we consider a follow-up time of about one year after the first hospitalization due to acute COVID-19, and we focus on some specific sequelae for which data have been collected. In particular, the following symptoms are considered: arthromyalgia, asthenia, cough, diarrhea, dysgeusia, anosmia. They are all represented as binary features denoting the absence/presence of the corresponding symptom at follow-up time. In addition, we also consider the results of the respiratory test mMRC (modified – British – Medical Research Council questionnaire), which is a reliable indicator for dyspnea. Clinicians have then summarized it into a binary variable (mMRC_cat) representing the absence/presence of severe dyspnea. In summary, we consider a total of $7$ sequelae at follow-up.

Concerning patient characterization, baseline data indicate features of demographic and medical history of the patient, while hospitalization data refer to the patient’s symptoms at hospitalization (acute COVID-19 onset). Baseline data are not directly related to COVID-19 infection but are important factors to take into account in order to make an accurate diagnosis or prediction. Features in the baseline data can be grouped in terms of demographic characteristics (sex, age, smoking attitude, $\dots$ ) and of prior comorbidities (obesity, chronic liver disease, hypertension, anxiety and depression, $\dots$ )

Hospitalization data include the patient’s symptoms at COVID-19 onset (fever, cough, dyspnea, arthralgia, $\dots$ ), drugs administered (hydroxychloroquine, monoclonal antibodies, glucocorticoids, antivirals, $\dots$ ), and hospitalization information (duration, oxygen administration, ICU intubation, $etc.$ ) All the baseline and hospitalization data result in a total of $57$ features, among which $47$ binary and $10$ numeric.

The classification problem can then be described as follows: predict the presence of specific long-COVID sequelae at follow-up, using baseline and hospitalization information.

Dataset Analysis

The dataset produced in our long-COVID study (named orig in the following) resulted in $n = 324$ instances (324 different patients under study) with $D = 57$ features ( $47$ binary features and $10$ numeric features). No missing value was reported.¹ The cardinality of the label-set is $| L | = 7$ since we consider $7$ long-COVID sequelae (see Section “Problem Characterization”).

Some distributions concerning label and instances are reported in Figure 1, while the first row of Table 1 shows the main characterization measures of the original dataset we have obtained. Subsequent rows refer to augmented datasets described in the following (Section “Increasing the Sample Size”).

Figure 1.

Original dataset. (a) Label number by instance number; (b) Label distribution.

Table 1.

Main Features of Considered Datasets; $n$ : Number of Instances, #ls: Cardinality of Label-set, #pl: Number of Label-set with More Than One Label.

Dataset	n	#ls	#pl	Card	Dens	MeanIR	CVIR	Scumble
orig	324	36	54	0.670	0.096	2.303	0.417	0.012
k3I	424	36	54	0.592	0.085	2.173	0.415	0.009
k3R	424	38	79	0.814	0.116	1.762	0.373	0.007
k3U	424	48	154	1.752	0.250	1.566	0.239	0.008
k5I	424	36	54	0.512	0.073	2.303	0.417	0.009
k5R	424	36	54	0.599	0.086	2.089	0.466	0.008
k5U	424	48	154	1.745	0.249	1.602	0.311	0.009

We can notice that the imbalance level (as measured by MeanIR) is significant, even if a low $C V I R$ shows that there are similar level of imbalance among the labels. Moreover, the Scumble metric $S = 0.012$ is relatively low, showing a low level of interactions (concurrence) between minority and majority labels. This suggests that data augmentation techniques through resampling can be attempted, in order to get more significant data; this is very important in the present context, since the study was able to process just a few hundreds real patients, and the possibility of increasing the dataset size (even if with a limited increase of data) is considered as fundamental.

An alternative to upsampling to handle the “high dimensionality vs low sample size” problem is to reduce the number of features through some feature reduction methods.² In the following we will discuss both alternatives starting from the upsampling approach.

Increasing the Sample Size

The first approach we have tested in order to get more significant analysis of the available data has been to resort to MLSmote (Charte et al., 2015), an upsampling technique suited to multi-label framework, producing new synthetic data by first identifying minority labels (using $I R L b l$ measures), followed by a kNN search and a feature generation from such search as in the standard SMOTE technique (Chawla et al., 2002). Finally, label-sets for the synthetic instances are generated, using $3$ possible criteria: Intersection (I): labels appearing in the reference sample and in all the neighbors are added to the new sample; Union (U) labels appearing in either the reference sample or in any of the neighbors are added to the new sample; Ranking (R): labels present in more than half of the reference and neighbor samples are added to the new sample (majority voting).

We started from the original dataset (orig) and generated $6$ data augmented datasets using MLSmote with $k = 3, 5$ (number of considered neighbors) and the intersection, union and ranking generation (datasets k3I, k3U, k3R, k5I, k5U, k5R respectively). The number of synthetic instances to introduce has been set to $100$ , in order to increment the available examples in a reasonable way (the increase is less than $\frac{1}{3}$ of the original size, but allows us to consider a situation with a significant number of additional potential patients). Table 1 shows a summary of the characterization measures for each considered dataset.

We can notice that the more aggressive the augmentation (U more aggressive than R more aggressive than I), the larger the number of label-sets and the presence of label-sets with several labels, as reflected by cardinality and density as well. This allows also to reduce the label imbalance, by also keeping the Scumble index under control. To this end, considering a smaller number of neighbours in the instance generation ( $k = 3$ ) seems to take the label imbalance more under control.

To give an intuition of how MLSmote influences the resulting labelsets, Figure 2 reports the concurrence plots among the labels for datasets orig and K3U respectively.³

Figure 2.

Concurrence plots: orig (upper) and k3U (lower).

Concurrence plots report interactions among labels. This plot is circular, with the circumference partitioned into arcs representing the labels. Each arc has length proportional to the number of instances having that particular label. Bands join two arcs, showing the relation between the corresponding labels. The width of each band is proportional to the number of instances in which both labels appear simultaneously. Plot concerning k3U (lower part of Figure 2) shows more uniform bands than plot concerning orig (upper part of Figure 2), meaning that the balance of the label-sets is more noticeable in the former than in the latter.

Dimensionality Reduction

In order to mitigate the problem of a reduced sample size with respect to the dimensionality of the problem, the second approach has been to consider some methods of feature reduction. We know that by taking under control the number of attributes (by avoiding noisy ones or those bringing along a low amount of significant information) may be a way of dealing with a reduced number of examples when evaluating predictive results (Portinale & Saitta, 2002). To this end, we have considered two different methods based on feature ranking: a supervised approach called Relief (Kira & Rendell, 1992), and a standard unsupervised approach, namely Principal Component Analysis (PCA).

Relief computes a score for each feature which can then be applied to rank and select them on the basis of their order. The score is based on the identification of feature value differences between nearest neighbor instance pairs. We get a so called hit when a feature value difference is observed in a neighboring instance pair with the same class, while we get a miss when the difference is observed in a neighboring instance pair with different classes. The score is suitably increased in case of miss and decreased in case of hit. The number $k$ of nearest neighbors to consider is an hyper-parameter of the algorithm. In our case, we considered $k = 10$ and we adopted the Relief implementation provided by Weka (Frank et al., 2016).

Since the method relies on a single label class, in order to apply it in the multi-label setting, we considered a single class label obtained from the possible label-sets occurring in the dataset (as in the $L C$ multi-label classification method). Once the single label class attribute has been identified, we got rid of the specific attributes composing the label-set, and we applied Relief. The score computed by the method is essentially a correlation score between each feature and the class. By adopting the widely adopted “rule of thumb” that considers as significant a correlation between two variables when their correlation score is above the threshold $σ = \frac{2}{\sqrt{n}}$ (where $n$ is the sample size), we finally selected the first top $8$ features, those having a score exceeding the above threshold (in our case $n = 324$ and $σ = 0.11)$ . The resulting dataset has been indicated as Relief in the following (see Section “Results”).

Concerning the unsupervised feature reduction, we removed the attributes related to the sequelae to be predicted and performed (again using Weka) a standard PCA; by considering an explanation score of $95 %$ of the data variance, we obtained $44$ new features which are linear combination of the $57$ original ones. The resulting dataset has been labeled as PCA in the following (see Section “Results”). Notice that, apart from the supervised vs unsupervised approach, the two considered methods have a significant difference in the total reduction of the number of features ( $86 %$ reduction for the former and $23 %$ for the latter). The next section discusses the experiments we have performed on all the datasets that have been generated by considering the original data, the MLSmote upsampling, and finally the two proposed methods for feature reduction.

Experimental Framework

The experimental analysis has been performed using Meka Read et al. (2016), a multi-label extension to Weka. We considered the following problem transformation approaches to MLC: –

BR (Binary Relevance): a set of $7$ independent binary classification models (one for each possible label) has been built and results have been merged; label correlation information is completely neglected in this case.

–

CC (Classifier Chain): labels are processed in a random order by a set of binary classifiers; each classifier predicts the presence of the corresponding label, by considering the classification produced by the previous ones, i.e., classifier $C_{i} (2 \leq i \leq 7)$ uses problem features augmented with the label values predicted by classifiers $C_{j} (1 \leq j \leq i - 1)$ . Classifier $C_{1}$ uses only the problem features.

–

BCC (Bayesian Classifier Chain): as the CC method, but the order in which labels are processed is not random; we tested two possible versions of BCC corresponding to different label ordering: BCC(I), where the label order is induced by the mutual information among labels, and BCC(C) where label ordering is determined by label co-occurrence counts.

–

LC (Label Combination): the problem features are augmented with a class attribute taking discrete values in the range $[1, # l s]$ where #ls is the cardinality of the label-sets; the problem is then solved as a standard single-label multi-class classification.

–

PS (Pruned Set): first it prunes all examples having label-sets that occur less than $p$ times in the training set (we set $p = 2$ in our analyses); then it subsamples the label-sets of these examples for label subsets that occur more frequently in the training data. It then attaches these label sets to the example, creating new examples and reintroducing them into the training set; after these steps, it trains a standard LC classifier.

–

CDN (Conditional Dependency Network): it builds a fully-connected network where nodes are the labels, then it builds a set of binary classifiers ( $7$ in our case) one for each label $λ_{j}$ , predicting $p (λ_{j} | x_{i}, λ_{1}, \dots λ_{j - 1}, λ_{j + 1}, \dots λ_{Q})$ ; inference is performed through Gibbs sampling (Robert & Casella, 2004) over a set of $I$ iterations, and by collecting results from last $I_{c}$ iterations (in our case we set $I = 500$ and $I_{c} = 100$ ).

–

RAkEL (RAndom k-labEL Pruned Sets): it randomly draws $M$ subsets of labels, each with $k$ labels, from the set of labels, and trains PS upon each one (in our case we set PS as indicated above, $M = 10$ and $k = 3$ ).

The tested MLC methods are

4

binary methods (BR, CC and the two versions of BCC),

2

multi-class methods (LC and PS), and

2

ensemble methods (CDN and RAkEL). We tested the above methods with several base classifiers (from lazy classifiers, to neural nets, SVM and tree-based classifiers), and we have finally found (through an inner cross-validated grid-search) that a

200

-trees Random Forest provided the most interesting results.⁴ In the following, we will report results concerning the above transformation-based methods using this base classifier.

It is worth to remark that multi-class methods rely on the label-sets which are actually present in the data. A suitable data augmentation strategy could be really important to consider potential label-sets which are not occurring in the original dataset.

Results

This section presents the comparative results obtained by the MLC algorithms described in Section “Experimental Framework”, on the considered datasets. We tested all the methods using $5$ runs of $10$ -fold outer cross-validation, and we finally averaged the results. In the following tables, we highlight in bold the best results obtained for each considered dataset: the original, the ones obtained through upsampling via MLSmote, and the ones obtained through feature reduction via Relief and PCA; the last row reports the average performance of each MLC method over all the datasets (in bold are shown the best results). Table 2 shows the Accuracy/Jaccard Index for the various datasets and tested methods.

Table 2.

Accuracy (Jaccard Index).

Datasets	BCC(C)	BCC(I)	BR	CC	CDN	LC	PS	RAkEL
orig	0.604	0.605	0.420	0.604	0.415	0.604	0.605	0.604
k3I	0.646	0.647	0.507	0.649	0.503	0.653	0.673	0.651
k3R	0.615	0.615	0.546	0.613	0.527	0.660	0.643	0.652
k3U	0.637	0.639	0.577	0.638	0.537	0.637	0.624	0.651
k5I	0.666	0.666	0.488	0.669	0.466	0.646	0.680	0.641
k5R	0.645	0.648	0.499	0.645	0.496	0.648	0.668	0.647
k5U	0.651	0.649	0.582	0.649	0.552	0.648	0.637	0.666
Relief	0.556	0.558	0.471	0.559	0.425	0.552	0.566	0.539
PCA	0.606	0.606	0.445	0.606	0.432	0.605	0.605	0.603
Avg	0.625	0.626	0.504	0.626	0.484	0.628	0.633	0.628

We can notice that multi-class methods have better performance, with chain classifiers showing very close results (independently from label ordering). The use of an ensemble (RAkEL) does not actually improve the basic PS version. As also outlined in Section “MLC: Characterization and Evaluation Measures”, reported scores can be regarded as quite satisfactory results. However, by looking at the scores obtained on each specific datasets, we can notice that the feature reduction obtained through Relief is too drastic, significantly reducing the score with respect to other datasets; moreover, PCA is not actually improving the performance over the original dataset with all the original features. Similar considerations apply for the Exact Match score as well, where PS method is definitely the better (see Table 3). Since this metric is much more strict than Jaccard Index, the obtained results can be considered very satisfactory. Notice that augmented datasets produced with MLSmote(U) have a larger cardinality and density: this decreases the probability of getting an exact match as shown in the table.

Table 3.

Exact Match.

Datasets	BCC(I)	BCC(C)	BR	CC	CDN	LC	PS	RAkEL
orig	0.601	0.603	0.383	0.601	0.377	0.604	0.605	0.601
k3I	0.641	0.643	0.475	0.645	0.471	0.644	0.668	0.642
k3R	0.590	0.593	0.479	0.590	0.461	0.620	0.623	0.614
k3U	0.559	0.564	0.460	0.568	0.431	0.589	0.594	0.586
k5I	0.666	0.666	0.473	0.669	0.449	0.646	0.680	0.641
k5R	0.641	0.645	0.472	0.640	0.468	0.639	0.664	0.640
k5U	0.571	0.571	0.472	0.580	0.435	0.586	0.594	0.593
Relief	0.546	0.548	0.435	0.548	0.388	0.544	0.561	0.523
PCA	0.605	0.605	0.406	0.605	0.391	0.605	0.605	0.603
Avg	0.602	0.604	0.451	0.605	0.430	0.609	0.622	0.605

The last bipartition-based metric we have considered is the Hamming score and results are shown in Table 4. The obtained scores are in general very good, again with comparable performances of multi-class methods and classifier chains. As before, for feature reduction approaches we notice that PCA is usually providing the same results as the use of every available feature, while the reduction proposed by Relief appears to be loosing important attributes for the final predictions.

Table 4.

Hamming Score.

Datasets	BCC(I)	BCC(C)	BR	CC	CDN	LC	PS	RAkEL
orig	0.904	0.904	0.847	0.903	0.844	0.904	0.904	0.904
k3I	0.915	0.915	0.874	0.915	0.875	0.911	0.922	0.913
k3R	0.903	0.903	0.874	0.903	0.871	0.908	0.910	0.908
k3U	0.873	0.876	0.850	0.876	0.839	0.859	0.856	0.876
k5I	0.918	0.918	0.868	0.918	0.869	0.908	0.922	0.909
k5R	0.913	0.914	0.872	0.913	0.874	0.909	0.922	0.911
k5U	0.881	0.880	0.859	0.882	0.842	0.861	0.858	0.883
Relief	0.885	0.886	0.847	0.885	0.846	0.882	0.891	0.877
PCA	0.905	0.905	0.851	0.905	0.847	0.904	0.904	0.904
Avg	0.900	0.900	0.860	0.900	0.856	0.899	0.898	0.900

We can also notice that, in general, differently from the use of other datasets, the one obtained through PCA shows a more stable behavior (in terms of the considered evaluation measures) across all the tested methods. While for other datasets there are significant differences in the metrics depending on the adopted multi-label classification algorithm, in case of PCA this difference is not so evident.

Finally, we computed the macro-averaged AUC as shown in Table 5. In this case, the results of all methods are comparable over all datasets; a slightly better performance can be noticed for basic BR method and for the ensemble algorithm CDN. This can be explained by the fact that in these approaches the contribution of labels more represented dominates on the aggregated ROC, disregarding the contribution of some minority labels (that is usually better captured by multi-class methods). We also notice here that, in terms of AUC, the use of datasets obtained vie feature reduction does not allow the tested multi-label classification algorithms to reach performances close to the ones obtained on some upsampled datasets (in particular those using the union method of MLSmote).

Table 5.

Area Under ROC (Macro-averaged).

Datasets	BCC(C)	BCC(I)	BR	CC	CDN	LC	PS	RAkEL
orig	0.504	0.504	0.543	0.503	0.547	0.501	0.500	0.501
k3I	0.535	0.537	0.574	0.533	0.579	0.563	0.560	0.550
k3R	0.621	0.618	0.727	0.622	0.731	0.673	0.639	0.666
k3U	0.763	0.765	0.847	0.768	0.836	0.746	0.724	0.778
k5I	0.496	0.496	0.452	0.496	0.459	0.490	0.497	0.491
k5R	0.529	0.531	0.544	0.532	0.560	0.553	0.551	0.550
k5U	0.785	0.784	0.870	0.787	0.846	0.760	0.737	0.797
Relief	0.503	0.504	0.518	0.504	0.547	0.495	0.495	0.502
PCA	0.501	0.501	0.588	0.501	0.589	0.500	0.500	0.500
Avg	0.582	0.582	0.629	0.583	0.636	0.587	0.578	0.593

In conclusion, the performances of MLC methods highlight the capability of a multi-label classifier to obtain interesting predictions concerning long-COVID syndrome on the collected data (baseline and hospitalization). Such results can be improved with a suitable data augmentation, taking into careful consideration the main characteristics of the original data. On the other hand, feature reduction techniques (both supervised and unsupervised) do not appear to have any particular improvement over the use of the original data with the original dimensionality; aggressive dimensionality reduction (such as in the Relief case) seems to loose important features with respect to the label-set prediction task.

Interpretability Analysis

To make our approach truly usable in the medical domain, an essential aspect is to provide physicians with techniques for the interpretability of the model’s predictions. To this purpose, we approached the problem by using the Shapley Additive Explanations (SHAP) framework (Lundberg & Lee, 2017). In short, for a given model prediction, SHAP assigns an importance value to each feature of the dataset. A SHAP value mathematically represents the contribution of each feature to a model’s prediction by distributing the difference between the specific prediction and the average prediction across all features. This distribution is based on cooperative game theory; more specifically, the Shapley value ensures that each feature’s impact is fairly calculated by considering all possible combinations of features and their marginal contributions to the prediction. These values can then be aggregated to provide a broader interpretation of which features, and their corresponding values, are most significant in predicting a particular label. Notably, in the context of multilabel classification, we can analyze the contribution of each label to the SHAP values. It is worth stressing that frameworks like SHAP are particularly useful for black-box models (such as Random Forest), where direct interpretation of the predictions is not straightforward.

The analysis has been divided into two parts. The first aspect we considered was providing domain experts (specifically, physicians) with direct interpretations of our models, in order to highlight aspects that are not visible through other means. To validate our interpretative findings, we sought evaluations from physicians to ensure consistency with existing medical knowledge on long-COVID syndrome, as reported in the literature.

A second key aspect of our experiments involved data augmentation using the MLSmote technique. We investigated how the upsampling method affects the interpretability of the model’s predictions by comparing the SHAP values of the features from the original dataset with those from the upsampled dataset.

All the experiments related to the interpretability have been conducted using the Random Forest model and using the PS approach to MLC (one of the best performing on average in our experimental analysis) and the best hyperparameters found through grid-search (see Section “Experimental Framework” for details).

Domain Experts Analysis

The first objective of this analysis was to develop tools, such as plots or indices, that could be provided to physicians for inspection. These tools enable domain experts to assess why the model has performed specific predictions, by revealing the key factors influencing the model’s decisions. We describe the SHAP plots used in this project and the conclusions drawn in collaboration with medical experts. To validate our approach, we reference relevant clinical literature that supports our findings wherever possible. It is worth noting that the fact that some of these findings are already established in the literature is an indications of the strength of our approach, ensuring that the method is robust and applicable, in a semi-automatic manner, especially for future analyses in the context of the TECNOMED-HUB research project.

The first type of representation we used is shown in Figure 3. This bar plot presents a broader level of aggregation by summing the SHAP values across all the samples in the dataset. The SHAP method assigns a value to each feature for every sample; we aggregated these values by calculating the sum of their absolute values across all samples. This approach allows us to assess the overall contribution of each feature to the model’s output. More specifically, each bar represents the contribution of a specific feature to the final prediction. The bar is divided into segments, with each segment (proportional to the SHAP value) reflecting the feature’s contribution to a particular label (different labels identified with different colors) within the model’s output.

Figure 3.

SHAP values for the original dataset using a Random Forest model.

For a more detailed analysis instead, we considered beeswarm plots, as shown in Figures 4 and 5. These plots provide insight at the level of individual samples, showing how SHAP values vary based on the specific value of each feature and a particular class. This allows us to determine how the values of a particular feature influence the final output. In a beeswarm plot each row is a scatterplot showing the distribution of SHAP values across all samples in the dataset, with the color of the points indicating the corresponding feature values (towards blue for small values and towards red for large values of the corresponding feature).

Figure 4.

SHAP values for arthalgia at follow-up.

Figure 5.

SHAP values for diarrhea at follow-up.

In the following we report the most significant analyses we have performed; additional analyses have been conducted but not reported due to space constraints. From the SHAP analysis and the physicians’ evaluations, we identified several findings. The most significant features influencing the model are sex, BMI, and hospitalization duration (see Figure 3), consistent with those reported in some previous studies on long-COVID (Mateu et al., 2023; Silva et al., 2024; Vimercati et al., 2021). In addition, we can notice that these features, along with anxious symptoms, depression, and lack of sleep, are particularly important predictors of arthromyalgia as previously reported in Phu et al. (2023) (see Figure 4). These features appear to be common predictors across all follow-up symptoms, with the exception of glucocorticoids administration at the onset of COVID-19, which is associated with the absence of diarrhea at follow-up (see Figure 5). Another important observation is that the number of comorbidities and the severity of the initial case seem to be linked to long COVID symptoms (Mateu et al., 2023). Finally, we also observe that the anosmia at time of hospitalization is a strong predictor of anosmia at follow-up, a pattern which is not observed for other symptoms.

Comparison with Upsampled Datasets

The second part of the interprertability analysis focuses on evaluating the impact of upsampling methods on the model’s predictions interpretability. Given the scarcity of data, these augmentations are necessary to enhance the robustness of the model. Our goal is to validate the hypothesis that the augmentations maintain consistency in terms of interpretability. To achieve this, we compared the SHAP values of features from the original dataset with those from various upsampled datasets. Specifically, we normalized the SHAP values of each dataset independently between $0$ and $1$ . We then identified the top $15$ features in the original dataset and examined their corresponding SHAP values in the augmented datasets. This approach allows us to determine whether the most significant features in the model trained on the original dataset remain consistent in the models trained on the upsampled datasets. The results of this analysis are presented in Figure 6. Overall, we observe that the top $15$ features identified in the base model consistently achieve high SHAP values across all upsampled models, particularly in the k3I, k5I, and k5R methods. In contrast, the union-based upsampling methods (k3U, k5U) show lower SHAP values for the top three features. This is likely due to the more aggressive nature of union-based upsampling, which may significantly alter the dataset’s statistical properties.

Figure 6.

Comparison of SHAP values between the original dataset and different augmentations. The SHAP values of each dataset have been normalized independently.

Conclusions

We have presented a study about the prediction of long-COVID sequelae through multi-label classification (MLC). We have initially considered data about more than $300$ patients, considering their baseline situation, as well as their condition when hospitalized after contracting severe COVID-19 infection. The goal was to study the presence of specific long-COVID sequelae after a one year follow-up. Since the original set of patients under study was limited and could suffer of under-representativeness, we carefully investigated the possibility of both reducing the number of features and augmenting the dataset, by considering situations where different levels in the number of complications could arise. Concerning data augmentation, MLSmote under six different policies of data augmentation has been considered, while regarding dimensionality reduction we tested a supervised and an unsupervised method applied to multi-label setting (Relief and PCA respectively).

A representative set of MLC approaches have been finally tested on all the available datasets. Results have been evaluated in terms of Accuracy, Exact match, Hamming Score and macro-averaged AUC. They showed that MLC methods can actually be useful for the prediction of specific (label-based) long-COVID sequelae, under the different conditions represented by the different considered datasets. Multi-class MLC methods appear to be very promising, and binary approaches based on chains could be a valid alternative in order to take into account label correlation.

Results have been also considered in terms of interpretability, by adopting the SHAP method in the context of multi-label classification. The analysis shows that useful clinical information can be obtained by considering the influence of specific features on the predicted labels. Moreover, the use of data augumentation (that is justified in the presented context) seems to leave unchanged the qualitative results obtained on the original dataset. Currently, we are integrating our approach with the TECNOMED-HUB databank.

As future works, on the side of the long-COVID syndrome study, we aim at integrating additional clinical information such as the level of several cytokines, which are indicators of specific inflammatory processes supposed to be involved in the long-COVID syndrome. Moreover, on the side of the applicability of our approach to a general databank, we aim to test it with different diseases and to complement it with explanation techniques for MLC (see, e.g., Panigutti et al., 2020; Tabia, 2019), to make our approach usable in practice by physicians.

Footnotes

ORCID iDs

Mattia Bellan

Annalisa Chiocchetti

Marco Dossena

Christopher Irwin

Luca Piovesan

Luigi Portinale

Funding

The author(s) received no financial support for the research, authorship and/or publication of this article.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability

Due to privacy concerns, the dataset on which the experiments were carried out is not publicy available; enquiries regarding the data can be sent to the authors.

Notes

References

Atkinson

(1970). On the measurement of inequality. Journal of Economic Theory, 2(3), 244–263.

Baarts

Giezendanner

Lüthi-Corridori

Brändle

Dieterle

Gabutti

Hammerer-Lercher

Hasler

Henny-Fullin

Muser

Leibundgut

Leuppi-Taegtmeyer

Marbet

C. P.

Schraner

Leuppi

J. D.

Jaun

(2021). Multilabel classification of disease prediction in patients presenting with dyspnea. European Respiratory Journal, 58(65).

Bellan

Baricich

Patrucco

Zeppegno

Gramaglia

Balbo

P. E.

Carriero

Amico

C. S.

Avanzi

G. C.

Barini

, et al. (2021a). Long-term sequelae are highly prevalent one year after hospitalization for severe COVID-19. Scientific Reports, 11(1), 22666.

Bellan

Soddu

Balbo

P. E.

Baricich

Zeppegno

, et al. (2021b). Respiratory and psychophysical sequelae among patients with COVID-19 four months after hospital discharge. JAMA Network, 41(1), e2036142.

Bogatinovski

Todorovski

Džeroski

Kocev

(2022). Comprehensive comparative study of multi-label classification methods. Expert Systems with Applications, 203, 117215.

Charte

Rivera

delJesus

Herrera

(2015). MLSMOTE: Approaching imbalanced multilabeled learning through synthetic instance generation. Knowledge Based Systems, 89, 385–397.

Charte

Rivera

delJesus

Herrera

(2019). Dealing with difficult minority labels in imbalanced mutilabel data sets. Neurocomputing, 326-327, 39–53.

Chawla

N. V.

Bowyer

K. W.

Hall

L. O.

Kegelmeyer

W. P.

(2002). Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16(1), 321–357.

Frank

Hall

Witten

. (2016). The WEKA workbench. In Data mining: Practical machine learning tools and techniques (4th ed.). (online Appendix).

10.

Gibaja

Ventura

(2015). A tutorial on multilabel learning. ACM Computing Surveys (CSUR), 47(3), 1–38.

11.

Guo

. (2011). Multi-label classification using conditional dependency networks. In Proceedings of the 22nd international joint conference on artificial intelligence (IJCAI 11) (pp. 1300–1305).

12.

Huang

Zhang

Xia

Liu

Yang

(2023). A multi-label learning prediction model for heart failure in patients with atrial fibrillation based on expert knowledge of disease duration. Applied Intelligence, 53(17), 1–12.

13.

Kira

Rendell

L. A.

(1992). A practical approach to feature selection. In Proceedings of the ninth international workshop on machine learning (ML 92) (pp. 249–256). Morgan Kaufmann Publishers Inc.

14.

Lundberg

S. M.

Lee

S. I.

(2017). A unified approach to interpreting model predictions. In NIPS’17 (pp. 4768–4777). Curran Associates Inc.

15.

Madjarov

Kocev

Gjorgjevikj

Džeroski

(2012). An extensive experimental comparison of methods for multi-label learning. Pattern Recognition, 45(9), 3084–3104.

16.

Mateu

Tebe

Loste

Santos

J. R.

Lladós

López

España-Cueto

Toledo

Font

Chamorro

, et al. (2023). Determinants of the onset and prognosis of the post-COVID-19 condition: A 2-year prospective observational cohort study. The Lancet Regional Health–Europe, 33.

17.

Nalbandian

, et al. (2021). Post-acute COVID-19 syndrome. Nature Medicine, 27(4), 601–615.

18.

Panigutti

Guidotti

Monreale

Pedreschi

(2020). Explaining multi-label black-box classifiers for health applications. Springer International Publishing.

19.

Phu

D. H.

Maneerattanasak

Shohaimi

Trang

L. T. T.

Nam

T. T.

Kuning

Torpor

Suwanbamrung

(2023). Prevalence and factors associated with long COVID and mental health status among recovered COVID-19 patients in southern thailand. PLoS One, 18(7), e0289382.

20.

Portinale

Saitta

. (2002). Feature selection. Technical Report D14.1 EU Project MiningMart, University of Dortmund.

21.

Rana

Sowmya

Meijering

Song

(2023). Imbalanced classification for protein subcellular localization with multilabel oversampling. Bioinformatics (Oxford, England), 39(1), btac841.

22.

Read

Pfahringer

Holmes

. (2008). Multi-label classification using ensembles of pruned sets. In Proceedings of 8th IEEE international conference on data mining (ICDM 08) (pp. 995–1000).

23.

Read

Pfahringer

Holmes

Frank

(2011). Classifier chains for multi-label classification. Machine Learning, 85, 333–359.

24.

Read

Reutemann

Pfahringer

Holmes

(2016). MEKA: A multi-label/multi-target extension to Weka. Journal of Machine Learning Research, 17(21), 1–5. http://meka.sourceforge.net/

25.

Robert

C. P.

Casella

(2004). Monte Carlo statistical methods. (2nd ed.). Springer-Verlag.

26.

Silva

Takahashi

Wood

Tabachnikova

Gehlhausen

J. R.

Greene

Bhattacharjee

Monteiro

V. S.

Lucas

, et al. (2024). Sex differences in symptomatology and immune profiles of long COVID. In medRxiv (pp. 2024–02).

27.

Tabia

. (2019). Towards explainable multi-label classification. In 2019 IEEE 31st international conference on tools with artificial intelligence (ICTAI) (pp. 1088–1095). https://doi.org/10.1109/ICTAI.2019.00152

28.

Tarekegn

Giacobini

Michalak

(2021). A review of methods for imbalanced multi-label classification. Pattern Recognition, 118, 107965.

29.

Tecnomed-hub webpage. (2023). Retrieved June 6, 2023, from https://www.tecnomedhub.it

30.

Tsoumakas

Katakis

Vlahavas

(2011). Random K-labelsets for multi-label classification. IEEE Transactions on Knowledge and Data Engineering, 23, 1079–1089.

31.

Vimercati

De Maria

Quarato

Caputi

Gesualdo

Migliore

Cavone

Sponselli

Pipoli

Inchingolo

, et al. (2021). Association between long COVID and overweight/obesity. Journal of Clinical Medicine, 10(18), 4143.

32.

Zaragoza

Sucar

Morales

Bielza

Larranaga

. (2011). Bayesian chain classifiers for multidimensional classification. In Proceedings of the 22nd international joint conference on artificial intelligence (IJCAI 11) (pp. 2192–2197).

33.

Zhou

Zheng

Yang

Wang

Bai

(2021). Application of multi-label classification models for the diagnosis of diabetic complications. BMC Medical Informatics and Decision Making, 21(1), 182.