Characterizing SARS-CoV-2 Spike Sequences Based on Geographical Location

Abstract

With the rapid spread of COVID-19 worldwide, viral genomic data are available in the order of millions of sequences on public databases such as GISAID. This Big Data creates a unique opportunity for analysis toward the research of effective vaccine development for current pandemics, and avoiding or mitigating future pandemics. One piece of information that comes with every such viral sequence is the geographical location where it was collected—the patterns found between viral variants and geographical location surely being an important part of this analysis. One major challenge that researchers face is processing such huge, highly dimensional data to obtain useful insights as quickly as possible. Most of the existing methods face scalability issues when dealing with the magnitude of such data. In this article, we propose an approach that first computes a numerical representation of the spike protein sequence of SARS-CoV-2 using k-mers (substrings) and then uses several machine learning models to classify the sequences based on geographical location. We show that our proposed model significantly outperforms the baselines. We also show the importance of different amino acids in the spike sequences by computing the information gain corresponding to the true class labels.

1. INTRODUCTION

The adaptability of viruses such as SARS-CoV-2, when coupled with a variety of selection pressures from the various ecosystems, host immunities, and approaches to pharmaceutical intervention, provides an evolutionary environment that leads to the emergence of strains and variants in different geographical locations. While SARS-CoV-2 has spread rather quickly to many parts of the globe since the initial outbreak in Wuhan at the end of 2019, which led to the COVID-19 pandemic (Wu et al, 2020), it continues to raise global concerns as the virus persistently evolves and accumulates new mutations. Consequently, new variants of SARS-CoV-2 have emerged in different parts of the world: the alpha variant (B.1.1.17) emerged in the United Kingdom, beta (B.1.351) in South Africa, gamma in Brazil, epsilon in California, iota (B.1.526) in New York, and the delta (B.1.167.2) and kappa (B.1.167.1) in India, to name a few.

All of these variants possess some mutations that confer increased transmissibility or higher binding affinity of their spike protein (Fig. 1) to human host ACE2 receptors (Farinholt et al, 2021; Huang et al, 2020).

FIG. 1.

The SARS-CoV-2 genome codes for several proteins, including the surface, or spike protein. The spike protein is composed of 3821 (25,384 − 21,563) nucleotides (and “one-stop” character “*”). Therefore, the final length of the spike protein is $3822 ∕ 3 = 1274$ (we divide by 3 because each amino acid corresponds to three DNA characters, or codons) (Huang et al, 2020).

It is concerning that the longer SARS-CoV-2 has to propagate, its exposure to wider ranges of immune response attacks across diverse communities and geographically diverse environments may be incubating the virus to evolve new variants and strains that are dangerous and extremely immunologically evasive both locally and globally, as the pandemic prolongs. From the point of view of evolution, this is like giving the virus robust evolutionary room and time to learn, to evolve adaptations, gain of function, and escapes from host immune arsenal and attacks. Sadly, this is gradually the case already, as the original Wuhan strain is now almost completely replaced by new variants with different characteristic behaviors and hence less responsive to the currently available vaccines (Korber et al, 2020; Hu et al, 2021).

This is why it is important to characterize different strains and variants of SARS-CoV-2 based on the geographical location, to understand the patterns of spread in the hopes to contain, or at least cope with, this virus.

All viruses mutate with time—RNA viruses particularly do so at a faster rate. The SARS-CoV-2 is an RNA virus, it, however, exhibits a moderately lower rate compared with other RNA viruses such as HIV and influenza due to the possession of a genetic proof-reading mechanism for correcting errors. The SARS-CoV-2 genome typically accrues 1 or 2 point mutations [single nucleotide variations (SNPs)] in a month. According to a review, some 12,706 such mutations have so far been detected by researchers since the advent of the COVID-19 pandemic. While some changes have neutral effects, a few that occur in major proteins—be it, addition, substitution, or deletion—are critical to viral evolution, genomic stability, transmissibility, antigenicity, virulence, adaptation, and escape from the host immune response (Lorenzo-Redondo et al, 2020; Pachetti et al, 2020).

The SARS-CoV-2 spike (S) protein is a key player in the virus life cycle. The protein is composed of 1274 amino acids encoded by the S gene of the virus (Fig. 1). It is the major target of the neutralizing antibodies from host immune response and currently available vaccines for COVID-19. The virus uses the spike protein to bind the host ACE2 receptor on the cell surface (found abundantly in airways, lungs, mucous lines, and the intestine), which facilitates the uptake of the virus into host cells (Lamers et al, 2020; V'kovski et al, 2021).

Thus, mutations in the S gene have reportedly imparted viral pathogenesis, binding activity of the spike protein to the host, as well as causing conformational changes in the protein molecule. For instance, mutation D614G—that is, a substitution of glycine (G) for aspartate (D) at position 614—was found to enhance the viral infectivity and stability of the SARS-CoV-2 genome, which has been attributed to spike protein assembly on the virion surface (Korber et al, 2020).

At present, quite a number of novel variants are being identified by the US Centers for Disease Control and Prevention (CDC) and the World Health Organization (WHO) SARS-CoV-2 Variant Classifications and Definitions. The variants are divided into categories such as Variants of Concern (VOCs), Variants Being Monitored (VBMs), Variants of interest (VOIs), and Variants of High Consequence (VOHCs). At the time of this study, the VOC was the delta variant (B.1.617.2 and AY.1 sublineages), SARS-CoV-2 Variant Classifications and Definitions. Since all of these variants are characterized by different spike protein contents (Farinholt et al, 2021; Huang et al, 2020), classification can help us to discover also patterns in the geographic distribution of these variants.

At the time of this study, the VBMs comprised the alpha (B.1.1.7, Q.1-Q.8 Pango lineage), beta (B.1.351, B.1.351.2, B.1.351.3 Pango lineage), gamma (P.1, P.1.1, P.1.2 Pango lineage), epsilon (B.1.427 B.1.429 Pango lineage), eta (B.1.525 Pango lineage), iota (B.1.526 Pango lineage), kappa (B.1.617.1 Pango lineage), zeta (P.2 Pango lineage), and MU (B.1.621, B.1.621.1 Pango lineage) SARS-CoV-2 Variant Classifications and Definitions. There are no VOIs or VOHCs at the time.

The SARS-CoV-2 still circulates among human populations in different locations, weather conditions, and epidemiological descriptions. It is important to investigate how this regional diversity contributes to viral evolution and the emergence of new variants in these regions. Research suggests possible selective mutations in the SARS-CoV-2 genome—specific sites that appear more subjective to selective mutation. Some mutational sites in the ORF1ab, ORF3a, ORF8, and N regions of SARS-CoV-2 reportedly exhibit different rates of mutation (Wang et al, 2020). A study involving the analysis and characterization of samples from COVID-19 patients in different parts of the world identifies eight novel recurrent mutational sites in the SARS-CoV-2 genome. Interestingly, the studies also note changes at sites 2891, 3036, 14408, 23403, and 28881 to be common in Europe, while 17746, 17857, and 18060 are common in North America (Pachetti et al, 2020).

A recent study also identified the ongoing evolution of SARS-CoV-2 to involve purifying selection, and that a small number of sites appear to be positively selected. The work also identifies the spike protein receptor binding domain (RBD) and a region of nucleocapsid protein to be also positively selected for substitutions. The work also highlighted trend in virus diversity with geographic region and adaptive diversification that may potentially make variant-specific vaccination an issue (Rochman et al, 2021).

Given all of the novel SARS-CoV-2 variants and strains that have emerged from different geographical regions of the world, we need to investigate this connection to the spread of the virus, for example, weather factors possibly play a systematic role (Pezzotti et al, 2016; Segovia-Dominguez et al, 2021). There is also diversity of immune system across the human population. Genomic variations only cause 20%–40% of this immune system variation, while the remaining 60%–80% is accounted for by age, environmental factors, such as where we live and our neighbors, cohabitation, and chronic viral infections. Immune response is also known to show intraspecies variation (Liston et al, 2016).

There is an ongoing evolutionary arms-race between the hosts and pathogens they are exposed to, which constantly changes the host antipathogen attack and in turn causes the pathogen to refine or adjust its escape from host immune attack (Brodin et al, 2015; Liston et al, 2016). This is constantly taking place, with the virus under evolutionary pressure and natural selection to propagate the virus with the highest fitness. It may be complex to characterize how each factor contributes to this variation. The immune system variation is possibly an important driver on how new variants of SARS-CoV-2 are regionally emerging with positive selections for escaping immune neutralization, increased infectivity, and transmissibility, as observed recently.

Classification of the SARS-CoV-2 spike protein sequences based on geographical location of emergence is therefore an important and informative exploration for possible unique patterns, trends, and distribution. The SARS-CoV-2 spike protein must interact chemically with the host receptor molecule, ACE2, for cellular uptake. Since millions of spike sequences are available now on public databases such as GISAID, classifying those sequences becomes a Big Data problem. When dealing with big data, scalability and robustness are two important challenges. Some algorithms are robust, while others scale well, but give a poor predictive performance on larger data sets. Ali and Patterson (2021) proposed a scalable approach, called Spike2Vec, which is scalable to larger sized data sets. When there is some structure (natural clustering) in the data, Spike2Vec is proven to be useful compared with one-hot embedding (Ali and Patterson, 2021).

However, we show in this article that Spike2Vec does not always work in all types of scenarios. To further improve the results of Spike2Vec and that of one-hot embedding, we use a neural network (NN) model.

In this article, we propose to use a simple sequential convolutional NN along with a k-mer-based feature vector representation for classifying the geographical locations of COVID-19 patients using spike protein sequences only. Our contributions in this article are the following: (1)

We show that the NN model is scalable on a high volume of data and significantly outperforms the baseline algorithms.

(2)

We show the importance of different amino acids within the spike sequence by computing information gain (IG) corresponding to the class label.

(3)

We show that given the complexity of the data, our model is still able to outperform the baselines while using only 10% of the training data.

(4)

We show that preserving the order of amino acids using k-mers achieves better predictive performance than the traditional one-hot encoding (OHE)-based embedding approach.

(5)

Our approach allows us to predict the geographical region of the COVID-19-infected patient while accounting for important local and global variability in the spike sequences.

The rest of the article is organized as follows: Section 2 contains the related work. The proposed approach is given in Section 3. Data set details and experimental setup are in Section 4. The results of our method and comparison with baselines are shown in Section 5. Finally, we conclude our article in Section 6.

2. RELATED WORK

Sequence classification is a widely studied problem in domains such as sequence homology (shared ancestry) detection between a pair of proteins and Phylogeny-based inference (Dhar et al, 2020) of disease transmission (Krishnan et al, 2021). Knowledge of variants and mutations can also help in identifying the transmission patterns of different variants, which will help to devise appropriate public health interventions so that the rapid spread of viruses can be prevented (Ahmad et al, 2020, 2017, 2016; Tariq et al, 2017). This will also help in vaccine design and efficacy. Previous studies on working with a fixed length numerical representation of the data successfully perform different data analytic tasks. It has applications in different domains such as graphs (Hassan et al, 2021, 2020), nodes in graphs (Ali et al, 2021a; Grover and Leskovec, 2016), and electricity consumption (Ali et al, 2020b, 2019).

This vector-based representation also achieves significant success in sequence analysis, such as texts (Shakeel et al, 2020a, 2020b, 2019), electroencephalography and electromyography sequences (Atzori et al, 2014; Ullah et al, 2020), networks (Ali et al, 2020a), and biological sequences (Ali et al, 2021c). However, most of the existing sequence classification methods require the input sequences to be aligned. Although sequence alignment helps to analyze the data better, it is a very costly process.

In the evolution of the SARS-CoV-2 genome, it is well-known that a disproportionate amount (in terms of its length) of the variation takes place in the spike region. Kuzmin et al (2020) show that viral-host classification can be done efficiently using spike sequences only and applying different machine learning (ML) models. They use OHE to obtain a numerical representation for the spike sequences and then apply traditional ML classifiers after reducing the dimensions of the data using the principal component analysis (PCA) method (Wold et al, 1987). Although OHE is proven to be efficient in terms of predictive performance, it does not preserve the order of amino acids in the spike protein if we want to take the pairwise Euclidean distance (Ali et al, 2021d). Another problem with the OHE-based approach is that it deals with aligned sequential data only.

Many previous studies propose the use of k-mers (substrings of length k), which is an alignment-free approach, instead of the traditional OHE-based embedding to obtain the numerical vector representation for the genomic data (Ali and Patterson, 2021; Ali et al, 2021b, 2021d). After computing substrings of length k, a fixed-length feature vector is generated, containing the count of each unique k-mer in a given sequence. This k-mer- based method has been used for phylogenetic applications (Blaisdell, 1986) and has shown success in constructing accurate phylogenetic trees from DNA sequences. Ali et al (2021d) argue that better sequence classification results can be achieved using k-mers instead of OHE because k-mers tend to preserve the order of amino acids within a (e.g., spike) sequence.

After obtaining the numerical representation, a popular approach is to compute the kernel matrix and provide that matrix as input to traditional ML classifiers such as support vector machines (SVM) (Farhan et al, 2017; Kuksa et al, 2012; Leslie et al, 2003). Farhan et al (2017) propose an approximate kernel (Gram matrix) computation algorithm, which uses the k-mer-based feature vector representation as an input to the kernel computation algorithm.

3. PROPOSED APPROACH

In this section, we present our proposed model for classifying population regions based on spike sequences only. We start by explaining the basic MAJORITY-based model for the classification. We then show the OHE-based feature vector generation approach. After that, we show how we generate k-mer-based frequency vectors. Then, we introduce our models, which we are using for the purpose of classification. Finally, we give brief details on the experimental setup, before reporting the results of these experiments in the following section.

3.1. Majority

We start with a simple baseline model called MAJORITY. In this approach, we simply take the class with majority representation in the training data and declare it as the class label for all data points in the test set. We then measure the performance of this baseline model using different evaluation metrics.

3.2. One-hot encoding

To obtain a numerical representation for the sequence-based data, one of the popular methods is using OHE (Ali and Patterson, 2021; Ali et al, 2021b, 2021d; Kuzmin et al, 2020). Note that the length of each spike sequence in our data set is 1274, which contains characters (amino acids) from a set of 21 unique alphabets “ACDEFGHIKLMNPQRSTVWXY.” For OHE, since we need to have a 21-dimensional subvector for each amino acid, the length of the OHE-based feature vector for each spike sequence will be 21 × 1273 = 26,733 (we take the length of spike protein as 1273 instead of 1274 because we have the stopping character “*” at the 1274th position). After obtaining the OHE for the whole data, since the dimensionality of the data will be high, Kuzmin et al (2020) use the typical PCA approach for dimensionality reduction.

Since the size of the data is much larger in our case, we simply cannot use PCA because of the high computational cost (Ali and Patterson, 2021). For this purpose, we use instead an unsupervised approach for low-dimensional feature vector representation, called random Fourier features (RFF) (Rahimi and Recht, 2007).

3.3. RFF-based embedding

To compute the pairwise similarity between two feature vectors, a popular method is to compute the kernel (similarity) matrix (Gram matrix) and give it as input to popular classifiers such as SVM (Farhan et al, 2017). However, exact kernel methods are expensive in terms of computation (scale poorly on training data) (Rahimi and Recht, 2007), and they require huge space to store an n × n matrix (where n is the total number of data points). To overcome this problem, we can use the so-called kernel trick.

Definition 1 (Kernel Trick). The kernel trick is a fast way to compute the similarity between feature vectors using the inner product. The kernel trick's main goal is to avoid the explicit need to map the input data to a high-dimensional feature space.

The kernel trick relies on the assumption that any positive definite function $f (a, b)$ , where $a, b \in ℛ^{d}$ , defines an inner product and a lifting $ϕ$ so that we can quickly compute the inner product between the lifted data points (Rahimi and Recht, 2007). It can be described in a formal way as $⟨ ϕ (a), ϕ (b) ⟩ = f (a, b)$ .

Although the kernel trick is effective in terms of computational complexity, it is still not scalable to millions of data points. To overcome these computational and storage problems, we use RFF (Rahimi and Recht, 2007), an unsupervised approach that maps the input data to a randomized low-dimensional feature space (Euclidean inner product space). It can be described in a formal way as $z : ℛ^{d} \to ℛ^{D}$ . In RFF, we approximate the inner product between a pair of transformed points, which is almost equal to the actual inner product between the original data points. More formally: $f (a, b) = ⟨ ϕ (a), ϕ (b) ⟩ \approx z (a)' z (b)$ . Here, z is a (transformed) low-dimensional (approximate) representation of the original feature vector (unlike the lifting $ϕ$ ). Since z is the approximate low-dimensional representation of the original feature vector, we can use z as an input for different ML tasks such as classification.

3.4. Spike2Vec

Spike2Vec is a recently proposed method that uses k-mers and RFF to design a low-dimensional feature vector representation of the data and then perform typical ML tasks such as classification and clustering (Ali and Patterson, 2021). The first step of Spike2Vec is to generate k-mers for the spike sequences.

3.4.1. k-mer computation

The main idea behind k-mers is to preserve the order of amino acids within spike sequences. The k-mers are basically a set of substrings (called mers) of length k. For each spike sequence, the total number of k-mers is $N - k + 1$ , where N is the length of the spike sequence (1274), and k is a user-defined parameter for the size of each mer. An example of k-mers (where k = $3, 4,$ and 5) is given in Figure 2. In this article, we are using $k = 3$ (selected empirically).

FIG. 2.

Example of different length k-mers in spike sequence “MDPEG.”

3.5. ML models

To both feature engineering based embeddings, namely OHE and Spike2Vec, we apply three ML classifiers downstream, namely naive Bayes (NB), logistic regression (LR), and ridge classifier (RC). For all these classifiers, default parameters are used for training. To measure the performance, we use average accuracy, precision, recall, and weighted and macro F₁, receiver operating characteristic area under the curve (ROC-AUC). We also show the training runtime (in seconds) for all methods.

3.6. NN model

Although the Spike2vec embedding allows the downstream ML models to scale to data sets with millions of sequences, and is proven to outperform the typical OHE, it is not always effective in terms of overall predictive performance in certain scenarios. To further increase the predictive performance, we move to an NN architecture, which takes OHE- or k-mer-based vectors as input. Note that no dimensionality reduction step (e.g., PCA, RFF) is applied beforehand—the NN model takes the OHE- or k-mer-based vectors directly. Our NN architecture comprises a sequential constructor. We create a fully connected network with one hidden layer that contains 9261 neurons (which is equal to the length of the feature vector, i.e., one neuron for every feature at the beginning). The activation function that we are using is “rectifier.” In the output layer, we use the “softmax” activation function.

At the end, we use the efficient Adam gradient descent optimization algorithm (Zhang, 2018) with the “sparse categorical cross-entropy” loss function (used for multiclass classification problems), which computes the cross-entropy loss between the labels and predictions. The batch size that we are taking is 100, and we take 10 as the number of epochs for training the model. Note that we use OHE- and k-mer-based frequency vectors (separately) as input to the NN.

Remark 1. Note that we are using sparse categorical cross-entropy rather than simple categorical cross-entropy because we are using integer labels rather than the one-hot representation of labels.

4. EXPERIMENTAL EVALUATION

In this section, we provide some statistics and visualization on the data that we use, and then the precise details of the experimental setup used to produce the results.

4.1. Data set statistics

We use a set of 2,384,646 spike amino acid sequences obtained from the GISAID (n.d.), along with metadata on geographical location (continent, country, and in the case of the US states). These data, organized by country, are given in Table 1.

Table 1.

The Set of 2,384,646 SARS-CoV-2 Spike Sequences Used in This Study, Labeled by Country of Origin

Region	Country	No. of sequences	Region	Country	No. of sequences
	England	568,202		United States	663,527
	Germany	146,730	North America	Canada	91,193
	Denmark	138,574		Mexico	20,040
	Sweden	78,810	Total	3	774,760
	Scotland	69,387	South America	Brazil	26,729
	France	56,247	Total	1	26,729
Europe	Netherlands	49,938		Japan	75,423
	Spain	48,830	Asia	India	37,943
	Switzerland	48,516		Israel	14,361
	Wales	46,851	Total	3	127,727
	Italy	44,728	Australia	Australia	20,985
	Belgium	28,758	Total	1	20,985
	Ireland	23,441
	Poland	16,061
	Norway	14,684
	Lithuania	13,586
	Luxembourg	12,713
	Finland	11,254
	Slovenia	17,135
Total	19	1,434,445

4.2. Data visualization

To evaluate any natural clustering in our data (if any exist), we use the t-distributed stochastic neighbor embedding (t-SNE) (van der Maaten and Hinton, 2008). The t-SNE approach maps the data into a two-dimensional (2D) real vector, which can then be visualized using a scatter plot. Since applying t-SNE on the whole data is very costly and time-consuming, we randomly sampled a subset ( $\approx$ 80 K sequences) from the data (of Table 1) and generated a 2D real vector using the t-SNE approach (Fig. 3).

FIG. 3.

A t-SNE plot from the frequency (k-mer-based feature) vectors along with the country information for $\approx$ 80 K randomly sampled sequences from the set of 2,384,646 sequences (Table 1) used in this study. t-SNE, t-distributed stochastic neighbor embedding.

Remark 2. The reason for (randomly) selecting $\approx$ 80 K sequences is because the t-SNE method is computationally very expensive (runtime is $O (N^{2})$ , where N is the number of data points (Pezzotti et al, 2016) and is infeasible in terms of runtime on 2.3 million sequences.

The rate of spread of the three most common variants of SARS-CoV-2 (in the United States) from March 2020 to July 2021 from our data is given in Figure 4. We can see that the alpha variant was clearly the VOC when it reached its peak in April 2021. We can see a drop from this peak for all variants after April 2021. This is likely because a significant proportion of the population was vaccinated by this point, and hence, the total number of cases started decreasing (Ali and Patterson, 2021).

FIG. 4.

The rate of spread of the three most common SARS-CoV-2 variants (in the United States) from March 2020 till July 2021.

4.3. Experimental setup

All experiments are conducted using an Intel^® Xeon^® CPU E7-4850 v4 @ $2.10$ GHz having Ubuntu 64 bit OS ( $16.04.7$ LTS Xenial Xerus) with 3023GB memory. The implementation of our algorithms is done in Python and the code is available online for reproducibility.* We obtain a set of 2,384,646 spike amino acid sequences from the GISAID (n.d.). The GISAID provides many different metadata for these sequences, such as collection date, geographical location, and sometimes variant information. These data, preprocessed to include geographical location, are available online,^† which can be used after agreeing to the terms and conditions of GISAID.^‡ For the classification algorithms, we use $10 %$ of the data for training and $90 %$ for testing. The purpose of using a smaller training data set is to show how much performance gain we can achieve while using minimal training data.

5. RESULTS AND DISCUSSION

In this section, we present results for three different granularities of class labels, namely continents, countries, and finally states in a case study of the United States.

5.1. Continent classification

In this section, we show classification results for five different continents, namely Europe, North America, South America, Asia, and Australia (Table 1). The classification results (average $\pm$ standard deviation of 5 runs) are given in Table 2. In terms of predictive performance, we can observe that the NN model with the k-mer-based embedding performs best compared with the baselines. While comparing the two embedding methods (i.e., OHE and k-mers), we can see that k-mer is better than OHE for the NN model. Since k-mer can preserve the order of amino acids better compared with the OHE, it is able to give richer information in the feature vector. In terms of runtime, RC with the Spike2Vec embedding is performing best. The NN model will take longer to train the models compared with simple ML classifiers because of the tuning of different parameters.

Table 2.

Continent Classification Results (Average ± Standard Deviation of 5 Runs) for 5 Continents Comprising 2,384,646 Spike Sequences (10% Training Set and 90% Testing Set)

Approach	Embedding method	Algorithm	Accuracy	Precision	Recall	F ₁ weighted	F ₁ macro	ROC-AUC	Training runtime (seconds)
MAJORITY	—	—	0.60 ± 0.000	0.36 ± 0.000	0.60 ± 0.000	0.45 ± 0.000	0.15 ± 0.000	0.50 ± 0.000	–
Feature engineering	OHE	NB	0.49 ± 0.005	0.63 ± 0.006	0.49 ± 0.005	0.50 ± 0.007	0.38 ± 0.006	0.63 ± 0.005	1457.2 ± 0.023
		LR	0.67 ± 0.007	0.66 ± 0.008	0.67 ± 0.007	0.64 ± 0.007	0.33 ± 0.008	0.58 ± 0.005	1622.4 ± 0.031
		RC	0.67 ± 0.004	0.66 ± 0.005	0.67 ± 0.004	0.64 ± 0.006	0.28 ± 0.004	0.57 ± 0.005	1329.1 ± 0.029
	Spike2Vec	NB	0.48 ± 0.007	0.63 ± 0.006	0.48 ± 0.008	0.49 ± 0.007	0.36 ± 0.007	0.63 ± 0.006	970.6 ± 0.065
		LR	0.67 ± 0.005	0.67 ± 0.007	0.67 ± 0.006	0.64 ± 0.007	0.34 ± 0.006	0.58 ± 0.005	1141.9 ± 0.072
		RC	0.67 ± 0.003	0.66 ± 0.004	0.67 ± 0.003	0.64 ± 0.003	0.29 ± 0.006	0.57 ± 0.007	832.3 ± 0.057
NN	OHE	NN	0.75 ± 0.007	0.76 ± 0.008	0.75 ± 0.008	0.72 ± 0.009	0.47 ± 0.007	0.65 ± 0.008	30,932.0 ± 0.105
	k-mers	NN	0.77 ± 0.009	0.78 ± 0.008	0.77 ± 0.009	0.74 ± 0.007	0.49 ± 0.008	0.65 ± 0.009	18,631.7 ± 0.235

Best average values are shown in bold.

LR, logistic regression; NB, naive Bayes; NN, neural network; OHE, one-hot encoding; RC, ridge classifier; ROC-AUC, receiver operating characteristic area under the curve.

5.2. Country classification

After classifying the continents, we take countries as the class label and train all ML and NN models again with the same parameter settings. The classification results (average $\pm$ standard deviation of 5 runs) for countries are given in Table 3. In terms of predictive performance, we can observe that the NN model is performing better than all baselines. In terms of runtime, RC with the OHE is the best classifier. An important observation here is the drop in overall performance of all classification models compared with the continent classification. The reason for this behavior is likely due to any natural clustering or other information in the spike sequences corresponding to the location of patients breaking down at this level of granularity. This lack of knowledge in the data makes country classification a difficult task. However, we can see that the NN model can still classify the countries better than the baselines.

Table 3.

Country Classification Results (Average ± Standard Deviation of 5 Runs) for 27 Countries Comprising 2,384,646 Spike Sequences (10% Training Set and 90% Testing Set)

Approach	Embedding method	Algorithm	Accuracy	Precision	Recall	F ₁ weighted	F ₁ macro	ROC-AUC	Training runtime (seconds)
MAJORITY	—	—	0.27 ± 0.000	0.07 ± 0.000	0.27 ± 0.000	0.12 ± 0.000	0.01 ± 0.000	0.50 ± 0.000	—
Feature engineering	OHE	NB	0.11 ± 0.007	0.44 ± 0.008	0.11 ± 0.007	0.11 ± 0.007	0.10 ± 0.009	0.55 ± 0.008	1308.4 ± 0.098
		LR	0.40 ± 0.009	0.46 ± 0.009	0.40 ± 0.008	0.33 ± 0.007	0.15 ± 0.008	0.55 ± 0.009	2361.8 ± 0.074
		RC	0.40 ± 0.006	0.38 ± 0.007	0.40 ± 0.006	0.31 ± 0.008	0.11 ± 0.007	0.54 ± 0.006	746.4 ± 0.085
	Spike2Vec	NB	0.13 ± 0.004	0.41 ± 0.005	0.13 ± 0.004	0.151 ± 0.006	0.109 ± 0.005	0.555 ± 0.007	1315.3 ± 0.085
		LR	0.40 ± 0.006	0.45 ± 0.006	0.40 ± 0.007	0.33 ± 0.008	0.16 ± 0.007	0.55 ± 0.006	2736.8 ± 0.058
		RC	0.39 ± 0.006	0.37 ± 0.007	0.39 ± 0.006	0.31 ± 0.008	0.11 ± 0.006	0.54 ± 0.007	779.4 ± 0.074
NN	OHE	NN	0.49 ± 0.009	0.53 ± 0.008	0.49 ± 0.009	0.43 ± 0.009	0.24 ± 0.007	0.6 ± 0.006	28,914.8 ± 0.453
	k-mers	NN	0.51 ± 0.005	0.57 ± 0.004	0.51 ± 0.005	0.45 ± 0.006	0.28 ± 0.006	0.60 ± 0.007	10,383.6 ± 0.745

Best average values are shown in bold.

5.3. A case study of the United States

After classifying continents and countries, we investigate our model with more highly granular class labels. For this purpose, we first take the single country with the highest number of spike sequences in the data. Since the United States contains most of the spike sequences in our data (Table 1), we took it as a case study to further explore different states within the United States. The pie chart showing the distribution of the sequences over the US states is given in Figure 5.

FIG. 5.

Distribution of the 663,527 sequences over the US states, with the top 11 states specified, while the remaining fall into the “others” category.

The classification results (average $\pm$ standard deviation of 5 runs) for different states are given in Table 4. We can again observe the drop in predictive performance for all models. This again proves that as we increase the granularity of the class labels, it becomes difficult for any model to classify with higher accuracy. We can also observe that the NN model with the k-mer-based feature embedding is performing better than all the baselines.

Table 4.

Classification Results (Average ± Standard Deviation of 5 Runs) for Different US States (10% Training Set and 90% Testing Set)

Approach	Embedding Method	Algorithm	Accuracy	Precision	Recall	F ₁ weighted	F ₁ macro	ROC-AUC	Training runtime (seconds)
MAJORITY	—	—	0.33 ± 0.000	0.11 ± 0.000	0.33 ± 0.000	0.17 ± 0.000	0.04 ± 0.000	0.50 ± 0.000	—
Feature engineering	OHE	NB	0.18 ± 0.005	0.32 ± 0.006	0.18 ± 0.004	0.14 ± 0.005	0.13 ± 0.006	0.54 ± 0.007	860.2 ± 0.745
		LR	0.37 ± 0.007	0.45 ± 0.008	0.37 ± 0.007	0.26 ± 0.007	0.13 ± 0.008	0.53 ± 0.006	1036.2 ± 0.458
		RC	0.37 ± 0.008	0.41 ± 0.008	0.37 ± 0.007	0.25 ± 0.009	0.12 ± 0.005	0.52 ± 0.007	707.7 ± 0.865
	Spike2Vec	NB	0.19 ± 0.005	0.37 ± 0.005	0.19 ± 0.004	0.14 ± 0.006	0.14 ± 0.006	0.55 ± 0.007	273.7 ± 0.124
		LR	0.38 ± 0.008	0.44 ± 0.007	0.38 ± 0.007	0.29 ± 0.006	0.16 ± 0.007	0.54 ± 0.008	374.2 ± 0.865
		RC	0.37 ± 0.006	0.42 ± 0.006	0.37 ± 0.004	0.27 ± 0.005	0.14 ± 0.008	0.53 ± 0.007	197.1 ± 0.657
NN	OHE	NN	0.38 ± 0.011	0.44 ± 0.012	0.38 ± 0.011	0.34 ± 0.013	0.22 ± 0.015	0.57 ± 0.010	7881.3 ± 0.857
	k-mers	NN	0.47 ± 0.017	0.50 ± 0.016	0.47 ± 0.014	0.42 ± 0.017	0.33 ± 0.015	0.61 ± 0.017	4908.6 ± 0.975

The best average values are shown in bold.

5.4. Importance of attributes

To evaluate the importance of the positions in the spike sequences, we find the importance of each attribute with respect to class label (using the WEKA tool^§). For this purpose, a randomly selected subset of spike sequences ( $\approx$ 80 K) is taken from the original data set. We then compute the IG between each attribute (P = amino acid position) and the true class label (C = country). More formally, IG can be computed as follows: $I G (C, P) = H (C) - H (C | P),$ (1)

where $H (C)$ and $H (C | P)$ are entropy and conditional entropy, respectively. The entropy H can be calculated using the following: $H = \sum_{i \in C} - p_{i} log p_{i},$ (2)

where p_i is the probability of the class i. The IG values for each attribute are given in Figure 6. The IG values for each attribute are also available online.^**

FIG. 6.

Information gain for each amino acid position corresponding to the class.

6. CONCLUSION

This article uses several ML models using a k-mer-based representation as input and efficiently classifies SARS-CoV-2 spike sequences based on geographical location. We show that our proposed approach outperforms the baselines in terms of predictive performance. Using IG, we also show the importance of attributes (amino acids) in the spike sequences. Such classification and its analysis can help researchers to study more deeply the connection between geographical location and SARS-CoV-2 variants. In the future, we will explore more sophisticated models such as long short-term memory (LSTM) and gated recurrent unit (GRU), and also use other attributes such as months' information to increase the predictive performance, and maybe give an idea of the dynamics (spread) of the virus over time. Using other alignment-free methods such as minimizers is another possible future direction.

Footnotes

AUTHORs' CONTRIBUTIONS

S.A. and M.P.: Conceptualization. S.A.: Methodology. S.A.: Software. S.A. and Z.T.: Validation. S.A. and M.P.: Formal analysis. S.A., B.B., and Z.T.: Investigation. All: Resources. All: Data curation. All: Writing—original draft. All: Writing—review and editing. S.A.: Visualization. S.A. and M.P.: Supervision. M.P.: Project administration. M.P.: Funding acquisition.

AUTHOR DISCLOSURE STATEMENT

The authors declare they have no financial conflicts of interest.

FUNDING INFORMATION

Research supported by an MBD Fellowship for S.A., and a Georgia State University Computer Science start-up grant for M.P.

References

Ahmad

, Ali

, Tariq

, et al. Combinatorial trace method for network immunization. Inform Sci, 2020; 519:215–228.

Ahmad

, Tariq

, Farhan

, et al. Who Should Receive the Vaccine? Australasian Data Mining Conference (AusDM); 2016; pp. 137–145.

Ahmad

, Tariq

, Shabbir

, et al. Spectral methods for immunization of large networks. Australas J Inform Syst, 2017; 21:1–18.

Ali

, Ali

, Khan

, et al. Effective and Scalable Clustering of SARS-CoV-2 Sequences. International Conference on Big Data Research (ICBDR); 2021a; pp. 42–49.

Ali

, Alvi

, Faizullah

, et al. Detecting DDoS Attack on SDN Due to Vulnerabilities in OpenFlow. International Conference on Advances in the Emerging Computing Technologies (AECT); 2020a; pp. 1–6.

Ali

, Ciccolella

, Lucarella

, et al. Simpler and faster development of tumor phylogeny pipelines. J Comput Biol 2021b;28(11):1142–1155.

Ali

, Mansoor

, Arshad

, et al. Short Term Load Forecasting Using Smart Meter Data. International Conference on Future Energy Systems (E-Energy);, 2019; pp. 419–421.

Ali

, Mansoor

, Khan

, et al. Short-term load forecasting using ami data. CoRR 2020b; abs/1912.12479; pp. 1–12.

Ali

, Patterson

Spike2vec: An Efficient and Scalable Embedding Approach for COVID-19 Spike Sequences. IEEE International Conference on Big Data (Big Data);, 2021; pp. 1533–1540.

10.

Ali

, Sahoo

, Ullah

, et al. A k-Mer Based Approach for SARS-CoV-2 Variant Identification. International Symposium on Bioinformatics Research and Applications; 2021c; pp. 153–164.

11.

Ali

, Shakeel

, Khan

, et al. Predicting attributes of nodes using network structure. ACM Trans Intell Syst Technol 2021d;12(2):1–23.

12.

Atzori

, Gijsberts

, Castellini

, et al. Electromyography data for non-invasive naturally-controlled robotic hand prostheses. Sci Data, 2014; 1(1):1–13.

13.

Blaisdell

A measure of the similarity of sets of sequences not requiring sequence alignment. Proc Natl Acad Sci U S A, 1986; 83:5155–5159.

14.

Brodin

, Jojic

, Gao

, et al. Variation in the human immune system is largely driven by non-heritable influences. Cell, 2015;160(1–2):37–47.

15.

Dhar

, Zhang

, Mandoiu

, et al. Tnet: Phylogeny-based inference of disease transmission networks using within-host strain diversity. International Symposium on Bioinformatics Research and Applications (ISBRA);, 2020; pp. 203–216.

16.

Farhan

, Tariq

, Zaman

, et al. Efficient Approximation Algorithms for Strings Kernel Based Sequence Classification. Advances in Neural Information Processing Systems (NeurIPS);, 2017; pp. 6935–6945.

17.

Farinholt

, Doddapaneni

, Qin

, et al. Transmission event of SARS-CoV-2 delta variant reveals multiple vaccine breakthrough infections. medRxiv 2021.

18.

GISAID. n.d. Available from: https://www.gisaid.org Last accessed on December 15, 2022.

19.

Grover

, Leskovec

Node2vec: Scalable Feature Learning for Networks. International Conference on Knowledge Discovery & Data Mining (KDD);, 2016; pp. 855–864.

20.

Hassan

, Shabbir

, Khan

, et al. Estimating Descriptors for Large Graphs. Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD);, 2020; pp. 779–791.

21.

Hassan

, Khan

, Shabbir

, et al. Computing graph descriptors on edge streams. CoRR, 2021; arXiv:2109.01494.

22.

, Guo

, Zhou

, et al. Characteristics of SARS-CoV-2 and COVID-19. Nat Rev Microbiol, 2021; 19(3):141–154.

23.

Huang

, Yang

, Xu

X-f

, et al. Structural and functional properties of SARSCoV-2 spike protein: Potential antivirus drug development for COVID-19. Acta Pharmacol Sin, 2020; 41(9):1141–1149.

24.

Korber

, Fischer

, Gnanakaran

, et al. Tracking changes in SARS-CoV-2 spike: Evidence that D614G increases infectivity of the COVID-19 virus. Cell, 2020; 182(4):812–827.

25.

Krishnan

, Kamath

, Sugumaran

Predicting Vaccine Hesitancy and Vaccine Sentiment Using Topic Modeling and Evolutionary Optimization. International Conference on Applications of Natural Language to Information Systems (NLDB);, 2021; pp. 255–263.

26.

Kuksa

, Khan

, Pavlovic

Generalized Similarity Kernels for Efficient Sequence Classification. SIAM International Conference on Data Mining (SDM);, 2012; pp. 873–882.

27.

Kuzmin

, Adeniyi

, DaSouza

, et al. Machine learning methods accurately predict host specificity of coronaviruses based on spike sequences alone. Biochem Biophys Res Commun, 2020; 533:553–558.

28.

Lamers

, Beumer

, van der Vaart

, et al. SARS-CoV-2 productively infects human gut enterocytes. Science, 2020; 369(6499):50–54.

29.

Leslie

, Eskin

, Weston

, et al. Mismatch String Kernels for SVM Protein Classification. Advances in Neural Information Processing Systems (NeurIPS);, 2003; pp. 1441–1448.

30.

Liston

, Carr

, Linterman

. Shaping variation in the human immune system. Trends Immunol 2016; 37(10):637–646.

31.

Lorenzo-Redondo

, Nam

, Roberts

, et al. A unique clade of SARSPage CoV-2 viruses is associated with lower viral loads in patient upper airways. MedRxiv 2020.

32.

Pachetti

, Marini

, Benedetti

, et al. Emerging SARS-CoV-2 mutation hot spots include a novel RNA-dependent-RNA polymerase variant. J Transl Med, 2020; 18(1):1–9.

33.

Pezzotti

, Lelieveldt

, Van Der Maaten

, et al. Approximated and user steerable tSNE for progressive visual analytics. IEEE Trans Vis Comput Graph, 2016; 23(7):1739–1752.

34.

Rahimi

, Recht

Random Features for Large-Scale Kernel Machines. NIPS 4; 2007; p. 5.

35.

Rochman

, Wolf

, Faure

, et al. Ongoing global and regional adaptive evolution of SARS-CoV-2. Proc Natl Acad Sci U S A, 2021; 118(29):e2104241118.

36.

Segovia-Dominguez

, Zhen

, Wagh

, et al. TLife-LSTM: Forecasting Future COVID-19 Progression with Topological Signatures of Atmospheric Conditions. In: Advances in Knowledge Discovery and Data Mining (Karlapelm K, Cheng H, Ramakrishnan N, et al, eds). Springer International Publishing: Cham;, 2021; pp. 201–212.

37.

Shakeel

, Faizullah

, Alghamidi

, et al. Language Independent Sentiment Analysis. International Conference on Advances in the Emerging Computing Technologies (AECT); 2020a; pp. 1–5.

38.

Shakeel

, Karim

, Khan

A Multi-Cascaded Deep Model for Bilingual SMS Classification. International Conference on Neural Information Processing;, 2019; pp. 287–298.

39.

Shakeel

, Karim

, Khan

. A multi-cascaded model with data augmentation for enhanced paraphrase detection in short texts. Inform Process Manage, 2020b;57(3):102204.

40.

Tariq

, Ahmad

, Khan

, et al. Scalable Approximation Algorithm for Network Immunization. Pacific Asia Conference on Information Systems (PACIS);, 2017; p. 200.

41.

Ullah

, Ali

, Khan

, et al. Effect of Analysis Window and Feature Selection on Classification of Hand Movements Using Emg Signal. SAI Intelligent Systems Conference (Intellisys);, 2020; pp. 400–415.

42.

van der Maaten

, Hinton

Visualizing data using t-SNE. J Mach Learn Res, 2008; 9(11):1–27.

43.

V'kovski

, Kratzel

, Steiner

, et al. Coronavirus biology and replication: Implications for SARS-CoV-2. Nat Rev Microbiol, 2021; 19(3):155–170.

44.

Wang

, Liu

, Chen

, et al. The establishment of reference sequence for SARS-CoV-2 and variation analysis. J Med Virol, 2020; 92(6):667–674.

45.

Wold

, Esbensen

, Geladi

Principal component analysis. Chemometr Intell Lab Syst, 1987;2(1–3):37–52.

46.

, Zhao

, Yu

, et al. A new coronavirus associated with human respiratory disease in China. Nature, 2020; 579:265–269.

47.

Zhang

. Improved Adam Optimizer for Deep Neural

Networks

. 2018 IEEE/ACM 26th International Symposium on Quality of Service (IWQoS); 2018; pp. 1–2.