Abstract
Given a data-set of Ribonucleic acid (RNA) sequences we can infer the phylogenetics of the samples and tackle the information for scientific purposes. Based on current data and knowledge, the SARS-CoV-2 seemingly mutates much more slowly than the influenza virus that causes seasonal flu. However, very recent evolution poses some doubts about such conjecture and shadows the out-coming light of people vaccination. This paper adopts mathematical and computational tools for handling the challenge of analyzing the data-set of different clades of the severe acute respiratory syndrome virus-2 (SARS-CoV-2). On one hand, based on the mathematical paraphernalia of tools, the concept of distance associated with the Kolmogorov complexity and Shannon information theories, as well as with the Hamming scheme, are considered. On the other, advanced data processing computational techniques, such as, data compression, clustering and visualization, are borrowed for tackling the problem. The results of the synergistic approach reveal the complex time dynamics of the evolutionary process and may help to clarify future directions of the SARS-CoV-2 evolution.
Similar content being viewed by others
1 Introduction
The severe acute respiratory syndrome virus (SARS-CoV-2) is a single-stranded Ribonucleic Acid (RNA) beta-coronavirus presenting a 29,903 nucleotides-long genome. It caused an outbreak in the Chinese city of Wuhan, in December 2019. Subsequently, the new coronavirus has spread worldwide in less than 4 months, and a pandemic situation was declared by the World Health Organization (WHO). The disease was named on February 2020, Coronavirus Disease 2019 (COVID-19), and the disease is now responsible for more than 145,000,000 confirmed cases, and 4,230,000 deaths reported worldwide as of August 3, 2021 (source: World Health Organization https://covid19.who.int). There are significant differences in case fatality rates (proportion of deaths from a specific disease compared to the total number of individuals diagnosed with the disease for a particular period) between countries, possibly related to the efficacy of the measures adopted to limit viral spreading, the demographic pyramid, i.e., the distribution of various age groups in a particular country, and the screening, test and tracing strategy [1].
Using the data from the database of the Global Initiative on Sharing Avian Influenza Data (GISAID) Consortium, major clades of SARS-CoV-2 were identified [2]. The initial RNA genome (accession number: NC 045512.2), identified in Wuhan is named as clade ‘L’. This root clade suffered mutations due to numerous factors, and some new clades emerged with stable mutations: clade ‘G’ (presenting an alteration of the spike protein S-D614G was the first dominant variant) and its two derivatives ‘GH’ (with ORF3a-Q57H mutation) and ‘GR’ (affected by RG203KR mutation); clade ‘V’ (variant of the ORF3a coding protein NS3-G251), and clade ‘S’ (variant ORF8-L84S). Other alleles or combinations different from the previously described clades are classified as clade ‘O’ [2]. These data confirmed, until now, the relatively low variability presented by SARS-CoV-2 compared to other respiratory viruses [3]. The different fatality rates, speed of transmission, and infectiousness profiles observed in different countries were probably not related to differences in virulence of the clades and their characteristic mutations. However, very recently, some new mutations were found with potential epidemiological consequences. For example, 12 human cases were identified in September 2020 in North Jutland with a unique variant called ‘cluster 5’, a combination of mutations that have been not previously described. All 12 cases were linked to the mink farming industry or the local community [4]. However, the clinical presentation, severity, and duration of COVID-19, and transmission among those infected were similar to that of other circulating SARS-CoV-2 viruses. The variant cluster 5 has not been detected since September despite extensive sequencing and data sharing and is thought to be a dead-end, owing to the very restricted spreading and infectiousness to humans [4].
More disturbing is the new variant strain of SARS-CoV-2 that contains 23 mutations, eight of which are in the spike protein the virus uses to bind to and enter human cells. The spike protein is also the focus of most COVID-19 vaccines that are now being administered in numerous countries. Moreover, the diagnostics tests of COVID-19 are also based on the protein sequence found on the Wuhan reference strain spike. Therefore, their efficacy can be changed by these genomic variations. The appearance of these mutations can also lead to immunological resistance and vaccine escape [5]. The new British variant was reported for the first time in December 2020 and has become highly prevalent globally and responsible for another COVID-19 wave in numerous countries [6]. Based on these mutations, this variant has been predicted to potentially be more quickly transmissible than other circulating strains of SARS-CoV-2. The variant is referred to as SARS-CoV-2 VUI 202012/01 (i.e., Variant Under Investigation, year 2020, month 12, variant 01), or B.1.1.7, as the lineage of the clade GR was classified on GISAID. Variant B.1.1.7 presents increased transmissibility and increased virus load. However, apparently, there is no association with more severe disease [7] although the increased infectiousness can lead to more deaths due to the strain of the health systems of the affected countries. In Mid-November in South Africa, a new lineage of the clade GH has emerged but shared one of the mutations described in the British variant. The South African virus variant (B.1.351), known as ‘triple variant’ is distinct from the UK variant, but both contain an unusually high number of mutations, with potential functional significance, compared to other SARS-CoV-2 lineages, and it can, apparently, partially escape to some available vaccines [8]. Several other variants were described more recently in several countries. Examples are the variants P1 and P2 in Brazil, variants B.1.429 and B.1.526 found in the USA, and yet more recently, a variant first sequenced in India (variant B.1.167) [9]. In April 2021, the South Africa, Brazil, and India variants caused or are causing large disease outbreaks in their respective countries with increased excess deaths due to rupture of the hospital capacities.
The SARS-CoV-2 B.1.1.7 variant was already detected at the end of December 2020 in France, Denmark, Holland, and Italy [10]. In 31 of May, the WHO has assigned simple, easy to say and remember labels for the most important variants of SARS-CoV-2, using letters of the Greek alphabet (see Tracking SARS-CoV-2 variants at ).
The most important variants of concern according to this new classification are: Alpha, corresponding to lineage B.1.1.7, found in the United Kingdom in September 2020; Beta, corresponding to lineages B.1.351, B.1.351.2, B.1.351.3 present in South Africa in May 2020; Gamma, corresponding to lineages P.1, P.1.1, P.1.2, sequenced in Brazil, in November 2020; and, Delta, corresponding to lineages B.1.617.2, AY.1, AY.2 and AY.3 found in India in October 2020 (see Table 1).
In the Summer of 2021, the SARS-CoV-2 Delta variant is becoming dominant in Europe, North America and other parts of the world and is responsible for a new wave, mainly in the unvaccinated population [11]. It is highly transmissible and it also appears to affect vaccine effectiveness and breakthrough infections in vaccinated individuals appear to be more frequent with this variant [11]. It was reported that the viral load is 1000 times higher for Delta compared with previous variants of the initial wave of infections and may have a faster replication rate, a reduced incubation period, and greater viral shedding [12]. The Delta variant was found to be approximately 64% more transmissible than the Alpha variant that was dominant in the waves of the end of December 2020 and first months of 2021. On the other hand, Alpha was already estimated to be 50% more transmissible than the D614G strain, responsible for the first wave in the beginning of 2020 [13].
The so-called coronavirus ‘waves’ can be more contagious and spread faster than the initial ones due to the new variants that can also present other epidemiological problems. There is no definitive evidences that the various variants are associated with an higher disease severity. However, there is a clear risk that future epidemic ‘waves’ may be larger and, therefore, associated with greater burden for the health systems and society due to the lockdowns. Therefore, with this work we try to understand SARS-CoV-2 new variants and the relations with the already known virus clades using new strategies for finding the relationships among them [14,15,16]. It is important to understand the recent evolution of the virus partially responsible for the so-called ‘waves’ [17]. With this regard computational techniques associated with mathematical tools are a promising strategy to tackle genomic data-sets. This approach was tested in [18, 19] using the Kolmogorov complexity and Shannon information theories, associated with clustering techniques [20, 21] such as the Multidimensional Scaling (MDS). Given its successful application in a primary set of 133 items, encompassing a variety of virus, this paper extends the study to a more challenging problem, namely of analyzing and comparing the SARS-CoV-2 mutations. For that purpose a new data-set of 307 virus including RNA information from the beginning of the spread up to the day of writing this paper is selected. Furthermore, based on the aforementioned mathematical tools, we include a larger set of indices to allow a more complete comparison. In the case of the Kolmogorov complexity we consider four indices [22, 23], normalized information distance, Compression-based Dissimilarity Measure, Chen-Li Metric and Compression-based Cosine (that are abbreviated by the acronyms NCD, CDM, CLM and CosS). In the scope of the Shannon information theory we consider also four metrics, namely the Jaccard, Jensen-Shannon, Jeffreys and Topsøe distances (denoted as \(d_{Ja}\), \(d_{JS}\), \(d_{Je}\) and \(d_{To}\)) [24, 25]. A third type and distinct type of assessment is also included and consists of the Hamming distance (\(d_{Ha}\)), widely used in information theory [26].
Following these ideas the rest of the paper is organized as follows. Section 2 introduces the fundamental tools. The main mathematical concepts involved with distance, Kolmogorov complexity, Shannon information and Hamming metric are summarized. Additionally, the computational tools such as data compression and MDS are also included. Section 3 describes and analyses the data-set of very close, but distinct, information for a large number of RNA sequences. Section 4 develops a synergistic performance of a variety of measures associated with the MDS clustering and visualization. The results shed light on the dynamics of the evolutionary process of the SARS-CoV-2 lineages. Finally, Sect. 5 presents the conclusions.
2 Fundamental tools
2.1 Distance
A function \(d(\cdot ,\cdot )\) stands for the distance between two objects x and y if satisfies three axioms [27], namely identity \(d(x,y) = 0\) if \(x = y\), symmetry \(d(x,y)= d(y,x)\) and triangle inequality \( d(x,y) \le d(x,z) + d(y,z)\). These axioms imply the non-negativity (or separation condition) \(d(x,y)\ge 0\). On the other hand, they allow the definition of a plethora of different functions, with distinct pros and cons [24, 25]. Based on these notions, several algorithms [28,29,30] were adopted for comparing data sequences [26, 31,32,33]. However, users must have in mind that the selection of a set of distances for a given application requires some experience and that a number of numerical trials are usually necessary before finding the ‘best’ ones [34,35,36,37].
In the final part of the paper, the Appendix presents the mathematical and algorithmic fundamentals of the distances used in this paper for assessing the genetic information.
2.2 Data compression
Compression data algorithms can be classified as ‘lossless’ and ‘lossy’. Lossless compression algorithms are typically used for archival or other high fidelity purposes and reduce the size of files without losing any information in the file, which means that we can reconstruct the original data from the compressed file. Lossy compression algorithms reduce the size of files by discarding the less important information in a file, which can significantly reduce file size but also affect file quality.
In this paper we used the BZip2 compression algorithm which is based on Burrows-Wheeler transform [38]. This compressor has the extension BZ2 designating a pure data compression format not providing file archival feature. In this algorithm the speed is somewhat slower than for the compressor LZW (extension .Z) and Deflate (extension .zip and .gz) compression algorithms [39]. These employ the classic Deflate algorithm (even if correctly implemented Bzip2 algorithm can be easily made parallel, and benefit of recent multi-core CPU), but faster than more powerful compression schemes as in RAR format, 7Z format, and new ZIPX format. The compression ratio, also, is usually intermediate between the older Deflate-based ZIP/GZ files and modern RAR, 7Z and ZIPX formats [40].
2.3 Multidimensional scaling
Let us consider a group of N objects \(x_i\), \(i =1, \cdots , N\), in a q-dim space. The MDS is a computational method for [41] that re-organizes them in a structure where the objects are represented by points trying to highlight the similarities between them in the sense of a predefined distance [42]. The process starts by calculating a \(N \times N\) dimensional matrix, \(D=[d_{ij}]\), with \(d_{ij}\in \mathbb {R^{+}}\) for \(i\ne j\) and \(d_{ii}=0\), \((i,j)=1, \cdots , N\), giving object to object distances [43]. In a second phase, the MDS calculates the point coordinates \(\hat{x}_{i}\) in a \(d<q\)-dim space, trying to mimic the original distances. The MDS technique includes a numerical iterations for optimizing a cost function, often called Stress, that compares the distances \(d_{ij}=\left| x_{i}-x_{j}\right| \) and \(\hat{d}_{ij}=\left| \hat{x}_{i}-\hat{x}_{j}\right| \), so that the index \(Stress = \sqrt{ \sum _{i<j}\left( \hat{d}_{ij}-d_{ij}\right) ^{2}} \) is minimized.
The MDS points \(\hat{x}_{i}\) have coordinates that yield a symmetric matrix \(\hat{D}=[\hat{d}_{ij}]\) of distances that approximate D. The MDS results are interpreted based on the clusters, and eventually of patterns, of points [18, 44]. Therefore, similar objects are represented by nearby points, and the opposite for dissimilar objects. Different distances produce distinct MDS maps and it is up to the user to choose the metrics that reflect better the characteristics of the objects under analysis. By other words, the different distances are correct from the mathematical point of view, but the association of each metrics with the MDS algorithm may produce disparate patterns in the plots. In some cases the emerging patterns, although different, lead to similar conclusions. In other cases, some distances reflect better (or worst) the information embedded in the dataset and the selection of the ‘best’ metric depends on a trial and error set of tests based on the user experience.
3 Dataset description
The RNA is commonly sequenced indirectly by copying it into the complementary DNA (cDNA). Then the cDNA is amplified and analyzed using a number of DNA sequencing methods. The sequences of the RNA are published in the databases presenting the bases, adenine (A), cytosine (C), guanine (G), and thymine (T). Some symbols, such as N (unspecified or unknown nucleoside), R, (unspecified purine nucleoside), Y (unspecified pyrimidine nucleoside) and others, permeate only a small percentage of the information and are not considered [45]. The information about the \(N=307\) GS was collected in the Global Initiative on Sharing Avian Influenza Data (GISAID) available at https://www.gisaid.org/. The information regarding the sequences, serial, clade/variant and country are listed in the Tables 5, 6, 7, 8 and 9. The genetic information is organized in 8 clades {GH, GR, O, GV, G, L, S, V} with 10 elements each, making a total of 80 cases. The recent advent of new variants correspond to the remaining 227 additional items as follows:
-
South Africa, with 10 cases for the variant ‘South Africa Triple Variant’, denoted as TV-ZA
-
Denmark, with 10 cases for the variant ‘Mink Cluster V’, denoted as CL5-DK
-
England, with 10 cases for the variant VUI2020/01, denoted as VUI-GB
-
Italy, with 10 cases for the variant VUI2020/01, denoted as VUI-IT
-
Denmark, with 10 cases for the variant VUI2020/01, denoted as VUI-DK
-
Portugal, with 10 cases for the variant VUI2020/01, denoted as VUI-PT
-
USA, with 10 cases for the variant VUI2020/01, denoted as VUI-US
-
a mixture of several cases scattered along the world to give an higher spatial diversity
-
Ireland, with 7 cases for the variant VUI2020/ 01, denoted as VUI-IE
-
Japan, with 6 cases for the variant VUI2020/01, denoted as VUI-JP
-
Australia, with 5 cases for the variant VUI2020/ 01, denoted as VUI-AU
-
Singapore, with 5 cases for the variant VUI 2020/01, denoted as VUI-SG
-
Israel, with 4 cases for the variant VUI2020/01, denoted as VUI-IL
-
South Korea, with 3 cases for the variant VUI2020/01, denoted as VUI-KR
-
Norway, with 3 cases for the variant VUI2020/ 01, denoted as VUI-NO
-
France, with 2 cases for the variant VUI2020/01, denoted as VUI-FR
-
Germany, with 2 cases for the variant VUI2020/ 01, denoted as VUI-DE
-
Spain, Gibraltar, Hong Kong, India, Luxembourg, Switzerland and Sweden, with 1 case each, for the variant VUI2020/01, denoted simply as VUI
-
-
Japan with 10 cases for the variant B.1.1.28, denoted as B.1.1.28-JP
-
Brazil with 10 cases for the variant B.1.1.28, denoted as B.1.1.28-BR
-
Brazil with 10 cases for the variant P.1.1.28, denoted as P.1-BR
-
Brazil with 10 cases for the variant P.2, denoted as P.2-BR
-
South Africa with 10 cases for the variant B.1.351, denoted as B.1.351-ZA
-
USA with 10 cases for the variant B.1.427, denoted as B.1.427-US
-
California, USA, with 10 cases for the variant B.1.429, denoted as B.1.429 -US
-
New York, USA, with 10 cases for the variant B.1.526, denoted as B.1.526-US
-
India with 10 cases for the first variant VUI B.1.617, denoted as VUI B.1.617.1-IN
-
India with 10 cases for the second variant VUI B.1.617, denoted as VUI B.1.617.2-IN
-
India with 10 cases for the third variant VUI B.1.617, denoted as VUI B.1.617.3-IN.
In synthesis, we have collected a first set of 80 GS of the SARS-CoV-2 virus obtained in several countries during a first period of the outbreak. The second set includes 227 recent GS. The smaller number of cases the recent genomic data for some countries is limited to the data set available at the time of writing this paper. All ASCII files have approximately 30 kBytes.
The \(N=307\) GS exhibit very small differences and, therefore, are difficult to distinguish. We can first characterize them by their length L that varies between minimum and maximum values of \(L_{min}=28560\) and \(L_{max}=29900\) symbols. Moreover, we have an average and standard deviation of \(L_{av}=29773.5\) and \(L_{sd}=109.31\) symbols, respectively. This small variability in the size is relevant for the reliability when comparing strings with the Kolmogorov- and Hamming-based metrics.
As mentioned before, the viral RNA information is represented by ASCII files with the four nitrogenous bases. Therefore, we can consider the grouping of \(k_{s}=\left\{ 1,2,3\right\} \), consecutive symbols. For simplicity, we denote the corresponding sub-strings by \(\left\{ S^1,S^2,S^3\right\} \) and obtain the statistics listed in Tables 2, 3 and 4. The term ‘others’ stands for a small number of other symbols distinct of the four nucleotide bases. Therefore, occasionally we find the symbols N, M, S, R, Y, K, H, V, and W, that abbreviate aNy (A, C, G or T), aMino (A or C), Strong interaction (3 H-bonds, G or C), puRine (A or G), pYrimidine (C or T), Keto (G or T), not-G (A, C or T), not-T (A, C or G), and Weak interaction (2 H-bonds, A or T), respectively.
Figure 1 shows the cumulative numbers of cases for sub-strings including \(k_{s}=\left\{ 1,2,3\right\} \) symbols in the \(N=307\) GS.
As a complementary analysis of the relationship between GS we use the VOSviewer software tool [46,47,48,49,50] for constructing and visualizing bibliometric networks (https://www.vosviewer.com/). The program was built having in mind scientometric applications, but can be used in the present case if we consider the associations of symbols as keywords in a standard technical text. For that purpose we construct \(k_s\)-tuples of consecutive symbols in the GS in order to have ‘words’ (i.e., sub-strings) with \(k_s\) consecutive symbols. We considered the VOSviewer options ‘Full counting’, ‘Minimum number of occurrences of a term = 5’ and ‘Number of terms selected = 28’. For example, Fig. 2 shows the network for the sub-strings with \(k_s=3\) consecutive symbols in the \(i=1\) genomic sequence. For the other virus the results are of the same type. We observe the very small relevance of ‘phrases’ with several triplets and, on the other hand, the complex network relationship between triplets.
From the Tables 2, 3 and 4 and Figs. 1 and 2 we verify that the case of \(k_s=3\) symbols represent a good compromise between complexity and accurate description of the information content. This conclusion follows previous observations with genetic data [19, 51,52,53].
The existence of the symbols classified as ‘other’ and the variation of size in the GS can be considered as a kind of noise. However, as verified with the previous tests they reflect in very small numbers. Also, in most cases we considered about 10 GS for each type of virus. Therefore, we can proceed with the analysis of the genetic information knowing its robustness against possible volatility in the data.
The mathematical description of the viral information is based on the Kolmogorov, Shannon and Hamming perspectives implemented by the distances \(\left\{ NCD,CLM,CDM,CosS\right\} \), \(\left\{ d_{Ja},d_{Js},d_{Je},d_{To}\right\} \) and \(d_{Ha}\) presented in the Appendix. The MDS clustering and visualization is used to unravel relationships between the data and to identify possible patterns. In the case of the Kolmogorov complexity we consider the compressor BZip2 (https://www.zlib.net). In the case of the Shannon information, we start by calculating the 64-bin histograms (i.e., the triplets {AAA, AAC, \(\ldots \), GGT, GGG}) for the triplets (\(k_s=3\)) of the nitrogenous bases and then we calculate the distances between them. The resulting matrix D, \(307 \times 307\) dimensional, that is processed by MDS using the Matlab command cmdscale.
4 Data-set analysis: clustering results
The distances following the Kolmogorov theory \(\left\{ NCD,CLM,CDM,CosS\right\} \) yield almost similar MDS plots. This behavior was also observed in previous studies with distinct data-sets [54]. In what concerns the distances based on the Shannon theory \(\left\{ d_{Ja},d_{Js},d_{Je},d_{To}\right\} \) we note that \(d_{Ja}\) produces a slightly different plot from the group \(\left\{ d_{Js},d_{Je},d_{To}\right\} \), which return charts having just small differences. On the other hand, the distance \(d_{Ha}\) leads to a very different chart. Therefore, for parsimony, for these sets of distances we depict just the plots for the NCD, \(d_{Jacc}\), \(d_{JS}\) and \(d_{Ha}\).
The MDS charts produced by the four distances are represented in Fig. 3, 4, 5 and 6, respectively. In all cases the MDS plots require a careful rotation to get the correct 3-dimensional perspective and assessment, since the planar projections in the figures are not totally capable of depicting their structure.
Several MDS loci reveal different clusters, but, in general we do not have a clear group for each variant of the virus. The Kolmogorov- and the Shannon-based metrics show some clusters, but with a mixture of many variants, while the Hamming scheme gives the worst map in the perspective of clustering. Nonetheless, we can ask a different and more relevant question which is how to assess the ‘dynamics’ of the evolutionary process. We must note that the available information just reflects ‘time samples’ of the variants, that is, the date where the procedure for collecting, identifying and recording the GS took place. Consequently, we do not have a precise control of the time elapsed between the real mutation and the laboratory measurement. Nonetheless, we can have a good idea of the dynamical behavior if we include time information in the MDS plots, even if some ‘noise’ is present in the time information.
Figures 7, 8 , 9 and 10 depict the MDS plots with the time information represented by the colors of the marks. The colorbar with the interval \(\left[ 0,1\right] \) corresponds to the period between the dates 2020/Feb/24 and 2021/Apr/23. We observe some improvement over the initial set of experiments, but we have still some ‘noisy’ behavior of the time flow.
It is well known that each distance captures a given characteristic of the phenomenon under analysis. Therefore, we wander if some measure associating several distances could reveal better the patterns embedded in the datataset. In this line of thought we can design a ‘generalized’ distance by weighting several of the previous distances. If we consider the distances NCD, \(d_{Jacc}\), \(d_{JS}\) and \(d_{Ha}\), then we can define the new metric:
where \(\mu _{NCD}\), \(\mu _{Ja}\), \(\mu _{JS}\) and \(\mu _{Ha}\) are weight factors for the distances NCD, \(d_{Jacc}\), \(d_{JS}\) and \(d_{Ha}\), respectively, so that \(\mu _{NCD}+\mu _{Ja}+\mu _{JS}+\mu _{Ha}=1\).
We can adjust the numerical values of the wight factor to reflect the importance of each distance for improving the MDS representation in the sense of providing a more clear visualization of the dynamical effect. In our case, after some experiments, and having in mind the dynamics in time, we consider the weights \(\mu _{NCD}=0.55\), \(\mu _{Ja}=0.2\), \(\mu _{JS}=0.2\) and \(\mu _{Ha}=0.05\). Figure 11 shows the MDS plot where we observe the emergence of five clusters denoted by the symbols \(\mathcal {A}\) to \(\mathcal {F}\). The time ‘arrow’ follows somehow the pattern \(\mathcal {A} \rightarrow \mathcal {B} \rightarrow \mathcal {C} \rightarrow \left\{ \mathcal {D},\mathcal {E}\right\} \rightarrow \mathcal {F}\).
We verify (i) the first and second clusters, \(\mathcal {A}\) and \(\mathcal {B}\), exhibit a very low scattering in contrast with the others, (ii) the existence of some noise in the evolutionary process, (iii) that time flows in discrete steps in the MDS representation, that is, the time evolution is not continuous, (iv) that we can have some evolutionary bifurcations in time as it is visible for \(\left\{ \mathcal {D},\mathcal {E}\right\} \), and (v) clusters for different time instants may be close in the MDS locus. A deep reflection upon these results seems consistent with our present knowledge of the evolutionary process. In fact, we can interpret and justify the previous assertions as follows:
-
the initial strains of virus had very similar characteristics, contrary to the more recent variants that reveal an increasing variability, which agrees with the first observation
-
evolution and, in particular, mutations, occur randomly which justifies the second conclusion
-
mutations do not emerge continuously in time and, in fact, they are the result of a multitude of issues such as environmental, social, geographical and economical factors. Therefore, new variants that lead to a relevant number of infections are expected to emerge without a clear time pattern which is in accordance with the third consideration
-
the infection spreads in very different space locations and it is reasonable to expect to have several important variants at the same time, particularly when the number of infected people grows considerably. These considerations support the forth remark
-
the close location of some clusters for non- consecutive time samples (e.g., \( \mathcal {C}\) and \(\mathcal {E}\)) can be interpreted as the MDS representation the infection ‘waves’, which explain the fifth observation.
It is important to analyze the effect of the clustering algorithm, the dimension of the visualization space and the type of representation. In this perspective, the Generalized distance (1) and, consequently, the same matrix D used for the MDS scheme, is now used with the set of programs Phylip [55, 56] available at http://evolution.genetics.washington.edu/phylip.html. This program is used in in phylogenetics for displaying the evolutionary relationships among various biological objects. The program allows 2-dim representations where the final objects are the ‘leafs’ of some type of tree. The algorithm neighbor processes the matrix D and produces the data clustering. The programs drawgram and drawtree are used for obtaining the graphical representations in the form of dendrograms and trees, respectively.
Figures 12 and 13 shows the dendrogram and tree generated by Phylip for the generalized distance, \(d_{Ge}\), respectively. The time evolution is represented by colors as before.
The 2-dim graphical representations are easier to visualize in the sense that the user does not needs to rotate and shift the plot. However, the lack of the third dimension leads to a clustering somewhat inferior to the one performed by the 3-dim MDS. Nonetheless, we can still see some clusters. In the case of the dendrogram we observe a simple evolution from top (initial period) to bottom (final period) with 3 main clusters. The two clusters at the top do not show a precise separation of the periods of time since with have overlapping object for the initial 60% of the total period. In the case of a graphical output my means of a tree we have a more intricate pattern, with the objects placed in the form of two ‘arcs’. The outer arc covers the period of time from beginning up to approximately 75%, while the inner arc corresponds to the rest, that is, to the more recent 25% GS. Moreover, the initial period of time of about 30% is place it the top of the outer arc.
In summary, the strategy followed in this paper is consistent with present-day understanding about the SARS-CoV-2 genome and the adoption of several distinct distances allows users to have a complementary interpretation of the information embedded in the data-set.
5 Conclusions
The information of 307 RNA viruses available in a public database was explored by means of an association of mathematical and computational tools. The notions Kolmogorov complexity, Shannon information and Hamming distance, on the perspective of analytic tools, and the ideas of compression and MDS algorithms, on the point of view of computational tools, were considered. Three sets of indices of dissimilarity were adopted, allowing a broad comparison of the results, with four distances based on the BZip2 compression, four distances using 2-dimensional histograms of consecutive triplets, and one metric using the Hamming distance between triplets of bases. From these, the MDS algorithm allowed an efficient clustering and 3-dim visualization. The MDS plots revealed pros and cons of the alternative distances adopted for assessing the set of viruses. The problem at stake proved to pose a considerable challenge and no clear clusters emerged for the virus variants included in the dataset. This motivated a new question, namely the relevance of clustering the variants when thinking about its evolutionary dynamics, since many variants have minor differences between themselves. The idea of assessing the dynamics in time lead to the design of a generalized distance taking advantage of the characteristics of distinct metrics and allowing a adjustment to the phenomenon by means of weight factors. The results showed interesting dynamical effects in terms of forming clear clusters in time. Besides the analytic tools, several algorithmic mechanisms were explored, such as the Matlab, VOSviwer and Phylip programs, which illustrate the diversity and richness of present day computational strategies. The synergistic perspectives provided by distinct processing tools and graphical representations allow comparing the genomic data and provide a computational strategy for exploring future viral outbreaks. The association of analytic and computational techniques may help interpreting the phylogeny of these new strain outbreaks, associate its dynamics and selective pressures, and give additional insight for the quick development and testing of tailored countermeasures. The computational analysis of the SARS-CoV-2 genome may have a role in the early detection of potential variants of concern and help in the characterization of the risk posed to global public health. This is important as it may contribute to the global monitoring of SARS-CoV-2 variants and to improve the search for a more effective response to the COVID-19 pandemic.
Data availability
The reformatted data that support the findings of this study are available in http://ave.dee.isep.ipp.pt/jtm/FILES/datasets/Archive.zip.
References
Dowd, J.B., Andriano, L., Brazel, D.M., Rotondi, V., Block, P., Ding, X., Liu, Y., Mills, M.C.: Demographic science aids in understanding the spread and fatality rates of COVID-19. Proc. Natl. Acad. Sci. 117(18), 9696–9698 (2020). https://doi.org/10.1073/pnas.2004911117
Mercatelli, D., Giorgi, F.M.: Geographic and genomic distribution of SARS-CoV-2 mutations. Front. Microbiol. (2020). https://doi.org/10.3389/fmicb.2020.01800
Ceraolo, C., Giorgi, F.M.: Genomic variance of the 2019-nCoV coronavirus. J. Med. Virol. 92(5), 522–528 (2020). https://doi.org/10.1002/jmv.25700
Mallapaty, S.: COVID mink analysis shows mutations are not dangerous - yet. Nature 587(7834), 340–341 (2020). https://doi.org/10.1038/d41586-020-03218-z
Hamed, S.M., Elkhatib, W.F., Khairalla, A.S., Noreddin, A.M.: Global dynamics of SARS-CoV-2 clades and their relation to COVID-19 epidemiology. Sci. Rep. (2021). https://doi.org/10.1038/s41598-021-87713-x
Rambaut, A., Loman, N., Pybus, O., Barclay, W., Barrett, J., Carabelli, A., Connor, T., Peacock, T., Robertson, D.L., (on behalf of COVID-19 Genomics Consortium UK (CoG-UK)9, E.V.: Preliminary genomic characterisation of an emergent SARS-CoV-2 lineage in the UK defined by a novel set of spike mutations. https://virological.org/t/preliminary-genomic-characterisation-of-an-emergent-sars-cov-2-lineage-in-the-uk-defined-by-a-novel-set-of-spike-mutations/563 (2020). [Online; posted 18-December-2020]
Frampton, D., Rampling, T., Cross, A., Bailey, H., Heaney, J., Byott, M., Scott, R., Sconza, R., Price, J., Margaritis, M., Bergstrom, M., Spyer, M.J., Miralhes, P.B., Grant, P., Kirk, S., Valerio, C., Mangera, Z., Prabhahar, T., Moreno-Cuesta, J., Arulkumaran, N., Singer, M., Shin, G.Y., Sanchez, E., Paraskevopoulou, S.M., Pillay, D., McKendry, R.A., Mirfenderesky, M., Houlihan, C.F., Nastouli, E.: Genomic characteristics and clinical effect of the emergent SARS-CoV-2 B.1.1.7 lineage in London, UK: a whole-genome sequencing and hospital-based cohort study. The Lancet Infectious Diseases (2021). https://doi.org/10.1016/s1473-3099(21)00170-5
Tegally, H., Wilkinson, E., Giovanetti, M., Iranzadeh, A., Fonseca, V., Giandhari, J., Doolabh, D., Pillay, S., San, E.J., Msomi, N., Mlisana, K., von Gottberg, A., Walaza, S., Allam, M., Ismail, A., Mohale, T., Glass, A.J., Engelbrecht, S., Van Zyl, G., Preiser, W., Petruccione, F., Sigal, A., Hardie, D., Marais, G., Hsiao, M., Korsman, S., Davies, M.A., Tyers, L., Mudau, I., York, D., Maslo, C., Goedhals, D., Abrahams, S., Laguda-Akingba, O., Alisoltani-Dehkordi, A., Godzik, A., Wibmer, C.K., Sewell, B.T., Lourenço, J., Alcantara, L.C.J., Pond, S.L.K., Weaver, S., Martin, D., Lessells, R.J., Bhiman, J.N., Williamson, C., de Oliveira, T.: Emergence and rapid spread of a new severe acute respiratory syndrome-related coronavirus 2 (SARS-CoV-2) lineage with multiple spike mutations in South Africa. medRxiv (2020). https://doi.org/10.1101/2020.12.21.20248640. https://www.medrxiv.org/content/early/2020/12/22/2020.12.21.20248640
Long, S.W., Olsen, R.J., Christensen, P.A., Subedi, S., Olson, R., Davis, J.J., Saavedra, M.O., Yerramilli, P., Pruitt, L., Reppond, K., Shyer, M.N., Cambric, J., Finkelstein, I.J., Gollihar, J., Musser, J.M.: Sequence analysis of 20, 453 severe acute respiratory syndrome coronavirus 2 genomes from the Houston metropolitan area identifies the emergence and widespread distribution of multiple isolates of all major variants of concern. Am. J. Pathol. (2021). https://doi.org/10.1016/j.ajpath.2021.03.004
Davies, N., Barnard, R.C., Jarvis, C.I., Kucharski, A.J., Munday, J.D., Pearson, C.A., Russell, T.W., Tully, D.C., Abbott, S., Gimma, A., Waites, W., Wong, K.L., van Zandvoort, K., working group, C., Eggo, R.M., Funk, S., Jit, M., Atkins, K.E., Edmunds, W.J.: Estimated transmissibility and severity of novel SARS-CoV-2 variant of concern 2020/12/01 in England. https://cmmid.github.io/topics/covid19/uk-novel-variant.html (2020). [First online: 23-12-2020, Last update: 03-03-2021]
Brown, C.M., Vostok, J., Johnson, H., Burns, M., Gharpure, R., Sami, S., Sabo, R.T., Hall, N., Foreman, A., Schubert, P.L., Gallagher, G.R., Fink, T., Madoff, L.C., Gabriel, S.B., MacInnis, B., Park, D.J., Siddle, K.J., Harik, V., Arvidson, D., Brock-Fisher, T., Dunn, M., Kearns, A., Laney, A.S.: Outbreak of SARS-CoV-2 infections, including COVID-19 vaccine breakthrough infections, associated with large public gatherings – Barnstable county, Massachusetts, July 2021. MMWR. Morbidity and Mortality Weekly Report 70(31) (2021). https://doi.org/10.15585/mmwr.mm7031e2
Li, B., Deng, A., Li, K., Hu, Y., Li, Z., Xiong, Q., Liu, Z., Guo, Q., Zou, L., Zhang, H., Zhang, M., Ouyang, F., Su, J., Su, W., Xu, J., Lin, H., Sun, J., Peng, J., Jiang, H., Zhou, P., Hu, T., Luo, M., Zhang, Y., Zheng, H., Xiao, J., Liu, T., Che, R., Zeng, H., Zheng, Z., Huang, Y., Yu, J., Yi, L., Wu, J., Chen, J., Zhong, H., Deng, X., Kang, M., Pybus, O.G., Hall, M., Lythgoe, K.A., Li, Y., Yuan, J., He, J., Lu, J.: Viral infection and transmission in a large, well-traced outbreak caused by the SARS-CoV-2 delta variant (2021). https://doi.org/10.1101/2021.07.07.21260122
Allen, H., Vusirikala, A., Flannagan, J., Twohig, K.A., Zaidi, A., Groves, N., Lopez-Bernal, J., Harris, R., Charlett, A., Dabrera, G., Kall, M.: Increased household transmission of COVID-19 cases associated with SARS-CoV-2 variant of concern B.1.617.2: a national case-control study (2021). https://www.gov.uk/government/collections/new-sars-cov-2-variant
Polack, F.P., Thomas, S.J., Kitchin, N., Absalon, J., Gurtman, A., Lockhart, S., Perez, J.L., Marc, G.P., Moreira, E.D., Zerbini, C., Bailey, R., Swanson, K.A., Roychoudhury, S., Koury, K., Li, P., Kalina, W.V., Cooper, D., Frenck, R.W., Hammitt, L.L., Türeci, Özlem., Nell, H., Schaefer, A., Ünal, S., Tresnan, D.B., Mather, S., Dormitzer, P.R., Şahin, U., Jansen, K.U., Gruber, W.C.: Safety and efficacy of the BNT162b2 mRNA Covid-19 vaccine. New Eng. J. Med. (2020). https://doi.org/10.1056/nejmoa2034577
Voysey, M., Clemens, S.A.C., Madhi, S.A., Weckx, L.Y., Folegatti, P.M., Aley, P.K., Angus, B., Baillie, V.L., Barnabas, S.L., Bhorat, Q.E., Bibi, S., Briner, C., Cicconi, P., Collins, A.M., Colin-Jones, R., Cutland, C.L., Darton, T.C., Dheda, K., Duncan, C.J.A., Emary, K.R.W., Ewer, K.J., Fairlie, L., Faust, S.N., Feng, S., Ferreira, D.M., Finn, A., Goodman, A.L., Green, C.M., Green, C.A., Heath, P.T., Hill, C., Hill, H., Hirsch, I., Hodgson, S.H.C., Izu, A., Jackson, S., Jenkin, D., Joe, C.C.D., Kerridge, S., Koen, A., Kwatra, G., Lazarus, R., Lawrie, A.M., Lelliott, A., Libri, V., Lillie, P.J., Mallory, R., Mendes, A.V.A., Milan, E.P., Minassian, A.M., McGregor, A., Morrison, H., Mujadidi, Y.F., Nana, A., O’Reilly, P.J., Padayachee, S.D., Pittella, A., Plested, E., Pollock, K.M., Ramasamy, M.N., Rhead, S., Schwarzbold, A.V., Singh, N., Smith, A., Song, R., Snape, M.D., Sprinz, E., Sutherland, R.K., Tarrant, R., Thomson, E.C., Török, M.E., Toshner, M., Turner, D.P.J., Vekemans, J., Villafana, T.L., Watson, M.E.E., Williams, C.J., Douglas, A.D., Hill, A.V.S., Lambe, T., Gilbert, S.C., Pollard, A.J., Aban, M., Abayomi, F., Abeyskera, K., Aboagye, J., Adam, M., Adams, K., Adamson, J., Adelaja, Y.A., Adlou, S., Ahmed, K., Akhalwaya, Y., Akhalwaya, S., Alcock, A., Ali, A., Allen, E.R., Allen, L., Almeida, T.C.D.S.C., Alves, M.P., Amorim, F., Andritsou, F., Anslow, R., Appleby, M., Arbe-Barnes, E.H., Ariaans, M.P., Arns, B., Arruda, L., Awedetan, G., Azi, P., Azi, L., Babbage, G., Bailey, C., Baker, K.F., Baker, M., Baker, N., Baker, P., Baldwin, L., Baleanu, I., Bandeira, D., Bara, A., Barbosa, M.A., Barker, D., Barlow, G.D., Barnes, E., Barr, A.S., Barrett, J.R., Barrett, J., Bates, L., Batten, A., Beadon, K., Beales, E., Beckley, R., Belij-Rammerstorfer, S., Bell, J., Bellamy, D., Bellei, N., Belton, S., Berg, A., Bermejo, L., Berrie, E., Berry, L., Berzenyi, D., Beveridge, A., Bewley, K.R., Bexhell, H., Bhikha, S., Bhorat, A.E., Bhorat, Z.E., Bijker, E., Birch, G., Birch, S., Bird, A., Bird, O., Bisnauthsing, K., Bittaye, M., Blackstone, K., Blackwell, L., Bletchly, H., Blundell, C.L., Blundell, S.R., Bodalia, P., Boettger, B.C., Bolam, E., Boland, E., Bormans, D., Borthwick, N., Bowring, F., Boyd, A., Bradley, P., Brenner, T., Brown, P., Brown, C., Brown-O-Sullivan, C., Bruce, S., Brunt, E., Buchan, R., Budd, W., Bulbulia, Y.A., Bull, M., Burbage, J., Burhan, H., Burn, A., Buttigieg, K.R., Byard, N., Puig, I.C., Calderon, G., Calvert, A., Camara, S., Cao, M., Cappuccini, F., Cardoso, J.R., Carr, M., Carroll, M.W., Carson-Stevens, A., de M. Carvalho, Y., Carvalho, J.A., Casey, H.R., Cashen, P., Castro, T., Castro, L.C., Cathie, K., Cavey, A., Cerbino-Neto, J., Chadwick, J., Chapman, D., Charlton, S., Chelysheva, I., Chester, O., Chita, S., Cho, J.S., Cifuentes, L., Clark, E., Clark, M., Clarke, A., Clutterbuck, E.A., Collins, S.L., Conlon, C.P., Connarty, S., Coombes, N., Cooper, C., Cooper, R., Cornelissen, L., Corrah, T., Cosgrove, C., Cox, T., Crocker, W.E., Crosbie, S., Cullen, L., Cullen, D., Cunha, D.R., Cunningham, C., Cuthbertson, F.C., Guarda, S.N.F.D., da Silva, L.P., Damratoski, B.E., Danos, Z., Dantas, M.T., Darroch, P., Datoo, M.S., Datta, C., Davids, M., Davies, S.L., Davies, H., Davis, E., Davis, J., Davis, J., Nobrega, M.M.D., Kalid, L.M.D.O., Dearlove, D., Demissie, T., Desai, A., Marco, S.D., Maso, C.D., Dinelli, M.I., Dinesh, T., Docksey, C., Dold, C., Dong, T., Donnellan, F.R., Santos, T.D., dos Santos, T.G., Santos, E.P.D., Douglas, N., Downing, C., Drake, J., Drake-Brockman, R., Driver, K., Drury, R., Dunachie, S.J., Durham, B.S., Dutra, L., Easom, N.J., van Eck, S., Edwards, M., Edwards, N.J., Muhanna, O.M.E., Elias, S.C., Elmore, M., English, M., Esmail, A., Essack, Y.M., Farmer, E., Farooq, M., Farrar, M., Farrugia, L., Faulkner, B., Fedosyuk, S., Felle, S., Feng, S., Silva, C.F.D., Field, S., Fisher, R., Flaxman, A., Fletcher, J., Fofie, H., Fok, H., Ford, K.J., Fowler, J., Fraiman, P.H., Francis, E., Franco, M.M., Frater, J., Freire, M.S., Fry, S.H., Fudge, S., Furze, J., Fuskova, M., Galian-Rubio, P., Galiza, E., Garlant, H., Gavrila, M., Geddes, A., Gibbons, K.A., Gilbride, C., Gill, H., Glynn, S., Godwin, K., Gokani, K., Goldoni, U.C., Goncalves, M., Gonzalez, I.G., Goodwin, J., Goondiwala, A., Gordon-Quayle, K., Gorini, G., Grab, J., Gracie, L., Greenland, M., Greenwood, N., Greffrath, J., Groenewald, M.M., Grossi, L., Gupta, G., Hackett, M., Hallis, B., Hamaluba, M., Hamilton, E., Hammersley, D., Hanrath, A.T., Hanumunthadu, B., Harris, S.A., Harris, C., Harris, T., Harrison, T.D., Harrison, D., Hart, T.C., Hartnell, B., Hassan, S., Haughney, J., Hawkins, S., Hay, J., Head, I., Henry, J., Herrera, M.H., Hettle, D.B., Hill, J., Hodges, G., Horne, E., Hou, M.M., Houlihan, C., Howe, E., Howell, N., Humphreys, J., Humphries, H.E., Hurley, K., Huson, C., Hyder-Wright, A., Hyamns, C., Ikram, S., Ishwarbhai, A., Ivan, M., Iveson, P., Iyer, V., Jackson, F., Jager, J.D., Jaumdally, S., Jeffers, H., Jesudason, N., Jones, B., Jones, K., Jones, E., Jones, C., Jorge, M.R., Jose, A., Joshi, A., Júnior, E.A., Kadziola, J., Kailath, R., Kana, F., Karampatsas, K., Kasanyinga, M., Keen, J., Kelly, E.J., Kelly, D.M., Kelly, D., Kelly, S., Kerr, D., de Ávila Kfouri, R., Khan, L., Khozoee, B., Kidd, S., Killen, A., Kinch, J., Kinch, P., King, L.D., King, T.B., Kingham, L., Klenerman, P., Knapper, F., Knight, J.C., Knott, D., Koleva, S., Lang, M., Lang, G., Larkworthy, C.W., Larwood, J.P., Law, R., Lazarus, E.M., Leach, A., Lees, E.A., Lemm, N.M., Lessa, A., Leung, S., Li, Y., Lias, A.M., Liatsikos, K., Linder, A., Lipworth, S., Liu, S., Liu, X., Lloyd, A., Lloyd, S., Loew, L., Ramon, R.L., Lora, L., Lowthorpe, V., Luz, K., MacDonald, J.C., MacGregor, G., Madhavan, M., Mainwaring, D.O., Makambwa, E., Makinson, R., Malahleha, M., Malamatsho, R., Mallett, G., Mansatta, K., Maoko, T., Mapetla, K., Marchevsky, N.G., Marinou, S., Marlow, E., Marques, G.N., Marriott, P., Marshall, R.P., Marshall, J.L., Martins, F.J., Masenya, M., Masilela, M., Masters, S.K., Mathew, M., Matlebjane, H., Matshidiso, K., Mazur, O., Mazzella, A., McCaughan, H., McEwan, J., McGlashan, J., McInroy, L., McIntyre, Z., McLenaghan, D., McRobert, N., McSwiggan, S., Megson, C., Mehdipour, S., Meijs, W., Mendonça, R.N., Mentzer, A.J., Mirtorabi, N., Mitton, C., Mnyakeni, S., Moghaddas, F., Molapo, K., Moloi, M., Moore, M., Moraes-Pinto, M.I., Moran, M., Morey, E., Morgans, R., Morris, S., Morris, S., Morris, H.C., Morselli, F., Morshead, G., Morter, R., Mottal, L., Moultrie, A., Moya, N., Mpelembue, M., Msomi, S., Mugodi, Y., Mukhopadhyay, E., Muller, J., Munro, A., Munro, C., Murphy, S., Mweu, P., Myasaki, C.H., Naik, G., Naker, K., Nastouli, E., Nazir, A., Ndlovu, B., Neffa, F., Njenga, C., Noal, H., Noé, A., Novaes, G., Nugent, F.L., Nunes, G., O-Brien, K., O-Connor, D., Odam, M., Oelofse, S., Oguti, B., Olchawski, V., Oldfield, N.J., Oliveira, M.G., Oliveira, C., Oosthuizen, A., O-Reilly, P., Osborne, P., Owen, D.R., Owen, L., Owens, D., Owino, N., Pacurar, M., Paiva, B.V., Palhares, E.M., Palmer, S., Parkinson, S., Parracho, H.M., Parsons, K., Patel, D., Patel, B., Patel, F., Patel, K., Patrick-Smith, M., Payne, R.O., Peng, Y., Penn, E.J., Pennington, A., Alvarez, M.P.P., Perring, J., Perry, N., Perumal, R., Petkar, S., Philip, T., Phillips, D.J., Phillips, J., Phohu, M.K., Pickup, L., Pieterse, S., Piper, J., Pipini, D., Plank, M., Plessis, J.D., Pollard, S., Pooley, J., Pooran, A., Poulton, I., Powers, C., Presa, F.B., Price, D.A., Price, V., Primeira, M., Proud, P.C., Provstgaard-Morys, S., Pueschel, S., Pulido, D., Quaid, S., Rabara, R., Radford, A., Radia, K., Rajapaska, D., Rajeswaran, T., Ramos, A.S.F., Lopez, F.R., Rampling, T., Rand, J., Ratcliffe, H., Rawlinson, T., Rea, D., Rees, B., Reiné, J., Resuello-Dauti, M., Pabon, E.R., Ribiero, C.M., Ricamara, M., Richter, A., Ritchie, N., Ritchie, A.J., Robbins, A.J., Roberts, H., Robinson, R.E., Robinson, H., Rocchetti, T.T., Rocha, B.P., Roche, S., Rollier, C., Rose, L., Russell, A.L.R., Rossouw, L., Royal, S., Rudiansyah, I., Ruiz, S., Saich, S., Sala, C., Sale, J., Salman, A.M., Salvador, N., Salvador, S., Sampaio, M., Samson, A.D., Sanchez-Gonzalez, A., Sanders, H., Sanders, K., Santos, E., Guerra, M.F.S., Satti, I., Saunders, J.E., Saunders, C., Sayed, A., van der Loeff, I.S., Schmid, A.B., Schofield, E., Screaton, G., Seddiqi, S., Segireddy, R.R., Senger, R., Serrano, S., Shah, R., Shaik, I., Sharpe, H.E., Sharrocks, K., Shaw, R., Shea, A., Shepherd, A., Shepherd, J.G., Shiham, F., Sidhom, E., Silk, S.E., da Silva Moraes, A.C., Silva-Junior, G., Silva-Reyes, L., Silveira, A.D., Silveira, M.B., Sinha, J., Skelly, D.T., Smith, D.C., Smith, N., Smith, H.E., Smith, D.J., Smith, C.C., Soares, A., Soares, T., Solórzano, C., Sorio, G.L., Sorley, K., Sosa-Rodriguez, T., Souza, C.M., Souza, B.S., Souza, A.R., Spencer, A.J., Spina, F., Spoors, L., Stafford, L., Stamford, I., Starinskij, I., Stein, R., Steven, J., Stockdale, L., Stockwell, L.V., Strickland, L.H., Stuart, A.C., Sturdy, A., Sutton, N., Szigeti, A., Tahiri-Alaoui, A., Tanner, R., Taoushanis, C., Tarr, A.W., Taylor, K., Taylor, U., Taylor, I.J., Taylor, J., te Water Naude, R., Themistocleous, Y., Themistocleous, A., Thomas, M., Thomas, K., Thomas, T.M., Thombrayil, A., Thompson, F., Thompson, A., Thompson, K., Thompson, A., Thomson, J., Thornton-Jones, V., Tighe, P.J., Tinoco, L.A., Tiongson, G., Tladinyane, B., Tomasicchio, M., Tomic, A., Tonks, S., Tran, N., Tree, J., Trillana, G., Trinham, C., Trivett, R., Truby, A., Tsheko, B.L., Turabi, A., Turner, R., Turner, C., Ulaszewska, M., Underwood, B.R., Varughese, R., Verbart, D., Verheul, M., Vichos, I., Vieira, T., Waddington, C.S., Walker, L., Wallis, E., Wand, M., Warbick, D., Wardell, T., Warimwe, G., Warren, S.C., Watkins, B., Watson, E., Webb, S., Webb-Bridges, A., Webster, A., Welch, J., Wells, J., West, A., White, C., White, R., Williams, P., Williams, R.L., Winslow, R., Woodyer, M., Worth, A.T., Wright, D., Wroblewska, M., Yao, A., Zimmer, R., Zizi, D., Zuidewind, P.: Safety and efficacy of the ChAdOx1 nCoV-19 vaccine (AZD1222) against SARS-CoV-2: an interim analysis of four randomised controlled trials in Brazil, South Africa, and the UK. The Lancet (2020). https://doi.org/10.1016/s0140-6736(20)32661-1
Knoll, M.D., Wonodi, C.: Oxford-AstraZeneca COVID-19 vaccine efficacy. Lancet (2020). https://doi.org/10.1016/s0140-6736(20)32623-4
Machado, J.A.T., Lopes, A.M.: Rare and extreme events: the case of COVID-19 pandemic. Nonlinear Dyn. 100(3), 2953–2972 (2020). https://doi.org/10.1007/s11071-020-05680-w
Lopes, A.M., Andrade, J.P., Machado, J.T.: Multidimensional scaling analysis of virus diseases. Comput. Methods Progr. Biomed. 131, 97–110 (2016). https://doi.org/10.1016/j.cmpb.2016.03.029
Machado, J.A.T., Rocha-Neves, J.M., Andrade, J.P.: Computational analysis of the SARS-CoV-2 and other viruses based on the Kolmogorov’s complexity and Shannon’s information theories. Nonlinear Dyn. 101(3), 1731–1750 (2020). https://doi.org/10.1007/s11071-020-05771-8
Machado, J.T., Lopes, A.M.: A computational perspective of the periodic table of elements. Commun. Nonlinear Sci. Num. Simul. 78, 104883 (2019). https://doi.org/10.1016/j.cnsns.2019.104883
Machado, J.T., Lopes, A.M.: Multidimensional scaling and visualization of patterns in prime numbers. Commun. Nonlinear Sci. Num. Simul. 83, 105128 (2020). https://doi.org/10.1016/j.cnsns.2019.105128
Bennett, C.H., Gács, P., Li, M., Vitányi, P., Zurek, W.H.: Information distance. IEEE Trans. Inf. Theory 44(4), 1407–1423 (1998)
Fortnow, L., Lee, T., Vereshchagin, N.: Kolmogorov complexity with error. In: Durand, B., Thomas, W. (eds.) STACS 2006–23rd Annual Symposium on Theoretical Aspects of Computer Science, Marseille, France, February 23–25, 2006. Lecture Notes in Computer Science, pp. 137–148. Springer, Berlin, Heidelberg (2006)
Cha, S.: Taxonomy of nominal type histogram distance measures. In: Proceedings of the American Conference on Applied Mathematics, pp. 325–330. Harvard, Massachusetts, USA (2008)
Deza, M.M., Deza, E.: Encyclopedia of distances. Springer-Verlag, Berlin, Heidelberg (2009)
Hamming, R.W.: Error detecting and error correcting codes. Bell Syst. Tech. J. 29(2), 147–160 (1950). https://doi.org/10.1002/j.1538-7305.1950.tb00463.x
Cilibrasi, R., Vitany, P.M.B.: Clustering by compression. IEEE Trans. Inf. Theory 51(4), 1523–1545 (2005). https://doi.org/10.1109/TIT.2005.844059
Yin, C., Chen, Y., Yau, S.S.T.: A measure of DNA sequence similarity by Fourier transform with applications on hierarchical clustering complexity for DNA sequences. J. Theor. Biol. 359, 18–28 (2014). https://doi.org/10.1016/j.jtbi.2014.05.043
Kubicova, V., Provaznik, I.: Relationship of bacteria using comparison of whole genome sequences in frequency domain. Inf. Technol. Biomed. 3, 397–408 (2014). https://doi.org/10.1007/978-3-319-06593-9_35
Glunčić, M., Paar, V.: Direct mapping of symbolic DNA sequence into frequency domain in global repeat map algorithm. Nucleic Acids Res. (2013). https://doi.org/10.1093/nar/gks721
Hautamaki, V., Pollanen, A., Kinnunen, T., Aik, K., Haizhou, L., Franti, L.: A comparison of categorical attribute data clustering methods, pp. 53–62. Berlin, Springer (2014). https://doi.org/10.1007/978-3-662-44415-3_6
Hu, L.Y., Huang, M.W., Ke, S.W., Tsai, C.F.: The distance function effect on k-nearest neighbor classification for medical datasets. Springer Plus 5,(2016). https://doi.org/10.1186/s40064-016-2941-7
Aziz, M., Alhadidi, D., Mohammed, N.: Secure approximation of edit distance on genomic data. BMC Med Genom. (2017). https://doi.org/10.1186/s12920-017-0279-9
Yianilos, P.N.: Normalized forms of two common metrics. Tech. Rep. Report 91–082-9027-1, NEC Research Institute (1991)
Yu, J., Amores, J., Sebe, N., Tian, Q.: A new study on distance metrics as similarity measurement. In: IEEE International Conference on Multimedia and Expo, pp. 533–536 (2006). https://doi.org/10.1109/ICME.2006.262443
Guyon, I., Gunn, S., Nikravesh, M., Zadeh, L.A. (eds.): Feature extraction: foundations and applications. Springer, Berlin (2008)
Russel, R., Sinha, P.: Perceptually based comparison of image similarity metrics. Perception 40, 1269–1281 (2011). https://doi.org/10.1068/p7063
Burrows, M., Wheeler, D.J.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994)
Welch, T.: A technique for high-performance data compression. Computer 17(6), 8–19 (1984). https://doi.org/10.1109/mc.1984.1659158
Kodituwakku, S.: Comparison of lossless data compression algorithms for text data. Indian J. Comput. Sci. Eng. 1(4), 416–425 (2010)
Saeed, N., Haewoon, Imtiaz, Saqib, M.: A survey on multidimensional scaling. ACM Comput. Surv. (CSUR) 51(3), 47 (2018). https://doi.org/10.1145/3178155
Hartigan, J.A.: Clustering algorithms. Wiley, London (1975)
Tenreiro Machado, J.A., Galhano, A.M.: Multidimensional scaling visualization using parametric similarity indices. Entropy 17(4), 1775–1794 (2015). https://doi.org/10.3390/e17041775
Machado, J.A.T.: Relativistic time effects in financial dynamics. Nonlinear Dyn. 75(4), 735–744 (2014). https://doi.org/10.1007/s11071-013-1100-8
Liébecq, C. (ed.): IUPAC-IUBMB Joint Commission on Biochemical Nomenclature and Nomenclature Commission of IUBMB. In: Biochemical Nomenclature and Related Documents. Portland Press (1992)
van Eck, N.J., Waltman, L.: Software survey: VOSviewer, a computer program for bibliometric mapping. Scientometrics 84(2), 523–538 (2009). https://doi.org/10.1007/s11192-009-0146-3
Waltman, L., van Eck, N.J., Noyons, E.C.: A unified approach to mapping and clustering of bibliometric networks. J. Inf. 4(4), 629–635 (2010). https://doi.org/10.1016/j.joi.2010.07.002
van Eck, N.J., Waltman, L.: Visualizing bibliometric networks. In: Measuring Scholarly Impact, pp. 285–320. Springer International Publishing (2014). https://doi.org/10.1007/978-3-319-10377-8_13
Perianes-Rodriguez, A., Waltman, L., van Eck, N.J.: Constructing bibliometric networks: a comparison between full and fractional counting. J. Inf. 10(4), 1178–1195 (2016). https://doi.org/10.1016/j.joi.2016.10.006
van Eck, N.J., Waltman, L.: Citation-based clustering of publications using CitNetExplorer and VOSviewer. Scientometrics 111(2), 1053–1070 (2017). https://doi.org/10.1007/s11192-017-2300-7
Machado, J.A.T.: Shannon information and power law analysis of the chromosome code. Abstr. Appl. Anal. 2012, 1–13 (2012). https://doi.org/10.1155/2012/439089
Machado, J.A.T., Costa, A.C., Quelhas, M.D.: Can power laws help us understand gene and proteome information? Adv. Math. Phys. 2013, 1–10 (2013). https://doi.org/10.1155/2013/917153
Machado, J.T.: Fractional order description of DNA. Appl. Math. Model. 39(14), 4095–4102 (2015). https://doi.org/10.1016/j.apm.2014.12.037
Sculley, D., Brodley, C.: Compression and machine learning: a new perspective on feature space vectors, p. 332. IEEE (2006). https://doi.org/10.1109/dcc.2006.13
Felsenstein, J.: PHYLIP (phylogeny inference package), version 3.5 c. Joseph Felsenstein (1993)
Tuimala, J.: A primer to phylogenetic analysis using the PHYLIP package. CSC - Scientific Computing Ltd., Finland (2006)
Kolmogorov, A.: Three approaches to the quantitative definition of information. Int. J. Comput. Math. 2(1–4), 157–168 (1968)
Li, M., Chen, X., Li, X., Ma, B., Vitanyi, P.: The similarity metric. IEEE Trans. Inf. Theory 50(12), 3250–3264 (2004). https://doi.org/10.1109/tit.2004.838101
Li, M., Chen, X., Li, X., Ma, B., Vitanyi, P.: The similarity metric. In: Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 863–872 (2004)
Chen, X., Kwong, S., Li, M.: A compression algorithm for DNA sequences and its applications in genome comparison. In: Genome Informatics: Proceedings of the 10th Workshop on Genome Informatics, pp. 51–61 (1999)
Li, M., Badger, J.H., Chen, X., Kwong, S., Kearney, P., Zhang, H.: An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17(2), 149–154 (2001). https://doi.org/10.1093/bioinformatics/17.2.149
Chen, X., Francia, B., Li, M., McKinnon, B., Seker, A.: Shared information and program plagiarism detection. IEEE Trans. Inf. Theory 50(7), 1545–1551 (2004). https://doi.org/10.1109/tit.2004.830793
Keogh, E., Lonardi, S., Ratanamahatana, C.A.: Towards parameter-free data mining. In: Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 206–215. ACM Press (2004). https://doi.org/10.1145/1014052.1014077
Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27(3), 623–656 (1948)
Gray, R.M.: Entropy and information theory. Springer-Verlag, New York (2011)
Beck, C.: Generalised information and entropy measures in physics. Contemp. Phys. 50(4), 495–510 (2009). https://doi.org/10.1080/00107510902823517
Khinchin, A.I.: Mathematical foundations of information theory. Dover, New York (1957)
Jaynes, E.T.: Information theory and statistical mechanics. Phys. Rev. 106(6), 620–630 (1957)
Pilcher, C.D., Wong, J.K., Pillai, S.K.: Inferring HIV transmission dynamics from phylogenetic sequence relationships. PLoS Med. 5(3), e69 (2008). https://doi.org/10.1371/journal.pmed.0050069
Acknowledgements
The authors thank all those who have contributed and shared sequences to the GISAID database available at https://www.gisaid.org/.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
This appendix describes the metrics adopted in the paper. The Kolmogorov complexity and Shannon information theories are revisited and several distances are recalled. Additionally, the Hamming distance is also included.
1.1 Kolmogorov complexity
The Kolmogorov complexity tackles the assessment of information without considering probabilistic notions [57]. Given a string x, the Kolmogorov complexity can be defined as ‘the length of the shortest program that, given an empty string \(\psi \), can compute x and then halts’. Consequently, the Kolmogorov complexity can be somehow viewed as the length of the compressed form of the string x.
The information distance between two strings [22, 23] leads to concepts such as, normalized information distance (NCD) [58, 59], Chen-Li Metric (CLM) [60,61,62], Compression-based Dissimilarity Measure (CDM) [63] and Compression-based Cosine (CosS) [54], given by:
The operator C(x) represents the length (measured as a number of bits) of string x after being compressed by a given algorithm \(C(\cdot )\). The concatenation of two strings, x and y, is denoted as xy and, therefore, C(xy) gives the number of bits when compressing together x and y. Moreover, the notation C(x|y) stands for the length of x when conditionally compressed with a model built on string y.
Sculley and Brodley [54] noted that the compression has an associated feature space, \(\mathbb {N}\). Therefore, they map a string x into a vector \(\bar{x}\in \mathbb {N}\) and \(C(x) = \left| \left| \bar{x}\right| \right| _{1}\). After some simplifications, we verify that the four compression distances (2) use the expression \(\left| \left| \bar{x} \right| \right| _{1}+\left| \left| \bar{y} \right| \right| _{1}-\left| \left| \bar{x}+\bar{y} \right| \right| _{1}\) since we can write
The four indices reduce to the form 1\(-\frac{\left| \left| \bar{x}\right| \right| _{1}+\left| \left| \bar{y}\right| \right| _{1}-\left| \left| \bar{x}+\bar{y}\right| \right| _{1}}{f\left( \bar{x},\bar{y}\right) }\), where the normalizing term \({f\left( \bar{x},\bar{y}\right) }\) varies for each case.
1.2 Shannon information
In the scope of the Information theory [64], the so-called information content I of an event \(x_i\) with probability \(P\left( X=x_{i}\right) \) is defined as:
where X denotes a discrete random variable.
The expected value of the information, E(I), named Shannon entropy [65, 66], becomes:
that follows the four Khinchin axioms [67, 68].
In the case of a two-dimensional distribution \(\left( X_{1},X_{2}\right) \) we have the joint entropy \(H\left( X_{1},X_{2}\right) \) and mutual information \(MI\left( X_{1},X_{2}\right) \) given by:
Several distances can be defined from these mathematical tools. In the follow-up we adopt the Jaccard and Jensen-Shannon distances, \(d_{Ja}\left( X_{1},X_{2}\right) \) and \(d_{JS}\left( X_1 \parallel X_2 \right) \) respectively.
The Jaccard distance represents a set-theoretic interpretation of information based on the normalized distance \(\frac{MI\left( X_{1},X_{2}\right) }{H\left( X_{1},X_{2}\right) }\) and is formulated as:
The Jensen-Shannon divergence is the symmetric version of the Kullback-Leibler divergence \(d_{KL}(X_1\parallel X_{12} ) \) and is given by:
where \(X_{12}=\frac{1}{2}\left( X_1+X_2 \right) \).
The expression of \(d_{JS}\left( X_1,X_2\right) \) can be rewritten as:
In the Shannon’s entropy family we include also the Jeffreys (\(d_{Je}\)) and Topsøe (\(d_{To}\)) expressions given by:
1.3 Hamming distance
We adopt also a distance based on standard concepts for comparing the symbols forming the genetic sequences (GS). In information processing, the Hamming metric between two strings of equal length is given by the number of positions at which the corresponding symbols are different [26, 69]. Therefore, we start by considering triplets of \(k_{s}\) consecutive symbols in each string. This means that we form sub-strings with \(k_{s}\) successive symbols in the DNA strand, so that for a genetic sequence x with length \(L_x\) we obtain \(n_{x}=\left\lfloor \frac{L_{x}}{3}\right\rfloor \) of \(k_{s}\)-tuples. The assessment of similarity between the k-th character, \(k=1,\ldots ,k_{s}\), in the i-th pair of sub-strings (i.e., the \(x_k\) and \(y_k\) symbols in the two sub-strings i extracted from the x and y sequences), is then based on the concept of Hamming metric:
where the indices k and i stand for the k-th symbol in the i-th sub-string (\(k=1,\ldots ,k_{s}\), and \(i=1,\ldots , n_{x}\)), respectively. In general the strings x and y have slightly different lengths and we simply adopt the length \(n_{xy}=\min \left( n_{x},n_{y}\right) \). For calculating the total distance between two strings we test the ‘normalized’ version of the Hamming distance:
where \(\overline{\delta _{i}\left( x_k,y_k\right) }=1-\delta _i(x_k,y_k)\) and the term ‘normalization’ means that the final values are given in relation to the total number of compared sub-strings, \(n_{xy}\).
Rights and permissions
About this article
Cite this article
Machado, J.A.T., Rocha-Neves, J.M., Azevedo, F. et al. Advances in the computational analysis of SARS-COV2 genome. Nonlinear Dyn 106, 1525–1555 (2021). https://doi.org/10.1007/s11071-021-06836-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11071-021-06836-y