1 Introduction

The severe acute respiratory syndrome virus (SARS-CoV-2) is a single-stranded Ribonucleic Acid (RNA) beta-coronavirus presenting a 29,903 nucleotides-long genome. It caused an outbreak in the Chinese city of Wuhan, in December 2019. Subsequently, the new coronavirus has spread worldwide in less than 4 months, and a pandemic situation was declared by the World Health Organization (WHO). The disease was named on February 2020, Coronavirus Disease 2019 (COVID-19), and the disease is now responsible for more than 145,000,000 confirmed cases, and 4,230,000 deaths reported worldwide as of August 3, 2021 (source: World Health Organization https://covid19.who.int). There are significant differences in case fatality rates (proportion of deaths from a specific disease compared to the total number of individuals diagnosed with the disease for a particular period) between countries, possibly related to the efficacy of the measures adopted to limit viral spreading, the demographic pyramid, i.e., the distribution of various age groups in a particular country, and the screening, test and tracing strategy [1].

Using the data from the database of the Global Initiative on Sharing Avian Influenza Data (GISAID) Consortium, major clades of SARS-CoV-2 were identified [2]. The initial RNA genome (accession number: NC 045512.2), identified in Wuhan is named as clade ‘L’. This root clade suffered mutations due to numerous factors, and some new clades emerged with stable mutations: clade ‘G’ (presenting an alteration of the spike protein S-D614G was the first dominant variant) and its two derivatives ‘GH’ (with ORF3a-Q57H mutation) and ‘GR’ (affected by RG203KR mutation); clade ‘V’ (variant of the ORF3a coding protein NS3-G251), and clade ‘S’ (variant ORF8-L84S). Other alleles or combinations different from the previously described clades are classified as clade ‘O’ [2]. These data confirmed, until now, the relatively low variability presented by SARS-CoV-2 compared to other respiratory viruses [3]. The different fatality rates, speed of transmission, and infectiousness profiles observed in different countries were probably not related to differences in virulence of the clades and their characteristic mutations. However, very recently, some new mutations were found with potential epidemiological consequences. For example, 12 human cases were identified in September 2020 in North Jutland with a unique variant called ‘cluster 5’, a combination of mutations that have been not previously described. All 12 cases were linked to the mink farming industry or the local community [4]. However, the clinical presentation, severity, and duration of COVID-19, and transmission among those infected were similar to that of other circulating SARS-CoV-2 viruses. The variant cluster 5 has not been detected since September despite extensive sequencing and data sharing and is thought to be a dead-end, owing to the very restricted spreading and infectiousness to humans [4].

More disturbing is the new variant strain of SARS-CoV-2 that contains 23 mutations, eight of which are in the spike protein the virus uses to bind to and enter human cells. The spike protein is also the focus of most COVID-19 vaccines that are now being administered in numerous countries. Moreover, the diagnostics tests of COVID-19 are also based on the protein sequence found on the Wuhan reference strain spike. Therefore, their efficacy can be changed by these genomic variations. The appearance of these mutations can also lead to immunological resistance and vaccine escape [5]. The new British variant was reported for the first time in December 2020 and has become highly prevalent globally and responsible for another COVID-19 wave in numerous countries [6]. Based on these mutations, this variant has been predicted to potentially be more quickly transmissible than other circulating strains of SARS-CoV-2. The variant is referred to as SARS-CoV-2 VUI 202012/01 (i.e., Variant Under Investigation, year 2020, month 12, variant 01), or B.1.1.7, as the lineage of the clade GR was classified on GISAID. Variant B.1.1.7 presents increased transmissibility and increased virus load. However, apparently, there is no association with more severe disease [7] although the increased infectiousness can lead to more deaths due to the strain of the health systems of the affected countries. In Mid-November in South Africa, a new lineage of the clade GH has emerged but shared one of the mutations described in the British variant. The South African virus variant (B.1.351), known as ‘triple variant’ is distinct from the UK variant, but both contain an unusually high number of mutations, with potential functional significance, compared to other SARS-CoV-2 lineages, and it can, apparently, partially escape to some available vaccines [8]. Several other variants were described more recently in several countries. Examples are the variants P1 and P2 in Brazil, variants B.1.429 and B.1.526 found in the USA, and yet more recently, a variant first sequenced in India (variant B.1.167) [9]. In April 2021, the South Africa, Brazil, and India variants caused or are causing large disease outbreaks in their respective countries with increased excess deaths due to rupture of the hospital capacities.

The SARS-CoV-2 B.1.1.7 variant was already detected at the end of December 2020 in France, Denmark, Holland, and Italy [10]. In 31 of May, the WHO has assigned simple, easy to say and remember labels for the most important variants of SARS-CoV-2, using letters of the Greek alphabet (see Tracking SARS-CoV-2 variants at ).

The most important variants of concern according to this new classification are: Alpha, corresponding to lineage B.1.1.7, found in the United Kingdom in September 2020; Beta, corresponding to lineages B.1.351, B.1.351.2, B.1.351.3 present in South Africa in May 2020; Gamma, corresponding to lineages P.1, P.1.1, P.1.2, sequenced in Brazil, in November 2020; and, Delta, corresponding to lineages B.1.617.2, AY.1, AY.2 and AY.3 found in India in October 2020 (see Table 1).

Table 1 Currently designated SARS-CoV-2 variants of concern

In the Summer of 2021, the SARS-CoV-2 Delta variant is becoming dominant in Europe, North America and other parts of the world and is responsible for a new wave, mainly in the unvaccinated population [11]. It is highly transmissible and it also appears to affect vaccine effectiveness and breakthrough infections in vaccinated individuals appear to be more frequent with this variant [11]. It was reported that the viral load is 1000 times higher for Delta compared with previous variants of the initial wave of infections and may have a faster replication rate, a reduced incubation period, and greater viral shedding [12]. The Delta variant was found to be approximately 64% more transmissible than the Alpha variant that was dominant in the waves of the end of December 2020 and first months of 2021. On the other hand, Alpha was already estimated to be 50% more transmissible than the D614G strain, responsible for the first wave in the beginning of 2020 [13].

The so-called coronavirus ‘waves’ can be more contagious and spread faster than the initial ones due to the new variants that can also present other epidemiological problems. There is no definitive evidences that the various variants are associated with an higher disease severity. However, there is a clear risk that future epidemic ‘waves’ may be larger and, therefore, associated with greater burden for the health systems and society due to the lockdowns. Therefore, with this work we try to understand SARS-CoV-2 new variants and the relations with the already known virus clades using new strategies for finding the relationships among them [14,15,16]. It is important to understand the recent evolution of the virus partially responsible for the so-called ‘waves’ [17]. With this regard computational techniques associated with mathematical tools are a promising strategy to tackle genomic data-sets. This approach was tested in [18, 19] using the Kolmogorov complexity and Shannon information theories, associated with clustering techniques [20, 21] such as the Multidimensional Scaling (MDS). Given its successful application in a primary set of 133 items, encompassing a variety of virus, this paper extends the study to a more challenging problem, namely of analyzing and comparing the SARS-CoV-2 mutations. For that purpose a new data-set of 307 virus including RNA information from the beginning of the spread up to the day of writing this paper is selected. Furthermore, based on the aforementioned mathematical tools, we include a larger set of indices to allow a more complete comparison. In the case of the Kolmogorov complexity we consider four indices [22, 23], normalized information distance, Compression-based Dissimilarity Measure, Chen-Li Metric and Compression-based Cosine (that are abbreviated by the acronyms NCD, CDM, CLM and CosS). In the scope of the Shannon information theory we consider also four metrics, namely the Jaccard, Jensen-Shannon, Jeffreys and Topsøe distances (denoted as \(d_{Ja}\), \(d_{JS}\), \(d_{Je}\) and \(d_{To}\)) [24, 25]. A third type and distinct type of assessment is also included and consists of the Hamming distance (\(d_{Ha}\)), widely used in information theory [26].

Following these ideas the rest of the paper is organized as follows. Section 2 introduces the fundamental tools. The main mathematical concepts involved with distance, Kolmogorov complexity, Shannon information and Hamming metric are summarized. Additionally, the computational tools such as data compression and MDS are also included. Section 3 describes and analyses the data-set of very close, but distinct, information for a large number of RNA sequences. Section 4 develops a synergistic performance of a variety of measures associated with the MDS clustering and visualization. The results shed light on the dynamics of the evolutionary process of the SARS-CoV-2 lineages. Finally, Sect. 5 presents the conclusions.

2 Fundamental tools

2.1 Distance

A function \(d(\cdot ,\cdot )\) stands for the distance between two objects x and y if satisfies three axioms [27], namely identity \(d(x,y) = 0\) if \(x = y\), symmetry \(d(x,y)= d(y,x)\) and triangle inequality \( d(x,y) \le d(x,z) + d(y,z)\). These axioms imply the non-negativity (or separation condition) \(d(x,y)\ge 0\). On the other hand, they allow the definition of a plethora of different functions, with distinct pros and cons [24, 25]. Based on these notions, several algorithms [28,29,30] were adopted for comparing data sequences [26, 31,32,33]. However, users must have in mind that the selection of a set of distances for a given application requires some experience and that a number of numerical trials are usually necessary before finding the ‘best’ ones [34,35,36,37].

In the final part of the paper, the Appendix presents the mathematical and algorithmic fundamentals of the distances used in this paper for assessing the genetic information.

2.2 Data compression

Compression data algorithms can be classified as ‘lossless’ and ‘lossy’. Lossless compression algorithms are typically used for archival or other high fidelity purposes and reduce the size of files without losing any information in the file, which means that we can reconstruct the original data from the compressed file. Lossy compression algorithms reduce the size of files by discarding the less important information in a file, which can significantly reduce file size but also affect file quality.

In this paper we used the BZip2 compression algorithm which is based on Burrows-Wheeler transform [38]. This compressor has the extension BZ2 designating a pure data compression format not providing file archival feature. In this algorithm the speed is somewhat slower than for the compressor LZW (extension .Z) and Deflate (extension .zip and .gz) compression algorithms [39]. These employ the classic Deflate algorithm (even if correctly implemented Bzip2 algorithm can be easily made parallel, and benefit of recent multi-core CPU), but faster than more powerful compression schemes as in RAR format, 7Z format, and new ZIPX format. The compression ratio, also, is usually intermediate between the older Deflate-based ZIP/GZ files and modern RAR, 7Z and ZIPX formats [40].

2.3 Multidimensional scaling

Let us consider a group of N objects \(x_i\), \(i =1, \cdots , N\), in a q-dim space. The MDS is a computational method for [41] that re-organizes them in a structure where the objects are represented by points trying to highlight the similarities between them in the sense of a predefined distance [42]. The process starts by calculating a \(N \times N\) dimensional matrix, \(D=[d_{ij}]\), with \(d_{ij}\in \mathbb {R^{+}}\) for \(i\ne j\) and \(d_{ii}=0\), \((i,j)=1, \cdots , N\), giving object to object distances [43]. In a second phase, the MDS calculates the point coordinates \(\hat{x}_{i}\) in a \(d<q\)-dim space, trying to mimic the original distances. The MDS technique includes a numerical iterations for optimizing a cost function, often called Stress, that compares the distances \(d_{ij}=\left| x_{i}-x_{j}\right| \) and \(\hat{d}_{ij}=\left| \hat{x}_{i}-\hat{x}_{j}\right| \), so that the index \(Stress = \sqrt{ \sum _{i<j}\left( \hat{d}_{ij}-d_{ij}\right) ^{2}} \) is minimized.

The MDS points \(\hat{x}_{i}\) have coordinates that yield a symmetric matrix \(\hat{D}=[\hat{d}_{ij}]\) of distances that approximate D. The MDS results are interpreted based on the clusters, and eventually of patterns, of points [18, 44]. Therefore, similar objects are represented by nearby points, and the opposite for dissimilar objects. Different distances produce distinct MDS maps and it is up to the user to choose the metrics that reflect better the characteristics of the objects under analysis. By other words, the different distances are correct from the mathematical point of view, but the association of each metrics with the MDS algorithm may produce disparate patterns in the plots. In some cases the emerging patterns, although different, lead to similar conclusions. In other cases, some distances reflect better (or worst) the information embedded in the dataset and the selection of the ‘best’ metric depends on a trial and error set of tests based on the user experience.

3 Dataset description

The RNA is commonly sequenced indirectly by copying it into the complementary DNA (cDNA). Then the cDNA is amplified and analyzed using a number of DNA sequencing methods. The sequences of the RNA are published in the databases presenting the bases, adenine (A), cytosine (C), guanine (G), and thymine (T). Some symbols, such as N (unspecified or unknown nucleoside), R, (unspecified purine nucleoside), Y (unspecified pyrimidine nucleoside) and others, permeate only a small percentage of the information and are not considered [45]. The information about the \(N=307\) GS was collected in the Global Initiative on Sharing Avian Influenza Data (GISAID) available at https://www.gisaid.org/. The information regarding the sequences, serial, clade/variant and country are listed in the Tables 5, 6, 7, 8 and 9. The genetic information is organized in 8 clades {GH, GR, O, GV, G, L, S, V} with 10 elements each, making a total of 80 cases. The recent advent of new variants correspond to the remaining 227 additional items as follows:

  • South Africa, with 10 cases for the variant ‘South Africa Triple Variant’, denoted as TV-ZA

  • Denmark, with 10 cases for the variant ‘Mink Cluster V’, denoted as CL5-DK

  • England, with 10 cases for the variant VUI2020/01, denoted as VUI-GB

  • Italy, with 10 cases for the variant VUI2020/01, denoted as VUI-IT

  • Denmark, with 10 cases for the variant VUI2020/01, denoted as VUI-DK

  • Portugal, with 10 cases for the variant VUI2020/01, denoted as VUI-PT

  • USA, with 10 cases for the variant VUI2020/01, denoted as VUI-US

  • a mixture of several cases scattered along the world to give an higher spatial diversity

    • Ireland, with 7 cases for the variant VUI2020/ 01, denoted as VUI-IE

    • Japan, with 6 cases for the variant VUI2020/01, denoted as VUI-JP

    • Australia, with 5 cases for the variant VUI2020/ 01, denoted as VUI-AU

    • Singapore, with 5 cases for the variant VUI 2020/01, denoted as VUI-SG

    • Israel, with 4 cases for the variant VUI2020/01, denoted as VUI-IL

    • South Korea, with 3 cases for the variant VUI2020/01, denoted as VUI-KR

    • Norway, with 3 cases for the variant VUI2020/ 01, denoted as VUI-NO

    • France, with 2 cases for the variant VUI2020/01, denoted as VUI-FR

    • Germany, with 2 cases for the variant VUI2020/ 01, denoted as VUI-DE

    • Spain, Gibraltar, Hong Kong, India, Luxembourg, Switzerland and Sweden, with 1 case each, for the variant VUI2020/01, denoted simply as VUI

  • Japan with 10 cases for the variant B.1.1.28, denoted as B.1.1.28-JP

  • Brazil with 10 cases for the variant B.1.1.28, denoted as B.1.1.28-BR

  • Brazil with 10 cases for the variant P.1.1.28, denoted as P.1-BR

  • Brazil with 10 cases for the variant P.2, denoted as P.2-BR

  • South Africa with 10 cases for the variant B.1.351, denoted as B.1.351-ZA

  • USA with 10 cases for the variant B.1.427, denoted as B.1.427-US

  • California, USA, with 10 cases for the variant B.1.429, denoted as B.1.429 -US

  • New York, USA, with 10 cases for the variant B.1.526, denoted as B.1.526-US

  • India with 10 cases for the first variant VUI B.1.617, denoted as VUI B.1.617.1-IN

  • India with 10 cases for the second variant VUI B.1.617, denoted as VUI B.1.617.2-IN

  • India with 10 cases for the third variant VUI B.1.617, denoted as VUI B.1.617.3-IN.

In synthesis, we have collected a first set of 80 GS of the SARS-CoV-2 virus obtained in several countries during a first period of the outbreak. The second set includes 227 recent GS. The smaller number of cases the recent genomic data for some countries is limited to the data set available at the time of writing this paper. All ASCII files have approximately 30 kBytes.

The \(N=307\) GS exhibit very small differences and, therefore, are difficult to distinguish. We can first characterize them by their length L that varies between minimum and maximum values of \(L_{min}=28560\) and \(L_{max}=29900\) symbols. Moreover, we have an average and standard deviation of \(L_{av}=29773.5\) and \(L_{sd}=109.31\) symbols, respectively. This small variability in the size is relevant for the reliability when comparing strings with the Kolmogorov- and Hamming-based metrics.

As mentioned before, the viral RNA information is represented by ASCII files with the four nitrogenous bases. Therefore, we can consider the grouping of \(k_{s}=\left\{ 1,2,3\right\} \), consecutive symbols. For simplicity, we denote the corresponding sub-strings by \(\left\{ S^1,S^2,S^3\right\} \) and obtain the statistics listed in Tables 2, 3 and 4. The term ‘others’ stands for a small number of other symbols distinct of the four nucleotide bases. Therefore, occasionally we find the symbols N, M, S, R, Y, K, H, V, and W, that abbreviate aNy (A, C, G or T), aMino (A or C), Strong interaction (3 H-bonds, G or C), puRine (A or G), pYrimidine (C or T), Keto (G or T), not-G (A, C or T), not-T (A, C or G), and Weak interaction (2 H-bonds, A or T), respectively.

Table 2 Statistics of 1-tuples of symbols in the \(N=307\) GS
Table 3 Statistics of 2-tuples of symbols in the \(N=307\) GS
Table 4 Statistics of 3-tuples of symbols in the \(N=307\) GS

Figure 1 shows the cumulative numbers of cases for sub-strings including \(k_{s}=\left\{ 1,2,3\right\} \) symbols in the \(N=307\) GS.

As a complementary analysis of the relationship between GS we use the VOSviewer software tool [46,47,48,49,50] for constructing and visualizing bibliometric networks (https://www.vosviewer.com/). The program was built having in mind scientometric applications, but can be used in the present case if we consider the associations of symbols as keywords in a standard technical text. For that purpose we construct \(k_s\)-tuples of consecutive symbols in the GS in order to have ‘words’ (i.e., sub-strings) with \(k_s\) consecutive symbols. We considered the VOSviewer options ‘Full counting’, ‘Minimum number of occurrences of a term = 5’ and ‘Number of terms selected = 28’. For example, Fig. 2 shows the network for the sub-strings with \(k_s=3\) consecutive symbols in the \(i=1\) genomic sequence. For the other virus the results are of the same type. We observe the very small relevance of ‘phrases’ with several triplets and, on the other hand, the complex network relationship between triplets.

From the Tables 2, 3 and 4 and Figs. 1 and 2 we verify that the case of \(k_s=3\) symbols represent a good compromise between complexity and accurate description of the information content. This conclusion follows previous observations with genetic data [19, 51,52,53].

The existence of the symbols classified as ‘other’ and the variation of size in the GS can be considered as a kind of noise. However, as verified with the previous tests they reflect in very small numbers. Also, in most cases we considered about 10 GS for each type of virus. Therefore, we can proceed with the analysis of the genetic information knowing its robustness against possible volatility in the data.

The mathematical description of the viral information is based on the Kolmogorov, Shannon and Hamming perspectives implemented by the distances \(\left\{ NCD,CLM,CDM,CosS\right\} \), \(\left\{ d_{Ja},d_{Js},d_{Je},d_{To}\right\} \) and \(d_{Ha}\) presented in the Appendix. The MDS clustering and visualization is used to unravel relationships between the data and to identify possible patterns. In the case of the Kolmogorov complexity we consider the compressor BZip2 (https://www.zlib.net). In the case of the Shannon information, we start by calculating the 64-bin histograms (i.e., the triplets {AAA, AAC, \(\ldots \), GGT, GGG}) for the triplets (\(k_s=3\)) of the nitrogenous bases and then we calculate the distances between them. The resulting matrix D, \(307 \times 307\) dimensional, that is processed by MDS using the Matlab command cmdscale.

4 Data-set analysis: clustering results

The distances following the Kolmogorov theory \(\left\{ NCD,CLM,CDM,CosS\right\} \) yield almost similar MDS plots. This behavior was also observed in previous studies with distinct data-sets [54]. In what concerns the distances based on the Shannon theory \(\left\{ d_{Ja},d_{Js},d_{Je},d_{To}\right\} \) we note that \(d_{Ja}\) produces a slightly different plot from the group \(\left\{ d_{Js},d_{Je},d_{To}\right\} \), which return charts having just small differences. On the other hand, the distance \(d_{Ha}\) leads to a very different chart. Therefore, for parsimony, for these sets of distances we depict just the plots for the NCD, \(d_{Jacc}\), \(d_{JS}\) and \(d_{Ha}\).

The MDS charts produced by the four distances are represented in Fig. 3, 4, 5 and 6, respectively. In all cases the MDS plots require a careful rotation to get the correct 3-dimensional perspective and assessment, since the planar projections in the figures are not totally capable of depicting their structure.

Several MDS loci reveal different clusters, but, in general we do not have a clear group for each variant of the virus. The Kolmogorov- and the Shannon-based metrics show some clusters, but with a mixture of many variants, while the Hamming scheme gives the worst map in the perspective of clustering. Nonetheless, we can ask a different and more relevant question which is how to assess the ‘dynamics’ of the evolutionary process. We must note that the available information just reflects ‘time samples’ of the variants, that is, the date where the procedure for collecting, identifying and recording the GS took place. Consequently, we do not have a precise control of the time elapsed between the real mutation and the laboratory measurement. Nonetheless, we can have a good idea of the dynamical behavior if we include time information in the MDS plots, even if some ‘noise’ is present in the time information.

Fig. 1
figure 1

Cumulative numbers of cases for sub-strings including \(k_{s}=\left\{ 1,2,3\right\} \) symbols in the \(N=307\) GS

Fig. 2
figure 2

Network of the sub-strings with \(k_s=3\) symbols extracted from the GS of case \(i=1\)

Fig. 3
figure 3

One perspective of the MDS 3-dim locus for the set of \(N=307\) virus using the normalized information distance, NCD, and BZip2

Fig. 4
figure 4

One perspective of the MDS 3-dim locus for the set of \(N=307\) virus using the Jaccard distance, \(d_{Ja}\), based on the sub-strings with \(k_s=3\) consecutive symbols

Fig. 5
figure 5

One perspective of the MDS 3-dim locus for the set of \(N=307\) virus using the Jensen-Shannon divergence, \(d_{JS}\), based on the sub-strings with \(k_s=3\) consecutive symbols

Fig. 6
figure 6

One perspective of the MDS 3-dim locus for the set of \(N=307\) virus using the Hamming distance, \(d_{Ha}\), based on the sub-strings with \(k_s=3\) consecutive symbols

Fig. 7
figure 7

MDS 3-dim locus exhibiting the dynamics in time of the virus evolution during the period 2020/Feb/24-2021/Apr/23 for the set of \(N=307\) virus using the normalized information distance, NCD, and BZip2

Fig. 8
figure 8

MDS 3-dim locus exhibiting the dynamics in time of the virus evolution during the period 2020/Feb/24-2021/Apr/23 for the set of \(N=307\) virus using the Jaccard distance, \(d_{Ja}\), based on the sub-strings with \(k_s=3\) consecutive symbols

Fig. 9
figure 9

MDS 3-dim locus exhibiting the dynamics in time of the virus evolution during the period 2020/Feb/24-2021/Apr/23 for the set of \(N=307\) virus using the Jensen-Shannon divergence, \(d_{JS}\), based on the sub-strings with \(k_s=3\) consecutive symbols

Fig. 10
figure 10

MDS 3-dim locus exhibiting the dynamics in time of the virus evolution during the period 2020/Feb/24-2021/Apr/23 for the set of \(N=307\) virus using the Hamming distance, \(d_{Ha}\), based on the sub-strings with \(k_s=3\) consecutive symbols

Fig. 11
figure 11

Two perspectives of the MDS 3-dim locus exhibiting the dynamics in time of the virus evolution during the period 2020/Feb/24-2021/Apr/23 for the set of \(N=307\) virus using the Generalized distance, \(d_{Ge}\). The time ‘arrow’ follows the pattern \(\mathcal {A} \rightarrow \mathcal {B} \rightarrow \mathcal {C} \rightarrow \left\{ \mathcal {D},\mathcal {E}\right\} \rightarrow \mathcal {F}\)

Fig. 12
figure 12

Dendrogram representation of the SARS-COV2 time dynamics during the period 2020/Feb/24-2021/Apr/23 for \(N=307\) GS using the generalized distance, \(d_{Ge}\)

Fig. 13
figure 13

Tree representation of the SARS-COV2 time dynamics during the period 2020/Feb/24-2021/Apr/23 for \(N=307\) GS using the generalized distance, \(d_{Ge}\)

Figures 7, 8 , 9 and 10 depict the MDS plots with the time information represented by the colors of the marks. The colorbar with the interval \(\left[ 0,1\right] \) corresponds to the period between the dates 2020/Feb/24 and 2021/Apr/23. We observe some improvement over the initial set of experiments, but we have still some ‘noisy’ behavior of the time flow.

It is well known that each distance captures a given characteristic of the phenomenon under analysis. Therefore, we wander if some measure associating several distances could reveal better the patterns embedded in the datataset. In this line of thought we can design a ‘generalized’ distance by weighting several of the previous distances. If we consider the distances NCD, \(d_{Jacc}\), \(d_{JS}\) and \(d_{Ha}\), then we can define the new metric:

$$\begin{aligned} d_{Ge}= & {} \mu _{NCD}\frac{NCD}{\max \left( NCD\right) }+\mu _{Ja} \frac{d_{Ja}}{\max \left( d_{Ja}\right) }\nonumber \\&+\mu _{JS} \frac{d_{JS}}{\max \left( d_{JS}\right) }+\mu _{Ha}\frac{d_{Ha}}{\max \left( d_{Ha}\right) }, \end{aligned}$$
(1)

where \(\mu _{NCD}\), \(\mu _{Ja}\), \(\mu _{JS}\) and \(\mu _{Ha}\) are weight factors for the distances NCD, \(d_{Jacc}\), \(d_{JS}\) and \(d_{Ha}\), respectively, so that \(\mu _{NCD}+\mu _{Ja}+\mu _{JS}+\mu _{Ha}=1\).

We can adjust the numerical values of the wight factor to reflect the importance of each distance for improving the MDS representation in the sense of providing a more clear visualization of the dynamical effect. In our case, after some experiments, and having in mind the dynamics in time, we consider the weights \(\mu _{NCD}=0.55\), \(\mu _{Ja}=0.2\), \(\mu _{JS}=0.2\) and \(\mu _{Ha}=0.05\). Figure 11 shows the MDS plot where we observe the emergence of five clusters denoted by the symbols \(\mathcal {A}\) to \(\mathcal {F}\). The time ‘arrow’ follows somehow the pattern \(\mathcal {A} \rightarrow \mathcal {B} \rightarrow \mathcal {C} \rightarrow \left\{ \mathcal {D},\mathcal {E}\right\} \rightarrow \mathcal {F}\).

We verify (i) the first and second clusters, \(\mathcal {A}\) and \(\mathcal {B}\), exhibit a very low scattering in contrast with the others, (ii) the existence of some noise in the evolutionary process, (iii) that time flows in discrete steps in the MDS representation, that is, the time evolution is not continuous, (iv) that we can have some evolutionary bifurcations in time as it is visible for \(\left\{ \mathcal {D},\mathcal {E}\right\} \), and (v) clusters for different time instants may be close in the MDS locus. A deep reflection upon these results seems consistent with our present knowledge of the evolutionary process. In fact, we can interpret and justify the previous assertions as follows:

  • the initial strains of virus had very similar characteristics, contrary to the more recent variants that reveal an increasing variability, which agrees with the first observation

  • evolution and, in particular, mutations, occur randomly which justifies the second conclusion

  • mutations do not emerge continuously in time and, in fact, they are the result of a multitude of issues such as environmental, social, geographical and economical factors. Therefore, new variants that lead to a relevant number of infections are expected to emerge without a clear time pattern which is in accordance with the third consideration

  • the infection spreads in very different space locations and it is reasonable to expect to have several important variants at the same time, particularly when the number of infected people grows considerably. These considerations support the forth remark

  • the close location of some clusters for non- consecutive time samples (e.g., \( \mathcal {C}\) and \(\mathcal {E}\)) can be interpreted as the MDS representation the infection ‘waves’, which explain the fifth observation.

It is important to analyze the effect of the clustering algorithm, the dimension of the visualization space and the type of representation. In this perspective, the Generalized distance (1) and, consequently, the same matrix D used for the MDS scheme, is now used with the set of programs Phylip [55, 56] available at http://evolution.genetics.washington.edu/phylip.html. This program is used in in phylogenetics for displaying the evolutionary relationships among various biological objects. The program allows 2-dim representations where the final objects are the ‘leafs’ of some type of tree. The algorithm neighbor processes the matrix D and produces the data clustering. The programs drawgram and drawtree are used for obtaining the graphical representations in the form of dendrograms and trees, respectively.

Figures 12 and 13 shows the dendrogram and tree generated by Phylip for the generalized distance, \(d_{Ge}\), respectively. The time evolution is represented by colors as before.

The 2-dim graphical representations are easier to visualize in the sense that the user does not needs to rotate and shift the plot. However, the lack of the third dimension leads to a clustering somewhat inferior to the one performed by the 3-dim MDS. Nonetheless, we can still see some clusters. In the case of the dendrogram we observe a simple evolution from top (initial period) to bottom (final period) with 3 main clusters. The two clusters at the top do not show a precise separation of the periods of time since with have overlapping object for the initial 60% of the total period. In the case of a graphical output my means of a tree we have a more intricate pattern, with the objects placed in the form of two ‘arcs’. The outer arc covers the period of time from beginning up to approximately 75%, while the inner arc corresponds to the rest, that is, to the more recent 25% GS. Moreover, the initial period of time of about 30% is place it the top of the outer arc.

In summary, the strategy followed in this paper is consistent with present-day understanding about the SARS-CoV-2 genome and the adoption of several distinct distances allows users to have a complementary interpretation of the information embedded in the data-set.

Table 5 Information about the \(N=307\) GS
Table 6 Information about the \(N=307\) GS
Table 7 Information about the \(N=307\) GS
Table 8 Information about the \(N=307\) GS
Table 9 Information about the \(N=307\) GS

5 Conclusions

The information of 307 RNA viruses available in a public database was explored by means of an association of mathematical and computational tools. The notions Kolmogorov complexity, Shannon information and Hamming distance, on the perspective of analytic tools, and the ideas of compression and MDS algorithms, on the point of view of computational tools, were considered. Three sets of indices of dissimilarity were adopted, allowing a broad comparison of the results, with four distances based on the BZip2 compression, four distances using 2-dimensional histograms of consecutive triplets, and one metric using the Hamming distance between triplets of bases. From these, the MDS algorithm allowed an efficient clustering and 3-dim visualization. The MDS plots revealed pros and cons of the alternative distances adopted for assessing the set of viruses. The problem at stake proved to pose a considerable challenge and no clear clusters emerged for the virus variants included in the dataset. This motivated a new question, namely the relevance of clustering the variants when thinking about its evolutionary dynamics, since many variants have minor differences between themselves. The idea of assessing the dynamics in time lead to the design of a generalized distance taking advantage of the characteristics of distinct metrics and allowing a adjustment to the phenomenon by means of weight factors. The results showed interesting dynamical effects in terms of forming clear clusters in time. Besides the analytic tools, several algorithmic mechanisms were explored, such as the Matlab, VOSviwer and Phylip programs, which illustrate the diversity and richness of present day computational strategies. The synergistic perspectives provided by distinct processing tools and graphical representations allow comparing the genomic data and provide a computational strategy for exploring future viral outbreaks. The association of analytic and computational techniques may help interpreting the phylogeny of these new strain outbreaks, associate its dynamics and selective pressures, and give additional insight for the quick development and testing of tailored countermeasures. The computational analysis of the SARS-CoV-2 genome may have a role in the early detection of potential variants of concern and help in the characterization of the risk posed to global public health. This is important as it may contribute to the global monitoring of SARS-CoV-2 variants and to improve the search for a more effective response to the COVID-19 pandemic.