1 Introduction

In December 2019, a mysterious pneumonia with unknown etiology was reported in the city of Wuhan, Province of Hubei, China [1]. The International Committee on Taxonomy of Virus (ICTV) named the virus as severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) [2]. Globally, to the 30th of July 2020, according to the World Health Organization Coronavirus Disease 2019 (COVID-19) Situation Report, 10,185, 374 confirmed cases of COVID-19 are reported, resulting in 503,862 deaths.

The scientific community reacted as never before, and many researchers focused on this urgent topic [3,4,5,6,7,8,9,10]. The mathematical and computer science communities are also studying this challenging problem, and we testimony the recent emergence of new models and algorithmic approaches. This common multidisciplinary research will allow the human kind to implement a robust and fast response to a problem that is causing a severe global socioeconomic disruption, and most probably will lead to the largest global recession since the Great Depression.

The present paper follows this trend by studying the genetic information by means of mathematical and computational tools. The starting point is the information encoded in the ribonucleic acid (RNA), a polymeric molecule essential in various biological roles.

The evolutionary origin and divergence of eucaryotes is mostly recoverable from their genetic relationships. The phylogeny of core genes, such as those for ribosomal proteins, provides a reasonable representation of many billions of years [11]. Unfortunately, the diversity of viruses prevents such a reconstruction of virus evolutionary histories as they lack any equivalent set of universally conserved genes on which to construct a phylogeny. Viral diversity is far greater than that of other organisms, with significant differences in their genetic material, RNA or deoxyribonucleic acid (DNA), and configurations (double or single stranded), as well as the orientation of their encoded gene. The smallest virus genomes [12] contain over 2,500 genes [13]. The RNA is a nucleic acid that is essential to all forms of life, and it is found in nature generally as single-stranded (ss) rather than a paired double-stranded (ds) like DNA. In DNA, there are four bases: adenine (A) complementary to thymine (T) and cytosine (C) complementary to guanine (G). In the RNA, uracil (U) is used instead of thymine.

Like DNA, RNA can carry genetic information. An RNA virus is a virus that has RNA as its genetic material encoding a number of proteins. This nucleic acid is usually single-stranded RNA (ssRNA), but there are double-stranded RNA (dsRNA) viruses. These viruses exploit the presence of RNA-dependent RNA polymerases in eucaryotes cells for replication of their genomes. Many human diseases are caused by such RNA viruses such as influenza, severe acute respiratory syndrome (SARS), COVID-19, Ebola disease virus, chikungunya, Zika virus, influenza B and Lassa virus [14].

The RNA is usually sequenced indirectly by copying it into complementary DNA (cDNA), which is often amplified and subsequently analyzed using a variety of DNA sequencing methods. Therefore, the genomic sequences of the RNA viruses are published presenting the four bases, namely adenine (A), cytosine (C), guanine (G) and thymine (T).

This genetic information is analyzed by means of the Kolmogorov’s complexity and Shannon’s information theories. In the first case, the so-called normalized information distance (NID) is adopted. In the second, a statistical approach is considered, by constructing histograms for the relative frequency of three consecutive bases (triplets). The histograms are interpreted under the light of entropy, cumulative residual entropy and Jensen–Shannon divergence. The results obtained for each theory, that is, the values assessing the virus genetic code under the light of the Kolmogorov’s complexity and the Shannon’s information, are further processed by means of advanced computational representation techniques. The final visualization is obtained using the hierarchical clustering (HC) [15,16,17,18,19,20] and multidimensional scaling (MDS) techniques [21,22,23,24,25,26,27]. Three alternative representations, namely dendrograms, hierarchical trees and three-dimensional loci, are considered.

According to the ICTV, human coronavirus belongs to the Betacoronavirus genus, a member of the Coronaviridae family, categorized in the order Nidovirales [28]. It has been categorized into several genera, based on phylogenetic analyses and antigenic criteria, namely: (i) Alphacoronavirus, responsible for gastrointestinal disorders in humans, dogs, pigs and cats; (ii) Betacoronavirus, including the bat coronavirus, the human severe acute respiratory syndrome (SARS) virus, the Middle Eastern respiratory syndrome (MERS) virus and now the SARS-CoV-2; (iii) Gammacoronavirus, which infects avian species; and iv) Deltacoronavirus [2, 29].

Four coronaviruses broadly distributed among humans (229E, OC43, NL63 and, HKU1) frequently cause only common cold symptoms [2, 12]. The two other strains of coronaviruses are linked with deadly diseases and zoonotic in origin, i.e., the SARS-CoV and MERS-CoV [30]. In 2002–2003, there was an outbreak of SARS beginning in Guangdong Province in China and affecting 27 countries subsequently [31]. It was considered the first pandemic event of the twenty-first century, due to the SARS-CoV infecting 8098 individuals with 774 deaths [31]. In 2012, in the Middle East, MERS-CoV caused a severe respiratory disease that affected 2494 individuals and 858 deaths [2]. In both epidemics, bats were the original host of these two coronaviruses [2].

Coronaviruses contain a positive-sense, single-stranded RNA genome. The genome size for coronaviruses ranges from 26.4 to 31.7 kilobases, and it is one of the largest among RNA viruses.

The rapid spread of SARS-CoV-2 raises intriguing questions such as whether its evolution is driven by mutations, recombination or genetic variation [32]. This information is now being applied to the development of diagnostic tools, effective antiviral therapies and in the understanding of viral diseases. Although numerous studies have been done from the biological and medical perspective, to help further understand SARS-CoV-2 and trace its origin, this paper reports the use of multidimensional scale techniques for the finding of the similarities and the relationships among COVID-19 strains themselves and between other described viruses.

For that, we have collected a set of 37 complete genome sequences of SARS-CoV-2 virus obtained in several countries from patients with COVID-19. To help verify whether it is possible to trace the original or intermediate host of SARS-CoV-2, we obtained the genomic sequences of SARS-CoV-2 virus in other hosts, including bats (16 genomic sequences), pangolins (8 genomic sequences) and environment (market of Wuhan) [13]. We have also collected 23 genomic sequences of the coronavirus that cause mild symptoms related to common cold in man (229E, OC43, NL63 and, HKU1). The genomic sequences of SARS-CoV (10 genomic sequences) and a MERS-CoV (13 genomic sequences) were also gathered. For comparison and control, we also obtained sequences of other deadly pathogenic RNA viruses such as Lassa (6 genomic sequences), Ebola (6 genomic sequences), dengue (7 genomic sequences), chikungunya (1 genomic sequence) and influenza B (2 genomic sequences).

The diagram of Fig. 1 summarizes the main historical flow of coronavirus in what concerns the twenty-first-century epidemics.

Fig. 1
figure 1

Coronavirus related to twenty-first-century epidemics. ICST: The International Committee on Taxonomy of Virus—Coronavirus Study Group, WHO: World Health Organization

To the authors’ best knowledge, this paper analyzes for the first time a large number of viruses associated with the combination of several distinct mathematical and computational tools. In short, we have the cross-comparison of:

  • The genomic sequences of 133 viruses.

  • The data treatment by means of Kolmogorov’s complexity and Shannon’s information theories, using normalized compression distance, entropy, cumulative residual entropy and Jensen–Shannon divergence.

  • Two clustering computational techniques, namely the hierarchical clustering and multidimensional scaling.

  • The visualization of the results by means of dendrograms, trees and point loci.

The results indicate clearly the superior performance of the approaches based on the Kolmogorov’s complexity measure and the MDS three-dimensional visualization. Moreover, the characteristics of the coronaviruses within the large set of tested cases are highlighted. We conclude that:

  • The association between the Kolmogorov perspective and the three-dimensional MDS representation leads to be the best results.

  • The clusters are easily distinguishable and we observe the relation between the new SARS-CoV-2 virus and some CoV found in bats and in the pangolin.

Bearing these ideas in mind, the paper is organized as follows. Section 2 introduces the mathematical and computational tools adopted in the follow-up. Section 3 analyzes the data by means of the complexity and information theories accompanied by the HC and MDS computational visualization resources. Finally, Sect. 4 summarizes the main conclusions.

2 Fundamental tools

2.1 Distance metrics

Evaluating the similarity degree between several objects, each having a number of features requires the definition of a distance. The calculation of a function d of two objects \(x_A\) and \(x_B\) can be interpreted as a distance if \(d(x_A,x_B)\ge 0\) and satisfies the following axioms [33]:

$$\begin{aligned} \begin{aligned}&\hbox {identity}: \, d(x_A,x_B)= 0 \hbox { if } x_A\ = x_B\\&\hbox {symmetry}: \, d(x_A,x_B)= d(x_B,x_A) \\&\hbox {triangle inequality}: \, d(x_A,x_B)\le d(x_A,x_C)\\&\quad + d(x_C,x_A). \end{aligned} \end{aligned}$$
(1)

We find in the literature a variety of different functions that can tackle datasets and shed light to distinct characteristics [34]. For a set of numerical vectors \(\left( x_{1},\ldots ,x_{N}\right) ^{T}\in \mathbb {R}^{n}\), the Minkowski norm \(L_n: \left( \sum _{k=1}^{N}\left| x_{k}\right| ^n\right) ^{\frac{1}{n}}\), and in particular the Manhattan and Euclidean cases, \(L_1\) and \(L_2\) for \(n=1\) and \(n=2\), are often used [35]. In the case of DNA analysis, these norms support different algorithms [36,37,38] that allow the comparison of data sequences. We can also mention metrics such as the Chi-square [39], Hamming [40] and edit [41] distances. Nonetheless, the selection of the optimal distances for a specific application poses relevant challenges [42,43,44,45]. In fact, the adoption of a given metric often depends on the user experience that performs several numerical experiments before selecting one or more distances. Due to these issues, assessing the similarity of several objects is not a straightforward process and we can adopt both non-probabilistic and probabilistic information measures to obtain distinct perspectives between object similarities.

2.2 Kolmogorov complexity theory

The Kolmogorov complexity, or algorithmic entropy, addresses the information measurement without relying on probabilistic assumptions. The information measurement focuses an individual finite object described by a string and takes into consideration that ‘regular’ strings are compressible [46]. The Kolmogorov complexity of a string x, denoted as K(x), can be defined as the length of the shortest binary program that, given an empty string \(\psi \) at its input, can compute x on its output in a universal computer, and then halts. Therefore, K(x) can be interpreted as the length of the ultimate compressed version of x.

The information distance of two strings (or files) \(\lbrace x_A,x_B \rbrace \in \varSigma \), where \(\varSigma \) denotes the space of the objects, can also be computed by means of the conditional Kolmogorov complexity \(K(x_A|x_B)\) [47, 48]. This concept can be read as the length of the shortest program to obtain \(x_A\), if \(x_B\) is provided for input. In heuristic terms, if the two strings are more/less similar, then the calculation is less/more difficult and, consequently, the size of the program is smaller/larger. Therefore, the following inequality always holds

$$\begin{aligned} K(x_A|x_B) \le K(x_A). \end{aligned}$$
(2)

Under the light of these concepts, the universal distance metric [47] denoted as normalized information distance (NID):

$$\begin{aligned}&NID(x_A,x_B)\nonumber \\&\quad = \frac{\max \lbrace K(x_A|x_B),K(x_B|x_A)\rbrace }{\max \lbrace K(x_A),K(x_B)\rbrace }, \ \ \lbrace x_A,x_B \rbrace \in \varSigma , \end{aligned}$$
(3)

was proposed.

From equation (2), we have that the NID may take values in the range [0, 1]. Moreover, we have \(NID(x_A,x_A)\approx 0\) and \(NID(x_A,\psi )\approx 1\), where \(\psi \) is an empty object that has no similarity to \(x_A\). It is shown [47] that the NID is a distance because it satisfies the axioms defined in (1), up to some additive precision, but is non-computable [47]. In spite of this limitation, the NID gives the basis for the so-called normalized compression distance (NCD), which is a computable distance [45]. The computability comes with the cost of using a good approximation of the Kolmogorov complexity by a standard compressor \(C(\cdot )\) (interested readers for the discussion between the equivalence of the NID and the NCD can find further details in [33]).

The NCD is given by:

$$\begin{aligned} NCD(x_A,x_B) = \frac{C(x_Ax_B)-\min \lbrace C(x_A),C(x_B)\rbrace }{\max \lbrace C(x_A),C(x_B) \rbrace }. \end{aligned}$$
(4)

The NCD has values in the range \(0<NCD(x,y)<1+ \epsilon \), assessing the distance between the files \(x_A\) and \(x_B\). The parameter \(\epsilon > 0\) reflects ‘imperfections’ in the compression algorithm. The values of \(C(x_A)\) and \(C(x_B)\) are the sizes of each of the compressed files \(x_A\) and \(x_B\), respectively, and \(C(x_Ax_B)\) is the compressed size of the two concatenated files considered by the compressor as a single file.

Let us consider, for example, that \(C(x_B)\ge C(x_A)\). Expression (4) says that the distance \(NCD(x_A,x_B)\) assesses the improvement due to compressing \(x_B\) using data from the previously compressed \(x_A\) and compressing \(x_B\) from scratch, expressed as the ratio between their compressed sizes.

Obviously, the approximation of the NID by means of the NCD poses operational obstacles. Due to the non-computability of the Kolmogorov complexity, we cannot predict how close is the NCD to the real value of the NID, and the approximation may yield arithmetic problems, particularly in the case of small strings where numerical indeterminate forms may arise [33, 49]. Moreover, the compressor (as an approximation of the Kolmogorov complexity) must be ‘normal’ in the sense that given the object \(x_Ax_A\) the compressor C should produce an object with almost to an identical size to the compressed version of \(x_A\). This is a limitation for the universality of the NCD since in specific applications the best performing loss-less algorithms (e.g., JBIG, JPEG2000 and JPEG-LS in image compression) do not satisfy such propriety [50]. Nevertheless, key results were already obtained using the NCD in image distinguishability [51], image OCR [52], malware recognition [53] and genomic analysis [54, 55].

2.3 Shannon information theory

Information theory [56] has been applied in a variety of scientific areas. The most fundamental piece of the theory is the information content I of a given event \(x_i\) having probability of occurrence \(P\left( X=x_{i}\right) \):

$$\begin{aligned} I \left( X=x_{i}\right) =-\ln \left( P\left( X=x_{i}\right) \right) , \end{aligned}$$
(5)

where X is a discrete random variable.

The expected value of the information, the so-called Shannon entropy [57, 58], is given by:

$$\begin{aligned} H\left( X\right)= & {} E\left( -\ln \left( X\right) \right) \nonumber \\= & {} { -\sum _{i} P\left( X=x_{i}\right) \ln \left( P\left( X=x_{i}\right) \right) } , \end{aligned}$$
(6)

where \(E\left( \cdot \right) \) denotes the expected value operator.

Expression (6) obeys the four Khinchin axioms [59, 60]. Several generalizations of Shannon entropy have been proposed, relaxing some of those axioms, and we can mention the Rényi, Tsallis and generalized entropy [61,62,63], just to name a few.

A recent and interesting concept is the cumulative residual entropies \(\varepsilon \) given by [64, 65]:

$$\begin{aligned} \varepsilon \left( X\right) ={ -\sum _{i}P\left( X> x_{i}\right) \log \left( P\left( X > x_{i}\right) \right) }. \end{aligned}$$
(7)

Within the scope of information theory, we can formulate also the concept of distance discussed in Sect. 2.1. The Kullback–Leibler divergence between the probability distributions \(X_1\) and \(X_2\) is defined as [34, 66,67,68,69]:

$$\begin{aligned} D_{KL}\left( X_1\parallel X_2 \right) ={ \sum _{i}P\left( X_1 = x_{i}\right) \ln \left( \frac{P\left( X_1=x_{i}\right) }{P\left( X_2=x_{i}\right) } \right) }. \end{aligned}$$
(8)

From this, we obtain the Jensen–Shannon divergence, \(JSD\left( X_1 \parallel X_2 \right) \), or distance, given by:

$$\begin{aligned}&JSD\left( X_1\parallel X_2\right) \nonumber \\&\quad =\frac{1}{2}\left[ D_{KL}\left( X_1\parallel X_{12} \right) +D_{KL}\left( X_2\parallel X_{12} \right) \right] , \end{aligned}$$
(9)

where \(X_{12}=\frac{1}{2}\left( X_1+X_2 \right) \).

Alternatively, the \(JSD\left( X_1\parallel X_2\right) \) can be calculated as:

$$\begin{aligned} JSD\left( X_1\parallel X_2\right)= & {} \frac{1}{2} \sum _{i} P\left( X_1 = x_{i}\right) \ln \left( P\left( X_1=x_{i}\right) \right) \nonumber \\&+\frac{1}{2} \sum _{i} P\left( X_2 = x_{i}\right) \ln \left( P\left( X_2=x_{i}\right) \right) \nonumber \\&- \sum _{i} P\left( X_{12} = x_{i}\right) \ln \left( P\left( X_{12}=x_{i}\right) \right) . \end{aligned}$$
(10)

2.4 Hierarchical clustering, multidimensional scaling and visualization

The HC is a computational technique that assesses a group of N objects \(X_i\), \(i =1, \cdots , N\), in a q-dimensional space, and tries to rearrange them in a visual structure with objects \(Y_i\) highlighting the main similarities between them in the sense of some predefined metric [70].

Let us consider N objects defined in a q-dimensional real-valued space and a distance (or dissimilarity) measure \(\delta _{ij}\) between pairs of objects. The first step is to calculate a \(N \times N\)-dimensional matrix, \(\varDelta =[\delta _{ij}]\), where \(\delta _{ij}\in \mathbb {R^{+}}\) for \(i\ne j\) and \(\delta _{ii}=0\), \((i,j)=1, \cdots , N\), stands for the object to object distances [71]. The HC uses the input information in matrix \(\varDelta \) and produces a graphical representation consisting in a dendrogram or a hierarchical tree.

The so-called agglomerative or divisive clustering iterative techniques are usually adopted for processing the information. In the first approach, each object starts in its own cluster and the computational iterations merge the most similar items until having just one cluster. In the second, all objects start their own cluster and the computational iterations separate items, until each object has his own cluster. For both approaches, the numerical iterations follow a linkage criterion, based on the distances between pairs, for quantifying the dissimilarity between clusters. The maximum, minimum and average linkages are possible criteria. The distance \(d\left( x_{A},x_{B}\right) \) between two objects \(x_{A} \in A\) and \(x_{B} \in B\) can be assessed by means of several metrics such as the average linkage given by [72]:

$$\begin{aligned} d_{av}\left( A,B\right) =\frac{1}{\left\| A\right\| \left\| B\right\| }{ \sum _{x_{A}\in A,x_{B}\in B}d\left( x_{A},x_{B}\right) }. \end{aligned}$$
(11)

The clustering quality can be assessed by means of the cophenetic correlation [73]:

$$\begin{aligned} c =\frac{{ \sum _{i<j}\left( x(i,j)-\bar{x}\right) \left( y(i,j)-\bar{y}\right) }}{\sqrt{\left[ { \sum _{i<j}\left( x(i,j)-\bar{x}\right) ^{2}}\right] \left[ { \sum _{i<j}\left( y(i,j)-\bar{y}\right) ^{2}}\right] }}, \end{aligned}$$
(12)

where x(ij) and y(ij) stand for the distances between the pairs \(X_i\) and \(X_j\), in the initial measurements, and \(Y_i\) and \(Y_j\), in the HC chart, respectively, and \(\bar{x}\) denotes the average of x.

Values of c close to 1 (to 0) indicate a good (weak) cluster representation of the original data. In MATLAB, c is computed by means of the command cophenet.

In this paper, we adopt the agglomerative clustering and the average linkage [74, 75] for tackling the matrix of distances based on the JSD metric (10).

The MDS is also a computational technique for clustering and visualizing multidimensional data [76]. Similarly to what was said for the HC, the input of the MDS is the matrix \(\varDelta =[\delta _{ij}]\), \((i,j)=1, \cdots , N\). The MDS is to adopt points for representing the objects in a d-dimensional space, with \(d < q\), that try to reproduce the original distances, \(\delta _{ij}\). The MDS performs a series of numerical iterations rearranging the points in order to optimize a given cost function called stress S. We have, for example, the residual sum of squares and the Sammon criteria:

$$\begin{aligned} S= & {} \left[ \sum _{i<j}\left( \phi _{ij}-\delta _{ij}\right) ^{2}\right] ^{\frac{1}{2}},\end{aligned}$$
(13a)
$$\begin{aligned} S= & {} \left[ \frac{ \sum _{i<j}\left( \phi _{ij}-\delta _{ij}\right) ^{2} }{ \sum _{i<j}\phi _{ij}^{2}} \right] ^{\frac{1}{2}}. \end{aligned}$$
(13b)

The resulting MDS points have coordinates that produce a symmetric matrix \(\varPhi =[\phi _{ij}]\) of distances that approximate the original one \(\varDelta =[\delta _{ij}]\). In MATLAB, the commands cmdscale and Sammon can be adopted for the classical MDS and the Sammon stress criterion.

The interpretation of the MDS locus is based on the patterns of points [77, 78]. Therefore, objects with strong (weak) similarities are represented by fairly close (distant) points. The MDS locus is interpreted on the relative positions of the point coordinates. So, the absolute position of the points or the shape of the clusters has usually a special meaning, and we can magnify, translate and rotate the locus for achieving a good visualization. In the same line of reasoning, the axes of the plot do not have units or physical meaning. The quality of the produced MDS locus can be evaluated using the stress plot and the Shepard diagram. The stress plot shows S versus d and decreases monotonically. Usually, low values of d are adopted, and present computational resources allow a direct three-dimensional representation, but often some rotations and magnification are required to achieve the best visual perspective. The Shepard diagram plots \(\phi _{ij}\) against \(\delta _{ij}\) for a given value of d, and a narrow (large) scatter of points indicates a good (weak) fit between the original and the reproduced distances.

3 The dataset

The information of 133 publicly released genomic sequences was collected in the Global Initiative on Sharing Avian Influenza Data (GISAID) and the GenBank of the National Center for Biotechnology Information (NCBI) databases (https://www.gisaid.org/, https://www.ncbi.nlm.nih.gov/genbank). The information regarding the sequences and serial numbers is given in Table of “Appendix.”

The mathematical tools of Kolmogorov and Shannon theories are used to compare and extract relationships among the data and to identify viral genomic patterns. The viral genomes are analyzed from the perspective of dynamical systems using HC and MDS. Dendrograms and trees are generated by HC algorithms, and a three-dimensional visualization through MDS visualization. Several clusters with medical and epidemiological interest were visualized. The MDS loci provide a key visualization of the relation of SARS-CoV-2 with the other known coronavirus affecting humans.

The several non-coronavirus pathogenic viruses analyzed are very well delimited in several independent and easily delineated clusters (Zika, Chikungunya, Dengue, Lassa, Ebola, Influenza B).

The several types of coronaviruses that cause mild disease (common flu) are very well separated from the other coronavirus clusters. Interestingly, the types, 229E, HkU1, NL63 and OC43, form separated mini-clusters within a big well-defined cluster

There are two well-defined CoV-19 or SARS-CoV clusters. The two environmental SARS-CoV-2 were aggregated with the human SARS-CoV-2. To differentiate the human MERS-CoV, a zoom was made to isolate it forming a separate cluster. The MERS-CoV obtained from camels was included in this cluster. SARS-CoV formed another independent and well-segregated cluster. The CoV from bats and pangolin is dispersed among these several clusters of coronavirus forming sometimes clusters of few elements. Note that the bat CoV RaTG13 is near to one of the SARS-CoV clusters. This coronavirus is the closest known relative of SARS-CoV-2 [79].

For processing the RNA information consisting of ASCII files with the four nitrogenous bases, we consider two approaches. The first one follows the Kolmogorov complexity theory described in Sect. 2.2. Therefore, we consider the compressors zlib and bz2 (see https://www.zlib.net and https://sourceware.org/bzip2/) followed by the NCD distance (4). Nonetheless, the two algorithms give almost identical results, and therefore, only the zlib is considered in the follow-up.

The application of the NCD produces a symmetrical matrix \(\varDelta \), \(133 \times 133\) dimensional that is tackled and visualized by means of a dendrogram and a HC tree obtained by the program Phylip http://evolution.genetics.washington.edu/phylip.html and MDS using MATLAB https://www.mathworks.com/products/matlab.html. The corresponding charts are represented in Figs. 23 and 4.

Fig. 2
figure 2

Dendrogram for the set of 133 viruses using the NCD and zlib based on the Kolmogorov complexity theory

Fig. 3
figure 3

HC tree for the set of 133 viruses using the NCD and zlib based on the Kolmogorov complexity theory

Fig. 4
figure 4

MDS three-dimensional locus for the set of 133 viruses using the NCD and zlib based on the Kolmogorov complexity theory with the cluster of SARS-CoV-2 connected by a line

Fig. 5
figure 5

MDS three-dimensional locus for the set of 133 viruses using the NCD and zlib based on the Kolmogorov complexity theory, without point labels and the cluster of SARS-CoV-2 connected by a line

It is known that the three-dimensional plots require some rotation in the computer screen. Therefore, Fig. 5 shows the MDS locus in a different perspective, without point labels and the cluster of SARS-CoV-2 connected by a line.

The second approach follows the Shannon information theory described in Sect. 2.3. Therefore, we start by considering non-overlapping (codon or anticodon) triplets of bases and the histograms of relative frequency for each virus. In a second phase, after having the histogram for the complete set of virus, we process them using entropy concepts. Before continuing, we must clarify that we tested the adoption of n-tuples, with \(n=1,2,3,4\), in the DNA information analysis of a large set of superior species such as mammals, where the genetic information is considerably larger and the construction of histograms for large values of n does not pose problems of statistical significance (since the number of histograms bins grows as \(4^n\)). It was observed that \(n=3\) is a ‘good value,’ since \(n=1\) leads to poor results, \(n=2\) improves things but is still insufficient, while \(n=4\) gives almost identical results to \(n=3\).

In a first experiment, we consider the Shannon entropy H vs fractional cumulative residual entropy \(\varepsilon \) as possible descriptors of the 133 histograms. Figure 6 shows the resulting two-dimensional plot, that is, the Shannon entropy H vs the fractional cumulative residual entropy \(\varepsilon \), for the set of 133 viruses. We observe some separation between virus, but we have some difficulties due to the high density of points in some areas.

Fig. 6
figure 6

Shannon entropy H vs fractional cumulative residual entropy \(\varepsilon \) for the set of 133 viruses, with the cluster of SARS-CoV-2 connected by a line

We now try the other computational techniques introduced in Sect. 2. In the follow-up, we consider identical input information, that is, the JDS and the corresponding matrix \(\varDelta \) for the set of 133 viruses, and we compare the distinct computational clustering and visualization techniques.

Figures 78 and 9 show the dendrogram, the HC tree and the three-dimensional MDS locus, respectively. The dendrogram is the simplest representation and it is straightforward to interpret. However, this technique does not take full advantage of the space. The HC tree uses more efficiently the two-dimensional space, but we now have more difficulties in reading the most dense clusters. The three-dimensional MDS takes advantage of the computational representation, but requires some rotation/shift/magnification in the computer screen.

Successive magnifications can be done and are necessary to achieve a more distinct visualization. Therefore, we can say that all representations have their own pros and cons, although the three-dimensional MDS is, a priori, the most efficient method.

Fig. 7
figure 7

Dendrogram for the set of 133 viruses using the JSD based on the Shannon information theory

Fig. 8
figure 8

HC tree for the set of 133 viruses using the JSD based on the Shannon information theory

Fig. 9
figure 9

MDS three-dimensional locus for the set of 133 viruses using the JSD based on the Shannon information theory, with the cluster of SARS-CoV-2 connected by a line

Fig. 10
figure 10

MDS three-dimensional locus for the set of 133 viruses using the JSD based on the Shannon information theory, without point labels and the cluster of SARS-CoV-2 connected by a line

Since the three-dimensional plot requires some rotation in the computer screen, Fig. 10 shows the MDS locus in a different perspective, without point labels and the cluster of SARS-CoV-2 connected by a line.

The MDS 3D following the Kolmogorov method isolated the several groups of virus in several extended groups. The cluster of SARS-CoV-2 is very extended. Note that the SARS-CoV-2 virus found in the market of Wuhan is very near of the SARS-CoV-2 cluster. The CoV found in the pangolin also forms a cluster mixing with the upper part of the SARS-CoV-2 cluster. Bat CoV virus forms also a diffuse cluster lacking some contiguity. The bat CoV Yunnan RaTG13 is very near the SARS-CoV-2 cluster. There is a big cluster formed by the different CoV that causes a mild respiratory disease (common cool).

Other pathogenic RNA virus clusters, i.e., Lassa virus, Zika, influenza B and Ebola, can be easily separated from the CoV clusters. However, with this methodology, the cluster formed by the dengue virus is not very far, in some views, from a part of the cluster formed by the CoV that causes mild respiratory disease.

The SARS CoV virus forms another independent cluster. MERS-CoV from humans forms another extended cluster and in the vicinity of the CoV virus from camel.

4 Conclusions

This paper explored the information of 133 RNA viruses available in public databases. For that purpose, several mathematical and computational tools were applied. On the one hand, the concepts of distance and the theories developed by Kolmogorov and Shannon for assessing complexity and information were recalled. The first involved the normalized compression distance and the zlib compressor for the DNA. The second understood the use of histograms of triplets of bases and their assessment through entropy, cumulative residual entropy and Jensen–Shannon divergence. On the other hand, the advantages of hierarchical clustering and multidimensional scaling algorithmic techniques were also discussed. Representations in two and three dimensions, namely by dendrograms and trees, and loci of points or points and lines, respectively, were compared. The results revealed their pros and cons for the specific application of the set of viruses under comparison.

The MDS 3D in the Kolmogorov perspective leads to be the best visualization method. We have not only the clusters easily distinguishable, but we find also the relation between the new SARS-CoV-2 virus and some CoV found in bats (the primary host of the virus) and in the pangolin, the likely intermediate host. The SARS-CoV-2 found in the environment, namely the Market of Wuhan where the epidemic probably started, is indistinguishable from the SARS-CoV-2 found in humans.

This type of methodology may help to study how an animal virus jumped the boundaries of species to infect humans, and pinpoint its origin, as this knowledge can help to prevent future zoonotic events [80]. The statistical and computational techniques allow different perspectives over viral diseases that may be used to grasp the dynamics of the emerging COVID-19 disease. These methodologies [79] may help interpreting future viral outbreaks and to provide additional information concerning these infectious agents and understand the dynamics of viral diseases.