Introduction

SARS-CoV-2 infection lead to a large-scale pandemic distressing several countries around the world. The infection leads to several symptoms such as fever, severe respiratory illness, and pneumonia in the human population (Wrapp et al. 2020). SARS-CoV-2 is a novel coronavirus which was initially reported in the markets of Wuhan, China in November 2019 (Araujo and Naimi 2020). The virus is exceedingly contagious and can be transferred by droplets from the host body. It has been shown to be highly similar to other coronaviruses some of which caused similar diseases such as SARS (severe acute respiratory syndrome) in 2002 and MERS (middle east respiratory syndrome) in 2012 (de Wit et al. 2016). Although, their slow infection rate fortunately did not lead into a pandemic situation. Statistical observations by the world health organisation (WHO) reported that the infection through MERS and SARS took place at the rate of 1000 people in 4 months while, SARS-CoV-2 took 48 days to infect 1000 people. Its rapid rate of infection urged the WHO to affirm it a public health emergency of international concern (PHEIC) (Tarik Jasarevic and Chaib 2020). Prolonged infection with SARS-CoV-2 can cause an increased release of cytokines which may lead to cytokine release syndrome, that is characterized by multiple organ failure and fever (Sun et al. 2020).

SARS-CoV-2 belongs to the family of coronaviridae and sub-family of orthocoronavirinae. The virus is a single stranded positive RNA virus (26 to 32 kilo base pairs) having spike proteins which are crown-like in structure, when viewed under an electron microscope (Periwal et al. 2020). SARS-CoV-2 is very much related to SARS-CoV and is also in close relation to bat coronavirus, as discovered by Zhou et al. (2020). Furthermore, it has been reported that capping loops that cause amplified communication between the viral spike proteins and the human ACE2 cellular receptor in humans are present in human coronavirus but are not present in the bat coronavirus. The virus consists of structural and non-structural proteins. Structural proteins are of four types, spike glycoproteins, envelope proteins, membrane proteins, and nucleocaspid proteins (Forster et al. 2020). Spike proteins emerge from the envelope and aid in host sensitivity and attachment of virus to host (Ortega et al. 2020). Membrane fusion process of infection of the host cell is mediated by the spike glycoproteins present on the surface of the virus. These homotrimeric spike glycoproteins present on the envelope bind to the cellular receptors on the host membrane leading to the viral entry. (Zhang et al. 2020). Spike glycoproteins are made up of two subunits S1 and S2. Each subunit of the trimer is 180 kDa to 200 kDa in size (Ou et al. 2020). The S1 subunit is present within the amine terminal of the S homotrimer. It consists of N-terminal domain (NTD), receptor binding domain (RBD), and receptor binding motif (RBM). Whereas, the S2 subunit is extremely conserved and is present within the C terminal of the sequence. The S2 subunit consists of a fusion peptide (FP), heptad repeat 1 (HR1), heptad repeat 2 (HR2), transmembrane domain (TM) and cytoplasmic domain (CP) (Hillen et al. 2020). Enhanced interactions between the heptad repeat 1 and heptad repeat 2, lead to the stabilization of 6HB structures which cause an enhanced capability of SARS-CoV-2 to contaminate the host (Xia et al. 2020). The spike glycoprotein consists of a furin cleavage site between the two subunits of the S protein. This cleavage site aids in replication of viral protein and differentiates SARS-CoV-2 from all other coronaviruses (Walls et al. 2020). The S proteins present in the virus can be divided by host human proteases at the site of the S2 subunit, this leads to the activation of membrane fusion protein with the help of conformational changes which are irreversible (Walls et al. 2020). SARS-CoV-2 with the help of spike glycoproteins interacts with a receptor called the human angiotensin converting enzyme 2 (hACE2) and infects the human body. The interaction between the viral subunit and enzyme occur via endocytosis with the help of phosphoinositides (Ou et al. 2020). The virus spike glycoprotein belongs to the class I of fusion proteins. The α-helical coiled structure formed is a character of this type of fusion protein. It is also composed of a C terminal region which possess these α-helical formations having a coiled coil structure (Heald-Sargent and Gallagher 2012; Zhang et al. 2020). Open reading frames present the viral genome work as templates for the production of sub-genomic mRNAs and also aid in the termination of transcription. Sub-genomic mRNA is a key player in the replication-transcription complex which causes transcription of the viral genome. There are up to seven open reading frames present in a single coronavirus genome (Xia et al. 2020). The entire structure of the spike glycoprotein consists of 1273 sites, out of which 1 to 667 regions mark the S1 subunit and 668 to 1273 mark the S2 subunit (Fig. 1). Site 336 to 516 consist of the receptor binding domain (RBD) and regions 424 to 494 are responsible for the membrane binding. Similarly, for S2 subunit the region 770 to 788 are made up of fusion proteins, 915 to 949 are the heptad repeat, 1150 to 1185 consist of the heptad repeat 2 and 1190 to 1273 consist of the domains of the transmembrane and cytoplasm.

Fig. 1
figure 1

Schematic representation of the binding of S1 subunit of the SARS-CoV-2 molecule to the ACE2 present in a human cell. The receptor binding domain binds identifies and binds to the ACE2 in the host organism

Studies have reported that the SARS-CoV-2 is highly similar to bat coronavirus, specifically to RaTG13 which reportedly shares a 98% homology to the spike glycoprotein within SARS-CoV-2. A furin recognition site "RRAR" is located within SARS-CoV-2 spike glycoprotein because of an addition inside the S1 or S2 site of division (Wrapp et al. 2020). Moreover, Shang et al. have indicated through a study in SARS-CoV-2 that mutations in spike glycoproteins of novel coronavirus can lead to a change in characteristics of the virus which has been theorised to cause an increase in viral pathogenesis (Shang et al. 2020). It has been noted that the rate of infection varies among countries as per statistics since the outbreak in January 2020 up to March 2021 (Fig. 2) (Othman et al. 2020).

Fig. 2
figure 2

Analysis of the number of SARS-CoV-2 cases in different countries from January 2020 to May 31st 2020

Multiple hypothesis has been developed and tested over the months against the spread of SARS-CoV-2 infections in humans. Miguel B. et al. recommended that the spread of SARS-CoV-2 virus followed a seasonal climate pattern. Based on their in silico studies, the transmission rates were reported to be higher in arid and temperate regions (Araujo and Naimi 2020). Rahila Sardar et al. hypothesised that the mutations in the glycoprotein regions which mediate immune response vary within different geographical regions and may be key in understanding the differences in severity of infection among different countries (Sardar et al. 2020; Fang et al. 2020). S2 subunit plays a crucial role in transmission of infection. The sequence of the surface glycoprotein is reported to be approximately 1273 amino acids in length. It was hypothesized that the possibility of having variation in the spike glycoproteins found in humans from different countries might be high. This might support the hypothesised statement for an augmented pace of infection in the population of certain countries as compared to others. Previous research has shown that the phylogenetic investigation of genomes from diverse geological areas does not have any significant result but showed variable clustering among different countries (Sardar et al. 2020). This suggests that a variation might be possible at an amino acid mutation level which could lead to an increased infection in certain populations around the world. Several other studies have demonstrated the clustering of amino acids in the protein sequences of countries leading to the assumption that a massive exchange was taking place from the epicentre of the disease to other countries via carriers (Begum et al. 2020). The main objective of this study was to understand the mutational changes in the spike glycoproteins between infected populations around the globe. In this study, phylogentic studies of SARS-CoV-2 were carried out along with multiple sequence alignment to understand the variation in spike glycoproteins between infected populations in various countries.

Procedure

Protein sequence retrieval

The surface glycoprotein sequences for SARS-CoV-2 from multiple countries was acquired from the NCBI (National Centre for Biotechnology Information) database for novel coronavirus called NCBI Virus. Surface glycoprotein, S protein and Spike protein; in conjugation with SARS-CoV-2 and the desired country were related as query terms during the search through the database. The sequences were downloaded in their FASTA format and stored in a notepad. All the sequences were made up of 1273 amino acids or sites.

Multiple sequence alignment

All retrieved sequences were aligned using MEGA X (version 10.1.8) using the inbuilt MUSCLE alignment feature. The cluster iterations used UPGMA (un-weighted pair group method with arithmetic man) as a guide, along with 24 as the minimum length of diagonal. A total of 147 sequences were aligned using this software. The aligned data were saved in the form of an excel sheet and the mutation in the sequence was highlighted. The MEGA software is able to align more than 2000 sequences at once in a few minutes. The data were stored with the MSDX suffix and all conserved, singleton, variable and parsimony integrated sites were highlighted. The alignment image was then stored as an image file.

Phylogenetic tree analysis

Using previously aligned protein sequences, a phylogenetic hierarchy was designed to understand the connection between the sequences collected from different geographical locations around the world. MEGA X software (version 10.1.8) was used to prepare the phylogenetic tree. A tree was created by means of maximum likelihood as a statistical base. The analysis had a bootstrap value of 500 replicates. The substitution of amino acids was done using the Jones-Taylor-Thornton (JTT) matrix based Model with uniform rates among different amino acid sites. Missing data and gaps were set to use all sites to ensure an efficient phylogenetic tree. Tree inference options for maximum likelihood heuristics included Nearest-Neighbour-Interchange (NNI) and the initial tree was set to default. Data acquired was stored as a portable document format (PDF) file for further assessment. The amino acid composition was also calculated per sample using the inbuilt tool on MEGA software. On the basis of the phylogenetic tree, pair wise distance between the sequences was calculated using the distance feature in MEGA X software (version 10.1.8).

Outcome & analysis

Sequences collected from NCBI Virus, a public database, were downloaded as a text document. As of 4th May, the sequences were predominantly from China and USA, mainly due to the amount of samples submitted to the database. The sequences retrieved were, ten each from China, India, Hong Kong, Greece, France, Taiwan, Thailand, Australia, USA, and Spain. The other sequences retrieved were Germany (6), Czech Republic (8), Puerto Rico (7), Srilanka (4), Iran (2), Israel (2), South Africa (1), Kazakhstan (4), Malaysia (3), Nepal (1), Pakistan (2), South Korea (4), Italy, (2) and Brazil (2). These sequences were then used to carry out a multiple sequence alignment using MEGA software (version 10.1.8.8). The inbuilt MUSCLE feature was able to sequentially align 147 sequences from various countries around the world. The sequences were composed of total 1273 amino acids. The final alignment displayed, 32 variable sites, 1241 conserved sites. Also, five sites were parsimony informative, which means that these sites consisted of at least two types of amino acids at the site. Also, at least two of those amino acids occurred with a minimum frequency of two. Moreover, the alignment showed 27 singleton sites out of 1273 which illuminates the presence of regions with at least 2 amino acids with 1 repeating several times. The amino acid sequences were > 99% homologues to each other with the exception of single amino acid mutations. Multiple mutations were noted after alignment. The most prominent mutation observed was the substitution of Glycine (G) with Aspartic acid (D) at the 614th position. Based on previous studies, this mutation occurs due to a change in the a triplet code in the RNA sequence when GAU and GAC which code for aspartic acid and GGU and GGC which both code for Glycine undergo a single nucleotide substitution of G to A or vice versa (Fig. 3) (Korber et al. 2020). According to the study, this mutation was visible in many European samples. In our study conducted with multiple countries, we noted that this mutation was more prevalent in Asian countries for instance Taiwan, China, Hong Kong, Malaysia, South Korea and Pakistan. Other countries included, Italy and Brazil. The other substitutions were as shown in Table 1.

Fig. 3
figure 3

MSA of 1247 sequences determined that 63 sequences consisted of the D614G mutation

Table 1 Mutations deciphered after multiple sequence alignment using MEGA X

Another mutation observed was that of a single peptide mutation at the 8th site and the 5th site. It was a substitution of Leucine to Valine in viral samples from Hong Kong and a substitution of Leucine to Phenylalanine in samples from France, respectively. These mutations do not have any major role in functioning of the virus and do not impact transmission in any way known yet. It has been hypothesized that these mutations can be used to identify individuals more susceptible to the viral infection as compared to others (Korber et al. 2020). A mutation in 49th site by the substitution of Histidine and Tyrosine was observed in a sample collected from Taiwan. This mutation occurs in the S1 subunit at the N-terminal Domain but is of not much significance other than aiding in identification of geographical area of the sample collected, as this mutation is unique to Netherlands and Taiwan as of yet. Out of 32 single amino acid substitutions, only 2 were found to be in the binding domain of the viral spike glycoprotein. This mutation was also observed in one of the samples acquired from India as well as Malaysia. The mutation involved the substitution of Arginine to Isoleucine and Proline to Lysine, respectively. Both these mutations are suggested to impact the ability of the domain to attach with the hACE2 of the host. Another peculiar mutation noted was the substitution of a chain of 6 amino acids from site 292 to site 297 in a sample acquired from Malaysia. This mutation showed a substitution of A L D P L S to V M I H F W. Due to lack of data, the reason for this substitution is still unknown and requires further study.

To further assess any homology between the amino acid sequences obtained from different countries, an unrooted phylogenetic tree was depending on Maximum Likelihood among all 147 sequences based on their multiple sequence alignment data obtained earlier. The tree is separated into six clades, as visualised in the condensed tree in Fig. 4. Clade 1 represents a unique mutation in two Sequences, one from China (QJA20044) and India (QJF77846). Clade 2 represents sequences from Hong Kong, Clade 3 represents sequences from Taiwan, Clade 4 represents sequences from France unique due a common mutation, Clade 5 represents sequences from the Czech Republic and Clade 6 represents sequences from Thailand. All of the sequences acquired are unique to each other due to the presence single mutations in their amino acid sequences. These single mutations increase the evolutionary distance between other sequences from other countries. An interesting observation was done by finding out the single sequence similarity from a sample obtained from Nepal (QIB84673), Puerto Rico, Germany and Pakistan. This suggested that the virus was transmitted to other human sources via a common carrier. It was also observed that the single sequence from South Africa (QIZ15537) was highly similar to sequences obtained from samples collected from China. Amino acid composition obtained on the basis of the phylogenetic tree indicate that the sequences are similar in nature with little or no differences apart from single amino acid substitutions. Pair wise distance of amino acids was also calculated using the phylogenetic tree. The analysis by the software was processed via Poisson correction model. This analysis studied all 147 sequences. All regions containing gaps and missing data were removed. There were a total of 1223 sites in the concluding dataset. It was observed that 0.82% was the highest noted distance between the sequences as per the values observed at 0.0082 and the lowest was at 0.

Fig. 4
figure 4

Condensed circular Phylogenetic Tree of predominantly related samples from, Puerto Rico, USA, China, Hong Kong and Australia

Discussion

Upon critical analysis of the data acquired from NCBI Virus database, the protein sequences were aligned to identify multiple or single amino acid mutations which were specifically observed in certain countries along with a mutation which was identified at a global level. The substitution of Glycine to Isoleucine at the 614th position was observed in all countries analysed except for Czech Republic and South Africa. According to previous studies, this mutation was mostly predominant in European countries but has also spread across many other different countries around the world. This mutation has been noted to be associated with enhanced transmission of SARS-CoV-2. Many reasons have been speculated for this to happen. One of which being its structure, as the mutation is present on the surface of the spike glycoprotein. This allows it to make interactions with other subunits of the spike glycoproteins via the interaction of Aspartic acid present in S1 of one spike glycoprotein and Threonine on the S2 subunit of the other spike glycoprotein. This interaction might reduce the interaction between S1 and S2 subunits causing the separation of S1 from bound S2 or it may also cause a change in the way the receptor binding domain binds to human ACE2 in the host (Korber et al. 2020). This mutation can also be associated with immunological changes in the host which can lead to increased susceptibility to infection. This is because of the presence of the mutation in the immunological domain of the spike glycoprotein which leads to high B-Cell response as was earlier seen during the SARS-CoV epidemic in 2002 (Lu et al. 2020).

Initial studies conducted by a group of researchers in Europe discovered that patients with this mutation generally were observed to have a higher load of viral components in their body (Cascella 2021). Due to the lack of studies conducted on this mutation not much could be said for the samples containing these mutations. Furthermore, other than the mutation at the 614th site, multiple single amino acid substitutions were recorded, these were generally specific to a certain country and did not occur in any important region that deals with functionality of protein or aids in receptor binding to enzyme.

Two interesting mutations observed were those that occurred in the receptor binding domain of samples from India and Malaysia. These were found to be isolated mutations. In the case of the sequence from Indian sample, the mutation was present at the 408th site with a substitution of Arginine to Isoleucine, while in the sequence from Malaysian sample the mutation was present at the 491st site with a substitution of Proline to Leucine. Both the mentioned sequences were present in the binding membrane of the receptor binding domain. In previous studies conducted by researchers, the presence of Arginine at the 408th site is preserved in SARS-CoV-2, SARS-CoV and Bat-CoV. This region directly impacts the binding of the viral spike glycoprotein with the ACE2 receptor of the human host. The mutation of the 408th Arginine replaced by Isoleucine has been considered to reduce the ACE2 receptor binding ability of SARS-CoV-2 as it disrupts the glycan-hydrogen bond present at the 408th site coding for Arginine (Jia et al. 2020). Mutation found at the 491st site in the Malaysian sample was the Proline substitution to Leucine which also had the same impact on the binding efficacy of receptor binding membrane to ACE2. Due to lack of samples and further data, this study could not be further tested and therefore calls for further studies.

A sudden increase in the infection in human population in Italy and Australia can also be attributed to the lack of alterations in the RBD of the viral protein. Another reason, for a high number of cases in China, USA, Australia, Thailand, Taiwan and Italy can be attributed to the presence of mutation at the 614th site (Figs. 5 and 6) which has been known to increase the ability of receptor binding domain to interact with the human angiotensin converting enzyme 2 in the host organism. At this position, Aspartic acid is replaced by Glycine. Aspartic acid has an average occurrence of about 5% in all proteins, it is acid in nature, normally used in peptide mapping and proteomic analysis. Its specificity also complements those of trypsin, endoproteinase Lys-C and other proteases. Whereas, Glycine is hydrophobic in nature and reported as a virulent factor in SARS-CoV-2. For example, the average occurrence of Asp, Arg and Lys is about 5, 5, and 6% in all proteins, respectively. Therefore, digestion with Asp-N, generally leads to longer and fewer peptides than tryptic cleavage. Another finding in this study was the identification of possible objectives for the production of fitting vaccines and therapeutics, which could potentially aid in the battle against the SARS-CoV-2 virus (Jia et al. 2020).

Fig. 5
figure 5

Figure represent the WT and Mutated sequence of S2 domain of Spike Glycoprotein of SARS-CoV-2. Aspartic acid is replaced by Glycine at 614 position of S2 domain of Spike Glycoprotein of SARS-CoV-2

Fig. 6
figure 6

Structural 3D representation in ribbon style of WT (614:D(Aspartic acid)) and mutated (614:G(Glycine)) spike glycoprotein of SARS-CoV-2

A major aim of this study was to identify the presence of a mutation within spike proteins of the SARS-CoV-2 in infected populations across the countries which in aim to understand why some countries are affected in a higher rate than others.

With the data available on the National Centre for Biotechnology Information (NCBI) Virus database, we were able to identify a few single amino acid mutations unique to certain populations and one global amino acid mutation, which could give a novel strategy to describe the differential infection rate of SARS-CoV-2 across the globe. A large-scale analysis of these mutations are required, with more samples, to confirm and validate the study.

The mutations identified, especially those in the receptor binding domain can be used as potential targets. Moreover, the phylogenetic analysis helped in showing that most samples were predominantly related to samples collected from, Puerto Rico, USA, China, Hong Kong and Australia. This was mainly because of the number of samples obtained from the database as compared to samples from other countries. To broaden the understanding of geographical source of conduction of SARS-CoV-2, samples from throughout the globe must be collected in larger number and deeper studies into spike glycoprotein must take place. This would aid in the prediction of the spreading of the infection as it would be great strategy to prevent the spread of the infection at a larger number. Due to the sudden onset of diseases such as the recent SARS-CoV-2 and earlier diseases like SARS and MERS, a system where disease transmission can be predicted could prove to be useful in the future.

Conclusions

In silico analysis of surface spike glycoprotein sequences have enabled to identify multiple mutations in different SARS-CoV-2-infected populations. Over the past few months, constant efforts from researches across the globe has contributed to the vaccine development and has been effectively distributing to several countries infected with SARS-CoV-2. Unfortunately, the ability of the virus to attain mutations in its genome has opened up for fast and effective solutions against it. The new mutated strain of SARS-CoV-2 identified in Britain is one of the new strains reported to be having novel mutations helping it become more contagious and infectious. The analysis of spike glycoprotein sequences performed by multiple sequence alignment and phylogenetic tree studies helped in understanding the heterogeneity in S2 subunit of spike glycoprotein of SARS-CoV-2 in different populations. A deeper study into the mutational changes taking place in the regulatory proteins of SARS-CoV-2 would help researchers and clinicians develop better therapeutics to combat the virus. Multiple studies done to identify specific epitopes such as E332-370, E627-651, E440-464 and E694-715, along with MHC-I, MHC-II alleles, B-Cell and IFN-inducing epitopes, could be a great knowledge to be targeted and develop novel effective vaccines (Lizbeth et al. 2020; Rahman et al. 2020) Mutational heterogeneity analysis of more samples along with those of the new variant would advance the development of more specific therapeutics and vaccines.