Abstract

Since its discovery at the end of 2019, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has rapidly evolved into many variants, including the subvariant BA.2 and the GKA clade. Genomic clarification is needed for better management of the current pandemic as well as the possible reemergence of novel variants. The sequence of the reference genome Wuhan-Hu-1 and approximately 20 representatives of each variant were downloaded from GenBank and GISAID. Two representatives with no track of in-definitive nucleotides were selected. The sequences were aligned using muscle. The location of insertion/deletion (indel) in the genome was mapped following the open reading frame (ORF) of Wuhan-Hu-1. The phylogeny of the spike protein coding region was constructed using the maximum likelihood method. Amino acid substitutions in all ORFs were analyzed separately. There are two indel sites in ORF1AB, eight in spike, and one each in ORF3A, matrix (MA), nucleoprotein (NP), and the 3′-untranslated regions (3′UTR). Some indel sites and residues/substitutions are not unique, and some are variant-specific. The phylogeny shows that Omicron, Deltacron, and BA2 are clustered together and separated from other variants with 100% bootstrap support. In conclusion, whole-genome comparison of representatives of all variants revealed indel patterns that are specific to SARS-CoV-2 variants or subvariants. Polymorphic amino acid comparison across all coding regions also showed amino acid residues shared by specific groups of variants. Finally, the higher transmissibility of BA.2 might be due at least in part to the 48 nucleotide deletions in the 3′UTR, while the seem-to-be extinction of GKA clade is due to the lack of genetic advantages as a consequence of amino acid substitutions in various genes.

1. Introduction

The rapid evolution of severe acute respiratory syndrome 2 virus (SARS-CoV-2), the causative agent of the coronavirus 2019 (COVID-19) pandemic, requires immediate scientific clarification to better manage the current pandemic and to serve as a reference for the possible emergence of novel variants. Since it was discovered at the end of 2019 [1], the original virus has evolved into many variants, which could be attributed to specific clinical consequences [2]. Established variants of concern (VOCs) are Alpha, Beta, Gamma, Delta, and Omicron; Lambda and Mu are variants of interest (VOIs), and GH/490R is a variant under monitoring (VUM). Some other clusters of viruses that are closely related to Omicron, the so-called BA.2 and GKA clades, which carry molecular markers of Delta and Omicron variants, require special attention. Originally, detection of the GKA clade, popularly known as Deltacron, led to arguments that it may be the result of a sequencing error [3]. However, there is an increasing number of SARS-CoV-2 whole-genome sequences labelled as clade GKA with the notification “this submission requires investigation! It appears to contain markers of multiple lineages from both Delta and Omicron variants” in the database.

The capacity to spread globally as well as other biological properties of each variant must be encoded, at least in part, by its genome. Whole-genome comparison should also be able to confirm the establishment of the BA.2 subvariant and GKA clade. Based on previous publications [4, 5], the whole genome organization of SARS-CoV-2 after the ORF annotation of Wuhan-Hu-1 and adding 5′- and 3′UTR and intergenic sequence (IGS), is 5′UTR-ORF1AB- IGS-Spike-IGS-ORF3A-ORF3B-IGS-Protein E-IGS-membrane (MA)-IGS-ORF6-ORF7A-ORF7B-ORF8-IGS-Nucleoprotein NP-ORF-10-3′UTR. Intergenic sequences (IGSs) have been identified previously [68].

Large submissions of whole genome sequences pose a major computational challenge, and some portions of submitted sequences contain a long track of nondefinitive nucleotides. Here, we identify insertion/deletion (indel) and amino acid substitution patterns in the whole genome of representative variants, including the BA.2 subvariant and the GKA clade.

2. Materials and Methods

The sequence of the reference genome of SARS-CoV-2 strain Wuhan-Hu-1 (accession number NC_045512) was downloaded from GenBank. Ten to 20 complete sequences of each definitive variant as well as the subvariant BA.2 and GKA clade were selected randomly from GISAID and downloaded. Two representatives of each variant with no undetermined nucleotide of “N” or other IUPAC nucleotide codes were selected. The dataset identifier is EPI_SET ID: EPI_SET_230223kx; doi: 10.55876/gis8.230223kx. Sequences with a single N or in-definitive nucleotide were accepted. The sequences were aligned using muscle in MEGA-X software [9]. The locations of deletions/insertions in the genome of SARS-CoV-2 were mapped following the open reading frame of Wuhan-Hu-1, as available in the GenBank file. ORFs were analyzed separately to determine the effects of mutations and deletion/insertion.

Using the corresponding ORF of the coding region of Wuhan-Hu-1, the first 15 nucleotides of the 5′-terminus were searched, and the sequence prior to the marked sequence was deleted. The last 15 nucleotides of Wuhan-Hu-1 were used. The selected sequences were translated into amino acid sequences and aligned using MEGA-X software [9]. Using the same software, the data were exported in Mega format and analyzed further for polymorphic amino acids. We identified amino acids that were consistently substituted from Wuhan-Hu-1 across all variants and amino acids shared by the Omicron, BA2, and GKA lineages, Omicron and BA.2, Delta and GKA, Delta, Omicron, BA.2, and GKA, as well as Omicron and GKA. The final fasta file of the data set is available in Supplementary Material 1.

The phylogeny of the spike protein coding region of the representatives of variants was constructed using the maximum likelihood method and JTT matrix-based model [10] conducted in MEGA-X software [9]. The phylogenetic tree was rooted to Wuhan-Hu-1 sequence.

3. Results

The indel pattern and its location in the whole genome of representatives of various variants of SARS-CoV-2 are presented in Table 1. There are two indel sites in ORF1AB, eight in spike, and one each in ORF3A, MA, NP, and 3′UTR. No indel occurs in IGS. Some indel sites are not unique, as they occur in more than one variant. D21605-21613 is unique to Omicron and DA.2 variants, and D21965-21967 is unique to the Alpha variant; I21968-21971 and D26143-26146 are unique to the Mu variant, and D28351-28359 is unique to the Omicron and DA.2 variants; and D29723-29748 is unique to the DA.2 variant. Some indels occur in one representative of the variant.

All polymorphic amino acids of all proteins of two representatives of each variant of SARS-CoV-2 are listed in Supplementary Material 2. A summary of unique amino acids across entire genes in at least one of the representative strains of SARS-CoV-2 variants is presented in Table 2. Amino acids consistently substituted from Wuhan-Hu-1 across all variants are ORF1AB P4715L/F and spike D618G. Regarding the amino acids shared by Omicron, BA2, and GKA, there are 10 in ORF1AB, 21 in spike, and one each in ORF3A, ORF6, and NP. Three deletions in NP are unique to Omicron and BA.2. Exclusive to Delta and GKA are two deletions in ORF1AB, four in spike, three in NP, and two in ORF7A and ORF8; spike amino acids shared by Delta, Omicron, BA.2, and GKA occur only once. Three insertions, Ins216E, Ins217P, and Ins218E, are unique to Omicron and GKA. GH/490 harbors 16 variants specific to Wuhan-Hu-1, namely, five in ORF1AB, eight in spike, two in NP, and one in ORF3A. Unique amino acid substitutions from Wuhan-Hu-1 to GKA clade were nine amino acids in ORF1AB, namely, E352D, A1306S, P2046L, A2529V, I2820V, V2930L, T3646A, P4715F, and A6319V, one in NP, namely, G215C, two in ORF3A, namely, S92L and D155Y, and one in ORF7B of T40I.

The topology of phylogenetic analysis of the spike protein gene of two representatives of each variant of SARS-CoV-2 is presented in Figure 1. The phylogeny shows that Omicron, Deltacron, and BA2 are clustered together and separated from other variants with 100% bootstrap support.

4. Discussion

The genetic diversity of coronaviruses occurs through mutation and recombination, as it has been described for SARS-CoV-2 too [11]. Although the RNA-dependent RNA polymerases of coronaviruses possess proof-reading capacity [12], the virus still undergoes mutation, which might lead to amino acid replacement. Such changes impact the biology of the virus as well as the clinical manifestation of its infection. Recombination involves viral RNA merging with other RNAs, either its own RNA, the RNA of other viruses, or cellular RNA; thus, template switching occurs during transcription [13]. This process leads to RNA indels. Mutations in SARS-CoV-2 prior to the emergence of variants have been reported [6]. In HIV, the deletions occurred by at least three different mechanisms: (i) misalignment of the growing point; (ii) incorrect synthesis and termination in the primer-binding sequence during the synthesis of the plus-strand strong-stop DNA; and (iii) incorrect synthesis and termination before the primer-binding sequence during synthesis of the plus-strand strong-stop DNA [14].

Previous whole-genome comparisons have been conducted, including for Omicron [15]. However, that work focused on phylogeny and did not cover the recently identified BA.2 and GPA lineages, which are colloquially known as Deltacron. Indels and amino acid substitutions unique to specific variants were not described.

Through random selection of variant representatives with definitive sequences across the genome, we managed to identify unique patterns of indels and amino acid substitutions. Even with only two representatives for each variant, which is indeed the limitation of this study, we identified quasispecies or, in the case of a variant, quasivariant. Viral quasispecies refers to a population structure that consists of extremely large numbers of variant genomes, termed mutant spectra, mutant swarms, or mutant clouds [16]. For SARS-CoV-2, this phenomenon has been discovered even in single infected individuals [1721]. We proposed the term quasivariant, as many indels and amino acid substitutions occur in one of only two representatives. We believe that we will find more variation if we analyze more variant representatives.

Amino acids consistently substituted from Wuhan-Hu-1 across all variants are ORF1AB P4715L/F and spike D618G. The D618G has been covered in previous works [6, 2229]. ORF1AB P4715L/F has also been described [30, 31]. A database-wide survey is needed to understand the frequency of those substitutions.

The variant that harbors the most variant-specific substitution from Wuhan-Hu-1 is VUM GH/490. Both representatives show five, eight, two, and one amino acid substitutions in ORF1AB, spike, NP, and ORF3A, respectively. This VUM is being tracked in Europe, Africa, Asia, and America; however, the genome frequency for access of GISAID dated March 30, 2022, is lower than 0.3%.

The GKA clade does not comprise a unique variant. It harbors no unique indels or substitutions compared with Wuhan-Hu-1, but it does share 34 amino acid replacements from Wuhan-Hu-1 with Omicron and BA2, 13 with Delta, one in spike with Delta, Omicron, and BA.2. Three insertions in the spike in one representative of Ins216E, Ins217P, and Ins218E of the GKA clade are shared with Omicron. The molecular signatures of Delta and Omicron are obvious in the GKA clade. It is plausible that the GKA clade is an Omicron subvariant. We suggest that the clade is not the result of sequencing errors, as previously thought [3].

The GKA clade seems to have no genetic advantage, so it becomes extinct shortly after its discovery. There are only 89 full genome sequences tagged with GKA clade upon access to the GISAIS database on May 3rd, 2022. The collection date of the earliest sequence was dated on January 20, 2022, and the last one was dated on March 21, 2022. This clade poses nine amino acid changes from the reference strain of Wuhan-Hu-1 in ORF1AB, two in ORF3A, and one in NP and ORF7B. No unique amino acid change was observed in the spike protein. The clade seems to be suppressed by antibodies to other variants following previous natural infection and/or vaccination.

Interestingly, we identified a truncated ORF3A in the Mu variant. Deletion of four nucleotides generates a stop codon; thus, ORF3A in this variant is 257 amino acids in length, whereas the others are 275 residues long. This accessory protein contributes to the pathogenesis of SARS-CoV-2 by inducing pathological apoptosis [32]. The effect of the Mu variant at the cellular level has not yet been described. One article on this variant covered the neutralization effect of antibodies [33]. According to the GISAID database accessed on March 30, 2022, this variant has been identified in many countries, with a maximum global genome frequency of less than 1%, which has declined recently.

BA.2 differs from Omicron in the deletion of 48 nucleotides from the 3′UTR. The 3′UTR of coronaviruses contains all cis-acting sequences necessary for viral replication and binds to cellular as well as the viral components nsp1 and N proteins [34], which are required for minus-strand RNA synthesis [35]. This has also been described in SARS-CoV-2, whereby the 3′UTR is involved in genomic dimerization and interacts with cellular microRNA [36]. BA.2 has recently increased in frequency in multiple regions of the world, suggesting that it has a selective advantage over Omicron [3740]. The genome frequency of BA.2 has increased exponentially to 90% of total Omicron submissions, as based on GISAID accessed on the previous date. As the original SARS-CoV-2 has a basic reproduction number (R0) of 2.4-3 [41], Delta has an R0 of 5 [42], and Omicron has an R0 of estimated to be higher than 10 or three times greater than Delta [43]; additionally, BA.2 subvariant might have an R0 of 15 or higher. The higher transmissibility of BA.2 might be attributed, at least in part, to the shorter 3′UTR, which results in a higher speed of viral replication, which needs to be investigated further. However, because the coding region across the whole genome, particularly for the spike protein of BA.2, is very close to that of Omicron, people who survived Omicron infection should be naturally protected against BA.2.

Phylogenetic analysis (Figure 1) demonstrated that the BA2 and GKA subvariants are Omicron variant. The phylogeny shows that Omicron, GKA/Deltacron, and BA2 are clustered together and separated from other variants with 100% bootstrap support.

5. Conclusion

Whole-genome comparison of representatives of all variants revealed indel patterns that are specific to SARS-CoV-2 variants or subvariants. Polymorphic amino acid comparison across all coding regions also showed amino acid residues shared by specific groups of variants. Finally, the higher transmissibility of BA.2 might be due at least in part to the 48 nucleotide deletions in the 3′UTR, which result in a higher speed of viral replication, while the seem-to-be extinction of GKA clade is due to the lack of genetic advantage as a consequence of amino acid substitutions in various genes.

Data Availability

All genome sequences and associated metadata in this dataset are published in the GISAID’s EpiCov database. The final dataset is available at GISAID with identifier EPI_SET_230223kx, doi: 10.55876/gis8.230223kx. To view the contributors and each individual sequence with details such as the accession number, virus name, collection date, originating lab and submitting lab, and the list of authors, we need to visit 10.55876/gis8.230223kx. All polymorphic amino acids of all proteins of two representatives of each variant of SARS-CoV-2 are listed in Supplementary Material 1.

Disclosure

An earlier version of the manuscript has been presented as a preprint in https://www.researchsquare.com/article/rs-1526043/v1.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

Ida B. K. Suardana and Gusti N. Mahardika contributed in the conception. Ida B. K. Suardana, Bayu K. Mahardika, Made Pharmawati, and Putu H. Sudipa contributed in the acquisition, analysis, and interpretation of data. Tri K. Sari and Nyoman B. Mahendra drafted the manuscript. All authors reviewed the manuscript and approved the final version of the manuscript for publication.

Acknowledgments

The English language of the manuscript has been edited by Springer Nature Author Services ([email protected]). This study was supported by the Research, Technology, and Higher Education (RISTEKDIKTI) of Indonesia through world class research.

Supplementary Materials

All polymorphic amino acids of all proteins of two representatives of each variant of SARS-CoV-2. (Supplementary Materials)