Keywords
SARS related coronavirus, SARS-CoV-2, SARS-CoV-1, COVID-19, virus passage, yeast S. cerevisiae, directed evolution, genomic transformation, genome editing, synthetic biology
This article is included in the Cell & Molecular Biology gateway.
This article is included in the Coronavirus collection.
SARS related coronavirus, SARS-CoV-2, SARS-CoV-1, COVID-19, virus passage, yeast S. cerevisiae, directed evolution, genomic transformation, genome editing, synthetic biology
From the beginning of the COVID-19 pandemic, in March 2020, evidence was put forward that the outbreak of novel coronavirus SARS-CoV-2 within the human population was most likely a product of natural evolution1. According to this view, COVID-19 is a zoonosis that probably originated from a species of closely related bat coronaviruses2. Prior to a hypothetical spillover event, a recent ancestor to SARS-CoV-2 likely evolved inside bat host cells for many decades3. However, the natural evolution hypothesis of SARS-CoV-2 origin is currently not without considerable limitations: first, the difficulty in characterizing the evolutionary origin of the unusual poly-basic (RRAR) furin cleavage site at the S1/S2 junction of the SARS-COV-2 spike (S) glycoprotein4; second, the discrepancy between an exponentially suppressed tropism of SARS-CoV-2 in Rhinolophus sinicus bat cells5 and the high susceptibility of SARS-CoV-2 toward cell entry via Rhinolophus sinicus angiotensin-converting enzyme 2, its primary entry receptor6; and third, the persistent inability to identify an intermediate ancestral host between human and the horseshoe bat Rhinolophus affinis. This species was reported to be the host of coronavirus RaTG137,8, currently the isolate with the highest sequence similarity to the SARS-CoV-2 genome, which is located on the same phylogenetic branch as Rhinolophus sinicus bat coronavirus9. Finding the last animal progenitor host of SARS-CoV-2 has been further complicated by a continued uncertainty about the origin of RaTG13 itself10,11. In contrast to the natural evolution hypothesis for SARS-CoV-2, the above limitations do not necessarily apply to genetic engineering of viral genomes. For example, in the case of SARS coronavirus, it has long been established that introducing a synthetic poly-arginine construct at the furin cleavage site significantly increases the rate of entry into human cells compared with wild-type spike protein12. Also before 2010, after a period of rapid progress in the understanding the relevant host-virus factors13,14, natural barriers in host range of positive strand RNA viruses were rationally extended, leading to directed viral replication in new species including model organisms that originally were not permissive, such as the yeast Saccharomyces cerevisiae15. Accordingly, to transform yeast into a synthetic host for viral replication, the scheme has been to co-express viral RNA dependent RNA polymerase (RdRp) and, if also necessary for replication, additional viral factors on plasmids under the control of auxotrophic yeast selectable markers (YSM)16. These selectable markers are primarily there to direct cell lines into stable expression of desired plasmid DNA, but at the same time may function as entry gates for directed insertion of exogenous genetic material into yeast chromosomes17. In principle, once the RdRp and required auxiliary factors are selectively and functionally expressed, this approach applies to any replication competent SARS coronavirus RNA, including cloned or de novo synthesized genomic parts or even entire genomes, thus facilitating their replication as well as integration into the yeast genome. Our hypothesis is that such a passage would leave behind traces in the genomes of both the virus construct and the synthetic host.
SARS and SARS-like sarbecovirus whole genome nucleotide sequences were taken from the comprehensive sequence and phylogenetic analyses by Zhou et al.18 and from Li et al.9. In our study, sequences were selected only if they had a valid GenBank accession identifier or an NCBI Reference Sequence (RefSeq) accession identifier, as of 5 June 2021, resulting in the reference set of 13 whole genome virus sequences (see also Extended Data). BLAT whole genome comparative sequence analysis was performed using the BLAT public webserver (BLAT, RRID:SCR_011919) with options set “Genome: Search all” and “All results (no minimum matches)”. Given each one of the corresponding 13 BLAT output tables produced from genomic alignments to the yeast S. cerevisiae (Extended data Tables S2 – S14), the profiled BLAT score, pS, was the genome-wide distribution of BLAT scores (output table column [SCORE]) weighted by the corresponding length of the homologous genomic region (output table distance between columns [START] and [END]). The cumulative profiled BLAT score cS, which was used as a genome-wide quantitative indicator of yeast (S. cerevisiae) homology, was the total sum over this distribution. After shifting cS by the sample’s mean and dividing by its standard deviation, the resulting standardized BLAT z-score became then a relative indicator of sequence homology with S. cerevisiae. Sequence alignments for cross-validation were produced with lalign from the fasta36-36.3.8/bin/lalign36 software package (version number 36.3.8) with parameter settings: -f -12 -g 0 -E 1. This parameter choice followed standard parameters for LALIGN.
Sequence identities were calculated using the Clustal Omega public webserver (RRID:SCR_001591) with standard preset parameters.
To interrogate the possibility that a similar passage through yeast cells took place within the family of SARS coronaviruses, we selected eight reference genomes18 for further analysis (see Methods): SARS-CoV-2 isolate Wuhan-Hu-1 (GenBank reference NC_045512.2), Rhinolophus affinis bat coronavirus RaTG13 (MN996532.2), Rhinolophus pusillus SL-CoV ZXC21 (MG772934.1), Rhinolophus pusillus SL-CoV ZC45 (MG772933.1), Rhinolophus acuminatus bat coronavirus RacCS203 (MW251308.1), Rhinolophus cornutus bat coronavirus Rc-o319 (LC556375.1), SARS-CoV Urbani (AY278741.1), and MERS-CoV isolate HCoV-EMC/2012 (NC_019843.3). For comparative genomic sequence analysis we used a standard bioinformatics approach with the BLAST-like Alignment Tool (BLAT) (BLAT, RRID:SCR_011919)19. Each search from the above set of query sequences against the entire multi-species genome database produced a high number (between 1689 and 5083) of tiles, i.e. perfectly aligned short DNA sequences of length 11. A large majority of these tiles were repeatedly matched on the same two target genomes (out of 107 total; see also Extended data Table S1): SARS-CoV-2 (NC_045512.2), the only coronavirus genome in the database, and S. cerevisiae (SacCer3/S288c). In these instances, BLAT identified many homologous regions by aggregating multiple tiles (Tables S2–S9), and to each homologous region it produced an integer score S, which is the number of perfectly matched positions therein. To obtain a genome-wide view of this homology signal we stacked together all homologous regions weighted by their individual alignment scores S, which resulted in an accumulated homology profile, pS (see Methods and Extended Data Figures S1 and S2). To remove its shortest-scale fluctuations, the profile was smoothed by a centered sliding window filter with window size of 200 nucleotides (nt). The output of eight genomic profiles (Figure 1 and Figure S2) were ordered by decreasing sequence identity to SARS-CoV-2.
For SARS-CoV-2, two prominent (pS > 20) peaks indicated highly localized profile scores at levels ~10-fold above the apparent background. A first peak (P1) reaching a top alignment score of 47 in the narrow genomic interval [7191 nt..7192 nt]max, and a second peak P2 over ~18,000 nt downstream with a score of 36 in the region [25196 nt..25212 nt]max (see, Figure 1). To put these data into an established gene-function context these two maxima, with half-maximum widths w1/2 = 215 nt and w1/2 = 219 nt, respectively, were annotated with available information from the closest and most specifically annotated genomic region in RefSeq, the NCBI Reference Sequence database20. Thus P1 was closest to the start of the C-terminal domain of non-structural protein 3 (designated nsp3C), which extends over the interval [6962 nt..8552 nt]. The C-terminal domain of nsp3 is known to play a critical role in replication due to its direct interaction with nsp4, thereby facilitating virus-induced membrane rearrangement and replication complex formation; conversely, loss of nsp3C-nsp4 interaction abolishes SARS coronavirus replication21. P2 was located toward the 3′ end of the open reading frame of the spike gene. Here it overlapped with the 3′ end of the stretch that covers both the S1/S2 cleavage region and the S2 fusion subunit of the S protein (S_S1/S2, with interval [23192 nt..25187 nt]). The S_S1/S2 domain includes the characteristic furin cleavage site at the S1/S2 junction22. Cleavage activates the nearby S2 fusion peptide and together they constitute an essential part in SARS-CoV-2 particle-dependent and particle-independent cell entry through fusion of viral and cellular membranes23,24. A similar analysis for the RaTG13 viral genome identified only one isolated peak (P3) with a maximum profile score of 50 on the interval [9713 nt..9733 nt]max, and with w1/2 = 230 nt. It intersected with the coding region of the C-terminal domain of nsp4 located at [9770 nt..10046 nt] (Figure 1).
In the SARS coronavirus Urbani genome (SARS-CoV-1), two additional signals were detected: P4 with a maximum score pS = 26 at position [13486 nt..13497 nt]max and w1/2 = 222 nt; and a broader second peak, P5, with pS = 41 at position [22286 nt..22391 nt]max and w1/2 = 477 nt. P4 sharply co-localized with the N-terminus of the RdRp domain at [13414 nt..14470 nt]. P5 was annotated with the N-terminal part of the spike gene’s receptor binding domain (Rbd) located in the interval [22443 nt..23199 nt]. In contrast to the five signals identified in these three genomes, an equivalent analysis for the other three (RacCS203, Rc-o319, MERS-CoV) produced only negative results. Their accumulated homology profiles were evenly distributed across the entire genomes consistent with a low random score background from many short spurious matches. As a further specificity control, negative results were obtained (see, Figure S3 and Tables S10–S14) after profiling the five most closely SARS-CoV-1 related betacoronavirus isolates from five wild animals (civet, Paradoxurus hermaphroditus, Paguma larvata, Aselliscus stoliczkanus, and Rhinolophus sinicus), which together with SARS-CoV-2 occupy the same phylogenetic branch9. These data also lead to a highly differential signal of yeast homology in SARS-CoV-1, SARS-CoV-2 and RaTG13 genomes after calculating standardized z-scores (Figure 2) from the entire BLAT profiles produced to all 13 of the above sequences (Tables S2–S14). To cross-validate the detected yeast homology signals in P1-P5, we used an independent sequence alignment method, LALIGN25, which additionally produced statistics (E-values) for pairwise alignments. While the peaks P1 and P2, as well as P4 and P5, could be positively cross-validated, the P3 signal in RaTG13 detected by BLAT did not yield a statistically significant alignment with LALIGN, with its E-value reaching above 0.01 (see, Table S16 and Figure S4). Taken together, these highly differential data show that, for SARS-CoV-1 and for SARS-CoV-2, genes known to be critical for viral replication and host cell invasion display localized yeast homology at their flanking regions with limited extensions into the corresponding open reading frames.
To explain this yeast DNA enrichment pattern, we propose the following artificial passage model (Figure 3A): Its starting point is a doubly auxotrophic, synthetic yeast cell line with stable, heterologous expression of viral replicase complex (RdRp, optionally together with auxiliary factors for replication, Aux) from a plasmid under the control of a selectable marker YSM1. A second plasmid carries another auxotrophic yeast selectable marker YSM2, which originates from a different chromosome, and regulates the expression of a non-replicative segment of viral RNA (nrvRNA1). At this point, nrvRNA1 is any uninterrupted DNA segment from a SARS-coronavirus related genome prior to passage. Through homologous recombination, the target yeast chromosome is transformed and nrvRNA1 is integrated17 at the chromosomal site of the auxotrophy conferring allele homologous to YSM2. During passage cell growth double stranded DNA breaks occur, and breaks at both ends of nrvRNA1 ends, their flanking regions, and their homologous extensions into YSM2 are repaired preferably by intra-chromosomal gene conversion26, i.e. through a non-crossover homologous recombination, and with the endogenous site as the homologous repair donor (Figure 3A).
If we assume that nrvRNA1 itself contains a copy of RdRp (and of Aux), then the above model implies that higher-order integration events17 will occur between the YSM1 plasmid and the primary site of integration. In effect, short segments from its YSM1 region will be also integrated into nrvRNA1. In this case the passage model specifically predicts that during S. cerevisiae growth nrvRNA1 will accumulate sequences from exactly two yeast chromosomes, i.e. those two which YSM1 and YSM2 originated from.
To test this prediction, we produced the score profile pS, but this time from the yeast sequence hits on each chromosome. For direct comparison, we then transformed each profile into a single number (cS), for all 16 chromosomes (mitochondrial chromosome excluded), by calculating the sum of pS over the entire chromosome length conditional on the cutoff pS > 30. In the case of SARS-CoV-2, this procedure resulted in two distinct peaks at chromosome number II and number XV (Figure 3B). For SARS-CoV-1, the highest two peaks were at chromosomes IV and V, followed by a much shallower peak on XVI with only 0.24 the height of IV. One peak was detected for RaTG13, also at XVI, whereas the other three viral genomes produced no signal at the chosen cutoff (see, Figure 3B, also for similar data without a cutoff). To further connect these data to our passage model, we attempted to match the seven most commonly used auxotrophic yeast selectable markers27,28 according to their chromosomal origin: ADE2 (adenine requiring phosphoribosylaminoimidazole carboxylase, on chromosome XV), HIS3 (histidine requiring imidazoleglycerol-phosphate dehydratase, chr. XV), LEU2 (leucine requiring Beta-isopropylmalate dehydrogenase, chr. III), LYS2 (lysine requiring aminoadipate reductase, chr. II), MET15 (methionine requiring O-acetyl homoserine-O-acetyl serine sulfhydrylase, chr. XII), URA3 (uracil requiring orotidine-5'-phosphate (OMP) decarboxylase, chr.V), and TRP1 (tryptophan requiring phosphoribosylanthranilate isomerase, chr. IV). In agreement with the model prediction, five out the seven markers could be matched to the four highest of the five chromosome peaks detected in SARS-CoV-2 and SARS-CoV-1 (Figure 3B). This outcome especially implies that for SARS-CoV-2 the two auxotrophic markers (YSM1, YSM2) could be any pair from the triple (ADE2, HIS3, LYS2), and for SARS-CoV-1 either the pair (LEU2, TRP1) or (TRP1, LEU2). Thus SARS-CoV-2 and SARS-CoV-1 both did, but RaTG13 did not fit into this synthetic passage model.
These results further allowed us to infer a specific scheme for the synthetic biogenesis of SARS-CoV-2 and SARS-CoV-1 in transformed yeast cells (Figure 3C). The idea is to stitch together both outer DNA complements of a chosen viral genome with the inner segment nrvRNA1. For co-transformation and integration, two plasmids are designed that carry the YSM2 selectable marker with either the 5′-end (nrvRNA2) or the 3′-end (nrvRNA3) of the target virus genome along with some overlap into nrvRNA1 (regions 1′ and 1′′, respectively, see Figure 3C). Essential plasmid ingredients are also a transcriptional promoter for nrvRNA2, and a self-cleaving ribozyme (Rz) sequence for the correct 3′-end in nrvRNA315. Once these three non-replicable RNA encoding segments are integrated on the chromosome in the correct order, expression of fully replicable virus (+)RNA begins and replication commences upon co-expression of the viral replicase complex (RdRp and Aux, controlled through the auxotrophic marker YSM1). The final step, assembly into a fully infectious viral particle, is conveniently achieved with a yeast virus-like-particle (VLP) expression system for the structural proteins S, E (envelope), M (membrane), and N (nucleocapsid) that can be used in parallel by an extended set of auxiliary proteins, Aux*29. This hypothetical cellular factory may therefore produce the targeted, fully infectious viral particles without itself being infected by the virus produced.
Taken together, our results reveal a highly differential homology signal on SARS-CoV-2 and SARS-CoV-1 genomes, which—according to our model—points to their history of targeted integration, recombination, and directed viral replication through passage in an artificial S. cerevisiae host. This genomic pattern suggests similar synthetic origins of SARS-CoV-1 and SARS-CoV-2, but at the same time robustly excludes all other clade members from this type of synthetic origin. A special case is RaTG13, which in our analysis produced both a simpler pattern and a weaker signal of common genetic history with yeast than the two mutually more similar homology signals found in SARS-CoV-1 and SARS-CoV-2. Yet RaTG13 is claimed to be much closer to SARS-CoV-2 evolutionarily7, i.e. 96% genomic sequence identity to SARS-CoV-2 against 80% between SARS-CoV-1 and the latter. This divergence suggests that if RaTG13 is assumed to be a product of natural evolution then both the sequences of SARS-CoV-1 and SARS-CoV-2 cannot be. Alternatively, the origin of RaTG13 could be artificial11—along with SARS-CoV-2 and SARS-CoV-130, as our results also suggest. In this context, a point of uttermost importance would be the identification of the putative input progenitor SARS-CoV nucleotide sequence that went into passage. For example, it could be a highly pathogenic virus designed for, or naturally adapted to human cells and then selected for a transient artificial passage together with some genetic modifications31 of the virus to attenuate its virulence. Then its release back into the human host would likely initiate a rapid succession of complex reversal mutations toward its more pathogenic original structure30,31. Intriguingly, during the first months of the SARS-CoV-2 outbreak, the genomic regions of nsp3 and spike protein had the highest mutational rate within the SARS-CoV-2 genome32 which may interfere with the yeast homology signal detected in the present study. During an epidemic, such reversal mutations toward an unidentified artificial genotype would be highly detrimental to most public health countermeasures, including pharmacological interventions and vaccinations. In contrast, through specific guidance of countermeasures such as vaccine development, detailed knowledge about the input progenitor’s nucleotide sequence would effectively confer population immunity against the pathogen. Without such specific knowledge, our and further results about the conditions of the yeast artificial passage could offer important directions for SARS-CoV-2 antiviral interventions.
Associated or additional data. All data underlying the results are available as part of the article and no additional source data are required.
Repository-hosted data. The following sequence data was retrieved from the NCBI GenBank repository:
1. Middle East respiratory syndrome-related coronavirus isolate HCoV-EMC/2012, complete genome (NCBI Reference Sequence: NC_019843.3)
2. Severe acute respiratory syndrome-related coronavirus Rc-o319 RNA, complete genome (GenBank: LC556375.1)
3. Bat SARS-like coronavirus isolate As6526, complete genome (GenBank: KY417142.1)
4. Bat SARS-like coronavirus isolate Rs4874, complete genome (GenBank: KY417150.1)
5. SARS coronavirus Urbani, complete genome (GenBank: AY278741.1)
6. SARS coronavirus PC4-13, complete genome (GenBank: AY613948.1)
7. SARS coronavirus civet020, complete genome (GenBank: AY572038.1)
8. SARS coronavirus HC/SZ/61/03, complete genome (GenBank: AY515512.1)
9. Bat SARS-like coronavirus isolate bat-SL-CoVZC45, complete genome (GenBank: MG772933.1)
10. Bat SARS-like coronavirus isolate bat-SL-CoVZXC21, complete genome (GenBank: MG772934.1)
11. Bat coronavirus RacCS203, complete genome (GenBank: MW251308.1)
12. Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome (GenBank: NC_045512.2)
13. Bat coronavirus RaTG13, complete genome (GenBank: MN996532.2)
Harvard Dataverse: Differential enrichment of yeast DNA in SARS-CoV-2 and related genomes supports synthetic origin hypothesis. https://doi.org/10.7910/DVN/BK8AL633.
This project contains the following extended data files:
Figure_S1.pdf : Profiled alignment scores (pS) without smoothing filter from the BLAT alignment output to the query input of six SARS-coronavirus related full genome nucleotide sequences.
Figure_S2.pdf : Profiled alignment scores (pS) from the alignment output to the query input of SARS-coronavirus like genome sequences SL-ZC45 and SL-ZXC21.
Figure_S3.pdf : Smoothed profile yeast BLAT alignment scores of five betacoronavirus isolates from five wild animals, closely related to SARS-CoV-1 and SARS-CoV-2, after the phylogenetic analysis of Li et al. (2020): Paradoxurus hermaphroditus (palm civet) SARS coronavirus PC4-13 (GenBank AY613948), Civet SARS coronavirus civet020 (AY572038), Paguma larvata SARS coronavirus HC/SZ/61/03 (AY515512), Rhinolophus sinicus bat SARS-like coronavirus Rs4874 (KY417150), Aselliscus stoliczkanus bat SARS-like coronavirus As6526 (KY417142).
Figure_S4.pdf : Alignment E-values (inverted, 1/E) as profiles across genomes of SARS-CoV-2, RaTG13, and SARS-CoV-1 calculated with the LALIGN local alignment method by using a sliding window approach with window sizes as given in Table S16.
Table_S1.tab: Output from the BLAT web server.
Table_S2.tab: SARS-CoV-2/S. cerevisiae (sacCer3) BLAT results.
Table_S3.tab: RaTG13/S. cerevisiae (sacCer3) BLAT results.
Table_S4.tab: RacCS203/S. cerevisiae (sacCer3) BLAT results.
Table_S5.tab: SL-CoV_ZC45/S. cerevisiae (sacCer3) BLAT results.
Table_S6.tab: SL-CoV ZXC21/S. cerevisiae (sacCer3) BLAT results.
Table_S7.tab: Rc-o319/S. cerevisiae (sacCer3) BLAT results.
Table_S8.tab: SARS-CoV-1 Urbani/S. cerevisiae (sacCer3) BLAT results.
Table_S9.tab: MERS-CoV/S. cerevisiae (sacCer3) BLAT results.
Table_S10.tab: SARS coronavirus PC4-13/S. cerevisiae (sacCer3) BLAT results.
Table_S11.tab: SARS coronavirus civet020/S. cerevisiae (sacCer3) BLAT results.
Table_S12.tab: SARS coronavirus HC/SZ/61/03/S. cerevisiae (sacCer3) BLAT results.
Table_S13.tab: SARS-like coronavirus isolate Rs4874 /S. cerevisiae (sacCer3) BLAT results.
Table_S14.tab: SARS-like coronavirus isolate As6526/S. cerevisiae (sacCer3) BLAT results.
Table_S15.tab: Percent identity matrix (Clustal 2.1).
Table_S16.tab: Peak P1-P5 yeast homology signals detected by BLAT, and cross-validated by the LALIGN method.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||
---|---|---|
1 | 2 | |
Version 5 (revision) 04 Jul 22 |
||
Version 4 (revision) 08 Mar 22 |
read | |
Version 3 (revision) 19 Jan 22 |
read | |
Version 2 (update) 14 Oct 21 |
||
Version 1 10 Sep 21 |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)