Your privacy, your choice

We use essential cookies to make sure the site can function. We also use optional cookies for advertising, personalisation of content, usage analysis, and social media.

By accepting optional cookies, you consent to the processing of your personal data - including transfers to third parties. Some third parties are outside of the European Economic Area, with varying standards of data protection.

See our privacy policy for more information on the use of your personal data.

for further information and to change your choices.

Skip to main content

Cov-trans: an efficient algorithm for discontinuous transcript assembly in coronaviruses

Abstract

Background

Discontinuous transcription allows coronaviruses to efficiently replicate and transmit within host cells, enhancing their adaptability and survival. Assembling viral transcripts is crucial for virology research and the development of antiviral strategies. However, traditional transcript assembly methods primarily designed for variable alternative splicing events in eukaryotes are not suitable for the viral transcript assembly problem. The current algorithms designed for assembling viral transcripts often struggle with low accuracy in determining the transcript boundaries. There is an urgent need to develop a highly accurate viral transcript assembly algorithm.

Results

In this work, we propose Cov-trans, a reference-based transcript assembler specifically tailored for the discontinuous transcription of coronaviruses. Cov-trans first identifies canonical transcripts based on discontinuous transcription mechanisms, start and stop codons, as well as reads alignment information. Subsequently, it formulates the assembly of non-canonical transcripts as a path extraction problem, and introduces a mixed integer linear programming to recover these non-canonical transcripts.

Conclusion

Experimental results show that Cov-trans outperforms other assemblers in both accuracy and recall, with a notable strength in accurately identifying the boundaries of transcripts. Cov-trans is freely available at https://github.com/computer-Bioinfo/Cov-trans.git.

Peer Review reports

Introduction

Coronaviruses are enveloped viruses with genomes consisting of linear, single-stranded positive-sense RNA [1, 2]. Their genomes have a capped structure at the 5’ end with methylation and a poly(A) tail at the 3’ end. When these viruses enter host cells, their viral genomic RNA serves as a template for translation with the help of viral RNA-dependent RNA polymerase (RdRp) [3, 4]. Unlike eukaryotes, coronaviruses have a unique mechanism for mRNA maturation called discontinuous transcription. RNA polymerase is guided by specific transcription regulatory sequences (TRSs) that help it selectively transcribe from negative-sense RNA. This includes a regulatory sequence at the 3’ end of the leading sequence (TRS-L), which allows for discontinuous transcription, along with a sequence found before each viral gene (TRS-B) [5]. This process allows RdRp to create a series of discontinuous transcripts by skipping segments in the viral RNA template.

The transcripts of coronaviruses are typically categorized into canonical and non-canonical transcripts based on their characteristics and functions. Canonical transcripts generally refer to the standard transcripts produced through routine transcription of the viral genome, encoding essential viral proteins. On the other hand, non-canonical transcripts may encompass a broader range of transcripts generated through alternative transcription mechanisms, potentially encoding diverse proteins or functional RNA segments. The detailed sequence information of non-canonical transcripts has not been incorporated into databases like National Center for Biotechnology Information (NCBI). The functions and impacts of non-canonical transcripts are still under investigation.

The discontinuous transcript assembly problem was first introduced in Jumper [6] and has garnered widespread attention. Accurately assembling the discontinuous transcripts of coronavirus is crucial for understanding its lifecycle, transmission capabilities, and the infection severity. To the best of our knowledge, Jumper is the only algorithm specifically designed for discontinuous transcription events in coronaviruses. Jumper first aligns short paired-end reads to the reference genome, then constructs a segment graph based on these alignments. It classifies the edges in the discontinuous graph as either continuous or discontinuous. Continuous edges connect two vertices that correspond to adjacent genomic fragments, while discontinuous edges connect two non-adjacent fragments, which is caused by discontinuous transcription. Jumper selects the path with the same source and sink node, and the maximum posterior probability as the transcript. Jumper achieves more competitive performances than the eukaryotic assembly algorithms Scallop and StringTie in terms of identifying viral transcripts. However, Jumper suffers from the issue of inaccurate transcript boundaries.

Existing work on transcriptome assembly problem can be mainly divided into two types: reference-based methods and de novo assembly methods. Reference-based methods such as StringTie [7, 8], IsoInfer [9], IsoLasso [10], Ryuto [11], Scallop [12], Cufflinks [13], CIDANE [14], Traph [15], iReckon [16], and Scripture [17] typically achieve higher accuracy than de novo assemblers when high-quality reference genomes are available. De novo assembly methods such as rnaSPAdes [18], Trinity [19], SOAPdenovo-Trans [20], Bridger [21] and TransLiG [22] become valuable when a reference genome is unavailable. However, these methods are all developed based on alternative splicing in eukaryotic organisms and are not suitable for assembling discontinuous transcripts of viruses. These methods are designed with the assumption that transcripts of each gene are contiguous, meaning there are no jumps from the 5’ end to 3’ end of each transcript. While this assumption holds true for the transcription of most eukaryotic genes, it does not apply for coronaviruses. Moreover, coronavirus transcription involves jumps between transcription regulatory sequences, resulting in multiple mRNAs with identical 3’ ends but different 5’ ends. Current eukaryotic transcript assembly algorithms struggle to effectively identify and handle these jumping events, leading to inaccurate assembly of these transcripts, especially in complex transcript structures generated by TRS jumps.

The assembly of the coronavirus discontinuous transcriptome presents certain complexities. First, the transcriptional regulation mechanism of coronaviruses is highly complex, producing both canonical and non-canonical transcripts. Second, there are repetitive sequences among transcripts, which makes assembly difficult. Third, the abundance of canonical transcripts significantly differs from that of non-canonical transcripts, with canonical transcripts often being more abundant. This greatly increases the difficulty of assembling low-abundance non-canonical transcripts.

In this paper, we introduce Cov-trans, a reference-based assembly strategy specifically designed for reconstructing discontinuous transcripts of coronaviruses. Cov-trans first extracts canonical transcripts, which are usually expressed at higher levels and are crucial for regulating the gene expression and functionality in the virus. It then employs a mixed integer linear programming model to recover non-canonical transcript from discontinuous graph. We benchmarked Cov-trans against Jumper, Scallop, and StringTie. Experimental results show that Cov-trans outperforms other assemblers in both recall and precision. Remarkably, it also delivers more accurate boundaries.

Fig. 1
figure 1

The workflow of Cov-trans algorithm. a Coronaviruses generate two types of transcripts: canonical and non-canonical. b Cov-trans takes the reference genome and aligned reads as input. c The alignments are corrected based on start and stop codons, followed by identifying discontinuous transcriptional sites using corrected alignments and spanning information. d Cov-trans constructs the discontinuous graph based on spanning information. There are three types of edges in the discontinuous graph: continuous edges (black), canonical discontinuous edges (blue), and non-canonical discontinuous edges (red). e, f Cov-trans employs a dual-phase strategy to extract transcripts. It first assembles canonical transcripts and subsequently uses mixed integer linear programming to extract non-canonical transcripts

Methods

Algorithm overview

Cov-trans takes RNA sequencing reads and reference genome as inputs, with the goal of recovering both canonical and non-canonical transcripts. Cov-trans first aligns reads to the reference genome using STAR [23], and it identifies discontinuous transcriptional sites in the genome by analyzing the spliced alignment positions along with start and stop codons. Cov-trans then divides the genome into multiple segments according to these sites and constructs a weighted directed acyclic graph, called discontinuous graph, where each segment is represented as a vertex. Subsequently, Cov-trans extracts canonical transcripts from the discontinuous graph, and then formulates the non-canonical transcript assembly problem as a path extraction within this discontinuous graph and proposes a mixed integer linear programming model to solve it. The workflow of Cov-trans is presented in Fig. 1.

Identification of discontinuous transcriptional sites

Cov-trans extracts all the genome regions jumped by reads and puts them into a set called R. Each element \([r_{1}, r_{2}]\) in set R indicates that there is a region in the genome from \(r_{1}\) to \(r_2\), where two adjacent bases on the reads are aligned to each end of this region, respectively. Cov-trans divides set R into two subsets: the canonical spanning set \(R_c\) and the non-canonical spanning set \(R_n\), based on whether the starting position of each region falls within the transcription regulating leader sequence (TRS-L), i.e. the region leading to discontinuous transcription. The TRS-L region typically ranges from 50 to 85 nucleotides [6]. Specifically, for any element \([r_{1}, r_{2}]\) in set R, if its starting position \(r_1\) is in the TRS-L region, then \([r_{1}, r_{2}]\) is placed in set \(R_c\); otherwise, it is included in set \(R_n\). Cov-trans considers the spanning regions in set \(R_n\) to be a result of non-canonical transcripts, while the spanning regions in set \(R_c\) may arise from either canonical or non-canonical transcripts.

Fig. 2
figure 2

Examples of set \(R_n\), \(R_c\) and discontinuous transcriptional sites. All the start and end positions of each genomic region in set \(R_c\) and \(R_n\) are regarded as discontinuous transcriptional sites (TRS, \(r_a\), \(r_b\), \(r_c\), \(r_d\), \(r_e\), \(r_f\), \(r_g\), and ploy(A)). These sites partition the genome into multiple segments([start,TRS-1],[TRS,\(r_a\)−1], [\(r_a\),\(r_b\)−1], [\(r_b\),\(r_c\)−1],[\(r_c\),\(r_d\)−1], [\(r_d\),\(r_e\)−1], [\(r_e\),\(r_f\)−1], [\(r_f\),\(r_g\)−1], [\(r_g\),ploy(A)−1], and [ploy(A),end])

Given that typical transcripts begin at TRS-L, span from a start codon to a stop codon, and end with a poly(A) tail, Cov-trans adjusts the boundaries of the spanning regions in the set \(R_c\) according to these features. For each spanning region \([r_1, r_2]\) in set \(R_c\), Cov-trans searches for a start codon (AUG) within 100bp [24] downstream of \(r_2\) and updates \(r_2\) to the first position of this start codon once located. Then, it proceeds to search along the start codon for the stop codon, making the final position of the stop codon as \(r_3\). Cov-trans regards \([TRS, r_2]\) and \([r_3, ploy(A)]\) as a pair of discontinuous transcriptional sites, and substitutes them for \([r_1, r_2]\) in set \(R_c\). Note that this pair of sites has the potential to form a candidate for canonical transcripts. Specially, Cov-trans moves the spanning region \([r_1, r_2]\) from \(R_c\) to \(R_n\) when it cannot find either a start codon or a stop codon. Figure 2 presents examples of set \(R_n\), \(R_c\) and discontinuous transcriptional sites.

Cov-trans defines the start and end positions of each spanning region in sets \(R_c\) and \(R_n\) as discontinuous transcriptional sites. These sites partition the genome into multiple segments. Cov-trans regards each segment as a vertex and connects these vertices based on the spanning information provided by set \(R_c\) and \(R_n\) to create a directed acyclic graph, called the discontinuous graph. It extracts paths from this graph to represent canonical and non-canonical transcripts, respectively. Details about the graph construction and path extraction can be found in the following sections.

Construction of discontinuous graph

Cov-trans views each genome region divided by discontinuous transcriptional sites as a vertex in the discontinuous graph. Specially, the TRS-L region in the genome is designated as source vertex s, while the ploy(A) tail is considered the sink vertex t in the graph. Cov-trans adds directed edges to the discontinuous graph based on following rules.

  • Continuous edge. A directed edge \(u \rightarrow v\) is added for vertices u and v when they represent adjacent genomic regions, with u preceding v.

  • Canonical discontinuous edge. Edge \(u \rightarrow v\) is included if an element \([r_1, r_2]\) from set \(R_c\) has \(r_1\) positioned exactly at the end of vertex u and \(r_2\) situated at the beginning of v.

  • Non-canonical discontinuous edge. Edge \(u \rightarrow v\) is added if a component \([r_1, r_2]\) of set \(R_n\) has \(r_1\) positioned at the end of vertex u and \(r_2\) marks the beginning of vertex v.

Cov-trans defines the average depth of reads in the corresponding segment as the depth of the vertex. Cov-trans also assigns a coverage to each vertex, which denotes the proportion of the region covered by reads compared to the total region of this vertex. In this paper, the depth and coverage of vertex v are represented as dep(v) and cov(v), respectively.

In the discontinuous graph, each canonical transcript can be depicted as an \(s \sim t\) path comprising only two canonical discontinuous edges. On the other hand, an \(s \sim t\) path corresponding to a non-canonical transcript involves some non-canonical discontinuous edges. Paths representing both canonical and non-canonical transcripts may contain varying numbers of continuous edges. In fact, not every \(s \sim t\) path that meets the aforementioned conditions accurately represents a transcript. There are significantly more paths that meet these conditions than there are transcripts. In multiple experiments, we consistently observed that the accuracy of non-canonical discontinuous edges is very low. This underscores the necessity of applying specific measures to handle non-canonical discontinuous edges, including the removal of edges supported by very fewer reads. Moreover, it is essential to prioritize the extraction of canonical transcripts before isolating non-canonical transcripts from the complex group of non-canonical discontinuous edges.

Canonical transcripts assembly

For any path \(p=sv_1v_2...v_Lt\) in the set \(P_c\), where \(P_c\) comprises all paths representing canonical transcripts, it satisfies the following conditions. Firstly, the edge \(v_i \rightarrow v_{i+1}\) (\(1 \le i < L\)) on the path must be a continuous edge. Secondly, let \(len(v_i)\) denote the length of the genome region associated with vertex \(v_i\) (\(1 \le i \le L\)), the sum length \(\sum _{i=1}^{L} len(v_i)\) must be a multiple of 3. According to the graph construction rules, \(v_1\) starts with a start codon, and \(v_L\) ends with a stop codon. The genome region from vertex \(v_1\) to \(v_L\) corresponds to the protein-coding region, and the length of this coding region must be a multiple of 3. Thirdly, the average coverage and depth of vertices (excluding s and t) within this path needs to exceed specified thresholds. In this study, the coverage threshold is defined as 0.9, and the depth threshold is specified as the maximum value between 0.001 times the average depth of all vertices in the discontinuous graph. The parameters are set because canonical transcripts usually have high expression levels, resulting in greater depth and coverage of the corresponding paths. Finally, the number of reads supporting the canonical discontinuous edge \(s \rightarrow v_1\) should be more than 2.

Extraction of non-canonical transcripts

For a given discontinuous graph G(VE), Cov-trans represents each non-canonical transcript as a \(s\sim t\) path, comprising at most N non-canonical discontinuous edges. It employs a backtracking algorithm to identify all paths of this type and places them into the set \(P_n\). It then proposes a mixed integer linear programming to select a subset from \(P_n\) to depict the non-canonical transcripts. Similar to MultiTrans [25], Cov-trans aims to minimize the size of the selected subset, and it strives to ensure that the paths in this subset, combined with paths from set \(P_c\), cover all vertices in the discontinuous graph. Moreover, Cov-trans hopes that the depths of each vertex and edge in the graph are as close as possible to the sum of the flow through paths passing through them.

Cov-trans introduces a binary variable \(x_i\) (\(1\le i \le |P_n|\)) to indicate whether the \(i^{th}\) path in the set \(P_n\) is selected (\(x_i = 1\)) or not (\(x_i = 0\)) as a non-canonical transcript. The flow of this path is denoted as \(f_i^{(n)}\) (\(f_i^{(n)} \ge 0\), \(1\le i \le |P_n|\)). Note that the \(i^{th}\) path is selected as a non-canonical transcript if and only if \(f_i^{(n)} > 0\), and in this case, the value of \(f_i^{(n)}\) symbolizes the abundance of this transcript. This rule can be guaranteed by following constraints.

$$\begin{aligned} \lambda x{_i}\ge f_{i}^{(n)},\qquad for\ all \ 1\le i\le \left| P_n\right| \end{aligned}$$
(1)
$$\begin{aligned} \lambda f_{i}^{(n)}\ge x{_i}, \qquad for\ all \ 1\le i\le \left| P_n\right| \end{aligned}$$
(2)

where \(\lambda\) is a large positive number.

Cov-trans employs binary variables \(v_{ij}^{(c)}\) and \(e_{ij}^{(c)}\) to denote whether the \(i^{th}\) path in \(P_c\) passes through the \(j^{th}\) vertex and the \(j^{th}\) edge, respectively. Similarly, \(v_{ij}^{(n)}\) and \(e_{ij}^{(n)}\) indicate whether the \(i^{th}\) path in \(P_n\) traverses the \(j^{th}\) vertex and the \(j^{th}\) edge, respectively. Cov-trans imposes the following constraints to ensure that all vertices and edges are covered by at least one canonical or non-canonical transcript path.

$$\begin{aligned} \sum \limits ^{\left| P_c \right| }_{i=1}v_{ij}^{(c)} + \sum \limits ^{\left| P_n\right| }_{i=1}v_{ij}^{(n)}x_{i} \geqslant 1, \qquad 1\le j\le \left| V\right| \end{aligned}$$
(3)
$$\begin{aligned} \sum \limits ^{\left| P_c \right| }_{i=1}e_{ij}^{(c)} + \sum \limits ^{\left| P_n\right| }_{i=1}e_{ij}^{(n)}x_i \geqslant 1, \qquad 1\le j\le \left| E\right| \end{aligned}$$
(4)

In an ideal scenario, the observed and predicted depths for vertices would match perfectly. However, due to sequencing biases and other factors, it is unrealistic to expect an exact match between observed depth and predicted depth for all vertices and edges. If the observed depth is close to the predicted depth, particularly when the ratio between predicted and observed depth falls inside the interval \([1-\varepsilon , 1+\varepsilon ]\), where \(\varepsilon\) is an experimental value (defaulting to 0.1), Cov-trans sees them as matched. It introduces a binary variable \(y_j\) (\(1 \le j \le |V|\)) to indicate whether the observed depth matches the predicted depth for the \(j^{th}\) vertex. If they match, \(y_j = 0\); otherwise, \(y_j = 1\) and the \(j^{th}\) vertex is considered as a discordant vertex. It aims to minimize the number of vertices with inconsistencies, thus minimizing \(\sum _{j=1}^{|V|} y_j\). The value of \(y_j\) can be determined by the following constraints.

$$\begin{aligned} \lambda y_{j} \ge (1-\varepsilon )dep(v_j) - \sum \limits _{i=1}^{\left| P_{c}\right| }v_{ij}^{(c)}f_{i}^{(c)} -\sum \limits _{i=1}^{\left| P_{n}\right| }v_{ij}^{(n)}f_{i}^{(n)}, 1\le j\le \left| V\right| \end{aligned}$$
(5)
$$\begin{aligned} \lambda y_{j} \ge \sum \limits _{i=1}^{\left| P_{c}\right| }v_{ij}^{(c)}f_{i}^{(c)}+\sum \limits _{i=1}^{\left| P_{n}\right| }v_{ij}^{(n)}f_{i}^{(n)} -(1+\varepsilon )dep(v_j), 1\le j\le \left| V\right| \end{aligned}$$
(6)

Similarly, Cov-trans introduces a binary variable \(z_j\) (\(1\le j \le |E|\)) for each edge \(e_j\) (\(e_j \in E, 1\le j \le |E|\)) to signify whether the observed depth matches the predicted depth.

$$\begin{aligned} \lambda z_{j} \ge (1-\varepsilon )dep(e_j) -\sum \limits _{i=1}^{\left| P_{c}\right| }e_{ij}^{(c)}f_{i}^{(c)} -\sum \limits _{i=1}^{\left| P_{n}\right| }e_{ij}^{(n)}f_{i}^{(n)}, 1\le j\le \left| E\right| \end{aligned}$$
(7)
$$\begin{aligned} \lambda z_{j} \ge \sum \limits _{i=1}^{\left| P_{c}\right| }e_{ij}^{(c)}f_{i}^{(c)}+ \sum \limits _{i=1}^{\left| P_{n}\right| }e_{ij}^{(n)}f_{i}^{(n)}-(1+\varepsilon )dep(e_j), 1\le j\le \left| E\right| \end{aligned}$$
(8)

Cov-trans selects a subset of paths from the set \(P_n\) to represent non-canonical transcripts using the following objective function, which aims to minimize the number of selected paths, as well as the discordant vertices and edges.

$$\begin{aligned} \min {\quad w_{1}\sum \limits ^{\left| P_{n}\right| }_{i=1}x_{i} + w_{2}\sum \limits ^{\left| V\right| }_{j=1}y_{j}+ w_{3}\sum \limits ^{\left| E\right| }_{j=1}z_{j}} \end{aligned}$$
(9)

In the above objective function, \(\min {\sum ^{|P_n|}_{i=1}x_i}\) aims to minimize the number of selected paths, while \(\min {\sum ^{|V|}_{j=1}y_j}\) and \(\min {\sum ^{|E|}_{j=1}z_j}\) strive to minimize the number of discordant vertices and edges, respectively. The weighting coefficients \(w_1\), \(w_2\), and \(w_3\) control the relative importance of the three objectives. When \(w_1 = \frac{1}{|P_n|}\), \(w_2 = \frac{1}{|V|}\), and \(w_3 = \frac{1}{|E|}\), the three objectives are given equal importance. However, in this study, \(w_2\) and \(w_3\) are adjusted to \(\frac{|P_n|}{|V|}\) and \(\frac{|P_n|}{|E|}\) respectively to place a higher emphasis on the latter two objectives. Specifically, in cases where multiple solutions have the same minimum number of discordant vertices and edges, the programming prefers the solution with the fewest paths.

Experimental setting

We compared the performance of the Cov-trans with three reference-based assemblers, Jumper, StringTie, and Scallop. It’s important to note that all assemblers were executed with their default parameters. STAR was employed to align paired-end reads to the reference genome. The Cov-trans method, implemented in Python 3.7, utilized Gurobi [26] for solving mixed integer linear programming.

Dataset information

We benchmarked Cov-trans and three other leading transcript assemblers on the following datasets:

  • Simulated datasets. There are 18 simulated samples produced by Flux-Simulator [27]. The samples Sars2-sim1, Sars2-sim2, Sars2-sim3, Sars2-sim4, Sars2-sim5, and Sars2-sim6 are simulated from 19 discontinuous transcripts of SARS-CoV-2, with each dataset containing 60,000, 50,000, 30,000, 10,000, 5,000, and 2,500 paired-end reads of 150 bp, respectively. Sars1-sim1, Sars1-sim2, Sars1-sim3, Sars1-sim4, Sars1-sim5, and Sars1-sim6 are generated from 17 discontinuous transcripts of SARS-CoV-1, with each dataset comprising 60,000, 50,000, 30,000, 10,000, 5,000, and 2,500 paired-end reads, all 150 bp in length. The samples Mers-sim1, Mers-sim2, Mers-sim3, Mers-sim4, Mers-sim5, and Mers-sim6 are generated based on 25 discontinuous transcripts of MERS-CoV. Each dataset consists of 60,000, 50,000, 30,000, 10,000, 5,000, and 2,500 paired-end reads of 150 bp, respectively.

  • Real datasets. There are nine datasets retrieved from the NCBI Sequence Read Archive (SRA) database that encompass samples from SARS-CoV-1, SARS-CoV-2, and MERS-CoV. A detailed description of these datasets can be found in Table 1.

Table 1 Details of real datasets selected for comparison of different transcript assembly tools

Assessment metrics

In the simulated datasets, we compared the discontinuous transcription boundaries of the assembled transcripts with those of the ground-truth transcripts. A ground truth transcript is labeled as an x-tolerance reconstructed transcript if its discontinuous transcription boundaries match those of an assembled transcript within a margin of x nucleotides, where x is set to 0, 20, 40, and 60. We analyze the precision, recall, and F1-score of each algorithm in reconstructing the transcripts under different tolerances. The F1-score can be calculated using following formula.

$$\begin{aligned} F1{-score}= \frac{2 \times \text {precision} \times \text {recall}}{\text {precision} + \text {recall}} \end{aligned}$$
(10)

where recall is defined as the proportion of reconstructed transcripts among all ground-true transcripts, and precision refers to the proportion of reconstructed transcripts among all assembled candidate transcripts.

In the real datasets where the ground-truth transcripts are unknown, we regard the annotated transcripts from NCBI databases as ground truth transcripts. The annotated transcriptome of SARS-CoV-1, SARS-CoV-2, and MERS-CoV consists of 15, 12, and 11 canonical transcripts, respectively. The non-canonical transcripts are still unknown and not annotated. Therefore, we employ the number of reconstructed transcripts under different tolerance levels in the transcript boundaries.

Results

Evaluation on simulated data

The transcriptome of coronavirus is very complex owing to numerous non-canonical discontinuous transcription events. However, the details of non-canonical transcripts remain unknown. In order to accurately replicate the real coronavirus transcriptome, we manually created some non-canonical transcripts and incorporated them with canonical transcripts collected from the NCBI database to generate simulated reads. Since the simulated datasets are relatively straightforward, the boundaries of transcripts reconstructed by Cov-trans and the compared algorithms are accurate. Here, we compute the precision, recall, and F1-score for each algorithm considering the 0-tolerance reconstructed transcript. The precision, recall, and F1-scores achieved by the assemblers on the simulated datasets are presented in Fig. 3.

Fig. 3
figure 3

The performance of the four assemblers on the simulated datasets. a, b, and c represent the performance on simulated datasets of SARS-CoV-2. d, e, and f represent the assembly results on simulated datasets of SARS-CoV-1. g, h, and i represent the performance on simulated datasets of MERS-CoV

Cov-trans demonstrates a more competitive performance in terms of precision, recall, as well as F1-score across simulated dataset with varying sequencing depths

As shown in Fig. 3, Cov-trans achieved the highest recall across all simulated datasets, with an average recall of 0.690. In comparison, the average recall of Jumper, Scallop, and StringTie is 0.242, 0.292 and 0.050, respectively. This means Cov-trans achieves a recall rate improvement of over 1.36 times. On the six simulated datasets of SARS-CoV-2, Cov-trans shows better performance on the average F1-score (0.739), average recall (0.623) and average precision (0.911). The number of correct transcripts assembled by Cov-trans is also more than double that of Jumper and Scallop, and over ten times that of StringTie. On six simulated SARS-CoV-1 datasets, Cov-trans outperformed Jumper, Scallop, and StringTie, with an average F1-score of 0.796, recall of 0.775, and precision of 0.820. By comparison, Jumper achieved an F1-score of 0.349 (precision: 0.612, recall: 0.284), Scallop scored 0.322 (precision: 0.583, recall: 0.225), and StringTie scored 0.103 (precision: 0.833, recall: 0.059). On the MERS-CoV datasets, although Cov-trans has a slightly lower average precision than Scallop, it achieves higher F1-score and recall, with an 12.8% increase in F1-score and a 27.7% increase in recall. Moreover, while Scallop performs better than Cov-trans on the Mers-Sim5 dataset, the transcripts assembled by Cov-trans have perfectly accurate boundaries, thanks to our boundary correction strategy.

Cov-trans assembled the most 0-tolerance reconstructed transcripts on simulated datasets

Figure 3c, f, and i present the number of 0-tolerance reconstructed transcripts recovered by each assembler across the three types of simulated datasets. Cov-trans assembled the largest number of 0-tolerance reconstructed transcripts on all simulated datasets. For example, Cov-trans recovered an average of 12, 13, and 17 0-tolerance transcripts on the simulated datasets of SARS-CoV-2, SARS-CoV-1, and MERS-CoV, respectively. In comparison, Jumper reconstructed an average of 4, 6, and 8 0-tolerance transcripts for these datasets, while Scallop achieved averages of 2, 4, and 13, respectively. Specifically, it reconstructed at least 62.5% more canonical transcripts than the other methods on the simulated datasets of SARS-CoV-2, SARS-CoV-1. Cov-trans also recovered more non-canonical transcripts compared to other methods, except for the Mers-Sim4 and Mers-Sim5 datasets. On these datasets, Cov-trans assembled the second highest number of non-canonical transcripts, totaling 8, whereas Scallop assembled the most non-canonical transcripts at 9. However, Cov-trans reconstructed a total of 9 canonical transcripts, which is more than the number of canonical transcripts assembled by Scallop. We also found that Cov-trans correctly predicts a higher number of canonical transcripts compared to Jumper, Scallop, and StringTie for all the simulated instances of Jumper. Cov-trans benefits from its strategy of first identifying canonical transcripts, which enables the assembly of a larger number of accurate canonical transcripts.

From the simulated experiments mentioned above, it can be observed that in most cases, Cov-trans achieves higher recall, precision, and F1-score than Jumper, Scallop, and StringTie. Cov-trans correctly predicts more canonical and non-canonical transcripts. However, the recall rate obtained by Cov-trans is usually below 0.7. This may be due to the sharing of long subsequences between canonical and non-canonical transcripts, as well as the enumeration of candidate non-canonical transcript paths using backtracking, resulting in many candidate non-canonical transcripts that are difficult to assemble accurately. Additionally, erroneous reads may introduce incorrect discontinuous edges, leading to errors in the boundaries of non-canonical transcripts and an increase in edges and nodes in the discontinuous graph. To reduce errors, Cov-trans must remove edges with low coverage, which biases against highly expressed transcripts and overlooks some low-expressed transcripts.

Fig. 4
figure 4

The total number of canonical transcripts recovered by assemblers across nine real datasets. a, b, and c represent the number of canonical transcripts on real datasets of SARS-CoV-2. d, e, and f represent the assembly results about canonical transcripts on real datasets of SARS-CoV-1. g, h, and i represent the performance on real datasets of MERS-CoV

Evaluation on real data

In the experiments with simulated datasets, we discovered that StringTie is unsuitable for handling coronavirus transcriptome. It also showed poor performance on the real datasets. Consequently, we decided not to include it in comparisons with other algorithms in the real datasets section. Instead, we evaluated Cov-trans, Jumper, and Scallop on nine real datasets sequenced from SARS-CoV-1, SARS-CoV-2, and MERS-Cov. Figure 4 shows the number of transcripts reconstructed by each assembler under different tolerance levels in the transcript boundaries.

Cov-trans achieved more accurate boundaries on real datasets

As illustrated in Fig. 4, most algorithms struggle to reconstruct transcripts with a strict 0-tolerance, mainly due to the complex and diverse structure of coronavirus transcriptomes in real-world scenarios, with Cov-trans being the exception. Cov-trans consistently demonstrates superior performance in reconstructing canonical transcripts across nine real datasets. For the SARS-CoV-2 datasets, Cov-trans successfully identified 10 complete canonical transcripts without errors, outperforming Scallop and Jumper, which yielded fewer transcripts with multiple errors. Similarly, for the SARS-CoV-1 datasets, Cov-trans reconstructed 12 canonical transcripts, while Scallop and Jumper produced only 1 and 10 transcripts, respectively, with substantial boundary inaccuracies. In the MERS-CoV datasets, Cov-trans reconstructed 8 accurate transcripts, whereas Scallop and Jumper generated fewer and less precise results. It’s worth noting that even when the boundaries are not strictly enforced, Cov-trans consistently reconstructs the highest number of annotated transcripts on each dataset. Besides, Jumper’s performance is commendable when not strictly enforcing transcript boundaries. It even matches Cov-trans’ performance on the MERS-CoV dataset and slightly lags behind on other datasets. On the other hand, Scallop’s performance is subpar due to its design around eukaryotic alternative splicing events, making it unsuitable for coronavirus transcriptome assembly. Jumper and Cov-trans, tailored for coronaviruses’ discontinuous transcripts, exhibit significantly better performance. This underscores the importance of developing a dedicated transcript assembly algorithm specifically for coronaviruses.

In summary, Cov-trans excels in accurately assembling transcripts with precise boundaries, particularly in recovering canonical transcripts. This capability is attributed to its strategy of identifying discontinuous transcriptional sites and extracting transcripts. Cov-trans corrects the discontinuous transcriptional sites based on start and stop codon information, significantly enhancing boundary accuracy. In its transcript extraction strategy, it prioritizes assembling canonical transcripts first and then handles the more complex non-canonical transcripts. This approach allows Cov-trans to assemble more canonical transcripts without compromising the non-canonical ones.

Conclusion

Discontinuous transcription is caused by the virus RNA-dependent RNA polymerase (RdRp) and differs from the alternative splicing observed in eukaryotes. We introduced an efficient transcript assembly algorithm, Cov-trans, specifically designed for discontinuous transcription in coronaviruses. Cov-trans formulates the transcript assembly problem as aligning paired-end reads, correcting reads boundaries, identifying canonical transcripts, building a discontinuous graph, and using mixed integer linear programming to extract candidate non-canonical transcripts. Cov-trans seeks the global optimum solution by simultaneously considering constraints on the number of paths, vertex and edge coverage, and flow information in a mixed integer linear programming formulation. Cov-trans excels at assembling canonical transcripts of coronaviruses, and the assembled transcript boundaries are highly accurate. Experiments on both simulated and real datasets showed that Cov-trans outperformed other assemblers. An important point to note is its ability to precisely identify the boundaries of viral transcripts, offering a dependable assurance for delving deeper into the functional aspects of these transcripts.

Data availability

The real datasets can be accessed from the NCBI SRA database using the following accession numbers: SRR1942954, SRR1942956, SRR1942957, SRR12789544, SRR12789545, SRR12789546, SRR10357372, SRR10357373, and SRR10357374. All the simulated data generated in this study is available at https://github.com/computer-Bioinfo/Cov-trans/tree/main/Cov-trans simulation data.

References

  1. De Vries AA, Horzinek MC, Rottier PJ, De Groot RJ. The genome organization of the Nidovirales: similarities and differences between arteri-, toro-, and coronaviruses. In: Seminars in VIROLOGY. vol. 8. Elsevier; 1997. pp. 33–47.

  2. Kim D, Lee JY, Yang JS, Kim JW, Kim VN, Chang H. The architecture of SARS-CoV-2 transcriptome. Cell. 2020;181(4):914–21.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Maier HJ, Bickerton E, Britton P. Coronaviruses. Methods Protoc. 2015. pp. 1–282.

  4. Sola I, Almazán F, Zúñiga S, Enjuanes, L. Continuous and discontinuous RNA synthesis in coronaviruses. Annual review of virology. 2015;2(1):265–288.

  5. Davidson AD, Williamson MK, Lewis S, Shoemark D, Carroll MW, Heesom KJ, et al. Characterisation of the transcriptome and proteome of SARS-CoV-2 reveals a cell passage induced in-frame deletion of the furin-like cleavage site from the spike glycoprotein. Genome Med. 2020;12:1–15.

    Article  Google Scholar 

  6. Sashittal P, Zhang C, Peng J, El-Kebir M. Jumper enables discontinuous transcript assembly in coronaviruses. Nat Commun. 2021;12(1):6728.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Pertea M, Pertea GM, Antonescu CM, Chang TC, Mendell JT, Salzberg SL. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat Biotechnol. 2015;33(3):290–5.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Kovaka S, Zimin AV, Pertea GM, Razaghi R, Salzberg SL, Pertea M. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biol. 2019;20:1–13.

    Article  Google Scholar 

  9. Feng J, Li W, Jiang T. Inference of Isoforms from Short Sequence Reads (Extended Abstract). J Comput Biol J Comput Mol Cell Biol. 2011;18(3):305–21.

    Article  CAS  Google Scholar 

  10. Li W, Feng J, Jiang T. IsoLasso: A LASSO Regression Approach to RNA-Seq Based Transcriptome Assembly. J Comput Biol J Comput Mol Cell Biol. 2011;18(11):1693–707.

    Article  Google Scholar 

  11. Gatter T, Stadler PF. Ryūtō: network-flow based transcriptome reconstruction. BMC Bioinformatics. 2019;20:1–14.

    Article  Google Scholar 

  12. Shao M, Kingsford C. Accurate assembly of transcripts through phase-preserving graph decomposition. Nat Biotechnol. 2017;35(12):1167–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, Van Baren MJ, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010;28(5):511–5.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. CIDANE: comprehensive isoform discovery and abundance estimation. Genome Biol. 2016;17(1):16.

  15. Tomescu AI, Kuosmanen A, Rizzi R, Mäkinen V. A novel min-cost flow method for estimating transcript expression with RNA-Seq. In: BMC bioinformatics. vol. 14. Springer; 2013. pp. 1–10.

  16. Mezlini AM, Smith EJ, Fiume M, Buske O, Savich GL, Shah S, et al. iReckon: simultaneous isoform discovery and abundance estimation from RNA-seq data. Genome Res. 2013;23(3):519–29.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Guttman M, Garber M, Levin JZ, Donaghey J, et al. Ab initio reconstruction of transcriptomes of pluripotent and lineage committed cells reveals gene structures of thousands of lincRNAs. Nat Biotechnol. 2010;28(5):503.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Bushmanova E, Antipov D, Lapidus A, Prjibelski AD. rnaSPAdes: a de novo transcriptome assembler and its application to RNA-Seq data. GigaScience. 2019;8(9):giz100.

  19. Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, et al. Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nat Biotechnol. 2011;29(7):644.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Xie Y, Wu G, Tang J, Luo R, Patterson J, Liu S, et al. SOAPdenovo-Trans: de novo transcriptome assembly with short RNA-Seq reads. Bioinformatics. 2014;30(12):1660–6.

    Article  CAS  PubMed  Google Scholar 

  21. Chang Z, Li G, Liu J, Zhang Y, Ashby C, Liu D, et al. Bridger: a new framework for de novo transcriptome assembly using RNA-seq data. Genome Biol. 2015;16:1–10.

    Article  Google Scholar 

  22. Liu J, Yu T, Mu Z, Li G. TransLiG: a de novo transcriptome assembler that uses line graph iteration. Genome Biol. 2019;20:1–9.

    Article  Google Scholar 

  23. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29(1):15–21.

    Article  CAS  PubMed  Google Scholar 

  24. Nomburg J, Meyerson M, DeCaprio JA. Pervasive generation of non-canonical subgenomic RNAs by SARS-CoV-2. Genome Med. 2020;12:1–14.

    Article  Google Scholar 

  25. Zhao J, Feng H, Zhu D, Lin Y. Multitrans: an algorithm for path extraction through mixed integer linear programming for transcriptome assembly. IEEE/ACM Trans Comput Biol Bioinforma. 2021;19(1):48–56.

    Article  Google Scholar 

  26. Gurobi Optimization L. Gurobi optimizer reference manual. 2020. https://www.gurobi.com. Accessed 14 June 2022.

  27. Griebel T, Zacher B, Ribeca P, Raineri E, Lacroix V, Guigó R, et al. Modelling and simulating generic RNA-Seq experiments with the flux simulator. Nucleic Acids Res. 2012;40(20):10073–83.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

The authors would like to thank the editors and reviewers for their constructive comments and suggestions on this manuscript.

Funding

This work is supported by the National Natural Science Foundation of China under Nos. 62202251 and 62272268, and the Natural Science Foundation of Shandong Province under No. ZR2022QF133 and ZR2022QF147.

Author information

Authors and Affiliations

Authors

Contributions

J.Z. contributed to the design of the study. X.G. implemented Cov-trans and performed experiments. J.Z., X.G., Z.W. and S.Z. wrote and reviewed the manuscript.

Corresponding author

Correspondence to Jin Zhao.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Guo, X., Wu, Z., Zhang, S. et al. Cov-trans: an efficient algorithm for discontinuous transcript assembly in coronaviruses. BMC Genomics 25, 1257 (2024). https://doi.org/10.1186/s12864-024-11179-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12864-024-11179-0

Keywords