Predict coding regions within transcripts in OmicsBox with TransDecoder

Predict coding regions within transcripts in OmicsBox with TransDecoder

Most transcripts assembled from eukaryotic and prokaryotic RNA-Seq data are expected to code for proteins. The most practical procedure to identify likely coding transcripts is a sequence homology search, such as by BLASTX, against sequences from a well-annotated and related species. Predicting coding regions is crucial to determine the molecular role that transcripts play in the cell. Unfortunately, such well-annotated nearby species are often not available for transcriptomes of newly sequenced transcriptomes.

De novo assembly strategies are required to reconstruct transcriptomes of organisms without reference genome and transcriptome. These newly targeted transcriptomes generally encode proteins that are insufficiently represented by detectable homologies to known proteins. To capture those coding regions requires methods that predict coding regions based on metrics tied to sequence composition.

Identifying coding regions is crucial to determine the molecular and biological roles that transcripts play in the cell.

Here, we present the Predict Coding Regions functionality available in OmicsBox, which is based on TransDecoder. TransDecoder is a utility developed and included with Trinity, and it is intended to assist in the identification of potential coding regions within reconstructed transcripts. TransDecoder recognizes likely coding sequences based on the following criteria:

  • A minimum length open reading frame (ORF) in a transcript sequence.
  • A log-likelihood score similar to what is computed by the GeneID software is greater than 0.
  • The above coding score is greatest when the ORF is scored in the first reading frame as compared to scores in the other 2 forward reading frames.
  • If a candidate ORF is found fully encapsulated by the coordinates of another candidate ORF, the longer one is reported. However, a single transcript can report multiple ORFs (allowing for operons, chimeras, etc).
  • A Position-Specific Scoring Matrix (PSSM) is computed, trained and used to refine the start codon prediction.

Optionally, to further maximize sensitivity for capturing ORFs that may have functional significance, regardless of the coding likelihood score, ORFs can be scanned for homology to known proteins and retain all such ORFs. This can be done by searching PFAM to identify common protein domains.

The TransDecoder methodology can be easily applied in OmicsBox via the Predict Coding Regions utility. Transcript sequences must have been previously loaded on the platform. If the RNA-Seq reads have been not assembled yet, we recommend using the RNA-Seq de novo assembly application offered by OmicsBox (based on Trinity). If the transcripts have been previously assembled, they can be loaded into OmicsBox in Fasta format: File -> Load -> Load Sequences -> Load Fasta File.

The Predict Coding Regions procedure can be adjusted according to the data and the species under study. Users can select the appropriate genetic code to find ORFs, establish a minimum protein length cutoff, configure how ORFs will be retained, and so forth. The Pfam Search is an optional but highly recommended parameter. If it is checked, ORFs will be scanned against PFAM to identify protein domains, that will be used as ORF retention criteria.

The identified ORFs are returned in three different formats:

  • CDSs: Nucleotide sequences for coding regions of the final candidate ORFs, in FASTA format.
  • Proteins: Peptide sequences for the final candidate ORFs, in FASTA format.
  • Coordinates: Positions within the target transcripts of the final selected ORF, in GFF format.

The OmicsBox genome browser is recommended for viewing the candidate ORFs in the context of the transcriptome.

There are a few items to take notice of in the above files. TransDecoder provides details about the predicted ORFs, such as the length and the strand in which the coding region was found. Furthermore, it classifies predicted ORF according to the start and stop signals:

  • Complete ORF: Contains a start and a stop codon.
  • 5′ partial ORF: It lacks the start codon and presumably part o the N-terminus.
  • 3′ partial ORF: It lacks the stop codon and presumably part of the C-terminus.
  • Internal ORF: It is both 5′ and 3′ partial.

In practice, after applying this strategy, the most confident ORFs are extracted and used to predict protein functions. In OmicsBox, this step can be directly linked to the homology-based functional annotation pipeline, which uses the widely known Blas2GO methodology.

Example Use Case

Reanalyzing the A. galli transcriptomic response to an anthelmintic drug with OmicsBox