fbpx

Predict coding regions within transcripts in OmicsBox with TransDecoder

Predict coding regions within transcripts in OmicsBox with TransDecoder

Most transcripts assembled from eukaryotic and prokaryotic RNA-Seq data are expected to code for proteins. The most practical procedure to identify likely coding transcripts is a sequence homology search, such as by BLASTX, against sequences from well-annotated and related species. Predicting coding regions is crucial to determine the molecular role that transcripts play in the cell. Unfortunately, such well-annotated nearby species are often not available for transcriptomes of newly sequenced transcriptomes.

When we work with non-model organisms, the reference genome and transcriptome might not be available, so transcriptome assembly requires de novo strategies. These newly targeted transcriptomes generally encode proteins that are insufficiently represented by detectable homologies to known proteins. To capture those coding regions, we need methods that predict coding regions based on metrics tied to sequence composition, such as TransDecoder. This tool to Predict Coding Regions is available in the Transcriptomics Module in OmicsBox.

 

Predict Coding Regions with TransDecoder

TransDecoder is a utility developed and included with Trinity (which can also be used inside OmicsBox). It is intended to assist in the identification of potential coding regions within reconstructed transcripts. TransDecoder recognizes likely coding sequences based on the following criteria:

  • A minimum length open reading frame (ORF) in a transcript sequence.

  • A log-likelihood score is greater than 0. This score is similar to the GeneID score.

  • The above coding score is greatest when the ORF is in the first reading frame as compared to scores in the other 2 forward reading frames.

  • If a candidate ORF is fully encapsulated by the coordinates of another candidate ORF, it reports the longer one. However, a single transcript can report multiple ORFs (allowing for operons, chimeras, etc).

  • A Position-Specific Scoring Matrix (PSSM) is computed, trained, and used to refine the start codon prediction.

Optionally, to further maximize sensitivity for capturing ORFs that may have functional significance, regardless of the coding likelihood score, ORFs can be scanned for homology to known proteins and retain all such ORFs using PFAM to identify common protein domains.

TransDecoder in OmicsBox

The TransDecoder methodology is available in OmicsBox via the

  • CDSs: Nucleotide sequences for coding regions of the final candidate ORFs, in FASTA format.

  • Proteins: Peptide sequences for the final candidate ORFs, in FASTA format.

  • Coordinates: Positions within the target transcripts of the final selected ORF, in GFF format.

  • We recommend the OmicsBox Genome Browser for viewing the candidate ORFs in the context of the transcriptome (Fig. 1).

    transdecoder omisbox
    Figure 1. Study interesting sequences in the OmicsBox Genome Browser

     

    ORF Types

    There are a few items to pay attention to in the above files. TransDecoder provides details about the predicted ORFs, such as the length and the strand in which the coding region was found. Furthermore, it classifies predicted ORF according to the start and stop signals (Fig. 2):

    • Complete ORF: Contains a start and a stop codon.

    • 5′ partial ORF: It lacks the start codon and presumably part o the N-terminus.

    • 3′ partial ORF: It lacks the stop codon and presumably part of the C-terminus.

    • Internal ORF: It is both 5′ and 3′ partial.

    transdecoder ORF classification
    Figure 2. Pie Chart with ORF Classification

     

    In practice, after applying this strategy, Omicsbox extracts the most confident ORFs, used to predict protein functions. In OmicsBox, this step can be directly linked to the homology-based functional annotation pipeline, which uses the widely known Blas2GO methodology (Fig. 3).

    transdecoder workflow
    Figure 3. Analysis Workflow Example

    Example Use Case

    Reanalyzing the A. galli transcriptomic response to an anthelmintic drug with OmicsBox

    References

     

    About the Author

    Enrique Presa

    With a biological and technological academic background, including a BSc in Biotechnology and an MSc in Bioinformatics, Enrique’s expertise lies in the areas of Long Reads and Genetic Variation.

    transdecoder

    Blog Categories:

    News

    Releases, Media, Announcements, etc.

    Use Cases, Reviews, Tutorials

    Product Tutorial, Quickstarts, New Features, etc.

    Video Tutorials

    Helpful Features, Tips and Tricks

    Tips And Tricks

    Mini-tutorials for common use-cases and to address frequently asked questions FAQs

    Most Popular:

    Facebook
    Twitter
    LinkedIn
    Email
    Print