fbpx

Quality Assessment of DNA-Seq De Novo Assemblies using QUAST

Quality Assessment of DNA-Seq De Novo Assemblies using QUAST

Genetic Variation studies often involve analyzing samples from a previously studied species. For instance, it is of interest to examine genomes of various cultivars, strains, or populations of the same species. In such cases, it may be necessary to perform de novo DNA-Seq assembly to obtain the genome of the sequenced organisms, even if a reference genome is available. To ensure the accuracy and reliability of the analysis, it is recommended to employ different assembly algorithms with varying configurations, as different datasets may require distinct analytical approaches.

In this regard, the QUAST tool proves useful, as it allows for easy comparison of different de novo assemblies. To illustrate its functionality, we will present several case studies utilizing a Caenorhabditis elegans WGS dataset. This dataset comprises short paired-end and mate-pair reads of the C. elegans N2 strain, enabling a thorough examination of QUAST’s features.

  • SRA Study: DRP001005
  • Reference Genome: WBcel235

    Comparing different assembly algorithms: ABySS and SPAdes

    Different assemblers employ various algorithms with diverse assumptions and procedures. As a consequence, the resulting assemblies will differ depending on the software utilized. At times, there is no single algorithm that performs optimally in all scenarios, and the efficacy could vary depending on the input data. In this section, we will evaluate the performance of ABySS and SPAdes using the C. elegans dataset. To this end, four distinct assemblies were generated:

    • Using ABySS with paired-end reads
    • Using ABySS with paired-end and mate-pair reads
    • Using SPAdes with paired-end reads
    • Using SPAdes with paired-end and mate-pair reads

    QUAST Results

    The NGx plot provides information on the size of the contigs. For example, an NG50 of 10bp indicates that 50% of the total reference genome size is covered by contigs longer than 10bp. The commonly used Nx statistic employs the assembly size rather than the genome size. However, since each assembly has a different length, the Nx is not directly comparable between assemblies. Therefore, the NGx is preferable for comparative purposes.

    In general, larger contigs are desirable as they are more representative of the real chromosome or genome size. By examining the NGx plot, it is immediately evident that the assemblies generated using ABySS contain larger contigs (Figure 1). In particular, the combination of ABySS and mate-pair results produced notably better results. This is further supported by the length of the largest contig and the total assembly length (Table 1).

    Another useful metric for evaluating the quality of an assembly is the Genome Fraction, which indicates the percentage of genomic bases covered by the assembly. This is measured by aligning the assembly to the reference genome. As shown in Figure 2, the assemblies generated using ABySS cover a larger portion of the reference genome. This is consistent with the total assembly size (Table 1), as a larger assembly has the potential to cover more of the genome.

    Lastly, the number of misassemblies detected can provide insight into the accuracy of the assembly. These regions of the assembly do not align with the reference genome. In this dataset, the number of misassemblies is comparatively lower in the assembly generated using ABySS and mate-pair reads (Figure 3).

    In conclusion, after considering the different metrics, it can be inferred that the assembly generated using ABySS and mate-pair reads is more complete and accurate.

    Figure 1. NGx Plot for the assemblies made with ABySS and SPAdes.
    Figure 2. Genome Fraction for each of the assemblies.
    Figure 3. Number of Misassemblies detected for each of the


    Table 1. Summary statistics obtained by QUAST for each of the assemblies performed with different algorithms.

    StatisticAbyssAbyss_mateSpadesSpades_unpaired
    Num. contigs (>= 0 bp)70674686006145963280
    Num. contigs (>= 10000 bp)2229139323282377
    Num. contigs (>= 50000 bp)490601458437
    Num. contigs5674365456686112
    Largest contig384836753101283380239865
    Total length958131559738427294415625239865
    Reference length100286401100286401100286401100286401
    auNG62763.7135616.354011.749442.1

    Adding different mate-pairs or unpaired reads to the assembly

    Both the quality and quantity of input data used can influence de novo assembly. Assembling algorithms aim to reconstruct the genome using only the input reads without the guidance of a reference genome, making the resulting assembly highly dependent on the input data. Generally, the more input data that is used, the better the assembly, but this is not always true if the provided data is erroneous.

    This section investigates how the addition of mate-pair data and/or unpaired reads affects the assembly. Unpaired reads are reads that have lost their corresponding pairs during the FASTQ Preprocessing step, thus making them single-end reads. Some scientists suggest that unpaired reads, being short and potentially of lower quality, could introduce noise into the assembly. However, others argue that not using them results in the loss of valuable information.

    Input Data

    In order to assess this, a dataset consisting of a WGS of C. elegans has been assembled using ABySS. It has been done with four different sets of input data:

    • Short paired-end reads
    • Short paired-end reads + mate-pair reads
    • Short paired-end reads + short unpaired (single) reads
    • Short paired-end reads + short unpaired (single) reads + mate-pair reads

    Then, the quality of the assemblies has been assessed and compared using the QUAST implementation available in OmicsBox.

    QUAST Results

    By examining Figure 4, the first observation is that assemblies generated with mate-pair data produce larger assemblies as the lines on the NGx plot are positioned higher on the Y-axis. Another way to assess this quickly is by examining the auNG (Table 2), which is the area under the NGx curve. Hence, the greater the auNG, the greater the NGx values.

    This is further supported by some statistics presented in the Summary Report (Table 2). For instance, the number of large contigs (with a length >= 50k bp), the largest contig length, and the total assembly length are all higher in assemblies generated with mate-pair data.

    Additionally, the number of misassemblies detected is substantially lower in assemblies generated with mate-pair reads (Figure 5).

    Overall, there is a significant difference between using and not using mate-pair reads. This is logical since these reads are used specifically to enhance assemblies. Mate-pair reads provide long-range information, which can aid in scaffolding and connecting contigs generated from short-read sequencing data, as well as resolving ambiguities and errors in the assembly process.

    Concerning the difference between adding or not adding unpaired reads, it can be observed that it helps to construct larger contigs in this instance (Table 2, Figure 4). However, the difference is not as significant as it is with using or not using mate-pair reads.

    Figure 4. NGx Plot for the assemblies obtained with different input data.
    Figure 5. Number of Misassemblies detected for each of the assemblies.

    Table 2. Summary statistics obtained by QUAST for each of the assemblies obtained with different sets of input data.

    StatisticAbyssAbyss_mateAbyss_mate_unpaired    Abyss_unpaired
    Num. contigs (>= 0 bp)70674686008571787640
    Num. contigs (>= 10000 bp)2229139313962235
    Num. contigs (>= 50000 bp)490601613478
    Num. contigs5674365436445532
    Largest contig384836753101942534516982
    Total length95813155973842729737951695812895
    Reference length100286401100286401100286401100286401
    auNG62763.7135616.3141949.464254.4

    Trying different k-mer sizes

    Even using the same algorithm and input data, the resulting assembly can be really different when specifying different parameters. In particular, one parameter that largely affects the assembly is the k-mer size. In a de novo genome assembly, k-mers are short, contiguous sequences of length k (usually ranging from 20-150 nucleotides) that are extracted from the sequencing reads. The k-mer size is an important parameter in the assembly process, as it determines the level of sensitivity and specificity in detecting overlaps and assembling contigs.

    Input Data

    The length of the reads for the C. elegans dataset is 110bp. In this section, we will assess the performance of using three different k-mer lengths:

    • 55bp kmer size, which is 50% of the read length.
    • 77bp kmer size, which is 70% of the read length.
    • 25bp kmer size, which is ~23% of the read length.

     

    QUAST Results

    As in the other section, it is easy to see in the NGx plot (Figure 6) that the assembly obtained with a kmer size of 77 contains bigger contigs, which is desirable. Surprisingly, it seems that the default configuration (55 kmer size) is the assembly that performed the worst since it is made of smaller contigs.

    However, by looking at the Genome Fraction (Figure 7), we see that the assembly obtained with a kmer size of 27 covers a smaller percentage of the reference genome. Say it in another way, even if contigs are bigger, the assembly is less complete in comparison with the assembly obtained with a kmer size of 55. It is also evidenced by the total assembly length (Table 3); the smaller assembly is obtained with a kmer length of 27. This exemplifies how important is to look into different statistics in order to evaluate an assembly.

    Moreover, the total number of contigs (Num. contigs (>= 0 bp) in Table 3) it’s quite higher in the abyss_kmer27 assembly. Thus, it seems that there is a high number of tiny contigs that the algorithm was not able to join in bigger sequences, in comparison with the other two assemblies.

    Finally, it turns out that the assembly with a higher number of misassemblies is the one obtained with a kmer length of 55 (Figure 8), which is the default configuration. Thus, in conclusion, it seems that a kmer size of 77 gave the best results.

    Figure 6. NGx plot for the assemblies obtained with different kmer sizes.


    Figure 7. Percentage of the genome covered by each of the assemblies.
    Figure 8. The number of misassemblies detected for each of the assemblies.

    Table 3. Summary statistics obtained by QUAST for each of the assemblies obtained with different k-mer sizes.

    Statistic Abyss_kmer27 Abyss_kmer255 Abyss_kmer77
    Num. contigs (>= 0 bp) 479663 70674 53279
    Num. contigs (>= 10000 bp) 1375 2229 1009
    Num. contigs (>= 50000 bp) 452 490 553
    Num. contigs 20530 5674 2641
    Largest contig 618279 284836 914631
    Total length 90902996 95813155 98116581
    Reference length 100286401 100286401 100286401
    auNG 84684.1 62763.7 197579.9

     

     

    References

    Bolger, A.M., Lohse, M. and Usadel, B. (2014) “Trimmomatic: A flexible trimmer for Illumina sequence data,” Bioinformatics, 30(15), pp. 2114–2120. Available at: Trimmomatic: a flexible trimmer for Illumina sequence data .

    Gurevich, A. et al. (2013) “Quast: Quality Assessment Tool for Genome Assemblies,” Bioinformatics, 29(8), pp. 1072–1075. Available at: QUAST: quality assessment tool for genome assemblies .

    Jackman, S.D. et al. (2017) “Abyss 2.0: Resource-Efficient Assembly of large genomes using a bloom filter,” Genome Research, 27(5), pp. 768–777. Available at: ABySS 2.0: resource-efficient assembly of large genomes using a Bloom filter .

    Li, H. 2020 Aun: A new metric to measure assembly contiguity, Sitewide ATOM. Available at: auN: a new metric to measure assembly contiguity (Accessed: February 27, 2023).

    Prjibelski, A. et al. (2020) “Using Spades de Novo Assembler,” Current Protocols in Bioinformatics, 70(1). Available at: https://doi.org/10.1002/cpbi.102.

    About the Author Marta Benegas Marta Benegas studied biotechnology at the Valencia Polytechnic University (UPV) and continued her studies with a Master's in Bioinformatics at the Autonomous University of Barcelona (UAB), Spain. After her master's degree, she started her professional career at Biobam where she is now working as a bioinformatics specialist and support manager. At the moment she is mainly focused on Single-Cell technologies developing various pipelines which allow getting from reads to functional insights at a single-cell resolution. These developments are available in OmicsBox, BioBam’s software solution.
Mesa de trabajo 2 copia 2

Blog Categories:

News

Releases, Media, Announcements, etc.

Use Cases, Reviews, Tutorials

Product Tutorial, Quickstarts, New Features, etc.

Video Tutorials

Helpful Features, Tips and Tricks

Tips And Tricks

Mini-tutorials for common use-cases and to address frequently asked questions FAQs

Most Popular:

Facebook
Twitter
LinkedIn
Email
Print