How to create a taxonomic mapping file.

How to create a taxonomic mapping file.

Create a taxonomic mapping file to Make Blast Database within OmicsBox

OmicsBox allows creating a custom database to run Blast locally. The blast algorithm will run on the user’s computer against a database that is installed locally.
In order to do so, we have to either download a pre-formatted NCBI database (see tutorial) or format our own database (see this tutorial until step 3). When formatting own database an optional file may be used to associate each sequence with a taxonomic node.
In the following use cases we will explain how to create the Taxonomy ID file.

The taxonomic mapping file is a tab delimited text file and should be provided in the following format:

<SequenceId> \t <TaxonomyId><newline>

Use case 1

In this use case we will show how to create a taxonomy mapping file from a fasta file that has been downloaded from NCBI.
Considering we have few sequences e.g. 20 and the taxonomic mapping file can be created by “hand”.
Here are the first 4 sequences from fasta file that we will use as an example.

>gi|513846619|ref|WP_016515408.1| alpha/beta fold hydrolase [Microcystis aeruginosa]
>gi|505170323|ref|WP_015357425.1| alpha/beta fold hydrolase [Mycobacterium liflandii]
>gi|505107425|ref|WP_015294527.1| alpha/beta fold hydrolase [Mycobacterium canettii]
>gi|504907150|ref|WP_015094252.1| alpha/beta fold hydrolase [Pseudomonas sp. UW4]
  1. The sequence identifier (gi or Acc) (gi|513846619|ref|WP_016515408.1|) can be extracted from the fasta file with the following linux command line. We will extract the Acc identifier in this example.

    grep ">" 1_Sequence.txt | cut -d \| -f4 > sequence_name.txt

    The sequence_names.txt file looks like:

  2. The taxonomic names ([Microcystis aeruginosa]) for each sequence can also be retrieved from the fasta file

    grep ">" my_fasta_file.fasta | cut -d '>' -f2 | cut -d '[' -f2 | cut -d ']' -f1 > taxonomy_names.txt

    The taxonomy_names.txt file looks like:

    Microcystis aeruginosa
    Mycobacterium liflandii
    Mycobacterium canettii
    Pseudomonas sp. UW4
  3. Now that you have the taxonomic names you may search for its id in NCBI and create the taxonomy mapping file.
    Here is a tool that helps you retrieving the corresponding taxonomic identifiers by giving the taxonomy names: https://www.ncbi.nlm.nih.gov/Taxonomy/TaxIdentifier/tax_identifier.cgi

  4. The final tab delimited taxonomy ID file, that can be easily created with Excel should look like:

    WP_016515408.1 1126
    WP_015357425.1 261524
    WP_015294527.1 78331
    WP_015094252.1 1207075

Copying and pasting data from different files is error prone. The next use case explains how to create a taxonomy ID file directly from a pre-formatted database from NCBI.

Use case 2

Here the main goal is to obtain a class specific database from a pre-formatted database from NCBI.
In this example we want to create a EST databse for class Hydrozoa using an Acc list retrieved from NCBI.
Once we have these files we will create the taxonomic mapping file that can then be used with Make Blast Database feature within OmicsBox.

  1. Download the Blast+ executables (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/) from NCBI and extract it.
  2. Download est_others database (with perl script or directly from NCBI webpage)
    • At the moment (July 2019) there are 11 files est_others.00.tar.gz to est_others.10.tar.gz
  3. Search the Entrez Nucleotide database and query for a taxonomic identifier e.g. “txid6074[Organism] “
  4. Download the ACC list from the desired species from NCBI
    Send to > Complete Record > File > Accession List
  5. The above-mentioned Perl script is located in the bin folder of the Blast+ executables and it can be used to download the desired pre-formatted database, e.g. est_others.
    To run the script, Perl has to be installed on the computer.
    perl update_blastdb.pl est_others
  6. Once all .tar.gz files have been downloaded, they need to to be extracted. It is possible to do so in one go with a Linux command.
    for file in *.tar.gz; do tar -xvzf $file; done
  7. Now that we have the .nal database and the Acc list it is possible to create the so called taxonomic mapping file to be used in the create database feature in OmicsBox.

    blastdbcmd -db est_others -entry_batch sequence_acclist.txt -outfmt '%i %T' > sequenceID_taxa.txt
  8. Now you will end up with a text file that has the sequence id and the corresponding taxonomy. This file can be opened with Excel and it contains the | as separator.
    emb|HE983324.1| 6087
    emb|HE983323.1| 6087
    emb|HE983322.1| 6087
    emb|AM755036.1| 252671
    emb|AM755035.1| 252671
    emb|AM755034.1| 252671

    If it has been opened correctly in Excel then we will end up with 3 columns where the 2nd and 3rd are the ones we need to proceed with the generation of the EST database of the desired species

    HE983324.1 6087
    HE983323.1 6087
    HE983322.1 6087
    AM755036.1 252671
    AM755035.1 252671
    AM755034.1 252671
  9. To create the desired database within OmicsBox (Functional Analysis > Blast > Make Blast Database) we also need the EST sequences itself in fasta format.
    ./blastdbcmd -db est_others -entry_batch sequence_acclist.txt -out est_species_6074.fasta
  10. That is all. You can now select the EST fasta file and the corresponding TaxonomyID file in the Make Blast Database dialog in OmicsBox.


Make Blast Database with Mapping file

Select .fasta and .txt files to Make Blast Database

For more information on how to create own database or run Local Blast within OmicsBox, please have a look at our user manual, tutorials and videos.