Taxonomic reference databases

Database processing principle

zAMP offers the possibility to train classifiers like RDP [1], QIIME [2] and Decipher [3] on full length sequences or subdomains like the V3-V4 region in the 16S rRNA gene.

Classification accuracy depends on classifiers parameters and the region on which they were trained on [4]. In fact, Bokulich et al. [4] demonstrated that training classifiers on specific regions leads to enhanced accuracy compared to using the full length sequences (for short reads).

Therefore, by default, zAMP extracts primer amplified regions with cutadapt [5], dereplicates and clusters sequences with vseach [6], and adapts taxonomy according to these clusters.

Database processing flowchart:

        flowchart TD
A["dna-sequences.fasta"]
taxonomy["taxonomy.tsv"]
taxonomy --> clean("Clean taxonomy")
clean --> B
A --> B{Extract region ?}
B --> |Yes| C("Extract regions
(cutadapt)")
B --> |No| D(Copy files)
C --> E("Cluster
(vsearch)")
E --> F("Derep and merge taxonomy")
F --> G("Train classifiers
(RDP, Decipher...) ")
    

Custom database

Taxonomy table in QIIME format:

1   Bacteria;Proteobacteria;Alphaproteobacteria;Rhodospirillales;Rhodospirillaceae;Magnetospirillum;Magnetospirillum magnetotacticum
2   Bacteria;Fusobacteria;Fusobacteria_c;Fusobacteriales;Fusobacteriaceae;Fusobacterium;Fusobacterium nucleatum

Fasta file:

>1
CTGNCGGCGTGCCTAACACATNCAAGTCGAGCGGTGCTACGGAGGTCTTCGGACTGAAGTAGCATAGCGGCGGACGGGTGAGTAATACACAGGAACGTGCCCCTTGGAGGCGGATAGCTGTGGGAAACTGCAGGTAATCCGCCGTAAGCTCGGGAGAGGAAAGCCGGAAGGCGCCGAGGGAGCGGCCTGTGGCCCATCAGGTAGTTGGTAGGGTAAGAGCCTACCAAGCCGACGACGGGTAGCCGGTCTGAGAGGATGGACGGCCACAAGGGCACTGAGACACGGGCCCTACTCCTACGGGAGGCAGCAGTGGGGGATATTGGACAATGGGCGAAAGCCTGATCCAGCGACGCCGCGTGAGGGACGAAGTCCTTCGGGACGTAAACCTCTGTTGTAGGGGAAGAAGACAGTGACGGTACCCTACGAGGAAGCCCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGGNCGAGCGTTACCCGGAATCACTGGGCGTAAAGGGTGCGTA
>2
AACGAACGCTGGCGGCAGGCTTAACACATGCAAGTCGAACGAAGTCTTCGGACTTAGTGGCGCACGGGTGAGTAACACGTGGGAATATACCTCTTGGTGGGGAATAACGTCGGGAAACTGACGCTAATACCGCATACGCCCTTCGGGGGAAAGATTTATCGCCGAGAGATTAGCCCGCGTCCGATTAGCTAGTTGGTGAGGTAATGGCTCACCAAGGCGACGATCGGTAGCTGGTCTGAGAGGATGATCAGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCGTGAGTGATGAAGGCCTTAGGGTTGTAAAGCTCTTTCACCCACGACGATGATGACGGTAGTGGGAGAAGAAGCCCCGGCTAACTTCGTGCCAGCAGCCGCGGTAATACGAAGGGGGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTGGTCATAGTCAGAAGTGAAAGCCCTGGGCTCAACCCGGGAATTGCTTTTGATACTGGACCGCTAGAATCACGGAGAGGGTAGTGGAATTCCGAGTGTAGAGGTGAAATTCGTAGATATTCGGAAGAACACCAGTGGCGAAGG

Default command:

zamp db --taxonomy taxonomy.tsv \
        --fasta sequences.fasta \
        --fw-primer CCTACGGGNGGCWGCAG \
        --rv-primer GACTACHVGGGTATCTAATCC \
        -o processed-db

Skip primer amplified region extraction:

zamp db --taxonomy taxonomy.tsv \
    --fasta sequences.fasta \
    --no-processing \
    -o unprocessed-db

Available databases

Note

Processed and unprocessed SILVA, Greengenes2 and UNITE will be made available soon

Here is a short, non-exhaustive, list of databases from which we could successfully prepare a database:

Note

For Eukaryome, additional steps might be needed to filter a kingdom of interest (e.g. Fungi), or remove entries with incomplete taxonomy.

Parameters

Command:

zamp db -h

Examples

Bacteria

Greengenes2

  • Download:

    wget http://ftp.microbio.me/greengenes_release/2022.10/2022.10.backbone.full-length.fna.qza && \
    wget http://ftp.microbio.me/greengenes_release/2022.10/2022.10.backbone.tax.qza
    
  • Decompress qza with qiime2 export:

    docker run -t -i -v $(pwd):/data quay.io/qiime2/tiny:2024.5 \
    qiime tools export \
    --input-path 2022.10.backbone.full-length.fna.qza \
    --output-path greengenes2 && \
    docker run -t -i -v $(pwd):/data quay.io/qiime2/tiny:2024.5 \
    qiime tools export \
    --input-path 2022.10.backbone.tax.qza --output-path greengenes2
    
  • Prepare database:

    zamp db --fasta greengenes2/dna-sequences.fasta \
    --taxonomy greengenes2/taxonomy.tsv --name greengenes2 \
    --fw-primer CCTACGGGNGGCWGCAG --rv-primer GACTACHVGGGTATCTAATCC \
    -o greengenes2
    

Fungi

Unite ITS1

Fungal ITS databases Unite v10 and Eukaryome v1.8 do not contain the adjacent SSU/LSU sequences (they contain 5.8S), where some of the commonly used PCR primers lie on. It is important to adjust the cutadapt parameters so that only the absent primer is optional. In the following example, we prepare a database for fungal ITS1 from Unite Db. In this case, the forward primer (lying of the 18S) will not be present in most sequences of Unite/Eukaryome (but the reverse primer lying on the 5.8S is present); therefore we set the forward primer as optional; the extracted sequences will start at the available 5’ of the database and end at the reverse primer:

zamp db \
--fasta sh_refs_qiime_unite_ver10_dynamic_04.04.2024.fasta \
--taxonomy sh_taxonomy_qiime_unite_ver10_dynamic_04.04.2024.txt \
--name unite \
--fw-primer CYHRGYYATTTAGAGGWMSTAA --rv-primer RCKDYSTTCWTCRWYGHTGB \
--minlen 50 --maxlen 900 \
--cutadapt_args_fw "optional" \
-o unite_ITS1

Eukaryome ITS2

Similarly, to extract ITS2 from fungal databases such as Eukaryome, the reverse primer needs to be set as optional, because it is located on the LSU, which is absent in the database sequences:

zamp db \
--fasta QIIME2_EUK_ITS_v1.8.fasta \
--taxonomy QIIME2_EUK_ITS_v1.8.txt \
--name eukaryome \
--fw-primer GCATCGATGAAGAACGCAGC --rv-primer TCCTCCGCTTATTGATATGC \
--minlen 50 --maxlen 900 \
--cutadapt_args_rv "optional" \
-o eukaryome_ITS2

Output

Please, see <tax_DB_path>/<tax_DB_name>/QIIME/problematic_taxa.txt file for identical sequences that had taxonomic disagreeing identifiers above the genus rank.

References