Taxonomic reference databases¶
Database processing principle¶
zAMP offers the possibility to train classifiers like RDP [1], QIIME [2] and Decipher [3] on full length sequences or subdomains like the V3-V4 region in the 16S rRNA gene.
Classification accuracy depends on classifiers parameters and the region on which they were trained on [4]. In fact, Bokulich et al. [4] demonstrated that training classifiers on specific regions leads to enhanced accuracy compared to using the full length sequences (for short reads).
Therefore, by default, zAMP extracts primer amplified regions with cutadapt [5], dereplicates and clusters sequences with vseach [6], and adapts taxonomy according to these clusters.
Database processing flowchart:
flowchart TD
A["dna-sequences.fasta"]
taxonomy["taxonomy.tsv"]
taxonomy --> clean("Clean taxonomy")
clean --> B
A --> B{Extract region ?}
B --> |Yes| C("Extract regions
(cutadapt)")
B --> |No| D(Copy files)
C --> E("Cluster
(vsearch)")
E --> F("Derep and merge taxonomy")
F --> G("Train classifiers
(RDP, Decipher...) ")
Custom database¶
Taxonomy table in QIIME format:
1 Bacteria;Proteobacteria;Alphaproteobacteria;Rhodospirillales;Rhodospirillaceae;Magnetospirillum;Magnetospirillum magnetotacticum
2 Bacteria;Fusobacteria;Fusobacteria_c;Fusobacteriales;Fusobacteriaceae;Fusobacterium;Fusobacterium nucleatum
Fasta file:
>1
CTGNCGGCGTGCCTAACACATNCAAGTCGAGCGGTGCTACGGAGGTCTTCGGACTGAAGTAGCATAGCGGCGGACGGGTGAGTAATACACAGGAACGTGCCCCTTGGAGGCGGATAGCTGTGGGAAACTGCAGGTAATCCGCCGTAAGCTCGGGAGAGGAAAGCCGGAAGGCGCCGAGGGAGCGGCCTGTGGCCCATCAGGTAGTTGGTAGGGTAAGAGCCTACCAAGCCGACGACGGGTAGCCGGTCTGAGAGGATGGACGGCCACAAGGGCACTGAGACACGGGCCCTACTCCTACGGGAGGCAGCAGTGGGGGATATTGGACAATGGGCGAAAGCCTGATCCAGCGACGCCGCGTGAGGGACGAAGTCCTTCGGGACGTAAACCTCTGTTGTAGGGGAAGAAGACAGTGACGGTACCCTACGAGGAAGCCCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGGNCGAGCGTTACCCGGAATCACTGGGCGTAAAGGGTGCGTA
>2
AACGAACGCTGGCGGCAGGCTTAACACATGCAAGTCGAACGAAGTCTTCGGACTTAGTGGCGCACGGGTGAGTAACACGTGGGAATATACCTCTTGGTGGGGAATAACGTCGGGAAACTGACGCTAATACCGCATACGCCCTTCGGGGGAAAGATTTATCGCCGAGAGATTAGCCCGCGTCCGATTAGCTAGTTGGTGAGGTAATGGCTCACCAAGGCGACGATCGGTAGCTGGTCTGAGAGGATGATCAGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCGTGAGTGATGAAGGCCTTAGGGTTGTAAAGCTCTTTCACCCACGACGATGATGACGGTAGTGGGAGAAGAAGCCCCGGCTAACTTCGTGCCAGCAGCCGCGGTAATACGAAGGGGGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTGGTCATAGTCAGAAGTGAAAGCCCTGGGCTCAACCCGGGAATTGCTTTTGATACTGGACCGCTAGAATCACGGAGAGGGTAGTGGAATTCCGAGTGTAGAGGTGAAATTCGTAGATATTCGGAAGAACACCAGTGGCGAAGG
Default command:
zamp db --taxonomy taxonomy.tsv \
--fasta sequences.fasta \
--fw-primer CCTACGGGNGGCWGCAG \
--rv-primer GACTACHVGGGTATCTAATCC \
-o processed-db
Skip primer amplified region extraction:
zamp db --taxonomy taxonomy.tsv \
--fasta sequences.fasta \
--no-processing \
-o unprocessed-db
Available databases¶
Note
Processed and unprocessed SILVA, Greengenes2 and UNITE will be made available soon
Here is a short, non-exhaustive, list of databases from which we could successfully prepare a database:
EzBioCloud (16S rRNA - Bacteria)
SILVA (16/18S rRNA, 23/28S rRNA - Bacteria and Eukarya )
UNITE (ITS - Eukarya)
Eukaryome (ITS - Eukarya)
Note
For Eukaryome, additional steps might be needed to filter a kingdom of interest (e.g. Fungi), or remove entries with incomplete taxonomy.
Parameters¶
Command:
zamp db -h
The “–no-processing” parameter enables to skip the preprocessing and only format the provided database and train the classifiers.
“fw-primer” and “rv-primer” are fed to cutadapt linked adapter argument.
“–cutadapt_args_fw” and “–cutadapt_args_rv” allow to pass additional arguments to cutadapt, affecting the forward and reverse primer, respectively. It for instance allows to indicate which primer is optional <https://cutadapt.readthedocs.io/en/v3.0/guide.html#changing-which-adapters-are-required>`_. It is particularly useful when trying to extract ITS1 amplicons: the 5’ universal primer is located on the SSU rRNA preceding the ITS region and thus is absent in ITS reference database. In this case, providing “–cutadapt_args_fw optional” enables to make it optional.
“errors” is fed to cutadapt to define the number of accepted mismatches per primer.
“ampcov” is used with the length of the provided primers to feed cutadapt with a minimal overlap.
Examples¶
Bacteria¶
Greengenes2
Download:
wget http://ftp.microbio.me/greengenes_release/2022.10/2022.10.backbone.full-length.fna.qza && \ wget http://ftp.microbio.me/greengenes_release/2022.10/2022.10.backbone.tax.qza
Decompress qza with qiime2 export:
docker run -t -i -v $(pwd):/data quay.io/qiime2/tiny:2024.5 \ qiime tools export \ --input-path 2022.10.backbone.full-length.fna.qza \ --output-path greengenes2 && \ docker run -t -i -v $(pwd):/data quay.io/qiime2/tiny:2024.5 \ qiime tools export \ --input-path 2022.10.backbone.tax.qza --output-path greengenes2
Prepare database:
zamp db --fasta greengenes2/dna-sequences.fasta \ --taxonomy greengenes2/taxonomy.tsv --name greengenes2 \ --fw-primer CCTACGGGNGGCWGCAG --rv-primer GACTACHVGGGTATCTAATCC \ -o greengenes2
Fungi¶
Unite ITS1
Fungal ITS databases Unite v10 and Eukaryome v1.8 do not contain the adjacent SSU/LSU sequences (they contain 5.8S), where some of the commonly used PCR primers lie on. It is important to adjust the cutadapt parameters so that only the absent primer is optional. In the following example, we prepare a database for fungal ITS1 from Unite Db. In this case, the forward primer (lying of the 18S) will not be present in most sequences of Unite/Eukaryome (but the reverse primer lying on the 5.8S is present); therefore we set the forward primer as optional; the extracted sequences will start at the available 5’ of the database and end at the reverse primer:
zamp db \
--fasta sh_refs_qiime_unite_ver10_dynamic_04.04.2024.fasta \
--taxonomy sh_taxonomy_qiime_unite_ver10_dynamic_04.04.2024.txt \
--name unite \
--fw-primer CYHRGYYATTTAGAGGWMSTAA --rv-primer RCKDYSTTCWTCRWYGHTGB \
--minlen 50 --maxlen 900 \
--cutadapt_args_fw "optional" \
-o unite_ITS1
Eukaryome ITS2
Similarly, to extract ITS2 from fungal databases such as Eukaryome, the reverse primer needs to be set as optional, because it is located on the LSU, which is absent in the database sequences:
zamp db \
--fasta QIIME2_EUK_ITS_v1.8.fasta \
--taxonomy QIIME2_EUK_ITS_v1.8.txt \
--name eukaryome \
--fw-primer GCATCGATGAAGAACGCAGC --rv-primer TCCTCCGCTTATTGATATGC \
--minlen 50 --maxlen 900 \
--cutadapt_args_rv "optional" \
-o eukaryome_ITS2
Output¶
Please, see <tax_DB_path>/<tax_DB_name>/QIIME/problematic_taxa.txt file for identical sequences that had taxonomic disagreeing identifiers above the genus rank.