In silico validation tool¶
Aim¶
This module aims at predicting in silico if specific taxa are:
amplified by a set of PCR primers used for amplicon-based metagenomics
accurately classified taxonomically based on the generated amplicon
Working principle¶
Based on a user-defined list of NCBI tax IDs, assemblies or taxon queries, genome assemblies are downloaded from the NCBI database with Assembly Finder. Then, PCR primer sequences provided by the user are used to run an in silico PCR with in_silico_pcr (or alternatively, with simulate_PCR). The generated in silico amplicons are treated by the main pipeline as they would if they were the results of sequencing reads (primer trimming, amplicon clustering into representative sequences, taxonomic classification).
Finally, for each of the downloaded assembly, this module provides a table with a description of the amplicons predicted to be amplified with the PCR primers (number of sequence variants, number of copies) as well as the expected and obtained taxonomic assignment.
Inputs¶
To execute the pipeline, one needs:
An input file containing the accession names or the Tax IDs of interest. This is a one-column text file without headers. The identifiers should match NCBI taxonomy. One can skip this text file and use a query term instead, see usage cases below.
A taxonomic database preprocessed with our dedicated pipeline
Input file example:
With accession names:
GCA_000008005.1
GCA_000010425.1
GCA_000016965.1
GCA_020546685.1
GCA_000172575.2
GCA_000005845.2
GCA_000014425.1
GCA_003324715.1
GCA_000007645.1
GCA_000007465.2
GCA_013372085.1
GCA_031191545.1
GCA_000012825.1
GCA_000307795.1
GCA_000008805.1
GCA_000010505.1
GCA_000231215.1
GCA_000017205.1
GCA_000013425.1
GCA_000007265.1
With NCBI tax IDs:
1069201
182096
41058
1220207
1287682
746128
5059
5062
Execution¶
The module is executed with zamp insilico. You can see all required and optional arguments with:
zamp insilico -h
Example usage cases:
Using bacteria assembly accession names (note the –accession argument when using accession names instead of tax IDs):
zamp insilico -i zamp/data/bacteria-accs.txt -db greengenes2 --accession --fw-primer CCTACGGGNGGCWGCAG --rv-primer GACTACHVGGGTATCTAATCC
Using fungi tax IDs (requires additional ITS amplicon-specific parameters to adjust the amplicon size)
zamp insilico -i zamp/data/fungi-taxa.txt \ -db unite_db_v10 \ --fw-primer CYHRGYYATTTAGAGGWMSTAA --rv-primer RCKDYSTTCWTCRWYGHTGB \ --minlen 50 --maxlen 900
Using a query term. In this example, 100 assemblies will be downloaded per taxon (
nb 100) including non-reference assemblies (not-only-ref):zamp insilico -i "lactobacillus" \ -db ezbiocloud \ --fw-primer CCTACGGGNGGCWGCAG --rv-primer GACTACHVGGGTATCTAATCC \ --replace-empty -nb 100 --not-only-ref
Output¶
The pipeline gathers information on available assemblies for the requested taxIDs in the assembly_finder folder.
The output of the in-silico amplification is in Insilico folder, and contains the following subfolders:
PCR: contains the output of in-silico PCR amplification
2_denoised: output of clustering and denoising into representative sequences, and count tables
3_classified: output of taxonomic classification and tables comparing expected and obtained taxonomic assignations (InSilico_compare_tax.tsv and InSilico_compare_tax_long.tsv.)