The JSON configuration file
The JSON configuration file plays the key role in the workflow.
By editing it, you will :
While starting from the configuration file from the test dataset could be a good way to proceed, this tutorial will focus on creating it from scratch
This workflow relies on Qiime2, a lot of tools used in it are part of the Qiime2 toolbox. Default configuration is provided for most tools. But it means that you should refer to the Qiime2 documentation when editing a section of the config file that is tied to a Qiime2 tool.
{
"global" :{
}
}
Use your favorite editor to create the minimal following JSON structure
The config file is divided in sections, each section corresponding to a workflow tool
The global section lists all the parameters that are common to the whole workflow (i.e. input data, output directory, library type, ...)
{
"global" :{
"context" : "test/context.tsv"
}
}
The first mandatory file is the one describing the experiment, aka the context file or sample metadata as stated in Qiime2
It's a TSV file that must respect the format imposed by Qiime2
#SampleID BarcodeSequence BodySite Year
#q2:types categorical categorical numeric
L1S8 AGCTGACTAGTC gut 2008
L2S155 ACGATGCGACCA left palm 2009
It contains informations about the samples (body site, sex, ...)
allows to start the analysis from three types of data :
The first two options start with sequence files, the last one suppose you already performed the first part of the analysis on your own
microSysMics
NB : Sequence files must already be demultiplexed !
To use this option, your fastq files must be gathered in a directory, in the CASAVA 1.8 format and compressed as fastq.gz
{
"global" : {
"input" : "test/fastq",
"context" : "test/context.tsv"
}
"import" : {
"type": "SampleData[PairedEndSequencesWithQuality]",
"source-format" : "CasavaOneEightSingleLanePerSampleDirFmt"
}
}
The input keyword in the global section indicates the input data path, here the test/fastq directory
The import section tells Qiime2 about the input data type and source format
NB : If you have single-end data, use SampleData[SequencesWithQuality] as type
The MANIFEST is a CSV file mapping samples IDs to samples absolute filepath and direction. It shoud respect some rules.
{
"global" : {
"input" : "test/manifest.csv",
"context" : "test/context.tsv"
}
"import" : {
"source-format" : "PairedEndFastqManifestPhred33"
}
}
You can check for your format on the Qiime2 website.
sample-id,absolute-filepath,direction
TARA-085-DCM-0.22-3,/Users/delage-e/workspace/microSysMics/tara/TARA_085_DCM_0.22-3/BTR_ACQIOSW_2_1_HMLJJBCXX.12BA026_clean-min.fastq,forward
TARA-085-DCM-0.22-3,/Users/delage-e/workspace/microSysMics/tara/TARA_085_DCM_0.22-3/BTR_ACQIOSW_2_2_HMLJJBCXX.12BA026_clean-min.fastq,reverse
TARA-056-SRF-0.22-3,/Users/delage-e/workspace/microSysMics/tara/TARA_056_SRF_0.22-3/BTR_AERIOSW_2_2_HMLJJBCXX.12BA101_clean-min.fastq,reverse
TARA-056-SRF-0.22-3,/Users/delage-e/workspace/microSysMics/tara/TARA_056_SRF_0.22-3/BTR_AERIOSW_2_1_HMLJJBCXX.12BA101_clean-min.fastq,forward
Example of MANIFEST file
Extension must be csv !
The abundance matrix is a tab-separated file with samples as columns and OTUs as rows
{
"global" : {
"context":"test/context.tsv",
"input-type" : "abundance_matrix",
"input" : "test/abundance_matrix_test.tsv"
}
}
You need to add the "input-type" : "abundance_matrix" to tell the pipeline to skip the first part of the analysis and start from the abundance matrix
Extension must be tsv !
{
"deblur": {
"p-trim-length" : 200
}
}
Each tool comes with its own set of options. The name of this options can be found in its respective documentation
For example, if we want to change deblur trim length
NB : You should not add input and output options to the tools as snakemake is handling it.
Option name | Option value | Default value |
---|---|---|
input | - fastq directory - MANIFEST file (CSV) - abundance matrix (TSV) |
MANDATORY |
context | - sample metadata (TSV) | MANDATORY |
outdir | - analysis output directory | out |
library-type | - single-end - paired-end |
paired-end |
denoiser | - deblur - dada2 |
deblur |
graph_inference | - FlashWeave - SpiecEasi |
SpiecEasi |
Tool name | Goal | Mandatory parameters and default values |
---|---|---|
cutadapt | Trim adapters sequences from reads | |
quality-filter | Filter reads according to some criterions | |
deblur | Denoise amplicon reads | p-trim-length : 200 |
dada2 | Denoise amplicon reads | p-trunc-len-f : 0 p-trunc-len-r : 0 |
diversity | Compute diversity metrics | p-sampling-depth : 1000 |
rarefaction | Produce rarefaction curves | p-max-depth : 1000 |
phylogenetic_tree | Builds a phylogenetic tree |
All possible options could be found on the Qiime2 website
The abundance matrix can be filtered according several criterions, which can be combined together.
"pre-filter" : {
"samples":[{
"filterOn" : "BodySite",
"regexp":"gut|palm"}
]
}
Pre-filtering on samples
Keep only samples that comes from gut or palm
There must be a BodySite column in the context file
"pre-filter" : {
"otus":[{
"filterOn" : "lineage",
"regexp":"Prokaryota"}
]
}
Pre-filtering on otus
Keep only prokaryotes
There must be a lineage column in the abundance file
We can also filter the matrix according to some statistical measures
"filter" : {
"otus":[{
"name" : "prevalence",
"cutoff-percent":"0.33"}
]
}
Filtering by prevalence
Keep otus that are present in at least 1 sample out of 3
Filtering on number of reads
Keep otus with at least 100 reads across all samples
"filter" : {
"otus":[{
"name" : "read_number",
"cutoff":"100"}
]
}
"filter" : {
"otus":[{
"name" : "standard_deviation",
"cutoff" : 100}
]
}
Filtering by standard deviation
Keep the first 100 OTUs with highest standard deviation
Filtering on abundance
Keep the 60% most abundant OTUs
"filter" : {
"otus":[{
"name" : "abundance",
"cutoff-percent":"0.6"}
]
}
For those two filters, a TSS normalization is performed first
Example
"pre-filter" : {
"otus" : {
"filterOn": "index",
"regexp": "Bacteria|Archaea",
"keepNa" : true
},
"samples":{
"filterOn" : "Depth",
"regexp":"SRF|DCM"
}
},
"groupBy":{
"samples":{
"Depth":["SRF|DCM"],
"Marine.biome":["Westerlies Biome","Polar Biome",
"Coastal Biome", "Trades Biome"]
}
},
"filter":{
"otus":[{"name":"prevalence",
"cutoff-percent":0.33
}, {"name":"standard_deviation","cutoff-percent":0.5}]
}
Keep otus that have Bacteria or Archaea in their index name
Keep samples with SRF or DCM in the Depth column of the context file
Group together samples from SRF or DCM
Split samples according to the values in the Marine.biome column
Filter OTUs first according to prevalence (remove OTUs that appear in less than 1/3 samples
Then keep only the most 50% varying OTUs
Option name | Option value | Default value |
---|---|---|
nc | Number of cores to use | 4 |
lambda.min.ratio | Determines lambda minimum value in the LASSO regularization | 0.01 |
nlambda | Number of lambdas to test | 20 |
rep.num | Number of subsampling step in STARS procedure. | 20 |
stars.thresh | STARS variability threshold | 0.05 |
To get a better understanding of these parameters, please refer to SpiecEasi.