microSysMics
The JSON configuration file
The JSON configuration file plays the key role in the workflow.
- Indicate your input files and output directory
- Select the tools you want to use
- Add specific options to each tool
Why ?
By editing it, you will :
Remarks
While starting from the configuration file from the test dataset could be a good way to proceed, this tutorial will focus on creating it from scratch
This workflow relies on Qiime2, a lot of tools used in it are part of the Qiime2 toolbox. Default configuration is provided for most tools. But it means that you should refer to the Qiime2 documentation when editing a section of the config file that is tied to a Qiime2 tool.
Creating config file from scratch
{
"global" :{
}
}
Use your favorite editor to create the minimal following JSON structure
The config file is divided in sections, each section corresponding to a workflow tool
The global section lists all the parameters that are common to the whole workflow (i.e. input data, output directory, library type, ...)
Context file
{
"global" :{
"context" : "test/context.tsv"
}
}
The first mandatory file is the one describing the experiment, aka the context file or sample metadata as stated in Qiime2
It's a TSV file that must respect the format imposed by Qiime2
#SampleID BarcodeSequence BodySite Year
#q2:types categorical categorical numeric
L1S8 AGCTGACTAGTC gut 2008
L2S155 ACGATGCGACCA left palm 2009
It contains informations about the samples (body site, sex, ...)
Input data
allows to start the analysis from three types of data :
- Input directory containing compressed fastq files in CASAVA 1.8 format
- MANIFEST file listing fastq files
- OTU abundance matrix
The first two options start with sequence files, the last one suppose you already performed the first part of the analysis on your own
microSysMics
NB : Sequence files must already be demultiplexed !
Input data : input directory
To use this option, your fastq files must be gathered in a directory, in the CASAVA 1.8 format and compressed as fastq.gz
{
"global" : {
"input" : "test/fastq",
"context" : "test/context.tsv"
}
"import" : {
"type": "SampleData[PairedEndSequencesWithQuality]",
"source-format" : "CasavaOneEightSingleLanePerSampleDirFmt"
}
}
The input keyword in the global section indicates the input data path, here the test/fastq directory
The import section tells Qiime2 about the input data type and source format
NB : If you have single-end data, use SampleData[SequencesWithQuality] as type
Input data : MANIFEST file
The MANIFEST is a CSV file mapping samples IDs to samples absolute filepath and direction. It shoud respect some rules.
{
"global" : {
"input" : "test/manifest.csv",
"context" : "test/context.tsv"
}
"import" : {
"source-format" : "PairedEndFastqManifestPhred33"
}
}
You can check for your format on the Qiime2 website.
sample-id,absolute-filepath,direction
TARA-085-DCM-0.22-3,/Users/delage-e/workspace/microSysMics/tara/TARA_085_DCM_0.22-3/BTR_ACQIOSW_2_1_HMLJJBCXX.12BA026_clean-min.fastq,forward
TARA-085-DCM-0.22-3,/Users/delage-e/workspace/microSysMics/tara/TARA_085_DCM_0.22-3/BTR_ACQIOSW_2_2_HMLJJBCXX.12BA026_clean-min.fastq,reverse
TARA-056-SRF-0.22-3,/Users/delage-e/workspace/microSysMics/tara/TARA_056_SRF_0.22-3/BTR_AERIOSW_2_2_HMLJJBCXX.12BA101_clean-min.fastq,reverse
TARA-056-SRF-0.22-3,/Users/delage-e/workspace/microSysMics/tara/TARA_056_SRF_0.22-3/BTR_AERIOSW_2_1_HMLJJBCXX.12BA101_clean-min.fastq,forward
Example of MANIFEST file
Extension must be csv !
Input data : Abundance matrix
The abundance matrix is a tab-separated file with samples as columns and OTUs as rows
{
"global" : {
"context":"test/context.tsv",
"input-type" : "abundance_matrix",
"input" : "test/abundance_matrix_test.tsv"
}
}
You need to add the "input-type" : "abundance_matrix" to tell the pipeline to skip the first part of the analysis and start from the abundance matrix
Extension must be tsv !
Options
{
"deblur": {
"p-trim-length" : 200
}
}
Each tool comes with its own set of options. The name of this options can be found in its respective documentation
For example, if we want to change deblur trim length
NB : You should not add input and output options to the tools as snakemake is handling it.
global section
Option name | Option value | Default value |
---|---|---|
input | - fastq directory - MANIFEST file (CSV) - abundance matrix (TSV) |
MANDATORY |
context | - sample metadata (TSV) | MANDATORY |
outdir | - analysis output directory | out |
library-type | - single-end - paired-end |
paired-end |
denoiser | - deblur - dada2 |
deblur |
graph_inference | - FlashWeave - SpiecEasi |
SpiecEasi |
Qiime2 tools
Tool name | Goal | Mandatory parameters and default values |
---|---|---|
cutadapt | Trim adapters sequences from reads | |
quality-filter | Filter reads according to some criterions | |
deblur | Denoise amplicon reads | p-trim-length : 200 |
dada2 | Denoise amplicon reads | p-trunc-len-f : 0 p-trunc-len-r : 0 |
diversity | Compute diversity metrics | p-sampling-depth : 1000 |
rarefaction | Produce rarefaction curves | p-max-depth : 1000 |
phylogenetic_tree | Builds a phylogenetic tree |
All possible options could be found on the Qiime2 website
Filtering abundance matrix
The abundance matrix can be filtered according several criterions, which can be combined together.
"pre-filter" : {
"samples":[{
"filterOn" : "BodySite",
"regexp":"gut|palm"}
]
}
Pre-filtering on samples
Keep only samples that comes from gut or palm
There must be a BodySite column in the context file
"pre-filter" : {
"otus":[{
"filterOn" : "lineage",
"regexp":"Prokaryota"}
]
}
Pre-filtering on otus
Keep only prokaryotes
There must be a lineage column in the abundance file
Filtering abundance matrix
We can also filter the matrix according to some statistical measures
"filter" : {
"otus":[{
"name" : "prevalence",
"cutoff-percent":"0.33"}
]
}
Filtering by prevalence
Keep otus that are present in at least 1 sample out of 3
Filtering on number of reads
Keep otus with at least 100 reads across all samples
"filter" : {
"otus":[{
"name" : "read_number",
"cutoff":"100"}
]
}
Filtering abundance matrix
"filter" : {
"otus":[{
"name" : "standard_deviation",
"cutoff" : 100}
]
}
Filtering by standard deviation
Keep the first 100 OTUs with highest standard deviation
Filtering on abundance
Keep the 60% most abundant OTUs
"filter" : {
"otus":[{
"name" : "abundance",
"cutoff-percent":"0.6"}
]
}
For those two filters, a TSS normalization is performed first
Filtering abundance matrix
Example
"pre-filter" : {
"otus" : {
"filterOn": "index",
"regexp": "Bacteria|Archaea",
"keepNa" : true
},
"samples":{
"filterOn" : "Depth",
"regexp":"SRF|DCM"
}
},
"groupBy":{
"samples":{
"Depth":["SRF|DCM"],
"Marine.biome":["Westerlies Biome","Polar Biome",
"Coastal Biome", "Trades Biome"]
}
},
"filter":{
"otus":[{"name":"prevalence",
"cutoff-percent":0.33
}, {"name":"standard_deviation","cutoff-percent":0.5}]
}
Keep otus that have Bacteria or Archaea in their index name
Keep samples with SRF or DCM in the Depth column of the context file
Group together samples from SRF or DCM
Split samples according to the values in the Marine.biome column
Filter OTUs first according to prevalence (remove OTUs that appear in less than 1/3 samples
Then keep only the most 50% varying OTUs
Graph inference : SpiecEasi
Option name | Option value | Default value |
---|---|---|
nc | Number of cores to use | 4 |
lambda.min.ratio | Determines lambda minimum value in the LASSO regularization | 0.01 |
nlambda | Number of lambdas to test | 20 |
rep.num | Number of subsampling step in STARS procedure. | 20 |
stars.thresh | STARS variability threshold | 0.05 |
To get a better understanding of these parameters, please refer to SpiecEasi.
deck
By edelage
deck
- 330