microSysMics

The JSON configuration file

The JSON configuration file plays the key role in the workflow.

  • Indicate your input files and output directory
  • Select the tools you want to use
  • Add specific options to each tool

Why ?

By editing it, you will :

Remarks

While starting from the configuration file from the test dataset could be a good way to proceed, this tutorial will focus on creating it from scratch

This workflow relies on Qiime2, a lot of tools used in it are part of the Qiime2 toolbox. Default configuration is provided for most tools. But it means that you should refer to the Qiime2 documentation when editing a section of the config file that is tied to a Qiime2 tool.

Creating config file from scratch

{
    "global" :{
    }
}

Use your favorite editor to create  the minimal following JSON structure

The config file is divided in sections, each section corresponding to a workflow tool 

The global section lists all the parameters that are common to the whole workflow (i.e. input data, output directory, library type, ...)

Context file

{
    "global" :{
        "context" : "test/context.tsv"
    }
}

The first mandatory file is the one describing the experiment, aka the context file or sample metadata as stated in Qiime2

It's a TSV file that  must respect the format imposed by Qiime2

#SampleID	BarcodeSequence	BodySite	Year
#q2:types	categorical	categorical	numeric
L1S8	AGCTGACTAGTC	gut	2008
L2S155	ACGATGCGACCA	left palm	2009

It contains informations about the samples (body site, sex, ...)

Input data

allows to start the analysis from three types of data :

  1. Input directory containing compressed fastq files in CASAVA 1.8 format
  2. MANIFEST file listing fastq files 
  3. OTU abundance matrix

The first two options start with sequence files, the last one suppose you already performed the first part of the analysis on your own

microSysMics

NB : Sequence files must already be demultiplexed !

Input data : input directory

To use this option, your fastq files must be  gathered in a directory, in the CASAVA 1.8 format and compressed as fastq.gz

{
	"global" : {
		"input" : "test/fastq",
		"context" : "test/context.tsv"
        }
	"import" : {
		"type": "SampleData[PairedEndSequencesWithQuality]",
		"source-format" : "CasavaOneEightSingleLanePerSampleDirFmt"
	}
}

The input keyword in the global section indicates the input data path, here the test/fastq directory

The import section tells Qiime2 about the input data type and source format

NB : If you have single-end data, use SampleData[SequencesWithQuality] as type

Input data : MANIFEST file

The MANIFEST is a CSV file mapping samples IDs to samples absolute  filepath and direction. It shoud respect some rules.

{
	"global" : {
		"input" : "test/manifest.csv",
		"context" : "test/context.tsv"
        }
	"import" : {
		"source-format" : "PairedEndFastqManifestPhred33"
	}
}

You can check for your format on the Qiime2 website.

sample-id,absolute-filepath,direction
TARA-085-DCM-0.22-3,/Users/delage-e/workspace/microSysMics/tara/TARA_085_DCM_0.22-3/BTR_ACQIOSW_2_1_HMLJJBCXX.12BA026_clean-min.fastq,forward
TARA-085-DCM-0.22-3,/Users/delage-e/workspace/microSysMics/tara/TARA_085_DCM_0.22-3/BTR_ACQIOSW_2_2_HMLJJBCXX.12BA026_clean-min.fastq,reverse
TARA-056-SRF-0.22-3,/Users/delage-e/workspace/microSysMics/tara/TARA_056_SRF_0.22-3/BTR_AERIOSW_2_2_HMLJJBCXX.12BA101_clean-min.fastq,reverse
TARA-056-SRF-0.22-3,/Users/delage-e/workspace/microSysMics/tara/TARA_056_SRF_0.22-3/BTR_AERIOSW_2_1_HMLJJBCXX.12BA101_clean-min.fastq,forward

Example of MANIFEST file

Extension must be csv !

Input data : Abundance matrix

The abundance matrix is a tab-separated file with samples as columns and OTUs as rows

{
	"global" : {
            "context":"test/context.tsv",
	    "input-type" : "abundance_matrix",
            "input" : "test/abundance_matrix_test.tsv"
	}
}

You need to add the "input-type" : "abundance_matrix" to tell the pipeline to skip the first part of the analysis and start from the abundance matrix

Extension must be tsv !

Options

{
	"deblur": {
		"p-trim-length" : 200
	}
}

Each tool comes with its own set of options. The name of this options can be found in its respective documentation

For example, if we want to change deblur trim length

NB : You should not add input and output options to the tools as snakemake is handling it.

global section

Option name Option value Default value
input - fastq directory
- MANIFEST file (CSV)
- abundance matrix (TSV)
MANDATORY
context - sample metadata (TSV) MANDATORY
outdir - analysis output directory out
library-type - single-end
- paired-end
paired-end
denoiser - deblur
- dada2
deblur
graph_inference - FlashWeave
- SpiecEasi
SpiecEasi

Qiime2 tools

Tool name Goal Mandatory parameters and default values
cutadapt Trim adapters sequences from reads
quality-filter Filter reads according to some criterions
deblur Denoise amplicon reads p-trim-length : 200
dada2 Denoise amplicon reads p-trunc-len-f : 0
p-trunc-len-r : 0
diversity Compute diversity metrics p-sampling-depth : 1000
rarefaction Produce rarefaction curves p-max-depth : 1000
phylogenetic_tree Builds a phylogenetic tree

All possible options could be found on the Qiime2 website

Filtering abundance matrix

The abundance matrix can be filtered according several criterions, which can be combined together.

"pre-filter" : {
    "samples":[{
        "filterOn" : "BodySite",
        "regexp":"gut|palm"}
    ]
 }

Pre-filtering on samples

Keep only samples that comes from gut or palm

There must be a BodySite column in the context file

"pre-filter" : {
    "otus":[{
        "filterOn" : "lineage",
        "regexp":"Prokaryota"}
    ]
 }

Pre-filtering on otus

Keep only prokaryotes

There must be a lineage column in the abundance file

Filtering abundance matrix

We can also filter the matrix according to some statistical measures

"filter" : {
    "otus":[{
        "name" : "prevalence",
        "cutoff-percent":"0.33"}
    ]
 }

Filtering by prevalence

Keep otus that are present in at least 1 sample out of 3

Filtering on number of reads

Keep otus with at least 100 reads across all samples

"filter" : {
    "otus":[{
        "name" : "read_number",
        "cutoff":"100"}
    ]
 }

Filtering abundance matrix

"filter" : {
    "otus":[{
        "name" : "standard_deviation",
        "cutoff" : 100}
    ]
 }

Filtering by standard deviation

Keep the first 100 OTUs with highest standard deviation

Filtering on abundance

Keep the 60% most abundant OTUs

"filter" : {
    "otus":[{
        "name" : "abundance",
        "cutoff-percent":"0.6"}
    ]
 }

For those two filters, a TSS normalization is performed first

Filtering abundance matrix

Example

"pre-filter" : {
    "otus" : {
      "filterOn": "index",
      "regexp": "Bacteria|Archaea",
      "keepNa" : true
    },
    "samples":{
      "filterOn" : "Depth",
      "regexp":"SRF|DCM"
    }
  },  
  "groupBy":{
    "samples":{
      "Depth":["SRF|DCM"],
      "Marine.biome":["Westerlies Biome","Polar Biome",
                      "Coastal Biome", "Trades Biome"]
    }
  },
  "filter":{
    "otus":[{"name":"prevalence",
      "cutoff-percent":0.33
    }, {"name":"standard_deviation","cutoff-percent":0.5}]
  }

Keep otus that have Bacteria or Archaea in their index name

Keep samples with SRF or DCM in the Depth column of the context file

Group together samples from SRF or DCM

Split samples according to the values in the Marine.biome column 

Filter OTUs first according to prevalence (remove OTUs that appear in less than 1/3 samples

Then keep only the most 50% varying OTUs

Graph inference : SpiecEasi

Option name Option value Default value
nc Number of cores to use 4
lambda.min.ratio Determines lambda minimum value in the LASSO regularization 0.01
nlambda Number of lambdas to test 20
rep.num Number of subsampling step in STARS procedure. 20
stars.thresh STARS variability threshold 0.05

To get a better understanding of these parameters, please refer to SpiecEasi.

deck

By edelage

deck

  • 347