“Analysis of omics data” course
Skoltech Term 4
23 April 2020
This presentation can be found at https://slides.com/agalicina/epigenetics-practice2-2020
Proteins and nucleic acids are the most important components of the cell. Their interactions are necessary for functioning and regulation. Types of interactions:
Wasserman & Sandelin, 2004
Engreitz et al. 2016
Binding of protein is typically specific to DNA sequence:
Binding pattern
DNA-protein binding
List of DNAs
with binding events
courtesy of Ivan Kulakovsky, Pavel Mazin
Chromatin-immunoprecipitation followed by sequencing:
Unwanted factors contributing to sequencing & processing output:
"Input" experiment is a ChIP-Seq control without precipitation step. It is used for normalisation of ChIP-Seq.
Park, Nat Reviews Genetics, 2009
In order to predict binging events we need to call peaks in ChIP-Seq data:
Mahony & Pugh 2015
Example output of the procedure:
What is the next step?
courtesy of Ivan Kulakovsky, Pavel Mazin
Question: Can you propose a computational approach to find it?
courtesy of Ivan Kulakovsky, Pavel Mazin
courtesy of Ivan Kulakovsky, Pavel Mazin
We need some advanced algorithms to find binding patterns.
courtesy of Ivan Kulakovsky, Pavel Mazin
TATAAT TAAAAT TAATAT TGTAAT TATACT
Consensus is the sequence of the most frequent letters:
T[AG][AT][AT][AC]T
With the consensus sequence, we lose the information:
courtesy of Pavel Mazin
Other representations:
Frequency matrix
Probability matrix
Matrix normalized by background
Position Weight Matrix (PWM)
courtesy of Pavel Mazin
courtesy of Pavel Mazin
Approaches:
Approaches:
courtesy of Pavel Mazin
Specific and non-specific binding:
Slattery et al. 2014
Hsu et al. 2018, Slattery et al. 2014
We can also do motifs search in open chromatin regions
Pipeline is a set of processing steps, usually wrapped in a single script, container or tool.
Pipelines aim to be:
Some web-server based pipelines:
Some command-line-based pipelines:
Task 1 (no points, but if absent, the practice won't be counted as done):
Please, store all the obtained files and results in your home folder under EpiPract2 path. I will collect them in a week and check it.
$ ssh username@servername
# Create the working directory and enter it:
$ mkdir EpiPract2
$ cd EpiPract2
# Copy the public ATAC-Seq nextflow-core repository and enter the directory with it:
$ git clone https://github.com/nf-core/atacseq.git
$ cd atacseq
$ export PATH="/home/galitsyna/anaconda3/bin:$PATH"
$ printf "envs_dirs:\n - /home/galitsyna/anaconda3/envs/" > .condarc
$ conda activate atacseq-nf
The test dataset that authors provide is ATAC-Seq on Saccharomyces cerevisiae cells (yeast) from the Shep et al. paper.
You can find the details on it in GEO database: GSE66386
Open the link and read about it.
Name | File Prefix for EpiPract2 |
---|---|
Anastasia Pivnyuk | OSMOTIC_STRESS_T0_R1_T1_1 |
Nikita Sharaev | OSMOTIC_STRESS_T0_R1_T1_2 |
Artemy Shumskiy | OSMOTIC_STRESS_T0_R2_T1_1 |
Dmitrii Kriukov | OSMOTIC_STRESS_T0_R2_T1_2 |
Pletenev Ilya | OSMOTIC_STRESS_T15_R1_T1_1 |
Konstantin Chernyshov | OSMOTIC_STRESS_T15_R1_T1_2 |
Ivan Kuznetsov | OSMOTIC_STRESS_T15_R2_T1_1 |
Anna Kalinina | OSMOTIC_STRESS_T15_R2_T1_2 |
Sofya Kasatskaya | OSMOTIC_STRESS_T0_R1_T1_1 |
Julia Bocharkina | OSMOTIC_STRESS_T0_R1_T1_2 |
Vasily Borodin | OSMOTIC_STRESS_T0_R2_T1_1 |
Sofia Kamalyan | OSMOTIC_STRESS_T0_R2_T1_2 |
Anna Krasivskaya | OSMOTIC_STRESS_T15_R1_T1_1 |
Aleksandra Ozerova | OSMOTIC_STRESS_T15_R1_T1_2 |
Victoria Kobets | OSMOTIC_STRESS_T15_R2_T1_1 |
Slesareva Anastasiia | OSMOTIC_STRESS_T15_R2_T1_2 |
Mikhail Moldovan | OSMOTIC_STRESS_T0_R1_T1_1 |
Viktor Mamontov | OSMOTIC_STRESS_T0_R1_T1_2 |
Evgeniia Alekseeva | OSMOTIC_STRESS_T0_R2_T1_1 |
Trofimova Anna | OSMOTIC_STRESS_T0_R2_T1_2 |
Task 5 (0.5 points): Describe the details of your experiment and answer the questions above. Is it similar to others or is it outlier? By what criteria? (List at least one)
We aim to study the binding of transcription factors in them. Thus we should explicitly state that we need the narrow peaks originating from TF binding.
Task 6 (1.5 point): What is the Fraction of Reads in Peaks for your experiment with and without --narow_peak option?
Why?
Is it consistent with the number of peaks for your experiment?
Buenrostro et al., 2013
Hsu et al. 2018
paired-end sequencing and mapping
Buenrostro et al., 2013
Buenrostro et al., 2013
Buenrostro et al., 2013
Nucleosome-free fraction
Nucleosome-bound fraction
Buenrostro et al., 2013
Nucleosome-free fraction
Nucleosome-bound fraction
Albanus et al., 2019
We can plot V-plot around positions of factors binding sites:
Albanus et al., 2019
Schep et al., 2015
We can also plot V-plot around positions of nucleosomes:
Find the distribution of insert size in your report.
Highlight your ATAC-Seq experiment.
Task 7 (1 points): What is the approximate spacing between sequential nucleosomes in your ATAC-Seq? Highlight it on a plot and add to the report
Task 8 (1 points): Can you see dimers of nucleosomes? Trimers? Tetramers? Report and speculate, why.
Task 9 (2 points): What are the found motifs? Report the logo of the best hit. What is the significance of the hit? What might be the factor that binds this motif (given the experimental setup and analysis)?
$ genomepy install sacCer3 UCSC --annotation
$ conda activate gimme
$ gimme motifs <your-narrowPeakFile> gimme.denovo.output -g sacCer3 --denovo
Task 10 (2 points): What are the found motifs? Report the logo of the best hit. What is the significance of the hit? Is it similar to the one that was found de novo?
$ gimme motifs <yourfile> -p HOMER --known -g sacCer3