“Analysis of omics data” course
Skoltech Term 4
27 April 2020
This presentation can be found at https://slides.com/agalicina/epigenetics-practice4-2020
For EpiPract 1, we obtained a bigWig file with the coverage in your experiments:
We may need these files for the last practice on data association (EpiPract5). You can exchange these files with three of your colleagues in any preferable way (e.g. e-mail or copying from the folder on cluster).
For your convenience, I've created the folder for exchange:
Feel free to copy your file to this directory and use other files from there.
$ mv <file.genomecov.bw> <YOUR_CELL_LINE-CHROMOSOME.bw>
$ cp ${your_file} /home/shared/EpiPract1/
Adopted from Imakaev et al. Nature Methods 2012
Bonev et al. Nature Reviews 2016
Falk et al. Nature 2019
b. Average number of contacts between pairs of chromosomes. Average cis contacts are much higher than trans contacts.
a. Hi-C maps for chr 1, 2, 3 demonstrating territoriality:
Lieberman-Aiden, 2009
Bonev et al. Nature Reviews 2016
Lieberman-Aiden et al. Nature 2009
Lieberman-Aiden et al. Nature 2009
Lieberman-Aiden et al. Nature 2009
Falk et al. Nature 2019
Usual scenario: euchromatin resides in the interior, heretochromatin is outside. On comparison with Hi-C, you see the compartments have different strength:
Falk et al. Nature 2019
Non-typical scenario, inverted nuclei when the euchromatin is peripheral:
inverted
inverted
usual
Note that compartments on Hi-C are the same for two fundamentally different structures of thymocytes nuclei.
Procedure:
1. Call compartments.
2. Calculate observed over expected for Hi-C,
3. Reorder Hi-C rows and columns by the 1st component of PCA,
4. Average neighbouring pixels.
Bonev et al. Nature Reviews 2016
Filippova et al. Algorithms for Molecular Biology 2014
TADs are hierarchical, there is no single solution for the TAD calling problem:
Forcato et al. Nature Methods 2017
based on Crane, 2015
Bonev et al. Nature Reviews 2016
Bonev et al. Nature Reviews 2016
Different names for the same feature:
Rao et al. Cell 2014
In mammals, loops are associated with CTCF binding motif with a particular type of orientation:
Li et al. Nature 2020
Mechanism of loop extrusion as an explanation of formation of bright dots:
Rao et al. Cell 2014
Due to this mechanism, dots are also hierarchical:
Forcato et al. Nature Methods 2017
There is a variety of tools for calculation of enriched contacts in Hi-C, but not all of the results can be considered loops:
Rao et al. Cell 2014
Hi-C Computational Unbiased Peak Search:
Flyamer Bioinformatics 2019
We can average Hi-C plots around all the pixels of the found loops:
https://github.com/Phlya/coolpuppy
We'll need conda environment "hic" (see instructions on activation can be found in EpiPract3). We will:
For two organisms:
For Drosophila Hug 2017: you may use your cool files obtained for EpiPract3 (*), but it's not required.
.mcool is a multi-resolution cool file for Hi-C data storage.
Copy these files to your current folder:
You can check what resolutions are available for your mcool-file:
Then you can get detailed information about your single-resolution cooler (let's stick to 10 Kb resolution throughout this practice):
Note that you access different cooler resolutions by this query: ${cool}::/resolutions/10000
$ cooler ls GM12878_Rao2014.hg19.1000.mcool
$ cooler info GM12878_Rao2014.hg19.1000.mcool::/resolutions/10000
Let's plot the dependence of contact probability from the genomic distance with HiCExplorer for Drosophila set.
The basic command for 10 Kb-resolution looks as follows:
However, we need also to:
$ hicPlotDistVsCounts --matrices ${file.mcool}::/resolutions/10000 --plotFile ${output.png}
Task 1 (1 point): Write down the final command that you've got. Explain briefly what it does and why you need queries/parameters.
Task 2 (1 point): Add the resulting plot to your report. What is the difference between two datasets? What are possibly due to technical and what are due to biological reasons?
Non-optimized example, similar to what you should get:
Text
Labels need to be corrected
Note the intercept in the beginning. It corresponds to contact probability at the smallest distance. Does it depend on coverage?
Do we get a strict line? If not, it might indicate more frequent/rare short- or long-distance contacts.
We will use HiCExplorer's hiCTransform and hiCPlotMatrix for plotting the observed over expected:
We will need to modify in these commands:
$ hicTransform --matrix ${file.mcool}::/resolutions/10000 --outFileName ${normalized.cool} --method obs_exp
$ hicPlotMatrix -m ${normalized.cool} -o ${output.png}
Task 3 (1 point): Add the resulting commands and plots with expected maps for 10 Mb-regions of S2 Drosophila and GM12878 human data to your report. What features can you observe on this plot? Are they present in both species? Do they have the same prominence and size in genomic units?
Task 4 (1 point*): Plot the same region for Drosophila embryogenesis dataset. What differences do you observe with S2 cell line? What might be explained by technical reasons and what by biological? Consult with the paper.
For this task, we will use cooltools Command Line Interface (CLI).
Compartments call can be divided into following steps:
! Note that we don't do phasing of compartments, so we don't know what is actually A and what is B.
Task 5 (2 point): Run for both Drosophila and human. Try -n 10 and 20. What are your conclusions?
Are both compartments strong for both species?
$ cooltools call-compartments ${file.mcool}::/resolutions/10000 -o ${compartments.tsv} --contact-type trans
$ cooltools compute-expected ${file.mcool}::/resolutions/10000 -o ${expected.tsv}
$ cooltools compute-saddle ${file.mcool}::/resolutions/10000 ${compartment.tsv}.trans.vecs.tsv ${expected.tsv} -o ${saddle} --fig png -n 10
We will use HiCExplorer HiCFindTADs functionality.
1. Call TADs:
2. Convert cool format to h5:
3. Create tracks.ini file (example on the right), with vim on nano.
4. Run plotting step:
Task 6 (2 points*): Read about TAD calling in HiCExplorer. What are other parameters that you can try to vary? Try 5-6 different sets, pick the best and describe your observations.
Task 7 (3 points*). Run the same for human and data on embryogenesis at 10 Kb resolution. Describe your observations.
hicFindTADs --matrix ${file.mcool}::/resolutions/10000 --outPrefix ${tmp.tads} --correctForMultipleTesting fdr --minDepth 30000 --maxDepth 10000000 --step 30000 --numberOfProcessors 2
[x-axis]
where = top
[hic matrix]
file = ${converted.file.h5}
title = Hi-C data
depth = 300000
transform = log1p
file_type = hic_matrix
[tads]
file = ${tads.file.domains.bed}
file_type = domains
border_color = black
color = none
overlay_previous = share-y
$ hicPlotTADs --tracks track.ini -o {tads.pnf} --region chr2L:0-1000000
$ hicConvertFormat --matrices ${file.mcool}::/resolutions/10000 -o ${converted.file.h5}.h5 --inputFormat cool --outputFormat h5
tracks.ini
Flyamer et al. Nature 2017
Stevens et al. Nature 2017
work in preparation
One Read-Based Interactions Annotation (ORBITA):
Number of contacts per restriction fragment for two snHi-C approaches:
hiclib as in Flyamer 2017
ORBITA - One Read-Based Interactions Annotation
We expect at most 4 contacts per restriction fragments (2 DNA copies per cell, 2 ends to be ligated)
work in preparation
Resulting Hi-C datasets for Drosophila melanogaster
work in preparation
Barcoding-based protocol:
Ramani et al. Nature methods 2017
Ramani et al. Nature methods 2017
Engreitz et al. Nature Reviews 2016
Variety of methods exist for RNA-DNA interactome assays:
Gavrilov, Zharikova, Mironov, Galitsyna, paper in review
Gavrilov, Zharikova, Mironov, Galitsyna, paper in review