“Analysis of omics data” course
Skoltech Term 4
30 April 2020
This presentation can be found at https://slides.com/agalicina/epigenetics-practice5-2020
Problem: I've collected multiple epigenomics datasets, what should I do with that?
Filion et al. Cell 2010
Let's say we have 53 ChIP-Seq datasets for the same cell line of Drosophila:
Filion et al. Cell 2010
Perform dimensionality reduction with PCA. Each dot is one genomic region:
Filion et al. Cell 2010
And then we can train some machine learning algorithm to obtain the groups by similarity:
Filion et al. Cell 2010
Then we can do annotation of these groups and derive the functions of these genomic regions (chromatin states, or colors):
Cusanovich et al. Cell 2018
Let's say we have chromatin openness for datasets for multiple cell types (different tissues):
DNase-Seq coverage for ∼350-kb region:
Thurman et al. Nature 2012
Different cell types
Cusanovich et al. Cell 2018
Let's check the similarity of pooled tissue samples with DNase-seq from known tissues. For that, let's plot bi-clustered heatmap of Spearman correlation coefficients:
Result of clustering,
the tree reflecting the similarity between samples
Cusanovich et al. Cell 2018
Finally, let's do dimensionality reduction, where each dot is a single sample.
Here is an example with t-SNE, although PCA (Principal Component Analysis) and MDS (Multidimensional scaling) are other popular techniques:
We will use deeptools commands multiBigwigSummary and plotCorrelation to compare the DNase-Seq experiments that we've obtained during EpiPact1. We will construct bi-clustered heatmaps of correlations:
Text
Task 1 (1 point): Provide the final set of commands that you've used and briefly justify the parameters.
Task 2 (1 point): Report the final scatteplot for 100 Kb. Describe your observations.
Task 3 (*1 point): Repeat the procedure for 10 Kb and 1 Kb and report the changes in observations. At what resolution you might observe the differential DNase I signal?
multiBigwigSummary bins --binSize 100000 -b ${file1.bw} ${file2.bw} ${file3.bw} ${file4.bw} -o ${summary.npz}
plotCorrelation -in ${summary.npz} --corMethod pearson --skipZeros --whatToPlot scatterplot -o ${scatterplot.png}
6. Plot correlations for 100 Kb as a heatmap. Use the parameters that you found to be representative for the scatterplots. For example:
7. Inspect and describe your observations. Adjust the visualization, if needed.
plotCorrelation -in ${summary.npz} --corMethod pearson --skipZeros --whatToPlot heatmap --colorMap RdYlBu --plotNumbers -o ${heatmap.png}
Result of clustering
Correlation coefficient
Labels are not representative
(correct if needed)
Task 4 (2 points): Provide the resulting heatmap and PCA plot with your report. What are the similar cell lines by DNase signal in your experiment? Can you explain it? Look up the literature on these cell lines and check the scatterplots in T.2, if needed.
Task 5 (*1 point): Repeat these steps at 10 Kb and 1 Kb. Do you observe the same clustering? Why?
6. Plot correlations for 100 Kb as a heatmap. Use the parameters that you found to be representative for the scatterplots. For example:
7. Inspect and describe your observations. Adjust the visualization, if needed.
8. Plot the PCA plot of your experiments:
plotCorrelation -in ${summary.npz} --corMethod pearson --skipZeros --whatToPlot heatmap --colorMap RdYlBu --plotNumbers -o ${heatmap.png}
plotPCA --corData ${summary.npz} --plotFile ${pca.png}
Ulyanov, Khrameeva et al. Genome Research 2016
Example of association of more complex datasets:
Ulyanov, Khrameeva et al. Genome Research 2016
Plotting the profiles of enrichment around TAD boundaries (ChIP-Seq):
Ulyanov, Khrameeva et al. Genome Research 2016
Plotting the profiles of enrichment around TAD boundaries (chromatin colors):
work in preparation
One Read-Based Interactions Annotation (ORBITA):
Resulting Hi-C datasets for Drosophila melanogaster
work in preparation
Resulting maps and TADs for Drosophila:
work in preparation
work in preparation
Groups of boundaries based on the presence in single cells:
work in preparation
Groups of boundaries based on the presence in single cells:
We will:
We'll need conda environment "hic" (see instructions on activation can be found in EpiPract3).
We will work with 20 Kb resolution and use HiCExplorer HiCFindTADs functionality.
1. Call TADs with parameters that you can find in the table:
Task 6 (1 points): What were the generated files? How many TADs and boundaries you were able to find in your final version of code?
hicFindTADs --matrix ${file.mcool}::/resolutions/20000 --outPrefix ${tmp.tads} \
--minDepth ${minDepth} --maxDepth ${maxDepth} --step 20000 \
--correctForMultipleTesting fdr --thresholdComparisons ${Threshold} --delta ${Delta} \
--numberOfProcessors 2
Name | minDepth | maxDepth | Threshold | Delta | ChIP-Seq (from /home/galitsyna/EpiPract5/Drosophila_chipseq/ ) |
---|---|---|---|---|---|
Anastasia Pivnyuk | 60000 | 1000000 | 0.05 | 0.01 | S2-Chriz.bw |
Nikita Sharaev | 60000 | 1000000 | 0.05 | 0.01 | S2-CTCF.bw |
Artemy Shumskiy | 60000 | 1000000 | 0.05 | 0.01 | S2-H3K4me3.bw |
Dmitrii Kriukov | 60000 | 1000000 | 0.05 | 0.01 | S2-JIL1.bw |
Pletenev Ilya | 60000 | 1000000 | 0.05 | 0.01 | S2-RNAPolII.bw |
Konstantin Chernyshov | 60000 | 1000000 | 0.05 | 0.01 | S2-SuHw.bw |
Ivan Kuznetsov | 60000 | 1000000 | 0.05 | 0.01 | S2-WDS.bw |
Anna Kalinina | 60000 | 500000 | 0.01 | 0.01 | S2-Chriz.bw |
Sofya Kasatskaya | 60000 | 500000 | 0.01 | 0.01 | S2-CTCF.bw |
Julia Bocharkina | 60000 | 500000 | 0.01 | 0.01 | S2-H3K4me3.bw |
Vasily Borodin | 60000 | 500000 | 0.01 | 0.01 | S2-JIL1.bw |
Sofia Kamalyan | 60000 | 500000 | 0.01 | 0.01 | S2-RNAPolII.bw |
Anna Krasivskaya | 60000 | 500000 | 0.01 | 0.01 | S2-SuHw.bw |
Aleksandra Ozerova | 60000 | 500000 | 0.01 | 0.01 | S2-WDS.bw |
Victoria Kobets | 60000 | 200000 | 0.05 | 0.01 | S2-Chriz.bw |
Slesareva Anastasiia | 60000 | 200000 | 0.05 | 0.01 | S2-CTCF.bw |
Mikhail Moldovan | 60000 | 200000 | 0.05 | 0.01 | S2-H3K4me3.bw |
Viktor Mamontov | 60000 | 200000 | 0.05 | 0.01 | S2-JIL1.bw |
Evgeniia Alekseeva | 60000 | 200000 | 0.05 | 0.01 | S2-RNAPolII.bw |
Trofimova Anna | 60000 | 200000 | 0.05 | 0.01 | S2-SuHw.bw |
We will use HiCExplorer HiCFindTADs functionality.
1. Call TADs:
2. Convert cool format to h5:
3. Create tracks.ini file (see two slides below) with Hi-C map, TADs segmentation and your factor profile. Use vim on nano.
4. Run plotting step:
Task 7 (2 points): Report the resulting plot and describe your observations. Is your factor enriched relative to TAD boundaries or TAD bodies? Is this plot enough to draw this conclusion?
$ hicPlotTADs --tracks tracks.ini -o {tads.png} --region chr2L:0-1000000
$ hicConvertFormat --matrices ${file.mcool}::/resolutions/20000 -o ${converted.file.h5} --inputFormat cool --outputFormat h5
hicFindTADs --matrix ${file.mcool}::/resolutions/20000 --outPrefix ${tmp.tads} \
--minDepth ${minDepth} --maxDepth ${maxDepth} --step 20000 \
--correctForMultipleTesting fdr --thresholdComparisons ${Threshold} --delta ${Delta} \
--numberOfProcessors 2
TADs visualisation with HiCExplorer required tracks.ini file with plot description. It allows to adjust the very detail of your plot, add multiple tracks, see the documentation.
The minimal working version of tracks.ini file is placed below. It might be improved, if you try to change parameters in the file and produce the better vizualisation of TADs and heatmap features:
[x-axis]
where = top
[hic matrix]
file = ${converted.file.h5}
title = Hi-C data
depth = 300000
transform = log1p
file_type = hic_matrix
[tads]
file = ${tads.file.domains.bed}
file_type = domains
border_color = black
color = none
overlay_previous = share-y
tracks.ini
[bigwig]
file = ${YourFactor.bw}
title = ${Put Your Factor Name Here}
color = black
min_value = 0
max_value = auto
nans to zeros = True
Let's compare TADs boundaries with ChIP-Seq profiles using plotHeatmap (see Examples).
Task 8 (2 points): Provide the resulting enrichment profile. Describe your observations. Is your factor enriched at TADs boundaries? Is this number of boundaries enough to draw conclusions?
Task 9 (*2 points): Repeat the same with TAD bodies. Do you observe the opposite effect?
computeMatrix reference-point -S ${factor.bw} -R ${boundaries.bed} --referencePoint center -a 200000 -b 200000 -bs 20000 -out ${enrichment.tab.gz}
plotHeatmap -m ${enrichment.tab.gz} --outFileName ${enrichment.png} --heatmapHeight 15 --colorMap jet --sortRegions ascend --regionsLabel 'Put Your Factor Name here'