“Analysis of omics data” course
Skoltech Term 4
24 April 2020
This presentation can be found at https://slides.com/agalicina/epigenetics-practice3-2020
Ulianov et al. Genome Biology 2016
Microscopy:
Microscopy with fluorescing marks:
(FISH)
For 2 marks:
For multiple marks:
3C: Dekker et al. Science 2002
Viewpoint
High-throughput chromosomes conformation capture:
DNA interactome map:
Lieberman-Aiden et al. Science 2009
Adopted from Schmitt Nature Reviews 2016
Adopted from Imakaev et al. Nature Methods 2012
Bonev et al. Nature Reviews 2016
Bonev et al. Nature Reviews 2016
Bonev et al. Nature Reviews 2016
Lupiáñez et al., Cell, 2015
Bonev et al. Nature Reviews 2016
1. Reads mapping
2. Contacts filtration
3. Contacts map retrieval
4. Balancing and map normalization
5. Features calling (TADs, compartments, loops)
Adopted from Lajoie et al., The Hitchhiker's guide to Hi-C analysis: Practical guidelines.
Methods 2015
Iterative mapping or mapping allowing chimeric reads (split read alignment, e.g. bwa mem)
Imakaev et al. Nature Methods 2012
Hi-C restriction fragments are assigned to bins (sequential same size genomic windows) and aggregated by taking the sum:
Two major types of approaches:
Adopted from Schmitt et al. Nature Reviews 2016
Schmitt et al. Nature Reviews 2016
Imakaev et al. Nature Methods 2012
Forcato et al. Nature Methods 2017
There are various tools for Hi-C data processing:
1. Mapping
2. Filtering, sorting, creation of the contacts list
3. Binning and normalization
4. Aggregation
Output: multires.cool (mcool) file
All of the processing steps are accessible in the form of pipelines,
e.g. distiller-nf:
https://github.com/mirnylab/distiller-nf
1. Login to the server via terminal:
or use Putty on Windows.
(example instructions https://www.ssh.com/ssh/putty/windows/#sec-Configuration-options-and-saved-profiles)
2. Create the folder with your project:
3. Activate prepared environment:
(if you have problems with it, please, contact the TA)
4. Check your environment:
ssh username@servername
mkdir EpiPract3
cd EpiPract3
export PATH="/home/galitsyna/anaconda3/bin:$PATH"
conda activate hic
ls # Should list all the filesin the directory
pwd # Should result in your currentl location
conda list # Should list all the programs installed, if conda imported successfully.
$ cp /home/galitsyna/EpiPract3/fastq/your_file.fa.gz ~/EpiPract3/
$ mkdir ~/EpiPract3/genome/
$ cp /home/galitsyna/EpiPract3/genome/* ~/EpiPract3/genome/
$ ls -hts
$ ls -hts genome/
Name | Files for EpiPract3 | |
---|---|---|
Anastasia Pivnyuk | nuclear_cycle_12_repl_1_run1_1.fastq.gz | nuclear_cycle_12_repl_1_run1_2.fastq.gz |
Nikita Sharaev | nuclear_cycle_12_repl_1_run2_1.fastq.gz | nuclear_cycle_12_repl_1_run2_2.fastq.gz |
Artemy Shumskiy | nuclear_cycle_12_repl_1_run3_1.fastq.gz | nuclear_cycle_12_repl_1_run3_2.fastq.gz |
Dmitrii Kriukov | nuclear_cycle_12_repl_1_run4_1.fastq.gz | nuclear_cycle_12_repl_1_run4_2.fastq.gz |
Pletenev Ilya | nuclear_cycle_12_repl_2_run1_1.fastq.gz | nuclear_cycle_12_repl_2_run1_2.fastq.gz |
Konstantin Chernyshov | nuclear_cycle_12_repl_2_run2_1.fastq.gz | nuclear_cycle_12_repl_2_run2_2.fastq.gz |
Ivan Kuznetsov | nuclear_cycle_13_repl_1_run1_1.fastq.gz | nuclear_cycle_13_repl_1_run1_2.fastq.gz |
Anna Kalinina | nuclear_cycle_13_repl_1_run2_1.fastq.gz | nuclear_cycle_13_repl_1_run2_2.fastq.gz |
Sofya Kasatskaya | nuclear_cycle_13_repl_1_run3_1.fastq.gz | nuclear_cycle_13_repl_1_run3_2.fastq.gz |
Julia Bocharkina | nuclear_cycle_12_repl_2_run1_1.fastq.gz | nuclear_cycle_12_repl_2_run1_2.fastq.gz |
Vasily Borodin | nuclear_cycle_14_repl_1_run1_1.fastq.gz | nuclear_cycle_14_repl_1_run1_2.fastq.gz |
Sofia Kamalyan | nuclear_cycle_14_repl_1_run2_1.fastq.gz | nuclear_cycle_14_repl_1_run2_2.fastq.gz |
Anna Krasivskaya | nuclear_cycle_14_repl_1_run3_1.fastq.gz | nuclear_cycle_14_repl_1_run3_2.fastq.gz |
Aleksandra Ozerova | nuclear_cycle_14_repl_1_run4_1.fastq.gz | nuclear_cycle_14_repl_1_run4_2.fastq.gz |
Victoria Kobets | 3-4h_repl_1_run1_1.fastq.gz | 3-4h_repl_1_run1_2.fastq.gz |
Slesareva Anastasiia | 3-4h_repl_1_run2_1.fastq.gz | 3-4h_repl_1_run2_2.fastq.gz |
Mikhail Moldovan | 3-4h_repl_1_run3_1.fastq.gz | 3-4h_repl_1_run3_2.fastq.gz |
Viktor Mamontov | 3-4h_repl_1_run4_1.fastq.gz | 3-4h_repl_1_run4_2.fastq.gz |
Evgeniia Alekseeva | 3-4h_repl_2_run1_1.fastq.gz | 3-4h_repl_2_run1_2.fastq.gz |
Trofimova Anna | 3-4h_repl_2_run2_1.fastq.gz | 3-4h_repl_2_run2_2.fastq.gz |
The files that should be in the genome folder:
The files that should be in you working directory:
$ bwa mem -t 1 -v 3 -SP ${genome_file.fa.gz} ${fastq_file1} ${fastq_file2} > ${output.sam}
$ samtools view -bS ${output.sam} > ${output.bam}
$ pairtools parse --walks-policy 5unique -c ${chromosome_sizes} ${input.bam} -o ${output.pair}
$ pairtools dedup --output-stats ${output.stats} ${output.pair} -o ${output.nodup.pair}
$ cooler cload pairs -c1 2 -c2 4 -p1 3 -p2 5 ${chromsizes}:20000 ${input.nodup.pair} ${output.cool}
Task 1 (1 point): Report the number of uniquely mapped pairs, number and fraction removed after deduplication.
Task 2 (2 points): Visualize the data with cooler show in png format for any genomic region of 2 Mb size before and after correction. Report the resulting figures and your observations.
$ cooler balance ${input.cool}
Lieberman-Aiden, 2009
Task 3 (optional*): Convert your cool file to h5 with HiCExplorer command hicConvertFormat. Run hicInfo on the resulting file. What are the differences from cooler info?
Task 4 (optional*): Create the scaling plot for your dataset with HiCPlotDistVsCounts. What are your observations?
We'll try to plot our own scaling plots on the data with HiCExplorer.
5 points in total (rescaled proportionally to 10 in Canvas):
Optional tasks (about HiCExplorer can substitute the points that you might miss in the required part).