“Analysis of omics data” course
Skoltech
28 February 2022
Proteins and nucleic acids are the most important components of the cell. Their interactions are necessary for functioning and regulation. Types of interactions:
Wasserman & Sandelin, 2004
Engreitz et al. 2016
Binding of protein is typically specific to DNA sequence:
Binding pattern
DNA-protein binding
List of DNAs
with binding events
courtesy of Ivan Kulakovsky, Pavel Mazin
Chromatin-immunoprecipitation followed by sequencing:
Unwanted factors contributing to sequencing & processing output:
"Input" experiment is a ChIP-Seq control without precipitation step. It is used for normalisation of ChIP-Seq.
Park, Nat Reviews Genetics, 2009
In order to predict binging events we need to call peaks in ChIP-Seq data:
Mahony & Pugh 2015
courtesy of Ivan Kulakovsky, Pavel Mazin
Question: Can you propose a computational approach to find it?
courtesy of Ivan Kulakovsky, Pavel Mazin
courtesy of Ivan Kulakovsky, Pavel Mazin
We need some advanced algorithms to find binding patterns.
courtesy of Ivan Kulakovsky, Pavel Mazin
TATAAT TAAAAT TAATAT TGTAAT TATACT
Consensus is the sequence of the most frequent letters:
T[AG][AT][AT][AC]T
With the consensus sequence, we lose the information:
courtesy of Pavel Mazin
Other representations:
Frequency matrix
Probability matrix
Matrix normalized by background
Position Weight Matrix (PWM)
courtesy of Pavel Mazin
courtesy of Pavel Mazin
Approaches:
Approaches:
courtesy of Pavel Mazin
Specific and non-specific binding:
Slattery et al. 2014
Buenrostro et al., 2013
Meyer & Liu Nat. Rev. Gen. 2014
Meyer & Liu Nat. Rev. Gen. 2014
Meyer & Liu Nat. Rev. Gen. 2014
Hsu et al. 2018
paired-end sequencing and mapping
Buenrostro et al., 2013
Buenrostro et al., 2013
Buenrostro et al., 2013
Nucleosome-free fraction
Nucleosome-bound fraction
Buenrostro et al., 2013
Nucleosome-free fraction
Nucleosome-bound fraction
Albanus et al., 2019
We can plot V-plot around positions of factors binding sites:
Albanus et al., 2019
Schep et al., 2015
We can also plot V-plot around positions of nucleosomes:
Hsu et al. 2018, Slattery et al. 2014
We can also do motifs search in open chromatin regions
Specific and non-specific binding:
Slattery et al. 2014
Cazares et al. 2022, bioRxiv preprint
Kaya-Okur and Henikoff et al. 2019 Efficient low-cost chromatin profiling with CUT&Tag
Cleavage Under Targets and Release Using Nuclease (CUT&Run)
Cleavage Under Targets & Tagmentation (CUT&Tag)
Kaya-Okur and Henikoff et al. 2019 Efficient low-cost chromatin profiling with CUT&Tag
Cleavage Under Targets and Release Using Nuclease (CUT&Run)
Cleavage Under Targets & Tagmentation (CUT&Tag)
DNase-Seq coverage for ∼350-kb region:
Thurman et al. Nature 2012
Different cell types
Cusanovich et al. Cell 2018
Let's check the similarity of pooled tissue samples with DNase-seq from known tissues. For that, let's plot bi-clustered heatmap of Spearman correlation coefficients:
Result of clustering,
the tree reflecting the similarity between samples
Cusanovich et al. Cell 2018
Finally, let's do dimensionality reduction, where each dot is a single sample.
Here is an example with t-SNE, although PCA (Principal Component Analysis) and MDS (Multidimensional scaling) are other popular techniques:
We will use deeptools commands multiBigwigSummary and plotCorrelation to compare the ChIP-Seq experiments. We will construct bi-clustered heatmaps of correlations:
Text
echo "Bioinformatics has never been easier! Copy&paste wisely, or rm -rf ~/ will strike you one day."
Review of some useful bash commands:
ls
pwd
less file.txt
some_command -h
some_command --help
man command
cp target_file destination_folder
cd directory_name/
cd ../
ls -hts
mkdir
gzip -d file.gz
unzip file.zip
1. Login to the server via terminal:
or use Putty on Windows.
(example instructions https://www.ssh.com/ssh/putty/windows/#sec-Configuration-options-and-saved-profiles)
2. Create the folder with your project:
3. Activate prepared environment:
(if you have problems with it, please, contact the TA)
4. Check your environment:
ssh username@servername
mkdir EpiPract1
cd EpiPract1
export PATH="/home/a.galitsyna/conda/bin:$PATH"
conda activate chipseq
ls # Should list all the filesin the directory
pwd # Should result in your currentl location
conda list # Should list all the programs installed, if conda imported successfully.
less /home/a.galitsyna/data
mkdir ~/lesson6/
cp /home/a.galitsyna/data/reads/${your_files.fastq.gz} ~/lesson6/
cd ~/lesson6
ls -hts
gzip -d <file.fastq.gz> # This will result in <file.fastq> unzipped file
less <file.fastq>
wc -l <file.fastq>
fastqc <file.fastq>
Let's review the NGS data processing workflow:
fastqc <input.fastq>
ls
# input.fastq input_fastqc/
cd input_fastqc/
ls
# fastqc_data.txt fastqc_report.html Icons Images summary.txt
Don't run this code, it's FYI
(Manual: http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml)
bowtie2-build -h
bowtie2-build chr1.fa,chr2.fa,chr3.fa,chr4.fa hg38
# Check the results:
ls -hs
# 938M hg38.1.bt2 701M hg38.2.bt2 12K hg38.3.bt2 701M hg38.4.bt2 3.0G hg38.fa 938M hg38.rev.1.bt2 701M hg38.rev.2.bt2
bowtie2-inspect -n hg38
Don't run this code, it's FYI
FASTQ
+ indexed database
-> SAM file
(Manual: http://www.htslib.org/doc/samtools.html)
samtools view file.sam
samtools view -F 4 input.sam
samtools stats file.sam
samtools view -b file.sam > file.bam
(Manual: https://bedtools.readthedocs.io/en/latest/content/bedtools-suite.html
BED format specification: https://genome.ucsc.edu/FAQ/FAQformat.html)
bedtools bamtobed -i file.bam > file.bed
map
intersect
Full set of compiled tools: http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/
bedGraphToBigWig -h
These steps are time-consuming, but we have to complete them:
cd GENOME/
bowtie2-build dm6.fa
bowtie2-build -h
bowtie2 -h
bowtie2 --very-sensitive -x ./GENOME/dm6 <file.fastq> -S <file.sam>
less <file.sam>
ls
cd ../
Don't run this code, it's FYI
samtools view -h -F 4 <file.sam> | less
samtools --help
samtools view -h -q 10 <file.sam> > <file_filtered.sam>
samtools view -b -q 10 <file.sam> > <file_filtered.bam>
samtools sort <file_filtered.bam> > <file_filtered.sorted.bam>
Don't run this code, it's FYI
bedtools makewindows -g GENOME/chrom_sizes.tsv -w 1000 > hg38_windows.1000.bed
grep "chr5" hg38_windows.1000.bed > chr5_windows.1000.bed
bedtools genomecov -ibam <file_filtered.sorted.bam> -bga -split > <file.genomecov.bedgraph>
bedtools genomecov -h
bedtools makewindows -h
bedtools coverage -h
bedtools coverage -counts -a chr5_windows.1000.bed -b <file_filtered.sorted.bam> > <file.binned.bedgraph>
ls
less <file.genomecov.bedgraph>
less <file.binned.bedgraph>
Don't run this code, it's FYI
We will use UCSC browser for bedgraph files visualization.
mv <file.genomecov.bw> <YOUR_CELL_LINE-CHROMOSOME.bw>
bedGraphToBigWig <file.genomecov.bedgraph> ./GENOME/chrom_sizes.tsv <file.genomecov.bw>
Task 1 (1 point): Provide the final set of commands that you've used and briefly justify the parameters.
Task 2 (1 point): Report the final scatteplot for 100 Kb. Describe your observations.
Task 3 (*1 point): Repeat the procedure for 10 Kb and 1 Kb and report the changes in observations. At what resolution you might observe the differential DNase I signal?
multiBigwigSummary bins --binSize 100000 -b ${file1.bw} ${file2.bw} ${file3.bw} ${file4.bw} -o ${summary.npz}
plotCorrelation -in ${summary.npz} --corMethod pearson --skipZeros --whatToPlot scatterplot -o ${scatterplot.png}
6. Plot correlations for 100 Kb as a heatmap. Use the parameters that you found to be representative for the scatterplots. For example:
7. Inspect and describe your observations. Adjust the visualization, if needed.
plotCorrelation -in ${summary.npz} --corMethod pearson --skipZeros --whatToPlot heatmap --colorMap RdYlBu --plotNumbers -o ${heatmap.png}
Result of clustering
Correlation coefficient
Labels are not representative
(correct if needed)
We will:
We'll need conda environment "hic" (see instructions on activation can be found in EpiPract3).