“Analysis of omics data” course
Skoltech
24 February 2022
"Computational analysis of
Sanger
Illumina, Roche 454, SOLiD
Nanopore
Wasserman & Sandelin. Applied bioinformatics for the identification of regulatory elements. Nat. Rev. Genet. 2004
Key determinant of expression is activation of promoters and enhancers
Promoter
One gene can regulate multiple enhancers,
while multiple enhancers can influence a gene:
Wasserman & Sandelin. Applied bioinformatics for the identification of regulatory elements. Nat. Rev. Genet. 2004
Park 2009
echo "Bioinformatics has never been easier! Copy&paste wisely, or rm -rf ~/ will strike you one day."
Review of some useful bash commands:
ls
pwd
less file.txt
some_command -h
some_command --help
man command
cp target_file destination_folder
cd directory_name/
cd ../
ls -hts
mkdir
gzip -d file.gz
unzip file.zip
Park 2009
Park 2009
The cross-correlation typically peaks at the shift corresponding to the fragment length and the shift corresponding to the read length.
Bailey 2013
NSC (normalized strand cross-correlation coefficient):
The ratio between the cross-correlation at the fragment length and the background crosscorrelation.
RSC (relative strand cross-correlation coefficient):
The ratio between cross-correlation at the fragment length and the crosscorrelation at the read length.
Very successful ChIP experiments generally have NSC>1.05 and RSC>0.8
Bailey 2013
Park 2009
Park 2009
elevated
(absolute values)
enriched
(relative to input)
Park 2009
ENCODE Standards:
https://www.encodeproject.org/chip-seq/transcription_factor/
MACS2 is the most common software:
Binding of protein is typically specific to DNA sequence:
Example output of the procedure:
What is the next step?
Specific and non-specific binding:
Slattery et al. 2014
DNase-Seq coverage for ∼350-kb region:
Thurman et al. Nature 2012
Different cell types
Cusanovich et al. Cell 2018
Let's check the similarity of pooled tissue samples with DNase-seq from known tissues. For that, let's plot bi-clustered heatmap of Spearman correlation coefficients:
Result of clustering,
the tree reflecting the similarity between samples
Cusanovich et al. Cell 2018
Finally, let's do dimensionality reduction, where each dot is a single sample.
Here is an example with t-SNE, although PCA (Principal Component Analysis) and MDS (Multidimensional scaling) are other popular techniques:
We will use deeptools commands multiBigwigSummary and plotCorrelation to compare the ChIP-Seq experiments. We will construct bi-clustered heatmaps of correlations:
Text
1. Login to the server via terminal:
or use Putty on Windows.
(example instructions https://www.ssh.com/ssh/putty/windows/#sec-Configuration-options-and-saved-profiles)
2. Create the folder with your project:
3. Activate prepared environment:
(if you have problems with it, please, contact the TA)
4. Check your environment:
ssh username@servername
mkdir EpiPract1
cd EpiPract1
export PATH="/home/a.galitsyna/conda/bin:$PATH"
conda activate chipseq
ls # Should list all the filesin the directory
pwd # Should result in your currentl location
conda list # Should list all the programs installed, if conda imported successfully.
less /home/a.galitsyna/data
mkdir ~/lesson6/
cp /home/a.galitsyna/data/reads/${your_files.fastq.gz} ~/lesson6/
cd ~/lesson6
ls -hts
gzip -d <file.fastq.gz> # This will result in <file.fastq> unzipped file
less <file.fastq>
wc -l <file.fastq>
fastqc <file.fastq>
Let's review the NGS data processing workflow:
fastqc <input.fastq>
ls
# input.fastq input_fastqc/
cd input_fastqc/
ls
# fastqc_data.txt fastqc_report.html Icons Images summary.txt
Don't run this code, it's FYI
(Manual: http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml)
bowtie2-build -h
bowtie2-build chr1.fa,chr2.fa,chr3.fa,chr4.fa hg38
# Check the results:
ls -hs
# 938M hg38.1.bt2 701M hg38.2.bt2 12K hg38.3.bt2 701M hg38.4.bt2 3.0G hg38.fa 938M hg38.rev.1.bt2 701M hg38.rev.2.bt2
bowtie2-inspect -n hg38
Don't run this code, it's FYI
FASTQ
+ indexed database
-> SAM file
(Manual: http://www.htslib.org/doc/samtools.html)
samtools view file.sam
samtools view -F 4 input.sam
samtools stats file.sam
samtools view -b file.sam > file.bam
(Manual: https://bedtools.readthedocs.io/en/latest/content/bedtools-suite.html
BED format specification: https://genome.ucsc.edu/FAQ/FAQformat.html)
bedtools bamtobed -i file.bam > file.bed
map
intersect
Full set of compiled tools: http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/
bedGraphToBigWig -h
These steps are time-consuming, but we have to complete them:
cd GENOME/
bowtie2-build dm6.fa
bowtie2-build -h
bowtie2 -h
bowtie2 --very-sensitive -x ./GENOME/dm6 <file.fastq> -S <file.sam>
less <file.sam>
ls
cd ../
Don't run this code, it's FYI
samtools view -h -F 4 <file.sam> | less
samtools --help
samtools view -h -q 10 <file.sam> > <file_filtered.sam>
samtools view -b -q 10 <file.sam> > <file_filtered.bam>
samtools sort <file_filtered.bam> > <file_filtered.sorted.bam>
Don't run this code, it's FYI
bedtools makewindows -g GENOME/chrom_sizes.tsv -w 1000 > hg38_windows.1000.bed
grep "chr5" hg38_windows.1000.bed > chr5_windows.1000.bed
bedtools genomecov -ibam <file_filtered.sorted.bam> -bga -split > <file.genomecov.bedgraph>
bedtools genomecov -h
bedtools makewindows -h
bedtools coverage -h
bedtools coverage -counts -a chr5_windows.1000.bed -b <file_filtered.sorted.bam> > <file.binned.bedgraph>
ls
less <file.genomecov.bedgraph>
less <file.binned.bedgraph>
Don't run this code, it's FYI
We will use UCSC browser for bedgraph files visualization.
mv <file.genomecov.bw> <YOUR_CELL_LINE-CHROMOSOME.bw>
bedGraphToBigWig <file.genomecov.bedgraph> ./GENOME/chrom_sizes.tsv <file.genomecov.bw>
Task 1 (1 point): Provide the final set of commands that you've used and briefly justify the parameters.
Task 2 (1 point): Report the final scatteplot for 100 Kb. Describe your observations.
Task 3 (*1 point): Repeat the procedure for 10 Kb and 1 Kb and report the changes in observations. At what resolution you might observe the differential DNase I signal?
multiBigwigSummary bins --binSize 100000 -b ${file1.bw} ${file2.bw} ${file3.bw} ${file4.bw} -o ${summary.npz}
plotCorrelation -in ${summary.npz} --corMethod pearson --skipZeros --whatToPlot scatterplot -o ${scatterplot.png}
6. Plot correlations for 100 Kb as a heatmap. Use the parameters that you found to be representative for the scatterplots. For example:
7. Inspect and describe your observations. Adjust the visualization, if needed.
plotCorrelation -in ${summary.npz} --corMethod pearson --skipZeros --whatToPlot heatmap --colorMap RdYlBu --plotNumbers -o ${heatmap.png}
Result of clustering
Correlation coefficient
Labels are not representative
(correct if needed)
We will:
We'll need conda environment "hic" (see instructions on activation can be found in EpiPract3).