ChIP-Seq data analysis

Introduction to epigenetics and ChIP-Seq

Aleksandra Galitsyna

“Analysis of omics data” course
Skoltech
24 February 2022

Introduction: sequencing

"Computational analysis of next generation sequencing data and its applications in clinical oncology" Wadapukar&Vyas 2018

Sanger

Illumina, Roche 454, SOLiD

Nanopore

Illumina sequencing

Illumina sequencers

https://www.illumina.com/systems/sequencing-platforms.html

NGS techniques diversity

Wasserman & Sandelin. Applied bioinformatics for the identification of regulatory elements. Nat. Rev. Genet. 2004

Activation of promoters and enhancers

Key determinant of expression is activation of promoters and enhancers

http://advancedtranscription16.blogspot.com/p/overview.html

Promoter

How to activate enhancer and promoter?

Regulatory networks of enhancers

One gene can regulate multiple enhancers,
while multiple enhancers can influence a gene:

https://cotney.research.uchc.edu/

Wasserman & Sandelin. Applied bioinformatics for the identification of regulatory elements. Nat. Rev. Genet. 2004

Epigenetic encoding of expression patterns:

Sequencing for epigenetics

https://doi.org/10.1016/B978-0-12-812215-0.00004-2

ChIP-Seq: experimental protocol

Aim: find all regions in the genome bound by a specic transcription factor (or bearing a specic histone modication)
Answer: ChIP-Seq, Chromatin immunoprecipitation with sequencing

Park 2009

Practice

Before we start the practice...

Hometask made available on canvas:

Find and follow the instructions for the seminar here

Before we start the practice...

We will work in browser and with terminal. We also need to copy files from the remote server, so install appropriate software, if needed (your favourite terminal distributive,
Putty: https://www.putty.org/ or WinSCP: https://winscp.net/eng/download.php)
File formats are specified after the dot (e.g. .bam).
Code is running in grey boxes with monospace font, try not to copy it blindly, because:
some file names are surrounded by brackets "${}". This indicates that it should be replaced by the name of your file. You may use the filenames that you like, but I highly recommend to stick to using the name of your factor and input instead.

echo "Bioinformatics has never been easier! Copy&paste wisely, or rm -rf ~/ will strike you one day."

Before we start the practice...

Review of some useful bash commands:

List files in the current directory:
Check the current position in the directory tree:
Open text file for reading:
Run command help, one of the options:
(some of them work for ones
but not the others)
Copy file:
Enter directory:
Exit the directory:
Create directory:
File decompression:

ls

pwd

less file.txt

some_command -h

some_command --help

man command

cp target_file destination_folder

cd directory_name/

cd ../

ls -hts

mkdir

gzip -d file.gz

unzip file.zip

Expected result of the seminar

Quizz on canvas (10 points in total)
Extended deadline until 24th of March.
Each task of the quizz is screenshot + essay answer.
Mistakes on biology and theory are penalized.
Feel free to ask the questions in the chat!

ChIP-Seq theory

ChIP-Seq: density of DNA fragments

We expect to find different results for plus (Watson) and minus (Crick) strands of DNA:

Park 2009

ChIP-Seq: Cross-Correlation

We expect to find different results for plus (Watson) and minus (Crick) strands of DNA:

Park 2009

ChIP-Seq: Cross-Correlation

The cross-correlation typically peaks at the shift corresponding to the fragment length and the shift corresponding to the read length.

Bailey 2013

ChIP-Seq: Cross-Correlation measure

NSC (normalized strand cross-correlation coefficient):
The ratio between the cross-correlation at the fragment length and the background crosscorrelation.
RSC (relative strand cross-correlation coefficient):
The ratio between cross-correlation at the fragment length and the crosscorrelation at the read length.
Very successful ChIP experiments generally have NSC>1.05 and RSC>0.8

Bailey 2013

Input control

Input experiment is ChIP-Seq control with the same experimental steps but without immunoprecipitation.
Input is used for normalization of ChIP-Seq.
Input experiment accounts for unwanted factors contributing to sequencing & processing output:
- DNA-shearing is not uniform across genome -> more reads in open chromatin regions,
- Amplification bias (GC content),
- Repetitive regions,
- DNA accessibility

Park 2009

Normalization by input control

Park 2009

elevated
(absolute values)

enriched
(relative to input)

log(\frac{signal}{input})

ChIP-Seq data processing

Park 2009

ChIP-Seq data processing: implementation

https://hbctraining.github.io/Intro-to-ChIPseq/lessons/07_handling-replicates-idr.html

ENCODE Standards:

https://www.encodeproject.org/chip-seq/transcription_factor/

ChIP-Seq peak calling

MACS2 is the most common software:

https://galaxyproject.org/tutorials/chip/

ChIP-Seq peak calling result

Binding of protein is typically specific to DNA sequence:

ChIP-Seq: peak calling

https://www.strand-ngs.com/features/chip-seq

Example output of the procedure:

What is the next step?

Binding events and DNA motifs

Specific and non-specific binding:

Slattery et al. 2014

Genomics tracks association

DNase-Seq coverage for ∼350-kb region:

Thurman et al. Nature 2012

Different cell types

Clustering of samples

Cusanovich et al. Cell 2018

Let's check the similarity of pooled tissue samples with DNase-seq from known tissues. For that, let's plot bi-clustered heatmap of Spearman correlation coefficients:

Result of clustering,
the tree reflecting the similarity between samples

Clustering of samples

Cusanovich et al. Cell 2018

Finally, let's do dimensionality reduction, where each dot is a single sample.
Here is an example with t-SNE, although PCA (Principal Component Analysis) and MDS (Multidimensional scaling) are other popular techniques:

Practical Part: Samples Comparison

We will use deeptools commands multiBigwigSummary and plotCorrelation to compare the ChIP-Seq experiments. We will construct bi-clustered heatmaps of correlations:

Text

Stackup Plots (enrichment profiles):

https://deeptools.readthedocs.io/en/develop/content/tools/plotHeatmap.html#usage-examples

Additional slides
FYI

Get in touch with your data: step 1

1. Login to the server via terminal:

or use Putty on Windows.
(example instructions https://www.ssh.com/ssh/putty/windows/#sec-Configuration-options-and-saved-profiles)

2. Create the folder with your project:

3. Activate prepared environment:
(if you have problems with it, please, contact the TA)

4. Check your environment:

ssh username@servername

mkdir EpiPract1
cd EpiPract1

export PATH="/home/a.galitsyna/conda/bin:$PATH"
conda activate chipseq

ls # Should list all the filesin the directory
pwd # Should result in your currentl location
conda list # Should list all the programs installed, if conda imported successfully.

Get in touch with your data: step 2

Find your datasets (here or in the table on the right):
Copy to the local folder:
Check the content and file sizes:

less /home/a.galitsyna/data

mkdir ~/lesson6/
cp /home/a.galitsyna/data/reads/${your_files.fastq.gz} ~/lesson6/

cd ~/lesson6
ls -hts

Get in touch with your data: step 3

Open FASTQ files, check the format
(see also: http://emea.support.illumina.com/bulletins/2016/04/fastq-files-explained.html)
Check FASTQ files, what is the number of rows vs the number of entries?
Run fastqc (a tool for FASTQ quality control, or QC)

gzip -d <file.fastq.gz> # This will result in <file.fastq> unzipped file
less <file.fastq>

wc -l <file.fastq>

fastqc <file.fastq>

NGS data typical workflow example

Let's review the NGS data processing workflow:

Quality control: fastqc on initial FASTQ
Index build: bowtie2-build on initial chromosomes files
Read mapping: FASTQ → SAM/BAM
Filtering of mapped reads: samtools on SAM from p.3
Conversion to BED: bedtools on SAM from p.4

FASTQC report example

Takes FASTQ as input:
Output format: html and plain text

fastqc <input.fastq>

ls 
  # input.fastq input_fastqc/

cd input_fastqc/
ls
  # fastqc_data.txt  fastqc_report.html  Icons  Images  summary.txt

Don't run this code, it's FYI

FASTQC report example

Example html output:
(see https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ for more examples)

Genome index build example

(Manual: http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml)

Required for fast search of query sequence during mapping
Takes fasta with reference as input (comma-separated list) and prefix name.
Example run and resulting database:

bowtie2-build -h

bowtie2-build chr1.fa,chr2.fa,chr3.fa,chr4.fa hg38

# Check the results: 

ls -hs

# 938M hg38.1.bt2  701M hg38.2.bt2   12K hg38.3.bt2  701M hg38.4.bt2  3.0G hg38.fa  938M hg38.rev.1.bt2  701M hg38.rev.2.bt2

bowtie2-inspect -n hg38

Don't run this code, it's FYI

Read mapping example

Takes query FASTQ and indexed database as input.
Output is in SAM or BAM format
(SAM format specification: https://samtools.github.io/hts-specs/SAMv1.pdf)
SAM is a TSV file with header, BAM is its compressed binary version. The columns are:

Read mapping example

FASTQ

+ indexed database

-> SAM file

Samtools example

(Manual: http://www.htslib.org/doc/samtools.html)

A toolkit designed for SAM/BAM manipulations
View:
Statistics assessment:
Format conversion:
Filtering:

samtools view file.sam

samtools view -F 4 input.sam

samtools stats file.sam

samtools view -b file.sam > file.bam

Bedtools example

(Manual: https://bedtools.readthedocs.io/en/latest/content/bedtools-suite.html
BED format specification: https://genome.ucsc.edu/FAQ/FAQformat.html)

A toolkit for conversion and manipulations with BED:
A powerful tool for genomic arithmetics:
(Tutorial: https://bedtools.readthedocs.io/en/latest/)

bedtools bamtobed -i file.bam > file.bed

genomecov

map

intersect

Other formats

BED
BEDGRAPH
BIGWIG: binary compressed format, created with UCSC tools (see below)

UCSC genome browser & UCSC tools

UCSC maintains genome browser and has a set of tools: https://genome-euro.ucsc.edu/

UCSC tools example

Full set of compiled tools: http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/

We will need file conversion, check the documentation:

bedGraphToBigWig -h

Next step: index build and mapping

These steps are time-consuming, but we have to complete them:

Build genome index with name prefix hg38_chr5-6_index:
Run mapping of your FASTQ file for the index hg38_chr5-6_index:

cd GENOME/
bowtie2-build dm6.fa

bowtie2-build -h

bowtie2 -h

bowtie2 --very-sensitive -x ./GENOME/dm6 <file.fastq> -S <file.sam>

less <file.sam>

ls 
cd ../

Don't run this code, it's FYI

Filtering

Filter only uniquely mapped reads with high alignment score, sort resulting file:

samtools view -h -F 4 <file.sam> | less

samtools --help

samtools view -h -q 10 <file.sam> > <file_filtered.sam>

samtools view -b -q 10 <file.sam> > <file_filtered.bam>
samtools sort <file_filtered.bam> > <file_filtered.sorted.bam>

Don't run this code, it's FYI

File format conversion

Read the documentation of bedtools:
Convert BAM to BEDGRAPH genome coverage:
Create files with genomic windows (1Kbp bin size, don't forget to use your chromosome!). Convert BAM to BEDGRAPH coverage of windows:

bedtools makewindows -g GENOME/chrom_sizes.tsv -w 1000 > hg38_windows.1000.bed
grep "chr5" hg38_windows.1000.bed > chr5_windows.1000.bed

bedtools genomecov -ibam <file_filtered.sorted.bam> -bga -split > <file.genomecov.bedgraph>

bedtools genomecov -h
bedtools makewindows -h
bedtools coverage -h

bedtools coverage -counts -a chr5_windows.1000.bed -b <file_filtered.sorted.bam> > <file.binned.bedgraph>

Check the resulting files, you should see genome coverage
files (*.genomecov.*) and binned datasets (*.binned.*):

ls
less <file.genomecov.bedgraph>
less <file.binned.bedgraph>

Don't run this code, it's FYI

BEDGRAPH tracks visualization

We will use UCSC browser for bedgraph files visualization.

Download both files file.genomecov.bedgraph and file.binned.bedgraph to your local computer.
Manually add the header to your file:
(see UCSC recommendations: https://genome.ucsc.edu/goldenPath/help/bedgraph.html)
track type=bedGraph visibility=full
Go to UCSC genome browser, hg38 genome:
https://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg38

UCSC genome browser

Find "add custom tracks button".
Check the upload form:

Upload your modified BEDGRAPH files (with headers) to UCSC ("Choose file" -> "Submit").

Experiments comparison

Convert BEDGRAPH to BIGWIG:
Rename your final file (substitute YOUR_CELL_LINE and CHROMOSOME):
Find three of your colleagues with the same chromosome as you. Exchange your BIGWIG files and upload them to the server. Thus you should collect at least 4 BIGWIG files with recognisable names (cell line and chromosome should be present in the file names).

mv <file.genomecov.bw> <YOUR_CELL_LINE-CHROMOSOME.bw>

bedGraphToBigWig <file.genomecov.bedgraph> ./GENOME/chrom_sizes.tsv <file.genomecov.bw>

Samples comparison

Task 1 (1 point): Provide the final set of commands that you've used and briefly justify the parameters.

Task 2 (1 point): Report the final scatteplot for 100 Kb. Describe your observations.
Task 3 (*1 point): Repeat the procedure for 10 Kb and 1 Kb and report the changes in observations. At what resolution you might observe the differential DNase I signal?

Collect at least 4 DNase I results for your chromosome in bigWig format. One of them should be yours, the others you may take from /home/shared/EpiPract1/. you may use conda environment "hic" (see instructions on activation can be found in EpiPract3).
Use multiBigwigSummary to create a summary file. Use bins mode, bin size 100 Kb:
Plot correlations obtained from your file as a scatterplot:
Try to adjust the parameters of the plot to improve the analysis. Try plotting in log-scale (--log1p). If you get the warning about the outliers, try to fix it: use --removeOutliers and --corMethod spearman. Which method worked?
* Repeat the procedure for 10 Kb and 1 Kb. You may want to create another intermediary summary file for that. Inspect the changes.

multiBigwigSummary bins --binSize 100000 -b ${file1.bw} ${file2.bw} ${file3.bw} ${file4.bw} -o ${summary.npz}

plotCorrelation -in ${summary.npz} --corMethod pearson --skipZeros --whatToPlot scatterplot -o ${scatterplot.png}

Experiments comparison

6. Plot correlations for 100 Kb as a heatmap. Use the parameters that you found to be representative for the scatterplots. For example:

7. Inspect and describe your observations. Adjust the visualization, if needed.

plotCorrelation -in ${summary.npz} --corMethod pearson --skipZeros --whatToPlot heatmap --colorMap RdYlBu --plotNumbers -o ${heatmap.png}

Result of clustering

Correlation coefficient

Labels are not representative
(correct if needed)

Practical Part: Enrichment Profiles

We will:

Call TADs in S2 cell line (Wang et al. 2017),
Plot the enrichment profiles for multiple factors.

We'll need conda environment "hic" (see instructions on activation can be found in EpiPract3).

Introduction to epigenetics and ChIP-Seq

Aleksandra Galitsyna

Introduction: sequencing

Illumina sequencing

Illumina sequencers

NGS techniques diversity

Activation of promoters and enhancers

How to activate enhancer and promoter?

Regulatory networks of enhancers

Epigenetic encoding of expression patterns:

Sequencing for epigenetics

ChIP-Seq: experimental protocol

Practice

Before we start the practice...

Before we start the practice...

Before we start the practice...

Expected result of the seminar

ChIP-Seq theory

ChIP-Seq: density of DNA fragments

ChIP-Seq: Cross-Correlation

ChIP-Seq: Cross-Correlation

ChIP-Seq: Cross-Correlation measure

Input control

Normalization by input control

ChIP-Seq data processing

ChIP-Seq data processing: implementation

ChIP-Seq peak calling

ChIP-Seq peak calling result

ChIP-Seq: peak calling

Binding events and DNA motifs

Genomics tracks association

Clustering of samples

Clustering of samples

Practical Part: Samples Comparison

Stackup Plots (enrichment profiles):

Additional slides FYI

Get in touch with your data: step 1

Get in touch with your data: step 2

Get in touch with your data: step 3

NGS data typical workflow example

FASTQC report example

FASTQC report example

Genome index build example

Read mapping example

Read mapping example

Samtools example

Bedtools example

Other formats

UCSC genome browser & UCSC tools

UCSC tools example

Next step: index build and mapping

Filtering

File format conversion

BEDGRAPH tracks visualization

UCSC genome browser

Experiments comparison

Samples comparison

Experiments comparison

Practical Part: Enrichment Profiles

Additional slides
FYI