Introduction to epigenetics and ATAC-Seq

Aleksandra Galitsyna

“Analysis of omics data” course
Skoltech
28 February 2022

Sequencing for epigenetics

https://doi.org/10.1016/B978-0-12-812215-0.00004-2

NGS techniques diversity

Types of binding events in the cell

Proteins and nucleic acids are the most important components of the cell. Their interactions are necessary for functioning and regulation. Types of interactions:

DNA-protein interactions, examples:
- replication/reparation/recombination
- transcription
- chromatin modification and folding
RNA-protein:
- RNA metabolism regulation
- processing
- translation
- degradation: microRNA, NMD
RNA-DNA interactions:
- expression regulation
- chromatin structure maintenance

Wasserman & Sandelin, 2004

Types of molecular interactions

Direct:
- protein-protein
- RNA-protein
- DNA-protein
- RNA-DNA (R-loops)
Indirect:
- Protein-mediated
- RNA-mediated
Colocalization:
- Compartments
- Hubs

RNA-DNA interactions

Engreitz et al. 2016

ChIP-Seq

Binding of protein is typically specific to DNA sequence:

Binding pattern

DNA-protein binding

List of DNAs
with binding events

courtesy of Ivan Kulakovsky, Pavel Mazin

ChIP-Seq

Chromatin-immunoprecipitation followed by sequencing:

ChIP-Seq: input experiment

Unwanted factors contributing to sequencing & processing output:

DNA accessibility
DNA amplification efficiency
Mappability

"Input" experiment is a ChIP-Seq control without precipitation step. It is used for normalisation of ChIP-Seq.

Park, Nat Reviews Genetics, 2009

ChIP-Seq: peak calling

In order to predict binging events we need to call peaks in ChIP-Seq data:

Mahony & Pugh 2015

ChIP-Seq: binding pattern

As an output from ChIP-Seq we have a list of peaks.
In theory, each peak should have a binding event.
Imagine you have the peaks with the following sequences:
What is the binding pattern?

courtesy of Ivan Kulakovsky, Pavel Mazin

ChIP-Seq: binding pattern

Seems to be easy to find the binding pettern:

Question: Can you propose a computational approach to find it?

courtesy of Ivan Kulakovsky, Pavel Mazin

ChIP-Seq: motifs search

Let's introduce some variety, which can arise as:
- Suboptimal binding of the protein
- Genome variability
- Errors during the sequencing

courtesy of Ivan Kulakovsky, Pavel Mazin

ChIP-Seq: motifs search

With possible substitutions, finding the correct pattern is not easy anymore:

We need some advanced algorithms to find binding patterns.

courtesy of Ivan Kulakovsky, Pavel Mazin

Binding pattern representation

Binding motif is a DNA sequence pattern that has biological significance (e.g. it is a target for protein binding).
Consider following pattern of sequences:

TATAAT
TAAAAT
TAATAT
TGTAAT
TATACT

Consensus is the sequence of the most frequent letters:

T[AG][AT][AT][AC]T

With the consensus sequence, we lose the information:

about suboptimal binding,
of background nucleotides frequencies.

courtesy of Pavel Mazin

Binding motif representation

Other representations:

Frequency matrix

Probability matrix

Matrix normalized by background

Position Weight Matrix (PWM)

courtesy of Pavel Mazin

Position Weight Matrix (PWM)

courtesy of Pavel Mazin

Binding motifs search

Approaches:

Scanning of the motifs from the database
(e.g. JASPAR http://jaspar.genereg.net/ )

Binding motifs search

Approaches:

Scanning of the motifs from the database
(e.g. JASPAR http://jaspar.genereg.net/ )
Motifs discovery (de novo search)

http://meme-suite.org/tools/meme

courtesy of Pavel Mazin

ChIP-Seq: motifs search tools

Web server tools:
- http://rsat.eu/
- http://meme-suite.org/
- ...
Command line interface tools (CLI):
- MEME
- HOMER
- Autosome
- ...
- GimmeMotifs https://gimmemotifs.readthedocs.io/en/master/
  - Simple CLI
  - Can run multiple other tools
  - Databases included

Binding events and DNA motifs

Specific and non-specific binding:

Slattery et al. 2014

DNase-Seq: experimental protocol

DNase I hypersensitive sites sequencing

ATAC-Seq: experimental protocol

Assay for Transposase-Accessible Chromatin with high-throughput sequencing
Nucleosome-free regions detection method
Tagmentation: Tn5 transposase cuts DNA and ligates adaptors to open DNA

Buenrostro et al., 2013

"Block-schemes" of Seqs:

Meyer & Liu Nat. Rev. Gen. 2014

Data processing of Seqs:

Meyer & Liu Nat. Rev. Gen. 2014

Examples of tracks:

Meyer & Liu Nat. Rev. Gen. 2014

In theory:

ATAC-Seq

Hsu et al. 2018

paired-end sequencing and mapping

ATAC-Seq: insertion size distribution

Buenrostro et al., 2013

Distance between forward and reverse mapping positions:

ATAC-Seq: insertion size distribution

Buenrostro et al., 2013

Distance between forward and reverse mapping positions:

ATAC-Seq: insertion size distribution

Buenrostro et al., 2013

Distance between forward and reverse mapping positions:

Nucleosome-free fraction

Nucleosome-bound fraction

ATAC-Seq around gene starts

Buenrostro et al., 2013

Nucleosome-free fraction

Nucleosome-bound fraction

Average plot around
Transcription Start Sites (TSS):

ATAC-Seq around TF binding sites

Albanus et al., 2019

We can plot V-plot around positions of factors binding sites:

ATAC-Seq around TF binding sites

Albanus et al., 2019

ATAC-Seq around nucleosomes

Schep et al., 2015

We can also plot V-plot around positions of nucleosomes:

Chromatin openness (accessibility)

Hsu et al. 2018, Slattery et al. 2014

We can also do motifs search in open chromatin regions

Binding events and DNA motifs

Specific and non-specific binding:

Slattery et al. 2014

maxATAC: automated search of motifs in ATAC-Seq peaks

Cazares et al. 2022, bioRxiv preprint

High-resolution antibody-targeted chromatin profiling

Kaya-Okur and Henikoff et al. 2019 Efficient low-cost chromatin profiling with CUT&Tag

Cleavage Under Targets and Release Using Nuclease (CUT&Run)

Cleavage Under Targets & Tagmentation (CUT&Tag)

High-resolution antibody-targeted chromatin profiling

Kaya-Okur and Henikoff et al. 2019 Efficient low-cost chromatin profiling with CUT&Tag

Cleavage Under Targets and Release Using Nuclease (CUT&Run)

Cleavage Under Targets & Tagmentation (CUT&Tag)

Few slides on epigenetics data association

Genomics tracks association

DNase-Seq coverage for ∼350-kb region:

Thurman et al. Nature 2012

Different cell types

Clustering of samples

Cusanovich et al. Cell 2018

Let's check the similarity of pooled tissue samples with DNase-seq from known tissues. For that, let's plot bi-clustered heatmap of Spearman correlation coefficients:

Result of clustering,
the tree reflecting the similarity between samples

Clustering of samples

Cusanovich et al. Cell 2018

Finally, let's do dimensionality reduction, where each dot is a single sample.
Here is an example with t-SNE, although PCA (Principal Component Analysis) and MDS (Multidimensional scaling) are other popular techniques:

Practical Part: Samples Comparison

We will use deeptools commands multiBigwigSummary and plotCorrelation to compare the ChIP-Seq experiments. We will construct bi-clustered heatmaps of correlations:

Text

Stackup Plots (enrichment profiles):

https://deeptools.readthedocs.io/en/develop/content/tools/plotHeatmap.html#usage-examples

Practice

Before we start the practice...

Hometask made available on canvas:

Find and follow the instructions for the seminar here

Before we start the practice...

We will work in browser and with terminal. We also need to copy files from the remote server, so install appropriate software, if needed (your favourite terminal distributive,
Putty: https://www.putty.org/ or WinSCP: https://winscp.net/eng/download.php)
File formats are specified after the dot (e.g. .bam).
Code is running in grey boxes with monospace font, try not to copy it blindly, because:
some file names are surrounded by brackets "${}". This indicates that it should be replaced by the name of your file. You may use the filenames that you like, but I highly recommend to stick to using the name of your factor and input instead.

echo "Bioinformatics has never been easier! Copy&paste wisely, or rm -rf ~/ will strike you one day."

Before we start the practice...

Review of some useful bash commands:

List files in the current directory:
Check the current position in the directory tree:
Open text file for reading:
Run command help, one of the options:
(some of them work for ones
but not the others)
Copy file:
Enter directory:
Exit the directory:
Create directory:
File decompression:

ls

pwd

less file.txt

some_command -h

some_command --help

man command

cp target_file destination_folder

cd directory_name/

cd ../

ls -hts

mkdir

gzip -d file.gz

unzip file.zip

Expected result of the seminar

Quizz on canvas (10 points in total)
Extended deadline until 24th of March.
Each task of the quizz is screenshot + essay answer.
Mistakes on biology and theory are penalized.
Feel free to ask the questions in the chat!

Additional slides
FYI

Get in touch with your data: step 1

1. Login to the server via terminal:

or use Putty on Windows.
(example instructions https://www.ssh.com/ssh/putty/windows/#sec-Configuration-options-and-saved-profiles)

2. Create the folder with your project:

3. Activate prepared environment:
(if you have problems with it, please, contact the TA)

4. Check your environment:

ssh username@servername

mkdir EpiPract1
cd EpiPract1

export PATH="/home/a.galitsyna/conda/bin:$PATH"
conda activate chipseq

ls # Should list all the filesin the directory
pwd # Should result in your currentl location
conda list # Should list all the programs installed, if conda imported successfully.

Get in touch with your data: step 2

Find your datasets (here or in the table on the right):
Copy to the local folder:
Check the content and file sizes:

less /home/a.galitsyna/data

mkdir ~/lesson6/
cp /home/a.galitsyna/data/reads/${your_files.fastq.gz} ~/lesson6/

cd ~/lesson6
ls -hts

Get in touch with your data: step 3

Open FASTQ files, check the format
(see also: http://emea.support.illumina.com/bulletins/2016/04/fastq-files-explained.html)
Check FASTQ files, what is the number of rows vs the number of entries?
Run fastqc (a tool for FASTQ quality control, or QC)

gzip -d <file.fastq.gz> # This will result in <file.fastq> unzipped file
less <file.fastq>

wc -l <file.fastq>

fastqc <file.fastq>

NGS data typical workflow example

Let's review the NGS data processing workflow:

Quality control: fastqc on initial FASTQ
Index build: bowtie2-build on initial chromosomes files
Read mapping: FASTQ → SAM/BAM
Filtering of mapped reads: samtools on SAM from p.3
Conversion to BED: bedtools on SAM from p.4

FASTQC report example

Takes FASTQ as input:
Output format: html and plain text

fastqc <input.fastq>

ls 
  # input.fastq input_fastqc/

cd input_fastqc/
ls
  # fastqc_data.txt  fastqc_report.html  Icons  Images  summary.txt

Don't run this code, it's FYI

FASTQC report example

Example html output:
(see https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ for more examples)

Genome index build example

(Manual: http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml)

Required for fast search of query sequence during mapping
Takes fasta with reference as input (comma-separated list) and prefix name.
Example run and resulting database:

bowtie2-build -h

bowtie2-build chr1.fa,chr2.fa,chr3.fa,chr4.fa hg38

# Check the results: 

ls -hs

# 938M hg38.1.bt2  701M hg38.2.bt2   12K hg38.3.bt2  701M hg38.4.bt2  3.0G hg38.fa  938M hg38.rev.1.bt2  701M hg38.rev.2.bt2

bowtie2-inspect -n hg38

Don't run this code, it's FYI

Read mapping example

Takes query FASTQ and indexed database as input.
Output is in SAM or BAM format
(SAM format specification: https://samtools.github.io/hts-specs/SAMv1.pdf)
SAM is a TSV file with header, BAM is its compressed binary version. The columns are:

Read mapping example

FASTQ

+ indexed database

-> SAM file

Samtools example

(Manual: http://www.htslib.org/doc/samtools.html)

A toolkit designed for SAM/BAM manipulations
View:
Statistics assessment:
Format conversion:
Filtering:

samtools view file.sam

samtools view -F 4 input.sam

samtools stats file.sam

samtools view -b file.sam > file.bam

Bedtools example

(Manual: https://bedtools.readthedocs.io/en/latest/content/bedtools-suite.html
BED format specification: https://genome.ucsc.edu/FAQ/FAQformat.html)

A toolkit for conversion and manipulations with BED:
A powerful tool for genomic arithmetics:
(Tutorial: https://bedtools.readthedocs.io/en/latest/)

bedtools bamtobed -i file.bam > file.bed

genomecov

map

intersect

Other formats

BED
BEDGRAPH
BIGWIG: binary compressed format, created with UCSC tools (see below)

UCSC genome browser & UCSC tools

UCSC maintains genome browser and has a set of tools: https://genome-euro.ucsc.edu/

UCSC tools example

Full set of compiled tools: http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/

We will need file conversion, check the documentation:

bedGraphToBigWig -h

Next step: index build and mapping

These steps are time-consuming, but we have to complete them:

Build genome index with name prefix hg38_chr5-6_index:
Run mapping of your FASTQ file for the index hg38_chr5-6_index:

cd GENOME/
bowtie2-build dm6.fa

bowtie2-build -h

bowtie2 -h

bowtie2 --very-sensitive -x ./GENOME/dm6 <file.fastq> -S <file.sam>

less <file.sam>

ls 
cd ../

Don't run this code, it's FYI

Filtering

Filter only uniquely mapped reads with high alignment score, sort resulting file:

samtools view -h -F 4 <file.sam> | less

samtools --help

samtools view -h -q 10 <file.sam> > <file_filtered.sam>

samtools view -b -q 10 <file.sam> > <file_filtered.bam>
samtools sort <file_filtered.bam> > <file_filtered.sorted.bam>

Don't run this code, it's FYI

File format conversion

Read the documentation of bedtools:
Convert BAM to BEDGRAPH genome coverage:
Create files with genomic windows (1Kbp bin size, don't forget to use your chromosome!). Convert BAM to BEDGRAPH coverage of windows:

bedtools makewindows -g GENOME/chrom_sizes.tsv -w 1000 > hg38_windows.1000.bed
grep "chr5" hg38_windows.1000.bed > chr5_windows.1000.bed

bedtools genomecov -ibam <file_filtered.sorted.bam> -bga -split > <file.genomecov.bedgraph>

bedtools genomecov -h
bedtools makewindows -h
bedtools coverage -h

bedtools coverage -counts -a chr5_windows.1000.bed -b <file_filtered.sorted.bam> > <file.binned.bedgraph>

Check the resulting files, you should see genome coverage
files (*.genomecov.*) and binned datasets (*.binned.*):

ls
less <file.genomecov.bedgraph>
less <file.binned.bedgraph>

Don't run this code, it's FYI

BEDGRAPH tracks visualization

We will use UCSC browser for bedgraph files visualization.

Download both files file.genomecov.bedgraph and file.binned.bedgraph to your local computer.
Manually add the header to your file:
(see UCSC recommendations: https://genome.ucsc.edu/goldenPath/help/bedgraph.html)
track type=bedGraph visibility=full
Go to UCSC genome browser, hg38 genome:
https://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg38

UCSC genome browser

Find "add custom tracks button".
Check the upload form:

Upload your modified BEDGRAPH files (with headers) to UCSC ("Choose file" -> "Submit").

Experiments comparison

Convert BEDGRAPH to BIGWIG:
Rename your final file (substitute YOUR_CELL_LINE and CHROMOSOME):
Find three of your colleagues with the same chromosome as you. Exchange your BIGWIG files and upload them to the server. Thus you should collect at least 4 BIGWIG files with recognisable names (cell line and chromosome should be present in the file names).

mv <file.genomecov.bw> <YOUR_CELL_LINE-CHROMOSOME.bw>

bedGraphToBigWig <file.genomecov.bedgraph> ./GENOME/chrom_sizes.tsv <file.genomecov.bw>

Samples comparison

Task 1 (1 point): Provide the final set of commands that you've used and briefly justify the parameters.

Task 2 (1 point): Report the final scatteplot for 100 Kb. Describe your observations.
Task 3 (*1 point): Repeat the procedure for 10 Kb and 1 Kb and report the changes in observations. At what resolution you might observe the differential DNase I signal?

Collect at least 4 DNase I results for your chromosome in bigWig format. One of them should be yours, the others you may take from /home/shared/EpiPract1/. you may use conda environment "hic" (see instructions on activation can be found in EpiPract3).
Use multiBigwigSummary to create a summary file. Use bins mode, bin size 100 Kb:
Plot correlations obtained from your file as a scatterplot:
Try to adjust the parameters of the plot to improve the analysis. Try plotting in log-scale (--log1p). If you get the warning about the outliers, try to fix it: use --removeOutliers and --corMethod spearman. Which method worked?
* Repeat the procedure for 10 Kb and 1 Kb. You may want to create another intermediary summary file for that. Inspect the changes.

multiBigwigSummary bins --binSize 100000 -b ${file1.bw} ${file2.bw} ${file3.bw} ${file4.bw} -o ${summary.npz}

plotCorrelation -in ${summary.npz} --corMethod pearson --skipZeros --whatToPlot scatterplot -o ${scatterplot.png}

Experiments comparison

6. Plot correlations for 100 Kb as a heatmap. Use the parameters that you found to be representative for the scatterplots. For example:

7. Inspect and describe your observations. Adjust the visualization, if needed.

plotCorrelation -in ${summary.npz} --corMethod pearson --skipZeros --whatToPlot heatmap --colorMap RdYlBu --plotNumbers -o ${heatmap.png}

Result of clustering

Correlation coefficient

Labels are not representative
(correct if needed)

Practical Part: Enrichment Profiles

We will:

Call TADs in S2 cell line (Wang et al. 2017),
Plot the enrichment profiles for multiple factors.

We'll need conda environment "hic" (see instructions on activation can be found in EpiPract3).

Introduction to epigenetics and ATAC-Seq

Aleksandra Galitsyna

Sequencing for epigenetics

NGS techniques diversity

Types of binding events in the cell

Types of molecular interactions

RNA-DNA interactions

ChIP-Seq

ChIP-Seq

ChIP-Seq: input experiment

ChIP-Seq: peak calling

ChIP-Seq: binding pattern

ChIP-Seq: binding pattern

ChIP-Seq: motifs search

ChIP-Seq: motifs search

Binding pattern representation

Binding motif representation

Position Weight Matrix (PWM)

Binding motifs search

Binding motifs search

ChIP-Seq: motifs search tools

Binding events and DNA motifs

DNase-Seq: experimental protocol

ATAC-Seq: experimental protocol

"Block-schemes" of Seqs:

Data processing of Seqs:

Examples of tracks:

In theory:

ATAC-Seq

ATAC-Seq: insertion size distribution

ATAC-Seq: insertion size distribution

ATAC-Seq: insertion size distribution

ATAC-Seq around gene starts

ATAC-Seq around TF binding sites

ATAC-Seq around TF binding sites

ATAC-Seq around nucleosomes

Chromatin openness (accessibility)

Binding events and DNA motifs

maxATAC: automated search of motifs in ATAC-Seq peaks

High-resolution antibody-targeted chromatin profiling

High-resolution antibody-targeted chromatin profiling

Few slides on epigenetics data association

Genomics tracks association

Clustering of samples

Clustering of samples

Practical Part: Samples Comparison

Stackup Plots (enrichment profiles):

Practice

Before we start the practice...

Before we start the practice...

Before we start the practice...

Expected result of the seminar

Additional slides FYI

Get in touch with your data: step 1

Get in touch with your data: step 2

Get in touch with your data: step 3

NGS data typical workflow example

FASTQC report example

FASTQC report example

Genome index build example

Read mapping example

Read mapping example

Samtools example

Bedtools example

Other formats

UCSC genome browser & UCSC tools

UCSC tools example

Next step: index build and mapping

Filtering

File format conversion

BEDGRAPH tracks visualization

UCSC genome browser

Experiments comparison

Samples comparison

Experiments comparison

Practical Part: Enrichment Profiles

Additional slides
FYI