Introduction to epigenetics and ATAC-Seq

Aleksandra Galitsyna

“Analysis of omics data” course
Skoltech
28 February 2022

Sequencing for epigenetics

NGS techniques diversity

Types of binding events in the cell

Proteins and nucleic acids are the most important components of the cell. Their interactions are necessary for functioning and regulation. Types of interactions:

  • DNA-protein interactions, examples:
    • replication/reparation/recombination
    • transcription
    • chromatin modification and folding
  • RNA-protein:
    • RNA metabolism regulation
    • processing
    • translation
    • degradation: microRNA, NMD
  • RNA-DNA interactions:
    • expression regulation
    • chromatin structure maintenance

Wasserman & Sandelin, 2004

Types of molecular interactions

  • Direct:
    • protein-protein
    • RNA-protein
    • DNA-protein
    • RNA-DNA (R-loops)
  • Indirect:
    • Protein-mediated
    • RNA-mediated
  • Colocalization:
    • Compartments
    • Hubs

RNA-DNA interactions

Engreitz et al. 2016

ChIP-Seq

Binding of protein is typically specific to DNA sequence:

Binding pattern

DNA-protein binding

List of DNAs
with binding events

courtesy of Ivan Kulakovsky, Pavel Mazin

ChIP-Seq

Chromatin-immunoprecipitation followed by sequencing:

ChIP-Seq: input experiment

Unwanted factors contributing to sequencing & processing output:

  • DNA accessibility
  • DNA amplification efficiency
  • Mappability

"Input" experiment is a ChIP-Seq control without precipitation step. It is used for normalisation of ChIP-Seq.

Park, Nat Reviews Genetics, 2009

ChIP-Seq: peak calling

In order to predict binging events we need to call peaks in ChIP-Seq data:

Mahony & Pugh 2015

 

ChIP-Seq: binding pattern

  • As an output from ChIP-Seq we have a list of peaks.
  • In theory, each peak should have a binding event.
  • Imagine you have the peaks with the following sequences:









     
  • What is the binding pattern?

courtesy of Ivan Kulakovsky, Pavel Mazin

ChIP-Seq: binding pattern

  • Seems to be easy to find the binding pettern:

Question: Can you propose a computational approach to find it?

courtesy of Ivan Kulakovsky, Pavel Mazin

ChIP-Seq: motifs search

  • Let's introduce some variety, which can arise as:
    • Suboptimal binding of the protein
    • Genome variability
    • Errors during the sequencing

courtesy of Ivan Kulakovsky, Pavel Mazin

ChIP-Seq: motifs search

  • With possible substitutions, finding the correct pattern is not easy anymore:

We need some advanced algorithms to find binding patterns.

courtesy of Ivan Kulakovsky, Pavel Mazin

Binding pattern representation

  • Binding motif is a DNA sequence pattern that has biological significance (e.g. it is a target for protein binding).
  • Consider following pattern of sequences:
TATAAT
TAAAAT
TAATAT
TGTAAT
TATACT

Consensus is the sequence of the most frequent letters:

T[AG][AT][AT][AC]T

With the consensus sequence, we lose the information:

  • about suboptimal binding,
  • of background nucleotides frequencies.

 

courtesy of  Pavel Mazin

Binding motif representation

Other representations: 

Frequency matrix

Probability matrix

Matrix normalized by background

Position Weight Matrix (PWM)

courtesy of  Pavel Mazin

Position Weight Matrix (PWM)

courtesy of  Pavel Mazin

Binding motifs search 

Approaches:

Binding motifs search 

Approaches:

courtesy of  Pavel Mazin

ChIP-Seq: motifs search tools

Binding events and DNA motifs

Specific and non-specific binding:

Slattery et al. 2014

DNase-Seq: experimental protocol

  • DNase I hypersensitive sites sequencing

ATAC-Seq: experimental protocol

  • Assay for Transposase-Accessible Chromatin with high-throughput sequencing
  • Nucleosome-free regions detection method
  • Tagmentation: Tn5 transposase cuts DNA and ligates adaptors to open DNA

Buenrostro et al., 2013

"Block-schemes" of Seqs:

Meyer & Liu Nat. Rev. Gen. 2014

Data processing of Seqs:

Meyer & Liu Nat. Rev. Gen. 2014

Examples of tracks:

Meyer & Liu Nat. Rev. Gen. 2014

In theory: 

ATAC-Seq

Hsu et al. 2018

paired-end sequencing and mapping

ATAC-Seq: insertion size distribution

Buenrostro et al., 2013

  • Distance between forward and reverse mapping positions:

ATAC-Seq: insertion size distribution

Buenrostro et al., 2013

  • Distance between forward and reverse mapping positions:

ATAC-Seq: insertion size distribution

Buenrostro et al., 2013

  • Distance between forward and reverse mapping positions:

Nucleosome-free fraction

Nucleosome-bound fraction

ATAC-Seq around gene starts

Buenrostro et al., 2013

Nucleosome-free fraction

Nucleosome-bound fraction

  • Average plot around
    Transcription Start Sites (TSS): 

ATAC-Seq around TF binding sites

 Albanus et al., 2019

We can plot V-plot around positions of factors binding sites:

ATAC-Seq around TF binding sites

 Albanus et al., 2019

ATAC-Seq around nucleosomes

Schep et al., 2015

We can also plot V-plot around positions of nucleosomes:

Chromatin openness (accessibility)

Hsu et al. 2018, Slattery et al. 2014

We can also do motifs search in open chromatin regions

Binding events and DNA motifs

Specific and non-specific binding:

Slattery et al. 2014

maxATAC: automated search of motifs in ATAC-Seq peaks

Cazares et al. 2022, bioRxiv preprint

High-resolution antibody-targeted chromatin profiling

Kaya-Okur and Henikoff et al. 2019 Efficient low-cost chromatin profiling with CUT&Tag

 Cleavage Under Targets and Release Using Nuclease (CUT&Run)

Cleavage Under Targets & Tagmentation (CUT&Tag)

High-resolution antibody-targeted chromatin profiling

Kaya-Okur and Henikoff et al. 2019 Efficient low-cost chromatin profiling with CUT&Tag

 Cleavage Under Targets and Release Using Nuclease (CUT&Run)

Cleavage Under Targets & Tagmentation (CUT&Tag)

Few slides on epigenetics data association

Genomics tracks association

DNase-Seq coverage for 350-kb region:

Thurman et al. Nature 2012

Different cell types

Clustering of samples

Cusanovich et al. Cell 2018

Let's check the similarity of pooled tissue samples with DNase-seq from known tissues. For that, let's plot bi-clustered heatmap of Spearman correlation coefficients:

 

Result of clustering,
the tree reflecting the similarity between samples

Clustering of samples

Cusanovich et al. Cell 2018

Finally, let's do dimensionality reduction, where each dot is a single sample.
Here is an example with t-SNE, although PCA (Principal Component Analysis) and MDS (Multidimensional scaling) are other popular techniques:

 

Practical Part: Samples Comparison

We will use deeptools commands multiBigwigSummary and plotCorrelation to compare the ChIP-Seq experiments. We will construct bi-clustered heatmaps of correlations:

Text

Stackup Plots (enrichment profiles):

Practice

Before we start the practice...

  • Hometask made available on canvas: 
  • Find and follow the instructions for the seminar here

Before we start the practice...

  • We will work in browser and with terminal. We also need to copy files from the remote server, so install appropriate software, if needed (your favourite terminal distributive,
    Putty: https://www.putty.org/ or WinSCP: https://winscp.net/eng/download.php)
  • File formats are specified after the dot (e.g. .bam).
  • Code is running in grey boxes with monospace font, try not to copy it blindly, because:
    some file names are surrounded by brackets "${}". This indicates that it should be replaced by the name of your file.​ You may use the filenames that you like, but I highly recommend to stick to using the name of your factor and input instead. 
echo "Bioinformatics has never been easier! Copy&paste wisely, or rm -rf ~/ will strike you one day." 

            

Before we start the practice...

Review of some useful bash commands: 

  • List files in the current directory:
     
  • Check the current position in the directory tree:
  • Open text file for reading:
     
  • Run command help, one of the options:
    (some of them work for ones
    but not the others)
     
  • Copy file:
  • Enter directory:
  • Exit the directory:
  • Create directory:
  • File decompression:
ls
pwd
less file.txt
some_command -h
some_command --help
man command
cp target_file destination_folder
cd directory_name/
cd ../
ls -hts
mkdir
gzip -d file.gz
unzip file.zip

Expected result of the seminar

  • Quizz on canvas (10 points in total)
  • Extended deadline until 24th of March.
  • Each task of the quizz is screenshot + essay answer.
  • Mistakes on biology and theory are penalized. 
  • Feel free to ask the questions in the chat!

Additional slides
FYI

Get in touch with your data: step 1

1. Login to the server via terminal:

 

or use Putty on Windows.
(example instructions https://www.ssh.com/ssh/putty/windows/#sec-Configuration-options-and-saved-profiles)

2. Create the folder with your project:

 

 

3. Activate prepared environment:
(if you have problems with it, please, contact the TA)

 


4. Check your environment:

 

ssh username@servername
mkdir EpiPract1
cd EpiPract1
export PATH="/home/a.galitsyna/conda/bin:$PATH"
conda activate chipseq
ls # Should list all the filesin the directory
pwd # Should result in your currentl location
conda list # Should list all the programs installed, if conda imported successfully.

Get in touch with your data: step 2

  1. Find your datasets (here or in the table on the right):
     
  2. Copy to the local folder:


     
  3. Check the content and file sizes:
less /home/a.galitsyna/data
mkdir ~/lesson6/
cp /home/a.galitsyna/data/reads/${your_files.fastq.gz} ~/lesson6/
cd ~/lesson6
ls -hts 

Get in touch with your data: step 3

gzip -d <file.fastq.gz> # This will result in <file.fastq> unzipped file
less <file.fastq>
wc -l <file.fastq>
fastqc <file.fastq>

NGS data typical workflow example

Let's review the NGS data processing workflow:

  1. Quality control: fastqc on initial FASTQ
  2. Index build: bowtie2-build on initial chromosomes files
  3. Read mapping: FASTQ → SAM/BAM
  4. Filtering of mapped reads: samtools on SAM from p.3
  5. Conversion to BED: bedtools on SAM from p.4

FASTQC report example

  • Takes FASTQ as input:
     
  • Output format: html and plain text

     
fastqc <input.fastq>
ls 
  # input.fastq input_fastqc/

cd input_fastqc/
ls
  # fastqc_data.txt  fastqc_report.html  Icons  Images  summary.txt

Don't run this code, it's FYI

FASTQC report example

Genome index build example

(Manual: http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml)

  • Required for fast search of query sequence during mapping
  • Takes fasta with reference as input (comma-separated list) and prefix name.
  • Example run and resulting database:
bowtie2-build -h

bowtie2-build chr1.fa,chr2.fa,chr3.fa,chr4.fa hg38

# Check the results: 

ls -hs

# 938M hg38.1.bt2  701M hg38.2.bt2   12K hg38.3.bt2  701M hg38.4.bt2  3.0G hg38.fa  938M hg38.rev.1.bt2  701M hg38.rev.2.bt2

bowtie2-inspect -n hg38

Don't run this code, it's FYI

Read mapping example

  • Takes query FASTQ and indexed database as input.
  • Output is in SAM or BAM format
    (SAM format specification: https://samtools.github.io/hts-specs/SAMv1.pdf)
  • SAM is a TSV file with header, BAM is its compressed binary version. The columns are:

Read mapping example

FASTQ

+ indexed database

-> SAM file

Samtools example

(Manual: http://www.htslib.org/doc/samtools.html)

  • A toolkit designed for SAM/BAM manipulations
  • View:
     
  • Statistics assessment:
     
  • Format conversion:
     
  • Filtering:





     
samtools view file.sam
samtools view -F 4 input.sam
samtools stats file.sam
samtools view -b file.sam > file.bam

Bedtools example

(Manual: https://bedtools.readthedocs.io/en/latest/content/bedtools-suite.html
BED format specification: https://genome.ucsc.edu/FAQ/FAQformat.html)

bedtools bamtobed -i file.bam > file.bed

genomecov

map

intersect

Other formats

  1. BED



     
  2. BEDGRAPH




     
  3. BIGWIG: binary compressed format, created with UCSC tools (see below)

UCSC genome browser & UCSC tools

UCSC tools example

Full set of compiled tools: http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/

  • We will need file conversion, check the documentation:
bedGraphToBigWig -h

Next step: index build and mapping

These steps are time-consuming, but we have to complete them:

  1. Build genome index with name prefix hg38_chr5-6_index:



     
  2. Run mapping of your FASTQ file for the index hg38_chr5-6_index:
cd GENOME/
bowtie2-build dm6.fa
bowtie2-build -h
bowtie2 -h
bowtie2 --very-sensitive -x ./GENOME/dm6 <file.fastq> -S <file.sam>
less <file.sam>
ls 
cd ../

Don't run this code, it's FYI

Filtering

 

  • Filter only uniquely mapped reads with high alignment score, sort resulting file:

     
samtools view -h -F 4 <file.sam> | less
samtools --help
samtools view -h -q 10 <file.sam> > <file_filtered.sam>
samtools view -b -q 10 <file.sam> > <file_filtered.bam>
samtools sort <file_filtered.bam> > <file_filtered.sorted.bam>

Don't run this code, it's FYI

File format conversion

  • Read the documentation of bedtools:

     
  • Convert BAM to BEDGRAPH genome coverage:
     
  • Create files with genomic windows (1Kbp bin size, don't forget to use your chromosome!). Convert BAM to BEDGRAPH coverage of windows:
bedtools makewindows -g GENOME/chrom_sizes.tsv -w 1000 > hg38_windows.1000.bed
grep "chr5" hg38_windows.1000.bed > chr5_windows.1000.bed
bedtools genomecov -ibam <file_filtered.sorted.bam> -bga -split > <file.genomecov.bedgraph>
bedtools genomecov -h
bedtools makewindows -h
bedtools coverage -h
bedtools coverage -counts -a chr5_windows.1000.bed -b <file_filtered.sorted.bam> > <file.binned.bedgraph>
  • Check the resulting files, you should see genome coverage
    files (*.genomecov.*) and binned datasets (*.binned.*):
ls
less <file.genomecov.bedgraph>
less <file.binned.bedgraph>

Don't run this code, it's FYI

BEDGRAPH tracks visualization

We will use UCSC browser for bedgraph files visualization.

  

UCSC genome browser

  • Find "add custom tracks button".
  • Check the upload form:

 

 

 

 

 

 

 

 

 

  • Upload your modified BEDGRAPH files (with headers) to UCSC ("Choose file" -> "Submit").

Experiments comparison

  • Convert BEDGRAPH to BIGWIG:

     
  • Rename your final file (substitute YOUR_CELL_LINE and CHROMOSOME):

     
  • Find three of your colleagues with the same chromosome as you. Exchange your BIGWIG files and upload them to the server. Thus you should collect at least 4 BIGWIG files with recognisable names (cell line and chromosome should be present in the file names).
mv <file.genomecov.bw> <YOUR_CELL_LINE-CHROMOSOME.bw>
bedGraphToBigWig <file.genomecov.bedgraph> ./GENOME/chrom_sizes.tsv <file.genomecov.bw>

Samples comparison

Task 1 (1 point):  Provide the final set of commands that you've used and briefly justify the parameters.

Task 2 (1 point): Report the final scatteplot for 100 Kb. Describe your observations.
Task 3 (*1 point): Repeat the procedure for 10 Kb  and 1 Kb and report the changes in observations. At what resolution you might observe the differential DNase I signal?

  1. Collect at least 4 DNase I results for your chromosome in bigWig format. One of them should be yours, the others you may take from /home/shared/EpiPract1/. you may use conda environment "hic" (see instructions on activation can be found in EpiPract3).
  2. Use multiBigwigSummary to create a summary file. Use bins mode, bin size 100 Kb:

     
  3. Plot correlations obtained from your file as a scatterplot:


     
  4. Try to adjust the parameters of the plot to improve the analysis. Try plotting in log-scale (--log1p). If you get the warning about the outliers, try to fix it: use --removeOutliers and --corMethod spearman. Which method worked?
  5. * Repeat the procedure for 10 Kb and 1 Kb. You may want to create another intermediary summary file for that. Inspect the changes.
multiBigwigSummary bins --binSize 100000 -b ${file1.bw} ${file2.bw} ${file3.bw} ${file4.bw} -o ${summary.npz}
plotCorrelation -in ${summary.npz} --corMethod pearson --skipZeros --whatToPlot scatterplot -o ${scatterplot.png}

Experiments comparison

6. Plot correlations for 100 Kb as a heatmap. Use the parameters that you found to be representative for the scatterplots. For example:
 

 

7. Inspect and describe your observations. Adjust the visualization, if needed. 

 

 

plotCorrelation -in ${summary.npz} --corMethod pearson --skipZeros --whatToPlot heatmap --colorMap RdYlBu --plotNumbers -o ${heatmap.png}

Result of clustering

Correlation coefficient

Labels are not representative
(correct if needed)

Practical Part: Enrichment Profiles

We will:

  • Call TADs in S2 cell line (Wang et al. 2017),
  • Plot the enrichment profiles for multiple factors.

 

We'll need conda environment "hic" (see instructions on activation can be found in EpiPract3).

ChIP-Seq data analysis

By agalicina

ChIP-Seq data analysis

Today we'll talk about chromatin openness and ATAC-Seq. We'll finalize the ChIP-Seq data analysis.

  • 149