Tools for a Reproducible Workflow for ChIP-seq QC and Visualisation.

Thomas Carroll

Head of Bioinformatics

MRC Clinical Sciences Centre

Overview

  • Higher output more complex hypotheses.
  • New challenges for:
    • Quality control
    • Visualisation
    • Dynamics of epigenetic states.
    • Reproducibility and reporting.

 

Then

  • Limited sequence depth.
  • Shorter reads.
  • Developing protocols.
  • Single sample studies.
  • Non standardised toolset.
  • ChIP-seq != RNA-seq

 

Now

  • Greater depths per lane.
  • Established protocols
  • Multiplexing technologies - Sample replication.
  • Advanced toolsets
  • Borrowing of analysis techniques between data-types

High-throughput Sequencing  technologies

Bigger studies require large scale quality control

  • ChIP-seq has many sources of noise
    • Inefficient or non-specific antibodies.
    • Artefact signals.
    • Signal complexity.
  • Requirement for methods to assess quality  within sample groups and across large experiments.

How to assess what is useful.

  • Encode and others have established metrics of ChIP-seq quality.
    • SPP/phantomQualityTools.
    • HTSeqtools.
  • Use and interpretation of these metrics varies with ChIP type.
  • Such metrics may be overfitted to these big studies and dependent on the standardised processing protocols applied.

Epigenetic's data QC study

  • > 500 datasets of ChIP-seq for transcription factor and epigenetic marks.
  • Evaluate quality control metrics under different conditions.
  • Identify importance/redundancy of QC metrics.
  • Establish effect of processing on QC metrics

Carroll et al Impact of artifact removal on ChIP quality metrics in ChIP-seq and ChIP-exo data. Front Genet April 2014.

Blacklisted Regions

  • Identified by Peter Park.
  • DAC list created by Anshul Kundaje for Encode.
  • Species specific regions of consistent ultra high signal 
  • Contains repeat regions and genomic expansion.

Blacklisted Regions and the "Phantom Peak"

  • Fragment length peak indicates ChIP signal strength.
  • "Phantom peak" often identified in cross-correlation analysis at read length.
  • Signal from blacklisted regions contribute solely to this peak

Measures of inequality of coverage (SSD)

  • Standardised Standard Deviation (SSD) of ChIP-signal.
    • Measure of overall "peakiness".
    • More suited for longer marks
  • SSD strongly influenced by blacklisted regions.

Library complexity and duplication levels

  • Library complexity often assessed by duplication rate.

 

  • Artefact duplicates often overestimated.

 

  • Duplicates may arise from high efficiency and/or high depth.

ChIP-exo

  • ChIP-exo offers higher resolution and greater efficiency.
  • Reduced "Blacklisted" signal.
  • Higher duplication in peaks.
  • Cross-correlation profiles require intrepretation.

ChIPQC package

High-throughput QC of epigenetics data.

  • Implemented QC learned from study.
  • Designed for work on large experiments with multiple groups.
  • Implemented in R and Bioconductor for ease on installation, implementation and integration with reporting

Examples ChIP-QC

  • Evaluate +/- read clustering around peaks.

 

  • Contribution of artifact signal.

 

  • Distribution of ChIP signal across genome.

QC from low depth

(MiSeq)

  • Speed of Miseq's  run allows for rapid evaluation of ChIP prior to large scale sequencing.
  • Direct comparison of Hiseq vs Miseq demonstrates its ability to evaluate ChIP quality

~ 1 Million reads

Visualisation of data

  • Essential step is to look at your data.

 

  • Evaluate results in their genomic context

 

  • Identify patterns and artefact regions.

 

Genome Browsing

  • Genome Browsers present a linear representation of genomics data.

 

  • Allow for integration of multiple data-types as well as inclusion of both the user's and public datasets.

 

  • Popular browsers include UCSC (server) and IGV (local).

Integration with genome browsers

 

  • Trackhubs and Sample information files may be created and manually curated.

 

  • For bigger studies, the organisation of tracks in a genome browser is not a trivial task.

 

  • Evaluation of results requires exporting, conversion and potentially uploading to FTPs.

tracktables package.

Generation and  integration of R objects and dynamic HTML reports with the IGV genome browser. 

  • Allow for rapid presentation of high-throughput sequencing results in the IGV genome browser.

 

  • Produce experiment focused IGV-linked HTML reports.

 

  • Cast R objects into HTML tables with links to genomic locations.

 

  • Integration with pre-existing reporting tools reinforcing reproducible research.

Building Experiment Reports

Rapid prototyping by visualisation

  • Cast R/Bioconductor objects into HTML tables.

 

  • Allows user to evaluate results in genomic context.

Dynamics of epigenetic states

  • With the advent of multiplexing technologies
    • Greater number of sample groups
    • Higher replicate numbers
  • More complex hypotheses.
    • From simple epigenetic mapping  -> Identifying significant changes in epigenetic states.

Borrowing RNA-seq methods for ChIP-seq data

  • Differential affinity/binding of epigenetic marks or transciption factors.

 

  • Methods implemented in RNA-seq now being used in ChIP-seq (DEseq2/limma/EdgeR/Diffbind)

 

  • ChIPQC integrates with Diffbind.                                                        

Visualising dynamics of epigenomic states over genomic interval sets

  • RNA-seq visualisation methods don't cover ChIP-seq

 

  • Require tools to investigate shapes of signal.

 

  • Simple tools designed for small sample numbers.

soGGi package

Summarising Over Grouped Genomic Intervals

  • Accepts multiple input types (BAMs/bigWigs/Motifs) to generate profiles.

 

  • Leverages GGplot2 to enable intuitive plotting grammar

 

  • Implements normalisation procedures and arithmetic operations across profiles.

Profile types require different techniques

  • soGGi allows for three types of plot
    • Point (signal around a genomic location).
    • Meta (signal over and around a genomic region normalised to that genomic regions length).
    • Hybrid (Point plot over edges and Meta plot over central genomic region).

 

Integrating Arithmetic Operations 

 

  • Profiles can be further normalised and operated on to visualise hypotheses.

 

  • Most common arithmetic operations and transformations can be applied between profiles.

Accounting for noise in ChIP-seq

  • Technical variation in ChIP-seq is typically considered very high.

 

  • Signal across artefact regions and duplication rates can be highly variable.

 

  • Such defined sources of noise can inform the models used in differential binding analysis.

pol2Rates package (unpublished)

  • Tools to evaluate quantitative changes in Pol2.

 

  • Calculates normalisation factors from Pol2 signal distribution.

 

  • Not available yet but hope for release before end of year

Comparison of Uncorrected vs Corrected

  • Correction significantly improves correlation with RNA-seq data and known targets.

A Reproducible Workflow

  • Created three packages to handle:
    • Quality control
    • Genome Browser integration.
    • Visualisation of signal structure over genomic regions of interest.
  • Integrated with R's rMarkdown (Reporting) and packrat (Version control) packages.
  • Together allows for creation of customisable and reproducible reporting for ChIP-seq.

Acknowledgements

MRC Clinical Sciences Centre

  • Ziwei Liang
  • Gopuraja Dharmalingham
  • Sanjay Khadayate
  • Yi-Fang Wang
  • Matthias Merkenschlager

CRUK Cambridge Institute

  • Rafik Salama
  • Ines de Santiago
  • Rory Sark

References

  • Carroll et al Impact of artifact removal on ChIP quality metrics in ChIP-seq and ChIP-exo data. Front Genet. April 2014.
  • Bioconductor (http://bioconductor.org/)
    • ChIPQC Bioconductor 2.13 (May 2014)
    • tracktables Bioconductor 2.14  (September 2014)
    • soGGi Bioconductor 3.0 (May 2015)

Copy of Copy of ChIP-seq analysis workflow

By tom carroll

Copy of Copy of ChIP-seq analysis workflow

  • 564