Project to analyse and report QC metrics across all ChIP-seq data.
Training
MRC asking for all PhD students to receive bioinformatics training.
Introduction to R
Statistical inference in Biology.
Applied bioinformatics
Post docs and senior scientists wish to acquire skills.
R
High-throughput sequencing analysis.
Simple statistics
Training from Bioinformatics team
General training in genomics data
Visualisation of high throughput sequencing data
Training in the analysis of high throughput sequencing data in R
Analysis of HTS in R
All courses are presented as in RStudio.
Rpres format used to allow display of material within RStudio itself.
All material public and version controlled on github
Topics include
Introduction to R
Reproducible R
Introduction to Bioconductor
Introduction to ChIP-seq and RNA-seq
Analysis of HTS in R - Issues
Very popular course.
Requires large room.
Requires computer systems set-up in advance.
Any changes require update to all computers.
Analysis of HTS is computationally expensive.
Most course use dummy data.
Use of subsetted data limits usefullness of course to real world data.
Data for course must be downloaded to every machine.
Often incorrect selection of data leads to problems.
Analysis of HTS in R - Solution
Rstudio server.
Bring low power laptop simply to login to server.
Environment already set-up and easy to update.
All computation run on server. Allows for examples using multicore processing and more memory.
Central/shared directory for data.
An Rstudio-server?
Where to get an Rstudio server.
Invest in MRC CSC server?
Not so scalable
Not useful for most of the year
Use cloud platform? Tried in BioC 2014 and received grant for MRC CSC courses.
HTS pipelines
Rapid and reproducible results from high thoughput sequencing data is essential for any modern bioinformatics core.
To achieve this we use version controlled pipelines optimised for our local compute resources.
ChIP-seq pipeline
Previous versions of ChIP-seq pipeline were developed at Cambridge University by Thomas Carroll.
Widely used in CRUK, Sanger and Cambridge University and now adapted for MRC CSC in London.
Many of the tools within pipeline have now been wrapped up in R/Bioconductor packages maintained by MRC CSC.
ChIPQC.
Triform.
soGGi.
tracktables.
Updated ChIP-seq pipeline
First pipeline used multiple tool sets and so was hard to version control and install.
An R centric ChIP-seq pipeline has now been developed within MRC Clinical Sciences Centre.
This new pipeline runs as a single R markdown script and generates HTML reports linked to IGV.
Easily installed dependencies from Bioconductor and CRAN repositories.
Testing the ChIP-seq pipeline
To test the first version of the ChIP-seq pipeline we analysed 1400 datasets and investigated QC metrics and their relation to both each other and related processing steps.
This provided us with essential knowledge of which QC flags to be used in controlling ChIP-seq data.
Testing the R-centric ChIP-seq pipeline
More recently new forms or ChIP-seq have emerged (MNAse seq, ChIP-exo) and with new technologies the sequencing output has rapidly grown.
A large scale reanalysis of ChIP-seq data is required to gain a better understanding of how metrics relate to new ChIP methods as well as the increased length/depth and complexity of sequencing.
Testing the R-centric ChIP-seq pipeline
Re-run study with new tools and new pipeline on all available ChIP-seq, ChIP-exo, MNAse-seq data.
Analysis of results will be reviewed within the Bioinformatics team and with Shamith Samarajiwa (MRC Cancer Unit) and Ines De Santiago (CRUK Cambridge) .
The full pipeline and all metric results will be freely available and published within Github. Analysis of QC flags published in a peer reviewed journal.