What you should know about
Graph-based Reference Genomes
What are they?
Why should they be used?
How can you use them?
What are they?
UC Santa Cruz Genomics Institute
Chr17, GRch38
Alt loci
Alt loci
GRCh38 can be seen as a graph
Why use them?
- Individuals do not have the same sequence
- Enables more accurate mapping and variant calling
- Create reference for variation rich species
To account for natural genetic variation, we consider the more general case in which a reference genome is represented by a graph rather than a set of phased chromosomes; the latter is treated as a special case.
Paten et al. (2014)
Some examples
Reference genome of malaria-infected mosquitoes
Harding et al.
De novo assembly and genotyping of variants using colored de Bruijn graphs
Population reference graphs for tuberculosis bacteria
Mouzos et al.
Iqbal et al., Nature Genetics 2012
How can you use graph-based reference genomes?
You can use GRCh38, however:
- Few tools use the alternative loci
- Be aware of flanking regions
- Where to find annotated data?
You can create your own using vg:
- Building a graph with 1000 genomes vcf requires 200+ GB of memory and ~1.5 TB of disk space
- Compiling vg can be hard
- No annotated data
Conclusion: Depends on what you want to use it for
What we have done
Coordinates
-
Partition the graph into region paths
-
Offset based coordinate system:
Region identifier + offset
Black: Hierarchical partitioning (used today in GRCh38)
Sequential partitioning
Red:
Genomic intervals
Example: Genes on GRCh38
Example: Genes on GRCh38
"Flanking" regions of the alternative locus have been merged with the main path, revealing that the two genes start at the same position.
Next project:
Statistical genomics on GBRGs
Example case:
Do SNPs associated with a disease overlap more with open chromatin in one cell type vs others?
Genomic HyperBrowser
Statistical genomics on GBRGs
Assumptions:
- GWAS SNPs and open chromatin are in variation-rich regions
- SNPs come from multiple sources
- Open chromatin follows a single path
Some further challenges
- Open chromatin on GRCh38?
- GWAS on GRCh38?
- Mapping to GRCh38?
- If you ask a bioinformatician for help, he probably does not know what an alt locus is
Summary
- The future is not linear
- Use GRCh38 and its alternative loci
- You can map to GRCh38 using BWA-MEM or vg
- Using a linear reference increases noise or even leads to biases when doing statistical genomics
- vg: github.com/vgteam/vg
- Our preprint: biorxiv.org/content/early/2016/07/11/063206
- Guide on how to map to GRCh38:
http://gatkforums.broadinstitute.org/gatk/discussion/8017/how-to-map-reads-to-a-reference-with-alternate-contigs-like-grch38
Check out:
grbgs
By ivarg
grbgs
- 403