What you should know about

Graph-based Reference Genomes

 

What are they?

Why should they be used?

How can you use them?

What are they?

UC Santa Cruz Genomics Institute

Chr17, GRch38

Alt loci

Alt loci

GRCh38 can be seen as a graph

Why use them?

  1. Individuals do not have the same sequence
     
  2. Enables more accurate mapping and variant calling
     
  3. Create reference for variation rich species

 To account for natural genetic variation, we consider the more general case in which a reference genome is represented by a graph rather than a set of phased chromosomes; the latter is treated as a special case.
Paten et al. (2014)

Some examples

Reference genome of malaria-infected mosquitoes

Harding et al.

De novo assembly and genotyping of variants using colored de Bruijn graphs

Population reference graphs for tuberculosis bacteria

Mouzos et al.

Iqbal et al., Nature Genetics 2012

How can you use graph-based reference genomes?

You can use GRCh38, however:

- Few tools use the alternative loci

- Be aware of flanking regions

- Where to find annotated data?

 

 

 

You can create your own using vg:

- Building a graph with 1000 genomes vcf requires 200+ GB of memory and ~1.5 TB of disk space

- Compiling vg can be hard

- No annotated data

Conclusion: Depends on what you want to use it for

What we have done

Coordinates

  • Partition the graph into region paths
     
  • Offset based coordinate system:
    Region identifier + offset
     

 

Black: Hierarchical partitioning (used today in GRCh38)

Sequential partitioning

Red:

Genomic intervals

Example: Genes on GRCh38

Example: Genes on GRCh38

"Flanking" regions of the alternative locus have been merged with the main path, revealing that the two genes start at the same position.

Next project:
Statistical genomics on GBRGs

Example case:
Do SNPs associated with a disease overlap more with open chromatin in one cell type vs others?

Genomic HyperBrowser

 

Statistical genomics on GBRGs

Assumptions:

  • GWAS SNPs and open chromatin are in variation-rich regions
  • SNPs come from multiple sources
  • Open chromatin follows a single path

Some further challenges

  • Open chromatin on GRCh38?
  • GWAS on GRCh38?
  • Mapping to GRCh38?
     
  • If you ask a bioinformatician for help, he probably does not know what an alt locus is

Summary

  • The future is not linear
  • Use GRCh38 and its alternative loci
  • You can map to GRCh38 using BWA-MEM or vg
  • Using a linear reference increases noise or even leads to biases when doing statistical genomics
  • vg: github.com/vgteam/vg
  • Our preprint: biorxiv.org/content/early/2016/07/11/063206
  • Guide on how to map to GRCh38:
    ​http://gatkforums.broadinstitute.org/gatk/discussion/8017/how-to-map-reads-to-a-reference-with-alternate-contigs-like-grch38

Check out:

grbgs

By ivarg

grbgs

  • 403