HLI WGS HG38

Summary

  • VCF File Characteristics
  • Sample characteristics
  • High Pass Regions
  • Manifest of IDs

VCF File Characteristics

  • N = 2357 files
  • N variants (Min,Mean,Max)
    • (4,495,629; 4,738,841; 5,846,745)
  • Not phased
  • Not QCed

Sample characteristics

  • N=2357
  • Sample volume
  • DNA concentration
  • DNA mass
  • DNA purity 260/280 ratio
Characteristic 1-missingness Mean (SD)
Sample Volume (uL) 100% 60.15 (3.80)
DNA mass (ng) - calculated 99.6% 13693.10 (10186.93)
DNA concentration (ng/uL) 99.6% 228.58 (170.61)
DNA purity (260/280) 17.5% 1.72 (0.10)

Notes:

  1. There is negligible protein contamination. (100% DNA -> 260/280 is approximately 1.8)
  2. Min mass >2220 ng, Min volume > 30 uL, Min concentration >27ng/uL
  3. Ilumina usually requires a minimum of 10 nM in 20 uL
  4. Min nM>80 assuming insert size of 50bp + 2x 151 bp paired-end nmer

Highpass regions

  • Number
  • Length
  • Total length

Number of regions | chr

Total number of regions 712,265

Length of highpass regions

Length | chr

Mean length | chr 

Highpass total length | chr

2,572,251,922 (84%)

HG38 Total non-N lenght = 3,074,968,030

Manifest of IDs

  • Number of individuals sent
  • Number of individuals sequenced
  • Number of twin pairs and zygosity
  • Number of trios

Analysis of IDs

Samples Retrieved Submitted Intersection
N 2,357 2,069 2,036
Setdiff 321 33 0
Gender 1971/65
Singletons 134 134 (46/88)
Twins pairs 951 951 (401/550)
Triplets 0 --
Trios 159 (with DNA) 161 (41/120)
Parents 175 (with DNA)
Missing from DB 106 
Missing Annot 40

Total = Singletons + 2xTwins + Parents = 2211

Omics overlap

Omics N
PainExomes 272                
GOT2DExomes 100
UK10K 861
EB_Fat 545
EB_Skin 516
EB_LCL 586
EB_WB 298
Fat_450K 449

Pipeline for DNASeq

Annotating, merging, coverting, validating, and quality

Filter Chr20: vcftools

Merge VCF: BCFtools

WG HLI HG38 VCF

Convert to plink: BCFtools

flip +ve strand: plink

Annotate VCF: GATK

Filter highpass regions: Bedtools

gen file

matrixeqtl file

phasing: shapeit2

GenotypeConcordance: VCFtools

Exon HG19 vcf

Coordinate change HG38:

crossmap

convert files: qctool / gtools

plink file

phased plink file

QC individuals

  • Analysis conducted on SNPs shared by HLI WGS and Genotyping arrays merged at Sanger
  • Comparison of statistics with and without exclusion of individuals with high heterozygosity (Het>0.4)

Releases comparison Hetererozygosity

Sanger vs Alex

Heterozigosity

Exclusions

 

  • 42 individuals excluded due to high (Het>0.4) heterozygosity :
    • 5 Twin Pairs (10 individuals)
    • 22 unrelated (unpaired twins)
    • 10 parents

Heterozygosity before / after exclusions

Heterozigosity -- F

Heterozigosity O&E ~ MAF

Releases comparison Hetererozygosity

Tao vs Alex

Only variants with MAF<0.05 and HWE<1E-6

Missing vcf file

[alvesa@athena HLI_HG38]$ ls | grep 6052
60521.vcf.gz
60521.vcf.gz.tbi
     Client.Subject.ID Gender Ethnicity Birth.Date FamilyID Relation Zygosity Stool.Sample HLI.genome.ID
1731             60522      F     White    10/4/35      762     Twin       MZ         TRUE     176500025
 
 

Comparison PLINK vs VCF

Comparison PLINK vs VCF

After removal of 16 individuals shared with Tao's release with |F|>.1

Comparison PLINK vs VCF

Before filtering SNP with MAF<0.05 and HWE<1E-6

Releases comparison Zygosity & IBD

Sanger vs Alex

Relatedness ~ Zygosity

Concordance

SUM(I(PI_HAT<0.9)) = 22

SUM(I(PI_HAT<0.9)) = 5

Discordant individuals

Discordant individuals:         5201,        5202,        50472,        59022,        92521

disconcordance analysis

Discordant individuals:         5201,        5202,        50472,        59022,        92521

"Inbreed" individual

Individual is unrelated to any sequenced sample, including its chip-genotyped sample

Sample swaps

 

Sample swap within twin pairs:

HLI 5202 is actually SANGER 5201

HLI 5201 is actually SANGER 5202

Sample swaps in unpaired twins:

HLI 50472 is actually SANGER 50471

HLI 59021 is actually SANGER 59022

Individuals with low relatedness with SANGER sample:

HLI 92521 is matched to SANGER 92521 but with lower relatedness.

Unpaired twins either in SANGER of HLI:

SANGER59021

HLI50471

Conclusions

  • 2357 samples analysed
  • 1 missing individual as per Tao's release
  • 47 individuals discarded with signs of contamination
    • 2% of all individuals with |F|>0.1
  • 4 sample swaps

Future work

  • Deploy
    • VCF and plink releases with highpass regions filtered in and problematic individuals filtered out
    • merged VCF with all individuals and all regions

WG HLI HG38 VCF

GenotypeConcordance: VCFtools

Exon HG19 vcf

Coordinate change HG38:

crossmap

Convert PLINK to VCF

WGS data HG38 VCF

(HLI)

Merge Datasets

Array data PLINK HG18 (Sanger merge)

Coordinate change HG38:

crossmap

Flipping alleles from positive to reference strand

Convert VCF to PLINK

Convert VCF to PLINK

IBS Analysis

Filter in MAF>5% HWE>1E-6

Merge VCF: BCFtools

WG HLI HG38 VCF

Convert to plink: BCFtools

flip +ve strand: plink

Annotate VCF: GATK

Filter highpass regions: Bedtools

gen file

matrixeqtl file

phasing: shapeit2

convert files: qctool / gtools

plink file

phased plink file

QC individuals

EG Meeting -- HLI WGS HG38

By acoutoal

EG Meeting -- HLI WGS HG38

  • 611