Speeding up VQSR for 2000+ WGS samples

Hi GATK folks

I am joint genotyping fairly large cohorts of 30x WGS data with 2000 to 5000 samples follwing the Best Practices using GATK 3.5
By and large this works pretty well, however one major bottleneck is the VQSR step. Let me describe my setup and hopefully you can comment on whether and how I can improve my workflow.

CombineGVCF in order to get from 2000+ gvcs to <200 multi sample gvcfs
GenotypeGVCF parallelized by genomic coordinates
annotateVariants parallelized by genomic coordinates
CatVariants to write one massive wgs.vcf.gz
train SNP model: VariantRecalibrator(wgs.vcf.gz, mode=SNP) -> SNP-model
train INDEL model: VariantRecalibrator(wgs.vcf.gz, mode=INDEL) -> INDEL-model
ApplyRecalibration using SNP model: ApplyRecalibration(wgs.vcf.gz, SNP-model) -> wgs.SNP.vcf.gz
ApplyRecalibration using INDEL model: ApplyRecalibration(wgs.SNP.vcf.gz, INDEL-model) -> wgs.SNP.INDEL.vcf.gz

Now, the problem with this setup is that it has to write a massive vcf.gz file in step 4,7 and 8. Since these files easily get over 1Tb large it does take well over one day and I have seen it take up to three days for a file including 2700 samples. I am already using -nt 32 on a beefy machine to speed up the computation, but I guess the real bottleneck is writing the file to disk (simply copying the file using cp took 3 hours).

One idea I had to speed things up would be if VariantRecalibrator can use more than one file as input. Then I could parallelize applyRecalibration by genomic locations and do the CatVariants from step 4 only once in the end.

Another idea is to separate SNPs and indels into two different files and do the VQSR separately, but this still requires a final mergeVariants or downstream analyses have to deal with two different input files.

It would be great to hear how you at the BROAD and others are handling the VQSR step, especially when dealing with say 5000 samples. What are typical run times that you achieve? How do you parallelize it, if that is even possible?

Jens

Speeding up VQSR for 2000+ WGS samples

Trending Articles

Black Angus Grilled Artichokes

Demi Lovato – Tell Me You Love Me (Remixes) – 2018 – iTunes Plus AAC M4A – EP

Return To Forever – Musicmagic (1977) [Audio Fidelity 2016] {SACD ISO + FLAC...

Moondru Mudichu 16-05-2017 – Polimer tv Serial

[GET] Jenna Kutcher – The Instagram Lab 2.0 ($297.00)

RE: Same voucher no. with different dates in AX 2009

[アメリカドラマ][WEBDL] ナルコワールド麻薬取引の実態全4話

Cris MJ – Apocalipsis [iTunes Plus M4A]

Forum Post: RE: ahdlLib opamp model vref pin

Law Enforcement Agencies Release Warrant Lists

Practice Sheet of Right form of verbs for HSC Students

The 10 Tennessee Cities With The Largest Black Population For 2021

Bureau of Internal Revenue: Regional Offices (Directory)

New Guidelines for settlement of Medical claims of pensioners and others in...

DJ Snake – Encore [iTunes Plus M4A]

Maryland: State Police report DUI arrests for Aug. 16th – 31st 2015; beer and...

Cecil Smith Has Taken His Life, After Being the Subject of Conspiracy...

99 Rain Status for Whatsapp - Best Rain Dp Collection

Project could not be loaded, it may be damaged or contain outdated elements

0014368: Detected CPU family 6 model 158