Hi GATK folks
I am joint genotyping fairly large cohorts of 30x WGS data with 2000 to 5000 samples follwing the Best Practices using GATK 3.5
By and large this works pretty well, however one major bottleneck is the VQSR step. Let me describe my setup and hopefully you can comment on whether and how I can improve my workflow.
- CombineGVCF in order to get from 2000+ gvcs to <200 multi sample gvcfs
- GenotypeGVCF parallelized by genomic coordinates
- annotateVariants parallelized by genomic coordinates
- CatVariants to write one massive wgs.vcf.gz
- train SNP model:
VariantRecalibrator(wgs.vcf.gz, mode=SNP) -> SNP-model
- train INDEL model:
VariantRecalibrator(wgs.vcf.gz, mode=INDEL) -> INDEL-model
- ApplyRecalibration using SNP model:
ApplyRecalibration(wgs.vcf.gz, SNP-model) -> wgs.SNP.vcf.gz
- ApplyRecalibration using INDEL model:
ApplyRecalibration(wgs.SNP.vcf.gz, INDEL-model) -> wgs.SNP.INDEL.vcf.gz
Now, the problem with this setup is that it has to write a massive vcf.gz file in step 4,7 and 8. Since these files easily get over 1Tb large it does take well over one day and I have seen it take up to three days for a file including 2700 samples. I am already using -nt 32 on a beefy machine to speed up the computation, but I guess the real bottleneck is writing the file to disk (simply copying the file using cp
took 3 hours).
One idea I had to speed things up would be if VariantRecalibrator can use more than one file as input. Then I could parallelize applyRecalibration by genomic locations and do the CatVariants from step 4 only once in the end.
Another idea is to separate SNPs and indels into two different files and do the VQSR separately, but this still requires a final mergeVariants or downstream analyses have to deal with two different input files.
It would be great to hear how you at the BROAD and others are handling the VQSR step, especially when dealing with say 5000 samples. What are typical run times that you achieve? How do you parallelize it, if that is even possible?
Jens