Dear GATK team members and forum users,
I am analysing 200 germline whole genomes following the GATK best practises. I am experiencing issues with GenotypeGVCFs, whose runtime is increasing exponentially as the number of samples (gVCFs) increases.
To set you in context, I have 200 germline whole genomes in BAM format. These are high coverage, so their size ranges between 40-130GB. After recalibration, the size of these BAM files increases around 2-fold. The recalibrated BAMs are the input of HaplotypeCaller. I have run ~100 of these BAMs and got the gVCFs.
Now I want to perform joint-genotyping with GenotypeGVCFs. I remember having only 22 samples and running GenotypeGVCFs with these 22 gVCFs did not take long (around 4.5h), but now that I want to re-run with 100 samples this single command takes too long (around 1 week). Actually I am running the pipeline on an HPC, which has a maximum walltime of 1 week, hence GenotypeGVCFs is killed before finishing.
The gVCFs are compressed using bgzip + tabix. The .g.vcf.gz weight between 1.9-7GB. These are used to feed GenotypeGVCFs. I am using 230Gb memory. The exact command I am running is the following:
java -Xmx230g -Djava.io.tmpdir=/tmp \
-jar GenomeAnalysisTK.jar \
-T GenotypeGVCFs \
-R reference.fa \
--dbsnp bundle2_8/b37/dbsnp_138.b37.vcf \
--variant sample1.g.vcf.gz --variant sample2.g.vcf.gz ... --variant sample100.g.vcf.gz \
-o joint.vcf
The reason why I am not using -nt option is that it gives an "error MESSAGE: Code exception".
The GATK version I am using is 3.7
I also tried combining the 100 gVCFs into 2 batches of 50 each, but this also takes too long, around 3 days for each batch (6 days in total).
I wonder what approach would be suitable to handle this amount of data and whether this is normal. I am really concerned because I don't know how I am going to manage this once I have the 200 gVCFs.
All answers will be appreciated.
Thanks,
Ezequiel