We have to perform HaplotypeCaller variant calling
on a cohort of Whole Genome Samples (~400 samples)
. Since the region to be called is huge, I was wondering what would be the best way to go about doing the sample-level-gVCF-calling
& there-after GenotypeGVFs
step.
- Should we split the whole-genome region into "several smaller parts"
(For example - 100 BED parts)
and then perform gVCF calling for each of those 100 parts for each of the 400 samples(100 BED parts * 400 samples = 40000 gVCFs)
? - ..and then merge each BED part gVCF from each sample into one final joined VCF for that BED part
(100 VCFs)
? - ..and then concatenate each of the joined 100 VCF parts into one final whole-genome VCF file?
Or is there a more efficient way to go about this?
I guess I have not been able to find much information on your forums where folks have been doing the gVCF calling on a smaller BED region and then stitching together those regions' gVCFs into one giant gVCF or VCF. I am aware that there is a -L
option available in the HaplotypeCaller
Module, but I am not sure what are the recommended best practices for using that option when it comes to gVCF calling.
Shalabh Suman