Hi!
I'm calling variants in a cohort of samples following the best practices recommendations (https://gatkforums.broadinstitute.org/gatk/discussion/3893/calling-variants-on-cohorts-of-samples-using-the-haplotypecaller-in-gvcf-mode). When I get to run HaplotypeCaller, the estimated running time escalates to more than a month per sample, which is excessive. A way arround this is to call variants one chromosome/scaffold at a time using the L flag:
java -jar GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-R reference.fasta \
-I sample_realigned.bam \
-L $chr:1+ ###$chr is being passed to the script using a for loop that loops through all the chromosomes in my file \
--emitRefConfidence GVCF \
--variant_index_type LINEAR \
--variant_index_parameter 128000 \
-o sample$1.g.vcf
...and then concatenate the files per sample using CatVariants. This way I can process each sample in less than a day. Later, I run GenotypeGVCFs on all samples together and get my vcf ready for filtering. My question is: Is it safe to do this? Am I affecting HaplotypeCaller capacity to call variants by separating my dataset in many small subdatasets and then combining them again?
Thanks!