Dear GATK team,
In order to calculate variant frequencies, all samples from different runs must be in the same GVCF. For each run, I'am currently running the CombineGVCF tool and then I use the GenotypeGVCF tool to combine different runs and calculate variant frequencies. When a new run arrives, to obtain just one unique file for all runs, I execute again the GenotypeGVCF tool adding to the list of GVCF files the one that belongs to the new run. I wonder if there is a more efficient way to reduce computational cost for joining GVCF files produced by CombineGVCF into one so then I can use this unique file as input for the GenotypeGVCF along with new GVCFs that come from new runs.
Example:
Now I work this way:
Run_1
gatk --analysis_type CombineGVCFs --variant sample_1.g.vcf --variant sample_2.g.vcf -R GRCh38.fasta -o run1.g.vcf
Run_2
gatk --analysis_type CombineGVCFs --variant sample_3.g.vcf --variant sample_4.g.vcf -R GRCh38.fasta -o run2.g.vcf
Combine Run_1 and Run_2
gatk --analysis_type GenotypeGVCFs --variant run1.g.vcf --variant run2.g.vcf -R GRCh38.fasta -o run1_run2.g.vcf --includeNonVariantSites
I wonder if there is a better way to do the following:
Run_3
gatk --analysis_type CombineGVCFs --variant sample_5.g.vcf --variant sample_6.g.vcf -R GRCh38.fasta -o run3.g.vcf
Combine the GVCF from Run_1+Run_2 and the new Run_3
gatk --analysis_type GenotypeGVCFs --variant run1_run2.g.vcf --variant run3.g.vcf -R GRCh38.fasta -o run1_run2_run3.g.vcf --includeNonVariantSites
I've tried different ways with no success.
Thanks very much in advance.
Regards,
Sheila