Hi,
I've run into the (already reported http://gatkforums.broadinstitute.org/dsde/discussion/5598/missing-depth-dp-after-haplotypecaller ) bug of the missing DP format field in my callings.
I've run the following (relevant) commands:
Haplotype Caller -> Generate GVCF:
java -Xmx${xmx} ${gct} -Djava.io.tmpdir=${NEWTMPDIR} -jar ${gatkpath}/GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-R ${ref} \
-I ${NEWTMPDIR}/${prefix}.realigned.fixed.recal.bam \
-L ${reg} \
-ERC GVCF \
-nct ${nct} \
--genotyping_mode DISCOVERY \
-stand_emit_conf 10 \
-stand_call_conf 30 \
-o ${prefix}.raw_variants.annotated.g.vcf \
-A QualByDepth -A RMSMappingQuality -A MappingQualityRankSumTest -A ReadPosRankSumTest -A FisherStrand -A StrandOddsRatio -A Coverage
That generates GVCF files that DO HAVE the DP field for all reference positions, but DO NOT HAVE the DP format field for any called variant (but still keep the DP in the INFO field):
18 11255 . T <NON_REF> . . END=11256 GT:DP:GQ:MIN_DP:PL 0/0:18:48:18:0,48,720
18 11257 . C G,<NON_REF> 229.77 . BaseQRankSum=1.999;DP=20;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQRankSum=-1.377;ReadPosRankSum=0.489 GT:AD:GQ:PL:SB 0/1:10,8,0:99:258,0,308,288
18 11258 . G <NON_REF> . . END=11260 GT:DP:GQ:MIN_DP:PL 0/0:17:48:16:0,48,530
Later, I ran Genotype GVCF joining all the samples with the following command:
java -Xmx${xmx} ${gct} -Djava.io.tmpdir=${NEWTMPDIR} -jar ${gatkpath}/GenomeAnalysisTK.jar \
-T GenotypeGVCFs \
-R ${ref} \
-L ${pos} \
-o ${prefix}.raw_variants.annotated.vcf \
--variant ${variant} [...]
This generated vcf files where the DP field is present in the format description, it IS present in the Homozygous REF samples, but IS MISSING in any Heterozygous or HomoALT samples.
22 17280388 . T C 18459.8 PASS AC=34;AF=0.340;AN=100;BaseQRankSum=-2.179e+00;DP=1593;FS=2.526;InbreedingCoeff=0.0196;MLEAC=34;MLEAF=0.340;MQ=60.00;MQRankSum=0.196;QD=19.76;ReadPosRankSum=-9.400e-02;SOR=0.523 GT:AD:DP:GQ:PL 0/0:29,0:29:81:0,81,1118 0/1:20,22:.:99:688,0,682 1/1:0,27:.:81:1018,81,0 0/0:22,0:22:60:0,60,869 0/1:20,10:.:99:286,0,664 0/1:11,17:.:99:532,0,330 0/1:14,14:.:99:431,0,458 0/0:28,0:28:81:0,81,1092 0/0:35,0:35:99:0,99,1326 0/1:14,20:.:99:631,0,453 0/1:13,16:.:99:511,0,423 0/1:38,29:.:99:845,0,1231 0/1:20,10:.:99:282,0,671 0/0:22,0:22:63:0,63,837 0/1:8,15:.:99:497,0,248 0/0:32,0:32:90:0,90,1350 0/1:12,12:.:99:378,0,391 0/1:14,26:.:99:865,0,433 0/0:37,0:37:99:0,105,1406 0/0:44,0:44:99:0,120,1800 0/0:24,0:24:72:0,72,877 0/0:30,0:30:84:0,84,1250 0/0:31,0:31:90:0,90,1350 0/1:15,25:.:99:827,0,462 0/0:35,0:35:99:0,99,1445 0/0:29,0:29:72:0,72,1089 1/1:0,32:.:96:1164,96,0 0/0:21,0:21:63:0,63,809 0/1:21,15:.:99:450,0,718 1/1:0,40:.:99:1539,120,0 0/0:20,0:20:60:0,60,765 0/1:11,9:.:99:293,0,381 1/1:0,35:.:99:1306,105,0 0/1:18,14:.:99:428,0,606 0/0:32,0:32:90:0,90,1158 0/1:24,22:.:99:652,0,816 0/0:20,0:20:60:0,60,740 1/1:0,30:.:90:1120,90,0 0/1:15,13:.:99:415,0,501 0/0:31,0:31:90:0,90,1350 0/1:15,18:.:99:570,0,480 0/1:22,13:.:99:384,0,742 0/1:19,11:.:99:318,0,632 0/0:28,0:28:75:0,75,1125 0/0:20,0:20:60:0,60,785 1/1:0,27:.:81:1030,81,0 0/0:30,0:30:90:0,90,1108 0/1:16,16:.:99:479,0,493 0/1:14,22:.:99:745,0,439 0/0:31,0:31:90:0,90,1252
22 17280822 . G A 5491.56 PASS AC=8;AF=0.080;AN=100;BaseQRankSum=1.21;DP=1651;FS=0.000;InbreedingCoeff=-0.0870;MLEAC=8;MLEAF=0.080;MQ=60.00;MQRankSum=0.453;QD=17.89;ReadPosRankSum=-1.380e-01;SOR=0.695 GT:AD:DP:GQ:PL 0/0:27,0:27:72:0,72,1080 0/0:34,0:34:90:0,90,1350 0/1:15,16:.:99:528,0,491 0/0:27,0:27:60:0,60,900 0/1:15,22:.:99:699,0,453 0/0:32,0:32:90:0,90,1350 0/0:37,0:37:99:0,99,1485 0/0:31,0:31:87:0,87,1305 0/0:40,0:40:99:0,108,1620 0/1:20,9:.:99:258,0,652 0/0:26,0:26:72:0,72,954 0/1:16,29:.:99:943,0,476 0/0:27,0:27:69:0,69,1035 0/0:19,0:19:48:0,48,720 0/0:32,0:32:81:0,81,1215 0/0:36,0:36:99:0,99,1435 0/0:34,0:34:99:0,99,1299 0/0:35,0:35:99:0,102,1339 0/0:38,0:38:99:0,102,1520 0/0:36,0:36:99:0,99,1476 0/0:31,0:31:81:0,81,1215 0/0:31,0:31:75:0,75,1125 0/0:35,0:35:99:0,99,1485 0/0:37,0:37:99:0,99,1485 0/0:35,0:35:90:0,90,1350 0/0:20,0:20:28:0,28,708 0/1:16,22:.:99:733,0,474 0/0:32,0:32:90:0,90,1350 0/0:35,0:35:99:0,99,1467 0/1:27,36:.:99:1169,0,831 0/0:28,0:28:75:0,75,1125 0/0:36,0:36:81:0,81,1215 0/0:35,0:35:90:0,90,1350 0/0:28,0:28:72:0,72,1080 0/0:31,0:31:81:0,81,1215 0/0:37,0:37:99:0,99,1485 0/0:31,0:31:84:0,84,1260 0/0:39,0:39:99:0,101,1575 0/0:37,0:37:96:0,96,1440 0/0:34,0:34:99:0,99,1269 0/0:30,0:30:81:0,81,1215 0/0:36,0:36:99:0,99,1485 0/1:17,17:.:99:567,0,530 0/0:26,0:26:72:0,72,1008 0/0:18,0:18:45:0,45,675 0/0:33,0:33:84:0,84,1260 0/0:25,0:25:61:0,61,877 0/1:9,21:.:99:706,0,243 0/0:35,0:35:81:0,81,1215 0/0:35,0:35:99:0,99,1485
I've just discovered this issue, and I need to run an analysis trying on the differential depth of coverage in different regions, and if there is a DP bias between called/not-called samples.
I have thousands of files and I've spent almost 1 year generating all these callings, so redoing the callings is not an option.
What would be the best/fastest strategy to either fix my final vcfs with the DP data present in all intermediate gvcf files (preferably) or, at least, extracting this data for all snps and samples?
Thanks in advance,
Txema
PS: Recalling the individual samples from bamfiles is not an option. Fixing the individual gvcfs and redoing the joint GenotypeGVCFs could be.