Hi all,
I'm using GATK v.3.6 for a multi-sample analysis. I followed best practices and this is the command line for HaplotypeCaller:
java -Xmx64g -jar $GATK -T HaplotypeCaller \
-R $REF \
-I $PROCESSING/5_BQSR/${filename%.*}.bam \
-o $PROCESSING/6_Variant/GATK/${filename%.*}.g.vcf \
-ERC GVCF \
--doNotRunPhysicalPhasing \
-bamout $PROCESSING/6_Variant/GATK/${filename%.*}.g.vcf.bam \
-L $TARGET
Looking at the final vcf format (after GenotypeGVCFs but also in .g.vcf file) I found these variants:
chr21 44477938 . CGGGCACCCGTTTGAGCTGCCTGTAGGTGACCGGGCACCCGTTTGAGCTGCCTGTAGGTGACT TGGGCACCCGTTTGAGCTGCCTGTAGGTGACCGGGCACCCGTTTGAGCTGCCTGTAGGTGACT,C 6064.54 PASS AC=3,1;AF=0.125,0.042;AN=24;BaseQRankSum=1.00;ClippingRankSum=0.092;DP=3685;ExcessHet=0.6070;FS=5.275;InbreedingCoeff=0.4000;MLEAC=3,1;MLEAF=0.125,0.042;MQ=57.73;MQRankSum=-4.330e-01;QD=10.40;ReadPosRankSum=-5.700e-02;SOR=0.577 GT:AD:DP:GQ:PL 0/0:229,0,0:229:99:0,120,1800,120,1800,1800 0/1:177,131,0:308:99:2363,0,5706,2891,6106,8997 0/0:246,0,0:246:99:0,120,1800,120,1800,1800 0/0:223,0,0:223:99:0,120,1800,120,1800,1800 0/1:123,80,0:203:99:1422,0,3124,1790,3357,5147 0/0:393,0,0:393:99:0,120,1800,120,1800,1800 0/0:311,0,35:346:61:0,913,12646,61,11795,11566 0/0:461,0,0:461:99:0,120,1800,120,1800,1800 1/2:1,35,36:72:99:2338,868,1163,1274,0,3057 0/0:374,0,0:374:99:0,120,1800,120,1800,1800 0/0:37,0,0:37:99:0,99,1239,99,1239,1239 0/0:356,0,0:356:99:0,120,1800,120,1800,1800
Looking at this variant 1/2:1,35,36:72:99:2338,868,1163,1274,0,3057 it seems like there is a 63 bp deletion on this site.
Now, look at the bam file screen for this sample: there is no deletion in that site but 3 snps. In fact, using other variant callers (VarScan and FreeBayes), I found these variants but no deletion:
chr21 44477938 . C T
chr21 44477971 . G C
chr21 44478000 . T C
However, GATK calls the last one variant at this position:
chr21 44478000 . T C 56720.78 PASS AC=14;AF=0.636;AN=22;BaseQRankSum=0.692;ClippingRankSum=-2.390e-01;DP=3859;ExcessHet=1.1475;FS=0.000;InbreedingCoeff=0.2143;MLEAC=14;MLEAF=0.636;MQ=57.12;MQRankSum=0.139;QD=21.15;ReadPosRankSum=0.667;SOR=0.702 GT:AD:DP:GQ:PL ./.:234,0:234 1/1:0,287:287:99:8615,853,0 1/1:0,252:252:99:7501,751,0 1/1:0,188:188:99:5121,531,0 1/1:0,229:229:99:5490,588,0 0/0:393,0:393:99:0,120,1800 0/1:133,190:323:99:4813,0,490 1/1:3,464:467:99:14732,1315,0 0/1:56,35:91:99:1155,0,1414 0/1:155,280:435:99:5265,0,3405 0/0:37,0:37:99:0,99,1239 0/1:167,243:410:99:4096,0,310
In addition, consider that I found the snp in position chr21-44477971 on different samples in VarScan and FreeBayes:
0/1:255:254:245:136:106:43,44%:4,17E-39:34:32:53:83:60:46 0/0:452:294:286:279:5:1,75%:3,07E-2:33:31:130:149:3:2 0/0:430:242:230:230:0:0%:1E0:32:0:101:129:0:0 0/0:334:216:206:200:3:1,46%:1,2407E-1:32:33:100:100:2:1 0/0:304:200:188:185:3:1,6%:1,24E-1:32:31:84:101:1:2 0/0:557:357:342:337:5:1,46%:3,0793E-2:32:23:173:164:2:3 0/1:255:384:374:241:133:35,56%:5,2547E-47:33:31:112:129:69:64 0/0:734:416:391:391:0:0%:1E0:33:0:183:208:0:0 0/1:187:123:119:67:52:43,7%:1,6526E-19:34:32:29:38:28:24 0/0:589:340:326:325:1:0,31%:5E-1:33:31:152:173:0:1 0/0:399:242:233:229:2:0,86%:2,4946E-1:32:20:113:116:2:0 0/0:498:332:317:309:6:1,89%:1,5254E-2:32:28:151:158:4:2
In conclusion, GATK seems to call a deletion missing information about the second variant for all the samples.
I hope everything is clear, thank you in advice for your help!
Matteo