I have a question about how the reassembly works in HaplotypeCaller from GATK version 3.6-0-g89b7209. I have been checking the SNPs called in IGV and I have found a couple of regions that very clearly do not coincide between the bam file and the VCF file.
I have attached an image showing a few files in IGV for 4 different samples.
At the top are the coverage tracks for the 4 samples (bam files generated with bwa, then sorted and duplicates marked with picard-tools).
Below that are the 4 VCF outputs from HaplotypeCaller.
Below that are the 4 VCF outputs of the freebayes variant caller.
And the final 4 lines show the coverage tracks for the bamout files of the 4 samples.
From the coverage tracks alone it seems clear that there should not be any SNPs here (coverage allele fraction threshold for IGV is set to 0.1 so any time there is a non-reference allele appearing in at least 10% of the reads, it should be highlighted here). As one specific example, the first of the sites in this region that was called a SNP by HaplotypeCaller (left-most variant) has a reference of T and is called a heterozygous SNP with an alternate allele of A. However, according to IGV, this site for the sample CLIB_2 has 744 reads with a T and 1 read with a C...none with A, while the bamout file says 762 T and 748 A. This is essentially the case for all those variants in this region.
Now, it is striking that the coverage takes a noticeable dip in this area, and in fact this same problem with SNPs occurs a number of times, and seemingly always in areas with such a dip. So I am wondering if this somehow influences the process of the reassembly during the variant calling, and for some reason is bringing actual SNPs from other areas to these spots?
Freebayes, which I believe does not have this reassembly process, does not seem to have this same issue in any of these spots. But it does tend to agree with HaplotypeCaller outside of these areas, as seen with the 2 variants to the right of this area.
Do you have any ideas of why this occurs and how I could work around it when using HaplotypeCaller? Thanks for any help you can provide
Jesse