I am trying to figure out why a site is being called with the wrong AD / Likelihood and what can be done about it.
In particular, there is a site in my VCF that is being called with a very wonky AD value of 190,63, which is well outside the expected 50:50 ratio. The call in question is a deletion of 2 T's from a 7 T track (e.g. T[7] -> T[5]), and was made on a BAM file of a single subject using the haplotype caller with --genotyping_mode DISCOVERY
and otherwise the default values.
In tracking down this issue, it seems the algorithm has a strong bias for the reference. Stepping through the code, my impression is the HC is doing the following:
1 - Creating a set of 82 possible haplotypes
2 - Creating a candidate set of variants at each position (in this case, GTTTT*, GTT, GTTTTT, G)
3 - Assign each haplotype to one of the possible variants, take the highest "score" for a haplotype assigned to a variant, and make that the likelihood for that read/variant combination.
However, this process seems to be flawed. Examining the region, it looks like the process that assigns haplotypes to variants (Method createEventMapper@HaplotypeCallerGenotypingEngine.java:1043), assumes that if a haplotype doesn't have an event at a particular location, it sould be assigned to the reference.
This however does not work when the haplotype does not match the reference at that location, which can occur for example if there is an upstream deletion that removes the reference sequence being considered. The graphic below shows all the haplotypes that were considered for each possible variant, clearly many things with the variant "TT" deletion are identified as being part of the reference, which seems to lead to skewed likelihoods and incorrect AD counts further downstream.
Is this a known issue with the haplotype caller? And if so, are there any workarounds?