I've found some interesting behavior in HaplotypeCaller in GATK 3.6 that I was hoping someone would help explain. I have a sequence context with 3 phased SNPs directly downstream of an intronic repetitive region with 4 21-bp repeats. Calling with HaplotypeCaller on the full region produces no calls for any of these SNPs, which are instead called as reference with GQ=0. Investigation of the debug output showed that the output graph did not include nodes corresponding to these variants or any haplotypes containing these variants and that the default kmer size of 25 was dynamically bumped up to 75 before assembly thanks to the repetitive upstream region.
I was able to get correct calls on the two exonic SNPs by restricting the calling region (-L) to exclude some of the upstream repetitive region. This resulted in kmer expansion up to 55 but correct assembly of the downstream variants.
I was also able to get correct calls on all three by including the --allowNonUniqueKmersInRef flag, preventing kmer size expansion to accommodate the repetitive regions.
I had a couple of questions about how to move forward with these results:
1) Why are larger kmer sizes preventing correct graph assembly? Is my coverage just sufficiently low that with 75-bp kmers I'm not getting full k-1 overlaps between some graph nodes that's leading to some correct nodes getting pruned?
2) What's the downside to allowing cycles (non-unique ref kmers) in the graph? If the regions I'm trying to call are not repetitive but occasionally flanked by repetitive regions should I expect to see problems in my regions of interest? Is it safe to enable this for all my calls, or should I limit it to known problematic regions?
Thanks!