Actually, there are several aspects to this. I am looking for de novos in trios and basically I have far too many, indicating incorrect calls with HaplotypeCaller. One example which highlights a number of issues is that I have a parent and child with very similar reads but the child is called heterozygous with GQ=99 while the parent is called homozygous, also with GQ=99 Here are the entries from the gvcfs:
0/1:115,11,0:126:99:0
0/0:103,5,0:108:99:0
Now if you ask me having 5-10 alt bases out of 100 should not give a high GQ for either genotype.
In fact, the situation is worse because I only see this if I set the -forceActive flag. If I don't, then the parental gvcf contains a region spanning this variant with entry 0/0:94:99:49:0,104,1575. What this means is that in the vcf file the variant is reported to have 49 ref and 0 alt alleles. I only used that flag for writing all reads using -bamout to try to understand what was going on. Without it, there is no way from looking at the vcf file entries for me to know that the homozygous genotype call might be unreliable.
I have tried using -bamout to understand why the two calls are so different and I have viewed them in igv but I haven't been able to understand what's going on. To me, both sets of reads look very similar and the alt bases seem to be of high quality. The alt bases are all in haplotypes bearing alt bases at a nearby position, which could suggest poor alignment but might just be because another variant is in LD. However I don't really understand why all the alt bases are ignored in the parent.
Looking at the reads, I'd say both calls were ambiguous. I wouldn't have a problem discarding them The problem is that I can't see any way to filter out calls like this because they both have high GQ and because the AD in the vcf ends up being misleading.
I'm wondering if there are any arguments I could change for HaplotypeCaller to do a better job of handling situations like this?
The bam files for the region are here:
https://1drv.ms/u/s!AhC8mtxvI36M8_RjYQGqH_7ZLbNkyw
https://1drv.ms/u/s!AhC8mtxvI36M8_RkGl5BBlLH0O_A9w
The problematic variant is at 6:17602910.
My arguments for HaplotypeCaller look like this:
$java17 -Djava.io.tmpdir=${javaTemp} -Xmx8g -jar $GATK -T HaplotypeCaller -R $fasta -I $bam \
--dbsnp ${bundle}/dbsnp_137.b37.vcf \
--emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 \
-stand_call_conf 30.0 \
-stand_emit_conf 10.0 \
-L ${chr}:${start}-${end} \
--activeRegionExtension 100 \
-o ${ID}.$chr.$pos.gvcf \
--bamOutput ${ID}.$chr.$pos.bam \
--bamWriterType ALL_POSSIBLE_HAPLOTYPES \
#-forceActive
Thanks for any suggestions you might have.
- Dave