Hi,
I am mapping chimpanzee samples to the human reference hg19. I mappend the samples using the standard protocol (BWA mem, remove duplicates, indel realigner) and called them with GATK 3.7 Haplotype Caller. After all variant filtering (hard filter + remove duplicated and low mappability regions from external bed files), I found an interesting insertion in one of my samples:
In chr2:48033272 there is a deletion of this sequence TTTTTGTTTTAATTCCT . The human reference has GCA|TTTTTGTTTTAATTCCTT|TTTTGTTTTAATTCCTT|TG
this sequence duplicated. This sample is called homozygous for the deletion.
A few bp after this, GATK calls an insertion:
chr2 48033352 . C CAACCGATGTTGCTTTTCTGTCCTAGCATTTTTGTTTTAATTCCTT 108.02 PASS
Long story short:
- There are only 6 reads supporting this insertion.
- Of them, only 3 have the full "GATK-ALT-insertion". All of these 3 have, at least, 1 bp more.
- None of these reads have the 3' side of the reference. It should be:
CTTTAACAGGAAGAGGTAC ins TGCAACATTTGATGGG
- I lied. One of these reads does have the full sequence:
TAACAGGAAGAGGTAC | AACCGATGTTGCTTTTCTGTCCTAGCATTTTTGTTTTAATTCCTTTGAGTTACTTCCTTATGCATATTTTACTTTAACAGGAAGAGGTAC | TGCAACATTTGATGGGACAGCAATAGCAAATGCAGTTGTTAAAGA
It is a duplication of the whole previous sequence, including the deletion 80bp upstream. I want to run functional analysis of the variants detected, and I am changing from a frameshift insertion to a non-frameshift insertion.
Ok, I have detected this wrong indel, but I am calling 1.9M Indels in this dataset, and 1.7M more in another one and I am worried about reporting strong functional annotation to erroneous variants.
Is there any method I can use to detect this kind of indel (not enough reads supporting both tips of the insertion)? Or I can only filter by QD?
My hard filter removes QUAL<50 and QD<2 . This particular variant has QUAL=108.02, QD=4.91 and Genotyping Quality=99
Any advice?
Thanks in advance,
Txema