Hello, I recently compared results from GATK best practices (bwa, Picard, HaplotypeCaller, GenotypeGVCFs) with a snp array set (a high confident known variant detection method) for 6 samples (data from Illumina Hiseq2500) and got a really interesting confusion matrix.
gatk\snp-array | wild | het | hom |
---|---|---|---|
wild | 109,575 | 20,122 | 63 |
het | 60 | 44,579 | 28 |
hom | 378 | 26,493 | 28,402 |
This means that GATK (as any other caller), has troubles by calling heterozygous variants. We are discussing the causes of this phenomenon and how HC+GG deal with it.
At first we though it is a DP problem and yes, it is: when filtering variants with DP>20 het column transformed in:
gatk\snp-array | wild | het | hom |
---|---|---|---|
wild | 46,323 | 1524 | 42 |
het | 22 | 32,337 | 14 |
hom | 273 | 1325 | 9207 |
This means that the proportion of ref/alt bases is critical when calling heterozygous variants.
We hope you can give us more ideas on the causes of this problem and how can we move those wild-called het variants to called variants, even at the cost of getting more false positives.
We used bwa 0.7.10-r789 and gatk 3.7-0-gcfedb67