Dear GATK team,
I have started to re-analyse some samples, which I had analysed a long time ago with v2.7 - mainly because a new reference genome has become available for this organism. The old reference genome was pretty bad quality with lots of assembly mistakes, lots of scaffolds and with a significant proportion of Ns in the genome.
Now I observed two things:
*) the number of SNPs increased significantly
*) the percentage of overlap (mutual SNPs) between two samples increased significantly
My questions now relate to the 'why'
A part of the increased number of SNPs will of course come from additional sequence information in the new genome (instead of the Ns).
but it does not explain the high increase I observed or the increased percentage in the mutual SNPs.
1) So can a bad reference lead to less SNPs being called?
2) Does the new GATK version call more SNPs and/or is it able to call SNPs more reliable on low coverage data? the number of SNPs called appears to be more constant now across the samples. With the previous version I observed quite a high dependency between average coverage and number of SNPs.
Thank you very much for your help!
Best,
Julia