I'm analysing a set of bacterial isolates, some which are (almost) identical to the reference, and some which are very different. Despite the fact that the identical isolates have good coverage (80x), I end up filtering a lot of the SNPs for the identical isolates due to lack of depth (cutoff of 10). I was wondering if this is due to the way the g.vcf files are used.
Below is a typical part of the g.vcf file for one of the identical isolates
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample1
REF 1 . A <NON_REF> . . END=296981 GT:DP:GQ:MIN_DP:PL 0:82:99:8:0,252
The average depth(DP) is 82, but the lowest depth (MIN_DP)in that region of 300kb is 8. If any of the other samples in the same analysis have a SNP in this region, what will be the DP for sample1
for that snip? Will it be 82 or 8?
If it is 8, every SNP in that regions for sample1
will be hard filtered, even though the actual coverage in that region (and most likely for that SNP) is a lot higher. How can I prevent discarding all that data for samples that are highly similar to the reference used?
I'm using 3.6-44-ge7d1cd2