I have a Plasmid library of random mutations (thousands of mutations), we performed targeted amplicon sequencing (high depth >7000) on Illumina's MiSeq platform. On performing variant calling with GATK-UG while following the best practices guidelines (without BQSR), I get a list of variant calls. However, that list is not comprehensive as it doesn't report a lot of mutations that I know should be there (experimental evidence). On the other hand, if I generate a variant report using samtools mpileup, and filter the variant call list using read depth and quality criteria, I get a larger number of mutations remaining, which seems closer to my estimates of the library size.
The caveat in the library is that as it is a random mutation library, most of the reads at a locus would be wild type, because most of the plasmids are wild type at that locus, except for that particular mutant. This results in a sample that has thousands of very low-frequency mutations.
My questions are
1) Is GATK suitable to analyze a sample with large number of very low frequency mutations (Depth at locus ≈8000, and reads with mutation in range 20-100, VAF for most abundant mutation is 0.05) in a very small genomic region, e.g. 1 gene. i.e. Does GATK think that there are too many mutations in this region (which is a real possibility in our case), it is likely that these are sequencing artifacts?
2) Why does GATK drop so many variants, and reduces the number of reported variants by ≈10-20 fold.
3) Is there a way I can ask GATK to report all the mismatches it finds, and then I can perform my own filtering?
4) The mutations are in a oncogene, so does GATK cross refer to some kind of Cancer mutations database and take that into account? Because that would make it biased for my application area.