Hello GATK team!
I'm having a hard time finding discussion of and examples of use of HaplotypeCaller by researchers with similar data and research questions as me so apologies for posting on your forum with such high end questions.
My goal is to explore genetic diversity at 7 different amplicon sites in a wide array of populations (60) of a very genetically diverse species of parasitic nematode (each population sample consists of pooled DNA of ~200-500 worms). To explore genetic diversity I would like to count the number of haplotypes present (and in what proportion) in each population at each of these seven loci.
I intend to infer signatures of selection by drug treatment from the level of haplotypic diversity in each population at each locus, some loci being candidate genes of interest and some being control loci. The populations are part of controlled experiments with drug treatment so drops in diversity at candidate loci in post-treatment population samples relative to associated pre-treatment samples should tell us if selection is happening near that locus.
My data consists of roughly 5,000 - 25,000 ~600 bp paired-end reads per population, per amplicon locus, that were sequenced on a MiSeq V3 run. I've already aligned the reads of each population sample to a reference 'genome' consisting of the seven loci using bowtie2 [--local --no-mixed]. Reads were indexed on the MiSeq by population, so each fastq R1/R2 fileset contains reads of each of the 7 loci from a single population.
I would now like to take these 60 .sam/.bam files and assess them with HaplotypeCaller to get counts of unique haplotypes that pass confidence (ie aren't potential false haplotypes due to sequence error etc) for each of the 7 loci, in each of the 60 population samples. Additionally (but not as important as just the counts), I'd like the proportion that these haplotypes exist in each population, and would prefer the presence of each unique haplotype to possibly be traceable across the populations.
Here's what I think is the main problem, this worm is exceptionally diverse, both between populations and within population. I expect roughly 5-40 true haplotypes within each population, usually due to variations in the presence of large indels (5-50 base pairs) and many SNV sites (probably 20-50 variant sites in one population across the 600 bps would be a good expectation).
So to get to my actual question (sorry), 1) Is HaplotypeCaller designed to give me the information I want - haplotype counts, frequency, and presence across populations?, and 2) Will the very high genetic diversity cause problems for the tool and confound the output given its optimized for organisms with much lower levels of diversity?
Thank you so much for your help!
Andew