Greetings,
I am hoping to get some help troubleshooting a frustrating error I am having trying to genotype a large set of data. The source data is nearly 12000 WES samples, which were sequenced by a 3rd party company, so I am assuming it is worth the money that was spent . I know they followed best practices and used the same reference file for all samples. I have the gvcf files for the entire set, and I have successfully genotyped the entire WES intervals, as well as subset the gvcf files for 74 genes and successfully genotyped those. All of this with GATK v3.7.
I now have a third set of intervals (SXP) I am trying to process. SelectVariants with this interval set works fine. I create 40 cohort.g.vcf files with roughly ~290 samples in each, and this process has worked without any errors in all three use cases.
However, now with these SXP cohorts, I get about 2.5% through GenotypeGVCF and will receive an error
##### ERROR MESSAGE: The provided variant file(s) have inconsistent references for the same position(s) at 1:62732364, A* vs. G*
I identify that a single cohort has this ref anomaly. I looked for it in the individual SXP subset g.vcfs of all the samples in that cohort, but cannot find a single sample with that position as such; I have no idea where it comes from. I tried removing that position from the cohort g.vcf. I receive the same error, at a different position, in a different cohort, but I notice that its technically happening in the same gene as the original error.
I removed that gene from my interval list, re-subset the entire sample set and made the same cohorts from the modified data; receive the same error, at a different position, in a different cohort, in a different gene.
I can find no evidence that these data had any sort of inconsistent reference when they were created, and again I have used them successfully a couple of times already, and so have other researchers working with the data files.
I do not understand where these genotypes are coming from. From my understanding I can not run ValidateVariants on gvcfs and get anything meaningful. Is there anything else I can be doing to find the issue or is there a way to GenotypeGVCFs move passed these error positions? I think they only thing I havent tried is upgrading to GATK 4.0, but I am dubious it will make a difference. Thank you!
-bwubb