Quantcast
Channel: Recent Discussions — GATK-Forum
Viewing all articles
Browse latest Browse all 12345

combineGVCF batch effects

$
0
0

Dear Sir/Madam,

We have been using the best practice variant calling pipeline to call variants in our targeted sequencing data (GATK 3.2.2). Me and others in the team have discovered batch effects in the data that seem to be related to the combineGVCF step. We have thousands of samples so we used the combineGVCF to group individual samples into groups of about 100, before running the genotypeGVCF step.

Several times we have noted heterozygous calls only present in samples from certain groups (combineGVCF groups). Often those heterozygous calls also coincide with groups that have a lot of missing genotype data. When my colleagues have investigated this further they noted that the odd heterozygous calls seem to appear in the combineGVCF step. When the same samples are passed individually to the genotypeGVCF step (all other samples as groups but the problematic group as individual samples) those heterozygous calls do not appear. We are wondering what is actually happening here? I though that the combineGVCF step is only "merging" data for the 100 samples and not doing any recalculation of genotypes but maybe I have misunderstood things?

A similar problem was posted earlier to this forum, then the solution was to use a newer version of GATK, actually the version that we have used, the 3.2.2 version. Have you heard of this problem before? Do you think that an even newer version of GATK can solve this problem? We are considering to rerun our samples to get rid of this batch effect, but we would like to make sure that a newer version actually solves it. If not, we have to think very carefully on what samples are grouped by combineGVCF.

Any input would be very much appreciated!

Thank you!

Best regards, Lina


Viewing all articles
Browse latest Browse all 12345

Trending Articles