Hi!
I'm trying to replicate the variant calling procedure done by a previous graduate student in my lab. To be precise, I have access to the original fastq files and his final vcf file; unfortunately, his pipeline seems to have been lost and I had to replicate the procedure from what is described in his thesis. The only difference is that he used UnifiedGenotype; I'm using HaplotypeCaller (and a more modern version of GATK - his thesis was done almost three years ago). After running my scripts, my vcf is significantly smaller than his. He obtained ~ 200 million sites. I got ~70 million. I have done a thorough check of my scripts and the pipeline description he did in his thesis and everything seems OK. This may mean two different things: 1. That I'm unable to replicate his pipeline because some crucial steps or parameters may were left out the description; 2. That HaplotypeCaller and UnifiedGenotyper can produce different results under certain circumstance.
Could the second option explain the differences, specially considering that the sequencing data weren't of very good quality? Maybe HaplotypeCaller and more modern versions of GATK are more strict when calling variants...
Thanks!