Hi,
I am working with RNA-Seq data from 6 different samples. Part of my research is to identify novel polymorphisms. I have generated a filtered vcf file for each sample. I would like to now combine these into a single vcf.
I am concerned about sites that were either not covered by the RNA-Seq analysis or were no different from the reference allele in some individuals but not others. These sites will be ‘missed’ when haplotypeCaller analyzes each sample individually and will not be represented in the downstream vcf files.
When the files are combined, what happens to these ‘missed’ sites? Are they automatically excluded? Are they treated as missing data? Is the absent data filled in from the reference genome?
Alternatively, can BaseRecallibrator and/or HaplotypeCaller simultaneously analyze multiple bam files?
Is it common practice to combine bam files for discovering sequence variants?