Dear Sir/Madam!
We have discovered a problem in our data generated using the best practice pipeline for variant calling (GATK 2.7.2, bwa 0.7.4, picard 1.92). We originally used the hg19 reference (chr1-22, M, X, Y) with the following extra chromosomes added:
chr19_gl000208_random
chr19_gl000209_random
chr6_cox_hap2
We have now discovered issues related to the inclusion of the extra chromosomes, sequence reads from one sample gets split up between chr6 and chr6_cox_hap2 for example. Sometimes part of the reads go to chr6 and some to chr6_cox_hap2 and sometimes all reads go to either chr6 or chr6_cox_hap2. This results in that the variant in the position does not pass quality filters since it is not called in enough samples.
We are actually not interested in the extra chromosomes, and have realised it was a mistake to include them from the beginning. We are now looking into different solutions for this problem.
I suspect the best and safest solution is to rerun all samples using a new reference (without the extra chromosomes). We have thousands of samples, so it would take us considerable time and space. To make sure we have thought through every option I would like to ask some questions:
- I understand it is not advised to align the reads to only part of the genome? Because of the risk of getting incorrect alignments between the reads and the restricted reference which would not occur if the aligner had the entire genome.
- Is it possible to select some sequence reads to re-align? I am thinking to select only the reads that aligned to the extra chromosomes and re-align those to the new reference.
- Is it possible to combine two datasets that have been run with different references at a later stage in the pipeline? How and when would that best be done?
Any help would be very much appreciated!
Best regards,
Lina