Hi Everyone,
I am brand new to this so please go easy on me. I have just taken over a project where we are going to be doing variant calling on a large number of human samples. I have inherited a number of scripts that are at least a few years old. I decided I wanted to follow the GATK best practices while noting the differences between the them and the scripts I have. I'm currently trying to push a single family (5 individuals) through the pipeline before applying it to all of the other samples I have.
So, first of all, all of my raw data is stored as paired-end reads in fastq format, I have no uBAM files available to me. According to the data pre-processing for variant discovery steps, the "reference implementations expect the read data to be input in unmapped BAM (uBAM) format. Conversion utilities are available to convert from FASTQ to uBAM." So the first thing I did was use FastqToSam to do the conversion yesterday. This is not an insignificant task, I ran each sample for my test family in parallel and it took roughly 5 hours.
I understand the benefit of using uBAM from the get-go (keeping some metadata that is discarded in fastq as described here: https://gatkforums.broadinstitute.org/gatk/discussion/5990/what-is-ubam-and-why-is-it-better-than-fastq-for-storing-unmapped-sequence-data), but I don't see the benefit of doing this conversion if the first step of the alignment is to convert this uBAM back to fastq before running bwa mem and samtools view. The next step would be to use MergeBamAlignments to merge the mapped and unmapped alignments which I guess I couldn't do if I did not do the original fastq->uBAM conversion.
Basically, my question is if the initial conversion from fastq to uBam is necessary or even recommended in this case. I don't see how it could have any added benefit and converting from and to fastq will incur a significant overhead. For what it's worth, the script I inherited simply ran 'bwa mem' on the paired-end reads and piped the output into 'samtools view -bh' to create the aligned BAM file. From here they would move on to the marking of duplicates. If I don't convert to uBAM and then skip the MergeBamAlignments, will that have an impact on me being able to apply the best practices down the line? I want to stick as close to the best practices as I possibly can, but If I can cut out some unnecessary computation time then that would be great.
Thanks!