Dear GATK Team,
We received Whole Genome Sequencing data from Illumina and it was mapped by CASAVA.
The problem is the BAM files we have do not have @RG tag.
But we do have the read group information for each sequence. The following is an example:
PROD104_897:1:2201:15530:13154 99 chr1 9999 254 ...
The read name has the following structure:
<instrument>_<fcnumber>:<lane>:<tile>:<xcoord>:<ycoord>
So you can extract the physical lane information by doing something like:
samtools view file.bam | cut -f1 | cut -d: -f1,2
By using Picard Tools, we can add RGSM for each sample. But we can't add lane and flowcell information in this way because you need to split the BAM files into multiple files based on the lane or flowcell information, then add @RG information for each small bam files and then combine them together.
It seems like impossible to do so because we have so many samples and each file is like 110G.
My question is:
Do we have a better way to add @RG information? Or instead of Picard AddOrReplaceReadGroup.jar add the @RG by lane or flowcell, is there a way to manipulate BAM file by sequence, take the read group information, then write @RG tag?
Is it a must for GATK to use the read group information by finding @RG tag? Could I do some modification in GATK code to tell GATK how to read the read group information for this type of data? If so, what part of code should I look into?
Thanks very much.