Hello all,
I am having a problem during the Split'N'Trim phase of the RNAseq Best Practices. The script I have used is as follows:
java -jar /data1/APPS/gatk/GenomeAnalysisTK.jar -T SplitNCigarReads \
-R /path/reference.fa \
-I 042517Sam3C_S3_combined_dedup.bam \
-o 042517Sam3C_S3_combined_split.bam \
-rf ReassignOneMappingQuality \
-RMQF 255 -RMQT 60 \
-U ALLOW_N_CIGAR_READS
When I use ValidateSamFile to examine this output I receive the following errors:
Error Type Count
ERROR:INVALID_CIGAR 397
ERROR:MATES_ARE_SAME_END 6588323
ERROR:MATE_NOT_FOUND 5711036
ERROR:MISMATCH_FLAG_MATE_NEG_STRAND 13112240
ERROR:MISMATCH_FLAG_MATE_UNMAPPED 78
ERROR:MISMATCH_MATE_ALIGNMENT_START 15160687
ERROR:MISMATCH_MATE_CIGAR_STRING 20226660
This is a similar problem to this thread:
https://gatkforums.broadinstitute.org/gatk/discussion/7957/errors-when-running-picard-validatesamfile-on-bam-file-got-from-splitncigarreads
I have tried simply skipping this phase, however when I run BQSR I receive this message:
INFO 14:06:29,259 MicroScheduler - 67278005 reads were filtered out during the traversal out of approximately 69796967 total reads (96.39%)
INFO 14:06:29,260 MicroScheduler - -> 0 reads (0.00% of total) failing BadCigarFilter
INFO 14:06:29,260 MicroScheduler - -> 861213 reads (1.23% of total) failing DuplicateReadFilter
INFO 14:06:29,260 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
INFO 14:06:29,260 MicroScheduler - -> 515857 reads (0.74% of total) failing MalformedReadFilter
INFO 14:06:29,260 MicroScheduler - -> 57110033 reads (81.82% of total) failing MappingQualityUnavailableFilter
INFO 14:06:29,261 MicroScheduler - -> 3740357 reads (5.36% of total) failing MappingQualityZeroFilter
INFO 14:06:29,261 MicroScheduler - -> 5050545 reads (7.24% of total) failing NotPrimaryAlignmentFilter
INFO 14:06:29,261 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter
I acknowledge that I need to reassign mapping qualities so I run the following script:
java -jar /data1/APPS/gatk/GenomeAnalysisTK.jar -T PrintReads \
-R /path/reference.fa \
-I 042517Sam3C_S3_combined_dedup.bam \
-o 042517Sam3C_reassigned.bam \
-rf ReassignOneMappingQuality \
-RMQF 255 -RMQT 60 \
--filter_reads_with_N_cigar
When I try to validate the file produced I receive this error:
ERROR:MATE_NOT_FOUND 5649538
I feel that at this point I have run into a dead end and don't know where to turn.
The only deviations from the best practices methodology I have done are to run MergeBamAlignment on the 2-pass file produced by STAR as validation of that file reported the MATE_NOT_FOUND error and this fixed that error. I also have multiple lanes and multiple samples so I created many SJ.out.tab files (48 to be exact) during the 1st pass of STAR, used cat to the combine all of the SJ.out.tab files into an SJ.all.tab file, and used that for the 2nd pass. I saw a suggestion to do this on a forum post, however, I can't find the link (my advisor also suggested this). I compared the output STAR sam file from this method with the method of running all samples separately and the results were more or less the same.
The file produced by the final step of MarkDuplicates (042517Sam3C_S3_combined_dedup.bam) passes the validation with "no errors found."
Any help/suggestions would be greatly appreciated!
As a side note, I tried running the SplitNCigarReads in GATK4.beta using the following script:
java -jar /data1/APPS/gatk-4.beta.1/gatk-package-4.beta.1-local.jar SplitNCigarReads \
-R /path/reference.fa \
-I 042517Sam3C_S3_combined_dedup.bam \
-O 042517Sam3C_S3_combined_split.bam
And the engine stopped immediately as it started the second pass.