I would like to use GATK to call variants in RNA-seq including RNA editing sites. I have followed the guide article #3891 "Calling variants in RNAseq", except for starting with already aligned (Novoalign) Bam files. Prior to SplitNCigarReads, IGV shows >30% of the reads have a T-C transition (attached IGV screenshot), whereas post SplitNCigarReads, there are < 1%. This known editing site is near an intron, and it looks like the SpitNCigarReads may be overaggressive in hard clipping 2 additional nucleotides into the exon (see below examples of reads pre and post splitting - the intron is 539N, but it looks like 539N + 2 are trimmed). How can I get around this?
java -jar /Applications/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SplitNCigarReads -R Homo_sapiens_assembly38.fasta -I a.bam -o a.split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS
samtools view a.bam --threads 16 | grep "K00282:11:H5VKLBBXX:3:1102:1824:16682"
K00282:11:H5VKLBBXX:3:1102:1824:16682 1040 chrX 154351037 52 5M539N46M * 0 0 GACTCCCGAAGGCTAGAAACAGTGAGGCGGCGGGCGTCGCCAGACGGAGAA AAFFFJJJJJJJJJJJJJJJJJAFJJJJJJJJJJJJJJAJJJJJJJJJJJJ PG:Z:MarkDuplicates.3 RG:Z:id NH:i:1 SJ:Z:chrX:154350995-154351041_154351580-154351626 NM:i:1 GN:Z:FLNA UQ:i:30 AS:i:30
samtools view a.split.bam --threads 16 | grep "K00282:11:H5VKLBBXX:3:1102:1824:16682"
K00282:11:H5VKLBBXX:3:1102:1824:16682 1040 chrX 154351037 52 5M585H * 0 0 GACTC AAFFF PG:Z:MarkDuplicates.3 RG:Z:id NH:i:1 SJ:Z:chrX:154350995-154351041_154351580-154351626 NM:i:1 GN:Z:FLNA UQ:i:30 AS:i:30
K00282:11:H5VKLBBXX:3:1102:1824:16682 1040 chrX 154351583 52 546H44M * 0 0 GAAGGCTAGAAACAGTGAGGCGGCGGGCGTCGCCAGACGGAGAA JJJJJJJJJJJJJJJAFJJJJJJJJJJJJJJAJJJJJJJJJJJJ PG:Z:MarkDuplicates.3 RG:Z:id NH:i:1 SJ:Z:chrX:154350995-154351041_154351580-154351626 NM:i:1 GN:Z:FLNA UQ:i:30 AS:i:30