Hi, I'm working on a pipeline which takes a RNA-seq bam file and looks for overlaps using R. When I run the R script on my original BAM file it works fine, however when I run it on the bam file produced by MarkDuplicates it throws this error:
Error in $<-.data.frame
(*tmp*
, "queryHits", value = integer(0)) :
replacement has 0 rows, data has 2
Calls: $<- -> $<-.data.frame
Execution halted
From what I can find online this means that the R script has been asked to find a variable in the file which it cannot find. The R script uses these packages: library(GenomicRanges) and library(Biostrings)
This is the section of R script which falls over:
overlaps <- findOverlaps(GRbam,GR)
printWithTimeStamp("Collating data:\n")
overs <- data.frame(NA,rownames=c(1:length(overlaps)))
printWithTimeStamp(" queryHits\n")
overs$queryHits<-queryHits(overlaps)
I included both the following options when calling MarkDuplicates to try and reduce the formatting changes in the new file:
REMOVE_DUPLICATES=TRUE
PROGRAM_RECORD_ID=null
Has anyone come across a similar issue and know what might be different in the new bam file? I've compared both the headers and the only difference is that the original file has @HD:VN 1.0, while the MarkDuplicates output has @HD:VN 1.5. Could this be the issue? I can't find much online about the differences.
Please feel free to ask for more information if I haven't been clear.