Hi GATK team and users,
I am using PrintReads with -dfrac option to simulate different depths of coverage. The original data contains WGS, PE reads (from GATK's Bundle bam, PrintReads with -L 20, -dfrac 0.18). I'm using gatk-3.7.
I think that the PE mates are lost while downsampling (first observed at IGV with 'view as pairs'):
samtools flagstat still see 96.64% of "properly paired" reads but I guess that it is because the flags are inherited from the original bam reads.
samtools flagstat ./NA12878/CEUTrio.HiSeq.WGS.b37.NA12878.L20.dfrac0.18.bam:
## 9278360 + 0 in total (QC-passed reads + QC-failed reads)
## ...
## 8966551 + 0 properly paired (96.64% : N/A)
## 9023705 + 0 with itself and mate mapped
## ...
82% of the name of the reads are unique (and not duplicated as expected for PE data).
samtools view ./NA12878/CEUTrio.HiSeq.WGS.b37.NA12878.L20.dfrac0.18.bam | awk '{print $1}' | sort -n | uniq -c | awk '{print $1}' | sort -n | uniq -c
## 7612964 1
## 832698 2
Is there a way to downsample a bam file keeping the paired reads to simulate that I have got less data but still properly paired?
Thanks a lot for any help/discussion,
EsterQ