Hi,
I am facing an issue that I do not understand using MarkDuplicates .
I have 2 bam files produced from mRNA .fastq files with the same protocol (GATK Best Practices):
- file A is 20GB (HiSeq 2500); aligned with STAR; sorted by coordinates
- file B is 10GB (HiSeq 4000), aligned with STAR; sorted by coordinates
Command line used for both files:
picard.sam.markduplicates.MarkDuplicates \
INPUT=[file.bam] \
OUTPUT=file_dedupe.bam \
METRICS_FILE=file_dedupe_metrics.txt \
OPTICAL_DUPLICATE_PIXEL_DISTANCE=250 \
CREATE_INDEX=true \
MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 \
MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 \
SORTING_COLLECTION_SIZE_RATIO=0.25 \
REMOVE_SEQUENCING_DUPLICATES=false \
TAGGING_POLICY=DontTag \
REMOVE_DUPLICATES=false \
ASSUME_SORTED=false \
DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES \
PROGRAM_RECORD_ID=MarkDuplicates \
PROGRAM_GROUP_NAME=MarkDuplicates \
READ_NAME_REGEX= \
VERBOSITY=INFO \
QUIET=false \
VALIDATION_STRINGENCY=STRICT \
COMPRESSION_LEVEL=5 \
MAX_RECORDS_IN_RAM=500000 \
CREATE_MD5_FILE=false \
GA4GH_CLIENT_SECRETS=client_secrets.json
MarkDuplicates took about 12 hours on the largest file A, with a minimum and a maximum 'Large duplicate set. size' of 1,001 and 145,066 respectively.
For the smaller file B, it has been running for 1 week and it is far to be finished.
Everything ran fine until a monster set of size 19,220,000 appeared. The 'ReadEnds to keeper' step was very quick. But the 'ReadEnds to others' is taking about 11 minutes for processing 1,000 reads. If the rate is constant it should take not less than 132 days to complete this set !
I am running the program with 150GB memory and 24 cpus, set the -Xmx32G and a large temporary file. Increasing PIXEL_DISTANCE to 2500 does not change anything.
(1) Why a a smaller file would take forever to complete? Does it means I have a big "clump" of duplicates?
(2) Is there a way to apply a threshold to the duplicate set size?
(3) Is HiSeq 4000 flow cell not suppose to avoid optical duplicates? Is it safe to skip the optical duplicate detection when bam files will be used to detect variants and gene expression downstream?
Thanks !