MarkDuplicates: avoid excessive duplicate set size?

Hi,

I am facing an issue that I do not understand using MarkDuplicates .

I have 2 bam files produced from mRNA .fastq files with the same protocol (GATK Best Practices):

file A is 20GB (HiSeq 2500); aligned with STAR; sorted by coordinates
file B is 10GB (HiSeq 4000), aligned with STAR; sorted by coordinates

Command line used for both files:
picard.sam.markduplicates.MarkDuplicates \
INPUT=[file.bam] \
OUTPUT=file_dedupe.bam \
METRICS_FILE=file_dedupe_metrics.txt \
OPTICAL_DUPLICATE_PIXEL_DISTANCE=250 \
CREATE_INDEX=true \
MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 \
MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 \
SORTING_COLLECTION_SIZE_RATIO=0.25 \
REMOVE_SEQUENCING_DUPLICATES=false \
TAGGING_POLICY=DontTag \
REMOVE_DUPLICATES=false \
ASSUME_SORTED=false \
DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES \
PROGRAM_RECORD_ID=MarkDuplicates \
PROGRAM_GROUP_NAME=MarkDuplicates \
READ_NAME_REGEX= \
VERBOSITY=INFO \
QUIET=false \
VALIDATION_STRINGENCY=STRICT \
COMPRESSION_LEVEL=5 \
MAX_RECORDS_IN_RAM=500000 \
CREATE_MD5_FILE=false \
GA4GH_CLIENT_SECRETS=client_secrets.json

MarkDuplicates took about 12 hours on the largest file A, with a minimum and a maximum 'Large duplicate set. size' of 1,001 and 145,066 respectively.

For the smaller file B, it has been running for 1 week and it is far to be finished.
Everything ran fine until a monster set of size 19,220,000 appeared. The 'ReadEnds to keeper' step was very quick. But the 'ReadEnds to others' is taking about 11 minutes for processing 1,000 reads. If the rate is constant it should take not less than 132 days to complete this set !

I am running the program with 150GB memory and 24 cpus, set the -Xmx32G and a large temporary file. Increasing PIXEL_DISTANCE to 2500 does not change anything.

(1) Why a a smaller file would take forever to complete? Does it means I have a big "clump" of duplicates?
(2) Is there a way to apply a threshold to the duplicate set size?
(3) Is HiSeq 4000 flow cell not suppose to avoid optical duplicates? Is it safe to skip the optical duplicate detection when bam files will be used to detect variants and gene expression downstream?

Thanks !

MarkDuplicates: avoid excessive duplicate set size?

Trending Articles

Scuffham Amps - S-GEAR 2.6.0 VST, AAX, STANDALONE x86 x64 (R2R NO iLok2, +NO...

Practice Sheet of Right form of verbs for HSC Students

VHSE First (1st) Allotment 2025 - vhscap.kerala.gov.in

UNIVERSE LEAGUE – UNIVERSE LEAGUE – WAR (We Are Ready) – EP [iTunes Plus M4A]

City Hunter Teledrama – Episode 18 – 07th May 2016

Comment on Proposed Criteria for Identifying Predatory Conferences by Luke...

Bureau of Internal Revenue: Regional Offices (Directory)

Kendrick Lamar – Not Like Us (2024) [24Bit-88.2kHz] [PMEDIA] ⭐️

Inception 2010 Hindi Dual Audio 650MB BRRip 720p ESubs HEVC

East Hull MD admits sexual assaults after another victim comes forward

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

R. v. Sargeant, 2023 ONSC 6406 (CanLII)

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Who’s been sentenced at Northampton Magistrates’ Court

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Family cries out as traditional ruler allegedly abducts brother, extorts N2.5m

Long-Running Conflict In Springfield (MA) Gangland Sphere Has Manzi Family &...

Wondershare Filmora X v10.1.20.16 x64

Man arrested after fracas in flat

Man charged in ongoing Sexual Assault Investigation Derek Nyilas, 46, Faces...