Hi,
Im running HaplotypeCaller on a server this way:
java -XX:ParallelGCThreads=8 -Xmx80g -jar $GATK/GenomeAnalysisTK.jar -T HaplotypeCaller -I a2tl1_14_final.bam --min_base_quality_score 25 --min_mapping_quality_score 25 -rf DuplicateRead -rf BadMate -rf BadCigar -R JIC_reference/alygenomes.fasta -o a2tl1_14_HC1.g.vcf.gz -ploidy 2 -stand_call_conf 25 -ERC GVCF --pcr_indel_model NONE -nct 8 --max_num_PL_values 350
And I can not figure out why no duplicate reads are being filtered out although they were marked by Picard (with option TAGGING_POLICY=All ) and I also see around 20% of duplicates in corresponding samtools flagstat.
The beginning of the stdout looks like this:
INFO 04:37:54,634 HelpFormatter - -------------------------------------------------------------------------------- INFO 04:37:54,986 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.7-0-gcfedb67, Compiled 2016/12/12 11:21:18 INFO 04:37:54,986 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute INFO 04:37:54,986 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk INFO 04:37:54,986 HelpFormatter - [Tue Nov 07 04:37:54 CET 2017] Executing on Linux 3.16.0-4-amd64 amd64 INFO 04:37:54,986 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 INFO 04:37:54,990 HelpFormatter - Program Args: -T HaplotypeCaller -I h2al1_21_final.bam --min_base_quality_score 25 --min_mapping_quality_score 25 -rf DuplicateRead -rf BadMate -rf BadCigar -R JIC_reference/alygenomes.fasta -o h2al1_21_HC1.g.vcf.gz -ploidy 2 -stand_call_conf 25 -ERC GVCF --pcr_indel_model NONE -nct 8 --max_num_PL_values 350 INFO 04:37:55,002 HelpFormatter - Executing as vlkofly@zigur17.cerit-sc.cz on Linux 3.16.0-4-amd64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27. INFO 04:37:55,003 HelpFormatter - Date/Time: 2017/11/07 04:37:54 INFO 04:37:55,003 HelpFormatter - -------------------------------------------------------------------------------- INFO 04:37:55,003 HelpFormatter - -------------------------------------------------------------------------------- WARN 04:37:55,009 GATKVCFUtils - Creating Tabix index for h2al1_21_HC1.g.vcf.gz, ignoring user-specified index type and parameter INFO 04:37:55,237 GenomeAnalysisEngine - Strictness is SILENT INFO 04:37:56,044 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 500 INFO 04:37:56,051 SAMDataSource$SAMReaders - Initializing SAMRecords in serial WARNING: BAM index file /scratch/vlkofly/job_386162.wagap-pro.cerit-sc.cz/h2al1_21_final.bai is older than BAM /scratch/vlkofly/job_386162.wagap-pro.cerit-sc.cz/h2al1_21_final.bam INFO 04:37:56,221 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.17 INFO 04:37:56,244 HCMappingQualityFilter - Filtering out reads with MAPQ < 25 INFO 04:37:56,289 MicroScheduler - Running the GATK in parallel mode with 8 total threads, 8 CPU thread(s) for each of 1 data thread(s), of 8 processors available on this machine INFO 04:37:57,903 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 04:37:58,090 GenomeAnalysisEngine - Done preparing for traversal INFO 04:37:58,090 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 04:37:58,091 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 04:37:58,091 ProgressMeter - Location | active regions | elapsed | active regions | completed | runtime | runtime INFO 04:37:58,091 HaplotypeCaller - Standard Emitting and Calling confidence set to 0.0 for reference-model confidence output INFO 04:37:58,092 HaplotypeCaller - All sites annotated with PLs forced to true for reference-model confidence output WARN 04:37:58,411 InbreedingCoeff - Annotation will not be calculated. InbreedingCoeff requires at least 10 unrelated samples. INFO 04:37:58,510 HaplotypeCaller - Using global mismapping rate of 45 => -4.5 in log10 likelihood units INFO 04:37:58,511 PairHMM - Performance profiling for PairHMM is disabled because the program is being run with multiple threads (-nct>1) option
And the info lines showing no duplicates removed:
INFO 16:53:05,155 ProgressMeter - Total runtime 130507.06 secs, 2175.12 min, 36.25 hours INFO 16:53:05,155 MicroScheduler - 46705813 reads were filtered out during the traversal out of approximately 149962396 total reads (31.15%) INFO 16:53:05,155 MicroScheduler - -> 0 reads (0.00% of total) failing BadCigarFilter INFO 16:53:05,156 MicroScheduler - -> 13334530 reads (8.89% of total) failing BadMateFilter INFO 16:53:05,156 MicroScheduler - -> 0 reads (0.00% of total) failing DuplicateReadFilter INFO 16:53:05,156 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter INFO 16:53:05,156 MicroScheduler - -> 31823278 reads (21.22% of total) failing HCMappingQualityFilter INFO 16:53:05,156 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter INFO 16:53:05,157 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter INFO 16:53:05,157 MicroScheduler - -> 1548005 reads (1.03% of total) failing NotPrimaryAlignmentFilter INFO 16:53:05,157 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter