Hi,
I am running HaplotypeCaller on whole genome re-sequenced (~10X coverage) African buffalo genomes, using a high coverage African buffalo genome as the reference (~90X). The genome is about 2.8Gb. The reference genome currently consists of 442 402 scaffolds and contigs.
HaplotypeCaller works fine and produces the expected output etc., but it is spending a lot of time on the "Strictness is SILENT" step in particular, but also on the "MicroScheduler" and "Preparing for traversal over 1 BAM files" steps (9 and 7 hrs, respectively) and is taking >36 hours per genome.
I know this is not a quick analysis and some steps will take a long time, but all the log files I've seen on the forum and from a colleague (working on smaller fungal genomes) have Strictness is SILENT steps of <1 min.
Why is HaplotypeCaller spending so much time on this step? Could it be because of the many scaffolds and contigs in the reference genome? Is there something I can do to speed up this step (and/or the other two steps with long processing times)?
I am running GATK v3.6-0-g89b7209 and Java 1.8.0_73-b02. I've allocated Java 10GB of memory (-Xmx10g), but have 125GB available. Would increasing the allocated RAM (to say about 40GB) help speed up some of these steps?
The log file:
<br />INFO 21:29:46,295 HelpFormatter - ----------------------------------------------------------------------------------
INFO 21:29:46,298 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.6-0-g89b7209, Compiled 2016/06/01 22:27:29
INFO 21:29:46,298 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute
INFO 21:29:46,298 HelpFormatter - For support and documentation go to https://www.broadinstitute.org/gatk
INFO 21:29:46,298 HelpFormatter - [Mon Sep 25 21:29:46 SAST 2017] Executing on Linux 3.10.0-514.6.1.el7.x86_64 amd64
INFO 21:29:46,298 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0_73-b02 JdkDeflater
INFO 21:29:46,301 HelpFormatter - Program Args: -T HaplotypeCaller -R /mnt/lustre/users/djager/buf_clean/alignment_files/bam_sorted/gatk/GATK_Deon/refs/buffalo.final.fa -I /mnt/lustre/users/djager/buf_clean/alignment_files/bam_sorted/gatk/GATK_Deon/M_47_14_aln-PE_sorted_dups_marked.bam --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -hets 0.03 -mbq 20 -stand_emit_conf 20 -stand_call_conf 30 -out_mode EMIT_ALL_CONFIDENT_SITES -nct 24 -ploidy 2 -o M_47_14_aln-PE_sorted_dups_marked_output.raw.snps.indels.g.vcf
INFO 21:29:46,307 HelpFormatter - Executing as djager@cnode0033 on Linux 3.10.0-514.6.1.el7.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_73-b02.
INFO 21:29:46,307 HelpFormatter - Date/Time: 2017/09/25 21:29:46
INFO 21:29:46,308 HelpFormatter - ----------------------------------------------------------------------------------
INFO 21:29:46,308 HelpFormatter - ----------------------------------------------------------------------------------
WARN 21:29:46,314 GATKVCFUtils - Naming your output file using the .g.vcf extension will automatically set the appropriate values for --variant_index_type and --variant_index_parameter
INFO 21:29:46,332 GenomeAnalysisEngine - Strictness is SILENT
INFO 05:46:59,587 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 500
INFO 05:46:59,737 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 05:47:07,772 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 8.03
INFO 05:47:11,853 HCMappingQualityFilter - Filtering out reads with MAPQ < 20
INFO 05:47:13,107 MicroScheduler - Running the GATK in parallel mode with 24 total threads, 24 CPU thread(s) for each of 1 data thread(s), of 24 processors available on this machine
INFO 14:05:27,691 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files
INFO 21:57:03,635 GenomeAnalysisEngine - Done preparing for traversal
INFO 21:57:03,635 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
Thanks and kind regards
Deon