The HC calls in issue were called in a complete GATK 3.6-0/ JDK 1.8 workflow as follows:
java -Xmx64G -jar $GATK_JAR -T HaplotypeCaller -ERC GVCF -R $REFGENOME -I $INPUT_FILE -o $HAPLOTYPECALLER_OUTPUT_FILE -G Standard -G AS_Standard -A HomopolymerRun
The output is large unremarkable, with the exception of occasional alt-allele count warnings:
INFO 11:24:33,739 HelpFormatter - ----------------------------------------------------------------------------------
INFO 11:24:33,742 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.6-0-g89b7209, Compiled 2016/06/01 22:27:29
INFO 11:24:33,743 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute
INFO 11:24:33,743 HelpFormatter - For support and documentation go to https://www.broadinstitute.org/gatk
INFO 11:24:33,743 HelpFormatter - [Thu Sep 15 11:24:33 CDT 2016] Executing on Linux 2.6.32-431.23.3.el6.x86_64 amd64
INFO 11:24:33,743 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0_45-b14 JdkDeflater
INFO 11:24:33,749 HelpFormatter - Program Args: [skipped]
INFO 11:24:33,762 HelpFormatter - Executing as [redacted] on Linux 2.6.32-431.23.3.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_45-b14.
INFO 11:24:33,763 HelpFormatter - Date/Time: 2016/09/15 11:24:33
INFO 11:24:33,763 HelpFormatter - ----------------------------------------------------------------------------------
INFO 11:24:33,763 HelpFormatter - ----------------------------------------------------------------------------------
INFO 11:24:33,784 GenomeAnalysisEngine - Strictness is SILENT
INFO 11:24:33,984 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 500
INFO 11:24:33,994 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 11:24:34,097 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.10
INFO 11:24:34,169 HCMappingQualityFilter - Filtering out reads with MAPQ < 20
INFO 11:24:34,327 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files
INFO 11:24:34,690 GenomeAnalysisEngine - Done preparing for traversal
INFO 11:24:34,691 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 11:24:34,691 ProgressMeter - | processed | time | per 1M | | total | remaining
INFO 11:24:34,692 ProgressMeter - Location | active regions | elapsed | active regions | completed | runtime | runtime
INFO 11:24:34,693 HaplotypeCaller - Standard Emitting and Calling confidence set to 0.0 for reference-model confidence output
INFO 11:24:34,693 HaplotypeCaller - All sites annotated with PLs forced to true for reference-model confidence output
WARN 11:24:34,754 AS_InbreedingCoeff - Annotation will not be calculated. InbreedingCoeff requires at least 10 unrelated samples.
WARN 11:24:34,755 InbreedingCoeff - Annotation will not be calculated. InbreedingCoeff requires at least 10 unrelated samples.
INFO 11:24:34,960 HaplotypeCaller - Using global mismapping rate of 45 => -4.5 in log10 likelihood units
Using un-vectorized C++ implementation of PairHMM
INFO 11:24:38,259 VectorLoglessPairHMM - libVectorLoglessPairHMM unpacked successfully from GATK jar file
INFO 11:24:38,260 VectorLoglessPairHMM - Using vectorized implementation of PairHMM
WARN 11:24:38,361 HaplotypeScore - Annotation will not be calculated, must be called from UnifiedGenotyper
[...]
INFO 18:09:21,878 VectorLoglessPairHMM - Time spent in setup for JNI call : 5.591074305
INFO 18:09:21,878 PairHMM - Total compute time in PairHMM computeLikelihoods() : 4182.0076576070005
INFO 18:09:21,879 HaplotypeCaller - Ran local assembly on 10816364 active regions
INFO 18:09:21,989 ProgressMeter - done 3.099750718E9 6.7 h 7.0 s 100.0% 6.7 h 0.0 s
INFO 18:09:21,990 ProgressMeter - Total runtime 24287.30 secs, 404.79 min, 6.75 hours
INFO 18:09:21,992 MicroScheduler - 15060649 reads were filtered out during the traversal out of approximately 71598790 total reads (21.03%)
INFO 18:09:21,993 MicroScheduler - -> 0 reads (0.00% of total) failing BadCigarFilter
INFO 18:09:21,994 MicroScheduler - -> 10614114 reads (14.82% of total) failing DuplicateReadFilter
INFO 18:09:21,995 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
INFO 18:09:21,996 MicroScheduler - -> 4446535 reads (6.21% of total) failing HCMappingQualityFilter
INFO 18:09:21,997 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter
INFO 18:09:21,998 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter
INFO 18:09:21,999 MicroScheduler - -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter
INFO 18:09:22,000 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter
I attempted to use GenotypeGVCFs to make call from these GVCFs:
java -Xmx64G -jar $GATK_JAR -T GenotypeGVCFs -A HomopolymerRun -R $REFGENOME -stand_call_conf 30 -stand_emit_conf 10 -V [skipped] -o [skipped]
while GenotypeGVCFs does complete, there were a large number of warnings (the stderr log is larger than 1gb) of the type:
WARN 11:19:19,340 ReferenceConfidenceVariantContextMerger - WARNING: remaining (non-reducible) annotations are assumed to be ints or doubles or booleans, but 1058.00|1542.00|0.00 doesn't parse and will not be annotated in the final VC.
WARN 11:19:19,341 ReferenceConfidenceVariantContextMerger - WARNING: remaining (non-reducible) annotations are assumed to be ints or doubles or booleans, but 8,1,20,1|4,1,9,1,20,1| doesn't parse and will not be annotated in the final VC.
WARN 11:19:19,341 ReferenceConfidenceVariantContextMerger - WARNING: remaining (non-reducible) annotations are assumed to be ints or doubles or booleans, but 23,2|22,1,23,2| doesn't parse and will not be annotated in the final VC.
WARN 11:19:19,341 ReferenceConfidenceVariantContextMerger - WARNING: remaining (non-reducible) annotations are assumed to be ints or doubles or booleans, but 1,1|1,2|0,0 doesn't parse and will not be annotated in the final VC.
WARN 11:19:19,342 ReferenceConfidenceVariantContextMerger - WARNING: remaining (non-reducible) annotations are assumed to be ints or doubles or booleans, but 20,2|33,1,36,2| doesn't parse and will not be annotated in the final VC.
I have traced these four lines to a variant called by a single sample:
chr1 13273 . G C,<NON_REF> 38.77 . AS_RAW_BaseQRankSum=20,2|33,1,36,2|;AS_RAW_MQ=1058.00|1542.00|0.00;AS_RAW_MQRankSum=23,2|22,1,23,2|;AS_RAW_ReadPosRankSum=8,1,20,1|4,1,9,1,20,1|;AS_SB_TABLE=1,1|1,2|0,0;BaseQRankSum=1.645;DP=5;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=-0.524;RAW_MQ=2600.00;ReadPosRankSum=-0.253 GT:AD:GQ:PL:SB 0/1:2,3,0:34:67,0,34,73,43,117:1,1,1,2
But then there are so many WARN's emitted that I have been able to identify calls from every sample, and every possible INFO fields where there is a pipe separator.
I noticed a previous thread described a similar warning message, but it doesn't seem to fit in my current issue.
ValidateVariants turns out to be more a pain to run than I thought; java consumes so many I/O cores that even designating for 8 cores on my cluster still breaks the PROC hard limit... I'll try to generate some results, but I kind of doubt that's the issue here.