Main question (for the community):
How to solve / avoid getting the error Must initialize the cache of allele anyploid indices for ploidy 1
with the VariantAnnotator (3.7-0) and freebayes1.1.0 .
The best solution might be Do not use third party tools!
as the mod will say. Thus I might not need a perfect solution and so experiences merging variants from different sources are welcome. My workflow producing the results: (BWA->AddOrReplaceReadgroups-> MergeSamFiles->MarkDuplicates-> IndelrealignmentKnownIndels->(Freebayes + HC) -> CombineVariants -> snpEff+ VariantAnnotator)
. Optionally with a VariantsToAllelicPrimitives
after freebayes
.
The usual stuff
stack trace,an afflicting record and my solution
java -Xmx8g -Djava.io.tmpdir=/scratch/ -XX:+UseConcMarkSweepGC -XX:ParallelGCThreads=1 -jar /data/GATK/3.7-0-foss-2016a-Java-1.8.0_121/GenomeAnalysisTK.jar -T VariantAnnotator -R /data/ftp.broadinstitute.org/bundle/2.8/b37//human_g1k_v37_decoy.fasta --dbsnp /data/ftp.broadinstitute.org/bundle/2.8/b37//dbsnp_138.b37.vcf -I 1.bam -I 2.bam -I 3.bam -I 4.bam -I 5.bam -I 6.bam --excludeAnnotation MVLikelihoodRatio --excludeAnnotation TechnologyComposition --excludeAnnotation DepthPerSampleHC --excludeAnnotation PercentNBaseSolid --excludeAnnotation PossibleDeNovo --excludeAnnotation AS_RMSMappingQuality --excludeAnnotation ClusteredReadPosition --filter_bases_not_stored --useAllAnnotations --snpEffFile /scratch/snpEffGatk.intermediate.vcf --resource:cosmic,VCF /data/sftp-cancer.sanger.ac.uk/files/grch37/cosmic//v72/VCF/CosmicCodingMuts.bundle_2.8_b37.vcf.gz -E cosmic.ID --resource:1000gPhase1Snps,vcf /data//ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz -E 1000gPhase1Snps.AF -E 1000gPhase1Snps.AFR_AF -E 1000gPhase1Snps.AMR_AF -E 1000gPhase1Snps.ASN_AF -E 1000gPhase1Snps.EUR_AF --resource:1000gPhase1Indels,vcf /data/ftp.broadinstitute.org/bundle/2.8/b37//1000G_phase1.indels.b37.vcf -E 1000gPhase1Indels.AF -E 1000gPhase1Indels.AFR_AF -E 1000gPhase1Indels.AMR_AF -E 1000gPhase1Indels.ASN_AF -E 1000gPhase1Indels.EUR_AF --resource:dbSnp,vcf /data/ftp.broadinstitute.org/bundle/2.8/b37//dbsnp_138.b37.vcf -E dbSnp.CAF -E dbSnp.COMMON -E dbSnp.dbSNPBuildID --resource:exac,vcf /data//ExAC/release0.3/ExAC.r0.3.sites.vep.vcf.gz -E exac.AC -E exac.AN -E exac.AC_Adj -E exac.AC_Hom -E exac.AC_Het -E exac.AC_Hemi -E exac.AN_AFR -E exac.AC_AFR -E exac.AC_AMR -E exac.AN_AMR -E exac.AC_EAS -E exac.AN_EAS -E exac.AC_FIN -E exac.AN_FIN -E exac.AC_OTH -E exac.AN_OTH -E exac.AN_SAS -E exac.AC_SAS -V:input,VCF /scratch/3_178927848_178927848.vcf --out /scratch/3_178927848_178927848.vcf -L /scratch/test.vcf &>/dev/stdout | tail -n 150
INFO 10:10:06,670 HelpFormatter - ----------------------------------------------------------------------------------
INFO 10:10:06,672 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.7-0-gcfedb67, Compiled 2016/12/12 11:21:18
INFO 10:10:06,672 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute
INFO 10:10:06,672 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk
INFO 10:10:06,672 HelpFormatter - [Thu Jun 08 10:10:06 CEST 2017] Executing on Linux 2.6.32-642.6.2.el6.x86_64 amd64
INFO 10:10:06,672 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0_121-b13
INFO 10:10:06,675 HelpFormatter - Program Args: -T VariantAnnotator -R /data/ftp.broadinstitute.org/bundle/2.8/b37//human_g1k_v37_decoy.fasta --dbsnp /data//ftp.broadinstitute.org/bundle/2.8/b37//dbsnp_138.b37.vcf -I /1.bam -I 2.bam -I 3.bam -I 4.bam -I 5.bam -I 6.bam --excludeAnnotation MVLikelihoodRatio --excludeAnnotation TechnologyComposition --excludeAnnotation DepthPerSampleHC --excludeAnnotation PercentNBaseSolid --excludeAnnotation PossibleDeNovo --excludeAnnotation AS_RMSMappingQuality --excludeAnnotation ClusteredReadPosition --filter_bases_not_stored --useAllAnnotations --snpEffFile /scratch/npEffGatk.intermediate.vcf --resource:cosmic,VCF /data/sftp-cancer.sanger.ac.uk/files/grch37/cosmic//v72/VCF/CosmicCodingMuts.bundle_2.8_b37.vcf.gz -E cosmic.ID --resource:1000gPhase1Snps,vcf /data/ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz -E 1000gPhase1Snps.AF -E 1000gPhase1Snps.AFR_AF -E 1000gPhase1Snps.AMR_AF -E 1000gPhase1Snps.ASN_AF -E 1000gPhase1Snps.EUR_AF --resource:1000gPhase1Indels,vcf /data/ftp.broadinstitute.org/bundle/2.8/b37//1000G_phase1.indels.b37.vcf -E 1000gPhase1Indels.AF -E 1000gPhase1Indels.AFR_AF -E 1000gPhase1Indels.AMR_AF -E 1000gPhase1Indels.ASN_AF -E 1000gPhase1Indels.EUR_AF --resource:dbSnp,vcf /data/ftp.broadinstitute.org/bundle/2.8/b37/dbsnp_138.b37.vcf -E dbSnp.CAF -E dbSnp.COMMON -E dbSnp.dbSNPBuildID --resource:exac,vcf /data/ExAC/release0.3/ExAC.r0.3.sites.vep.vcf.gz -E exac.AC -E exac.AN -E exac.AC_Adj -E exac.AC_Hom -E exac.AC_Het -E exac.AC_Hemi -E exac.AN_AFR -E exac.AC_AFR -E exac.AC_AMR -E exac.AN_AMR -E exac.AC_EAS -E exac.AN_EAS -E exac.AC_FIN -E exac.AN_FIN -E exac.AC_OTH -E exac.AN_OTH -E exac.AN_SAS -E exac.AC_SAS -V:input,VCF /scratch/3_178927848_178927848.vcf --out /scratch/3_178927848_178927848.vcf -L /scratch/3_178927848_178927848.vcf
INFO 10:10:06,680 HelpFormatter - Executing as umcg-mterpstra@pg-interactive on Linux 2.6.32-642.6.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_121-b13.
INFO 10:10:06,680 HelpFormatter - Date/Time: 2017/06/08 10:10:06
INFO 10:10:06,680 HelpFormatter - ----------------------------------------------------------------------------------
INFO 10:10:06,680 HelpFormatter - ----------------------------------------------------------------------------------
INFO 10:10:06,711 GenomeAnalysisEngine - Strictness is SILENT
INFO 10:10:06,790 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250
INFO 10:10:06,796 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 10:10:07,894 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 1.10
INFO 10:10:10,340 IntervalUtils - Processing 6042 bp from intervals
WARN 10:10:10,345 IndexDictionaryUtils - Track cosmic doesn't have a sequence dictionary built in, skipping dictionary validation
WARN 10:10:10,345 IndexDictionaryUtils - Track 1000gPhase1Snps doesn't have a sequence dictionary built in, skipping dictionary validation
WARN 10:10:10,346 IndexDictionaryUtils - Track exac doesn't have a sequence dictionary built in, skipping dictionary validation
INFO 10:10:10,483 GenomeAnalysisEngine - Preparing for traversal over 6 BAM files
INFO 10:10:10,613 GenomeAnalysisEngine - Done preparing for traversal
INFO 10:10:10,613 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 10:10:10,614 ProgressMeter - | processed | time | per 1M | | total | remaining
INFO 10:10:10,614 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime
WARN 10:10:11,001 AS_RankSumTest - Allele-specific annotations can only be used with HaplotypeCaller, CombineGVCFs and GenotypeGVCFs -- no data will be output
WARN 10:10:11,002 AS_StrandBiasTest - Allele-specific annotations can only be used with HaplotypeCaller, CombineGVCFs and GenotypeGVCFs -- no data will be output
WARN 10:10:11,002 AS_InbreedingCoeff - Annotation will not be calculated. InbreedingCoeff requires at least 10 unrelated samples.
WARN 10:10:11,003 AS_RankSumTest - Allele-specific annotations can only be used with HaplotypeCaller, CombineGVCFs and GenotypeGVCFs -- no data will be output
WARN 10:10:11,003 AS_RankSumTest - Allele-specific annotations can only be used with HaplotypeCaller, CombineGVCFs and GenotypeGVCFs -- no data will be output
WARN 10:10:11,003 AS_RankSumTest - Allele-specific annotations can only be used with HaplotypeCaller, CombineGVCFs and GenotypeGVCFs -- no data will be output
WARN 10:10:11,003 AS_RankSumTest - Allele-specific annotations can only be used with HaplotypeCaller, CombineGVCFs and GenotypeGVCFs -- no data will be output
WARN 10:10:11,003 AS_StrandBiasTest - Allele-specific annotations can only be used with HaplotypeCaller, CombineGVCFs and GenotypeGVCFs -- no data will be output
WARN 10:10:11,004 StrandBiasTest - StrandBiasBySample annotation exists in input VCF header. Attempting to use StrandBiasBySample values to calculate strand bias annotation values. If no sample has the SB genotype annotation, annotation may still fail.
WARN 10:10:11,004 InbreedingCoeff - Annotation will not be calculated. InbreedingCoeff requires at least 10 unrelated samples.
WARN 10:10:11,004 StrandBiasTest - StrandBiasBySample annotation exists in input VCF header. Attempting to use StrandBiasBySample values to calculate strand bias annotation values. If no sample has the SB genotype annotation, annotation may still fail.
WARN 10:10:28,397 BaseQualitySumPerAlleleBySample - Annotation will not be calculated, can only be called from MuTect2, not org.broadinstitute.gatk.tools.walkers.annotator.VariantAnnotator
WARN 10:10:28,397 OxoGReadCounts - Annotation will not be calculated, can only be called from MuTect2, not org.broadinstitute.gatk.tools.walkers.annotator.VariantAnnotator
WARN 10:10:28,398 AnnotationUtils - SAC annotation will not be calculated, must be called from HaplotypeCaller or MuTect2, not VariantAnnotator
WARN 10:10:28,398 AnnotationUtils - SB annotation will not be calculated, must be called from HaplotypeCaller or MuTect2, not VariantAnnotator
WARN 10:10:28,429 HaplotypeScore - Annotation will not be calculated, must be called from UnifiedGenotyper, not org.broadinstitute.gatk.tools.walkers.annotator.VariantAnnotator
WARN 10:10:28,429 HardyWeinberg - Too few genotypes
WARN 10:10:28,435 SpanningDeletions - Annotation will not be calculated, must be called from UnifiedGenotyper, not org.broadinstitute.gatk.tools.walkers.annotator.VariantAnnotator
WARN 10:10:28,437 TransmissionDisequilibriumTest - Transmission disequilibrium test annotation requires a valid ped file be passed in.
##### ERROR --
##### ERROR stack trace
java.lang.IllegalStateException: Must initialize the cache of allele anyploid indices for ploidy 1
at htsjdk.variant.variantcontext.GenotypeLikelihoods.getAlleles(GenotypeLikelihoods.java:532)
at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.getLikelihoodIndexes(GATKVariantContextUtils.java:681)
at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.determineLikelihoodIndexesToUse(GATKVariantContextUtils.java:639)
at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.subsetAlleles(GATKVariantContextUtils.java:610)
at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.splitVariantContextToBiallelics(GATKVariantContextUtils.java:1071)
at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.splitVariantContextToBiallelics(GATKVariantContextUtils.java:1019)
at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.splitVariantContextToBiallelics(GATKVariantContextUtils.java:1000)
at org.broadinstitute.gatk.tools.walkers.annotator.VariantAnnotatorEngine.getMinRepresentationBiallelics(VariantAnnotatorEngine.java:535)
at org.broadinstitute.gatk.tools.walkers.annotator.VariantAnnotatorEngine.annotateExpressions(VariantAnnotatorEngine.java:452)
at org.broadinstitute.gatk.tools.walkers.annotator.VariantAnnotatorEngine.annotateContext(VariantAnnotatorEngine.java:226)
at org.broadinstitute.gatk.tools.walkers.annotator.VariantAnnotatorEngine.annotateContext(VariantAnnotatorEngine.java:212)
at org.broadinstitute.gatk.tools.walkers.annotator.VariantAnnotator.map(VariantAnnotator.java:355)
at org.broadinstitute.gatk.tools.walkers.annotator.VariantAnnotator.map(VariantAnnotator.java:112)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:98)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:316)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:123)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:108)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 3.7-0-gcfedb67):
##### ERROR
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions https://software.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: Must initialize the cache of allele anyploid indices for ploidy 1
##### ERROR ------------------------------------------------------------------------------------------
Afflicting record (multiple present in VCF)
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 1 2 3 4 5 6
3 178927848 rs67871207 AT A 61.21 . AC=4;AF=1.00;AN=4;DB;DP=5;ExcessHet=3.0103;FS=0.000;MLEAC=4;MLEAF=1.00;MQ=60.00;QD=20.40;SOR=3.258;set=GATK GT:AD:DP:GQ:PL ./../. 1/1:0,1:1:3:28,3,0 ./. ./. 1/1:0,3:3:9:66,9,0
3 178927848 . ATTTTTTTTTTTTA ATTTTTTTTTTTA,ATTTTTTTTTTA,ATTTTTTTTCTTTA 117.54 . AB=0.666667,0.333333,0;ABP=3.73412,3.73412,0;AC=3,1,2;AF=0.500,0.167,0.333;AN=6;AO=3,1,1;CIGAR=1M1D12M,1M2D11M,9M1X4M;DP=5;DPB=4.64286;DPRA=2,3,1;EPP=3.73412,5.18177,5.18177;EPPR=0;GTI=1;LEN=1,2,1;MEANALT=1.5,2,1;MQM=60,60,60;MQMR=0;NS=3;NUMALT=3;ODDS=0.405465;PAIRED=0,0,0;PAIREDR=0;PAO=0,0,0;PQA=0,0,0;PQR=0;PRO=0;QA=86,36,36;QR=0;RO=0;RPL=1,0,0;RPP=3.73412,5.18177,5.18177;RPPR=0;RPR=2,1,1;RUN=1,1,1;SAF=3,1,1;SAP=9.52472,5.18177,5.18177;SAR=0,0,0;SRF=0;SRP=0;SRR=0;TYPE=del,del,snp;set=freebayes;technology.illumina=1,1,1 GT:AD:AO:DP:PL:QA:QR:RO . . 1/1:0,1,0,0:1,0,0:1:36,3,0,36,3,36,36,3,36,36:36,0,0:0:0 . 3/3:0,0,0,1:0,0,1:1:36,36,36,36,36,36,3,3,3,0:0,0,36:0:0 1/2:0,2,1,0:2,1,0:3:71,33,27,41,0,38,71,33,41,71:50,36,0:0:0
My solution
#keep only simple variants not multiallelic
perl -i.bak -wpe 'if(not((m/^#/) || (m/TYPE=complex;/||m/TYPE=snp;/||m/TYPE=del;/||m/TYPE=ins;/||m/TYPE=mnp;/))){$_="";}' probematic.vcf
#CombineVariants with -genotypeMergeOptions PRIORITIZE -priority GATK,freebayes --filteredrecordsmergetype KEEP_UNCONDITIONAL
#select the GATK records if two records have the same CHROM/POS
#VariantAnnotator....