CombineGVCF: java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Double

October 18, 2017, 12:35 am

≫ Next: Read Groups without known barcodes

≪ Previous: When will GATK4 release to general availability status?

I'm trying to combine a bunch of gvcf's generated by bcbio-nextgen with GATK.
However, when running the command I get the following error:

INFO  09:32:20,508 GenomeAnalysisEngine - Preparing for traversal
INFO  09:32:20,519 GenomeAnalysisEngine - Done preparing for traversal
INFO  09:32:20,520 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO  09:32:20,520 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining
INFO  09:32:20,521 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime
WARN  09:32:21,591 StrandBiasTest - StrandBiasBySample annotation exists in input VCF header. Attempting to use StrandBiasBySample values to calculate strand bias annotation values. If no sample has the SB genotype annotation, annotation may still fail.
WARN  09:32:21,592 StrandBiasTest - StrandBiasBySample annotation exists in input VCF header. Attempting to use StrandBiasBySample values to calculate strand bias annotation values. If no sample has the SB genotype annotation, annotation may still fail.
##### ERROR --
##### ERROR stack trace
java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Double
        at java.lang.Double.compareTo(Double.java:49)
        at java.util.ComparableTimSort.countRunAndMakeAscending(ComparableTimSort.java:320)
        at java.util.ComparableTimSort.sort(ComparableTimSort.java:188)
        at java.util.Arrays.sort(Arrays.java:1312)
        at java.util.Arrays.sort(Arrays.java:1506)
        at java.util.ArrayList.sort(ArrayList.java:1454)
        at java.util.Collections.sort(Collections.java:141)
        at org.broadinstitute.gatk.utils.MathUtils.median(MathUtils.java:1010)
        at org.broadinstitute.gatk.tools.walkers.variantutils.ReferenceConfidenceVariantContextMerger.combineAnnotationValues(ReferenceConfidenceVariantContextMerger.java:84)
        at org.broadinstitute.gatk.tools.walkers.variantutils.ReferenceConfidenceVariantContextMerger.merge(ReferenceConfidenceVariantContextMerger.java:206)
        at org.broadinstitute.gatk.tools.walkers.variantutils.CombineGVCFs.endPreviousStates(CombineGVCFs.java:366)
        at org.broadinstitute.gatk.tools.walkers.variantutils.CombineGVCFs.reduce(CombineGVCFs.java:254)
        at org.broadinstitute.gatk.tools.walkers.variantutils.CombineGVCFs.reduce(CombineGVCFs.java:116)
        at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociReduce.apply(TraverseLociNano.java:291)
        at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociReduce.apply(TraverseLociNano.java:280)
        at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:279)
        at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
        at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144)
        at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
        at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
        at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:98)
        at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:323)
        at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:123)
        at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
        at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
        at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:108)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 3.8-0-ge9d806836):
##### ERROR
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions https://software.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: java.lang.Integer cannot be cast to java.lang.Double
##### ERROR ------------------------------------------------------------------------------------------

I'm aware that gvcf's that went through bcftools posed an issue with the same stack trace in the past, but I've already been able to do several merges, with only some failing.

Any idea on how I could fix this?

Thanks a lot
M

sample vcf header (w/o contigs for short)

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another">
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##GATKCommandLine.HaplotypeCaller=<ID=HaplotypeCaller,Version=3.8-0-ge9d806836,Date="Wed Oct 04 07:09:47 CEST 2017",Epoch=1507093787231,CommandLineOptions="analysis_type=HaplotypeCaller input_file=[/home/projects/bcbio_annotation/exomes/HSQExomes/hg38/HSQ_RUN_055/samples_HSQ055-merged/work/align/D1308739/D1308739-sort.bam] showFullBamList=fal
se read_buffer_size=null read_filter=[BadCigar, NotPrimaryAlignment] disable_read_filter=[] intervals=[/home/projects/bcbio_annotation/exomes/HSQExomes/hg38/HSQ_RUN_055/samples_HSQ055-merged/work/gatk-haplotype/chr1/D1308739-chr1_0_16125340-regions.bed] excludeIntervals=null interval_set_rule=INTERSECTION interval_merging=ALL interval_padding
=0 reference_sequence=/home/galaxy/bcbio/genomes/Hsapiens/hg38/seq/hg38.fa nonDeterministicRandomSeed=false disableDithering=false maxRuntime=-1 maxRuntimeUnits=MINUTES downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=500 baq=OFF baqGapOpenPenalty=40.0 refactor_NDN_cigar_string=false fix_misencoded_quality_scores=false allow_potentially_misencoded_quality_scores=false useOriginalQualities=false defaultBaseQualities=-1 performanceLog=null BQSR=null quantize_quals=0 static_quantized_quals=null round_down_quantized=false disable_indel_quals=false emit_original_quals=false preserve_qscores_less_than=6 globalQScorePrior=-1.0 secondsBetweenProgressUpdates=10 validation_strictness=SILENT remove_program_records=false keep_program_records=false sample_rename_mapping_file=null unsafe=LENIENT_VCF_PROCESSING use_jdk_deflater=false use_jdk_inflater=false disable_auto_index_creation_and_locking_when_reading_rods=false no_cmdline_in_header=false sites_only=false never_trim_vcf_format_field=false bcf=false bam_compression=null simplifyBAM=false disable_bam_indexing=false generate_md5=false num_threads=1 num_cpu_threads_per_data_thread=1 num_io_threads=0 monitorThreadEfficiency=false num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false generateShadowBCF=false variant_index_type=LINEAR variant_index_parameter=128000 reference_window_stop=0 phone_home= gatk_key=null tag=NA logging_level=INFO log_to_file=null help=false version=false likelihoodCalculationEngine=PairHMM heterogeneousKmerSizeResolution=COMBO_MIN dbsnp=(RodBinding name=dbsnp source=/home/galaxy/bcbio/genomes/Hsapiens/hg38/variation/dbsnp-150.vcf.gz) dontTrimActiveRegions=false maxDiscARExtension=25 maxGGAARExtension=300 paddingAroundIndels=150 paddingAroundSNPs=20 comp=[] annotation=[FisherStrand, MappingQualityRankSumTest, MappingQualityZero, QualByDepth, ReadPosRankSumTest, RMSMappingQuality, BaseQualityRankSumTest, GCContent, HaplotypeScore, HomopolymerRun, DepthPerAlleleBySample, Coverage, ClippingRankSumTest, DepthPerSampleHC, StrandBiasBySample] excludeAnnotation=[ChromosomeCounts, FisherStrand, StrandOddsRatio, QualByDepth] group=[StandardAnnotation, StandardHCAnnotation] debug=false useFilteredReadsForAnnotations=false emitRefConfidence=GVCF bamOutput=null bamWriterType=CALLED_HAPLOTYPES emitDroppedReads=false disableOptimizations=false annotateNDA=false useNewAFCalculator=false heterozygosity=0.001 indel_heterozygosity=1.25E-4 heterozygosity_stdev=0.01 standard_min_confidence_threshold_for_calling=-0.0 standard_min_confidence_threshold_for_emitting=30.0 max_alternate_alleles=6 max_genotype_count=1024 max_num_PL_values=100 input_prior=[] sample_ploidy=2 genotyping_mode=DISCOVERY alleles=(RodBinding name= source=UNBOUND) contamination_fraction_to_filter=0.0 contamination_fraction_per_sample_file=null p_nonref_model=null exactcallslog=null output_mode=EMIT_VARIANTS_ONLY allSitePLs=true gcpHMM=10 pair_hmm_implementation=VECTOR_LOGLESS_CACHING phredScaledGlobalReadMismappingRate=45 noFpga=false nativePairHmmThreads=1 useDoublePrecision=false sample_name=null kmerSize=[10, 25] dontIncreaseKmerSizesForCycles=false allowNonUniqueKmersInRef=false numPruningSamples=1 recoverDanglingHeads=false doNotRecoverDanglingBranches=false minDanglingBranchLength=4 consensus=false maxNumHaplotypesInPopulation=128 errorCorrectKmers=false minPruning=2 debugGraphTransformations=false allowCyclesInKmerGraphToGeneratePaths=false graphOutput=null kmerLengthForReadErrorCorrection=25 minObservationsForKmerToBeSolid=20 GVCFGQBands=[10, 20, 30, 40, 60, 80] indelSizeToEliminateInRefModel=10 min_base_quality_score=10 includeUmappedReads=false useAllelesTrigger=false doNotRunPhysicalPhasing=false keepRG=null justDetermineActiveRegions=false dontGenotype=false dontUseSoftClippedBases=false captureAssemblyFailureBAM=false errorCorrectReads=false pcr_indel_model=CONSERVATIVE maxReadsInRegionPerSample=10000 minReadsPerAlignmentStart=10 mergeVariantsViaLD=false activityProfileOut=null activeRegionOut=null activeRegionIn=null activeRegionExtension=null forceActive=false activeRegionMaxSize=null bandPassSigma=null maxReadsInMemoryPerSample=30000 maxTotalReadsInMemory=10000000 maxProbPropagationDistance=50 activeProbabilityThreshold=0.002 min_mapping_quality_score=20 filter_reads_with_N_cigar=false filter_mismatching_base_and_quals=false filter_bases_not_stored=false">
##GVCFBlock0-10=minGQ=0(inclusive),maxGQ=10(exclusive)
##GVCFBlock10-20=minGQ=10(inclusive),maxGQ=20(exclusive)
##GVCFBlock20-30=minGQ=20(inclusive),maxGQ=30(exclusive)
##GVCFBlock30-40=minGQ=30(inclusive),maxGQ=40(exclusive)
##GVCFBlock40-60=minGQ=40(inclusive),maxGQ=60(exclusive)
##GVCFBlock60-80=minGQ=60(inclusive),maxGQ=80(exclusive)
##GVCFBlock80-100=minGQ=80(inclusive),maxGQ=100(exclusive)
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP Membership">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=ExcessHet,Number=1,Type=Float,Description="Phred-scaled p-value for exact test of excess heterozygosity">
##INFO=<ID=GC,Number=1,Type=Float,Description="GC content around the variant (see docs for window size details)">
##INFO=<ID=HRun,Number=1,Type=Integer,Description="Largest Contiguous Homopolymer Run of Variant Allele In Either Direction">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=RAW_MQ,Number=1,Type=Float,Description="Raw data for RMS Mapping Quality">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##contig=<ID=chr1,length=248956422>
...
##contig=<ID=HLA-DRB1*16:02:01,length=11005>
##reference=file:///home/galaxy/bcbio/genomes/Hsapiens/hg38/seq/hg38.fa
##bcftools_concatVersion=1.5+htslib-1.5
##bcftools_concatCommand=concat --allow-overlaps -O z --file-list /home/projects/bcbio_annotation/exomes/HSQExomes/hg38/HSQ_RUN_055/samples_HSQ055-merged/work/gatk-haplotype/D1308739-files.list -o /tmp/bcbio/tmpWDKfKz/D1308739.vcf.gz; Date=Wed Oct  4 22:37:37 2017
##bcftools_viewVersion=1.5+htslib-1.5
##bcftools_viewCommand=view -h /home/projects/bcbio_annotation/exomes/HSQExomes/hg38/HSQ_RUN_055/samples_HSQ055-merged/final/D1308739/D1308739-gatk-haplotype.vcf.gz; Date=Wed Oct 18 09:19:18 2017
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  D1308739

↧

Read Groups without known barcodes

October 18, 2017, 7:11 am

≫ Next: Picard - SortSam

≪ Previous: CombineGVCF: java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Double

Dear all,

I am working with some barley accessions but the barcodes for the samples are not known (not publicly available). However, because the addition of the read groups is a prerequisite for BQSR and HaplotypeCaller, I wonder if I can I use the AddOrReplaceReadGroups command by substituting the barcodes with Ns? RGPU=NNNNNNNN-NNNNNNNN? Thank you in advance.

↧

Picard - SortSam

October 18, 2017, 8:36 am

≫ Next: Are filtering and trimming necessary before mapping and SNP calling using GATK

≪ Previous: Read Groups without known barcodes

Hello,

I'm using Picard's SortSam tool and was wandering if I can use the same script for 2 files? Meaning, can I do something like that (script below)? What would be the separator sign? Or I would have to run it independently for each file?

Thank you!

source /broad/software/scripts/useuse
reuse -q Java-1.8
"$@"

source /broad/software/scripts/useuse
reuse -q Picard-Tools
"$@"

java -Xmx4g -jar ~/picard-2.12.2/picard.jar SortSam \
INPUT=FILE1.sam, FILE2.bam \
OUTPUT=FILE1.bam, FILE2.bam \
SORT_ORDER=queryname

↧

Are filtering and trimming necessary before mapping and SNP calling using GATK

October 18, 2017, 11:44 am

≫ Next: Web-based Oncotator server

≪ Previous: Picard - SortSam

Hi GATK team,

I am using GATK to call SNPs from whole genome re-sequencing data. According to FastQC report, base quality was lower than 20 after 100bp (120bp reads) and illumine Universal Adapter contents reach to 5% after 60 bp. I set base quality 30 and map quality 30 (--min_base_quality_score 30 --min_mapping_quality_score 30) to call SNP in GATK. Are these two settings enough to remove low quality data? Shall I need to remove reads with adapter contamination and trim low quality reads before mapping and SNP calling? Thanks very much for your help.

Here attached FastQC report

Best regards,
Baosheng

↧

Web-based Oncotator server

May 16, 2014, 4:13 pm

≫ Next: BAM with soft-clipped primer sequences as an input for RevertSam (Tutorial #6483)

≪ Previous: Are filtering and trimming necessary before mapping and SNP calling using GATK

There is a web-based version of Oncotator which you can use for annotation without running anything on your own machine.

However, please note that the web-based version is an older version, with fewer datasources and many limitations. We urge you to use the downloadable version instead, and at this time we do not provide user support for the web-based version. It is simply provided as-is.

Note also that on rare occasions the server malfunctions and needs to be rebooted. If you experience any server errors (e.g. an error message stating that the server is unavailable), please post a note in the thread below and we'll reboot it as soon as we can.

↧

BAM with soft-clipped primer sequences as an input for RevertSam (Tutorial #6483)

October 19, 2017, 12:35 am

≫ Next: VQSR: low TiTv

≪ Previous: Web-based Oncotator server

Could I use RevertSam on primer-clipped BAM and then MergeBamAlignment using the primer-clipped BAM and uBAM originated from it as input?
I'm analyzing paired-end TruSeq Custom Amplicon panel data, thus the workflow includes soft-clipping primer sequences with BamClipper. How do I implement the soft-clipping step into the data cleaning pipeline offered in the Tutorial #6483?

↧

VQSR: low TiTv

September 21, 2017, 2:59 am

≫ Next: GATK4 resource bundle

≪ Previous: BAM with soft-clipped primer sequences as an input for RevertSam (Tutorial #6483)

I'm trying out VQSR on a batch of 16 human whole genomes (~25-30x). I was wondering if someone could review the below profiles. It seems the false-positive rate is much higher than the GATK examples.

Has anyone else experienced similar results? Any possible solutions?

Here are the commands used with GATK-3.7.0:

#Build the SNP recalibration model /share/apps/jre-distros/jre1.8.0_101/bin/java -Djava.io.tmpdir=/state/partition1/tmpdir -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx8g -jar /share/apps/GATK-distros/GATK_3.7.0/GenomeAnalysisTK.jar \ -T VariantRecalibrator \ -R /state/partition1/db/human/gatk/2.8/b37/human_g1k_v37.fasta \ -input "$seqId"_variants.vcf \ -resource:hapmap,known=false,training=true,truth=true,prior=15.0 /state/partition1/db/human/gatk/2.8/b37/hapmap_3.3.b37.vcf \ -resource:omni,known=false,training=true,truth=true,prior=12.0 /state/partition1/db/human/gatk/2.8/b37/1000G_omni2.5.b37.vcf \ -resource:1000G,known=false,training=true,truth=false,prior=10.0 /state/partition1/db/human/gatk/2.8/b37/1000G_phase1.snps.high_confidence.b37.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /state/partition1/db/human/gatk/2.8/b37/dbsnp_138.b37.vcf \ -an DP \ -an QD \ -an FS \ -an SOR \ -an MQ \ -an MQRankSum \ -an ReadPosRankSum \ -an InbreedingCoeff \ -mode SNP \ -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \ -recalFile "$seqId"_SNP.recal \ -tranchesFile "$seqId"_SNP.tranches \ -rscriptFile "$seqId"_SNP_plots.R \ -L 1 -L 2 -L 3 -L 4 -L 5 -L 6 -L 7 -L 8 -L 9 -L 10 -L 11 -L 12 -L 13 -L 14 -L 15 -L 16 -L 17 -L 18 -L 19 -L 20 -L 21 -L 22 -L X -L Y -L MT \ -nt 12 \ -ped "$seqId"_pedigree.ped \ -dt NONE

#Apply the desired level of recalibration to the SNPs in the call set /share/apps/jre-distros/jre1.8.0_101/bin/java -Djava.io.tmpdir=/state/partition1/tmpdir -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx4g -jar /share/apps/GATK-distros/GATK_3.7.0/GenomeAnalysisTK.jar \ -T ApplyRecalibration \ -R /state/partition1/db/human/gatk/2.8/b37/human_g1k_v37.fasta \ -input "$seqId"_variants.vcf \ -mode SNP \ --ts_filter_level 99.0 \ -recalFile "$seqId"_SNP.recal \ -tranchesFile "$seqId"_SNP.tranches \ -o "$seqId"_recalibrated_snps_raw_indels.vcf \ -L 1 -L 2 -L 3 -L 4 -L 5 -L 6 -L 7 -L 8 -L 9 -L 10 -L 11 -L 12 -L 13 -L 14 -L 15 -L 16 -L 17 -L 18 -L 19 -L 20 -L 21 -L 22 -L X -L Y -L MT \ -nt 12 \ -ped "$seqId"_pedigree.ped \ -dt NONE

#Build the Indel recalibration model /share/apps/jre-distros/jre1.8.0_101/bin/java -Djava.io.tmpdir=/state/partition1/tmpdir -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx4g -jar /share/apps/GATK-distros/GATK_3.7.0/GenomeAnalysisTK.jar \ -T VariantRecalibrator \ -R /state/partition1/db/human/gatk/2.8/b37/human_g1k_v37.fasta \ -input "$seqId"_recalibrated_snps_raw_indels.vcf \ -resource:mills,known=false,training=true,truth=true,prior=12.0 /state/partition1/db/human/gatk/2.8/b37/Mills_and_1000G_gold_standard.indels.b37.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /state/partition1/db/human/gatk/2.8/b37/dbsnp_138.b37.vcf \ -an DP \ -an QD \ -an FS \ -an SOR \ -an MQRankSum \ -an ReadPosRankSum \ -an InbreedingCoeff \ -mode INDEL \ -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \ --maxGaussians 4 \ -recalFile "$seqId"_INDEL.recal \ -tranchesFile "$seqId"_INDEL.tranches \ -rscriptFile "$seqId"_INDEL_plots.R \ -L 1 -L 2 -L 3 -L 4 -L 5 -L 6 -L 7 -L 8 -L 9 -L 10 -L 11 -L 12 -L 13 -L 14 -L 15 -L 16 -L 17 -L 18 -L 19 -L 20 -L 21 -L 22 -L X -L Y -L MT \ -nt 12 \ -ped "$seqId"_pedigree.ped \ -dt NONE

#Apply the desired level of recalibration to the Indels in the call set /share/apps/jre-distros/jre1.8.0_101/bin/java -Djava.io.tmpdir=/state/partition1/tmpdir -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx4g -jar /share/apps/GATK-distros/GATK_3.7.0/GenomeAnalysisTK.jar \ -T ApplyRecalibration \ -R /state/partition1/db/human/gatk/2.8/b37/human_g1k_v37.fasta \ -input "$seqId"_recalibrated_snps_raw_indels.vcf \ -mode INDEL \ --ts_filter_level 99.0 \ -recalFile "$seqId"_INDEL.recal \ -tranchesFile "$seqId"_INDEL.tranches \ -o "$seqId"_recalibrated_variants.vcf \ -L 1 -L 2 -L 3 -L 4 -L 5 -L 6 -L 7 -L 8 -L 9 -L 10 -L 11 -L 12 -L 13 -L 14 -L 15 -L 16 -L 17 -L 18 -L 19 -L 20 -L 21 -L 22 -L X -L Y -L MT \ -nt 12 \ -ped "$seqId"_pedigree.ped \ -dt NONE

↧

GATK4 resource bundle

October 19, 2017, 1:58 am

≫ Next: MergeBam giving error although the inputs are sprted

≪ Previous: VQSR: low TiTv

Hi,

I was wondering if you guys are planning to release a new resource bundle with full hg38 support (+patches)?
Perhaps to celebrate the release of GATK4?

Thanks
M

↧

MergeBam giving error although the inputs are sprted

August 26, 2016, 2:24 pm

≫ Next: Calling variants in RNAseq

≪ Previous: GATK4 resource bundle

I am trying to use picard merge bam, after I sort the alignments and unaligned reads wrt queryname using picard. I merge them with SORT_ORDER=queryname
Then I get the following error:

Exception merging bam alignment - attempting to sort aligned reads and try again: Aligned record iterator (GM12878_CTCF_sequence_B1_T1_solid:222_1008_1366) is behind the unmapped reads (GM12878_CTCF_sequence_B1_T1_solid:222_1008_1366/1)

....
Exception in thread "main" java.lang.IllegalStateException: Aligned record iterator (GM12878_CTCF_sequence_B1_T1_solid:222_1008_1366) is behind the unmapped reads (GM12878_CTCF_sequence_B1_T1_solid:222_1008_1366/1)

However, when I look at the input files, they are in the same order :

samtools view unmapped/GM12878_CTCF.clean.sorted.bam | grep -n -B 5 -A 5 -m 1 222_1008_1366 | cut -f 1-5
211-GM12878_CTCF_sequence_B1_T1_solid:222_1008_1148/1 4 * 0 0
212-GM12878_CTCF_sequence_B1_T1_solid:222_1008_1184/1 4 * 0 0
213-GM12878_CTCF_sequence_B1_T1_solid:222_1008_13/1 4 * 0 0
214-GM12878_CTCF_sequence_B1_T1_solid:222_1008_1301/1 4 * 0 0
215-GM12878_CTCF_sequence_B1_T1_solid:222_1008_1344/1 4 * 0 0
216:GM12878_CTCF_sequence_B1_T1_solid:222_1008_1366/1 4 * 0 0
217-GM12878_CTCF_sequence_B1_T1_solid:222_1008_1482/1 4 * 0 0
218-GM12878_CTCF_sequence_B1_T1_solid:222_1008_1492/1 4 * 0 0
219-GM12878_CTCF_sequence_B1_T1_solid:222_1008_1498/1 4 * 0 0
220-GM12878_CTCF_sequence_B1_T1_solid:222_1008_1556/1 4 * 0 0
221-GM12878_CTCF_sequence_B1_T1_solid:222_1008_18/1 4 * 0 0

samtools view alignments/GM12878_CTCF.clean.sorted.bam | grep -n -B 5 -A 5 -m 1 222_1008_1366 | cut -f 1-5
211-GM12878_CTCF_sequence_B1_T1_solid:222_1008_1148 4 * 0 0
212-GM12878_CTCF_sequence_B1_T1_solid:222_1008_1184 4 * 0 0
213-GM12878_CTCF_sequence_B1_T1_solid:222_1008_13 4 * 0 0
214-GM12878_CTCF_sequence_B1_T1_solid:222_1008_1301 4 * 0 0
215-GM12878_CTCF_sequence_B1_T1_solid:222_1008_1344 4 * 0 0
216:GM12878_CTCF_sequence_B1_T1_solid:222_1008_1366 0 chr7 105529114 0
217-GM12878_CTCF_sequence_B1_T1_solid:222_1008_1482 4 * 0 0
218-GM12878_CTCF_sequence_B1_T1_solid:222_1008_1492 4 * 0 0
219-GM12878_CTCF_sequence_B1_T1_solid:222_1008_1498 4 * 0 0
220-GM12878_CTCF_sequence_B1_T1_solid:222_1008_1556 4 * 0 0
221-GM12878_CTCF_sequence_B1_T1_solid:222_1008_18 4 * 0 0

I could not understand the reason for this error. Can you help?

↧

Calling variants in RNAseq

March 5, 2014, 11:15 pm

≫ Next: How to use HaplotypeCallerSpark from GATK 4 (beta 6) with Adam input files

≪ Previous: MergeBam giving error although the inputs are sprted

Overview

This document describes the details of the GATK Best Practices workflow for SNP and indel calling on RNAseq data.

Please note that any command lines are only given as example of how the tools can be run. You should always make sure you understand what is being done at each step and whether the values are appropriate for your data. To that effect, you can find more guidance here.

In brief, the key modifications made to the DNAseq Best Practices focus on handling splice junctions correctly, which involves specific mapping and pre-processing procedures, as well as some new functionality in the HaplotypeCaller. Here is a detailed overview:

Caveats

Please keep in mind that our DNA-focused Best Practices were developed over several years of thorough experimentation, and are continuously updated as new observations come to light and the analysis methods improve. We have been working with RNAseq for a somewhat shorter time, so there are many aspects that we still need to examine in more detail before we can be fully confident that we are doing the best possible thing.

We know that the current recommended pipeline is producing both false positives (wrong variant calls) and false negatives (missed variants) errors. While some of those errors are inevitable in any pipeline, others are errors that we can and will address in future versions of the pipeline. A few examples of such errors are given in this article as well as our ideas for fixing them in the future.

We will be improving these recommendations progressively as we go, and we hope that the research community will help us by providing feedback of their experiences applying our recommendations to their data.

The workflow

1. Mapping to the reference

The first major difference relative to the DNAseq Best Practices is the mapping step. For DNA-seq, we recommend BWA. For RNA-seq, we evaluated all the major software packages that are specialized in RNAseq alignment, and we found that we were able to achieve the highest sensitivity to both SNPs and, importantly, indels, using STAR aligner. Specifically, we use the STAR 2-pass method which was described in a recent publication (see page 43 of the Supplemental text of the Pär G Engström et al. paper referenced below for full protocol details -- we used the suggested protocol with the default parameters). In brief, in the STAR 2-pass approach, splice junctions detected in a first alignment run are used to guide the final alignment.

Here is a walkthrough of the STAR 2-pass alignment steps:

1) STAR uses genome index files that must be saved in unique directories. The human genome index was built from the FASTA file hg19.fa as follows:

genomeDir=/path/to/hg19
mkdir $genomeDir
STAR --runMode genomeGenerate --genomeDir $genomeDir --genomeFastaFiles hg19.fa\  --runThreadN <n>

2) Alignment jobs were executed as follows:

runDir=/path/to/1pass
mkdir $runDir
cd $runDir
STAR --genomeDir $genomeDir --readFilesIn mate1.fq mate2.fq --runThreadN <n>

3) For the 2-pass STAR, a new index is then created using splice junction information contained in the file SJ.out.tab from the first pass:

genomeDir=/path/to/hg19_2pass
mkdir $genomeDir
STAR --runMode genomeGenerate --genomeDir $genomeDir --genomeFastaFiles hg19.fa \
    --sjdbFileChrStartEnd /path/to/1pass/SJ.out.tab --sjdbOverhang 75 --runThreadN <n>

4) The resulting index is then used to produce the final alignments as follows:

runDir=/path/to/2pass
mkdir $runDir
cd $runDir
STAR --genomeDir $genomeDir --readFilesIn mate1.fq mate2.fq --runThreadN <n>

2. Add read groups, sort, mark duplicates, and create index

The above step produces a SAM file, which we then put through the usual Picard processing steps: adding read group information, sorting, marking duplicates and indexing.

java -jar picard.jar AddOrReplaceReadGroups I=star_output.sam O=rg_added_sorted.bam SO=coordinate RGID=id RGLB=library RGPL=platform RGPU=machine RGSM=sample

java -jar picard.jar MarkDuplicates I=rg_added_sorted.bam O=dedupped.bam  CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT M=output.metrics

3. Split'N'Trim and reassign mapping qualities

Next, we use a new GATK tool called SplitNCigarReads developed specially for RNAseq, which splits reads into exon segments (getting rid of Ns but maintaining grouping information) and hard-clip any sequences overhanging into the intronic regions.

In the future we plan to integrate this into the GATK engine so that it will be done automatically where appropriate, but for now it needs to be run as a separate step.

At this step we also add one important tweak: we need to reassign mapping qualities, because STAR assigns good alignments a MAPQ of 255 (which technically means “unknown” and is therefore meaningless to GATK). So we use the GATK’s ReassignOneMappingQuality read filter to reassign all good alignments to the default value of 60. This is not ideal, and we hope that in the future RNAseq mappers will emit meaningful quality scores, but in the meantime this is the best we can do. In practice we do this by adding the ReassignOneMappingQuality read filter to the splitter command.

Finally, be sure to specify that reads with N cigars should be allowed. This is currently still classified as an "unsafe" option, but this classification will change to reflect the fact that this is now a supported option for RNAseq processing.

java -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R ref.fasta -I dedupped.bam -o split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS

4. Indel Realignment (optional)

After the splitting step, we resume our regularly scheduled programming... to some extent. We have found that performing realignment around indels can help rescue a few indels that would otherwise be missed, but to be honest the effect is marginal. So while it can’t hurt to do it, we only recommend performing the realignment step if you have compute and time to spare (or if it’s important not to miss any potential indels).

5. Base Recalibration

We do recommend running base recalibration (BQSR). Even though the effect is also marginal when applied to good quality data, it can absolutely save your butt in cases where the qualities have systematic error modes.

Both steps 4 and 5 are run as described for DNAseq (with the same known sites resource files), without any special arguments. Finally, please note that you should NOT run ReduceReads on your RNAseq data. The ReduceReads tool will no longer be available in GATK 3.0.

6. Variant calling

Finally, we have arrived at the variant calling step! Here, we recommend using HaplotypeCaller because it is performing much better in our hands than UnifiedGenotyper (our tests show that UG was able to call less than 50% of the true positive indels that HC calls). We have added some functionality to the variant calling code which will intelligently take into account the information about intron-exon split regions that is embedded in the BAM file by SplitNCigarReads. In brief, the new code will perform “dangling head merging” operations and avoid using soft-clipped bases (this is a temporary solution) as necessary to minimize false positive and false negative calls. To invoke this new functionality, just add -dontUseSoftClippedBases to your regular HC command line. Note that the -recoverDanglingHeads argument which was previously required is no longer necessary as that behavior is now enabled by default in HaplotypeCaller. Also, we found that we get better results if we set the minimum phred-scaled confidence threshold for calling variants 20, but you can lower this to increase sensitivity if needed.

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ref.fasta -I input.bam -dontUseSoftClippedBases -stand_call_conf 20.0 -o output.vcf

7. Variant filtering

To filter the resulting callset, you will need to apply hard filters, as we do not yet have the RNAseq training/truth resources that would be needed to run variant recalibration (VQSR).

We recommend that you filter clusters of at least 3 SNPs that are within a window of 35 bases between them by adding -window 35 -cluster 3 to your command. This filter recommendation is specific for RNA-seq data.

As in DNA-seq, we recommend filtering based on Fisher Strand values (FS > 30.0) and Qual By Depth values (QD < 2.0).

java -jar GenomeAnalysisTK.jar -T VariantFiltration -R hg_19.fasta -V input.vcf -window 35 -cluster 3 -filterName FS -filter "FS > 30.0" -filterName QD -filter "QD < 2.0" -o output.vcf

Please note that we selected these hard filtering values in attempting to optimize both high sensitivity and specificity together. By applying the hard filters, some real sites will get filtered. This is a tradeoff that each analyst should consider based on his/her own project. If you care more about sensitivity and are willing to tolerate more false positives calls, you can choose not to filter at all (or to use less restrictive thresholds).

An example of filtered (SNPs cluster filter) and unfiltered false variant calls:

An example of true variants that were filtered (false negatives). As explained in text, there is a tradeoff that comes with applying filters:

Known issues

There are a few known issues; one is that the allelic ratio is problematic. In many heterozygous sites, even if we can see in the RNAseq data both alleles that are present in the DNA, the ratio between the number of reads with the different alleles is far from 0.5, and thus the HaplotypeCaller (or any caller that expects a diploid genome) will miss that call. A DNA-aware mode of the caller might be able to fix such cases (which may be candidates also for downstream analysis of allele specific expression).

Although our new tool (splitNCigarReads) cleans many false positive calls that are caused by splicing inaccuracies by the aligners, we still call some false variants for that same reason, as can be seen in the example below. Some of those errors might be fixed in future versions of the pipeline with more sophisticated filters, with another realignment step in those regions, or by making the caller aware of splice positions.

As stated previously, we will continue to improve the tools and process over time. We have plans to improve the splitting/clipping functionalities, improve true positive and minimize false positive rates, as well as developing statistical filtering (i.e. variant recalibration) recommendations.

We also plan to add functionality to process DNAseq and RNAseq data from the same samples simultaneously, in order to facilitate analyses of post-transcriptional processes. Future extensions to the HaplotypeCaller will provide this functionality, which will require both DNAseq and RNAseq in order to produce the best results. Finally, we are also looking at solutions for measuring differential expression of alleles.

[1] Pär G Engström et al. “Systematic evaluation of spliced alignment programs for RNA-seq data”. Nature Methods, 2013

NOTE: Questions about this document that were posted before June 2014 have been moved to this archival thread: http://gatkforums.broadinstitute.org/discussion/4709/questions-about-the-rnaseq-variant-discovery-workflow

↧

How to use HaplotypeCallerSpark from GATK 4 (beta 6) with Adam input files

October 19, 2017, 1:05 pm

≫ Next: obsolete index parameters

≪ Previous: Calling variants in RNAseq

Hi,
it seems that GatkReads can read SAM/BAM/CRAM files and Adam files.
But when I try to use HaplotypeCallerSpark with adam parquet files it fails because of some dictionnary validation ?
Here's the command line I use (same command line that I use with SAM files, I just changed the input path) :

./gatk-4.beta.6/gatk-launch HaplotypeCallerSpark \
    --sparkMaster spark://<my spark master> \
    --input "input.adam" \
    --output output.vcf \
    --reference /data/hg19/hg19.2bit \
    -- --sparkRunner SPARK --driver-memory 10G --executor-memory 10G

↧

obsolete index parameters

September 29, 2017, 10:43 am

≫ Next: Filtering MuTect2 output for depth of coverage

≪ Previous: How to use HaplotypeCallerSpark from GATK 4 (beta 6) with Adam input files

Hello,

We have a bunch of gVCFs created with an older version, when the flags -variant_index_type LINEAR -variant_index_parameter 128000 --filter_mismatching_base_and_quals were still necessary for indexing. We are now generating a new batch with the latest release. Is there any need to re-do the old gVCFs without the obsolete flags, or will it have no impact on compatibility?

↧

Filtering MuTect2 output for depth of coverage

October 19, 2017, 7:41 pm

≫ Next: PL

≪ Previous: obsolete index parameters

Dear GATK team,

I am writing with a question about depth of coverage in the output of MuTect2 (GATK3 and GATK4). I would like to filter my somatic calls to keep only calls with sufficient depth of coverage in both the tumor and the normal. However, I am unsure of the optimal approach.

The first approach I took was to add together the AD annotations for both the ref and alt alleles. However, my understanding from reading the forums is that the AD includes filtered reads. Because of this, I was concerned that some sites might actually have filtered depths lower than what was shown, and I wouldn't be able to filter those sites adequately.

The next approach I took is to use the DepthOfCoverage tool to annotate the filtered coverage at each site in the input bam file, and then look up the depths of each variant. This works, but does not take into account the fact that MuTect2 does local realignments which may change the depth of coverage around a variant.

So, I next looked into using the DepthOfCoverage tool on the MuTect2 bamout file. However, the depth of the normal sample in particular seemed really low compared to the input bam. Is there downsampling going on?

Do you have any thoughts on which of these annotations would be best to use to capture the depth for determination of whether we can confidently make a variant call at a particular position?

Is filtering for depth not something that you are expecting users to do, because the information on whether we can make a confident call is captured in the TLOD score, or another annotation?

Thank you for your help!

Best,
Kate

↧

PL

March 13, 2016, 8:03 pm

≫ Next: HaplotypeCaller can't call a 10bp deletion variant

≪ Previous: Filtering MuTect2 output for depth of coverage

Hi,
I realize PL will be normalized in this page.
https://www.broadinstitute.org/gatk/guide/tagged?tag=pl
However, I need the data before normalize, raw PL, where can I get it?

Thanks for your help.
Wendy

↧

HaplotypeCaller can't call a 10bp deletion variant

October 19, 2017, 8:35 pm

≫ Next: Picard GenotypeConcordance

≪ Previous: PL

Hi, GATK team.
I use haplotypeCaller to call variants, but it can't find a 10bp deletion variant, as you can see in the graph.
I use -L targetInterval
-bamWriterType ALL_POSSIBLE_HAPLOTYPES
-bamout haplotype.bam
to see whether haplotype is correctly assembled, but the haplotype.bam is empty, it seems the targetInterval is not active region.

Then I use -forceActive, the haplotype.bam is not empty anymore, but the output vcf file still doesn't contain the 10bp deletion variant, so I'm really confused now, what should I do to call this variant out.
I use gatk3.6.

the 10 bp deletion's position is chr17:29541466.
Here is my pipeline:
java -d64 -server -XX:+UseParallelGC -XX:ParallelGCThreads=2 -Djava.io.tmpdir=$tmp_dir -jar $gatk \
-R $reference_file \
-L 17:29,541,000-29,542,000 \
-bamWriterType ALL_POSSIBLE_HAPLOTYPES \
-bamout test.bam\
-T HaplotypeCaller \
-I $in_dir/NA12878MOD_sort_markdup_vardict_sort_realign_recal.bam \
--dbsnp $dbsnp_del100 \
-forceActive \
-o $out_dir/test.raw.snps.indels.vcf

↧

Picard GenotypeConcordance

October 19, 2017, 8:46 pm

≫ Next: How to run the entire pipeline (using even Spark tools) from Java?

≪ Previous: HaplotypeCaller can't call a 10bp deletion variant

Hi GATK team!
I have a question.
I want to compare my two VCF files for checking the discordance.
Reading the GATK Forum, I picked up picard Genotype Concordance.
I can get the metric files, but I can't create the output vcf file with the argument of OUTPUT_VCF=true.
I received the message : Exception in thread "main" java. lang. IllegalArgumentException: Duplicate allele added to VariantContext: CA*.
I don't understand the message.
Please tell me how to solve this problem, or to teach another way.

I use picard/2.9.2

Thanks

↧

How to run the entire pipeline (using even Spark tools) from Java?

October 20, 2017, 4:35 am

≫ Next: Using GenomeStrip to genotype known vcf

≪ Previous: Picard GenotypeConcordance

I am trying to write a Java pipeline which follows the GATK Best Practices, in particular, using more than one input sample.
As first steps, I am trying to use FastqToSam (even if not mandatory for the Best Practices, but required in case of using fastq samples), BwaAndMarkDuplicatesPipelineSpark and BQSRPipelineSpark.

For example with FastqToSam I am using this simple approach, in which I manage to "sparkify" the command with more samples and obtaining even some speedup:

JavaRDD<String> rdd_fastq_r1_r2 = sc.parallelize(fastq_r1_r2);

createBashScript(gatkCommand);

JavaRDD<String> bashExec = rdd_fastq_r1_r2.pipe("/path/script.sh");

where fastq_r1_r2 is a list of String representing the paths of samples to use.
In few words, I execute a bash command for each couple of Paired End Reads file (in particular the bash command as explained here) inside the pipe method provided by Spark

java -Xmx8G -jar picard.jar FastqToSam [...]

But this approach would not work with Spark GATK tools, like BwaAndMarkDuplicatesPipelineSpark and BQSRPipelineSpark.

So, is there any other way to execute these Spark tools in Java code? For example 4.5 years ago in this post they suggested to use org.broadinstitute.sting.gatk.CommandLineGATK, but now this class is not available anymore.
And moreover, is available any kind of Java API (and in case any tutorial), in order to use your methods (I could say in a similar way of Spark API) without using bash commands?

Thanks for your time and I hope to be clear in explaining my questions,
Nicholas

↧

Using GenomeStrip to genotype known vcf

October 20, 2017, 4:55 am

≫ Next: Using JEXL to apply hard filters or select variants based on annotation values

≪ Previous: How to run the entire pipeline (using even Spark tools) from Java?

Hi,

I want to genotype known CNVs (from 1000G Phase3, GoNL, etc.) in our samples using GenomeStrip without performing any discovery step at first.

1) Do I have to run only the SVPreprocess steps followed by the SVGenotyper step ? Does the CNVDiscovery and/or LCNVDiscovery pipelines produce metadata that can be useful for the duplications and CNV genotyping or does the SVPreprocess produce all needed metadata ?

2) As far as I understand, imprecise variants are genotyped using the SVTYPE info. The documentation explains SVTYPE=DEL and SVTYPE=CNV. But how does the software consider SVTYPE=DUP ? SVTYPE=INS ? SVTYPE=DEL_ALU and so on which are present in 1000G Phase3 vcf file ? Does SVGenotyper consider all SVTYPE except DEL as CNV so that it will try to genotype different copy number alleles ? If true, does that mean that a SVTYPE=DUP can be genotyped as a deletion for example if GenomeStrip finds it's not a pure duplication (so that we can't force the SVTYPE) ?

Best,

↧

Using JEXL to apply hard filters or select variants based on annotation values

August 1, 2012, 4:04 pm

≫ Next: VQSR error

≪ Previous: Using GenomeStrip to genotype known vcf

1. JEXL in a nutshell

JEXL stands for Java EXpression Language. It's not a part of the GATK as such; it's a software library that can be used by Java-based programs like the GATK. It can be used for many things, but in the context of the GATK, it has one very specific use: making it possible to operate on subsets of variants from VCF files based on one or more annotations, using a single command. This is typically done with walkers such as VariantFiltration and SelectVariants.

2. Basic structure of JEXL expressions for use with the GATK

In this context, a JEXL expression is a string (in the computing sense, i.e. a series of characters) that tells the GATK which annotations to look at and what selection rules to apply.

JEXL expressions contain three basic components: keys and values, connected by operators. For example, in this simple JEXL expression which selects variants whose quality score is greater than 30:

"QUAL > 30.0"

QUAL is a key: the name of the annotation we want to look at
30.0 is a value: the threshold that we want to use to evaluate variant quality against
> is an operator: it determines which "side" of the threshold we want to select

The complete expression must be framed by double quotes. Within this, keys are strings (typically written in uppercase or CamelCase), and values can be either strings, numbers or booleans (TRUE or FALSE) -- but if they are strings the values must be framed by single quotes, as in the following example:

"MY_STRING_KEY == 'foo'"

3. Evaluation on multiple annotations

You can build expressions that calculate a metric based on two separate annotations, for example if you want to select variants for which quality (QUAL) divided by depth of coverage (DP) is below a certain threshold value:

"QUAL / DP < 10.0"

You can also join multiple conditional statements with logical operators, for example if you want to select variants that have both sufficient quality (QUAL) and a certain depth of coverage (DP):

"QUAL > 30.0 && DP == 10"

where && is the logical "AND".

Or if you want to select variants that have at least one of several conditions fulfilled:

"QD < 2.0 || ReadPosRankSum < -20.0 || FS > 200.0"

where || is the logical "OR".

4. Filtering on sample/genotype-level properties

You can also filter individual samples/genotypes in a VCF based on information from the FORMAT field. Variant Filtration will add the sample-level FT tag to the FORMAT field of filtered samples. Note however that this does not affect the record's FILTER tag. This is still a work in progress and isn't quite as flexible and powerful yet as we'd like it to be. For now, you can filter based on most fields as normal (e.g. GQ < 5.0), but the GT (genotype) field is an exception. We have put in convenience methods to enable filtering out heterozygous calls (isHet == 1), homozygous-reference calls (isHomRef == 1), and homozygous-variant calls (isHomVar == 1).

5. Important caveats

Sensitivity to case and type

You're probably used to case being important (whether letters are lowercase or UPPERCASE) but now you need to also pay attention to the type of value that is involved -- for example, numbers are differentiated between integers and floats (essentially, non-integers). These points are especially important to keep in mind:

Case
Currently, VCF INFO field keys are case-sensitive. That means that if you have a QUAL field in uppercase in your VCF record, the system will not recognize it if you write it differently (Qual, qual or whatever) in your JEXL expression.
Type
The types (i.e. string, integer, non-integer, floating point or boolean) used in your expression must be exactly the same as that of the value you are trying to evaluate. In other words, if you have a QUAL field with non-integer values (e.g. 45.3) and your filter expression is written as an integer (e.g. "QUAL < 50"), the system will throw a hissy fit (specifically, a Java exception, e.g. a NumberFormatException for numerical type mismatches).

Complex queries

We highly recommend that complex expressions involving multiple AND/OR operations be split up into separate expressions whenever possible to avoid confusion. If you are using complex expressions, make sure to test them on a panel of different sites with several combinations of yes/no criteria.

6. More complex JEXL magic

Note that this last part is fairly advanced and not for the faint of heart. To be frank, it's also explained rather more briefly than the topic deserves. But if there's enough demand for this level of usage (click the "view in forum" link and leave a comment) we'll consider producing a full-length tutorial.

Introducing the VariantContext object

When you use SelectVariants with JEXL, what happens under the hood is that the program accesses something called the VariantContext, which is a representation of the variant call with all its annotation information. The VariantContext is technically not part of GATK; it's part of the variant library included within the Picard tools source code, which GATK uses for convenience.

The reason we're telling you about this is that you can actually make more complex queries than what the GATK offers convenience functions for, provided you're willing to do a little digging into the VariantContext methods. This will allow you to leverage the full range of capabilities of the underlying objects from the command line.

In a nutshell, the VariantContext is available through the vc variable, and you just need to add method calls to that variable in your command line. The bets way to find out what methods are available is to read the VariantContext documentation on the Picard tools source code repository (on SourceForge), but we list a few examples below to whet your appetite.

Using VariantContext directly

For example, suppose I want to use SelectVariants to select all of the sites where sample NA12878 is homozygous-reference. This can be accomplished by assessing the underlying VariantContext as follows:

java -jar GenomeAnalysisTK.jar -T SelectVariants \
        -R reference.fasta \
        -V variants.vcf \
        -select 'vc.getGenotype("NA12878").isHomRef()'

Groovy, right? Now here's a more sophisticated example of JEXL expression that finds all novel variants in the total set with allele frequency > 0.25 but not 1, is not filtered, and is non-reference in 01-0263 sample:

! vc.getGenotype("01-0263").isHomRef() && (vc.getID() == null || vc.getID().equals(".")) && AF > 0.25 && AF < 1.0 && vc.isNotFiltered() && vc.isSNP() -o 01-0263.high_freq_novels.vcf -sn 01-0263

Using the VariantContext to evaluate boolean values

The classic way of evaluating a boolean goes like this:

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants \
        -R reference.fasta \
        -V my.vcf \
        -select 'DB'

But you can also use the VariantContext object like this:

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants \
        -R reference.fasta \
        -V my.vcf \
        -select 'vc.hasAttribute("DB")'

Using VariantContext to access annotations in multiallelic sites

The order of alleles in the VariantContext object is not guaranteed to be the same as in the VCF output, so accessing the AF by an index derived from a scrambled alleles array is dangerous. However! If we have the sample genotypes, there's a workaround:

java -jar GenomeAnalysisTK.jar -T SelectVariants  \
        -R reference.fasta  \
        -V multiallelics.vcf  \
        -select 'vc.hasGenotypes() && vc.getCalledChrCount(vc.getAltAlleleWithHighestAlleleCount())/(1.0*vc.getCalledChrCount()) > 0.1' -o multiHighAC.vcf

The odd 1.0 is there because otherwise we're dividing two integers, which will always yield 0. The vc.hasGenotypes() is extra error checking. This might be slow for large files, but we could use something like this if performance is a concern:

java -jar GenomeAnalysisTK.jar -T SelectVariants \
        -R reference.fasta \
        -V multiallelics.vcf \
         -select 'vc.isBiallelic() ? AF > 0.1 : vc.hasGenotypes() && vc.getCalledChrCount(vc.getAltAlleleWithHighestAlleleCount())/(1.0*vc.getCalledChrCount()) > 0.1' -o multiHighAC.vcf

Where hopefully the ternary expression shortcuts the extra vc calls for all the biallelics.

Using JEXL to evaluate arrays

Sometimes you might want to write a JEXL expression to evaluate e.g. the AD (allelic depth) field in the FORMAT column. However, the AD is technically not an integer; rather it is a list (array) of integers. One can evaluate the array data using the "." operator. Here's an example:

java -jar GenomeAnalysisTK.jar -T SelectVariants \
        -R reference.fasta \
        -V variants.vcf \
        -select 'vc.getGenotype("NA12878").getAD().0 > 10'

If you would like to select sites where the alternate allele frequency is greater than 50%, you can use the following expression:

java -jar GenomeAnalysisTK.jar -T SelectVariants \
        -R reference.fasta \
        -V variants.vcf \
        -select vc.getGenotype("NA12878").getAD().1 / vc.getGenotype("NA12878").getDP() > 0.50

↧

VQSR error

October 23, 2017, 3:13 am

≫ Next: Cannot connect to the ftp server

≪ Previous: Using JEXL to apply hard filters or select variants based on annotation values

I used to run VQSR using the following command. Approximately for 400 samples it worked very well. But for the first time I am getting an error while doing VQSR by adding few more sample with old ones.

[root@localhost Process]# java -Xmx8g -XX:ParallelGCThreads=20 -jar /mnt/exome/Softwares/GenomeAnalysisTK.jar -T VariantRecalibrator -R /mnt/exome/ReferenceFiles/human_g1k_v37.fasta -input Combined.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 /mnt/exome/Softwares/HG19/hapmap_3.3.b37.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 /mnt/exome/Softwares/HG19/1000G_omni2.5.b37.vcf -resource:dbsnp,known=true,training=true,truth=false,prior=10.0 /mnt/exome/Softwares/HG19/dbsnp_hg19_138.vcf -an DP -an QD -an FS -an MQRankSum -an ReadPosRankSum -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -recalFile recalibrate_SNP.recal -tranchesFile recalibrate_SNP.tranches -rscriptFile recalibrate_SNP_plots.R
INFO 15:22:17,107 HelpFormatter - ---------------------------------------------------------------------------------
INFO 15:22:17,108 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.6-0-g89b7209, Compiled 2016/06/01 22:27:29
INFO 15:22:17,109 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute
INFO 15:22:17,109 HelpFormatter - For support and documentation go to https://www.broadinstitute.org/gatk
INFO 15:22:17,109 HelpFormatter - [Mon Oct 23 15:22:17 IST 2017] Executing on Linux 3.10.0-514.6.1.el7.x86_64 amd64
INFO 15:22:17,109 HelpFormatter - OpenJDK 64-Bit Server VM 1.8.0_121-b13 JdkDeflater
INFO 15:22:17,112 HelpFormatter - Program Args: -T VariantRecalibrator -R /mnt/exome/ReferenceFiles/human_g1k_v37.fasta -input Combined.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 /mnt/exome/Softwares/HG19/hapmap_3.3.b37.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 /mnt/exome/Softwares/HG19/1000G_omni2.5.b37.vcf -resource:dbsnp,known=true,training=true,truth=false,prior=10.0 /mnt/exome/Softwares/HG19/dbsnp_hg19_138.vcf -an DP -an QD -an FS -an MQRankSum -an ReadPosRankSum -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -recalFile recalibrate_SNP.recal -tranchesFile recalibrate_SNP.tranches -rscriptFile recalibrate_SNP_plots.R
INFO 15:22:17,117 HelpFormatter - Executing as root@localhost.localdomain on Linux 3.10.0-514.6.1.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_121-b13.
INFO 15:22:17,117 HelpFormatter - Date/Time: 2017/10/23 15:22:17
INFO 15:22:17,117 HelpFormatter - ---------------------------------------------------------------------------------
INFO 15:22:17,118 HelpFormatter - ---------------------------------------------------------------------------------
INFO 15:22:17,135 GenomeAnalysisEngine - Strictness is SILENT
INFO 15:22:17,218 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO 15:22:17,546 GenomeAnalysisEngine - Preparing for traversal
INFO 15:22:17,551 GenomeAnalysisEngine - Done preparing for traversal
INFO 15:22:17,551 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 15:22:17,552 ProgressMeter - | processed | time | per 1M | | total | remaining
INFO 15:22:17,552 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime
INFO 15:22:17,557 TrainingSet - Found hapmap track: Known = false Training = true Truth = true Prior = Q15.0
INFO 15:22:17,558 TrainingSet - Found omni track: Known = false Training = true Truth = true Prior = Q12.0
INFO 15:22:17,558 TrainingSet - Found dbsnp track: Known = true Training = true Truth = false Prior = Q10.0
INFO 15:22:47,554 ProgressMeter - 2:55755921 6684477.0 30.0 s 4.0 s 9.8% 5.1 m 4.6 m
INFO 15:23:17,556 ProgressMeter - 3:113288945 1.3557799E7 60.0 s 4.0 s 19.5% 5.1 m 4.1 m
INFO 15:23:47,557 ProgressMeter - 5:16225713 2.0535053E7 90.0 s 4.0 s 28.9% 5.2 m 3.7 m
INFO 15:24:17,558 ProgressMeter - 6:136574731 2.7525682E7 120.0 s 4.0 s 38.7% 5.2 m 3.2 m
INFO 15:24:47,559 ProgressMeter - 8:95110754 3.4599575E7 2.5 m 4.0 s 48.0% 5.2 m 2.7 m
INFO 15:25:17,560 ProgressMeter - 10:116951732 4.1411494E7 3.0 m 4.0 s 57.9% 5.2 m 2.2 m
INFO 15:25:47,561 ProgressMeter - 12:126998698 4.8109015E7 3.5 m 4.0 s 67.0% 5.2 m 103.0 s
INFO 15:26:17,562 ProgressMeter - 16:3932287 5.4821274E7 4.0 m 4.0 s 77.8% 5.1 m 68.0 s
INFO 15:26:47,563 ProgressMeter - 19:37901784 6.1660969E7 4.5 m 4.0 s 87.0% 5.2 m 40.0 s
INFO 15:27:17,323 VariantDataManager - DP: mean = 23089.37 standard deviation = 15289.49
INFO 15:27:17,395 VariantDataManager - QD: mean = 11.97 standard deviation = 5.15
INFO 15:27:17,431 VariantDataManager - FS: mean = 2.00 standard deviation = 9.01
INFO 15:27:17,462 VariantDataManager - MQRankSum: mean = -0.21 standard deviation = 1.39
INFO 15:27:17,491 VariantDataManager - ReadPosRankSum: mean = 0.42 standard deviation = 1.03
INFO 15:27:17,564 ProgressMeter - GL000202.1:10465 6.8459572E7 5.0 m 4.0 s 99.8% 5.0 m 0.0 s
INFO 15:27:17,702 VariantDataManager - Annotations are now ordered by their information content: [DP, QD, FS, ReadPosRankSum, MQRankSum]
INFO 15:27:17,724 VariantDataManager - Training with 195822 variants after standard deviation thresholding.
INFO 15:27:17,727 GaussianMixtureModel - Initializing model with 100 k-means iterations...
INFO 15:27:27,669 VariantRecalibratorEngine - Finished iteration 0.
INFO 15:27:32,206 VariantRecalibratorEngine - Finished iteration 5. Current change in mixture coefficients = 1.13417
INFO 15:27:36,627 VariantRecalibratorEngine - Finished iteration 10. Current change in mixture coefficients = 0.59423
INFO 15:27:41,171 VariantRecalibratorEngine - Finished iteration 15. Current change in mixture coefficients = 0.20486
INFO 15:27:46,168 VariantRecalibratorEngine - Finished iteration 20. Current change in mixture coefficients = 0.03726
INFO 15:27:47,565 ProgressMeter - GL000202.1:10465 6.8459572E7 5.5 m 4.0 s 99.8% 5.5 m 0.0 s
INFO 15:27:51,022 VariantRecalibratorEngine - Finished iteration 25. Current change in mixture coefficients = 0.03971
INFO 15:27:55,971 VariantRecalibratorEngine - Finished iteration 30. Current change in mixture coefficients = 0.04863
INFO 15:28:00,943 VariantRecalibratorEngine - Finished iteration 35. Current change in mixture coefficients = 0.03344
INFO 15:28:05,845 VariantRecalibratorEngine - Finished iteration 40. Current change in mixture coefficients = 0.03454
INFO 15:28:10,855 VariantRecalibratorEngine - Finished iteration 45. Current change in mixture coefficients = 0.04853
INFO 15:28:15,849 VariantRecalibratorEngine - Finished iteration 50. Current change in mixture coefficients = 0.10959
INFO 15:28:17,566 ProgressMeter - GL000202.1:10465 6.8459572E7 6.0 m 5.0 s 99.8% 6.0 m 0.0 s
INFO 15:28:20,957 VariantRecalibratorEngine - Finished iteration 55. Current change in mixture coefficients = 0.00585
INFO 15:28:26,167 VariantRecalibratorEngine - Finished iteration 60. Current change in mixture coefficients = 0.00347
INFO 15:28:31,206 VariantRecalibratorEngine - Finished iteration 65. Current change in mixture coefficients = 0.00198
INFO 15:28:31,206 VariantRecalibratorEngine - Convergence after 65 iterations!
INFO 15:28:31,895 VariantRecalibratorEngine - Evaluating full set of 361653 variants...
INFO 15:28:31,917 VariantDataManager - Training with worst 0 scoring variants --> variants with LOD <= -5.0000.

ERROR --

ERROR stack trace

java.lang.IllegalArgumentException: No data found.
at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:88)
at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:489)
at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:185)
at org.broadinstitute.gatk.engine.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129)
at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:116)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:311)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:113)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:255)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:157)
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:108)

ERROR ------------------------------------------------------------------------------------------

ERROR A GATK RUNTIME ERROR has occurred (version 3.6-0-g89b7209):

ERROR

ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.

ERROR If not, please post the error message, with stack trace, to the GATK forum.

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions https://www.broadinstitute.org/gatk

ERROR

ERROR MESSAGE: No data found.

ERROR ------------------------------------------------------------------------------------------

↧