What causes BaseRecalibratorSpark to run for a long time and end up failing with memory errors?

October 10, 2017, 1:19 pm

≪ Previous: Why is Inbreeding Coefficient not always displayed in a vcf file?

Hi, GATK team,

I am testing BaseRecalibrator in GATK 4.5 beta, when running in LOCAL mode, it finishes pretty fast. However when i run BaseRecalibratorSpark in SPARK mode, it runs for a long time and eventually fails with memory errors like:

'java.lang.OutOfMemoryError：GC overhead limit exceeded'

When I look at the stdout of the executors, it contains many messages like this:

14:17:19.753 INFO KnownSitesCache - Number of variants read: 37000001

I tested HaplotypeCallerSpark on the same SPARK cluster and it can finish pretty quick too.

↧

VQSR: low TiTv

September 21, 2017, 2:59 am

≫ Next: bam files header

≪ Previous: What causes BaseRecalibratorSpark to run for a long time and end up failing with memory errors?

I'm trying out VQSR on a batch of 16 human whole genomes (~25-30x). I was wondering if someone could review the below profiles. It seems the false-positive rate is much higher than the GATK examples.

Has anyone else experienced similar results? Any possible solutions?

Here are the commands used with GATK-3.7.0:

#Build the SNP recalibration model /share/apps/jre-distros/jre1.8.0_101/bin/java -Djava.io.tmpdir=/state/partition1/tmpdir -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx8g -jar /share/apps/GATK-distros/GATK_3.7.0/GenomeAnalysisTK.jar \ -T VariantRecalibrator \ -R /state/partition1/db/human/gatk/2.8/b37/human_g1k_v37.fasta \ -input "$seqId"_variants.vcf \ -resource:hapmap,known=false,training=true,truth=true,prior=15.0 /state/partition1/db/human/gatk/2.8/b37/hapmap_3.3.b37.vcf \ -resource:omni,known=false,training=true,truth=true,prior=12.0 /state/partition1/db/human/gatk/2.8/b37/1000G_omni2.5.b37.vcf \ -resource:1000G,known=false,training=true,truth=false,prior=10.0 /state/partition1/db/human/gatk/2.8/b37/1000G_phase1.snps.high_confidence.b37.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /state/partition1/db/human/gatk/2.8/b37/dbsnp_138.b37.vcf \ -an DP \ -an QD \ -an FS \ -an SOR \ -an MQ \ -an MQRankSum \ -an ReadPosRankSum \ -an InbreedingCoeff \ -mode SNP \ -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \ -recalFile "$seqId"_SNP.recal \ -tranchesFile "$seqId"_SNP.tranches \ -rscriptFile "$seqId"_SNP_plots.R \ -L 1 -L 2 -L 3 -L 4 -L 5 -L 6 -L 7 -L 8 -L 9 -L 10 -L 11 -L 12 -L 13 -L 14 -L 15 -L 16 -L 17 -L 18 -L 19 -L 20 -L 21 -L 22 -L X -L Y -L MT \ -nt 12 \ -ped "$seqId"_pedigree.ped \ -dt NONE

#Apply the desired level of recalibration to the SNPs in the call set /share/apps/jre-distros/jre1.8.0_101/bin/java -Djava.io.tmpdir=/state/partition1/tmpdir -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx4g -jar /share/apps/GATK-distros/GATK_3.7.0/GenomeAnalysisTK.jar \ -T ApplyRecalibration \ -R /state/partition1/db/human/gatk/2.8/b37/human_g1k_v37.fasta \ -input "$seqId"_variants.vcf \ -mode SNP \ --ts_filter_level 99.0 \ -recalFile "$seqId"_SNP.recal \ -tranchesFile "$seqId"_SNP.tranches \ -o "$seqId"_recalibrated_snps_raw_indels.vcf \ -L 1 -L 2 -L 3 -L 4 -L 5 -L 6 -L 7 -L 8 -L 9 -L 10 -L 11 -L 12 -L 13 -L 14 -L 15 -L 16 -L 17 -L 18 -L 19 -L 20 -L 21 -L 22 -L X -L Y -L MT \ -nt 12 \ -ped "$seqId"_pedigree.ped \ -dt NONE

#Build the Indel recalibration model /share/apps/jre-distros/jre1.8.0_101/bin/java -Djava.io.tmpdir=/state/partition1/tmpdir -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx4g -jar /share/apps/GATK-distros/GATK_3.7.0/GenomeAnalysisTK.jar \ -T VariantRecalibrator \ -R /state/partition1/db/human/gatk/2.8/b37/human_g1k_v37.fasta \ -input "$seqId"_recalibrated_snps_raw_indels.vcf \ -resource:mills,known=false,training=true,truth=true,prior=12.0 /state/partition1/db/human/gatk/2.8/b37/Mills_and_1000G_gold_standard.indels.b37.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /state/partition1/db/human/gatk/2.8/b37/dbsnp_138.b37.vcf \ -an DP \ -an QD \ -an FS \ -an SOR \ -an MQRankSum \ -an ReadPosRankSum \ -an InbreedingCoeff \ -mode INDEL \ -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \ --maxGaussians 4 \ -recalFile "$seqId"_INDEL.recal \ -tranchesFile "$seqId"_INDEL.tranches \ -rscriptFile "$seqId"_INDEL_plots.R \ -L 1 -L 2 -L 3 -L 4 -L 5 -L 6 -L 7 -L 8 -L 9 -L 10 -L 11 -L 12 -L 13 -L 14 -L 15 -L 16 -L 17 -L 18 -L 19 -L 20 -L 21 -L 22 -L X -L Y -L MT \ -nt 12 \ -ped "$seqId"_pedigree.ped \ -dt NONE

#Apply the desired level of recalibration to the Indels in the call set /share/apps/jre-distros/jre1.8.0_101/bin/java -Djava.io.tmpdir=/state/partition1/tmpdir -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx4g -jar /share/apps/GATK-distros/GATK_3.7.0/GenomeAnalysisTK.jar \ -T ApplyRecalibration \ -R /state/partition1/db/human/gatk/2.8/b37/human_g1k_v37.fasta \ -input "$seqId"_recalibrated_snps_raw_indels.vcf \ -mode INDEL \ --ts_filter_level 99.0 \ -recalFile "$seqId"_INDEL.recal \ -tranchesFile "$seqId"_INDEL.tranches \ -o "$seqId"_recalibrated_variants.vcf \ -L 1 -L 2 -L 3 -L 4 -L 5 -L 6 -L 7 -L 8 -L 9 -L 10 -L 11 -L 12 -L 13 -L 14 -L 15 -L 16 -L 17 -L 18 -L 19 -L 20 -L 21 -L 22 -L X -L Y -L MT \ -nt 12 \ -ped "$seqId"_pedigree.ped \ -dt NONE

↧

bam files header

October 11, 2017, 3:31 am

≫ Next: Are you going to have a working version in few time of the complete GATK Best Practices in Spark?

≪ Previous: VQSR: low TiTv

Hi, I am new in gatk, and as bioinformatician too
I am trying to analyze some public bam files; it included all the genome but I am only interested in plastome genomes. I need to obtain variants from the plastome; so I first extracted Pt reads using samtools view (including the header). Then I tried to eliminate all the other chromosomes and scaffolds from the header to do haplotype caller using the Pt.fasta as reference; I extracted the headers, eliminated the unnecessary lines with sed and finally I did samtools reheader... but he files could not be indexed, "bus error". I don't exactly understand what I am doing wrong; the correcter headers look like the .dict file from the reference, Can you please help me?? Thanks!!

↧

Are you going to have a working version in few time of the complete GATK Best Practices in Spark?

October 11, 2017, 3:46 am

≫ Next: (howto) Recalibrate variant quality scores = run VQSR

≪ Previous: bam files header

I am trying to use your Spark commands in GATK 4 release in order to execute the GATK Best Practices with Spark and it looks working fine until the BQSRPipelineSpark (I am using gatk_4/build/libs/gatk-package-4.beta.5-67-ge919f85-SNAPSHOT-local.jar with openjdk version "1.8.0_131" on a single node with 27.5 GB RAM and 4 execution threads); but when I try to execute HaplotypeCallerSpark I noticed that there are some unresolved issues (reported on other questions in GATK forums too).
For example, as personal experience I tried to execute

./gatk-launch ReadsPipelineSpark --input ERR000589_aligned.bam --reference ucsc.hg19.2bit --disableSequenceDictionaryValidation true --knownSites dbsnp_138.hg19.vcf --knownSites Mills_and_1000G_gold_standard.indels.hg19.vcf --knownSites 1000G_phase1.indels.hg19.sites.vcf --emitRefConfidence GVCF --output ERR000589_raw_variants.g.vcf

that according to the output of gatk-launch --list

ReadsPipelineSpark                           (BETA Tool) Takes aligned reads (likely from BWA) and runs MarkDuplicates, BQSR, and HaplotypeCaller. The final result is analysis-ready variants

contains HaplotypeCaller; in particular, I guess that the execution never ends (or it takes too much time) because I executed the program all night long, without reaching the end of the execution of this command; I attached part of the command output and I even tried to google some WARN line, but I was not able to interpret them.
There was the same behavior even when I tried to execute HaplotypeCallerSpark.

So is the problem mine, in the execution of the command? Or it is a problem of HaplotypeCallerSpark which is not again mature? And in case this second option is true, I am interested to know if GATK staff is working on it and if we will have a working version in few time or we will have to wait for much time?
Because in case I could consider to look at other "Sparkified" solutions for the GATK Best Practices following steps (Haplotype Caller, Genotype GVCF, Variant Annotation...).

Thanks for your time,
Nicholas

↧

(howto) Recalibrate variant quality scores = run VQSR

June 17, 2013, 3:26 pm

≫ Next: MergeBamAlignment help

≪ Previous: Are you going to have a working version in few time of the complete GATK Best Practices in Spark?

Objective

Recalibrate variant quality scores and produce a callset filtered for the desired levels of sensitivity and specificity.

Prerequisites

Caveats

This document provides a typical usage example including parameter values. However, the values given may not be representative of the latest Best Practices recommendations. When in doubt, please consult the FAQ document on VQSR training sets and parameters, which overrides this document. See that document also for caveats regarding exome vs. whole genomes analysis design.

Steps

Prepare recalibration parameters for SNPs
a. Specify which call sets the program should use as resources to build the recalibration model
b. Specify which annotations the program should use to evaluate the likelihood of Indels being real
c. Specify the desired truth sensitivity threshold values that the program should use to generate tranches
d. Determine additional model parameters
Build the SNP recalibration model
Apply the desired level of recalibration to the SNPs in the call set
Prepare recalibration parameters for Indels
a. Specify which call sets the program should use as resources to build the recalibration model
b. Specify which annotations the program should use to evaluate the likelihood of Indels being real
c. Specify the desired truth sensitivity threshold values that the program should use to generate tranches
d. Determine additional model parameters
Build the Indel recalibration model
Apply the desired level of recalibration to the Indels in the call set

1. Prepare recalibration parameters for SNPs

a. Specify which call sets the program should use as resources to build the recalibration model

For each training set, we use key-value tags to qualify whether the set contains known sites, training sites, and/or truth sites. We also use a tag to specify the prior likelihood that those sites are true (using the Phred scale).

True sites training resource: HapMap

This resource is a SNP call set that has been validated to a very high degree of confidence. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). We will also use these sites later on to choose a threshold for filtering variants based on sensitivity to truth sites. The prior likelihood we assign to these variants is Q15 (96.84%).

True sites training resource: Omni

This resource is a set of polymorphic SNP sites produced by the Omni genotyping array. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q12 (93.69%).

Non-true sites training resource: 1000G

This resource is a set of high-confidence SNP sites produced by the 1000 Genomes Project. The program will consider that the variants in this resource may contain true variants as well as false positives (truth=false), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q10 (90%).

Known sites resource, not used in training: dbSNP

This resource is a SNP call set that has not been validated to a high degree of confidence (truth=false). The program will not use the variants in this resource to train the recalibration model (training=false). However, the program will use these to stratify output metrics such as Ti/Tv ratio by whether variants are present in dbsnp or not (known=true). The prior likelihood we assign to these variants is Q2 (36.90%).

The default prior likelihood assigned to all other variants is Q2 (36.90%). This low value reflects the fact that the philosophy of the GATK callers is to produce a large, highly sensitive callset that needs to be heavily refined through additional filtering.

b. Specify which annotations the program should use to evaluate the likelihood of SNPs being real

These annotations are included in the information generated for each variant call by the caller. If an annotation is missing (typically because it was omitted from the calling command) it can be added using the VariantAnnotator tool.

Coverage (DP)

Total (unfiltered) depth of coverage. Note that this statistic should not be used with exome datasets; see caveat detailed in the VQSR arguments FAQ doc.

QualByDepth (QD)

Variant confidence (from the QUAL field) / unfiltered depth of non-reference samples.

FisherStrand (FS)

Measure of strand bias (the variation being seen on only the forward or only the reverse strand). More bias is indicative of false positive calls. This complements the StrandOddsRatio (SOR) annotation.

StrandOddsRatio (SOR)

Measure of strand bias (the variation being seen on only the forward or only the reverse strand). More bias is indicative of false positive calls. This complements the FisherStrand (FS) annotation.

MappingQualityRankSumTest (MQRankSum)

The rank sum test for mapping qualities. Note that the mapping quality rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.

ReadPosRankSumTest (ReadPosRankSum)

The rank sum test for the distance from the end of the reads. If the alternate allele is only seen near the ends of reads, this is indicative of error. Note that the read position rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.

RMSMappingQuality (MQ)

Estimation of the overall mapping quality of reads supporting a variant call.

InbreedingCoeff

Evidence of inbreeding in a population. See caveats regarding population size and composition detailed in the VQSR arguments FAQ doc.

c. Specify the desired truth sensitivity threshold values that the program should use to generate tranches

First tranche threshold 100.0
Second tranche threshold 99.9
Third tranche threshold 99.0
Fourth tranche threshold 90.0

Tranches are essentially slices of variants, ranked by VQSLOD, bounded by the threshold values specified in this step. The threshold values themselves refer to the sensitivity we can obtain when we apply them to the call sets that the program uses to train the model. The idea is that the lowest tranche is highly specific but less sensitive (there are very few false positives but potentially many false negatives, i.e. missing calls), and each subsequent tranche in turn introduces additional true positive calls along with a growing number of false positive calls. This allows us to filter variants based on how sensitive we want the call set to be, rather than applying hard filters and then only evaluating how sensitive the call set is using post hoc methods.

2. Build the SNP recalibration model

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
    -T VariantRecalibrator \
    -R reference.fa \
    -input raw_variants.vcf \
    -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap.vcf \
    -resource:omni,known=false,training=true,truth=true,prior=12.0 omni.vcf \
    -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G.vcf \
    -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.vcf \
    -an DP \
    -an QD \
    -an FS \
    -an SOR \
    -an MQ \
    -an MQRankSum \
    -an ReadPosRankSum \
    -an InbreedingCoeff \
    -mode SNP \
    -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \
    -recalFile recalibrate_SNP.recal \
    -tranchesFile recalibrate_SNP.tranches \
    -rscriptFile recalibrate_SNP_plots.R

Expected Result

This creates several files. The most important file is the recalibration report, called recalibrate_SNP.recal, which contains the recalibration data. This is what the program will use in the next step to generate a VCF file in which the variants are annotated with their recalibrated quality scores. There is also a file called recalibrate_SNP.tranches, which contains the quality score thresholds corresponding to the tranches specified in the original command. Finally, if your installation of R and the other required libraries was done correctly, you will also find some PDF files containing plots. These plots illustrated the distribution of variants according to certain dimensions of the model.

For detailed instructions on how to interpret these plots, please refer to the VQSR method documentation and presentation videos.

3. Apply the desired level of recalibration to the SNPs in the call set

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
    -T ApplyRecalibration \
    -R reference.fa \
    -input raw_variants.vcf \
    -mode SNP \
    --ts_filter_level 99.0 \
    -recalFile recalibrate_SNP.recal \
    -tranchesFile recalibrate_SNP.tranches \
    -o recalibrated_snps_raw_indels.vcf

Expected Result

This creates a new VCF file, called recalibrated_snps_raw_indels.vcf, which contains all the original variants from the original raw_variants.vcf file, but now the SNPs are annotated with their recalibrated quality scores (VQSLOD) and either PASS or FILTER depending on whether or not they are included in the selected tranche.

Here we are taking the second lowest of the tranches specified in the original recalibration command. This means that we are applying to our data set the level of sensitivity that would allow us to retrieve 99% of true variants from the truth training sets of HapMap and Omni SNPs. If we wanted to be more specific (and therefore have less risk of including false positives, at the risk of missing real sites) we could take the very lowest tranche, which would only retrieve 90% of the truth training sites. If we wanted to be more sensitive (and therefore less specific, at the risk of including more false positives) we could take the higher tranches. In our Best Practices documentation, we recommend taking the second highest tranche (99.9%) which provides the highest sensitivity you can get while still being acceptably specific.

4. Prepare recalibration parameters for Indels

a. Specify which call sets the program should use as resources to build the recalibration model

Known and true sites training resource: Mills

This resource is an Indel call set that has been validated to a high degree of confidence. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q12 (93.69%).

b. Specify which annotations the program should use to evaluate the likelihood of Indels being real

Coverage (DP)

Total (unfiltered) depth of coverage. Note that this statistic should not be used with exome datasets; see caveat detailed in the VQSR arguments FAQ doc.

QualByDepth (QD)

Variant confidence (from the QUAL field) / unfiltered depth of non-reference samples.

FisherStrand (FS)

StrandOddsRatio (SOR)

Measure of strand bias (the variation being seen on only the forward or only the reverse strand). More bias is indicative of false positive calls. This complements the FisherStrand (FS) annotation.

MappingQualityRankSumTest (MQRankSum)

The rank sum test for mapping qualities. Note that the mapping quality rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.

ReadPosRankSumTest (ReadPosRankSum)

InbreedingCoeff

Evidence of inbreeding in a population. See caveats regarding population size and composition detailed in the VQSR arguments FAQ doc.

c. Specify the desired truth sensitivity threshold values that the program should use to generate tranches

First tranche threshold 100.0
Second tranche threshold 99.9
Third tranche threshold 99.0
Fourth tranche threshold 90.0

d. Determine additional model parameters

Maximum number of Gaussians (-maxGaussians) 4

This is the maximum number of Gaussians (i.e. clusters of variants that have similar properties) that the program should try to identify when it runs the variational Bayes algorithm that underlies the machine learning method. In essence, this limits the number of different ”profiles” of variants that the program will try to identify. This number should only be increased for datasets that include very many variants.

5. Build the Indel recalibration model

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
    -T VariantRecalibrator \
    -R reference.fa \
    -input recalibrated_snps_raw_indels.vcf \
    -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.sites.vcf \
    -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.b37.vcf \
    -an QD \
    -an DP \
    -an FS \
    -an SOR \
    -an MQRankSum \
    -an ReadPosRankSum \
    -an InbreedingCoeff
    -mode INDEL \
    -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \
    --maxGaussians 4 \
    -recalFile recalibrate_INDEL.recal \
    -tranchesFile recalibrate_INDEL.tranches \
    -rscriptFile recalibrate_INDEL_plots.R

Expected Result

This creates several files. The most important file is the recalibration report, called recalibrate_INDEL.recal, which contains the recalibration data. This is what the program will use in the next step to generate a VCF file in which the variants are annotated with their recalibrated quality scores. There is also a file called recalibrate_INDEL.tranches, which contains the quality score thresholds corresponding to the tranches specified in the original command. Finally, if your installation of R and the other required libraries was done correctly, you will also find some PDF files containing plots. These plots illustrated the distribution of variants according to certain dimensions of the model.

For detailed instructions on how to interpret these plots, please refer to the online GATK documentation.

6. Apply the desired level of recalibration to the Indels in the call set

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
    -T ApplyRecalibration \
    -R reference.fa \
    -input recalibrated_snps_raw_indels.vcf \
    -mode INDEL \
    --ts_filter_level 99.0 \
    -recalFile recalibrate_INDEL.recal \
    -tranchesFile recalibrate_INDEL.tranches \
    -o recalibrated_variants.vcf

Expected Result

This creates a new VCF file, called recalibrated_variants.vcf, which contains all the original variants from the original recalibrated_snps_raw_indels.vcf file, but now the Indels are also annotated with their recalibrated quality scores (VQSLOD) and either PASS or FILTER depending on whether or not they are included in the selected tranche.

↧

MergeBamAlignment help

October 11, 2017, 5:53 am

≫ Next: java training institute in chennai

≪ Previous: (howto) Recalibrate variant quality scores = run VQSR

Hi all,

I am attempting to go through the Dropseq pipeline, but I have changed a few things to the default. I have aligned my fastq files to the hg38 genome rather than hg19 and I've also used TopHat to align rather than STAR. However, when I get the the MergeBamAlignment step, used to merge an unaligned bam and the aligned bam to re-introduce the tags into the aligned bam files, I keep getting an error but unsure how to resolve it.

Both the bam files are sorted by queryname as the pipeline says to do, but I keep getting the following error (I've removed some of the chromosome names otherwise it would have been too long, as it contains all the contigs):

Exception in thread "main" java.lang.IllegalArgumentException: Do not use this function to merge dictionaries with different sequences in them. Sequences must be in the same order as well. Found [1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2, 20, 21, 22, 3, 4, 5, 6, 7, 8, 9, ...].
        at htsjdk.samtools.SAMSequenceDictionary.mergeDictionaries(SAMSequenceDictionary.java:305)
        at picard.sam.SamAlignmentMerger.getDictionaryForMergedBam(SamAlignmentMerger.java:197)
        at picard.sam.AbstractAlignmentMerger.mergeAlignment(AbstractAlignmentMerger.java:346)
        at picard.sam.SamAlignmentMerger.mergeAlignment(SamAlignmentMerger.java:181)
        at picard.sam.MergeBamAlignment.doWork(MergeBamAlignment.java:282)
        at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:205)
        at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:94)
        at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:104)

The picard code I'm using is:

picard MergeBamAlignment UNMAPPED_BAM=4571121.blue101/temp/unaligned_mc_tagged_polyA_filtered.bam ALIGNED_BAM=4571121.blue101/temp/aligned.sorted.bam OUTPUT=4565921.blue101/temp/merged.bam REFERENCE_SEQUENCE=/scratch/ea11g10/Dropseq/hg38.fasta PAIRED_RUN=false INCLUDE_SECONDARY_ALIGNMENTS=false    CLIP_ADAPTERS=true IS_BISULFITE_SEQUENCE=false ALIGNED_READS_ONLY=false MAX_INSERTIONS_OR_DELETIONS=1 READ1_TRIM=0 READ2_TRIM=0 ALIGNER_PROPER_PAIR_FLAGS=false SORT_ORDER=coordinate PRIMARY_ALIGNMENT_STRATEGY=BestMapq CLIP_OVERLAPPING_READS=true ADD_MATE_CIGAR=true VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LE
VEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json

I created the dict file for the hg38.fasta file using picard CreateSequenceDictionary

I am using picard version 2.8.3 and the java version is 1.8.9_51.

Any help would be appreciated. Thanks

↧

java training institute in chennai

September 21, 2017, 11:34 pm

≫ Next: picard ./gradlew shadowJar install error

≪ Previous: MergeBamAlignment help

how to choosing a java training institute according to best learning and I am in Chennai suggest me the best java training institute in Chennai

↧

picard ./gradlew shadowJar install error

October 11, 2017, 6:29 pm

≫ Next: why HaplotypeCaller did not call variants with high alt allele coverage

≪ Previous: java training institute in chennai

I have an error when tried to install picard on our system.
I am running centOS system. Could you please let me know what should I do?
Thank you.

[root@NGS:picard]# java -version
java version "1.7.0_141"
OpenJDK Runtime Environment (rhel-2.6.10.1.el7_3-x86_64 u141-b02)
OpenJDK 64-Bit Server VM (build 24.141-b02, mixed mode)

[root@NGS:picard]# git --version
git version 1.8.3.1

[root@NGS:picard]# ll
total 48K
-rw-rw-r-- 1 hanmr hanmr 13K Oct 12 10:04 build.gradle
-rw-rw-r-- 1 hanmr hanmr 1.1K Oct 12 10:04 build.xml
-rw-rw-r-- 1 hanmr hanmr 869 Oct 12 10:04 Dockerfile
drwxrwxr-x 3 hanmr hanmr 26 Oct 12 10:04 etc
drwxrwxr-x 3 hanmr hanmr 29 Oct 12 10:04 gradle
-rwxrwxr-x 1 hanmr hanmr 5.0K Oct 12 10:04 gradlew
-rw-rw-r-- 1 hanmr hanmr 1.1K Oct 12 10:04 LICENSE.txt
drwxrwxr-x 2 hanmr hanmr 33 Oct 12 10:04 project
-rw-rw-r-- 1 hanmr hanmr 5.9K Oct 12 10:04 README.md
-rw-rw-r-- 1 hanmr hanmr 56 Oct 12 10:04 settings.gradle
drwxrwxr-x 4 hanmr hanmr 42 Oct 12 10:04 src
drwxrwxr-x 3 hanmr hanmr 28 Oct 12 10:04 testdata
[root@NGS:picard]# ./gradlew shadowJar

FAILURE: Build failed with an exception.

Where:
Build file '/data3/tools/picard/build.gradle' line: 55
What went wrong:
A problem occurred evaluating root project 'picard'.

Cannot invoke method getURLs() on null object
Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output.

BUILD FAILED

Total time: 1.175 secs

I also downloaded picard.jar file, but I've got still error...

[root@NGS:hanmr]# java -jar picard.jar
Exception in thread "main" java.lang.UnsupportedClassVersionError: picard/cmdline/PicardCommandLine : Unsupported major.minor version 52.0
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:803)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:442)
at java.net.URLClassLoader.access$100(URLClassLoader.java:64)
at java.net.URLClassLoader$1.run(URLClassLoader.java:354)
at java.net.URLClassLoader$1.run(URLClassLoader.java:348)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:347)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:312)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:482)

↧

why HaplotypeCaller did not call variants with high alt allele coverage

October 11, 2017, 8:50 pm

≫ Next: VariantRecalibrator ERROR: Bad input: Found annotations with zero variance. They must be excluded be

≪ Previous: picard ./gradlew shadowJar install error

Hi GATK team,
HC variant calling command:

GenomeAnalysisTK-3.6.jar -T HaplotypeCaller  --dontUseSoftClippedBases -mbq 10 -stand_call_conf 10 -dt NONE -R hs37d5.fa -I UM1.recal.bam -L test.bed  -D dbsnp147_GRCH37_All_20160601.vcf -o normal.g.vcf -forceActive -ERC GVCF -disableOptimizations

the region in test.bed is :

2   211513107   211513324

below is the screen shot of IGV:

the bam file in the screen shot from up to bottom are:
1, bamout file from GATK using bamOUT option,
2, the bam file after softlipping( soft clip primer region), and BQSR
3, the bam file mapped using bwa-mem

As we can see, most of the reads carry T allele in the position "2:211513118" with high mapping quality and high base quality, but this variants are not called in any mode of HC.

And if we use UG to call this region, this variants were successfully called:

2   211513118   rs4673540   C   T   25161.77    .   AC=2;AF=1.00;AN=2;BaseQRankSum=4.561;DB;DP=691;Dels=0.00;ExcessHet=3.0103;FS=0.000;HaplotypeScore=23.2440;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;MQRankSum=0.000;QD=34.24;ReadPosRankSum=0.000;SOR=11.687    GT:AD:DP:GQ:PL  1/1:1,689:691:99:25190,2045,0

Any idea why this site was not call using HC?

Thanks

↧

VariantRecalibrator ERROR: Bad input: Found annotations with zero variance. They must be excluded be

October 12, 2017, 6:24 am

≫ Next: Mutect1 -> depth of 'PASS' variants

≪ Previous: why HaplotypeCaller did not call variants with high alt allele coverage

Hello GATK Team,

Below is the the partial pipeline of my code and I have ** the VariantRecalibrator tool line where I am getting the error:
os.system('java -Xmx40g -Djava.io.tmpdir=/Users/seyfim/tmp -jar /Users/seyfim/software/GenomeAnalysisTK-3.7-0/GenomeAnalysisTK.jar -T HaplotypeCaller -R '+hg19+' -ERC GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -I '+n+'.sample.bam --dbsnp '+dbsnp+' -stand_call_conf 20 -o '+n+'.g.vcf’)

os.system('java -Xmx6g -Djava.io.tmpdir=/Users/seyfim/tmp -jar /Users/seyfim/software/GenomeAnalysisTK-3.7-0/GenomeAnalysisTK.jar -T GenotypeGVCFs -ERC GVCF '+hg19+' --variant '+n+'.g.vcf -o ALL_SAMPLES.vcf’)

os.system('java -Xmx40g -Djava.io.tmpdir=/Users/seyfim/tmp -jar /Users/seyfim/software/GenomeAnalysisTK-3.7-0/GenomeAnalysisTK.jar -T VariantAnnotator -R '+hg19+' -I '+n+'.sample.bam -o '+n+'.sample.raw.vcf -A Coverage --variant ALL_SAMPLES.vcf -L ALL_SAMPLES.vcf --dbsnp /Users/seyfim/working/bwa_working/DbSNP137_hg19.vcf’)

os.system('java -Xmx28g -Djava.io.tmpdir=/Users/seyfim/tmp -jar /Users/seyfim/software/picard.jar UpdateVcfSequenceDictionary I= '+n+'.sample.raw.vcf O='+n+'.sorted.raw.vcf SEQUENCE_DICTIONARY='+reference_dictionary+'')

** os.system('java -Xmx40g -Djava.io.tmpdir=/Users/seyfim/tmp -jar /Users/seyfim/software/GenomeAnalysisTK-3.7-0/GenomeAnalysisTK.jar -T VariantRecalibrator -R '+hg19+' -input '+n+'.sorted.raw.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 '+hapmap+' -resource:omni,known=false,training=true,truth=false,prior=12.0 '+omni+' -resource:1000G,known=false,training=true,truth=false,prior=10.0 '+SNPs+' -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 '+dbsnp+' -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an ClippingRankSum -mode SNP -L ALL_SAMPLES.vcf -U ALLOW_SEQ_DICT_INCOMPATIBILITY -recalFile '+n+'.output.recal -tranchesFile '+n+'.output.tranches')

Here is the error I am getting:

ERROR ------------------------------------------------------------------------------------------

ERROR A USER ERROR has occurred (version 3.7-0-gcfedb67):

ERROR

ERROR This means that one or more arguments or inputs in your command are incorrect.

ERROR The error message below tells you what is the problem.

ERROR

ERROR If the problem is an invalid argument, please check the online documentation guide

ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.

ERROR

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions https://software.broadinstitute.org/gatk

ERROR

ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.

ERROR

ERROR MESSAGE: Bad input: Found annotations with zero variance. They must be excluded before proceeding.

ERROR ------------------------------------------------------------------------------------------

I have checked the forum, made adjustments and still cannot get this tool to work, any help is greatly appreciated

↧

Mutect1 -> depth of 'PASS' variants

October 12, 2017, 6:51 am

≫ Next: Persistent memory problem with BaseRecalibrator.

≪ Previous: VariantRecalibrator ERROR: Bad input: Found annotations with zero variance. They must be excluded be

Hi,

I have used Mutect1 (tumor-control) for calling variants. As mentioned here:
http://archive.broadinstitute.org/cancer/cga/mutect ; that "We currently use cutoffs of at least 14 reads in the tumor and at least 8 in the normal (these cutoffs are applied after removing noisy reads in the preprocessing step)". This means that the variants I get as a result of the analysis should have a depth of at least > 14 in tumor and > 8 in control. However, I do see variants in my resulting vcf file that have DP less than the above mentioned thresholds. I would like to know the explanation for these. Does mutect still calls them 'PASS' based on LOD; or something else should be taken care off here?

↧

Persistent memory problem with BaseRecalibrator.

October 12, 2017, 7:25 am

≫ Next: Haplotype caller not picking up variants for HiSeq Runs

≪ Previous: Mutect1 -> depth of 'PASS' variants

I'm using BaseRecalibrator as part of the GATK best practices workflow. My workflow up to this point is trim --> align --> deduplicate --> fix mates.

I'm calling BaseRecalibrator like this (edited for clarity):

java -Xmx16G -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R GATK_resource/b37/human_g1k_v37.fasta \
-I ${i} \
-knownSites GATK_resource/b37/dbsnp_138.b37.vcf.gz \
-knownSites GATK_resource/b37/Mills_and_1000G_gold_standard.indels.b37.vcf.gz \
-knownSites GATK_resource/b37/1000G_phase3_v4_20130502.sites.vcf.gz
\ -L targets.interval.list -o ${NAME}_recal_data.table

My data is paired-end WES data produced by an Illumina HiSeq. I have 5 samples, each of which gave the same result.

Each time, BaseRecalibrator runs for about 20 minutes before quitting with the following error:

ERROR ------------------------------------------------------------------------------------------

ERROR A USER ERROR has occurred (version 3.8-0-ge9d806836):

ERROR

ERROR This means that one or more arguments or inputs in your command are incorrect.

ERROR The error message below tells you what is the problem.

ERROR

ERROR If the problem is an invalid argument, please check the online documentation guide

ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.

ERROR

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions https://software.broadinstitute.org/gatk

ERROR

ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.

ERROR

ERROR MESSAGE: An error occurred because you did not provide enough memory to run this program. You can use the -Xmx argument (before the -jar argument) to adjust the maximum heap size provided to Java. Note that this is a JVM argument, not a GATK argument.

ERROR ------------------------------------------------------------------------------------------

Here's what I've tried.
1. The error seems clear enough, so I've doubled the memory (-Xmx32G), with the result of getting the same error. So I doubled it again (64G)... and finally one more time (128G). Our server has 500G of RAM. Same error every time.
2. Checked the input bam files with ValidateSamFile. No errors in the bam files.
3. Updated GATK to version 3.8. Same error for every file.
4. Run a smaller interval, namely just chromosome 20 (-L 20 argument). Same error.
5. Tried to recalibrate an older WES bam from our lab, which was aligned to hg19. This worked, so I continued the whole BQSR workflow, with success all the way to a vcf.
6. Ran that same file through my whole workflow, from trimming to BQSR. No problems along the way.
7. Tried to recalibrate an older (smaller) targeted capture PE dataset, aligned to b37. It worked.

Any ideas? This has taken 3 weeks and I'm sort of at a loss where to turn next.

↧

Haplotype caller not picking up variants for HiSeq Runs

October 12, 2017, 7:33 am

≫ Next: Problem with allele specific annotation AS_QualByDepth (AS_QD) during variant calling

≪ Previous: Persistent memory problem with BaseRecalibrator.

Hello,
We were sequencing all our data in HiSeq and now moved to nextseq. We have sequenced the same batch of samples on both the sequencers. Both are processed using the same pipeline/parameters.
What I have noticed is, GATK 3.7 HC is not picking up variants, even though the coverage is good and is evidently present in the BAM file.

For example the screenshot below shows the BAM files for both NextSeq and HiSeq sample. There are atleast 3
variants in the region 22:29885560-29885861(NEPH, exon 5) that is expected to be picked up for HiSeq.

These variants are picked up for NextSeq samples (even though the coverage for hiSeq is much better).

The command that I have used for both samples is

java -Xmx32g -jar GATK_v3_7/GenomeAnalysisTK.jar -T HaplotypeCaller -R GRCh37.fa --dbsnp GATK_ref/dbsnp_138.b37.vcf -I ${i}.HiSeq_Run31.variant_ready.bam -L NEPH.bed -o ${i}.HiSeq_Run31.NEPH.g.vcf

Any idea why this can happen ?

Many thanks,

↧

Problem with allele specific annotation AS_QualByDepth (AS_QD) during variant calling

June 7, 2016, 12:10 pm

≫ Next: Best practice on SRA data

≪ Previous: Haplotype caller not picking up variants for HiSeq Runs

Hi GATK team,

First a big thank you for all your hard work in developing the tool and supporting the users!

I am trying out the allelic specific(AS) annotations in version 3.6. While I have gotten a few other AS annotations to properly show up in my VCF, I am having trouble getting the AS_QualByDepth in particular.

For example, I tried to call variant on a few samples at a specific locus with a "T" homopolymer run. I first ran HaplotypeCaller in the GVCF mode for each sample:

java -jar GenomeAnalysisTK.jar\
  -T HaplotypeCaller \
  --emitRefConfidence GVCF -variant_index_type LINEAR -variant_index_parameter 128000 \
  -R ref_fasta \
  -I sample_$i \
  -L chr1:10348759-10348801 \
  -A AS_StrandOddsRatio -A AS_FisherStrand -A AS_QualByDepth \
  -A AS_BaseQualityRankSumTest -A AS_ReadPosRankSumTest -A AS_MappingQualityRankSumTest
  -o sample_$i.gvcf

I then did GenotypeGVCFs on all the samples together:

java -jar GenomeAnalysisTK.jar\
  -T GenotypeGVCFs \
  -R ref_fasta \
  -V gvcf_list \
  -L chr1:10348759-10348801 \
  -A AS_StrandOddsRatio -A AS_FisherStrand -A AS_QualByDepth \
  -A AS_BaseQualityRankSumTest -A AS_ReadPosRankSumTest -A AS_MappingQualityRankSumTest
  -o out.vcf

In the final joint-called VCF header, the following AS annotations all showed up.

##INFO=<ID=AS_BaseQRankSum,Number=A,Type=Float,Description="allele specific Z-score from Wilcoxon rank sum test of each Alt Vs. Ref base qualities">
##INFO=<ID=AS_FS,Number=A,Type=Float,Description="allele specific phred-scaled p-value using Fisher's exact test to detect strand bias of each alt allele">
##INFO=<ID=AS_MQRankSum,Number=A,Type=Float,Description="Allele-specific Mapping Quality Rank Sum">
##INFO=<ID=AS_QD,Number=1,Type=Float,Description="Allele-specific Variant Confidence/Quality by Depth">
##INFO=<ID=AS_RAW_BaseQRankSum,Number=1,Type=String,Description="raw data for allele specific rank sum test of base qualities">
##INFO=<ID=AS_RAW_MQRankSum,Number=1,Type=String,Description="Allele-specific raw data for Mapping Quality Rank Sum">
##INFO=<ID=AS_RAW_ReadPosRankSum,Number=1,Type=String,Description="allele specific raw data for rank sum test of read position bias">
##INFO=<ID=AS_ReadPosRankSum,Number=A,Type=Float,Description="allele specific Z-score from Wilcoxon rank sum test of each Alt vs. Ref read position bias">
##INFO=<ID=AS_SB_TABLE,Number=1,Type=String,Description="Allele-specific forward/reverse read counts for strand bias tests">
##INFO=<ID=AS_SOR,Number=A,Type=Float,Description="Allele specific strand Odds Ratio of 2x|Alts| contingency table to detect allele specific strand bias">

However, in the INFO column, I only got the other AS annotations but not AS_QD.

chr1    10348779        .       AT      A,ATT   981.29  .       AC=4,2;AF=0.333,0.167;AN=12;AS_BaseQRankSum=-1.087,-2.521;AS_FS=3.986,7.378;AS_MQRankSum=-1.130,-2.349;AS_ReadPosRankSum=-1.192,-1.396;AS_SOR=0.415,0.254;BaseQRankSum=-6.350e-01;ClippingRankSum=0.00;DP=627;ExcessHet=14.6052;FS=6.378;MLEAC=4,2;MLEAF=0.333,0.167;MQ=59.95;MQRankSum=0.00;QD=1.94;ReadPosRankSum=-1.050e-01;SOR=0.352        GT:AD:DP:GQ:PL  0/1:44,9,7:63:81:81,0,1033,93,844,1165  0/1:71,11,8:99:47:47,0,1659,110,1414,1803       0/1:54,15,7:81:99:205,0,1239,280,1087,1635      0/1:69,25,12:106:99:311,0,1603,336,1306,2058    0/2:55,11,22:94:99:291,233,1636,0,943,1294      0/2:61,11,14:91:14:92,14,1473,0,1071,1468

I also checked the individual sample gVCFs. Similarly, there is AS_QD in the header but not in the INFO column. I wondering if this might be a bug or I am doing something wrong.

Another curious thing I noticed is that in the VCF header, the other AS annotations all have "Number=A" but AS_QD has "Number=1". Don't know if this might be causing some problem.

↧

Best practice on SRA data

October 31, 2012, 3:56 am

≫ Next: VariantRecalibrator ERROR: Bad input: Found annotations with zero variance. They must be excluded be

≪ Previous: Problem with allele specific annotation AS_QualByDepth (AS_QD) during variant calling

Dear all,

I'm trying to detect variants starting from the data on a few (5) SRA files downloaded from NCBI (whole genome resequencing on Illumina GA). I do not have information about lanes (each SRA file includes sequences for one individuals, with no specification of number of lanes used, at least that I know of). All libraries are paired end.

I should also point out that:

I do not have pre-existing SNP/INDEL information for this species.
The reference genome is about 230 Mbp

I proceeded as follows for each SRA file (i.e. each sample):

extract paired reads in separate fastq files (sratoolkit)
quality trim reads and keep only full pairs
align and map of PE reads (bwa align + bwa sampe)
fix sam file (Picard CleanSam)
convert sam to bam (Picard SamFormatConverter)
sort bam file and add metainfo (@RG etc.) (Picard AddOrReplaceGroups)
index sorted bam file

These bam files pass Picard ValidateSamFile.

I consider these indexed bam files as "Raw reads".
In order to properly call variants with GATK, I was now trying to go from "Raw reads" to "Analysis ready reads" as specified in GATK best practices.

So far I proceeded as follows (on each sample):

Indel local realignment: GATK RealignerTargetCreator (without snpdb) + IndelRealigner (using .inteval file produced in previous step)
Mark duplicates (Picard MarkDuplicates)

I now have a recalibrated&dedupped bam file for each sample.
What follows, before variant calling, should be Base Quality Score Recalibration (GATK BaseRecalibrator + PrintReads using recalibration data produced in the previous step).

To do this without known indels, I am planning to do as suggested in an article on Base Quality Score Recalibration in the Methods & Workflows section of the Guide (Troubleshooting paragraph, bootstrap procedure).
Namely, for each sample separately I will do an initial run of SNP calling on initial data (i.e. on realigned&dedupped bam), select hi-confidence SNPs and feed them as known SNPs (vcf file) to BaseRecalibrator + PrintReads to produce a recalibrated bam file.
Then I will do a real SNP calling (HaplotypeCaller) on the so obtained recalibrated bam files (all samples together).

My questions are:

Is the order of the various steps correct?
Did I choose the appropriate GATK methods for these data?
Is it better to perform the BQS recalibration using all data together or bam-by-bam?
How do I select "hi-confidence SNPs" in the bootstrap procedure? Can anyone indicate a threshold quality for this?
How can I verify "convergence" of the bootstrap procedure? At convergence should perhaps obtained SNP calls coincide with known SNPs fed to the analysis?

Sorry for the lengthy post, I'm not quite a bioinformatician, and I'd really need to be sure before proceeding further.

Thanks!

Giorgio

↧

VariantRecalibrator ERROR: Bad input: Found annotations with zero variance. They must be excluded be

October 12, 2017, 11:50 am

≫ Next: Errors in SAM/BAM files can be diagnosed with ValidateSamFile

≪ Previous: Best practice on SRA data

Hello GATK Team,
I am using the best practices for germline SNPs and Indels in Whole Genomes and Exomes
Below is the the partial pipeline of my code and I have ** the VariantRecalibrator tool line where I am getting the error:
os.system('java -Xmx40g -Djava.io.tmpdir=/Users/seyfim/tmp -jar /Users/seyfim/software/GenomeAnalysisTK-3.7-0/GenomeAnalysisTK.jar -T HaplotypeCaller -R '+hg19+' -ERC GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -I '+n+'.sample.bam --dbsnp '+dbsnp+' -stand_call_conf 20 -o '+n+'.g.vcf’)

Here is the error I am getting:

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 3.7-0-gcfedb67):
ERROR
ERROR This means that one or more arguments or inputs in your command are incorrect.
ERROR The error message below tells you what is the problem.
ERROR
ERROR If the problem is an invalid argument, please check the online documentation guide
ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
ERROR
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions https://software.broadinstitute.org/gatk
ERROR
ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
ERROR
ERROR MESSAGE: Bad input: Found annotations with zero variance. They must be excluded before proceeding.
ERROR ------------------------------------------------------------------------------------------
I have checked the forum, made adjustments and still cannot get this tool to work, any help is greatly appreciated

↧

Errors in SAM/BAM files can be diagnosed with ValidateSamFile

May 4, 2016, 7:40 am

≫ Next: GATK4 is completely open source

≪ Previous: VariantRecalibrator ERROR: Bad input: Found annotations with zero variance. They must be excluded be

The problem

You're trying to run a GATK or Picard tool that operates on a SAM or BAM file, and getting some cryptic error that doesn't clearly tell you what's wrong. Bits of the stack trace (the pile of lines in the output log that the program outputs when there is a problem) may contain the following: java.lang.String, Error Type Count, NullPointerException -- or maybe something else that doesn't mean anything to you.

Why this happens

The most frequent cause of these unexplained problems is not a bug in the program -- it's an invalid or malformed SAM/BAM file. This means that there is something wrong either with the content of the file (something important is missing) or with its format (something is written the wrong way). Invalid SAM/BAM files generally have one or more errors in the following sections: the header tags, the alignment fields, or the optional alignment tags. In addition, the SAM/BAM index file can be a source of errors as well.

The source of these errors is usually introduced by upstream processing tools, such as the genome mapper/aligner or any other data processing tools you may have applied before feeding the data to Picard or GATK.

The solution

To fix these problems, you first have to know what's wrong. Fortunately there's a handy Picard tool that can test for (almost) all possible SAM/BAM format errors, called ValidateSamFile.

We recommend the workflow included below for diagnosing problems with ValidateSamFile. This workflow will help you tackle the problem efficiently and set priorities for dealing with multiple errors (which often happens). We also outline typical solutions for common errors, but note that this is not meant to be an exhaustive list -- there are too many possible problems to tackle all of them in this document. To be clear, here we focus on diagnostics, not treatment.

In some cases, it may not be possible to fix some problems that are too severe, and you may need to redo the genome alignment/mapping from scratch! Consider running ValidateSamFile proactively at all key steps of your analysis pipeline to catch errors early!

Workflow for diagnosing SAM/BAM file errors with ValidateSamFile

1. Generate summary of errors

First, run ValidateSamFile in SUMMARY mode in order to get a summary of everything that is missing or improperly formatted in your input file. We set MODE=SUMMARY explicitly because by default the tool would just emit details about the 100 first problems it finds then quit. If you have some minor formatting issues that don't really matter but affect every read record, you won't get to see more important problems that occur later in the file.

$ java -jar picard.jar ValidateSamFile \
        I=input.bam \
        MODE=SUMMARY

If this outputs No errors found, then your SAM/BAM file is completely valid. If you were running this purely as a preventative measure, then you're good to go and proceed to the next step in your pipeline. If you were doing this to diagnose a problem, then you're back to square one -- but at least now you know it's not likely to be a SAM/BAM file format issue. One exception: some analysis tools require Read Group tags like SM that not required by the format specification itself, so the input files will pass validation but the analysis tools will still error out. If that happens to you, check whether your files have SM tags in the @RG lines in their BAM header. That is the most common culprit.

However, if the command above outputs one or more of the 8 possible WARNING or 48 possible ERROR messages (see tables at the end of this document), you must proceed to the next step in the diagnostic workflow.

When run in SUMMARY mode, ValidateSamFile outputs a table that differentiates between two levels of error: ERROR proper and WARNING, based on the severity of problems that they would cause in downstream analysis. All problems that fall in the ERROR category must be addressed to in order to proceed with other Picard or GATK tools, while those that fall in the WARNING category may often be ignored for some, if not all subsequent analyses.

Example of error summary

ValidateSamFile (SUMMARY)	Count
ERROR:MISSING_READ_GROUP	1
ERROR:MISMATCH_MATE_ALIGNMENT_START	4
ERROR:MATES_ARE_SAME_END	894289
ERROR:CIGAR_MAPS_OFF_REFERENCE	354
ERROR:MATE_NOT_FOUND	1
ERROR:MISMATCH_FLAG_MATE_UNMAPPED	46672
ERROR:MISMATCH_READ_LENGTH_AND_E2_LENGTH	1
WARNING:RECORD_MISSING_READ_GROUP	54
WARNING:MISSING_TAG_NM	33

This table, generated by ValidateSamFile from a real BAM file, indicates that this file has a total of 1 MISSING_READ_GROUP error, 4 MISMATCH_MATE_ALIGNMENT_START errors, 894,289 MATES_ARE_SAME_END errors, and so on. Moreover, this output also indicates that there are 54 RECORD_MISSING_READ_GROUP warnings and 33 MISSING_TAG_NM warnings.

2. Generate detailed list of ERROR records

Since ERRORs are more severe than WARNINGs, we focus on diagnosing and fixing them first. From the first step we only had a summary of errors, so now we generate a more detailed report with this command:

$ java -jar picard.jar ValidateSamFile \
        I=input.bam \
        IGNORE_WARNINGS=true \
        MODE=VERBOSE

Note that we invoked the MODE=VERBOSE and the IGNORE_WARNINGS=true arguments.

The former is technically not necessary as VERBOSE is the tool's default mode, but we specify it here to make it clear that that's the behavior we want. This produces a complete list of every problematic record, as well as a more descriptive explanation for each type of ERROR than is given in the SUMMARY output.

The IGNORE_WARNINGS option enables us to specifically examine only the records with ERRORs. When working with large files, this feature can be quite helpful, because there may be many records with WARNINGs that are not immediately important, and we don't want them flooding the log output.

Example of VERBOSE report for ERRORs only

ValidateSamFile (VERBOSE)	Error Description
ERROR: Read groups is empty	Empty read group field for multiple records
ERROR: Record 1, Read name 20FUKAAXX100202:6:27:4968:125377	Mate alignment does not match alignment start of mate
ERROR: Record 3, Read name 20FUKAAXX100202:6:27:4986:125375	Both mates are marked as second of pair
ERROR: Record 6, Read name 20GAVAAXX100126:4:47:18102:194445	Read CIGAR M operator maps off end of reference
ERROR: Read name 30PPJAAXX090125:1:60:1109:517#0	Mate not found for paired read
ERROR: Record 402, Read name 20GAVAAXX100126:3:44:17022:23968	Mate unmapped flag does not match read unmapped flag of mate
ERROR: Record 12, Read name HWI-ST1041:151:C7BJEACXX:1:1101:1128:82805	Read length does not match quals length

These ERRORs are all problems that we must address before using this BAM file as input for further analysis. Most ERRORs can typically be fixed using Picard tools to either correct the formatting or fill in missing information, although sometimes you may want to simply filter out malformed reads using Samtools.

For example, MISSING_READ_GROUP errors can be solved by adding the read group information to your data using the AddOrReplaceReadGroups tool. Most mate pair information errors can be fixed with FixMateInformation.

Once you have attempted to fix the errors in your file, you should put your new SAM/BAM file through the first validation step in the workflow, running ValidateSamFile in SUMMARY mode again. We do this to evaluate whether our attempted fix has solved the original ERRORs, and/or any of the original WARNINGs, and/or introduced any new ERRORs or WARNINGs (sadly, this does happen).

If you still have ERRORs, you'll have to loop through this part of the workflow until no more ERRORs are detected.

If you have no more ERRORs, congratulations! It's time to look at the WARNINGs (assuming there are still some -- if not, you're off to the races).

3. Generate detailed list of WARNING records

To obtain more detailed information about the warnings, we invoke the following command:

$ java -jar picard.jar ValidateSamFile \
        I=input.bam \
        IGNORE=type \
        MODE=VERBOSE

At this time we often use the IGNORE option to tell the program to ignore a specific type of WARNING that we consider less important, in order to focus on the rest. In some cases we may even decide to not try to address some WARNINGs at all because we know they are harmless (for example, MATE_NOT_FOUND warnings are expected when working with a small snippet of data). But in general we do strongly recommend that you address all of them to avoid any downstream complications, unless you're sure you know what you're doing.

Example of VERBOSE report for WARNINGs only

ValidateSamFile (VERBOSE)	Warning Description
WARNING: Read name H0164ALXX140820:2:1204:13829:66057	A record is missing a read group
WARNING: Record 1, Read name HARMONIA-H16:1253:0:7:1208:15900:108776	NM tag (nucleotide differences) is missing

Here we see a read group-related WARNING which would probably be fixed when we fix the MISSING_READ_GROUP error we encountered earlier, hence the prioritization strategy of tackling ERRORs first and WARNINGs second.

We also see a WARNING about missing NM tags. This is an alignment tag that is added by some but not all genome aligners, and is not used by the downstream tools that we care about, so you may decide to ignore this warning by adding IGNORE=MISSING_TAG_NM from now on when you run ValidateSamFile on this file.

Once you have attempted to fix all the WARNINGs that you care about in your file, you put your new SAM/BAM file through the first validation step in the workflow again, running ValidateSamFile in SUMMARY mode. Again, we check that no new ERRORs have been introduced and that the only WARNINGs that remain are the ones we feel comfortable ignoring. If that's not the case we run through the workflow again. If it's all good, we can proceed with our analysis.

Appendix: List of all WARNINGs and ERRORs emitted by ValidateSamFile

We are currently in the process of updating the Picard website to include the following two tables, describing WARNING (Table I) and ERROR (Table II) cases. Until that's done, you can find them here.

Table I
WARNING	Description
Header Issues
INVALID_DATE_STRING	Date string is not ISO-8601
INVALID_QUALITY_FORMAT	Quality encodings out of range; appear to be Solexa or Illumina when Phred expected. Avoid exception being thrown as a result of no qualities being read.
General Alignment Record Issues
ADJACENT_INDEL_IN_CIGAR	CIGAR string contains an insertion (I) followed by deletion (D), or vice versa
RECORD_MISSING_READ_GROUP	A SAMRecord is found with no read group id
Mate Pair Issues
PAIRED_READ_NOT_MARKED_AS_FIRST_OR_SECOND	Pair flag set but not marked as first or second of pair
Optional Alignment Tag Issues
MISSING_TAG_NM	The NM tag (nucleotide differences) is missing
E2_BASE_EQUALS_PRIMARY_BASE	Secondary base calls should not be the same as primary, unless one or the other is N
General File, Index or Sequence Dictionary Issues
BAM_FILE_MISSING_TERMINATOR_BLOCK	BAM appears to be healthy, but is an older file so doesn't have terminator block

Table II
ERROR	Description
Header Issues
DUPLICATE_PROGRAM_GROUP_ID	Same program group id appears more than once
DUPLICATE_READ_GROUP_ID	Same read group id appears more than once
HEADER_RECORD_MISSING_REQUIRED_TAG	Header tag missing in header line
HEADER_TAG_MULTIPLY_DEFINED	Header tag appears more than once in header line with different value
INVALID_PLATFORM_VALUE	The read group has an invalid value set for its PL field
INVALID_VERSION_NUMBER	Does not match any of the acceptable versions
MISSING_HEADER	The SAM/BAM file is missing the header
MISSING_PLATFORM_VALUE	The read group is missing its PL (platform unit) field
MISSING_READ_GROUP	The header is missing read group information
MISSING_SEQUENCE_DICTIONARY	There is no sequence dictionary in the header
MISSING_VERSION_NUMBER	Header has no version number
POORLY_FORMATTED_HEADER_TAG	Header tag does not have colon
READ_GROUP_NOT_FOUND	A read group ID on a SAMRecord is not found in the header
UNRECOGNIZED_HEADER_TYPE	Header record is not one of the standard types
General Alignment Record Issues
CIGAR_MAPS_OFF_REFERENCE	Bases corresponding to M operator in CIGAR extend beyond reference
INVALID_ALIGNMENT_START	Alignment start position is incorrect
INVALID_CIGAR	CIGAR string error for either read or mate
INVALID_FLAG_FIRST_OF_PAIR	First of pair flag set for unpaired read
INVALID_FLAG_SECOND_OF_PAIR	Second of pair flag set for unpaired read
INVALID_FLAG_PROPER_PAIR	Proper pair flag set for unpaired read
INVALID_FLAG_MATE_NEG_STRAND	Mate negative strand flag set for unpaired read
INVALID_FLAG_NOT_PRIM_ALIGNMENT	Not primary alignment flag set for unmapped read
INVALID_FLAG_SUPPLEMENTARY_ALIGNMENT	Supplementary alignment flag set for unmapped read
INVALID_FLAG_READ_UNMAPPED	Mapped read flat not set for mapped read
INVALID_INSERT_SIZE	Inferred insert size is out of range
INVALID_MAPPING_QUALITY	Mapping quality set for unmapped read or is >= 256
INVALID_PREDICTED_MEDIAN_INSERT_SIZE	PI tag value is not numeric
MISMATCH_READ_LENGTH_AND_QUALS_LENGTH	Length of sequence string and length of base quality string do not match
TAG_VALUE_TOO_LARGE	Unsigned integer tag value is deprecated in BAM. Template length
Mate Pair Issues
INVALID_FLAG_MATE_UNMAPPED	Mate unmapped flag is incorrectly set
MATE_NOT_FOUND	Read is marked as paired, but its pair was not found
MATE_CIGAR_STRING_INVALID_PRESENCE	A cigar string for a read whose mate is NOT mapped
MATE_FIELD_MISMATCH	Read alignment fields do not match its mate
MATES_ARE_SAME_END	Both mates of a pair are marked either as first or second mates
MISMATCH_FLAG_MATE_UNMAPPED	Mate unmapped flag does not match read unmapped flag of mate
MISMATCH_FLAG_MATE_NEG_STRAND	Mate negative strand flag does not match read strand flag
MISMATCH_MATE_ALIGNMENT_START	Mate alignment does not match alignment start of mate
MISMATCH_MATE_CIGAR_STRING	The mate cigar tag does not match its mate's cigar string
MISMATCH_MATE_REF_INDEX	Mate reference index (MRNM) does not match reference index of mate
Optional Alignment Tag Issues
INVALID_MATE_REF_INDEX	Mate reference index (MRNM) set for unpaired read
INVALID_TAG_NM	The NM tag (nucleotide differences) is incorrect
MISMATCH_READ_LENGTH_AND_E2_LENGTH	Lengths of secondary base calls tag values and read should match
MISMATCH_READ_LENGTH_AND_U2_LENGTH	Secondary base quals tag values should match read length
EMPTY_READ	Indicates that a read corresponding to the first strand has a length of zero and/or lacks flow signal intensities (FZ)
INVALID_INDEXING_BIN	Indexing bin set on SAMRecord does not agree with computed value
General File, Index or Sequence Dictionary Issues
INVALID_INDEX_FILE_POINTER	Invalid virtualFilePointer in index
INVALID_REFERENCE_INDEX	Reference index not found in sequence dictionary
RECORD_OUT_OF_ORDER	The record is out of order
TRUNCATED_FILE	BAM file does not have terminator block

↧

GATK4 is completely open source

May 24, 2017, 10:37 am

≫ Next: Version highlights for GATK version 3.8

≪ Previous: Errors in SAM/BAM files can be diagnosed with ValidateSamFile

This is one of two posts announcing the imminent beta release of GATK4; for a technical description of features, see this other post.

"Wait, what?" Yes, you read that right, we're moving GATK4 to a fully open source license -- specifically, BSD 3-clause. And to be clear, this applies to all of GATK4. Not just the core framework (which, little known fact, has always been open source), but all the tools that were previously "protected", including HaplotypeCaller, the new CNV discovery tools, everything. The whole enchilada.

Old-timers in the field (i.e. anyone with what, 3+ years experience?) will recognize this as a major shift. An important subset of the GATK -- some might say "all the really valuable bits" -- has been under a mixed licensing model since version 2.0 was released in 2012. Under this mixed model, GATK was free for academic/non-profit research purposes, while any for-profit use required a paid commercial license. The proceeds funded further GATK development and support.

Admittedly the move from the initial open-source state of GATK 1.x to the mixed licensing model caused a fair amount of debate. I'm not going to revisit in full (even my therapist is sick of hearing about it), but it's fair to say that the licensing created an obstacle for our interactions with some other groups, and that it raised some barriers to access to GATK, especially for smaller companies and startups.

Since then the context within which we operate at the Broad has evolved significantly: a little over two years ago, our small development team was assimilated into a then-newly created larger group called the Data Sciences Platform (DSP), which aims to tackle the big challenges in genomics with robust engineering solutions. This involves applying some novel approaches compared to traditional academic software development, including: 1) give engineers a good home; 2) focus on products, not projects; and 3) maximize openness. This last point in particular means that our DSP mothership-within-Broad recognizes the immense potentiating role of open-source software in driving technological and methodological innovation. In fact, all of DSP's software products have been open-source since its inception, with the notable exception of GATK, which it inherited in a mixed state.

Over the past two years, the collaborations that DSP has cultivated with external groups have immensely benefitted the development of the new framework that would eventually become GATK4. Key features that we have come to rely on were contributed as open-source code by external collaborators: the GenomicsDB datastore that allows us to scale joint genotyping to tens of thousands of whole genomes, by Karthik Gururaj and colleagues at Intel; the Genomics Kernel Library, which provides many impressive speedups for the GATK, by George Powley at Intel; the NIO functionality that allows us to access data on Google Cloud Storage directly, by JP Martin at Google; and the Apache Spark support that allows us to parallelize operations in a much more robust way than before, by Tom White at Cloudera. And it's not all about institutional collaborations; we have also received spontaneous contributions from individuals such as Daniel Gómez-Sánchez of the Institut für Populationsgenetik of Vienna, which have collectively enhanced the GATK codebase and its value to the user community.

So with GATK4 on the cusp of release, and with enthusiasm from all of us at the Broad, we're seizing this opportunity to do a reboot* and bring into alignment our mandate (to build great software), our mission (to empower great research) and our means: a more community-minded approach anchored in openness and free exchange of ideas.

* (at least we had already ditched Jar-Jar "Phone Home" Binks...)

I expect the benefits of this new direction are fairly self-evident, so I'll do us all a favor and close with just one last, somewhat personal note specifically from the development team. We want to thank all the collaborators who have worked with us so far for their support, their invaluable contributions and their faith in what we could accomplish together. And as we turn over this new leaf, we look forward to welcoming into the GATK family anyone who would like to see how much further we can push the genomics envelope.

↧

Version highlights for GATK version 3.8

July 28, 2017, 9:01 pm

≫ Next: Allele-specific annotation and filtering

≪ Previous: GATK4 is completely open source

One more 3.x version, for the road! That's right, even as we're ramping up our efforts on GATK4 (we're three beta releases in at this point, and getting down to brass tacks writing the migration guide ahead of the 4.0 general release) we still found it worthwhile to cut one last release of GATK3.

Our main motivation here is to introduce the Intel Genomics Kernel Library, which comes bearing the gift of speed improvements for those of you who won't be able to migrate to GATK4 right away.

As a secondary benefit, this version includes a handful of bug fixes, some usability improvements including better error messages, documentation fixes and logging tweaks, and a few improvements to annotation calculations (especially in allele-specific mode), which you'll find described briefly in the release notes. No big changes though, except perhaps the new default behavior of VariantsToTable with regard to missing annotation values, discussed below. Finally, we've committed a copy of all the peripheral documentation (= the docs that live in the forum and complement the tool documentation) to the now-old GATK codebase.

And thus, the last-ever GATK3 version emerges covered in carbonite.

Introducing the Intel Genomics Kernel Library

The Genomics Kernel Library or GKL is an open-source library developed by our collaborators at Intel that provides accelerated versions of algorithms, i.e. "kernels", used in genomics tools. These kernels are optimized to run on Intel Architecture under 64-bit Linux and Mac OSX. They're plugged into the GATK in such a way that they will be automatically used if your computing hardware supports them, but if it doesn't they will remain inactive and the "default" generic Java versions will be used instead.

At the moment there are three main kernels included:

Intel inflater/deflater: a file compression/decompression kernel that provides different levels of compression (with correspondingly variable speedups). This replaces the JDK inflater/deflater and is now activated by default. It can be disabled by using the -jdk_deflater and -jdk_inflater flags.
Intel chip optimization for PairHMM: a version of the PairHMM algorithm used by HaplotypeCaller to calculate genotype likelihoods that runs faster on Intel hardware. It can be disabled by setting -pairHMM LOGLESS_CACHING, for example if you need completely deterministic behavior across different machine types (at the expense, of course, of speed).
FPGA support for PairHMMM: another version of the PairHMM algorithm, this one designed to run on FPGAs, which are a type of processor that is gaining popularity for computing applications that require extremely high speed. The FPGA support in this version is fairly experimental so we can't guarantee results, but if you have access to this specialized hardware we definitely encourage you to try it out and let us know how it goes.

Attitude adjustment for VariantsToTable

VariantsToTable is a tool we're quite fond of because it allows us to extract just the information we want from VCFs when we want to probe a callset interactively, typically for filtering purposes. Previously we had to tell it explicitly not to freak out if it came across any sites or genotypes where an annotation we requested was missing; but realistically, there are always some sites for which we can't calculate some annotations (like ranksum annotations at sites where we don't have any heterozygous samples), so that was annoying. Now we've flipped the behavior so that by default the tool keeps going and just outputs "NA" anywhere it encounters such sites or genotypes, unless you specify that it should freak out by using the --errorIfMissingData flag.

Documentation archive and deprecation plans

In preparation for the general release of GATK4 (in the form of a 4.0 version), we made a copy of all the peripheral (forum-based) documentation in its current state and archived it in the codebase itself here. This is intended to be a permanent archive for documentation that we are phasing out in favor of GATK4-focused documentation.

Our ultimate goal is to provide some degree of continuity and support for users who cannot migrate to GATK4 right away and must continue to use older versions, without leaving too much clutter around that might confuse everyone else.

In the immediate future we will delete three sets of documents from the forum (and therefore from the website):

"Developer Zone": replaced in GATK4 by a developer-oriented Wiki in the github repository;
"Queue": superseded for all versions by Cromwell+WDL;
The current contents of "Archive", which have typically been replaced by individual articles linked at the top of the deprecated article.

Within the other documentation sections, articles may get updated in place or moved to the Archive for future removal. Versioned tool documentation going back to 3.5-0 will remain available on the website for the foreseeable future. For older versions, the documentation can be built from source. Finally, the Best Practices section of the website will be updated to reflect the new world order once GATK 4.0 is released and becomes the officially supported version of GATK. Going forward we'll have versioned Best Practices accompanied by a publicly available WDL script for each major use case. We'll post more details of what this will look like in the coming weeks.

↧

Allele-specific annotation and filtering

May 21, 2017, 11:14 am

≫ Next: MuTect2 all potential somatic mutations did not pass the alt_allele_in_normal filter

≪ Previous: Version highlights for GATK version 3.8

Introduction and FAQs

The current recalibration paradigm evaluates each position, and passes or filters all alleles at that position, regardless of how many alternate alleles occur there. This has major disadvantages in cases where a real variant allele occurs at the same position as an error that has sufficient evidence to be called as a variant. The goal of the Allele-Specific Filtering Workflow is to treat each allele separately in the annotation, recalibration and filtering phases.

What studies can benefit from the Allele-Specific Filtering Workflow?

Multi-allelic sites benefit the most from the Allele-Specific Filtering Workflow because each allele will be evaluated more accurately than if its data was lumped together with other alleles. Large callsets will benefit more than small callsets because multi-allelics will occur more frequently as the number of samples in a cohort increases. One callset with 42 samples that was used for development contains 3% multi-allelic sites, while the ExAC callset [http://biorxiv.org/content/early/2015/10/30/030338] with approximately 60,000 samples contains nearly 8% multi-allelic sites. Recalibrating each allele separately will also greatly benefit rare disease studies, in which rare alleles may not be shared by other members of the callset, but could still occur at the same positions as common alleles or errors.

What additional data do I need to run the Allele-Specific Filtering Workflow?

No additional resource files are necessary, but this workflow does require the sample bam files. Annotations cannot be calculated from VCF or gVCF files alone.

Is the Allele-Specific Filtering Workflow going to change my data? Can I still use my old analysis pipeline?

After running the Allele-Specific Filtering Workflow, several new annotations will be added to the INFO field for your variants (see below), and VQSR results will be based on those new annotations, though using SNP and INDEL tranche sensitivity cutoffs equivalent to the non-allele-specific best practices. If after analyzing your recalibrated data, you’re not convinced that this workflow is for you, you can still run the classic VQSR on your genotyped VCF because the standard annotations for VQSR are still included in the genotyped VCF.

Can I run the Allele-Specific Filtering Workflow not in reference confidence mode?

Nope. Sorry. The way we generate and combine the allele-specific data depends on having raw data for each sample in the gVCF.

Is the Allele-Specific Filtering Workflow part of the GATK Best Practices?

Not yet. Although we are happy with the performance of this workflow, our own production pipelines have not yet been updated to include this, so it should still be considered experimental. However, we do encourage you to try this out on your own data and let us know what you find, as this helps us refine the tools and catch bugs.

Allele-Specific Workflow

Input

Begin with a post-BQSR bam file for each sample. The read data in the bam are necessary to generate the allele-specific annotations.

Step 1: HaplotypeCaller

Using the locally-realigned reads, HaplotypeCaller will generate gVCFs with all of its usual standard annotations, plus raw data to calculate allele-specific versions of the standard annotations. That means each alternate allele in each VariantContext will get its own data used by downstream tools to generate allele-specific QualByDepth, RMSMappingQuality, FisherStrand and allele-specific versions of the other standard annotations. For example, this will help us sort out good alleles that only occur in a few samples and have a good balance of forward and reverse reads but occur at the same position as another allele that has bad strand bias because it’s probably a mapping error.

java -jar $GATKjar -T HaplotypeCaller -R $reference \
    -I mySample.bam \
    -o mySample.AS.g.vcf \
    -ERC GVCF \
    -G StandardAnnotation -G AS_StandardAnnotation -G StandardHCAnnotation

(Optional) Step 2a: CombineGVCFs

Here the allele-specific data for each sample is combined per allele, but is not yet in its final annotation form. We only do this for computational efficiency reasons when we have >200 samples.

java -jar $GATKjar -T CombineGVCFs -R $reference \
    -V mySample1.AS.g.vcf -V mySample2.AS.g.vcf -V mySample3.AS.g.vcf \
    -o allSamples.g.AS.vcf \
    -G StandardAnnotation -G AS_StandardAnnotation

Note that if you run this, you need to modify the -V input in the next step to just the combined gVCF file.

Step 2: GenotypeGVCFs

Raw allele-specific data for each sample is used to calculate the finalized annotation values. In GATK 3.6, non-allele-specific rank sum annotations are still combined using the median across all samples. (See below for details on more accurate MQ calculations.)

java -jar $GATKjar -T GenotypeGVCFs -R $reference \
    -V mySample1.AS.g.vcf -V mySample2.AS.g.vcf -V mySample3.AS.g.vcf \
    -o allSamples.AS.vcf \
    -G StandardAnnotation -G AS_StandardAnnotation

Step 3: VariantRecalibrator

In allele-specific mode, the VariantRecalibrator builds the statistical model based on data for each allele, rather than each site. This has the added benefit of being able to recalibrate the SNPs in mixed sites according to the appropriate model, rather than lumping them in with INDELs as had been done previously. It will also provide better results by matching the exact allele in the training and truth data rather than just the position.

# SNP modeling pass

java -jar $GATKjar -T VariantRecalibrator -R $reference -mode SNP -AS \
    -an AS_QD -an AS_FS -an AS_ReadPosRankSum -an AS_MQ -an AS_MQRankSum -an AS_SOR \
    -input allSamples.AS.vcf \
    -resource:known=false,training=true,truth=true,prior=15.0 $hapmap_sites \
    -resource:known=false,training=true,truth=true,prior=12.0 $omni_sites \
    -resource:known=false,training=true,truth=false,prior=10.0 $training_1000G_sites \
    -resource:known=true,training=false,truth=false,prior=2.0 $dbSNP_129 \
    -tranche 100.0 -tranche 99.9 -tranche 99.8 -tranche 99.7 -tranche 99.5 -tranche 99.3 -tranche 99.0 -tranche 90.0 \
    -recalFile allSamples.AS.snps.recal \
    -tranchesFile allSamples.AS.snps.tranches \
    -modelFile allSamples.AS.snps.report \
    -rscriptFile allSamples.AS.snps.R

# INDEL modeling pass

java -jar $GATKjar -T VariantRecalibrator -R $reference -mode INDEL -AS \
    -an AS_QD -an AS_FS -an AS_ReadPosRankSum -an AS_MQRankSum -an AS_SOR \
    -input allSamples.AS.vcf \
    -resource:known=false,training=true,truth=true,prior=12.0 $indelGoldStandardCallset \
    -resource:known=true,training=false,truth=false,prior=2.0 $dbSNP_129 \
    -tranche 100.0 -tranche 99.9 -tranche 99.8 -tranche 99.7 -tranche 99.5 -tranche 99.3 -tranche 99.0 -tranche 90.0 \
    --maxGaussians 4 \
    -recalFile allSamples.AS.indels.recal \
    -tranchesFile allSamples.AS.indels.tranches \
    -modelFile allSamples.AS.indels.report \
    -rscriptFile allSamples.AS.indels.R

Note that these commands are for exomes. For whole genomes, the classic -DP annotation will still be used for SNP recalibration, as in the Best Practices.

Step 4: ApplyRecalibration

Allele-specific filters are calculated and stored in the AS_FilterStatus INFO annotation. A site-level filter is applied to each site based on the most lenient filter across all alleles. For example, if any allele passes, the entire site passes. If no alleles pass, then the filter will be applied corresponding to the allele with the lowest tranche (best VQSLOD).

The two ApplyRecalibration modes should be run in series, as in our Best Practices recommendations. If SNP and INDEL ApplyRecalibration modes are run in parallel and combined with CombineVariants (which will work for the standard VQSR arguments), in allele-specific mode any mixed sites will fail to be processed correctly.

# SNP filtering pass

java -jar $GATKjar -T ApplyRecalibration -R $reference \
    -input allSamples.AS.vcf \
    -mode SNP --ts_filter_level 99.70 -AS \
    --recal_file allSamples.AS.snps.recal \
    --tranches_file allSamples.AS.snps.tranches \
    -o allSamples.AS.snp_recalibrated.vcf

# INDEL filtering pass

java -jar $GATKjar -T ApplyRecalibration -R $reference \
    -input allSamples.AS.snp_recalibrated.vcf \
    -mode INDEL --ts_filter_level 99.3 -AS \
    --recal_file allSamples.AS.indels.recal \
    --tranches_file allSamples.AS.indels.tranches \
    -o allSamples.AS.snp_indel_recalibrated.vcf

Output of the workflow

The Allele-Specific Filtration Workflow adds new allele-specific info-level annotations to the VCFs and produces a final output with allele-specific filters based on the VQSR SNP and INDEL tranches.

Allele-specific annotations

The AS_Standard annotation set will produce allele-specific versions of our standard annotations. For AS_MQ, this means that the root-mean-squared mapping quality will be given for all of the reads that support each allele, respectively. For rank sum and strand bias tests, the annotation for each allele will compare that alternative allele’s values against the reference allele.

Recalibration files from allele-specific VariantRecalibrator

Each allele will be described in a separate line in the output recalibration (.recal) files. For the advanced analyst, this is a good way to check which allele has the worst data and is responsible for a NEGATIVE_TRAIN_SITE classification.

Allele-specific filters

After both ApplyRecalibration modes are run, the INFO field will contain an annotation called AS_FilterStatus, which will list the filter corresponding to each alternate allele. Allele-specific culprit and VQSLOD scores will also be added to the final VCF in the AS_culprit and AS_VQSLOD annotations, respectively.

Sample output

3 195507036 . C G,CCT 6672.42 VQSRTrancheINDEL99.80to99.90 AC=7,2;AF=0.106,0.030;AN=66;AS_BaseQRankSum=-0.144,1.554;AS_FS=127.421,52.461;AS_FilterStatus=VQSRTrancheSNP99.90to100.00,VQSRTrancheINDEL99.80to99.90;AS_MQ=29.70,28.99;AS_MQRankSum=1.094,0.045;AS_ReadPosRankSum=1.120,-7.743;AS_SOR=9.981,7.523;AS_VQSLOD=-48.3935,-7.8306;AS_culprit=AS_FS,AS_FS;BaseQRankSum=0.028;DP=2137;ExcessHet=1.6952;FS=145.982;GQ_MEAN=200.21;GQ_STDDEV=247.32;InbreedingCoeff=0.0744;MLEAC=7,2;MLEAF=0.106,0.030;MQ=29.93;MQRankSum=0.860;NCC=9;NEGATIVE_TRAIN_SITE;QD=10.94;ReadPosRankSum=-7.820e-01;SOR=10.484

3 153842181 . CT TT,CTTTT,CTTTTTTTTTT,C 4392.82 PASS AC=15,1,1,1;AF=0.192,0.013,0.013,0.013;AN=78;AS_BaseQRankSum=-11.667,-3.884,-2.223,0.972;AS_FS=204.035,22.282,16.930,2.406;AS_FilterStatus=VQSRTrancheSNP99.90to100.00,VQSRTrancheINDEL99.50to99.70,VQSRTrancheINDEL99.70to99.80,PASS;AS_MQ=58.44,59.93,54.79,59.72;AS_MQRankSum=2.753,0.123,0.157,0.744;AS_ReadPosRankSum=-9.318,-5.429,-5.578,1.336;AS_SOR=6.924,3.473,5.131,1.399;AS_VQSLOD=-79.9547,-2.0208,-3.4051,0.7975;AS_culprit=AS_FS,AS_ReadPosRankSum,AS_ReadPosRankSum,QD;BaseQRankSum=-2.828e+00;DP=1725;ExcessHet=26.1737;FS=168.440;GQ_MEAN=117.51;GQ_STDDEV=141.53;InbreedingCoeff=-0.1776;MLEAC=16,1,1,1;MLEAF=0.205,0.013,0.013,0.013;MQ=54.35;MQRankSum=0.967;NCC=3;NEGATIVE_TRAIN_SITE;QD=4.42;ReadPosRankSum=-2.515e+00;SOR=4.740

Caveats

Spanning deletions

Since GATK3.4, GenotypeGVCFs has had the ability to output a “spanning deletion allele” (now represented with *) to indicate that a position in the VCF is contained within an upstream deletion and may have “missing data” in samples that contain that deletion. While the upstream deletions will continue to be recalibrated and filtered by VQSR similar to the way they always have been, these spanning deletion alleles that occur downstream (and represent the same event) will be skipped.

gVCF size increase

Using the default gVCF bands ([1:60,70,80,90,99]), the raw allele-specific data makes a minimal size increase, which was less than a 1% increase on the NA12878 exome used for development.

MQ calculation change

If you ran the same callset through GATK 3.4 or earlier and GATK 3.5 or later, you may notice that the MQ annotation values for your variants changed slightly. That’s because with or without allele-specific annotation and filtering, MQ is being calculated in a new more accurate way. The GenotypeGVCFs used to combine each sample’s annotations by taking the median, but new MQ annotation calculation code now combine’s each sample’s data in a more mathematically correct way.

Potential usage errors

Problem: WARN 08:35:26,273 ReferenceConfidenceVariantContextMerger - WARNING: remaining (non-reducible) annotations are assumed to be ints or doubles or booleans, but 0.00|15005.00|14400.00|0.00 doesn't parse and will not be annotated in the final VC.

Solution: Remember to add -G Standard -G AS_Standard to the GenotypeGVCFs command

Problem: Standard (non-allele-specific) annotations are missing

Solution: HaplotypeCaller and GenotypeGVCFs need -G Standard specified if -G AS_Standard is also specified.

↧