Channel: Recent Discussions — GATK-Forum
Viewing all 12345 articles
Browse latest View live

Picard Sort Vcf Error



I am using GATK version 3.6, picard-2.8.2.jar

I downloaded hapmap_3.3.hg38.vcf from gatk resource bundle. I then used the below command to remove chr notation.
awk '{gsub(/^chr/,""); print}' hapmap_3.3.hg38.vcf > no_chr_hapmap_3.3.hg38.vcf.vcf

Before (hapmap_3.3.hg38.vcf)
chr1 2242065 rs263526 T C . PASS AC=724;AF=0.259;AN=2792
chr1 2242417 rs16824926 C . . PASS AN=530
chr1 2242880 rs11581436 A . . PASS AN=540

After (no_chr_hapmap_3.3.hg38.vcf.vcf)
1 6421563 rs4908891 G A . PASS AC=1086;AF=0.389;AN=2792
1 6421782 rs4908892 A G . PASS AC=1692;AF=0.606;AN=2792
1 6421856 rs12078257 T C . PASS AC=368;AF=0.132;AN=2790

Then, use Picard SortVcf to sort the no_chr_hapmap_3.3.hg38.vcf.vcf
java -jar picard-2.8.2.jar SortVcf I=removedChr_HapMap.vcf O=sortedHapMap.vcf SEQUENCE_DICTIONARY=hg38.dict

@SQ SN:1 LN:248956422 UR:file:/media/ubuntu/Elements/TOOL/hg38.fa M5:2648ae1bacce4ec4b6cf337dcae37816
@SQ SN:10 LN:133797422 UR:file:/media/ubuntu/Elements/TOOL/hg38.fa M5:907112d17fcb73bcab1ed1c72b97ce68
@SQ SN:11 LN:135086622 UR:file:/media/ubuntu/Elements/TOOL/hg38.fa M5:1511375dc2dd1b633af8cf439ae90cec
@SQ SN:12 LN:133275309 UR:file:/media/ubuntu/Elements/TOOL/hg38.fa M5:e81e16d3f44337034695a29b97708fce

I have then encountered this error:

Exception in thread "main" java.lang.IllegalArgumentException: java.lang.AssertionError: SAM dictionaries are not the same: SAMSequenceRecord(name=chr1,length=248956422,dict_index=0,assembly=20) was found when SAMSequenceRecord(name=1,length=248956422,dict_index=0,assembly=null) was expected.
at picard.vcf.SortVcf.collectFileReadersAndHeaders(SortVcf.java:126)
at picard.vcf.SortVcf.doWork(SortVcf.java:95)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:205)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:94)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:104)
Caused by: java.lang.AssertionError: SAM dictionaries are not the same: SAMSequenceRecord(name=chr1,length=248956422,dict_index=0,assembly=20) was found when SAMSequenceRecord(name=1,length=248956422,dict_index=0,assembly=null) was expected.
at htsjdk.samtools.SAMSequenceDictionary.assertSameDictionary(SAMSequenceDictionary.java:170)
at picard.vcf.SortVcf.collectFileReadersAndHeaders(SortVcf.java:124)
... 4 more

I have tried a lot of times but still getting back the same error. Kindly do advise how can I solve this problem.

I would then like to perform SelectVariants to extract variants that missed in HapMap but present in my dataset.

Thank you so much in advance.


Cannot open GATK Documentation

When browsing GATK4 Documentation there is a message :
"Showing docs for version | The latest version is"

However when I choose in the list I get redirected to an empty directory.

h ttps://software.broadinstitute.org/gatk/documentation/tooldocs/

HaplotypeCaller advanced option --force-active missing

From reading the docs for version (and working with GATK version

h ttps://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_haplotypecaller_HaplotypeCaller.php#--disable-optimizations

I see from the option --disable-optimizations documentation :
"Setting the --force-active and --dont-trim-active-regions flags may also be necessary."

However the option --force-active seems no longer available. Because I get this error message :

"A USER ERROR has occurred: force-active is not a recognized option"

Is this option no longer relevant and is the documentation out-of-date ? or should it still be present ?
(Older versions of the documetations also mention --force-active or -forceActive).

Thank you for clarification.

Differences between GATK3 MuTect2 and GATK4 Mutect2


Tool names are cased differently

The first thing to note is how the tool names are different. In GATK3 it's spelled MuTect2 with an uppercase T, whereas in GATK it's spelled Mutect2 with a lowercase t. Not only is the new tool name easier to type, it helps us distinguish which version of the tool a document refers to.


The two Mutects differ in functionality

And their respective workflow tools differ too. The table shows the tools for the workflow functionalities for GATK3 versus GATK4.


GATK3 MuTect2 will remain in beta status. GATK4 Mutect2 is in beta status as of the official GATK4 release.

One major difference is GATK4 breaks off filtering into a separate tool, FilterMutectCalls. In GATK3, MuTect2 both calls and filters variants. In GATK4, Mutect2 is focused mostly on calling and does some minimal upfront filtering of obvious non-somatic sites. However, it leaves the majority of filtering to FilterMutectCalls. This separation makes it easier to test changes to filtering thresholds as the computationally expensive calling is decoupled from filtering.

Another major difference is in site versus allele filtering against the germline resource. GATK3 MuTect2 prefilters sites in the germline resource regardless of the allele in the tumor. GATK4 Mutect2 distinguishes alleles in the germline resource and only filters the site if the tumor allele matches. If the alleles are different, then the tool considers the allele a putative somatic mutation.

Filtering of sites in the panel of normals (PoN) and the matched normal remains unchanged, except that the tool will prefilter most of these such that site records are absent from the VCF.


With the 1000 Genomes Project now wrapped up, and with the availability of germline variant callsets from even larger cohorts, i.e. gnomAD, the germline component of human cancers is something that GATK4 Mutect2 can account for in a more sophisticated way. GATK4 Mutect2 factors the germline population allele frequencies towards somatic probability calculations. For a given allele in the tumor, if it is present in the germline resource, its probability of being a somatic mutation is weighted inversely to the frequency with which the allele is observed in the population.

Here are the differences between GATK3 MuTect2 and GATK4 Mutect2 as a list.

  1. The filtering functionality that annotates the FILTER column is now done by a separate tool called FilterMutectCalls. To filter further based on sequence context artifacts, additionally use FilterByOrientationBias. Note that Mutect2 still performs some upfront filtering (see next point).
  2. Mutect2 ignores sites present in the Panel of Normals (PoN) as well as sites that correspond to high fraction variants in the normal. By doing so, the tool avoids spending time in steps such as graph assembly and pairHMM alignments that cost compute. However, there is an option to force the tool to run the full process on sites that are in the PoN (--genotype-pon-sites), which can be useful in comparing results to older MuTect versions.
  3. If using a known germline variants resource, then it must contain population allele frequencies, e.g. if working on humans then from gnomAD. The VCF INFO field contains the allele frequency (AF) tag. See the GATK Resource Bundle or the Mutect2 tool documentation for an example.
  4. To create the PoN, call on each normal sample using Mutect2's tumor-only mode and then use GATK4's CreateSomaticPanelOfNormals, a tool new to GATK4. This contrasts with the GATK3 workflow, which uses an artifact calling mode in MuTect2 and CombineVariants for PoN creation. In GATK4, omitting to filter with FilterMutectCalls achieves the same result.
  5. Instead of using a maximum likelihood estimate to calculate the variant likelihoods, GATK4 Mutect2 marginalizes over allele fractions using a Bayesian likelihoods model. See the Mutect2 methods whitepaper for algorithm details. GATK3 MuTect2 uses allele depths (AD) directly to estimate allele fractions and calculate likelihoods. In contrast, GATK4 Mutect2 factors for the statistical error inherent in allele depths by marginalizing over allele fractions when calculating likelihoods.
  6. In GATK4, we recommend including cross-sample contamination estimates from CalculateContamination when filtering with FilterMutectCalls. CalculateContamination, in turn, relies on the results of GetPileupSummaries and can incorporate information from the matched normal, if available, when calculating the contamination in the tumor sample.

What remains unchanged is that neither version calls potential loss of heterozygosity (LoH) events. To detect LoH, see the Somatic Copy Number Variant (CNV) workflow.

You can find tutorials that explore consideration in the GATK3 workflow or the GATK4 workflow on our forum.

  • Tutorial#9183 outlines the GATK3 MuTect2 workflow.
  • Tutorial#11136 outlines the GATK4 Mutect2 workflow.
  • If you are wondering about the differences between Mutect2 and HaplotypeCaller, see Article#11127.
  • If you are nostalgic for the original MuTect, you can get it as a standalone jar from the MuTect1 Download page. The version is v1.1.7 and it requires Java 7 to run. MuTect1 is a somatic pileup caller that calls SNVs only. That is, it does not call indels, and therefore workflows that use it should include indel realignment. Version 1.1.7 writes results to VCF format (specify with –-vcf). For example usage commands see this thread. For prior versions that give results in MAF format, see the Broad CGA website. For workflows that use a composite of MuTect1 SNV calls and MuTect2 indel calls, see FireCloud Article#7512.

A USER ERROR has occurred: Traversal by intervals was requested but some input files are not indexed

Hi Team,

We are running Cromwell on AWS and running five-dollar-genome-analysis-pipeline-master using GATK toolset. After resolving couple issues related to missing files, NIO etc, we are facing yet another issue related to file indexing during BaseRecalibrator processing.

Can you please help us to identify for missing index files or script which is responsible to index bam files?

Here is error stack trace for your reference.

"A USER ERROR has occurred: Traversal by intervals was requested but some input files are not indexed.

Please index all input files:
Please index all input files:

samtools index /cromwell_root/cromwelleast/cromwell-execution/germline_single_sample_workflow/29c2f87f-54e6-47d3-aa46-b062cee5df57/call-to_bam_workflow/ToBam.to_bam_workflow/3e2283ad-6b4d-4e5d-af68-438f9af13843/call-SortSampleBam/NA12878.aligned.duplicate_marked.sorted.bam
samtools index /cromwell_root/cromwelleast/cromwell-execution/germline_single_sample_workflow/29c2f87f-54e6-47d3-aa46-b062cee5df57/call-to_bam_workflow/ToBam.to_bam_workflow/3e2283ad-6b4d-4e5d-af68-438f9af13843/call-SortSampleBam/NA12878.aligned.duplicate_marked.sorted.bam"

Do let us know if any other information required. Thanks in advance!

Concordance crashed


Dear Gatk team,
I'm trying to compare two vcfs with genotypes for a single sample on two platforms using Concordance (gatk Getting an error, please help:

[jlr328@cbsurf01 Concordance]$ /programs/gatk4/gatk --java-options "-Xmx120G" Concordance -R ../RemapWGS/Reference/v2.refseq/GRCh38_latest_genomic.fna --evaluation qmd-27-07.deep-wes.grch38-refseq.gatk4.qchip-sites.vcf --truth qmd-27-07.deep-wgs.grch38-refseq.gatk4.qchip-sites.vcf --summary concordance.qmd-27-07.eval-deep-wes.truth-deep-wgs.tsv
Using GATK jar /programs/gatk-
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Xmx120G -jar /programs/gatk- Concordance -R ../RemapWGS/Reference/v2.refseq/GRCh38_latest_genomic.fna --evaluation qmd-27-07.deep-wes.grch38-refseq.gatk4.qchip-sites.vcf --truth qmd-27-07.deep-wgs.grch38-refseq.gatk4.qchip-sites.vcf --summary concordance.qmd-27-07.eval-deep-wes.truth-deep-wgs.tsv
14:29:40.034 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/programs/gatk-!/com/intel/gkl/native/libgkl_compression.so
14:29:40.157 INFO Concordance - ------------------------------------------------------------
14:29:40.157 INFO Concordance - The Genome Analysis Toolkit (GATK) v4.0.1.1
14:29:40.157 INFO Concordance - For support and documentation go to https://software.broadinstitute.org/gatk/
14:29:40.158 INFO Concordance - Executing as jlr328@cbsurf01.biohpc.cornell.edu on Linux v3.10.0-693.el7.x86_64 amd64
14:29:40.158 INFO Concordance - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_131-b12
14:29:40.158 INFO Concordance - Start Date/Time: January 8, 2019 2:29:40 PM EST
14:29:40.158 INFO Concordance - ------------------------------------------------------------
14:29:40.159 INFO Concordance - ------------------------------------------------------------
14:29:40.159 INFO Concordance - HTSJDK Version: 2.14.1
14:29:40.159 INFO Concordance - Picard Version: 2.17.2
14:29:40.159 INFO Concordance - HTSJDK Defaults.COMPRESSION_LEVEL : 1
14:29:40.159 INFO Concordance - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
14:29:40.159 INFO Concordance - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
14:29:40.159 INFO Concordance - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
14:29:40.160 INFO Concordance - Deflater: IntelDeflater
14:29:40.160 INFO Concordance - Inflater: IntelInflater
14:29:40.160 INFO Concordance - GCS max retries/reopens: 20
14:29:40.160 INFO Concordance - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
14:29:40.160 INFO Concordance - Initializing engine
14:29:40.712 INFO FeatureManager - Using codec VCFCodec to read file file:///local/storage/AnalysisData/Ongoing/Qatar/Array/QChip/QChipData/v4/GRCh38/Concordance/qmd-27-07.deep-wgs.grch38-refseq.gatk4.qchip-sites.vcf
14:29:40.736 INFO FeatureManager - Using codec VCFCodec to read file file:///local/storage/AnalysisData/Ongoing/Qatar/Array/QChip/QChipData/v4/GRCh38/Concordance/qmd-27-07.deep-wes.grch38-refseq.gatk4.qchip-sites.vcf
14:29:40.744 INFO Concordance - Done initializing engine
14:29:40.745 INFO ProgressMeter - Starting traversal
14:29:40.745 INFO ProgressMeter - Current Locus Elapsed Minutes Records Processed Records/Minute
14:29:40.958 INFO Concordance - Shutting down engine
[January 8, 2019 2:29:40 PM EST] org.broadinstitute.hellbender.tools.walkers.validation.Concordance done. Elapsed time: 0.02 minutes.
java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
at java.util.ArrayList.rangeCheck(ArrayList.java:653)
at java.util.ArrayList.get(ArrayList.java:429)
at htsjdk.variant.variantcontext.VariantContext.getAlternateAllele(VariantContext.java:879)
at org.broadinstitute.hellbender.tools.walkers.validation.Concordance.areVariantsAtSameLocusConcordant(Concordance.java:256)
at org.broadinstitute.hellbender.engine.AbstractConcordanceWalker$ConcordanceIterator.next(AbstractConcordanceWalker.java:188)
at org.broadinstitute.hellbender.engine.AbstractConcordanceWalker$ConcordanceIterator.next(AbstractConcordanceWalker.java:163)
at java.util.Iterator.forEachRemaining(Iterator.java:116)
at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580)
at org.broadinstitute.hellbender.engine.AbstractConcordanceWalker.traverse(AbstractConcordanceWalker.java:121)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:893)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:136)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:179)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:198)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:153)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:195)
at org.broadinstitute.hellbender.Main.main(Main.java:277)

How to restart a wdl workflow task from where it failed on google cloud

Hi ,
I'm trying to run a scatter wdl workflow on google cloud through cromwell , how do I restart the workflow from where it stopped?


Error with GATK ModelSegments


I am using the BETA tool "ModelSegments" in a copy number variation analysis and I've run into an error that I don't understand. Within our institution's cluster computing environment, I submitted the following job:


srun $GATK/gatk --java-options "-Xmx10000m" ModelSegments --allelic-counts $ALLELIC_COUNTS_T --normal-allelic-counts $ALLELIC_COUNTS_N --output-prefix 10058 -O $OUTPUT_DIR

From this, I get the following error:

Using GATK jar /home/exacloud/lustre1/BioDSP/users/jacojam/programs/gatk-
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx10000m -jar /home/exacloud/lustre1/BioDSP/users/jacojam/programs/gat$
06:42:48.839 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/exacloud/lustre1/BioDSP/users/jacojam/programs/gatk-!/com/intel/gkl/native/libgkl_compression.so
06:42:49.212 INFO ModelSegments - ------------------------------------------------------------
06:42:49.212 INFO ModelSegments - The Genome Analysis Toolkit (GATK) v4.0.4.0
06:42:49.212 INFO ModelSegments - For support and documentation go to https://software.broadinstitute.org/gatk/
06:42:49.213 INFO ModelSegments - Executing as jacojam@exanode-3-7.local on Linux v3.10.0-693.17.1.el7.x86_64 amd64
06:42:49.213 INFO ModelSegments - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_161-b14
06:42:49.213 INFO ModelSegments - Start Date/Time: May 2, 2018 6:42:48 AM PDT
06:42:49.213 INFO ModelSegments - ------------------------------------------------------------
06:42:49.213 INFO ModelSegments - ------------------------------------------------------------
06:42:49.214 INFO ModelSegments - HTSJDK Version: 2.14.3
06:42:49.214 INFO ModelSegments - Picard Version: 2.18.2
06:42:49.214 INFO ModelSegments - HTSJDK Defaults.COMPRESSION_LEVEL : 2
06:42:49.214 INFO ModelSegments - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
06:42:49.214 INFO ModelSegments - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
06:42:49.214 INFO ModelSegments - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
06:42:49.214 INFO ModelSegments - Deflater: IntelDeflater
06:42:49.214 INFO ModelSegments - Inflater: IntelInflater
06:42:49.214 INFO ModelSegments - GCS max retries/reopens: 20
06:42:49.214 INFO ModelSegments - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
06:42:49.215 WARN ModelSegments -

^[[1m^[[31m !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Warning: ModelSegments is a BETA tool and is not yet ready for use in production


06:42:49.215 INFO ModelSegments - Initializing engine
06:42:49.215 INFO ModelSegments - Done initializing engine
06:42:49.224 INFO ModelSegments - Reading file (/home/exacloud/lustre1/BioDSP/users/jacojam/data/hnscc/DNASeq/hg19_BWA_alignment_10058_tumor/tumor.allelicCounts.tsv)...
06:15:44.797 INFO ModelSegments - Shutting down engine
[May 3, 2018 6:15:44 AM PDT] org.broadinstitute.hellbender.tools.copynumber.ModelSegments done. Elapsed time: 1,412.93 minutes.
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space: failed reallocation of scalar replaced objects
at java.util.Arrays.copyOfRange(Arrays.java:3664)
at java.lang.String.(String.java:207)
at java.lang.StringBuilder.toString(StringBuilder.java:407)
at com.opencsv.CSVParser.parseLine(CSVParser.java:383)
at com.opencsv.CSVParser.parseLineMulti(CSVParser.java:299)
at com.opencsv.CSVReader.readNext(CSVReader.java:275)
at org.broadinstitute.hellbender.utils.tsv.TableReader.fetchNextRecord(TableReader.java:348)
at org.broadinstitute.hellbender.utils.tsv.TableReader.access$200(TableReader.java:94)
at org.broadinstitute.hellbender.utils.tsv.TableReader$1.hasNext(TableReader.java:458)
at java.util.Iterator.forEachRemaining(Iterator.java:115)
at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
at org.broadinstitute.hellbender.tools.copynumber.formats.collections.AbstractRecordCollection.(AbstractRecordCollection.java:82)
at org.broadinstitute.hellbender.tools.copynumber.formats.collections.AbstractLocatableCollection.(AbstractLocatableCollection.java:58)
at org.broadinstitute.hellbender.tools.copynumber.formats.collections.AbstractSampleLocatableCollection.(AbstractSampleLocatableCollection.java:44)
at org.broadinstitute.hellbender.tools.copynumber.formats.collections.AllelicCountCollection.(AllelicCountCollection.java:58)
at org.broadinstitute.hellbender.tools.copynumber.ModelSegments$$Lambda$29/27313641.apply(Unknown Source)
at org.broadinstitute.hellbender.tools.copynumber.ModelSegments.readOptionalFileOrNull(ModelSegments.java:559)
at org.broadinstitute.hellbender.tools.copynumber.ModelSegments.doWork(ModelSegments.java:462)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:134)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:179)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:198)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)
srun: error: exanode-3-7: task 0: Exited with exit code 1

Is this something you could potentially help me with? Thank you.

(How to) Call somatic mutations using GATK4 Mutect2


Post suggestions and read about updates in the Comments section.

imageThis tutorial introduces researchers to considerations in somatic short variant discovery using GATK4 Mutect2. Example data are based on a breast cancer cell line and its matched normal cell line derived from blood and are aligned to GRCh38 with post-alt processing [1]. The tutorial focuses on how to call traditional somatic short mutations, as described in Article#11127 and pipelined in GATK v4.0.0.0's mutect2.wdl [2]. The tool and its workflow are in BETA status as of this writing, which means they may undergo changes and are not guaranteed for production.

► For Broad Mutation Calling Best Practices, see FireCloud Article#45055.

Section 1 calls somatic mutations with Mutect2 using all the bells and whistles of the tool. Section 2 outlines how to create the panel of normals resource using the tumor-only mode of Mutect2. Section 3 outlines how to estimate cross-sample contamination. Section 4 shows how to filter the callset with FilterMutectCalls. Unlike GATK3, in GATK4 the somatic calling and filtering functionalities are embodied by separate tools. Section 5 shows an optional filtering step to filter by sequence context artifacts that present with orientation bias, e.g. OxoG artifacts. Section 6 shows how to set up in IGV for manual review. Finally, section 7 provides a brief list of related resources that may be of interest to researchers.

GATK4 Mutect2 is a versatile variant caller that not only is more sensitive than, but is also roughly twice as fast as, HaplotypeCaller's reference confidence mode. Researchers who wish to customize analyses should find the tutorial's descriptions of the multiple levers of Mutect2 in section 1 and descriptions of the tumor-only mode of Mutect2 in section 2 of interest.

Jump to a section

  1. Call somatic short variants and generate a bamout with Mutect2
    1.1 What are the Mutect2 annotations?
    1.2 What is the impact of disabling the MateOnSameContigOrNoMappedMateReadFilter read filter?
  2. Create a sites-only PoN with CreateSomaticPanelOfNormals
    2.1 The tumor-only mode of Mutect2 is useful outside of pon creation
  3. Estimate cross-sample contamination using GetPileupSummaries and CalculateContamination
    3.1 What if I find high levels of contamination?
  4. Filter for confident somatic calls using FilterMutectCalls
  5. (Optional) Estimate artifacts with CollectSequencingArtifactMetrics and filter them with FilterByOrientationBias
    5.1 Tally of applied filters for the tutorial data
  6. Set up in IGV to review somatic calls
  7. Related resources

Tools involved

  • GATK v4.0.0.0 is available in a Docker image and as a standalone jar. For the latest release, see the Downloads page. Note that GATK v4.0.0.0 contains Picard tools from release v2.17.2 that are callable with the gatk launch script.
  • Desktop IGV. The tutorial uses v2.3.97.

Download example data

Download tutorial_11136.tar.gz, either from the GoogleDrive or from the ftp site. To access the ftp site, leave the password field blank. If the GoogleDrive link is broken, please let us know. The tutorial also requires the GRCh38 reference FASTA, dictionary and index. These are available from the GATK Resource Bundle. For details on the example data and resources, see [3] and [4].

► The tutorial steps switch between the subset and full data. Some of the data files, e.g. BAMs, are restricted to a small region of the genome to efficiently pace the tutorial. Other files, e.g. the Mutect2 calls that the tutorial filters, are from the entire genome. The tutorial content was originally developed for the 2017-09 Helsinki workshop and we make the full data files, i.e. the resource files and the BAMs, available at gs://gatk-best-practices/somatic-hg38.

1. Call somatic short variants and generate a bamout with Mutect2

Here we have a rather complex command to call somatic variants on the HCC1143 tumor sample using Mutect2. For a synopsis of what somatic calling entails, see Article#11127. The command calls somatic variants in the tumor sample and uses a matched normal, a panel of normals (PoN) and a population germline variant resource.

gatk --java-options "-Xmx2g" Mutect2 \
-R hg38/Homo_sapiens_assembly38.fasta \
-I tumor.bam \
-I normal.bam \
-tumor HCC1143_tumor \
-normal HCC1143_normal \
-pon resources/chr17_pon.vcf.gz \
--germline-resource resources/chr17_af-only-gnomad_grch38.vcf.gz \
--af-of-alleles-not-in-resource 0.0000025 \
--disable-read-filter MateOnSameContigOrNoMappedMateReadFilter \
-L chr17plus.interval_list \
-O 1_somatic_m2.vcf.gz \
-bamout 2_tumor_normal_m2.bam 

This produces a raw unfiltered somatic callset 1_somatic_m2.vcf.gz, a reassembled reads BAM 2_tumor_normal_m2.bam and the respective indices 1_somatic_m2.vcf.gz.tbi and 2_tumor_normal_m2.bai.

Comments on select parameters

  • Specify the case sample for somatic calling with two parameters. Provide the BAM with -I and the sample's read group sample name (the SM field value) with -tumor. To look up the read group SM field use GetSampleName. Alternatively, use samtools view -H tumor.bam | grep '@RG'.
  • Prefilter variant sites in a control sample alignment. Specify the control BAM with -I and the control sample's read group sample name (the SM field value) with -normal. In the case of a tumor with a matched normal control, we can exclude even rare germline variants and individual-specific artifacts. If we analyze our tumor sample with Mutect2 without the matched normal, we get an order of magnitude more calls than with the matched normal.
  • Prefilter variant sites in a panel of normals callset. Specify the panel of normals (PoN) VCF with -pon. Section 2 outlines how to create a PoN. The panel of normals not only represents common germline variant sites, it presents commonly noisy sites in sequencing data, e.g. mapping artifacts or other somewhat random but systematic artifacts of sequencing. By default, the tool does not reassemble nor emit variant sites that match identically to a PoN variant. To enable genotyping of PoN sites, use the --genotype-pon-sites option. If the match is not exact, e.g. there is an allele-mismatch, the tool reassembles the region, emits the calls and annotates matches in the INFO field with IN_PON.
  • Annotate variant alleles by specifying a population germline resource with --germline-resource. The germline resource must contain allele-specific frequencies, i.e. it must contain the AF annotation in the INFO field [4]. The tool annotates variant alleles with the population allele frequencies. When using a population germline resource, consider adjusting the --af-of-alleles-not-in-resource parameter from its default of 0.001. For example, the gnomAD resource af-only-gnomad_grch38.vcf.gz represents ~200k exomes and ~16k genomes and the tutorial data is exome data, so we adjust --af-of-alleles-not-in-resource to 0.0000025 which corresponds to 1/(2*exome samples). The default of 0.001 is appropriate for human sample analyses without any population resource. It is based on the human average rate of heterozygosity. The population allele frequencies (POP_AF) and the af-of-alleles-not-in-resource factor in probability calculations of the variant being somatic.
  • Include reads whose mate maps to a different contig. For our somatic analysis that uses alt-aware and post-alt processed alignments to GRCh38, we disable a specific read filter with --disable-read-filter MateOnSameContigOrNoMappedMateReadFilter. This filter removes from analysis paired reads whose mate maps to a different contig. Because of the way BWA crisscrosses mate information for mates that align better to alternate contigs (in alt-aware mapping to GRCh38), we want to include these types of reads in our analysis. Otherwise, we may miss out on detecting SNVs and indels associated with alternate haplotypes. Disabling this filter deviates from current production practices.
  • Target the analysis to specific genomic intervals with the -L parameter. Here we specify this option to speed up our run on the small tutorial data. For the full callset we use in section 4, calling was on the entirety of the data, without an intervals file.
  • Generate the reassembled alignments file with -bamout. The bamout alignments contain the artificial haplotypes and reassembled alignments for the normal and tumor and enable manual review of calls. The parameter is not required by the tool but is recommended as adding it costs only a small fraction of the total run time.

To illustrate how Mutect2 applies annotations, below are five multiallelic sites from the full callset. Pull these out with gzcat somatic_m2.vcf.gz | awk '$5 ~","'. The awk '$5 ~","' subsets records that contain a comma in the 5th column.


We see eleven columns of information per variant call including genotype calls for the normal and tumor. Notice the empty fields for QUAL and FILTER, and annotations at the site (INFO) and sample level (columns 10 and 11). The samples each have genotypes and when a site is multiallelic, we see allele-specific annotations. Samples may have additional annotations, e.g. PGT and PID that relate to phasing.

☞ 1.1 What are the Mutect2 annotations?

We can view the standard FORMAT-level and INFO-level Mutect2 annotations in the VCF header.



The Variant Annotations section of the Tool Documentation further describe some of the annotations. For a complete list of annotations available in GATK4, see this site.

To enable specific filtering that relies on nonstandard annotations, or just to add additional annotations, use the -A argument. For example, -A ReferenceBases adds the ReferenceBases annotation to variant calls. Note that if an annotation a filter relies on is absent, FilterMutectCalls will skip the particular filtering without any warning messages.

☞ 1.2 What is the impact of disabling the MateOnSameContigOrNoMappedMateReadFilter read filter?

To understand the impact, consider some numbers. After all other read filters, the MateOnSameContigOrNoMappedMateReadFilter (MOSCO) filter additionally removes from analysis 8.71% (8,681,271) tumor sample reads and 8.18% (6,256,996) normal sample reads from the full data. The impact of disabling the MOSCO filter is that reads on alternate contigs and read pairs that span contigs can now lend support to variant calls.

For the tutorial data, including reads normally filtered by the MOSCO filter roughly doubles the number of Mutect2 calls. The majority of the additional calls comes from the ALT, HLA and decoy contigs.

back to top

2. Create a sites-only PoN with CreateSomaticPanelOfNormals

We make the motions of creating a PoN using three germline samples. These samples are HG00190, NA19771 and HG02759 [3].

First, run Mutect2 in tumor-only mode on each normal sample. In tumor-only mode, a single case sample is analyzed with the -tumor flag without an accompanying matched control -normal sample. For the tutorial, we run this command only for sample HG00190.

gatk Mutect2 \
-R ~/Documents/ref/hg38/Homo_sapiens_assembly38.fasta \
-I HG00190.bam \
-tumor HG00190 \
--disable-read-filter MateOnSameContigOrNoMappedMateReadFilter \
-L chr17plus.interval_list \
-O 3_HG00190.vcf.gz

This generates a callset 3_HG00190.vcf.gz and a matching index. Mutect2 calls variants in the sample with the same sensitive criteria it uses for calling mutations in the tumor in somatic mode. Because the command omits the use of options that trigger upfront filtering, we expect all detectable variants to be called. The calls will include low allele fraction variants and sites with multiple variant alleles, i.e. multiallelic sites. Here are two multiallelic records from 3_HG00190.vcf.gz.


We see for each site, Mutect2 calls the ref allele and three alternate alleles. The GT genotype call is 0/1/2/3. The AD allele depths are 16,3,12,4 and 41,5,24,4, respectively for the two sites.

Comments on select parameters

  • One option that is not used here is to include a germline resource with --germline-resource. Remember from section 1 this resource must contain AF population allele frequencies in the INFO column. Use of this resource in tumor-only mode, just as in somatic mode, allows upfront filtering of common germline variant alleles. This effectively omits common germline variant alleles from the PoN. Note the related optional parameter --max-population-af (default 0.01) defines the cutoff for allele frequencies. Given a resource, and read evidence for the variant, Mutect2 will still emit variant alleles with AF less than or equal to the --max-population-af.
  • Recapitulate any special options used in somatic calling in the panel of normals sample calling, e.g.--disable-read-filter MateOnSameContigOrNoMappedMateReadFilter. This particular option is relevant for alt-aware and post-alt processed alignments.

Second, collate all the normal VCFs into a single callset with CreateSomaticPanelOfNormals. For the tutorial, to illustrate the step with small data, we run this command on three normal sample VCFs. The general recommendation for panel of normals is a minimum of forty samples.

gatk CreateSomaticPanelOfNormals \
-vcfs 3_HG00190.vcf.gz \
-vcfs 4_NA19771.vcf.gz \
-vcfs 5_HG02759.vcf.gz \
-O 6_threesamplepon.vcf.gz

This generates a PoN VCF 6_threesamplepon.vcf.gz and an index. The tutorial PoN contains 8,275 records.
CreateSomaticPanelOfNormals retains sites with variants in two or more samples. It retains the alleles from the samples but drops all other annotations to create an eight-column, sites-only VCF as shown.


Ideally, the PoN includes samples that are technically representative of the tumor case sample--i.e. samples sequenced on the same platform using the same chemistry, e.g. exome capture kit, and analyzed using the same toolchain. However, even an unmatched PoN will be remarkably effective in filtering a large proportion of sequencing artifacts. This is because mapping artifacts and polymerase slippage errors occur for pretty much the same genomic loci for short read sequencing approaches.

What do you think of including samples of family members in the PoN?

☞ 2.1 The tumor-only mode of Mutect2 is useful outside of pon creation

For example, consider variant calling on data that represents a pool of individuals or a collective of highly similar but distinct DNA molecules, e.g. mitochondrial DNA. Mutect2 calls multiple variants at a site in a computationally efficient manner. Furthermore, the tumor-only mode can be co-opted to simply call differences between two samples. This approach is described in Blog#11315.

back to top

3. Estimate cross-sample contamination using GetPileupSummaries and CalculateContamination.

First, run GetPileupSummaries on the tumor BAM to summarize read support for a set number of known variant sites. Use a population germline resource containing only common biallelic variants, e.g. subset by using SelectVariants --restrict-alleles-to BIALLELIC, as well as population AF allele frequencies in the INFO field [4]. The tool tabulates read counts that support reference, alternate and other alleles for the sites in the resource.

gatk GetPileupSummaries \
-I tumor.bam \
-V resources/chr17_small_exac_common_3_grch38.vcf.gz \
-O 7_tumor_getpileupsummaries.table

This produces a six-column table as shown. The alt_count is the count of reads that support the ALT allele in the germline resource. The allele_frequency corresponds to that given in the germline resource. Counts for other_alt_count refer to reads that support all other alleles.


Comments on select parameters

  • The tool only considers homozygous alternate sites in the sample that have a population allele frequency that ranges between that set by --minimum-population-allele-frequency (default 0.01) and --maximum-population-allele-frequency (default 0.2). The rationale for these settings is as follows. If the homozygous alternate site has a rare allele, we are more likely to observe the presence of REF or other more common alleles if there is cross-sample contamination. This allows us to measure contamination more accurately.
  • One option to speed up analysis, that is not used in the command above, is to limit data collection to a sufficiently large but subset genomic region with the -L argument.
  • As of GATK4.0.8.0, released August 2, 2018, GetPileupSummaries requires both -L and -V parameters. For the tutorial, provide the same resources/chr17_small_exac_common_3_grch38.vcf.gz file to each parameter. For details, see the GetPileupSummaries tool documentation.

Second, estimate contamination with CalculateContamination. The tool takes the summary table from GetPileupSummaries and gives the fraction contamination. This estimation informs downstream filtering by FilterMutectCalls.

gatk CalculateContamination \
-I 7_tumor_getpileupsummaries.table \
-O 8_tumor_calculatecontamination.table

This produces a table with estimates for contamination and error. The estimate for the full tumor sample is shown below and gives a contamination fraction of 0.0205. Going forward, we know to suspect calls with less than ~2% alternate allele fraction.


Comments on select parameters

  • CalculateContamination can operate in two modes. The command above uses the mode that simply estimates contamination for a given sample. The alternate mode incorporates the metrics for the matched normal, to enable a potentially more accurate estimate. For the second mode, run GetPileupSummaries on the normal sample and then provide the normal pileup table to CalculateContamination with the -matched argument.

► Cross-sample contamination differs from normal contamination of tumor and tumor contamination of normal. Currently, the workflow does not account for the latter type of purity issue.

☞ 3.1 What if I find high levels of contamination?

One thing to rule out is sample swaps at the read group level.

Picard’s CrosscheckFingerprints can detect sample-swaps at the read group level and can additionally measure how related two samples are. Because sequencing can involve multiplexing a sample across lanes and regrouping a sample’s multiple read groups, depending on the level of automation in handling these, there is a possibility of including read groups from unrelated samples. The inclusion of such a cross-sample in the tumor sample would be detrimental to a somatic analysis. Without getting into details, the tool allows us to (i) check at the sample level that our tumor and normal are related, as it is imperative they should come from the same individual and (ii) check at the read group level that each of the read group data come from the same individual.

Again, imagine if we mistook the contaminating read group data as some tumor subpopulation! The tutorial normal and tumor samples consist of 16 and 22 read groups respectively, and when we provide these and set EXPECT_ALL_GROUPS_TO_MATCH=true, CrosscheckReadGroupFingerprints (a tool now replaced by CrosscheckFingerprints) informs us All read groups related as expected.

back to top

4. Filter for confident somatic calls using FilterMutectCalls

FilterMutectCalls determines whether a call is a confident somatic call. The tool uses the annotations within the callset and applies preset thresholds that are tuned for human somatic analyses.

Filter the Mutect2 callset with FilterMutectCalls. Here we use the full callset, somatic_m2.vcf.gz. To activate filtering based on the contamination estimate, provide the contamination table with --contamination-table. In GATK v4.0.0.0, the tool uses the contamination estimate as a hard cutoff.

gatk FilterMutectCalls \
-V somatic_m2.vcf.gz \
--contamination-table tumor_calculatecontamination.table \
-O 9_somatic_oncefiltered.vcf.gz

This produces a VCF callset 9_somatic_oncefiltered.vcf.gz and index. Calls that are likely true positives get the PASS label in the FILTER field, and calls that are likely false positives are labeled with the reason(s) for filtering in the FILTER field of the VCF. We can view the available filters in the VCF header using grep '##FILTER'.


This step seemingly applies 14 filters, including contamination. However, if an annotation a filter relies on is absent, the tool skips the particular filtering. The filter will still appear in the header. For example, the duplicate_evidence filter requires a nonstandard annotation that our callset omits.

So far, we have 3,695 calls, of which 2,966 are filtered and 729 pass as confident somatic calls. Of the filtered, contamination filters eight calls, all of which would have been filtered for other reasons. For the statistically inclined, this may come as a surprise. However, remember that the great majority of contaminant variants would be common germline alleles, for which we have in place other safeguards.

► In the next GATK version, FilterMutectCalls will use a statistical model to filter based on the contamination estimate.

back to top

5. (Optional) Estimate artifacts with CollectSequencingArtifactMetrics and filter them with FilterByOrientationBias

FilterByOrientationBias allows filtering based on sequence context artifacts, e.g. OxoG and FFPE. This step is optional and if employed, should always be performed after filtering with FilterMutectCalls. The tool requires the pre_adapter_detail_metrics from Picard CollectSequencingArtifactMetrics.

First, collect metrics on sequence context artifacts with CollectSequencingArtifactMetrics. The tool categorizes these as those that occur before hybrid selection (preadapter) and those that occur during hybrid selection (baitbias). Results provide a global view across the genome that empowers decision making in ways that site-specific analyses cannot. The metrics can help decide whether to consider downstream filtering.

gatk CollectSequencingArtifactMetrics \
-I tumor.bam \
-O 10_tumor_artifact \
-R ~/Documents/ref/hg38/Homo_sapiens_assembly38.fasta

Alternatively, use the tool from a standalone Picard jar.

java -jar picard.jar \
CollectSequencingArtifactMetrics \
I=tumor.bam \
O=10_tumor_artifact \

This generates five metrics files, including pre_adapter_detail_metrics, which contains counts that FilterByOrientationBias uses. Below are the summary pre_adapter_summary_metrics for the full data. Our samples were not from FFPE so we do not expect this artifact. However, it appears that we could have some OxoG transversions.



Picard metrics are described in detail here. For the purposes of this tutorial, we focus on the TOTAL_QSCORE.

  • The TOTAL_QSCORE is Phred-scaled such that lower scores equate to a higher probability the change is artifactual. E.g. forty translates to 1 in 10,000 probability. For OxoG, a rough cutoff for concern is 30. FilterByOrientationBias uses the quality score as a prior that a context will produce an artifact. The tool also weighs the evidence from the reads. For example, if the QSCORE is 50 but the allele is supported by 15 reads in F1R2 and no reads in F2R1, then the tool should filter the call.
  • FFPE stands for formalin-fixed, paraffin-embedded. Formaldehyde deaminates cytosines and thereby results in C→T transition mutations. Oxidation of guanine to 8-oxoguanine results in G→T transversion mutations during library preparation. Another Picard tool, CollectOxoGMetrics, similarly gives Phred-scaled scores for the 16 three-base extended sequence contexts. In GATK4 Mutect2, the F1R2 and F2R1 annotations count the reads in the pair orientation supporting the allele(s). This is a change from GATK3’s FOXOG (fraction OxoG) annotation.

Second, perform orientation bias filtering with FilterByOrientationBias. We provide the tool with the once-filtered calls 9_somatic_oncefiltered.vcf.gz, the pre_adapter_detail_metrics file and the sequencing contexts for FFPE (C→T transition) and OxoG (G→T transversion). The tool knows to include the reverse complement contexts.

gatk FilterByOrientationBias \
-A G/T \
-A C/T \
-V 9_somatic_oncefiltered.vcf.gz \
-P tumor_artifact.pre_adapter_detail_metrics.txt \
-O 11_somatic_twicefiltered.vcf.gz

This produces a VCF 11_somatic_twicefiltered.vcf.gz, index and summary 11_somatic_twicefiltered.vcf.gz.summary. In the summary, we see the number of calls for the sequence context and the number of those that the tool filters.


Is the filtering in line with our earlier prediction?

In the VCF header, we see the addition of the 15th filter, orientation_bias, which the tool applies to 56 calls. All 56 of these calls were previously PASS sites, i.e. unfiltered. We now have 673 passing calls out of 3,695 total calls.


☞ 5.1 Tally of applied filters for the tutorial data

The table shows the breakdown in filters applied to 11_somatic_twicefiltered.vcf.gz. The middle column tallys the instances in which each filter was applied across the calls and the third column tallys the instances in which a filter was the sole reason for a site not passing. Of the total calls, ~18% (673/3,695) are confident somatic calls. Of the filtered calls, ~56% (1,694/3,022) are filtered singly. We see an average of ~1.73 filters per filtered call (5,223/3,022).


Which filters appear to have the greatest impact? What types of calls do you think compels manual review?

Examine passing records with the following command. Take note of the AD and AF annotation values in particular, as they show the high sensitivity of the caller.

gzcat 11_somatic_twicefiltered.vcf.gz | grep -v '#' | awk '$7=="PASS"' | less

back to top

6. Set up in IGV to review somatic calls

Deriving a good somatic callset involves comparing callsets, e.g. from different callers or calling approaches, manually reviewing passing and filtered calls and, if necessary, combining callsets and additional filtering. Manual review extends from deciphering call record annotations to the nitty-gritty of reviewing read alignments using a visualizer.

To manually review calls, use the feature-rich desktop version of the Integrative Genomics Viewer (IGV). Remember that Mutect2 makes calls on reassembled alignments that do not necessarily reflect that of the starting BAM. Given this, viewing the raw BAM is insufficient for understanding calls. We must examine the bamout that Mutect2's graph-assembly produces.

First, load Human (hg38) as the reference in IGV. Then load these six files in order:

  • resources/chr17_pon.vcf.gz
  • resources/chr17_af-only-gnomad_grch38.vcf.gz
  • 11_somatic_twicefiltered.vcf.gz
  • 2_tumor_normal_m2.bam
  • normal.bam
  • tumor.bam

With the exception of the somatic callset 11_somatic_twicefiltered.vcf.gz, the subset regions the data cover are in chr17plus.interval_list.

imageSecond, navigate IGV to the TP53 locus (chr17:7,666,402-7,689,550).

  • One of the tracks is dominating the view. Right-click on track chr17_af-only-gnomad_grch38.vcf.gz and collapse its view.
  • imageZoom into the somatic call in 11_somatic_twicefiltered.vcf.gz, the gray rectangle in exon 3, by click-dragging on the ruler.
  • Hover over or click on the gray call in track 11_somatic_twicefiltered.vcf.gz to view INFO level annotations. Similarly, the blue call underneath gives HCC1143_tumor sample level information.
  • Scroll through the alignment data and notice the coverage for the samples.

A C→T variant is in tumor.bam but not normal.bam. What is happening in 2_tumor_normal_m2.bam?

imageThird, tweak IGV settings that aid in visualizing reassembled alignments.

  • Make room to focus on track 2_tumor_normal_m2.bam. Shift+select on the left panels for tracks tumor.bam, normal.bam and their coverages. Right-click and Remove Tracks.
  • Go to View>Preferences>Alignments. Toggle on Show center line and toggle off Downsample reads.
  • Drag the alignments panel to center the red variant.
  • Right-click on the alignments track and

    • Group by sample
    • Sort by base
    • Color by tag: HC.
  • Scroll to take note of the number of groups. Click on a read in each group to determine which group belongs to which sample.


What are the three grouped tracks for the bamout? What does the pastel versus gray colors indicate? How plausible is it that all tumor copies of this locus have this alteration?

Here is the corresponding VCF record. Remember Mutect2 makes no ploidy assumption. The GT field tabulates the presence for each allele starting with the reference allele.


chr17 7,674,220 . C T . PASS DP=122;ECNT=1;NLOD=13.54;N_ART_LOD=-1.675e+00;POP_AF=2.500e-06;P_GERMLINE=-1.284e+01;TLOD=257.15
HCC1143_normal 0/0:45,0:0.032:19,0:26,0:0:151,0:0:0:false:false
HCC1143_tumor 0/1:0,70:0.973:0,34:0,36:33:0,147:60:21:true:false:0.486:0.00:46.01:100.00:0.990,0.990,1.00:0.028,0.026,0.946

Finally, here are the indel calls for which we have bamout alignments. All 17 of these happen to be filtered. Explore a few of these sites in IGV to practice the motions of setting up for manual review and to study the logic behind different filters.

chr17 4,539,344 T TA artifact_in_normal;germline_risk;panel_of_normals
chr17 7,221,420 CACTGCCCTAGGTCAGGA C artifact_in_normal;panel_of_normals;str_contraction
chr17 7,483,063 A AC mapping_quality;t_lod
chr17 8,513,688 GTT G panel_of_normals
chr17 19,748,387 G GA t_lod
chr17 26,982,033 G GC artifact_in_normal;clustered_events
chr17 30,059,463 CT C t_lod
chr17 35,422,473 C CA t_lod
chr17 35,671,734 CTT C,CT,CTTT artifact_in_normal;multiallelic;panel_of_normals
chr17 43,104,057 CA C artifact_in_normal;germline_risk;panel_of_normals
chr17 43,104,072 AAAAAAAAAGAAAAG A panel_of_normals;t_lod
chr17 46,332,538 G GT artifact_in_normal;panel_of_normals
chr17 47,157,394 CAA C panel_of_normals;t_lod
chr17 50,124,771 GCACACACACACACACA G clustered_events;panel_of_normals;t_lod
chr17 68,907,890 GA G artifact_in_normal;base_quality;germline_risk;panel_of_normals;t_lod
chr17 69,182,632 C CA artifact_in_normal;t_lod
chr17 69,182,835 GAAAA G panel_of_normals

back to top

7. Related resources

The next step after generating a carefully manicured somatic callset is typically functional annotation.

  • Funcotator is available in BETA and can annotate GRCh38 and prior reference aligned VCF format data.
  • Oncotator can annotate GRCh37 and prior reference aligned MAF and VCF format data. It is also possible to download and install the tool following instructions in Article#4154.
  • Annotate with the external program VEP to predict phenotypic changes and confirm or hypothesize biochemical effects.

For a cohort, after annotation, use MutSig to discover driver mutations. MutsigCV (the version is CV) is available on GenePattern. If more samples are needed to increase the power of the analysis, consider padding the analysis set with TCGA Project or other data.

The dSKY plot at https://figshare.com/articles/D_SKY_for_HCC1143/2056665 shows somatic copy number alterations for the HCC1143 tumor sample. Its colorful results remind us that calling SNVs and indels is only one part of cancer genome analyses. Somatic copy number alteration detection will be covered in another GATK tutorial. For reference implementations of Somatic CNV workflows see here.

back to top


[1] Data was alt-aware aligned to GRCh38 and post-alt processed. For an introduction to alt-aware alignment and post-alt processing, see [Blog#8180](https://software.broadinstitute.org/gatk/blog?id=8180). The HCC1143 alignments are identical to that in [Tutorial#9183](https://software.broadinstitute.org/gatk/documentation/article?id=9183), which uses GATK3 MuTect2.

[2] For scripted GATK Best Practices Somatic Short Variant Discovery workflows, see [https://github.com/gatk-workflows](https://github.com/gatk-workflows). Within the repository, as of this writing, [gatk-somatic-snvs-indels](https://github.com/gatk-workflows/gatk4-somatic-snvs-indels), which uses GRCh37, is the sole GATK4 Mutect2 workflow. This tutorial uses additional parameters not used in the [GRCh37 gatk-somatic-snvs-indels](https://github.com/gatk-workflows/gatk4-somatic-snvs-indels) example because the tutorial data was preprocessed with post-alt processing of alt-aware alignments, which deviates from production practices. The general workflow steps remain the same.

[3] About the tutorial data:

  • The data tarball contains 15 files in the main directory, six files in its resources folder and twenty files in its precomputed folder. Of the files, chr17 refers to data subset to that in the regions in chr17plus.interval_list, the m2pon consists of forty 1000 Genomes Project samples, pon to panel of normals, tumor to the tumor HCC1143 breast cancer sample and normal to its matched blood normal.
  • Again, example data are based on a breast cancer cell line and its matched normal cell line derived from blood. Both cell lines are consented and known as HCC1143 and HCC1143_BL, respectively. The Broad Cancer Genome Analysis (CGA) group has graciously provided 2x76 paired-end whole exome sequence data from the two cell lines (C835.HCC1143_2 and C835.HCC1143_BL.4), and @shlee reverted and aligned these to GRCh38 using alt-aware alignment and post-alt processing as described in Tutorial#8017. During preprocessing, the MergeBamAlignment step was omitted, reads containing adapter sequence were removed altogether for both samples (~0.153% of reads in the tumor) as determined by MarkIlluminaAdapters, base qualities were not binned during base recalibration and indel realignment was included to match the toolchain of the PoN normals. The program group for base recalibration is absent from the BAM headers due to a bug in the version of PrintReads at the time of pre-processing, in January of 2017.
  • Note that the tutorial uses exome data for its small size. The workflow is applicable to whole genome sequence data (WGS).
  • @shlee lifted-over or remapped the gnomAD resource files from GRCh37 counterparts to GRCh38. The tutorial uses subsets of the full resources; the full-length versions are available at gs://gatk-best-practices/somatic-hg38/. The official GRCh37 versions of the resources are available in the GATK Resource Bundle and are based on the gnomAD resource. These GRCh37 versions were prepared by @davidben according to the method outlined in the mutect_resources.wdl and described in [4].
  • The full data in the tutorial were generated by @shlee using the github.com/broadinstitute/gatk mutect2.wdl from between the v4.0.0.0 and v4.0.0.1 release with commit hash b4d1ddd. The GATK Docker image was broadinstitute/gatk: and Picard was v2.14.1. A single modification was made to the script to enable generating the bamout. The script was run locally on a Google Cloud Compute VM using Cromwell v30.1. Given Docker was installed and the specified Docker images were present on the VM, Cromwell automatically launched local Docker container instances during the run and handled the local files as hard-links to avoid redundant copying. Workflow input variables were as follows.
  "Mutect2.is_run_oncotator": "False",
  "Mutect2.is_run_orientation_bias_filter": "True",
  "Mutect2.picard": "/home/shlee/picard-2.14.1.jar",
  "Mutect2.gatk_docker": "broadinstitute/gatk:",
  "Mutect2.oncotator_docker": "broadinstitute/oncotator:",
  "Mutect2.artifact_modes": ["G/T", "C/T"],
  "Mutect2.m2_extra_args": "--af-of-alleles-not-in-resource 0.0000025 --disable-read-filter MateOnSameContigOrNoMappedMateReadFilter",
  "Mutect2.m2_extra_filtering_args": "",
  "Mutect2.scatter_count": "10"
  • If using newer versions of the mutect2.wdl that allow setting SplitIntervals optional arguments, then @shlee recommends setting --subdivision-mode BALANCING_WITHOUT_INTERVAL_SUBDIVISION to avoid splitting contigs.
  • With the exception of the PoN and Picard tool steps, data was generated using v4.0.0.0. The PoN was generated using GATK4 vbeta.6. Besides the syntax, little changed for the Mutect2 workflow between these releases and the workflow and most of its tools remain in beta status as of this writing. We used Picard v2.14.1 for the CollectSequencingArtifactMetrics step. Figures in section 5 reflect results from Picard v2.11.0, which give, at glance, identical results as 2.14.1.
  • The three samples in section 2 are present in the forty sample PoN used in section 1 and they are 1000 Genomes Project samples.

[4] The WDL script [mutect_resources.wdl](https://github.com/broadinstitute/gatk/blob/master/scripts/mutect2_wdl/mutect_resources.wdl) takes a large gnomAD VCF or other typical cohort VCF and from it prepares both a simplified germline resource for use in _section 1_ and a common biallelic variants resource for use in _section 3_. The script first generates a sites-only VCF and in the process _removes all extraneous annotations_ except for `AF` allele frequencies. We recommend this simplification as the unburdened VCF allows Mutect2 to run much more efficiently. To generate the common biallelic variants resource, the script then selects the biallelic sites from the sites-only VCF.

back to top

Allele Depth (AD) is lower than expected


The problem:

You're trying to evaluate the support for a particular call, but the numbers in the DP (total depth) and AD (allele depth) fields aren't making any sense. For example, the sum of all the ADs doesn't match up to the DP, or even more baffling, the AD for an allele that was called is zero!

Many users have reported being confused by variant calls where there is apparently no evidence for the called allele. For example, sometimes a VCF may contain a variant call that looks like this:

2 151214 . G A 673.77 . AN=2;DP=10;FS=0.000;MLEAF=0.500;MQ=56.57;MQ0=0;NCC=0;SOR=0.693 GT:AD:DP:GQ:PL 0/1:0,0:10:38:702,0,38

You can see in the Format field the AD values are 0 for both of the alleles. However, in the Info and FORMAT fields, the DP is 10. Because the DP in the INFO field is unfiltered and the DP in the FORMAT field is filtered, you know none of the reads were filtered out by the engine's built-in read filters. And if you look at the "bamout", you see 10 reads covering the position! So why is the VCF reporting an AD value of 0?

The explanation: uninformative reads

This is not actually a bug -- the program is doing what we expect; this is an interpretation problem. The answer lies in uninformative reads.

We call a read “uninformative” when it passes the quality filters, but the likelihood of the most likely allele given the read is not significantly larger than the likelihood of the second most likely allele given the read. Specifically, the difference between the Phred scaled likelihoods must be greater than 0.2 to be considered significant. In other words, that means the most likely allele must be 60% more likely than the second most likely allele.

Let’s walk through an example to make this clearer. Let’s say we have 2 reads and 2 possible alleles at a site. All of the reads have passed HaplotypeCaller’s quality filters, and the likelihoods of the alleles given the reads are in the table below.

Reads Likelihood of A Likelihood of T
1 3.8708e-7 3.6711e-7
2 4.9992e-7 2.8425e-7

Note: Keep in mind that HaplotypeCaller marginalizes the likelihoods of the haplotypes given the reads to get the likelihoods of the alleles given the reads. The table above shows the likelihoods of the alleles given the reads. For additional details, please see the HaplotypeCaller method documentation.

Now, let’s convert the likelihoods into Phred-scaled likelihoods. To do this, we simply take the log of the likelihoods.

Reads Phred-scaled likelihood of A Phred-scaled likelihood of T
1 -6.4122 -6.4352
2 -6.3011 -6.5463

Now, we want to determine if read 1 is informative. To do this, we simply look at the Phred scaled likelihoods of the most likely allele and the second most likely allele. The Phred scaled likelihood of the most likely allele (A) is -6.4122.The Phred-scaled likelihood of the second most likely allele (T) is -6.4352. Taking the difference between the two likelihoods gives us 0.023. Because 0.023 is Less than 0.2, read 1 is considered uninformative.

To determine if read 2 is informative, we take -6.3011-(-6.5463). This gives us 0.2452, which is greater than 0.2. Read 2 is considered informative.

How does a difference of 0.2 mean the most likely allele is ~60% more likely than the second most likely allele? Well, because the likelihoods are Phred-scaled, 0.2 = 10^0.2 = 1.585 which is approximately 60% greater.


So, now that we know the math behind determining which reads are informative, let’s look at how this affects the record output to the VCF. If a read is considered informative, it gets counted toward the AD and DP of the variant allele in the output record. If a read is considered uninformative, it is counted towards the DP, but not the AD. That way, the AD value reflects how many reads actually contributed support for a given allele at the site. We would not want to include uninformative reads in the AD value because we don’t have confidence in them.

Please note, however, that although an uninformative read is not reported in the AD, it is still used in calculations for genotyping. In future we may add an annotation to indicate counts of reads that were considered informative vs. uninformative. Let us know in the comments if you think that would be helpful.

In most cases, you will have enough coverage at a site to disregard small numbers of uninformative reads. Unfortunately, sometimes uninformative reads are the only reads you have at a site. In this case, we report the potential variant allele, but keep the AD values 0. The uncertainty at the site will be reflected in the QG and PL values.

somatic variants

GATK "bwa_commandline" variable settings and comparisons to Bowtie2

Hi there,

I've been working on improving a workflow for processing data for my lab, and I wanted to compare the workflow we have set out to some of GATK's best practices workflows. We are currently using Bowtie2 and fgbio+picard to deal with our data.

My question is on the bwa mem command line written in the "processing-for-variant-discovery-gatk4.hg38.wgs.inputs.json"

The bwa mem command line is: "bwa mem -K 100000000 -p -v 3 -t 16 -Y $bash_ref_fasta"
Looking at the manual page of bwa mem, I'm having a hard time finding out what these two variables are: -K and -Y. The two variables are not defined in the bwa mem manual page, so I don't really know what they are doing.

As for the 2nd part, I want to compare the results from alignment that we get from Bowtie2 and Bwa Mem using GATK's settings for BWA mem. Would you happen to know the closest parameters I can set for BT2, so that I can do an apt comparison when aligning? (We're using hg19.fa as the ref files.)


How to parallelise HaplotypeCaller in 4.0.0?



At this time, what is the recommended way to parallelise HaplotypeCaller in GATK4, please? Assuming I care about results and don't want to use the Spark version. In particular, what is the effect of nativePairHmmThreads? So far it has had no influence on the speed of my runs, and yet runs with the same parameters can vary drastically in length (with the record ones taking over a month on a vey diverse insect...).

Interpreting results of VQSR on a non-human species


Hello and happy new Year!

I have done VQSR on a non-human data set. Data corresponds to WGS (~20x) on 60 mice sequenced individually. Total number of raw SNPs (according to GATK best practices): 16,012,193

As a truth/training resource, I used the sites that PASSed generic hard filters and that are found in any of the two mouse genotyping arrays: GIGA-Mouse Universal Genotyping Array; Mouse Diversity Array. Total number of SNPs in resource: 313,776.

As a training-only resource I used variants reported by Sanger's Mouse Genomes Project found accross 36 strains and that PASSed their filters. There were 9,130,946 sites found in my raw SNPs.

Known sites correspond to dbSNP150.

This is a snipet of the the command line:

--resource TRUTH,known=false,training=true,truth=true,prior=12.0:RawSNPs_in_GIGA_or_MDA_OnlyPASS.vcf \
--resource sanger,known=false,training=true,truth=false,prior=10.0:mgp.v5.merged.snps_all.dbSNP142_PASS_final.vcf \
--resource dbsnp,known=true,training=false,truth=false,prior=2.0:mus_musculus.vcf \

After running VQSR (90% truth sensitivity), there are 8,070,948 SNPs that PASSed (~50% of the raw SNPs). Of which 8,060,873 are bi-allelic.

The tranche plot shows a Ti/Tv ratio of 1.7 and 1/5 of false positives at tranche 90. Also, Ti/Tv has a wide range across tranches (1.7 to 1.078). Overall, I think the tranche plot is telling me there is room for improvement.

However, the tranche plot referes to novel variants (not found in any of the resources, incl dbSNP), and the Ti/Tv ratio for variants found in dbSNP as reported by Picard's CollectVariantCallingMetrics corresponds to 2.15, which is much satisfactory and represents >95% of all bi-allelic SNPs.

|    8060873|       7674154|     386719|  0.952025|   2.146346|   1.700175|

I would like to clarify the following:

1) Was the contruction of the truth set reasonable (i.e. enough number of sites)?

2) If novel variants are more likely to be false positives, how are false and true positives defined in the traches plot, which is constructed from novel variants only?

3) Should I deal with the low Ti/Tv ratio at tranche 90, considering that novel SNPs correspond to <5% of all PASSing SNPs after VQSR?

Your feedback will be greatly appreciated!

I have two bams with equivalent sams; one yields a validation error and the other does not.


Hi! I have two bam files whose sam equivalents are identical-- as in:

diff <(samtools view -h small.bam) <(samtools view -h smalltest.bam)

yields nothing, and when I run haplotype caller on one file I get errors that say (for every read):

Ignoring SAM validation error: ERROR: Record 1, Read name RSRS1, bin field of BAM record does not equal value computed based on alignment start and end, and length of sequence to which read is aligned

and no SNPs are generated, while the other file processes just fine.

Needless to say, the bin fields are the same.

To be clear, I generated one of the files, it generated the error, and when I converted from bam->sam->bam, GATK processed it correctly.

I'm using gatk-, samtools verion 1.7 (htslib 1.9)


Genotypes in vcf with a GQ=0



I'm using GATK version 3.7 and I have a problem similar to the point raised in 2015 and titled "Unexpected genotype likelihoods & genotype qualities".
I've used HaplotypeCaller to generate individual GVCFs and after that, have used GenotypeGVCFs on multiple individual samples.
In the resulting vcf, I have some genotypes 0/0 with a GQ=0, whereas I was waiting for genotypes ./. (I suppose that when 2 genotypes have the same probability, the genotype should be unknown).

Could you please tell me how to do to force GATK to put ./. when GQ=0 ?
I've tried the GenotypeGVCFs command with teh version 3.8 and there's the same problem.

Thank very much.

Error in trying to use VariantRecalibrator



I'm trying to use the VariantRecalibrator on SNP calls from ~160 yeast genomes. Because there are no resources for known SNPs, I had to create my own. As all the genomes are of yeast that derived in an evolution experiment, they have a (relatively small) common set of differences relative to the reference. As coverage varies between strains, and across the genome, not all common SNPs get called in all strains, so I generated one vcf file with SNPs that were called in 25% or more of the strains - manual examination of these suggests that they are all real (I called this truth.vcf - it has ~120 SNPs) - and one file of SNPs that were called in between 10-25% of strains (I called this train.vcf - it only has ~25 SNPs). I then try to do variant recalibration as:

gatk- VariantRecalibrator -R Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fasta -V called_genotypes.vcf.gz --resource truth,known=false,training=true,truth=true,prior=15.0:truth.vcf --resource train,known=false,training=true,truth=false,prior=10.0:train.vcf -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP -mode SNP -O output.recal --tranches-file output.tranches --rscript-file output.plots.R --max-gaussians=4

but got the following output:

Using GATK jar /Volumes/Promise_Pegasus/gatk-
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /Volumes/Promise_Pegasus/gatk- VariantRecalibrator -R Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fasta -V called_genotypes.vcf.gz --resource truth,known=false,training=true,truth=true,prior=15.0:truth.vcf --resource train,known=false,training=true,truth=false,prior=10.0:train.vcf -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP -mode SNP -O output.recal --tranches-file output.tranches --rscript-file output.plots.R --max-gaussians=4
18:23:35.917 INFO NativeLibraryLoader - Loading libgkl_compression.dylib from jar:file:/Volumes/Promise_Pegasus/gatk-!/com/intel/gkl/native/libgkl_compression.dylib
18:23:37.594 INFO VariantRecalibrator - ------------------------------------------------------------
18:23:37.594 INFO VariantRecalibrator - The Genome Analysis Toolkit (GATK) v4.0.11.0
18:23:37.595 INFO VariantRecalibrator - For support and documentation go to https://software.broadinstitute.org/gatk/
18:23:37.595 INFO VariantRecalibrator - Executing as sherlock@hopslam.local on Mac OS X v10.13.3 x86_64
18:23:37.595 INFO VariantRecalibrator - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_51-b16
18:23:37.595 INFO VariantRecalibrator - Start Date/Time: January 9, 2019 6:23:35 PM PST
18:23:37.595 INFO VariantRecalibrator - ------------------------------------------------------------
18:23:37.595 INFO VariantRecalibrator - ------------------------------------------------------------
18:23:37.596 INFO VariantRecalibrator - HTSJDK Version: 2.16.1
18:23:37.596 INFO VariantRecalibrator - Picard Version: 2.18.13
18:23:37.596 INFO VariantRecalibrator - HTSJDK Defaults.COMPRESSION_LEVEL : 2
18:23:37.596 INFO VariantRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
18:23:37.596 INFO VariantRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
18:23:37.596 INFO VariantRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
18:23:37.596 INFO VariantRecalibrator - Deflater: IntelDeflater
18:23:37.596 INFO VariantRecalibrator - Inflater: IntelInflater
18:23:37.596 INFO VariantRecalibrator - GCS max retries/reopens: 20
18:23:37.596 INFO VariantRecalibrator - Requester pays: disabled
18:23:37.596 INFO VariantRecalibrator - Initializing engine
18:23:38.024 INFO FeatureManager - Using codec VCFCodec to read file file:///Volumes/Promise_Pegasus/Lucas/whole_genome_seq/Gavin_Haploid_Analysis/truth.vcf
18:23:38.040 INFO FeatureManager - Using codec VCFCodec to read file file:///Volumes/Promise_Pegasus/Lucas/whole_genome_seq/Gavin_Haploid_Analysis/train.vcf
18:23:38.047 INFO FeatureManager - Using codec VCFCodec to read file file:///Volumes/Promise_Pegasus/Lucas/whole_genome_seq/Gavin_Haploid_Analysis/called_genotypes.vcf.gz
18:23:38.082 INFO VariantRecalibrator - Done initializing engine
18:23:38.086 INFO TrainingSet - Found truth track: Known = false Training = true Truth = true Prior = Q15.0
18:23:38.086 INFO TrainingSet - Found train track: Known = false Training = true Truth = false Prior = Q10.0
18:23:38.092 WARN GATKVariantContextUtils - Can't determine output variant file format from output file extension "recal". Defaulting to VCF.
18:23:38.107 INFO ProgressMeter - Starting traversal
18:23:38.108 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
18:23:38.347 INFO VariantRecalibrator - No variants filtered by: AllowAllVariantsVariantFilter
18:23:38.348 INFO ProgressMeter - XI:67773 0.0 1598 399500.0
18:23:38.348 INFO ProgressMeter - Traversal complete. Processed 1598 total variants in 0.0 minutes.
18:23:38.349 INFO VariantDataManager - QD: mean = 31.03 standard deviation = 2.13
18:23:38.349 INFO VariantDataManager - MQ: mean = 56.26 standard deviation = 5.91
18:23:38.349 INFO VariantDataManager - MQRankSum: mean = 0.17 standard deviation = 0.65
18:23:38.350 INFO VariantDataManager - ReadPosRankSum: mean = 0.22 standard deviation = 0.89
18:23:38.350 INFO VariantDataManager - FS: mean = 0.53 standard deviation = 3.12
18:23:38.351 INFO VariantDataManager - SOR: mean = 1.03 standard deviation = 0.77
18:23:38.351 INFO VariantDataManager - DP: mean = 3260.79 standard deviation = 10238.34
18:23:38.358 INFO VariantDataManager - Annotation order is: [DP, MQ, QD, FS, ReadPosRankSum, MQRankSum, SOR]
18:23:38.358 INFO VariantDataManager - Training with 112 variants after standard deviation thresholding.
18:23:38.358 WARN VariantDataManager - WARNING: Training with very few variant sites! Please check the model reporting PDF to ensure the quality of the model is reliable.
18:23:38.361 INFO GaussianMixtureModel - Initializing model with 100 k-means iterations...
18:23:38.394 INFO VariantRecalibratorEngine - Finished iteration 0.
18:23:38.412 INFO VariantRecalibratorEngine - Finished iteration 5. Current change in mixture coefficients = 0.04664
18:23:38.419 INFO VariantRecalibratorEngine - Finished iteration 10. Current change in mixture coefficients = 0.05034
18:23:38.421 INFO VariantRecalibratorEngine - Convergence after 11 iterations!
18:23:38.424 WARN VariantRecalibratorEngine - Model could not pre-compute denominators.
18:23:38.424 INFO VariantDataManager - Selected worst 0 scoring variants --> variants with LOD <= -5.0000.
18:23:38.427 INFO VariantRecalibrator - Shutting down engine
[January 9, 2019 6:23:38 PM PST] org.broadinstitute.hellbender.tools.walkers.vqsr.VariantRecalibrator done. Elapsed time: 0.04 minutes.
java.lang.IllegalArgumentException: No data found.
at org.broadinstitute.hellbender.tools.walkers.vqsr.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:34)
at org.broadinstitute.hellbender.tools.walkers.vqsr.VariantRecalibrator.onTraversalSuccess(VariantRecalibrator.java:656)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:968)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)

I don't understand what the "No data found" error at the end means, but it results in me having no SNPs in the output.recal file (should that be a vcf file?) , and I get no .R file from which to generate any plots.

Running MuTect2 without knowing my BAM header

I have BAM files that were generated elsewhere and I'm trying to run MuTect2 on tumor/normal pairs, but I keep getting the following message:

A USER ERROR has occurred: Bad input: Sample Sample1.IS.noPG is not in BAM header: []

I'm not sure what the header actually is, and I'm having no luck using 'samtools reheader' to rename the header.

Any suggestions?

How does MuTect2 assign germline_risk filter


I have read MuTect1 paper and MuTect2 code, and it seems the germline risk is assigned this way
1. if variants in dbsnp, but not cosmic, the nlod cutoff is 5.5
2. otherwise, the cutoff is 2.2

Some of the variants that looks like somatic, but are labelled as germline_risks, one example is
chr11 123456789 . C A . germline_risk ECNT=1;HCNT=4;MAX_ED=.;MIN_ED=.;NLOD=3.61;TLOD=12.81 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1 0/0:13,0:0.00:0:0:.:440,0:10:3 0/1:10,6:0.357:2:4:0.667:336,173:4:6

And I used the MuTect2 default settings initial_tumor_lod=4.0 initial_normal_lod=0.5 tumor_lod=
6.3 normal_lod=2.2 dbsnp_normal_lod=5.5

Spanning or overlapping deletions (* allele)


We use the term spanning deletion or overlapping deletion to refer to a deletion that spans a position of interest.

The presence of a spanning deletion affects how we can represent genotypes at any site(s) that it spans for those samples that carry the deletion, whether in heterozygous or homozygous variant form. Page 8, item 5 of the VCF v4.3 specification reserves the * allele to reference overlapping deletions. This is not to be confused with the bracketed asterisk <*> used to denote symbolic alternate alleles.


Here we illustrate with four human samples. Bob and Lian each have a heterozygous A to T single polymorphism at position 20, our position of interest. Kyra has a 9 bp deletion from position 15 to 23 on both homologous chromosomes that extends across position 20. Lian and Omar each are heterozygous for the same 9 bp deletion. Omar and Bob's other allele is the reference A.

What are the genotypes for each individual at position 20? For Bob, the reference A and variant T alleles are clearly present for a genotype of A/T.

What about Lian? Lian has a variant T allele plus a 9 bp deletion overlapping position 20. To notate the deletion as we do single nucleotide deletions is technically inaccurate. We need a placeholder notation to signify absent sequence that extends beyond the position of interest and that is listed for an earlier position, in our case position 14. The solution is to use a star or asterisk * at position 20 to refer to the spanning deletion. Using this convention, Lian's genotype is T/*.

At the sample-level, Kyra and Omar would not have records for position 20. However, we are comparing multiple samples and so we indicate the spanning deletion at position 20 with *. Omar's genotype is A/* and Kyra's is */*.


In the VCF, depending on the format used by tools, positions equivalent to our example position 20 may or may not be listed. If listed, such as in the first example VCF shown, the spanning deletion is noted with the asterisk * under the ALT column. The spanning deletion is then referred to in the genotype GT for Kyra, Lian and Omar. Alternatively, a VCF may altogether avoid referencing the spanning deletion by listing the variant with the spanning deletion together with the deletion. This is shown in the second example VCF at position 14.

Viewing all 12345 articles
Browse latest View live

Latest Images