Quantcast
Channel: Recent Discussions — GATK-Forum
Viewing all 12345 articles
Browse latest View live

Where is "known_indels_sites_VCFs" defined?

$
0
0

Dear GATK team,

I have been translating your wdl files into shell scripts to map them better to the scheduler on our Linux cluster (shell scripts are not already available anywhere, are they?).

At some point in the PairedEndSingleSampleWf.wdl you reference known_indels_sites_VCFs, I thought this array would be defined in JointGenotypingWf.hg38.inputs.json however the name known_indels_sites_VCFs is not specifically mentioned there and the files listed under "##_COMMENT4": "KNOWN SITES RESOURCES" are not only known indel sites, but also snps. So my question is: Is known_indels_sites_VCFs this entire list or some subset of said list? If it is a subset, where is it defined?

Highest regards,

Freek


Algorithm question for VQSR

$
0
0

As for as I understand, VQSR selects a pool of SNP existing in both testing set and know annotated SNP database. These SNP will be considered as true variants and a Gaussian mixture model is established based on the features of these true variant to classify additional SNP.

These true SNPs will be clustered using Gaussian model. However, Gaussian mixture model means we are also cluster "bad" SNPs as well. I imagine that these "bad" SNPs have different poor qualities on different direction and the finally the Gaussian mixture model will make multiple clusters (one true SNP cluster and multiple bad SNP clusters), right?

Then Why can't we just use a simple Gaussian model to just draw distribution of true SNP and any SNPs far from this cluster will more likely to be false?

Filtering VCF file to remove ./.

$
0
0

Hello,
I am trying to understand my sample format in my merged vcf file of RNA-seq SNPs produced from the GATK best practices. I have several vcf files that I have merged into one file using combinevariants. Before the files are merged, the format of each sample is mostly 1/1 and 0/1. I understand what these genotypes mean, but after I merge the files, I end up with lots of SNPs that have the genotype "./." while one of the other samples has1/1. I have been reading through other people's work and it seems like maybe "./" indicates that this SNP did not have a high enough quality for this sample? I want to select from my merged vcf file only those variants that pass the quality for all individual files. Just to clarify, if I had three vcf files merged and one variant in these files had the genotype 1/1, 1/0, and 0/0 respectively, I want to keep that variant. However, if there was a variant with the genotypes 1/1, 1/0, and ./., I don't want to keep that variant. Am I understanding what ./. means correctly? And is there an easy way to remove these variants from my merged file? Thank you very much for your help!
Leigh Ann

Interpretation GATK BaseRecalibration report

$
0
0

I am having some difficulties understanding the plots from the GATK BaseRecalibration report. Is there any guide or tutorial available that could help me making sense of them? Thank you.

combineGVCFs with duplicate sample id?

$
0
0

I am performing the joint calling workflow on a large batch of samples and I have a handful that were sequenced twice, using two different capture kits. For these, the sample ID in the GVCFs are the same. I am looking for an option like -genotypeMergeOption UNIQUIFY to combineGVCFs that will make the sample names unique. I see that if two GVCFs with the same ID are given to combineGVCFs that the ID is present only once in the resulting combined GVCF header, and if the ID is present in two different combined GVCFs that are given to genotypeGVCF that the ID is only present once in the output. What is the recommended practice here? I would like to avoid rerunning my pipeline again to make the names unique in the single sample GVCF.

how to do downsampling

$
0
0

Hello,

Could anyone tell me how to do downsampling analysis by using GATK tools? The GATK version I use is 4.0.4.0.

Below is my command:
gatk PrintReads -R path/to/hg19.fa -I LIB.bam -O LIB_downsample10.bam --downsample_to_coverage 10

But it always threw out this error:
A USER ERROR has occurred: downsample_to_coverage is not a recognized option

When I changed the '--downsample_to_coverage' to '-dcov', the error became this:
A USER ERROR has occurred: d is not a recognized option

Could you please kindly help?

Thank you very much.
Li

DiscoverVariantsFromContigAlignmentsSAMSpark error

$
0
0

Hello there!

I am using "DiscoverVariantsFromContigAlignmentsSAMSpark" to call SNPs on contigs from an assembly. The assembly was done by Falcon for Pacbio reads. While running the command, I receive an interval error for the SAM file. In order to run the command successfully, I've already mapped the contigs to the reference genome, using minimap2. I've converted resulting CRAM file to a SAM file, added read group info there and sorted the SAM file. I also ran "ValidateSamFile" successfully. I am running latest version of GATK. Here are the commands and part of the error message:

Commnad:

gatk DiscoverVariantsFromContigAlignmentsSAMSpark -R $ref_genome -I $input_file -O ${input_file}_gatk_Vcalled.vcf

Error:

ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 162)
java.lang.IllegalArgumentException: Invalid interval. Contig:16 start:46400198 end:46400197
at org.broadinstitute.hellbender.utils.Utils.validateArg(Utils.java:687)
at org.broadinstitute.hellbender.utils.SimpleInterval.validatePositions(SimpleInterval.java:61)
at org.broadinstitute.hellbender.utils.SimpleInterval.(SimpleInterval.java:37)
at org.broadinstitute.hellbender.tools.spark.sv.discovery.alignment.ContigAlignmentsModifier.splitGappedAlignment(ContigAlignmentsModifier.java:310)
at org.broadinstitute.hellbender.tools.spark.sv.discovery.SvDiscoverFromLocalAssemblyContigAlignmentsSpark$SAMFormattedContigAlignmentParser.lambda

Any help on resolving the issue is appreciated!

Does ReadBackedPhasing rely on a VCF's GT field?

$
0
0

I have a VCF file that is missing the GT field. Can I just add 0/1 for each variant, and let GATK's ReadBackedPhasing take care of resolving the actual phased genotypes?


VariantRecalibrator parameter question

$
0
0

For the VariantRecalibrator program, there is an option "--trust-all-polymorphic". The documentation says

"Trust that all the input training sets' unfiltered records contain only polymorphic sites to drastically speed up the computation."

What I'm trying to figure out is whether this means that the sites in the training dataset are polymorphic in the training set or in the test set. For example, I have a set of data I'm using as my training dataset (not human data). I've filtered it to a set of sites that I am confident in, and would like to use as my training set. Within this set, all those sites are polymorphic.
I have a test set of data, with different individuals, which I would like to filter. In this test set, some of the sites identified in the training set will be polymorphic, but some will not be. In this case, should I set --trust-all-polymorphic to TRUE?

Spark

$
0
0

In a nutshell, Spark is a piece of software that GATK4 uses to do multithreading, which is a form of parallelization that allows a computer (or cluster of computers) to finish executing a task sooner. You can read more about multithreading and parallelism in GATK here. The Spark software library is open-source and maintained by the Apache Software Foundation. It is very widely used in the computing industry and is one of the most promising technologies for accelerating execution of analysis pipelines.


Not all GATK tools use Spark

Tools that can use Spark generally have a note to that effect in their respective Tool Doc.

- Some GATK tools exist in distinct Spark-capable and non-Spark-capable versions

The "sparkified" versions have the suffix "Spark" at the end of their names. Many of these are still experimental; down the road we plan to consolidate them so that there will be only one version per tool.

- Some GATK tools only exist in a Spark-capable version

Those tools don't have the "Spark" suffix.


You don't need a Spark cluster to run Spark-enabled GATK tools!

If you're working on a "normal" machine (even just a laptop) with multiple CPU cores, the GATK engine can still use Spark to create a virtual standalone cluster in place, and set it to take advantage of however many cores are available on the machine -- or however many you choose to allocate. See the example parameters below and the local-Spark tutorial for more information on how to control this. And if your machine only has a single core, these tools can always be run in single-core mode -- it'll just take longer for them to finish.

To be clear, even the Spark-only tools can be run on regular machines, though in practice a few of them may be prohibitively slow (SV tools and PathSeq). See the Tool Docs for tool-specific recommendations.

If you do have access to a Spark cluster, the Spark-enabled tools are going to be extra happy but you may need to provide some additional parameters to use them effectively. See the cluster-Spark tutorial for more information.

Example command-line parameters

Here are some example arguments you would give to a Spark-enabled GATK tool:

  • --sparkMaster local[*] -> "Run on the local machine using all cores"
  • --sparkMaster local[2] -> "Run on the local machine using two cores"
  • --sparkMaster spark://23.195.26.187:7077 -> "Run on the cluster at 23.195.26.187, port 7077"
  • --sparkRunner GCS --cluster my_cluster -> "Run on my_cluster in Google Dataproc"

You don't need to install any additional software to use Spark in GATK

All the necessary software for using Spark, whether it's on a local machine or a Spark cluster, is bundled within the GATK itself. Just make sure to invoke GATK using the gatk wrapper script rather than calling the jar directly, because the wrapper will select the appropriate jar file (there are two!) and will set some parameters for you.

ReadBackedPhasing with several BAM files possible?

$
0
0

Hello,
I have ran Haplotype Caller on 12 BAM files all together at once, using "Variant-only calling on DNAseq". Now, I have one VCF file containing all my variants. I would like to run ReadBackedPhasing in order to phase my SNPs. However, I see in the manual that one BAM file is required to provide physical information. Is it possible, in my case, to provide not one, but my 12 BAM files in a single command?
(I am aware that having ran the GVCF workflow would have phased my SNPs already, but I realised that too late...)
Thank you by advance for your help :)

"java.lang.NullPointerException" error

$
0
0

Dear developers,
So recently I encountered "java.lang.NullPointerException" error, when dealing with 400+ samples in the SVDiscovery step. It was divided into 300+ jobs, and about 30 jobs failed due to this error. After I reran the jobs, about 15 of them worked, and the remaining ones still failed due to the same error. Simply rerunning the job again helps a bit, but the most the remaining jobs still failed.

Here is the one example error:

ERROR 11:12:08,761 FunctionEdge - Error:  'java'  '-Xmx4096m'  '-XX:+UseParallelOldGC'  '-XX:ParallelGCThreads=4'  '-XX:GCTimeLimit=50'  '-XX:GCHeapFreeLimit=10'  '-Djava.io.tmpdir=/u/home/h/hjzhou/.queue/tmp'  '-cp' '/u/home/a/alden/svtoolkit/lib/SVToolkit.jar:/u/home/a/alden/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/u/home/a/alden/svtoolkit/lib/gatk/Queue.jar' org.broadinstitute.sv.main.SVDiscovery '-T' 'SVDiscoveryWalker'  '-R' '/u/home/a/alden/eeskin2/bipolar_sv/svtoolkit/reference/Homo_sapiens_assembly19.fasta'  '-I' '/u/home/h/hjzhou/batch_redo449.list'  '-O' '/u/flashscratch/h/hjzhou/redo_discovery_out/deletions100k.svtoolkit2017April.mask/P0263.discovery.vcf.gz'  '-disableGATKTraversal' 'true'  '-md' '/u/flashscratch/h/hjzhou/bipolar_sv_large/batch1.0/metadata' '-md' '/u/flashscratch/h/hjzhou/bipolar_sv_large/batch1.1/metadata' '-md' '/u/flashscratch/h/hjzhou/bipolar_sv_large/batch1.2/metadata' '-md' '/u/flashscratch/h/hjzhou/bipolar_sv_large/batch1.3/metadata' '-md' '/u/flashscratch/h/hjzhou/bipolar_sv_large/batch1.4/metadata' '-md' '/u/flashscratch/h/hjzhou/bipolar_sv_large/batch2.0/metadata' '-md' '/u/flashscratch/h/hjzhou/bipolar_sv_large/batch2.1/metadata' '-md' '/u/flashscratch/h/hjzhou/bipolar_sv_large/batch2.2/metadata' '-md' '/u/flashscratch/h/hjzhou/bipolar_sv_large/batch2.3/metadata' '-md' '/u/flashscratch/h/hjzhou/bipolar_sv_large/batch2.4/metadata'  '-configFile' '/u/home/a/alden/svtoolkit/conf/genstrip_parameters.txt'  '-runDirectory' '/u/flashscratch/h/hjzhou/redo_discovery_out/deletions100k.svtoolkit2017April.mask'  '-genderMapFile' '/u/flashscratch/h/hjzhou/bipolar_sv_large/batch1.0/metadata/sample_gender.report.txt' '-genderMapFile' '/u/flashscratch/h/hjzhou/bipolar_sv_large/batch1.1/metadata/sample_gender.report.txt' '-genderMapFile' '/u/flashscratch/h/hjzhou/bipolar_sv_large/batch1.2/metadata/sample_gender.report.txt' '-genderMapFile' '/u/flashscratch/h/hjzhou/bipolar_sv_large/batch1.3/metadata/sample_gender.report.txt' '-genderMapFile' '/u/flashscratch/h/hjzhou/bipolar_sv_large/batch1.4/metadata/sample_gender.report.txt' '-genderMapFile' '/u/flashscratch/h/hjzhou/bipolar_sv_large/batch2.0/metadata/sample_gender.report.txt' '-genderMapFile' '/u/flashscratch/h/hjzhou/bipolar_sv_large/batch2.1/metadata/sample_gender.report.txt' '-genderMapFile' '/u/flashscratch/h/hjzhou/bipolar_sv_large/batch2.2/metadata/sample_gender.report.txt' '-genderMapFile' '/u/flashscratch/h/hjzhou/bipolar_sv_large/batch2.3/metadata/sample_gender.report.txt' '-genderMapFile' '/u/flashscratch/h/hjzhou/bipolar_sv_large/batch2.4/metadata/sample_gender.report.txt'  '-genomeMaskFile' '/u/home/a/alden/eeskin2/bipolar_sv/svtoolkit/reference/Homo_sapiens_assembly19.svmask.fasta'  '-partitionName' 'P0263'  '-runFilePrefix' 'P0263'  '-storeReadPairFile' 'true'  -L chr17:19997001-30103001 -searchLocus chr17:20000001-30000000 -searchWindow chr17:19997001-30103001 -searchMinimumSize 100 -searchMaximumSize 100000
ERROR 11:12:08,770 FunctionEdge - Contents of /u/flashscratch/h/hjzhou/redo_discovery_out/deletions100k.svtoolkit2017April.mask/logs/SVDiscovery-263.out:
INFO  11:07:06,817 HelpFormatter - -----------------------------------------------------------------------------------------
INFO  11:07:06,821 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.5.GS-r1732-0-gf101448, Compiled 2017/04/18 15:39:27
INFO  11:07:06,821 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO  11:07:06,821 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO  11:07:06,826 HelpFormatter - Program Args: -T SVDiscoveryWalker -R /u/home/a/alden/eeskin2/bipolar_sv/svtoolkit/reference/Homo_sapiens_assembly19.fasta -O /u/flashscratch/h/hjzhou/redo_discovery_out/deletions100k.svtoolkit2017April.mask/P0263.discovery.vcf.gz -disableGATKTraversal true -md /u/flashscratch/h/hjzhou/bipolar_sv_large/batch1.0/metadata -md /u/flashscratch/h/hjzhou/bipolar_sv_large/batch1.1/metadata -md /u/flashscratch/h/hjzhou/bipolar_sv_large/batch1.2/metadata -md /u/flashscratch/h/hjzhou/bipolar_sv_large/batch1.3/metadata -md /u/flashscratch/h/hjzhou/bipolar_sv_large/batch1.4/metadata -md /u/flashscratch/h/hjzhou/bipolar_sv_large/batch2.0/metadata -md /u/flashscratch/h/hjzhou/bipolar_sv_large/batch2.1/metadata -md /u/flashscratch/h/hjzhou/bipolar_sv_large/batch2.2/metadata -md /u/flashscratch/h/hjzhou/bipolar_sv_large/batch2.3/metadata -md /u/flashscratch/h/hjzhou/bipolar_sv_large/batch2.4/metadata -configFile /u/home/a/alden/svtoolkit/conf/genstrip_parameters.txt -runDirectory /u/flashscratch/h/hjzhou/redo_discovery_out/deletions100k.svtoolkit2017April.mask -genderMapFile /u/flashscratch/h/hjzhou/bipolar_sv_large/batch1.0/metadata/sample_gender.report.txt -genderMapFile /u/flashscratch/h/hjzhou/bipolar_sv_large/batch1.1/metadata/sample_gender.report.txt -genderMapFile /u/flashscratch/h/hjzhou/bipolar_sv_large/batch1.2/metadata/sample_gender.report.txt -genderMapFile /u/flashscratch/h/hjzhou/bipolar_sv_large/batch1.3/metadata/sample_gender.report.txt -genderMapFile /u/flashscratch/h/hjzhou/bipolar_sv_large/batch1.4/metadata/sample_gender.report.txt -genderMapFile /u/flashscratch/h/hjzhou/bipolar_sv_large/batch2.0/metadata/sample_gender.report.txt -genderMapFile /u/flashscratch/h/hjzhou/bipolar_sv_large/batch2.1/metadata/sample_gender.report.txt -genderMapFile /u/flashscratch/h/hjzhou/bipolar_sv_large/batch2.2/metadata/sample_gender.report.txt -genderMapFile /u/flashscratch/h/hjzhou/bipolar_sv_large/batch2.3/metadata/sample_gender.report.txt -genderMapFile /u/flashscratch/h/hjzhou/bipolar_sv_large/batch2.4/metadata/sample_gender.report.txt -genomeMaskFile /u/home/a/alden/eeskin2/bipolar_sv/svtoolkit/reference/Homo_sapiens_assembly19.svmask.fasta -partitionName P0263 -runFilePrefix P0263 -storeReadPairFile true -L chr17:19997001-30103001 -searchLocus chr17:20000001-30000000 -searchWindow chr17:19997001-30103001 -searchMinimumSize 100 -searchMaximumSize 100000
INFO  11:07:06,833 HelpFormatter - Executing as hjzhou@n6221 on Linux 2.6.32-696.18.7.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_111-b14.
INFO  11:07:06,833 HelpFormatter - Date/Time: 2018/06/07 11:07:06
INFO  11:07:06,833 HelpFormatter - -----------------------------------------------------------------------------------------
INFO  11:07:06,834 HelpFormatter - -----------------------------------------------------------------------------------------
INFO  11:07:07,133 07-Jun-2018 GenomeAnalysisEngine - Strictness is SILENT
INFO  11:07:07,236 07-Jun-2018 GenomeAnalysisEngine - Downsampling Settings: No downsampling
INFO  11:07:07,256 07-Jun-2018 IntervalUtils - Processing 10106001 bp from intervals
INFO  11:07:07,354 07-Jun-2018 GenomeAnalysisEngine - Preparing for traversal
INFO  11:07:07,356 07-Jun-2018 GenomeAnalysisEngine - Done preparing for traversal
INFO  11:07:07,356 07-Jun-2018 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO  11:07:07,356 07-Jun-2018 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining
INFO  11:07:07,356 07-Jun-2018 ProgressMeter -        Location |     reads | elapsed |     reads | completed | runtime |   runtime
INFO  11:07:07,357 07-Jun-2018 SVDiscovery - Initializing SVDiscovery ...
INFO  11:07:07,357 07-Jun-2018 SVDiscovery - Reading configuration file ...
INFO  11:07:07,362 07-Jun-2018 SVDiscovery - Read configuration file.
INFO  11:07:07,362 07-Jun-2018 SVDiscovery - Opening reference sequence ...
INFO  11:07:07,362 07-Jun-2018 SVDiscovery - Opened reference sequence.
INFO  11:07:07,363 07-Jun-2018 SVDiscovery - Opening genome mask ...
INFO  11:07:07,364 07-Jun-2018 SVDiscovery - Opened genome mask.
INFO  11:07:07,364 07-Jun-2018 SVDiscovery - Initializing input data set ...
INFO  11:07:11,540 07-Jun-2018 SVDiscovery - Initialized data set: 449 files, 449 read groups, 449 samples.
INFO  11:07:11,541 07-Jun-2018 MetaData - Opening metadata ...
INFO  11:07:11,543 07-Jun-2018 MetaData - Adding metadata directory /u/flashscratch/h/hjzhou/bipolar_sv_large/batch1.0/metadata ...
INFO  11:07:11,543 07-Jun-2018 MetaData - Adding metadata directory /u/flashscratch/h/hjzhou/bipolar_sv_large/batch1.1/metadata ...
INFO  11:07:11,544 07-Jun-2018 MetaData - Adding metadata directory /u/flashscratch/h/hjzhou/bipolar_sv_large/batch1.2/metadata ...
INFO  11:07:11,545 07-Jun-2018 MetaData - Adding metadata directory /u/flashscratch/h/hjzhou/bipolar_sv_large/batch1.3/metadata ...
INFO  11:07:11,546 07-Jun-2018 MetaData - Adding metadata directory /u/flashscratch/h/hjzhou/bipolar_sv_large/batch1.4/metadata ...
INFO  11:07:11,547 07-Jun-2018 MetaData - Adding metadata directory /u/flashscratch/h/hjzhou/bipolar_sv_large/batch2.0/metadata ...
INFO  11:07:11,548 07-Jun-2018 MetaData - Adding metadata directory /u/flashscratch/h/hjzhou/bipolar_sv_large/batch2.1/metadata ...
INFO  11:07:11,549 07-Jun-2018 MetaData - Adding metadata directory /u/flashscratch/h/hjzhou/bipolar_sv_large/batch2.2/metadata ...
INFO  11:07:11,549 07-Jun-2018 MetaData - Adding metadata directory /u/flashscratch/h/hjzhou/bipolar_sv_large/batch2.3/metadata ...
INFO  11:07:11,550 07-Jun-2018 MetaData - Adding metadata directory /u/flashscratch/h/hjzhou/bipolar_sv_large/batch2.4/metadata ...
INFO  11:07:11,567 07-Jun-2018 MetaData - Opened metadata.
INFO  11:07:11,580 07-Jun-2018 SVDiscovery - Opened metadata.
INFO  11:07:11,586 07-Jun-2018 MetaData - Loading insert size distributions ...
INFO  11:07:11,615 07-Jun-2018 MetaData - Loading insert size distributions ...
INFO  11:07:11,629 07-Jun-2018 MetaData - Loading insert size distributions ...
INFO  11:07:11,641 07-Jun-2018 MetaData - Loading insert size distributions ...
INFO  11:07:11,653 07-Jun-2018 MetaData - Loading insert size distributions ...
INFO  11:07:11,669 07-Jun-2018 MetaData - Loading insert size distributions ...
INFO  11:07:11,680 07-Jun-2018 MetaData - Loading insert size distributions ...
INFO  11:07:11,692 07-Jun-2018 MetaData - Loading insert size distributions ...
INFO  11:07:11,704 07-Jun-2018 MetaData - Loading insert size distributions ...
INFO  11:07:11,716 07-Jun-2018 MetaData - Loading insert size distributions ...
INFO  11:07:12,023 07-Jun-2018 SVDiscovery - Processing locus: chr17:20000001-30000000:100-100000
INFO  11:07:12,023 07-Jun-2018 SVDiscovery - Locus search window: chr17:19997001-30103001
INFO  11:07:37,361 07-Jun-2018 ProgressMeter -        Starting         0.0    30.0 s      49.6 w      100.0%    30.0 s       0.0 s
INFO  11:08:07,363 07-Jun-2018 ProgressMeter -        Starting         0.0    60.0 s      99.2 w      100.0%    60.0 s       0.0 s
INFO  11:08:37,364 07-Jun-2018 ProgressMeter -        Starting         0.0    90.0 s     148.8 w      100.0%    90.0 s       0.0 s
INFO  11:09:07,365 07-Jun-2018 ProgressMeter -        Starting         0.0   120.0 s     198.4 w      100.0%   120.0 s       0.0 s
INFO  11:09:37,367 07-Jun-2018 ProgressMeter -        Starting         0.0     2.5 m     248.0 w      100.0%     2.5 m       0.0 s
INFO  11:10:07,870 07-Jun-2018 ProgressMeter -        Starting         0.0     3.0 m     298.5 w      100.0%     3.0 m       0.0 s
INFO  11:10:37,885 07-Jun-2018 ProgressMeter -        Starting         0.0     3.5 m     348.1 w      100.0%     3.5 m       0.0 s
INFO  11:11:07,886 07-Jun-2018 ProgressMeter -        Starting         0.0     4.0 m     397.7 w      100.0%     4.0 m       0.0 s
INFO  11:11:37,887 07-Jun-2018 ProgressMeter -        Starting         0.0     4.5 m     447.3 w      100.0%     4.5 m       0.0 s
Caught exception while processing read: HS2000-9109_119:4:1305:6758:46890       97      chr17   22253156        3       29M1I70M        =       22260357        7296    GTTGGAAACGGGATAAACCGCACAGAACTAAAACAGAAGCATTCTAAGAACCCTCTTCGTGATGTTTGCATTCAACTCACAGTGCTGAACCTTTCTTTGA    AABDDDEEA<DDDBBDD@B:AB?CBBC?BJCCC>BABBAABABAAAB@BB?AAAAAB@8AABAA@CBBACABBCB?CBCADCACBDDCDADEDDCFEEDC    MD:Z:44C6T23C7T15       RG:Z:LP6005646-DNA_A12  NM:i:5  AS:i:72 XS:i:67
INFO  11:11:43,073 07-Jun-2018 GATKRunReport - Uploaded run statistics report to AWS S3
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
java.lang.NullPointerException
        at htsjdk.samtools.SAMRecordQueryNameComparator.fileOrderCompare(SAMRecordQueryNameComparator.java:76)
        at htsjdk.samtools.SAMRecordQueryNameComparator.compare(SAMRecordQueryNameComparator.java:32)
        at htsjdk.samtools.SAMRecordQueryNameComparator.compare(SAMRecordQueryNameComparator.java:29)
        at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355)
        at java.util.TimSort.sort(TimSort.java:234)
        at java.util.Arrays.sort(Arrays.java:1512)
        at htsjdk.samtools.util.SortingCollection$InMemoryIterator.<init>(SortingCollection.java:350)
        at htsjdk.samtools.util.SortingCollection.iterator(SortingCollection.java:269)
        at htsjdk.samtools.SAMFileWriterImpl.close(SAMFileWriterImpl.java:213)
        at org.broadinstitute.sv.discovery.DeletionDiscoveryAlgorithm.closeOutputFiles(DeletionDiscoveryAlgorithm.java:1217)
        at org.broadinstitute.sv.discovery.DeletionDiscoveryAlgorithm.close(DeletionDiscoveryAlgorithm.java:120)
        at org.broadinstitute.sv.discovery.SVDiscoveryWalker.onTraversalDone(SVDiscoveryWalker.java:109)
        at org.broadinstitute.sv.discovery.SVDiscoveryWalker.onTraversalDone(SVDiscoveryWalker.java:40)
        at org.broadinstitute.gatk.engine.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129)
        at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:116)
        at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:315)
        at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121)
        at org.broadinstitute.sv.main.SVCommandLine.execute(SVCommandLine.java:133)
        at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
        at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
        at org.broadinstitute.sv.main.SVCommandLine.main(SVCommandLine.java:87)
        at org.broadinstitute.sv.main.SVDiscovery.main(SVDiscovery.java:21)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 3.5.GS-r1732-0-gf101448):
##### ERROR
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: Code exception (see stack trace for error itself)
##### ERROR ------------------------------------------------------------------------------------------
INFO  11:12:08,771 QGraph - Writing incremental jobs reports...
INFO  11:12:08,771 QJobsReporter - Writing JobLogging GATKReport to file /u/home/h/hjzhou/SVDiscovery.jobreport.txt
INFO  11:12:08,797 QGraph - 5 Pend, 12 Run, 1 Fail, 310 Done

In the error, the read associating with a particular partition is not always the same (in most cases, it is different). I wonder what might be the cause.

How to use mutation rate to identify a homozygous or heterozygous mutation?

$
0
0

Dear all,
This may not be very related to GATK software. I have a question that we use software to call SNP, and get the mutation rate of each SNP, are there any standards to identify if it is a homozygous mutation? like setting a threshold
of 40%-80% is heterozygous mutation, and beyond 80% or beneath 40% is homozygous mutation. If there is a international recognized standards?
Thank you for your time.

HaplotypeCaller on whole genome or chromosome by chromosome: different results

$
0
0

Hi,

I'm working on targeted resequencing data and I'm doing a multi-sample variant calling with the HaplotypeCaller. First, I tried to call the variants in all the targeted regions by doing the calling at one time on a cluster. I thus specified all the targeted regions with the -L option.

Then, as it was taking too long, I decided to cut my interval list, chromosome by chromosome and to do the calling on each chromosome. At the end, I merged the VCFs files that I had obtained for the callings on the different chromosomes.

Then, I compared this merged VCF file with the vcf file that I obtained by doing the calling on all the targeted regions at one time. I noticed 1% of variation between the two variants lists. And I can't explain this stochasticity. Any suggestion?

Thanks!

Maguelonne

is it right to use CombineVariants to combine all sample vcf together?

$
0
0

use HaplotypeCaller and VariantFiltration to get every sample's vcf, then use CombineVariants to combine all the vcf, in your guide describes CombineVariants as "CombineVariants reads in variants records from separate ROD (Reference-Ordered Data) sources and combines them into a single VCF." , whether it means every chromosome(reference rna-seq) or every transcript(denovo rna-seq)


Variant analysis on 1 or 2 samples: should I skip the final steps?

$
0
0

Hello all,

I'm a bit confused as to what steps are necessary, and what steps are not going to add much benefit. I have 2 jobs to complete for 2 different research groups we support: 1) Germline short variant discovery on whole exome sequencing (WES) data collected from 1 mouse (1 sample in total), and 2) Germline short variant discovery on whole genome sequencing data (WGS) collected from 2 macaques (2 samples in total).

I have written a wrapper that follows the GATK best practices from fastq preprocessing to HaplotypeCaller with appropriate conditional loops and required files specific to each species and type of sequencing.

According to the GATK workflow - my next steps after running HaplotypeCaller (with --emit-ref-confidence GVCF) in the pipeline are 1) consolidate GVCFs, 2) Joint-calling cohort, and 3) VQSR).

So here are my concerns:

  1. Considering I have only 1 or 2 samples - is it pointless doing some/all of these steps? Should I just stick to the variants called in each sample by HaplotypeCaller? Should I remove "--emit-ref-confidence GVCF" and just create a regular VCF? Is it possible to hard filter a regular VCF?
  2. If I do VQSR, I don't know where to find truth sets. Someone has suggested to me that I can use the human truth set in other species because the information taken from the truth set is the profile of what a true SNP looks like, not the position of the SNP - I'm really not sure about this.

I have previously posted this on biostars without success.

Help Me, Obi-Wan.

Thanks in advance.

Kenneth

How to remove uncalled variants with ./. from a VCF file?

$
0
0

Hello,
I have multi sample VCF file (generated after joint genotyping and VQSR) which includes some unrelated samples and few trios. I have extracted a new VCF file from this multi sample VCF file for a trio (family) using GATK's 'SelectVariants' option. Now in this file there are some variants which are not called in any of these individuals of the trio and hence left as ./. for GT format.
e.g.

chr1 65745 . A G 141.58 VQSRTrancheSNP99.00to99.90 AC=0;AF=0.00;AN=0;DP=164;ExcessHet=0.1296;FS=0.000;InbreedingCoeff=0.2775;MQ=27.00;QD=23.60;SOR=3.912;VQSLOD=-4.044e+00;culprit=MQ GT ./. ./. ./.

Now I want to remove these variants from the VCF file. Is there any option in GATK or any other to use for this purpose?

GATK 3.8 No data Found

$
0
0

Hi,I ran VariantRecalibrator to do the VQSR,first I got the ERROR message 'Unable to retrive the result' and I remove nt parameter. Then I got the following error message. I also saw Geraldine_VdAuwera said that VQSR is not available to pretty small data. I wonder if it is suitable to some panel data like 600Mb size ?

 java -Xmx4g -jar ../GenomeAnalysisTK-3.8-0-ge9d806836/GenomeAnalysisTK.jar \
-T VariantRecalibrator \
-R ../resource_bundle/ucsc.hg19.fasta \
-input annotated.vcf \
-recalFile out.recal \
-tranchesFile out.tranches  \
-resource:hapmap,known=false,training=true,truth=true,prior=15.0 ../resource_bundle/hapmap_3.3.hg19.sites.vcf \
-resource:omni,known=false,training=true,truth=true,prior=12.0 ../resource_bundle/1000G_omni2.5.hg19.sites.vcf \
-resource:1000G,known=false,training=true,truth=false,prior=10.0 ../resource_bundle/1000G_phase1.snps.high_confidence.hg19.sites.vcf \
-resource:dbsnp,known=true,training=false,truth=false,prior=2.0 ../resource_bundle/dbsnp_138.hg19.vcf \
-an QD -an MQ -an FS -an SOR -an MQRankSum -an ReadPosRankSum \
-mode SNP \
-L ../56gene171230/56gene-20170328.bed 
INFO  13:21:42,163 HelpFormatter - ---------------------------------------------------------------------------------- 
INFO  13:21:42,165 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.8-0-ge9d806836, Compiled 2017/07/28 21:26:50 
INFO  13:21:42,165 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute 
INFO  13:21:42,166 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk 
INFO  13:21:42,166 HelpFormatter - [Tue Jan 02 13:21:42 CST 2018] Executing on Mac OS X 10.12.6 x86_64 
INFO  13:21:42,166 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0_144-b01 
INFO  13:21:42,170 HelpFormatter - Program Args: -T VariantRecalibrator -R ../resource_bundle/ucsc.hg19.fasta -input annotated.vcf -recalFile out.recal -tranchesFile out.tranches -resource:hapmap,known=false,training=true,truth=true,prior=15.0 ../resource_bundle/hapmap_3.3.hg19.sites.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 ../resource_bundle/1000G_omni2.5.hg19.sites.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 ../resource_bundle/1000G_phase1.snps.high_confidence.hg19.sites.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 ../resource_bundle/dbsnp_138.hg19.vcf -an QD -an MQ -an FS -an SOR -an MQRankSum -an ReadPosRankSum -mode SNP -L ../56gene171230/56gene-20170328.bed 
INFO  13:21:42,175 HelpFormatter - Executing as bioinformatician@bioinformaticiandeiMac.local on Mac OS X 10.12.6 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_144-b01. 
INFO  13:21:42,175 HelpFormatter - Date/Time: 2018/01/02 13:21:42 
INFO  13:21:42,176 HelpFormatter - ---------------------------------------------------------------------------------- 
INFO  13:21:42,176 HelpFormatter - ---------------------------------------------------------------------------------- 
ERROR StatusLogger Unable to create class org.apache.logging.log4j.core.impl.Log4jContextFactory specified in jar:file:/Users/bioinformatician/Programs/gatk/GenomeAnalysisTK-3.8-0-ge9d806836/GenomeAnalysisTK.jar!/META-INF/log4j-provider.properties
ERROR StatusLogger Log4j2 could not find a logging implementation. Please add log4j-core to the classpath. Using SimpleLogger to log to the console...
INFO  13:21:42,303 GenomeAnalysisEngine - Deflater: IntelDeflater 
INFO  13:21:42,303 GenomeAnalysisEngine - Inflater: IntelInflater 
INFO  13:21:42,304 GenomeAnalysisEngine - Strictness is SILENT 
INFO  13:21:42,376 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 
INFO  13:21:43,036 IntervalUtils - Processing 228705 bp from intervals 
INFO  13:21:43,095 GenomeAnalysisEngine - Preparing for traversal 
INFO  13:21:43,096 GenomeAnalysisEngine - Done preparing for traversal 
INFO  13:21:43,096 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
INFO  13:21:43,097 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining 
INFO  13:21:43,097 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime 
INFO  13:21:43,099 TrainingSet - Found hapmap track:    Known = false   Training = true     Truth = true    Prior = Q15.0 
INFO  13:21:43,099 TrainingSet - Found omni track:  Known = false   Training = true     Truth = true    Prior = Q12.0 
INFO  13:21:43,099 TrainingSet - Found 1000G track:     Known = false   Training = true     Truth = false   Prior = Q10.0 
INFO  13:21:43,100 TrainingSet - Found dbsnp track:     Known = true    Training = false    Truth = false   Prior = Q2.0 
INFO  13:21:44,050 VariantDataManager - QD:      mean = 20.71    standard deviation = 9.05 
INFO  13:21:44,051 VariantDataManager - MQ:      mean = 60.13    standard deviation = 1.53 
INFO  13:21:44,051 VariantDataManager - FS:      mean = 2.31     standard deviation = 3.39 
INFO  13:21:44,052 VariantDataManager - SOR:     mean = 1.05     standard deviation = 0.59 
INFO  13:21:44,053 VariantDataManager - MQRankSum:   mean = -0.28    standard deviation = 1.09 
INFO  13:21:44,053 VariantDataManager - ReadPosRankSum:      mean = 0.03     standard deviation = 1.00 
INFO  13:21:44,057 VariantDataManager - Annotations are now ordered by their information content: [MQ, QD, FS, SOR, ReadPosRankSum, MQRankSum] 
INFO  13:21:44,058 VariantDataManager - Training with 158 variants after standard deviation thresholding. 
WARN  13:21:44,058 VariantDataManager - WARNING: Training with very few variant sites! Please check the model reporting PDF to ensure the quality of the model is reliable. 
INFO  13:21:44,061 GaussianMixtureModel - Initializing model with 100 k-means iterations... 
INFO  13:21:44,117 VariantRecalibratorEngine - Finished iteration 0. 
INFO  13:21:44,137 VariantRecalibratorEngine - Finished iteration 5.    Current change in mixture coefficients = 0.11471 
INFO  13:21:44,145 VariantRecalibratorEngine - Finished iteration 10.   Current change in mixture coefficients = 0.01279 
INFO  13:21:44,157 VariantRecalibratorEngine - Finished iteration 15.   Current change in mixture coefficients = 0.02997 
INFO  13:21:44,165 VariantRecalibratorEngine - Finished iteration 20.   Current change in mixture coefficients = 0.01192 
INFO  13:21:44,175 VariantRecalibratorEngine - Finished iteration 25.   Current change in mixture coefficients = 0.00462 
INFO  13:21:44,182 VariantRecalibratorEngine - Finished iteration 30.   Current change in mixture coefficients = 0.00288 
INFO  13:21:44,189 VariantRecalibratorEngine - Convergence after 34 iterations! 
WARN  13:21:44,193 VariantRecalibratorEngine - Model could not pre-compute denominators. 
INFO  13:21:44,197 VariantDataManager - Selected worst 0 scoring variants --> variants with LOD <= -5.0000. 
##### ERROR --
##### ERROR stack trace 
java.lang.IllegalArgumentException: No data found.
    at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:88)
    at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:536)
    at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:191)
    at org.broadinstitute.gatk.engine.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129)
    at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:115)
    at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:323)
    at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:123)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
    at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:108)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 3.8-0-ge9d806836):
##### ERROR
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions https://software.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: No data found.
##### ERROR ------------------------------------------------------------------------------------------

(How to) Call somatic mutations using GATK4 Mutect2

$
0
0

Post suggestions and read about updates in the Comments section.


imageThis tutorial introduces researchers to considerations in somatic short variant discovery using GATK4 Mutect2. Example data are based on a breast cancer cell line and its matched normal cell line derived from blood and are aligned to GRCh38 with post-alt processing [1]. The tutorial focuses on how to call traditional somatic short mutations, as described in Article#11127 and pipelined in GATK v4.0.0.0's mutect2.wdl [2]. The tool and its workflow are in BETA status as of this writing, which means they may undergo changes and are not guaranteed for production.

► For Broad Mutation Calling Best Practices, see FireCloud Article#45055.

Section 1 calls somatic mutations with Mutect2 using all the bells and whistles of the tool. Section 2 outlines how to create the panel of normals resource using the tumor-only mode of Mutect2. Section 3 outlines how to estimate cross-sample contamination. Section 4 shows how to filter the callset with FilterMutectCalls. Unlike GATK3, in GATK4 the somatic calling and filtering functionalities are embodied by separate tools. Section 5 shows an optional filtering step to filter by sequence context artifacts that present with orientation bias, e.g. OxoG artifacts. Section 6 shows how to set up in IGV for manual review. Finally, section 7 provides a brief list of related resources that may be of interest to researchers.

GATK4 Mutect2 is a versatile variant caller that not only is more sensitive than, but is also roughly twice as fast as, HaplotypeCaller's reference confidence mode. Researchers who wish to customize analyses should find the tutorial's descriptions of the multiple levers of Mutect2 in section 1 and descriptions of the tumor-only mode of Mutect2 in section 2 of interest.


Jump to a section

  1. Call somatic short variants and generate a bamout with Mutect2
    1.1 What are the Mutect2 annotations?
    1.2 What is the impact of disabling the MateOnSameContigOrNoMappedMateReadFilter read filter?
  2. Create a sites-only PoN with CreateSomaticPanelOfNormals
    2.1 The tumor-only mode of Mutect2 is useful outside of pon creation
  3. Estimate cross-sample contamination using GetPileupSummaries and CalculateContamination
    3.1 What if I find high levels of contamination?
  4. Filter for confident somatic calls using FilterMutectCalls
  5. (Optional) Estimate artifacts with CollectSequencingArtifactMetrics and filter them with FilterByOrientationBias
    5.1 Tally of applied filters for the tutorial data
  6. Set up in IGV to review somatic calls
  7. Related resources

Tools involved

  • GATK v4.0.0.0 is available in a Docker image and as a standalone jar. For the latest release, see the Downloads page. Note that GATK v4.0.0.0 contains Picard tools from release v2.17.2 that are callable with the gatk launch script.
  • Desktop IGV. The tutorial uses v2.3.97.

Download example data

Download tutorial_11136.tar.gz, either from the GoogleDrive or from the ftp site. To access the ftp site, leave the password field blank. If the GoogleDrive link is broken, please let us know. The tutorial also requires the GRCh38 reference FASTA, dictionary and index. These are available from the GATK Resource Bundle. For details on the example data and resources, see [3] and [4].

► The tutorial steps switch between the subset and full data. Some of the data files, e.g. BAMs, are restricted to a small region of the genome to efficiently pace the tutorial. Other files, e.g. the Mutect2 calls that the tutorial filters, are from the entire genome. The tutorial content was originally developed for the 2017-09 Helsinki workshop and we make the full data files, i.e. the resource files and the BAMs, available at gs://gatk-best-practices/somatic-hg38.


1. Call somatic short variants and generate a bamout with Mutect2

Here we have a rather complex command to call somatic variants on the HCC1143 tumor sample using Mutect2. For a synopsis of what somatic calling entails, see Article#11127. The command calls somatic variants in the tumor sample and uses a matched normal, a panel of normals (PoN) and a population germline variant resource.

gatk --java-options "-Xmx2g" Mutect2 \
-R hg38/Homo_sapiens_assembly38.fasta \
-I tumor.bam \
-I normal.bam \
-tumor HCC1143_tumor \
-normal HCC1143_normal \
-pon resources/chr17_pon.vcf.gz \
--germline-resource resources/chr17_af-only-gnomad_grch38.vcf.gz \
--af-of-alleles-not-in-resource 0.0000025 \
--disable-read-filter MateOnSameContigOrNoMappedMateReadFilter \
-L chr17plus.interval_list \
-O 1_somatic_m2.vcf.gz \
-bamout 2_tumor_normal_m2.bam 

This produces a raw unfiltered somatic callset 1_somatic_m2.vcf.gz, a reassembled reads BAM 2_tumor_normal_m2.bam and the respective indices 1_somatic_m2.vcf.gz.tbi and 2_tumor_normal_m2.bai.

Comments on select parameters

  • Specify the case sample for somatic calling with two parameters. Provide the BAM with -I and the sample's read group sample name (the SM field value) with -tumor. To look up the read group SM field use GetSampleName. Alternatively, use samtools view -H tumor.bam | grep '@RG'.
  • Prefilter variant sites in a control sample alignment. Specify the control BAM with -I and the control sample's read group sample name (the SM field value) with -normal. In the case of a tumor with a matched normal control, we can exclude even rare germline variants and individual-specific artifacts. If we analyze our tumor sample with Mutect2 without the matched normal, we get an order of magnitude more calls than with the matched normal.
  • Prefilter variant sites in a panel of normals callset. Specify the panel of normals (PoN) VCF with -pon. Section 2 outlines how to create a PoN. The panel of normals not only represents common germline variant sites, it presents commonly noisy sites in sequencing data, e.g. mapping artifacts or other somewhat random but systematic artifacts of sequencing. By default, the tool does not reassemble nor emit variant sites that match identically to a PoN variant. To enable genotyping of PoN sites, use the --genotype-pon-sites option. If the match is not exact, e.g. there is an allele-mismatch, the tool reassembles the region, emits the calls and annotates matches in the INFO field with IN_PON.
  • Annotate variant alleles by specifying a population germline resource with --germline-resource. The germline resource must contain allele-specific frequencies, i.e. it must contain the AF annotation in the INFO field [4]. The tool annotates variant alleles with the population allele frequencies. When using a population germline resource, consider adjusting the --af-of-alleles-not-in-resource parameter from its default of 0.001. For example, the gnomAD resource af-only-gnomad_grch38.vcf.gz represents ~200k exomes and ~16k genomes and the tutorial data is exome data, so we adjust --af-of-alleles-not-in-resource to 0.0000025 which corresponds to 1/(2*exome samples). The default of 0.001 is appropriate for human sample analyses without any population resource. It is based on the human average rate of heterozygosity. The population allele frequencies (POP_AF) and the af-of-alleles-not-in-resource factor in probability calculations of the variant being somatic.
  • Include reads whose mate maps to a different contig. For our somatic analysis that uses alt-aware and post-alt processed alignments to GRCh38, we disable a specific read filter with --disable-read-filter MateOnSameContigOrNoMappedMateReadFilter. This filter removes from analysis paired reads whose mate maps to a different contig. Because of the way BWA crisscrosses mate information for mates that align better to alternate contigs (in alt-aware mapping to GRCh38), we want to include these types of reads in our analysis. Otherwise, we may miss out on detecting SNVs and indels associated with alternate haplotypes. Disabling this filter deviates from current production practices.
  • Target the analysis to specific genomic intervals with the -L parameter. Here we specify this option to speed up our run on the small tutorial data. For the full callset we use in section 4, calling was on the entirety of the data, without an intervals file.
  • Generate the reassembled alignments file with -bamout. The bamout alignments contain the artificial haplotypes and reassembled alignments for the normal and tumor and enable manual review of calls. The parameter is not required by the tool but is recommended as adding it costs only a small fraction of the total run time.

To illustrate how Mutect2 applies annotations, below are five multiallelic sites from the full callset. Pull these out with gzcat somatic_m2.vcf.gz | awk '$5 ~","'. The awk '$5 ~","' subsets records that contain a comma in the 5th column.

image

We see eleven columns of information per variant call including genotype calls for the normal and tumor. Notice the empty fields for QUAL and FILTER, and annotations at the site (INFO) and sample level (columns 10 and 11). The samples each have genotypes and when a site is multiallelic, we see allele-specific annotations. Samples may have additional annotations, e.g. PGT and PID that relate to phasing.


☞ 1.1 What are the Mutect2 annotations?

We can view the standard FORMAT-level and INFO-level Mutect2 annotations in the VCF header.

image

image

The Variant Annotations section of the Tool Documentation further describe some of the annotations. For a complete list of annotations available in GATK4, see this site.

To enable specific filtering that relies on nonstandard annotations, or just to add additional annotations, use the -A argument. For example, -A ReferenceBases adds the ReferenceBases annotation to variant calls. Note that if an annotation a filter relies on is absent, FilterMutectCalls will skip the particular filtering without any warning messages.


☞ 1.2 What is the impact of disabling the MateOnSameContigOrNoMappedMateReadFilter read filter?

To understand the impact, consider some numbers. After all other read filters, the MateOnSameContigOrNoMappedMateReadFilter (MOSCO) filter additionally removes from analysis 8.71% (8,681,271) tumor sample reads and 8.18% (6,256,996) normal sample reads from the full data. The impact of disabling the MOSCO filter is that reads on alternate contigs and read pairs that span contigs can now lend support to variant calls.

For the tutorial data, including reads normally filtered by the MOSCO filter roughly doubles the number of Mutect2 calls. The majority of the additional calls comes from the ALT, HLA and decoy contigs.


back to top


2. Create a sites-only PoN with CreateSomaticPanelOfNormals

We make the motions of creating a PoN using three germline samples. These samples are HG00190, NA19771 and HG02759 [3].

First, run Mutect2 in tumor-only mode on each normal sample. In tumor-only mode, a single case sample is analyzed with the -tumor flag without an accompanying matched control -normal sample. For the tutorial, we run this command only for sample HG00190.

gatk Mutect2 \
-R ~/Documents/ref/hg38/Homo_sapiens_assembly38.fasta \
-I HG00190.bam \
-tumor HG00190 \
--disable-read-filter MateOnSameContigOrNoMappedMateReadFilter \
-L chr17plus.interval_list \
-O 3_HG00190.vcf.gz

This generates a callset 3_HG00190.vcf.gz and a matching index. Mutect2 calls variants in the sample with the same sensitive criteria it uses for calling mutations in the tumor in somatic mode. Because the command omits the use of options that trigger upfront filtering, we expect all detectable variants to be called. The calls will include low allele fraction variants and sites with multiple variant alleles, i.e. multiallelic sites. Here are two multiallelic records from 3_HG00190.vcf.gz.

image

We see for each site, Mutect2 calls the ref allele and three alternate alleles. The GT genotype call is 0/1/2/3. The AD allele depths are 16,3,12,4 and 41,5,24,4, respectively for the two sites.

Comments on select parameters

  • One option that is not used here is to include a germline resource with --germline-resource. Remember from section 1 this resource must contain AF population allele frequencies in the INFO column. Use of this resource in tumor-only mode, just as in somatic mode, allows upfront filtering of common germline variant alleles. This effectively omits common germline variant alleles from the PoN. Note the related optional parameter --max-population-af (default 0.01) defines the cutoff for allele frequencies. Given a resource, and read evidence for the variant, Mutect2 will still emit variant alleles with AF less than or equal to the --max-population-af.
  • Recapitulate any special options used in somatic calling in the panel of normals sample calling, e.g.--disable-read-filter MateOnSameContigOrNoMappedMateReadFilter. This particular option is relevant for alt-aware and post-alt processed alignments.

Second, collate all the normal VCFs into a single callset with CreateSomaticPanelOfNormals. For the tutorial, to illustrate the step with small data, we run this command on three normal sample VCFs. The general recommendation for panel of normals is a minimum of forty samples.

gatk CreateSomaticPanelOfNormals \
-vcfs 3_HG00190.vcf.gz \
-vcfs 4_NA19771.vcf.gz \
-vcfs 5_HG02759.vcf.gz \
-O 6_threesamplepon.vcf.gz

This generates a PoN VCF 6_threesamplepon.vcf.gz and an index. The tutorial PoN contains 8,275 records.
CreateSomaticPanelOfNormals retains sites with variants in two or more samples. It retains the alleles from the samples but drops all other annotations to create an eight-column, sites-only VCF as shown.

image

Ideally, the PoN includes samples that are technically representative of the tumor case sample--i.e. samples sequenced on the same platform using the same chemistry, e.g. exome capture kit, and analyzed using the same toolchain. However, even an unmatched PoN will be remarkably effective in filtering a large proportion of sequencing artifacts. This is because mapping artifacts and polymerase slippage errors occur for pretty much the same genomic loci for short read sequencing approaches.

What do you think of including samples of family members in the PoN?


☞ 2.1 The tumor-only mode of Mutect2 is useful outside of pon creation

For example, consider variant calling on data that represents a pool of individuals or a collective of highly similar but distinct DNA molecules, e.g. mitochondrial DNA. Mutect2 calls multiple variants at a site in a computationally efficient manner. Furthermore, the tumor-only mode can be co-opted to simply call differences between two samples. This approach is described in Blog#11315.


back to top


3. Estimate cross-sample contamination using GetPileupSummaries and CalculateContamination.

First, run GetPileupSummaries on the tumor BAM to summarize read support for a set number of known variant sites. Use a population germline resource containing only common biallelic variants, e.g. subset by using SelectVariants --restrict-alleles-to BIALLELIC, as well as population AF allele frequencies in the INFO field [4]. The tool tabulates read counts that support reference, alternate and other alleles for the sites in the resource.

gatk GetPileupSummaries \
-I tumor.bam \
-V resources/chr17_small_exac_common_3_grch38.vcf.gz \
-O 7_tumor_getpileupsummaries.table

This produces a six-column table as shown. The alt_count is the count of reads that support the ALT allele in the germline resource. The allele_frequency corresponds to that given in the germline resource. Counts for other_alt_count refer to reads that support all other alleles.

image

Comments on select parameters

  • The tool only considers homozygous alternate sites in the sample that have a population allele frequency that ranges between that set by --minimum-population-allele-frequency (default 0.01) and --maximum-population-allele-frequency (default 0.2). The rationale for these settings is as follows. If the homozygous alternate site has a rare allele, we are more likely to observe the presence of REF or other more common alleles if there is cross-sample contamination. This allows us to measure contamination more accurately.
  • One option to speed up analysis, that is not used in the command above, is to limit data collection to a sufficiently large but subset genomic region with the -L argument.

Second, estimate contamination with CalculateContamination. The tool takes the summary table from GetPileupSummaries and gives the fraction contamination. This estimation informs downstream filtering by FilterMutectCalls.

gatk CalculateContamination \
-I 7_tumor_getpileupsummaries.table \
-O 8_tumor_calculatecontamination.table

This produces a table with estimates for contamination and error. The estimate for the full tumor sample is shown below and gives a contamination fraction of 0.0205. Going forward, we know to suspect calls with less than ~2% alternate allele fraction.

image

Comments on select parameters

  • CalculateContamination can operate in two modes. The command above uses the mode that simply estimates contamination for a given sample. The alternate mode incorporates the metrics for the matched normal, to enable a potentially more accurate estimate. For the second mode, run GetPileupSummaries on the normal sample and then provide the normal pileup table to CalculateContamination with the -matched argument.

► Cross-sample contamination differs from normal contamination of tumor and tumor contamination of normal. Currently, the workflow does not account for the latter type of purity issue.


☞ 3.1 What if I find high levels of contamination?

One thing to rule out is sample swaps at the read group level.

Picard’s CrosscheckFingerprints can detect sample-swaps at the read group level and can additionally measure how related two samples are. Because sequencing can involve multiplexing a sample across lanes and regrouping a sample’s multiple read groups, depending on the level of automation in handling these, there is a possibility of including read groups from unrelated samples. The inclusion of such a cross-sample in the tumor sample would be detrimental to a somatic analysis. Without getting into details, the tool allows us to (i) check at the sample level that our tumor and normal are related, as it is imperative they should come from the same individual and (ii) check at the read group level that each of the read group data come from the same individual.

Again, imagine if we mistook the contaminating read group data as some tumor subpopulation! The tutorial normal and tumor samples consist of 16 and 22 read groups respectively, and when we provide these and set EXPECT_ALL_GROUPS_TO_MATCH=true, CrosscheckReadGroupFingerprints (a tool now replaced by CrosscheckFingerprints) informs us All read groups related as expected.


back to top


4. Filter for confident somatic calls using FilterMutectCalls

FilterMutectCalls determines whether a call is a confident somatic call. The tool uses the annotations within the callset and applies preset thresholds that are tuned for human somatic analyses.

Filter the Mutect2 callset with FilterMutectCalls. Here we use the full callset, somatic_m2.vcf.gz. To activate filtering based on the contamination estimate, provide the contamination table with --contamination-table. In GATK v4.0.0.0, the tool uses the contamination estimate as a hard cutoff.

gatk FilterMutectCalls \
-V somatic_m2.vcf.gz \
--contamination-table tumor_calculatecontamination.table \
-O 9_somatic_oncefiltered.vcf.gz

This produces a VCF callset 9_somatic_oncefiltered.vcf.gz and index. Calls that are likely true positives get the PASS label in the FILTER field, and calls that are likely false positives are labeled with the reason(s) for filtering in the FILTER field of the VCF. We can view the available filters in the VCF header using grep '##FILTER'.

image

This step seemingly applies 14 filters, including contamination. However, if an annotation a filter relies on is absent, the tool skips the particular filtering. The filter will still appear in the header. For example, the duplicate_evidence filter requires a nonstandard annotation that our callset omits.

So far, we have 3,695 calls, of which 2,966 are filtered and 729 pass as confident somatic calls. Of the filtered, contamination filters eight calls, all of which would have been filtered for other reasons. For the statistically inclined, this may come as a surprise. However, remember that the great majority of contaminant variants would be common germline alleles, for which we have in place other safeguards.

► In the next GATK version, FilterMutectCalls will use a statistical model to filter based on the contamination estimate.


back to top


5. (Optional) Estimate artifacts with CollectSequencingArtifactMetrics and filter them with FilterByOrientationBias

FilterByOrientationBias allows filtering based on sequence context artifacts, e.g. OxoG and FFPE. This step is optional and if employed, should always be performed after filtering with FilterMutectCalls. The tool requires the pre_adapter_detail_metrics from Picard CollectSequencingArtifactMetrics.

First, collect metrics on sequence context artifacts with CollectSequencingArtifactMetrics. The tool categorizes these as those that occur before hybrid selection (preadapter) and those that occur during hybrid selection (baitbias). Results provide a global view across the genome that empowers decision making in ways that site-specific analyses cannot. The metrics can help decide whether to consider downstream filtering.

gatk CollectSequencingArtifactMetrics \
-I tumor.bam \
-O 10_tumor_artifact \
–-FILE_EXTENSION ".txt" \
-R ~/Documents/ref/hg38/Homo_sapiens_assembly38.fasta

Alternatively, use the tool from a standalone Picard jar.

java -jar picard.jar \
CollectSequencingArtifactMetrics \
I=tumor.bam \
O=10_tumor_artifact \
FILE_EXTENSION=.txt \
R=~/Documents/ref/hg38/Homo_sapiens_assembly38.fasta

This generates five metrics files, including pre_adapter_detail_metrics, which contains counts that FilterByOrientationBias uses. Below are the summary pre_adapter_summary_metrics for the full data. Our samples were not from FFPE so we do not expect this artifact. However, it appears that we could have some OxoG transversions.

image

image

Picard metrics are described in detail here. For the purposes of this tutorial, we focus on the TOTAL_QSCORE.

  • The TOTAL_QSCORE is Phred-scaled such that lower scores equate to a higher probability the change is artifactual. E.g. forty translates to 1 in 10,000 probability. For OxoG, a rough cutoff for concern is 30. FilterByOrientationBias uses the quality score as a prior that a context will produce an artifact. The tool also weighs the evidence from the reads. For example, if the QSCORE is 50 but the allele is supported by 15 reads in F1R2 and no reads in F2R1, then the tool should filter the call.
  • FFPE stands for formalin-fixed, paraffin-embedded. Formaldehyde deaminates cytosines and thereby results in C→T transition mutations. Oxidation of guanine to 8-oxoguanine results in G→T transversion mutations during library preparation. Another Picard tool, CollectOxoGMetrics, similarly gives Phred-scaled scores for the 16 three-base extended sequence contexts. In GATK4 Mutect2, the F1R2 and F2R1 annotations count the reads in the pair orientation supporting the allele(s). This is a change from GATK3’s FOXOG (fraction OxoG) annotation.

Second, perform orientation bias filtering with FilterByOrientationBias. We provide the tool with the once-filtered calls 9_somatic_oncefiltered.vcf.gz, the pre_adapter_detail_metrics file and the sequencing contexts for FFPE (C→T transition) and OxoG (G→T transversion). The tool knows to include the reverse complement contexts.

gatk FilterByOrientationBias \
-A G/T \
-A C/T \
-V 9_somatic_oncefiltered.vcf.gz \
-P tumor_artifact.pre_adapter_detail_metrics.txt \
-O 11_somatic_twicefiltered.vcf.gz

This produces a VCF 11_somatic_twicefiltered.vcf.gz, index and summary 11_somatic_twicefiltered.vcf.gz.summary. In the summary, we see the number of calls for the sequence context and the number of those that the tool filters.

image

Is the filtering in line with our earlier prediction?

In the VCF header, we see the addition of the 15th filter, orientation_bias, which the tool applies to 56 calls. All 56 of these calls were previously PASS sites, i.e. unfiltered. We now have 673 passing calls out of 3,695 total calls.

image


☞ 5.1 Tally of applied filters for the tutorial data

The table shows the breakdown in filters applied to 11_somatic_twicefiltered.vcf.gz. The middle column tallys the instances in which each filter was applied across the calls and the third column tallys the instances in which a filter was the sole reason for a site not passing. Of the total calls, ~18% (673/3,695) are confident somatic calls. Of the filtered calls, ~56% (1,694/3,022) are filtered singly. We see an average of ~1.73 filters per filtered call (5,223/3,022).

image

Which filters appear to have the greatest impact? What types of calls do you think compels manual review?

Examine passing records with the following command. Take note of the AD and AF annotation values in particular, as they show the high sensitivity of the caller.

gzcat 11_somatic_twicefiltered.vcf.gz | grep -v '#' | awk '$7=="PASS"' | less


back to top


6. Set up in IGV to review somatic calls

Deriving a good somatic callset involves comparing callsets, e.g. from different callers or calling approaches, manually reviewing passing and filtered calls and, if necessary, combining callsets and additional filtering. Manual review extends from deciphering call record annotations to the nitty-gritty of reviewing read alignments using a visualizer.

To manually review calls, use the feature-rich desktop version of the Integrative Genomics Viewer (IGV). Remember that Mutect2 makes calls on reassembled alignments that do not necessarily reflect that of the starting BAM. Given this, viewing the raw BAM is insufficient for understanding calls. We must examine the bamout that Mutect2's graph-assembly produces.

First, load Human (hg38) as the reference in IGV. Then load these six files in order:

  • resources/chr17_pon.vcf.gz
  • resources/chr17_af-only-gnomad_grch38.vcf.gz
  • 11_somatic_twicefiltered.vcf.gz
  • 2_tumor_normal_m2.bam
  • normal.bam
  • tumor.bam

With the exception of the somatic callset 11_somatic_twicefiltered.vcf.gz, the subset regions the data cover are in chr17plus.interval_list.

imageSecond, navigate IGV to the TP53 locus (chr17:7,666,402-7,689,550).

  • One of the tracks is dominating the view. Right-click on track chr17_af-only-gnomad_grch38.vcf.gz and collapse its view.
  • imageZoom into the somatic call in 11_somatic_twicefiltered.vcf.gz, the gray rectangle in exon 3, by click-dragging on the ruler.
  • Hover over or click on the gray call in track 11_somatic_twicefiltered.vcf.gz to view INFO level annotations. Similarly, the blue call underneath gives HCC1143_tumor sample level information.
  • Scroll through the alignment data and notice the coverage for the samples.

A C→T variant is in tumor.bam but not normal.bam. What is happening in 2_tumor_normal_m2.bam?

imageThird, tweak IGV settings that aid in visualizing reassembled alignments.

  • Make room to focus on track 2_tumor_normal_m2.bam. Shift+select on the left panels for tracks tumor.bam, normal.bam and their coverages. Right-click and Remove Tracks.
  • Go to View>Preferences>Alignments. Toggle on Show center line and toggle off Downsample reads.
  • Drag the alignments panel to center the red variant.
  • Right-click on the alignments track and

    • Group by sample
    • Sort by base
    • Color by tag: HC.
  • Scroll to take note of the number of groups. Click on a read in each group to determine which group belongs to which sample.

image

What are the three grouped tracks for the bamout? What does the pastel versus gray colors indicate? How plausible is it that all tumor copies of this locus have this alteration?

Here is the corresponding VCF record. Remember Mutect2 makes no ploidy assumption. The GT field tabulates the presence for each allele starting with the reference allele.

image

CHROM POS ID REF ALT QUAL FILTER INFO
chr17 7,674,220 . C T . PASS DP=122;ECNT=1;NLOD=13.54;N_ART_LOD=-1.675e+00;POP_AF=2.500e-06;P_GERMLINE=-1.284e+01;TLOD=257.15
FORMAT GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:OBAM:OBAMRC:OBF:OBP:OBQ:OBQRC:SA_MAP_AF:SA_POST_PROB
HCC1143_normal 0/0:45,0:0.032:19,0:26,0:0:151,0:0:0:false:false
HCC1143_tumor 0/1:0,70:0.973:0,34:0,36:33:0,147:60:21:true:false:0.486:0.00:46.01:100.00:0.990,0.990,1.00:0.028,0.026,0.946

Finally, here are the indel calls for which we have bamout alignments. All 17 of these happen to be filtered. Explore a few of these sites in IGV to practice the motions of setting up for manual review and to study the logic behind different filters.

CHROM POS REF ALT FILTER
chr17 4,539,344 T TA artifact_in_normal;germline_risk;panel_of_normals
chr17 7,221,420 CACTGCCCTAGGTCAGGA C artifact_in_normal;panel_of_normals;str_contraction
chr17 7,483,063 A AC mapping_quality;t_lod
chr17 8,513,688 GTT G panel_of_normals
chr17 19,748,387 G GA t_lod
chr17 26,982,033 G GC artifact_in_normal;clustered_events
chr17 30,059,463 CT C t_lod
chr17 35,422,473 C CA t_lod
chr17 35,671,734 CTT C,CT,CTTT artifact_in_normal;multiallelic;panel_of_normals
chr17 43,104,057 CA C artifact_in_normal;germline_risk;panel_of_normals
chr17 43,104,072 AAAAAAAAAGAAAAG A panel_of_normals;t_lod
chr17 46,332,538 G GT artifact_in_normal;panel_of_normals
chr17 47,157,394 CAA C panel_of_normals;t_lod
chr17 50,124,771 GCACACACACACACACA G clustered_events;panel_of_normals;t_lod
chr17 68,907,890 GA G artifact_in_normal;base_quality;germline_risk;panel_of_normals;t_lod
chr17 69,182,632 C CA artifact_in_normal;t_lod
chr17 69,182,835 GAAAA G panel_of_normals


back to top


7. Related resources

The next step after generating a carefully manicured somatic callset is typically functional annotation.

  • Funcotator is available in BETA and can annotate GRCh38 and prior reference aligned VCF format data.
  • Oncotator can annotate GRCh37 and prior reference aligned MAF and VCF format data. It is also possible to download and install the tool following instructions in Article#4154.
  • Annotate with the external program VEP to predict phenotypic changes and confirm or hypothesize biochemical effects.

For a cohort, after annotation, use MutSig to discover driver mutations. MutsigCV (the version is CV) is available on GenePattern. If more samples are needed to increase the power of the analysis, consider padding the analysis set with TCGA Project or other data.

The dSKY plot at https://figshare.com/articles/D_SKY_for_HCC1143/2056665 shows somatic copy number alterations for the HCC1143 tumor sample. Its colorful results remind us that calling SNVs and indels is only one part of cancer genome analyses. Somatic copy number alteration detection will be covered in another GATK tutorial. For reference implementations of Somatic CNV workflows see here.


back to top


Footnotes

[1] Data was alt-aware aligned to GRCh38 and post-alt processed. For an introduction to alt-aware alignment and post-alt processing, see [Blog#8180](https://software.broadinstitute.org/gatk/blog?id=8180). The HCC1143 alignments are identical to that in [Tutorial#9183](https://software.broadinstitute.org/gatk/documentation/article?id=9183), which uses GATK3 MuTect2.

[2] For scripted GATK Best Practices Somatic Short Variant Discovery workflows, see [https://github.com/gatk-workflows](https://github.com/gatk-workflows). Within the repository, as of this writing, [gatk-somatic-snvs-indels](https://github.com/gatk-workflows/gatk4-somatic-snvs-indels), which uses GRCh37, is the sole GATK4 Mutect2 workflow. This tutorial uses additional parameters not used in the [GRCh37 gatk-somatic-snvs-indels](https://github.com/gatk-workflows/gatk4-somatic-snvs-indels) example because the tutorial data was preprocessed with post-alt processing of alt-aware alignments, which deviates from production practices. The general workflow steps remain the same.

[3] About the tutorial data:

  • The data tarball contains 15 files in the main directory, six files in its resources folder and twenty files in its precomputed folder. Of the files, chr17 refers to data subset to that in the regions in chr17plus.interval_list, the m2pon consists of forty 1000 Genomes Project samples, pon to panel of normals, tumor to the tumor HCC1143 breast cancer sample and normal to its matched blood normal.
  • Again, example data are based on a breast cancer cell line and its matched normal cell line derived from blood. Both cell lines are consented and known as HCC1143 and HCC1143_BL, respectively. The Broad Cancer Genome Analysis (CGA) group has graciously provided 2x76 paired-end whole exome sequence data from the two cell lines (C835.HCC1143_2 and C835.HCC1143_BL.4), and @shlee reverted and aligned these to GRCh38 using alt-aware alignment and post-alt processing as described in Tutorial#8017. During preprocessing, the MergeBamAlignment step was omitted, reads containing adapter sequence were removed altogether for both samples (~0.153% of reads in the tumor) as determined by MarkIlluminaAdapters, base qualities were not binned during base recalibration and indel realignment was included to match the toolchain of the PoN normals. The program group for base recalibration is absent from the BAM headers due to a bug in the version of PrintReads at the time of pre-processing, in January of 2017.
  • Note that the tutorial uses exome data for its small size. The workflow is applicable to whole genome sequence data (WGS).
  • @shlee lifted-over or remapped the gnomAD resource files from GRCh37 counterparts to GRCh38. The tutorial uses subsets of the full resources; the full-length versions are available at gs://gatk-best-practices/somatic-hg38/. The official GRCh37 versions of the resources are available in the GATK Resource Bundle and are based on the gnomAD resource. These GRCh37 versions were prepared by @davidben according to the method outlined in the mutect_resources.wdl and described in [4].
  • The full data in the tutorial were generated by @shlee using the github.com/broadinstitute/gatk mutect2.wdl from between the v4.0.0.0 and v4.0.0.1 release with commit hash b4d1ddd. The GATK Docker image was broadinstitute/gatk:4.0.0.0 and Picard was v2.14.1. A single modification was made to the script to enable generating the bamout. The script was run locally on a Google Cloud Compute VM using Cromwell v30.1. Given Docker was installed and the specified Docker images were present on the VM, Cromwell automatically launched local Docker container instances during the run and handled the local files as hard-links to avoid redundant copying. Workflow input variables were as follows.
{
  "##_COMMENT1:": "WORKFLOW STEP OPTIONS",
  "Mutect2.is_run_oncotator": "False",
  "Mutect2.is_run_orientation_bias_filter": "True",
  "Mutect2.picard": "/home/shlee/picard-2.14.1.jar",
  "Mutect2.gatk_docker": "broadinstitute/gatk:4.0.0.0",
  "Mutect2.oncotator_docker": "broadinstitute/oncotator:1.9.3.0",
...
  "##_COMMENT3:": "ANALYSIS PARAMETERS",
  "Mutect2.artifact_modes": ["G/T", "C/T"],
  "Mutect2.m2_extra_args": "--af-of-alleles-not-in-resource 0.0000025 --disable-read-filter MateOnSameContigOrNoMappedMateReadFilter",
  "Mutect2.m2_extra_filtering_args": "",
  "Mutect2.scatter_count": "10"
}
  • If using newer versions of the mutect2.wdl that allow setting SplitIntervals optional arguments, then @shlee recommends setting --subdivision-mode BALANCING_WITHOUT_INTERVAL_SUBDIVISION to avoid splitting contigs.
  • With the exception of the PoN and Picard tool steps, data was generated using v4.0.0.0. The PoN was generated using GATK4 vbeta.6. Besides the syntax, little changed for the Mutect2 workflow between these releases and the workflow and most of its tools remain in beta status as of this writing. We used Picard v2.14.1 for the CollectSequencingArtifactMetrics step. Figures in section 5 reflect results from Picard v2.11.0, which give, at glance, identical results as 2.14.1.
  • The three samples in section 2 are present in the forty sample PoN used in section 1 and they are 1000 Genomes Project samples.

[4] The WDL script [mutect_resources.wdl](https://github.com/broadinstitute/gatk/blob/master/scripts/mutect2_wdl/mutect_resources.wdl) takes a large gnomAD VCF or other typical cohort VCF and from it prepares both a simplified germline resource for use in _section 1_ and a common biallelic variants resource for use in _section 3_. The script first generates a sites-only VCF and in the process _removes all extraneous annotations_ except for `AF` allele frequencies. We recommend this simplification as the unburdened VCF allows Mutect2 to run much more efficiently. To generate the common biallelic variants resource, the script then selects the biallelic sites from the sites-only VCF.

back to top


Evaluating the quality of a variant callset

$
0
0

Introduction

Running through the steps involved in variant discovery (calling variants, joint genotyping and applying filters) produces a variant callset in the form of a VCF file. So what’s next? Technically, that callset is ready to be used in downstream analysis. But before you do that, we recommend running some quality control analyses to evaluate how “good” that callset is.

To be frank, distinguishing between a “good” callset and a “bad” callset is a complex problem. If you knew the absolute truth of what variants are present or not in your samples, you probably wouldn’t be here running variant discovery on some high-throughput sequencing data. Your fresh new callset is your attempt to discover that truth. So how do you know how close you got?

Methods for variant evaluation

There are several methods that you can apply which offer different insights into the probable biological truth, all with their own pros and cons. Possibly the most trusted method is Sanger sequencing of regions surrounding putative variants. However, it is also the least scalable as it would be prohibitively costly and time-consuming to apply to an entire callset. Typically, Sanger sequencing is only applied to validate candidate variants that are judged highly likely. Another popular method is to evaluate concordance against results obtained from a genotyping chip run on the same samples. This is much more scalable, and conveniently also doubles as a quality control method to detect sample swaps. Although it only covers the subset of known variants that the chip was designed for, this method can give you a pretty good indication of both sensitivity (ability to detect true variants) and specificity (not calling variants where there are none). This is something we do systematically for all samples in the Broad’s production pipelines.

The third method, presented here, is to evaluate how your variant callset stacks up against another variant callset (typically derived from other samples) that is considered to be a truth set (sometimes referred to as a gold standard -- these terms are very close and often used interchangeably). The general idea is that key properties of your callset (metrics discussed later in the text) should roughly match those of the truth set. This method is not meant to render any judgments about the veracity of individual variant calls; instead, it aims to estimate the overall quality of your callset and detect any red flags that might be indicative of error.

Underlying assumptions and truthiness*: a note of caution

It should be immediately obvious that there are two important assumptions being made here: 1) that the content of the truth set has been validated somehow and is considered especially trustworthy; and 2) that your samples are expected to have similar genomic content as the population of samples that was used to produce the truth set. These assumptions are not always well-supported, depending on the truth set, your callset, and what they have (or don’t have) in common. You should always keep this in mind when choosing a truth set for your evaluation; it’s a jungle out there. Consider that if anyone can submit variants to a truth set’s database without a well-regulated validation process, and there is no process for removing variants if someone later finds they were wrong (I’m looking at you, dbSNP), you should be extra cautious in interpreting results.
*With apologies to Stephen Colbert.

Validation

So what constitutes validation? Well, the best validation is done with orthogonal methods, meaning that it is done with technology (wetware, hardware, software, etc.) that is not subject to the same error modes as the sequencing process. Calling variants with two callers that use similar algorithms? Great way to reinforce your biases. It won’t mean anything that both give the same results; they could both be making the same mistakes. On the wetlab side, Sanger and genotyping chips are great validation tools; the technology is pretty different, so they tend to make different mistakes. Therefore it means more if they agree or disagree with calls made from high-throughput sequencing.

Matching populations

Regarding the population genomics aspect: it’s complicated -- especially if we’re talking about humans (I am). There’s a lot of interesting literature on this topic; for now let’s just summarize by saying that some important variant calling metrics vary depending on ethnicity. So if you are studying a population with a very specific ethnic composition, you should try to find a truth set composed of individuals with a similar ethnic background, and adjust your expectations accordingly for some metrics.

Similar principles apply to non-human genomic data, with important variations depending on whether you’re looking at wild or domesticated populations, natural or experimentally manipulated lineages, and so on. Unfortunately we can’t currently provide any detailed guidance on this topic, but hopefully this explanation of the logic and considerations involved will help you formulate a variant evaluation strategy that is appropriate for your organism of interest.


Variant evaluation metrics

So let’s say you’ve got your fresh new callset and you’ve found an appropriate truth set. You’re ready to look at some metrics (but don’t worry yet about how; we’ll get to that soon enough). There are several metrics that we recommend examining in order to evaluate your data. The set described here should be considered a minimum and is by no means exclusive. It is nearly always better to evaluate more metrics if you possess the appropriate data to do so -- and as long as you understand why those additional metrics are meaningful. Please don’t try to use metrics that you don’t understand properly, because misunderstandings lead to confusion; confusion leads to worry; and worry leads to too many desperate posts on the GATK forum.

Variant-level concordance and genotype concordance

The relationship between variant-level concordance and genotype concordance is illustrated in this figure.

  • Variant-level concordance (aka % Concordance) gives the percentage of variants in your samples that match (are concordant with) variants in your truth set. It essentially serves as a check of how well your analysis pipeline identified variants contained in the truth set. Depending on what you are evaluating and comparing, the interpretation of percent concordance can vary quite significantly.
    Comparing your sample(s) against genotyping chip results matched per sample allows you to evaluate whether you missed any real variants within the scope of what is represented on the chip. Based on that concordance result, you can extrapolate what proportion you may have missed out of the real variants not represented on the chip.
    If you don't have a sample-matched truth set and you're comparing your sample against a truth set derived from a population, your interpretation of percent concordance will be more limited. You have to account for the fact that some variants that are real in your sample will not be present in the population and that conversely, many variants that are in the population will not be present in your sample. In both cases, "how many" depends on how big the population is and how representative it is of your sample's background.
    Keep in mind that for most tools that calculate this metric, all unmatched variants (present in your sample but not in the truth set) are considered to be false positives. Depending on your trust in the truth set and whether or not you expect to see true, novel variants, these unmatched variants could warrant further investigation -- or they could be artifacts that you should ignore.

  • Genotype concordance is a similar metric but operates at the genotype level. It allows you to evaluate, within a set of variant calls that are present in both your sample callset and your truth set, what proportion of the genotype calls have been assigned correctly. This assumes that you are comparing your sample to a matched truth set derived from the same original sample.

Number of Indels & SNPs and TiTv Ratio

These metrics are widely applicable. The table below summarizes their expected value ranges for Human Germline Data:

Sequencing Type # of Variants* TiTv Ratio
WGS ~4.4M 2.0-2.1
WES ~41k 3.0-3.3

*for a single sample

  • Number of Indels & SNPs
    The number of variants detected in your sample(s) are counted separately as indels (insertions and deletions) and SNPs (Single Nucleotide Polymorphisms). Many factors can affect this statistic including whole exome (WES) versus whole genome (WGS) data, cohort size, strictness of filtering through the GATK pipeline, the ethnicity of your sample(s), and even algorithm improvement due to a software update. For reference, Nature's recently published 2015 paper in which various ethnicities in a moderately large cohort were analyzed for number of variants. As such, this metric alone is insufficient to confirm data validity, but it can raise warning flags when something went extremely wrong: e.g. 1000 variants in a large cohort WGS data set, or 4 billion variants in a ten-sample whole-exome set.

  • TiTv Ratio
    This metric is the ratio of transition (Ti) to transversion (Tv) SNPs. If the distribution of transition and transversion mutations were random (i.e. without any biological influence) we would expect a ratio of 0.5. This is simply due to the fact that there are twice as many possible transversion mutations than there are transitions. However, in the biological context, it is very common to see a methylated cytosine undergo deamination to become thymine. As this is a transition mutation, it has been shown to increase the expected random ratio from 0.5 to ~2.01. Furthermore, CpG islands, usually found in primer regions, have higher concentrations of methylcytosines. By including these regions, whole exome sequencing shows an even stronger lean towards transition mutations, with an expected ratio of 3.0-3.3. A significant deviation from the expected values could indicate artifactual variants causing bias. If your TiTv Ratio is too low, your callset likely has more false positives.

    It should also be noted that the TiTv ratio from exome-sequenced data will vary from the expected value based upon the length of flanking sequences. When we analyze exome sequence data, we add some padding (usually 100 bases) around the targeted regions (using the -ip engine argument) because this improves calling of variants that are at the edges of exons (whether inside the exon sequence or in the promoter/regulatory sequence before the exon). These flanking sequences are not subject to the same evolutionary pressures as the exons themselves, so the number of transition and transversion mutants lean away from the expected ratio. The amount of "lean" depends on how long the flanking sequence is.

Ratio of Insertions to Deletions (Indel Ratio)

This metric is generally evaluated after filtering for purposes that are specific to your study, and the expected value range depends on whether you're looking for rare or common variants, as summarized in the table below.

Filtering for Indel Ratio
common ~1
rare 0.2-0.5

A significant deviation from the expected ratios listed in the table above could indicate a bias resulting from artifactual variants.


Tools for performing variant evaluation

VariantEval

This is the GATK’s main tool for variant evaluation. It is designed to collect and calculate a variety of callset metrics that are organized in evaluation modules, which are listed in the tool doc. For each evaluation module that is enabled, the tool will produce a table containing the corresponding callset metrics based on the specified inputs (your callset of interest and one or more truth sets). By default, VariantEval will run with a specific subset of the available modules (listed below), but all evaluation modules can be enabled or disabled from the command line. We recommend setting the tool to produce only the metrics that you are interested in, because each active module adds to the computational requirements and overall runtime of the tool.

It should be noted that all module calculations only include variants that passed filtering (i.e. FILTER column in your vcf file should read PASS); variants tagged as filtered out will be ignored. It is not possible to modify this behavior. See the example analysis for more details on how to use this tool and interpret its output.

GenotypeConcordance

This tool calculates -- you’ve guessed it -- the genotype concordance between callsets. In earlier versions of GATK, GenotypeConcordance was itself a module within VariantEval. It was converted into a standalone tool to enable more complex genotype concordance calculations.

Picard tools

The Picard toolkit includes two tools that perform similar functions to VariantEval and GenotypeConcordance, respectively called CollectVariantCallingMetrics and GenotypeConcordance. Both are relatively lightweight in comparison to their GATK equivalents; their functionalities are more limited, but they do run quite a bit faster. See the example analysis of CollectVariantCallingMetrics for details on its use and data interpretation. Note that in the coming months, the Picard tools are going to be integrated into the next major version of GATK, so at that occasion we plan to consolidate these two pairs of homologous tools to eliminate redundancy.

Which tool should I use?

We recommend Picard's version of each tool for most cases. The GenotypeConcordance tools provide mostly the same information, but Picard's version is preferred by Broadies. Both VariantEval and CollectVariantCallingMetrics produce similar metrics, however the latter runs faster and is scales better for larger cohorts. By default, CollectVariantCallingMetrics stratifies by sample, allowing you to see the value of relevant statistics as they pertain to specific samples in your cohort. It includes all metrics discussed here, as well as a few more. On the other hand, VariantEval provides many more metrics beyond the minimum described here for analysis. It should be noted that none of these tools use phasing to determine metrics.

So when should I use CollectVariantCallingMetrics?

  • If you have a very large callset
  • If you want to look at the metrics discussed here and not much else
  • If you want your analysis back quickly

When should I use VariantEval?

  • When you require a more detailed analysis of your callset
  • If you need to stratify your callset by another factor (allele frequency, indel size, etc.)
  • If you need to compare to multiple truth sets at the same time
Viewing all 12345 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>