VariantRecalibrator and CalculateGenotypePosteriors requires the same 1000G vcf file?

March 21, 2016, 2:58 pm

≫ Next: MarkIlluminaAdapters missing from GATK pipelines

Hello,

I am using the best practice pipeline to call genotypes from whole-genome sequencing data on 67 human samples. In the examples on the website, VariantRecalibrator uses 1000G_phase1.snps.high_confidence.hg19.sites.vcf as one of the training examples. However, CalculateGenotypePosteriors uses 1000G_phase3_v4_20130502.sites.vcf as the supporting file. Is this intentional? Does it matter which version of 1000G vcf is supplied to CalculateGenotypePosteriors?

↧

MarkIlluminaAdapters missing from GATK pipelines

August 27, 2019, 1:43 pm

≫ Next: GenomicsDBImport with many intervals

≪ Previous: VariantRecalibrator and CalculateGenotypePosteriors requires the same 1000G vcf file?

The best practices GATK exome and genome pipelines seem to not perform adapter clipping/marking with MarkIlluminaAdapters as described here: https://software.broadinstitute.org/gatk/documentation/article?id=6483

This would likely lead to significant issues in subsequent alignments and false positive calls.

Why is MarkIlluminaAdapters not part of the standard GATK pipelines even though the GATK website says it should be?

↧

GenomicsDBImport with many intervals

August 27, 2019, 2:23 pm

≫ Next: Input Priors HaplotypeCaller and GenotypeGVCFs

≪ Previous: MarkIlluminaAdapters missing from GATK pipelines

Hi GATK team,

I am running GATK4 GenomicsDBImport to create somatic PON for TCGA whole-exome-data. I wonder if I can split the interval list to as many as possible (I used 1000 pieces from SplitIntervals, and dispatch the 1000 jobs onto a local cluster)? Will such splitting result in an improvement in performance? Will too many intervals potentially cause trouble in the subsequent analysis?

Thanks!

↧

Input Priors HaplotypeCaller and GenotypeGVCFs

August 30, 2019, 7:12 am

≫ Next: Panel of Normals for RNA-Seq Samples

≪ Previous: GenomicsDBImport with many intervals

Hi,

Both HaplotypeCaller and GenotypeGVCFs have --input_prior options.

I'm confused on whether both need to be changed or just in Haplotypecaller to remove reference bias. Previous posts mention changing the --input_prior in HaplotypeCaller but not GenotypeGCVFs

(https://gatkforums.broadinstitute.org/gatk/discussion/11877/free-of-reference-bias-priors-in-haplotypecaller)

Also, say if I created gVCFs and forgot to change the --input_prior, can i just change the priors at the GenotypeGCVFs step instead of creating the gVCFs again with the correct priors.

Thanks,
Mo

↧

Panel of Normals for RNA-Seq Samples

August 30, 2019, 8:57 am

≫ Next: CollectHsMetrics - Excludes duplicates?

≪ Previous: Input Priors HaplotypeCaller and GenotypeGVCFs

Hello. Is it appropriate to use the 1000g panel of normals (1000g_pon.hg38.vcf.gz) when working with RNA-Seq samples? I've been looking for more information about the 1000g PoN and have not found much.

Specifically, I am using the 1000g PoN in tumor-only mode with RNA-Seq tumor samples. I have used STAR 2-pass alignment and the somatic variant calling best-practices pipeline on a few dozen samples. I unexpectedly saw hundreds of variants shared by the samples. Is there anything wrong with using the 1000g PoN with RNA-Seq generated BAM files in tumor-only mode?

Best,
Bruno

↧

CollectHsMetrics - Excludes duplicates?

August 29, 2019, 7:48 pm

≫ Next: Combining variants from different WES capture types

≪ Previous: Panel of Normals for RNA-Seq Samples

Hi,
Does CollectHsMetrics exclude duplicates from all of its metrics? In particular:
MEAN_TARGET_COVERAGE
PCT_TARGET_BASES_*

This is not documented in the tool or online tool reference.

↧

Combining variants from different WES capture types

October 15, 2019, 2:49 am

≫ Next: Error in ReadsPipelineSpark version 4.1.4

≪ Previous: CollectHsMetrics - Excludes duplicates?

Hi there!
I've googled on GATK forum with no success for the following topic. I have a set of wes (around 110 samples in total) all of them from an specific population. The aim of the project is to study population genetic variation. All samples have been processed with GATK 4.1.2. The issue is that I have two subsets of samples, each generated with a different capture technology.

Not sure how to proceed to study variants for the whole set since it is desired to reduce the batch effect as much as possible. I've run the following: gVCF files were generated for each sample and then a joint analysis has been applied using all gVCF files (GenomicsDBImport and genotypeGVCFs). Not sure if this approach is the best one (it is the same as assuming a single capture technology). For GenomicsDBImport, the intervals used were all chromosomes although another try would be to build de database using a specific set of regions given just by the intersection of the two capture BEDs.

Another approach would be to perform joint variant calling separately for each subset and then combine results somehow (not sure how) using again the intersection of capture BEDs, but may be this might introduce a worse batch effect.

Any suggestions?
Thanks,
Javier

↧

Error in ReadsPipelineSpark version 4.1.4

October 15, 2019, 8:54 am

≫ Next: GATK, WDL, Cromwell and Terra at ASHG 2019

≪ Previous: Combining variants from different WES capture types

Hi.
Just downloaded the 4.1.4 version in order to test the performances of the ReadsPipelineSpark tool.
Very good indeed.
But... at the end of the pipeline , when the tool has to concat all the vcf parts, it throws the following exception:

A USER ERROR has occurred: Couldn't write file hdfs://cloudera08/gatk-test2/WES2019-022_S4_out.vcf because writing failed with exception concat: target file /gatk-test2/WES2019-022_S4_out.vcf.parts/output is empty
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.concatInternal(FSNamesystem.java:2303)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.concatInt(FSNamesystem.java:2257)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.concat(FSNamesystem.java:2219)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.concat(NameNodeRpcServer.java:829)
at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.concat(AuthorizationProviderProxyClientProtocol.java:285)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.concat(ClientNamenodeProtocolServerSideTranslatorPB.java:580)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2278)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2274)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1924)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2272)

org.broadinstitute.hellbender.exceptions.UserException$CouldNotCreateOutputFile: Couldn't write file hdfs://cloudera08/gatk-test2/WES2019-022_S4_out.vcf because writing failed with exception concat: target file /gatk-test2/WES2019-022_S4_out.vcf.parts/output is empty
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.concatInternal(FSNamesystem.java:2303)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.concatInt(FSNamesystem.java:2257)
...
...

This is the command I used to run the job:
nohup /opt/gatk/gatk-4.1.4.0/gatk ReadsPipelineSpark --spark-runner SPARK --spark-master yarn --spark-submit-command spark2-submit -I hdfs://cloudera08/gatk-test2/WES2019-022_S4.bam -O hdfs://cloudera08/gatk-test2/WES2019-022_S4_out.vcf -R hdfs://cloudera08/gatk-test1/ucsc.hg19.fasta --known-sites hdfs://cloudera08/gatk-test1/dbsnp_150_hg19.vcf.gz --known-sites hdfs://cloudera08/gatk-test1/Mills_and_1000G_gold_standard.indels.hg19.vcf.gz --align true --emit-ref-confidence GVCF --standard-min-confidence-threshold-for-calling 50.0 --conf deploy-mode=cluster --conf "spark.driver.memory=2g" --conf "spark.executor.memory=18g" --conf "spark.storage.memoryFraction=1" --conf "spark.akka.frameSize=200" --conf "spark.default.parallelism=100" --conf "spark.core.connection.ack.wait.timeout=600" --conf "spark.yarn.executor.memoryOverhead=4096" --conf "spark.yarn.driver.memoryOverhead=400" > WES2019-022_S4.out

-bash-4.1$ hdfs dfs -ls /gatk-test2/
Found 7 items
-rw-r--r-- 3 hdfs supergroup 39673964 2019-09-19 15:45 /gatk-test2/RefGene_exons.bed
-rw-r--r-- 3 hdfs supergroup 38516963 2019-09-19 15:45 /gatk-test2/RefGene_exons.interval_list
-rw-r--r-- 3 hdfs supergroup 13569684570 2019-10-02 11:49 /gatk-test2/WES2019-022_S4.bam
-rw-r--r-- 3 hdfs supergroup 16 2019-10-02 11:58 /gatk-test2/WES2019-022_S4.bam.bai
drwxr-xr-x - hdfs supergroup 0 2019-10-15 16:21 /gatk-test2/WES2019-022_S4_out.vcf.parts

-bash-4.1$ hdfs dfs -ls /gatk-test2/WES2019-022_S4_out.vcf.parts/
Found 105 items
-rw-r--r-- 3 hdfs supergroup 0 2019-10-15 16:21 /gatk-test2/WES2019-022_S4_out.vcf.parts/_SUCCESS
-rw-r--r-- 3 hdfs supergroup 10632 2019-10-15 16:21 /gatk-test2/WES2019-022_S4_out.vcf.parts/header
-rw-r--r-- 3 hdfs supergroup 0 2019-10-15 16:21 /gatk-test2/WES2019-022_S4_out.vcf.parts/output
-rw-r--r-- 3 hdfs supergroup 21498665 2019-10-15 14:43 /gatk-test2/WES2019-022_S4_out.vcf.parts/part-r-00000
-rw-r--r-- 3 hdfs supergroup 25489817 2019-10-15 15:10 /gatk-test2/WES2019-022_S4_out.vcf.parts/part-r-00001
-rw-r--r-- 3 hdfs supergroup 35599315 2019-10-15 14:44 /gatk-test2/WES2019-022_S4_out.vcf.parts/part-r-00002
-rw-r--r-- 3 hdfs supergroup 25185088 2019-10-15 14:41 /gatk-test2/WES2019-022_S4_out.vcf.parts/part-r-00003
-rw-r--r-- 3 hdfs supergroup 70456674 2019-10-15 14:43 /gatk-test2/WES2019-022_S4_out.vcf.parts/part-r-00004
-rw-r--r-- 3 hdfs supergroup 41305463 2019-10-15 14:52 /gatk-test2/WES2019-022_S4_out.vcf.parts/part-r-00005
...
...
-rw-r--r-- 3 hdfs supergroup 41022593 2019-10-15 16:08 /gatk-test2/WES2019-022_S4_out.vcf.parts/part-r-00097
-rw-r--r-- 3 hdfs supergroup 46040755 2019-10-15 16:03 /gatk-test2/WES2019-022_S4_out.vcf.parts/part-r-00098
-rw-r--r-- 3 hdfs supergroup 63441406 2019-10-15 15:57 /gatk-test2/WES2019-022_S4_out.vcf.parts/part-r-00099
-rw-r--r-- 3 hdfs supergroup 44377853 2019-10-15 15:55 /gatk-test2/WES2019-022_S4_out.vcf.parts/part-r-00100
-rw-r--r-- 3 hdfs supergroup 22847475 2019-10-15 16:21 /gatk-test2/WES2019-022_S4_out.vcf.parts/part-r-00101

It seems like a bug.
Could you please verify and let me know?

Thanks a lot.
Alessandro

↧

GATK, WDL, Cromwell and Terra at ASHG 2019

October 15, 2019, 11:13 pm

≫ Next: MergeBamAlignment – Select primary alignment

≪ Previous: Error in ReadsPipelineSpark version 4.1.4

If it's hot, humid and everyone around you has a name tag, you're probably in Houston, TX for ASHG. I know there's a lot going on and a million different presentations vying for your attention, so I'll cut to the chase: we have several members of our department (Data Sciences Platform) who will be at the Broad Genomics booth #714 in the Exhibition Hall at the following times. Don't miss this opportunity to come chat with us in person and get answers for all your burning questions about the latest exciting developments, whether it's DRAGEN-GATK or Cromwell on Azure that floats your boat, or you just want to learn more about running our fully configured GATK pipelines on Terra. We look forward to seeing you there!

Day	Time	Team member	Focus area
Wednesday 16	12-2pm	Geraldine Van der Auwera	All
Thursday 17	10am-12:30pm	Bhanu Gandham	GATK support
Thursday 17	12:30-1:30pm	Rob Title	Interactive analysis on Terra
Thursday 17	2:30-4:30pm	Sushma Chaluvadi	Terra support
Friday 18	11am-12pm	Ruchi Munshi	Cromwell and WDL

↧

MergeBamAlignment – Select primary alignment

September 1, 2019, 11:51 am

≫ Next: Exit code 3 for HaplotypeCaller?

≪ Previous: GATK, WDL, Cromwell and Terra at ASHG 2019

Hi,

In the current best practices workflow gatk4-data-processing, you recommend using uBAMs instead of FASTQ files. Great idea! However, when it comes to merging with the BWA alignment BAM, there is something that puzzles me.

Here is an example of a paired-end read mapped by BWA:

XXXXXXXX:412:YYYYYYYYY:1:11101:10001:10497  83  chr16   1229894 0   149M    =   1229833 -210    GGGCCGCGTAGGCGCGGCTCGCCAGGACGGGCAGCGCCAGCAGCAGCAGATTCAGCATCTGGGGAGCAAGGAGGAGCATCGTGGGCCTGGCCGGGCCTCACAGGGCAGGGCTGGGGGCTACAGATTGTGGGGTGAAGAATGGAGCTGAG   AAAAA/E<EEAA</A/<EA<<EEEEEEEE/EEEAAEEAEE/EAEAAEEEEEEEEEEEAEEAAEEAEAEAAEEEEEEEEEEEEAAEEEEAE6EAEEEEEEEE/EEEEEEE/EE/AEAAEEEEEEEEEAAEEEEEEEEEEEEEEEEAAAAA   XA:Z:chr16,+1240848,149M,1;chr16,+1256211,149M,6;   MC:Z:150M   MD:Z:147G1  RG:Z:NS500158.1 NM:i:1  AS:i:147    XS:i:147
XXXXXXXX:412:YYYYYYYYY:1:11101:10001:10497  163 chr16   1229833 0   150M    =   1229894 210 CCAGGCCCTGACCTGTGGAATGTGGTGAGGGGCAGGGTGGACCCCGGCTGGGACTCACCAGGGGCCGCGTAGGCGCGGCTCGCCAGGACGGGCAGCGCCAGCAGCAGCAGATTCAGCATCTGGGGAGCAAGGAGGAGCATCGTGGGCCTG  AAAAAEEEEEEEEEEEEAE6EEEAEEEEEEEEEEEEEEAE/EEEEEEEEEEA/AEAEEEEEEEEEAEAE<EEE6A/EEAAAEEEA/EEAAEEAEEE/AAAAEEEEEEEAE/EEEEEEEEEEAEEEEEEAEEEAEE6EAEEAE<</AAA<6  XA:Z:chr16,-1240908,150M,0; MC:Z:149M   MD:Z:150    RG:Z:NS500158.1 NM:i:0  AS:i:150    XS:i:150

Note that BWA has suggested an alternative alignment given in the XA tag. When using MergeBamAlignment as in the best practices pipeline, the alignment in XA is chosen. I have tried modifying the --PRIMARY_ALIGNMENT_STRATEGY parameter, but is doesn't change anything.

In the old days before uMAPs, you worked directly with FASTQ files and hence used the primary alignment selected by BWA. What is the motivation for changing that?

↧

Exit code 3 for HaplotypeCaller?

October 16, 2019, 12:43 pm

≫ Next: BAM file generated by GATK HaplotypeCaller shows duplicates, not fixed by Picard FixMateInformation

≪ Previous: MergeBamAlignment – Select primary alignment

I am getting an exit code 3 for HaplotypeCaller version 4.1.2.0. I checked the command line output and the only thing I can see is this:

WARN InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples

and

htsjdk.samtools.util.RuntimeIOException: Unable to close index for file:///path/to/Sample_DS-bkm-085-N.txt.b37.bam.g.vcf.gz
at htsjdk.variant.variantcontext.writer.IndexingVariantContextWriter.close(IndexingVariantContextWriter.java:183)
at htsjdk.variant.variantcontext.writer.VCFWriter.close(VCFWriter.java:231)
at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller.closeTool(HaplotypeCaller.java:246)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1043)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
at org.broadinstitute.hellbender.Main.main(Main.java:291)
Caused by: java.io.IOException: Terminator block not found after closing BGZF file /path/to/Sample_DS-bkm-085-N.txt.b37.bam.g.vcf.gz
at htsjdk.samtools.util.BlockCompressedOutputStream.close(BlockCompressedOutputStream.java:329)
at htsjdk.variant.variantcontext.writer.IndexingVariantContextWriter.close(IndexingVariantContextWriter.java:172)

Does anyone know what is going on?

↧

BAM file generated by GATK HaplotypeCaller shows duplicates, not fixed by Picard FixMateInformation

October 17, 2019, 6:45 am

≫ Next: Methods repository not accepting new Mutect2 WDL as a subworkflow

≪ Previous: Exit code 3 for HaplotypeCaller?

I have some BAM files that are being processed with GATK to call variants, following a schedule like the one below:

BAM -> HaplotypeCaller + CombineGVCFs + GenotypeGVCFs -> VCF

However, when I examine the variants found (VCF) in IGV software I see some inconsistencies with these BAM files (eg. variants that do not match the position). I understand that this is likely being originated by HaplotypeCaller doing some local realignments.

Because of that, I'm running again HaplotypeCaller with -bamout --force-active --disable-optimizations options in order to generate a BAM file (BAM*) that accounts for these local realignments.

HaplotypeCaller () -> BAM*

However, when I examine these BAM* files in IGV, the number of reads has increased, many of them looking like duplicates. This is weird however, because the original BAM file was processed with Picard AddOrReplaceReadGroups as well as by MarkDuplicates, resulting in no errors by ValdiateSamFile.

The BAM* generated by HaplotypeCaller shows some mate read errors when ValidateSamFile is used. When Picard FixMateInforation is used however, the resulting file is of size 0 (it's empty). Any ideas on why is this happening and how to solve it?

Version numbers are 2.21.1 for Picard and 4.1.4.0 for GATK

↧

Methods repository not accepting new Mutect2 WDL as a subworkflow

October 17, 2019, 12:00 pm

≫ Next: Is it possible to create a VCF file with information from a specific set of sites?

≪ Previous: BAM file generated by GATK HaplotypeCaller shows duplicates, not fixed by Picard FixMateInformation

I have a pipeline that runs Mutect2 as a subworkflow. The first line in the WDL imports the WDL used in the featured workspace:

import "https://raw.githubusercontent.com/gatk-workflows/gatk4-somatic-snvs-indels/2.6.0/mutect2.wdl" as m2

(Previously this was the 2.4.0/mutect2_nio.wdl version, which worked).

When I switched to this import (with no other change to the WDL) the methods repository gives the following error:

Error: Invalid WDL: Unrecognized token on line 445, column 42: samtools view -h -T ~{ref_fasta} ~{cram} | ^

Line 445, column 42 of the imported WDL is the tilde in ~{cram} inside a command block. The previous WDL in the featured workspace used a $ instead of a ~ here.

↧

Is it possible to create a VCF file with information from a specific set of sites?

October 17, 2019, 12:34 pm

≫ Next: known site for BaseRecalibrator with hg38

≪ Previous: Methods repository not accepting new Mutect2 WDL as a subworkflow

Hi,

Basically, I have a set of fastq files from ~50 samples and I need a VCF file containing information about ~1.1 million of markers of all samples.
I prepared all 50 filtered bam files using GATK best practices and now I'm trying to create the raw VCF file just for a subset of positions of interest. The great problem is that the fastq sequences were obtained from a capture kit with low coverage and the sites absent in the final VCF file can be non-variant or missing data.
I tried two different approaches using HaplotypeCaller:
(1) Using a VCF file containing the positions of interest in other samples:

gatk HaplotypeCaller \
--java-options '-Xmx32g -XX:ParallelGCThreads=1' \
-R ref.fa.gz \
-I sample1.sorted.dedup.bam \
-I sample2.sorted.dedup.bam \
...
-I sampleN.sorted.dedup.bam \
-O output.raw.vcf \
-L other_samples.vcf \
--output-mode EMIT_ALL_SITES

(2) Using a ".interval_list" file with the specific site positions:

gatk HaplotypeCaller \
--java-options '-Xmx32g -XX:ParallelGCThreads=1' \
-R ref.fa.gz \
-I sample1.sorted.dedup.bam \
-I sample2.sorted.dedup.bam \
...
-I sampleN.sorted.dedup.bam \
-O output.raw.vcf \
-L list_of_snp_positions.interval_list \
--output-mode EMIT_ALL_SITES

As a result I obtained two VCF files containing information from the same 576,468 sites (approximately half of the information I desired).

I'm using java version "1.8.0_201" and The Genome Analysis Toolkit (GATK) v4.1.0.0

Then, my question is: Is it possible to create a VCF file with information from a specific set of sites?

Thanks in advance!

↧

known site for BaseRecalibrator with hg38

October 17, 2019, 1:56 pm

≫ Next: StrandBias annotation generated by HaplotypeCaller is absent in output of GenotypeGVCFs

≪ Previous: Is it possible to create a VCF file with information from a specific set of sites?

Hello,

I'm trying to call Germline SNPs + Indels as following the best practice pipeline with GATK-4.1.4 on hg38.
To run BaseRecalibrator I'm referring the page:
https://gatkforums.broadinstitute.org/gatk/discussion/1247/what-should-i-use-as-known-variants-sites-for-running-tool-x

I found dbSNP and Mills indels in GATK Bundle hg38, but 1KG indels. Do you recommend to use other indel resource instead, such as I found known_indels.vcf in the bundle beta directory.

Thank you!
Hiroko Matsui

↧

StrandBias annotation generated by HaplotypeCaller is absent in output of GenotypeGVCFs

October 17, 2019, 3:53 pm

≫ Next: Weird VQSR filtering pattern

≪ Previous: known site for BaseRecalibrator with hg38

Dear GATK staff and forum community,

My question follows on from a comment posted in a broader topic, on annotations not working for haplotype caller
https://gatkforums.broadinstitute.org/gatk/discussion/comment/58658#Comment_58658
In my case I want to include the StrandBiasBySample (SB) annotation (https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_annotator_StrandBiasBySample.php), and HaplotypeCaller generated the annotation successfully across all my files (e.g. GT:AD:DP:GQ:PL:SB 1/1:0,2,0:2:6:73,6,0,73). However, the output from GenotypeGVCFs does not have any SB annotation. My commands and log below specify java version and gatk package. I cannot seem to find any answers posted on this, it would be great to know if there is a way of incorporating the annotation with GenotypeGVCFs or not.

Many thanks for your help!

All the best,

Lucio

/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -jar ~/Mito_reads/gatk-4.0.11.0/gatk-package-4.0.11.0-local.jar HaplotypeCaller \
-R ~/Mito_reads/data/ref_seqs/NC_001960.1_Salmo.fa \
-I $i \
-O "$i".SB.g.vcf \
--emit-ref-confidence GVCF \
-A StrandBiasBySample \
-A AS_StrandOddsRatio \
-A QualByDepth

/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -jar ~/Mito_reads/gatk-4.0.11.0/gatk-package-4.0.11.0-local.jar GenomicsDBImport \
--genomicsdb-workspace-path ~/Mito_reads/data/Demultiplexed_Salmon/db_SB/ \
--sample-name-map ~/Mito_reads/data/Demultiplexed_Salmon/vcf_SB.map \
--intervals NC_001960.1:1-16665

/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -jar ~/Mito_reads/gatk-4.0.11.0/gatk-package-4.0.11.0-local.jar GenotypeGVCFs \
-R ~/Mito_reads/data/ref_seqs/NC_001960.1_Salmo.fa \
-V gendb:///home/lmarcello/Mito_reads/data/Demultiplexed_Salmon/db_SB/ \
-O ~/Mito_reads/data/Demultiplexed_Salmon/GATKsalmon_SB.vcf

if I run grep -c ":SB" on any of my samples I get a number corresponding to the number of variants (i.e. SB has been added to all variants as annotation), whereas if I run grep -c ":SB" on GATKsalmon_SB.vcf I get 0.

Here is the log from running GenotypeGVCFs, there does not seem to be any mention of SB

(base) lmarcello@RLI-Linux:~/Mito_reads/data/Demultiplexed_Salmon$ sh GATK_GenotypeGVCFs.sh
23:28:32.694 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/lmarcello/Mito_reads/gatk-4.0.11.0/gatk-package-4.0.11.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
23:28:34.323 INFO GenotypeGVCFs - ------------------------------------------------------------
23:28:34.324 INFO GenotypeGVCFs - The Genome Analysis Toolkit (GATK) v4.0.11.0
23:28:34.324 INFO GenotypeGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
23:28:34.324 INFO GenotypeGVCFs - Executing as lmarcello@RLI-Linux on Linux v4.15.0-54-generic amd64
23:28:34.324 INFO GenotypeGVCFs - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_222-8u222-b10-1ubuntu1~18.04.1-b10
23:28:34.324 INFO GenotypeGVCFs - Start Date/Time: 17 October 2019 23:28:32 BST
23:28:34.324 INFO GenotypeGVCFs - ------------------------------------------------------------
23:28:34.324 INFO GenotypeGVCFs - ------------------------------------------------------------
23:28:34.324 INFO GenotypeGVCFs - HTSJDK Version: 2.16.1
23:28:34.324 INFO GenotypeGVCFs - Picard Version: 2.18.13
23:28:34.324 INFO GenotypeGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 2
23:28:34.324 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
23:28:34.324 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
23:28:34.324 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
23:28:34.324 INFO GenotypeGVCFs - Deflater: IntelDeflater
23:28:34.324 INFO GenotypeGVCFs - Inflater: IntelInflater
23:28:34.324 INFO GenotypeGVCFs - GCS max retries/reopens: 20
23:28:34.324 INFO GenotypeGVCFs - Requester pays: disabled
23:28:34.325 INFO GenotypeGVCFs - Initializing engine
WARNING: No valid combination operation found for INFO field DS - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field DS - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records
23:28:34.752 INFO GenotypeGVCFs - Done initializing engine
23:28:34.783 INFO ProgressMeter - Starting traversal
23:28:34.784 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
WARNING: No valid combination operation found for INFO field DS - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records
23:28:36.213 WARN ReferenceConfidenceVariantContextMerger - Detected invalid annotations: When trying to merge variant contexts at location NC_001960.1:639 the annotation AS_SB_TABLE=0,0|0,0|0,0 was not a numerical value and was ignored
23:28:41.267 WARN InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
23:28:43.364 WARN InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
23:28:43.366 WARN InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
23:28:45.448 INFO ProgressMeter - NC_001960.1:7595 0.2 3000 16879.2
23:28:45.896 WARN InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
23:28:47.749 WARN InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
23:28:49.051 WARN InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
23:28:53.357 WARN InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
23:28:53.392 WARN InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
23:28:55.493 WARN InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
23:28:55.549 WARN InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
23:28:56.594 INFO ProgressMeter - NC_001960.1:15684 0.4 7000 19257.2
23:28:58.321 WARN InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),3.0794550240000023,Cpu time(s),3.064977544000002
23:28:58.333 INFO GenotypeGVCFs - No variants filtered by: AllowAllVariantsVariantFilter
23:28:58.333 INFO ProgressMeter - NC_001960.1:15684 0.4 7853 20008.5
23:28:58.333 INFO ProgressMeter - Traversal complete. Processed 7853 total variants in 0.4 minutes.
23:28:58.336 INFO GenotypeGVCFs - Shutting down engine
[17 October 2019 23:28:58 BST] org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs done. Elapsed time: 0.43 minutes.
Runtime.totalMemory()=1613758464

↧

Weird VQSR filtering pattern

August 19, 2019, 12:38 am

≫ Next: Post was deleted

≪ Previous: StrandBias annotation generated by HaplotypeCaller is absent in output of GenotypeGVCFs

I performed HC joint calling against 300 normal tissue samples. I did every step according to the best practice and got a VQSR plot very similar to this one. Most variants have the same MQ and FS value and make these feature distribution non-gaussian. Would such result be still valid? Can anyone share some insight?

Cheng

↧

Post was deleted

October 19, 2019, 10:50 am

≫ Next: (How to) Call somatic mutations using GATK4 Mutect2 (Deprecated)

≪ Previous: Weird VQSR filtering pattern

Post was deleted

↧

(How to) Call somatic mutations using GATK4 Mutect2 (Deprecated)

January 6, 2018, 8:39 pm

≫ Next: Funcotator user-defined data sources

≪ Previous: Post was deleted

This tutorial is now deprecated and only valid for Mutect2 v4.1.0.0 and lower. For Mutect2 v4.1.1.0 and higher, please refer to this tutorial.

Post suggestions and read about updates in the Comments section.

This tutorial introduces researchers to considerations in somatic short variant discovery using GATK4 Mutect2. Example data are based on a breast cancer cell line and its matched normal cell line derived from blood and are aligned to GRCh38 with post-alt processing [1]. The tutorial focuses on how to call traditional somatic short mutations, as described in Article#11127 and pipelined in GATK v4.0.0.0's mutect2.wdl [2]. The tool and its workflow are in BETA status as of this writing, which means they may undergo changes and are not guaranteed for production.

► For Broad Mutation Calling Best Practices, see FireCloud Article#45055.

Section 1 calls somatic mutations with Mutect2 using all the bells and whistles of the tool. Section 2 outlines how to create the panel of normals resource using the tumor-only mode of Mutect2. Section 3 outlines how to estimate cross-sample contamination. Section 4 shows how to filter the callset with FilterMutectCalls. Unlike GATK3, in GATK4 the somatic calling and filtering functionalities are embodied by separate tools. Section 5 shows an optional filtering step to filter by sequence context artifacts that present with orientation bias, e.g. OxoG artifacts. Section 6 shows how to set up in IGV for manual review. Finally, section 7 provides a brief list of related resources that may be of interest to researchers.

GATK4 Mutect2 is a versatile variant caller that not only is more sensitive than, but is also roughly twice as fast as, HaplotypeCaller's reference confidence mode. Researchers who wish to customize analyses should find the tutorial's descriptions of the multiple levers of Mutect2 in section 1 and descriptions of the tumor-only mode of Mutect2 in section 2 of interest.

Jump to a section

Tools involved

GATK v4.0.0.0 is available in a Docker image and as a standalone jar. For the latest release, see the Downloads page. Note that GATK v4.0.0.0 contains Picard tools from release v2.17.2 that are callable with the gatk launch script.
Desktop IGV. The tutorial uses v2.3.97.

Download example data

Download tutorial_11136.tar.gz, either from the GoogleDrive or from the ftp site. To access the ftp site, leave the password field blank. If the GoogleDrive link is broken, please let us know. The tutorial also requires the GRCh38 reference FASTA, dictionary and index. These are available from the GATK Resource Bundle. For details on the example data and resources, see [3] and [4].

► The tutorial steps switch between the subset and full data. Some of the data files, e.g. BAMs, are restricted to a small region of the genome to efficiently pace the tutorial. Other files, e.g. the Mutect2 calls that the tutorial filters, are from the entire genome. The tutorial content was originally developed for the 2017-09 Helsinki workshop and we make the full data files, i.e. the resource files and the BAMs, available at gs://gatk-best-practices/somatic-hg38.

1. Call somatic short variants and generate a bamout with Mutect2

Here we have a rather complex command to call somatic variants on the HCC1143 tumor sample using Mutect2. For a synopsis of what somatic calling entails, see Article#11127. The command calls somatic variants in the tumor sample and uses a matched normal, a panel of normals (PoN) and a population germline variant resource.

gatk --java-options "-Xmx2g" Mutect2 \
-R hg38/Homo_sapiens_assembly38.fasta \
-I tumor.bam \
-I normal.bam \
-tumor HCC1143_tumor \
-normal HCC1143_normal \
-pon resources/chr17_pon.vcf.gz \
--germline-resource resources/chr17_af-only-gnomad_grch38.vcf.gz \
--af-of-alleles-not-in-resource 0.0000025 \
--disable-read-filter MateOnSameContigOrNoMappedMateReadFilter \
-L chr17plus.interval_list \
-O 1_somatic_m2.vcf.gz \
-bamout 2_tumor_normal_m2.bam

This produces a raw unfiltered somatic callset 1_somatic_m2.vcf.gz, a reassembled reads BAM 2_tumor_normal_m2.bam and the respective indices 1_somatic_m2.vcf.gz.tbi and 2_tumor_normal_m2.bai.

Comments on select parameters

Specify the case sample for somatic calling with two parameters. Provide the BAM with -I and the sample's read group sample name (the SM field value) with -tumor. To look up the read group SM field use GetSampleName. Alternatively, use samtools view -H tumor.bam | grep '@RG'.
Prefilter variant sites in a control sample alignment. Specify the control BAM with -I and the control sample's read group sample name (the SM field value) with -normal. In the case of a tumor with a matched normal control, we can exclude even rare germline variants and individual-specific artifacts. If we analyze our tumor sample with Mutect2 without the matched normal, we get an order of magnitude more calls than with the matched normal.
Prefilter variant sites in a panel of normals callset. Specify the panel of normals (PoN) VCF with -pon. Section 2 outlines how to create a PoN. The panel of normals not only represents common germline variant sites, it presents commonly noisy sites in sequencing data, e.g. mapping artifacts or other somewhat random but systematic artifacts of sequencing. By default, the tool does not reassemble nor emit variant sites that match identically to a PoN variant. To enable genotyping of PoN sites, use the --genotype-pon-sites option. If the match is not exact, e.g. there is an allele-mismatch, the tool reassembles the region, emits the calls and annotates matches in the INFO field with IN_PON.
Annotate variant alleles by specifying a population germline resource with --germline-resource. The germline resource must contain allele-specific frequencies, i.e. it must contain the AF annotation in the INFO field [4]. The tool annotates variant alleles with the population allele frequencies. When using a population germline resource, consider adjusting the --af-of-alleles-not-in-resource parameter from its default of 0.001. For example, the gnomAD resource af-only-gnomad_grch38.vcf.gz represents ~200k exomes and ~16k genomes and the tutorial data is exome data, so we adjust --af-of-alleles-not-in-resource to 0.0000025 which corresponds to 1/(2*exome samples). The default of 0.001 is appropriate for human sample analyses without any population resource. It is based on the human average rate of heterozygosity. The population allele frequencies (POP_AF) and the af-of-alleles-not-in-resource factor in probability calculations of the variant being somatic.
Include reads whose mate maps to a different contig. For our somatic analysis that uses alt-aware and post-alt processed alignments to GRCh38, we disable a specific read filter with --disable-read-filter MateOnSameContigOrNoMappedMateReadFilter. This filter removes from analysis paired reads whose mate maps to a different contig. Because of the way BWA crisscrosses mate information for mates that align better to alternate contigs (in alt-aware mapping to GRCh38), we want to include these types of reads in our analysis. Otherwise, we may miss out on detecting SNVs and indels associated with alternate haplotypes. Disabling this filter deviates from current production practices.
Target the analysis to specific genomic intervals with the -L parameter. Here we specify this option to speed up our run on the small tutorial data. For the full callset we use in section 4, calling was on the entirety of the data, without an intervals file.
Generate the reassembled alignments file with -bamout. The bamout alignments contain the artificial haplotypes and reassembled alignments for the normal and tumor and enable manual review of calls. The parameter is not required by the tool but is recommended as adding it costs only a small fraction of the total run time.

To illustrate how Mutect2 applies annotations, below are five multiallelic sites from the full callset. Pull these out with gzcat somatic_m2.vcf.gz | awk '$5 ~","'. The awk '$5 ~","' subsets records that contain a comma in the 5th column.

We see eleven columns of information per variant call including genotype calls for the normal and tumor. Notice the empty fields for QUAL and FILTER, and annotations at the site (INFO) and sample level (columns 10 and 11). The samples each have genotypes and when a site is multiallelic, we see allele-specific annotations. Samples may have additional annotations, e.g. PGT and PID that relate to phasing.

☞ 1.1 What are the Mutect2 annotations?

We can view the standard FORMAT-level and INFO-level Mutect2 annotations in the VCF header.

The Variant Annotations section of the Tool Documentation further describe some of the annotations. For a complete list of annotations available in GATK4, see this site.

To enable specific filtering that relies on nonstandard annotations, or just to add additional annotations, use the -A argument. For example, -A ReferenceBases adds the ReferenceBases annotation to variant calls. Note that if an annotation a filter relies on is absent, FilterMutectCalls will skip the particular filtering without any warning messages.

☞ 1.2 What is the impact of disabling the MateOnSameContigOrNoMappedMateReadFilter read filter?

To understand the impact, consider some numbers. After all other read filters, the MateOnSameContigOrNoMappedMateReadFilter (MOSCO) filter additionally removes from analysis 8.71% (8,681,271) tumor sample reads and 8.18% (6,256,996) normal sample reads from the full data. The impact of disabling the MOSCO filter is that reads on alternate contigs and read pairs that span contigs can now lend support to variant calls.

For the tutorial data, including reads normally filtered by the MOSCO filter roughly doubles the number of Mutect2 calls. The majority of the additional calls comes from the ALT, HLA and decoy contigs.

2. Create a sites-only PoN with CreateSomaticPanelOfNormals

We make the motions of creating a PoN using three germline samples. These samples are HG00190, NA19771 and HG02759 [3].

First, run Mutect2 in tumor-only mode on each normal sample. In tumor-only mode, a single case sample is analyzed with the -tumor flag without an accompanying matched control -normal sample. For the tutorial, we run this command only for sample HG00190.

gatk Mutect2 \
-R ~/Documents/ref/hg38/Homo_sapiens_assembly38.fasta \
-I HG00190.bam \
-tumor HG00190 \
--disable-read-filter MateOnSameContigOrNoMappedMateReadFilter \
-L chr17plus.interval_list \
-O 3_HG00190.vcf.gz

This generates a callset 3_HG00190.vcf.gz and a matching index. Mutect2 calls variants in the sample with the same sensitive criteria it uses for calling mutations in the tumor in somatic mode. Because the command omits the use of options that trigger upfront filtering, we expect all detectable variants to be called. The calls will include low allele fraction variants and sites with multiple variant alleles, i.e. multiallelic sites. Here are two multiallelic records from 3_HG00190.vcf.gz.

We see for each site, Mutect2 calls the ref allele and three alternate alleles. The GT genotype call is 0/1/2/3. The AD allele depths are 16,3,12,4 and 41,5,24,4, respectively for the two sites.

Comments on select parameters

One option that is not used here is to include a germline resource with --germline-resource. Remember from section 1 this resource must contain AF population allele frequencies in the INFO column. Use of this resource in tumor-only mode, just as in somatic mode, allows upfront filtering of common germline variant alleles. This effectively omits common germline variant alleles from the PoN. Note the related optional parameter --max-population-af (default 0.01) defines the cutoff for allele frequencies. Given a resource, and read evidence for the variant, Mutect2 will still emit variant alleles with AF less than or equal to the --max-population-af.
Recapitulate any special options used in somatic calling in the panel of normals sample calling, e.g.--disable-read-filter MateOnSameContigOrNoMappedMateReadFilter. This particular option is relevant for alt-aware and post-alt processed alignments.

Second, collate all the normal VCFs into a single callset with CreateSomaticPanelOfNormals. For the tutorial, to illustrate the step with small data, we run this command on three normal sample VCFs. The general recommendation for panel of normals is a minimum of forty samples.

gatk CreateSomaticPanelOfNormals \
-vcfs 3_HG00190.vcf.gz \
-vcfs 4_NA19771.vcf.gz \
-vcfs 5_HG02759.vcf.gz \
-O 6_threesamplepon.vcf.gz

This generates a PoN VCF 6_threesamplepon.vcf.gz and an index. The tutorial PoN contains 8,275 records.
CreateSomaticPanelOfNormals retains sites with variants in two or more samples. It retains the alleles from the samples but drops all other annotations to create an eight-column, sites-only VCF as shown.

Ideally, the PoN includes samples that are technically representative of the tumor case sample--i.e. samples sequenced on the same platform using the same chemistry, e.g. exome capture kit, and analyzed using the same toolchain. However, even an unmatched PoN will be remarkably effective in filtering a large proportion of sequencing artifacts. This is because mapping artifacts and polymerase slippage errors occur for pretty much the same genomic loci for short read sequencing approaches.

What do you think of including samples of family members in the PoN?

☞ 2.1 The tumor-only mode of Mutect2 is useful outside of pon creation

For example, consider variant calling on data that represents a pool of individuals or a collective of highly similar but distinct DNA molecules, e.g. mitochondrial DNA. Mutect2 calls multiple variants at a site in a computationally efficient manner. Furthermore, the tumor-only mode can be co-opted to simply call differences between two samples. This approach is described in Blog#11315.

3. Estimate cross-sample contamination using GetPileupSummaries and CalculateContamination.

First, run GetPileupSummaries on the tumor BAM to summarize read support for a set number of known variant sites. Use a population germline resource containing only common biallelic variants, e.g. subset by using SelectVariants --restrict-alleles-to BIALLELIC, as well as population AF allele frequencies in the INFO field [4]. The tool tabulates read counts that support reference, alternate and other alleles for the sites in the resource.

gatk GetPileupSummaries \
-I tumor.bam \
-V resources/chr17_small_exac_common_3_grch38.vcf.gz \
-O 7_tumor_getpileupsummaries.table

This produces a six-column table as shown. The alt_count is the count of reads that support the ALT allele in the germline resource. The allele_frequency corresponds to that given in the germline resource. Counts for other_alt_count refer to reads that support all other alleles.

Comments on select parameters

The tool only considers homozygous alternate sites in the sample that have a population allele frequency that ranges between that set by --minimum-population-allele-frequency (default 0.01) and --maximum-population-allele-frequency (default 0.2). The rationale for these settings is as follows. If the homozygous alternate site has a rare allele, we are more likely to observe the presence of REF or other more common alleles if there is cross-sample contamination. This allows us to measure contamination more accurately.
One option to speed up analysis, that is not used in the command above, is to limit data collection to a sufficiently large but subset genomic region with the -L argument.
As of GATK4.0.8.0, released August 2, 2018, GetPileupSummaries requires both -L and -V parameters. For the tutorial, provide the same resources/chr17_small_exac_common_3_grch38.vcf.gz file to each parameter. For details, see the GetPileupSummaries tool documentation.

Second, estimate contamination with CalculateContamination. The tool takes the summary table from GetPileupSummaries and gives the fraction contamination. This estimation informs downstream filtering by FilterMutectCalls.

gatk CalculateContamination \
-I 7_tumor_getpileupsummaries.table \
-O 8_tumor_calculatecontamination.table

This produces a table with estimates for contamination and error. The estimate for the full tumor sample is shown below and gives a contamination fraction of 0.0205. Going forward, we know to suspect calls with less than ~2% alternate allele fraction.

Comments on select parameters

CalculateContamination can operate in two modes. The command above uses the mode that simply estimates contamination for a given sample. The alternate mode incorporates the metrics for the matched normal, to enable a potentially more accurate estimate. For the second mode, run GetPileupSummaries on the normal sample and then provide the normal pileup table to CalculateContamination with the -matched argument.

► Cross-sample contamination differs from normal contamination of tumor and tumor contamination of normal. Currently, the workflow does not account for the latter type of purity issue.

☞ 3.1 What if I find high levels of contamination?

One thing to rule out is sample swaps at the read group level.

Picard’s CrosscheckFingerprints can detect sample-swaps at the read group level and can additionally measure how related two samples are. Because sequencing can involve multiplexing a sample across lanes and regrouping a sample’s multiple read groups, depending on the level of automation in handling these, there is a possibility of including read groups from unrelated samples. The inclusion of such a cross-sample in the tumor sample would be detrimental to a somatic analysis. Without getting into details, the tool allows us to (i) check at the sample level that our tumor and normal are related, as it is imperative they should come from the same individual and (ii) check at the read group level that each of the read group data come from the same individual.

Again, imagine if we mistook the contaminating read group data as some tumor subpopulation! The tutorial normal and tumor samples consist of 16 and 22 read groups respectively, and when we provide these and set EXPECT_ALL_GROUPS_TO_MATCH=true, CrosscheckReadGroupFingerprints (a tool now replaced by CrosscheckFingerprints) informs us All read groups related as expected.

4. Filter for confident somatic calls using FilterMutectCalls

FilterMutectCalls determines whether a call is a confident somatic call. The tool uses the annotations within the callset and applies preset thresholds that are tuned for human somatic analyses.

Filter the Mutect2 callset with FilterMutectCalls. Here we use the full callset, somatic_m2.vcf.gz. To activate filtering based on the contamination estimate, provide the contamination table with --contamination-table. In GATK v4.0.0.0, the tool uses the contamination estimate as a hard cutoff.

gatk FilterMutectCalls \
-V somatic_m2.vcf.gz \
--contamination-table tumor_calculatecontamination.table \
-O 9_somatic_oncefiltered.vcf.gz

This produces a VCF callset 9_somatic_oncefiltered.vcf.gz and index. Calls that are likely true positives get the PASS label in the FILTER field, and calls that are likely false positives are labeled with the reason(s) for filtering in the FILTER field of the VCF. We can view the available filters in the VCF header using grep '##FILTER'.

This step seemingly applies 14 filters, including contamination. However, if an annotation a filter relies on is absent, the tool skips the particular filtering. The filter will still appear in the header. For example, the duplicate_evidence filter requires a nonstandard annotation that our callset omits.

So far, we have 3,695 calls, of which 2,966 are filtered and 729 pass as confident somatic calls. Of the filtered, contamination filters eight calls, all of which would have been filtered for other reasons. For the statistically inclined, this may come as a surprise. However, remember that the great majority of contaminant variants would be common germline alleles, for which we have in place other safeguards.

► In the next GATK version, FilterMutectCalls will use a statistical model to filter based on the contamination estimate.

5. (Optional) Estimate artifacts with CollectSequencingArtifactMetrics and filter them with FilterByOrientationBias

FilterByOrientationBias allows filtering based on sequence context artifacts, e.g. OxoG and FFPE. This step is optional and if employed, should always be performed after filtering with FilterMutectCalls. The tool requires the pre_adapter_detail_metrics from Picard CollectSequencingArtifactMetrics.

First, collect metrics on sequence context artifacts with CollectSequencingArtifactMetrics. The tool categorizes these as those that occur before hybrid selection (preadapter) and those that occur during hybrid selection (baitbias). Results provide a global view across the genome that empowers decision making in ways that site-specific analyses cannot. The metrics can help decide whether to consider downstream filtering.

gatk CollectSequencingArtifactMetrics \
-I tumor.bam \
-O 10_tumor_artifact \
–-FILE_EXTENSION ".txt" \
-R ~/Documents/ref/hg38/Homo_sapiens_assembly38.fasta

Alternatively, use the tool from a standalone Picard jar.

java -jar picard.jar \
CollectSequencingArtifactMetrics \
I=tumor.bam \
O=10_tumor_artifact \
FILE_EXTENSION=.txt \
R=~/Documents/ref/hg38/Homo_sapiens_assembly38.fasta

This generates five metrics files, including pre_adapter_detail_metrics, which contains counts that FilterByOrientationBias uses. Below are the summary pre_adapter_summary_metrics for the full data. Our samples were not from FFPE so we do not expect this artifact. However, it appears that we could have some OxoG transversions.

Picard metrics are described in detail here. For the purposes of this tutorial, we focus on the TOTAL_QSCORE.

The TOTAL_QSCORE is Phred-scaled such that lower scores equate to a higher probability the change is artifactual. E.g. forty translates to 1 in 10,000 probability. For OxoG, a rough cutoff for concern is 30. FilterByOrientationBias uses the quality score as a prior that a context will produce an artifact. The tool also weighs the evidence from the reads. For example, if the QSCORE is 50 but the allele is supported by 15 reads in F1R2 and no reads in F2R1, then the tool should filter the call.
FFPE stands for formalin-fixed, paraffin-embedded. Formaldehyde deaminates cytosines and thereby results in C→T transition mutations. Oxidation of guanine to 8-oxoguanine results in G→T transversion mutations during library preparation. Another Picard tool, CollectOxoGMetrics, similarly gives Phred-scaled scores for the 16 three-base extended sequence contexts. In GATK4 Mutect2, the F1R2 and F2R1 annotations count the reads in the pair orientation supporting the allele(s). This is a change from GATK3’s FOXOG (fraction OxoG) annotation.

Second, perform orientation bias filtering with FilterByOrientationBias. We provide the tool with the once-filtered calls 9_somatic_oncefiltered.vcf.gz, the pre_adapter_detail_metrics file and the sequencing contexts for FFPE (C→T transition) and OxoG (G→T transversion). The tool knows to include the reverse complement contexts.

gatk FilterByOrientationBias \
-A G/T \
-A C/T \
-V 9_somatic_oncefiltered.vcf.gz \
-P tumor_artifact.pre_adapter_detail_metrics.txt \
-O 11_somatic_twicefiltered.vcf.gz

This produces a VCF 11_somatic_twicefiltered.vcf.gz, index and summary 11_somatic_twicefiltered.vcf.gz.summary. In the summary, we see the number of calls for the sequence context and the number of those that the tool filters.

Is the filtering in line with our earlier prediction?

In the VCF header, we see the addition of the 15th filter, orientation_bias, which the tool applies to 56 calls. All 56 of these calls were previously PASS sites, i.e. unfiltered. We now have 673 passing calls out of 3,695 total calls.

☞ 5.1 Tally of applied filters for the tutorial data

The table shows the breakdown in filters applied to 11_somatic_twicefiltered.vcf.gz. The middle column tallys the instances in which each filter was applied across the calls and the third column tallys the instances in which a filter was the sole reason for a site not passing. Of the total calls, ~18% (673/3,695) are confident somatic calls. Of the filtered calls, ~56% (1,694/3,022) are filtered singly. We see an average of ~1.73 filters per filtered call (5,223/3,022).

Which filters appear to have the greatest impact? What types of calls do you think compels manual review?

Examine passing records with the following command. Take note of the AD and AF annotation values in particular, as they show the high sensitivity of the caller.

gzcat 11_somatic_twicefiltered.vcf.gz | grep -v '#' | awk '$7=="PASS"' | less

6. Set up in IGV to review somatic calls

Deriving a good somatic callset involves comparing callsets, e.g. from different callers or calling approaches, manually reviewing passing and filtered calls and, if necessary, combining callsets and additional filtering. Manual review extends from deciphering call record annotations to the nitty-gritty of reviewing read alignments using a visualizer.

To manually review calls, use the feature-rich desktop version of the Integrative Genomics Viewer (IGV). Remember that Mutect2 makes calls on reassembled alignments that do not necessarily reflect that of the starting BAM. Given this, viewing the raw BAM is insufficient for understanding calls. We must examine the bamout that Mutect2's graph-assembly produces.

First, load Human (hg38) as the reference in IGV. Then load these six files in order:

resources/chr17_pon.vcf.gz
resources/chr17_af-only-gnomad_grch38.vcf.gz
11_somatic_twicefiltered.vcf.gz
2_tumor_normal_m2.bam
normal.bam
tumor.bam

With the exception of the somatic callset 11_somatic_twicefiltered.vcf.gz, the subset regions the data cover are in chr17plus.interval_list.

Second, navigate IGV to the TP53 locus (chr17:7,666,402-7,689,550).

One of the tracks is dominating the view. Right-click on track chr17_af-only-gnomad_grch38.vcf.gz and collapse its view.
Zoom into the somatic call in 11_somatic_twicefiltered.vcf.gz, the gray rectangle in exon 3, by click-dragging on the ruler.
Hover over or click on the gray call in track 11_somatic_twicefiltered.vcf.gz to view INFO level annotations. Similarly, the blue call underneath gives HCC1143_tumor sample level information.
Scroll through the alignment data and notice the coverage for the samples.

A C→T variant is in tumor.bam but not normal.bam. What is happening in 2_tumor_normal_m2.bam?

Third, tweak IGV settings that aid in visualizing reassembled alignments.

Make room to focus on track 2_tumor_normal_m2.bam. Shift+select on the left panels for tracks tumor.bam, normal.bam and their coverages. Right-click and Remove Tracks.
Go to View>Preferences>Alignments. Toggle on Show center line and toggle off Downsample reads.
Drag the alignments panel to center the red variant.
Right-click on the alignments track and
- Group by sample
- Sort by base
- Color by tag: HC.
Scroll to take note of the number of groups. Click on a read in each group to determine which group belongs to which sample.

What are the three grouped tracks for the bamout? What does the pastel versus gray colors indicate? How plausible is it that all tumor copies of this locus have this alteration?

Here is the corresponding VCF record. Remember Mutect2 makes no ploidy assumption. The GT field tabulates the presence for each allele starting with the reference allele.

CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
chr17	7,674,220	.	C	T	.	PASS	DP=122;ECNT=1;NLOD=13.54;N_ART_LOD=-1.675e+00;POP_AF=2.500e-06;P_GERMLINE=-1.284e+01;TLOD=257.15

FORMAT	GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:OBAM:OBAMRC:OBF:OBP:OBQ:OBQRC:SA_MAP_AF:SA_POST_PROB
HCC1143_normal	0/0:45,0:0.032:19,0:26,0:0:151,0:0:0:false:false
HCC1143_tumor	0/1:0,70:0.973:0,34:0,36:33:0,147:60:21:true:false:0.486:0.00:46.01:100.00:0.990,0.990,1.00:0.028,0.026,0.946

Finally, here are the indel calls for which we have bamout alignments. All 17 of these happen to be filtered. Explore a few of these sites in IGV to practice the motions of setting up for manual review and to study the logic behind different filters.

CHROM	POS	REF	ALT	FILTER
chr17	4,539,344	T	TA	artifact_in_normal;germline_risk;panel_of_normals
chr17	7,221,420	CACTGCCCTAGGTCAGGA	C	artifact_in_normal;panel_of_normals;str_contraction
chr17	7,483,063	A	AC	mapping_quality;t_lod
chr17	8,513,688	GTT	G	panel_of_normals
chr17	19,748,387	G	GA	t_lod
chr17	26,982,033	G	GC	artifact_in_normal;clustered_events
chr17	30,059,463	CT	C	t_lod
chr17	35,422,473	C	CA	t_lod
chr17	35,671,734	CTT	C,CT,CTTT	artifact_in_normal;multiallelic;panel_of_normals
chr17	43,104,057	CA	C	artifact_in_normal;germline_risk;panel_of_normals
chr17	43,104,072	AAAAAAAAAGAAAAG	A	panel_of_normals;t_lod
chr17	46,332,538	G	GT	artifact_in_normal;panel_of_normals
chr17	47,157,394	CAA	C	panel_of_normals;t_lod
chr17	50,124,771	GCACACACACACACACA	G	clustered_events;panel_of_normals;t_lod
chr17	68,907,890	GA	G	artifact_in_normal;base_quality;germline_risk;panel_of_normals;t_lod
chr17	69,182,632	C	CA	artifact_in_normal;t_lod
chr17	69,182,835	GAAAA	G	panel_of_normals

7. Related resources

The next step after generating a carefully manicured somatic callset is typically functional annotation.

Funcotator is available in BETA and can annotate GRCh38 and prior reference aligned VCF format data.
Oncotator can annotate GRCh37 and prior reference aligned MAF and VCF format data. It is also possible to download and install the tool following instructions in Article#4154.
Annotate with the external program VEP to predict phenotypic changes and confirm or hypothesize biochemical effects.

For a cohort, after annotation, use MutSig to discover driver mutations. MutsigCV (the version is CV) is available on GenePattern. If more samples are needed to increase the power of the analysis, consider padding the analysis set with TCGA Project or other data.

The dSKY plot at https://figshare.com/articles/D_SKY_for_HCC1143/2056665 shows somatic copy number alterations for the HCC1143 tumor sample. Its colorful results remind us that calling SNVs and indels is only one part of cancer genome analyses. Somatic copy number alteration detection will be covered in another GATK tutorial. For reference implementations of Somatic CNV workflows see here.

Footnotes

[1] Data was alt-aware aligned to GRCh38 and post-alt processed. For an introduction to alt-aware alignment and post-alt processing, see [Blog#8180](https://software.broadinstitute.org/gatk/blog?id=8180). The HCC1143 alignments are identical to that in [Tutorial#9183](https://software.broadinstitute.org/gatk/documentation/article?id=9183), which uses GATK3 MuTect2.

[2] For scripted GATK Best Practices Somatic Short Variant Discovery workflows, see [https://github.com/gatk-workflows](https://github.com/gatk-workflows). Within the repository, as of this writing, [gatk-somatic-snvs-indels](https://github.com/gatk-workflows/gatk4-somatic-snvs-indels), which uses GRCh37, is the sole GATK4 Mutect2 workflow. This tutorial uses additional parameters not used in the [GRCh37 gatk-somatic-snvs-indels](https://github.com/gatk-workflows/gatk4-somatic-snvs-indels) example because the tutorial data was preprocessed with post-alt processing of alt-aware alignments, which deviates from production practices. The general workflow steps remain the same.

[3] About the tutorial data:

The data tarball contains 15 files in the main directory, six files in its resources folder and twenty files in its precomputed folder. Of the files, chr17 refers to data subset to that in the regions in chr17plus.interval_list, the m2pon consists of forty 1000 Genomes Project samples, pon to panel of normals, tumor to the tumor HCC1143 breast cancer sample and normal to its matched blood normal.
Again, example data are based on a breast cancer cell line and its matched normal cell line derived from blood. Both cell lines are consented and known as HCC1143 and HCC1143_BL, respectively. The Broad Cancer Genome Analysis (CGA) group has graciously provided 2x76 paired-end whole exome sequence data from the two cell lines (C835.HCC1143_2 and C835.HCC1143_BL.4), and @shlee reverted and aligned these to GRCh38 using alt-aware alignment and post-alt processing as described in Tutorial#8017. During preprocessing, the MergeBamAlignment step was omitted, reads containing adapter sequence were removed altogether for both samples (~0.153% of reads in the tumor) as determined by MarkIlluminaAdapters, base qualities were not binned during base recalibration and indel realignment was included to match the toolchain of the PoN normals. The program group for base recalibration is absent from the BAM headers due to a bug in the version of PrintReads at the time of pre-processing, in January of 2017.
Note that the tutorial uses exome data for its small size. The workflow is applicable to whole genome sequence data (WGS).
@shlee lifted-over or remapped the gnomAD resource files from GRCh37 counterparts to GRCh38. The tutorial uses subsets of the full resources; the full-length versions are available at gs://gatk-best-practices/somatic-hg38/. The official GRCh37 versions of the resources are available in the GATK Resource Bundle and are based on the gnomAD resource. These GRCh37 versions were prepared by @davidben according to the method outlined in the mutect_resources.wdl and described in [4].
The full data in the tutorial were generated by @shlee using the github.com/broadinstitute/gatk mutect2.wdl from between the v4.0.0.0 and v4.0.0.1 release with commit hash b4d1ddd. The GATK Docker image was broadinstitute/gatk:4.0.0.0 and Picard was v2.14.1. A single modification was made to the script to enable generating the bamout. The script was run locally on a Google Cloud Compute VM using Cromwell v30.1. Given Docker was installed and the specified Docker images were present on the VM, Cromwell automatically launched local Docker container instances during the run and handled the local files as hard-links to avoid redundant copying. Workflow input variables were as follows.

{
  "##_COMMENT1:": "WORKFLOW STEP OPTIONS",
  "Mutect2.is_run_oncotator": "False",
  "Mutect2.is_run_orientation_bias_filter": "True",
  "Mutect2.picard": "/home/shlee/picard-2.14.1.jar",
  "Mutect2.gatk_docker": "broadinstitute/gatk:4.0.0.0",
  "Mutect2.oncotator_docker": "broadinstitute/oncotator:1.9.3.0",
...
  "##_COMMENT3:": "ANALYSIS PARAMETERS",
  "Mutect2.artifact_modes": ["G/T", "C/T"],
  "Mutect2.m2_extra_args": "--af-of-alleles-not-in-resource 0.0000025 --disable-read-filter MateOnSameContigOrNoMappedMateReadFilter",
  "Mutect2.m2_extra_filtering_args": "",
  "Mutect2.scatter_count": "10"
}

If using newer versions of the mutect2.wdl that allow setting SplitIntervals optional arguments, then @shlee recommends setting --subdivision-mode BALANCING_WITHOUT_INTERVAL_SUBDIVISION to avoid splitting contigs.
With the exception of the PoN and Picard tool steps, data was generated using v4.0.0.0. The PoN was generated using GATK4 vbeta.6. Besides the syntax, little changed for the Mutect2 workflow between these releases and the workflow and most of its tools remain in beta status as of this writing. We used Picard v2.14.1 for the CollectSequencingArtifactMetrics step. Figures in section 5 reflect results from Picard v2.11.0, which give, at glance, identical results as 2.14.1.
The three samples in section 2 are present in the forty sample PoN used in section 1 and they are 1000 Genomes Project samples.

[4] The WDL script [mutect_resources.wdl](https://github.com/broadinstitute/gatk/blob/master/scripts/mutect2_wdl/mutect_resources.wdl) takes a large gnomAD VCF or other typical cohort VCF and from it prepares both a simplified germline resource for use in _section 1_ and a common biallelic variants resource for use in _section 3_. The script first generates a sites-only VCF and in the process _removes all extraneous annotations_ except for `AF` allele frequencies. We recommend this simplification as the unburdened VCF allows Mutect2 to run much more efficiently. To generate the common biallelic variants resource, the script then selects the biallelic sites from the sites-only VCF.

↧

Funcotator user-defined data sources

October 19, 2019, 11:57 am

≫ Next: MergeBamAlignment - what are all the exact steps it performs?

≪ Previous: (How to) Call somatic mutations using GATK4 Mutect2 (Deprecated)

Hi, recently I was trying expand the annotation data source when using funcotator, however, the document didn't give much information or example. Now I was trying to add CADD to the data source folder. After running the funcotator, I got the error:

org.broadinstitute.hellbender.exceptions.GATKException: Error initializing feature reader for path file: funcotator_dataSources.v1.6.20190124s/cadd/hg19/cadd.config
at org.broadinstitute.hellbender.engine.FeatureDataSource.getTribbleFeatureReader(FeatureDataSource.java:353)
at org.broadinstitute.hellbender.engine.FeatureDataSource.getFeatureReader(FeatureDataSource.java:305)
at org.broadinstitute.hellbender.engine.FeatureDataSource.<init>(FeatureDataSource.java:256)
at org.broadinstitute.hellbender.engine.FeatureManager.addToFeatureSources(FeatureManager.java:234)
at org.broadinstitute.hellbender.engine.GATKTool.addFeatureInputsAfterInitialization(GATKTool.java:957)
at org.broadinstitute.hellbender.tools.funcotator.dataSources.DataSourceUtils.createAndRegisterFeatureInputs(DataSourceUtils.java:328)
at org.broadinstitute.hellbender.tools.funcotator.dataSources.DataSourceUtils.createDataSourceFuncotationFactoriesForDataSources(DataSourceUtils.java:277)
at org.broadinstitute.hellbender.tools.funcotator.Funcotator.onTraversalStart(Funcotator.java:774)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1037)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
at org.broadinstitute.hellbender.Main.main(Main.java:291)
Caused by: htsjdk.tribble.TribbleException$MalformedFeatureFile: Unable to parse header with error: Duplicate key 0, for input source: cadd.config
at htsjdk.tribble.TribbleIndexedFeatureReader.readHeader(TribbleIndexedFeatureReader.java:263)
at htsjdk.tribble.TribbleIndexedFeatureReader.<init>(TribbleIndexedFeatureReader.java:102)
at htsjdk.tribble.TribbleIndexedFeatureReader.<init>(TribbleIndexedFeatureReader.java:127)
at htsjdk.tribble.AbstractFeatureReader.getFeatureReader(AbstractFeatureReader.java:120)
at org.broadinstitute.hellbender.engine.FeatureDataSource.getTribbleFeatureReader(FeatureDataSource.java:350)
... 14 more
Caused by: java.lang.IllegalStateException: Duplicate key 0
at java.util.stream.Collectors.lambda$throwingMerger$0(Collectors.java:133)
at java.util.HashMap.merge(HashMap.java:1254)
at java.util.stream.Collectors.lambda$toMap$58(Collectors.java:1320)
at java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169)
at java.util.stream.IntPipeline$4$1.accept(IntPipeline.java:250)
at java.util.stream.Streams$RangeIntSpliterator.forEachRemaining(Streams.java:110)
at java.util.Spliterator$OfInt.forEachRemaining(Spliterator.java:693)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566)
at org.broadinstitute.hellbender.utils.codecs.xsvLocatableTable.XsvLocatableTableCodec.readActualHeader(XsvLocatableTableCodec.java:341)
at org.broadinstitute.hellbender.utils.codecs.xsvLocatableTable.XsvLocatableTableCodec.readActualHeader(XsvLocatableTableCodec.java:64)
at htsjdk.tribble.AsciiFeatureCodec.readHeader(AsciiFeatureCodec.java:79)
at htsjdk.tribble.AsciiFeatureCodec.readHeader(AsciiFeatureCodec.java:37)
at htsjdk.tribble.TribbleIndexedFeatureReader.readHeader(TribbleIndexedFeatureReader.java:261)
... 18 more

java version:
java -version
openjdk version "1.8.0_222"
OpenJDK Runtime Environment (build 1.8.0_222-8u222-b10-1~deb9u1-b10)
OpenJDK 64-Bit Server VM (build 25.222-b10, mixed mode)

I added the cadd folder into data source folder like the structure mentioned in document:

cadd
|- hg19
| |- cadd.config
| |- InDels_inclAnno.tsv
| |- InDels_inclAnno.tsv.gz.tbi
|
|- hg38
| |- cadd.config
| |- InDels_inclAnno.tsv
| |- InDels_inclAnno.tsv.gz.tbi

The config file (cadd.config)

name = CADD
version = v1.4
src_file = InDels_inclAnno.tsv
origin_location =
preprocessing_script = UNKNOWN

# Whether this data source is for the b37 reference.
# Required and defaults to false.
isB37DataSource = false

# Supported types:
# simpleXSV -- Arbitrary separated value table (e.g. CSV), keyed off Gene Name OR Transcript ID
# locatableXSV -- Arbitrary separated value table (e.g. CSV), keyed off a genome location
# gencode -- Custom datasource class for GENCODE
# cosmic -- Custom datasource class for COSMIC
# vcf -- Custom datasource class for Variant Call Format (VCF) files
type = locatableXSV

# Required field for GENCODE files.
# Path to the FASTA file from which to load the sequences for GENCODE transcripts:
gencode_fasta_path =

# Required field for GENCODE files.
# NCBI build version (either hg19 or hg38):
ncbi_build_version =

# Required field for simpleXSV files.
# Valid values:
# GENE_NAME
# TRANSCRIPT_ID
xsv_key = GENE_NAME

# Required field for simpleXSV files.
# The 0-based index of the column containing the key on which to match
xsv_key_column =

# Required field for simpleXSV AND locatableXSV files.
# The delimiter by which to split the XSV file into columns.
xsv_delimiter = \t

# Required field for simpleXSV files.
# Whether to permissively match the number of columns in the header and data rows
# Valid values:
# true
# false
xsv_permissive_cols =

# Required field for locatableXSV files.
# The 0-based index of the column containing the contig for each row
contig_column = 0

# Required field for locatableXSV files.
# The 0-based index of the column containing the start position for each row
start_column = 1

# Required field for locatableXSV files.
# The 0-based index of the column containing the end position for each row
end_column = 1

A snapshot of InDels_inclAnno.tsv:
Chrom Pos Ref Alt Type Length AnnoType Consequence ConsScore ConsDetail GC CpG motifECount motifEName
motifEHIPos motifEScoreChng oAA nAA GeneID FeatureID GeneName CCDS Intron Exon cDNApos relcDNApos CDSpos relCDSpo
s protPos relProtPos Domain Dst2Splice Dst2SplType minDistTSS minDistTSE SIFTcat SIFTval PolyPhenCat PolyPhenVal priPhC
ons mamPhCons verPhCons priPhyloP mamPhyloP verPhyloP bStatistic targetScan mirSVR-Score mirSVR-E mirS
VR-Aln cHmm_E1 cHmm_E2 cHmm_E3 cHmm_E4 cHmm_E5 cHmm_E6 cHmm_E7 cHmm_E8 cHmm_E9 cHmm_E10 cHmm_E11 cHmm_E12 cHmm_E13 cHmm_E14
cHmm_E15 cHmm_E16 cHmm_E17 cHmm_E18 cHmm_E19 cHmm_E20 cHmm_E21 cHmm_E22 cHmm_E23 cHmm_E
24 cHmm_E25 GerpRS GerpRSpval GerpN GerpS tOverlapMotifs motifDist EncodeH3K4me1-sum EncodeH3K4me1-max EncodeH3K4me
2-sum EncodeH3K4me2-max EncodeH3K4me3-sum EncodeH3K4me3-max EncodeH3K9ac-sum EncodeH3K9ac-max EncodeH3K9me3-sum En
codeH3K9me3-max EncodeH3K27ac-sum EncodeH3K27ac-max EncodeH3K27me3-sum EncodeH3K27me3-max EncodeH3K36me3-sum EncodeH3K36me3-m
ax EncodeH3K79me2-sum EncodeH3K79me2-max EncodeH4K20me1-sum EncodeH4K20me1-max EncodeH2AFZ-sum EncodeH2AFZ-max EncodeDNase-sum Encode
DNase-max EncodetotalRNA-sum EncodetotalRNA-max Grantham Dist2Mutation Freq100bp Rare100bp Sngl100bp Freq1000bp Rare
1000bp Sngl1000bp Freq10000bp Rare10000bp Sngl10000bp EnsembleRegulatoryFeature dbscSNV-ada_score dbscSNV-rf_score Re
mapOverlapTF RemapOverlapCL RawScore PHRED

1 10001 T TC INS 1 RegulatoryFeature REGULATORY 4 regulatory 0.448933333333 0.00993288590604 NA
NA NA NA NA NA NA ENSR00000344265 NA NA NA NA NA NA NA NA NA NA NA NA NA 1869 3670 NA NA NA NA NA NA NA NA NA NA 994 NA NA NA NA 0.008 0.000 0.000 0.000 0.016 0.000 0.024 0.087 0.472 0.000 0.000 0.000 0.000 0.000 0.394 NA NA 0 0 NA NA
NA NA NA GM1 10.04 2.84 8.0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 2773 NA NA NA NA NA NA 3 2 32 NA NA -0.083014 1.567

The funcotator command:
gatk Funcotator \
> --variant TriLevelv2_bqsr-filtered.vcf \
> --output test_cadd.vcf \
> --reference hg19.fa \
> --data-sources-path /funcotator_dataSources.v1.6.20190124s \
> --ref-version hg19 \
> --output-file-format VCF \
> --java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true' \
> --verbosity DEBUG \
> --disable-sequence-dictionary-validation true \
> --disable-bam-index-caching true

I am not sure what I missed here, although I am not quite sure about how should to add new data sources. Sincerely appreciate your help!

↧