FilterMutectCalls - Key P_CONTAM found in VariantContext field Error

April 18, 2018, 2:12 pm

≫ Next: GetPileupSummaries error on hg19 liftover

I am using GATK version 4.0.3.0. I get following error with FilterMutectCalls command:

Key P_CONTAM found in VariantContext field INFO at chr1:289 but this key isn't defined in the VCFHeader. We require all VCFs to have complete VCF headers by default

Can you please help to resolve this?

↧

GetPileupSummaries error on hg19 liftover

April 19, 2018, 4:07 am

≫ Next: Can ASEReadCounter take indels?

≪ Previous: FilterMutectCalls - Key P_CONTAM found in VariantContext field Error

I trasform the small_exac_comonb37 on hg19 version using picard as suggest on the forum:

 picard -Xmx10G LiftoverVcf I=small_exac_common_3_b37.vcf  CHAIN=b37tohg19.chain  O=small_exac_common_3_hg19.vcf R=database/hg19_primary.fa REJECT=reject.variant.vcf

When I try to use gatk 4

gatk-launch GetPileupSummaries -I Results/mutect2/432.mutect2.vcf -V /home/centos/illumina/database/small_exac_common_3_hg19.vcf -O Results/mutect2/432.pileups.table
I have this error:

Runtime.totalMemory()=590872576
***********************************************************************

A USER ERROR has occurred: Input files reads and features have incompatible contigs: No overlapping contigs found.
  reads contigs = []
  features contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM]

***********************************************************************
Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--javaOptions '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.

↧

Can ASEReadCounter take indels?

February 19, 2018, 11:27 am

≫ Next: NoSuchFileException during running JointGenotyping locally

≪ Previous: GetPileupSummaries error on hg19 liftover

Can ASEReadCounter calculate allele-specific count correctly on indels? I've seen some old posts about a bug for this (https://gatkforums.broadinstitute.org/gatk/discussion/6723). I wonder if it was fixed or not.
Thanks.

↧

NoSuchFileException during running JointGenotyping locally

March 13, 2018, 1:23 am

≫ Next: GATK ERROR MESSAGE: 38 HaplotipeCaller

≪ Previous: Can ASEReadCounter take indels?

Dear GATK developers,

I have a consistent problem when running the joint-discovery part of your gatk4-germline-snps-indels workflow.
Independent of the file(path) I get a NoSuchFileException which seems to be caused by running it locally and not on the cloud.

Please see my Github issue for details: https://github.com/gatk-workflows/gatk4-germline-snps-indels/issues/13

Please let me know if you need any further information in order to reproduce this.

Kind regards
Tim

↧

GATK ERROR MESSAGE: 38 HaplotipeCaller

February 5, 2018, 1:05 am

≫ Next: Optimal values of parameters for GATK4.0.3.0 FilterMutectCalls on Tumor-Only Mutect2 VCF

≪ Previous: NoSuchFileException during running JointGenotyping locally

Hello everyone,

I´m using HaplotypeCaller program in whole sheep genome.

The next paragraph is the command used for all 158 samples. We use nodes of 16 cores (-ntc 16) and 28 Gb of memory RAM.

Could you tell me what the next mistake might be?
thank you in advance
-------------------------------------------------------------------------------------------------------------------------------------------------------------

ERROR --

ERROR stack trace

java.lang.ArrayIndexOutOfBoundsException: 38
at org.broadinstitute.gatk.tools.walkers.annotator.BaseQualityRankSumTest.getElementForRead(BaseQualityRankSumTest.java:96)
at org.broadinstitute.gatk.tools.walkers.annotator.RankSumTest.getElementForRead(RankSumTest.java:209)
at org.broadinstitute.gatk.tools.walkers.annotator.RankSumTest.fillQualsFromLikelihoodMap(RankSumTest.java:187)
at org.broadinstitute.gatk.tools.walkers.annotator.RankSumTest.annotate(RankSumTest.java:104)
at org.broadinstitute.gatk.tools.walkers.annotator.VariantAnnotatorEngine.annotateContextForActiveRegion(VariantAnnotatorEngine.java:315)
at org.broadinstitute.gatk.tools.walkers.annotator.VariantAnnotatorEngine.annotateContextForActiveRegion(VariantAnnotatorEngine.java:260)
at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCallerGenotypingEngine.annotateCall(HaplotypeCallerGenotypingEngine.java:328)
at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCallerGenotypingEngine.assignGenotypeLikelihoods(HaplotypeCallerGenotypingEngine.java:290)
at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:970)
at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:252)
at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:709)
at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:705)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler$ReadMapReduceJob.run(NanoScheduler.java:471)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

ERROR ------------------------------------------------------------------------------------------

ERROR A GATK RUNTIME ERROR has occurred (version 3.8-0-ge9d806836):

ERROR

ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.

ERROR If not, please post the error message, with stack trace, to the GATK forum.

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions https://software.broadinstitute.org/gatk

ERROR

##### ERROR MESSAGE: 38

##### ERROR ------------------------------------------------------------------------------------------

-------------------------------------------------------------------------------------------------------------------------------------------------------------

↧

Optimal values of parameters for GATK4.0.3.0 FilterMutectCalls on Tumor-Only Mutect2 VCF

April 19, 2018, 11:56 am

≫ Next: GenomeSTRiP v2.0 CNV calling

≪ Previous: GATK ERROR MESSAGE: 38 HaplotipeCaller

Intro

I was hoping someone could help me understand (or at least, how to titrate) values of the parameters in FIlterMutectCalls to maximize my sensitivity without getting too many false positive in the tumor-only case. My intention is that this post will serve as a reference for future users that might be facing this same problem, as I've done an extensive search on the internet and have not been able to find any recent posts / discussions. So, please forgive the extended length of this post.

Current Understanding

The parameters that I "think" are relevant: --tumor-lod and --maximum-germline-posterior, and maybe --log-somatic-prior. But, I really could use some help determining if these are the correct parameters or if there are any others I should consider.

Background and attempts to understand param values

Currently, I'm running a command like this:

java -Xmx4g -jar gatk-package-4.0.3.0-14-g95430b1-SNAPSHOT-local.jar FilterMutectCalls \
--variant SDS1-000196-1-H3.vcf \
--contamination-table SDS1-000196-1-H3.contamination.table \
--output SDS1-000196-1-H3.fcn.vcf \
--tumor-lod $TLOD \
--tumor-segmentation SDS1-000196-1-H3.tumor_segments.table \
--max-germline-posterior $MGP

So in my attempt to understand how changing $TLOD and $MGP affect the number of mutations that "PASS", I ran the above with various values of $TLOD and $MGP on a cohort with 43 tumor-normal pairs and my 1 tumor-only sample. We suspect that these samples have a low mutation rate (rare blood disease, no previous WES/WGS studies to get ballpark mutation rate, some samples are pre-cancer while others are full-blow cancer).

The TLOD plot shows number of PASS mutations per pair / sample with varying $TLOD and fixed MGP=0.1 (default). TLOD=5.3 is colored red in this plot. The MGP plot shows number of PASS mutations per pair / sample with varying $MGP and fixed TLOD=5.3 (default). The results are attached below. As you can see from looking at the MGP plot specifically, changing these values affects the number of mutations for all pairs/samples, but it very drastically affects the number of mutations in the tumor-only case (I've marked the sample with a red blip on the x axis of the plot). This sample has markedly more mutations that PASS the default filters for some reason, I have a feeling that they are all germline events.

Specific Questions

So I wonder, assuming the default TLOD=5.3 is optimal, is there an optimal MGP for the tumor-only case? Should I even be playing with these parameter values at all? Or maybe should I just resort to a different filtering method (all ears for suggestions)? Should I be playing with the --log-somatic-prior parameter?

I hope that this makes sense. Please let me know if I can help to clarify my dilemma/questions. Thank you so much for your time and consideration.

↧

GenomeSTRiP v2.0 CNV calling

March 25, 2015, 2:09 pm

≫ Next: Run the germline GATK Best Practices Pipeline for $5 per genome

≪ Previous: Optimal values of parameters for GATK4.0.3.0 FilterMutectCalls on Tumor-Only Mutect2 VCF

Hi Bob,

I upgraded to version 2 and ran my samples with a 30 background population samples from 1000G for DEL discovery. I know that SVToolkit now calls deletions and copy number variation, my question is; how do I run CNV discovery. Here are my commands;

java -cp ${classpath} ${mx} \ org.broadinstitute.gatk.queue.QCommandLine \ -S ${SV_DIR}/qscript/SVPreprocess.q \ -S ${SV_DIR}/qscript/SVQScript.q \ -gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \ -cp ${classpath} \ -configFile ${SV_DIR}/conf/genstrip_parameters.txt \ -tempDir ${SV_TMPDIR} \ -R data/human_g1k_v37.fasta \ -genomeMaskFile data/human_g1k_v37.mask.100.fasta \ -ploidyMapFile data/humgen_g1k_v37_ploidy.map \ -copyNumberMaskFile data/cn2_mask_g1k_v37.fasta \ -genderMapFile data/geneder1000GandMysamples.map \ -runDirectory ${runDir} \ -md ${runDir}/metadata \ -reduceInsertSizeDistributions true \ -bamFilesAreDisjoint true \ -computeGCProfiles true \ -jobLogDir ${runDir}/logs \ -I ${bamList} \ -run \
java -cp ${classpath} ${mx} \ org.broadinstitute.gatk.queue.QCommandLine \ -S ${SV_DIR}/qscript/SVDiscovery.q \ -S ${SV_DIR}/qscript/SVQScript.q \ -gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \ -cp ${classpath} \ -configFile ${SV_DIR}/conf/genstrip_parameters.txt \ -tempDir ${SV_TMPDIR} \ -R data/human_g1k_v37.fasta \ -genomeMaskFile data/human_g1k_v37.mask.100.fasta \ -genderMapFile data/geneder1000GandMysamples.map \ -runDirectory ${runDir} \ -md ${runDir}/metadata \ -jobLogDir ${runDir}/logs \ -minimumSize 100 \ -maximumSize 1000000 \ -suppressVCFCommandLines \ -I ${bamList} \ -O ${sites} \ -run \
java -cp ${classpath} ${mx} \ org.broadinstitute.gatk.queue.QCommandLine \ -S ${SV_DIR}/qscript/SVGenotyper.q \ -S ${SV_DIR}/qscript/SVQScript.q \ -gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \ --disableJobReport \ -cp ${classpath} \ -configFile ${SV_DIR}/conf/genstrip_parameters.txt \ -tempDir ${SV_TMPDIR} \ -R data/human_g1k_v37.fasta \ -genomeMaskFile data/human_g1k_v37.mask.100.fasta \ -ploidyMapFile data/humgen_g1k_v37_ploidy.map \ -genderMapFile data/geneder1000GandMysamples.map \ -runDirectory ${runDir} \ -md ${runDir}/metadata \ -jobLogDir ${runDir}/logs \ -I ${bamList} \ -vcf ${sites} \ -O ${genotypes} \ -run

Would you please let me know what should I do to run CNV calling! And let me know if I did anything wrong in my DEL calling.

Thank you so much,
Reza

↧

Run the germline GATK Best Practices Pipeline for $5 per genome

February 12, 2018, 6:38 am

≫ Next: Multiple reference allele types existed in one site?

≪ Previous: GenomeSTRiP v2.0 CNV calling

By Eric Banks, Director, Data Sciences Platform at the Broad Institute

Last week I wrote about our efforts to develop a data processing pipeline specification that would eliminate batch effects, in collaboration with other major sequencing centers. Today I want to share our implementation of the resulting "Functional Equivalence" pipeline spec, and highlight the cost-centric optimizations we've made that make it incredibly cheap to run on Google Cloud.

For a little background, we started transitioning our analysis pipelines to Google Cloud Platform in 2016. Throughout that process we focused most of our engineering efforts on bringing down compute cost, which is the most important factor for our production operation. It's been a long road, but all that hard work really paid off: we managed to get the cost of our main Best Practices analysis pipeline down from about $45 to $5 per genome! As you can imagine that kind of cost reduction has a huge impact on our ability to do more great science per research dollar -- and now, we’re making this same pipeline available to everyone.

The Best Practices pipeline I'm talking about is the most common type of analysis done on a 30x WGS: germline short variant discovery (SNPs and indels). This pipeline covers taking the data from unmapped reads all the way to an analysis-ready BAM or CRAM (i.e. the part covered by the Functional Equivalence spec), then either a single-sample VCF or an intermediate GVCF, plus 15 steps of quality control metrics collected at various points in the pipeline, totalling $5 in compute cost on Google Cloud. As far as I know this is the most comprehensive pipeline available for whole-genome data processing and germline short variant discovery (without skimping on QC and important cleanup steps like base recalibration).

Let me give you a real-world example of what this means for an actual project. In February 2017, our production team processed a cohort of about 900 30x WGS samples through our Best Practices germline variant discovery pipeline; the compute costs totalled $12,150 or $13.50 per sample. If we had run the version of this pipeline we had just one year prior (before the main optimizations were made), it would have cost $45 per sample; a whopping $40,500 total! Meanwhile we've made further improvements since February, and if we were to run this same pipeline today, the cohort would cost only $4,500 to analyze.

	2016	2017	Today
# of Whole Genomes Analyzed	900	900	900
Total Compute Cost	$40,500	$12,150	$4,500
Cost per Genome Analyzed	$45	$13.50	$5

For the curious, the most dramatic reductions we saw came from using different machine types for each of the various tasks (rather than piping data between tasks), leveraging GCP’s preemptible VMs, and most recently incorporating NIO to minimize the amount of data localization involved. You can read more about these approaches on Google's blog. At this point the single biggest culprit for cost in the pipeline is BWA (the genome mapper), a problem which its author Heng Li is actively working to address through a much faster (but equivalently accurate) mapper. Once Heng's new mapper is available, we anticipate the cost per genome analyzed to drop below $3.

On top of the low cost of operating the pipeline, the other huge bonus we get from running this pipeline on the cloud is that we can get any number of samples done in the time it takes to do just one, due to the staggeringly elastic scalability of the cloud environment. Even though it takes a single genome 30 hours to run through the pipeline (and we're still working on speeding that up), we're able to process genomes at a rate of one every 3.6 minutes, and we've been averaging about 500 genomes completed per day.

We're making the workflow script for this pipeline available in Github under an open-source license so anyone can use it, and we're also providing it as a preconfigured pipeline in FireCloud, the pipelining service we run on Google Cloud. Anyone can access FireCloud for free, you just need to pay Google for any compute and storage costs you incur when running the pipelines. So to be clear, when you run this pipeline on your data in FireCloud, all $5 of compute costs will go directly to the cloud provider; we won't make any money off of it. And there are no licensing fees involved at any point!

As a cherry on the cake, our friends at Google Cloud Platform are sponsoring free credits to help first-time users get started with FireCloud: the first 1,000 applicants can get $250 worth of credits to cover compute and storage costs. You can learn more here on the FireCloud website if you're interested.

Of course, we understand that not everyone is on Google Cloud, so we are actively collaborating with other cloud vendors and technology partners to expand the range of options for taking advantage of our optimized pipelines. For example, the Chinese cloud giant Alibaba Cloud is developing a backend for Cromwell, the execution engine we use to run our pipelines. And it's not all cloud-centric either; we are also collaborating with our long-time partners at Intel to ensure our pipelines can be run optimally on on-premises infrastructure without compromising on quality.

In conclusion, this pipeline is the result of two years' worth of hard work by a lot of people, both on our team and on the teams of the institutions and companies we collaborate with. We're all really excited to finally share it with the world, and we hope it will make it easier for everyone in the community to get more mileage out of their research dollars, just like we do.

↧

Multiple reference allele types existed in one site?

April 19, 2018, 3:29 pm

≫ Next: (How to) Map and clean up short read sequence data efficiently

≪ Previous: Run the germline GATK Best Practices Pipeline for $5 per genome

Hello all,

I found there were two or more reference alleles for a few sites after haplotype calling process, for example,

TRINITY_DN100017_c0_g1_i1 454 . CT C, 24.54 . DP=3;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQ=10800.00 GT:AD:DP:GQ:PL:SB 1/1:0,2,0:2:6:60,6,0,60,6,60:0,0,0,2

It shows reference allele is "CT" in the 4th column. To my understanding, one site can only have one type of reference allele. Anyone could help me figure out what is the reason causing this issue?

My command is listed below in case needed,
gatk --java-options "-Xmx20g" HaplotypeCaller -R Trinity.fasta -I 151_RWG1_assembly_merge_rm_dup.bam -O 151_RWG1_assembly_RESOLUTION.g.vcf -ERC BP_RESOLUTION

Any comments will be appreciated!

Best,
Yuanwen

↧

(How to) Map and clean up short read sequence data efficiently

November 23, 2015, 12:55 pm

≫ Next: FixMateInformation cannot fix GATK4 BQSR'd bam???

≪ Previous: Multiple reference allele types existed in one site?

If you are interested in emulating the methods used by the Broad Genomics Platform to pre-process your short read sequencing data, you have landed on the right page. The parsimonious operating procedures outlined in this three-step workflow both maximize data quality, storage and processing efficiency to produce a mapped and clean BAM. This clean BAM is ready for analysis workflows that start with MarkDuplicates.

Since your sequencing data could be in a number of formats, the first step of this workflow refers you to specific methods to generate a compatible unmapped BAM (uBAM, Tutorial#6484) or (uBAM^XT, Tutorial#6570 coming soon). Not all unmapped BAMs are equal and these methods emphasize cleaning up prior meta information while giving you the opportunity to assign proper read group fields. The second step of the workflow has you marking adapter sequences, e.g. arising from read-through of short inserts, using MarkIlluminaAdapters such that they contribute minimally to alignments and allow the aligner to map otherwise unmappable reads. The third step pipes three processes to produce the final BAM. Piping SamToFastq, BWA-MEM and MergeBamAlignment saves time and allows you to bypass storage of larger intermediate FASTQ and SAM files. In particular, MergeBamAlignment merges defined information from the aligned SAM with that of the uBAM to conserve read data, and importantly, it generates additional meta information and unifies meta data. The resulting clean BAM is coordinate sorted, indexed.

The workflow reflects a lossless operating procedure that retains original sequencing read information within the final BAM file such that data is amenable to reversion and analysis by different means. These practices make scaling up and long-term storage efficient, as one needs only keep the final BAM file.

Geraldine_VdAuwera points out that there are many different ways of correctly preprocessing HTS data for variant discovery and ours is only one approach. So keep this in mind.

We present this workflow using real data from a public sample. The original data file, called Solexa-272222, is large at 150 GB. The file contains 151 bp paired PCR-free reads giving 30x coverage of a human whole genome sample referred to as NA12878. The entire sample library was sequenced in a single flow cell lane and thereby assigns all the reads the same read group ID. The example commands work both on this large file and on smaller files containing a subset of the reads, collectively referred to as snippet. NA12878 has a variant in exon 5 of the CYP2C19 gene, on the portion of chromosome 10 covered by the snippet, resulting in a nonfunctional protein. Consistent with GATK's recommendation of using the most up-to-date tools, for the given example results, with the exception of BWA, we used the most current versions of tools as of their testing (September to December 2015). We provide illustrative example results, some of which were derived from processing the original large file and some of which show intermediate stages skipped by this workflow.

Download example snippet data to follow along the tutorial.

We welcome feedback. Share your suggestions in the Comments section at the bottom of this page.

Jump to a section

Tools involved

MarkIlluminaAdapters
Unix pipelines
SamToFastq
BWA-MEM (Li 2013 reference; Li 2014 benchmarks; homepage; manual)
MergeBamAlignment

Prerequisites

Installed Picard tools
Installed GATK tools
Installed BWA
Reference genome
Illumina or similar tech DNA sequence reads file containing data corresponding to one read group ID. That is, the file contains data from one sample and from one flow cell lane.

Download example data

To download the reference, open ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.8/b37/ in your browser. Leave the password field blank. Download the following three files (~860 MB) to the same folder: human_g1k_v37_decoy.fasta.gz, .fasta.fai.gz, and .dict.gz. This same reference is available to load in IGV.
I divided the example data into two tarballs: tutorial_6483_piped.tar.gz contains the files for the piped process and tutorial_6483_intermediate_files.tar.gz contains the intermediate files produced by running each process independently. The data contain reads originally aligning to a one Mbp genomic interval (10:96,000,000-97,000,000) of GRCh37. The table shows the steps of the workflow, corresponding input and output example data files and approximate minutes and disk space needed to process each step. Additionally, we tabulate the time and minimum storage needed to complete the workflow as presented (piped) or without piping.

Related resources

See this tutorial to add or replace read groups or coordinate-sort and index a BAM.
See this tutorial for basic instructions on using the Integrative Genomics Viewer (IGV).
For collecting alignment summary metrics, see CollectAlignmentSummaryMetrics, CollectWgsMetrics and CollectInsertSizeMetrics. See Picard for metrics definitions.
See SAM flags to interpret SAM flag values.
Tutorial#2799 gives an example command to mark duplicates.

Other notes

When transforming data files, we stick to using Picard tools over other tools to avoid subtle incompatibilities.
For large files, (1) use the Java -Xmx setting and (2) set the environmental variable TMP_DIR for a temporary directory.
```
java -Xmx8G -jar /path/picard.jar MarkIlluminaAdapters \
    TMP_DIR=/path/shlee 
```
In the command, the -Xmx8G Java option caps the maximum heap size, or memory usage, to eight gigabytes. The path given by TMP_DIR points the tool to scratch space that it can use. These options allow the tool to run without slowing down as well as run without causing an out of memory error. The -Xmx settings we provide here are more than sufficient for most cases. For GATK, 4G is standard, while for Picard less is needed. Some tools, e.g. MarkDuplicates, may require more. These options can be omitted for small files such as the example data and the equivalent command is as follows.
```
java -jar /path/picard.jar MarkIlluminaAdapters 
```
To find a system's default maximum heap size, type java -XX:+PrintFlagsFinal -version, and look for MaxHeapSize. Note that any setting beyond available memory spills to storage and slows a system down. If multithreading, increase memory proportionately to the number of threads. e.g. if 1G is the minimum required for one thread, then use 2G for two threads.
When I call default options within a command, follow suit to ensure the same results.

1. Generate an unmapped BAM from FASTQ, aligned BAM or BCL

If you have raw reads data in BAM format with appropriately assigned read group fields, then you can start with step 2. Namely, besides differentiating samples, the read group ID should differentiate factors contributing to technical batch effects, i.e. flow cell lane. If not, you need to reassign read group fields. This dictionary post describes factors to consider and this post and this post provide some strategic advice on handling multiplexed data.

See this tutorial to add or replace read groups.

If your reads are mapped, or in BCL or FASTQ format, then generate an unmapped BAM according to the following instructions.

To convert FASTQ or revert aligned BAM files, follow directions in Tutorial#6484. The resulting uBAM needs to have its adapter sequences marked as outlined in the next step (step 2).
To convert an Illumina Base Call files (BCL) use IlluminaBasecallsToSam. The tool marks adapter sequences at the same time. The resulting uBAM^XT has adapter sequences marked with the XT tag so you can skip step 2 of this workflow and go directly to step 3. The corresponding Tutorial#6570 is coming soon.

See if you can revert 6483_snippet.bam, containing 279,534 aligned reads, to the unmapped 6383_snippet_revertsam.bam, containing 275,546 reads.

2. Mark adapter sequences using MarkIlluminaAdapters

MarkIlluminaAdapters adds the XT tag to a read record to mark the 5' start position of the specified adapter sequence and produces a metrics file. Some of the marked adapters come from concatenated adapters that randomly arise from the primordial soup that is a PCR reaction. Others represent read-through to 3' adapter ends of reads and arise from insert sizes that are shorter than the read length. In some instances read-though can affect the majority of reads in a sample, e.g. in Nextera library samples over-titrated with transposomes, and render these reads unmappable by certain aligners. Tools such as SamToFastq use the XT tag in various ways to effectively remove adapter sequence contribution to read alignment and alignment scoring metrics. Depending on your library preparation, insert size distribution and read length, expect varying amounts of such marked reads.

java -Xmx8G -jar /path/picard.jar MarkIlluminaAdapters \
I=6483_snippet_revertsam.bam \
O=6483_snippet_markilluminaadapters.bam \
M=6483_snippet_markilluminaadapters_metrics.txt \ #naming required
TMP_DIR=/path/shlee #optional to process large files

This produces two files. (1) The metrics file, 6483_snippet_markilluminaadapters_metrics.txt bins the number of tagged adapter bases versus the number of reads. (2) The 6483_snippet_markilluminaadapters.bam file is identical to the input BAM, 6483_snippet_revertsam.bam, except reads with adapter sequences will be marked with a tag in XT:i:# format, where # denotes the 5' starting position of the adapter sequence. At least six bases are required to mark a sequence. Reads without adapter sequence remain untagged.

By default, the tool uses Illumina adapter sequences. This is sufficient for our example data.
Adjust the default standard Illumina adapter sequences to any adapter sequence using the FIVE_PRIME_ADAPTER and THREE_PRIME_ADAPTER parameters. To clear and add new adapter sequences first set ADAPTERS to 'null' then specify each sequence with the parameter.

We plot the metrics data that is in GATKReport file format using RStudio, and as you can see, marked bases vary in size up to the full length of reads.

Do you get the same number of marked reads? 6483_snippet marks 448 reads (0.16%) with XT, while the original Solexa-272222 marks 3,236,552 reads (0.39%).

Below, we show a read pair marked with the XT tag by MarkIlluminaAdapters. The insert region sequences for the reads overlap by a length corresponding approximately to the XT tag value. For XT:i:20, the majority of the read is adapter sequence. The same read pair is shown after SamToFastq transformation, where adapter sequence base quality scores have been set to 2 (# symbol), and after MergeBamAlignment, which restores original base quality scores.

Unmapped uBAM (step 1)

After MarkIlluminaAdapters (step 2)

After SamToFastq (step 3)

After MergeBamAlignment (step 3)

3. Align reads with BWA-MEM and merge with uBAM using MergeBamAlignment

This step actually pipes three processes, performed by three different tools. Our tutorial example files are small enough to easily view, manipulate and store, so any difference in piped or independent processing will be negligible. For larger data, however, using Unix pipelines can add up to significant savings in processing time and storage.

Not all tools are amenable to piping and piping the wrong tools or wrong format can result in anomalous data.

The three tools we pipe are SamToFastq, BWA-MEM and MergeBamAlignment. By piping these we bypass storage of larger intermediate FASTQ and SAM files. We additionally save time by eliminating the need for the processor to read in and write out data for two of the processes, as piping retains data in the processor's input-output (I/O) device for the next process.

To make the information more digestible, we will first talk about each tool separately. At the end of the section, we provide the piped command.

3A. Convert BAM to FASTQ and discount adapter sequences using SamToFastq

Picard's SamToFastq takes read identifiers, read sequences, and base quality scores to write a Sanger FASTQ format file. We use additional options to effectively remove previously marked adapter sequences, in this example marked with an XT tag. By specifying CLIPPING_ATTRIBUTE=XT and CLIPPING_ACTION=2, SamToFastq changes the quality scores of bases marked by XT to two--a rather low score in the Phred scale. This effectively removes the adapter portion of sequences from contributing to downstream read alignment and alignment scoring metrics.

Illustration of an intermediate step unused in workflow. See piped command.

java -Xmx8G -jar /path/picard.jar SamToFastq \
I=6483_snippet_markilluminaadapters.bam \
FASTQ=6483_snippet_samtofastq_interleaved.fq \
CLIPPING_ATTRIBUTE=XT \
CLIPPING_ACTION=2 \
INTERLEAVE=true \ 
NON_PF=true \
TMP_DIR=/path/shlee #optional to process large files

This produces a FASTQ file in which all extant meta data, i.e. read group information, alignment information, flags and tags are purged. What remains are the read query names prefaced with the @ symbol, read sequences and read base quality scores.

For our paired reads example file we set SamToFastq's INTERLEAVE to true. During the conversion to FASTQ format, the query name of the reads in a pair are marked with /1 or /2 and paired reads are retained in the same FASTQ file. BWA aligner accepts interleaved FASTQ files given the -p option.
We change the NON_PF, aka INCLUDE_NON_PF_READS, option from default to true. SamToFastq will then retain reads marked by what some consider an archaic 0x200 flag bit that denotes reads that do not pass quality controls, aka reads failing platform or vendor quality checks. Our tutorial data do not contain such reads and we call out this option for illustration only.
Other CLIPPING_ACTION options include (1) X to hard-clip, (2) N to change bases to Ns or (3) another number to change the base qualities of those positions to the given value.

3B. Align reads and flag secondary hits using BWA-MEM

In this workflow, alignment is the most compute intensive and will take the longest time. GATK's variant discovery workflow recommends Burrows-Wheeler Aligner's maximal exact matches (BWA-MEM) algorithm (Li 2013 reference; Li 2014 benchmarks; homepage; manual). BWA-MEM is suitable for aligning high-quality long reads ranging from 70 bp to 1 Mbp against a large reference genome such as the human genome.

Aligning our snippet reads against either a portion or the whole genome is not equivalent to aligning our original Solexa-272222 file, merging and taking a new slice from the same genomic interval.
For the tutorial, we use BWA v 0.7.7.r441, the same aligner used by the Broad Genomics Platform as of this writing (9/2015).
As mentioned, alignment is a compute intensive process. For faster processing, use a reference genome with decoy sequences, also called a decoy genome. For example, the Broad's Genomics Platform uses an Hg19/GRCh37 reference sequence that includes Ebstein-Barr virus (EBV) sequence to soak up reads that fail to align to the human reference that the aligner would otherwise spend an inordinate amount of time trying to align as split reads. GATK's resource bundle provides a standard decoy genome from the 1000 Genomes Project.
BWA alignment requires an indexed reference genome file. Indexing is specific to algorithms. To index the human genome for BWA, we apply BWA's index function on the reference genome file, e.g. human_g1k_v37_decoy.fasta. This produces five index files with the extensions amb, ann, bwt, pac and sa.
```
bwa index -a bwtsw human_g1k_v37_decoy.fasta
```

The example command below aligns our example data against the GRCh37 genome. The tool automatically locates the index files within the same folder as the reference FASTA file.

Illustration of an intermediate step unused in workflow. See piped command.

/path/bwa mem -M -t 7 -p /path/human_g1k_v37_decoy.fasta \ 
6483_snippet_samtofastq_interleaved.fq > 6483_snippet_bwa_mem.sam

This command takes the FASTQ file, 6483_snippet_samtofastq_interleaved.fq, and produces an aligned SAM format file, 6483_snippet_unthreaded_bwa_mem.sam, containing read alignment information, an automatically generated program group record and reads sorted in the same order as the input FASTQ file. Aligner-assigned alignment information, flag and tag values reflect each read's or split read segment's best sequence match and does not take into consideration whether pairs are mapped optimally or if a mate is unmapped. Added tags include the aligner-specific XS tag that marks secondary alignment scores in XS:i:# format. This tag is given for each read even when the score is zero and even for unmapped reads. The program group record (@PG) in the header gives the program group ID, group name, group version and recapitulates the given command. Reads are sorted by query name. For the given version of BWA, the aligned file is in SAM format even if given a BAM extension.

Does the aligned file contain read group information?

We invoke three options in the command.

-M to flag shorter split hits as secondary.
This is optional for Picard compatibility as MarkDuplicates can directly process BWA's alignment, whether or not the alignment marks secondary hits. However, if we want MergeBamAlignment to reassign proper pair alignments, to generate data comparable to that produced by the Broad Genomics Platform, then we must mark secondary alignments.
-p to indicate the given file contains interleaved paired reads.
-t followed by a number for the number of processor threads to use concurrently. Here we use seven threads which is one less than the total threads available on my Mac laptap. Check your server or system's total number of threads with the following command provided by KateN.
```
getconf _NPROCESSORS_ONLN 
```

In the example data, all of the 1211 unmapped reads each have an asterisk (*) in column 6 of the SAM record, where a read typically records its CIGAR string. The asterisk represents that the CIGAR string is unavailable. The several asterisked reads I examined are recorded as mapping exactly to the same location as their _mapped_ mates but with MAPQ of zero. Additionally, the asterisked reads had varying noticeable amounts of low base qualities, e.g. strings of #s, that corresponded to original base quality calls and not those changed by SamToFastq. This accounting by BWA allows these pairs to always list together, even when the reads are coordinate-sorted, and leaves a pointer to the genomic mapping of the mate of the unmapped read. For the example read pair shown below, comparing sequences shows no apparent overlap, with the highest identity at 72% over 25 nts.

After MarkIlluminaAdapters (step 2)

After BWA-MEM (step 3)

After MergeBamAlignment (step 3)

3C. Restore altered data and apply & adjust meta information using MergeBamAlignment

MergeBamAlignment is a beast of a tool, so its introduction is longer. It does more than is implied by its name. Explaining these features requires I fill you in on some background.

Broadly, the tool merges defined information from the unmapped BAM (uBAM, step 1) with that of the aligned BAM (step 3) to conserve read data, e.g. original read information and base quality scores. The tool also generates additional meta information based on the information generated by the aligner, which may alter aligner-generated designations, e.g. mate information and secondary alignment flags. The tool then makes adjustments so that all meta information is congruent, e.g. read and mate strand information based on proper mate designations. We ascribe the resulting BAM as clean.

Specifically, the aligned BAM generated in step 3 lacks read group information and certain tags--the UQ (Phred likelihood of the segment), MC (CIGAR string for mate) and MQ (mapping quality of mate) tags. It has hard-clipped sequences from split reads and altered base qualities. The reads also have what some call mapping artifacts but what are really just features we should not expect from our aligner. For example, the meta information so far does not consider whether pairs are optimally mapped and whether a mate is unmapped (in reality or for accounting purposes). Depending on these assignments, MergeBamAlignment adjusts the read and read mate strand orientations for reads in a proper pair. Finally, the alignment records are sorted by query name. We would like to fix all of these issues before taking our data to a variant discovery workflow.

Enter MergeBamAlignment. As the tool name implies, MergeBamAlignment applies read group information from the uBAM and retains the program group information from the aligned BAM. In restoring original sequences, the tool adjusts CIGAR strings from hard-clipped to soft-clipped. If the alignment file is missing reads present in the unaligned file, then these are retained as unmapped records. Additionally, MergeBamAlignment evaluates primary alignment designations according to a user-specified strategy, e.g. for optimal mate pair mapping, and changes secondary alignment and mate unmapped flags based on its calculations. Additional for desired congruency. I will soon explain these and additional changes in more detail and show a read record to illustrate.

Consider what PRIMARY_ALIGNMENT_STRATEGY option best suits your samples. MergeBamAlignment applies this strategy to a read for which the aligner has provided more than one primary alignment, and for which one is designated primary by virtue of another record being marked secondary. MergeBamAlignment considers and switches only existing primary and secondary designations. Therefore, it is critical that these were previously flagged.

A read with multiple alignment records may map to multiple loci or may be chimeric--that is, splits the alignment. It is possible for an aligner to produce multiple alignments as well as multiple primary alignments, e.g. in the case of a linear alignment set of split reads. When one alignment, or alignment set in the case of chimeric read records, is designated primary, others are designated either secondary or supplementary. Invoking the -M option, we had BWA mark the record with the longest aligning section of split reads as primary and all other records as secondary. MergeBamAlignment further adjusts this secondary designation and adds the read mapped in proper pair (0x2) and mate unmapped (0x8) flags. The tool then adjusts the strand orientation flag for a read (0x10) and it proper mate (0x20).

In the command, we change CLIP_ADAPTERS, MAX_INSERTIONS_OR_DELETIONS and PRIMARY_ALIGNMENT_STRATEGY values from default, and invoke other optional parameters. The path to the reference FASTA given by R should also contain the corresponding .dict sequence dictionary with the same prefix as the reference FASTA. It is imperative that both the uBAM and aligned BAM are both sorted by queryname.

Illustration of an intermediate step unused in workflow. See piped command.

java -Xmx16G -jar /path/picard.jar MergeBamAlignment \
R=/path/Homo_sapiens_assembly19.fasta \ 
UNMAPPED_BAM=6383_snippet_revertsam.bam \ 
ALIGNED_BAM=6483_snippet_bwa_mem.sam \ #accepts either SAM or BAM
O=6483_snippet_mergebamalignment.bam \
CREATE_INDEX=true \ #standard Picard option for coordinate-sorted outputs
ADD_MATE_CIGAR=true \ #default; adds MC tag
CLIP_ADAPTERS=false \ #changed from default
CLIP_OVERLAPPING_READS=true \ #default; soft-clips ends so mates do not extend past each other
INCLUDE_SECONDARY_ALIGNMENTS=true \ #default
MAX_INSERTIONS_OR_DELETIONS=-1 \ #changed to allow any number of insertions or deletions
PRIMARY_ALIGNMENT_STRATEGY=MostDistant \ #changed from default BestMapq
ATTRIBUTES_TO_RETAIN=XS \ #specify multiple times to retain tags starting with X, Y, or Z 
TMP_DIR=/path/shlee #optional to process large files

This generates a coordinate-sorted and clean BAM, 6483_snippet_mergebamalignment.bam, and corresponding .bai index. These are ready for analyses starting with MarkDuplicates. The two bullet-point lists below describe changes to the resulting file. The first list gives general comments on select parameters and the second describes some of the notable changes to our example data.

Comments on select parameters

Setting PRIMARY_ALIGNMENT_STRATEGYto MostDistant marks primary alignments based on the alignment pair with the largest insert size. This strategy is based on the premise that of chimeric sections of a read aligning to consecutive regions, the alignment giving the largest insert size with the mate gives the most information.
It may well be that alignments marked as secondary represent interesting biology, so we retain them with the INCLUDE_SECONDARY_ALIGNMENTS parameter.
Setting MAX_INSERTIONS_OR_DELETIONS to -1 retains reads irregardless of the number of insertions and deletions. The default is 1.
Because we leave the ALIGNER_PROPER_PAIR_FLAGS parameter at the default false value, MergeBamAlignment will reassess and reassign proper pair designations made by the aligner. These are explained below using the example data.
ATTRIBUTES_TO_RETAIN is specified to carryover the XS tag from the alignment, which reports BWA-MEM's suboptimal alignment scores. My impression is that this is the next highest score for any alternative or additional alignments BWA considered, whether or not these additional alignments made it into the final aligned records. (IGV's BLAT feature allows you to search for additional sequence matches). For our tutorial data, this is the only additional unaccounted tag from the alignment. The XS tag in unnecessary for the Best Practices Workflow and is not retained by the Broad Genomics Platform's pipeline. We retain it here not only to illustrate that the tool carries over select alignment information only if asked, but also because I think it prudent. Given how compute intensive the alignment process is, the additional ~1% gain in the snippet file size seems a small price against having to rerun the alignment because we realize later that we want the tag.
Setting CLIP_ADAPTERS to false leaves reads unclipped.
By default the merged file is coordinate sorted. We set CREATE_INDEX to true to additionally create the bai index.
We need not invoke PROGRAM options as BWA's program group information is sufficient and is retained in the merging.
As a standalone tool, we would normally feed in a BAM file for ALIGNED_BAM instead of the much larger SAM. We will be piping this step however and so need not add an extra conversion to BAM.

Description of changes to our example data

MergeBamAlignment merges header information from the two sources that define read groups (@RG) and program groups (@PG) as well as reference contigs.
Tags are updated for our example data as shown in the table. The tool retains SA, MD, NM and AS tags from the alignment, given these are not present in the uBAM. The tool additionally adds UQ (the Phred likelihood of the segment), MC (mate CIGAR string) and MQ (mapping quality of the mate/next segment) tags if applicable. For unmapped reads (marked with an * asterisk in column 6 of the SAM record), the tool removes AS and XS tags and assigns MC (if applicable), PG and RG tags. This is illustrated for example read H0164ALXX140820:2:1101:29704:6495 in the BWA-MEM section of this document.
Original base quality score restoration is illustrated in step 2.

The example below shows a read pair for which MergeBamAlignment adjusts multiple information fields, and these changes are described in the remaining bullet points.

MergeBamAlignment changes hard-clipping to soft-clipping, e.g. 96H55M to 96S55M, and restores corresponding truncated sequences with the original full-length read sequence.
The tool reorders the read records to reflect the chromosome and contig ordering in the header and the genomic coordinates for each.
MergeBamAlignment's MostDistant PRIMARY_ALIGNMENT_STRATEGY asks the tool to consider the best pair to mark as primary from the primary and secondary records. In this pair, one of the reads has two alignment loci, on contig hs37d5 and on chromosome 10. The two loci align 115 and 55 nucleotides, respectively, and the aligned sequences are identical by 55 bases. Flag values set by BWA-MEM indicate the contig hs37d5 record is primary and the shorter chromosome 10 record is secondary. For this chimeric read, MergeBamAlignment reassigns the chromosome 10 mapping as the primary alignment and the contig hs37d5 mapping as secondary (0x100 flag bit).
In addition, MergeBamAlignment designates each record on chromosome 10 as read mapped in proper pair (0x2 flag bit) and the contig hs37d5 mapping as mate unmapped (0x8 flag bit). IGV's paired reads mode displays the two chromosome 10 mappings as a pair after these MergeBamAlignment adjustments.
MergeBamAlignment adjusts read reverse strand (0x10 flag bit) and mate reverse strand (0x20 flag bit) flags consistent with changes to the proper pair designation. For our non-stranded DNA-Seq library alignments displayed in IGV, a read pointing rightward is in the forward direction (absence of 0x10 flag) and a read pointing leftward is in the reverse direction (flagged with 0x10). In a typical pair, where the rightward pointing read is to the left of the leftward pointing read, the left read will also have the mate reverse strand (0x20) flag.

Two distinct classes of mate unmapped read records are now present in our example file: (1) reads whose mates truly failed to map and are marked by an asterisk * in column 6 of the SAM record and (2) multimapping reads whose mates are in fact mapped but in a proper pair that excludes the particular read record. Each of these two classes of mate unmapped reads can contain multimapping reads that map to two or more locations.

Comparing 6483_snippet_bwa_mem.sam and 6483_snippet_mergebamalignment.bam, we see the number_unmapped reads_ remains the same at 1211, while the number of records with the mate unmapped flag increases by 1359, from 1276 to 2635. These now account for 0.951% of the 276,970 read records.

For 6483_snippet_mergebamalignment.bam, how many additional unique reads become mate unmapped?

After BWA-MEM alignment

After MergeBamAlignment

3D. Pipe SamToFastq, BWA-MEM and MergeBamAlignment to generate a clean BAM

We pipe the three tools described above to generate an aligned BAM file sorted by query name. In the piped command, the commands for the three processes are given together, separated by a vertical bar |. We also replace each intermediate output and input file name with a symbolic path to the system's output and input devices, here /dev/stdout and /dev/stdin, respectively. We need only provide the first input file and name the last output file.

Before using a piped command, we should ask UNIX to stop the piped command if any step of the pipe should error and also return to us the error messages. Type the following into your shell to set these UNIX options.

set -o pipefail

Overview of command structure

[SamToFastq] | [BWA-MEM] | [MergeBamAlignment]

Piped command

java -Xmx8G -jar /path/picard.jar SamToFastq \
I=6483_snippet_markilluminaadapters.bam \
FASTQ=/dev/stdout \
CLIPPING_ATTRIBUTE=XT CLIPPING_ACTION=2 INTERLEAVE=true NON_PF=true \
TMP_DIR=/path/shlee | \ 
/path/bwa mem -M -t 7 -p /path/Homo_sapiens_assembly19.fasta /dev/stdin | \  
java -Xmx16G -jar /path/picard.jar MergeBamAlignment \
ALIGNED_BAM=/dev/stdin \
UNMAPPED_BAM=6383_snippet_revertsam.bam \ 
OUTPUT=6483_snippet_piped.bam \
R=/path/Homo_sapiens_assembly19.fasta CREATE_INDEX=true ADD_MATE_CIGAR=true \
CLIP_ADAPTERS=false CLIP_OVERLAPPING_READS=true \
INCLUDE_SECONDARY_ALIGNMENTS=true MAX_INSERTIONS_OR_DELETIONS=-1 \
PRIMARY_ALIGNMENT_STRATEGY=MostDistant ATTRIBUTES_TO_RETAIN=XS \
TMP_DIR=/path/shlee

The piped output file, 6483_snippet_piped.bam, is for all intensive purposes the same as 6483_snippet_mergebamalignment.bam, produced by running MergeBamAlignment separately without piping. However, the resulting files, as well as new runs of the workflow on the same data, have the potential to differ in small ways because each uses a different alignment instance.

How do these small differences arise?

Counting the number of mate unmapped reads shows that this number remains unchanged for the two described workflows. Two counts emitted at the end of the process updates, that also remain constant for these instances, are the number of alignment records and the number of unmapped reads.

INFO    2015-12-08 17:25:59 AbstractAlignmentMerger Wrote 275759 alignment records and 1211 unmapped reads.

Some final remarks

We have produced a clean BAM that is coordinate-sorted and indexed, in an efficient manner that minimizes processing time and storage needs. The file is ready for marking duplicates as outlined in Tutorial#2799. Additionally, we can now free up storage on our file system by deleting the original file we started with, the uBAM and the uBAM^XT. We sleep well at night knowing that the clean BAM retains all original information.

We have two final comments (1) on multiplexed samples and (2) on fitting this workflow into a larger workflow.

For multiplexed samples, first perform the workflow steps on a file representing one sample and one lane. Then mark duplicates. Later, after some steps in the GATK's variant discovery workflow, and after aggregating files from the same sample from across lanes into a single file, mark duplicates again. These two marking steps ensure you find both optical and PCR duplicates.

For workflows that nestle this pipeline, consider additionally optimizing java jar's parameters for SamToFastq and MergeBamAlignment. For example, the following are the additional settings used by the Broad Genomics Platform in the piped command for very large data sets.

    java -Dsamjdk.buffer_size=131072 -Dsamjdk.compression_level=1 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx128m -jar /path/picard.jar SamToFastq ...

    java -Dsamjdk.buffer_size=131072 -Dsamjdk.use_async_io=true -Dsamjdk.compression_level=1 -XX:+UseStringCache -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx5000m -jar /path/picard.jar MergeBamAlignment ...

I give my sincere thanks to Julian Hess, the GATK team and the Data Sciences and Data Engineering (DSDE) team members for all their help in writing this and related documents.

↧

FixMateInformation cannot fix GATK4 BQSR'd bam???

April 19, 2018, 9:10 pm

≫ Next: (How to) Call somatic mutations using GATK4 Mutect2

≪ Previous: (How to) Map and clean up short read sequence data efficiently

I used to use FixMateInformation to fix 2.3-9's IndelRealigner's bam and it worked perfectly. But when I switched
to GATK 4.0.2.1 for BQSR, I am getting "mate not found for paired read" errors in my downstream analysis.

So I ran picard-2.17.10's FixMateInformation again on the BQSR'd bam but I am still getting the same error. I ran ValidateSamFile and it also shows the same error. Is there anything I can do to fix this?

Here is the command I ran:

gatk-4.0.2.1/gatk --java-options "-DGATK_STACKTRACE_ON_USER_EXCEPTION=true" BaseRecalibrator -R hs38DH.fa -O test.table -L ../../my.bed -I dedup.bam --known-sites dbsnp_146.hg38.vcf.gz --known-sites Mills_and_1000G_gold_standard.indels.hg38.vcf.gz --known-sites Homo_sapiens_assembly38.known_indels.vcf.gz
gatk-4.0.2.1/gatk --java-options "-DGATK_STACKTRACE_ON_USER_EXCEPTION=true" ApplyBQSR -R hs38DH.fa -bqsr test.table -L ../../my.bed -I dedup.bam -O recal.bam --known-sites dbsnp_146.hg38.vcf.gz --known-sites Mills_and_1000G_gold_standard.indels.hg38.vcf.gz --known-sites Homo_sapiens_assembly38.known_indels.vcf.gz
java -jar picard-2.17.10.jar I=recal.bam O=fmi.bam CREATE_INDEX=true

↧

(How to) Call somatic mutations using GATK4 Mutect2

January 6, 2018, 8:39 pm

≫ Next: Variants for GetPileupSummaries for non-human organisms

≪ Previous: FixMateInformation cannot fix GATK4 BQSR'd bam???

Post suggestions and read about updates in the Comments section.

This tutorial introduces researchers to considerations in somatic short variant discovery using GATK4 Mutect2. Example data are based on a breast cancer cell line and its matched normal cell line derived from blood and are aligned to GRCh38 with post-alt processing [1]. The tutorial focuses on how to call traditional somatic short mutations, as described in Article#11127 and pipelined in GATK v4.0.0.0's mutect2.wdl [2]. The tool and its workflow are in BETA status as of this writing, which means they may undergo changes and are not guaranteed for production.

► For Broad Mutation Calling Best Practices, see FireCloud Article#45055.

Section 1 calls somatic mutations with Mutect2 using all the bells and whistles of the tool. Section 2 outlines how to create the panel of normals resource using the tumor-only mode of Mutect2. Section 3 outlines how to estimate cross-sample contamination. Section 4 shows how to filter the callset with FilterMutectCalls. Unlike GATK3, in GATK4 the somatic calling and filtering functionalities are embodied by separate tools. Section 5 shows an optional filtering step to filter by sequence context artifacts that present with orientation bias, e.g. OxoG artifacts. Section 6 shows how to set up in IGV for manual review. Finally, section 7 provides a brief list of related resources that may be of interest to researchers.

GATK4 Mutect2 is a versatile variant caller that not only is more sensitive than, but is also roughly twice as fast as, HaplotypeCaller's reference confidence mode. Researchers who wish to customize analyses should find the tutorial's descriptions of the multiple levers of Mutect2 in section 1 and descriptions of the tumor-only mode of Mutect2 in section 2 of interest.

Jump to a section

Tools involved

GATK v4.0.0.0 is available in a Docker image and as a standalone jar. For the latest release, see the Downloads page. Note that GATK v4.0.0.0 contains Picard tools from release v2.17.2 that are callable with the gatk launch script.
Desktop IGV. The tutorial uses v2.3.97.

Download example data

Download tutorial_11136.tar.gz, either from the GoogleDrive or from the ftp site. To access the ftp site, leave the password field blank. If the GoogleDrive link is broken, please let us know. The tutorial also requires the GRCh38 reference FASTA, dictionary and index. These are available from the GATK Resource Bundle. For details on the example data and resources, see [3] and [4].

► The tutorial steps switch between the subset and full data. Some of the data files, e.g. BAMs, are restricted to a small region of the genome to efficiently pace the tutorial. Other files, e.g. the Mutect2 calls that the tutorial filters, are from the entire genome. The tutorial content was originally developed for the 2017-09 Helsinki workshop and we make the full data files, i.e. the resource files and the BAMs, available at gs://gatk-best-practices/somatic-hg38.

1. Call somatic short variants and generate a bamout with Mutect2

Here we have a rather complex command to call somatic variants on the HCC1143 tumor sample using Mutect2. For a synopsis of what somatic calling entails, see Article#11127. The command calls somatic variants in the tumor sample and uses a matched normal, a panel of normals (PoN) and a population germline variant resource.

gatk --java-options "-Xmx2g" Mutect2 \
-R hg38/Homo_sapiens_assembly38.fasta \
-I tumor.bam \
-I normal.bam \
-tumor HCC1143_tumor \
-normal HCC1143_normal \
-pon resources/chr17_pon.vcf.gz \
--germline-resource resources/chr17_af-only-gnomad_grch38.vcf.gz \
--af-of-alleles-not-in-resource 0.0000025 \
--disable-read-filter MateOnSameContigOrNoMappedMateReadFilter \
-L chr17plus.interval_list \
-O 1_somatic_m2.vcf.gz \
-bamout 2_tumor_normal_m2.bam

This produces a raw unfiltered somatic callset 1_somatic_m2.vcf.gz, a reassembled reads BAM 2_tumor_normal_m2.bam and the respective indices 1_somatic_m2.vcf.gz.tbi and 2_tumor_normal_m2.bai.

Comments on select parameters

Specify the case sample for somatic calling with two parameters. Provide the BAM with -I and the sample's read group sample name (the SM field value) with -tumor. To look up the read group SM field use GetSampleName. Alternatively, use samtools view -H tumor.bam | grep '@RG'.
Prefilter variant sites in a control sample alignment. Specify the control BAM with -I and the control sample's read group sample name (the SM field value) with -normal. In the case of a tumor with a matched normal control, we can exclude even rare germline variants and individual-specific artifacts. If we analyze our tumor sample with Mutect2 without the matched normal, we get an order of magnitude more calls than with the matched normal.
Prefilter variant sites in a panel of normals callset. Specify the panel of normals (PoN) VCF with -pon. Section 2 outlines how to create a PoN. The panel of normals not only represents common germline variant sites, it presents commonly noisy sites in sequencing data, e.g. mapping artifacts or other somewhat random but systematic artifacts of sequencing. By default, the tool does not reassemble nor emit variant sites that match identically to a PoN variant. To enable genotyping of PoN sites, use the --genotype-pon-sites option. If the match is not exact, e.g. there is an allele-mismatch, the tool reassembles the region, emits the calls and annotates matches in the INFO field with IN_PON.
Annotate variant alleles by specifying a population germline resource with --germline-resource. The germline resource must contain allele-specific frequencies, i.e. it must contain the AF annotation in the INFO field [4]. The tool annotates variant alleles with the population allele frequencies. When using a population germline resource, consider adjusting the --af-of-alleles-not-in-resource parameter from its default of 0.001. For example, the gnomAD resource af-only-gnomad_grch38.vcf.gz represents ~200k exomes and ~16k genomes and the tutorial data is exome data, so we adjust --af-of-alleles-not-in-resource to 0.0000025 which corresponds to 1/(2*exome samples). The default of 0.001 is appropriate for human sample analyses without any population resource. It is based on the human average rate of heterozygosity. The population allele frequencies (POP_AF) and the af-of-alleles-not-in-resource factor in probability calculations of the variant being somatic.
Include reads whose mate maps to a different contig. For our somatic analysis that uses alt-aware and post-alt processed alignments to GRCh38, we disable a specific read filter with --disable-read-filter MateOnSameContigOrNoMappedMateReadFilter. This filter removes from analysis paired reads whose mate maps to a different contig. Because of the way BWA crisscrosses mate information for mates that align better to alternate contigs (in alt-aware mapping to GRCh38), we want to include these types of reads in our analysis. Otherwise, we may miss out on detecting SNVs and indels associated with alternate haplotypes. Disabling this filter deviates from current production practices.
Target the analysis to specific genomic intervals with the -L parameter. Here we specify this option to speed up our run on the small tutorial data. For the full callset we use in section 4, calling was on the entirety of the data, without an intervals file.
Generate the reassembled alignments file with -bamout. The bamout alignments contain the artificial haplotypes and reassembled alignments for the normal and tumor and enable manual review of calls. The parameter is not required by the tool but is recommended as adding it costs only a small fraction of the total run time.

To illustrate how Mutect2 applies annotations, below are five multiallelic sites from the full callset. Pull these out with gzcat somatic_m2.vcf.gz | awk '$5 ~","'. The awk '$5 ~","' subsets records that contain a comma in the 5th column.

We see eleven columns of information per variant call including genotype calls for the normal and tumor. Notice the empty fields for QUAL and FILTER, and annotations at the site (INFO) and sample level (columns 10 and 11). The samples each have genotypes and when a site is multiallelic, we see allele-specific annotations. Samples may have additional annotations, e.g. PGT and PID that relate to phasing.

☞ 1.1 What are the Mutect2 annotations?

We can view the standard FORMAT-level and INFO-level Mutect2 annotations in the VCF header.

The Variant Annotations section of the Tool Documentation further describe some of the annotations. For a complete list of annotations available in GATK4, see this site.

To enable specific filtering that relies on nonstandard annotations, or just to add additional annotations, use the -A argument. For example, -A ReferenceBases adds the ReferenceBases annotation to variant calls. Note that if an annotation a filter relies on is absent, FilterMutectCalls will skip the particular filtering without any warning messages.

☞ 1.2 What is the impact of disabling the MateOnSameContigOrNoMappedMateReadFilter read filter?

To understand the impact, consider some numbers. After all other read filters, the MateOnSameContigOrNoMappedMateReadFilter (MOSCO) filter additionally removes from analysis 8.71% (8,681,271) tumor sample reads and 8.18% (6,256,996) normal sample reads from the full data. The impact of disabling the MOSCO filter is that reads on alternate contigs and read pairs that span contigs can now lend support to variant calls.

For the tutorial data, including reads normally filtered by the MOSCO filter roughly doubles the number of Mutect2 calls. The majority of the additional calls comes from the ALT, HLA and decoy contigs.

2. Create a sites-only PoN with CreateSomaticPanelOfNormals

We make the motions of creating a PoN using three germline samples. These samples are HG00190, NA19771 and HG02759 [3].

First, run Mutect2 in tumor-only mode on each normal sample. In tumor-only mode, a single case sample is analyzed with the -tumor flag without an accompanying matched control -normal sample. For the tutorial, we run this command only for sample HG00190.

gatk Mutect2 \
-R ~/Documents/ref/hg38/Homo_sapiens_assembly38.fasta \
-I HG00190.bam \
-tumor HG00190 \
--disable-read-filter MateOnSameContigOrNoMappedMateReadFilter \
-L chr17plus.interval_list \
-O 3_HG00190.vcf.gz

This generates a callset 3_HG00190.vcf.gz and a matching index. Mutect2 calls variants in the sample with the same sensitive criteria it uses for calling mutations in the tumor in somatic mode. Because the command omits the use of options that trigger upfront filtering, we expect all detectable variants to be called. The calls will include low allele fraction variants and sites with multiple variant alleles, i.e. multiallelic sites. Here are two multiallelic records from 3_HG00190.vcf.gz.

We see for each site, Mutect2 calls the ref allele and three alternate alleles. The GT genotype call is 0/1/2/3. The AD allele depths are 16,3,12,4 and 41,5,24,4, respectively for the two sites.

Comments on select parameters

One option that is not used here is to include a germline resource with --germline-resource. Remember from section 1 this resource must contain AF population allele frequencies in the INFO column. Use of this resource in tumor-only mode, just as in somatic mode, allows upfront filtering of common germline variant alleles. This effectively omits common germline variant alleles from the PoN. Note the related optional parameter --max-population-af (default 0.01) defines the cutoff for allele frequencies. Given a resource, and read evidence for the variant, Mutect2 will still emit variant alleles with AF less than or equal to the --max-population-af.
Recapitulate any special options used in somatic calling in the panel of normals sample calling, e.g.--disable-read-filter MateOnSameContigOrNoMappedMateReadFilter. This particular option is relevant for alt-aware and post-alt processed alignments.

Second, collate all the normal VCFs into a single callset with CreateSomaticPanelOfNormals. For the tutorial, to illustrate the step with small data, we run this command on three normal sample VCFs. The general recommendation for panel of normals is a minimum of forty samples.

gatk CreateSomaticPanelOfNormals \
-vcfs 3_HG00190.vcf.gz \
-vcfs 4_NA19771.vcf.gz \
-vcfs 5_HG02759.vcf.gz \
-O 6_threesamplepon.vcf.gz

This generates a PoN VCF 6_threesamplepon.vcf.gz and an index. The tutorial PoN contains 8,275 records.
CreateSomaticPanelOfNormals retains sites with variants in two or more samples. It retains the alleles from the samples but drops all other annotations to create an eight-column, sites-only VCF as shown.

Ideally, the PoN includes samples that are technically representative of the tumor case sample--i.e. samples sequenced on the same platform using the same chemistry, e.g. exome capture kit, and analyzed using the same toolchain. However, even an unmatched PoN will be remarkably effective in filtering a large proportion of sequencing artifacts. This is because mapping artifacts and polymerase slippage errors occur for pretty much the same genomic loci for short read sequencing approaches.

What do you think of including samples of family members in the PoN?

☞ 2.1 The tumor-only mode of Mutect2 is useful outside of pon creation

For example, consider variant calling on data that represents a pool of individuals or a collective of highly similar but distinct DNA molecules, e.g. mitochondrial DNA. Mutect2 calls multiple variants at a site in a computationally efficient manner. Furthermore, the tumor-only mode can be co-opted to simply call differences between two samples. This approach is described in Blog#11315.

3. Estimate cross-sample contamination using GetPileupSummaries and CalculateContamination.

First, run GetPileupSummaries on the tumor BAM to summarize read support for a set number of known variant sites. Use a population germline resource containing only common biallelic variants, e.g. subset by using SelectVariants --restrict-alleles-to BIALLELIC, as well as population AF allele frequencies in the INFO field [4]. The tool tabulates read counts that support reference, alternate and other alleles for the sites in the resource.

gatk GetPileupSummaries \
-I tumor.bam \
-V resources/chr17_small_exac_common_3_grch38.vcf.gz \
-O 7_tumor_getpileupsummaries.table

This produces a six-column table as shown. The alt_count is the count of reads that support the ALT allele in the germline resource. The allele_frequency corresponds to that given in the germline resource. Counts for other_alt_count refer to reads that support all other alleles.

Comments on select parameters

The tool only considers homozygous alternate sites in the sample that have a population allele frequency that ranges between that set by --minimum-population-allele-frequency (default 0.01) and --maximum-population-allele-frequency (default 0.2). The rationale for these settings is as follows. If the homozygous alternate site has a rare allele, we are more likely to observe the presence of REF or other more common alleles if there is cross-sample contamination. This allows us to measure contamination more accurately.
One option to speed up analysis, that is not used in the command above, is to limit data collection to a sufficiently large but subset genomic region with the -L argument.

Second, estimate contamination with CalculateContamination. The tool takes the summary table from GetPileupSummaries and gives the fraction contamination. This estimation informs downstream filtering by FilterMutectCalls.

gatk CalculateContamination \
-I 7_tumor_getpileupsummaries.table \
-O 8_tumor_calculatecontamination.table

This produces a table with estimates for contamination and error. The estimate for the full tumor sample is shown below and gives a contamination fraction of 0.0205. Going forward, we know to suspect calls with less than ~2% alternate allele fraction.

Comments on select parameters

CalculateContamination can operate in two modes. The command above uses the mode that simply estimates contamination for a given sample. The alternate mode incorporates the metrics for the matched normal, to enable a potentially more accurate estimate. For the second mode, run GetPileupSummaries on the normal sample and then provide the normal pileup table to CalculateContamination with the -matched argument.

► Cross-sample contamination differs from normal contamination of tumor and tumor contamination of normal. Currently, the workflow does not account for the latter type of purity issue.

☞ 3.1 What if I find high levels of contamination?

One thing to rule out is sample swaps at the read group level.

Picard’s CrosscheckFingerprints can detect sample-swaps at the read group level and can additionally measure how related two samples are. Because sequencing can involve multiplexing a sample across lanes and regrouping a sample’s multiple read groups, depending on the level of automation in handling these, there is a possibility of including read groups from unrelated samples. The inclusion of such a cross-sample in the tumor sample would be detrimental to a somatic analysis. Without getting into details, the tool allows us to (i) check at the sample level that our tumor and normal are related, as it is imperative they should come from the same individual and (ii) check at the read group level that each of the read group data come from the same individual.

Again, imagine if we mistook the contaminating read group data as some tumor subpopulation! The tutorial normal and tumor samples consist of 16 and 22 read groups respectively, and when we provide these and set EXPECT_ALL_GROUPS_TO_MATCH=true, CrosscheckReadGroupFingerprints (a tool now replaced by CrosscheckFingerprints) informs us All read groups related as expected.

4. Filter for confident somatic calls using FilterMutectCalls

FilterMutectCalls determines whether a call is a confident somatic call. The tool uses the annotations within the callset and applies preset thresholds that are tuned for human somatic analyses.

Filter the Mutect2 callset with FilterMutectCalls. Here we use the full callset, somatic_m2.vcf.gz. To activate filtering based on the contamination estimate, provide the contamination table with --contamination-table. In GATK v4.0.0.0, the tool uses the contamination estimate as a hard cutoff.

gatk FilterMutectCalls \
-V somatic_m2.vcf.gz \
--contamination-table tumor_calculatecontamination.table \
-O 9_somatic_oncefiltered.vcf.gz

This produces a VCF callset 9_somatic_oncefiltered.vcf.gz and index. Calls that are likely true positives get the PASS label in the FILTER field, and calls that are likely false positives are labeled with the reason(s) for filtering in the FILTER field of the VCF. We can view the available filters in the VCF header using grep '##FILTER'.

This step seemingly applies 14 filters, including contamination. However, if an annotation a filter relies on is absent, the tool skips the particular filtering. The filter will still appear in the header. For example, the duplicate_evidence filter requires a nonstandard annotation that our callset omits.

So far, we have 3,695 calls, of which 2,966 are filtered and 729 pass as confident somatic calls. Of the filtered, contamination filters eight calls, all of which would have been filtered for other reasons. For the statistically inclined, this may come as a surprise. However, remember that the great majority of contaminant variants would be common germline alleles, for which we have in place other safeguards.

► In the next GATK version, FilterMutectCalls will use a statistical model to filter based on the contamination estimate.

5. (Optional) Estimate artifacts with CollectSequencingArtifactMetrics and filter them with FilterByOrientationBias

FilterByOrientationBias allows filtering based on sequence context artifacts, e.g. OxoG and FFPE. This step is optional and if employed, should always be performed after filtering with FilterMutectCalls. The tool requires the pre_adapter_detail_metrics from Picard CollectSequencingArtifactMetrics.

First, collect metrics on sequence context artifacts with CollectSequencingArtifactMetrics. The tool categorizes these as those that occur before hybrid selection (preadapter) and those that occur during hybrid selection (baitbias). Results provide a global view across the genome that empowers decision making in ways that site-specific analyses cannot. The metrics can help decide whether to consider downstream filtering.

gatk CollectSequencingArtifactMetrics \
-I tumor.bam \
-O 10_tumor_artifact \
–-FILE_EXTENSION ".txt" \
-R ~/Documents/ref/hg38/Homo_sapiens_assembly38.fasta

Alternatively, use the tool from a standalone Picard jar.

java -jar picard.jar \
CollectSequencingArtifactMetrics \
I=tumor.bam \
O=10_tumor_artifact \
FILE_EXTENSION=.txt \
R=~/Documents/ref/hg38/Homo_sapiens_assembly38.fasta

This generates five metrics files, including pre_adapter_detail_metrics, which contains counts that FilterByOrientationBias uses. Below are the summary pre_adapter_summary_metrics for the full data. Our samples were not from FFPE so we do not expect this artifact. However, it appears that we could have some OxoG transversions.

Picard metrics are described in detail here. For the purposes of this tutorial, we focus on the TOTAL_QSCORE.

The TOTAL_QSCORE is Phred-scaled such that lower scores equate to a higher probability the change is artifactual. E.g. forty translates to 1 in 10,000 probability. For OxoG, a rough cutoff for concern is 30. FilterByOrientationBias uses the quality score as a prior that a context will produce an artifact. The tool also weighs the evidence from the reads. For example, if the QSCORE is 50 but the allele is supported by 15 reads in F1R2 and no reads in F2R1, then the tool should filter the call.
FFPE stands for formalin-fixed, paraffin-embedded. Formaldehyde deaminates cytosines and thereby results in C→T transition mutations. Oxidation of guanine to 8-oxoguanine results in G→T transversion mutations during library preparation. Another Picard tool, CollectOxoGMetrics, similarly gives Phred-scaled scores for the 16 three-base extended sequence contexts. In GATK4 Mutect2, the F1R2 and F2R1 annotations count the reads in the pair orientation supporting the allele(s). This is a change from GATK3’s FOXOG (fraction OxoG) annotation.

Second, perform orientation bias filtering with FilterByOrientationBias. We provide the tool with the once-filtered calls 9_somatic_oncefiltered.vcf.gz, the pre_adapter_detail_metrics file and the sequencing contexts for FFPE (C→T transition) and OxoG (G→T transversion). The tool knows to include the reverse complement contexts.

gatk FilterByOrientationBias \
-A G/T \
-A C/T \
-V 9_somatic_oncefiltered.vcf.gz \
-P tumor_artifact.pre_adapter_detail_metrics.txt \
-O 11_somatic_twicefiltered.vcf.gz

This produces a VCF 11_somatic_twicefiltered.vcf.gz, index and summary 11_somatic_twicefiltered.vcf.gz.summary. In the summary, we see the number of calls for the sequence context and the number of those that the tool filters.

Is the filtering in line with our earlier prediction?

In the VCF header, we see the addition of the 15th filter, orientation_bias, which the tool applies to 56 calls. All 56 of these calls were previously PASS sites, i.e. unfiltered. We now have 673 passing calls out of 3,695 total calls.

☞ 5.1 Tally of applied filters for the tutorial data

The table shows the breakdown in filters applied to 11_somatic_twicefiltered.vcf.gz. The middle column tallys the instances in which each filter was applied across the calls and the third column tallys the instances in which a filter was the sole reason for a site not passing. Of the total calls, ~18% (673/3,695) are confident somatic calls. Of the filtered calls, ~56% (1,694/3,022) are filtered singly. We see an average of ~1.73 filters per filtered call (5,223/3,022).

Which filters appear to have the greatest impact? What types of calls do you think compels manual review?

Examine passing records with the following command. Take note of the AD and AF annotation values in particular, as they show the high sensitivity of the caller.

gzcat 11_somatic_twicefiltered.vcf.gz | grep -v '#' | awk '$7=="PASS"' | less

6. Set up in IGV to review somatic calls

Deriving a good somatic callset involves comparing callsets, e.g. from different callers or calling approaches, manually reviewing passing and filtered calls and, if necessary, combining callsets and additional filtering. Manual review extends from deciphering call record annotations to the nitty-gritty of reviewing read alignments using a visualizer.

To manually review calls, use the feature-rich desktop version of the Integrative Genomics Viewer (IGV). Remember that Mutect2 makes calls on reassembled alignments that do not necessarily reflect that of the starting BAM. Given this, viewing the raw BAM is insufficient for understanding calls. We must examine the bamout that Mutect2's graph-assembly produces.

First, load Human (hg38) as the reference in IGV. Then load these six files in order:

resources/chr17_pon.vcf.gz
resources/chr17_af-only-gnomad_grch38.vcf.gz
11_somatic_twicefiltered.vcf.gz
2_tumor_normal_m2.bam
normal.bam
tumor.bam

With the exception of the somatic callset 11_somatic_twicefiltered.vcf.gz, the subset regions the data cover are in chr17plus.interval_list.

Second, navigate IGV to the TP53 locus (chr17:7,666,402-7,689,550).

One of the tracks is dominating the view. Right-click on track chr17_af-only-gnomad_grch38.vcf.gz and collapse its view.
Zoom into the somatic call in 11_somatic_twicefiltered.vcf.gz, the gray rectangle in exon 3, by click-dragging on the ruler.
Hover over or click on the gray call in track 11_somatic_twicefiltered.vcf.gz to view INFO level annotations. Similarly, the blue call underneath gives HCC1143_tumor sample level information.
Scroll through the alignment data and notice the coverage for the samples.

A C→T variant is in tumor.bam but not normal.bam. What is happening in 2_tumor_normal_m2.bam?

Third, tweak IGV settings that aid in visualizing reassembled alignments.

Make room to focus on track 2_tumor_normal_m2.bam. Shift+select on the left panels for tracks tumor.bam, normal.bam and their coverages. Right-click and Remove Tracks.
Go to View>Preferences>Alignments. Toggle on Show center line and toggle off Downsample reads.
Drag the alignments panel to center the red variant.
Right-click on the alignments track and
- Group by sample
- Sort by base
- Color by tag: HC.
Scroll to take note of the number of groups. Click on a read in each group to determine which group belongs to which sample.

What are the three grouped tracks for the bamout? What does the pastel versus gray colors indicate? How plausible is it that all tumor copies of this locus have this alteration?

Here is the corresponding VCF record. Remember Mutect2 makes no ploidy assumption. The GT field tabulates the presence for each allele starting with the reference allele.

CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
chr17	7,674,220	.	C	T	.	PASS	DP=122;ECNT=1;NLOD=13.54;N_ART_LOD=-1.675e+00;POP_AF=2.500e-06;P_GERMLINE=-1.284e+01;TLOD=257.15

FORMAT	GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:OBAM:OBAMRC:OBF:OBP:OBQ:OBQRC:SA_MAP_AF:SA_POST_PROB
HCC1143_normal	0/0:45,0:0.032:19,0:26,0:0:151,0:0:0:false:false
HCC1143_tumor	0/1:0,70:0.973:0,34:0,36:33:0,147:60:21:true:false:0.486:0.00:46.01:100.00:0.990,0.990,1.00:0.028,0.026,0.946

Finally, here are the indel calls for which we have bamout alignments. All 17 of these happen to be filtered. Explore a few of these sites in IGV to practice the motions of setting up for manual review and to study the logic behind different filters.

CHROM	POS	REF	ALT	FILTER
chr17	4,539,344	T	TA	artifact_in_normal;germline_risk;panel_of_normals
chr17	7,221,420	CACTGCCCTAGGTCAGGA	C	artifact_in_normal;panel_of_normals;str_contraction
chr17	7,483,063	A	AC	mapping_quality;t_lod
chr17	8,513,688	GTT	G	panel_of_normals
chr17	19,748,387	G	GA	t_lod
chr17	26,982,033	G	GC	artifact_in_normal;clustered_events
chr17	30,059,463	CT	C	t_lod
chr17	35,422,473	C	CA	t_lod
chr17	35,671,734	CTT	C,CT,CTTT	artifact_in_normal;multiallelic;panel_of_normals
chr17	43,104,057	CA	C	artifact_in_normal;germline_risk;panel_of_normals
chr17	43,104,072	AAAAAAAAAGAAAAG	A	panel_of_normals;t_lod
chr17	46,332,538	G	GT	artifact_in_normal;panel_of_normals
chr17	47,157,394	CAA	C	panel_of_normals;t_lod
chr17	50,124,771	GCACACACACACACACA	G	clustered_events;panel_of_normals;t_lod
chr17	68,907,890	GA	G	artifact_in_normal;base_quality;germline_risk;panel_of_normals;t_lod
chr17	69,182,632	C	CA	artifact_in_normal;t_lod
chr17	69,182,835	GAAAA	G	panel_of_normals

7. Related resources

The next step after generating a carefully manicured somatic callset is typically functional annotation.

Funcotator is available in BETA and can annotate GRCh38 and prior reference aligned VCF format data.
Oncotator can annotate GRCh37 and prior reference aligned MAF and VCF format data. It is also possible to download and install the tool following instructions in Article#4154.
Annotate with the external program VEP to predict phenotypic changes and confirm or hypothesize biochemical effects.

For a cohort, after annotation, use MutSig to discover driver mutations. MutsigCV (the version is CV) is available on GenePattern. If more samples are needed to increase the power of the analysis, consider padding the analysis set with TCGA Project or other data.

The dSKY plot at https://figshare.com/articles/D_SKY_for_HCC1143/2056665 shows somatic copy number alterations for the HCC1143 tumor sample. Its colorful results remind us that calling SNVs and indels is only one part of cancer genome analyses. Somatic copy number alteration detection will be covered in another GATK tutorial. For reference implementations of Somatic CNV workflows see here.

Footnotes

[1] Data was alt-aware aligned to GRCh38 and post-alt processed. For an introduction to alt-aware alignment and post-alt processing, see [Blog#8180](https://software.broadinstitute.org/gatk/blog?id=8180). The HCC1143 alignments are identical to that in [Tutorial#9183](https://software.broadinstitute.org/gatk/documentation/article?id=9183), which uses GATK3 MuTect2.

[2] For scripted GATK Best Practices Somatic Short Variant Discovery workflows, see [https://github.com/gatk-workflows](https://github.com/gatk-workflows). Within the repository, as of this writing, [gatk-somatic-snvs-indels](https://github.com/gatk-workflows/gatk4-somatic-snvs-indels), which uses GRCh37, is the sole GATK4 Mutect2 workflow. This tutorial uses additional parameters not used in the [GRCh37 gatk-somatic-snvs-indels](https://github.com/gatk-workflows/gatk4-somatic-snvs-indels) example because the tutorial data was preprocessed with post-alt processing of alt-aware alignments, which deviates from production practices. The general workflow steps remain the same.

[3] About the tutorial data:

The data tarball contains 15 files in the main directory, six files in its resources folder and twenty files in its precomputed folder. Of the files, chr17 refers to data subset to that in the regions in chr17plus.interval_list, the m2pon consists of forty 1000 Genomes Project samples, pon to panel of normals, tumor to the tumor HCC1143 breast cancer sample and normal to its matched blood normal.
Again, example data are based on a breast cancer cell line and its matched normal cell line derived from blood. Both cell lines are consented and known as HCC1143 and HCC1143_BL, respectively. The Broad Cancer Genome Analysis (CGA) group has graciously provided 2x76 paired-end whole exome sequence data from the two cell lines (C835.HCC1143_2 and C835.HCC1143_BL.4), and @shlee reverted and aligned these to GRCh38 using alt-aware alignment and post-alt processing as described in Tutorial#8017. During preprocessing, the MergeBamAlignment step was omitted, reads containing adapter sequence were removed altogether for both samples (~0.153% of reads in the tumor) as determined by MarkIlluminaAdapters, base qualities were not binned during base recalibration and indel realignment was included to match the toolchain of the PoN normals. The program group for base recalibration is absent from the BAM headers due to a bug in the version of PrintReads at the time of pre-processing, in January of 2017.
Note that the tutorial uses exome data for its small size. The workflow is applicable to whole genome sequence data (WGS).
@shlee lifted-over or remapped the gnomAD resource files from GRCh37 counterparts to GRCh38. The tutorial uses subsets of the full resources; the full-length versions are available at gs://gatk-best-practices/somatic-hg38/. The official GRCh37 versions of the resources are available in the GATK Resource Bundle and are based on the gnomAD resource. These GRCh37 versions were prepared by @davidben according to the method outlined in the mutect_resources.wdl and described in [4].
The full data in the tutorial were generated by @shlee using the github.com/broadinstitute/gatk mutect2.wdl from between the v4.0.0.0 and v4.0.0.1 release with commit hash b4d1ddd. The GATK Docker image was broadinstitute/gatk:4.0.0.0 and Picard was v2.14.1. A single modification was made to the script to enable generating the bamout. The script was run locally on a Google Cloud Compute VM using Cromwell v30.1. Given Docker was installed and the specified Docker images were present on the VM, Cromwell automatically launched local Docker container instances during the run and handled the local files as hard-links to avoid redundant copying. Workflow input variables were as follows.

{
  "##_COMMENT1:": "WORKFLOW STEP OPTIONS",
  "Mutect2.is_run_oncotator": "False",
  "Mutect2.is_run_orientation_bias_filter": "True",
  "Mutect2.picard": "/home/shlee/picard-2.14.1.jar",
  "Mutect2.gatk_docker": "broadinstitute/gatk:4.0.0.0",
  "Mutect2.oncotator_docker": "broadinstitute/oncotator:1.9.3.0",
...
  "##_COMMENT3:": "ANALYSIS PARAMETERS",
  "Mutect2.artifact_modes": ["G/T", "C/T"],
  "Mutect2.m2_extra_args": "--af-of-alleles-not-in-resource 0.0000025 --disable-read-filter MateOnSameContigOrNoMappedMateReadFilter",
  "Mutect2.m2_extra_filtering_args": "",
  "Mutect2.scatter_count": "10"
}

If using newer versions of the mutect2.wdl that allow setting SplitIntervals optional arguments, then @shlee recommends setting --subdivision-mode BALANCING_WITHOUT_INTERVAL_SUBDIVISION to avoid splitting contigs.
With the exception of the PoN and Picard tool steps, data was generated using v4.0.0.0. The PoN was generated using GATK4 vbeta.6. Besides the syntax, little changed for the Mutect2 workflow between these releases and the workflow and most of its tools remain in beta status as of this writing. We used Picard v2.14.1 for the CollectSequencingArtifactMetrics step. Figures in section 5 reflect results from Picard v2.11.0, which give, at glance, identical results as 2.14.1.
The three samples in section 2 are present in the forty sample PoN used in section 1 and they are 1000 Genomes Project samples.

[4] The WDL script [mutect_resources.wdl](https://github.com/broadinstitute/gatk/blob/master/scripts/mutect2_wdl/mutect_resources.wdl) takes a large gnomAD VCF or other typical cohort VCF and from it prepares both a simplified germline resource for use in _section 1_ and a common biallelic variants resource for use in _section 3_. The script first generates a sites-only VCF and in the process _removes all extraneous annotations_ except for `AF` allele frequencies. We recommend this simplification as the unburdened VCF allows Mutect2 to run much more efficiently. To generate the common biallelic variants resource, the script then selects the biallelic sites from the sites-only VCF.

↧

Variants for GetPileupSummaries for non-human organisms

April 20, 2018, 4:58 am

≫ Next: CombineGVCFs input multiple files

≪ Previous: (How to) Call somatic mutations using GATK4 Mutect2

Hi there

I am working with mouse matched tumor-normal samples and running somatic calling with Mutect2 (gatk4). I want to run contamination estimation using GetPileupSummaries, but I don't know what I should provide for the common variants (--variant) option.

Would it be at all helpful to use variants called with haplotypecaller from normal samples? I only have ~5 normals, so the numbers are very small...

↧

CombineGVCFs input multiple files

December 12, 2016, 2:57 am

≫ Next: GATK4beta6 annotation incompatibility between HaplotypeCaller and GenomicsDBImport

≪ Previous: Variants for GetPileupSummaries for non-human organisms

Hello,

How can I specify the multiple input files for CombineGVCF without having to type the name of each file separately (I have 432 input files...)?

Using

java -jar GenomeAnalysisTK.jar \
-T CombineGVCFs \
-R /home/mruhsam/bluebells/Ref_files/refseq_gene236.fa \
--variant *.scc30_sec20.g.vcf \
-o cohort.g.vcf

gives me the error message

ERROR MESSAGE: Invalid argument value '12663.gatkHC.raw.snps.indels_scc30_sec20.g.vcf ' at position 6.

ERROR Invalid argument value '187-03.gatkHC.raw.snps.indels_scc30_sec20.g.vcf ' at position 7.

ERROR Invalid argument value '188-02.gatkHC.raw.snps.indels_scc30_sec20.g.vcf ' at position 8.

ERROR Invalid argument value '232-01.gatkHC.raw.snps.indels_scc30_sec20.g.vcf ' at position 9.

AND SO ON going through all the 432 sample names. However, specifying a few files separately using '–variant' works. Surely there must be a way to 'batch' specify the name of the input files?

Thank you

Markus

↧

GATK4beta6 annotation incompatibility between HaplotypeCaller and GenomicsDBImport

January 2, 2018, 7:15 am

≫ Next: How to run GATK directly on SRA files

≪ Previous: CombineGVCFs input multiple files

Happy New Year!

I'm attempting to joint genotype ~1000 exomes using GATK4. I've run HC per sample with the following command:

java -Xmx7g -jar gatk-package-4.beta.6-local.jar HaplotypeCaller -ERC GVCF -G StandardAnnotation -G AS_StandardAnnotation --maxReadsPerAlignmentStart 0 -GQB 5 -GQB 10 -GQB 15 -GQB 20 -GQB 25 -GQB 30 -GQB 35 -GQB 40 -GQB 45 -GQB 50 -GQB 55 -GQB 60 -GQB 65 -GQB 70 -GQB 75 -GQB 80 -GQB 85 -GQB 90 -GQB 95 -GQB 99 -I example.bam -O example.g.vcf.gz -R /path/to/GRCh38.d1.vd1.fa

And then attempted to create a GenomicDB per chromosome with the following command:

java -Xmx70g -jar gatk-package-4.beta.6-local.jar GenomicsDBImport -genomicsDBWorkspace chrX_db --overwriteExistingGenomicsDBWorkspace true --intervals chrX -V gvcfs.list

I get the following error:

Exception: [January 2, 2018 9:36:26 AM EST] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 0.09 minutes. Runtime.totalMemory()=2238185472 htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: Discordant field size detected for field AS_RAW_ReadPosRankSum at chrX:251751. Field had 4 values but the header says this should have 1 values based on header record INFO=<ID=AS_RAW_ReadPosRankSum,Number=1,Type=String,Description="allele specific raw data for rank sum test of read position bias"> at htsjdk.variant.variantcontext.VariantContext.fullyDecodeAttributes(VariantContext.java:1571) at htsjdk.variant.variantcontext.VariantContext.fullyDecodeInfo(VariantContext.java:1546) at htsjdk.variant.variantcontext.VariantContext.fullyDecode(VariantContext.java:1530) at htsjdk.variant.variantcontext.writer.BCF2Writer.add(BCF2Writer.java:176) at com.intel.genomicsdb.GenomicsDBImporter.add(GenomicsDBImporter.java:1232) at com.intel.genomicsdb.GenomicsDBImporter.importBatch(GenomicsDBImporter.java:1282) at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.traverse(GenomicsDBImport.java:443) at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:838) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:119) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:176) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:195) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:137) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:158) at org.broadinstitute.hellbender.Main.main(Main.java:239)

Which refers the following line in one of the GVCFs:

chrX 251751 . G A,<NON_REF> 46.56 . AS_RAW_BaseQRankSum=|30,1,33,1|;AS_RAW_MQ=0.00|7200.00|0.00;AS_RAW_MQRankSum=|60,2|;AS_RAW_ReadPosRankSum=|5,1,20,1|;AS_SB_TABLE=0,0|0,0|0,0;DP=2;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQ=7200.00 GT:AD:GQ:PL:SB 1/1:0,2,0:6:73,6,0,73,6,73:0,0,1,1

I haven't found a way to get past this error. I found this post from a while back with a very similar error:

https://gatkforums.broadinstitute.org/gatk/discussion/comment/43382#Comment_43382

But they seemed to indicate that it was fixed for them in GATK4beta6.

Any help/insight in to how to resolve it, or if its an unimportant annotation how to ignore it would be greatly appreciated. Thanks!

Ben

↧

How to run GATK directly on SRA files

April 23, 2016, 5:40 pm

≫ Next: Question about ASEReadCounter

≪ Previous: GATK4beta6 annotation incompatibility between HaplotypeCaller and GenomicsDBImport

Hello , I recently saw a webinar by NCBI "Advanced Workshop on SRA and dbGaP Data Analysis" (ftp://ftp.ncbi.nlm.nih.gov/pub/education/public_webinars/2016/03Mar23_Advanced_Workshop/). They mentioned that they were able to run GATK directly on SRA files.

I downloaded GenomeAnalysisTK-3.5 jar file to my computer. I tried both these commands:

java -jar /path/GenomeAnalysisTK-3.5/GenomeAnalysisTK.jar -T HaplotypeCaller -R SRRFileName -I SRRFileName -stand_call_conf 30 -stand_emit_conf 10 -o SRRFileName.vcf

java -jar /path/GenomeAnalysisTK-3.5/GenomeAnalysisTK.jar -T SRRFileName -R SRR1718738 -I SRRFileName -stand_call_conf 30 -stand_emit_conf 10 -o SRRFileName.vcf

For both these commands, I got this error:
ERROR MESSAGE: Invalid command line: The GATK reads argument (-I, --input_file) supports only BAM/CRAM files with the .bam/.cram extension and lists of BAM/CRAM files with the .list extension, but the file SRR1718738 has neither extension. Please ensure that your BAM/CRAM file or list of BAM/CRAM files is in the correct format, update the extension, and try again.

I don't see any documentation here about this, so wanted to check with you or anyone else has had any experience with this.

Thanks
K

↧

Question about ASEReadCounter

April 20, 2018, 8:35 am

≫ Next: GATK4 GenotypeGVCFs does not support caret symbol (^) in filenames

≪ Previous: How to run GATK directly on SRA files

Dear GATK team,

I have two questions about ASEReadCounter.
1. It seems a large fraction of my sites were filtered out in the output file. I don't think they are multiallelic sites, since those sites are present in some output csv files.

The output seems always homo sites as following
contig position variantID refAllele altAllele refCount altCount totalCount lowMAPQDepth lowBaseQDepth rawDepth otherBases improperPairs
chr1 752721 rs3131972 ** A A** 0 0 0 0 0 11 0
chr1 757936 rs4951862 C C 1 0 1 0 0 21 0
chr1 758626 rs3131954 C C ** 0 0 0 0 0 11 0
chr1 760912 rs1048488 **C C 1 0 1 0 0 32 0
chr1 761147 rs3115850 T T 0 0 0 0 0 22 0
chr1 761752 rs1057213 C C 1 0 1 0 0 10 0
chr1 762485 rs148989274 C C 2 0 2 0 0 20 0
chr1 762632 rs61768173 T T 0 0 0 0 0 11 0
chr1 769223 rs60320384 C C 0 0 0 0 0 11 0

I think my vcf file looks normal.

fileformat=VCFv4.2

fileDate=20180228

source=PLINKv1.90

contig=<ID=chr1,length=249250621>

contig=<ID=chr2,length=243199373>

contig=<ID=chr3,length=198022430>

contig=<ID=chr4,length=191154276>

contig=<ID=chr5,length=180915260>

contig=<ID=chr6,length=171115067>

contig=<ID=chr7,length=159138663>

contig=<ID=chr8,length=146364022>

contig=<ID=chr9,length=141213431>

contig=<ID=chr10,length=135534747>

contig=<ID=chr11,length=135006516>

contig=<ID=chr12,length=133851895>

contig=<ID=chr13,length=115169878>

contig=<ID=chr14,length=107349540>

contig=<ID=chr15,length=102531392>

contig=<ID=chr16,length=90354753>

contig=<ID=chr17,length=81195210>

contig=<ID=chr18,length=78077248>

contig=<ID=chr19,length=59128983>

contig=<ID=chr20,length=63025520>

contig=<ID=chr21,length=48129895>

contig=<ID=chr22,length=51304566>

contig=<ID=chrX,length=155270560>

INFO=<ID=PR,Number=0,Type=Flag,Description="Provisional reference allele, may not be based on real reference genome"

FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 0_HPAP017 0_HP15031 0_ICRH95 0_ADGK346 0_HPAP005 0_ADGC040 0_ACFC227 0_ACJT115 0_ICRH39 0_HPAP001 0_ICRH76

    0_HPAP004       0_HPAP002       0_HPAP011       0_HPAP016       0_ADGD159       0_HPAP014       0_ICRH41        0_HPAP003       0_HPAP006       0_ZHK162        0_HPAP015       0_ABFD452       0_ABEL098       0_ICRH96        0_ADDU20

6 0_HPAP010 chr1 752721 rs3131972 0/1 0/1 chr1 753405 rs61770173 0/1 0/1 chr1 753425 rs3131970 0/1 0/1 chr1 754334 rs3131967 0/1 0/1 chr1 755890 rs3115858 0/1 0/1 chr1 756434 rs61768170 0/1 0/1 chr1 756604 rs3131962 0/1 0/1 chr1 757734 rs4951929 0/1 0/1 chr1 757936 rs4951862 0/1 0/1 chr1 758144 rs3131956 0/1 0/1 chr1 758626 rs3131954 0/1 0/1 chr1 759837 rs114111569 0/1 0/1 chr1 760912 rs1048488 0/1 0/1 0_HPAP012 0_HPAP018 0_HPAP008 0_HPAP013 0_ABEF248 0_HPAP007
G A . . PR GT 0/0 0/0 0/1 0/1 0/1 0/0 1/1 0/0 0/1 0/0 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1
0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1 0/0
A C . . PR GT 0/0 0/0 0/0 0/1 0/1 0/0 1/1 0/0 ./. 0/0 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1
0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0
C T . . PR GT 0/0 0/0 0/0 0/1 0/1 0/0 1/1 0/0 ./. 0/0 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1
0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0
C T . . PR GT 0/0 0/0 0/1 0/1 0/1 0/0 1/1 0/0 0/0 0/0 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1
0/0 ./. 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1 0/0
T A . . PR GT 0/0 0/0 0/0 0/1 0/1 0/0 1/1 0/0 0/0 0/0 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1
0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0
C G . . PR GT 1/1 1/1 1/1 0/1 0/1 1/1 0/0 1/1 1/1 1/1 1/1 0/1 1/1 1/1 1/1 1/1 1/1 1/1 1/1 0/1
1/1 1/1 1/1 1/1 1/1 1/1 1/1 1/1 1/1 1/1 1/1
G A . . PR GT 0/0 0/0 0/0 0/1 0/1 0/0 1/1 0/0 0/0 0/0 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1
0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0
T C . . PR GT 0/0 0/0 0/0 0/1 0/1 0/0 1/1 0/0 0/0 0/0 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1
0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0
A C . . PR GT 0/0 0/0 0/0 0/1 0/1 0/0 1/1 0/0 0/0 0/0 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1
0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0
G A . . PR GT 0/0 0/0 0/0 0/1 0/1 0/0 1/1 0/0 ./. 0/0 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1
0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0
T C . . PR GT 0/0 0/0 0/0 0/1 0/1 0/0 1/1 0/0 0/0 0/0 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1
0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0
A T . . PR GT 0/0 0/0 0/0 0/1 0/1 0/0 1/1 0/0 0/0 0/0 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1
0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0
T C . . PR GT 0/0 0/0 0/0 0/1 1/1 0/0 1/1 0/0 ./. 0/0 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1
0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0

Thanks,
Long

↧

GATK4 GenotypeGVCFs does not support caret symbol (^) in filenames

April 20, 2018, 1:17 pm

≫ Next: CallableLoci replacement in GATK4?

≪ Previous: Question about ASEReadCounter

The title says it all. When I attempt to run GATK4 (v4.0.3.0) GenotypeGVCFs with the input file A^B^C.g.vcf.gz the following error is produced (full output in attached gatk4-ggvcfs.txt):

[user@hostname] $ /usr/local/genome/gatk-4.0.3.0/gatk GenotypeGVCFs \
  -R ${REFMEM} \
  -V A^B^C.g.vcf.gz \
  -O foo.vcf
...
14:57:22.252 INFO  FeatureManager - Using codec VCFCodec to read file file:///home/user/A%5EB%5EC.g.vcf.gz
...
java.lang.IllegalArgumentException: Illegal character in path at index 1: A^B^C.g.vcf.gz
...
Caused by: java.net.URISyntaxException: Illegal character in path at index 1: A^B^C.g.vcf.gz

Looks like those ^ are getting changed to %5E, which is somehow not interpreted correctly at some point.

This works just fine with GATK 3.x, and interestingly, works just fine when running the following command (full output in attached gatk4-hc.txt):

[user@hostname] $ /usr/local/genome/gatk-4.0.3.0/gatk HaplotypeCaller \
  -R ${REFMEM} \
  -I A^B^C.bam \
  -O foo.g.vcf.gz \
  -ERC GVCF

Sorry in advance for begin a repeat troublemaker

↧

CallableLoci replacement in GATK4?

January 10, 2018, 10:10 am

≫ Next: SelectVariants Large VCF slow runtime

≪ Previous: GATK4 GenotypeGVCFs does not support caret symbol (^) in filenames

I noticed that CallableLoci does not exist in GATK4 anymore, is there a replacement tool for it in GAKT4? Thanks, Luobin

↧

SelectVariants Large VCF slow runtime

January 3, 2017, 8:28 am

≫ Next: How to set parameter in UnifiedGenotype mode like samtools mpileup

≪ Previous: CallableLoci replacement in GATK4?

I am attempting to subset and filter a large (10k exome sample, 250GB) VCF file using SelectVariants. My goal is to subset by individual samples (iterating over each sample using a custom script and passing an individual SelectVariants command for each), selecting only heterozygous alleles, with an alt allele depth > 5, GQ > 30, and for SNPs that pass the filter. My issue is very slow runtime, which seems like it shouldn't be a problem when I only want calls from a single sample. I feel it may be an issue with how I have set up my SelectVariants command (shown below), or it may be an issue with SelectVariants and large VCFs.

Here is the command I am using:

java -jar GATK.3.7.jar -T SelectVariants -R ref.fa -V very.large.vcf.gz -o single.sample.filtered.vcf.gz -sn sample.name -selectType SNP -select 'vc.getGenotype("sample.name").isHet()' -select 'vc.getGenotype("sample.name").getAD().1 > 5' -select 'vc.getGenotype("sample.name").getGQ() > 30' -select 'vc.isNotFiltered()'

↧