How can i set cores that MarkDuplicatesSpark uses?

June 22, 2019, 6:50 pm

≫ Next: Questions about calculating the genotype likelihoods

≪ Previous: Picard FindMendelianViolations: "Malformed header" error when specifying output directory

I use command --conf 'spark.executor.cores=5', but the MarkDuplicates still all available core on my service.
Besides, my service got 16 cores in total.
How can fixed it ?

↧

Questions about calculating the genotype likelihoods

April 8, 2019, 9:41 pm

≫ Next: GenotypeGVCFs Error:

≪ Previous: How can i set cores that MarkDuplicatesSpark uses?

In this website, https://software.broadinstitute.org/gatk/documentation/article.php?id=4442, you showed the formula used to calculate PL.

I can understand most of the formulas used here. But I can't understand the change on the formula when you are trying to implement G=H1H2 to P(D|G). I tried a lot of times and I cannot finish the math inference on my own. I think the formula you used to calculate P(D|G) should also be available to be generated by pure math deduction.

Therefore, if convenient, would you please show me the process of the math deduction of the formula to prove that P(D|G)=P(D|H1)/2 + P(D|H2)/2 (given a single read sequence).

Thank you!

↧

GenotypeGVCFs Error:

January 28, 2017, 5:45 pm

≫ Next: (How to) Call somatic mutations using GATK4 Mutect2 (Deprecated)

≪ Previous: Questions about calculating the genotype likelihoods

Hi, I was trying to run GenotypeGVCFs with my data as follwing (version GATK 3.7):

java -jar /var/bin/GenomeAnalysisTK.jar -T GenotypeGVCFs -R RefGenome.fa --variant a-45.g.vcf --variant a-8.g.vcf --variant e-18.g.vcf --variant e-55.g.vcf --variant e-69.g.vcf --variant e-854.g.vcf --variant e-98.g.vcf -stand_call_conf 30 -o 7samples.vcf

And I got an error like this:

The list of input alleles must contain as an allele but that is not the case at position 4988; please use the Haplotype Caller with gVCF output to generate appropriate records

But I have produced the *g.vcf files using the Haplotype Caller. Kindly correct me if I am wrong or give me an solution. Thank you.

↧

(How to) Call somatic mutations using GATK4 Mutect2 (Deprecated)

January 6, 2018, 8:39 pm

≫ Next: GATKException: Somehow the requested coordinate is not covered by the read.

≪ Previous: GenotypeGVCFs Error:

This tutorial is now deprecated and only valid for Mutect2 v4.1.0.0 and lower. For Mutect2 v4.1.1.0 and higher, please refer to this tutorial.

Post suggestions and read about updates in the Comments section.

This tutorial introduces researchers to considerations in somatic short variant discovery using GATK4 Mutect2. Example data are based on a breast cancer cell line and its matched normal cell line derived from blood and are aligned to GRCh38 with post-alt processing [1]. The tutorial focuses on how to call traditional somatic short mutations, as described in Article#11127 and pipelined in GATK v4.0.0.0's mutect2.wdl [2]. The tool and its workflow are in BETA status as of this writing, which means they may undergo changes and are not guaranteed for production.

► For Broad Mutation Calling Best Practices, see FireCloud Article#45055.

Section 1 calls somatic mutations with Mutect2 using all the bells and whistles of the tool. Section 2 outlines how to create the panel of normals resource using the tumor-only mode of Mutect2. Section 3 outlines how to estimate cross-sample contamination. Section 4 shows how to filter the callset with FilterMutectCalls. Unlike GATK3, in GATK4 the somatic calling and filtering functionalities are embodied by separate tools. Section 5 shows an optional filtering step to filter by sequence context artifacts that present with orientation bias, e.g. OxoG artifacts. Section 6 shows how to set up in IGV for manual review. Finally, section 7 provides a brief list of related resources that may be of interest to researchers.

GATK4 Mutect2 is a versatile variant caller that not only is more sensitive than, but is also roughly twice as fast as, HaplotypeCaller's reference confidence mode. Researchers who wish to customize analyses should find the tutorial's descriptions of the multiple levers of Mutect2 in section 1 and descriptions of the tumor-only mode of Mutect2 in section 2 of interest.

Jump to a section

Tools involved

GATK v4.0.0.0 is available in a Docker image and as a standalone jar. For the latest release, see the Downloads page. Note that GATK v4.0.0.0 contains Picard tools from release v2.17.2 that are callable with the gatk launch script.
Desktop IGV. The tutorial uses v2.3.97.

Download example data

Download tutorial_11136.tar.gz, either from the GoogleDrive or from the ftp site. To access the ftp site, leave the password field blank. If the GoogleDrive link is broken, please let us know. The tutorial also requires the GRCh38 reference FASTA, dictionary and index. These are available from the GATK Resource Bundle. For details on the example data and resources, see [3] and [4].

► The tutorial steps switch between the subset and full data. Some of the data files, e.g. BAMs, are restricted to a small region of the genome to efficiently pace the tutorial. Other files, e.g. the Mutect2 calls that the tutorial filters, are from the entire genome. The tutorial content was originally developed for the 2017-09 Helsinki workshop and we make the full data files, i.e. the resource files and the BAMs, available at gs://gatk-best-practices/somatic-hg38.

1. Call somatic short variants and generate a bamout with Mutect2

Here we have a rather complex command to call somatic variants on the HCC1143 tumor sample using Mutect2. For a synopsis of what somatic calling entails, see Article#11127. The command calls somatic variants in the tumor sample and uses a matched normal, a panel of normals (PoN) and a population germline variant resource.

gatk --java-options "-Xmx2g" Mutect2 \
-R hg38/Homo_sapiens_assembly38.fasta \
-I tumor.bam \
-I normal.bam \
-tumor HCC1143_tumor \
-normal HCC1143_normal \
-pon resources/chr17_pon.vcf.gz \
--germline-resource resources/chr17_af-only-gnomad_grch38.vcf.gz \
--af-of-alleles-not-in-resource 0.0000025 \
--disable-read-filter MateOnSameContigOrNoMappedMateReadFilter \
-L chr17plus.interval_list \
-O 1_somatic_m2.vcf.gz \
-bamout 2_tumor_normal_m2.bam

This produces a raw unfiltered somatic callset 1_somatic_m2.vcf.gz, a reassembled reads BAM 2_tumor_normal_m2.bam and the respective indices 1_somatic_m2.vcf.gz.tbi and 2_tumor_normal_m2.bai.

Comments on select parameters

Specify the case sample for somatic calling with two parameters. Provide the BAM with -I and the sample's read group sample name (the SM field value) with -tumor. To look up the read group SM field use GetSampleName. Alternatively, use samtools view -H tumor.bam | grep '@RG'.
Prefilter variant sites in a control sample alignment. Specify the control BAM with -I and the control sample's read group sample name (the SM field value) with -normal. In the case of a tumor with a matched normal control, we can exclude even rare germline variants and individual-specific artifacts. If we analyze our tumor sample with Mutect2 without the matched normal, we get an order of magnitude more calls than with the matched normal.
Prefilter variant sites in a panel of normals callset. Specify the panel of normals (PoN) VCF with -pon. Section 2 outlines how to create a PoN. The panel of normals not only represents common germline variant sites, it presents commonly noisy sites in sequencing data, e.g. mapping artifacts or other somewhat random but systematic artifacts of sequencing. By default, the tool does not reassemble nor emit variant sites that match identically to a PoN variant. To enable genotyping of PoN sites, use the --genotype-pon-sites option. If the match is not exact, e.g. there is an allele-mismatch, the tool reassembles the region, emits the calls and annotates matches in the INFO field with IN_PON.
Annotate variant alleles by specifying a population germline resource with --germline-resource. The germline resource must contain allele-specific frequencies, i.e. it must contain the AF annotation in the INFO field [4]. The tool annotates variant alleles with the population allele frequencies. When using a population germline resource, consider adjusting the --af-of-alleles-not-in-resource parameter from its default of 0.001. For example, the gnomAD resource af-only-gnomad_grch38.vcf.gz represents ~200k exomes and ~16k genomes and the tutorial data is exome data, so we adjust --af-of-alleles-not-in-resource to 0.0000025 which corresponds to 1/(2*exome samples). The default of 0.001 is appropriate for human sample analyses without any population resource. It is based on the human average rate of heterozygosity. The population allele frequencies (POP_AF) and the af-of-alleles-not-in-resource factor in probability calculations of the variant being somatic.
Include reads whose mate maps to a different contig. For our somatic analysis that uses alt-aware and post-alt processed alignments to GRCh38, we disable a specific read filter with --disable-read-filter MateOnSameContigOrNoMappedMateReadFilter. This filter removes from analysis paired reads whose mate maps to a different contig. Because of the way BWA crisscrosses mate information for mates that align better to alternate contigs (in alt-aware mapping to GRCh38), we want to include these types of reads in our analysis. Otherwise, we may miss out on detecting SNVs and indels associated with alternate haplotypes. Disabling this filter deviates from current production practices.
Target the analysis to specific genomic intervals with the -L parameter. Here we specify this option to speed up our run on the small tutorial data. For the full callset we use in section 4, calling was on the entirety of the data, without an intervals file.
Generate the reassembled alignments file with -bamout. The bamout alignments contain the artificial haplotypes and reassembled alignments for the normal and tumor and enable manual review of calls. The parameter is not required by the tool but is recommended as adding it costs only a small fraction of the total run time.

To illustrate how Mutect2 applies annotations, below are five multiallelic sites from the full callset. Pull these out with gzcat somatic_m2.vcf.gz | awk '$5 ~","'. The awk '$5 ~","' subsets records that contain a comma in the 5th column.

We see eleven columns of information per variant call including genotype calls for the normal and tumor. Notice the empty fields for QUAL and FILTER, and annotations at the site (INFO) and sample level (columns 10 and 11). The samples each have genotypes and when a site is multiallelic, we see allele-specific annotations. Samples may have additional annotations, e.g. PGT and PID that relate to phasing.

☞ 1.1 What are the Mutect2 annotations?

We can view the standard FORMAT-level and INFO-level Mutect2 annotations in the VCF header.

The Variant Annotations section of the Tool Documentation further describe some of the annotations. For a complete list of annotations available in GATK4, see this site.

To enable specific filtering that relies on nonstandard annotations, or just to add additional annotations, use the -A argument. For example, -A ReferenceBases adds the ReferenceBases annotation to variant calls. Note that if an annotation a filter relies on is absent, FilterMutectCalls will skip the particular filtering without any warning messages.

☞ 1.2 What is the impact of disabling the MateOnSameContigOrNoMappedMateReadFilter read filter?

To understand the impact, consider some numbers. After all other read filters, the MateOnSameContigOrNoMappedMateReadFilter (MOSCO) filter additionally removes from analysis 8.71% (8,681,271) tumor sample reads and 8.18% (6,256,996) normal sample reads from the full data. The impact of disabling the MOSCO filter is that reads on alternate contigs and read pairs that span contigs can now lend support to variant calls.

For the tutorial data, including reads normally filtered by the MOSCO filter roughly doubles the number of Mutect2 calls. The majority of the additional calls comes from the ALT, HLA and decoy contigs.

2. Create a sites-only PoN with CreateSomaticPanelOfNormals

We make the motions of creating a PoN using three germline samples. These samples are HG00190, NA19771 and HG02759 [3].

First, run Mutect2 in tumor-only mode on each normal sample. In tumor-only mode, a single case sample is analyzed with the -tumor flag without an accompanying matched control -normal sample. For the tutorial, we run this command only for sample HG00190.

gatk Mutect2 \
-R ~/Documents/ref/hg38/Homo_sapiens_assembly38.fasta \
-I HG00190.bam \
-tumor HG00190 \
--disable-read-filter MateOnSameContigOrNoMappedMateReadFilter \
-L chr17plus.interval_list \
-O 3_HG00190.vcf.gz

This generates a callset 3_HG00190.vcf.gz and a matching index. Mutect2 calls variants in the sample with the same sensitive criteria it uses for calling mutations in the tumor in somatic mode. Because the command omits the use of options that trigger upfront filtering, we expect all detectable variants to be called. The calls will include low allele fraction variants and sites with multiple variant alleles, i.e. multiallelic sites. Here are two multiallelic records from 3_HG00190.vcf.gz.

We see for each site, Mutect2 calls the ref allele and three alternate alleles. The GT genotype call is 0/1/2/3. The AD allele depths are 16,3,12,4 and 41,5,24,4, respectively for the two sites.

Comments on select parameters

One option that is not used here is to include a germline resource with --germline-resource. Remember from section 1 this resource must contain AF population allele frequencies in the INFO column. Use of this resource in tumor-only mode, just as in somatic mode, allows upfront filtering of common germline variant alleles. This effectively omits common germline variant alleles from the PoN. Note the related optional parameter --max-population-af (default 0.01) defines the cutoff for allele frequencies. Given a resource, and read evidence for the variant, Mutect2 will still emit variant alleles with AF less than or equal to the --max-population-af.
Recapitulate any special options used in somatic calling in the panel of normals sample calling, e.g.--disable-read-filter MateOnSameContigOrNoMappedMateReadFilter. This particular option is relevant for alt-aware and post-alt processed alignments.

Second, collate all the normal VCFs into a single callset with CreateSomaticPanelOfNormals. For the tutorial, to illustrate the step with small data, we run this command on three normal sample VCFs. The general recommendation for panel of normals is a minimum of forty samples.

gatk CreateSomaticPanelOfNormals \
-vcfs 3_HG00190.vcf.gz \
-vcfs 4_NA19771.vcf.gz \
-vcfs 5_HG02759.vcf.gz \
-O 6_threesamplepon.vcf.gz

This generates a PoN VCF 6_threesamplepon.vcf.gz and an index. The tutorial PoN contains 8,275 records.
CreateSomaticPanelOfNormals retains sites with variants in two or more samples. It retains the alleles from the samples but drops all other annotations to create an eight-column, sites-only VCF as shown.

Ideally, the PoN includes samples that are technically representative of the tumor case sample--i.e. samples sequenced on the same platform using the same chemistry, e.g. exome capture kit, and analyzed using the same toolchain. However, even an unmatched PoN will be remarkably effective in filtering a large proportion of sequencing artifacts. This is because mapping artifacts and polymerase slippage errors occur for pretty much the same genomic loci for short read sequencing approaches.

What do you think of including samples of family members in the PoN?

☞ 2.1 The tumor-only mode of Mutect2 is useful outside of pon creation

For example, consider variant calling on data that represents a pool of individuals or a collective of highly similar but distinct DNA molecules, e.g. mitochondrial DNA. Mutect2 calls multiple variants at a site in a computationally efficient manner. Furthermore, the tumor-only mode can be co-opted to simply call differences between two samples. This approach is described in Blog#11315.

3. Estimate cross-sample contamination using GetPileupSummaries and CalculateContamination.

First, run GetPileupSummaries on the tumor BAM to summarize read support for a set number of known variant sites. Use a population germline resource containing only common biallelic variants, e.g. subset by using SelectVariants --restrict-alleles-to BIALLELIC, as well as population AF allele frequencies in the INFO field [4]. The tool tabulates read counts that support reference, alternate and other alleles for the sites in the resource.

gatk GetPileupSummaries \
-I tumor.bam \
-V resources/chr17_small_exac_common_3_grch38.vcf.gz \
-O 7_tumor_getpileupsummaries.table

This produces a six-column table as shown. The alt_count is the count of reads that support the ALT allele in the germline resource. The allele_frequency corresponds to that given in the germline resource. Counts for other_alt_count refer to reads that support all other alleles.

Comments on select parameters

The tool only considers homozygous alternate sites in the sample that have a population allele frequency that ranges between that set by --minimum-population-allele-frequency (default 0.01) and --maximum-population-allele-frequency (default 0.2). The rationale for these settings is as follows. If the homozygous alternate site has a rare allele, we are more likely to observe the presence of REF or other more common alleles if there is cross-sample contamination. This allows us to measure contamination more accurately.
One option to speed up analysis, that is not used in the command above, is to limit data collection to a sufficiently large but subset genomic region with the -L argument.
As of GATK4.0.8.0, released August 2, 2018, GetPileupSummaries requires both -L and -V parameters. For the tutorial, provide the same resources/chr17_small_exac_common_3_grch38.vcf.gz file to each parameter. For details, see the GetPileupSummaries tool documentation.

Second, estimate contamination with CalculateContamination. The tool takes the summary table from GetPileupSummaries and gives the fraction contamination. This estimation informs downstream filtering by FilterMutectCalls.

gatk CalculateContamination \
-I 7_tumor_getpileupsummaries.table \
-O 8_tumor_calculatecontamination.table

This produces a table with estimates for contamination and error. The estimate for the full tumor sample is shown below and gives a contamination fraction of 0.0205. Going forward, we know to suspect calls with less than ~2% alternate allele fraction.

Comments on select parameters

CalculateContamination can operate in two modes. The command above uses the mode that simply estimates contamination for a given sample. The alternate mode incorporates the metrics for the matched normal, to enable a potentially more accurate estimate. For the second mode, run GetPileupSummaries on the normal sample and then provide the normal pileup table to CalculateContamination with the -matched argument.

► Cross-sample contamination differs from normal contamination of tumor and tumor contamination of normal. Currently, the workflow does not account for the latter type of purity issue.

☞ 3.1 What if I find high levels of contamination?

One thing to rule out is sample swaps at the read group level.

Picard’s CrosscheckFingerprints can detect sample-swaps at the read group level and can additionally measure how related two samples are. Because sequencing can involve multiplexing a sample across lanes and regrouping a sample’s multiple read groups, depending on the level of automation in handling these, there is a possibility of including read groups from unrelated samples. The inclusion of such a cross-sample in the tumor sample would be detrimental to a somatic analysis. Without getting into details, the tool allows us to (i) check at the sample level that our tumor and normal are related, as it is imperative they should come from the same individual and (ii) check at the read group level that each of the read group data come from the same individual.

Again, imagine if we mistook the contaminating read group data as some tumor subpopulation! The tutorial normal and tumor samples consist of 16 and 22 read groups respectively, and when we provide these and set EXPECT_ALL_GROUPS_TO_MATCH=true, CrosscheckReadGroupFingerprints (a tool now replaced by CrosscheckFingerprints) informs us All read groups related as expected.

4. Filter for confident somatic calls using FilterMutectCalls

FilterMutectCalls determines whether a call is a confident somatic call. The tool uses the annotations within the callset and applies preset thresholds that are tuned for human somatic analyses.

Filter the Mutect2 callset with FilterMutectCalls. Here we use the full callset, somatic_m2.vcf.gz. To activate filtering based on the contamination estimate, provide the contamination table with --contamination-table. In GATK v4.0.0.0, the tool uses the contamination estimate as a hard cutoff.

gatk FilterMutectCalls \
-V somatic_m2.vcf.gz \
--contamination-table tumor_calculatecontamination.table \
-O 9_somatic_oncefiltered.vcf.gz

This produces a VCF callset 9_somatic_oncefiltered.vcf.gz and index. Calls that are likely true positives get the PASS label in the FILTER field, and calls that are likely false positives are labeled with the reason(s) for filtering in the FILTER field of the VCF. We can view the available filters in the VCF header using grep '##FILTER'.

This step seemingly applies 14 filters, including contamination. However, if an annotation a filter relies on is absent, the tool skips the particular filtering. The filter will still appear in the header. For example, the duplicate_evidence filter requires a nonstandard annotation that our callset omits.

So far, we have 3,695 calls, of which 2,966 are filtered and 729 pass as confident somatic calls. Of the filtered, contamination filters eight calls, all of which would have been filtered for other reasons. For the statistically inclined, this may come as a surprise. However, remember that the great majority of contaminant variants would be common germline alleles, for which we have in place other safeguards.

► In the next GATK version, FilterMutectCalls will use a statistical model to filter based on the contamination estimate.

5. (Optional) Estimate artifacts with CollectSequencingArtifactMetrics and filter them with FilterByOrientationBias

FilterByOrientationBias allows filtering based on sequence context artifacts, e.g. OxoG and FFPE. This step is optional and if employed, should always be performed after filtering with FilterMutectCalls. The tool requires the pre_adapter_detail_metrics from Picard CollectSequencingArtifactMetrics.

First, collect metrics on sequence context artifacts with CollectSequencingArtifactMetrics. The tool categorizes these as those that occur before hybrid selection (preadapter) and those that occur during hybrid selection (baitbias). Results provide a global view across the genome that empowers decision making in ways that site-specific analyses cannot. The metrics can help decide whether to consider downstream filtering.

gatk CollectSequencingArtifactMetrics \
-I tumor.bam \
-O 10_tumor_artifact \
–-FILE_EXTENSION ".txt" \
-R ~/Documents/ref/hg38/Homo_sapiens_assembly38.fasta

Alternatively, use the tool from a standalone Picard jar.

java -jar picard.jar \
CollectSequencingArtifactMetrics \
I=tumor.bam \
O=10_tumor_artifact \
FILE_EXTENSION=.txt \
R=~/Documents/ref/hg38/Homo_sapiens_assembly38.fasta

This generates five metrics files, including pre_adapter_detail_metrics, which contains counts that FilterByOrientationBias uses. Below are the summary pre_adapter_summary_metrics for the full data. Our samples were not from FFPE so we do not expect this artifact. However, it appears that we could have some OxoG transversions.

Picard metrics are described in detail here. For the purposes of this tutorial, we focus on the TOTAL_QSCORE.

The TOTAL_QSCORE is Phred-scaled such that lower scores equate to a higher probability the change is artifactual. E.g. forty translates to 1 in 10,000 probability. For OxoG, a rough cutoff for concern is 30. FilterByOrientationBias uses the quality score as a prior that a context will produce an artifact. The tool also weighs the evidence from the reads. For example, if the QSCORE is 50 but the allele is supported by 15 reads in F1R2 and no reads in F2R1, then the tool should filter the call.
FFPE stands for formalin-fixed, paraffin-embedded. Formaldehyde deaminates cytosines and thereby results in C→T transition mutations. Oxidation of guanine to 8-oxoguanine results in G→T transversion mutations during library preparation. Another Picard tool, CollectOxoGMetrics, similarly gives Phred-scaled scores for the 16 three-base extended sequence contexts. In GATK4 Mutect2, the F1R2 and F2R1 annotations count the reads in the pair orientation supporting the allele(s). This is a change from GATK3’s FOXOG (fraction OxoG) annotation.

Second, perform orientation bias filtering with FilterByOrientationBias. We provide the tool with the once-filtered calls 9_somatic_oncefiltered.vcf.gz, the pre_adapter_detail_metrics file and the sequencing contexts for FFPE (C→T transition) and OxoG (G→T transversion). The tool knows to include the reverse complement contexts.

gatk FilterByOrientationBias \
-A G/T \
-A C/T \
-V 9_somatic_oncefiltered.vcf.gz \
-P tumor_artifact.pre_adapter_detail_metrics.txt \
-O 11_somatic_twicefiltered.vcf.gz

This produces a VCF 11_somatic_twicefiltered.vcf.gz, index and summary 11_somatic_twicefiltered.vcf.gz.summary. In the summary, we see the number of calls for the sequence context and the number of those that the tool filters.

Is the filtering in line with our earlier prediction?

In the VCF header, we see the addition of the 15th filter, orientation_bias, which the tool applies to 56 calls. All 56 of these calls were previously PASS sites, i.e. unfiltered. We now have 673 passing calls out of 3,695 total calls.

☞ 5.1 Tally of applied filters for the tutorial data

The table shows the breakdown in filters applied to 11_somatic_twicefiltered.vcf.gz. The middle column tallys the instances in which each filter was applied across the calls and the third column tallys the instances in which a filter was the sole reason for a site not passing. Of the total calls, ~18% (673/3,695) are confident somatic calls. Of the filtered calls, ~56% (1,694/3,022) are filtered singly. We see an average of ~1.73 filters per filtered call (5,223/3,022).

Which filters appear to have the greatest impact? What types of calls do you think compels manual review?

Examine passing records with the following command. Take note of the AD and AF annotation values in particular, as they show the high sensitivity of the caller.

gzcat 11_somatic_twicefiltered.vcf.gz | grep -v '#' | awk '$7=="PASS"' | less

6. Set up in IGV to review somatic calls

Deriving a good somatic callset involves comparing callsets, e.g. from different callers or calling approaches, manually reviewing passing and filtered calls and, if necessary, combining callsets and additional filtering. Manual review extends from deciphering call record annotations to the nitty-gritty of reviewing read alignments using a visualizer.

To manually review calls, use the feature-rich desktop version of the Integrative Genomics Viewer (IGV). Remember that Mutect2 makes calls on reassembled alignments that do not necessarily reflect that of the starting BAM. Given this, viewing the raw BAM is insufficient for understanding calls. We must examine the bamout that Mutect2's graph-assembly produces.

First, load Human (hg38) as the reference in IGV. Then load these six files in order:

resources/chr17_pon.vcf.gz
resources/chr17_af-only-gnomad_grch38.vcf.gz
11_somatic_twicefiltered.vcf.gz
2_tumor_normal_m2.bam
normal.bam
tumor.bam

With the exception of the somatic callset 11_somatic_twicefiltered.vcf.gz, the subset regions the data cover are in chr17plus.interval_list.

Second, navigate IGV to the TP53 locus (chr17:7,666,402-7,689,550).

One of the tracks is dominating the view. Right-click on track chr17_af-only-gnomad_grch38.vcf.gz and collapse its view.
Zoom into the somatic call in 11_somatic_twicefiltered.vcf.gz, the gray rectangle in exon 3, by click-dragging on the ruler.
Hover over or click on the gray call in track 11_somatic_twicefiltered.vcf.gz to view INFO level annotations. Similarly, the blue call underneath gives HCC1143_tumor sample level information.
Scroll through the alignment data and notice the coverage for the samples.

A C→T variant is in tumor.bam but not normal.bam. What is happening in 2_tumor_normal_m2.bam?

Third, tweak IGV settings that aid in visualizing reassembled alignments.

Make room to focus on track 2_tumor_normal_m2.bam. Shift+select on the left panels for tracks tumor.bam, normal.bam and their coverages. Right-click and Remove Tracks.
Go to View>Preferences>Alignments. Toggle on Show center line and toggle off Downsample reads.
Drag the alignments panel to center the red variant.
Right-click on the alignments track and
- Group by sample
- Sort by base
- Color by tag: HC.
Scroll to take note of the number of groups. Click on a read in each group to determine which group belongs to which sample.

What are the three grouped tracks for the bamout? What does the pastel versus gray colors indicate? How plausible is it that all tumor copies of this locus have this alteration?

Here is the corresponding VCF record. Remember Mutect2 makes no ploidy assumption. The GT field tabulates the presence for each allele starting with the reference allele.

CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
chr17	7,674,220	.	C	T	.	PASS	DP=122;ECNT=1;NLOD=13.54;N_ART_LOD=-1.675e+00;POP_AF=2.500e-06;P_GERMLINE=-1.284e+01;TLOD=257.15

FORMAT	GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:OBAM:OBAMRC:OBF:OBP:OBQ:OBQRC:SA_MAP_AF:SA_POST_PROB
HCC1143_normal	0/0:45,0:0.032:19,0:26,0:0:151,0:0:0:false:false
HCC1143_tumor	0/1:0,70:0.973:0,34:0,36:33:0,147:60:21:true:false:0.486:0.00:46.01:100.00:0.990,0.990,1.00:0.028,0.026,0.946

Finally, here are the indel calls for which we have bamout alignments. All 17 of these happen to be filtered. Explore a few of these sites in IGV to practice the motions of setting up for manual review and to study the logic behind different filters.

CHROM	POS	REF	ALT	FILTER
chr17	4,539,344	T	TA	artifact_in_normal;germline_risk;panel_of_normals
chr17	7,221,420	CACTGCCCTAGGTCAGGA	C	artifact_in_normal;panel_of_normals;str_contraction
chr17	7,483,063	A	AC	mapping_quality;t_lod
chr17	8,513,688	GTT	G	panel_of_normals
chr17	19,748,387	G	GA	t_lod
chr17	26,982,033	G	GC	artifact_in_normal;clustered_events
chr17	30,059,463	CT	C	t_lod
chr17	35,422,473	C	CA	t_lod
chr17	35,671,734	CTT	C,CT,CTTT	artifact_in_normal;multiallelic;panel_of_normals
chr17	43,104,057	CA	C	artifact_in_normal;germline_risk;panel_of_normals
chr17	43,104,072	AAAAAAAAAGAAAAG	A	panel_of_normals;t_lod
chr17	46,332,538	G	GT	artifact_in_normal;panel_of_normals
chr17	47,157,394	CAA	C	panel_of_normals;t_lod
chr17	50,124,771	GCACACACACACACACA	G	clustered_events;panel_of_normals;t_lod
chr17	68,907,890	GA	G	artifact_in_normal;base_quality;germline_risk;panel_of_normals;t_lod
chr17	69,182,632	C	CA	artifact_in_normal;t_lod
chr17	69,182,835	GAAAA	G	panel_of_normals

7. Related resources

The next step after generating a carefully manicured somatic callset is typically functional annotation.

Funcotator is available in BETA and can annotate GRCh38 and prior reference aligned VCF format data.
Oncotator can annotate GRCh37 and prior reference aligned MAF and VCF format data. It is also possible to download and install the tool following instructions in Article#4154.
Annotate with the external program VEP to predict phenotypic changes and confirm or hypothesize biochemical effects.

For a cohort, after annotation, use MutSig to discover driver mutations. MutsigCV (the version is CV) is available on GenePattern. If more samples are needed to increase the power of the analysis, consider padding the analysis set with TCGA Project or other data.

The dSKY plot at https://figshare.com/articles/D_SKY_for_HCC1143/2056665 shows somatic copy number alterations for the HCC1143 tumor sample. Its colorful results remind us that calling SNVs and indels is only one part of cancer genome analyses. Somatic copy number alteration detection will be covered in another GATK tutorial. For reference implementations of Somatic CNV workflows see here.

Footnotes

[1] Data was alt-aware aligned to GRCh38 and post-alt processed. For an introduction to alt-aware alignment and post-alt processing, see [Blog#8180](https://software.broadinstitute.org/gatk/blog?id=8180). The HCC1143 alignments are identical to that in [Tutorial#9183](https://software.broadinstitute.org/gatk/documentation/article?id=9183), which uses GATK3 MuTect2.

[2] For scripted GATK Best Practices Somatic Short Variant Discovery workflows, see [https://github.com/gatk-workflows](https://github.com/gatk-workflows). Within the repository, as of this writing, [gatk-somatic-snvs-indels](https://github.com/gatk-workflows/gatk4-somatic-snvs-indels), which uses GRCh37, is the sole GATK4 Mutect2 workflow. This tutorial uses additional parameters not used in the [GRCh37 gatk-somatic-snvs-indels](https://github.com/gatk-workflows/gatk4-somatic-snvs-indels) example because the tutorial data was preprocessed with post-alt processing of alt-aware alignments, which deviates from production practices. The general workflow steps remain the same.

[3] About the tutorial data:

The data tarball contains 15 files in the main directory, six files in its resources folder and twenty files in its precomputed folder. Of the files, chr17 refers to data subset to that in the regions in chr17plus.interval_list, the m2pon consists of forty 1000 Genomes Project samples, pon to panel of normals, tumor to the tumor HCC1143 breast cancer sample and normal to its matched blood normal.
Again, example data are based on a breast cancer cell line and its matched normal cell line derived from blood. Both cell lines are consented and known as HCC1143 and HCC1143_BL, respectively. The Broad Cancer Genome Analysis (CGA) group has graciously provided 2x76 paired-end whole exome sequence data from the two cell lines (C835.HCC1143_2 and C835.HCC1143_BL.4), and @shlee reverted and aligned these to GRCh38 using alt-aware alignment and post-alt processing as described in Tutorial#8017. During preprocessing, the MergeBamAlignment step was omitted, reads containing adapter sequence were removed altogether for both samples (~0.153% of reads in the tumor) as determined by MarkIlluminaAdapters, base qualities were not binned during base recalibration and indel realignment was included to match the toolchain of the PoN normals. The program group for base recalibration is absent from the BAM headers due to a bug in the version of PrintReads at the time of pre-processing, in January of 2017.
Note that the tutorial uses exome data for its small size. The workflow is applicable to whole genome sequence data (WGS).
@shlee lifted-over or remapped the gnomAD resource files from GRCh37 counterparts to GRCh38. The tutorial uses subsets of the full resources; the full-length versions are available at gs://gatk-best-practices/somatic-hg38/. The official GRCh37 versions of the resources are available in the GATK Resource Bundle and are based on the gnomAD resource. These GRCh37 versions were prepared by @davidben according to the method outlined in the mutect_resources.wdl and described in [4].
The full data in the tutorial were generated by @shlee using the github.com/broadinstitute/gatk mutect2.wdl from between the v4.0.0.0 and v4.0.0.1 release with commit hash b4d1ddd. The GATK Docker image was broadinstitute/gatk:4.0.0.0 and Picard was v2.14.1. A single modification was made to the script to enable generating the bamout. The script was run locally on a Google Cloud Compute VM using Cromwell v30.1. Given Docker was installed and the specified Docker images were present on the VM, Cromwell automatically launched local Docker container instances during the run and handled the local files as hard-links to avoid redundant copying. Workflow input variables were as follows.

{
  "##_COMMENT1:": "WORKFLOW STEP OPTIONS",
  "Mutect2.is_run_oncotator": "False",
  "Mutect2.is_run_orientation_bias_filter": "True",
  "Mutect2.picard": "/home/shlee/picard-2.14.1.jar",
  "Mutect2.gatk_docker": "broadinstitute/gatk:4.0.0.0",
  "Mutect2.oncotator_docker": "broadinstitute/oncotator:1.9.3.0",
...
  "##_COMMENT3:": "ANALYSIS PARAMETERS",
  "Mutect2.artifact_modes": ["G/T", "C/T"],
  "Mutect2.m2_extra_args": "--af-of-alleles-not-in-resource 0.0000025 --disable-read-filter MateOnSameContigOrNoMappedMateReadFilter",
  "Mutect2.m2_extra_filtering_args": "",
  "Mutect2.scatter_count": "10"
}

If using newer versions of the mutect2.wdl that allow setting SplitIntervals optional arguments, then @shlee recommends setting --subdivision-mode BALANCING_WITHOUT_INTERVAL_SUBDIVISION to avoid splitting contigs.
With the exception of the PoN and Picard tool steps, data was generated using v4.0.0.0. The PoN was generated using GATK4 vbeta.6. Besides the syntax, little changed for the Mutect2 workflow between these releases and the workflow and most of its tools remain in beta status as of this writing. We used Picard v2.14.1 for the CollectSequencingArtifactMetrics step. Figures in section 5 reflect results from Picard v2.11.0, which give, at glance, identical results as 2.14.1.
The three samples in section 2 are present in the forty sample PoN used in section 1 and they are 1000 Genomes Project samples.

[4] The WDL script [mutect_resources.wdl](https://github.com/broadinstitute/gatk/blob/master/scripts/mutect2_wdl/mutect_resources.wdl) takes a large gnomAD VCF or other typical cohort VCF and from it prepares both a simplified germline resource for use in _section 1_ and a common biallelic variants resource for use in _section 3_. The script first generates a sites-only VCF and in the process _removes all extraneous annotations_ except for `AF` allele frequencies. We recommend this simplification as the unburdened VCF allows Mutect2 to run much more efficiently. To generate the common biallelic variants resource, the script then selects the biallelic sites from the sites-only VCF.

↧

GATKException: Somehow the requested coordinate is not covered by the read.

September 26, 2018, 6:31 am

≫ Next: Does the tool BaseRecalibrator use only variants that have passed all filtration steps?

≪ Previous: (How to) Call somatic mutations using GATK4 Mutect2 (Deprecated)

Hi!

I am analysing Ion Torrent targeted panel data. I follow the best practices pipeline (https://software.broadinstitute.org/gatk/documentation/article?id=11136) but I got this error when I run Mutect2 in these files.

12:19:15.515 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/soft/EB_repo/bio/sequence/programs/foss/2016b/GATK/4.0.4.0/gatk-package-4.0.4.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
12:19:17.186 INFO  Mutect2 - ------------------------------------------------------------
12:19:17.187 INFO  Mutect2 - The Genome Analysis Toolkit (GATK) v4.0.4.0
12:19:17.188 INFO  Mutect2 - For support and documentation go to https://software.broadinstitute.org/gatk/
12:19:17.189 INFO  Mutect2 - Executing as jgibert@node01 on Linux v3.10.0-693.21.1.el7.x86_64 amd64
12:19:17.192 INFO  Mutect2 - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_92-b14
12:19:17.194 INFO  Mutect2 - Start Date/Time: August 30, 2018 12:19:15 PM CEST
12:19:17.194 INFO  Mutect2 - ------------------------------------------------------------
12:19:17.194 INFO  Mutect2 - ------------------------------------------------------------
12:19:17.195 INFO  Mutect2 - HTSJDK Version: 2.14.3
12:19:17.196 INFO  Mutect2 - Picard Version: 2.18.2
12:19:17.196 INFO  Mutect2 - HTSJDK Defaults.COMPRESSION_LEVEL : 2
12:19:17.197 INFO  Mutect2 - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
12:19:17.197 INFO  Mutect2 - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
12:19:17.198 INFO  Mutect2 - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
12:19:17.198 INFO  Mutect2 - Deflater: IntelDeflater
12:19:17.199 INFO  Mutect2 - Inflater: IntelInflater
12:19:17.199 INFO  Mutect2 - GCS max retries/reopens: 20
12:19:17.199 INFO  Mutect2 - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
12:19:17.200 INFO  Mutect2 - Initializing engine
12:19:19.295 INFO  FeatureManager - Using codec VCFCodec to read file file:///users/genomics/jgibert/Varis/af-only-gnomad.raw.sites.h19.vcf.gz
12:19:20.028 INFO  FeatureManager - Using codec BEDCodec to read file file:///users/genomics/jgibert/Varis/Oncomine_TML.20170222.designed.bed
12:19:21.127 INFO  IntervalArgumentCollection - Processing 1656280 bp from intervals
12:19:21.281 INFO  Mutect2 - Done initializing engine
12:19:22.708 INFO  NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/soft/EB_repo/bio/sequence/programs/foss/2016b/GATK/4.0.4.0/gatk-package-4.0.4.0-local.jar!/com/intel/gkl/native/libgkl_utils.so
12:19:22.928 INFO  PairHMM - OpenMP multi-threaded AVX-accelerated native PairHMM implementation is not supported
12:19:22.929 WARN  PairHMM - ***WARNING: Machine does not have the AVX instruction set support needed for the accelerated AVX PairHmm. Falling back to the MUCH slower LOGLESS_CACHING implementation!
12:19:23.836 INFO  ProgressMeter - Starting traversal
12:19:23.843 INFO  ProgressMeter -        Current Locus  Elapsed Minutes     Regions Processed   Regions/Minute
12:19:47.129 INFO  ProgressMeter -         chr1:6529491              0.4                    20             51.5
12:19:56.279 INFO  PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 17.687161069000002
12:19:56.279 INFO  SmithWatermanAligner - Total compute time in java Smith-Waterman : 4.22 sec
12:19:56.864 INFO  Mutect2 - Shutting down engine
[August 30, 2018 12:19:56 PM CEST] org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2 done. Elapsed time: 0.69 minutes.
Runtime.totalMemory()=1211629568
org.broadinstitute.hellbender.exceptions.GATKException: Somehow the requested coordinate is not covered by the read. Alignment 6530911 | 55M1I36M
    at org.broadinstitute.hellbender.utils.read.ReadUtils.getReadCoordinateForReferenceCoordinate(ReadUtils.java:815)
    at org.broadinstitute.hellbender.utils.read.ReadUtils.getReadCoordinateForReferenceCoordinate(ReadUtils.java:691)
    at org.broadinstitute.hellbender.utils.read.ReadUtils.getReadCoordinateForReferenceCoordinateUpToEndOfRead(ReadUtils.java:649)
    at org.broadinstitute.hellbender.tools.walkers.annotator.BaseQualityRankSumTest.getReadBaseQuality(BaseQualityRankSumTest.java:43)
    at org.broadinstitute.hellbender.tools.walkers.annotator.BaseQualityRankSumTest.getElementForRead(BaseQualityRankSumTest.java:38)
    at org.broadinstitute.hellbender.tools.walkers.annotator.RankSumTest.getElementForRead(RankSumTest.java:108)
    at org.broadinstitute.hellbender.tools.walkers.annotator.RankSumTest.fillQualsFromLikelihood(RankSumTest.java:86)
    at org.broadinstitute.hellbender.tools.walkers.annotator.RankSumTest.annotate(RankSumTest.java:44)
    at org.broadinstitute.hellbender.tools.walkers.annotator.VariantAnnotatorEngine.annotateContext(VariantAnnotatorEngine.java:377)
    at org.broadinstitute.hellbender.tools.walkers.mutect.SomaticGenotypingEngine.callMutations(SomaticGenotypingEngine.java:167)
    at org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2Engine.callRegion(Mutect2Engine.java:182)
    at org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2.apply(Mutect2.java:183)
    at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.processReadShard(AssemblyRegionWalker.java:295)
    at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.traverse(AssemblyRegionWalker.java:271)
    at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:892)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:134)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:179)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:198)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
    at org.broadinstitute.hellbender.Main.main(Main.java:289)
Using GATK jar /soft/EB_repo/bio/sequence/programs/foss/2016b/GATK/4.0.4.0/gatk-package-4.0.4.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /soft/EB_repo/bio/sequence/programs/foss/2016b/GATK/4.0.4.0/gatk-package-4.0.4.0-local.jar Mutect2 --af-of-alleles-not-in-resource 0.0000025 --germline-resource /users/genomics/jgibert/Varis/af-only-gnomad.raw.sites.h19.vcf.gz -A BaseQualityRankSumTest -A ClippingRankSumTest -A MappingQualityRankSumTest -A QualByDepth -A RMSMappingQuality -A ReadPosRankSumTest -A FisherStrand -A StrandOddsRatio -A InbreedingCoeff -I IonXpress_001.bam -R /users/genomics/jgibert/hg19_IOT/hg19_IOT.fasta -tumor IonXpress_001 -L /users/genomics/jgibert/Varis/Oncomine_TML.20170222.designed.bed -bamout /users/genomics/jgibert/data/TML_Pedro_samples/GATK/IonXpress_001_Mutect.bam -O /users/genomics/jgibert/data/TML_Pedro_samples/GATK/IonXpress_001_Mutect.vcf.gz

In other files I also get different error: GATKException: Reference coordinate corresponds to a non-existent base in the read.

I tried to use the last version of GATK but the error still persists. Any idea what's happening?
Thanks

↧

Does the tool BaseRecalibrator use only variants that have passed all filtration steps?

June 24, 2019, 4:17 am

≫ Next: Trio or SelectVariants

≪ Previous: GATKException: Somehow the requested coordinate is not covered by the read.

Hi,

@Geraldine_VdAuwera

BaseRecalibrator tools requires known sites which I don't have so I have read that one can iteratively call variants and use the set that passes the hard filtering cutoffs as input for the recalibration.

So, after hard filtering, I have variants that passed and did not pass in the vcf file - do I need to write the ones that passed to a separate file that is fed into the recalibration or can I just feed in all - so does the tool BaseRecalibrator use only variants that have passed all filtration steps?

Thanks!

↧

Trio or SelectVariants

June 24, 2019, 6:25 am

≫ Next: How does the insert size influence the variant calling?

≪ Previous: Does the tool BaseRecalibrator use only variants that have passed all filtration steps?

I have 25 samples of exome sequencing data from 6 families on normal and diseased individuals. Two are joint families with data available for affected/unaffected uncles and aunts. I am not sure if I should run trio for this or SelectVariants (normal vs diseased) (or if there is another, more appropriate way).

If I run trio, then I have only about 10 complete trios. The uncles and aunts are part of incomplete trios and I am not sure if those samples will be of much use in calling disease specific mutations. If I run SelectVariants, I will be able to include all samples but I think trio analysis will be more accurate.

The third possibility is performing the analysis in both ways and then selecting the common variants. The hypothesis is that the common variants will be high-confidence variants.

Any ideas?

↧

How does the insert size influence the variant calling?

June 24, 2019, 8:06 am

≫ Next: Larger Sample Sizes are Reducing SNPs dramatically

≪ Previous: Trio or SelectVariants

Hi,

@AdelaideR

I would like to use a public dataset that has been used for assembly originally for variant calling. All libraries come from the same sample and I define appropriate RG. However, there the libraries have varying high insert size (up to around 800). But all are paired end Illumina reads. Is this a problem? Should I exclude some?

Thanks!

↧

Larger Sample Sizes are Reducing SNPs dramatically

June 24, 2019, 9:38 am

≫ Next: Count of False Positives

≪ Previous: How does the insert size influence the variant calling?

I'm running a pipeline with a little over 200 samples. I've tried this in both the "regular" and GVCF modes of haplotype caller, and have gotten ~1200 and ~500 SNPS, respectively. Biologically, I don't believe this can be possible given the diversity of the dataset. Additionally, when the sample size is smaller I get many more than that. By removing 65 of the samples I had over 12,000 SNPs and in a subset I used to pilot the pipeline with just 35 samples I was getting hundreds of thousands of SNPs. Removing low coverage, or sequences with highly-variable coverage did not help. Is this a normal outcome of HaplotypeCaller, and how can I rescue lost SNPs?

↧

Count of False Positives

December 17, 2018, 9:43 am

≫ Next: Calling variants on cohorts of samples using the HaplotypeCaller in GVCF mode

≪ Previous: Larger Sample Sizes are Reducing SNPs dramatically

After applying VariantRecalibrator, I am examining the SNP.tranches.pdf and SNP.tranches table output files. I am wondering, is it possible to get a count of the tranche-specific false-positives? This value isn't immediately obvious to me in the SNP.tranches table but is plotted in the graph in SNP.tranches.pdf and I would like to access those values.

My SNP.tranches table is shown here:

# Variant quality score tranches file
# Version number 5
targetTruthSensitivity,numKnown,numNovel,knownTiTv,novelTiTv,minVQSLod,filterName,model,accessibleTruthSites,callsAtTruthSites,truthSensitivity
80.00,420195,76067,2.4862,2.0882,3.8931,VQSRTrancheSNP0.00to80.00,SNP,525369,420295,0.8000
90.00,472716,100489,2.4582,1.9875,2.2868,VQSRTrancheSNP80.00to90.00,SNP,525369,472832,0.9000
95.00,498977,113066,2.4434,1.9268,1.0517,VQSRTrancheSNP90.00to95.00,SNP,525369,499100,0.9500
99.00,519990,124866,2.4257,1.8821,-0.5150,VQSRTrancheSNP95.00to99.00,SNP,525369,520115,0.9900
99.90,524717,144668,2.4167,1.7196,-3.2399,VQSRTrancheSNP99.00to99.90,SNP,525369,524843,0.9990
100.00,525243,147612,2.4118,1.6796,-73.9905,VQSRTrancheSNP99.90to100.00,SNP,525369,525369,1.0000

Thanks so much. (And this is whole-exome data and I am following the Best Practices Guidelines - although we are using GATK version 3.5, this is a choice we made for the purpose of ensuring the highest possible consistency with older data called using version 3.5)

↧

Calling variants on cohorts of samples using the HaplotypeCaller in GVCF mode

March 6, 2014, 12:12 am

≫ Next: How can I choose parameters in STAR to satisfy a read that only matches a position in the genome?

≪ Previous: Count of False Positives

This document describes the new approach to joint variant discovery that is available in GATK versions 3.0 and above. For a more detailed discussion of why it's better to perform joint discovery, see this FAQ article. For more details on how this fits into the overall reads-to-variants analysis workflow, see the Best Practices workflows documentation.

Overview

This is the workflow recommended in our Best Practices for performing variant discovery analysis on cohorts of samples.

In a nutshell, we now call variants individually on each sample using the HaplotypeCaller in -ERC GVCF mode, leveraging the previously introduced reference model to produce a comprehensive record of genotype likelihoods and annotations for each site in the genome (or exome), in the form of a gVCF file (genomic VCF).

In a second step, we then perform a joint genotyping analysis of the gVCFs produced for all samples in a cohort.
This allows us to achieve the same results as joint calling in terms of accurate genotyping results, without the computational nightmare of exponential runtimes, and with the added flexibility of being able to re-run the population-level genotyping analysis at any time as the available cohort grows.

This is meant to replace the joint discovery workflow that we previously recommended, which involved calling variants jointly on multiple samples, with a much smarter approach that reduces computational burden and solves the "N+1 problem".

Workflow details

This is a quick overview of how to apply the workflow in practice. For more details, see the Best Practices workflows documentation.

1. Variant calling

Run the HaplotypeCaller on each sample's BAM file(s) (if a sample's data is spread over more than one BAM, then pass them all in together) to create single-sample gVCFs, with the option --emitRefConfidence GVCF, and using the .g.vcf extension for the output file.

Note that versions older than 3.4 require passing the options --variant_index_type LINEAR --variant_index_parameter 128000 to set the correct index strategy for the output gVCF.

2. Data aggregation step

A new tool called GenomicsDBImport is necessary to aggregate the GVCF files and feed in one GVCF to GenotypeGVCFs. You can read more about it here. You can also run CombineGVCFs if you are not able to use GenomicsDBImport.

3. Joint genotyping

Take the outputs from step 2 (or step 1 if dealing with fewer samples) and run GenotypeGVCFs on all of them together to create the raw SNP and indel VCFs that are usually emitted by the callers.

4. Variant recalibration

Finally, resume the classic GATK Best Practices workflow by running VQSR on these "regular" VCFs according to our usual recommendations.

That's it! Fairly simple in practice, but we predict this is going to have a huge impact in how people perform variant discovery in large cohorts. We certainly hope it helps people deal with the challenges posed by ever-growing datasets.

As always, we look forward to comments and observations from the research community!

↧

How can I choose parameters in STAR to satisfy a read that only matches a position in the genome?

June 24, 2019, 8:57 pm

≫ Next: Difference in analysis pipeline of WGS versus WES

≪ Previous: Calling variants on cohorts of samples using the HaplotypeCaller in GVCF mode

In the downstream analysis of RNA-seq, such as gene expression calculation, SNP calling, Is it necessary that one read must be aligned to a single position in the genome? If so, How can I choose parameters in STAR to satisfy the condition? Thanks

↧

Difference in analysis pipeline of WGS versus WES

June 24, 2019, 11:46 pm

≫ Next: The future of GATK tutorials is written in Jupyter Notebooks

≪ Previous: How can I choose parameters in STAR to satisfy a read that only matches a position in the genome?

Hello GATK team,
I wanted to know what are the differences in the analyses steps in WES vs WGS?
I have paired samples for a complex disease.
(P.S: A detailed documentation for both would be of great help!!)

↧

The future of GATK tutorials is written in Jupyter Notebooks

June 25, 2019, 12:09 am

≫ Next: Strange tranche plot after GATK4 germline SNPs pipeline

≪ Previous: Difference in analysis pipeline of WGS versus WES

Last week I wrote about how we're using a cloud platform called Terra to make it easier to get started with GATK; and specifically I highlighted the fully loaded workspaces that showcase our Best Practices pipelines, which we think will make it a lot easier to test drive our pipelines end-to-end. This week I want to talk about a complementary approach we're taking, using Jupyter Notebooks on Terra to teach the step-by-step details of what happens inside the pipelines. Though before we get into the nitty gritty of how it works, I'd like to take some time to walk you through why we're taking this particular approach.

Writing a good tutorial is not that hard, in theory. You state the problem, provide a command line, then give a few instructions for poking at the outputs and you discuss what happened. The hardest part should be choosing what details and parameters to explain vs. what to leave alone to avoid confusing newcomers. Right? Well… In practice, the hardest part is often providing the inputs and instructions in such a way that most people will be able to run it in their own, unique and precious computing environment without some amount of head scratching and at least three pages of alternative instructions for this system or that system. Ugh.

We've run dozens of workshops where the setup is that we provide a PDF of instructions and a data bundle, and participants run commands in the terminal on their laptops. Inevitably some non-trivial amount of time ends up being spent debugging environment settings, typos and character encodings. That's just not a good use of anypony's time. Plus we want to be able to demonstrate larger-scale analyses with full-size inputs, not just the usual snippets of data whittled down to be convenient to download and move around. (Genomic data is getting big, if you haven't noticed.)

So earlier this year, we converted all our workshop tutorials to Jupyter Notebooks, an increasingly popular medium for combining live, executable code and documentation content, hosted on Terra.

And no kidding, it's been transformative. So far this year we've done three "GATK bootcamp" workshops (4 days long, 50% hands-on tutorials) and in every one of them the verdict was the same: notebooks FTW. Compared to our old approach, we spend so much less time troubleshooting technical issues and so much more actually exploring and discussing what the tools are doing, what the data looks like and so on -- you know, the interesting stuff. Not unexpectedly, the Notebooks-based approach is also proving to be extremely popular with participants who have less experience with command line environments.

In my next post later this week, I'll walk you through one of the notebooks from our most recent workshop. My goal is is to show how you can take advantage of these resources to level up your understanding of how GATK tools work even if you can't make it to one of our workshops in person.

Of course if you're too impatient to wait for the guided tour, feel free to sneak a peek at the notebooks I plan to demo, which you can find in this workshop workspace in the Terra Showcase. If you read my post on the Best Practices pipelines from last week, you might have already signed up on Terra and claimed your free credits… but if you haven't, please go ahead and do that now, because you're going to want to clone the workspace and open the notebooks in interactive mode.

Go to http://app.terra.bio and you'll be asked to log in with a Google identity. If you don't have one already, you can create one, and choose to either create a new Gmail account for it or associate your new Google identity with your existing email address. See this article for step-by-step instructions on how to register if needed. Once you've logged in, look for the big green banner at the top of the screen and click "Start trial" to take advantage of the free credits program. As a reminder, access to Terra is free but Google charges you for compute and storage; the credits (a $300 value) will allow you to try out the resources I'm describing here for free. To clone a workspace, open it, expand the workspace action menu (three-dot icon, top right) and select the "Clone" option. In the cloning dialog, select the billing project we created for you with your free credits. The resulting workspace clone belongs to you. Have fun!

↧

Strange tranche plot after GATK4 germline SNPs pipeline

June 25, 2019, 4:10 am

≫ Next: (How to) Call somatic mutations using GATK4 Mutect2

≪ Previous: The future of GATK tutorials is written in Jupyter Notebooks

Dear GATK team,

Context : 504 WES human samples
Objective : Call SNP using GATK4 best practices

After applying the GATK4 germline SNP and indels best practives pipeline ( we used your .wdl file with cromwell ), VariantRecalibrator gave us this very strange (and bad) tranche plot.

We decided to go further and hard filter using the parameters suggested on your site :

for SNPs :
--filter-expression "ExcessHet > 54.69" --filter-name "ExcessHet" \
-filter "QD < 2.0" --filter-name "QD2" \
-filter "QUAL < 30.0" --filter-name "QUAL30" \
-filter "SOR > 3.0" --filter-name "SOR3" \
-filter "FS > 60.0" --filter-name "FS60" \
-filter "MQ < 40.0" --filter-name "MQ40" \
-filter "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" \
-filter "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" \

and for indels :
--filter-expression "ExcessHet > 54.69" --filter-name "ExcessHet" \
-filter "QD < 2.0" --filter-name "QD2" \
-filter "QUAL < 30.0" --filter-name "QUAL30" \
-filter "FS > 200.0" --filter-name "FS200" \
-filter "ReadPosRankSum < -20.0" --filter-name "ReadPosRankSum-20" \

We had already analyzed these samples using GWAS. We decided to compare the concordance of the hard filered WES and the GWAS using SnpSift concordance. Results : on average 97% of concordance between the WES and GWAS samples.

So we are kind of lost now regarding the strange tranche plot that VariantRecalibrator gave us and we would like to solve this issue. Even if the hard filtering seems to work ok we are afraid to lose some important information ; or to have too many false positive calls..

In this context, could you help us and give us some advice to debug this ?

Thank you
N.

The tranche plot :

↧

(How to) Call somatic mutations using GATK4 Mutect2

May 24, 2019, 12:00 pm

≫ Next: Issues on FilterMutectCalls: Log10-probability must be 0 or less

≪ Previous: Strange tranche plot after GATK4 germline SNPs pipeline

This tutorial is applicable to Mutect2 version 4.1.1.0 and higher. Post suggestions/questions in the Ask the GATK team section.

GATK 4.1.1.0 contains several major advances in the Mutect2 pipeline, which is good, but we have had to change command lines in a few places, which we usually try to avoid. Here are the major changes.

Bug Fixes

We fixed several bugs with error messages about invalid log probabilities, infinities, NaNs etc that were introduced by neglecting to account for finite precision errors. We also resolved an issue where CalculateContamination worked poorly on very small gene panels.

New Required Inputs to FilterMutectCalls

1) FilterMutectCalls now requires a reference, which should be the same fasta file input to Mutect2.
2) FilterMutectCalls also requires a stats file from Mutect2. When you run Mutect2 with output -O unfiltered.vcf, for example, it produces a file called unfiltered.vcf.stats. FilterMutectCalls automatically looks for this file, so if you are running locally no change is needed. That is,

gatk Mutect2 -R ref.fasta -I tumor.bam -O unfiltered.vcf

followed by

gatk FilterMutectCalls -R ref.fasta -V unfiltered.vcf -O filtered.vcf

works because Mutect2 creates unfiltered.vcf.stats behind the scenes and FilterMutectCalls knows to look for it. However, if you are running on a cluster or the cloud you need to keep track of the stats file. For example, you need to delocalize it from a VM, as is done in the Mutect2 WDL. You can explicitly input the stats file with the -stats argument in FilterMutectCalls. If you are scattering Mutect2 over multiple nodes you must merge the stats files with MergeMutectStats:

gatk MergeMutectStats -stats unfiltered1.vcf.stats -stats unfiltered2.vcf.stats -O merged.stats

and pass merged.stats to FilterMutectCalls.

New Filtering Strategy in FilterMutectCalls

FilterMutectCalls now filters based on a single quantity, the probability that a variant is a somatic mutation. Previously, different reasons why this might not be the case each had their own thresholds. We have removed parameters such as -normal-artifact-lod, -max-germline-posterior, -max-strand-artifact-probability, and -max-contamination-probability. Even the previously fundamental -tumor-lod is gone. Rather than replace these with a single threshold for the error probability, FilterMutectCalls replaces them with nothing at all. It automatically determines the threshold that optimizes the "F score", the harmonic mean of sensitivity and precision. In order to tweak results in favor of more sensitivity users may set -f-score-beta to a value greater than its default of 1 (beta is the relative weight of sensitivity versus precision in the harmonic mean). Setting it lower biases results toward greater precision.

You can think of these changes as doing two things. Unifying filtering thresholds puts all the filters at the same place on a ROC curve of sensitivity vs precision. Previously, one threshold might be sacrificing a lot of sensitivity for a small gain in precision while another might be doing the opposite, the result being poor sensitivity and precision that fell below the potential ROC curve. Once we’re on that ROC curve, we achieve a good balance between sensitivity and precision by optimizing the F score.

New Somatic Clustering Model in FilterMutectCalls

We had long suspected that modeling the spectrum of subclonal allele fractions would help distinguish somatic variants from errors. For example, if every somatic variant in a tumor were a het occurring in 40% of cells, we would know to reject anything with an allele fraction significantly different from 20%. In the Bayesian framework of Mutect2 this means that the likelihood for somatic variants is given by a binomial distribution. We account for an unknown number of subclones with a Dirichlet process binomial mixture model. This is still an oversimplification because CNVs, small subclones, and genetic drift of passenger mutations all contribute allele fractions that don’t match a few discrete values. Therefore, we include a couple of beta-binomials in the mixture to account for a background spread of allele fractions while still benefiting from clustering. Finally, we use these binomial and beta-binomial likelihoods to refine the tumor log odds calculated by Mutect2, which assume a uniform distribution of allele fractions.

In addition to clustering allele fractions we also learn the overall prior probabilities of somatic SNVs and indels so that we can be more skeptical of apparent variants in a quiet tumor, for example. We learn the parameters of this model with a stochastic EM approach, where the E steps consist of Chinese Restaurant Process sampling of the allele fraction clusters. In case you were wondering, we have tested this new approach on some of the off-label, non-cancer uses of Mutect2, such as mitochondria, and it works very well.
FilterMutectCalls reports the learned parameters of somatic clustering in a new .filtering_stats.tsv output. This file also contains information on the probability threshold chosen to optimize the F score and the number of false positives and false negatives from each filter to be expected from this choice.

A step-by-step guide to the new Mutect2 Read Orientation Artifacts Workflow

If you suspect any of your samples of substitution errors that occur on a single strand before sequencing you should definitely use Mutect2's orientation bias filter. This applies to all FFPE tumor samples and samples sequenced on Illumina Novaseq machines, among others. In fact, with the optimizations in 4.1.1.0 you can run the filter even when you're not suspicious. It won't hurt accuracy and the CPU cost is now quite small.

There are three steps to the filter. First, run Mutect2 with the --f1r2-tar-gz argument. This creates an output with raw data used to learn the orientation bias model. Previously this was done by CollectF1R2Counts. By absorbing it into Mutect2, we eliminated the cost of CollectF1R2Counts with almost no effect on the runtime of Mutect2. When multiple tumor samples are specified, you only need a single --f1r2-tar-gz output, which contains data for each tumor sample.

gatk Mutect2 -R ref.fasta \
   -L intervals.interval_list \
   -I tumor.bam \
   -germline-resource af-only-gnomad.vcf \
   -pon panel_of_normals.vcf   \
   --f1r2-tar-gz f1r2.tar.gz \
   -O unfiltered.vcf

Next, pass this raw data to LearnReadOrientationModel:

gatk LearnReadOrientationModel -I f1r2.tar.gz -O read-orientation-model.tar.gz

Finally, pass the learned read orientation model to FilterMutectCallswith the -ob-priors argument:

gatk FilterMutectCalls -V unfiltered.vcf \
   [--tumor-segmentation segments.table] \
   [--contamination-table contamination.table] \
   --ob-priors read-orientation-model.tar.gz \
   -O filtered.vcf

Advanced note: if you are scattering Mutect2 over nodes in a cluster or on the cloud, you must input the --f1r2-tar-gz output from each Mutect2 scatter to LearnReadOrientationModel. This is done automatically in the Mutect2 wdl in the gatk repo and on Terra. For example:

for chromosome in {1..22}; do
gatk Mutect2 -R ref.fasta -I tumor.bam -L $chromosome --f1r2-tar-gz ${chromosome}-f1r2.tar.gz -O ${chromosome}-unfiltered.vcf``
done
all_f1r2_input=`for chromosome in {1..22}; do printf -- "-I ${chromosome}-f1r2.tar.gz "; done`
LearnReadOrientationModel $all_f1_r2_input -O read-orientation-model.tar.gz

A step-by-step guide to the new Mutect2 Panel of Normals Workflow

We rewrote CreateSomaticPanelOfNormals to use GenomicsDB for scalability. We also added the INFO fields FRACTION (the fraction of normal samples with an artifact) and BETA (the shape parameters of the beta distribution of artifact allele fractions among samples with the artifact) to the panel of normals. FilterMutectCalls doesn’t use these yet, but we hope to experiment with them in the near future. Furthermore, we never liked how germline variants, which are handled in a more principled way with our germline filter, ended up as hard-filtered pon sites, so the panel of normals workflow now optionally excludes germline events from its output, keeping only technical artifacts.

The three steps to create a panel of normals are:

Step 1: Run Mutect2 in tumor-only mode for each normal sample:

gatk Mutect2 -R reference.fasta -I normal1.bam -O normal1.vcf.gz
gatk Mutect2 -R reference.fasta -I normal2.bam -O normal2.vcf.gz
Etc

Step 2: Create a GenomicsDB from the normal Mutect2 calls:

gatk GenomicsDBImport -R reference.fasta -L intervals.interval_list \
  --genomicsdb-workspace-path pon_db \
  --max-mnp-distance 0 \
  -V normal1.vcf.gz \
  -V normal2.vcf.gz \
  -V normal3.vcf.gz

Step 3: Combine the normal calls using CreateSomaticPanelOfNormals:

gatk CreateSomaticPanelOfNormals -R reference.fasta \
  --germline-resource af-only-gnomad.vcf.gz \
  -V gendb://pon_db \
  -O pon.vcf.gz

In the third step, the AF-only gnomAD resource is the same public GATK resource used by Mutect2 (when calling tumor samples, but not in Step 1, above).

↧

Issues on FilterMutectCalls: Log10-probability must be 0 or less

March 17, 2019, 10:53 pm

≫ Next: BaitBias Artifact Filter

≪ Previous: (How to) Call somatic mutations using GATK4 Mutect2

When I use FilterMutectCalls to Filter VCF resutls called by Mutect2, I got "Log10-probability must be 0 or less" Error on some of the VCF results (not all of them got the error message, maybe 10 out of 50 or something...), and it seems that the filtering process is done since I got the "org.broadinstitute.hellbender.tools.walkers.mutect.FilterMutectCalls done. Elapsed time: 0.54 minutes" message before the error alert, but I got no filtered vcf in my output directory.

Additional Info: the samples' vcf that got issues on FilterMutectCalls use the exact same experimental and bioinformatics approach with those samples of the same batch on which FilterMutectCalls works perfectly normal.

The detailed log is like this (PS: My GATK version is 4.1.0.0):

19:15:30.504 INFO FilterMutectCalls - Done initializing engine
19:15:30.573 INFO ProgressMeter - Starting traversal
19:15:30.574 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
19:15:30.574 INFO FilterMutectCalls - Starting first pass through the variants
19:15:30.606 INFO FilterMutectCalls - Shutting down engine
[March 11, 2019 7:15:30 PM CST] org.broadinstitute.hellbender.tools.walkers.mutect.FilterMutectCalls done. Elapsed time: 0.54 minutes.
Runtime.totalMemory()=2032140288
java.lang.IllegalArgumentException: log10p: Log10-probability must be 0 or less
at org.broadinstitute.hellbender.utils.Utils.validateArg(Utils.java:724)
at org.broadinstitute.hellbender.utils.MathUtils.log10BinomialProbability(MathUtils.java:1031)
at org.broadinstitute.hellbender.utils.MathUtils.binomialProbability(MathUtils.java:1024)
at org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2FilteringEngine.applyContaminationFilter(Mutect2FilteringEngine.java:68)
at org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2FilteringEngine.calculateFilters(Mutect2FilteringEngine.java:518)
at org.broadinstitute.hellbender.tools.walkers.mutect.FilterMutectCalls.firstPassApply(FilterMutectCalls.java:130)
at org.broadinstitute.hellbender.engine.TwoPassVariantWalker.lambda$traverseVariants$0(TwoPassVariantWalker.java:76)
at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
at java.util.Iterator.forEachRemaining(Iterator.java:116)
at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
at org.broadinstitute.hellbender.engine.TwoPassVariantWalker.traverseVariants(TwoPassVariantWalker.java:74)
at org.broadinstitute.hellbender.engine.TwoPassVariantWalker.traverse(TwoPassVariantWalker.java:27)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:966)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:138)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
at org.broadinstitute.hellbender.Main.main(Main.java:291)

Could anyone please help to verify this?

↧

BaitBias Artifact Filter

May 22, 2019, 2:38 pm

≫ Next: Panel of Normals

≪ Previous: Issues on FilterMutectCalls: Log10-probability must be 0 or less

Hi GATK team,

We recently have a set of targeted sequencing samples showing inflated G>T false positive variants. Picard CollectSequencingArtifactMetrics showed that almost all samples in the batch is having very low bait bias qscore(under 30) for G to T base change.

I read through the picard document about BaitBiasSummaryMetrics and PreAdapterSummaryMetrics. They used G>T example for both types of artifact. It gave me a not very clear impression that BaitBias is looking for bias between reference and complementary while PreAdapterSummaryMetrics is looking for an orientation bias?

Can I have a more understandable explanation of how to differentiate elevated G>T rates that are OxoG artifacts, or G-ref artifacts?

Also, GATK4 has an experimental tool FilterByOrientationBias(https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_exome_FilterByOrientationBias.php)
to filter out OxoG artifact. I'm wondering is there a similar tool that can potentially remove G-ref artifact given SequencingArtifactMetrics.BaitBiasDetailMetrics?

Thanks!
Wen

↧

Panel of Normals

March 27, 2019, 3:43 am

≫ Next: How to evaluate a call set when no true snp set is available?

≪ Previous: BaitBias Artifact Filter

Hi,

I am currently using mutect2 with and without panel of normals and I would like to know what I use. Do you have a documentation about how does panel of normals selected. How does panel of normal selected from normal samples ?

Thanks in advance, Gufran

↧

How to evaluate a call set when no true snp set is available?

June 26, 2019, 4:41 am

≫ Next: GATK Determine Contig Ploidy Error: ValueError: invalid literal for int() with base 10: '0 PLOIDY_P

≪ Previous: Panel of Normals

Hi,

I would like to know how to evaluate a call set when no true snp set is available. Are there recommendations? Certain values one can look at? VariantEval seems to focus on comparing with a reference of known variants. But maybe there are at least a couple of basic values that one can evaluate?

Also, is there a ballpark percentage of SNPs that is usually filtered by the hard filtering recommendations? I am aware that you cannot give me a precise number here but I would like to know if I should expect 10-30 % to get kicked out or more like 60-80 % to get kicked out.

Thanks so much!!

↧