Can i skip GenomicsDBimport and CombinedGVCFs?

March 20, 2019, 5:06 am

≫ Next: if panel and wes data not using corresponding bed, except for time, is there other errors?

≪ Previous: Run the germline GATK Best Practices Pipeline for $5 per genome

Hi ,
I am very unconfident about my result GenotypyeGVCF because the following steps in VQSR showed no TPs(or maybe extremely low). With only 31 human exome data, may i know is it okay to skip GenomicsDBImport after HC? This is GenomicsDBImport has left out value like SOR and FS despite there were picked out by HC.
Otherwise, is it okay to ask for some recommendations on the cutoff to filter away SNP without no GT? I am very confused with my findings using Select Variant....

↧

if panel and wes data not using corresponding bed, except for time, is there other errors?

March 20, 2019, 7:38 am

≫ Next: Creating a list of common SNPs for use with GetBayesianHetCoverage

≪ Previous: Can i skip GenomicsDBimport and CombinedGVCFs?

hi, thanks a lot.
in https
://gatkforums.broadinstitute.org/gatk/discussion/11062/when-should-i-restrict-my-analysis-to-specific-intervals,
you give detailed talk about when need bed file.

here I want to confirm with you two things.

Q1:
if using targeted panel and wes, I do not use bed, except for time and computer consumption, is there any other wrong?(both germline and somatic)

Q2:
the step BQSR and applybqsr ， should I use bed file

thanks a lot

↧

Creating a list of common SNPs for use with GetBayesianHetCoverage

June 16, 2016, 7:59 am

≫ Next: Known Issues with Mutect2 GATK4.1

≪ Previous: if panel and wes data not using corresponding bed, except for time, is there other errors?

The first step in the GATK ACNV workflow (http://gatkforums.broadinstitute.org/gatk/discussion/7387/description-and-examples-of-the-steps-in-the-acnv-case-workflow) is to naively call heterozygous SNPs in a case sample from the pileups at common SNP sites. These SNP sites are specified by a Picard interval list and can be constructed from any suitable database of SNPs. We outline below one possible method of building such a list with the GATK and Picard from the filtered 1000 Genomes Project Phase 3 markers used by BEAGLE 4.x.

First, download and extract the BEAGLE 1000G VCFs to a working folder:

wget -r -nH -nd -np -R index.html -A 'chr*.1kg.phase3.v5a.vcf.zip' http://bochet.gcc.biostat.washington.edu/beagle/1000_Genomes_phase3_v5a/
unzip 'chr*.1kg.phase3.v5a.vcf.zip'

Structural variants have already been filtered from these VCFs. We further filter out indels, multiallelic sites, and sites with minor-allele frequency less than a given threshold (we choose 10% here) using SelectVariants. We then use CatVariants to merge the per-chromosome VCFs and produce a single VCF, which is finally converted to the desired interval list using Picard. Note that sex chromosomes are excluded as they are not currently supported by the GATK CNV/ACNV workflows.

PICARD_JAR=/seq/software/picard/current/bin/picard.jar
GATK_JAR=/humgen/gsa-hpprojects/GATK/bin/current/GenomeAnalysisTK.jar
HG19_FASTA=/seq/references/Homo_sapiens_assembly19/v1/Homo_sapiens_assembly19.fasta
CATVARIANTS_COMMAND="java -cp $GATK_JAR org.broadinstitute.gatk.tools.CatVariants -R $HG19_FASTA"

for chr in {1..22}
do 
    # Index the per-chromosome vcf.gz files using tabix.
    tabix -p vcf chr$chr.1kg.phase3.v5a.vcf.gz
    # Structural variants have already been filtered from these VCFs.  We further filter out indels, multiallelic sites, and sites with minor-allele frequency less than 10%.
    java -jar $GATK_JAR -T SelectVariants -R $HG19_FASTA -V chr$chr.1kg.phase3.v5a.vcf.gz -selectType SNP --restrictAllelesTo BIALLELIC -select 'vc.getCalledChrCount(vc.altAlleleWithHighestAlleleCount) * 1.0 / vc.calledChrCount >= 0.10 && vc.getCalledChrCount(vc.altAlleleWithHighestAlleleCount) * 1.0 / vc.calledChrCount < 0.90' -o chr$chr.1kg.phase3.v5a.snp.maf10.biallelic.hg19.vcf.gz
    # Add filtered VCF for this chromosome to the list of inputs for CatVariants.
    CATVARIANTS_COMMAND="$CATVARIANTS_COMMAND -V chr$chr.1kg.phase3.v5a.snp.maf10.biallelic.hg19.vcf.gz"
done

# Finish constructing the CatVariants command and execute it.
CATVARIANTS_COMMAND="$CATVARIANTS_COMMAND -out allchr.1kg.phase3.v5a.snp.maf10.biallelic.hg19.vcf.gz --assumeSorted"
$CATVARIANTS_COMMAND

# Convert to interval list.
java -jar $PICARD_JAR VcfToIntervalList I=allchr.1kg.phase3.v5a.snp.maf10.biallelic.hg19.vcf.gz O=allchr.1kg.phase3.v5a.snp.maf10.biallelic.hg19.interval_list

Depending on the amount of free space in the working folder, commands to remove intermediate files can be added to the appropriate places above.

↧

Known Issues with Mutect2 GATK4.1

March 20, 2019, 10:08 am

≫ Next: GATK4: How to reassign STAR mapping quality from 255 to 60 with SplitNCigarReads

≪ Previous: Creating a list of common SNPs for use with GetBayesianHetCoverage

Error:
java.lang.IllegalArgumentException: log10p: Log10-probability must be 0 or less when running FilterMutectCalls(GATK4.1) to Filter VCF results called by Mutect2.
Solution:
This error is fixed in master as of March 19th 2019. It will be released in the GATK version 4.1.1.0.

↧

GATK4: How to reassign STAR mapping quality from 255 to 60 with SplitNCigarReads

November 20, 2017, 12:57 am

≫ Next: Samtool 'non-existent file' stops the the gatk4-germline-snps-indel/joint-discovery-gatk4 workflow

≪ Previous: Known Issues with Mutect2 GATK4.1

Hi,

How can I reassign STAR mapping quality from 255 to 60 with SplitNCigarReads?

In GATK 3.X this used to be done like this:
java -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R ref.fasta -I dedupped.bam -o split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS
See this blog post: https://software.broadinstitute.org/gatk/blog?id=4285

With GATK4 latest beta the read filter argument has been renamed. Trying the same -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS arguments leads to the following error:
A USER ERROR has occurred: rf is not a recognized option

Trough looking at the CLI help documentation I got as far as:
--readFilter ReassignOneMappingQuality -RMQF 255 -RMQT 60

The readFilter argument is now recognized. But not the -RMQF 255 -RMQT 60 part:
A USER ERROR has occurred: U is not a recognized option

Could you please advice on how to run the GATK4 SplitNCigarReads tool with reassignment of the mapping quailty?

Without reassignment of the mapping quality GATK haplotype caller discards all the STAR mapped reads, and calls full chromosome reference, without any variant.

Thank you.

↧

Samtool 'non-existent file' stops the the gatk4-germline-snps-indel/joint-discovery-gatk4 workflow

March 20, 2019, 7:14 pm

≫ Next: GATK4.1.0.0,HalotypeCaller VCF have 0/1 and 0|1 genotype。How to distinguish “/| and ”|“

≪ Previous: GATK4: How to reassign STAR mapping quality from 255 to 60 with SplitNCigarReads

Hello,
I am trying to run a version of the joint-discovery-gatk4-local workflow slightly adjusted to run with a SLURM backend (I am running with gatk 4.0.12.0; the json and wdl files are both based on github.com/gatk-workflows/gatk4-germline-snps-indels 'local' version). When running with enough samples to trigger the scatter-gather of the metrics, the workflow stops at the "GatherMetrics" step. I get this error message:
htsjdk.samtools.SAMException: Cannot read non-existent file: file:///test_joint-call/cromwell-executions/JointGenotyping/0c5fec3d-ae6a-4740-b991-3c5832c36315/call-GatherMetrics/inputs/-343490749/test3000.0.variant_calling_detail_metrics.variant_calling_detail_metrics

This file (with the double suffix) is indeed non-existent, but the file test3000.0.variant_calling_detail_metrics does exist in the right location. And in the command line featured in the logs, the filename is correct, and points to an existing and readable file:

```
Using GATK jar /cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/gatk/4.0.12.0/gatk-package-4.0.12.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx2g -Xms2g -jar /cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/gatk/4.0.12.0/gatk-package-4.0.12.0-local.jar AccumulateVariantCallingMetrics --INPUT /test_joint-call/cromwell-executions/JointGenotyping/0c5fec3d-ae6a-4740-b991-3c5832c36315/call-GatherMetrics/inputs/-343490749/test3000.0.variant_calling_detail_metrics --INPUT [... follows a long list of input shards ...] --OUTPUT test3000
```

Have you seen such a problem before? Do you know how to solve it? Those file are generated and named automatically, it would be strange if there was really a problem reading one.

Many thanks,

Frederic

↧

GATK4.1.0.0,HalotypeCaller VCF have 0/1 and 0|1 genotype。How to distinguish “/| and ”|“

March 20, 2019, 7:52 pm

≫ Next: Variantfiltration not recognizing first position annotations

≪ Previous: Samtool 'non-existent file' stops the the gatk4-germline-snps-indel/joint-discovery-gatk4 workflow

In the previous version（GATK-4.0.3.0）, there were only 0/1 genotypes, no 0|1. In the latest version（GATK-4.1.0.0）, there were 0/1 and 0|1 genotypes. What is the difference between "/" and "|", why is this happening? Is it because of the version update?
Thank you!

↧

Variantfiltration not recognizing first position annotations

March 21, 2019, 2:20 am

≫ Next: Input for VQSR from Mergevcfs

≪ Previous: GATK4.1.0.0,HalotypeCaller VCF have 0/1 and 0|1 genotype。How to distinguish “/| and ”|“

Dear GATK staffs,

I tried to do Variantfiltration of my result from GenotypeGVCFs so that i can combined the variant into a single vcf file before pipe into VQSR. This is because my 31 vcf files have very few variant, ranging from 100 to 3000 which makes the Gausian model less reliable(full of FP showed in the graph as well). Feel free to advice me if i am doing silly mistakes in this workflow.

However, in both enxountering of Variantfiltration (using my result from GVCFs and result after VQSR) , i faced the same problem.

17:01:16.119 WARN JexlEngine - ![0,14]: 'ReadPosRankSum < -8.0 || MQRankSum < -12.5 || QD < 2.0 || FS > 60.0 || SOR > 3.0 || MQ < 40.0;' undefined variable ReadPosRankSum
17:01:16.121 WARN JexlEngine - ![0,14]: 'ReadPosRankSum < -8.0 || MQRankSum < -12.5 || QD < 2.0 || FS > 60.0 || SOR > 3.0 || MQ < 40.0;' undefined variable ReadPosRankSum
17:01:16.125 WARN JexlEngine - ![0,14]: 'ReadPosRankSum < -8.0 || MQRankSum < -12.5 || QD < 2.0 || FS > 60.0 || SOR > 3.0 || MQ < 40.0;' undefined variable ReadPosRankSum
17:01:16.126 WARN JexlEngine - ![0,14]: 'ReadPosRankSum < -8.0 || MQRankSum < -12.5 || QD < 2.0 || FS > 60.0 || SOR > 3.0 || MQ < 40.0;' undefined variable ReadPosRankSum

When i swapped ReadposRankSum with other annotation, those that placed in the first position in the beginning will be reported undefined. Hope that i can find an answer here.

My commands:
for file in *.vcf.gz; do gatk VariantFiltration -R $reference_dir -O ${file%%.vcf.gz}_filtered.vcf.gz -V $file --filter-name "snps_filter" --filter-expression "ReadPosRankSum < -8.0 || MQRankSum < -12.5 || QD < 2.0 || FS > 60.0 || SOR > 3.0 || MQ < 40.0" ; done

↧

Input for VQSR from Mergevcfs

March 21, 2019, 3:35 am

≫ Next: Multi-allelic sites in VQSR

≪ Previous: Variantfiltration not recognizing first position annotations

Dear GATK staff,

i have 28 vcf files from 31 humans exome data, the output after GenotypeGVCF according to the targeted gene intervals shows very little variant in each vcf files , less than 200 or 60 which seems might create a less reliable Gaussian model in VQSR. Should i use Mergevcfs to combine the 31 vcf files into a single file before piped them into VQSR?

↧

Multi-allelic sites in VQSR

March 21, 2019, 7:04 am

≫ Next: (How to) Call somatic mutations using GATK4 Mutect2

≪ Previous: Input for VQSR from Mergevcfs

I was wondering how GATK VQSR deals with multi-allelic sites.
I already know that -
i) VQSR treats them same way as bi-allelic sites (https://gatkforums.broadinstitute.org/gatk/discussion/7754/how-vqsr-deals-with-multiallelic-snps-and-indel)
ii) Split multi-allelic sites before VQSR (https://gatkforums.broadinstitute.org/gatk/discussion/23559/split-multiallelic-variants-before-vqsr-and-cnnscorevariants-gatk-team-opinion).
This mainly informs about mixed (SNP + INDEL) multi-allelic sites.

Summary questions:
1. Do you recommend split multi-allelic SNPs before VQSR? Will it be biased since site-level information/annotation would be multiple counted. I got different results in split and NOT split (performed relatively better)
2. If we don't split multi-allelic SNP sites then how Ti/Tv ratio is calculated.
For example:

chr1 123 A T,G
chr2 234 C *,A,T

In these above cases, which allele(s) is taken to calculate the Ti/Tv ratio in the tranche file. If VQSR takes the first allele then what to expect in 2nd case, where a star allele is at first position Or it is better to remove star alleles before VQSR?

↧

(How to) Call somatic mutations using GATK4 Mutect2

January 6, 2018, 8:39 pm

≫ Next: Should I use OUTPUT_BY_READGROUP on RevertSam and why?

≪ Previous: Multi-allelic sites in VQSR

Post suggestions and read about updates in the Comments section.

This tutorial introduces researchers to considerations in somatic short variant discovery using GATK4 Mutect2. Example data are based on a breast cancer cell line and its matched normal cell line derived from blood and are aligned to GRCh38 with post-alt processing [1]. The tutorial focuses on how to call traditional somatic short mutations, as described in Article#11127 and pipelined in GATK v4.0.0.0's mutect2.wdl [2]. The tool and its workflow are in BETA status as of this writing, which means they may undergo changes and are not guaranteed for production.

► For Broad Mutation Calling Best Practices, see FireCloud Article#45055.

Section 1 calls somatic mutations with Mutect2 using all the bells and whistles of the tool. Section 2 outlines how to create the panel of normals resource using the tumor-only mode of Mutect2. Section 3 outlines how to estimate cross-sample contamination. Section 4 shows how to filter the callset with FilterMutectCalls. Unlike GATK3, in GATK4 the somatic calling and filtering functionalities are embodied by separate tools. Section 5 shows an optional filtering step to filter by sequence context artifacts that present with orientation bias, e.g. OxoG artifacts. Section 6 shows how to set up in IGV for manual review. Finally, section 7 provides a brief list of related resources that may be of interest to researchers.

GATK4 Mutect2 is a versatile variant caller that not only is more sensitive than, but is also roughly twice as fast as, HaplotypeCaller's reference confidence mode. Researchers who wish to customize analyses should find the tutorial's descriptions of the multiple levers of Mutect2 in section 1 and descriptions of the tumor-only mode of Mutect2 in section 2 of interest.

Jump to a section

Tools involved

GATK v4.0.0.0 is available in a Docker image and as a standalone jar. For the latest release, see the Downloads page. Note that GATK v4.0.0.0 contains Picard tools from release v2.17.2 that are callable with the gatk launch script.
Desktop IGV. The tutorial uses v2.3.97.

Download example data

Download tutorial_11136.tar.gz, either from the GoogleDrive or from the ftp site. To access the ftp site, leave the password field blank. If the GoogleDrive link is broken, please let us know. The tutorial also requires the GRCh38 reference FASTA, dictionary and index. These are available from the GATK Resource Bundle. For details on the example data and resources, see [3] and [4].

► The tutorial steps switch between the subset and full data. Some of the data files, e.g. BAMs, are restricted to a small region of the genome to efficiently pace the tutorial. Other files, e.g. the Mutect2 calls that the tutorial filters, are from the entire genome. The tutorial content was originally developed for the 2017-09 Helsinki workshop and we make the full data files, i.e. the resource files and the BAMs, available at gs://gatk-best-practices/somatic-hg38.

1. Call somatic short variants and generate a bamout with Mutect2

Here we have a rather complex command to call somatic variants on the HCC1143 tumor sample using Mutect2. For a synopsis of what somatic calling entails, see Article#11127. The command calls somatic variants in the tumor sample and uses a matched normal, a panel of normals (PoN) and a population germline variant resource.

gatk --java-options "-Xmx2g" Mutect2 \
-R hg38/Homo_sapiens_assembly38.fasta \
-I tumor.bam \
-I normal.bam \
-tumor HCC1143_tumor \
-normal HCC1143_normal \
-pon resources/chr17_pon.vcf.gz \
--germline-resource resources/chr17_af-only-gnomad_grch38.vcf.gz \
--af-of-alleles-not-in-resource 0.0000025 \
--disable-read-filter MateOnSameContigOrNoMappedMateReadFilter \
-L chr17plus.interval_list \
-O 1_somatic_m2.vcf.gz \
-bamout 2_tumor_normal_m2.bam

This produces a raw unfiltered somatic callset 1_somatic_m2.vcf.gz, a reassembled reads BAM 2_tumor_normal_m2.bam and the respective indices 1_somatic_m2.vcf.gz.tbi and 2_tumor_normal_m2.bai.

Comments on select parameters

Specify the case sample for somatic calling with two parameters. Provide the BAM with -I and the sample's read group sample name (the SM field value) with -tumor. To look up the read group SM field use GetSampleName. Alternatively, use samtools view -H tumor.bam | grep '@RG'.
Prefilter variant sites in a control sample alignment. Specify the control BAM with -I and the control sample's read group sample name (the SM field value) with -normal. In the case of a tumor with a matched normal control, we can exclude even rare germline variants and individual-specific artifacts. If we analyze our tumor sample with Mutect2 without the matched normal, we get an order of magnitude more calls than with the matched normal.
Prefilter variant sites in a panel of normals callset. Specify the panel of normals (PoN) VCF with -pon. Section 2 outlines how to create a PoN. The panel of normals not only represents common germline variant sites, it presents commonly noisy sites in sequencing data, e.g. mapping artifacts or other somewhat random but systematic artifacts of sequencing. By default, the tool does not reassemble nor emit variant sites that match identically to a PoN variant. To enable genotyping of PoN sites, use the --genotype-pon-sites option. If the match is not exact, e.g. there is an allele-mismatch, the tool reassembles the region, emits the calls and annotates matches in the INFO field with IN_PON.
Annotate variant alleles by specifying a population germline resource with --germline-resource. The germline resource must contain allele-specific frequencies, i.e. it must contain the AF annotation in the INFO field [4]. The tool annotates variant alleles with the population allele frequencies. When using a population germline resource, consider adjusting the --af-of-alleles-not-in-resource parameter from its default of 0.001. For example, the gnomAD resource af-only-gnomad_grch38.vcf.gz represents ~200k exomes and ~16k genomes and the tutorial data is exome data, so we adjust --af-of-alleles-not-in-resource to 0.0000025 which corresponds to 1/(2*exome samples). The default of 0.001 is appropriate for human sample analyses without any population resource. It is based on the human average rate of heterozygosity. The population allele frequencies (POP_AF) and the af-of-alleles-not-in-resource factor in probability calculations of the variant being somatic.
Include reads whose mate maps to a different contig. For our somatic analysis that uses alt-aware and post-alt processed alignments to GRCh38, we disable a specific read filter with --disable-read-filter MateOnSameContigOrNoMappedMateReadFilter. This filter removes from analysis paired reads whose mate maps to a different contig. Because of the way BWA crisscrosses mate information for mates that align better to alternate contigs (in alt-aware mapping to GRCh38), we want to include these types of reads in our analysis. Otherwise, we may miss out on detecting SNVs and indels associated with alternate haplotypes. Disabling this filter deviates from current production practices.
Target the analysis to specific genomic intervals with the -L parameter. Here we specify this option to speed up our run on the small tutorial data. For the full callset we use in section 4, calling was on the entirety of the data, without an intervals file.
Generate the reassembled alignments file with -bamout. The bamout alignments contain the artificial haplotypes and reassembled alignments for the normal and tumor and enable manual review of calls. The parameter is not required by the tool but is recommended as adding it costs only a small fraction of the total run time.

To illustrate how Mutect2 applies annotations, below are five multiallelic sites from the full callset. Pull these out with gzcat somatic_m2.vcf.gz | awk '$5 ~","'. The awk '$5 ~","' subsets records that contain a comma in the 5th column.

We see eleven columns of information per variant call including genotype calls for the normal and tumor. Notice the empty fields for QUAL and FILTER, and annotations at the site (INFO) and sample level (columns 10 and 11). The samples each have genotypes and when a site is multiallelic, we see allele-specific annotations. Samples may have additional annotations, e.g. PGT and PID that relate to phasing.

☞ 1.1 What are the Mutect2 annotations?

We can view the standard FORMAT-level and INFO-level Mutect2 annotations in the VCF header.

The Variant Annotations section of the Tool Documentation further describe some of the annotations. For a complete list of annotations available in GATK4, see this site.

To enable specific filtering that relies on nonstandard annotations, or just to add additional annotations, use the -A argument. For example, -A ReferenceBases adds the ReferenceBases annotation to variant calls. Note that if an annotation a filter relies on is absent, FilterMutectCalls will skip the particular filtering without any warning messages.

☞ 1.2 What is the impact of disabling the MateOnSameContigOrNoMappedMateReadFilter read filter?

To understand the impact, consider some numbers. After all other read filters, the MateOnSameContigOrNoMappedMateReadFilter (MOSCO) filter additionally removes from analysis 8.71% (8,681,271) tumor sample reads and 8.18% (6,256,996) normal sample reads from the full data. The impact of disabling the MOSCO filter is that reads on alternate contigs and read pairs that span contigs can now lend support to variant calls.

For the tutorial data, including reads normally filtered by the MOSCO filter roughly doubles the number of Mutect2 calls. The majority of the additional calls comes from the ALT, HLA and decoy contigs.

2. Create a sites-only PoN with CreateSomaticPanelOfNormals

We make the motions of creating a PoN using three germline samples. These samples are HG00190, NA19771 and HG02759 [3].

First, run Mutect2 in tumor-only mode on each normal sample. In tumor-only mode, a single case sample is analyzed with the -tumor flag without an accompanying matched control -normal sample. For the tutorial, we run this command only for sample HG00190.

gatk Mutect2 \
-R ~/Documents/ref/hg38/Homo_sapiens_assembly38.fasta \
-I HG00190.bam \
-tumor HG00190 \
--disable-read-filter MateOnSameContigOrNoMappedMateReadFilter \
-L chr17plus.interval_list \
-O 3_HG00190.vcf.gz

This generates a callset 3_HG00190.vcf.gz and a matching index. Mutect2 calls variants in the sample with the same sensitive criteria it uses for calling mutations in the tumor in somatic mode. Because the command omits the use of options that trigger upfront filtering, we expect all detectable variants to be called. The calls will include low allele fraction variants and sites with multiple variant alleles, i.e. multiallelic sites. Here are two multiallelic records from 3_HG00190.vcf.gz.

We see for each site, Mutect2 calls the ref allele and three alternate alleles. The GT genotype call is 0/1/2/3. The AD allele depths are 16,3,12,4 and 41,5,24,4, respectively for the two sites.

Comments on select parameters

One option that is not used here is to include a germline resource with --germline-resource. Remember from section 1 this resource must contain AF population allele frequencies in the INFO column. Use of this resource in tumor-only mode, just as in somatic mode, allows upfront filtering of common germline variant alleles. This effectively omits common germline variant alleles from the PoN. Note the related optional parameter --max-population-af (default 0.01) defines the cutoff for allele frequencies. Given a resource, and read evidence for the variant, Mutect2 will still emit variant alleles with AF less than or equal to the --max-population-af.
Recapitulate any special options used in somatic calling in the panel of normals sample calling, e.g.--disable-read-filter MateOnSameContigOrNoMappedMateReadFilter. This particular option is relevant for alt-aware and post-alt processed alignments.

Second, collate all the normal VCFs into a single callset with CreateSomaticPanelOfNormals. For the tutorial, to illustrate the step with small data, we run this command on three normal sample VCFs. The general recommendation for panel of normals is a minimum of forty samples.

gatk CreateSomaticPanelOfNormals \
-vcfs 3_HG00190.vcf.gz \
-vcfs 4_NA19771.vcf.gz \
-vcfs 5_HG02759.vcf.gz \
-O 6_threesamplepon.vcf.gz

This generates a PoN VCF 6_threesamplepon.vcf.gz and an index. The tutorial PoN contains 8,275 records.
CreateSomaticPanelOfNormals retains sites with variants in two or more samples. It retains the alleles from the samples but drops all other annotations to create an eight-column, sites-only VCF as shown.

Ideally, the PoN includes samples that are technically representative of the tumor case sample--i.e. samples sequenced on the same platform using the same chemistry, e.g. exome capture kit, and analyzed using the same toolchain. However, even an unmatched PoN will be remarkably effective in filtering a large proportion of sequencing artifacts. This is because mapping artifacts and polymerase slippage errors occur for pretty much the same genomic loci for short read sequencing approaches.

What do you think of including samples of family members in the PoN?

☞ 2.1 The tumor-only mode of Mutect2 is useful outside of pon creation

For example, consider variant calling on data that represents a pool of individuals or a collective of highly similar but distinct DNA molecules, e.g. mitochondrial DNA. Mutect2 calls multiple variants at a site in a computationally efficient manner. Furthermore, the tumor-only mode can be co-opted to simply call differences between two samples. This approach is described in Blog#11315.

3. Estimate cross-sample contamination using GetPileupSummaries and CalculateContamination.

First, run GetPileupSummaries on the tumor BAM to summarize read support for a set number of known variant sites. Use a population germline resource containing only common biallelic variants, e.g. subset by using SelectVariants --restrict-alleles-to BIALLELIC, as well as population AF allele frequencies in the INFO field [4]. The tool tabulates read counts that support reference, alternate and other alleles for the sites in the resource.

gatk GetPileupSummaries \
-I tumor.bam \
-V resources/chr17_small_exac_common_3_grch38.vcf.gz \
-O 7_tumor_getpileupsummaries.table

This produces a six-column table as shown. The alt_count is the count of reads that support the ALT allele in the germline resource. The allele_frequency corresponds to that given in the germline resource. Counts for other_alt_count refer to reads that support all other alleles.

Comments on select parameters

The tool only considers homozygous alternate sites in the sample that have a population allele frequency that ranges between that set by --minimum-population-allele-frequency (default 0.01) and --maximum-population-allele-frequency (default 0.2). The rationale for these settings is as follows. If the homozygous alternate site has a rare allele, we are more likely to observe the presence of REF or other more common alleles if there is cross-sample contamination. This allows us to measure contamination more accurately.
One option to speed up analysis, that is not used in the command above, is to limit data collection to a sufficiently large but subset genomic region with the -L argument.
As of GATK4.0.8.0, released August 2, 2018, GetPileupSummaries requires both -L and -V parameters. For the tutorial, provide the same resources/chr17_small_exac_common_3_grch38.vcf.gz file to each parameter. For details, see the GetPileupSummaries tool documentation.

Second, estimate contamination with CalculateContamination. The tool takes the summary table from GetPileupSummaries and gives the fraction contamination. This estimation informs downstream filtering by FilterMutectCalls.

gatk CalculateContamination \
-I 7_tumor_getpileupsummaries.table \
-O 8_tumor_calculatecontamination.table

This produces a table with estimates for contamination and error. The estimate for the full tumor sample is shown below and gives a contamination fraction of 0.0205. Going forward, we know to suspect calls with less than ~2% alternate allele fraction.

Comments on select parameters

CalculateContamination can operate in two modes. The command above uses the mode that simply estimates contamination for a given sample. The alternate mode incorporates the metrics for the matched normal, to enable a potentially more accurate estimate. For the second mode, run GetPileupSummaries on the normal sample and then provide the normal pileup table to CalculateContamination with the -matched argument.

► Cross-sample contamination differs from normal contamination of tumor and tumor contamination of normal. Currently, the workflow does not account for the latter type of purity issue.

☞ 3.1 What if I find high levels of contamination?

One thing to rule out is sample swaps at the read group level.

Picard’s CrosscheckFingerprints can detect sample-swaps at the read group level and can additionally measure how related two samples are. Because sequencing can involve multiplexing a sample across lanes and regrouping a sample’s multiple read groups, depending on the level of automation in handling these, there is a possibility of including read groups from unrelated samples. The inclusion of such a cross-sample in the tumor sample would be detrimental to a somatic analysis. Without getting into details, the tool allows us to (i) check at the sample level that our tumor and normal are related, as it is imperative they should come from the same individual and (ii) check at the read group level that each of the read group data come from the same individual.

Again, imagine if we mistook the contaminating read group data as some tumor subpopulation! The tutorial normal and tumor samples consist of 16 and 22 read groups respectively, and when we provide these and set EXPECT_ALL_GROUPS_TO_MATCH=true, CrosscheckReadGroupFingerprints (a tool now replaced by CrosscheckFingerprints) informs us All read groups related as expected.

4. Filter for confident somatic calls using FilterMutectCalls

FilterMutectCalls determines whether a call is a confident somatic call. The tool uses the annotations within the callset and applies preset thresholds that are tuned for human somatic analyses.

Filter the Mutect2 callset with FilterMutectCalls. Here we use the full callset, somatic_m2.vcf.gz. To activate filtering based on the contamination estimate, provide the contamination table with --contamination-table. In GATK v4.0.0.0, the tool uses the contamination estimate as a hard cutoff.

gatk FilterMutectCalls \
-V somatic_m2.vcf.gz \
--contamination-table tumor_calculatecontamination.table \
-O 9_somatic_oncefiltered.vcf.gz

This produces a VCF callset 9_somatic_oncefiltered.vcf.gz and index. Calls that are likely true positives get the PASS label in the FILTER field, and calls that are likely false positives are labeled with the reason(s) for filtering in the FILTER field of the VCF. We can view the available filters in the VCF header using grep '##FILTER'.

This step seemingly applies 14 filters, including contamination. However, if an annotation a filter relies on is absent, the tool skips the particular filtering. The filter will still appear in the header. For example, the duplicate_evidence filter requires a nonstandard annotation that our callset omits.

So far, we have 3,695 calls, of which 2,966 are filtered and 729 pass as confident somatic calls. Of the filtered, contamination filters eight calls, all of which would have been filtered for other reasons. For the statistically inclined, this may come as a surprise. However, remember that the great majority of contaminant variants would be common germline alleles, for which we have in place other safeguards.

► In the next GATK version, FilterMutectCalls will use a statistical model to filter based on the contamination estimate.

5. (Optional) Estimate artifacts with CollectSequencingArtifactMetrics and filter them with FilterByOrientationBias

FilterByOrientationBias allows filtering based on sequence context artifacts, e.g. OxoG and FFPE. This step is optional and if employed, should always be performed after filtering with FilterMutectCalls. The tool requires the pre_adapter_detail_metrics from Picard CollectSequencingArtifactMetrics.

First, collect metrics on sequence context artifacts with CollectSequencingArtifactMetrics. The tool categorizes these as those that occur before hybrid selection (preadapter) and those that occur during hybrid selection (baitbias). Results provide a global view across the genome that empowers decision making in ways that site-specific analyses cannot. The metrics can help decide whether to consider downstream filtering.

gatk CollectSequencingArtifactMetrics \
-I tumor.bam \
-O 10_tumor_artifact \
–-FILE_EXTENSION ".txt" \
-R ~/Documents/ref/hg38/Homo_sapiens_assembly38.fasta

Alternatively, use the tool from a standalone Picard jar.

java -jar picard.jar \
CollectSequencingArtifactMetrics \
I=tumor.bam \
O=10_tumor_artifact \
FILE_EXTENSION=.txt \
R=~/Documents/ref/hg38/Homo_sapiens_assembly38.fasta

This generates five metrics files, including pre_adapter_detail_metrics, which contains counts that FilterByOrientationBias uses. Below are the summary pre_adapter_summary_metrics for the full data. Our samples were not from FFPE so we do not expect this artifact. However, it appears that we could have some OxoG transversions.

Picard metrics are described in detail here. For the purposes of this tutorial, we focus on the TOTAL_QSCORE.

The TOTAL_QSCORE is Phred-scaled such that lower scores equate to a higher probability the change is artifactual. E.g. forty translates to 1 in 10,000 probability. For OxoG, a rough cutoff for concern is 30. FilterByOrientationBias uses the quality score as a prior that a context will produce an artifact. The tool also weighs the evidence from the reads. For example, if the QSCORE is 50 but the allele is supported by 15 reads in F1R2 and no reads in F2R1, then the tool should filter the call.
FFPE stands for formalin-fixed, paraffin-embedded. Formaldehyde deaminates cytosines and thereby results in C→T transition mutations. Oxidation of guanine to 8-oxoguanine results in G→T transversion mutations during library preparation. Another Picard tool, CollectOxoGMetrics, similarly gives Phred-scaled scores for the 16 three-base extended sequence contexts. In GATK4 Mutect2, the F1R2 and F2R1 annotations count the reads in the pair orientation supporting the allele(s). This is a change from GATK3’s FOXOG (fraction OxoG) annotation.

Second, perform orientation bias filtering with FilterByOrientationBias. We provide the tool with the once-filtered calls 9_somatic_oncefiltered.vcf.gz, the pre_adapter_detail_metrics file and the sequencing contexts for FFPE (C→T transition) and OxoG (G→T transversion). The tool knows to include the reverse complement contexts.

gatk FilterByOrientationBias \
-A G/T \
-A C/T \
-V 9_somatic_oncefiltered.vcf.gz \
-P tumor_artifact.pre_adapter_detail_metrics.txt \
-O 11_somatic_twicefiltered.vcf.gz

This produces a VCF 11_somatic_twicefiltered.vcf.gz, index and summary 11_somatic_twicefiltered.vcf.gz.summary. In the summary, we see the number of calls for the sequence context and the number of those that the tool filters.

Is the filtering in line with our earlier prediction?

In the VCF header, we see the addition of the 15th filter, orientation_bias, which the tool applies to 56 calls. All 56 of these calls were previously PASS sites, i.e. unfiltered. We now have 673 passing calls out of 3,695 total calls.

☞ 5.1 Tally of applied filters for the tutorial data

The table shows the breakdown in filters applied to 11_somatic_twicefiltered.vcf.gz. The middle column tallys the instances in which each filter was applied across the calls and the third column tallys the instances in which a filter was the sole reason for a site not passing. Of the total calls, ~18% (673/3,695) are confident somatic calls. Of the filtered calls, ~56% (1,694/3,022) are filtered singly. We see an average of ~1.73 filters per filtered call (5,223/3,022).

Which filters appear to have the greatest impact? What types of calls do you think compels manual review?

Examine passing records with the following command. Take note of the AD and AF annotation values in particular, as they show the high sensitivity of the caller.

gzcat 11_somatic_twicefiltered.vcf.gz | grep -v '#' | awk '$7=="PASS"' | less

6. Set up in IGV to review somatic calls

Deriving a good somatic callset involves comparing callsets, e.g. from different callers or calling approaches, manually reviewing passing and filtered calls and, if necessary, combining callsets and additional filtering. Manual review extends from deciphering call record annotations to the nitty-gritty of reviewing read alignments using a visualizer.

To manually review calls, use the feature-rich desktop version of the Integrative Genomics Viewer (IGV). Remember that Mutect2 makes calls on reassembled alignments that do not necessarily reflect that of the starting BAM. Given this, viewing the raw BAM is insufficient for understanding calls. We must examine the bamout that Mutect2's graph-assembly produces.

First, load Human (hg38) as the reference in IGV. Then load these six files in order:

resources/chr17_pon.vcf.gz
resources/chr17_af-only-gnomad_grch38.vcf.gz
11_somatic_twicefiltered.vcf.gz
2_tumor_normal_m2.bam
normal.bam
tumor.bam

With the exception of the somatic callset 11_somatic_twicefiltered.vcf.gz, the subset regions the data cover are in chr17plus.interval_list.

Second, navigate IGV to the TP53 locus (chr17:7,666,402-7,689,550).

One of the tracks is dominating the view. Right-click on track chr17_af-only-gnomad_grch38.vcf.gz and collapse its view.
Zoom into the somatic call in 11_somatic_twicefiltered.vcf.gz, the gray rectangle in exon 3, by click-dragging on the ruler.
Hover over or click on the gray call in track 11_somatic_twicefiltered.vcf.gz to view INFO level annotations. Similarly, the blue call underneath gives HCC1143_tumor sample level information.
Scroll through the alignment data and notice the coverage for the samples.

A C→T variant is in tumor.bam but not normal.bam. What is happening in 2_tumor_normal_m2.bam?

Third, tweak IGV settings that aid in visualizing reassembled alignments.

Make room to focus on track 2_tumor_normal_m2.bam. Shift+select on the left panels for tracks tumor.bam, normal.bam and their coverages. Right-click and Remove Tracks.
Go to View>Preferences>Alignments. Toggle on Show center line and toggle off Downsample reads.
Drag the alignments panel to center the red variant.
Right-click on the alignments track and
- Group by sample
- Sort by base
- Color by tag: HC.
Scroll to take note of the number of groups. Click on a read in each group to determine which group belongs to which sample.

What are the three grouped tracks for the bamout? What does the pastel versus gray colors indicate? How plausible is it that all tumor copies of this locus have this alteration?

Here is the corresponding VCF record. Remember Mutect2 makes no ploidy assumption. The GT field tabulates the presence for each allele starting with the reference allele.

CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
chr17	7,674,220	.	C	T	.	PASS	DP=122;ECNT=1;NLOD=13.54;N_ART_LOD=-1.675e+00;POP_AF=2.500e-06;P_GERMLINE=-1.284e+01;TLOD=257.15

FORMAT	GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:OBAM:OBAMRC:OBF:OBP:OBQ:OBQRC:SA_MAP_AF:SA_POST_PROB
HCC1143_normal	0/0:45,0:0.032:19,0:26,0:0:151,0:0:0:false:false
HCC1143_tumor	0/1:0,70:0.973:0,34:0,36:33:0,147:60:21:true:false:0.486:0.00:46.01:100.00:0.990,0.990,1.00:0.028,0.026,0.946

Finally, here are the indel calls for which we have bamout alignments. All 17 of these happen to be filtered. Explore a few of these sites in IGV to practice the motions of setting up for manual review and to study the logic behind different filters.

CHROM	POS	REF	ALT	FILTER
chr17	4,539,344	T	TA	artifact_in_normal;germline_risk;panel_of_normals
chr17	7,221,420	CACTGCCCTAGGTCAGGA	C	artifact_in_normal;panel_of_normals;str_contraction
chr17	7,483,063	A	AC	mapping_quality;t_lod
chr17	8,513,688	GTT	G	panel_of_normals
chr17	19,748,387	G	GA	t_lod
chr17	26,982,033	G	GC	artifact_in_normal;clustered_events
chr17	30,059,463	CT	C	t_lod
chr17	35,422,473	C	CA	t_lod
chr17	35,671,734	CTT	C,CT,CTTT	artifact_in_normal;multiallelic;panel_of_normals
chr17	43,104,057	CA	C	artifact_in_normal;germline_risk;panel_of_normals
chr17	43,104,072	AAAAAAAAAGAAAAG	A	panel_of_normals;t_lod
chr17	46,332,538	G	GT	artifact_in_normal;panel_of_normals
chr17	47,157,394	CAA	C	panel_of_normals;t_lod
chr17	50,124,771	GCACACACACACACACA	G	clustered_events;panel_of_normals;t_lod
chr17	68,907,890	GA	G	artifact_in_normal;base_quality;germline_risk;panel_of_normals;t_lod
chr17	69,182,632	C	CA	artifact_in_normal;t_lod
chr17	69,182,835	GAAAA	G	panel_of_normals

7. Related resources

The next step after generating a carefully manicured somatic callset is typically functional annotation.

Funcotator is available in BETA and can annotate GRCh38 and prior reference aligned VCF format data.
Oncotator can annotate GRCh37 and prior reference aligned MAF and VCF format data. It is also possible to download and install the tool following instructions in Article#4154.
Annotate with the external program VEP to predict phenotypic changes and confirm or hypothesize biochemical effects.

For a cohort, after annotation, use MutSig to discover driver mutations. MutsigCV (the version is CV) is available on GenePattern. If more samples are needed to increase the power of the analysis, consider padding the analysis set with TCGA Project or other data.

The dSKY plot at https://figshare.com/articles/D_SKY_for_HCC1143/2056665 shows somatic copy number alterations for the HCC1143 tumor sample. Its colorful results remind us that calling SNVs and indels is only one part of cancer genome analyses. Somatic copy number alteration detection will be covered in another GATK tutorial. For reference implementations of Somatic CNV workflows see here.

Footnotes

[1] Data was alt-aware aligned to GRCh38 and post-alt processed. For an introduction to alt-aware alignment and post-alt processing, see [Blog#8180](https://software.broadinstitute.org/gatk/blog?id=8180). The HCC1143 alignments are identical to that in [Tutorial#9183](https://software.broadinstitute.org/gatk/documentation/article?id=9183), which uses GATK3 MuTect2.

[2] For scripted GATK Best Practices Somatic Short Variant Discovery workflows, see [https://github.com/gatk-workflows](https://github.com/gatk-workflows). Within the repository, as of this writing, [gatk-somatic-snvs-indels](https://github.com/gatk-workflows/gatk4-somatic-snvs-indels), which uses GRCh37, is the sole GATK4 Mutect2 workflow. This tutorial uses additional parameters not used in the [GRCh37 gatk-somatic-snvs-indels](https://github.com/gatk-workflows/gatk4-somatic-snvs-indels) example because the tutorial data was preprocessed with post-alt processing of alt-aware alignments, which deviates from production practices. The general workflow steps remain the same.

[3] About the tutorial data:

The data tarball contains 15 files in the main directory, six files in its resources folder and twenty files in its precomputed folder. Of the files, chr17 refers to data subset to that in the regions in chr17plus.interval_list, the m2pon consists of forty 1000 Genomes Project samples, pon to panel of normals, tumor to the tumor HCC1143 breast cancer sample and normal to its matched blood normal.
Again, example data are based on a breast cancer cell line and its matched normal cell line derived from blood. Both cell lines are consented and known as HCC1143 and HCC1143_BL, respectively. The Broad Cancer Genome Analysis (CGA) group has graciously provided 2x76 paired-end whole exome sequence data from the two cell lines (C835.HCC1143_2 and C835.HCC1143_BL.4), and @shlee reverted and aligned these to GRCh38 using alt-aware alignment and post-alt processing as described in Tutorial#8017. During preprocessing, the MergeBamAlignment step was omitted, reads containing adapter sequence were removed altogether for both samples (~0.153% of reads in the tumor) as determined by MarkIlluminaAdapters, base qualities were not binned during base recalibration and indel realignment was included to match the toolchain of the PoN normals. The program group for base recalibration is absent from the BAM headers due to a bug in the version of PrintReads at the time of pre-processing, in January of 2017.
Note that the tutorial uses exome data for its small size. The workflow is applicable to whole genome sequence data (WGS).
@shlee lifted-over or remapped the gnomAD resource files from GRCh37 counterparts to GRCh38. The tutorial uses subsets of the full resources; the full-length versions are available at gs://gatk-best-practices/somatic-hg38/. The official GRCh37 versions of the resources are available in the GATK Resource Bundle and are based on the gnomAD resource. These GRCh37 versions were prepared by @davidben according to the method outlined in the mutect_resources.wdl and described in [4].
The full data in the tutorial were generated by @shlee using the github.com/broadinstitute/gatk mutect2.wdl from between the v4.0.0.0 and v4.0.0.1 release with commit hash b4d1ddd. The GATK Docker image was broadinstitute/gatk:4.0.0.0 and Picard was v2.14.1. A single modification was made to the script to enable generating the bamout. The script was run locally on a Google Cloud Compute VM using Cromwell v30.1. Given Docker was installed and the specified Docker images were present on the VM, Cromwell automatically launched local Docker container instances during the run and handled the local files as hard-links to avoid redundant copying. Workflow input variables were as follows.

{
  "##_COMMENT1:": "WORKFLOW STEP OPTIONS",
  "Mutect2.is_run_oncotator": "False",
  "Mutect2.is_run_orientation_bias_filter": "True",
  "Mutect2.picard": "/home/shlee/picard-2.14.1.jar",
  "Mutect2.gatk_docker": "broadinstitute/gatk:4.0.0.0",
  "Mutect2.oncotator_docker": "broadinstitute/oncotator:1.9.3.0",
...
  "##_COMMENT3:": "ANALYSIS PARAMETERS",
  "Mutect2.artifact_modes": ["G/T", "C/T"],
  "Mutect2.m2_extra_args": "--af-of-alleles-not-in-resource 0.0000025 --disable-read-filter MateOnSameContigOrNoMappedMateReadFilter",
  "Mutect2.m2_extra_filtering_args": "",
  "Mutect2.scatter_count": "10"
}

If using newer versions of the mutect2.wdl that allow setting SplitIntervals optional arguments, then @shlee recommends setting --subdivision-mode BALANCING_WITHOUT_INTERVAL_SUBDIVISION to avoid splitting contigs.
With the exception of the PoN and Picard tool steps, data was generated using v4.0.0.0. The PoN was generated using GATK4 vbeta.6. Besides the syntax, little changed for the Mutect2 workflow between these releases and the workflow and most of its tools remain in beta status as of this writing. We used Picard v2.14.1 for the CollectSequencingArtifactMetrics step. Figures in section 5 reflect results from Picard v2.11.0, which give, at glance, identical results as 2.14.1.
The three samples in section 2 are present in the forty sample PoN used in section 1 and they are 1000 Genomes Project samples.

[4] The WDL script [mutect_resources.wdl](https://github.com/broadinstitute/gatk/blob/master/scripts/mutect2_wdl/mutect_resources.wdl) takes a large gnomAD VCF or other typical cohort VCF and from it prepares both a simplified germline resource for use in _section 1_ and a common biallelic variants resource for use in _section 3_. The script first generates a sites-only VCF and in the process _removes all extraneous annotations_ except for `AF` allele frequencies. We recommend this simplification as the unburdened VCF allows Mutect2 to run much more efficiently. To generate the common biallelic variants resource, the script then selects the biallelic sites from the sites-only VCF.

↧

Should I use OUTPUT_BY_READGROUP on RevertSam and why?

March 21, 2019, 11:56 am

≫ Next: GATK (v4.0.10.1) CombineGVCFs failing with 'java.lang.OutOfMemoryError'; not using memory provided

≪ Previous: (How to) Call somatic mutations using GATK4 Mutect2

Hi,

We have re-analyzed a TCGA WES sample by taking the BAM file and using RevertSam and then putting it through the standard pipeline. We have noticed about a 1% difference in variants when doing it by read-group (i.e. if we produce a uBAM per read-group and then merge then at the point of MarkDuplicates) than when we do it without read-group.

Separating by read-group is a bit of a nuisance for our pipeline and we wanted to know if it is correct not to do so. I take it all the read-groups have followed the same sequencing protocols. I imagine this may have to do with BWA being read-group aware.

Could you please clarify

If we can safely disregard OUTPUT_BY_READGROUP in general
If not, why?

Thanks a lot for a terrific tool.

↧

GATK (v4.0.10.1) CombineGVCFs failing with 'java.lang.OutOfMemoryError'; not using memory provided

October 22, 2018, 11:00 am

≫ Next: Is my "--resource" parameter to VariantRecalibrator correct?

≪ Previous: Should I use OUTPUT_BY_READGROUP on RevertSam and why?

Hi,

We ran a CombineGVCFs job using the following command, where gvcfs.list contained only 31 gvcf files with 24 samples each:

$GATK --java-options "-Xmx650G" \
CombineGVCFs \
-R $referenceFasta \
-O full_cohort.b37.g.vcf \
--variant gvcfs.list

We tried the extreme memory because CombineGVCFs kept failing. This node has 750G of RAM.

Despite the high memory provided, we get the stacktrace below. The total memory reported by GATK is only ~12G, though (Runtime.totalMemory()=12662603776). Am I missing something? I don't understand why GATK is only using 12G of RAM when we provided much more, and then failing with an OutOfMemoryError.

We are currently setting up GenomicsDBImport, but this seems worth reporting.

Really appreciate your help.

18:55:51.944 INFO ProgressMeter - 4:26649295 23.6 18617000 787894.4
18:56:01.988 INFO ProgressMeter - 4:26655758 23.8 18779000 789159.6
18:59:13.407 INFO CombineGVCFs - Shutting down engine
[October 19, 2018 6:59:13 PM CDT] org.broadinstitute.hellbender.tools.walkers.CombineGVCFs done. Elapsed time: 27.06 minutes.
Runtime.totalMemory()=12662603776
Exception in thread "main" java.lang.OutOfMemoryError
at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
at sun.nio.cs.StreamEncoder.implClose(StreamEncoder.java:316)
at sun.nio.cs.StreamEncoder.close(StreamEncoder.java:149)
at java.io.OutputStreamWriter.close(OutputStreamWriter.java:233)
at java.io.BufferedWriter.close(BufferedWriter.java:266)
at htsjdk.variant.variantcontext.writer.VCFWriter.close(VCFWriter.java:226)
at org.broadinstitute.hellbender.tools.walkers.CombineGVCFs.closeTool(CombineGVCFs.java:461)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:970)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)

↧

Is my "--resource" parameter to VariantRecalibrator correct?

March 21, 2019, 7:58 pm

≫ Next: GATK resource bundle dbsnp_138.hg19.vcf

≪ Previous: GATK (v4.0.10.1) CombineGVCFs failing with 'java.lang.OutOfMemoryError'; not using memory provided

I'm trying to upgrade my pipeline using the latest GATK-4.1.0.0 package (I was using the 4.0.0.6 version), and I found that VariantRecalibrator stopped running. From the error message, it treats the entire argument string to "--resource" as the file name, like "hapmap,known=false,training=true,truth=true,prior=15.0:hapmap_3.3.b37.vcf", instead of using the actual file name after ":".

I'm not sure if my argument string has a correct format -- I followed the example on https://software.broadinstitute.org/gatk/documentation/tooldocs/4.1.0.0/org_broadinstitute_hellbender_tools_walkers_vqsr_VariantRecalibrator.php

I then found that this argument string is handled by the "FeatureInput" class, and a recent change in January 2019 removed some code in its constructor, hence the raw argument string is directly given to the constructor its super class (PathSpecifier) as the URI of the file. I'm not sure if it is the cause of this issue?

BTW here is the change applied to FeatureInput.java in January:

https://github.com/broadinstitute/gatk/commit/0238d2a0273349f9865b75d3e3e81528cc7cafde#diff-b2806dbc398b2b88f95db09016e64a05

The "ParsedArgument.of(rawArgumentValue)" call was removed after this change; this function used to split the argument string to the key-value pairs and the file name. Sorry if my understanding is wrong.

My command:

java \
-Dsamjdk.use_async_io_read_samtools=false \
-Dsamjdk.use_async_io_write_samtools=true \
-Dsamjdk.use_async_io_write_tribble=false \
-Dsamjdk.compression_level=2 \
-jar gatk-package-4.1.0.0-local.jar VariantRecalibrator \
-R CHR21_B37_ONLY.fasta \
-V my-pipeline-S2.1.vcf \
-an QD -an FS -an SOR \
-mode SNP \
--truth-sensitivity-tranche 100.0 \
--truth-sensitivity-tranche 99.9 \
--truth-sensitivity-tranche 99.5 \
--truth-sensitivity-tranche 99.0 \
-O my-pipeline-S2.2.recal \
--tranches-file my-pipeline-S2.2.tranches \
--resource hapmap,known=false,training=true,truth=true,prior=15.0:GATK/data/ftp.broadinstitute.org-bundle-b37/hapmap_3.3.b37.vcf \
--resource omni,known=false,training=true,truth=true,prior=12.0:GATK/data/ftp.broadinstitute.org-bundle-b37/1000G_omni2.5.b37.vcf \
--resource 1000G,known=false,training=true,truth=false,prior=10.0:GATK/data/ftp.broadinstitute.org-bundle-b37/1000G_phase1.snps.high_confidence.b37.vcf \
--resource dbsnp,known=true,training=false,truth=false,prior=2.0:GATK/data/ftp.broadinstitute.org-bundle-b37/dbsnp_138.b37.vcf

The error message:

A USER ERROR has occurred: Couldn't read file file:///home/users/dbpz/myPipeline/hapmap,known=false,training=true,truth=true,prior=15.0:GATK/data/ftp.broadinstitute.org-bundle-b37/hapmap_3.3.b37.vcf. Error was: It doesn't exist.

Java version:

$ java -version
java version "1.8.0_121"
Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)

↧

GATK resource bundle dbsnp_138.hg19.vcf

June 27, 2018, 3:53 am

≫ Next: Mutect2 MNV and read_orientation_artifact filter

≪ Previous: Is my "--resource" parameter to VariantRecalibrator correct?

I've tried to connect via FTP and download dbsnp_138.hg19.vcf, but I cannot.
I've also tried with

wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/dbsnp_138.hg19.vcf.gz

but they said me the username is incorrect.
Any other solution?

thanks!

↧

Mutect2 MNV and read_orientation_artifact filter

March 22, 2019, 12:49 am

≫ Next: Hi， What function does the complement_full_prob_avxs implement in the GKL library？

≪ Previous: GATK resource bundle dbsnp_138.hg19.vcf

I used different callers to detect SNVs from FFPE WES samples , including Mutect2 (GATK 4.1.0) and compared the results.

My orignal command line:
gatk CollectF1R2Counts \
-L Covered.bed \
-R genome.fa \
-I Tumor_reca.bam \
-alt-table Tumor.alt.tsv \
-ref-hist Tumor.ref.metrics \
-alt-hist Tumor.alt.metrics

gatk LearnReadOrientationModel \
-alt-table Tumor.alt.tsv \
-ref-hist Tumor.ref.metrics \
-alt-hist Tumor.alt.metrics \
-O Tumor.artifact-prior.tsv

gatk Mutect2 \
-R genome.fa \
-I Tumor_reca.bam \
-I Normal_reca.bam \
-normal Normal_name \
-L Covered.bed \
-ip 100 \
--orientation-bias-artifact-priors Tumor.artifact-prior.tsv \
--germline-resource af-only-gnomad.raw.sites.hg19.vcf.gz \
-pon pon.vcf \
-O Tumor.vcf.gz \
-bamout Tumor_bamout.bam

gatk GetPileupSummaries \
-I Tumor_reca.bam \
-V small_exac_common_3_hg19.vcf.gz \
-L small_exac_common_3_hg19.vcf.gz \
-O Tumor_getpileupsummaries.table

gatk GetPileupSummaries \
-I Normal_reca.bam \
-V small_exac_common_3_hg19.vcf.gz \
-L small_exac_common_3_hg19.vcf.gz \
-O Normal_getpileupsummaries.table

gatk CalculateContamination \
-I Tumor_getpileupsummaries.table \
-matched Normal_getpileupsummaries.table \
--tumor-segmentation Tumor_segments.table \
-O Pair_calculatecontamination.table

gatk FilterMutectCalls \
-V Tumor.vcf.gz \
--contamination-table Pair_calculatecontamination.table \
--tumor-segmentation Tumor_segments.table \
–-stats Tumor_somatic_filtered.stats.txt \
-O Tumor_somatic_filtered.vcf.gz

and its output:
chr4 38828727 . CAG TTT . PASS CONTQ=24;DP=23;ECNT=1;GERMQ=46;MBQ=33,22;MFRL=118,117;MMQ=60,60;MPOS=20;NALOD=1.24;NLOD=4.81;POPAF=6.00;REF_BASES=TCATTGTTTTCAGTGACTAGT;SAAF=0.303,0.303,0.333;SAPP=0.024,0.024,0.952;TLOD=6.98 GT:AD:AF:DP:F1R2:F2R1 0/0:16,0:0.054:16:5,0:11,0 0/1:4,2:0.375:6:3,0:1,2

Mutect2 could output MNV (multiple nucleotide variant), while other tools (e.g. Strelka2, Varscan2,Muse,Somatic-sniper) outputed this MNV site in VCF seperately.
To find the passed varaint intersection set, I tried to change the Mutect2 parameter to separate the MNV into some adjacent SNVs (to match other tool's VCF output).

My command line (only changed a parameter in "gatk Mutect2"):

gatk Mutect2 \
-R genome.fa \
-I Tumor_reca.bam \
-I Normal_reca.bam \
-normal Normal_name \
-L Covered.bed \
-ip 100 \
--max-mnp-distance 0 \
--orientation-bias-artifact-priors Tumor.artifact-prior.tsv \
--germline-resource af-only-gnomad.raw.sites.hg19.vcf.gz \
-pon pon.vcf \
-O Tumor.vcf.gz \
-bamout Tumor_bamout.bam

and its output:
chr4 38828727 . C T . bad_haplotype;clustered_events;read_orientation_artifact CONTQ=24;DP=22;ECNT=3;GERMQ=46;MBQ=33,22;MFRL=118,117;MMQ=60,60;MPOS=20;NALOD=1.23;NLOD=4.81;POPAF=6.00;REF_BASES=TCATTGTTTTCAGTGACTAGT;SAAF=0.293,0.303,0.333;SAPP=0.028,0.021,0.951;TLOD=6.98 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:P_PRIOR_RO:P_RO:ROF_TYPE 0|0:16,0:0.056:16:5,0:11,0:0|1:38828727_C_T:38828727 0|1:4,2:0.375:6:3,0:1,2:0|1:38828727_C_T:38828727:0.034:0.982:F2R1

chr4 38828728 . A T . bad_haplotype;clustered_events;read_orientation_artifact CONTQ=24;DP=23;ECNT=3;GERMQ=46;MBQ=26,22;MFRL=118,117;MMQ=60,60;MPOS=21;NALOD=1.24;NLOD=4.81;POPAF=6.00;REF_BASES=CATTGTTTTCAGTGACTAGTG;SAAF=0.293,0.303,0.333;SAPP=0.028,0.021,0.951;TLOD=6.98 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:P_PRIOR_RO:P_RO:ROF_TYPE 0|0:16,0:0.054:16:5,0:11,0:0|1:38828727_C_T:38828727 0|1:4,2:0.375:6:3,0:1,2:0|1:38828727_C_T:38828727:1.797e-03:0.766:F2R1

chr4 38828729 . G T . bad_haplotype;clustered_events CONTQ=24;DP=22;ECNT=3;GERMQ=43;MBQ=32,23;MFRL=119,117;MMQ=60,60;MPOS=22;NALOD=1.22;NLOD=4.52;POPAF=6.00;REF_BASES=ATTGTTTTCAGTGACTAGTGT;SAAF=0.293,0.303,0.333;SAPP=0.028,0.021,0.951;TLOD=6.98 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:P_PRIOR_RO:P_RO:ROF_TYPE 0|0:15,0:0.057:15:5,0:10,0:0|1:38828727_C_T:38828727 0|1:4,2:0.375:6:3,0:1,2:0|1:38828727_C_T:38828727:1.383e-03:0.632:F2R1

Because these adjacent SNVs didn't pass the "clustered_events" and "bad_haplotype" filter, I also loosed the criterion in "gatk FilterMutectCalls".

My command line (only changed parameters in "gatk Mutect2" and "gatk FilterMutectCalls"):

gatk FilterMutectCalls \
-V Tumor.vcf.gz \
--contamination-table Pair_calculatecontamination.table \
--tumor-segmentation Tumor_segments.table \
--max-events-in-region 3 \
–-stats Tumor_somatic_filtered.stats.txt \
-O Tumor_somatic_filtered.vcf.gz

and its output:
chr4 38828727 . C T . read_orientation_artifact CONTQ=24;DP=22;ECNT=3;GERMQ=46;MBQ=33,22;MFRL=118,117;MMQ=60,60;MPOS=20;NALOD=1.23;NLOD=4.81;POPAF=6.00;REF_BASES=TCATTGTTTTCAGTGACTAGT;SAAF=0.293,0.303,0.333;SAPP=0.028,0.021,0.951;TLOD=6.98 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:P_PRIOR_RO:P_RO:ROF_TYPE 0|0:16,0:0.056:16:5,0:11,0:0|1:38828727_C_T:38828727 0|1:4,2:0.375:6:3,0:1,2:0|1:38828727_C_T:38828727:0.034:0.982:F2R1

chr4 38828728 . A T . read_orientation_artifact CONTQ=24;DP=23;ECNT=3;GERMQ=46;MBQ=26,22;MFRL=118,117;MMQ=60,60;MPOS=21;NALOD=1.24;NLOD=4.81;POPAF=6.00;REF_BASES=CATTGTTTTCAGTGACTAGTG;SAAF=0.293,0.303,0.333;SAPP=0.028,0.021,0.951;TLOD=6.98 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:P_PRIOR_RO:P_RO:ROF_TYPE 0|0:16,0:0.054:16:5,0:11,0:0|1:38828727_C_T:38828727 0|1:4,2:0.375:6:3,0:1,2:0|1:38828727_C_T:38828727:1.797e-03:0.766:F2R1

chr4 38828729 . G T . PASS CONTQ=24;DP=22;ECNT=3;GERMQ=43;MBQ=32,23;MFRL=119,117;MMQ=60,60;MPOS=22;NALOD=1.22;NLOD=4.52;POPAF=6.00;REF_BASES=ATTGTTTTCAGTGACTAGTGT;SAAF=0.293,0.303,0.333;SAPP=0.028,0.021,0.951;TLOD=6.98 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:P_PRIOR_RO:P_RO:ROF_TYPE 0|0:15,0:0.057:15:5,0:10,0:0|1:38828727_C_T:38828727 0|1:4,2:0.375:6:3,0:1,2:0|1:38828727_C_T:38828727:1.383e-03:0.632:F2R1

Two sites didn't pass the "read_orientation_artifact" filter.
I guessed that these separate adjacent SNVs had their own "P_PRIOR_RO", so two sites didn't pass the "read_orientation_artifact" filter due to their "P_RO".

I wonder why the original MNV passes the "read_orientation_artifact" ?

However, changing the two parameters affects more sites than MNV sites.... Does Mutect2 have the more efficient parameters to output adjacent SNVs (instead of MNVs) and not affects other sites?

Thanks! :smile:

↧

Hi， What function does the complement_full_prob_avxs implement in the GKL library？

March 22, 2019, 1:15 am

≫ Next: How does Mutect2 deal with SoftClip bases?

≪ Previous: Mutect2 MNV and read_orientation_artifact filter

The GKL library speeds up the pairhmm algorithm, but I don't see the concrete_full_prob_avxs implementation in the source code. What function does the complement_full_prob_avxs implement in the GKL library, and what is the source code?

↧

How does Mutect2 deal with SoftClip bases?

March 22, 2019, 1:46 am

≫ Next: Difference between HC and UG methods

≪ Previous: Hi， What function does the complement_full_prob_avxs implement in the GKL library？

Hi , I get a variant record like this using GATK4.1.0.0 Mutect2 tumoronly mode.
It seems that the insertion is caused by softclip bases. Please see the IGV snapshot( I have turned on "Show soft-clipped bases").

chr16   68845762    .   G   GAGTTTCCCTACGTATACCCTGGTGGTTCAAGCTGCTGACCTTCAAGGT   .   read_position   CONTQ=35;DP=40;ECNT=1;GERMQ=84;MBQ=37,40;MFRL=257,1656;MMQ=60,60;MPOS=0;POPAF=7.30;SAAF=0.00,0.071,0.075;SAPP=0.180,3.791e-03,0.816;TLOD=7.89   GT:AD:AF:DP:F1R2:F2R1:OBAM:OBAMRC   0/1:37,3:0.095:40:24,0:13,3:false:false

I want to know how Mutect2 deals with soft-clipped bases? And how Mutect2 flags the "read_position" filter tag.

Thank you.

↧

Difference between HC and UG methods

March 22, 2019, 2:36 am

≫ Next: Variant Quality Score Recalibration (VQSR)

≪ Previous: How does Mutect2 deal with SoftClip bases?

When I compared the result of HaplotypeCaller and Unifiedgenotyper, I found some locus that only be calling variation by UG. However, the origin bam file and bamout file produced by reassembly both contain this variation(see the figure 1). The number of reads supporting reference and alter allele are 456(87%) and 69(13%) respectively in raw bam. Meanwhile they are 439(86%) and 69(14%) in bamout file. I have got confused with this problem.

↧

Variant Quality Score Recalibration (VQSR)

December 29, 2017, 9:09 am

≫ Next: Weird VariantRecalibration result, should i add exome data from 1000G project into WES analysis?

≪ Previous: Difference between HC and UG methods

VQSR stands for Variant Quality Score Recalibration. In a nutshell, it is a sophisticated filtering technique applied on the variant callset that uses machine learning to model the technical profile of variants in a training set and uses that to filter out probable artifacts from the callset.

Note that this variant recalibration process (VQSR) should not be confused with base recalibration (BQSR), which is a data pre-processing method applied to the read data in an earlier step. The developers who named these methods wish to apologize sincerely to anyone, especially Spanish-speaking users, who get tripped up by the similarity of these names.

Overview
Variant recalibration procedure details
Interpretation of the Gaussian mixture model plots
Tranches and the tranche plot
Resource datasets
Problems

1. Overview

What's in a name?

Let's get this out of the way first -- “variant quality score recalibration” is kind of a bad name because it’s not re-calibrating variant quality scores at all; it is calculating a new quality score called the VQSLOD (for variant quality score log-odds) that takes into account various properties of the variant context not captured in the QUAL score. The purpose of this new score is to enable variant filtering in a way that allows analysts to balance sensitivity (trying to discover all the real variants) and specificity (trying to limit the false positives that creep in when filters get too lenient) as finely as possible.

Filtering approaches

The basic, traditional way of filtering variants is to look at various annotations (context statistics) that describe e.g. what the sequence context is like around the variant site, how many reads covered it, how many reads covered each allele, what proportion of reads were in forward vs reverse orientation; things like that -- then choose threshold values and throw out any variants that have annotation values above or below the set thresholds. The problem with this approach is that it is very limiting because it forces you to look at each annotation dimension individually, and you end up throwing out good variants just because one of their annotations looks bad, or keeping bad variants in order to keep those good variants.

The VQSR method, in contrast, uses machine learning algorithms to learn from each dataset what is the annotation profile of good variants vs. bad variants, and does so in a way that integrates information from multiple dimensions (like, 5 to 8, typically). The cool thing is that this allows us to pick out clusters of variants in a way that frees us from the traditional binary choice of “is this variant above or below the threshold for this annotation?”

Let’s do a quick mental visualization exercise, in two dimensions because our puny human brains work best at that level. Imagine a topographical map of a mountain range, with North-South and East-West axes standing in for two variant annotation scales. Your job is to define a subset of territory that contains mostly mountain peaks, and as few lowlands as possible. Traditional hard-filtering forces you to set a single longitude cutoff and a single latitude cutoff, resulting in one rectangular quadrant of the map being selected, and all the rest being greyed out. It’s about as subtle as a sledgehammer and forces you to make a lot of compromises. VQSR allows you to select contour lines around the peaks and decide how low or how high you want to go to include or exclude territory.

That sounds great! How does it work?

Well, like many things that mention the words "machine learning", it's a bit complicated. The key point is that we use known, highly validated variant resources (omni, 1000 Genomes, hapmap) to select a subset of variants within our callset that we’re really confident are probably true positives (that’s the training set). We look at the annotation profiles of those variants (in our own data!), and we from that we learn some rules about how to recognize good variants. We do something similar for bad variants as well. Then we apply the rules we learned to all of the sites, which (through some magical hand-waving) yields a single score for each variant that describes how likely it is based on all the examined dimensions. In our map analogy this is the equivalent of determining on which contour line the variant sits. Finally, we pick a threshold value indirectly by asking the question “what score do I need to choose so that e.g. 99% of the variants in my callset that are also in hapmap will be selected?”. This is called the target sensitivity. We can twist that dial in either direction depending on what is more important for our project, sensitivity or specificity.

Recalibrate variant types separately!

Due to important differences in how the annotation distributions relate to variant quality between SNPs and indels, we recalibrate them separately. See the Best Practices workflow recommendations for details on how to wire this up.

2. Variant recalibration procedure details

VariantRecalibrator builds the model(s)

The tool takes the overlap of the training/truth resource sets and of your callset. It models the distribution of these variants relative to the annotations you specified, and attempts to group them into clusters. Then it uses the clustering to assign VQSLOD scores to all variants. Variants that are closer to the heart of a cluster will get a higher score than variants that are outliers.

From a more technical point of view, the idea is that we can develop a continuous, covarying estimate of the relationship between variant call annotations (QD, MQ, FS etc.) and the probability that a variant call is a true genetic variant versus a sequencing or data processing artifact. We determine this model adaptively based on "true sites" provided as input (typically HapMap 3 sites and those sites found to be polymorphic on the Omni 2.5M SNP chip array, for humans). We can then apply this adaptive error model to both known and novel variation discovered in the call set of interest to evaluate the probability that each call is real. The VQSLOD score, which gets added to the INFO field of each variant is the log odds ratio of being a true variant versus being false under the trained Gaussian mixture model.

This step produces a recalibration file in VCF format and some accessory files (tranches and plots).

_Note that for workflow efficiency purposes it is possible to split this step in two: (1) run the tool on all the data and output an intermediate recalibration model report, then (2) run the tool again to calculate the VQSLOD scores and write out the recalibration file, tranches and plots. This has the advantage of making it possible to scatter the second part over genomic intervals, to accelerate the process. _

ApplyRecalibration applies a filtering threshold

Here we go again with the not-so-great naming, sorry. This tool doesn't really apply the recalibration model to the callset, since VariantRecalibrator already did that -- that's what the recalibration file contains. Rather, it applies a filtering threshold and writes out who passed and who fails to an output VCF.

During the first part of the recalibration process, variants in your callset were given a score called VQSLOD. At the same time, variants in a truth set were also ranked by VQSLOD. When you specify a tranche sensitivity threshold with ApplyRecalibration, expressed as a percentage (e.g. 99.9%), what happens is that the program looks at what is the VQSLOD value above which 99.9% of the variants in the truth set are included. It then takes that value of VQSLOD and uses it as a threshold to filter your variants. Variants that are above the threshold pass the filter, so the FILTER field will contain PASS. Variants that are below the threshold will be filtered out; they will be written to the output file, but in the FILTER field they will have the name of the tranche they belonged to. So VQSRTrancheSNP99.90to100.00 means that the variant was in the range of VQSLODs corresponding to the remaining 0.1% of the truth set, which are considered false positives. Yes, we accept the possibility that some small number of variant calls in the truth set are wrong...

3. Interpretation of the Gaussian mixture model plots

The variant recalibration step fits a Gaussian mixture model to the contextual annotations given to each variant. By fitting this probability model to the training variants (variants considered to be true-positives), a probability can be assigned to the putative novel variants (some of which will be true-positives, some of which will be false-positives). It is useful for users to see how the probability model was fit to their data. Therefore a modeling report is automatically generated each time VariantRecalibrator is run. For every pairwise combination of annotations used in modeling, a 2D projection of the Gaussian mixture model is shown.

The figure shows one page of an example Gaussian mixture model report that is automatically generated by the VQSR from an example HiSeq call set of SNPs. This page shows the 2D projection of Mapping Quality Rank Sum Test (MQRankSum) versus Haplotype Score (HS) by marginalizing over the other annotation dimensions in the model.

Note that this is an old example that uses an annotation, Haplotype Score, that has been deprecated and is no longer available. However, all the points made in the description are still valid.

In each page there are four panels which show different ways of looking at the 2D projection of the model. The upper left panel shows the probability density function that was fit to the data. The 2D projection was created by marginalizing over the other annotation dimensions in the model via random sampling. Green areas show locations in the space that are indicative of being high quality while red areas show the lowest probability areas. In general putative SNPs that fall in the red regions will be filtered out of the recalibrated call set.

The remaining three panels give scatter plots in which each SNP is plotted in the two annotation dimensions as points in a point cloud. The scale for each dimension is in normalized units. The data for the three panels is the same but the points are colored in different ways to highlight different aspects of the data. In the upper right panel SNPs are colored black and red to show which SNPs are retained and filtered, respectively, by applying the VQSR procedure. The red SNPs didn't meet the given truth sensitivity threshold and so are filtered out of the call set. The lower left panel colors SNPs green, grey, and purple to give a sense of the distribution of the variants used to train the model. The green SNPs are those which were found in the training sets passed into the VariantRecalibrator step, while the purple SNPs are those which were found to be furthest away from the learned Gaussians and thus given the lowest probability of being true. Finally, the lower right panel colors each SNP by their known/novel status with blue being the known SNPs and red being the novel SNPs. Here the idea is to see if the annotation dimensions provide a clear separation between the known SNPs (most of which are true) and the novel SNPs (most of which are false).

An example of good clustering for SNP calls from the tutorial dataset is shown to the right. The plot shows that the training data forms a distinct cluster at low values for each of the two statistics shown. As the SNPs fall off the distribution in either one or both of the dimensions they are assigned a lower probability (that is, move into the red region of the model's PDF) and are filtered out. This makes sense based on what the annotated statistics mean; for example the higher values for mapping quality bias indicate more evidence of bias between the reference bases and the alternative bases. The model has captured our intuition that this area of the distribution is highly enriched for machine artifacts and putative variants here should be filtered out.

4. Tranches and the tranche plot

The VQSLOD score provides a continuous estimate of the probability that each variant is true, allowing one to partition the call sets into quality tranches. The main purpose of the tranches is to establish thresholds within your data that correspond to certain levels of sensitivity relative to the truth sets. The idea is that with well calibrated variant quality scores, you can generate call sets in which each variant doesn't have to have a hard answer as to whether it is in or out of the set. If a very high accuracy call set is desired then one can use the highest tranche, but if a larger, more complete call set is a higher priority than one can dip down into lower and lower tranches. These tranches are applied to the output VCF file using the FILTER field. In this way you can choose to use some of the filtered records or only use the PASSing records.

The first tranche (90 by default, but you can specify your own values) has the lowest value of truth sensitivity but the highest value of novel Ti/Tv; it is extremely specific (almost exclusively real variants in here) but less sensitive (it's missing a lot). From there, each subsequent tranche introduces additional true positive calls along with a growing number of false positive calls.

A plot of the tranches is automatically generated for SNPs by the VariantRecalibrator tool if you have the requisite R dependencies installed; an example is shown below.

This is an example of a tranches plot generated for a HiSeq call set. The x-axis gives the number of novel variants called while the y-axis shows two quality metrics -- novel transition to transversion (TiTv) ratio and the overall truth sensitivity. Remember that specificity decreases as the truth sensitivity increases.

The tranches plot is not applicable for indels and will not be generated when the tool is run in INDEL mode.

5. Resource datasets

This procedure relies heavily on the availability of good resource datasets of the following types:

Truth resource

This resource must be a call set that has been validated to a very high degree of confidence. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). We will also use these sites later on to choose a threshold for filtering variants based on sensitivity to truth sites.

Training resource

This resource must be a call set that has been validated to some degree of confidence. The program will consider that the variants in this resource may contain false positives as well as true variants (truth=false), and will use them to train the recalibration model (training=true).

Known sites resource

This resource can be a call set that has not been validated to a high degree of confidence (truth=false). The program will not use the variants in this resource to train the recalibration model (training=false). However, the program will use these to stratify output metrics such as Ti/Tv ratio by whether variants are present in dbsnp or not (known=true).

Obtaining appropriate resources

The human genome training, truth and known resource datasets that are used in our Best Practices workflow applied to human data are all available from our Resource Bundle.

If you are working with non-human genomes, you will need to find or generate at least truth and training resource datasets with properties as described above. You'll also need to assign your set a prior likelihood that reflects your confidence in how reliable it is as a truth set. We recommend Q10 as a starting value, which you can then experiment with to find the most appropriate value empirically. There are many possible avenues of research here. Hopefully the model reporting plots that are generated by the recalibration tools will help facilitate this experimentation.

6. Problems

VariantRecalibrator is like a teenager, frequently moody and uncommunicative. There are many things that can put this tool in a bad mood and cause it to fail. When that happens, it rarely provides a useful error message. We're very sorry about that and we're working hard on developing a new tool (based on deep learning) that will be more stable and user-friendly. In the meantime though, here are a few things to watch out for.

Greedy for data

This tool expects large numbers of variant sites in order to achieve decent modeling with the Gaussian mixture model. It's difficult to put a hard number on mimimum requirements because it depends a lot on the quality of the data (clean, well-behaved data requires fewer sites because the clustering tends to be less noisy), but empirically we find that in humans, the procedure tends to work well enough with at least one whole genome or 30 exomes. Anything smaller than that scale is likely to run into difficulties, especially for the indel recalibration.

If you don't have enough of your own samples, consider using publicly available data (e.g. exome bams from the 1000 Genomes Project) to "pad" your cohort. Be aware however that you cannot simply merge in someone else's variant calls. You must joint-call variants from the original BAMs with your own samples. We recommend using the GVCF workflow to generate GVCFs from the original BAMs, and joint-call those GVCFs along with your own samples' GVCFs using GenotypeGVCFs.

One other thing that can help is to turn down the number of Gaussians used during training. This can be accomplished by adding --maxGaussians 4 to your command line. This controls the maximum number of different "clusters" (=Gaussians) of variants the program is allowed to try to identify. Lowering this number forces the program to group variants into a smaller number of clusters, which means there will be more variants in each cluster -- hopefully enough to satisfy the statistical requirements. Of course, this decreases the level of discrimination that you can achieve between variant profiles/error modes. It's all about trade-offs; and unfortunately if you don't have a lot of variants you can't afford to be very demanding in terms of resolution.

Annotations are tricky

VariantRecalibrator assumes that the distribution of annotation values is gaussianly distributed, but we know this assumption breaks down for some annotations. For example, mapping quality has a very different distribution because it is not a calibrated statistic, so in some cases it can destabilize the model. When you run into trouble, excluding MQ from the list of annotations can be helpful.

In addition, some of the annotations included in our standard VQSR recommendations might not be the best for your particular dataset. In particular, the following caveats apply:

Depth of coverage (the DP annotation invoked by Coverage) should not be used when working with exome datasets since there is extreme variation in the depth to which targets are captured. In whole genome experiments this variation is indicative of error but that is not the case in capture experiments.
*InbreedingCoeff** is a population level statistic that requires at least 10 samples in order to be computed. For projects with fewer samples, or that includes many closely related samples (such as a family) please omit this annotation from the command line.

Dependencies

Plot generation depends on having Rscript accessible in your environment path. Rscript is the command line version of R that allows execution of R scripts from the command line. We also make use of the ggplot2 library so please be sure to install that package as well. See the Common Problems section of the Guide for more details.

↧

Post suggestions and read about updates in the Comments section.

Jump to a section

Tools involved

Download example data

1. Call somatic short variants and generate a bamout with Mutect2

☞ 1.1 What are the Mutect2 annotations?

☞ 1.2 What is the impact of disabling the MateOnSameContigOrNoMappedMateReadFilter read filter?

2. Create a sites-only PoN with CreateSomaticPanelOfNormals

☞ 2.1 The tumor-only mode of Mutect2 is useful outside of pon creation

3. Estimate cross-sample contamination using GetPileupSummaries and CalculateContamination.

☞ 3.1 What if I find high levels of contamination?

4. Filter for confident somatic calls using FilterMutectCalls

5. (Optional) Estimate artifacts with CollectSequencingArtifactMetrics and filter them with FilterByOrientationBias

☞ 5.1 Tally of applied filters for the tutorial data

6. Set up in IGV to review somatic calls

7. Related resources

Footnotes

Contents

1. Overview

What's in a name?

Filtering approaches

That sounds great! How does it work?

Recalibrate variant types separately!

2. Variant recalibration procedure details

VariantRecalibrator builds the model(s)

ApplyRecalibration applies a filtering threshold

3. Interpretation of the Gaussian mixture model plots

4. Tranches and the tranche plot

5. Resource datasets

Truth resource

Training resource

Known sites resource

Obtaining appropriate resources

6. Problems

Greedy for data

Annotations are tricky

Dependencies