Is there any "pre copy number" value in the processing?

January 16, 2019, 12:13 pm

≫ Next: Can't open GATK GUI on MacBook Pro

≪ Previous: Getting the absolute (not %) of 'USABLE_BASES_ON_TARGET'

Dear Genome STRiP users,

I completed the SVCNVDiscovery process to a certain cohort, and also calculate the copy number by other software such as Lumpy. But when I compare these two copy number results, I found some samples with weird copy number values.

samples	lumpy CN	GS CN
sample1	3.05128	2
sample2	2.64587	2
sample3	2.56714	3
sample4	1.84659	1

It is hard to figure out the conversion between the continuous CN from lumpy to the discrete CN from GS. I am wondering if in Genome STRiP, it also generate continuous CNs and then discretize them to the final discrete value. If so, where can I get the "pre" value of the final discrete CNs or can I output this kind of "pre" value? Thank you very much.

Best regards,
Minzhi

↧

Can't open GATK GUI on MacBook Pro

January 17, 2019, 12:39 am

≫ Next: Problem understanding construction of Tranches Plots

≪ Previous: Is there any "pre copy number" value in the processing?

Hi,

I downloaded the GATL software to run on my MacBook Pro. I've opened the gatk terminal file and it seems to load but then nothing. Also none of the Jar files open either. I have attached screen shoots. Can someone tell me where I'm going wrong?

Cheers,

Peter

↧

Problem understanding construction of Tranches Plots

January 17, 2019, 12:55 am

≫ Next: Please add an explicit type tag :NAME

≪ Previous: Can't open GATK GUI on MacBook Pro

Hello!
I must be missing a fundamental concept about VQSR, because I cannot figure out how false positives are defined in the construction of the tranches plot.

First, I understand a tranche as a subset of the original raw sites, in which sites PASS if they are above a particular VQSLOD score that allows to keep N% of the sites in the truth set. From this, all sites in the tranche must have FILTER=PASS, and some of those sites will be novel because they do not appear in any of the resources provided.

When constructing the tranches plot, only novel sites are considered.

So, if in order to be in the tranche, a site must have PASSed filtering by VQSLOD score, then where do the tranche specific false positives come from?

This question is derived from a previous post where it was indicated that

All the SNPs that are found are considered "positives" because they were found by earlier stages of analysis. "True" vs "False" positives is simply referring to whether they pass the VQSR filter in a given tranche.

Is there a stage in the analysis that I am not considering?

Thanks

↧

Please add an explicit type tag :NAME

November 6, 2013, 5:41 am

≫ Next: Error: Not enough resources available to fulfill request

≪ Previous: Problem understanding construction of Tranches Plots

Hi,

I am using the VariantsToTable walker to convert my vcfs to tabular format. However, I keep getting the following error:

Invalid command line: No tribble type was provided on the command line and the type of the file could not be determined dynamically. Please add an explicit type tag :NAME listing the correct type from among the supported types

The problem is that I don't understand what is meant by the advice. How do you actually provide the :NAME tag in the command line - I have tried a number of ways and nothing seems to work and I can't find any reference to this tag in the documentation.

Best wishes,

Kath

↧

Error: Not enough resources available to fulfill request

January 17, 2019, 7:47 am

≫ Next: Picard 2.18.23 crashing when building the JAR

≪ Previous: Please add an explicit type tag :NAME

I am just starting to use FireCloud for data analysis. We are currently on the free credit program and want to start our first analysis. Our workspace name is: "fccredits-iron-jade-XX31/XXXXX_free“. After triggering the method from the available configuration “cellranger_mkfastq_count”, both me and my colleague experience issues right after job submission. The error message is below:

Call #1 (Subworkflow ID 2fd28187-015c-4adc-ae4e-4aa450581a1d):
Started:January 17, 2019, 9:48 AM (57 minutes ago)
Ended:January 17, 2019, 9:50 AM (55 minutes ago)
Failures:
message: Workflow failed
causedBy:
message: Task cellranger_mkfastq.run_cellranger_mkfastq:NA:1 failed. The job was stopped before the command finished. PAPI error code 2. The zone 'projects/fccredits-iron-jade-XX31/zones/us-central1-f' does not have enough resources available to fulfill the request. '(resource type:compute)'.

According to SO, this suggests a true limitation in Google’s compute resources. Do you think this could be the problem? Have you encountered similar scenarios in the past?

Thank you!
Best, Niklas

↧

Picard 2.18.23 crashing when building the JAR

January 17, 2019, 6:58 am

≫ Next: (How to) Consolidate GVCFs for joint calling with GenotypeGVCFs

≪ Previous: Error: Not enough resources available to fulfill request

Hi there! I am following the instructions to build the JAR for Picard found in the repository Readme. I am trying to execute step ./gradlew shadowJar, but I am being presented with this error:

Exception in thread "main" java.io.IOException: Function not implemented
at sun.nio.ch.FileDispatcherImpl.lock0(Native Method)
at sun.nio.ch.FileDispatcherImpl.lock(FileDispatcherImpl.java:90)
at sun.nio.ch.FileChannelImpl.tryLock(FileChannelImpl.java:1115)
at java.nio.channels.FileChannel.tryLock(FileChannel.java:1155)
at org.gradle.wrapper.ExclusiveFileAccessManager.access(ExclusiveFileAccessManager.java:51)
at org.gradle.wrapper.Install.createDist(Install.java:48)
at org.gradle.wrapper.WrapperExecutor.execute(WrapperExecutor.java:128)
at org.gradle.wrapper.GradleWrapperMain.main(GradleWrapperMain.java:61)

Could something be wrong?

↧

(How to) Consolidate GVCFs for joint calling with GenotypeGVCFs

April 16, 2018, 10:14 am

≫ Next: Combine multi-sample GVCFs

≪ Previous: Picard 2.18.23 crashing when building the JAR

In GATK4, the GenotypeGVCFs tool can only take a single input, so if you have GVCFs from multiple samples (which is usually the case) you will need to combine them before feeding them to GenotypeGVCFs. Although there are several tools in the GATK and Picard toolkits that provide some type of VCF or GVCF merging functionality, for this use case only two of them can do the GVCF consolidation step correctly: GenomicsDBImport and CombineGVCFs.

GenomicsDBImport is the preferred tool (see detailed instructions below); CombineGVCFs is provided only as a backup solution for people who cannot use GenomicsDBImport, which has a few limitations (for example it can only run on diploid data at the moment). We know CombineGVCFs is quite inefficient and typically requires a lot of memory, so we encourage you to try GenomicsDBImport first and only fall back on CombineGVCFs if you experience issues that we are unable to help you solve (ask us for help in the forum!).

Using`GenomicsDBImport` in practice

The GenomicsDBImport tool takes in one or more single-sample GVCFs and imports data over a single interval, and outputs a directory containing a GenomicsDB datastore with combined multi-sample data. GenotypeGVCFs can then read from the created GenomicsDB directly and output the final multi-sample VCF.

So if you have a trio of GVCFs your GenomicsDBImportcommand would look like this, assuming you're running per chromosome (here we're showing the tool running on chromosome 20):

gatk GenomicsDBImport \
    -V data/gvcfs/mother.g.vcf \
    -V data/gvcfs/father.g.vcf \
    -V data/gvcfs/son.g.vcf \
    --genomicsdb-workspace-path my_database \
    --intervals chr20

That generates a directory called my_database containing the combined GVCF data for chromosome 20. The contents of the directory are not really human-readable; see further down for tips to deal with that.

Then you run joint genotyping; note the gendb:// prefix to the database input directory path.

gatk GenotypeGVCFs \
    -R data/ref/ref.fasta \
    -V gendb://my_database \
    -G StandardAnnotation -newQual \
    -O test_output.vcf

And that's all there is to it.

Important limitations:

You can't add data to an existing database; you have to keep the original GVCFs around and reimport them all together when you get new samples. For very large numbers of samples, there are some batching options.
At the moment you can only run GenomicsDBImport on a single genomic interval (ie max one contig) at a time. Down the road this will change (the work is tentatively scheduled for the second quarter of 2018), because we want to make it possible to run on one multiple intervals in one go. But for now you need to run on each interval separately. We recommend scripting this of course.
GenomicsDBImport cannot accept multiple GVCFs for the same sample, so if for example you generated separate GVCFs per chromosome for each sample, you'll need to either concatenate the chromosome GVCFs to produce a single GVCF per sample (using CatVariants) or scatter the following steps by chromosome as well.

**If you can't use GenomicsDBImport for whatever reason, fall back to CombineGVCFs instead. It is slower but will allow you to combine GVCFs the old-fashioned way. **

Addendum: extracting GVCF data from the GenomicsDB

If you want to generate a flat multisample GVCF file from the GenomicsDB you created, you can do so with SelectVariants as follows:

gatk SelectVariants \
    -R data/ref/ref.fasta \
    -V gendb://my_database \
    -O combined.g.vcf

↧

Combine multi-sample GVCFs

January 17, 2019, 3:07 pm

≫ Next: MuTect generating empty VCF file

≪ Previous: (How to) Consolidate GVCFs for joint calling with GenotypeGVCFs

Hi GATK experts,

I have 6144 individual sample gvcfs with different ploidies so can't use GenomicsDBImport for generating a single gvcf for passing it to GenotypeGVCFs. I have tried running all 6144 gvcfs through CombineGVCFs but got stuck due to ulimit constraints which couldn't be resolved despite increasing ulimit 'nproc' and 'nofile' settings to the required higher number. This I think is due to some conflicts with SGE environment or some other arrangements in our own cluster setup. Previously I have successfully run 384 gvcfs through CombineGVCFs to the final steps. So now I have divided these 6144 gvcfs into 16 parts each containing 384 gvcfs. I am running these sixteen 384-gvcf batches through CombineGVCFs for each chromosome (12 chromosomes in total) separately. This will lead to the generation of 192 multi-sample gvcfs. My question is can CombineGVCFs be used to merge multi-sample GVCFs in addition to single sample gvcfs and, if yes, will all the annotation fields still be meaningful?

Regards,
Sanjeev

↧

MuTect generating empty VCF file

January 17, 2019, 8:28 pm

≫ Next: Recommendations for calling on and flitering ~100 low coverage samples

≪ Previous: Combine multi-sample GVCFs

I'm new to MuTect, but I keep getting an empty VCF file when I run. I'm working with fairly low variant frequencies (<5%), but I have a synthetic WT "normal" as my tumor/normal pair, so the background is low. Using other methods, I've identified 10-15 mutations that are >10-fold higher in my sample than in my synthetic normal, but I can't replicate the results with MuTect Here's the output:

03:39:59.750 INFO IntelPairHmm - Available threads: 8
03:39:59.750 INFO IntelPairHmm - Requested threads: 4
03:39:59.751 INFO PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation
03:39:59.838 INFO ProgressMeter - Starting traversal
03:39:59.839 INFO ProgressMeter - Current Locus Elapsed Minutes Regions Processed Regions/Minute
03:40:09.854 INFO ProgressMeter - 1:37436701 0.2 124790 748740.0
03:40:19.839 INFO ProgressMeter - 1:88544701 0.3 295150 885450.0
03:40:29.839 INFO ProgressMeter - 1:140798701 0.5 469330 938660.0
03:40:39.839 INFO ProgressMeter - 1:194402701 0.7 648010 972015.0
03:40:49.839 INFO ProgressMeter - 1:247406701 0.8 824690 989628.0
03:40:59.839 INFO ProgressMeter - 2:54147901 1.0 1011330 1011330.0
03:41:09.839 INFO ProgressMeter - 2:112905901 1.2 1207190 1034734.3
03:41:19.839 INFO ProgressMeter - 2:171579901 1.3 1402770 1052077.5
03:41:29.839 INFO ProgressMeter - 2:230160901 1.5 1598040 1065360.0
03:41:34.539 INFO VectorLoglessPairHMM - Time spent in setup for JNI call : 0.0
03:41:34.540 INFO PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 0.0
03:41:34.540 INFO SmithWatermanAligner - Total compute time in java Smith-Waterman : 0.00 sec
03:41:34.541 INFO Mutect2 - Shutting down engine
[January 18, 2019 3:41:34 AM GMT] org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2 done. Elapsed time: 1.61 minutes.
Runtime.totalMemory()=1574961152
java.lang.IllegalStateException: Offered read with sample name null to SamplePartitioner but this sample wasn't provided as one of possible samples at construction
at org.broadinstitute.hellbender.utils.Utils.validate(Utils.java:749)
at org.broadinstitute.hellbender.utils.locusiterator.SamplePartitioner.submitRead(SamplePartitioner.java:86)
at org.broadinstitute.hellbender.utils.locusiterator.ReadStateManager.submitRead(ReadStateManager.java:188)
at org.broadinstitute.hellbender.utils.locusiterator.ReadStateManager.collectPendingReads(ReadStateManager.java:160)
at org.broadinstitute.hellbender.utils.locusiterator.LocusIteratorByState.lazyLoadNextAlignmentContext(LocusIteratorByState.java:315)
at org.broadinstitute.hellbender.utils.locusiterator.LocusIteratorByState.hasNext(LocusIteratorByState.java:252)
at org.broadinstitute.hellbender.utils.locusiterator.IntervalAlignmentContextIterator.advanceAlignmentContext(IntervalAlignmentContextIterator.java:104)
at org.broadinstitute.hellbender.utils.locusiterator.IntervalAlignmentContextIterator.(IntervalAlignmentContextIterator.java:45)
at org.broadinstitute.hellbender.engine.AssemblyRegionIterator.(AssemblyRegionIterator.java:117)
at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.processReadShard(AssemblyRegionWalker.java:282)
at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.traverse(AssemblyRegionWalker.java:267)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:966)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)
willey@HS9Q8N0DIQ94D:/mnt/c/gatk-4.0.12.0$ java.lang.IllegalStateException: Offered read with sample name null to SamplePartitioner but this sample wasn't provided as one of possible samples at construction^C

↧

Recommendations for calling on and flitering ~100 low coverage samples

January 17, 2019, 8:57 pm

≫ Next: Coverage bias in HaplotypeCaller

≪ Previous: MuTect generating empty VCF file

I have about 80-100 population-specific WGS samples with coverage around 5-10X. What modifications would you recommend in GATK best practices to suit variant calling (and VQSR) ? Also, what are the drawbacks of using the standard GVCF joint genotyping workflow for low coverage samples ?

I have one high coverage sample of about 30X; Would it help if I pool that in ?

↧

Coverage bias in HaplotypeCaller

April 28, 2015, 1:17 am

≫ Next: Variant Quality Score Recalibration (VQSR)

≪ Previous: Recommendations for calling on and flitering ~100 low coverage samples

Hi,

I am doing joint variant calling for Illumina paired end data of 150 monkeys. Coverage varies from 3-30 X with most individuals having around 4X coverage.

I was doing all the variant detection and hard-filtering (GATK Best Practices) process with both UnifiedGenotyper and Haplotype caller.

My problem is that HaplotypeCaller shows a much stronger bias for calling the reference allele in low coverage individuals than UnifiedGenotyper does. Is this a known issue?

In particular, consider pairwise differences across individuals:
The absolute values are lower for low coverage individuals than for high coverage, for both methods, since it is more difficult to make calls for them.
However, for UnifiedGenotyper, I can correct for this by calculating the "accessible genome size" for each pair of individuals by substracting from the total reference length all the filtered sites and sites where one of the two individuals has no genotype call (./.). If I do this, there is no bias in pairwise differences for UnifiedGenotyper. Values are comparable for low and high coverage individuals (If both pairs consist of members of similar populations).

However, for HaplotypeCaller, this correction does not remove bias due to coverage. Hence, it seems that for UnifiedGenotyper low coverage individuals are more likely to have no call (./.) but if there is a call it is not biased towards reference or alternative allele (at least compared to high coverage individuals). For HaplotypeCaller, on the other hand, it seems that in cases of doubt the genotype is more likely to be set to reference. I can imagine that this is an effect of looking for similar haplotypes in the population.

Can you confirm this behaviour? For population genetic analysis this effect is highly problematic. I would trade in more false positive if this removed the bias. Note that when running HaplotypeCaller, I used a value of 3*10^(-3) for the expected heterozygosity (--heterozygosity) which is the average cross individuals diversity and thus already at the higher-end for within individual heterozygosity. I would expect the problem to be even worse if I chose lower values.

Can you give me any recommendation, should I go back using UnifiedGenotyper or is there any way to solve this problem?

Many thanks in advance,
Hannes

↧

Variant Quality Score Recalibration (VQSR)

July 23, 2012, 9:49 am

≫ Next: Filtering variants after calling on intervals - VQSR vs Hard-filter vs CNN

≪ Previous: Coverage bias in HaplotypeCaller

This document describes what Variant Quality Score Recalibration (VQSR) is designed to do, and outlines how it works under the hood. The first section is a high-level overview aimed at non-specialists. Additional technical details are provided below.

For command-line examples and recommendations on what specific resource datasets and arguments to use for VQSR, please see this FAQ article. See the VariantRecalibrator tool doc and the ApplyRecalibration tool doc for a complete description of available command line arguments.

As a complement to this document, we encourage you to watch the workshop videos available in the Presentations section.

High-level overview

VQSR stands for “variant quality score recalibration”, which is a bad name because it’s not re-calibrating variant quality scores at all; it is calculating a new quality score that is supposedly super well calibrated (unlike the variant QUAL score which is a hot mess) called the VQSLOD (for variant quality score log-odds). I know this probably sounds like gibberish, stay with me. The purpose of this new score is to enable variant filtering in a way that allows analysts to balance sensitivity (trying to discover all the real variants) and specificity (trying to limit the false positives that creep in when filters get too lenient) as finely as possible.

The basic, traditional way of filtering variants is to look at various annotations (context statistics) that describe e.g. what the sequence context is like around the variant site, how many reads covered it, how many reads covered each allele, what proportion of reads were in forward vs reverse orientation; things like that -- then choose threshold values and throw out any variants that have annotation values above or below the set thresholds. The problem with this approach is that it is very limiting because it forces you to look at each annotation dimension individually, and you end up throwing out good variants just because one of their annotations looks bad, or keeping bad variants in order to keep those good variants.

The VQSR method, in a nutshell, uses machine learning algorithms to learn from each dataset what is the annotation profile of good variants vs. bad variants, and does so in a way that integrates information from multiple dimensions (like, 5 to 8, typically). The cool thing is that this allows us to pick out clusters of variants in a way that frees us from the traditional binary choice of “is this variant above or below the threshold for this annotation?”

Let’s do a quick mental visualization exercise (pending an actual figure to illustrate this), in two dimensions because our puny human brains work best at that level. Imagine a topographical map of a mountain range, with North-South and East-West axes standing in for two variant annotation scales. Your job is to define a subset of territory that contains mostly mountain peaks, and as few lowlands as possible. Traditional hard-filtering forces you to set a single longitude cutoff and a single latitude cutoff, resulting in one rectangular quadrant of the map being selected, and all the rest being greyed out. It’s about as subtle as a sledgehammer and forces you to make a lot of compromises. VQSR allows you to select contour lines around the peaks and decide how low or how high you want to go to include or exclude territory within your subset.

How this is achieved is another can of worms. The key point is that we use known, highly validated variant resources (omni, 1000 Genomes, hapmap) to select a subset of variants within our callset that we’re really confident are probably true positives (that’s the training set). We look at the annotation profiles of those variants (in our own data!), and we from that we learn some rules about how to recognize good variants. We do something similar for bad variants as well. Then we apply the rules we learned to all of the sites, which (through some magical hand-waving) yields a single score for each variant that describes how likely it is based on all the examined dimensions. In our map analogy this is the equivalent of determining on which contour line the variant sits. Finally, we pick a threshold value indirectly by asking the question “what score do I need to choose so that e.g. 99% of the variants in my callset that are also in hapmap will be selected?”. This is called the target sensitivity. We can twist that dial in either direction depending on what is more important for our project, sensitivity or specificity.

Technical overview

The purpose of variant recalibration is to assign a well-calibrated probability to each variant call in a call set. This enables you to generate highly accurate call sets by filtering based on this single estimate for the accuracy of each call.

The approach taken by variant quality score recalibration is to develop a continuous, covarying estimate of the relationship between SNP call annotations (QD, SB, HaplotypeScore, HRun, for example) and the the probability that a SNP is a true genetic variant versus a sequencing or data processing artifact. This model is determined adaptively based on "true sites" provided as input (typically HapMap 3 sites and those sites found to be polymorphic on the Omni 2.5M SNP chip array, for humans). This adaptive error model can then be applied to both known and novel variation discovered in the call set of interest to evaluate the probability that each call is real. The score that gets added to the INFO field of each variant is called the VQSLOD. It is the log odds ratio of being a true variant versus being false under the trained Gaussian mixture model.

The variant recalibrator contrastively evaluates variants in a two step process, each performed by a distinct tool:

VariantRecalibrator
Create a Gaussian mixture model by looking at the annotations values over a high quality subset of the input call set and then evaluate all input variants. This step produces a recalibration file.
ApplyRecalibration
Apply the model parameters to each variant in input VCF files producing a recalibrated VCF file in which each variant is annotated with its VQSLOD value. In addition, this step will filter the calls based on this new lod score by adding lines to the FILTER column for variants that don't meet the specified lod threshold.

Please see the VQSR tutorial for step-by-step instructions on running these tools.

How VariantRecalibrator works in a nutshell

The tool takes the overlap of the training/truth resource sets and of your callset. It models the distribution of these variants relative to the annotations you specified, and attempts to group them into clusters. Then it uses the clustering to assign VQSLOD scores to all variants. Variants that are closer to the heart of a cluster will get a higher score than variants that are outliers.

How ApplyRecalibration works in a nutshell

During the first part of the recalibration process, variants in your callset were given a score called VQSLOD. At the same time, variants in your training sets were also ranked by VQSLOD. When you specify a tranche sensitivity threshold with ApplyRecalibration, expressed as a percentage (e.g. 99.9%), what happens is that the program looks at what is the VQSLOD value above which 99.9% of the variants in the training callset are included. It then takes that value of VQSLOD and uses it as a threshold to filter your variants. Variants that are above the threshold pass the filter, so the FILTER field will contain PASS. Variants that are below the threshold will be filtered out; they will be written to the output file, but in the FILTER field they will have the name of the tranche they belonged to. So VQSRTrancheSNP99.90to100.00 means that the variant was in the range of VQSLODs corresponding to the remaining 0.1% of the training set, which are basically considered false positives.

Interpretation of the Gaussian mixture model plots

The variant recalibration step fits a Gaussian mixture model to the contextual annotations given to each variant. By fitting this probability model to the training variants (variants considered to be true-positives), a probability can be assigned to the putative novel variants (some of which will be true-positives, some of which will be false-positives). It is useful for users to see how the probability model was fit to their data. Therefore a modeling report is automatically generated each time VariantRecalibrator is run (in the above command line the report will appear as path/to/output.plots.R.pdf). For every pair-wise combination of annotations used in modeling, a 2D projection of the Gaussian mixture model is shown.

The figure shows one page of an example Gaussian mixture model report that is automatically generated by the VQSR from the example HiSeq call set. This page shows the 2D projection of mapping quality rank sum test versus Haplotype score by marginalizing over the other annotation dimensions in the model.

In each page there are four panels which show different ways of looking at the 2D projection of the model. The upper left panel shows the probability density function that was fit to the data. The 2D projection was created by marginalizing over the other annotation dimensions in the model via random sampling. Green areas show locations in the space that are indicative of being high quality while red areas show the lowest probability areas. In general putative SNPs that fall in the red regions will be filtered out of the recalibrated call set.

The remaining three panels give scatter plots in which each SNP is plotted in the two annotation dimensions as points in a point cloud. The scale for each dimension is in normalized units. The data for the three panels is the same but the points are colored in different ways to highlight different aspects of the data. In the upper right panel SNPs are colored black and red to show which SNPs are retained and filtered, respectively, by applying the VQSR procedure. The red SNPs didn't meet the given truth sensitivity threshold and so are filtered out of the call set. The lower left panel colors SNPs green, grey, and purple to give a sense of the distribution of the variants used to train the model. The green SNPs are those which were found in the training sets passed into the VariantRecalibrator step, while the purple SNPs are those which were found to be furthest away from the learned Gaussians and thus given the lowest probability of being true. Finally, the lower right panel colors each SNP by their known/novel status with blue being the known SNPs and red being the novel SNPs. Here the idea is to see if the annotation dimensions provide a clear separation between the known SNPs (most of which are true) and the novel SNPs (most of which are false).

An example of good clustering for SNP calls from the tutorial dataset is shown to the right. The plot shows that the training data forms a distinct cluster at low values for each of the two statistics shown (haplotype score and mapping quality bias). As the SNPs fall off the distribution in either one or both of the dimensions they are assigned a lower probability (that is, move into the red region of the model's PDF) and are filtered out. This makes sense as not only do higher values of HaplotypeScore indicate a lower chance of the data being explained by only two haplotypes but also higher values for mapping quality bias indicate more evidence of bias between the reference bases and the alternative bases. The model has captured our intuition that this area of the distribution is highly enriched for machine artifacts and putative variants here should be filtered out!

Tranches and the tranche plot

The recalibrated variant quality score provides a continuous estimate of the probability that each variant is true, allowing one to partition the call sets into quality tranches. The main purpose of the tranches is to establish thresholds within your data that correspond to certain levels of sensitivity relative to the truth sets. The idea is that with well calibrated variant quality scores, you can generate call sets in which each variant doesn't have to have a hard answer as to whether it is in or out of the set. If a very high accuracy call set is desired then one can use the highest tranche, but if a larger, more complete call set is a higher priority than one can dip down into lower and lower tranches. These tranches are applied to the output VCF file using the FILTER field. In this way you can choose to use some of the filtered records or only use the PASSing records.

The first tranche (90) which has the lowest value of truth sensitivity but the highest value of novel Ti/Tv, is exceedingly specific but less sensitive. Each subsequent tranche in turn introduces additional true positive calls along with a growing number of false positive calls. Downstream applications can select in a principled way more specific or more sensitive call sets or incorporate directly the recalibrated quality scores to avoid entirely the need to analyze only a fixed subset of calls but rather weight individual variant calls by their probability of being real. An example tranche plot, automatically generated by the VariantRecalibrator walker, is shown below.

This is an example of a tranches plot generated for a HiSeq call set. The x-axis gives the number of novel variants called while the y-axis shows two quality metrics -- novel transition to transversion ratio and the overall truth sensitivity.

Note that the tranches plot is not applicable for indels and will not be generated when the tool is run in INDEL mode.

Ti/Tv-free recalibration

We use a Ti/Tv-free approach to variant quality score recalibration. This approach requires an additional truth data set, and cuts the VQSLOD at given sensitivities to the truth set. It has several advantages over the Ti/Tv-targeted approach:

The truth sensitivity (TS) approach gives you back the novel Ti/Tv as a QC metric
The truth sensitivity (TS) approach is conceptual cleaner than deciding on a novel Ti/Tv target for your dataset
The TS approach is easier to explain and defend, as saying "I took called variants until I found 99% of my known variable sites" is easier than "I took variants until I dropped my novel Ti/Tv ratio to 2.07"

We have used hapmap 3.3 sites as the truth set (genotypes_r27_nr.b37_fwd.vcf), but other sets of high-quality (~99% truly variable in the population) sets of sites should work just as well. In our experience, with HapMap, 99% is a good threshold, as the remaining 1% of sites often exhibit unusual features like being close to indels or are actually MNPs, and so receive a low VQSLOD score.
Note that the expected Ti/Tv is still an available argument but it is only used for display purposes.

Finally, a couple of Frequently Asked Questions

- Can I use the variant quality score recalibrator with my small sequencing experiment?

This tool is expecting thousands of variant sites in order to achieve decent modeling with the Gaussian mixture model. Whole exome call sets work well, but anything smaller than that scale might run into difficulties.

One piece of advice is to turn down the number of Gaussians used during training. This can be accomplished by adding --maxGaussians 4 to your command line.

maxGaussians is the maximum number of different "clusters" (=Gaussians) of variants the program is "allowed" to try to identify. Lowering this number forces the program to group variants into a smaller number of clusters, which means there will be more variants in each cluster -- hopefully enough to satisfy the statistical requirements. Of course, this decreases the level of discrimination that you can achieve between variant profiles/error modes. It's all about trade-offs; and unfortunately if you don't have a lot of variants you can't afford to be very demanding in terms of resolution.

- Why don't all the plots get generated for me?

The most common problem related to this is not having Rscript accessible in your environment path. Rscript is the command line version of R that gets installed right alongside. We also make use of the ggplot2 library so please be sure to install that package as well. See the Common Problems section of the Guide for more details.

↧

Filtering variants after calling on intervals - VQSR vs Hard-filter vs CNN

January 18, 2019, 1:28 am

≫ Next: GATK4 CNV ModelSegments hets output

≪ Previous: Variant Quality Score Recalibration (VQSR)

Hi. I have ~80-100 low-coverage (5-10X) WGS samples which I want to run through a joint genotyping (GVCF) workflow. I have "Halotype-called" (ERC GVCF) variants on my genes of interest using an interval list (contains 90 genes).

Will VQSR work with the final interval-subsetted VCF ? If no, would it help if I extended my intervals to cover a 100 more genes and extend the intervals further ? I'm very reluctant to call variants on the whole genome samples since I'm in a hurry and HaplotypeCaller takes too long (~6 hours).

PS I have one high coverage sample which I can download but would prefer not to.

Apart from that I have previously made calls on the same samples (but over the whole genome) by which I plan to use in my training set.

↧

GATK4 CNV ModelSegments hets output

November 30, 2018, 1:48 pm

≫ Next: something fishy? VCF depth and BAM depth don't match

≪ Previous: Filtering variants after calling on intervals - VQSR vs Hard-filter vs CNN

Hi GATK team!

I'm having trouble precisely understanding the ModelSegments hets output when ran on a tumor sample provided both a tumor and normal AllelicCounts are given.

The documentation reads:

If the matched normal is available, its allelic counts will be used to genotype the sites, and we will simply assume these genotypes are the same in the case sample. (This can be critical, for example, for determining sites with loss of heterozygosity in high purity case samples; such sites will be genotyped as homozygous if the matched-normal sample is not available.)

If this were truly the case then why:
1. Is a different number of variants (and not 1:1 exactly overlapping) output to hets.tsv and hets.normal.tsv
2. If I roughly quantify variant allele fractions in the hets.normal.tsv file, a large portion of them are far away from 0.5

Both these observations seem to contradict what the documentation states. Can someone explain the difference and similarities between the hets.tsv and hets.normal.tsv output file in a way other than stated in the documentation because I'm not understanding this explanation.

↧

something fishy? VCF depth and BAM depth don't match

October 18, 2018, 7:20 am

≫ Next: IntervalListTools UNION corner case: doesn't keep all name fields if merged coordinates are same

≪ Previous: GATK4 CNV ModelSegments hets output

Hi,

So after performing a multisample variant calling using GATK 4.0.4.0, I separated one of the samples and wanted to visualize one heterozygous site on IGV. If you see from the attached image, the VCF shows that the Depth is 1 and the GQ is 99. But, the BAM file has lots of reads assigned to that position. I have used the original BAM file that was used during the variant calling. Is there a possibility of rearrangement at that site due to which the Depth became 1? Should have I used the rearranged BAM file instead for visualizing? If the Depth is 1 in the VCF file, how come the GQ is 99?

↧

IntervalListTools UNION corner case: doesn't keep all name fields if merged coordinates are same

January 18, 2019, 10:48 am

≫ Next: (How to) Filter variants either with VQSR or by hard-filtering

≪ Previous: something fishy? VCF depth and BAM depth don't match

When IntervalListTools UNION encounters two records with identical coordinates but different name fields, it outputs a single record with only one of the original name fields. E.g.

chr1    2228866 2228866 +   NM_003036.3(SKI):c.100G>A
chr1    2228866 2228866 +   NM_003036.3(SKI):c.100G>T

becomes

chr1    2228866 2228866 +   NM_003036.3(SKI):c.100G>A

whereas this is expected:

chr1    2228866 2228866 +   NM_003036.3(SKI):c.100G>A|NM_003036.3(SKI):c.100G>T

The expected, name-concatenating, behavior works when the coordinates to be merged are not identical:

chr1    11992659    11992659    +   NM_014874.3(MFN2):c.280C>T
chr1    11992660    11992660    +   NM_014874.3(MFN2):c.281G>A

becomes

chr1    11992659    11992660    +   NM_014874.3(MFN2):c.280C>T|NM_014874.3(MFN2):c.281G>A

GATK command used:

java -jar $HOME/apps/GATK4.jar IntervalListTools \
--ACTION UNION \
-I /mnt/hdd/resources/clinvar/clinvar_2019-01-17/tab_delimited/clinvar_2019-01-17_variant_summary.pathogenic.tmp.interval_list \
-O /mnt/hdd/resources/clinvar/clinvar_2019-01-17/tab_delimited/clinvar_2019-01-17_variant_summary.pathogenic.interval_list \
;

↧

(How to) Filter variants either with VQSR or by hard-filtering

December 14, 2018, 3:37 am

≫ Next: BQSR with MuTect2: use it or not ?

≪ Previous: IntervalListTools UNION corner case: doesn't keep all name fields if merged coordinates are same

Document is in BETA. It may be incomplete and/or inaccurate. Post suggestions to the Comments section and be sure to read about updates also within the Comments section.

This article outlines two different approaches to site-level variant filtration. Site-level filtering involves using INFO field annotations in filtering. Section 1 outlines steps in Variant Quality Score Recalibration (VQSR) and section 2 outlines steps in hard-filtering. See Article#6925 for in-depth descriptions of the different variant annotations.

The GATK Best Practices recommends filtering germline variant callsets with VQSR. For WDL script implementations of pipelines the Broad Genomics Platform uses in production, i.e. reference implementations, see links, e.g. provided on https://software.broadinstitute.org/gatk/best-practices/workflow?id=11145 to both the gatk-workflows WDL script repository and to FireCloud workspaces. Each includes example data as well as publically available recommended human population resources.

Hard-filtering is useful when the data cannot support VQSR or when an analysis requires manual filtering. Additionally, hard-filtering allows for filtering on sample-level annotations, i.e. FORMAT field annotations, which this article does not cover. See Tutorial#12350 to filter on FORMAT field attributes and to change the genotypes of such filtered sample sites to NULL ./..

► GATK4 offers a deep learning method to filter germline variants that is applicable to single sample callsets. As of this writing, the CNN workflow is in experimental status (check here for an update). See Blog#10996 for an overview and initial benchmarking results, and see the gatk4-cnn-variant-filter repository for the WDL pipeline.
► For more complex variant filtering and annotation, see the Broad Hail.is framework at https://hail.is/index.html.
► After variant filtration, if downstream analyses require high-quality genotype calls, consider genotype refinement, e.g. filtering posterior-corrected GQ<20 genotypes. See Article#11074 for an overview.

Jump to a section

1. VQSR: filter a cohort callset with VariantRecalibrator & ApplyVQSR

This section outlines the VQSR filtering steps performed in the 1.1.1 version of the broad-prod-wgs-germline-snps-indels pipeline. Note the workflow hard-filters on the ExcessHet annotation before filtering with VQSR with the expectation that the callset represents many samples.

[A] Hard-filter a large cohort callset on ExcessHet using VariantFiltration
ExcessHet filtering applies only to callsets with a large number of samples, e.g. hundreds of unrelated samples. Small cohorts should not trigger ExcessHet filtering as values should remain small. Note cohorts of consanguinous samples will inflate ExcessHet, and it is possible to limit the annotation to founders for such cohorts by providing a pedigree file during variant calling.

gatk --java-options "-Xmx3g -Xms3g" VariantFiltration \
-V cohort.vcf.gz \
--filter-expression "ExcessHet > 54.69" \
--filter-name ExcessHet \
-O cohort_excesshet.vcf.gz

This produces a VCF callset where any record with ExcessHet greater than 54.69 is filtered with the ExcessHet label in the FILTER column. The phred-scaled 54.69 corresponds to a z-score of -4.5. If a record lacks the ExcessHet annotation, it will pass filters.

[B] Create sites-only VCF with MakeSitesOnlyVcf
Site-level filtering requires only site-level annotations. We can speed up the analysis in the modeling step by using a VCF that drops sample-level columns.

gatk MakeSitesOnlyVcf \
-I cohort_excesshet.vcf.gz \
-O cohort_sitesonly.vcf.gz

This produces a VCF that retains the first eight-columns.

[C] Calculate VQSLOD tranches for indels using VariantRecalibrator
All of the population resource files are publically available at gs://broad-references/hg38/v0. The parameters in this article reflect those in the 1.1.1 version of the broad-prod-wgs-germline-snps-indels pipeline and are thus tuned for WGS samples. For recommendations specific to exome samples, reasons why SNPs versus indels require different filtering and additional discussion of training sets and arguments, see Article#1259. For example, the article states:

[For filtering indels, m]ost annotations related to mapping quality have been removed since there is a conflation with the length of an indel in a read and the degradation in mapping quality that is assigned to the read by the aligner. This covariation is not necessarily indicative of being an error in the same way that it is for SNPs.

gatk --java-options "-Xmx24g -Xms24g" VariantRecalibrator \
-V cohort_sitesonly.vcf.gz \
--trust-all-polymorphic \
-tranche 100.0 -tranche 99.95 -tranche 99.9 -tranche 99.5 -tranche 99.0 -tranche 97.0 -tranche 96.0 -tranche 95.0 -tranche 94.0 -tranche 93.5 -tranche 93.0 -tranche 92.0 -tranche 91.0 -tranche 90.0 \
-an FS -an ReadPosRankSum -an MQRankSum -an QD -an SOR -an DP \      
-mode INDEL \
--max-gaussians 4 \
-resource mills,known=false,training=true,truth=true,prior=12:Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
-resource axiomPoly,known=false,training=true,truth=false,prior=10:Axiom_Exome_Plus.genotypes.all_populations.poly.hg38.vcf.gz \
-resource dbsnp,known=true,training=false,truth=false,prior=2:Homo_sapiens_assembly38.dbsnp138.vcf \
-O cohort_indels.recal \
--tranches-file cohort_indels.tranches

The --max-gaussians parameter sets the expected number of clusters in modeling. If a dataset gives fewer distinct clusters, e.g. as can happen for smaller data, then the tool will tell you there is insufficient data with a No data found error message. In this case, try decrementing the --max-gaussians value.

[D] Calculate VQSLOD tranches for SNPs using VariantRecalibrator

gatk --java-options "-Xmx3g -Xms3g" VariantRecalibrator \
-V cohort_sitesonly.vcf.gz \
--trust-all-polymorphic \
-tranche 100.0 -tranche 99.95 -tranche 99.9 -tranche 99.8 -tranche 99.6 -tranche 99.5 -tranche 99.4 -tranche 99.3 -tranche 99.0 -tranche 98.0 -tranche 97.0 -tranche 90.0 \
-an QD -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an SOR -an DP \
-mode SNP \
--max-gaussians 6 \
-resource hapmap,known=false,training=true,truth=true,prior=15:hapmap_3.3.hg38.vcf.gz \
-resource omni,known=false,training=true,truth=true,prior=12:1000G_omni2.5.hg38.vcf.gz \
-resource 1000G,known=false,training=true,truth=false,prior=10:1000G_phase1.snps.high_confidence.hg38.vcf.gz \
-resource dbsnp,known=true,training=false,truth=false,prior=7:Homo_sapiens_assembly38.dbsnp138.vcf \
-O cohort_snps.recal \
--tranches-file cohort_snps.tranches

Each step, C and D, produces a .recal recalibration table and a .tranches tranches table. In the filtering step, ApplyVQSR will use both types of data.

To additionally produce the optional tranche plot, specify the --rscript-file parameter. See the VariantRecalibrator tool documentation for details and this discussion thread for an example plot.
For allele-specific recalibration of an allele-specific callset, a beta feature as of this writing, add the -AS parameter.

☞ 1.1 How can I parallelize VQSR?

For cohorts with more than 10,000 WGS samples, it is possible to break down the analysis across genomic regions for parallel processing. The 1.1.1 version of the broad-prod-wgs-germline-snps-indels pipeline does so first by increasing --java-options to "-Xmx100g -Xms100g" and second by add the following parameters to the command to subsample variants and to produce a file of the VQSR model.

--sample-every-Nth-variant 10 \
--output-model ${model_report_filename} \

The pipeline then applies the resulting model to each genomic interval with the same parameters as above with two additions. It provides the resulting model report to VariantRecalibrator with --input-model and specifies the flag --output-tranches-for-scatter. The pipeline then collates the resulting per-interval tranches with GatherTranches. Refer to the pipeline script for implementation details.

Successively apply the indel and SNP recalibrations to the full callset that has already undergone ExcessHet filtering.

[E] Filter indels on VQSLOD using ApplyVQSR

gatk --java-options "-Xmx5g -Xms5g" \
ApplyVQSR \
-V cohort_excesshet.vcf.gz \
--recal-file cohort_indels.recal \
--tranches-file cohort_indels.tranches \
--truth-sensitivity-filter-level 99.7 \
--create-output-variant-index true \
-mode INDEL \
-O indel.recalibrated.vcf.gz

This produces an indel-filtered callset. At this point, SNP-type variants remain unfiltered.

[F] Filter SNPs on VQSLOD using ApplyVQSR

gatk --java-options "-Xmx5g -Xms5g" \
ApplyVQSR \
-V indel.recalibrated.vcf.gz \
--recal-file ${snps_recalibration} \
--tranches-file ${snps_tranches} \
--truth-sensitivity-filter-level 99.7 \
--create-output-variant-index true \
-mode SNP \
-O snp.recalibrated.vcf.gz \

This produces a SNP-filtered callset. Given the indel-filtered callset, this results in the final filtered callset.

2. Hard filter a cohort callset with VariantFiltration

This section of the tutorial provides generic hard-filtering thresholds and example commands for site-level manual filtering. A typical scenario requiring manual filtration is small cohort callsets, e.g. less than thirty exomes. See the GATK3 hard filtering Tutorial#2806 for additional discussion.

Researchers are expected to fine-tune hard-filtering thresholds for their data. Towards gauging the relative informativeness of specific variant annotations, the GATK hands-on hard-filtering workshop tutorial demonstrates how to plot distributions of annotation values for variant calls stratified against a truthset.

As with VQSR, hard-filter SNPs and indels separately. As of this writing, SelectVariants subsets SNP-only records, indel-only records or mixed-type, i.e. SNP and indel alternate alleles in the same record, separately. Therefore, when subsetting to SNP-only or indel-only records, mixed-type records are excluded. See this GitHub ticket for the status of a feature request to apply VariantFiltration directly on types of variants.

To avoid the loss of mixed-type variants, break up the multiallelic records into biallelic records before proceeding with the following subsetting. Alternatively, to process mixed-type variants with indel filtering thresholds similar to VQSR, add -select-type MIXED to the second command [B].

[A] Subset to SNPs-only callset with SelectVariants

gatk SelectVariants \
-V cohort.vcf.gz \
-select-type SNP \
-O snps.vcf.gz

This produces a VCF with records with SNP-type variants only.

[B] Subset to indels-only callset with SelectVariants

gatk SelectVariants \
-V cohort.vcf.gz \
-select-type INDEL \
-O indels.vcf.gz

This produces a VCF with records with indel-type variants only.

[C] Hard-filter SNPs on multiple expressions using VariantFiltration
The GATK does not recommend use of compound filtering expressions, e.g. the logical || "OR". For such expressions, if a record is null for or missing a particular annotation in the expression, the tool negates the entire compound expression and so automatically passes the variant record even if it fails on one of the expressions. See this issue ticket for details.

Provide each expression separately with the -filter parameter followed by the –-filter-name. The tool evaluates each expression independently. Here we show basic filtering thresholds researchers may find useful to start.

gatk VariantFiltration \
-V snps.vcf.gz \
-filter "QD < 2.0" --filter-name "QD2" \
-filter "QUAL < 30.0" --filter-name "QUAL30" \
-filter "SOR > 3.0" --filter-name "SOR3" \
-filter "FS > 60.0" --filter-name "FS60" \
-filter "MQ < 40.0" --filter-name "MQ40" \
-filter "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" \
-filter "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" \
-O snps_filtered.vcf.gz

This produces a VCF with the same variant records now annotated with filter status. Specifically, if a record passes all the filters, it receives a PASS label in the FILTER column. A record that fails a filter receives the filter name in the FILTER column, e.g. SOR3. If a record fails multiple filters, then each failing filter name appears in the FILTER column separated by semi-colons ;, e.g. MQRankSum-12.5;ReadPosRankSum-8.

[D] Similarly, hard-filter indels on multiple expressions using VariantFiltration

gatk VariantFiltration \ 
-V indels.vcf.gz \ 
-filter "QD < 2.0" --filter-name "QD2" \
-filter "QUAL < 30.0" --filter-name "QUAL30" \
-filter "FS > 200.0" --filter-name "FS200" \
-filter "ReadPosRankSum < -20.0" --filter-name "ReadPosRankSum-20" \ 
-O indels_filtered.vcf.gz

This produces a VCF with the same variant records annotated with filter names for failing records. At this point, consider merging the separate callsets together. Select comments follow.

RankSum annotations can only be calculated for REF/ALT heterozygous sites and therefore will be absent from records that do not present read counts towards heterozygous genotypes.
By default, GATK HaplotypeCaller and GenotypeGVCFs do not emit variants with QUAL < 10. The --standard-min-confidence-threshold-for-calling (-stand-call-conf) parameter adjusts this threshold. GATK recommends filtering variants with QUAL less than 30. The lower default QUAL threshold of the callers allows for more negative training data in VQSR filtering.
When providing filtering thresholds, the tool expects the value to match the type specified in the ##INFO lines of the callset. For example, an Integer type is a whole number without decimals, e.g. 0, and a Float type is a number with decimals, e.g. 0.0. If the expected type mismatches, the tool will give a java.lang.NumberFormatException error.
If a filter expression is misspelled, the tool does not give a warning, so be sure to carefully review filter expressions for correctness.

3. Evaluate the filtered callset

Filtering is about balancing sensitivity and precision for research aims. For example, genome-wide association studies can afford to maximize sensitivity over precision such that there are more false positives in the callset. Conversely, downstream analyses that require high precision, e.g. those that cannot tolerate false positive calls because validating variants is expensive, maximize precision over sensitivity such that the callset loses true positives.

Two tools enable site-level evaluation--CollectVariantCallingMetrics and VariantEval. Another tool, GenotypeConcordance, measures sample-level genotype concordance and is not covered here. For an overview of all three tools, see Article#6308.

Compare callset against a known population callset using CollectVariantCallingMetrics

gatk CollectVariantCallingMetrics \
-I filtered.vcf.gz \
--DBSNP Homo_sapiens_assembly38.dbsnp138.vcf \
-SD Homo_sapiens_assembly38.dict \
-O metrics

This produces detailed and summary metrics report files. The summary metrics provide cohort-level variant metrics and the detailed metrics segment variant metrics for each sample in the callset. The detail metrics give the same metrics as the summary metrics for the samples plus several additional metrics. These are explained in detail at https://broadinstitute.github.io/picard/picard-metric-definitions.html.

Compare callset against a known population callset using VariantEval
As of this writing, VariantEval is in beta status in GATK v4.1. And so we provide an example GATK3 command, where the tool is in production status. GATK3 Dockers are available at https://hub.docker.com/r/broadinstitute/gatk3.

java -jar gatk3.jar \
-T VariantEval \
-R Homo_sapiens_assembly38.fasta \
-eval cohort.vcf.gz \
-D Homo_sapiens_assembly38.dbsnp138.vcf \
-noEV \
-EV CompOverlap -EV IndelSummary -EV TiTvVariantEvaluator \
-EV CountVariants -EV MultiallelicSummary \
-o cohortEval.txt

This produces a file containing a table for each of the evaluation modules, e.g. CompOverlap.

Please note the GA4GH (Global Alliance for Genomics and Health) recommends using hap.py for stratified variant evaluations (1, 2). One approach using hap.py wraps the vcfeval module of RTG-Tools. The module accounts for differences in variant representation by matching variants mapped back to the reference.

↧

BQSR with MuTect2: use it or not ?

June 7, 2017, 3:21 am

≫ Next: PICARD MarkDuplicates errors near the end of its process: tmp does not exist

≪ Previous: (How to) Filter variants either with VQSR or by hard-filtering

Hello,

I've been reading some threads on the forum about BQSR with MuTect2. I know it has been proposed in Best-Practices uses. However, there were a lot of mixed comments and I can't find a clear conclusion on whether to use BQSR with MuTect2 since MuTect2 takes into consideration the base quality score, and that's what BQSR does. I am working on 18 human samples matched normal and tumor. Those samples have been exome-sequenced. I am using MuTect2 from GATK 3.7 stable version. I generated results using the proposed pipeline here. I used the following inputs:

Mills_and_1000G_gold_standard.indels.hg19.sites.vcf
dbsnp_138.hg19.vcf
hg19_ref_genome.fa

Following this thread here for example, I am worried that potential true variants could be altered due to recalibration.

I also have another doubt, in BQSR thread, I just want to make sure that BQSR does NOT change the base of the variant itself but it just assigns a low base quality score if it gets recalibrated.

I have analyzed commands ran by The Cancer Genome Atlas and they actually use BQSR in their workflow. So finally, I would like to know if it safe to use BQSR with MuTect2 ? It is better to have multiple dbSNPs to avoid having mismatches of potential variants (for example, I have downloaded from NCBI all kwown SNPs of the human ~ 57GB vcf file) ?

Thank you in advance !

↧

PICARD MarkDuplicates errors near the end of its process: tmp does not exist

June 14, 2017, 1:41 pm

≫ Next: When should I use -L to pass in a list of intervals?

≪ Previous: BQSR with MuTect2: use it or not ?

Hi, I have a problem in that PICARD MarkDuplicates appears to error near the end of its process -- with a temp file not found error.

This is running in a GATK pipeline on our cluster for WGS Best Practices.

Error is;
Exception in thread "main" java.lang.IllegalStateException: Non-zero numRecords but /tmp/rwillia/CSPI.7224120933399878689.tmp/2.tmp does not exist

We get the same error with picard 2.9.2 and picard 2.0.1 . The java being used is:
java version "1.8.0_65"
Java(TM) SE Runtime Environment (build 1.8.0_65-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.65-b01, mixed mode)

We have tried multiple different BAM files as tests and get the same error.

Has anybody else seen this "tmp does not exist error" before? Was there a fix that worked?I cannot see this error previously reported.
I can run MarkDuplicates using a standalone Qsub tester script and I get the same error.

I made the directory /tmp/rwillia on the scratch drive but it did not help.
Thanks for any help,
Cheers,
Roy Williams

Test qsub script

PBS -l walltime=99:00:00

PBS -l nodes=1:ppn=8:memory=29gb

export TMPDIR=/scratch/rwillia
module load samtools
module load picard
java -Xmx26g -jar /opt/applications/picard/2.1.0/bin/picard.jar MarkDuplicates \
I=/mnt/loring/data/OMICS_PIPE_DATA/ANALYSIS/DNAseq/RW_WGS/BWA_RESULTS/REACH000450/REACH000450_sorted.rg.bam \
O=/mnt/loring/data/OMICS_PIPE_DATA/ANALYSIS/DNAseq/RW_WGS/BWA_RESULTS/REACH000450/REACH000450_sorted.rg.md.bam \
M=/mnt/loring/data/OMICS_PIPE_DATA/ANALYSIS/DNAseq/RW_WGS/BWA_RESULTS/REACH000450/REACH000450_sorted.rg.md.metrics.txt \
ASSUME_SORTED=true \
VALIDATION_STRINGENCY=LENIENT

picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 14.35 minutes.
Runtime.totalMemory()=21026570240
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" java.lang.IllegalStateException: Non-zero numRecords but /tmp/rwillia/CSPI.7224120933399878689.tmp/2.tmp does not exist
at htsjdk.samtools.CoordinateSortedPairInfoMap.ensureSequenceLoaded(CoordinateSortedPairInfoMap.java:141)
at htsjdk.samtools.CoordinateSortedPairInfoMap.remove(CoordinateSortedPairInfoMap.java:86)
at picard.sam.markduplicates.util.DiskBasedReadEndsForMarkDuplicatesMap.remove(DiskBasedReadEndsForMarkDuplicatesMap.java:61)
at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:388)
at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:185)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:209)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)

Roy Williams, Ph.D.
Bioinformatics Director,
The Center for Regenerative Medicine,
Scripps Research Institute,
10550 North Torrey Pines Road
San Diego, California, 92121
USA

↧

When should I use -L to pass in a list of intervals?

May 6, 2014, 2:51 pm

≫ Next: The GATK Best Practices for variant calling on RNAseq, in full detail

≪ Previous: PICARD MarkDuplicates errors near the end of its process: tmp does not exist

The -L argument (short for --intervals) enables you to restrict your analysis to specific intervals instead of running over the whole genome. Using this argument can have important consequences for performance and/or results. Here, we present some guidelines for using it appropriately depending on your experimental design.

In a nutshell, if you’re doing:

- Whole genome analysis: intervals are not required but they can help speed up analysis
- Whole exome analysis: you must provide the list of capture targets (typically genes/exons)
- Small targeted experiment: you must provide the targeted interval(s)
- Troubleshooting: you can run on a specific interval to test parameters or create a data snippet

Important notes:

Whatever you end up using -L for, keep this in mind: for tools that output a bam or VCF file, the output file will only contain data from the intervals specified by the -L argument. To be clear, we do not recommend using -L with tools that output a bam file since doing so will omit some data from the output.

Example Use of -L:

-L 20 for chromosome 20 in b37/b39 build
-L chr20:1-100 for chromosome 20 positions 1-100 in hg18/hg19 build
-L intervals.list (or intervals.interval_list, or intervals.bed) where the value passed to the argument is a text file containing intervals
-L some_variant_calls.vcf where the value passed to the argument is a VCF file containing variant records; their genomic coordinates will be used as intervals.

Specifying contigs with colons in their names, as occurs for new contigs in GRCh38, requires special handling for GATK versions prior to v3.6. Please use the following workaround.

- For example, HLA-A*01:01:01:01 is a new contig in GRCh38. The colons are a new feature of contig naming for GRCh38 from prior assemblies. This has implications for using the -L option of GATK as the option also uses the colon as a delimiter to distinguish between contig and genomic coordinates.
- When defining coordinates of interest for a contig, e.g. positions 1-100 for chr1, we would use -L chr1:1-100. This also works for our HLA contig, e.g. -L HLA-A*01:01:01:01:1-100.
- However, when passing in an entire contig, for contigs with colons in the name, you must add :1+ to the end of the chromosome name as shown below. This ensures that portions of the contig name are appropriately identified as part of the contig name and not genomic coordinates.

-L HLA-A*01:01:01:01:1+

So here’s a little more detail for each experimental design type.

Whole genome analysis

It is not necessary to use an intervals list in whole genome analysis -- presumably you're interested in the whole genome!

However, from a technical perspective, you may want to mask out certain contigs (e.g. chrY or non-chromosome contigs) or regions (e.g. centromere) where you know the data is not reliable or is very messy, causing excessive slowdowns. You can do this by providing a list of "good" intervals with -L, or you could also provide a list of "bad" intervals with -XL, which does the exact opposite of -L: it excludes the provided intervals. We share the whole-genome interval lists (of good intervals) that we use in our production pipelines, in our resource bundle (see Download page).

Whole exome analysis

By definition, exome sequencing data doesn’t cover the entire genome, so many analyses can be restricted to just the capture targets (genes or exons) to save processing time. There are even some analyses which should be restricted to the capture targets because failing to do so can lead to suboptimal results.

Note that we recommend adding some “padding” to the intervals in order to include the flanking regions (typically ~100 bp). No need to modify your target list; you can have the GATK engine do it for you automatically using the interval padding argument. This is not required, but if you do use it, you should do it consistently at all steps where you use -L.

Below is a step-by-step breakdown of the Best Practices workflow, with a detailed explanation of why -L should or shouldn’t be used with each tool.

Tool	-L?	Why / why not
BaseRecalibrator	YES	This excludes off-target sequences and sequences that may be poorly mapped, which have a higher error rate. Including them could lead to a skewed model and bad recalibration.
PrintReads	NO	Output is a bam file; using -L would lead to lost data.
UnifiedGenotyper/Haplotype Caller	YES	We’re only interested in making calls in exome regions; the rest is a waste of time & includes lots of false positives.
Next steps	NO	No need since subsequent steps operate on the callset, which was restricted to the exome at the calling step.

Small targeted experiments

The same guidelines as for whole exome analysis apply except you do not run BQSR on small datasets.

Debugging / troubleshooting

You can use -L a lot while troubleshooting! For example, you can just provide an interval at the command line, and the output file will contain the data from that interval.This is really useful when you’re trying to figure out what’s going on in a specific interval (e.g. why HaplotypeCaller is not calling your favorite indel) or what would be the effect of changing a parameter (e.g. what happens to your indel call if you increase the value of -minPruning). This is also what you’d use to generate a file snippet to send us as part of a bug report (except that never happens because GATK has no bugs, ever).

↧

UsingGenomicsDBImport in practice