Here it is at last… as in, last release for 2016, and possibly the last point release of GATK 3 ever!
Aside from the usual pile of bug fixes, the new features in this version are actually (almost) all features or improvements that were developed for GATK 4. We backported them to the GATK 3 framework to make them widely available sooner rather than later, since we still have some work to do to make GATK 4 complete enough to become the new standard. Of course there are a lot of other new things in the GATK 4 alpha version that we can't backport (especially those related to speed/performance improvements) because they depend on the new framework. But what we could backport, we did.
The hottest change here is a new model for calculating the QUAL score, but be aware it's there on an opt-in basis, not enabled by default. This also comes with a lower default value for the -stand_call_conf
threshold, and deprecation of the confusing and ultimately rather pointless threshold -stand_emit_conf
. We're also introducing some logic for prioritizing alleles to improve performance in messy regions. And we've got some improvements for MuTect2, although that tool does remain in beta status for now.
As usual, see the release notes for a full list of changes, and read on below for details on what we think you'll care most about.
A certain je ne sais QUAL ...
There's been a recent infusion of new blood in the development team -- meaning new members, to be clear (GATK development does not actually involve any demonic rituals, despite any rumors you might have heard). With that came a renewal of ideas on how to calculate key variant metrics, including QUAL.
As we explain in loving detail in this document, the QUAL score is the Phred-scaled posterior probability that all samples in your callset are homozygous reference. In other words, it represents the probability that all the variant evidence you saw in your data is wrong. It's not a very reliable way of ranking variant quality as such, because it's vulnerable to artifacts like inflation at high read depth -- but it does allow us to rule out the majority of glaringly false calls at the low end of the scale.
The current model for calculating QUAL has some flaws that manifest, among other things, as a tendency to excessively penalize singletons and doubletons (variants observed only in one or two samples in a cohort), especially at large cohort sizes. It also uses different and needlessly complicated logic for dealing with haploid, diploid and polyploid cases, leading to "amusing" inconsistencies. No one likes that.
So then some magic happened and now we have a new model that is simpler and behaves better, according to our tests. We're using it in our own work already, but we're not switching it on by default in the public release because it is a pretty big change. Instead, it's available as an opt-in feature that you can enable by setting -newQual
in any command line invoking a germline caller (HaplotypeCaller, GenotypeGVCFs or UnifiedGenotyper if you really must use it).
Assuming it continues to behave well in our hands and yours (those of you who switch it on), it will be the default model in GATK 4. And then it will get documented in loving detail too, of course.
One threshold for calling
One of the last steps in the germline short variant calling process is the calculation of the QUAL score for each candidate variant. Once that's done, a threshold is applied on the QUAL score and we discard any variants that scored lower than the given threshold value. When you're running HaplotypeCaller in "GVCF mode", i.e. with -ERC GVCF
or -ERC BP_RESOLUTION
, that threshold is set to zero and every record is written to the output file. In every other case, the threshold is set by the -stand_call_conf
argument, which stands for "standard calling confidence".
That sounds perfectly reasonable, doesn't it? Well, in practice that's not exactly how it works -- or has worked so far, anyway. No, we looove to provide sooo many additional options, we just had to use two arguments to do the QUAL thresholding. One called -stand_call_conf
, and a second one called -stand_emit_conf
. The first one works as advertised; the second was meant to make it possible to include candidate variants with a lower QUAL score to be written to the output, BUT with a "LowQual" tag of shame in the FILTER field. It was supposed to provide a sort of filtering preview.
Frankly, for the past few years we've used the same value for both, which effectively cancels out the -stand_emit_conf
functionality, and generally speaking the presence of that argument has been sowing confusion since the dawn of time. In my view it's at odds with our philosophy of "call everything that moves then filter things out properly with other annotations". So we're killing it. It's gone.
So to summarize, the -stand_call_conf
is the last threshold left standing for variant calling. Oh, and while we were at it we lowered the default value to 10 instead of 30. This is more generous than it needs to be, but you can always filter whatever's in the output that you don't want. Whereas you can't easily go back and relax the threshold if it was higher than you wanted -- or if you hadn't even realized it was a thing you could do to increase sensitivity.
Meaningful nod to people who do variant caller comparisons...
Culling the genotype herd to avoid a stampede
In some regions where the sequence context is very repetitive, we tend to find many candidate alleles for the same position, even within a single sample. When that happens, and especially if the requested ploidy is high (e.g. in pooled experiments), the number of possible genotypes that we have to evaluate (i.e. calculate likelihoods for) becomes downright astronomical. Depending on conditions, the consequences can range from unacceptably long runtimes to complete crash.
We previously tried to solve this by providing an argument called -maxAltAlleles
to limit the number of alleles the caller has to consider, but the way it was wired up only limited the alleles that were output, not those considered internally. So it only solved superficial problems, and it didn't account for ploidy directly.
Now we're trying a new approach that involves setting a limit on the number of genotypes that we're willing to consider, instead of a number of alleles. Under the hood, this ties into some logic that drops alt alleles that are really unlikely until we get to a number of possible genotypes that we deem acceptable. The default value is set to be comfortable enough that this only kicks in at complex sites when ploidy is high, but it can be modified with the argument -maxGenotypeCnt
to be more or less generous.
Note that -maxAltAlleles
is still applicable, but the current implementation is set to give precedence to -maxGenotypeCnt
. So at sites where sample ploidy and -maxAltAlleles
combine to give a genotype count higher than the value in -maxGenotypeCnt
, the -maxAltAlleles
limit will be ignored, and alternate alleles will be removed based on ploidy and -maxGenotypeCnt
.
If you want to tweak these settings, keep in mind their interactions and the rule of precedence, otherwise you might run into surprises (and not necessarily of the surprise birthday cake at your team meeting kind) that we will NOT consider to be bugs. For example, let's say you provide -maxAltAlleles
with a high value, leave the -maxGenotypeCnt
as default, and works with a high ploidy sample. Due to the newly imposed maximum genotype count, alt alleles actually used in genotyping will be limited to far less than the maximum you requested. For example, with ploidy 18 and maximum genotype count set to 1024 (the current, arbitrary default value, but definitely reasonable in most cases), the maximum allele count is 3 (alt allele count 2), potentially much lower than the -maxAltAlleles
you requested.
MuTect2 starting to see the light at the end of the beta tunnel
MuTect2 is a next-generation somatic SNP and indel caller that combines the DREAM challenge-winning somatic genotyping engine of the original MuTect with the assembly-based machinery of HaplotypeCaller. It was first made publicly available in GATK version 3.5 as a beta tool earmarked for experimental work only (no production or commercial work). In evaluations performed with our colleagues in the Broad's Cancer Genome Analysis group (CGA), we found that MuTect2 was doing a great job on indels, but its sensitivity on SNVs was slightly inferior to the original MuTect on which it was based.
Due to a shift in priorities we then had to put MuTect2 development on hiatus, which is why it stayed virtually unchanged in GATK 3.6. But I'm happy to report that MuTect2 is now back in active development! In this release, we have a small but appreciable crop of improvements to MuTect2, which will be the last ever made in a 3.x version.
The improvements made in this version mainly have to do with cleaning up the code and simplifying parts where hybridization with the HaplotypeCaller machinery got a bit Frankenstein-y. As part of that, we're now exposing a couple of downsampling-related arguments that were previously hardcoded, so -maxReadsInRegionPerSample
and -minReadsPerAlignment
can now be set from command line. We've also added back a few components that were in the original MuTect but weren't ported in the move to GATK, including the clustered read position and strand bias filters. Work is still ongoing to determine exactly what is the best way to leverage these components for best results.
To be clear, despite these improvements we're still keeping MuTect2 in beta status pending full satisfaction from our CGA friends -- so that does mean that the eventual fully-supported version of MuTect2 will be released in GATK 4 only. We'll post a roadmap/expected timeline of GATK4 and MuTect2 development in the coming weeks.