Please note that this article refers to the original standalone version of MuTect. A new version is now available within GATK (starting at GATK 3.5) under the name MuTect2. This new version is able to call both SNPs and indels. See the GATK version 3.5 release notes and the MuTect2 tool documentation for further details.
Overview
In a nutshell, the MuTect analysis consists of three steps:
- Pre-processing the aligned reads in the tumor and normal sequencing data
- Statistical analysis to identify sites that are likely to carry somatic mutations with high confidence
- Post-processing of candidate somatic mutations
This document summarizes the key points of these three steps. For complete details, please see the 2013 publication in Nature Biotechnology:
1. Pre-processing the aligned reads in the tumor and normal sequencing data
In this step we ignore reads with too many mismatches or very low quality scores since these represent noisy reads that introduce more noise than signal.
2. Statistical analysis to identify sites that are likely to carry somatic mutations with high confidence
The statistical analysis predicts a somatic mutation by using two Bayesian classifiers – the first aims to detect whether the tumor is non-reference at a given site and, for those sites that are found as non-reference, the second classifier makes sure the normal does not carry the variant allele. In practice the classification is performed by calculating a LOD score (log odds) and comparing it to a cutoff determined by the log ratio of prior probabilities of the considered events.
For the tumors we calculate:
$$ LOD_T = log_{10} \left ( \frac{ P( \text{observed data in tumor | site is mutated} ) } { P( \text{observed data in tumor | site is reference} ) } \right ) $$
And for the normal:
$$ LOD_N = log_{10} \left ( \frac{ P( \text{observed data in normal | site is reference} ) } { P( \text{observed data in normal | site is mutated} ) } \right ) $$
Since we expect somatic mutations to occur at a rate of ~1 per Mb, we require
$$ LOD_T > log_{10} (0.5 \times 10^{-6} ) \approx 6.3 $$
which guarantees that our false positive rate, due to noise in the tumor, is less than half of the somatic mutation rate.
In the normal, for sites that are not in dbSNP, we require
$$ LOD_N > log_{10} (0.5 \times 10^{-2} ) \approx 2.3 $$
since non-dbSNP germline variants occur roughly at a rate of 100 per Mb. This cutoff guarantees that the false positive somatic call rate, due to missing the variant in the normal, is also less than half the somatic mutation rate.
3. Post-processing of candidate somatic mutations
This step aims to eliminate artifacts of next-generation sequencing, short read alignment and hybrid capture. For example, sequence context can cause hallucinated alternate alleles but often only in a single direction. Therefore, we test that the alternate alleles supporting the mutations are observed in both directions.
Note on method validation
Most cancer genome studies at the Broad Institute have made use of MuTect and have validated the mutation calls as a part of their cancer biology papers, showing that MuTect has a very low false positive rate. A summary of validation rates from these papers are show below: