Hi !
I am trying to use MuTect2 on RNAseq data, trying to detect somatic mutation.
I had a few interrogations about the formulas used by the program. I went through the original MuTect publication (Cibulskis et al. 2013) and found this part about dbSNPs :
" There are ~30×10e6 sites known to be variant in the human population according to dbSNP release 134, which is ~1000 variants/megabase. A given individual typically has ~3×10e6 variants in their genome, 95% of which fall on dbSNP sites. Therefore we expect ~50 variants/mb not at dbSNP sites, i.e. P(germline| non-dbSNP site) = 5×10e−5 and therefore we use θN|non-dbSNP site = 2.2. At dbSNP sites, however, we expect 95% of the ~3×10e6 variants to occur in the 30×10e6 sites in the dbSNP database, yielding P(germline| dbSNP site) = 0.095 hence θN|dbSNP site = 5.5."
But it appears that nowadays, the last dbSNP release (147) contains 150mio of variants (5 times more). So I think this is changing quit a lot the probabilities, no? I was wondering if the values mentioned above were changed with MuTect2 in the newer versions ? Or maybe if the program adapt itself to the dbSNP database by counting the number of variant in the dbSNP file to measure this probability ?
I had another question : I wanted to know if there is any publication for MuTect2 that could explain the rationale behind the indel detection ?
Thanks a lot !
Alexandre Coudray