Quantcast
Channel: Recent Discussions — GATK-Forum
Viewing all articles
Browse latest Browse all 12345

VariantRecalibrator tranche plots have a lot of false positives

$
0
0
Hello!

I am working with data from 122 human whole exomes, captured using SeqCap EZ Prime Exome. My software versions are GATK 3.8.0 and java 1.8.0_131.

After following the Best Practices guidelines, I have gotten tranche plots from VariantRecalibrator that show a high proportion of 'false positives' in my novel variants (resulting from a low Ti/Tv ratio). I can't find anything this extreme on the forum, and I'm wondering if I may be doing something wrong with my variant calling.

The command that produced the tranche plots is:

```
java -Xmx16000m -jar GenomeAnalysisTK.jar \
-T VariantRecalibrator \
-R hg38.fa \
-input SNP.vcf \
-resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.hg38.vcf.gz \
-resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.hg38.vcf.gz \
-resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.hg38.vcf.gz \
-resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.hg38.vcf.gz \
-an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an InbreedingCoeff \
-mode SNP \
-recalFile SNP.recal \
-tranchesFile SNP.tranches \
-rscriptFile SNP.plots.R
```

As you can see in 'all_SNPs.pdf', something like 40% of the novel SNPs are estimated to be false positives. 'more_tranches.pdf' shows that lowering the truth threshold does not resolve this (though it does discard a ton of SNPs).

As an alternative, I did hard filtering based on the distributions of all my annotations in R. (They looked pretty normal except for QD, I think because of high depths--see 'QD.png' attached here, and QUAL by DP plots in the thread for Discussion 23514 [sorry, can't post links]).

```
java -Xmx16000m -jar GenomeAnalysisTK.jar \
-T VariantFiltration \
-R hg38.fa \
--variant SNP.vcf \
-o SNP.FILT.vcf \
--filterExpression "QD < 2.0 || FS > 60.0 || MQ < 55.0 || MQRankSum < -1.0 || ReadPosRankSum < -2.5 || SOR > 2.5 || DP < 500 || InbreedingCoeff < -0.1"\
--filterName "HARDFILTER"
```

I then ran VariantRacalibrator on the hard filtered variants to see what would happen. Hard filtering helps reduce the false positive estimates a little--see 'hard_filtered_SNPs.pdf'--but it does not really solve this problem.

I ran VariantEval on the filtered variants to get a better idea of what was going on, and found the following:

My Data Subset Ti/Tv
All SNPs 2.23
SNPs in dbSNP (68% of total) 2.65
novel SNPs (32% of total) 1.52

So, it seems like my SNPs that also appear in dbSNP are alright, but the novel ones are not trustworthy.

One obvious option is to just filter out any variant not found in an existing database. This is OK for my purposes, since I'm looking for effects of common variants. But it still gives me pause that my novel variants can't be trusted. Any ideas about what would lead to such low Ti/Tv in an exome datset? (Note, I used '-L PrimeExome.intervals -ip 100' at relevant steps.)

Thanks a lot!

Viewing all articles
Browse latest Browse all 12345

Trending Articles