Quantcast
Channel: Recent Discussions — GATK-Forum
Viewing all articles
Browse latest Browse all 12345

How to get rid of false positives? And troubles with VQSR

$
0
0

For the last couple of months we have been trying to analyze an exome data set of ~1500 samples.
Within the data there are 2 sets, ~1200 control samples sequenced at one site, and ~300 cases samples from another sequencing facility, but with the same capture kit and both sets have similar coverage.

We have thrown all bams on a big pile and followed the best practices, performing base recalibration, joint genotyping and variant recalibration on the whole dataset together. When we performed downstream analysis on the data we found a lot of false positives (having for example 20% freq in the controls and never in the cases or vise versa).

We initially thought that the main problem was the difference in sequencing location. Therefore, to solve this we performed filtering on DP>8, GQ>20, HWE, averageGQ>35 and callrate>85%, on the 2 separate batches as recommended by: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-15-125
After that we repeated the VQSR on the whole dataset together and the results already improved significantly, however we still had some strange results. When we looked at the bam files of people carrying these variants the mapping seemed to be the main problem in some. So then we added the MQCap option to the VQSR which further improved the results.

1.Is the way we set up this experiment correct, or are there any thing we should change? Are we being too stringent/not stringent enough? We still have false positives, but what is the chance of missing something by filtering before the VQSR?

The next question is about using different versions of dbsnp to perform the VQSR. In our whole pipeline we initially used dbsnp version 144. However a lot of the downstream analysis variants could not be confirmed with sanger sequencing. When we perform the VQSR with dbsnp138 most of these false positive variants, are not in the dataset anymore (except 2 that remain).

2.How is it possible that using a different dbsnp version has such a big impact on the output of the VQSR? And which dbsnp version should we use? (because still 2 (and most likely more) variants remain that cannot be verified, but are associated with our trait in downstream analysis)

If we look at the haplotype option bamouts of a person with and without one of the variants that could not be confirmed with sanger sequencing, we see something peculiar. There seems to be something strange going on in how haplotypecaller looks at this site in the individuals supposedly carrying the variant (sample 2) and one that does not carry the variant (sample 1) while the coverage in the samples is the same, and the region looks very messy in both samples. We could not confirm this variant in sample 2 with sanger sequencing.

3.What is going on? And could this explain the false positives we find?

Also we have some questions about using different versions, the haplotype calling and joint genotyping was performed with GATK3.4 by a collaborator and repeating it would be very cumbersome.
The VQSR was performed with GATK 3.5 on our local cluster, and could be repeated with 3.7.

4.However, we were wondering how advisable it is to mix versions of GATK? And if the new options of 3.7 will still function on g.vcfs that are generated with 3.4?

We’re kind of running out of ideas, and do not have that much experience with running these kind of analysis. After consulting a lot on the forum, we did not find the answers to our questions, so we hoped someone could answer them.
Thank you very much in advance.

Command:

        java -Xmx32G -jar /opt/GATK-3.5/GenomeAnalysisTK.jar \
            -T VariantRecalibrator \
            -R {path}References/hs_ref_GRCh37.p5_all_contigs.fasta \
            -input {path}/Complete_withXY_sorted.vcf \
            -nt 6 \
            -resource:hapmap,known=false,training=true,truth=true,prior=15.0 {path}References/hapmap_3.3.b37_withchr.vcf \
            -resource:omni,known=false,training=true,truth=true,prior=12.0 {path}References/1000G_omni2.5.b37_withchr.vcf \
            -resource:1000G,known=false,training=true,truth=false,prior=10.0 {path}References/1000G_phase1.snps.high_confidence_withchr.b37.vcf \
            -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 {path}References/snp144-All.vcf \
            -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an InbreedingCoeff \
            -MQCap 70 \
            -tranche 100.0 -tranche 99.9 -tranche 99.8 -tranche 99.0 -tranche 90.0 \
            -recalFile {path}/Complete_withXY_snps_dbsnp144.recal \
            -tranchesFile {path}/Complete_withXY_snps_dbsnp144.tranches \
            -rscriptFile {path}/Complete_withXY_snps_dbsnp144.R \
            -mode SNP

        #indels
        java -Xmx8G -jar /opt/GATK-3.5/GenomeAnalysisTK.jar \
            -T VariantRecalibrator \
            -R {path}References/hs_ref_GRCh37.p5_all_contigs.fasta \
            -input {path}/Complete_withXY_sorted.vcf \
            -recalFile {path}/Complete_withXY_indels_dbsnp144.recal \
        -tranchesFile {path}/Complete_withXY_indels_dbsnp144.tranches \
        -rscriptFile {path}/Complete_withXY_indels_dbsnp144.R \
        -nt 3 \
        -resource:mills,known=false,training=true,truth=true,prior=12.0 {path}References/Mills_and_1000G_gold_standard.indels.b37.vcf \
        -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 {path}References/snp144-All.vcf \
        -an QD -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an InbreedingCoeff \
        -MQCap 70 \
        --maxGaussians 4 \
        -tranche 100.0 -tranche 99.9 -tranche 99.8 -tranche 99.0 -tranche 90.0 \
        -mode INDEL  

    #apply recalibration
    java -jar /opt/GATK-3.5/GenomeAnalysisTK.jar \
        -T ApplyRecalibration \
        -R {path}References/hs_ref_GRCh37.p5_all_contigs.fasta \
        -input {path}/Complete_withXY_sorted.vcf \
        --ts_filter_level 99.9 \
        -recalFile {path}/Complete_withXY_snps_dbsnp144.recal \
        -tranchesFile {path}/Complete_withXY_snps_dbsnp144.tranches \
        -o {path}/Complete_withXY_snps_dbsnp144.vcf \
        -mode SNP

    java -jar /opt/GATK-3.5/GenomeAnalysisTK.jar \
        -T ApplyRecalibration \
        -R {path}References/hs_ref_GRCh37.p5_all_contigs.fasta \
        -input {path}/Complete_withXY_sorted.vcf \
        --ts_filter_level 99.9 \
        -recalFile {path}/Complete_withXY_indels_dbsnp144.recal \
        -tranchesFile {path}/Complete_withXY_indels_dbsnp144.tranches \
        -o {path}/Complete_withXY_indels_dbsnp144.vcf \
        -mode INDEL    

Viewing all articles
Browse latest Browse all 12345

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>