Quantcast
Channel: Recent Discussions — GATK-Forum
Viewing all articles
Browse latest Browse all 12345

How to diagnose missing MQRankSum annotations (when BaseQRankSum is available)

$
0
0

We wish to discover short variants in a cohort of 60 plant whole-genome-samples. We're blocked on VariantRecalibrator.

We have a VCF truth set (aka resource) of SNPs which has been computed beforehand and hard-filtered. And we have a raw VCF for the 60 samples under study. This input VCF has been joint-called with HaplotypeCaller (GVCF) + GenomicsDBImport + GenotypeGVCFs over the whole genome. We computed a sites-only version of that input VCF and fed it to VariantRecalibrator. We configured HaplotypeCaller to produce allele-specific annotations (-G StandardAnnotation -G AS_StandardAnnotation -G StandardHCAnnotation) and GenotypeGVCFs as well (-G StandardAnnotation -G AS_StandardAnnotation).

We've configured VariantRecalibrator to build its SNP model based on the set of annotations: -an AS_QD -an MQRankSum -an AS_ReadPosRankSum -an AS_FS -an AS_MQ -an AS_SOR -an DP. This was based on this allele-specific annotation and filtering article.

Unfortunately, both AS_MQRankSum, and MQRankSum annotations have variance 0 over our data, and prevent the model from being produced. Dropping the annotation is one option, but it's ill-advised afaik (see reference #1).

How do we diagnose this?

/gatk/gatk VariantRecalibrator --java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Xmx183296m' --tmp-dir <redacted> <redacted_list_of_input_vcf_files> --resource:GOLD,known=false,training=true,truth=true,prior=10.0 <redacted>/gold.snps.vcf.gz --mode SNP -an AS_QD -an MQRankSum -an AS_ReadPosRankSum -an AS_FS -an AS_MQ -an AS_SOR -an DP --trust-all-polymorphic --truth-sensitivity-tranche 100.0 --truth-sensitivity-tranche 99.0 --truth-sensitivity-tranche 90.0 --truth-sensitivity-tranche 70.0 --truth-sensitivity-tranche 50.0 --max-gaussians 6 --rscript-file <redacted>  --tranches-file <redacted> -AS --output <redacted>/snp.recal.vcf.gz
22:16:29.941 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.1.2.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
...
A USER ERROR has occurred: Bad input: Found annotations with zero variance. They must be excluded before proceeding.
...
19:52:31.846 INFO  ProgressMeter - Traversal complete. Processed 2786933 total variants in 0.8 minutes.
19:52:31.959 INFO  VariantDataManager - AS_QD:   mean = 17.44    standard deviation = 8.75
19:52:32.173 INFO  VariantDataManager - MQRankSum:       mean = 0.00     standard deviation = 0.00
19:52:32.487 INFO  VariantDataManager - AS_ReadPosRankSum:       mean = 0.01     standard deviation = 0.84
19:52:32.797 INFO  VariantDataManager - AS_FS:   mean = 1.79     standard deviation = 3.03
19:52:32.962 INFO  VariantDataManager - AS_MQ:   mean = 60.00    standard deviation = 0.12
19:52:33.139 INFO  VariantDataManager - AS_SOR:          mean = 0.68     standard deviation = 0.25
19:52:33.335 INFO  VariantDataManager - DP:      mean = 1323.99  standard deviation = 945.30

(We initially tried with AS_MQRankSum, instead of MQRankSum, but it also had 0.00 variance)

Question 1: As we understand it, MQRankSum can only be computed on sites which are heterozygous reference (see reference #2). Generally speaking, and not for our particular dataset, do we need a good representation of such sites both in the truth "resource" sets and the input raw variant vcfs, or just in the input vcfs ?

Question 2: We've confirmed that our data (both the truth set and the input set) does have many het-ref sites with good read support for all alleles. One evidence of this (we think), is the fact that AS_ReadPosRankSum was calculated to have non-zero variance, as shown in the VariantRecalibrator output above. MQRankSum's variance couldn't be calculated, but both it and AS_MQRankSum have the same caveats in the documentation. What are cases where one annotation can be calculated, but not the other? e.g. Does this indicate that my mapping qualities are too uniform (in which case, the variance would be exactly 0.000)?

Question 3: If we dig in the dataset (i.e. in the sites-only VCF inputs), we see a lot of sites whose relevant annotations are a mix of "nul", and "0.000". At certain sites, AS_MQRankSum is there but not MQRankSum. Sometimes both of them are there. How should we interpret the different values? Is there anything "wrong" with that?

Ex: biallelic heterozygous-ref site. AS_MQRankSum and AS_ReadPosRankSum are "nul". MQRankSum and ReadPosRankSum are omitted altogether.

HanXRQChr01     16169   .       G       T       114.60  PASS    AC=2;AF=0.250;AN=8;AS_BaseQRankSum=nul;AS_FS=0.000;AS_MQ=
60.00;AS_MQRankSum=nul;AS_QD=30.82;AS_ReadPosRankSum=nul;AS_SOR=0.693;DP=6;ExcessHet=0.3218;FS=0.000;MLEAC=10;MLEAF=1.00;MQ=60.00;QD=29.27;SOR=0.693

Ex: biallelic het-ref site. ReadPosRankSum is non-zero. AS_MQRankSum is zero.

HanXRQChr01     17137   .       G       A       53.23   PASS AC=1;AF=0.010;AN=96;AS_BaseQRankSum=0.600;AS_FS=0.000;AS_InbreedingCoeff=-0.0476;AS_MQ=60.00;AS_MQRankSum=0.000;AS_QD=8.87;AS_ReadPosRankSum=0.800;AS_SOR=1.179;BaseQRankSum=0.623;DP=243;ExcessHet=3.0103;FS=0.000;InbreedingCoeff=-0.0476;MLEAC=1;MLEAF=0.010;MQ=60.00;MQRankSum=0.00;QD=8.87;ReadPosRankSum=0.842;SOR=1.179

Ex: site where one allele has nuls, and the other one has floats

HanXRQChr01     18154   .       G       C,A     8466.08 PASS    AC=25,1;AF=0.240,9.615e-03;AN=104;AS_BaseQRankSum=-0.550,nul;AS_FS=1.536,2.158;AS_InbreedingCoeff=0.8253,-0.0126;AS_MQ=60.00,60.00;AS_MQRankSum=0.000,nul;AS_QD=29.59,8.09;AS_ReadPosRankSum=0.900,nul;AS_SOR=0.400,0.223;BaseQRankSum=0.494;DP=813;ExcessHet=0.0000;FS=1.538;InbreedingCoeff=0.8833;MLEAC=26,1;MLEAF=0.250,9.615e-03;MQ=60.00;MQRankSum=0.00;QD=32.25;ReadPosRankSum=1.52;SOR=0.391

Relevant pages and comments I've found on the subject:
1. "MQRankSum is one of the core annotations that we recommend using, so I would recommend going to the trouble of finding out why it's not working." (https://gatkforums.broadinstitute.org/gatk/discussion/comment/9737/#Comment_9737 )
2. "The Rank Sum Tests require at least one individual to be heterozygous and have a mix of ref and alt reads" (https://gatkforums.broadinstitute.org/gatk/discussion/comment/33174/#Comment_33174 )
3. AS_ReadPosRankSum annotation documentation (https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_annotator_AS_ReadPosRankSumTest.php)
4. AS_MQRankSum annotation documentation (https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_annotator_AS_MappingQualityRankSumTest.php )
5. Hard-filtering recommendations (which talks about how the tests work, in particular MQRankSum and ReadPosRankSum): (https://software.broadinstitute.org/gatk/documentation/article.php?id=6925 )
6. Allele-specific annotation and filtering article. (https://gatkforums.broadinstitute.org/gatk/discussion/9622/allele-specific-annotation-and-filtering/)


Viewing all articles
Browse latest Browse all 12345

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>