We wish to discover short variants in a cohort of 60 plant whole-genome-samples. We're blocked on VariantRecalibrator.
We have a VCF truth set (aka resource) of SNPs which has been computed beforehand and hard-filtered. And we have a raw VCF for the 60 samples under study. This input VCF has been joint-called with HaplotypeCaller (GVCF) + GenomicsDBImport + GenotypeGVCFs over the whole genome. We computed a sites-only version of that input VCF and fed it to VariantRecalibrator. We configured HaplotypeCaller to produce allele-specific annotations (-G StandardAnnotation -G AS_StandardAnnotation -G StandardHCAnnotation
) and GenotypeGVCFs as well (-G StandardAnnotation -G AS_StandardAnnotation
).
We've configured VariantRecalibrator to build its SNP model based on the set of annotations: -an AS_QD -an MQRankSum -an AS_ReadPosRankSum -an AS_FS -an AS_MQ -an AS_SOR -an DP
. This was based on this allele-specific annotation and filtering article.
Unfortunately, both AS_MQRankSum
, and MQRankSum
annotations have variance 0 over our data, and prevent the model from being produced. Dropping the annotation is one option, but it's ill-advised afaik (see reference #1).
How do we diagnose this?
/gatk/gatk VariantRecalibrator --java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Xmx183296m' --tmp-dir <redacted> <redacted_list_of_input_vcf_files> --resource:GOLD,known=false,training=true,truth=true,prior=10.0 <redacted>/gold.snps.vcf.gz --mode SNP -an AS_QD -an MQRankSum -an AS_ReadPosRankSum -an AS_FS -an AS_MQ -an AS_SOR -an DP --trust-all-polymorphic --truth-sensitivity-tranche 100.0 --truth-sensitivity-tranche 99.0 --truth-sensitivity-tranche 90.0 --truth-sensitivity-tranche 70.0 --truth-sensitivity-tranche 50.0 --max-gaussians 6 --rscript-file <redacted> --tranches-file <redacted> -AS --output <redacted>/snp.recal.vcf.gz
22:16:29.941 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.1.2.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
...
A USER ERROR has occurred: Bad input: Found annotations with zero variance. They must be excluded before proceeding.
...
19:52:31.846 INFO ProgressMeter - Traversal complete. Processed 2786933 total variants in 0.8 minutes.
19:52:31.959 INFO VariantDataManager - AS_QD: mean = 17.44 standard deviation = 8.75
19:52:32.173 INFO VariantDataManager - MQRankSum: mean = 0.00 standard deviation = 0.00
19:52:32.487 INFO VariantDataManager - AS_ReadPosRankSum: mean = 0.01 standard deviation = 0.84
19:52:32.797 INFO VariantDataManager - AS_FS: mean = 1.79 standard deviation = 3.03
19:52:32.962 INFO VariantDataManager - AS_MQ: mean = 60.00 standard deviation = 0.12
19:52:33.139 INFO VariantDataManager - AS_SOR: mean = 0.68 standard deviation = 0.25
19:52:33.335 INFO VariantDataManager - DP: mean = 1323.99 standard deviation = 945.30
(We initially tried with AS_MQRankSum, instead of MQRankSum, but it also had 0.00 variance)
Question 1: As we understand it, MQRankSum
can only be computed on sites which are heterozygous reference (see reference #2). Generally speaking, and not for our particular dataset, do we need a good representation of such sites both in the truth "resource" sets and the input raw variant vcfs, or just in the input vcfs ?
Question 2: We've confirmed that our data (both the truth set and the input set) does have many het-ref sites with good read support for all alleles. One evidence of this (we think), is the fact that AS_ReadPosRankSum
was calculated to have non-zero variance, as shown in the VariantRecalibrator output above. MQRankSum
's variance couldn't be calculated, but both it and AS_MQRankSum
have the same caveats in the documentation. What are cases where one annotation can be calculated, but not the other? e.g. Does this indicate that my mapping qualities are too uniform (in which case, the variance would be exactly 0.000)?
Question 3: If we dig in the dataset (i.e. in the sites-only VCF inputs), we see a lot of sites whose relevant annotations are a mix of "nul", and "0.000". At certain sites, AS_MQRankSum
is there but not MQRankSum
. Sometimes both of them are there. How should we interpret the different values? Is there anything "wrong" with that?
Ex: biallelic heterozygous-ref site. AS_MQRankSum
and AS_ReadPosRankSum
are "nul". MQRankSum
and ReadPosRankSum
are omitted altogether.
HanXRQChr01 16169 . G T 114.60 PASS AC=2;AF=0.250;AN=8;AS_BaseQRankSum=nul;AS_FS=0.000;AS_MQ=
60.00;AS_MQRankSum=nul;AS_QD=30.82;AS_ReadPosRankSum=nul;AS_SOR=0.693;DP=6;ExcessHet=0.3218;FS=0.000;MLEAC=10;MLEAF=1.00;MQ=60.00;QD=29.27;SOR=0.693
Ex: biallelic het-ref site. ReadPosRankSum is non-zero. AS_MQRankSum is zero.
HanXRQChr01 17137 . G A 53.23 PASS AC=1;AF=0.010;AN=96;AS_BaseQRankSum=0.600;AS_FS=0.000;AS_InbreedingCoeff=-0.0476;AS_MQ=60.00;AS_MQRankSum=0.000;AS_QD=8.87;AS_ReadPosRankSum=0.800;AS_SOR=1.179;BaseQRankSum=0.623;DP=243;ExcessHet=3.0103;FS=0.000;InbreedingCoeff=-0.0476;MLEAC=1;MLEAF=0.010;MQ=60.00;MQRankSum=0.00;QD=8.87;ReadPosRankSum=0.842;SOR=1.179
Ex: site where one allele has nuls, and the other one has floats
HanXRQChr01 18154 . G C,A 8466.08 PASS AC=25,1;AF=0.240,9.615e-03;AN=104;AS_BaseQRankSum=-0.550,nul;AS_FS=1.536,2.158;AS_InbreedingCoeff=0.8253,-0.0126;AS_MQ=60.00,60.00;AS_MQRankSum=0.000,nul;AS_QD=29.59,8.09;AS_ReadPosRankSum=0.900,nul;AS_SOR=0.400,0.223;BaseQRankSum=0.494;DP=813;ExcessHet=0.0000;FS=1.538;InbreedingCoeff=0.8833;MLEAC=26,1;MLEAF=0.250,9.615e-03;MQ=60.00;MQRankSum=0.00;QD=32.25;ReadPosRankSum=1.52;SOR=0.391
Relevant pages and comments I've found on the subject:
1. "MQRankSum is one of the core annotations that we recommend using, so I would recommend going to the trouble of finding out why it's not working." (https://gatkforums.broadinstitute.org/gatk/discussion/comment/9737/#Comment_9737 )
2. "The Rank Sum Tests require at least one individual to be heterozygous and have a mix of ref and alt reads" (https://gatkforums.broadinstitute.org/gatk/discussion/comment/33174/#Comment_33174 )
3. AS_ReadPosRankSum
annotation documentation (https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_annotator_AS_ReadPosRankSumTest.php)
4. AS_MQRankSum
annotation documentation (https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_annotator_AS_MappingQualityRankSumTest.php )
5. Hard-filtering recommendations (which talks about how the tests work, in particular MQRankSum and ReadPosRankSum): (https://software.broadinstitute.org/gatk/documentation/article.php?id=6925 )
6. Allele-specific annotation and filtering article. (https://gatkforums.broadinstitute.org/gatk/discussion/9622/allele-specific-annotation-and-filtering/)