Hello!
I run VariantFiltration on my joint-called SNP and indel set vcf file (HaplotypeCaller -> CombineGVCFs -> GenotypeGVCFs). I applied the following command in GATK4.0.6.0:
gatk VariantFiltration \
-R path_to/genome.fa \
-V path_to/joint_call_set.vcf \
--genotype-filter-expression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" \
--genotype-filter-name "my_snp_filter" \
-O path_to/joint_call_set_HARD.vcf
According to the documentation on hard filtering (https://gatkforums.broadinstitute.org/gatk/discussion/2806/howto-apply-hard-filters-to-a-call-set), the resulting file should contain the labels "PASS" or "FILTER" if entries passed or not the any of the filters, respectively.
I checked how many entries remained after filtering and realized that all entries were kept:
## Lines in vcf body
grep -E -v "^#" joint_call_set_HARD.vcf | wc -l
20939832
## Lines flagged as PASS in column FILTER
grep -E -v "^#" joint_call_set_HARD.vcf | awk '{print $7}' | grep "PASS" | wc -l #the 7th column corresponds to "FILTER"
20939832
## Lines flagged as FILTER in column FILTER
grep -E -v "^#" joint_call_set_HARD.vcf | awk '{print $7}' | grep "FILTER" | wc -l #the 7th column corresponds to "FILTER"
0
Escentially, all entries passed the filters, which cannot be correct.
The file contains the genotype information for 60 samples, could this have something to do with the issue? (i.e. if one sample passes, then the whole entry is labeled as PASS)
This file is inteded to be used as a training set for VariantRecalibrator.
I appreciate your feedback.
Cheers!