Hi,
I use SelectVariants to remove some sample out of may initial data. In doing that, I ended up having some sites for which there is no variant as the variation was removed with samples that had been removed...
However, when I add the option -env (--excludeNonVariants) in order to remove all the sites where there is no variation between samples. I still have sites where there is no variation between samples, but different from the reference, in the vcf file.
Here is my command:
java -Xmx4G -jar /path/to/GenomeAnalysisTK.jar -T SelectVariants -R /path/to/ref.fa -V starting_vcf_file.vcf -env -o results_vcf_file.fa
I expect the results vcf file to contain only sites with variation between the samples. However I still have sites such as
scaffold1 25003 . T C 40226.42 PASS AC=14;AF=1.00;AN=14;DP=866;ExcessHet=0.2482;FS=0.000;MQ=60.06;QD=31.72;SOR=1.046 GT:AD:DP:GQ:PL 1:0,54:54:99:2427,0 1:0,69:69:99:3089,0 1:0,83:83:99:3807,0 1:0,35:35:99:1530,0 1:0,40:40:99:1842,0 1:0,31:31:99:1405,0 1:0,65:65:99:2899,0 1:0,43:43:99:1935,0 1:0,55:55:99:2464,0 1:0,64:64:99:2900,0 1:0,86:86:99:3850,0 1:0,91:91:99:4182,0 1:0,84:84:99:3817,0 1:0,66:66:99:3029,0
As you can see, there is no variation between the samples ! Does GATK consider all sites where all samples have the alternative allels as variants ???
If this is the case, how can exclude these sites ?
Thank you very much in advance.