Hi - we have whole genome samples consisting of non-barcoded DNA from 50 diploid individuals with 80-120x coverage. We use version 3.5-0-g36282e). We align, then use BQSR. For variant calling, we cannot use GATK - Haplotype caller doesn't support this and UG never finishes with such high ploidy levels (it's not just us; Huang et al 2015, "Evaluation of variant detection software for pooled next-generation sequence data", noted that UG failed to complete for large pools).
Thus, after BQSR we are calling variants with LoFreq. However, when we attempt to process the vcf file produced by LoFreq with VQSR, we get error messages. We suspect the absence of genotype columns is involved, although strictly speaking you can't call a genotype with a high ploidy pooled sample.
An example of the LoFreq vcf file is shown at the end of this discussion.
I'm quite willing to put together some awk scripts to add a non-haplotyped genotype column to our vcf, if this will be enough to make VQSR happy. Do you have any experience with this? A comment by delangel in 2013 stated the team had not at that time tried to do VQSR with pooled samples. Has this changed?
Example from LoFreq vcf:
INFO=<ID=DP4,Number=4,Type=Integer,Description="Counts for ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">
CHROM POS ID REF ALT QUAL FILTER INFO
1.1 1348 . C T 152 PASS DP=81;AF=0.345679;SB=8;DP4=27,26,9,19
1.1 1418 . G A 101 PASS DP=124;AF=0.233871;SB=2;DP4=45,50,16,13
1.1 1493 . T G 507 PASS DP=109;AF=0.522936;SB=9;DP4=30,21,24,33
1.1 1523 . A T 1640 PASS DP=86;AF=0.976744;SB=3;DP4=0,1,43,41
I hypothesize that if I produce something like the lines below this might help:
INFO=<ID=DP4,Number=4,Type=Integer,Description="Counts for ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT samp1
1.1 1348 . C T 152 PASS DP=81;AF=0.345679;SB=8;DP4=27,26,9,19 GT:DP 1:81
1.1 1418 . G A 101 PASS DP=124;AF=0.233871;SB=2;DP4=45,50,16,13 GT:DP 1:124
1.1 1493 . T G 507 PASS DP=109;AF=0.522936;SB=9;DP4=30,21,24,33 GT:DP 1:109
1.1 1523 . A T 1640 PASS DP=86;AF=0.976744;SB=3;DP4=0,1,43,41 GT:DP 1:86
Is there some way forward here or is VQSR and high-ploidy pooled samples a no-go?
Thanks!