Quantcast
Channel: Recent Discussions — GATK-Forum
Viewing all articles
Browse latest Browse all 12345

Variant calling inconsistencies using GVCF

$
0
0

Before starting, I'm a grad student without formal training in bioinformatics so I've mainly learned from online forums. That being said, if this question is stupid or I'm missing critical information, don't hold back.

I have an issue similar to this thread http://gatkforums.broadinstitute.org/gatk/discussion/5638/incorrect-missing-genotypes-using-haplotypecaller but it doesn't appear to be a data quality issue as was seen there.

To summarize:
I followed the best practices workflow using GATK 3.7, although I omitted base quality score recalibration. I have a a combination of WGS and WES and 21 samples for which I generated gvcf files and then genotyped concurrently. I noticed that occasionally single samples would be called ./. in regions where every other sample is 1/1. I focused on one of these oddities and went to the single sample that was causing the problem.
image
Sample 1270 is the original bam that is called improperly while 1271 is called properly. Sample 1270 has 40 alternate reads and 0 reference reads.

I ran the following:

$ GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-R GRCz10.fa \ 
-I 1270.mark.bam \ 
-o 1270origregion.g.vcf \
-L 16:17352455-17392053 \
-bamout 1270origregion.bam \
-ERC GVCF 

The region (-L) is the surrounding gene from the region list I used on the whole genome data although this sample specifically is from an exome capture.

In the resulting GVCF I see:

16  17361797    .   C   <NON_REF>   .   .   END=17361797    GT:DP:GQ:MIN_DP:PL  0/0:40:0:40:0,0,0

In the bamout generated I see:
image

This looks odd as the "ArtificialHaploytypes" (the top 4 reads) generated do not contain the variant. The remaining 24 reads are from the sample. It seems that the caller is pulling the DP of 40 from the original file but for some reason calling it reference.

Out of curiosity I bootstrapped this with Haplotype Caller not in GCVF mode:

$ GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-R GRCz10.fa \ 
-I 1270.origregion.bam \ 
-o 1270bootstrap.bam \
-L 16:17352455-17392053 \
-bamout 1270bootstrap.bam

And now the vcf includes the variant

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  1270    HC
16  17361797    .   C   T   780.17  .   AC=2;AF=0.500;AN=4;BaseQRankSum=-0.659;ClippingRankSum=0.000;DP=29;ExcessHet=0.7918;FS=10.017;MLEAC=2;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=31.21;ReadPosRankSum=-0.913;SOR=0.472 GT:AD:DP:GQ:PL  1/1:0,25:25:75:814,75,00/0:4,0:4:12:0,12,151

and the bamout looks to include the variant in the artificial haplotypes

image

If I do a traditional multisample haplotype caller using the 1270 and 1271 shown above, it will call both properly. However, if I call g.vcf and the genotype them it misses the call in 1270. If I call 1270 in the standard mode it also misses the call. The base qualities and mapping qualities in the region seem to be fine, with the caveat that I didn't recalibrate scores.

The last thing I notice is that the active region is truncated for sample 1270. But if I call 1270 with 1271 the active region extends 100 bp upstream:
image

If anyone has any experience with this, tips for troubleshooting or needs more information, please let me know.

Thanks
Steve


Viewing all articles
Browse latest Browse all 12345

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>