Dear GATK team:
I'm currently working on calling the variants for our amplicon sequencing data targeting regions around 300 bps. The data is generated by Illumina Miseq with 250 paired end read. So forward and reverse read in one paired end read will overlap.
The original BAM file is processed first by setting the initial 23 bases of the read to N for each paired read group to remove the primer influence on variant calling. Then I used GATK haplotypecaller in GATK-3.6-0-g89b7209 in GVCF mode followed by GVCFgenotype for joint genotyping.
In the result, I observed a lot of "./." called in the joint genotyping result with high coverage of AD and DP. It seems that they only exist for homo-reference site. They do not show in every sample but only specific ones.
One example from the vcf result is shown below:
chr1 18722933 rs56176731 G C 3760.6 . AC=2;AF=0.019;AN=104;BaseQRankSum=7.94;ClippingRankSum=0.00;DB;DP=311240;ExcessHet=4.4395;FS=0.000;InbreedingCoeff=-0.0558;MLEAC=2;MLEAF=0.019;MQ=13.65;MQRankSum=0.00;QD=1.96;ReadPosRankSum=1.72;SOR=0.150 GT:AD:DP:GQ:PL:SAC 0/0:10893,0:10893:99:0,120,1800 0/0:3095,0:3095:99:0,120,1800 0/0:7109,0:7109:99:0,120,1800 0/0:6159,0:6159:99:0,120,1800 ./.:9187,0:9187:.:0,0,0 0/0:9921,0:9921:99:0,120,1800 0/0:7886,0:7886:99:0,120,1800 0/0:9599,0:9599:99:0,120,1800 0/0:3568,0:3568:99:0,120,1800 0/0:3587,0:3587:99:0,120,1800 ./.:10150,0:10150:.:0,0,0 0/0:10063,0:10063:99:0,120,1800 0/0:7977,0:7977:99:0,120,1800 0/0:6701,0:6701:99:0,120,1800 0/0:9992,0:9992:99:0,120,1800 0/0:8268,0:8268:99:0,120,1800 0/0:7164,0:7164:99:0,120,1800 0/0:3071,0:3071:99:0,120,1800 0/0:3744,0:3744:99:0,120,1800 0/0:4276,0:4276:99:0,120,1800 0/0:2209,0:2209:0:0,0,2796 0/0:2951,0:2951:0:0,0,3425 0/0:9073,0:9073:99:0,120,1800 0/0:2828,0:2828:99:0,120,1800 ./.:3450,0:3450:.:0,0,0 0/0:2960,0:2960:99:0,120,1800 0/0:4119,0:4119:99:0,120,1800 0/0:5063,0:5063:99:0,120,1800 0/1:1305,331:1645:99:858,0,62435:0,1305,0,331 0/0:4505,0:4505:99:0,120,1800 0/0:2868,0:2868:99:0,120,1800 0/0:6611,0:6611:0:0,0,876 0/0:7709,0:7709:0:0,0,3260 0/0:4767,0:4767:99:0,120,1800 0/0:4956,0:4956:99:0,120,1800 0/0:6305,0:6305:99:0,120,1800 0/0:1866,0:1866:99:0,120,1800 0/1:73,214:287:99:2943,0,1029:0,73,0,214 0/0:8021,0:8021:0:0,0,9405 0/0:2878,0:2878:99:0,120,1800 0/0:8165,0:8165:99:0,120,1800 0/0:3005,0:3005:99:0,120,1800 ./.:3688,0:3688:.:0,0,0 0/0:7725,0:7725:0:0,0,11872 0/0:8611,0:8611:99:0,120,1800 0/0:3994,0:3994:99:0,120,1800 0/0:4031,0:4031:0:0,0,3673 0/0:5476,0:5476:99:0,120,1800 0/0:8891,0:8891:99:0,120,1800 0/0:2868,0:2868:99:0,120,1800 ./.:3637,0:3637:.:0,0,0 ./.:7150,0:7150:.:0,0,0 0/0:1092,0:1092:99:0,120,1800 0/0:924,0:924:99:0,120,1800 ./.:4579,0:4579:.:0,0,0 0/0:1658,0:1658:99:0,120,1800 0/0:2684,0:2684:99:0,120,1800 0/0:1396,0:1396:99:0,120,1800 0/0:1328,0:1328:99:0,120,1800 ./.:2372,0:2372:.:0,0,0
The command line parameter I used for haplotypecaller (one sample):
java -Xmx4g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -dt NONE --genotyping_mode DISCOVERY -A StrandAlleleCountsBySample -A StrandBiasBySample -R ucsc.hg19_nohap_v3_fixed.fasta -I JK0812_m_clipped.bam -o JK0812_m_clipped.g.vcf --dbsnp dbsnp_138.hg19.vcf -L target_region.bed -ERC GVCF --variant_index_type LINEAR --variant_index_parameter 128000 --maxReadsInRegionPerSample 200000
In the command line, I turned the downsampling off and set the maximum read in one active region per sample to 200000 to account for the nature of targeted sequencing data.
I then picked a single sample with "./." and run it on HaplotypeCaller native mode.
The genomic interval range is set to chr1:18722675-18722950, the command line is used as follows:
java -Xmx4g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -dt NONE --genotyping_mode DISCOVERY -I SK1466_m_clipped.bam -R ucsc.hg19_nohap_v3_fixed.fasta --dbsnp dbsnp_138.hg19.vcf -L chr1_target_region.bed -o SK1466_m_clipped.vcf -allSitePLs --maxReadsInRegionPerSample 200000 -out_mode EMIT_ALL_SITES
And some of the results shown below:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SK1466
chr1 18722701 . C . 0 LowQual AN=2;DP=6432;MQ=60.00 GT:AD:DP 0/0:6432:6432
chr1 18722702 . T . 0 LowQual AN=2;DP=6432;MQ=60.00 GT:AD:DP 0/0:6432:6432
chr1 18722703 . A . 0 LowQual AN=2;DP=6432;MQ=60.00 GT:AD:DP 0/0:6432:6432
chr1 18722704 . G . 0 LowQual AN=2;DP=6432;MQ=60.00 GT:AD:DP 0/0:6432:6432
chr1 18722706 . T . 0 LowQual AN=2;DP=6432;MQ=60.00 GT:AD:DP 0/0:6432:6432
chr1 18722707 . A . 0 LowQual AN=2;DP=6432;MQ=60.00 GT:AD:DP 0/0:6432:6432
chr1 18722709 . T . 0 LowQual AN=2;DP=6432;MQ=60.00 GT:AD:DP 0/0:6432:6432
chr1 18722710 . T . 0 LowQual AN=2;DP=6432;MQ=60.00 GT:AD:DP 0/0:6432:6432
chr1 18722711 . C . 0 LowQual AN=2;DP=6432;MQ=60.00 GT:AD:DP 0/0:6432:6432
chr1 18722713 rs1336130 T C 107857.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-0.647;ClippingRankSum=-0.000;DB;DP=6432;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=-0.000;QD=16.78;ReadPosRankSum=-0.109;SOR=0.532 GT:AD:DP:GQ:PL 0/1:3497,2930:6427:99:107886,0,133316
The homo-ref site is labeled as LowQual with QUAL=0 while no PL is calculated. Any one could let me know what is happening with the weird calling in GVCF mode and native mode of haplotypecaller?
I have generated the corresponding bamout file and found that many haplotypes are generated because the nature of amplicon sequencing. Please let me know how to upload them or insert the comparison figure in IGV.
One similar thread: gatkforums.broadinstitute.org/gatk/discussion/8783/homozygous-reference-genotype-is-called-in-native-mode-but-uncalled-in-haplotypecaller-erc-modes
Help on this will be greatly appreciated.
Best