Quantcast
Channel: Recent Discussions — GATK-Forum
Viewing all 12345 articles
Browse latest View live

Alternate Alleles in VCF are more than 1 base

$
0
0

Hi there,

I've removed INDELS from a multi-sample vcf from HaplotypeCaller using SelectVariants. However, the ALT 'SNPs' are more than a single nucleotide substitution. Eg.

TTTTTTGTTTTTTGTTTT,GTTTTTGTTTT,G
TTTTTTTA,*
TTTTTTTAG,*
TTTTTTTATTTTTCATTTA,*
TTTTTGTTTTTTTA,TC,*

Q1) What is the meaning of the * symbol?
Q2) Is it to be expected that these SNPs are more than a single nucleotide substitution?

Thanks,
Tom


ERROR StatusLogger Log4j2 could not find a logging implementation

$
0
0

Hi developers,

When I used GATK (version 3.8-0-ge9d806836) to generate GVCF with "HaplotypeCaller" tool, I got this error message:

ERROR StatusLogger Unable to create class org.apache.logging.log4j.core.impl.Log4jContextFactory specified in jar:file:/Data/Sunhh/src/pipeline/gatk/GenomeAnalysisTK-3.8-0-g

ERROR StatusLogger Log4j2 could not find a logging implementation. Please add log4j-core to the classpath. Using SimpleLogger to log to the console...

I tested this with a very small bam file and got this message again, and the GATK program never ended.
Here is the detail information about the environment:
Operating system : Fedora 26 (4.12.5-300.fc26.x86_64);
GATK Version : 3.8-0-ge9d806836
JAVA version : Java(TM) SE Runtime Environment (build 1.8.0_144-b01) ; Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
Variant calling from paired-end DNA-Seq Illumina reads ;
Command I used : /usr/java/jre1.8.0_144/bin/java -Xmx8G -jar /home/Sunhh/tools/GenomeAnalysisTK.jar -T HaplotypeCaller -R /Data/Sunhh/wm_reseqByPB/db/wm97pv0.scf.fa -I t_merged_dedup_pipe2.bam --genotyping_mode DISCOVERY -stand_call_conf 30 -ERC GVCF -o t_merged.g.vcf 1>s6.std.t_merged 2>s6.err.t_merged

If possible, I'd like to try some older version of GATK, but I don't know where to download them. Could you tell me how to get those old versions?

Thanks!

Best
Honghe

Possible bug in SelectVariants tool

$
0
0

Dear GATK experts,

I have done variant calling on 384 potato samples following, mostly, GATK best practices and have applied hard filters to select SNPs for further usage. However, I am noticing that '--max-nocall-fraction', '--max-nocall-number' and '--max-fraction-filtered-genotypes' arguments for 'SelectVariants' are not working properly. I have tried with various cutoff settings and every time I am observing SNPs with a much larger number of genotypes (~246 out of 384) with 'no call' than the set thresholds. I have searched the forum first but couldn't find any relevant threads. I am using the latest GATK version (4.0.7.0). I am attaching three example sets of (1) log files (2) subset vcf files and (3) vcf index file for the three main vcfs. I would appreciate if you could provide any feedback on this issue and/or if this behaviour has been observed by some other users also.

Regards,
Sanjeev

BWA mem -M option

$
0
0

bwa mem has an -M flag that will:

Mark shorter split hits as secondary (for Picard compatibility).

However, my guess is Picard has since been updated and this is no longer required. Should bwa mem be run with or without the -M flag assuming we are using relatively up to date software?

about ASEReadCounter

$
0
0

Dear all,

i am using ASEReadCounter in order to count the number of reads per variant in a BAM file, and somehow related to a previous post (below), I am encountering a similar error :

"MESSAGE: More then one variant context at position: chr19:125517"

i.e. in the vcf file, there are 2 entries for the same position :

chr19 125517 . A G 42.01 . AC1=1;AF1=0.5;BQB=0.950129;DP=43;DP4=12,7,4,1;FQ=45.0154;MQ=46;MQ0F=0;MQB=0.984335;MQSB=0.998127;PV4=0.631094,1,1,1;RPB=1;SGB=-0.590765;VDB=0.233642 GT:PL 0/1:72,0,255

chr19 125517 . AA AAGAGA 5.79 . AC1=1;AF1=0.499984;DP=43;DP4=7,5,5,0;FQ=8.19012;IDV=2;IMF=0.0444444;INDEL;MQ=45;MQ0F=0;MQSB=0.99446;PV4=0.244505,1,0.0559047,0.273348;SGB=-0.590765;VDB=0.125771 GT:PL 0/1:42,0,151

The question would be : is there any way in GATK to remove these sites ? Of course, i could do it with a simple script outside GATK, although doing it outside GATK may complicate a bit the pipeline. Thank you very much !

-- bogdan

ps : the previous post was :
http://gatkforums.broadinstitute.org/gatk/discussion/comment/30752#Comment_30752

SelectVariants processing 0 variants with -ids rsidlist.txt \-V Homo_sapiens_assembly38.dbsnp138.vcf

$
0
0

Hello,

The title probably says it all, the above vcf is the second genome I have used to extract the desired rsids from using a text file with each rsid separated by a newline. In each case the resultant filtered vcf file contains 0 variants. I am unable to determine why.

Apologies if I have not asked the question in a convenient manner.

"Could not open array genomicsdb_array at workspace" from GenotypeGVCFs in GATK 4.0.0.0

$
0
0

I experience Issues with GenotypeGVCFs and GenomicsDB input in the final GATK4 release using the official Docker image. This does not occur using the 4.beta.6 release. It looks like there has been a bug in an earlier beta release with the same error message which got fixed. Is my issue related to that old bug or just results in the same error message? What can I do to debug the issue?

2018-01-10T12:15:04.154516155Z terminate called after throwing an instance of 'VariantQueryProcessorException'
2018-01-10T12:15:04.154547266Z   what():  VariantQueryProcessorException : Could not open array genomicsdb_array at workspace: /keep/d22f668d4f44631d98bc650d582975ca+1399/chr22_db
2018-01-10T12:15:04.154561314Z 
2018-01-10T12:15:04.620517615Z Using GATK wrapper script /gatk/build/install/gatk/bin/gatk
2018-01-10T12:15:04.620517615Z Running:
2018-01-10T12:15:04.620517615Z     /gatk/build/install/gatk/bin/gatk GenotypeGVCFs -V gendb:///keep/d22f668d4f44631d98bc650d582975ca+1399/chr22_db --output chr22_db.vcf --reference /keep/db91e5f04cbd9018e42708316c28e82d+2160/hg19.fa

Documentation for GATK 4.0.11.0

$
0
0

Hi,

I'm wondering when I can expect the updated documentation for GATK 4.0.11.0. I'm especially looking forward to the updates concerning the --mitochondria-mode in Mutect2 and FilterMutectCalls and since this version is out for almost a month now I really want to get started.

Best,
Daniel


STAR-2 pass mapping

$
0
0

I want to use STAR 2-pass alignment steps for SNP detection in RNAseq data:

But I am getting very confused, I using STAR 2.5.3a version:

I can understand the there 4 steps need to perform in STAR 2- pass mapping.

1) 1st Genome generator

2) ButI can't able to understand how to run 1st pass aligner for all sample together or separately.

3) Genome generator again.

4) After 1st pass aligner how to specify all tab files in 2nd aligner, what should be the parameter to filter the SJ.out. tab files need to be considered? how to prefix the SJ.out.tab with the different name?

Command line which I am using to perform all four steps:

1) 1st Genome Generator

/STAR --runThreadN 6 --runMode genomeGenerate --genomeDir /data/SNU_work/genome --sjdbGTFfile Radish_123.cds.gtf --genomeFastaFiles Rs.R1_R9.fasta

1st read mapping

2) /home/yog/software/STAR-2.5.3a/source/STAR --genomeDir /data/SNU_work/genome --readFilesIn 216_R1.fq 216_R2.fq --runThreadN 6

2nd Genome generator:

3) /STAR --runThreadN 6 --runMode genomeGenerate --genomeDir /data/SNU_work/genome --sjdbOverhang 124 --sjdbFileChrStartEnd /data/SNU_work/SJ.out.tab --genomeFastaFiles Rs.R1_R9.fasta

Now here I am confused how to generator all Sj.out.tab altogether or should generator one by one but how to mention different name according to RNAseq library?

4) again star aligner

Please look into command line also and suggest if I am making all correct or not.

Question Regarding GATK Doc #6483

$
0
0

Hi

I just happen to see a strange issue with this document.

In my own practice I always remove adapters at the very beginning (demultiplexing stage) and continue my analyses from fastq to uBAM to mapping and so on.

However recently I received some external data for analysis and realized that there is about 9 percent adapter contamination in fastqs. Looks like adapter cleanup is omitted in the demultiplexing stage.

As a personal preference I am against removing anything from fastq after demultiplexing stage and I am totally against trimmers since they tend to mess up with the order of reads and further complicate debugging of already established pipelines in production.

So I decided to give a try to MarkIlluminaAdapters option since that gives me the option to mark them and rescore them with QV2 therefore they won't interfere with my analyses. Looking at the document #6483 after marking illumina adapters step uBAM is streamed to BWA then streamed to MergeBamAlignment to create a clean bam however those marked adapters with QV2 are totally reverted to their original quality values (QV >30 for most!!!!) at that stage. So I am concerned about this.

Can anyone comment on that from GATK team why do we mark them if the original qualities will be restored anyway?

Am I missing something?

My current practice is to mark the adapters and convert uBAMxt to FASTQ with CLIPPING OPTION 2 and start mapping with this fastq and also generate a second uBAM with these new fastqs that contain the adapter sequences with QV2.

Am I understanding wrong?

Thanks for the help.

GenomicsDBImport

Loss of RSIDs for GenotypeGVCFs, possibly an issue with dpsnp filter.vcf file from SelectVariants

$
0
0

Hello,

I've run into this problem a few times now having attempted to debug the issue in various ways. The first time it occurred I was using a vcf file containing the rsids I wished to genotype on a given gvcf file. There was no error message that I could discern, however the resultant vcf file only contained <1/5th of the rsids of interest in the filter file.

I assumed this issue related to the gvcf file and so acquired the bam file to begin the workflow from the beginning and stick to best practices, however I have now run into the same issue using haplotype caller and the same filter.vcf file as the -L and -D argument.

This leads me to believe the filter file is the issue, but when I look at the file I can see all of the rsids there so I'm not sure what is causing the loss. Of lesser concern is the loss of 7 rsids when I used SelectVariants to produce the filter initially, but I will also have to address that.

I have attempted to reduce the minimum base quality reads to 1 but it has not resulted in any increase in variant calling. I also looked into generating a bamout file which I will return to when possible, however I am currently working remotely and unable to install IGV on this device.

Many thanks for any tips

GATK4 VariantAnnotator

$
0
0

Hello,

Has this tools been removed in the last release ? :o

Thanks,
Pedro

GATK v4.0.11.0 HaplotypeCaller missing SNP

$
0
0

Hi GATK team,

GATK v4.011.0 haplotype caller is missed to call this SNP.

$java8 -Xmx10g -jar /gatk-4.0.11.0/gatk-package-4.0.11.0-local.jar \
    HaplotypeCaller \
    -R hg19.fa \
    -L Target.bed \
    -I sample.markDup.recalibrate.bam \
    -A Coverage \
    -A MappingQualityRankSumTest \
    -A ReadPosRankSumTest \
    -A BaseQualityRankSumTest \
    -G StandardAnnotation \
    --dbsnp hg19_dbSNP138.vcf \
    --min-base-quality-score 10 \
    --output sample.HaplotypeCaller.snp.indel.vcf \
    --stand-call-conf 10 \
    --bam-output sample.HaplotypeCaller.bam 

I have tried various parameters like (--allow-non-unique-kmers-in-ref, --kmer-size 10, 25, 35 ), but no luck.
All base & mapping qualities of the non-reference bases are looking good.

This is the output of the HaplotypeCaller GVCF.
chr20 62316953 . A . . END=62316953 GT:DP:GQ:MIN_DP:PL 0/0:98:0:98:0,0,416

But GATK v3.2 UnifiedGenotyper have detected this variant and pass the filters.
chr20 62316953 . A G 947.77 PASS ABHet=0.622;AC=1;AF=0.500;AN=2;BaseQRankSum=3.481;DP=98;Dels=0.00;FS=5.289;HaplotypeScore=2.3548;MLEAC=1;MLEAF=0.500;MQ=58.64;MQ0=0;MQRankSum=0.022;QD=9.67;ReadPosRankSum=-0.366;SB=-3.310e+02 GT:AD:DP:GQ:PL 0/1:61,37:98:99:976,0,1402

Any idea why this SNP was not called by HC?

Thanks for your help.

Regards
Shibu

question in recapseg .pcov file

$
0
0

I had a question in recapseg. I would like to ask more clarification how to generate .pcov and .pcov.cr.stat file.

Inputs
~~
All file lists must have corresponding entries. These files will correspond to control/normal samples.

  • control_pcov_file_list -- text file listing file paths to .pcov files. One entry per line.
  • control_cr_stat_file_list -- text file listing file paths to .pcov.cr.stat files. One entry per line.
  • control_sample_id_list -- list of sample ids for the corresponding pcov and cr stat files. One entry per line.

Outputs
~~~

  • panel of normals file (hdf5)

How the HaplotypeCaller's reference confidence model works

$
0
0

This document describes the reference confidence model applied by HaplotypeCaller to generate genomic VCFs (gVCFS), invoked by -ERC GVCF or -ERC BP_RESOLUTION (see the FAQ on gVCFs for format details).

Please note that this document may be expanded with more detailed information in the near future.

How it works

The mode works by assembling the reads to create potential haplotypes, realigning the reads to their most likely haplotypes, and then projecting these reads back onto the reference sequence via their haplotypes to compute alignments of the reads to the reference. For each position in the genome we have either an ALT call (via the standard calling mechanism) or we can estimate the chance that some (unknown) non-reference allele is segregating at this position by examining the realigned reads that span the reference base. At this base we perform two calculations:

  • Estimate the confidence that no SNP exists at the site by contrasting all reads with the ref base vs all reads with any non-reference base.
  • Estimate the confidence that no indel of size < X (determined by command line parameter) could exist at this site by calculating the number of reads that provide evidence against such an indel, and from this value estimate the chance that we would not have seen the allele confidently.

Based on this, we emit the genotype likelihoods (PL) and compute the GQ (from the PLs) for the least confidence of these two models.

We use a symbolic allele pair, <NON_REF>, to indicate that the site is not homozygous reference, and because we have an ALT allele we can provide allele-specific AD and PL field values.

For details of the gVCF format, please see the document that explains what is a gVCF.

VariantRecalibrator fails on 30x chr22 subset (GIAB NA12878)

$
0
0

I did a full workflow with the 300x depth data but now, retrying with a subset of 30x depth fails (required for a training session).

Is there a requirement for recalibration that is set too high by default and fails on my data?
Or is it because my data is present in the calibration collections.
This GIAB data was mapped then called using only chr22 reads (10% of the 300x) against HG38 and led to a gvcf from which I derived the vcf

-rw-r--r-- 1 root domain users 142M Aug  3 11:17 NA12878_0.1.g.vcf.gz
-rw-r--r-- 1 root domain users  60K Aug  3 11:17 NA12878_0.1.g.vcf.gz.tbi
-rw-r--r-- 1 root domain users 4.6M Aug  3 11:22 NA12878_0.1.vcf.gz
-rw-r--r-- 1 root domain users  24K Aug  3 11:22 NA12878_0.1.vcf.gz.tbi

Command and output are below. I can add any useful extract on your go.

Thanks a lot for your helping me find why this fails.

Stephane

# True sites training resource: HapMap
truetraining15=reference/hg38_v0_hapmap_3.3.hg38.vcf.gz

# True sites training resource: Omni
truetraining12=reference/hg38_v0_1000G_omni2.5.hg38.vcf.gz

# Non-true sites training resource: 1000G
nontruetraining10=reference/hg38_v0_Mills_and_1000G_gold_standard.indels.hg38.vcf.gz

# Known sites resource, not used in training: dbSNP
knowntraining2=reference/hg38_v0_Homo_sapiens_assembly38.dbsnp138.vcf.gz

# indels True sites training resource: Mills
truetrainingindel12=reference/hg38_v0_Mills_and_1000G_gold_standard.indels.hg38.vcf.gz

java -Xmx${maxmem} -jar $GATK/gatk.jar VariantRecalibrator \
-R $BWA_INDEXES/NCBI_GRCh38.fa \
-V ${mappings}/${samplename}_${p}.vcf.gz \
--resource hapmap,known=false,training=true,truth=true,prior=15.0:${truetraining15} \
--resource omni,known=false,training=true,truth=true,prior=12.0:${truetraining12} \
--resource 1000g,known=false,training=true,truth=false,prior=10.0:${nontruetraining10} \
--resource dbsnp,known=true,training=false,truth=false,prior=2.0:${knowntraining2} \
--resource Mills_and_1000G_gold,known=false,training=true,truth=true,prior=12.0:${truetrainingindel12} \
-an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP \
--mode BOTH \
--output ${mappings}/output.recal_${p}.vcf \
--tranches-file ${mappings}/output.tranches_${p} \
--rscript-file ${mappings}/output.plots_${p}.R

stderr

java -Xmx${maxmem} -jar $GATK/gatk.jar VariantRecalibrator \
> -R $BWA_INDEXES/NCBI_GRCh38.fa \
> -V ${mappings}/${samplename}_${p}.vcf.gz \
> --resource hapmap,known=false,training=true,truth=true,prior=15.0:${truetraining15} \
> --resource omni,known=false,training=true,truth=true,prior=12.0:${truetraining12} \
> --resource 1000g,known=false,training=true,truth=false,prior=10.0:${nontruetraining10} \
> --resource dbsnp,known=true,training=false,truth=false,prior=2.0:${knowntraining2} \
> --resource Mills_and_1000G_gold,known=false,training=true,truth=true,prior=12.0:${truetrainingindel12} \
> -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP \
> --mode BOTH \
> --output ${mappings}/output.recal_${p}.vcf \
> --tranches-file ${mappings}/output.tranches_${p} \
> --rscript-file ${mappings}/output.plots_${p}.R
11:33:58.090 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/biotools/gatk-4.0.7.0/gatk-package-4.0.7.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
11:33:58.338 INFO  VariantRecalibrator - ------------------------------------------------------------
11:33:58.339 INFO  VariantRecalibrator - The Genome Analysis Toolkit (GATK) v4.0.7.0
11:33:58.339 INFO  VariantRecalibrator - For support and documentation go to https://software.broadinstitute.org/gatk/
11:33:58.339 INFO  VariantRecalibrator - Executing as u0002316@gbw-s-pacbio01 on Linux v4.4.0-131-generic amd64
11:33:58.339 INFO  VariantRecalibrator - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_171-8u171-b11-0ubuntu0.16.04.1-b11
11:33:58.340 INFO  VariantRecalibrator - Start Date/Time: August 3, 2018 11:33:58 AM CEST
11:33:58.340 INFO  VariantRecalibrator - ------------------------------------------------------------
11:33:58.340 INFO  VariantRecalibrator - ------------------------------------------------------------
11:33:58.340 INFO  VariantRecalibrator - HTSJDK Version: 2.16.0
11:33:58.340 INFO  VariantRecalibrator - Picard Version: 2.18.7
11:33:58.340 INFO  VariantRecalibrator - HTSJDK Defaults.COMPRESSION_LEVEL : 2
11:33:58.340 INFO  VariantRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
11:33:58.340 INFO  VariantRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
11:33:58.340 INFO  VariantRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
11:33:58.340 INFO  VariantRecalibrator - Deflater: IntelDeflater
11:33:58.341 INFO  VariantRecalibrator - Inflater: IntelInflater
11:33:58.341 INFO  VariantRecalibrator - GCS max retries/reopens: 20
11:33:58.341 INFO  VariantRecalibrator - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
11:33:58.341 INFO  VariantRecalibrator - Initializing engine
11:33:58.804 INFO  FeatureManager - Using codec VCFCodec to read file file:///mnt/freenas/NGS_Variant-Analysis-training2018/reference/hg38_v0_hapmap_3.3.hg38.vcf.gz
11:33:59.042 INFO  FeatureManager - Using codec VCFCodec to read file file:///mnt/freenas/NGS_Variant-Analysis-training2018/reference/hg38_v0_1000G_omni2.5.hg38.vcf.gz
11:33:59.177 INFO  FeatureManager - Using codec VCFCodec to read file file:///mnt/freenas/NGS_Variant-Analysis-training2018/reference/hg38_v0_Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
11:33:59.290 INFO  FeatureManager - Using codec VCFCodec to read file file:///mnt/freenas/NGS_Variant-Analysis-training2018/reference/hg38_v0_Homo_sapiens_assembly38.dbsnp138.vcf.gz
11:33:59.402 INFO  FeatureManager - Using codec VCFCodec to read file file:///mnt/freenas/NGS_Variant-Analysis-training2018/reference/hg38_v0_Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
11:33:59.513 INFO  FeatureManager - Using codec VCFCodec to read file file:///mnt/freenas/NGS_Variant-Analysis-training2018/bwa_mappings_10pc/NA12878_0.1.vcf.gz
11:33:59.633 INFO  VariantRecalibrator - Done initializing engine
11:33:59.645 INFO  TrainingSet - Found hapmap track:    Known = false   Training = true         Truth = true    Prior = Q15.0
11:33:59.646 INFO  TrainingSet - Found omni track:      Known = false   Training = true         Truth = true    Prior = Q12.0
11:33:59.646 INFO  TrainingSet - Found 1000g track:     Known = false   Training = true         Truth = false   Prior = Q10.0
11:33:59.646 INFO  TrainingSet - Found dbsnp track:     Known = true    Training = false        Truth = false   Prior = Q2.0
11:33:59.646 INFO  TrainingSet - Found Mills_and_1000G_gold track:      Known = false   Training = true         Truth = true    Prior = Q12.0
11:33:59.693 INFO  ProgressMeter - Starting traversal
11:33:59.693 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute
11:34:08.098 INFO  ProgressMeter -       chr22:50362905              0.1                 81887         584628.7
11:34:08.098 INFO  ProgressMeter - Traversal complete. Processed 81887 total variants in 0.1 minutes.
11:34:08.113 INFO  VariantDataManager - QD:      mean = 19.68    standard deviation = 9.58
11:34:08.124 INFO  VariantDataManager - MQ:      mean = 59.83    standard deviation = 1.70
11:34:08.132 INFO  VariantDataManager - MQRankSum:       mean = -0.02    standard deviation = 0.29
11:34:08.145 INFO  VariantDataManager - ReadPosRankSum:          mean = 0.03     standard deviation = 1.00
11:34:08.157 INFO  VariantDataManager - FS:      mean = 1.97     standard deviation = 3.36
11:34:08.165 INFO  VariantDataManager - SOR:     mean = 1.02     standard deviation = 0.58
11:34:08.173 INFO  VariantDataManager - DP:      mean = 25.19    standard deviation = 7.16
11:34:08.276 INFO  VariantDataManager - Annotations are now ordered by their information content: [MQ, DP, QD, MQRankSum, FS, SOR, ReadPosRankSum]
11:34:08.284 INFO  VariantDataManager - Training with 27312 variants after standard deviation thresholding.
11:34:08.288 INFO  GaussianMixtureModel - Initializing model with 100 k-means iterations...
11:34:09.210 INFO  VariantRecalibratorEngine - Finished iteration 0.
11:34:09.860 INFO  VariantRecalibratorEngine - Finished iteration 5.    Current change in mixture coefficients = 2.94951
11:34:10.490 INFO  VariantRecalibratorEngine - Finished iteration 10.   Current change in mixture coefficients = 0.39301
11:34:11.065 INFO  VariantRecalibratorEngine - Finished iteration 15.   Current change in mixture coefficients = 0.00837
11:34:11.507 INFO  VariantRecalibratorEngine - Finished iteration 20.   Current change in mixture coefficients = 0.01513
11:34:12.037 INFO  VariantRecalibratorEngine - Finished iteration 25.   Current change in mixture coefficients = 0.01792
11:34:12.531 INFO  VariantRecalibratorEngine - Finished iteration 30.   Current change in mixture coefficients = 0.01851
11:34:12.998 INFO  VariantRecalibratorEngine - Finished iteration 35.   Current change in mixture coefficients = 0.02430
11:34:13.451 INFO  VariantRecalibratorEngine - Finished iteration 40.   Current change in mixture coefficients = 0.01579
11:34:13.891 INFO  VariantRecalibratorEngine - Finished iteration 45.   Current change in mixture coefficients = 0.00536
11:34:14.433 INFO  VariantRecalibratorEngine - Finished iteration 50.   Current change in mixture coefficients = 0.00169
11:34:14.433 INFO  VariantRecalibratorEngine - Convergence after 50 iterations!
11:34:14.549 WARN  VariantRecalibratorEngine - Model could not pre-compute denominators.
11:34:14.567 INFO  VariantDataManager - Selected worst 0 scoring variants --> variants with LOD <= -5.0000.
11:34:14.593 INFO  VariantRecalibrator - Shutting down engine
[August 3, 2018 11:34:14 AM CEST] org.broadinstitute.hellbender.tools.walkers.vqsr.VariantRecalibrator done. Elapsed time: 0.28 minutes.
Runtime.totalMemory()=5262802944
java.lang.IllegalArgumentException: No data found.
        at org.broadinstitute.hellbender.tools.walkers.vqsr.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:34)
        at org.broadinstitute.hellbender.tools.walkers.vqsr.VariantRecalibrator.onTraversalSuccess(VariantRecalibrator.java:630)
        at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:981)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:137)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:182)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:201)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
        at org.broadinstitute.hellbender.Main.main(Main.java:289)

HC step 4: Assigning per-sample genotypes

$
0
0

This document describes the procedure used by HaplotypeCaller to assign genotypes to individual samples based on the allele likelihoods calculated in the previous step. For more context information on how this fits into the overall HaplotypeCaller method, please see the more general HaplotypeCaller documentation. See also the documentation on the QUAL score as well as PL and GQ.

Note that this describes the regular mode of HaplotypeCaller, which does not emit an estimate of reference confidence. For details on how the reference confidence model works and is applied in -ERC modes (GVCF and BP_RESOLUTION) please see the reference confidence model documentation.

Overview

The previous step produced a table of per-read allele likelihoods for each candidate variant site under consideration. Now, all that remains to do is to evaluate those likelihoods in aggregate to determine what is the most likely genotype of the sample at each site. This is done by applying Bayes' theorem to calculate the likelihoods of each possible genotype, and selecting the most likely. This produces a genotype call as well as the calculation of various metrics that will be annotated in the output VCF if a variant call is emitted.


1. Preliminary assumptions / limitations

Quality

Keep in mind that we are trying to infer the genotype of each sample given the observed sequence data, so the degree of confidence we can have in a genotype depends on both the quality and the quantity of the available data. By definition, low coverage and low quality will both lead to lower confidence calls. The GATK only uses reads that satisfy certain mapping quality thresholds, and only uses “good” bases that satisfy certain base quality thresholds (see documentation for default values).

Ploidy

Both the HaplotypeCaller and GenotypeGVCFs (but not UnifiedGenotyper) assume that the organism of study is diploid by default, but desired ploidy can be set using the -ploidy argument. The ploidy is taken into account in the mathematical development of the Bayesian calculation. The generalized form of the genotyping algorithm that can handle ploidies other than 2 is available as of version 3.3-0. Note that using ploidy for pooled experiments is subject to some practical limitations due to the number of possible combinations resulting from the interaction between ploidy and the number of alternate alleles that are considered (currently, the maximum "workable" ploidy is ~20 for a max number of alt alleles = 6). Future developments will aim to mitigate those limitations.

Paired end reads

Reads that are mates in the same pair are not handled together in the reassembly, but if they overlap, there is some special handling to ensure they are not counted as independent observations.

Single-sample vs multi-sample

We apply different genotyping models when genotyping a single sample as opposed to multiple samples together (as done by HaplotypeCaller on multiple inputs or GenotypeGVCFs on multiple GVCFs). The multi-sample case is not currently documented for the public but is an extension of previous work by Heng Li and others.


2. Calculating genotype likelihoods using Bayes' Theorem

We use the approach described in Li 2011 to calculate the posterior probabilities of non-reference alleles (Methods 2.3.5 and 2.3.6) extended to handle multi-allelic variation.

The basic formula we use for all types of variation under consideration (SNPs, insertions and deletions) is:

$$ P(G|D) = \frac{ P(G) P(D|G) }{ \sum_{i} P(G_i) P(D|G_i) } $$

If that is meaningless to you, please don't freak out -- we're going to break it down and go through all the components one by one. First of all, the term on the left:

$$ P(G|D) $$

is the quantity we are trying to calculate for each possible genotype: the conditional probability of the genotype G given the observed data D.

Now let's break down the term on the right:

$$ \frac{ P(G) P(D|G) }{ \sum_{i} P(G_i) P(D|G_i) } $$

We can ignore the denominator (bottom of the fraction) because it ends up being the same for all the genotypes, and the point of calculating this likelihood is to determine the most likely genotype. The important part is the numerator (top of the fraction):

$$ P(G) P(D|G) $$

which is composed of two things: the prior probability of the genotype and the conditional probability of the data given the genotype.

The first one is the easiest to understand. The prior probability of the genotype G:

$$ P(G) $$

represents how probably we expect to see this genotype based on previous observations, studies of the population, and so on. By default, the GATK tools use a flat prior (always the same value) but you can input your own set of priors if you have information about the frequency of certain genotypes in the population you're studying.

The second one is a little trickier to understand if you're not familiar with Bayesian statistics. It is called the conditional probability of the data given the genotype, but what does that mean? Assuming that the genotype G is the true genotype,

$$ P(D|G) $$

is the probability of observing the sequence data that we have in hand. That is, how likely would we be to pull out a read with a particular sequence from an individual that has this particular genotype? We don't have that number yet, so this requires a little more calculation, using the following formula:

$$ P(D|G) = \prod{j} \left( \frac{P(D_j | H_1)}{2} + \frac{P(D_j | H_2)}{2} \right) $$

You'll notice that this is where the diploid assumption comes into play, since here we decomposed the genotype G into:

$$ G = H_1H_2 $$

which allows for exactly two possible haplotypes. In future versions we'll have a generalized form of this that will allow for any number of haplotypes.

Now, back to our calculation, what's left to figure out is this:

$$ P(D_j|H_n) $$

which as it turns out is the conditional probability of the data given a particular haplotype (or specifically, a particular allele), aggregated over all supporting reads. Conveniently, that is exactly what we calculated in Step 3 of the HaplotypeCaller process, when we used the PairHMM to produce the likelihoods of each read against each haplotype, and then marginalized them to find the likelihoods of each read for each allele under consideration. So all we have to do at this point is plug the values from that table into the equation above, and we can work our way back up to obtain:

$$ P(G|D) $$

for the genotype G.


3. Selecting a genotype and emitting the call record

We go through the process of calculating a likelihood for each possible genotype based on the alleles that were observed at the site, considering every possible combination of alleles. For example, if we see an A and a T at a site, the possible genotypes are AA, AT and TT, and we end up with 3 corresponding probabilities. We pick the largest one, which corresponds to the most likely genotype, and assign that to the sample.

Note that depending on the variant calling options specified in the command-line, we may only emit records for actual variant sites (where at least one sample has a genotype other than homozygous-reference) or we may also emit records for reference sites. The latter is discussed in the reference confidence model documentation.

Assuming that we have a non-ref genotype, all that remains is to calculate the various site-level and genotype-level metrics that will be emitted as annotations in the variant record, including QUAL as well as PL and GQ -- see the linked docs for details. For more information on how the other variant context metrics are calculated, please see the corresponding variant annotations documentation.

GATK v4.0.8.1 GenomicsDBImport Error (VariantStorageManagerException exception)

$
0
0

Hi, I am following the current best practice to prepare the consolidated GVCF from 5 samples of WGS for joint calling
with the following command and encounter an error
java -Djava.io.tmpdir=/work/TMP \ -Xmx40g -jar ~/bin/gatk-4.0.8.1/gatk-package-4.0.8.1-local.jar \ GenomicsDBImport \ -V /work/Analysis/III_3P_RG_DupMark.raw.snps.indels.g.vcf \ -V /work/Analysis/IV_11N_RG_DupMark.raw.snps.indels.g.vcf \ -V /work/Analysis/IV_8N_RG_DupMark.raw.snps.indels.g.vcf \ -V /work/Analysis/IV_10P_RG_DupMark.raw.snps.indels.g.vcf \ -V /work/Analysis/IV_20P_RG_DupMark.raw.snps.indels.g.vcf \ --genomicsdb-workspace-path /work/Analysis/wang_chr19_re \ --intervals chr19

Error Log

15:00:35.770 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/wang/bin/gatk-4.0.8.1/gatk-package-4.0.8.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
15:00:35.944 INFO GenomicsDBImport - ------------------------------------------------------------
15:00:35.944 INFO GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.0.8.1
15:00:35.945 INFO GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/
15:00:35.945 INFO GenomicsDBImport - Executing as wang@Ubuntu1604 on Linux v3.16.0-43-generic amd64
15:00:35.945 INFO GenomicsDBImport - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_171-8u171-b11-2~14.04-b11
15:00:35.945 INFO GenomicsDBImport - Start Date/Time: October 2, 2018 3:00:35 PM JST
15:00:35.945 INFO GenomicsDBImport - ------------------------------------------------------------
15:00:35.945 INFO GenomicsDBImport - ------------------------------------------------------------
15:00:35.946 INFO GenomicsDBImport - HTSJDK Version: 2.16.0
15:00:35.946 INFO GenomicsDBImport - Picard Version: 2.18.7
15:00:35.946 INFO GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
15:00:35.946 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
15:00:35.946 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
15:00:35.946 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
15:00:35.946 INFO GenomicsDBImport - Deflater: IntelDeflater
15:00:35.946 INFO GenomicsDBImport - Inflater: IntelInflater
15:00:35.946 INFO GenomicsDBImport - GCS max retries/reopens: 20
15:00:35.946 INFO GenomicsDBImport - Using google-cloud-java fork https://github.com/broadinstitute/google-cloud-java/releases/tag/0.20.5-alpha-GCS-RETRY-FIX
15:00:35.946 INFO GenomicsDBImport - Initializing engine
15:00:38.360 INFO IntervalArgumentCollection - Processing 58617616 bp from intervals
15:00:38.366 INFO GenomicsDBImport - Done initializing engine
Created workspace /work/Analysis/wgs_chr19
15:00:38.849 INFO GenomicsDBImport - Vid Map JSON file will be written to /work/Analysis/wgs_chr19/vidmap.json
15:00:38.849 INFO GenomicsDBImport - Callset Map JSON file will be written to /work/Analysis/wgs_chr19/callset.json
15:00:38.849 INFO GenomicsDBImport - Complete VCF Header will be written to /work/Analysis/wgs_chr19/vcfheader.vcf
15:00:38.850 INFO GenomicsDBImport - Importing to array - /work/Analysis/wgs_chr19/genomicsdb_array
15:00:38.850 INFO ProgressMeter - Starting traversal
15:00:38.850 INFO ProgressMeter - Current Locus Elapsed Minutes Batches Processed Batches/Minute
15:00:39.771 INFO GenomicsDBImport - Importing batch 1 with 5 samples
Buffer resized from 28469bytes to 32688
Buffer resized from 28473bytes to 32630
Buffer resized from 28469bytes to 32745
Buffer resized from 28469bytes to 32717
Buffer resized from 28466bytes to 32648
Buffer resized from 32688bytes to 32758
Buffer resized from 32630bytes to 32726
Buffer resized from 32648bytes to 32703
Buffer resized from 32717bytes to 32751
Buffer resized from 32703bytes to 32765
Buffer resized from 32745bytes to 32768
Buffer resized from 32726bytes to 32763
Buffer resized from 32765bytes to 32767
Buffer resized from 32758bytes to 32765
Buffer resized from 32751bytes to 32762
Buffer resized from 32767bytes to 32769
Buffer resized from 32763bytes to 32768
Buffer resized from 32762bytes to 32768
Buffer resized from 32765bytes to 32767
Buffer resized from 32767bytes to 32769
Buffer resized from 32768bytes to 32769
Buffer resized from 32768bytes to 32769
Buffer resized from 32768bytes to 32769
terminate called after throwing an instance of 'VariantStorageManagerException'
what(): VariantStorageManagerException exception : Error while syncing array chr19$1$58617616 to disk
TileDB error message : [TileDB::utils] Error: Cannot sync file '/work/Analysis/wgs_chr19/chr19$1$58617616/.__a89fdd44-1241-43ba-9072-6fcf116fbc1d139627949156096_1538460040234'; File syncing error

things I have checked
I have confirmed there is enough disk space and the working directory is in a shared volume.
It would be appreciated if you can help me on the troubleshotting.

Thanks

HaplotypeCaller may fail to detect variant with the same reads with a different composition.

$
0
0

I have experienced a variant detection issue with confusion. The png file attached is the result of exact same NextSeq experiment but the read extraction range is different.

NextSeq2_point.bam: bam is composed of the reads which cover chr16: 89100686 position only.

NextSeq2_region.bam: bam is composed of the reads which cover the region of chr16: 89100686 +-100bp.
On position chr16:89100686, I presume T>C should be detected, but HaplotypeCaller failed to detect the variant with NextSeq2_region.bam.

NextSeq2_point.vcf:
chr16 89100686 . T C,<NON_REF> 7397.77 . DP=199;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQ=716400.00 GT:AD:DP:GQ:PL:SB 1/1:0,199,0:199:99:7426,599,0,7426,599,7426:0,0,155,44

NextSeq2_region.vcf:
chr16 89100686 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:0,199:199:0:0,0,0

What causes the difference and why?

--- GATK Version (Docker latest)
    Using GATK wrapper script /gatk/build/install/gatk/bin/gatk
    Running:
        /gatk/build/install/gatk/bin/gatk HaplotypeCaller --version
    Version:4.0.1.2
---

--- Command used
gatk HaplotypeCaller -I /temp/NextSeq2_region.bam -O /temp/NextSeq2_region.vcf -R /temp/genome.fa -L /temp/only16.bed --debug true --output-mode EMIT_ALL_SITES --all-site-pls true --dont-trim-active-regions true --emit-ref-confidence BP_RESOLUTION
---
--- Genome Version: hg38
--- bed
chr16   89100681    89101347    NM_174917.4_cds_2_0_chr16_89100682_f    0   +

If you need the bams and vcfs, I can post them here.

Viewing all 12345 articles
Browse latest View live