VariantRecalibrator: ERROR MESSAGE: NaN LOD value assigned.

April 4, 2017, 10:52 am

≪ Previous: (howto) Apply hard filters to a call set

I have some gVCF files, and I need to call variants from them. I am able to use HaplotypeCaller successfully, but VariantRecalibrator is giving me error.

java -jar /storage/s1saini/GenomeAnalysisTK.jar -T GenotypeGVCFs -V SSC00003.g.vcf.gz -V SSC00004.g.vcf.gz -V SSC00005.g.vcf.gz -V SSC00006.g.vcf.gz -V SSC01958.g.vcf.gz -V SSC01964.g.vcf.gz -V SSC01965.g.vcf.gz -V SSC01966.g.vcf.gz -V SSC02852.g.vcf.gz -V SSC02854.g.vcf.gz -V SSC02857.g.vcf.gz -V SSC02858.g.vcf.gz -V SSC03070.g.vcf.gz -V SSC03078.g.vcf.gz -V SSC03092.g.vcf.gz -V SSC03093.g.vcf.gz -o jointcalls.vcf -R ref/human_g1k_b37_20.fasta -L 20 -nt 4

INFO  10:24:23,052 HelpFormatter - --------------------------------------------------------------------------------- 
INFO  10:24:23,055 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.7-0-gcfedb67, Compiled 2016/12/12 11:21:18 
INFO  10:24:23,055 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute 
INFO  10:24:23,055 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk 
INFO  10:24:23,055 HelpFormatter - [Tue Apr 04 10:24:23 PDT 2017] Executing on Linux 3.10.0-514.2.2.el7.x86_64 amd64 
INFO  10:24:23,055 HelpFormatter - OpenJDK 64-Bit Server VM 1.8.0_111-b15 
INFO  10:24:23,059 HelpFormatter - Program Args: -T GenotypeGVCFs -V SSC00003.g.vcf.gz -V SSC00004.g.vcf.gz -V SSC00005.g.vcf.gz -V SSC00006.g.vcf.gz -V SSC01958.g.vcf.gz -V SSC01964.g.vcf.gz -V SSC01965.g.vcf.gz -V SSC01966.g.vcf.gz -V SSC02852.g.vcf.gz -V SSC02854.g.vcf.gz -V SSC02857.g.vcf.gz -V SSC02858.g.vcf.gz -V SSC03070.g.vcf.gz -V SSC03078.g.vcf.gz -V SSC03092.g.vcf.gz -V SSC03093.g.vcf.gz -o jointcalls.vcf -R ref/human_g1k_b37_20.fasta -L 20 -nt 4 
INFO  10:24:23,063 HelpFormatter - Executing as s1saini@snorlax on Linux 3.10.0-514.2.2.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_111-b15. 
INFO  10:24:23,064 HelpFormatter - Date/Time: 2017/04/04 10:24:23 
INFO  10:24:23,064 HelpFormatter - --------------------------------------------------------------------------------- 
INFO  10:24:23,064 HelpFormatter - --------------------------------------------------------------------------------- 
INFO  10:24:23,114 GenomeAnalysisEngine - Strictness is SILENT 
INFO  10:24:23,303 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 
INFO  10:24:25,741 IntervalUtils - Processing 63025520 bp from intervals 
WARN  10:24:25,741 IndexDictionaryUtils - Track variant doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  10:24:25,742 IndexDictionaryUtils - Track variant2 doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  10:24:25,742 IndexDictionaryUtils - Track variant3 doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  10:24:25,742 IndexDictionaryUtils - Track variant4 doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  10:24:25,742 IndexDictionaryUtils - Track variant5 doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  10:24:25,743 IndexDictionaryUtils - Track variant6 doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  10:24:25,743 IndexDictionaryUtils - Track variant7 doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  10:24:25,743 IndexDictionaryUtils - Track variant8 doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  10:24:25,744 IndexDictionaryUtils - Track variant9 doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  10:24:25,744 IndexDictionaryUtils - Track variant10 doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  10:24:25,744 IndexDictionaryUtils - Track variant11 doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  10:24:25,744 IndexDictionaryUtils - Track variant12 doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  10:24:25,744 IndexDictionaryUtils - Track variant13 doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  10:24:25,745 IndexDictionaryUtils - Track variant14 doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  10:24:25,745 IndexDictionaryUtils - Track variant15 doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  10:24:25,745 IndexDictionaryUtils - Track variant16 doesn't have a sequence dictionary built in, skipping dictionary validation 
INFO  10:24:25,753 MicroScheduler - Running the GATK in parallel mode with 4 total threads, 1 CPU thread(s) for each of 4 data thread(s), of 28 processors available on this machine 
INFO  10:24:25,809 GenomeAnalysisEngine - Preparing for traversal 
INFO  10:24:25,810 GenomeAnalysisEngine - Done preparing for traversal 
INFO  10:24:25,811 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
INFO  10:24:25,811 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining 
INFO  10:24:25,811 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime 
WARN  10:24:26,003 StrandBiasTest - StrandBiasBySample annotation exists in input VCF header. Attempting to use StrandBiasBySample values to calculate strand bias annotation values. If no sample has the SB genotype annotation, annotation may still fail. 
WARN  10:24:26,004 StrandBiasTest - StrandBiasBySample annotation exists in input VCF header. Attempting to use StrandBiasBySample values to calculate strand bias annotation values. If no sample has the SB genotype annotation, annotation may still fail. 
INFO  10:24:26,005 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant files 
WARN  10:24:28,595 HaplotypeScore - Annotation will not be calculated, must be called from UnifiedGenotyper, not org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs 
WARN  10:24:31,245 ExactAFCalculator - This tool is currently set to genotype at most 6 alternate alleles in a given context, but the context at 20: 83250 has 10 alternate alleles so only the top alleles will be used; see the --max_alternate_alleles argument. Unless the DEBUG logging level is used, this warning message is output just once per run and further warnings are suppressed. 

Message from syslogd@snorlax at Apr  4 10:24:45 ...
 kernel:do_IRQ: 8.228 No irq handler for vector (irq -1)
INFO  10:24:56,003 ProgressMeter -      20:3126601         0.0    30.0 s      49.9 w        5.0%    10.1 m       9.6 m 
INFO  10:25:26,005 ProgressMeter -      20:3535701         0.0    60.0 s      99.5 w        5.6%    17.8 m      16.8 m 
INFO  10:25:56,006 ProgressMeter -      20:6041401   3000000.0    90.0 s      30.0 s        9.6%    15.6 m      14.1 m 
INFO  10:26:26,008 ProgressMeter -      20:7496301   4000000.0   120.0 s      30.0 s       11.9%    16.8 m      14.8 m 
INFO  10:26:56,010 ProgressMeter -     20:11018501   8000000.0     2.5 m      18.0 s       17.5%    14.3 m      11.8 m 
INFO  10:27:26,011 ProgressMeter -     20:11547201   8000000.0     3.0 m      22.0 s       18.3%    16.4 m      13.4 m 
INFO  10:27:56,012 ProgressMeter -     20:15076001       1.2E7     3.5 m      17.0 s       23.9%    14.6 m      11.1 m 

Message from syslogd@snorlax at Apr  4 10:28:14 ...
 kernel:do_IRQ: 3.86 No irq handler for vector (irq -1)
INFO  10:28:26,013 ProgressMeter -     20:15629601       1.2E7     4.0 m      20.0 s       24.8%    16.1 m      12.1 m 
INFO  10:28:56,014 ProgressMeter -     20:19188001       1.6E7     4.5 m      16.0 s       30.4%    14.8 m      10.3 m 
INFO  10:29:26,015 ProgressMeter -     20:19745601       1.6E7     5.0 m      18.0 s       31.3%    16.0 m      11.0 m 
INFO  10:29:56,017 ProgressMeter -     20:23238001       2.0E7     5.5 m      16.0 s       36.9%    14.9 m       9.4 m 
INFO  10:30:26,018 ProgressMeter -     20:23764301       2.0E7     6.0 m      18.0 s       37.7%    15.9 m       9.9 m 
INFO  10:30:56,019 ProgressMeter -     20:29293301       2.6E7     6.5 m      15.0 s       46.5%    14.0 m       7.5 m 
INFO  10:31:26,020 ProgressMeter -     20:31020501       2.8E7     7.0 m      15.0 s       49.2%    14.2 m       7.2 m 
INFO  10:31:56,021 ProgressMeter -     20:33371001       3.0E7     7.5 m      15.0 s       52.9%    14.2 m       6.7 m 
INFO  10:32:26,022 ProgressMeter -     20:34325401       3.2E7     8.0 m      15.0 s       54.5%    14.7 m       6.7 m 
INFO  10:32:56,024 ProgressMeter -     20:37383101       3.4E7     8.5 m      15.0 s       59.3%    14.3 m       5.8 m 
INFO  10:33:26,025 ProgressMeter -     20:39016401       3.6E7     9.0 m      15.0 s       61.9%    14.5 m       5.5 m 
INFO  10:33:56,026 ProgressMeter -     20:41453001       3.8E7     9.5 m      15.0 s       65.8%    14.4 m       4.9 m 
INFO  10:34:26,027 ProgressMeter -     20:45001701       4.2E7    10.0 m      14.0 s       71.4%    14.0 m       4.0 m 
INFO  10:34:56,029 ProgressMeter -     20:46006401       4.3E7    10.5 m      14.0 s       73.0%    14.4 m       3.9 m 
INFO  10:35:26,030 ProgressMeter -     20:49063101       4.6E7    11.0 m      14.0 s       77.8%    14.1 m       3.1 m 
INFO  10:35:56,031 ProgressMeter -     20:50020001       4.7E7    11.5 m      14.0 s       79.4%    14.5 m       3.0 m 
INFO  10:36:26,032 ProgressMeter -     20:53090001       5.0E7    12.0 m      14.0 s       84.2%    14.2 m       2.2 m 
INFO  10:36:56,033 ProgressMeter -     20:54019201       5.1E7    12.5 m      14.0 s       85.7%    14.6 m       2.1 m 
INFO  10:37:26,034 ProgressMeter -     20:57112001       5.4E7    13.0 m      14.0 s       90.6%    14.3 m      80.0 s 
INFO  10:37:56,036 ProgressMeter -     20:58083301       5.5E7    13.5 m      14.0 s       92.2%    14.6 m      68.0 s 
INFO  10:38:26,037 ProgressMeter -     20:61190101       5.8E7    14.0 m      14.0 s       97.1%    14.4 m      25.0 s 
INFO  10:38:56,038 ProgressMeter -     20:62134501       5.9E7    14.5 m      14.0 s       98.6%    14.7 m      12.0 s 
INFO  10:39:26,039 ProgressMeter -     20:63025501   6.202552E7    15.0 m      14.0 s      100.0%    15.0 m       0.0 s 
INFO  10:39:49,084 ProgressMeter -            done   6.302552E7    15.4 m      14.0 s      100.0%    15.4 m       0.0 s 
INFO  10:39:49,084 ProgressMeter - Total runtime 923.27 secs, 15.39 min, 0.26 hours 
------------------------------------------------------------------------------------------
Done. There were 20 WARN messages, the first 10 are repeated below.
WARN  10:24:25,741 IndexDictionaryUtils - Track variant doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  10:24:25,742 IndexDictionaryUtils - Track variant2 doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  10:24:25,742 IndexDictionaryUtils - Track variant3 doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  10:24:25,742 IndexDictionaryUtils - Track variant4 doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  10:24:25,742 IndexDictionaryUtils - Track variant5 doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  10:24:25,743 IndexDictionaryUtils - Track variant6 doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  10:24:25,743 IndexDictionaryUtils - Track variant7 doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  10:24:25,743 IndexDictionaryUtils - Track variant8 doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  10:24:25,744 IndexDictionaryUtils - Track variant9 doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  10:24:25,744 IndexDictionaryUtils - Track variant10 doesn't have a sequence dictionary built in, skipping dictionary validation

java -jar /storage/s1saini/GenomeAnalysisTK.jar -T VariantRecalibrator -R ref/human_g1k_b37_20.fasta -input jointcalls.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.vcf.gz -an DP -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -recalFile recalibrate_SNP.recal -tranchesFile recalibrate_SNP.tranches -rscriptFile recalibrate_SNP_plots.R

INFO  10:40:30,682 HelpFormatter - --------------------------------------------------------------------------------- 
INFO  10:40:30,684 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.7-0-gcfedb67, Compiled 2016/12/12 11:21:18 
INFO  10:40:30,685 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute 
INFO  10:40:30,685 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk 
INFO  10:40:30,685 HelpFormatter - [Tue Apr 04 10:40:30 PDT 2017] Executing on Linux 3.10.0-514.2.2.el7.x86_64 amd64 
INFO  10:40:30,685 HelpFormatter - OpenJDK 64-Bit Server VM 1.8.0_111-b15 
INFO  10:40:30,689 HelpFormatter - Program Args: -T VariantRecalibrator -R ref/human_g1k_b37_20.fasta -input jointcalls.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.vcf.gz -an DP -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -recalFile recalibrate_SNP.recal -tranchesFile recalibrate_SNP.tranches -rscriptFile recalibrate_SNP_plots.R 
INFO  10:40:30,693 HelpFormatter - Executing as s1saini@snorlax on Linux 3.10.0-514.2.2.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_111-b15. 
INFO  10:40:30,694 HelpFormatter - Date/Time: 2017/04/04 10:40:30 
INFO  10:40:30,694 HelpFormatter - --------------------------------------------------------------------------------- 
INFO  10:40:30,694 HelpFormatter - --------------------------------------------------------------------------------- 
INFO  10:40:30,718 GenomeAnalysisEngine - Strictness is SILENT 
INFO  10:40:30,808 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 
WARN  10:40:31,044 IndexDictionaryUtils - Track hapmap doesn't have a sequence dictionary built in, skipping dictionary validation 
INFO  10:40:31,165 GenomeAnalysisEngine - Preparing for traversal 
INFO  10:40:31,166 GenomeAnalysisEngine - Done preparing for traversal 
INFO  10:40:31,167 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
INFO  10:40:31,167 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining 
INFO  10:40:31,167 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime 
INFO  10:40:31,172 TrainingSet - Found hapmap track:    Known = false   Training = true     Truth = true    Prior = Q15.0 
INFO  10:40:35,327 VariantDataManager - DP:      mean = 535.96   standard deviation = 65.37 
INFO  10:40:35,483 VariantDataManager - Annotations are now ordered by their information content: [DP] 
INFO  10:40:35,498 VariantDataManager - Training with 61633 variants after standard deviation thresholding. 
INFO  10:40:35,502 GaussianMixtureModel - Initializing model with 100 k-means iterations... 
INFO  10:40:36,992 VariantRecalibratorEngine - Finished iteration 0. 
INFO  10:40:37,902 VariantRecalibratorEngine - Finished iteration 5.    Current change in mixture coefficients = 0.08556 
INFO  10:40:40,563 VariantRecalibratorEngine - Finished iteration 10.   Current change in mixture coefficients = 0.04317 
INFO  10:40:42,340 VariantRecalibratorEngine - Finished iteration 15.   Current change in mixture coefficients = 0.02471 
INFO  10:40:43,232 VariantRecalibratorEngine - Finished iteration 20.   Current change in mixture coefficients = 0.01472 
INFO  10:40:43,805 VariantRecalibratorEngine - Finished iteration 25.   Current change in mixture coefficients = 0.01129 
INFO  10:40:44,384 VariantRecalibratorEngine - Finished iteration 30.   Current change in mixture coefficients = 0.01005 
INFO  10:40:44,965 VariantRecalibratorEngine - Finished iteration 35.   Current change in mixture coefficients = 0.00837 
INFO  10:40:45,538 VariantRecalibratorEngine - Finished iteration 40.   Current change in mixture coefficients = 0.00690 
INFO  10:40:46,119 VariantRecalibratorEngine - Finished iteration 45.   Current change in mixture coefficients = 0.00585 
INFO  10:40:46,703 VariantRecalibratorEngine - Finished iteration 50.   Current change in mixture coefficients = 0.00541 
INFO  10:40:47,286 VariantRecalibratorEngine - Finished iteration 55.   Current change in mixture coefficients = 0.00555 
INFO  10:40:47,866 VariantRecalibratorEngine - Finished iteration 60.   Current change in mixture coefficients = 0.00570 
INFO  10:40:48,460 VariantRecalibratorEngine - Finished iteration 65.   Current change in mixture coefficients = 0.00588 
INFO  10:40:49,048 VariantRecalibratorEngine - Finished iteration 70.   Current change in mixture coefficients = 0.00611 
INFO  10:40:49,640 VariantRecalibratorEngine - Finished iteration 75.   Current change in mixture coefficients = 0.00634 
INFO  10:40:50,456 VariantRecalibratorEngine - Finished iteration 80.   Current change in mixture coefficients = 0.00651 
INFO  10:40:51,053 VariantRecalibratorEngine - Finished iteration 85.   Current change in mixture coefficients = 0.00651 
INFO  10:40:51,651 VariantRecalibratorEngine - Finished iteration 90.   Current change in mixture coefficients = 0.00626 
INFO  10:40:52,249 VariantRecalibratorEngine - Finished iteration 95.   Current change in mixture coefficients = 0.00575 
INFO  10:40:52,841 VariantRecalibratorEngine - Finished iteration 100.  Current change in mixture coefficients = 0.00508 
INFO  10:40:53,434 VariantRecalibratorEngine - Finished iteration 105.  Current change in mixture coefficients = 0.00436 
INFO  10:40:54,050 VariantRecalibratorEngine - Finished iteration 110.  Current change in mixture coefficients = 0.00368 
INFO  10:40:54,668 VariantRecalibratorEngine - Finished iteration 115.  Current change in mixture coefficients = 0.00308 
INFO  10:40:55,282 VariantRecalibratorEngine - Finished iteration 120.  Current change in mixture coefficients = 0.00257 
INFO  10:40:55,905 VariantRecalibratorEngine - Finished iteration 125.  Current change in mixture coefficients = 0.00213 
INFO  10:40:56,161 VariantRecalibratorEngine - Convergence after 127 iterations! 
INFO  10:40:56,243 VariantRecalibratorEngine - Evaluating full set of 188987 variants... 
INFO  10:40:56,546 VariantDataManager - Training with worst 856 scoring variants --> variants with LOD <= -5.0000. 
INFO  10:40:56,546 GaussianMixtureModel - Initializing model with 100 k-means iterations... 
INFO  10:40:56,551 VariantRecalibratorEngine - Finished iteration 0. 
INFO  10:40:56,554 VariantRecalibratorEngine - Convergence after 3 iterations! 
INFO  10:40:56,563 VariantRecalibratorEngine - Evaluating full set of 188987 variants... 
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 3.7-0-gcfedb67): 
##### ERROR
##### ERROR This means that one or more arguments or inputs in your command are incorrect.
##### ERROR The error message below tells you what is the problem.
##### ERROR
##### ERROR If the problem is an invalid argument, please check the online documentation guide
##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
##### ERROR
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions https://software.broadinstitute.org/gatk
##### ERROR
##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
##### ERROR
##### ERROR MESSAGE: NaN LOD value assigned. Clustering with this few variants and these annotations is unsafe. Please consider raising the number of variants used to train the negative model (via --minNumBadVariants 5000, for example).
##### ERROR ------------------------------------------------------------------------------------------

I don't believe this is because of small dataset. I am working with 16 samples, on Chromosome 20.

↧

Picard vcf filtering

April 4, 2017, 6:59 am

≫ Next: Run GATK on files stored in AWS S3?

≪ Previous: VariantRecalibrator: ERROR MESSAGE: NaN LOD value assigned.

Hi, I use Picard to filter a vcf file. It seems not to function propely. I am filtering on read depth using MIN_DP=50 but I still see a lot of genotypes in the resulting file with lower coverage than 50.

Here is my call:
java -jar /tools/picard-2.9.0/picard.jar FilterVcf I=output.vcf O=output2.vcf MIN_AB=0.4 MIN_DP=50 MIN_GQ=30

↧

Run GATK on files stored in AWS S3?

April 4, 2017, 3:40 pm

≫ Next: VCF input and BaseRecalibrator error

≪ Previous: Picard vcf filtering

Hi,
Same as title: Is it possible to run GATK on files stored on S3?

↧

VCF input and BaseRecalibrator error

April 2, 2017, 9:53 pm

≫ Next: Variant detection in RNAseq: when to merge samples

≪ Previous: Run GATK on files stored in AWS S3?

I'm trying to run BaseRecalibrator with GATK4-Alpha. I'm using --knownSites to input a vcf file as follows:

java -Xmx80G -jar $GATK BaseRecalibrator \
-R reference.fa \
-I ${SAMPLE}_dedup.bam \
-knownSites KnownVariants.vcf \
-BQSR recal.table \
-O after_recal.table

However it returns the following error

A USER ERROR has occurred: Cannot read KnownVariants.vcf because no suitable codecs found

I've used this input file with GATK3.5.0 and it was fine.

Any guidance on what the problem might be would be much appreciated.

↧

Variant detection in RNAseq: when to merge samples

April 5, 2017, 12:55 am

≫ Next: Mutect is not working

≪ Previous: VCF input and BaseRecalibrator error

Hi, I have followed the recommendations for my RNAseq variant search as outlined here:
http://gatkforums.broadinstitute.org/gatk/discussion/4067/best-practices-for-variant-discovery-in-rnaseq

Since I am pretty new to the gatk world, silly questions are bound to emerge. Anyway, I have followed the linked Best Practice and are in the process of doing BaseRecalibration now. I have treated my four lines individually, meaning that alle the processes so far have been done individual by individual. My question is whether I have done this correctly - by treating the individuals separately up to now?
Since I am looking for snps, I assume that at one stage I need to merge them together, but should I have done that at an earlier stage?

Feedback/comments are greatly appreciated.
Thank you.

jahn

↧

Mutect is not working

March 30, 2015, 8:04 pm

≫ Next: VQSLOD score different for Full and split (by chromosome) BAM file

≪ Previous: Variant detection in RNAseq: when to merge samples

Dear Cancer team,

I installed mvn, gatk-protected, and mutect. (https://github.com/broadinstitute/mutect) After that, I came upon the following error message:

ERROR ------------------------------------------------------------------------------------------

ERROR stack trace

java.lang.ExceptionInInitializerError
at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.(GenomeAnalysisEngine.java:167)
at org.broadinstitute.sting.gatk.CommandLineExecutable.(CommandLineExecutable.java:57)
at org.broadinstitute.sting.gatk.CommandLineGATK.(CommandLineGATK.java:66)
at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:106)
Caused by: java.lang.NullPointerException
at org.reflections.Reflections.scan(Reflections.java:220)
at org.reflections.Reflections.scan(Reflections.java:166)
at org.reflections.Reflections.(Reflections.java:94)
at org.broadinstitute.sting.utils.classloader.PluginManager.(PluginManager.java:79)
... 4 more

ERROR ------------------------------------------------------------------------------------------

ERROR A GATK RUNTIME ERROR has occurred (version 3.1-0-g72492bb):

ERROR

ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.

ERROR If not, please post the error message, with stack trace, to the GATK forum.

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions http://www.broadinstitute.org/gatk

ERROR

ERROR MESSAGE: Code exception (see stack trace for error itself)

ERROR ------------------------------------------------------------------------------------------

It seems to be a GATK version error. Where can I download the correct version of MuTect that is compatible with GATK 3.3 or GATK3.1?

Thanks,
Woody

↧

VQSLOD score different for Full and split (by chromosome) BAM file

April 4, 2017, 8:11 pm

≫ Next: Genotype Refinement workflow

≪ Previous: Mutect is not working

Dear all,

I am trying to generate VCF using full and split (by chromosome) BAM files by GATK but after successful completion, there are difference in the VQSLOD score:

For full : VQSLOD=-6.339e-01

For split : VQSLOD=-1.472e+00

Does this means splitting the BAM is not correct or is it because of wrong parameters? or it is correct and I can neglect this?

↧

Genotype Refinement workflow

October 17, 2014, 2:35 pm

≫ Next: Empty ContEst Output

≪ Previous: VQSLOD score different for Full and split (by chromosome) BAM file

Overview

This document describes the purpose and general principles of the Genotype Refinement workflow. For the mathematical details of the methods involved, please see the Genotype Refinement math documentation. For step-by-step instructions on how to apply this workflow to your data, please see the Genotype Refinement tutorial.

1. Introduction

The core GATK Best Practices workflow has historically focused on variant discovery --that is, the existence of genomic variants in one or more samples in a cohorts-- and consistently delivers high quality results when applied appropriately. However, we know that the quality of the individual genotype calls coming out of the variant callers can vary widely based on the quality of the BAM data for each sample. The goal of the Genotype Refinement workflow is to use additional data to improve the accuracy of genotype calls and to filter genotype calls that are not reliable enough for downstream analysis. In this sense it serves as an optional extension of the variant calling workflow, intended for researchers whose work requires high-quality identification of individual genotypes.

A few commonly asked questions are:

What studies can benefit from the Genotype Refinement workflow?

While every study can benefit from increased data accuracy, this workflow is especially useful for analyses that are concerned with how many copies of each variant an individual has (e.g. in the case of loss of function) or with the transmission (or de novo origin) of a variant in a family.

What additional data do I need to run the Genotype Refinement workflow?

If a “gold standard” dataset for SNPs is available, that can be used as a very powerful set of priors on the genotype likelihoods in your data. For analyses involving families, a pedigree file describing the relatedness of the trios in your study will provide another source of supplemental information. If neither of these applies to your data, the samples in the dataset itself can provide some degree of genotype refinement (see section 5 below for details).

Is the Genotype Refinement workflow going to change my data? Can I still use my old analysis pipeline?

After running the Genotype Refinement workflow, several new annotations will be added to the INFO and FORMAT fields of your variants (see below), GQ fields will be updated, and genotype calls may be modified. However, the Phred-scaled genotype likelihoods (PLs) which indicate the original genotype call (the genotype candidate with PL=0) will remain untouched. Any analysis that made use of the PLs will produce the same results as before.

2. The Genotype Refinement workflow

Overview

Input

Begin with recalibrated variants from VQSR at the end of the best practices pipeline. The filters applied by VQSR will be carried through the Genotype Refinement workflow.

Step 1: Derive posterior probabilities of genotypes

Tool used: CalculateGenotypePosteriors

Using the Phred-scaled genotype likelihoods (PLs) for each sample, prior probabilities for a sample taking on a HomRef, Het, or HomVar genotype are applied to derive the posterior probabilities of the sample taking on each of those genotypes. A sample’s PLs were calculated by HaplotypeCaller using only the reads for that sample. By introducing additional data like the allele counts from the 1000 Genomes project and the PLs for other individuals in the sample’s pedigree trio, those estimates of genotype likelihood can be improved based on what is known about the variation of other individuals.

SNP calls from the 1000 Genomes project capture the vast majority of variation across most human populations and can provide very strong priors in many cases. At sites where most of the 1000 Genomes samples are homozygous variant with respect to the reference genome, the probability of a sample being analyzed of also being homozygous variant is very high.

For a sample for which both parent genotypes are available, the child’s genotype can be supported or invalidated by the parents’ genotypes based on Mendel’s laws of allele transmission. Even the confidence of the parents’ genotypes can be recalibrated, such as in cases where the genotypes output by HaplotypeCaller are apparent Mendelian violations.

Step 2: Filter low quality genotypes

Tool used: VariantFiltration

After the posterior probabilities are calculated for each sample at each variant site, genotypes with GQ < 20 based on the posteriors are filtered out. GQ20 is widely accepted as a good threshold for genotype accuracy, indicating that there is a 99% chance that the genotype in question is correct. Tagging those low quality genotypes indicates to researchers that these genotypes may not be suitable for downstream analysis. However, as with the VQSR, a filter tag is applied, but the data is not removed from the VCF.

Step 3: Annotate possible de novo mutations

Tool used: VariantAnnotator

Using the posterior genotype probabilities, possible de novo mutations are tagged. Low confidence de novos have child GQ >= 10 and AC < 4 or AF < 0.1%, whichever is more stringent for the number of samples in the dataset. High confidence de novo sites have all trio sample GQs >= 20 with the same AC/AF criterion.

Step 4: Functional annotation of possible biological effects

Tool options: SnpEff or Oncotator (both are non-GATK tools)

Especially in the case of de novo mutation detection, analysis can benefit from the functional annotation of variants to restrict variants to exons and surrounding regulatory regions. The GATK currently does not feature integration with any functional annotation tool, but SnpEff and Oncotator are useful utilities that can work with the GATK's VCF output.

3. Output annotations

The Genotype Refinement Pipeline adds several new info- and format-level annotations to each variant. GQ fields will be updated, and genotypes calculated to be highly likely to be incorrect will be changed. The Phred-scaled genotype likelihoods (PLs) carry through the pipeline without being changed. In this way, PLs can be used to derive the original genotypes in cases where sample genotypes were changed.

Population Priors

New INFO field annotation PG is a vector of the Phred-scaled prior probabilities of a sample at that site being HomRef, Het, and HomVar. These priors are based on the input samples themselves along with data from the supporting samples if the variant in question overlaps another in the supporting dataset.

Phred-Scaled Posterior Probability

New FORMAT field annotation PP is the Phred-scaled posterior probability of the sample taking on each genotype for the given variant context alleles. The PPs represent a better calibrated estimate of genotype probabilities than the PLs are recommended for use in further analyses instead of the PLs.

Genotype Quality

Current FORMAT field annotation GQ is updated based on the PPs. The calculation is the same as for GQ based on PLs.

Joint Trio Likelihood

New FORMAT field annotation JL is the Phred-scaled joint likelihood of the posterior genotypes for the trio being incorrect. This calculation is based on the PLs produced by HaplotypeCaller (before application of priors), but the genotypes used come from the posteriors. The goal of this annotation is to be used in combination with JP to evaluate the improvement in the overall confidence in the trio’s genotypes after applying CalculateGenotypePosteriors. The calculation of the joint likelihood is given as:

$$ -10*\log ( 1-GL_{mother}[\text{Posterior mother GT}] * GL_{father}[\text{Posterior father GT}] * GL_{child}[\text{Posterior child GT}] ) $$

where the GLs are the genotype likelihoods in [0, 1] probability space.

Joint Trio Posterior

New FORMAT field annotation JP is the Phred-scaled posterior probability of the output posterior genotypes for the three samples being incorrect. The calculation of the joint posterior is given as:

$$ -10*\log (1-GP_{mother}[\text{Posterior mother GT}] * GP_{father}[\text{Posterior father GT}] * GP_{child}[\text{Posterior child GT}] )$$

where the GPs are the genotype posteriors in [0, 1] probability space.

Low Genotype Quality

New FORMAT field filter lowGQ indicates samples with posterior GQ less than 20. Filtered samples tagged with lowGQ are not recommended for use in downstream analyses.

High and Low Confidence De Novo

New INFO field annotation for sites at which at least one family has a possible de novo mutation. Following the annotation tag is a list of the children with de novo mutations. High and low confidence are output separately.

4. Example

Before:

1       1226231 rs13306638      G       A       167563.16       PASS    AC=2;AF=0.333;AN=6;…        GT:AD:DP:GQ:PL  0/0:11,0:11:0:0,0,249   0/0:10,0:10:24:0,24,360 1/1:0,18:18:60:889,60,0

After:

1       1226231 rs13306638      G       A       167563.16       PASS    AC=3;AF=0.500;AN=6;…PG=0,8,22;…    GT:AD:DP:GQ:JL:JP:PL:PP 0/1:11,0:11:49:2:24:0,0,249:49,0,287    0/0:10,0:10:32:2:24:0,24,360:0,32,439   1/1:0,18:18:43:2:24:889,60,0:867,43,0

The original call for the child (first sample) was HomRef with GQ0. However, given that, with high confidence, one parent is HomRef and one is HomVar, we expect the child to be heterozygous at this site. After family priors are applied, the child’s genotype is corrected and its GQ is increased from 0 to 49. Based on the allele frequency from 1000 Genomes for this site, the somewhat weaker population priors favor a HomRef call (PG=0,8,22). The combined effect of family and population priors still favors a Het call for the child.

The joint likelihood for this trio at this site is two, indicating that the genotype for one of the samples may have been changed. Specifically a low JL indicates that posterior genotype for at least one of the samples was not the most likely as predicted by the PLs. The joint posterior value for the trio is 24, which indicates that the GQ values based on the posteriors for all of the samples are at least 24. (See above for a more complete description of JL and JP.)

5. More information about priors

The Genotype Refinement Pipeline uses Bayes’s Rule to combine independent data with the genotype likelihoods derived from HaplotypeCaller, producing more accurate and confident genotype posterior probabilities. Different sites will have different combinations of priors applied based on the overlap of each site with external, supporting SNP calls and on the availability of genotype calls for the samples in each trio.

Input-derived Population Priors

If the input VCF contains at least 10 samples, then population priors will be calculated based on the discovered allele count for every called variant.

Supporting Population Priors

Priors derived from supporting SNP calls can only be applied at sites where the supporting calls overlap with called variants in the input VCF. The values of these priors vary based on the called reference and alternate allele counts in the supporting VCF. Higher allele counts (for ref or alt) yield stronger priors.

Family Priors

The strongest family priors occur at sites where the called trio genotype configuration is a Mendelian violation. In such a case, each Mendelian violation configuration is penalized by a de novo mutation probability (currently 10-6). Confidence also propagates through a trio. For example, two GQ60 HomRef parents can substantially boost a low GQ HomRef child and a GQ60 HomRef child and parent can improve the GQ of the second parent. Application of family priors requires the child to be called at the site in question. If one parent has a no-call genotype, priors can still be applied, but the potential for confidence improvement is not as great as in the 3-sample case.

Caveats

Right now family priors can only be applied to biallelic variants and population priors can only be applied to SNPs. Family priors only work for trios.

↧

Empty ContEst Output

April 5, 2017, 7:00 am

≫ Next: variant calling with biological replicates

≪ Previous: Genotype Refinement workflow

Hello,

I have some truseq cancer panel amplicon data and I am in process of calling Somatic variants in Tumor samples with MuTect. By using the default fraction_contamination, I am not getting any entries with "KEEP" status. However, reducing this does start yielding SNPs. I am not sure what will be the ideal value here, hence I have been trying to run ContEst with GATK 3.7 as suggested on the website. But, I am not getting any output. Here are the details of the command:

GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T ContEst -R hg19.fa -I:eval tumor_sorted.bam -I:genotype normal_sorted.bam --popfile popaf/hapmap_3.3_hg19_pop_stratified_af_sorted.vcf -isr INTERSECTION -o output.txt -L targets.bed

Here are the STDOUT and STDERR:

INFO  13:53:43,463 HelpFormatter - --------------------------------------------------------------------------------
INFO  13:53:43,652 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.7-0-gcfedb67, Compiled 2016/12/12 11:21:18
INFO  13:53:43,652 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute
INFO  13:53:43,652 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk
INFO  13:53:43,652 HelpFormatter - [Wed Apr 05 13:53:43 BST 2017] Executing on Linux 3.10.0-229.el7.x86_64 amd64
INFO  13:53:43,653 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15
INFO  13:53:43,656 HelpFormatter - Program Args: -T ContEst -R hg19.fa -I:eval tumor_sorted.bam -I:genotype normal_sorted.bam --popfile popaf/hapmap_3.3_hg19_pop_stratified_af_sorted.vcf -isr INTERSECTION -o output.txt -L targets.bed
INFO  13:53:43,892 HelpFormatter - Executing as urmi208@node6 on Linux 3.10.0-229.el7.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15.
INFO  13:53:43,892 HelpFormatter - Date/Time: 2017/04/05 13:53:43
INFO  13:53:43,892 HelpFormatter - --------------------------------------------------------------------------------
INFO  13:53:43,892 HelpFormatter - --------------------------------------------------------------------------------
INFO  13:53:43,905 GenomeAnalysisEngine - Strictness is SILENT
INFO  13:53:44,787 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO  13:53:44,794 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO  13:53:44,973 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.17
INFO  13:53:45,491 IntervalUtils - Processing 38385 bp from intervals
INFO  13:53:46,311 GenomeAnalysisEngine - Preparing for traversal over 2 BAM files
INFO  13:53:46,390 GenomeAnalysisEngine - Done preparing for traversal
INFO  13:53:46,390 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO  13:53:46,391 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining
INFO  13:53:46,391 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime
INFO  13:53:46,392 ContEst - Running in sequencing mode
INFO  13:54:08,909 ContEst - Total sites:  37742
INFO  13:54:08,910 ContEst - Population informed sites:  37
INFO  13:54:08,910 ContEst - Non homozygous variant sites: 32
INFO  13:54:08,910 ContEst - Homozygous variant sites: 5
INFO  13:54:08,910 ContEst - Passed coverage: 5
INFO  13:54:08,911 ContEst - Results: 0
INFO  13:54:08,913 ProgressMeter -            done     68536.0    22.0 s       5.5 m       99.8%    22.0 s       0.0 s
INFO  13:54:08,914 ProgressMeter - Total runtime 22.52 secs, 0.38 min, 0.01 hours
INFO  13:54:08,914 MicroScheduler - 918 reads were filtered out during the traversal out of approximately 893918 total reads (0.10%)
INFO  13:54:08,914 MicroScheduler -   -> 0 reads (0.00% of total) failing BadCigarFilter
INFO  13:54:08,915 MicroScheduler -   -> 0 reads (0.00% of total) failing DuplicateReadFilter
INFO  13:54:08,915 MicroScheduler -   -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
INFO  13:54:08,915 MicroScheduler -   -> 0 reads (0.00% of total) failing MalformedReadFilter
INFO  13:54:08,915 MicroScheduler -   -> 918 reads (0.10% of total) failing NotPrimaryAlignmentFilter
INFO  13:54:08,915 MicroScheduler -   -> 0 reads (0.00% of total) failing UnmappedReadFilter
Done. ------------------------------------------------------------------------------------------
There were no warn messages.

I seems that most things are filtered out as data is down sampled with target coverage of 1000. I will be grateful if you could provide any help.

Many thanks.

↧

variant calling with biological replicates

July 24, 2015, 8:03 am

≫ Next: I get very different MQ values when using GVCF vs BP_RESOLUTION

≪ Previous: Empty ContEst Output

I am new to use GATK pipeline for SNP calling. I am currently working on four different populations (RNASEq data) with 6 clones. Each clone has 3 biological replicates. How do I combine the variant calling step for the replicates?
I went through the GATK documentation on cohorts and stuff, but I am not sure if I should continue this.
Does combining the resultant vcf file also a possibility?
please help!

↧

I get very different MQ values when using GVCF vs BP_RESOLUTION

March 19, 2016, 10:20 am

≫ Next: Web-based Oncotator server

≪ Previous: variant calling with biological replicates

Hello! I had a question about the difference between using HaplotypeCaller's --emitRefConfidence GVCF vs BP_RESOLUTION. Maybe the answer is obvious or in the forum somewhere already but I couldn't spot it...

First, some context: I'm working with GATK v. 3.5.0 in a haploid organism. I have 34 samples, from which 5 are very similar to the reference (they are backcrosses) while the rest are strains from a wild population. Originally I used --emitRefConfidence GVCF followed by GenotypeGVCF. While checking the output VCF file, I realized that the five backcrosses had a much lower DP in average than the other samples (but this doesn't make sense due to difference in reads numbers or anything like that, since they were run in the same lane, etc). I assume this happened because there are long tracks without any variant compare to the reference in those samples, and the GVCF blocks end up assigning a lower depth for a great amount of sites in those samples compare to the much more polymorphic ones. In any case, I figured I could just get all sites using BP_RESOLUTION so to obtain the "true" DP values per site. However, when I tried to do that, the resulting VCF file had very low MQ values! Can you explain why this happened?

This is the original file with --emitRefConfidence GVCF:

$ bcftools view -H 34snps.vcf | head -n3 | cut -f1-8
chromosome_1    57  .   A   G   309.4   .   AC=4;AF=0.235;AN=17;DP=582;FS=0;MLEAC=4;MLEAF=0.235;MQ=40;QD=34.24;SOR=2.303
chromosome_1    81  .   G   A   84.49   .   AC=2;AF=0.065;AN=31;DP=603;FS=0;MLEAC=2;MLEAF=0.065;MQ=44.44;QD=30.63;SOR=2.833
chromosome_1    88  .   T   C   190.75  .   AC=1;AF=0.091;AN=11;BaseQRankSum=-0.762;ClippingRankSum=0.762;DP=660;FS=7.782;MLEAC=1;MLEAF=0.091;MQ=29.53;MQRankSum=-1.179;QD=21.19;ReadPosRankSum=-1.666;SOR=1.414

And this is with --emitRefConfidence BP_RESOLUTION:

$ bcftools view -H 34allgenome_snps.vcf | head -n3 | cut -f1-8
chromosome_1    57  .   A   G   307.28  .   AC=4;AF=0.211;AN=19;DP=602;FS=0;MLEAC=4;MLEAF=0.211;MQ=8.23;QD=34.24;SOR=2.204
chromosome_1    81  .   G   A   84.49   .   AC=2;AF=0.065;AN=31;DP=750;FS=0;MLEAC=2;MLEAF=0.065;MQ=5.53;QD=30.63;SOR=2.833
chromosome_1    88  .   T   C   190.75  .   AC=1;AF=0.091;AN=11;BaseQRankSum=-1.179;ClippingRankSum=0.762;DP=796;FS=7.782;MLEAC=1;MLEAF=0.091;MQ=4.8;MQRankSum=-1.179;QD=21.19;ReadPosRankSum=-1.666;SOR=1.414

I find it particularly strange since the mapping quality of the backcrosses should in fact be slightly better in average (around 59 for the original BAM file) than the other more polymorphic samples (around 58)...

Thank you very much!

↧

Web-based Oncotator server

May 16, 2014, 4:13 pm

≫ Next: Advice for running GenotypeVCF and Recalibration on a thousand samples

≪ Previous: I get very different MQ values when using GVCF vs BP_RESOLUTION

There is a web-based version of Oncotator which you can use for annotation without running anything on your own machine.

However, please note that the web-based version is an older version, with fewer datasources and many limitations. We urge you to use the downloadable version instead, and at this time we do not provide user support for the web-based version. It is simply provided as-is.

Note also that on rare occasions the server malfunctions and needs to be rebooted. If you experience any server errors (e.g. an error message stating that the server is unavailable), please post a note in the thread below and we'll reboot it as soon as we can.

↧

Advice for running GenotypeVCF and Recalibration on a thousand samples

April 5, 2017, 11:36 am

≫ Next: Posters on somatic analysis with GATK4 presented at AACR 2017

≪ Previous: Web-based Oncotator server

Hi,

I am working with a couple thousand gVCFs. I plan to run GenotypeGVCF and ApplyRecalibration on them in a joint fashion. I want to know how does GATK scale on these numbers, and is it even recommended to run this analysis on these numbers?

I tried running GenotypeGVCF in multi-threaded mode, but it runs into race condition (or some other issue) and a single thread seems the only viable option.

↧

Posters on somatic analysis with GATK4 presented at AACR 2017

April 5, 2017, 3:21 pm

≫ Next: Anticipate "--fix_misencoded_quality_scores"

≪ Previous: Advice for running GenotypeVCF and Recalibration on a thousand samples

A few of us GATKers (among a flood of other Broadies) traveled to Washington, DC this week for the General Meeting of the American Association for Cancer Research (AACR). Here are PDF copies of the posters we presented on Tuesday morning.

Abbreviated title	Presenter	Link
Somatic mutation discovery with GATK4	Geraldine Van der Auwera	PDF
Allelic Copy Number Variation Discovery	Aaron Chevalier	PDF
Copy Number Variation Discovery in WGS and Exomes	Mehrtash Babadi	PDF

Incidentally, it's the end of the conference so now 10,000 people are trying to get home, and apparently half of them are going to Boston. I was hoping to catch an earlier flight on standby; the gate attendant laughed so hard. Most of the flights are overbooked to start with. So I have some time to kill until 9 PM. Well, I guess there's plenty of documentation in need of writing!

↧

Anticipate "--fix_misencoded_quality_scores"

April 6, 2017, 5:09 am

≫ Next: Questions about DepthOfCoverage

≪ Previous: Posters on somatic analysis with GATK4 presented at AACR 2017

Hello everyone,

I've created a full Variant Calling pipeline (on Galaxy). There is of course the IndelRealignment phase done by GATK.
We all know now the solution of the "SAM/BAM file x appears to be using the wrong encoding for quality scores" problem by applying this option "--fix_misencoded_quality_scores" or not applying it.

My question is, can we ascertain if you need to apply this option or not before actually running the pipeline? For now I have been running the pipeline, and changing the option if it did fail, of course wasting time.

I thank you in advance for you help.

↧

Questions about DepthOfCoverage

May 15, 2015, 7:35 pm

≫ Next: Best Practices for Variant Discovery in RNAseq

≪ Previous: Anticipate "--fix_misencoded_quality_scores"

This discussion was created from comments split from: Using DepthOfCoverage to find out how much sequence data you have.

↧

Best Practices for Variant Discovery in RNAseq

April 16, 2014, 3:09 pm

≫ Next: How can I access the GSA public FTP server?

≪ Previous: Questions about DepthOfCoverage

This article is part of the Best Practices documentation. See http://www.broadinstitute.org/gatk/guide/best-practices for the full documentation set.

This is our recommended workflow for calling variants in RNAseq data from single samples, in which all steps are performed per-sample. In future we will provide cohort analysis recommendations, but these are not yet available.

The workflow is divided in three main sections that are meant to be performed sequentially:

Pre-processing: from raw RNAseq sequence reads (FASTQ files) to analysis-ready reads (BAM files)
Variant discovery: from reads (BAM files) to variants (VCF files)
Refinement and evaluation: genotype refinement, functional annotation and callset QC

Compared to the DNAseq Best Practices, the key adaptations for calling variants in RNAseq focus on handling splice junctions correctly, which involves specific mapping and pre-processing procedures, as well as some new functionality in the HaplotypeCaller, which are highlighted in the figure below.

Pre-Processing

The data generated by the sequencers are put through some pre-processing steps to make it suitable for variant calling analysis. The steps involved are: Mapping and Marking Duplicates; Split'N'Trim; Local Realignment Around Indels (optional); and Base Quality Score Recalibration (BQSR); performed in that order.

Mapping and Marking Duplicates

The sequence reads are first mapped to the reference using STAR aligner (2-pass protocol) to produce a file in SAM/BAM format sorted by coordinate. The next step is to mark duplicates. The rationale here is that during the sequencing process, the same DNA molecules can be sequenced several times. The resulting duplicate reads are not informative and should not be counted as additional evidence for or against a putative variant. The duplicate marking process identifies these reads as such so that the GATK tools know they should ignore them.

Split'N'Trim

Then, an RNAseq-specific step is applied: reads with N operators in the CIGAR strings (which denote the presence of a splice junction) are split into component reads and trimmed to remove any overhangs into splice junctions, which reduces the occurrence of artifacts. At this step, we also reassign mapping qualities from 255 (assigned by STAR) to 60 which is more meaningful for GATK tools.

Realignment Around Indels

Next, local realignment is performed around indels, because the algorithms that are used in the initial mapping step tend to produce various types of artifacts. For example, reads that align on the edges of indels often get mapped with mismatching bases that might look like evidence for SNPs, but are actually mapping artifacts. The realignment process identifies the most consistent placement of the reads relative to the indel in order to clean up these artifacts. It occurs in two steps: first the program identifies intervals that need to be realigned, then in the second step it determines the optimal consensus sequence and performs the actual realignment of reads. This step is considered optional for RNAseq.

Base Quality Score Recalibration

Finally, base quality scores are recalibrated, because the variant calling algorithms rely heavily on the quality scores assigned to the individual base calls in each sequence read. These scores are per-base estimates of error emitted by the sequencing machines. Unfortunately the scores produced by the machines are subject to various sources of systematic error, leading to over- or under-estimated base quality scores in the data. Base quality score recalibration is a process in which we apply machine learning to model these errors empirically and adjust the quality scores accordingly. This yields more accurate base qualities, which in turn improves the accuracy of the variant calls. The base recalibration process involves two key steps: first the program builds a model of covariation based on the data and a set of known variants, then it adjusts the base quality scores in the data based on the model.

Variant Discovery

Once the data has been pre-processed as described above, it is put through the variant discovery process, i.e. the identification of sites where the data displays variation relative to the reference genome, and calculation of genotypes for each sample at that site. Because some of the variation observed is caused by mapping and sequencing artifacts, the greatest challenge here is to balance the need for sensitivity (to minimize false negatives, i.e. failing to identify real variants) vs. specificity (to minimize false positives, i.e. failing to reject artifacts). It is very difficult to reconcile these objectives in a single step, so instead the variant discovery process is decomposed into separate steps: variant calling (performed per-sample) and variant filtering (also performed per-sample). The first step is designed to maximize sensitivity, while the filtering step aims to deliver a level of specificity that can be customized for each project.

Our current recommendation for RNAseq is to run all these steps per-sample. At the moment, we do not recommend applying the GVCF-based workflow to RNAseq data because although there is no obvious obstacle to doing so, we have not validated that configuration. Therefore, we cannot guarantee the quality of results that this would produce.

Per-Sample Variant Calling

We perform variant calling by running the HaplotypeCaller on each sample BAM file (if a sample's data is spread over more than one BAM, then pass them all in together) to create single-sample VCFs containing raw SNP and indel calls.

Per-Sample Variant Filtering

For RNAseq, it is not appropriate to apply variant recalibration in its present form. Instead, we provide hard-filtering recommendations to filter variants based on specific annotation value thresholds. This produces a VCF of calls annotated with fiiltering information that can then be used in downstream analyses.

Refinement and evaluation

In this last section, we perform some refinement steps on the genotype calls (GQ estimation and transmission phasing), add functional annotations if desired, and do some quality evaluation by comparing the callset to known resources. None of these steps are absolutely required, and the workflow may need to be adapted quite a bit to each project's requirements.

Important note on GATK versions

The Best Practices have been updated for GATK version 3. If you are running an older version, you should seriously consider upgrading. For more details about what has changed in each version, please see the Version History section. If you cannot upgrade your version of GATK for any reason, please look up the corresponding version of the GuideBook PDF (also in the Version History section) to ensure that you are using the appropriate recommendations for your version.

↧

How can I access the GSA public FTP server?

July 26, 2012, 8:24 am

≫ Next: How can I invoke read filters and their arguments?

≪ Previous: Best Practices for Variant Discovery in RNAseq

NOTE: This article will be deprecated in the near future as this information will be consolidated elsewhere.

We make various files available for public download from the GSA FTP server, such as the GATK resource bundle and presentation slides. We also maintain a public upload feature for processing bug reports from users.

There are two logins to choose from depending on whether you want to upload or download something:

Downloading

location: ftp.broadinstitute.org
username: gsapubftp-anonymous
password: <blank>

Uploading

location: ftp.broadinstitute.org
username: gsapubftp
password: 5WvQWSfi

Using a browser as FTP client

If you use your browser as FTP client, make sure to include the login information in the address, otherwise you will access the general Broad Institute FTP instead of our team FTP. This should work as a direct link (for downloading only):

ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle

↧

How can I invoke read filters and their arguments?

March 13, 2013, 8:26 am

≫ Next: Purpose and operation of Read-backed Phasing

≪ Previous: How can I access the GSA public FTP server?

Most GATK tools apply several read filters by default. You can look up exactly what are the defaults for each tool in their respective Technical Documentation pages.

But sometimes you want to specify additional filters yourself (and before you ask, no, you cannot disable the default read filters used by a given tool). This is how you do it:

The --read-filter argument (or -rf for short) allows you to apply whatever read filters you'd like. For example, to add the MaxReadLengthFilter filter above to PrintReads, you just add this to your command line:

--read_filter MaxReadLength

Notice that when you specify a read filter, you need to strip the Filter part of its name off!

The read filter will be applied with its default value (which you can also look up in the Tech Docs for that filter). Now, if you want to specify a different value from the default, you pass the relevant argument by adding this right after the read filter:

--read_filter MaxReadLength -maxReadLength 76

It's important that you pass the argument right after the filter itself, otherwise the command line parser won't know that they're supposed to go together.

And of course, you can add as many filters as you like by using multiple copies of the --read_filter parameter:

--read_filter MaxReadLength --maxReadLength 76 --read_filter ZeroMappingQualityRead

↧

Purpose and operation of Read-backed Phasing

July 23, 2012, 10:15 am

≫ Next: Is it necessary to process 1000 genome data for exome variant calling training?

≪ Previous: How can I invoke read filters and their arguments?

This document describes the underlying concepts of physical phasing as applied in the ReadBackedPhasing tool. For a complete, detailed argument reference, refer to the tool documentation page.

Note that as of GATK 3.3, physical phasing is performed automatically by HaplotypeCaller when it is run in -ERC GVCF or -ERC BP_RESOLUTION mode, so post-processing variant calls with ReadBackedPhasing is no longer necessary unless you want to merge consecutive variants into MNPs.

Underlying concepts

The biological unit of inheritance from each parent in a diploid organism is a set of single chromosomes, so that a diploid organism contains a set of pairs of corresponding chromosomes. The full sequence of each inherited chromosome is also known as a haplotype. It is critical to ascertain which variants are associated with one another in a particular individual. For example, if an individual's DNA possesses two consecutive heterozygous sites in a protein-coding sequence, there are two alternative scenarios of how these variants interact and affect the phenotype of the individual. In one scenario, they are on two different chromosomes, so each one has its own separate effect. On the other hand, if they co-occur on the same chromosome, they are thus expressed in the same protein molecule; moreover, if they are within the same codon, they are highly likely to encode an amino acid that is non-synonymous (relative to the other chromosome). The ReadBackedPhasing program serves to discover these haplotypes based on high-throughput sequencing reads.

How it works

The first step in phasing is to call variants ("genotype calling") using a SAM/BAM file of reads aligned to the reference genome -- this results in a VCF file. Using the VCF file and the SAM/BAM reads file, the ReadBackedPhasing tool considers all reads within a Bayesian framework and attempts to find the local haplotype with the highest probability, based on the reads observed.

The local haplotype and its phasing is encoded in the VCF file as a "|" symbol (which indicates that the alleles of the genotype correspond to the same order as the alleles for the genotype at the preceding variant site). For example, the following VCF indicates that SAMP1 is heterozygous at chromosome 20 positions 332341 and 332503, and the reference base at the first position (A) is on the same chromosome of SAMP1 as the alternate base at the latter position on that chromosome (G), and vice versa (G with C):

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SAMP1   
chr20   332341  rs6076509   A   G   470.60  PASS    AB=0.46;AC=1;AF=0.50;AN=2;DB;DP=52;Dels=0.00;HRun=1;HaplotypeScore=0.98;MQ=59.11;MQ0=0;OQ=627.69;QD=12.07;SB=-145.57    GT:DP:GL:GQ 0/1:46:-79.92,-13.87,-84.22:99
chr20   332503  rs6133033   C   G   726.23  PASS    AB=0.57;AC=1;AF=0.50;AN=2;DB;DP=61;Dels=0.00;HRun=1;HaplotypeScore=0.95;MQ=60.00;MQ0=0;OQ=894.70;QD=14.67;SB=-472.75    GT:DP:GL:GQ:PQ  1|0:60:-110.83,-18.08,-149.73:99:126.93

The per-sample per-genotype PQ field is used to provide a Phred-scaled phasing quality score based on the statistical Bayesian framework employed for phasing. For cases of homozygous sites that lie in between phased heterozygous sites, these homozygous sites will be phased with the same quality as the next heterozygous site.

Note that this tool can only handle diploid data properly. If your organism of interest is polyploid or if you are working with data from pooling experiments, you should not run this tool on your data.

More detailed aspects of semantics of phasing in the VCF format

The "|" symbol is used for each sample to indicate that each of the alleles of the genotype in question derive from the same haplotype as each of the alleles of the genotype of the same sample in the previous NON-FILTERED variant record. That is, rows without FILTER=PASS are essentially ignored in the read-backed phasing (RBP) algorithm.
Note that the first heterozygous genotype record in a pair of haplotypes will necessarily have a "/" - otherwise, they would be the continuation of the preceding haplotypes.
A homozygous genotype is always "appended" to the preceding haplotype. For example, any 0/0 or 1/1 record is always converted into 0|0 and 1|1.
RBP attempts to phase a heterozygous genotype relative the preceding HETEROZYGOUS genotype for that sample. If there is sufficient read information to deduce the two haplotypes (for that sample), then the current genotype is declared phased ("/" changed to "|") and assigned a PQ that is proportional to the estimated Phred-scaled error rate. All homozygous genotypes for that sample that lie in between the two heterozygous genotypes are also assigned the same PQ value (and remain phased).
If RBP cannot phase the heterozygous genotype, then the genotype remains with a "/", and no PQ score is assigned. This site essentially starts a new section of haplotype for this sample.

For example, consider the following records from the VCF file:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SAMP1   SAMP2
chr1    1   .   A   G   99  PASS    .   GT:GL:GQ    0/1:-100,0,-100:99  0/1:-100,0,-100:99
chr1    2   .   A   G   99  PASS    .   GT:GL:GQ:PQ 1|1:-100,0,-100:99:60   0|1:-100,0,-100:99:50
chr1    3   .   A   G   99  PASS    .   GT:GL:GQ:PQ 0|1:-100,0,-100:99:60   0|0:-100,0,-100:99:60
chr1    4   .   A   G   99  FAIL    .   GT:GL:GQ    0/1:-100,0,-100:99  0/1:-100,0,-100:99
chr1    5   .   A   G   99  PASS    .   GT:GL:GQ:PQ 0|1:-100,0,-100:99:70   1|0:-100,0,-100:99:60
chr1    6   .   A   G   99  PASS    .   GT:GL:GQ:PQ 0/1:-100,0,-100:99  1|1:-100,0,-100:99:70
chr1    7   .   A   G   99  PASS    .   GT:GL:GQ:PQ 0|1:-100,0,-100:99:80   0|1:-100,0,-100:99:70
chr1    8   .   A   G   99  PASS    .   GT:GL:GQ:PQ 0|1:-100,0,-100:99:90   0|1:-100,0,-100:99:80

The proper interpretation of these records is that SAMP1 has the following haplotypes at positions 1-5 of chromosome 1: