Quantcast
Channel: Recent Discussions — GATK-Forum
Viewing all 12345 articles
Browse latest View live

GATK 3 to GATK 4 and BQSR

$
0
0

Hello,
I am trying to modify my exome pipelines to integrate GATK4.
For the recalibration, I used with GATK 3 two round of BaseRecalibrator and used the recalibration table in happlotype caller with -BQSR option in order to avoid writing intermediate bam files that are a lot bigger.

With the new GATK it does not seems to be possible to do 2 round and use the recalibrate table for the haplotype caller. Is the -BQSR option still available in the new happlotyecaller.

Thanks for your help.


Off-label workflow to simply call differences in two samples

$
0
0

image
Given my years as a biochemist, if given two samples to compare, my first impulse is to want to know what are the functional differences, i.e. differences in proteins expressed between the two samples. I am interested in genomic alterations that ripple down the central dogma to transform a cell.

Please note the workflow that follows is NOT a part of the Best Practices. This is an illustrative, unsupported workflow. For the official Somatic Short Variant Calling Best Practices workflow, see Tutorial#11136.

To call every allele that is different between two samples, I have devised a two-pass workflow that takes advantage of Mutect2 features. This workflow uses Mutect2 in tumor-only mode and appropriates the --germline-resource argument to supply a single-sample VCF with allele fractions instead of population allele frequencies. The workflow assumes the two case samples being compared originate from the same parental line and the ploidy and mutation rates make it unlikely that any site accumulates more than one allele change.


First, call on each sample using Mutect2's tumor-only mode.

gatk Mutect2 \
-R ref.fa \
-I A.bam \
-tumor A \
-O A.vcf

gatk Mutect2 \
-R ref.fa \
-I B.bam \
-tumor B \
-O B.vcf

Second, for each single-sample VCF, move the sample-level AF allele-fraction annotation to the INFO field and simplify to a sites-only VCF.

This is a heuristic solution in which we substitute sample-level allele fractions for the expected population germline allele frequencies. Mutect2 is actually designed to use population germline allele frequencies in somatic likelihood calculations, so this substitution allows us to fulfill the requirement for an AF annotation with plausible fractional values. The terminal screenshots highlight the data transpositions.

Before:

image

After:

image

Third, call on each sample in a second pass, again in tumor-only mode, with the following additions.

gatk Mutect2 \
-R ref.fa \
-I A.bam \
-tumor A \
--germline-resource Baf.vcf \
--af-of-alleles-not-in-resource 0 \
--max-population-af 0 \
-pon pon_maskAB.vcf \
-O A-B.vcf

gatk Mutect2 \
-R ref.fa \
-I B.bam \
-tumor B \
--germline-resource Aaf.vcf \
--af-of-alleles-not-in-resource 0 \
--max-population-af 0 \
-pon pon_maskAB.vcf \
-O B-A.vcf
  • Provide the matched single-sample callset for the case sample with the --germline-resource argument.
  • Avoid calling any allele in the --germline-resource by setting --max-population-af to zero.
  • Maximize the probability of calling any differing allele by setting --af-of-alleles-not-in-resource to zero.
  • Prefilter sites with artifacts and cross-sample contamination with a panel of normals (PoN) in which confident variant sites for both sample A and B have been removed, e.g. with gatk SelectVariants –V pon.vcf -XL AandB_haplotypecaller.vcf –O pon_maskAB.vcf.

Fourth, filter out unlikely calls with FilterMutectCalls.

gatk FilterMutectCalls \
-V A-B.vcf \
-O A-B-filter.vcf

gatk FilterMutectCalls \
-V B-A.vcf \
-O B-A-filter.vcf

FilterMutectCalls provides many filters, e.g. that account for low base quality, for events that are clustered, for low mapping quality and for short-tandem-repeat contractions. Of the filters, let's consider the multiallelic filter. It discounts sites with more than two variant alleles that pass the tumor LOD threshold.

  • We assume case sample variant sites will have a maximum of one allele that is different from the --germline-resource control. A single allele call will pass the multiallelic filter. However, if we emit any shared variant allele alongside the differing allele, e.g. for a heterozygous site without ref alleles, then the call becomes multiallelic and will be filtered, which is not what we want. We previously set Mutect2’s --max-population-af to zero to ensure only the differing allele is called, and so here we can rely on FilterMutectCalls to filter artifactual multiallelic sites.
  • If multiple variant alleles are expected per call, then FilterMutectCall’s multiallelic filtering will be undesirable. For example, if changes to allele fractions for alleles that are shared was of interest for the two samples derived from the same parental line, and Mutect2 --max-population-af was set to one in the previous step to additionally emit the shared variant alleles, then you would expect multiallelic calls. These will be indistinguishable from artifactual multiallelic sites.

This workflow produces contrastive variants. If the samples are a tumor and its matched normal, then the calls include sites where heterozygosity was lost.

We know that loss of heterozygosity (LOH) plays a role in tumorigenesis (doi:10.1186/s12920-015-0123-z). This leads us to believe the heterozygosity of proteins we express contributes to our health. If this is true, then for somatic studies, if cataloging the gain of alleles is of interest, then cataloging the loss of alleles should also be of interest. Can we assume just because variants are germline that they do not play a role in disease processes? How can we account for the combinatorial effects of the diploid nature of our genomes?

Remember regions of LOH do not necessarily represent a haploid state but can be copy-neutral or even copy-amplified. It may be that as one parental chromosome copy is lost, the other is duplicated to maintain copy number, which presumably compensates for dosage effects similar to uniparental isodisomy.


About the result of CNVDiscoveryPipeline

$
0
0

@bhandsaker
This is the CNV result of my test: gs_cnv.genotypes.vcf.gz

#

But I have some doubt:
1. How can I select the CNV which belong to my test sample (there are so many background population) ,becuse the vcf file do not mark the CNV belong to which sample
2. The vcf file do not mark the genotypes (Homozygous or heterozygous) of each CNV.How can I known that infor.
Thank you very much !

Finish Running Svreprocess and Svdiscovery no result in xxxx.discovery.vcf.gz

$
0
0

When I finished Svreprocess, I check the output in xxxx.discovery.vcf.gz,there is nothing under the header

    ##fileformat=VCFv4.2
##ALT=<ID=DEL,Description="Deletion">
##INFO=<ID=CIEND,Number=2,Type=Integer,Description="Confidence interval around END for imprecise variants">
##INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval around POS for imprecise variants">
##INFO=<ID=END,Number=1,Type=Integer,Description="End coordinate of this variant">
##INFO=<ID=GSCOHERENCE,Number=1,Type=Float,Description="Value of coherence statistic">
##INFO=<ID=GSCOHFN,Number=1,Type=Float,Description="Coherence statistic per pair">
##INFO=<ID=GSCOHPVALUE,Number=1,Type=Float,Description="Coherence metric (not a true p-value)">
##INFO=<ID=GSCOORDS,Number=4,Type=Integer,Description="Original cluster coordinates">
##INFO=<ID=GSCORA6,Number=1,Type=Float,Description="Correlation with array intensity from Affy6 arrays">
##INFO=<ID=GSCORI1M,Number=1,Type=Float,Description="Correlation with array intensity from Illumina 1M arrays">
##INFO=<ID=GSCORNG,Number=1,Type=Float,Description="Correlation with array intensity from NimbleGen arrays">
##INFO=<ID=GSDEPTHCALLS,Number=.,Type=String,Description="Samples with discrepant read pairs or low read depth">
##INFO=<ID=GSDEPTHCALLTHRESHOLD,Number=1,Type=Float,Description="Read depth threshold (median read depth of samples with discrepant read pairs)">
##INFO=<ID=GSDEPTHNOBSSAMPLES,Number=1,Type=Integer,Description="Number of samples with discrepant read pairs in depth test">
##INFO=<ID=GSDEPTHNTOTALSAMPLES,Number=1,Type=Integer,Description="Total samples in depth test">
##INFO=<ID=GSDEPTHOBSSAMPLES,Number=.,Type=String,Description="Samples with discrepant read pairs in depth test">
##INFO=<ID=GSDEPTHPVALUE,Number=1,Type=Float,Description="Depth p-value using chi-squared test">
##INFO=<ID=GSDEPTHPVALUECOUNTS,Number=4,Type=Integer,Description="Depth test read counts (carrier inside event, carrier outside event, non-carrier inside, non-carrier outside)">
##INFO=<ID=GSDEPTHRANKSUMPVALUE,Number=1,Type=Float,Description="Depth p-value using rank-sum test">
##INFO=<ID=GSDEPTHRATIO,Number=1,Type=Float,Description="Read depth ratio test">
##INFO=<ID=GSDMAX,Number=1,Type=Integer,Description="Maximum value considered for DOpt">
##INFO=<ID=GSDMIN,Number=1,Type=Integer,Description="Minimum value considered for DOpt">
##INFO=<ID=GSDOPT,Number=1,Type=Integer,Description="Most likely event length">
##INFO=<ID=GSDSPAN,Number=1,Type=Integer,Description="Inner span length of read pair cluster">
##INFO=<ID=GSELENGTH,Number=1,Type=Integer,Description="Effective length">
##INFO=<ID=GSMEMBNPAIRS,Number=1,Type=Integer,Description="Number of pairs used in membership test">
##INFO=<ID=GSMEMBNSAMPLES,Number=1,Type=Integer,Description="Number of samples used in membership test">
##INFO=<ID=GSMEMBOBSSAMPLES,Number=.,Type=String,Description="Samples participating in membership test">
##INFO=<ID=GSMEMBPVALUE,Number=1,Type=Float,Description="Membership p-value">
##INFO=<ID=GSMEMBSTATISTIC,Number=1,Type=Float,Description="Value of membership statistic">
##INFO=<ID=GSNDEPTHCALLS,Number=1,Type=Integer,Description="Number of samples with discrepant read pairs or low read depth">
##INFO=<ID=GSNHET,Number=1,Type=Integer,Description="Number of heterozygous snp genotype calls inside the event">
##INFO=<ID=GSNHOM,Number=1,Type=Integer,Description="Number of homozygous snp genotype calls inside the event">
##INFO=<ID=GSNNOCALL,Number=1,Type=Integer,Description="Number of snp genotype non-calls inside the event">
##INFO=<ID=GSNPAIRS,Number=1,Type=Integer,Description="Number of discrepant read pairs">
##INFO=<ID=GSNSAMPLES,Number=1,Type=Integer,Description="Number of samples with discrepant read pairs">
##INFO=<ID=GSNSNPS,Number=1,Type=Integer,Description="Number of snps inside the event">
##INFO=<ID=GSOUTLEFT,Number=1,Type=Integer,Description="Number of outlier read pairs on left">
##INFO=<ID=GSOUTLIERS,Number=1,Type=Integer,Description="Number of outlier read pairs">
##INFO=<ID=GSOUTRIGHT,Number=1,Type=Integer,Description="Number of outlier read pairs on right">
##INFO=<ID=GSREADGROUPS,Number=.,Type=String,Description="Read groups contributing discrepant read pairs">
##INFO=<ID=GSREADNAMES,Number=.,Type=String,Description="Discrepant read pair identifiers">
##INFO=<ID=GSRPORIENTATION,Number=1,Type=String,Description="Read pair orientation">
##INFO=<ID=GSSAMPLES,Number=.,Type=String,Description="Samples contributing discrepant read pairs">
##INFO=<ID=GSSNPHET,Number=1,Type=Float,Description="Fraction of het snp genotype calls inside the event">
##INFO=<ID=HOMLEN,Number=.,Type=Integer,Description="Length of base pair identical micro-homology at event breakpoints">
##INFO=<ID=HOMSEQ,Number=.,Type=String,Description="Sequence of base pair identical micro-homology at event breakpoints">
##INFO=<ID=IMPRECISE,Number=0,Type=Flag,Description="Imprecise structural variation">
##INFO=<ID=NOVEL,Number=0,Type=Flag,Description="Indicates a novel structural variation">
##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##fileDate=20161013
##source=GenomeSTRiP_v2.00
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO

and my command is like so:

java -cp ${classpath} ${mx} \
    org.broadinstitute.gatk.queue.QCommandLine \
    -S ${SV_DIR}/qscript/SVPreprocess.q \
    -S ${SV_DIR}/qscript/SVQScript.q \
    -gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \
    --disableJobReport \
    -cp ${classpath} \
    -configFile ${SV_DIR}/conf/genstrip_parameters.txt\
    -tempDir ${SV_TMPDIR} \
    -R /home/liyc/test_bam/1000G_phase1/human_g1k_v37.fasta \
    -genomeMaskFile /home/liyc/test_bam/1000G_phase1/human_g1k_v37.svmask.fasta \
    -copyNumberMaskFile /home/liyc/test_bam/1000G_phase1/human_g1k_v37.gcmask.fasta \
    -genderMaskBedFile /home/liyc/test_bam/1000G_phase1/human_g1k_v37.gendermask.bed \
    -runDirectory ${runDir} \
    -md ${runDir}/metadata \
    -disableGATKTraversal \
    -useMultiStep \
    -reduceInsertSizeDistributions false \
    -computeGCProfiles true \
    -computeReadCounts true \
    -jobLogDir ${runDir}/logs \
    -I ${inputFile} \
    -run \
    || exit 1
# Run discovery.
java -cp ${classpath} ${mx} \
    org.broadinstitute.gatk.queue.QCommandLine \
    -S ${SV_DIR}/qscript/SVDiscovery.q \
    -S ${SV_DIR}/qscript/SVQScript.q \
    -gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \
    --disableJobReport \
    -cp ${classpath} \
    -configFile /home/liyc/SV/svtoolkit/installtest/conf/genstrip_installtest_parameters.txt\
    -tempDir ${SV_TMPDIR} \
    -R /home/liyc/test_bam/1000G_phase1/human_g1k_v37.fasta \
    -genomeMaskFile /home/liyc/test_bam/1000G_phase1/human_g1k_v37.svmask.fasta \
    -genderMapFile /home/liyc/test_bam/test_gender.map \
    -runDirectory ${runDir} \
    -md ${runDir}/metadata \
    -disableGATKTraversal \
    -minimumSize 100 \
    -maximumSize 1000000 \
    -jobLogDir ${runDir}/logs \
    -suppressVCFCommandLines \
    -I ${inputFile} \
    -O ${sites} \
    -run \
    || exit 1

Through my Svpreprocess I have met 3 problems,the first is that:

INFO  14:43:05,581 QGraph - Failed:   samtools index test2_2_new/metadata/headers.bam

I just extract the failed component.I solve this by run by hand:

 samtools index headers.bam

The second and third problem are as followd:

  INFO  14:59:58,545 QGraph - Failed:   'java'  '-Xmx2048m'  '-XX:+UseParallelOldGC'  '-XX:ParallelGCThreads=4'  '-XX:GCTimeLimit=50'  '-XX:GCHeapFreeLimit=10'  '-Djava.io.tmpdir=/home/liyc/test_bam/test2_raw/tmpdir2_2'  '-cp' '/home/liyc/SV/svtoolkit/liuc_sv/svtoolkit/lib/SVToolkit.jar:/home/liyc/SV/svtoolkit/liuc_sv/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/home/liyc/SV/svtoolkit/liuc_sv/svtoolkit/lib/gatk/Queue.jar'  '-cp' '/home/liyc/SV/svtoolkit/liuc_sv/svtoolkit/lib/SVToolkit.jar:/home/liyc/SV/svtoolkit/liuc_sv/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/home/liyc/SV/svtoolkit/liuc_sv/svtoolkit/lib/gatk/Queue.jar'  'org.broadinstitute.sv.apps.ComputeDepthProfiles'  '-O' '/home/liyc/test_bam/test2_raw/test2_2_new/metadata/profiles_100Kb/profile_seq_2_100000.dat.gz'  '-I' 'test2_2_new/metadata/headers.bam'  '-configFile' '/home/liyc/SV/svtoolkit/liuc_sv/svtoolkit/conf/genstrip_parameters.txt'  '-R' '/home/liyc/test_bam/1000G_phase1/human_g1k_v37.fasta'  '-L' '2:0-0'  '-genomeMaskFile' '/home/liyc/test_bam/1000G_phase1/human_g1k_v37.svmask.fasta'  '-md' 'test2_2_new/metadata'  '-profileBinSize' '100000'  '-maximumReferenceGapLength' '10000'
    INFO  14:59:58,545 QGraph - Log:     /home/liyc/test_bam/test2_raw/test2_2_new/logs/SVPreprocess-21.out
    INFO  14:59:58,546 QCommandLine - Script failed: 2 Pend, 0 Run, 1 Fail, 186 Done

I just solved them by rerun the Queue, and it did not output failed.

My headers.bam was that:

    @HD VN:1.5  GO:none SO:coordinate
@SQ SN:1    LN:249250621
@SQ SN:2    LN:243199373
@SQ SN:3    LN:198022430
@SQ SN:4    LN:191154276
@SQ SN:5    LN:180915260
@SQ SN:6    LN:171115067
@SQ SN:7    LN:159138663
@SQ SN:8    LN:146364022
@SQ SN:9    LN:141213431
@SQ SN:10   LN:135534747
@SQ SN:11   LN:135006516
@SQ SN:12   LN:133851895
@SQ SN:13   LN:115169878
@SQ SN:14   LN:107349540
@SQ SN:15   LN:102531392
@SQ SN:16   LN:90354753
@SQ SN:17   LN:81195210
@SQ SN:18   LN:78077248
@SQ SN:19   LN:59128983
@SQ SN:20   LN:63025520
@SQ SN:21   LN:48129895
@SQ SN:22   LN:51304566
@SQ SN:X    LN:155270560
@SQ SN:Y    LN:59373566
@SQ SN:MT   LN:16569
@SQ SN:GL000207.1   LN:4262
@SQ SN:GL000226.1   LN:15008
@SQ SN:GL000229.1   LN:19913
@SQ SN:GL000231.1   LN:27386
@SQ SN:GL000210.1   LN:27682
@SQ SN:GL000239.1   LN:33824
@SQ SN:GL000235.1   LN:34474
@SQ SN:GL000201.1   LN:36148
@SQ SN:GL000247.1   LN:36422
@SQ SN:GL000245.1   LN:36651
@SQ SN:GL000197.1   LN:37175
@SQ SN:GL000203.1   LN:37498
@SQ SN:GL000246.1   LN:38154
@SQ SN:GL000249.1   LN:38502
@SQ SN:GL000196.1   LN:38914
@SQ SN:GL000248.1   LN:39786
@SQ SN:GL000244.1   LN:39929
@SQ SN:GL000238.1   LN:39939
@SQ SN:GL000202.1   LN:40103
@SQ SN:GL000234.1   LN:40531
@SQ SN:GL000232.1   LN:40652
@SQ SN:GL000206.1   LN:41001
@SQ SN:GL000240.1   LN:41933
@SQ SN:GL000236.1   LN:41934
@SQ SN:GL000241.1   LN:42152
@SQ SN:GL000243.1   LN:43341
@SQ SN:GL000242.1   LN:43523
@SQ SN:GL000230.1   LN:43691
@SQ SN:GL000237.1   LN:45867
@SQ SN:GL000233.1   LN:45941
@SQ SN:GL000204.1   LN:81310
@SQ SN:GL000198.1   LN:90085
@SQ SN:GL000208.1   LN:92689
@SQ SN:GL000191.1   LN:106433
@SQ SN:GL000227.1   LN:128374
@SQ SN:GL000228.1   LN:129120
@SQ SN:GL000214.1   LN:137718
@SQ SN:GL000221.1   LN:155397
@SQ SN:GL000209.1   LN:159169
@SQ SN:GL000218.1   LN:161147
@SQ SN:GL000220.1   LN:161802
@SQ SN:GL000213.1   LN:164239
@SQ SN:GL000211.1   LN:166566
@SQ SN:GL000199.1   LN:169874
@SQ SN:GL000217.1   LN:172149
@SQ SN:GL000216.1   LN:172294
@SQ SN:GL000215.1   LN:172545
@SQ SN:GL000205.1   LN:174588
@SQ SN:GL000219.1   LN:179198
@SQ SN:GL000224.1   LN:179693
@SQ SN:GL000223.1   LN:180455
@SQ SN:GL000195.1   LN:182896
@SQ SN:GL000212.1   LN:186858
@SQ SN:GL000222.1   LN:186861
@SQ SN:GL000200.1   LN:187035
@SQ SN:GL000193.1   LN:189789
@SQ SN:GL000194.1   LN:191469
@SQ SN:GL000225.1   LN:211173
@SQ SN:GL000192.1   LN:547496
@RG ID:ST-E0020653.1    PL:ILLUMINA PU:H5KFTCCXX.1.test LB:lib1 SM:test1
@PG ID:bwa  PN:bwa  VN:0.7.12-r1039 CL:bwa mem -t 10 -R @RG\tID:ST-E0020653.1\tPL:ILLUMINA\tPU:H5KFTCCXX.1.test\tLB:lib1\tSM:test1 /home/liyc/test_bam/1000G_phase1/human_g1k_v37.fasta 1_test2.fastq 2_test2.fastq

I use the bwa in my system that vision is :

Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.12-r1039
Contact: Heng Li <lh3@sanger.ac.uk>

Thank you very much!

Information about the result of CNVDiscoveryPipeline

$
0
0

@bhandsaker
hi Bob,I got the result of CNVDiscoveryPipeline
In the vcf file:
SVTYPE=CNV,How could I known this CNV is belong to DEL or Dup?
and if it is Dup,how many times the CNV repeat?
and what are the mean CN,CNQ,CNL,CNP fields respectively?
Thank you very much!

Invalid or corrupt jarfile

$
0
0

When I run

./gatk --help

it seems to be working fine. However, running anything else such as

./gatk --list

produces an error:

Error: Invalid or corrupt jarfile /path/to/gatk/gatk-4.0.0.0/gatk-package-4.0.0.0-local.jar

What's going on? Sorry this might be noob question.

GATK for long reads like PacBio and Oxford Nanopore

$
0
0

Hi,

I am aware that GATK is developed and used keeping illumina short reads in mind and it is also advised not to use them for calling calling variants using long reads. Just from a curiosity sake, on which step in the algorithm will this cause an issue?

Interpreting alt_allele_in_normal filter in Mutect2

$
0
0

Hi- If I understand correctly Mutect2, the alt_allele_in_normal flag is assigned when variant reads in the normal exceed the --max_alt_alleles_in_normal_count setting (which has default=1).

I wonder if this generates false negatives since you can have only few variant reads in normal out of many total reads. For example, in normal you can have REF:ALT=1000:3 and in tumour REF:ALT=1000:700. These numbers would pass the NLOD and TLOD filters (right?) but the variant would be flagged as alt_allele_in_normal. So I wonder whether one should rescue alt_allele_in_normal variants by, for example, applying a simple fisher test to see if the ALT counts are convincingly larger in tumour vs normal.

Just to make a concrete example, this is a variant that potentially may be rescued. Here tumour REF:ALT=35:25 and normal REF:ALT=51:2:

chr17 39524885 . G A . alt_allele_in_normal ECNT=1;HCNT=1;MAX_ED=.;MIN_ED=.;NLOD=6.35;TLOD=76.43 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1 0/1:35,25:0.396:7:18:0.28:1386,995:12:23 0/0:51,2:0.039:2:0:1:2028,82:26:25

(This comment is based on Mutect2 in GATK 3.8.0, I would switch to GATK4 but I see Mutect is still in beta there)

Many thanks
Dario


VariantRecalibrator - no data found

$
0
0

I just updated to the latest nightly and got the same error:

INFO 12:03:16,652 VariantRecalibratorEngine - Finished iteration 45. Current change in mixture coefficients = 0.00258
INFO 12:03:23,474 ProgressMeter - GL000202.1:10465 5.68e+07 32.4 m 34.0 s 98.7% 32.9 m 25.0 s
INFO 12:03:32,263 VariantRecalibratorEngine - Convergence after 46 iterations!
INFO 12:03:41,008 VariantRecalibratorEngine - Evaluating full set of 4944219 variants...
INFO 12:03:41,100 VariantDataManager - Training with worst 0 scoring variants --> variants with LOD <= -5.0000.

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.IllegalArgumentException: No data found.
at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:83)
at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:392)
at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:138)
at org.broadinstitute.sting.gatk.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129)
at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:116)
at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313)
at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:121)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:107)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version nightly-2014-03-20-g65934ae):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: No data found.
ERROR ------------------------------------------------------------------------------------------

Which version of Spark works well with the current GATK 4.0?

$
0
0

Hi,

I am interested in using GATK4.0 with Spark on my PC. From previous posts I noticed that there is compatibility issue (errors) of GATK4 with the latest version of Spark (2.1 or 2.2 if I am correct). Which version of Spark should I use to run GATK4.0 ?

Furthermore, will the calling results be different between the GATK-spark and GATK-withoutspark versions?

Thank you so much !

Mutect 2 B38 germline resource

$
0
0

Hi,

Congratulations on GATK 4.0!

I'm looking at the instructions for Mutect2 where it suggests using a germline resource "--germline-resource af-only-gnomad.vcf.gz".

Do you have a version of this for b38 coming? Or know where I could obtain one?

Thanks

Dan

Mutect2 resources guide

$
0
0

A new tutorial for somatic calling

We have a new tutorial, Tutorial#11136, that outlines how to call somatic short variants, i.e. SNVs and indels, with GATK4 Mutect2. The tutorial provides small example data to follow along with.

Mutect2-compatible germline resources

Full-length Mutect2-compatible human germline resources are available on our FTP server and at gs://gatk-best-practices/. The resources are simplified from the gnomAD resource and retain population allele frequencies. Mutect2 and GetPileupSummaries are the two tools in the workflow that each require a germline resource.

Working WDL scripts

If you want to run the Somatic Short Variant Discovery Best Practices workflow using WDL, be sure to check out the official Mutect2 WDL script in the gatk-workflows repository. @bshifaw and other engineers optimize the scripts in the repository to run efficiently in the cloud. Furthermore, the scripts come with example JSON format inputs files filled out with publically-accessible cloud data.

For other Mutect2-related scripts, e.g. towards panel of normals generation, check out the gatk repository's scripts/mutect2_wdl directory. Our developers update these scripts on a continual basis.

For background information

If you are new to somatic calling, be sure to read Article#11127. It gives an overview of what traditional somatic calling entails. For one, somatic calling is NOT just a difference between two callsets in that germline variant sites are excluded from consideration.

For those switching from GATK3 MuTect2, Blog#10911 will bring you up to speed on the differences.

An off-label tutorial for simple difference calling

If you are interested in simply calling differences between two samples, Blog#11315 outlines an off-label two-pass Mutect2 workflow. Off-label means the workflow is not a part of the Best Practices and is therefore unsupported. However, if given enough community interest, we may be convinced to further flesh out the workflow. Please do post to the forum to express interest.


PICARD MarkDuplicates errors near the end of its process: tmp does not exist

$
0
0

Hi, I have a problem in that PICARD MarkDuplicates appears to error near the end of its process -- with a temp file not found error.

This is running in a GATK pipeline on our cluster for WGS Best Practices.

Error is;
Exception in thread "main" java.lang.IllegalStateException: Non-zero numRecords but /tmp/rwillia/CSPI.7224120933399878689.tmp/2.tmp does not exist

We get the same error with picard 2.9.2 and picard 2.0.1 . The java being used is:
java version "1.8.0_65"
Java(TM) SE Runtime Environment (build 1.8.0_65-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.65-b01, mixed mode)

We have tried multiple different BAM files as tests and get the same error.

Has anybody else seen this "tmp does not exist error" before? Was there a fix that worked?I cannot see this error previously reported.
I can run MarkDuplicates using a standalone Qsub tester script and I get the same error.

I made the directory /tmp/rwillia on the scratch drive but it did not help.
Thanks for any help,
Cheers,
Roy Williams

Test qsub script

PBS -l walltime=99:00:00

PBS -l nodes=1:ppn=8:memory=29gb

export TMPDIR=/scratch/rwillia
module load samtools
module load picard
java -Xmx26g -jar /opt/applications/picard/2.1.0/bin/picard.jar MarkDuplicates \
I=/mnt/loring/data/OMICS_PIPE_DATA/ANALYSIS/DNAseq/RW_WGS/BWA_RESULTS/REACH000450/REACH000450_sorted.rg.bam \
O=/mnt/loring/data/OMICS_PIPE_DATA/ANALYSIS/DNAseq/RW_WGS/BWA_RESULTS/REACH000450/REACH000450_sorted.rg.md.bam \
M=/mnt/loring/data/OMICS_PIPE_DATA/ANALYSIS/DNAseq/RW_WGS/BWA_RESULTS/REACH000450/REACH000450_sorted.rg.md.metrics.txt \
ASSUME_SORTED=true \
VALIDATION_STRINGENCY=LENIENT

picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 14.35 minutes.
Runtime.totalMemory()=21026570240
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" java.lang.IllegalStateException: Non-zero numRecords but /tmp/rwillia/CSPI.7224120933399878689.tmp/2.tmp does not exist
at htsjdk.samtools.CoordinateSortedPairInfoMap.ensureSequenceLoaded(CoordinateSortedPairInfoMap.java:141)
at htsjdk.samtools.CoordinateSortedPairInfoMap.remove(CoordinateSortedPairInfoMap.java:86)
at picard.sam.markduplicates.util.DiskBasedReadEndsForMarkDuplicatesMap.remove(DiskBasedReadEndsForMarkDuplicatesMap.java:61)
at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:388)
at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:185)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:209)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)

Roy Williams, Ph.D.
Bioinformatics Director,
The Center for Regenerative Medicine,
Scripps Research Institute,
10550 North Torrey Pines Road
San Diego, California, 92121
USA

PrintReads gets stuck

$
0
0

Hi GATK team!
I am trying to do somatic variant calling with RNAseq data (following GATK best practices with paired-end Illumina reads). I have normals and tumors, and my pipeline runs from the fastq to the final recalibrated bam, right before running MuTect2. Of the 30 samples I have, 29 run fine, but for one of them, the last step "PrintReads" just gets stuck almost at the end.
It seems to progress fine at the beginning:
INFO 21:18:07,884 ReadShardBalancer$1 - Done loading BAM index data
INFO 21:18:37,868 ProgressMeter - chr1:632281 200216.0 30.0 s 2.5 m 0.0% 42.5 h 42.5 h
INFO 21:19:37,870 ProgressMeter - chr1:19978481 701190.0 90.0 s 2.1 m 0.6% 4.0 h 4.0 h
INFO 21:20:07,872 ProgressMeter - chr1:37612921 1001193.0 120.0 s 119.0 s 1.2% 2.9 h 2.8 h
INFO 21:20:37,874 ProgressMeter - chr1:61289244 1201197.0 2.5 m 2.1 m 1.9% 2.2 h 2.2 h
INFO 21:21:07,875 ProgressMeter - chr1:91387373 1525198.0 3.0 m 118.0 s 2.8% 106.0 m 103.0 m

and then it gets stuck at a specific position:
INFO 05:02:07,536 ProgressMeter - chr21:8579818 1.31038633E8 7.7 h 3.5 m 86.3% 9.0 h 73.5 m
INFO 05:04:34,643 ProgressMeter - chr21:8579818 1.31038633E8 7.8 h 3.6 m 86.3% 9.0 h 73.9 m
INFO 05:07:07,032 ProgressMeter - chr21:8579818 1.31038633E8 7.8 h 3.6 m 86.3% 9.1 h 74.3 m
INFO 05:09:18,063 ProgressMeter - chr21:8579818 1.31038633E8 7.9 h 3.6 m 86.3% 9.1 h 74.7 m
(the last 208 log entries are exactly like the above (stuck at chr21:8579818; 86.3% progress)

First I thought it might be a memory issue, but I am running now with 1 TB of RAM, and it just runs out of time (max walltime is 24 h in my server, though I doubt allowing more time would let it finish?).
Thing is, this is not the biggest sample (in terms of reads, or file size), and all others run fine in less than 12 h.

I am using GenomeAnalysisTK-3.8-0, and this is my command line for that part:
INFO 21:18:06,716 HelpFormatter - Program Args: -T PrintReads -R /reference/GRCh38.p7.genome.fa -I /RNAseq_alignments/sample_dir/sample_rg_added_sorted.marked_duplicates.split.bam -BQSR /RNAseq_alignments/sample_dir/sample_rg_added_sorted.marked_duplicates.split.bam-realigned_recal_data.table -o /RNAseq_alignments/sample_dir/sample_recal_reads.bam

Any help, please?
Thanks!

There is a bug in "CollectSequencingArtifactMetrics" in GATK 4.0

$
0
0

I am addressing WES data with GATK for mutation calling.
I selected two samples to have a test. One is tumor sample, one is matched normal sample.

I used bowtie2 to do the alignment (hg19). and then use Picard to do the sort, and remove duplicates.
and then used GATK to do the Base Recalibration. and then Mutect2.

I met an error in the step of filtering.
When I use CollectSequencingArtifactMetrics, two parameters are needed: a bam file, and a GATK reference.

gatk CollectSequencingArtifactMetrics \
-I tumor.bam \
-R ref.fasta \
-O tumor_artifact \
--FILE_EXTENSION ".txt"

The command line is:
gatk CollectSequencingArtifactMetrics
-I ../MySam/SRR5038441_Pst_MD_BQSR.bam
-R ucsc.hg19.fasta (from GATK ftp site)
-O tumor_artifact
--FILE_EXTENSION ".txt"

The error is: Sequence dictionaries are not the same size (84, 93).

The reason is that, there are 84 lines in the sequence dictionary in bam file and 93 line dictionary in reference.
The sequence dictionary in bam was generated by bowtie2. I selected all the reference files with hg19.
The reference that used by bowtie2 might be different from the reference downloaded from GATK resource (your ftp site).

and then, I printed the head of bam file, and checked with the "ucsc.hg19.dict", and tried to remove the extra 9 lines.
such as
SN:chr4_ctg9_hap1
SN:chr6_apd_hap1
SN:chr6_cox_hap2
SN:chr6_dbb_hap3
SN:chr6_mann_hap4
SN:chr6_mcf_hap5
SN:chr6_qbl_hap6
SN:chr6_ssto_hap7
SN:chr17_ctg5_hap1

and I also address in the same way in the ucsc.hg19.fasta.fai.

but still got the error:
Sequences at index 0 don't match: 0/249250621/chr1 0/16571/chrM/UR=file:/humgen/gsa-hpprojects/GATK/bundle/ucsc.hg19/ucsc.hg19.fasta/M5=d2ed829b8a1628d16cbeee88e88e39eb

but I think the updated file "ucsc.hg19.fasta.fai" is consistent with "ucsc.hg19.dict". (I checked each line)

Finally, I test this question with hg38.
I downloaded the index files of bowtie2 with hg38, and re-alignment the original file and get a new bam file.
and then I re-download the GATK reference with hg38 from GATK resource bundle(ftp://ftp.broadinstitute.org/bundle/hg38/).

gatk CollectSequencingArtifactMetrics
-I ../MySam/441.bam
-R Homo_sapiens_assembly38.fasta
-O tumor_artifact
--FILE_EXTENSION ".txt"

I still got the similar error.: Sequence dictionaries are not the same size (195, 3366).
This error means that there are 195 lines in the sequence dictionaries in bam file but 3366 lines in GATK reference.

In summary, although bowtie2 and gatk may follow the same standard for the same genome, but there are still some functions are not compatible. Many metrics calculated from picard can not address the bam files, if they have different size of dictionaries compare to that in reference.
However, there is an exception:
I called the function "CollectOxoGMetrics", no error.

gatk CollectOxoGMetrics -I ../MySam/SRR5038441_Pst_MD_BQSR.bam -R ucsc.hg19.fasta -O tumor_artifact.txt

Pls pay attention, this bam file has 84 lines of sequence dictionary, and the 93 lines of dictionary in ucsc.hg19.fasta.


Input files reference and features have incompatible contigs

$
0
0

Hi
I am very new to GATK. i read the paper Curr Protoc Bioinformatics. ; 11(1110): 11.10.1–11.10.33. doi:10.1002/0471250953.bi1110s43 and best practices guide. i want to run whole exom analysis to find high confidence SNP and indels .I am stuck at the BQSR analysis step ( BaseRecalibrator). it shows the error message
Input files reference and features have incompatible contigs: No overlapping contigs found.

i did following steps
1. Download the genome file hg38.fa.gz from UCSC genome.
2.Indexing the reference genome : bwa index hg38.fa
3.Create fasta file index : samtools faidx hg38.fa
5. Create sequence dictionary by java -jar picard.jar CreateSequenceDictionary REFERENCE=hg38.fa OUTPUT=hg38.dict
6. mapping the data to reference : bwa mem -R '@RG\tID:group1\tSM:sample1\tPL:illumina\tLB:lib1\tPU:unit1' -p hg38.fa R1.fastq R2.fastq > aligned_reads.sam
7. sort my align reads: java -jar picard.jar SortSam INPUT=aligned_reads.sam OUTPUT=sorted_reads.bam SORT_ORDER=coordinate
8. marking duplicates : java -jar picard.jar MarkDuplicates INPUT=sorted_reads.bam OUTPUT=dedup_reads.bam METRICS_FILE=metrics.txt
9. index my markduplicate file : java -jar picard.jar BuildBamIndex INPUT=dedup_reads.bam
10. Download dbSNP from this ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b149_GRCh38p7/VCF/00-All.vcf.gz
11. Download indels from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/other_mapping_resources/Mills_and_1000G_gold_standard.indels.b38.primary_assembly.vcf.gz
12. Run the BaseRecalibrator : java -jar gatk-package-4.0.0.0-local.jar BaseRecalibrator -R hg38.fa -I dedup_reads.bam --known-sites SNP_00-All_GRCh38.vcf --known-sites Mills_and_1000G_gold_standard_indels.vcf -O recal_data.table
it shows the error message to index the SNP_00-All_GRCh38.vcf and indel file
13.i index the file by running this command : java -jar gatk-package-4.0.0.0-local.jar IndexFeatureFile -F SNP_00-All_GRCh38.vcf and java -jar gatk-package-4.0.0.0-local.jar IndexFeatureFile -F Mills_and_1000G_gold_standard_indels.vcf

  1. After index, again i run the BaseRecalibrator : java -jar gatk-package-4.0.0.0-local.jar BaseRecalibrator -R hg38.fa -I dedup_reads.bam --known-sites SNP_00-All_GRCh38.vcf --known-sites Mills_and_1000G_gold_standard_indels.vcf -O recal_data.table

it shows the error " A USER ERROR has occurred: Input files reference and features have incompatible contigs: No overlapping contigs found."

Please help me to solve the problem

hi there,my script,there are many bugs

$
0
0

java -Xmx22g -jar /home/gatk/GenomeAnalysisTK.jar -T VariantFiltration -R /home/raw.fa -V /home/raw.vcf --filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" --filterName "my_snp_filter" -o /home/result/filtered_snps.vcf

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 3.7-0-gcfedb67):
ERROR
ERROR This means that one or more arguments or inputs in your command are incorrect.
ERROR The error message below tells you what is the problem.
ERROR
ERROR If the problem is an invalid argument, please check the online documentation guide
ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
ERROR
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions https://software.broadinstitute.org/gatk
ERROR
ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
ERROR

' at position 12.GE: Invalid argument value '

ERROR ------------------------------------------------------------------------------------------

Installing GATK4 via Conda

$
0
0

Hi there! I have a small problem, or a suggestion for improvement, related to the use of (Mini)conda and GATK4. I'm not entirely sure if this forum is a right place to ask this because I don't really know how GATK4's Conda package is maintained, but let's give it a try!

So I'm using a wide variety of bioinformatic tools in my work which is why I prefer Conda in package management - just to make it little bit easier to handle package dependencies and package updates. I am now planning to try the new GATK4 as the version 4.0.1.1 seems to be available in Bioconda. With GATK3 I was able to launch GATK simply with command 'gatk' so I naturally tried the very same command for GATK4. However;

gatk -h
bash: gatk: command not found
gatk4 -h
bash: gatk4: command not found

I located the GATK4 .jar file and succesfully tried the command;

java -jar /home/user/miniconda3/pkgs/gatk4-4.0.1.1-py36/share/gatk4-4.0.0.1-0/gatk-package-4-0.0.1-local.jar -h

This prints all available tools as excepted. So the main problem seems to be that shortcut to this .jar file is not included in the Conda distribution. Is there any particular reason for this behaviour or is this just a bug in the package? It is, of course, possible to use GATK4 with 'java -jar' command but the use of simple 'gatk' or 'gatk4' would be easier for Conda users. For example, if I update my GATK4 in the future I must also update my pipelines so that my paths are leading to the right .jar file. If I use direct 'gatk4' command, in turn, I can simply update GATK4 with Conda and launch it with 'gatk4' command in my pipeline - without manual path updating.

Thank you!

Single cell snp-calling

$
0
0

Hi,
I am dealing with some single cell data of breast cancer line. I have two questions:
1. I don't know if I need to make some changes when I do snp-calling on single cell data
2. Each sample (cell) have four uBam files. It seems that they come from different lanes, I just show the the name of first read of each file below:
Sample1.file1: HS32_13160:1:1101:3804:2200#1
Sample1.file2: HS32_13160:2:1101:5014:2219#1
Sample1.file3: HS31_13175:1:1101:2905:2214#1
Sample1.file4: HS31_13175:2:1101:2640:2225#1
However the four uBam files have the same ID value of RG tag, I just show them below:

Sample1.file1:
@RG ID:1#1 PL:ILLUMINA PU:140605_HS32_13160_A_H9FDBADXX_1#1 LB:10450847 PG:BamIndexDecoder CN:SC
@PG ID:SplitBamByReadGroup PN:SplitBamByReadGroup PP:BamMerger DS:Split a BAM file into multiple BAM files based on ReadGroup. Headers are a copy of the original file, removing @RGs where IDs match with the other ReadGroup IDs

Sample1.file2:
@RG ID:1#1 PL:ILLUMINA PU:140605_HS32_13160_A_H9FDBADXX_2#1 LB:10450847 PG:BamIndexDecoder CN:SC
@PG ID:SplitBamByReadGroup PN:SplitBamByReadGroup PP:BamMerger DS:Split a BAM file into multiple BAM files based on ReadGroup. Headers are a copy of the original file, removing @RGs where IDs match with the other ReadGroup IDs

Sample1.file3:
@RG ID:1#1 PL:ILLUMINA PU:140606_HS31_13175_B_H9FDUADXX_1#1 LB:10450847 PG:BamIndexDecoder CN:SC
@PG ID:SplitBamByReadGroup PN:SplitBamByReadGroup PP:BamMerger DS:Split a BAM file into multiple BAM files based on ReadGroup. Headers are a copy of the original file, removing @RGs where IDs match with the other ReadGroup IDs

Sample1.file4:
@RG ID:1#1 PL:ILLUMINA PU:140606_HS31_13175_B_H9FDUADXX_2#1 LB:10450847 PG:BamIndexDecoder CN:SC
@PG ID:SplitBamByReadGroup PN:SplitBamByReadGroup PP:BamMerger DS:Split a BAM file into multiple BAM files based on ReadGroup. Headers are a copy of the original file, removing @RGs where IDs match with the other ReadGroup IDs

How should I deal with these files?

Thank you!

HaplotypeCaller with WARNNING DepthPerSampleHC

$
0
0

Hello, sorry for my bad English.
While running GATK 4.0.1.1 Haplotypecaller with two or more bams together, this tool would send lots of lines of warning message about DepthPerSampleHC like follows:

10:32:58.674 WARN DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
10:33:02.097 WARN DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
10:33:02.102 WARN DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
10:33:02.114 WARN DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null

And all those messages are gone while using "--annotation-group StandardAnnotation", which will remove "DP" in format. Maybe it due to the "./." genotype in some samples.
So, could I keep "DP" in vcf.gz without receiving too many WARN ?
Many thanks!

Here is my running command :

gatk HaplotypeCaller --java-options -Xmx40g
-R GRCh37_latest_chr_genomic.fna
--dbsnp All_20170710.vcf.gz
-O pop.vcf.gz
-I B1700.refine.bam
-I NC.refine.bam
--standard-min-confidence-threshold-for-calling 30
1>pop.HaplotypeCaller.log
Viewing all 12345 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>