GenotypeGVCFs calculateLikelihoodSums NullPointerException

July 20, 2017, 5:21 am

≫ Next: picard tool for MarkDuplicates for cram

≪ Previous: BQSRPipelineSpark can't run under joinStrategy in SHUFFLE model

Hi,

I run into a NullPointerException exception when trying do use GenotypeGVCFs to merge a set of gVCF files stored via GenomicsDBImport in a database.

java.lang.NullPointerException
        at org.broadinstitute.hellbender.tools.walkers.genotyper.AlleleSubsettingUtils.calculateLikelihoodSums(AlleleSubsettingUtils.java:234)
        at org.broadinstitute.hellbender.tools.walkers.genotyper.AlleleSubsettingUtils.calculateMostLikelyAlleles(AlleleSubsettingUtils.java:199)
        at org.broadinstitute.hellbender.tools.walkers.genotyper.GenotypingEngine.calculateGenotypes(GenotypingEngine.java:241)
        at org.broadinstitute.hellbender.tools.walkers.genotyper.GenotypingEngine.calculateGenotypes(GenotypingEngine.java:205)
        at org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs.calculateGenotypes(GenotypeGVCFs.java:276)
        at org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs.regenotypeVC(GenotypeGVCFs.java:234)
        at org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs.apply(GenotypeGVCFs.java:213)
        at org.broadinstitute.hellbender.engine.VariantWalkerBase.lambda$traverse$0(VariantWalkerBase.java:110)
        at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
        at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
        at java.util.Iterator.forEachRemaining(Iterator.java:116)
        at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
        at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
        at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
        at org.broadinstitute.hellbender.engine.VariantWalkerBase.traverse(VariantWalkerBase.java:108)
        at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:838)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:115)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:170)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:189)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:131)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:152)
        at org.broadinstitute.hellbender.Main.main(Main.java:230)

A pull request concerning a NullPointerException during GenotypeGVCFs calculateLikelihoodSums has been merged after the GATK4 beta1 release:
15 days ago
https://github.com/broadinstitute/gatk/pull/3212

22 days ago:
https://github.com/broadinstitute/gatk/releases/tag/4.beta.1

Can I somewhere download a GATK4 jar that contains the fix for the nullpointer issue in GenotypeGVCFs? ie,. a nightly build?

Thank you.

↧

picard tool for MarkDuplicates for cram

August 4, 2017, 3:15 pm

≫ Next: setup local picard against local htsjdk

≪ Previous: GenotypeGVCFs calculateLikelihoodSums NullPointerException

Hi picarders,

I tried to use picard tool's MarkDuplicates to run against cram files, but I failed! Here is the error message:
Exception in thread "main" java.lang.IllegalStateException: A valid CRAM reference was not supplied and one cannot be acquired via the property settings reference_fasta or use_cram_ref_download
at htsjdk.samtools.cram.ref.ReferenceSource.getDefaultCRAMReferenceSource(ReferenceSource.java:108)
at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:386)
at picard.sam.markduplicates.util.AbstractMarkDuplicatesCommandLineProgram.openInputs(AbstractMarkDuplicatesCommandLineProgram.java:211)
at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:470)
at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:228)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:228)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:94)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:104)

Here is the command I ran:
java -jar /software/picard.jar MarkDuplicates TMP_DIR=/space1/tmp AS=TRUE M=/dev/null VALIDATION_STRINGENCY=SILENT I=H2MMFCCXY-2.hgv.cram.20x.ds.cram O=H2MMFCCXY-2.hgv.cram.20x.ds.Marked.cram R=/next-gen/Illumina/bwa_references/g/GRCh38_1000Genomes/GRCh38_full.fa

the java version is jdk1.8.0_74

The Picard version is: 2.10.6

Thanks!

Peter

↧

setup local picard against local htsjdk

August 4, 2017, 3:28 pm

≫ Next: muTect2: ": Somehow the requested coordinate is not covered by the read. Too many deletions?"

≪ Previous: picard tool for MarkDuplicates for cram

Hi picarders,

Since I am not having success using picard on my cram files, I decided to setup the picard source code along with the htsjdk source code locally, where I put htsjdk (folder with the same name) under the picard folder.

Here is the problem. I am trying to link my local htsjdk with my local picard code, but there is no documentation regarding this.

There is an old thread on this https://gatkforums.broadinstitute.org/gatk/discussion/6826/picard-build-cant-find-htsjdk, but somehow they didn't explicitly give an answer to the very question they try to answer. They provided all other links except the one that answer the question!

It is possible that I have missed the place where the answer is. Please direct me to the right direction. Thank you very much!

Best,

Peter

↧

muTect2: ": Somehow the requested coordinate is not covered by the read. Too many deletions?"

October 18, 2016, 11:33 am

≫ Next: how to set the value "--contamination_fraction_to_filter" of Mutect2 ?

≪ Previous: setup local picard against local htsjdk

Hello! I am using muTect2 (in particular I am following this pipeline: http://gatkforums.broadinstitute.org/gatk/discussion/5963/tumor-normal-paired-exome-sequencing-pipeline) but today I am getting this error on chromosome 3:

ERROR --

ERROR stack trace

org.broadinstitute.gatk.utils.exceptions.ReviewedGATKException: Somehow the requested coordinate is not covered by the read. Too many deletions?

at org.broadinstitute.gatk.utils.sam.ReadUtils.getReadCoordinateForReferenceCoordinate(ReadUtils.java:490)

at org.broadinstitute.gatk.utils.sam.ReadUtils.getReadCoordinateForReferenceCoordinate(ReadUtils.java:436)

at org.broadinstitute.gatk.utils.sam.ReadUtils.getReadCoordinateForReferenceCoordinate(ReadUtils.java:427)

at org.broadinstitute.gatk.utils.clipping.ReadClipper.hardClipByReferenceCoordinates(ReadClipper.java:543)

at org.broadinstitute.gatk.utils.clipping.ReadClipper.hardClipByReferenceCoordinatesLeftTail(ReadClipper.java:177)

at org.broadinstitute.gatk.utils.clipping.ReadClipper.hardClipAdaptorSequence(ReadClipper.java:408)

at org.broadinstitute.gatk.utils.clipping.ReadClipper.hardClipAdaptorSequence(ReadClipper.java:411)

at org.broadinstitute.gatk.tools.walkers.cancer.m2.MuTect2.finalizeActiveRegion(MuTect2.java:1201)

at org.broadinstitute.gatk.tools.walkers.cancer.m2.MuTect2.assembleReads(MuTect2.java:1145)

at org.broadinstitute.gatk.tools.walkers.cancer.m2.MuTect2.map(MuTect2.java:536)

at org.broadinstitute.gatk.tools.walkers.cancer.m2.MuTect2.map(MuTect2.java:176)

at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:709)

at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:705)

at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler$ReadMapReduceJob.run(NanoScheduler.java:471)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

ERROR ------------------------------------------------------------------------------------------

ERROR A GATK RUNTIME ERROR has occurred (version 3.6-0-g89b7209):

ERROR

ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.

ERROR If not, please post the error message, with stack trace, to the GATK forum.

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions https://www.broadinstitute.org/gatk

ERROR

ERROR MESSAGE: Somehow the requested coordinate is not covered by the read. Too many deletions?

ERROR ------------------------------------------------------------------------------------------

Thank you in advance!
Best

↧

how to set the value "--contamination_fraction_to_filter" of Mutect2 ?

August 6, 2017, 2:31 am

≫ Next: RealignerTargetCreator hangs

≪ Previous: muTect2: ": Somehow the requested coordinate is not covered by the read. Too many deletions?"

HI,
what's the meanning of the Mutect option "--contamination_fraction_to_filter " ?
how to set that value ?
it's value of ContEst result ？ I know ContEst is used to evaluate the cross-individual contamination ,and the result is bigger than 5% threhold,showing there is a servious contamination .

↧

RealignerTargetCreator hangs

July 20, 2017, 11:26 am

≫ Next: Joint genotyping exomes is extremely slow (part of the germline haplotypecaller GVCF pipeline)

≪ Previous: how to set the value "--contamination_fraction_to_filter" of Mutect2 ?

Hi GATK team!

we have an issue with running the RealignerTargetCreator unfortunately. Commandline looks like this:

gatk -T RealignerTargetCreator -R ref.fasta -I /testsample.sorted.bam -nt 32 -o /testsample.intervals
INFO  13:00:59,111 HelpFormatter - ---------------------------------------------------------------------------------------------
INFO  13:00:59,141 HelpFormatter - The Genome Analysis Toolkit (GATK) vnightly-2017-07-11-g1f763d5, Compiled 2017/07/11 00:01:14
INFO  13:00:59,141 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute
INFO  13:00:59,142 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk
INFO  13:00:59,142 HelpFormatter - [Thu Jul 20 13:00:58 UTC 2017] Executing on Linux 3.10.0-327.3.1.el7.x86_64 amd64
INFO  13:00:59,142 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11
INFO  13:00:59,170 HelpFormatter - Program Args:  -T RealignerTargetCreator -R ref.fasta -I /testsample.sorted.bam -nt 32 -o /testsample.intervals
INFO  13:00:59,226 HelpFormatter - Executing as user on Linux 3.10.0-327.3.1.el7.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11.
INFO  13:00:59,227 HelpFormatter - Date/Time: 2017/07/20 13:00:59
INFO  13:00:59,227 HelpFormatter - ---------------------------------------------------------------------------------------------
INFO  13:00:59,228 HelpFormatter - ---------------------------------------------------------------------------------------------
ERROR StatusLogger Unable to create class org.apache.logging.log4j.core.impl.Log4jContextFactory specified in jar:file:/opt/gatk/GenomeAnalysisTK.jar!/META-INF/log4j-provider.properties
ERROR StatusLogger Log4j2 could not find a logging implementation. Please add log4j-core to the classpath. Using SimpleLogger to log to the console…

After this, the application unfortunately hangs. Running this with GATK v3.7 stable is also not working, we had issues with the bug in HaplotypeCallers VectorHMM library. Any ideas what we can do?

↧

Joint genotyping exomes is extremely slow (part of the germline haplotypecaller GVCF pipeline)

August 7, 2017, 7:52 pm

≫ Next: Scatter Gather and Spark together

≪ Previous: RealignerTargetCreator hangs

I am enduring an incredible slow down during my genotyping stage of the haplotypecaller GVCF command series. It is my understanding from the documentation that this step should be rather fast: "This step runs very fast and can be rerun at any point when samples are added to the cohort, thereby solving the so-called N+1 problem."

However, given 50 - 100 exomes, the command estimates several weeks until completion time, despite being given 64 cores and 256GB ram with unlimited disk space. I'm concerned because this seems unrealistically high, especially given that once a pool of several hundred training exomes is created, the purpose of the GVCF pipeline is to quickly use that pool in a joint genotyping step with a new sample exome. Therefore, each time I have a new sample exome, I would have to endure another multi-week joint genotyping step.

Can you please advise me as to why my command is taking so long? Any insight is much appreciated. Please find below a copy of my command:

    time java -Djava.io.tmpdir=$temp_directory -Xmx192g -jar /root/Installation/gatk/GenomeAnalysisTK.jar -T GenotypeGVCFs \
    -R /bundles/b37/human_g1k_v37.fasta \
    (list of all training exomes and the single sample exome goes here \
    --disable_auto_index_creation_and_locking_when_reading_rods \
    -o genotyped.g.vcf -nt 60


    # I deactivated the following step since it seems to be unnecessary
    # --sample_ploidy 60 \ #(ploidy is set to number of samples per pool * individual sample ploidy)

↧

Scatter Gather and Spark together

July 25, 2017, 2:23 am

≫ Next: picard markdup error:Value was put into PairInfoMap more than once

≪ Previous: Joint genotyping exomes is extremely slow (part of the germline haplotypecaller GVCF pipeline)

Hi,

I can't find any recommendations on how to use scatter gather and spark together.
We do panel diagnosics and whole exomes. Therefore a run may contain up to 60 samples. My first idea was to use Scatter Gather to analyse a few samples at the same time. Our server crashed with a concurrent-job-limit > 3 because we ran out of ram and all cores were at 100%. Since I plan to use spark in the future, I wanted to know

if it is a good idea to use scatter-gather and spark together,
how they work with each other
and if there are recommendations how I can calulate the number of cromwell jobs and spark workers from my hardware.

Please just show me the way if there is already a tutorial that answers my question.

Thanks and best regards,
Daniel

↧

picard markdup error:Value was put into PairInfoMap more than once

August 8, 2017, 5:59 am

≫ Next: Does GATK work with polypolids?

≪ Previous: Scatter Gather and Spark together

Hi, all!

I get the error information Value was put into PairInfoMap more than once when I use picard to mark duplication.
I have already tested the newest version bwa 0.7.16a and picard 2.10.7.
My mapping paramater is bwa mem -M.
All paired reads id are unique in fq.

I use samtools view this.bam|grep readsname, the output is captured as this picture:
https://p.qlogo.cn/qqmail_head/LIND77SSexibQw48mEewIK3YKyoKCpB06NW5gSKticJqN2mMnvbd8S7KZdvcuWGj31sIGFzvpM9Bg/0
I have read the pages about this trouble asked before on gatk forum and biostar forum. I think the secondary hits in bam are correct, compared to the previous reported condition, but not for sure. Therefore, I come here to ask for professional help.

↧

Does GATK work with polypolids?

August 8, 2017, 6:24 am

≫ Next: How to edit MULTIPLE read groups in one bam file

≪ Previous: picard markdup error:Value was put into PairInfoMap more than once

Hi, I would like to use GATK for SNP variant and indel variant calling in wild relatives of a crop plant. I have a good reference genome for the diploid crop where I could map my reads but my species where I would like to call the SNPs are di- tetra and hexaploid. I wanna trace there ancestry. Can I use GATK as a platform to call the genome wide SNPs?

↧

How to edit MULTIPLE read groups in one bam file

August 8, 2017, 11:35 am

≫ Next: Running ASE Read counter on WXS

≪ Previous: Does GATK work with polypolids?

Hi everyone,

I recently received a WGS bam from Broad for 1 sample, but with about 8 read groups. BSQR kicked it back saying that the sequencer name in the read group is not recognized.

Anyways, I need to edit the sequencer name so that BSQR can run. AddReplaceReadGroups in Picard will toss out the 8 RGs and add 1 RG info, so that will not work. So how do you edit one or two of the RGs, or replace all 8 RGs in the bam?

I am sure this is a common issue.

Thanks

↧

Running ASE Read counter on WXS

August 8, 2017, 11:57 am

≫ Next: When should I use -L to pass in a list of intervals?

≪ Previous: How to edit MULTIPLE read groups in one bam file

Hi
I wanted to ask if I can use ASE read counter on WXS data, instead of RNA seq. Would it give me the number of times a particular SNP has been observed on the DNA level? And is it correct to interpret the results that way?

↧

When should I use -L to pass in a list of intervals?

May 6, 2014, 2:51 pm

≫ Next: [GATK 4 beta] read_position and clipping filters in FilterMutectCalls

≪ Previous: Running ASE Read counter on WXS

The -L argument (short for --intervals) enables you to restrict your analysis to specific intervals instead of running over the whole genome. Using this argument can have important consequences for performance and/or results. Here, we present some guidelines for using it appropriately depending on your experimental design.

In a nutshell, if you’re doing:

- Whole genome analysis: intervals are not required but they can help speed up analysis
- Whole exome analysis: you must provide the list of capture targets (typically genes/exons)
- Small targeted experiment: you must provide the targeted interval(s)
- Troubleshooting: you can run on a specific interval to test parameters or create a data snippet

Important notes:

Whatever you end up using -L for, keep this in mind: for tools that output a bam or VCF file, the output file will only contain data from the intervals specified by the -L argument. To be clear, we do not recommend using -L with tools that output a bam file since doing so will omit some data from the output.

Example Use of -L:

-L 20 for chromosome 20 in b37/b39 build
-L chr20:1-100 for chromosome 20 positions 1-100 in hg18/hg19 build
-L intervals.list (or intervals.interval_list, or intervals.bed) where the value passed to the argument is a text file containing intervals
-L some_variant_calls.vcf where the value passed to the argument is a VCF file containing variant records; their genomic coordinates will be used as intervals.

Specifying contigs with colons in their names, as occurs for new contigs in GRCh38, requires special handling for GATK versions prior to v3.6. Please use the following workaround.

- For example, HLA-A*01:01:01:01 is a new contig in GRCh38. The colons are a new feature of contig naming for GRCh38 from prior assemblies. This has implications for using the -L option of GATK as the option also uses the colon as a delimiter to distinguish between contig and genomic coordinates.
- When defining coordinates of interest for a contig, e.g. positions 1-100 for chr1, we would use -L chr1:1-100. This also works for our HLA contig, e.g. -L HLA-A*01:01:01:01:1-100.
- However, when passing in an entire contig, for contigs with colons in the name, you must add :1+ to the end of the chromosome name as shown below. This ensures that portions of the contig name are appropriately identified as part of the contig name and not genomic coordinates.

-L HLA-A*01:01:01:01:1+

So here’s a little more detail for each experimental design type.

Whole genome analysis

It is not necessary to use an intervals list in whole genome analysis -- presumably you're interested in the whole genome!

However, from a technical perspective, you may want to mask out certain contigs (e.g. chrY or non-chromosome contigs) or regions (e.g. centromere) where you know the data is not reliable or is very messy, causing excessive slowdowns. You can do this by providing a list of "good" intervals with -L, or you could also provide a list of "bad" intervals with -XL, which does the exact opposite of -L: it excludes the provided intervals. We share the whole-genome interval lists (of good intervals) that we use in our production pipelines, in our resource bundle (see Download page).

Whole exome analysis

By definition, exome sequencing data doesn’t cover the entire genome, so many analyses can be restricted to just the capture targets (genes or exons) to save processing time. There are even some analyses which should be restricted to the capture targets because failing to do so can lead to suboptimal results.

Note that we recommend adding some “padding” to the intervals in order to include the flanking regions (typically ~100 bp). No need to modify your target list; you can have the GATK engine do it for you automatically using the interval padding argument. This is not required, but if you do use it, you should do it consistently at all steps where you use -L.

Below is a step-by-step breakdown of the Best Practices workflow, with a detailed explanation of why -L should or shouldn’t be used with each tool.

Tool	-L?	Why / why not
BaseRecalibrator	YES	This excludes off-target sequences and sequences that may be poorly mapped, which have a higher error rate. Including them could lead to a skewed model and bad recalibration.
PrintReads	NO	Output is a bam file; using -L would lead to lost data.
UnifiedGenotyper/Haplotype Caller	YES	We’re only interested in making calls in exome regions; the rest is a waste of time & includes lots of false positives.
Next steps	NO	No need since subsequent steps operate on the callset, which was restricted to the exome at the calling step.

Small targeted experiments

The same guidelines as for whole exome analysis apply except you do not run BQSR on small datasets.

Debugging / troubleshooting

You can use -L a lot while troubleshooting! For example, you can just provide an interval at the command line, and the output file will contain the data from that interval.This is really useful when you’re trying to figure out what’s going on in a specific interval (e.g. why HaplotypeCaller is not calling your favorite indel) or what would be the effect of changing a parameter (e.g. what happens to your indel call if you increase the value of -minPruning). This is also what you’d use to generate a file snippet to send us as part of a bug report (except that never happens because GATK has no bugs, ever).

↧

[GATK 4 beta] read_position and clipping filters in FilterMutectCalls

July 21, 2017, 12:46 pm

≫ Next: what's difference between a read marked by "sam FLAG 1024" and marked by " PG:markduplicates" ?

≪ Previous: When should I use -L to pass in a list of intervals?

Hello,
I would like to understand the clipping and read_position filters better. Is the read_position filter useful because base quality gets worse toward the end of read in Illumina sequencing? And, is the clipping filter useful because high quality soft clipped bases and hard clipped bases are signs of bad alignment or evidences of structural variation that Mutect2 cares less? --maxMedianClippingDifference is set to 1 by default and it looks stringent to me, though I am sure there is a good reason that I don't know. I'd appreciate it if you advise on rationals behind the filters.
(I assume that read_position is driven by --minMedianReadPosition and clipping is driven by --maxMedianClippingDifference.)

On the other hand, I am having hard time finding evidences supporting the filters (at least for a variant below).
I guess the filters works based on the information in MPOS and MCL.
##FORMAT=<ID=MPOS,Number=A,Type=Float,Description="median distance from end of read">
##FORMAT=<ID=MCL,Number=A,Type=Float,Description="number of soft- and hard- clipped bases">
For the two closely located variants (17:7577079, 17:7577087) below, one variant (17:7577087) is filtered by clipping and read_position filters and the other (17:7577079) is not. To me, MPOS and MCL values are suspicious. MPOS for alt. allele is 0 for the filtered variant, which looks incorrect because those two variants are always together in supporting reads. MCL is both "0,0", which doesn't explain the clipping filter. What did I miss?

Not filtered (17:7577079): MPOS: 37,59; MCL: 0,0
Filtered (17:7577087): MPOS: 40,0; MCL: 0,0

17      7577079 .       CTTCCT  C       .       clustered_events        DP=592;ECNT=5;NLOD=95.58;N_ART_LOD=-2.510e+00;POP_AF=1.000e-03;P_GERMLINE=-9.228e+01;TLOD=18.63 GT:AD:AF:MBQ:MCL:M
FRL:MMQ:MPOS:OBAM:OBAMRC:PGT:PID:SA_MAP_AF:SA_POST_PROB       0/0:316,0:0.019:41,0:0,0:413,0:60,0:41,0:false:false:0|1:7577079_CTTCCT_C       0/1:265,8:0.042:41,41:0,0:410,416:60,60:37,5
9:false:false:0|1:7577079_CTTCCT_C:0.030,0.020,0.029:4.069e-03,3.816e-03,0.992
17      7577087 .       GT      G       .       clipping;clustered_events;read_position DP=570;ECNT=5;NLOD=93.62;N_ART_LOD=-2.494e+00;POP_AF=1.000e-03;P_GERMLINE=-9.031e+01;TLOD=18.71 GT
:AD:AF:MBQ:MCL:MFRL:MMQ:MPOS:OBAM:OBAMRC:PGT:PID:SA_MAP_AF:SA_POST_PROB       0/0:302,0:3.807e-05:41,0:0,0:416,0:60,0:40,0:false:false:0|1:7577079_CTTCCT_C   0/1:260,8:0.029:41,0:0,0:416
,416:60,60:40,0:false:false:0|1:7577079_CTTCCT_C:0.030,0.020,0.030:4.483e-03,3.633e-03,0.992

↧

what's difference between a read marked by "sam FLAG 1024" and marked by " PG:markduplicates" ?

August 8, 2017, 8:23 pm

≫ Next: Joint genotyping of trio exome

≪ Previous: [GATK 4 beta] read_position and clipping filters in FilterMutectCalls

HI,

    1 . when I  run the following command :
         "samtools view -h -f 1024  bwa.raw.bam  " ,  then can not get any reads , so I think BWA do not markduplicates to use" FLAG 1024".
    2   after I  deal with  bwa.raw.bam with using picard-markduplicates ,
          "samtools view -h -f 1024  baw.raw.markdup.bam  "   , I got some reads with the below marks :
               FLAG : 1187    PG:Z:MarkDuplicates
     SO  i got a question :
                I know  the option "-f  1024"  is just for optical or pcr dups , but  I do not know which software modify "FLAG"  to add "1024"  ??
               according to 1 , I think BWA do not do that becasue  no reads found .
               however ,according to 2 , I got reads markd by "FLAG 1024", so I guess   when I do  the picard-markduplicates step , picard will modify sam flag  ?? ,not just add "PG:Z:MarkDuplicates" .

↧

Joint genotyping of trio exome

August 8, 2017, 11:17 pm

≫ Next: GATK 4 VariantRecalibrator throws error for missing R path, but only for SNPs

≪ Previous: what's difference between a read marked by "sam FLAG 1024" and marked by " PG:markduplicates" ?

I am running genotypeGVCFs for trio exome datasets (Father, Mother, and Child).

java -jar GenomeAnalysisTK.jar \
-T GenotypeGVCFs \
-R reference.fasta \
--variant father.g.vcf \
--variant mother.g.vcf \
--variant child.g.vcf \
-o output.vcf

I am wondering which variants are trio variants that are derived from father, mother, and child. How can I see from the vcf file? Could you please suggest me?

Example of result:

...
chr19 43355878 . G A 1088.92 . AC=3;AF=0.500;AN=6;BaseQRankSum=-3.460e-01;ClippingRankSum=0.00;DP=51;ExcessHet=3.0103;FS=1.358;MLEAC=3;MLEAF=0.500;MQ=41.72;MQRankSum=-2.136e+00;QD=13.44;ReadPosRankSum=-5.170e-01;SOR=0.952 GT:AD:DP:GQ:PGT:PID:PL 0/1:16,11:27:99:0|1:43355878_G_A:373,0,622
chr19 43355896 . G T 875.92 . AC=3;AF=0.500;AN=6;BaseQRankSum=0.569;ClippingRankSum=0.00;DP=34;ExcessHet=3.0103;FS=0.000;MLEAC=3;MLEAF=0.500;MQ=51.21;MQRankSum=-1.546e+00;QD=16.22;ReadPosRankSum=1.34;SOR=0.880 GT:AD:DP:GQ:PGT:PID:PL 0/1:11,7:18:99:0|1:43355878_G_A:302,0,425
...

↧

GATK 4 VariantRecalibrator throws error for missing R path, but only for SNPs

August 9, 2017, 1:10 am

≫ Next: dbSNP_RS seems to not be annotated with Oncotator 1.9.3

≪ Previous: Joint genotyping of trio exome

I'm not using the --rscript_file option, but VariantRecalibrator for SNPs still throws an error for it, but it's fine with the exact same command, except -mG 4, in the INDEL run. And it produces the output just fine, but for some reason complains about not finding Rscript.

Here's the stderr output of the SNP run, edited for readability.

Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=call-VariantRecalibratorSNP/execution/tmp.sAdeNd
[August 9, 2017 6:36:15 AM UTC] VariantRecalibrator  --mode SNP --resource v1000G:1000G_phase1.snps.high_confidence.hg38.vcf.gz --resource omni:1000G_omni2.5.hg38.vcf.gz --resource dbsnp:dbsnp_146.hg38.vcf.gz --resource hapmap:hapmap_3.3.hg38.vcf.gz --output GVCF.varRec.SNP.recal --tranches_file GVCF.varRec.SNP.tranches --use_annotation QD --use_annotation MQ --use_annotation DP --use_annotation MQRankSum --use_annotation ReadPosRankSum --use_annotation FS --use_annotation SOR --TStranche 100.0 --TStranche 99.95 --TStranche 99.9 --TStranche 99.8 --TStranche 99.6 --TStranche 99.5 --TStranche 99.4 --TStranche 99.3 --TStranche 99.0 --TStranche 98.0 --TStranche 97.0 --TStranche 90.0 --variant GVCF.genotypegvcf.g.vcf --reference Homo_sapiens_assembly38.fasta  --useAlleleSpecificAnnotations false --maxGaussians 8 --maxNegativeGaussians 2 --maxIterations 150 --numKMeans 100 --stdThreshold 10.0 --shrinkage 1.0 --dirichlet 0.001 --priorCounts 20.0 --maxNumTrainingData 2500000 --minNumBadVariants 1000 --badLodCutoff -5.0 --MQCapForLogitJitterTransform 0 --no_MQ_logit false --MQ_jitter 0.05 --target_titv 2.15 --ignore_all_filters false --sample_every_Nth_variant 1 --output_tranches_for_scatter false --replicate 200 --max_attempts 1 --trustAllPolymorphic false --interval_set_rule UNION --interval_padding 0 --interval_exclusion_padding 0 --readValidationStringency SILENT --secondsBetweenProgressUpdates 10.0 --disableSequenceDictionaryValidation false --createOutputBamIndex true --createOutputBamMD5 false --createOutputVariantIndex true --createOutputVariantMD5 false --lenient false --addOutputSAMProgramRecord true --addOutputVCFCommandLine true --cloudPrefetchBuffer 40 --cloudIndexPrefetchBuffer -1 --disableBamIndexCaching false --help false --version false --showHidden false --verbosity INFO --QUIET false --use_jdk_deflater false --use_jdk_inflater false --disableToolDefaultReadFilters false
[August 9, 2017 6:36:15 AM UTC] Executing as root@db77ec148549 on Linux 2.6.32-642.11.1.el6.centos.plus.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11; Version: 4.beta.3
[August 9, 2017 7:13:35 AM UTC] org.broadinstitute.hellbender.tools.walkers.vqsr.VariantRecalibrator done. Elapsed time: 37.34 minutes.
Runtime.totalMemory()=5059379200
***********************************************************************

A USER ERROR has occurred: Unable to execute RScript command: Please add the Rscript directory to your environment ${PATH}

***********************************************************************
Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--javaOptions '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.

And the INDEL stderr.

[August 9, 2017 6:36:15 AM UTC] VariantRecalibrator  --mode INDEL --maxGaussians 4 --resource mills:Mills_and_1000G_gold_standard.indels.hg38.vcf.gz --resource dbsnp:dbsnp_146.hg38.vcf.gz --output GVCF.varRec.INDEL.recal --tranches_file GVCF.varRec.INDEL.tranches --use_annotation QD --use_annotation DP --use_annotation FS --use_annotation SOR --use_annotation ReadPosRankSum --use_annotation MQRankSum --TStranche 100.0 --TStranche 99.95 --TStranche 99.9 --TStranche 99.5 --TStranche 99.0 --TStranche 97.0 --TStranche 96.0 --TStranche 95.0 --TStranche 94.0 --TStranche 93.5 --TStranche 93.0 --TStranche 92.0 --TStranche 91.0 --TStranche 90.0 --variant GVCF.genotypegvcf.g.vcf --reference Homo_sapiens_assembly38.fasta  --useAlleleSpecificAnnotations false --maxNegativeGaussians 2 --maxIterations 150 --numKMeans 100 --stdThreshold 10.0 --shrinkage 1.0 --dirichlet 0.001 --priorCounts 20.0 --maxNumTrainingData 2500000 --minNumBadVariants 1000 --badLodCutoff -5.0 --MQCapForLogitJitterTransform 0 --no_MQ_logit false --MQ_jitter 0.05 --target_titv 2.15 --ignore_all_filters false --sample_every_Nth_variant 1 --output_tranches_for_scatter false --replicate 200 --max_attempts 1 --trustAllPolymorphic false --interval_set_rule UNION --interval_padding 0 --interval_exclusion_padding 0 --readValidationStringency SILENT --secondsBetweenProgressUpdates 10.0 --disableSequenceDictionaryValidation false --createOutputBamIndex true --createOutputBamMD5 false --createOutputVariantIndex true --createOutputVariantMD5 false --lenient false --addOutputSAMProgramRecord true --addOutputVCFCommandLine true --cloudPrefetchBuffer 40 --cloudIndexPrefetchBuffer -1 --disableBamIndexCaching false --help false --version false --showHidden false --verbosity INFO --QUIET false --use_jdk_deflater false --use_jdk_inflater false --disableToolDefaultReadFilters false
[August 9, 2017 6:36:15 AM UTC] Executing as root@db77ec148549 on Linux 2.6.32-642.11.1.el6.centos.plus.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11; Version: 4.beta.3
[August 9, 2017 6:52:54 AM UTC] org.broadinstitute.hellbender.tools.walkers.vqsr.VariantRecalibrator done. Elapsed time: 16.66 minutes.
Runtime.totalMemory()=7318011904

The only occurrence of anything related to R is in the error message, so I don't understand why it's complaining. I run it with cromwell-28_2 and the gatk4 beta 3, and if I run ApplyVQSR with the SNP recal and tranches files it works just fine, so there's nothing wrong with the output files, it's just that it for some reason complains that R is missing even though it's not supposed to try to use R. What could be wrong?

↧

dbSNP_RS seems to not be annotated with Oncotator 1.9.3

August 9, 2017, 1:59 am

≫ Next: What should I use as known variants/sites for running tool X?

≪ Previous: GATK 4 VariantRecalibrator throws error for missing R path, but only for SNPs

Oncotator 1.9 seems to not be annotating dbSNP annotations (dbSNP_RS), leaving all elements blank in this column. Other annotations are correctly annotated.
My command line is oncotator -v --db-dir /Database/oncotator_v1_ds_April052016/ test.mutect.txt test.snv.maf hg19

↧

What should I use as known variants/sites for running tool X?

July 31, 2012, 10:50 am

≫ Next: HaplotypeCaller raises an error with -A BaseCountsBySample

≪ Previous: dbSNP_RS seems to not be annotated with Oncotator 1.9.3

1. Notes on known sites

Why are they important?

Each tool uses known sites differently, but what is common to all is that they use them to help distinguish true variants from false positives, which is very important to how these tools work. If you don't provide known sites, the statistical analysis of the data will be skewed, which can dramatically affect the sensitivity and reliability of the results.

In the variant calling pipeline, the only tools that do not strictly require known sites are UnifiedGenotyper and HaplotypeCaller.

Human genomes

If you're working on human genomes, you're in luck. We provide sets of known sites in the human genome as part of our resource bundle, and we can give you specific Best Practices recommendations on which sets to use for each tool in the variant calling pipeline. See the next section for details.

Non-human genomes

If you're working on genomes of other organisms, things may be a little harder -- but don't panic, we'll try to help as much as we can. We've started a community discussion in the forum on What are the standard resources for non-human genomes? in which we hope people with non-human genomics experience will share their knowledge.

And if it turns out that there is as yet no suitable set of known sites for your organisms, here's how to make your own for the purposes of BaseRecalibration: First, do an initial round of SNP calling on your original, unrecalibrated data. Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator. Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence. Good luck!

Some experimentation will be required to figure out the best way to find the highest confidence SNPs for use here. Perhaps one could call variants with several different calling algorithms and take the set intersection. Or perhaps one could do a very strict round of filtering and take only those variants which pass the test.

2. Recommended sets of known sites per tool

Summary table

Tool	dbSNP 129	dbSNP >132	Mills indels	1KG indels	HapMap	Omni
RealignerTargetCreator			X	X
IndelRealigner			X	X
BaseRecalibrator		X	X	X
(UnifiedGenotyper/ HaplotypeCaller)		X
VariantRecalibrator		X	X		X	X
VariantEval	X

RealignerTargetCreator and IndelRealigner

These tools require known indels passed with the -known argument to function properly. We use both the following files:

Mills_and_1000G_gold_standard.indels.b37.sites.vcf
1000G_phase1.indels.b37.vcf (currently from the 1000 Genomes Phase I indel calls)

BaseRecalibrator

This tool requires known SNPs and indels passed with the -knownSites argument to function properly. We use all the following files:

The most recent dbSNP release (build ID > 132)
Mills_and_1000G_gold_standard.indels.b37.sites.vcf
1000G_phase1.indels.b37.vcf (currently from the 1000 Genomes Phase I indel calls)

UnifiedGenotyper / HaplotypeCaller

These tools do NOT require known sites, but if SNPs are provided with the -dbsnp argument they will use them for variant annotation. We use this file:

The most recent dbSNP release (build ID > 132)

VariantRecalibrator

For VariantRecalibrator, please see the FAQ article on VQSR training sets and arguments.

VariantEval

This tool requires known SNPs passed with the -dbsnp argument to function properly. We use the following file:

A version of dbSNP subsetted to only sites discovered in or before dbSNP BuildID 129, which excludes the impact of the 1000 Genomes project and is useful for evaluation of dbSNP rate and Ti/Tv values at novel sites.

↧

HaplotypeCaller raises an error with -A BaseCountsBySample

March 27, 2017, 3:04 am

≫ Next: GATK 3.8 Dictionary cannot have size zero

≪ Previous: What should I use as known variants/sites for running tool X?

Hi GATK team , FYI I'm getting the following error with GATK 3.7:

<br />java  -X  -jar GenomeAnalysisTK.jar -T HaplotypeCaller    -R human_g1k_v37.fasta     --validation_strictness LENIENT    -I  .bam.list \
 -o "out.vcf.gz"  -l INFO         -nct 10 --dbsnp "138.b37.vcf"   --annotation PossibleDeNovo   --annotation AS_FisherStrand   --annotation AlleleBalance   --annotation AlleleBalanceBySample   --annotation BaseCountsBySample   --annotation GCContent   --annotation ClippingRankSumTest   --pedigree in.ped -L:BED "in.bed"

##### ERROR stack trace
java.lang.IllegalStateException: Never found start -1 or stop -1 given cigar 100I
        at org.broadinstitute.gatk.utils.sam.AlignmentUtils.getBasesCoveringRefInterval(AlignmentUtils.java:204)
        at org.broadinstitute.gatk.utils.sam.AlignmentUtils.countBasesAtPileupPosition(AlignmentUtils.java:1418)
        at org.broadinstitute.gatk.tools.walkers.annotator.BaseCountsBySample.getBaseCounts(BaseCountsBySample.java:135)
        at org.broadinstitute.gatk.tools.walkers.annotator.BaseCountsBySample.annotate(BaseCountsBySample.java:113)
        at org.broadinstitute.gatk.tools.walkers.annotator.VariantAnnotatorEngine.annotateGenotypes(VariantAnnotatorEngine.java:517)
        at org.broadinstitute.gatk.tools.walkers.annotator.VariantAnnotatorEngine.annotateContextForActiveRegion(VariantAnnotatorEngine.java:278)
        at org.broadinstitute.gatk.tools.walkers.annotator.VariantAnnotatorEngine.annotateContextForActiveRegion(VariantAnnotatorEngine.java:260)
        at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCallerGenotypingEngine.annotateCall(HaplotypeCallerGenotypingEngine.java:328)
        at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCallerGenotypingEngine.assignGenotypeLikelihoods(HaplotypeCallerGenotypingEngine.java:290)
        at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:962)
        at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:250)
        at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:709)
        at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:705)
        at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler$ReadMapReduceJob.run(NanoScheduler.java:471)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

searching for '100I': the following command returns nothing:

find wordir -name "**.bam"   | while read F; do samtools view  $F "12:132806991-133841896" | grep 100I ; done

removing 'BaseCountsBySample' seems to fix the problem.

↧