Can HaplotypeCaller be used on drug treated samples?

January 9, 2018, 9:21 pm

≫ Next: GATK ERROR MESSAGE: 38 HaplotipeCaller

≪ Previous: HaplotypeCaller with WARNNING DepthPerSampleHC

Hello, I am working on a RNASeq data which consists of liver samples from donors. It is a case-control study where 12 samples are divided as Normal (control) and Rifampin Treated (case). I want to create a sample specific VCF file. I was going through the documentation and I got a bit confused between HaplotypeCaller and Mutect2. Which one should I use to get my VCF file.

In addition, is there a decent way to add gene name, symbol and other annotations to the INFO field of the VCF file?

Any help is much appreciated.

Regards,
Anurag

↧

GATK ERROR MESSAGE: 38 HaplotipeCaller

February 5, 2018, 1:05 am

≫ Next: Problems with BAMs decompressed from CRAMs

≪ Previous: Can HaplotypeCaller be used on drug treated samples?

Hello everyone,

I´m using HaplotypeCaller program in whole sheep genome.

The next paragraph is the command used for all 158 samples. We use nodes of 16 cores (-ntc 16) and 28 Gb of memory RAM.

Could you tell me what the next mistake might be?
thank you in advance
-------------------------------------------------------------------------------------------------------------------------------------------------------------

ERROR --

ERROR stack trace

java.lang.ArrayIndexOutOfBoundsException: 38
at org.broadinstitute.gatk.tools.walkers.annotator.BaseQualityRankSumTest.getElementForRead(BaseQualityRankSumTest.java:96)
at org.broadinstitute.gatk.tools.walkers.annotator.RankSumTest.getElementForRead(RankSumTest.java:209)
at org.broadinstitute.gatk.tools.walkers.annotator.RankSumTest.fillQualsFromLikelihoodMap(RankSumTest.java:187)
at org.broadinstitute.gatk.tools.walkers.annotator.RankSumTest.annotate(RankSumTest.java:104)
at org.broadinstitute.gatk.tools.walkers.annotator.VariantAnnotatorEngine.annotateContextForActiveRegion(VariantAnnotatorEngine.java:315)
at org.broadinstitute.gatk.tools.walkers.annotator.VariantAnnotatorEngine.annotateContextForActiveRegion(VariantAnnotatorEngine.java:260)
at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCallerGenotypingEngine.annotateCall(HaplotypeCallerGenotypingEngine.java:328)
at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCallerGenotypingEngine.assignGenotypeLikelihoods(HaplotypeCallerGenotypingEngine.java:290)
at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:970)
at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:252)
at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:709)
at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:705)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler$ReadMapReduceJob.run(NanoScheduler.java:471)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

ERROR ------------------------------------------------------------------------------------------

ERROR A GATK RUNTIME ERROR has occurred (version 3.8-0-ge9d806836):

ERROR

ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.

ERROR If not, please post the error message, with stack trace, to the GATK forum.

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions https://software.broadinstitute.org/gatk

ERROR

##### ERROR MESSAGE: 38

##### ERROR ------------------------------------------------------------------------------------------

-------------------------------------------------------------------------------------------------------------------------------------------------------------

↧

Problems with BAMs decompressed from CRAMs

February 5, 2018, 3:51 am

≫ Next: Training GATK4

≪ Previous: GATK ERROR MESSAGE: 38 HaplotipeCaller

I have a number of NGS samples, with 30x coverage. Samples were processed according to GATK Best Practices. For some samples, the processed BAMs were kept, for others BAMs were compressed to CRAMs. My task now is to use Downsample option in Picard to get various low coverage scenarios.
For the samples were original BAMs were kept, the downsampling is carried out as expected. However, BAMs obtained from CRAMs do not work at all.

First, I have decompressed CRAMs to BAMs using samtools. This seemed to work fine, as I was able to add index and get stats of the BAM file. However, when I tried to use the file for downsampling, I was getting errors that the file is truncated.
Thinking it might be an issue with decompressing, I tried using cramtools. This time the decompression failed, with following error:
ERROR 2018-02-02 09:23:13 ReferenceSource Downloaded sequence is corrupt: requested md5=971cb1c7a7f62a402dab61cfe84a93b1, received md5=d41d8cd98f00b204e9800998ecf8427e
ERROR 2018-02-02 09:23:13 ReferenceSource Downloaded sequence is corrupt: requested md5=971cb1c7a7f62a402dab61cfe84a93b1, received md5=d41d8cd98f00b204e9800998ecf8427e
ERROR 2018-02-02 09:23:13 Cram2Bam Can't find reference to validate slice md5: 0 CM000093.4
ERROR 2018-02-02 09:23:13 ReferenceSource Downloaded sequence is corrupt: requested md5=971cb1c7a7f62a402dab61cfe84a93b1, received md5=d41d8cd98f00b204e9800998ecf8427e
ERROR 2018-02-02 09:23:13 ReferenceSource Downloaded sequence is corrupt: requested md5=971cb1c7a7f62a402dab61cfe84a93b1, received md5=d41d8cd98f00b204e9800998ecf8427e
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at net.sf.cram.CramTools.invoke(CramTools.java:91)
at net.sf.cram.CramTools.main(CramTools.java:121)
Caused by: java.lang.RuntimeException: Reference sequence required but not found: CM000093.4, md5=971cb1c7a7f62a402dab61cfe84a93b1
at htsjdk.samtools.cram.build.CramNormalizer.restoreBases(CramNormalizer.java:228)
at htsjdk.samtools.cram.build.CramNormalizer.normalizeRecordsForReferenceSource(CramNormalizer.java:201)
at net.sf.cram.Cram2Bam.main(Cram2Bam.java:237)
... 6 more

According to this source https://github.com/enasequence/cramtools/issues/74, this is due to the fact that cramtools are not patched for use of https.

I have finally run the ValidateSamFile option of Picard, which revealed multiple (reach maximum output at 100) errors:

ERROR: Record x, Read name y, bin field of BAM record does not equal value computed based on alignment start and end, and length of sequence to which read is aligned

Is this something that can be somehow fixed? Could it be that the original BAM was created using a previous version of the reference genome? I did not create the BAMs, so am trying to find any possible causes of the problem, and ideally, solution to avoid re-processing the samples!

↧

Training GATK4

February 5, 2018, 6:42 am

≫ Next: Mutect2 version 4.0.0.0

≪ Previous: Problems with BAMs decompressed from CRAMs

Hi,

I'm a researcher in Italy and I'm looking for a research center in Europe that I can be a host for a short training on GATK4.
I'm new in GATK and I'm building the pipeline on Bash. For now I don't have access to a server where I can use the ready-to-use GATK4 WDL pipeline.

Please let me know with a private message.

All the best,
manolis

↧

Mutect2 version 4.0.0.0

January 11, 2018, 3:33 pm

≫ Next: Web-based Oncotator server

≪ Previous: Training GATK4

Mutect2 version 4.beta.6

./gatk-launch Mutect2 --help
BETA FEATURE - FOR EVALUATION ONLY

Mutect2 version 4.0.0.0

./gatk Mutect2 --help
BETA FEATURE - WORK IN PROGRESS

Is there a timeline for having a stable release of Mutect2 ?

Thank you,

Rai

↧

Web-based Oncotator server

May 16, 2014, 4:13 pm

≫ Next: Can I use GATK on non-diploid organisms?

≪ Previous: Mutect2 version 4.0.0.0

There is a web-based version of Oncotator which you can use for annotation without running anything on your own machine.

However, please note that the web-based version is an older version, with fewer datasources and many limitations. We urge you to use the downloadable version instead, and at this time we do not provide user support for the web-based version. It is simply provided as-is.

Note also that on rare occasions the server malfunctions and needs to be rebooted. If you experience any server errors (e.g. an error message stating that the server is unavailable), please post a note in the thread below and we'll reboot it as soon as we can.

↧

Can I use GATK on non-diploid organisms?

July 26, 2012, 7:50 am

≫ Next: Trouble with running GenomicsDBImport

≪ Previous: Web-based Oncotator server

In general most GATK tools don't care about ploidy. The major exception is, of course, at the variant calling step: the variant callers need to know what ploidy is assumed for a given sample in order to perform the appropriate calculations.

Ploidy-related functionalities

As of version 3.3, the HaplotypeCaller and GenotypeGVCFs are able to deal with non-diploid organisms (whether haploid or exotically polyploid). In the case of HaplotypeCaller, you need to specify the ploidy of your non-diploid sample with the -ploidy argument. HC can only deal with one ploidy at a time, so if you want to process different chromosomes with different ploidies (e.g. to call X and Y in males) you need to run them separately. On the bright side, you can combine the resulting files afterward. In particular, if you’re running the -ERC GVCF workflow, you’ll find that both CombineGVCFs and GenotypeGVCFs are able to handle mixed ploidies (between locations and between samples). Both tools are able to correctly work out the ploidy of any given sample at a given site based on the composition of the GT field, so they don’t require you to specify the -ploidy argument.

For earlier versions (all the way to 2.0) the fallback option is UnifiedGenotyper, which also accepts the -ploidy argument.

Cases where ploidy needs to be specified

Native variant calling in haploid or polyploid organisms.
Pooled calling where many pooled organisms share a single barcode and hence are treated as a single "sample".
Pooled validation/genotyping at known sites.

For normal organism ploidy, you just set the -ploidy argument to the desired number of chromosomes per organism. In the case of pooled sequencing experiments, this argument should be set to the number of chromosomes per barcoded sample, i.e. (Ploidy per individual) * (Individuals in pool).

Important limitations

Several variant annotations are not appropriate for use with non-diploid cases. In particular, InbreedingCoeff will not be annotated on non-diploid calls. Annotations that do work and are supported in non-diploid use cases are the following: QUAL, QD, SB, FS, AC, AF, and Genotype annotations such as PL, AD, GT, etc.

You should also be aware of the fundamental accuracy limitations of high ploidy calling. Calling low-frequency variants in a pool or in an organism with high ploidy is hard because these rare variants become almost indistinguishable from sequencing errors.

↧

Trouble with running GenomicsDBImport

February 5, 2018, 8:56 am

≫ Next: BQSR and Novaseq quality scores

≪ Previous: Can I use GATK on non-diploid organisms?

I'm writing a pipeline using GATK4 for our local cluster which uses Slurm as job scheduler. The command below seems to run successfully, however, it took only a few seconds and the output file sizes are very small. Using the genomics db file from the output as input for joint genotyping, the output vcf only contains header section.

gatk --java-options "-Xmx8000M" GenomicsDBImport -V /gpfs/scratch/jw24/variant_discovery/raw_vcf/TEST/TEST_sample_2745_T_AS.g.vcf -V /gpfs/scratch/jw24/variant_discovery/raw_vcf/TEST/TEST_sample_2753_T_AS.g.vcf --genomicsdb-workspace-path /gpfs/scratch/jw24/variant_discovery/genomicsDB/chr1GenomicDB -L chr1
Using GATK jar /util/common/bioinformatics/GATK/gatk-4.0.0.0/gatk-package-4.0.0.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Xmx8000M -jar /util/common/bioinformatics/GATK/gatk-4.0.0.0/gatk-package-4.0.0.0-local.jar GenomicsDBImport -V /gpfs/scratch/jw24/variant_discovery/raw_vcf/TEST/TEST_MMRF_2745_T_AS.g.vcf -V /gpfs/scratch/jw24/variant_discovery/raw_vcf/TEST/TEST_MMRF_2753_T_AS.g.vcf --genomicsdb-workspace-path /gpfs/scratch/jw24/variant_discovery/genomicsDB/chr1GenomicDB -L chr1
11:47:00.370 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/util/common/bioinformatics/GATK/gatk-4.0.0.0/gatk-package-4.0.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
11:47:00.540 INFO GenomicsDBImport - ------------------------------------------------------------
11:47:00.541 INFO GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.0.0.0
11:47:00.541 INFO GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/
11:47:00.541 INFO GenomicsDBImport - Executing as jw24@srv-p27-40.cbls.ccr.buffalo.edu on Linux v3.10.0-693.11.6.el7.x86_64 amd64
11:47:00.542 INFO GenomicsDBImport - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_45-b14
11:47:00.542 INFO GenomicsDBImport - Start Date/Time: February 5, 2018 11:47:00 AM EST
11:47:00.542 INFO GenomicsDBImport - ------------------------------------------------------------
11:47:00.542 INFO GenomicsDBImport - ------------------------------------------------------------
11:47:00.543 INFO GenomicsDBImport - HTSJDK Version: 2.13.2
11:47:00.543 INFO GenomicsDBImport - Picard Version: 2.17.2
11:47:00.543 INFO GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 1
11:47:00.543 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
11:47:00.543 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
11:47:00.543 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
11:47:00.543 INFO GenomicsDBImport - Deflater: IntelDeflater
11:47:00.544 INFO GenomicsDBImport - Inflater: IntelInflater
11:47:00.544 INFO GenomicsDBImport - GCS max retries/reopens: 20
11:47:00.544 INFO GenomicsDBImport - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
11:47:00.544 INFO GenomicsDBImport - Initializing engine
11:47:01.241 INFO IntervalArgumentCollection - Processing 248956422 bp from intervals
11:47:01.244 INFO GenomicsDBImport - Done initializing engine
Created workspace /gpfs/scratch/jw24/variant_discovery/genomicsDB/chr1GenomicDB
11:47:01.437 INFO GenomicsDBImport - Vid Map JSON file will be written to /gpfs/scratch/jw24/variant_discovery/genomicsDB/chr1GenomicDB/vidmap.json
11:47:01.437 INFO GenomicsDBImport - Callset Map JSON file will be written to /gpfs/scratch/jw24/variant_discovery/genomicsDB/chr1GenomicDB/callset.json
11:47:01.438 INFO GenomicsDBImport - Complete VCF Header will be written to /gpfs/scratch/jw24/variant_discovery/genomicsDB/chr1GenomicDB/vcfheader.vcf
11:47:01.438 INFO GenomicsDBImport - Importing to array - /gpfs/scratch/jw24/variant_discovery/genomicsDB/chr1GenomicDB/genomicsdb_array
11:47:01.456 INFO ProgressMeter - Starting traversal
11:47:01.457 INFO ProgressMeter - Current Locus Elapsed Minutes Batches Processed Batches/Minute
11:47:01.704 INFO GenomicsDBImport - Importing batch 1 with 2 samples
11:47:01.850 INFO GenomicsDBImport - Done importing batch 1/1
11:47:01.851 INFO ProgressMeter - chr1:1 0.0 1 152.3
11:47:01.852 INFO ProgressMeter - Traversal complete. Processed 1 total batches in 0.0 minutes.
11:47:01.852 INFO GenomicsDBImport - Import completed!
11:47:01.872 INFO GenomicsDBImport - Shutting down engine
[February 5, 2018 11:47:01 AM EST] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 0.03 minutes.
Runtime.totalMemory()=2356150272
Tool returned:
true

Thanks

Jason

↧

BQSR and Novaseq quality scores

February 5, 2018, 9:40 am

≫ Next: Germline copy number variant discovery (CNVs)

≪ Previous: Trouble with running GenomicsDBImport

Dear all,
we recently bought an Illumina Novaseq instrument and as a first project sequenced some human whole genomes.
Since the Novaseq only has 4 quality scores (2, 11, 25 and 37) I was wondering if BQSR makes any sense. Could you please be so kind comment on this issue?
Best
Stefan

↧

Germline copy number variant discovery (CNVs)

January 7, 2018, 1:08 am

≫ Next: GATK4 results for PrecisionFDA Consistency challenge data

≪ Previous: BQSR and Novaseq quality scores

Purpose

Identify germline copy number variants.

Diagram is not available

Reference implementation is not available

This workflow is in development; detailed documentation will be made available when the workflow is considered fully released.

↧

GATK4 results for PrecisionFDA Consistency challenge data

February 5, 2018, 9:57 am

≫ Next: What is correct way to generate the input vcf and vcf.idx file for GATK4

≪ Previous: Germline copy number variant discovery (CNVs)

Dear team,

I've run GATK4.0.0 using Cromwell (30.2) and WDLs at https://github.com/gatk-workflows/gatk4-data-processing and https://github.com/gatk-workflows/gatk4-germline-snps-indels. I had bwa aligned and deduped BAMs, so I modified "processing-for-variant-discovery-gatk4.wdl" to start from BQSR, but otherwise used the published WDLs with minimal modifications.

The results for the public PrecisionFDA datasets (https://precision.fda.gov/) are interesting. The recall and precision were great for the Truth challenge datasets (HiSeq2500, PCR-free, ~50x), but not for the Consistency challenge datasets (HiSeqX, PCR+, ~30x). In particular for indels from the Consistency challenge datasets, the recall and precision were far worse than GATK3 results available for these datasets: ~92% and ~79% for the Garvan dataset and ~89% and 83% for the HLI dataset after VQSR filtration.

Do these numbers match with what you normally get for PCR+ HiSeq X WGS datasets with depths ~35x? If not, are there any parameters that I need to change?

Also, I think it will be very helpful to the community if the team make your GATK4 results publicly available for these popular public datasets.

Best,

Sangtae

↧

What is correct way to generate the input vcf and vcf.idx file for GATK4

February 5, 2018, 10:52 am

≫ Next: VariantRecalibrator Stack Error

≪ Previous: GATK4 results for PrecisionFDA Consistency challenge data

Hi,

We want to call SNVs by M2 for our WGS data. But, we do not have gnomad.vcf for WGS. So, we downloaded vcf files for each chr from gnomad website. Could you please let me know the "correct" way to generate both vcf and index file for GATK4 pipeline. Any suggestions are welcome! Thank you so much!

Does bcftools and tabix work?

Best,
Chunyang

↧

VariantRecalibrator Stack Error

February 5, 2018, 11:19 am

≫ Next: [ERROR] SelectVariants

≪ Previous: What is correct way to generate the input vcf and vcf.idx file for GATK4

Hi,

I tried to run VariantRecalibrator for one WGS sample after HaplotyopeCaller. But I got stack trace error.

The commands I run is below.
java -Xmx4g -Djava.io.tmpdir=tmp -jar /home/serein/app/GenomeAnalysisTK-3.6-0/GenomeAnalysisTK.jar -T VariantRecalibrator -R /oasis/tscc/scratch/serein/genome/hg19/bwa/karyotic/hg19.fa -input A0004/vcf/A0004.raw_variants.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 /oasis/tscc/scratch/serein/resources/GATK_bundle/hg19/hapmap_3.3.hg19.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 /oasis/tscc/scratch/serein/resources/GATK_bundle/hg19/1000G_omni2.5.hg19.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /oasis/tscc/scratch/serein/resources/GATK_bundle/hg19/dbsnp_138.hg19.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 /oasis/tscc/scratch/serein/resources/GATK_bundle/hg19/1000G_phase1.snps.high_confidence.hg19.vcf -an DP -an QD -an MQ -an SOR -an FS -an MQRankSum -an ReadPosRankSum -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -recalFile A0004/vcf/A0004.recalibrate_SNP.recal -tranchesFile A0004/vcf/A0004.recalibrate_SNP.tranches -rscriptFile A0004/vcf/A0004.recalibrate_SNP_plots.R &>A0004/log/A0004.VariantRecalibrator_SNP.log

The error message is below.

INFO 09:15:39,047 VariantRecalibratorEngine - Finished iteration 35. Current change in mixture coefficients = 0.01403
INFO 09:16:17,661 ProgressMeter - chrY:59362044 6.8249266E7 43.1 m 37.0 s 100.0% 43.1 m 0.0 s
INFO 09:16:49,156 ProgressMeter - chrY:59362044 6.8249266E7 43.6 m 38.0 s 100.0% 43.6 m 0.0 s
INFO 09:17:12,991 VariantRecalibratorEngine - Convergence after 38 iterations!
INFO 09:17:19,157 ProgressMeter - chrY:59362044 6.8249266E7 44.1 m 38.0 s 100.0% 44.1 m 0.0 s
INFO 09:17:31,148 VariantRecalibratorEngine - Evaluating full set of 4580637 variants...
INFO 09:17:31,229 VariantDataManager - Training with worst 0 scoring variants --> variants with LOD <= -5.0000.

ERROR --

ERROR stack trace

java.lang.IllegalArgumentException: No data found.

I read some similar posts but cannot figure out how I should proceed.

Thanks,
Yunjiang

↧

[ERROR] SelectVariants

February 5, 2018, 11:37 am

≫ Next: How can I exclude multi-mapped reads before using UnifiedGenotyper to call SNPs

≪ Previous: VariantRecalibrator Stack Error

Hi,

I want to run SelectVariants on FC using my own vcf. But, I got the following ERROR. Any suggestions are welcome. Thanks!

Best,
Chunyang

22:08:56.550 INFO SelectVariants - Initializing engine 22:08:57.487 INFO FeatureManager - Using codec VCFCodec to read file file:///cromwell_root/fc-b9cc3b48-69c9-4013-8eba-ff7973fc7bc9/gnomad.genomes.r2.0.2.sites.vcf.gz 22:08:58.200 INFO SelectVariants - Shutting down engine [February 4, 2018 10:08:58 PM UTC] org.broadinstitute.hellbender.tools.walkers.variantutils.SelectVariants done. Elapsed time: 2.55 minutes. Runtime.totalMemory()=74297344 htsjdk.tribble.TribbleException: Contig 1 does not have a length field. at htsjdk.variant.vcf.VCFContigHeaderLine.getSAMSequenceRecord(VCFContigHeaderLine.java:80) at htsjdk.variant.vcf.VCFHeader.getSequenceDictionary(VCFHeader.java:206) at org.broadinstitute.hellbender.engine.FeatureDataSource.getSequenceDictionary(FeatureDataSource.java:400) at org.broadinstitute.hellbender.engine.FeatureManager.lambda$getAllSequenceDictionaries$5(FeatureManager.java:282) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.Iterator.forEachRemaining(Iterator.java:116) at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) at org.broadinstitute.hellbender.engine.FeatureManager.getAllSequenceDictionaries(FeatureManager.java:284) at org.broadinstitute.hellbender.engine.GATKTool.validateSequenceDictionaries(GATKTool.java:588) at org.broadinstitute.hellbender.engine.GATKTool.onStartup(GATKTool.java:563) at org.broadinstitute.hellbender.engine.VariantWalker.onStartup(VariantWalker.java:43) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:134) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:179) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:198) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:153) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:195) at org.broadinstitute.hellbender.Main.main(Main.java:277) Using GATK jar /root/gatk.jar defined in environment variable GATK_LOCAL_JAR Running: java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Xmx2g -jar /root/gatk.jar SelectVariants -V /cromwell_root/fc-b9cc3b48-69c9-4013-8eba-ff7973fc7bc9/gnomad.genomes.r2.0.2.sites.vcf.gz -O selected.vcf --lenient

↧

How can I exclude multi-mapped reads before using UnifiedGenotyper to call SNPs

May 27, 2016, 6:59 pm

≫ Next: HaplotypeCaller and reads mapped to multiple locations

≪ Previous: [ERROR] SelectVariants

Hi, how can I exclude multi-mapped reads before using UnifiedGenotyper to call SNPs?

You mentioned picard that could be used to exclude multi-mapped reads. However, which tool of picard should be used?

Thanks a lot!

↧

HaplotypeCaller and reads mapped to multiple locations

March 16, 2016, 8:36 am

≫ Next: Elapsed time about the CNVDiscoveryPipeline

≪ Previous: How can I exclude multi-mapped reads before using UnifiedGenotyper to call SNPs

Dear GATK team,

I've been trying to use GATK to call SNPs from RNA-Seq data mapped to a transcriptome assembly. I used Bowtie2 for the read mapping. I apologize if the information is already posted, but it seemed hard to find out about this information, so I hoped to get some advice or pointed to the right place - How does the HaplotypeCaller handle reads mapped to multiple places? I used paired-end reads for read mapping.

Thank you very much for any feedback you might have.

Sincerely,

Xin

↧

Elapsed time about the CNVDiscoveryPipeline

February 5, 2018, 5:40 pm

≫ Next: MuTect2 Seems to lose a lot of mutations.

≪ Previous: HaplotypeCaller and reads mapped to multiple locations

@bhandsaker
Hi Bob,why does the CNVDiscoveryPipeline is so time consuming? I test a WGS sample (about 30x),and run about 4 days,and it is still runing.This is my script about the CNVDiscoveryPipeline:

!/bin/bash

If you adapt this script for your own use, you will need to set these two variables based on your environment.

SV_DIR is the installation directory for SVToolkit - it must be an exported environment variable.

SV_TMPDIR is a directory for writing temp files, which may be large if you have a large data set.

export SV_DIR=/work/SoftW/svtoolkit
SV_TMPDIR=2016006L-3-1/tmpdir_CNVDiscovry

runDir=2016006L-3-1
inputFile=/work1/wsh/4.test/1.perl/1.pipetest/WGS/2016006L-3-1.dedupped.bam
sites=2016006L-3-1.discovery.vcf
genotypes=2016006L-3-1.genotypes.vcf

These executables must be on your path.

which java > /dev/null || exit 1
which Rscript > /dev/null || exit 1
which samtools > /dev/null || exit 1

For SVAltAlign, you must use the version of bwa compatible with Genome STRiP.

export PATH=${SV_DIR}/bwa:${PATH}
export LD_LIBRARY_PATH=${SV_DIR}/bwa:${LD_LIBRARY_PATH}

classpath="${SV_DIR}/lib/SVToolkit.jar:${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar:${SV_DIR}/lib/gatk/Queue.jar"

mkdir -p ${runDir}/logs || exit 1
mkdir -p ${runDir}/metadata || exit 1

java -Xmx4g -cp ${classpath} \
org.broadinstitute.gatk.queue.QCommandLine \
-S ${SV_DIR}/qscript/discovery/cnv/CNVDiscoveryPipeline.q \
-S ${SV_DIR}/qscript/SVQScript.q \
-cp ${classpath} \
-gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \
-configFile conf/genstrip_parameters.txt \
-R /work/wsh/0.Pipeline/TargetSeq/Genome_STRiP_ref/Homo_sapiens_assembly19.fasta \
-I ${inputFile} \
-md ${runDir}/metadata \
-runDirectory ${runDir} \
-jobLogDir ${runDir}/logs \
-intervalList /work/wsh/0.Pipeline/TargetSeq/Genome_STRiP_ref/Homo_sapiens_assembly19.interval.list \
-genderMapFile /work1/wsh/4.test/1.perl/1.pipetest/WGS/2016006L-3-1_gender.map \
-jobRunner Shell \
--disableJobReport \
-tempDir ${SV_TMPDIR} \
-gatkJobRunner Shell \
-retry 10 \
-tilingWindowSize 1000 \
-tilingWindowOverlap 500 \
-maximumReferenceGapLength 1000 \
-boundaryPrecision 100 \
-minimumRefinedLength 500 \
-genotypingParallelRecords 500 \
-run

#

Could you help me check my script Whether there are some mistake? Thank you very much.

↧

MuTect2 Seems to lose a lot of mutations.

January 11, 2018, 11:56 pm

≫ Next: CalculateMixingFractions Returns all NaN?

≪ Previous: Elapsed time about the CNVDiscoveryPipeline

I use Mutect2 (tumor only mode) to find somatic mutation, but some position was not able find mutations.
I have attached some IGV figures in the regions where MuTect2 has missed somatic mutations (Top is bam file to input Mutect2, bottom is bamout file from Mutect2).
Some are too obvious, and cannot understand why MuTect2 does not identify them as mutation.
I think I might need to modify some parameters when I run MuTect.
Which parameter can I adjust?

The command is here:
java -Xmx4g -jar /GATK3.7-0/GenomeAnalysisTK.jar -T MuTect2 -L Target.bed -R hg19.fasta
-I:tumor test.bam --dbsnp All_20170403.vcf.gz --cosmic CosmicCodingMutsGRCh37v81.vcf
-dfrac 1 --maxReadsInRegionPerSample 100000 --minPruning 1 --artifact_detection_mode
--useNewAFCalculator -o test.vcf -bamout test.bam -log test.log

version
BWA mem 0.7.12
GATK 3.7/4.0

images

↧

CalculateMixingFractions Returns all NaN?

February 5, 2018, 8:54 pm

≫ Next: (howto) Recalibrate base quality scores = run BQSR

≪ Previous: MuTect2 Seems to lose a lot of mutations.

Hi there, I'm attempting to use the CalculateMixingFractions tool to estimate the mixture of genomes from a VCF in a pooled BAM file and getting all NaN as output.

The output looks like this:

SAMPLE MIXING_FRACTION GENO1 NaN GENO2 NaN GENO3 NaN etc...

And the headers of my VCF file is the following:

##fileformat=VCFv4.1 ##filedate=2017.8.8 ##source=Minimac3 ##FORMAT= ##FORMAT= ##FORMAT= ##INFO= ##INFO= ##INFO= #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT GENO1 GENO2 GENO3 chr10 67193 rs1111111:67193:C:T C T . . . GT:AS 0|0:0,0 0|0:0,0 0|0:4,0

And the header for the SAM file is:

@HD VN:1.4 SO:coordinate SQ SN:chr1 LN:249250621 SQ SN:chr2 LN:243199373 ...etc SL-HEL:C66HBACXX150512:C66HBACXX:6:1302:18557:85138 147 chr1 10000 0 25M = 10003 -22 ATAACCCTAACCCTAACCCTAACCC ##BA39A<@CBBCCBBBAAAA<==> MD:Z:25 PG:Z:MarkDuplicates NM:i:0 AS:i:25 XS:i:25 etc...

↧

(howto) Recalibrate base quality scores = run BQSR

June 17, 2013, 2:18 pm

≫ Next: Base Quality Score Recalibration (BQSR)

≪ Previous: CalculateMixingFractions Returns all NaN?

Objective

Recalibrate base quality scores in order to correct sequencing errors and other experimental artifacts.

Prerequisites

Steps

Analyze patterns of covariation in the sequence dataset
Do a second pass to analyze covariation remaining after recalibration
Generate before/after plots
Apply the recalibration to your sequence data

1. Analyze patterns of covariation in the sequence dataset

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
    -T BaseRecalibrator \
    -R reference.fa \
    -I input_reads.bam \
    -L 20 \
    -knownSites dbsnp.vcf \
    -knownSites gold_indels.vcf \
    -o recal_data.table

Expected Result

This creates a GATKReport file called recal_data.table containing several tables. These tables contain the covariation data that will be used in a later step to recalibrate the base qualities of your sequence data.

It is imperative that you provide the program with a set of known sites, otherwise it will refuse to run. The known sites are used to build the covariation model and estimate empirical base qualities. For details on what to do if there are no known sites available for your organism of study, please see the online GATK documentation.

Note that -L 20 is used here and in the next steps to restrict analysis to only chromosome 20 in the b37 human genome reference build. To run against a different reference, you may need to change the name of the contig according to the nomenclature used in your reference.

2. Do a second pass to analyze covariation remaining after recalibration

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
    -T BaseRecalibrator \
    -R reference.fa \
    -I input_reads.bam \
    -L 20 \
    -knownSites dbsnp.vcf \
    -knownSites gold_indels.vcf \
    -BQSR recal_data.table \
    -o post_recal_data.table

Expected Result

This creates another GATKReport file, which we will use in the next step to generate plots. Note the use of the -BQSR flag, which tells the GATK engine to perform on-the-fly recalibration based on the first recalibration data table.

3. Generate before/after plots

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
    -T AnalyzeCovariates \
    -R reference.fa \
    -L 20 \
    -before recal_data.table \
    -after post_recal_data.table \
    -plots recalibration_plots.pdf

Expected Result

This generates a document called recalibration_plots.pdf containing plots that show how the reported base qualities match up to the empirical qualities calculated by the BaseRecalibrator. Comparing the before and after plots allows you to check the effect of the base recalibration process before you actually apply the recalibration to your sequence data. For details on how to interpret the base recalibration plots, please see the online GATK documentation.

4. Apply the recalibration to your sequence data

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
    -T PrintReads \
    -R reference.fa \
    -I input_reads.bam \
    -L 20 \
    -BQSR recal_data.table \
    -o recal_reads.bam

Expected Result

This creates a file called recal_reads.bam containing all the original reads, but now with exquisitely accurate base substitution, insertion and deletion quality scores. By default, the original quality scores are discarded in order to keep the file size down. However, you have the option to retain them by adding the flag –emit_original_quals to the PrintReads command, in which case the original qualities will also be written in the file, tagged OQ.

Notice how this step uses a very simple tool, PrintReads, to apply the recalibration. What’s happening here is that we are loading in the original sequence data, having the GATK engine recalibrate the base qualities on-the-fly thanks to the -BQSR flag (as explained earlier), and just using PrintReads to write out the resulting data to the new file.

↧