Quantcast
Channel: Recent Discussions — GATK-Forum
Viewing all 12345 articles
Browse latest View live

M2 error with canine germline resource and variants_for_contamination files

$
0
0

Dear GATK team,

I am running the Mutect2 pipeline on canine tumor samples in Terra, using WDL version 2.5 and GATK version 4.1.2.0. I was able to run the pipeline successfully without entering a germline resource file or VCF of common variants for contamination, however, when I did add these files in, I got the following error:

java.lang.IndexOutOfBoundsException: Index: 5, Size: 5
    at java.util.ArrayList.rangeCheck(ArrayList.java:657)
    at java.util.ArrayList.get(ArrayList.java:433)
    at org.broadinstitute.hellbender.tools.walkers.mutect.SomaticGenotypingEngine.lambda$getGermlineAltAlleleFrequencies$31(SomaticGenotypingEngine.java:350)
    at java.util.stream.ReferencePipeline$6$1.accept(ReferencePipeline.java:244)
    at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
    at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
    at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
    at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:545)
    at java.util.stream.AbstractPipeline.evaluateToArrayNode(AbstractPipeline.java:260)
    at java.util.stream.DoublePipeline.toArray(DoublePipeline.java:506)
    at org.broadinstitute.hellbender.tools.walkers.mutect.SomaticGenotypingEngine.getGermlineAltAlleleFrequencies(SomaticGenotypingEngine.java:352)
    at org.broadinstitute.hellbender.tools.walkers.mutect.SomaticGenotypingEngine.getNegativeLogPopulationAFAnnotation(SomaticGenotypingEngine.java:335)
    at org.broadinstitute.hellbender.tools.walkers.mutect.SomaticGenotypingEngine.callMutations(SomaticGenotypingEngine.java:141)
    at org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2Engine.callRegion(Mutect2Engine.java:250)
    at org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2.apply(Mutect2.java:324)
    at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.processReadShard(AssemblyRegionWalker.java:308)
    at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.traverse(AssemblyRegionWalker.java:281)
    at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1039)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
    at org.broadinstitute.hellbender.Main.main(Main.java:291)
Using GATK jar /root/gatk.jar defined in environment variable GATK_LOCAL_JAR
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx3000m -jar /root/gatk.jar Mutect2 -R gs://fc-0b0cb3ce-e2cb-4aef-a8b2-08e60d78e87c/Canis_lupus_familiaris_assembly3.fasta -I gs://fc-8268e82b-ed61-4e04-a8c9-a95a05c0952e/bda6f5ba-8928-45bf-a6b0-9fe67d8dd9a4/PreProcessingForVariantDiscovery_GATK4/cccdda67-56e1-4363-aa6c-46ce53ef8afd/call-GatherBamFiles/attempt-2/Abrams_cell.bam -tumor Abrams_1 --germline-resource gs://fc-0b0cb3ce-e2cb-4aef-a8b2-08e60d78e87c/canid_wgs_ref.1.0.no_samples.vcf.gz -pon gs://fc-afa03a31-404c-4a93-9f6a-31b673db5c69/b92f3c35-5813-455b-94dc-3de3b54f5f98/Mutect2_Panel/c9f21d8a-384e-4d17-a6f8-79a502698827/call-MergeVCFs/1-Mutect2_PON_2019-07-25T22-08-49.vcf -L gs://fc-afa03a31-404c-4a93-9f6a-31b673db5c69/f2138b33-3918-4f8a-9b87-1823a0084ac3/Mutect2/c4844164-ecad-4878-9e5d-cd134a7fb40d/call-SplitIntervals/glob-0fc990c5ca95eebc97c4c204e3e303e1/0000-scattered.interval_list -O output.vcf --f1r2-tar-gz f1r2.tar.gz --af-of-alleles-not-in-resource 0.0007 --downsampling-stride 20 --max-reads-per-alignment-start 6 --max-suspicious-reads-per-alignment-start 6`

The germline resource is a VCF of approximately 80 million SNPs and indels (including multi allelic sites) called from a large number of canine WGS. It is formatted as a VCF with no sample information:

chr1    240     .       TG      T       464.40  PASS    AC=4;AF=0.011;AN=332;BaseQRankSum=0.674;ClippingRankSum=0;DP=14798;ExcessHet=0.0026;FS=5.63;InbreedingCoeff=-0.005;MLEAC=14;MLEAF=0.017;MQ=7.49;MQRankSum=-0.967;QD=22.11;ReadPosRankSum=0.967;SOR=3.18

The VCF for variants for contamination is a subset of this VCF, with only biallelic SNPs with AF between 0.01 and 0.2. Initially, it was formatted the same as the above file.

As part of debugging, I tried removing everything from the INFO field of the variants for contamination file, except allele frequency, and I tried using that simplified VCF both for the germline resource and the variants for contamination file. This seemed to fix the index out of bounds error, but the job then failed at the filtering step, with the following error:

java.lang.IllegalArgumentException: log10p: Log10-probability must be 0 or less
    at org.broadinstitute.hellbender.utils.Utils.validateArg(Utils.java:724)
    at org.broadinstitute.hellbender.utils.MathUtils.log10BinomialProbability(MathUtils.java:934)
    at org.broadinstitute.hellbender.utils.MathUtils.binomialProbability(MathUtils.java:927)
    at org.broadinstitute.hellbender.tools.walkers.mutect.filtering.ContaminationFilter.calculateErrorProbability(ContaminationFilter.java:56)
    at org.broadinstitute.hellbender.tools.walkers.mutect.filtering.Mutect2VariantFilter.errorProbability(Mutect2VariantFilter.java:15)
    at org.broadinstitute.hellbender.tools.walkers.mutect.filtering.ErrorProbabilities.lambda$new$1(ErrorProbabilities.java:19)
    at java.util.stream.Collectors.lambda$toMap$58(Collectors.java:1321)
    at java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169)
    at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
    at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
    at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
    at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
    at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
    at org.broadinstitute.hellbender.tools.walkers.mutect.filtering.ErrorProbabilities.<init>(ErrorProbabilities.java:19)
    at org.broadinstitute.hellbender.tools.walkers.mutect.filtering.Mutect2FilteringEngine.accumulateData(Mutect2FilteringEngine.java:141)
    at org.broadinstitute.hellbender.tools.walkers.mutect.filtering.FilterMutectCalls.nthPassApply(FilterMutectCalls.java:146)
    at org.broadinstitute.hellbender.engine.MultiplePassVariantWalker.lambda$traverse$0(MultiplePassVariantWalker.java:40)
    at org.broadinstitute.hellbender.engine.MultiplePassVariantWalker.lambda$traverseVariants$1(MultiplePassVariantWalker.java:77)
    at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
    at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
    at java.util.Iterator.forEachRemaining(Iterator.java:116)
    at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
    at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
    at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
    at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
    at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
    at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
    at org.broadinstitute.hellbender.engine.MultiplePassVariantWalker.traverseVariants(MultiplePassVariantWalker.java:75)
    at org.broadinstitute.hellbender.engine.MultiplePassVariantWalker.traverse(MultiplePassVariantWalker.java:40)
    at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1048)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
    at org.broadinstitute.hellbender.Main.main(Main.java:291)
Using GATK jar /root/gatk.jar defined in environment variable GATK_LOCAL_JAR
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx6500m -jar /root/gatk.jar FilterMutectCalls -V gs://fc-afa03a31-404c-4a93-9f6a-31b673db5c69/0bbb4e0e-7293-4ce5-b81f-d722fcec561a/Mutect2/223610c8-ec63-4439-b339-9503ceb80828/call-MergeVCFs/Abrams_cell-unfiltered.vcf -R gs://fc-0b0cb3ce-e2cb-4aef-a8b2-08e60d78e87c/Canis_lupus_familiaris_assembly3.fasta -O Abrams_cell-filtered.vcf --contamination-table /cromwell_root/fc-afa03a31-404c-4a93-9f6a-31b673db5c69/0bbb4e0e-7293-4ce5-b81f-d722fcec561a/Mutect2/223610c8-ec63-4439-b339-9503ceb80828/call-CalculateContamination/contamination.table --tumor-segmentation /cromwell_root/fc-afa03a31-404c-4a93-9f6a-31b673db5c69/0bbb4e0e-7293-4ce5-b81f-d722fcec561a/Mutect2/223610c8-ec63-4439-b339-9503ceb80828/call-CalculateContamination/segments.table --ob-priors /cromwell_root/fc-afa03a31-404c-4a93-9f6a-31b673db5c69/0bbb4e0e-7293-4ce5-b81f-d722fcec561a/Mutect2/223610c8-ec63-4439-b339-9503ceb80828/call-LearnReadOrientationModel/artifact-priors.tar.gz -stats /cromwell_root/fc-afa03a31-404c-4a93-9f6a-31b673db5c69/0bbb4e0e-7293-4ce5-b81f-d722fcec561a/Mutect2/223610c8-ec63-4439-b339-9503ceb80828/call-MergeStats/merged.stats --filtering-stats filtering.stats --min-median-read-position 10

Both of these tests were run on an interval that included a single chromosome (approximately 24Mb).

Thank you for your help!

Best,
Kate


GATK 4.1.1.0 GenomicsDBImport error

Why "InbreedingCoeff -must provide < 10 samples" warning is generated, when no. of samples is 29?

$
0
0
Dear GATK Team,

I tried performing joint calling for 29 samples from the genomicsDB datastore for the interval of complete chromosome 1.But got the following warning " WARN InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples". Even though my number of samples is 29, stil the warning is saying must provide atleast 10 samples.
Also,in the generated vcf file I have the inbreeding score calculated and mentioned. Then Why am I getting this warning? Is it something to worry about?

Thank you

Regards

Abhishek

Multi-tumor variant calling with Mutect2 (GATK v4.1.3.0) weakens sensitivity

$
0
0

This observation is very clear when one or more of the tumor samples have very low tumor content or no tumor content.

So for example if we have tumor1 and tumor2 with high tumor content and tumor3 with low tumor content, using Mutect2 on tumor1 and tumor2 will produce a lot more variants than when doing the calling on tumor1, tumor2 and tumor3 together.

(WhatsApp:+1 (985) 606-3684 Sign up for a Genuine IELTS Certificate Without Attending Exam

$
0
0
We don't just produce certificates. We start by getting our clients properly registered for the exam. Then,
we proceed with the processing of the certificate, with or without the client attending the exam.
It is also worth reiterating that we don't make fake certificates.
All certificates which we process are fully registered in the system with the client's details and verifiable online at the official IELTS vertification website.



Buy IELTS certificate online, Buy TOEFL certificate online, Buy GRE Certificate Online, Buy GMAT Certificate Online,
Buy CAE certificate online, IELTS Certificate for sale, Registered IELTS certificate for sale, IELTS certificate without exam, Buy original IELTS certificate, Buy CAE certificate online, Buy TOEFL certificate online, Buy IELTS certificate online, IELTS Certificate for sale, Registered IELTS certificate for sale, Buy GRE Certificate Online, Buy GMAT Certificate Online, Buy original IELTS certificate, IELTS certificate without exam, Buy IELTS Certificate In Romania - Buy IELTS Certificate in Spain - Buy Verified IELTS certificate in Belarus - Buy IELTS Without Exam in Bulgaria - Buy Real IELTS Certificate in Czech Republic - Buy IELTS Certificate In Hungary - Buy Original IELTS certificate in Poland - Buy IELTS Certificate In Moldova - Buy IELTS Certificate in Russia - Buy IELTS Certificate in Slovakia - Buy IELTS Certificate In Ukraine - Buy IELTS Certificate In Aland Islands - Buy IELTS Certificate In Denmark - Buy IELTS Certificate In Estonia - Buy IELTS certificate in Faroe Islands - Buy ielts without exam in Finland - Buy IELTS Without Exam in Guernsey - Buy Registered IELTS certificate in Iceland - Buy Certified IELTS certificate in Ireland - Buy IELTS Certificate in Isle of Man - Buy IELTS Certificate in Jersey - buy ielts certificate without exam in Latvia - Buy IELTS Certificate Without Exam In Lithuania - Buy IELTS Certificate In Norway - Buy IELTS Certificate In Sark - Buy Registered IELTS Certificate in Sweden - Buy IELTS Certificate In United Kingdom - Buy IELTS Without Exam in Albania - Buy IELTS certificate In Andorra - Buy IELTS Certificate in Bosnia and Herzegovina - Buy IELTS certificate in Croatia - Buy IELTS certificate in Gibraltar




We are a network of English Language Professors with years of Examination experience. During these years, we have been able to derive backdoor means of registering IELTS certificates without Students taking the Test. With our help, you can be able to get real registered and original IELTS Certificates without facing the stress and trauma of the Exam. The IELTS Certificates we issue carries a score of your choosing and you will be able to verify it online and collect the original TRF or Test report card from local district Examination Center or we send it directly to you.

- We only produce Real and IDP/BC verified IELTS Certificates

- We do not produce fake IELTS certificates as they serve no purpose

- We keep client information discreet and we do not share with any third party
With these Certificates you have a shot at a migration process.


Study Abroad
Study Abroad with IDP
Study Abroad in Australia
Study Abroad in USA
Study Abroad in Canada
Study Abroad in UNITED KINGDOM
Study Abroad in New Zealand


WhatApp: +1 (985) 606-3684



We can help you to get Certified IELTS IDP/BC certificates with your desired Score. The Certificate is Registered and can be verified online , you can use this certificate for University admission and any immigration use

*Do you need Real and IDP/BC verified IELTS Certificates?
*Do you like to Get academic or general IELTS Certificate Test?
*Are You trying to get Band 7 or 8 in IELTS certification in Asia, Europe, America, Africa etc ?
*Do you need to edit and increase your past certificate ?
*Do you need a teacher to write the exams for you ?
*Do you need questions/test/exams paper both questions and answers ?
*Do you need our help in the exam to provide your target score even if you fail ?
We can help you!!

#Each BAND corresponds to a level of English Competence. ALL part of the test and overall score can be reported in whole and half bands,e.g 6.5, 7.0 ,7.5, 8.0.

Our Services:

- we provide Official certificate with registration into the database and actual center stamps for customers interested in obtaining the certificate without taking the test.

- If you already took the test and it less than a month that you took the test, we can update the results optained in your previous test to provide you with a new certificate with the updated results for you to follow you PR procedures without any risk.

- Last but not the least we can provide Question papers for future test before the actual test date. the questionnaires will be issued about 6 to 10 days before the test data and will be 100% same questions that will appear in the test. guaranteed at 100%

- You can register for your exams and go in for but we shall provide your target scores as you request because we have underground partners working at any center test which give us access into the system.



BUY REAL IELTS CERTIFICATE ONLINE - BUY CERTIFIED IELTS CERTIFICATE - GET ORIGINAL IELTS WITHOUT EXAM - HOW TO CHECK IELTS BRITISH COUNCIL - BUY IDP IELTS CERTIFICATE - BUY GENUINE IELTS CERTIFICATE


ABOUT US:


>>We are fast, reliable and flexible

>>We are popular and trusted

>>We are highly experienced in documentation

>>We have excellent pass into database.


WhatApp: +1 (985) 606-3684




Some may not have the time or patience to do this and some may be afraid of complications not to have the right agent from the right source. There are many agents and their competence (and honesty!) ranges from excellent all the way down to non-existent.


One may decide to use an agent to help and advice on how to get his/her certificate. But, if you do decide to use an agent, be careful especially on the internet. WE ALWAYS ADVISE OUR CLIENTS TO BE CAREFUL .

The best way to ensure that you are in direct deal with competent, professional and honest officials, feel free to leave us a message, using email


Our representatives are waiting to reply to your inquiries 24/7, and set you on your way toward obtaining your IELTS, TOEFL, GMAT, GRE, PTE, CAE, SAT, PMP, CELPIP, TESOL, NEBOSH, FCE, PSAT, certificates that may dramatically change your life for the better!.

Through us it is straight forward; with a little time and effort to spent

Which part of mym QD plot is the homozygous peak?

$
0
0

Hi there,

Just a quick question, which I think may be of use to people with similarly...squiffy...plots! I've plotted out QD values v Density so as to inform the hard filtering process but I'm having difficulty discerning the expected peaks for heterozygous calls and homozygous calls, as described at https://software.broadinstitute.org/gatk/guide/article?id=6925. As you can see from the attached plot, there is a peak at the lower values (or is it a shoulder?), a tiny bump and then a major peak, but then just a shoulder on the other side of the peak. As filtering effectively (and stringently) is key to my study, I'd like to know what each peak and shoulder represents before I take the plunge if anyone can make an educated guess, please?

Many thanks,

Ian

Is it possible to use two versions of GATK ( 4 and 2 ) in the same computer

$
0
0
i currently use GATK4 and need to use some GATK tools available only in previous versions. Is it possible to install and use both versions in the same computer (command Line) and same files.

A fatal error has been detected by the Java Runtime Environment while running Genome Strip.

$
0
0

Hello Every one,

I was trying to run Genometsrip CNV discovery for one of sample and I am getting below error.

INFO 14:42:06,029 FunctionEdge - Starting: 'java' '-Xmx2048m' '-XX:+UseParallelOldGC' '-XX:ParallelGCThreads=4' '-XX:GCTimeLimit=50' '-XX:GCHeapFreeLimit=10' '-Djava.io.tmpdir=/gpfs/projects/bioinfo/najeeb/CNV_pipeline/GenomeStrip/svtoolkit/.queue/tmp' '-cp' '/gpfs/projects/bioinfo/najeeb/CNV_pipeline/GenomeStrip/svtoolkit/lib/SVToolkit.jar:/gpfs/projects/bioinfo/najeeb/CNV_pipeline/GenomeStrip/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/gpfs/projects/bioinfo/najeeb/CNV_pipeline/GenomeStrip/svtoolkit/lib/gatk/Queue.jar' '-cp' '/gpfs/projects/bioinfo/najeeb/CNV_pipeline/GenomeStrip/svtoolkit/lib/SVToolkit.jar:/gpfs/projects/bioinfo/najeeb/CNV_pipeline/GenomeStrip/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/gpfs/projects/bioinfo/najeeb/CNV_pipeline/GenomeStrip/svtoolkit/lib/gatk/Queue.jar' 'org.broadinstitute.sv.discovery.SVDepthScanner' '-O' '/gpfs/projects/bioinfo/najeeb/CNV_pipeline/GenomeStrip/svtoolkit/testCNV/cnv_stage1/seq_9/seq_9.sites.vcf.gz' '-R' '/gpfs/projects/bioinfo/najeeb/CNV_pipeline/svtoolkit/reference_metadata_bundles/Homo_sapiens_assembly19/Homo_sapiens_assembly19.fasta' '-genomeMaskFile' '/gpfs/projects/bioinfo/najeeb/CNV_pipeline/svtoolkit/reference_metadata_bundles/Homo_sapiens_assembly19/Homo_sapiens_assembly19.svmask.fasta' '-genomeMaskFile' '/gpfs/projects/bioinfo/najeeb/CNV_pipeline/svtoolkit/reference_metadata_bundles/Homo_sapiens_assembly19/Homo_sapiens_assembly19.lcmask.fasta' '-genderMapFile' 'gender_map_file.txt' '-md' 'testCNV/metadata' '-configFile' '/gpfs/projects/bioinfo/najeeb/CNV_pipeline/GenomeStrip/svtoolkit/conf/genstrip_parameters.txt' '-L' '9' '-tilingWindowSize' '1000' '-tilingWindowOverlap' '500' '-maximumReferenceGapLength' '1000'
INFO 14:42:06,030 FunctionEdge - Output written to /gpfs/projects/bioinfo/najeeb/CNV_pipeline/GenomeStrip/svtoolkit/testCNV/cnv_stage1/seq_9/logs/CNVDiscoveryStage1-1.out
#

A fatal error has been detected by the Java Runtime Environment:

#

SIGSEGV (0xb) at pc=0x00000037f8d32d5f, pid=26385, tid=47583501379328

#

JRE version: Java(TM) SE Runtime Environment (8.0_66-b17) (build 1.8.0_66-b17)

Java VM: Java HotSpot(TM) 64-Bit Server VM (25.66-b17 mixed mode linux-amd64 compressed oops)

Problematic frame:

C [libc.so.6+0x132d5f]

#

Core dump written. Default location: /gpfs/projects/bioinfo/najeeb/CNV_pipeline/GenomeStrip/svtoolkit/core or core.26385

#

An error report file with more information is saved as:

/gpfs/projects/bioinfo/najeeb/CNV_pipeline/GenomeStrip/svtoolkit/hs_err_pid26385.log

#
25,1 53%
I tried both Java 1.7 and Java 1.8 with Genomestrip 2.00 1650 and 2.00 1636 as well. While Preprocessing step is working but CNVDiscovery is giving errors. I am using LSF for submitting jobs as bsub -n 8 scripname.sh .
I am attaching script for kind pursual as well.

Someone please help me on this.


What's the sample limitation for CombineGVCF?

$
0
0
Hello!

I'm trying to do VQSR to a set of single sample vcfs. While this is a large data set with almost 50,000 samples, so I think it would be better to merge them into a multi-sample vcf and do the VQSR later.
During the merge step, there is an error saying a too long argument for gatk, which should due to the sample size.
Could someone please tell me what's the sample limitation for CombineGVCF? Maybe I can separate samples into several groups and do a merge multiple times.

Thanks in advance!

Warning in GenotypeGVCFs using output from CombineGVCFs in GATK4

$
0
0
I am running GenotypeGVCFs using a single, combine GVCF produced from CombineGVCFs in GATK4. The file contains 44 individuals yet I am receiving the warning 'InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples'. Should I be concerned by this? I can validate that my input file contains more than 10 samples by the headers it contains.

This is the code I am running,

```java -jar ../../programs/gatk-4.1.2.0/gatk-package-4.1.2.0-local.jar GenotypeGVCFs -R revisedAssemblyUnmasked.fa --variant subset_of_pop.vcf -O genotype_subset.vcf
```
Thank you,

Which part of mym QD plot is the homozygous peak?

$
0
0

Hi there,

Just a quick question, which I think may be of use to people with similarly...squiffy...plots! I've plotted out QD values v Density so as to inform the hard filtering process but I'm having difficulty discerning the expected peaks for heterozygous calls and homozygous calls, as described at https://software.broadinstitute.org/gatk/guide/article?id=6925. As you can see from the attached plot, there is a peak at the lower values (or is it a shoulder?), a tiny bump and then a major peak, but then just a shoulder on the other side of the peak. As filtering effectively (and stringently) is key to my study, I'd like to know what each peak and shoulder represents before I take the plunge if anyone can make an educated guess, please?

Many thanks,

Ian

Appropriate LB parameter replacement (AddOrReplaceReadGroups) if starting with paired fastq files

$
0
0
Greeting All.

I am starting with paired fastq WGS files and cannot obtain information about what library was used for prep. To the best of my knowledge, each file represents a single sample (r1 or r2) and is the result of a single run.

I would like to add the readgroup information (picard AddOrReplaceReadGroups) and am searching for an appropriate library identifier when the original information is gone forever. My question is very similar to https://gatkforums.broadinstitute.org/gatk/discussion/9263/obtaining-read-group-information-from-the-fastq-files -- but not quite...

I could cobble together some items from the fastq headers (e.g., @D00687:89:CBV1EANXX:2:2201:1088:2049 1:N:0:ACTATGCA) but I don't know if that is copacetic. Guidance is appreciated.

gatk4 error: java.lang.IllegalStateException: The covariates table is missing ReadGroup

$
0
0

Hi,
Trying to run GATK4 best practice on AWS batch setup with docker (using SSDs).
In most of the time "ApplyBQSR" fails with the following error.

Oops... Pipeline execution stopped with the following message: 08:43:27.126 INFO ApplyBQSR - ------------------------------------------------------------ 08:43:27.126 INFO ApplyBQSR - ------------------------------------------------------------ 08:43:27.126 INFO ApplyBQSR - HTSJDK Version: 2.14.1 08:43:27.126 INFO ApplyBQSR - Picard Version: 2.17.2 08:43:27.126 INFO ApplyBQSR - HTSJDK Defaults.COMPRESSION_LEVEL : 1 08:43:27.126 INFO ApplyBQSR - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false 08:43:27.126 INFO ApplyBQSR - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true 08:43:27.126 INFO ApplyBQSR - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false 08:43:27.127 INFO ApplyBQSR - Deflater: IntelDeflater 08:43:27.127 INFO ApplyBQSR - Inflater: IntelInflater : 08:43:27.954 INFO ProgressMeter - Current Locus Elapsed Minutes Reads Processed Reads/Minute 08:43:28.470 INFO ApplyBQSR - Shutting down engine [February 28, 2018 8:43:28 AM UTC] org.broadinstitute.hellbender.tools.walkers.bqsr.ApplyBQSR done. Elapsed time: 0.03 minutes. Runtime.totalMemory()=2112618496 java.lang.IllegalStateException: The covariates table is missing ReadGroup e5dae369 in RecalTable0 at org.broadinstitute.hellbender.utils.Utils.validate(Utils.java:706) at org.broadinstitute.hellbender.utils.recalibration.covariates.ReadGroupCovariate.keyForReadGroup(ReadGroupCovariate.java:81) at org.broadinstitute.hellbender.utils.recalibration.covariates.ReadGroupCovariate.recordValues(ReadGroupCovariate.java:53) at org.broadinstitute.hellbender.utils.recalibration.covariates.StandardCovariateList.recordAllValuesInStorage(StandardCovariateList.java:133) at org.broadinstitute.hellbender.utils.recalibration.RecalUtils.computeCovariates(RecalUtils.java:546) at org.broadinstitute.hellbender.utils.recalibration.RecalUtils.computeCovariates(RecalUtils.java:527) at org.broadinstitute.hellbender.transformers.BQSRReadTransformer.apply(BQSRReadTransformer.java:145) at org.broadinstitute.hellbender.transformers.BQSRReadTransformer.apply(BQSRReadTransformer.java:27) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.Iterator.forEachRemaining(Iterator.java:116) at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) at org.broadinstitute.hellbender.engine.ReadWalker.traverse(ReadWalker.java:94) at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:893) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:136) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:179) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:198) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:153) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:195) at org.broadinstitute.hellbender.Main.main(Main.java:277) Using GATK jar /opt/conda/share/gatk4-4.0.1.1-0/gatk-package-4.0.1.1-local.jar
Rerunning the same step works most of the time. Any suggestions. Thanks.

Running HaplotypeCaller on a multi-lane sample

$
0
0
Hello,

I am interested in running HaplotypeCaller for a multi-lane sample. What is the best way to do this? I followed the protocol here (https://software.broadinstitute.org/gatk/documentation/article?id=6057) which states to run MarkDuplicates and BQSR before HaplotypeCaller, which I did. I fed the new BAM file into HaplotypeCaller but get an error with a corrupt gVCF file. Am I meant to merge the read groups into a single read group before using HaplotypeCaller?

Best practices for small number of exome samples

$
0
0

Hi, In processing a small number of exome samples (3 or 4, e.g. trio or quad family sequencing), what is the best way to do the analysis? Joint calling requires a large number samples per the workflow documentation. So should we stop after creating the initial GVCF? Or is there some other step we should do to combine the GVCFs from the family?


Funcotator Information and Tutorial

$
0
0

0 - Introduction

This page explains what Funcotator is and how to run it.

0.1 - Table of Contents

  1. 0.0 Introduction
    1. 0.1 Table of Contents
  2. 1.0 Funcotator Background Information
    1. 1.1 Data Sources
      1. 1.1.1 Data Source Folders
      2. 1.1.2 Pre-Packaged Data Sources
        1. 1.1.2.1 Downloading Pre-Packaged Data Sources
        2. 1.1.2.2 gnomAD
          1. 1.1.2.2.1 Enabling gnomAD
          2. 1.1.2.2.2 Included gnomAD Fields
      3. 1.1.3 Data Source Downloader Tool
      4. 1.1.4 Disabling Data Sourcesl
      5. 1.1.5 User-Defined Data Sources
        1. 1.1.5.1 Configuration File Format
          1. 1.1.5.1.1 Simple XSV Config File Example
          2. 1.1.5.1.2 Locatable XSV Config File Example
        2. 1.1.5.2 Cloud Data Sources
      6. 1.1.6 Data Source Versioning
    2. 1.2 Input Variant Data Formats
    3. 1.3 Output
      1. 1.3.1 Output Data Formats
        1. 1.3.1.1 VCF Format
        2. 1.3.1.2 MAF Format
      2. 1.3.2 Annotations for Pre-Packaged Data Sources
        1. 1.3.2.1 Gencode Annotation Specification
    4. 1.4 Reference Genome Versions
    5. 1.5 Comparisons with Oncotator
      1. 1.5.1 Funcotator / Oncotator Feature Comparison
      2. 1.5.2 Oncotator Bugs Compared With Funcotator
  3. 2.0 Tutorial
    1. 2.0 Requirements
    2. 2.1 Running Funcotator in the GATK With Base Options
    3. 2.2 Optional Parameters
      1. 2.2.1 - --ignore-filtered-variants
      2. 2.2.2 - --transcript-selection-mode
      3. 2.2.1 - --transcript-list
      4. 2.2.2 - --annotation-default
      5. 2.2.1 - --annotation-override
      6. 2.2.2 - --allow-hg19-gencode-b37-contig-matching
  4. 3.0 FAQ
  5. 4.0 Known Issues
  6. 5.0 Github
  7. 6.0 Tool Documentation

1 - Funcotator Background Information

Funcotator (FUNCtional annOTATOR) analyzes given variants for their function (as retrieved from a set of data sources) and produces the analysis in a specified output file.

This tool allows a user to add their own annotations to variants based on a set of data sources. Each data source can be customized to annotate a variant based on several matching criteria. This allows a user to create their own custom annotations easily, without modifying any Java code.

An example Funcotator workflow based on the GATK Best Practices Somatic Pipeline is as follows:

1.1 - Data Sources

Data sources are expected to be in folders that are specified as input arguments. While multiple data source folders can be specified, no two data sources can have the same name.

1.1.1 - Data Source Folders

In each main data source folder, there should be sub-directories for each individual data source, with further sub-directories for a specific reference (e.g. hg19, hg38, etc.). In the reference-specific data source directory, there is a configuration file detailing information about the data source and how to match it to a variant. This configuration file is required.

An example of a data source directory is the following:

    dataSourcesFolder/
         Data_Source_1/
             hg19
                 data_source_1.config
                 data_source_1.data.file.one
                 data_source_1.data.file.two
                 data_source_1.data.file.three
                 ...
              hg38
                 data_source_1.config
                 data_source_1.data.file.one
                 data_source_1.data.file.two
                 data_source_1.data.file.three
                 ...
         Data_Source_2/
             hg19
                 data_source_2.config
                 data_source_2.data.file.one
                 data_source_2.data.file.two
                 data_source_2.data.file.three
                 ...
              hg38
                 data_source_2.config
                 data_source_2.data.file.one
                 data_source_2.data.file.two
                 data_source_2.data.file.three
                 ...
          ...

1.1.2 - Pre-Packaged Data Sources

The GATK includes two sets of pre-packaged data sources, allowing for Funcotator use without (much) additional configuration.
These data source packages correspond to the germline and somatic use cases.
Broadly speaking, if you have a germline VCF, the germline data sources are what you want to use to start with.
Conversely, if you have a somatic VCF, the somatic data sources are what you want to use to start with.

1.1.2.1 - Downloading Pre-Packaged Data Sources

Versioned gzip archives of data source files are provided here:
● FTP: ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/funcotator/
● Google Cloud Bucket: gs://broad-public-datasets/funcotator/

1.1.2.2 - gnomAD

The pre-packaged data sources include a subset of gnomAD, a large database of known variants. This subset contains a greatly reduced subset of INFO fields, primarily containing allele frequency data. gnomAD is split into two parts - one based on exome data, one based on whole genome data. These two data sources are not equivalent and for complete coverage using gnomAD, we recommend annotating with both.
Due to the size of gnomAD, it cannot be included in the data sources package directly. Instead, the configuration data are present and point to a Google bucket in which
the gnomAD data reside. This will cause Funcotator to actively connect to that bucket when it is run.
For this reason, gnomAD is disabled by default.

Because Funcotator will query the Internet when gnomAD is enabled, performance will be impacted by the machine's Internet connection speed.
If this degradation is significant, you can localize gnomAD to the machine running Funcotator to improve performance (however due to the size of gnomAD this may be impractical).

1.1.2.2.1 - Enabling gnomAD

To enable gnomAD, simply change directories to your data sources directory and untar the gnomAD tar.gz files:

cd DATA_SOURCES_DIR
tar -zxf gnomAD_exome.tar.gz
tar -zxf gnomAD_genome.tar.gz

1.1.2.2.2 - Included gnomAD Fields

The fields included in the pre-packaged gnomAD subset are the following:

Field NameField Description
AFAllele Frequency, for each ALT allele, in the same order as listed
AF_afrAlternate allele frequency in samples of African-American ancestry
AF_afr_femaleAlternate allele frequency in female samples of African-American ancestry
AF_afr_maleAlternate allele frequency in male samples of African-American ancestry
AF_amrAlternate allele frequency in samples of Latino ancestry
AF_amr_femaleAlternate allele frequency in female samples of Latino ancestry
AF_amr_maleAlternate allele frequency in male samples of Latino ancestry
AF_asjAlternate allele frequency in samples of Ashkenazi Jewish ancestry
AF_asj_femaleAlternate allele frequency in female samples of Ashkenazi Jewish ancestry
AF_asj_maleAlternate allele frequency in male samples of Ashkenazi Jewish ancestry
AF_easAlternate allele frequency in samples of East Asian ancestry
AF_eas_femaleAlternate allele frequency in female samples of East Asian ancestry
AF_eas_jpnAlternate allele frequency in samples of Japanese ancestry
AF_eas_korAlternate allele frequency in samples of Korean ancestry
AF_eas_maleAlternate allele frequency in male samples of East Asian ancestry
AF_eas_oeaAlternate allele frequency in samples of non-Korean, non-Japanese East Asian ancestry
AF_femaleAlternate allele frequency in female samples
AF_finAlternate allele frequency in samples of Finnish ancestry
AF_fin_femaleAlternate allele frequency in female samples of Finnish ancestry
AF_fin_maleAlternate allele frequency in male samples of Finnish ancestry
AF_maleAlternate allele frequency in male samples
AF_nfeAlternate allele frequency in samples of non-Finnish European ancestry
AF_nfe_bgrAlternate allele frequency in samples of Bulgarian ancestry
AF_nfe_estAlternate allele frequency in samples of Estonian ancestry
AF_nfe_femaleAlternate allele frequency in female samples of non-Finnish European ancestry
AF_nfe_maleAlternate allele frequency in male samples of non-Finnish European ancestry
AF_nfe_nweAlternate allele frequency in samples of North-Western European ancestry
AF_nfe_onfAlternate allele frequency in samples of non-Finnish but otherwise indeterminate European ancestry
AF_nfe_seuAlternate allele frequency in samples of Southern European ancestry
AF_nfe_sweAlternate allele frequency in samples of Swedish ancestry
AF_othAlternate allele frequency in samples of uncertain ancestry
AF_oth_femaleAlternate allele frequency in female samples of uncertain ancestry
AF_oth_maleAlternate allele frequency in male samples of uncertain ancestry
AF_popmaxMaximum allele frequency across populations (excluding samples of Ashkenazi, Finnish, and indeterminate ancestry)
AF_rawAlternate allele frequency in samples, before removing low-confidence genotypes
AF_sasAlternate allele frequency in samples of South Asian ancestry
AF_sas_femaleAlternate allele frequency in female samples of South Asian ancestry
AF_sas_maleAlternate allele frequency in male samples of South Asian ancestry
OriginalAlleles*A list of the original alleles (including REF) of the variant prior to liftover. If the alleles were not changed during liftover, this attribute will be omitted.
OriginalContig*The name of the source contig/chromosome prior to liftover.
OriginalStart*The position of the variant on the source contig prior to liftover.
ReverseComplementedAlleles*The REF and the ALT alleles have been reverse complemented in liftover since the mapping from the previous reference to the current one was on the negative strand.
SwappedAlleles*The REF and the ALT alleles have been swapped in liftover due to changes in the reference. It is possible that not all INFO annotations reflect this swap, and in the genotypes, only the GT, PL, and AD fields have been modified. You should check the TAGS_TO_REVERSE parameter that was used during the LiftOver to be sure.

* - only available in hg38

1.1.3 - Data Source Downloader Tool

To improve ease-of-use of Funcotator, there is a tool to download the pre-packaged data sources to the user's machine.
This tool is the FuncotatorDataSourceDownloader and can be run to retrieve the pre-packaged data sources from the google bucket and localize them to the machine on which it is run.
Briefly:
For somatic data sources:

./gatk FuncotatorDataSourceDownloader --somatic --validate-integrity --extract-after-download

For germline data sources:

./gatk FuncotatorDataSourceDownloader --germline --validate-integrity --extract-after-download

1.1.4 - Disabling Data Sources

A data source can be disabled by removing the folder containing the configuration file for that source. This can be done on a per-reference basis. If the entire data source should be disabled, the entire top-level data source folder can be removed.

1.1.5 - User-Defined Data Sources

Users can define their own data sources by creating a new correctly-formatted data source sub-directory in the main data sources folder. In this sub-directory, the user must create an additional folder for the reference for which the data source is valid. If the data source is valid for multiple references, then multiple reference folders should be created. Inside each reference folder, the user should place the file(s) containing the data for the data source. Additionally the user must create a configuration file containing metadata about the data source.

There are several formats allowed for data sources:

Data Format ClassData Source Description
simpleXSVSeparated value table (e.g. CSV), keyed off Gene Name OR Transcript ID
locatableXSVSeparated value table (e.g. CSV), keyed off a genome location
gencodeClass for GENCODE data files (gtf format)
cosmicClass for COSMIC data
vcfClass for Variant Call Format (VCF) files

Two of the most useful are arbitrarily separated value (XSV) files, such as comma-separated value (CSV), tab-separated value (TSV). These files contain a table of data that can be matched to a variant by gene name, transcript ID, or genome position. In the case of gene name and transcript ID, one column must contain the gene name or transcript ID for each row's data.

  • For gene name, when a variant is annotated with a gene name that exactly matches an entry in the gene name column for a row, that row's other fields will be added as annotations to the variant.
  • For transcript ID, when a variant is annotated with a transcript ID that exactly matches an entry in the transcript ID column for a row, that row's other fields will be added as annotations to the variant.
  • For genome position, one column must contain the contig ID, another column must contain the start position (1-based, inclusive), and a column must contain the stop position (1-based, inclusive). The start and stop columns may be the same column. When a variant is annotated with a genome position that overlaps an entry in the three genome position columns for a row, that row's other fields will be added as annotations to the variant.

1.1.5.1 - Configuration File Format

The configuration file is a standard Java properties-style configuration file with key-value pairs. This file name must end in .config.

1.1.5.1.1 - Simple XSV

The following is an example of a Locatable XSV configuration file (for the Familial Cancer Genes data source):

name = Familial_Cancer_Genes
version = 20110905
src_file = Familial_Cancer_Genes.no_dupes.tsv
origin_location = oncotator_v1_ds_April052016.tar.gz
preprocessing_script = UNKNOWN

# Whether this data source is for the b37 reference.
# Required and defaults to false.
isB37DataSource = false

# Supported types:
# simpleXSV    -- Arbitrary separated value table (e.g. CSV), keyed off Gene Name OR Transcript ID
# locatableXSV -- Arbitrary separated value table (e.g. CSV), keyed off a genome location
# gencode      -- Custom datasource class for GENCODE
#   cosmic       -- Custom datasource class for COSMIC
# vcf          -- Custom datasource class for Variant Call Format (VCF) files
type = simpleXSV

# Required field for GENCODE files.
# Path to the FASTA file from which to load the sequences for GENCODE transcripts:
gencode_fasta_path =

# Required field for GENCODE files.
# NCBI build version (either hg19 or hg38):
ncbi_build_version = 

# Required field for simpleXSV files.
# Valid values:
#     GENE_NAME
#     TRANSCRIPT_ID
xsv_key = GENE_NAME

# Required field for simpleXSV files.
# The 0-based index of the column containing the key on which to match
xsv_key_column = 2

# Required field for simpleXSV AND locatableXSV files.
# The delimiter by which to split the XSV file into columns.
xsv_delimiter = \t

# Required field for simpleXSV files.
# Whether to permissively match the number of columns in the header and data rows
# Valid values:
#     true
#     false
xsv_permissive_cols = true

# Required field for locatableXSV files.
# The Name or 0-based index of the column containing the contig for each row
contig_column =

# Required field for locatableXSV files.
# The Name or 0-based index of the column containing the start position for each row
start_column =

# Required field for locatableXSV files.
# The Name or 0-based index of the column containing the end position for each row
end_column =

1.1.5.1.2 - Locatable XSV

The following is an example of a Locatable XSV configuration file (for the ORegAnno data source):

name = Oreganno
version = 20160119
src_file = oreganno.tsv
origin_location = http://www.oreganno.org/dump/ORegAnno_Combined_2016.01.19.tsv
preprocessing_script = getOreganno.py

# Whether this data source is for the b37 reference.
# Required and defaults to false.
isB37DataSource = false

# Supported types:
# simpleXSV    -- Arbitrary separated value table (e.g. CSV), keyed off Gene Name OR Transcript ID
# locatableXSV -- Arbitrary separated value table (e.g. CSV), keyed off a genome location
# gencode      -- Custom datasource class for GENCODE
#   cosmic       -- Custom datasource class for COSMIC
# vcf          -- Custom datasource class for Variant Call Format (VCF) files
type = locatableXSV

# Required field for GENCODE files.
# Path to the FASTA file from which to load the sequences for GENCODE transcripts:
gencode_fasta_path =

# Required field for GENCODE files.
# NCBI build version (either hg19 or hg38):
ncbi_build_version = 

# Required field for simpleXSV files.
# Valid values:
#     GENE_NAME
#     TRANSCRIPT_ID
xsv_key =

# Required field for simpleXSV files.
# The 0-based index of the column containing the key on which to match
xsv_key_column =

# Required field for simpleXSV AND locatableXSV files.
# The delimiter by which to split the XSV file into columns.
xsv_delimiter = \t

# Required field for simpleXSV files.
# Whether to permissively match the number of columns in the header and data rows
# Valid values:
#     true
#     false
xsv_permissive_cols = true

# Required field for locatableXSV files.
# The Name or 0-based index of the column containing the contig for each row
contig_column = 1

# Required field for locatableXSV files.
# The Name or 0-based index of the column containing the start position for each row
start_column = 2

# Required field for locatableXSV files.
# The Name or 0-based index of the column containing the end position for each row
end_column = 3

1.1.5.2 - Cloud Data Sources

Funcotator allows for data sources with source files that live on the cloud, enabling users to annotate with data sources that are not physically present on the machines running Funcotator.
To create a data source based on the cloud, create a configuration file for that data source and put the cloud URL in as the src_file property (see Configuration File Format for details).
E.g.:

 ...
src_file = gs://broad-references/hg19/v0/1000G_phase1.snps.high_confidence.b37.vcf.gz
...

1.1.6 - Data Source Versioning

Each release of the data sources contains a version number. Newer versions of Funcotator require minimum versions of data sources in order to run. If a new version of Funcotator is run with an older version of the data sources, an error will be thrown prompting the user to download a new release of the data sources.

Similarly newer releases of the data source packages are not reverse compatible with older versions of Funcotator. However, in this case Funcotator may or may not throw an error or warning.

To ensure compatibility when upgrading Funcotator, always download the latest data sources release. Similarly, when updating data sources make sure to update Funcotator to the latest version.

1.2 - Input Variant Data Formats

Currently Funcotator can only accept input variants in the form of a VCF file.

1.3 - Output

1.3.1 - Output Data Formats

Funcotator supports output in both VCF format and MAF format.

1.3.1.1 - VCF Output

VCF files will contain the annotations for each variant allele as part of a custom INFO tag - FUNCOTATION. This custom tag will contain a pipe-separated (|) list of annotations for each alternate allele on a given line of the VCF. The VCF header will contain an INFO field comment line for the FUNCOTATION data describing the field name for each value in the pipe-separated list. For variants with multiple alternate alleles, the INFO field will contain multiple lists of annotations (each list separated by a comma), the order of which corresponds to the alternate allele being annotated.

For example:

#fileformat=VCFv4.2
...
 #INFO=<ID=FUNCOTATION,Number=A,Type=String,Description="Functional annotation from the Funcotator tool.  Funcotation fields are: dbSNP_Val_Status|Center">
...
#CHROM  POS ID  REF ALT QUAL  FILTER  INFO
chr19 8914955 . C A 40  . FUNCOTATION=No Value|broad.mit.edu

In this example, the variant has one alternate allele (A) with two fields (_dbSNP_Val_Status and Center). The values of the fields are:

Field NameField Value
dbSNP_Val_StatusNo Value
Centerbroad.mit.edu

For multiple alternate alleles:

#fileformat=VCFv4.2
...
 #INFO=<ID=FUNCOTATION,Number=A,Type=String,Description="Functional annotation from the Funcotator tool.  Funcotation fields are: dbSNP_Val_Status|Center">
...
#CHROM  POS ID  REF ALT QUAL  FILTER  INFO
chr7 273846 . C A,G 40  . FUNCOTATION=No Value|broad.mit.edu,Big Value Here|brandeis.edu

In this example, the variant has one alternate allele (A) with two fields (_dbSNP_Val_Status and Center). The values of the fields are:

Alternate AlleleField NameField Value
AdbSNP_Val_StatusNo Value
ACenterbroad.mit.edu
GdbSNP_Val_StatusBig Value Here
GCenterbrandeis.edu

This formatting is the result of limitations in the VCF file specification.

1.3.1.2 - MAF Output

The MAF format used in Funcotator is an extension of the standard TCGA MAF. It is based on the MAF format specified for Oncotator here under Output Format. While the actual columns can vary (due to different data sources being used to create annotations), columns 1-67 will generally be the same.

In the case of a variant with multiple alternate alleles, each alternate allele will be written to a separate line in the MAF file.

1.3.2 - Annotations for Pre-Packaged Data Sources

The pre-packaged data sources will create a set of baseline, or default annotations for an input data set.
Most of these data sources copy and paste values from their source files into the output of Funcotator to create annotations. In this sense they are trivial data sources.

1.3.2.1 - Gencode Annotation Specification

Funcotator performs some processing on the input data to create the Gencode annotations. Gencode is currently required, so Funcotator will create these annotations for all input variants.
See this forum post for the specification of Gencode annotations in Funcotator.

1.4 - Reference Genome Versions

The two currently supported genomes for annotations out of the box are hg19 and hg38. This is due to the pre-packaged Gencode data sources being for those two references. Any reference genome with published Gencode data sources can be used.

1.4.1 - hg19 vs b37 Reference

The Broad Institute uses an alternate hg19 reference known as b37 for our sequencing. UCSC uses the baseline hg19 reference. These references are similar but different.

Due to the Gencode data source being published by UCSC, the data sources all use the hg19 reference for hg19 data (as opposed to b37). Funcotator detects when user data is from the b37 reference and forces the use of the hg19 data sources in this case. The user is warned when this occurs. Generally speaking this is OK, but due to the differences in the sequence data it is possible that some erroneous data will be created.

This effect has not yet been quantified, but in most cases should not be appreciable. For details, see this forum post.

1.5 - Comparisons with Oncotator

Oncotator is an older functional annotation tool developed by The Broad Institute. Funcotator and Oncotator are fundamentally different tools with some similarities.

While I maintain that a direct comparison should not be made, to address some inevitable questions some comparison highlights between Oncotator and Funcotator are in the following two tables:

1.5.1 - Funcotator / Oncotator Feature Comparison

FuncotatorOncotatorNotes
Override values for annotationsYesYes
Default values for annotationsYesYes
VCF inputYesYes
VCF outputYesYesAnnotation format b/w Funcotator and Oncotator differ.
MAF inputNoYes
MAF outputYesYes
TSV/maflite inputNoYes
Simple TSV outputNoYes
Removing datasources does not require developerYesYes
hg38 supportYesNo
Cloud datasourcesYesNoAll data sources supported
Transcript override listYesYes
Default config speed somatic (muts/min) (hg19)
Default config speed germline (muts/min) (hg19)A very long time....
Default config speed somatic (muts/min) (hg38)N/A
Default config speed germline (muts/min) (hg38)N/A
DocumentationTutorial; Specifications forum post; inclusion in workshop materialsMinimal support in forum
ManuscriptPlannedYes
HGVS supportNoYes
BigWig datasource supportNoLinux only
Seg file input/outputNoYes
Transcript modes: canonical and most deleterious effectYesYes
Transcript mode: ALLYesNo
Exclude annotations/columns on CLIYesNo
Automated datasource download toolYesNo
Automated tool for creating datasourcesNoYes
Web applicationNoYesUses old version of Oncotator and datasources
Config file to specify CLI argumentsYesNoGATK built-in command line arguments file
Simple MAF to VCF NoYes
Or
VCF to MAF conversion
Inferring ONPsNoYes (Not recommended)Mutect2 infers ONPs when calling variants.  This is not the job of a functional annotator.
Ignores filtered input variantsYesYes
Mitochondrial amino acid sequence renderingYesNo
gnomAD annotationsYes (cloud support)Not recommendedv2.1 support for hg19
V2.0.2 support for hg38 liftover coming soon
Must be manually enabled
UniProt ID annotationsYesYes
Other UniProt annotations (e.g. AAxform)NoYes
Custom fields: t_alt_count; t_ref_count; etcMAF Output OnlyYes
“other_transcripts” annotationYesYes
Reference context annotationsYesYes
COSMIC annotationsYesYes
UCSC ID annotationsYesYesIn Funcotator UCSC ID is part of the HGNC data source.
RefSeq ID annotationsYesYes

1.5.2 - Oncotator Bugs Compared With Funcotator

Fixed in FuncotatorFixed in OncotatorNotes
Collapsing ONP counts into one numberN/ANo
Variants resulting in protein changes that do not overlap the variant codon itself are not rendered properlyYesNo
Appris ranking not properly sortedYesNo
Using protein-coding status of gene for sorting (instead of transcript)YesNo
De Novo Start in UTRs not properly annotatedYesNo
Protein changes for Frame-Shift Insertions on the Negative strand incorrectly renderedYesNo
MNP End positions incorrectly reportedYesNo
MNPs on the Negative strand have incorrect cDNA/codon/protein changesYesNo
For Negative strand indels; cDNA string is incorrectYesNo
Negative strand splice site detection boundary check for indels is incorrectYesNo
Inconsistent number of bases in reported reference context annotation for indelsYesNo
5’ Flanking variants are reported with an incorrect transcript chosen for Canonical modeYesNo
Variants overlapping both introns and exons or transcript boundaries are not rendered properlyNoNoFuncotator produces a ‘CANNOT_DETERMINE’ variant classification and minimal populated annotations.

2 - Tutorial

2.0 - Requirements

  1. Java 1.8
  2. A functioning GATK4 jar
  3. Reference genome (fasta files) with fai and dict files. Human references can be downloaded as part of the GATK resource bundle. Other references can be used but must be provided by the user.
  4. A local copy of the Funcotator data sources
  5. A VCF file containing variants to annotate.

2.1 - Running Funcotator in the GATK With Base Options

Open a command line and navigate to your GATK directory.

cd ~/gatk

At this point you should choose your output format. There are two output format choices, one of which must be specified.

Additionally, you must specify a reference version. This reference version is used verbatim to determine which data sources to use for annotations. That is, specifying hg19 will cause Funcotator to look in the <data_sources_dir>/hg19 folder for data sources to use.

A VCF instantiation of the Funcotator tool looks like this:

./gatk Funcotator --variant variants.vcf --reference Homo_sapiens_assembly19.fasta --ref-version hg19 --data-sources-path funcotator_dataSources.v1.2.20180329 --output variants.funcotated.vcf --output-file-format VCF

A MAF instantiation of the Funcotator tool looks like this:

./gatk Funcotator --variant variants.vcf --reference Homo_sapiens_assembly19.fasta --ref-version hg19 --data-sources-path funcotator_dataSources.v1.2.20180329 --output variants.funcotated.maf --output-file-format MAF

2.2 - Optional Parameters

2.2.1 - --ignore-filtered-variants

This flag controls whether Funcotator will annotate filtered variants. By default, this flag is set to true.
To annotate filtered variants, run Funcotator with this flag set to false:

./gatk Funcotator --variant variants.vcf --reference Homo_sapiens_assembly19.fasta --ref-version hg19 --data-sources-path funcotator_dataSources.v1.2.20180329 --output variants.funcotated.maf --output-file-format MAF --ignore-filtered-variants false

2.2.2 - --transcript-selection-mode

This parameter determines how the primary annotated transcript is determined. The two modes for this parameter are BEST_EFFECT, CANONICAL, and ALL. By default, Funcotator uses the CANONICAL transcript selection mode.

The explanations and rules governing the two transcript selection modes are as follows:

BEST_EFFECT
Select a transcript to be reported with details with priority on effect according to the folowing list of selection criteria:

  • Choose the transcript that is on the custom list specified by the user. If no list was specified, treat as if no transcripts were on the list (tie).
  • In case of tie, choose the transcript that yields the variant classification highest on the variant classification rank list (see below).
  • If still a tie, choose the transcript with highest level of curation. Note that this means lower number is better for level (see below).
  • If still a tie, choose the transcript with the best appris annotation (see below).
  • If still a tie, choose the transcript with the longest transcript sequence length.
  • If still a tie, choose the first transcript, alphabetically.

CANONICAL

Select a transcript to be reported with details with priority on canonical order according to the folowing list of selection criteria:

  • Choose the transcript that is on the custom list specified by the user. If no list was specified, treat as if all transcripts were on the list (tie).
  • In case of tie, choose the transcript with highest level of curation. Note that this means lower number is better for level (see below).
  • If still a tie, choose the transcript that yields the variant classification highest on the variant classification rank list (see below).
  • If still a tie, choose the transcript with the best appris annotation (see below).
  • If still a tie, choose the transcript with the longest transcript sequence length.
  • If still a tie, choose the first transcript, alphabetically.

ALL
Same as CANONICAL, but indicates that no transcripts should be dropped. Render all overlapping transcripts.

2.2.3 - --transcript-list

This parameter will restrict the reported/annotated transcripts to only include those on the given list of transcript IDs. This list can be given as the path to a file containing one transcript ID per line OR this parameter can be given multiple times each time specifying a transcript ID.

When specifying transcript IDs, transcript version numbers will be ignored.

Using a manually specified set of transcripts for the transcript list:

./gatk Funcotator --variant variants.vcf --reference Homo_sapiens_assembly19.fasta --ref-version hg19 --data-sources-path funcotator_dataSources.v1.2.20180329 --output variants.funcotated.maf --output-file-format MAF --transcript-list TRANSCRIPT_ID1 --transcript-list TRANSCRIPT_ID2

Using an equivalent transcript file:

./gatk Funcotator --variant variants.vcf --reference Homo_sapiens_assembly19.fasta --ref-version hg19 --data-sources-path funcotator_dataSources.v1.2.20180329 --output variants.funcotated.maf --output-file-format MAF --transcript-list transcriptFile.txt

Contents of transcriptFile.txt:

TRANSCRIPT_ID1
TRANSCRIPT_ID2

2.2.4 - --annotation-default

This parameter specifies a default value for an annotation. This default value for this annotation will be used for any annotated variant. However if this annotation would be added by Funcotator to this variant, the Funcotator value will overwrite this default.

To specify this annotation default, the value on the command line takes the format:
ANNOTATION_FIELD:value

For example, to set the Center annotation to broad.mit.edu:

./gatk Funcotator --variant variants.vcf --reference Homo_sapiens_assembly19.fasta --ref-version hg19 --data-sources-path funcotator_dataSources.v1.2.20180329 --output variants.funcotated.maf --output-file-format MAF --annotation-default Center:broad.mit.edu

It is valid to provide both the--annotation-default and --annotation-override arguments to Funcotator, however the behavior of specifying an annotation-default and an annotation-overrid for the same annotation field is undefined.

2.2.5 - --annotation-override

This parameter specifies an override value for an annotation. If the annotation were to be added to a variant by a data source, the value for that annotation would be replaced with the value specified in the annotation override. If the annotation would not be added by a data source it is added to the output with the given value.

To specify this annotation default, the value on the command line takes the format:
ANNOTATION_FIELD:value

For example, to override the NCBI_Build annotation to HG19:

./gatk Funcotator --variant variants.vcf --reference Homo_sapiens_assembly19.fasta --ref-version hg19 --data-sources-path funcotator_dataSources.v1.2.20180329 --output variants.funcotated.maf --output-file-format MAF --annotation-override NCBI_Build:HG19

It is valid to provide both the --annotation-override and --annotation-default arguments to Funcotator, however the behavior of specifying an annotation-override and an annotation-default for the same annotation field is undefined.

2.2.6 - --allow-hg19-gencode-b37-contig-matching

This flag will cause hg19 contig names to match b37 contig names, allowing a set of variants created on an hg19 reference to match a b37 reference and visa-versa.

hg19 was created by UCSC. b37 was created by the Genome Reference Consortium. In practice these references are very similar but have small differences in certain bases, as well as a different naming convention for chromosomal contigs (chr1 in hg19 vs 1 in b37). In 99.9% of cases the results will be identical, however for certain genomic regions the results will differ.

This flag defaults to true.

To run Funcotator without this hg19/b37 matching:

./gatk Funcotator --variant variants.vcf --reference Homo_sapiens_assembly19.fasta --ref-version hg19 --data-sources-path funcotator_dataSources.v1.2.20180329 --output variants.funcotated.maf --output-file-format MAF --allow-hg19-gencode-b37-contig-matching false

3 - FAQ

Why do I not get annotations from my favorite data source on my favorite variant?

This almost always happens when the data source does not overlap the variant. Commonly a variant that is not within a gene will not be annotated by data sources because they are not in the region that the data sources cover (e.g. when the VariantClassification is IGR, FIVE_PRIME_FLANK, COULD_NOT_DETERMINE, etc.).
This can also happen if the given reference file does not match the data sources' reference (for the pre-packaged data sources either hg19/b37 or hg38). In this case, Funcotator will produce a large obnoxious warning:

  _ _ _  __        ___    ____  _   _ ___ _   _  ____   _ _ _ 
 | | | | \ \      / / \  |  _ \| \ | |_ _| \ | |/ ___| | | | |
 | | | |  \ \ /\ / / _ \ | |_) |  \| || ||  \| | |  _  | | | |
 |_|_|_|   \ V  V / ___ \|  _ <| |\  || || |\  | |_| | |_|_|_|
 (_|_|_)    \_/\_/_/   \_\_| \_\_| \_|___|_| \_|\____| (_|_|_)
--------------------------------------------------------------------------------
Only IGRs were produced for this dataset.  This STRONGLY indicates that this   
run was misconfigured.     
You MUST check your data sources to make sure they are correct for these data.
================================================================================

4 - Known Issues

The current list of known open issues can be found on the GATK github page here.

5 - Github

Funcotator is developed as part of GATK. The GATK github page is here.

6 - Tool Documentation

Tool documentation is written in the source code for Funcotator to better explain the options for running and some details of its features.
The tool documentation for Funcotator is here.


Back to Top


PlotDenoisedCopyRatios error at generating plots

$
0
0

Hi

I am trying to generate some plots using PlotDenoisedCopyRatios function in GATK4.1.2.0.
Everything seems to work nicely until the R script is initialized. I got the and error regarding --sample_name:

gatk PlotDenoisedCopyRatios --standardized-copy-ratios try_sta --denoised-copy-ratios try --sequence-dictionary /bicoh/MARGenomics/Ref_Genomes_fa/GATK_bundle/hg38/ref/Homo_sapiens_assembly38.dict --minimum-contig-length 46709983 --output Plots --output-prefix Sample_7387
Using GATK jar /soft/EB_repo/bio/sequence/programs/noarch/GATK/4.1.2.0/gatk-package-4.1.2.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /soft/EB_repo/bio/sequence/programs/noarch/GATK/4.1.2.0/gatk-package-4.1.2.0-local.jar PlotDenoisedCopyRatios --standardized-copy-ratios try_sta --denoised-copy-ratios try --sequence-dictionary /bicoh/MARGenomics/Ref_Genomes_fa/GATK_bundle/hg38/ref/Homo_sapiens_assembly38.dict --minimum-contig-length 46709983 --output Plots --output-prefix Sample_7387
13:26:45.991 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/soft/EB_repo/bio/sequence/programs/noarch/GATK/4.1.2.0/gatk-package-4.1.2.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
Oct 31, 2019 1:28:08 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
13:28:08.914 INFO  PlotDenoisedCopyRatios - ------------------------------------------------------------
13:28:08.916 INFO  PlotDenoisedCopyRatios - The Genome Analysis Toolkit (GATK) v4.1.2.0
13:28:08.916 INFO  PlotDenoisedCopyRatios - For support and documentation go to https://software.broadinstitute.org/gatk/
13:28:08.917 INFO  PlotDenoisedCopyRatios - Executing as jgibert@hydra.prib.upf.edu on Linux v3.10.0-862.14.4.el7.x86_64 amd64
13:28:08.918 INFO  PlotDenoisedCopyRatios - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_92-b14
13:28:08.918 INFO  PlotDenoisedCopyRatios - Start Date/Time: October 31, 2019 1:26:45 PM CET
13:28:08.918 INFO  PlotDenoisedCopyRatios - ------------------------------------------------------------
13:28:08.918 INFO  PlotDenoisedCopyRatios - ------------------------------------------------------------
13:28:08.927 INFO  PlotDenoisedCopyRatios - HTSJDK Version: 2.19.0
13:28:08.928 INFO  PlotDenoisedCopyRatios - Picard Version: 2.19.0
13:28:08.928 INFO  PlotDenoisedCopyRatios - HTSJDK Defaults.COMPRESSION_LEVEL : 2
13:28:08.928 INFO  PlotDenoisedCopyRatios - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
13:28:08.928 INFO  PlotDenoisedCopyRatios - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
13:28:08.928 INFO  PlotDenoisedCopyRatios - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
13:28:08.928 INFO  PlotDenoisedCopyRatios - Deflater: IntelDeflater
13:28:08.928 INFO  PlotDenoisedCopyRatios - Inflater: IntelInflater
13:28:08.928 INFO  PlotDenoisedCopyRatios - GCS max retries/reopens: 20
13:28:08.929 INFO  PlotDenoisedCopyRatios - Requester pays: disabled
13:28:08.929 INFO  PlotDenoisedCopyRatios - Initializing engine
13:28:08.929 INFO  PlotDenoisedCopyRatios - Done initializing engine
13:28:08.943 INFO  PlotDenoisedCopyRatios - Reading and validating input files...
13:28:14.567 INFO  PlotDenoisedCopyRatios - Contigs above length threshold: {chr1=248956422, chr2=242193529, chr3=198295559, chr4=190214555, chr5=181538259, chr6=170805979, chr7=159345973, chr8=145138636, chr9=138394717, chr10=133797422, chr11=135086622, chr12=133275309, chr13=114364328, chr14=107043718, chr15=101991189, chr16=90338345, chr17=83257441, chr18=80373285, chr19=58617616, chr20=64444167, chr21=46709983, chr22=50818468, chrX=156040895, chrY=57227415}
13:28:14.638 WARN  PlotDenoisedCopyRatios - Contigs present in the file try_sta are missing from the sequence dictionary and will not be plotted.
13:28:14.797 WARN  PlotDenoisedCopyRatios - Contigs present in the file try are missing from the sequence dictionary and will not be plotted.
13:28:14.873 INFO  PlotDenoisedCopyRatios - Writing plots to /users/genomics/jgibert/Exomes_Bea_uBAM/CNV_somatic/Plots...
13:28:28.947 INFO  PlotDenoisedCopyRatios - Shutting down engine
[October 31, 2019 1:28:28 PM CET] org.broadinstitute.hellbender.tools.copynumber.plotting.PlotDenoisedCopyRatios done. Elapsed time: 1.72 minutes.
Runtime.totalMemory()=737148928
org.broadinstitute.hellbender.utils.R.RScriptExecutorException: 
Rscript exited with 1
Command Line: Rscript -e tempLibDir = '/tmp/Rlib.1345838355110254415';source('/tmp/CNVPlottingLibrary.6888862715841197009.R');source('/tmp/PlotDenoisedCopyRatios.3695016065488923741.R'); --args --sample_name=-A --standardized_copy_ratios_file=/users/genomics/jgibert/Exomes_Bea_uBAM/CNV_somatic/try_sta --denoised_copy_ratios_file=/users/genomics/jgibert/Exomes_Bea_uBAM/CNV_somatic/try --contig_names=chr1CONTIG_DELIMITERchr2CONTIG_DELIMITERchr3CONTIG_DELIMITERchr4CONTIG_DELIMITERchr5CONTIG_DELIMITERchr6CONTIG_DELIMITERchr7CONTIG_DELIMITERchr8CONTIG_DELIMITERchr9CONTIG_DELIMITERchr10CONTIG_DELIMITERchr11CONTIG_DELIMITERchr12CONTIG_DELIMITERchr13CONTIG_DELIMITERchr14CONTIG_DELIMITERchr15CONTIG_DELIMITERchr16CONTIG_DELIMITERchr17CONTIG_DELIMITERchr18CONTIG_DELIMITERchr19CONTIG_DELIMITERchr20CONTIG_DELIMITERchr21CONTIG_DELIMITERchr22CONTIG_DELIMITERchrXCONTIG_DELIMITERchrY --contig_lengths=248956422CONTIG_DELIMITER242193529CONTIG_DELIMITER198295559CONTIG_DELIMITER190214555CONTIG_DELIMITER181538259CONTIG_DELIMITER170805979CONTIG_DELIMITER159345973CONTIG_DELIMITER145138636CONTIG_DELIMITER138394717CONTIG_DELIMITER133797422CONTIG_DELIMITER135086622CONTIG_DELIMITER133275309CONTIG_DELIMITER114364328CONTIG_DELIMITER107043718CONTIG_DELIMITER101991189CONTIG_DELIMITER90338345CONTIG_DELIMITER83257441CONTIG_DELIMITER80373285CONTIG_DELIMITER58617616CONTIG_DELIMITER64444167CONTIG_DELIMITER46709983CONTIG_DELIMITER50818468CONTIG_DELIMITER156040895CONTIG_DELIMITER57227415 --output_dir=/users/genomics/jgibert/Exomes_Bea_uBAM/CNV_somatic/Plots/ --output_prefix=Sample_7387
Stdout: 
Stderr: Error in make_option(c("--sample_name", "-sample_name"), dest = "sample_name",  : 
  Short flag -sample_name must only be a '-' and a single letter
Calls: source -> withVisible -> eval -> eval -> make_option

    at org.broadinstitute.hellbender.utils.R.RScriptExecutor.getScriptException(RScriptExecutor.java:80)
    at org.broadinstitute.hellbender.utils.R.RScriptExecutor.getScriptException(RScriptExecutor.java:19)
    at org.broadinstitute.hellbender.utils.runtime.ScriptExecutor.executeCuratedArgs(ScriptExecutor.java:126)
    at org.broadinstitute.hellbender.utils.R.RScriptExecutor.exec(RScriptExecutor.java:129)
    at org.broadinstitute.hellbender.tools.copynumber.plotting.PlotDenoisedCopyRatios.writeDenoisingPlots(PlotDenoisedCopyRatios.java:204)
    at org.broadinstitute.hellbender.tools.copynumber.plotting.PlotDenoisedCopyRatios.doWork(PlotDenoisedCopyRatios.java:155)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
    at org.broadinstitute.hellbender.Main.main(Main.java:291)
Exception in thread "Thread-1" htsjdk.samtools.util.RuntimeIOException: java.nio.file.NoSuchFileException: /tmp/Rlib.1345838355110254415
    at htsjdk.samtools.util.IOUtil.recursiveDelete(IOUtil.java:1346)
    at org.broadinstitute.hellbender.utils.io.IOUtils.deleteRecursively(IOUtils.java:1061)
    at org.broadinstitute.hellbender.utils.io.DeleteRecursivelyOnExitPathHook.runHooks(DeleteRecursivelyOnExitPathHook.java:56)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.nio.file.NoSuchFileException: /tmp/Rlib.1345838355110254415
    at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
    at sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
    at sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:144)
    at sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
    at java.nio.file.Files.readAttributes(Files.java:1737)
    at java.nio.file.FileTreeWalker.getAttributes(FileTreeWalker.java:219)
    at java.nio.file.FileTreeWalker.visit(FileTreeWalker.java:276)
    at java.nio.file.FileTreeWalker.walk(FileTreeWalker.java:322)
    at java.nio.file.Files.walkFileTree(Files.java:2662)
    at java.nio.file.Files.walkFileTree(Files.java:2742)
    at htsjdk.samtools.util.IOUtil.recursiveDelete(IOUtil.java:1344)
    ... 3 more

I tried to follow the error and I changed the @RG SM: field in both denoised and standarized tsv files with the requirements (as you may see in the log), but the error still persists.

Any idea why this is happening?
Thanks!

Calling variants in RNAseq

$
0
0

Overview

This document describes the details of the GATK Best Practices workflow for SNP and indel calling on RNAseq data.

Please note that any command lines are only given as example of how the tools can be run. You should always make sure you understand what is being done at each step and whether the values are appropriate for your data. To that effect, you can find more guidance here.

image

In brief, the key modifications made to the DNAseq Best Practices focus on handling splice junctions correctly, which involves specific mapping and pre-processing procedures, as well as some new functionality in the HaplotypeCaller. Here is a detailed overview:

image

Caveats

Please keep in mind that our DNA-focused Best Practices were developed over several years of thorough experimentation, and are continuously updated as new observations come to light and the analysis methods improve. We have been working with RNAseq for a somewhat shorter time, so there are many aspects that we still need to examine in more detail before we can be fully confident that we are doing the best possible thing.

We know that the current recommended pipeline is producing both false positives (wrong variant calls) and false negatives (missed variants) errors. While some of those errors are inevitable in any pipeline, others are errors that we can and will address in future versions of the pipeline. A few examples of such errors are given in this article as well as our ideas for fixing them in the future.

We will be improving these recommendations progressively as we go, and we hope that the research community will help us by providing feedback of their experiences applying our recommendations to their data.


The workflow

1. Mapping to the reference

The first major difference relative to the DNAseq Best Practices is the mapping step. For DNA-seq, we recommend BWA. For RNA-seq, we evaluated all the major software packages that are specialized in RNAseq alignment, and we found that we were able to achieve the highest sensitivity to both SNPs and, importantly, indels, using STAR aligner. Specifically, we use the STAR 2-pass method which was described in a recent publication (see page 43 of the Supplemental text of the Pär G Engström et al. paper referenced below for full protocol details -- we used the suggested protocol with the default parameters). In brief, in the STAR 2-pass approach, splice junctions detected in a first alignment run are used to guide the final alignment.

Here is a walkthrough of the STAR 2-pass alignment steps:

1) STAR uses genome index files that must be saved in unique directories. The human genome index was built from the FASTA file hg19.fa as follows:

genomeDir=/path/to/hg19
mkdir $genomeDir
STAR --runMode genomeGenerate --genomeDir $genomeDir --genomeFastaFiles hg19.fa\  --runThreadN <n>

2) Alignment jobs were executed as follows:

runDir=/path/to/1pass
mkdir $runDir
cd $runDir
STAR --genomeDir $genomeDir --readFilesIn mate1.fq mate2.fq --runThreadN <n>

3) For the 2-pass STAR, a new index is then created using splice junction information contained in the file SJ.out.tab from the first pass:

genomeDir=/path/to/hg19_2pass
mkdir $genomeDir
STAR --runMode genomeGenerate --genomeDir $genomeDir --genomeFastaFiles hg19.fa \
    --sjdbFileChrStartEnd /path/to/1pass/SJ.out.tab --sjdbOverhang 75 --runThreadN <n>

4) The resulting index is then used to produce the final alignments as follows:

runDir=/path/to/2pass
mkdir $runDir
cd $runDir
STAR --genomeDir $genomeDir --readFilesIn mate1.fq mate2.fq --runThreadN <n>

2. Add read groups, sort, mark duplicates, and create index

The above step produces a SAM file, which we then put through the usual Picard processing steps: adding read group information, sorting, marking duplicates and indexing.

java -jar picard.jar AddOrReplaceReadGroups I=star_output.sam O=rg_added_sorted.bam SO=coordinate RGID=id RGLB=library RGPL=platform RGPU=machine RGSM=sample 

java -jar picard.jar MarkDuplicates I=rg_added_sorted.bam O=dedupped.bam  CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT M=output.metrics 

3. Split'N'Trim and reassign mapping qualities

Next, we use a new GATK tool called SplitNCigarReads developed specially for RNAseq, which splits reads into exon segments (getting rid of Ns but maintaining grouping information) and hard-clip any sequences overhanging into the intronic regions.

image

In the future we plan to integrate this into the GATK engine so that it will be done automatically where appropriate, but for now it needs to be run as a separate step.

At this step we also add one important tweak: we need to reassign mapping qualities, because STAR assigns good alignments a MAPQ of 255 (which technically means “unknown” and is therefore meaningless to GATK). So we use the GATK’s ReassignOneMappingQuality read filter to reassign all good alignments to the default value of 60. This is not ideal, and we hope that in the future RNAseq mappers will emit meaningful quality scores, but in the meantime this is the best we can do. In practice we do this by adding the ReassignOneMappingQuality read filter to the splitter command.

Finally, be sure to specify that reads with N cigars should be allowed. This is currently still classified as an "unsafe" option, but this classification will change to reflect the fact that this is now a supported option for RNAseq processing.

java -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R ref.fasta -I dedupped.bam -o split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS

4. Indel Realignment (optional)

After the splitting step, we resume our regularly scheduled programming... to some extent. We have found that performing realignment around indels can help rescue a few indels that would otherwise be missed, but to be honest the effect is marginal. So while it can’t hurt to do it, we only recommend performing the realignment step if you have compute and time to spare (or if it’s important not to miss any potential indels).

5. Base Recalibration

We do recommend running base recalibration (BQSR). Even though the effect is also marginal when applied to good quality data, it can absolutely save your butt in cases where the qualities have systematic error modes.

Both steps 4 and 5 are run as described for DNAseq (with the same known sites resource files), without any special arguments. Finally, please note that you should NOT run ReduceReads on your RNAseq data. The ReduceReads tool will no longer be available in GATK 3.0.

6. Variant calling

Finally, we have arrived at the variant calling step! Here, we recommend using HaplotypeCaller because it is performing much better in our hands than UnifiedGenotyper (our tests show that UG was able to call less than 50% of the true positive indels that HC calls). We have added some functionality to the variant calling code which will intelligently take into account the information about intron-exon split regions that is embedded in the BAM file by SplitNCigarReads. In brief, the new code will perform “dangling head merging” operations and avoid using soft-clipped bases (this is a temporary solution) as necessary to minimize false positive and false negative calls. To invoke this new functionality, just add -dontUseSoftClippedBases to your regular HC command line. Note that the -recoverDanglingHeads argument which was previously required is no longer necessary as that behavior is now enabled by default in HaplotypeCaller. Also, we found that we get better results if we set the minimum phred-scaled confidence threshold for calling variants 20, but you can lower this to increase sensitivity if needed.

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ref.fasta -I input.bam -dontUseSoftClippedBases -stand_call_conf 20.0 -o output.vcf

7. Variant filtering

To filter the resulting callset, you will need to apply hard filters, as we do not yet have the RNAseq training/truth resources that would be needed to run variant recalibration (VQSR).

We recommend that you filter clusters of at least 3 SNPs that are within a window of 35 bases between them by adding -window 35 -cluster 3 to your command. This filter recommendation is specific for RNA-seq data.

As in DNA-seq, we recommend filtering based on Fisher Strand values (FS > 30.0) and Qual By Depth values (QD < 2.0).

java -jar GenomeAnalysisTK.jar -T VariantFiltration -R hg_19.fasta -V input.vcf -window 35 -cluster 3 -filterName FS -filter "FS > 30.0" -filterName QD -filter "QD < 2.0" -o output.vcf 

Please note that we selected these hard filtering values in attempting to optimize both high sensitivity and specificity together. By applying the hard filters, some real sites will get filtered. This is a tradeoff that each analyst should consider based on his/her own project. If you care more about sensitivity and are willing to tolerate more false positives calls, you can choose not to filter at all (or to use less restrictive thresholds).

An example of filtered (SNPs cluster filter) and unfiltered false variant calls:

image

An example of true variants that were filtered (false negatives). As explained in text, there is a tradeoff that comes with applying filters:

image


Known issues

There are a few known issues; one is that the allelic ratio is problematic. In many heterozygous sites, even if we can see in the RNAseq data both alleles that are present in the DNA, the ratio between the number of reads with the different alleles is far from 0.5, and thus the HaplotypeCaller (or any caller that expects a diploid genome) will miss that call. A DNA-aware mode of the caller might be able to fix such cases (which may be candidates also for downstream analysis of allele specific expression).

Although our new tool (splitNCigarReads) cleans many false positive calls that are caused by splicing inaccuracies by the aligners, we still call some false variants for that same reason, as can be seen in the example below. Some of those errors might be fixed in future versions of the pipeline with more sophisticated filters, with another realignment step in those regions, or by making the caller aware of splice positions.

image

image

As stated previously, we will continue to improve the tools and process over time. We have plans to improve the splitting/clipping functionalities, improve true positive and minimize false positive rates, as well as developing statistical filtering (i.e. variant recalibration) recommendations.

We also plan to add functionality to process DNAseq and RNAseq data from the same samples simultaneously, in order to facilitate analyses of post-transcriptional processes. Future extensions to the HaplotypeCaller will provide this functionality, which will require both DNAseq and RNAseq in order to produce the best results. Finally, we are also looking at solutions for measuring differential expression of alleles.


[1] Pär G Engström et al. “Systematic evaluation of spliced alignment programs for RNA-seq data”. Nature Methods, 2013


NOTE: Questions about this document that were posted before June 2014 have been moved to this archival thread: http://gatkforums.broadinstitute.org/discussion/4709/questions-about-the-rnaseq-variant-discovery-workflow

Picard CollectWgsMetrics for WES?

$
0
0

Hello,

I want to get a coverage report for WES using Picard CollectWgsMetrics, since the DepthofCoverage function in GATK is relatively slow. But the coverage results differ a lot from the 2 programs.

Settings:
* CollectWgsMetrics and DepthofCoverage wereprovided with the interval list file from the WES.
* ran with default settings.

Results:
CollectWgsMetrics reports mean coverage = 33.50, while DepthofCoverage reports 58.33. So I tried to find out why such big difference. I found that over 29.4% aligned bases were excluded in PCT_EXC_DUPE based on CollectWgsMetrics. However, only 8.9% of reads are filtered by DuplicateReadFilter in DepthofCoverage log file.

Is the lower coverage in CollectWgsMetrics due to the larger number of duplicated bases?

DoesCollectWgsMetrics and DepthofCoverage count duplicated reads differently?

Thanks a lot,
JL

Why does HaplotypeCaller call heterozygotes when all reads are identical?

$
0
0

Hello, I've been exploring the vcf output from HaplotypeCaller-in-gVCF-mode, and I noticed that for some SNPs (37/~250K), HaplotypeCaller has called a heterozygote when according to the VCF, all the reads support either the reference or the alternative allele (90% of the time, it is when all reads support the alternative allele).

There are for fairly low-coverage sites (all < 10), although from exploring the data, it looks like "GQ" values are quite high (20+) and that they are all loci that have been phased by HaplotypeCaller.

Is HaplotypeCaller doing something smart here with the phasing to "recover" information that isn't in the raw sequences? I can't really figure out what that might be, but I'm also don't really know what is going on under the hood with physical phasing.

In case it matters, I am working with mosquito exome data, 150bp PE illumina reads, GATK 4.1.2.0, java 1.8.0_181. I've been following the Broad best practices pretty closely, although I haven't managed to figure out how to bootstrap base recalibration yet.

An example line from the VCF, with the individual of concern bolded (note that there are some bulks in here, and I think they might be exhibiting the same behavior, but so far I'm only looking a diploid single individuals):

NW_021837111.1 53637 . T G 107.95 . AC=1;AF=9.434e-03;AN=106;DP=221;ExcessHet=3.0103;FS=0.000;InbreedingCoeff=-0.1312;MLEAC=1;MLEAF=9.434e-03;MQ=60.00;QD=32.06;SOR=2.833GT:AD:DP:GQ:PGT:PID:PL:PS 0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0:55,0:55:8:.:.:0,8,17,25,35,45,55,67,79,91,105,120,137,155,176,199,226,257,296,346,417,537,1800 0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0:51,0:51:11:.:.:0,11,23,36,50,65,82,100,120,144,170,202,241,291,361,482,1800 0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0:36,0:36:7:.:.:0,7,14,22,30,38,47,57,67,78,89,102,116,132,149,169,192,219,252,294,354,456,1530 0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0:45,0:45:7:.:.:0,7,15,23,32,41,50,60,70,82,94,107,120,136,152,170,191,214,241,272,311,361,432,552,1800 0|1:0,3:3:25:0|1:53637_T_G:123,0,25:53637 ./.:0,0:0:.:.:.:0,0,0 ./.:0,0:0:.:.:.:0,0,0 ./.:0,0:0:.:.:.:0,0,0 ./.:0,0:0:.:.:.:0,0,0 ./.:0,0:0:.:.:.:0,0,0 ./.:0,0:0:.:.:.:0,0,0 ./.:0,0:0:.:.:.:0,0,0 0/0:1,0:1:3:.:.:0,3,22 0/0:1,0:1:3:.:.:0,3,36 0/0:7,0:7:18:.:.:0,18,270 0/0:2,0:2:6:.:.:0,6,75 0/0:2,0:2:6:.:.:0,6,77 0/0:1,0:1:3:.:.:0,3,40 0/0:5,0:5:15:.:.:0,15,205 0/0:3,0:3:9:.:.:0,9,112 ./.:0,0:0:.:.:.:0,0,0 0/0:5,0:5:15:.:.:0,15,213 ./.:0,0:0:.:.:.:0,0,0 0/0:4,0:4:12:.:.:0,12,124

Viewing all 12345 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>