LiftoverVCF chain file for b37 to hg38

July 26, 2018, 10:38 am

≫ Next: Does HaplotypeCaller allow multiple allele calls?

≪ Previous: Is it possible to call SNPs without prior assigning read groups ?

For liftover of a b37 vcf to hg38, the gatk LiftoverVcf tool needs a chain file (b37Tohg38.chain). The documentation example uses this file but I don't see it in the GATK bundles. Is there one available? Or will the ucsc hg19Tohg38.chain file work despite the contig name difference?

↧

Does HaplotypeCaller allow multiple allele calls?

July 26, 2018, 11:05 am

≫ Next: MuTect2 INDEL accuracy

≪ Previous: LiftoverVCF chain file for b37 to hg38

I'm trying to use HaplotypeCaller to call variants from a pooled population sample. However whenever I run the basic program, with the ploidy set to 2 x number of individuals in the pool, I get this warning;

WARN HaplotypeCallerGenotypingEngine - Removed alt alleles where ploidy is 88 and original allele count is 3, whereas after trimming the allele count becomes 2. Alleles kept are:[T*, A]

What is going on behind the scenes to reduce the allele count down to two? Is there a way to turn this off?

Cheers,
James

↧

MuTect2 INDEL accuracy

February 29, 2016, 9:07 am

≫ Next: error using ClipReads with commas for multiple cycles to clip

≪ Previous: Does HaplotypeCaller allow multiple allele calls?

Hi all,
I was wondering for other people's opinion about the accuracy of MuTect2 regarding INDEL's detection.Are you satisfied?
No pitfall behind that, just wondering!!

Thank u in advance!

↧

error using ClipReads with commas for multiple cycles to clip

July 27, 2018, 2:56 am

≫ Next: GATK4 MergeVcfs "One or more header lines must be in the header line collection"

≪ Previous: MuTect2 INDEL accuracy

Hi,
I want to remove cycles with high error rates. The following command works:
java -jar /homes/aplic/noarch/software/GATK/3.7-Java-1.8.0_74/GenomeAnalysisTK.jar -T ClipReads -R /gpfs42/projects/lab_lcarey/single_cell_behavior/Data/PhiX/NC_001422.fasta --clipRepresentation WRITE_NS_Q0S -I CRAM/180402_7001450_0415_ACC843ANXX__lane8_NoIndex_L008.cram -o CRAM_CLIPPED/180402_7001450_0415_ACC843ANXX__lane8_NoIndex_L008.bam -QT 30 -CT "1-2,3-4"
...
INFO 11:53:18,177 ClipReads - Creating Q-score clipper with threshold 30
INFO 11:53:18,177 ClipReads - Creating cycle clipper 0-1
INFO 11:53:18,178 ClipReads - Creating cycle clipper 2-3

but using a list with commas does not:
$ java -jar /homes/aplic/noarch/software/GATK/3.7-Java-1.8.0_74/GenomeAnalysisTK.jar -T ClipReads -R /gpfs42/projects/lab_lcarey/single_cell_behavior/Data/PhiX/NC_001422.fasta --clipRepresentation WRITE_NS_Q0S -I CRAM/180402_7001450_0415_ACC843ANXX__lane8_NoIndex_L008.cram -o CRAM_CLIPPED/180402_7001450_0415_ACC843ANXX__lane8_NoIndex_L008.bam -QT 30 -CT "1,2,3,4"
....
INFO 11:54:06,394 ClipReads - Creating Q-score clipper with threshold 30

ERROR --

ERROR stack trace

java.lang.RuntimeException: Badly formatted cyclesToClip argument: 1,2,3,4
at org.broadinstitute.gatk.tools.walkers.readutils.ClipReads.initialize(ClipReads.java:279)
at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:83)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:316)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:123)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:108)

ERROR ------------------------------------------------------------------------------------------

ERROR A GATK RUNTIME ERROR has occurred (version 3.7-0-gcfedb67):

ERROR

ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.

ERROR If not, please post the error message, with stack trace, to the GATK forum.

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions https://software.broadinstitute.org/gatk

ERROR

ERROR MESSAGE: Badly formatted cyclesToClip argument: 1,2,3,4

ERROR ------------------------------------------------------------------------------------------

↧

GATK4 MergeVcfs "One or more header lines must be in the header line collection"

July 1, 2018, 9:04 am

≫ Next: Current status of GATK4 GermlineCNVCaller tools and best practices.

≪ Previous: error using ClipReads with commas for multiple cycles to clip

Hi! I am trying to use MergeVcfs to merge several VCF files (VarScan2 output files) but I am getting the following error:

gatk MergeVcfs \
   -I A.vcf \
   -I B.vcf \
   -D human_g1k_v37_decoy.dict
   -O out.vcf

...
java.lang.IllegalArgumentException: One or more header lines must be in the header line collection
...

Unfortunately I cannot find any information about this error message. I have tried using gatk ValidateVariants to validate the input VCF files but this does not return any errors:

gatk ValidateVariants \
   -V A.vcf \
   -R human_g1k_v37_decoy.fasta

...
12:01:11.764 INFO  ValidateVariants - Done initializing engine
12:01:11.764 INFO  ProgressMeter - Starting traversal
12:01:11.765 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute
12:01:12.641 INFO  ProgressMeter -           1:29562369              0.0                 43393        2978924.5
12:01:12.642 INFO  ProgressMeter - Traversal complete. Processed 43393 total variants in 0.0 minutes.
12:01:12.642 INFO  ValidateVariants - Shutting down engine
[July 1, 2018 12:01:12 PM EDT] org.broadinstitute.hellbender.tools.walkers.variantutils.ValidateVariants done. Elapsed time: 0.03 minutes.

Can anyone familiar with the code point me in the right direction?

The VCF header for A.vcf and B.vcf looks as follows:

##fileformat=VCFv4.1
##source=VarScan2
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total depth of quality bases">
##INFO=<ID=SOMATIC,Number=0,Type=Flag,Description="Indicates if record is a somatic mutation">
##INFO=<ID=SS,Number=1,Type=String,Description="Somatic status of variant (0=Reference,1=Germline,2=Somatic,3=LOH, or 5=Unknown)
##INFO=<ID=SSC,Number=1,Type=String,Description="Somatic score in Phred scale (0-255) derived from somatic p-value">
##INFO=<ID=GPV,Number=1,Type=Float,Description="Fisher's Exact Test P-value of tumor+normal versus no variant for Germline calls
##INFO=<ID=SPV,Number=1,Type=Float,Description="Fisher's Exact Test P-value of tumor versus normal for Somatic/LOH calls">
##FILTER=<ID=str10,Description="Less than 10% or more than 90% of variant supporting reads on one strand">
##FILTER=<ID=indelError,Description="Likely artifact due to indel reads at this position">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=RD,Number=1,Type=Integer,Description="Depth of reference-supporting bases (reads1)">
##FORMAT=<ID=AD,Number=1,Type=Integer,Description="Depth of variant-supporting bases (reads2)">
##FORMAT=<ID=FREQ,Number=1,Type=String,Description="Variant allele frequency">
##FORMAT=<ID=DP4,Number=1,Type=String,Description="Strand read counts: ref/fwd, ref/rev, var/fwd, var/rev">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NORMAL  TUMOR

↧

Current status of GATK4 GermlineCNVCaller tools and best practices.

January 31, 2018, 11:56 pm

≫ Next: SplitNCigarReads java.lang.ArrayIndexOutOfBoundsException

≪ Previous: GATK4 MergeVcfs "One or more header lines must be in the header line collection"

Hi,

I would like to try out GATK4 for discovering or genotyping germline CNV's in a cohort of few hundred whole genome sequenced samples. I work with non-human species data, but the genome sizes are almost the same as human or smaller.

The best practice documentation for germline CNV calling is still empty.
https://software.broadinstitute.org/gatk/best-practices/workflow?id=11148

According the gatk4-4.0.0.0-0 JAR file germline CNV calling tools are already included.
java -jar ./gatk4-4.0.0.0-0/gatk-package-4.0.0.0-local.jar
USAGE: [-h]--------------------------------------------------------------------------------------
Copy Number Variant Discovery: Tools that analyze read coverage to detect copy number variants.
AnnotateIntervals (BETA Tool) Annotates intervals with GC content
CallCopyRatioSegments (BETA Tool) Calls copy-ratio segments as amplified, deleted, or copy-number neutral
CombineSegmentBreakpoints (EXPERIMENTAL Tool) Combine the breakpoints of two segment files and annotate the resulting intervals with chosen columns from each file.
CreateReadCountPanelOfNormals (BETA Tool) Creates a panel of normals for read-count denoising
DenoiseReadCounts (BETA Tool) Denoises read counts to produce denoised copy ratios
DetermineGermlineContigPloidy (BETA Tool) Determines the baseline contig ploidy for germline samples given counts data.
GermlineCNVCaller (BETA Tool) Calls copy-number variants in germline samples given their counts and the output of DetermineGermlineContigPloidy.
ModelSegments (BETA Tool) Models segmented copy ratios from denoised read counts and segmented minor-allele fractions from allelic counts
PlotDenoisedCopyRatios (BETA Tool) Creates plots of denoised copy ratios
PlotModeledSegments (BETA Tool) Creates plots of denoised and segmented copy-ratio and minor-allele-fraction estimates

Can you give some more information about what the current status is of the GATK4 GermlineCNVCaller tools and if you have an estimation for when the best practices for these tools should be available?

It would also be nice if you can give an idea if the GATK4 GermlineCNVCallertools tools are expected to work for non-human species, e.g. other vertebrates, simple / complex plants genomes and bacteria.

Thank you.

↧

SplitNCigarReads java.lang.ArrayIndexOutOfBoundsException

July 17, 2018, 12:36 am

≫ Next: Phalogenics is an androgenic hormone, a hormone responsible

≪ Previous: Current status of GATK4 GermlineCNVCaller tools and best practices.

Dear GATK team,

I'm using gatk4-4.0.5.1-0 on CentOS, installed through conda.
I post this question here because I could not get any answer regarding this matter online.

I'm performing gatk SplitNCigarReads on my RNA-seq sorted-bam file.
The sorted-bam file was obtained by using gatk AddOrReplaceReadGroups, using the -SO coordinate option.

Before writing the error message down, here's the script written on the log file (which both stdour and stderr were saved):

Using GATK jar /home/genomics_cf/.conda/envs/exome/share/gatk4-4.0.5.1-0/gatk-package-4.0.5.1-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=fals
e -Dsamjdk.compression_level=2 -Xmx4g -Djava.io.tmpdir=/home/genomics_cf/180530_NB501839_0019_AHKG5LBGX5/tmp -jar /home/genomics_cf/.
conda/envs/exome/share/gatk4-4.0.5.1-0/gatk-package-4.0.5.1-local.jar SplitNCigarReads -I /home/genomics_cf/180530_NB501839_0019_AHKG
5LBGX5/03_sort_bam/ParkK_S12.srt.bam -O /home/genomics_cf/180530_NB501839_0019_AHKG5LBGX5/03_sort_bam/ParkK_S12.mqfix.bam -R /storage
/data3/public_data/Broad_DBs/ucsc.hg19.fasta -skip-mq-transform false
15:31:09.392 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/genomics_cf/.conda/envs/exome/share/gatk4-
4.0.5.1-0/gatk-package-4.0.5.1-local.jar!/com/intel/gkl/native/libgkl_compression.so

And the below is the error message (including some normal-looking log messages; ignore the Korean alphabet plz; it's just the date/time of the message):

15:59:00.033 INFO  ProgressMeter -       chr14:50053490             27.8              66479000        2388541.4
15:59:10.034 INFO  ProgressMeter -       chr14:50320395             28.0              66895000        2389179.7
15:59:20.146 INFO  ProgressMeter -       chr14:50320427             28.2              67280000        2388554.3
15:59:20.653 INFO  SplitNCigarReads - Shutting down engine
[2018년 7월 17일 (화) 오후 3시 59분 20초] org.broadinstitute.hellbender.tools.walkers.rnaseq.SplitNCigarReads done. Elapsed time: 28.19 minutes.
Runtime.totalMemory()=3288858624
java.lang.ArrayIndexOutOfBoundsException: -1
        at org.broadinstitute.hellbender.tools.walkers.rnaseq.OverhangFixingManager.overhangingBasesMismatch(OverhangFixingManager.java:313)
        at org.broadinstitute.hellbender.tools.walkers.rnaseq.OverhangFixingManager.fixSplit(OverhangFixingManager.java:259)
        at org.broadinstitute.hellbender.tools.walkers.rnaseq.OverhangFixingManager.addReadGroup(OverhangFixingManager.java:209)
        at org.broadinstitute.hellbender.tools.walkers.rnaseq.SplitNCigarReads.splitNCigarRead(SplitNCigarReads.java:270)
        at org.broadinstitute.hellbender.tools.walkers.rnaseq.SplitNCigarReads.firstPassApply(SplitNCigarReads.java:180)
        at org.broadinstitute.hellbender.engine.TwoPassReadWalker.lambda$traverseReads$0(TwoPassReadWalker.java:62)
        at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
        at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
        at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
        at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
        at java.util.Iterator.forEachRemaining(Iterator.java:116)
        at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
        at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
        at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
        at org.broadinstitute.hellbender.engine.TwoPassReadWalker.traverseReads(TwoPassReadWalker.java:60)
        at org.broadinstitute.hellbender.engine.TwoPassReadWalker.traverse(TwoPassReadWalker.java:42)
        at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:994)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:135)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:180)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:199)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
        at org.broadinstitute.hellbender.Main.main(Main.java:289)

I've never seen this kind of error. Is there any way I can fix or go around this matter?
Thank you for your support.

Best,
Seongmin

↧

Phalogenics is an androgenic hormone, a hormone responsible

August 4, 2018, 12:03 pm

≫ Next: GATK hands-on tutorial, for GATK4

≪ Previous: SplitNCigarReads java.lang.ArrayIndexOutOfBoundsException

amidst fetal life and ends up being potentially the most fundamental factor in sexual part that starts in the third month. Phalogenics creation stops in the midst of work and proceeds at pubescence. Some Phalogenics bit of Phalogenics is an androgenic hormone, a hormone responsible for male physical characteristics, for example, the progress of male genitalia, shagginess, or the shedding of the voice. Phalogenics in like way has an anabolic part, which collects it moves the refinement in muscles and bones.

http://www.viralsupplement.com/phalogenics-review/

https://www.crunchbase.com/person/phalogenics-reviews

https://twitter.com/VSupplement/status/1025350428458930176

↧

GATK hands-on tutorial, for GATK4

July 29, 2018, 4:41 am

≫ Next: Get Error when using CreateReadCountPanelOfNormals in Calling Somatic Copy Number Variation

≪ Previous: Phalogenics is an androgenic hormone, a hormone responsible

Hi,

I found this hands-on tutorial is very helpful https://gatkforums.broadinstitute.org/gatk/discussion/7869/howto-discover-variants-with-gatk-a-gatk-workshop-tutorial. However, it is based on GATK3. As far as I know, there are quite some differences between GATK3 and GATK4, for example, -o is changed to -O, -known is changed to --known-sites, Indel Realighment is gone, etc.

So, my questions is, is there a similar hands-on tutorial based on GATK4?

Thanks!
Jie

↧

Get Error when using CreateReadCountPanelOfNormals in Calling Somatic Copy Number Variation

July 29, 2018, 5:11 am

≫ Next: Off-label workflow to simply call differences in two samples

≪ Previous: GATK hands-on tutorial, for GATK4

**Error information:** Using GATK jar /home/yangyuan/Desktop/Tool/gatk-4.0.5.2/gatk-package-4.0.5.2-local.jar Running: java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx6500m -jar /home/yangyuan/Desktop/Tool/gatk-4.0.5.2/gatk-package-4.0.5.2-local.jar CreateReadCountPanelOfNormals -I 1_19_0427_S18.counts.hdf5 -I 1_20_0427_S19.counts.hdf5 -I 1_21_0427_S20.counts.hdf5 -I 1_22_0427_S21.counts.hdf5 -I 1_23_0427_S22.counts.hdf5 -I 1_24_0427_S23.counts.hdf5 -I 1_25_0427_S24.counts.hdf5 -I 1_26_0427_S25.counts.hdf5 -I 1_50_0427_S48.counts.hdf5 -I 1_51_0427_S49.counts.hdf5 -I ...... ...... ...... ...... 18/07/29 19:46:57 INFO Executor: Running task 32.0 in stage 1.0 (TID 33) 18/07/29 19:46:57 INFO Executor: Running task 33.0 in stage 1.0 (TID 34) 18/07/29 19:46:57 INFO Executor: Running task 34.0 in stage 1.0 (TID 35) 18/07/29 19:46:57 INFO Executor: Running task 35.0 in stage 1.0 (TID 36) 18/07/29 19:46:57 INFO Executor: Running task 36.0 in stage 1.0 (TID 37) 18/07/29 19:46:57 INFO Executor: Running task 37.0 in stage 1.0 (TID 38) 18/07/29 19:46:57 INFO Executor: Running task 38.0 in stage 1.0 (TID 39) 18/07/29 19:46:57 INFO Executor: Running task 39.0 in stage 1.0 (TID 40) Jul 29, 2018 7:46:57 PM com.github.fommil.jni.JniLoader liberalLoad INFO: successfully loaded /tmp/yangyuan/jniloader7881962181404460704netlib-native_system-linux-x86_64.so java: symbol lookup errorjavajavajava: : symbol lookup errorsymbol lookup error: java: /tmp/yangyuan/jniloader7881962181404460704netlib-native_system-linux-x86_64.so: : /tmp/yangyuan/jniloader7881962181404460704netlib-native_system-linux-x86_64.so: javasymbol lookup error/tmp/yangyuan/jniloader7881962181404460704netlib-native_system-linux-x86_64.so: symbol lookup error: : java: undefined symbol: cblas_dspr: undefined symbol: cblas_dspr: : symbol lookup errorjava/tmp/yangyuan/jniloader7881962181404460704netlib-native_system-linux-x86_64.soundefined symbol: cblas_dsprsymbol lookup error/tmp/yangyuan/jniloader7881962181404460704netlib-native_system-linux-x86_64.so: java: : : java

: /tmp/yangyuan/jniloader7881962181404460704netlib-native_system-linux-x86_64.so
undefined symbol: cblas_dspr/tmp/yangyuan/jniloader7881962181404460704netlib-native_system-linux-x86_64.sosymbol lookup error: java: symbol lookup error: java: symbol lookup error: : : /tmp/yangyuan/jniloader7881962181404460704netlib-native_system-linux-x86_64.so: /tmp/yangyuan/jniloader7881962181404460704netlib-native_system-linux-x86_64.so:
undefined symbol: cblas_dspr

Hi, when i'm using CreateReadCountPanelOfNormals in Calling Somatic Copy Number Variation, i got error.
When i search the error on Google, i found someone meet the same problem with me (https://gatkforums.broadinstitute.org/gatk/discussion/8810/something-about-create-pon-workflow).

But the solution did not work for me, this is the solution the above link give:

I met this problem too. it was running very well with one sample input, but this bug appeared when I input multiple samples... BTW, my version is 4.0.3.0.
It seems related to Spark, and I just solved it.
1. install libblas.so, liblapacke.so and libopenblas.so(which I lacked).
2. add to environment. export LD_PRELOAD=/path/to/libopenblas.so
Then everything works as expected.

The command i input was:
gatk --java-options "-Xmx6500m" CreateReadCountPanelOfNormals \
-I 1_19_0427_S18.counts.hdf5 \
-I 1_20_0427_S19.counts.hdf5 \
-I 1_21_0427_S20.counts.hdf5 \
-I 1_22_0427_S21.counts.hdf5 \
-I 1_23_0427_S22.counts.hdf5 \
-I 1_24_0427_S23.counts.hdf5 \
-I 1_25_0427_S24.counts.hdf5 \
-I 1_26_0427_S25.counts.hdf5 \
-I 1_50_0427_S48.counts.hdf5 \
-I 1_51_0427_S49.counts.hdf5 \
-I 1_52_0427_S50.counts.hdf5 \
-I 1_53_0427_S51.counts.hdf5 \
-I 1_54_0427_S52.counts.hdf5 \
-I 1_55_0427_S53.counts.hdf5 \
-I 1_56_0427_S54.counts.hdf5 \
-I 1_57_0427_S55.counts.hdf5 \
-I 1_58_0427_S56.counts.hdf5 \
-I 1_59_0427_S57.counts.hdf5 \
--minimum-interval-median-percentile 55.0 \
-O cnvponC.pon.hdf5

↧

Off-label workflow to simply call differences in two samples

January 29, 2018, 12:08 pm

≫ Next: Clarification of --normal-artifact-lod in FilterMutectCalls

≪ Previous: Get Error when using CreateReadCountPanelOfNormals in Calling Somatic Copy Number Variation

Given my years as a biochemist, if given two samples to compare, my first impulse is to want to know what are the functional differences, i.e. differences in proteins expressed between the two samples. I am interested in genomic alterations that ripple down the central dogma to transform a cell.

Please note the workflow that follows is NOT a part of the Best Practices. This is an illustrative, unsupported workflow. For the official Somatic Short Variant Calling Best Practices workflow, see Tutorial#11136.

To call every allele that is different between two samples, I have devised a two-pass workflow that takes advantage of Mutect2 features. This workflow uses Mutect2 in tumor-only mode and appropriates the --germline-resource argument to supply a single-sample VCF with allele fractions instead of population allele frequencies. The workflow assumes the two case samples being compared originate from the same parental line and the ploidy and mutation rates make it unlikely that any site accumulates more than one allele change.

First, call on each sample using Mutect2's tumor-only mode.

gatk Mutect2 \
-R ref.fa \
-I A.bam \
-tumor A \
-O A.vcf

gatk Mutect2 \
-R ref.fa \
-I B.bam \
-tumor B \
-O B.vcf

Second, for each single-sample VCF, move the sample-level `AF` allele-fraction annotation to the INFO field and simplify to a sites-only VCF.

This is a heuristic solution in which we substitute sample-level allele fractions for the expected population germline allele frequencies. Mutect2 is actually designed to use population germline allele frequencies in somatic likelihood calculations, so this substitution allows us to fulfill the requirement for an AF annotation with plausible fractional values. The terminal screenshots highlight the data transpositions.

Before:

After:

Third, call on each sample in a second pass, again in tumor-only mode, with the following additions.

gatk Mutect2 \
-R ref.fa \
-I A.bam \
-tumor A \
--germline-resource Baf.vcf \
--af-of-alleles-not-in-resource 0 \
--max-population-af 0 \
-pon pon_maskAB.vcf \
-O A-B.vcf

gatk Mutect2 \
-R ref.fa \
-I B.bam \
-tumor B \
--germline-resource Aaf.vcf \
--af-of-alleles-not-in-resource 0 \
--max-population-af 0 \
-pon pon_maskAB.vcf \
-O B-A.vcf

Provide the matched single-sample callset for the case sample with the --germline-resource argument.
Avoid calling any allele in the --germline-resource by setting --max-population-af to zero.
Maximize the probability of calling any differing allele by setting --af-of-alleles-not-in-resource to zero.
Prefilter sites with artifacts and cross-sample contamination with a panel of normals (PoN) in which confident variant sites for both sample A and B have been removed, e.g. with gatk SelectVariants –V pon.vcf -XL AandB_haplotypecaller.vcf –O pon_maskAB.vcf.

Fourth, filter out unlikely calls with FilterMutectCalls.

gatk FilterMutectCalls \
-V A-B.vcf \
-O A-B-filter.vcf

gatk FilterMutectCalls \
-V B-A.vcf \
-O B-A-filter.vcf

FilterMutectCalls provides many filters, e.g. that account for low base quality, for events that are clustered, for low mapping quality and for short-tandem-repeat contractions. Of the filters, let's consider the multiallelic filter. It discounts sites with more than two variant alleles that pass the tumor LOD threshold.

We assume case sample variant sites will have a maximum of one allele that is different from the --germline-resource control. A single allele call will pass the multiallelic filter. However, if we emit any shared variant allele alongside the differing allele, e.g. for a heterozygous site without ref alleles, then the call becomes multiallelic and will be filtered, which is not what we want. We previously set Mutect2’s --max-population-af to zero to ensure only the differing allele is called, and so here we can rely on FilterMutectCalls to filter artifactual multiallelic sites.
If multiple variant alleles are expected per call, then FilterMutectCall’s multiallelic filtering will be undesirable. For example, if changes to allele fractions for alleles that are shared was of interest for the two samples derived from the same parental line, and Mutect2 --max-population-af was set to one in the previous step to additionally emit the shared variant alleles, then you would expect multiallelic calls. These will be indistinguishable from artifactual multiallelic sites.

This workflow produces contrastive variants. If the samples are a tumor and its matched normal, then the calls include sites where heterozygosity was lost.

We know that loss of heterozygosity (LOH) plays a role in tumorigenesis (doi:10.1186/s12920-015-0123-z). This leads us to believe the heterozygosity of proteins we express contributes to our health. If this is true, then for somatic studies, if cataloging the gain of alleles is of interest, then cataloging the loss of alleles should also be of interest. Can we assume just because variants are germline that they do not play a role in disease processes? How can we account for the combinatorial effects of the diploid nature of our genomes?

Remember regions of LOH do not necessarily represent a haploid state but can be copy-neutral or even copy-amplified. It may be that as one parental chromosome copy is lost, the other is duplicated to maintain copy number, which presumably compensates for dosage effects as is the case in uniparental isodisomy.

↧

Clarification of --normal-artifact-lod in FilterMutectCalls

July 29, 2018, 6:30 pm

≫ Next: Is GATK CalculateGenotypePosteriors useful for de novo family mutations?

≪ Previous: Off-label workflow to simply call differences in two samples

The FilterMutectCalls tool doc says:

If the normal artifact log odds is larger than the threshold, then FilterMutectCalls applies the artifact-in-normal filter. For matched normal analyses with tumor contamination in the normal, consider increasing the normal-artifact-lod threshold.

This is what I understand:

The normal atrifact log odds is the threshold above which an artifactual site detected in the normal will be used to filter any variants at that site (assuming this is just a shared artifact). If there is tumor contamination of the normal, an apparent artifact in the normal may just represent tumor contamination. Hence, this threshold should be raised if we suspect tumor contamination of the normal.

Am I right?

↧

Is GATK CalculateGenotypePosteriors useful for de novo family mutations?

August 4, 2018, 4:50 pm

≫ Next: A complete script that processes a trio with WGS data from FASTQ to BAM to VCF

≪ Previous: Clarification of --normal-artifact-lod in FilterMutectCalls

I have a few WGS samples; patient, mother, father and brother. I am interested in looking for de novo mutations in the patient within the family. My data is a recalibrated variants from GATK VQSR.

The Genotype Refinement workflow from GATK looks promising for finding the mutations (https://software.broadinstitute.org/gatk/documentation/article.php?id=4723).

... (e.g. in the case of loss of function) or with the transmission (or de novo origin) of a variant in a family.

The first step in the workflow is the CalculateGenotypePosteriors tool. In the documentation:

Using the default behavior, priors will only be applied for each variants (provided each variant has at least 10 called samples.)...

Q1: This is where I am not sure. I only have 4 WGS samples in a family, it's not a large scale population analysis. 4 < 10, does that mean the tool is not designing to work for my samples? The Bayesian not designed for just 4 samples in a family?

Q2: If the tool is not appropriate for a family analysis. What would be a better workflow?

↧

A complete script that processes a trio with WGS data from FASTQ to BAM to VCF

June 29, 2018, 7:51 pm

≫ Next: HaplotypeCaller doesn't annotate some rs IDs on called variants

≪ Previous: Is GATK CalculateGenotypePosteriors useful for de novo family mutations?

Hi,

I found there are quite many GATK documentation material online. However, I could not find one that shows me exactly how to process a WGS dataset from bump to bump. For example, right now, I got 3 samples with WGS data. Each sample has 4 FASTQ files, for example for the first sample, there are: s1_L1_1.fa.gz, s1_L1_2.fa.gz, s1_L2_1.fa.gz, s1_L2_2.fa.gz. Now, my question is, what are the exact serial of commands that I should use to create a VCF file with these 3 samples.

I found a nice example at http://www.htslib.org/workflow, but I am afraid that it is not the latest version. I spent quite some time try to to figure this out. Below is what I got for running on each of the 3 samples:

bwa mem -t 1000 -k 32 -M hg19.fa s1_L1_1.fa.gz s1_L1_2.fa.gz s1_L2_1.fa.gz s1_L2_2.fa.gz | samtools view -b -S -t hg19.fa.fai - > s1.bam
samtools sort -@ 4 s1.tmp.bam s1.sorted.bam, then java -jar picard/MarkDuplicates.jar I= s1.sorted.bam O=s1.markdup.bam M=s1.dupStat
gatk RealignerTargetCreator –R hg19.fq.gz –I s1.sorted.bam –known indels.vcf –O realigner.intervals
gatk BaseRecalibrator –R hg19.fq.gz–I realigned.bam –knownSites dbsnp137.vcf –knownSites gold.standard.indels.vcf –O recal.table
gatk HaplotypeCaller –R hg19.fa.gz –I s1.bam –o s1.gvcf –ERC GVCF

Once I done the above for each of the 3 samples, then I merge 3 gVCF files together, by:
gatk GenotypeGVCFs –R hg19.fa.gz –V s1.gvcf –V s2.gvcf –V s3.gvcf

Can someone please let me know if I got the above correct? If not, can you please kindly correct me?

Thank you & best regards,
Jie

↧

HaplotypeCaller doesn't annotate some rs IDs on called variants

July 30, 2018, 12:17 am

≫ Next: VariantsToTable only returning header lines

≪ Previous: A complete script that processes a trio with WGS data from FASTQ to BAM to VCF

Hi!

I have three WES files and produced vcf file using GATK4 HaplotypeCaller. I've used GATK v4.0.6.0

Each WES file went through HaplotypeCaller separately. When I was looking through the vcf files, i found that some rs IDs were missing on one file while in other files it existed.

Here's the screen shot of it:

All three files called chr1:63912546 as a variant but only first file doesn't have rs ID even though i've provided dbsnp.

Here is my command for HaplotypeCaller:

gatk --java-options "-Xms4g -Xmx10g" HaplotypeCaller \
-R human_g1k_v37.fasta \
-I my_bam_file.bam
-O output.vcf
--dbsnp dbsnp_138.b37.vcf
-L exome_regions.bed

What could have gone wrong?

Thanks!

↧

VariantsToTable only returning header lines

July 30, 2018, 12:34 am

≫ Next: DetermineGermlineContigPloidy

≪ Previous: HaplotypeCaller doesn't annotate some rs IDs on called variants

Hi when running VariantsToTable the output.table produces only the header. I can't see anything that would cause this to happen.
I am using gatk-4.0.6 with the following settings.

./gatk VariantsToTable -V test.vcf -F CHROM -F POS -O output.table

The vcf header is too long to paste but can provide that if needed.

All I'm getting back is

CHROM POS

If I include -GF GT I get back all the sample IDs but again only header information.

↧

DetermineGermlineContigPloidy

July 30, 2018, 1:03 am

≫ Next: Running Picard MergeBamAlignment

≪ Previous: VariantsToTable only returning header lines

I was using the brand new DetermineGermlineContigPloidy in 4.0.6.0. However I met a error. In COHORT mode, DetermineGermlineContigPloidy need a TSV file specifying prior probabilities for each integer ploidy state and for each contig.The following shows an example of such a table:

In the example in COHORT mode, a_valid_ploidy_priors_table.tsv is needed.

I could not find a tool in gatk to produce a_valid_ploidy_priors_table.tsv. So how could I produce the file to perform DetermineGermlineContigPloidy?

thanks

↧

Running Picard MergeBamAlignment

July 30, 2018, 6:18 am

≫ Next: Running picard MergeBamAlignment step

≪ Previous: DetermineGermlineContigPloidy

When I'm running picard MergeBamAlignment, I am getting this error

Exception in thread "main" htsjdk.samtools.util.RuntimeIOException: java.util.zip.DataFormatException: invalid distance too far back.

Any idea on what this could be and how to solve it.

↧

Running picard MergeBamAlignment step

July 20, 2018, 8:24 am

≫ Next: How to get the min, max and average coverage for whole intervals (pre-defined in a bed file)?

≪ Previous: Running Picard MergeBamAlignment

I am running the drop seq pipeline. I have successfully run all the steps till the MergeBamAlignment step.

Here i am getting an error that says: Exception in thread "main" htsjdk.samtools.SAMException: Could not find dictionary next to reference file /scratch/saimukund/Reference/Human/GSM1629193_hg19_ERCC.fasta

These are the contents in the /scratch/saimukund/Reference/Human directory:
1. GSM1629193_hg19_ERCC.dict.txt
2. GSM1629193_hg19_ERCC.fasta
3. GSM1629193_hg19_ERCC.gtf
4. GSM1629193_hg19_ERCC.refFlat.txt

I am not able to solve the error. Could you please help out here.

↧

How to get the min, max and average coverage for whole intervals (pre-defined in a bed file)?

July 11, 2018, 1:02 am

≫ Next: SplitNCigarReads fails on IllegalArgumentException: contig must be non-null and not equal to *, and

≪ Previous: Running picard MergeBamAlignment step

Hi,
I am working with DNA sequencing data (illumina). We use targeted sequencing mostly. I analysed the data following your best practice using GATK 4.0.1.1.
My question is: How can I simply generate a report where I can see, for each selected interval in a bed file, the minimum, maximum and average coverage. The available tools under "Coverage Analysis" and "Diagnostics and Quality Control" don't seem to provide such information. I think those tools were available under GATK3.8 but I can't see which ones in GATK4 are replacing them.
Thanks in advance
Nawar

↧

ERROR --

ERROR stack trace

ERROR ------------------------------------------------------------------------------------------

ERROR A GATK RUNTIME ERROR has occurred (version 3.7-0-gcfedb67):

ERROR

ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.

ERROR If not, please post the error message, with stack trace, to the GATK forum.

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions https://software.broadinstitute.org/gatk

ERROR

ERROR MESSAGE: Badly formatted cyclesToClip argument: 1,2,3,4

ERROR ------------------------------------------------------------------------------------------

First, call on each sample using Mutect2's tumor-only mode.

Second, for each single-sample VCF, move the sample-level AF allele-fraction annotation to the INFO field and simplify to a sites-only VCF.

Third, call on each sample in a second pass, again in tumor-only mode, with the following additions.

Fourth, filter out unlikely calls with FilterMutectCalls.

Second, for each single-sample VCF, move the sample-level `AF` allele-fraction annotation to the INFO field and simplify to a sites-only VCF.