Can't post a discussion on the forum

January 13, 2014, 7:06 am

≫ Next: how to use gatk to analysis multi samples at different tumor phase? GenotypeGVCFs or Mutect2

≪ Previous: Known Issues with VariantRecalibrator

I've beed trying to post a question and it gets rejected every time.
There is nothing special with this question except for the fact that I have a couple of link.
When I click on the post discussion button there is a pop up saying "Your discussion will appear after it is approved."

Can somebody tell me what I'm doing wrong?

Thanks

↧

how to use gatk to analysis multi samples at different tumor phase? GenotypeGVCFs or Mutect2

April 9, 2019, 8:51 pm

≫ Next: GATK for Nanopore data

≪ Previous: Can't post a discussion on the forum

Hello, I am a fresh in gatk, I got some samples from different patient at different phase. for exmaple:
normal phase : 1 patient sample
tumor phase1 : 9 patient samples
tumor phase2 : 7 patient samples
tumor phase3 : 10 patient samples

I went to attain the specific variants at different phases, by reading the gatk best practice, I have finished joint calling to each phase (CombineGVCFs+GenotypeGVCFs) .

#samples is different phase sample
sample_gvcfs=""
for sample in $samples ; do 
    sample_gvcfs=${sample_gvcfs}"-V $outdir/${sample}/gatk/${sample}.HC.g.vcf.gz \\"\n
done
time $gatk CombineGVCFs \
    -R $reference/Homo_sapiens_assembly38.fasta \
    ${sample_gvcfs} \
    -O $outdir/population/${outname}.HC.g.vcf.gz && echo "** ${outname}.HC.g.vcf.gz done ** " && \
time $gatk GenotypeGVCFs \
    -R $reference/Homo_sapiens_assembly38.fasta \
    -V $outdir/population/${outname}.HC.g.vcf.gz \
    -O $outdir/population/${outname}.HC.vcf.gz && echo "** ${outname}.HC.vcf.gz done ** "

I get two questions following:

1.After getting vcf and using hard filter, I found there is still too much varients, how can I remove more unrelated variants ? （may be I can use normal sample to filter more varient, is there any tools?）

2.Mutect2 in gatk is the better choice in this situation ?

Any help will be appreciated !

↧

GATK for Nanopore data

March 13, 2017, 1:47 am

≫ Next: MarkDuplicatesSpark raised a error : Multiple inputs but sorted in coordinate order

≪ Previous: how to use gatk to analysis multi samples at different tumor phase? GenotypeGVCFs or Mutect2

Hi,
I'm trying to use GATK to perform variant calling analysis with Minion Nanopore data. In details, I would like to ask you if someone has already used GATK to analyse Nanopore sequencing data, and if data processing analysis with GATK can improve homopolymer errors.
Thank you,

Paola Orsini

↧

MarkDuplicatesSpark raised a error : Multiple inputs but sorted in coordinate order

April 10, 2019, 12:20 am

≫ Next: Error in Mutect2

≪ Previous: GATK for Nanopore data

"A USER ERROR has occurred: Multiple inputs to MarkDuplicatesSpark detected but input abcd_lane1.bam was sorted in coordinate order."

All inputs are sorted in coordinate order.
When i used non-spark version, the error didn't occur.

Does the spark-version need the inputs which are sorted in other ways?

I can't find any reference from the manuals.

Could you help me to solve this problem?
Thank you in advance.

↧

Error in Mutect2

April 11, 2018, 12:26 pm

≫ Next: Choice of Known indels file for local realignment

≪ Previous: MarkDuplicatesSpark raised a error : Multiple inputs but sorted in coordinate order

I just encountered this error in Mutect2:

Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx4g -jar /Users/loeblabm11/bioinformatics/programs/GATK/gatk-4.0.3.0/gatk-package-4.0.3.0-local.jar Mutect2 -R /Users/loeblabm11/bioinformatics/reference/human/hg19/hg19.fa -I 20171027_BN31_python.dcs.filt.no_overlap.bam -tumor BN31 -O 20171027_BN31_python.dcs.MuTect2.vcf -bamout 20171027_BN31_python.dcs.MuTect2.bam --max-reads-per-alignment-start 0 --max-population-af 1 --disable-tool-default-read-filters
12:18:10.900 INFO NativeLibraryLoader - Loading libgkl_compression.dylib from jar:file:/Users/loeblabm11/bioinformatics/programs/GATK/gatk-4.0.3.0/gatk-package-4.0.3.0-local.jar!/com/intel/gkl/native/libgkl_compression.dylib
12:18:11.387 INFO Mutect2 - ------------------------------------------------------------
12:18:11.388 INFO Mutect2 - The Genome Analysis Toolkit (GATK) v4.0.3.0
12:18:11.388 INFO Mutect2 - For support and documentation go to https://software.broadinstitute.org/gatk/
12:18:11.388 INFO Mutect2 - Executing as loeblabm11@LoeblabM11s-iMac.local on Mac OS X v10.12.6 x86_64
12:18:11.388 INFO Mutect2 - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_151-b12
12:18:11.388 INFO Mutect2 - Start Date/Time: April 11, 2018 12:18:10 PM PDT
12:18:11.388 INFO Mutect2 - ------------------------------------------------------------
12:18:11.388 INFO Mutect2 - ------------------------------------------------------------
12:18:11.388 INFO Mutect2 - HTSJDK Version: 2.14.3
12:18:11.388 INFO Mutect2 - Picard Version: 2.17.2
12:18:11.388 INFO Mutect2 - HTSJDK Defaults.COMPRESSION_LEVEL : 2
12:18:11.388 INFO Mutect2 - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
12:18:11.388 INFO Mutect2 - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
12:18:11.388 INFO Mutect2 - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
12:18:11.388 INFO Mutect2 - Deflater: IntelDeflater
12:18:11.388 INFO Mutect2 - Inflater: IntelInflater
12:18:11.389 INFO Mutect2 - GCS max retries/reopens: 20
12:18:11.389 INFO Mutect2 - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
12:18:11.389 INFO Mutect2 - Initializing engine
12:18:11.724 INFO Mutect2 - Done initializing engine
12:18:12.288 INFO NativeLibraryLoader - Loading libgkl_utils.dylib from jar:file:/Users/loeblabm11/bioinformatics/programs/GATK/gatk-4.0.3.0/gatk-package-4.0.3.0-local.jar!/com/intel/gkl/native/libgkl_utils.dylib
12:18:12.290 WARN NativeLibraryLoader - Unable to find native library: native/libgkl_pairhmm_omp.dylib
12:18:12.290 INFO PairHMM - OpenMP multi-threaded AVX-accelerated native PairHMM implementation is not supported
12:18:12.290 INFO NativeLibraryLoader - Loading libgkl_pairhmm.dylib from jar:file:/Users/loeblabm11/bioinformatics/programs/GATK/gatk-4.0.3.0/gatk-package-4.0.3.0-local.jar!/com/intel/gkl/native/libgkl_pairhmm.dylib
12:18:12.368 WARN IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
12:18:12.368 WARN IntelPairHmm - Ignoring request for 4 threads; not using OpenMP implementation
12:18:12.369 INFO PairHMM - Using the AVX-accelerated native PairHMM implementation
12:18:12.403 INFO ProgressMeter - Starting traversal
12:18:12.403 INFO ProgressMeter - Current Locus Elapsed Minutes Regions Processed Regions/Minute
12:18:22.403 INFO ProgressMeter - chr1:75065650 0.2 250240 1501440.0
12:18:29.713 INFO VectorLoglessPairHMM - Time spent in setup for JNI call : 0.009098343
12:18:29.713 INFO PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 0.121747383
12:18:29.713 INFO SmithWatermanAligner - Total compute time in java Smith-Waterman : 0.42 sec
12:18:29.763 INFO Mutect2 - Shutting down engine
[April 11, 2018 12:18:29 PM PDT] org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2 done. Elapsed time: 0.31 minutes.
Runtime.totalMemory()=1781006336
java.lang.NullPointerException
at org.broadinstitute.hellbender.utils.locusiterator.ReadStateManager.readStartsAtCurrentPosition(ReadStateManager.java:132)
at org.broadinstitute.hellbender.utils.locusiterator.ReadStateManager.collectPendingReads(ReadStateManager.java:159)
at org.broadinstitute.hellbender.utils.locusiterator.LocusIteratorByState.lazyLoadNextAlignmentContext(LocusIteratorByState.java:315)
at org.broadinstitute.hellbender.utils.locusiterator.LocusIteratorByState.hasNext(LocusIteratorByState.java:252)
at org.broadinstitute.hellbender.utils.locusiterator.IntervalAlignmentContextIterator.advanceAlignmentContext(IntervalAlignmentContextIterator.java:104)
at org.broadinstitute.hellbender.utils.locusiterator.IntervalAlignmentContextIterator.advanceAlignmentContextToCurrentInterval(IntervalAlignmentContextIterator.java:99)
at org.broadinstitute.hellbender.utils.locusiterator.IntervalAlignmentContextIterator.next(IntervalAlignmentContextIterator.java:69)
at org.broadinstitute.hellbender.utils.locusiterator.IntervalAlignmentContextIterator.next(IntervalAlignmentContextIterator.java:21)
at org.broadinstitute.hellbender.engine.AssemblyRegionIterator.loadNextAssemblyRegion(AssemblyRegionIterator.java:143)
at org.broadinstitute.hellbender.engine.AssemblyRegionIterator.next(AssemblyRegionIterator.java:135)
at org.broadinstitute.hellbender.engine.AssemblyRegionIterator.next(AssemblyRegionIterator.java:34)
at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.processReadShard(AssemblyRegionWalker.java:290)
at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.traverse(AssemblyRegionWalker.java:271)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:893)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:134)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:179)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:198)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)

My command looks like this:
~/bioinformatics/programs/GATK/gatk-4.0.3.0/gatk --java-options "-Xmx4g" Mutect2 -R /Users/loeblabm11/bioinformatics/reference/human/hg19/hg19.fa -I 20171027_BN31_python.dcs.filt.no_overlap.bam -tumor BN31 -O 20171027_BN31_python.dcs.MuTect2.vcf -bamout 20171027_BN31_python.dcs.MuTect2.bam --max-reads-per-alignment-start 0 --max-population-af 1 --disable-tool-default-read-filters

and I'm running on MacOSX Sierra (10.12.6). java -version returns
java version "1.8.0_151"
Java(TM) SE Runtime Environment (build 1.8.0_151-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode)

The input file is position-sorted output from Duplex Sequencing pipelines, and has been through IndelRealigner (from GATK 3.7), and ClipOverlappingReads from FgBio. It has been indexed.

I can put together a full bug report, complete with my input file, if necessary; I just want an idea of what might be happening.

Brendan

↧

Choice of Known indels file for local realignment

April 10, 2019, 3:40 am

≫ Next: Variant calling: high (and strange) number of alternative allele

≪ Previous: Error in Mutect2

Dear all:

I am following a pipeline in an article to realign WGS reads to GRCh38 in an ALTs aware manner. I plan to use gatk 3.7 to do the local realignment.
I am wondering is it OK to use "Homo_sapiens_assembly38.known_indels.vcf.gz" on the google cloud bucket as a known indels file for local realignment with "IndelRealigner"?
And is this the 1000 genome phase3 known indels file?

I know that gatk4 no longer does realignment. But I still need to do it. I am just confused about which known indels file to use, since I want to process reads re-aligned to GRCh38.

Many thanks,
Yidong Zhang

↧

Variant calling: high (and strange) number of alternative allele

December 20, 2018, 8:23 am

≫ Next: FireCloud Sort and BQSR failing - java.io.IOException: Broken pipe

≪ Previous: Choice of Known indels file for local realignment

Deat GATK team,

I am calling variant on a trio (mother, father and offspring) of Macaca mulatta. I have whole genome sequencing 60X for each individual. I use GATK 4.0.7.0, I call variant with HaplotypeCaller BP-RESOLUTION, combine with GenomicDBimport per chromosomes and genotype with GenotypeGVCF.

I am interested in the number of sites where I have only reference allele (AD=0 for the alternative) and the number of sites where I have some reads supporting ALT allele (AD > 0) in the parents.

I found a lot of sites (for each individuals) where I have AD>0 in the gvcf file (per indiviuals, the combined one and after genotyping). I looked at each site that are HomRef and for each individuals less than 30% of the HomRef sites have AD=0 for the alternative allele. I know that HaplotypeCaller does a realignement step that may change the positions of the reads, but 70% of the sites that have AD>0 seems a lot. I looked back at the BAM file and those alternative alleles don’t seem to be there. I try to call again using the bam.out option, and here again I don’t see so many alternative alleles. However, I see that sometimes on a read where there were no alternative allele on the bam input there is an alternative allele on the output.
Also I have tried samtools mpileup and in this case almost 90% of the HomRef sites are AD=0 for alternative allele.

Just as an example bellow is the VCF output from HaplotypeCaller for one individual and then there is a picture of both the input bam file and the output bam file.
For chr1 pos 24203380 the ref is A and I have:
Vcf --> DP=96 AD=92,4
Bam input --> DP 93, 92,1 (N)
Bam out --> DP=80, 79,1 (N)

chr1 24203380 . A <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:92,4:96:57:0,57,5771 chr1 24203381 . G <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:90,5:95:0:0,0,5897 chr1 24203382 . C <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:92,3:95:78:0,78,6075 chr1 24203383 . A <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:92,3:95:68:0,68,6127

Just in case here is my code:
gatk --java-options "-XX:ParallelGCThreads=16 -Xmx64g" HaplotypeCaller -R /PATH/rheMac8.fa -I /PATH/R01068_sorted.merged.addg.uniq.rmdup.bam -O /PATH/R01068_res.g.vcf -ERC BP_RESOLUTION \

I don’t know why I have this high number of alternative alleles and how to get read of them to have the 'real' number of alternative allele per position. The problem persists on the genotyping vcf files with some alternative alleles that are not present on any bam (input or HaplotypeCaller output).

I hope I gave you enough details so you have a clear idea of my problem and will be able to help me.
Best,

↧

FireCloud Sort and BQSR failing - java.io.IOException: Broken pipe

April 10, 2019, 7:43 am

≫ Next: Mutect2 stops with no log files

≪ Previous: Variant calling: high (and strange) number of alternative allele

Dear FireCloud team,

We have been running a scaled-down version of the GATK best practices preprocessing wdl, which we have edited such that it is just running SortAndFixTags and then running BQSR on whole genome sequencing data from dogs (alignment and MarkDuplicates having already been run by the Genomics Platform). Our pipeline seemed to work well on 33 of 34 samples we have run so far, but one sample failed, with the following errors:

Elapsed time: 12:07:19s.  Time for last 10,000,000:  222s.  Last read position: chr6:45,839,960
[Sun Apr 07 07:15:32 UTC 2019] picard.sam.SetNmAndUqTags done. Elapsed time: 727.50 minutes.
Runtime.totalMemory()=1094713344
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
htsjdk.samtools.SAMException: Exception when processing alignment for BAM index HFTLGALXX170510:4:1116:22262:20436 1/2 151b aligned to chr6:46451817-46451967.
    at htsjdk.samtools.BAMFileWriter.writeAlignment(BAMFileWriter.java:140)
    at htsjdk.samtools.SAMFileWriterImpl.addAlignment(SAMFileWriterImpl.java:185)
    at htsjdk.samtools.AsyncSAMFileWriter.synchronouslyWrite(AsyncSAMFileWriter.java:36)
    at htsjdk.samtools.AsyncSAMFileWriter.synchronouslyWrite(AsyncSAMFileWriter.java:16)
    at htsjdk.samtools.util.AbstractAsyncWriter$WriterRunnable.run(AbstractAsyncWriter.java:123)
    at java.lang.Thread.run(Thread.java:748)
Caused by: htsjdk.samtools.util.RuntimeIOException: Write error; BinaryCodec in writemode; streamed file (filename not available)
    at htsjdk.samtools.util.BinaryCodec.writeBytes(BinaryCodec.java:222)
    at htsjdk.samtools.util.BlockCompressedOutputStream.writeGzipBlock(BlockCompressedOutputStream.java:429)
    at htsjdk.samtools.util.BlockCompressedOutputStream.deflateBlock(BlockCompressedOutputStream.java:392)
    at htsjdk.samtools.util.BlockCompressedOutputStream.write(BlockCompressedOutputStream.java:291)
    at htsjdk.samtools.util.BinaryCodec.writeBytes(BinaryCodec.java:220)
    at htsjdk.samtools.util.BinaryCodec.writeByteBuffer(BinaryCodec.java:188)
    at htsjdk.samtools.util.BinaryCodec.writeInt(BinaryCodec.java:234)
    at htsjdk.samtools.BAMRecordCodec.encode(BAMRecordCodec.java:150)
    at htsjdk.samtools.BAMFileWriter.writeAlignment(BAMFileWriter.java:133)
    ... 5 more
Caused by: java.io.IOException: No space left on device
    at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
    at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60)
    at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
    at sun.nio.ch.IOUtil.write(IOUtil.java:65)
    at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:211)
    at java.nio.channels.Channels.writeFullyImpl(Channels.java:78)
    at java.nio.channels.Channels.writeFully(Channels.java:101)
    at java.nio.channels.Channels.access$000(Channels.java:61)
    at java.nio.channels.Channels$1.write(Channels.java:174)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
    at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
    at htsjdk.samtools.util.Md5CalculatingOutputStream.write(Md5CalculatingOutputStream.java:89)
    at htsjdk.samtools.util.BinaryCodec.writeBytes(BinaryCodec.java:220)
    ... 13 more
Using GATK jar /gatk/gatk-package-4.1.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Dsamjdk.compression_level=5 -Xms500m -jar /gatk/gatk-package-4.1.0.0-local.jar SetNmAndUqTags --INPUT /dev/stdin --OUTPUT MR415976_N.sorted.bam --CREATE_INDEX true --CREATE_MD5_FILE true --REFERENCE_SEQUENCE /cromwell_root/fc-0b0cb3ce-e2cb-4aef-a8b2-08e60d78e87c/Canis_lupus_familiaris_assembly3.fasta
[Sun Apr 07 07:15:32 UTC 2019] picard.sam.SortSam done. Elapsed time: 727.51 minutes.
Runtime.totalMemory()=4192731136
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
htsjdk.samtools.util.RuntimeIOException: Write error; BinaryCodec in writemode; streamed file (filename not available)
    at htsjdk.samtools.util.BinaryCodec.writeBytes(BinaryCodec.java:222)
    at htsjdk.samtools.util.BlockCompressedOutputStream.writeGzipBlock(BlockCompressedOutputStream.java:429)
    at htsjdk.samtools.util.BlockCompressedOutputStream.deflateBlock(BlockCompressedOutputStream.java:392)
    at htsjdk.samtools.util.BlockCompressedOutputStream.write(BlockCompressedOutputStream.java:291)
    at htsjdk.samtools.util.BinaryCodec.writeBytes(BinaryCodec.java:220)
    at htsjdk.samtools.util.BinaryCodec.writeBytes(BinaryCodec.java:212)
    at htsjdk.samtools.BAMRecordCodec.encode(BAMRecordCodec.java:164)
    at htsjdk.samtools.BAMFileWriter.writeAlignment(BAMFileWriter.java:143)
    at htsjdk.samtools.SAMFileWriterImpl.close(SAMFileWriterImpl.java:210)
    at htsjdk.samtools.AsyncSAMFileWriter.synchronouslyClose(AsyncSAMFileWriter.java:38)
    at htsjdk.samtools.util.AbstractAsyncWriter.close(AbstractAsyncWriter.java:89)
    at picard.sam.SortSam.doWork(SortSam.java:167)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:295)
    at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:25)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
    at org.broadinstitute.hellbender.Main.main(Main.java:291)
Caused by: java.io.IOException: Broken pipe
    at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
    at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60)
    at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
    at sun.nio.ch.IOUtil.write(IOUtil.java:65)
    at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:211)
    at java.nio.channels.Channels.writeFullyImpl(Channels.java:78)
    at java.nio.channels.Channels.writeFully(Channels.java:101)
    at java.nio.channels.Channels.access$000(Channels.java:61)
    at java.nio.channels.Channels$1.write(Channels.java:174)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
    at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
    at htsjdk.samtools.util.BinaryCodec.writeBytes(BinaryCodec.java:220)
    ... 16 more
Using GATK jar /gatk/gatk-package-4.1.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Dsamjdk.compression_level=5 -Xms4000m -jar /gatk/gatk-package-4.1.0.0-local.jar SortSam --INPUT /cromwell_root/fc-8268e82b-ed61-4e04-a8c9-a95a05c0952e/MR415976_2.bam --OUTPUT /dev/stdout --SORT_ORDER coordinate --CREATE_INDEX false --CREATE_MD5_FILE false

Any thoughts as to what may be going on?

I have checked that the bam and index file uploaded to FireCloud correctly (md5sums match) and have run Picard Tools ValidateSamFile on the bam without getting any errors.

Thank you for your help!

Best regards,
Kate

↧

Mutect2 stops with no log files

April 10, 2019, 7:57 am

≫ Next: GenomicsDBImport haploid data ?

≪ Previous: FireCloud Sort and BQSR failing - java.io.IOException: Broken pipe

This discussion was created from comments split from: Error in Mutect2.

↧

GenomicsDBImport haploid data ?

May 14, 2018, 7:14 am

≫ Next: Why does HaplotypeCaller (HC) use a flat prior in joint calling?

≪ Previous: Mutect2 stops with no log files

The documentation says 'At the moment GenomicsDBImport only supports diploid data. There's some work underway to support non-diploid data as well.'

Does that mean you can't use it on haploid data or is it just greater than diploid that it won't support?

Thanks

↧

Why does HaplotypeCaller (HC) use a flat prior in joint calling?

September 26, 2018, 12:07 pm

≫ Next: Workshop season is right around the corner

≪ Previous: GenomicsDBImport haploid data ?

In genotyping, P(G|D)=P(G)P(D|G)/P(D). Why HC uses a flat P(G) instead of computing one based on cohort allele frequencies?

↧

Workshop season is right around the corner

April 10, 2019, 11:59 am

≫ Next: BaseRecalibrator was able to recalibrate 0 reads

≪ Previous: Why does HaplotypeCaller (HC) use a flat prior in joint calling?

We recently rolled out a fresh new edition of our popular 4-day GATK bootcamp, which we trialed at the Broad, and now we're excited to take it on the road! We have six locations lined up, one every month for the next six months. We're going to be bouncing around Europe quite a bit, with a couple of side trips to Central and South America:

May 14-17: Helsinki, Finland (register here)
June 18-21: Newcastle, UK (register here)
July 8-11: Cambridge, UK (register here)
August 26-29: São Paulo, Brazil
September 10-13: San Jose, Costa Rica
October 21-24: Seville, Spain

(registration links will be added as they become available)

In this bootcamp-style workshop, we take you through the basics of working with genomic sequence data and a high-level overview of variant discovery so that by the end of the first day, you have a solid understanding of all the main pieces and processes. Then over the course of the next two days we delve into the details of variant calling for the most mature supported use cases -- germline SNPs and Indels on Day 2, somatic SNPs and Indels and somatic copy number (CNVs) on Day 3 -- through a combination of lectures and hands-on tutorials. Finally, on the fourth day we cover the mechanics of assembling and executing analysis workflows, as well as some useful tools and newer use cases that are hot off the development press.

The most innovative feature of this season is that we'll be running all hands-on tutorials on the Broad's new cloud-based analysis platform, Terra (the next-gen version of FireCloud), which you can learn more about at https://terra.bio. Terra is currently in a closed beta phase until May 1st, but you can already request access by filling in a contact form at https://terra.bio/contact. Going forward Terra will be our primary channel for delivering tutorials, working examples of GATK-based analyses and more, so be sure to check it out even if you don't plan to move your work to the cloud.

↧

BaseRecalibrator was able to recalibrate 0 reads

April 10, 2019, 12:35 pm

≫ Next: Funcotator: Functional annotation out of beta

≪ Previous: Workshop season is right around the corner

I'm trying to track down why `BaseRecalibrator` isn't able to recalibrate my reads, but am having trouble finding the problem's source. I'm running GATK v 3.5.

Command:
```
strings=(
S1233686
)

for i in "${strings[@]}"; do
echo "${i}"

java -jar /ast/emb/anaconda3/opt/gatk-3.5/GenomeAnalysisTK.jar \
-T BaseRecalibrator \
-R /ast/emb/prtj3/refs/Homo_sapiens_assembly38.fasta \
-I /ast/emb/prtj3/indel_realigned/${i}.bam \
-knownSites /ast/emb/prtj3/refs/Mills_and_1000G_gold_standard.indels.hg38.vcf \
-knownSites /ast/emb/prtj3/refs/1000G_phase1.snps.high_confidence.hg38.vcf \
-o /ast/emb/prtj3/BQSR/${i}.recal.data.table.txt

done
```

Output:
```INFO 14:42:03,137 HelpFormatter - --------------------------------------------------------------------------------
INFO 14:42:03,139 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.5-0-g36282e4, Compiled 2015/11/25 04:03:56
INFO 14:42:03,139 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 14:42:03,139 HelpFormatter - For support and documentation go to removed link
INFO 14:42:03,142 HelpFormatter - Program Args: -T BaseRecalibrator -R /ast/emb/well/prjt3/Homo_sapiens_assembly38.fasta -I /ast/emb/prjt3/indel_realigned/S1233686.bam
-knownSites /ast/emb/prjt3/refs/Mills_and_1000G_gold_standard.indels.hg38.vcf -knownSites /ast/emb/prjt3/refs/1000G_phase1.snps.high_confidence.hg38.vcf -o /ast/emb/prjt3/BQSR/S1233686.recal.data.table.txt
INFO 14:42:03,147 HelpFormatter - Executing as emb@compute-3-45.local on Linux 2.6.32-696.el6.centos.plus.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_45-b14.
INFO 14:42:03,147 HelpFormatter - Date/Time: 2019/04/10 14:42:03
INFO 14:42:03,147 HelpFormatter - --------------------------------------------------------------------------------
INFO 14:42:03,147 HelpFormatter - --------------------------------------------------------------------------------
INFO 14:42:03,466 GenomeAnalysisEngine - Strictness is SILENT
INFO 14:42:03,650 GenomeAnalysisEngine - Downsampling Settings: No downsampling
INFO 14:42:03,654 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 14:42:03,773 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.12
INFO 14:42:04,552 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files
INFO 14:42:04,556 GenomeAnalysisEngine - Done preparing for traversal
INFO 14:42:04,557 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 14:42:04,557 ProgressMeter - | processed | time | per 1M | | total | remaining
INFO 14:42:04,557 ProgressMeter - Location | reads | elapsed | reads | completed | runtime | runtime
INFO 14:42:04,574 BaseRecalibrator - The covariates being used here:
INFO 14:42:04,575 BaseRecalibrator - ReadGroupCovariate
INFO 14:42:04,575 BaseRecalibrator - QualityScoreCovariate
INFO 14:42:04,575 BaseRecalibrator - ContextCovariate
INFO 14:42:04,575 ContextCovariate - Context sizes: base substitution model 2, indel substitution model 3
INFO 14:42:04,575 BaseRecalibrator - CycleCovariate
INFO 14:42:04,650 ReadShardBalancer$1 - Loading BAM index data
INFO 14:42:04,653 ReadShardBalancer$1 - Done loading BAM index data
INFO 14:42:04,663 BaseRecalibrator - Calculating quantized quality scores...
INFO 14:42:04,674 BaseRecalibrator - Writing recalibration report...
INFO 14:42:04,692 BaseRecalibrator - ...done!
INFO 14:42:04,692 BaseRecalibrator - BaseRecalibrator was able to recalibrate 0 reads
INFO 14:42:04,694 ProgressMeter - done 0.0 0.0 s 37.9 h 100.0% 0.0 s 0.0 s
INFO 14:42:04,694 ProgressMeter - Total runtime 0.14 secs, 0.00 min, 0.00 hours
INFO 14:42:05,344 GATKRunReport - Uploaded run statistics report to AWS S3
```

From the Resource Bundle, I retrieved
`Homo_sapiens_assembly38.fasta`,
`Mills_and_1000G_gold_standard.indels.hg38.vcf`, and
`1000G_phase1.snps.high_confidence.hg38.vcf`.

Here I tried validating the input .bam:

Command:
```
java -jar /ast/emb/anaconda3/share/picard-2.18.16-0/picard.jar ValidateSamFile \
I=/ast/emb/prjt3/indel_realigned/S1233686.bam \
MODE=SUMMARY
```

Output:
```
INFO 2019-04-10 15:12:23 ValidateSamFile

********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
********** removed link
**********
********** The command line looks like this in the new syntax:
**********
********** ValidateSamFile -I /ast/emb/prjt3/indel_realigned/S1233686.bam -MODE SUMMARY
**********

15:12:24.009 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/ast/emb/anaconda3/share/picard-2.18.16-0/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Wed Apr 10 15:12:24 EDT 2019] ValidateSamFile INPUT=/ast/emb/prjt3/indel_realigned/S1233686.bam MODE=SUMMARY MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true INDEX_VALIDATION_STRINGENCY=EXHAUSTIVE IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 SKIP_MATE_VALIDATION=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Wed Apr 10 15:12:24 EDT 2019] Executing as emb@compute-3-45.local on Linux 2.6.32-696.el6.centos.plus.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_45-b14; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.18.16-SNAPSHOT
WARNING 2019-04-10 15:12:24 ValidateSamFile NM validation cannot be performed without the reference. All other validations will still occur.
No errors found
[Wed Apr 10 15:12:24 EDT 2019] picard.sam.ValidateSamFile done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=506462208
```

Prior to running `BaseRecalibrator`, here's what I did with aligned reads:
1. `MarkDuplicates`
2. `SortSam`, `SORT_ORDER=coordinate`
3. `AddOrReplaceReadGroups`
4. `index`
5. GATK
6. `RealignerTargetCreator`
7. `IndelRealigner`
8. `index`

Sorry, the back ticks don't seem to produce code even after using return.

↧

Funcotator: Functional annotation out of beta

February 20, 2019, 11:05 am

≫ Next: Allele-specific annotation and filtering of germline short variants

≪ Previous: BaseRecalibrator was able to recalibrate 0 reads

A production-ready tool to predict variant function

For the past year Funcotator has been a beta tool in GATK. With this new 4.1 release, Funcotator is (finally) out of beta and ready for use in a production environment. So... what exactly is Funcotator, and why should you care?

Funcotator is a functional annotator (FUNCtional annOTATOR) that reads in variants and adds useful information about their potential effects. It’s uses range from answering the question ‘in which gene (if any) does this variant occur’ to predicting an amino acid change string for variants that result in different protein structures. Accurate functional annotation is critical to turning vast amounts of genomic data into a better understanding of protein function.

Created to be a fast, functional, and accurate annotation tool that supports the hg38 genome, many recent updates have made Funcotator more robust and more correct. Of particular note - the protein change string algorithm can now account for protein changes that do not occur at the same amino acid position as a variant (such as when deletions occur in short tandem repeats). If you have a set of variants and wish to identify the genes affected and/or the protein amino acid sequence change, or if you simply wish to cross-reference your variants with a list of variants thought to be implicated in disease - Funcotator is the tool for you.

We publish two sets of data sources to go with Funcotator (including Gencode, ClinVar, gnomAD, and more) so it can be used out of the box and with minimal effort to add annotations to either germline or somatic variants. Best of all, it can be updated by you, the user, to include your favorite annotation data sources when you annotate your VCF files (with some caveats).

“Fun” means improved user experience and data output

A huge number of bug fixes and accuracy improvements mean output is now much better and more correct than Oncotator. As an example of improved user experience, the new FuncotatorDataSourceDownloader tool enables downloading the data sources from which annotations are created directly from the command-line. It is as simple as running ./gatk FuncotatorDataSourceDownloader --somatic to get the somatic data sources (though there are more options for the tool as well).

“Funcotator” versus “Oncotator” - very different annotator tools

A savvy user may want to compare Funcotator to the Broad’s previous functional annotation tool Oncotator. Despite similar names and purpose, they are VERY different pieces of software and a direct comparison cannot really be made. Funcotator is not Oncotator. The forum post below details some of the differences between the two tools.

Future of “Fun”

There are many features on the horizon for Funcotator (in addition to normal support and bug fixes). In the long-term, we would like greatly increase performance with a Spark version of Funcotator. Adding even more supported data formats for data sources will offer users additional options to add in annotations. Since it is in active development, there are always small features being added and bugs being fixed.

Check for current progress on the GATK Github page here:
https://github.com/broadinstitute/gatk/labels/Funcotator

A forum post with a tutorial and some additional data can be found here:
https://gatkforums.broadinstitute.org/dsde/discussion/11193/funcotator-information-and-tutorial

The tool documentation for Funcotator can be found here:
https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_funcotator_Funcotator.php

↧

Allele-specific annotation and filtering of germline short variants

December 28, 2017, 1:28 pm

≫ Next: How can I get a common variant of three samples from multi-sample VCF after joint genotyping?

≪ Previous: Funcotator: Functional annotation out of beta

Overview

The traditional VQSR recalibration paradigm evaluates each position, and passes or filters all alleles at that position, regardless of how many alternate alleles occur there. This has major disadvantages in cases where a real variant allele occurs at the same position as an error that has sufficient evidence to be called as a variant. The goal of the Allele-Specific filtering workflow is to treat each allele separately in the annotation, recalibration and filtering phases.

Multi-allelic sites benefit the most from the Allele-Specific filtering workflow because each allele will be evaluated more accurately than if its data was lumped together with other alleles. Large callsets will benefit more than small callsets because multi-allelics will occur more frequently as the number of samples in a cohort increases. One callset with 42 samples that was used for development contains 3% multi-allelic sites, while the ExAC callset [http://biorxiv.org/content/early/2015/10/30/030338] with approximately 60,000 samples contains nearly 8% multi-allelic sites. Recalibrating each allele separately will also greatly benefit rare disease studies, in which rare alleles may not be shared by other members of the callset, but could still occur at the same positions as common alleles or errors.

No additional resource files are necessary compared to the traditional VQSR workflow, but this must be run starting from the sample BAM files. The relevant annotations cannot be calculated from VCF or GVCF files alone.

After running the Allele-Specific filtering workflow, several new annotations will be added to the INFO field for your variants (see below), and VQSR results will be based on those new annotations, though using SNP and INDEL tranche sensitivity cutoffs equivalent to the non-allele-specific best practices. If after analyzing your recalibrated data, you’re not convinced that this workflow is for you, you can still run the classic VQSR on your genotyped VCF because the standard annotations for VQSR are still included in the genotyped VCF.

To be clear, this workflow cannot be run without the GVCF mode. This is because the way we generate and combine the allele-specific data depends on having raw data for each sample in the GVCF.

Note that this workflow is not yet officially part of the GATK Best Practices for germline variant discovery. Although we are happy with the performance of this workflow, our own production pipelines have not yet been updated to include this, so it should still be considered experimental. However, we do encourage you to try this out on your own data and let us know what you find, as this helps us refine the tools and catch bugs.

Summary of workflow steps

Input:

Begin with a BAM file that has been fully pre-processed according to our Best Practices recommendations for each sample. The read data in the BAM are necessary to generate the allele-specific annotations.

Step 1: Generate a GVCF per sample with HaplotypeCaller

Using the locally-realigned reads, HaplotypeCaller will generate GVCFs with all of its usual standard annotations, plus raw data to calculate allele-specific versions of the standard annotations. That means each alternate allele in each VariantContext will get its own data used by downstream tools to generate allele-specific QualByDepth, RMSMappingQuality, FisherStrand and allele-specific versions of the other standard annotations. For example, this will help us sort out good alleles that only occur in a few samples and have a good balance of forward and reverse reads but occur at the same position as another allele that has bad strand bias because it’s probably a mapping error.

Use -G StandardAnnotation -G AS_StandardAnnotation -G StandardHCAnnotation to request the appropriate annotations at this stage.

Step 2: Consolidate GVCFs (using either CombineGVCFs or ImportGenomicsDB)

Here the allele-specific data for each sample is combined per allele, but is not yet in its final annotation form. Use -G StandardAnnotation -G AS_StandardAnnotation to request the appropriate annotations at this stage.

Step 3: Joint-call all samples with GenotypeGVCFs

Raw allele-specific data for each sample is used to calculate the finalized annotation values. Use -G StandardAnnotation -G AS_StandardAnnotationto request the appropriate annotations at this stage.

Step 4: Variant filtering with VQSR

In allele-specific mode (activated using -AS), the VariantRecalibrator builds the statistical model based on data for each allele, rather than each site. This has the added benefit of being able to recalibrate the SNPs in mixed sites according to the appropriate model, rather than lumping them in with INDELs as had been done previously. It will also provide better results by matching the exact allele in the training and truth data rather than just the position.

When you run the second step of VQSR, ApplyRecalibration, allele-specific filters are calculated and stored in the AS_FilterStatus INFO annotation. A site-level filter is applied to each site based on the most lenient filter across all alleles. For example, if any allele passes, the entire site passes. If no alleles pass, then the filter will be applied corresponding to the allele with the lowest tranche (best VQSLOD).

The two ApplyRecalibration modes should be run in series, as in our usual Best Practices recommendations. If SNP and INDEL ApplyRecalibration modes are run in parallel and combined with CombineVariants (which would work for the standard VQSR arguments), in allele-specific mode any mixed sites will fail to be processed correctly.

Output of the workflow

This workflow adds new allele-specific info-level annotations to the VCFs and produces a final output with allele-specific filters based on the VQSR SNP and INDEL tranches.

Additional details

Allele-specific annotations

The AS_Standard annotation set will produce allele-specific versions of our standard annotations. For AS_MQ, this means that the root-mean-squared mapping quality will be given for all of the reads that support each allele, respectively. For rank sum and strand bias tests, the annotation for each allele will compare that alternative allele’s values against the reference allele.

Recalibration files from allele-specific VariantRecalibrator

Each allele will be described in a separate line in the output recalibration (.recal) files. For the advanced analyst, this is a good way to check which allele has the worst data and is responsible for a NEGATIVE_TRAIN_SITE classification.

Allele-specific filters

After both ApplyRecalibration modes are run, the INFO field will contain an annotation called AS_FilterStatus, which will list the filter corresponding to each alternate allele. Allele-specific culprit and VQSLOD scores will also be added to the final VCF in the AS_culprit and AS_VQSLOD annotations, respectively.

Sample output

3 195507036 . C G,CCT 6672.42 VQSRTrancheINDEL99.80to99.90 AC=7,2;AF=0.106,0.030;AN=66;AS_BaseQRankSum=-0.144,1.554;AS_FS=127.421,52.461;AS_FilterStatus=VQSRTrancheSNP99.90to100.00,VQSRTrancheINDEL99.80to99.90;AS_MQ=29.70,28.99;AS_MQRankSum=1.094,0.045;AS_ReadPosRankSum=1.120,-7.743;AS_SOR=9.981,7.523;AS_VQSLOD=-48.3935,-7.8306;AS_culprit=AS_FS,AS_FS;BaseQRankSum=0.028;DP=2137;ExcessHet=1.6952;FS=145.982;GQ_MEAN=200.21;GQ_STDDEV=247.32;InbreedingCoeff=0.0744;MLEAC=7,2;MLEAF=0.106,0.030;MQ=29.93;MQRankSum=0.860;NCC=9;NEGATIVE_TRAIN_SITE;QD=10.94;ReadPosRankSum=-7.820e-01;SOR=10.484

3 153842181 . CT TT,CTTTT,CTTTTTTTTTT,C 4392.82 PASS AC=15,1,1,1;AF=0.192,0.013,0.013,0.013;AN=78;AS_BaseQRankSum=-11.667,-3.884,-2.223,0.972;AS_FS=204.035,22.282,16.930,2.406;AS_FilterStatus=VQSRTrancheSNP99.90to100.00,VQSRTrancheINDEL99.50to99.70,VQSRTrancheINDEL99.70to99.80,PASS;AS_MQ=58.44,59.93,54.79,59.72;AS_MQRankSum=2.753,0.123,0.157,0.744;AS_ReadPosRankSum=-9.318,-5.429,-5.578,1.336;AS_SOR=6.924,3.473,5.131,1.399;AS_VQSLOD=-79.9547,-2.0208,-3.4051,0.7975;AS_culprit=AS_FS,AS_ReadPosRankSum,AS_ReadPosRankSum,QD;BaseQRankSum=-2.828e+00;DP=1725;ExcessHet=26.1737;FS=168.440;GQ_MEAN=117.51;GQ_STDDEV=141.53;InbreedingCoeff=-0.1776;MLEAC=16,1,1,1;MLEAF=0.205,0.013,0.013,0.013;MQ=54.35;MQRankSum=0.967;NCC=3;NEGATIVE_TRAIN_SITE;QD=4.42;ReadPosRankSum=-2.515e+00;SOR=4.740

Caveats

Spanning deletions

Since GATK3.4, GenotypeGVCFs has had the ability to output a “spanning deletion allele” (now represented with *) to indicate that a position in the VCF is contained within an upstream deletion and may have “missing data” in samples that contain that deletion. While the upstream deletions will continue to be recalibrated and filtered by VQSR similar to the way they always have been, these spanning deletion alleles that occur downstream (and represent the same event) will be skipped.

GVCF size increase

Using the default GVCF bands, the raw allele-specific data causes a minimal size increase, which amounted to less than a 1% increase on the NA12878 exome used for development.

Typical usage errors

Problem: WARN 08:35:26,273 ReferenceConfidenceVariantContextMerger - WARNING: remaining (non-reducible) annotations are assumed to be ints or doubles or booleans, but 0.00|15005.00|14400.00|0.00 doesn't parse and will not be annotated in the final VC.

Solution: Remember to add -G Standard -G AS_Standard to the GenotypeGVCFs command

Problem: Standard (non-allele-specific) annotations are missing

Solution: HaplotypeCaller and GenotypeGVCFs need -G Standard specified if -G AS_Standard is also specified.

↧

How can I get a common variant of three samples from multi-sample VCF after joint genotyping?

August 2, 2017, 11:09 pm

≫ Next: How can I get a common variant of several samples from multi-sample VCF after joint genotyping ?

≪ Previous: Allele-specific annotation and filtering of germline short variants

Hi. I’m studying about sequencing data analysis followed GATK Best practices for Germline SNP & Indel Discovery. Through the series of analysis, finally I get multi-sample VCF file from joint genotyping. After this, I want to get common variants from only three sample. I’ve conducted “Select variant” and using “—sample_name” argument and i get has three sample from vcf file. but this VCF file has not three sample variants, it contains all variant from multi-sample. so I want ask you any methods to get the common variants of three samples from multi-sample. Thank you

↧

How can I get a common variant of several samples from multi-sample VCF after joint genotyping ?

April 10, 2019, 8:23 pm

≫ Next: GenotypeGVCFs 3.7 gives different results depending on sharding

≪ Previous: How can I get a common variant of three samples from multi-sample VCF after joint genotyping?

Here's a previous similar thread but I can' t find the resolution.
I’m studying about sequencing data analysis followed GATK Best practices for Germline SNP & Indel Discovery. Through the series of analysis, finally I get multi-sample VCF file from joint genotyping. After this, I'm looking for sites where the three samples are variant，how can I use “SelectVariants” to get the common variant.
I tried this command to get the result but nothing output.
gatk SelectVariants -V P_14-18.HC.vcf -select 'set == "Intersection";' -O P_14-18.HC.select.vcf

Maybe I should call each sample and get vcf file one by one, then use some other method to get a common variant, however, joint calling is recommended in this thread.

↧

GenotypeGVCFs 3.7 gives different results depending on sharding

April 10, 2019, 10:25 pm

≫ Next: Two validated variants missed by HaplotypeCaller using MIP data (amplicon like data)

≪ Previous: How can I get a common variant of several samples from multi-sample VCF after joint genotyping ?

Broad's BP joint calling workflow shards GenotypeGVCFs by genomic locations. I was tasked to port the BP workflow from GATK4 to GATK3.7, and reference genome 37.

https://github.com/gatk-workflows/gatk4-germline-snps-indels/blob/b6f86ebfeeb8990b2588b72f3ad869e93e218e5f/joint-discovery-gatk4-local.wdl#L115

I was not able to find an equivalent of the hg38.even.handcurated.20k.intervals file for hg37, so tried to make one myself. (I believe people have asked before on this forum for hg38 versions of these intervals, or an explanation of how they were created).

I found that depending where the intervals lie, different calls are emitted by GenotypeGVCFs. The first difference I have seen is that ALT '*' alleles are not called if the upstream deletion is outside the interval.

Called with these shards:

21:10705896-10707251
21:10707252-10707407
21:10707408-10722075

21 10707250 . CTTTGTGA C 140.17 . AC=1;AF=0.0005097;AN=1962;BaseQRankSum=2.25;ClippingRankSum=0;DP=1
21 10707253 rs71227073 T A,* 15301.9 . AC=96,1;AF=0.049,0.0005097;AN=1962;BaseQRankSum=1.62;Clipping...


VS


21 10707250 . CTTTGTGA C 140.17 . AC=1;AF=0.0005097;AN=1962;BaseQRankSum=2.25;ClippingRankSum=0;DP=1
21 10707253 rs71227073 T A 15301.9 . AC=96;AF=0.049;AN=1962;BaseQRankSum=1.62;ClippingRankSum=0;DB;D

Is this expected behaviour? IE that calls near the shard boundaries are incorrect?

Is there documentation about the method used to create hg38.even.handcurated.20k.intervals ?

Is there a way to workaround this issue?

Thanks

↧

Two validated variants missed by HaplotypeCaller using MIP data (amplicon like data)

October 20, 2016, 4:40 am

≫ Next: Picard GenotypeConcordance was over in the middle of chr2.

≪ Previous: GenotypeGVCFs 3.7 gives different results depending on sharding

Dear GATK,

We are using MIPs (amplicon like) data to analyze the variants for certain genes. However, in two independent samples two validated variants were missed by the HaplotypeCaller. We were wondering if you have any idea why these variants were not called?

I've used the latest version of GATK (3.6) and the two commands we performed are:
--filter_mismatching_base_and_quals -R hs_ref_GRCh37.p5_all_contigs.fa -I sample1.sorted.bam -T HaplotypeCaller --emitRefConfidence GVCF -L targets.bed --dbsnp dbsnp_137.hg19.vcf -rf BadCigar -stand_call_conf 30.0 -stand_emit_conf 30.0 -nct 1 -o sample1_haplotypecaller.g.vcf
--filter_mismatching_base_and_quals -R hs_ref_GRCh37.p5_all_contigs.fa -I sample2.sorted.bam -T HaplotypeCaller --emitRefConfidence GVCF -L targets.bed --dbsnp dbsnp_137.hg19.vcf -rf BadCigar -stand_call_conf 30.0 -stand_emit_conf 30.0 -nct 1 -o sample2_haplotypecaller.g.vcf

Attached you will find two pictures of the used bam files. The mapping quality of the variant-reads look similar compared to the reference-reads(~60) as well as the base phred quality (~36). I've tried also many other settings/arguments for example by lowering the minimum phred-scaled confidence threshold at which variants should be called and the minimum phred-scaled confidence threshold at which variants should be emitted. Nothing worked to call the variants, However, if I use a smaller target region I am able to call the variant located on chr8.

The output of the GVCF gave:
chr14 31355353 . C . . END=31355353 GT:DP:GQ:MIN_DP:PL 0/0:987:0:987:0,0,11170
and
chr8 117861187 . G . . END=117861187 GT:DP:GQ:MIN_DP:PL 0/0:1253:0:1253:0,0,20903

Thank you very much in advance!
Kind regards,

Maartje

↧

Picard GenotypeConcordance was over in the middle of chr2.

April 11, 2019, 3:12 am

≫ Next: How can i use the optional tool; OUTPUT_VCF?

≪ Previous: Two validated variants missed by HaplotypeCaller using MIP data (amplicon like data)

Hi,

I am using Picard GenotypeConcordance.
I tried to compare two vcf files. I could get the result, but it was halfway.
The command was 'java -jar picard.jar GenotypeConcordance \TRUTH_VCF=PG2572_01_b.vcf \TRUTH_SAMPLE=PG2572_01_b \CALL_VCF=PG3149_01_a.vcf \CALL_SAMPLE=PG3149_01_a \O=PG3149_01_a_PG257201_b \OUTPUT_VCF=true'.

And the command line was,

INFO 2019-04-11 18:41:44 GenotypeConcordance Starting iteration over variants.
INFO 2019-04-11 18:41:45 GenotypeConcordance checked 10,000 variants. Elapsed time: 00:00:01s. Time for last 10,000: 0s. Last read position: chr1:4,736,774
~~~~~~~(omission)~~~~~~~~~
INFO 2019-04-11 18:42:02 GenotypeConcordance checked 680,000 variants. Elapsed time: 00:00:18s. Time for last 10,000: 0s. Last read position: chr2:165,005,044

It was over here, in the middle of chr2.
Both vcf files have the full date.

What happened and how can I resolve?

I do appreciate your help.

Best Regards.

↧