GenomicsDBImport terminates after Overlapping contigs found error

October 15, 2018, 8:11 am

≫ Next: wants to look for problems or diseases in my Helix Exome+

My original query was about batching and making intervals for GenomicsDBImport, but I have run into a new problem. I am using version 4.0.7.0 I tried the following:

gatk GenomicsDBImport \
--java-options "-Xmx250G -XX:+UseParallelGC -XX:ParallelGCThreads=24" \
-V input.list \
--genomicsdb-workspace-path 5sp_45ind_assmb_00 \
--intervals interval.00.list \
--batch-size 9

where I have split my list of contigs into 50 lists, and set batch size as 9 (instead of reading in 45 g.vcf at once) for a total of 5 batches. It looks like it has started to run, but terminated quickly after an error.

The resulting stack trace is:

00:53:23.869 INFO  GenomicsDBImport - HTSJDK Version: 2.16.0
00:53:23.869 INFO  GenomicsDBImport - Picard Version: 2.18.7
00:53:23.869 INFO  GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
00:53:23.869 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
00:53:23.869 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
00:53:23.869 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
00:53:23.869 INFO  GenomicsDBImport - Deflater: IntelDeflater
00:53:23.869 INFO  GenomicsDBImport - Inflater: IntelInflater
00:53:23.869 INFO  GenomicsDBImport - GCS max retries/reopens: 20
00:53:23.869 INFO  GenomicsDBImport - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
00:53:23.869 INFO  GenomicsDBImport - Initializing engine
01:26:13.490 INFO  IntervalArgumentCollection - Processing 58057410 bp from intervals
01:26:13.517 INFO  GenomicsDBImport - Done initializing engine
Created workspace /home/leq/gvcfs/5sp_45ind_assmb_00
01:26:13.655 INFO  GenomicsDBImport - Vid Map JSON file will be written to 5sp_45ind_assmb_00/vidmap.json
01:26:13.655 INFO  GenomicsDBImport - Callset Map JSON file will be written to 5sp_45ind_assmb_00/callset.json
01:26:13.655 INFO  GenomicsDBImport - Complete VCF Header will be written to 5sp_45ind_assmb_00/vcfheader.vcf
01:26:13.655 INFO  GenomicsDBImport - Importing to array - 5sp_45ind_assmb_00/genomicsdb_array
01:26:13.656 INFO  ProgressMeter - Starting traversal
01:26:13.656 INFO  ProgressMeter -        Current Locus  Elapsed Minutes     Batches Processed   Batches/Minute
01:33:16.970 INFO  GenomicsDBImport - Importing batch 1 with 9 samples
[libprotobuf ERROR google/protobuf/io/coded_stream.cc:207] A protocol message was rejected because it was too big (more than 67108864 bytes).  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
Contig/chromosome ctg7180018354961 begins at TileDB column 0 and intersects with contig/chromosome ctg7180018354960 that spans columns [1380207667, 1380207970] terminate called after throwing an instance of 'ProtoBufBasedVidMapperException' what():  
ProtoBufBasedVidMapperException : Overlapping contigs found

How do I overcome this issue of 'overlapping contigs found'? Is there a problem with my set of contigs? Also, is the warning about protocol messages something to worry about?

Thank you!

↧

wants to look for problems or diseases in my Helix Exome+

August 20, 2018, 9:13 pm

≫ Next: Mutect2 pipeline fails for some inputs

≪ Previous: GenomicsDBImport terminates after Overlapping contigs found error

I bought my Exome+ from Helix after participating in the 'Healthy Nevada' study. It is a gVCF file which I have unpacked.

I have been using linux for years, so I managed to download the GATK and successfully ran the gatk --help and gatk --list commands.

I want to find variations which may affect my health, such as in the APOE (so-called 'Alzheimer's gene').
I browsed the literature on your site, but mostly it seemed to deal with processing multiple genomes.

My first 2 questions are:
Is there a way to view the exome without doing anything to it?
How may I query the gVCF, searching for gene-based health issues?

Thanks for your help, and please forgive me if i missed the 'newbie' section explaining how to do this.
Regards,
Ralph

↧

Mutect2 pipeline fails for some inputs

October 25, 2018, 2:03 am

≫ Next: GATK v4.0.8.1 GenomicsDBImport Error (VariantStorageManagerException exception)

≪ Previous: wants to look for problems or diseases in my Helix Exome+

I'm running a WGS analysis and parallelizing the run for each chromosome. Few chromosomes are failing the somatic variant calling process with the following error. It is really difficult to pinpoint what the problem is, because most of the chromosomes are processed correctly to the end (20 out of 24). I'm guessing there is some integer vs. floating point conversion error. For now, I would really appreciate if you could tell me how to get rid of this issue...!

I think this is really something you should fix.

I'm running Mutect2 in a docker container: GATK jar /gatk/gatk-package-4.0.11.0-local.jar

...
15:39:59.895 INFO  ProgressMeter -        chr2:89909854            218.8                770780           3523.1
15:40:09.967 INFO  ProgressMeter -        chr2:90031053            218.9                771760           3524.9
15:40:20.003 INFO  ProgressMeter -        chr2:90125567            219.1                772540           3525.7
15:40:30.146 INFO  ProgressMeter -        chr2:90285632            219.3                773610           3527.9
15:40:43.457 INFO  ProgressMeter -        chr2:90296305            219.5                773680           3524.7
15:40:53.596 INFO  ProgressMeter -        chr2:90357855            219.7                774100           3523.9
15:40:59.892 INFO  VectorLoglessPairHMM - Time spent in setup for JNI call : 11.926295505
15:40:59.892 INFO  PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 1194.2077419680002
15:40:59.892 INFO  SmithWatermanAligner - Total compute time in java Smith-Waterman : 2637.29 sec
INFO    2018-10-24 15:41:01     SortingCollection       Creating merging iterator from 8 files
15:41:13.039 INFO  Mutect2 - Shutting down engine
[October 24, 2018 3:41:13 PM UTC] org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2 done. Elapsed time: 220.03 minutes.
Runtime.totalMemory()=12962496512
java.lang.NumberFormatException: For input string: "."
        at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
        at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
        at java.lang.Double.parseDouble(Double.java:538)
        at java.lang.Double.valueOf(Double.java:502)
        at htsjdk.variant.variantcontext.CommonInfo.lambda$getAttributeAsDoubleList$2(CommonInfo.java:299)
        at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
        at java.util.Collections$2.tryAdvance(Collections.java:4717)
        at java.util.Collections$2.forEachRemaining(Collections.java:4725)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
        at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
        at htsjdk.variant.variantcontext.CommonInfo.getAttributeAsList(CommonInfo.java:273)
        at htsjdk.variant.variantcontext.CommonInfo.getAttributeAsDoubleList(CommonInfo.java:293)
        at htsjdk.variant.variantcontext.VariantContext.getAttributeAsDoubleList(VariantContext.java:740)
        at org.broadinstitute.hellbender.tools.walkers.mutect.GermlineProbabilityCalculator.getGermlineAltAlleleFrequencies(GermlineProbabilityCalculator.java:49)
        at org.broadinstitute.hellbender.tools.walkers.mutect.GermlineProbabilityCalculator.getPopulationAFAnnotation(GermlineProbabilityCalculator.java:27)
        at org.broadinstitute.hellbender.tools.walkers.mutect.SomaticGenotypingEngine.callMutations(SomaticGenotypingEngine.java:155)
        at org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2Engine.callRegion(Mutect2Engine.java:221)
        at org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2.apply(Mutect2.java:230)
        at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.processReadShard(AssemblyRegionWalker.java:291)
        at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.traverse(AssemblyRegionWalker.java:267)
        at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:966)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
        at org.broadinstitute.hellbender.Main.main(Main.java:289)
Using GATK jar /gatk/gatk-package-4.0.11.0-local.jar

↧

GATK v4.0.8.1 GenomicsDBImport Error (VariantStorageManagerException exception)

October 1, 2018, 11:21 pm

≫ Next: Mutect 2 "Cannot read non-existent {bam} file"

≪ Previous: Mutect2 pipeline fails for some inputs

Hi, I am following the current best practice to prepare the consolidated GVCF from 5 samples of WGS for joint calling
with the following command and encounter an error
java -Djava.io.tmpdir=/work/TMP \ -Xmx40g -jar ~/bin/gatk-4.0.8.1/gatk-package-4.0.8.1-local.jar \ GenomicsDBImport \ -V /work/Analysis/III_3P_RG_DupMark.raw.snps.indels.g.vcf \ -V /work/Analysis/IV_11N_RG_DupMark.raw.snps.indels.g.vcf \ -V /work/Analysis/IV_8N_RG_DupMark.raw.snps.indels.g.vcf \ -V /work/Analysis/IV_10P_RG_DupMark.raw.snps.indels.g.vcf \ -V /work/Analysis/IV_20P_RG_DupMark.raw.snps.indels.g.vcf \ --genomicsdb-workspace-path /work/Analysis/wang_chr19_re \ --intervals chr19

Error Log

15:00:35.770 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/wang/bin/gatk-4.0.8.1/gatk-package-4.0.8.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
15:00:35.944 INFO GenomicsDBImport - ------------------------------------------------------------
15:00:35.944 INFO GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.0.8.1
15:00:35.945 INFO GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/
15:00:35.945 INFO GenomicsDBImport - Executing as wang@Ubuntu1604 on Linux v3.16.0-43-generic amd64
15:00:35.945 INFO GenomicsDBImport - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_171-8u171-b11-2~14.04-b11
15:00:35.945 INFO GenomicsDBImport - Start Date/Time: October 2, 2018 3:00:35 PM JST
15:00:35.945 INFO GenomicsDBImport - ------------------------------------------------------------
15:00:35.945 INFO GenomicsDBImport - ------------------------------------------------------------
15:00:35.946 INFO GenomicsDBImport - HTSJDK Version: 2.16.0
15:00:35.946 INFO GenomicsDBImport - Picard Version: 2.18.7
15:00:35.946 INFO GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
15:00:35.946 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
15:00:35.946 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
15:00:35.946 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
15:00:35.946 INFO GenomicsDBImport - Deflater: IntelDeflater
15:00:35.946 INFO GenomicsDBImport - Inflater: IntelInflater
15:00:35.946 INFO GenomicsDBImport - GCS max retries/reopens: 20
15:00:35.946 INFO GenomicsDBImport - Using google-cloud-java fork https://github.com/broadinstitute/google-cloud-java/releases/tag/0.20.5-alpha-GCS-RETRY-FIX
15:00:35.946 INFO GenomicsDBImport - Initializing engine
15:00:38.360 INFO IntervalArgumentCollection - Processing 58617616 bp from intervals
15:00:38.366 INFO GenomicsDBImport - Done initializing engine
Created workspace /work/Analysis/wgs_chr19
15:00:38.849 INFO GenomicsDBImport - Vid Map JSON file will be written to /work/Analysis/wgs_chr19/vidmap.json
15:00:38.849 INFO GenomicsDBImport - Callset Map JSON file will be written to /work/Analysis/wgs_chr19/callset.json
15:00:38.849 INFO GenomicsDBImport - Complete VCF Header will be written to /work/Analysis/wgs_chr19/vcfheader.vcf
15:00:38.850 INFO GenomicsDBImport - Importing to array - /work/Analysis/wgs_chr19/genomicsdb_array
15:00:38.850 INFO ProgressMeter - Starting traversal
15:00:38.850 INFO ProgressMeter - Current Locus Elapsed Minutes Batches Processed Batches/Minute
15:00:39.771 INFO GenomicsDBImport - Importing batch 1 with 5 samples
Buffer resized from 28469bytes to 32688
Buffer resized from 28473bytes to 32630
Buffer resized from 28469bytes to 32745
Buffer resized from 28469bytes to 32717
Buffer resized from 28466bytes to 32648
Buffer resized from 32688bytes to 32758
Buffer resized from 32630bytes to 32726
Buffer resized from 32648bytes to 32703
Buffer resized from 32717bytes to 32751
Buffer resized from 32703bytes to 32765
Buffer resized from 32745bytes to 32768
Buffer resized from 32726bytes to 32763
Buffer resized from 32765bytes to 32767
Buffer resized from 32758bytes to 32765
Buffer resized from 32751bytes to 32762
Buffer resized from 32767bytes to 32769
Buffer resized from 32763bytes to 32768
Buffer resized from 32762bytes to 32768
Buffer resized from 32765bytes to 32767
Buffer resized from 32767bytes to 32769
Buffer resized from 32768bytes to 32769
Buffer resized from 32768bytes to 32769
Buffer resized from 32768bytes to 32769
terminate called after throwing an instance of 'VariantStorageManagerException'
what(): VariantStorageManagerException exception : Error while syncing array chr19$1$58617616 to disk
TileDB error message : [TileDB::utils] Error: Cannot sync file '/work/Analysis/wgs_chr19/chr19$1$58617616/.__a89fdd44-1241-43ba-9072-6fcf116fbc1d139627949156096_1538460040234'; File syncing error

things I have checked
I have confirmed there is enough disk space and the working directory is in a shared volume.
It would be appreciated if you can help me on the troubleshotting.

Thanks

↧

Mutect 2 "Cannot read non-existent {bam} file"

November 6, 2018, 9:01 am

≫ Next: GenomicsDBImport--do multiple samples need individual databases?

≪ Previous: GATK v4.0.8.1 GenomicsDBImport Error (VariantStorageManagerException exception)

Hello,

I made several pipeline using Singularity images, Nextflow with gatk tools.

I got a very silly error with Mutect2

Running:
      java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /opt/conda/share/gatk4-4.0.9.0-0/gatk-package-4.0.9.0-local.jar Mutect2 -R /bettik/tintest/PROJECTS/Test_nextflow_OAR/REF/hg38/hg38.fasta -I S668_D4B_C000F4B_MD_BSQR2.bam -I S668_D4C_C000F4C_MD_BSQR2.bam -tumor S668_D4B -normal S668_D4C -L 1 -pon SPARK_GATK_pon.vcf.gz --germline-resource SPARK_GATK_gnomad_hg38.vcf.gz --af-of-alleles-not-in-resource 0.0000025 --disable-read-filter MateOnSameContigOrNoMappedMateReadFilter -O patient1_S668_D4B_S668_D4C_1.vcf.gz -bamout patient1_S668_D4B_S668_D4C_1.bam
  15:11:26.144 WARN  GATKReadFilterPluginDescriptor - Disabled filter (MateOnSameContigOrNoMappedMateReadFilter) is not enabled by this tool
  15:11:26.235 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/conda/share/gatk4-4.0.9.0-0/gatk-package-4.0.9.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
  15:11:28.747 INFO  Mutect2 - ------------------------------------------------------------
  15:11:28.747 INFO  Mutect2 - The Genome Analysis Toolkit (GATK) v4.0.9.0
  15:11:28.747 INFO  Mutect2 - For support and documentation go to https://software.broadinstitute.org/gatk/
  15:11:28.747 INFO  Mutect2 - Executing as tintest@dahu44 on Linux v4.9.0-8-amd64 amd64
  15:11:28.748 INFO  Mutect2 - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_152-release-1056-b12
  15:11:28.748 INFO  Mutect2 - Start Date/Time: November 6, 2018 3:11:26 PM UTC
  15:11:28.748 INFO  Mutect2 - ------------------------------------------------------------
  15:11:28.748 INFO  Mutect2 - ------------------------------------------------------------
  15:11:28.748 INFO  Mutect2 - HTSJDK Version: 2.16.1
  15:11:28.748 INFO  Mutect2 - Picard Version: 2.18.13
  15:11:28.749 INFO  Mutect2 - HTSJDK Defaults.COMPRESSION_LEVEL : 2
  15:11:28.749 INFO  Mutect2 - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
  15:11:28.749 INFO  Mutect2 - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
  15:11:28.749 INFO  Mutect2 - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
  15:11:28.749 INFO  Mutect2 - Deflater: IntelDeflater
  15:11:28.749 INFO  Mutect2 - Inflater: IntelInflater
  15:11:28.749 INFO  Mutect2 - GCS max retries/reopens: 20
  15:11:28.749 INFO  Mutect2 - Requester pays: disabled
  15:11:28.749 INFO  Mutect2 - Initializing engine
  15:11:28.835 INFO  Mutect2 - Shutting down engine
  [November 6, 2018 3:11:28 PM UTC] org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2 done. Elapsed time: 0.04 minutes.
  Runtime.totalMemory()=1962934272
  ***********************************************************************

  A USER ERROR has occurred: Couldn't read file. Error was: S668_D4B_C000F4B_MD_BSQR2.bam with exception: Cannot read non-existent file: file://S668_D4B_C000F4B_MD_BSQR2.bam

  ***********************************************************************
  Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.

However a simlink of my bam file is in my workdir :

ll /bettik/tintest/PROJECTS/Test_nextflow_OAR/work/5d/a31934453d8db9ebfffcc2809a8da4
total 16K
drwxr-xr-x 2 tintest l-iab   14 Nov  6 16:11 .
drwxr-xr-x 3 tintest l-iab    1 Nov  6 16:11 ..
-rw-r--r-- 1 tintest l-iab    0 Nov  6 16:11 .command.begin
-rw-r--r-- 1 tintest l-iab 3.2K Nov  6 16:11 .command.err
-rw-r--r-- 1 tintest l-iab    0 Nov  6 16:11 .command.out
-rwx------ 1 tintest l-iab 3.6K Nov  6 16:11 .command.run
-rw-r--r-- 1 tintest l-iab  485 Nov  6 16:11 .command.sh
-rw-r--r-- 1 tintest l-iab    1 Nov  6 16:11 .exitcode
-rw-r--r-- 1 tintest l-iab 3.2K Nov  6 16:11 OAR.nf-mutect2_1.8458699.stderr
-rw-r--r-- 1 tintest l-iab    0 Nov  6 16:11 OAR.nf-mutect2_1.8458699.stdout
lrwxrwxrwx 1 tintest l-iab   60 Nov  6 16:11 S668_D4B_C000F4B_MD_BSQR2.bam -> /bettik/tintest/SPARK/illumina/S668_D4B_C000F4B_MD_BSQR2.bam
lrwxrwxrwx 1 tintest l-iab   60 Nov  6 16:11 S668_D4C_C000F4C_MD_BSQR2.bam -> /bettik/tintest/SPARK/illumina/S668_D4C_C000F4C_MD_BSQR2.bam
lrwxrwxrwx 1 tintest l-iab   60 Nov  6 16:11 SPARK_GATK_gnomad_hg38.vcf.gz -> /bettik/tintest/SPARK/illumina/SPARK_GATK_gnomad_hg38.vcf.gz
lrwxrwxrwx 1 tintest l-iab   64 Nov  6 16:11 SPARK_GATK_gnomad_hg38.vcf.gz.tbi -> /bettik/tintest/SPARK/illumina/SPARK_GATK_gnomad_hg38.vcf.gz.tbi
lrwxrwxrwx 1 tintest l-iab   52 Nov  6 16:11 SPARK_GATK_pon.vcf.gz -> /bettik/tintest/SPARK/illumina/SPARK_GATK_pon.vcf.gz
lrwxrwxrwx 1 tintest l-iab   56 Nov  6 16:11 SPARK_GATK_pon.vcf.gz.tbi -> /bettik/tintest/SPARK/illumina/SPARK_GATK_pon.vcf.gz.tbi

I know the cluster I'm using do use several file system for front nodes and archive nodes. All my data are on the archive nodes. This didn't cause me any problem for a "standard germline pipeline", using tools like MarkDuplicates, BaseRecalibrator, HaplotypeCaller ...

Do you have any solution ?

Thank you.

↧

GenomicsDBImport--do multiple samples need individual databases?

September 27, 2018, 9:04 am

≫ Next: VariantRecalibrator fails after traversal

≪ Previous: Mutect 2 "Cannot read non-existent {bam} file"

I am trying to do variant calling on a reference transcriptome that I've produced, but I have some questions about functionality of GenomicsDBImport, and downstream in SelectVariants. I know that you can only look at one genomic interval per go, but do they all need individual databases?

I've begun running this command from bash, with $contigs as the list of contigs to go through, $path gives the absolute path, and files.txt representing all my samples.

for i in $contigs do gatk GenomicsDBImport \ $(cat files.txt) \ --genomicsdb-workspace-path $path/my_database \ --intervals $i done

I get this error after running it A USER ERROR has occurred: The workspace you're trying to create already exists. ( /gatk/my_data/my_database ) Writing into an existing workspace can cause data corruption. Please choose an output path that doesn't already exist. Does this mean that I need to create an individual result for each contig? And how does this influence the downstream SelectVariants command?

↧

VariantRecalibrator fails after traversal

October 8, 2018, 8:59 am

≫ Next: What is your pet name?

≪ Previous: GenomicsDBImport--do multiple samples need individual databases?

Hi,

VariantRecalibrator fails during Indel recalibration at the moment when it should finish. I use GATK4.0.4.0. and it's the first run to fail at this point.
The output recal file has no entry after the header, the tranches-file is completetly empty.

Error:

13:54:08.684 INFO  ProgressMeter -        chrY:56879627             24.2               5075692         210002.4
13:54:08.684 INFO  ProgressMeter - Traversal complete. Processed 5075692 total variants in 24.2 minutes.
13:54:08.831 INFO  VariantDataManager - QD:      mean = 25.11    standard deviation = 6.61
13:54:08.925 INFO  VariantDataManager - FS:      mean = 0.38     standard deviation = 2.35
13:54:09.020 INFO  VariantDataManager - SOR:     mean = 0.99     standard deviation = 0.64
13:54:09.110 INFO  VariantDataManager - MQRankSum:       mean = -0.01    standard deviation = 0.33
13:54:09.257 INFO  VariantDataManager - ReadPosRankSum:          mean = 0.39     standard deviation = 1.03
13:54:10.024 INFO  VariantDataManager - Annotations are now ordered by their information content: [QD, SOR, ReadPosRankSum, FS, MQRankSum]
13:54:10.078 INFO  VariantDataManager - Training with 301439 variants after standard deviation thresholding.
13:54:10.117 INFO  GaussianMixtureModel - Initializing model with 100 k-means iterations...
13:54:22.733 INFO  VariantRecalibratorEngine - Finished iteration 0.
13:54:26.379 INFO  VariantRecalibratorEngine - Finished iteration 5.    Current change in mixture coefficients = 1.01851
13:54:30.054 INFO  VariantRecalibratorEngine - Finished iteration 10.   Current change in mixture coefficients = 0.32170
13:54:33.267 INFO  VariantRecalibratorEngine - Convergence after 14 iterations!
13:54:33.742 INFO  VariantRecalibratorEngine - Evaluating full set of 696457 variants...
13:54:33.742 WARN  VariantRecalibratorEngine - Evaluate datum returned a NaN.
13:54:33.793 INFO  VariantDataManager - Selected worst 0 scoring variants --> variants with LOD <= -5.0000.
13:54:33.824 INFO  VariantRecalibrator - Shutting down engine
[8. Oktober 2018 13:54:33 MESZ] org.broadinstitute.hellbender.tools.walkers.vqsr.VariantRecalibrator done. Elapsed time: 24.70 minutes.
Runtime.totalMemory()=7774666752
java.lang.IllegalArgumentException: No data found.
        at org.broadinstitute.hellbender.tools.walkers.vqsr.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:34)
        at org.broadinstitute.hellbender.tools.walkers.vqsr.VariantRecalibrator.onTraversalSuccess(VariantRecalibrator.java:629)
        at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:894)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:134)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:179)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:198)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
        at org.broadinstitute.hellbender.Main.main(Main.java:289)

Command:

/opt/gatk/4.0.4.0/gatk VariantRecalibrator \
    -R GRCh38_latest_genomic_final.fa \
    -V recalibrated_snps_raw_indels.vcf \
    -O recalibrate_INDEL.recal \
    --tranches-file recalibrate_INDEL.tranches \
    --rscript-file vrecalibrate_INDEL_plots.R \
    -an QD \
    -an FS \
    -an SOR \
    -an MQRankSum \
    -an ReadPosRankSum \
    --resource mills,known=false,training=true,truth=true,prior=12.0:Mills.hg38.vcf \
    --resource dbsnp,known=true,training=false,truth=false,prior=2.0:dbsnp.vcf \
    --mode INDEL \
    --truth-sensitivity-tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \
    --max-gaussians 4

Best,
Daniel

↧

What is your pet name?

November 8, 2018, 4:36 am

≫ Next: gatk-4.0.11.0 multithreads tool HaplotypeCallerSpark error

≪ Previous: VariantRecalibrator fails after traversal

I like dogs & puppies...

↧

gatk-4.0.11.0 multithreads tool HaplotypeCallerSpark error

November 7, 2018, 6:44 pm

≫ Next: GenotypeGVCFs and VariantRecalibrator

≪ Previous: What is your pet name?

I used the commond " gatk --java-options "-Xmx16g" HaplotypeCallerSpark -ERC GVCF -R ../ref.fa -I AS39.realn.bam -O AS39.raw.g.vcf --spark-master local[6] " on our cluster. After around 30 hours , the job shout down and I got the following error:
16:13:36.991 INFO HaplotypeCallerSpark - Shutting down engine
[November 7, 2018 4:13:36 PM CST] org.broadinstitute.hellbender.tools.HaplotypeCallerSpark done. Elapsed time: 66.76 minutes.
Runtime.totalMemory()=8630304768
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 3.0 failed 1 times, most recent failure: Lost task 2.0 in stage 3.0 (TID 1306, localhost, executor driver): ExecutorLostFailure (executor driver exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 560455 ms
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
at org.apache.spark.RangePartitioner$.sketch(Partitioner.scala:266)
at org.apache.spark.RangePartitioner.(Partitioner.scala:128)
at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:62)
at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:61)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.OrderedRDDFunctions.sortByKey(OrderedRDDFunctions.scala:61)
at org.apache.spark.api.java.JavaPairRDD.sortByKey(JavaPairRDD.scala:925)
at org.apache.spark.api.java.JavaPairRDD.sortByKey(JavaPairRDD.scala:915)
at org.broadinstitute.hellbender.engine.spark.datasources.VariantsSparkSink.sortVariants(VariantsSparkSink.java:243)
at org.broadinstitute.hellbender.engine.spark.datasources.VariantsSparkSink.writeVariantsSingle(VariantsSparkSink.java:224)
at org.broadinstitute.hellbender.engine.spark.datasources.VariantsSparkSink.writeVariants(VariantsSparkSink.java:198)
at org.broadinstitute.hellbender.engine.spark.datasources.VariantsSparkSink.writeVariants(VariantsSparkSink.java:180)
at org.broadinstitute.hellbender.tools.HaplotypeCallerSpark.callVariantsWithHaplotypeCallerAndWriteOutput(HaplotypeCallerSpark.java:210)
at org.broadinstitute.hellbender.tools.HaplotypeCallerSpark.runTool(HaplotypeCallerSpark.java:167)
at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:466)
at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:30)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)
18/11/07 16:13:38 INFO ShutdownHookManager: Shutdown hook called
18/11/07 16:13:38 INFO ShutdownHookManager: Deleting directory /tmp/spark-08e80a6b-5c0e-4d17-96c4-6b1a3cdf7faa

↧

GenotypeGVCFs and VariantRecalibrator

November 8, 2018, 6:40 am

≫ Next: Family analysis

≪ Previous: gatk-4.0.11.0 multithreads tool HaplotypeCallerSpark error

HI GATK developers,

First we are using GATK 4.0.3.0. I have two simple questions:
1. Is there any internal criteria (eg, cannot be specified by user) used by GenotypeGVCFs that cause the program drop calls? We found sometimes, when there are two calls next to each other, the program drops calls.
2. If we have only read depth information for our variants (WGS data), can we use VariantRecalibrator (assuming the format is correct)?

Thanks.

zwang

↧

Family analysis

November 7, 2018, 1:03 am

≫ Next: VariantsToTable : Badly formed genome location

≪ Previous: GenotypeGVCFs and VariantRecalibrator

Hi Sheila,
Is this possible to identify variants that comes from mother or father allele in family analysis using GATK pipeline?

eg:
Affected child
ref: AA Alt: AT GT: 0/1

Unaffected Mother
ref: AA Alt: AA GT:0/0

Unaffected father
ref: AA Alt: AT GT:0/1

↧

VariantsToTable : Badly formed genome location

November 7, 2018, 7:09 am

≫ Next: GATK 4.0.11.0 Variant Recalibrator ERROR

≪ Previous: Family analysis

Hello,

I want to use VariantsToTable on a vcf files with Structurales Variations.
In my vcf, I have a translocation chr1:861936 => chr14:295688
But i have this error with this line :
Badly formed genome location: Parameters to GenomeLocParser are incorrect:The stop position 295688 is less than start 861936 in contig chr1

line in vcf :
chr1 861936 TRA006SUR N N[chr8:295687[ . PASS SVTYPE=BND;CHR2=chr8;END=295688

Do you have a solution?

Thank you,

↧

GATK 4.0.11.0 Variant Recalibrator ERROR

November 6, 2018, 12:18 am

≫ Next: PhaseByTransmission errors

≪ Previous: VariantsToTable : Badly formed genome location

Could someone please provide me with a help to run Variant Recalibrator for GATK4.0.11.0？
when running the tool using GATK 4.0.11.0 with the following command line:

time ~/gatk-4.0.11.0/gatk VariantRecalibrator
-R ~/reference/hg19.fa -V ~/MT-1/outname.HC.vcf.gz
--resource hapmap,known=false,training=true,truth=true,prior=15.0:~/reference/hg19/hapmap_3.3.hg19.sites.vcf
--resource omni,known=false,training=true,truth=false,prior=12.0:~/reference/hg19/1000G_omni2.5.hg19.sites.vcf
--resource 1000G,known=false,training=true,truth=false,prior=10.0:~/reference/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf
--resource dbsnp,known=true,training=false,truth=false,prior=6.0:~/reference/hg19/dbsnp_138.hg19.vcf
--use-annotation DP --use-annotation QD --use-annotation FS --use-annotation SOR --use-annotation ReadPosRankSum --use-annotation MQRankSum
--mode SNP
--truth-sensitivity-tranche 100.0 --truth-sensitivity-tranche 99.9 --truth-sensitivity-tranche 99.0 --truth-sensitivity-tranche 95.0 --truth-sensitivity-tranche 90.0
--rscript-file ~/MT-1/outname.HC.snps.plots.R
--tranches-file ~/MT-1/outname.HC.snps.tranches
--output ~/MT-1/outname.HC.snps.recal

I met this questiion: A USER ERROR has occurred: Couldn't read file file:///home/chenjie1/~/reference/hg19/hapmap_3.3.hg19.sites.vcf. Error was: It doesn't exist

The command syntax follows the same pattern as version 4.0.9.0.
Has the syntax been changed for GATK version 4.0.11.0?
Thanks.
Best regards.

↧

PhaseByTransmission errors

November 2, 2018, 12:54 pm

≫ Next: GATK runtime error (READ_MAX_LENGTH must be > 0 but got 0) with 1000g bam

≪ Previous: GATK 4.0.11.0 Variant Recalibrator ERROR

Hi,

I am interested in finding de novo mutations in human families. I have implemented the GATK best practices data preprocessing pipeline as well as the Germline short variant discovery workflow. I have done this using GATK4 and the hg38 reference genome. I plan on implementing the Genotype Refinement workflow for germline short variants (as this seems to be the GATK suggested way of identifying denovos) but I wanted to try out PhaseByTransmission before proceeding. To do this I needed to use GATK3 as PBT is not implemented in GATK4. At first I ran into a problem because my VCF contained variants from an entire family (5 individuals) and PBT can only run on a single trio at a time. So I wrote individual PED files for each child and ran PBT with the '--pedigreeValidationType SILENT' option. Here is what I ran for one of the children (I ran something very similar for the other 2):

java -jar GenomeAnalysisTK.jar \
-T PhaseByTransmission \
-R $ref_dir/Homo_sapiens_assembly38.fasta \
-V 2003_57_recalibrated_variants.vcf \
-ped 2003058.ped \
-o 2003058_pbt.vcf \
-mvf 2003058_mandelian_violations.vcf \
--pedigreeValidationType SILENT 2

and here is the contents of 2003058.ped:

2003_57 2003003 0 0 2 0
2003_57 2003057 0 0 1 0
2003_57 2003058 2003057 2003003 2 0

For each of these runs, PBT crashed after about 10 minutes. Here is the tail of the output:

INFO  15:05:18,853 ProgressMeter - chr13:108037823   8021595.0     9.0 m      67.0 s       67.9%    13.3 m       4.3 m 
INFO  15:05:48,854 ProgressMeter -  chr15:46251520   8472743.0     9.5 m      67.0 s       72.9%    13.0 m       3.5 m 
INFO  15:06:18,855 ProgressMeter -  chr16:71683502   8933283.0    10.0 m      67.0 s       76.8%    13.0 m       3.0 m 
INFO  15:06:48,857 ProgressMeter -  chr18:10072380   9399718.0    10.5 m      67.0 s       80.3%    13.1 m       2.6 m 
INFO  15:07:18,859 ProgressMeter -  chr19:45951638   9858096.0    11.0 m      66.0 s       83.9%    13.1 m       2.1 m 
INFO  15:07:48,860 ProgressMeter -  chr21:27088297   1.0311184E7    11.5 m      66.0 s       87.2%    13.2 m     101.0 s 
##### ERROR --
##### ERROR stack trace 
java.lang.ArrayIndexOutOfBoundsException: 2
    at htsjdk.variant.variantcontext.GenotypeLikelihoods.getAsMap(GenotypeLikelihoods.java:171)
    at org.broadinstitute.gatk.tools.walkers.phasing.PhaseByTransmission.getLikelihoodsAsMapSafeNull(PhaseByTransmission.java:625)
    at org.broadinstitute.gatk.tools.walkers.phasing.PhaseByTransmission.phaseTrioGenotypes(PhaseByTransmission.java:669)
    at org.broadinstitute.gatk.tools.walkers.phasing.PhaseByTransmission.map(PhaseByTransmission.java:878)
    at org.broadinstitute.gatk.tools.walkers.phasing.PhaseByTransmission.map(PhaseByTransmission.java:143)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255)
    at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
    at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
    at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:98)
    at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:323)
    at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:123)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
    at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:108)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 3.8-1-0-gf15c1c3ef):
##### ERROR
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions https://software.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: 2
##### ERROR ------------------------------------------------------------------------------------------

I saw threads on similar error messages, so I used ValidateVariants to make sure the VCF file produced by the GATK pipeline was OK. GATK4's ValidateVariants said ti was fine. However, GATK3 had the following output:

INFO  15:49:26,538 ValidateVariants - Reference allele is too long (133) at position chr2:3725401; skipping that record. Set --reference_window_stop >= 133 
INFO  15:49:26,818 ValidateVariants - Reference allele is too long (111) at position chr2:8067973; skipping that record. Set --reference_window_stop >= 111 
INFO  15:49:26,857 ValidateVariants - Reference allele is too long (120) at position chr2:8895476; skipping that record. Set --reference_window_stop >= 120 
INFO  15:49:26,884 ValidateVariants - Reference allele is too long (113) at position chr2:9406449; skipping that record. Set --reference_window_stop >= 113 
INFO  15:49:27,010 ValidateVariants - Reference allele is too long (113) at position chr2:10925438; skipping that record. Set --reference_window_stop >= 113 
INFO  15:49:27,105 ValidateVariants - Reference allele is too long (108) at position chr2:12456149; skipping that record. Set --reference_window_stop >= 108 
INFO  15:49:27,402 ValidateVariants - Reference allele is too long (187) at position chr2:17964404; skipping that record. Set --reference_window_stop >= 187 
INFO  15:49:27,428 ValidateVariants - Reference allele is too long (122) at position chr2:18294631; skipping that record. Set --reference_window_stop >= 122 
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 3.8-1-0-gf15c1c3ef): 
##### ERROR
##### ERROR This means that one or more arguments or inputs in your command are incorrect.
##### ERROR The error message below tells you what is the problem.
##### ERROR
##### ERROR If the problem is an invalid argument, please check the online documentation guide
##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
##### ERROR
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions https://software.broadinstitute.org/gatk
##### ERROR
##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
##### ERROR
##### ERROR MESSAGE: File /home/aschoenr/work/data/8_rvqs/2003_57/2003_57_recalibrated_variants.vcf fails strict validation: one or more of the ALT allele(s) for the record at position chr2:20357491 are not observed at all in the sample genotypes
##### ERROR ------------------------------------------------------------------------------------------

Does this mean that the VCF created by the GATK4 pipeline will not work with PhaseByTransmission? Like I said before, I plan on implementing the Genotype Refinement workflow, but I thought it would be nice to have the PBT output to compare to the Genotype Refinement workflow output.

Any help would be greatly appreciated!

↧

GATK runtime error (READ_MAX_LENGTH must be > 0 but got 0) with 1000g bam

April 7, 2017, 1:48 am

≫ Next: GATK4 - Available generic command line options and read filters for every tool ?

≪ Previous: PhaseByTransmission errors

Hi,

I'm trying to build a pon with GATK 3.7-0 to use with mutect2. For that, I've downloaded 80 exome bam files from the 1000g project (GBR, TSI, IBS and CEU populations).
For most of them, when I try to use the artifact_dectection_mode, I get a GATK runtime error saying 'READ_MAX_LENGTH must be > 0 but got 0'.
To try you can, for example, download ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00116/exome_alignment/HG00116.mapped.ILLUMINA.bwa.GBR.exome.20120522.bam
I'm using the b37 reference files from the bundle_gatk_2_8 and the bed file of SureSelect6 from Agilent .

Here the full stack trace and command line :

INFO 09:42:32,040 HelpFormatter - ------------------------------------------------------------------------------------
INFO 09:42:32,045 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.7-0-gcfedb67, Compiled 2016/12/12 11:21:18
INFO 09:42:32,045 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute
INFO 09:42:32,045 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk
INFO 09:42:32,045 HelpFormatter - [Thu Apr 06 09:42:32 GMT 2017] Executing on Linux 2.6.18-275.12.1.el5.573g0000 amd64
INFO 09:42:32,045 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0_40-b25
INFO 09:42:32,049 HelpFormatter - Program Args: -T MuTect2 -I:tumor /data/misc/mutect2/pon/1000g_bam/HG00116.mapped.ILLUMINA.bwa.GBR.exome.20120522.bam --db
snp /data/highlander/reference/bundle_gatk_2_8/b37/dbsnp_138.b37.vcf --artifact_detection_mode -L /data/highlander/reference/bundle_gatk_2_8/b37/capture.sureselect6.bed -R /data/highlander/reference/bundle_gatk_2_8/b37/human_g1k_v37.fasta -o /data/misc/mutect2/pon/1000g_vcf_normal/HG00116.vcf.gz
INFO 09:42:32,062 HelpFormatter - Executing as lifescope@n3 on Linux 2.6.18-275.12.1.el5.573g0000 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_40-b25.
INFO 09:42:32,062 HelpFormatter - Date/Time: 2017/04/06 09:42:32
INFO 09:42:32,062 HelpFormatter - ------------------------------------------------------------------------------------
INFO 09:42:32,063 HelpFormatter - ------------------------------------------------------------------------------------

...

ERROR --

ERROR stack trace

java.lang.IllegalArgumentException: READ_MAX_LENGTH must be > 0 but got 0
at org.broadinstitute.gatk.utils.pairhmm.PairHMM.initialize(PairHMM.java:126)
at org.broadinstitute.gatk.utils.pairhmm.N2MemoryPairHMM.initialize(N2MemoryPairHMM.java:60)
at org.broadinstitute.gatk.utils.pairhmm.LoglessPairHMM.initialize(LoglessPairHMM.java:66)
at org.broadinstitute.gatk.utils.pairhmm.PairHMM.initialize(PairHMM.java:159)
at org.broadinstitute.gatk.tools.walkers.haplotypecaller.PairHMMLikelihoodCalculationEngine.initializePairHMM(PairHMMLikelihoodCalculationEngine.java:267)
at org.broadinstitute.gatk.tools.walkers.haplotypecaller.PairHMMLikelihoodCalculationEngine.computeReadLikelihoods(PairHMMLikelihoodCalculationEngine.java:282)
at org.broadinstitute.gatk.tools.walkers.cancer.m2.MuTect2.map(MuTect2.java:644)
at org.broadinstitute.gatk.tools.walkers.cancer.m2.MuTect2.map(MuTect2.java:171)
at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:709)
at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:705)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:274)
at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:78)
at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:98)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:316)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:123)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:108)

ERROR ------------------------------------------------------------------------------------------

ERROR A GATK RUNTIME ERROR has occurred (version 3.7-0-gcfedb67):

ERROR

ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.

ERROR If not, please post the error message, with stack trace, to the GATK forum.

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions https://software.broadinstitute.org/gatk

ERROR

ERROR MESSAGE: READ_MAX_LENGTH must be > 0 but got 0

ERROR ------------------------------------------------------------------------------------------

Am I doing something wrong or is it some kind of bug :-) ?

Thank you in advance for your help

Raphael

↧

GATK4 - Available generic command line options and read filters for every tool ?

November 5, 2018, 5:25 am

≫ Next: Using GenomicsDBImport with many samples and few thousand intervals, can I restart the run?

≪ Previous: GATK runtime error (READ_MAX_LENGTH must be > 0 but got 0) with 1000g bam

Hi,

I am currently trying to set up a pipeline using Mutect2 (GATK 4.0.11.0). I would like to know exactly what this tool is doing with my reads and also whether I am able to supply a multi-threading option. I can not easily find information on that (command line help, online tool documentation or snippets in tutorials).

Is there a way to know for any tool :

- which read filters are applied by default 
- if there is a support for multi-threading

Thank you very much,

:smile:

Anthony

↧

Using GenomicsDBImport with many samples and few thousand intervals, can I restart the run?

November 9, 2018, 1:30 am

≫ Next: MNP and HaplotypeCaller GVCF mode

≪ Previous: GATK4 - Available generic command line options and read filters for every tool ?

Hello,
I'm running the GATK4 pipeline for calling SNPs in a capture-seq dataset of 500 individuals and 9,500 genomic intervals. I succesfully produced GVCF files with the HaplotypeCaller. When moving to cobined them with GenomicsDBImport, I first test 10 intervals with 50 individuals, and after confirming this took few minutes to complete I was hoping the entire run with 500 individuals and 9,500 genomic intervals will take 2-3 days. I'm using the following code

${GATKLoc} --java-options "-Xmx80g -Xms8g" \ GenomicsDBImport \ --genomicsdb-workspace-path ${SAMPLES}_database \ --consolidate \ --batch-size 88 \ -L $INTERVAL \ --sample-name-map ${SAMPLES}_list.sample_map \ --tmp-dir=${TMPdir}

I was wrong. It has been a week and is still not finished....

Now I realised I should have do a batch work with subsets of the intervals!?

So my questions are: Can I interrupt the run now and restart the unfinished intervals () working with batches now? How can I find out which intervals are finsihed already? Should I put the new runs in the same database folder?

The screen output is not informative, it only shows a huge list of these lines:

11:10:44.477 INFO GenomicsDBImport - Importing batch 3 with 88 samples
11:11:14.452 INFO GenomicsDBImport - Importing batch 3 with 88 samples
11:11:42.312 INFO GenomicsDBImport - Importing batch 3 with 88 samples

Also ALL intervals have generated their respective folder in the database folder. But I don;t know how to tell which ones are finsihed and which ones are not? E.g. this is the output for the first interval:

ls f2_samples_database/Contig2\$1\$7033/
__90d72b7d-46fb-40a5-9734-0fb7ee64cbb7140370416006912_1541705482603 __bb96df53-3662-4fb7-9d6d-59b906d673a0140370416006912_1541429490911 genomicsdb_meta_dir
__array_schema.tdb __c9eae372-1963-4631-abe7-222589fb2a88140370416006912_1541149600860

And this is the output for the last interval:

ls f2_samples_database/Contig388217\$1\$15477/
__50096ca9-52c9-4606-82a7-514133a73dd9140370416006912_1541429461943 __a3423e6f-4321-4c62-a487-085b8d903774140370416006912_1541705455158 __array_schema.tdb genomicsdb_meta_dir

Thanks!

↧

MNP and HaplotypeCaller GVCF mode

November 8, 2018, 5:26 am

≫ Next: Can I use GATK on non-diploid organisms?

≪ Previous: Using GenomicsDBImport with many samples and few thousand intervals, can I restart the run?

Hello

I am attempting to run HaplotypeCaller in a way that will merge adjacent SNPs into MNPs.
To do so I set --max-mnp-distance to 1 or 2.

This worked well when I did not used GVCF mode.
However, when I attempted this in GVCF mode I got the following error:
A USER ERROR has occurred: Illegal argument value: Non-zero maxMnpDistance is incompatible with GVCF mode.
(I am using GATK 4.0.8.1).

I am not sure I understand this conceptually:
If my callset contains two (or more) heterozygous SNPs that occur in adjacent genomic sites, they can only be determined to constitute part of a single MNP if both SNPs originate from the same chromsome/haplotype.
This is determined by phasing the callset, which as explained in the "Purpose and operation of Read-backed Phasing" page, is only enabled when HaplotypeCaller is run in GVCF or BP_RESOLUTION mode.

Following this reasoning it appears to me that merging SNPs into MNPs will only make sense in one of this modes since otherwise SNPs from different haplotypes can be merged erroneously.

Therefore I do not understand why in MNP merging possible without enabling GVCF mode, but is incompatible with GVCF mode.

I will be very glad for an explanation.

↧

Can I use GATK on non-diploid organisms?

July 26, 2012, 7:50 am

≫ Next: Help me understand why this variant was called

≪ Previous: MNP and HaplotypeCaller GVCF mode

In general most GATK tools don't care about ploidy. The major exception is, of course, at the variant calling step: the variant callers need to know what ploidy is assumed for a given sample in order to perform the appropriate calculations.

Ploidy-related functionalities

As of version 3.3, the HaplotypeCaller and GenotypeGVCFs are able to deal with non-diploid organisms (whether haploid or exotically polyploid). In the case of HaplotypeCaller, you need to specify the ploidy of your non-diploid sample with the -ploidy argument. HC can only deal with one ploidy at a time, so if you want to process different chromosomes with different ploidies (e.g. to call X and Y in males) you need to run them separately. On the bright side, you can combine the resulting files afterward. In particular, if you’re running the -ERC GVCF workflow, you’ll find that both CombineGVCFs and GenotypeGVCFs are able to handle mixed ploidies (between locations and between samples). Both tools are able to correctly work out the ploidy of any given sample at a given site based on the composition of the GT field, so they don’t require you to specify the -ploidy argument.

For earlier versions (all the way to 2.0) the fallback option is UnifiedGenotyper, which also accepts the -ploidy argument.

Cases where ploidy needs to be specified

Native variant calling in haploid or polyploid organisms.
Pooled calling where many pooled organisms share a single barcode and hence are treated as a single "sample".
Pooled validation/genotyping at known sites.

For normal organism ploidy, you just set the -ploidy argument to the desired number of chromosomes per organism. In the case of pooled sequencing experiments, this argument should be set to the number of chromosomes per barcoded sample, i.e. (Ploidy per individual) * (Individuals in pool).

Important limitations

Several variant annotations are not appropriate for use with non-diploid cases. In particular, InbreedingCoeff will not be annotated on non-diploid calls. Annotations that do work and are supported in non-diploid use cases are the following: QUAL, QD, SB, FS, AC, AF, and Genotype annotations such as PL, AD, GT, etc.

You should also be aware of the fundamental accuracy limitations of high ploidy calling. Calling low-frequency variants in a pool or in an organism with high ploidy is hard because these rare variants become almost indistinguishable from sequencing errors.

↧

Help me understand why this variant was called

April 12, 2017, 7:26 am

≫ Next: CalculateGenotypePosteriors error

≪ Previous: Can I use GATK on non-diploid organisms?

This variant was called with MuTect2 T/N mode :

chr7 152008861 . A T . PASS \
ECNT=1;HCNT=70;MAX_ED=.;MIN_ED=.;NLOD=96.12;TLOD=37.01 \
GT:AD:AF:ALT_F1R2:ALT_F2R1:DP:FOXOG:QSS:REF_F1R2:REF_F2R1:SAC \
0/1:201,0:0.112:0:0:201:.:5721,0:0:0:101,100,0,0 \
0/0:445,0:0.055:0:0:445:.:15084,0:0:0:219,226,0,0

If i read this correctly :
AF for this SNP (A->T) is 0.112 (11.2%) in the tumor sample.
AF for this SNP (A->T) is 0.055 in the normal sample.
Depth at this position (AD) in tumor sample : 201 (ref), 0 (alt).
Depth at this position (AD) in normal sample : 445 (ref), 0 (alt).

This seems strange to call a SNP with AF = 0.112 when no reads at all show up for the alt_allele in the AD field.
We have nothing to support alt_allele in the QSS field either (for both tumor and normal sample, the sum of base quality scores for alt_allele equals 0).

Ps : We wanted to have some clue about strand biaisfor our variants (strand_artifact flag nether appears in MuTect2 VCF), so we requested the SAC to be present in the VCF output. No reads at all in the SAC field are supporting alternate allele on +/- strands.

I know that it's stated in the doc that "the allele counts provided by AD and SAC should not be used to make assumptions about the called genotype", but still, i think it's very strange to have no supporting allele count in both AD & SAC when AF = 11.2%. And given that AF = AD[format] / DP[format], i was expecting some correlation between AF for a given variant and its AD field.

What am I missing ? Did i misread how AF is calculated ?

↧