VariantsToTable does not produce the table of all SNPs in the vcf file

August 2, 2018, 6:13 pm

≫ Next: Picard IlluminaBaseCallsToSam showing inconsistent XT tags in different runs

Hi,

I am trying to convert my vcf file to a flat tab-separated format by using "VariantsToTable" in GATK. I am particularly interested in DP at the sample annotation for my analysis. I am using
/data/programs/gatk-4.0.4.0/gatk VariantsToTable -V final.vcf -GF DP -O snps.DP.tab

My vcf file consists of over 4 million SNPs, but the table that I get from running VariantsToTable consists only nearly 1 million SNPs. I wonder what is wrong! Am I missing something? I would appreciate if you help me to sort this out!
Thanks

↧

Picard IlluminaBaseCallsToSam showing inconsistent XT tags in different runs

August 2, 2018, 6:46 pm

≫ Next: A java question!

≪ Previous: VariantsToTable does not produce the table of all SNPs in the vcf file

I am using picard 2.17.10 to extract unmapped bam from Illumina raw data.

My command lines for ExtractIlluminaBarcodes and IlluminaBasecallsToSam:
java -jar picard.jar ExtractIlluminaBarcodes BASECALLS_DIR=Data/Intensities/BaseCalls/ LANE=1 READ_STRUCTURE=76T6B76T BARCODE_FILE=sample.barcode METRICS_FILE=metrics1.out COMPRESS_OUTPUTS=true NUM_PROCESSORS=0
java -jar picard.jar IlluminaBasecallsToSam BASECALLS_DIR=Data/Intensities/BaseCalls/ LANE=1 READ_STRUCTURE=76T6B76T RUN_BARCODE=firstrun IGNORE_UNEXPECTED_BARCODES=true LIBRARY_PARAMS=lane1.params ADAPTERS_TO_CHECK=PAIRED_END MAX_READS_IN_RAM_PER_TILE=1000000 MAX_RECORDS_IN_RAM=5000000 FORCE_GC=false

I am getting different XT tags across different runs. For example in one run, I have
180307_NB551391_0005_AHYMM2AFXX:3:21403:25151:1041 77 * 0 0 * * 0 0 CCACAAATGCCGGTTCCCTTCTACAGGCCCAGTCGCCAGCTCAGAGGACACTCGATCTCCTGAGATCGGAAGAGCA AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEA RG:Z:18030.3 XT:i:63
180307_NB551391_0005_AHYMM2AFXX:3:21403:25151:1041 141 * 0 0 * * 0 0 CAGGAGATCGAGTGTCCTCTGAGCTGGCGACTGGGCCTGTAGAANNNANCCNGCATTNGTGGAGATCGNNAGNGCG AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE###E#EE#EEEEE#EEEEEEEEEE##EE#EAE RG:Z:18030.3 XT:i:63
In another run, I have
180307_NB551391_0005_AHYMM2AFXX:3:21403:25151:1041 77 * 0 0 * * 0 0 CCACAAATGCCGGTTCCCTTCTACAGGCCCAGTCGCCAGCTCAGAGGACACTCGATCTCCTGAGATCGGAAGAGCA AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEA RG:Z:18030.3
180307_NB551391_0005_AHYMM2AFXX:3:21403:25151:1041 141 * 0 0 * * 0 0 CAGGAGATCGAGTGTCCTCTGAGCTGGCGACTGGGCCTGTAGAANNNANCCNGCATTNGTGGAGATCGNNAGNGCG AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE###E#EE#EEEEE#EEEEEEEEEE##EE#EAE RG:Z:18030.3

From manual inspection, the one with XT tag should be correct.
How come the second run didn't have it? Is there some randomness
in IlluminaBasecallsToSam?
Does missing XT tag matters in downstream
analysis if I follow the best practice?

Should I not use IlluminaBaseCallsToSam for adapter marking and use
MarkIlluminaAdapters instead. Will MarkIlluminaAdapters produce
consistent XT tags across runs?

Thank you very much for your time

↧

A java question!

August 2, 2018, 11:58 pm

≫ Next: What's the best way to process multiple samples in "Data pre-processing for variant discovery"

≪ Previous: Picard IlluminaBaseCallsToSam showing inconsistent XT tags in different runs

Hello!
I'm using GATK4.0.4.0 but I got a problem on Java. That is everytime I run GATK it gives me a error file back.

Using GATK jar /home/gaotiangang/niuguohao/biosoft/GATK4/gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx10240m -Djava.io.tmpdir=./ -jar /home/gaotiangang/niuguohao/biosoft/GATK4/gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar AnalyzeCovariates -before seq103.4.table -after seq103.5.table -plots seq103.5.pdf --ignore-last-modification-times
Picked up _JAVA_OPTIONS: -Xmx20480m -Xms20480m
Exception in thread "main" java.lang.UnsatisfiedLinkError: no net in java.library.path
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1864)
at java.lang.Runtime.loadLibrary0(Runtime.java:870)
at java.lang.System.loadLibrary(System.java:1122)
at java.net.InetAddress$1.run(InetAddress.java:294)
at java.net.InetAddress$1.run(InetAddress.java:292)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.InetAddress.(InetAddress.java:291)
at org.apache.logging.log4j.core.util.NetUtils.getLocalHostname(NetUtils.java:53)
at org.apache.logging.log4j.core.LoggerContext.setConfiguration(LoggerContext.java:486)
at org.apache.logging.log4j.core.LoggerContext.reconfigure(LoggerContext.java:562)
at org.apache.logging.log4j.core.LoggerContext.reconfigure(LoggerContext.java:578)
at org.apache.logging.log4j.core.LoggerContext.start(LoggerContext.java:214)
at org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:145)
at org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:41)
at org.apache.logging.log4j.LogManager.getContext(LogManager.java:182)
at org.apache.logging.log4j.LogManager.getLogger(LogManager.java:455)
at org.broadinstitute.hellbender.utils.Utils.(Utils.java:75)
at org.broadinstitute.hellbender.Main.(Main.java:45)

I really don't understand what it mean because it worked well yesterday !
Thanks!

↧

What's the best way to process multiple samples in "Data pre-processing for variant discovery"

August 3, 2018, 1:19 am

≫ Next: Mark duplicate file being overwritten when doing Apply BQSR

≪ Previous: A java question!

In the Best Practice "Data pre-processing for variant discovery," the json file include parameters PreProcessingForVariantDiscovery_GATK4.sample_name and PreProcessingForVariantDiscovery_GATK4.flowcell_unmapped_bams_list. The flowcell_unmapped_bams_list is mean for include multiple bam files from the same sample, and sample_name is for the actual sample name. So this pipeline can only process one sample per run.

My question is how to or is there a way to parameterized this pipeline to process multiple samples in one call? Or how to integrate this pipeline into a script to process multiple samples? For example, is it possible to give the sample_name a value like $NAME and get the actual sample name from a shell variable? How about the flowcell_unmapped_bams_list?

Thank you very much for the help!

↧

Mark duplicate file being overwritten when doing Apply BQSR

August 3, 2018, 1:27 am

≫ Next: VariantRecalibrator fails on 30x chr22 subset (GIAB NA12878)

≪ Previous: What's the best way to process multiple samples in "Data pre-processing for variant discovery"

Hello,
I have been trying to sort this issue out, but I can't seem to figure out why it is happening, and would appreciate any help

When I run this script, it only runs for 60 seconds and then there is nothing in the error file, but my mark duplicate bams are being written over, and the file size is highly decreased, e-g- 17.7GB to 33kB.
The files named .recal_pass1.bam are not being created at all.
This is what my output looks like, in terms of what files come out.

This is my script I wrote

#!/usr/bin/env python

import subprocess, glob, re

input_bams=glob.glob('*.markdup.bam')
input_tables=glob.glob('*.recal_pass1.table')

input_bams.sort()
input_tables.sort()

for i in range(len(input_bams)):
  subprocess.call(['gatk --java-options "-Djava.io.tmpdir=/usr/users/havill/temp" ApplyBQSR \
  -I ' + input_bams[i] + ' \
  --bqsr-recal-file ' + input_tables[i] + ' \
  -O '+ re.sub('.recal_pass1.table', '.recal_pass1.bam', input_bams[i]) +' \
  -R /usr/users/havill/genomes/Galga16/GCF_000002315.5_GRCg6a_genomic.fna \
  > ' + re.sub('.recal_pass1.table', '.recal_pass1.bam.log', input_tables[i])+ ' 2>&1'], shell=True )

↧

VariantRecalibrator fails on 30x chr22 subset (GIAB NA12878)

August 3, 2018, 2:38 am

≫ Next: (How to) Consolidate GVCFs for joint calling with GenotypeGVCFs

≪ Previous: Mark duplicate file being overwritten when doing Apply BQSR

I did a full workflow with the 300x depth data but now, retrying with a subset of 30x depth fails (required for a training session).

Is there a requirement for recalibration that is set too high by default and fails on my data?
Or is it because my data is present in the calibration collections.
This GIAB data was mapped then called using only chr22 reads (10% of the 300x) against HG38 and led to a gvcf from which I derived the vcf

-rw-r--r-- 1 root domain users 142M Aug  3 11:17 NA12878_0.1.g.vcf.gz
-rw-r--r-- 1 root domain users  60K Aug  3 11:17 NA12878_0.1.g.vcf.gz.tbi
-rw-r--r-- 1 root domain users 4.6M Aug  3 11:22 NA12878_0.1.vcf.gz
-rw-r--r-- 1 root domain users  24K Aug  3 11:22 NA12878_0.1.vcf.gz.tbi

Command and output are below. I can add any useful extract on your go.

Thanks a lot for your helping me find why this fails.

Stephane

# True sites training resource: HapMap
truetraining15=reference/hg38_v0_hapmap_3.3.hg38.vcf.gz

# True sites training resource: Omni
truetraining12=reference/hg38_v0_1000G_omni2.5.hg38.vcf.gz

# Non-true sites training resource: 1000G
nontruetraining10=reference/hg38_v0_Mills_and_1000G_gold_standard.indels.hg38.vcf.gz

# Known sites resource, not used in training: dbSNP
knowntraining2=reference/hg38_v0_Homo_sapiens_assembly38.dbsnp138.vcf.gz

# indels True sites training resource: Mills
truetrainingindel12=reference/hg38_v0_Mills_and_1000G_gold_standard.indels.hg38.vcf.gz

java -Xmx${maxmem} -jar $GATK/gatk.jar VariantRecalibrator \
-R $BWA_INDEXES/NCBI_GRCh38.fa \
-V ${mappings}/${samplename}_${p}.vcf.gz \
--resource hapmap,known=false,training=true,truth=true,prior=15.0:${truetraining15} \
--resource omni,known=false,training=true,truth=true,prior=12.0:${truetraining12} \
--resource 1000g,known=false,training=true,truth=false,prior=10.0:${nontruetraining10} \
--resource dbsnp,known=true,training=false,truth=false,prior=2.0:${knowntraining2} \
--resource Mills_and_1000G_gold,known=false,training=true,truth=true,prior=12.0:${truetrainingindel12} \
-an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP \
--mode BOTH \
--output ${mappings}/output.recal_${p}.vcf \
--tranches-file ${mappings}/output.tranches_${p} \
--rscript-file ${mappings}/output.plots_${p}.R

stderr

java -Xmx${maxmem} -jar $GATK/gatk.jar VariantRecalibrator \
> -R $BWA_INDEXES/NCBI_GRCh38.fa \
> -V ${mappings}/${samplename}_${p}.vcf.gz \
> --resource hapmap,known=false,training=true,truth=true,prior=15.0:${truetraining15} \
> --resource omni,known=false,training=true,truth=true,prior=12.0:${truetraining12} \
> --resource 1000g,known=false,training=true,truth=false,prior=10.0:${nontruetraining10} \
> --resource dbsnp,known=true,training=false,truth=false,prior=2.0:${knowntraining2} \
> --resource Mills_and_1000G_gold,known=false,training=true,truth=true,prior=12.0:${truetrainingindel12} \
> -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP \
> --mode BOTH \
> --output ${mappings}/output.recal_${p}.vcf \
> --tranches-file ${mappings}/output.tranches_${p} \
> --rscript-file ${mappings}/output.plots_${p}.R
11:33:58.090 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/biotools/gatk-4.0.7.0/gatk-package-4.0.7.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
11:33:58.338 INFO  VariantRecalibrator - ------------------------------------------------------------
11:33:58.339 INFO  VariantRecalibrator - The Genome Analysis Toolkit (GATK) v4.0.7.0
11:33:58.339 INFO  VariantRecalibrator - For support and documentation go to https://software.broadinstitute.org/gatk/
11:33:58.339 INFO  VariantRecalibrator - Executing as u0002316@gbw-s-pacbio01 on Linux v4.4.0-131-generic amd64
11:33:58.339 INFO  VariantRecalibrator - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_171-8u171-b11-0ubuntu0.16.04.1-b11
11:33:58.340 INFO  VariantRecalibrator - Start Date/Time: August 3, 2018 11:33:58 AM CEST
11:33:58.340 INFO  VariantRecalibrator - ------------------------------------------------------------
11:33:58.340 INFO  VariantRecalibrator - ------------------------------------------------------------
11:33:58.340 INFO  VariantRecalibrator - HTSJDK Version: 2.16.0
11:33:58.340 INFO  VariantRecalibrator - Picard Version: 2.18.7
11:33:58.340 INFO  VariantRecalibrator - HTSJDK Defaults.COMPRESSION_LEVEL : 2
11:33:58.340 INFO  VariantRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
11:33:58.340 INFO  VariantRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
11:33:58.340 INFO  VariantRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
11:33:58.340 INFO  VariantRecalibrator - Deflater: IntelDeflater
11:33:58.341 INFO  VariantRecalibrator - Inflater: IntelInflater
11:33:58.341 INFO  VariantRecalibrator - GCS max retries/reopens: 20
11:33:58.341 INFO  VariantRecalibrator - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
11:33:58.341 INFO  VariantRecalibrator - Initializing engine
11:33:58.804 INFO  FeatureManager - Using codec VCFCodec to read file file:///mnt/freenas/NGS_Variant-Analysis-training2018/reference/hg38_v0_hapmap_3.3.hg38.vcf.gz
11:33:59.042 INFO  FeatureManager - Using codec VCFCodec to read file file:///mnt/freenas/NGS_Variant-Analysis-training2018/reference/hg38_v0_1000G_omni2.5.hg38.vcf.gz
11:33:59.177 INFO  FeatureManager - Using codec VCFCodec to read file file:///mnt/freenas/NGS_Variant-Analysis-training2018/reference/hg38_v0_Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
11:33:59.290 INFO  FeatureManager - Using codec VCFCodec to read file file:///mnt/freenas/NGS_Variant-Analysis-training2018/reference/hg38_v0_Homo_sapiens_assembly38.dbsnp138.vcf.gz
11:33:59.402 INFO  FeatureManager - Using codec VCFCodec to read file file:///mnt/freenas/NGS_Variant-Analysis-training2018/reference/hg38_v0_Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
11:33:59.513 INFO  FeatureManager - Using codec VCFCodec to read file file:///mnt/freenas/NGS_Variant-Analysis-training2018/bwa_mappings_10pc/NA12878_0.1.vcf.gz
11:33:59.633 INFO  VariantRecalibrator - Done initializing engine
11:33:59.645 INFO  TrainingSet - Found hapmap track:    Known = false   Training = true         Truth = true    Prior = Q15.0
11:33:59.646 INFO  TrainingSet - Found omni track:      Known = false   Training = true         Truth = true    Prior = Q12.0
11:33:59.646 INFO  TrainingSet - Found 1000g track:     Known = false   Training = true         Truth = false   Prior = Q10.0
11:33:59.646 INFO  TrainingSet - Found dbsnp track:     Known = true    Training = false        Truth = false   Prior = Q2.0
11:33:59.646 INFO  TrainingSet - Found Mills_and_1000G_gold track:      Known = false   Training = true         Truth = true    Prior = Q12.0
11:33:59.693 INFO  ProgressMeter - Starting traversal
11:33:59.693 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute
11:34:08.098 INFO  ProgressMeter -       chr22:50362905              0.1                 81887         584628.7
11:34:08.098 INFO  ProgressMeter - Traversal complete. Processed 81887 total variants in 0.1 minutes.
11:34:08.113 INFO  VariantDataManager - QD:      mean = 19.68    standard deviation = 9.58
11:34:08.124 INFO  VariantDataManager - MQ:      mean = 59.83    standard deviation = 1.70
11:34:08.132 INFO  VariantDataManager - MQRankSum:       mean = -0.02    standard deviation = 0.29
11:34:08.145 INFO  VariantDataManager - ReadPosRankSum:          mean = 0.03     standard deviation = 1.00
11:34:08.157 INFO  VariantDataManager - FS:      mean = 1.97     standard deviation = 3.36
11:34:08.165 INFO  VariantDataManager - SOR:     mean = 1.02     standard deviation = 0.58
11:34:08.173 INFO  VariantDataManager - DP:      mean = 25.19    standard deviation = 7.16
11:34:08.276 INFO  VariantDataManager - Annotations are now ordered by their information content: [MQ, DP, QD, MQRankSum, FS, SOR, ReadPosRankSum]
11:34:08.284 INFO  VariantDataManager - Training with 27312 variants after standard deviation thresholding.
11:34:08.288 INFO  GaussianMixtureModel - Initializing model with 100 k-means iterations...
11:34:09.210 INFO  VariantRecalibratorEngine - Finished iteration 0.
11:34:09.860 INFO  VariantRecalibratorEngine - Finished iteration 5.    Current change in mixture coefficients = 2.94951
11:34:10.490 INFO  VariantRecalibratorEngine - Finished iteration 10.   Current change in mixture coefficients = 0.39301
11:34:11.065 INFO  VariantRecalibratorEngine - Finished iteration 15.   Current change in mixture coefficients = 0.00837
11:34:11.507 INFO  VariantRecalibratorEngine - Finished iteration 20.   Current change in mixture coefficients = 0.01513
11:34:12.037 INFO  VariantRecalibratorEngine - Finished iteration 25.   Current change in mixture coefficients = 0.01792
11:34:12.531 INFO  VariantRecalibratorEngine - Finished iteration 30.   Current change in mixture coefficients = 0.01851
11:34:12.998 INFO  VariantRecalibratorEngine - Finished iteration 35.   Current change in mixture coefficients = 0.02430
11:34:13.451 INFO  VariantRecalibratorEngine - Finished iteration 40.   Current change in mixture coefficients = 0.01579
11:34:13.891 INFO  VariantRecalibratorEngine - Finished iteration 45.   Current change in mixture coefficients = 0.00536
11:34:14.433 INFO  VariantRecalibratorEngine - Finished iteration 50.   Current change in mixture coefficients = 0.00169
11:34:14.433 INFO  VariantRecalibratorEngine - Convergence after 50 iterations!
11:34:14.549 WARN  VariantRecalibratorEngine - Model could not pre-compute denominators.
11:34:14.567 INFO  VariantDataManager - Selected worst 0 scoring variants --> variants with LOD <= -5.0000.
11:34:14.593 INFO  VariantRecalibrator - Shutting down engine
[August 3, 2018 11:34:14 AM CEST] org.broadinstitute.hellbender.tools.walkers.vqsr.VariantRecalibrator done. Elapsed time: 0.28 minutes.
Runtime.totalMemory()=5262802944
java.lang.IllegalArgumentException: No data found.
        at org.broadinstitute.hellbender.tools.walkers.vqsr.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:34)
        at org.broadinstitute.hellbender.tools.walkers.vqsr.VariantRecalibrator.onTraversalSuccess(VariantRecalibrator.java:630)
        at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:981)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:137)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:182)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:201)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
        at org.broadinstitute.hellbender.Main.main(Main.java:289)

↧

(How to) Consolidate GVCFs for joint calling with GenotypeGVCFs

April 16, 2018, 10:14 am

≫ Next: Suppress Man Page

≪ Previous: VariantRecalibrator fails on 30x chr22 subset (GIAB NA12878)

In GATK4, the GenotypeGVCFs tool can only take a single input, so if you have GVCFs from multiple samples (which is usually the case) you will need to combine them before feeding them to GenotypeGVCFs. Although there are several tools in the GATK and Picard toolkits that provide some type of VCF or GVCF merging functionality, for this use case only two of them can do the GVCF consolidation step correctly: GenomicsDBImport and CombineGVCFs.

GenomicsDBImport is the preferred tool (see detailed instructions below); CombineGVCFs is provided only as a backup solution for people who cannot use GenomicsDBImport, which has a few limitations (for example it can only run on diploid data at the moment). We know CombineGVCFs is quite inefficient and typically requires a lot of memory, so we encourage you to try GenomicsDBImport first and only fall back on CombineGVCFs if you experience issues that we are unable to help you solve (ask us for help in the forum!).

Using`GenomicsDBImport` in practice

The GenomicsDBImport tool takes in one or more single-sample GVCFs and imports data over a single interval, and outputs a directory containing a GenomicsDB datastore with combined multi-sample data. GenotypeGVCFs can then read from the created GenomicsDB directly and output the final multi-sample VCF.

So if you have a trio of GVCFs your GenomicsDBImportcommand would look like this, assuming you're running per chromosome (here we're showing the tool running on chromosome 20):

gatk GenomicsDBImport \
    -V data/gvcfs/mother.g.vcf \
    -V data/gvcfs/father.g.vcf \
    -V data/gvcfs/son.g.vcf \
    --genomicsdb-workspace-path my_database \
    --intervals chr20

That generates a directory called my_database containing the combined GVCF data for chromosome 20. The contents of the directory are not really human-readable; see further down for tips to deal with that.

Then you run joint genotyping; note the gendb:// prefix to the database input directory path.

gatk GenotypeGVCFs \
    -R data/ref/ref.fasta \
    -V gendb://my_database \
    -G StandardAnnotation -newQual \
    -O test_output.vcf

And that's all there is to it.

Important limitations:

You can't add data to an existing database; you have to keep the original GVCFs around and reimport them all together when you get new samples. For very large numbers of samples, there are some batching options.
At the moment you can only run GenomicsDBImport on a single genomic interval (ie max one contig) at a time. Down the road this will change (the work is tentatively scheduled for the second quarter of 2018), because we want to make it possible to run on one multiple intervals in one go. But for now you need to run on each interval separately. We recommend scripting this of course.
GenomicsDBImport cannot accept multiple GVCFs for the same sample, so if for example you generated separate GVCFs per chromosome for each sample, you'll need to either concatenate the chromosome GVCFs to produce a single GVCF per sample (using CatVariants) or scatter the following steps by chromosome as well.

**If you can't use GenomicsDBImport for whatever reason, fall back to CombineGVCFs instead. It is slower but will allow you to combine GVCFs the old-fashioned way. **

Addendum: extracting GVCF data from the GenomicsDB

If you want to generate a flat multisample GVCF file from the GenomicsDB you created, you can do so with SelectVariants as follows:

gatk SelectVariants \
    -R data/ref/ref.fasta \
    -V gendb://my_database \
    -O combined.g.vcf

↧

Suppress Man Page

August 3, 2018, 8:35 am

≫ Next: Is it possible to use "HaplotypeCaller" without "AddOrReplaceReadGroups" ?

≪ Previous: (How to) Consolidate GVCFs for joint calling with GenotypeGVCFs

Hi! I've looked through the forums and help, but I can't find an option that does what I want it to do.

When I'm debugging a pipeline, whenever an error occurs, the tool that receives the error prints out the man page to my terminal prior to printing out the suspected cause of the error.

Is there an option to suppress outputting the man page? I'm using gatk v4.0.5.1

Here is an example:

command:
gatk GenotypeGVCFs -R /path/to/ref/GRCh38_.fa -V gendb://wild_dir -G StandardAnnotation -newQual -O wild_genotype.vcf

Output:
Perform joint genotyping on a single-sample GVCF from HaplotypeCaller or a multi-sample GVCF from CombineGVCFs or
GenomicsDBImport
Version:4.0.5.1

Required Arguments:

--output,-O:File File to which variants should be written Required.

--reference,-R:String Reference sequence file Required.

--variant,-V:String A VCF file containing variants Required.

Optional Arguments:

--add-output-sam-program-record,-add-output-sam-program-record:Boolean
If true, adds a PG tag to created SAM/BAM/CRAM files. Default value: true. Possible
values: {true, false}

--add-output-vcf-command-line,-add-output-vcf-command-line:Boolean
If true, adds a command line header line to created VCF files. Default value: true.
Possible values: {true, false}

--annotate-with-num-discovered-alleles:Boolean
If provided, we will annotate records with the number of alternate alleles that were
discovered (but not necessarily genotyped) at a given site Default value: false. Possible
values: {true, false}

--annotation,-A:String One or more specific annotations to add to variant calls This argument may be specified 0
or more times. Default value: null. Possible Values: {AS_BaseQualityRankSumTest,
AS_FisherStrand, AS_InbreedingCoeff, AS_MappingQualityRankSumTest, AS_QualByDepth,
AS_ReadPosRankSumTest, AS_RMSMappingQuality, AS_StrandOddsRatio, BaseQuality,
BaseQualityRankSumTest, ChromosomeCounts, ClippingRankSumTest, Coverage,
DepthPerAlleleBySample, DepthPerSampleHC, ExcessHet, FisherStrand, FragmentLength,
GenotypeSummaries, InbreedingCoeff, LikelihoodRankSumTest, MappingQuality,
MappingQualityRankSumTest, MappingQualityZero, OxoGReadCounts, PossibleDeNovo,
QualByDepth, ReadPosition, ReadPosRankSumTest, ReferenceBases, RMSMappingQuality,
SampleList, StrandArtifact, StrandBiasBySample, StrandOddsRatio, TandemRepeat,
UniqueAltReadCount}

--annotation-group,-G:String One or more groups of annotations to apply to variant calls This argument may be
specified 0 or more times. Default value: null. Possible Values: {AS_StandardAnnotation,
ReducibleAnnotation, StandardAnnotation, StandardHCAnnotation, StandardMutectAnnotation}

--annotations-to-exclude,-AX:String
One or more specific annotations to exclude from variant calls This argument may be
specified 0 or more times. Default value: null. Possible Values: {BaseQualityRankSumTest,
ChromosomeCounts, Coverage, DepthPerAlleleBySample, ExcessHet, FisherStrand,
InbreedingCoeff, MappingQualityRankSumTest, QualByDepth, ReadPosRankSumTest,
RMSMappingQuality, StrandOddsRatio}

--arguments_file:File read one or more arguments files and add them to the command line This argument may be
specified 0 or more times. Default value: null.

--cloud-index-prefetch-buffer,-CIPB:Integer
Size of the cloud-only prefetch buffer (in MB; 0 to disable). Defaults to
cloudPrefetchBuffer if unset. Default value: -1.

--cloud-prefetch-buffer,-CPB:Integer
Size of the cloud-only prefetch buffer (in MB; 0 to disable). Default value: 40.

--create-output-bam-index,-OBI:Boolean
If true, create a BAM/CRAM index when writing a coordinate-sorted BAM/CRAM file. Default
value: true. Possible values: {true, false}

--create-output-bam-md5,-OBM:Boolean
If true, create a MD5 digest for any BAM/SAM/CRAM file created Default value: false.
Possible values: {true, false}

--create-output-variant-index,-OVI:Boolean
If true, create a VCF index when writing a coordinate-sorted VCF file. Default value:
true. Possible values: {true, false}

--create-output-variant-md5,-OVM:Boolean
If true, create a a MD5 digest any VCF file created. Default value: false. Possible
values: {true, false}

--dbsnp,-D:FeatureInput dbSNP file Default value: null.

--disable-bam-index-caching,-DBIC:Boolean
If true, don't cache bam indexes, this will reduce memory requirements but may harm
performance if many intervals are specified. Caching is automatically disabled if there
are no intervals specified. Default value: false. Possible values: {true, false}

--disable-read-filter,-DF:String
Read filters to be disabled before analysis This argument may be specified 0 or more
times. Default value: null. Possible Values: {WellformedReadFilter}

--disable-sequence-dictionary-validation,-disable-sequence-dictionary-validation:Boolean
If specified, do not check the sequence dictionaries from our inputs for compatibility.
Use at your own risk! Default value: false. Possible values: {true, false}

--exclude-intervals,-XL:StringOne or more genomic intervals to exclude from processing This argument may be specified 0
or more times. Default value: null.

--founder-id,-founder-id:String
Samples representing the population "founders" This argument may be specified 0 or more
times. Default value: null.

--gatk-config-file:String A configuration file to use with the GATK. Default value: null.

--gcs-max-retries,-gcs-retries:Integer
If the GCS bucket channel errors out, how many times it will attempt to re-initiate the
connection Default value: 20.

--help,-h:Boolean display the help message Default value: false. Possible values: {true, false}

--heterozygosity:Double Heterozygosity value used to compute prior likelihoods for any locus. See the GATKDocs
for full details on the meaning of this population genetics concept Default value: 0.001.

--heterozygosity-stdev:Double Standard deviation of heterozygosity for SNP and indel calling. Default value: 0.01.

--indel-heterozygosity:Double Heterozygosity for indel calling. See the GATKDocs for heterozygosity for full details on
the meaning of this population genetics concept Default value: 1.25E-4.

--input,-I:String BAM/SAM/CRAM file containing reads This argument may be specified 0 or more times.
Default value: null.

--interval-exclusion-padding,-ixp:Integer
Amount of padding (in bp) to add to each interval you are excluding. Default value: 0.

--interval-merging-rule,-imr:IntervalMergingRule
Interval merging rule for abutting intervals Default value: ALL. Possible values: {ALL,
OVERLAPPING_ONLY}

--interval-padding,-ip:IntegerAmount of padding (in bp) to add to each interval you are including. Default value: 0.

--interval-set-rule,-isr:IntervalSetRule
Set merging approach to use for combining interval inputs Default value: UNION. Possible
values: {UNION, INTERSECTION}

--intervals,-L:String One or more genomic intervals over which to operate This argument may be specified 0 or
more times. Default value: null.

--lenient,-LE:Boolean Lenient processing of VCF files Default value: false. Possible values: {true, false}

--num-reference-samples-if-no-call:Integer
Number of hom-ref genotypes to infer at sites not present in a panel Default value: 0.

--pedigree,-ped:File Pedigree file for determining the population "founders" Default value: null.

--population-callset,-population:FeatureInput
Callset to use in calculating genotype priors Default value: null.

--QUIET:Boolean Whether to suppress job-summary info on System.err. Default value: false. Possible
values: {true, false}

--read-filter,-RF:String Read filters to be applied before analysis This argument may be specified 0 or more
times. Default value: null. Possible Values: {AlignmentAgreesWithHeaderReadFilter,
AllowAllReadsReadFilter, AmbiguousBaseReadFilter, CigarContainsNoNOperator,
FirstOfPairReadFilter, FragmentLengthReadFilter, GoodCigarReadFilter,
HasReadGroupReadFilter, LibraryReadFilter, MappedReadFilter,
MappingQualityAvailableReadFilter, MappingQualityNotZeroReadFilter,
MappingQualityReadFilter, MatchingBasesAndQualsReadFilter, MateDifferentStrandReadFilter,
MateOnSameContigOrNoMappedMateReadFilter, MetricsReadFilter,
NonZeroFragmentLengthReadFilter, NonZeroReferenceLengthAlignmentReadFilter,
NotDuplicateReadFilter, NotOpticalDuplicateReadFilter, NotSecondaryAlignmentReadFilter,
NotSupplementaryAlignmentReadFilter, OverclippedReadFilter, PairedReadFilter,
PassesVendorQualityCheckReadFilter, PlatformReadFilter, PlatformUnitReadFilter,
PrimaryLineReadFilter, ProperlyPairedReadFilter, ReadGroupBlackListReadFilter,
ReadGroupReadFilter, ReadLengthEqualsCigarLengthReadFilter, ReadLengthReadFilter,
ReadNameReadFilter, ReadStrandFilter, SampleReadFilter, SecondOfPairReadFilter,
SeqIsStoredReadFilter, ValidAlignmentEndReadFilter, ValidAlignmentStartReadFilter,
WellformedReadFilter}

--read-index,-read-index:String
Indices to use for the read inputs. If specified, an index must be provided for every read
input and in the same order as the read inputs. If this argument is not specified, the
path to the index for each input will be inferred automatically. This argument may be
specified 0 or more times. Default value: null.

--read-validation-stringency,-VS:ValidationStringency
Validation stringency for all SAM/BAM/CRAM/SRA files read by this program. The default
stringency value SILENT can improve performance when processing a BAM file in which
variable-length data (read, qualities, tags) do not otherwise need to be decoded. Default
value: SILENT. Possible values: {STRICT, LENIENT, SILENT}

--sample-ploidy,-ploidy:Integer
Ploidy (number of chromosomes) per sample. For pooled data, set to (Number of samples in
each pool * Sample Ploidy). Default value: 2.

--seconds-between-progress-updates,-seconds-between-progress-updates:Double
Output traversal statistics every time this many seconds elapse Default value: 10.0.

--sequence-dictionary,-sequence-dictionary:String
Use the given sequence dictionary as the master/canonical sequence dictionary. Must be a
.dict file. Default value: null.

--sites-only-vcf-output:Boolean
If true, don't emit genotype fields when writing vcf file output. Default value: false.
Possible values: {true, false}

--standard-min-confidence-threshold-for-calling,-stand-call-conf:Double
The minimum phred-scaled confidence threshold at which variants should be called Default
value: 10.0.

--TMP_DIR:File Undocumented option This argument may be specified 0 or more times. Default value: null.

--use-jdk-deflater,-jdk-deflater:Boolean
Whether to use the JdkDeflater (as opposed to IntelDeflater) Default value: false.
Possible values: {true, false}

--use-jdk-inflater,-jdk-inflater:Boolean
Whether to use the JdkInflater (as opposed to IntelInflater) Default value: false.
Possible values: {true, false}

--use-new-qual-calculator,-new-qual:Boolean
If provided, we will use the new AF model instead of the so-called exact model Default
value: false. Possible values: {true, false}

--verbosity,-verbosity:LogLevel
Control verbosity of logging. Default value: INFO. Possible values: {ERROR, WARNING,
INFO, DEBUG}

--version:Boolean display the version number for this tool Default value: false. Possible values: {true,
false}

Advanced Arguments:

--disable-tool-default-annotations,-disable-tool-default-annotations:Boolean
Disable all tool default annotations Default value: false. Possible values: {true, false}

--disable-tool-default-read-filters,-disable-tool-default-read-filters:Boolean
Disable all tool default read filters (WARNING: many tools will not function correctly
without their default read filters on) Default value: false. Possible values: {true,
false}

--enable-all-annotations:Boolean
Use all possible annotations (not for the faint of heart) Default value: false. Possible
values: {true, false}

--input-prior:Double Input prior for calls This argument may be specified 0 or more times. Default value:
null.

--max-alternate-alleles:Integer
Maximum number of alternate alleles to genotype Default value: 6.

--max-genotype-count:Integer Maximum number of genotypes to consider at any site Default value: 1024.

--only-output-calls-starting-in-intervals:Boolean
Restrict variant output to sites that start within provided intervals Default value:
false. Possible values: {true, false}

--showHidden,-showHidden:Boolean
display hidden arguments Default value: false. Possible values: {true, false}

Conditional Arguments for read-filter:

Valid only if "AmbiguousBaseReadFilter" is specified:
--ambig-filter-bases:Integer Threshold number of ambiguous bases. If null, uses threshold fraction; otherwise,
overrides threshold fraction. Default value: null. Cannot be used in conjuction with
argument(s) maxAmbiguousBaseFraction

--ambig-filter-frac:Double Threshold fraction of ambiguous bases Default value: 0.05. Cannot be used in conjuction
with argument(s) maxAmbiguousBases

Valid only if "FragmentLengthReadFilter" is specified:
--max-fragment-length:Integer Maximum length of fragment (insert size) Default value: 1000000.

Valid only if "LibraryReadFilter" is specified:
--library,-library:String Name of the library to keep This argument must be specified at least once. Required.

Valid only if "MappingQualityReadFilter" is specified:
--maximum-mapping-quality:Integer
Maximum mapping quality to keep (inclusive) Default value: null.

--minimum-mapping-quality:Integer
Minimum mapping quality to keep (inclusive) Default value: 10.

Valid only if "OverclippedReadFilter" is specified:
--dont-require-soft-clips-both-ends:Boolean
Allow a read to be filtered out based on having only 1 soft-clipped block. By default,
both ends must have a soft-clipped block, setting this flag requires only 1 soft-clipped
block Default value: false. Possible values: {true, false}

--filter-too-short:Integer Minimum number of aligned bases Default value: 30.

Valid only if "PlatformReadFilter" is specified:
--platform-filter-name:String Platform attribute (PL) to match This argument must be specified at least once. Required.

Valid only if "PlatformUnitReadFilter" is specified:
--black-listed-lanes:String Platform unit (PU) to filter out This argument must be specified at least once. Required.

Valid only if "ReadGroupBlackListReadFilter" is specified:
--read-group-black-list:StringThe name of the read group to filter out This argument must be specified at least once.
Required.

Valid only if "ReadGroupReadFilter" is specified:
--keep-read-group:String The name of the read group to keep Required.

Valid only if "ReadLengthReadFilter" is specified:
--max-read-length:Integer Keep only reads with length at most equal to the specified value Required.

--min-read-length:Integer Keep only reads with length at least equal to the specified value Default value: 1.

Valid only if "ReadNameReadFilter" is specified:
--read-name:String Keep only reads with this read name Required.

Valid only if "ReadStrandFilter" is specified:
--keep-reverse-strand-only:Boolean
Keep only reads on the reverse strand Required. Possible values: {true, false}

Valid only if "SampleReadFilter" is specified:
--sample,-sample:String The name of the sample(s) to keep, filtering out all others This argument must be
specified at least once. Required.

A USER ERROR has occurred: n is not a recognized option

I only want the last section where it tells me about the USER ERROR.

↧

Is it possible to use "HaplotypeCaller" without "AddOrReplaceReadGroups" ?

July 23, 2018, 10:54 pm

≫ Next: Understanding -nct and -nt

≪ Previous: Suppress Man Page

I have tens of thousands of BAM files which I need to call SNPs (only need to call SNPs on several specific loci indicated by -L vcf file). AddOrReplaceReadGroups on each of them would be too cumbersome. Is it possible to use HaplotypeCaller without first modifying these BAM files by AddOrReplaceReadGroups? Thanks.

↧

Understanding -nct and -nt

March 28, 2017, 10:00 am

≫ Next: Is it still valid to pre-process read groups separatedly when multiplexing?

≪ Previous: Is it possible to use "HaplotypeCaller" without "AddOrReplaceReadGroups" ?

I am trying to understand GATK's parameters for parallelism:

-nt / --num_threads controls the number of data threads sent to the processor

-nct / --num_cpu_threads_per_data_thread controls the number of CPU threads allocated to each data thread

So, is the following example true?

-nt 8 means you use 8 cores

-nct 8 means you use 8 threads within a single core

Thanks!

↧

Is it still valid to pre-process read groups separatedly when multiplexing?

July 24, 2018, 4:31 am

≫ Next: Haptyepecaller calls incorrect genotype in several site

≪ Previous: Understanding -nct and -nt

Hello,
I have 60 DNA samples, each one was barcoded, then pooled (10 samples per pool), and each pool was sequenced in 9 different lanes. So for each sampel I have 9*2(PE)=18 FASTQ files. In total, 1080 FASTQs.

My intension is to do variant calling using haplotype caller.

My intention is to follow the recomendations found in this tutorial https://gatkforums.broadinstitute.org/gatk/discussion/3060/how-should-i-pre-process-data-from-multiplexed-sequencing-and-multi-library-designs

However, I have been suggested that this might not be the most up to date reference workflow, and that is more convenient to merge all BAMs right after alignment as one read group, given that there are no lane derived artifacts.
https://biostars.org/p/328605/#328704

My question is: is the tutorial still valid? (valid = still used by the community and at broad)

Thanks in advance!

↧

Haptyepecaller calls incorrect genotype in several site

July 24, 2018, 6:39 am

≫ Next: Why does my GATK4 GenotypeGVCFs command not work?

≪ Previous: Is it still valid to pre-process read groups separatedly when multiplexing?

Hi,
I found that the Haptyepecaller made heterozygous calls where there is no support for them in the BAM. We use IGV to compare input BAM and Haptyepecaller output bam. The region shown in the figure confused us. At the top of this figure is input-BAM while another is Haptyepecaller-output-bam. Haptyepecaller-output-gvcf also suggest this site is heterozygous.
It seems that it's the same issue as https://gatkforums.broadinstitute.org/gatk/discussion/2319/haplotypecaller-incorrectly-making-heterozygous-calls-again. In that question,your suggested solution is updating GATK. Howerer,we used GATK 3.8 and GATK4.0.6 and we got same results.
The command line we used is:
~/software/gatk-4.0.6.0/gatk --java-options "-Xmx30G" HaplotypeCaller -L chr01:9550000-9850000 -ERC GVCF -R -I -O <output_g.vcf> -bamout

↧

Why does my GATK4 GenotypeGVCFs command not work?

July 24, 2018, 7:17 am

≫ Next: VariantsToBinaryPed runtime error

≪ Previous: Haptyepecaller calls incorrect genotype in several site

If I run these commands:
java -Xmx64G -jar /GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T GenotypeGVCFs -R ref.fa -V in.combined.vcf.gz -o in.raw_variants.vcf.gz
java -Xmx64G -jar /gatk-4.0.6.0/gatk-package-4.0.6.0-local.jar GenotypeGVCFs -R ref.fa -V in.combined.vcf.gz -O in.raw_variants.vcf.gz

the first command (GATK3) works fine.
the latter (GATK4) get's stuck right where the action begins:
15:47:12.332 INFO GenotypeGVCFs - Done initializing engine
15:47:12.423 INFO ProgressMeter - Starting traversal
15:47:12.424 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute

Any ideas why? Or how can I get some more verbose output?
Thanks.
Tim

↧

VariantsToBinaryPed runtime error

July 24, 2018, 7:27 am

≫ Next: VariantRecalibrator Error; No Data Found

≪ Previous: Why does my GATK4 GenotypeGVCFs command not work?

I have a multisample WES gVCF file processed all of the way through GenotypeGVCFs. I have used GATK 4.0.6 for everything. Now that I wish to convert my gVCF to PLINK format, I have to use GATK 3. But using VariantsToBinaryPed from GATK 3.8 (I've tried 3.7 as well) I am getting a runtime error. I originally had multiallelic sites and thought that would be the issue, but after using SelectVariant and keeping only the biallelic sites the runtime issue remains the same.

Previous issues similar to mine that have been solved had the top item on the stack trace reading through the metadata file. For me it is the gVCF file itself, with a function called getGQLog10FromLikelihoods. I'm currently trying GenotypeGVCFs from 3.8 to see if this is compatible.

INFO  09:17:44,576 HelpFormatter - Date/Time: 2018/07/24 09:17:44 
INFO  09:17:44,576 HelpFormatter - ------------------------------------------------------------------------------------ 
INFO  09:17:44,577 HelpFormatter - ------------------------------------------------------------------------------------ 
INFO  09:17:44,616 NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/kk2252/sandbox/gatk3.jar!/com/intel/gkl/native/libgkl_compression.so 
INFO  09:17:44,650 GenomeAnalysisEngine - Deflater: IntelDeflater 
INFO  09:17:44,651 GenomeAnalysisEngine - Inflater: IntelInflater 
INFO  09:17:44,651 GenomeAnalysisEngine - Strictness is SILENT 
INFO  09:17:44,816 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 
INFO  09:17:44,999 GenomeAnalysisEngine - Preparing for traversal 
INFO  09:17:45,003 GenomeAnalysisEngine - Done preparing for traversal 
INFO  09:17:45,004 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
INFO  09:17:45,004 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining 
INFO  09:17:45,004 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime 
##### ERROR --
##### ERROR stack trace 
java.lang.ArrayIndexOutOfBoundsException: -1
    at htsjdk.variant.variantcontext.GenotypeLikelihoods.getGQLog10FromLikelihoods(GenotypeLikelihoods.java:220)
    at org.broadinstitute.gatk.tools.walkers.variantutils.VariantsToBinaryPed.checkGQIsGood(VariantsToBinaryPed.java:442)
    at org.broadinstitute.gatk.tools.walkers.variantutils.VariantsToBinaryPed.getStandardEncoding(VariantsToBinaryPed.java:406)
    at org.broadinstitute.gatk.tools.walkers.variantutils.VariantsToBinaryPed.getEncoding(VariantsToBinaryPed.java:398)
    at org.broadinstitute.gatk.tools.walkers.variantutils.VariantsToBinaryPed.writeIndividualMajor(VariantsToBinaryPed.java:282)
    at org.broadinstitute.gatk.tools.walkers.variantutils.VariantsToBinaryPed.map(VariantsToBinaryPed.java:267)
    at org.broadinstitute.gatk.tools.walkers.variantutils.VariantsToBinaryPed.map(VariantsToBinaryPed.java:103)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255)
    at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
    at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
    at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:98)
    at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:323)
    at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:123)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
    at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:108)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 3.8-1-0-gf15c1c3ef):
##### ERROR
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions https://software.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: -1
##### ERROR ------------------------------------------------------------------------------------------

↧

VariantRecalibrator Error; No Data Found

July 24, 2018, 8:23 am

≫ Next: Increasing Fastaalternatereferencemaker speed

≪ Previous: VariantsToBinaryPed runtime error

Hi, I'm trying to run VariantRecalibrator on my vcf files. I have followed most of the pipeline - using HaplotypeCaller before this step. I am working with file of specific extracted genes from a whole genome sequence of a tumor sample - and am trying to run VariantRecalibrator on SNP mode.

Here is the error message:
INFO 15:14:35,232 HelpFormatter - --------------------------------------------------------------------------------
INFO 15:14:35,235 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.7-0-gcfedb67, Compiled 2016/12/12 11:21:18
INFO 15:14:35,235 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute
INFO 15:14:35,235 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk
INFO 15:14:35,236 HelpFormatter - [Tue Jul 24 15:14:35 BST 2018] Executing on Linux 3.10.0-693.el7.x86_64 amd64
INFO 15:14:35,236 HelpFormatter - OpenJDK 64-Bit Server VM 1.8.0_131-b12
INFO 15:14:35,238 HelpFormatter - Program Args: -T VariantRecalibrator -R resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta -input extracted.genotyped.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 resources_broad_hg38_v0_hapmap_3.3.hg38.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 resources_broad_hg38_v0_1000G_omni2.5.hg38.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 resources_broad_hg38_v0_1000G_phase1.snps.high_confidence.hg38.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=.20 resources_broad_hg38_v0_Homo_sapiens_assembly38.dbsnp138.vcf -an FS -an SOR -an MQ -an MQRankSum -an ReadPosRankSum -an QD -mode SNP -recalFile extracted.output.recal -tranchesFile extracted.output.tranches -rscriptFile output.genotyped.plots.R
INFO 15:14:35,286 HelpFormatter - Executing as student2@ls-msc12 on Linux 3.10.0-693.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_131-b12.
INFO 15:14:35,286 HelpFormatter - Date/Time: 2018/07/24 15:14:35
INFO 15:14:35,286 HelpFormatter - --------------------------------------------------------------------------------
INFO 15:14:35,286 HelpFormatter - --------------------------------------------------------------------------------
INFO 15:14:35,314 GenomeAnalysisEngine - Strictness is SILENT
INFO 15:14:36,339 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO 15:14:38,780 GenomeAnalysisEngine - Preparing for traversal
INFO 15:14:38,787 GenomeAnalysisEngine - Done preparing for traversal
INFO 15:14:38,787 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 15:14:38,787 ProgressMeter - | processed | time | per 1M | | total | remaining
INFO 15:14:38,787 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime
INFO 15:14:38,794 TrainingSet - Found hapmap track: Known = false Training = true Truth = true Prior = Q15.0
INFO 15:14:38,794 TrainingSet - Found omni track: Known = false Training = true Truth = true Prior = Q12.0
INFO 15:14:38,794 TrainingSet - Found 1000G track: Known = false Training = true Truth = false Prior = Q10.0
INFO 15:14:38,795 TrainingSet - Found dbsnp track: Known = true Training = false Truth = false Prior = Q0.2
INFO 15:15:08,790 ProgressMeter - chr1:158866731 3040060.0 30.0 s 9.0 s 4.9% 10.1 m 9.6 m
INFO 15:15:38,790 ProgressMeter - chr2:32996905 5958581.0 60.0 s 10.0 s 8.8% 11.4 m 10.4 m
INFO 15:16:08,791 ProgressMeter - chr2:164765745 8934105.0 90.0 s 10.0 s 12.9% 11.7 m 10.2 m
INFO 15:16:38,792 ProgressMeter - chr3:45607274 1.1837802E7 120.0 s 10.0 s 16.7% 12.0 m 10.0 m
INFO 15:17:08,793 ProgressMeter - chr3:173995509 1.4702158E7 2.5 m 10.0 s 20.7% 12.1 m 9.6 m
INFO 15:17:48,794 ProgressMeter - chr4:146994206 1.8753018E7 3.2 m 10.0 s 26.0% 12.2 m 9.0 m
INFO 15:18:18,795 ProgressMeter - chr5:84721074 2.1717071E7 3.7 m 10.0 s 30.0% 12.2 m 8.6 m
INFO 15:18:48,795 ProgressMeter - chr6:29998321 2.4723551E7 4.2 m 10.0 s 33.9% 12.3 m 8.1 m
INFO 15:19:18,796 ProgressMeter - chr6:158998294 2.770885E7 4.7 m 10.0 s 37.9% 12.3 m 7.6 m
INFO 15:19:48,797 ProgressMeter - chr7:117998677 3.0823038E7 5.2 m 10.0 s 42.0% 12.3 m 7.1 m
INFO 15:20:28,798 ProgressMeter - chr8:121277023 3.4763574E7 5.8 m 10.0 s 47.0% 12.4 m 6.6 m
INFO 15:20:58,799 ProgressMeter - chr9:116352967 3.7647625E7 6.3 m 10.0 s 51.4% 12.3 m 6.0 m
INFO 15:21:28,799 ProgressMeter - chr10:95120082 4.04697E7 6.8 m 10.0 s 55.0% 12.4 m 5.6 m
INFO 15:21:58,800 ProgressMeter - chr11:80387285 4.3313383E7 7.3 m 10.0 s 58.7% 12.5 m 5.2 m
INFO 15:22:28,806 ProgressMeter - chr12:61686300 4.6067319E7 7.8 m 10.0 s 62.3% 12.6 m 4.7 m
INFO 15:22:58,807 ProgressMeter - chr13:71168017 4.8988535E7 8.3 m 10.0 s 66.8% 12.5 m 4.1 m
INFO 15:23:38,808 ProgressMeter - chr15:60998995 5.3047905E7 9.0 m 10.0 s 73.3% 12.3 m 3.3 m
INFO 15:24:08,810 ProgressMeter - chr16:88819894 5.6127061E7 9.5 m 10.0 s 77.4% 12.3 m 2.8 m
INFO 15:24:48,811 ProgressMeter - chr19:14848081 6.0360246E7 10.2 m 10.0 s 83.0% 12.3 m 2.1 m
INFO 15:25:18,812 ProgressMeter - chr21:30210373 6.3574404E7 10.7 m 10.0 s 87.3% 12.2 m 93.0 s
INFO 15:25:48,812 ProgressMeter - chrX:113562949 6.6694362E7 11.2 m 10.0 s 92.9% 12.0 m 51.0 s
INFO 15:25:56,610 VariantDataManager - FS: mean = 0.00 standard deviation = 0.13
INFO 15:25:56,666 VariantDataManager - SOR: mean = 1.47 standard deviation = 0.73
INFO 15:25:56,710 VariantDataManager - MQ: mean = 59.74 standard deviation = 1.69
INFO 15:25:56,759 VariantDataManager - MQRankSum: mean = -0.00 standard deviation = 0.51
INFO 15:25:56,857 VariantDataManager - ReadPosRankSum: mean = 0.21 standard deviation = 0.88
INFO 15:25:56,936 VariantDataManager - QD: mean = 30.70 standard deviation = 3.26
INFO 15:25:57,191 VariantDataManager - Annotations are now ordered by their information content: [MQ, QD, FS, SOR, MQRankSum, ReadPosRankSum]
INFO 15:25:57,216 VariantDataManager - Training with 543153 variants after standard deviation thresholding.
INFO 15:25:57,219 GaussianMixtureModel - Initializing model with 100 k-means iterations...
INFO 15:26:14,865 VariantRecalibratorEngine - Finished iteration 0.
INFO 15:26:18,813 ProgressMeter - chrUn_JTFH01001976v1_decoy:1087 6.7557875E7 11.7 m 10.0 s 99.9% 11.7 m 0.0 s
INFO 15:26:24,537 VariantRecalibratorEngine - Finished iteration 5. Current change in mixture coefficients = 1.81969
INFO 15:26:33,560 VariantRecalibratorEngine - Finished iteration 10. Current change in mixture coefficients = 0.50173
INFO 15:26:42,845 VariantRecalibratorEngine - Finished iteration 15. Current change in mixture coefficients = 2.36851
INFO 15:26:48,815 ProgressMeter - chrUn_JTFH01001976v1_decoy:1087 6.7557875E7 12.2 m 10.0 s 99.9% 12.2 m 0.0 s
INFO 15:26:53,744 VariantRecalibratorEngine - Finished iteration 20. Current change in mixture coefficients = 0.07550
INFO 15:27:04,513 VariantRecalibratorEngine - Finished iteration 25. Current change in mixture coefficients = 0.06077
INFO 15:27:15,536 VariantRecalibratorEngine - Finished iteration 30. Current change in mixture coefficients = 0.02782
INFO 15:27:18,816 ProgressMeter - chrUn_JTFH01001976v1_decoy:1087 6.7557875E7 12.7 m 11.0 s 99.9% 12.7 m 0.0 s
INFO 15:27:26,437 VariantRecalibratorEngine - Finished iteration 35. Current change in mixture coefficients = 0.02561
INFO 15:27:37,924 VariantRecalibratorEngine - Finished iteration 40. Current change in mixture coefficients = 0.02931
INFO 15:27:48,817 ProgressMeter - chrUn_JTFH01001976v1_decoy:1087 6.7557875E7 13.2 m 11.0 s 99.9% 13.2 m 0.0 s
INFO 15:27:49,460 VariantRecalibratorEngine - Finished iteration 45. Current change in mixture coefficients = 0.03923
INFO 15:28:00,461 VariantRecalibratorEngine - Finished iteration 50. Current change in mixture coefficients = 0.04987
INFO 15:28:11,614 VariantRecalibratorEngine - Finished iteration 55. Current change in mixture coefficients = 0.08558
INFO 15:28:18,818 ProgressMeter - chrUn_JTFH01001976v1_decoy:1087 6.7557875E7 13.7 m 12.0 s 99.9% 13.7 m 0.0 s
INFO 15:28:22,738 VariantRecalibratorEngine - Finished iteration 60. Current change in mixture coefficients = 0.11308
INFO 15:28:33,851 VariantRecalibratorEngine - Finished iteration 65. Current change in mixture coefficients = 0.03488
INFO 15:28:44,767 VariantRecalibratorEngine - Finished iteration 70. Current change in mixture coefficients = 0.00368
INFO 15:28:48,819 ProgressMeter - chrUn_JTFH01001976v1_decoy:1087 6.7557875E7 14.2 m 12.0 s 99.9% 14.2 m 0.0 s
INFO 15:28:49,297 VariantRecalibratorEngine - Convergence after 72 iterations!
INFO 15:28:50,143 VariantRecalibratorEngine - Evaluating full set of 618296 variants...
INFO 15:28:50,163 VariantDataManager - Training with worst 0 scoring variants --> variants with LOD <= -5.0000.

ERROR --

ERROR stack trace

java.lang.IllegalArgumentException: No data found.
at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:88)
at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:489)
at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:185)
at org.broadinstitute.gatk.engine.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129)
at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:115)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:316)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:123)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:108)

ERROR ------------------------------------------------------------------------------------------

ERROR A GATK RUNTIME ERROR has occurred (version 3.7-0-gcfedb67):

ERROR

ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.

ERROR If not, please post the error message, with stack trace, to the GATK forum.

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions https://software.broadinstitute.org/gatk

ERROR

ERROR MESSAGE: No data found.

↧

Increasing Fastaalternatereferencemaker speed

August 3, 2018, 11:17 am

≫ Next: VariantEval and PASS filter.

≪ Previous: VariantRecalibrator Error; No Data Found

Hello,
I'm using GATK-3.6. It takes about 20 minutes to use a VCF file and a reference genome to make a new genome. I wanted to make this faster by filtering my VCF file so it only contains transcript regions. I used VCFtools keep-INFo-all.
However I now have this error.

Done. There were 1 WARN messages, the first 1 are repeated below.
WARN 17:48:03,374 IndexDictionaryUtils - Track /home/gnojoomi/Project_1/RNAseq_pipeline/files_made/VCF_intervals_with_transcripts.recode.vcf doesn't have a sequence dictionary built in, skipping dictionary validation

I was wondering if this is an important error. More importantly it still takes around 20 minutes and I'm really hoping if there is a way to make this faster for my lab. If possible I can see if I can ask my school to update our cluster to have the newest GATK version if it makes it faster.

↧

VariantEval and PASS filter.

July 24, 2018, 9:51 am

≫ Next: GenotypeGVCFs WARN Track variant doesn't have a sequence dictionary built in

≪ Previous: Increasing Fastaalternatereferencemaker speed

I am wondering whether I should selectvariants before use varianteval tool?
Based on WES data, I followed gatk pipeline and generate and applied VQSR to jointgenotype VCF file. Filter column indicate "PASS" or failed filter name.
Before I use VariantEval tool, should I selectVariants that are have PASS in filter column, or variantEval tool will only exam passed variant and disregard other not-passed variants automatically, even these not-passed variants existed in VCF file?

↧

GenotypeGVCFs WARN Track variant doesn't have a sequence dictionary built in

March 21, 2015, 2:14 am

≫ Next: SelectVariants V4 TribbleException Contig chr1 does not have a length field

≪ Previous: VariantEval and PASS filter.

Hi Team,
I'm getting `WARN  21:19:30,478 IndexDictionaryUtils - Track variant doesn't have a sequence dictionary built in, skipping dictionary validation` when processing gzipped g.vcf files produced by HaplotypeCaller (via -o foo.g.vcf.gz, as suggested by @Geraldine_VdAuwera in blog post 3893) with GenotypeGVCFs.
This results in dramatic increases in run time (makes sense if GenotypeGVCFs un-compresses the files), and memory requirements (why ??) for GenotypeGVCFs compared to processing the gvcf for same bam files if HC outfiles are unzipped. Most batches that previously completed with 4x8GB RAM now produce `java.lang.OutOfMemoryError: Java heap space` errors even with 4X64GB!

Could you please advise whether this warning is expected behaviour? If yes, what exactly is missing (can't see much difference in unzipped vs gzipped vcf headers), and can this be added somehow?

↧

SelectVariants V4 TribbleException Contig chr1 does not have a length field

July 16, 2018, 10:43 am

≫ Next: GenotypeGVCFs calculation

≪ Previous: GenotypeGVCFs WARN Track variant doesn't have a sequence dictionary built in

I indexed my VCF file with GATK V4.0.6.0 IndexFeatureFile, then ran GATK V4.0.6.0 SelectVariants on it, and I got an exception:

htsjdk.tribble.TribbleException: Contig chr1 does not have a length field.

When I run the same VCF using GATK V3 SelectVariants, it works.

As far as I know, ##contig entries in the VCF header should NOT have a length in them.

↧

GenotypeGVCFs calculation

July 24, 2018, 3:13 pm

≫ Next: Can the GATK Best Practices Pipeline on Google Cloud Platform be used on FASTQ inputs?

≪ Previous: SelectVariants V4 TribbleException Contig chr1 does not have a length field

Hello,

Is there any documentation for how GenotypeGVCFs recalculates Phred-likelihoods for genotypes across samples? I've looked fairly extensively, and I've been unable to find any concrete algorithmic details.

Thank you!

↧

UsingGenomicsDBImport in practice

Important limitations:

Addendum: extracting GVCF data from the GenomicsDB

ERROR --

ERROR stack trace

ERROR ------------------------------------------------------------------------------------------

ERROR A GATK RUNTIME ERROR has occurred (version 3.7-0-gcfedb67):

ERROR

ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.

ERROR If not, please post the error message, with stack trace, to the GATK forum.

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions https://software.broadinstitute.org/gatk

ERROR

ERROR MESSAGE: No data found.

Using`GenomicsDBImport` in practice