Rounds of BQSR in GATK3 GenericPreProcessingWorkflow

January 25, 2019, 9:07 am

≫ Next: An Indel was called by HaplotypeCaller, passed SNPFilter, but the AD for alt allele is low

≪ Previous: GenomeLoc 11:69653434-69653483 has a size == 50 but the variation reference allele has length 51

I reviewed the GATK3 WDL Preprocessing workflow on gitHub, and found something strange.

I recall in GATK3, we'd need to run BQSR twice, then use PrintReads to apply the results of BQSR. However, it seemed the code only runs BQSR once before running PrintReads. Can anyone confirm that we actually need to run it twice? Thanks!

↧

An Indel was called by HaplotypeCaller, passed SNPFilter, but the AD for alt allele is low

January 25, 2019, 1:56 pm

≫ Next: A strange INDEL with AD and DP are zeros

≪ Previous: Rounds of BQSR in GATK3 GenericPreProcessingWorkflow

Using openjdk-8-jre-headless and GATK version 3.7-0-gcfedb67
I ran bam file output through HaplotypeCaller and VariantFiltration in GATK version 3.7.

Commands:

java -jar GenomeAnalysisTK_3.7.jar -T HaplotypeCaller -I sample.bam -R hg37.fasta --genotyping_mode DISCOVERY -stand_call_conf 30 -L targets.interval_list -nct 8 -o sample.vcf.gz

java -jar GenomeAnalysisTK_3.7.jar -T VariantFiltration -V sample.vcf.gz -R hg37.fasta --filterName SnpFilter --filterExpression QD<2.0||FS>60.0||MQ<40.0||MQRankSum<-12.5||ReadPosRankSum<-8.0 --filterName DPLow --filterExpression DP<30 --filterName GQLow --filterExpression GQ!=-1&&GQ<30 -o sample.vcf.gz

There were two INDELS of interest in the vcf output.

INDEL 1
6   137143793   .   C   CCGCGTG 723.73  PASS    AC=1;ACov=365.79;AF=0.500;AN=2;BaseQRankSum=-1.732;ClippingRankSum=0.000;DP=257;ExcessHet=3.0103;FS=16.829;GQ=99;Group=group1;MLEAC=1;MLEAF=0.500;MQ=70.00;MQRankSum=0.000;QD=3.11;ReadPosRankSum=-1.522;SOR=2.200;TCov=55235   GT:AD:DP:GQ:PL  0/1:231,2:233:99:761,0,8880

INDEL 2
6   137143796   .   C   CGGGGGGGGGG 491.73  IndelFilter AC=1;ACov=365.79;AF=0.500;AN=2;BaseQRankSum=-8.303;ClippingRankSum=0.000;DP=302;ExcessHet=3.0103;FS=294.284;GQ=99;Group=group1;MLEAC=1;MLEAF=0.500;MQ=70.00;MQRankSum=0.000;QD=1.78;ReadPosRankSum=-9.180;SOR=7.657;TCov=55235  GT:AD:DP:GQ:PL  0/1:232,45:277:99:529,0,9061

Indel 2 failed the indel filter and indel 1 passed the filter.

My question is why was indel 1 called by HaplotypeCaller and have high enough quality values to pass the VariantFiltration. The indel does not look real because there is an allele depth of 2 for the indel and 231 for the ref allele. Is there a reason why something like this would be called.

Thank you in advance for the help.

↧

A strange INDEL with AD and DP are zeros

January 25, 2019, 6:48 pm

≫ Next: is there a limit of indel calling in mutect2(the length of insert of deleted)

≪ Previous: An Indel was called by HaplotypeCaller, passed SNPFilter, but the AD for alt allele is low

Hi there,

Here is the mutation.

4   5858712 .   G   GTTATCACCACTATCATTATCACCCCACCACCATCATCACCATCACCACCACCACCATCATCATTACTAACATTATCACCACCACCACCATCACCACCATCATCACCACCGCCACCATCCACTATCATCACCACCATCCACTA,GTTATCACCACTATCATTATCACCCCACCACCATCATCACCATCACCACCACCACCATCATCATTACTAACATTATCACCACCACCACCATCACCACCATCATCACCACCGCCACCATCCACTATCATCACCACCATCCACTATCATCACCACCATCCACTA  1155.73 .   AC=1,1;AF=0.500,0.500;AN=2;DP=15;ExcessHet=3.0103;FS=0.000;MLEAC=1,1;MLEAF=0.500,0.500;MQ=60.00;SOR=0.693   GT:AD:DP:GQ:PL  1/2:0,0,0:0:99:1193,321,251,263,0,191

Why it call out the mutation but the AD and DP in FORMAT are both zeros. Is this mutation can be trusted or not?

↧

is there a limit of indel calling in mutect2(the length of insert of deleted)

January 26, 2019, 4:37 am

≫ Next: How to use bwa mem for paired-end Illumina reads

≪ Previous: A strange INDEL with AD and DP are zeros

is there a limit of indel calling in mutect2(the length of insert of deleted)
thanks a lot

↧

How to use bwa mem for paired-end Illumina reads

November 11, 2014, 12:10 am

≫ Next: problem of calling SNPs GenotypeGVCFs

≪ Previous: is there a limit of indel calling in mutect2(the length of insert of deleted)

Dear All,

we would like to use the bwa mem algorithm for the alignment of paired-end (100 bp) Illumina reads and variant calling
with GATK. Unfortunately there are some problems understanding the command description.
Do I need to use the -p and [mates.fq] options for paired-end reads?
And what about simply using the command below?

bwa mem -M -t 16 ref.fa read1.fq read2.fq > aln.sam

Command description (http://bio-bwa.sourceforge.net/bwa.shtml):
"If mates.fq file is absent and option -p is not set, this command regards input reads are single-end. If mates.fq is present, this command assumes the i-th read in reads.fq and the i-th read in mates.fq constitute a read pair. If -p is used, the command assumes the 2i-th and the (2i+1)-th read in reads.fq constitute a read pair (such input file is said to be interleaved). In this case, mates.fq is ignored. In the paired-end mode, the mem command will infer the read orientation and the insert size distribution from a batch of reads.
The BWA-MEM algorithm performs local alignment. It may produce multiple primary alignments for different part of a query sequence. This is a crucial feature for long sequences. However, some tools such as Picard’s markDuplicates does not work with split alignments. One may consider to use option -M to flag shorter split hits as secondary."

We appreciate your help!

Best regards,
Sugi

↧

problem of calling SNPs GenotypeGVCFs

January 26, 2019, 6:14 pm

≫ Next: (How to) Consolidate GVCFs for joint calling with GenotypeGVCFs

≪ Previous: How to use bwa mem for paired-end Illumina reads

Dear Mr/Ms.,

I used GenomicsDBImport and GenotypeGVCFs to call SNPs in GATK4. Due to large reference genome (1G) and sample size (100). I want to separate the work for each chromosome. It seems that GenomicsDBImport works for all chromosomes, but GenotypeGVCFs only works for chromosome1. Could you please give me some suggestions. Below are commands and log information for chromosome 2.

Look forward to hearing from you soon.
Best regards,
Baosheng

command, all variables are defined before command lines.

$GATK --java-options "-Xmx24g" \
GenomicsDBImport \
${InputVCF} \
--genomicsdb-workspace-path ${OUTDIR}/chr02 \
-L Qrob_Chr02

$GATK --java-options "-Xmx48g" \
GenotypeGVCFs \
-R ${REF} \
-V gendb://${OUTDIR}/chr02 \
-all-sites \
-O ${OUTDIR}/chr02.vcf

log file

23:41:40.274 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/WangBS/software/GATK/gatk/build/libs/gatk-package-4.0.11.0-56-g2c0e9b0-SNAPSHOT-local.jar!/com/intel/gkl/native/libgkl_compression.so
23:42:41.990 INFO GenomicsDBImport - ------------------------------------------------------------
23:42:41.990 INFO GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.0.11.0-56-g2c0e9b0-SNAPSHOT
23:42:41.990 INFO GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/
23:42:41.991 INFO GenomicsDBImport - Executing as WangBS@cu53 on Linux v3.10.0-693.el7.x86_64 amd64
23:42:41.991 INFO GenomicsDBImport - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_131-b12
23:42:41.991 INFO GenomicsDBImport - Start Date/Time: January 26, 2019 11:41:40 PM CST
23:42:41.991 INFO GenomicsDBImport - ------------------------------------------------------------
23:42:41.991 INFO GenomicsDBImport - ------------------------------------------------------------
23:42:41.991 INFO GenomicsDBImport - HTSJDK Version: 2.18.1
23:42:41.991 INFO GenomicsDBImport - Picard Version: 2.18.16
23:42:41.992 INFO GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
23:42:41.992 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
23:42:41.992 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
23:42:41.992 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
23:42:41.992 INFO GenomicsDBImport - Deflater: IntelDeflater
23:42:41.992 INFO GenomicsDBImport - Inflater: IntelInflater
23:42:41.992 INFO GenomicsDBImport - GCS max retries/reopens: 20
23:42:41.992 INFO GenomicsDBImport - Requester pays: disabled
23:42:41.992 INFO GenomicsDBImport - Initializing engine
23:42:44.099 INFO IntervalArgumentCollection - Processing 115639695 bp from intervals
23:42:44.102 INFO GenomicsDBImport - Done initializing engine
23:42:44.276 INFO GenomicsDBImport - Vid Map JSON file will be written to /home/WangBS/Analyses/vcf/test/chr02/vidmap.json
23:42:44.276 INFO GenomicsDBImport - Callset Map JSON file will be written to /home/WangBS/Analyses/vcf/test/chr02/callset.json
23:42:44.276 INFO GenomicsDBImport - Complete VCF Header will be written to /home/WangBS/Analyses/vcf/test/chr02/vcfheader.vcf
23:42:44.276 INFO GenomicsDBImport - Importing to array - /home/WangBS/Analyses/vcf/test/chr02/genomicsdb_array
23:42:44.276 INFO ProgressMeter - Starting traversal
23:42:44.276 INFO ProgressMeter - Current Locus Elapsed Minutes Batches Processed Batches/Minute
23:42:45.830 INFO GenomicsDBImport - Importing batch 1 with 63 samples
Buffer resized from 37294 bytes to 65464
Buffer resized from 37294 bytes to 65511
Buffer resized from 37293 bytes to 65539
Buffer resized from 37294 bytes to 65447
.....
.....
Buffer resized from 65538 bytes to 65539
Buffer resized from 65538 bytes to 65539
Buffer resized from 65538 bytes to 65539
06:50:14.219 INFO ProgressMeter - Qrob_Chr02:1 427.5 1 0.0
06:50:14.220 INFO GenomicsDBImport - Done importing batch 1/1
06:50:14.221 INFO ProgressMeter - Qrob_Chr02:1 427.5 1 0.0
06:50:14.229 INFO ProgressMeter - Traversal complete. Processed 1 total batches in 427.5 minutes.
06:50:14.236 INFO GenomicsDBImport - Import completed!
06:50:14.236 INFO GenomicsDBImport - Shutting down engine
[January 27, 2019 6:50:14 AM CST] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 428.57 minutes.
Runtime.totalMemory()=8988393472
Tool returned:
true
Using GATK jar /home/WangBS/software/GATK/gatk/build/libs/gatk-package-4.0.11.0-56-g2c0e9b0-SNAPSHOT-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx24g -jar /home/WangBS/software/GATK/gatk/build/libs/gatk-package-4.0.11.0-56-g2c0e9b0-SNAPSHOT-local.jar GenotypeGVCFs -R /home/WangBS/Reference/Qrobur/Qrob_PM1N.fa -V gendb:///home/WangBS/Analyses/vcf/test/chr02 -all-sites -O /home/WangBS/Analyses/vcf/test/chr02.vcf
06:50:19.236 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/WangBS/software/GATK/gatk/build/libs/gatk-package-4.0.11.0-56-g2c0e9b0-SNAPSHOT-local.jar!/com/intel/gkl/native/libgkl_compression.so
06:51:21.116 INFO GenotypeGVCFs - ------------------------------------------------------------
06:51:21.116 INFO GenotypeGVCFs - The Genome Analysis Toolkit (GATK) v4.0.11.0-56-g2c0e9b0-SNAPSHOT
06:51:21.116 INFO GenotypeGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
06:51:21.117 INFO GenotypeGVCFs - Executing as WangBS@cu53 on Linux v3.10.0-693.el7.x86_64 amd64
06:51:21.117 INFO GenotypeGVCFs - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_131-b12
06:51:21.117 INFO GenotypeGVCFs - Start Date/Time: January 27, 2019 6:50:19 AM CST
06:51:21.117 INFO GenotypeGVCFs - ------------------------------------------------------------
06:51:21.117 INFO GenotypeGVCFs - ------------------------------------------------------------
06:51:21.118 INFO GenotypeGVCFs - HTSJDK Version: 2.18.1
06:51:21.118 INFO GenotypeGVCFs - Picard Version: 2.18.16
06:51:21.118 INFO GenotypeGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 2
06:51:21.118 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
06:51:21.118 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
06:51:21.118 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
06:51:21.118 INFO GenotypeGVCFs - Deflater: IntelDeflater
06:51:21.118 INFO GenotypeGVCFs - Inflater: IntelInflater
06:51:21.118 INFO GenotypeGVCFs - GCS max retries/reopens: 20
06:51:21.118 INFO GenotypeGVCFs - Requester pays: disabled
06:51:21.118 INFO GenotypeGVCFs - Initializing engine
WARNING: No valid combination operation found for INFO field DS - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field DS - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records
06:51:26.212 INFO GenotypeGVCFs - Done initializing engine
06:51:26.257 INFO ProgressMeter - Starting traversal
06:51:26.257 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
06:51:26.278 INFO GenotypeGVCFs - Shutting down engine
[January 27, 2019 6:51:26 AM CST] org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs done. Elapsed time: 1.12 minutes.
Runtime.totalMemory()=1972371456
java.lang.IllegalStateException: There are no sources based on those query parameters
at com.intel.genomicsdb.reader.GenomicsDBFeatureIterator.(GenomicsDBFeatureIterator.java:131)
at com.intel.genomicsdb.reader.GenomicsDBFeatureReader.query(GenomicsDBFeatureReader.java:144)
at org.broadinstitute.hellbender.engine.FeatureDataSource.refillQueryCache(FeatureDataSource.java:534)
at org.broadinstitute.hellbender.engine.FeatureDataSource.queryAndPrefetch(FeatureDataSource.java:503)
at org.broadinstitute.hellbender.engine.FeatureDataSource.query(FeatureDataSource.java:469)
at org.broadinstitute.hellbender.engine.VariantLocusWalker.lambda$traverse$2(VariantLocusWalker.java:144)
at java.util.Iterator.forEachRemaining(Iterator.java:116)
at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
at java.util.stream.ReferencePipeline$Head.forEachOrdered(ReferencePipeline.java:590)
at org.broadinstitute.hellbender.engine.VariantLocusWalker.traverse(VariantLocusWalker.java:143)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:966)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)

↧

(How to) Consolidate GVCFs for joint calling with GenotypeGVCFs

April 16, 2018, 10:14 am

≫ Next: MuTect2 for amplicon did not call some variants

≪ Previous: problem of calling SNPs GenotypeGVCFs

In GATK4, the GenotypeGVCFs tool can only take a single input i.e., 1) a single single-sample GVCF 2) a single multi-sample GVCF created by CombineGVCFs or 3) a GenomicsDB workspace created by GenomicsDBImport. If you have GVCFs from multiple samples (which is usually the case) you will need to combine them before feeding them to GenotypeGVCFs. The input samples must possess genotype likelihoods containing the allele produced by HaplotypeCaller with -ERC GVCF or -ERC BP_RESOLUTION.

Although there are several tools in the GATK and Picard toolkits that provide some type of VCF merging functionality, for this use case ONLY two of them can do the GVCF consolidation step correctly: GenomicsDBImport and CombineGVCFs.

GenomicsDBImport is the preferred tool (see detailed instructions below); CombineGVCFs is provided only as a backup solution for people who cannot use GenomicsDBImport. We know CombineGVCFs is quite inefficient and typically requires a lot of memory, so we encourage you to try GenomicsDBImport first and only fall back on CombineGVCFs if you experience issues that we are unable to help you solve (ask us for help in the forum!).

Using`GenomicsDBImport` in practice

The GenomicsDBImport tool takes in one or more single-sample GVCFs and imports data over at least one genomics interval (this feature is available in v4.0.6.0 and later and stable in v4.0.8.0 and later), and outputs a directory containing a GenomicsDB datastore with combined multi-sample data. GenotypeGVCFs can then read from the created GenomicsDB directly and output the final multi-sample VCF.

So if you have a trio of GVCFs your GenomicsDBImport command would look like this, assuming you're running per chromosome (here we're showing the tool running on chromosome 20 and chromosome 21):

gatk GenomicsDBImport \
    -V data/gvcfs/mother.g.vcf \
    -V data/gvcfs/father.g.vcf \
    -V data/gvcfs/son.g.vcf \
    --genomicsdb-workspace-path my_database \
    --intervals chr20,chr21

That generates a directory called my_database containing the combined GVCF data for chromosome 20 and 21. (The contents of the directory are not really human-readable; see “extracting GVCF data from a GenomicsDB” to evaluate the combined, pre-genotyped data. Also note that the log will contain a series of messages like Buffer resized from 178298bytes to 262033 -- this is expected.) For larger cohort sizes, we recommend specifying a batch size of 50 for improved memory usage. A sample map file can also be specified when enumerating the GVCFs individually as above becomes arduous.

Then you run joint genotyping; note the gendb:// prefix to the database input directory path. Note that this step requires a reference, even though the import can be run without one.

gatk GenotypeGVCFs \
    -R data/ref/ref.fasta \
    -V gendb://my_database \
    -newQual \
    -O test_output.vcf

And that's all there is to it.

Important limitations and Common “Gotchas”:

You can't add data to an existing database; you have to keep the original GVCFs around and reimport them all together when you get new samples. For very large numbers of samples, there are some batching options.
At least one interval must be provided when using GenomicsDBImport.
Input GVCFs cannot contain multiple entries for a single genomic position
GenomicsDBImport cannot accept multiple GVCFs for the same sample, so if for example you generated separate GVCFs per chromosome for each sample, you'll need to either concatenate the chromosome GVCFs to produce a single GVCF per sample (using GatherVcfs) or scatter the following steps by chromosome as well.
The annotation counts specified in the header MUST BE VALID! If not, you may see an error like A fatal error has been detected by the Java Runtime Environment [...] SIGSEGV with mention of a core dump (which may or may not be output depending on your system configuration.) You can check you annotation headers with vcf-validator from VCFtools [https://github.com/vcftools/vcftools]
GenomicsDB will not overwrite an existing workspace. To rerun an import, you will have to manually delete the workspace before running the command again.
If you’re working on a POSIX filesystem (e.g. Lustre, NFS, xfs, ext4 etc), you must set the environment variable TILEDB_DISABLE_FILE_LOCKING=1 before running any GenomicsDB tool. If you don’t, you will likely see an error like Could not open array genomicsdb_array at workspace:[...]
HaplotypeCaller output containing MNPs cannot be merged with CombineGVCFs or GenotypeGVCFs. For phasing nearby variants in multi-sample callsets, MNPs can be inferred from the phase set (PS) tag in the FORMAT field.
There are a few other, rare bugs we’re in the process of working out. If you run into problems, you can check the open github issues [https://github.com/broadinstitute/gatk/issues?utf8=✓&q=is:issue+is:open+genomicsdb] to see if a fix is in in progress.

If you can't use GenomicsDBImport for whatever reason, fall back to CombineGVCFs instead. It is slower but will allow you to combine GVCFs the old-fashioned way.

Addendum: extracting GVCF data from the GenomicsDB

If you want to generate a flat multisample GVCF file from the GenomicsDB you created, you can do so with SelectVariants as follows:

gatk SelectVariants \
    -R data/ref/ref.fasta \
    -V gendb://my_database \
    -O combined.g.vcf

Bells and Whistles

GenomicsDB now supports allele-specific annotations [ https://software.broadinstitute.org/gatk/documentation/article?id=9622 ], which have become standard in our Broad exome production pipeline.

GenomicsDB can now import directly from a Google cloud path (i.e. gs://) using NIO.

↧

MuTect2 for amplicon did not call some variants

December 24, 2018, 11:21 pm

≫ Next: GATK4 - CreateSomaticPanelOfNormals. What's under the hood ?

≪ Previous: (How to) Consolidate GVCFs for joint calling with GenotypeGVCFs

Hi GATK team,

I am a beginner for using GATK. I performed the amplicon-based target sequencing and then I used the GATK4-MuTect2 to call variants. However, when we compared the variants from GATK4-MuTect2 with those from VariantCaller on Ion Torrent Sever, we found some inconsistencies. Therefore, I generated the bamout and found that some variants seem to be realigned and therefore they did not be called (see figure chr17:50196078).

In other case, the allele frequency (AF) is homozygous in both input bam file and bamout, but the allele frequency (AF) is heterozygous in the vcf which is shown below and in the figure chr17:50188065.

chr17 50188065 . A G . clustered_events DP=6046;ECNT=18;POP_AF=5.000e-08;P_CONTAM=0.00;P_GERMLINE=-2.846e+01;TLOD=1518.06 GT:AD:AF:DP:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB 0/1:447,1238:0.544:1685:174,937:273,301:32,36:0,0:60:12:0.737,0.525,0.735:0.00,1.00,0

The following is my parametric, IGV and VCF data,
date;/share/app/GATK/gatk-4.0.9.0/gatk --java-options "-Xmx256g" Mutect2 -R grch38.p2_rmsk.fasta -I 18060114.bam -tumor 18060114 -O 8060114_Mutect2_tumor_maxaf1_mrra0.vcf.gz --max-population-af 1 --max-reads-per-alignment-start 0 --min-base-quality-score 0

Your advise is highly appreciated, and look forward to your reply.

Thank you!

Respectfully yours,
Ching-Yuan Wang

↧

GATK4 - CreateSomaticPanelOfNormals. What's under the hood ?

January 28, 2019, 6:33 am

≫ Next: Version numbers

≪ Previous: MuTect2 for amplicon did not call some variants

Hi,

I am using GATK4 (4.0.12) to create a panel of normals to be used with Mutect2.

The

↧

Version numbers

January 28, 2019, 12:58 pm

≫ Next: Mutect2

≪ Previous: GATK4 - CreateSomaticPanelOfNormals. What's under the hood ?

GATK4 version numbers are based on semantic versioning. If that term doesn't sound familiar to you, rest assured it's just a fancy way of saying that the version numbers are structured in a meaningful way.

The official "semver" semantic versioning scheme describes a system of MAJOR.MINOR.PATCH version, where a MAJOR version bump involves breaking changes, a MINOR version bump adds functionality without breaking anything, and a PATCH version provides bug fixes (to existing functionality) that don't break anything.

In GATK4, we apply a relaxed interpretation of that scheme. We use a PATCH version for a release has only very minor changes, or bug fixes only, and a MINOR version for "typical" releases containing some new features and some bug fixes. We go to a MAJOR version on an exceptional basis, when we have major new features to show off.

So, how do GATK4 version numbers match up to this? The MAJOR.MINOR.PATCH version numbers are those that come after 4. So for example, version 4.0.12.0 was the 12th minor version of the initial (0th) release of GATK4, and it has not received any patches. Now you may point out, waitafrigginminute, isn't the 4 itself a version number? Well, yes it is, but we decided it should be considered "bigger than MAJOR" so that we wouldn't already be jumping to GATK5 after, like, a year of development. We made such a big deal out of that 4, we want to keep it around for a while, y'know? So we kept it as a sort of prefix to the "proper" semver-compliant version.

There you go, fascinating bit of GATK4 trivia right there.

↧

Mutect2

November 28, 2018, 2:33 pm

≫ Next: Cromwell terminates unexpectedly in Google cloud. UnknownHostException: genomics.googleapis.com

≪ Previous: Version numbers

This discussion was created from comments split from: New to the forum? Ask your questions here!.

↧

Cromwell terminates unexpectedly in Google cloud. UnknownHostException: genomics.googleapis.com

January 28, 2019, 3:30 pm

≫ Next: New year, new Comms team members!

≪ Previous: Mutect2

Hi,

I am trying to run a WGS pipeline, but it's stopping at Fastq to bam step complaining about "java.net.UnknownHostException: genomics.googleapis.com". A small test fastq file can successfully complete the whole pipeline in google cloud, but when I tried to use a WGS fastq, it's failing below. Any advice why it's doing that? I don't quite understand why it's getting the error message.

Below is part of a log of the failure. I am using cromwell version 36.

[2019-01-28 21:42:15,24] [info] PipelinesApiAsyncBackendJobExecutionActor [^[[38;5;2m6f4b745c^[[0mW1.prefastqc:NA:1]: Status change from - to Running
[2019-01-28 21:43:56,72] [info] PipelinesApiAsyncBackendJobExecutionActor [^[[38;5;2m6f4b745c^[[0mW1.ScatterIntervalList:NA:1]: Status change from Running to Success
[2019-01-28 23:36:16,31] [info] PipelinesApiAsyncBackendJobExecutionActor [^[[38;5;2m6f4b745c^[[0mW1.prefastqc:NA:1]: Status change from Running to Success
[2019-01-29 03:11:59,22] [info] PipelinesApiAsyncBackendJobExecutionActor [^[[38;5;2m6f4b745c^[[0mW1.trim:NA:1]: Status change from Running to Success
[2019-01-29 03:12:00,73] [info] WorkflowExecutionActor-6f4b745c-0c1f-4b6e-84c6-0d768079e2d3 [^[[38;5;2m6f4b745c^[[0m]: Starting W1.postfastqc, W1.PairedFastQsToUnmappedBAM
[2019-01-29 03:12:03,99] [info] PipelinesApiAsyncBackendJobExecutionActor [^[[38;5;2m6f4b745c^[[0mW1.PairedFastQsToUnmappedBAM:NA:1]: ^[[38;5;5m/gatk/gatk --java-options "-Xmx3000m" \
FastqToSam \
--FASTQ /cromwell_root/pca_binf_test/bill-w1-cromwell-execution/W1/6f4b745c-0c1f-4b6e-84c6-0d768079e2d3/call-trim/NEUCV649UJK_ATTCAGAA_HJY2KCCXX_L1_4_001.R1.fastq.gz_R1_trim.fq.gz \
--FASTQ2 /cromwell_root/pca_binf_test/bill-w1-cromwell-execution/W1/6f4b745c-0c1f-4b6e-84c6-0d768079e2d3/call-trim/NEUCV649UJK_ATTCAGAA_HJY2KCCXX_L1_4_001.R1.fastq.gz_R2_trim.fq.gz \
--OUTPUT NEUCV649UJK_ATTCAGAA_HJY2KCCXX_L1_4_001.unmapped.bam \
--READ_GROUP_NAME NEUCV649UJK_ATTCAGAA_HJY2KCCXX_L1_4_001 \
--SAMPLE_NAME NEUCV649UJK_ATTCAGAA_HJY2KCCXX_L1_4_001 \
--LIBRARY_NAME NEUCV649UJK_ATTCAGAA_HJY2KCCXX_L1_4_001 \
--PLATFORM_UNIT NEUCV649UJK_ATTCAGAA_HJY2KCCXX_L1_4_001 \
--RUN_DATE 2019-01-25T22:14:37 \
--PLATFORM illumina \
--SEQUENCING_CENTER BI^[[0m
[2019-01-29 03:12:07,52] [info] PipelinesApiAsyncBackendJobExecutionActor [^[[38;5;2m6f4b745c^[[0mW1.postfastqc:NA:1]: ^[[38;5;5mfastqc -t 4 --outdir $PWD /cromwell_root/pca_binf_test/bill-w1-cromwell-execution/W1/6f4b745c-0c1f-4b6e-84c6-0d768079e2d3/call-trim/NEUCV649UJK_ATTCAGAA_HJY2KCCXX_L1_4_001.R1.fastq.gz_R1_trim.fq.gz /cromwell_root/pca_binf_test/bill-w1-cromwell-execution/W1/6f4b745c-0c1f-4b6e-84c6-0d768079e2d3/call-trim/NEUCV649UJK_ATTCAGAA_HJY2KCCXX_L1_4_001.R1.fastq.gz_R2_trim.fq.gz^[[0m
[2019-01-29 03:12:36,15] [info] PipelinesApiAsyncBackendJobExecutionActor [^[[38;5;2m6f4b745c^[[0mW1.PairedFastQsToUnmappedBAM:NA:1]: job id: operations/EKup76mJLRiajZ6C85PAoG4g6OKq67wbKg9wcm9kdWN0aW9uUXVldWU
[2019-01-29 03:12:36,15] [info] PipelinesApiAsyncBackendJobExecutionActor [^[[38;5;2m6f4b745c^[[0mW1.postfastqc:NA:1]: job id: operations/EJyp76mJLRjazoGLuIr6o6UBIOjiquu8GyoPcHJvZHVjdGlvblF1ZXVl
[2019-01-29 03:13:07,02] [info] PipelinesApiAsyncBackendJobExecutionActor [^[[38;5;2m6f4b745c^[[0mW1.PairedFastQsToUnmappedBAM:NA:1]: Status change from - to Running
[2019-01-29 03:13:07,02] [info] PipelinesApiAsyncBackendJobExecutionActor [^[[38;5;2m6f4b745c^[[0mW1.postfastqc:NA:1]: Status change from - to Running
[2019-01-29 04:30:18,60] [info] Message [cromwell.docker.DockerHashActor$DockerHashFailedResponse] from Actor[akka://cromwell-system/user/HealthMonitorDockerHashActor#-2047665085] to Actor[akka://cromwell-system/deadLetters] was not delivered. [1] dead letters encountered, no more dead letters will be logged. If this is not an expected behavior, then [Actor[akka://cromwell-system/deadLetters]] may have terminated unexpectedly, This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
[2019-01-29 04:35:15,79] [info] PipelinesApiAsyncBackendJobExecutionActor [^[[38;5;2m6f4b745c^[[0mW1.postfastqc:NA:1]: Status change from Running to Success
[2019-01-29 05:43:27,62] [^[[38;5;1merror^[[0m] The JES API worker actor Actor[akka://cromwell-system/user/SingleWorkflowRunnerActor/JES-Singleton/PAPIQueryManager/PAPIQueryWorker-aba5cceb-6c41-421d-842c-becadaf4269a#-146857563] unexpectedly terminated while conducting 1 polls. Making a new one...
java.net.UnknownHostException: genomics.googleapis.com
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:673)
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:463)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:558)
at sun.net.www.protocol.https.HttpsClient.(HttpsClient.java:264)
at sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:367)
at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:191)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1156)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1050)
at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:177)
at sun.net.www.protocol.http.HttpURLConnection.getOutputStream0(HttpURLConnection.java:1334)
at sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1309)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.getOutputStream(HttpsURLConnectionImpl.java:259)
at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:77)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:981)
at com.google.api.client.googleapis.batch.BatchRequest.execute(BatchRequest.java:241)
at cromwell.backend.google.pipelines.common.api.PipelinesApiRequestWorker.runBatch(PipelinesApiRequestWorker.scala:56)
at cromwell.backend.google.pipelines.common.api.PipelinesApiRequestWorker.cromwell$backend$google$pipelines$common$api$PipelinesApiRequestWorker$$handleBatch(PipelinesApiRequestWorker.scala:50)
at cromwell.backend.google.pipelines.common.api.PipelinesApiRequestWorker$$anonfun$receive$1.applyOrElse(PipelinesApiRequestWorker.scala:35)
at akka.actor.Actor.aroundReceive(Actor.scala:517)
at akka.actor.Actor.aroundReceive$(Actor.scala:515)
at cromwell.backend.google.pipelines.common.api.PipelinesApiRequestWorker.aroundReceive(PipelinesApiRequestWorker.scala:19)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:588)
at akka.actor.ActorCell.invoke(ActorCell.scala:557)

↧

New year, new Comms team members!

January 28, 2019, 8:09 pm

≫ Next: SNP chip data base for Base Quality Score Recalibration

≪ Previous: Cromwell terminates unexpectedly in Google cloud. UnknownHostException: genomics.googleapis.com

I can't quite believe it's been a full year since we released GATK4! The tools have evolved a lot since then, and as a matter of fact we're due for another major version release very soon indeed. So we're going to be talking a lot more about that over the next few weeks -- specifically, the GATK developers are going to tell you all about their latest work in a series of guest posts on this very blog.

But before we get to all that cool new stuff, I want to take a moment to introduce you to the wonderful people who most recently joined us (the DSP* Communications Team) in our mission to help all of you make effective use of our tools in your work.

* DSP = Data Sciences Platform of the Broad Institute

For context, our Comms team is dedicated to providing education, support and outreach for the software and services produced by the Data Sciences Platform at the Broad Institute (not just GATK anymore). It counts 10 people (excluding yours truly, since as Head Bureaucrat I barely do anything anymore), organized in two sub-teams: User Education under our Lead Educator Robert Majovski, and Frontline Support under our Senior Community Manager Tiffany Miller. You'll hear more from both Robert and Tiffany in the near future.

Robert's User Education team specializes in writing docs and developing educational materials. It includes veteran scientific writer Soo Hee Lee, who has penned much of our documentation, as well as newcomers Allie Hajian and Anton Kovalsky. So far Allie and Anton have mostly been rocking the cloud platform side of our world but you'll likely start seeing their fingerprints in more of our GATK-related materials in the near future.

Tiffany's Frontline Support team specializes in bringing you swift helpful answers through the forums and helpdesk. It includes pipelining veteran Beri Shifaw, the indispensable curator of gatk-workflows and associated workspaces in FireCloud, as well as newcomers Bhanu Gandham, Sushma Chaluvadi and Adelaide Rhodes. They all contribute to providing top-notch frontline support across all our products, but you may recognize Bhanu in particular as the point person for GATK. You can expect to hear more from Bhanu on this blog in the very near future!

Both teams additionally contribute to developing resources like preloaded workspace in DSP's cloud-based analysis platform, FireCloud, as well as developing and teaching workshops. All this cross-team work is facilitated by veteran forum specialist Kate Noblett in her current capacity as Senior Project Coordinator.

Please join me in welcoming our new team members --and appreciating our veterans-- as we gear up for an exciting new season of GATK developments!

↧

SNP chip data base for Base Quality Score Recalibration

January 29, 2019, 6:10 am

≫ Next: Fast and accurate genomic analyses using genome graphs, anyone used?

≪ Previous: New year, new Comms team members!

Hi all,

I want to perform the Base Quality Score Recalibration for maize data, I am deciding how to obtain a:

A database of known polymorphic sites to mask out

One of the possibilities that I am exploring is to use positions from a SNP chip, because I know that the positions from the chip came from high quality SNPs, these are around 600K positions. Do you think that could be a good idea to use this as my database? I expect to abtain around 20 million of SNPs from my calling, so I am wondering if these data base is not to small.

Best.

↧

Fast and accurate genomic analyses using genome graphs, anyone used?

January 29, 2019, 6:45 am

≫ Next: Understanding my VQSR results

≪ Previous: SNP chip data base for Base Quality Score Recalibration

a new tool?

↧

Understanding my VQSR results

January 29, 2019, 8:28 am

≫ Next: CalculateGenotypePosteriors produces a bunch of zero coverage variants

≪ Previous: Fast and accurate genomic analyses using genome graphs, anyone used?

Hi Everyone, I have a few questions about VQSR I was hoping to have answered. First, a bit about my project. I am interested in finding de novo mutations in human (multi-sibling) families, looking at the whole genome. The 14 families I am dealing with have anywhere between 2 and 6 children (for a total of 79 whole genomes for this cohort). I have implemented the preprocessing pipeline, the germline short variant discovery workflow and finally the genotype refinement workflow for germline short variants. The only major changes I have mde was to parallelize as much as possible by processing an entire family at a time (as opposed to single individuals). This means that I have been applying VQSR to each family independently (4-8 whole genomes each time). At first I was getting the dreaded 'no data error' when running VariantRecalibrator on the SNPS (step 1). I added the parameter '--max-gaussians 4' and this seems to have solved the problem. Here is an example of how I call it:

~/gatk-4.0.11.0/gatk VariantRecalibrator \
    -R ~/hg38/Homo_sapiens_assembly38.fasta \
    -V ~/8_rvqs/1041_08/1041_08.vcf \
    --resource hapmap,known=false,training=true,truth=true,prior=15.0:/~/hg38/hapmap_3.3.hg38.vcf.gz \
    --resource omni,known=false,training=true,truth=true,prior=12.0:~/hg38/1000G_omni2.5.hg38.vcf.gz \
    --resource 1000G,known=false,training=true,truth=false,prior=10.0:~/hg38/1000G_phase1.snps.high_confidence.hg38.vcf.gz \
    --resource dbsnp,known=true,training=false,truth=false,prior=2.0:~/hg38/dbsnp_146.hg38.vcf.gz \
    -an DP \
    -an QD \
    -an FS \
    -an SOR \
    -an MQ \
    -an MQRankSum \
    -an ReadPosRankSum \
    -mode SNP \
    --max-gaussians 4 \
    -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \
    -O ~/8_rvqs/1041_08/1041_08_recalibrate_SNP.recal \
    --tranches-file /~/8_rvqs/1041_08/1041_08_recalibrate_SNP.tranches \
    --rscript-file ~/8_rvqs/1041_08/1041_08_recalibrate_SNP_plots.R \
    >>~/8_rvqs/1041_08/1041_08_VR_SNP.log 2>&1

Other than '--max-gaussians 4', I think this is straight from the documentation. I then apply the model for SNPS, build a model for indels and finally apply the model for indels. Everything runs successfully and I have attached the plots generated. Now for my questions:

1) Looking at the SNP and INDEL plots I'm not sure how to evaluate them. Do these look reasonable? The plots generated from family to family look similar.

2) From what I know DP means read depth. The average read depth for the individuals in this example family is between 33 and 35 but the plots show a DP range between 0 and 400. Does this make sense some how?

3) For the SNP tranche plot, in the example family here I have a range of Ti/Tv of 1 - 1.6. I have read on here and other places online as well as in the talk on VQSR I found on youtube (linked below) that you would ideally like to have a Ti/Tv over 2 for human data. Is this something I should worry about? This is similar across all of my families:

4) My last question may be the most basic.... if I'm only worried about identifying de novo mutations, do I need to worry about any of this? After the VQSR I take the results and pump them directly through the genotype refinement workflow for germline short variants and then identify the annotated de novo mutations. I do not ever do any explicit filtering on the scores calculated here (I assume they are used to calculate posterior probabilities in the next step). So if I'm not filtering based on the VQSLOD, should I be worried about any of this? Or should I definitely be filtering based on VQSLOD before proceeding?

Please let me know if any other details would be helpful here. Thanks!

↧

CalculateGenotypePosteriors produces a bunch of zero coverage variants

November 14, 2018, 3:24 am

≫ Next: Error with GATK ModelSegments

≪ Previous: Understanding my VQSR results

Hi GATK Team,

we are running small targeted panels on GATK4. It seems, most of the Variants (~90%) are DP 0 Variants, emerging after applying CalculateGenotypePosteriors. Before this step we are running VQSR with over thirty exomes. Should we use external databases for CalculateGenotypePosteriors?

This is what we do now:

GATK version 4.0.4.0
${tool_gatk} --java-options "${javaArg_xms} ${javaArg_xmx}" CalculateGenotypePosteriors \ -R ${reference} \ -V ${outDirectory}/${variant_vcf}.vcf \ -O ${outDirectory}/${variant_vcf}.postCGP.vcf \ --supporting ${knownsite_hapmap} \ --pedigree ${pedigree}

Greetings
Martin

↧

Error with GATK ModelSegments

May 3, 2018, 1:27 pm

≫ Next: ECNT Value in Mutect2

≪ Previous: CalculateGenotypePosteriors produces a bunch of zero coverage variants

I am using the BETA tool "ModelSegments" in a copy number variation analysis and I've run into an error that I don't understand. Within our institution's cluster computing environment, I submitted the following job:

COMMON_DIR="/home/exacloud/lustre1/BioDSP/users/jacojam"
GATK=$COMMON_DIR"/programs/gatk-4.0.4.0"
ALIGNMENT_RUN_T="hg19_BWA_alignment_10058_tumor"
ALIGNMENT_RUN_N="hg19_BWA_alignment_10058_normal"
ALLELIC_COUNTS_T=$COMMON_DIR"/data/hnscc/DNASeq/"$ALIGNMENT_RUN_T"/tumor.allelicCounts.tsv"
ALLELIC_COUNTS_N=$COMMON_DIR"/data/hnscc/DNASeq/"$ALIGNMENT_RUN_N"/normal.allelicCounts.tsv"
OUTPUT_DIR=$COMMON_DIR"/data/hnscc/DNASeq/"$ALIGNMENT_RUN_T"/GATK_CNV"

srun $GATK/gatk --java-options "-Xmx10000m" ModelSegments --allelic-counts $ALLELIC_COUNTS_T --normal-allelic-counts $ALLELIC_COUNTS_N --output-prefix 10058 -O $OUTPUT_DIR

From this, I get the following error:

Using GATK jar /home/exacloud/lustre1/BioDSP/users/jacojam/programs/gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx10000m -jar /home/exacloud/lustre1/BioDSP/users/jacojam/programs/gat$
06:42:48.839 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/exacloud/lustre1/BioDSP/users/jacojam/programs/gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
06:42:49.212 INFO ModelSegments - ------------------------------------------------------------
06:42:49.212 INFO ModelSegments - The Genome Analysis Toolkit (GATK) v4.0.4.0
06:42:49.212 INFO ModelSegments - For support and documentation go to https://software.broadinstitute.org/gatk/
06:42:49.213 INFO ModelSegments - Executing as jacojam@exanode-3-7.local on Linux v3.10.0-693.17.1.el7.x86_64 amd64
06:42:49.213 INFO ModelSegments - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_161-b14
06:42:49.213 INFO ModelSegments - Start Date/Time: May 2, 2018 6:42:48 AM PDT
06:42:49.213 INFO ModelSegments - ------------------------------------------------------------
06:42:49.213 INFO ModelSegments - ------------------------------------------------------------
06:42:49.214 INFO ModelSegments - HTSJDK Version: 2.14.3
06:42:49.214 INFO ModelSegments - Picard Version: 2.18.2
06:42:49.214 INFO ModelSegments - HTSJDK Defaults.COMPRESSION_LEVEL : 2
06:42:49.214 INFO ModelSegments - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
06:42:49.214 INFO ModelSegments - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
06:42:49.214 INFO ModelSegments - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
06:42:49.214 INFO ModelSegments - Deflater: IntelDeflater
06:42:49.214 INFO ModelSegments - Inflater: IntelInflater
06:42:49.214 INFO ModelSegments - GCS max retries/reopens: 20
06:42:49.214 INFO ModelSegments - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
06:42:49.215 WARN ModelSegments -

^[[1m^[[31m !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Warning: ModelSegments is a BETA tool and is not yet ready for use in production

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!^[[0m

06:42:49.215 INFO ModelSegments - Initializing engine
06:42:49.215 INFO ModelSegments - Done initializing engine
06:42:49.224 INFO ModelSegments - Reading file (/home/exacloud/lustre1/BioDSP/users/jacojam/data/hnscc/DNASeq/hg19_BWA_alignment_10058_tumor/tumor.allelicCounts.tsv)...
06:15:44.797 INFO ModelSegments - Shutting down engine
[May 3, 2018 6:15:44 AM PDT] org.broadinstitute.hellbender.tools.copynumber.ModelSegments done. Elapsed time: 1,412.93 minutes.
Runtime.totalMemory()=6298271744
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space: failed reallocation of scalar replaced objects
at java.util.Arrays.copyOfRange(Arrays.java:3664)
at java.lang.String.(String.java:207)
at java.lang.StringBuilder.toString(StringBuilder.java:407)
at com.opencsv.CSVParser.parseLine(CSVParser.java:383)
at com.opencsv.CSVParser.parseLineMulti(CSVParser.java:299)
at com.opencsv.CSVReader.readNext(CSVReader.java:275)
at org.broadinstitute.hellbender.utils.tsv.TableReader.fetchNextRecord(TableReader.java:348)
at org.broadinstitute.hellbender.utils.tsv.TableReader.access$200(TableReader.java:94)
at org.broadinstitute.hellbender.utils.tsv.TableReader$1.hasNext(TableReader.java:458)
at java.util.Iterator.forEachRemaining(Iterator.java:115)
at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
at org.broadinstitute.hellbender.tools.copynumber.formats.collections.AbstractRecordCollection.(AbstractRecordCollection.java:82)
at org.broadinstitute.hellbender.tools.copynumber.formats.collections.AbstractLocatableCollection.(AbstractLocatableCollection.java:58)
at org.broadinstitute.hellbender.tools.copynumber.formats.collections.AbstractSampleLocatableCollection.(AbstractSampleLocatableCollection.java:44)
at org.broadinstitute.hellbender.tools.copynumber.formats.collections.AllelicCountCollection.(AllelicCountCollection.java:58)
at org.broadinstitute.hellbender.tools.copynumber.ModelSegments$$Lambda$29/27313641.apply(Unknown Source)
at org.broadinstitute.hellbender.tools.copynumber.ModelSegments.readOptionalFileOrNull(ModelSegments.java:559)
at org.broadinstitute.hellbender.tools.copynumber.ModelSegments.doWork(ModelSegments.java:462)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:134)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:179)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:198)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)
srun: error: exanode-3-7: task 0: Exited with exit code 1

Is this something you could potentially help me with? Thank you.

↧

ECNT Value in Mutect2

December 20, 2018, 10:57 am

≫ Next: Intervals and interval lists

≪ Previous: Error with GATK ModelSegments

Hello,

I am using Mutect2 and FilterMutectCalls to call variants in mtDNA. According to the vcf file, the value recorded for ECNT is "Number of events in this haplotype". I am assuming that this is the number of times that a particular mutation was found in all the reads under that base pair. I am concerned because in my data, I am seeing clusters of mutations with the same ECNT value. For example:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
gi|9626243|ref|NC_001416.1| 115 . C T VL PASS DP=445;ECNT=6;POP_AF=5.000e-08;P_CONTAM=0.00;P_GERMLINE=-7.025e+01;TLOD=6.86 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB
gi|9626243|ref|NC_001416.1| 157 . T A VL PASS DP=427;ECNT=6;POP_AF=5.000e-08;P_CONTAM=0.00;P_GERMLINE=-1.113e+02;TLOD=8.55 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB
gi|9626243|ref|NC_001416.1| 470 . C T VL PASS DP=703;ECNT=9;POP_AF=5.000e-08;P_CONTAM=0.00;P_GERMLINE=-1.859e+02;TLOD=15.58 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB
gi|9626243|ref|NC_001416.1| 500 . A T VL PASS DP=691;ECNT=9;POP_AF=5.000e-08;P_CONTAM=0.00;P_GERMLINE=-1.927e+02;TLOD=7.59 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB
gi|9626243|ref|NC_001416.1| 601 . CT C VL PASS DP=671;ECNT=9;POP_AF=5.000e-08;P_CONTAM=0.00;P_GERMLINE=-1.766e+02;RPA=3,2;RU=T;STR;TLOD=7.68 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB
gi|9626243|ref|NC_001416.1| 635 . C T VL PASS DP=665;ECNT=9;POP_AF=5.000e-08;P_CONTAM=0.00;P_GERMLINE=-1.602e+02;TLOD=34.37 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB
gi|9626243|ref|NC_001416.1| 645 . C T VL PASS DP=668;ECNT=9;POP_AF=5.000e-08;P_CONTAM=0.00;P_GERMLINE=-1.818e+02;TLOD=7.87 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB
gi|9626243|ref|NC_001416.1| 704 . T TAAAAAA VL PASS DP=660;ECNT=9;POP_AF=5.000e-08;P_CONTAM=0.00;P_GERMLINE=-1.845e+02;RPA=5,11;RU=A;STR;TLOD=5.89 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:PGT:PID:SA_MAP_AF:SA_POST_PROB
gi|9626243|ref|NC_001416.1| 736 . A G VL PASS DP=666;ECNT=6;POP_AF=5.000e-08;P_CONTAM=0.00;P_GERMLINE=-1.723e+02;TLOD=20.03 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB
gi|9626243|ref|NC_001416.1| 754 . A G VL PASS DP=654;ECNT=6;POP_AF=5.000e-08;P_CONTAM=0.00;P_GERMLINE=-1.606e+02;TLOD=29.79 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB
gi|9626243|ref|NC_001416.1| 788 . A T VL PASS DP=639;ECNT=6;POP_AF=5.000e-08;P_CONTAM=0.00;P_GERMLINE=-1.738e+02;TLOD=6.65 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB
gi|9626243|ref|NC_001416.1| 898 . C T VL PASS DP=671;ECNT=6;POP_AF=5.000e-08;P_CONTAM=0.00;P_GERMLINE=-1.821e+02;TLOD=5.38 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB
gi|9626243|ref|NC_001416.1| 958 . A G VL PASS DP=671;ECNT=6;POP_AF=5.000e-08;P_CONTAM=0.00;P_GERMLINE=-1.847e+02;TLOD=6.69

Is there an explanation for why the data would look like this? It seems odd that the mutations would be occurring exactly the same number of times as the variants surrounding it.

Thank you,

kzwon

↧

Intervals and interval lists

December 23, 2017, 7:03 pm

≫ Next: Problems with Mate-Pairs in 1000G CRAM files.

≪ Previous: ECNT Value in Mutect2

Interval lists define subsets of genomic regions, sometimes even just individual positions in the genome. You can provide GATK tools with intervals or lists of intervals when you want to restrict them to operating on a subset of genomic regions. There are four main types of reasons for doing so:

You want to run a quick test on a subset of data (often used in troubleshooting)
You want to parallelize execution of an analysis across genomic regions
You need to exclude regions that have bad or uninformative data where a tool is getting stuck
The analysis you're running should only take data from those subsets due to how the underlying algorithm works

Regarding the latter case, see the Best Practices workflow recommendations and tool example commands for guidance regarding when to restrict analysis to intervals.

Interval-related arguments and syntax

Arguments for specifying and modifying intervals are provided by the engine and can be applied to most of not all tools. The main arguments you need to know about are the following:

-L / --intervals allows you to specify an interval or list of intervals to include.
-XL / --exclude-intervals allows you to specify an interval or list of intervals to exclude.
-ip / --interval-padding allows you to add padding (in bp) to the intervals you include.
-ixp / --interval-exclusion-padding allows you to add padding (in bp) to the intervals you exclude.

By default the engine will merge any intervals that abut (i.e. they are contiguous, they touch without overlapping) or overlap into a single interval. This behavior can be modified by specifying an alternate interval merging rule (see --interval-merging-rule in the Tool Docs).

The syntax for using -L is as follows; it applies equally to -XL:

-L chr20 for contig chr20.
-L chr20:1-100 for contig chr20, positions 1-100.
-L intervals.list (or intervals.interval_list, or intervals.bed) when specifying a text file containing intervals (see supported formats below).
-L variants.vcf when specifying a VCF file containing variant records; their genomic coordinates will be used as intervals.

If you want to provide several intervals or several interval lists, just pass them in using separate -L or -XL arguments (you can even use both of them in the same command). You can use all the different formats within the same command line. By default, the GATK engine will take the UNION of all the intervals in all the sets. This behavior can be modified by specifying an alternate interval set rule (see --interval-set-rule in the Tool Docs).

Supported interval list formats

GATK supports several types of interval list formats: Picard-style .interval_list, GATK-style .list, BED files with extension .bed, and VCF files. The intervals MUST be sorted by coordinate (in increasing order) within contigs; and the contigs must be sorted in the same order as in the sequence dictionary. This is require for efficiency reasons.

A. Picard-style `.interval_list`

Picard-style interval files have a SAM-like header that includes a sequence dictionary. The intervals are given in the form <chr> <start> <stop> + <target_name>, with fields separated by tabs, and the coordinates are 1-based (first position in the genome is position 1, not position 0).

@HD     VN:1.0  SO:coordinate
@SQ     SN:1    LN:249250621    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:1b22b98cdeb4a9304cb5d48026a85128     SP:Homo Sapiens
@SQ     SN:2    LN:243199373    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:a0d9851da00400dec1098a9255ac712e     SP:Homo Sapiens
1       30366   30503   +       target_1
1       69089   70010   +       target_2
1       367657  368599  +       target_3
1       621094  622036  +       target_4
1       861320  861395  +       target_5
1       865533  865718  +       target_6

This is the preferred format because the explicit sequence dictionary safeguards against accidental misuse (e.g. apply hg18 intervals to an hg19 BAM file). Note that this file is 1-based, not 0-based (the first position in the genome is position 1).

B. GATK-style `.list` or `.intervals`

This is a simpler format, where intervals are in the form <chr>:<start>-<stop>, and no sequence dictionary is necessary. This file format also uses 1-based coordinates. Note that only the <chr> part is strictly required; if you just want to specify chromosomes/ contigs as opposed to specific coordinate ranges, you don't need to specify the rest. Both <chr>:<start>-<stop> and <chr> can be present in the same file. You can also specify intervals in this format directly at the command line instead of writing them in a file.

C. BED files with extension `.bed`

We also accept the widely-used BED format, where intervals are in the form <chr> <start> <stop>, with fields separated by tabs. However, you should be aware that this file format is 0-based for the start coordinates, so coordinates taken from 1-based formats (e.g. if you're cooking up a custom interval list derived from a file in a 1-based format) should be offset by 1. The GATK engine recognizes the .bed extension and interprets the coordinate system accordingly.

D. VCF files

Yeah, I bet you didn't expect that was a thing! It's very convenient. Say you want to redo a variant calling run on a set of variant calls that you were given by a colleague, but with the latest version of HaplotypeCaller. You just provide the VCF, slap on some padding on the fly using e.g. -ip 100 in the HC command, and boom, done. Each record in the VCF will be interpreted as a single-base interval, and by adding padding you ensure that the caller sees enough context to reevaluate the call appropriately.

Obtaining suitable interval lists

So where do those intervals come from? It depends a lot on what you're working with (everyone's least favorite answer, I know). The most important distinction is the sequencing experiment type: is it whole genome, or targeted sequencing of some sort?

Targeted sequencing (exomes, gene panels etc.)

For exomes and similarly targeted data types, the interval list should correspond to the capture targets used for the library prep, and is typically provided by the prep kit manufacturer (with versions for each ref genome build of course).

We make our exome interval lists available, but be aware that they are specific to the custom exome targeting kits used at the Broad. If you got your sequencing done somewhere else, you should seek to get the appropriate intervals list from the sequencing provider.

Whole genomes (WGS)

For whole genome sequence, the intervals lists don’t depend on the prep (since in principle you captured the “whole genome”) so instead it depends on what regions of the genome you want to blacklist (e.g. centromeric regions that waste your time for nothing) and how the reference genome build enables you to cut up regions (separated by Ns) for scatter-gather parallelizing.

We make our WGS interval lists available, and the good news is that, as long as you're using the same genome reference build as us, you can use them with your own data even if it comes from somewhere else -- assuming you agree with our decisions about which regions to blacklist! Which you can examine by looking at the intervals themselves. However, we don't currently have documentation on their provenance, sorry -- baby steps.

↧