BaseRecalibrator : getContigNames(SequenceDictionaryUtils.java:463)

July 11, 2019, 7:47 am

≫ Next: BQSR: How to speed this step up?

≪ Previous: Funcotator Information and Tutorial

Hi,

I'm trying to use the BaseRecalibrator tool on a BAM file but the program doesn't run to the end. The messages returned by the tool did not allow me to correct the error by myself. I am running version 4.1.2.0 of GATK4.

Here is the complete message:

```
16:09:12.733 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/data2/home/pamesl/miniconda3/envs/smk_env/share/gatk4-4.1.2.0-1/gatk-package-4.1.2.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
Jul 11, 2019 4:09:14 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
16:09:14.487 INFO BaseRecalibrator - ------------------------------------------------------------
16:09:14.488 INFO BaseRecalibrator - The Genome Analysis Toolkit (GATK) v4.1.2.0
16:09:14.488 INFO BaseRecalibrator - For support and documentation go to
16:09:14.488 INFO BaseRecalibrator - Executing as pamesl@NODE01 on Linux v2.6.32-573.7.1.el6.x86_64 amd64
16:09:14.489 INFO BaseRecalibrator - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_152-release-1056-b12
16:09:14.489 INFO BaseRecalibrator - Start Date/Time: 11 juillet 2019 16:09:12 CEST
16:09:14.489 INFO BaseRecalibrator - ------------------------------------------------------------
16:09:14.489 INFO BaseRecalibrator - ------------------------------------------------------------
16:09:14.490 INFO BaseRecalibrator - HTSJDK Version: 2.19.0
16:09:14.490 INFO BaseRecalibrator - Picard Version: 2.19.0
16:09:14.490 INFO BaseRecalibrator - HTSJDK Defaults.COMPRESSION_LEVEL : 2
16:09:14.491 INFO BaseRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
16:09:14.491 INFO BaseRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
16:09:14.491 INFO BaseRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
16:09:14.491 INFO BaseRecalibrator - Deflater: IntelDeflater
16:09:14.491 INFO BaseRecalibrator - Inflater: IntelInflater
16:09:14.491 INFO BaseRecalibrator - GCS max retries/reopens: 20
16:09:14.491 INFO BaseRecalibrator - Requester pays: disabled
16:09:14.492 INFO BaseRecalibrator - Initializing engine
16:09:15.263 INFO FeatureManager - Using codec VCFCodec to read file file:///data1/scratch/pamesl/projet_cbf/data/dbSNP/dbsnp_138.hg19.vcf.gz
16:09:15.411 INFO FeatureManager - Using codec VCFCodec to read file file:///data1/scratch/pamesl/projet_cbf/data/mills_1000G/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf
16:09:15.428 INFO BaseRecalibrator - Shutting down engine
[11 juillet 2019 16:09:15 CEST] org.broadinstitute.hellbender.tools.walkers.bqsr.BaseRecalibrator done. Elapsed time: 0.05 minutes.
Runtime.totalMemory()=2224553984
java.lang.NullPointerException
at org.broadinstitute.hellbender.utils.SequenceDictionaryUtils.getContigNames(SequenceDictionaryUtils.java:463)
at org.broadinstitute.hellbender.utils.SequenceDictionaryUtils.getCommonContigsByName(SequenceDictionaryUtils.java:457)
at org.broadinstitute.hellbender.utils.SequenceDictionaryUtils.compareDictionaries(SequenceDictionaryUtils.java:234)
at org.broadinstitute.hellbender.utils.SequenceDictionaryUtils.validateDictionaries(SequenceDictionaryUtils.java:150)
at org.broadinstitute.hellbender.utils.SequenceDictionaryUtils.validateDictionaries(SequenceDictionaryUtils.java:98)
at org.broadinstitute.hellbender.engine.GATKTool.validateSequenceDictionaries(GATKTool.java:760)
at org.broadinstitute.hellbender.engine.GATKTool.onStartup(GATKTool.java:702)
at org.broadinstitute.hellbender.engine.ReadWalker.onStartup(ReadWalker.java:50)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:137)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
at org.broadinstitute.hellbender.Main.main(Main.java:291)
Using GATK jar /data2/home/pamesl/miniconda3/envs/smk_env/share/gatk4-4.1.2.0-1/gatk-package-4.1.2.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /data2/home/pamesl/miniconda3/envs/smk_env/share/gatk4-4.1.2.0-1/gatk-package-4.1.2.0-local.jar BaseRecalibrator -I /data1/scratch/pamesl/projet_cbf/data/bam/SJCBF016_G-C0DG1ACXX.5_marked_duplicates.bam -R /data1/scratch/pamesl/projet_cbf/data/hg19_data/reference_hg19/ucsc.hg19.fasta.gz --known-sites /data1/scratch/pamesl/projet_cbf/data/dbSNP/dbsnp_138.hg19.vcf.gz --known-sites /data1/scratch/pamesl/projet_cbf/data/mills_1000G/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf -O /data1/scratch/pamesl/projet_cbf/data/bam/recal_data_SJCBF016_G-C0DG1ACXX.5.table
```

I checked the validity of the BAM file SJCBF016_G-C0DG1ACXX.5_marked_duplicates.bam using the ValidateSamFile tool and got the following result:

```
No errors found
Tool returned:
0
```

I have a feeling that the problem comes from my Mills_and_1000G_gold_standard files.indels.hg19.sites.vcf, dbsnp_138.hg19.vcf.gz or my reference file ucsc.hg19.fasta.gz but I don't know which way to go.

Edit: I will perform ValidateVariants on each VCF files and post results tomorrow.

Best regards,

Paul-Arthur

↧

BQSR: How to speed this step up?

July 11, 2019, 8:41 am

≫ Next: GATK 4.1.1.0 GenomicsDBImport error : Duplicate fields exist in vid attribute "fields" and 2 errors

≪ Previous: BaseRecalibrator : getContigNames(SequenceDictionaryUtils.java:463)

Hi GATK team members,

I am running a 120G WGS bam file through BaseRecalibrator, and it took me more than a day to even finish chr2. I am writing to learn your advice on speeding this step up. Below are some of my thoughts:

I recall GATK version 3 starts its command line by java -Xmx8g... so that I could assign memory. Now with GATK 4, the command line starts with "gatk", so how should I include memory please?
I understand that in theory I could split the bam by chromosome and do BQSR on each and then recombine. I would like to avoid that if all possible, because I have too many samples to handle and doing this will take too much storage space.
Would love to learn how to do a paralleled computing, and/or distribute tasks onto different threads for this step of GATK.

Thanks a lot.

Helen

↧

GATK 4.1.1.0 GenomicsDBImport error : Duplicate fields exist in vid attribute "fields" and 2 errors

July 11, 2019, 8:44 am

≫ Next: GATK 3.8: genotype called as HOM_VAR while there is a large number of REF and only a few ALT

≪ Previous: BQSR: How to speed this step up?

Hello GATK team!
I am currently using Mutect2 & FilterMutectCalls & GenomicsDBImport for somatic calling. In the steps of Mutect2 & FilterMutectCalls, I got samples' gVCF fine. However, I want to use GenomicsDBImport to combine all gVCF and this step is not working with several errors and I am running out of ideas. Thank you.
GATK4.1.1.0
1: Mutect2 .bam to .g.vcf, seems ok and generate .vcf, .vcf.idx and .vcf.stats
gatk Mutect2 --reference .../hg19.fa --input ....bam --output ...g.vcf -ERC GVCF --tmp-dir ...
2: FilterMutectCalls .g.vcf to .g.vcf, seems ok and generate .vcf, .vcf.idx and .vcf.filteringStats.tsv
gatk FilterMutectCalls --reference .../hg19.fa --variant ...g.vcf --intervals ...hg19.bed --output ...g.vcf --tmp-dir ...
3. When combination:
gatk GenomicsDBImport --reference .../hg19.fa --sample-name-map ${sample_mapFile} --validate-sample-name-map true --intervals ...hg19.bed --genomicsdb-workspace-path ... --max-num-intervals-to-import-in-parallel 20 --consolidate true --batch-size 100 --merge-input-intervals true --tmp-dir ...
This commend is same and ok at germline pipline. VCF files can also be read now. But got this error:
Duplicate field name TLOD found in vid attribute "fields"
Duplicate field name TLOD found in vid attribute "fields"
terminate called after throwing an instance of 'FileBasedVidMapperException'
terminate called recursively
what(): FileBasedVidMapperException : Duplicate fields exist in vid attribute "fields"
4. I deleted this line and re-run:
##INFO=
Then got this error:
htsjdk.tribble.TribbleException: The provided VCF file is malformed at approximately line number 171: . is not a valid start position in the VCF format, for input source: file:///home/yb87626/breast/variantCalling/SRR8437498.postM2.g.vcf
at htsjdk.variant.vcf.AbstractVCFCodec.generateException(AbstractVCFCodec.java:797)
at htsjdk.variant.vcf.AbstractVCFCodec.parseVCFLine(AbstractVCFCodec.java:324)
...
5. I deleted this line and re-run:
##tumor_sample=SAMN10735600
Then got this error:
[July 3, 2019 7:10:52 AM UTC] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 0.06 minutes.
Runtime.totalMemory()=1243086848
htsjdk.tribble.TribbleException: Line 169: there aren't enough columns for line END=17447;STRANDQ=93 GT:DP:MIN_DP:TLOD 0/0:1:1:-4.765e-01 (we expected 9 tokens, and saw 3 ), for input source: file:///home/yb87626/breast/variantCalling/SRR8437498.postM2.g.vcf
at htsjdk.variant.vcf.AbstractVCFCodec.decodeLine(AbstractVCFCodec.java:296)
at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:277)
...
You can see the program recognize 'END=17447;STRANDQ=93 GT:DP:MIN_DP:TLOD 0/0:1:1:-4.765e-01' as a line, but there are columns before them in the same line. The problem may not be this line, because same problem happens at the next line when I delete this line.
Now, I don't know how to solve it. And did I do right before 4 and 5?
Part of .g.vcf:
##fileformat=VCFv4.2
...
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMN10735589
chr1 1 . N . PASS END=17405;STRANDQ=93 GT:DP:MIN_DP:TLOD 0/0:0:0:0.00

↧

GATK 3.8: genotype called as HOM_VAR while there is a large number of REF and only a few ALT

July 11, 2019, 9:11 am

≫ Next: No overlapping contigs found

≪ Previous: GATK 4.1.1.0 GenomicsDBImport error : Duplicate fields exist in vid attribute "fields" and 2 errors

Hi the GATK team,
there is a mutation in my (haloplex) bam that was called (REF=C) as HOM_VAR in my bam while there are only a few reads carrying a ALT=A

$ samtools mpileup -r "15:42171531-42171531" ~/jeter.bam 
[mpileup] 1 samples in 1 input files
15  42171531    N   735 cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc$cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccaacccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccacccccccccccccccccccccccccccccccccccccccccccccccccccCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC

I put a minimal example in https://nextcloud-bird.univ-nantes.fr/index.php/s/B4WQSw3oLwsWwYS

HaplotypeCaller (3.8-0-ge9d806836) in vcf or g.vcf mode ( I'd like to stick to 3.8 for now )

my command line:

java -Djava.io.tmpdir=. -jar ${GATK_JAR} -T HaplotypeCaller -R "/path/to/b37/human_g1k_v37.fasta" -I ~/jeter.bam  -L "15:42171520-42171550"
(...)
15  42171531    .   C   A   2080.77 .   AC=2;AF=1.00;AN=2;BaseQRankSum=-3.799;ClippingRankSum=0.000;DP=90;ExcessHet=3.0103;FS=12.355;MLEAC=2;MLEAF=1.00;MQ=60.00;MQRankSum=0.000;QD=24.19;ReadPosRankSum=1.079;SOR=6.982    GT:AD:DP:GQ:PL  1/1:5,81:86:99:2109,133,0

here the genotype is 1/1 AD=5/81 DP=86

Furthermore, the DEPTH looks much higher with samtools:

$ samtools depth -r "15:42171531-42171531" ~/jeter.bam      
15  42171531    833

what could be the cause of this ?

thank you in advance for your help,

Pierre

↧

No overlapping contigs found

July 11, 2019, 7:29 pm

≫ Next: I tested the following. BAM is WGS data; however, the VCF contains only CHROM 1, why? Thanks!

≪ Previous: GATK 3.8: genotype called as HOM_VAR while there is a large number of REF and only a few ALT

I am trying to use BaseRecalibrator but facing error "no overlapping contigs found". How to solve this?

↧

I tested the following. BAM is WGS data; however, the VCF contains only CHROM 1, why? Thanks!

July 12, 2019, 3:03 am

≫ Next: GenomicsDBImport does not support GVCFs with MNPs; GATK (v4.1.0.0)

≪ Previous: No overlapping contigs found

# Launch multiple process gatk code
gatk Mutect2\
-R human_g1k_v37.fasta\
-I B0026.bam\
-I S0026.bam\
-tumor B0026\
-normal S0026\
--germline-resource af-only-gnomad.raw.sites.b37.vcf.gz\
--af-of-alleles-not-in-resource 0.0000025\
--disable-read-filter MateOnSameContigOrNoMappedMateReadFilter\
-O B0026_vs_S0026.vcf.gz\
-bamout B0026_vs_S0026.bam

#Also, VCF has "##filtering_status=Warning: unfiltered Mutect 2 calls. Please run FilterMutectCalls to remove false positives." How do I do FilterMutectCalls?

#The VCF header is attached behind.
##fileformat=VCFv4.2
##FORMAT=
##FORMAT=
##FORMAT=
##FORMAT=
##FORMAT=
##FORMAT=
##FORMAT=
##FORMAT=
##FORMAT=
##FORMAT=
##FORMAT=
##FORMAT=
##GATKCommandLine=
##INFO=
##INFO=
##INFO=
##INFO=
##INFO=
##INFO=
##INFO=
##INFO=
##INFO=
##INFO=
##INFO=
##INFO=
##INFO=
##INFO=
##INFO=
##INFO=
##INFO=
##INFO=
##INFO=
##INFO=
##INFO=
##INFO=
##INFO=
##MutectVersion=2.2
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##contig=
##filtering_status=Warning: unfiltered Mutect 2 calls. Please run FilterMutectCalls to remove false positives.
##normal_sample=S0026
##source=Mutect2
##tumor_sample=B0026

↧

GenomicsDBImport does not support GVCFs with MNPs; GATK (v4.1.0.0)

February 20, 2019, 5:08 pm

≫ Next: Pipeline Index

≪ Previous: I tested the following. BAM is WGS data; however, the VCF contains only CHROM 1, why? Thanks!

Hello!

I am running the GATK (v4.1.0.0) best practices pipeline on FireCloud with 12 pooled WGS samples; one pooled sample contains ~48 individual fish (I am using a ploidy of 20 throughout the pipeline). Though I have 24 linkage groups I also have 8286 very small scaffolds that my reads are aligned to, which has caused some issues with using scatter/gather and running the tasks by interval with -L (though that is not my main issue here). Lately I have run into a problem at the JointGenotyping stage.

I have one GVCF for each pool from HaplotypeCaller, and I tried to combine them all using CombineGVCFs. Because of the ploidy of 20 I thought I could not use GenomicsDBImport. I had the same error using CombineGVCFs as the person in this thread: gatkforums.broadinstitute.org/gatk/discussion/13430/gatk-v4-0-10-1-combinegvcfs-failing-with-java-lang-outofmemoryerror-not-using-memory-provided. No matter the amount of memory I allowed the task, it failed every time.

But following @shlee's advice and reading this: github.com/broadinstitute/gatk/issues/5383 I decided to give GenomicsDBImport a try. I just used my 24 linkage groups, so my interval list has only those 24 listed.

I am stumped by the error I got for many of the linkage groups:

***********************************************************************

A USER ERROR has occurred: Bad input: GenomicsDBImport does not support GVCFs with MNPs. MNP found at LG07:4616323 in VCF /6942d818-1ae4-4c81-a4be-0f27ec47ec16/HaplotypeCallerGVCF_halfScatter_GATK4/3a4a3acc-2f06-44dc-ab6d-2617b06f3f46/call-MergeGVCFs/301508.merged.matefixed.sorted.markeddups.recal.g.vcf.gz

***********************************************************************

What is the best way to address this? I didn't see anything in the GenomicsDB documentation about flagging the MNPs or ignoring them. I was thinking of removing the MNPs using SelectVariants, before importing the GVCFs into GenomicsDB but how do you get SelectVariants to output a GVCF, which is needed for Joint Genotyping.

What would you recommend I do to get past this MNP hurdle?

↧

Pipeline Index

July 14, 2019, 3:19 am

≫ Next: a clear interpretation in filter column of Mutect2 vcf

≪ Previous: GenomicsDBImport does not support GVCFs with MNPs; GATK (v4.1.0.0)

This document is under construction. It aims to provide an overview of use cases covered by GATK Best Practices workflows.

Variant Discovery	Germline	Somatic	Notes
Data pre-processing	Single-sample	Single-sample	Same workflow applies to all
Short variants: SNPs and Indels	Single-sample & Joint Calling	Tumor-Normal & Tumor-Only
Copy Number Variants (CNVs)	Multisample	Tumor-Normal & Tumor-Only
Structural Variants (SVs)	In progress	TBD

Special use cases	Notes
Metagenomic analysis (PathSeq)
Mitochondrial short variants
Liquid blood biopsy

↧

a clear interpretation in filter column of Mutect2 vcf

July 14, 2019, 6:42 am

≫ Next: MuTect2 output shows the sample name as "$sp" instead of "TUMOR" and "NORMAL"

≪ Previous: Pipeline Index

hi, I hava read the doc mathematical notes on mutect.pdf and the latest mutect.pdf and the header in vcf.
but I am still confused when I am reading variant in vcf file. so I want to consult that with you carefully. hope you can give me so instructions, the question can be very detailed and may takes you too much time, I am so sorry, thanks a lot.

min-median-base-quality is the minimum median base quality of bases supporting a SNV.
but is does not say the concrete threshold value of the minimum median base quality

min-median-mapping-quality also does not say the concrete value of the minimum median base quality

clustered_events,Description="Clustered events observed in the tumor"> , is the maximum allowable number of called variants co-occurring in a single assembly region. If the number of called variants exceeds this they will all be filtered. how to understand the called "a single assembly region".

FILTER=<ID=bad_haplotype,Description="Variant near filtered variant on same haplotype.">, how to understand this? for example, here is a site

chr1 144854528 . A G . bad_haplotype;clustered_events DP=971;ECNT=5;POP_AF=5.000e-08;P_CONTAM=0.00;P_GERMLINE=-1.537e+01;TLOD=1291.33 GT:AD:AF:DP:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:OBAM:OBAMRC:OBF:OBP:OBQ:OBQRC:ORIGINAL_CONTIG_MISMATCH:PGT:PID:SA_MAP_AF:SA_POST_PROB 0/1:607,356:0.370:963:289,183:318,173:29,29:175,187:60:39:false:false:.:.:50.60:100.00:0:0|1:144854528_A_G:0.364,0.364,0.370:5.314e-03,0.015,0.980

FILTER=<ID=chimeric_original_alignment,Description="NuMT variant with too many ALT reads originally from autosome">, how to understand this?

• max-germline-posterior is the maximum posterior probability, as determined by the above germline probability model, that a variant is a germline event. but is does not say the concrete threshold value

FILTER=<ID=low_avg_alt_quality,Description="Low average alt quality">, is this all the reads of this alt site base quality, it seems comes out very rare, why?

max-alt-allele-count is the maximum allowable number of alt alleles at a site. By default only biallelic
variants pass the filter. whether it means a site can be at most three base possibility. (ref and two possible alt)

FILTER=<ID=n_ratio,Description="Ratio of N to alt exceeds specified ratio">, it seems comes out very rare, why?

FILTER=<ID=orientation_bias,Description="Orientation bias (in one of the specified artifact mode(s) or complement) seen in one or more samples.">, whether it means the reads just comes from positive or negative strand?

here is a example site
chr3 181496339 . G T . orientation_bias DP=450;ECNT=1;POP_AF=5.000e-08;P_CONTAM=0.00;P_GERMLINE=-1.164e+02;TLOD=5.93GT:AD:AF:DP:F1R2:F2R1:FT:MBQ:MFRL:MMQ:MPOS:OBAM:OBAMRC:OBF:OBP:OBQ:OBQRC:ORIGINAL_CONTIG_MISMATCH:SA_MAP_AF:SA_POST_PROB 0/1:428,7:0.018:435:198,6:230,1:orientation_bias:35,28:178,195:60:27:true:false:0.857:0.249:33.02:100.00:0:0.010,0.020,0.016:3.913e-03,1.666e-03,0.994

min-median-read-position is the minimum median length of bases supporting an allele from the closest end
of the read. Indels positions are measured by the end farthest from the end of the read. but is does not say the concrete threshold value
here is a example site
chr1 16258280 . C CTCTAAATCTTCA . read_position DP=618;ECNT=1;POP_AF=5.000e-08;P_CONTAM=0.00;P_GERMLINE=-1.682e+02;TLOD=5.82GT:AD:AF:DP:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:OBAM:OBAMRC:ORIGINAL_CONTIG_MISMATCH:SA_MAP_AF:SA_POST_PROB 0/1:586,4:8.350e-03:590:272,4:314,0:29,33:176,140:60:0:false:false:0:0.010,0.010,6.780e-03:1.216e-03,1.579e-03,0.997

max-strand-artifact-probability is the posterior probability of a strand artifact, as determined by the
model described above, required to apply the strand artifact filter. This is necessary but not su"cient – we also
require the estimated max a posteriori allele fraction to be less than min-strand-artifact-allele-fraction.
The second condition prevents filtering real variants that also have significant strand bias, i.e. a true variant
that also has some artifactual reads.

how to understand "Evidence for alt allele comes from one read direction only", how many read directions? can you plot for that when it is one read direction, when it is two? and does F1R2 and F2R1 stands for this?

FILTER=<ID=strict_strand_bias,Description="Evidence for alt allele is not represented in both directions">

, it comes out very rare, why?

another question is that, when view reads in igv, if I sort reads by sample, there can be many reads 'Sample=HC', it seems these will be deleted in count in vcf DP though they will show in bamout.bam

↧

MuTect2 output shows the sample name as "$sp" instead of "TUMOR" and "NORMAL"

July 14, 2019, 8:29 am

≫ Next: Mutect2 FilterByOrientationBias failure

≪ Previous: a clear interpretation in filter column of Mutect2 vcf

Here is my code:

java -jar GenomeAnalysisTK-3.8-0-ge9d806836/GenomeAnalysisTK.jar \
-T MuTect2 -L chrM -I:tumor Exon_P3_TXC_PT1.realn_Recal.bam \
-I:normal Exon_P3_TXC_N/Exon_P3_TXC_N.realn_Recal.bam \
--dbsnp dbsnp_135.hg19.vcf --output_mode EMIT_VARIANTS_ONLY -o test.vcf.gz -R genome.fa

The output shows that the sample name is "$sp" instead of "TUMOR" and "NORMAL".
Here is the output file (test.vcf.gz):

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT $sp
chrM 3395 . T C . alt_allele_in_normal;clustered_events;t_lod_fstar ECNT=3;HCNT=13;MAX_ED=27;MIN_ED=1;NLOD=34.28;TLOD=4.37 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:PGT:PID:QSS:REF_F1R2:REF_F2R1 0/0:147,3:0.020:1:2:0.333:0|1:3395_T_C:4031,59:72:75
chrM 3396 . A G . alt_allele_in_normal;clustered_events;t_lod_fstar ECNT=3;HCNT=13;MAX_ED=27;MIN_ED=1;NLOD=33.98;TLOD=4.38 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:PGT:PID:QSS:REF_F1R2:REF_F2R1 0/0:148,3:0.021:1:2:0.667:0|1:3395_T_C:3715,61:72:76
chrM 3422 . G A . alt_allele_in_normal;clustered_events ECNT=3;HCNT=16;MAX_ED=27;MIN_ED=1;NLOD=9.29;TLOD=19.31 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1 0/0:154,12:0.077:7:5:0.583:4237,317:76:78
chrM 3698 . G T . alt_allele_in_normal;t_lod_fstar ECNT=1;HCNT=21;MAX_ED=.;MIN_ED=.;NLOD=39.42;TLOD=6.27 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1 0/0:197,5:0.027:4:1:0.800:5645,138:93:104
chrM 13189 . C A . alt_allele_in_normal;t_lod_fstar ECNT=1;HCNT=36;MAX_ED=.;MIN_ED=.;NLOD=27.20;TLOD=4.07 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1 0/0:139,6:0.038:3:3:0.500:4167,133:72:67
chrM 14784 . T C . alt_allele_in_normal ECNT=1;HCNT=40;MAX_ED=.;MIN_ED=.;NLOD=11.41;TLOD=9.65 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1 0/0:102,8:0.067:5:3:0.625:2854,188:60:42

↧

Mutect2 FilterByOrientationBias failure

July 14, 2019, 5:23 pm

≫ Next: java.lang.IncompatibleClassChangeError GATK 4

≪ Previous: MuTect2 output shows the sample name as "$sp" instead of "TUMOR" and "NORMAL"

I am trying to create a panel of normals using the featured GATK4 method on Firecloud. Everything runs fine but the method keeps failing at FilterByOrientationBias. It looks like this task has trouble locating the filtered vcfs from the previous task, but I'm not sure why. Can you help me with this? I have attached the stderr and JESlog here. Thanks!

↧

java.lang.IncompatibleClassChangeError GATK 4

December 5, 2017, 5:44 am

≫ Next: Not getting any output from GetPileupSummaries

≪ Previous: Mutect2 FilterByOrientationBias failure

Hi,

I hit an error with GATK 4 beta 6 using the RealignerTargetCreator - as a complete java newbie it's quite incomprehensible to me. I'm running (oracle) java 9.0.1(and thus GATK 3 RealignerTargetCreator isn't working for me either ).

Here is the command I ran:

gatk-launch RealignerTargetCreator -R ~/data/ref/hg38.fa -I Sample1_dedup.bam -o Sample1_int.intervals

And this is the output:

Using GATK jar /usr/local/bin/gatk-4.beta.6/gatk-package-4.beta.6-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -jar /usr/local/bin/gatk-4.beta.6/gatk-package-4.beta.6-local.jar RealignerTargetCreator -R /home/jamie/data/ref/hg38.fa -I Sample1_dedup.bam -o Sample1_int.intervals
Exception in thread "main" java.lang.IncompatibleClassChangeError: Inconsistent constant pool data in classfile for class org/broadinstitute/barclay/argparser/CommandLineProgramGroup. Method lambda$static$0(Lorg/broadinstitute/barclay/argparser/CommandLineProgramGroup;Lorg/broadinstitute/barclay/argparser/CommandLineProgramGroup;)I at index 43 is CONSTANT_MethodRef and should be CONSTANT_InterfaceMethodRef
    at org.broadinstitute.barclay.argparser.CommandLineProgramGroup.<clinit>(CommandLineProgramGroup.java:16)
    at org.broadinstitute.hellbender.Main.printUsage(Main.java:332)
    at org.broadinstitute.hellbender.Main.extractCommandLineProgram(Main.java:305)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:156)
    at org.broadinstitute.hellbender.Main.main(Main.java:239)

Many thanks!

↧

Not getting any output from GetPileupSummaries

April 15, 2019, 12:32 am

≫ Next: Null pointer exception using GATK2 at chr19 for mouse genome

≪ Previous: java.lang.IncompatibleClassChangeError GATK 4

Hi,

I am getting GetPileupSummaries to complete seemingly without error. But, the output table has no records; it only has a header column. I have tried different steps to resolved, but I have not yet figured out the problem. Below I show the content of the output file, the content of the error log file, and the first few rows of the input vcf file. The input bam is an exome-sequencing bam. I checked and found that over 29,000 sites of the input file are in the capture kit region. So, I think there should be many records that are output. Please let me know if you have any suggestions to get this to work. Also, if you don't have any suggestions on how to make this work, could you please suggest an alternative approach for creating the input tables for CalculateContamination? Thanks.

$ cat ../output/GetPileupSummaries_sample.table

SAMPLE=sample

contig position ref_count alt_count other_alt_count allele_frequency
[user@headnode output]$ $ cat ../logs/GetPileupSummaries_sample.e1715758
01:51:01.533 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/research/gatk-4.1.1.0/gatk-package-4.1.1.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
Apr 15, 2019 1:51:03 AM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
01:51:03.228 INFO GetPileupSummaries - ------------------------------------------------------------
01:51:03.228 INFO GetPileupSummaries - The Genome Analysis Toolkit (GATK) v4.1.1.0
01:51:03.228 INFO GetPileupSummaries - For support and documentation go to https://software.broadinstitute.org/gatk/
01:51:03.229 INFO GetPileupSummaries - Executing as user@node016 on Linux v2.6.32-754.6.3.el6.x86_64 amd64
01:51:03.229 INFO GetPileupSummaries - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_111-b14
01:51:03.229 INFO GetPileupSummaries - Start Date/Time: April 15, 2019 1:51:01 AM CDT
01:51:03.229 INFO GetPileupSummaries - ------------------------------------------------------------
01:51:03.229 INFO GetPileupSummaries - ------------------------------------------------------------
01:51:03.229 INFO GetPileupSummaries - HTSJDK Version: 2.19.0
01:51:03.229 INFO GetPileupSummaries - Picard Version: 2.19.0
01:51:03.229 INFO GetPileupSummaries - HTSJDK Defaults.COMPRESSION_LEVEL : 2
01:51:03.230 INFO GetPileupSummaries - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
01:51:03.230 INFO GetPileupSummaries - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
01:51:03.230 INFO GetPileupSummaries - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
01:51:03.230 INFO GetPileupSummaries - Deflater: IntelDeflater
01:51:03.230 INFO GetPileupSummaries - Inflater: IntelInflater
01:51:03.230 INFO GetPileupSummaries - GCS max retries/reopens: 20
01:51:03.230 INFO GetPileupSummaries - Requester pays: disabled
01:51:03.230 WARN GetPileupSummaries -

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Warning: GetPileupSummaries is a BETA tool and is not yet ready for use in production

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

01:51:03.230 INFO GetPileupSummaries - Initializing engine
01:51:03.532 INFO FeatureManager - Using codec VCFCodec to read file file:///research/GATKresources/gatk-test-data__wgs_ubam__HCC1143T/af-only-gnomad.hg38.common.bialleliconly.canonicalonly.AFonly.vcf.gz
01:51:03.628 INFO FeatureManager - Using codec VCFCodec to read file file:///research/GATKresources/gatk-test-data__wgs_ubam__HCC1143T/af-only-gnomad.hg38.common.bialleliconly.canonicalonly.AFonly.vcf.gz
01:51:09.675 INFO IntervalArgumentCollection - Processing 3893864 bp from intervals
01:51:09.815 INFO GetPileupSummaries - Done initializing engine
01:51:09.815 INFO ProgressMeter - Starting traversal
01:51:09.816 INFO ProgressMeter - Current Locus Elapsed Minutes Loci Processed Loci/Minute
01:51:25.428 INFO ProgressMeter - chr1:6440859 0.3 1000 3843.2
01:51:35.668 INFO ProgressMeter - chr1:143476552 0.4 11000 25529.9
01:51:46.273 INFO ProgressMeter - chr1:216684973 0.6 19000 31269.7
01:51:56.664 INFO ProgressMeter - chr2:111011916 0.8 31000 39702.9
01:52:08.579 INFO ProgressMeter - chr3:52786965 1.0 44000 44926.2
01:52:18.730 INFO ProgressMeter - chr4:53788564 1.1 57000 49627.1
01:52:30.410 INFO ProgressMeter - chr5:148065222 1.3 73000 54346.5
01:52:41.138 INFO ProgressMeter - chr6:118615131 1.5 86000 56504.0
01:52:51.745 INFO ProgressMeter - chr7:99050742 1.7 98000 57687.2
01:53:01.813 INFO ProgressMeter - chr8:102360515 1.9 109000 58394.4
01:53:12.806 INFO ProgressMeter - chr10:3339720 2.0 122000 59517.0
01:53:22.994 INFO ProgressMeter - chr11:5425116 2.2 134000 60370.3
01:53:33.642 INFO ProgressMeter - chr12:8242332 2.4 146000 60907.4
01:53:44.178 INFO ProgressMeter - chr13:27900338 2.6 158000 61414.1
01:53:54.586 INFO ProgressMeter - chr15:34863204 2.7 171000 62268.6
01:54:05.829 INFO ProgressMeter - chr16:24964104 2.9 181000 61700.3
01:54:17.241 INFO ProgressMeter - chr17:39213163 3.1 191000 61144.8
01:54:27.416 INFO ProgressMeter - chr19:2408074 3.3 202000 61336.3
01:54:37.846 INFO ProgressMeter - chr19:52439481 3.5 211000 60856.9
01:54:48.734 INFO ProgressMeter - chr22:23928590 3.6 223000 61118.8
01:54:57.895 INFO GetPileupSummaries - 4313566 read(s) filtered by: (((((((((MappingQualityAvailableReadFilter AND MappingQualityNotZeroReadFilter) AND MappedReadFilter) AND PrimaryLineReadFilter) AND NotDuplicateReadFilter) AND PassesVendorQualityCheckReadFilter) AND NonZeroReferenceLengthAlignmentReadFilter) AND MateOnSameContigOrNoMappedMateReadFilter) AND GoodCigarReadFilter) AND WellformedReadFilter)
4313566 read(s) filtered by: ((((((((MappingQualityAvailableReadFilter AND MappingQualityNotZeroReadFilter) AND MappedReadFilter) AND PrimaryLineReadFilter) AND NotDuplicateReadFilter) AND PassesVendorQualityCheckReadFilter) AND NonZeroReferenceLengthAlignmentReadFilter) AND MateOnSameContigOrNoMappedMateReadFilter) AND GoodCigarReadFilter)
4313566 read(s) filtered by: (((((((MappingQualityAvailableReadFilter AND MappingQualityNotZeroReadFilter) AND MappedReadFilter) AND PrimaryLineReadFilter) AND NotDuplicateReadFilter) AND PassesVendorQualityCheckReadFilter) AND NonZeroReferenceLengthAlignmentReadFilter) AND MateOnSameContigOrNoMappedMateReadFilter)
4296314 read(s) filtered by: ((((((MappingQualityAvailableReadFilter AND MappingQualityNotZeroReadFilter) AND MappedReadFilter) AND PrimaryLineReadFilter) AND NotDuplicateReadFilter) AND PassesVendorQualityCheckReadFilter) AND NonZeroReferenceLengthAlignmentReadFilter)
4296314 read(s) filtered by: (((((MappingQualityAvailableReadFilter AND MappingQualityNotZeroReadFilter) AND MappedReadFilter) AND PrimaryLineReadFilter) AND NotDuplicateReadFilter) AND PassesVendorQualityCheckReadFilter)
4296314 read(s) filtered by: ((((MappingQualityAvailableReadFilter AND MappingQualityNotZeroReadFilter) AND MappedReadFilter) AND PrimaryLineReadFilter) AND NotDuplicateReadFilter)
592640 read(s) filtered by: (((MappingQualityAvailableReadFilter AND MappingQualityNotZeroReadFilter) AND MappedReadFilter) AND PrimaryLineReadFilter)
583953 read(s) filtered by: ((MappingQualityAvailableReadFilter AND MappingQualityNotZeroReadFilter) AND MappedReadFilter)
583953 read(s) filtered by: (MappingQualityAvailableReadFilter AND MappingQualityNotZeroReadFilter)
583953 read(s) filtered by: MappingQualityNotZeroReadFilter
8687 read(s) filtered by: PrimaryLineReadFilter
3703674 read(s) filtered by: NotDuplicateReadFilter
17252 read(s) filtered by: MateOnSameContigOrNoMappedMateReadFilter

01:54:57.895 INFO ProgressMeter - chrX:143216908 3.8 231568 60917.8
01:54:57.895 INFO ProgressMeter - Traversal complete. Processed 231568 total loci in 3.8 minutes.
01:54:57.904 INFO GetPileupSummaries - Shutting down engine
[April 15, 2019 1:54:57 AM CDT] org.broadinstitute.hellbender.tools.walkers.contamination.GetPileupSummaries done. Elapsed time: 3.94 minutes.
Runtime.totalMemory()=3257925632
[user@headnode output]$ zcat /research/GATKresources/gatk-test-data__wgs_ubam__HCC1143T/af-only-gnomad.hg38.common.bialleliconly.canonicalonly.AFonly.vcf.gz | grep -v ^## | head -n 20

CHROM POS ID REF ALT QUAL FILTER INFO

chr1 10583 rs58108140 G A 1052610 InbreedingCoeff AF=0.229
chr1 12783 . G A 10065800 InbreedingCoeff AF=0.556
chr1 13116 rs201725126 T G 21488400 InbreedingCoeff AF=0.532
chr1 13118 rs200579949 A G 21440300 InbreedingCoeff AF=0.531
chr1 13868 . A G 1610370 PASS AF=0.204
chr1 13896 rs201696125 C A 1115790 PASS AF=0.207
chr1 14464 . A T 3542820 PASS AF=0.204
chr1 14653 rs375086259 C T 836949 InbreedingCoeff AF=0.252
chr1 14699 rs372910670 C G 2774380 InbreedingCoeff AF=0.373
chr1 14907 rs79585140 A G 23517400 InbreedingCoeff AF=0.497
chr1 14930 rs75454623 A G 23449200 InbreedingCoeff AF=0.495
chr1 15118 rs71252250 A G 10335000 InbreedingCoeff AF=0.442
chr1 15190 rs200030104 G A 2612030 InbreedingCoeff AF=0.287
chr1 15688 . C T 1120870 InbreedingCoeff AF=0.223
chr1 16068 rs372319358 T C 2962330 InbreedingCoeff AF=0.466
chr1 16103 rs200358166 T G 7153330 InbreedingCoeff AF=0.535
chr1 16288 rs200736374 C G 1007750 InbreedingCoeff AF=0.278
chr1 16298 rs200451305 C T 8569780 InbreedingCoeff AF=0.538
chr1 16378 rs148220436 T C 27199800 InbreedingCoeff AF=0.535
[user@headnode output]$

↧

Null pointer exception using GATK2 at chr19 for mouse genome

July 15, 2019, 7:47 am

≫ Next: About CombineGVCFs

≪ Previous: Not getting any output from GetPileupSummaries

Hi,

I'm doing tumor normal somatic variant calling using ** MuTect2 (gatk v4.1.2.0)**. I have used mm10 reference genome for mapping and I have tried following the general best practices of GATK and Mutect2.

But for some reason the variant calling breaks at chr19 and reports a null pointer exception. This happens to all my samples.

The command I have used for running MuTect2 is as below:

gatk Mutect2 \
    -R $ref \
    -I $BAMs/3_MOO111A3_S8_001.markdup.realigned.bam \
    -I $BAMs/4_MOO111A4_S11_001.markdup.realigned.bam \
    -tumor 3_MOO111A3_S8_001 \
    -normal 4_MOO111A4_S11_001 \
    -O $out/3_4_unfiltered.vcf.gz \
    -bamout $out/3_4_tumor_normal.bam

The error I get is as follows:

02:04:08.372 INFO  ProgressMeter -       chr19:61211311            984.9               4994610           5071.4
02:04:19.192 INFO  ProgressMeter -       chr19:61266818            985.0               4994840           5070.7
02:04:27.335 INFO  VectorLoglessPairHMM - Time spent in setup for JNI call : 39.543984247000004
02:04:27.335 INFO  PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 15107.717470987001
02:04:27.336 INFO  SmithWatermanAligner - Total compute time in java Smith-Waterman : 14077.45 sec
INFO    2019-07-11 02:04:28 SortingCollection   Creating merging iterator from 34 files
02:05:08.531 INFO  Mutect2 - Shutting down engine
[July 11, 2019 2:05:08 AM EDT] org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2 done. Elapsed time: 985.91 minutes.
Runtime.totalMemory()=2113929216
java.lang.NullPointerException
    at org.broadinstitute.hellbender.transformers.PalindromeArtifactClipReadTransformer.apply(PalindromeArtifactClipReadTransformer.java:98)
    at org.broadinstitute.hellbender.transformers.PalindromeArtifactClipReadTransformer.apply(PalindromeArtifactClipReadTransformer.java:49)
    at org.broadinstitute.hellbender.transformers.ReadTransformer.lambda$andThen$f85d1091$1(ReadTransformer.java:20)
    at org.broadinstitute.hellbender.utils.iterators.ReadTransformingIterator.next(ReadTransformingIterator.java:42)
    at org.broadinstitute.hellbender.utils.iterators.ReadTransformingIterator.next(ReadTransformingIterator.java:14)
    at org.broadinstitute.hellbender.utils.iterators.PushToPullIterator.fillCache(PushToPullIterator.java:72)
    at org.broadinstitute.hellbender.utils.iterators.PushToPullIterator.advanceToNextElement(PushToPullIterator.java:58)
    at org.broadinstitute.hellbender.utils.iterators.PushToPullIterator.<init>(PushToPullIterator.java:37)
    at org.broadinstitute.hellbender.utils.downsampling.ReadsDownsamplingIterator.<init>(ReadsDownsamplingIterator.java:21)
    at org.broadinstitute.hellbender.engine.MultiIntervalLocalReadShard.iterator(MultiIntervalLocalReadShard.java:149)
    at org.broadinstitute.hellbender.engine.AssemblyRegionIterator.<init>(AssemblyRegionIterator.java:109)
    at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.processReadShard(AssemblyRegionWalker.java:296)
    at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.traverse(AssemblyRegionWalker.java:281)
    at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1039)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
    at org.broadinstitute.hellbender.Main.main(Main.java:291)

I have tried reading other posts with Null pointers exception, but could not find a solution for my issue.

I also validate my bam using ValidateSam from picard as per one of the answers. There were no issues/errors with my bam files.

↧

About CombineGVCFs

July 15, 2019, 11:23 am

≫ Next: how to let GATK support Coordinate Sorted Index (CSI) format of bam file

≪ Previous: Null pointer exception using GATK2 at chr19 for mouse genome

Hi,

I have about 600 g.vcf files after GATK haplotypecalling.
Then I want to combine the 600 g.vcf files.
I tried to combine 600 g.vcf files with GATK CombineGVCFs but I couldn't.
The command I used is as below.
----------
java -jar /home/h1kimura/lib/gatk-4.1.2.0/gatk-package-4.1.2.0-local.jar CombineGVCFs
-R /oasis/projects/nsf/ddp195/h1kimura/resources/human_g1k_v37_decoy.fasta
-V /oasis/projects/nsf/ddp195/h1kimura/exon_vcfs//gVCF.list
-O /oasis/projects/nsf/ddp195/h1kimura/exon_vcfs/combined_genotyped.vcf

gVCF.list includes 600 g.vcf files
----------------------
the erro message are as below
----------------------
07:30:27.308 INFO FeatureManager - Using codec VCFCodec to read file file:///oasis/projects/nsf/ddp195/h1kimura/exon_vcfs/NP070.raw_variants.g.vcf
07:30:41.618 INFO FeatureManager - Using codec VCFCodec to read file file:///oasis/projects/nsf/ddp195/h1kimura/exon_vcfs/NP071.raw_variants.g.vcf
08:29:17.988 INFO CombineGVCFs - Shutting down engine
[July 15, 2019 8:29:18 AM PDT] org.broadinstitute.hellbender.tools.walkers.CombineGVCFs done. Elapsed time: 62.33 minutes.
Runtime.totalMemory()=26742882304
java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
at htsjdk.tribble.index.IndexFactory.loadIndex(IndexFactory.java:202)
at htsjdk.tribble.index.IndexFactory.loadIndex(IndexFactory.java:178)
at htsjdk.tribble.TribbleIndexedFeatureReader.loadIndex(TribbleIndexedFeatureReader.java:163)
at htsjdk.tribble.TribbleIndexedFeatureReader.hasIndex(TribbleIndexedFeatureReader.java:228)
at org.broadinstitute.hellbender.engine.FeatureDataSource.(FeatureDataSource.java:263)
at org.broadinstitute.hellbender.engine.FeatureManager.addToFeatureSources(FeatureManager.java:234)
at org.broadinstitute.hellbender.engine.MultiVariantWalker.lambda$initializeDrivingVariants$0(MultiVariantWalker.java:73)
at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580)
at org.broadinstitute.hellbender.engine.MultiVariantWalker.initializeDrivingVariants(MultiVariantWalker.java:63)
at org.broadinstitute.hellbender.engine.VariantWalkerBase.initializeFeatures(VariantWalkerBase.java:55)
at org.broadinstitute.hellbender.engine.GATKTool.onStartup(GATKTool.java:697)
at org.broadinstitute.hellbender.engine.MultiVariantWalker.onStartup(MultiVariantWalker.java:46)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:137)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
at org.broadinstitute.hellbender.Main.main(Main.java:291)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.GeneratedConstructorAccessor27.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at htsjdk.tribble.index.IndexFactory.loadIndex(IndexFactory.java:195)
... 18 more
Caused by: java.lang.OutOfMemoryError: Java heap space
at htsjdk.tribble.index.linear.LinearIndex$ChrIndex.read(LinearIndex.java:295)
at htsjdk.tribble.index.AbstractIndex.read(AbstractIndex.java:404)
at htsjdk.tribble.index.linear.LinearIndex.(LinearIndex.java:116)
at sun.reflect.GeneratedConstructorAccessor27.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at htsjdk.tribble.index.IndexFactory.loadIndex(IndexFactory.java:195)
at htsjdk.tribble.index.IndexFactory.loadIndex(IndexFactory.java:178)
at htsjdk.tribble.TribbleIndexedFeatureReader.loadIndex(TribbleIndexedFeatureReader.java:163)
at htsjdk.tribble.TribbleIndexedFeatureReader.hasIndex(TribbleIndexedFeatureReader.java:228)
at org.broadinstitute.hellbender.engine.FeatureDataSource.(FeatureDataSource.java:263)
at org.broadinstitute.hellbender.engine.FeatureManager.addToFeatureSources(FeatureManager.java:234)
at org.broadinstitute.hellbender.engine.MultiVariantWalker.lambda$initializeDrivingVariants$0(MultiVariantWalker.java:73)
at org.broadinstitute.hellbender.engine.MultiVariantWalker$$Lambda$69/1189187821.accept(Unknown Source)
... 12 more
---------------------
Could you tell me the how to combine the 600 g.vcf files?
Have I use the more cpus?

Thanks in advance.

Meganejin

↧

how to let GATK support Coordinate Sorted Index (CSI) format of bam file

October 11, 2015, 8:54 pm

≫ Next: "Failed to create reader" error in GenomicsDBImport

≪ Previous: About CombineGVCFs

Since samtools 1.0, csi indexing format of bam file is specifically used for any other organisms with long chromosomes ( > 536Mb). Could you help me figure out how to let GATK SplitNCigarReads support cis indexing file?

Thanks so much.

↧

"Failed to create reader" error in GenomicsDBImport

May 2, 2019, 8:14 am

≫ Next: RNAseq short variant discovery (SNPs + Indels)

≪ Previous: how to let GATK support Coordinate Sorted Index (CSI) format of bam file

I ran GenomicsDBImport and got the error below. Perhaps it is worth noting that I'm running this in Nextflow (nextflow.io), because I didn't have this problem outside of Nextflow.

15:54:24.801 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/olavur/miniconda3/envs/exolink/share/gatk4-4.1.0.0-0/gatk-package-4.1.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
15:54:26.895 INFO  GenomicsDBImport - ------------------------------------------------------------
15:54:26.896 INFO  GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.1.0.0
15:54:26.896 INFO  GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/
15:54:26.897 INFO  GenomicsDBImport - Executing as olavur@hnpv-fargenCompute01.heilsunet.fo on Linux v4.4.0-101-generic amd64
15:54:26.897 INFO  GenomicsDBImport - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_192-b01
15:54:26.898 INFO  GenomicsDBImport - Start Date/Time: 02 May 2019 15:54:24 WEST
15:54:26.898 INFO  GenomicsDBImport - ------------------------------------------------------------
15:54:26.898 INFO  GenomicsDBImport - ------------------------------------------------------------
15:54:26.899 INFO  GenomicsDBImport - HTSJDK Version: 2.18.2
15:54:26.899 INFO  GenomicsDBImport - Picard Version: 2.18.25
15:54:26.899 INFO  GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
15:54:26.899 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
15:54:26.899 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
15:54:26.900 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
15:54:26.900 INFO  GenomicsDBImport - Deflater: IntelDeflater
15:54:26.900 INFO  GenomicsDBImport - Inflater: IntelInflater
15:54:26.900 INFO  GenomicsDBImport - GCS max retries/reopens: 20
15:54:26.900 INFO  GenomicsDBImport - Requester pays: disabled
15:54:26.900 INFO  GenomicsDBImport - Initializing engine
15:54:27.029 INFO  GenomicsDBImport - Shutting down engine
[02 May 2019 15:54:27 WEST] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 0.04 minutes.
Runtime.totalMemory()=205800865792
***********************************************************************

A USER ERROR has occurred: Failed to create reader from file://data/results/gvcf/FN000119.gvcf

***********************************************************************
Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.
Using GATK jar /home/olavur/miniconda3/envs/exolink/share/gatk4-4.1.0.0-0/gatk-package-4.1.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx200g -Xms200g -jar /home/olavur/miniconda3/envs/exolink/share/gatk4-4.1.0.0-0/gatk-package-4.1.0.0-local.jar GenomicsDBImport -V data/results/gvcf/FN000119.gvcf -V data/results/gvcf/FN000103.gvcf -V data/results/gvcf/FN000105.gvcf -L resources/sureselect_human_all_exon_v6_utr_grch38/S07604624_Padded.bed --genomicsdb-workspace-path genomicsdb/run --merge-input-intervals --tmp-dir=tmp

The command I ran was:

export TILEDB_DISABLE_FILE_LOCKING=1
gatk GenomicsDBImport         -V data/results/gvcf/FN000119.gvcf -V data/results/gvcf/FN000103.gvcf -V data/results/gvcf/FN000105.gvcf         -L resources/sureselect_human_all_exon_v6_utr_grch38/S07604624_Padded.bed         --genomicsdb-workspace-path "genomicsdb/run"         --merge-input-intervals         --tmp-dir=tmp         --java-options "-Xmx200g -Xms200g"

The command that produced the GVCF in question is:

gatk HaplotypeCaller          -I recalibrated.bam         -O "FN000119.gvcf"         -R resources/reference_10x_Genomics/refdata-GRCh38-2.1.0/fasta/genome.fa         -L resources/sureselect_human_all_exon_v6_utr_grch38/S07604624_Padded.bed         --dbsnp resources/gatk_bundle/Homo_sapiens_assembly38.dbsnp138/Homo_sapiens_assembly38.dbsnp138.vcf         -ERC GVCF         --create-output-variant-index         --annotation MappingQualityRankSumTest         --annotation QualByDepth         --annotation ReadPosRankSumTest         --annotation RMSMappingQuality         --annotation FisherStrand         --annotation Coverage         --verbosity INFO         --tmp-dir=tmp         --java-options "-Xmx100g -Xms100g"

↧

RNAseq short variant discovery (SNPs + Indels)

January 8, 2018, 8:39 pm

≫ Next: A question about the database bed file：cn2_mask_g1k_v37.mask.bed

≪ Previous: "Failed to create reader" error in GenomicsDBImport

Purpose

Identify short variants (SNPs and Indels) in RNAseq data.

Reference Implementations

Pipeline	Summary	Notes	Github	Terra
RNAseq short variant per-sample calling	BAM to VCF	universal (expected)	yes	TBD

Expected input

This workflow is designed to operate on a set of samples (uBAM files) one-at-a-time; joint calling RNAseq is not supported.

Main Steps

Mapping to the Reference

Tools involved: STAR

We begin with mapping RNA reads to a reference, we recommend using STAR aligner because it increased sensitivity compared to TopHat (especially for INDELS). We use STAR’s two-pass mode to get better alignments around novel splice junctions.

Data Cleanup

Tools involved: MergeBamAlignment, MarkDuplicates

We use MergeBamAlignment and MarkDuplicates (similarly to our DNA pre-processing best practices pipeline)

SplitNCigarReads

Tools involved: SplitNCigarReads

Because RNA aligners have different conventions than DNA aligners, we need to reformat some of the alignments that span introns for HaplotypeCaller. This step splits reads with N in the cigar into multiple supplementary alignments and hard clips mismatching overhangs. By default this step also reassigns mapping qualities for good alignments to match DNA conventions.

Base Quality Recalibration

Tools involved: BaseRecalibrator, Apply Recalibration, AnalyzeCovariates (optional)

This step is performed per-sample and consists of applying machine learning to detect and correct for patterns of systematic errors in the base quality scores, which are confidence scores emitted by the sequencer for each base. Base quality scores play an important role in weighing the evidence for or against possible variant alleles during the variant discovery process, so it's important to correct any systematic bias observed in the data. Biases can originate from biochemical processes during library preparation and sequencing, from manufacturing defects in the chips, or instrumentation defects in the sequencer. The recalibration procedure involves collecting covariate statistics from all base calls in the dataset, building a model from those statistics, and applying base quality adjustments to the dataset based on the resulting model. The initial statistics collection can be parallelized by scattering across genomic coordinates, typically by chromosome or batches of chromosomes but this can be broken down further to boost throughput if needed. Then the per-region statistics must be gathered into a single genome-wide model of covariation; this cannot be parallelized but it is computationally trivial, and therefore not a bottleneck. Finally, the recalibration rules derived from the model are applied to the original dataset to produce a recalibrated dataset. This is parallelized in the same way as the initial statistics collection, over genomic regions, then followed by a final file merge operation to produce a single analysis-ready file per sample.

Variant Calling

Tools involved: HaplotypeCaller

HaplotypeCaller doesn’t need any specific changes to run with RNA once the bam has been run through SplitNCigarReads. We do adjust the minimum phred-scaled confidence threshold for calling variants to 20, but this value will depend on your specific use case.

Variant Filtering

Tools involved: VariantFiltration

We recommend specific hard filters, since VQSR and CNNScoreVariants require truth data for training that we don’t yet have for RNA.

↧

A question about the database bed file：cn2_mask_g1k_v37.mask.bed

July 15, 2019, 8:33 pm

≫ Next: Removing NON_REF tags from VCF

≪ Previous: RNAseq short variant discovery (SNPs + Indels)

I just want to confirm that if the file cn2_mask_g1k_v37.mask.bed( which come from the link [ftp://ftp.broadinstitute.org/pub/svtoolkit/cn2masks/cn2_mask_g1k_v37.mask.bed] can serve as segmental_duplication_file specified by parameter "--segmental-duplication-track" of the tools AnnotateIntervals?
Asking because I use this as the segmental duplication files, then I found that, some filtered intervals are not repeat regions,like this two filtered intervals:
chr12 25378298 25378957 0.313636 1.000000 0.734848
chr12 25397958 25398579 0.337621 1.000000 1.000000

↧

Removing NON_REF tags from VCF

July 15, 2019, 10:24 pm

≫ Next: mutect2 get more variations after add filter parameters

≪ Previous: A question about the database bed file：cn2_mask_g1k_v37.mask.bed

Dear GATK Team,

I ran HaplotypeCaller in GVCF mode and extracted variants using gvcftools "extract_variants". VCF has "NON_REF" tags which carried from g.vcf.

I want to generate both g.vcf & vcf for each sample. Instead of running haplotypecaller again for getting vcf, I am using gvcftools "extract_variants" to extract variants from existing g.vcf which is faster.

Please suggest me how can i get rid of NON_REF tags in vcf file.

Thanks In Advance
Fazulur Rehaman

↧