RealignerTargetCreator and IndelRealigner

February 16, 2018, 7:07 pm

≫ Next: ReadBacked phasing vs Trio phasing?

≪ Previous: Filtering multi-sample VCFs for low DP

Hi,

The tools RealignerTargetCreator and IndelRealigner are obsolete in GATK4. Are there any replacements for these tools in GATK4?

Also, there are two option -BQSR (used with PrintReads) and -stand_emit_conf (used with HaplotypeCaller) that are not available with GATK4. What should be done in these cases?

↧

ReadBacked phasing vs Trio phasing?

January 31, 2018, 11:10 am

≫ Next: How does the BwaSpark in GATK4 control the number of threads?

≪ Previous: RealignerTargetCreator and IndelRealigner

I think I understand the technical difference. But in terms of phasing quality, how does one compare to the another? Are there any publications/reports/blog posts comparing the two? Is there some quantifiable metric that shows how different the estimates are?

↧

How does the BwaSpark in GATK4 control the number of threads?

May 10, 2017, 7:30 pm

≫ Next: GatherBamFiles / FixMateInformation / ValidateSamFile

≪ Previous: ReadBacked phasing vs Trio phasing?

I tried to ERR000589 process data with BwaSpark. The bam file size is 1.3G. The average time spent is about 25 min (5 nodes).
However it would only cost 5 min in processing same data if I tried to use original C bwa with 32 threads.
Base on this observation, I have several questions list as follow:
1. If there is anything wrong with my params?
2. For each Partition, is BwaSpark running in multi-thread mode?
3. How to control the number of the bwa threads inside BwaSpark?

P.S.
The running command is:
./gatk-launch BwaSpark -I hdfs:///user/XX/ERR000589/ERR000589.bam -O hdfs:///user/XX/ERR000589/ERR000589_bwa.bam -R hdfs:///user/xx/refs/ucsc.hg19.fasta --bwamemIndexImage ~/data/ref/ucsc.hg19.img -disableSequenceDictionaryValidation true -- --sparkRunner SPARK --sparkMaster --executor-cores 1 --total-executor-cores 16 --executor-memory 4G

I tried to further adjust the following parameters,
--executor-cores --total-executor-cores --executor-memory --driver-memory
but none of these took less time than 16 min

Besides, I alsow tried to run it in local mode, while it won't end successfully. It seems that CPU was in endless waiting. I guess it occupied so much memory that the swap space is in use? Pic 1 shows the memory consumed while running
This time, the command is:
./gatk-launch BwaSpark -I hdfs:///user/XX/ERR000589/ERR000589.bam -O hdfs:///user/xx/ERR000589/ERR000589.bwa.bam -R /software/home/xx/data/ref/ucsc.hg19.fasta \ --bwamemIndexImage ~/data/ref/ucsc.hg19.img -disableSequenceDictionaryValidation true -- --sparkRunner SPARK --sparkMaster local[*] --total-executor-cores 8 --executor-memory 20G --driver-memory 30G

BTW, the testing environment is:
CPU 2 X 8 physical core
node: 5
network: GBE
memory: 64G

↧

GatherBamFiles / FixMateInformation / ValidateSamFile

February 17, 2018, 10:19 am

≫ Next: Biallelic vs Multiallelic sites

≪ Previous: How does the BwaSpark in GATK4 control the number of threads?

Hi, here the pipe...

1) ApplyBQSR

while read -r f1 f2; do ....
${ph6} --java-options ${java_opt1} ApplyBQSR -R ${gnm} -I ${fBAM} -O ${fol5}/${c_applybqsr} -L ${f1} -bqsr ${fol5}/${bqsrrd} --static-quantized-quals 10 --static-quantized-quals 20 --static-quantized-quals 30 --add-output-sam-program-record --create-output-bam-md5 --use-original-qualities
... done

2) GatherBamFiles
java -Dsamjdk.compression_level=${cl} ${java_opt1} -jar ${ph3} GatherBamFiles ${BQSRs} O=${fol4}/${tofixapplybqsr} CREATE_INDEX=true CREATE_MD5_FILE=true

BQSR variable is all Input chr .bam files

3) ValidateSamFile
java -jar ${ph3} ValidateSamFile I=${tofixapplybqsr} MODE=SUMMARY

4)FixMateInformation
java -jar ${ph3} FixMateInformation I=${tofixapplybqsr} O=${applybqsr} CREATE_INDEX=true CREATE_MD5_FILE=true

5)ValidateSamFile

java -jar ${ph3} ValidateSamFile I=${applybqsr} MODE=SUMMARY

After steps 2 the validation of bam file (step 3) give me an error:

HISTOGRAM java.lang.String

Error Type Count
ERROR:MATE_NOT_FOUND 11647

Then I'm going to fix this error with FixMateInformation (step 4) and trying again to validate my bam, the error is always there!

HISTOGRAM java.lang.String

Error Type Count

ERROR:MATE_NOT_FOUND 11647

This kind of error is important? I'have really to fix it or can I go to the next steps (bam -> gVFC).
Any suggestion how to fix it?

My intervals are: chr1... chr22, chrX, chrY, chrM.

Many thanks

Error details

ERROR: Read name A00125:27:H3JT2DMXX:2:2168:22227:12054, Mate not found for paired read
ERROR: Read name A00125:27:H3JT2DMXX:1:2229:28085:35008, Mate not found for paired read
ERROR: Read name A00125:27:H3JT2DMXX:2:2111:1045:33646, Mate not found for paired read
....

↧

Biallelic vs Multiallelic sites

November 18, 2015, 8:10 am

≫ Next: GATK 4.0 does not have the IndelRealigner method?

≪ Previous: GatherBamFiles / FixMateInformation / ValidateSamFile

A biallelic site is a specific locus in a genome that contains two observed alleles, counting the reference as one, and therefore allowing for one variant allele. In practical terms, this is what you would call a site where, across multiple samples in a cohort, you have evidence for a single non-reference allele. Shown below is a toy example in which the consensus sequence for samples 1-3 have a deletion at position 7. Sample 4 matches the reference. This is considered a biallelic site because there are only two possible alleles-- a deletion, or the reference allele G.

           1 2 3 4 5 6 7 8 9
Reference: A T A T A T G C G
Sample 1 : A T A T A T - C G
Sample 2 : A T A T A T - C G
Sample 3 : A T A T A T - C G
Sample 4 : A T A T A T G C G

A multiallelic site is a specific locus in a genome that contains three or more observed alleles, again counting the reference as one, and therefore allowing for two or more variant alleles. This is what you would call a site where, across multiple samples in a cohort, you see evidence for two or more non-reference alleles. Show below is a toy example in which the consensus sequences for samples 1-3 have a deletion or a SNP at the 7th position. Sample 4 matches the reference. This is considered a multiallelic site because there are four possible alleles-- a deletion, the reference allele G, a C (SNP), or a T (SNP). True multiallelic sites are not observed very frequently unless you look at very large cohorts, so they are often taken as a sign of a noisy region where artifacts are likely.

           1 2 3 4 5 6 7 8 9
Reference: A T A T A T G C G
Sample 1 : A T A T A T - C G
Sample 2 : A T A T A T C C G
Sample 3 : A T A T A T T C G
Sample 4 : A T A T A T G C G

↧

GATK 4.0 does not have the IndelRealigner method?

February 18, 2018, 12:43 am

≫ Next: Problem with annotating GATK4 VCF file

≪ Previous: Biallelic vs Multiallelic sites

HI，I found that the LATEST version of GATK 4.0 does not have the RealignerTargetCreator and IndelRealigner modules！ Is that means we don't need them any more?

↧

Problem with annotating GATK4 VCF file

February 18, 2018, 9:08 am

≫ Next: Confusion in using gVCF mode

≪ Previous: GATK 4.0 does not have the IndelRealigner method?

Based on GATK4 best practices pipline I have made a VCF file composed 4 person WES data. When I want to annotate it with annovar , but annovar could not annotate all variations and near 70% of variations discard and gone to Invalid_input.

I though it might happen due to VCF version (4.2), but it doesn't work with annovar default input format (avinput).

What is your suggestion for annotating GATK4 output VCFs?

↧

Confusion in using gVCF mode

February 1, 2018, 4:57 am

≫ Next: extracting forward and reverse reads from uBAM file

≪ Previous: Problem with annotating GATK4 VCF file

I have problem in using HaplotypeCaller gVCF mode ( GATK4 best practices). Please let me know following problems:

1- Should we run gVCF even when we have one WES sample?

2- I have 3 WES samples, should I use gVCF --> Cosolidate --> GenotypeGVCF --> VCF or it is better to obtain VCF directly from HaplotypeCaller and ignore its next steps?

3- If I have 3-5 WES samples, is it better to run HaplotypeCaller with multiple input (bams) or separately?

Regards.

↧

extracting forward and reverse reads from uBAM file

February 18, 2018, 11:30 am

≫ Next: [GATK 4.0.1.2] No non-zero singular values were found in creating a panel of normals for somatic CNV

≪ Previous: Confusion in using gVCF mode

Hi everyone,

I am using Ion 16s Metagenomics kit to perform microbiome analysis in wastewater. After performing paired-end sequencing, I have been given raw reads in a UBAM file which contains both forward and reverse reads (unaligned). I used samtools to convert the UBAM into FASTQ (because QIIME 2.0 doesn't accept UBAM). However, I was unable to get both forward and reverse reads. I used the following command to convert ubam to fastq
samtools fastq -0 out.fastq input.ubam which gives me only one output fastq file whereas I need two fastq (one containing forward and other containing reverse reads).

Please help! I am stuck.

↧

[GATK 4.0.1.2] No non-zero singular values were found in creating a panel of normals for somatic CNV

February 18, 2018, 11:36 am

≫ Next: A GATK RUNTIME ERROR, Invalid alignment found, alignmentStart > alignemntEnd

≪ Previous: extracting forward and reverse reads from uBAM file

Hello,

I got an exception below in creating a PoN for somatic CNV on about 80 WGS samples. It consists of ~30 males and ~50 females. I was able to create PoNs successfully for each sex separately, but not all together. Sex chromosomes were excluded for the PoN creation for both sexes.

The exception message suggests to set minimum-interval-median-percentile to a higher value. I wonder how higher it should be or what it means. Would you also help me understand the argument, --number-of-eigensamples?

Thanks!

12:14:03.269 WARN  HDF5SVDReadCountPanelOfNormals - Exception encountered during creation of panel of normals (org.broadinstitute.hellbender.exceptions.UserException: No non-zero singular values were found.  It may be necessary to use stricter parameters for filtering.  For example, use a larger value of minimum-interval-median-percentile.).  Attempting to delete partial output in cromwell-executions/CNVSomaticPanelWorkflow/0c26e635-0641-4791-9769-adaa5cee0e87/call-CreateReadCountPanelOfNormals/execution/gatk_somatic_wgs.pon.hdf5...
18/02/18 12:14:03 INFO SparkUI: Stopped Spark web UI at http://192.168.1.12:4040
18/02/18 12:14:03 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
18/02/18 12:14:03 INFO MemoryStore: MemoryStore cleared
18/02/18 12:14:03 INFO BlockManager: BlockManager stopped
18/02/18 12:14:03 INFO BlockManagerMaster: BlockManagerMaster stopped
18/02/18 12:14:03 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
18/02/18 12:14:03 INFO SparkContext: Successfully stopped SparkContext
12:14:03.373 INFO  CreateReadCountPanelOfNormals - Shutting down engine
[February 18, 2018 12:14:03 PM CST] org.broadinstitute.hellbender.tools.copynumber.CreateReadCountPanelOfNormals done. Elapsed time: 3.64 minutes.
Runtime.totalMemory()=16482041856
org.broadinstitute.hellbender.exceptions.GATKException: Could not create panel of normals.  It may be necessary to use stricter parameters for filtering.  For example, use a larger value of minimum-interval-median-percentile.
        at org.broadinstitute.hellbender.tools.copynumber.denoising.HDF5SVDReadCountPanelOfNormals.create(HDF5SVDReadCountPanelOfNormals.java:341)
        at org.broadinstitute.hellbender.tools.copynumber.CreateReadCountPanelOfNormals.runPipeline(CreateReadCountPanelOfNormals.java:269)
        at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:30)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:136)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:179)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:198)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:153)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:195)
        at org.broadinstitute.hellbender.Main.main(Main.java:277)
Caused by: org.broadinstitute.hellbender.exceptions.UserException: No non-zero singular values were found.  It may be necessary to use stricter parameters for filtering.  For example, use a larger value of minimum-interval-median-percentile.
        at org.broadinstitute.hellbender.tools.copynumber.denoising.HDF5SVDReadCountPanelOfNormals.create(HDF5SVDReadCountPanelOfNormals.java:317)
        ... 8 more

↧

A GATK RUNTIME ERROR, Invalid alignment found, alignmentStart > alignemntEnd

February 18, 2018, 4:04 pm

≫ Next: Mutect2 outputting samples with AD of "."

≪ Previous: [GATK 4.0.1.2] No non-zero singular values were found in creating a panel of normals for somatic CNV

Hi, I was in the second step of Genome STRiP, to discover SV. It seems that all "partition" jobs worked fine, except one. The error message looks like this:

ERROR 00:12:01,906 FunctionEdge - Contents of /u/flashscratch/h/hjzhou/biploar_del_discovery_out/deletions100k/logs/SVDiscovery-203.out:
...
##### ERROR stack trace
java.lang.RuntimeException: Invalid alignment found, alignmentStart (1231809) > alignemntEnd (1231808) HS2000-887_377:1:1112:13543:35396        81      chr12   1231809 70      62I38S  =       1230685 -1124   ATCTCTATCTCTATCTCTATCTGTGCCTATTGATATATCTGTATATATCTATCTAAATCTCTATCTCTATCTCTATCTGTGCCTATTGATATATCTGTAT    CA@>EFECFEDBEAHHHIIIHEHGHGIIJIIGIIGIGJJCJJIGFCJIJJIJJIIIHDEGHGIEJGHHHEHGIJIHJJIJJIJJJIJHGHHHFDFFFC@C    SA:Z:chr12,1231809,-,56S44M,60,1;       MC:Z:100M       OC:Z:62M38S     RG:Z:LP6005646-DNA_B05  NM:i:62 MQ:i:60 AS:i:57 XS:i:25
...

I am very new to Genome STRiP and GATK. Any clue or advice would be helpful.

↧

Mutect2 outputting samples with AD of "."

February 18, 2018, 4:48 pm

≫ Next: CombineVariants Key . found in VariantContext field INFO but this key isn't defined in the VCFHeader

≪ Previous: A GATK RUNTIME ERROR, Invalid alignment found, alignmentStart > alignemntEnd

I've setup a workflow that runs Mutect2, which completes successfully, but the AD of all variants is output as "."

This is an issue as we use the AD for downstream stages. Is there any reason this might be happening?

I have attached both the VCF that is missing the AD field, and the log from our workflow that includes the GATK logging. Everything there seems fine, but perhaps I'm missing something.

↧

CombineVariants Key . found in VariantContext field INFO but this key isn't defined in the VCFHeader

February 18, 2018, 5:19 pm

≫ Next: Germline VQSR recommended settings

≪ Previous: Mutect2 outputting samples with AD of "."

I have encountered the following error when trying to merge two VCFs from different callers:

Key . found in VariantContext field INFO at chrM:711 but this key isn't defined in the VCFHeader. We require all VCFs to have complete VCF headers by default.

From looking at the VCF inputs, I can see that one of my VCFs has . as its INFO field (it's an output from Mutect 1). However, according to the VCF spec, this is a valid placeholder value.

I have attached the output log from this command, as well as the two VCFs.

Is this the correct behaviour? How can this error be avoided?

↧

Germline VQSR recommended settings

January 22, 2018, 8:55 am

≫ Next: gatk3.8 vs gatk4 va gatk4spark ,the newer the slower!!

≪ Previous: CombineVariants Key . found in VariantContext field INFO but this key isn't defined in the VCFHeader

Hi, we've been looking at the new Best Practices pages and at the WDLs linked there. In particular, we looked at the settings for VariantRecalibrator in this WDL. We ran germline analyses on samples HG001 and HG002 with versions 4.beta.2 and 4.0.0.0. The 4.beta.2 VariantRecalibrator used 3.7 Best Practices parameters, while the 4.0.0.0 VariantRecalibrator used settings from the mentioned WDL. We noticed a decrease in SNP recall from 4.0.0.0 to 4.beta.2 and an increase in INDEL recall from 4.0.0.0 to 4.beta.2. For example, for sample HG001, the scores are:

Version	SNP Precision	SNP Recall	INDEL Precision	INDEL Recall
4.beta.2	0.999058	0.997499	0.993983	0.986496
4.0.0.0	0.998427	0.985755	0.993613	0.993145

Are these scores to be expected and if not are there other VariantRecalibrator settings that we should set for germline analysis?

Note: Except VariantRecalibrator, the settings for all of the tools are the same. HaplotypeCaller was ran with --interval-set-rule UNION --genotyping-mode DISCOVERY --emit-ref-confidence GVCF. Also the precision and recall scores for raw VCFs outputed by HaplotypeCaller/GenotypeGVCFs are close to identical between 4.beta.2 and 4.0.0.0 for these samples. The exact command lines for 4.0.0.0 VQSR are:

./gatk --java-options "-Xmx2048M" VariantRecalibrator --rscript-file snp_hg001.recal.R --tranches-file snp_hg001.tranches --output snp_hg001.recal --use-annotation QD --use-annotation MQRankSum --use-annotation FS --use-annotation DP --use-annotation ReadPosRankSum --use-annotation SOR --use-annotation MQ --variant hg001.vcf --resource dbsnp,prior=7,truth=false,training=false,known=true:dbsnp_137.b37.vcf --resource 1000G,prior=10,truth=true,training=true,known=false:1000G_phase1.snps.high_confidence.b37.vcf --resource omni,prior=12,truth=true,training=true,known=false:1000G_omni2.5.b37.vcf --resource hapmap,prior=15,truth=true,training=true,known=false:hapmap_3.3.b37.vcf --truth-sensitivity-tranche 100 --truth-sensitivity-tranche 99.95 --truth-sensitivity-tranche 99.9 --truth-sensitivity-tranche 99.8 --truth-sensitivity-tranche 99.6 --truth-sensitivity-tranche 99.5 --truth-sensitivity-tranche 99.4 --truth-sensitivity-tranche 99.3 --truth-sensitivity-tranche 99 --truth-sensitivity-tranche 98 --truth-sensitivity-tranche 97 --truth-sensitivity-tranche 90 --trust-all-polymorphic --reference human_g1k_v37_decoy.fasta --mode SNP --max-gaussians 6

./gatk --java-options "-Xmx2048M" VariantRecalibrator --rscript-file indel_hg001.recal.R --tranches-file indel_hg001.tranches --output indel_hg001.recal --use-annotation DP --use-annotation FS --use-annotation ReadPosRankSum --use-annotation MQRankSum --use-annotation QD --use-annotation SOR --variant hg001.vcf --resource dbsnp,prior=2,truth=false,training=false,known=true:dbsnp_137.b37.vcf --resource mills,prior=12,truth=true,training=true,known=false:Mills_and_1000G_gold_standard.indels.b37.sites.vcf --truth-sensitivity-tranche 100 --truth-sensitivity-tranche 99.95 --truth-sensitivity-tranche 99.9 --truth-sensitivity-tranche 99.5 --truth-sensitivity-tranche 99 --truth-sensitivity-tranche 97 --truth-sensitivity-tranche 96 --truth-sensitivity-tranche 95 --truth-sensitivity-tranche 94 --truth-sensitivity-tranche 93.5 --truth-sensitivity-tranche 93 --truth-sensitivity-tranche 92 --truth-sensitivity-tranche 91 --truth-sensitivity-tranche 90 --trust-all-polymorphic --reference human_g1k_v37_decoy.fasta --mode INDEL --max-gaussians 4

./gatk --java-options "-Xmx2048M" ApplyVQSR --output hg001.vcf --variant hg001.vcf --truth-sensitivity-filter-level 99.7 --tranches-file snp_hg001.tranches --reference human_g1k_v37_decoy.fasta --recal-file snp_hg001.recal --mode SNP

./gatk --java-options "-Xmx2048M" ApplyVQSR --output hg001.vcf --variant hg001.vcf --truth-sensitivity-filter-level 99.7 --tranches-file indel_hg001.tranches --reference human_g1k_v37_decoy.fasta --recal-file indel_hg001.recal --mode INDEL

↧

gatk3.8 vs gatk4 va gatk4spark ,the newer the slower!!

January 24, 2018, 9:12 am

≫ Next: GenotypeGVCFs error

≪ Previous: Germline VQSR recommended settings

I use the gatk3.8 gatk4.0.0and gatkspark to test my data . I received a suprising result. gatk4 is slower than gatk3.8 ,and gatkspark is slower than them. The times are 17.3 vs 19.2 vs 24 min . The codes are basic and as follows:

gatk3.8

java -jar /GenomeAnalysisTK-3.8-0-ge9d806836/GenomeAnalysisTK.jar -T HaplotypeCaller -R cr.fa -I 10_dedup_reads.bam -o testgatk3.raw.variants.vcf

gatk4.0.0

/gatk-4.0.0.0/gatk HaplotypeCaller -R cr.fa -I 10_dedup_reads.bam -O 10.g.vcf.gz

gatkspark

/gatk-4.0.0.0/gatk HaplotypeCallerSpark -R cr.2bit -I 10_dedup_reads.bam -O 10.g.vcf.gz
And I am sure that the IO ,the cpus,and the memory are not reach the limit, so did I do something wrong ? Thanks a lot for reading or replying my quesion！！！

↧

GenotypeGVCFs error

February 9, 2017, 8:55 am

≫ Next: OpenMP multi-threaded AVX-accelerated native PairHMM in HaplotypeCaller not supported

≪ Previous: gatk3.8 vs gatk4 va gatk4spark ,the newer the slower!!

Hi I am getting the following error. I ran the exact same samples/pipeline a couple of weeks ago using 3.6 and it worked fine, now with 3.7 I am getting an error:

INFO 09:49:49,742 HelpFormatter - ----------------------------------------------------------------------------------
INFO 09:49:49,745 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.7-0-gcfedb67, Compiled 2016/12/12 11:21:18
INFO 09:49:49,745 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute
INFO 09:49:49,745 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk
INFO 09:49:49,747 HelpFormatter - [Thu Feb 09 09:49:49 MST 2017] Executing on Linux 3.10.0-327.28.3.el7.x86_64 amd64
INFO 09:49:49,747 HelpFormatter - OpenJDK 64-Bit Server VM 1.8.0_111-b15
INFO 09:49:49,749 HelpFormatter - Program Args: -T GenotypeGVCFs -nt 16 -R /gs0/home/martensc/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa -o cohen_test.vcf --variant gatk/9273.raw.snps.indels.g.vcf --variant gatk/9274.raw.snps.indels.g.vcf --variant gatk/9275.raw.snps.indels.g.vcf --variant gatk/9276.raw.snps.indels.g.vcf --variant gatk/9277.raw.snps.indels.g.vcf --variant gatk/9278.raw.snps.indels.g.vcf --variant gatk/9279.raw.snps.indels.g.vcf --variant gatk/9280.raw.snps.indels.g.vcf --variant gatk/9281.raw.snps.indels.g.vcf
INFO 09:49:49,751 HelpFormatter - Executing as martensc@ai-rmlcpu09.niaid.nih.gov on Linux 3.10.0-327.28.3.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_111-b15.
INFO 09:49:49,751 HelpFormatter - Date/Time: 2017/02/09 09:49:49
INFO 09:49:49,751 HelpFormatter - ----------------------------------------------------------------------------------
INFO 09:49:49,751 HelpFormatter - ----------------------------------------------------------------------------------
INFO 09:49:50,063 GenomeAnalysisEngine - Strictness is SILENT
INFO 09:49:50,238 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO 09:49:50,603 MicroScheduler - Running the GATK in parallel mode with 16 total threads, 1 CPU thread(s) for each of 16 data thread(s), of 16 processors available on this machine
INFO 09:49:50,673 GenomeAnalysisEngine - Preparing for traversal
INFO 09:49:50,676 GenomeAnalysisEngine - Done preparing for traversal
INFO 09:49:50,676 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 09:49:50,676 ProgressMeter - | processed | time | per 1M | | total | remaining
INFO 09:49:50,676 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime
WARN 09:49:50,756 StrandBiasTest - StrandBiasBySample annotation exists in input VCF header. Attempting to use StrandBiasBySample values to calculate strand bias annotation values. If no sample has the SB genotype annotation, annotation may still fail.
WARN 09:49:50,757 InbreedingCoeff - Annotation will not be calculated. InbreedingCoeff requires at least 10 unrelated samples.
WARN 09:49:50,757 StrandBiasTest - StrandBiasBySample annotation exists in input VCF header. Attempting to use StrandBiasBySample values to calculate strand bias annotation values. If no sample has the SB genotype annotation, annotation may still fail.
INFO 09:49:50,757 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant files
WARN 09:49:51,164 HaplotypeScore - Annotation will not be calculated, must be called from UnifiedGenotyper, not org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs
WARN 09:49:53,406 ExactAFCalculator - This tool is currently set to genotype at most 6 alternate alleles in a given context, but the context at chr1: 7828173 has 7 alternate alleles so only the top alleles will be used; see the --max_alternate_alleles argument. Unless the DEBUG logging level is used, this warning message is output just once per run and further warnings are suppressed.

ERROR --

ERROR stack trace

java.lang.NullPointerException
at java.util.LinkedList$ListItr.next(LinkedList.java:893)
at org.broadinstitute.gatk.tools.walkers.genotyper.GenotypingEngine.coveredByDeletion(GenotypingEngine.java:426)
at org.broadinstitute.gatk.tools.walkers.genotyper.GenotypingEngine.calculateOutputAlleleSubset(GenotypingEngine.java:387)
at org.broadinstitute.gatk.tools.walkers.genotyper.GenotypingEngine.calculateGenotypes(GenotypingEngine.java:251)
at org.broadinstitute.gatk.tools.walkers.genotyper.UnifiedGenotypingEngine.calculateGenotypes(UnifiedGenotypingEngine.java:392)
at org.broadinstitute.gatk.tools.walkers.genotyper.UnifiedGenotypingEngine.calculateGenotypes(UnifiedGenotypingEngine.java:375)
at org.broadinstitute.gatk.tools.walkers.genotyper.UnifiedGenotypingEngine.calculateGenotypes(UnifiedGenotypingEngine.java:330)
at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.regenotypeVC(GenotypeGVCFs.java:326)
at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.map(GenotypeGVCFs.java:304)
at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.map(GenotypeGVCFs.java:135)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
at org.broadinstitute.gatk.engine.executive.ShardTraverser.call(ShardTraverser.java:98)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

ERROR ------------------------------------------------------------------------------------------

ERROR A GATK RUNTIME ERROR has occurred (version 3.7-0-gcfedb67):

ERROR

ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.

ERROR If not, please post the error message, with stack trace, to the GATK forum.

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions https://software.broadinstitute.org/gatk

ERROR

ERROR MESSAGE: Code exception (see stack trace for error itself)

ERROR ------------------------------------------------------------------------------------------

Thanks, Craig

↧

OpenMP multi-threaded AVX-accelerated native PairHMM in HaplotypeCaller not supported

January 24, 2018, 1:30 pm

≫ Next: Regarding of piping - Picard and BWA (Align and MergeBamAlignment step)

≪ Previous: GenotypeGVCFs error

I'm unable to get a multithreaded instance of PairHMM to work in HaplotypeCaller with JDK 1.8 on my local machine (Intel 4770K 8-core i7 processor) running MacOS 10.12.6. I've tried both a pre-built version from the Docker hub as well as one that I built on my local machine, and in both cases I get the warning:
"NativeLibraryLoader - Unable to find native library: native/libgkl_pairhmm_omp.dylib

I've tried the "-pairHMM AVX_LOGLESS_CACHING_OMP" option, but I then get:
"A USER ERROR has occurred: Machine does not support OpenMP AVX PairHMM.
PairHMM - OpenMP multi-threaded AVX-accelerated native PairHMM implementation is not supported"

I suspect this might be caused by having a version of clang that doesn't support OpenMP, but I'm not sure. I'm using Homebrew gcc and c++ compilers, and an OpenMP clang (http://openmp.llvm.org) to no avail. Or maybe Intel 4770K can't support OpenMP PairHMM?

Here's my command:
gatk --java-options "-Xmx20g -DGATK_STACKTRACE_ON_USER_EXCEPTION=true" HaplotypeCaller \
-R /Volumes/HighSierra/Users/tschappe/Documents/P.nicotianae_assembly/ASM148301v1/GCA_001483015.1_ASM148301v1_genomic.fna \
-I /Volumes/HighSierra/Users/tschappe/Documents/P.nicotianae_assembly/race0_2_sorted.bam \
-O /Volumes/HighSierra/Users/tschappe/Documents/P.nicotianae_assembly/race0.g.vcf.gz \
-pairHMM AVX_LOGLESS_CACHING_OMP

Here's the entire error stack trace:
16:28:17.652 INFO NativeLibraryLoader - Loading libgkl_compression.dylib from jar:file:/Applications/gatk-4.0/gatk/build/libs/gatk-package-4.0.0.0-37-g1316033-SNAPSHOT-local.jar!/com/intel/gkl/native/libgkl_compression.dylib
16:28:17.731 INFO HaplotypeCaller - ------------------------------------------------------------
16:28:17.731 INFO HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.0.0.0-37-g1316033-SNAPSHOT
16:28:17.731 INFO HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
16:28:17.731 INFO HaplotypeCaller - Executing as tschappe@Tylers-iMac.local on Mac OS X v10.12.6 x86_64
16:28:17.731 INFO HaplotypeCaller - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_161-b12
16:28:17.731 INFO HaplotypeCaller - Start Date/Time: January 24, 2018 4:28:17 PM EST
16:28:17.731 INFO HaplotypeCaller - ------------------------------------------------------------
16:28:17.731 INFO HaplotypeCaller - ------------------------------------------------------------
16:28:17.732 INFO HaplotypeCaller - HTSJDK Version: 2.14.1
16:28:17.732 INFO HaplotypeCaller - Picard Version: 2.17.2
16:28:17.732 INFO HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 1
16:28:17.732 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
16:28:17.732 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
16:28:17.732 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
16:28:17.732 INFO HaplotypeCaller - Deflater: IntelDeflater
16:28:17.732 INFO HaplotypeCaller - Inflater: IntelInflater
16:28:17.732 INFO HaplotypeCaller - GCS max retries/reopens: 20
16:28:17.732 INFO HaplotypeCaller - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
16:28:17.732 INFO HaplotypeCaller - Initializing engine
16:28:18.287 INFO HaplotypeCaller - Done initializing engine
16:28:18.332 INFO HaplotypeCallerEngine - Disabling physical phasing, which is supported only for reference-model confidence output
16:28:18.877 INFO NativeLibraryLoader - Loading libgkl_utils.dylib from jar:file:/Applications/gatk-4.0/gatk/build/libs/gatk-package-4.0.0.0-37-g1316033-SNAPSHOT-local.jar!/com/intel/gkl/native/libgkl_utils.dylib
16:28:18.880 WARN NativeLibraryLoader - Unable to find native library: native/libgkl_pairhmm_omp.dylib
16:28:18.880 INFO HaplotypeCaller - Shutting down engine
[January 24, 2018 4:28:18 PM EST] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 0.02 minutes.
Runtime.totalMemory()=740294656

A USER ERROR has occurred: Machine does not support OpenMP AVX PairHMM.

org.broadinstitute.hellbender.exceptions.UserException$HardwareFeatureException: Machine does not support OpenMP AVX PairHMM.
at org.broadinstitute.hellbender.utils.pairhmm.VectorLoglessPairHMM.(VectorLoglessPairHMM.java:78)
at org.broadinstitute.hellbender.utils.pairhmm.PairHMM$Implementation.lambda$static$4(PairHMM.java:64)
at org.broadinstitute.hellbender.utils.pairhmm.PairHMM$Implementation.makeNewHMM(PairHMM.java:120)
at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.PairHMMLikelihoodCalculationEngine.(PairHMMLikelihoodCalculationEngine.java:141)
at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.AssemblyBasedCallerUtils.createLikelihoodCalculationEngine(AssemblyBasedCallerUtils.java:169)
at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCallerEngine.initialize(HaplotypeCallerEngine.java:191)
at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCallerEngine.(HaplotypeCallerEngine.java:160)
at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCallerEngine.(HaplotypeCallerEngine.java:151)
at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller.onTraversalStart(HaplotypeCaller.java:197)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:891)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:136)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:179)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:198)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:152)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:195)
at org.broadinstitute.hellbender.Main.main(Main.java:275)

↧

Regarding of piping - Picard and BWA (Align and MergeBamAlignment step)

February 19, 2018, 1:09 am

≫ Next: Current status of GATK4 GermlineCNVCaller tools and best practices.

≪ Previous: OpenMP multi-threaded AVX-accelerated native PairHMM in HaplotypeCaller not supported

I made 3 bam files with **command in below.**

Picard version: 2.17.8
BWA version: 0.7.17-r1188

compression_level=2
java_opt="-Xmx32G"
bwa_version="0.7.17-r1188"
bwa_commandline="mem -K 100000000 -p -v 3 -t 64 -Y ${ref_fasta}"

java ${java_opt} -jar ${PICARD_JAR} SamToFastq \
I=${INPUT_BAM} \
INTERLEAVE=true NON_PF=true \
FASTQ=/dev/stdout \
TMP_DIR=${TMP_DIR} | \
${BWA} ${bwa_commandline} /dev/stdin - 2> >(tee ${OUTPUT_BAM}.stderr.log >&2) | \
java -Dsamjdk.compression_level=${compression_level} -Xms12G -jar ${PICARD_JAR} \
MergeBamAlignment \
    VALIDATION_STRINGENCY=SILENT \
    EXPECTED_ORIENTATIONS=FR \
    ATTRIBUTES_TO_RETAIN=X0 \
    ATTRIBUTES_TO_REMOVE=NM \
    ATTRIBUTES_TO_REMOVE=MD \
    ALIGNED_BAM=/dev/stdin \
    UNMAPPED_BAM=${INPUT_BAM} \
    OUTPUT=${OUTPUT_BAM} \
    REFERENCE_SEQUENCE=${ref_fasta} \
    PAIRED_RUN=true \
    SORT_ORDER="unsorted" \
    IS_BISULFITE_SEQUENCE=false \
    ALIGNED_READS_ONLY=false \
    CLIP_ADAPTERS=false \
    MAX_RECORDS_IN_RAM=2000000 \
    ADD_MATE_CIGAR=true \
    MAX_INSERTIONS_OR_DELETIONS=-1 \
    PRIMARY_ALIGNMENT_STRATEGY=MostDistant \
    PROGRAM_RECORD_ID="bwamem" \
    PROGRAM_GROUP_VERSION="${bwa_version}" \
    PROGRAM_GROUP_COMMAND_LINE="${bwa_commandline}" \
    PROGRAM_GROUP_NAME="bwamem" \
    UNMAPPED_READ_STRATEGY=COPY_TO_TAG \
    ALIGNER_PROPER_PAIR_FLAGS=true \
    UNMAP_CONTAMINANT_READS=true \
    ADD_PG_TAG_TO_READS=false

and I tried to MarkDuplicates step. but it had problem.

Exception in thread "main" htsjdk.samtools.FileTruncatedException: Premature end of file: /BiO/Project/brandon-genome-analysis/analysis/B001.fastqtosam.unmerged.bam
at htsjdk.samtools.util.BlockCompressedInputStream.processNextBlock(BlockCompressedInputStream.java:530)
at htsjdk.samtools.util.BlockCompressedInputStream.nextBlock(BlockCompressedInputStream.java:468)
at htsjdk.samtools.util.BlockCompressedInputStream.readBlock(BlockCompressedInputStream.java:458)
at htsjdk.samtools.util.BlockCompressedInputStream.available(BlockCompressedInputStream.java:196)
at htsjdk.samtools.util.BlockCompressedInputStream.read(BlockCompressedInputStream.java:331)
at java.io.DataInputStream.read(DataInputStream.java:149)
at htsjdk.samtools.util.BinaryCodec.readBytesOrFewer(BinaryCodec.java:418)
at htsjdk.samtools.util.BinaryCodec.readBytes(BinaryCodec.java:394)
at htsjdk.samtools.util.BinaryCodec.readBytes(BinaryCodec.java:380)
at htsjdk.samtools.BAMRecordCodec.decode(BAMRecordCodec.java:209)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.getNextRecord(BAMFileReader.java:829)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:803)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:797)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:765)
at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:576)
at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:548)
at htsjdk.samtools.util.PeekableIterator.advance(PeekableIterator.java:71)
at htsjdk.samtools.util.PeekableIterator.next(PeekableIterator.java:57)
at htsjdk.samtools.MergingSamRecordIterator.next(MergingSamRecordIterator.java:130)
at htsjdk.samtools.MergingSamRecordIterator.next(MergingSamRecordIterator.java:38)
at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:495)
at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:232)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:269)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:98)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:108)

All BAM file was trucated.

$ samtools view -c /BiO/Project/brandon-genome-analysis/analysis/B001.fastqtosam.unmerged.bam
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
[E::bgzf_read] Read block operation failed with error -1 after 107 of 180 bytes
[main_samview] truncated file.
$ samtools view -c /BiO/Project/brandon-genome-analysis/analysis/B002.fastqtosam.unmerged.bam
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
[E::bgzf_read] Read block operation failed with error -1 after 1 of 180 bytes
[main_samview] truncated file.
$ samtools view -c /BiO/Project/brandon-genome-analysis/analysis/B003.fastqtosam.unmerged.bam
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
[E::bgzf_read] Read block operation failed with error -1 after 10 of 39 bytes
[main_samview] truncated file.

↧

Current status of GATK4 GermlineCNVCaller tools and best practices.

January 31, 2018, 11:56 pm

≫ Next: GATK - 4.0.0.0 [BaseRecalibratorSpark low performance]

≪ Previous: Regarding of piping - Picard and BWA (Align and MergeBamAlignment step)

Hi,

I would like to try out GATK4 for discovering or genotyping germline CNV's in a cohort of few hundred whole genome sequenced samples. I work with non-human species data, but the genome sizes are almost the same as human or smaller.

The best practice documentation for germline CNV calling is still empty.
https://software.broadinstitute.org/gatk/best-practices/workflow?id=11148

According the gatk4-4.0.0.0-0 JAR file germline CNV calling tools are already included.
java -jar ./gatk4-4.0.0.0-0/gatk-package-4.0.0.0-local.jar
USAGE: [-h]--------------------------------------------------------------------------------------
Copy Number Variant Discovery: Tools that analyze read coverage to detect copy number variants.
AnnotateIntervals (BETA Tool) Annotates intervals with GC content
CallCopyRatioSegments (BETA Tool) Calls copy-ratio segments as amplified, deleted, or copy-number neutral
CombineSegmentBreakpoints (EXPERIMENTAL Tool) Combine the breakpoints of two segment files and annotate the resulting intervals with chosen columns from each file.
CreateReadCountPanelOfNormals (BETA Tool) Creates a panel of normals for read-count denoising
DenoiseReadCounts (BETA Tool) Denoises read counts to produce denoised copy ratios
DetermineGermlineContigPloidy (BETA Tool) Determines the baseline contig ploidy for germline samples given counts data.
GermlineCNVCaller (BETA Tool) Calls copy-number variants in germline samples given their counts and the output of DetermineGermlineContigPloidy.
ModelSegments (BETA Tool) Models segmented copy ratios from denoised read counts and segmented minor-allele fractions from allelic counts
PlotDenoisedCopyRatios (BETA Tool) Creates plots of denoised copy ratios
PlotModeledSegments (BETA Tool) Creates plots of denoised and segmented copy-ratio and minor-allele-fraction estimates

Can you give some more information about what the current status is of the GATK4 GermlineCNVCaller tools and if you have an estimation for when the best practices for these tools should be available?

It would also be nice if you can give an idea if the GATK4 GermlineCNVCallertools tools are expected to work for non-human species, e.g. other vertebrates, simple / complex plants genomes and bacteria.

Thank you.

↧

GATK - 4.0.0.0 [BaseRecalibratorSpark low performance]

January 22, 2018, 11:15 am

≫ Next: Can ASEReadCounter take indels?

≪ Previous: Current status of GATK4 GermlineCNVCaller tools and best practices.

Dear GATK_team, I'd like to run Spark-enabled GATK tools on a Spark cluster. Precisely I am launching a Spark cluster in the standalone mode submitting the BaseRecalibratorSpark application via Slurm. Before the official release, I was running the gatk-4.beta.6-17 version, with the following allocated resources, and the following command line for the Spark arguments: ./gatk-launch BaseRecalibratorSpark \ --sparkRunner SPARK --sparkMaster spark://${MASTER} --driver-memory 80g --num-executors 16 --executor-memory 8g. The speed-up achieved was 3.79 min. However, with the official release GATK-4.0.0.0, with the same datafiles and the same Spark arguments I don't see the same nice speed-up anymore (~ 40 min). Am I missing something with the new version? Or with the invoking command line? Thanks in advance for your time and kind answer. Best, Giuseppe

↧