Quantcast
Channel: Recent Discussions — GATK-Forum
Viewing all 12345 articles
Browse latest View live

Different "java.lang.OutOfMemoryError: Java heap space" at nearly the end of SVPreprocess

$
0
0

Dear Genome STRiP users,

I nearly completed the SVPreprocess to all 10686 samples:

INFO  01:17:45,002 QGraph - 4 Pend, 2 Run, 0 Fail, 32076 Done

However, I met a similar but not the same "java.lang.OutOfMemoryError: Java heap space" as before:

ERROR 01:19:31,326 FunctionEdge - Error:  'java'  '-Xmx2048m'  '-XX:+UseParallelOldGC'  '-XX:ParallelGCThreads=4'  '-XX:GCTimeLimit=50'  '-XX:GCHeapFreeLimit=10'  '-Djava.io.tmpdir=/proj/yunligrp/users/minzhi/gs/gs_tempdir/svpre_tmp'  '-cp' '/proj/yunligrp/users/minzhi/svtoolkit/lib/SVToolkit.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/Queue.jar'  '-cp' '/proj/yunligrp/users/minzhi/svtoolkit/lib/SVToolkit.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/Queue.jar'  'org.broadinstitute.sv.apps.ComputeDepthProfiles'  '-O' '/proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svpre_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/md_tempdir/profiles_100Kb/profile_seq_chr16_100000.dat.gz'  '-I' '/proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svpre_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/md_tempdir/headers.bam'  '-configFile' '/proj/yunligrp/users/minzhi/svtoolkit/conf/genstrip_parameters.txt' '-configFile' '/proj/yunligrp/users/minzhi/gs/Homo_sapiens_assembly38/Homo_sapiens_assembly38.gsparams.txt'  '-R' '/proj/yunligrp/users/minzhi/gs/Homo_sapiens_assembly38/Homo_sapiens_assembly38.fasta'  '-L' 'chr16:1-500000'  '-genomeMaskFile' '/proj/yunligrp/users/minzhi/gs/Homo_sapiens_assembly38/Homo_sapiens_assembly38.svmask.fasta'  '-md' '/proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svpre_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/md_tempdir'  '-profileBinSize' '100000'  '-maximumReferenceGapLength' '10000'  
ERROR 01:19:31,333 FunctionEdge - Contents of /proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svpre_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/logs/SVPreprocess-32077.out:
INFO  01:18:49,973 HelpFormatter - ------------------------------------------------------------- 
INFO  01:18:49,976 HelpFormatter - Program Name: org.broadinstitute.sv.apps.ComputeDepthProfiles 
INFO  01:18:49,979 HelpFormatter - Program Args: -O /proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svpre_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/md_tempdir/profiles_100Kb/profile_seq_chr16_100000.dat.gz -I /proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svpre_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/md_tempdir/headers.bam -configFile /proj/yunligrp/users/minzhi/svtoolkit/conf/genstrip_parameters.txt -configFile /proj/yunligrp/users/minzhi/gs/Homo_sapiens_assembly38/Homo_sapiens_assembly38.gsparams.txt -R /proj/yunligrp/users/minzhi/gs/Homo_sapiens_assembly38/Homo_sapiens_assembly38.fasta -L chr16:1-500000 -genomeMaskFile /proj/yunligrp/users/minzhi/gs/Homo_sapiens_assembly38/Homo_sapiens_assembly38.svmask.fasta -md /proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svpre_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/md_tempdir -profileBinSize 100000 -maximumReferenceGapLength 10000 
INFO  01:18:49,983 HelpFormatter - Executing as minzhi@c0924.ll.unc.edu on Linux 3.10.0-957.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_191-b12. 
INFO  01:18:49,984 HelpFormatter - Date/Time: 2019/03/01 01:18:49 
INFO  01:18:49,984 HelpFormatter - ------------------------------------------------------------- 
INFO  01:18:49,984 HelpFormatter - ------------------------------------------------------------- 
INFO  01:18:49,999 ComputeDepthProfiles - Opening reference sequence ... 
INFO  01:18:50,002 ComputeDepthProfiles - Opened reference sequence. 
INFO  01:18:50,003 ComputeDepthProfiles - Opening genome mask ... 
INFO  01:18:50,005 ComputeDepthProfiles - Opened genome mask. 
INFO  01:18:50,007 MetaData - Opening metadata ...  
INFO  01:18:50,007 MetaData - Adding metadata location /proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svpre_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/md_tempdir ... 
INFO  01:18:50,018 MetaData - Opened metadata. 
INFO  01:18:50,018 ComputeDepthProfiles - Opened metadata. 
INFO  01:18:50,018 ComputeDepthProfiles - Initializing input data set ... 
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:3332)
    at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
    at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
    at java.lang.StringBuilder.append(StringBuilder.java:136)
    at htsjdk.samtools.SAMTextHeaderCodec.advanceLine(SAMTextHeaderCodec.java:139)
    at htsjdk.samtools.SAMTextHeaderCodec.decode(SAMTextHeaderCodec.java:94)
    at htsjdk.samtools.BAMFileReader.readHeader(BAMFileReader.java:655)
    at htsjdk.samtools.BAMFileReader.<init>(BAMFileReader.java:298)
    at htsjdk.samtools.BAMFileReader.<init>(BAMFileReader.java:176)
    at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:376)
    at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:202)
    at org.broadinstitute.sv.dataset.SAMFileLocation.createSamFileReader(SAMFileLocation.java:97)
    at org.broadinstitute.sv.dataset.SAMLocation.createSamFileReader(SAMLocation.java:41)
    at org.broadinstitute.sv.dataset.DataSet.initInputFile(DataSet.java:138)
    at org.broadinstitute.sv.dataset.DataSet.initialize(DataSet.java:128)
    at org.broadinstitute.sv.apps.ComputeDepthProfiles.initDataSet(ComputeDepthProfiles.java:263)
    at org.broadinstitute.sv.apps.ComputeDepthProfiles.run(ComputeDepthProfiles.java:141)
    at org.broadinstitute.sv.commandline.CommandLineProgram.execute(CommandLineProgram.java:54)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
    at org.broadinstitute.sv.commandline.CommandLineProgram.runAndReturnResult(CommandLineProgram.java:29)
    at org.broadinstitute.sv.commandline.CommandLineProgram.run(CommandLineProgram.java:25)
    at org.broadinstitute.sv.apps.ComputeDepthProfiles.main(ComputeDepthProfiles.java:109) 

And after this error it shows

INFO  01:20:13,358 QGraph - 4 Pend, 1 Run, 1 Fail, 32076 Done 

And then, such error repeated again rather than directly exit.

ERROR 01:20:43,371 FunctionEdge - Error:  'java'  '-Xmx2048m'  '-XX:+UseParallelOldGC'  '-XX:ParallelGCThreads=4'  '-XX:GCTimeLimit=50'  '-XX:GCHeapFreeLimit=10'  '-Djava.io.tmpdir=/proj/yunligrp/users/minzhi/gs/gs_tempdir/svpre_tmp'  '-cp' '/proj/yunligrp/users/minzhi/svtoolkit/lib/SVToolkit.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/Queue.jar'  '-cp' '/proj/yunligrp/users/minzhi/svtoolkit/lib/SVToolkit.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/Queue.jar'  'org.broadinstitute.sv.apps.CallSampleGender'  '-O' '/proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svpre_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/md_tempdir/sample_gender.report.txt'  '-I' '/proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svpre_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/md_tempdir/headers.bam'  '-configFile' '/proj/yunligrp/users/minzhi/svtoolkit/conf/genstrip_parameters.txt' '-configFile' '/proj/yunligrp/users/minzhi/gs/Homo_sapiens_assembly38/Homo_sapiens_assembly38.gsparams.txt'  '-R' '/proj/yunligrp/users/minzhi/gs/Homo_sapiens_assembly38/Homo_sapiens_assembly38.fasta'  '-md' '/proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svpre_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/md_tempdir'  '-genderBedFile' '/proj/yunligrp/users/minzhi/gs/Homo_sapiens_assembly38/Homo_sapiens_assembly38.gendermask.bed'  '-genomeMaskFile' '/proj/yunligrp/users/minzhi/gs/Homo_sapiens_assembly38/Homo_sapiens_assembly38.svmask.fasta'  '-L' 'chr16:1-500000'  '-ploidyMapFile' '/proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/supporting_freeze6-AA_chr16/freeze6-AA_chr16_standard_ploidy.map'  
ERROR 01:20:43,375 FunctionEdge - Contents of /proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svpre_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/logs/SVPreprocess-32080.out:
INFO  01:19:29,576 HelpFormatter - --------------------------------------------------------- 
INFO  01:19:29,580 HelpFormatter - Program Name: org.broadinstitute.sv.apps.CallSampleGender 
INFO  01:19:29,587 HelpFormatter - Program Args: -O /proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svpre_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/md_tempdir/sample_gender.report.txt -I /proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svpre_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/md_tempdir/headers.bam -configFile /proj/yunligrp/users/minzhi/svtoolkit/conf/genstrip_parameters.txt -configFile /proj/yunligrp/users/minzhi/gs/Homo_sapiens_assembly38/Homo_sapiens_assembly38.gsparams.txt -R /proj/yunligrp/users/minzhi/gs/Homo_sapiens_assembly38/Homo_sapiens_assembly38.fasta -md /proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svpre_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/md_tempdir -genderBedFile /proj/yunligrp/users/minzhi/gs/Homo_sapiens_assembly38/Homo_sapiens_assembly38.gendermask.bed -genomeMaskFile /proj/yunligrp/users/minzhi/gs/Homo_sapiens_assembly38/Homo_sapiens_assembly38.svmask.fasta -L chr16:1-500000 -ploidyMapFile /proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/supporting_freeze6-AA_chr16/freeze6-AA_chr16_standard_ploidy.map 
INFO  01:19:29,594 HelpFormatter - Executing as minzhi@c0816.ll.unc.edu on Linux 3.10.0-957.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_191-b12. 
INFO  01:19:29,594 HelpFormatter - Date/Time: 2019/03/01 01:19:29 
INFO  01:19:29,594 HelpFormatter - --------------------------------------------------------- 
INFO  01:19:29,595 HelpFormatter - --------------------------------------------------------- 
INFO  01:19:29,595 CallSampleGender - Opening reference sequence ... 
INFO  01:19:29,598 CallSampleGender - Opened reference sequence. 
INFO  01:19:29,609 MetaData - Opening metadata ...  
INFO  01:19:29,610 MetaData - Adding metadata location /proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svpre_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/md_tempdir ... 
INFO  01:19:29,626 MetaData - Opened metadata. 
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:3332)
    at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
    at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
    at java.lang.StringBuilder.append(StringBuilder.java:136)
    at htsjdk.samtools.SAMTextHeaderCodec.advanceLine(SAMTextHeaderCodec.java:139)
    at htsjdk.samtools.SAMTextHeaderCodec.decode(SAMTextHeaderCodec.java:94)
    at htsjdk.samtools.BAMFileReader.readHeader(BAMFileReader.java:655)
    at htsjdk.samtools.BAMFileReader.<init>(BAMFileReader.java:298)
    at htsjdk.samtools.BAMFileReader.<init>(BAMFileReader.java:176)
    at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:376)
    at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:202)
    at org.broadinstitute.sv.dataset.SAMFileLocation.createSamFileReader(SAMFileLocation.java:97)
    at org.broadinstitute.sv.dataset.SAMLocation.createSamFileReader(SAMLocation.java:41)
    at org.broadinstitute.sv.dataset.DataSet.initInputFile(DataSet.java:138)
    at org.broadinstitute.sv.dataset.DataSet.initialize(DataSet.java:128)
    at org.broadinstitute.sv.apps.CallSampleGender.run(CallSampleGender.java:105)
    at org.broadinstitute.sv.commandline.CommandLineProgram.execute(CommandLineProgram.java:54)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
    at org.broadinstitute.sv.commandline.CommandLineProgram.runAndReturnResult(CommandLineProgram.java:29)
    at org.broadinstitute.sv.commandline.CommandLineProgram.run(CommandLineProgram.java:25)
    at org.broadinstitute.sv.apps.CallSampleGender.main(CallSampleGender.java:92) 
INFO  01:20:43,375 QGraph - Writing incremental jobs reports... 

The details of this error are different from other "java.lang.OutOfMemoryError: Java heap space" I found in the forum. So is it possible to solve this problem by editing the SVQscript? May I have your suggestions? Thank you very much.

Best regards,
Wusheng


Best Practices RNAseq Test Files

$
0
0

Hello GATK team,

I've been using NA12878.bam as a test file for RNAseq short variant discovery workflow as stated in the inputs JSON file. After running the workflow, the portion of unmapped reads was 99.26%. It seems like the reads within this file are single-end rather than paired-end. According to the WDL file, this workflow is always expecting paired-end and cannot properly handle single-end reads. Could you confirm that this unmapped BAM file is not a proper test file for RNAseq workflow and if that is a case, could you point me out to the adequate test file. Maybe there's something I am missing but is there some specific reason for leaving out the option to work with both single and paired-end reads?

Thanks in advance,
Nemanja

Out of order read after MarkDuplicateSpark + BaseRecalibrator/ApplyBQSR

$
0
0

Hi,

I am building a workflow for discovery of somatic snvs + indels that is pretty much the Broad's Best Practice but incorporating MarkDuplicatesSpark and a couple of other minor changes. Today I was running a normal-tumor pair of samples from WES experiments in GCP, and everything was going great until the workflow failed during Mutect2. In one of the shards (I am scattering the M2 step through 12 splits of the exome bedfile) I got this error:

    13:53:46.994 INFO  ProgressMeter -       chr19:18926479             20.2                 22440           1112.1
    13:53:51.138 INFO  VectorLoglessPairHMM - Time spent in setup for JNI call : 0.589863008
    13:53:51.145 INFO  PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 415.78724766500005
    13:53:51.147 INFO  SmithWatermanAligner - Total compute time in java Smith-Waterman : 82.56 sec
    13:53:52.161 INFO  Mutect2 - Shutting down engine
    [February 19, 2019 1:53:52 PM UTC] org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2 done. Elapsed time: 20.68 minutes.
    Runtime.totalMemory()=1132453888
    java.lang.IllegalArgumentException: Attempting to add a read to ActiveRegion out of order w.r.t. other reads: lastRead SRR3270880.37535587 chr19:19227104-19227253 at 19227104 attempting to add SRR3270880.23592400 chr19:19226999-19227148 at 19226999
        at org.broadinstitute.hellbender.utils.Utils.validateArg(Utils.java:730)
        at org.broadinstitute.hellbender.engine.AssemblyRegion.add(AssemblyRegion.java:338)
        at org.broadinstitute.hellbender.engine.AssemblyRegionIterator.fillNextAssemblyRegionWithReads(AssemblyRegionIterator.java:230)
        at org.broadinstitute.hellbender.engine.AssemblyRegionIterator.loadNextAssemblyRegion(AssemblyRegionIterator.java:194)
        at org.broadinstitute.hellbender.engine.AssemblyRegionIterator.next(AssemblyRegionIterator.java:135)
        at org.broadinstitute.hellbender.engine.AssemblyRegionIterator.next(AssemblyRegionIterator.java:34)
        at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.processReadShard(AssemblyRegionWalker.java:286)
        at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.traverse(AssemblyRegionWalker.java:267)
        at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:966)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:138)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
        at org.broadinstitute.hellbender.Main.main(Main.java:291)
    Using GATK jar /gatk/gatk-package-4.1.0.0-local.jar

The other 11 shards finished without errors and produced the expected output.

I checked the bam from the tumor sample and indeed the read mentioned in the error is out of order. It is the second read from the end in the following snippet (pasting here only the first 9 columns from the bam file):

    SRR3270880.37535587 163 chr19   19227104    60  150M    =   19227395    441
    SRR3270880.46694860 147 chr19   19227106    60  150M    =   19226772    -484
    SRR3270880.60287639 1171    chr19   19227106    60  150M    =   19226772    -484
    SRR3270880.68448188 83  chr19   19227106    60  150M    =   19226611    -645
    SRR3270880.70212050 1171    chr19   19227106    60  150M    =   19226772    -484
    SRR3270880.23592400 163 chr19   19226999    60  150M    =   19227232    383
    SRR3270880.21876644 1171    chr19   19227001    60  150M    =   19226793    -358

The read does not have any bad quality flags and it appears twice in the bam, being in the correct order in its first occurrence (second read in the following snippet):

    SRR3270880.61849825 147 chr19   19226995    60  150M    =   19226895    -250
    SRR3270880.23592400 163 chr19   19226999    60  150M    =   19227232    383
    SRR3270880.21876644 1171    chr19   19227001    60  150M    =   19226793    -358
    SRR3270880.47062210 147 chr19   19227001    60  150M    =   19226625    -526

The workflow does not include SortSam after MarkDuplicatesSpark as MDSpark's output is supposed to be coordinate sorted. From the bam's header: @HD VN:1.6 GO:none SO:coordinate

Previous to Mutect2, BaseRecalibrator-GatherBqsrReport-ApplyBQSR-GatherBamFiles (non-Spark versions) finished without any errors. These steps are also scattered through interval splits of the exome bedfile.

Strikingly, the start-end positions of this out of order read span from the last interval of interval split 6 to the first interval of interval split 7. Maybe the read was included in two contiguous splits of the bam file at the same time and that is why it appears twice in the bam file after the merge done by GatherBamFiles. (Last interval from split 6: chr19 19226311 19227116 ; first interval from interval split 7 : chr19 19227145 19228774 )

Intervals in my workflow are split by "SplitIntervals" tool (gatk4.1.0.0). I am currently including the argument --subdivision-mode BALANCING_WITHOUT_INTERVAL_SUBDIVISION and feel that this could have to do with the error...

Any ideas of how this issue can be solved?

Thank you in advance

How is the phasing done in single sample HaplotypeCaller?

$
0
0
Hi, after running HaplotypeCaller with this commands
gatk --java-options "-Xmx4g" HaplotypeCaller -R $refGenome -I /home/ready.bam -ERC GVCF -O /home/GATK4-HC.g.vcf

I obtain positions in the gvcf file like these:
1 1243896 . C T, 0.01 . MLEAC=0,0;MLEAF=NaN,NaN GT ./.
1 1243929 . G T, 0.01 . MLEAC=0,0;MLEAF=NaN,NaN GT:PGT:PID ./.:0|1:1243929_G_T
1 4204648 . CTACCA C, 0 . MLEAC=0,0;MLEAF=NaN,NaN GT:PGT:PID ./.:0|1:4204601_T_C
1 6292991 . C . . END=6293126 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0

How is possible that those positions have 0|1 if I have not used any database nor any other samples?

How is ./. different of 0/0:0:0:0:0,0,0 in the GT field?

Thank you very much

GermlineCNVCaller, some samples do not have read depth metadata

$
0
0

GATK4.1.0.0, linux server, bash

Hi, I'm testing the following commands:

  ${GATK4} --java-options "${javaOpt}" DetermineGermlineContigPloidy \
  -L ${fol7}/"${INTERVAL}.preprocessed.interval_list" \
  --interval-merging-rule "OVERLAPPING_ONLY" \
  ${INPUT.hdf5} \
  --contig-ploidy-priors ${fol8}/ContigPloidyPriors.tsv \
  -O ${fol8}/"Karyo"/ \
  --output-prefix "Karyo_cohort" \
  --verbosity "DEBUG" \
  --tmp-dir ${tmp}/

 ${GATK4} --java-options "${javaOpt}" GermlineCNVCaller \
 --run-mode COHORT \
 -L ${fol7}/"${INTERVAL}.preprocessed.interval_list" \
 --interval-merging-rule "OVERLAPPING_ONLY" \
  --contig-ploidy-calls ${fol8}/"Karyo"/ \
 ${INPUT.hdf5} \
 --output ${fol8}/"germCNV"/ \
 --output-prefix "germCNV_cohort" \
 --verbosity "DEBUG" \
 --tmp-dir ${tmp}/

In both tools (DetermineGermlineContigPloidy, GermlineCNVCaller) I used the same interval and the same input files... I can not figure out why in the second command "GermlineCNVCaller" I have the following error: "Some samples do not have read depth metadata"

Part of the logs:

  Traceback (most recent call last):
    File "/home/manolis/GATK4/tmp/cohort_denoising_calling.6605166682471840218.py", line 133, in <module>
      n_st, sample_names, sample_metadata_collection)
    File "/share/apps/bio/miniconda2/envs/gatk4100/lib/python3.6/site-packages/gcnvkernel/models/model_denoising_calling.py", line 379, in __init__
      sample_metadata_collection, sample_names, self.contig_list)
    File "/share/apps/bio/miniconda2/envs/gatk4100/lib/python3.6/site-packages/gcnvkernel/models/model_denoising_calling.py", line 618, in _get_baseline_
  copy_number_and_read_depth
      "Some samples do not have read depth metadata"
  AssertionError: Some samples do not have read depth metadata
  17:51:21.529 DEBUG ScriptExecutor - Result: 1
  17:51:21.531 INFO  GermlineCNVCaller - Shutting down engine
  [March 1, 2019 5:51:21 PM CET] org.broadinstitute.hellbender.tools.copynumber.GermlineCNVCaller done. Elapsed time: 0.77 minutes.
  Runtime.totalMemory()=3591372800
  org.broadinstitute.hellbender.utils.python.PythonScriptExecutorException: 
  python exited with 1
  Command Line: python ...
  ....
          at org.broadinstitute.hellbender.utils.python.PythonExecutorBase.getScriptException(PythonExecutorBase.java:75)
          at org.broadinstitute.hellbender.utils.runtime.ScriptExecutor.executeCuratedArgs(ScriptExecutor.java:126)
          at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeArgs(PythonScriptExecutor.java:170)

...

Many thanks for any help!

gatk4 Mutect2 call vcf , is there a order of normal sample name and tumor sample name

$
0
0
I see a guide about how to use gatk4 mutect3, the order is normal first, is this a constant order, what is order of TAK3.

if the order is tumor first, is there any bad consequence?
thanks a lot.

is ther a depth limit in gatk4 ? if so, whether it means not suitable for deep depth?

$
0
0

is ther a depth limit in gatk4 ? if so, whether it means not suitable for deep depth? is there a concrete depth number

CollectSequencingArtifactMetrics just need to work on tumor bam file

$
0
0

when do paired sample using gatk4 mutect2, dose CollectSequencingArtifactMetrics just need to work on tumor bam file


(How to) Call somatic mutations using GATK4 Mutect2

$
0
0

Post suggestions and read about updates in the Comments section.


imageThis tutorial introduces researchers to considerations in somatic short variant discovery using GATK4 Mutect2. Example data are based on a breast cancer cell line and its matched normal cell line derived from blood and are aligned to GRCh38 with post-alt processing [1]. The tutorial focuses on how to call traditional somatic short mutations, as described in Article#11127 and pipelined in GATK v4.0.0.0's mutect2.wdl [2]. The tool and its workflow are in BETA status as of this writing, which means they may undergo changes and are not guaranteed for production.

► For Broad Mutation Calling Best Practices, see FireCloud Article#45055.

Section 1 calls somatic mutations with Mutect2 using all the bells and whistles of the tool. Section 2 outlines how to create the panel of normals resource using the tumor-only mode of Mutect2. Section 3 outlines how to estimate cross-sample contamination. Section 4 shows how to filter the callset with FilterMutectCalls. Unlike GATK3, in GATK4 the somatic calling and filtering functionalities are embodied by separate tools. Section 5 shows an optional filtering step to filter by sequence context artifacts that present with orientation bias, e.g. OxoG artifacts. Section 6 shows how to set up in IGV for manual review. Finally, section 7 provides a brief list of related resources that may be of interest to researchers.

GATK4 Mutect2 is a versatile variant caller that not only is more sensitive than, but is also roughly twice as fast as, HaplotypeCaller's reference confidence mode. Researchers who wish to customize analyses should find the tutorial's descriptions of the multiple levers of Mutect2 in section 1 and descriptions of the tumor-only mode of Mutect2 in section 2 of interest.


Jump to a section

  1. Call somatic short variants and generate a bamout with Mutect2
    1.1 What are the Mutect2 annotations?
    1.2 What is the impact of disabling the MateOnSameContigOrNoMappedMateReadFilter read filter?
  2. Create a sites-only PoN with CreateSomaticPanelOfNormals
    2.1 The tumor-only mode of Mutect2 is useful outside of pon creation
  3. Estimate cross-sample contamination using GetPileupSummaries and CalculateContamination
    3.1 What if I find high levels of contamination?
  4. Filter for confident somatic calls using FilterMutectCalls
  5. (Optional) Estimate artifacts with CollectSequencingArtifactMetrics and filter them with FilterByOrientationBias
    5.1 Tally of applied filters for the tutorial data
  6. Set up in IGV to review somatic calls
  7. Related resources

Tools involved

  • GATK v4.0.0.0 is available in a Docker image and as a standalone jar. For the latest release, see the Downloads page. Note that GATK v4.0.0.0 contains Picard tools from release v2.17.2 that are callable with the gatk launch script.
  • Desktop IGV. The tutorial uses v2.3.97.

Download example data

Download tutorial_11136.tar.gz, either from the GoogleDrive or from the ftp site. To access the ftp site, leave the password field blank. If the GoogleDrive link is broken, please let us know. The tutorial also requires the GRCh38 reference FASTA, dictionary and index. These are available from the GATK Resource Bundle. For details on the example data and resources, see [3] and [4].

► The tutorial steps switch between the subset and full data. Some of the data files, e.g. BAMs, are restricted to a small region of the genome to efficiently pace the tutorial. Other files, e.g. the Mutect2 calls that the tutorial filters, are from the entire genome. The tutorial content was originally developed for the 2017-09 Helsinki workshop and we make the full data files, i.e. the resource files and the BAMs, available at gs://gatk-best-practices/somatic-hg38.


1. Call somatic short variants and generate a bamout with Mutect2

Here we have a rather complex command to call somatic variants on the HCC1143 tumor sample using Mutect2. For a synopsis of what somatic calling entails, see Article#11127. The command calls somatic variants in the tumor sample and uses a matched normal, a panel of normals (PoN) and a population germline variant resource.

gatk --java-options "-Xmx2g" Mutect2 \
-R hg38/Homo_sapiens_assembly38.fasta \
-I tumor.bam \
-I normal.bam \
-tumor HCC1143_tumor \
-normal HCC1143_normal \
-pon resources/chr17_pon.vcf.gz \
--germline-resource resources/chr17_af-only-gnomad_grch38.vcf.gz \
--af-of-alleles-not-in-resource 0.0000025 \
--disable-read-filter MateOnSameContigOrNoMappedMateReadFilter \
-L chr17plus.interval_list \
-O 1_somatic_m2.vcf.gz \
-bamout 2_tumor_normal_m2.bam 

This produces a raw unfiltered somatic callset 1_somatic_m2.vcf.gz, a reassembled reads BAM 2_tumor_normal_m2.bam and the respective indices 1_somatic_m2.vcf.gz.tbi and 2_tumor_normal_m2.bai.

Comments on select parameters

  • Specify the case sample for somatic calling with two parameters. Provide the BAM with -I and the sample's read group sample name (the SM field value) with -tumor. To look up the read group SM field use GetSampleName. Alternatively, use samtools view -H tumor.bam | grep '@RG'.
  • Prefilter variant sites in a control sample alignment. Specify the control BAM with -I and the control sample's read group sample name (the SM field value) with -normal. In the case of a tumor with a matched normal control, we can exclude even rare germline variants and individual-specific artifacts. If we analyze our tumor sample with Mutect2 without the matched normal, we get an order of magnitude more calls than with the matched normal.
  • Prefilter variant sites in a panel of normals callset. Specify the panel of normals (PoN) VCF with -pon. Section 2 outlines how to create a PoN. The panel of normals not only represents common germline variant sites, it presents commonly noisy sites in sequencing data, e.g. mapping artifacts or other somewhat random but systematic artifacts of sequencing. By default, the tool does not reassemble nor emit variant sites that match identically to a PoN variant. To enable genotyping of PoN sites, use the --genotype-pon-sites option. If the match is not exact, e.g. there is an allele-mismatch, the tool reassembles the region, emits the calls and annotates matches in the INFO field with IN_PON.
  • Annotate variant alleles by specifying a population germline resource with --germline-resource. The germline resource must contain allele-specific frequencies, i.e. it must contain the AF annotation in the INFO field [4]. The tool annotates variant alleles with the population allele frequencies. When using a population germline resource, consider adjusting the --af-of-alleles-not-in-resource parameter from its default of 0.001. For example, the gnomAD resource af-only-gnomad_grch38.vcf.gz represents ~200k exomes and ~16k genomes and the tutorial data is exome data, so we adjust --af-of-alleles-not-in-resource to 0.0000025 which corresponds to 1/(2*exome samples). The default of 0.001 is appropriate for human sample analyses without any population resource. It is based on the human average rate of heterozygosity. The population allele frequencies (POP_AF) and the af-of-alleles-not-in-resource factor in probability calculations of the variant being somatic.
  • Include reads whose mate maps to a different contig. For our somatic analysis that uses alt-aware and post-alt processed alignments to GRCh38, we disable a specific read filter with --disable-read-filter MateOnSameContigOrNoMappedMateReadFilter. This filter removes from analysis paired reads whose mate maps to a different contig. Because of the way BWA crisscrosses mate information for mates that align better to alternate contigs (in alt-aware mapping to GRCh38), we want to include these types of reads in our analysis. Otherwise, we may miss out on detecting SNVs and indels associated with alternate haplotypes. Disabling this filter deviates from current production practices.
  • Target the analysis to specific genomic intervals with the -L parameter. Here we specify this option to speed up our run on the small tutorial data. For the full callset we use in section 4, calling was on the entirety of the data, without an intervals file.
  • Generate the reassembled alignments file with -bamout. The bamout alignments contain the artificial haplotypes and reassembled alignments for the normal and tumor and enable manual review of calls. The parameter is not required by the tool but is recommended as adding it costs only a small fraction of the total run time.

To illustrate how Mutect2 applies annotations, below are five multiallelic sites from the full callset. Pull these out with gzcat somatic_m2.vcf.gz | awk '$5 ~","'. The awk '$5 ~","' subsets records that contain a comma in the 5th column.

image

We see eleven columns of information per variant call including genotype calls for the normal and tumor. Notice the empty fields for QUAL and FILTER, and annotations at the site (INFO) and sample level (columns 10 and 11). The samples each have genotypes and when a site is multiallelic, we see allele-specific annotations. Samples may have additional annotations, e.g. PGT and PID that relate to phasing.


☞ 1.1 What are the Mutect2 annotations?

We can view the standard FORMAT-level and INFO-level Mutect2 annotations in the VCF header.

image

image

The Variant Annotations section of the Tool Documentation further describe some of the annotations. For a complete list of annotations available in GATK4, see this site.

To enable specific filtering that relies on nonstandard annotations, or just to add additional annotations, use the -A argument. For example, -A ReferenceBases adds the ReferenceBases annotation to variant calls. Note that if an annotation a filter relies on is absent, FilterMutectCalls will skip the particular filtering without any warning messages.


☞ 1.2 What is the impact of disabling the MateOnSameContigOrNoMappedMateReadFilter read filter?

To understand the impact, consider some numbers. After all other read filters, the MateOnSameContigOrNoMappedMateReadFilter (MOSCO) filter additionally removes from analysis 8.71% (8,681,271) tumor sample reads and 8.18% (6,256,996) normal sample reads from the full data. The impact of disabling the MOSCO filter is that reads on alternate contigs and read pairs that span contigs can now lend support to variant calls.

For the tutorial data, including reads normally filtered by the MOSCO filter roughly doubles the number of Mutect2 calls. The majority of the additional calls comes from the ALT, HLA and decoy contigs.


back to top


2. Create a sites-only PoN with CreateSomaticPanelOfNormals

We make the motions of creating a PoN using three germline samples. These samples are HG00190, NA19771 and HG02759 [3].

First, run Mutect2 in tumor-only mode on each normal sample. In tumor-only mode, a single case sample is analyzed with the -tumor flag without an accompanying matched control -normal sample. For the tutorial, we run this command only for sample HG00190.

gatk Mutect2 \
-R ~/Documents/ref/hg38/Homo_sapiens_assembly38.fasta \
-I HG00190.bam \
-tumor HG00190 \
--disable-read-filter MateOnSameContigOrNoMappedMateReadFilter \
-L chr17plus.interval_list \
-O 3_HG00190.vcf.gz

This generates a callset 3_HG00190.vcf.gz and a matching index. Mutect2 calls variants in the sample with the same sensitive criteria it uses for calling mutations in the tumor in somatic mode. Because the command omits the use of options that trigger upfront filtering, we expect all detectable variants to be called. The calls will include low allele fraction variants and sites with multiple variant alleles, i.e. multiallelic sites. Here are two multiallelic records from 3_HG00190.vcf.gz.

image

We see for each site, Mutect2 calls the ref allele and three alternate alleles. The GT genotype call is 0/1/2/3. The AD allele depths are 16,3,12,4 and 41,5,24,4, respectively for the two sites.

Comments on select parameters

  • One option that is not used here is to include a germline resource with --germline-resource. Remember from section 1 this resource must contain AF population allele frequencies in the INFO column. Use of this resource in tumor-only mode, just as in somatic mode, allows upfront filtering of common germline variant alleles. This effectively omits common germline variant alleles from the PoN. Note the related optional parameter --max-population-af (default 0.01) defines the cutoff for allele frequencies. Given a resource, and read evidence for the variant, Mutect2 will still emit variant alleles with AF less than or equal to the --max-population-af.
  • Recapitulate any special options used in somatic calling in the panel of normals sample calling, e.g.--disable-read-filter MateOnSameContigOrNoMappedMateReadFilter. This particular option is relevant for alt-aware and post-alt processed alignments.

Second, collate all the normal VCFs into a single callset with CreateSomaticPanelOfNormals. For the tutorial, to illustrate the step with small data, we run this command on three normal sample VCFs. The general recommendation for panel of normals is a minimum of forty samples.

gatk CreateSomaticPanelOfNormals \
-vcfs 3_HG00190.vcf.gz \
-vcfs 4_NA19771.vcf.gz \
-vcfs 5_HG02759.vcf.gz \
-O 6_threesamplepon.vcf.gz

This generates a PoN VCF 6_threesamplepon.vcf.gz and an index. The tutorial PoN contains 8,275 records.
CreateSomaticPanelOfNormals retains sites with variants in two or more samples. It retains the alleles from the samples but drops all other annotations to create an eight-column, sites-only VCF as shown.

image

Ideally, the PoN includes samples that are technically representative of the tumor case sample--i.e. samples sequenced on the same platform using the same chemistry, e.g. exome capture kit, and analyzed using the same toolchain. However, even an unmatched PoN will be remarkably effective in filtering a large proportion of sequencing artifacts. This is because mapping artifacts and polymerase slippage errors occur for pretty much the same genomic loci for short read sequencing approaches.

What do you think of including samples of family members in the PoN?


☞ 2.1 The tumor-only mode of Mutect2 is useful outside of pon creation

For example, consider variant calling on data that represents a pool of individuals or a collective of highly similar but distinct DNA molecules, e.g. mitochondrial DNA. Mutect2 calls multiple variants at a site in a computationally efficient manner. Furthermore, the tumor-only mode can be co-opted to simply call differences between two samples. This approach is described in Blog#11315.


back to top


3. Estimate cross-sample contamination using GetPileupSummaries and CalculateContamination.

First, run GetPileupSummaries on the tumor BAM to summarize read support for a set number of known variant sites. Use a population germline resource containing only common biallelic variants, e.g. subset by using SelectVariants --restrict-alleles-to BIALLELIC, as well as population AF allele frequencies in the INFO field [4]. The tool tabulates read counts that support reference, alternate and other alleles for the sites in the resource.

gatk GetPileupSummaries \
-I tumor.bam \
-V resources/chr17_small_exac_common_3_grch38.vcf.gz \
-O 7_tumor_getpileupsummaries.table

This produces a six-column table as shown. The alt_count is the count of reads that support the ALT allele in the germline resource. The allele_frequency corresponds to that given in the germline resource. Counts for other_alt_count refer to reads that support all other alleles.

image

Comments on select parameters

  • The tool only considers homozygous alternate sites in the sample that have a population allele frequency that ranges between that set by --minimum-population-allele-frequency (default 0.01) and --maximum-population-allele-frequency (default 0.2). The rationale for these settings is as follows. If the homozygous alternate site has a rare allele, we are more likely to observe the presence of REF or other more common alleles if there is cross-sample contamination. This allows us to measure contamination more accurately.
  • One option to speed up analysis, that is not used in the command above, is to limit data collection to a sufficiently large but subset genomic region with the -L argument.
  • As of GATK4.0.8.0, released August 2, 2018, GetPileupSummaries requires both -L and -V parameters. For the tutorial, provide the same resources/chr17_small_exac_common_3_grch38.vcf.gz file to each parameter. For details, see the GetPileupSummaries tool documentation.

Second, estimate contamination with CalculateContamination. The tool takes the summary table from GetPileupSummaries and gives the fraction contamination. This estimation informs downstream filtering by FilterMutectCalls.

gatk CalculateContamination \
-I 7_tumor_getpileupsummaries.table \
-O 8_tumor_calculatecontamination.table

This produces a table with estimates for contamination and error. The estimate for the full tumor sample is shown below and gives a contamination fraction of 0.0205. Going forward, we know to suspect calls with less than ~2% alternate allele fraction.

image

Comments on select parameters

  • CalculateContamination can operate in two modes. The command above uses the mode that simply estimates contamination for a given sample. The alternate mode incorporates the metrics for the matched normal, to enable a potentially more accurate estimate. For the second mode, run GetPileupSummaries on the normal sample and then provide the normal pileup table to CalculateContamination with the -matched argument.

► Cross-sample contamination differs from normal contamination of tumor and tumor contamination of normal. Currently, the workflow does not account for the latter type of purity issue.


☞ 3.1 What if I find high levels of contamination?

One thing to rule out is sample swaps at the read group level.

Picard’s CrosscheckFingerprints can detect sample-swaps at the read group level and can additionally measure how related two samples are. Because sequencing can involve multiplexing a sample across lanes and regrouping a sample’s multiple read groups, depending on the level of automation in handling these, there is a possibility of including read groups from unrelated samples. The inclusion of such a cross-sample in the tumor sample would be detrimental to a somatic analysis. Without getting into details, the tool allows us to (i) check at the sample level that our tumor and normal are related, as it is imperative they should come from the same individual and (ii) check at the read group level that each of the read group data come from the same individual.

Again, imagine if we mistook the contaminating read group data as some tumor subpopulation! The tutorial normal and tumor samples consist of 16 and 22 read groups respectively, and when we provide these and set EXPECT_ALL_GROUPS_TO_MATCH=true, CrosscheckReadGroupFingerprints (a tool now replaced by CrosscheckFingerprints) informs us All read groups related as expected.


back to top


4. Filter for confident somatic calls using FilterMutectCalls

FilterMutectCalls determines whether a call is a confident somatic call. The tool uses the annotations within the callset and applies preset thresholds that are tuned for human somatic analyses.

Filter the Mutect2 callset with FilterMutectCalls. Here we use the full callset, somatic_m2.vcf.gz. To activate filtering based on the contamination estimate, provide the contamination table with --contamination-table. In GATK v4.0.0.0, the tool uses the contamination estimate as a hard cutoff.

gatk FilterMutectCalls \
-V somatic_m2.vcf.gz \
--contamination-table tumor_calculatecontamination.table \
-O 9_somatic_oncefiltered.vcf.gz

This produces a VCF callset 9_somatic_oncefiltered.vcf.gz and index. Calls that are likely true positives get the PASS label in the FILTER field, and calls that are likely false positives are labeled with the reason(s) for filtering in the FILTER field of the VCF. We can view the available filters in the VCF header using grep '##FILTER'.

image

This step seemingly applies 14 filters, including contamination. However, if an annotation a filter relies on is absent, the tool skips the particular filtering. The filter will still appear in the header. For example, the duplicate_evidence filter requires a nonstandard annotation that our callset omits.

So far, we have 3,695 calls, of which 2,966 are filtered and 729 pass as confident somatic calls. Of the filtered, contamination filters eight calls, all of which would have been filtered for other reasons. For the statistically inclined, this may come as a surprise. However, remember that the great majority of contaminant variants would be common germline alleles, for which we have in place other safeguards.

► In the next GATK version, FilterMutectCalls will use a statistical model to filter based on the contamination estimate.


back to top


5. (Optional) Estimate artifacts with CollectSequencingArtifactMetrics and filter them with FilterByOrientationBias

FilterByOrientationBias allows filtering based on sequence context artifacts, e.g. OxoG and FFPE. This step is optional and if employed, should always be performed after filtering with FilterMutectCalls. The tool requires the pre_adapter_detail_metrics from Picard CollectSequencingArtifactMetrics.

First, collect metrics on sequence context artifacts with CollectSequencingArtifactMetrics. The tool categorizes these as those that occur before hybrid selection (preadapter) and those that occur during hybrid selection (baitbias). Results provide a global view across the genome that empowers decision making in ways that site-specific analyses cannot. The metrics can help decide whether to consider downstream filtering.

gatk CollectSequencingArtifactMetrics \
-I tumor.bam \
-O 10_tumor_artifact \
–-FILE_EXTENSION ".txt" \
-R ~/Documents/ref/hg38/Homo_sapiens_assembly38.fasta

Alternatively, use the tool from a standalone Picard jar.

java -jar picard.jar \
CollectSequencingArtifactMetrics \
I=tumor.bam \
O=10_tumor_artifact \
FILE_EXTENSION=.txt \
R=~/Documents/ref/hg38/Homo_sapiens_assembly38.fasta

This generates five metrics files, including pre_adapter_detail_metrics, which contains counts that FilterByOrientationBias uses. Below are the summary pre_adapter_summary_metrics for the full data. Our samples were not from FFPE so we do not expect this artifact. However, it appears that we could have some OxoG transversions.

image

image

Picard metrics are described in detail here. For the purposes of this tutorial, we focus on the TOTAL_QSCORE.

  • The TOTAL_QSCORE is Phred-scaled such that lower scores equate to a higher probability the change is artifactual. E.g. forty translates to 1 in 10,000 probability. For OxoG, a rough cutoff for concern is 30. FilterByOrientationBias uses the quality score as a prior that a context will produce an artifact. The tool also weighs the evidence from the reads. For example, if the QSCORE is 50 but the allele is supported by 15 reads in F1R2 and no reads in F2R1, then the tool should filter the call.
  • FFPE stands for formalin-fixed, paraffin-embedded. Formaldehyde deaminates cytosines and thereby results in C→T transition mutations. Oxidation of guanine to 8-oxoguanine results in G→T transversion mutations during library preparation. Another Picard tool, CollectOxoGMetrics, similarly gives Phred-scaled scores for the 16 three-base extended sequence contexts. In GATK4 Mutect2, the F1R2 and F2R1 annotations count the reads in the pair orientation supporting the allele(s). This is a change from GATK3’s FOXOG (fraction OxoG) annotation.

Second, perform orientation bias filtering with FilterByOrientationBias. We provide the tool with the once-filtered calls 9_somatic_oncefiltered.vcf.gz, the pre_adapter_detail_metrics file and the sequencing contexts for FFPE (C→T transition) and OxoG (G→T transversion). The tool knows to include the reverse complement contexts.

gatk FilterByOrientationBias \
-A G/T \
-A C/T \
-V 9_somatic_oncefiltered.vcf.gz \
-P tumor_artifact.pre_adapter_detail_metrics.txt \
-O 11_somatic_twicefiltered.vcf.gz

This produces a VCF 11_somatic_twicefiltered.vcf.gz, index and summary 11_somatic_twicefiltered.vcf.gz.summary. In the summary, we see the number of calls for the sequence context and the number of those that the tool filters.

image

Is the filtering in line with our earlier prediction?

In the VCF header, we see the addition of the 15th filter, orientation_bias, which the tool applies to 56 calls. All 56 of these calls were previously PASS sites, i.e. unfiltered. We now have 673 passing calls out of 3,695 total calls.

image


☞ 5.1 Tally of applied filters for the tutorial data

The table shows the breakdown in filters applied to 11_somatic_twicefiltered.vcf.gz. The middle column tallys the instances in which each filter was applied across the calls and the third column tallys the instances in which a filter was the sole reason for a site not passing. Of the total calls, ~18% (673/3,695) are confident somatic calls. Of the filtered calls, ~56% (1,694/3,022) are filtered singly. We see an average of ~1.73 filters per filtered call (5,223/3,022).

image

Which filters appear to have the greatest impact? What types of calls do you think compels manual review?

Examine passing records with the following command. Take note of the AD and AF annotation values in particular, as they show the high sensitivity of the caller.

gzcat 11_somatic_twicefiltered.vcf.gz | grep -v '#' | awk '$7=="PASS"' | less


back to top


6. Set up in IGV to review somatic calls

Deriving a good somatic callset involves comparing callsets, e.g. from different callers or calling approaches, manually reviewing passing and filtered calls and, if necessary, combining callsets and additional filtering. Manual review extends from deciphering call record annotations to the nitty-gritty of reviewing read alignments using a visualizer.

To manually review calls, use the feature-rich desktop version of the Integrative Genomics Viewer (IGV). Remember that Mutect2 makes calls on reassembled alignments that do not necessarily reflect that of the starting BAM. Given this, viewing the raw BAM is insufficient for understanding calls. We must examine the bamout that Mutect2's graph-assembly produces.

First, load Human (hg38) as the reference in IGV. Then load these six files in order:

  • resources/chr17_pon.vcf.gz
  • resources/chr17_af-only-gnomad_grch38.vcf.gz
  • 11_somatic_twicefiltered.vcf.gz
  • 2_tumor_normal_m2.bam
  • normal.bam
  • tumor.bam

With the exception of the somatic callset 11_somatic_twicefiltered.vcf.gz, the subset regions the data cover are in chr17plus.interval_list.

imageSecond, navigate IGV to the TP53 locus (chr17:7,666,402-7,689,550).

  • One of the tracks is dominating the view. Right-click on track chr17_af-only-gnomad_grch38.vcf.gz and collapse its view.
  • imageZoom into the somatic call in 11_somatic_twicefiltered.vcf.gz, the gray rectangle in exon 3, by click-dragging on the ruler.
  • Hover over or click on the gray call in track 11_somatic_twicefiltered.vcf.gz to view INFO level annotations. Similarly, the blue call underneath gives HCC1143_tumor sample level information.
  • Scroll through the alignment data and notice the coverage for the samples.

A C→T variant is in tumor.bam but not normal.bam. What is happening in 2_tumor_normal_m2.bam?

imageThird, tweak IGV settings that aid in visualizing reassembled alignments.

  • Make room to focus on track 2_tumor_normal_m2.bam. Shift+select on the left panels for tracks tumor.bam, normal.bam and their coverages. Right-click and Remove Tracks.
  • Go to View>Preferences>Alignments. Toggle on Show center line and toggle off Downsample reads.
  • Drag the alignments panel to center the red variant.
  • Right-click on the alignments track and

    • Group by sample
    • Sort by base
    • Color by tag: HC.
  • Scroll to take note of the number of groups. Click on a read in each group to determine which group belongs to which sample.

image

What are the three grouped tracks for the bamout? What does the pastel versus gray colors indicate? How plausible is it that all tumor copies of this locus have this alteration?

Here is the corresponding VCF record. Remember Mutect2 makes no ploidy assumption. The GT field tabulates the presence for each allele starting with the reference allele.

image

CHROM POS ID REF ALT QUAL FILTER INFO
chr17 7,674,220 . C T . PASS DP=122;ECNT=1;NLOD=13.54;N_ART_LOD=-1.675e+00;POP_AF=2.500e-06;P_GERMLINE=-1.284e+01;TLOD=257.15
FORMAT GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:OBAM:OBAMRC:OBF:OBP:OBQ:OBQRC:SA_MAP_AF:SA_POST_PROB
HCC1143_normal 0/0:45,0:0.032:19,0:26,0:0:151,0:0:0:false:false
HCC1143_tumor 0/1:0,70:0.973:0,34:0,36:33:0,147:60:21:true:false:0.486:0.00:46.01:100.00:0.990,0.990,1.00:0.028,0.026,0.946

Finally, here are the indel calls for which we have bamout alignments. All 17 of these happen to be filtered. Explore a few of these sites in IGV to practice the motions of setting up for manual review and to study the logic behind different filters.

CHROM POS REF ALT FILTER
chr17 4,539,344 T TA artifact_in_normal;germline_risk;panel_of_normals
chr17 7,221,420 CACTGCCCTAGGTCAGGA C artifact_in_normal;panel_of_normals;str_contraction
chr17 7,483,063 A AC mapping_quality;t_lod
chr17 8,513,688 GTT G panel_of_normals
chr17 19,748,387 G GA t_lod
chr17 26,982,033 G GC artifact_in_normal;clustered_events
chr17 30,059,463 CT C t_lod
chr17 35,422,473 C CA t_lod
chr17 35,671,734 CTT C,CT,CTTT artifact_in_normal;multiallelic;panel_of_normals
chr17 43,104,057 CA C artifact_in_normal;germline_risk;panel_of_normals
chr17 43,104,072 AAAAAAAAAGAAAAG A panel_of_normals;t_lod
chr17 46,332,538 G GT artifact_in_normal;panel_of_normals
chr17 47,157,394 CAA C panel_of_normals;t_lod
chr17 50,124,771 GCACACACACACACACA G clustered_events;panel_of_normals;t_lod
chr17 68,907,890 GA G artifact_in_normal;base_quality;germline_risk;panel_of_normals;t_lod
chr17 69,182,632 C CA artifact_in_normal;t_lod
chr17 69,182,835 GAAAA G panel_of_normals


back to top


7. Related resources

The next step after generating a carefully manicured somatic callset is typically functional annotation.

  • Funcotator is available in BETA and can annotate GRCh38 and prior reference aligned VCF format data.
  • Oncotator can annotate GRCh37 and prior reference aligned MAF and VCF format data. It is also possible to download and install the tool following instructions in Article#4154.
  • Annotate with the external program VEP to predict phenotypic changes and confirm or hypothesize biochemical effects.

For a cohort, after annotation, use MutSig to discover driver mutations. MutsigCV (the version is CV) is available on GenePattern. If more samples are needed to increase the power of the analysis, consider padding the analysis set with TCGA Project or other data.

The dSKY plot at https://figshare.com/articles/D_SKY_for_HCC1143/2056665 shows somatic copy number alterations for the HCC1143 tumor sample. Its colorful results remind us that calling SNVs and indels is only one part of cancer genome analyses. Somatic copy number alteration detection will be covered in another GATK tutorial. For reference implementations of Somatic CNV workflows see here.


back to top


Footnotes

[1] Data was alt-aware aligned to GRCh38 and post-alt processed. For an introduction to alt-aware alignment and post-alt processing, see [Blog#8180](https://software.broadinstitute.org/gatk/blog?id=8180). The HCC1143 alignments are identical to that in [Tutorial#9183](https://software.broadinstitute.org/gatk/documentation/article?id=9183), which uses GATK3 MuTect2.

[2] For scripted GATK Best Practices Somatic Short Variant Discovery workflows, see [https://github.com/gatk-workflows](https://github.com/gatk-workflows). Within the repository, as of this writing, [gatk-somatic-snvs-indels](https://github.com/gatk-workflows/gatk4-somatic-snvs-indels), which uses GRCh37, is the sole GATK4 Mutect2 workflow. This tutorial uses additional parameters not used in the [GRCh37 gatk-somatic-snvs-indels](https://github.com/gatk-workflows/gatk4-somatic-snvs-indels) example because the tutorial data was preprocessed with post-alt processing of alt-aware alignments, which deviates from production practices. The general workflow steps remain the same.

[3] About the tutorial data:

  • The data tarball contains 15 files in the main directory, six files in its resources folder and twenty files in its precomputed folder. Of the files, chr17 refers to data subset to that in the regions in chr17plus.interval_list, the m2pon consists of forty 1000 Genomes Project samples, pon to panel of normals, tumor to the tumor HCC1143 breast cancer sample and normal to its matched blood normal.
  • Again, example data are based on a breast cancer cell line and its matched normal cell line derived from blood. Both cell lines are consented and known as HCC1143 and HCC1143_BL, respectively. The Broad Cancer Genome Analysis (CGA) group has graciously provided 2x76 paired-end whole exome sequence data from the two cell lines (C835.HCC1143_2 and C835.HCC1143_BL.4), and @shlee reverted and aligned these to GRCh38 using alt-aware alignment and post-alt processing as described in Tutorial#8017. During preprocessing, the MergeBamAlignment step was omitted, reads containing adapter sequence were removed altogether for both samples (~0.153% of reads in the tumor) as determined by MarkIlluminaAdapters, base qualities were not binned during base recalibration and indel realignment was included to match the toolchain of the PoN normals. The program group for base recalibration is absent from the BAM headers due to a bug in the version of PrintReads at the time of pre-processing, in January of 2017.
  • Note that the tutorial uses exome data for its small size. The workflow is applicable to whole genome sequence data (WGS).
  • @shlee lifted-over or remapped the gnomAD resource files from GRCh37 counterparts to GRCh38. The tutorial uses subsets of the full resources; the full-length versions are available at gs://gatk-best-practices/somatic-hg38/. The official GRCh37 versions of the resources are available in the GATK Resource Bundle and are based on the gnomAD resource. These GRCh37 versions were prepared by @davidben according to the method outlined in the mutect_resources.wdl and described in [4].
  • The full data in the tutorial were generated by @shlee using the github.com/broadinstitute/gatk mutect2.wdl from between the v4.0.0.0 and v4.0.0.1 release with commit hash b4d1ddd. The GATK Docker image was broadinstitute/gatk:4.0.0.0 and Picard was v2.14.1. A single modification was made to the script to enable generating the bamout. The script was run locally on a Google Cloud Compute VM using Cromwell v30.1. Given Docker was installed and the specified Docker images were present on the VM, Cromwell automatically launched local Docker container instances during the run and handled the local files as hard-links to avoid redundant copying. Workflow input variables were as follows.
{
  "##_COMMENT1:": "WORKFLOW STEP OPTIONS",
  "Mutect2.is_run_oncotator": "False",
  "Mutect2.is_run_orientation_bias_filter": "True",
  "Mutect2.picard": "/home/shlee/picard-2.14.1.jar",
  "Mutect2.gatk_docker": "broadinstitute/gatk:4.0.0.0",
  "Mutect2.oncotator_docker": "broadinstitute/oncotator:1.9.3.0",
...
  "##_COMMENT3:": "ANALYSIS PARAMETERS",
  "Mutect2.artifact_modes": ["G/T", "C/T"],
  "Mutect2.m2_extra_args": "--af-of-alleles-not-in-resource 0.0000025 --disable-read-filter MateOnSameContigOrNoMappedMateReadFilter",
  "Mutect2.m2_extra_filtering_args": "",
  "Mutect2.scatter_count": "10"
}
  • If using newer versions of the mutect2.wdl that allow setting SplitIntervals optional arguments, then @shlee recommends setting --subdivision-mode BALANCING_WITHOUT_INTERVAL_SUBDIVISION to avoid splitting contigs.
  • With the exception of the PoN and Picard tool steps, data was generated using v4.0.0.0. The PoN was generated using GATK4 vbeta.6. Besides the syntax, little changed for the Mutect2 workflow between these releases and the workflow and most of its tools remain in beta status as of this writing. We used Picard v2.14.1 for the CollectSequencingArtifactMetrics step. Figures in section 5 reflect results from Picard v2.11.0, which give, at glance, identical results as 2.14.1.
  • The three samples in section 2 are present in the forty sample PoN used in section 1 and they are 1000 Genomes Project samples.

[4] The WDL script [mutect_resources.wdl](https://github.com/broadinstitute/gatk/blob/master/scripts/mutect2_wdl/mutect_resources.wdl) takes a large gnomAD VCF or other typical cohort VCF and from it prepares both a simplified germline resource for use in _section 1_ and a common biallelic variants resource for use in _section 3_. The script first generates a sites-only VCF and in the process _removes all extraneous annotations_ except for `AF` allele frequencies. We recommend this simplification as the unburdened VCF allows Mutect2 to run much more efficiently. To generate the common biallelic variants resource, the script then selects the biallelic sites from the sites-only VCF.

back to top


GATK4 pipeline in easy bash scripts, please

$
0
0

Hi, I asked this question a while ago and a few times. I know, there is a wonderful WDL platform and fire cloud stuff to run things in parallel and check this and that. But, for someone who are so used to a series of simple BASH commands, can you guys please kindly provide an example script like the one shown here https://gencore.bio.nyu.edu/variant-calling-pipeline/?

Right after I found the above, I found that it is not updated with GATK4, and I would hate to use a pipeline that is based on an outdated version of engine.

I will say "Thank You So Much, GATK". For almost a year, I still could not make GATK run on my own server, although there are a million documentation and tutorial and PPTs googled everywhere.

SVCNVDiscovery Error "java.lang.OutOfMemoryError: Java heap space" AFTER adding "-memLimit 100" flag

$
0
0

Dear Genome STRiP users,

I completed the SVPreprocess to the 10686 samples successfully. When I run SVCNVDiscovery pipeline with the following script,

classpath="${SV_DIR}/lib/SVToolkit.jar:${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar:${SV_DIR}/lib/gatk/Queue.jar"
gs_dir="/proj/yunligrp/users/minzhi/gs"
svpreprocess_dir="${gs_dir}/freeze6-AA_chr16/svpre_freeze6-AA_chr16_standard_full_single_1-500000over1-500000_parallel_success"
rundir="${gs_dir}/freeze6-AA_chr16/svcnv_freeze6-AA_chr16_standard_full_single_1-500000over1-500000"

java -Xmx4g -cp ${classpath} \
    org.broadinstitute.gatk.queue.QCommandLine \
    -S ${SV_DIR}/qscript/discovery/cnv/CNVDiscoveryPipeline.q \
    -S ${SV_DIR}/qscript/SVQScript.q \
    -cp ${classpath} \
    -gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \
    -configFile ${SV_DIR}/conf/genstrip_parameters.txt \
    -R ${gs_dir}/Homo_sapiens_assembly38/Homo_sapiens_assembly38.fasta \
    -I ${gs_dir}/freeze6-AA_chr16/supporting_freeze6-AA_chr16/freeze6-AA_chr16_full_sample.list \
    -genderMapFile ${gs_dir}/freeze6-AA_chr16/supporting_freeze6-AA_chr16/freeze6-AA_chr16_full_all-male_gender.map \
    -ploidyMapFile ${gs_dir}/freeze6-AA_chr16/supporting_freeze6-AA_chr16/freeze6-AA_chr16_standard_ploidy.map \
    -md ${svpreprocess_dir}/md_tempdir \
    -tempDir ${gs_dir}/gs_tempdir/svcnv_tmp \
    -runDirectory ${rundir} \
    -jobLogDir ${rundir}/logs \
    -intervalList ${gs_dir}/freeze6-AA_chr16/supporting_freeze6-AA_chr16/freeze6-AA_chr16_1-500000_interval.list \
    -tilingWindowSize 1000 \
    -tilingWindowOverlap 500 \
    -maximumReferenceGapLength 1000 \
    -boundaryPrecision 100 \
    -minimumRefinedLength 500 \
    -memLimit 100 \
    -jobRunner Drmaa \
    -gatkJobRunner Drmaa \
    -jobNative "--mem=100000 --time=08:00:00 --nodes=1 --ntasks-per-node=8" \
    -jobQueue general \
    -run \
    || exit 1

The modification I made to the QScripts are
1. SVQScript (line 1284&1285)

        this.memoryLimit = Some(85)
        this.javaMemoryLimit = Some(85)
  1. Under "CallSampleGender()" and "ComputeDepthProfile(profilesDir: File, sequenceName: String, intervalList: List[GenomeInterval])" functions, I added the following lines to allow more memory
        this.memoryLimit = Some(10)
        this.javaMemoryLimit = Some(10)

Then I got the following errors:

  1. Non-specific error
ERROR 16:06:20,037 FunctionEdge - Error:  'java'  '-Xmx102400m'  '-XX:+UseParallelOldGC'  '-XX:ParallelGCThreads=4'  '-XX:GCTimeLimit=50'  '-XX:GCHeapFreeLimit=10'  '-Djava.io.tmpdir=/proj/yunligrp/users/minzhi/gs/gs_tempdir/svcnv_tmp'  '-cp' '/proj/yunligrp/users/minzhi/svtoolkit/lib/SVToolkit.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/Queue.jar'  '-cp' '/proj/yunligrp/users/minzhi/svtoolkit/lib/SVToolkit.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/Queue.jar'  'org.broadinstitute.gatk.queue.QCommandLine'  '-cp' '/proj/yunligrp/users/minzhi/svtoolkit/lib/SVToolkit.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/Queue.jar'  '-S' '/proj/yunligrp/users/minzhi/svtoolkit/qscript/discovery/cnv/CNVDiscoveryStage2.q'  '-S' '/proj/yunligrp/users/minzhi/svtoolkit/qscript/discovery/cnv/CNVDiscoveryStageBase.q' '-S' '/proj/yunligrp/users/minzhi/svtoolkit/qscript/discovery/cnv/CNVDiscoveryGenotyper.q'  '-S' '/proj/yunligrp/users/minzhi/svtoolkit/qscript/SVQScript.q'  '-gatk' '/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/GenomeAnalysisTK.jar'  '-jobLogDir' '/proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svcnv_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/cnv_stage2/seq_chr16/logs'  '-memLimit' '100.0'  '-jobRunner' 'Drmaa'  '-gatkJobRunner' 'Drmaa'  '-jobNative' '--mem=100000 --time=08:00:00 --nodes=1 --ntasks-per-node=8'  '-jobQueue' 'general'  -run  '-sequenceName' 'chr16'  '-runDirectory' '/proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svcnv_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/cnv_stage2/seq_chr16'  '-sentinelFile' '/proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svcnv_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/cnv_sentinel_files/stage_2_seq_chr16.sent'  --disableJobReport  '-configFile' '/proj/yunligrp/users/minzhi/svtoolkit/conf/genstrip_parameters.txt'  '-R' '/proj/yunligrp/users/minzhi/gs/Homo_sapiens_assembly38/Homo_sapiens_assembly38.fasta'  '-ploidyMapFile' '/proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/supporting_freeze6-AA_chr16/freeze6-AA_chr16_standard_ploidy.map'  '-genomeMaskFile' '/proj/yunligrp/users/minzhi/gs/Homo_sapiens_assembly38/Homo_sapiens_assembly38.svmask.fasta' '-genomeMaskFile' '/proj/yunligrp/users/minzhi/gs/Homo_sapiens_assembly38/Homo_sapiens_assembly38.lcmask.fasta'  '-copyNumberMaskFile' '/proj/yunligrp/users/minzhi/gs/Homo_sapiens_assembly38/Homo_sapiens_assembly38.gcmask.fasta'  '-readDepthMaskFile' '/proj/yunligrp/users/minzhi/gs/Homo_sapiens_assembly38/Homo_sapiens_assembly38.rdmask.bed'  '-genderMaskBedFile' '/proj/yunligrp/users/minzhi/gs/Homo_sapiens_assembly38/Homo_sapiens_assembly38.gendermask.bed'  '-vdjBedFile' '/proj/yunligrp/users/minzhi/gs/Homo_sapiens_assembly38/Homo_sapiens_assembly38.vdjregions.bed'  '-genderMapFile' '/proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/supporting_freeze6-AA_chr16/freeze6-AA_chr16_full_all-male_gender.map'  '-md' '/proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svpre_freeze6-AA_chr16_standard_full_single_1-500000over1-500000_parallel_success/md_tempdir'  -disableGATKTraversal  '-I' '/proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svcnv_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/bam_headers/merged_headers.bam'  '-vcf' '/proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svcnv_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/cnv_stage1/seq_chr16/seq_chr16.sites.vcf.gz'  '-genderGenotypeFilterFile' '/proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svcnv_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/gender_gt_filters/gender_gt_filter.txt'  '-filterDescriptionFile' '/proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svcnv_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/gender_gt_filters/gender_gt_filter_descr.txt'  '-genotypingParallelRecords' '1000'  
ERROR 16:06:20,069 FunctionEdge - Contents of /proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svcnv_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/logs/CNVDiscoveryPipeline-5.out:
INFO  16:05:27,250 QScriptManager - Compiling 4 QScripts 
INFO  16:05:32,094 QScriptManager - Compilation complete 
INFO  16:05:32,226 HelpFormatter - ---------------------------------------------------------------------- 
INFO  16:05:32,227 HelpFormatter - Queue v3.7.GS-r1748-0-g74bfe0b, Compiled 2018/04/10 10:30:23 
INFO  16:05:32,227 HelpFormatter - Copyright (c) 2012 The Broad Institute 
INFO  16:05:32,227 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
INFO  16:05:32,227 HelpFormatter - Program Args: -cp /proj/yunligrp/users/minzhi/svtoolkit/lib/SVToolkit.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/Queue.jar -S /proj/yunligrp/users/minzhi/svtoolkit/qscript/discovery/cnv/CNVDiscoveryStage2.q -S /proj/yunligrp/users/minzhi/svtoolkit/qscript/discovery/cnv/CNVDiscoveryStageBase.q -S /proj/yunligrp/users/minzhi/svtoolkit/qscript/discovery/cnv/CNVDiscoveryGenotyper.q -S /proj/yunligrp/users/minzhi/svtoolkit/qscript/SVQScript.q -gatk /proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/GenomeAnalysisTK.jar -jobLogDir /proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svcnv_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/cnv_stage2/seq_chr16/logs -memLimit 100.0 -jobRunner Drmaa -gatkJobRunner Drmaa -jobNative --mem=100000 --time=08:00:00 --nodes=1 --ntasks-per-node=8 -jobQueue general -run -sequenceName chr16 -runDirectory /proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svcnv_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/cnv_stage2/seq_chr16 -sentinelFile /proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svcnv_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/cnv_sentinel_files/stage_2_seq_chr16.sent --disableJobReport -configFile /proj/yunligrp/users/minzhi/svtoolkit/conf/genstrip_parameters.txt -R /proj/yunligrp/users/minzhi/gs/Homo_sapiens_assembly38/Homo_sapiens_assembly38.fasta -ploidyMapFile /proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/supporting_freeze6-AA_chr16/freeze6-AA_chr16_standard_ploidy.map -genomeMaskFile /proj/yunligrp/users/minzhi/gs/Homo_sapiens_assembly38/Homo_sapiens_assembly38.svmask.fasta -genomeMaskFile /proj/yunligrp/users/minzhi/gs/Homo_sapiens_assembly38/Homo_sapiens_assembly38.lcmask.fasta -copyNumberMaskFile /proj/yunligrp/users/minzhi/gs/Homo_sapiens_assembly38/Homo_sapiens_assembly38.gcmask.fasta -readDepthMaskFile /proj/yunligrp/users/minzhi/gs/Homo_sapiens_assembly38/Homo_sapiens_assembly38.rdmask.bed -genderMaskBedFile /proj/yunligrp/users/minzhi/gs/Homo_sapiens_assembly38/Homo_sapiens_assembly38.gendermask.bed -vdjBedFile /proj/yunligrp/users/minzhi/gs/Homo_sapiens_assembly38/Homo_sapiens_assembly38.vdjregions.bed -genderMapFile /proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/supporting_freeze6-AA_chr16/freeze6-AA_chr16_full_all-male_gender.map -md /proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svpre_freeze6-AA_chr16_standard_full_single_1-500000over1-500000_parallel_success/md_tempdir -disableGATKTraversal -I /proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svcnv_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/bam_headers/merged_headers.bam -vcf /proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svcnv_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/cnv_stage1/seq_chr16/seq_chr16.sites.vcf.gz -genderGenotypeFilterFile /proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svcnv_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/gender_gt_filters/gender_gt_filter.txt -filterDescriptionFile /proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svcnv_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/gender_gt_filters/gender_gt_filter_descr.txt -genotypingParallelRecords 1000 
INFO  16:05:32,228 HelpFormatter - Executing as minzhi@c0318.ll.unc.edu on Linux 3.10.0-957.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_191-b12. 
INFO  16:05:32,228 HelpFormatter - Date/Time: 2019/03/02 16:05:32 
INFO  16:05:32,228 HelpFormatter - 
  1. java.lang.OutOfMemoryError: Java heap space
ERROR 16:06:03,437 FunctionEdge - Error:  'java'  '-Xmx3072m'  '-XX:+UseParallelOldGC'  '-XX:ParallelGCThreads=4'  '-XX:GCTimeLimit=50'  '-XX:GCHeapFreeLimit=10'  '-Djava.io.tmpdir=/proj/yunligrp/users/minzhi/gs/gs_tempdir/svcnv_tmp'  '-cp' '/proj/yunligrp/users/minzhi/svtoolkit/lib/SVToolkit.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/Queue.jar' org.broadinstitute.sv.main.SVGenotyper '-T' 'SVGenotyperWalker'  '-R' '/proj/yunligrp/users/minzhi/gs/Homo_sapiens_assembly38/Homo_sapiens_assembly38.fasta'  '-I' '/proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svcnv_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/bam_headers/merged_headers.bam'  '-O' '/proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svcnv_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/cnv_stage2/seq_chr16/P0001.genotypes.vcf.gz'  '-disableGATKTraversal' 'true'  '-md' '/proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svpre_freeze6-AA_chr16_standard_full_single_1-500000over1-500000_parallel_success/md_tempdir'  '-configFile' '/proj/yunligrp/users/minzhi/svtoolkit/conf/genstrip_parameters.txt' '-configFile' '/proj/yunligrp/users/minzhi/gs/Homo_sapiens_assembly38/Homo_sapiens_assembly38.gsparams.txt'  '-P' 'genotyping.modules:depth' '-P' 'depth.readCountCacheIgnoreGenomeMask:true'  '-runDirectory' '/proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svcnv_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/cnv_stage2/seq_chr16'  '-genderMapFile' '/proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/supporting_freeze6-AA_chr16/freeze6-AA_chr16_full_all-male_gender.map'  '-ploidyMapFile' '/proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/supporting_freeze6-AA_chr16/freeze6-AA_chr16_standard_ploidy.map'  '-genomeMaskFile' '/proj/yunligrp/users/minzhi/gs/Homo_sapiens_assembly38/Homo_sapiens_assembly38.svmask.fasta' '-genomeMaskFile' '/proj/yunligrp/users/minzhi/gs/Homo_sapiens_assembly38/Homo_sapiens_assembly38.lcmask.fasta'  '-vcf' '/proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svcnv_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/cnv_stage1/seq_chr16/seq_chr16.sites.vcf.gz'  '-partitionName' 'P0001'  '-partition' 'records:1-922'  
ERROR 16:06:03,450 FunctionEdge - Contents of /proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svcnv_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/cnv_stage2/seq_chr16/logs/CNVDiscoveryStage2-1.out:
INFO  16:05:46,373 HelpFormatter - ----------------------------------------------------------------------------------------- 
INFO  16:05:46,375 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.7.GS-r1748-0-g74bfe0b, Compiled 2018/04/10 10:30:23 
INFO  16:05:46,376 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute 
INFO  16:05:46,376 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk 
INFO  16:05:46,376 HelpFormatter - [Sat Mar 02 16:05:46 EST 2019] Executing on Linux 3.10.0-957.el7.x86_64 amd64 
INFO  16:05:46,376 HelpFormatter - OpenJDK 64-Bit Server VM 1.8.0_191-b12 
INFO  16:05:46,380 HelpFormatter - Program Args: -T SVGenotyperWalker -R /proj/yunligrp/users/minzhi/gs/Homo_sapiens_assembly38/Homo_sapiens_assembly38.fasta -O /proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svcnv_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/cnv_stage2/seq_chr16/P0001.genotypes.vcf.gz -disableGATKTraversal true -md /proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svpre_freeze6-AA_chr16_standard_full_single_1-500000over1-500000_parallel_success/md_tempdir -configFile /proj/yunligrp/users/minzhi/svtoolkit/conf/genstrip_parameters.txt -configFile /proj/yunligrp/users/minzhi/gs/Homo_sapiens_assembly38/Homo_sapiens_assembly38.gsparams.txt -P genotyping.modules:depth -P depth.readCountCacheIgnoreGenomeMask:true -runDirectory /proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svcnv_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/cnv_stage2/seq_chr16 -genderMapFile /proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/supporting_freeze6-AA_chr16/freeze6-AA_chr16_full_all-male_gender.map -ploidyMapFile /proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/supporting_freeze6-AA_chr16/freeze6-AA_chr16_standard_ploidy.map -genomeMaskFile /proj/yunligrp/users/minzhi/gs/Homo_sapiens_assembly38/Homo_sapiens_assembly38.svmask.fasta -genomeMaskFile /proj/yunligrp/users/minzhi/gs/Homo_sapiens_assembly38/Homo_sapiens_assembly38.lcmask.fasta -vcf /proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svcnv_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/cnv_stage1/seq_chr16/seq_chr16.sites.vcf.gz -partitionName P0001 -partition records:1-922 -L chr1:1-1 
INFO  16:05:46,385 HelpFormatter - Executing as minzhi@c0320.ll.unc.edu on Linux 3.10.0-957.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_191-b12. 
INFO  16:05:46,385 HelpFormatter - Date/Time: 2019/03/02 16:05:46 
INFO  16:05:46,386 HelpFormatter - ----------------------------------------------------------------------------------------- 
INFO  16:05:46,386 HelpFormatter - ----------------------------------------------------------------------------------------- 
INFO  16:05:46,396 02-Mar-2019 GenomeAnalysisEngine - Strictness is SILENT
INFO  16:05:47,594 02-Mar-2019 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO  16:05:48,474 02-Mar-2019 IntervalUtils - Processing 1 bp from intervals
INFO  16:05:48,521 02-Mar-2019 GenomeAnalysisEngine - Preparing for traversal
INFO  16:05:48,522 02-Mar-2019 GenomeAnalysisEngine - Done preparing for traversal
INFO  16:05:48,522 02-Mar-2019 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO  16:05:48,522 02-Mar-2019 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining
INFO  16:05:48,522 02-Mar-2019 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime
INFO  16:05:48,528 02-Mar-2019 SVGenotyper - Opening reference sequence ...
INFO  16:05:48,528 02-Mar-2019 SVGenotyper - Opened reference sequence.
INFO  16:05:48,529 02-Mar-2019 SVGenotyper - Opening genome mask ...
INFO  16:05:48,530 02-Mar-2019 SVGenotyper - Opened genome mask.
INFO  16:05:48,532 02-Mar-2019 MetaData - Opening metadata ... 
INFO  16:05:48,532 02-Mar-2019 MetaData - Adding metadata location /proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svpre_freeze6-AA_chr16_standard_full_single_1-500000over1-500000_parallel_success/md_tempdir ...
INFO  16:05:48,535 02-Mar-2019 MetaData - Opened metadata.
INFO  16:05:48,569 02-Mar-2019 SVGenotyper - Initializing input data set ...
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:3332)
    at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
    at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
    at java.lang.StringBuilder.append(StringBuilder.java:136)
    at htsjdk.samtools.SAMTextHeaderCodec.advanceLine(SAMTextHeaderCodec.java:139)
    at htsjdk.samtools.SAMTextHeaderCodec.decode(SAMTextHeaderCodec.java:94)
    at htsjdk.samtools.BAMFileReader.readHeader(BAMFileReader.java:655)
    at htsjdk.samtools.BAMFileReader.<init>(BAMFileReader.java:298)
    at htsjdk.samtools.BAMFileReader.<init>(BAMFileReader.java:176)
    at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:376)
    at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:202)
    at org.broadinstitute.sv.dataset.SAMFileLocation.createSamFileReader(SAMFileLocation.java:97)
    at org.broadinstitute.sv.dataset.SAMLocation.createSamFileReader(SAMLocation.java:41)
    at org.broadinstitute.sv.dataset.DataSet.initInputFile(DataSet.java:138)
    at org.broadinstitute.sv.dataset.DataSet.initialize(DataSet.java:128)
    at org.broadinstitute.sv.genotyping.SVGenotyperWalker.initDataSet(SVGenotyperWalker.java:355)
    at org.broadinstitute.sv.genotyping.SVGenotyperWalker.initialize(SVGenotyperWalker.java:197)
    at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:83)
    at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:316)
    at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:123)
    at org.broadinstitute.sv.main.SVCommandLine.execute(SVCommandLine.java:141)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
    at org.broadinstitute.sv.main.SVCommandLine.main(SVCommandLine.java:91)
    at org.broadinstitute.sv.main.SVGenotyper.main(SVGenotyper.java:21) 
INFO  16:06:03,451 QGraph - Writing incremental jobs reports... 
INFO  16:06:03,452 QGraph - 8 Pend, 0 Run, 1 Fail, 0 Done 
INFO  16:06:03,453 QCommandLine - Writing final jobs report... 
INFO  16:06:03,454 QCommandLine - Done with errors 

You can find that even though I added the flag "-memLimit 100", at top of the second error, it still shows '-Xmx3072m'.

Besides, there are also two failure messages here:

The first also have shows '-Xmx3072m':

INFO  16:06:03,458 QGraph - Failed:   'java'  '-Xmx3072m'  '-XX:+UseParallelOldGC'  '-XX:ParallelGCThreads=4'  '-XX:GCTimeLimit=50'  '-XX:GCHeapFreeLimit=10'  '-Djava.io.tmpdir=/proj/yunligrp/users/minzhi/gs/gs_tempdir/svcnv_tmp'  '-cp' '/proj/yunligrp/users/minzhi/svtoolkit/lib/SVToolkit.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/Queue.jar' org.broadinstitute.sv.main.SVGenotyper '-T' 'SVGenotyperWalker'  '-R' '/proj/yunligrp/users/minzhi/gs/Homo_sapiens_assembly38/Homo_sapiens_assembly38.fasta'  '-I' '/proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svcnv_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/bam_headers/merged_headers.bam'  '-O' '/proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svcnv_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/cnv_stage2/seq_chr16/P0001.genotypes.vcf.gz'  '-disableGATKTraversal' 'true'  '-md' '/proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svpre_freeze6-AA_chr16_standard_full_single_1-500000over1-500000_parallel_success/md_tempdir'  '-configFile' '/proj/yunligrp/users/minzhi/svtoolkit/conf/genstrip_parameters.txt' '-configFile' '/proj/yunligrp/users/minzhi/gs/Homo_sapiens_assembly38/Homo_sapiens_assembly38.gsparams.txt'  '-P' 'genotyping.modules:depth' '-P' 'depth.readCountCacheIgnoreGenomeMask:true'  '-runDirectory' '/proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svcnv_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/cnv_stage2/seq_chr16'  '-genderMapFile' '/proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/supporting_freeze6-AA_chr16/freeze6-AA_chr16_full_all-male_gender.map'  '-ploidyMapFile' '/proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/supporting_freeze6-AA_chr16/freeze6-AA_chr16_standard_ploidy.map'  '-genomeMaskFile' '/proj/yunligrp/users/minzhi/gs/Homo_sapiens_assembly38/Homo_sapiens_assembly38.svmask.fasta' '-genomeMaskFile' '/proj/yunligrp/users/minzhi/gs/Homo_sapiens_assembly38/Homo_sapiens_assembly38.lcmask.fasta'  '-vcf' '/proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svcnv_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/cnv_stage1/seq_chr16/seq_chr16.sites.vcf.gz'  '-partitionName' 'P0001'  '-partition' 'records:1-922'  
INFO  16:06:03,458 QGraph - Log:     /proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svcnv_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/cnv_stage2/seq_chr16/logs/CNVDiscoveryStage2-1.out 
INFO  16:06:03,459 QCommandLine - Script failed: 8 Pend, 0 Run, 1 Fail, 0 Done 
------------------------------------------------------------------------------------------
Done. There were no warn messages.

But the second shows '-Xmx102400m':

INFO  16:06:22,771 QGraph - Failed:   'java'  '-Xmx102400m'  '-XX:+UseParallelOldGC'  '-XX:ParallelGCThreads=4'  '-XX:GCTimeLimit=50'  '-XX:GCHeapFreeLimit=10'  '-Djava.io.tmpdir=/proj/yunligrp/users/minzhi/gs/gs_tempdir/svcnv_tmp'  '-cp' '/proj/yunligrp/users/minzhi/svtoolkit/lib/SVToolkit.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/Queue.jar'  '-cp' '/proj/yunligrp/users/minzhi/svtoolkit/lib/SVToolkit.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/Queue.jar'  'org.broadinstitute.gatk.queue.QCommandLine'  '-cp' '/proj/yunligrp/users/minzhi/svtoolkit/lib/SVToolkit.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/Queue.jar'  '-S' '/proj/yunligrp/users/minzhi/svtoolkit/qscript/discovery/cnv/CNVDiscoveryStage2.q'  '-S' '/proj/yunligrp/users/minzhi/svtoolkit/qscript/discovery/cnv/CNVDiscoveryStageBase.q' '-S' '/proj/yunligrp/users/minzhi/svtoolkit/qscript/discovery/cnv/CNVDiscoveryGenotyper.q'  '-S' '/proj/yunligrp/users/minzhi/svtoolkit/qscript/SVQScript.q'  '-gatk' '/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/GenomeAnalysisTK.jar'  '-jobLogDir' '/proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svcnv_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/cnv_stage2/seq_chr16/logs'  '-memLimit' '100.0'  '-jobRunner' 'Drmaa'  '-gatkJobRunner' 'Drmaa'  '-jobNative' '--mem=100000 --time=08:00:00 --nodes=1 --ntasks-per-node=8'  '-jobQueue' 'general'  -run  '-sequenceName' 'chr16'  '-runDirectory' '/proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svcnv_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/cnv_stage2/seq_chr16'  '-sentinelFile' '/proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svcnv_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/cnv_sentinel_files/stage_2_seq_chr16.sent'  --disableJobReport  '-configFile' '/proj/yunligrp/users/minzhi/svtoolkit/conf/genstrip_parameters.txt'  '-R' '/proj/yunligrp/users/minzhi/gs/Homo_sapiens_assembly38/Homo_sapiens_assembly38.fasta'  '-ploidyMapFile' '/proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/supporting_freeze6-AA_chr16/freeze6-AA_chr16_standard_ploidy.map'  '-genomeMaskFile' '/proj/yunligrp/users/minzhi/gs/Homo_sapiens_assembly38/Homo_sapiens_assembly38.svmask.fasta' '-genomeMaskFile' '/proj/yunligrp/users/minzhi/gs/Homo_sapiens_assembly38/Homo_sapiens_assembly38.lcmask.fasta'  '-copyNumberMaskFile' '/proj/yunligrp/users/minzhi/gs/Homo_sapiens_assembly38/Homo_sapiens_assembly38.gcmask.fasta'  '-readDepthMaskFile' '/proj/yunligrp/users/minzhi/gs/Homo_sapiens_assembly38/Homo_sapiens_assembly38.rdmask.bed'  '-genderMaskBedFile' '/proj/yunligrp/users/minzhi/gs/Homo_sapiens_assembly38/Homo_sapiens_assembly38.gendermask.bed'  '-vdjBedFile' '/proj/yunligrp/users/minzhi/gs/Homo_sapiens_assembly38/Homo_sapiens_assembly38.vdjregions.bed'  '-genderMapFile' '/proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/supporting_freeze6-AA_chr16/freeze6-AA_chr16_full_all-male_gender.map'  '-md' '/proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svpre_freeze6-AA_chr16_standard_full_single_1-500000over1-500000_parallel_success/md_tempdir'  -disableGATKTraversal  '-I' '/proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svcnv_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/bam_headers/merged_headers.bam'  '-vcf' '/proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svcnv_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/cnv_stage1/seq_chr16/seq_chr16.sites.vcf.gz'  '-genderGenotypeFilterFile' '/proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svcnv_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/gender_gt_filters/gender_gt_filter.txt'  '-filterDescriptionFile' '/proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svcnv_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/gender_gt_filters/gender_gt_filter_descr.txt'  '-genotypingParallelRecords' '1000'  
INFO  16:06:22,771 QGraph - Log:     /proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svcnv_freeze6-AA_chr16_standard_full_single_1-500000over1-500000/logs/CNVDiscoveryPipeline-5.out 
INFO  16:06:22,772 QCommandLine - Script failed: 10 Pend, 0 Run, 1 Fail, 4 Done

Does it mean that I need more deeply edit somewhere in the Qscript to allow more memory? Because I indeed find that the first subjob (successful) trigged by the SVCNVDiscovery takes about 80 GB memory, but the other subjobs (failed) only takes about 2.9 GB memory. And especially, some errors and failure coming from QGraph steps which I cannot find in the QScript. May I have your suggestions about these errors? Thank you very much.

Best regards,
Wusheng

SVCNVDiscovery Error: java.lang.RuntimeException: Read count cache file

$
0
0

Dear Genome STRiP users,

I am running SVCNVDiscovery pipeline to 10686 samples with successfully completed SVPreprocesse. I met two "java.lang.RuntimeException" errors in the output.

Exception in thread "main" java.lang.RuntimeException: Read count cache file file:///proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svpre_freeze6-AA_chr16_standard_full_single_1-500000over1-500000_parallel_success/md_tempdir/rccache.bin is truncated
    at org.broadinstitute.sv.commandline.CommandLineProgram.execute(CommandLineProgram.java:65)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
    at org.broadinstitute.sv.commandline.CommandLineProgram.runAndReturnResult(CommandLineProgram.java:29)
    at org.broadinstitute.sv.commandline.CommandLineProgram.run(CommandLineProgram.java:25)
    at org.broadinstitute.sv.genotyping.RefineCNVBoundaries.main(RefineCNVBoundaries.java:133)
Caused by: java.lang.RuntimeException: Read count cache file file:///proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svpre_freeze6-AA_chr16_standard_full_single_1-500000over1-500000_parallel_success/md_tempdir/rccache.bin is truncated
    at org.broadinstitute.sv.metadata.depth.ReadCountFileReader$ReadCountDataIterator.decodeRow(ReadCountFileReader.java:516)
    at org.broadinstitute.sv.metadata.depth.ReadCountFileReader$ReadCountDataIterator.getReadCacheItems(ReadCountFileReader.java:470)
    at org.broadinstitute.sv.metadata.depth.ReadCountFileReader$ReadCountDataIterator.aggregateSampleReadCounts(ReadCountFileReader.java:476)
    at org.broadinstitute.sv.metadata.depth.ReadCountFileReader.getReadCounts(ReadCountFileReader.java:266)
    at org.broadinstitute.sv.common.ReadCountCache.getReadCounts(ReadCountCache.java:100)
    at org.broadinstitute.sv.genotyping.GenotypingDepthModule.computeRefReadCounts(GenotypingDepthModule.java:295)
    at org.broadinstitute.sv.genotyping.GenotypingDepthModule.computeRefReadCounts(GenotypingDepthModule.java:245)
    at org.broadinstitute.sv.genotyping.GenotypingDepthModule.getReadCounts(GenotypingDepthModule.java:230)
    at org.broadinstitute.sv.genotyping.GenotypingDepthModule.getCnpReadCounts(GenotypingDepthModule.java:217)
    at org.broadinstitute.sv.genotyping.GenotypingDepthModule.genotypeCnp(GenotypingDepthModule.java:141)
    at org.broadinstitute.sv.genotyping.BoundaryRefinementAlgorithm.genotypeCnp(BoundaryRefinementAlgorithm.java:287)
    at org.broadinstitute.sv.genotyping.BoundaryRefinementAlgorithm.refineOneBoundary(BoundaryRefinementAlgorithm.java:633)
    at org.broadinstitute.sv.genotyping.BoundaryRefinementAlgorithm.refineBoundaryStep(BoundaryRefinementAlgorithm.java:553)
    at org.broadinstitute.sv.genotyping.BoundaryRefinementAlgorithm.refineBoundaries(BoundaryRefinementAlgorithm.java:536)
    at org.broadinstitute.sv.genotyping.BoundaryRefinementAlgorithm.processVariant(BoundaryRefinementAlgorithm.java:232)
    at org.broadinstitute.sv.genotyping.RefineCNVBoundaries.run(RefineCNVBoundaries.java:204)
    at org.broadinstitute.sv.commandline.CommandLineProgram.execute(CommandLineProgram.java:54)
    ... 5 more 
INFO  23:29:51,212 QGraph - Writing incremental jobs reports... 
INFO  23:29:51,213 QGraph - 4 Pend, 5 Run, 1 Fail, 33 Done 
INFO  23:30:51,242 FunctionEdge - Done:  'java'  '-Xmx102400m' ...
...
INFO  23:25:21,377 MetaData - Opened metadata. 
INFO  23:25:21,436 RefineCNVBoundaries - Initializing input data set ... 
INFO  23:25:31,820 RefineCNVBoundaries - Initialized data set: 1 file, 121337 read groups, 10148 samples. 
INFO  23:25:32,452 ReadCountCache - Initializing read count cache with 1 file. 
mInputFile=file:///proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svpre_freeze6-AA_chr16_standard_full_single_1-500000over1-500000_parallel_success/md_tempdir/rccache.bin mCurrentSequenceName=chr16; mCurrentPosition=500001
Exception in thread "main" java.lang.RuntimeException: Read count cache file file:///proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svpre_freeze6-AA_chr16_standard_full_single_1-500000over1-500000_parallel_success/md_tempdir/rccache.bin is truncated
    at org.broadinstitute.sv.commandline.CommandLineProgram.execute(CommandLineProgram.java:65)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
    at org.broadinstitute.sv.commandline.CommandLineProgram.runAndReturnResult(CommandLineProgram.java:29)
    at org.broadinstitute.sv.commandline.CommandLineProgram.run(CommandLineProgram.java:25)
    at org.broadinstitute.sv.genotyping.RefineCNVBoundaries.main(RefineCNVBoundaries.java:133)
Caused by: java.lang.RuntimeException: Read count cache file file:///proj/yunligrp/users/minzhi/gs/freeze6-AA_chr16/svpre_freeze6-AA_chr16_standard_full_single_1-500000over1-500000_parallel_success/md_tempdir/rccache.bin is truncated
    at org.broadinstitute.sv.metadata.depth.ReadCountFileReader$ReadCountDataIterator.decodeRow(ReadCountFileReader.java:516)
    at org.broadinstitute.sv.metadata.depth.ReadCountFileReader$ReadCountDataIterator.getReadCacheItems(ReadCountFileReader.java:470)
    at org.broadinstitute.sv.metadata.depth.ReadCountFileReader$ReadCountDataIterator.aggregateSampleReadCounts(ReadCountFileReader.java:476)
    at org.broadinstitute.sv.metadata.depth.ReadCountFileReader.getReadCounts(ReadCountFileReader.java:266)
    at org.broadinstitute.sv.common.ReadCountCache.getReadCounts(ReadCountCache.java:100)
    at org.broadinstitute.sv.genotyping.GenotypingDepthModule.computeRefReadCounts(GenotypingDepthModule.java:295)
    at org.broadinstitute.sv.genotyping.GenotypingDepthModule.computeRefReadCounts(GenotypingDepthModule.java:245)
    at org.broadinstitute.sv.genotyping.GenotypingDepthModule.getReadCounts(GenotypingDepthModule.java:230)
    at org.broadinstitute.sv.genotyping.GenotypingDepthModule.getCnpReadCounts(GenotypingDepthModule.java:217)
    at org.broadinstitute.sv.genotyping.GenotypingDepthModule.genotypeCnp(GenotypingDepthModule.java:141)
    at org.broadinstitute.sv.genotyping.BoundaryRefinementAlgorithm.genotypeCnp(BoundaryRefinementAlgorithm.java:287)
    at org.broadinstitute.sv.genotyping.BoundaryRefinementAlgorithm.refineOneBoundary(BoundaryRefinementAlgorithm.java:633)
    at org.broadinstitute.sv.genotyping.BoundaryRefinementAlgorithm.refineBoundaryStep(BoundaryRefinementAlgorithm.java:558)
    at org.broadinstitute.sv.genotyping.BoundaryRefinementAlgorithm.refineBoundaries(BoundaryRefinementAlgorithm.java:536)
    at org.broadinstitute.sv.genotyping.BoundaryRefinementAlgorithm.processVariant(BoundaryRefinementAlgorithm.java:232)
    at org.broadinstitute.sv.genotyping.RefineCNVBoundaries.run(RefineCNVBoundaries.java:204)
    at org.broadinstitute.sv.commandline.CommandLineProgram.execute(CommandLineProgram.java:54)
    ... 5 more 
INFO  23:42:51,601 QGraph - Writing incremental jobs reports... 
INFO  23:42:51,602 QGraph - 4 Pend, 0 Run, 2 Fail, 37 Done 
INFO  23:42:51,605 QCommandLine - Writing final jobs report... 
INFO  23:42:51,606 QCommandLine - Done with errors 
INFO  23:42:51,609 QGraph - ------- 
INFO  23:42:51,609 QGraph - Failed:   'java'  '-Xmx102400m'  ...
...

Has anyone met such error? Does it related to the original .bam files or my SVPreprocess is broken? May I have your suggestions? Thank you very much.

Best regards,
Wusheng

VariantRecalibrator accepts only one VCF file?

$
0
0

The documentation on VariantRecalibrator reads:

--variant  -V   []  One or more VCF files containing variants

However, when I supply multiple VCF files I get this error message:

A USER ERROR has occurred: Invalid argument 'out_GenotypeGVCFs/2nd.vcf.gz'.

Previously one could do --input vcfs.list. Now it seems the VCFs have to be merged prior to running VariantRecalibrator? Thanks!

LeftAlignAndTrimVariants --splitMultiallelics changes GT from known to unknown

$
0
0

I have a VCF file with this line (i.e. GT=0/1=G/T):

20  10120854    .   G   T,A 32175.56    .   AC=399,18;AF=0.111,5.006e-03;AN=3596;BaseQRankSum=1

.03;DP=6710;FS=2.485;GQ_MEAN=15.45;GQ_STDDEV=20.21;InbreedingCoeff=0.1235;MLEAC=416,17;MLEAF=0.116,4.727e-03;MQ=60.00;MQ0=0
;MQRankSum=0.358;NCC=189;QD=18.08;ReadPosRankSum=0.358 GT:AD:DP:GQ:PL 0/1:1,3,0:.:34:123,0,34,126,43,169

When I run it through version 3.2 of LeftAlignAndTrimVariants with the --splitMultiallelics flag, then the genotype information is lost; i.e. GT=./. and the output is:

20  10120854    .   G   T   32175.56    .   BaseQRankSum=1.03;DP=6710;FS=2.485;GQ_MEAN=15.45;GQ_STDDEV=20.21;InbreedingCoeff=0.1235;MLEAC=416,17;MLEAF=0.116,4.727e-03;MQ=60.00;MQ0=0;MQRankSum=0.358;NCC=189;QD=18.08;ReadPosRankSum=0.358 GT  ./.
20  10120854    .   G   A   32175.56    .   BaseQRankSum=1.03;DP=6710;FS=2.485;GQ_MEAN=15.45;GQ_STDDEV=20.21;InbreedingCoeff=0.1235;MLEAC=416,17;MLEAF=0.116,4.727e-03;MQ=60.00;MQ0=0;MQRankSum=0.358;NCC=189;QD=18.08;ReadPosRankSum=0.358 GT  ./.

I have attached the input VCF. I also got rid of the PL information, but still got the unexpected output.

Maybe I should just write some code for normalizing variants myself in order to save time :) Thank you very much as always!

A problem met when running Mutect2 4.0.9.0

$
0
0

14:44:03.509 INFO VectorLoglessPairHMM - Time spent in setup for JNI call : 14.865818774000001
14:44:03.509 INFO PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 3144.705825942
14:44:03.509 INFO SmithWatermanAligner - Total compute time in java Smith-Waterman : 1498.35 sec
14:44:03.511 INFO Mutect2 - Shutting down engine
[March 4, 2019 2:44:03 PM CST] org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2 done. Elapsed time: 276.83 minutes.
Runtime.totalMemory()=4570218496
java.lang.NumberFormatException: For input string: "."
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
at java.lang.Double.parseDouble(Double.java:538)
at java.lang.Double.valueOf(Double.java:502)
at htsjdk.variant.variantcontext.CommonInfo.lambda$getAttributeAsDoubleList$2(CommonInfo.java:299)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
at htsjdk.variant.variantcontext.CommonInfo.getAttributeAsList(CommonInfo.java:273)
at htsjdk.variant.variantcontext.CommonInfo.getAttributeAsDoubleList(CommonInfo.java:293)
at htsjdk.variant.variantcontext.VariantContext.getAttributeAsDoubleList(VariantContext.java:740)
at org.broadinstitute.hellbender.tools.walkers.mutect.GermlineProbabilityCalculator.getGermlineAltAlleleFrequencies(GermlineProbabilityCalculator.java:49)
at org.broadinstitute.hellbender.tools.walkers.mutect.GermlineProbabilityCalculator.getPopulationAFAnnotation(GermlineProbabilityCalculator.java:27)
at org.broadinstitute.hellbender.tools.walkers.mutect.SomaticGenotypingEngine.callMutations(SomaticGenotypingEngine.java:151)
at org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2Engine.callRegion(Mutect2Engine.java:217)
at org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2.apply(Mutect2.java:215)
at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.processReadShard(AssemblyRegionWalker.java:291)
at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.traverse(AssemblyRegionWalker.java:267)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:966)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)
Using GATK jar /hwfssz1/ST_PRECISION/PMO/F16ZQSB1SY2968/3.scripts/5.WESpipeline/software/gatk-4.0.9.0/gatk-package-4.0.9.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx8g -XX:ParallelGCThreads=1 -Dsamjdk.compression_level=5 -XX:-UseGCOverheadLimit -Djava.io.tmpdir=tmp_dir -jar /mydir/gatk-4.0.9.0/gatk-package-4.0.9.0-local.jar Mutect2 -R /mydir/database/reference/grch37.fa -I /mydir/Tumor.bqsr.bam -tumor Tumor -I /mydir/Normal.bqsr.bam -normal Normal --germline-resource /mydir/GATK_bundle/gnomAD/exomes/gnomad.exomes.r2.0.2.sites.vcf.gz -L /mydir/allchr.intervals -O /mydir/mutect2.vcf.gz


GATK v4.1.0.0 ValidateVariants, gVCF mode, error; non in v4.0.11.0

$
0
0

GATK v4.0.11.0 & v4.1.0.0, linux server, bash

Hi,

I was running the following codes

${GATK4} --java-options '-Xmx10g -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -XX:ConcGCThreads=1 -XX:ParallelGCThreads=2' HaplotypeCaller -R /shared/resources/hgRef/hg38/Homo_sapiens_assembly38.fasta -I /home/manolis/GATK4/2.BQSR/bqsr_PROVA/WES_16-1239_bqsr.bam -O "PROVA_${version}.g.vcf.gz" -L /home/manolis/GATK4/DB/hg38_SureSelectV6noUTR_S07604514_HC_1-22_XY.intervals -ip 100 -ERC GVCF --max-alternate-alleles 3 -ploidy 2 -A StrandBiasBySample --tmp-dir /home/manolis/GATK4/tmp/

${GATK4} --java-options '-Xmx10g -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -XX:ConcGCThreads=1 -XX:ParallelGCThreads=2' ValidateVariants -R /shared/resources/hgRef/hg38/Homo_sapiens_assembly38.fasta -V "PROVA_${version}.g.vcf.gz" -L/home/manolis/GATK4/DB/hg38_SureSelectV6noUTR_S07604514_HC_1-22_XY.intervals -ip 100 -gvcf -Xtype ALLELES --tmp-dir /home/manolis/GATK4/tmp/

and I created the following files:

HaplotypeCaller v4.0.11.0 -> output "PROVA_v40110.g.vcf.gz"
HaplotypeCaller v4.1.0.0 -> output "PROVA_v4100.g.vcf.gz"

When I'm going to validate them I have the following results:

1) ValidateVariants v4.0.11.0 -> input "PROVA_v40110.g.vcf.gz" ........ Everything OK !!!

2) ValidateVariants v4.0.11.0 -> input "PROVA_v4100.g.vcf.gz" ........ Everything OK !!!

3) ValidateVariants v4.1.0.0 -> input "PROVA_v4100.g.vcf.gz" ........ ERROR !!!

***********************************************************************
A USER ERROR has occurred: In a GVCF all records must ordered. Record: [VC Unknown @ chr2:41350-41765 Q. of type=SYMBOLIC alleles=[A*, <NON_REF>] attr={END=41765} filters= covers a position previously traversed.
***********************************************************************

4) ValidateVariants v4.1.0.0 -> input "PROVA_v40110.g.vcf.gz" ........ ERROR !!!

***********************************************************************
A USER ERROR has occurred: In a GVCF all records must ordered. Record: [VC Unknown @ chr2:41350-41765 Q. of type=SYMBOLIC alleles=[A*, <NON_REF>] attr={END=41765} filters= covers a position previously traversed.
***********************************************************************

If I create a vcf.gz file with HaplotypeCaller v4.1.0.0 (standard mode, NO gVCF ) and I'm going to validate it with ValidateVariants v4.1.0.0 I do not have any error!

For now... can I Validate my g.vcf.gz files generated from HC v4.1.0.0 with ValidateVariants of the v4.0.11.0?

Thanks

CNNScoreVariants problem? Slow or stopped?

$
0
0
Dear all, I'm working trying to use CNN to annotate my VCF, adopting wdl from gatk4-cnn-variant-filter in GitHub (gatk-workflows)

My project is a custom target capture panel and I've scattered my bams (each 2megabytes) + vcfs (each 38 kilobytes)
and each pair is fed to CNNScoreVariants. There are no error messages but the process is not consuming any CPU and looks like it's stuck for half a day without any progress.

Would truly appreciate if you can offer your expert advice. Thank you!

Command as follow: (gatk4 was installed by conda)

gatk --java-options -Xmx5G \
CNNScoreVariants \
-I /mnt/operation/RedCellNGS/Processed_bam_vcf_1/bam/cromwell-executions/CNN_filtering/982a2730-51ee-4e06-aa2a-6f8e9ab0ab3d/call-CNNScoreVariants/shard-0/inputs/-1372610583/11H28680-2_bamout.bam \
-R /mnt/operation/RedCellNGS/Processed_bam_vcf_1/bam/cromwell-executions/CNN_filtering/982a2730-51ee-4e06-aa2a-6f8e9ab0ab3d/call-CNNScoreVariants/shard-0/inputs/131803870/Homo_sapiens_assembly38.fasta \
-V /mnt/operation/RedCellNGS/Processed_bam_vcf_1/bam/cromwell-executions/CNN_filtering/982a2730-51ee-4e06-aa2a-6f8e9ab0ab3d/call-CNNScoreVariants/shard-0/inputs/-1372610583/11H28680-2_hc4.vcf.gz \
-O 11H28680-2_cnn_annotated.vcf.gz \
-L /mnt/operation/RedCellNGS/Processed_bam_vcf_1/bam/cromwell-executions/CNN_filtering/982a2730-51ee-4e06-aa2a-6f8e9ab0ab3d/call-CNNScoreVariants/shard-0/inputs/1460496992/0000-scattered.interval_list \
--tensor-type read_tensor \
--inference-batch-size 2 \
--transfer-batch-size 2

Stderr content is in next post

Parameters for running GenomicsDB import

$
0
0

I have a system with about 8GB RAM. I've run HaplotypeCaller (-ERC GVCF) on specific genes of my interest using a .list file and have 109 **.g.vcf.gzs **of about 5-10 GB each. What would be the most optimal way to run GenomicsDBImport on these samples for Joint Calling ? Will I need to further subset these files into specific intervals or set a batch size ?

GATK version - 4.0.11, Java version-1.8

Optimal = Avoid errors, Maximise input samples, minimise computational load and minimise time in that order.

(How to) Consolidate GVCFs for joint calling with GenotypeGVCFs

$
0
0

In GATK4, the GenotypeGVCFs tool can only take a single input i.e., 1) a single single-sample GVCF 2) a single multi-sample GVCF created by CombineGVCFs or 3) a GenomicsDB workspace created by GenomicsDBImport. If you have GVCFs from multiple samples (which is usually the case) you will need to combine them before feeding them to GenotypeGVCFs. The input samples must possess genotype likelihoods containing the allele produced by HaplotypeCaller with -ERC GVCF or -ERC BP_RESOLUTION.

Although there are several tools in the GATK and Picard toolkits that provide some type of VCF merging functionality, for this use case ONLY two of them can do the GVCF consolidation step correctly: GenomicsDBImport and CombineGVCFs.

GenomicsDBImport is the preferred tool (see detailed instructions below); CombineGVCFs is provided only as a backup solution for people who cannot use GenomicsDBImport. We know CombineGVCFs is quite inefficient and typically requires a lot of memory, so we encourage you to try GenomicsDBImport first and only fall back on CombineGVCFs if you experience issues that we are unable to help you solve (ask us for help in the forum!).


UsingGenomicsDBImport in practice

The GenomicsDBImport tool takes in one or more single-sample GVCFs and imports data over at least one genomics interval (this feature is available in v4.0.6.0 and later and stable in v4.0.8.0 and later), and outputs a directory containing a GenomicsDB datastore with combined multi-sample data. GenotypeGVCFs can then read from the created GenomicsDB directly and output the final multi-sample VCF.

So if you have a trio of GVCFs your GenomicsDBImport command would look like this, assuming you're running per chromosome (here we're showing the tool running on chromosome 20 and chromosome 21):

gatk GenomicsDBImport \
    -V data/gvcfs/mother.g.vcf \
    -V data/gvcfs/father.g.vcf \
    -V data/gvcfs/son.g.vcf \
    --genomicsdb-workspace-path my_database \
    --intervals chr20,chr21

That generates a directory called my_database containing the combined GVCF data for chromosome 20 and 21. (The contents of the directory are not really human-readable; see “extracting GVCF data from a GenomicsDB” to evaluate the combined, pre-genotyped data. Also note that the log will contain a series of messages like Buffer resized from 178298bytes to 262033 -- this is expected.) For larger cohort sizes, we recommend specifying a batch size of 50 for improved memory usage. A sample map file can also be specified when enumerating the GVCFs individually as above becomes arduous.

Then you run joint genotyping; note the gendb:// prefix to the database input directory path. Note that this step requires a reference, even though the import can be run without one.

gatk GenotypeGVCFs \
    -R data/ref/ref.fasta \
    -V gendb://my_database \
    -newQual \
    -O test_output.vcf 

And that's all there is to it.


Important limitations and Common “Gotchas”:

  1. You can't add data to an existing database; you have to keep the original GVCFs around and reimport them all together when you get new samples. For very large numbers of samples, there are some batching options.

  2. At least one interval must be provided when using GenomicsDBImport.

  3. Input GVCFs cannot contain multiple entries for a single genomic position

  4. GenomicsDBImport cannot accept multiple GVCFs for the same sample, so if for example you generated separate GVCFs per chromosome for each sample, you'll need to either concatenate the chromosome GVCFs to produce a single GVCF per sample (using GatherVcfs) or scatter the following steps by chromosome as well.

  5. The annotation counts specified in the header MUST BE VALID! If not, you may see an error like A fatal error has been detected by the Java Runtime Environment [...] SIGSEGV with mention of a core dump (which may or may not be output depending on your system configuration.) You can check you annotation headers with vcf-validator from VCFtools [https://github.com/vcftools/vcftools]

  6. GenomicsDB will not overwrite an existing workspace. To rerun an import, you will have to manually delete the workspace before running the command again.

  7. If you’re working on a POSIX filesystem (e.g. Lustre, NFS, xfs, ext4 etc), you must set the environment variable TILEDB_DISABLE_FILE_LOCKING=1 before running any GenomicsDB tool. If you don’t, you will likely see an error like Could not open array genomicsdb_array at workspace:[...]

  8. HaplotypeCaller output containing MNPs cannot be merged with CombineGVCFs or GenotypeGVCFs. For phasing nearby variants in multi-sample callsets, MNPs can be inferred from the phase set (PS) tag in the FORMAT field.

  9. There are a few other, rare bugs we’re in the process of working out. If you run into problems, you can check the open github issues [https://github.com/broadinstitute/gatk/issues?utf8=✓&amp;q=is:issue+is:open+genomicsdb] to see if a fix is in in progress.

If you can't use GenomicsDBImport for whatever reason, fall back to CombineGVCFs instead. It is slower but will allow you to combine GVCFs the old-fashioned way.


Addendum: extracting GVCF data from the GenomicsDB

If you want to generate a flat multisample GVCF file from the GenomicsDB you created, you can do so with SelectVariants as follows:

gatk SelectVariants \
    -R data/ref/ref.fasta \
    -V gendb://my_database \
    -O combined.g.vcf

Bells and Whistles

GenomicsDB now supports allele-specific annotations [ https://software.broadinstitute.org/gatk/documentation/article?id=9622 ], which have become standard in our Broad exome production pipeline.

GenomicsDB can now import directly from a Google cloud path (i.e. gs://) using NIO.

Question: Mutect v1 doesnt finds second alternative allele

$
0
0
mutect version 1 doesnt find any alternative allele than the first one. I use mutect 1 to find somatic variant calling. the reason i dont use mutect 2 is beacuse it produces too many false positives.

Thanks in advance
Viewing all 12345 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>