Quantcast
Channel: Recent Discussions — GATK-Forum
Viewing all 12345 articles
Browse latest View live

GenomicsDBImport does not support GVCFs with MNPs; GATK (v4.1.0.0)

$
0
0

Hello!

I am running the GATK (v4.1.0.0) best practices pipeline on FireCloud with 12 pooled WGS samples; one pooled sample contains ~48 individual fish (I am using a ploidy of 20 throughout the pipeline). Though I have 24 linkage groups I also have 8286 very small scaffolds that my reads are aligned to, which has caused some issues with using scatter/gather and running the tasks by interval with -L (though that is not my main issue here). Lately I have run into a problem at the JointGenotyping stage.

I have one GVCF for each pool from HaplotypeCaller, and I tried to combine them all using CombineGVCFs. Because of the ploidy of 20 I thought I could not use GenomicsDBImport. I had the same error using CombineGVCFs as the person in this thread: gatkforums.broadinstitute.org/gatk/discussion/13430/gatk-v4-0-10-1-combinegvcfs-failing-with-java-lang-outofmemoryerror-not-using-memory-provided. No matter the amount of memory I allowed the task, it failed every time.

But following @shlee's advice and reading this: github.com/broadinstitute/gatk/issues/5383 I decided to give GenomicsDBImport a try. I just used my 24 linkage groups, so my interval list has only those 24 listed.

I am stumped by the error I got for many of the linkage groups:

***********************************************************************

A USER ERROR has occurred: Bad input: GenomicsDBImport does not support GVCFs with MNPs. MNP found at LG07:4616323 in VCF /6942d818-1ae4-4c81-a4be-0f27ec47ec16/HaplotypeCallerGVCF_halfScatter_GATK4/3a4a3acc-2f06-44dc-ab6d-2617b06f3f46/call-MergeGVCFs/301508.merged.matefixed.sorted.markeddups.recal.g.vcf.gz

***********************************************************************

What is the best way to address this? I didn't see anything in the GenomicsDB documentation about flagging the MNPs or ignoring them. I was thinking of removing the MNPs using SelectVariants, before importing the GVCFs into GenomicsDB but how do you get SelectVariants to output a GVCF, which is needed for Joint Genotyping.

What would you recommend I do to get past this MNP hurdle?


MergeBamAlignment – Select primary alignment

$
0
0

Hi,

In the current best practices workflow gatk4-data-processing, you recommend using uBAMs instead of FASTQ files. Great idea! However, when it comes to merging with the BWA alignment BAM, there is something that puzzles me.

Here is an example of a paired-end read mapped by BWA:

XXXXXXXX:412:YYYYYYYYY:1:11101:10001:10497  83  chr16   1229894 0   149M    =   1229833 -210    GGGCCGCGTAGGCGCGGCTCGCCAGGACGGGCAGCGCCAGCAGCAGCAGATTCAGCATCTGGGGAGCAAGGAGGAGCATCGTGGGCCTGGCCGGGCCTCACAGGGCAGGGCTGGGGGCTACAGATTGTGGGGTGAAGAATGGAGCTGAG   AAAAA/E<EEAA</A/<EA<<EEEEEEEE/EEEAAEEAEE/EAEAAEEEEEEEEEEEAEEAAEEAEAEAAEEEEEEEEEEEEAAEEEEAE6EAEEEEEEEE/EEEEEEE/EE/AEAAEEEEEEEEEAAEEEEEEEEEEEEEEEEAAAAA   XA:Z:chr16,+1240848,149M,1;chr16,+1256211,149M,6;   MC:Z:150M   MD:Z:147G1  RG:Z:NS500158.1 NM:i:1  AS:i:147    XS:i:147
XXXXXXXX:412:YYYYYYYYY:1:11101:10001:10497  163 chr16   1229833 0   150M    =   1229894 210 CCAGGCCCTGACCTGTGGAATGTGGTGAGGGGCAGGGTGGACCCCGGCTGGGACTCACCAGGGGCCGCGTAGGCGCGGCTCGCCAGGACGGGCAGCGCCAGCAGCAGCAGATTCAGCATCTGGGGAGCAAGGAGGAGCATCGTGGGCCTG  AAAAAEEEEEEEEEEEEAE6EEEAEEEEEEEEEEEEEEAE/EEEEEEEEEEA/AEAEEEEEEEEEAEAE<EEE6A/EEAAAEEEA/EEAAEEAEEE/AAAAEEEEEEEAE/EEEEEEEEEEAEEEEEEAEEEAEE6EAEEAE<</AAA<6  XA:Z:chr16,-1240908,150M,0; MC:Z:149M   MD:Z:150    RG:Z:NS500158.1 NM:i:0  AS:i:150    XS:i:150

Note that BWA has suggested an alternative alignment given in the XA tag. When using MergeBamAlignment as in the best practices pipeline, the alignment in XA is chosen. I have tried modifying the --PRIMARY_ALIGNMENT_STRATEGY parameter, but is doesn't change anything.

In the old days before uMAPs, you worked directly with FASTQ files and hence used the primary alignment selected by BWA. What is the motivation for changing that?

GC overhead java error on high shard count with cromwell for LearnReadOrientationModel

$
0
0

Hey,

I am back with another issue, when running multi sample somatic variant calling with mutect2. (10 tumor samples WGS at 130x)
Currently I have a workflow definition in cromwell splitting the calling regions into 7 Million base regions, which leads to a scatter across 515 shards. (the size of the region is so the region can be run within 24h)

workflow is basically:

  • scatter mutect2
  • concat vcfs
  • combine stats
  • run pileup
  • estimate contamination
  • learn read orientation model

And this is where it fails.
I have already allowed for 32Gb Heap size and I will try how much I have to request to make it work.
I understand that my usecase might not be a common issue, but this could affect other people as well.

Runtime.totalMemory()=30542397440
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
        at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1875)
        at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
        at java.lang.Double.parseDouble(Double.java:538)
        at htsjdk.samtools.util.FormatUtil.parseDouble(FormatUtil.java:141)
        at htsjdk.samtools.metrics.MetricsFile.read(MetricsFile.java:434)
        at org.broadinstitute.hellbender.tools.walkers.readorientation.LearnReadOrientationModel.readMetricsFile(LearnReadOrientationModel.java:296)
        at org.broadinstitute.hellbender.tools.walkers.readorientation.LearnReadOrientationModel.lambda$doWork$7(LearnReadOrientationModel.java:96)
        at org.broadinstitute.hellbender.tools.walkers.readorientation.LearnReadOrientationModel$$Lambda$52/825496893.apply(Unknown Source)
        at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
        at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
        at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
        at org.broadinstitute.hellbender.tools.walkers.readorientation.LearnReadOrientationModel.doWork(LearnReadOrientationModel.java:97)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
        at org.broadinstitute.hellbender.Main.main(Main.java:291)
Using GATK jar /gatk/gatk-package-4.1.2.0-local.jar

As always, thanks for your help

EDIT:
I have tried a few more sets and it seems in my case the Heap space required is 62Gb, which is quite substantial.
So my problem is kinda solved, but I think it would be worth looking into the code for possible optimizations.

A point mutation was missed by Mutect2 when read length redueced from 150bp to 75bp

$
0
0

Hello,

I have been analyzing the NGS data from clinical cancer samples for diagnosis.
The NGS data are paired end reads with 150bp each.
Most of the paired end reads were overlapped because the DNA extracted from FFPE samples are likely to be degraded into short pieces less than 200bp.
So, I am thinking about reducing the read to 75bp if there are no differences in mutation detection.
Two types of read length (150bp and 75bp) were tested with one sample which was techincally repeated 6 times.
The 75bp reads were generated by cutting the original reads in half in fastq files.
In the analysis, one mutation (SNP) was missed with 75bp reads all the time but detected with 150bp reads.
The depth of the mutation was about 300bp, and its VAF was about 50% (heterozygous SNP).
Why Mutect2 could not detect the mutation with 75bp-reads?

Germline short variant discovery (SNPs + Indels)

$
0
0

Important: This document is currently being updated


Purpose

Identify germline short variants (SNPs and Indels) in one or more individuals to produce a joint callset in VCF format.


Reference Implementations

Pipeline Summary Notes Github Terra
Prod* germline short variant per-sample calling uBAM to GVCF optimized for GCP yes pending
Prod* germline short variant joint genotyping GVCFs to cohort VCF optimized for GCP yes pending
$5 Genome Analysis Pipeline uBAM to GVCF or cohort VCF optimized for GCP (see blog) yes hg38
Generic germline short variant per-sample calling analysis-ready BAM to GVCF universal yes hg38
Generic germline short variant joint genotyping GVCFs to cohort VCF universal yes hg38 & b37
Intel germline short variant per-sample calling uBAM to GVCF Intel optimized for local architectures yes NA

* Prod refers to the Broad Institute's Data Sciences Platform production pipelines, which are used to process sequence data produced by the Broad's Genomic Sequencing Platform facility.


Expected input

This workflow is designed to operate on a set of samples constituting a study cohort. Specifically, a set of per-sample BAM files that have been pre-processed as described in the GATK Best Practices for data pre-processing.


Main steps for Germline Cohort Data

We begin by calling variants per sample in order to produce a file in GVCF format. Next, we consolidate GVCFs from multiple samples into a GenomicsDB datastore. We then perform joint genotyping, and finally, apply VQSR filtering to produce the final multisample callset with the desired balance of precision and sensitivity.

Additional steps such as Genotype Refinement and Variant Annotation may be included depending on experimental design; those are not documented here.

Call variants per-sample

Tools involved: HaplotypeCaller (in GVCF mode)

In the past, variant callers specialized in either SNPs or Indels, or (like the GATK's own UnifiedGenotyper) could call both but had to do so them using separate models of variation. The HaplotypeCaller is capable of calling SNPs and indels simultaneously via local de-novo assembly of haplotypes in an active region. In other words, whenever the program encounters a region showing signs of variation, it discards the existing mapping information and completely reassembles the reads in that region. This allows the HaplotypeCaller to be more accurate when calling regions that are traditionally difficult to call, for example when they contain different types of variants close to each other. It also makes the HaplotypeCaller much better at calling indels than position-based callers like UnifiedGenotyper.

In the GVCF mode used for scalable variant calling in DNA sequence data, HaplotypeCaller runs per-sample to generate an intermediate file called a GVCF , which can then be used for joint genotyping of multiple samples in a very efficient way. This enables rapid incremental processing of samples as they roll off the sequencer, as well as scaling to very large cohort sizes

In practice, this step can be appended to the pre-processing section to form a single pipeline applied per-sample, going from the original unmapped BAM containing raw sequence all the way to the GVCF for each sample. This is the implementation used in production at the Broad Institute.

Consolidate GVCFs

Tools involved: GenomicsDBImport

This step consists of consolidating the contents of GVCF files across multiple samples in order to improve scalability and speed the next step, joint genotyping. Note that this is NOT equivalent to the joint genotyping step; variants in the resulting merged GVCF cannot be considered to have been called jointly.

Prior to GATK4 this was done through hierarchical merges with a tool called CombineGVCFs. This tool is included in GATK4 for legacy purposes, but performance is far superior when using GenomicsDBImport, which produces a datastore instead of a GVCF file.

Joint-Call Cohort

Tools involved: GenotypeGVCFs

At this step, we gather all the per-sample GVCFs (or combined GVCFs if we are working with large numbers of samples) and pass them all together to the joint genotyping tool, GenotypeGVCFs. This produces a set of joint-called SNP and indel calls ready for filtering. This cohort-wide analysis empowers sensitive detection of variants even at difficult sites, and produces a squared-off matrix of genotypes that provides information about all sites of interest in all samples considered, which is important for many downstream analyses.

This step runs quite quickly and can be rerun at any point when samples are added to the cohort, thereby solving the so-called N+1 problem.

Filter Variants by Variant (Quality Score) Recalibration

Tools involved: VariantRecalibrator, ApplyRecalibration

The GATK's variant calling tools are designed to be very lenient in order to achieve a high degree of sensitivity. This is good because it minimizes the chance of missing real variants, but it does mean that we need to filter the raw callset they produce in order to reduce the amount of false positives, which can be quite large.

The established way to filter the raw variant callset is to use variant quality score recalibration (VQSR), which uses machine learning to identify annotation profiles of variants that are likely to be real, and assigns a VQSLOD score to each variant that is much more reliable than the QUAL score calculated by the caller. In the first step of this two-step process, the program builds a model based on training variants, then applies that model to the data to assign a well-calibrated probability to each variant call. We can then use this variant quality score in the second step to filter the raw call set, thus producing a subset of calls with our desired level of quality, fine-tuned to balance specificity and sensitivity.

The downside of how variant recalibration works is that the algorithm requires high-quality sets of known variants to use as training and truth resources, which for many organisms are not yet available. It also requires quite a lot of data in order to learn the profiles of good vs. bad variants, so it can be difficult or even impossible to use on small datasets that involve only one or a few samples, on targeted sequencing data, on RNAseq, and on non-model organisms. If for any of these reasons you find that you cannot perform variant recalibration on your data (after having tried the workarounds that we recommend, where applicable), you will need to use hard-filtering instead. This consists of setting flat thresholds for specific annotations and applying them to all variants equally. See the methods articles and FAQs for more details on how to do this.

We are currently experimenting with neural network-based approaches with the goal of eventually replacing VQSR with a more powerful and flexible filtering process.


Main steps for Germline Single-Sample Data

Single sample variant discovery uses HaplotypeCaller in its default single-sample mode to call variants in an analysis-ready BAM file. The VCF that HaplotypeCaller emits errs on the side of sensitivity, so some filtering is often desired. To filter variants first run the CNNScoreVariants tool. This tool annotates each variant with a score indicating the model's prediction of the quality of each variant. To apply filters based on those scores run the FIlterVariantTranches tool with SNP and INDEL sensitivity tranches appropriate for your task.


Notes on methodology

The central tenet that governs the variant discovery part of the workflow is that the accuracy and sensitivity of the germline variant discovery algorithm are significantly increased when it is provided with data from many samples at the same time. Specifically, the variant calling program needs to be able to construct a squared-off matrix of genotypes representing all potentially variant genomic positions, across all samples in the cohort. Note that this is distinct from the primitive approach of combining variant calls generated separately per-sample, which lack information about the confidence of homozygous-reference or other uncalled genotypes.

In earlier versions of the variant discovery phase, multiple per-sample BAM files were presented directly to the variant calling program for joint analysis. However, that scaled very poorly with the number of samples, posing unacceptable limits on the size of the study cohorts that could be analyzed in that way. In addition, it was not possible to add samples incrementally to a study; all variant calling work had to be redone when new samples were introduced.

Starting with GATK version 3.x, a new approach was introduced, which decoupled the two internal processes that previously composed variant calling: (1) the initial per-sample collection of variant context statistics and calculation of all possible genotype likelihoods given each sample by itself, which require access to the original BAM file reads and is computationally expensive, and (2) the calculation of genotype posterior probabilities per-sample given the genotype likelihoods across all samples in the cohort, which is computationally cheap. These were made into the separate steps described below, enabling incremental growth of cohorts as well as scaling to large cohort sizes.

Two types of separators for GT in vcf file

$
0
0

Hi,
I've tried joint genotyping process with 'GenotypeGVCFs' tool in GATK v4.1.2 and the output has generated without any errors. However, there is something weird that I couldn't understand. In the vcf file, there exists two types of GT separator, '|' and '/'.

1   38  .   C   CA  10745.24    .   AC=92;AF=0.920;AN=100;DP=210;ExcessHet=0.0000;FS=0.000;InbreedingCoeff=0.4128;MLEAC=106;MLEAF=1.00;MQ=36.67;QD=25.36;SOR=10.626 GT:AD:DP:GQ:PGT:PID:PL:PS   1|1:0,19:19:69:1|1:38_C_CA:996,69,0:38  1|1:0,1:1:6:1|1:38_C_CA:81,6,0:38   1|1:0,24:24:75:1|1:38_C_CA:1110,75,0:38 1|1:0,3:3:9:1|1:38_C_CA:135,9,0:38  1|1:0,2:2:6:1|1:38_C_CA:90,6,0:38   1|1:0,19:19:63:1|1:38_C_CA:926,63,0:38  1|1:0,2:2:6:1|1:38_C_CA:90,6,0:38   1|1:0,5:5:18:1|1:38_C_CA:261,18,0:38    1|1:0,4:4:12:1|1:38_C_CA:180,12,0:38    1|1:0,2:2:9:1|1:38_C_CA:126,9,0:38  1|1:0,4:4:12:1|1:38_C_CA:180,12,0:38    1|1:0,4:4:12:1|1:38_C_CA:180,12,0:38    1|1:0,4:4:12:1|1:38_C_CA:180,12,0:38    1|1:0,3:3:12:1|1:38_C_CA:171,12,0:38    1|1:0,1:1:6:1|1:38_C_CA:90,6,0:38   1|1:0,4:4:12:1|1:38_C_CA:180,12,0:38    0/0:1,0:1:3:.:.:0,3,28  1|1:0,6:6:18:1|1:38_C_CA:270,18,0:38    ./.:2,0:2:.:.:.:0,0,./.:0,0:0:.:.:.:0,0,0   1|1:0,4:4:18:1|1:38_C_CA:261,18,0:38    ./.:0,0:0:.:.:.:0,0,0   1|1:0,2:2:6:1|1:38_C_CA:90,6,0:38   1|1:0,2:2:6:1|1:38_C_CA:90,6,0:38   ./.:0,0:0:.:.:.:0,0,1|1:0,3:3:9:1|1:38_C_CA:135,9,0:38  1|1:0,3:3:12:1|1:38_C_CA:171,12,0:38    1|1:0,5:5:15:1|1:38_C_CA:225,15,0:38    1|1:0,1:1:6:1|1:38_C_CA:90,6,0:38   ./.:0,0:0:.:.:.:0,0,0   1|1:0,3:3:9:1|1:38_C_CA:135,9,0:38  1|1:0,5:5:15:1|1:38_C_CA:225,15,0:38    1|1:0,2:2:6:1|1:38_C_CA:90,6,0:38   1|1:0,2:2:6:1|1:38_C_CA:90,6,0:38   ./.:2,0:2:.:.:.:0,0,0   1|1:0,3:3:9:1|1:38_C_CA:135,9,0:38  0/0:1,0:1:3:.:.:0,3,34  ./.:0,0:0:.:.:.:0,0,0   0/0:1,0:1:3:.:.:0,3,35  1|1:0,5:5:15:1|1:38_C_CA:225,15,0:38    1|1:0,5:5:18:1|1:38_C_CA:270,18,0:38    1|1:0,6:6:18:1|1:38_C_CA:270,18,0:38    1|1:0,3:3:9:1|1:38_C_CA:135,9,0:38  1|1:0,2:2:6:1|1:38_C_CA:90,6,0:38   1|1:0,4:4:12:1|1:38_C_CA:180,12,0:38    1|1:0,2:2:6:1|1:38_C_CA:90,6,0:38   1|1:0,3:3:9:1|1:38_C_CA:135,9,0:38  1|1:0,3:3:9:1|1:38_C_CA:135,9,0:38  0/0:1,0:1:3:.:.:0,3,34  1|1:0,2:2:6:1|1:38_C_CA:90,6,0:38   1|1:0,3:3:12:1|1:38_C_CA:171,12,0:38    1|1:0,3:3:9:1|1:38_C_CA:135,9,0:38  1|1:0,5:5:21:1|1:38_C_CA:306,21,0:38    1|1:0,1:1:6:1|1:38_C_CA:90,6,0:38   1|1:0,4:4:12:1|1:38_C_CA:180,12,0:38    1|1:0,5:5:15:1|1:38_C_CA:225,15,0:38    1|1:0,4:4:12:1|1:38_C_CA:180,12,0:38

I couldn't find any options that correct this situation in the tool documentations.
Looking forward to your reply. Thank you!

"Invalid interval" error

$
0
0

Hi,

Occasionally I would encounter this error when combing gVCFs for the next step joint-calling/genotyping.

For example, recently I made this run:

java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /my/gatk/directory/gatk-package-4.0.10.0-local.jar CombineGVCFs -R /my/genome/reference/reference.fa -L 5:63300001-65300000 -O Combined.vcf -V gVCF.list

And got this:

java.lang.IllegalArgumentException: Invalid interval. Contig:5 start:63630211 end:63630210
    at org.broadinstitute.hellbender.utils.Utils.validateArg(Utils.java:730)
    at org.broadinstitute.hellbender.utils.SimpleInterval.validatePositions(SimpleInterval.java:61)
    at org.broadinstitute.hellbender.utils.SimpleInterval.<init>(SimpleInterval.java:37)
    at org.broadinstitute.hellbender.utils.SimpleInterval.<init>(SimpleInterval.java:49)
    at org.broadinstitute.hellbender.engine.VariantWalkerBase.lambda$traverse$0(VariantWalkerBase.java:152)
    at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
    at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
    at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
    at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
    at java.util.Iterator.forEachRemaining(Iterator.java:116)
    at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
    at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
    at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
    at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
    at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
    at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
    at org.broadinstitute.hellbender.engine.VariantWalkerBase.traverse(VariantWalkerBase.java:151)
    at org.broadinstitute.hellbender.engine.MultiVariantWalkerGroupedOnStart.traverse(MultiVariantWalkerGroupedOnStart.java:113)
    at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:966)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
    at org.broadinstitute.hellbender.Main.main(Main.java:289)

I got this error with 4.0.4.10 and recently upgraded to the current version 4.0.10.0, and the problem persisted.

There must be some variant sites in the gVCFs causing confusion for GATK, because the error only pops up for specific sites in specific windows . I have read some threads here and found some potential causes, such as overlap blocks or zero length reads. But there hasn't been a solution for this, as I have hundreds of gVCFs and it would be difficult to screen each one for such problems. Does anyone has some advices for this?

Ke

IllegalArgumentException: beta must be greater than 0 in FilterMutectCalls

$
0
0

Hi there,

I have a simulated dataset of related samples and currently running Mutect2 on it (10 tumor samples WGS with 130x)
I managed to run everything through and now FilterMutectCalls crashes after the first pass through the variants with

[October 1, 2019 12:16:16 PM UTC] org.broadinstitute.hellbender.tools.walkers.mutect.filtering.FilterMutectCalls done. Elapsed time: 370.68 minutes.
Runtime.totalMemory()=20597702656
java.lang.IllegalArgumentException: beta must be greater than 0 but got -87566.7500301585
        at org.broadinstitute.hellbender.utils.Utils.validateArg(Utils.java:724)
        at org.broadinstitute.hellbender.tools.walkers.readorientation.BetaDistributionShape.<init>(BetaDistributionShape.java:14)
        at org.broadinstitute.hellbender.tools.walkers.mutect.clustering.BinomialCluster.getFuzzyBinomial(BinomialCluster.java:42)
        at org.broadinstitute.hellbender.tools.walkers.mutect.clustering.BinomialCluster.learn(BinomialCluster.java:33)
        at org.broadinstitute.hellbender.tools.walkers.mutect.clustering.SomaticClusteringModel.lambda$learnAndClearAccumulatedData$7(SomaticClusteringModel.java:131)
        at org.broadinstitute.hellbender.utils.IndexRange.forEach(IndexRange.java:116)
        at org.broadinstitute.hellbender.tools.walkers.mutect.clustering.SomaticClusteringModel.learnAndClearAccumulatedData(SomaticClusteringModel.java:131)
        at org.broadinstitute.hellbender.tools.walkers.mutect.filtering.Mutect2FilteringEngine.learnParameters(Mutect2FilteringEngine.java:156)
        at org.broadinstitute.hellbender.tools.walkers.mutect.filtering.FilterMutectCalls.afterNthPass(FilterMutectCalls.java:151)
        at org.broadinstitute.hellbender.engine.MultiplePassVariantWalker.traverse(MultiplePassVariantWalker.java:44)
        at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1039)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
        at org.broadinstitute.hellbender.Main.main(Main.java:291)
Using GATK jar /gatk/gatk-package-4.1.2.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /gatk/gatk-package-4.1.2.0-local.jar FilterMutectCalls --contamination-table /cromwell-executions/sebastian/f3c8dc32-7754-42c3-b0a7-9d667904c2e5/call-filtering/inputs/1230118915/generated-3208ebe8-e3ef-11e9-91de-005056b01e3e --tumor-segmentation /cromwell-executions/sebastian/f3c8dc32-7754-42c3-b0a7-9d667904c2e5/call-filtering/inputs/1230118915/generated-3208e648-e3ef-11e9-91de-005056b01e3e --stats /cromwell-executions/sebastian/f3c8dc32-7754-42c3-b0a7-9d667904c2e5/call-filtering/inputs/61973814/generated-58e77956-db7e-11e9-9da2-005056b01e3e.txt --orientation-bias-artifact-priors /cromwell-executions/sebastian/f3c8dc32-7754-42c3-b0a7-9d667904c2e5/call-filtering/inputs/-1768654832/generated-58e75278-db7e-11e9-9da2-005056b01e3e.tar.gz -V /cromwell-executions/sebastian/f3c8dc32-7754-42c3-b0a7-9d667904c2e5/call-filtering/inputs/164276910/generated-58e6fad0-db7e-11e9-9da2-005056b01e3e.vcf.gz -R /cromwell-executions/sebastian/f3c8dc32-7754-42c3-b0a7-9d667904c2e5/call-filtering/inputs/1500471319/human_g1k_v37.fasta -O generated-32095ea2-e3ef-11e9-91de-005056b01e3e.vcf.gz

I do not have any idea how to work around this.
Any suggestions?


Germline CNVs pipeline missing MLPA confirmed deletions with very low copy ratios

$
0
0

Hi,

I’m running GATK Germline CNVs pipeline-like in a dataset of 1389 regions of interest (ROI) for 69 samples.

In general, the default parameters seems to be working very well: the number of CNVs found seems to be reasonable and it found one duplication and one deletion confirmed by MLPA. It also correctly determine the sex of the samples.

However, it does not call two deletions that were previously confirmed by MLPA, although the denoised copy ratios values are very low: 1.182 +/- 0.042 and 1.030 +/- 0.050 (+/- standard deviation). The first case is even more strange because it call a deletion for a sample with denoised copy ratio higher than that: 1.199 +/- 0.066.

I tried to follow the suggestions here to increase the sensitivity and it, indeed, recovered the two missing CNVs. However, it also annotate much more CNVs and in all samples, which I think to be false positives.

I attached a table with the denoised copy ratios values for all samples and all ROIs. Since this approach uses the information from all samples, I do not know what kind of files I should provide for you to help me understanding why those deletions are missing.

Commands using broadinstitute/gatk:4.1.3.0 Docker image:

# Read counts: from BAM to HDF5
## For each sample
gatk CollectReadCounts --input /mnt/bam_dir/$sample.bam --intervals /mnt/intervals_dir/trusight_cancer.bed --format HDF5 --output /mnt/out_dir/$sample.hdf5 --interval-merging-rule OVERLAPPING_ONLY --reference /mnt/ref_dir/Homo_sapiens.GRCh38.dna.primary_assembly.fa

# Annotate intervals
gatk AnnotateIntervals --intervals /mnt/intervals_dir/trusight_cancer.bed --output /mnt/out_dir/annotate_outfile.gc_mappability150_duplication.tsv --reference /mnt/ref_dir/Homo_sapiens.GRCh38.dna.primary_assembly.fa --interval-merging-rule OVERLAPPING_ONLY --mappability-track /mnt/mappability_dir/human_g1k_v37_gemmap_l150_m2_e1_uniq.bed.gz --segmental-duplication-track /mnt/dup_dir/hg19_self_chain_split_both.bed.gz

# Filter intervals
gatk FilterIntervals --intervals /mnt/intervals_dir/trusight_cancer.bed --arguments_file /mnt/out_dir/gatk_arguments_file.list_input_hdf5_sample_files.txt --annotated-intervals /mnt/annotate_dir/annotate_outfile.gc_mappability150_duplication.tsv --output /mnt/out_dir/filtered_intervals.interval_list --interval-merging-rule OVERLAPPING_ONLY --maximum-segmental-duplication-content 0.6 --minimum-mappability 0.8 --exclude-intervals X:10001-2781479 --exclude-intervals X:155701383-156030895 --exclude-intervals Y:10001-2781479 --exclude-intervals Y:56887902-57217415

# DetermineGermlineContigPloidy
gatk DetermineGermlineContigPloidy --intervals /mnt/intervals_dir/filtered_intervals.interval_list --arguments_file /mnt/out_dir/gatk_arguments_file.list_input_hdf5_sample_files.txt --output /mnt/out_dir/ --output-prefix cohort --interval-merging-rule OVERLAPPING_ONLY --contig-ploidy-priors /mnt/prior_dir/contig_ploidy_priors.suggested_gatk.tab

# GermlineCNVCaller
## Original comand
gatk GermlineCNVCaller --run-mode COHORT --intervals /mnt/intervals_dir/filtered_intervals.interval_list --arguments_file /mnt/out_dir/gatk_arguments_file.list_input_hdf5_sample_files.txt --output /mnt/out_dir/ --output-prefix cohort --interval-merging-rule OVERLAPPING_ONLY --contig-ploidy-calls /mnt/ploidy_dir/

## Changed command to increase the sensitivity
gatk GermlineCNVCaller --run-mode COHORT --intervals /mnt/intervals_dir/filtered_intervals.interval_list --arguments_file /mnt/out_dir/gatk_arguments_file.list_input_hdf5_sample_files.txt --output /mnt/out_dir/ --output-prefix cohort --interval-merging-rule OVERLAPPING_ONLY --contig-ploidy-calls /mnt/ploidy_dir/ --interval-psi-scale 0.000001 --log-mean-bias-standard-deviation 0.01 --sample-psi-scale 0.000001 --class-coherence-length 1000 --cnv-coherence-length 1000

# PostprocessGermlineCNVCalls
## For each sample
gatk PostprocessGermlineCNVCalls --calls-shard-path /mnt/cnv_dir/cohort-calls/ --model-shard-path /mnt/cnv_dir/cohort-model/ --sample-index $sample_index --autosomal-ref-copy-number 2 --allosomal-contig X --allosomal-contig Y --output-genotyped-intervals /mnt/out_dir/$sample.genotyped_intervals.vcf --output-genotyped-segments /mnt/out_dir/$sample.genotyped_segments.vcf --contig-ploidy-calls /mnt/ploidy_dir/ --intervals /mnt/intervals_dir/filtered_intervals.interval_list --reference /mnt/ref_dir/Homo_sapiens.GRCh38.dna.primary_assembly.fa --output-denoised-copy-ratios /mnt/out_dir/$sample.denoised_copy_ratios.tsv

GenotypeConcordance IndexOutOfBoundsException

$
0
0

I'm running the newest version of GATK4 GenotypeConcordance when I get this error.

ERROR: "java.lang.IndexOutOfBoundsException: Index: 0, Size: 0"

Yesterday I performed troubleshooting with a freshly installed version of GATK4 with:

  • conda install -c bioconda gatk4

The call VCF was created with a pipeline running the slightly older GATK 4.0. I have no problem running:

  • call sample vcf vs call sample vcf

It's only when I use:

  • call vcf vs truth vcf

that I have an issue. The truth vcf and the reference were both downloaded from the gatk resource bundle:

My pipeline uses bwa mem, fastp & GATK 4.0 but I'm troubleshooting with GATK 4.1.3.

COMMAND: gatk GenotypeConcordance -R resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta -O NA12891_GenotypeConcordance.txt -CS NA12891 -CV NA12891_HaplotypeCaller.SNP_filtered.vcf -TV dbsnp_138.hg38.vcf --INTERVALS NA12891_bait.interval_list
REFERENCE_MDSUM: 7ff134953dcca8c8997453bbb80b6b5e resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta
VCF_MDSUM: f7e1ef5c1830bfb33675b9c7cbaa4868 dbsnp_138.hg38.vcf

Running.....
==============================================

Using GATK jar /data1/BIOINFORMATICS/SOFTWARE/ANACONDA_JN/MINI-CONDA/envs/gatk4/share/gatk4-4.1.3.0-0/gatk-package-4.1.3.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /data1/BIOINFORMATICS/SOFTWARE/ANACONDA_JN/MINI-CONDA/envs/gatk4/share/gatk4-4.1.3.0-0/gatk-package-4.1.3.0-local.jar GenotypeConcordance -R resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta -O NA12891_GenotypeConcordance.txt -CS NA12891 -CV NA12891_HaplotypeCaller.SNP_filtered.vcf -TV dbsnp_138.hg38.vcf --INTERVALS NA12891_bait.interval_list
09:47:05.280 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/data1/BIOINFORMATICS/SOFTWARE/ANACONDA_JN/MINI-CONDA/envs/gatk4/share/gatk4-4.1.3.0-0/gatk-package-4.1.3.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
[Wed Oct 02 09:47:05 EDT 2019] GenotypeConcordance --TRUTH_VCF dbsnp_138.hg38.vcf --CALL_VCF NA12891_HaplotypeCaller.SNP_filtered.vcf --OUTPUT NA12891_GenotypeConcordance.txt --CALL_SAMPLE NA12891 --INTERVALS NA12891_bait.interval_list --REFERENCE_SEQUENCE resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta --OUTPUT_VCF false --INTERSECT_INTERVALS true --MIN_GQ 0 --MIN_DP 0 --OUTPUT_ALL_ROWS false --USE_VCF_INDEX false --MISSING_SITES_HOM_REF false --IGNORE_FILTER_STATUS false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
Oct 02, 2019 9:47:07 AM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
[Wed Oct 02 09:47:07 EDT 2019] Executing as nowackj1@ridus004.ind.roche.com on Linux 3.10.0-957.21.3.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_152-release-1056-b12; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.1.3.0
INFO 2019-10-02 09:47:07 GenotypeConcordance Starting to load intervals list(s).
INFO 2019-10-02 09:47:08 GenotypeConcordance Finished loading up intervals list(s).
[Wed Oct 02 09:47:09 EDT 2019] picard.vcf.GenotypeConcordance done. Elapsed time: 0.06 minutes.
Runtime.totalMemory()=2530738176
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:653)
at java.util.ArrayList.get(ArrayList.java:429)
at picard.vcf.GenotypeConcordance.doWork(GenotypeConcordance.java:325)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:305)
at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:25)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
at org.broadinstitute.hellbender.Main.main(Main.java:291)

HaplotypeCaller does not call snps with 100% frequency

$
0
0
Hello, GATK team!

I'm using HaplotypeCaller from GATK 4 for germline variant calling in bacteria. The pipeline:
1) Raw read mapping to reference with SMALT.
2) Varaint calling with HaplotypeCaller:


```
gatk --java-options "-Xmx4g" HaplotypeCaller -VS SILENT -R path_to_ref -I path_to_input_bam -O path_to_output_gvcf -ERC GVCF -bamout path_to_gatk_bam --smith-waterman FASTEST_AVAILABLE -ploidy 1
```

For debug purposes I've also tried: --force-active true --disable-optimizations true

3) Merging multiple gvcfs with GenomicsDBImport
4) Genotyping with GenotypeGVCFs:

```
gatk --java-options "-Xmx4g " GenotypeGVCFs --sample-ploidy 1 ...
```

5) Splitting of resulting huge.vcf:

```
gatk --java-options "-Xmx1g -Xms1g" SelectVariants --exclude-non-variants true --remove-unused-alternates true -OVI false -R path_to_ref -sn sample_id -V huge.vcf -O final.vcf
```

I've compared initial bam (SAMEA1015921.bam) file with gatk bamout (SAMEA1015921_h37rv_gatk.bam), obtained with "--force-active true --disable-optimizations true" in IGV.

Here is what I see in one interesting locus:

![IGV_PPE34](https://drive.google.com/open?id=1vrg1yR05BIRMzHeKhfcdAyY6x7Dx6Was "Missing SNPs in PPE34")

I can't see these SNPs in corresponding final.vcf:

```
ch1 2246247 . T G 18794118.69 . AC=1;AF=1.00;AN=1;BaseQRankSum=1.70;DP=89;FS=0.000;MQ=59.16;MQRankSum=-2.800e-01;QD=30.34;ReadPosRankSum=0.484;SOR=0.518 GT:AD:DP:GQ:PL 1:0,89:89:99:3836,0
ch1 2250185 . T TA 12273363.57 . AC=1;AF=1.00;AN=1;BaseQRankSum=-7.900e-02;DP=106;FS=0.000;MQ=59.28;MQRankSum=-9.400e-02;QD=32.71;ReadPosRankSum=0.124;SOR=0.576 GT:AD:DP:GQ:PL 1:0,106:106:99:3540,0
```
and even in g.vcf:

```
ch1 2248181 . G <NON_REF> . . END=2248181 GT:DP:GQ:MIN_DP:PL 0:96:0:96:0,0
ch1 2248182 . A <NON_REF> . . END=2248182 GT:DP:GQ:MIN_DP:PL 0:111:99:111:0,1800
ch1 2248183 . G <NON_REF> . . END=2248183 GT:DP:GQ:MIN_DP:PL 0:98:0:98:0,0
ch1 2248184 . C <NON_REF> . . END=2248184 GT:DP:GQ:MIN_DP:PL 0:110:99:110:0,1800
ch1 2248185 . C <NON_REF> . . END=2248185 GT:DP:GQ:MIN_DP:PL 0:87:0:87:0,0
ch1 2248186 . C <NON_REF> . . END=2248186 GT:DP:GQ:MIN_DP:PL 0:98:99:98:0,1800
ch1 2248187 . G <NON_REF> . . END=2248187 GT:DP:GQ:MIN_DP:PL 0:89:0:89:0,0
ch1 2248188 . G <NON_REF> . . END=2248189 GT:DP:GQ:MIN_DP:PL 0:92:99:91:0,1800
ch1 2248190 . T <NON_REF> . . END=2248190 GT:DP:GQ:MIN_DP:PL 0:84:0:84:0,0
ch1 2248191 . A <NON_REF> . . END=2248198 GT:DP:GQ:MIN_DP:PL 0:80:99:77:0,1800
ch1 2248199 . C <NON_REF> . . END=2248199 GT:DP:GQ:MIN_DP:PL 0:71:0:71:0,0
ch1 2248200 . G <NON_REF> . . END=2248201 GT:DP:GQ:MIN_DP:PL 0:71:99:70:0,1800
ch1 2248202 . C <NON_REF> . . END=2248202 GT:DP:GQ:MIN_DP:PL 0:70:0:70:0,0
ch1 2248203 . T <NON_REF> . . END=2250184 GT:DP:GQ:MIN_DP:PL 0:95:99:52:0,1000
```

Two SNPs C->T in 2248199 and 2248202 have 100% frequency in initial bam. Also in gatk bamout thу alternative allele is dominating: the only read without mutations is HC11051 from ArtificialHaplotypeRG group, containing no variant in the area.

Can you comment this? Is HC11051 the haplotype assembled by HaplotypeCaller from these reads?

By the way: what does <NON_REF> in gvcf means?

All the data can be found in [google drive](https://drive.google.com/open?id=103gfjVhbbWlPFTyrDFKhMIZhuvdHbPei).

Looking forward for your answer,
Gennady

Picard LiftoverVcf: contig not part of the target reference

$
0
0

Dear GATK team,

I am trying to liftover a vcf file from hg19 to hg38, by running the command
java -jar ~/tools/picard-2.1.0/dist/picard.jar LiftoverVcf I=input.chr22.vcf O=hg38.chr22.vcf CHAIN=hg19ToHg38.over.chain REJECT=liftover_rejected.chr22.vcf R=chr22.fa

Since I'm working on one chromosome only, my vcf file has only "chr22" in the CHROM field. chr22.fa, the reference genome in hg38, starts with ">chr22" on the first line. I also generated the .dict file for it using Picard tools. chr22.dict file looks like:
@HD VN:1.5 SO:unsorted
@SQ SN:chr22 LN:50818468 M5:221733a2a15e2de66d33e73d126c5109 UR:file:/my/directory/chr22.fa

However, after a few seconds I always get the following error message:
[Thu Feb 18 15:53:08 GMT 2016] Executing as me@myhost on Linux 3.2.0-75-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_73-b02; Picard version: 2.1.0() JdkDeflater
INFO 2016-02-18 15:53:09 LiftoverVcf Loading up the target reference genome.
INFO 2016-02-18 15:53:11 LiftoverVcf Lifting variants over and sorting.
ERROR 2016-02-18 15:53:11 LiftoverVcf Encountered a contig, chr22 that is not part of the target reference.

Could you suggest how to fix this? Thank you!

Best,
Ruoyun

Large cohort VCFs in GATK4 - to combine or not ...

$
0
0

Hi,

I've somatically called a few thousand samples against a PoN. I'm now looking through the results and wondering how best to collate all these single VCFs. Is there a tool like GenotypeGVCF for VCFs? (CombineVariants is no longer available - and would take too long presumably).
If not, what would be a strategy for pooling these result files into an analysis set? Is there an alternative analysis strategy to pooling?

Thanks.

MergeBamAlignment - what are all the exact steps it performs?

$
0
0

Hi, I have a question about the MergeBamAlignment tool
I tried reading through the documentation and through the couple of blog posts that I found on the GATK website, but I still have a couple of things that I could use help clearing out.

Basically, I ran the following tests:
1. Starting from an unmapped BAM file with multiple read groups, I ran the GATK data pre-processing Best Practices WDL
2. Starting from the same uBAM, but with read group information removed using AddOrReplaceReadGroups, I ran the GATK data pre-processing Best Practices WDL
3. Starting from the uBAM without readgroup information, I ran the Data pre-processing pipeline where I removed the MergeBamAlignment step

After this, with the resulting BAM files, I ran the GATK generic germline variant calling Best Practices WDL.

Between the first two cases, for the samples that I was testing with, I found 2.1% differences in variants called.
I understand here that MergeBamAlignments adds the missing readgroup information from the uBAM, which in turn is used during MarkDuplications, BaseRecalibrator and ApplyBQSR steps, which would lead to a different BAM than in the case if I didn't have the read group information (test 2), and so the variant calling would also be different.

But, between test 2 (uBAM with no readgroups and MBA exists) and test 3 (uBAM with no readgroups and MBS does not exist), I also noted differences in variants called - the difference was 0.18%, so albeit small, it still exists.
My understanding is that MergeBamAlignment also performs more actions in additions to just merging readgroups and read-level tag information.
From one post, I understood that MBA turns hardclipped reads (by BWA, usually some chimeric reads) back into softclipped reads.
Does anyone have more info on this? Or on what exactly MBA does?
Should I expect these small differences, or not?

Low CNQ value for all the CNV events in a proband.

$
0
0

Dear Officer, I've been recently using gCNV for germline CNV detection. I noticed that in the interval VCF file of one patient, all the CNV events have extremly low values(below 10). As to other compartments with a GT of 0 and CN of 2, the CNQ values seem to be as normal as usual. And this kind of issue does not happen on the proband's parents.

It might be that this proband does not have a single CNV event. But it seems unlikely to happen in real world....

Here I paste the code for DetermineContigPloidy in COHORT and CASE mode.
COHORT:
$gatk DetermineGermlineContigPloidy \ -L ${v7dir}/v7.cohort.gc.filtered.interval_list \ -I ...(approximately 50 samples) \ --contig-ploidy-priors ${v7dir}/contig_ploidy_priors_table.tsv \ -imr OVERLAPPING_ONLY \ --output ${valid_model_dir} \ --output-prefix v7_normal_cohort \ --verbosity DEBUG
CASE(for a WES trio):
${gatk} DetermineGermlineContigPloidy \ --model /paedwy/disk1/yangyxt/wes/healthy_bams_for_CNV/using_V7_probe/v7_ploidy_model/v7_normal_cohort-model \ -I /paedwy/disk1/yangyxt/wes/3_samples/CNV_calling/A190382.counts.hdf5 \ -I /paedwy/disk1/yangyxt/wes/3_samples/CNV_calling/A190383.counts.hdf5 \ -I /paedwy/disk1/yangyxt/wes/3_samples/CNV_calling/A190384.counts.hdf5 \ -O /paedwy/disk1/yangyxt/wes/3_samples/CNV_calling \ --output-prefix v7_case_ploidy \ --verbosity DEBUG

Code for gCNV in COHORT and CASE mode.
COHORT:
$gatk GermlineCNVCaller \ --run-mode COHORT \ -L ${v7dir}/v7.cohort.gc.filtered.interval_list \ -I ...(approximately 50 samples) --interval-merging-rule OVERLAPPING_ONLY \ --contig-ploidy-calls ${valid_ploidy_call} \ --verbosity DEBUG \ --annotated-intervals ${v7dir}/v7.annotated.tsv \ --output ... --output-prefix ...
CASE(for a WES trio):
${gatk} GermlineCNVCaller \ --run-mode CASE \ --model /paedwy/disk1/yangyxt/wes/healthy_bams_for_CNV/using_V7_probe/v7_gCNV_model/v7_gCNV_normal_cohort-model \ --contig-ploidy-calls /paedwy/disk1/yangyxt/wes/3_samples/CNV_calling/v7_case_ploidy-calls \ --class-coherence-length 500.0 \ --cnv-coherence-length 500.0 \ -I /paedwy/disk1/yangyxt/wes/3_samples/CNV_calling/A190382.counts.hdf5 \ -I /paedwy/disk1/yangyxt/wes/3_samples/CNV_calling/A190383.counts.hdf5 \ -I /paedwy/disk1/yangyxt/wes/3_samples/CNV_calling/A190384.counts.hdf5 \ -O /paedwy/disk1/yangyxt/wes/3_samples/CNV_calling \ --output-prefix v7_case_gCNV \ --verbosity DEBUG

Code for PostprocessGermlineCNVCalls:
${gatk} PostprocessGermlineCNVCalls \ --model-shard-path ${v7_gCNV_model} \ --calls-shard-path ${cnv_dir}/${v7_gCNV_case_prefix}-calls \ --allosomal-contig chrX --allosomal-contig chrY \ --contig-ploidy-calls ${cnv_dir}/${v7_ploidy_case_prefix}-calls \ --sample-index ${sample_index} \ --output-denoised-copy-ratios ${cnv_dir}/${patientID}.sample_${sample_index}.denoised_copy_ration.tsv \ --output-genotyped-intervals ${cnv_dir}/genotyped-intervals-"case"-${patientID}-vs-v7cohort.vcf.gz \ --output-genotyped-segments ${cnv_dir}/genotyped-segments-"case"-${patientID}-vs-v7cohort.vcf.gz \ --sequence-dictionary ${ref_gen}/ucsc.hg19.dict

Here I paste the DEBUG log file (for PostProcessGermlineCNVCalls)reference:
PostProcessGermlineCNVCalls:
Using GATK jar /home/yangyxt/software/gatk-4.1.3.0/gatk-package-4.1.3.0-local.jar Running: java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/yangyxt/software/gatk-4.1.3.0 17:46:45.860 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/yangyxt/software/gatk-4.1.3.0/gatk-package-4.1.3.0-local.jar!/com/intel/gkl/native/libgkl_compression.so Oct 03, 2019 5:46:48 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine INFO: Failed to detect whether we are running on Google Compute Engine. 17:46:48.386 INFO PostprocessGermlineCNVCalls - ------------------------------------------------------------ 17:46:48.387 INFO PostprocessGermlineCNVCalls - The Genome Analysis Toolkit (GATK) v4.1.3.0 17:46:48.387 INFO PostprocessGermlineCNVCalls - For support and documentation go to https://software.broadinstitute.org/gatk/ 17:46:48.387 INFO PostprocessGermlineCNVCalls - Executing as yangyxt@paedwy01 on Linux v3.10.0-957.10.1.el7.x86_64 amd64 17:46:48.388 INFO PostprocessGermlineCNVCalls - Java runtime: OpenJDK 64-Bit Server VM v11.0.1+13-LTS 17:46:48.388 INFO PostprocessGermlineCNVCalls - Start Date/Time: October 3, 2019 at 5:46:45 PM HKT 17:46:48.388 INFO PostprocessGermlineCNVCalls - ------------------------------------------------------------ 17:46:48.388 INFO PostprocessGermlineCNVCalls - ------------------------------------------------------------ 17:46:48.389 INFO PostprocessGermlineCNVCalls - HTSJDK Version: 2.20.1 17:46:48.389 INFO PostprocessGermlineCNVCalls - Picard Version: 2.20.5 17:46:48.390 INFO PostprocessGermlineCNVCalls - HTSJDK Defaults.COMPRESSION_LEVEL : 2 17:46:48.390 INFO PostprocessGermlineCNVCalls - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false 17:46:48.390 INFO PostprocessGermlineCNVCalls - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true 17:46:48.390 INFO PostprocessGermlineCNVCalls - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false 17:46:48.390 INFO PostprocessGermlineCNVCalls - Deflater: IntelDeflater 17:46:48.390 INFO PostprocessGermlineCNVCalls - Inflater: IntelInflater 17:46:48.390 INFO PostprocessGermlineCNVCalls - GCS max retries/reopens: 20 17:46:48.390 INFO PostprocessGermlineCNVCalls - Requester pays: disabled 17:46:48.391 INFO PostprocessGermlineCNVCalls - Initializing engine 17:46:58.628 INFO PostprocessGermlineCNVCalls - Done initializing engine 17:46:59.602 INFO ProgressMeter - Starting traversal 17:46:59.602 INFO ProgressMeter - Current Locus Elapsed Minutes Records Processed Records/Minute 17:46:59.603 INFO ProgressMeter - unmapped 0.0 0 NaN 17:46:59.603 INFO ProgressMeter - Traversal complete. Processed 0 total records in 0.0 minutes. 17:46:59.603 INFO PostprocessGermlineCNVCalls - Generating intervals VCF file... 17:46:59.621 INFO PostprocessGermlineCNVCalls - Writing intervals VCF file to /paedwy/disk1/yangyxt/wes/3_samples/CNV_calling/genotyped-intervals-case-A190382-vs-v7cohort.vcf.gz... 17:46:59.621 INFO PostprocessGermlineCNVCalls - Analyzing shard 0 / 1... 17:47:04.683 INFO PostprocessGermlineCNVCalls - Generating segments VCF file... 17:47:52.205 INFO PostprocessGermlineCNVCalls - Writing segments VCF file to /paedwy/disk1/yangyxt/wes/3_samples/CNV_calling/genotyped-segments-case-A190382-vs-v7cohort.vcf.gz... 17:47:52.272 INFO PostprocessGermlineCNVCalls - Generating denoised copy ratios... 17:47:52.752 INFO PostprocessGermlineCNVCalls - Writing denoised copy ratios to /paedwy/disk1/yangyxt/wes/3_samples/CNV_calling/A190382.sample_0.denoised_copy_ration.tsv... 17:47:53.003 INFO PostprocessGermlineCNVCalls - PostprocessGermlineCNVCalls complete. 17:47:53.003 INFO PostprocessGermlineCNVCalls - Shutting down engine [October 3, 2019 at 5:47:53 PM HKT] org.broadinstitute.hellbender.tools.copynumber.PostprocessGermlineCNVCalls done. Elapsed time: 1.12 minutes. Runtime.totalMemory()=3028287488

DEBUG level log file for gCNV in CASE mode:
`Using GATK jar /home/yangyxt/software/gatk-4.1.3.0/gatk-package-4.1.3.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/yangyxt/software/gatk-4.1.3.0
16:52:52.098 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/yangyxt/software/gatk-4.1.3.0/gatk-package-4.1.3.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
16:52:52.114 DEBUG NativeLibraryLoader - Extracting libgkl_compression.so to /tmp/libgkl_compression5118930638540507614.so
Oct 03, 2019 4:52:53 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
16:52:53.790 INFO GermlineCNVCaller - ------------------------------------------------------------
16:52:53.791 INFO GermlineCNVCaller - The Genome Analysis Toolkit (GATK) v4.1.3.0
16:52:53.791 INFO GermlineCNVCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
16:52:53.791 INFO GermlineCNVCaller - Executing as yangyxt@paedwy01 on Linux v3.10.0-957.10.1.el7.x86_64 amd64
16:52:53.791 INFO GermlineCNVCaller - Java runtime: OpenJDK 64-Bit Server VM v11.0.1+13-LTS
16:52:53.792 INFO GermlineCNVCaller - Start Date/Time: October 3, 2019 at 4:52:52 PM HKT
16:52:53.792 INFO GermlineCNVCaller - ------------------------------------------------------------
16:52:53.792 INFO GermlineCNVCaller - ------------------------------------------------------------
16:52:53.793 INFO GermlineCNVCaller - HTSJDK Version: 2.20.1
16:52:53.793 INFO GermlineCNVCaller - Picard Version: 2.20.5
16:52:53.795 INFO GermlineCNVCaller - HTSJDK Defaults.BUFFER_SIZE : 131072
16:52:53.795 INFO GermlineCNVCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
16:52:53.795 INFO GermlineCNVCaller - HTSJDK Defaults.CREATE_INDEX : false
16:52:53.795 INFO GermlineCNVCaller - HTSJDK Defaults.CREATE_MD5 : false
16:52:53.796 INFO GermlineCNVCaller - HTSJDK Defaults.CUSTOM_READER_FACTORY :
16:52:53.796 INFO GermlineCNVCaller - HTSJDK Defaults.DISABLE_SNAPPY_COMPRESSOR : false
16:52:53.796 INFO GermlineCNVCaller - HTSJDK Defaults.EBI_REFERENCE_SERVICE_URL_MASK : https://www.ebi.ac.uk/ena/cram/md5/%s
16:52:53.796 INFO GermlineCNVCaller - HTSJDK Defaults.NON_ZERO_BUFFER_SIZE : 131072
16:52:53.796 INFO GermlineCNVCaller - HTSJDK Defaults.REFERENCE_FASTA : null
16:52:53.796 INFO GermlineCNVCaller - HTSJDK Defaults.SAM_FLAG_FIELD_FORMAT : DECIMAL
16:52:53.796 INFO GermlineCNVCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
16:52:53.796 INFO GermlineCNVCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
16:52:53.796 INFO GermlineCNVCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
16:52:53.796 INFO GermlineCNVCaller - HTSJDK Defaults.USE_CRAM_REF_DOWNLOAD : false
16:52:53.797 DEBUG ConfigFactory - Configuration file values:
16:52:53.804 DEBUG ConfigFactory - gcsMaxRetries = 20
16:52:53.804 DEBUG ConfigFactory - gcsProjectForRequesterPays =
16:52:53.804 DEBUG ConfigFactory - gatk_stacktrace_on_user_exception = false
16:52:53.804 DEBUG ConfigFactory - samjdk.use_async_io_read_samtools = false
16:52:53.804 DEBUG ConfigFactory - samjdk.use_async_io_write_samtools = true
16:52:53.804 DEBUG ConfigFactory - samjdk.use_async_io_write_tribble = false
16:52:53.804 DEBUG ConfigFactory - samjdk.compression_level = 2
16:52:53.804 DEBUG ConfigFactory - spark.kryoserializer.buffer.max = 512m
16:52:53.804 DEBUG ConfigFactory - spark.driver.maxResultSize = 0
16:52:53.804 DEBUG ConfigFactory - spark.driver.userClassPathFirst = true
16:52:53.804 DEBUG ConfigFactory - spark.io.compression.codec = lzf
16:52:53.804 DEBUG ConfigFactory - spark.executor.memoryOverhead = 600
16:52:53.805 DEBUG ConfigFactory - spark.driver.extraJavaOptions =
16:52:53.805 DEBUG ConfigFactory - spark.executor.extraJavaOptions =
16:52:53.805 DEBUG ConfigFactory - codec_packages = [htsjdk.variant, htsjdk.tribble, org.broadinstitute.hellbender.utils.codecs]
16:52:53.805 DEBUG ConfigFactory - read_filter_packages = [org.broadinstitute.hellbender.engine.filters]
16:52:53.805 DEBUG ConfigFactory - annotation_packages = [org.broadinstitute.hellbender.tools.walkers.annotator]
16:52:53.805 DEBUG ConfigFactory - cloudPrefetchBuffer = 40
16:52:53.805 DEBUG ConfigFactory - cloudIndexPrefetchBuffer = -1
16:52:53.805 DEBUG ConfigFactory - createOutputBamIndex = true
16:52:53.805 INFO GermlineCNVCaller - Deflater: IntelDeflater
16:52:53.806 INFO GermlineCNVCaller - Inflater: IntelInflater
16:52:53.806 INFO GermlineCNVCaller - GCS max retries/reopens: 20
16:52:53.806 INFO GermlineCNVCaller - Requester pays: disabled
16:52:53.806 INFO GermlineCNVCaller - Initializing engine
16:52:53.810 DEBUG ScriptExecutor - Executing:
16:52:53.810 DEBUG ScriptExecutor - python
16:52:53.810 DEBUG ScriptExecutor - -c
16:52:53.810 DEBUG ScriptExecutor - import gcnvkernel

16:53:03.383 DEBUG ScriptExecutor - Result: 0
16:53:03.383 INFO GermlineCNVCaller - Done initializing engine
16:53:04.046 INFO GermlineCNVCaller - Running the tool in CASE mode...
16:53:04.046 INFO GermlineCNVCaller - Validating and aggregating data from input read-count files...
16:53:04.077 INFO GermlineCNVCaller - Aggregating read-count file /paedwy/disk1/yangyxt/wes/3_samples/CNV_calling/A190382.counts.hdf5 (1 / 3)
log4j:WARN No appenders could be found for logger (org.broadinstitute.hdf5.HDF5Library).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
16:53:04.589 INFO GermlineCNVCaller - Aggregating read-count file /paedwy/disk1/yangyxt/wes/3_samples/CNV_calling/A190383.counts.hdf5 (2 / 3)
16:53:04.845 INFO GermlineCNVCaller - Aggregating read-count file /paedwy/disk1/yangyxt/wes/3_samples/CNV_calling/A190384.counts.hdf5 (3 / 3)
16:53:05.106 DEBUG ScriptExecutor - Executing:
16:53:05.106 DEBUG ScriptExecutor - python
16:53:05.106 DEBUG ScriptExecutor - /tmp/case_denoising_calling.4935013407031412822.py
16:53:05.106 DEBUG ScriptExecutor - --ploidy_calls_path=/paedwy/disk1/yangyxt/wes/3_samples/CNV_calling/v7_case_ploidy-calls
16:53:05.106 DEBUG ScriptExecutor - --output_calls_path=/paedwy/disk1/yangyxt/wes/3_samples/CNV_calling/v7_case_gCNV-calls
16:53:05.106 DEBUG ScriptExecutor - --output_tracking_path=/paedwy/disk1/yangyxt/wes/3_samples/CNV_calling/v7_case_gCNV-tracking
16:53:05.106 DEBUG ScriptExecutor - --input_model_path=/paedwy/disk1/yangyxt/wes/healthy_bams_for_CNV/using_V7_probe/v7_gCNV_model/v7_gCNV_normal_cohort-model
16:53:05.106 DEBUG ScriptExecutor - --read_count_tsv_files
16:53:05.106 DEBUG ScriptExecutor - /tmp/sample-06628694563004719291.tsv
16:53:05.106 DEBUG ScriptExecutor - /tmp/sample-11038104874439639617.tsv
16:53:05.106 DEBUG ScriptExecutor - /tmp/sample-213882947664018409285.tsv
16:53:05.106 DEBUG ScriptExecutor - --psi_s_scale=1.000000e-04
16:53:05.106 DEBUG ScriptExecutor - --mapping_error_rate=1.000000e-02
16:53:05.106 DEBUG ScriptExecutor - --depth_correction_tau=1.000000e+04
16:53:05.106 DEBUG ScriptExecutor - --q_c_expectation_mode=hybrid
16:53:05.107 DEBUG ScriptExecutor - --p_alt=1.000000e-06
16:53:05.107 DEBUG ScriptExecutor - --cnv_coherence_length=5.000000e+02
16:53:05.107 DEBUG ScriptExecutor - --max_copy_number=5
16:53:05.107 DEBUG ScriptExecutor - --learning_rate=1.000000e-02
16:53:05.107 DEBUG ScriptExecutor - --adamax_beta1=9.000000e-01
16:53:05.107 DEBUG ScriptExecutor - --adamax_beta2=9.900000e-01
16:53:05.107 DEBUG ScriptExecutor - --log_emission_samples_per_round=50
16:53:05.107 DEBUG ScriptExecutor - --log_emission_sampling_rounds=10
16:53:05.107 DEBUG ScriptExecutor - --log_emission_sampling_median_rel_error=5.000000e-03
16:53:05.107 DEBUG ScriptExecutor - --max_advi_iter_first_epoch=5000
16:53:05.107 DEBUG ScriptExecutor - --max_advi_iter_subsequent_epochs=200
16:53:05.107 DEBUG ScriptExecutor - --min_training_epochs=10
16:53:05.107 DEBUG ScriptExecutor - --max_training_epochs=50
16:53:05.107 DEBUG ScriptExecutor - --initial_temperature=1.500000e+00
16:53:05.107 DEBUG ScriptExecutor - --num_thermal_advi_iters=2500
16:53:05.107 DEBUG ScriptExecutor - --convergence_snr_averaging_window=500
16:53:05.107 DEBUG ScriptExecutor - --convergence_snr_trigger_threshold=1.000000e-01
16:53:05.107 DEBUG ScriptExecutor - --convergence_snr_countdown_window=10
16:53:05.107 DEBUG ScriptExecutor - --max_calling_iters=10
16:53:05.107 DEBUG ScriptExecutor - --caller_update_convergence_threshold=1.000000e-03
16:53:05.107 DEBUG ScriptExecutor - --caller_internal_admixing_rate=7.500000e-01
16:53:05.107 DEBUG ScriptExecutor - --caller_external_admixing_rate=1.000000e+00
16:53:05.108 DEBUG ScriptExecutor - --disable_caller=false
16:53:05.108 DEBUG ScriptExecutor - --disable_sampler=false
16:53:05.108 DEBUG ScriptExecutor - --disable_annealing=false
17:46:43.343 DEBUG ScriptExecutor - Result: 0
17:46:43.344 INFO GermlineCNVCaller - GermlineCNVCaller complete.
17:46:43.345 INFO GermlineCNVCaller - Shutting down engine
[October 3, 2019 at 5:46:43 PM HKT] org.broadinstitute.hellbender.tools.copynumber.GermlineCNVCaller done. Elapsed time: 53.85 minutes.
Runtime.totalMemory()=2843738112`


HaplotypeCaller multi-sample error on single sample

$
0
0
I have 2 questions:

1. I am running HaplotypeCaller on a single whole-exome BAM file. However, I get this error message:

```
A USER ERROR has occurred: Argument --emitRefConfidence has a bad value: Can only be used in single sample mode currently. Use the sample_name argument to run on a single sample out of a multi-sample BAM file.
```
I believe this should be a single-sample BAM file, but I'm not sure how to check.

2. Are gVCFs outputted by parallelizing HaplotypeCaller mergable into a single gVCF for further analysis?

combineGVCFs taking weeks, GATK 4.1.2.

$
0
0

Dear GATK Team
I am running according to best practices, using GATK 4.1.2.
After running haplotypecaller per sample per chromosome I am now running combineGVCFs. My design has two different datasets, each with ~20 samples. combineGVCFs took less than a day dataset 1. Now dataset 2, using the same reference genome, same way to call variants just different input samples, does seem to run forever. I do get Progress outputs, so it looks like it is doing what it is supposed to do just it takes extreemly long to do so. The initial Haplotypecaller VCFs look fine and no different from the other dataset, they also took no longer to be generated.
I am running the single chromosomes on different machines and have started the entire script twice so i can rule out it is a defective/old machine I am running on. I am using 50GB of RAM so that should also be fine. Java is 1.8.
I do not see any difference between the two datasets that what technically explain what is happening.
The script I am using is

gatk CombineGVCFs \
-R mygenome.fna
--variant MFG4_NC_031971.2.g.vcf.gz \
--variant MFG8_NC_031971.2.g.vcf.gz \
--variant MFG9_NC_031971.2.g.vcf.gz \
....
-O NC_031971.2_CTEHOR_combined.g.vcf.gz

and here the last lines of the progress file
08:43:46.589 INFO ProgressMeter - NC_031972.2:14648571 25937.1 72258000 2785.9
08:44:07.800 INFO ProgressMeter - NC_031972.2:14648741 25937.4 72259000 2785.9
08:44:29.259 INFO ProgressMeter - NC_031972.2:14648911 25937.8 72260000 2785.9
08:44:49.915 INFO ProgressMeter - NC_031972.2:14649074 25938.1 72261000 2785.9
08:45:12.491 INFO ProgressMeter - NC_031972.2:14649251 25938.5 72262000 2785.9

Any help would be great, it has been running now for 18 days and it seems to get slower rather than finishing and wil eventually hit a time limit of the compute cluster
Thank you
Astrid

Haplotype GVCF mode

$
0
0
Hi

I have question about if a) I should run haplotype caller in single sample mode or b) if I should GVCF mode combining all samples from my study or c) if I should just group all the cases together and all controls together and then run GVCF mode seperately for both groups

I am currently following GATK best practices guideline for germline variant calling.
I am interested in analyzing germline mutations and signatures and all samples are from the same cancer type but divided into two categories: a) either had somatic mutation in my gene of interest and a high somatic mutation burden (cases) or b) no mutation in my gene of interest and low mutation burden (controls).

I am using aligned sequencing reads from blood derived normal for the TCGA data in GDC and using haplotype caller. I have 10 samples in the control and 10 for the cases.

I would appreciate any feedback and or advice and Thank you in advance

Somatic mutation artifact - unknown mode

$
0
0

Hi,
In an exome sample in which I'm trying to call somatic mutations, I came across a strange somatic mutation that I cannot interpret and doesn't seem to fit any known error mode that Mutect handles.

The mutation is a T>G (relative to the reference top/+ strand) in 15% of reads (14/92 reads, after PCR duplicate removal). And the mutation is only seen in F2R1 reads. This suggests a single-strand artifact. However, the weird part is this: there is ALSO a strand bias. All 14 of the reads with the mutation are REVERSE strand.

Not only that, but in the 5 out of 14 reads that have the mutation where the forward and reverse reads overlap the mutation, the mutation was only seen in the reverse read. How is that possible? That contradict the possibility of a single-strand artifact.

I've been struggling to understand this and I cannot seem to find an explanation. Any input is appreciated, especially from the Mutect team.
Thank you.

MarkDuplicatesSpark error: Multiple mark duplicate record objects corresponding to read with name

$
0
0
Hi,
I am working on WES data and try to follow GATK's best practices guidelines. I want to switch from MarkDuplicates + SortSam to MarkDuplicatesSpark.
My problem is that MarkDuplicatesSpark terminates with an error message:

"ERROR TaskSetManager: Task 21 in stage 4.0 failed 1 times; aborting job"

Here is the more detailed error message:

"org.apache.spark.SparkException: Job aborted due to stage failure: Task 21 in stage 4.0 failed 1 times, most recent failure: Lost task 21.0 in stage 4.0 (TID 1705, localhost, executor dri
ver): org.broadinstitute.hellbender.exceptions.GATKException: Detected multiple mark duplicate records objects corresponding to read with name 'NB551494:142:HC5KTBGXC:1:11309:14684:9603',
this could be the result of the file sort order being incorrect or that a previous tool has let readnames span multiple partitions"

I believe that this indicates that I am doing something wrong in the process upstream but I could not fix that issue.

Currently, here is the pipeline that I have in place: the first steps are to run bwa-mem on the fastq files and (in parallel) convert the fastq files to the ubam format with Picard FastqToSam:

bwa mem -t 8 -M $bwaIndex ${fq}_R1.fastq.gz ${fq}_R2.fastq.gz \
> ${fq}.sam

java -Xmx8G -jar $PICARD_PATH/picard.jar FastqToSam \
FASTQ=${fq}_R1.fastq.gz \
FASTQ2=${fq}_R2.fastq.gz \
OUTPUT=${fq}.unmapped.bam \
READ_GROUP_NAME=${rgid} \
SAMPLE_NAME=${rgsm} \
LIBRARY_NAME=${rglb} \
PLATFORM_UNIT=${rgpu} \
PLATFORM=${rgpl} \
SEQUENCING_CENTER=${rgcn} \
DESCRIPTION=${rgds}


Then I use MergeBamAlignment for merging the BWA-aligned SAM with the ubam file as follow:

java -Xmx8G -jar $PICARD_PATH/picard.jar MergeBamAlignment \
R=${fasta} \
ALIGNED=${fq}.sam \
UNMAPPED=${fq}.unmapped.bam \
SORT_ORDER=queryname \
O=${fq}.merged.bam

Finally I run MarkDuplicatesSpark:

java -Xmx42g -Djava.io.tmpdir=$tmpDir -jar $GATK_PATH/GenomeAnalysisTK.jar MarkDuplicatesSpark \
-I ${fq}.merged.bam \
-O ${fq}.s.dedup.bam \
-M ${fq}.dup.metrics.out \
--read-validation-stringency LENIENT \
--conf 'spark.executor.cores=8' \
--conf 'spark.local.dir=${tmpDir}'

Any help would be highly appreciated. Thank you in advance!

Best regards,

Florian
Viewing all 12345 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>