Mismatch header after generating reference from FastaAlternateReferenceMaker

May 20, 2019, 3:22 am

≫ Next: Graphical (GUI) and interactive exploration tool for large genotype matrixes like 1KG or gnomAD.

≪ Previous: 'Cannot allocate memory' when I try to run gatk4-data-processing use whole genome data of one person

Hello,

I try to use the command "FastaAlternateReferenceMaker" to build a new genome reference from FASTA+VCF. However, the header format is changed from >AABL01000001 to >1 AABL01000001:1 (the number was added before the chromosome name). My question is that there are any commands using to replace this header back to the same as old header in FASTA file?
The commands I used are:
java -jar /GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar -T FastaAlternateReferenceMaker -R aaa.fasta -o aaa_new.fasta -V bbb.vcf

Best,

Nattawat

↧

Graphical (GUI) and interactive exploration tool for large genotype matrixes like 1KG or gnomAD.

May 20, 2019, 5:39 am

≫ Next: Error with GATK ModelSegments

≪ Previous: Mismatch header after generating reference from FastaAlternateReferenceMaker

Dear GATK development team and GATK users,

What is currently the best visual(GUI) and interactive genotype matrix exploration tool (a browser) for large genotype matrixes, say the 1000 human genomes VCF?
Or something between the 1000 genomes VCF and the gnomAD (15K genomes) VCF? The full VCF including the genotypes should be visualized and explorable, not just the variant sites.

So 100M plus variants, 1000+ samples, raw uncompressed VCF file size 1TB+.

One requirement is that it should do all kinds of filtering that 'bcftools view' or 'GATK VariantFiltration' does:
http://www.htslib.org/doc/bcftools.html#view
https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_filters_VariantFiltration.php#--filter-expression

But then in an interactive and visual way (graphical user interface).
Queries are return within seconds, and a (paged) variant and genotype table is shown. And maybe even summary stats for your current selection.

Does something like this already exist? If so which tools? Or is it being build by some one? If not why not?

My preference would be:
1) An open source solution that builds on bcftools or GATK, or the HTS-JDK or HTSLib libraries. Maybe in combination with an open source big data backend.
2) A standard commercial front end/analytical tool (e.g. SpotFire/Tableau) that takes in tab file created by BCFtools query or GATK VariantsToTable. Downside is of course that SpotFire/Tableau don't have any genomics/genetics domain logic that can be used for filtering the table. And a very big memory machine is needed, since all data is loaded to memory? Did anyone try this?
2) A standard commercial front end/analytical tool (e.g. SpotFire/Tableau) that somehow works with the domain logic of bcftools/GATK/HTS-JDK/HTSlib in the backend? Maybe with a 'bigdata' distributed or in memory database backend? e.g. Apache Spark ? Is this possible?
4) A commercial software tool that builds on top of the functionality/results of GATK GenotypeGVCFs or maybe even IntelGenomicsDB.

Thank you.

↧

Error with GATK ModelSegments

May 3, 2018, 1:27 pm

≫ Next: Does the best practice pipeline for pre-data processing output a merged BAM file?

≪ Previous: Graphical (GUI) and interactive exploration tool for large genotype matrixes like 1KG or gnomAD.

I am using the BETA tool "ModelSegments" in a copy number variation analysis and I've run into an error that I don't understand. Within our institution's cluster computing environment, I submitted the following job:

COMMON_DIR="/home/exacloud/lustre1/BioDSP/users/jacojam"
GATK=$COMMON_DIR"/programs/gatk-4.0.4.0"
ALIGNMENT_RUN_T="hg19_BWA_alignment_10058_tumor"
ALIGNMENT_RUN_N="hg19_BWA_alignment_10058_normal"
ALLELIC_COUNTS_T=$COMMON_DIR"/data/hnscc/DNASeq/"$ALIGNMENT_RUN_T"/tumor.allelicCounts.tsv"
ALLELIC_COUNTS_N=$COMMON_DIR"/data/hnscc/DNASeq/"$ALIGNMENT_RUN_N"/normal.allelicCounts.tsv"
OUTPUT_DIR=$COMMON_DIR"/data/hnscc/DNASeq/"$ALIGNMENT_RUN_T"/GATK_CNV"

srun $GATK/gatk --java-options "-Xmx10000m" ModelSegments --allelic-counts $ALLELIC_COUNTS_T --normal-allelic-counts $ALLELIC_COUNTS_N --output-prefix 10058 -O $OUTPUT_DIR

From this, I get the following error:

Using GATK jar /home/exacloud/lustre1/BioDSP/users/jacojam/programs/gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx10000m -jar /home/exacloud/lustre1/BioDSP/users/jacojam/programs/gat$
06:42:48.839 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/exacloud/lustre1/BioDSP/users/jacojam/programs/gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
06:42:49.212 INFO ModelSegments - ------------------------------------------------------------
06:42:49.212 INFO ModelSegments - The Genome Analysis Toolkit (GATK) v4.0.4.0
06:42:49.212 INFO ModelSegments - For support and documentation go to https://software.broadinstitute.org/gatk/
06:42:49.213 INFO ModelSegments - Executing as jacojam@exanode-3-7.local on Linux v3.10.0-693.17.1.el7.x86_64 amd64
06:42:49.213 INFO ModelSegments - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_161-b14
06:42:49.213 INFO ModelSegments - Start Date/Time: May 2, 2018 6:42:48 AM PDT
06:42:49.213 INFO ModelSegments - ------------------------------------------------------------
06:42:49.213 INFO ModelSegments - ------------------------------------------------------------
06:42:49.214 INFO ModelSegments - HTSJDK Version: 2.14.3
06:42:49.214 INFO ModelSegments - Picard Version: 2.18.2
06:42:49.214 INFO ModelSegments - HTSJDK Defaults.COMPRESSION_LEVEL : 2
06:42:49.214 INFO ModelSegments - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
06:42:49.214 INFO ModelSegments - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
06:42:49.214 INFO ModelSegments - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
06:42:49.214 INFO ModelSegments - Deflater: IntelDeflater
06:42:49.214 INFO ModelSegments - Inflater: IntelInflater
06:42:49.214 INFO ModelSegments - GCS max retries/reopens: 20
06:42:49.214 INFO ModelSegments - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
06:42:49.215 WARN ModelSegments -

^[[1m^[[31m !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Warning: ModelSegments is a BETA tool and is not yet ready for use in production

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!^[[0m

06:42:49.215 INFO ModelSegments - Initializing engine
06:42:49.215 INFO ModelSegments - Done initializing engine
06:42:49.224 INFO ModelSegments - Reading file (/home/exacloud/lustre1/BioDSP/users/jacojam/data/hnscc/DNASeq/hg19_BWA_alignment_10058_tumor/tumor.allelicCounts.tsv)...
06:15:44.797 INFO ModelSegments - Shutting down engine
[May 3, 2018 6:15:44 AM PDT] org.broadinstitute.hellbender.tools.copynumber.ModelSegments done. Elapsed time: 1,412.93 minutes.
Runtime.totalMemory()=6298271744
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space: failed reallocation of scalar replaced objects
at java.util.Arrays.copyOfRange(Arrays.java:3664)
at java.lang.String.(String.java:207)
at java.lang.StringBuilder.toString(StringBuilder.java:407)
at com.opencsv.CSVParser.parseLine(CSVParser.java:383)
at com.opencsv.CSVParser.parseLineMulti(CSVParser.java:299)
at com.opencsv.CSVReader.readNext(CSVReader.java:275)
at org.broadinstitute.hellbender.utils.tsv.TableReader.fetchNextRecord(TableReader.java:348)
at org.broadinstitute.hellbender.utils.tsv.TableReader.access$200(TableReader.java:94)
at org.broadinstitute.hellbender.utils.tsv.TableReader$1.hasNext(TableReader.java:458)
at java.util.Iterator.forEachRemaining(Iterator.java:115)
at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
at org.broadinstitute.hellbender.tools.copynumber.formats.collections.AbstractRecordCollection.(AbstractRecordCollection.java:82)
at org.broadinstitute.hellbender.tools.copynumber.formats.collections.AbstractLocatableCollection.(AbstractLocatableCollection.java:58)
at org.broadinstitute.hellbender.tools.copynumber.formats.collections.AbstractSampleLocatableCollection.(AbstractSampleLocatableCollection.java:44)
at org.broadinstitute.hellbender.tools.copynumber.formats.collections.AllelicCountCollection.(AllelicCountCollection.java:58)
at org.broadinstitute.hellbender.tools.copynumber.ModelSegments$$Lambda$29/27313641.apply(Unknown Source)
at org.broadinstitute.hellbender.tools.copynumber.ModelSegments.readOptionalFileOrNull(ModelSegments.java:559)
at org.broadinstitute.hellbender.tools.copynumber.ModelSegments.doWork(ModelSegments.java:462)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:134)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:179)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:198)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)
srun: error: exanode-3-7: task 0: Exited with exit code 1

Is this something you could potentially help me with? Thank you.

↧

Does the best practice pipeline for pre-data processing output a merged BAM file?

May 20, 2019, 10:31 am

≫ Next: NullPointerException while writing VCF

≪ Previous: Error with GATK ModelSegments

I'm using the GATK4 best practice pipeline for pre-data processing. I have several questions and would like to confirm:

The final output is a merged BAM file, right? All the input per-sample BAM files are merged during MarkDuplicates and are processed together ever since.
It seems that I can use SplitSam on the final output to extract the BAM files for each sample, is this correct?
In the accompanying .json file, the sample_name is NA12878. Are there any special reasons for choosing this name? Or can it be any name?

Also following the 1st question, why all the input BAM files are merged? Are there any special reasons for doing so? For the dataset I'm analyzing, each BAM is about 15GB, and I have 200 such files, if they are merged together it will be 3000GB = 3TB. Although I'm using a server, still it will be very difficult to process such a huge file (if such a huge file could be supported)! Besides, since they are merged, we will not be able to have parallel computing, and the computing time will be longer.

Thank you very much for your help!

↧

NullPointerException while writing VCF

May 20, 2019, 11:28 am

≫ Next: GermlineCNVCaller - How to add samples to existing model

≪ Previous: Does the best practice pipeline for pre-data processing output a merged BAM file?

Hello,

I'm getting this error while writing a VCF file:

Exception in thread "main" java.lang.NullPointerException
        at htsjdk.tribble.index.tabix.TabixIndexCreator.advanceToReference(TabixIndexCreator.java:116)
        at htsjdk.tribble.index.tabix.TabixIndexCreator.addFeature(TabixIndexCreator.java:96)
        at htsjdk.variant.variantcontext.writer.IndexingVariantContextWriter.add(IndexingVariantContextWriter.java:203)
        at htsjdk.variant.variantcontext.writer.VCFWriter.add(VCFWriter.java:240)
        at chowser.execute.VcfUtils$.$anonfun$transformVcf$1(VcfUtils.scala:18)
        at chowser.execute.VcfUtils$.$anonfun$transformVcf$1$adapted(VcfUtils.scala:18)
        at scala.collection.Iterator.foreach(Iterator.scala:941)
        at scala.collection.Iterator.foreach$(Iterator.scala:941)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
        at chowser.execute.VcfUtils$.transformVcf(VcfUtils.scala:18)
        at chowser.execute.VariantsCanonicalizeVcfExecuter$.execute(VariantsCanonicalizeVcfExecuter.scala:11)
        at chowser.execute.ChowserExecuter$.execute(ChowserExecuter.scala:24)
        at chowser.app.ChowserApp$.main(ChowserApp.scala:15)
        at chowser.app.ChowserApp.main(ChowserApp.scala)

My code is this:

package chowser.execute

import better.files.File
import htsjdk.variant.variantcontext.VariantContext
import htsjdk.variant.variantcontext.writer.VariantContextWriterBuilder
import htsjdk.variant.vcf.VCFFileReader
import scala.collection.JavaConverters.asScalaIteratorConverter

object VcfUtils {
  def transformVcf(inFile: File, outFile: File)
                  (transformation: Iterator[VariantContext] => Iterator[VariantContext]): Unit = {
    val reader = new VCFFileReader(inFile.path, false)
    val header = reader.getFileHeader
    val dict = header.getSequenceDictionary
    val writer = new VariantContextWriterBuilder().setOutputPath(outFile.path).setReferenceDictionary(dict).build
    writer.writeHeader(header)
    val variantContextIter = reader.iterator().asScala
    transformation(variantContextIter).foreach(writer.add)
    writer.close()
  }
}

which is invoked like this:

package chowser.execute

import chowser.cmd.VariantsCanonicalizeVcfCommand
import chowser.genomics.VariantGroupId
import htsjdk.variant.variantcontext.VariantContextBuilder

object VariantsCanonicalizeVcfExecuter extends ChowserExecuter[VariantsCanonicalizeVcfCommand] {

  def execute(command: VariantsCanonicalizeVcfCommand): Result = {
    import command.{inFile, outFile}
    VcfUtils.transformVcf(inFile, outFile) { variantContextIter =>
      variantContextIter.flatMap { context =>
        VariantGroupId.fromVariantContext(context) match {
          case Right(newId) =>
            Some(new VariantContextBuilder(context).id(newId.toString).make)
          case Left(message) =>
            println(message)
            None
        }
      }
    }
    Result(command, success = true)
  }

  case class Result(command: VariantsCanonicalizeVcfCommand, success: Boolean)
    extends ChowserExecuter.Result[VariantsCanonicalizeVcfCommand]

}

All I'm trying to do is read a VCF file and write a new VCF file that looks identical to the original file, except that the variant ids have been replaced by canonical ones.

Thanks!

Best, Oliver

↧

GermlineCNVCaller - How to add samples to existing model

May 21, 2019, 2:02 am

≫ Next: tried removing Mutect2 parameters

≪ Previous: NullPointerException while writing VCF

GATK 4.1.2 Linux Ubuntu 16.0.4, using bash script

Hello,
I am in the process of testing your GermlineCNVCaller, at first I made the first model consisting of my 45 WES samples , steps were these:
1. For each sample:
java -Xmx30G -jar $gatk CollectReadCounts \ -I ${fileline[2]} \ -L $interval_file \ --interval-merging-rule OVERLAPPING_ONLY \ -O $output_path/${fileline[1]}.counts.hdf5
2. Determine Germline Contig Ploidy with list of all hdf5 files:
java -jar $gatk DetermineGermlineContigPloidy \ -L $interval_file \ --interval-merging-rule OVERLAPPING_ONLY \ --input $45_hdf5.list \ --contig-ploidy-priors $ploidy_table \ --output $output_path \ --output-prefix WES_45_cohort
3. Germline CNV Caller
java -jar $gatk GermlineCNVCaller \ --run-mode COHORT \ -L $interval_file \ --interval-merging-rule OVERLAPPING_ONLY \ --contig-ploidy-calls WES_45_cohort-calls \ --input $45_hdf5.list \ --output $output_path \ --output-prefix WES_45_cohort
4. Postprocessing
java -jar $gatk PostprocessGermlineCNVCalls \ --calls-shard-path WES_45_cohort-calls \ --model-shard-path WES_45_cohort-model \ --sample-index 4 \ --autosomal-ref-copy-number 2 \ --allosomal-contig chrX \ --allosomal-contig chrY \ --output-genotyped-intervals outputintervals.vcf \ --output-genotyped-segments outputsegments.vcf \ --contig-ploidy-calls WES_45_cohort-calls \ -imr OVERLAPPING_ONLY \ -R /home/dnalab/bioinformatics/hg19/ucsc.hg19.fasta

Now I would like to use this model on the other 9 WES samples (same sequencing setup), is there any way how to add it to existing model without re-running all these steps? Or maybe I just dont understand the flow of the pipeline.

↧

tried removing Mutect2 parameters

May 21, 2019, 2:20 am

≫ Next: How can I prevent the file header from showing up in gigantic font?

≪ Previous: GermlineCNVCaller - How to add samples to existing model

I tried to remove the -tumor and -normal parameter from the Mutect2 command.
It runs and produces as output a vcf file much bigger than the one we obtain including those parameters.
I am sort of curious to know what's the difference in including/excluding those parameters. Thx a lot.

↧

How can I prevent the file header from showing up in gigantic font?

January 17, 2017, 7:52 pm

≫ Next: Multi-allelic sites in VQSR

≪ Previous: tried removing Mutect2 parameters

Hi. My question is, when I post to the forum, some parts of my post become huge, e.g. file headers or error messages. I'm showing a truncated example below of a VCF header. How can I prevent this from happening and show the copy-pasted blocks in normal font?

fileformat=VCFv4.2

...

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878

↧

Multi-allelic sites in VQSR

March 21, 2019, 7:04 am

≫ Next: What to do if the read group information is not properly available?

≪ Previous: How can I prevent the file header from showing up in gigantic font?

I was wondering how GATK VQSR deals with multi-allelic sites.
I already know that -
i) VQSR treats them same way as bi-allelic sites (https://gatkforums.broadinstitute.org/gatk/discussion/7754/how-vqsr-deals-with-multiallelic-snps-and-indel)
ii) Split multi-allelic sites before VQSR (https://gatkforums.broadinstitute.org/gatk/discussion/23559/split-multiallelic-variants-before-vqsr-and-cnnscorevariants-gatk-team-opinion).
This mainly informs about mixed (SNP + INDEL) multi-allelic sites.

Summary questions:
1. Do you recommend split multi-allelic SNPs before VQSR? Will it be biased since site-level information/annotation would be multiple counted. I got different results in split and NOT split (performed relatively better)
2. If we don't split multi-allelic SNP sites then how Ti/Tv ratio is calculated.
For example:

chr1 123 A T,G
chr2 234 C *,A,T

In these above cases, which allele(s) is taken to calculate the Ti/Tv ratio in the tranche file. If VQSR takes the first allele then what to expect in 2nd case, where a star allele is at first position Or it is better to remove star alleles before VQSR?

↧

What to do if the read group information is not properly available?

May 22, 2019, 1:41 am

≫ Next: failed while using GATK3.8.1 Baserecalibrator with -nct

≪ Previous: Multi-allelic sites in VQSR

Hi,

There are recommendations how to work with read groups (https://gatkforums.broadinstitute.org/gatk/discussion/6472/read-groups)
However, I was wondering how to proceed if the read group information is not properly available - so when I am working with public data. SRA strips/replaces the read names from the fastq files so I basically only have a run, experiment and biosample ID from SRA. I am aware that working with public data is always difficult but I am trying to find the best possible way to handle this.

Thanks for your input!
Best,

↧

failed while using GATK3.8.1 Baserecalibrator with -nct

May 22, 2019, 4:45 am

≫ Next: BaitBias Artifact Filter

≪ Previous: What to do if the read group information is not properly available?

Hi, GATK team,
I am using GATK 3.8.1 BASERECALIBRATOR in handling the targeted region sequencing data.
And i want to speedup the process so I add the -nct command and failed with an empty file eventually.
please find the command and the result I ran as following, and please give me some suggestions.
thanks a lot.

java -jar -Xmx64G /usr/GenomeAnalysisTK.jar -T BaseRecalibrator\
-nct 10 \
-R ${reference}\
-I ${inputbam}/${SM}.sorted.bam\
-knownSites ${gatk_bundle}/dbsnp_138.hg19.vcf\
-knownSites ${gatk_bundle}/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf\
-o ${outputdir}/BQSR/${SM}_first_recal.table\
-L ${bedfile}

java -jar -Xmx64G /usr/GenomeAnalysisTK.jar -T BaseRecalibrator \
-R ${reference} \
-I ${inputbam}/${SM}.sorted.bam \
-knownSites $gatk_bundle/dbsnp_138.hg19.vcf \
-knownSites $gatk_bundle/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf \
-BQSR ${outputdir}/BQSR/${SM}_first_recal.table \
-o ${outputdir}/BQSR/${SM}_second_recal.table
.....

Result:

↧

BaitBias Artifact Filter

May 22, 2019, 2:38 pm

≫ Next: A few things need help about output of command ModelSegments.

≪ Previous: failed while using GATK3.8.1 Baserecalibrator with -nct

Hi GATK team,

We recently have a set of targeted sequencing samples showing inflated G>T false positive variants. Picard CollectSequencingArtifactMetrics showed that almost all samples in the batch is having very low bait bias qscore(under 30) for G to T base change.

I read through the picard document about BaitBiasSummaryMetrics and PreAdapterSummaryMetrics. They used G>T example for both types of artifact. It gave me a not very clear impression that BaitBias is looking for bias between reference and complementary while PreAdapterSummaryMetrics is looking for an orientation bias?

Can I have a more understandable explanation of how to differentiate elevated G>T rates that are OxoG artifacts, or G-ref artifacts?

Also, GATK4 has an experimental tool FilterByOrientationBias(https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_exome_FilterByOrientationBias.php)
to filter out OxoG artifact. I'm wondering is there a similar tool that can potentially remove G-ref artifact given SequencingArtifactMetrics.BaitBiasDetailMetrics?

Thanks!
Wen

↧

A few things need help about output of command ModelSegments.

May 22, 2019, 11:39 pm

≫ Next: GenomicsDBImport does not support GVCFs with MNPs; GATK (v4.1.0.0)

≪ Previous: BaitBias Artifact Filter

Hello, gatk team thanks for your good tools. Here I have some questions about output of ModelSegments.
1. *.cr.seg file NUM_POINTS_COPY_RATIO field. It's this the same with Num_Probes in CGH result? Or this means how many bins supporting this segment? I want to plot heatmap for multiple samples result use "copynumber" package in R.
2. *.igv.seg file last field Segment_Mean is absolute value of copy number ratio(all positive number in my case.)? It seems igv accept this field as log2 ratio.
3. *.modelFinal.seg LOG2_COPY_RATIO_POSTERIOR_(10/50/90) what's those posterior(10/50/90) meaning?
Thanks.

↧

GenomicsDBImport does not support GVCFs with MNPs; GATK (v4.1.0.0)

February 20, 2019, 5:08 pm

≫ Next: Does GATK4 Mutect2 support Force Calling?

≪ Previous: A few things need help about output of command ModelSegments.

Hello!

I am running the GATK (v4.1.0.0) best practices pipeline on FireCloud with 12 pooled WGS samples; one pooled sample contains ~48 individual fish (I am using a ploidy of 20 throughout the pipeline). Though I have 24 linkage groups I also have 8286 very small scaffolds that my reads are aligned to, which has caused some issues with using scatter/gather and running the tasks by interval with -L (though that is not my main issue here). Lately I have run into a problem at the JointGenotyping stage.

I have one GVCF for each pool from HaplotypeCaller, and I tried to combine them all using CombineGVCFs. Because of the ploidy of 20 I thought I could not use GenomicsDBImport. I had the same error using CombineGVCFs as the person in this thread: gatkforums.broadinstitute.org/gatk/discussion/13430/gatk-v4-0-10-1-combinegvcfs-failing-with-java-lang-outofmemoryerror-not-using-memory-provided. No matter the amount of memory I allowed the task, it failed every time.

But following @shlee's advice and reading this: github.com/broadinstitute/gatk/issues/5383 I decided to give GenomicsDBImport a try. I just used my 24 linkage groups, so my interval list has only those 24 listed.

I am stumped by the error I got for many of the linkage groups:

***********************************************************************

A USER ERROR has occurred: Bad input: GenomicsDBImport does not support GVCFs with MNPs. MNP found at LG07:4616323 in VCF /6942d818-1ae4-4c81-a4be-0f27ec47ec16/HaplotypeCallerGVCF_halfScatter_GATK4/3a4a3acc-2f06-44dc-ab6d-2617b06f3f46/call-MergeGVCFs/301508.merged.matefixed.sorted.markeddups.recal.g.vcf.gz

***********************************************************************

What is the best way to address this? I didn't see anything in the GenomicsDB documentation about flagging the MNPs or ignoring them. I was thinking of removing the MNPs using SelectVariants, before importing the GVCFs into GenomicsDB but how do you get SelectVariants to output a GVCF, which is needed for Joint Genotyping.

What would you recommend I do to get past this MNP hurdle?

↧

Does GATK4 Mutect2 support Force Calling?

March 21, 2018, 1:37 pm

≫ Next: What's new with Mutect2 since v4.1.1.0 ?

≪ Previous: GenomicsDBImport does not support GVCFs with MNPs; GATK (v4.1.0.0)

Hi,
Does GATK4 Mutect2 support Force Calling? If yes, could you let me know how to do it? Thanks a lot!

Thanks,
Chunyang

↧

What's new with Mutect2 since v4.1.1.0 ?

May 23, 2019, 9:20 am

≫ Next: Where can I download "gcnvkernel" from?

≪ Previous: Does GATK4 Mutect2 support Force Calling?

GATKv4.1.1.0 introduces streamlined somatic calling with fewer errors, fewer false-negatives and optimized sensitivity and precision due to several major advances in the Mutect2 pipeline. We hope the changes will help make your work more efficient, more accurate and less expensive, benefits that will be worth the slight annoyance of the occasional command line change to the workflow. Read to the bottom for what you need to know to run and take advantage of the new pipeline.

Reducing errors with key bug fixes

We fixed several bugs that were responsible for error messages about invalid log probabilities, infinities, NaNs etc. We also resolved an issue where CalculateContamination worked poorly on very small gene panels.

Maximizing sensitivity and precision with a streamlined filtering strategy

FilterMutectCalls now filters based on a single quantity, the probability that a variant is not a somatic mutation, regardless of cause. Previously, each had its own threshold. We have removed parameters such as -normal-artifact-lod, -max-germline-posterior, -max-strand-artifact-probability, -max-contamination-probability, and even -tumor-lod. FilterMutectCalls automatically determines the probability threshold that optimizes the "F score," the harmonic mean of sensitivity and precision. Users can tweak results in favor of more or less sensitivity by modifying a single parameter, the variable beta (the relative weight of sensitivity versus precision in the harmonic mean). Setting beta to a value greater than its default filters for greater sensitivity and setting it lower filters for greater precision.

Reducing false-positives with a Bayesian somatic clustering model

We had long suspected that modeling the spectrum of subclonal allele fractions would help distinguish somatic variants from errors. For example, if every somatic variant in a tumor occurred in 40% of cells, we would know to reject anything with an allele fraction significantly different from 20%. In the Bayesian framework of Mutect2 this means that we can model the read counts of somatic variants with binomial distributions. We account for an unknown number of subclones with a Dirichlet process binomial mixture model. Because CNVs, small subclones, and genetic drift of passenger mutations all contribute allele fractions that don’t match a few discrete values, this is still an oversimplification. Therefore, we include a couple of beta-binomials in the mixture to account for a background spread of allele fractions while still benefiting from clustering. Finally, we use these binomial and beta-binomial likelihoods to refine the tumor log odds calculated by Mutect2, which assume a uniform distribution of allele fractions.

We are working on providing you with a tutorial for somatic variant calling with Mutect2. Keep an eye out for it. In the mean time, refer to the latest Mutect2 tool documentation here.

↧

Where can I download "gcnvkernel" from?

February 11, 2019, 6:14 am

≫ Next: GATK variant calling

≪ Previous: What's new with Mutect2 since v4.1.1.0 ?

Hi,
I am trying to experiment on the germline CNV tools and I read that they rely on a python package called "gcnvkernel". I tried installing it using (conda install gcnvkernel) but it says that it is not available in any of its channels. I tried searching for it in the anaconda database but I found nothing.
Any advice on how to install it please?
Thanks for your help
Nawar

↧

GATK variant calling

May 23, 2019, 11:19 am

≫ Next: cannot find gcnvkernel to install to run GATK4 CNV tools.

≪ Previous: Where can I download "gcnvkernel" from?

Hi,

I am using this code to call SNPs for DNA seq paired end files aligned using bowtie2:

module load gatk/4.1.0.0 picard/2.9.2
picard CreateSequenceDictionary R=ref.fa O=ref.dict
java -jar picard.jar ValidateSamFile \
I=bowtie.sorted.bam \
MODE=SUMMARY
java -jar picard.jar AddOrReplaceReadGroups \
I=bowtie.sorted.bam \
O=bowtie_output.bam \
RGID=1 \
RGLB=lib1 \
RGPL=illumina \
RGPU=unit1 \
RGSM=20
java -jar picard.jar ValidateSamFile \
I=bowtie_output.bam \
MODE=SUMMARY
samtools index bowtie_output.bam
gatk --java-options "-Xmx4G" HaplotypeCaller -R ref.fa -I bowtie_output.bam -O bowtie_gatk.vcf

I got the vcf file but the vcf file size is very low (almost 10,000 times) when I compared with the vcf file that I got using samtools with the same input using this code:
bcftools mpileup -f ref.fa bowtie.sorted.bam > bowtie_samtools.vcf

I am not sure if this is normal or not?

↧

cannot find gcnvkernel to install to run GATK4 CNV tools.

May 23, 2019, 1:51 pm

≫ Next: A USER ERROR has occurred: v is not a recognized option

≪ Previous: GATK variant calling

I used miniconda to install the gatk environment. I thought this would install gcnvkernel as well but it did not. I CANNOT find it anywhere to install.
Any advice?

↧

A USER ERROR has occurred: v is not a recognized option

May 23, 2019, 5:51 pm

≫ Next: Help me check my picard style intervals list file.

≪ Previous: cannot find gcnvkernel to install to run GATK4 CNV tools.

I came across this error when I collate all the normal VCFs into a single callset with CreateSomaticPanelOfNormals
Syntax as "Java -jar -Xmx16g ./gatk-package-4.1.2.0-local.jar CreateSomaticPanelOfNormals \
-vcfs tutorial_11136/3_HG00190.vcf.gz \
-vcfs tutorial_11136/4_NA19771.vcf.gz \
-vcfs tutorial_11136/5_HG02759.vcf.gz \
-O 6_threesamplepon.vcf.gz"
I tried in Cygwin and Powershell in my PC.
Any suggestion?

↧