Quantcast
Channel: Recent Discussions — GATK-Forum
Viewing all 12345 articles
Browse latest View live

GenotypeGVCFs: --includeNonVariantSites disappeared?

$
0
0

Hi,
I just wanted to use the GenotypeGVCFS tool to genotype some gvcfs at known variant sites and am also quite interested if my samples are reference at these positions or if the sites are not covered. The old GATK 3.7 version had the option --includeNonVariantSites, which is not supported by GATK4... Do you have some hints or a workaround? Currently, I'm rolling back to v3.7 which might lead to difficulties later.
Thanks for your help
Stefan


Quality trimming

$
0
0

Hi GATK team,

I was wondering, your best practices for data preprocessing (https://software.broadinstitute.org/gatk/best-practices/workflow?id=11165) don't mention any trimming (using i.e. trim_galore or fastx). It seems like at least some time ago this was pretty standard. Does this mean it is not advised anymore? Can I really take my raw fastq file and map them using BWA directly, without any filtering?

Highest regards,

Freek.

Germline short variant discovery (SNPs + Indels)

$
0
0

Purpose

Identify germline short variants (SNPs and Indels) in one or more individuals to produce a joint callset in VCF format.



Reference Implementations

Pipeline Summary Notes Github FireCloud
Prod* germline short variant per-sample calling uBAM to GVCF optimized for GCP yes pending
Prod* germline short variant joint genotyping GVCFs to cohort VCF optimized for GCP yes pending
$5 Genome Analysis Pipeline uBAM to GVCF or cohort VCF optimized for GCP (see blog) yes hg38
Generic germline short variant per-sample calling analysis-ready BAM to GVCF universal yes hg38
Generic germline short variant joint genotyping GVCFs to cohort VCF universal yes hg38 & b37
Intel germline short variant per-sample calling uBAM to GVCF Intel optimized for local architectures yes NA

* Prod refers to the Broad Institute's Data Sciences Platform production pipelines, which are used to process sequence data produced by the Broad's Genomic Sequencing Platform facility.


Expected input

This workflow is designed to operate on a set of samples constituting a study cohort. Specifically, a set of per-sample BAM files that have been pre-processed as described in the GATK Best Practices for data pre-processing.


Main steps

We begin by calling variants per sample in order to produce a file in GVCF format. Next, we consolidate GVCFs from multiple samples into a GenomicsDB datastore. We then perform joint genotyping, and finally, apply VQSR filtering to produce the final multisample callset with the desired balance of precision and sensitivity.

Additional steps such as Genotype Refinement and Variant Annotation may be included depending on experimental design; those are not documented here.

Call variants per-sample

Tools involved: HaplotypeCaller (in GVCF mode)

In the past, variant callers specialized in either SNPs or Indels, or (like the GATK's own UnifiedGenotyper) could call both but had to do so them using separate models of variation. The HaplotypeCaller is capable of calling SNPs and indels simultaneously via local de-novo assembly of haplotypes in an active region. In other words, whenever the program encounters a region showing signs of variation, it discards the existing mapping information and completely reassembles the reads in that region. This allows the HaplotypeCaller to be more accurate when calling regions that are traditionally difficult to call, for example when they contain different types of variants close to each other. It also makes the HaplotypeCaller much better at calling indels than position-based callers like UnifiedGenotyper.

In the GVCF mode used for scalable variant calling in DNA sequence data, HaplotypeCaller runs per-sample to generate an intermediate file called a GVCF , which can then be used for joint genotyping of multiple samples in a very efficient way. This enables rapid incremental processing of samples as they roll off the sequencer, as well as scaling to very large cohort sizes

In practice, this step can be appended to the pre-processing section to form a single pipeline applied per-sample, going from the original unmapped BAM containing raw sequence all the way to the GVCF for each sample. This is the implementation used in production at the Broad Institute.

Consolidate GVCFs

Tools involved: ImportGenomicsDB

This step consists of consolidating the contents of GVCF files across multiple samples in order to improve scalability and speed the next step, joint genotyping. Note that this is NOT equivalent to the joint genotyping step; variants in the resulting merged GVCF cannot be considered to have been called jointly.

Prior to GATK4 this was done through hierarchical merges with a tool called CombineGVCFs. This tool is included in GATK4 for legacy purposes, but performance is far superior when using ImportGenomicsDB, which produces a datastore instead of a GVCF file.

Joint-Call Cohort

Tools involved: GenotypeGVCFs

At this step, we gather all the per-sample GVCFs (or combined GVCFs if we are working with large numbers of samples) and pass them all together to the joint genotyping tool, GenotypeGVCFs. This produces a set of joint-called SNP and indel calls ready for filtering. This cohort-wide analysis empowers sensitive detection of variants even at difficult sites, and produces a squared-off matrix of genotypes that provides information about all sites of interest in all samples considered, which is important for many downstream analyses.

This step runs quite quickly and can be rerun at any point when samples are added to the cohort, thereby solving the so-called N+1 problem.

Filter Variants by Variant (Quality Score) Recalibration

Tools involved: VariantRecalibrator, ApplyRecalibration

The GATK's variant calling tools are designed to be very lenient in order to achieve a high degree of sensitivity. This is good because it minimizes the chance of missing real variants, but it does mean that we need to filter the raw callset they produce in order to reduce the amount of false positives, which can be quite large.

The established way to filter the raw variant callset is to use variant quality score recalibration (VQSR), which uses machine learning to identify annotation profiles of variants that are likely to be real, and assigns a VQSLOD score to each variant that is much more reliable than the QUAL score calculated by the caller. In the first step of this two-step process, the program builds a model based on training variants, then applies that model to the data to assign a well-calibrated probability to each variant call. We can then use this variant quality score in the second step to filter the raw call set, thus producing a subset of calls with our desired level of quality, fine-tuned to balance specificity and sensitivity.

The downside of how variant recalibration works is that the algorithm requires high-quality sets of known variants to use as training and truth resources, which for many organisms are not yet available. It also requires quite a lot of data in order to learn the profiles of good vs. bad variants, so it can be difficult or even impossible to use on small datasets that involve only one or a few samples, on targeted sequencing data, on RNAseq, and on non-model organisms. If for any of these reasons you find that you cannot perform variant recalibration on your data (after having tried the workarounds that we recommend, where applicable), you will need to use hard-filtering instead. This consists of setting flat thresholds for specific annotations and applying them to all variants equally. See the methods articles and FAQs for more details on how to do this.

We are currently experimenting with neural network-based approaches with the goal of eventually replacing VQSR with a more powerful and flexible filtering process.


Notes on methodology

The central tenet that governs the variant discovery part of the workflow is that the accuracy and sensitivity of the germline variant discovery algorithm are significantly increased when it is provided with data from many samples at the same time. Specifically, the variant calling program needs to be able to construct a squared-off matrix of genotypes representing all potentially variant genomic positions, across all samples in the cohort. Note that this is distinct from the primitive approach of combining variant calls generated separately per-sample, which lack information about the confidence of homozygous-reference or other uncalled genotypes.

In earlier versions of the variant discovery phase, multiple per-sample BAM files were presented directly to the variant calling program for joint analysis. However, that scaled very poorly with the number of samples, posing unacceptable limits on the size of the study cohorts that could be analyzed in that way. In addition, it was not possible to add samples incrementally to a study; all variant calling work had to be redone when new samples were introduced.

Starting with GATK version 3.x, a new approach was introduced, which decoupled the two internal processes that previously composed variant calling: (1) the initial per-sample collection of variant context statistics and calculation of all possible genotype likelihoods given each sample by itself, which require access to the original BAM file reads and is computationally expensive, and (2) the calculation of genotype posterior probabilities per-sample given the genotype likelihoods across all samples in the cohort, which is computationally cheap. These were made into the separate steps described below, enabling incremental growth of cohorts as well as scaling to large cohort sizes.

GATK4.0.2.1: steps, interval list, precision of the range (interval)

$
0
0

Hi,

I'm still a little confuse about intervals. For example, personally I use the -L option in the following steps (-L intervals, -R reference genome):

BaseRecalibrator (-L): chr1... chrX, chrY ( =24contigns)
ApplyBQSR (-R)
HaplotypeCaller (-L): 404 intervals*
GenomicsDBImport (-L): 404 intervals*
GenotypeGVCFs (-L): 404 intervals*

*is the whole genome without the gap-regions (reported in the UCSC browser)

I also work with cohort of WES which have different "exon targeted design".

My pipe is reliable/precise using the above intervals? or is better to use the regions/intervals of my "exon targeted designs"? If yes, in which step?

Many thanks

skip "indel realignment" and recalibration"

$
0
0

Hi to all
can I skip "indel realignment" and re-calibration" steps, when I am using HaplotypeCaller ?

SortSam before MarkDuplicates?

$
0
0

Hi GATK team,

I'm setting up a GATK best practices workflow. It is described here: https://software.broadinstitute.org/gatk/best-practices/workflow?id=11165 that after mapping, which I did like this:

bwa mem -M -t 8 Homo_sapiens.GRCh38.dna.primary_assembly.fa R1_001.fastq.gz R2_001.fastq.gz > unmarkedDuplicates.bam

...I should MarkDuplicates. I do this like this:

gatk MarkDuplicates \
    -I unmarkedDuplicates.bam \
    -O markedDuplicates.bam \
    -M DuplicationMetrics.txt

This fails with the following error:

picard.PicardException: This program requires input that are either coordinate or query sorted. Found unsorted

Am I doing something wrong or should it be reversed in the description/best practices? It is of course easy to just sort first but I really want to follow your guides as close as possible.

Highest regards,

Freek

CombineVariants-- inconsistent references error

$
0
0

Hi I have 2 vcf files
vcf1 from WGS data
vcf2 from genotype data
I am trying to merge these two files using combinevariants
java -jar $GenomeAnalysisTK_jar -T CombineVariants -R $REF --variant $vcf1 --variant $vcf2 -o $outputfile

I am getting this error

MESSAGE: The provided variant file(s) have inconsistent references for the same position(s) at 21:11098723, T* vs. A*

ERROR ------------------------------------------------------------------------------------------

when i grep these position in
vcf1

21 11098723 . T C

21 11098724 . G A

vcf2

21 11098723 exm1562347 A G

how can i fix this error?

Can GATK 4 output all reference sites after joint genotyping?

$
0
0

Hi GATK,

My question is if its possible to run --includeNonVariantSites in GATK 4?

Ari


HaplotypeCaller on whole genome or chromosome by chromosome: different results

$
0
0

Hi,

I'm working on targeted resequencing data and I'm doing a multi-sample variant calling with the HaplotypeCaller. First, I tried to call the variants in all the targeted regions by doing the calling at one time on a cluster. I thus specified all the targeted regions with the -L option.

Then, as it was taking too long, I decided to cut my interval list, chromosome by chromosome and to do the calling on each chromosome. At the end, I merged the VCFs files that I had obtained for the callings on the different chromosomes.

Then, I compared this merged VCF file with the vcf file that I obtained by doing the calling on all the targeted regions at one time. I noticed 1% of variation between the two variants lists. And I can't explain this stochasticity. Any suggestion?

Thanks!

Maguelonne

Mutect2 does not recognize reference sequence

$
0
0

Hi GATK team,

I'm trying to run the Mutect2 pipeline on build 38 WGS CRAM files. I am using something very similar to this WDL: https://github.com/broadinstitute/gatk/blob/master/scripts/mutect2_wdl/mutect2.wdl .

My call to Mutect2 looks like this (the following is from exec.sh, generated by Cromwell):

gatk --java-options "-Xmx4g" Mutect2 \
$tumor_command_line \
$normal_command_line \
--germline-resource /cromwell_root/somatic-workspace/sites.vcf.gz \
-pon /cromwell_root/somatic-workspace/PON.vcf \
-L /cromwell_root/somatic-workspace/Mutect2/8f8e9314-936a-45f4-91cb-878530214146/call-SplitIntervals/glob-6f4bc12a708659d4f5f3eecd1cdffff7/0000-scattered.intervals \
--af-of-alleles-not-in-resource 0.000006 \
--max-population-af 0.001 \
-R /cromwell_root/reference/hs38DH.fa \
-O "output.vcf"

I am running this without specifying the "normal_bam" option; i.e., I only specify the tumor_bam as my CRAM file.

Despite the fact that I am specifying the reference, I receive the following error:

"A USER ERROR has occurred: A reference file is required when using CRAM files."

Am I specifying the call incorrectly?

Thanks,
Josh

Invalid command line: Cannot process the provided BAM/CRAM file(s)

$
0
0

I am trying to generate a joint call VCF file from two bam files. Here is my code : java -jar /usr/bin/GATK/GenomeAnalysisTK.jar -T HaplotypeCaller -R ucsc.hg19.fasta -I 1-A10_S10.bam -I 1-A11_S11.bam -o jointcalls_A.vcf
But it shows that "ERROR MESSAGE: Invalid command line: Cannot process the provided BAM/CRAM file(s) because they were not indexed."

error_java.lang.IllegalArgumentException: No data found--when using VQSR

$
0
0
ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.IllegalArgumentException: No data found.
at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:88)
at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:409)
at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:157)
at org.broadinstitute.gatk.engine.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129)
at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:116)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:315)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:106)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.5-0-g36282e4):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: No data found.
ERROR ------------------------------------------------------------------------------------------

When should I use -L to pass in a list of intervals?

$
0
0

The -L argument (short for --intervals) enables you to restrict your analysis to specific intervals instead of running over the whole genome. Using this argument can have important consequences for performance and/or results. Here, we present some guidelines for using it appropriately depending on your experimental design.

In a nutshell, if you’re doing:

- Whole genome analysis: intervals are not required but they can help speed up analysis
- Whole exome analysis: you must provide the list of capture targets (typically genes/exons)
- Small targeted experiment: you must provide the targeted interval(s)
- Troubleshooting: you can run on a specific interval to test parameters or create a data snippet

Important notes:

Whatever you end up using -L for, keep this in mind: for tools that output a bam or VCF file, the output file will only contain data from the intervals specified by the -L argument. To be clear, we do not recommend using -L with tools that output a bam file since doing so will omit some data from the output.

Example Use of -L:

  • -L 20 for chromosome 20 in b37/b39 build

  • -L chr20:1-100 for chromosome 20 positions 1-100 in hg18/hg19 build

  • -L intervals.list (or intervals.interval_list, or intervals.bed) where the value passed to the argument is a text file containing intervals

  • -L some_variant_calls.vcf where the value passed to the argument is a VCF file containing variant records; their genomic coordinates will be used as intervals.

Specifying contigs with colons in their names, as occurs for new contigs in GRCh38, requires special handling for GATK versions prior to v3.6. Please use the following workaround.

- For example, HLA-A*01:01:01:01 is a new contig in GRCh38. The colons are a new feature of contig naming for GRCh38 from prior assemblies. This has implications for using the -L option of GATK as the option also uses the colon as a delimiter to distinguish between contig and genomic coordinates.
- When defining coordinates of interest for a contig, e.g. positions 1-100 for chr1, we would use -L chr1:1-100. This also works for our HLA contig, e.g. -L HLA-A*01:01:01:01:1-100.
- However, when passing in an entire contig, for contigs with colons in the name, you must add :1+ to the end of the chromosome name as shown below. This ensures that portions of the contig name are appropriately identified as part of the contig name and not genomic coordinates.

-L HLA-A*01:01:01:01:1+

So here’s a little more detail for each experimental design type.

Whole genome analysis

It is not necessary to use an intervals list in whole genome analysis -- presumably you're interested in the whole genome!

However, from a technical perspective, you may want to mask out certain contigs (e.g. chrY or non-chromosome contigs) or regions (e.g. centromere) where you know the data is not reliable or is very messy, causing excessive slowdowns. You can do this by providing a list of "good" intervals with -L, or you could also provide a list of "bad" intervals with -XL, which does the exact opposite of -L: it excludes the provided intervals. We share the whole-genome interval lists (of good intervals) that we use in our production pipelines, in our resource bundle (see Download page).

Whole exome analysis

By definition, exome sequencing data doesn’t cover the entire genome, so many analyses can be restricted to just the capture targets (genes or exons) to save processing time. There are even some analyses which should be restricted to the capture targets because failing to do so can lead to suboptimal results.

Note that we recommend adding some “padding” to the intervals in order to include the flanking regions (typically ~100 bp). No need to modify your target list; you can have the GATK engine do it for you automatically using the interval padding argument. This is not required, but if you do use it, you should do it consistently at all steps where you use -L.

Below is a step-by-step breakdown of the Best Practices workflow, with a detailed explanation of why -L should or shouldn’t be used with each tool.

Tool -L? Why / why not
BaseRecalibrator YES This excludes off-target sequences and sequences that may be poorly mapped, which have a higher error rate. Including them could lead to a skewed model and bad recalibration.
PrintReads NO Output is a bam file; using -L would lead to lost data.
UnifiedGenotyper/Haplotype Caller YES We’re only interested in making calls in exome regions; the rest is a waste of time & includes lots of false positives.
Next steps NO No need since subsequent steps operate on the callset, which was restricted to the exome at the calling step.

Small targeted experiments

The same guidelines as for whole exome analysis apply except you do not run BQSR on small datasets.

Debugging / troubleshooting

You can use -L a lot while troubleshooting! For example, you can just provide an interval at the command line, and the output file will contain the data from that interval.This is really useful when you’re trying to figure out what’s going on in a specific interval (e.g. why HaplotypeCaller is not calling your favorite indel) or what would be the effect of changing a parameter (e.g. what happens to your indel call if you increase the value of -minPruning). This is also what you’d use to generate a file snippet to send us as part of a bug report (except that never happens because GATK has no bugs, ever).

CombineGVCFs error in GATK 4.0.1.2

$
0
0

Hi,

I am aware that some people have faced this error, but they are from old version of GATK and I am not sure if it applies to the GATK version I am using or not (4.0.1.2 with Java 1.8.0_74)..but I am facing these errors:

java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
at htsjdk.tribble.index.IndexFactory.loadIndex(IndexFactory.java:190)
at htsjdk.tribble.TribbleIndexedFeatureReader.loadIndex(TribbleIndexedFeatureReader.java:162)
at htsjdk.tribble.TribbleIndexedFeatureReader.hasIndex(TribbleIndexedFeatureReader.java:227)
at org.broadinstitute.hellbender.engine.FeatureDataSource.(FeatureDataSource.java:251)
at org.broadinstitute.hellbender.engine.MultiVariantDataSource.lambda$new$0(MultiVariantDataSource.java:89)
at java.util.ArrayList.forEach(ArrayList.java:1249)
at org.broadinstitute.hellbender.engine.MultiVariantDataSource.(MultiVariantDataSource.java:88)
at org.broadinstitute.hellbender.engine.MultiVariantWalker.initializeDrivingVariants(MultiVariantWalker.java:71)
at org.broadinstitute.hellbender.engine.VariantWalkerBase.initializeFeatures(VariantWalkerBase.java:47)
at org.broadinstitute.hellbender.engine.GATKTool.onStartup(GATKTool.java:558)
at org.broadinstitute.hellbender.engine.MultiVariantWalker.onStartup(MultiVariantWalker.java:48)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:134)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:179)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:198)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:153)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:195)
at org.broadinstitute.hellbender.Main.main(Main.java:277)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.GeneratedConstructorAccessor26.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at htsjdk.tribble.index.IndexFactory.loadIndex(IndexFactory.java:181)
... 16 more
Caused by: java.lang.OutOfMemoryError: Java heap space
at htsjdk.tribble.index.interval.IntervalTree.insert(IntervalTree.java:57)
at htsjdk.tribble.index.interval.IntervalTreeIndex$ChrIndex.read(IntervalTreeIndex.java:223)
at htsjdk.tribble.index.AbstractIndex.read(AbstractIndex.java:404)
at htsjdk.tribble.index.interval.IntervalTreeIndex.(IntervalTreeIndex.java:53)
at sun.reflect.GeneratedConstructorAccessor26.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at htsjdk.tribble.index.IndexFactory.loadIndex(IndexFactory.java:181)
at htsjdk.tribble.TribbleIndexedFeatureReader.loadIndex(TribbleIndexedFeatureReader.java:162)
at htsjdk.tribble.TribbleIndexedFeatureReader.hasIndex(TribbleIndexedFeatureReader.java:227)
at org.broadinstitute.hellbender.engine.FeatureDataSource.(FeatureDataSource.java:251)
at org.broadinstitute.hellbender.engine.MultiVariantDataSource.lambda$new$0(MultiVariantDataSource.java:89)
at org.broadinstitute.hellbender.engine.MultiVariantDataSource$$Lambda$59/1292784864.accept(Unknown Source)
... 12 more

The command I am running is:

java -Xmx200g -jar /exports/eddie3_homes_local/s0928794/tools/gatk-package-4.0.1.2-local.jar CombineGVCFs -R GCF_000471725.1_UMD_CASPUR_WB_2.0_genomic.fa --variant All_gvcfs.list -O combined_81.g.vcf.gz

The All_gvcfs.list contains absolute paths to 81 GVCF files of varied sizes (24-106 GB) generated by haplotycaller of GATK 4.0.1.2. Ex:

/exports/cmvm/eddie/eb/groups/prendergast_dutta_phd/WGS_atlas_animals_gvcf/Lodi_female_30x_WGS_atlas.g.vcf
/exports/cmvm/eddie/eb/groups/prendergast_dutta_phd/WGS_atlas_animals_gvcf/Pandharpuri_female_30x_WGS_atlas.g.vcf
/exports/cmvm/eddie/eb/groups/prendergast_dutta_phd/WGS_atlas_animals_gvcf/Lodi_male_30x_WGS_atlas.g.vcf
/exports/cmvm/eddie/eb/groups/prendergast_dutta_phd/WGS_atlas_animals_gvcf/Bhadawari_male_30x_WGS_atlas.g.vcf
/exports/cmvm/eddie/eb/groups/prendergast_dutta_phd/indian_wgs_10x_gvcf/Surti-214_10x.g.vcf
/exports/cmvm/eddie/eb/groups/prendergast_dutta_phd/indian_wgs_10x_gvcf/Jaffrabadi-548_10x.g.vcf
/exports/cmvm/eddie/eb/groups/prendergast_dutta_phd/indian_wgs_10x_gvcf/Bhadhwari-B254_10x.g.vcf

.....Total 81 GVCFs

I tested many java heap sizes (started from 8G, but not all files were being read by VCFCodec, when I gave 200G, it read all, but the above error came when the traversal was actually going to start.

Error stack trace when I try running GenomeAnalysisTKLite.jar.

$
0
0

Hello, so I know this may sound odd, but I have to use an older version of GATK for my company. We are using 2.3.0 and when I try running it on a new machine I get.

`##### ERROR ------------------------------------------------------------------------------------------

ERROR stack trace

java.lang.ExceptionInInitializerError
at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.(GenomeAnalysisEngine.java:160)
at org.broadinstitute.sting.gatk.CommandLineExecutable.(CommandLineExecutable.java:53)
at org.broadinstitute.sting.gatk.CommandLineGATK.(CommandLineGATK.java:54)
at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:90)
Caused by: java.lang.NullPointerException
at org.reflections.Reflections.scan(Reflections.java:220)
at org.reflections.Reflections.scan(Reflections.java:166)
at org.reflections.Reflections.(Reflections.java:94)
at org.broadinstitute.sting.utils.classloader.PluginManager.(PluginManager.java:77)
... 4 more

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 2.3-9-gdcdccbb):
ERROR
ERROR Please visit the wiki to see if this is a known problem
ERROR If not, please post the error, with stack trace, to the GATK forum
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Code exception (see stack trace for error itself)
ERROR ------------------------------------------------------------------------------------------

`

I'm not sure what is causing this because when I run it on my older machine which has BioLinux 14.04, it works. But this is the error I recieve when I try running it on Ubuntu 16.04.

Also, I'm not even inputting any parameters yet..I'm just typing java -jar GenomeAnalysisTKLite.jar

Any ideas?


Several Annotations not working in GATK Haplotype Caller

$
0
0

I am using Genotype Given Allele with Haplotype Caller
I am trying to explicitely request all annotations that the documentation says are compatible with the Haplotype caller (and that make sense for a single sample .. e.g. no hardy weinberg ..)

the following annotations all have "NA"
GCContent(GC) HomopolymerRun(Hrun) TandemRepeatAnnotator (STR RU RPA)
.. but are valid requests because I get no errors from GATK.

This is the command I ran (all on one line)

java -Xmx40g -jar /data5/bsi/bictools/alignment/gatk/3.4-46/GenomeAnalysisTK.jar -T HaplotypeCaller --input_file /data2/external_data/Weinshilboum_Richard_weinsh/s115343.beauty/Paired_analysis/secondary/Paired_10192014/IGV_BAM/pair_EX167687/s_EX167687_DNA_Blood.igv-sorted.bam --alleles:vcf /data2/external_data/Kocher_Jean-Pierre_m026645/s109575.ez/Sequencing_2016/OMNI.vcf --phone_home NO_ET --gatk_key /projects/bsi/bictools/apps/alignment/GenomeAnalysisTK/3.1-1/Hossain.Asif_mayo.edu.key --reference_sequence /data2/bsi/reference/sequence/human/ncbi/hg19/allchr.fa --minReadsPerAlignmentStart 1 --disableOptimizations --dontTrimActiveRegions --forceActive --out /data2/external_data/Kocher_Jean-Pierre_m026645/s109575.ez/Sequencing_2016/EX167687.0.0375.chr22.vcf --logging_level INFO -L chr22 --downsample_to_fraction 0.0375 --downsampling_type BY_SAMPLE --genotyping_mode GENOTYPE_GIVEN_ALLELES --standard_min_confidence_threshold_for_calling 20.0 --standard_min_confidence_threshold_for_emitting 0.0 --annotateNDA --annotation BaseQualityRankSumTest --annotation ClippingRankSumTest --annotation Coverage --annotation FisherStrand --annotation GCContent --annotation HomopolymerRun --annotation LikelihoodRankSumTest --annotation MappingQualityRankSumTest --annotation NBaseCount --annotation QualByDepth --annotation RMSMappingQuality --annotation ReadPosRankSumTest --annotation StrandOddsRatio --annotation TandemRepeatAnnotator --annotation DepthPerAlleleBySample --annotation DepthPerSampleHC --annotation StrandAlleleCountsBySample --annotation StrandBiasBySample --excludeAnnotation HaplotypeScore --excludeAnnotation InbreedingCoeff

Log file is below( Notice "weird" WARNings about) "StrandBiasBySample annotation exists in input VCF header"..
which make no sense because the header is empty other than the barebone fields.

This is the barebone VCF
head /data2/external_data/Kocher_Jean-Pierre_m026645/s109575.ez/Sequencing_2016/OMNI.vcf

fileformat=VCFv4.2

CHROM POS ID REF ALT QUAL FILTER INFO

chr1 723918 rs144434834 G A 30 PASS .
chr1 729632 rs116720794 C T 30 PASS .
chr1 752566 rs3094315 G A 30 PASS .
chr1 752721 rs3131972 A G 30 PASS .
chr1 754063 rs12184312 G T 30 PASS .
chr1 757691 rs74045212 T C 30 PASS .
chr1 759036 rs114525117 G A 30 PASS .
chr1 761764 rs144708130 G A 30 PASS .

This is the output

INFO 10:03:06,080 HelpFormatter - ---------------------------------------------------------------------------------
INFO 10:03:06,082 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.4-46-gbc02625, Compiled 2015/07/09 17:38:12
INFO 10:03:06,083 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 10:03:06,083 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 10:03:06,086 HelpFormatter - Program Args: -T HaplotypeCaller --input_file /data2/external_data/Weinshilboum_Richard_weinsh/s115343.beauty/Paired_analysis/secondary/Paired_10192014/IGV_BAM/pair_EX167687/s_EX167687_DNA_Blood.igv-sorted.bam --alleles:vcf /data2/external_data/Kocher_Jean-Pierre_m026645/s109575.ez/Sequencing_2016/OMNI.vcf --phone_home NO_ET --gatk_key /projects/bsi/bictools/apps/alignment/GenomeAnalysisTK/3.1-1/Hossain.Asif_mayo.edu.key --reference_sequence /data2/bsi/reference/sequence/human/ncbi/hg19/allchr.fa --minReadsPerAlignmentStart 1 --disableOptimizations --dontTrimActiveRegions --forceActive --out /data2/external_data/Kocher_Jean-Pierre_m026645/s109575.ez/Sequencing_2016/EX167687.0.0375.chr22.vcf --logging_level INFO -L chr22 --downsample_to_fraction 0.0375 --downsampling_type BY_SAMPLE --genotyping_mode GENOTYPE_GIVEN_ALLELES --standard_min_confidence_threshold_for_calling 20.0 --standard_min_confidence_threshold_for_emitting 0.0 --annotateNDA --annotation BaseQualityRankSumTest --annotation ClippingRankSumTest --annotation Coverage --annotation FisherStrand --annotation GCContent --annotation HomopolymerRun --annotation LikelihoodRankSumTest --annotation MappingQualityRankSumTest --annotation NBaseCount --annotation QualByDepth --annotation RMSMappingQuality --annotation ReadPosRankSumTest --annotation StrandOddsRatio --annotation TandemRepeatAnnotator --annotation DepthPerAlleleBySample --annotation DepthPerSampleHC --annotation StrandAlleleCountsBySample --annotation StrandBiasBySample --excludeAnnotation HaplotypeScore --excludeAnnotation InbreedingCoeff
INFO 10:03:06,093 HelpFormatter - Executing as m037385@franklin04-213 on Linux 2.6.32-573.8.1.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_20-b26.
INFO 10:03:06,094 HelpFormatter - Date/Time: 2016/01/19 10:03:06
INFO 10:03:06,094 HelpFormatter - ---------------------------------------------------------------------------------
INFO 10:03:06,094 HelpFormatter - ---------------------------------------------------------------------------------
INFO 10:03:06,545 GenomeAnalysisEngine - Strictness is SILENT
INFO 10:03:06,657 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Fraction: 0.04
INFO 10:03:06,666 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 10:03:07,012 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.35
INFO 10:03:07,031 HCMappingQualityFilter - Filtering out reads with MAPQ < 20
INFO 10:03:07,170 IntervalUtils - Processing 51304566 bp from intervals
INFO 10:03:07,256 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files
INFO 10:03:07,595 GenomeAnalysisEngine - Done preparing for traversal
INFO 10:03:07,595 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 10:03:07,595 ProgressMeter - | processed | time | per 1M | | total | remaining
INFO 10:03:07,596 ProgressMeter - Location | active regions | elapsed | active regions | completed | runtime | runtime
INFO 10:03:07,596 HaplotypeCaller - Disabling physical phasing, which is supported only for reference-model confidence output
WARN 10:03:07,709 StrandBiasTest - StrandBiasBySample annotation exists in input VCF header. Attempting to use StrandBiasBySample values to calculate strand bias annotation values. If no sample has the SB genotype annotation, annotation may still fail.
WARN 10:03:07,709 StrandBiasTest - StrandBiasBySample annotation exists in input VCF header. Attempting to use StrandBiasBySample values to calculate strand bias annotation values. If no sample has the SB genotype annotation, annotation may still fail.
INFO 10:03:07,719 HaplotypeCaller - Using global mismapping rate of 45 => -4.5 in log10 likelihood units
INFO 10:03:37,599 ProgressMeter - chr22:5344011 0.0 30.0 s 49.6 w 10.4% 4.8 m 4.3 m
INFO 10:04:07,600 ProgressMeter - chr22:11875880 0.0 60.0 s 99.2 w 23.1% 4.3 m 3.3 m
Using AVX accelerated implementation of PairHMM
INFO 10:04:29,924 VectorLoglessPairHMM - libVectorLoglessPairHMM unpacked successfully from GATK jar file
INFO 10:04:29,925 VectorLoglessPairHMM - Using vectorized implementation of PairHMM
WARN 10:04:29,938 AnnotationUtils - Annotation will not be calculated, genotype is not called
WARN 10:04:29,938 AnnotationUtils - Annotation will not be calculated, genotype is not called
WARN 10:04:29,939 AnnotationUtils - Annotation will not be calculated, genotype is not called
INFO 10:04:37,601 ProgressMeter - chr22:17412465 0.0 90.0 s 148.8 w 33.9% 4.4 m 2.9 m
INFO 10:05:07,602 ProgressMeter - chr22:18643131 0.0 120.0 s 198.4 w 36.3% 5.5 m 3.5 m
INFO 10:05:37,603 ProgressMeter - chr22:20133744 0.0 2.5 m 248.0 w 39.2% 6.4 m 3.9 m
INFO 10:06:07,604 ProgressMeter - chr22:22062452 0.0 3.0 m 297.6 w 43.0% 7.0 m 4.0 m
INFO 10:06:37,605 ProgressMeter - chr22:23818297 0.0 3.5 m 347.2 w 46.4% 7.5 m 4.0 m
INFO 10:07:07,606 ProgressMeter - chr22:25491290 0.0 4.0 m 396.8 w 49.7% 8.1 m 4.1 m
INFO 10:07:37,607 ProgressMeter - chr22:27044271 0.0 4.5 m 446.4 w 52.7% 8.5 m 4.0 m
INFO 10:08:07,608 ProgressMeter - chr22:28494980 0.0 5.0 m 496.1 w 55.5% 9.0 m 4.0 m
INFO 10:08:47,609 ProgressMeter - chr22:30866786 0.0 5.7 m 562.2 w 60.2% 9.4 m 3.8 m
INFO 10:09:27,610 ProgressMeter - chr22:32908950 0.0 6.3 m 628.3 w 64.1% 9.9 m 3.5 m
INFO 10:09:57,610 ProgressMeter - chr22:34451306 0.0 6.8 m 677.9 w 67.2% 10.2 m 3.3 m
INFO 10:10:27,611 ProgressMeter - chr22:36013343 0.0 7.3 m 727.5 w 70.2% 10.4 m 3.1 m
INFO 10:10:57,613 ProgressMeter - chr22:37387478 0.0 7.8 m 777.1 w 72.9% 10.7 m 2.9 m
INFO 10:11:27,614 ProgressMeter - chr22:38534891 0.0 8.3 m 826.8 w 75.1% 11.1 m 2.8 m
INFO 10:11:57,615 ProgressMeter - chr22:39910054 0.0 8.8 m 876.4 w 77.8% 11.4 m 2.5 m
INFO 10:12:27,616 ProgressMeter - chr22:41738463 0.0 9.3 m 926.0 w 81.4% 11.5 m 2.1 m
INFO 10:12:57,617 ProgressMeter - chr22:43113306 0.0 9.8 m 975.6 w 84.0% 11.7 m 112.0 s
INFO 10:13:27,618 ProgressMeter - chr22:44456937 0.0 10.3 m 1025.2 w 86.7% 11.9 m 95.0 s
INFO 10:13:57,619 ProgressMeter - chr22:45448656 0.0 10.8 m 1074.8 w 88.6% 12.2 m 83.0 s
INFO 10:14:27,620 ProgressMeter - chr22:46689073 0.0 11.3 m 1124.4 w 91.0% 12.5 m 67.0 s
INFO 10:14:57,621 ProgressMeter - chr22:48062438 0.0 11.8 m 1174.0 w 93.7% 12.6 m 47.0 s
INFO 10:15:27,622 ProgressMeter - chr22:49363910 0.0 12.3 m 1223.6 w 96.2% 12.8 m 29.0 s
INFO 10:15:57,623 ProgressMeter - chr22:50688233 0.0 12.8 m 1273.2 w 98.8% 13.0 m 9.0 s
INFO 10:16:12,379 VectorLoglessPairHMM - Time spent in setup for JNI call : 0.061128124000000006
INFO 10:16:12,379 PairHMM - Total compute time in PairHMM computeLikelihoods() : 22.846350295
INFO 10:16:12,380 HaplotypeCaller - Ran local assembly on 25679 active regions
INFO 10:16:12,434 ProgressMeter - done 5.1304566E7 13.1 m 15.0 s 100.0% 13.1 m 0.0 s
INFO 10:16:12,435 ProgressMeter - Total runtime 784.84 secs, 13.08 min, 0.22 hours
INFO 10:16:12,435 MicroScheduler - 727347 reads were filtered out during the traversal out of approximately 4410423 total reads (16.49%)
INFO 10:16:12,435 MicroScheduler - -> 2 reads (0.00% of total) failing BadCigarFilter
INFO 10:16:12,436 MicroScheduler - -> 669763 reads (15.19% of total) failing DuplicateReadFilter
INFO 10:16:12,436 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
INFO 10:16:12,436 MicroScheduler - -> 57582 reads (1.31% of total) failing HCMappingQualityFilter
INFO 10:16:12,437 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter
INFO 10:16:12,437 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter
INFO 10:16:12,437 MicroScheduler - -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter
INFO 10:16:12,438 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter

Can GATK mutect2 detect MNPs?

$
0
0

In GATK 3.8, we used readbackedphasing to merge SNPs into a MNP, but I did not see this tool listed for 4.0
Does mutect2 do this now?
Thanks

How Can I merge haplotypes to MNPs in GATK4

$
0
0

I was using GATK 3.8 previously, and I identify MNPs in haploid sequences by first using HaplotypeCaller and GVCFs and then ReadBackedPhasing using the enableMergeToMNP option. But GATK4 seem to have discontinued the ReadBackedPhasing tool, and I am at a loss on how I could have phased SNPs for my haploid (prokaryotic) genome merged to MNPs where the case arises. I will appreciate your help, so I could start using the current version of GATK4.

Does GATK HaplotypeCaller has resume analysis feature

$
0
0

Hello there,

I am calling variants on 800 exome samples using Haplotypercaller for some reasons the caller stopped the analysis on certain location on chromosome 5 (after 5 weeks and i have 10 weeks to go). The error is below (one of the samples was malformed)
_INFO 02:04:51,540 ProgressMeter - chr5:180670788 1.05846818873E11 4.5 w 25.0 s 31.7% 14.1 w 9.6 w
INFO 02:06:01,736 ProgressMeter - chr5:180687847 1.05853600709E11 4.5 w 25.0 s 31.7% 14.1 w 9.6 w
INFO 02:07:01,737 ProgressMeter - chr5:180687847 1.05853600709E11 4.5 w 25.0 s 31.7% 14.1 w 9.6 w
INFO 02:08:11,738 ProgressMeter - chr5:180687847 1.05853600709E11 4.5 w 25.0 s 31.7% 14.1 w 9.6 w
INFO 02:09:11,739 ProgressMeter - chr5:180687847 1.05853600709E11 4.5 w 25.0 s 31.7% 14.1 w 9.6 w
INFO 02:10:11,740 ProgressMeter - chr5:180687847 1.05853600709E11 4.5 w 25.0 s 31.7% 14.1 w 9.6 w
INFO 02:11:11,741 ProgressMeter - chr5:180687847 1.05853600709E11 4.5 w 25.0 s 31.7% 14.1 w 9.6 w

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 3.8-1-0-gf15c1c3ef):
ERROR
ERROR This means that one or more arguments or inputs in your command are incorrect.
ERROR The error message below tells you what is the problem.
ERROR
ERROR If the problem is an invalid argument, please check the online documentation guide
ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
ERROR
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions https://software.broadinstitute.org/gatk
ERROR
ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
ERROR
ERROR MESSAGE: File /media/daruma/sea/sample483.fastq.gz.mdup.realigned.fixed.recal.bai is malformed: Premature end-of-file while reading BAM index file sample483.fastq.gz.mdup.realigned.fixed.recal.bai. It's likely that this file is truncated or corrupt -- Please try re-indexing the corresponding BAM file.

_
I will re analysis the sample and want to resume the analysis to speed up the process i was wondering if Haplotypecaller has resume function. If it doesnt what is the best way to work around the issue?
- shall i modify the exome target file ? if yes, shall i start from the begining of chroomsome 5 or find the closer position to where the analysis stopped ?

  • why not including resume feature by feeding the analysis VCF file where it was stopped?
    would like to hear other suggestions that was not mentioned

Thanks in advance :)

VariantFiltration: how to filter samples where less then 95% of reads agree with the called genotype

$
0
0

I've been looking over the documentation for VariantFilteration and jexl (http://gatkforums.broadinstitute.org/gatk/discussion/1255/using-jexl-to-apply-hard-filters-or-select-variants-based-on-annotation-values) to figure out how to do this, but I can seem to find an answer.

I would like to filter snips where there are many reads that disagree with the called genotype (eg FORMAT 1:20,60:80:99:1900,0).

In pseudo code, I'd like to write something like "AD[GT]/DP<0.95", where the allelic depth (AD) for the called genotype (GT) divided by the total depth (DP) < 0.95. However, the docs indicate that it is not possible to access GT:
"For now, you can filter based on most fields as normal (e.g. GQ < 5.0), but the GT (genotype) field is an exception"

Is there another way to accomplish what I want using VariantFiltration? It seems like a common sense filter, and it's also something I frequently see in the literature for snips in haploid organisms.

Viewing all 12345 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>