Quantcast
Channel: Recent Discussions — GATK-Forum
Viewing all 12345 articles
Browse latest View live

java.lang.IncompatibleClassChangeError GATK 4

$
0
0

Hi,

I hit an error with GATK 4 beta 6 using the RealignerTargetCreator - as a complete java newbie it's quite incomprehensible to me. I'm running (oracle) java 9.0.1(and thus GATK 3 RealignerTargetCreator isn't working for me either :# ).

Here is the command I ran:

gatk-launch RealignerTargetCreator -R ~/data/ref/hg38.fa -I Sample1_dedup.bam -o Sample1_int.intervals

And this is the output:

Using GATK jar /usr/local/bin/gatk-4.beta.6/gatk-package-4.beta.6-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -jar /usr/local/bin/gatk-4.beta.6/gatk-package-4.beta.6-local.jar RealignerTargetCreator -R /home/jamie/data/ref/hg38.fa -I Sample1_dedup.bam -o Sample1_int.intervals
Exception in thread "main" java.lang.IncompatibleClassChangeError: Inconsistent constant pool data in classfile for class org/broadinstitute/barclay/argparser/CommandLineProgramGroup. Method lambda$static$0(Lorg/broadinstitute/barclay/argparser/CommandLineProgramGroup;Lorg/broadinstitute/barclay/argparser/CommandLineProgramGroup;)I at index 43 is CONSTANT_MethodRef and should be CONSTANT_InterfaceMethodRef
    at org.broadinstitute.barclay.argparser.CommandLineProgramGroup.<clinit>(CommandLineProgramGroup.java:16)
    at org.broadinstitute.hellbender.Main.printUsage(Main.java:332)
    at org.broadinstitute.hellbender.Main.extractCommandLineProgram(Main.java:305)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:156)
    at org.broadinstitute.hellbender.Main.main(Main.java:239)

Many thanks! :*


GATK reported a deletion as a homozygous while the same deletion is shown heterozygous in IGV.

$
0
0

I am using GATK version 3.7 and following best practice pipeline as recommended for cohorts analysis. I have called variants using HaplotypeCaller joint genotyping with default parameters from 70 samples. When I went through individual samples I found that the same deletion reported by GATK HaplotypeCaller as homogzygous (1/1:0,14:14) with DP of 14 is shown as heterozygous (0/1:5,8:13) with DP of 13 in IGV. As shown in the picture, Ref bases are GGGCG.

Kindly give your valuable comment on this, Thanks.

MuTect2 beta --germline_resource for build b37

$
0
0

Hi - I'm looking to run MuTect2 beta using the --germline_resource option. However, I've consistently used the b37 genome build throughout my analaysis, while the suggested resource (gnomad) appears to only be available for the hg19 build. So I'm wondering whether I should
1. Go ahead and use the gnomad hg19 files, despite the fact that my whole analysis has used the b37 build?
2. Lift over my existing gnomad vcfs from hg19 to b37? (In this case, I'd need an hg19tob37 liftOver file - I can't find one anywhere).
3. Use another germline resource?

I wonder which option you would recommend? Many thanks for your time.

1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38.vcf corrupted?

$
0
0

Hi i downloaded the file from GATK google cloud but it seems the file is corrupted? only chr1-chr15 sites are present.

LiftoverVcf no output

$
0
0

Hi!

I'm using LiftoverVcf to make a hg38 version of the gnomad file from the GATK resource bundle
The gnomad file is in b37 version, as there isn't a chain file for b37 -> hg38, I have to do b37 -> hg19, and then hg19 -> hg38 (even if I'm not very enthusiast to do so).
Anyway, I performed the first "conversion step" with the following command line:

java -Xmx8g -jar picard.jar LiftoverVcf
I=~/af-only-gnomad.raw.sites.b37.vcf.gz
O=~/af-only-gnomad.raw.sites_LIFTOVERhg19.vcf
CHAIN=~/b37tohg19.chain
REJECT=~/af-only-gnomad.raw.sites_LIFTrejected19.vcf
R=~/hg19_M_rCRS.fasta

The result was an empty VCF file, I thought that something went wrong and looked at the "rejected" file, but it was empty as well...it only contained the Header (I don't even know if it is normal that the header is rejected)

the log on the terminal was this:

INFO    2017-12-06 13:20:39 LiftoverVcf Loading up the target reference genome.
INFO    2017-12-06 13:20:45 LiftoverVcf Lifting variants over and sorting.
INFO    2017-12-06 13:20:51 LiftoverVcf read     1.000.000 records.  Elapsed time: 00:00:05s.  Time for last 1.000.000:    5s.  Last read position: 1:9.514.996
INFO    2017-12-06 13:20:55 LiftoverVcf read     2.000.000 records.  Elapsed time: 00:00:09s.  Time for last 1.000.000:    4s.  Last read position: 1:19.801.042
[...]
INFO    2017-12-06 13:37:29 LiftoverVcf read   267.000.000 records.  Elapsed time: 00:16:44s.  Time for last 1.000.000:    3s.  Last read position: X:131.552.026
INFO    2017-12-06 13:37:33 LiftoverVcf read   268.000.000 records.  Elapsed time: 00:16:48s.  Time for last 1.000.000:    3s.  Last read position: X:146.780.876
INFO    2017-12-06 13:37:36 LiftoverVcf Processed 268545311 variants.
INFO    2017-12-06 13:37:36 LiftoverVcf 0 variants failed to liftover.
INFO    2017-12-06 13:37:36 LiftoverVcf 0 variants lifted over but had mismatching reference alleles after lift over.
INFO    2017-12-06 13:37:36 LiftoverVcf 0,0000% of variants were not successfully lifted over and written to the output.
INFO    2017-12-06 13:37:36 LiftoverVcf Writing out sorted records to final VCF.
[Wed Dec 06 13:37:36 CET 2017] picard.vcf.LiftoverVcf done. Elapsed time: 16,96 minutes.
Runtime.totalMemory()=7299137536
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" java.lang.NumberFormatException: For input string: "30,35"
    at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
    at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
    at java.lang.Double.parseDouble(Double.java:538)
    at java.lang.Double.valueOf(Double.java:502)
    at htsjdk.variant.vcf.AbstractVCFCodec.parseQual(AbstractVCFCodec.java:518)
    at htsjdk.variant.vcf.AbstractVCFCodec.parseVCFLine(AbstractVCFCodec.java:322)
    at htsjdk.variant.vcf.AbstractVCFCodec.decodeLine(AbstractVCFCodec.java:284)
    at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:262)
    at htsjdk.variant.vcf.VCFRecordCodec.decode(VCFRecordCodec.java:53)
    at htsjdk.variant.vcf.VCFRecordCodec.decode(VCFRecordCodec.java:18)
    at htsjdk.samtools.util.SortingCollection$FileRecordIterator.advance(SortingCollection.java:497)
    at htsjdk.samtools.util.SortingCollection$FileRecordIterator.<init>(SortingCollection.java:469)
    at htsjdk.samtools.util.SortingCollection$MergingIterator.<init>(SortingCollection.java:407)
    at htsjdk.samtools.util.SortingCollection.iterator(SortingCollection.java:273)
    at picard.vcf.LiftoverVcf.doWork(LiftoverVcf.java:329)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:268)
    at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:98)
    at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:108)

According to it, it seems that 0 variants failed to liftover, but they weren't printed in the output file.
I also don't actually know what the last lines of the log mean...I'm talking about "Exception in thread etc. etc. ..."

detect MNP variants using by Haplotypecaller

$
0
0

Hi!

I used Haplotypecaller 3.1 to detect MNP variants.
java -Xmx6g -Djava.io.tmpdir=$PWD -jar $GATK -R $hg19 -T HaplotypeCaller -I $bamlist --dbsnp $dpsnp135 -o $call/$sample.all.vcf -stand_call_conf 50.0 -stand_emit_conf 10.0 -dcov 200 -nct $nct -A SpanningDeletions -A TandemRepeatAnnotator -A HomopolymerRun -A AlleleBalance -l INFO -baqGOP 30 --max_alternate_alleles 2 -rf BadCigar --minPruning 5

The result had no MNP type varaints, but I can find some continuous snp variants. these snp variants didn't be combined to MNP variants.
Is this normal ? or I really didn't get MNP variants ?

Why is HaplotypeCaller slower in the most recent GATK4 beta versions?

$
0
0

Because it's saving its strength for the 4.0 general release ;)

Many of the "early adopters" who have been testing out the GATK4 during its beta phase have pointed out that they saw significant speed improvements in early beta versions (yay!), but then when they upgraded to more recent betas (starting with 4.beta.4), they observed a return to the slowness seen in GATK3 versions (boo!). This has understandably caused some concern to those who were attracted to the GATK4 beta version of HaplotypeCaller because of its promised speed improvements -- so, basically everyone.

The good news is that this is only a temporary artifact of some of our development and evaluation constraints, which forced us to remove some key improvements while we refine and evaluate the equivalence of results with the older version. We should be able to restore the HaplotypeCaller's speed improvements in the very near future -- in time for the GATK 4.0 planned for January 9, 2018.

If you're interested in understanding why we had to hobble the HaplotypeCaller in this way, please read on! Otherwise feel free to take our word for it.


There are two opposing forces in play when we migrate tools from the older GATK to the new 4.x framework. One is that we want to streamline the program's operation to make it run faster and cheaper. The other is that we have been asked by our internal stakeholders to produce an exact "tie-out" for the germline variant discovery pipeline that we run in production at the Broad (i.e. for a subset of tools including HaplotypeCaller). This means that the HaplotypeCaller we release in GATK 4.0 needs to produce exactly the same output (modulo some margins) as the one from version 3.8, to minimize disruption when the pipelines are migrated. That's a very high standard, and it's the right thing to do both from an operations standpoint and from a software engineering standpoint.

However, these two directives came into conflict because we realized, somewhere in the early beta stages, that some of the optimizations that were introduced to make HaplotypeCaller faster also created output differences that were outside of the acceptable margins. We believe that those differences may actually be improvements on the "old" results, but for the sake of the tie-outs we had to take them out temporarily -- hence the HaplotypeCaller went back to being slower than we'd like in the later beta releases.

We're confident we have a solution that will allow us to put the efficiency optimizations back in as soon as the final tie-out test results have been approved, which appears to be imminent. So by the time GATK4 is released into general availability in January, the new HaplotypeCaller should have all its superpowers back.

Variant annotations

$
0
0

Variant annotations are available to HaplotypeCaller, Mutect2, VariantAnnotator and GenotypeGVCFs. These are listed under Annotations in the Tool Documentation.

  • HaplotypeCaller and Mutect2 calculate annotations based on realigned reads.
  • If given the optional BAM input, VariantAnnotator will calculate annotations based on the pileup. Otherwise, VariantAnnotator and GenotypeGVCFs calculate summary metrics based on existing VCF fields.
  • Some annotations, when called by different tools, may give different results.

See related forum discussion here.



What is a VCF and how should I interpret it?

$
0
0

This document describes "regular" VCF files produced for GERMLINE calls. For information on the special kind of VCF called gVCF, produced by HaplotypeCaller in -ERC GVCF mode, please see this companion document. For information specific to SOMATIC calls, see the MuTect documentation.


Contents

  1. What is VCF?
  2. Basic structure of a VCF file
  3. Interpreting the VCF file header information
  4. Structure of variant call records
  5. How the genotype and other sample-level information is represented
  6. How to extract information from a VCF in a sane, straightforward way

1. What is VCF?

VCF stands for Variant Call Format. It is a standardized text file format for representing SNP, indel, and structural variation calls. The VCF specification used to be maintained by the 1000 Genomes Project, but its management and expansion has been taken over by the Global Alliance for Genomics and Health Data Working group file format team. The full format spec can be found in the Samtools/Hts-specs repository along with other useful specs like SAM/BAM. We highly encourage you to take a look at those documents, as they contain a lot of useful information that we don't go over in this document.

VCF is the primary (and only well-supported) format used by the GATK for variant calls. We prefer it above all others because while it can be a bit verbose, the VCF format is very explicit about the exact type and sequence of variation as well as the genotypes of multiple samples for this variation.

That being said, this highly detailed information can be challenging to understand. The information provided by the GATK tools that infer variation from high-throughput sequencing data, such as the HaplotypeCaller, is especially complex. This document describes the key features and annotations that you need to know about in order to understand VCF files output by the GATK tools.

Note that VCF files are plain text files, so you can open them for viewing or editing in any text editor, with the following caveats:

  • Some VCF files are very large, so your personal computer may struggle to load the whole file into memory. In such cases, you may need to use a different approach, such as using UNIX tools to access the part of the dataset that is relevant to you, or subsetting the data using tools like GATK's SelectVariants.

  • NEVER EDIT A VCF IN A WORD PROCESSOR SUCH AS MICROSOFT WORD BECAUSE IT WILL SCREW UP THE FORMAT! You have been warned :)

  • Don't write home-brewed VCF parsing scripts. It never ends well.


2. Basic structure of a VCF file

A valid VCF file is composed of two main parts: the header, and the variant call records.

image

The header contains information about the dataset and relevant reference sources (e.g. the organism, genome build version etc.), as well as definitions of all the annotations used to qualify and quantify the properties of the variant calls contained in the VCF file. The header of VCFs generated by GATK tools also include the command line that was used to generate them. Some other programs also record the command line in the VCF header, but not all do so as it is not required by the VCF specification. For more information about the header, see the next section.

The actual data lines will look something like this:

[HEADER LINES]
#CHROM  POS ID      REF ALT QUAL    FILTER  INFO          FORMAT          NA12878
1   873762  .       T   G   5231.78 PASS    [ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:173,141:282:99:255,0,255
1   877664  rs3828047   A   G   3931.66 PASS    [ANNOTATIONS] GT:AD:DP:GQ:PL  1/1:0,105:94:99:255,255,0
1   899282  rs28548431  C   T   71.77   PASS    [ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:1,3:4:26:103,0,26
1   974165  rs9442391   T   C   29.84   LowQual [ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:14,4:14:61:61,0,255

After the header lines and the field names, each line represents a single variant, with various properties of that variant represented in the columns. Note that all the lines shown in the example above describe SNPs (also called SNVs), but other variation could be described, such as indels or CNVs. See the VCF specification for details on how the various types of variations are represented. Depending on how the callset was generated, there may only be records for sites where a variant was identified, or there may also be "invariant" records, ie records for sites where no variation was identified.

You will sometimes come across VCFs that have only 8 columns, and contain no FORMAT or sample-specific information. These are called "sites-only" VCFs, and represent variation that has been observed in a population. Generally, information about the population of origin should be included in the header.


3. Interpreting the VCF file header information

The following is a valid VCF header produced by HaplotypeCaller on an example data set (derived from our favorite test sample, NA12878). You can download similar test data from our resource bundle and try looking at it yourself!

##fileformat=VCFv4.1
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##GATKCommandLine.HaplotypeCaller=<ID=HaplotypeCaller,Version=3.4-3-gd1ac142,Date="Mon May 18 17:36:4
.
.
.
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##contig=<ID=chr1,length=249250621,assembly=b37>
##reference=file:human_genome_b37.fasta

We're not showing all the lines here, but that's still a lot... so let's break it down into digestible bits. Note that the header lines are always listed in alphabetical order.

  • VCF spec version

The first line:

##fileformat=VCFv4.1

tells you the version of the VCF specification to which the file conforms. This may seem uninteresting but it can have some important consequences for how to handle and interpret the file contents. As genomics is a fast moving field, the file formats are evolving fairly rapidly, so some of the encoding conventions change. If you run into unexpected issues while trying to parse a VCF file, be sure to check the version and the spec for any relevant format changes.

  • FILTER lines

The FILTER lines tell you what filters have been applied to the data. In our test file, one filter has been applied:

##FILTER=<ID=LowQual,Description="Low quality">

Records that fail any of the filters listed here will contain the ID of the filter (here, LowQual) in its FILTER field (see how records are structured further below).

  • FORMAT and INFO lines

These lines define the annotations contained in the FORMAT and INFO columns of the VCF file, which we explain further below. If you ever need to know what an annotation stands for, you can always check the VCF header for a brief explanation.

  • GATKCommandLine

The GATKCommandLine lines contain all the parameters that went used by the tool that generated the file. Here, GATKCommandLine.HaplotypeCaller refers to a command line invoking HaplotypeCaller. These parameters include all the arguments that the tool accepts, not just the ones specified explicitly by the user in the command line.

  • Contig lines and Reference

These contain the contig names, lengths, and which reference assembly was used with the input bam file. This can come in handy when someone gives you a callset but doesn't tell you which reference it was derived from -- remember that for most organisms, there are multiple reference assemblies, and you should always make sure to use the appropriate one!

[todo: FAQ on genome builds]


4. Structure of variant call records

For each site record, the information is structured into columns (also called fields) as follows:

#CHROM  POS ID  REF ALT     QUAL    FILTER  INFO    FORMAT  NA12878 [other samples...]

The first 8 columns of the VCF records (up to and including INFO) represent the properties observed at the level of the variant (or invariant) site. Keep in mind that when multiple samples are represented in a VCF file, some of the site-level annotations represent a summary or average of the values obtained for that site from the different samples.

Sample-specific information such as genotype and individual sample-level annotation values are contained in the FORMAT column (9th column) and in the sample-name columns (10th and beyond). In the example above, there is one sample called NA12878; if there were additional samples there would be additional columns to the right. Most programs order the sample columns alphabetically by sample name, but this is not always the case, so be aware that you can't depend on ordering rules for parsing VCF output!

Site-level properties and annotations

These first 7 fields are required by the VCF format and must be present, although they can be empty (in practice, there has to be a dot, ie . to serve as a placeholder).

  • CHROM and POS : The contig and genomic coordinates on which the variant occurs.
    Note that for deletions the position given is actually the base preceding the event.

  • ID: An optional identifier for the variant.
    Based on the contig and position of the call and whether a record exists at this site in a reference database such as dbSNP.

  • REF and ALT: The reference allele and alternative allele(s) observed in a sample, set of samples, or a population in general (depending how the VCF was generated).
    Note that REF and ALT are always given on the forward strand. For insertions, the ALT allele includes the inserted sequence as well as the base preceding the insertion so you know where the insertion is compared to the reference sequence. For deletions, the ALT allele is the base before the deletion.

  • QUAL: The Phred-scaled probability that a REF/ALT polymorphism exists at this site given sequencing data.
    Because the Phred scale is -10 * log(1-p), a value of 10 indicates a 1 in 10 chance of error, while a 100 indicates a 1 in 10^10 chance (see the FAQ article for a detailed explanation). These values can grow very large when a large amount of data is used for variant calling, so QUAL is not often a very useful property for evaluating the quality of a variant call. See our documentation on filtering variants for more information on this topic.
    Not to be confused with the sample-level annotation GQ; see this FAQ article for an explanation of the differences in what they mean and how they should be used.

  • FILTER: This field contains the name(s) of any filter(s) that the variant fails to pass, or the value PASS if the variant passed all filters.
    If the FILTER value is ., then no filtering has been applied to the records. It is extremely important to apply appropriate filters before using a variant callset in downstream analysis. See our documentation on filtering variants for more information on this topic.

This next field does not have to be present in the VCF.

  • INFO: Various site-level annotations.
    The annotations contained in the INFO field are represented as tag-value pairs, where the tag and value are separated by an equal sign, ie =, and pairs are separated by colons, ie ; as in this example: MQ=99.00;MQ0=0;QD=17.94.
    They typically summarize context information from the samples, but can also include information from other sources (e.g. population frequencies from a database resource). Some are annotated by default by the GATK tools that produce the callset, and some can be added on request. They are always defined in the VCF header, so that's an easy way to check what an annotation means if you don't recognize it. You can also find additional information on how they are calculated and how they should be interpreted in the "Annotations" section of the Tool Documentation.

Sample-level annotations

At this point you've met all the fields up to INFO in this lineup:

#CHROM  POS ID  REF ALT     QUAL    FILTER  INFO    FORMAT  NA12878 [other samples...]

All the rest is going to be sample-level information. Sample-level annotations are tag-value pairs, like the INFO annotations, but the formatting is a bit different. The short names of the sample-level annotations are recorded in the FORMAT field. The annotation values are then recorded in corresponding order in each sample column (where the sample names are the SM tags identified in the read group data). Typically, you will at minimum have information about the genotype and confidence in the genotype for the sample at each site. See the next section on genotypes for more details.


5. How the genotype and other sample-level information is represented

The sample-level information contained in the VCF (also called "genotype fields") may look a bit complicated at first glance, but they're actually not that hard to interpret once you understand that they're just sets of tags and values.

Let's take a look at three of the records shown earlier, simplified to just show the key genotype annotations:

1   873762  .       T   G   [CLIPPED] GT:AD:DP:GQ:PL    0/1:173,141:282:99:255,0,255
1   877664  rs3828047   A   G   [CLIPPED] GT:AD:DP:GQ:PL    1/1:0,105:94:99:255,255,0
1   899282  rs28548431  C   T   [CLIPPED] GT:AD:DP:GQ:PL    0/1:1,3:4:26:103,0,26

Looking at that last column, here is what the tags mean:

  • GT : The genotype of this sample at this site.
    For a diploid organism, the GT field indicates the two alleles carried by the sample, encoded by a 0 for the REF allele, 1 for the first ALT allele, 2 for the second ALT allele, etc. When there's a single ALT allele (by far the more common case), GT will be either:

    • 0/0 - the sample is homozygous reference
    • 0/1 - the sample is heterozygous, carrying 1 copy of each of the REF and ALT alleles
    • 1/1 - the sample is homozygous alternate
      In the three sites shown in the example above, NA12878 is observed with the allele combinations T/G, G/G, and C/T respectively.
      For non-diploids, the same pattern applies; in the haploid case there will be just a single value in GT; for polyploids there will be more, e.g. 4 values for a tetraploid organism.
  • AD and DP : Allele depth and depth of coverage.
    These are complementary fields that represent two important ways of thinking about the depth of the data for this sample at this site.
    AD is the unfiltered allele depth, i.e. the number of reads that support each of the reported alleles. All reads at the position (including reads that did not pass the variant caller’s filters) are included in this number, except reads that were considered uninformative. Reads are considered uninformative when they do not provide enough statistical evidence to support one allele over another.
    DP is the filtered depth, at the sample level. This gives you the number of filtered reads that support each of the reported alleles. You can check the variant caller’s documentation to see which filters are applied by default. Only reads that passed the variant caller’s filters are included in this number. However, unlike the AD calculation, uninformative reads are included in DP.
    See the Tool Documentation for more details on AD (DepthPerAlleleBySample) and DP (Coverage) for more details.

  • PL : "Normalized" Phred-scaled likelihoods of the possible genotypes.
    For the typical case of a monomorphic site (where there is only one ALT allele) in a diploid organism, the PL field will contain three numbers, corresponding to the three possible genotypes (0/0, 0/1, and 1/1). The PL values are "normalized" so that the PL of the most likely genotype (assigned in the GT field) is 0 in the Phred scale. We use "normalized" in quotes because these are not probabilities. We set the most likely genotype PL to 0 for easy reading purpose.The other values are scaled relative to this most likely genotype.
    Keep in mind, if you're not familiar with the statistical lingo, that when we say PL is the "Phred-scaled likelihood of the genotype", we mean it is "How much less likely that genotype is compared to the best one". Have a look at this article for an example of how PL is calculated.

  • GQ : Quality of the assigned genotype.
    The Genotype Quality represents the Phred-scaled confidence that the genotype assignment (GT) is correct, derived from the genotype PLs. Specifically, the GQ is the difference between the PL of the second most likely genotype, and the PL of the most likely genotype. As noted above, the values of the PLs are normalized so that the most likely PL is always 0, so the GQ ends up being equal to the second smallest PL, unless that PL is greater than 99. In GATK, the value of GQ is capped at 99 because larger values are not more informative, but they take more space in the file. So if the second most likely PL is greater than 99, we still assign a GQ of 99.
    Basically the GQ gives you the difference between the likelihoods of the two most likely genotypes. If it is low, you can tell there is not much confidence in the genotype, i.e. there was not enough evidence to confidently choose one genotype over another. See the FAQ article on the Phred scale to get a sense of what would be considered low.
    Not to be confused with the site-level annotation QUAL; see this FAQ article for an explanation of the differences in what they mean and how they should be used.

With that out of the way, let's interpret the genotype information for NA12878 at 1:899282.

1   899282  rs28548431  C   T   [CLIPPED] GT:AD:DP:GQ:PL    0/1:1,3:4:26:103,0,26

At this site, the called genotype is GT = 0/1, which corresponds to the alleles C/T. The confidence indicated by GQ = 26 isn't very good, largely because there were only a total of 4 reads at this site (DP =4), 1 of which was REF (=had the reference base) and 3 of which were ALT (=had the alternate base) (indicated by AD=1,3). The lack of certainty is evident in the PL field, where PL(0/1) = 0 (the normalized value that corresponds to a likelihood of 1.0) as is always the case for the assigned allele, but the next PL is PL(1/1) = 26 (which corresponds to 10^(-2.6), or 0.0025). So although we're pretty sure there's a variant at this site, there's a chance that the genotype assignment is incorrect, and that the subject may in fact not be het (heterozygous) but be may instead be hom-var (homozygous with the variant allele). But either way, it's clear that the subject is definitely not hom-ref (homozygous with the reference allele) since PL(0/0) = 103, which corresponds to 10^(-10.3), a very small number.


6. How to extract information from a VCF in a sane, (mostly) straightforward way

Use VariantsToTable.

No, really, don't write your own parser if you can avoid it. This is not a comment on how smart or how competent we think you are -- it's a comment on how annoyingly obtuse and convoluted the VCF format is.

Seriously. The VCF format lends itself really poorly to parsing methods like regular expressions, and we hear sob stories all the time from perfectly competent people whose home-brewed parser broke because it couldn't handle a more esoteric feature of the format. We know we broke a bunch of people's scripts when we introduced a new representation for spanning deletions in multisample callsets. OK, we ended up replacing it with a better representation a month later that was a lot less disruptive and more in line with the spirit of the specification -- but the point is, that first version was technically legal by the 4.2 spec, and that sort of thing can happen at any time. So yes, the VCF is a difficult format to work with, and one way to deal with that safely is to not home-brew parsers.

(Why are we sticking with it anyway? Because, as Winston Churchill famously put it, VCF is the worst variant call representation, except for all the others.)

Why does GenotypeGVCFs with and without the "includeNonVariantSites" option gives different results?

$
0
0

Hi GATK team,
For reasons explained in another discussion (https://gatkforums.broadinstitute.org/gatk/discussion/10751/high-proportion-of-spanning-deletion-in-a-whole-genome-callset#latest) we decided to run GenotypeGVCFs with and without the --includeNonVariantSites option. The idea was then to use the VCF with variants only for VQSR and to include the filtering information in the all sites VCF.
We use GATK3.7, mostly follow the GATK guidelines (except for a more complex BQSR step). I did not run GenotypeGVCFs in multi-threading mode.
I compared the outputs of both modes and found that they differ a bit. For SNP, about 0.04% of SNP are found only in the default mode, 0.03% in --includeNonVariantSites, and the rest is common. For indels the situation is more complex. I paste below the first few discordant or different calls that I get (obtained using vcftools).

Records in the VCF obtain in GenotypeGVCFs default mode:

 #CHROM POS ID  REF ALT QUAL    FILTER
chr20   1007104 rs141600758 AT  A,* 86338.8 .
chr20   1032276 rs754360776 CTT CT,C    5312.73 .
chr20   1061955 .   CAAATTGTGGTGCAAAAGTAGTTGTGGTTTTTGCCATTACTTTCAATGGA  C,* 15411.4 .
chr20   1061958 .   ATTGTGGTGC  A,* 16207.4 .
chr20   1133585 .   TGGTGTAG    T   853.85  .

Records in the VCF obtain in GenotypeGVCFs --includeNonVariantSites:

#CHROM  POS ID  REF ALT QUAL    FILTER
chr20   1007103 .   TA  *   9457.13 .
chr20   1007104 rs141600758 AT  A,* 86338.9 .
chr20   1032276 rs764424719 CTTT    CTT,C,CT    5377.99 .
chr20   1032295 .   TG  T   16.63   .
chr20   1060512 .   AAAAAC  *   28726.9 .
chr20   1061953 .   GGCAAATTGTGGTGCAAAAGTAGTTGTGGTTTTTGCCATTACTTTCAATGGAAAAAACAGCAATTACTTTTGCACCAACA    *   15029   .
chr20   1061955 .   CAAATTGTGGTGCAAAAGTAGTTGTGGTTTTTGCCATTACTTTCAATGGA  C,* 15422.5 .
chr20   1061958 .   ATTGTGGTGC  A,* 16216.4 .
chr20   1062031 .   CAG *   24447.4 .

I reran GenotypeGVCFs in both modes for one chromosome. For the two runs with the same settings, the output is the same.
Can you explain to me why using the option --includeNonVariantSites would change the result? And is one of the results more to be trusted?
Thanks in advance and I hope I provided enough information,
Best,
Gwenna

GATK4-beta Mutect2 input and samplenames

$
0
0

I'm using paired tumor-normal samples, and since the command line parameters have changes, I wanted to make sure that I'm defining my input files and sample names correctly. The command I use is

java -Xmx{params.xmx} Xms{params.xms} -jar {params.GATK} Mutect2 --reference {input.REF} --intervals {input.INTERVAL_LIST} --input {input.NORMAL_BAM} --input {input.TUMOR_BAM} --tumorSampleName {params.TUMOR_BAM_SAMPLE_NAME} O {output.VCF}

TUMOR_BAM_SAMPLE_NAME is the name of the sample as in the BAM header (SM). Is it necessary to define the sample name of the normal sample, when using paired samples?

From the help:
--input,-I:String. BAM/SAM/CRAM file containing reads This argument must be specified at least once. Required.
--tumorSampleName,-tumor:String. BAM sample name of tumor Required.
--normalSampleName,-normal:String. BAM sample name of tumor Default value: null.

How to define different target regions in Haplotypecaller?

$
0
0

Hi, I'm new to use GATK to analyze whole-exome sequencing. I'm wondering how to define different target regions in HaplotypeCaller for the label (-L). The customer provides me a bed file. Your help will be greatly appreciated.
Thanks in advance!
xiaohong

help with picard CollectHsMetrics

$
0
0

I tried to run CollectHsMetrics with this command

java -jar /picard.jar CollectHsMetrics \
I=190.sortedDeDup.bam \
O=190_hs_metrics.txt \
R=ucsc.hg19.fasta \
BAIT_INTERVALS=AgilentSSV6_bait_list.interval_list \
TARGET_INTERVALS=V6SureSelect/AgilentSSV6_targets_list.interval_list

it works ok until minute 2, when it stops and display this error:
[Wed Dec 06 19:23:49 CST 2017] picard.analysis.directed.CollectHsMetrics done. Elapsed time: 2.74 minutes.
Runtime.totalMemory()=5752487936
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" java.lang.IllegalStateException: Could not find percentile: 0.2
at htsjdk.samtools.util.Histogram.getPercentile(Histogram.java:327)
at picard.analysis.directed.TargetMetricsCollector$PerUnitTargetMetricCollector.calculateTargetCoverageMetrics(TargetMetricsCollector.java:688)
at picard.analysis.directed.TargetMetricsCollector$PerUnitTargetMetricCollector.finish(TargetMetricsCollector.java:626)
at picard.metrics.MultiLevelCollector$AllReadsDistributor.finish(MultiLevelCollector.java:208)
at picard.metrics.MultiLevelCollector.finish(MultiLevelCollector.java:324)
at picard.analysis.directed.CollectTargetedMetrics.doWork(CollectTargetedMetrics.java:153)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:268)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:98)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:108)

What am I doing something wrong?

Thank you

How MuTect filters candidate mutations

$
0
0

Please note that this article refers to the original standalone version of MuTect. A new version is now available within GATK (starting at GATK 3.5) under the name MuTect2. This new version is able to call both SNPs and indels. See the GATK version 3.5 release notes and the MuTect2 tool documentation for further details.

Overview

This document describes the methodological underpinnings of the filters that MuTect applies by default to distinguish real mutations from sequencing artifacts and errors. Some of these filters are applied in all detection modes, while others are only applied in "High Confidence" detection mode.

Note that at the moment, there is no straightforward way to disable these filters. It is possible to disable each by passing parameter values that render the filters ineffective (e.g. set a value of zero for a filter that requires a minimum value of some quantity) but this has to be examined on a case-by-case basis. A more practical solution is to leave the filter parameters untouched, but instead perform some filtering on the CALLSTATS file using text processing functions (e.g. test for lines that have REJECT in only one of several columns).


Filters used in high-confidence mode

1. Proximal Gap

This filter removes false positives (FP) caused by nearby misaligned small indel events. MuTect will reject a candidate site if there are more than a given number of reads with insertions/deletions in an 11 base pair window centered on the candidate. The threshold value is controlled by the --gap_events_threshold.

In the CALLSTATS output file, the relevant columns are labeled t_ins_count and t_del_count.

2. Poor Mapping

This filter removes FPs caused by reads that are poorly mapped (typically due to sequence similarities between different portions of the genome). The filter uses two tests:

  • Reject candidate if it does not meet a given threshold for the fraction of reads that have a mapping quality of 0 in tumor and normal samples. The threshold value is controlled by --fraction_mapq_threshold.

  • Reject candidate if it does not have at least one observation of the mutant allele with a mapping quality that satisfies a given threshold. The threshold value is controlled by --required_maximum_alt_allele_mapping_quality_score.

In the CALLSTATS output file, the relevant columns are labeled total_reads and map_Q0_reads for the first test, and t_alt_max_mapq for the second test.

3. Strand Bias

This filter rejects FPs caused by context-specific sequencing where the vast majority of alternate alleles are seen in a single direction of reads. Candidates are rejected if strand-specific LOD is below a given threshold in a direction where the sensitivity to have passed that threshold is above a certain percentage. The LOD threshold value is controlled by --strand_artifact_lod and the percentage is controlled by --strand_artifact_power_threshold.

In the CALLSTATS output file, the relevant columns are labeled power_to_detect_negative_strand_artifact and t_lod_fstar_forward. There are also complementary columns labeled power_to_detect_positive_strand_artifact and t_lod_fstar_reverse.

4. Clustered Position

This filter rejects FPs caused by misalignments evidenced by the alternate alleles being clustered at a consistent distance from the start or end of the read alignment. Candidates are rejected if their median distance from the start/end of the read and median absolute deviation are lower or equal to given thresholds. The position from end of read threshold value is controlled by --pir_median_threshold and the deviation value is controlled by --pir_mad_threshold.

In the CALLSTATS output file, the relevant columns are labeled tumor_alt_fpir_median and tumor_alt_fpir_mad for the forward strand, and complementary columns are labeled tumor_alt_rpir_median and tumor_alt_rpir_mad for the reverse (note the name difference is fpir vs. rpir, for forward vs. reverse position in read).

5. Observed in Control

This filter rejects FPs in tumor data by looking at control data (typically from a matched normal) for evidence of the alternate allele that is above random sequencing error. Candidates are rejected if both the following conditions are met:

  • The number of observations of the alternate allele or the proportion of reads carrying the alternate allele is above a given threshold, controlled by --max_alt_alleles_in_normal_count and --max_alt_allele_in_normal_fraction.

  • The sum of quality scores is above a given threshold value, controlled by --max_alt_alleles_in_normal_qscore_sum.

In the CALLSTATS output file, the relevant columns are labeled n_alt_count, normal_f , and n_alt_sum.


Filters applied in all MuTect modes

1. Tumor and normal LOD scores

This filter rejects candidates with a tumor LOD score below a given threshold value, controlled by --tumor_lod, and similarly for a normal LOD score threshold controlled by --normal_lod_threshold.

In the CALLSTATS output file, the relevant columns are labeled t_lod_fstar and init_n_lod, respectively.

2. Possible contamination

This filter rejects candidates with potential cross-patient contamination, controlled by --fraction_contamination.

In the CALLSTATS output file, the relevant columns are labeled t_lod_fstar and contaminant_lod.

3. Normal LOD score and dbsnp status

If a candidate mutation is in dbsnp but is not in COSMIC, it may be a germline variant. In that case, the normal LOD threshold that the candidate must clear is raised to a value controlled by --dbsnp_normal_lod.

In the CALLSTATS output file, the relevant column is labeled init_n_lod.

4. Triallelic Site Filter

When the program is evaluating a site, it considers all possible alternate alleles as mutation candidates, and puts them through all the filters detailed above. If more than one candidate allele passes all filters, resulting in a proposed triallelic site, the site is rejected with the reason triallelic_site because it is extremely unlikely that this would really happen in a tumor sample.

[GATK 4 beta] clustered_events in Mutect2/FilterMutectCalls

$
0
0

Hi,

I have a question about filtering Mutect2 calls. A well-characterized SNV (vcf records below 17:7577120) is filtered out by clustered_events filter. It appears that an artificial haplotype is assembled to have the SNV along with other variations nearby. But those variations don't appear in the same read and it seems unlikely to me that those coexist in one haplotype, so I am hesitant to configure a higher value for --maxEventsInHaplotype.
Do you think it is reasonable? If so, I wonder how I can get the variant (17:7577120) not-filtered (and variants on the left filtered).
Thank you!

17      7577079 .       CTTCCT  C       .       clustered_events        DP=592;ECNT=5;NLOD=95.58;N_ART_LOD=-2.510e+00;POP_AF=1.000e-03;P_GERMLINE=-9.228e+01;TLOD=18.63 GT:AD:AF:MBQ:MCL:MFRL:MMQ:MPOS:OBAM:OBAMRC:PGT:PID:SA_MAP_AF:SA_POST_PROB       0/0:316,0:0.019:41,0:0,0:413,0:60,0:41,0:false:false:0|1:7577079_CTTCCT_C       0/1:265,8:0.042:41,41:0,0:410,416:60,60:37,59:false:false:0|1:7577079_CTTCCT_C:0.030,0.020,0.029:4.069e-03,3.816e-03,0.992
17      7577087 .       GT      G       .       clipping;clustered_events;read_position DP=570;ECNT=5;NLOD=93.62;N_ART_LOD=-2.494e+00;POP_AF=1.000e-03;P_GERMLINE=-9.031e+01;TLOD=18.71 GT:AD:AF:MBQ:MCL:MFRL:MMQ:MPOS:OBAM:OBAMRC:PGT:PID:SA_MAP_AF:SA_POST_PROB       0/0:302,0:3.807e-05:41,0:0,0:416,0:60,0:40,0:false:false:0|1:7577079_CTTCCT_C   0/1:260,8:0.029:41,0:0,0:416,416:60,60:40,0:false:false:0|1:7577079_CTTCCT_C:0.030,0.020,0.030:4.483e-03,3.633e-03,0.992
17      7577089 .       G       A       .       clipping;clustered_events;read_position DP=566;ECNT=5;NLOD=91.51;N_ART_LOD=-2.484e+00;POP_AF=1.000e-03;P_GERMLINE=-8.821e+01;TLOD=18.76 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:MBQ:MCL:MFRL:MMQ:MPOS:OBAM:OBAMRC:OBF:OBP:OBQ:OBQRC:PGT:PID:REF_F1R2:REF_F2R1:SA_MAP_AF:SA_POST_PROB   0/0:297,0:3.822e-05:0:0:NaN:41,0:0,0:413,0:60,0:40,0:false:false:.:.:.:.:0|1:7577079_CTTCCT_C:154:143   0/1:261,8:0.030:2:6:0.250:41,0:0,0:418,416:60,60:39,0:false:true:0.750:0.038:44.92:100.00:0|1:7577079_CTTCCT_C:133:128:0.030,0.020,0.030:4.595e-03,3.543e-03,0.992
17      7577090 .       CGCCGGT C       .       clipping;clustered_events;read_position DP=579;ECNT=5;NLOD=93.27;N_ART_LOD=-2.494e+00;POP_AF=1.000e-03;P_GERMLINE=-8.997e+01;TLOD=18.63 GT:AD:AF:MBQ:MCL:MFRL:MMQ:MPOS:OBAM:OBAMRC:PGT:PID:SA_MAP_AF:SA_POST_PROB       0/0:305,0:4.990e-03:41,0:0,0:416,0:60,0:40,0:false:false:0|1:7577079_CTTCCT_C   0/1:266,8:0.030:41,0:0,0:417,416:60,60:39,0:false:false:0|1:7577079_CTTCCT_C:0.030,0.020,0.029:4.692e-03,3.389e-03,0.992
17      7577120 .       C       G       .       clustered_events        DP=520;ECNT=5;NLOD=82.01;N_ART_LOD=-2.443e+00;POP_AF=2.462e-05;P_GERMLINE=-8.032e+01;TLOD=188.97        GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:MBQ:MCL:MFRL:MMQ:MPOS:OBAM:OBAMRC:OBQ:OBQRC:REF_F1R2:REF_F2R1:SA_MAP_AF:SA_POST_PROB   0/0:273,0:0.016:0:0:NaN:41,0:0,0:415,0:60,0:39,0:false:false:.:.:135:138        0/1:191,56:0.233:26:30:0.536:41,41:0,0:418,388:60,60:37,44:false:false:54.12:100.00:102:89:0.222,0.202,0.227:9.482e-03,0.012,0.978

image


Is the GATK resource bundle data outdated?

VariantAnnotator using GnomAD gives NullPointerException

$
0
0

Hello,

Running VariantAnnotator, I am running into errors I couldn't find solutions for in the forum. Using the GnomAD publicly available VCF's, I would like to add information to a VCF, specifically frequency tracks (I am testing with a portion of chromosome 1 to test with). I am using the following command:

java -jar ~/Downloads/GenomeAnalysisTK38.jar -R ~/build/GRCh37/GRCh37.fa -T VariantAnnotator -V sample.vcf --comp:gnomad,vcf gnomad.vcf --expression gnomad.AF -o output_AF.vcf

The error I get is:

##### ERROR --
##### ERROR stack trace
java.lang.NullPointerException
    at org.broadinstitute.gatk.tools.walkers.annotator.VariantAnnotator.initialize(VariantAnnotator.java:284)
    at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:83)
    at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:323)
    at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:123)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
    at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:108)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 3.8-0-ge9d806836):
##### ERROR
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions https://software.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: Code exception (see stack trace for error itself)
##### ERROR ------------------------------------------------------------------------------------------

If I remove the --expression command, it works perfectly fine and I get an output vcf with ";gnomad" added to the FORMAT column wherever the variant was observed in GnomAD (but ofcourse without any other GnomAD-derived annotation). The error occurs with any requested field (GT, DP, AC, AF, lcr, etc.).

I ran ValidateVariants on both VCF's. Both exited with "Done. There were no warn messages."

Thank you very much in advance for any help.
Klaasjan

Variant annotations

$
0
0

Variant annotations are available to HaplotypeCaller, Mutect2, VariantAnnotator and GenotypeGVCFs. These are listed under Annotations in the Tool Documentation.

  • HaplotypeCaller and Mutect2 calculate annotations based on realigned reads.
  • If given the optional BAM input, VariantAnnotator will calculate annotations based on the pileup. Otherwise, VariantAnnotator and GenotypeGVCFs calculate summary metrics based on existing VCF fields.
  • Some annotations, when called by different tools, may give different results.

See related forum discussion here.


How to use multiple g.VCF files in GATK4.beta.1 GenotypeGVCFs?

$
0
0

Hi,
I tried to use GenotypeGVCFs from GATK4.beta.1, but there seems to be still a bug with the --variants statement. At first I gave a list of my g.VCF files to it (ending .list as it worked in GATK3.7), but got an error message that no suitable codecs have been found. Giving multiple statements containing always one of my input files produced the error that I am only allowed to set this option once, but running GenotypeGVCFs with only one input g.VCF worked (no longer as I tried to use just this sample within the input list). Was there a change since 3.7 or is this a bug?
In addition, I'm woundering how to get the full stack trace, as -DGATK_STACKTRACE_ON_USER_EXCEPTION was somehow recogniced as -D (A USER ERROR has occurred: Argument '[D, dbsnp]' cannot be specified more than once.) and -GATK_STACKTRACE_ON_USER_EXCEPTION just produced no change in the log.
Thanks in advance
Johannes

Using GATK jar /home/uni08/geibel/software/gatk-4.beta.1/gatk-package-4.beta.1-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Dsnappy.disable=true -Xmx220g -jar /home/uni08/geibel/software/gatk-4.beta.1/gatk-package-4.beta.1-local.jar GenotypeGVCFs -R /home/uni08/geibel/chicken/chickenrefgen/galGal5_Dec2015/galGal5.fa --variant /usr/users/geibel/chicken/pool_sequence_nov2016/data/VCF_test/input_chr26.list --useNewAFCalculator -L /home/uni08/geibel/chicken/chickenrefgen/galGal5_Dec2015/contigs_chr26.intervals --dbsnp /home/uni08/geibel/chicken/chickenrefgen/ENSEMBL_20170106/Gallus_gallus.updated.vcf -O /usr/users/geibel/chicken/pool_sequence_nov2016/data/VCF_test/IndandPool_chr26.raw.vcf
[July 7, 2017 2:35:20 PM CEST] GenotypeGVCFs  --output /usr/users/geibel/chicken/pool_sequence_nov2016/data/VCF_test/IndandPool_chr26.raw.vcf --useNewAFCalculator true --dbsnp /home/uni08/geibel/chicken/chickenrefgen/ENSEMBL_20170106/Gallus_gallus.updated.vcf --variant /usr/users/geibel/chicken/pool_sequence_nov2016/data/VCF_test/input_chr26.list --intervals /home/uni08/geibel/chicken/chickenrefgen/galGal5_Dec2015/contigs_chr26.intervals --reference /home/uni08/geibel/chicken/chickenrefgen/galGal5_Dec2015/galGal5.fa  --annotateNDA false --heterozygosity 0.001 --indel_heterozygosity 1.25E-4 --heterozygosity_stdev 0.01 --standard_min_confidence_threshold_for_calling 10.0 --max_alternate_alleles 6 --max_genotype_count 1024 --sample_ploidy 2 --group StandardAnnotation --onlyOutputCallsStartingInIntervals false --interval_set_rule UNION --interval_padding 0 --interval_exclusion_padding 0 --readValidationStringency SILENT --secondsBetweenProgressUpdates 10.0 --disableSequenceDictionaryValidation false --createOutputBamIndex true --createOutputBamMD5 false --createOutputVariantIndex true --createOutputVariantMD5 false --lenient false --addOutputSAMProgramRecord true --addOutputVCFCommandLine true --cloudPrefetchBuffer 40 --cloudIndexPrefetchBuffer -1 --disableBamIndexCaching false --help false --version false --showHidden false --verbosity INFO --QUIET false --use_jdk_deflater false --use_jdk_inflater false --disableToolDefaultReadFilters false
[July 7, 2017 2:35:20 PM CEST] Executing as geibel@gwdu101 on Linux 3.10.0-327.36.3.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_111-b15; Version: 4.beta.1
[July 7, 2017 2:35:33 PM CEST] org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs done. Elapsed time: 0.22 minutes.
Runtime.totalMemory()=985661440
***********************************************************************

A USER ERROR has occurred: Cannot read /usr/users/geibel/chicken/pool_sequence_nov2016/data/VCF_test/input_chr26.list because no suitable codecs found

***********************************************************************
Use -DGATK_STACKTRACE_ON_USER_EXCEPTIONto print the stack trace.

Java error on running VariantRecalibrator

$
0
0

I have been trying tu run the VariantRecalibrator on my vcf with the standard command line suggested here:
https://gatkforums.broadinstitute.org/gatk/discussion/2805/howto-recalibrate-variant-quality-scores-run-vqsr

Command line I used is this:

gatk -T VariantRecalibrator \
-R /Users/debortoli/Doutorado/hg19/hg19.fa \
-input test_raw_snps.vcf \
-resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.hg19.sites.vcf \
-resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.hg19.sites.vcf \
-resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.hg19.sites.vcf \
-resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.hg19.vcf \
-an QD \
-an FS \
-an SOR \
-an MQ \
-an MQRankSum \
-an ReadPosRankSum \
-an InbreedingCoeff \
-mode SNP \
-tranche 100.0 \
-tranche 99.9 \
-tranche 99.0 \
-tranche 90.0 \
-recalFile recalibrate_SNP.recal \
-tranchesFile recalibrate_SNP.tranches \
-rscriptFile recalibrate_SNP_plots.R

After some minutes running I get this error:

INFO 18:54:29,091 HelpFormatter - The Genome Analysis Toolkit (GATK) vexported, Compiled 2017/07/31 02:32:19
INFO 18:54:29,091 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute
INFO 18:54:29,091 HelpFormatter - For support and documentation go to https://www.broadinstitute.org/gatk
INFO 18:54:29,091 HelpFormatter - [Thu Dec 07 18:54:29 BRST 2017] Executing on Mac OS X 10.13.1 x86_64
INFO 18:54:29,092 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0_141-b15 JdkDeflater
INFO 18:54:29,095 HelpFormatter - Program Args: -T VariantRecalibrator -R /Users/debortoli/Doutorado/hg19/hg19.fa -input oca2_herc2_raw_snps.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.hg19.sites.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.hg19.sites.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.hg19.sites.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.hg19.vcf -an QD -an FS -an SOR -an MQ -an MQRankSum -an ReadPosRankSum -an InbreedingCoeff -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -recalFile recalibrate_SNP.recal -tranchesFile recalibrate_SNP.tranches -rscriptFile recalibrate_SNP_plots.R
INFO 18:54:29,108 HelpFormatter - Executing as debortoli@MacBook-Pro on Mac OS X 10.13.1 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_141-b15.
INFO 18:54:29,108 HelpFormatter - Date/Time: 2017/12/07 18:54:29
INFO 18:54:29,109 HelpFormatter - --------------------------------------------------------------------------
INFO 18:54:29,109 HelpFormatter - --------------------------------------------------------------------------
INFO 18:54:29,132 GenomeAnalysisEngine - Strictness is SILENT
INFO 18:54:29,234 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO 18:54:29,832 GenomeAnalysisEngine - Preparing for traversal
INFO 18:54:29,837 GenomeAnalysisEngine - Done preparing for traversal
INFO 18:54:29,837 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 18:54:29,837 ProgressMeter - | processed | time | per 1M | | total | remaining
INFO 18:54:29,838 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime
INFO 18:54:29,842 TrainingSet - Found hapmap track: Known = false Training = true Truth = true Prior = Q15.0
INFO 18:54:29,842 TrainingSet - Found omni track: Known = false Training = true Truth = true Prior = Q12.0
INFO 18:54:29,843 TrainingSet - Found 1000G track: Known = false Training = true Truth = false Prior = Q10.0
INFO 18:54:29,843 TrainingSet - Found dbsnp track: Known = true Training = false Truth = false Prior = Q2.0
INFO 18:54:59,845 ProgressMeter - chr1:180418989 3596240.0 30.0 s 8.0 s 5.8% 8.6 m 8.1 m
INFO 18:55:29,853 ProgressMeter - chr2:89269200 7377132.0 60.0 s 8.0 s 10.9% 9.1 m 8.1 m
INFO 18:55:59,866 ProgressMeter - chr3:27451056 1.1488123E7 90.0 s 7.0 s 16.8% 8.9 m 7.4 m
INFO 18:56:29,873 ProgressMeter - chr4:6767822 1.5567576E7 120.0 s 7.0 s 22.5% 8.9 m 6.9 m
INFO 18:56:59,879 ProgressMeter - chr4:180178303 1.9568326E7 2.5 m 7.0 s 28.1% 8.9 m 6.4 m
INFO 18:57:29,882 ProgressMeter - chr6:13510279 2.439402E7 3.0 m 7.0 s 34.8% 8.6 m 5.6 m
INFO 18:57:59,894 ProgressMeter - chr7:7711400 2.8355875E7 3.5 m 7.0 s 40.1% 8.7 m 5.2 m
INFO 18:58:29,905 ProgressMeter - chr8:12472026 3.2353219E7 4.0 m 7.0 s 45.4% 8.8 m 4.8 m
INFO 18:58:59,918 ProgressMeter - chr9:29350711 3.6268024E7 4.5 m 7.0 s 50.7% 8.9 m 4.4 m
INFO 18:59:29,930 ProgressMeter - chr10:82904883 4.0231189E7 5.0 m 7.0 s 57.0% 8.8 m 3.8 m

ERROR --
ERROR stack trace

java.lang.ArrayIndexOutOfBoundsException: -32
at org.broadinstitute.gatk.utils.BaseUtils.convertIUPACtoN(BaseUtils.java:215)
at org.broadinstitute.gatk.utils.fasta.CachingIndexedFastaSequenceFile.getSubsequenceAt(CachingIndexedFastaSequenceFile.java:347)
at org.broadinstitute.gatk.engine.datasources.providers.LocusReferenceView.initializeReferenceSequence(LocusReferenceView.java:163)
at org.broadinstitute.gatk.engine.datasources.providers.LocusReferenceView.(LocusReferenceView.java:139)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:90)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:311)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:113)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:255)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:157)
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:108)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version exported):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions https://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: -32
ERROR ------------------------------------------------------------------------------------------

Any thoughts about what can I do in this case?

Thanks

Viewing all 12345 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>