Too Many Missing 0 Calls in GVCF

July 12, 2018, 10:17 am

≫ Next: Where to download BAM files required for PON in Mutect2 worksheets given for Workshop 1803

≪ Previous: GATK4 for germline WES on limited number of samples

Hello!
I have 5555 patient samples. I have a VCF file for each patient with 1,2 calls, and a GVCF file for each patient with just the 0 calls. I used CombineVariants to merge just the VCF files. Now, I wrote a python script to loop through the GVCF files, and collect the quality 0 calls. Surprisingly, although I have up to 400 million 0 genotype calls in some of these GVCF files, the final 0,1,2 matrix I created looks surprisingly sparsely populated of 0 calls.

For instance, on ONE sample that I examined 3.7 million SNPs of interest it has a combined major allele frequency of 27% across all of these SNPs. (that means I looked at 3.7 million SNPs in one patient, and the major allele is only represented in a 27% frequency). In short there are surprisingly few 0 calls in my GVCF that match up with the 1,2 SNP calls (filtered PASS) in my VCF files over the 5555 patients.

Is there a way to infer the missing calls from the GVCF files? Or would it make a difference if I merged by GVCF files first?

↧

Where to download BAM files required for PON in Mutect2 worksheets given for Workshop 1803

July 12, 2018, 10:25 am

≫ Next: Calling a task depending on the type of input

≪ Previous: Too Many Missing 0 Calls in GVCF

I am beginner and have just started using GATK. I was following the worksheets for Workshop 1803, specifically the worksheet on Mutect2 basics. There is an exercise on PONs, and it requires HG00190.bam, NA19771.bam and HG02759.bam. These are not provided with the Gatk Bundle of the workshop. When i rand the commands, I got the error "unable to read non-existent file." I read in the worksheet that these are retrieved from 1000 genomes and I was able to track down the locations of files corresponding to them, but they are all CRAM files. All of them have sizes around 3 GB, whereas the normal and tumor bam files (provided in the bundle) are only of around 250 MB size. I have the following queries -

1. Is there any link to download the BAM files directly ?
2. Should I download the CRAM files and convert them to BAM ?

The links that I have found for these are the following -

HG00190 - ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/data/FIN/HG00190/exome_alignment/
NA19771 - ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/data/MXL/NA19771/
HG02759 - ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/data/GWD/HG02759/

↧

Calling a task depending on the type of input

July 12, 2018, 12:35 pm

≫ Next: Why is converting from fastq to uBAM nesessary before preprocessing?

≪ Previous: Where to download BAM files required for PON in Mutect2 worksheets given for Workshop 1803

I want to call a task depending on the type of input file. Is there a way to compare strings in wdl to do so?

  #is the input a cram file?
**  Boolean is_cram = ".cram" == suffix (input_file )**

  if ( is_cram ) {
    call CramToBamTask {
          input:
          input_cram = input_file,
          input_cram_index = input_file_index,
    }
  }

↧

Why is converting from fastq to uBAM nesessary before preprocessing?

March 27, 2018, 7:42 am

≫ Next: Can BQSR work on data coming from BGISEQ-500 and how to set the Read Group PL ?

≪ Previous: Calling a task depending on the type of input

Hi Everyone,

I am brand new to this so please go easy on me. I have just taken over a project where we are going to be doing variant calling on a large number of human samples. I have inherited a number of scripts that are at least a few years old. I decided I wanted to follow the GATK best practices while noting the differences between the them and the scripts I have. I'm currently trying to push a single family (5 individuals) through the pipeline before applying it to all of the other samples I have.

So, first of all, all of my raw data is stored as paired-end reads in fastq format, I have no uBAM files available to me. According to the data pre-processing for variant discovery steps, the "reference implementations expect the read data to be input in unmapped BAM (uBAM) format. Conversion utilities are available to convert from FASTQ to uBAM." So the first thing I did was use FastqToSam to do the conversion yesterday. This is not an insignificant task, I ran each sample for my test family in parallel and it took roughly 5 hours.

I understand the benefit of using uBAM from the get-go (keeping some metadata that is discarded in fastq as described here: https://gatkforums.broadinstitute.org/gatk/discussion/5990/what-is-ubam-and-why-is-it-better-than-fastq-for-storing-unmapped-sequence-data), but I don't see the benefit of doing this conversion if the first step of the alignment is to convert this uBAM back to fastq before running bwa mem and samtools view. The next step would be to use MergeBamAlignments to merge the mapped and unmapped alignments which I guess I couldn't do if I did not do the original fastq->uBAM conversion.

Basically, my question is if the initial conversion from fastq to uBam is necessary or even recommended in this case. I don't see how it could have any added benefit and converting from and to fastq will incur a significant overhead. For what it's worth, the script I inherited simply ran 'bwa mem' on the paired-end reads and piped the output into 'samtools view -bh' to create the aligned BAM file. From here they would move on to the marking of duplicates. If I don't convert to uBAM and then skip the MergeBamAlignments, will that have an impact on me being able to apply the best practices down the line? I want to stick as close to the best practices as I possibly can, but If I can cut out some unnecessary computation time then that would be great.

Thanks!

↧

Can BQSR work on data coming from BGISEQ-500 and how to set the Read Group PL ?

April 10, 2018, 3:48 am

≫ Next: Collected questions about VQSR

≪ Previous: Why is converting from fastq to uBAM nesessary before preprocessing?

Hi GATK team,
I have two questions when using GATK4：
1) Can Base Quality Score Recalibration (BQSR) work on data coming from BGISEQ-500? Please see this paper for more information about this sequencing platform: www.ncbi.nlm.nih.gov/pubmed/28379488
2) If can, how should I set the platform (PL) information in Read Group (RG)? Because I have noticed that the valid values of PL are just ILLUMINA, SOLID, LS454, HELICOS and PACBIO.
Thanks very much!

↧

Collected questions about VQSR

May 15, 2015, 1:32 am

≫ Next: Using JEXL to apply hard filters or select variants based on annotation values

≪ Previous: Can BQSR work on data coming from BGISEQ-500 and how to set the Read Group PL ?

This discussion was created from comments split from: Variant Quality Score Recalibration (VQSR).

↧

Using JEXL to apply hard filters or select variants based on annotation values

August 1, 2012, 4:04 pm

≫ Next: Interval list in CollectWgsMetrics has huge effect on output

≪ Previous: Collected questions about VQSR

1. JEXL in a nutshell

JEXL stands for Java EXpression Language. It's not a part of the GATK as such; it's a software library that can be used by Java-based programs like the GATK. It can be used for many things, but in the context of the GATK, it has one very specific use: making it possible to operate on subsets of variants from VCF files based on one or more annotations, using a single command. This is typically done with walkers such as VariantFiltration and SelectVariants.

2. Basic structure of JEXL expressions for use with the GATK

In this context, a JEXL expression is a string (in the computing sense, i.e. a series of characters) that tells the GATK which annotations to look at and what selection rules to apply.

JEXL expressions contain three basic components: keys and values, connected by operators. For example, in this simple JEXL expression which selects variants whose quality score is greater than 30:

"QUAL > 30.0"

QUAL is a key: the name of the annotation we want to look at
30.0 is a value: the threshold that we want to use to evaluate variant quality against
> is an operator: it determines which "side" of the threshold we want to select

The complete expression must be framed by double quotes. Within this, keys are strings (typically written in uppercase or CamelCase), and values can be either strings, numbers or booleans (TRUE or FALSE) -- but if they are strings the values must be framed by single quotes, as in the following example:

"MY_STRING_KEY == 'foo'"

3. Evaluation on multiple annotations

You can build expressions that calculate a metric based on two separate annotations, for example if you want to select variants for which quality (QUAL) divided by depth of coverage (DP) is below a certain threshold value:

"QUAL / DP < 10.0"

You can also join multiple conditional statements with logical operators, for example if you want to select variants that have both sufficient quality (QUAL) and a certain depth of coverage (DP):

"QUAL > 30.0 && DP == 10"

where && is the logical "AND".

Or if you want to select variants that have at least one of several conditions fulfilled:

"QD < 2.0 || ReadPosRankSum < -20.0 || FS > 200.0"

where || is the logical "OR".

4. Filtering on sample/genotype-level properties

You can also filter individual samples/genotypes in a VCF based on information from the FORMAT field. Variant Filtration will add the sample-level FT tag to the FORMAT field of filtered samples. Note however that this does not affect the record's FILTER tag. This is still a work in progress and isn't quite as flexible and powerful yet as we'd like it to be. For now, you can filter based on most fields as normal (e.g. GQ < 5.0), but the GT (genotype) field is an exception. We have put in convenience methods to enable filtering out heterozygous calls (isHet == 1), homozygous-reference calls (isHomRef == 1), and homozygous-variant calls (isHomVar == 1).

5. Important caveats

Sensitivity to case and type

You're probably used to case being important (whether letters are lowercase or UPPERCASE) but now you need to also pay attention to the type of value that is involved -- for example, numbers are differentiated between integers and floats (essentially, non-integers). These points are especially important to keep in mind:

Case
Currently, VCF INFO field keys are case-sensitive. That means that if you have a QUAL field in uppercase in your VCF record, the system will not recognize it if you write it differently (Qual, qual or whatever) in your JEXL expression.
Type
The types (i.e. string, integer, non-integer, floating point or boolean) used in your expression must be exactly the same as that of the value you are trying to evaluate. In other words, if you have a QUAL field with non-integer values (e.g. 45.3) and your filter expression is written as an integer (e.g. "QUAL < 50"), the system will throw a hissy fit (specifically, a Java exception, e.g. a NumberFormatException for numerical type mismatches).

Complex queries

We highly recommend that complex expressions involving multiple AND/OR operations be split up into separate expressions whenever possible to avoid confusion. If you are using complex expressions, make sure to test them on a panel of different sites with several combinations of yes/no criteria.

6. More complex JEXL magic

Note that this last part is fairly advanced and not for the faint of heart. To be frank, it's also explained rather more briefly than the topic deserves. But if there's enough demand for this level of usage (click the "view in forum" link and leave a comment) we'll consider producing a full-length tutorial.

Introducing the VariantContext object

When you use SelectVariants with JEXL, what happens under the hood is that the program accesses something called the VariantContext, which is a representation of the variant call with all its annotation information. The VariantContext is technically not part of GATK; it's part of the variant library included within the Picard tools source code, which GATK uses for convenience.

The reason we're telling you about this is that you can actually make more complex queries than what the GATK offers convenience functions for, provided you're willing to do a little digging into the VariantContext methods. This will allow you to leverage the full range of capabilities of the underlying objects from the command line.

In a nutshell, the VariantContext is available through the vc variable, and you just need to add method calls to that variable in your command line. The bets way to find out what methods are available is to read the VariantContext documentation on the Picard tools source code repository (on SourceForge), but we list a few examples below to whet your appetite.

Using VariantContext directly

For example, suppose I want to use SelectVariants to select all of the sites where sample NA12878 is homozygous-reference. This can be accomplished by assessing the underlying VariantContext as follows:

java -jar GenomeAnalysisTK.jar -T SelectVariants \
        -R reference.fasta \
        -V variants.vcf \
        -select 'vc.getGenotype("NA12878").isHomRef()'

Groovy, right? Now here's a more sophisticated example of JEXL expression that finds all novel variants in the total set with allele frequency > 0.25 but not 1, is not filtered, and is non-reference in 01-0263 sample:

! vc.getGenotype("01-0263").isHomRef() && (vc.getID() == null || vc.getID().equals(".")) && AF > 0.25 && AF < 1.0 && vc.isNotFiltered() && vc.isSNP() -o 01-0263.high_freq_novels.vcf -sn 01-0263

Using the VariantContext to evaluate boolean values

The classic way of evaluating a boolean goes like this:

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants \
        -R reference.fasta \
        -V my.vcf \
        -select 'DB'

But you can also use the VariantContext object like this:

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants \
        -R reference.fasta \
        -V my.vcf \
        -select 'vc.hasAttribute("DB")'

Using VariantContext to access annotations in multiallelic sites

The order of alleles in the VariantContext object is not guaranteed to be the same as in the VCF output, so accessing the AF by an index derived from a scrambled alleles array is dangerous. However! If we have the sample genotypes, there's a workaround:

java -jar GenomeAnalysisTK.jar -T SelectVariants  \
        -R reference.fasta  \
        -V multiallelics.vcf  \
        -select 'vc.hasGenotypes() && vc.getCalledChrCount(vc.getAltAlleleWithHighestAlleleCount())/(1.0*vc.getCalledChrCount()) > 0.1' -o multiHighAC.vcf

The odd 1.0 is there because otherwise we're dividing two integers, which will always yield 0. The vc.hasGenotypes() is extra error checking. This might be slow for large files, but we could use something like this if performance is a concern:

java -jar GenomeAnalysisTK.jar -T SelectVariants \
        -R reference.fasta \
        -V multiallelics.vcf \
         -select 'vc.isBiallelic() ? AF > 0.1 : vc.hasGenotypes() && vc.getCalledChrCount(vc.getAltAlleleWithHighestAlleleCount())/(1.0*vc.getCalledChrCount()) > 0.1' -o multiHighAC.vcf

Where hopefully the ternary expression shortcuts the extra vc calls for all the biallelics.

Using JEXL to evaluate arrays

Sometimes you might want to write a JEXL expression to evaluate e.g. the AD (allelic depth) field in the FORMAT column. However, the AD is technically not an integer; rather it is a list (array) of integers. One can evaluate the array data using the "." operator. Here's an example:

java -jar GenomeAnalysisTK.jar -T SelectVariants \
        -R reference.fasta \
        -V variants.vcf \
        -select 'vc.getGenotype("NA12878").getAD().0 > 10'

If you would like to select sites where the alternate allele frequency is greater than 50%, you can use the following expression:

java -jar GenomeAnalysisTK.jar -T SelectVariants \
        -R reference.fasta \
        -V variants.vcf \
        -select vc.getGenotype("NA12878").getAD().1 / vc.getGenotype("NA12878").getDP() > 0.50

↧

Interval list in CollectWgsMetrics has huge effect on output

July 13, 2018, 2:58 pm

≫ Next: GATK4 haplotypecaller doesn't genotype multiple bams

≪ Previous: Using JEXL to apply hard filters or select variants based on annotation values

Hello,

I'm getting VERY different results based on whether or not I include an interval list in CollectWgsMetrics.

I initially ran one of my analysis ready bams through CollectWgsMetrics with the following command:

java -Xms2000m -jar ${picard_jar} \
        CollectWgsMetrics \
        INPUT=${input_bam} \
        VALIDATION_STRINGENCY=SILENT \
        REFERENCE_SEQUENCE=${ref_fasta} \
        INCLUDE_BQ_HISTOGRAM=true \
        INTERVALS=${wgs_coverage_interval_list} \
        OUTPUT=${metrics_filename} \
        USE_FAST_ALGORITHM=true \
        READ_LENGTH=150

where ref_fasta and wgs_coverage_interval_list are files taken from the google cloud paths ‎gs://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta‎ and ‎gs://broad-references/hg38/v0/wgs_calling_regions.hg38.interval_list‎

I got pretty abysmal results from the above command, with a mean coverage of 0.01746. This is largely because a huge proportion of reads are being filtered out because of poor mapping quality (63%) and duplication (34%).

I then ran the same command, without the INTERVALS=${wgs_coverage_interval_list} line, and my results dramatically improved to a mean coverage of 17.97; the poor mapping quality and duplication rates shrank considerably.

Now, this seems to suggest that the majority of my reads map to crappy, between-interval locations with high quality, but I highly doubt this to be the case. Moreover, the extremely low coverage isn't unique to this particular sample, as it has happened to every other sample I've tested on.

Any reason why there might be such a high discrepancy between these two commands?

For reference, I attached the summary metrics and logs for both commands.

Thank you!

↧

GATK4 haplotypecaller doesn't genotype multiple bams

July 14, 2018, 10:44 am

≫ Next: Is it possible to pass multiple bams to GATK4 HaplotypeCaller?

≪ Previous: Interval list in CollectWgsMetrics has huge effect on output

In GATK3, using multiple bams in the -I parameter would print multiple columns of genotypes. This doesnt seem to happen with GATK4. Here is the command

    gatk --java-options "-Xmx60G" HaplotypeCaller \
    -I $BAM1 \
    -I $BAM2 \
     --genotyping-mode DISCOVERY \
    -R $ref \
    -D $dbsnp \
    -O $out_directory$fname.el.snps.vcf

Here is the output

##GATKCommandLine=<ID=HaplotypeCaller,CommandLine="HaplotypeCaller  --dbsnp /home/exacloud/tempwork/SpellmanLab/heskett/DBSNP_150_uw.vcf --genotyping-mode DISCOVERY --output 4e700.sorted.de-duplicated.recalibrated.el.snps.vcf --input 4e700.sorted.de-duplicated.recalibrate
d.bam --input 4l700.sorted.de-duplicated.recalibrated.bam --reference /home/exacloud/lustre1/SpellmanLab/heskett/refs/myron_refs/human_g1k_v37.fasta


#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  1
1       12783   rs62635284      G       A       251.77  .       AC=2;AF=1.00;AN=2;DB;DP=10;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=25.24;QD=25.18;SOR=4.804     GT:AD:DP:GQ:PL  1/1:0,10:10:30:280,30,0
1       13116   rs62635286      T       G       710.77  .       AC=2;AF=1.00;AN=2;DB;DP=17;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=28.05;QD=25.36;SOR=0.804     GT:AD:DP:GQ:PL  1/1:0,17:17:51:739,51,0
1       13118   rs62028691      A       G       710.77  .       AC=2;AF=1.00;AN=2;DB;DP=15;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=26.73;QD=28.73;SOR=0.818     GT:AD:DP:GQ:PL  1/1:0,15:15:51:739,51,0

is this feature not implemented yet? or did the command line call change?

Thanks!

↧

Is it possible to pass multiple bams to GATK4 HaplotypeCaller?

April 7, 2018, 1:40 pm

≫ Next: How does the picard tool "FixMateInformation" work exactly?

≪ Previous: GATK4 haplotypecaller doesn't genotype multiple bams

Because of a variety of preprocessing steps that I have to perform on my input bam I have split it by chromosome to make this process faster. Joining these bams together before calling SNVs with HaplotypeCaller is a very time consuming step so it would be most convenient to just pass all the bams at once without the join. Is it possible to pass multiple bams to HaplotypeCaller? I have tried a variety of ways such as

gatk-launch HaplotypeCaller -R hg19-1.2.0/fasta/genome.fa -I example.20.bam -I example.21.bam -O out.vcf

and the Mutect2 way

gatk-launch HaplotypeCaller -R hg19-1.2.0/fasta/genome.fa -I:20 example.20.bam -I:21 example.21.bam -O out.vcf

Is this even possible?

Thanks very much,

Stephen

↧

How does the picard tool "FixMateInformation" work exactly?

July 16, 2018, 8:14 am

≫ Next: What is truth? Or, how an accident of nature can illuminate our path

≪ Previous: Is it possible to pass multiple bams to GATK4 HaplotypeCaller?

Dear Forum:

Could you help explain exactly how the tool "FixMateInformation" verifies mate-pair information between mates? This document indicates that missing mate pair information can be "filled in" using Picard Tools. What does that mean exactly? Specifically, should every read have it's corresponding CIGAR string as well as that of it's mate? Is the script going through and simply adding the "missing" mate CIGAR string to each read that doesn't have both?

Thanks for your help,
Caroline

↧

What is truth? Or, how an accident of nature can illuminate our path

December 8, 2017, 11:10 am

≫ Next: GenotypeGVCFs WARN Track variant doesn't have a sequence dictionary built in

≪ Previous: How does the picard tool "FixMateInformation" work exactly?

By Yossi Farjoun, Associate Director of computational research methods in the Data Sciences Platform

A note to explain the context of the new paper by Heng Li, myself and others, “New synthetic-diploid benchmark for accurate variant calling evaluation” available as a preprint in bioRxiv.

Developing new tools and algorithms for genome analysis relies heavily on the availability of so-called "truth sets" that are used to evaluate performance (accuracy, sensitivity etc.). This has long been a sticking point, though recently the situation has improved dramatically with the availability of several public, high-quality truth sets such as Genome In A Bottle from NIST and Platinum Genomes from Illumina. Even these resources, which have been produced through painstaking analysis and curation, are not immune to the lack of “orthogonality” which plagues most available truth-sets. Chief among these is that the failure modes of Illumina sequencing are usually masked out and the resulting data are biased towards the easier parts of the genome.

The paper I linked above introduces a new dataset that we developed to be less biased. It is based solely on PacBio sequencing, and thus its error modes are less correlated with Illumina’s error modes. Using this dataset for benchmarking has given us high confidence in the accuracy of our validations and has enabled us to improve our methods with less concern of overfitting.

Truth data (for germline DNA methods) tend to be derived from two sources: synthetic (that is, computer generated), or Illumina (and other) sequencing of a particular sample called NA12878. Both of these sources are deeply flawed and ultimately, not good enough. First, it is virtually impossible to create synthetic data that truly resemble the results of sequencing actual biological tissue, for several reasons: the reference is an approximation and the effects of sample-extraction, library-construction, and sequencing are really hard to model accurately. Regarding our biggest issue with NA12878, we simply love this sample too much! Nearly all of NA12878’s variants are present in our resource files (dbSNP, the training files for VQSR, etc.). When we evaluate our method’s performance on NA12878, we cannot really trust the results since we have been using the answer all along. Furthermore, both the NIST and Platinum Genomes truthsets are each restricted to a subset of the genome that they consider the “confidence region”. This region is defined differently in the two datasets, but in both cases it is dependent on performance of Illumina sequencing of NA12878 (among other things). This has the perverse effect that the results are reflecting performance only in the easier-to-sequence-and-analyze part of the genome, falsely inflating our self-confidence, and giving no blame or credit for performance in the harder regions of the genome.

The “Synthetic-diploid” (or as we affectionately call it, SynDip) is generated from two human cell lines (CHM1 and CHM13, PacBio-sequenced and assembled by others) that were derived from Complete Hydatidiform Moles. This rare and devastating condition results in a non-viable collection of cells that is almost entirely homozygous. The homozygosity implies that PacBio sequencing is much more trustworthy as there are no heterozygous sites that tend to confuse the assembly: any confusion is almost certainly due to sequencing error and can therefore be masked out. To make use of this, we aligned the CHM1 and CHM13 assemblies to the hg38 reference, and created a VCF and a confidence region that characterize the variation that a 50-50 mixture of the two cell lines would have. At the same time, we also sequenced and aligned such a 50-50 mixture using our WEx and WGS protocols on Illumina. So to be clear, in that regard, the name is misleading. The only “synthetic” part about SynDip is that it’s synthetically diploid, but in all other aspects it’s as natural as can be, since it was generated from live cells using regular sequencing protocols.

Since the CHM dataset was generated using PacBio data alone, with no consideration for the flaws of Illumina’s short-read technology, there should be less correlation between the failure modes of our methods on the short-read data and SynDip’s confidence regions. This allows us to have better, more trustworthy truth-data. It enables us to remove much uncertainty, defusing our natural tendency to “look under the lamp” and to overfit our methods.

And beyond that, it empowers us to push our method development further by exposing large tracts of the reference where our methods (and not only ours!) do not perform well -- and provides us with a more truthful picture of what lies in those regions. Here are the main ways we have used this resource to that end:

We have used the insights gained from applying our filtering methods on the SynDip data, which reveal the flaws in their performance, to design better filtering architectures and fine-tune existing ones. (More on this in a future post….)
We have used the dataset to assess new variant calling methods for CNVs and SVs.
We have used it to compare different analysis pipelines and determine whether there’s a significant difference between them (e.g. What is the effect of running BQSR over and over again? Answer: Not much beyond the first run.)
We are currently using it to develop the next version of our joint-calling pipeline which will be able to joint call more than 100K genomes (!!!)

One thing that the current CHM dataset doesn’t help us do is develop better lab methods. This is because the CHM cell lines are not currently commercially available and thus the technology companies cannot test their new protocols and technologies on it. Hopefully, this will eventually be made possible and could enable us to explore hard-to-sequence regions of the genome.

If you are a method developer or you are in a position to evaluate the performance of various pipelines, we encourage you to check out the CHM dataset, and we hope it will help you develop new methods and pipelines! In the future we plan to share more data from the CHM cell lines and make the methods we use for evaluating our methods and data publicly available.

↧

GenotypeGVCFs WARN Track variant doesn't have a sequence dictionary built in

March 21, 2015, 2:14 am

≫ Next: SelectVariants V4 TribbleException Contig chr1 does not have a length field

≪ Previous: What is truth? Or, how an accident of nature can illuminate our path

Hi Team,
I'm getting `WARN  21:19:30,478 IndexDictionaryUtils - Track variant doesn't have a sequence dictionary built in, skipping dictionary validation` when processing gzipped g.vcf files produced by HaplotypeCaller (via -o foo.g.vcf.gz, as suggested by @Geraldine_VdAuwera in blog post 3893) with GenotypeGVCFs.
This results in dramatic increases in run time (makes sense if GenotypeGVCFs un-compresses the files), and memory requirements (why ??) for GenotypeGVCFs compared to processing the gvcf for same bam files if HC outfiles are unzipped. Most batches that previously completed with 4x8GB RAM now produce `java.lang.OutOfMemoryError: Java heap space` errors even with 4X64GB!

Could you please advise whether this warning is expected behaviour? If yes, what exactly is missing (can't see much difference in unzipped vs gzipped vcf headers), and can this be added somehow?

↧

SelectVariants V4 TribbleException Contig chr1 does not have a length field

July 16, 2018, 10:43 am

≫ Next: Required R packages to run AnalyzeCovariates

≪ Previous: GenotypeGVCFs WARN Track variant doesn't have a sequence dictionary built in

I indexed my VCF file with GATK V4.0.6.0 IndexFeatureFile, then ran GATK V4.0.6.0 SelectVariants on it, and I got an exception:

htsjdk.tribble.TribbleException: Contig chr1 does not have a length field.

When I run the same VCF using GATK V3 SelectVariants, it works.

As far as I know, ##contig entries in the VCF header should NOT have a length in them.

↧

Required R packages to run AnalyzeCovariates

July 16, 2018, 11:12 am

≫ Next: The allele with index is not defined in the REF/ALT columns in the record : CombineVariants

≪ Previous: SelectVariants V4 TribbleException Contig chr1 does not have a length field

I was trying to run AnalyzeCovariates, but I keep getting error messages saying;

Stderr: Error in library("ggplot2") : there is no package called ‘ggplot2’

Since I don't have sudo access to the server I'm working on, I had to repeatedly ask the administrator to download the R packages for me. Can you please add a full list of R-packages that are required to run AnalyzeCovariates to the documentation?

Cheers,
James

↧

The allele with index is not defined in the REF/ALT columns in the record : CombineVariants

July 16, 2018, 11:22 am

≫ Next: RNAseq short variant discovery (SNPs + Indels)

≪ Previous: Required R packages to run AnalyzeCovariates

Hello I am having an issue with combining VCFs. I am using GATK 3.8-1 for the CombineVariants step that's producing the error.

I have a VCF containing SNPs and INDELs. I first split the VCF using GATK 4.0.5.1. This step does not produce and error and I am able to use bgzip and tabix without error on the resulting VCFs.

/home/dantakli/bin/gatk-4.0.5.1/gatk SplitVcfs --INPUT $1 --SNP_OUTPUT $out\.snps.vcf --INDEL_OUTPUT $out\.indels.vcf --STRICT=false

My next step is to combine the SNPs from the previous command to another SNP VCF file with different samples (set1). At this combine step, I get this allele index error.

Here's the trace, set2 is the VCF that was split above and is the file that produces the error.

INFO  11:48:34,113 HelpFormatter - ------------------------------------------------------------------------------------
INFO  11:48:34,119 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.8-1-0-gf15c1c3ef, Compiled 2018/02/19 05:43:50
INFO  11:48:34,119 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute
INFO  11:48:34,119 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk
INFO  11:48:34,119 HelpFormatter - [Sun Jul 15 11:48:34 PDT 2018] Executing on Linux 2.6.32-696.10.3.el6.x86_64 amd64
INFO  11:48:34,119 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0_73-b02
INFO  11:48:34,122 HelpFormatter - Program Args: -T CombineVariants -R /reference/GRCh38_full_analysis_set_plus_decoy_hla.fa -L chr5:1-10000000 --genotypemergeoption UNIQUIFY --variant:set1,vcf set1.snps.hg38.chr5.vcf.gz --variant:set2,vcf set2.chr5.snps.vcf.gz -o /set1.set2.chr5.1-10000000.snps.vcf

...

##### ERROR MESSAGE: The allele with index 107548038 is not defined in the REF/ALT columns in the record

I ran ValidateVariants on the set2 SNP file and got the same error.

INFO  11:14:16,261 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.8-1-0-gf15c1c3ef, Compiled 2018/02/19 05:43:50
...
INFO  11:14:16,262 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0_73-b02
INFO  11:14:16,265 HelpFormatter - Program Args: -T ValidateVariants -L chr5:1-10000000 -R GRCh38_full_analysis_set_plus_decoy_hla.fa -V set2.chr5.snps.vcf.gz
...

INFO  11:14:16,301 NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/dantakli/bin/GenomeAnalysisTK-3.8-1/GenomeAnalysisTK.jar!/com/intel/gkl/native/libgkl_compression.so
INFO  11:14:16,313 GenomeAnalysisEngine - Deflater: IntelDeflater
INFO  11:14:16,313 GenomeAnalysisEngine - Inflater: IntelInflater
INFO  11:14:16,313 GenomeAnalysisEngine - Strictness is SILENT
INFO  11:14:17,634 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO  11:14:18,822 IntervalUtils - Processing 10000000 bp from intervals
WARN  11:14:18,822 IndexDictionaryUtils - Track variant doesn't have a sequence dictionary built in, skipping dictionary validation
INFO  11:14:18,890 GenomeAnalysisEngine - Preparing for traversal
....
##### ERROR MESSAGE: File set2.chr5.snps.vcf.gz fails strict validation: The allele with index 107548038 is not defined in the REF/ALT columns in the record
##### ERROR ------------------------------------------------------------------------------------------

I get the same error with the SNP+INDEL vcf (before splitting) too

##### ERROR MESSAGE: File set2.chr5.vcf fails strict validation: The allele with index 107548038 is not defined in the REF/ALT columns in the record

I don't get this error when splitting the VCF into SNPs and INDELs. So why am I getting it when I combine the variants?

Thanks.

↧

RNAseq short variant discovery (SNPs + Indels)

January 8, 2018, 8:39 pm

≫ Next: GATK haplotypecaller crashes randomly

≪ Previous: The allele with index is not defined in the REF/ALT columns in the record : CombineVariants

Purpose

Identify short variants (SNPs and Indels) in RNAseq data.

Diagram is not available

Reference Implementations

Pipeline	Summary	Notes	Github	FireCloud
RNAseq short variant per-sample calling	BAM to VCF	universal (expected)		TBD

Expected input

This workflow is designed to operate on a set of samples one sample at a time; joint calling RNAseq is not supported.

_This workflow is in development; detailed documentation will be made available when the workflow is considered fully released.

↧

GATK haplotypecaller crashes randomly

July 16, 2018, 1:14 pm

≫ Next: a question about running HaplotypeCaller with intervals

≪ Previous: RNAseq short variant discovery (SNPs + Indels)

Hi GATK team,

I am having a hard time getting GATK haplotypecaller running, I tried both GATK3 and GATK4 but both crash near the end of execution. The problem is this crash apprears randomly - since I was distributing the jobs among multiple servers and multiple cores, some completed successfully but some crashes. Below is a typical error message:

03:11:24.083 INFO  ProgressMeter -          1:307040580            180.7               1076383           5955.6
03:11:24.084 INFO  ProgressMeter - Traversal complete. Processed 1076383 total regions in 180.7 minutes.
03:11:24.374 INFO  VectorLoglessPairHMM - Time spent in setup for JNI call : 5.836160996
03:11:24.375 INFO  PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 7187.316120899
03:11:24.375 INFO  SmithWatermanAligner - Total compute time in java Smith-Waterman : 1707.84 sec
03:11:24.375 INFO  HaplotypeCaller - Shutting down engine
[July 16, 2018 3:11:24 AM CDT] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 180.78 minutes.
Runtime.totalMemory()=5114953728
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGBUS (0x7) at pc=0x00007fff2bbfac40, pid=26222, tid=0x00007ffff7dcf700
#
# JRE version: OpenJDK Runtime Environment (8.0_121-b15) (build 1.8.0_121-b15)
# Java VM: OpenJDK 64-Bit Server VM (25.121-b15 mixed mode linux-amd64 compressed oops)
# Problematic frame:
#
[error occurred during error reporting (printing problematic frame), id 0x7]

# Core dump written. Default location: /panfs/roc/scratch/zhoux379/biomap/core or core.26222
#
# An error report file with more information is saved as:
# /panfs/roc/scratch/zhoux379/biomap/hs_err_pid26222.log
#
# If you would like to submit a bug report, please visit:
#   http://www.azulsystems.com/support/
#
Using GATK jar /panfs/roc/groups/15/springer/zhoux379/software/miniconda3/share/gatk4-4.0.6.0-0/gatk-package-4.0.6.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx10G -Djava.io.tmpdir=/scratch.global/zhoux379/temp -jar /panfs/roc/groups/15/springer/zhoux379/software/miniconda3/share/gatk4-4.0.6.0-0/gatk-package-4.0.6.0-local.jar HaplotypeCaller -R /home/springer/zhoux379/data/genome/B73/21_gatk/maize.fasta -ERC GVCF -G StandardAnnotation -G AS_StandardAnnotation -G StandardHCAnnotation -L 1 -I 22_bam/bm227.bam -O 26_gatk/bm227/1.g.vcf.gz

And here is a job that successfully completes:

03:24:47.832 INFO  ProgressMeter -          1:307040580            194.1               1069913           5511.6
03:24:47.832 INFO  ProgressMeter - Traversal complete. Processed 1069913 total regions in 194.1 minutes.
03:24:48.227 INFO  VectorLoglessPairHMM - Time spent in setup for JNI call : 4.24577457
03:24:48.228 INFO  PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 8237.175499536
03:24:48.228 INFO  SmithWatermanAligner - Total compute time in java Smith-Waterman : 1517.82 sec
03:24:48.228 INFO  HaplotypeCaller - Shutting down engine
[July 16, 2018 3:24:48 AM CDT] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 194.17 minutes.
Runtime.totalMemory()=5136449536
pure virtual method called
terminate called without an active exception
Using GATK jar /panfs/roc/groups/15/springer/zhoux379/software/miniconda3/share/gatk4-4.0.6.0-0/gatk-package-4.0.6.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx10G -Djava.io.tmpdir=/scratch.global/zhoux379/temp -jar /panfs/roc/groups/15/springer/zhoux379/software/miniconda3/share/gatk4-4.0.6.0-0/gatk-package-4.0.6.0-local.jar HaplotypeCaller -R /home/springer/zhoux379/data/genome/B73/21_gatk/maize.fasta -ERC GVCF -G StandardAnnotation -G AS_StandardAnnotation -G StandardHCAnnotation -L 1 -I 22_bam/bm348.bam -O 26_gatk/bm348/1.g.vcf.gz

I can attach the java error hs_err_pid26222.log (which is pretty long) if necessary.

Any clue would be greatly appreciated!

↧

a question about running HaplotypeCaller with intervals

July 16, 2018, 2:16 pm

≫ Next: help me, GATK4 VQSR Error

≪ Previous: GATK haplotypecaller crashes randomly

Hi,

I have a question when running HaplotypeCaller functions with intervals on exome-seq data.
Here is the command I used:
java -jar gatk-package-4.0.6.0-local.jar HaplotypeCaller -R /espresso/share/genomes/hg38/genome.fa -I recal_reads.bam -O variants.g.vcf -ERC GVCF -L capture.bed

However, when I ran the command, I got the following message:
17:13:14.439 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk-4.0.6.0/gatk-package-4.0.6.0-local.jar!/com/intel/gkl/native/libgkl_compression.so 17:13:14.591 INFO HaplotypeCaller - ------------------------------------------------------------ 17:13:14.591 INFO HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.0.6.0 17:13:14.591 INFO HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/ 17:13:14.591 INFO HaplotypeCaller - Executing as ... on Linux v2.6.32-431.29.2.el6.x86_64 amd64 17:13:14.592 INFO HaplotypeCaller - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_121-b13 17:13:14.592 INFO HaplotypeCaller - Start Date/Time: July 16, 2018 5:13:14 PM EDT 17:13:14.592 INFO HaplotypeCaller - ------------------------------------------------------------ 17:13:14.592 INFO HaplotypeCaller - ------------------------------------------------------------ 17:13:14.592 INFO HaplotypeCaller - HTSJDK Version: 2.16.0 17:13:14.592 INFO HaplotypeCaller - Picard Version: 2.18.7 17:13:14.592 INFO HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2 17:13:14.592 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false 17:13:14.592 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true 17:13:14.592 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false 17:13:14.593 INFO HaplotypeCaller - Deflater: IntelDeflater 17:13:14.593 INFO HaplotypeCaller - Inflater: IntelInflater 17:13:14.593 INFO HaplotypeCaller - GCS max retries/reopens: 20 17:13:14.593 INFO HaplotypeCaller - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes 17:13:14.593 INFO HaplotypeCaller - Initializing engine 17:13:15.037 INFO FeatureManager - Using codec BEDCodec to read file file:///capture.bed 17:13:16.883 INFO IntervalArgumentCollection - Processing 64190747 bp from intervals 17:13:17.009 INFO HaplotypeCaller - Shutting down engine [July 16, 2018 5:13:17 PM EDT] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 0.04 minutes. Runtime.totalMemory()=2041053184 java.lang.NullPointerException at java.util.ComparableTimSort.countRunAndMakeAscending(ComparableTimSort.java:325) at java.util.ComparableTimSort.sort(ComparableTimSort.java:202) at java.util.Arrays.sort(Arrays.java:1312) at java.util.Arrays.sort(Arrays.java:1506) at java.util.ArrayList.sort(ArrayList.java:1454) at java.util.Collections.sort(Collections.java:141) at org.broadinstitute.hellbender.utils.IntervalUtils.sortAndMergeIntervals(IntervalUtils.java:459) at org.broadinstitute.hellbender.utils.IntervalUtils.getIntervalsWithFlanks(IntervalUtils.java:956) at org.broadinstitute.hellbender.utils.IntervalUtils.getIntervalsWithFlanks(IntervalUtils.java:971) at org.broadinstitute.hellbender.engine.MultiIntervalLocalReadShard.<init>(MultiIntervalLocalReadShard.java:59) at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.makeReadShards(AssemblyRegionWalker.java:195) at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.onStartup(AssemblyRegionWalker.java:175) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:133) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:180) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:199) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203) at org.broadinstitute.hellbender.Main.main(Main.java:289)

I did not see any error but it seems HaplotypeCaller did not run and there is no output.
So I will really appreciate it if I can get help from you guys.

Thank you!

Best,
Siyu

↧

help me, GATK4 VQSR Error

July 16, 2018, 3:56 pm

≫ Next: Cannot start GATK

≪ Previous: a question about running HaplotypeCaller with intervals

How can i handle it?

Error message :
A USER ERROR has occurred: The argument: "resource/resource" does not accept tags: "hapmap,known=false,training=true,truth=true,prior=15.0"

Command :smile:

java -Xmx60g -jar /UUU/chul/wes/tools/gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar VariantRecalibrator -R /UUU/chul/wes/hg19/ucsc.hg19.fasta -input new.vcf -input new.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 /UUU/chul/wes/hg19/hapmap_3.3.hg19.sites.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 /UUU/chul/wes/hg19/1000G_omni2.5.hg19.sites.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 /UUU/chul/wes/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /UUU/chul/wes/hg19/dbsnp_138.hg19.vcf -an DP -an QD -an FS -an SOR -an MQ -an MQRankSum -an ReadPosRankSum -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -O t4.recalibrate_SNP.recal --tranches-file t4.recalibrate_SNP.tranches --rscript-file t4.recalibrate_SNP_plots.R

↧