Questions about GenotypeVCFs output

January 17, 2018, 2:12 am

≫ Next: VariantFiltration | HaplotypeCaller - ignoring variants close (5bp) from 3´and 5´end.

≪ Previous: Maybe one code bug with GATK 3.7

Hi,

I generated 12 .g.vcf files with HaplotypeCaller in GVCF mode and then a vcf with GenotypeGVCFs. What's the easiest way to split this vcf per sample ? Should I apply hard filtering first and then split the vcf per sample ? (These vcf are normal exomes, I would like to use them afterwards with Mutect2).

Moreover, lines of the vcf file are annotated differently : BaseQRankSum, ClippingRankSum, ExcessHet, MQRankSum, ReadPosRankSum appear at some lines but not all of them (I get the same when I create a vcf with HaplotypeCaller without GVCF mode). Do you know why ? Is it possible to change this ? Will it be a problem at the hard filtering step when I'll handle variants with SelectVariants, and VariantFiltration ?

Thanks a lot!

↧

VariantFiltration | HaplotypeCaller - ignoring variants close (5bp) from 3´and 5´end.

January 17, 2018, 4:34 am

≫ Next: ERROR MESSAGE: Illegal base [ ] seen in the allele when running FastaAlternateReferenceMaker

≪ Previous: Questions about GenotypeVCFs output

Hi,

I am currently working with data from HaloPlex Target Enrichment System. HaloPlex is using retriction enzymes to digest the DNA, thus producing non-random reads and often have false mutations in 3´and 5´ends caused by adapter remnant. The problem with the adpater remnant mutations has previously been handled using custom scripts as described in Geéen et al. (https://doi.org/10.1016/j.jmoldx.2014.09.006 ) : _First, the cleaned index-sorted paired-end reads were scanned for flanking HaloPlex adapter sequences, ie, 5′-AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC-3′ and 5′-AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT-3′. However, the adapter 5′-recognition motif was restricted to 6 to 13 bp depending on the position of the adapter in the read. A perfect match was required in each case, which is simpler and faster compared with the procedure recommended by the HaloPlex development team. The minimal sequence for identification of an adapter at the 3′ end of a read was set to AGATCG. The adapter sequences were removed in the following way: i) five bases were removed from the 3′ end of all reads lacking identified adapter sequence (resulting in approximately 146-bp reads), ii) reads with adapter sequence within 50 bp of the 5′ end were discarded, and iii) reads with flanking adapter sequence in the 3′ end were trimmed by removal of the corresponding number of nucleotides. _

My question: Is there a option in variantFiltration or HaplotypeCaller that can mask/ignore variants that are detected in X (fx. 5) bp distance from the 3´and 5´ end of the reads?

Thank you!

↧

ERROR MESSAGE: Illegal base [ ] seen in the allele when running FastaAlternateReferenceMaker

January 17, 2018, 6:45 am

≫ Next: VariantAnnotator using GnomAD gives NullPointerException

≪ Previous: VariantFiltration | HaplotypeCaller - ignoring variants close (5bp) from 3´and 5´end.

Hello everyone,
It's me again :wink:
Since I have finally worked through where to apply the parametre -trimAlternates, now I got another question.
When I running FastaAlternateReferenceMaker, I got the error message below:

ERROR --

ERROR stack trace

java.lang.IllegalArgumentException: Illegal base [ ] seen in the allele
at htsjdk.variant.variantcontext.Allele.create(Allele.java:231)
at htsjdk.variant.variantcontext.Allele.create(Allele.java:355)
at org.broadinstitute.gatk.tools.walkers.fasta.FastaAlternateReferenceMaker.map(FastaAlternateReferenceMaker.java:170)
at org.broadinstitute.gatk.tools.walkers.fasta.FastaAlternateReferenceMaker.map(FastaAlternateReferenceMaker.java:96)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:98)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:323)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:123)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:108)

ERROR ------------------------------------------------------------------------------------------

ERROR A GATK RUNTIME ERROR has occurred (version 3.8-0-ge9d806836):

ERROR

ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.

ERROR If not, please post the error message, with stack trace, to the GATK forum.

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions https://software.broadinstitute.org/gatk

ERROR

ERROR MESSAGE: Illegal base [ ] seen in the allele

ERROR ------------------------------------------------------------------------------------------

It seems not the indels that bring such error cause I try to use only snp data to run the same command line, while got stuck by the same message.
Here is my working steps:

1 java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R Melc_scaffolds.fasta -V variant.vcf -sn MD -trimAlternates -env -nt 4 -o MD.vcf

2 java -Xmx4g -jar GenomeAnalysisTK.jar -T FastaAlternateReferenceMaker -R Melc_scaffolds.fasta -V MD.vcf -o MD.fasta

Actually, what I try to do is to convert VCF file to single fasta files and then uses them to construct phylogenetic tree, so any kind suggestion for fast and convenient format transition is appreciated.
Thanks for your attentions.
caiyc

↧

VariantAnnotator using GnomAD gives NullPointerException

December 7, 2017, 7:01 am

≫ Next: Understanding and adapting the generic hard-filtering recommendations

≪ Previous: ERROR MESSAGE: Illegal base [ ] seen in the allele when running FastaAlternateReferenceMaker

Hello,

Running VariantAnnotator, I am running into errors I couldn't find solutions for in the forum. Using the GnomAD publicly available VCF's, I would like to add information to a VCF, specifically frequency tracks (I am testing with a portion of chromosome 1 to test with). I am using the following command:

java -jar ~/Downloads/GenomeAnalysisTK38.jar -R ~/build/GRCh37/GRCh37.fa -T VariantAnnotator -V sample.vcf --comp:gnomad,vcf gnomad.vcf --expression gnomad.AF -o output_AF.vcf

The error I get is:

##### ERROR --
##### ERROR stack trace
java.lang.NullPointerException
    at org.broadinstitute.gatk.tools.walkers.annotator.VariantAnnotator.initialize(VariantAnnotator.java:284)
    at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:83)
    at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:323)
    at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:123)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
    at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:108)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 3.8-0-ge9d806836):
##### ERROR
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions https://software.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: Code exception (see stack trace for error itself)
##### ERROR ------------------------------------------------------------------------------------------

If I remove the --expression command, it works perfectly fine and I get an output vcf with ";gnomad" added to the FORMAT column wherever the variant was observed in GnomAD (but ofcourse without any other GnomAD-derived annotation). The error occurs with any requested field (GT, DP, AC, AF, lcr, etc.).

I ran ValidateVariants on both VCF's. Both exited with "Done. There were no warn messages."

Thank you very much in advance for any help.
Klaasjan

↧

Understanding and adapting the generic hard-filtering recommendations

February 3, 2016, 9:18 am

≫ Next: Are there any plans to add multi-interval support to GenomicsDBImport?

≪ Previous: VariantAnnotator using GnomAD gives NullPointerException

This document aims to provide insight into the logic of the generic hard-filtering recommendations that we provide as a substitute for VQSR. Hopefully it will also serve as a guide for adapting these recommendations or developing new filters that are appropriate for datasets that diverge significantly from what we usually work with.

Introduction

Hard-filtering consists of choosing specific thresholds for one or more annotations and throwing out any variants that have annotation values above or below the set thresholds. By annotations, we mean properties or statistics that describe for each variant e.g. what the sequence context is like around the variant site, how many reads covered it, how many reads covered each allele, what proportion of reads were in forward vs reverse orientation, and so on.

The problem with this approach is that it is very limiting because it forces you to look at each annotation dimension individually, and you end up throwing out good variants just because one of their annotations looks bad, or keeping bad variants in order to keep those good variants.

In contrast, VQSR is more powerful because it uses machine-learning algorithms to learn from the data what are the annotation profiles of good variants (true positives) and of bad variants (false positives) in a particular dataset. This empowers you to pull out variants based on how they cluster together along different dimensions, and liberates you to a large extent from the linear tyranny of single-dimension thresholds.

Unfortunately this method requires a large number of variants and well-curated known variant resources. For those of you working with small gene panels or with non-model organisms, this is a deal-breaker, and you have to fall back on hard-filtering.

Outline

In this article, we illustrate how the generic hard-filtering recommendations we provide relate to the distribution of annotation values we typically see in callsets produced by our variant calling tools, and how this in turn relates to the underlying physical properties of the sequence data.

We also use results from VQSR filtering (which we take as ground truth in this context) to highlight the limitations of hard-filtering.

We do this in turn for each of five annotations that are highly informative among the recommended annotations: QD, FS, MQ, MQRankSum and ReadPosRankSum. The same principles can be applied to most other annotations produced by GATK tools.

Overview of data and methods

Origin of the dataset

We called variants on a whole genome trio (samples NA12878, NA12891, NA12892, previously pre-processed) using HaplotypeCaller in GVCF mode, yielding a gVCF file for each sample. We then joint-genotyped the gVCFs using GenotypeGVCF, yielding an unfiltered VCF callset for the trio. Finally, we ran VQSR on the trio VCF, yielding the filtered callset. We will be looking at the SNPs only.

Plotting methods and interpretation notes

All plots shown below are density plots generated using the ggplot2 library in R. On the x-axis are the annotation values, and on the y-axis are the density values. The area under the density plot gives you the probability of observing the annotation values. So, the entire area under all of the plots will be equal to 1. However, if you would like to know the probability of observing an annotation value between 0 and 1, you will have to take the area under the curve between 0 and 1.

In plain English, this means that the plots shows you, for a given set of variants, what is the distribution of their annotation values. The caveat is that when we're comparing two or more sets of variants on the same plot, we have to keep in mind that they may contain very different numbers of variants, so the amount of variants in a given part of the distribution is not directly comparable; only their proportions are comparable.

QualByDepth (QD)

This is the variant confidence (from the QUAL field) divided by the unfiltered depth of non-hom-ref samples. This annotation is intended to normalize the variant quality in order to avoid inflation caused when there is deep coverage. For filtering purposes it is better to use QD than either QUAL or DP directly.

The generic filtering recommendation for QD is to filter out variants with QD below 2. Why is that?

First, let’s look at the QD values distribution for unfiltered variants. Notice the values can be anywhere from 0-40. There are two peaks where the majority of variants are (around QD = 12 and QD = 32). These two peaks correspond to variants that are mostly observed in heterozygous (het) versus mostly homozygous-variant (hom-var) states, respectively, in the called samples. This is because hom-var samples contribute twice as many reads supporting the variant than do het variants. We also see, to the left of the distribution, a "shoulder" of variants with QD hovering between 0 and 5.

We expect to see a similar distribution profile in callsets generated from most types of high-throughput sequencing data, although values where the peaks form may vary.

Now, let’s look at the plot of QD values for variants that passed VQSR and those that failed VQSR. Red indicates the variants that failed VQSR, and blue (green?) the variants that passed VQSR.

We see that the majority of variants filtered out correspond to that low-QD "shoulder" (remember that since this is a density plot, the y-axis indicates proportion, not number of variants); that is what we would filter out with the generic recommendation of the threshold value 2 for QD.

Notice however that VQSR has failed some variants that have a QD greater than 30! All those variants would have passed the hard filter threshold, but VQSR tells us that these variants looked artifactual in one or more other annotation dimensions. Conversely, although it is not obvious in the figure, we know that VQSR has passed some variants that have a QD less than 2, which hard filters would have eliminated from our callset.

FisherStrand (FS)

This is the Phred-scaled probability that there is strand bias at the site. Strand Bias tells us whether the alternate allele was seen more or less often on the forward or reverse strand than the reference allele. When there little to no strand bias at the site, the FS value will be close to 0.

Note: SB, SOR and FS are related but not the same! They all measure strand bias (a type of sequencing bias in which one DNA strand is favored over the other, which can result in incorrect evaluation of the amount of evidence observed for one allele vs. the other) in different ways. SB gives the raw counts of reads supporting each allele on the forward and reverse strand. FS is the result of using those counts in a Fisher's Exact Test. SOR is a related annotation that applies a different statistical test (using the SB counts) that is better for high coverage data.

Let’s look at the FS values for the unfiltered variants. The FS values have a very wide range; we made the x-axis log-scaled so the distribution is easier to see. Notice most variants have an FS value less than 10, and almost all variants have an FS value less than 100. However, there are indeed some variants with a value close to 400.

The plot below shows FS values for variants that passed VQSR and failed VQSR.

Notice most of the variants that fail have an FS value greater than 55. Our hard filtering recommendations tell us to fail variants with an FS value greater than 60. Notice that although we are able to remove many false positives by removing variants with FS greater than 60, we still keep many false positive variants. If we move the threshold to a lower value, we risk losing true positive variants.

StrandOddsRatio (SOR)

This is another way to estimate strand bias using a test similar to the symmetric odds ratio test. SOR was created because FS tends to penalize variants that occur at the ends of exons. Reads at the ends of exons tend to only be covered by reads in one direction and FS gives those variants a bad score. SOR will take into account the ratios of reads that cover both alleles.

Let’s look at the SOR values for the unfiltered variants. The SOR values range from 0 to greater than 9. Notice most variants have an SOR value less than 3, and almost all variants have an SOR value less than 9. However, there is a long tail of variants with a value greater than 9.

The plot below shows SOR values for variants that passed VQSR and failed VQSR.

Notice most of the variants that have an SOR value greater than 3 fail the VQSR filter. Although there is a non-negligible population of variants with an SOR value less than 3 that failed VQSR, our hard filtering recommendation of failing variants with an SOR value greater than 3 will at least remove the long tail of variants that show fairly clear bias according to the SOR test.

RMSMappingQuality (MQ)

This is the root mean square mapping quality over all the reads at the site. Instead of the average mapping quality of the site, this annotation gives the square root of the average of the squares of the mapping qualities at the site. It is meant to include the standard deviation of the mapping qualities. Including the standard deviation allows us to include the variation in the dataset. A low standard deviation means the values are all close to the mean, whereas a high standard deviation means the values are all far from the mean.When the mapping qualities are good at a site, the MQ will be around 60.

Now let’s check out the graph of MQ values for the unfiltered variants. Notice the very large peak around MQ = 60. Our recommendation is to fail any variant with an MQ value less than 40.0. You may argue that hard filtering any variant with an MQ value less than 50 is fine as well. This brings up an excellent point that our hard filtering recommendations are meant to be very lenient. We prefer to keep all potentially decent variants rather than get rid of a few bad variants.

Let’s look at the VQSR pass vs fail variants. At first glance, it seems like VQSR has passed the variants in the high peak and failed any variants not in the peak.

It is hard to tell which variants passed and failed, so let’s zoom in and see what exactly is happening.

The plot above shows the x-axis from 59-61. Notice the variants in blue (the ones that passed) all have MQ around 60. However, some variants in red (the ones that failed) also have an MQ around 60.

MappingQualityRankSumTest (MQRankSum)

This is the u-based z-approximation from the Rank Sum Test for mapping qualities. It compares the mapping qualities of the reads supporting the reference allele and the alternate allele. A positive value means the mapping qualities of the reads supporting the alternate allele are higher than those supporting the reference allele; a negative value indicates the mapping qualities of the reference allele are higher than those supporting the alternate allele. A value close to zero is best and indicates little difference between the mapping qualities.

Next, let’s look at the distribution of values for MQRankSum in the unfiltered variants. Notice the values range from approximately -10.5 to 6.5. Our hard filter threshold is -12.5. There are no variants in this dataset that have MQRankSum less than -10.5! In this case, hard filtering would not fail any variants based on MQRankSum. Remember, our hard filtering recommendations are meant to be very lenient. If you do plot your annotation values for your samples and find none of your variants have MQRankSum less than -12.5, you may want to refine your hard filters. Our recommendations are indeed recommendations that you the scientist will want to refine yourself.

Looking at the plot of pass VQSR vs fail VQSR variants, we see the variants with an MQRankSum value less than -2.5 fail VQSR. However, the region between -2.5 to 2.5 contains both pass and fail variants. Are you noticing a trend here? It is very difficult to pick a threshold for hard filtering. If we pick -2.5 as our hard filtering threshold, we still have many variants that fail VQSR in our dataset. If we try to get rid of those variants, we will lose some good variants as well. It is up to you to decide how many false positives you would like to remove from your dataset vs how many true positives you would like to keep and adjust your threshold based on that.

ReadPosRankSumTest (ReadPosRankSum)

This is the u-based z-approximation from the Rank Sum Test for site position within reads. It compares whether the positions of the reference and alternate alleles are different within the reads. Seeing an allele only near the ends of reads is indicative of error, because that is where sequencers tend to make the most errors. A negative value indicates that the alternate allele is found at the ends of reads more often than the reference allele; a positive value indicates that the reference allele is found at the ends of reads more often than the alternate allele. A value close to zero is best because it indicates there is little difference between the positions of the reference and alternate alleles in the reads.

The last annotation we will look at is ReadPosRankSum. Notice the values fall mostly between -4 and 4. Our hard filtering threshold removes any variant with a ReadPosRankSum value less than -8.0. Again, there are no variants in this dataset that have a ReadPosRankSum value less than -8.0, but some datasets might. If you plot your variant annotations and find there are no variants that have a value less than or greater than one of our recommended cutoffs, you will have to refine them yourself based on your annotation plots.

Looking at the VQSR pass vs fail variants, we can see VQSR has failed variants with ReadPosRankSum values less than -1.0 and greater than 3.5. However, notice VQSR has failed some variants that have values that pass VQSR.

↧

Are there any plans to add multi-interval support to GenomicsDBImport?

January 17, 2018, 9:18 am

≫ Next: Differences between GATK 4.beta.5 vs 4.0.0.0 HaplotypeCaller results

≪ Previous: Understanding and adapting the generic hard-filtering recommendations

The reason I ask is that it's rather annoying when you've chunking your input data and one of your chunks crosses a chromosome boundary. it seems like according to the Github docs thqt GenomicsDB supports this with vcf2tiledb, but I'm not sure whether it will then work with GenotypeGVCFs?

↧

Differences between GATK 4.beta.5 vs 4.0.0.0 HaplotypeCaller results

January 17, 2018, 9:26 am

≫ Next: Base Quality Score Recalibration (BQSR)

≪ Previous: Are there any plans to add multi-interval support to GenomicsDBImport?

Hi!
I'd like to perform short germline variant calling on human DNA-seq samples (separate analysis of WES cohort and PCR-free WGS cohort, both paired end). The plan is to follow GATK best practices of short variant discovery with joint genotyping, starting with single-sample gVCF creation through HaplotypeCaller.

I have analyzed some samples with HaplotypeCaller 4.beta.5, and was unsure whether there had been any fixes between 4.beta.5 and 4.0.0.0 that would necessitate re-running the samples.

To check, I ran the 4.beta.5 and 4.0.0.0 HaplotypeCaller on chr21 of 1000Genomes sample NA11992 link.

I ran the same command, on the same machine, in different conda environments:

4.beta.5

gatk4                     4.0b5                    py27_0    bioconda
picard                    2.16.0                   py27_0    bioconda
setuptools                38.2.4                   py27_0    conda-forge
wheel                     0.30.0                     py_1    conda-forge

4.0.0.0

gatk4                     4.0.0.0                  py27_0    bioconda
picard                    2.17.2                   py27_0    bioconda
setuptools                38.4.0                   py27_0    conda-forge
wheel                     0.30.0                   py27_2    conda-forge

Java

$ java -version
openjdk version "1.8.0_121"
OpenJDK Runtime Environment (Zulu 8.20.0.5-linux64) (build 1.8.0_121-b15)
OpenJDK 64-Bit Server VM (Zulu 8.20.0.5-linux64) (build 25.121-b15, mixed mode)

GATK command

gatk-launch HaplotypeCaller -R $refs/GRCh37.71.nochr.fa -I $data/NA11992.mapped.ILLUMINA.bwa.CEU.exome.20130415.bam -O ${version}.NA11992.21.default.vcf.gz -ERC GVCF -L 21

Results showed 9 differences using diff (10 including the the different ##GATKCommandLine in the gVCF header). Sometimes PL or SB fields change, sometimes non-variant blocks are subdivided differently, and sometimes indel calls change. Three differences are below:

$diff 4.0.0.0.NA11992.21.default.vcf 4.beta.5.NA11992.21.default.vcf
...
36482c36480
< 21    19701769        .       AT      A,<NON_REF>     33.73   .       BaseQRankSum=-0.253;ClippingRankSum=0.000;DP=6;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.842;RAW_MQ=19369.00;ReadPosRankSum=0.524 GT:AD:DP:GQ:PGT:PID:PL:SB
       0/1:2,3,0:5:66:0|1:19701769_AT_A:71,0,66,77,75,152:0,2,1,2
---
> 21    19701769        .       AT      A,<NON_REF>     33.73   .       BaseQRankSum=-0.253;ClippingRankSum=0.000;DP=6;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.842;RAW_MQ=19369.00;ReadPosRankSum=0.524 GT:AD:DP:GQ:PL:SB       0/1:2,3,0:5:66:71,0,66,77,76,152:0,2,1,2

36485,36486c36483,36485
< 21    19701776        .       T       <NON_REF>       .       .       END=19701778    GT:DP:GQ:MIN_DP:PL      0/0:6:12:6:0,12,180
< 21    19701779        .       TG      T,<NON_REF>     19.78   .       BaseQRankSum=0.431;ClippingRankSum=0.000;DP=6;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=-0.967;RAW_MQ=19369.00;ReadPosRankSum=1.282 GT:AD:DP:GQ:PGT:PID:PL:SB
       0/1:4,2,0:6:57:1|0:19701769_AT_A:57,0,146,69,152,221:1,3,0,2
---
> 21    19701776        .       T       <NON_REF>       .       .       END=19701777    GT:DP:GQ:MIN_DP:PL      0/0:6:12:6:0,12,180
> 21    19701778        .       TTG     T,<NON_REF>     0       .       DP=6;ExcessHet=3.0103;MLEAC=0,0;MLEAF=0.00,0.00;RAW_MQ=19369.00 GT:AD:DP:GQ:PL:SB       0/0:6,0,0:6:18:0,18,203,18,203,203:1,5,0,0
> 21    19701779        .       TG      T,<NON_REF>     3.96    .       BaseQRankSum=0.431;ClippingRankSum=0.000;DP=6;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=-0.967;RAW_MQ=19369.00;ReadPosRankSum=1.282 GT:AD:DP:GQ:PL:SB       0/1:4,2,0:6:39:39,0,146,51,152,204:1,3,0,2

54434c54433,54435
< 21    26959938        .       A       <NON_REF>       .       .       END=26959949    GT:DP:GQ:MIN_DP:PL      0/0:9:24:8:0,24,260
---
> 21    26959938        .       A       <NON_REF>       .       .       END=26959943    GT:DP:GQ:MIN_DP:PL      0/0:8:24:8:0,24,260
> 21    26959944        .       C       <NON_REF>       .       .       END=26959944    GT:DP:GQ:MIN_DP:PL      0/0:9:27:9:0,27,275
> 21    26959945        .       A       <NON_REF>       .       .       END=26959949    GT:DP:GQ:MIN_DP:PL      0/0:9:24:9:0,24,360
...

Overall, can I use 4.beta.5 HaplotypeCaller results in downstream analysis, or should results be re-analyzed with 4.0.0.0 HaplotypeCaller? This analysis showed relatively few differences, but I'm still unsure about 4.beta.5 HaplotypeCaller.

↧

Base Quality Score Recalibration (BQSR)

December 28, 2017, 7:15 pm

≫ Next: Picard LiftoverVcf: contig not part of the target reference

≪ Previous: Differences between GATK 4.beta.5 vs 4.0.0.0 HaplotypeCaller results

BQSR stands for Base Quality Score Recalibration. In a nutshell, it is a data pre-processing step that detects systematic errors made by the sequencing machine when it estimates the accuracy of each base call.

Note that this base recalibration process (BQSR) should not be confused with variant recalibration (VQSR), which is a sophisticated filtering technique applied on the variant callset produced in a later step. The developers who named these methods wish to apologize sincerely to anyone, especially Spanish-speaking users, who get tripped up by the similarity of these names.

Overview
Base recalibration procedure details
Important factors for successful recalibration
Examples of pre- and post-recalibration metrics
Recalibration report

1. Overview

It's all about the base, 'bout the base (quality scores)

Base quality scores are per-base estimates of error emitted by the sequencing machines; they express how confident the machine was that it called the correct base each time. For example, let's say the machine reads an A nucleotide, and assigns a quality score of Q20 -- in Phred-scale, that means it's 99% sure it identified the base correctly. This may seem high, but it does mean that we can expect it to be wrong in one case out of 100; so if we have several billion base calls (we get ~90 billion in a 30x genome), at that rate the machine would make the wrong call in 900 million bases -- which is a lot of bad bases. The quality score each base call gets is determined through some dark magic jealously guarded by the manufacturer of the sequencing machines.

Why does it matter? Because our short variant calling algorithms rely heavily on the quality score assigned to the individual base calls in each sequence read. This is because the quality score tells us how much we can trust that particular observation to inform us about the biological truth of the site where that base aligns. If we have a base call that has a low quality score, that means we're not sure we actually read that A correctly, and it could actually be something else. So we won't trust it as much as other base calls that have higher qualities. In other words we use that score to weigh the evidence that we have for or against a variant allele existing at a particular site.

Okay, so what is base recalibration?

Unfortunately the scores produced by the machines are subject to various sources of systematic (non-random) technical error, leading to over- or under-estimated base quality scores in the data. Some of these errors are due to the physics or the chemistry of how the sequencing reaction works, and some are probably due to manufacturing flaws in the equipment.

Base quality score recalibration (BQSR) is a process in which we apply machine learning to model these errors empirically and adjust the quality scores accordingly. For example we can identify that, for a given run, whenever we called two A nucleotides in a row, the next base we called had a 1% higher rate of error. So any base call that comes after AA in a read should have its quality score reduced by 1%. We do that over several different covariates (mainly sequence context and position in read, or cycle) in a way that is additive. So the same base may have its quality score increased for one reason and decreased for another.

This allows us to get more accurate base qualities overall, which in turn improves the accuracy of our variant calls. To be clear, we can't correct the base calls themselves, i.e. we can't determine whether that low-quality A should actually have been a T -- but we can at least tell the variant caller more accurately how far it can trust that A. Note that in some cases we may find that some bases should have a higher quality score, which allows us to rescue observations that otherwise may have been given less consideration than they deserve. Anecdotally our impression is that sequencers are more often over-confident than under-confident, but we do occasionally see runs from sequencers that seemed to suffer from low self-esteem.

This procedure can be applied to BAM files containing data from any sequencing platform that outputs base quality scores on the expected scale. We have run it ourselves on data from several generations of Illumina, SOLiD, 454, Complete Genomics, and Pacific Biosciences sequencers.

That sounds great! How does it work?

The base recalibration process involves two key steps: first the BaseRecalibrator tool builds a model of covariation based on the input data and a set of known variants, producing a recalibration file; then the ApplyBQSR tool adjusts the base quality scores in the data based on the model, producing a new BAM file. The known variants are used to mask out bases at sites of real (expected) variation, to avoid counting real variants as errors. Outside of the masked sites, every mismatch is counted as an error. The rest is mostly accounting.

There is an optional but highly recommended step that involves building a second model and generating before/after plots to visualize the effects of the recalibration process. This is useful for quality control purposes.

2. Base recalibration procedure details

BaseRecalibrator builds the model

To build the recalibration model, this first tool goes through all of the reads in the input BAM file and tabulates data about the following features of the bases:

read group the read belongs to
quality score reported by the machine
machine cycle producing this base (Nth cycle = Nth base from the start of the read)
current base + previous base (dinucleotide)

For each bin, we count the number of bases within the bin and how often such bases mismatch the reference base, excluding loci known to vary in the population, according to the known variants resource (typically dbSNP). This information is output to a recalibration file in GATKReport format.

Note that the recalibrator applies a "yates" correction for low occupancy bins. Rather than inferring the true Q score from # mismatches / # bases we actually infer it from (# mismatches + 1) / (# bases + 2). This deals very nicely with overfitting problems, which has only a minor impact on data sets with billions of bases but is critical to avoid overconfidence in rare bins in sparse data.

ApplyBQSR adjusts the scores

This second tool goes through all the reads again, using the recalibration file to adjust each base's score based on which bins it falls in. So effectively the new quality score is:

the sum of the global difference between reported quality scores and the empirical quality
plus the quality bin specific shift
plus the cycle x qual and dinucleotide x qual effect

Following recalibration, the read quality scores are much closer to their empirical scores than before. This means they can be used in a statistically robust manner for downstream processing, such as variant calling. In addition, by accounting for quality changes by cycle and sequence context, we can identify truly high quality bases in the reads, often finding a subset of bases that are Q30 even when no bases were originally labeled as such.

3. Important factors for successful recalibration

Read groups

The recalibration system is read-group aware, meaning it uses @RG tags to partition the data by read group. This allows it to perform the recalibration per read group, which reflects which library a read belongs to and what lane it was sequenced in on the flowcell. We know that systematic biases can occur in one lane but not the other, or one library but not the other, so being able to recalibrate within each unit of sequence data makes the modeling process more accurate. As a corollary, that means it's okay to run BQSR on BAM files with multiple read groups. However, please note that the memory requirements scale linearly with the number of read groups in the file, so that files with many read groups could require a significant amount of RAM to store all of the covariate data.

Amount of data

A critical determinant of the quality of the recalibration is the number of observed bases and mismatches in each bin. This procedure will not work well on a small number of aligned reads. We usually expect to see more than 100M bases per read group; as a rule of thumb, larger numbers will work better.

No excuses

You should almost always perform recalibration on your sequencing data. In human data, given the exhaustive databases of variation we have available, almost all of the remaining mismatches -- even in cancer -- will be errors, so it's super easy to ascertain an accurate error model for your data, which is essential for downstream analysis. For non-human data it can be a little bit more work since you may need to bootstrap your own set of variants if there are no such resources already available for you organism, but it's worth it.

Here's how you would bootstrap a set of known variants:

First do an initial round of variant calling on your original, unrecalibrated data.
Then take the variants that you have the highest confidence in and use that set as the database of known variants by feeding it as a VCF file to the BaseRecalibrator.
Finally, do a real round of variant calling with the recalibrated data. These steps could be repeated several times until convergence.

The main case figure where you really might need to skip BQSR is when you have too little data (some small gene panels have that problem), or you're working with a really weird organism that displays insane amounts of variation.

4. Examples of pre- and post-recalibration metrics

This shows recalibration results from a lane sequenced at the Broad by an Illumina GA-II in February 2010. This is admittedly not very recent but the results are typical of what we still see on some more recent runs, even if the overall quality of sequencing has improved. You can see there is a significant improvement in the accuracy of the base quality scores after applying the recalibration procedure. Note that the plots shown below are not the same as the plots that are produced by the AnalyzeCovariates tool.

5. Recalibration report

The recalibration report contains the following 5 tables:

Arguments Table -- a table with all the arguments and its values
Quantization Table
ReadGroup Table
Quality Score Table
Covariates Table

Arguments Table

This is the table that contains all the arguments used to run BQSR for this dataset.

#:GATKTable:true:1:17::;
#:GATKTable:Arguments:Recalibration argument collection values used in this run
Argument                    Value
covariate                   null
default_platform            null
deletions_context_size      6
force_platform              null
insertions_context_size     6
...

Quantization Table

The GATK offers native support to quantize base qualities. The GATK quantization procedure uses a statistical approach to determine the best binning system that minimizes the error introduced by amalgamating the different qualities present in the specific dataset. When running BQSR, a table with the base counts for each base quality is generated and a 'default' quantization table is generated. This table is a required parameter for any other tool in the GATK if you want to quantize your quality scores.

The default behavior (currently) is to use no quantization. You can override this by using the engine argument -qq. With -qq 0 you don't quantize qualities, or -qq N you recalculate the quantization bins using N bins.

#:GATKTable:true:2:94:::;
#:GATKTable:Quantized:Quality quantization map
QualityScore  Count        QuantizedScore
0                     252               0
1                   15972               1
2                  553525               2
3                 2190142               9
4                 5369681               9
9                83645762               9
...

ReadGroup Table

This table contains the empirical quality scores for each read group, for mismatches insertions and deletions.

#:GATKTable:false:6:18:%s:%s:%.4f:%.4f:%d:%d:;
#:GATKTable:RecalTable0:
ReadGroup  EventType  EmpiricalQuality  EstimatedQReported  Observations  Errors
SRR032768  D                   40.7476             45.0000    2642683174    222475
SRR032766  D                   40.9072             45.0000    2630282426    213441
SRR032764  D                   40.5931             45.0000    2919572148    254687
SRR032769  D                   40.7448             45.0000    2850110574    240094
SRR032767  D                   40.6820             45.0000    2820040026    241020
SRR032765  D                   40.9034             45.0000    2441035052    198258
SRR032766  M                   23.2573             23.7733    2630282426  12424434
SRR032768  M                   23.0281             23.5366    2642683174  13159514
SRR032769  M                   23.2608             23.6920    2850110574  13451898
SRR032764  M                   23.2302             23.6039    2919572148  13877177
SRR032765  M                   23.0271             23.5527    2441035052  12158144
SRR032767  M                   23.1195             23.5852    2820040026  13750197
SRR032766  I                   41.7198             45.0000    2630282426    177017
SRR032768  I                   41.5682             45.0000    2642683174    184172
SRR032769  I                   41.5828             45.0000    2850110574    197959
SRR032764  I                   41.2958             45.0000    2919572148    216637
SRR032765  I                   41.5546             45.0000    2441035052    170651
SRR032767  I                   41.5192             45.0000    2820040026    198762

Quality Score Table

This table contains the empirical quality scores for each read group and original quality score, for mismatches insertions and deletions.

#:GATKTable:false:6:274:%s:%s:%s:%.4f:%d:%d:;
#:GATKTable:RecalTable1:
ReadGroup  QualityScore  EventType  EmpiricalQuality  Observations  Errors
SRR032767            49  M                   33.7794          9549        3
SRR032769            49  M                   36.9975          5008        0
SRR032764            49  M                   39.2490          8411        0
SRR032766            18  M                   17.7397      16330200   274803
SRR032768            18  M                   17.7922      17707920   294405
SRR032764            45  I                   41.2958    2919572148   216637
SRR032765             6  M                    6.0600       3401801   842765
SRR032769            45  I                   41.5828    2850110574   197959
SRR032764             6  M                    6.0751       4220451  1041946
SRR032767            45  I                   41.5192    2820040026   198762
SRR032769             6  M                    6.3481       5045533  1169748
SRR032768            16  M                   15.7681      12427549   329283
SRR032766            16  M                   15.8173      11799056   309110
SRR032764            16  M                   15.9033      13017244   334343
SRR032769            16  M                   15.8042      13817386   363078
...

Covariates Table

This table has the empirical qualities for each covariate used in the dataset. The default covariates are cycle and context. In the current implementation, context is of a fixed size (default 6). Each context and each cycle will have an entry on this table stratified by read group and original quality score.

#:GATKTable:false:8:1003738:%s:%s:%s:%s:%s:%.4f:%d:%d:;
#:GATKTable:RecalTable2:
ReadGroup  QualityScore  CovariateValue  CovariateName  EventType  EmpiricalQuality  Observations  Errors
SRR032767            16  TACGGA          Context        M                   14.2139           817      30
SRR032766            16  AACGGA          Context        M                   14.9938          1420      44
SRR032765            16  TACGGA          Context        M                   15.5145           711      19
SRR032768            16  AACGGA          Context        M                   15.0133          1585      49
SRR032764            16  TACGGA          Context        M                   14.5393           710      24
SRR032766            16  GACGGA          Context        M                   17.9746          1379      21
SRR032768            45  CACCTC          Context        I                   40.7907        575849      47
SRR032764            45  TACCTC          Context        I                   43.8286        507088      20
SRR032769            45  TACGGC          Context        D                   38.7536         37525       4
SRR032768            45  GACCTC          Context        I                   46.0724        445275      10
SRR032766            45  CACCTC          Context        I                   41.0696        575664      44
SRR032769            45  TACCTC          Context        I                   43.4821        490491      21
SRR032766            45  CACGGC          Context        D                   45.1471         65424       1
SRR032768            45  GACGGC          Context        D                   45.3980         34657       0
SRR032767            45  TACGGC          Context        D                   42.7663         37814       1
SRR032767            16  AACGGA          Context        M                   15.9371          1647      41
SRR032764            16  GACGGA          Context        M                   18.2642          1273      18
SRR032769            16  CACGGA          Context        M                   13.0801          1442      70
SRR032765            16  GACGGA          Context        M                   15.9934          1271      31
...

↧

Picard LiftoverVcf: contig not part of the target reference

February 18, 2016, 8:13 am

≫ Next: libVectorLoglessPairHMM is not present in GATK 3.8 - HaplotypeCaller is slower than 3.4-46!

≪ Previous: Base Quality Score Recalibration (BQSR)

Dear GATK team,

I am trying to liftover a vcf file from hg19 to hg38, by running the command
java -jar ~/tools/picard-2.1.0/dist/picard.jar LiftoverVcf I=input.chr22.vcf O=hg38.chr22.vcf CHAIN=hg19ToHg38.over.chain REJECT=liftover_rejected.chr22.vcf R=chr22.fa

Since I'm working on one chromosome only, my vcf file has only "chr22" in the CHROM field. chr22.fa, the reference genome in hg38, starts with ">chr22" on the first line. I also generated the .dict file for it using Picard tools. chr22.dict file looks like:
@HD VN:1.5 SO:unsorted
@SQ SN:chr22 LN:50818468 M5:221733a2a15e2de66d33e73d126c5109 UR:file:/my/directory/chr22.fa

However, after a few seconds I always get the following error message:
[Thu Feb 18 15:53:08 GMT 2016] Executing as me@myhost on Linux 3.2.0-75-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_73-b02; Picard version: 2.1.0() JdkDeflater
INFO 2016-02-18 15:53:09 LiftoverVcf Loading up the target reference genome.
INFO 2016-02-18 15:53:11 LiftoverVcf Lifting variants over and sorting.
ERROR 2016-02-18 15:53:11 LiftoverVcf Encountered a contig, chr22 that is not part of the target reference.

Could you suggest how to fix this? Thank you!

Best,
Ruoyun

↧

libVectorLoglessPairHMM is not present in GATK 3.8 - HaplotypeCaller is slower than 3.4-46!

November 23, 2017, 3:59 am

≫ Next: MuTect2 and smalls mpileup reads info seem to be very different?

≪ Previous: Picard LiftoverVcf: contig not part of the target reference

We are running GATK on a multi-core Intel Xeon that does not have AVX. We have just upgraded from running 3.4-46 to running 3.8, and HaplotypeCaller runs much more slowly. I noticed that our logs used to say:

Using SSE4.1 accelerated implementation of PairHMM
INFO 06:18:09,932 VectorLoglessPairHMM - libVectorLoglessPairHMM unpacked successfully from GATK jar file
INFO 06:18:09,933 VectorLoglessPairHMM - Using vectorized implementation of PairHMM

But now they say:

WARN 07:10:21,304 PairHMMLikelihoodCalculationEngine$1 - OpenMP multi-threaded AVX-accelerated native PairHMM implementation is not supported
WARN 07:10:21,310 PairHMMLikelihoodCalculationEngine$1 - AVX-accelerated native PairHMM implementation is not supported. Falling back to slower LOGLESS_CACHING implementation

I'm guessing the newfangled Intel GKL isn't working so well for us. Note that I had a very similar problem with GATK 3.4-0, in http://gatk.vanillaforums.com/entry/passwordreset/21436/OrxbD0I4oRDaj8y1hDSE and this was resolved in GATK 3.4-46.

↧

MuTect2 and smalls mpileup reads info seem to be very different?

January 17, 2018, 3:23 pm

≫ Next: A logical problem with SplitCommonSuffices and MergeCommonSuffices

≪ Previous: libVectorLoglessPairHMM is not present in GATK 3.8 - HaplotypeCaller is slower than 3.4-46!

Hi - I used MuTect2 to call variants in multiple samples from one patient. However, I wanted read information for those samples where a mutation wasn't detected in all the samples and decided to use samtools mpileup at these sites. I noticed that a variant in a germline sample was reported as 24:0 ref:alt (MuTect2) and 104:25 (samtools). In one case, I'd call it a somatic mutation while it would be a gremlin mutation in the other case. Why do we see this difference? Is there a way to make MuTect2 output read info when it detects a mutation in one of the samples from a patient?

↧

A logical problem with SplitCommonSuffices and MergeCommonSuffices

January 17, 2018, 5:45 pm

≫ Next: Error while running BaseRecalibrator

≪ Previous: MuTect2 and smalls mpileup reads info seem to be very different?

@Sheila @valentin @depristo
For example ：
A+x -> y (A+x,y is a point)
B+x -> y
after SplitCommonSuffices
A -> x -> y (A,B,x,y is a point)
B -> x -> y
after MergeCommonSuffices
A -> x => A+x
B -> x => B+x
then after SplitCommonSuffices after MergeCommonSuffices ...
of course some time after SplitCommonSuffices SeqVertex id is bigger than before but this may occured in unintended(unmind) circumstances.

↧

Error while running BaseRecalibrator

January 18, 2018, 1:32 am

≫ Next: Can GATK4 be used in old shell scripts as GATK3 (without WDL)?

≪ Previous: A logical problem with SplitCommonSuffices and MergeCommonSuffices

Hello, I was running the BQSR on the RNAseq reads. This is the command i typed:

java -jar /usr/local/bin/GenomeAnalysisTK.jar -T BaseRecalibrator -R ../../../genome/hg38/hg38.fa -I a_split.bam -L 22 -knownSites ../../../genome/vcf/All_20170710.vcf -o a_sorted.bam_dedupped_split_recal_data.table

However, I get this error:
ERROR MESSAGE: The platform (platform) associated with read group GATKSAMReadGroupRecord @RG:id is not a recognized platform. Allowable options are ILLUMINA,SLX,SOLEXA,SOLID,454,LS454,COMPLETE,PACBIO,IONTORRENT,CAPILLARY,HELICOS,UNKNOWN.

I used the sed command to change the PL:platform to PL:SOLID using this command:

samtools view -H a_split.bam|sed -e 's/PL:platform/PL:/g'|samtools reheader - a_split.bam >a_reheadered.bam

But I then get another error stating that the BAM file is not indexed. How is changing one line in the header make it unindexed? Can anyone comment on this.

Regards,
Anurag

↧

Can GATK4 be used in old shell scripts as GATK3 (without WDL)?

January 18, 2018, 1:59 am

≫ Next: define java to use in GATK4?

≪ Previous: Error while running BaseRecalibrator

Can GATK4 be used in old shell scripts as GATK3* (and is there a point of changing) without planning to use wdl? Is GATK4 faster (if it can be used locally as GATK3)

↧

define java to use in GATK4?

January 18, 2018, 7:11 am

≫ Next: MuTect- small fraction of tumor contamination in normal samples.

≪ Previous: Can GATK4 be used in old shell scripts as GATK3 (without WDL)?

Hi,

today I tried to run GATK4. But I ran into an issue. Just calling "gatk" looks fine, but when running "gatk --list" produces the following output.

Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -jar /dsk/data1/ngs/bin/GATK/4.0.0.0/gatk-package-4.0.0.0-local.jar --help
Error: Invalid or corrupt jarfile /dsk/data1/ngs/bin/GATK/4.0.0.0/gatk-package-4.0.0.0-local.jar

Since now I have the command how GATK actually is started I replaced "java" with "java8" and everything works well.

So my question is, if there is an option, or config or anything, where I can define the path to my java8? I can't just switch to java8 as my main java since I'm working on a cluster, and the admins don't want it.

Thanks in advance,

Anselm Hoppmann

↧

MuTect- small fraction of tumor contamination in normal samples.

December 3, 2015, 6:17 am

≫ Next: Mutect 2 B38 germline resource

≪ Previous: define java to use in GATK4?

Hi,
We have normal samples that have a small fraction of tumor contamination. usually less than 10%. Is there a modification we can apply to mutect that can maximize the sensitivity and specificity of our somatic calls?

↧

Mutect 2 B38 germline resource

January 18, 2018, 8:53 am

≫ Next: Panel of Normals (PON)

≪ Previous: MuTect- small fraction of tumor contamination in normal samples.

Hi,

Congratulations on GATK 4.0!

I'm looking at the instructions for Mutect2 where it suggests using a germline resource "--germline-resource af-only-gnomad.vcf.gz".

Do you have a version of this for b38 coming? Or know where I could obtain one?

Thanks

Dan

↧

Panel of Normals (PON)

December 26, 2017, 6:10 pm

≫ Next: Using GenomicsDBImport to consolidate GVCFs for input to GenotypeGVCFs in GATK4

≪ Previous: Mutect 2 B38 germline resource

A Panel of Normal or PON is a type of resource used in somatic variant analysis. Depending on the type of variant you're looking for, the PON will be generated differently. What all PONs have in common is that (1) they are made from normal samples (in this context, "normal" means derived from healthy tissue that is believed to not have any somatic alterations) and (2) their main purpose is to capture recurrent technical artifacts in order to improve the results of the variant calling analysis.

As a result, the most important selection criteria for choosing normals to include in any PON are the technical properties of how the data was generated. It's very important to use normals that are as technically similar as possible to the tumor (same exome or genome preparation methods, sequencing technology and so on). Additionally, the samples should come from subjects that were young and healthy to minimize the chance of using as normal a sample from someone who has an undiagnosed tumor. Normals are typically derived from blood samples.

There is no definitive rule for how many samples should be used to make a PON (even a small PON is better than no PON) but in practice we recommend aiming for a minimum of 40.

At the Broad Institute, we typically make a standard PON for a given version of the pipeline (corresponding to the combination of all protocols used in production to generate the sequence data, starting from sample preparation and including the analysis software) and use it to process all tumor samples that go through that version of the pipeline. Because we process many samples in the same way, we are able to make PONs composed of hundreds of samples.

Variant type-specific recommendations are given below.

Short variants (SNVs and indels)

For short variant discovery, the PON is created by running the variant caller Mutect2 individually on a set of normal samples and combining the resulting variant calls with some criteria (e.g. excluding any sites that are not present in at least 2 normals) as defined in the Best Practices documentation. This produces a sites-only VCF file that can be used as PON for Mutect2.

Copy Number Variants

For CNV discovery, the PON is created by running the initial coverage collection tools individually on a set of normal samples and combining the resulting copy ratio data using a dedicated PON creation tool. This produces a binary file that can be used as PON.

↧

Using GenomicsDBImport to consolidate GVCFs for input to GenotypeGVCFs in GATK4

July 28, 2017, 6:09 pm

≫ Next: PGT and PID is a dot

≪ Previous: Panel of Normals (PON)

In GATK4, the GenotypeGVCFs tool can only take a single input, so if you have GVCFs from multiple samples (which is usually the case) you will need to combine them before feeding them to GenotypeGVCFs. Although there are several tools in the GATK and Picard toolkits that provide some type of VCF or GVCF merging functionality, for this use case there is only one valid way to do it: with GenomicsDBImport.

The GenomicsDBImport tool takes in one or more single-sample GVCFs and imports data over a single interval, and outputs a directory containing a GenomicsDB datastore with combined multi-sample data. GenotypeGVCFs can then read from the created GenomicsDB directly and output a VCF. Note that GenomicsDBImport does not take two or more same sample GVCFs. You will need to create one GVCF per-sample before running the tool.

Here are example commands to use it:

gatk-launch GenomicsDBImport \
    -V data/gvcfs/mother.g.vcf \
    -V data/gvcfs/father.g.vcf \
    -V data/gvcfs/son.g.vcf \
    --genomicsDBWorkspace my_database \
    --intervals 20

That generates a directory called my_database containing the combined gvcf data.

Then you run joint genotyping; note the gendb:// prefix to the database input directory path.

gatk-launch GenotypeGVCFs \
    -R data/ref/ref.fasta \
    -V gendb://my_database \
    -G StandardAnnotation -newQual \
    -O test_output.vcf

And that's all there is to it.

There are three caveats:

You can't add data to an existing database; you have to keep the original GVCFs around and reimport them all together when you get new samples. For very large numbers of samples, there are some batching options.
At the moment you can only run GenomicsDBImport on a single genomic interval (ie max one contig). This will probably change because we'd like to enable running one more intervals in one go, but for now you need to run on each interval separately. We recommend scripting this of course.
At the moment GenomicsDB only supports diploid data. The developers of GenomicsDB are working on implementing support for non-diploid data.

Addendum: extracting data from the GenomicsDB

If you want to generate a flat multisample GVCF file from the GenomicsDB you created, you can do so with SelectVariants as follows:

gatk-launch SelectVariants \
    -R data/ref/ref.fasta \
    -V gendb://my_database \
    -O combined.g.vcf

Caveat: cannot move database after creation

Currently the GenomicsDB internal code uses the absolute path of the location of the database as part of the data encoding. As a consequence, you cannot move the database to a different location before running GenotypeGVCFs on it. If you do, it will no longer work. This is obviously not desirable, and the development team is looking at options to remediate this.

↧

PGT and PID is a dot

January 4, 2018, 7:23 am

≫ Next: Unknown index type error of SelectVariants

≪ Previous: Using GenomicsDBImport to consolidate GVCFs for input to GenotypeGVCFs in GATK4

Hi,
I am following the best practice pipeline with version 3.6 of gatk and in order to reduce the amount of compound heteozygote variants found in my analysis, I recently chose to "mature" into using the phased gt annotations PID and PGT and I have very nice results.
However I found a high amount of variants with a dot ('.') in both fields in the vcfs I started to use the new analysis on.
I scanned around in the forums and documentation, however I could not find any sign what that means.
this is especially interesting, as I have several variants where the variants are on the same read only, however only a few of the variants are phased, the others show the '.'
I am pleased the with the outcome already however I fear that I might miss information.

I included the screenshot as well as an overview of the variants in that region.

I would appreciate it if you could shed some light on this issue, even if it is just " the dot means we were not sure about it"

chr1    152280671   .   A   C   3783.47 VQSRTrancheSNP99.90to100.00 AC=1;AF=0.019;AN=2;BaseQRankSum=5.09;ClippingRankSum=0;DP=26522;ExcessHet=10.4471;FS=22.852;InbreedingCoeff=-0.0643;MLEAC=7;MLEAF=0.019;MQ=58.86;MQRankSum=-7.267;QD=1.96;ReadPosRankSum=-1.448;SOR=2.749;VQSLOD=-14.1;culprit=MQRankSum    GT:AD:DP:GQ:PGT:PID:PL  0/1:277,34:311:99:0|1:152280669_T_C:585,0,11704
chr1    152280685   .   C   A   9358.84 VQSRTrancheSNP99.90to100.00 AC=1;AF=0.019;AN=2;BaseQRankSum=4.79;ClippingRankSum=0;DP=35550;ExcessHet=33.4707;FS=9.592;InbreedingCoeff=-0.1314;MLEAC=11;MLEAF=0.03;MQ=58.82;MQRankSum=-7.86;QD=3.99;ReadPosRankSum=-2.019;SOR=1.604;VQSLOD=-8.457;culprit=MQRankSum GT:AD:DP:GQ:PGT:PID:PL  0/1:299,66:365:99:0|1:152280669_T_C:1871,0,12440
chr1    152280687   .   C   T   9105.01 VQSRTrancheSNP99.90to100.00 AC=1;AF=0.019;AN=2;BaseQRankSum=5.67;ClippingRankSum=0;DP=35912;ExcessHet=38.6724;FS=9.592;InbreedingCoeff=-0.1438;MLEAC=12;MLEAF=0.033;MQ=58.81;MQRankSum=-7.86;QD=3.89;ReadPosRankSum=-1.766;SOR=1.605;VQSLOD=-8.451;culprit=MQRankSum    GT:AD:DP:GQ:PGT:PID:PL  0/1:299,66:365:99:0|1:152280669_T_C:1840,0,13062
chr1    152280688   .   A   G   9486.72 VQSRTrancheSNP99.90to100.00 AC=1;AF=0.019;AN=2;BaseQRankSum=9.54;ClippingRankSum=0;DP=36233;ExcessHet=39.4426;FS=9.592;InbreedingCoeff=-0.1455;MLEAC=8;MLEAF=0.022;MQ=58.83;MQRankSum=-7.765;QD=4.02;ReadPosRankSum=-1.182;SOR=1.63;VQSLOD=-8.617;culprit=MQRankSum GT:AD:DP:GQ:PGT:PID:PL  0/1:303,66:371:99:.:.:1926,0,13029
chr1    152280691   .   A   T   11612.7 VQSRTrancheSNP99.90to100.00 AC=1;AF=0.025;AN=2;BaseQRankSum=6.18;ClippingRankSum=0;DP=38167;ExcessHet=38.2356;FS=4.867;InbreedingCoeff=-0.1452;MLEAC=13;MLEAF=0.036;MQ=58.89;MQRankSum=-10.18;QD=3.99;ReadPosRankSum=-1.247;SOR=1.173;VQSLOD=-9.026;culprit=MQRankSum   GT:AD:DP:GQ:PGT:PID:PL  0/1:281,110:391:99:.:.:3611,0,11146

↧

ERROR --

ERROR stack trace

ERROR ------------------------------------------------------------------------------------------

ERROR A GATK RUNTIME ERROR has occurred (version 3.8-0-ge9d806836):

ERROR

ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.

ERROR If not, please post the error message, with stack trace, to the GATK forum.

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions https://software.broadinstitute.org/gatk

ERROR

ERROR MESSAGE: Illegal base [ ] seen in the allele

ERROR ------------------------------------------------------------------------------------------

1 java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R Melc_scaffolds.fasta -V variant.vcf -sn MD -trimAlternates -env -nt 4 -o MD.vcf

2 java -Xmx4g -jar GenomeAnalysisTK.jar -T FastaAlternateReferenceMaker -R Melc_scaffolds.fasta -V MD.vcf -o MD.fasta

Introduction

Outline

Overview of data and methods

Origin of the dataset

Plotting methods and interpretation notes

Contents

1. Overview

It's all about the base, 'bout the base (quality scores)

Okay, so what is base recalibration?

That sounds great! How does it work?

2. Base recalibration procedure details

BaseRecalibrator builds the model

ApplyBQSR adjusts the scores

3. Important factors for successful recalibration

Read groups

Amount of data

No excuses

4. Examples of pre- and post-recalibration metrics

5. Recalibration report

Arguments Table

Quantization Table

ReadGroup Table

Quality Score Table

Covariates Table

Short variants (SNVs and indels)

Copy Number Variants

Addendum: extracting data from the GenomicsDB

Caveat: cannot move database after creation