The GATK Best Practices for variant calling on RNAseq, in full detail

March 5, 2014, 11:24 pm

≫ Next: 4000x data suitable for mutect2?

≪ Previous: When should I use -L to pass in a list of intervals?

We’re excited to introduce our Best Practices recommendations for calling variants on RNAseq data. These recommendations are based on our classic DNA-focused Best Practices, with some key differences in the early data processing steps, as well as in the calling step.

Best Practices workflow for RNAseq

This workflow is intended to be run per-sample; joint calling on RNAseq is not supported yet, though that is on our roadmap.

Please see the new document here for full details about how to run this workflow in practice.

In brief, the key modifications made to the DNAseq Best Practices focus on handling splice junctions correctly, which involves specific mapping and pre-processing procedures, as well as some new functionality in the HaplotypeCaller.

Now, before you try to run this on your data, there are a few important caveats that you need to keep in mind.

Please keep in mind that our DNA-focused Best Practices were developed over several years of thorough experimentation, and are continuously updated as new observations come to light and the analysis methods improve. We have only been working with RNAseq for a few months, so there are many aspects that we still need to examine in more detail before we can be fully confident that we are doing the best possible thing.

For one thing, these recommendations are based on high quality RNA-seq data (30 million 75bp paired-end reads produced on Illumina HiSeq). Other types of data might need slightly different processing. In addition, we have currently worked only on data from one tissue from one individual. Once we’ve had the opportunity to get more experience with different types (and larger amounts) of data, we will update these recommendations to be more comprehensive.

Finally, we know that the current recommended pipeline is producing both false positives (wrong variant calls) and false negatives (missed variants) errors. While some of those errors are inevitable in any pipeline, others are errors that we can and will address in future versions of the pipeline. A few examples of such errors are given in this article as well as our ideas for fixing them in the future.

We will be improving these recommendations progressively as we go, and we hope that the research community will help us by providing feedback of their experiences applying our recommendations to their data. We look forward to hearing your thoughts and observations!

↧

4000x data suitable for mutect2?

January 21, 2019, 7:31 am

≫ Next: can mutect2 be applied for plants?

≪ Previous: The GATK Best Practices for variant calling on RNAseq, in full detail

hi, is there a coverage restriction for mutect2? for example 4000x . if it is ok, will it do downsample.
the software LoFreq* is generic and fast enough to be applied to high-coverage data and large genomes.

↧

can mutect2 be applied for plants?

January 21, 2019, 7:44 am

≫ Next: Understanding and adapting the generic hard-filtering recommendations

≪ Previous: 4000x data suitable for mutect2?

I do not know whether plants have conception about somatic and germline? do you have any experience in this? thanks a lot

↧

Understanding and adapting the generic hard-filtering recommendations

February 3, 2016, 9:18 am

≫ Next: When should I restrict my analysis to specific intervals?

≪ Previous: can mutect2 be applied for plants?

This document aims to provide insight into the logic of the generic hard-filtering recommendations that we provide as a substitute for VQSR. Hopefully it will also serve as a guide for adapting these recommendations or developing new filters that are appropriate for datasets that diverge significantly from what we usually work with.

Introduction

Hard-filtering consists of choosing specific thresholds for one or more annotations and throwing out any variants that have annotation values above or below the set thresholds. By annotations, we mean properties or statistics that describe for each variant e.g. what the sequence context is like around the variant site, how many reads covered it, how many reads covered each allele, what proportion of reads were in forward vs reverse orientation, and so on.

The problem with this approach is that it is very limiting because it forces you to look at each annotation dimension individually, and you end up throwing out good variants just because one of their annotations looks bad, or keeping bad variants in order to keep those good variants.

In contrast, VQSR is more powerful because it uses machine-learning algorithms to learn from the data what are the annotation profiles of good variants (true positives) and of bad variants (false positives) in a particular dataset. This empowers you to pull out variants based on how they cluster together along different dimensions, and liberates you to a large extent from the linear tyranny of single-dimension thresholds.

Unfortunately this method requires a large number of variants and well-curated known variant resources. For those of you working with small gene panels or with non-model organisms, this is a deal-breaker, and you have to fall back on hard-filtering.

Outline

In this article, we illustrate how the generic hard-filtering recommendations we provide relate to the distribution of annotation values we typically see in callsets produced by our variant calling tools, and how this in turn relates to the underlying physical properties of the sequence data.

We also use results from VQSR filtering (which we take as ground truth in this context) to highlight the limitations of hard-filtering.

We do this in turn for each of five annotations that are highly informative among the recommended annotations: QD, FS, MQ, MQRankSum and ReadPosRankSum. The same principles can be applied to most other annotations produced by GATK tools.

Overview of data and methods

Origin of the dataset

We called variants on a whole genome trio (samples NA12878, NA12891, NA12892, previously pre-processed) using HaplotypeCaller in GVCF mode, yielding a gVCF file for each sample. We then joint-genotyped the gVCFs using GenotypeGVCF, yielding an unfiltered VCF callset for the trio. Finally, we ran VQSR on the trio VCF, yielding the filtered callset. We will be looking at the SNPs only.

Plotting methods and interpretation notes

All plots shown below are density plots generated using the ggplot2 library in R. On the x-axis are the annotation values, and on the y-axis are the density values. The area under the density plot gives you the probability of observing the annotation values. So, the entire area under all of the plots will be equal to 1. However, if you would like to know the probability of observing an annotation value between 0 and 1, you will have to take the area under the curve between 0 and 1.

In plain English, this means that the plots shows you, for a given set of variants, what is the distribution of their annotation values. The caveat is that when we're comparing two or more sets of variants on the same plot, we have to keep in mind that they may contain very different numbers of variants, so the amount of variants in a given part of the distribution is not directly comparable; only their proportions are comparable.

QualByDepth (QD)

This is the variant confidence (from the QUAL field) divided by the unfiltered depth of non-hom-ref samples. This annotation is intended to normalize the variant quality in order to avoid inflation caused when there is deep coverage. For filtering purposes it is better to use QD than either QUAL or DP directly.

The generic filtering recommendation for QD is to filter out variants with QD below 2. Why is that?

First, let’s look at the QD values distribution for unfiltered variants. Notice the values can be anywhere from 0-40. There are two peaks where the majority of variants are (around QD = 12 and QD = 32). These two peaks correspond to variants that are mostly observed in heterozygous (het) versus mostly homozygous-variant (hom-var) states, respectively, in the called samples. This is because hom-var samples contribute twice as many reads supporting the variant than do het variants. We also see, to the left of the distribution, a "shoulder" of variants with QD hovering between 0 and 5.

We expect to see a similar distribution profile in callsets generated from most types of high-throughput sequencing data, although values where the peaks form may vary.

Now, let’s look at the plot of QD values for variants that passed VQSR and those that failed VQSR. Red indicates the variants that failed VQSR, and blue (green?) the variants that passed VQSR.

We see that the majority of variants filtered out correspond to that low-QD "shoulder" (remember that since this is a density plot, the y-axis indicates proportion, not number of variants); that is what we would filter out with the generic recommendation of the threshold value 2 for QD.

Notice however that VQSR has failed some variants that have a QD greater than 30! All those variants would have passed the hard filter threshold, but VQSR tells us that these variants looked artifactual in one or more other annotation dimensions. Conversely, although it is not obvious in the figure, we know that VQSR has passed some variants that have a QD less than 2, which hard filters would have eliminated from our callset.

FisherStrand (FS)

This is the Phred-scaled probability that there is strand bias at the site. Strand Bias tells us whether the alternate allele was seen more or less often on the forward or reverse strand than the reference allele. When there little to no strand bias at the site, the FS value will be close to 0.

Note: SB, SOR and FS are related but not the same! They all measure strand bias (a type of sequencing bias in which one DNA strand is favored over the other, which can result in incorrect evaluation of the amount of evidence observed for one allele vs. the other) in different ways. SB gives the raw counts of reads supporting each allele on the forward and reverse strand. FS is the result of using those counts in a Fisher's Exact Test. SOR is a related annotation that applies a different statistical test (using the SB counts) that is better for high coverage data.

Let’s look at the FS values for the unfiltered variants. The FS values have a very wide range; we made the x-axis log-scaled so the distribution is easier to see. Notice most variants have an FS value less than 10, and almost all variants have an FS value less than 100. However, there are indeed some variants with a value close to 400.

The plot below shows FS values for variants that passed VQSR and failed VQSR.

Notice most of the variants that fail have an FS value greater than 55. Our hard filtering recommendations tell us to fail variants with an FS value greater than 60. Notice that although we are able to remove many false positives by removing variants with FS greater than 60, we still keep many false positive variants. If we move the threshold to a lower value, we risk losing true positive variants.

StrandOddsRatio (SOR)

This is another way to estimate strand bias using a test similar to the symmetric odds ratio test. SOR was created because FS tends to penalize variants that occur at the ends of exons. Reads at the ends of exons tend to only be covered by reads in one direction and FS gives those variants a bad score. SOR will take into account the ratios of reads that cover both alleles.

Let’s look at the SOR values for the unfiltered variants. The SOR values range from 0 to greater than 9. Notice most variants have an SOR value less than 3, and almost all variants have an SOR value less than 9. However, there is a long tail of variants with a value greater than 9.

The plot below shows SOR values for variants that passed VQSR and failed VQSR.

Notice most of the variants that have an SOR value greater than 3 fail the VQSR filter. Although there is a non-negligible population of variants with an SOR value less than 3 that failed VQSR, our hard filtering recommendation of failing variants with an SOR value greater than 3 will at least remove the long tail of variants that show fairly clear bias according to the SOR test.

RMSMappingQuality (MQ)

This is the root mean square mapping quality over all the reads at the site. Instead of the average mapping quality of the site, this annotation gives the square root of the average of the squares of the mapping qualities at the site. It is meant to include the standard deviation of the mapping qualities. Including the standard deviation allows us to include the variation in the dataset. A low standard deviation means the values are all close to the mean, whereas a high standard deviation means the values are all far from the mean.When the mapping qualities are good at a site, the MQ will be around 60.

Now let’s check out the graph of MQ values for the unfiltered variants. Notice the very large peak around MQ = 60. Our recommendation is to fail any variant with an MQ value less than 40.0. You may argue that hard filtering any variant with an MQ value less than 50 is fine as well. This brings up an excellent point that our hard filtering recommendations are meant to be very lenient. We prefer to keep all potentially decent variants rather than get rid of a few bad variants.

Let’s look at the VQSR pass vs fail variants. At first glance, it seems like VQSR has passed the variants in the high peak and failed any variants not in the peak.

It is hard to tell which variants passed and failed, so let’s zoom in and see what exactly is happening.

The plot above shows the x-axis from 59-61. Notice the variants in blue (the ones that passed) all have MQ around 60. However, some variants in red (the ones that failed) also have an MQ around 60.

MappingQualityRankSumTest (MQRankSum)

This is the u-based z-approximation from the Rank Sum Test for mapping qualities. It compares the mapping qualities of the reads supporting the reference allele and the alternate allele. A positive value means the mapping qualities of the reads supporting the alternate allele are higher than those supporting the reference allele; a negative value indicates the mapping qualities of the reference allele are higher than those supporting the alternate allele. A value close to zero is best and indicates little difference between the mapping qualities.

Next, let’s look at the distribution of values for MQRankSum in the unfiltered variants. Notice the values range from approximately -10.5 to 6.5. Our hard filter threshold is -12.5. There are no variants in this dataset that have MQRankSum less than -10.5! In this case, hard filtering would not fail any variants based on MQRankSum. Remember, our hard filtering recommendations are meant to be very lenient. If you do plot your annotation values for your samples and find none of your variants have MQRankSum less than -12.5, you may want to refine your hard filters. Our recommendations are indeed recommendations that you the scientist will want to refine yourself.

Looking at the plot of pass VQSR vs fail VQSR variants, we see the variants with an MQRankSum value less than -2.5 fail VQSR. However, the region between -2.5 to 2.5 contains both pass and fail variants. Are you noticing a trend here? It is very difficult to pick a threshold for hard filtering. If we pick -2.5 as our hard filtering threshold, we still have many variants that fail VQSR in our dataset. If we try to get rid of those variants, we will lose some good variants as well. It is up to you to decide how many false positives you would like to remove from your dataset vs how many true positives you would like to keep and adjust your threshold based on that.

ReadPosRankSumTest (ReadPosRankSum)

This is the u-based z-approximation from the Rank Sum Test for site position within reads. It compares whether the positions of the reference and alternate alleles are different within the reads. Seeing an allele only near the ends of reads is indicative of error, because that is where sequencers tend to make the most errors. A negative value indicates that the alternate allele is found at the ends of reads more often than the reference allele; a positive value indicates that the reference allele is found at the ends of reads more often than the alternate allele. A value close to zero is best because it indicates there is little difference between the positions of the reference and alternate alleles in the reads.

The last annotation we will look at is ReadPosRankSum. Notice the values fall mostly between -4 and 4. Our hard filtering threshold removes any variant with a ReadPosRankSum value less than -8.0. Again, there are no variants in this dataset that have a ReadPosRankSum value less than -8.0, but some datasets might. If you plot your variant annotations and find there are no variants that have a value less than or greater than one of our recommended cutoffs, you will have to refine them yourself based on your annotation plots.

Looking at the VQSR pass vs fail variants, we can see VQSR has failed variants with ReadPosRankSum values less than -1.0 and greater than 3.5. However, notice VQSR has failed some variants that have values that pass VQSR.

↧

When should I restrict my analysis to specific intervals?

December 27, 2017, 10:36 pm

≫ Next: Build the SNP recalibration model error

≪ Previous: Understanding and adapting the generic hard-filtering recommendations

This document covers the reasoning behind the use of genomic intervals. If you're looking for instructions on how to use intervals in practice, including argument details and supported formats, please see this doc.

Depending on what you're trying to do, there are many reasons why you might want to tell a tool to operate on a subset of genomic regions only. We distinguish four main types of reasons for doing so:

You want to run a quick test on a subset of data (often used in troubleshooting)
You want to parallelize execution of an analysis across genomic regions
You need to exclude regions that have bad or uninformative data where a tool is getting stuck
The analysis you're running should only take data from those subsets due to how the underlying algorithm works

The first three should be fairly self-explanatory, but let's go into a bit more detail on the fourth one.

In a nutshell

- Whole genome analysis:
Intervals are not required but they can help speed up analysis by eliminating "difficult" regions and enabling parallelism

- Exome analysis and other targeted sequencing:
You must provide the list of targets, with padding, to exclude off-target noise. This will also speed up analysis and enable parallelism.

Whole genome analysis

It is not strictly necessary to restrict analysis to intervals when working with whole genomes, since presumably you're interested in all of it. However, from a technical perspective, you may want to mask out certain contigs (e.g. chrY or non-chromosome contigs) or regions (e.g. centromere) where you know the data is not reliable or is very messy, causing excessive slowdowns. In addition, defining whole-genome intervals allows you to parallelize execution across intervals using the scatter gather mode of parallelism.

We share the lists of "good" whole-genome intervals that we use in our production pipelines for human analysis in our resource bundle (see Download page).

Exome analysis and other targeted sequencing

By definition, exome sequencing and other targeted sequencing data don’t cover the entire genome, so most analyses can be restricted to just the capture targets (genes or exons) to save processing time and enable scatter gather parallelism. In addition, there are some processing steps, such as BQSR, that should be restricted to the capture targets in order to eliminate off-target sequencing data, which is uninformative and is a source of noise.

You should use the list of target intervals that corresponds to the library preparation method that was used to generate the data. If you're working with exome sequencing data that was prepared by someone else, you'll need to find out what kit was used; the kit manufacturers typically provide the lists of intervals that correspond to their kits on their website. We cannot provide you with a suitable interval lists unless you are sure that your data was sequenced at the Broad.

Important notes:

Whatever you end up using intervals for, keep this in mind: for tools that output a BAM or VCF file, the output file will only contain data from the intervals you specified. Any data that falls outside these intervals will be lost to downstream analysis.

In general we recommend adding some padding to the intervals in order to include the flanking regions (typically ~100 bp). No need to modify your target list; you can have the GATK engine do it for you automatically using the interval padding argument. This is not required, but if you do use it, you should do it consistently at all steps where you use a list of intervals.

You will have noticed by now that we do not provide detailed guidelines for which tool should or should not use an interval list in this article. For tool-by-tool recommendations, please see the example commands in the individual tool docs; they show the most common recommended usage for each. See also the Best Practices documentation for up to date implementation notes.

↧

Build the SNP recalibration model error

April 23, 2018, 9:45 pm

≫ Next: GenotypeGVCFs WARN Track variant doesn't have a sequence dictionary built in

≪ Previous: When should I restrict my analysis to specific intervals?

Hi,

I am trying to build the SNP recalibration model by running the following GATK command:

./gatk-4.0.3.0/gatk VariantRecalibrator \
-R human_g1k_v37_decoy.fasta \
-input /mergedFiles.vcf \
--resource hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf \
--resource omni,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.b37.sites.vcf \
--resource 1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.vcf \
--resource dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_135.b37.vcf \
-an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an InbreedingCoeff \
-mode SNP \
-tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \
--recalFile recalibrate_SNP.recal \
-tranchesFile output.tranches \
--rscriptFile output.plots.R

But I am getting following error.

Error:

A USER ERROR has occurred: Invalid argument 'hapmap_3.3.b37.sites.vcf'.

Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.

I have used the human_g1k_v37_decoy.fasta for alignment therefore, using the same for recalibration. I would like to convert raw variants to ready to analysis variant by applying filtration,and annotation. Please let me know if you have any direction for best practice approach.

Thanks

↧

GenotypeGVCFs WARN Track variant doesn't have a sequence dictionary built in

March 21, 2015, 2:14 am

≫ Next: How to consolidate 81 GVCF files for 35,000 intervals ?

≪ Previous: Build the SNP recalibration model error

Hi Team,
I'm getting `WARN  21:19:30,478 IndexDictionaryUtils - Track variant doesn't have a sequence dictionary built in, skipping dictionary validation` when processing gzipped g.vcf files produced by HaplotypeCaller (via -o foo.g.vcf.gz, as suggested by @Geraldine_VdAuwera in blog post 3893) with GenotypeGVCFs.
This results in dramatic increases in run time (makes sense if GenotypeGVCFs un-compresses the files), and memory requirements (why ??) for GenotypeGVCFs compared to processing the gvcf for same bam files if HC outfiles are unzipped. Most batches that previously completed with 4x8GB RAM now produce `java.lang.OutOfMemoryError: Java heap space` errors even with 4X64GB!

Could you please advise whether this warning is expected behaviour? If yes, what exactly is missing (can't see much difference in unzipped vs gzipped vcf headers), and can this be added somehow?

↧

How to consolidate 81 GVCF files for 35,000 intervals ?

January 23, 2019, 1:05 am

≫ Next: (howto) Install and run Oncotator for the first time

≪ Previous: GenotypeGVCFs WARN Track variant doesn't have a sequence dictionary built in

Our aim is to mine exome capture DNA sequencing data, generated for 81 provenances of tropical pine tree species (5-8 trees were pooled per provenance), for informative SNPs. DNA fragments were captured by 35,000 probes. We mapped the exome capture data against the full, but highly fragmented Pinus taeda 2.0 genome (22GB; consisting of 1.76 million scaffolds).

We have per-sample BAM files that have been pre-processed as described in the GATK Best Practices for data pre-processing. As suggested in the GATK Best Practices “Germline short variant discovery (SNPs + Indels)” workflow, we called variants per sample in order to produce per-sample files in GVCF format. We currenty assume a diploid model despite working with pooled samples.

The bottleneck is when we consolidate GVCFs from multiple samples into a GenomicsDB datastore. We use the latest version of GATK (4.0.12.0), with multi-interval support. We perform the analysis on a server with 3 TB memory and 96 CPU cores, using the following command:

```
gatk --java-options "-Xmx2500G" GenomicsDBImport -V f1.vcf -V f2.vcf -V f3.vcf -V f4.vcf -V f5.vcf -V f6.vcf -V f7.vcf -V f8.vcf -V f9.vcf -V f10.vcf -V f11.vcf --genomicsdb-workspace-path outputDB -L capture_probe_regions.bed

```

Each capture probe region is roughly 800pb, mostly on different scaffolds. I performed a few test runs with between 2 and 10 intervals. It takes ~2.5 hours per interval for 81 files (all files, which is what we would like to do) and ~20 minutes per interval for 11 files (the files for one sub-species). It is impractical to perform this step for 35,000 intervals.

Any advice will be appreciated. Thank you in advance.

Nanette

↧

(howto) Install and run Oncotator for the first time

May 8, 2014, 10:41 pm

≫ Next: RNAseq short variant discovery (SNPs + Indels)

≪ Previous: How to consolidate 81 GVCF files for 35,000 intervals ?

1. Download the Oncotator package, the default datasources package, and (recommended) transcript override list from the Downloads page

Please note: Broadies who wish to run the installed Oncotator on the Broad cluster should follow the instructions here, instead of this page

Oncotator Download

Download the latest release here.

Default Datasource Corpus Download (April 5, 2016)

Download 17GB oncotator_v1_ds_April052016.tar.gz.

Please note that this corpus should be used with Oncotator 1.4.x.x and above. Uniprot AA Pos annotations will not function properly with Oncotator 1.3.x.x and below.

Transcript override lists

We highly recommend that you download and use one of the below transcript override lists, especially if clinical applications of Oncotator. When running Oncotator, provide one of the below files with the -c parameter.

Download UniProt Exact Match For GENCODE v19, will give selection priority to transcripts with protein sequences that match the UniProt protein sequence exactly. This file can also be found in the Oncotator download at test/testdata/tx_exact_uniprot_matches.txt.
Download UniProt Exact Match + Clinical For GENCODE v19, this will give priority to known clinical protein changes. This file is a modification of the UniProt Exact Match (above). For more information about how this list was generated, please see the powerpoint presentation here

The Oncotator and default datasource corpus packages are simple tar files that can be expanded using the following commands:

$ tar zxvf oncotator-1.5.1.0.tar.gz
$ tar zxvf oncotator_v1_ds_Jan262015.tar.gz

This will produce two directories called oncotator-1.5.1.0 and oncotator_v1_ds_Dec112014, respectively. Move to the oncotator-1.5.1.0 directory by doing:

$ cd oncotator-1.5.1.0

2. Set up your Python environment and install dependencies

See the article on platform requirements for a full list of dependencies. This tutorial will show you how to use the virtual environment script we provide to set everything up automagically, and this tutorial will show you how to install dependencies manually if needed (or preferred).

3. Install Oncotator

Once you have installed all the necessary dependencies listed above, simply run the standard Python install script which is included with the Oncotator distribution.

$ python setup.py install

Two binaries (executable program files) named oncotator and initializeDatasource respectively will be installed into your Python's bin/ directory. You can test that they were installed by running e.g.:

$ oncotator -h

to invoke the help / usage instructions. You can also do a test run of Oncotator on the Patient0.snp.maf.txt file provided with the Oncotator distribution (in the test/testdata/maflite/ directory) with the following command:

$ oncotator -v --db-dir /path/to/oncotator_v1_ds_Jan262015 test/testdata/maflite/Patient0.snp.maf.txt exampleOutput.tsv hg19

where you provide the location of the datasources using the --db-dir argument. You may need to adapt the file path for the Patient0.snp.maf.txt file depending on where you run this command from.

This will produce a new file named exampleOutput.tsv with the appropriate annotations, built against the hg19 reference.

↧

RNAseq short variant discovery (SNPs + Indels)

January 8, 2018, 8:39 pm

≫ Next: Service note: we are now software.broadinstitute.org/gatk

≪ Previous: (howto) Install and run Oncotator for the first time

Purpose

Identify short variants (SNPs and Indels) in RNAseq data.

Diagram is not available

Reference Implementations

Pipeline	Summary	Notes	Github	FireCloud
RNAseq short variant per-sample calling	BAM to VCF	universal (expected)		TBD

Expected input

This workflow is designed to operate on a set of samples one sample at a time; joint calling RNAseq is not supported.

_This workflow is in development; detailed documentation will be made available when the workflow is considered fully released.

↧

Service note: we are now software.broadinstitute.org/gatk

July 20, 2016, 4:00 pm

≫ Next: GATK4: How to reassign STAR mapping quality from 255 to 60 with SplitNCigarReads

≪ Previous: RNAseq short variant discovery (SNPs + Indels)

For largely practical reasons, the GATK website home URL has become http://software.broadinstitute.org/gatk. Don't worry, your bookmarked www links will still work foreveeeer -- at least that's what I'm told by our valiant IT folks. As always, let us know if you run into any trouble, not that we're expecting any.

↧

GATK4: How to reassign STAR mapping quality from 255 to 60 with SplitNCigarReads

November 20, 2017, 12:57 am

≫ Next: DepthOfCoverage: Error with option -baseCounts

≪ Previous: Service note: we are now software.broadinstitute.org/gatk

Hi,

How can I reassign STAR mapping quality from 255 to 60 with SplitNCigarReads?

In GATK 3.X this used to be done like this:
java -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R ref.fasta -I dedupped.bam -o split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS
See this blog post: https://software.broadinstitute.org/gatk/blog?id=4285

With GATK4 latest beta the read filter argument has been renamed. Trying the same -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS arguments leads to the following error:
A USER ERROR has occurred: rf is not a recognized option

Trough looking at the CLI help documentation I got as far as:
--readFilter ReassignOneMappingQuality -RMQF 255 -RMQT 60

The readFilter argument is now recognized. But not the -RMQF 255 -RMQT 60 part:
A USER ERROR has occurred: U is not a recognized option

Could you please advice on how to run the GATK4 SplitNCigarReads tool with reassignment of the mapping quailty?

Without reassignment of the mapping quality GATK haplotype caller discards all the STAR mapped reads, and calls full chromosome reference, without any variant.

Thank you.

↧

DepthOfCoverage: Error with option -baseCounts

January 23, 2019, 1:45 pm

≫ Next: Description and examples of the steps in the ACNV case workflow

≪ Previous: GATK4: How to reassign STAR mapping quality from 255 to 60 with SplitNCigarReads

Hi all!

I'm trying to use DepthOfCoverage with this command:
```
java -jar home/apps/Logiciels/GATK/3.6-0/GenomeAnalysisTK.jar -l INFO -T DepthOfCoverage -R ref.fasta -I input.bam -L intervals.bed -baseCounts true -o output_DoC
```

and it keeps giving me this error:
```
##### ERROR MESSAGE: Invalid argument value 'true' at position 11.
```

I'm using version 3.6 of GATK and java jdk1.8.0_40.

Thank you,

↧

Description and examples of the steps in the ACNV case workflow

April 5, 2016, 6:33 pm

≫ Next: GVCF - Genomic Variant Call Format

≪ Previous: DepthOfCoverage: Error with option -baseCounts

Once you have run GATK CNV, you can run ACNV for revised segments based on both the target-coverage profile and the ref/alt counts at heterozygous SNPs. ACNV will report estimates for the posterior probabilities for copy ratio and minor-allele fraction in each segment.

The ACNV case workflow (description and examples)

Requirements

Java 1.8
A functioning GATK4-protected jar (hellbender-protected.jar or gatk-protected.jar)
Reference genome (fasta files) with fai and dict files. This can be downloaded as part of the GATK resource bundle: http://www.broadinstitute.org/gatk/guide/article?id=1213
Samples must be paired. You will need both a case sample (typically, a tumor) and a control sample (typically, a blood normal). We are working on alleviating this requirement.
A list of common heterozygous SNP sites. Currently, this needs to be in the Picard interval-list format. See http://gatkforums.broadinstitute.org/gatk/discussion/7812/creating-a-list-of-common-snps-for-use-with-getbayesianhetcoverage
A completed run of GATK CNV for the case sample.

Overview of steps

Identify heterozygous SNPs in the normal and aggregate read counts at these sites in the tumor.
Segment the case sample (based on both the read counts from step 1 and input from GATK CNV) and estimate copy ratio and minor-allele fraction in each segment.
Call copy-neutral loss-of-heterozygosity and balanced segments. This step will also create files that can be used as input for ABSOLUTE (Broad-internal versions only) and TITAN.

Step 1. Het Pulldown

** These instructions describe one method for Het Pulldown for matched samples. For more options, including tumor-only, please see: http://gatkforums.broadinstitute.org/gatk/discussion/7719/overview-of-getbayesianhetcoverage-for-heterozygous-snp-calling **

Inputs

control_bam -- BAM file for control sample (normal).
case_bam -- BAM file for case sample (tumor).
reference_sequence -- FASTA file for b37 reference.
snp_file -- Picard interval list of common SNP sites at which to test for heterozygosity in the control sample .

Outputs

normal_het_pulldown -- TSV file with M entries containing ref/alt counts, ref/alt bases, etc., where M is the number of hets called in the control sample.
tumor_het_pulldown -- TSV file with M entries containing ref/alt counts, ref/alt bases, etc. for sites in the case sample that were called as het in the control sample, where M is the number of hets called in the control sample.

Format for both output files:

CONTIG  POSITION        REF_COUNT       ALT_COUNT       REF_NUCLEOTIDE  ALT_NUCLEOTIDE  READ_DEPTH
1       809876  5       16      A       G       21
1       881627  23      12      G       A       35
1       882033  9       10      G       A       19
1       900505  26      24      G       C       50
....snip....

Invocation

java -jar <path_to_gatk_protected_jar> GetBayesianHetCoverage --reference <reference_sequence>
    --snpIntervals <snp_file> --tumor <case_bam> --tumorHets <tumor_het_pulldown> --normal <control_bam>
    --normalHets <normal_het_pulldown> --hetCallingStringency 30

Step 2. Allelic CNV

Inputs

tumor_het_pulldown -- Generated in step 1.
coverage_profile -- Tangent-normalized coverage TSV file obtained in the GATK CNV case workflow.
called_segments -- Called-segments TSV file obtained in the GATK CNV case workflow.
output_prefix -- Path and file prefix for creating the output files. For example, /home/lichtens/my_acnv_output/sample1

Outputs

acnv_segments -- TSV file with name ending with -sim-final.seg containing posterior summary statistics for log_2 copy ratio and minor-allele fraction in each segment. Using the above output_prefix, /home/lichtens/my_acnv_output/sample1-sim-final.seg
acnv_cr_parameters -- TSV file with name ending with -sim-final.cr.param containing posterior summary statistics for global parameters of the copy-ratio model. Using the above output_prefix, /home/lichtens/my_acnv_output/sample1-sim-final.cr.param
acnv_af_parameters -- TSV file with name ending with -sim-final.af.param containing posterior summary statistics for global parameters of the allele-fraction model. Using the above output_prefix, /home/lichtens/my_acnv_output/sample1-sim-final.af.param

Other files containing intermediate results of the calculation are also generated.

Invocation

 java -Xmx8g -jar <path_to_gatk_protected_jar> AllelicCNV  --tumorHets <tumor_het_pulldown>
    --tangentNormalized <coverage_profile> --segments <called_segments> --outputPrefix <output_prefix>

Step 3. Call CNLoH and Balanced Segments

** WARNING: This tool is experimental and exists primarily for internal Broad use. **

Inputs

tumor_het_pulldown -- Generated in step 1.
acnv_segments -- Generated in step 2 (*-sim-final.seg).
coverage_profile -- Tangent-normalized coverage TSV file obtained in the GATK CNV case workflow
output_dir -- Directory for creating the output files. For example, /home/lichtens/my_acnv_cnlohcalls_output/

Outputs

GATK-CNV-formatted seg file -- TSV file ending with -sim-final.cnv.seg. This file is formatted identically as the output of GATK CNV. Note that this implies that the allelic fraction values are not captured in this file.
AllelicCapSeg-formatted seg file -- TSV file ending with -sim-final.acs.seg. This file is formatted identically as the output of Broad CGA AllelicCapSeg. Note that this file can be used as input to Broad-internal versions of ABSOLUTE.
TITAN-compatible het file --TSV file ending with -sim-final.titan.het.tsv. This file can be used as the input to TITAN for the het read counts.
TITAN-compatible copy-ratio file -- TSV file ending with -sim-final.titan.tn.tsv. This file can be used as the input to TITAN for the per-target copy-ratio estimates.

Invocation

 java -Xmx8g -jar <path_to_gatk_protected_jar> CallCNLoHAndSplits  --tumorHets <tumor_het_pulldown>
    --segments <acnv_segments> --tangentNormalized <coverage_profile> --outputDir <output_dir>
    --rhoThreshold 0.2 --numIterations 10  --sparkMaster local[*]

↧

GVCF - Genomic Variant Call Format

December 23, 2017, 2:16 pm

≫ Next: GermlineCNVCaller --interval-merging-rule error.

≪ Previous: Description and examples of the steps in the ACNV case workflow

GVCF stands for Genomic VCF. A GVCF is a kind of VCF, so the basic format specification is the same as for a regular VCF (see the spec documentation here), but a Genomic VCF contains extra information.

This document explains what that extra information is and how you can use it to empower your variant discovery analyses.

Important notes

What we're covering here is strictly limited to GVCFs produced by HaplotypeCaller in GATK versions 3.0 and above. The term GVCF is sometimes used simply to describe VCFs that contain a record for every position in the genome (or interval of interest) regardless of whether a variant was detected at that site or not (such as VCFs produced by UnifiedGenotyper with --output_mode EMIT_ALL_SITES). GVCFs produced by HaplotypeCaller in GATK versions 3.x and 4.x contain additional information that is formatted in a very specific way. Read on to find out more.

GVCF files produced by HaplotypeCaller from GATK versions 3.x and 4.x are not substantially different. While we don't recommend mixing versions, and we have not tested this ourselves, it should be okay to use GVCFs made by different versions if the annotations and the GVCFBlock definitions (see below) are the same.

General comparison of VCF vs. GVCF

The key difference between a regular VCF and a GVCF is that the GVCF has records for all sites, whether there is a variant call there or not. The goal is to have every site represented in the file in order to do joint analysis of a cohort in subsequent steps. The records in a GVCF include an accurate estimation of how confident we are in the determination that the sites are homozygous-reference or not. This estimation is generated by the HaplotypeCaller's built-in reference model.

Note that some other tools (including the GATK's own UnifiedGenotyper) may output an all-sites VCF that looks superficially like the BP_RESOLUTION GVCFs produced by HaplotypeCaller, but they do not provide an accurate estimate of reference confidence, and therefore cannot be used in joint genotyping analyses.

The two types of GVCFs

As you can see in the figure above, there are two options you can use with -ERC: GVCF and BP_RESOLUTION. With BP_RESOLUTION, you get a GVCF with an individual record at every site: either a variant record, or a non-variant record. With GVCF, you get a GVCF with individual variant records for variant sites, but the non-variant sites are grouped together into non-variant block records that represent intervals of sites for which the genotype quality (GQ) is within a certain range or band. The GQ ranges are defined in the ##GVCFBlock line of the GVCF header. The purpose of the blocks (also called banding) is to keep file size down, so we recommend using the -GVCF option over BP_RESOLUTION.

Example GVCF file

This is a banded GVCF produced by HaplotypeCaller with the -GVCF option.

Header:

As you can see in the first line, the basic file format is a valid version 4.2 VCF:

##fileformat=VCFv4.2
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

One FORMAT annotation is unique to the GVCF format:

##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">

This defines what was the minimum amount of coverage observed at any one site within a block of records.

The header goes on:

##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another">
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##GATKCommandLine=<ID=HaplotypeCaller,CommandLine="[full command line goes here]",Version=4.beta.6-117-g4588584-SNAPSHOT,Date="December 23, 2017 4:04:34 PM EST">

At this point in the header we see the GVCFBlock definitions, which indicate the GQ ranges used for banding:

[individual blocks from 1 to 55]
##GVCFBlock55-56=minGQ=55(inclusive),maxGQ=56(exclusive)
##GVCFBlock56-57=minGQ=56(inclusive),maxGQ=57(exclusive)
##GVCFBlock57-58=minGQ=57(inclusive),maxGQ=58(exclusive)
##GVCFBlock58-59=minGQ=58(inclusive),maxGQ=59(exclusive)
##GVCFBlock59-60=minGQ=59(inclusive),maxGQ=60(exclusive)
##GVCFBlock60-70=minGQ=60(inclusive),maxGQ=70(exclusive)
##GVCFBlock70-80=minGQ=70(inclusive),maxGQ=80(exclusive)
##GVCFBlock80-90=minGQ=80(inclusive),maxGQ=90(exclusive)
##GVCFBlock90-99=minGQ=90(inclusive),maxGQ=99(exclusive)
##GVCFBlock99-100=minGQ=99(inclusive),maxGQ=100(exclusive)

In recent versions of GATK, the banding strategy has been tuned to provide high resolution at lower values of GQ (59 and below) and more compression at high values (60 and above). Note that since GQ is capped at 99, records where the corresponding PL is greater than 99 are lumped into the 99-100 band.

After that, the header goes on:

##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=ExcessHet,Number=1,Type=Float,Description="Phred-scaled p-value for exact test of excess heterozygosity">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=RAW_MQ,Number=1,Type=Float,Description="Raw data for RMS Mapping Quality">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##contig=<ID=20,length=63025520,assembly=GRCh37>
##source=HaplotypeCaller

Records

The first thing you'll notice, hopefully, is the <NON_REF> symbolic allele listed in every record's ALT field. This provides us with a way to represent the possibility of having a non-reference allele at this site, and to indicate our confidence either way.

The second thing to look for is the END tag in the INFO field of non-variant block records. This tells you at what position the block ends. For example, the first line is a non-variant block that starts at position 20:10001567 and ends at 20:10001616.

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA12878
20  10001567    .   A   <NON_REF>   .   .   END=10001616    GT:DP:GQ:MIN_DP:PL  0/0:38:99:34:0,101,1114
20  10001617    .   C   A,<NON_REF> 493.77  .   BaseQRankSum=1.632;ClippingRankSum=0.000;DP=38;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.000;RAW_MQ=136800.00;ReadPosRankSum=0.170    GT:AD:DP:GQ:PL:SB   0/1:19,19,0:38:99:522,0,480,578,538,1116:11,8,13,6
20  10001618    .   T   <NON_REF>   .   .   END=10001627    GT:DP:GQ:MIN_DP:PL  0/0:39:99:37:0,105,1575
20  10001628    .   G   A,<NON_REF> 1223.77 .   DP=37;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQ=133200.00   GT:AD:DP:GQ:PL:SB   1/1:0,37,0:37:99:1252,111,0,1252,111,1252:0,0,21,16
20  10001629    .   G   <NON_REF>   .   .   END=10001660    GT:DP:GQ:MIN_DP:PL  0/0:43:99:38:0,102,1219
20  10001661    .   T   C,<NON_REF> 1779.77 .   DP=42;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQ=151200.00   GT:AD:DP:GQ:PGT:PID:PL:SB   1/1:0,42,0:42:99:0|1:10001661_T_C:1808,129,0,1808,129,1808:0,0,26,16
20  10001662    .   T   <NON_REF>   .   .   END=10001669    GT:DP:GQ:MIN_DP:PL  0/0:44:99:43:0,117,1755
20  10001670    .   T   G,<NON_REF> 1773.77 .   DP=42;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQ=151200.00   GT:AD:DP:GQ:PGT:PID:PL:SB   1/1:0,42,0:42:99:0|1:10001661_T_C:1802,129,0,1802,129,1802:0,0,25,17
20  10001671    .   G   <NON_REF>   .   .   END=10001673    GT:DP:GQ:MIN_DP:PL  0/0:43:99:42:0,120,1800
20  10001674    .   A   <NON_REF>   .   .   END=10001674    GT:DP:GQ:MIN_DP:PL  0/0:42:96:42:0,96,1197
20  10001675    .   A   <NON_REF>   .   .   END=10001695    GT:DP:GQ:MIN_DP:PL  0/0:41:99:39:0,105,1575
20  10001696    .   A   <NON_REF>   .   .   END=10001696    GT:DP:GQ:MIN_DP:PL  0/0:38:97:38:0,97,1220

Note that toward the end of this snippet, you see multiple consecutive non-variant block records. These were not merged into a single record because the sites they contain belong to different ranges of GQ (which are defined in the header).

↧

GermlineCNVCaller --interval-merging-rule error.

March 29, 2018, 6:22 am

≫ Next: Input files reference and features have incompatible contigs

≪ Previous: GVCF - Genomic Variant Call Format

Hi
I was testing the brand new GermlineCNVCaller in 4.0.3.0 however I met a very strange error.

All my read collections was made with the following command

gatk CollectReadCounts -R $HG19FULL --interval-merging-rule OVERLAPPING_ONLY -L $TSOREG -I samplename_final.bam -O samplename_counts.hdf5

And there was no problem with DetermineGermlineContigPloidy step. All files were generated using GATK 4.0.3.0 docker image fresh from docker repo.

DetermineGermlineContigPloidy command was according to the doc files within the gatk folder. I have 32 samples and I was working in COHORT mode.

Here is the error message

Where is the problem here?

↧

Input files reference and features have incompatible contigs

February 2, 2018, 11:16 pm

≫ Next: Calling variants at known sites with HaplotypeCaller

≪ Previous: GermlineCNVCaller --interval-merging-rule error.

Hi
I am very new to GATK. i read the paper Curr Protoc Bioinformatics. ; 11(1110): 11.10.1–11.10.33. doi:10.1002/0471250953.bi1110s43 and best practices guide. i want to run whole exom analysis to find high confidence SNP and indels .I am stuck at the BQSR analysis step ( BaseRecalibrator). it shows the error message
Input files reference and features have incompatible contigs: No overlapping contigs found.

i did following steps
1. Download the genome file hg38.fa.gz from UCSC genome.
2.Indexing the reference genome : bwa index hg38.fa
3.Create fasta file index : samtools faidx hg38.fa
5. Create sequence dictionary by java -jar picard.jar CreateSequenceDictionary REFERENCE=hg38.fa OUTPUT=hg38.dict
6. mapping the data to reference : bwa mem -R '@RG\tID:group1\tSM:sample1\tPL:illumina\tLB:lib1\tPU:unit1' -p hg38.fa R1.fastq R2.fastq > aligned_reads.sam
7. sort my align reads: java -jar picard.jar SortSam INPUT=aligned_reads.sam OUTPUT=sorted_reads.bam SORT_ORDER=coordinate
8. marking duplicates : java -jar picard.jar MarkDuplicates INPUT=sorted_reads.bam OUTPUT=dedup_reads.bam METRICS_FILE=metrics.txt
9. index my markduplicate file : java -jar picard.jar BuildBamIndex INPUT=dedup_reads.bam
10. Download dbSNP from this ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b149_GRCh38p7/VCF/00-All.vcf.gz
11. Download indels from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/other_mapping_resources/Mills_and_1000G_gold_standard.indels.b38.primary_assembly.vcf.gz
12. Run the BaseRecalibrator : java -jar gatk-package-4.0.0.0-local.jar BaseRecalibrator -R hg38.fa -I dedup_reads.bam --known-sites SNP_00-All_GRCh38.vcf --known-sites Mills_and_1000G_gold_standard_indels.vcf -O recal_data.table
it shows the error message to index the SNP_00-All_GRCh38.vcf and indel file
13.i index the file by running this command : java -jar gatk-package-4.0.0.0-local.jar IndexFeatureFile -F SNP_00-All_GRCh38.vcf and java -jar gatk-package-4.0.0.0-local.jar IndexFeatureFile -F Mills_and_1000G_gold_standard_indels.vcf

After index, again i run the BaseRecalibrator : java -jar gatk-package-4.0.0.0-local.jar BaseRecalibrator -R hg38.fa -I dedup_reads.bam --known-sites SNP_00-All_GRCh38.vcf --known-sites Mills_and_1000G_gold_standard_indels.vcf -O recal_data.table

it shows the error " A USER ERROR has occurred: Input files reference and features have incompatible contigs: No overlapping contigs found."

Please help me to solve the problem

↧

Calling variants at known sites with HaplotypeCaller

January 24, 2019, 12:07 pm

≫ Next: SAM bin field error for the GATK run

≪ Previous: Input files reference and features have incompatible contigs

Hi, all. We are calling variants on large numbers of dogs (WGS) of a variety of breeds using HaplotypeCaller followed by GenotypeGVCFs.

When we call VCFs of dogs of different breeds in one group, that goes well, but when we call a smaller number of dogs of a single breed, we lose variants in the final VCF. We suspect that the problem is that breed-specific variants are being lost. In other words, if we call variants on several Golden Retrievers, the pipeline will save variants that differ between those dogs, as well as variants where the dogs differ from the reference. However, German Shepherd-specific variants will be lost, as they do not appear in the Goldens.

We would like to specify a list of variant sites of interest, based on the large number of sequenced samples we currently have available to us. We'd then use this list of sites in our pipeline so that when we call a smaller number of dogs, all variants of interest are retained in that VCF, even if they are invariant in those samples and vs the reference. We would also retain variants new to these samples (so would not be LIMITED to sites previously of interest).

I've been struggling with the documentation and can't quite see how to do this, although there are a variety of parameters that are ALMOST what we want. What am I missing?

Currently using GATK 3.3, about to move to 4.0 and happy to find a 4.0 solution.

Best,
Jessica

↧

SAM bin field error for the GATK run

June 12, 2014, 8:47 am

≫ Next: GenomeLoc 11:69653434-69653483 has a size == 50 but the variation reference allele has length 51

≪ Previous: Calling variants at known sites with HaplotypeCaller

Hello,

I am running Picard+GATK pipeline on paired-end illumina samples. The bam files were downloaded from TCGA. GATK 3.1.1 and java v1.7.0 were used. I have encountered such an error as below. I found the same errors in the picard Markduplicates step, but then as I changed picard version to 1.88, these errors were gone (as I read from another forum). GATK now picks up these errors again. When I set the --validation_strictness to be LENIENT, these errors do not affect the GATK run. I am wondering if there is a better way to solve this problem???

BTW, is the option IGNORE=INVALID_INDEXING_BIN of picard ValidateSamFile related to such a problem??

INFO 22:50:57,977 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.1-1-g07a4bf8, Compiled 2014/03/18 06:09:21 INFO 22:50:57,977 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 22:50:57,977 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 22:50:57,983 HelpFormatter - Program Args: -T RealignerTargetCreator -R b37_2.8/human_g1k_v37.fasta -I A3NJ_NB_rmdup.bam -I A3NJ_TP_rmdup.bam -known b37_2.8/100 0G_phase1.indels.b37.vcf -known b37_2.8/Mills_and_1000G_gold_standard.indels.b37.vcf -o realigner.A3NJ.intervals --validation_strictness LENIENT INFO 22:50:57,987 HelpFormatter - Executing as xrao@cnode18 on Linux 2.6.18-194.el5 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0-b147. INFO 22:50:57,987 HelpFormatter - Date/Time: 2014/06/10 22:50:57 INFO 22:50:57,987 HelpFormatter - -------------------------------------------------------------------------------- INFO 22:50:57,987 HelpFormatter - -------------------------------------------------------------------------------- INFO 22:50:58,801 GenomeAnalysisEngine - Strictness is LENIENT INFO 22:50:58,973 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 22:50:58,984 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 22:50:59,062 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.08 INFO 22:50:59,567 GenomeAnalysisEngine - Preparing for traversal over 2 BAM files INFO 22:51:00,913 GenomeAnalysisEngine - Done preparing for traversal INFO 22:51:00,914 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 22:51:00,914 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining Ignoring SAM validation error: ERROR: Record 13726, Read name HWI-ST735:144061002:C3D17ACXX:1:2202:14399:15015, bin field of BAM record does not equal value computed based on alignment start and end, and length of sequence to which read is aligned Ignoring SAM validation error: ERROR: Record 8265, Read name HWI-ST735:144061002:C3D17ACXX:8:2107:2975:86239, bin field of BAM record does not equal value computed ba sed on alignment start and end, and length of sequence to which read is aligned Ignoring SAM validation error: ERROR: Record 79, Read name HWI-ST735:144061002:C3D17ACXX:1:2202:14399:15015, bin field of BAM record does not equal value computed bas ed on alignment start and end, and length of sequence to which read is aligned ............

Any input would be very appreciated!

Thanks,

Xiayu

↧

GenomeLoc 11:69653434-69653483 has a size == 50 but the variation reference allele has length 51

January 25, 2019, 6:14 am

≫ Next: Rounds of BQSR in GATK3 GenericPreProcessingWorkflow

≪ Previous: SAM bin field error for the GATK run

Hello,

I am using GATK version 3.8.1 (3.8-1-0-gf15c1c3ef) and wanted to merge 4 vcf files using the CombineVariants command.
The command i am using is here:

java -jar GenomeAnalysisTK.jar \
-T CombineVariants -R genome.fa \
-nt 20 \
--variant a.vcf \
--variant b.vcf \
--variant c.vcf \
--variant d.vcf \
-o Combined.vcf \
-genotypeMergeOptions UNIQUIFY

I am using GRCh38 as a reference genome.

However after running for a while i get this error:

##### ERROR --
##### ERROR stack trace 
java.lang.IllegalStateException: BUG: GenomeLoc 11:69653434-69653483 has a size == 50 but the variation reference allele has length 51 this = [VC variant @ 11:69653434-69653483 Q. of type=MNP alleles=[ATGTGATCAATTTTGACTTAATGTGATTACTGCTCTATTCCAAAAAGGTTG*, TTGGGTTAATTTTTGACTTAATGTGATTACTGCTCTATTCCAAAAAGGTTG, TTTGTCTCAATTTTGACTTTATTCTTTTACCGCTCTTTTCCAAAAAGGGTA] attr={AC=0, AF=0.0, AN=2, HOMLEN=0, SVTYPE=RPL, set=variant} GT=[[TUMOR_pindel.variant ATGTGATCAATTTTGACTTAATGTGATTACTGCTCTATTCCAAAAAGGTTG*/ATGTGATCAATTTTGACTTAATGTGATTACTGCTCTATTCCAAAAAGGTTG*]]
        at htsjdk.variant.variantcontext.VariantContext.validateStop(VariantContext.java:1327)
        at htsjdk.variant.variantcontext.VariantContext.validate(VariantContext.java:1294)
        at htsjdk.variant.variantcontext.VariantContext.<init>(VariantContext.java:401)
        at htsjdk.variant.variantcontext.VariantContextBuilder.make(VariantContextBuilder.java:494)
        at htsjdk.variant.variantcontext.VariantContextBuilder.make(VariantContextBuilder.java:488)
        at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.simpleMerge(GATKVariantContextUtils.java:1363)
        at org.broadinstitute.gatk.tools.walkers.variantutils.CombineVariants.map(CombineVariants.java:361)
        at org.broadinstitute.gatk.tools.walkers.variantutils.CombineVariants.map(CombineVariants.java:143)
        at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267)
        at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255)
        at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
        at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
        at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144)
        at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
        at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
        at org.broadinstitute.gatk.engine.executive.ShardTraverser.call(ShardTraverser.java:98)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 3.8-1-0-gf15c1c3ef):
##### ERROR
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions https://software.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: BUG: GenomeLoc 11:69653434-69653483 has a size == 50 but the variation reference allele has length 51 this = [VC variant @ 11:69653434-69653483 Q. of type=MNP alleles=[ATGTGATCAATTTTGACTTAATGTGATTACTGCTCTATTCCAAAAAGGTTG*, TTGGGTTAATTTTTGACTTAATGTGATTACTGCTCTATTCCAAAAAGGTTG, TTTGTCTCAATTTTGACTTTATTCTTTTACCGCTCTTTTCCAAAAAGGGTA] attr={AC=0, AF=0.0, AN=2, HOMLEN=0, SVTYPE=RPL, set=variant} GT=[[TUMOR_pindel.variant ATGTGATCAATTTTGACTTAATGTGATTACTGCTCTATTCCAAAAAGGTTG*/ATGTGATCAATTTTGACTTAATGTGATTACTGCTCTATTCCAAAAAGGTTG*]]
##### ERROR ------------------------------------------------------------------------------------------

when i look at the a.vcf, i see there is a line:

11  69653434    .   ATGTGATCAATTTTGACTTAATGTGATTACTGCTCTATTCCAAAAAGGTTG TTTGTCTCAATTTTGACTTTATTCTTTTACCGCTCTTTTCCAAAAAGGGTA .   PASS    END=69653483;HOMLEN=0;SVLEN=-51;SVTYPE=RPL;NTLEN=51 GT:AD   0/0:141,1

But i do not see anything wrong with this. Could you please help me here?
This is my a.vcf file:

##fileformat=VCFv4.0
##fileDate=20190101
##source=sampleA
##reference=GRCh38
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##INFO=<ID=HOMLEN,Number=1,Type=Integer,Description="Length of base pair identical micro-homology at event breakpoints">
##INFO=<ID=PF,Number=1,Type=Integer,Description="The number of samples carry the variant">
##INFO=<ID=HOMSEQ,Number=.,Type=String,Description="Sequence of base pair identical micro-homology at event breakpoints">
##INFO=<ID=SVLEN,Number=1,Type=Integer,Description="Difference in length between REF and ALT alleles">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=NTLEN,Number=.,Type=Integer,Description="Number of bases inserted in place of deleted code">
##FORMAT=<ID=PL,Number=3,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=RD,Number=1,Type=Integer,Description="Reference depth, how many reads support the reference">
##FORMAT=<ID=AD,Number=2,Type=Integer,Description="Allele depth, how many reads support this allele">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  sampleA
11  69653433    .   TA  T   .   PASS    END=69653434;HOMLEN=0;SVLEN=-1;SVTYPE=DEL   GT:AD   0/0:96,1
11  69653434    .   ATGTGATCAA  TTGGGTTAAT  .   PASS    END=69653442;HOMLEN=0;SVLEN=-10;SVTYPE=RPL;NTLEN=10 GT:AD   0/0:106,2
11  69653434    .   ATGTGATCAATTTTGACTTAATGTGATTACTGCTCTATTCCAAAAAGGTTG TTTGTCTCAATTTTGACTTTATTCTTTTACCGCTCTTTTCCAAAAAGGGTA .   PASS    END=69653483;HOMLEN=0;SVLEN=-51;SVTYPE=RPL;NTLEN=51 GT:AD   0/0:141,1
11  69653438    .   GA  G   .   PASS    END=69653439;HOMLEN=0;SVLEN=-1;SVTYPE=DEL   GT:AD   0/0:102,1
11  69653550    .   T   TGGCGGGCAGACACGCGGGCGCGATCCCACACAGGCTGGCGGGGGGCGGGCCCCCGGGCGCC  .   PASS    END=69653550;HOMLEN=44;HOMSEQ=GGCGGGCAGACACGCGGGCGCGATCCCACACAGGCTGGCGGGGG;SVLEN=61;SVTYPE=INS  GT:AD   0/0:110,1
11  69653562    .   A   ACGCGGGCGCGATCCCACACAGGCTGGCGGGGGGCGGGGCCCCCGGCCC   .   PASS    END=69653562;HOMLEN=32;HOMSEQ=CGCGGGCGCGATCCCACACAGGCTGGCGGGGG;SVLEN=48;SVTYPE=INS  GT:AD   0/0:100,1

Any help will be appreciated. Thank you.

↧