Are there any Broad-specific instructions for using GATK?
In general you should use FireCloud, which has all the major GATK workflows preloaded, is more scalable and makes it easier to share any work you do with external collaborators, since the portal is...
View ArticleDocker - container - image - registry
A container is something quite similar to a virtual machine, which can be used to contain and execute all the software required to run a particular program or set of programs. The container includes an...
View Article(How to) Run GATK in a Docker container
This document explains how to install and use Docker to run GATK on a local machine. For a primer on what Docker containers are for and related terminology, see this Dictionary entry. Contents Install...
View ArticleGenomicsDB
GenomicsDB is a datastore format developed by our collaborators at Intel to store variant call data (where "datastore" = something that we mere mortals can think of as a database, even though IT...
View ArticleGoogle Dataproc - Spark cluster service
Dataproc is Google's Spark cluster service, which you can use to run GATK tools that are Spark-enabled very quickly and efficiently. To use it, you need a Google login and billing account, as well as...
View Article(How to) Create a Spark cluster on Google Dataproc
As noted in our brief primer on Dataproc, there are two ways to create and control a Spark cluster on Dataproc: through a form in Google's web-based console, or directly through gcloud, _ak.a. Google...
View ArticleErrors about input files having missing or incompatible contigs
These errors occur when the names or sizes of contigs don't match between input files. This is a classic problem that typically happens when you get some files from collaborators, you try to use them...
View ArticleErrors in SAM/BAM files can be diagnosed with ValidateSamFile
The problem You're trying to run a GATK or Picard tool that operates on a SAM or BAM file, and getting some cryptic error that doesn't clearly tell you what's wrong. Bits of the stack trace (the pile...
View ArticleAllele Depth (AD) is lower than expected
The problem: You're trying to evaluate the support for a particular call, but the numbers in the DP (total depth) and AD (allele depth) fields aren't making any sense. For example, the sum of all the...
View ArticleCan't use VQSR on non-model organism or small dataset
The problem: Our preferred method for filtering variants after the calling step is to use VQSR, a.k.a. recalibration. However, it requires well-curated training/truth resources, which are typically not...
View ArticleErrors about contigs in BAM or VCF files not being properly ordered or sorted
This is not as common as the "wrong reference build" problem, but it still pops up every now and then: a collaborator gives you a BAM or VCF file that's derived from the correct reference, but for...
View ArticleMissing annotations in the output callset VCF
The problem You specified -A <some annotation> in a command line invoking one of the annotation-capable tools (HaplotypeCaller, MuTect2, GenotypeGVCFs and VariantAnnotator), but that annotation...
View ArticleExpected variant at a specific site was not called
This can happen when you expect a call to be made based on the output of other variant calling tools, or based on examination of the data in a genome browser like IGV. There are several possibilities,...
View ArticleNeed to run programs that require different versions of Java
The problem We sometimes need to be able to use multiple versions of Java on the same computer to run command-line tools that have different version requirements. For example, at one point, GATK...
View ArticleErrors about misencoded quality scores
The problem You get an error like this: SAM/BAM/CRAM file <filename> appears to be using the wrong encoding for quality scores Why this happens The standard format for quality score encodings is...
View ArticleErrors about read group (RG) information
See the Dictionary entry on read groups for more information about what they represent and why they're very important. Note that the command line examples in this article have not yet been updated for...
View ArticleJava version issues
As documented here, GATK requires a particular major version of Java. If you try to run it with any other version, you'll get an error that will include this line: Unsupported major.minor version To...
View ArticlePipelining recommendations
We use Cromwell + WDL for all batch execution purposes. WDL is a community-driven user-friendly scripting language managed by the OpenWDL organization. Cromwell is an open-source workflow execution...
View ArticleGATK on Amazon Web Services
We are soon adding support for running Cromwell on AWS Batch, integrating with AWS products. This will allow you to login with your AWS credentials, access your files in S3, and run your WDL files...
View ArticleGATK on Google Cloud
At this time we are able to offer two services for running WDL workflows on Google Cloud using the Cromwell execution engine and the Google Pipelines API. Note that while access to both of these...
View Article