1. Introduction
GATK-Queue is command-line scripting framework for defining multi-stage genomic analysis pipelines combined with an execution manager that runs those pipelines from end-to-end. Often processing genome data includes several steps to produces outputs, for example our BAM to VCF calling pipeline include among other things:
Local realignment around indels
Emitting raw SNP calls
- Emitting indels
- Masking the SNPs at indels
- Annotating SNPs using chip data
- Labeling suspicious calls based on filters
- Creating a summary report with statistics
Running these tools one by one in series may often take weeks for processing, or would require custom scripting to try and optimize using parallel resources.
With a Queue script users can semantically define the multiple steps of the pipeline and then hand off the logistics of running the pipeline to completion. Queue runs independent jobs in parallel, handles transient errors, and uses various techniques such as running multiple copies of the same program on different portions of the genome to produce outputs faster.
2. Obtaining Queue
You have two options: download the binary distribution (prepackaged, ready to run program) or build it from source.
- Download the binary
This is obviously the easiest way to go. Links are on the Downloads page. Just get the Queue package; no need to get the GATK package separately as GATK is bundled in with Queue.
- Building Queue from source
Briefly, here's what you need to know/do:
Queue is part of the GATK repository. Download the source from the public repository on Github. Run the following command:
git clone https://github.com/broadgsa/gatk.git
IMPORTANT NOTE: These instructions refer to the MIT-licensed version of the GATK+Queue source code. With that version, you will be able to build Queue itself, as well as the public portion of the GATK (the core framework), but that will not include the GATK analysis tools. If you want to use Queue to pipeline the GATK analysis tools, you need to clone the 'protected' repository. Please note however that part of the source code in that repository (the 'protected' module) is under a different license which excludes for-profit use, modification and redistribution.
Move to the git root directory and use maven to build the source.
mvn clean verify
All dependencies will be managed by Maven as needed.
See this article on how to test your installation of Queue.
3. Running Queue
See this article on running Queue for the first time for full details.
Queue arguments can be listed by running with --help
java -jar dist/Queue.jar --help
To list the arguments required by a QScript, add the script with -S
and run with --help
.
java -jar dist/Queue.jar -S script.scala --help
Note that by default queue runs in a "dry" mode, as explained in the link above. After verifying the generated commands execute the pipeline by adding -run
.
See QFunction and Command Line Options for more info on adjusting Queue options.
4. QScripts
General Information
Queue pipelines are written as Scala 2.8 files with a bit of syntactic sugar, called QScripts.
Every QScript includes the following steps:
New instances of CommandLineFunctions are created
Input and output arguments are specified on each function
- The function is added with
add()
to Queue for dispatch and monitoring
The basic command-line to run the Queue pipelines on the command line is
java -jar Queue.jar -S <script>.scala
See the main article Queue QScripts for more info on QScripts.
Supported QScripts
Most QScripts are analysis pipelines that are custom-built for specific projects, and we currently do not offer any QScripts as supported analysis tools. However, we do provide some example scripts that you can use as basis to write your own QScripts (see below).
Example QScripts
The latest version of the example files are available in the Sting github repository under public/scala/qscript/examples
5. Visualization and Queue
QJobReport
Queue automatically generates GATKReport-formatted runtime information about executed jobs. See this presentation for a general introduction to QJobReport.
Note that Queue attempts to generate a standard visualization using an R script in the GATK public/R
repository. You must provide a path to this location if you want the script to run automatically. Additionally the script requires the gsalib
to be installed on the machine, which is typically done by providing its path in your .Rprofile
file:
bm8da-dbe ~/Desktop/broadLocal/GATK/unstable % cat ~/.Rprofile
.libPaths("/Users/depristo/Desktop/broadLocal/GATK/unstable/public/R/")
Note that gsalib is available from the CRAN repository so you can install it with the canonical R package install command.
Caveats
The system only provides information about commands that have just run. Resuming from a partially completed job will only show the information for the jobs that just ran, and not for any of the completed commands. This is due to a structural limitation in Queue, and will be fixed when the Queue infrastructure improves
This feature only works for command line and LSF execution models. SGE should be easy to add for a motivated individual but we cannot test this capabilities here at the Broad. Please send us a patch if you do extend Queue to support SGE.
DOT visualization of Pipelines
Queue emits a queue.dot
file to help visualize your commands. You can open this file in programs like DOT, OmniGraffle, etc to view your pipelines. By default the system will print out your LSF command lines, but this can be too much in a complex pipeline.
To clarify your pipeline, override the dotString()
function:
class CountCovariates(bamIn: File, recalDataIn: File, args: String = "") extends GatkFunction {
@Input(doc="foo") var bam = bamIn
@Input(doc="foo") var bamIndex = bai(bamIn)
@Output(doc="foo") var recalData = recalDataIn
memoryLimit = Some(4)
override def dotString = "CountCovariates: %s [args %s]".format(bamIn.getName, args)
def commandLine = gatkCommandLine("CountCovariates") + args + " -l INFO -D /humgen/gsa-hpprojects/GATK/data/dbsnp_129_hg18.rod -I %s --max_reads_at_locus 20000 -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate -recalFile %s".format(bam, recalData)
}
Here we only see CountCovariates my.bam [-OQ]
, for example, in the dot
file. The base quality score recalibration pipeline, as visualized by DOT, can be viewed here: