In a nutshell, Spark is a piece of software that GATK4 uses to do multithreading, which is a form of parallelization that allows a computer (or cluster of computers) to finish executing a task sooner. You can read more about multithreading and parallelism in GATK here. The Spark software library is open-source and maintained by the Apache Software Foundation. It is very widely used in the computing industry and is one of the most promising technologies for accelerating execution of analysis pipelines.
Spark in GATK
Here are the key things you need to know, whether you plan to use a Spark cluster or not:
- Not all GATK tools make use of Spark; tools that do have a note to that effect in their respective Tool Doc.
- Some GATK tools exist in distinct Spark-capable and non-Spark-capable versions; the "sparkified" versions have the suffix "Spark" at the end of their names. Many of these are still experimental; down the road we plan to consolidate them so that there will be only one version per tool.
- Some of the newer GATK tools (mainly CNV tools right now) only exist in a Spark-capable version; those don't have the "Spark" suffix.
- You don't need a Spark cluster to run Spark-capable tools! If you're working on a "normal" machine (even just a laptop) with multiple CPU cores, the GATK engine can still use Spark to create a virtual standalone cluster in place, and set it to take advantage of however many cores are available on the machine -- or however many you choose to allocate. See the local-Spark tutorial for more information on how to control this. And if your machine only has a single core, these tools can always be run in single-core mode -- it'll just take longer for them to finish.
- If you do have access to a Spark cluster, the Spark-capable tools are going to be extra happy but you may need to provide some additional parameters to use them effectively. See the cluster-Spark tutorial and the sparkified tools' respective Tool Docs for detailed recommendations.