Folks, it really makes my day when I get to announce some good news that has been cooking for a long time. So this is going to be a very happy Humpday indeed.
The good news (which I may have hinted at previously) is that we are making our production pipeline scripts public, starting with the one that implements our Best Practices for data pre-processing and initial variant calling (aka GVCF generation) in whole genomes. Not only that, all Grch38/Hg38 resource files needed to run it, plus test data, are in a Google Cloud bucket. In time the bucket will replace our not-so-reliable FTP server as bundle sharing mechanism.
Details below the fold, in FAQ format (sort of).
TL;DR: Take this script and run it, for it is our WGS processing production workflow (uBAMs -> GVCF per-sample).
Wait, what? The GATK dev team is sharing a pipeline script?
Yup. I know some old-timers out there will be shaking their head in disbelief. In the past we were very reluctant to share scripts because our internal scripts were very infrastructure-specific and difficult to provide support for (because, well, Scala). But now that we have a beautifully simple workflow language, WDL, that can run pretty much anywhere (cough cluster support coming soon cough) we're a lot more comfortable sharing our scripts.
What are you hoping to achieve with this?
The scientific equivalent of peace, love and understanding: reproducibility, economy of effort and, uh, understanding. That last one applies equally, it turns out.
No but really, what's the hidden agenda?
Frankly? From our support team's point of view, it's all about reducing support burden. There's a big gap between our Best Practices recommendations, which were always meant to be a generic, platonically ideal representation of The Scientifically Correct Way to do variant calling with GATK, and an actual pipeline implementation that represents A Technically Valid Way to run the Best Practices in practice. That gap causes a lot of head-scratching, and so much energy being spent on reimplementing the same wheel over and over again across the globe. And in the process, so many forum questions.
So we hope that sharing this script (and others to come) will help fill the gap by providing researchers with either a fully-baked solution (if their use case is the same as ours) or at least a solid blueprint that they can tweak without too much difficulty. In theory that should lead to fewer questions about how to run Best Practices, and more time for everyone to do more interesting things, like cure cancer and/or watch wacky cat videos on YouTube.
That sounds reasonable. Now get to the point --the script?
Alright, alright, I'm getting to the interesting stuff.
What we're sharing today is the workflow that we use in production to process the Broad's whole genomes, from unaligned BAMs (uBAMs) all the way to HaplotypeCaller GVCFs. Its official name is PublicPairedSingleSampleWf because it's designed to run per sample on paired-end reads; though we may sometimes refer to it as just "the single-sample pipeline". Don't look at me, the engineers named it. I just added "Public" at the front.
Note that it's really only half of the pipeline; the second half, which runs the GVCFs through joint discovery and filtering, is done by a second script. We're comfortable sharing the first script because it's mature enough that we don't anticipate it changing much. In contrast, the second one is still being actively worked on (as we port our local pipelines to the cloud) and we can't commit to releasing and supporting it quite yet. But we will do so as soon as we can.
SHOW ME THE SCRIPT!
That's not a question, but okay. The PublicPairedSingleSampleWf script is written in WDL and lives here, in a directory of the WDL repository dedicated to hosting all the pipeline scripts that we will make public. Crucially, the script comes with all the things that you need to run it:
The header specifying scope of application, expectations and input requirements;
The DockerHub identifier of the docker image containing all the software that is used in the pipeline (but note that the usual GATK licensing rules apply);
- An example JSON file that specifies inputs, including resources and test data, which are all available in a Google Cloud bucket;
- An example JSON file that specifies cloud-based runtime options;
- A document that explains everything that happens in the pipeline (what tools are run, in what order and with what parameters) with particular focus on how the implementation relates to the theoretical Best Practices.
Why no direct links to the various files?
I'm not linking to the files directly because they are date-stamped (pending a more formal versioning system TBD by the engineering team) and the current versions will eventually be supplanted by newer ones -- but past versions remain accessible in the archive.
In the near future, and as we add more scripts, we will add versioned links on the corresponding Best Practices pages to make it super easy to find these reference implementation scripts and accompanying resources.
So does it actually work out of the box? How do I run it?
Well, yes, with caveats.
It works out of the box, no questions asked, if you have access to a service on Google Cloud that runs the Cromwell execution engine. Like we do. (What do you mean, "that's not very helpful"?)
If you don't (hello, majority of the world), there is a light at the end of the tunnel, in two forms:
It will run on the FireCloud platform (which, although presented largely as a cancer genome analysis platform, is not technically restricted to cancer work) pending a few tweaks to the FireCloud backend. Not sure what is the ETA but it's on the engineering group's roadmap. We'll announce widely when this is ready.
Our friendly collaborators at Google are putting the final touches to a Cromwell execution service that will run this WDL on the Google Genomics platform. We'll announce that when it's ready as well.
The cloud-free alternative for now is to run it on your local machine (cluster support coming soon). You'll need to download all the files from the cloud bucket or substitute your own. The Getting Started section of the WDL documentation includes installation and execution instructions. The Cromwell engine will ignore the cloud-specific settings that are included in the WDL when you run locally, so the script should run seamlessly on the local backend -- but if you run into any trouble at the Crom/WDL level, you're welcome to ask for help in the WDL forum. Conveniently, it is a sister forum to the GATK one, and does not require a separate registration. This is also the case of the FireCloud forum, by the way.
Can anyone modify and redistribute the script?
Sure, feel free! But if you do so, please also modify the filenames and header of the script accordingly, to avoid any misunderstandings about the identity and origin of the script you redistribute. Specifically, we'd like to avoid situations where someone asks us for help troubleshooting a script that they think is ours but that turns out to have been modified by someone else. We'll still try to help, but will know to look for differences that could explain any unexpected behavior. Time saved dealing with that sort of problem translates to more/better documentation and support of other things.
How can I get notified when scripts are added or updated?
Subscribe to forum notifications, to the GATK blog RSS feed, or follow @GATK_dev and/or @WDL_dev on Twitter.