GATK has always been kind of a beast to get started with -- command-line program, many different tools under the hood, complex algorithms, multi-step pipelines, scale of computational resources involved... Plenty of challenges to go around, especially if you don't have a lot of computational experience.
We want to make it easier for anyone to try out the GATK Best Practices without investing a whole lot of time and effort up front. To that end, we're now using a cloud-based platform called Terra to share the GATK Best Practices as fully-configured pipelines that work right out of the box on example data that we provide, complemented by Jupyter Notebooks that walk you through the logic, operation and results of each step. We've already been using this approach in our popular workshop series with encouraging results, and we're planning to convert all our tutorials to Jupyter Notebooks that can be run in Terra. We don't expect all of you to adopt Terra for your work, but this feels like the best way we can empower you to get started with GATK.
The Terra platform is developed by our colleagues in the Data Sciences Platform at the Broad; it's free to access and we have funding to give every new account $300 in credits to cover computing & storage costs (which are billed by Google Cloud), so anyone can go in and try the pipelines at no cost and minimal effort. If you previously heard of FireCloud, this is essentially the same platform, but with a redesigned interface to make it more user-friendly.
We've set up the Best Practices pipelines in fully-furnished workspaces so you can poke at them, see how they work and examine the results they produce on example data. Then --where I think it gets really exciting-- you can upload your own data to test how the pipelines perform on that. When a new version comes out, you can test it quickly and decide whether the new results make it worth upgrading or whether you can wait until the next version. (The GATK engine team is developing some additional infrastructure to publish systematic benchmarks for every release but that's still a few months down the road at least.) We're also working to provide utilities for doing common ancillary tasks like converting between formats; for example, if you received FASTQs from your sequence provider and you want to use our pre-processing workflow that takes in unmapped BAMs.
We've been using Terra in our most recent workshops, and we're really encouraged by the responses we’ve gotten so far as well as the educational opportunities it offers. The user-friendly access to cloud compute capabilities means participants can run full-scale pipelines without worrying about computational infrastructure. The support for Jupyter Notebooks makes it way easier to do interactive hands-on tutorials during workshops AND distribute the workshop materials for self-service learning for anyone who can't make it to a workshop.
There's a lot to unpack on this topic, so we're going to roll out a series of blog posts explaining what you can do with the GATK resources we publish in Terra, how to get started and where to go from there. Stay tuned and make sure to follow the blog or gatk_dev on Twitter.
↧
Getting started with GATK is easier on Terra
↧