How long does it take, Part 2

A while back, I posted this article about work done by the Intel Bio Team to benchmark the speed and resource utilization of each step in the per-sample segment of the germline variation pipeline (from BWA to HaplotypeCaller; FASTQ to GVCF). They published their results as a white paper on the Intel Life Sciences website, which has a section dedicated to GATK (which makes us feel all warm and tingly).

Now the Intel team has published an updated version of the white paper here that extends the work, originally done on a WGS trio, to a cohort of 50 exomes and adds the joint analysis segment of the pipeline (GenotypeGVCFs to VQSR; GVCFs to filtered multisample VCF) for both datasets.

As previously, the paper does a great job of showing where are the performance bottlenecks and where you can get the biggest speed increases by parallelizing execution.

My commentary from the previous post still applies pretty much equally to this updated version, except now we have performance profiles for GenotypeGVCFs and the VQSR tools as well, which I'll comment on briefly (using WGS but the profiles are similar for exomes).

The biggest takeaway here is that the runtime of GenotypeGVCFs scales down almost linearly with how widely you parallelize it, which is obviously great news if you're in a rush and you have access to lots of machines. But pay attention here to the meaning of "thread count" in the context of the paper! As a reminder, most of the parallelization it presents (including at the GenotypeGVCFs step) is achieved through scatter-gather (parallelizing over predetermined genomic intervals), not by multithreading using -nt and/or nct. In our own production pipelines we don't use -nt/-nct multithreading at all, and in GATK4 we're abandoning them and replacing the functionality with Spark support wherever it makes sense. Why am I pointing this out here? Because we're finding that GenotypeGVCFs is especially difficult to parallelize through multithreading, due to the complexity of dealing with overlapping events across multiple samples (the occurrence of which increases with cohort size). In the recent GATK 3.7 release, we added some functionality to deal better with overlapping deletions -- and now we're getting reports that this breaks when multithreading is turned on (cue the poop emoji). The safest way to deal with this? Don't use multithreading with GenotypeGVCFs; use scatter-gather instead (ask me how in the comments).

Also, don't parallelize VQSR. Look at the graph; it's not worth it. VQSR needs to see all of the things most of the time.

Finally, I should add that having the exome numbers to compare to the WGS numbers is a big upgrade -- it really gives you sense of scale of the practical implications of choosing to work with one datatype versus the other. All other sciencey considerations being equal (which they're not, but let's pretend) the computational resource commitment is massively different. Which is hardly news to our Ops team that processed Daniel MacArthur's ludicrously large gnomAD dataset, let me tell you -- for reference, the final joint VCF on that was ~22TB for 20K genomes. That's a big part of why we run our whole genomes on the cloud. It's real Big Data, no hype needed.

How long does it take, Part 2

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Error 1920. Service VMware Blast (VMBlast) failed to start.

A/L Technology Stream – Subject combinations, Syllabuses and Teacher guides

Afzal Hai Kul Jahan Se Gharana Hussain Ka

Bureau of Internal Revenue: Regional Offices (Directory)

The 10 Tennessee Cities With The Largest Black Population For 2021

NCERT Solutions for Class 10th: Ch 5 Les médias French

Download: Bicko Bicko ft Rich Bizzy & Crew G- Wanfulanganya (Prod by: Bicko...

Nalgonda District Police Office Mobile Numbers List in Telangana State

99 God Status for Whatsapp, Facebook

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

Windows Update / Microsoft Update の接続先 URL について

IIS 観点でアンチウイルススキャン対象から除外したいフォルダ

Error 0x80070299 copying file to ReFS

मतलबी दुनिया स्टेटस – Matlabi Duniya Status in Hindi | Selfish Status

Black Angus Grilled Artichokes

Moondru Mudichu 27-12-2016 – Polimer tv Serial

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Ndola Headteacher video goes viral(Video)