Quantcast
Channel: Recent Discussions — GATK-Forum
Viewing all articles
Browse latest Browse all 12345

What are the smallest units I can break whole human genomes into, for scatter-gather?

$
0
0

Hi, and thank you so much for the wonderful tools and support! :smile:

For our current project, we'd like to run 2000+ whole genomes from FASTQ to VCF using GATK best practices.

I'd like to optimize the runtime, in particular of GenotypeGVCFs.

Previously, we have used -nt with GenotypeGVCFs for parallelism.
With GATK 3.7, using threads (-nt) with GenotypeGVCFs has always crashed due to what appears to be lack of thread safety.
From what I've read on this forum, this is a known issue, and users are urged to use a scatter-gather approach.

If I understand correctly, scatter-gather for GenotypeGVCFs would entail splitting the combined, multi-sample, Whole-Genome gVCFs into, say, combined, multi-sample, Per-Chromosome gVCFs. And then executing GenotypeGVCFs on each multi-sample, chromosomal gVCF, on a cluster, in single-threaded mode?

Please correct me if this understanding is not accurate. I have read the Parallelism and Scatter-Gather pages on the forums.

If my understanding of scatter-gather is accurate, then it seems that to get the best performance when scaling out, you would want to subset the multi-sample, whole-genome gVCFs into as-small-as-reasonably-possible gVCFs, so that you could run hundreds or thousands of them in parallel on the cluster.

E.g. Partition the multi-sample, whole-genome gVCFs into, say, 10kb regions over each chromosome, yielding ~300,000 multi-sample gVCFs. Then you submit those to your batch/queue system and run each as its own invocation of GenotypeGVCFs with -nt 1.

However, the recommendations on this forum, for whole-genome data, tend to be to split at the chromosome level.
This would limit your parallelism to 22 if your were running the human autosomes.
And if there are a large number of high-coverage samples, and you're forced into single thread mode, this will not be efficient.

So, what is the smallest unit one can break-up whole genome data for GenotypeGVCFs?


Viewing all articles
Browse latest Browse all 12345

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>