In GATK4, the GenotypeGVCFs
tool can only take a single input i.e., 1) a single single-sample GVCF 2) a single multi-sample GVCF created by CombineGVCFs or 3) a GenomicsDB workspace created by GenomicsDBImport. If you have GVCFs from multiple samples (which is usually the case) you will need to combine them before feeding them to GenotypeGVCFs
. The input samples must possess genotype likelihoods containing the allele produced by HaplotypeCaller with -ERC GVCF
or -ERC BP_RESOLUTION
.
Although there are several tools in the GATK and Picard toolkits that provide some type of VCF merging functionality, for this use case ONLY two of them can do the GVCF consolidation step correctly: GenomicsDBImport
and CombineGVCFs
.
GenomicsDBImport
is the preferred tool (see detailed instructions below); CombineGVCFs
is provided only as a backup solution for people who cannot use GenomicsDBImport
. We know CombineGVCFs
is quite inefficient and typically requires a lot of memory, so we encourage you to try GenomicsDBImport
first and only fall back on CombineGVCFs
if you experience issues that we are unable to help you solve (ask us for help in the forum!).
UsingGenomicsDBImport
in practice
The GenomicsDBImport
tool takes in one or more single-sample GVCFs and imports data over at least one genomics interval (this feature is available in v4.0.6.0 and later and stable in v4.0.8.0 and later), and outputs a directory containing a GenomicsDB datastore with combined multi-sample data. GenotypeGVCFs
can then read from the created GenomicsDB directly and output the final multi-sample VCF.
So if you have a trio of GVCFs your GenomicsDBImport
command would look like this, assuming you're running per chromosome (here we're showing the tool running on chromosome 20 and chromosome 21):
gatk GenomicsDBImport \
-V data/gvcfs/mother.g.vcf \
-V data/gvcfs/father.g.vcf \
-V data/gvcfs/son.g.vcf \
--genomicsdb-workspace-path my_database \
--intervals chr20,chr21
That generates a directory called my_database
containing the combined GVCF data for chromosome 20 and 21. (The contents of the directory are not really human-readable; see “extracting GVCF data from a GenomicsDB” to evaluate the combined, pre-genotyped data. Also note that the log will contain a series of messages like Buffer resized from 178298bytes to 262033
-- this is expected.) For larger cohort sizes, we recommend specifying a batch size of 50 for improved memory usage. A sample map file can also be specified when enumerating the GVCFs individually as above becomes arduous.
Then you run joint genotyping; note the gendb://
prefix to the database input directory path. Note that this step requires a reference, even though the import can be run without one.
gatk GenotypeGVCFs \
-R data/ref/ref.fasta \
-V gendb://my_database \
-newQual \
-O test_output.vcf
And that's all there is to it.
Important limitations and Common “Gotchas”:
You can't add data to an existing database; you have to keep the original GVCFs around and reimport them all together when you get new samples. For very large numbers of samples, there are some batching options.
At least one interval must be provided when using GenomicsDBImport
.
Input GVCFs cannot contain multiple entries for a single genomic position
GenomicsDBImport
cannot accept multiple GVCFs for the same sample, so if for example you generated separate GVCFs per chromosome for each sample, you'll need to either concatenate the chromosome GVCFs to produce a single GVCF per sample (using GatherVcfs) or scatter the following steps by chromosome as well.
The annotation counts specified in the header MUST BE VALID! If not, you may see an error like A fatal error has been detected by the Java Runtime Environment [...] SIGSEGV
with mention of a core dump (which may or may not be output depending on your system configuration.) You can check you annotation headers with vcf-validator from VCFtools [https://github.com/vcftools/vcftools]
GenomicsDB will not overwrite an existing workspace. To rerun an import, you will have to manually delete the workspace before running the command again.
If you’re working on a POSIX filesystem (e.g. Lustre, NFS, xfs, ext4 etc), you must set the environment variable TILEDB_DISABLE_FILE_LOCKING=1 before running any GenomicsDB tool. If you don’t, you will likely see an error like Could not open array genomicsdb_array at workspace:[...]
HaplotypeCaller output containing MNPs cannot be merged with CombineGVCFs or GenotypeGVCFs. For phasing nearby variants in multi-sample callsets, MNPs can be inferred from the phase set (PS
) tag in the FORMAT field.
There are a few other, rare bugs we’re in the process of working out. If you run into problems, you can check the open github issues [https://github.com/broadinstitute/gatk/issues?utf8=✓&q=is:issue+is:open+genomicsdb] to see if a fix is in in progress.
If you can't use GenomicsDBImport
for whatever reason, fall back to CombineGVCFs
instead. It is slower but will allow you to combine GVCFs the old-fashioned way.
Addendum: extracting GVCF data from the GenomicsDB
If you want to generate a flat multisample GVCF file from the GenomicsDB you created, you can do so with SelectVariants
as follows:
gatk SelectVariants \
-R data/ref/ref.fasta \
-V gendb://my_database \
-O combined.g.vcf
Bells and Whistles
GenomicsDB now supports allele-specific annotations [ https://software.broadinstitute.org/gatk/documentation/article?id=9622 ], which have become standard in our Broad exome production pipeline.
GenomicsDB can now import directly from a Google cloud path (i.e. gs://
) using NIO.