GenomicsDB is a datastore format developed by our collaborators at Intel to store variant call data (where "datastore" = something that we mere mortals can think of as a database, even though IT professionals insist that it's a completely different thing). The long-term vision is that ultimately we will use this datastore format as an alternative to VCF files for storing and working with variant data. For now though, we are only actively using it as a GVCF consolidation tool in the germline joint-calling workflow.
Note that at the moment GenomicsDB only supports diploid data; our Intel collaborators are working on implementing support for non-diploid data, but in the meantime if you need to work with non-diploid data you'll need to use CombineGVCFs instead.
There are currently three supported operations you can do with a GenomicsDB datastore: create a new GenomicsDB datastore from one or more GVCFs, joint-call it, and extract sample data from it.
Contents
- Create a new GenomicsDB datastore from one or more GVCFs
- Joint-call samples in a GenomicsDB datastore
- Extract data from a GenomicsDB datastore
1. Create a new GenomicsDB datastore from one or more GVCFs
The goal of this operation is to consolidate a set of GVCFs into a single datastore that GenotypeGVCFs
can run on (because GenotypeGVCFs
can only take a single input). To do this via GenomicsDB, we use the GenomicsDBImport
tool. This tool takes in one or more single-sample GVCFs (multi-sample GVCFs are not supported); it imports data over a single interval, and outputs a directory containing a GenomicsDB datastore with combined multi-sample data. GenotypeGVCFs
can then read from the created GenomicsDB directly and output a VCF.
Here's what a typical command looks like:
gatk-launch GenomicsDBImport \
-V data/gvcfs/mother.g.vcf \
-V data/gvcfs/father.g.vcf \
-V data/gvcfs/son.g.vcf \
--genomicsDBWorkspace my_database \
--intervals 20
This command generates a directory called my_database
containing the combined GVCF data.
Note that the GVCFs can also be passed in as a list or map instead of being enumerated in the command. However the --intervals
argument value must be a single interval, not a list, because this functionality was designed from the start to be used from within a script that scatters execution over multiple intervals. We'd like to enable running on one more intervals in one go, but we might not get to that for awhile, so for now you need to run on each interval separately.
Note also that at the moment you can't add data to an existing database; you have to keep the original GVCFs around and reimport them all together when you get new samples. For very large numbers of samples, there are some batching options that help make this reasonably quick. Overall it's much more scalable than the old CombineGVCFs route anyway (sorry, non-diploids!).
2. Joint-call samples in a GenomicsDB datastore
Once you have a GenomicsDB datastore containing GVCF data from one or more sample, you can run GenotypeGVCFs on it to joint-call the samples it contains.
Here's an example command:
gatk-launch GenotypeGVCFs \
-R data/ref/ref.fasta \
-V gendb://my_database \
-G StandardAnnotation -newQual \
-O test_output.vcf
This will produce a multi-sample VCF with all the usual bells and whistles.
Note the gendb://
prefix to the database input directory path. That's the only difference compared to a regular GenotypeGVCFs command, but it's an important one -- if you forget the prefix you will get a big fat error.
3. Extract data from a GenomicsDB datastore
If you want to generate a flat multisample GVCF file from a GenomicsDB you created, you can do so with SelectVariants as follows:
gatk-launch SelectVariants \
-R data/ref/ref.fasta \
-V gendb://my_database \
-O combined.g.vcf
You can use any of the usual SelectVariants modifiers to extract e.g. only a subset of samples, a subset of genomic intervals, and so on. This can be useful for troubleshooting variant calls, when you feel the need to look at what the intermediate GVCF looked like, for example, since it's not possible to view the calls in the GenomicsDB itself in a human-readable way.