Hi,
I am attempting to run Gemini within a docker through WDL and Cromwell. I have installed gemini with no data as the data is too large to be put into a Docker (plus it's bad practice). So I need to download the data elsewhere, and link it to be available for the gemini binary to access. Locally on my own machine without WDL, I might run the following to get this to work:
docker run -rm -v /path/to/local/gemini/data:/path/to/container/gemini/data -i gemini load -t VEP -v my.vcf my.db
At the bottom is have outlined my submission script with google genomics pipelines run and the yaml configuration for background. However, the crux of my problem is that I am unsure with the Broad docker image for wdl_runner what the mount procedure for the docker is.
In the WDL documentation, for local backends, the docker by default does the following:
docker run --rm -v <cwd>:<docker_cwd> -i <docker_image> /bin/bash < <script>
Now supposing I have my data in a google bucket at gs://my_bucket/data_for_gemini. How would I define in WDL the appropriate code to mount that google bucket directory so gemini inside the docker can access it?
Example WDL:
task Gemini {
File my_vcf
# how to pass an entire google bucket directory as a target site?
command {
# define mounts in here somehow?
gemini load -t VEP -v ${my_vcf} out.db
}
runtime {
# define mounts in here?
docker: "gcr.io/my_containers/gemini"
memory: "4 GB"
cpu: "1"
}
output {
File gemini_db = "out.db"
}
}
I have thought one inelegant solution would be to run a docker in a docker and mount via that way. But I wanted to know if there would be a better and more elegant way.
-- Derrick DeConti
My submission script is:
gcloud alpha genomics pipelines run \
--pipeline-file wdl_pipeline.yaml \
--zones us-east1-b \
--logging gs://dfci-cccb-pipeline-testing/logging \
--inputs-from-file WDL=VariantCalling.cloud.wdl \
--inputs-from-file WORKFLOW_INPUTS=VariantCalling.cloud.inputs.json \
--inputs-from-file WORKFLOW_OPTIONS=VariantCalling.cloud.options.json \
--inputs WORKSPACE=gs://dfci-cccb-pipeline-testing/workspace \
--inputs OUTPUTS=gs://dfci-cccb-pipeline-testing/outputs
The resultant yaml is as follows:
name: WDL Runner
description: Run a workflow defined by a WDL file
inputParameters:
- name: WDL
description: Workflow definition
- name: WORKFLOW_INPUTS
description: Workflow inputs
- name: WORKFLOW_OPTIONS
description: Workflow options
- name: WORKSPACE
description: Cloud Storage path for intermediate files
- name: OUTPUTS
description: Cloud Storage path for output files
docker:
imageName: gcr.io/broad-dsde-outreach/wdl_runner
cmd: >
/wdl_runner/wdl_runner.sh
resources:
minimumRamGb: 1