I'm having a strange issue running GATK4 tools on a dataproc cluster. I'm submitting from a Broad VM with an empty bash profile. As an example, here's what happens when I try to reproduce this tutorial. I'm running these commands from inside my GATK repo, which is the current master branch:
$ use .google-cloud-sdk-98.0.0
$ use Java-1.8
$ gsutil ls -lr gs://gatk-test-data/exome_bam/1000G_wex_hg38/HG00133.alt_bwamem_GRCh38DH.20150826.GBR.exome.bam
This shows me the correct file size.
Then, I spin up a dataproc cluster with v1.1 as instructed. I'm able to ssh into the master node and see it's running Java 1.8, as well. The problem occurs when I try to run any command via gatk-launch
:
$ ./gatk-launch FlagStatSpark \
-I gs://gatk-tutorials/how-to/6484_snippet.bam \
--disableReadFilter WellformedReadFilter \
-- --sparkRunner GCS --cluster cluster-8ed1
The output from this particular command shows the generated gcloud
command and the error I get:
Using GATK jar /xchip/scarter/dmccabe/software/gatk/build/libs/gatk-spark.jar
jar caching is disabled because GATK_GCS_STAGING is not set
please set GATK_GCS_STAGING to a bucket you have write access too in order to enable jar caching
add the following line to you .bashrc or equivalent startup script
export GATK_GCS_STAGING=gs://<my_bucket>/
Replacing spark-submit style args with dataproc style args
--cluster cluster-8ed1 -> --cluster cluster-8ed1 --properties spark.driver.userClassPathFirst=true,spark.io.compression.codec=lzf,spark.driver.maxResultSize=0,spark.executor.extraJavaOptions=-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=false -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Dsnappy.disable=true ,spark.driver.extraJavaOptions=-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=false -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Dsnappy.disable=true ,spark.kryoserializer.buffer.max=512m,spark.yarn.executor.memoryOverhead=600
Running:
gcloud dataproc jobs submit spark --cluster cluster-8ed1 --properties spark.driver.userClassPathFirst=true,spark.io.compression.codec=lzf,spark.driver.maxResultSize=0,spark.executor.extraJavaOptions=-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=false -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Dsnappy.disable=true ,spark.driver.extraJavaOptions=-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=false -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Dsnappy.disable=true ,spark.kryoserializer.buffer.max=512m,spark.yarn.executor.memoryOverhead=600 --jar /xchip/scarter/dmccabe/software/gatk/build/libs/gatk-spark.jar -- FlagStatSpark -I gs://gatk-tutorials/how-to/6484_snippet.bam --disableReadFilter WellformedReadFilter --sparkMaster yarn
Copying file:///xchip/scarter/dmccabe/software/gatk/build/libs/gatk-spark.jar [Content-Type=application/octet-stream]...
Uploading ...827d4-e9bd-470f-b4e6-0b95e5dd676f/gatk-spark.jar: 124.84 MiB/124.84 MiB
Job [7d532aeb-6a3b-4e2e-8b43-187374e33104] submitted.
Waiting for job output...
USAGE: <program name> [-h]
Available Programs:
--------------------------------------------------------------------------------------
<snip>
Exception in thread "main" org.broadinstitute.hellbender.exceptions.UserException: '--' is not a valid command.
This is the same error you'd get if you ran ./gatk-launch --
instead of an actual tool name. I get this error for any tool name and options I specify.
I can see that the command does get sent to the cluster with -- FlagStatSpark
being the first part of the argument:
Why is this happening? Is there something wrong with the GCS dotkit? Has something changed with the GATK?