Quantcast
Channel: Recent Discussions — GATK-Forum
Viewing all articles
Browse latest Browse all 12345

How to run the entire pipeline (using even Spark tools) from Java?

$
0
0

I am trying to write a Java pipeline which follows the GATK Best Practices, in particular, using more than one input sample.
As first steps, I am trying to use FastqToSam (even if not mandatory for the Best Practices, but required in case of using fastq samples), BwaAndMarkDuplicatesPipelineSpark and BQSRPipelineSpark.

For example with FastqToSam I am using this simple approach, in which I manage to "sparkify" the command with more samples and obtaining even some speedup:

JavaRDD<String> rdd_fastq_r1_r2 = sc.parallelize(fastq_r1_r2);

createBashScript(gatkCommand);

JavaRDD<String> bashExec = rdd_fastq_r1_r2.pipe("/path/script.sh");

where fastq_r1_r2 is a list of String representing the paths of samples to use.
In few words, I execute a bash command for each couple of Paired End Reads file (in particular the bash command as explained here) inside the pipe method provided by Spark

java -Xmx8G -jar picard.jar FastqToSam [...]

But this approach would not work with Spark GATK tools, like BwaAndMarkDuplicatesPipelineSpark and BQSRPipelineSpark.

So, is there any other way to execute these Spark tools in Java code? For example 4.5 years ago in this post they suggested to use org.broadinstitute.sting.gatk.CommandLineGATK, but now this class is not available anymore.
And moreover, is available any kind of Java API (and in case any tutorial), in order to use your methods (I could say in a similar way of Spark API) without using bash commands?

Thanks for your time and I hope to be clear in explaining my questions,
Nicholas


Viewing all articles
Browse latest Browse all 12345

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>