I tried to process data with BQSRPipelineSpark( the latest released gatk4 beta version),while it didn't work out unless the data size is small .To illustrate it, we conduct experiments in data extracted from ERR000589. It would work when there is only 10,000 sam record (the sam file is 2M), while it failed if the sam record is more than 10,000 (data size is 20M). I'd tried to adjust the memory of each excutor by increasing it from 30G to 60G, but the prgram still won't work.
I uses dbsnp_138.hg19.vcf, and the data size is 10G.
reference is ucsc.hg19.2bit, data size is 0.8G .
it was running on spark2.0, and there were 4 worker in total. Each node had 16 physical cores and 64G data memory.
Below is my command.
./gatk-launch BQSRPipelineSpark -I hdfs:///user/liucheng/ERR000589.bwa.mark.bam -O hdfs:///user/liucheng/ERR000589.bwa.mark.bqsr.bam -R hdfs:///user/liucheng/refs/ucsc.hg19.fasta --knownSites hdfs:///user/liucheng/dbsnp/dbsnp_138.hg19.vcf -joinStrategy SHUFFLE -- --sparkRunner SPARK --sparkMaster spark://cu11:7077 --total-executor-cores 48 --executor-cores 6 --executor-memory 25G --driver-memory 30G
The log is attached as follow:
[July 19, 2017 2:45:18 PM CST] org.broadinstitute.hellbender.tools.spark.pipelines.BQSRPipelineSpark done. Elapsed time: 3.57 minutes.
Runtime.totalMemory()=1559232512
org.apache.spark.SparkException: Job aborted due to stage failure: Task 25 in stage 5.0 failed 4 times, most recent failure: Lost task 25.3 in stage 5.0 (TID 418, 192.168.0.10, executor 1): htsjdk.samtools.SAMException: Unable to load chr13(72100194, 72110026) from /user/liucheng/refs/ucsc.hg19.fasta
at htsjdk.samtools.reference.IndexedFastaSequenceFile.getSubsequenceAt(IndexedFastaSequenceFile.java:247)
at org.broadinstitute.hellbender.engine.datasources.ReferenceHadoopSource.getReferenceBases(ReferenceHadoopSource.java:33)
at org.broadinstitute.hellbender.engine.datasources.ReferenceMultiSource.getReferenceBases(ReferenceMultiSource.java:99)
at org.broadinstitute.hellbender.engine.spark.ShuffleJoinReadsWithRefBases.lambda$addBases$cff38836$1(ShuffleJoinReadsWithRefBases.java:123)
at org.broadinstitute.hellbender.engine.spark.ShuffleJoinReadsWithRefBases$$Lambda$78/1542874140.call(Unknown Source)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$3$1.apply(JavaRDDLike.scala:143)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$3$1.apply(JavaRDDLike.scala:143)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30)
at org.broadinstitute.hellbender.tools.spark.transforms.BaseRecalibratorSparkFn.lambda$apply$26a6df3e$1(BaseRecalibratorSparkFn.java:27)
at org.broadinstitute.hellbender.tools.spark.transforms.BaseRecalibratorSparkFn$$Lambda$89/403397455.call(Unknown Source)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$4$1.apply(JavaRDDLike.scala:153)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$4$1.apply(JavaRDDLike.scala:153)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:796)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:796)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-796786996-222.201.145.253-1457530889871:blk_1074761762_1022478 file=/user/liucheng/refs/ucsc.hg19.fasta
at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:930)
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:609)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:841)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:889)
at java.io.DataInputStream.read(DataInputStream.java:149)
at java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:385)
at hdfs.jsr203.HadoopFileSystem$3.read(HadoopFileSystem.java:478)
at htsjdk.samtools.reference.IndexedFastaSequenceFile.readFromPosition(IndexedFastaSequenceFile.java:292)
at htsjdk.samtools.reference.IndexedFastaSequenceFile.getSubsequenceAt(IndexedFastaSequenceFile.java:244)
... 32 more
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1981)
at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1025)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.reduce(RDD.scala:1007)
at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1127)
at org.apache.spark.api.java.JavaRDDLike$class.treeAggregate(JavaRDDLike.scala:439)
at org.apache.spark.api.java.AbstractJavaRDDLike.treeAggregate(JavaRDDLike.scala:45)
at org.broadinstitute.hellbender.tools.spark.transforms.BaseRecalibratorSparkFn.apply(BaseRecalibratorSparkFn.java:39)
at org.broadinstitute.hellbender.tools.spark.pipelines.BQSRPipelineSpark.runTool(BQSRPipelineSpark.java:110)
at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:353)
at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:38)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:115)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:170)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:189)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:131)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:152)
at org.broadinstitute.hellbender.Main.main(Main.java:230)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: htsjdk.samtools.SAMException: Unable to load chr13(72100194, 72110026) from /user/liucheng/refs/ucsc.hg19.fasta
at htsjdk.samtools.reference.IndexedFastaSequenceFile.getSubsequenceAt(IndexedFastaSequenceFile.java:247)
at org.broadinstitute.hellbender.engine.datasources.ReferenceHadoopSource.getReferenceBases(ReferenceHadoopSource.java:33)
at org.broadinstitute.hellbender.engine.datasources.ReferenceMultiSource.getReferenceBases(ReferenceMultiSource.java:99)
at org.broadinstitute.hellbender.engine.spark.ShuffleJoinReadsWithRefBases.lambda$addBases$cff38836$1(ShuffleJoinReadsWithRefBases.java:123)
at org.broadinstitute.hellbender.engine.spark.ShuffleJoinReadsWithRefBases$$Lambda$78/1542874140.call(Unknown Source)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$3$1.apply(JavaRDDLike.scala:143)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$3$1.apply(JavaRDDLike.scala:143)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30)
at org.broadinstitute.hellbender.tools.spark.transforms.BaseRecalibratorSparkFn.lambda$apply$26a6df3e$1(BaseRecalibratorSparkFn.java:27)
at org.broadinstitute.hellbender.tools.spark.transforms.BaseRecalibratorSparkFn$$Lambda$89/403397455.call(Unknown Source)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$4$1.apply(JavaRDDLike.scala:153)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$4$1.apply(JavaRDDLike.scala:153)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:796)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:796)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-796786996-222.201.145.253-1457530889871:blk_1074761762_1022478 file=/user/liucheng/refs/ucsc.hg19.fasta
at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:930)
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:609)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:841)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:889)
at java.io.DataInputStream.read(DataInputStream.java:149)
at java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:385)
at hdfs.jsr203.HadoopFileSystem$3.read(HadoopFileSystem.java:478)
at htsjdk.samtools.reference.IndexedFastaSequenceFile.readFromPosition(IndexedFastaSequenceFile.java:292)
at htsjdk.samtools.reference.IndexedFastaSequenceFile.getSubsequenceAt(IndexedFastaSequenceFile.java:244