I tried to ERR000589 process data with BwaSpark. The bam file size is 1.3G. The average time spent is about 25 min (5 nodes).
However it would only cost 5 min in processing same data if I tried to use original C bwa with 32 threads.
Base on this observation, I have several questions list as follow:
1. If there is anything wrong with my params?
2. For each Partition, is BwaSpark running in multi-thread mode?
3. How to control the number of the bwa threads inside BwaSpark?
P.S.
The running command is:
./gatk-launch BwaSpark -I hdfs:///user/XX/ERR000589/ERR000589.bam -O hdfs:///user/XX/ERR000589/ERR000589_bwa.bam -R hdfs:///user/xx/refs/ucsc.hg19.fasta --bwamemIndexImage ~/data/ref/ucsc.hg19.img -disableSequenceDictionaryValidation true -- --sparkRunner SPARK --sparkMaster --executor-cores 1 --total-executor-cores 16 --executor-memory 4G
I tried to further adjust the following parameters,
--executor-cores --total-executor-cores --executor-memory --driver-memory
but none of these took less time than 16 min
Besides, I alsow tried to run it in local mode, while it won't end successfully. It seems that CPU was in endless waiting. I guess it occupied so much memory that the swap space is in use? Pic 1 shows the memory consumed while running
This time, the command is:
./gatk-launch BwaSpark -I hdfs:///user/XX/ERR000589/ERR000589.bam -O hdfs:///user/xx/ERR000589/ERR000589.bwa.bam -R /software/home/xx/data/ref/ucsc.hg19.fasta \ --bwamemIndexImage ~/data/ref/ucsc.hg19.img -disableSequenceDictionaryValidation true -- --sparkRunner SPARK --sparkMaster local[*] --total-executor-cores 8 --executor-memory 20G --driver-memory 30G
BTW, the testing environment is:
CPU 2 X 8 physical core
node: 5
network: GBE
memory: 64G