I am trying to run a https://github.com/broadinstitute/wdl/tree/develop/scripts/broad_pipelines style pipeline. The picard SortSam | picard SetNmMdAndUqTags fails because BWA aligned some part of the reads beyond the end of the chromosome, which seems normal behaviour for BWA MEM (does it?) I checked the md5sum of the reference files used by bwa are the same as on
gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38*
Did I make a mistake in the pipeline and missing something obvious?
I provided the output of the commands and an minimal sam file to recreate the problem.
non al:
$picard SortSam TMP_DIR=. INPUT=minimal.bam OUTPUT=/dev/stdout SORT_ORDER="coordinate" |$picard SetNmMdAndUqTags INPUT=/dev/stdin OUTPUT=NA12878_sorted.bam REFERENCE_SEQUENCE=/cvmfs/softdrive.nl/maartenk/project_mine/HG38/reference/Homo_sapiens_assembly38.fasta
[Mon Mar 13 12:02:58 CET 2017] picard.sam.SortSam INPUT=minimal.bam OUTPUT=/dev/stdout SORT_ORDER=coordinate TMP_DIR=[.] VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Mon Mar 13 12:02:58 CET 2017] picard.sam.SetNmMdAndUqTags INPUT=/dev/stdin OUTPUT=NA12878_sorted.bam REFERENCE_SEQUENCE=/cvmfs/softdrive.nl/maartenk/project_mine/HG38/reference/Homo_sapiens_assembly38.fasta IS_BISULFITE_SEQUENCE=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Mon Mar 13 12:02:58 CET 2017] Executing as maartenk@ui on Linux 2.6.32-642.6.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_102-b14; Picard version: 2.9.0-1-gf5b9f50-SNAPSHOT
[Mon Mar 13 12:02:58 CET 2017] Executing as maartenk@ui on Linux 2.6.32-642.6.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_102-b14; Picard version: 2.9.0-1-gf5b9f50-SNAPSHOT
[Mon Mar 13 12:02:58 CET 2017] picard.sam.SortSam done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=251658240
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: ERROR: Record 1, Read name C4681ANXX:1:1109:2167753:0, Mate Alignment start (59024838) must be <= reference sequence length (57227415) on reference chrY
at htsjdk.samtools.SAMUtils.processValidationErrors(SAMUtils.java:448)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:796)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.<init>(BAMFileReader.java:769)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.<init>(BAMFileReader.java:757)
at htsjdk.samtools.BAMFileReader.getIterator(BAMFileReader.java:465)
at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.iterator(SamReader.java:473)
at picard.sam.SortSam.doWork(SortSam.java:99)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:205)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:94)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:104)
[Mon Mar 13 12:02:58 CET 2017] picard.sam.SetNmMdAndUqTags done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=251658240
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.samtools.SAMException: Input must be coordinate-sorted for this program to run. Found: unsorted
at picard.sam.SetNmMdAndUqTags.doWork(SetNmMdAndUqTags.java:96)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:205)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:94)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:104)
Increasing the VALIDATION_STRINGENCY to LENIENT does not help since it will crash in SetNmMdAndUqTags
[Mon Mar 13 12:04:50 CET 2017] picard.sam.SetNmMdAndUqTags done. Elapsed time: 0.21 minutes.
Runtime.totalMemory()=440926208
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 59024837
at htsjdk.samtools.util.SequenceUtil.sumQualitiesOfMismatches(SequenceUtil.java:497)
at picard.sam.AbstractAlignmentMerger.fixNmMdAndUq(AbstractAlignmentMerger.java:563)
at picard.sam.SetNmMdAndUqTags.lambda$doWork$0(SetNmMdAndUqTags.java:106)
at java.util.stream.ReferencePipeline$11$1.accept(ReferencePipeline.java:372)
at java.util.Iterator.forEachRemaining(Iterator.java:116)
at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
at picard.sam.SetNmMdAndUqTags.doWork(SetNmMdAndUqTags.java:107)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:205)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:94)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:104)
a picard ValidateSamFile reports on the minimal exampl:
ERROR: Record 1, Read name C4681ANXX:1:1109:2167753:0, Mate Alignment start (59024838) must be <= reference sequence length (57227415) on reference chrY
ERROR: Record 1, Read name C4681ANXX:1:1109:2167753:0, Mate CIGAR M operator maps off end of reference
ERROR: Record 2, Read name C4681ANXX:1:1109:2167753:0, Alignment start (59024838) must be <= reference sequence length (57227415) on reference chrY
ERROR: Record 2, Read name C4681ANXX:1:1109:2167753:0, Read CIGAR M operator maps off end of reference
A minimal example file
@HD VN:1.5 GO:none SO:queryname
@SQ SN:chrY LN:57227415 M5:ce3e31103314a704255f3cd90369ecce UR:file:/scratch2/testingmicro_MergeBamAlignment_3/Homo_sapiens_assembly38.fasta
@SQ SN:chrY_KI270740v1_random LN:37240 M5:69e42252aead509bf56f1ea6fda91405 UR:file:/scratch2/testingmicro_MergeBamAlignment_3/Homo_sapiens_assembly38.fasta
@RG ID:0 PL:ILLUMINA SM:NA12878 PU:C4681ANXX:1:none
@RG ID:1 PL:ILLUMINA SM:NA12878 PU:C4681ANXX:2:none
@RG ID:2 PL:ILLUMINA SM:NA12878 PU:C468BANXX:1:none
@PG ID:bwamem VN:0.7.15-r1140 CL:bwa mem -K 100000000 -p -v 3 -t 8 /cvmfs/softdrive.nl/maartenk/project_mine/HG38/reference/Homo_sapiens_assembly38.fasta PN:bwamem
@PG ID:MarkDuplicates VN:2.9.0-1-gf5b9f50-SNAPSHOT CL:picard.sam.markduplicates.MarkDuplicates INPUT=[/data/scratch/10366498.batch.gina.sara.nl/NA12878_0.bam, /data/scratch/10366498.batch.gina.sara.nl/NA12878_1.bam, /data/scratch/10366498.batch.gina.sara.nl/NA12878_2.bam] OUTPUT=NA12878MarkDuplicates.bam METRICS_FILE=NA12878MarkDuplicates.metrics.txt ASSUME_SORT_ORDER=queryname OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 TMP_DIR=[.] VALIDATION_STRINGENCY=SILENT COMPRESSION_LEVEL=1 MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> VERBOSITY=INFO QUIET=false MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json PN:MarkDuplicates PP:bwamem
C4681ANXX:1:1109:2167753:0 99 chrY 56878349 0 94M31S = 59024838 2146524 CTCGATTTCATCCAAAAGTTTGGGCAGTGATCCCATCCACACTANNNNNNNNNNNNNNNNNNNNNAGTATAACACCAGGACATCGAGNNNATGGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF!!!!!!!!!!!!!!!!!!!!!<<<FFFFFFFFFFFFFFFFFFF!!!<<BF!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! BC:Z:none MC:Z:90S35M MD:Z:44A0A0A0A0A0G0G0A0A0G0A0G0A0T0C0A0T0A0C0A0G22A0T0C4 PG:Z:MarkDuplicates RG:Z:0 NM:i:24 SM:i:416 MQ:i:60 AS:i:46 ZX:i:1944 ZY:i:10212
C4681ANXX:1:1109:2167753:0 147 chrY 59024838 60 90S35M = 56878349 -2146524 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGNTTTGTATATTTTACCANNA !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!<!FFFFFFFFFFFFFF<<!!B BC:Z:none MC:Z:94M31S PG:Z:MarkDuplicates RG:Z:0 NM:i:17 SM:i:191 MQ:i:0 AS:i:705 ZX:i:1944 ZY:i:10212
Software used:
picard 2.9.0
jre1.8.0_102
BWA 0.7.15-r1140
CentOS 6.x