Picard ValidateSamFile is failing with INVALID_TAG_NM on hg38 HLA contigs when MODE=VERBOSE. The first 100 HLA reads in my BAM file failed. I assume all would fail as there were a number of different contigs among the first 100. When I validated the same BAM file with IGNORE=INVALID_TAG_NM, it passes.
Oddly, when MODE=SUMMARY, I got 'No errors found'.
I am running Picard 2.9.2 and using the GATk bundle Homo_sapiens_assembly38.fasta* as the reference. The BAM file was produced by Novoalign and processed by SortSam and SetNmMdAndUqTags. I also tried the deprecated SetNmAndUqTags. Moreover, I manually checked a few of the failing reads in the original BAM file produced by Novoalign and the records, including the NM tags, were the same as after running SetNmMdAndUqTags. I looked at the subset of the failed reads where NM was 0 and compared them directly to the sequence in the fasta file; all were a perfect match at the expected position.
*This shouldn't matter, but I replaced the tab character with two spaces to separate the contig name field in the '>' lines of of the HLA records of Homo_sapiens_assembly38.fasta. All the other records have two spaces, and the tab character was causing problems for Novoalign (to be fixed in the next release).
Picard ValidateSamFile command/stderr:
Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/run/media/yoursham/MY_6Tb_1/germline/cromwell-executions/PairedEndSingleSampleWorkflow/d626fa7a-c423-4f0c-98b2-aa0274b658e3/call-ValidateReadGroupSamFile/shard-0/execution/tmp.Et4z6V
[Tue Jun 06 00:40:48 UTC 2017] picard.sam.ValidateSamFile INPUT=/run/media/yoursham/MY_6Tb_1/germline/cromwell-executions/PairedEndSingleSampleWorkflow/d626fa7a-c423-4f0c-98b2-aa0274b658e3/call-ValidateReadGroupSamFile/shard-0/inputs/run/media/yoursham/MY_6Tb_1/germline/cromwell-executions/PairedEndSingleSampleWorkflow/d626fa7a-c423-4f0c-98b2-aa0274b658e3/call-SortAndFixReadGroupBam/shard-0/execution/NIST7035_TAAGGCGA_L001.aligned.sorted.bam OUTPUT=NIST7035_TAAGGCGA_L001.validation_report MODE=VERBOSE IGNORE=[] MAX_OUTPUT=1000000000 IS_BISULFITE_SEQUENCED=false REFERENCE_SEQUENCE=/mnt/hdd/germline/resources/gatk_bundle/Homo_sapiens_assembly38.fasta IGNORE_WARNINGS=false VALIDATE_INDEX=true INDEX_VALIDATION_STRINGENCY=EXHAUSTIVE MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Tue Jun 06 00:40:48 UTC 2017] Executing as yoursham@yoursham-linux on Linux 3.10.0-514.16.1.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_131-b12; Picard version: 2.9.2-SNAPSHOT
INFO 2017-06-06 00:41:45 SamFileValidator Validated Read 10,000,000 records. Elapsed time: 00:00:56s. Time for last 10,000,000: 55s. Last read position: chr5:75,080,505
INFO 2017-06-06 00:42:37 SamFileValidator Validated Read 20,000,000 records. Elapsed time: 00:01:48s. Time for last 10,000,000: 52s. Last read position: chr11:65,355,941
INFO 2017-06-06 00:43:31 SamFileValidator Validated Read 30,000,000 records. Elapsed time: 00:02:41s. Time for last 10,000,000: 53s. Last read position: chr19:1,918,167
INFO 2017-06-06 00:44:24 SamFileValidator Validated Read 40,000,000 records. Elapsed time: 00:03:34s. Time for last 10,000,000: 52s. Last read position: */*
[Tue Jun 06 00:45:14 UTC 2017] picard.sam.ValidateSamFile done. Elapsed time: 4.42 minutes.
Runtime.totalMemory()=1348468736
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Typical errors:
ERROR: Record 38230964, Read name HWI-D00119:50:H7AP8ADXX:1:2203:14796:98239, NM tag (nucleotide differences) in file [0] does not match reality [75]
ERROR: Record 38230965, Read name HWI-D00119:50:H7AP8ADXX:1:1209:5003:56756, NM tag (nucleotide differences) in file [1] does not match reality [76]
ERROR: Record 38230966, Read name HWI-D00119:50:H7AP8ADXX:1:1108:3556:87908, NM tag (nucleotide differences) in file [1] does not match reality [72]
ERROR: Record 38230967, Read name HWI-D00119:50:H7AP8ADXX:1:1208:5673:42488, NM tag (nucleotide differences) in file [0] does not match reality [72]
ERROR: Record 38230968, Read name HWI-D00119:50:H7AP8ADXX:1:1211:16359:18440, NM tag (nucleotide differences) in file [0] does not match reality [72]
BAM records for the above errors:
HWI-D00119:50:H7AP8ADXX:1:2203:14796:98239 99 HLA-A*11:50Q 1091 30 101M = 1111 121 CGCCTACGACGGCAAGGATTACATCGCCCTGAACGAGGACCTGCGCTCCTGGACCGCGGCGGACATGGCAGCTCAGATCACCAAGCGCAAGTGGGAGGCGG @?>?=>?@>@@@@==@@>>>>@>>@@@@@>@??@@>@@>@@>@@@@>@@>@@>@@@@@?@@@=@>=AA@>@@>@>@>>?>@?>>?@@@>>@>@@@>@?>?? ZA:f:30 LB:Z:NIST7035_Nextera-Rapid-Capture-Exome-and-Expanded-Exome MD:Z:101 PG:Z:novoalign RG:Z:H7AP8ADXX_TAAGGCGA_1_NA12878 AM:i:2 NM:i:0 SM:i:2 PQ:i:1 UQ:i:0 AS:i:0 PU:Z:H7AP8ADXX_TAAGGCGA_1_NA12878
HWI-D00119:50:H7AP8ADXX:1:1209:5003:56756 163 HLA-A*11:50Q 1092 30 101M = 1286 295 GCCTACGACGGCAAGGATTACATCGCCCTGAACGAGGACCTGCGCTCCTGGACCGCGGCGGACATGGCAGCTCAGATCACCAAACGCAAGTGGGAGGCGGC ?@@>=@@8@@@?==?@<==<@9<@5@@@>@>=@@9??>@@?@@@@>@>=@@<9?@<?><??<?<>>:>.??==:>>>@=9>=>.>>?==?=@??9>>?9=@ ZA:f:30 LB:Z:NIST7035_Nextera-Rapid-Capture-Exome-and-Expanded-Exome MD:Z:83G17 PG:Z:novoalign RG:Z:H7AP8ADXX_TAAGGCGA_1_NA12878 AM:i:2 NM:i:1 SM:i:2 PQ:i:26 UQ:i:13 AS:i:18 PU:Z:H7AP8ADXX_TAAGGCGA_1_NA12878
HWI-D00119:50:H7AP8ADXX:1:1108:3556:87908 163 HLA-A*11:50Q 1101 30 101M = 1162 162 GGCAAGGATTACATCGCCCTGAACGAGGACCTGCGCTCCTGGACCGCGGCGGACATGGCAGCCCAGATCACCAAGCGCAAGTGGGAGGCGGCCCGTCGGGC ?>@;<@@>===?==@@@@@=@=>@?=@@>@@>@>@@=@@=@A=@@@@@?@@@=9==@>@=@?+<=@<:?=@?/2<>9:/;;;@@@>=:?@@@==>;>?@@> ZA:f:27 LB:Z:NIST7035_Nextera-Rapid-Capture-Exome-and-Expanded-Exome MD:Z:62T38 PG:Z:novoalign RG:Z:H7AP8ADXX_TAAGGCGA_1_NA12878 AM:i:1 NM:i:1 SM:i:3 PQ:i:13 UQ:i:10 AS:i:13 PU:Z:H7AP8ADXX_TAAGGCGA_1_NA12878
HWI-D00119:50:H7AP8ADXX:1:1208:5673:42488 163 HLA-A*11:50Q 1111 30 101M = 1143 133 ACATCGCCCTGAACGAGGACCTGCGCTCCTGGACCGCGGCGGACATGGCAGCTCAGATCACCAAGCGCAAGTGGGAGGCGGCCCGTCGGGCGGAGCAGCGG ;@;>@@?@@=@==@@>@@>@@>@@@@>@@?@@?@@@@@@@@@>@>>@@@>@@=@>@>>@=@@>=@@A@=>?>?@@>@@@@@@@@@>@?@@?@@>@?>@@=> ZA:f:27 LB:Z:NIST7035_Nextera-Rapid-Capture-Exome-and-Expanded-Exome MD:Z:101 PG:Z:novoalign RG:Z:H7AP8ADXX_TAAGGCGA_1_NA12878 AM:i:1 NM:i:0 SM:i:25 PQ:i:0 UQ:i:0 AS:i:0 PU:Z:H7AP8ADXX_TAAGGCGA_1_NA12878
HWI-D00119:50:H7AP8ADXX:1:1211:16359:18440 163 HLA-A*11:50Q 1111 30 101M = 1135 125 ACATCGCCCTGAACGAGGACCTGCGCTCCTGGACCGCGGCGGACATGGCAGCTCAGATCACCAAGCGCAAGTGGGAGGCGGCCCGTCGGGCGGAGCAGCGG ;@;>@@?@@=@==@@>@@>@@>@@@@>@@?@@?@@@@@@@@@>@>>??@>@@=@>@>>@>@@>>@?@@>>@>@@@>@@@@@@@@@>?>@@?>?=?@>@@@@ ZA:f:27 LB:Z:NIST7035_Nextera-Rapid-Capture-Exome-and-Expanded-Exome MD:Z:101 PG:Z:novoalign RG:Z:H7AP8ADXX_TAAGGCGA_1_NA12878 AM:i:1 NM:i:0 SM:i:25 PQ:i:1 UQ:i:0 AS:i:0 PU:Z:H7AP8ADXX_TAAGGCGA_1_NA12878