Errors in SAM/BAM files can be diagnosed with ValidateSamFile

The problem

You're trying to run a GATK or Picard tool that operates on a SAM or BAM file, and getting some cryptic error that doesn't clearly tell you what's wrong. Bits of the stack trace (the pile of lines in the output log that the program outputs when there is a problem) may contain the following: java.lang.String, Error Type Count, NullPointerException -- or maybe something else that doesn't mean anything to you.

Why this happens

The most frequent cause of these unexplained problems is not a bug in the program -- it's an invalid or malformed SAM/BAM file. This means that there is something wrong either with the content of the file (something important is missing) or with its format (something is written the wrong way). Invalid SAM/BAM files generally have one or more errors in the following sections: the header tags, the alignment fields, or the optional alignment tags. In addition, the SAM/BAM index file can be a source of errors as well.

The source of these errors is usually introduced by upstream processing tools, such as the genome mapper/aligner or any other data processing tools you may have applied before feeding the data to Picard or GATK.

The solution

To fix these problems, you first have to know what's wrong. Fortunately there's a handy Picard tool that can test for (almost) all possible SAM/BAM format errors, called ValidateSamFile.

We recommend the workflow included below for diagnosing problems with ValidateSamFile. This workflow will help you tackle the problem efficiently and set priorities for dealing with multiple errors (which often happens). We also outline typical solutions for common errors, but note that this is not meant to be an exhaustive list -- there are too many possible problems to tackle all of them in this document. To be clear, here we focus on diagnostics, not treatment.

In some cases, it may not be possible to fix some problems that are too severe, and you may need to redo the genome alignment/mapping from scratch! Consider running ValidateSamFile proactively at all key steps of your analysis pipeline to catch errors early!

Workflow for diagnosing SAM/BAM file errors with ValidateSamFile

Image may be NSFW.
Clik here to view.

1. Generate summary of errors

First, run ValidateSamFile in SUMMARY mode in order to get a summary of everything that is missing or improperly formatted in your input file. We set MODE=SUMMARY explicitly because by default the tool would just emit details about the 100 first problems it finds then quit. If you have some minor formatting issues that don't really matter but affect every read record, you won't get to see more important problems that occur later in the file.

$ java -jar picard.jar ValidateSamFile \
        I=input.bam \
        MODE=SUMMARY

If this outputs No errors found, then your SAM/BAM file is completely valid. If you were running this purely as a preventative measure, then you're good to go and proceed to the next step in your pipeline. If you were doing this to diagnose a problem, then you're back to square one -- but at least now you know it's not likely to be a SAM/BAM file format issue. One exception: some analysis tools require Read Group tags like SM that not required by the format specification itself, so the input files will pass validation but the analysis tools will still error out. If that happens to you, check whether your files have SM tags in the @RG lines in their BAM header. That is the most common culprit.

However, if the command above outputs one or more of the 8 possible WARNING or 48 possible ERROR messages (see tables at the end of this document), you must proceed to the next step in the diagnostic workflow.

When run in SUMMARY mode, ValidateSamFile outputs a table that differentiates between two levels of error: ERROR proper and WARNING, based on the severity of problems that they would cause in downstream analysis. All problems that fall in the ERROR category must be addressed to in order to proceed with other Picard or GATK tools, while those that fall in the WARNING category may often be ignored for some, if not all subsequent analyses.

Example of error summary

ValidateSamFile (SUMMARY)	Count
ERROR:MISSING_READ_GROUP	1
ERROR:MISMATCH_MATE_ALIGNMENT_START	4
ERROR:MATES_ARE_SAME_END	894289
ERROR:CIGAR_MAPS_OFF_REFERENCE	354
ERROR:MATE_NOT_FOUND	1
ERROR:MISMATCH_FLAG_MATE_UNMAPPED	46672
ERROR:MISMATCH_READ_LENGTH_AND_E2_LENGTH	1
WARNING:RECORD_MISSING_READ_GROUP	54
WARNING:MISSING_TAG_NM	33

This table, generated by ValidateSamFile from a real BAM file, indicates that this file has a total of 1 MISSING_READ_GROUP error, 4 MISMATCH_MATE_ALIGNMENT_START errors, 894,289 MATES_ARE_SAME_END errors, and so on. Moreover, this output also indicates that there are 54 RECORD_MISSING_READ_GROUP warnings and 33 MISSING_TAG_NM warnings.

2. Generate detailed list of ERROR records

Since ERRORs are more severe than WARNINGs, we focus on diagnosing and fixing them first. From the first step we only had a summary of errors, so now we generate a more detailed report with this command:

$ java -jar picard.jar ValidateSamFile \
        I=input.bam \
        IGNORE_WARNINGS=true \
        MODE=VERBOSE

Note that we invoked the MODE=VERBOSE and the IGNORE_WARNINGS=true arguments.

The former is technically not necessary as VERBOSE is the tool's default mode, but we specify it here to make it clear that that's the behavior we want. This produces a complete list of every problematic record, as well as a more descriptive explanation for each type of ERROR than is given in the SUMMARY output.

The IGNORE_WARNINGS option enables us to specifically examine only the records with ERRORs. When working with large files, this feature can be quite helpful, because there may be many records with WARNINGs that are not immediately important, and we don't want them flooding the log output.

Example of VERBOSE report for ERRORs only

ValidateSamFile (VERBOSE)	Error Description
ERROR: Read groups is empty	Empty read group field for multiple records
ERROR: Record 1, Read name 20FUKAAXX100202:6:27:4968:125377	Mate alignment does not match alignment start of mate
ERROR: Record 3, Read name 20FUKAAXX100202:6:27:4986:125375	Both mates are marked as second of pair
ERROR: Record 6, Read name 20GAVAAXX100126:4:47:18102:194445	Read CIGAR M operator maps off end of reference
ERROR: Read name 30PPJAAXX090125:1:60:1109:517#0	Mate not found for paired read
ERROR: Record 402, Read name 20GAVAAXX100126:3:44:17022:23968	Mate unmapped flag does not match read unmapped flag of mate
ERROR: Record 12, Read name HWI-ST1041:151:C7BJEACXX:1:1101:1128:82805	Read length does not match quals length

These ERRORs are all problems that we must address before using this BAM file as input for further analysis. Most ERRORs can typically be fixed using Picard tools to either correct the formatting or fill in missing information, although sometimes you may want to simply filter out malformed reads using Samtools.

For example, MISSING_READ_GROUP errors can be solved by adding the read group information to your data using the AddOrReplaceReadGroups tool. Most mate pair information errors can be fixed with FixMateInformation.

Once you have attempted to fix the errors in your file, you should put your new SAM/BAM file through the first validation step in the workflow, running ValidateSamFile in SUMMARY mode again. We do this to evaluate whether our attempted fix has solved the original ERRORs, and/or any of the original WARNINGs, and/or introduced any new ERRORs or WARNINGs (sadly, this does happen).

If you still have ERRORs, you'll have to loop through this part of the workflow until no more ERRORs are detected.

If you have no more ERRORs, congratulations! It's time to look at the WARNINGs (assuming there are still some -- if not, you're off to the races).

3. Generate detailed list of WARNING records

To obtain more detailed information about the warnings, we invoke the following command:

$ java -jar picard.jar ValidateSamFile \
        I=input.bam \
        IGNORE=type \
        MODE=VERBOSE

At this time we often use the IGNORE option to tell the program to ignore a specific type of WARNING that we consider less important, in order to focus on the rest. In some cases we may even decide to not try to address some WARNINGs at all because we know they are harmless (for example, MATE_NOT_FOUND warnings are expected when working with a small snippet of data). But in general we do strongly recommend that you address all of them to avoid any downstream complications, unless you're sure you know what you're doing.

Example of VERBOSE report for WARNINGs only

ValidateSamFile (VERBOSE)	Warning Description
WARNING: Read name H0164ALXX140820:2:1204:13829:66057	A record is missing a read group
WARNING: Record 1, Read name HARMONIA-H16:1253:0:7:1208:15900:108776	NM tag (nucleotide differences) is missing

Here we see a read group-related WARNING which would probably be fixed when we fix the MISSING_READ_GROUP error we encountered earlier, hence the prioritization strategy of tackling ERRORs first and WARNINGs second.

We also see a WARNING about missing NM tags. This is an alignment tag that is added by some but not all genome aligners, and is not used by the downstream tools that we care about, so you may decide to ignore this warning by adding IGNORE=MISSING_TAG_NM from now on when you run ValidateSamFile on this file.

Once you have attempted to fix all the WARNINGs that you care about in your file, you put your new SAM/BAM file through the first validation step in the workflow again, running ValidateSamFile in SUMMARY mode. Again, we check that no new ERRORs have been introduced and that the only WARNINGs that remain are the ones we feel comfortable ignoring. If that's not the case we run through the workflow again. If it's all good, we can proceed with our analysis.

Appendix: List of all WARNINGs and ERRORs emitted by ValidateSamFile

The following two tables describe WARNING (Table I) and ERROR (Table II) cases, respectively.

Table I
WARNING	Description
Header Issues
INVALID_DATE_STRING	Date string is not ISO-8601
INVALID_QUALITY_FORMAT	Quality encodings out of range; appear to be Solexa or Illumina when Phred expected. Avoid exception being thrown as a result of no qualities being read.
General Alignment Record Issues
ADJACENT_INDEL_IN_CIGAR	CIGAR string contains an insertion (I) followed by deletion (D), or vice versa
RECORD_MISSING_READ_GROUP	A SAMRecord is found with no read group id
Mate Pair Issues
PAIRED_READ_NOT_MARKED_AS_FIRST_OR_SECOND	Pair flag set but not marked as first or second of pair
Optional Alignment Tag Issues
MISSING_TAG_NM	The NM tag (nucleotide differences) is missing
E2_BASE_EQUALS_PRIMARY_BASE	Secondary base calls should not be the same as primary, unless one or the other is N
General File, Index or Sequence Dictionary Issues
BAM_FILE_MISSING_TERMINATOR_BLOCK	BAM appears to be healthy, but is an older file so doesn't have terminator block

Table II
ERROR	Description
Header Issues
DUPLICATE_PROGRAM_GROUP_ID	Same program group id appears more than once
DUPLICATE_READ_GROUP_ID	Same read group id appears more than once
HEADER_RECORD_MISSING_REQUIRED_TAG	Header tag missing in header line
HEADER_TAG_MULTIPLY_DEFINED	Header tag appears more than once in header line with different value
INVALID_PLATFORM_VALUE	The read group has an invalid value set for its PL field
INVALID_VERSION_NUMBER	Does not match any of the acceptable versions
MISSING_HEADER	The SAM/BAM file is missing the header
MISSING_PLATFORM_VALUE	The read group is missing its PL (platform unit) field
MISSING_READ_GROUP	The header is missing read group information
MISSING_SEQUENCE_DICTIONARY	There is no sequence dictionary in the header
MISSING_VERSION_NUMBER	Header has no version number
POORLY_FORMATTED_HEADER_TAG	Header tag does not have colon
READ_GROUP_NOT_FOUND	A read group ID on a SAMRecord is not found in the header
UNRECOGNIZED_HEADER_TYPE	Header record is not one of the standard types
General Alignment Record Issues
CIGAR_MAPS_OFF_REFERENCE	Bases corresponding to M operator in CIGAR extend beyond reference
INVALID_ALIGNMENT_START	Alignment start position is incorrect
INVALID_CIGAR	CIGAR string error for either read or mate
INVALID_FLAG_FIRST_OF_PAIR	First of pair flag set for unpaired read
INVALID_FLAG_SECOND_OF_PAIR	Second of pair flag set for unpaired read
INVALID_FLAG_PROPER_PAIR	Proper pair flag set for unpaired read
INVALID_FLAG_MATE_NEG_STRAND	Mate negative strand flag set for unpaired read
INVALID_FLAG_NOT_PRIM_ALIGNMENT	Not primary alignment flag set for unmapped read
INVALID_FLAG_SUPPLEMENTARY_ALIGNMENT	Supplementary alignment flag set for unmapped read
INVALID_FLAG_READ_UNMAPPED	Mapped read flat not set for mapped read
INVALID_INSERT_SIZE	Inferred insert size is out of range
INVALID_MAPPING_QUALITY	Mapping quality set for unmapped read or is >= 256
INVALID_PREDICTED_MEDIAN_INSERT_SIZE	PI tag value is not numeric
MISMATCH_READ_LENGTH_AND_QUALS_LENGTH	Length of sequence string and length of base quality string do not match
TAG_VALUE_TOO_LARGE	Unsigned integer tag value is deprecated in BAM. Template length
Mate Pair Issues
INVALID_FLAG_MATE_UNMAPPED	Mate unmapped flag is incorrectly set
MATE_NOT_FOUND	Read is marked as paired, but its pair was not found
MATE_CIGAR_STRING_INVALID_PRESENCE	A cigar string for a read whose mate is NOT mapped
MATE_FIELD_MISMATCH	Read alignment fields do not match its mate
MATES_ARE_SAME_END	Both mates of a pair are marked either as first or second mates
MISMATCH_FLAG_MATE_UNMAPPED	Mate unmapped flag does not match read unmapped flag of mate
MISMATCH_FLAG_MATE_NEG_STRAND	Mate negative strand flag does not match read strand flag
MISMATCH_MATE_ALIGNMENT_START	Mate alignment does not match alignment start of mate
MISMATCH_MATE_CIGAR_STRING	The mate cigar tag does not match its mate's cigar string
MISMATCH_MATE_REF_INDEX	Mate reference index (MRNM) does not match reference index of mate
Optional Alignment Tag Issues
INVALID_MATE_REF_INDEX	Mate reference index (MRNM) set for unpaired read
INVALID_TAG_NM	The NM tag (nucleotide differences) is incorrect
MISMATCH_READ_LENGTH_AND_E2_LENGTH	Lengths of secondary base calls tag values and read should match
MISMATCH_READ_LENGTH_AND_U2_LENGTH	Secondary base quals tag values should match read length
EMPTY_READ	Indicates that a read corresponding to the first strand has a length of zero and/or lacks flow signal intensities (FZ)
INVALID_INDEXING_BIN	Indexing bin set on SAMRecord does not agree with computed value
General File, Index or Sequence Dictionary Issues
INVALID_INDEX_FILE_POINTER	Invalid virtualFilePointer in index
INVALID_REFERENCE_INDEX	Reference index not found in sequence dictionary
RECORD_OUT_OF_ORDER	The record is out of order
TRUNCATED_FILE	BAM file does not have terminator block

Errors in SAM/BAM files can be diagnosed with ValidateSamFile

The problem

Why this happens

The solution

Workflow for diagnosing SAM/BAM file errors with ValidateSamFile

1. Generate summary of errors

Example of error summary

2. Generate detailed list of ERROR records

Example of VERBOSE report for ERRORs only

3. Generate detailed list of WARNING records

Example of VERBOSE report for WARNINGs only

Appendix: List of all WARNINGs and ERRORs emitted by ValidateSamFile

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112