[work in progress--I am breaking this up into smaller chunks]
This document in part replaces the previous post (howto) Revert a BAM file to FastQ format that uses HTSlib commands. The workflow assumes familiarity with the concepts given in Collected FAQs about BAM files.
We outline steps to preprocess Illumina and similar tech DNA sequence reads for use in GATK's variant discovery workflow. This preprocessing workflow involves marking adapter sequences using MarkIlluminaAdapters so they contribute minimally to alignments, alignment using the BWA aligner's maximal exact match (MEM) algorithm, and preserving and adjusting read and read meta data using MergeBamAlignment for consistency and comparability of downstream results with analyses from the Broad Institute. With the exception of BWA, we use the most current versions of tools as of this writing. The workflow results in an aligned BAM file with appropriate meta information that is ready for processing with MarkDuplicates.
This workflow applies to three common types of sequence read files: (A) aligned BAMs that need realignment, (B) FASTQ format data and (C) raw sequencing data in BAM format. If you have raw data in BAM format (C), given appropriate read group fields, you can start with step 2. The other two formats require conversion to unmapped BAM (uBAM). We use Picard's RevertSam to convert an aligned BAM (A) or Picard's FastqToSam to convert a FASTQ (B) to the uBAM.
We address options relevant to process reads extracted from an interval as well as options to process large files, in our case a ~150G file called Solexa-272222
. The tutorial uses a smaller file of reads aligning to a genomic interval, called snippet
derived from Solexa-272222
, for faster processing. The example commands apply to the larger file. Some comments on the workflow:
The workflow reflects a lossless operating procedure that retains original FASTQ read information within the final BAM file such that data is amenable to reversion and analysis by different means. These practices make scaling up and longterm storage efficient, as one needs only store the final BAM file.
When transforming data files, we stick to using Picard tools over other tools to avoid subtle incompatibilities.
- Finally, when I call default options within a command, follow suit to ensure the same results.
The steps of the workflow are as follows.
- Generate an unmapped BAM (uBAM)
(A) Convert the FASTQ to uBAM and add read group information using FastqToSam
(B1) [Optional] Extract reads in a genomic interval from aligned BAM
(B2) Convert aligned BAM to uBAM and discard problematic records using RevertSam - Mark adapter sequences using MarkIlluminaAdapters
- Convert uBAM to FASTQ and assign adapter bases low qualities using SamToFastq
- Align reads and flag secondary hits using BWA MEM
- [Optional] Pipe steps 3 & 4 and collect alignment metrics
- [Optional] Sort, index and convert alignment to a BAM using SortSam and visualize on IGV
- Restore altered data and apply & adjust meta information using MergeBamAlignment
1. Generate an unmapped BAM (uBAM)
The goal is to produce an unmapped BAM file with appropriate read group (@RG) information that differentiates not only samples, but also factors that contribute to technical artifacts. To see the read group information for a BAM file, use the following command.
samtools view -H Solexa-272222.bam | grep '@RG'
This prints the lines starting with @RG within the header. Our tutorial file's single @RG line is shown below. The file has the read group fields required by this workflow as well as extra fields for record keeping. Two read group fields, ID
and PU
, appropriately differentiate flow cell lane, marked by .2
, a factor that contributes to batch effects.
@RG ID:H0164.2 PL:illumina PU:H0164ALXX140820.2 LB:Solexa-272222 PI:0 DT:2014-08-20T00:00:00-0400 SM:NA12878 CN:BI
GATK's variant discovery workflow requires
ID
,SM
andLB
fields and recommends thePL
field.Each
@RG
line has a uniqueID
that differentiates read groups. It is the lowest denominator that differentiates factors contributing to technical batch effects and is repeatedly indicated by theRG
tag for each read record. Thus, the length of this field contributes to file size.SM
indicates sample name and, within a collection of samples,LB
indicates if the same sample was sequenced in multiple lanes. See item 8 of Collected FAQs about BAM files for more detail.PU
is not required by any GATK tool. If present it is used by BQSR instead ofID
. It is required by Picard's AddOrReplaceReadGroups but not FastqToSam.
If your sample collection's BAM files lack required fields or do not differentiate pertinent factors within the fields, use Picard's AddOrReplaceReadGroups to add or appropriately rename the read group fields.
Here we illustrate how to derive both ID
and PU
fields from query names. We break down the common portion of two different read query names from the tutorial file.
H0164ALXX140820:2:1101:10003:23460
H0164ALXX140820:2:1101:15118:25288
#Breaking down the common portion of the query names:
H0164____________ # portion of @RG ID and PU fields indicating Illumina flow cell
_____ALXX140820__ # portion of @RG PU field indicating barcode or index in a multiplexed run
_______________:2 # portion of @RG ID and PU fields indicating flow cell lane
(A) Convert the FASTQ to uBAM and add read group information using FastqToSam
Picard's FastqToSam transforms a FASTQ file to unmapped BAM, requires two read group fields and makes optional specification of other read group fields. In the command below we note which fields are required for our workflow. All other read group fields are optional.
java -Xmx8G -jar /seq/software/picard/current/bin/picard.jar FastqToSam \
FASTQ=snippet_XT_interleaved.fq \ #our single tutorial file contains both reads in a pair
OUTPUT=snippet_FastqToSam_PU.bam \
READ_GROUP_NAME=H0164.2 \ # required; changed from default of A
SAMPLE_NAME=NA12878 \ # required
LIBRARY_NAME=Solexa-272222 \ # required
PLATFORM_UNIT=H0164ALXX140820.2 \
PLATFORM=illumina \ # recommended
SEQUENCING_CENTER=BI \
RUN_DATE=2014-08-20T00:00:00-0400
Some details on select parameters:
QUALITY_FORMAT
is detected automatically if unspecified.SORT_ORDER
by default is queryname.- Specify both
FASTQ
andFASTQ2
for paired reads in separate files. PLATFORM_UNIT
is often in run_barcode.lane format. Include if sample is multiplexed.RUN_DATE
is in Iso8601 date format.
(B1) [Optional] Extract reads in a genomic interval from aligned BAM
We want to test our reversion process on a subset of the tutorial file before committing to reverting the entire BAM. This process requires the reads in the BAM to be aligned to a reference genome and produces a BAM containing reads from a genomic interval.
java -Xmx8G -jar /path/GenomeAnalysisTK.jar \
-T PrintReads \
-R /path/human_g1k_v37_decoy.fasta \
-L 10:90000000-100000000 \ # this is the retained interval
-I Solexa-272222.bam -o snippet.bam # snippet.bam is newly created
This seems a good time to bring this up. In the command, the
-Xmx8G
Java option sets the maximum heap size, or memory usage to eight gigabytes. We want to both cap Java's use of memory so the system doesn't slow down as well as allow enough memory for the tool to run without causing an out of memory error. The-Xmx
settings we provide here is more than sufficient for most cases. For GATK, 4G is standard, while for Picard less is needed. Some tools, e.g. MarkDuplicates, may require more. I have heard up to16G specified and have also omitted this option for small files. To find a system's default maximum heap size, typejava -XX:+PrintFlagsFinal -version
, and look forMaxHeapSize
. Note that any setting beyond available memory spills to storage and slows a system down. If multithreading, increase memory proportionately to the number of threads. e.g. if 1G is the minimum required for one thread, then use 2G for two threads.This step is for our tutorial only. For applying interval lists, e.g. to whole exome data, see When should I use L to pass in a list of intervals.
(B2) Convert aligned BAM to uBAM and discard problematic records using RevertSam
We use Picard's RevertSam to remove alignment information. The resulting unmapped BAM (uBAM) has two uses in this workflow: (1) for processing through the MarkIlluminaAdapters branch of the workflow, and (2) for application of read group, read sequence and other read meta information to the aligned read file in the MergeBamAlignment branch of the workflow. The RevertSam parameters we specify remove information pertaining to previous alignments including program group records and standard alignment flags and tags that would otherwise transfer over in the MergeBamAlignment step. We remove nonstandard alignment tags with the ATTRIBUTE_TO_CLEAR
option. For example, we clear the XT
tag using this option so that it is free for use by MarkIlluminaAdapters. Our settings also reset flags to unmapped values, e.g. 77 and 141 for paired reads. Additionally, we invoke the SANITIZE
option to remove reads that cause problems for MarkIlluminaAdapters. Our tutorial's snippet
requires such filtering while Solexa-272222
does not.
For our particular file, we use the following parameters.
java -Xmx8G -jar /path/picard.jar RevertSam \
I=snippet.bam \
O=snippet_revert.bam \
SANITIZE=true \
MAX_DISCARD_FRACTION=0.005 \ # informational; does not affect processing
ATTRIBUTE_TO_CLEAR=XT \
ATTRIBUTE_TO_CLEAR=XN \
ATTRIBUTE_TO_CLEAR=AS \ #Picard release of 9/2015 clears AS by default
ATTRIBUTE_TO_CLEAR=OC \
ATTRIBUTE_TO_CLEAR=OP \
SORT_ORDER=queryname \ #default
RESTORE_ORIGINAL_QUALITIES=true \ #default
REMOVE_DUPLICATE_INFORMATION=true \ #default
REMOVE_ALIGNMENT_INFORMATION=true #default
To process large files, also designate a temporary directory.
TMP_DIR=/path/shlee # sets environmental variable for temporary directory
We change these settings for RevertSam:
SANITIZE
If the BAM file contains problematic reads, such as that might arise from taking a genomic interval of reads (Step 1), then RevertSam'sSANTITIZE
option removes them. Our workflow's downstream tools will have problems with paired reads with missing mates, duplicated records, and records with mismatches in length of bases and qualities.MAX_DISCARD_FRACTION
is set to a more strict threshold of 0.005 instead of the default 0.01. Whether or not this fraction is reached, the tool informs you of the number and fraction of reads it discards. This parameter asks the tool to additionally inform you of the discarded fraction via an exception as it finishes processing.Exception in thread "main" picard.PicardException: Discarded 0.947% which is above MAX_DISCARD_FRACTION of 0.500%
ATTRIBUTE_TO_CLEAR
is set to clear more than the default standard tags, which are NM, UQ, PG, MD, MQ, SA, MC, and AS tags. The AS tag is removed by default for Picard releases starting 9/2015. Remove all other tags, such as the XT tag needed by MarkIlluminaAdapters, by specifying each with theATTRIBUTE_TO_CLEAR
option. To list all tags within my BAM, I used the command below to get RG, OC, XN, OP, SA, MD, NM, PG, UQ, MC, MQ, AS, XT, and OQ tags. Those removed by default and byRESTORE_ORIGINAL_QUALITIES
are italicized. See your aligner's documentation and the Sequence Alignment/Map Format Specification for descriptions of tags.samtools view input.bam | cut -f 12- | tr '\t' '\n' | cut -d ':' -f 1 | awk '{ if(!x[$1]++) { print }}'
Some comments on options kept at default:
SORT_ORDER
=queryname
For paired read files, because each read in a pair has the same query name, sorting results in interleaved reads. This means that reads in a pair are listed consecutively within the same file. We make sure to alter the previous sort order. Coordinate sorted reads result in the aligner incorrectly estimating insert size from blocks of paired reads as they are not randomly distributed.RESTORE_ORIGINAL_QUALITIES
=true
Restoring original base qualities to the QUAL field requires OQ tags listing original qualities. The OQ tag uses the same encoding as the QUAL field, e.g. ASCII Phred-scaled base quality+33 for tutorial data. After restoring the QUAL field, RevertSam removes the tag.REMOVE_ALIGNMENT_INFORMATION
=true will remove program group records and alignment information. It also invokes the defaultATTRIBUTE_TO_CLEAR
parameter which removes standard alignment tags.
For snippet.bam, SANITIZE
removes 25,909 out of 2,735,539 (0.947%) reads, leaving us with 2,709,630 reads. The intact BAM retains all reads. The example shows a read pair before and after RevertSam.
#original BAM
H0164ALXX140820:2:1101:10003:23460 83 10 91515318 60 151M = 91515130 -339 CCCATCCCCTTCCCCTTCCCTTTCCCTTTCCCTTTTCTTTCCTCTTTTAAAGAGACAAGGTCTTGTTCTGTCACCCAGGCTGGAATGCAGTGGTGCAGTCATGGCTCACTGCCGCCTCAGACTTCAGGGCAAAAGCAATCTTTCCAGCTCA :<<=>@AAB@AA@AA>6@@A:>,*@A@<@??@8?9>@==8?:?@?;?:><??@>==9?>8>@:?>>=>;<==>>;>?=?>>=<==>>=>9<=>??>?>;8>?><?<=:>>>;4>=>7=6>=>>=><;=;>===?=>=>>?9>>>>??==== MC:Z:60M91S MD:Z:151 PG:Z:MarkDuplicates RG:Z:H0164.2 NM:i:0 MQ:i:0 OQ:Z:<FJFFJJJJFJJJJJF7JJJ<F--JJJFJJJJ<J<FJFF<JAJJJAJAJFFJJJFJAFJAJJAJJJJJFJJJJJFJJFJJJJFJFJJJJFFJJJJJJJFAJJJFJFJFJJJFFJJJ<J7JJJJFJ<AFAJJJJJFJJJJJAJFJJAFFFFA UQ:i:0 AS:i:151
H0164ALXX140820:2:1101:10003:23460 163 10 91515130 0 60M91S = 91515318 339 TCTTTCCTTCCTTCCTTCCTTGCTCCCTCCCTCCCTCCTTTCCTTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCTTCCCCTCTCCCACCCCTCTCTCCCCCCCTCCCACCC :0;.=;8?7==?794<<;:>769=,<;0:=<0=:9===/,:-==29>;,5,98=599;<=########################################################################################### SA:Z:2,33141573,-,37S69M45S,0,1; MC:Z:151M MD:Z:48T4T6 PG:Z:MarkDuplicates RG:Z:H0164.2 NM:i:2 MQ:i:60 OQ:Z:<-<-FA<F<FJF<A7AFAAJ<<AA-FF-AJF-FA<AFF--A-FA7AJA-7-A<F7<<AFF########################################################################################### UQ:i:49 AS:i:50
#after RevertSam (step 1.B2)
H0164ALXX140820:2:1101:10003:23460 77 * 0 0 * * 0 0 TGAGCTGGAAAGATTGCTTTTGCCCTGAAGTCTGAGGCGGCAGTGAGCCATGACTGCACCACTGCATTCCAGCCTGGGTGACAGAACAAGACCTTGTCTCTTTAAAAGAGGAAAGAAAAGGGAAAGGGAAAGGGAAGGGGAAGGGGATGGG AFFFFAJJFJAJJJJJFJJJJJAFA<JFJJJJ7J<JJJFFJJJFJFJFJJJAFJJJJJJJFFJJJJFJFJJJJFJJFJJJJJFJJJJJAJJAJFAJFJJJFFJAJAJJJAJ<FFJF<J<JJJJFJJJ--F<JJJ7FJJJJJFJJJJFFJF< RG:Z:H0164.2
H0164ALXX140820:2:1101:10003:23460 141 * 0 0 * * 0 0 TCTTTCCTTCCTTCCTTCCTTGCTCCCTCCCTCCCTCCTTTCCTTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCTTCCCCTCTCCCACCCCTCTCTCCCCCCCTCCCACCC <-<-FA<F<FJF<A7AFAAJ<<AA-FF-AJF-FA<AFF--A-FA7AJA-7-A<F7<<AFF########################################################################################### RG:Z:H0164.2
2. Mark adapter sequences using MarkIlluminaAdapters
Previously we cleared the XT tag from our BAM so Picard's MarkIlluminaAdapters can use it to mark adapter sequences. SamToFastq (step 4) will use these in turn to assign low base quality scores to the adapter bases, effectively removing their contribution to read alignment and alignment scoring metrics. For the tutorial data, adapter sequences have already been removed from the beginning of reads. We want to additionally effectively remove any adapter sequences at the ends of reads arising from read-through to adapters in read pairs with shorter inserts.
java -Xmx8G -jar /path/picard.jar MarkIlluminaAdapters \
I=snippet_revert.bam \
O=snippet_revertmark.bam \
M=snippet_revertmark.metrics.txt \ #naming required
TMP_DIR=/path/shlee # optional to process large files
By default, the tool uses Illumina adapter sequences. This is sufficient for our tutorial data. Specify other adapter sequences as outlined in the tool documentation.
Only reads with adapter sequence are marked with the tag in XT:i:[#] format, where # denotes the starting position of the adapter sequence.
The example shows a read pair marked with the XT tag by MarkIlluminaAdapters. This is a different pair than shown previously as H0164ALXX140820:2:1101:10003:23460
reads do not contain adapter sequence. The insert region sequences for the reads overlap by a length corresponding approximately to the XT tag value. The same read pair is shown after SamToFastq transformation, where adapter sequence base quality scores have been set to 2 (# symbol), and after MergeBamAlignment, which restores original base quality scores.
#after MarkIlluminaAdapters (step 2)
H0164ALXX140820:2:1101:15118:25288 77 * 0 0 * * 0 0
ACCTGCCTCAGCCTCCCAAAGTGCTGGGATTATAGGTATGTGTCACCACACCCAGCCAAGTATACTCACATTGTCGTGCAACCAAACTCCAGAACTTTTTCATCTTAAAGAATCAAGGTTTTTTATTGTTTACTTTATTACTTATTTATTT
AFFFFFJJFJFAAJJFFJJFJFJ<FJJJJJJF<JJJFFJJAF7JJJAAF7AJJFJFJFFJ--A-FAJA-F<J7A--AFJ7AJ7AJ-FJ7-JJJ-F-J---7J---7FF-JAJJ<A7JFAFAA7--FF----AF-7<JF<JFA-7<F-FF-J RG:Z:H0164.2 XT:i:63
H0164ALXX140820:2:1101:15118:25288 141 * 0 0 * * 0 0
GTCATGGCTGGACGCAGTGGCTCATACCTGTAATCCCAGCACTTTTGGAGGCTGAGGCAGGTAGATCGGAAGCGCCTCGTGTAGGGAGAGAGGGTTAACAAAAATGTAGATACCGGAGGTCGCCGTAAAATAAAAAAGTAGCAAGGAGTAG
AAFFFJJJJJAJJJJJFJJJJ<JFJJJJJJJJFJJJJFJ<FJJJJAJJJJJJJJFJJJ7JJ--JJJ<J<-FJ7F--<-J7--7AJJA-J------J7F<-77--F--FFJ---J-J-J--A-7<<----J-7-J-FJ--J--FA####### RG:Z:H0164.2 XT:i:63
#after SamToFastq (step 3)
@H0164ALXX140820:2:1101:15118:25288/1
ACCTGCCTCAGCCTCCCAAAGTGCTGGGATTATAGGTATGTGTCACCACACCCAGCCAAGTATACTCACATTGTCGTGCAACCAAACTCCAGAACTTTTTCATCTTAAAGAATCAAGGTTTTTTATTGTTTACTTTATTACTTATTTATTT
+
AFFFFFJJFJFAAJJFFJJFJFJ<FJJJJJJF<JJJFFJJAF7JJJAAF7AJJFJFJFFJ--#########################################################################################
@H0164ALXX140820:2:1101:15118:25288/2
GTCATGGCTGGACGCAGTGGCTCATACCTGTAATCCCAGCACTTTTGGAGGCTGAGGCAGGTAGATCGGAAGCGCCTCGTGTAGGGAGAGAGGGTTAACAAAAATGTAGATACCGGAGGTCGCCGTAAAATAAAAAAGTAGCAAGGAGTAG
+
AAFFFJJJJJAJJJJJFJJJJ<JFJJJJJJJJFJJJJFJ<FJJJJAJJJJJJJJFJJJ7JJ-#########################################################################################
#after MergeBamAlignment (step 7)
H0164ALXX140820:2:1101:15118:25288 99 10 99151971 60 151M = 99152350 440
ACCTGCCTCAGCCTCCCAAAGTGCTGGGATTATAGGTATGTGTCACCACACCCAGCCAAGTATACTCACATTGTCGTGCAACCAAACTCCAGAACTTTTTCATCTTAAAGAATCAAGGTTTTTTATTGTTTACTTTATTACTTATTTATTT
AFFFFFJJFJFAAJJFFJJFJFJ<FJJJJJJF<JJJFFJJAF7JJJAAF7AJJFJFJFFJ--A-FAJA-F<J7A--AFJ7AJ7AJ-FJ7-JJJ-F-J---7J---7FF-JAJJ<A7JFAFAA7--FF----AF-7<JF<JFA-7<F-FF-J MC:Z:90S61M MD:Z:74T10T3A37T23 PG:Z:bwamem RG:Z:H0164.2 NM:i:4 MQ:i:60 UQ:i:48 AS:i:131 XS:i:40
H0164ALXX140820:2:1101:15118:25288 147 10 99152350 60 90S61M = 99151971 -440
CTACTCCTTGCTACTTTTTTATTTTACGGCGACCTCCGGTATCTACATTTTTGTTAACCCTCTCTCCCTACACGAGGCGCTTCCGATCTACCTGCCTCAGCCTCCAAAAGTGCTGGGATTACAGGTATGAGCCACTGCGTCCAGCCATGAC
#######AF--J--JF-J-7-J----<<7-A--J-J-J---JFF--F--77-<F7J------J-AJJA7--7J-<--F7JF-<J<JJJ--JJ7JJJFJJJJJJJJAJJJJF<JFJJJJFJJJJJJJJFJ<JJJJFJJJJJAJJJJJFFFAA MC:Z:151M MD:Z:61 PG:Z:bwamem RG:Z:H0164.2 NM:i:0 MQ:i:60 UQ:i:0 AS:i:61 XS:i:50
Snippet_revertmark.bam marks 5,810 reads (0.21%) with XT, while Solexa-272222_revertmark.bam marks 3,236,552 reads (0.39%). We plot the metrics data using RStudio.
3. Convert BAM to FASTQ using SamToFastq
Picard's SamToFastq takes read identifiers, read sequences, and base quality scores to write a Sanger FASTQ format file. We use additional options to effectively remove adapter sequences previously marked with the XT tag. All extant meta data, i.e. alignment information, flags and tags, are purged in this transformation.
java -Xmx8G -jar /path/picard.jar SamToFastq \
I=snippet_revertmark.bam \
FASTQ=snippet_XT_interleaved.fq \
CLIPPING_ATTRIBUTE=XT \
CLIPPING_ACTION=2 \
INTERLEAVE=true \
NON_PF=true \
TMP_DIR=/path/shlee # optional to process large files
By specifying
CLIPPING_ATTRIBUTE
=XT andCLIPPING_ACTION
=2, SamToFastq changes the quality scores of bases marked by XT to two--a rather low score in the Phred scale. This effectively removes the adapter portion of sequences from contributing to read alignment and alignment scoring metrics. This reassignment is temporary as we will restore the original base quality scores after alignment in step 7.For our paired reads sample we set SamToFastq's
INTERLEAVE
to true. During the conversion to FASTQ format, the query name of the reads in a pair are marked with /1 or /2 and paired reads are retained in the same FASTQ file.BWA aligner accepts interleaved FASTQ files given the
-p
option. This command indicates that the i-th and the (i+1)-th reads constitute a read pair.We change the
NON_PF
, akaINCLUDE_NON_PF_READS
, option from default to true. SamToFastq will then retain reads marked by what some consider archaic 0x200 flag bit that denotes reads that do not pass quality controls. These reads are also known as failing platform or vendor quality checks. Our tutorial data do not contain such reads and we call out this option for illustration only.
[Optional] Compress the FASTQ using gzip
This step is optional. The step is irrelevant if you pipe steps 3 and 4, as we outline in step 5.
BWA handles both FASTQ and gzipped FASTQ files natively--that is, BWA works on both file types directly. Thus, this step is optional. Compress the FASTQ file using the UNIX gzip utility.
gzip snippet_XT_interleaved.fq #replaces the file with snippet_XT_interleaved.fq.gz
4. Align reads and flag secondary hits using BWA MEM
GATK's variant discovery workflow recommends Burrows-Wheeler Aligner's maximal exact matches (BWA MEM) algorithm (Li 2013 reference; Li 2014 benchmarks; homepage; manual). BWA MEM is suitable for aligning high-quality long reads ranging from 70 bp to 1 Mbp against a large reference genome such as the human genome.
We use BWA v 0.7.7.r441, the same aligner used by the Broad's Genomics Platform as of this writing (9/2015).
Alignment is a compute intensive process. For faster processing, use a reference genome with decoy sequences, also called a decoy genome. For example, the Broad's Genomics Platform uses an Hg19/GRCh37 reference sequence that includes Ebstein-Barr virus (EBV) sequence to soak up reads that fail to align to the human reference that the aligner would otherwise spend an inordinate amount of time trying to align as split reads. GATK's resource bundle provides a standard decoy genome from the 1000 Genomes Project.
- Aligning our
snippet
reads from a genomic interval against either a portion or the whole genome is not equivalent to aligning our entire file and taking a newslice
from the same genomic interval.
Index the reference genome file for BWA. Indexing is specific to algorithms. To index the human genome for BWA, we apply BWA's index
function on the reference genome file, e.g. human_g1k_v37_decoy.fasta
. This produces five index files with the extensions amb
, ann
, bwt
, pac
and sa
.
bwa index -a bwtsw human_g1k_v37_decoy.fasta
Align using BWA MEM. The tool automatically locates the index files within the same folder as the reference FASTA file. In the alignment command, >
denotes the aligned file.
The aligned file is in SAM format even if given a BAM extension and retains the sort order of the FASTQ file. Thus, our aligned tutorial file remains sorted by query name.
BWA automatically creates a program group record (@PG) in the header that gives the ID, group name, group version, and command line information.
/path/bwa mem -M -t 7 -p \
/path/Homo_sapiens_assembly19.fasta \ #reference genome
Solexa-272222_interleavedXT.fq > Solexa-272222_markXT_aln.sam
We invoke three options in the command.
-M
to flag shorter split hits as secondary.
This is optional for Picard compatibility. However, if we want MergeBamAlignment to reassign proper pair alignments, we need to mark secondary alignments.-p
to indicate the given file contains interleaved paired reads.-t
followed by a number for the additional number of processor threads to use concurrently. Check your server or system's total number of threads with the following command.getconf _NPROCESSORS_ONLN #thanks Kate
MarkDuplicates can directly process BWA's alignment, whether or not the alignment marks secondary hits. However, the point of this workflow is to take advantage of the features offered by MergeBamAlignment that allow for the scalable, lossless operating procedure practiced by Broad's Genomics Platform and to produce comparable metrics.
5. [Optional] Pipe steps 3 & 4 and collect alignment metrics
Piping processes saves time and space. Our tutorial's resulting SAM file is small enough to easily view, manipulate and store. For larger data, however, consider using Unix pipelines. Piping allows streaming data in the processor's input-output (I/O) device directly to the next process for efficient processing and storage. We recommend piping steps 3 and 4 so as to avoid rereading and storing the large intermediate FASTQ file.
You may additionally extend piping to include step 6's SortSam. Steps 3-4-6 are piped in the example command below to generate an aligned BAM file and index. [For the larger file, I couldn't pipe Step 7's MergeBamAlignment.]
#overview of command structure
[step 3's SamToFastq] | [step 4's bwa mem] | [step 6's SortSam]
#for our file
java -Xmx8G -jar /path/picard.jar SamToFastq I=snippet_revertmark.bam \
FASTQ=/dev/stdout \
CLIPPING_ATTRIBUTE=XT CLIPPING_ACTION=2 INTERLEAVE=true NON_PF=true \
TMP_DIR=/path/shlee | \
/path/bwa mem -M -t 7 -p /path/Homo_sapiens_assembly19.fasta \
/dev/stdin | \ #to stop piping here, add '> snippet_piped.sam'
java -Xmx8G -jar /path/picard.jar SortSam \
INPUT=/dev/stdin \
OUTPUT=snippet_piped.bam \
SORT_ORDER=coordinate CREATE_INDEX=true \
TMP_DIR=/path/shlee
Calculate alignment metrics using Picard tools. Picard offers a variety of metrics collecting tools, e.g. CollectAlignmentSummaryMetrics, CollectWgsMetrics and CollectInsertSizeMetrics. Some tools give more detailed metrics if given the reference sequence. See Picard for metrics definitions. Metrics calculations will differ if run on the BAM directly from alignment (BWA) versus on the merged BAM (MergeBamAlignment). See [link--get from G] for guidelines on when to run tools.
java -Xmx8G -jar /path/picard.jar CollectAlignmentSummaryMetrics \
R=/path/Homo_sapiens_assembly19.fasta \
INPUT=slice.bam \
OUTPUT=slice_bam_metrics.txt \
TMP_DIR=/path/shlee # optional to process large files
For example, percent chimeras is a calculated metric. Our tutorial alignment of the whole data set gives 0.019% (BWA) or 0.0034% (MergeBamAlignment) chimeric paired reads. The genomic interval defined in step 1 reports 0.0032% chimeric paired reads. In contrast, the aligned snippet
gives 0.0012% (BWA) or 0.00002% (MergeBamAlignment) chimeric paired reads. This illustrates in part the differences I alluded to at the beginning of step 4.
6. [Optional] Sort, index and convert alignment to a BAM using SortSam and visualize on IGV
Picard's SortSam sorts, indexes and converts between SAM and BAM formats. For file manipulations and to view aligned reads using the Integrative Genomics Viewer (IGV), the SAM or BAM file must be coordinate-sorted and indexed. Some Picard tools, such as MergeBamAlignment in step 7, by default coordinate sort and can use the standard CREATE_INDEX
option. If you didn't create an index in step 7, or want to convert to BAM and index the alignment file from step 4, then use Picard's SortSam. The index file will have an sai
or bai
extension depending on the specified format.
java -Xmx8G -jar /path/picard.jar SortSam \
INPUT=Solexa-272222_markXT_aln.sam \
OUTPUT=Solexa-272222_markXT_aln.bam \ #extension here specifies format conversion
SORT_ORDER=coordinate \
CREATE_INDEX=true \ # a standard option for Picard commands
TMP_DIR=/path/shlee # optional to process large files
View aligned reads using the Integrative Genomics Viewer (IGV). Of the multiple IGV versions, the Java Web Start jnlp
version allows the highest memory, as of this writing 10 GB for machines with 64-bit Java.
To run the
jnlp
version of IGV, you may need to adjust your system's Java Control Panel settings, e.g. enable Java content in the browser. Also, when first opening thejnlp
, overcome Mac OS X's gatekeeper function by right-clicking the savedjnlp
and selecting Open with Java Web Start.Load the appropriate reference genome. For our tutorial this is Human (b37).
- Go to View>Preferences and make sure the settings under the Alignments tab allows you to view reads of interest, e.g. duplicate reads. Default settings are tuned to genomic sequence libraries. Right-click on a track to access a menu of additional viewing modes. See Viewing Alignments in IGV documentation for details.
- Go to File>Load from and either load alignments from File or URL.
Here, IGV displays our example chimeric pair, H0164ALXX140820:2:1101:10003:23460
at its alignment loci. BWA's secondary alignment designation causes the mates on chromosome 10 to display as unpaired in IGV's paired view. MergeBamAlignment corrects for this when it switches the secondary alignment designation. Mates display as paired on chromosome 10.
Visualizing alignments in such a manner makes apparent certain convergent information. For example, we see that the chimeric region on chromosome 2 is a low complexity GC-rich region, apparent by the predominantly yellow coloring (representing guanine) of the reference region. We know there are many multimapping reads because reads with MAPQ score of zero are filled in white versus gray, and the region is down-sampled, as indicated by the underscoring in the log-scaled coverage chart. We can infer reads in this chromosome 2 region are poorly mapped based on the region's low complexity, depth of reads and prevalence of low MAPQ reads.
7. Restore altered data and apply & adjust meta information using MergeBamAlignment
Our alignment file lacks read group information and certain tags, such as the mate CIGAR (MC) tag. It has hard-clipped sequences and altered base qualities. The alignment also has some mapping artifacts we would like to correct for accounting congruency. Finally, the alignment records require coordinate sorting and indexing.
We use Picard's MergeBamAlignment to address all of these needs to produce a raw BAM file that is ready for GATK's variant discovery workflow. MergeBamAlignment takes metadata from a SAM or BAM file of unmapped reads (uBAM) and merges it with a SAM or BAM file containing alignment records for a subset of those reads. Metadata include read group information, read sequences, base quality scores and tags. The tool applies read group information from the uBAM and retains the program group information from the aligned file. In restoring original sequences, MergeBamAlignment adjusts CIGAR strings from hard-clipped to soft-clipped. The tool adjusts flag values, e.g. changes primary alignment designations according to a user-specified strategy, for desired congruency. Optional parameters allow introduction of additional metadata, e.g. user-specified program group information or nonstandard aligner-generated tags. If the alignment file is missing reads present in the unaligned file, these are retained as unaligned records. Finally, alignment records are coordinate sorted, meaning they are ordered by chromosomal mapping position.
To simply edit read group information, see Picard's AddOrReplaceReadGroups. To simply concatenate read records into one file, use Picard's GatherBamFiles. An advantage of using MergeBamAlignment over AddOrReplaceReadGroups is the ability to transfer mixed read groups to reads in a single file.
Consider what
PRIMARY_ALIGNMENT_STRATEGY
option best suits your samples. MergeBamAlignment applies this strategy to a read for which the aligner has provided more than one primary alignment, and for which one is designated primary by virtue of another record being marked secondary. MergeBamAlignment considers and switches only existing primary and secondary designations.- MergeBamAlignment retains secondary alignments with the
INCLUDE_SECONDARY_ALIGNMENTS
parameter. It may be that alignments marked as secondary are truer to biology or at least reveal useful insight.
A read with multiple alignment records may map to multiple loci or may be chimeric--that is, splits the alignment. It is possible for an aligner to produce multiple alignments as well as multiple primary alignments, e.g. in the case of a linear alignment set of split reads. When one alignment, or alignment set in the case of chimeric read records, is designated primary, others are designated either secondary or supplementary. Invoking the -M
option, we had BWA mark the record with the longest aligning section of split reads as primary and all other records as secondary. MergeBamAlignment further adjusts this secondary designation and other flags, e.g. read mapped in proper pair and mate unmapped flags, to fix mapping artifacts. We only note some changes made by MergeBamAlignment to our tutorial data and by no means comprehensively list its features.
java -Xmx16G -jar /path/picard.jar MergeBamAlignment \
R=/path/Homo_sapiens_assembly19.fasta \
UNMAPPED_BAM=Solexa-272222_revertclean.bam \
ALIGNED_BAM=Solexa-272222_markXT_aln.sam \
O=Solexa-272222_merge_IGV_raw.bam \ #output file name in SAM or BAM format
CREATE_INDEX=true \ #standard option for any Picard command
ADD_MATE_CIGAR=true \ #default; adds MC tag
CLIP_ADAPTERS=false \ #changed from default
CLIP_OVERLAPPING_READS=true \ #default; soft-clips ends so mates do not overlap
INCLUDE_SECONDARY_ALIGNMENTS=true \ #default
MAX_INSERTIONS_OR_DELETIONS=-1 \ #changed to allow any number of insertions or deletions
PRIMARY_ALIGNMENT_STRATEGY=MostDistant \ #changed from default BestMapq
ATTRIBUTES_TO_RETAIN=XS \ #specify multiple times to retain alignment tags starting with X, Y, or Z
TMP_DIR=/path/shlee #optional to process large files
You need not invoke PROGRAM
options as BWA's program group information is sufficient and transfer from the alignment during the merging. If, for whatever reason, you need to apply program group information by a different means, then use MergeBamAlignment to assign each of the following program group options. Example information is given.
PROGRAM_RECORD_ID=bwa \
PROGRAM_GROUP_NAME=bwamem \
PROGRAM_GROUP_VERSION=0.7.7-r441 \
PROGRAM_GROUP_COMMAND_LINE='/path/bwa mem -M -t 7 -p /path/Homo_sapiens_assembly19.fasta Solexa-272222_interleavedXT.fq > Solexa-272222_markXT_aln.sam' \
In the command, we change CLIP_ADAPTERS
, MAX_INSERTIONS_OR_DELETIONS
and PRIMARY_ALIGNMENT_STRATEGY
values from default, and invoke other optional parameters.
The path to the reference FASTA given by
R
should also contain the corresponding sequence dictionary with the same base name and extension.dict
. Create a sequence dictionary using Picard's CreateSequenceDictionary.CLIP_ADAPTERS
=false leaves reads unclipped.MAX_INSERTIONS_OR_DELETIONS
=-1 retains reads irregardless of the number of insertions and deletions. Default is 1.PRIMARY_ALIGNMENT_STRATEGY
=MostDistant marks primary alignments based on the alignment pair with the largest insert size. This strategy is based on the premise that of chimeric sections of a read aligning to consecutive regions, the alignment giving the largest insert size with the mate gives the most information.ATTRIBUTES_TO_RETAIN
is specified to carryover the XS tag from the alignment, which for BWA reports suboptimal alignment scores. The XS tag in not necessary for our workflow. We retain it to illustrate that the tool only carries over select alignment information unless specified otherwise. For our tutorial data, this is the only additional unaccounted tag from the alignment. [IDK if this tag is used downstream; need to confirm I can keep this.]- Because we have left the
ALIGNER_PROPER_PAIR_FLAGS
parameter at the default false value, MergeBamAlignment may reassign proper pair designations made by the aligner. - By default the merged file is coordinate sorted. We set
CREATE_INDEX
=true to additionally create thebai
index.
Original base quality score restoration is illustrated in Step 3. The following example shows a read pair for which MergeBamAlignment adjusts multiple other information fields. The query name is listed thrice because we have paired reads where one of the reads has two alignment loci, on chromosome 2 and on chromosome 10. The mate is mapped with high MAPQ to chromosome 10. The two loci align 69 and 60 nucleotide regions, respectively, and the aligned regions coincide by 15 bases. A good portion of the chromosome 2 aligned region has low base quality scores. The NM
tag indicates that the chromosome 2 alignment requires one change to match the reference, while the chromosome 10 read requires two changes and this is also reflected in the MD
tags that provide the mismatching positions. When tallying alignment scores, given by the AS
tag, aligners penalize mismatching positions, here apparently by five points per mismatch, e.g. 60 matches minus two mismatches multiplied by five gives an alignment score of 50. Both read records have values for the XS
(suboptimal alignment score) and SA
(chimeric alignment) tags that indicate a split read. Flag values, set by BWA, indicate the chromosome 2 record is primary and the chromosome 10 record is secondary.
#aligned reads from step 4
H0164ALXX140820:2:1101:10003:23460 177 2 33141435 0 37S69M45S 10 91515318 0
GGGTGGGAGGGGGGGAGAGAGGGGTGGGAGAGGGGAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAGGGGGGGGGGGGGGGGGAGGGGGGGGGGGGGGGGGAAGGAAAGGAGGGAGGGAGGGAGCAAGGAAGGAAGGAAGGAAAGA ###########################################################################################FFA<<7F<A-7-AJA7AF-A--FFA<AF-FJA-FF-AA<<JAAFA7A<FJF<F<AF-<-< NM:i:1 MD:Z:51G17 AS:i:64 XS:i:64 SA:Z:10,91515130,+,60M91S,0,2;
H0164ALXX140820:2:1101:10003:23460 417 10 91515130 0 60M91H = 91515318 339
TCTTTCCTTCCTTCCTTCCTTGCTCCCTCCCTCCCTCCTTTCCTTCCCCCCCCCCCCCCC <-<-FA<F<FJF<A7AFAAJ<<AA-FF-AJF-FA<AFF--A-FA7AJA-7-A<F7<<AFF NM:i:2 MD:Z:48T4T6 AS:i:50 XS:i:36 SA:Z:2,33141435,-,37S69M45S,0,1;
H0164ALXX140820:2:1101:10003:23460 113 10 91515318 60 151M 2 33141435 0
CCCATCCCCTTCCCCTTCCCTTTCCCTTTCCCTTTTCTTTCCTCTTTTAAAGAGACAAGGTCTTGTTCTGTCACCCAGGCTGGAATGCAGTGGTGCAGTCATGGCTCACTGCCGCCTCAGACTTCAGGGCAAAAGCAATCTTTCCAGCTCA <FJFFJJJJFJJJJJF7JJJ<F--JJJFJJJJ<J<FJFF<JAJJJAJAJFFJJJFJAFJAJJAJJJJJFJJJJJFJJFJJJJFJFJJJJFFJJJJJJJFAJJJFJFJFJJJFFJJJ<J7JJJJFJ<AFAJJJJJFJJJJJAJFJJAFFFFA NM:i:0 MD:Z:151 AS:i:151 XS:i:0
#after merging (step 7)
H0164ALXX140820:2:1101:10003:23460 409 2 33141435 0 37S69M45S = 33141435 0
GGGTGGGAGGGGGGGAGAGAGGGGTGGGAGAGGGGAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAGGGGGGGGGGGGGGGGGAGGGGGGGGGGGGGGGGGAAGGAAAGGAGGGAGGGAGGGAGCAAGGAAGGAAGGAAGGAAAGA ###########################################################################################FFA<<7F<A-7-AJA7AF-A--FFA<AF-FJA-FF-AA<<JAAFA7A<FJF<F<AF-<-< SA:Z:10,91515130,+,60M91S,0,2; MD:Z:51G17 PG:Z:bwamem RG:Z:H0164.2 NM:i:1 UQ:i:2 AS:i:64 XS:i:64
H0164ALXX140820:2:1101:10003:23460 163 10 91515130 0 60M91S = 91515318 339
TCTTTCCTTCCTTCCTTCCTTGCTCCCTCCCTCCCTCCTTTCCTTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCTTCCCCTCTCCCACCCCTCTCTCCCCCCCTCCCACCC ###########################################################################################FFA<<7F<A-7-AJA7AF-A--FFA<AF-FJA-FF-AA<<JAAFA7A<FJF<F<AF-<-< SA:Z:2,33141435,-,37S69M45S,0,1; MC:Z:151M MD:Z:48T4T6 PG:Z:bwamem RG:Z:H0164.2 NM:i:2 MQ:i:60 UQ:i:4 AS:i:50 XS:i:36
H0164ALXX140820:2:1101:10003:23460 83 10 91515318 60 151M = 91515130 -339
CCCATCCCCTTCCCCTTCCCTTTCCCTTTCCCTTTTCTTTCCTCTTTTAAAGAGACAAGGTCTTGTTCTGTCACCCAGGCTGGAATGCAGTGGTGCAGTCATGGCTCACTGCCGCCTCAGACTTCAGGGCAAAAGCAATCTTTCCAGCTCA <FJFFJJJJFJJJJJF7JJJ<F--JJJFJJJJ<J<FJFF<JAJJJAJAJFFJJJFJAFJAJJAJJJJJFJJJJJFJJFJJJJFJFJJJJFFJJJJJJJFAJJJFJFJFJJJFFJJJ<J7JJJJFJ<AFAJJJJJFJJJJJAJFJJAFFFFA MC:Z:60M91S MD:Z:151 PG:Z:bwamem RG:Z:H0164.2 NM:i:0 MQ:i:0 UQ:i:0 AS:i:151 XS:i:0
- For the read with two alignments, the aligner hard-clipped the alignment on chromosome 10 giving a CIGAR string of 60M91H and a truncated read sequence. MergeBamAlignment restores this chromosome 10 alignment with a full read sequence and adjusts the CIGAR string to 60M91S, which soft-clips the previously hard-clipped region without loss of alignment specificity.
- Both chromosome 2 and chromosome 10 alignments have zero mapping qualities to indicate multiple equally likely mappings. The similar alignment scores of 64 and 50, given by the
AS
tag, contribute in part to this ambiguity. Additionally, because we asked the aligner to flag shorter split reads as secondary, with the-M
option, it assigned a417
flag to the shorter split chromosome 10 alignment. This makes the chromosome 2 alignment for this read the primary alignment. We set ourPRIMARY_ALIGNMENT_STRATEGY
to MostDistant which asks the tool to consider the best pair to mark as primary from the primary and secondary records. MergeBamAlignment reassigns the chromosome 10 mapping as the primary alignment (163
flag) and the chromosome 2 mapping as secondary (409
flag). - MergeBamAlignment updates read group
RG
information, program groupPG
information and mate CIGARMC
tags as specified by our command for reads and in the header section. The tool retainsSA
,MD
,NM
andAS
tags from the alignment, given these are not present in the uBAM. The tool additionally addsUQ
(the Phred likelihood of the segment) andMQ
(mapping quality of the mate/next segment) tags if applicable. The following table summarizes changes to our tutorial data's tags during the workflow.
original | RevertSam | BWA MEM | MergeBamAlignment | |
---|---|---|---|---|
RG | RG | RG | read group | |
PG | PG | PG | program group | |
OC | original cigar | |||
XN | # of ambiguous bases in ref | |||
OP | original mapping position | |||
SA | SA | SA | chimeric alignment | |
MD | MD | MD | string for mismatching positions | |
NM | NM | NM | # of mismatches | |
AS | AS | AS | alignment score | |
UQ | UQ | Phred likelihood of the segment | ||
MC | MC | CIGAR string for mate | ||
MQ | MQ | mapping quality of the mate | ||
OQ | original base quality | |||
XT | tool specific | |||
XS | XS | BWA's secondary alignment score |
- In our example, we retained the aligner generated
XS
tag, for secondary alignment scores, with theATTRIBUTES_TO_RETAIN
option.
After merging our whole tutorial file, our unmapped read records increases by 620, from 5,334,323 to 5,334,943 due to changes in flag designations and not because any reads failed to map. Our total read records remains the same at 828,846,200 for our 819,728,254 original reads, giving ~1.11% multi-record reads.