Regarding of piping - Picard and BWA (Align and MergeBamAlignment step)

February 19, 2018, 1:09 am

≫ Next: (How to) Map and clean up short read sequence data efficiently

≪ Previous: What should I use as known variants/sites for running tool X?

I made 3 bam files with **command in below.**

Picard version: 2.17.8
BWA version: 0.7.17-r1188

compression_level=2
java_opt="-Xmx32G"
bwa_version="0.7.17-r1188"
bwa_commandline="mem -K 100000000 -p -v 3 -t 64 -Y ${ref_fasta}"

java ${java_opt} -jar ${PICARD_JAR} SamToFastq \
I=${INPUT_BAM} \
INTERLEAVE=true NON_PF=true \
FASTQ=/dev/stdout \
TMP_DIR=${TMP_DIR} | \
${BWA} ${bwa_commandline} /dev/stdin - 2> >(tee ${OUTPUT_BAM}.stderr.log >&2) | \
java -Dsamjdk.compression_level=${compression_level} -Xms12G -jar ${PICARD_JAR} \
MergeBamAlignment \
    VALIDATION_STRINGENCY=SILENT \
    EXPECTED_ORIENTATIONS=FR \
    ATTRIBUTES_TO_RETAIN=X0 \
    ATTRIBUTES_TO_REMOVE=NM \
    ATTRIBUTES_TO_REMOVE=MD \
    ALIGNED_BAM=/dev/stdin \
    UNMAPPED_BAM=${INPUT_BAM} \
    OUTPUT=${OUTPUT_BAM} \
    REFERENCE_SEQUENCE=${ref_fasta} \
    PAIRED_RUN=true \
    SORT_ORDER="unsorted" \
    IS_BISULFITE_SEQUENCE=false \
    ALIGNED_READS_ONLY=false \
    CLIP_ADAPTERS=false \
    MAX_RECORDS_IN_RAM=2000000 \
    ADD_MATE_CIGAR=true \
    MAX_INSERTIONS_OR_DELETIONS=-1 \
    PRIMARY_ALIGNMENT_STRATEGY=MostDistant \
    PROGRAM_RECORD_ID="bwamem" \
    PROGRAM_GROUP_VERSION="${bwa_version}" \
    PROGRAM_GROUP_COMMAND_LINE="${bwa_commandline}" \
    PROGRAM_GROUP_NAME="bwamem" \
    UNMAPPED_READ_STRATEGY=COPY_TO_TAG \
    ALIGNER_PROPER_PAIR_FLAGS=true \
    UNMAP_CONTAMINANT_READS=true \
    ADD_PG_TAG_TO_READS=false

and I tried to MarkDuplicates step. but it had problem.

Exception in thread "main" htsjdk.samtools.FileTruncatedException: Premature end of file: /BiO/Project/brandon-genome-analysis/analysis/B001.fastqtosam.unmerged.bam
at htsjdk.samtools.util.BlockCompressedInputStream.processNextBlock(BlockCompressedInputStream.java:530)
at htsjdk.samtools.util.BlockCompressedInputStream.nextBlock(BlockCompressedInputStream.java:468)
at htsjdk.samtools.util.BlockCompressedInputStream.readBlock(BlockCompressedInputStream.java:458)
at htsjdk.samtools.util.BlockCompressedInputStream.available(BlockCompressedInputStream.java:196)
at htsjdk.samtools.util.BlockCompressedInputStream.read(BlockCompressedInputStream.java:331)
at java.io.DataInputStream.read(DataInputStream.java:149)
at htsjdk.samtools.util.BinaryCodec.readBytesOrFewer(BinaryCodec.java:418)
at htsjdk.samtools.util.BinaryCodec.readBytes(BinaryCodec.java:394)
at htsjdk.samtools.util.BinaryCodec.readBytes(BinaryCodec.java:380)
at htsjdk.samtools.BAMRecordCodec.decode(BAMRecordCodec.java:209)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.getNextRecord(BAMFileReader.java:829)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:803)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:797)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:765)
at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:576)
at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:548)
at htsjdk.samtools.util.PeekableIterator.advance(PeekableIterator.java:71)
at htsjdk.samtools.util.PeekableIterator.next(PeekableIterator.java:57)
at htsjdk.samtools.MergingSamRecordIterator.next(MergingSamRecordIterator.java:130)
at htsjdk.samtools.MergingSamRecordIterator.next(MergingSamRecordIterator.java:38)
at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:495)
at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:232)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:269)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:98)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:108)

All BAM file was trucated.

$ samtools view -c /BiO/Project/brandon-genome-analysis/analysis/B001.fastqtosam.unmerged.bam
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
[E::bgzf_read] Read block operation failed with error -1 after 107 of 180 bytes
[main_samview] truncated file.
$ samtools view -c /BiO/Project/brandon-genome-analysis/analysis/B002.fastqtosam.unmerged.bam
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
[E::bgzf_read] Read block operation failed with error -1 after 1 of 180 bytes
[main_samview] truncated file.
$ samtools view -c /BiO/Project/brandon-genome-analysis/analysis/B003.fastqtosam.unmerged.bam
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
[E::bgzf_read] Read block operation failed with error -1 after 10 of 39 bytes
[main_samview] truncated file.

↧

(How to) Map and clean up short read sequence data efficiently

November 23, 2015, 12:55 pm

≫ Next: Best practices for joint genotyping of a very large sample size

≪ Previous: Regarding of piping - Picard and BWA (Align and MergeBamAlignment step)

If you are interested in emulating the methods used by the Broad Genomics Platform to pre-process your short read sequencing data, you have landed on the right page. The parsimonious operating procedures outlined in this three-step workflow both maximize data quality, storage and processing efficiency to produce a mapped and clean BAM. This clean BAM is ready for analysis workflows that start with MarkDuplicates.

Since your sequencing data could be in a number of formats, the first step of this workflow refers you to specific methods to generate a compatible unmapped BAM (uBAM, Tutorial#6484) or (uBAM^XT, Tutorial#6570 coming soon). Not all unmapped BAMs are equal and these methods emphasize cleaning up prior meta information while giving you the opportunity to assign proper read group fields. The second step of the workflow has you marking adapter sequences, e.g. arising from read-through of short inserts, using MarkIlluminaAdapters such that they contribute minimally to alignments and allow the aligner to map otherwise unmappable reads. The third step pipes three processes to produce the final BAM. Piping SamToFastq, BWA-MEM and MergeBamAlignment saves time and allows you to bypass storage of larger intermediate FASTQ and SAM files. In particular, MergeBamAlignment merges defined information from the aligned SAM with that of the uBAM to conserve read data, and importantly, it generates additional meta information and unifies meta data. The resulting clean BAM is coordinate sorted, indexed.

The workflow reflects a lossless operating procedure that retains original sequencing read information within the final BAM file such that data is amenable to reversion and analysis by different means. These practices make scaling up and long-term storage efficient, as one needs only keep the final BAM file.

Geraldine_VdAuwera points out that there are many different ways of correctly preprocessing HTS data for variant discovery and ours is only one approach. So keep this in mind.

We present this workflow using real data from a public sample. The original data file, called Solexa-272222, is large at 150 GB. The file contains 151 bp paired PCR-free reads giving 30x coverage of a human whole genome sample referred to as NA12878. The entire sample library was sequenced in a single flow cell lane and thereby assigns all the reads the same read group ID. The example commands work both on this large file and on smaller files containing a subset of the reads, collectively referred to as snippet. NA12878 has a variant in exon 5 of the CYP2C19 gene, on the portion of chromosome 10 covered by the snippet, resulting in a nonfunctional protein. Consistent with GATK's recommendation of using the most up-to-date tools, for the given example results, with the exception of BWA, we used the most current versions of tools as of their testing (September to December 2015). We provide illustrative example results, some of which were derived from processing the original large file and some of which show intermediate stages skipped by this workflow.

Download example snippet data to follow along the tutorial.

We welcome feedback. Share your suggestions in the Comments section at the bottom of this page.

Jump to a section

Tools involved

MarkIlluminaAdapters
Unix pipelines
SamToFastq
BWA-MEM (Li 2013 reference; Li 2014 benchmarks; homepage; manual)
MergeBamAlignment

Prerequisites

Installed Picard tools
Installed GATK tools
Installed BWA
Reference genome
Illumina or similar tech DNA sequence reads file containing data corresponding to one read group ID. That is, the file contains data from one sample and from one flow cell lane.

Download example data

To download the reference, open ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.8/b37/ in your browser. Leave the password field blank. Download the following three files (~860 MB) to the same folder: human_g1k_v37_decoy.fasta.gz, .fasta.fai.gz, and .dict.gz. This same reference is available to load in IGV.
I divided the example data into two tarballs: tutorial_6483_piped.tar.gz contains the files for the piped process and tutorial_6483_intermediate_files.tar.gz contains the intermediate files produced by running each process independently. The data contain reads originally aligning to a one Mbp genomic interval (10:96,000,000-97,000,000) of GRCh37. The table shows the steps of the workflow, corresponding input and output example data files and approximate minutes and disk space needed to process each step. Additionally, we tabulate the time and minimum storage needed to complete the workflow as presented (piped) or without piping.

Related resources

See this tutorial to add or replace read groups or coordinate-sort and index a BAM.
See this tutorial for basic instructions on using the Integrative Genomics Viewer (IGV).
For collecting alignment summary metrics, see CollectAlignmentSummaryMetrics, CollectWgsMetrics and CollectInsertSizeMetrics. See Picard for metrics definitions.
See SAM flags to interpret SAM flag values.
Tutorial#2799 gives an example command to mark duplicates.

Other notes

When transforming data files, we stick to using Picard tools over other tools to avoid subtle incompatibilities.
For large files, (1) use the Java -Xmx setting and (2) set the environmental variable TMP_DIR for a temporary directory.
```
java -Xmx8G -jar /path/picard.jar MarkIlluminaAdapters \
    TMP_DIR=/path/shlee 
```
In the command, the -Xmx8G Java option caps the maximum heap size, or memory usage, to eight gigabytes. The path given by TMP_DIR points the tool to scratch space that it can use. These options allow the tool to run without slowing down as well as run without causing an out of memory error. The -Xmx settings we provide here are more than sufficient for most cases. For GATK, 4G is standard, while for Picard less is needed. Some tools, e.g. MarkDuplicates, may require more. These options can be omitted for small files such as the example data and the equivalent command is as follows.
```
java -jar /path/picard.jar MarkIlluminaAdapters 
```
To find a system's default maximum heap size, type java -XX:+PrintFlagsFinal -version, and look for MaxHeapSize. Note that any setting beyond available memory spills to storage and slows a system down. If multithreading, increase memory proportionately to the number of threads. e.g. if 1G is the minimum required for one thread, then use 2G for two threads.
When I call default options within a command, follow suit to ensure the same results.

1. Generate an unmapped BAM from FASTQ, aligned BAM or BCL

If you have raw reads data in BAM format with appropriately assigned read group fields, then you can start with step 2. Namely, besides differentiating samples, the read group ID should differentiate factors contributing to technical batch effects, i.e. flow cell lane. If not, you need to reassign read group fields. This dictionary post describes factors to consider and this post and this post provide some strategic advice on handling multiplexed data.

See this tutorial to add or replace read groups.

If your reads are mapped, or in BCL or FASTQ format, then generate an unmapped BAM according to the following instructions.

To convert FASTQ or revert aligned BAM files, follow directions in Tutorial#6484. The resulting uBAM needs to have its adapter sequences marked as outlined in the next step (step 2).
To convert an Illumina Base Call files (BCL) use IlluminaBasecallsToSam. The tool marks adapter sequences at the same time. The resulting uBAM^XT has adapter sequences marked with the XT tag so you can skip step 2 of this workflow and go directly to step 3. The corresponding Tutorial#6570 is coming soon.

See if you can revert 6483_snippet.bam, containing 279,534 aligned reads, to the unmapped 6383_snippet_revertsam.bam, containing 275,546 reads.

2. Mark adapter sequences using MarkIlluminaAdapters

MarkIlluminaAdapters adds the XT tag to a read record to mark the 5' start position of the specified adapter sequence and produces a metrics file. Some of the marked adapters come from concatenated adapters that randomly arise from the primordial soup that is a PCR reaction. Others represent read-through to 3' adapter ends of reads and arise from insert sizes that are shorter than the read length. In some instances read-though can affect the majority of reads in a sample, e.g. in Nextera library samples over-titrated with transposomes, and render these reads unmappable by certain aligners. Tools such as SamToFastq use the XT tag in various ways to effectively remove adapter sequence contribution to read alignment and alignment scoring metrics. Depending on your library preparation, insert size distribution and read length, expect varying amounts of such marked reads.

java -Xmx8G -jar /path/picard.jar MarkIlluminaAdapters \
I=6483_snippet_revertsam.bam \
O=6483_snippet_markilluminaadapters.bam \
M=6483_snippet_markilluminaadapters_metrics.txt \ #naming required
TMP_DIR=/path/shlee #optional to process large files

This produces two files. (1) The metrics file, 6483_snippet_markilluminaadapters_metrics.txt bins the number of tagged adapter bases versus the number of reads. (2) The 6483_snippet_markilluminaadapters.bam file is identical to the input BAM, 6483_snippet_revertsam.bam, except reads with adapter sequences will be marked with a tag in XT:i:# format, where # denotes the 5' starting position of the adapter sequence. At least six bases are required to mark a sequence. Reads without adapter sequence remain untagged.

By default, the tool uses Illumina adapter sequences. This is sufficient for our example data.
Adjust the default standard Illumina adapter sequences to any adapter sequence using the FIVE_PRIME_ADAPTER and THREE_PRIME_ADAPTER parameters. To clear and add new adapter sequences first set ADAPTERS to 'null' then specify each sequence with the parameter.

We plot the metrics data that is in GATKReport file format using RStudio, and as you can see, marked bases vary in size up to the full length of reads.

Do you get the same number of marked reads? 6483_snippet marks 448 reads (0.16%) with XT, while the original Solexa-272222 marks 3,236,552 reads (0.39%).

Below, we show a read pair marked with the XT tag by MarkIlluminaAdapters. The insert region sequences for the reads overlap by a length corresponding approximately to the XT tag value. For XT:i:20, the majority of the read is adapter sequence. The same read pair is shown after SamToFastq transformation, where adapter sequence base quality scores have been set to 2 (# symbol), and after MergeBamAlignment, which restores original base quality scores.

Unmapped uBAM (step 1)

After MarkIlluminaAdapters (step 2)

After SamToFastq (step 3)

After MergeBamAlignment (step 3)

3. Align reads with BWA-MEM and merge with uBAM using MergeBamAlignment

This step actually pipes three processes, performed by three different tools. Our tutorial example files are small enough to easily view, manipulate and store, so any difference in piped or independent processing will be negligible. For larger data, however, using Unix pipelines can add up to significant savings in processing time and storage.

Not all tools are amenable to piping and piping the wrong tools or wrong format can result in anomalous data.

The three tools we pipe are SamToFastq, BWA-MEM and MergeBamAlignment. By piping these we bypass storage of larger intermediate FASTQ and SAM files. We additionally save time by eliminating the need for the processor to read in and write out data for two of the processes, as piping retains data in the processor's input-output (I/O) device for the next process.

To make the information more digestible, we will first talk about each tool separately. At the end of the section, we provide the piped command.

3A. Convert BAM to FASTQ and discount adapter sequences using SamToFastq

Picard's SamToFastq takes read identifiers, read sequences, and base quality scores to write a Sanger FASTQ format file. We use additional options to effectively remove previously marked adapter sequences, in this example marked with an XT tag. By specifying CLIPPING_ATTRIBUTE=XT and CLIPPING_ACTION=2, SamToFastq changes the quality scores of bases marked by XT to two--a rather low score in the Phred scale. This effectively removes the adapter portion of sequences from contributing to downstream read alignment and alignment scoring metrics.

Illustration of an intermediate step unused in workflow. See piped command.

java -Xmx8G -jar /path/picard.jar SamToFastq \
I=6483_snippet_markilluminaadapters.bam \
FASTQ=6483_snippet_samtofastq_interleaved.fq \
CLIPPING_ATTRIBUTE=XT \
CLIPPING_ACTION=2 \
INTERLEAVE=true \ 
NON_PF=true \
TMP_DIR=/path/shlee #optional to process large files

This produces a FASTQ file in which all extant meta data, i.e. read group information, alignment information, flags and tags are purged. What remains are the read query names prefaced with the @ symbol, read sequences and read base quality scores.

For our paired reads example file we set SamToFastq's INTERLEAVE to true. During the conversion to FASTQ format, the query name of the reads in a pair are marked with /1 or /2 and paired reads are retained in the same FASTQ file. BWA aligner accepts interleaved FASTQ files given the -p option.
We change the NON_PF, aka INCLUDE_NON_PF_READS, option from default to true. SamToFastq will then retain reads marked by what some consider an archaic 0x200 flag bit that denotes reads that do not pass quality controls, aka reads failing platform or vendor quality checks. Our tutorial data do not contain such reads and we call out this option for illustration only.
Other CLIPPING_ACTION options include (1) X to hard-clip, (2) N to change bases to Ns or (3) another number to change the base qualities of those positions to the given value.

3B. Align reads and flag secondary hits using BWA-MEM

In this workflow, alignment is the most compute intensive and will take the longest time. GATK's variant discovery workflow recommends Burrows-Wheeler Aligner's maximal exact matches (BWA-MEM) algorithm (Li 2013 reference; Li 2014 benchmarks; homepage; manual). BWA-MEM is suitable for aligning high-quality long reads ranging from 70 bp to 1 Mbp against a large reference genome such as the human genome.

Aligning our snippet reads against either a portion or the whole genome is not equivalent to aligning our original Solexa-272222 file, merging and taking a new slice from the same genomic interval.
For the tutorial, we use BWA v 0.7.7.r441, the same aligner used by the Broad Genomics Platform as of this writing (9/2015).
As mentioned, alignment is a compute intensive process. For faster processing, use a reference genome with decoy sequences, also called a decoy genome. For example, the Broad's Genomics Platform uses an Hg19/GRCh37 reference sequence that includes Ebstein-Barr virus (EBV) sequence to soak up reads that fail to align to the human reference that the aligner would otherwise spend an inordinate amount of time trying to align as split reads. GATK's resource bundle provides a standard decoy genome from the 1000 Genomes Project.
BWA alignment requires an indexed reference genome file. Indexing is specific to algorithms. To index the human genome for BWA, we apply BWA's index function on the reference genome file, e.g. human_g1k_v37_decoy.fasta. This produces five index files with the extensions amb, ann, bwt, pac and sa.
```
bwa index -a bwtsw human_g1k_v37_decoy.fasta
```

The example command below aligns our example data against the GRCh37 genome. The tool automatically locates the index files within the same folder as the reference FASTA file.

Illustration of an intermediate step unused in workflow. See piped command.

/path/bwa mem -M -t 7 -p /path/human_g1k_v37_decoy.fasta \ 
6483_snippet_samtofastq_interleaved.fq > 6483_snippet_bwa_mem.sam

This command takes the FASTQ file, 6483_snippet_samtofastq_interleaved.fq, and produces an aligned SAM format file, 6483_snippet_unthreaded_bwa_mem.sam, containing read alignment information, an automatically generated program group record and reads sorted in the same order as the input FASTQ file. Aligner-assigned alignment information, flag and tag values reflect each read's or split read segment's best sequence match and does not take into consideration whether pairs are mapped optimally or if a mate is unmapped. Added tags include the aligner-specific XS tag that marks secondary alignment scores in XS:i:# format. This tag is given for each read even when the score is zero and even for unmapped reads. The program group record (@PG) in the header gives the program group ID, group name, group version and recapitulates the given command. Reads are sorted by query name. For the given version of BWA, the aligned file is in SAM format even if given a BAM extension.

Does the aligned file contain read group information?

We invoke three options in the command.

-M to flag shorter split hits as secondary.
This is optional for Picard compatibility as MarkDuplicates can directly process BWA's alignment, whether or not the alignment marks secondary hits. However, if we want MergeBamAlignment to reassign proper pair alignments, to generate data comparable to that produced by the Broad Genomics Platform, then we must mark secondary alignments.
-p to indicate the given file contains interleaved paired reads.
-t followed by a number for the number of processor threads to use concurrently. Here we use seven threads which is one less than the total threads available on my Mac laptap. Check your server or system's total number of threads with the following command provided by KateN.
```
getconf _NPROCESSORS_ONLN 
```

In the example data, all of the 1211 unmapped reads each have an asterisk (*) in column 6 of the SAM record, where a read typically records its CIGAR string. The asterisk represents that the CIGAR string is unavailable. The several asterisked reads I examined are recorded as mapping exactly to the same location as their _mapped_ mates but with MAPQ of zero. Additionally, the asterisked reads had varying noticeable amounts of low base qualities, e.g. strings of #s, that corresponded to original base quality calls and not those changed by SamToFastq. This accounting by BWA allows these pairs to always list together, even when the reads are coordinate-sorted, and leaves a pointer to the genomic mapping of the mate of the unmapped read. For the example read pair shown below, comparing sequences shows no apparent overlap, with the highest identity at 72% over 25 nts.

After MarkIlluminaAdapters (step 2)

After BWA-MEM (step 3)

After MergeBamAlignment (step 3)

3C. Restore altered data and apply & adjust meta information using MergeBamAlignment

MergeBamAlignment is a beast of a tool, so its introduction is longer. It does more than is implied by its name. Explaining these features requires I fill you in on some background.

Broadly, the tool merges defined information from the unmapped BAM (uBAM, step 1) with that of the aligned BAM (step 3) to conserve read data, e.g. original read information and base quality scores. The tool also generates additional meta information based on the information generated by the aligner, which may alter aligner-generated designations, e.g. mate information and secondary alignment flags. The tool then makes adjustments so that all meta information is congruent, e.g. read and mate strand information based on proper mate designations. We ascribe the resulting BAM as clean.

Specifically, the aligned BAM generated in step 3 lacks read group information and certain tags--the UQ (Phred likelihood of the segment), MC (CIGAR string for mate) and MQ (mapping quality of mate) tags. It has hard-clipped sequences from split reads and altered base qualities. The reads also have what some call mapping artifacts but what are really just features we should not expect from our aligner. For example, the meta information so far does not consider whether pairs are optimally mapped and whether a mate is unmapped (in reality or for accounting purposes). Depending on these assignments, MergeBamAlignment adjusts the read and read mate strand orientations for reads in a proper pair. Finally, the alignment records are sorted by query name. We would like to fix all of these issues before taking our data to a variant discovery workflow.

Enter MergeBamAlignment. As the tool name implies, MergeBamAlignment applies read group information from the uBAM and retains the program group information from the aligned BAM. In restoring original sequences, the tool adjusts CIGAR strings from hard-clipped to soft-clipped. If the alignment file is missing reads present in the unaligned file, then these are retained as unmapped records. Additionally, MergeBamAlignment evaluates primary alignment designations according to a user-specified strategy, e.g. for optimal mate pair mapping, and changes secondary alignment and mate unmapped flags based on its calculations. Additional for desired congruency. I will soon explain these and additional changes in more detail and show a read record to illustrate.

Consider what PRIMARY_ALIGNMENT_STRATEGY option best suits your samples. MergeBamAlignment applies this strategy to a read for which the aligner has provided more than one primary alignment, and for which one is designated primary by virtue of another record being marked secondary. MergeBamAlignment considers and switches only existing primary and secondary designations. Therefore, it is critical that these were previously flagged.

A read with multiple alignment records may map to multiple loci or may be chimeric--that is, splits the alignment. It is possible for an aligner to produce multiple alignments as well as multiple primary alignments, e.g. in the case of a linear alignment set of split reads. When one alignment, or alignment set in the case of chimeric read records, is designated primary, others are designated either secondary or supplementary. Invoking the -M option, we had BWA mark the record with the longest aligning section of split reads as primary and all other records as secondary. MergeBamAlignment further adjusts this secondary designation and adds the read mapped in proper pair (0x2) and mate unmapped (0x8) flags. The tool then adjusts the strand orientation flag for a read (0x10) and it proper mate (0x20).

In the command, we change CLIP_ADAPTERS, MAX_INSERTIONS_OR_DELETIONS and PRIMARY_ALIGNMENT_STRATEGY values from default, and invoke other optional parameters. The path to the reference FASTA given by R should also contain the corresponding .dict sequence dictionary with the same prefix as the reference FASTA. It is imperative that both the uBAM and aligned BAM are both sorted by queryname.

Illustration of an intermediate step unused in workflow. See piped command.

java -Xmx16G -jar /path/picard.jar MergeBamAlignment \
R=/path/Homo_sapiens_assembly19.fasta \ 
UNMAPPED_BAM=6383_snippet_revertsam.bam \ 
ALIGNED_BAM=6483_snippet_bwa_mem.sam \ #accepts either SAM or BAM
O=6483_snippet_mergebamalignment.bam \
CREATE_INDEX=true \ #standard Picard option for coordinate-sorted outputs
ADD_MATE_CIGAR=true \ #default; adds MC tag
CLIP_ADAPTERS=false \ #changed from default
CLIP_OVERLAPPING_READS=true \ #default; soft-clips ends so mates do not extend past each other
INCLUDE_SECONDARY_ALIGNMENTS=true \ #default
MAX_INSERTIONS_OR_DELETIONS=-1 \ #changed to allow any number of insertions or deletions
PRIMARY_ALIGNMENT_STRATEGY=MostDistant \ #changed from default BestMapq
ATTRIBUTES_TO_RETAIN=XS \ #specify multiple times to retain tags starting with X, Y, or Z 
TMP_DIR=/path/shlee #optional to process large files

This generates a coordinate-sorted and clean BAM, 6483_snippet_mergebamalignment.bam, and corresponding .bai index. These are ready for analyses starting with MarkDuplicates. The two bullet-point lists below describe changes to the resulting file. The first list gives general comments on select parameters and the second describes some of the notable changes to our example data.

Comments on select parameters

Setting PRIMARY_ALIGNMENT_STRATEGYto MostDistant marks primary alignments based on the alignment pair with the largest insert size. This strategy is based on the premise that of chimeric sections of a read aligning to consecutive regions, the alignment giving the largest insert size with the mate gives the most information.
It may well be that alignments marked as secondary represent interesting biology, so we retain them with the INCLUDE_SECONDARY_ALIGNMENTS parameter.
Setting MAX_INSERTIONS_OR_DELETIONS to -1 retains reads irregardless of the number of insertions and deletions. The default is 1.
Because we leave the ALIGNER_PROPER_PAIR_FLAGS parameter at the default false value, MergeBamAlignment will reassess and reassign proper pair designations made by the aligner. These are explained below using the example data.
ATTRIBUTES_TO_RETAIN is specified to carryover the XS tag from the alignment, which reports BWA-MEM's suboptimal alignment scores. My impression is that this is the next highest score for any alternative or additional alignments BWA considered, whether or not these additional alignments made it into the final aligned records. (IGV's BLAT feature allows you to search for additional sequence matches). For our tutorial data, this is the only additional unaccounted tag from the alignment. The XS tag in unnecessary for the Best Practices Workflow and is not retained by the Broad Genomics Platform's pipeline. We retain it here not only to illustrate that the tool carries over select alignment information only if asked, but also because I think it prudent. Given how compute intensive the alignment process is, the additional ~1% gain in the snippet file size seems a small price against having to rerun the alignment because we realize later that we want the tag.
Setting CLIP_ADAPTERS to false leaves reads unclipped.
By default the merged file is coordinate sorted. We set CREATE_INDEX to true to additionally create the bai index.
We need not invoke PROGRAM options as BWA's program group information is sufficient and is retained in the merging.
As a standalone tool, we would normally feed in a BAM file for ALIGNED_BAM instead of the much larger SAM. We will be piping this step however and so need not add an extra conversion to BAM.

Description of changes to our example data

MergeBamAlignment merges header information from the two sources that define read groups (@RG) and program groups (@PG) as well as reference contigs.
Tags are updated for our example data as shown in the table. The tool retains SA, MD, NM and AS tags from the alignment, given these are not present in the uBAM. The tool additionally adds UQ (the Phred likelihood of the segment), MC (mate CIGAR string) and MQ (mapping quality of the mate/next segment) tags if applicable. For unmapped reads (marked with an * asterisk in column 6 of the SAM record), the tool removes AS and XS tags and assigns MC (if applicable), PG and RG tags. This is illustrated for example read H0164ALXX140820:2:1101:29704:6495 in the BWA-MEM section of this document.
Original base quality score restoration is illustrated in step 2.

The example below shows a read pair for which MergeBamAlignment adjusts multiple information fields, and these changes are described in the remaining bullet points.

MergeBamAlignment changes hard-clipping to soft-clipping, e.g. 96H55M to 96S55M, and restores corresponding truncated sequences with the original full-length read sequence.
The tool reorders the read records to reflect the chromosome and contig ordering in the header and the genomic coordinates for each.
MergeBamAlignment's MostDistant PRIMARY_ALIGNMENT_STRATEGY asks the tool to consider the best pair to mark as primary from the primary and secondary records. In this pair, one of the reads has two alignment loci, on contig hs37d5 and on chromosome 10. The two loci align 115 and 55 nucleotides, respectively, and the aligned sequences are identical by 55 bases. Flag values set by BWA-MEM indicate the contig hs37d5 record is primary and the shorter chromosome 10 record is secondary. For this chimeric read, MergeBamAlignment reassigns the chromosome 10 mapping as the primary alignment and the contig hs37d5 mapping as secondary (0x100 flag bit).
In addition, MergeBamAlignment designates each record on chromosome 10 as read mapped in proper pair (0x2 flag bit) and the contig hs37d5 mapping as mate unmapped (0x8 flag bit). IGV's paired reads mode displays the two chromosome 10 mappings as a pair after these MergeBamAlignment adjustments.
MergeBamAlignment adjusts read reverse strand (0x10 flag bit) and mate reverse strand (0x20 flag bit) flags consistent with changes to the proper pair designation. For our non-stranded DNA-Seq library alignments displayed in IGV, a read pointing rightward is in the forward direction (absence of 0x10 flag) and a read pointing leftward is in the reverse direction (flagged with 0x10). In a typical pair, where the rightward pointing read is to the left of the leftward pointing read, the left read will also have the mate reverse strand (0x20) flag.

Two distinct classes of mate unmapped read records are now present in our example file: (1) reads whose mates truly failed to map and are marked by an asterisk * in column 6 of the SAM record and (2) multimapping reads whose mates are in fact mapped but in a proper pair that excludes the particular read record. Each of these two classes of mate unmapped reads can contain multimapping reads that map to two or more locations.

Comparing 6483_snippet_bwa_mem.sam and 6483_snippet_mergebamalignment.bam, we see the number_unmapped reads_ remains the same at 1211, while the number of records with the mate unmapped flag increases by 1359, from 1276 to 2635. These now account for 0.951% of the 276,970 read records.

For 6483_snippet_mergebamalignment.bam, how many additional unique reads become mate unmapped?

After BWA-MEM alignment

After MergeBamAlignment

3D. Pipe SamToFastq, BWA-MEM and MergeBamAlignment to generate a clean BAM

We pipe the three tools described above to generate an aligned BAM file sorted by query name. In the piped command, the commands for the three processes are given together, separated by a vertical bar |. We also replace each intermediate output and input file name with a symbolic path to the system's output and input devices, here /dev/stdout and /dev/stdin, respectively. We need only provide the first input file and name the last output file.

Before using a piped command, we should ask UNIX to stop the piped command if any step of the pipe should error and also return to us the error messages. Type the following into your shell to set these UNIX options.

set -o pipefail

Overview of command structure

[SamToFastq] | [BWA-MEM] | [MergeBamAlignment]

Piped command

java -Xmx8G -jar /path/picard.jar SamToFastq \
I=6483_snippet_markilluminaadapters.bam \
FASTQ=/dev/stdout \
CLIPPING_ATTRIBUTE=XT CLIPPING_ACTION=2 INTERLEAVE=true NON_PF=true \
TMP_DIR=/path/shlee | \ 
/path/bwa mem -M -t 7 -p /path/Homo_sapiens_assembly19.fasta /dev/stdin | \  
java -Xmx16G -jar /path/picard.jar MergeBamAlignment \
ALIGNED_BAM=/dev/stdin \
UNMAPPED_BAM=6383_snippet_revertsam.bam \ 
OUTPUT=6483_snippet_piped.bam \
R=/path/Homo_sapiens_assembly19.fasta CREATE_INDEX=true ADD_MATE_CIGAR=true \
CLIP_ADAPTERS=false CLIP_OVERLAPPING_READS=true \
INCLUDE_SECONDARY_ALIGNMENTS=true MAX_INSERTIONS_OR_DELETIONS=-1 \
PRIMARY_ALIGNMENT_STRATEGY=MostDistant ATTRIBUTES_TO_RETAIN=XS \
TMP_DIR=/path/shlee

The piped output file, 6483_snippet_piped.bam, is for all intensive purposes the same as 6483_snippet_mergebamalignment.bam, produced by running MergeBamAlignment separately without piping. However, the resulting files, as well as new runs of the workflow on the same data, have the potential to differ in small ways because each uses a different alignment instance.

How do these small differences arise?

Counting the number of mate unmapped reads shows that this number remains unchanged for the two described workflows. Two counts emitted at the end of the process updates, that also remain constant for these instances, are the number of alignment records and the number of unmapped reads.

INFO    2015-12-08 17:25:59 AbstractAlignmentMerger Wrote 275759 alignment records and 1211 unmapped reads.

Some final remarks

We have produced a clean BAM that is coordinate-sorted and indexed, in an efficient manner that minimizes processing time and storage needs. The file is ready for marking duplicates as outlined in Tutorial#2799. Additionally, we can now free up storage on our file system by deleting the original file we started with, the uBAM and the uBAM^XT. We sleep well at night knowing that the clean BAM retains all original information.

We have two final comments (1) on multiplexed samples and (2) on fitting this workflow into a larger workflow.

For multiplexed samples, first perform the workflow steps on a file representing one sample and one lane. Then mark duplicates. Later, after some steps in the GATK's variant discovery workflow, and after aggregating files from the same sample from across lanes into a single file, mark duplicates again. These two marking steps ensure you find both optical and PCR duplicates.

For workflows that nestle this pipeline, consider additionally optimizing java jar's parameters for SamToFastq and MergeBamAlignment. For example, the following are the additional settings used by the Broad Genomics Platform in the piped command for very large data sets.

    java -Dsamjdk.buffer_size=131072 -Dsamjdk.compression_level=1 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx128m -jar /path/picard.jar SamToFastq ...

    java -Dsamjdk.buffer_size=131072 -Dsamjdk.use_async_io=true -Dsamjdk.compression_level=1 -XX:+UseStringCache -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx5000m -jar /path/picard.jar MergeBamAlignment ...

I give my sincere thanks to Julian Hess, the GATK team and the Data Sciences and Data Engineering (DSDE) team members for all their help in writing this and related documents.

↧

Best practices for joint genotyping of a very large sample size

June 1, 2018, 11:04 am

≫ Next: Does Picard MarkDuplicates remove clonal duplicates using BS-seq data aligned bam file

≪ Previous: (How to) Map and clean up short read sequence data efficiently

We are going to variant call 25k+ WGS samples soon. We want to adopt the joint genotyping pipeline provided at https://github.com/gatk-workflows/gatk4-germline-snps-indels/blob/master/joint-discovery-gatk4-local.wdl & https://github.com/gatk-workflows/gatk4-germline-snps-indels/blob/master/joint-discovery-gatk4-local.hg38.wgs.inputs.json.

Two questions:
1. One problem is that this pipeline uses a gvcf file, and gvcf is much bigger than vcf in size. So I am not sure if it is practical to have a gvcf file for 25k+ samples.
2. Another problem is memory usage. Can we joint genotype 25k+ WGS samples at once?

The above being said, I am wondering if we could divide the 25k+ samples into smaller groups (e.g. 1000 samples each group), do joint genotyping group by group, without compromising variant calling quality too much. By dividing, we should save space, memory, and time.

BTW, where can I find the gvcfs such as "/home/bshifaw/data/joint_discovery/NA12878.g.vcf.gz" and ""/home/bshifaw/data/joint_discovery/NA12878.g.vcf.gz.tbi"?

Thanks.

↧

Does Picard MarkDuplicates remove clonal duplicates using BS-seq data aligned bam file

June 1, 2018, 11:06 am

≫ Next: Build the SNP recalibration model error

≪ Previous: Best practices for joint genotyping of a very large sample size

The documentation of this module refers only to sequence similarity.
I would like to ask for a clarification whether two sequences having more then 3% mismatches (the default limit for sequence duplicate if i understand correctly) which are aligned to the same genomic coordinates will be marked as duplicates or not?
Thank you in advanced,

↧

Build the SNP recalibration model error

April 23, 2018, 9:45 pm

≫ Next: Making GATK available on the cloud for everyone

≪ Previous: Does Picard MarkDuplicates remove clonal duplicates using BS-seq data aligned bam file

Hi,

I am trying to build the SNP recalibration model by running the following GATK command:

./gatk-4.0.3.0/gatk VariantRecalibrator \
-R human_g1k_v37_decoy.fasta \
-input /mergedFiles.vcf \
--resource hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf \
--resource omni,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.b37.sites.vcf \
--resource 1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.vcf \
--resource dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_135.b37.vcf \
-an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an InbreedingCoeff \
-mode SNP \
-tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \
--recalFile recalibrate_SNP.recal \
-tranchesFile output.tranches \
--rscriptFile output.plots.R

But I am getting following error.

Error:

A USER ERROR has occurred: Invalid argument 'hapmap_3.3.b37.sites.vcf'.

Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.

I have used the human_g1k_v37_decoy.fasta for alignment therefore, using the same for recalibration. I would like to convert raw variants to ready to analysis variant by applying filtration,and annotation. Please let me know if you have any direction for best practice approach.

Thanks

↧

Making GATK available on the cloud for everyone

April 6, 2016, 5:01 am

≫ Next: Phantom indels from HaplotypeCaller?

≪ Previous: Build the SNP recalibration model error

Today, several members of our extended group are talking at the BioIT World meeting in Boston, and the Broad mothership is putting out a handful of announcements that are related to GATK. Among other communications there's a press release accompanied by a blog post on the Broad Institute blog, which unveil a landmark agreement we have reached with several major cloud vendors. I'd like to take a few minutes to discuss what is at stake, both in terms of what we're doing, and of how this will affect the wider GATK community.

These announcements all boil down to two things: we built a platform to run the Broad's GATK analysis pipelines in the cloud instead of our local cluster, and we're making that platform accessible to the wider community following a "Software as a Service" (SaaS) model.

Now, before we get any further into discussing what that entails, I want to reassure everyone that we will continue to provide the GATK software as a downloadable executable that can be used anywhere, whether locally on your laptop, on your institution's server farm or computing cluster, or on a cloud platform if you've already got that set up for yourself. The cloud-based service we're announcing is just one more option that we're making available for running GATK. And it should go without saying that we'll continue to provide the same level of support as we have in the past to everyone through the GATK forum; our commitment to that mission is absolute and unwavering.

Alright, so what's happening exactly? Read on to find out!

As discussed recently on this very blog, we've been migrating a substantial portion of the Broad's production genomic analysis pipelines to the cloud. This move was motivated in large part by a need for greater elasticity to deal with the onslaught of massive projects periodically hammering our datacenter (I'm looking at you, @dgmacarthur) as well as a drive toward increased cost-efficiency. But it was also a recognition that the mind-boggling rate at which genomic data is generated (roughly doubling every 8 months!) means we have to adapt how we share and interact with these frankly staggering amounts of data.

To that end, we've been working elbow to elbow with Google engineers for the past eighteen months; in short, they taught us how to cloud and we taught them how to genome. Together we built a system capable of operating our GATK Best Practices production pipelines at scale on the Google Cloud Platform (GCP), using Cromwell and WDL to define and execute the actual workflows. We've also been working closely with a team from the Intel Life Sciences division to solve some of the key challenges involved in scaling up to the next order of dataset magnitude, resulting in a new kind of database that will enable us to perform joint calling on tens, even hundreds of thousands of genomes at a time.

We're already running the Broad's whole genomes on this new platform, and eventually we plan to migrate most if not all our research pipelines (exomes, RNA etc) as well. As a corollary, all of the analysis results produced by the cloud-based pipeline are delivered to researchers through cloud-based workspaces within which they can kick off further analyses. That way, what happens on the cloud stays on the cloud, as far into the process as possible (in part to minimize egress charges).

From my perspective the most immediate upshot of this is that it finally puts us within reach of the holy grail of reproducibility: given the pipeline WDL scripts and resource datasets (both of which we plan to share freely) anyone will be able to reproduce our pipeline processing on their own instance of the Google Cloud with complete independence.

That being said, standing up and administering your own cloud-based service is not exactly trivial, and we know there's a lot of demand for push-button solutions, so we built our system to double as a Software as a Service (SaaS) platform that we can make publicly available for the convenience of the wider community. We plan to make this service accessible to everyone, Broadies and non-Broadies alike, including commercial/for-profit organizations, under the same conditions. Exact pricing has yet to be determined, but it will certainly include the cloud vendor's compute costs, and there will be no separate licensing cost for for-profit use.

We're also opening resale of GATK as a service to commercial SaaS vendors in order to maximize the options available to the community. Illumina has signed on as the first to offer GATK as a service through BaseSpace via Cromwell+WDL, and we're working with all the major cloud computing vendors mentioned above to ensure that the Cromwell+WDL pipelining solution will work as seamlessly and cost-effectively on their platforms as it does today on Google Cloud Platform.

Our ultimate goal here is to reduce the amount of effort that goes into standing up and maintaining implementations of GATK Best Practices worldwide, so that all those resources can be refocused on more interesting work. Personally, I expect that these new developments will contribute to making the GATK Best Practices more readily accessible and affordable to all, and I'm looking forward to being able to announce the availability of the new service later this year!

↧

Phantom indels from HaplotypeCaller?

December 12, 2017, 1:16 pm

≫ Next: How do specific the java 8 when it's not at system-wide location?

≪ Previous: Making GATK available on the cloud for everyone

Dear GATK users and developers,

I am running HaplotypeCaller followed by ValidateVariants and the latter complains about variants that have called alternative allele without any observation for it.

ERROR MESSAGE: File /storage/rafal.gutaker/NEXT_test/work/4f/6f8738a66d1c9d12651b76b7ef8819/IRIS_313-15896.g.vcf fails strict validation: one or more of the ALT allele(s) for the record at position LOC_Os01g01010:6190 are not observed at all in the sample genotypes |

ERROR ------------------------------------------------------------------------------------------

Here is an example of site that ValidateVariant complains about:

LOC_Os01g01010 6190 . GT G, 0 . DP=4;ExcessHet=3.0103;MLEAC=0,0;MLEAF=0.00,0.00;RAW_MQ=14400.00 GT:AD:DP:GQ:PL:SB 0/0:4,0,0:4:12:0,12,135,12,135,135:4,0,0,0
LOC_Os01g01010 6192 . T . . END=6192 GT:DP:GQ:MIN_DP:PL 0/0:8:0:8:0,0,254

In general, it seems not dangerous so i am thinking of removing this check, but why is HaplotypeCaller finding phanotm variants is a mystery to me.

Thank you and

Best!
Rafal

↧

How do specific the java 8 when it's not at system-wide location?

June 3, 2018, 1:37 am

≫ Next: GATK 3.8 Best Practice documents?

≪ Previous: Phantom indels from HaplotypeCaller?

the original version of Java in my computer is 1.7.0_131,and I install the JDK 8 at a directory(not the system-wide location because I don't have root access).

Then when I try to use GATK4.0 for SNP calling :
./python /path/to/gatk/gatk --java-options "-Xmx4g" HaplotypeCaller -R sequence/reference.fa.fasta -I sequence/A6475_aligned_out.bam -O output.g.vcf.gz -ERC GVCF
Error: Invalid or corrupt jarfile /export/home/biostuds/2257069w/gatk/gatk-package-4.0.4.0-local.jar

I specific the python because I didn't install it in system-wide location(because I don't have root access,as Java).It seems that in this command I did not specific the JDK 8 ,but I don't know where can I specific the java version in this command?

↧

GATK 3.8 Best Practice documents?

January 10, 2018, 1:01 pm

≫ Next: BQSR quality strings

≪ Previous: How do specific the java 8 when it's not at system-wide location?

Hi there,

Congrats on the version 4 release.
Now the page are updated to 4 but I'm currently still working on 3.8, is there a link I can find the old best practice documents?

Wenfu

↧

BQSR quality strings

September 24, 2015, 12:33 am

≫ Next: High proportion of reads exclude after Base Recalibration

≪ Previous: GATK 3.8 Best Practice documents?

Hi! I have a question about the BQSR quality strings, those output as BD:Z, BI:Z and BQ:Z in the final BAM file. I would like to decrease the footprint of our final analysis files. I was thinking to remove those quality strings after the variant calling (since they are required for variant calling). I read in one of your posts that it could be possible to retrieve those quality strings back if one has the recalibration tables by just using printreads option? Could you confirm that this is the case?

↧

High proportion of reads exclude after Base Recalibration

June 4, 2018, 12:01 am

≫ Next: HG37 support in GATK4

≪ Previous: BQSR quality strings

Hi GATK team,

Currently, I am doing RNA-seq variant calling following your best practice, except I am using HISAT2 for alignment.
After performing BaseRecalibration step, I notice in the output there're many reads excluded mainly because NotPrimaryAlignmentFilter. Does it mean my reads has high number of secondary alignment? If yes, will it be a problem?

Below is the complete output message:

INFO 15:33:06,313 BaseRecalibrator - BaseRecalibrator was able to recalibrate 63343819 reads
INFO 15:33:06,318 ProgressMeter - done 6.3343875E7 38.8 m 36.0 s 99.7% 38.9 m 6.0 s
INFO 15:33:06,319 ProgressMeter - Total runtime 2325.16 secs, 38.75 min, 0.65 hours
INFO 15:33:06,319 MicroScheduler - 292530923 reads were filtered out during the traversal out of approximately 355874798 total reads (82.20%)
INFO 15:33:06,320 MicroScheduler - -> 753 reads (0.00% of total) failing BadCigarFilter
INFO 15:33:06,320 MicroScheduler - -> 50087851 reads (14.07% of total) failing DuplicateReadFilter
INFO 15:33:06,320 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
INFO 15:33:06,320 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter
INFO 15:33:06,320 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter
INFO 15:33:06,320 MicroScheduler - -> 9331702 reads (2.62% of total) failing MappingQualityZeroFilter
INFO 15:33:06,320 MicroScheduler - -> 233110617 reads (65.50% of total) failing NotPrimaryAlignmentFilter
INFO 15:33:06,321 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter

Done. ------------------------------------------------------------------------------------------

Need your suggestion. Thank you.

↧

HG37 support in GATK4

June 4, 2018, 1:58 am

≫ Next: Is there a planned container for GATK 3.8.1?

≪ Previous: High proportion of reads exclude after Base Recalibration

Hi,

Is it possible to run GATK4 with HG37 data?

BR,
Juho

↧

Is there a planned container for GATK 3.8.1?

May 10, 2018, 1:29 am

≫ Next: Mouse Gencode version 15 bundle file

≪ Previous: HG37 support in GATK4

Hi,

I'm using the GATK 3.8.0 container, and following the best practices.
I know that version 3.8.1 is fixing this issue with PrintReads.
But you're not providing a container for version 3.8.1.
Is there any plan to do it?
All the best,
Maxime

↧

Mouse Gencode version 15 bundle file

February 26, 2018, 8:31 am

≫ Next: HaplotypeCaller sensitivity in large(ish) cohorts

≪ Previous: Is there a planned container for GATK 3.8.1?

Dear GATK team,

I am looking for mouse genome bundle file in order to use BQSR.
I found this link on search ftp://ftp-mouse.sanger.ac.uk/current_indels/.
Do you think i can use this? and is it compatible with Mouse Gencode version 15?

↧

HaplotypeCaller sensitivity in large(ish) cohorts

January 31, 2014, 6:47 am

≫ Next: CombineGVCFs outputs genomic region out of specified intervals

≪ Previous: Mouse Gencode version 15 bundle file

One of my projects currently has ~150 patients (exomes) that I've been processing through the standard pipeline (2.8-1, including ReduceReads). In my most recent run through HC, I split the cohort in half for the sake of time. A subset of these patients have undergone targeted genotyping in the clinic, and I have a list of 36 validated variants in 28 samples. When I checked these variants in the final VCF, 5 of 36 were not called by HaplotypeCaller and have moderate to excellent support in the BAM. Several of these (possibly all of them? Not sure) were present in previous HC and UG runs with fewer samples, and I verified that the one I'm focusing on is called correctly when I only use five samples.

Debugging runs on a small region have revealed the following:

ReduceReads does not seem to be the culprit, my variant is still uncalled when using the un-reduced bams
My variant is not inside an Active Region
When I force it to be with -forceActive, it's not in the trimmed ActiveRegion
I've tried increasing -maxNumHaplotypesInPopulation as high as 1024, and the trimmed region still doesn't include my variant
I've also tried running with -dontTrimActiveRegions, but haven't successfully finished yet (runtime increases from 30 seconds to over an hour, I keep trying to run it in short queues while I'm doing other stuff and getting killed by the scheduler)

A couple of other random notes that may or may not be applicable: These are rare variants that I only expect to see in 1 or 2 samples. My testing region is ~400bp around the variant in question. There is a variant in another sample at an immediately adjacent nucleotide that is also not called (and, perhaps obviously, is also outside the active regions).

Do you have any suggestions for approaching this? I haven't messed with -minPruning yet, as increasing that value should result in a loss of sensitivity and reducing it seems like a bad idea. I suppose I could split my cohort into subsets of 30 or 40 samples, but that doesn't seem like the best approach

↧

CombineGVCFs outputs genomic region out of specified intervals

May 17, 2018, 1:26 am

≫ Next: Error with GATK ModelSegments

≪ Previous: HaplotypeCaller sensitivity in large(ish) cohorts

Hi,

I am using CombineGVCFs module to merge a number of individual WGS gVCFs generated by Haplotype caller into a single gVCF files. The -L argument was used to restrict processing on a specific genomic intervals chr1:100000001-150000000. However, the output gVCF file contains info from region chr1:99999813-100000000 which supposed to be excluded from output.

Did I make a mistake?

Here is my command-line:

gatk --java-options "-Xmx4G -XX:+PrintCommandLineFlags -XX:ParallelGCThreads=1" CombineGVCFs -R hg38.fa -L chr1:100000001-150000000 --variant gvcf.list -O combine50_1.chr1.100000001-150000000.g.vcf.gz

↧

Error with GATK ModelSegments

May 3, 2018, 1:27 pm

≫ Next: How to left normalize multiallelic indels in ploidy=6 vcf files?

≪ Previous: CombineGVCFs outputs genomic region out of specified intervals

I am using the BETA tool "ModelSegments" in a copy number variation analysis and I've run into an error that I don't understand. Within our institution's cluster computing environment, I submitted the following job:

COMMON_DIR="/home/exacloud/lustre1/BioDSP/users/jacojam"
GATK=$COMMON_DIR"/programs/gatk-4.0.4.0"
ALIGNMENT_RUN_T="hg19_BWA_alignment_10058_tumor"
ALIGNMENT_RUN_N="hg19_BWA_alignment_10058_normal"
ALLELIC_COUNTS_T=$COMMON_DIR"/data/hnscc/DNASeq/"$ALIGNMENT_RUN_T"/tumor.allelicCounts.tsv"
ALLELIC_COUNTS_N=$COMMON_DIR"/data/hnscc/DNASeq/"$ALIGNMENT_RUN_N"/normal.allelicCounts.tsv"
OUTPUT_DIR=$COMMON_DIR"/data/hnscc/DNASeq/"$ALIGNMENT_RUN_T"/GATK_CNV"

srun $GATK/gatk --java-options "-Xmx10000m" ModelSegments --allelic-counts $ALLELIC_COUNTS_T --normal-allelic-counts $ALLELIC_COUNTS_N --output-prefix 10058 -O $OUTPUT_DIR

From this, I get the following error:

Using GATK jar /home/exacloud/lustre1/BioDSP/users/jacojam/programs/gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx10000m -jar /home/exacloud/lustre1/BioDSP/users/jacojam/programs/gat$
06:42:48.839 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/exacloud/lustre1/BioDSP/users/jacojam/programs/gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
06:42:49.212 INFO ModelSegments - ------------------------------------------------------------
06:42:49.212 INFO ModelSegments - The Genome Analysis Toolkit (GATK) v4.0.4.0
06:42:49.212 INFO ModelSegments - For support and documentation go to https://software.broadinstitute.org/gatk/
06:42:49.213 INFO ModelSegments - Executing as jacojam@exanode-3-7.local on Linux v3.10.0-693.17.1.el7.x86_64 amd64
06:42:49.213 INFO ModelSegments - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_161-b14
06:42:49.213 INFO ModelSegments - Start Date/Time: May 2, 2018 6:42:48 AM PDT
06:42:49.213 INFO ModelSegments - ------------------------------------------------------------
06:42:49.213 INFO ModelSegments - ------------------------------------------------------------
06:42:49.214 INFO ModelSegments - HTSJDK Version: 2.14.3
06:42:49.214 INFO ModelSegments - Picard Version: 2.18.2
06:42:49.214 INFO ModelSegments - HTSJDK Defaults.COMPRESSION_LEVEL : 2
06:42:49.214 INFO ModelSegments - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
06:42:49.214 INFO ModelSegments - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
06:42:49.214 INFO ModelSegments - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
06:42:49.214 INFO ModelSegments - Deflater: IntelDeflater
06:42:49.214 INFO ModelSegments - Inflater: IntelInflater
06:42:49.214 INFO ModelSegments - GCS max retries/reopens: 20
06:42:49.214 INFO ModelSegments - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
06:42:49.215 WARN ModelSegments -

^[[1m^[[31m !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Warning: ModelSegments is a BETA tool and is not yet ready for use in production

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!^[[0m

06:42:49.215 INFO ModelSegments - Initializing engine
06:42:49.215 INFO ModelSegments - Done initializing engine
06:42:49.224 INFO ModelSegments - Reading file (/home/exacloud/lustre1/BioDSP/users/jacojam/data/hnscc/DNASeq/hg19_BWA_alignment_10058_tumor/tumor.allelicCounts.tsv)...
06:15:44.797 INFO ModelSegments - Shutting down engine
[May 3, 2018 6:15:44 AM PDT] org.broadinstitute.hellbender.tools.copynumber.ModelSegments done. Elapsed time: 1,412.93 minutes.
Runtime.totalMemory()=6298271744
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space: failed reallocation of scalar replaced objects
at java.util.Arrays.copyOfRange(Arrays.java:3664)
at java.lang.String.(String.java:207)
at java.lang.StringBuilder.toString(StringBuilder.java:407)
at com.opencsv.CSVParser.parseLine(CSVParser.java:383)
at com.opencsv.CSVParser.parseLineMulti(CSVParser.java:299)
at com.opencsv.CSVReader.readNext(CSVReader.java:275)
at org.broadinstitute.hellbender.utils.tsv.TableReader.fetchNextRecord(TableReader.java:348)
at org.broadinstitute.hellbender.utils.tsv.TableReader.access$200(TableReader.java:94)
at org.broadinstitute.hellbender.utils.tsv.TableReader$1.hasNext(TableReader.java:458)
at java.util.Iterator.forEachRemaining(Iterator.java:115)
at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
at org.broadinstitute.hellbender.tools.copynumber.formats.collections.AbstractRecordCollection.(AbstractRecordCollection.java:82)
at org.broadinstitute.hellbender.tools.copynumber.formats.collections.AbstractLocatableCollection.(AbstractLocatableCollection.java:58)
at org.broadinstitute.hellbender.tools.copynumber.formats.collections.AbstractSampleLocatableCollection.(AbstractSampleLocatableCollection.java:44)
at org.broadinstitute.hellbender.tools.copynumber.formats.collections.AllelicCountCollection.(AllelicCountCollection.java:58)
at org.broadinstitute.hellbender.tools.copynumber.ModelSegments$$Lambda$29/27313641.apply(Unknown Source)
at org.broadinstitute.hellbender.tools.copynumber.ModelSegments.readOptionalFileOrNull(ModelSegments.java:559)
at org.broadinstitute.hellbender.tools.copynumber.ModelSegments.doWork(ModelSegments.java:462)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:134)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:179)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:198)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)
srun: error: exanode-3-7: task 0: Exited with exit code 1

Is this something you could potentially help me with? Thank you.

↧

How to left normalize multiallelic indels in ploidy=6 vcf files?

March 16, 2018, 1:41 pm

≫ Next: Outputting and Using VariantRecalibration Models

≪ Previous: Error with GATK ModelSegments

Hi,
I have generated 26 GVCFs (with ploidy parameter=6), then run GATK-GenotypeGVCFs, successfully obtaining my genotype calls. Now I am trying to left normalize indels, using GATK-LeftAlignAndTrimVariants, and I get an ERROR MESSAGE.

The command line:

java -jar /home/tools/manual/GATK-3.7.0/GenomeAnalysisTK.jar -T LeftAlignAndTrimVariants -R hg38/Homo_sapiens_assembly38.fasta --variant samples_6N.vcf -o samples_6N_LeftNorm.vcf --splitMultiallelics

The ERROR message:

##### ERROR --
##### ERROR stack trace 
java.lang.IllegalStateException: Must initialize the cache of allele anyploid indices for ploidy 6
        at htsjdk.variant.variantcontext.GenotypeLikelihoods.getAlleles(GenotypeLikelihoods.java:532)
        at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.getLikelihoodIndexes(GATKVariantContextUtils.java:681)
        at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.determineLikelihoodIndexesToUse(GATKVariantContextUtils.java:639)
        at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.subsetAlleles(GATKVariantContextUtils.java:610)
        at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.splitVariantContextToBiallelics(GATKVariantContextUtils.java:1071)
        at org.broadinstitute.gatk.tools.walkers.variantutils.LeftAlignAndTrimVariants.map(LeftAlignAndTrimVariants.java:212)
        at org.broadinstitute.gatk.tools.walkers.variantutils.LeftAlignAndTrimVariants.map(LeftAlignAndTrimVariants.java:137)
        at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267)
        at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255)
        at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
        at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
        at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144)
        at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
        at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
        at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:98)
        at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:316)
        at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:123)
        at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
        at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
        at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:108)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 3.7-0-gcfedb67):
##### ERROR
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions https://software.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: Must initialize the cache of allele anyploid indices for ploidy 6
##### ERROR ------------------------------------------------------------------------------------------

Could you, please, help me with this problem? I could not find a similar discussion.
Thanks a lot!

↧

Outputting and Using VariantRecalibration Models

June 4, 2018, 2:32 pm

≫ Next: CRAM vs BAM difference in output of HaplotypeCaller

≪ Previous: How to left normalize multiallelic indels in ploidy=6 vcf files?

Hello,

I'm working on optimizing the variant filtering pipeline for my team. Currently we're using VQSR following the best practices guidelines. I've been testing VQSR's ability to discern FPs and TPs by applying VQSR to sequencing data we've generated from the GM12878 cell line and comparing those VCFs to GIABs gold standard na12878 call-set using VCFeval. While following best practices for VQSR lowers the total number of FPs, it also lowers the F measure score, meaning more TPs are being filtered than FPs. This isn't optimal, naturally.

If I apply VariantRecalibrator with the GIAB snp call-set as a resource more FPs are filtered than TPs down to a tranche specificity of ~99.9. Ideally, I'd like to train a VQSR model using VCFs with gold standard call-sets as resources, output that model, and then apply that model to other VCFs.

I've been working to test the possibility of using this approach to variant filtering. The first step is to feed VariantRecalibrator a GM12878 library and the GIAB truth set, output the model, apply the recalibration and get results from VCFeval.

/home/gatk-4.0.3.0/gatk VariantRecalibrator \
   --reference $ref_fa \
   --variant $raw_snp \
   --resource GIAB,known=false,training=true,truth=true,prior=15.0:$GIAB_snp \
   --use-annotation DP \
   --use-annotation QD \
   --use-annotation FS \
   --use-annotation SOR \
   --use-annotation MQ \
   --use-annotation MQRankSum \
   --use-annotation ReadPosRankSum \
   --mode SNP \
   --truth-sensitivity-tranche 100.0 \
   --truth-sensitivity-tranche 99.98 \
   --truth-sensitivity-tranche 99.95 \
   --truth-sensitivity-tranche 99.90 \
   --output recalibrate_snp_giab.recal \
   --tranches-file recalibrate_snp_giab.tranches \
   --rscript-file recalibrate_snp_giab.plots.R \
   --output-model recalibrate_snp_giab.model

/home/gatk-4.0.3.0/gatk ApplyVQSR \
   --reference $ref_fa \
   --variant $raw_snp \
   --output ${laneid}.filtered.giab.99.9.snp.vcf.gz \
   --truth-sensitivity-filter-level 99.9 \
   --tranches-file recalibrate_snp_giab.tranches \
   --recal-file recalibrate_snp_giab.recal \
   --mode SNP

Next I want to input the model and the same VCF to VariantRecalibrator - ideally without resources, though that isn't possible - apply the recalibration and get results from VCFeval. If what I'm looking to do is possible, the two results should be the same for any given tranche.

An example of VariantRecalibrator options I've tried are below.

/home/gatk-4.0.3.0/gatk VariantRecalibrator \
   --reference $ref_fa \
   --variant $raw_snp \
   --resource HapMap,known=false,training=true,truth=true,prior=15.0:$HapMap \
   --resource Omni,known=false,training=true,truth=true,prior=12.0:$Omni \
   --resource 1000G,known=false,training=true,truth=false,prior=10.0:$Thousand_g  \
   --resource dbsnp,known=true,training=false,truth=false,prior=2.0:$DBsnp \
   --input-model $snp_model_orig \
   --use-annotation DP \
   --use-annotation QD \
   --use-annotation FS \
   --use-annotation SOR \
   --use-annotation MQ \
   --use-annotation MQRankSum \
   --use-annotation ReadPosRankSum \
   --mode SNP \
   --truth-sensitivity-tranche 100.0 \
   --truth-sensitivity-tranche 99.98 \
   --truth-sensitivity-tranche 99.95 \
   --truth-sensitivity-tranche 99.90 \
   --output recalibrate_snp_${laneid}.recal \
   --tranches-file recalibrate_snp_${laneid}.tranches \
   --rscript-file recalibrate_snp_${laneid}.plots.R \

/home/gatk-4.0.3.0/gatk ApplyVQSR \
   --reference $ref_fa \
   --variant $raw_snp \
   --output ${laneid}.filtered.model.99.9.snp.vcf.gz \
   --truth-sensitivity-filter-level 99.9 \
   --tranches-file recalibrate_snp_${laneid}.tranches \
   --recal-file recalibrate_snp_${laneid}.recal \
   --mode SNP

As I must supply some resources to VariantRecalibrator, I also tried minimizing the effect of any resources.

/home/gatk-4.0.3.0/gatk VariantRecalibrator \
   --reference $ref_fa \
   --variant $raw_snp \
   --resource HapMap,known=false,training=true,truth=true,prior=0.0:$HapMap \
   --input-model $snp_model_ms \
   --use-annotation DP \
   --use-annotation QD \
   --use-annotation FS \
   --use-annotation SOR \
   --use-annotation MQ \
   --use-annotation MQRankSum \
   --use-annotation ReadPosRankSum \
   --mode SNP \
   --truth-sensitivity-tranche 100.0 \
   --truth-sensitivity-tranche 99.98 \
   --truth-sensitivity-tranche 99.95 \
   --truth-sensitivity-tranche 99.90 \
   --output recalibrate_snp_${laneid}.recal \
   --tranches-file recalibrate_snp_${laneid}.tranches \
   --rscript-file recalibrate_snp_${laneid}.plots.R \

Neither of these approaches have been very successful in producing results similar to the model when it was first generated and applied. Is there anyway to use an output model "as is" without having it changed by VariantRecalibrator the second time around? Or do I misunderstand the nature of VQSR and how models are trained and applied?

Thanks a ton for any help! I really appreciate all the work y'all do!

-Ellis

↧

CRAM vs BAM difference in output of HaplotypeCaller

May 16, 2018, 7:52 pm

≫ Next: About fastaalternatereferencemaker in GATK 4.0

≪ Previous: Outputting and Using VariantRecalibration Models

Hi,
I have been investigating CRAM as a long term storage format. I took a paired fastq dataset through processing to 'analysis ready BAM'; https://github.com/gatk-workflows/gatk4-data-processing

I then converted that file to cram using scramble.
scramble -r hs37d5x.fa NA12878.hs37d5x.bam NA12878.hs37d5x.bam.scramble.cram

I beleive that this conversion is mostly lossless, it preserves names, bases, and qualities. However it does reorder the 'samtags', and discard some (NM, MD at least, because they are trivially re-calculable).

I then ran both the original bam and the cram through best practices part 2: https://github.com/gatk-workflows/gatk4-germline-snps-indels/blob/master/haplotypecaller-gvcf-gatk4.wdl

There are some small differences in the output .g.vcf (interleaved, first the cram, then the bam output):

1   143405790   .   GC  G,<NON_REF> 0   .   BaseQRankSum=0.48;ClippingRankSum=0;DP=70;ExcessHet=3.0103;MLEAC=0,0;MLEAF=0,0;MQRankSum=-1.485;RAW_MQ=87127;ReadPosRankSum=-0.741  GT:AD:DP:GQ:PL:SB   0/0:66,2,0:68:99:0,121,3162,205,3168,3252:28,38,2,0
1   143405790   .   GC  G,<NON_REF> 0   .   BaseQRankSum=0.48;ClippingRankSum=0;DP=70;ExcessHet=3.0103;MLEAC=0,0;MLEAF=0,0;MQRankSum=-1.485;RAW_MQ=87127;ReadPosRankSum=-0.741  GT:AD:DP:GQ:PL:SB   0/0:66,2,0:68:99:0,121,3130,205,3136,3220:28,38,2,0
1   143405791   .   C   T,<NON_REF> 0   .   BaseQRankSum=0.52;ClippingRankSum=0;DP=70;ExcessHet=3.0103;MLEAC=0,0;MLEAF=0,0;MQRankSum=-0.045;RAW_MQ=87127;ReadPosRankSum=1.37    GT:AD:DP:GQ:PL:SB   0/0:67,3,0:70:99:0,114,3025,204,3034,3124:31,36,1,2
1   143405791   .   C   T,<NON_REF> 0   .   BaseQRankSum=0.52;ClippingRankSum=0;DP=70;ExcessHet=3.0103;MLEAC=0,0;MLEAF=0,0;MQRankSum=-0.045;RAW_MQ=87127;ReadPosRankSum=1.37    GT:AD:DP:GQ:PL:SB   0/0:67,3,0:70:99:0,114,2991,204,3000,3090:31,36,1,2
1   144920661   .   G   A,<NON_REF> 378.77  .   BaseQRankSum=1.196;ClippingRankSum=0;DP=41;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.5,0;MQRankSum=-1.184;RAW_MQ=144373;ReadPosRankSum=-0.807  GT:AD:DP:GQ:PL:SB   0/1:23,15,0:38:99:407,0,897,477,942,1419:10,13,11,4
1   144920661   .   G   A,<NON_REF> 378.77  .   BaseQRankSum=1.196;ClippingRankSum=0;DP=41;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.5,0;MQRankSum=-1.184;RAW_MQ=144373;ReadPosRankSum=-0.807  GT:AD:DP:GQ:PL:SB   0/1:23,15,0:38:99:407,0,897,476,942,1418:10,13,11,4
1   144920668   .   C   T,<NON_REF> 612.77  .   BaseQRankSum=-1.145;ClippingRankSum=0;DP=40;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.5,0;MQRankSum=1.266;RAW_MQ=140773;ReadPosRankSum=1.3 GT:AD:DP:GQ:PL:SB   0/1:15,22,0:37:99:641,0,564,686,630,1316:11,4,10,12
1   144920668   .   C   T,<NON_REF> 611.77  .   BaseQRankSum=-1.145;ClippingRankSum=0;DP=40;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.5,0;MQRankSum=1.266;RAW_MQ=140773;ReadPosRankSum=1.3 GT:AD:DP:GQ:PL:SB   0/1:15,22,0:37:99:640,0,564,685,630,1315:11,4,10,12
1   144920703   .   CA  C,CAA,CAAA,<NON_REF>    0.2 .   BaseQRankSum=-0.776;ClippingRankSum=0;DP=35;ExcessHet=3.0103;MLEAC=1,0,0,0;MLEAF=0.5,0,0,0;MQRankSum=0;RAW_MQ=122773;ReadPosRankSum=-1.378  GT:AD:DP:GQ:PL:SB   0/1:9,4,1,2,0:16:14:24,0,188,23,109,216,14,137,215,319,61,173,215,246,267:9,0,6,1
1   144920703   .   CA  C,CAA,CAAA,<NON_REF>    0.2 .   BaseQRankSum=-0.776;ClippingRankSum=0;DP=35;ExcessHet=3.0103;MLEAC=1,0,0,0;MLEAF=0.5,0,0,0;MQRankSum=0;RAW_MQ=122773;ReadPosRankSum=-1.378  GT:AD:DP:GQ:PL:SB   0/1:9,4,1,2,0:16:14:24,0,188,23,105,190,14,137,204,319,61,171,202,239,258:9,0,6,1
1   144920754   .   GA  G,<NON_REF> 224.73  .   BaseQRankSum=1.247;ClippingRankSum=0;DP=39;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.5,0;MQRankSum=0;RAW_MQ=138564;ReadPosRankSum=0.893    GT:AD:DP:GQ:PL:SB   0/1:22,11,0:33:99:262,0,668,328,701,1030:16,6,6,5
1   144920754   .   GA  G,<NON_REF> 205.73  .   BaseQRankSum=1.247;ClippingRankSum=0;DP=39;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.5,0;MQRankSum=0;RAW_MQ=138564;ReadPosRankSum=0.893    GT:AD:DP:GQ:PL:SB   0/1:22,11,0:33:99:243,0,669,309,701,1010:16,6,6,5

The 'QUAL' field is slightly different in some cases, but very different in the last (224.73 vs 205.73).
The 'PL' Genotype field is slightly different. 'PL: phred-scaled genotype likelihoods rounded to the closest integer'

Can anyone help me explain these differences? I am not sure if these differences are numerically significant, but I would not have expected them regardless. I do need to double check that each HaplotypeCaller run used the same version of GATK4.

Thanks

↧