Quantcast
Channel: Recent Discussions — GATK-Forum
Viewing all 12345 articles
Browse latest View live

ERROR running SplitNCigarReads on RNAseq data using Ensembl Mus Musculus referece FASTA file

$
0
0

I was trying to run this commands

~/downloads/gatk-3.7 -T SplitNCigarReads -R ~/refFiles/Mus_musculus/Mus_musculus.GRCm38.dna_sm.primary_assembly.fa -I ../cleanRun/preqcBamFiles/H1_sorted_reordered_marked_dups.bam -o H1.intervals -U ALLOW_N_CIGAR_READS
cat ~/downloads/gatk-3.7

#!/bin/bash

jar="$HOME/downloads/GenomeAnalysisTK.jar"

exec java -Xmx3g -jar "$jar" $@

I get this error about an hour into the run

##### ERROR --
##### ERROR stack trace 
org.broadinstitute.gatk.utils.exceptions.ReviewedGATKException: BUG: requested unknown contig=CHR_MG132_PATCH index=-1
        at org.broadinstitute.gatk.utils.MRUCachingSAMSequenceDictionary.updateCache(MRUCachingSAMSequenceDictionary.java:178)
        at org.broadinstitute.gatk.utils.MRUCachingSAMSequenceDictionary.getSequence(MRUCachingSAMSequenceDictionary.java:109)
        at org.broadinstitute.gatk.utils.GenomeLocParser.validateGenomeLoc(GenomeLocParser.java:306)
        at org.broadinstitute.gatk.utils.GenomeLocParser.createGenomeLoc(GenomeLocParser.java:261)
        at org.broadinstitute.gatk.utils.GenomeLocParser.createGenomeLoc(GenomeLocParser.java:471)
        at org.broadinstitute.gatk.engine.datasources.providers.ReadReferenceView.getReferenceContext(ReadReferenceView.java:98)
        at org.broadinstitute.gatk.engine.traversals.TraverseReadsNano$2.next(TraverseReadsNano.java:140)
        at org.broadinstitute.gatk.engine.traversals.TraverseReadsNano$2.next(TraverseReadsNano.java:128)
        at org.broadinstitute.gatk.engine.traversals.TraverseReadsNano.aggregateMapData(TraverseReadsNano.java:119)
        at org.broadinstitute.gatk.engine.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:101)
        at org.broadinstitute.gatk.engine.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:56)
        at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:107)
        at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:316)
        at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:123)
        at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
        at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
        at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:108)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 3.7-0-gcfedb67):
##### ERROR
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions https://software.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: BUG: requested unknown contig=CHR_MG132_PATCH index=-1
##### ERROR ------------------------------------------------------------------------------------------

Also docs aren't accurate https://software.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_rnaseq_SplitNCigarReads.php

has -U ALLOW_N_CIGARS but it should be -U ALLOW_N_CIGAR_READS according to cmd help and it actually worked the other option didn't

Thanks


Custom Walker :

$
0
0

Hi all,

I'm playing with the GATK3.6 API, writing a custom walker.

@DocumentedGATKFeature(
        summary="Annotate Variants with eigen data",
        groupName = HelpConstants.DOCS_CAT_VARMANIP,
        extraDocs = {CommandLineGATK.class}
        )
public class EigenVariants extends AbstractVariantProcessor {
public abstract class AbstractVariantProcessor
extends RodWalker<Long, Long> implements TreeReducible<Long>
{
    @Input(fullName="variant", shortName = "V", doc="Input VCF file", required=true)
    protected RodBinding<VariantContext> variants;

    @Output(doc="File to which variants should be written")
    protected VariantContextWriter writer = null;
(...)

I'm compiling, packaging & everything is fine:

$  java -cp GenomeAnalysisTK.jar:mygatk.jar org.broadinstitute.gatk.engine.CommandLineGATK -T EigenVariants --help
(...)
Arguments for EigenVariants:
 -eigen,--eigenDirectory <eigenDirectory>   The Eigen directory
 -V,--variant <variant>                     Input VCF file
 -o,--out <out>                             File to which variants should be written
(...)

but when I only add one import from my source (no variable, only the import):

import com.github.lindenb.jvarkit.tools.vcfeigen.EigenInfoAnnotator;

then the PluginManager goes mad:

$  java -cp GenomeAnalysisTK.jar:mygatk.jar org.broadinstitute.gatk.engine.CommandLineGATK -T EigenVariants --help
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 3.6-0-g89b7209): 
##### ERROR
##### ERROR This means that one or more arguments or inputs in your command are incorrect.
##### ERROR The error message below tells you what is the problem.
##### ERROR
##### ERROR If the problem is an invalid argument, please check the online documentation guide
##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
##### ERROR
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions https://www.broadinstitute.org/gatk
##### ERROR
##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
##### ERROR
##### ERROR MESSAGE: Could not create module CodingFeatureCodec because Cannot instantiate class (Illegal Access) caused by exception Class org.broadinstitute.gatk.utils.classloader.PluginManager can not access a member of class com.github.lindenb.jvarkit.tools.vcfeigen.EigenInfoAnnotator$CodingFeatureCodec with modifiers "private"
##### ERROR ------------------------------------------------------------------------------------------

why ? how can i fix this ? thanks !

GATK 3.6 api: GATK cannot find my Walker when I use a java lambda / Filter ?

$
0
0

Hi all,
I'ts cross-posted on SO: http://stackoverflow.com/questions/41678374

I've written a custom plugin for GATK 3.6

$ javac -version
javac 1.8.0_60

My walker was compiled but wasn't found/loaded by GATK while it was in the classpath and a similar tool was working without any problem. By removing some part of the code, I've narrowed the problem to the following line which was the cause of my problem (see filter ) :

(...)
ctx.getAlleles().stream().filter(T->!(T.isSymbolic() || T.isNoCall())).mapToInt(new ToIntFunction<Allele>() {
    public int applyAsInt(Allele value) {return value.length();};
    });
(...)

My plugin is loaded ( I can see it with -T MyPugin --help) if I change the line above to :

(...)
ctx.getAlleles().stream().mapToInt(new ToIntFunction<Allele>() {
        public int applyAsInt(Allele value) {return value.length();};
    });
(...)

It's also loaded if the line is:

final Predicate<Allele> afilter = new Predicate<Allele>() {
    @Override
    public boolean test(Allele a) {
        return !(a.isNoCall() || a.isSymbolic());
    }
};
ctx.getAlleles().stream().filter(afilter).mapToInt(new ToIntFunction<Allele>() {
        public int applyAsInt(Allele value) {return value.length();};
    });

why ?

Annotation is lost when merging GVCF files with GenotypeGVCFs

$
0
0

Dear GATK team,
The StrandBiasBySample annotation works fine with the HaplotypeCaller but then when I merge all my GVCF files into one with the GenotypeGVCFs tool I loose this information for all my variants. Here is the command I use.
java -jar /data/results/Bioinfo/BF18_TargetSeq_Genomica/results/mapping/GenomeAnalysisTK.jar --analysis_type GenotypeGVCFs -R /data/references/human/genome/GRCh38_hg20/GRCh38.fasta --variant /data/results/Bioinfo/BF18_TargetSeq_Genomica/results/variant_calling/A1048_INH_raw.g.vcf.gz --variant /data/results/Bioinfo/BF18_TargetSeq_Genomica/results/variant_calling/A333_INH_raw.g.vcf.gz --variant /data/results/Bioinfo/BF18_TargetSeq_Genomica/results/variant_calling/A798_REP_INH_raw.g.vcf.gz --variant /data/results/Bioinfo/BF18_TargetSeq_Genomica/results/variant_calling/A846_URG_INH_raw.g.vcf.gz --variant /data/results/Bioinfo/BF18_TargetSeq_Genomica/results/variant_calling/A899_URG_INH_raw.g.vcf.gz --variant /data/results/Bioinfo/BF18_TargetSeq_Genomica/results/variant_calling/P283_INH_raw.g.vcf.gz --variant /data/results/Bioinfo/BF18_TargetSeq_Genomica/results/variant_calling/P284_INH_raw.g.vcf.gz --variant /data/results/Bioinfo/BF18_TargetSeq_Genomica/results/variant_calling/P285_INH_raw.g.vcf.gz --variant /data/results/Bioinfo/BF18_TargetSeq_Genomica/results/variant_calling/P288_INH_raw.g.vcf.gz --variant /data/results/Bioinfo/BF18_TargetSeq_Genomica/results/variant_calling/P289_INH_raw.g.vcf.gz --variant /data/results/Bioinfo/BF18_TargetSeq_Genomica/results/variant_calling/P291_INH_raw.g.vcf.gz -o /data/results/Bioinfo/BF18_TargetSeq_Genomica/results/variant_calling/all_samples.vcf --includeNonVariantSites

Is there any way to safely join my GVCF files into one and keep the StrandBiasBySample annotation for my variants?

Thanks very much in advance.

Regards,

Sheila

(How to) Map and clean up short read sequence data efficiently

$
0
0


imageIf you are interested in emulating the methods used by the Broad Genomics Platform to pre-process your short read sequencing data, you have landed on the right page. The parsimonious operating procedures outlined in this three-step workflow both maximize data quality, storage and processing efficiency to produce a mapped and clean BAM. This clean BAM is ready for analysis workflows that start with MarkDuplicates.

Since your sequencing data could be in a number of formats, the first step of this workflow refers you to specific methods to generate a compatible unmapped BAM (uBAM, Tutorial#6484) or (uBAMXT, Tutorial#6570 coming soon). Not all unmapped BAMs are equal and these methods emphasize cleaning up prior meta information while giving you the opportunity to assign proper read group fields. The second step of the workflow has you marking adapter sequences, e.g. arising from read-through of short inserts, using MarkIlluminaAdapters such that they contribute minimally to alignments and allow the aligner to map otherwise unmappable reads. The third step pipes three processes to produce the final BAM. Piping SamToFastq, BWA-MEM and MergeBamAlignment saves time and allows you to bypass storage of larger intermediate FASTQ and SAM files. In particular, MergeBamAlignment merges defined information from the aligned SAM with that of the uBAM to conserve read data, and importantly, it generates additional meta information and unifies meta data. The resulting clean BAM is coordinate sorted, indexed.

The workflow reflects a lossless operating procedure that retains original sequencing read information within the final BAM file such that data is amenable to reversion and analysis by different means. These practices make scaling up and long-term storage efficient, as one needs only keep the final BAM file.

Geraldine_VdAuwera points out that there are many different ways of correctly preprocessing HTS data for variant discovery and ours is only one approach. So keep this in mind.

We present this workflow using real data from a public sample. The original data file, called Solexa-272222, is large at 150 GB. The file contains 151 bp paired PCR-free reads giving 30x coverage of a human whole genome sample referred to as NA12878. The entire sample library was sequenced in a single flow cell lane and thereby assigns all the reads the same read group ID. The example commands work both on this large file and on smaller files containing a subset of the reads, collectively referred to as snippet. NA12878 has a variant in exon 5 of the CYP2C19 gene, on the portion of chromosome 10 covered by the snippet, resulting in a nonfunctional protein. Consistent with GATK's recommendation of using the most up-to-date tools, for the given example results, with the exception of BWA, we used the most current versions of tools as of their testing (September to December 2015). We provide illustrative example results, some of which were derived from processing the original large file and some of which show intermediate stages skipped by this workflow.

Download example snippet data to follow along the tutorial.

We welcome feedback. Share your suggestions in the Comments section at the bottom of this page.


Jump to a section

  1. Generate an unmapped BAM from FASTQ, aligned BAM or BCL
  2. Mark adapter sequences using MarkIlluminaAdapters
  3. Align reads with BWA-MEM and merge with uBAM using MergeBamAlignment
    A. Convert BAM to FASTQ and discount adapter sequences using SamToFastq
    B. Align reads and flag secondary hits using BWA-MEM
    C. Restore altered data and apply & adjust meta information using MergeBamAlignment
    D. Pipe SamToFastq, BWA-MEM and MergeBamAlignment to generate a clean BAM

Tools involved

Prerequisites

  • Installed Picard tools

  • Installed GATK tools

  • Installed BWA
  • Reference genome
  • Illumina or similar tech DNA sequence reads file containing data corresponding to one read group ID. That is, the file contains data from one sample and from one flow cell lane.

Download example data

  • To download the reference, open ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.8/b37/ in your browser. Leave the password field blank. Download the following three files (~860 MB) to the same folder: human_g1k_v37_decoy.fasta.gz, .fasta.fai.gz, and .dict.gz. This same reference is available to load in IGV.

  • I divided the example data into two tarballs: tutorial_6483_piped.tar.gz contains the files for the piped process and tutorial_6483_intermediate_files.tar.gz contains the intermediate files produced by running each process independently. The data contain reads originally aligning to a one Mbp genomic interval (10:96,000,000-97,000,000) of GRCh37. The table shows the steps of the workflow, corresponding input and output example data files and approximate minutes and disk space needed to process each step. Additionally, we tabulate the time and minimum storage needed to complete the workflow as presented (piped) or without piping.

image

Related resources

Other notes

  • When transforming data files, we stick to using Picard tools over other tools to avoid subtle incompatibilities.

  • For large files, (1) use the Java -Xmx setting and (2) set the environmental variable TMP_DIR for a temporary directory.

    java -Xmx8G -jar /path/picard.jar MarkIlluminaAdapters \
        TMP_DIR=/path/shlee 
    

    In the command, the -Xmx8G Java option caps the maximum heap size, or memory usage, to eight gigabytes. The path given by TMP_DIR points the tool to scratch space that it can use. These options allow the tool to run without slowing down as well as run without causing an out of memory error. The -Xmx settings we provide here are more than sufficient for most cases. For GATK, 4G is standard, while for Picard less is needed. Some tools, e.g. MarkDuplicates, may require more. These options can be omitted for small files such as the example data and the equivalent command is as follows.

    java -jar /path/picard.jar MarkIlluminaAdapters 
    

    To find a system's default maximum heap size, type java -XX:+PrintFlagsFinal -version, and look for MaxHeapSize. Note that any setting beyond available memory spills to storage and slows a system down. If multithreading, increase memory proportionately to the number of threads. e.g. if 1G is the minimum required for one thread, then use 2G for two threads.

  • When I call default options within a command, follow suit to ensure the same results.


1. Generate an unmapped BAM from FASTQ, aligned BAM or BCL

If you have raw reads data in BAM format with appropriately assigned read group fields, then you can start with step 2. Namely, besides differentiating samples, the read group ID should differentiate factors contributing to technical batch effects, i.e. flow cell lane. If not, you need to reassign read group fields. This dictionary post describes factors to consider and this post and this post provide some strategic advice on handling multiplexed data.

If your reads are mapped, or in BCL or FASTQ format, then generate an unmapped BAM according to the following instructions.

  • To convert FASTQ or revert aligned BAM files, follow directions in Tutorial#6484. The resulting uBAM needs to have its adapter sequences marked as outlined in the next step (step 2).

  • To convert an Illumina Base Call files (BCL) use IlluminaBasecallsToSam. The tool marks adapter sequences at the same time. The resulting uBAMXT has adapter sequences marked with the XT tag so you can skip step 2 of this workflow and go directly to step 3. The corresponding Tutorial#6570 is coming soon.

See if you can revert 6483_snippet.bam, containing 279,534 aligned reads, to the unmapped 6383_snippet_revertsam.bam, containing 275,546 reads.

back to top


2. Mark adapter sequences using MarkIlluminaAdapters

MarkIlluminaAdapters adds the XT tag to a read record to mark the 5' start position of the specified adapter sequence and produces a metrics file. Some of the marked adapters come from concatenated adapters that randomly arise from the primordial soup that is a PCR reaction. Others represent read-through to 3' adapter ends of reads and arise from insert sizes that are shorter than the read length. In some instances read-though can affect the majority of reads in a sample, e.g. in Nextera library samples over-titrated with transposomes, and render these reads unmappable by certain aligners. Tools such as SamToFastq use the XT tag in various ways to effectively remove adapter sequence contribution to read alignment and alignment scoring metrics. Depending on your library preparation, insert size distribution and read length, expect varying amounts of such marked reads.

java -Xmx8G -jar /path/picard.jar MarkIlluminaAdapters \
I=6483_snippet_revertsam.bam \
O=6483_snippet_markilluminaadapters.bam \
M=6483_snippet_markilluminaadapters_metrics.txt \ #naming required
TMP_DIR=/path/shlee #optional to process large files

This produces two files. (1) The metrics file, 6483_snippet_markilluminaadapters_metrics.txt bins the number of tagged adapter bases versus the number of reads. (2) The 6483_snippet_markilluminaadapters.bam file is identical to the input BAM, 6483_snippet_revertsam.bam, except reads with adapter sequences will be marked with a tag in XT:i:# format, where # denotes the 5' starting position of the adapter sequence. At least six bases are required to mark a sequence. Reads without adapter sequence remain untagged.

  • By default, the tool uses Illumina adapter sequences. This is sufficient for our example data.

  • Adjust the default standard Illumina adapter sequences to any adapter sequence using the FIVE_PRIME_ADAPTER and THREE_PRIME_ADAPTER parameters. To clear and add new adapter sequences first set ADAPTERS to 'null' then specify each sequence with the parameter.

We plot the metrics data that is in GATKReport file format using RStudio, and as you can see, marked bases vary in size up to the full length of reads.
image image

Do you get the same number of marked reads? 6483_snippet marks 448 reads (0.16%) with XT, while the original Solexa-272222 marks 3,236,552 reads (0.39%).

Below, we show a read pair marked with the XT tag by MarkIlluminaAdapters. The insert region sequences for the reads overlap by a length corresponding approximately to the XT tag value. For XT:i:20, the majority of the read is adapter sequence. The same read pair is shown after SamToFastq transformation, where adapter sequence base quality scores have been set to 2 (# symbol), and after MergeBamAlignment, which restores original base quality scores.

Unmapped uBAM (step 1)
image

After MarkIlluminaAdapters (step 2)
image

After SamToFastq (step 3)
image

After MergeBamAlignment (step 3)
image

back to top


3. Align reads with BWA-MEM and merge with uBAM using MergeBamAlignment

This step actually pipes three processes, performed by three different tools. Our tutorial example files are small enough to easily view, manipulate and store, so any difference in piped or independent processing will be negligible. For larger data, however, using Unix pipelines can add up to significant savings in processing time and storage.

Not all tools are amenable to piping and piping the wrong tools or wrong format can result in anomalous data.

The three tools we pipe are SamToFastq, BWA-MEM and MergeBamAlignment. By piping these we bypass storage of larger intermediate FASTQ and SAM files. We additionally save time by eliminating the need for the processor to read in and write out data for two of the processes, as piping retains data in the processor's input-output (I/O) device for the next process.

To make the information more digestible, we will first talk about each tool separately. At the end of the section, we provide the piped command.

back to top


3A. Convert BAM to FASTQ and discount adapter sequences using SamToFastq

Picard's SamToFastq takes read identifiers, read sequences, and base quality scores to write a Sanger FASTQ format file. We use additional options to effectively remove previously marked adapter sequences, in this example marked with an XT tag. By specifying CLIPPING_ATTRIBUTE=XT and CLIPPING_ACTION=2, SamToFastq changes the quality scores of bases marked by XT to two--a rather low score in the Phred scale. This effectively removes the adapter portion of sequences from contributing to downstream read alignment and alignment scoring metrics.

Illustration of an intermediate step unused in workflow. See piped command.

java -Xmx8G -jar /path/picard.jar SamToFastq \
I=6483_snippet_markilluminaadapters.bam \
FASTQ=6483_snippet_samtofastq_interleaved.fq \
CLIPPING_ATTRIBUTE=XT \
CLIPPING_ACTION=2 \
INTERLEAVE=true \ 
NON_PF=true \
TMP_DIR=/path/shlee #optional to process large files         

This produces a FASTQ file in which all extant meta data, i.e. read group information, alignment information, flags and tags are purged. What remains are the read query names prefaced with the @ symbol, read sequences and read base quality scores.

  • For our paired reads example file we set SamToFastq's INTERLEAVE to true. During the conversion to FASTQ format, the query name of the reads in a pair are marked with /1 or /2 and paired reads are retained in the same FASTQ file. BWA aligner accepts interleaved FASTQ files given the -p option.

  • We change the NON_PF, aka INCLUDE_NON_PF_READS, option from default to true. SamToFastq will then retain reads marked by what some consider an archaic 0x200 flag bit that denotes reads that do not pass quality controls, aka reads failing platform or vendor quality checks. Our tutorial data do not contain such reads and we call out this option for illustration only.

  • Other CLIPPING_ACTION options include (1) X to hard-clip, (2) N to change bases to Ns or (3) another number to change the base qualities of those positions to the given value.

back to top


3B. Align reads and flag secondary hits using BWA-MEM

In this workflow, alignment is the most compute intensive and will take the longest time. GATK's variant discovery workflow recommends Burrows-Wheeler Aligner's maximal exact matches (BWA-MEM) algorithm (Li 2013 reference; Li 2014 benchmarks; homepage; manual). BWA-MEM is suitable for aligning high-quality long reads ranging from 70 bp to 1 Mbp against a large reference genome such as the human genome.

  • Aligning our snippet reads against either a portion or the whole genome is not equivalent to aligning our original Solexa-272222 file, merging and taking a new slice from the same genomic interval.

  • For the tutorial, we use BWA v 0.7.7.r441, the same aligner used by the Broad Genomics Platform as of this writing (9/2015).

  • As mentioned, alignment is a compute intensive process. For faster processing, use a reference genome with decoy sequences, also called a decoy genome. For example, the Broad's Genomics Platform uses an Hg19/GRCh37 reference sequence that includes Ebstein-Barr virus (EBV) sequence to soak up reads that fail to align to the human reference that the aligner would otherwise spend an inordinate amount of time trying to align as split reads. GATK's resource bundle provides a standard decoy genome from the 1000 Genomes Project.
  • BWA alignment requires an indexed reference genome file. Indexing is specific to algorithms. To index the human genome for BWA, we apply BWA's index function on the reference genome file, e.g. human_g1k_v37_decoy.fasta. This produces five index files with the extensions amb, ann, bwt, pac and sa.

    bwa index -a bwtsw human_g1k_v37_decoy.fasta
    

The example command below aligns our example data against the GRCh37 genome. The tool automatically locates the index files within the same folder as the reference FASTA file.

Illustration of an intermediate step unused in workflow. See piped command.

/path/bwa mem -M -t 7 -p /path/human_g1k_v37_decoy.fasta \ 
6483_snippet_samtofastq_interleaved.fq > 6483_snippet_bwa_mem.sam

This command takes the FASTQ file, 6483_snippet_samtofastq_interleaved.fq, and produces an aligned SAM format file, 6483_snippet_unthreaded_bwa_mem.sam, containing read alignment information, an automatically generated program group record and reads sorted in the same order as the input FASTQ file. Aligner-assigned alignment information, flag and tag values reflect each read's or split read segment's best sequence match and does not take into consideration whether pairs are mapped optimally or if a mate is unmapped. Added tags include the aligner-specific XS tag that marks secondary alignment scores in XS:i:# format. This tag is given for each read even when the score is zero and even for unmapped reads. The program group record (@PG) in the header gives the program group ID, group name, group version and recapitulates the given command. Reads are sorted by query name. For the given version of BWA, the aligned file is in SAM format even if given a BAM extension.

Does the aligned file contain read group information?

We invoke three options in the command.

  • -M to flag shorter split hits as secondary.
    This is optional for Picard compatibility as MarkDuplicates can directly process BWA's alignment, whether or not the alignment marks secondary hits. However, if we want MergeBamAlignment to reassign proper pair alignments, to generate data comparable to that produced by the Broad Genomics Platform, then we must mark secondary alignments.

  • -p to indicate the given file contains interleaved paired reads.

  • -t followed by a number for the number of processor threads to use concurrently. Here we use seven threads which is one less than the total threads available on my Mac laptap. Check your server or system's total number of threads with the following command provided by KateN.

    getconf _NPROCESSORS_ONLN 
    

In the example data, all of the 1211 unmapped reads each have an asterisk (*) in column 6 of the SAM record, where a read typically records its CIGAR string. The asterisk represents that the CIGAR string is unavailable. The several asterisked reads I examined are recorded as mapping exactly to the same location as their _mapped_ mates but with MAPQ of zero. Additionally, the asterisked reads had varying noticeable amounts of low base qualities, e.g. strings of #s, that corresponded to original base quality calls and not those changed by SamToFastq. This accounting by BWA allows these pairs to always list together, even when the reads are coordinate-sorted, and leaves a pointer to the genomic mapping of the mate of the unmapped read. For the example read pair shown below, comparing sequences shows no apparent overlap, with the highest identity at 72% over 25 nts.

After MarkIlluminaAdapters (step 2)
image

After BWA-MEM (step 3)
image

After MergeBamAlignment (step 3)
image

back to top


3C. Restore altered data and apply & adjust meta information using MergeBamAlignment

MergeBamAlignment is a beast of a tool, so its introduction is longer. It does more than is implied by its name. Explaining these features requires I fill you in on some background.

Broadly, the tool merges defined information from the unmapped BAM (uBAM, step 1) with that of the aligned BAM (step 3) to conserve read data, e.g. original read information and base quality scores. The tool also generates additional meta information based on the information generated by the aligner, which may alter aligner-generated designations, e.g. mate information and secondary alignment flags. The tool then makes adjustments so that all meta information is congruent, e.g. read and mate strand information based on proper mate designations. We ascribe the resulting BAM as clean.

Specifically, the aligned BAM generated in step 3 lacks read group information and certain tags--the UQ (Phred likelihood of the segment), MC (CIGAR string for mate) and MQ (mapping quality of mate) tags. It has hard-clipped sequences from split reads and altered base qualities. The reads also have what some call mapping artifacts but what are really just features we should not expect from our aligner. For example, the meta information so far does not consider whether pairs are optimally mapped and whether a mate is unmapped (in reality or for accounting purposes). Depending on these assignments, MergeBamAlignment adjusts the read and read mate strand orientations for reads in a proper pair. Finally, the alignment records are sorted by query name. We would like to fix all of these issues before taking our data to a variant discovery workflow.

Enter MergeBamAlignment. As the tool name implies, MergeBamAlignment applies read group information from the uBAM and retains the program group information from the aligned BAM. In restoring original sequences, the tool adjusts CIGAR strings from hard-clipped to soft-clipped. If the alignment file is missing reads present in the unaligned file, then these are retained as unmapped records. Additionally, MergeBamAlignment evaluates primary alignment designations according to a user-specified strategy, e.g. for optimal mate pair mapping, and changes secondary alignment and mate unmapped flags based on its calculations. Additional for desired congruency. I will soon explain these and additional changes in more detail and show a read record to illustrate.

Consider what PRIMARY_ALIGNMENT_STRATEGY option best suits your samples. MergeBamAlignment applies this strategy to a read for which the aligner has provided more than one primary alignment, and for which one is designated primary by virtue of another record being marked secondary. MergeBamAlignment considers and switches only existing primary and secondary designations. Therefore, it is critical that these were previously flagged.

image A read with multiple alignment records may map to multiple loci or may be chimeric--that is, splits the alignment. It is possible for an aligner to produce multiple alignments as well as multiple primary alignments, e.g. in the case of a linear alignment set of split reads. When one alignment, or alignment set in the case of chimeric read records, is designated primary, others are designated either secondary or supplementary. Invoking the -M option, we had BWA mark the record with the longest aligning section of split reads as primary and all other records as secondary. MergeBamAlignment further adjusts this secondary designation and adds the read mapped in proper pair (0x2) and mate unmapped (0x8) flags. The tool then adjusts the strand orientation flag for a read (0x10) and it proper mate (0x20).

In the command, we change CLIP_ADAPTERS, MAX_INSERTIONS_OR_DELETIONS and PRIMARY_ALIGNMENT_STRATEGY values from default, and invoke other optional parameters. The path to the reference FASTA given by R should also contain the corresponding .dict sequence dictionary with the same prefix as the reference FASTA. It is imperative that both the uBAM and aligned BAM are both sorted by queryname.

Illustration of an intermediate step unused in workflow. See piped command.

java -Xmx16G -jar /path/picard.jar MergeBamAlignment \
R=/path/Homo_sapiens_assembly19.fasta \ 
UNMAPPED_BAM=6383_snippet_revertsam.bam \ 
ALIGNED_BAM=6483_snippet_bwa_mem.sam \ #accepts either SAM or BAM
O=6483_snippet_mergebamalignment.bam \
CREATE_INDEX=true \ #standard Picard option for coordinate-sorted outputs
ADD_MATE_CIGAR=true \ #default; adds MC tag
CLIP_ADAPTERS=false \ #changed from default
CLIP_OVERLAPPING_READS=true \ #default; soft-clips ends so mates do not overlap
INCLUDE_SECONDARY_ALIGNMENTS=true \ #default
MAX_INSERTIONS_OR_DELETIONS=-1 \ #changed to allow any number of insertions or deletions
PRIMARY_ALIGNMENT_STRATEGY=MostDistant \ #changed from default BestMapq
ATTRIBUTES_TO_RETAIN=XS \ #specify multiple times to retain tags starting with X, Y, or Z 
TMP_DIR=/path/shlee #optional to process large files

This generates a coordinate-sorted and clean BAM, 6483_snippet_mergebamalignment.bam, and corresponding .bai index. These are ready for analyses starting with MarkDuplicates. The two bullet-point lists below describe changes to the resulting file. The first list gives general comments on select parameters and the second describes some of the notable changes to our example data.

Comments on select parameters

  • Setting PRIMARY_ALIGNMENT_STRATEGYto MostDistant marks primary alignments based on the alignment pair with the largest insert size. This strategy is based on the premise that of chimeric sections of a read aligning to consecutive regions, the alignment giving the largest insert size with the mate gives the most information.

  • It may well be that alignments marked as secondary represent interesting biology, so we retain them with the INCLUDE_SECONDARY_ALIGNMENTS parameter.

  • Setting MAX_INSERTIONS_OR_DELETIONS to -1 retains reads irregardless of the number of insertions and deletions. The default is 1.
  • Because we leave the ALIGNER_PROPER_PAIR_FLAGS parameter at the default false value, MergeBamAlignment will reassess and reassign proper pair designations made by the aligner. These are explained below using the example data.
  • ATTRIBUTES_TO_RETAIN is specified to carryover the XS tag from the alignment, which reports BWA-MEM's suboptimal alignment scores. My impression is that this is the next highest score for any alternative or additional alignments BWA considered, whether or not these additional alignments made it into the final aligned records. (IGV's BLAT feature allows you to search for additional sequence matches). For our tutorial data, this is the only additional unaccounted tag from the alignment. The XS tag in unnecessary for the Best Practices Workflow and is not retained by the Broad Genomics Platform's pipeline. We retain it here not only to illustrate that the tool carries over select alignment information only if asked, but also because I think it prudent. Given how compute intensive the alignment process is, the additional ~1% gain in the snippet file size seems a small price against having to rerun the alignment because we realize later that we want the tag.
  • Setting CLIP_ADAPTERS to false leaves reads unclipped.
  • By default the merged file is coordinate sorted. We set CREATE_INDEX to true to additionally create the bai index.
  • We need not invoke PROGRAM options as BWA's program group information is sufficient and is retained in the merging.
  • As a standalone tool, we would normally feed in a BAM file for ALIGNED_BAM instead of the much larger SAM. We will be piping this step however and so need not add an extra conversion to BAM.

Description of changes to our example data

  • MergeBamAlignment merges header information from the two sources that define read groups (@RG) and program groups (@PG) as well as reference contigs.

  • imageTags are updated for our example data as shown in the table. The tool retains SA, MD, NM and AS tags from the alignment, given these are not present in the uBAM. The tool additionally adds UQ (the Phred likelihood of the segment), MC (mate CIGAR string) and MQ (mapping quality of the mate/next segment) tags if applicable. For unmapped reads (marked with an * asterisk in column 6 of the SAM record), the tool removes AS and XS tags and assigns MC (if applicable), PG and RG tags. This is illustrated for example read H0164ALXX140820:2:1101:29704:6495 in the BWA-MEM section of this document.

  • Original base quality score restoration is illustrated in step 2.

The example below shows a read pair for which MergeBamAlignment adjusts multiple information fields, and these changes are described in the remaining bullet points.

  • MergeBamAlignment changes hard-clipping to soft-clipping, e.g. 96H55M to 96S55M, and restores corresponding truncated sequences with the original full-length read sequence.

  • The tool reorders the read records to reflect the chromosome and contig ordering in the header and the genomic coordinates for each.

  • MergeBamAlignment's MostDistant PRIMARY_ALIGNMENT_STRATEGY asks the tool to consider the best pair to mark as primary from the primary and secondary records. In this pair, one of the reads has two alignment loci, on contig hs37d5 and on chromosome 10. The two loci align 115 and 55 nucleotides, respectively, and the aligned sequences are identical by 55 bases. Flag values set by BWA-MEM indicate the contig hs37d5 record is primary and the shorter chromosome 10 record is secondary. For this chimeric read, MergeBamAlignment reassigns the chromosome 10 mapping as the primary alignment and the contig hs37d5 mapping as secondary (0x100 flag bit).
  • In addition, MergeBamAlignment designates each record on chromosome 10 as read mapped in proper pair (0x2 flag bit) and the contig hs37d5 mapping as mate unmapped (0x8 flag bit). IGV's paired reads mode displays the two chromosome 10 mappings as a pair after these MergeBamAlignment adjustments.
  • MergeBamAlignment adjusts read reverse strand (0x10 flag bit) and mate reverse strand (0x20 flag bit) flags consistent with changes to the proper pair designation. For our non-stranded DNA-Seq library alignments displayed in IGV, a read pointing rightward is in the forward direction (absence of 0x10 flag) and a read pointing leftward is in the reverse direction (flagged with 0x10). In a typical pair, where the rightward pointing read is to the left of the leftward pointing read, the left read will also have the mate reverse strand (0x20) flag.

Two distinct classes of mate unmapped read records are now present in our example file: (1) reads whose mates truly failed to map and are marked by an asterisk * in column 6 of the SAM record and (2) multimapping reads whose mates are in fact mapped but in a proper pair that excludes the particular read record. Each of these two classes of mate unmapped reads can contain multimapping reads that map to two or more locations.

Comparing 6483_snippet_bwa_mem.sam and 6483_snippet_mergebamalignment.bam, we see the number_unmapped reads_ remains the same at 1211, while the number of records with the mate unmapped flag increases by 1359, from 1276 to 2635. These now account for 0.951% of the 276,970 read records.

For 6483_snippet_mergebamalignment.bam, how many additional unique reads become mate unmapped?

After BWA-MEM alignment
image

After MergeBamAlignment
image

back to top


3D. Pipe SamToFastq, BWA-MEM and MergeBamAlignment to generate a clean BAM

image We pipe the three tools described above to generate an aligned BAM file sorted by query name. In the piped command, the commands for the three processes are given together, separated by a vertical bar |. We also replace each intermediate output and input file name with a symbolic path to the system's output and input devices, here /dev/stdout and /dev/stdin, respectively. We need only provide the first input file and name the last output file.

Before using a piped command, we should ask UNIX to stop the piped command if any step of the pipe should error and also return to us the error messages. Type the following into your shell to set these UNIX options.

set -o pipefail

Overview of command structure

[SamToFastq] | [BWA-MEM] | [MergeBamAlignment]

Piped command

java -Xmx8G -jar /path/picard.jar SamToFastq \
I=6483_snippet_markilluminaadapters.bam \
FASTQ=/dev/stdout \
CLIPPING_ATTRIBUTE=XT CLIPPING_ACTION=2 INTERLEAVE=true NON_PF=true \
TMP_DIR=/path/shlee | \ 
/path/bwa mem -M -t 7 -p /path/Homo_sapiens_assembly19.fasta /dev/stdin | \  
java -Xmx16G -jar /path/picard.jar MergeBamAlignment \
ALIGNED_BAM=/dev/stdin \
UNMAPPED_BAM=6383_snippet_revertsam.bam \ 
OUTPUT=6483_snippet_piped.bam \
R=/path/Homo_sapiens_assembly19.fasta CREATE_INDEX=true ADD_MATE_CIGAR=true \
CLIP_ADAPTERS=false CLIP_OVERLAPPING_READS=true \
INCLUDE_SECONDARY_ALIGNMENTS=true MAX_INSERTIONS_OR_DELETIONS=-1 \
PRIMARY_ALIGNMENT_STRATEGY=MostDistant ATTRIBUTES_TO_RETAIN=XS \
TMP_DIR=/path/shlee

The piped output file, 6483_snippet_piped.bam, is for all intensive purposes the same as 6483_snippet_mergebamalignment.bam, produced by running MergeBamAlignment separately without piping. However, the resulting files, as well as new runs of the workflow on the same data, have the potential to differ in small ways because each uses a different alignment instance.

How do these small differences arise?

Counting the number of mate unmapped reads shows that this number remains unchanged for the two described workflows. Two counts emitted at the end of the process updates, that also remain constant for these instances, are the number of alignment records and the number of unmapped reads.

INFO    2015-12-08 17:25:59 AbstractAlignmentMerger Wrote 275759 alignment records and 1211 unmapped reads.

back to top


Some final remarks

We have produced a clean BAM that is coordinate-sorted and indexed, in an efficient manner that minimizes processing time and storage needs. The file is ready for marking duplicates as outlined in Tutorial#2799. Additionally, we can now free up storage on our file system by deleting the original file we started with, the uBAM and the uBAMXT. We sleep well at night knowing that the clean BAM retains all original information.

We have two final comments (1) on multiplexed samples and (2) on fitting this workflow into a larger workflow.

For multiplexed samples, first perform the workflow steps on a file representing one sample and one lane. Then mark duplicates. Later, after some steps in the GATK's variant discovery workflow, and after aggregating files from the same sample from across lanes into a single file, mark duplicates again. These two marking steps ensure you find both optical and PCR duplicates.

For workflows that nestle this pipeline, consider additionally optimizing java jar's parameters for SamToFastq and MergeBamAlignment. For example, the following are the additional settings used by the Broad Genomics Platform in the piped command for very large data sets.

    java -Dsamjdk.buffer_size=131072 -Dsamjdk.compression_level=1 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx128m -jar /path/picard.jar SamToFastq ...

    java -Dsamjdk.buffer_size=131072 -Dsamjdk.use_async_io=true -Dsamjdk.compression_level=1 -XX:+UseStringCache -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx5000m -jar /path/picard.jar MergeBamAlignment ...

I give my sincere thanks to Julian Hess, the GATK team and the Data Sciences and Data Engineering (DSDE) team members for all their help in writing this and related documents.

back to top


(How to) Generate an unmapped BAM from FASTQ or aligned BAM

$
0
0


image Here we outline how to generate an unmapped BAM (uBAM) from either a FASTQ or aligned BAM file. We use Picard's FastqToSam to convert a FASTQ (Option A) or Picard's RevertSam to convert an aligned BAM (Option B).

Jump to a section on this page

(A) Convert FASTQ to uBAM and add read group information using FastqToSam
(B) Convert aligned BAM to uBAM and discard problematic records using RevertSam

Tools involved

Prerequisites

  • Installed Picard tools

Download example data

Tutorial data reads were originally aligned to the advanced tutorial bundle's human_g1k_v37_decoy.fasta reference and to 10:91,000,000-92,000,000.

Related resources

  • Our dictionary entry on read groups discusses the importance of assigning appropriate read group fields that differentiate samples and factors that contribute to batch effects, e.g. flow cell lane. Be sure your read group fields conform to the recommendations.

  • This post provides an example command for AddOrReplaceReadGroups.

  • This How to is part of a larger workflow and tutorial on (How to) Efficiently map and clean up short read sequence data.
  • To extract reads in a genomic interval from the aligned BAM, use GATK's PrintReads.
  • In the future we will post on how to generate a uBAM from BCL data using IlluminaBasecallsToSAM.

(A) Convert FASTQ to uBAM and add read group information using FastqToSam

Picard's FastqToSam transforms a FASTQ file to an unmapped BAM, requires two read group fields and makes optional specification of other read group fields. In the command below we note which fields are required for GATK Best Practices Workflows. All other read group fields are optional.

java -Xmx8G -jar picard.jar FastqToSam \
    FASTQ=6484_snippet_1.fastq \ #first read file of pair
    FASTQ2=6484_snippet_2.fastq \ #second read file of pair
    OUTPUT=6484_snippet_fastqtosam.bam \
    READ_GROUP_NAME=H0164.2 \ #required; changed from default of A
    SAMPLE_NAME=NA12878 \ #required
    LIBRARY_NAME=Solexa-272222 \ #required 
    PLATFORM_UNIT=H0164ALXX140820.2 \ 
    PLATFORM=illumina \ #recommended
    SEQUENCING_CENTER=BI \ 
    RUN_DATE=2014-08-20T00:00:00-0400

Some details on select parameters:

  • For paired reads, specify each FASTQ file with FASTQ and FASTQ2 for the first read file and the second read file, respectively. Records in each file must be queryname sorted as the tool assumes identical ordering for pairs. The tool automatically strips the /1 and /2 read name suffixes and adds SAM flag values to indicate reads are paired. Do not provide a single interleaved fastq file, as the tool will assume reads are unpaired and the SAM flag values will reflect single ended reads.

  • For single ended reads, specify the input file with FASTQ.

  • QUALITY_FORMAT is detected automatically if unspecified.
  • SORT_ORDER by default is queryname.
  • PLATFORM_UNIT is often in run_barcode.lane format. Include if sample is multiplexed.
  • RUN_DATE is in Iso8601 date format.

Paired reads will have SAM flag values that reflect pairing and the fact that the reads are unmapped as shown in the example read pair below.

Original first read

@H0164ALXX140820:2:1101:10003:49022/1
ACTTTAGAAATTTACTTTTAAGGACTTTTGGTTATGCTGCAGATAAGAAATATTCTTTTTTTCTCCTATGTCAGTATCCCCCATTGAAATGACAATAACCTAATTATAAATAAGAATTAGGCTTTTTTTTGAACAGTTACTAGCCTATAGA
+
-FFFFFJJJJFFAFFJFJJFJJJFJFJFJJJ<<FJJJJFJFJFJJJJ<JAJFJJFJJJJJFJJJAJJJJJJFFJFJFJJFJJFFJJJFJJJFJJFJJFJAJJJJAJFJJJJJFFJJ<<<JFJJAFJAAJJJFFFFFJJJAJJJF<AJFFFJ

Original second read

@H0164ALXX140820:2:1101:10003:49022/2
TGAGGATCACTAGATGGGGGAGGGAGAGAAGAGATGTGGGCTGAAGAACCATCTGTTGGGTAATATGTTTACTGTCAGTGTGATGGAATAGCTGGGACCCCAAGCGTCAGTGTTACACAACTTACATCTGTTGATCGACTGTCTATGACAG
+
AA<FFJJJAJFJFAFJJJJFAJJJJJ7FFJJ<F-FJFJJJFJJFJJFJJF<FJJA<JF-AFJFAJFJJJJJAAAFJJJJJFJJF-FF<7FJJJJJJ-JA<<J<F7-<FJFJJ7AJAF-AFFFJA--J-F######################

After FastqToSam

H0164ALXX140820:2:1101:10003:49022      77      *       0       0       *       *       0       0       ACTTTAGAAATTTACTTTTAAGGACTTTTGGTTATGCTGCAGATAAGAAATATTCTTTTTTTCTCCTATGTCAGTATCCCCCATTGAAATGACAATAACCTAATTATAAATAAGAATTAGGCTTTTTTTTGAACAGTTACTAGCCTATAGA -FFFFFJJJJFFAFFJFJJFJJJFJFJFJJJ<<FJJJJFJFJFJJJJ<JAJFJJFJJJJJFJJJAJJJJJJFFJFJFJJFJJFFJJJFJJJFJJFJJFJAJJJJAJFJJJJJFFJJ<<<JFJJAFJAAJJJFFFFFJJJAJJJF<AJFFFJ RG:Z:H0164.2
H0164ALXX140820:2:1101:10003:49022      141     *       0       0       *       *       0       0       TGAGGATCACTAGATGGGGGAGGGAGAGAAGAGATGTGGGCTGAAGAACCATCTGTTGGGTAATATGTTTACTGTCAGTGTGATGGAATAGCTGGGACCCCAAGCGTCAGTGTTACACAACTTACATCTGTTGATCGACTGTCTATGACAG AA<FFJJJAJFJFAFJJJJFAJJJJJ7FFJJ<F-FJFJJJFJJFJJFJJF<FJJA<JF-AFJFAJFJJJJJAAAFJJJJJFJJF-FF<7FJJJJJJ-JA<<J<F7-<FJFJJ7AJAF-AFFFJA--J-F###################### RG:Z:H0164.2

back to top


(B) Convert aligned BAM to uBAM and discard problematic records using RevertSam

We use Picard's RevertSam to remove alignment information and generate an unmapped BAM (uBAM). For our tutorial file we have to call on some additional parameters that we explain below. This illustrates the need to cater the tool's parameters to each dataset. As such, it is a good idea to test the reversion process on a subset of reads before committing to reverting the entirety of a large BAM. Follow the directions in this How to to create a snippet of aligned reads corresponding to a genomic interval.

We use the following parameters.

java -Xmx8G -jar /path/picard.jar RevertSam \
    I=6484_snippet.bam \
    O=6484_snippet_revertsam.bam \
    SANITIZE=true \ 
    MAX_DISCARD_FRACTION=0.005 \ #informational; does not affect processing
    ATTRIBUTE_TO_CLEAR=XT \
    ATTRIBUTE_TO_CLEAR=XN \
    ATTRIBUTE_TO_CLEAR=AS \ #Picard release of 9/2015 clears AS by default
    ATTRIBUTE_TO_CLEAR=OC \
    ATTRIBUTE_TO_CLEAR=OP \
    SORT_ORDER=queryname \ #default
    RESTORE_ORIGINAL_QUALITIES=true \ #default
    REMOVE_DUPLICATE_INFORMATION=true \ #default
    REMOVE_ALIGNMENT_INFORMATION=true #default

To process large files, also designate a temporary directory.

    TMP_DIR=/path/shlee #sets environmental variable for temporary directory

We invoke or change multiple RevertSam parameters to generate an unmapped BAM

  • We remove nonstandard alignment tags with the ATTRIBUTE_TO_CLEAR option. Standard tags cleared by default are NM, UQ, PG, MD, MQ, SA, MC, and AS tags (AS for Picard releases starting 9/2015). Additionally, the OQ tag is removed by the default RESTORE_ORIGINAL_QUALITIES parameter. Remove all other nonstandard tags by specifying each with the ATTRIBUTE_TO_CLEAR option. For example, we clear the XT tag using this option for our tutorial file so that it is free for use by other tools, e.g. MarkIlluminaAdapters. To list all tags within a BAM, use the command below.

    samtools view input.bam | cut -f 12- | tr '\t' '\n' | cut -d ':' -f 1 | awk '{ if(!x[$1]++) { print }}' 
    

    For the tutorial file, this gives RG, OC, XN, OP and XT tags as well as those removed by default. We remove all of these except the RG tag. See your aligner's documentation and the Sequence Alignment/Map Format Specification for descriptions of tags.

  • Additionally, we invoke the SANITIZE option to remove reads that cause problems for certain tools, e.g. MarkIlluminaAdapters. Downstream tools will have problems with paired reads with missing mates, duplicated records, and records with mismatches in length of bases and qualities. Any paired reads file subset for a genomic interval requires sanitizing to remove reads with lost mates that align outside of the interval.

  • In this command, we've set MAX_DISCARD_FRACTION to a more strict threshold of 0.005 instead of the default 0.01. Whether or not this fraction is reached, the tool informs you of the number and fraction of reads it discards. This parameter asks the tool to additionally inform you of the discarded fraction via an exception as it finishes processing.

    Exception in thread "main" picard.PicardException: Discarded 0.787% which is above MAX_DISCARD_FRACTION of 0.500%  
    

Some comments on options kept at default:

  • SORT_ORDER=queryname
    For paired read files, because each read in a pair has the same query name, sorting results in interleaved reads. This means that reads in a pair are listed consecutively within the same file. We make sure to alter the previous sort order. Coordinate sorted reads result in the aligner incorrectly estimating insert size from blocks of paired reads as they are not randomly distributed.

  • RESTORE_ORIGINAL_QUALITIES=true
    Restoring original base qualities to the QUAL field requires OQ tags listing original qualities. The OQ tag uses the same encoding as the QUAL field, e.g. ASCII Phred-scaled base quality+33 for tutorial data. After restoring the QUAL field, RevertSam removes the tag.

  • REMOVE_ALIGNMENT_INFORMATION=true will remove program group records and alignment flag and tag information. For example, flags reset to unmapped values, e.g. 77 and 141 for paired reads. The parameter also invokes the default ATTRIBUTE_TO_CLEAR parameter which removes standard alignment tags. RevertSam ignores ATTRIBUTE_TO_CLEAR when REMOVE_ALIGNMENT_INFORMATION=false.

Below we show below a read pair before and after RevertSam from the tutorial data. Notice the first listed read in the pair becomes reverse-complemented after RevertSam. This restores how reads are represented when they come off the sequencer--5' to 3' of the read being sequenced.

For 6484_snippet.bam, SANITIZE removes 2,202 out of 279,796 (0.787%) reads, leaving us with 277,594 reads.

Original BAM

H0164ALXX140820:2:1101:10003:23460  83  10  91515318    60  151M    =   91515130    -339    CCCATCCCCTTCCCCTTCCCTTTCCCTTTCCCTTTTCTTTCCTCTTTTAAAGAGACAAGGTCTTGTTCTGTCACCCAGGCTGGAATGCAGTGGTGCAGTCATGGCTCACTGCCGCCTCAGACTTCAGGGCAAAAGCAATCTTTCCAGCTCA :<<=>@AAB@AA@AA>6@@A:>,*@A@<@??@8?9>@==8?:?@?;?:><??@>==9?>8>@:?>>=>;<==>>;>?=?>>=<==>>=>9<=>??>?>;8>?><?<=:>>>;4>=>7=6>=>>=><;=;>===?=>=>>?9>>>>??==== MC:Z:60M91S MD:Z:151    PG:Z:MarkDuplicates RG:Z:H0164.2    NM:i:0  MQ:i:0  OQ:Z:<FJFFJJJJFJJJJJF7JJJ<F--JJJFJJJJ<J<FJFF<JAJJJAJAJFFJJJFJAFJAJJAJJJJJFJJJJJFJJFJJJJFJFJJJJFFJJJJJJJFAJJJFJFJFJJJFFJJJ<J7JJJJFJ<AFAJJJJJFJJJJJAJFJJAFFFFA    UQ:i:0  AS:i:151

H0164ALXX140820:2:1101:10003:23460  163 10  91515130    0   60M91S  =   91515318    339 TCTTTCCTTCCTTCCTTCCTTGCTCCCTCCCTCCCTCCTTTCCTTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCTTCCCCTCTCCCACCCCTCTCTCCCCCCCTCCCACCC :0;.=;8?7==?794<<;:>769=,<;0:=<0=:9===/,:-==29>;,5,98=599;<=########################################################################################### SA:Z:2,33141573,-,37S69M45S,0,1;    MC:Z:151M   MD:Z:48T4T6 PG:Z:MarkDuplicates RG:Z:H0164.2    NM:i:2  MQ:i:60 OQ:Z:<-<-FA<F<FJF<A7AFAAJ<<AA-FF-AJF-FA<AFF--A-FA7AJA-7-A<F7<<AFF###########################################################################################    UQ:i:49 AS:i:50

After RevertSam

H0164ALXX140820:2:1101:10003:23460  77  *   0   0   *   *   0   0   TGAGCTGGAAAGATTGCTTTTGCCCTGAAGTCTGAGGCGGCAGTGAGCCATGACTGCACCACTGCATTCCAGCCTGGGTGACAGAACAAGACCTTGTCTCTTTAAAAGAGGAAAGAAAAGGGAAAGGGAAAGGGAAGGGGAAGGGGATGGG AFFFFAJJFJAJJJJJFJJJJJAFA<JFJJJJ7J<JJJFFJJJFJFJFJJJAFJJJJJJJFFJJJJFJFJJJJFJJFJJJJJFJJJJJAJJAJFAJFJJJFFJAJAJJJAJ<FFJF<J<JJJJFJJJ--F<JJJ7FJJJJJFJJJJFFJF< RG:Z:H0164.2

H0164ALXX140820:2:1101:10003:23460  141 *   0   0   *   *   0   0   TCTTTCCTTCCTTCCTTCCTTGCTCCCTCCCTCCCTCCTTTCCTTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCTTCCCCTCTCCCACCCCTCTCTCCCCCCCTCCCACCC <-<-FA<F<FJF<A7AFAAJ<<AA-FF-AJF-FA<AFF--A-FA7AJA-7-A<F7<<AFF########################################################################################### RG:Z:H0164.2

back to top


CombineVariants --variants : can I give both a variant name and a file type?

$
0
0

I am trying to run tool CombineVariants on 3 vcf files.
There is an extra dot in the file names which I expect is why dynamic determination of type is not working.
I want to rename the variants, not have the default names of variant, variant1, variant2.

I can't seem to specify both variant-name, and file type. Can you explain how to combine them? In the meantime I can rename my files to a simpler format, but I would like to know! Below are error message for three attempts

Version with just variant-name:
% gatk -T CombineVariants \

-R $REFDIR/PlasmoDB-29_Pfalciparum3D7_Genome.fasta \
--variant:S3 D2_S3_indel.filterGff.vcf \
--variant:S4 E7_S4_indel.filterGff.vcf \
--variant:S5 G7_S5_indel.filterGff.vcf \
-o mergeS3S4S5.indel.vcf \
-genotypeMergeOptions UNSORTED

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 3.5-0-g36282e4):
ERROR
ERROR This means that one or more arguments or inputs in your command are incorrect.
ERROR The error message below tells you what is the problem.
ERROR
ERROR If the problem is an invalid argument, please check the online documentation guide
ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
ERROR
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
ERROR
ERROR MESSAGE: Invalid command line: No tribble type was provided on the command line and the type of the file 'D2_S3_indel.filterGff.vcf' could not be determined dynamically. Please add an explicit type tag :NAME listing the correct type from among the supported types:
ERROR Name FeatureType Documentation
ERROR BCF2 VariantContext (this is an external codec and is not documented within GATK)
ERROR VCF VariantContext (this is an external codec and is not documented within GATK)
ERROR VCF3 VariantContext (this is an external codec and is not documented within GATK)
ERROR ------------------------------------------------------------------------------------------

Version with --variants:VariantName:VCF vcf_filename:
gatk -T CombineVariants \
-R $REFDIR/PlasmoDB-29_Pfalciparum3D7_Genome.fasta \
--variant:S3:VCF D2_S3_indel.filterGff.vcf \
--variant:S4 S4_indel.filterGff.vcf \
--variant:S5 S5_indel.filterGff.vcf \
-o mergeS3S4S5.indel.vcf \
-genotypeMergeOptions UNSORTED
...

ERROR MESSAGE: Invalid argument value '--variant:S3:VCF' at position 4.
ERROR Invalid argument value 'D2_S3_indel.filterGff.vcf' at position 5.
ERROR ------------------------------------------------------------------------------------------

Version with --variants:VCF:VariantName vcf_filename:
gatk -T CombineVariants \
-R $REFDIR/PlasmoDB-29_Pfalciparum3D7_Genome.fasta \
--variant:VCF:S3 D2_S3_indel.filterGff.vcf \
--variant:S4 S4_indel.filterGff.vcf \
--variant:S5 S5_indel.filterGff.vcf \
-o mergeS3S4S5.indel.vcf \
-genotypeMergeOptions UNSORTED
...

ERROR MESSAGE: Invalid argument value '--variant:VCF:S3' at position 4.
ERROR Invalid argument value 'D2_S3_indel.filterGff.vcf' at position 5.
ERROR ------------------------------------------------------------------------------------------

How can I prevent the file header from showing up in gigantic font?

$
0
0

Hi. My question is, when I post to the forum, some parts of my post become huge, e.g. file headers or error messages. I'm showing a truncated example below of a VCF header. How can I prevent this from happening and show the copy-pasted blocks in normal font?

fileformat=VCFv4.2

...

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878


(howto) Call variants with HaplotypeCaller

$
0
0

Objective

Call variants on a single genome with the HaplotypeCaller, producing a raw (unfiltered) VCF.

Caveat

This is meant only for single-sample analysis. To analyze multiple samples, see the Best Practices documentation on joint analysis.

Prerequisites

  • TBD

Steps

  1. Determine the basic parameters of the analysis
  2. Call variants in your sequence data

1. Determine the basic parameters of the analysis

If you do not specify these parameters yourself, the program will use default values. However we recommend that you set them explicitly because it will help you understand how the results are bounded and how you can modify the program's behavior.

  • Genotyping mode (--genotyping_mode)

This specifies how we want the program to determine the alternate alleles to use for genotyping. In the default DISCOVERY mode, the program will choose the most likely alleles out of those it sees in the data. In GENOTYPE_GIVEN_ALLELES mode, the program will only use the alleles passed in from a VCF file (using the -alleles argument). This is useful if you just want to determine if a sample has a specific genotype of interest and you are not interested in other alleles.

  • Emission confidence threshold (-stand_emit_conf)

This is the minimum confidence threshold (phred-scaled) at which the program should emit sites that appear to be possibly variant.

  • Calling confidence threshold (-stand_call_conf)

This is the minimum confidence threshold (phred-scaled) at which the program should emit variant sites as called. If a site's associated genotype has a confidence score lower than the calling threshold, the program will emit the site as filtered and will annotate it as LowQual. This threshold separates high confidence calls from low confidence calls.

The terms "called" and "filtered" are tricky because they can mean different things depending on context. In ordinary language, people often say a site was called if it was emitted as variant. But in the GATK's technical language, saying a site was called means that that site passed the confidence threshold test. For filtered, it's even more confusing, because in ordinary language, when people say that sites were filtered, they usually mean that those sites successfully passed a filtering test. However, in the GATK's technical language, the same phrase (saying that sites were filtered) means that those sites failed the filtering test. In effect, it means that those would be filtered out if the filter was used to actually remove low-confidence calls from the callset, instead of just tagging them. In both cases, both usages are valid depending on the point of view of the person who is reporting the results. So it's always important to check what is the context when interpreting results that include these terms.


2. Call variants in your sequence data

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T HaplotypeCaller \ 
    -R reference.fa \ 
    -I preprocessed_reads.bam \  
    -L 20 \ 
    --genotyping_mode DISCOVERY \ 
    -stand_emit_conf 10 \ 
    -stand_call_conf 30 \ 
    -o raw_variants.vcf 

Note that -L specifies that we only want to run the command on a subset of the data (here, chromosome 20). This is useful for testing as well as other purposes, as documented here. For example, when running on exome data, we use -L to specify a file containing the list of exome targets corresponding to the capture kit that was used to generate the exome libraries.

Expected Result

This creates a VCF file called raw_variants.vcf, containing all the sites that the HaplotypeCaller evaluated to be potentially variant. Note that this file contains both SNPs and Indels.

Although you now have a nice fresh set of variant calls, the variant discovery stage is not over. The distinctions made by the caller itself between low-confidence calls and the rest is very primitive, and should not be taken as a definitive guide for filtering. The GATK callers are designed to be very lenient in calling variants, so it is extremely important to apply one of the recommended filtering methods (variant recalibration or hard-filtering), in order to move on to downstream analyses with the highest-quality call set possible.

GATK CNV Toolchain in Firehose and FAQ (Broad Internal)

$
0
0

We have put the GATK4 Somatic CNV Toolchain into Firehose. Please copy the below workflows from Algorithm_Commons

GATK_Somatic_CNV_Toolchain_Capture
GATK_Somatic_CNV_Toolchain_WGS

For questions and discussions, not just specific to Firehose, please see the GATK 4 forum: http://gatkforums.broadinstitute.org/gatk/categories/gatk-4-alpha

Who do I contact with an issue?

First, make sure that your question is not here or in another forum post.
If it is a Firehose issue or you are not sure, email pipeline-help@broadinstitute.org.
If you are sure that it is an issue with GATK CNV, ACNV, or GetBayesianHetPulldown, post to the forum.

What is GATK CNV vs. ACNV and which are run in the workflows above?

  • GATK CNV estimates total copy ratio and performs segmentation and (basic) event calling. This tool works very similarly to ReCapSeg (for now).

  • GATK ACNV creates credible intervals for copy ratio and minor allelic fraction (MAF). Under the hood, this tool is very different from Allelic CapSeg, but it can produce a file that can be ingested by ABSOLUTE (i.e. file is in same format produced by Allelic CapSeg)

  • Both GATK CNV and ACNV are in the workflows above.

Are the results (e.g. sensitivity and precision) better than ReCapSeg in the GATK CNV toolchain?

If you talk about running without the allelic integration, then the results are equivalent. If you want more details, ask in the forum or invite us to talk to you -- we have a presentation or two about this topic.

Do I run these workflows on Pair Sets or Individual Sets?

Individual Sets

What entity types do the tasks run on?

Samples and Pairs. I realize that the above question says to run the workflow on Individual Sets. This is to work around a Firehose issue.

What are the caveats around WGS?

  • The total copy number tasks (similar to ReCapSeg) take about a tenth of the time as ReCapSeg, assuming good NFS performance. This is a good thing.

  • The allelic tasks (GetBayesianPulldown and Allelic CNV) take a very long time to run. Over a day of runtime is not uncommon. In the next version of the GATK4 CNV Toolchain, we will have addressed this issue, but due to dispatch limitations, Firehose may not be able to fully capitalize on these improvements.

  • The runtimes in general are very very sensitive to the filesystem performance.
  • The results still have the same oversegmentation issues that you will see in ReCapSeg. There is a GC correction tool, but this has not been integrated into the Firehose workflow.
  • There is a bug in a third-party library that limits the size of a PoN. This is unlikely to be an issue for capture, but can become a problem for WGS. For more details, please see gatkforums.broadinstitute.org/gatk/discussion/7594/limits-on-the-size-of-a-pon

What about the future of ReCapSeg?

We are phasing out ReCapSeg, for many reasons, everywhere -- not just Firehose. If you would like more details, post to the forum and we'll respond.

What about the future of Allelic CapSeg?

We have never supported (and never will support) Allelic CapSeg and cannot answer that question. We have some results comparing Allelic CapSeg and GATK ACNV. We can show you if you are interested (internal to Broad only).

Why are there fewer plots than in ReCapSeg?

We did not include plots that we did not believe were being used. If you would like to include additional plots, please post to the forum.

How is the GATK 4 CNV toolchain workflow better than the ReCapSeg workflow?

1) Faster. On exome, ReCapSeg takes ~105 minutes per case sample. GATK CNV takes < 30 minutes. Both time estimates assume good performance of NFS filesystem.
2) The workflows above include allelic integration results, from the tool GATK ACNV. These results are analogous to what Allelic CapSeg produces.
3) The workflow above produces results compatible with ABSOLUTE and TITAN. I.e. the results can be used as input to ABSOLUTE or TITAN.
4) All future improvements and bugfixes are going into GATK, not ReCapSeg. And many improvements are coming....
5) The workflows produce germline heterzygous SNP call files.
6) The ReCapSeg WGS workflow no longer works.

Are there new PoNs for these workflows?

Yes, but the PoN locations are already populated, if you run the workflows properly. Users do not need to do anything.

Is the correct PoN automatically selected for ICE vs. Agilent samples?

Yes, if you run the workflow.

Is there a PoN creation workflow in Firehose?

No. Never going to happen. Don't ask. See the forum for instructions (and a Queue workflow) to create PoNs.

Can I run ABSOLUTE from the output of GATK ACNV?

Yes. The annotations are gatk4cnv_acnv_acs_seg_file_capture (capture) and gatk4cnv_acnv_acs_seg_file_wgs (WGS).

Can I run TITAN from the output of GATK ACNV?

Yes, though there has been little testing. The annotations are gatk4cnv_acnv_acs_seg_file_capture and gatk4cnv_acnv_acs_seg_file_wgs.

Do the workflows above include Oncotator gene lists?

Yes.

Is the GATK4 CNV Toolchain in alpha?

Technically, the whole GATK4 is in alpha, but that includes more than just the GATK CNV toolchain. We are confident that the version in the workflows above produce high quality results. Please tell us if you find otherwise!

These workflows have Picard Target Mapper. Isn't that going to cause me to have to rerun all of my jobs (e.g. MuTect)?

The workflows above will rerun Picard Target Mapper, but only new annotations are added. All previous output annotations of Picard Target Mapper should be populated with the same values. This will look as if it outdated mutation calling (MuTect) and other tasks, but the rerunning will be job avoided.

Can I do the tumor-only GATK ACNV workflow?

For exome that is working well, but is not available in Firehose. If you would like to see evaluation data for tumor-only on exome, we can show you (internal to Broad only). Please contact us if you need this in Firehose and we will work with you to set it up.

What are all of the annotations produced?

Where applicable, each of the list below also has a *_wgs counterpart...
Sample annotations:

  • gatk4cnv_seg_file_capture -- seg file of GATK CNV. This file is analogous to the ReCapSeg seg file.

  • gatk4cnv_tn_file_capture -- tangent normalized (denoised) target copy ratio estimates of GATK CNV. This file is analogous to the ReCapSeg tn file.

  • gatk4cnv_pre_tn_file_capture -- coverage profile (i.e. target copy ratio estimates without denoising) of GATK CNV. This file is analogous to the ReCapSeg tn file.
  • gatk4cnv_betahats_capture -- Tangent normalization coefficients used in the projection. This is in the weeds.
  • gatk4cnv_called_seg_file_capture -- output called seg file of GATK CNV. This file is analogous to the ReCapSeg called seg file.
  • gatk4cnv_oncotated_called_seg_file_capture -- gene list file generated from the GATK CNV segments

  • gatk4cnv_dqc_capture (coming later) -- measure of noise reduction in the tangent normalization process. Lower is better.

  • gatk4cnv_preqc_capture (coming later) -- measure of noise before tangent normalization

  • gatk4cnv_postqc_capture (coming later) -- measure of noise after tangent normalization
  • gatk4cnv_num_seg_capture (coming later) -- number of segments in the GATK CNV output

Pair annotations:

  • gatk4cnv_case_het_file_capture -- het pulldown file for the tumor sample in the pair.

  • gatk4cnv_control_het_file_capture -- het pulldown file for the normal sample in the pair.

  • gatk4cnv_acnv_seg_file_capture -- ACNV seg file with confidence intervals for copy ratio and minor allelic fraction.

  • gatk4cnv_acnv_acs_seg_file_capture -- ACNV seg file in a format that looks as if it was produced by AllelicCapSeg. Any segments called as "balanced" will be pegged to a MAF of 0.5. This file is ready for ingestion by ABSOLUTE.

  • gatk4cnv_acnv_cnv_seg_file_capture -- ACNV seg file in a format that looks as if it was produced by GATK CNV

  • gatk4cnv_acnv_titan_het_file_capture -- het file in a format that can be ingested by TITAN
  • gatk4cnv_acnv_titan_cr_file_capture -- target copy ratio estimates file in a format that can be ingested by TITAN
  • gatk4cnv_acnv_cnloh_balanced_file_capture -- ACNV seg file with calls for whether a segment is balanced or CNLoH (or neither).

Do the workflows also run on the normals?

GATK CNV, yes.
GATK ACNV, no.
There is a het pulldown generated for the normal, as a side effect, when doing the het pulldown for the tumor.

What about array data?

The GATK4 CNV tools do not run on array data. Sequencing data only.

Do we still need separate PoNs if we want to run on X and Y?

Yes.

Can I run both the ReCapSeg workflow and the GATK CNV toolchain workflow?

Yes. All results are written to separate annotations.

Are the new workflows part of my PrAn?

No, not yet. You will need to copy (and run) these manually from Algorithm_Commons before you begin analysis. As a reminder, copy into your analysis workspace.

Does GATK CNV require matched (tumor-normal) samples?

No.

Does GATK ACNV require matched (tumor-normal) samples?

In Firehose, yes. Out of Firehose, no.

How do I modify the ABSOLUTE tasks in FH to accept the new GATK ACNV annotations?

There are two changes you need to make to the ABSOLUTE_v1.5_WES configuration to make it accept the new outputs.

1) replace alleliccapseg_tsv with gatk4cnv_acnv_acs_seg_file_capture in the inputs
2) replace alleliccapseg_skew with 0.9883274, and change the annotation type to "Literal" instead of "Simple Expression"

This answer thanks to Dimitri Livitz, Daniel Rosebrock, and David Kwiatkowski.

Gradle test failed after building GATK 4 Alpha

$
0
0

Hi,

I've just downloaded GATK 4 (from https://github.com/broadinstitute/gatk/) and successfully built it with ./gradle installAll command.
But after I run the ./gradle test command, it only completed 85% of the test (294676 completed, 85 failed, 9729 skipped)
Can I go on to use the built program (for now I'm just trying to profile the performance of GATK4)?
Or should I wait for updates till it can pass all the tests?

Thanks!

HaplotypeCaller pooled sequence problem

$
0
0

Hi,

I have a number of samples that consist of multiple individuals from the same population pooled together, and have been truing to use HaplotypeCaller to call the variants. I have set the (ploidy to 2 * number of individuals) but keep getting the same or similar error message, after running for several hours or days:

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.3-0-g37228af):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: the combination of ploidy (180) and number of alleles (9) results in a very large number of genotypes (> 2147483647). You need to limit ploidy or the number of alternative alleles to analyze this locus
ERROR ------------------------------------------------------------------------------------------

and I'm not sure what I can do to rectify it... Obviously I can't limit the ploidy, it is what it is, and I thought that HC only allows a maximum of six alleles anyway?

My code is below, and any help would be appreciated.

java -Xmx24g -jar ~/bin/GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar -T HaplotypeCaller
-nct 6 \
-R ~/my_ref_sequence \
--intervals ~/my_intervals_file \
-ploidy 180 \
-log my_log_file \
-I ~/my_input_bam \
-o ~/my_output_vcf

GATK HaplotypeCallerQUAL field changes while running per chromosome

$
0
0

Dear GATK Team,

I ran GATK haplotypecaller (3.4) on one sample.

  1. In normal procedure . Whole bam file submitted to haplotype caller.
    2 .In another way, BAM was splitted by chromosome ( using -L option) and ran haplotype caller on each chromosome separately and concatenate the per chromosomal vcf's at the end.

I got 2 vcf files and I have observed the changes in INFO and QUAL fields while comparing both.

Does haplotype caller calculates QUAL and INFO fields across all chromosomes?

Please help me to resolve the issue.

Thanks & Regards
Fazulur Rehaman

error CombineVariants UNIQUIFY

$
0
0

Hello,

I am trying to merge 2 vcf for same sample with 2 different caller (like GATK UG and HC)
I use CombineVariants tools with -genotypeMergeOptions UNIQUIFY option.
comande line :
java -jar GenomeAnalysisTK.jar -T CombineVariants -R hg19.fa --variant:SampleX_gatkHC SampleID_gatkHC.vcf --variant:SampleX_gatkUG SampleX_gatkUG.vcf -o SampleX.full.vcf -genotypeMergeOptions UNIQUIFY

SampleX_gatkHC.vcf :
GT:AD:DP:GQ:PL 1/2:7,53,18:78:99:2512,426,875,1811,0,2040

SampleX_gatkUG.vcf :
GT:AD:DP:GQ:PL 1/2:11,29,71:175:99:6052,4932,6910,983,0,367

SampleX.full.vcf :
FORMAT SampleX_gatkHC.vcf SampleX_gatkUG.vcf
GT:DP:GQ 1/2:78:99 2/1:175:99

why we loste some information like AD GQ PL ?

the format that I want is :
FORMAT SampleX_gatkHC.vcf SampleX_gatkUG.vcf
GT:AD:DP:GQ:PL 1/2:7,53,18:78:99:2512,426,875,1811,0,2040 1/2:11,29,71:175:99:6052,4932,6910,983,0,367

Can you help me to solve the problem.
I use GATK ( version : 3.5-0) and java (1.8.0_65)

Thanks

(How to) Map reads to a reference with alternate contigs like GRCh38

$
0
0

Document is in BETA. It may be incomplete and/or inaccurate. Post suggestions to the Comments section and be sure to read about updates also within the Comments section.


image This exploratory tutorial provides instructions and example data to map short reads to a reference genome with alternate haplotypes. Instructions are suitable for indexing and mapping reads to GRCh38.

► If you are unfamiliar with terms that describe reference genome components, or GRCh38 alternate haplotypes, take a few minutes to study the Dictionary entry Reference Genome Components.

► For an introduction to GRCh38, see Blog#8180.

Specifically, the tutorial uses BWA-MEM to index and map simulated reads for three samples to a mini-reference composed of a GRCh38 chromosome and alternate contig (sections 1–3). We align in an alternate contig aware (alt-aware) manner, which we also call alt-handling. This is the main focus of the tutorial.

The decision to align to a genome with alternate haplotypes has implications for variant calling. We discuss these in section 5 using the callset generated with the optional tutorial steps outlined in section 4. Because we strategically placed a number of SNPs on the sequence used to simulate the reads, in both homologous and divergent regions, we can use the variant calls and their annotations to examine the implications of analysis approaches. To this end, the tutorial fast-forwards through pre-processing and calls variants for a trio of samples that represents the combinations of the two reference haplotypes (the PA and the ALT). This first workflow (tutorial_8017) is suitable for calling variants on the primary assembly but is insufficient for capturing variants on the alternate contigs.

For those who are interested in calling variants on the alternate contigs, we also present a second and a third workflow in section 6. The second workflow (tutorial_8017_toSE) takes the processed BAM from the first workflow, makes some adjustments to the reads to maximize their information, and calls variants on the alternate contig. This approach is suitable for calling on ~75% of the non-HLA alternate contigs or ~92% of loci with non-HLA alternate contigs (see table in section 6). The third workflow (tutorial_8017_postalt) takes the alt-aware alignments from the first workflow and performs a postalt-processing step as well as the same adjustment from the second workflow. Postalt-processing uses the bwa-postalt.js javascript program that Heng Li provides as a companion to BWA. This allows for variant calling on all alternate contigs including HLA alternate contigs.

The tutorial ends by comparing the difference in call qualities from the multiple workflows for the given example data and discusses a few caveats of each approach.

► The three workflows shown in the diagram above are available as WDL scripts in our GATK Tutorials WDL scripts repository.


Jump to a section

  1. Index the reference FASTA for use with BWA-MEM
  2. Include the reference ALT index file
    What happens if I forget the ALT index file?
  3. Align reads with BWA-MEM
    How can I tell if a BAM was aligned with alt-handling?
    What is the pa tag?
  4. (Optional) Add read group information, preprocess to make a clean BAM and call variants
  5. How can I tell whether I should consider an alternate haplotype for a given sample?
    (5.1) Discussion of variant calls for tutorial_8017
  6. My locus includes an alternate haplotype. How can I call variants on alt contigs?
    (6.1) Variant calls for tutorial_8017_toSE
    (6.2) Variant calls for tutorial_8017_postalt
  7. Related resources

Tools involved

  • BWA v0.7.13 or later releases. The tutorial uses v0.7.15.
    Download from here and see Tutorial#2899 for installation instructions.
    The bwa-postalt.js script is within the bwakit folder.

  • Picard tools v2.5.0 or later releases. The tutorial uses v2.5.0.

  • Optional GATK tools. The tutorial uses v3.6.
  • Optional Samtools. The tutorial uses v1.3.1.
  • Optional Gawk, an AWK-like tool that can interpret bitwise SAM flags. The tutorial uses v4.1.3.
  • Optional k8 Javascript shell. The tutorial uses v0.2.3 downloaded from here.

Download example data

Download tutorial_8017.tar.gz, either from the GoogleDrive or from the ftp site. To access the ftp site, leave the password field blank. The data tarball contains the paired FASTQ reads files for three samples. It also contains a mini-reference chr19_chr19_KI270866v1_alt.fasta and corresponding .dict dictionary, .fai index and six BWA indices including the .alt index. The data tarball includes the output files from the workflow that we care most about. These are the aligned SAMs, processed and indexed BAMs and the final multisample VCF callsets from the three presented workflows.

image The mini-reference contains two contigs subset from human GRCh38: chr19 and chr19_KI270866v1_alt. The ALT contig corresponds to a diverged haplotype of chromosome 19. Specifically, it corresponds to chr19:34350807-34392977, which contains the glucose-6-phosphate isomerase or GPI gene. Part of the ALT contig introduces novel sequence that lacks a corresponding region in the primary assembly.

Using instructions in Tutorial#7859, we simulated paired 2x151 reads to derive three different sample reads that when aligned give roughly 35x coverage for the target primary locus. We derived the sequences from either the 43 kbp ALT contig (sample ALTALT), the corresponding 42 kbp region of the primary assembly (sample PAPA) or both (sample PAALT). Before simulating the reads, we introduced four SNPs to each contig sequence in a deliberate manner so that we can call variants.

► Alternatively, you may instead use the example input files and commands with the full GRCh38 reference. Results will be similar with a handful of reads mapping outside of the mini-reference regions.


1. Index the reference FASTA for use with BWA-MEM

Our example chr19_chr19_KI270866v1_alt.fasta reference already has chr19_chr19_KI270866v1_alt.dict dictionary and chr19_chr19_KI270866v1_alt.fasta.fai index files for use with Picard and GATK tools. BWA requires a different set of index files for alignment. The command below creates five of the six index files we need for alignment. The command calls the index function of BWA on the reference FASTA.

bwa index chr19_chr19_KI270866v1_alt.fasta

This gives .pac, .bwt, .ann, .amb and .sa index files that all have the same chr19_chr19_KI270866v1_alt.fasta basename. Tools recognize index files within the same directory by their identical basename. In the case of BWA, it uses the basename preceding the .fasta suffix and searches for the index file, e.g. with .bwt suffix or .64.bwt suffix. Depending on which of the two choices it finds, it looks for the same suffix for the other index files, e.g. .alt or .64.alt. Lack of a matching .alt index file will cause BWA to map reads without alt-handling. More on this next.

Note that the .64. part is an explicit indication that index files were generated with version 0.6 or later of BWA and are the 64-bit indices (as opposed to files generated by earlier versions, which were 32-bit). This .64. signifier can be added automatically by adding -6 to the bwa index command.


back to top


2. Include the reference ALT index file

Be sure to place the tutorial's mini-ALT index file chr19_chr19_KI270866v1_alt.fasta.alt with the other index files. Also, if it does not already match, change the file basename to match. This is the sixth index file we need for alignment. BWA-MEM uses this file to prioritize primary assembly alignments for reads that can map to both the primary assembly and an alternate contig. See BWA documentation for details.

  • As of this writing (August 8, 2016), the SAM format ALT index file for GRCh38 is available only in the x86_64-linux bwakit download as stated in this bwakit README. The hs38DH.fa.alt file is in the resource-GRCh38 folder.

  • In addition to mapped alternate contig records, the ALT index also contains decoy contig records as unmapped SAM records. This is relevant to the postalt-processing we discuss in section 6.2. As such, the postalt-processing in section 6 also requires the ALT index.

For the tutorial, we subset from hs38DH.fa.alt to create a mini-ALT index, chr19_chr19_KI270866v1_alt.fasta.alt. Its contents are shown below.

image

The record aligns the chr19_KI270866v1_alt contig to the chr19 locus starting at position 34,350,807 and uses CIGAR string nomenclature to indicate the pairwise structure. To interpret the CIGAR string, think of the primary assembly as the reference and the ALT contig sequence as the read. For example, the 11307M at the start indicates 11,307 corresponding sequence bases, either matches or mismatches. The 935S at the end indicates a 935 base softclip for the ALT contig sequence that lacks corresponding sequence in the primary assembly. This is a region that we consider highly divergent or novel. Finally, notice the NM tag that notes the edit distance to the reference.

☞ What happens if I forget the ALT index file?

If you omit the ALT index file from the reference, or if its naming structure mismatches the other indexes, then your alignments will be equivalent to the results you would obtain if you run BWA-MEM with the -j option. The next section gives an example of what this looks like.


back to top


3. Align reads with BWA-MEM

The command below uses an alt-aware version of BWA and maps reads using BWA's maximal exact match (MEM) option. Because the ALT index file is present, the tool prioritizes mapping to the primary assembly over ALT contigs. In the command, the tutorial's chr19_chr19_KI270866v1_alt.fasta serves as reference; one FASTQ holds the forward reads and the other holds the reverse reads.

bwa mem chr19_chr19_KI270866v1_alt.fasta 8017_read1.fq 8017_read2.fq > 8017_bwamem.sam

The resulting file 8017_bwamem.sam contains aligned read records.

  • BWA preferentially maps to the primary assembly any reads that can align equally well to the primary assembly or the ALT contigs as well as any reads that it can reasonably align to the primary assembly even if it aligns better to an ALT contig. Preference is given by the primary alignment record status, i.e. not secondary and not supplementary. BWA takes the reads that it cannot map to the primary assembly and attempts to map them to the alternate contigs. If a read can map to an alternate contig, then it is mapped to the alternate contig as a primary alignment. For those reads that can map to both and align better to the ALT contig, the tool flags the ALT contig alignment record as supplementary (0x800). This is what we call alt-aware mapping or alt-handling.

  • Adding the -j option to the command disables the alt-handling. Reads that can map multiply are given low or zero MAPQ scores.

image

☞ How can I tell if a BAM was aligned with alt-handling?

There are two approaches to this question.

First, you can view the alignments on IGV and compare primary assembly loci with their alternate contigs. The IGV screenshots to the right show how BWA maps reads with (top) or without (bottom) alt-handling.

Second, you can check the alignment SAM. Of two tags that indicate alt-aware alignment, one will persist after preprocessing only if the sample has reads that can map to alternate contigs. The first tag, the AH tag, is in the BAM header section of the alignment file, and is absent after any merging step, e.g. merging with MergeBamAlignment. The second tag, the pa tag, is present for reads that the aligner alt-handles. If a sample does not contain any reads that map equally or preferentially to alternate contigs, then this tag may be absent in a BAM even if the alignments were mapped in an alt-aware manner.

Here are three headers for comparison where only one indicates alt-aware alignment.

File header for alt-aware alignment. We use this type of alignment in the tutorial.
Each alternate contig's @SQ line in the header will have an AH:* tag to indicate alternate contig handling for that contig. This marking is based on the alternate contig being listed in the .alt index file and alt-aware alignment.
image

File header for -j alignment (alt-handling disabled) for example purposes. We do not perform this type of alignment in the tutorial.
Notice the absence of any special tags in the header.
image

File header for alt-aware alignment after merging with MergeBamAlignment. We use this step in the next section.
Again, notice the absence of any special tags in the header.
image

☞ What is the pa tag?

For BWA v0.7.15, but not v0.7.13, ALT loci alignment records that can align to both the primary assembly and alternate contig(s) will have a pa tag on the primary assembly alignment. For example, read chr19_KI270866v1_alt_4hetvars_26518_27047_0:0:0_0:0:0_931 of the ALTALT sample has five alignment records only three of which have the pa tag as shown below.

image

A brief description of each of the five alignments, in order:

  1. First in pair, primary alignment on the primary assembly; AS=146, pa=0.967
  2. First in pair, supplementary alignment on the alternate contig; AS=151
  3. Second in pair, primary alignment on the primary assembly; AS=120; pa=0.795
  4. Second in pair, supplementary alignment on the primary assembly; AS=54; pa=0.358
  5. Second in pair, supplementary alignment on the alternate contig; AS=151

The pa tag measures how much better a read aligns to its best alternate contig alignment versus its primary assembly (pa) alignment. Specifically, it is the ratio of the primary assembly alignment score over the highest alternate contig alignment score. In our example we have primary assembly alignment scores of 146, 120 and 54 and alternate contig alignment scores of 151 and again 151. This gives us three different pa scores that tag the primary assembly alignments: 146/151=0.967, 120/151=0.795 and 54/151=0.358.

In our tutorial's workflow, MergeBamAlignment may either change an alignment's pa score or add a previously unassigned pa score to an alignment. The result of this is summarized as follows for the same alignments.

  1. pa=0.967 --MergeBamAlignment--> same
  2. none --MergeBamAlignment--> assigns pa=0.967
  3. pa=0.795 --MergeBamAlignment--> same
  4. pa=0.358 --MergeBamAlignment--> replaces with pa=0.795
  5. none --MergeBamAlignment--> assigns pa=0.795

If you want to retain the BWA-assigned pa scores, then add the following options to the workflow commands in section 4.

  • For RevertSam, add ATTRIBUTE_TO_CLEAR=pa.

  • For MergeBamAlignment, add ATTRIBUTES_TO_RETAIN=pa.

In our sample set, after BWA-MEM alignment ALTALT has 1412 pa-tagged alignment records, PAALT has 805 pa-tagged alignment records and PAPA has zero pa-tagged records.


back to top


4. Add read group information, preprocess to make a clean BAM and call variants

The initial alignment file is missing read group information. One way to add that information, which we use in production, is to use MergeBamAlignment. MergeBamAlignment adds back read group information contained in an unaligned BAM and adjusts meta information to produce a clean BAM ready for pre-processing (see Tutorial#6483 for details on our use of MergeBamAlignment). Given the focus here is to showcase BWA-MEM's alt-handling, we refrain from going into the details of all this additional processing. They follow, with some variation, the PairedEndSingleSampleWf pipeline detailed here.

Remember these are simulated reads with simulated base qualities. We simulated the reads in a manner that only introduces the planned mismatches, without any errors. Coverage is good at roughly 35x. All of the base qualities for all of the reads are at I, which is, according to this page and this site, an excellent base quality score equivalent to a Sanger Phred+33 score of 40. We can therefore skip base quality score recalibration (BQSR) since the reads are simulated and the dataset is not large enough for recalibration anyway.

Here are the commands to obtain a final multisample variant callset. The commands are given for one of the samples. Process each of the three samples independently in the same manner [4.1–4.6] until the last GenotypeGVCFs command [4.7].

[4.1] Create unmapped uBAM

java -jar picard.jar RevertSam \
    I=altalt_bwamem.sam O=altalt_u.bam \
    ATTRIBUTE_TO_CLEAR=XS ATTRIBUTE_TO_CLEAR=XA

[4.2] Add read group information to uBAM

java -jar picard.jar AddOrReplaceReadGroups \
    I=altalt_u.bam O=altalt_rg.bam \
    RGID=altalt RGSM=altalt RGLB=wgsim RGPU=shlee RGPL=illumina

[4.3] Merge uBAM with aligned BAM

java -jar picard.jar MergeBamAlignment \
    ALIGNED=altalt_bwamem.sam UNMAPPED=altalt_rg.bam O=altalt_m.bam \
    R=chr19_chr19_KI270866v1_alt.fasta \
    SORT_ORDER=unsorted CLIP_ADAPTERS=false \
    ADD_MATE_CIGAR=true MAX_INSERTIONS_OR_DELETIONS=-1 \
    PRIMARY_ALIGNMENT_STRATEGY=MostDistant \
    UNMAP_CONTAMINANT_READS=false \
    ATTRIBUTES_TO_RETAIN=XS ATTRIBUTES_TO_RETAIN=XA

[4.4] Flag duplicate reads

java -jar picard.jar MarkDuplicates \
    INPUT=altalt_m.bam OUTPUT=altalt_md.bam METRICS_FILE=altalt_md.bam.txt \
    OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 ASSUME_SORT_ORDER=queryname 

[4.5] Coordinate sort, fix NM and UQ tags and index for clean BAM
As of Picard v2.7.0, released October 17, 2016, SetNmAndUqTags is no longer available. Use SetNmMdAndUqTags instead.

set -o pipefail
java -jar picard.jar SortSam \
    INPUT=altalt_md.bam OUTPUT=/dev/stdout SORT_ORDER=coordinate | \
    java -jar $PICARD SetNmAndUqTags \
    INPUT=/dev/stdin OUTPUT=altalt_snaut.bam \
    CREATE_INDEX=true R=chr19_chr19_KI270866v1_alt.fasta

[4.6] Call SNP and indel variants in emit reference confidence (ERC) mode per sample using HaplotypeCaller

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller \
    -R chr19_chr19_KI270866v1_alt.fasta \
    -o altalt.g.vcf -I altalt_snaut.bam \
    -ERC GVCF --max_alternate_alleles 3 --read_filter OverclippedRead \
    --emitDroppedReads -bamout altalt_hc.bam

[4.7] Call genotypes on three samples

java -jar GenomeAnalysisTK.jar -T GenotypeGVCFs \
    -R chr19_chr19_KI270866v1_alt.fasta -o multisample.vcf \
    --variant altalt.g.vcf --variant altpa.g.vcf --variant papa.g.vcf 

The altalt_snaut.bam, HaplotypeCaller's altalt_hc.bam and the multisample multisample.vcf are ready for viewing on IGV.

Before getting into the results in the next section, we have minor comments on two filtering options.

In our tutorial workflows, we turn off MergeBamAlignment's UNMAP_CONTAMINANT_READS option. If set to true, 68 reads become unmapped for PAPA and 40 reads become unmapped for PAALT. These unmapped reads are those reads caught by the UNMAP_CONTAMINANT_READS filter and their mates. MergeBamAlignment defines contaminant reads as those alignments that are overclipped, i.e. that are softclipped on both ends, and that align with less than 32 bases. Changing the MIN_UNCLIPPED_BASES option from the default of 32 to 22 and 23 restores all of these reads for PAPA and PAALT, respectively. Contaminants are obviously absent for these simulated reads. And so we set UNMAP_CONTAMINANT_READS to false to disable this filtering.

HaplotypeCaller's --read_filter OverclippedRead option similarly looks for both-end-softclipped alignments, then filters reads aligning with less than 30 bases. The difference is that HaplotypeCaller only excludes the overclipped alignments from its calling and does not remove mapping information nor does it act on the mate of the filtered alignment. Thus, we keep this read filter for the first workflow. However, for the second and third workflows in section 6, tutorial_8017_toSE and tutorial_8017_postalt, we omit the --read_filter Overclipped option from the HaplotypeCaller command. We also omit the --max_alternate_alleles 3 option for simplicity.


back to top


5. How can I tell whether I should consider an alternate haplotype?

image We consider this question only for our GPI locus, a locus we know has an alternate contig in the reference. Here we use the term locus in its biological sense to refer to a contiguous genomic region of interest. The three samples give the alignment and coverage profiles shown on the right.

What is immediately apparent from the IGV screenshot is that the scenarios that include the alternate haplotype give a distinct pattern of variant sites to the primary assembly much like a fingerprint. These variants are predominantly heterozygous or homozygous. Looking closely at the 3' region of the locus, we see some alignment coverage anomalies that also show a distinct pattern. The coverage in some of the highly diverged region in the primary assembly drops while in others it increases. If we look at the origin of simulated reads in one of the excess coverage regions, we see that they are from two different regions of the alternate contig that suggests duplicated sequence segments within the alternate locus.

The variation pattern and coverage anomalies on the primary locus suggest an alternate haplotype may be present for the locus. We can then confirm the presence of aligned reads, both supplementary and primary, on the alternate locus. Furthermore, if we count the alignment records for each region, e.g. using samtools idxstats, we see the following metrics.

                        ALT/ALT     PA/ALT     PA/PA   
chr19                     10005      10006     10000     
chr19_KI270866v1_alt       1407        799         0      

The number of alignments on the alternate locus increases proportionately with alternate contig dosage. All of these factors together suggest that the sample presents an alternate haplotype.

5.1 Discussion of variant calls for tutorial_8017

The three-sample variant callset gives 54 sites on the primary locus and two additional on the alternate locus for 56 variant sites. All of the eight SNP alleles we introduced are called, with six called on the primary assembly and two called on the alternate contig. Of the 15 expected genotype calls, four are incorrect. Namely, four PAALT calls that ought to be heterozygous are called homozygous variant. These are two each on the primary assembly and on the alternate contig in the region that is highly divergent.

► Our production pipelines use genomic intervals lists that exclude GRCh38 alternate contigs from variant calling. That is, variant calling is performed only for contigs of the primary assembly. This calling on even just the primary assembly of GRCh38 brings improvements to analysis results over previous assemblies. For example, if we align and call variants for our simulated reads on GRCh37, we call 50 variant sites with identical QUAL scores to the equivalent calls in our GRCh38 callset. However, this GRCh37 callset is missing six variant calls compared to the GRCh38 callset for the 42 kb locus: the two variant sites on the alternate contig and four variant sites on the primary assembly.

Consider the example variants on the primary locus. The variant calls from the primary assembly include 32 variant sites that are strictly homozygous variant in ALTALT and heterozygous variant in PAALT. The callset represents only those reads from the ALT that can be mapped to the primary assembly.

In contrast, the two variants in regions whose reads can only map to the alternate contig are absent from the primary assembly callset. For this simulated dataset, the primary alignments present on the alternate contig provide enough supporting reads that allow HaplotypeCaller to call the two variants. However, these variant calls have lower-quality annotation metrics than for those simulated in an equal manner on the primary assembly. We will get into why this is in section 6.

Additionally, for our PAALT sample that is heterozygous for an alternate haplotype, the genotype calls in the highly divergent regions are inaccurate. These are called homozygous variant on the primary assembly and on the alternate contig when in fact they are heterozygous variant. These calls have lower genotype scores GQ as well as lower allele depth AD and coverage DP. The table below shows the variant calls for the introduced SNP sites. In blue are the genotype calls that should be heterozygous variant but are instead called homozygous variant.
image

Here is a command to select out the intentional variant sites that uses SelectVariants:

java -jar GenomeAnalysisTK.jar -T SelectVariants \
    -R chr19_chr19_KI270866v1_alt.fasta \
    -V multisample.vcf -o multisample_selectvariants.vcf \
    -L chr19:34,383,500 -L chr19:34,389,485 -L chr19:34,391,800 -L chr19:34,392,600 \
    -L chr19_KI270866v1_alt:32,700 -L chr19_KI270866v1_alt:38,700 \
    -L chr19_KI270866v1_alt:41,700 -L chr19_KI270866v1_alt:42,700 \
    -L chr19:34,383,486 -L chr19_KI270866v1_alt:32,714 


back to top


6. My locus includes an alternate haplotype. How can I call variants on alt contigs?

If you want to call variants on alternate contigs, consider additional data processing that overcome the following problems.

  • Loss of alignments from filtering of overclipped reads.

  • HaplotypeCaller's filtering of alignments whose mates map to another contig. Alt-handling produces many of these types of reads on the alternate contigs.

  • Zero MAPQ scores for alignments that map to two or more alternate contigs. HaplotypeCaller excludes these types of reads from contributing to evidence for variation.

Let us talk about these in more detail.

Ideally, if we are interested in alternate haplotypes, then we would have ensured we were using the most up-to-date analysis reference genome sequence with the latest patch fixes. Also, whatever approach we take to align and preprocess alignments, if we filter any reads as putative contaminants, e.g. with MergeBamAlignment's option to unmap cross-species contamination, then at this point we would want to fish back into the unmapped reads pool and pull out those reads. Specifically, these would have an SA tag indicating mapping to the alternate contig of interest and an FT tag indicating the reason for unmapping was because MergeBamAlignment's UNMAP_CONTAMINANT_READS option identified them as cross-species contamination. Similarly, we want to make sure not to include HaplotypeCaller's --read_filter OverclippedRead option that we use in the first workflow.

image As section 5.1 shows, variant calls on the alternate contig are of low quality--they have roughly an order of magnitude lower QUAL scores than what should be equivalent variant calls on the primary assembly.

For this exploratory tutorial, we are interested in calling the introduced SNPs with equivalent annotation metrics. Whether they are called on the primary assembly or the alternate contig and whether they are called homozygous variant or heterozygous--let's say these are less important, especially given pinning certain variants from highly homologous regions to one of the loci is nigh impossible with our short reads. To this end, we will use the second workflow shown in the workflows diagram. However, because this solution is limited, we present a third workflow as well.

► We present these workflows solely for exploratory purposes. They do not represent any production workflows.

Tutorial_8017_toSE uses the processed BAM from our first workflow and allows for calling on singular alternate contigs. That is, the workflow is suitable for calling on alternate contigs of loci with only a single alternate contig like our GPI locus. Tutorial_8017_postalt uses the aligned SAM from the first workflow before processing, and requires separate processing before calling. This third workflow allows for calling on all alternate contigs, even on HLA loci that have numerous contigs per primary locus. However, the callset will not be parsimonious. That is, each alternate contig will greedily represent alignments and it is possible the same variant is called for all the alternate loci for a given primary locus as well as on the primary locus. It is up to the analyst to figure out what to do with the resulting calls.

image The reason for the divide in these two workflows is in the way BWA assigns mapping quality scores (MAPQ) to multimapping reads. Postalt-processing becomes necessary for loci with two or more alternate contigs because the shared alignments between the primary locus and alternate loci will have zero MAPQ scores. Postalt-processing gives non-zero MAPQ scores to the alignment records. The table presents the frequencies of GRCh38 non-HLA alternate contigs per primary locus. It appears that ~75% of non-HLA alternate contigs are singular to ~92% of primary loci with non-HLA alternate contigs. In terms of bases on the primary assembly, of the ~75 megabases that have alternate contigs, ~64 megabases (85%) have singular non-HLA alternate contigs and ~11 megabases (15%) have multiple non-HLA alternate contigs per locus. Our tutorial's example locus falls under this majority.

In both alt-aware mapping and postalt-processing, alternate contig alignments have a predominance of mates that map back to the primary assembly. HaplotypeCaller, for good reason, filters reads whose mates map to a different contig. However, we know that GRCh38 artificially represents alternate haplotypes as separate contigs and BWA-MEM intentionally maps these mates back to the primary locus. For comparable calls on alternate contigs, we need to include these alignments in calling. To this end, we have devised a temporary workaround.

6.1 Variant calls for tutorial_8017_toSE

Here we are only aiming for equivalent calls with similar annotation values for the two variants that are called on the alternate contig. For the solution that we will outline, here are the results.

image

Including the mate-mapped-to-other-contig alignments bolsters the variant call qualities for the two SNPs HaplotypeCaller calls on the alternate locus. We see the AD allele depths much improved for ALTALT and PAALT. Corresponding to the increase in reads, the GQ genotype quality and the QUAL score (highlighted in red) indicate higher qualities. For example, the QUAL scores increase from 332 and 289 to 2166 and 1764, respectively. We also see that one of the genotype calls changes. For sample ALTALT, we see a previous no call is now a homozygous reference call (highlighted in blue). This hom-ref call is further from the truth than not having a call as the ALTALT sample should not have coverage for this region in the primary assembly.

For our example data, tutorial_8017's callset subset for the primary assembly and tutorial_8017_toSE's callset subset for the alternate contigs together appear to make for a better callset.

What solution did we apply? As the workflow's name toSE implies, this approach converts paired reads to single end reads. Specifically, this approach takes the processed and coordinate-sorted BAM from the first workflow and removes the 0x1 paired flag from the alignments. Removing the 0x1 flag from the reads allows HaplotypeCaller to consider alignments whose mates map to a different contig. We accomplish this using a modified script of that presented in Biostars post https://www.biostars.org/p/106668/, indexing with Samtools and then calling with HaplotypeCaller as follows. Note this workaround creates an invalid BAM according to ValidateSamFile. Also, another caveat is that because HaplotypeCaller uses softclipped sequences, any overlapping regions of read pairs will count twice towards variation instead of once. Thus, this step may lead to overconfident calls in such regions.

Remove the 0x1 bitwise flag from alignments

samtools view -h altalt_snaut.bam | gawk '{printf "%s\t", $1; if(and($2,0x1))
{t=$2-0x1}else{t=$2}; printf "%s\t" , t; for (i=3; i<NF; i++){printf "%s\t", $i} ; 
printf "%s\n",$NF}'| samtools view -Sb - > altalt_se.bam

Index the resulting BAM

samtools index altalt_se.bam

Call variants in -ERC GVCF mode with HaplotypeCaller for each sample

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller \
    -R chr19_chr19_KI270866v1_alt.fasta \
    -I altalt_se.bam -o altalt_hc.g.vcf \
    -ERC GVCF --emitDroppedReads -bamout altalt_hc.bam

Finally, use GenotypeGVCFs as shown in section 4's command [4.7] for a multisample variant callset. Tutorial_8017_toSE calls 68 variant sites--66 on the primary assembly and two on the alternate contig.

6.2 Variant calls for tutorial_8017_postalt

BWA's postalt-processing requires the query-grouped output of BWA-MEM. Piping an alignment step with postalt-processing is possible. However, to be able to compare variant calls from an identical alignment, we present the postalt-processing as an add-on workflow that takes the alignment from the first workflow.

The command uses the bwa-postalt.js script, which we run through k8, a Javascript execution shell. It then lists the ALT index, the aligned SAM altalt.sam and names the resulting file > altalt_postalt.sam.

k8 bwa-postalt.js \
    chr19_chr19_KI270866v1_alt.fasta.alt \
    altalt.sam > altalt_postalt.sam

image The resulting postalt-processed SAM, altalt_postalt.sam, undergoes the same processing as the first workflow (commands 4.1 through 4.7) except that (i) we omit --max_alternate_alleles 3 and --read_filter OverclippedRead options for the HaplotypeCaller command like we did in section 6.1 and (ii) we perform the 0x1 flag removal step from section 6.1.

The effect of this postalt-processing is immediately apparent in the IGV screenshots. Previously empty regions are now filled with alignments. Look closely in the highly divergent region of the primary locus. Do you notice a change, albeit subtle, before and after postalt-processing for samples ALTALT and PAALT?

These alignments give the calls below for our SNP sites of interest. Here, notice calls are made for more sites--on the equivalent site if present in addition to the design site (highlighted in the first two columns). For the three pairs of sites that can be called on either the primary locus or alternate contig, the variant site QUALs, the INFO field annotation metrics and the sample level annotation values are identical for each pair.

image

Postalt-processing lowers the MAPQ of primary locus alignments in the highly divergent region that map better to the alt locus. You can see this as a subtle change in the IGV screenshot. After postalt-processing we see an increase in white zero MAPQ reads in the highly divergent region of the primary locus for ALTALT and PAALT. For ALTALT, this effectively cleans up the variant calls in this region at chr19:34,391,800 and chr19:34,392,600. Previously for ALTALT, these calls contained some reads: 4 and 25 for the first workflow and 0 and 28 for the second workflow. After postalt-processing, no reads are considered in this region giving us ./.:0,0:0:.:0,0,0 calls for both sites.

What we omit from examination are the effects of postalt-processing on decoy contig alignments. Namely, if an alignment on the primary assembly aligns better on a decoy contig, then postalt-processing discounts the alignment on the primary assembly by assigning it a zero MAPQ score.

To wrap up, here are the number of variant sites called for the three workflows. As you can see, this last workflow calls the most variants at 95 variant sites, with 62 on the primary assembly and 33 on the alternate contig.

Workflow                total    on primary assembly    on alternate contig
tutorial_8017           56       54                      2
tutorial_8017_toSE      68       66                      2
tutorial_8017_postalt   95       62                     33


back to top


7. Related resources

  • For WDL scripts of the workflows represented in this tutorial, see the GATK WDL scripts repository.

  • To revert an aligned BAM to unaligned BAM, see Section B of Tutorial#6484.

  • To simulate reads from a reference contig, see Tutorial#7859.
  • Dictionary entry Reference Genome Components reviews terminology that describe reference genome components.
  • The GATK resource bundle provides an analysis set GRCh38 reference FASTA as well as several other related resource files.
  • As of this writing (August 8, 2016), the SAM format ALT index file for GRCh38 is available only in the x86_64-linux bwakit download as stated in this bwakit README. The hs38DH.fa.alt file is in the resource-GRCh38 folder. Rename this file's basename to match that of the corresponding reference FASTA.
  • For more details on MergeBamAlignment features, see Section 3C of Tutorial#6483.
  • For details on the PairedEndSingleSampleWorkflow that uses GRCh38, see here.
  • See here for VCF specifications.

back to top



Missing positions after GenotypeGVCFs

$
0
0

Hi,

Sorry if this is a re-post but I have read through the last post on this issue: http://gatkforums.broadinstitute.org/gatk/discussion/4343/missing-positions-in-the-gvcf-file

I'm having the same issue except with GATK 3.6.0.

I ran haplotype caller on ~2,500 samples using the following cmd:
module load gatk/3.6.0 && module load java/1.8.0_91 && java -jar -Djava.io.tmpdir=/hpf/largeprojects/pray/llau/tmp/ -Xmx24G $GATK -T HaplotypeCaller -R /hpf/largeprojects/pray/llau/internal_databases/gatk_bundle/2.8_b37/human_g1k_v37_decoy.fasta -I .recalibrated.bam --genotyping_mode DISCOVERY -stand_emit_conf 10 -stand_call_conf 30 -rf BadCigar --dontUseSoftClippedBases --min_base_quality_score 20 --emitRefConfidence GVCF -o .raw_variants.gvcf

I then ran genotypeGVCFs using the following cmd:
module load gatk/3.6.0 && module load java/1.8.0_91 && java -jar -Djava.io.tmpdir=/hpf/largeprojects/pray/llau/tmp/ -Xmx24G $GATK -T GenotypeGVCFs --disable_auto_index_creation_and_locking_when_reading_rods --max_alternate_alleles 500 -R /hpf/largeprojects/pray/llau/internal_databases/gatk_bundle/2.8_b37/human_g1k_v37_decoy.fasta -L 20:62025520-63025520 -allSites --variant .g.vcf --variant .g.vcf..... --variant .vcf -o .vcf

If you view the attached screenshot of the .vcf file you can see I'm missing a couple of random positions (chr20: 62025529). I thought -allSites should report all the position? Am I screwing something up?

IndelRealigner

$
0
0

Hi. I am doing a matched Normal-Tumor mutation detection with whole exome sequencing data.

When evaluating my pipelines, I noticed that some somatic mutations are not showing if I follow the best practice for somatic detection. After I checked the bam files aligned by different pipelines with IGV, I found that the difference was made in the IndelRealigner process.
My question is, I have an impression that it is better to align the normal / tumor reads as input together, and then produce separated output through the --nWayout option in GATK IndelRealigner, but how would it be better to do so than aligning the normal / tumor sample separately? Comparing the separately-aligned result (1way) and aligned-together result (2way), I found a mutation showing in 1way output (with allele frequency 24/193), while it is almost gone in 2way output (1/170). I checked the alignment with IGV, and I think the mutation might be resulted by artifact, which means the the IndelRealigner made the right decision. However, I am still not sure if it is always better to trust the 2way result. I am worried about missing some somatic mutations and not knowing about it, since somatic mutation usually has lower allele frequency.

May I ask how to interpret the benefit of using 2way output over 1way?

Any reply would be greatly appreciated.

Mutect 2 for amplicon sequencing

$
0
0

Hello
I am trying to use Mutect2 to call variants using a PON and no germline sample and had a few questions about my results. I am using the illumina myeloid amplicon panel for library prep. All the samples are processed via GATK without marking duplicates and omitting the BQSR step. The PON was generated by using ~100 samples from 1000 genomes processed through M2 per the recommendations here.

Looking through the callset a small subset are outputted as passing all the filters. The vast majority were filtered for having both clustered events and homologous mapping. With amplicon sequencing won't the clustered events filter be applied too stringently since most of the reads start from the same location due to the primers? I've read through the forum to better understand how the homologous mapping filter is applied but do not understand it enough to know what impact the amplicon sequencing could have with the filter.

Any thoughts?

CatVariants Error - Features added out of order

$
0
0

I am running the practices pipeline on a large set of WGS data. On some of the CatVariants steps I am getting errors with the GATK v3.5.
there are errors:

Mutect2 ERROR: Null qss lines in bcbio log

$
0
0

Hi, I'm running bcbio mutect2 tumor/normal somatic variant calling on a linux cluster environment. I had to restart my job owing to an error in the bcbio config file (*.yaml), and upon editing the config file and restarting, I noticed that the pipeline began producing a lot (thousands) of these errors in the log:

[2017-01-19T22:28Z] INFO 16:28:45,687 ProgressMeter - 20:33587180 2.678151017E9 2.5 h 3.0 s 51.0% 4.9 h 2.4 h
[2017-01-19T22:28Z] INFO 16:28:46,114 ProgressMeter - 2:179499339 7.0449077071E10 11.3 h 0.0 s 73.3% 15.4 h 4.1 h
[2017-01-19T22:28Z] ERROR 16:28:47,438 MuTect2 - Null qss at 45506621
[2017-01-19T22:28Z] ERROR 16:28:48,223 MuTect2 - Null qss at 102394233
[2017-01-19T22:28Z] ERROR 16:28:48,411 MuTect2 - Null qss at 81919065

Before restarting, there were no (or few) errors. I tried searching for documentation about the meaning and causes of this error, but could not find anything except one github post saying not to worry about it. Any advice would be appreciated.

Here is an example mutect2 command being run by bcbio from the command log:

[2017-01-19T22:29Z] java -Xms681m -Xmx3181m -XX:+UseSerialGC -Djava.io.tmpdir=/Shared/Bioinformatics/data/mchiment/hansen_vs_exomes/bcbio_var2_mutect2_jan2017/work/mutect2/3/tx/tmpVAfsfP -jar /Shared/Bioinformatics/data/bcbio/toolplus/gatk/3.6-0-g89b7209/GenomeAnalysisTK.jar -T MuTect2 -R /Shared/Bioinformatics/data/bcbio/genomes/Hsapiens/GRCh37/seq/GRCh37.fa --annotation ClippingRankSumTest --annotation DepthPerSampleHC --annotation BaseQualityRankSumTest --annotation FisherStrand --annotation GCContent --annotation HaplotypeScore --annotation HomopolymerRun --annotation MappingQualityRankSumTest --annotation MappingQualityZero --annotation QualByDepth --annotation ReadPosRankSumTest --annotation RMSMappingQuality --annotation DepthPerAlleleBySample --annotation Coverage -I:tumor /Shared/Bioinformatics/data/mchiment/hansen_vs_exomes/bcbio_var2_mutect2_jan2017/work/bamprep/X3/3/X3-sort-3_0_198022430-prep.bam -I:normal /Shared/Bioinformatics/data/mchiment/hansen_vs_exomes/bcbio_var2_mutect2_jan2017/work/bamprep/X1/3/X1-sort-3_0_198022430-prep.bam -L /Shared/Bioinformatics/data/mchiment/hansen_vs_exomes/bcbio_var2_mutect2_jan2017/work/mutect2/3/X-3_0_198022430-regions-nolcr.bed --interval_set_rule INTERSECTION -ploidy 2 -U LENIENT_VCF_PROCESSING --read_filter BadCigar --read_filter NotPrimaryAlignment | bgzip -c > /Shared/Bioinformatics/data/mchiment/hansen_vs_exomes/bcbio_var2_mutect2_jan2017/work/mutect2/3/tx/tmpVAfsfP/X-3_0_198022430.vcf.gz

Viewing all 12345 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>