Quantcast
Channel: Recent Discussions — GATK-Forum
Viewing all 12345 articles
Browse latest View live

GATK4 on exome seqeunecing

$
0
0

Hello

I am interested in using the WDL pipeline available in the GATK workflow github to run GATK4 an exome data, and I have a few questions:

  1. I have noticed that the notes regarding gatk4-germline-snps-indels pipeline states that it has not been tested on exome data. Is there some pipeline that is most suitable for WES data ?

  2. I understand (as explained here) that I need to adjust the interval list. Can I simply pass a bed file with the relevant exome's kit target regions as the scattered_calling_intervals_list argument of the workflow or is there another way to do this ?

  3. Are then any other adjustments that I should consider in light the fact that I'm analysing exome data? I have looked at this thread: https://gatkforums.broadinstitute.org/gatk/discussion/6894/gatk-best-practices-for-exome-targeted-capture-small-region but it is a little bit old (that is, pre-GATK4). Are them some consideration specific to GATK4 that I should be aware of ?

Thanks in advance


(How to) Map and clean up short read sequence data efficiently

$
0
0


imageIf you are interested in emulating the methods used by the Broad Genomics Platform to pre-process your short read sequencing data, you have landed on the right page. The parsimonious operating procedures outlined in this three-step workflow both maximize data quality, storage and processing efficiency to produce a mapped and clean BAM. This clean BAM is ready for analysis workflows that start with MarkDuplicates.

Since your sequencing data could be in a number of formats, the first step of this workflow refers you to specific methods to generate a compatible unmapped BAM (uBAM, Tutorial#6484) or (uBAMXT, Tutorial#6570 coming soon). Not all unmapped BAMs are equal and these methods emphasize cleaning up prior meta information while giving you the opportunity to assign proper read group fields. The second step of the workflow has you marking adapter sequences, e.g. arising from read-through of short inserts, using MarkIlluminaAdapters such that they contribute minimally to alignments and allow the aligner to map otherwise unmappable reads. The third step pipes three processes to produce the final BAM. Piping SamToFastq, BWA-MEM and MergeBamAlignment saves time and allows you to bypass storage of larger intermediate FASTQ and SAM files. In particular, MergeBamAlignment merges defined information from the aligned SAM with that of the uBAM to conserve read data, and importantly, it generates additional meta information and unifies meta data. The resulting clean BAM is coordinate sorted, indexed.

The workflow reflects a lossless operating procedure that retains original sequencing read information within the final BAM file such that data is amenable to reversion and analysis by different means. These practices make scaling up and long-term storage efficient, as one needs only keep the final BAM file.

Geraldine_VdAuwera points out that there are many different ways of correctly preprocessing HTS data for variant discovery and ours is only one approach. So keep this in mind.

We present this workflow using real data from a public sample. The original data file, called Solexa-272222, is large at 150 GB. The file contains 151 bp paired PCR-free reads giving 30x coverage of a human whole genome sample referred to as NA12878. The entire sample library was sequenced in a single flow cell lane and thereby assigns all the reads the same read group ID. The example commands work both on this large file and on smaller files containing a subset of the reads, collectively referred to as snippet. NA12878 has a variant in exon 5 of the CYP2C19 gene, on the portion of chromosome 10 covered by the snippet, resulting in a nonfunctional protein. Consistent with GATK's recommendation of using the most up-to-date tools, for the given example results, with the exception of BWA, we used the most current versions of tools as of their testing (September to December 2015). We provide illustrative example results, some of which were derived from processing the original large file and some of which show intermediate stages skipped by this workflow.

Download example snippet data to follow along the tutorial.

We welcome feedback. Share your suggestions in the Comments section at the bottom of this page.


Jump to a section

  1. Generate an unmapped BAM from FASTQ, aligned BAM or BCL
  2. Mark adapter sequences using MarkIlluminaAdapters
  3. Align reads with BWA-MEM and merge with uBAM using MergeBamAlignment
    A. Convert BAM to FASTQ and discount adapter sequences using SamToFastq
    B. Align reads and flag secondary hits using BWA-MEM
    C. Restore altered data and apply & adjust meta information using MergeBamAlignment
    D. Pipe SamToFastq, BWA-MEM and MergeBamAlignment to generate a clean BAM

Tools involved

Prerequisites

  • Installed Picard tools
  • Installed GATK tools
  • Installed BWA
  • Reference genome
  • Illumina or similar tech DNA sequence reads file containing data corresponding to one read group ID. That is, the file contains data from one sample and from one flow cell lane.

Download example data

  • To download the reference, open ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.8/b37/ in your browser. Leave the password field blank. Download the following three files (~860 MB) to the same folder: human_g1k_v37_decoy.fasta.gz, .fasta.fai.gz, and .dict.gz. This same reference is available to load in IGV.
  • I divided the example data into two tarballs: tutorial_6483_piped.tar.gz contains the files for the piped process and tutorial_6483_intermediate_files.tar.gz contains the intermediate files produced by running each process independently. The data contain reads originally aligning to a one Mbp genomic interval (10:96,000,000-97,000,000) of GRCh37. The table shows the steps of the workflow, corresponding input and output example data files and approximate minutes and disk space needed to process each step. Additionally, we tabulate the time and minimum storage needed to complete the workflow as presented (piped) or without piping.

image

Related resources

Other notes

  • When transforming data files, we stick to using Picard tools over other tools to avoid subtle incompatibilities.
  • For large files, (1) use the Java -Xmx setting and (2) set the environmental variable TMP_DIR for a temporary directory.

    java -Xmx8G -jar /path/picard.jar MarkIlluminaAdapters \
        TMP_DIR=/path/shlee 
    

    In the command, the -Xmx8G Java option caps the maximum heap size, or memory usage, to eight gigabytes. The path given by TMP_DIR points the tool to scratch space that it can use. These options allow the tool to run without slowing down as well as run without causing an out of memory error. The -Xmx settings we provide here are more than sufficient for most cases. For GATK, 4G is standard, while for Picard less is needed. Some tools, e.g. MarkDuplicates, may require more. These options can be omitted for small files such as the example data and the equivalent command is as follows.

    java -jar /path/picard.jar MarkIlluminaAdapters 
    

    To find a system's default maximum heap size, type java -XX:+PrintFlagsFinal -version, and look for MaxHeapSize. Note that any setting beyond available memory spills to storage and slows a system down. If multithreading, increase memory proportionately to the number of threads. e.g. if 1G is the minimum required for one thread, then use 2G for two threads.

  • When I call default options within a command, follow suit to ensure the same results.


1. Generate an unmapped BAM from FASTQ, aligned BAM or BCL

If you have raw reads data in BAM format with appropriately assigned read group fields, then you can start with step 2. Namely, besides differentiating samples, the read group ID should differentiate factors contributing to technical batch effects, i.e. flow cell lane. If not, you need to reassign read group fields. This dictionary post describes factors to consider and this post and this post provide some strategic advice on handling multiplexed data.

If your reads are mapped, or in BCL or FASTQ format, then generate an unmapped BAM according to the following instructions.

  • To convert FASTQ or revert aligned BAM files, follow directions in Tutorial#6484. The resulting uBAM needs to have its adapter sequences marked as outlined in the next step (step 2).
  • To convert an Illumina Base Call files (BCL) use IlluminaBasecallsToSam. The tool marks adapter sequences at the same time. The resulting uBAMXT has adapter sequences marked with the XT tag so you can skip step 2 of this workflow and go directly to step 3. The corresponding Tutorial#6570 is coming soon.

See if you can revert 6483_snippet.bam, containing 279,534 aligned reads, to the unmapped 6383_snippet_revertsam.bam, containing 275,546 reads.

back to top


2. Mark adapter sequences using MarkIlluminaAdapters

MarkIlluminaAdapters adds the XT tag to a read record to mark the 5' start position of the specified adapter sequence and produces a metrics file. Some of the marked adapters come from concatenated adapters that randomly arise from the primordial soup that is a PCR reaction. Others represent read-through to 3' adapter ends of reads and arise from insert sizes that are shorter than the read length. In some instances read-though can affect the majority of reads in a sample, e.g. in Nextera library samples over-titrated with transposomes, and render these reads unmappable by certain aligners. Tools such as SamToFastq use the XT tag in various ways to effectively remove adapter sequence contribution to read alignment and alignment scoring metrics. Depending on your library preparation, insert size distribution and read length, expect varying amounts of such marked reads.

java -Xmx8G -jar /path/picard.jar MarkIlluminaAdapters \
I=6483_snippet_revertsam.bam \
O=6483_snippet_markilluminaadapters.bam \
M=6483_snippet_markilluminaadapters_metrics.txt \ #naming required
TMP_DIR=/path/shlee #optional to process large files

This produces two files. (1) The metrics file, 6483_snippet_markilluminaadapters_metrics.txt bins the number of tagged adapter bases versus the number of reads. (2) The 6483_snippet_markilluminaadapters.bam file is identical to the input BAM, 6483_snippet_revertsam.bam, except reads with adapter sequences will be marked with a tag in XT:i:# format, where # denotes the 5' starting position of the adapter sequence. At least six bases are required to mark a sequence. Reads without adapter sequence remain untagged.

  • By default, the tool uses Illumina adapter sequences. This is sufficient for our example data.
  • Adjust the default standard Illumina adapter sequences to any adapter sequence using the FIVE_PRIME_ADAPTER and THREE_PRIME_ADAPTER parameters. To clear and add new adapter sequences first set ADAPTERS to 'null' then specify each sequence with the parameter.

We plot the metrics data that is in GATKReport file format using RStudio, and as you can see, marked bases vary in size up to the full length of reads.
image image

Do you get the same number of marked reads? 6483_snippet marks 448 reads (0.16%) with XT, while the original Solexa-272222 marks 3,236,552 reads (0.39%).

Below, we show a read pair marked with the XT tag by MarkIlluminaAdapters. The insert region sequences for the reads overlap by a length corresponding approximately to the XT tag value. For XT:i:20, the majority of the read is adapter sequence. The same read pair is shown after SamToFastq transformation, where adapter sequence base quality scores have been set to 2 (# symbol), and after MergeBamAlignment, which restores original base quality scores.

Unmapped uBAM (step 1)
image

After MarkIlluminaAdapters (step 2)
image

After SamToFastq (step 3)
image

After MergeBamAlignment (step 3)
image

back to top


3. Align reads with BWA-MEM and merge with uBAM using MergeBamAlignment

This step actually pipes three processes, performed by three different tools. Our tutorial example files are small enough to easily view, manipulate and store, so any difference in piped or independent processing will be negligible. For larger data, however, using Unix pipelines can add up to significant savings in processing time and storage.

Not all tools are amenable to piping and piping the wrong tools or wrong format can result in anomalous data.

The three tools we pipe are SamToFastq, BWA-MEM and MergeBamAlignment. By piping these we bypass storage of larger intermediate FASTQ and SAM files. We additionally save time by eliminating the need for the processor to read in and write out data for two of the processes, as piping retains data in the processor's input-output (I/O) device for the next process.

To make the information more digestible, we will first talk about each tool separately. At the end of the section, we provide the piped command.

back to top


3A. Convert BAM to FASTQ and discount adapter sequences using SamToFastq

Picard's SamToFastq takes read identifiers, read sequences, and base quality scores to write a Sanger FASTQ format file. We use additional options to effectively remove previously marked adapter sequences, in this example marked with an XT tag. By specifying CLIPPING_ATTRIBUTE=XT and CLIPPING_ACTION=2, SamToFastq changes the quality scores of bases marked by XT to two--a rather low score in the Phred scale. This effectively removes the adapter portion of sequences from contributing to downstream read alignment and alignment scoring metrics.

Illustration of an intermediate step unused in workflow. See piped command.

java -Xmx8G -jar /path/picard.jar SamToFastq \
I=6483_snippet_markilluminaadapters.bam \
FASTQ=6483_snippet_samtofastq_interleaved.fq \
CLIPPING_ATTRIBUTE=XT \
CLIPPING_ACTION=2 \
INTERLEAVE=true \ 
NON_PF=true \
TMP_DIR=/path/shlee #optional to process large files         

This produces a FASTQ file in which all extant meta data, i.e. read group information, alignment information, flags and tags are purged. What remains are the read query names prefaced with the @ symbol, read sequences and read base quality scores.

  • For our paired reads example file we set SamToFastq's INTERLEAVE to true. During the conversion to FASTQ format, the query name of the reads in a pair are marked with /1 or /2 and paired reads are retained in the same FASTQ file. BWA aligner accepts interleaved FASTQ files given the -p option.
  • We change the NON_PF, aka INCLUDE_NON_PF_READS, option from default to true. SamToFastq will then retain reads marked by what some consider an archaic 0x200 flag bit that denotes reads that do not pass quality controls, aka reads failing platform or vendor quality checks. Our tutorial data do not contain such reads and we call out this option for illustration only.
  • Other CLIPPING_ACTION options include (1) X to hard-clip, (2) N to change bases to Ns or (3) another number to change the base qualities of those positions to the given value.

back to top


3B. Align reads and flag secondary hits using BWA-MEM

In this workflow, alignment is the most compute intensive and will take the longest time. GATK's variant discovery workflow recommends Burrows-Wheeler Aligner's maximal exact matches (BWA-MEM) algorithm (Li 2013 reference; Li 2014 benchmarks; homepage; manual). BWA-MEM is suitable for aligning high-quality long reads ranging from 70 bp to 1 Mbp against a large reference genome such as the human genome.

  • Aligning our snippet reads against either a portion or the whole genome is not equivalent to aligning our original Solexa-272222 file, merging and taking a new slice from the same genomic interval.
  • For the tutorial, we use BWA v 0.7.7.r441, the same aligner used by the Broad Genomics Platform as of this writing (9/2015).
  • As mentioned, alignment is a compute intensive process. For faster processing, use a reference genome with decoy sequences, also called a decoy genome. For example, the Broad's Genomics Platform uses an Hg19/GRCh37 reference sequence that includes Ebstein-Barr virus (EBV) sequence to soak up reads that fail to align to the human reference that the aligner would otherwise spend an inordinate amount of time trying to align as split reads. GATK's resource bundle provides a standard decoy genome from the 1000 Genomes Project.
  • BWA alignment requires an indexed reference genome file. Indexing is specific to algorithms. To index the human genome for BWA, we apply BWA's index function on the reference genome file, e.g. human_g1k_v37_decoy.fasta. This produces five index files with the extensions amb, ann, bwt, pac and sa.

    bwa index -a bwtsw human_g1k_v37_decoy.fasta
    

The example command below aligns our example data against the GRCh37 genome. The tool automatically locates the index files within the same folder as the reference FASTA file.

Illustration of an intermediate step unused in workflow. See piped command.

/path/bwa mem -M -t 7 -p /path/human_g1k_v37_decoy.fasta \ 
6483_snippet_samtofastq_interleaved.fq > 6483_snippet_bwa_mem.sam

This command takes the FASTQ file, 6483_snippet_samtofastq_interleaved.fq, and produces an aligned SAM format file, 6483_snippet_unthreaded_bwa_mem.sam, containing read alignment information, an automatically generated program group record and reads sorted in the same order as the input FASTQ file. Aligner-assigned alignment information, flag and tag values reflect each read's or split read segment's best sequence match and does not take into consideration whether pairs are mapped optimally or if a mate is unmapped. Added tags include the aligner-specific XS tag that marks secondary alignment scores in XS:i:# format. This tag is given for each read even when the score is zero and even for unmapped reads. The program group record (@PG) in the header gives the program group ID, group name, group version and recapitulates the given command. Reads are sorted by query name. For the given version of BWA, the aligned file is in SAM format even if given a BAM extension.

Does the aligned file contain read group information?

We invoke three options in the command.

  • -M to flag shorter split hits as secondary.
    This is optional for Picard compatibility as MarkDuplicates can directly process BWA's alignment, whether or not the alignment marks secondary hits. However, if we want MergeBamAlignment to reassign proper pair alignments, to generate data comparable to that produced by the Broad Genomics Platform, then we must mark secondary alignments.

  • -p to indicate the given file contains interleaved paired reads.

  • -t followed by a number for the number of processor threads to use concurrently. Here we use seven threads which is one less than the total threads available on my Mac laptap. Check your server or system's total number of threads with the following command provided by KateN.

    getconf _NPROCESSORS_ONLN 
    

In the example data, all of the 1211 unmapped reads each have an asterisk (*) in column 6 of the SAM record, where a read typically records its CIGAR string. The asterisk represents that the CIGAR string is unavailable. The several asterisked reads I examined are recorded as mapping exactly to the same location as their _mapped_ mates but with MAPQ of zero. Additionally, the asterisked reads had varying noticeable amounts of low base qualities, e.g. strings of #s, that corresponded to original base quality calls and not those changed by SamToFastq. This accounting by BWA allows these pairs to always list together, even when the reads are coordinate-sorted, and leaves a pointer to the genomic mapping of the mate of the unmapped read. For the example read pair shown below, comparing sequences shows no apparent overlap, with the highest identity at 72% over 25 nts.

After MarkIlluminaAdapters (step 2)
image

After BWA-MEM (step 3)
image

After MergeBamAlignment (step 3)
image

back to top


3C. Restore altered data and apply & adjust meta information using MergeBamAlignment

MergeBamAlignment is a beast of a tool, so its introduction is longer. It does more than is implied by its name. Explaining these features requires I fill you in on some background.

Broadly, the tool merges defined information from the unmapped BAM (uBAM, step 1) with that of the aligned BAM (step 3) to conserve read data, e.g. original read information and base quality scores. The tool also generates additional meta information based on the information generated by the aligner, which may alter aligner-generated designations, e.g. mate information and secondary alignment flags. The tool then makes adjustments so that all meta information is congruent, e.g. read and mate strand information based on proper mate designations. We ascribe the resulting BAM as clean.

Specifically, the aligned BAM generated in step 3 lacks read group information and certain tags--the UQ (Phred likelihood of the segment), MC (CIGAR string for mate) and MQ (mapping quality of mate) tags. It has hard-clipped sequences from split reads and altered base qualities. The reads also have what some call mapping artifacts but what are really just features we should not expect from our aligner. For example, the meta information so far does not consider whether pairs are optimally mapped and whether a mate is unmapped (in reality or for accounting purposes). Depending on these assignments, MergeBamAlignment adjusts the read and read mate strand orientations for reads in a proper pair. Finally, the alignment records are sorted by query name. We would like to fix all of these issues before taking our data to a variant discovery workflow.

Enter MergeBamAlignment. As the tool name implies, MergeBamAlignment applies read group information from the uBAM and retains the program group information from the aligned BAM. In restoring original sequences, the tool adjusts CIGAR strings from hard-clipped to soft-clipped. If the alignment file is missing reads present in the unaligned file, then these are retained as unmapped records. Additionally, MergeBamAlignment evaluates primary alignment designations according to a user-specified strategy, e.g. for optimal mate pair mapping, and changes secondary alignment and mate unmapped flags based on its calculations. Additional for desired congruency. I will soon explain these and additional changes in more detail and show a read record to illustrate.

Consider what PRIMARY_ALIGNMENT_STRATEGY option best suits your samples. MergeBamAlignment applies this strategy to a read for which the aligner has provided more than one primary alignment, and for which one is designated primary by virtue of another record being marked secondary. MergeBamAlignment considers and switches only existing primary and secondary designations. Therefore, it is critical that these were previously flagged.

image A read with multiple alignment records may map to multiple loci or may be chimeric--that is, splits the alignment. It is possible for an aligner to produce multiple alignments as well as multiple primary alignments, e.g. in the case of a linear alignment set of split reads. When one alignment, or alignment set in the case of chimeric read records, is designated primary, others are designated either secondary or supplementary. Invoking the -M option, we had BWA mark the record with the longest aligning section of split reads as primary and all other records as secondary. MergeBamAlignment further adjusts this secondary designation and adds the read mapped in proper pair (0x2) and mate unmapped (0x8) flags. The tool then adjusts the strand orientation flag for a read (0x10) and it proper mate (0x20).

In the command, we change CLIP_ADAPTERS, MAX_INSERTIONS_OR_DELETIONS and PRIMARY_ALIGNMENT_STRATEGY values from default, and invoke other optional parameters. The path to the reference FASTA given by R should also contain the corresponding .dict sequence dictionary with the same prefix as the reference FASTA. It is imperative that both the uBAM and aligned BAM are both sorted by queryname.

Illustration of an intermediate step unused in workflow. See piped command.

java -Xmx16G -jar /path/picard.jar MergeBamAlignment \
R=/path/Homo_sapiens_assembly19.fasta \ 
UNMAPPED_BAM=6383_snippet_revertsam.bam \ 
ALIGNED_BAM=6483_snippet_bwa_mem.sam \ #accepts either SAM or BAM
O=6483_snippet_mergebamalignment.bam \
CREATE_INDEX=true \ #standard Picard option for coordinate-sorted outputs
ADD_MATE_CIGAR=true \ #default; adds MC tag
CLIP_ADAPTERS=false \ #changed from default
CLIP_OVERLAPPING_READS=true \ #default; soft-clips ends so mates do not extend past each other
INCLUDE_SECONDARY_ALIGNMENTS=true \ #default
MAX_INSERTIONS_OR_DELETIONS=-1 \ #changed to allow any number of insertions or deletions
PRIMARY_ALIGNMENT_STRATEGY=MostDistant \ #changed from default BestMapq
ATTRIBUTES_TO_RETAIN=XS \ #specify multiple times to retain tags starting with X, Y, or Z 
TMP_DIR=/path/shlee #optional to process large files

This generates a coordinate-sorted and clean BAM, 6483_snippet_mergebamalignment.bam, and corresponding .bai index. These are ready for analyses starting with MarkDuplicates. The two bullet-point lists below describe changes to the resulting file. The first list gives general comments on select parameters and the second describes some of the notable changes to our example data.

Comments on select parameters

  • Setting PRIMARY_ALIGNMENT_STRATEGYto MostDistant marks primary alignments based on the alignment pair with the largest insert size. This strategy is based on the premise that of chimeric sections of a read aligning to consecutive regions, the alignment giving the largest insert size with the mate gives the most information.
  • It may well be that alignments marked as secondary represent interesting biology, so we retain them with the INCLUDE_SECONDARY_ALIGNMENTS parameter.
  • Setting MAX_INSERTIONS_OR_DELETIONS to -1 retains reads irregardless of the number of insertions and deletions. The default is 1.
  • Because we leave the ALIGNER_PROPER_PAIR_FLAGS parameter at the default false value, MergeBamAlignment will reassess and reassign proper pair designations made by the aligner. These are explained below using the example data.
  • ATTRIBUTES_TO_RETAIN is specified to carryover the XS tag from the alignment, which reports BWA-MEM's suboptimal alignment scores. My impression is that this is the next highest score for any alternative or additional alignments BWA considered, whether or not these additional alignments made it into the final aligned records. (IGV's BLAT feature allows you to search for additional sequence matches). For our tutorial data, this is the only additional unaccounted tag from the alignment. The XS tag in unnecessary for the Best Practices Workflow and is not retained by the Broad Genomics Platform's pipeline. We retain it here not only to illustrate that the tool carries over select alignment information only if asked, but also because I think it prudent. Given how compute intensive the alignment process is, the additional ~1% gain in the snippet file size seems a small price against having to rerun the alignment because we realize later that we want the tag.
  • Setting CLIP_ADAPTERS to false leaves reads unclipped.
  • By default the merged file is coordinate sorted. We set CREATE_INDEX to true to additionally create the bai index.
  • We need not invoke PROGRAM options as BWA's program group information is sufficient and is retained in the merging.
  • As a standalone tool, we would normally feed in a BAM file for ALIGNED_BAM instead of the much larger SAM. We will be piping this step however and so need not add an extra conversion to BAM.

Description of changes to our example data

  • MergeBamAlignment merges header information from the two sources that define read groups (@RG) and program groups (@PG) as well as reference contigs.
  • imageTags are updated for our example data as shown in the table. The tool retains SA, MD, NM and AS tags from the alignment, given these are not present in the uBAM. The tool additionally adds UQ (the Phred likelihood of the segment), MC (mate CIGAR string) and MQ (mapping quality of the mate/next segment) tags if applicable. For unmapped reads (marked with an * asterisk in column 6 of the SAM record), the tool removes AS and XS tags and assigns MC (if applicable), PG and RG tags. This is illustrated for example read H0164ALXX140820:2:1101:29704:6495 in the BWA-MEM section of this document.
  • Original base quality score restoration is illustrated in step 2.

The example below shows a read pair for which MergeBamAlignment adjusts multiple information fields, and these changes are described in the remaining bullet points.

  • MergeBamAlignment changes hard-clipping to soft-clipping, e.g. 96H55M to 96S55M, and restores corresponding truncated sequences with the original full-length read sequence.
  • The tool reorders the read records to reflect the chromosome and contig ordering in the header and the genomic coordinates for each.
  • MergeBamAlignment's MostDistant PRIMARY_ALIGNMENT_STRATEGY asks the tool to consider the best pair to mark as primary from the primary and secondary records. In this pair, one of the reads has two alignment loci, on contig hs37d5 and on chromosome 10. The two loci align 115 and 55 nucleotides, respectively, and the aligned sequences are identical by 55 bases. Flag values set by BWA-MEM indicate the contig hs37d5 record is primary and the shorter chromosome 10 record is secondary. For this chimeric read, MergeBamAlignment reassigns the chromosome 10 mapping as the primary alignment and the contig hs37d5 mapping as secondary (0x100 flag bit).
  • In addition, MergeBamAlignment designates each record on chromosome 10 as read mapped in proper pair (0x2 flag bit) and the contig hs37d5 mapping as mate unmapped (0x8 flag bit). IGV's paired reads mode displays the two chromosome 10 mappings as a pair after these MergeBamAlignment adjustments.
  • MergeBamAlignment adjusts read reverse strand (0x10 flag bit) and mate reverse strand (0x20 flag bit) flags consistent with changes to the proper pair designation. For our non-stranded DNA-Seq library alignments displayed in IGV, a read pointing rightward is in the forward direction (absence of 0x10 flag) and a read pointing leftward is in the reverse direction (flagged with 0x10). In a typical pair, where the rightward pointing read is to the left of the leftward pointing read, the left read will also have the mate reverse strand (0x20) flag.

Two distinct classes of mate unmapped read records are now present in our example file: (1) reads whose mates truly failed to map and are marked by an asterisk * in column 6 of the SAM record and (2) multimapping reads whose mates are in fact mapped but in a proper pair that excludes the particular read record. Each of these two classes of mate unmapped reads can contain multimapping reads that map to two or more locations.

Comparing 6483_snippet_bwa_mem.sam and 6483_snippet_mergebamalignment.bam, we see the number_unmapped reads_ remains the same at 1211, while the number of records with the mate unmapped flag increases by 1359, from 1276 to 2635. These now account for 0.951% of the 276,970 read records.

For 6483_snippet_mergebamalignment.bam, how many additional unique reads become mate unmapped?

After BWA-MEM alignment
image

After MergeBamAlignment
image

back to top


3D. Pipe SamToFastq, BWA-MEM and MergeBamAlignment to generate a clean BAM

image We pipe the three tools described above to generate an aligned BAM file sorted by query name. In the piped command, the commands for the three processes are given together, separated by a vertical bar |. We also replace each intermediate output and input file name with a symbolic path to the system's output and input devices, here /dev/stdout and /dev/stdin, respectively. We need only provide the first input file and name the last output file.

Before using a piped command, we should ask UNIX to stop the piped command if any step of the pipe should error and also return to us the error messages. Type the following into your shell to set these UNIX options.

set -o pipefail

Overview of command structure

[SamToFastq] | [BWA-MEM] | [MergeBamAlignment]

Piped command

java -Xmx8G -jar /path/picard.jar SamToFastq \
I=6483_snippet_markilluminaadapters.bam \
FASTQ=/dev/stdout \
CLIPPING_ATTRIBUTE=XT CLIPPING_ACTION=2 INTERLEAVE=true NON_PF=true \
TMP_DIR=/path/shlee | \ 
/path/bwa mem -M -t 7 -p /path/Homo_sapiens_assembly19.fasta /dev/stdin | \  
java -Xmx16G -jar /path/picard.jar MergeBamAlignment \
ALIGNED_BAM=/dev/stdin \
UNMAPPED_BAM=6383_snippet_revertsam.bam \ 
OUTPUT=6483_snippet_piped.bam \
R=/path/Homo_sapiens_assembly19.fasta CREATE_INDEX=true ADD_MATE_CIGAR=true \
CLIP_ADAPTERS=false CLIP_OVERLAPPING_READS=true \
INCLUDE_SECONDARY_ALIGNMENTS=true MAX_INSERTIONS_OR_DELETIONS=-1 \
PRIMARY_ALIGNMENT_STRATEGY=MostDistant ATTRIBUTES_TO_RETAIN=XS \
TMP_DIR=/path/shlee

The piped output file, 6483_snippet_piped.bam, is for all intensive purposes the same as 6483_snippet_mergebamalignment.bam, produced by running MergeBamAlignment separately without piping. However, the resulting files, as well as new runs of the workflow on the same data, have the potential to differ in small ways because each uses a different alignment instance.

How do these small differences arise?

Counting the number of mate unmapped reads shows that this number remains unchanged for the two described workflows. Two counts emitted at the end of the process updates, that also remain constant for these instances, are the number of alignment records and the number of unmapped reads.

INFO    2015-12-08 17:25:59 AbstractAlignmentMerger Wrote 275759 alignment records and 1211 unmapped reads.

back to top


Some final remarks

We have produced a clean BAM that is coordinate-sorted and indexed, in an efficient manner that minimizes processing time and storage needs. The file is ready for marking duplicates as outlined in Tutorial#2799. Additionally, we can now free up storage on our file system by deleting the original file we started with, the uBAM and the uBAMXT. We sleep well at night knowing that the clean BAM retains all original information.

We have two final comments (1) on multiplexed samples and (2) on fitting this workflow into a larger workflow.

For multiplexed samples, first perform the workflow steps on a file representing one sample and one lane. Then mark duplicates. Later, after some steps in the GATK's variant discovery workflow, and after aggregating files from the same sample from across lanes into a single file, mark duplicates again. These two marking steps ensure you find both optical and PCR duplicates.

For workflows that nestle this pipeline, consider additionally optimizing java jar's parameters for SamToFastq and MergeBamAlignment. For example, the following are the additional settings used by the Broad Genomics Platform in the piped command for very large data sets.

    java -Dsamjdk.buffer_size=131072 -Dsamjdk.compression_level=1 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx128m -jar /path/picard.jar SamToFastq ...

    java -Dsamjdk.buffer_size=131072 -Dsamjdk.use_async_io=true -Dsamjdk.compression_level=1 -XX:+UseStringCache -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx5000m -jar /path/picard.jar MergeBamAlignment ...

I give my sincere thanks to Julian Hess, the GATK team and the Data Sciences and Data Engineering (DSDE) team members for all their help in writing this and related documents.

back to top


Running HaplotypeCaller in GATK

$
0
0

I have an issue while running HaplotypeCaller in GATK. This is the entire protocol of what i did:

  1. I have two done whole exome sequencing and i have two sequences (paired ends so forward and reverse). I indexed the human genome using bwa and also aligned the two sequences to it using bwa.
  2. I then sorted the resulting sam file using picard sortsam.
  3. I then wanted to run the haplotypeCaller of gatk on the output bam file to get the vcf file of the available SNVs. But when i run it i get an error: java.lang.IllegalArgumentException: samples cannot be empty.
  4. So i ran picards validatesamfile and this is what i got:

HISTOGRAM java.lang.String

Error Type Count
ERROR:MISSING_READ_GROUP 1
WARNING:RECORD_MISSING_READ_GROUP 131996644

What does this error mean and how do i solve this? Is there any mistake i am making in my protocol that i should i change in order to not get this error in the future?

Using Mutect2 to call the same variants in multiple tumor samples

$
0
0

Hi,

I want to use M2 to call variants in multiple tumor samples (from the same individual) and have followed the great guide by @shlee here. Let's assume I have one normal control N and three tumor samples T1, T2 and T3. After calling and filtering variants I am left with three VCF files, eg.

T1-N.filtered.vcf
T2-N.filtered.vcf
T3-N.filtered.vcf

I want to merge these three VCF files so that each variant is genotyped in each tumor sample and the normal sample. The resulting merged VCF file should have a header like this:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  N    T1    T2    T3

Is there a recommended workflow for this?

Thanks!
Floris

Using JEXL to apply hard filters or select variants based on annotation values

$
0
0

1. JEXL in a nutshell

JEXL stands for Java EXpression Language. It's not a part of the GATK as such; it's a software library that can be used by Java-based programs like the GATK. It can be used for many things, but in the context of the GATK, it has one very specific use: making it possible to operate on subsets of variants from VCF files based on one or more annotations, using a single command. This is typically done with walkers such as VariantFiltration and SelectVariants.


2. Basic structure of JEXL expressions for use with the GATK

In this context, a JEXL expression is a string (in the computing sense, i.e. a series of characters) that tells the GATK which annotations to look at and what selection rules to apply.

JEXL expressions contain three basic components: keys and values, connected by operators. For example, in this simple JEXL expression which selects variants whose quality score is greater than 30:

"QUAL > 30.0"
  • QUAL is a key: the name of the annotation we want to look at
  • 30.0 is a value: the threshold that we want to use to evaluate variant quality against
  • > is an operator: it determines which "side" of the threshold we want to select

The complete expression must be framed by double quotes. Within this, keys are strings (typically written in uppercase or CamelCase), and values can be either strings, numbers or booleans (TRUE or FALSE) -- but if they are strings the values must be framed by single quotes, as in the following example:

"MY_STRING_KEY == 'foo'"

3. Evaluation on multiple annotations

You can build expressions that calculate a metric based on two separate annotations, for example if you want to select variants for which quality (QUAL) divided by depth of coverage (DP) is below a certain threshold value:

"QUAL / DP < 10.0"

You can also join multiple conditional statements with logical operators, for example if you want to select variants that have both sufficient quality (QUAL) and a certain depth of coverage (DP):

"QUAL > 30.0 && DP == 10"

where && is the logical "AND".

Or if you want to select variants that have at least one of several conditions fulfilled:

"QD < 2.0 || ReadPosRankSum < -20.0 || FS > 200.0"

where || is the logical "OR".


4. Filtering on sample/genotype-level properties

You can also filter individual samples/genotypes in a VCF based on information from the FORMAT field. Variant Filtration will add the sample-level FT tag to the FORMAT field of filtered samples. Note however that this does not affect the record's FILTER tag. This is still a work in progress and isn't quite as flexible and powerful yet as we'd like it to be. For now, you can filter based on most fields as normal (e.g. GQ < 5.0), but the GT (genotype) field is an exception. We have put in convenience methods to enable filtering out heterozygous calls (isHet == 1), homozygous-reference calls (isHomRef == 1), and homozygous-variant calls (isHomVar == 1).


5. Important caveats

Sensitivity to case and type

You're probably used to case being important (whether letters are lowercase or UPPERCASE) but now you need to also pay attention to the type of value that is involved -- for example, numbers are differentiated between integers and floats (essentially, non-integers). These points are especially important to keep in mind:

  • Case
    Currently, VCF INFO field keys are case-sensitive. That means that if you have a QUAL field in uppercase in your VCF record, the system will not recognize it if you write it differently (Qual, qual or whatever) in your JEXL expression.

  • Type
    The types (i.e. string, integer, non-integer, floating point or boolean) used in your expression must be exactly the same as that of the value you are trying to evaluate. In other words, if you have a QUAL field with non-integer values (e.g. 45.3) and your filter expression is written as an integer (e.g. "QUAL < 50"), the system will throw a hissy fit (specifically, a Java exception, e.g. a NumberFormatException for numerical type mismatches).

Complex queries

We highly recommend that complex expressions involving multiple AND/OR operations be split up into separate expressions whenever possible to avoid confusion. If you are using complex expressions, make sure to test them on a panel of different sites with several combinations of yes/no criteria.


6. More complex JEXL magic

Note that this last part is fairly advanced and not for the faint of heart. To be frank, it's also explained rather more briefly than the topic deserves. But if there's enough demand for this level of usage (click the "view in forum" link and leave a comment) we'll consider producing a full-length tutorial.

Introducing the VariantContext object

When you use SelectVariants with JEXL, what happens under the hood is that the program accesses something called the VariantContext, which is a representation of the variant call with all its annotation information. The VariantContext is technically not part of GATK; it's part of the variant library included within the Picard tools source code, which GATK uses for convenience.

The reason we're telling you about this is that you can actually make more complex queries than what the GATK offers convenience functions for, provided you're willing to do a little digging into the VariantContext methods. This will allow you to leverage the full range of capabilities of the underlying objects from the command line.

In a nutshell, the VariantContext is available through the vc variable, and you just need to add method calls to that variable in your command line. The bets way to find out what methods are available is to read the VariantContext documentation on the Picard tools source code repository (on SourceForge), but we list a few examples below to whet your appetite.

Using VariantContext directly

For example, suppose I want to use SelectVariants to select all of the sites where sample NA12878 is homozygous-reference. This can be accomplished by assessing the underlying VariantContext as follows:

java -jar GenomeAnalysisTK.jar -T SelectVariants \
        -R reference.fasta \
        -V variants.vcf \
        -select 'vc.getGenotype("NA12878").isHomRef()'

Groovy, right? Now here's a more sophisticated example of JEXL expression that finds all novel variants in the total set with allele frequency > 0.25 but not 1, is not filtered, and is non-reference in 01-0263 sample:

! vc.getGenotype("01-0263").isHomRef() && (vc.getID() == null || vc.getID().equals(".")) && AF > 0.25 && AF < 1.0 && vc.isNotFiltered() && vc.isSNP() -o 01-0263.high_freq_novels.vcf -sn 01-0263

Using the VariantContext to evaluate boolean values

The classic way of evaluating a boolean goes like this:

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants \
        -R reference.fasta \
        -V my.vcf \
        -select 'DB'

But you can also use the VariantContext object like this:

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants \
        -R reference.fasta \
        -V my.vcf \
        -select 'vc.hasAttribute("DB")'

Using VariantContext to access annotations in multiallelic sites

The order of alleles in the VariantContext object is not guaranteed to be the same as in the VCF output, so accessing the AF by an index derived from a scrambled alleles array is dangerous. However! If we have the sample genotypes, there's a workaround:

java -jar GenomeAnalysisTK.jar -T SelectVariants  \
        -R reference.fasta  \
        -V multiallelics.vcf  \
        -select 'vc.hasGenotypes() && vc.getCalledChrCount(vc.getAltAlleleWithHighestAlleleCount())/(1.0*vc.getCalledChrCount()) > 0.1' -o multiHighAC.vcf

The odd 1.0 is there because otherwise we're dividing two integers, which will always yield 0. The vc.hasGenotypes() is extra error checking. This might be slow for large files, but we could use something like this if performance is a concern:

java -jar GenomeAnalysisTK.jar -T SelectVariants \
        -R reference.fasta \
        -V multiallelics.vcf \
         -select 'vc.isBiallelic() ? AF > 0.1 : vc.hasGenotypes() && vc.getCalledChrCount(vc.getAltAlleleWithHighestAlleleCount())/(1.0*vc.getCalledChrCount()) > 0.1' -o multiHighAC.vcf

Where hopefully the ternary expression shortcuts the extra vc calls for all the biallelics.

Using JEXL to evaluate arrays

Sometimes you might want to write a JEXL expression to evaluate e.g. the AD (allelic depth) field in the FORMAT column. However, the AD is technically not an integer; rather it is a list (array) of integers. One can evaluate the array data using the "." operator. Here's an example:

java -jar GenomeAnalysisTK.jar -T SelectVariants \
        -R reference.fasta \
        -V variants.vcf \
        -select 'vc.getGenotype("NA12878").getAD().0 > 10'

If you would like to select sites where the alternate allele frequency is greater than 50%, you can use the following expression:

java -jar GenomeAnalysisTK.jar -T SelectVariants \
        -R reference.fasta \
        -V variants.vcf \
        -select vc.getGenotype("NA12878").getAD().1 / vc.getGenotype("NA12878").getDP() > 0.50

http://trybionutrition.com/keto-ultra-diet/

$
0
0

Keto Ultra Many of us are deeply concerned about weight-loss and appearance. Some of us are even on a lookout for the simplest way to decrease personal body weight. With weight-loss, one looks better, the appearance enhances, one is more confident. So then one is happier about going out, meeting individuals, maybe even using a few hobbies. As one falls personal body weight, one by default becomes more extrovert and enjoys being around with friends. And a higher experience much better aspect is just another one of the highs. Many of us intend to decrease personal body weight.
http://trybionutrition.com/keto-ultra-diet/

Haptyepecaller calls incorrect genotype in several site

$
0
0

Hi,
I found that the Haptyepecaller made heterozygous calls where there is no support for them in the BAM. We use IGV to compare input BAM and Haptyepecaller output bam. The region shown in the figure confused us. At the top of this figure is input-BAM while another is Haptyepecaller-output-bam. Haptyepecaller-output-gvcf also suggest this site is heterozygous.
It seems that it's the same issue as https://gatkforums.broadinstitute.org/gatk/discussion/2319/haplotypecaller-incorrectly-making-heterozygous-calls-again. In that question,your suggested solution is updating GATK. Howerer,we used GATK 3.8 and GATK4.0.6 and we got same results.
The command line we used is:
~/software/gatk-4.0.6.0/gatk --java-options "-Xmx30G" HaplotypeCaller -L chr01:9550000-9850000 -ERC GVCF -R -I -O <output_g.vcf> -bamout


Can FastaAlternateReferenceMaker put Ns instead of the reference bases where there is no-call ?

$
0
0

Dear GATK team,

I'm using FastaAlternateReferenceMaker to generate a fasta file per sample. To do so, I had to first generate a vcf file per sample and use FastaAlternateReferenceMaker to generate a fasta file. Everything went well except that in any position where the genotype of a given sample was a no-call (./. or ./) it put the reference bases while I would prefer to have Ns instead.

I have looked over a similar post (https://gatkforums.broadinstitute.org/gatk/discussion/5778/can-fastaalternatereferencemaker-put-n-instead-the-reference-base-when-data-is-bad). It basically explained that we could use SelectVariants to select only sites with certain criteria and make a vcf that could be fed on FastaAlternateReferenceMaker via -snpmask. However, being no-call or not for a genotype is not a criteria we can use for SelectVariants.

So, how can I tell GATK to only select sites with no-call genotype ?


Run Mutect2 in gatk-4.0.4.0 with an error `java.lang.NumberFormatException: For input string: "."`

$
0
0

Hi there~

When I ran mutect2 in gatk-4.0.4.0,I got something wrong with it.

gatk --java-options "-Xmx7g -XX:+UseParallelGC -XX:ParallelGCThreads=8" Mutect2 \
--dont-use-soft-clipped-bases true --max-reads-per-alignment-start 3000 --min-base-quality-score 20 \
-R hg19.fasta \
--germline-resource \gnomad.exomes.r2.0.2.sites.vcf.gz \
-I test-1.final.bam  -tumor test-1 \
-O test-1.raw.vcf \
-L agilent.gatk.intervals

It goes error with :

[August 8, 2018 10:27:44 AM UTC] org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2 done. Elapsed time: 0.34 minutes.

Runtime.totalMemory()=3729260544
java.lang.NumberFormatException: For input string: "."
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
at java.lang.Double.parseDouble(Double.java:538)
at java.lang.Double.valueOf(Double.java:502)
at htsjdk.variant.variantcontext.CommonInfo.lambda$getAttributeAsDoubleList$2(CommonInfo.java:299)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.Collections$2.tryAdvance(Collections.java:4717)
at java.util.Collections$2.forEachRemaining(Collections.java:4725)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
at htsjdk.variant.variantcontext.CommonInfo.getAttributeAsList(CommonInfo.java:273)
at htsjdk.variant.variantcontext.CommonInfo.getAttributeAsDoubleList(CommonInfo.java:293)
at htsjdk.variant.variantcontext.VariantContext.getAttributeAsDoubleList(VariantContext.java:740)
at org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2Engine.isActive(Mutect2Engine.java:253)
at org.broadinstitute.hellbender.engine.AssemblyRegionIterator.loadNextAssemblyRegion(AssemblyRegionIterator.java:159)
at org.broadinstitute.hellbender.engine.AssemblyRegionIterator.next(AssemblyRegionIterator.java:135)
at org.broadinstitute.hellbender.engine.AssemblyRegionIterator.next(AssemblyRegionIterator.java:34)
at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.processReadShard(AssemblyRegionWalker.java:290)
at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.traverse(AssemblyRegionWalker.java:271)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:892)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:134)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:179)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:198)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)

How can i fix it?

haplotypecaller-gvcf-gatk4 v13: Premature end of file failures

$
0
0

Hi

Last night I ran haplotypecaller-gvcf-gatk4 Snapshot ID: 13 (latest from Firecloud workspace help-gatk/Germline-SNPs-Indels-GATK4-b37) with default parameters on 254 WGS samples. One job completed and 254 failed. The pattern of failed jobs appears to that a random subsets of scattered jobs (n~5 of 53, where the number of failed jobs appears to be from a binomial distribution with rate ~10%) fail with a premature end of file message in stderr:

....
06:28:28.243 INFO ProgressMeter - 1:121139702 16.6 82360 4968.5
06:28:44.984 INFO ProgressMeter - 1:121142643 16.9 82380 4887.4
06:28:55.886 INFO ProgressMeter - 1:121351704 17.0 83290 4888.7
06:29:08.270 INFO VectorLoglessPairHMM - Time spent in setup for JNI call : 0.7598102280000001
06:29:08.270 INFO PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 148.740933848
06:29:08.270 INFO SmithWatermanAligner - Total compute time in java Smith-Waterman : 111.76 sec
06:29:08.270 INFO HaplotypeCaller - Shutting down engine
[August 8, 2018 6:29:28 AM UTC] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 17.65 minutes.
Runtime.totalMemory()=2501902336
htsjdk.samtools.FileTruncatedException: Premature end of file: /SC208230/SC208230.bam
at htsjdk.samtools.util.BlockCompressedInputStream.processNextBlock(BlockCompressedInputStream.java:530)
at htsjdk.samtools.util.BlockCompressedInputStream.nextBlock(BlockCompressedInputStream.java:468)
at htsjdk.samtools.util.BlockCompressedInputStream.readBlock(BlockCompressedInputStream.java:458)
at htsjdk.samtools.util.BlockCompressedInputStream.available(BlockCompressedInputStream.java:196)
at htsjdk.samtools.util.BlockCompressedInputStream.read(BlockCompressedInputStream.java:331)
at java.io.DataInputStream.read(DataInputStream.java:149)
at htsjdk.samtools.util.BinaryCodec.readBytesOrFewer(BinaryCodec.java:418)
...

This looks like NIO failing in roughly 10% of the scattered jobs - and a consequent 99% of gathered jobs. Could there be something wrong in my workflow parameters that would cause this?

Workflow parameters (in json format):

{"HaplotypeCallerGvcf_GATK4.input_bam_index":"${this.WGS_bai_path}","HaplotypeCallerGvcf_GATK4.input_bam":" this.WGS_bam_path","HaplotypeCallerGvcf_GATK4.scattered_calling_intervals_list":"${workspace.scattered_calling_intervals_list}","HaplotypeCallerGvcf_GATK4.ref_dict":" workspace.ref_dict","HaplotypeCallerGvcf_GATK4.ref_fasta_index":" workspace.ref_fasta_index","HaplotypeCallerGvcf_GATK4.ref_fasta":"${workspace.ref_fasta_file}"}

scattered_calling_intervals_list = gs://gatk-test-data/intervals/b37_wgs_scattered_calling_intervals.txt:

gs://gatk-legacy-bundles/b37/scattered_wgs_intervals/scatter-50/temp_0001_of_50/scattered.interval_list
gs://gatk-legacy-bundles/b37/scattered_wgs_intervals/scatter-50/temp_0002_of_50/scattered.interval_list
gs://gatk-legacy-bundles/b37/scattered_wgs_intervals/scatter-50/temp_0003_of_50/scattered.interval_list
gs://gatk-legacy-bundles/b37/scattered_wgs_intervals/scatter-50/temp_0004_of_50/scattered.interval_list
gs://gatk-legacy-bundles/b37/scattered_wgs_intervals/scatter-50/temp_0005_of_50/scattered.interval_list
...
gs://gatk-legacy-bundles/b37/scattered_wgs_intervals/scatter-50/temp_0046_of_50/scattered.interval_list
gs://gatk-legacy-bundles/b37/scattered_wgs_intervals/scatter-50/temp_0047_of_50/scattered.interval_list
gs://gatk-legacy-bundles/b37/scattered_wgs_intervals/scatter-50/temp_0048_of_50/scattered.interval_list
gs://gatk-legacy-bundles/b37/scattered_wgs_intervals/scatter-50/temp_0049_of_50/scattered.interval_list
gs://gatk-legacy-bundles/b37/scattered_wgs_intervals/scatter-50/temp_0050_of_50/scattered.interval_list

Questions:
a) Are there non-default parameters needed for WGS haplotypecaller-gvcf-gatk4 processing (eg. bigger disk) ?
b) What is the expected rate of NIO failures?
c) Do you need more information (parameters, log files,...) ?
d) Should this be posted in the Firecloud forum rather than the GATK forum?

Thanks,

Chip

P.S. Firecloud submission id: d77a4f70-ae18-431f-83b0-d119b7339dc4, cost $64.30

Gold trading signalsToday and gold price Outlook

$
0
0

Gold trading signalsToday and gold price Outlook
GOLD
buy @ 1215
tp @ 1229
sl @ 1208
Explanation of the gold trading signals today
Gold is preferred to buy at the market price as long as the price of gold above the level of 1208 as a stop loss level
Profit taking is set at 1229 for automatic exit on a profit
Gold Analysis and Gold Technical Outlook
The trend of gold in the near term and the medium is very bearish and the bearish trend continues
However, technical indicators are showing signs of recovery in gold prices
The price of gold has formed the technical model known as the measured move
Where the price of gold fell from the level of 1265 to reach the level of 1236 and represents the first wave decline in the pattern
The price of gold rose corrective until it reached the level of 1245, which dropped from it to start the last bear wave, which ended near the level of 1211
The first and second waves are equal in length
Which expresses the end of the pattern and is considered a signal for the rise
The RSI gave an early warning signal on the end of the current bearish trend through the revolutionary divergence pattern
The reversal candle pattern, known as the hammer, appears on the four-hour frame, which represents an opportunity to enter the buy

free forex trading signals live and EUR / USD outlook
EUR USD
sell @ 1.1750
tp @ 1.1680
sl @ 1.1790
Euro and Dollar Technical Outlook and forex trading signals free
The EURUSD rose from the 1.1500 level on 12-6-2018 until the EURUSD reached 1.1760 and during this rally the EURUSD formed the bullish measured move pattern and the two waves are equal height
The pair also formed another measured move pattern as it rose from the 1.1530 level to confirm the EURUSD selling opportunities
The EURUSD is testing the medium term downtrend line near 1.1750
EURUSD also formed the bearish divergence pattern on the RSI on the 4 hour and 4 hour frame
All of these technical indicators are a strong driver to send currency recommendations today to sell EURUSD
Explanation of the Euro-Dollar recommendation and currency expectations
EURUSD is preferred to sell on FX market as long as EURUSD trades below the down trend line
And below 1.1790 as a stop loss and exit from the deal targeting the 1.1680 level of take profit

How to generate RefSeq ROD

$
0
0

Dear team,

I would like to include a RefSeq ROD file in order to get the coverage per gene using the GATK coverage tool (http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_coverage_DepthOfCoverage.html). However, it is not really clear to me how I can easily generate such a file, since I can not find the right documentation. All the links that should point to this information seem to be (incorrectly?) redirected to the main GATK homepage). For example this http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_utils_codecs_refseq_RefSeqCodec.html has a link that points to http://www.broadinstitute.org/gsa/wiki/index.php/RefSeq which is redirected to http://www.broadinstitute.org/gatk/.
Can someone point me to the correct docs? Other resources I found also point to the same wiki, which I can't find at the moment..

Kind regards,
JJ

Increasing Fastaalternatereferencemaker speed

$
0
0

Hello,
I'm using GATK-3.6. It takes about 20 minutes to use a VCF file and a reference genome to make a new genome. I wanted to make this faster by filtering my VCF file so it only contains transcript regions. I used VCFtools keep-INFo-all.
However I now have this error.

Done. There were 1 WARN messages, the first 1 are repeated below.
WARN 17:48:03,374 IndexDictionaryUtils - Track /home/gnojoomi/Project_1/RNAseq_pipeline/files_made/VCF_intervals_with_transcripts.recode.vcf doesn't have a sequence dictionary built in, skipping dictionary validation

I was wondering if this is an important error. More importantly it still takes around 20 minutes and I'm really hoping if there is a way to make this faster for my lab. If possible I can see if I can ask my school to update our cluster to have the newest GATK version if it makes it faster.

Base Quality Score Recalibration (BQSR)

$
0
0

BQSR stands for Base Quality Score Recalibration. In a nutshell, it is a data pre-processing step that detects systematic errors made by the sequencing machine when it estimates the accuracy of each base call.

Note that this base recalibration process (BQSR) should not be confused with variant recalibration (VQSR), which is a sophisticated filtering technique applied on the variant callset produced in a later step. The developers who named these methods wish to apologize sincerely to anyone, especially Spanish-speaking users, who get tripped up by the similarity of these names.


Contents

  1. Overview
  2. Base recalibration procedure details
  3. Important factors for successful recalibration
  4. Examples of pre- and post-recalibration metrics
  5. Recalibration report

1. Overview

It's all about the base, 'bout the base (quality scores)

Base quality scores are per-base estimates of error emitted by the sequencing machines; they express how confident the machine was that it called the correct base each time. For example, let's say the machine reads an A nucleotide, and assigns a quality score of Q20 -- in Phred-scale, that means it's 99% sure it identified the base correctly. This may seem high, but it does mean that we can expect it to be wrong in one case out of 100; so if we have several billion base calls (we get ~90 billion in a 30x genome), at that rate the machine would make the wrong call in 900 million bases -- which is a lot of bad bases. The quality score each base call gets is determined through some dark magic jealously guarded by the manufacturer of the sequencing machines.

Why does it matter? Because our short variant calling algorithms rely heavily on the quality score assigned to the individual base calls in each sequence read. This is because the quality score tells us how much we can trust that particular observation to inform us about the biological truth of the site where that base aligns. If we have a base call that has a low quality score, that means we're not sure we actually read that A correctly, and it could actually be something else. So we won't trust it as much as other base calls that have higher qualities. In other words we use that score to weigh the evidence that we have for or against a variant allele existing at a particular site.

Okay, so what is base recalibration?

Unfortunately the scores produced by the machines are subject to various sources of systematic (non-random) technical error, leading to over- or under-estimated base quality scores in the data. Some of these errors are due to the physics or the chemistry of how the sequencing reaction works, and some are probably due to manufacturing flaws in the equipment.

Base quality score recalibration (BQSR) is a process in which we apply machine learning to model these errors empirically and adjust the quality scores accordingly. For example we can identify that, for a given run, whenever we called two A nucleotides in a row, the next base we called had a 1% higher rate of error. So any base call that comes after AA in a read should have its quality score reduced by 1%. We do that over several different covariates (mainly sequence context and position in read, or cycle) in a way that is additive. So the same base may have its quality score increased for one reason and decreased for another.

This allows us to get more accurate base qualities overall, which in turn improves the accuracy of our variant calls. To be clear, we can't correct the base calls themselves, i.e. we can't determine whether that low-quality A should actually have been a T -- but we can at least tell the variant caller more accurately how far it can trust that A. Note that in some cases we may find that some bases should have a higher quality score, which allows us to rescue observations that otherwise may have been given less consideration than they deserve. Anecdotally our impression is that sequencers are more often over-confident than under-confident, but we do occasionally see runs from sequencers that seemed to suffer from low self-esteem.

This procedure can be applied to BAM files containing data from any sequencing platform that outputs base quality scores on the expected scale. We have run it ourselves on data from several generations of Illumina, SOLiD, 454, Complete Genomics, and Pacific Biosciences sequencers.

That sounds great! How does it work?

The base recalibration process involves two key steps: first the BaseRecalibrator tool builds a model of covariation based on the input data and a set of known variants, producing a recalibration file; then the ApplyBQSR tool adjusts the base quality scores in the data based on the model, producing a new BAM file. The known variants are used to mask out bases at sites of real (expected) variation, to avoid counting real variants as errors. Outside of the masked sites, every mismatch is counted as an error. The rest is mostly accounting.

There is an optional but highly recommended step that involves building a second model and generating before/after plots to visualize the effects of the recalibration process. This is useful for quality control purposes.


2. Base recalibration procedure details

BaseRecalibrator builds the model

To build the recalibration model, this first tool goes through all of the reads in the input BAM file and tabulates data about the following features of the bases:

  • read group the read belongs to
  • quality score reported by the machine
  • machine cycle producing this base (Nth cycle = Nth base from the start of the read)
  • current base + previous base (dinucleotide)

For each bin, we count the number of bases within the bin and how often such bases mismatch the reference base, excluding loci known to vary in the population, according to the known variants resource (typically dbSNP). This information is output to a recalibration file in GATKReport format.

Note that the recalibrator applies a "yates" correction for low occupancy bins. Rather than inferring the true Q score from # mismatches / # bases we actually infer it from (# mismatches + 1) / (# bases + 2). This deals very nicely with overfitting problems, which has only a minor impact on data sets with billions of bases but is critical to avoid overconfidence in rare bins in sparse data.

ApplyBQSR adjusts the scores

This second tool goes through all the reads again, using the recalibration file to adjust each base's score based on which bins it falls in. So effectively the new quality score is:

  • the sum of the global difference between reported quality scores and the empirical quality
  • plus the quality bin specific shift
  • plus the cycle x qual and dinucleotide x qual effect

Following recalibration, the read quality scores are much closer to their empirical scores than before. This means they can be used in a statistically robust manner for downstream processing, such as variant calling. In addition, by accounting for quality changes by cycle and sequence context, we can identify truly high quality bases in the reads, often finding a subset of bases that are Q30 even when no bases were originally labeled as such.


3. Important factors for successful recalibration

Read groups

The recalibration system is read-group aware, meaning it uses @RG tags to partition the data by read group. This allows it to perform the recalibration per read group, which reflects which library a read belongs to and what lane it was sequenced in on the flowcell. We know that systematic biases can occur in one lane but not the other, or one library but not the other, so being able to recalibrate within each unit of sequence data makes the modeling process more accurate. As a corollary, that means it's okay to run BQSR on BAM files with multiple read groups. However, please note that the memory requirements scale linearly with the number of read groups in the file, so that files with many read groups could require a significant amount of RAM to store all of the covariate data.

Amount of data

A critical determinant of the quality of the recalibration is the number of observed bases and mismatches in each bin. This procedure will not work well on a small number of aligned reads. We usually expect to see more than 100M bases per read group; as a rule of thumb, larger numbers will work better.

No excuses

You should almost always perform recalibration on your sequencing data. In human data, given the exhaustive databases of variation we have available, almost all of the remaining mismatches -- even in cancer -- will be errors, so it's super easy to ascertain an accurate error model for your data, which is essential for downstream analysis. For non-human data it can be a little bit more work since you may need to bootstrap your own set of variants if there are no such resources already available for you organism, but it's worth it.

Here's how you would bootstrap a set of known variants:

  • First do an initial round of variant calling on your original, unrecalibrated data.
  • Then take the variants that you have the highest confidence in and use that set as the database of known variants by feeding it as a VCF file to the BaseRecalibrator.
  • Finally, do a real round of variant calling with the recalibrated data. These steps could be repeated several times until convergence.

The main case figure where you really might need to skip BQSR is when you have too little data (some small gene panels have that problem), or you're working with a really weird organism that displays insane amounts of variation.


4. Examples of pre- and post-recalibration metrics

This shows recalibration results from a lane sequenced at the Broad by an Illumina GA-II in February 2010. This is admittedly not very recent but the results are typical of what we still see on some more recent runs, even if the overall quality of sequencing has improved. You can see there is a significant improvement in the accuracy of the base quality scores after applying the recalibration procedure. Note that the plots shown below are not the same as the plots that are produced by the AnalyzeCovariates tool.

image
image
image
image


5. Recalibration report

The recalibration report contains the following 5 tables:

  • Arguments Table -- a table with all the arguments and its values
  • Quantization Table
  • ReadGroup Table
  • Quality Score Table
  • Covariates Table

Arguments Table

This is the table that contains all the arguments used to run BQSR for this dataset.

#:GATKTable:true:1:17::;
#:GATKTable:Arguments:Recalibration argument collection values used in this run
Argument                    Value
covariate                   null
default_platform            null
deletions_context_size      6
force_platform              null
insertions_context_size     6
...

Quantization Table

The GATK offers native support to quantize base qualities. The GATK quantization procedure uses a statistical approach to determine the best binning system that minimizes the error introduced by amalgamating the different qualities present in the specific dataset. When running BQSR, a table with the base counts for each base quality is generated and a 'default' quantization table is generated. This table is a required parameter for any other tool in the GATK if you want to quantize your quality scores.

The default behavior (currently) is to use no quantization. You can override this by using the engine argument -qq. With -qq 0 you don't quantize qualities, or -qq N you recalculate the quantization bins using N bins.

#:GATKTable:true:2:94:::;
#:GATKTable:Quantized:Quality quantization map
QualityScore  Count        QuantizedScore
0                     252               0
1                   15972               1
2                  553525               2
3                 2190142               9
4                 5369681               9
9                83645762               9
...

ReadGroup Table

This table contains the empirical quality scores for each read group, for mismatches insertions and deletions.

#:GATKTable:false:6:18:%s:%s:%.4f:%.4f:%d:%d:;
#:GATKTable:RecalTable0:
ReadGroup  EventType  EmpiricalQuality  EstimatedQReported  Observations  Errors
SRR032768  D                   40.7476             45.0000    2642683174    222475
SRR032766  D                   40.9072             45.0000    2630282426    213441
SRR032764  D                   40.5931             45.0000    2919572148    254687
SRR032769  D                   40.7448             45.0000    2850110574    240094
SRR032767  D                   40.6820             45.0000    2820040026    241020
SRR032765  D                   40.9034             45.0000    2441035052    198258
SRR032766  M                   23.2573             23.7733    2630282426  12424434
SRR032768  M                   23.0281             23.5366    2642683174  13159514
SRR032769  M                   23.2608             23.6920    2850110574  13451898
SRR032764  M                   23.2302             23.6039    2919572148  13877177
SRR032765  M                   23.0271             23.5527    2441035052  12158144
SRR032767  M                   23.1195             23.5852    2820040026  13750197
SRR032766  I                   41.7198             45.0000    2630282426    177017
SRR032768  I                   41.5682             45.0000    2642683174    184172
SRR032769  I                   41.5828             45.0000    2850110574    197959
SRR032764  I                   41.2958             45.0000    2919572148    216637
SRR032765  I                   41.5546             45.0000    2441035052    170651
SRR032767  I                   41.5192             45.0000    2820040026    198762

Quality Score Table

This table contains the empirical quality scores for each read group and original quality score, for mismatches insertions and deletions.

#:GATKTable:false:6:274:%s:%s:%s:%.4f:%d:%d:;
#:GATKTable:RecalTable1:
ReadGroup  QualityScore  EventType  EmpiricalQuality  Observations  Errors
SRR032767            49  M                   33.7794          9549        3
SRR032769            49  M                   36.9975          5008        0
SRR032764            49  M                   39.2490          8411        0
SRR032766            18  M                   17.7397      16330200   274803
SRR032768            18  M                   17.7922      17707920   294405
SRR032764            45  I                   41.2958    2919572148   216637
SRR032765             6  M                    6.0600       3401801   842765
SRR032769            45  I                   41.5828    2850110574   197959
SRR032764             6  M                    6.0751       4220451  1041946
SRR032767            45  I                   41.5192    2820040026   198762
SRR032769             6  M                    6.3481       5045533  1169748
SRR032768            16  M                   15.7681      12427549   329283
SRR032766            16  M                   15.8173      11799056   309110
SRR032764            16  M                   15.9033      13017244   334343
SRR032769            16  M                   15.8042      13817386   363078
...

Covariates Table

This table has the empirical qualities for each covariate used in the dataset. The default covariates are cycle and context. In the current implementation, context is of a fixed size (default 6). Each context and each cycle will have an entry on this table stratified by read group and original quality score.

#:GATKTable:false:8:1003738:%s:%s:%s:%s:%s:%.4f:%d:%d:;
#:GATKTable:RecalTable2:
ReadGroup  QualityScore  CovariateValue  CovariateName  EventType  EmpiricalQuality  Observations  Errors
SRR032767            16  TACGGA          Context        M                   14.2139           817      30
SRR032766            16  AACGGA          Context        M                   14.9938          1420      44
SRR032765            16  TACGGA          Context        M                   15.5145           711      19
SRR032768            16  AACGGA          Context        M                   15.0133          1585      49
SRR032764            16  TACGGA          Context        M                   14.5393           710      24
SRR032766            16  GACGGA          Context        M                   17.9746          1379      21
SRR032768            45  CACCTC          Context        I                   40.7907        575849      47
SRR032764            45  TACCTC          Context        I                   43.8286        507088      20
SRR032769            45  TACGGC          Context        D                   38.7536         37525       4
SRR032768            45  GACCTC          Context        I                   46.0724        445275      10
SRR032766            45  CACCTC          Context        I                   41.0696        575664      44
SRR032769            45  TACCTC          Context        I                   43.4821        490491      21
SRR032766            45  CACGGC          Context        D                   45.1471         65424       1
SRR032768            45  GACGGC          Context        D                   45.3980         34657       0
SRR032767            45  TACGGC          Context        D                   42.7663         37814       1
SRR032767            16  AACGGA          Context        M                   15.9371          1647      41
SRR032764            16  GACGGA          Context        M                   18.2642          1273      18
SRR032769            16  CACGGA          Context        M                   13.0801          1442      70
SRR032765            16  GACGGA          Context        M                   15.9934          1271      31
...

java.lang.NoClassDefFoundError: org/apache/logging/log4j/core/appender/AbstractAppender

$
0
0

Hello!

When I try running gatk 4.0.7.0 on spark 2.2 Microsoft Azure hdinsight cluster (using data from here https://gatkforums.broadinstitute.org/gatk/discussion/6484/how-to-generate-an-unmapped-bam-from-fastq-or-aligned-bam#optionA) I get "java.lang.NoClassDefFoundError: org/apache/logging/log4j/core/appender/AbstractAppender" error like the one below.

The thing that gatk-package-4.0.7.0-spark.jar has that class as can be verified by $jar tvf gatk-package-4.0.7.0-spark.jar
but nonetheless it seems like it does not load it correctly somehow.
What would be the possible fix for it?

$./gatk PrintReadsSpark -I ../6484_snippet.bam -O ../output.bam -- --spark-runner SPARK --spark-master spark://10.0.0.21:7077
Using GATK jar /root/gatk-4.0.7.0/gatk-package-4.0.7.0-spark.jar
Running:
/usr/hdp/current/spark2-client/bin/spark-submit --master spark://10.0.0.21:7077 --conf spark.driver.userClassPathFirst=false --conf spark.io.compression.codec=lzf --conf spark.driver.maxResultSize=0 --conf spark.executor.extraJavaOptions=-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=false -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 --conf spark.driver.extraJavaOptions=-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=false -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 --conf spark.kryoserializer.buffer.max=512m --conf spark.yarn.executor.memoryOverhead=600 /root/gatk-4.0.7.0/gatk-package-4.0.7.0-spark.jar PrintReadsSpark -I ../6484_snippet.bam -O ../output.bam --spark-master spark://10.0.0.21:7077
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/hdp/2.6.3.40-13/spark2/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.6.3.40-13/spark_llap/spark-llap-assembly-1.0.0.2.6.3.40-13.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
java.lang.NoClassDefFoundError: org/apache/logging/log4j/core/appender/AbstractAppender
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
.....
....


http://trybionutrition.com/purefit-keto/

$
0
0

Purefit Keto Many people search high and low for a successful weight loss program. Thousands will fall prey to a "miracle pill" scam. It seems everywhere you turn, there are hundreds of products promising weight loss. Oh yeah, you can eat all you want, and "product x" will still make you lose weight. Of course, most of us realize that these statements are not true. If there really was such a pill, most of us would be taking it, and the success stories would be all over the mainstream news instead of only mentioned on Infomercials. One product I have been using with some success is called Advalean.
http://trybionutrition.com/purefit-keto/

http://newfitnesssupplements.com/garcinia-sk2000/

$
0
0

Garcinia SK2000 : - SK2000 garcinia cambogia is blended with such a significant number of astonishing common and home grown fixings that will enable you to support your digestion level and make you fit and solid.Fixings like Lemon extricates, Garcinia cambogia, hostile to oxidants, and vitamins are available in this eating regimen.What's more, the most fascinating thing about it is that it is totally natural product based fixings. In any case, does it precisely mean when we say organic product based fixings, well garcinia cambogia is a tropical natural product that develops for the most part Southeast Asia and it contains HCA (hydroxycitric corrosive) which is extremely valuable in diminishing weight faster? SK2000 causes you in keeping up your eating routine appropriately, as it helps in expelling every one of the poisons from your body.
https://garciniask2000trial.tumblr.com/
http://newfitnesssupplements.com/garcinia-sk2000/

FilterMutectCalls did not give any records with 'PASS' in the filter field

$
0
0

Hi, I used FilterMutectCalls to filter several VCF files one by one:

java -jar ~/gatk-4.0.5.1/gatk.jar FilterMutectCalls -V original.vcf.gz -O filtered.vcf.gz

And in every output file filtered.vcf.gz, there is no record with 'PASS' in the filter field.

I ran other tools on the same bam files, and they all output variants that passed their corresponding filters.

I use the default parameters when running FilterMutectCalls. It is strange that there are no variants in the filtered.vcf with 'PASS' in the filter field. Can you please help me how to fix it? Thank you!

How to get a variant's abundance from calling file?

$
0
0

I'm wondering how to get a mutation's abundance based on variant calling result file. I noticed there is a 'AF' field in vcf file, but the value is always 1 or 0.5. It's different from the ratio of ref/alt reads in bam file. I also found that 'AD' filed in FORMAT column records the ref and alt reads count, but it can be different with corresponding reads number in bam file as well. So which field should I use to get variants' abundance? Thank you!

ApplyVQSR error in indels mode

$
0
0

Hi!
gatk-4.0.7.0. I successfully recalibrated SNP variants for my exomes, then got error with ApplyVQSR for indels (VariantRecalibration for indels ran successfully):

A USER ERROR has occurred: Encountered input variant which isn't found in the input recal file. Please make sure VariantRecalibrator and ApplyRecalibration were run on the same set of input variants. First seen at: [VC Unknown @ chr1:22660902-22660903 Q433.90 of type=MIXED alleles=[GC*, *, CC] attr={AC=[5, 1], AF=[0.250, 0.050], AN=20, BaseQRankSum=-6.740e-01, ClippingRankSum=0.00, DP=63, ExcessHet=6.1542, FS=44.773, InbreedingCoeff=-0.1502, MLEAC=[4, 1], MLEAF=[0.200, 0.050], MQ=74.07, MQRankSum=0.00, QD=22.84, ReadPosRankSum=-1.260e-01, SOR=5.130} GT=GT:AD:DP:GQ:PGT:PID:PL 0/2:2,0,2:4:78:.:.:78,84,168,0,84,78 ...

VariantRecalibrator and ApplyVQSR command lines:

gatk VariantRecalibrator \
-R hg38/Homo_sapiens_assembly38.fasta -V process/l-2.vcf -AS \
--max-gaussians 4 \
--trust-all-polymorphic \
--resource mills,known=false,training=true,truth=true,prior=12.0:hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf \
--resource dbsnp,known=true,training=false,truth=false,prior=2.0:hg38/dbsnp_146.hg38.vcf \
--resource axiomPoly,known=false,training=true,truth=false,prior=10:hg38/Axiom_Exome_Plus.genotypes.all_populations.poly.hg38.vcf \
-an QD -an DP -an FS -an SOR -an ReadPosRankSum -an MQRankSum -an InbreedingCoeff -mode INDEL \
-O process/l-3.recal --tranches-file process/l-3.tranches --rscript-file process/l-3.plots.R

gatk ApplyVQSR -R hg38/Homo_sapiens_assembly38.fasta -V process/l-2.vcf -O process/l-4.vcf -ts-filter-level 99.0 \
--tranches-file process/l-3.tranches --recal-file process/l-3.recal -mode INDEL &>process/log-applyVQSR-indels

Problem line in l-2.vcf, it is really absent in l-3.recal file.

chr1 22660902 . GC CC,* 433.90 . AC=5,1;AF=0.250,0.050;AN=20;BaseQRankSum=-6.740e-01;ClippingRankSum=0.00;DP=63;ExcessHet=6.1542;FS=44.773;InbreedingCoeff=-0.1502;MLEAC=4,1;MLEAF=0.200,0.050;MQ=74.07;MQRankSum=0.00;QD=22.84;ReadPosRankSum=-1.260e-01;SOR=5.130 GT:AD:DP:GQ:PGT:PID:PL 0/2:2,0,2:4:78:.:.:78,84,168,0,84,78 0/0:11,0,0:11:33:.:.:0,33,446,33,446,446 0/0:14,0,0:14:0:.:.:0,0,76,0,76,76 0/0:1,0,0:2:3:0|1:22660893_GC_G:0,3,45,3,45,45 1/1:0,1,0:1:3:1|1:22660893_GC_G:45,3,0,45,3,45 0/1:2,3,0:5:99:0|1:22660902_G_C:120,0,108,126,117,243 0/0:13,0,0:13:0:.:.:0,0,75,0,75,75 0/0:2,0,0:2:6:0|1:22660893_GC_G:0,6,90,6,90,90 0/1:2,3,0:5:99:0|1:22660893_GC_G:109,0,125,115,134,248 0/1:3,2,0:5:75:0|1:22660893_GC_G:75,0,176,84,182,266

I am new to bioinformatics, so this may be some obvious thing...

Viewing all 12345 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>