GATK on Alibaba Cloud

January 6, 2018, 10:09 pm

≫ Next: GATK on Intel BIGstack for on-premises infrastructure

Alibaba Cloud, the largest cloud provider in China, has developed open-source command-line utilities that leverage Cromwell to enable execution of analysis pipelines on the Alibaba Cloud platform, as shown below.

Alibaba Cloud provides pre-configured environments and pipeline templates to make it easy to setup your pipeline, modify your input files, and submit/monitor jobs (via BatchCompute CLI or a console).

For more information, check out the Getting Started with Alibaba Guide and the full Alibaba Integration Guide in the Cromwell documentation.

↧

GATK on Intel BIGstack for on-premises infrastructure

January 6, 2018, 10:11 pm

≫ Next: GATK on FireCloud

≪ Previous: GATK on Alibaba Cloud

Broad-Intel Genomics Stack (BIGstack) is an end-to-end, optimized solution on Intel hardware for analyzing genomic data. It provides an efficient way to run pre-packaged, optimized workflows, including the GATK Best Practices workflows.

BIGstack’s software stack includes two components developed by Intel for efficient and scalable execution of genomics workflows: GenomicsDB and the Genomics Kernel Library (GKL). GenomicsDB is a data store for genomic variants. It is based on the TileDB array storage manager, a system for efficiently storing, querying, and accessing sparse and dense matrix/array data. GKL is a collection of common, compute-intensive kernels used in genomic analysis tools. Intel and The Broad Institute worked together to identify these kernels in GATK, and experts across Intel optimized the kernels for Intel architecture.

BIGstack also includes support to run other open-source libraries of genomic analysis tools: Picard, BWA, and Samtools. These tools perform a wide variety of tasks, from sorting and fixing tags to generating recalibration models. Users specify the files to be analyzed, what tools they want to use, and the order in which the execution engine (Cromwell) performs the tasks using Workflow Description Language (WDL) files.

For more information, check out www.intel.com/broadinstitute and www.intel.com/selectsolutions.

↧

GATK on FireCloud

January 7, 2018, 12:57 am

≫ Next: Germline short variant discovery (SNPs + Indels)

≪ Previous: GATK on Intel BIGstack for on-premises infrastructure

FireCloud is an open platform for secure and scalable analysis on the cloud.

More concretely, it's a web-based portal that is provided as a freely accessible service by the Broad Institute's Data Sciences Platform, where GATK itself is also developed. FireCloud provides both GUI (point-and-click) and API access to a persistent Cromwell execution server that manages submissions to the Google Pipelines API. In addition to the core pipeline execution service, the FireCloud platform also includes functionality for data management, a data library of published datasets (including TCGA data) and a method repository for managing and sharing workflows.

The platform as a whole is designed to empower analysts, tool developers and production managers to perform large-scale analysis, engage in data curation, and store or publish results without having to worry about the underlying computational infrastructure.

All the Best Practices workflows, ready to run

As part of our effort to make it easier for everyone to run GATK regardless of their personal level of (dis)comfort with the intricacies of computational infrastructure, we make all of our Best Practices workflows (plus various additional utilities) available in FireCloud. This takes the form of workspaces where the workflows are preconfigured for common use cases, along with example data that is suitable for testing and benchmarking, both at small scale and at full scale. So it should just be a matter of a few clicks to run any pipeline you like on the preloaded example datasets -- or, with a few more (simple) steps, to run them on your own data. All this without ever touching a command line, unless you're the CLI-over-GUI type, in which case you're welcome to use the FireCloud APIs vis Swagger or the FISS Python bindings to do all this programmatically.

We hope this will enable researchers to spend less time figuring out how to run GATK Best Practices and more time doing interesting science with the results. We also believe this will boost portability and reproducibility in genomic analysis.

Free Credits Program

We understand that moving your analysis to the cloud is a big cultural and logistical shift, and there is a clear need to make it possible to try out such a new option without having to commit financially. To address that need, we've teamed up with Google Cloud to give away free credits for running GATK4 pipelines on FireCloud, our cloud-based analysis portal. Learn more about this free credits program in the FireCloud Free Credits documentation.

↧

Germline short variant discovery (SNPs + Indels)

January 7, 2018, 1:03 am

≫ Next: Somatic short variant discovery (SNVs + Indels)

≪ Previous: GATK on FireCloud

Purpose

Identify germline short variants (SNPs and Indels) in one or more individuals to produce a joint callset in VCF format.

Reference Implementations

Pipeline	Summary	Notes	FireCloud
Prod* germline short variant per-sample calling	uBAM to GVCF	optimized for GCP	TBD
Prod* germline short variant joint genotyping	GVCFs to cohort VCF	optimized for GCP	TBD
Generic germline short variant per-sample calling	analysis-ready BAM to GVCF	universal	TBD
Generic germline short variant joint genotyping	GVCFs to cohort VCF	universal	TBD
Intel germline short variant per-sample calling	uBAM to GVCF	Intel optimized for local architectures	TBD

* Prod refers to the Broad Institute's Data Sciences Platform production pipelines, which are used to process sequence data produced by the Broad's Genomic Sequencing Platform facility.

Expected input

This workflow is designed to operate on a set of samples constituting a study cohort. Specifically, a set of per-sample BAM files that have been pre-processed as described in the GATK Best Practices for data pre-processing.

Main steps

We begin by calling variants per sample in order to produce a file in GVCF format. Next, we consolidate GVCFs from multiple samples into a GenomicsDB datastore. We then perform joint genotyping, and finally, apply VQSR filtering to produce the final multisample callset with the desired balance of precision and sensitivity.

Additional steps such as Genotype Refinement and Variant Annotation may be included depending on experimental design; those are not documented here.

Call variants per-sample

Tools involved: HaplotypeCaller (in GVCF mode)

In the past, variant callers specialized in either SNPs or Indels, or (like the GATK's own UnifiedGenotyper) could call both but had to do so them using separate models of variation. The HaplotypeCaller is capable of calling SNPs and indels simultaneously via local de-novo assembly of haplotypes in an active region. In other words, whenever the program encounters a region showing signs of variation, it discards the existing mapping information and completely reassembles the reads in that region. This allows the HaplotypeCaller to be more accurate when calling regions that are traditionally difficult to call, for example when they contain different types of variants close to each other. It also makes the HaplotypeCaller much better at calling indels than position-based callers like UnifiedGenotyper.

In the GVCF mode used for scalable variant calling in DNA sequence data, HaplotypeCaller runs per-sample to generate an intermediate file called a GVCF , which can then be used for joint genotyping of multiple samples in a very efficient way. This enables rapid incremental processing of samples as they roll off the sequencer, as well as scaling to very large cohort sizes

In practice, this step can be appended to the pre-processing section to form a single pipeline applied per-sample, going from the original unmapped BAM containing raw sequence all the way to the GVCF for each sample. This is the implementation used in production at the Broad Institute.

Consolidate GVCFs

Tools involved: ImportGenomicsDB

This step consists of consolidating the contents of GVCF files across multiple samples in order to improve scalability and speed the next step, joint genotyping. Note that this is NOT equivalent to the joint genotyping step; variants in the resulting merged GVCF cannot be considered to have been called jointly.

Prior to GATK4 this was done through hierarchical merges with a tool called CombineGVCFs. This tool is included in GATK4 for legacy purposes, but performance is far superior when using ImportGenomicsDB, which produces a datastore instead of a GVCF file.

Joint-Call Cohort

Tools involved: GenotypeGVCFs

At this step, we gather all the per-sample GVCFs (or combined GVCFs if we are working with large numbers of samples) and pass them all together to the joint genotyping tool, GenotypeGVCFs. This produces a set of joint-called SNP and indel calls ready for filtering. This cohort-wide analysis empowers sensitive detection of variants even at difficult sites, and produces a squared-off matrix of genotypes that provides information about all sites of interest in all samples considered, which is important for many downstream analyses.

This step runs quite quickly and can be rerun at any point when samples are added to the cohort, thereby solving the so-called N+1 problem.

Filter Variants by Variant (Quality Score) Recalibration

Tools involved: VariantRecalibrator, ApplyRecalibration

The GATK's variant calling tools are designed to be very lenient in order to achieve a high degree of sensitivity. This is good because it minimizes the chance of missing real variants, but it does mean that we need to filter the raw callset they produce in order to reduce the amount of false positives, which can be quite large.

The established way to filter the raw variant callset is to use variant quality score recalibration (VQSR), which uses machine learning to identify annotation profiles of variants that are likely to be real, and assigns a VQSLOD score to each variant that is much more reliable than the QUAL score calculated by the caller. In the first step of this two-step process, the program builds a model based on training variants, then applies that model to the data to assign a well-calibrated probability to each variant call. We can then use this variant quality score in the second step to filter the raw call set, thus producing a subset of calls with our desired level of quality, fine-tuned to balance specificity and sensitivity.

The downside of how variant recalibration works is that the algorithm requires high-quality sets of known variants to use as training and truth resources, which for many organisms are not yet available. It also requires quite a lot of data in order to learn the profiles of good vs. bad variants, so it can be difficult or even impossible to use on small datasets that involve only one or a few samples, on targeted sequencing data, on RNAseq, and on non-model organisms. If for any of these reasons you find that you cannot perform variant recalibration on your data (after having tried the workarounds that we recommend, where applicable), you will need to use hard-filtering instead. This consists of setting flat thresholds for specific annotations and applying them to all variants equally. See the methods articles and FAQs for more details on how to do this.

We are currently experimenting with neural network-based approaches with the goal of eventually replacing VQSR with a more powerful and flexible filtering process.

Notes on methodology

The central tenet that governs the variant discovery part of the workflow is that the accuracy and sensitivity of the germline variant discovery algorithm are significantly increased when it is provided with data from many samples at the same time. Specifically, the variant calling program needs to be able to construct a squared-off matrix of genotypes representing all potentially variant genomic positions, across all samples in the cohort. Note that this is distinct from the primitive approach of combining variant calls generated separately per-sample, which lack information about the confidence of homozygous-reference or other uncalled genotypes.

In earlier versions of the variant discovery phase, multiple per-sample BAM files were presented directly to the variant calling program for joint analysis. However, that scaled very poorly with the number of samples, posing unacceptable limits on the size of the study cohorts that could be analyzed in that way. In addition, it was not possible to add samples incrementally to a study; all variant calling work had to be redone when new samples were introduced.

Starting with GATK version 3.x, a new approach was introduced, which decoupled the two internal processes that previously composed variant calling: (1) the initial per-sample collection of variant context statistics and calculation of all possible genotype likelihoods given each sample by itself, which require access to the original BAM file reads and is computationally expensive, and (2) the calculation of genotype posterior probabilities per-sample given the genotype likelihoods across all samples in the cohort, which is computationally cheap. These were made into the separate steps described below, enabling incremental growth of cohorts as well as scaling to large cohort sizes.

↧

Somatic short variant discovery (SNVs + Indels)

January 7, 2018, 1:04 am

≫ Next: Somatic copy number variant discovery (CNVs)

≪ Previous: Germline short variant discovery (SNPs + Indels)

Purpose

Identify somatic short variants (SNVs + Indels) in a tumor-normal for an individual sample. Requires an appropriate Panel of Normals (PON).

Reference Implementations

Pipeline	Summary	Notes	Github	FireCloud
Somatic short variants tumor-normal pair	T-N BAMs to VCF	universal		TBD
Somatic short variants PON creation	Normal BAMs to PON	universal	placeholder	TBD

A brand new version of these workflows is about to be released and will be made available within the next few days, along with the relevant documentation.

↧

Somatic copy number variant discovery (CNVs)

January 7, 2018, 1:05 am

≫ Next: Germline copy number variant discovery (CNVs)

≪ Previous: Somatic short variant discovery (SNVs + Indels)

Purpose

Identify somatic copy number variant (CNVs) in a case sample. Requires an appropriate Panel of Normals (PON).

Reference Implementations

Pipeline	Summary	Notes	Github	FireCloud
Somatic CNV case sample	Case BAM to CNV	universal	placeholder	TBD
Somatic CNV PON creation	Normal BAMs to PON	universal	placeholder	TBD

A brand new version of these workflows is about to be released and will be made available within the next few days, along with the relevant documentation.

↧

Germline copy number variant discovery (CNVs)

January 7, 2018, 1:08 am

≫ Next: What is the FireCloud Free Credits Program?

≪ Previous: Somatic copy number variant discovery (CNVs)

Purpose

Identify germline copy number variants.

Diagram is not available

Reference implementation is not available

This workflow is in development; detailed documentation will be made available when the workflow is considered fully released.

↧

What is the FireCloud Free Credits Program?

January 7, 2018, 9:05 pm

≫ Next: Introduction to the GATK Best Practices

≪ Previous: Germline copy number variant discovery (CNVs)

The FireCloud Free Credits Program is an opportunity for you to use FireCloud, the Broad's cloud-based analysis portal, for trying out the new GATK4 Best Practices pipelines at no cost to you. All the main GATK pipelines have been preconfigured into FireCloud workspaces according to our Best Practices, so it'll be just a matter of a few clicks to run any pipeline you like on the preloaded example datasets -- or, with a few more (simple) steps, to run them on your own data.

Our friends at Google Cloud Platform are generously footing the bill for this credits program. We at the Broad Institute are not getting any share of any revenue that may be generated by GCP as a result of this program. By that I mean that if you continue using Google Cloud for your work on your own dime after you have exhausted your credits, we will not get a cut of the money you pay to Google.

For us (the GATK team), the FireCloud portal and cloud-based platforms in general present an unparalleled opportunity to make our tools available in a format that is much easier to support, since it removes a lot of the complexity involved with dealing with lots of different local infrastructures. The more people use this kind of platform to run our pipelines, the easier it becomes for us to help ensure that the pipelines are running smoothly and correctly for everyone. We are very aware that to many of you, moving your work to the cloud is a big logistical and cultural shift, so we hope that this program will grease the wheels and make it easier for you to try the cloud (and GATK4 itself) on for size. If you find it doesn't suit you, you'll still be able to go back to the traditional method of downloading the software and deploying it on your own infrastructure.

There is no obligation to continue using FireCloud after your free credits expire, and you will be presented with options to save any work you got done during that time.

For sign-up information and FAQs, please see the FireCloud Free Credits Program documentation on the FireCloud website.

Bonus FAQ: Where can I learn how to use FireCloud?

Check out the Quick Start Guide and this video of how to run an analysis in a GATK4 Featured Workspace.

The FireCloud User Guide also includes a Tutorials section, FAQs, and assorted documentation that should prove helpful to you as you get started with the platform. You can also ask questions and leave comments in our support forum, which is run by the same team as the GATK forum.

↧

Introduction to the GATK Best Practices

January 8, 2018, 7:02 pm

≫ Next: RNAseq short variant discovery (SNPs + Indels)

≪ Previous: What is the FireCloud Free Credits Program?

This document provides important context information about how the GATK Best Practices are developed and what are their limitations.

What are the GATK Best Practices?
Analysis phases
Experimental designs
Workflow scripts provided as reference implementations
Scope and limitations
What is not GATK Best Practices?
Beware legacy scripts

1. What are the GATK Best Practices?

Reads-to-variants workflows used at the Broad Institute.

The GATK Best Practices provide step-by-step recommendations for performing variant discovery analysis in high-throughput sequencing (HTS) data. There are several different GATK Best Practices workflows tailored to particular applications depending on the type of variation of interest and the technology employed. The Best Practices documentation attempts to describe in detail the key principles of the processing and analysis steps required to go from raw reads coming off the sequencing machine, all the way to an appropriately filtered variant callset that can be used in downstream analyses. Wherever we can, we try to provide guidance regarding experimental design, quality control (QC) and pipeline implementation options, but please understand that those are dependent on many factors including sequencing technology and the hardware infrastructure that are at your disposal, so you may need to adapt our recommendations to your specific situation.

2. Analysis phases

Although the Best Practices workflows are each tailored to a particular application (type of variation and experimental design), overall they follow similar patterns, typically comprising two or three analysis phases depending on the application.

(1) Data Pre-processing is the first phase in all cases, and involves pre-processing the raw sequence data (provided in FASTQ or uBAM format) to produce analysis-ready BAM files. This involves alignment to a reference genome as well as some data cleanup operations to correct for technical biases and make the data suitable for analysis.

(2) Variant Discovery proceeds from analysis-ready BAM files and produces variant calls. This involves identifying genomic variation in one or more individuals and applying filtering methods appropriate to the experimental design. The output is typically in VCF format although some classes of variants (such as CNVs) are difficult to represent in VCF and may therefore be represented in other structured text-based formats.

(3) Depending on the application, additional steps such as filtering and annotation may be required to produce a callset ready for downstream genetic analysis. This typically involves using resources of known variation, truthsets and other metadata to assess and improve the accuracy of the results as well as attach additional information.

3. Experimental designs

Whole genomes. Exomes. Gene panels. RNAseq

These are the major experimental designs we support explicitly. Some of our workflows are specific to only one experimental design, while others can be adapted to others with some modifications. This is indicated in the workflow documentation where applicable. Note that any workflow tagged as applicable to whole genome sequence (WGS) and others is presented by default in the form that is suitable for whole genomes, and must be modified to apply to the others as recommended in the workflow documentation. Exomes, gene panels and other targeted sequencing experiments generally share the same workflow for a given variant type with only minor modifications.

4. Workflow scripts provided as reference implementations

Less guesswork, more reproducibility.

It's one thing to know what steps should be run (which is what the Best Practices tell you) and quite another to set up a pipeline that does it in practice. To help you cross this important gap, we provide the scripts that we use in our own pipelines as reference implementations. The scripts are written in WDL, a workflow description language designed specifically to be readable and writable by humans without an advanced programming background. WDL scripts can be run on Cromwell, an open-source execution engine that can connect to a variety of different platforms, whether local or cloud-based, through pluggable backends. See the Pipelining Options section for more on the Cromwell + WDL pipelining solution.

We also make all the GATK Best Practices workflows available in ready-to-run form on FireCloud, our cloud-based analysis portal, which you can read more about here.

Note that some of the production scripts we provide are specifically optimized to run on the Google Cloud Platform.
Wherever possible we also provide "generic" versions that are not platform-specific.

5. Scope and limitations

We can't test for every possible use case or technology.

We develop and validate these workflows in collaboration with many investigators within the Broad Institute's network of affiliated institutions. They are deployed at scale in the Broad's production pipelines -- a very large scale indeed. So as a general rule, the command-line arguments and parameters given in the documentation are meant to be broadly applicable (so to speak). However, our testing focuses largely on data from human whole-genome or whole-exome samples sequenced with Illumina technology, so if you are working with different types of data, organisms or experimental designs, you may need to adapt certain branches of the workflow, as well as certain parameter selections and values.

In addition, several key steps make use of external resources, including validated databases of known variants. If there are few or no such resources available for your organism, you may need to bootstrap your own or use alternative methods. We have documented useful methods to do this wherever possible, but be aware than some issues are currently still without a good solution. On the bright side, if you solve them for your field, you will be a hero to a generation of researchers and your citation index will go through the roof.

6. What is not GATK Best Practices?

Lots of workflows that people call GATK Best Practices diverge significantly from our recommendations.

Not that they're necessarily bad. Sometimes it makes perfect sense to diverge from our standard Best Practices in order to address a problem or use case that they're not designed to handle. The canonical Best Practices workflows (as run in production at the Broad) are designed specifically for human genome research and are optimized for the instrumentation (overwhelmingly Illumina) and needs of the Broad Institute sequencing facility. They can be adapted for analysis of non-human organisms of all kinds, including non-diploids, and of different data types, with varying degrees of effort depending on how divergent the use case and data type are. However, any workflow that has been significantly adapted or customized, whether for performance reasons or to fit a use case that we do not explicitly cover, should not be called "GATK Best Practices", which is a term that carries specific meaning. The correct way to refer to such workflows is "based on" or "adapted from" GATK Best Practices. When in doubt about whether a particular customization constitutes a significant divergence, feel free to ask us in the forum.

7. Beware legacy scripts

Trust, but verify.

If someone hands you a script and tells you "this runs the GATK Best Practices", start by asking what version of GATK it uses, when it was written, and what are the key steps that it includes. Both our software and our usage recommendations evolve in step with the rapid pace of technological and methodological innovation in the field of genomics, so what was Best Practice last year (let alone in 2010) may no longer be applicable. And if all the steps seem to be in accordance with our docs (same tools in the same order), you should still check every single parameter in the commands. If anything is unfamiliar to you, you should find out what it does. If you can't find it in the documentation, ask us in the forum. It's one or two hours of your life that can save you days of troubleshooting on the tail end of the pipeline, so please protect yourself by being thorough.

↧

RNAseq short variant discovery (SNPs + Indels)

January 8, 2018, 8:39 pm

≫ Next: Data pre-processing for variant discovery

≪ Previous: Introduction to the GATK Best Practices

Purpose

Identify short variants (SNPs and Indels) in RNAseq data.

Diagram is not available

Reference Implementations

Pipeline	Summary	Notes	Github	FireCloud
RNAseq short variant per-sample calling	BAM to VCF	universal (expected)		TBD

Expected input

This workflow is designed to operate on a set of samples one sample at a time; joint calling RNAseq is not supported.

_This workflow is in development; detailed documentation will be made available when the workflow is considered fully released.

↧

Data pre-processing for variant discovery

January 8, 2018, 8:45 pm

≫ Next: VQSR and VariantAnnotator on Samtools VCFs

≪ Previous: RNAseq short variant discovery (SNPs + Indels)

Purpose

The is the obligatory first phase that must precede all variant discovery. It involves pre-processing the raw sequence data (provided in FASTQ or uBAM format) to produce analysis-ready BAM files. This involves alignment to a reference genome as well as some data cleanup operations to correct for technical biases and make the data suitable for analysis.

Reference Implementations

Pipeline	Summary	Notes	Github	FireCloud
Prod* germline short variant per-sample calling	uBAM to GVCF	optimized for GCP		TBD
Generic data pre-processing	uBAM to analysis-ready BAM	universal		TBD

* Prod refers to the Broad Institute's Data Sciences Platform production pipelines, which are used to process sequence data produced by the Broad's Genomic Sequencing Platform facility.

Expected input

This workflow is designed to operate on individual samples, for which the data is initially organized in distinct subsets called readgroups. These correspond to the intersection of libraries (the DNA product extracted from biological samples and prepared for sequencing, which includes fragmenting and tagging with identifying barcodes) and lanes (units of physical separation on the DNA sequencing chips) generated through multiplexing (the process of mixing multiple libraries and sequencing them on multiple lanes, for risk and artifact mitigation purposes).

Our reference implementations expect the read data to be input in unmapped BAM (uBAM) format. Conversion utilities are available to convert from FASTQ to uBAM.

Main steps

We begin by mapping the sequence reads to the reference genome to produce a file in SAM/BAM format sorted by coordinate. Next, we mark duplicates to mitigate biases introduced by data generation steps such as PCR amplification. Finally, we recalibrate the base quality scores, because the variant calling algorithms rely heavily on the quality scores assigned to the individual base calls in each sequence read.

Map to Reference

Tools involved: BWA, MergeBamAlignments

This first processing step is performed per-read group and consists of mapping each individual read pair to the reference genome, which is a synthetic single-stranded representation of common genome sequence that is intended to provide a common coordinate framework for all genomic analysis. Because the mapping algorithm processes each read pair in isolation, this can be massively parallelized to increase throughput as desired.

Mark Duplicates

Tools involved: MarkDuplicates, SortSam

This second processing step is performed per-sample and consists of identifying read pairs that are likely to have originated from duplicates of the same original DNA fragments through some artifactual processes. These are considered to be non-independent observations, so the program tags all but of the read pairs within each set of duplicates, causing them to be ignored by default during the variant discovery process. This step constitutes a major bottleneck since it involves making a large number of comparisons between all the read pairs belonging to the sample, across all of its readgroups. It is followed by a sorting operation (not explicitly shown in the workflow diagram) that also constitutes a performance bottleneck, since it also operates across all reads belonging to the sample. Both algorithms continue to be the target of optimization efforts to reduce their impact on latency.

Base (Quality Score) Recalibration

Tools involved: BaseRecalibrator, Apply Recalibration, AnalyzeCovariates (optional)

This third processing step is performed per-sample and consists of applying machine learning to detect and correct for patterns of systematic errors in the base quality scores, which are confidence scores emitted by the sequencer for each base. Base quality scores play an important role in weighing the evidence for or against possible variant alleles during the variant discovery process, so it's important to correct any systematic bias observed in the data. Biases can originate from biochemical processes during library preparation and sequencing, from manufacturing defects in the chips, or instrumentation defects in the sequencer. The recalibration procedure involves collecting covariate statistics from all base calls in the dataset, building a model from those statistics, and applying base quality adjustments to the dataset based on the resulting model. The initial statistics collection can be parallelized by scattering across genomic coordinates, typically by chromosome or batches of chromosomes but this can be broken down further to boost throughput if needed. Then the per-region statistics must be gathered into a single genome-wide model of covariation; this cannot be parallelized but it is computationally trivial, and therefore not a bottleneck. Finally, the recalibration rules derived from the model are applied to the original dataset to produce a recalibrated dataset. This is parallelized in the same way as the initial statistics collection, over genomic regions, then followed by a final file merge operation to produce a single analysis-ready file per sample.

↧

VQSR and VariantAnnotator on Samtools VCFs

November 6, 2017, 9:20 am

≫ Next: Oncotator output from 1.8 and 1.9

≪ Previous: Data pre-processing for variant discovery

Hi everyone!
My goal is to run VQSR on VCFs generated with samtools mpileup.
According to GATK best practices first i have to run VariantAnnotator on each of my VCFs in order to do that.

here's the annotation options I included in the command line:

--annotation QualByDepth
--annotation RMSMappingQuality
--annotation MappingQualityRankSumTest
--annotation ReadPosRankSumTest
--annotation FisherStrand
--annotation StrandOddsRatio
--annotation DepthPerSampleHC
--annotation InbreedingCoeff

unfortunately it returns these warnings:

WARN 17:26:07,297 StrandBiasTest - No StrandBiasBySample annotation or read data was found. Strand bias annotations will not be output.
WARN 17:26:07,297 InbreedingCoeff - Annotation will not be calculated. InbreedingCoeff requires at least 10 unrelated samples.
WARN 17:26:07,297 StrandBiasTest - No StrandBiasBySample annotation or read data was found. Strand bias annotations will not be output.
WARN 17:29:25,076 AnnotationUtils - DP annotation will not be calculated, must be called from HaplotypeCaller or MuTect2, not VariantAnnotator

I don't care about DP as for WXS it is not required, but the others are mandatory for VQSR (right?)

I'm using samtools to generate VCF for a couple of reasons: i need to call variants from single samples, i need a tool that calls "everything" without filters in order to apply custom downstream filtering. the idea is to try first the VQSR approach for the artifact filtering instead of hardfiltering (which is more tricky and complex).
I was thinking about using Haplotype Caller (also because it would be easier to use HC VCFs with VQSR or other GATK tools), but for what I have understood it is meant to find "germline SNPs", which doesn't fit my needs as I am looking for novel SNV (mutations) and not SNPs (am I right?).

Thank you very much.

↧

Oncotator output from 1.8 and 1.9

December 16, 2016, 9:25 am

≫ Next: Use Select Variants on a gnomAD vcf for Mutect2 contamination filtering.

≪ Previous: VQSR and VariantAnnotator on Samtools VCFs

Are the 2 versions giving same output?
The columns in the output files from the 2 versions are the same, right?

↧

Use Select Variants on a gnomAD vcf for Mutect2 contamination filtering.

December 15, 2017, 6:58 pm

≫ Next: Differences between HaplotypeCaller and Mutect2

≪ Previous: Oncotator output from 1.8 and 1.9

I am trying to follow this set of steps to use the mutect2 wdl from the Broad.

https://github.com/broadinstitute/gatk/tree/master/scripts/mutect2_wdl

In that file, it is recommended to make a variants_for_contamination file, to filter out contaminating reads.

The command that is requested to run is here:
java -jar $gatk SelectVariants -V gnomad.vcf -L 1 --select "AF > 0.05" -O variants_for_contamination.vcf

I first got gnomad by going here: http://gnomad.broadinstitute.org/downloads
and getting the vcf for exomes, as I believe was instructed.

I pull the gatk container:
sudo docker pull broadinstitute/gatk:latest

and in that container run:
java -jar /gatk/gatk.jar SelectVariants -V gnomad.exomes.r2.0.2.sites.vcf.bgz -L 1 --select "AF > 0.05" -O variants_for_contamination.vcf

The output that I get is:

02:50:34.291 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/build/libs/gatk-package-4.beta.6-SNAPSHOT-local.jar!/com/intel/gkl/native/libgkl_compression.so
[December 16, 2017 2:50:33 AM UTC] SelectVariants --output variants_for_contamination.vcf --selectExpressions AF > 0.05 --variant gnomad.exomes.r2.0.2.sites.vcf.bgz --intervals 1 --invertSelect false --excludeNonVariants false --excludeFiltered false --preserveAlleles false --removeUnusedAlternates false --restrictAllelesTo ALL --keepOriginalAC false --keepOriginalDP false --mendelianViolation false --invertMendelianViolation false --mendelianViolationQualThreshold 0.0 --select_random_fraction 0.0 --remove_fraction_genotypes 0.0 --fullyDecode false --maxIndelSize 2147483647 --minIndelSize 0 --maxFilteredGenotypes 2147483647 --minFilteredGenotypes 0 --maxFractionFilteredGenotypes 1.0 --minFractionFilteredGenotypes 0.0 --maxNOCALLnumber 2147483647 --maxNOCALLfraction 1.0 --setFilteredGtToNocall false --ALLOW_NONOVERLAPPING_COMMAND_LINE_SAMPLES false --SUPPRESS_REFERENCE_PATH false --interval_set_rule UNION --interval_padding 0 --interval_exclusion_padding 0 --interval_merging_rule ALL --readValidationStringency SILENT --secondsBetweenProgressUpdates 10.0 --disableSequenceDictionaryValidation false --createOutputBamIndex true --createOutputBamMD5 false --createOutputVariantIndex true --createOutputVariantMD5 false --lenient false --addOutputSAMProgramRecord true --addOutputVCFCommandLine true --cloudPrefetchBuffer 40 --cloudIndexPrefetchBuffer -1 --disableBamIndexCaching false --help false --version false --showHidden false --verbosity INFO --QUIET false --use_jdk_deflater false --use_jdk_inflater false --gcs_max_retries 20 --disableToolDefaultReadFilters false
[December 16, 2017 2:50:33 AM UTC] Executing as root@ae5a49b74378 on Linux 4.8.0-59-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_131-8u131-b11-0ubuntu1.16.04.2-b11; Version: 4.beta.6-SNAPSHOT
02:50:35.167 INFO SelectVariants - HTSJDK Defaults.COMPRESSION_LEVEL : 5
02:50:35.167 INFO SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
02:50:35.168 INFO SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : false
02:50:35.168 INFO SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
02:50:35.168 INFO SelectVariants - Deflater: IntelDeflater
02:50:35.169 INFO SelectVariants - Inflater: IntelInflater
02:50:35.169 INFO SelectVariants - GCS max retries/reopens: 20
02:50:35.170 INFO SelectVariants - Using google-cloud-java patch c035098b5e62cb4fe9155eff07ce88449a361f5d from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
02:50:35.171 INFO SelectVariants - Initializing engine
02:50:35.925 INFO FeatureManager - Using codec VCFCodec to read file file:///mnt/gnomad.exomes.r2.0.2.sites.vcf.bgz
02:50:36.143 INFO FeatureManager - Using codec VCFCodec to read file file:///mnt/gnomad.exomes.r2.0.2.sites.vcf.bgz
02:50:36.220 WARN IndexUtils - Feature file "/mnt/gnomad.exomes.r2.0.2.sites.vcf.bgz" appears to contain no sequence dictionary. Attempting to retrieve a sequence dictionary from the associated index file
02:50:36.341 WARN IndexUtils - Index file /mnt/gnomad.exomes.r2.0.2.sites.vcf.bgz.tbi is out of date (index older than input file). Use IndexFeatureFile to make a new index.
02:50:36.361 INFO SelectVariants - Shutting down engine
[December 16, 2017 2:50:36 AM UTC] org.broadinstitute.hellbender.tools.walkers.variantutils.SelectVariants done. Elapsed time: 0.04 minutes.
Runtime.totalMemory()=422051840

A USER ERROR has occurred: Badly formed genome unclippedLoc: Parameters to GenomeLocParser are incorrect:The stop position 0 is less than start 1 in contig 1

Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--javaOptions '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.

I have no idea what is going on here. I am just trying to follow the instructions to use the provided wdl.

Any help would be great!

↧

Differences between HaplotypeCaller and Mutect2

January 4, 2018, 8:57 pm

≫ Next: Variant Quality Score Recalibration (VQSR)

≪ Previous: Use Select Variants on a gnomAD vcf for Mutect2 contamination filtering.

They share graph assembly--similarities end there

Operationally, Mutect2 works similarly to HaplotypeCaller in that they share the active region-based processing, assembly-based haplotype reconstruction and pairHMM alignment of reads to haplotypes. However, they use fundamentally different models for estimating variant likelihoods and genotypes. The HaplotypeCaller model uses ploidy in its genotype likelihood calculations. The Mutect2 model does not. We explain why this is the case.

Germline caller versus Somatic caller

The main difference is that HaplotypeCaller is designed to call germline variants, while Mutect2 is designed to call somatic variants. Neither is appropriate for the other use case.

Germline variants are straightforward. They vary against the reference. Germline calling typically assumes a fixed ploidy and calling includes genotyping sites. HaplotypeCaller allows setting a different ploidy than diploid with the -ploidy argument. HaplotypeCaller can call germline variants on one or multiple samples and the tool can use evidence of variation across the samples to increase confidence in a variant call.

Somatic variants contrast between two samples against the reference. What do we mean by somatic? The Greek word soma refers to parts of an organism other than the reproductive cells. For example, our skin cells are soma-tic and accumulate mutations from sun exposure that presumably our seed or germ cells are protected from. In this example, variants in skin cells that are not variant in the blood cells are somatic.

Mutect2 works primarily by contrasting the presence or absence of evidence for variation between two samples, the tumor and matched normal, from the same individual. The tool can run on unmatched tumors but this produces high rates of false positives. Technically speaking, somatic variants are both (i) different from the control sample and (ii) different from the reference. What this means is that if a site is variant in the control but in the somatic sample reverts to the reference allele, then it is not a somatic variant.

Here are some more specific differences

Mutect2 is incapable of calculating reference confidence, which is a feature in HaplotypeCaller that is key to producing GVCFs. As a result, there is currently no way to perform joint calling for somatic variant discovery.
Because a somatic callset is based on a single individual rather than a cohort, annotations in the INFO column of a Mutect2 VCF only refer to the ALT alleles and do not include values for the REF allele. This differs from a germline cohort callset, in which annotations in the INFO field are typically derived from data related to all observed alleles including the reference.
While HaplotypeCaller relies on a fixed ploidy assumption to calculate the genotype likelihoods that are the basis for genotype probabilities (PL), Mutect2 allows for varying ploidy in the form of allele fractions for each variant. Varying allele fractions are often seen within a tumor sample due to fractional purity, multiple subclones and copy number variation.
Mutect2 also differs from HaplotypeCaller in that it can apply various prefilters to sites and alleles depending on the use of a matched normal, a panel of normals (PoN) and a common population variant resource containing allele-specific frequencies. If a PoN or matched normal is provided, Mutect2 can use either to filter sites before reassembly, and it can use a germline resource to filter alleles.
The variant site annotations that HaplotypeCaller and Mutect2 apply by default are very different; see their respective tool documentation for details.
Finally, Mutect2 has additional parameters not available to HaplotypeCaller. These parameters factor towards the decision to perform reassembly on a region, towards whether to emit a variant and towards whether to filter a site:
- For one, the frequency of alleles not in the germline resource (--af-of-alleles-not-in-resource) defines in the germline variant prior, which Mutect2 uses in likelihood calculations of a variant being germline.
- Second, the log somatic prior (--log-somatic-prior) defines the somatic variant prior, which Mutect2 uses in likelihood calculations of a variant being somatic.
- Third, the normal log odds ratio (--normal-lod) defines the filter threshold for variants in the tumor not being in the normal, i.e. the germline risk factor.
- Fourth, the tumor log odds ratio for emission (–-tumor-lod-to-emit) defines the cutoff for a tumor variant to appear in a callset.

Historical perspective explains some quirks of somatic calling

Somatic calling is NOT a simple subtraction of control variant alleles from case sample variant alleles. The reason for this stems from the original intent for somatic callsets.

Somatic calling was originally designed for cancer research--specifically, computational research that focuses on triangulating driver mutation loci in cancer cohorts. Analyses require callsets with high specificity. What this means is that researchers prefer to remove false positives even at the expense of losing some true positives. Somatic callers reflect this preference in their stringent filtering against likely germline variants.

A somatic caller should detect low fraction alleles, can make no explicit ploidy assumption and omits genotyping. Mutect2 adheres to all of these criteria. A number of cancer sample characteristics necessitate such caller features. For one, biopsied tumor samples are commonly contaminated with normal cells, and the normal fraction can be much higher than the tumor fraction of a sample. Second, a tumor can be heterogeneous in its mutations. Third, these mutations not uncommonly include aneuploid events that change the copy number of a cell's genome in patchwork fashion.

A variant allele in the case sample is not called if the site is variant in controls. We explain an exception for GATK4 Mutect2 in a bit.

Historically, somatic callers have called somatic variants at the site-level. That is, if a variant site in the case is also variant in the matched control or in a population resource, e.g. dbSNP, even if the variant allele is different than the control or resource it is discounted from the somatic callset. This practice stems in part from cancer study designs where the control normal sample is sequenced at much lower depth than the case tumor sample. Because of the assumption mutations strike randomly, cancer geneticists view mutations at sites of common germline variation with skepticism. Remember for humans, common germline variant sites occur roughly on average one in a thousand reference bases. So if a commonly variant site accrues additional mutations, we must weigh the chance of it having arisen from a true somatic event or it being something else that will likely not add value to downstream analyses. For most sites and typical analyses, the latter is the case. The variant is unlikely to have arisen from a somatic event and more likely to be some artifact or germline variant, e.g. from mapping or cross-sample contamination.

GATK4 Mutect2 still applies this practice in part. The tool discounts variant sites shared with the panel of normals or with a matched normal control's unambiguously variant site. If the matched normal's variant allele is supported by few reads, at low allele fraction, then the tool accounts for the possibility of the site not being a germline variant.

When it comes to the population germline resource, GATK4 Mutect2 distinguishes between the variant alleles in the germline resource and the case sample. That is, Mutect2 will call a variant site somatic if the allele differs from that in the germline resource. Blog#10911 explains this in a bit more detail and explains how Mutect2 factors germline variant allele frequencies in calling.

Somatic workflows filter case sites with multiple variant alleles. By a similar logic to that outlined above, and with the assumption that common variant sites are biallelic, any site that presents multiple variant alleles in the case sample is suspect. Mutect2 still calls such sites and the contrasting variant alleles; however, in the next step of the workflow, FilterMutectCalls filters such sites with the multiallelic filter. It is possible a multiallelic site in the case sample represents a somatic event, but it is more likely the site is a germline variant site or an artifactual site.

Tutorial#2801 outlines how to call germline short variants with HaplotypeCaller.
Tutorial#11136 outlines the GATK4 somatic short variant discovery workflow.
For differences between GATK4 Mutect2 and GATK3 MuTect2, see Blog#10911.
HaplotypeCaller tool documentation is here.
GATK4 Mutect2 tool documentation is here.

↧

Variant Quality Score Recalibration (VQSR)

July 23, 2012, 9:49 am

≫ Next: Plot bqsr error using wdl.

≪ Previous: Differences between HaplotypeCaller and Mutect2

This document describes what Variant Quality Score Recalibration (VQSR) is designed to do, and outlines how it works under the hood. The first section is a high-level overview aimed at non-specialists. Additional technical details are provided below.

For command-line examples and recommendations on what specific resource datasets and arguments to use for VQSR, please see this FAQ article. See the VariantRecalibrator tool doc and the ApplyRecalibration tool doc for a complete description of available command line arguments.

As a complement to this document, we encourage you to watch the workshop videos available in the Presentations section.

High-level overview

VQSR stands for “variant quality score recalibration”, which is a bad name because it’s not re-calibrating variant quality scores at all; it is calculating a new quality score that is supposedly super well calibrated (unlike the variant QUAL score which is a hot mess) called the VQSLOD (for variant quality score log-odds). I know this probably sounds like gibberish, stay with me. The purpose of this new score is to enable variant filtering in a way that allows analysts to balance sensitivity (trying to discover all the real variants) and specificity (trying to limit the false positives that creep in when filters get too lenient) as finely as possible.

The basic, traditional way of filtering variants is to look at various annotations (context statistics) that describe e.g. what the sequence context is like around the variant site, how many reads covered it, how many reads covered each allele, what proportion of reads were in forward vs reverse orientation; things like that -- then choose threshold values and throw out any variants that have annotation values above or below the set thresholds. The problem with this approach is that it is very limiting because it forces you to look at each annotation dimension individually, and you end up throwing out good variants just because one of their annotations looks bad, or keeping bad variants in order to keep those good variants.

The VQSR method, in a nutshell, uses machine learning algorithms to learn from each dataset what is the annotation profile of good variants vs. bad variants, and does so in a way that integrates information from multiple dimensions (like, 5 to 8, typically). The cool thing is that this allows us to pick out clusters of variants in a way that frees us from the traditional binary choice of “is this variant above or below the threshold for this annotation?”

Let’s do a quick mental visualization exercise (pending an actual figure to illustrate this), in two dimensions because our puny human brains work best at that level. Imagine a topographical map of a mountain range, with North-South and East-West axes standing in for two variant annotation scales. Your job is to define a subset of territory that contains mostly mountain peaks, and as few lowlands as possible. Traditional hard-filtering forces you to set a single longitude cutoff and a single latitude cutoff, resulting in one rectangular quadrant of the map being selected, and all the rest being greyed out. It’s about as subtle as a sledgehammer and forces you to make a lot of compromises. VQSR allows you to select contour lines around the peaks and decide how low or how high you want to go to include or exclude territory within your subset.

How this is achieved is another can of worms. The key point is that we use known, highly validated variant resources (omni, 1000 Genomes, hapmap) to select a subset of variants within our callset that we’re really confident are probably true positives (that’s the training set). We look at the annotation profiles of those variants (in our own data!), and we from that we learn some rules about how to recognize good variants. We do something similar for bad variants as well. Then we apply the rules we learned to all of the sites, which (through some magical hand-waving) yields a single score for each variant that describes how likely it is based on all the examined dimensions. In our map analogy this is the equivalent of determining on which contour line the variant sits. Finally, we pick a threshold value indirectly by asking the question “what score do I need to choose so that e.g. 99% of the variants in my callset that are also in hapmap will be selected?”. This is called the target sensitivity. We can twist that dial in either direction depending on what is more important for our project, sensitivity or specificity.

Technical overview

The purpose of variant recalibration is to assign a well-calibrated probability to each variant call in a call set. This enables you to generate highly accurate call sets by filtering based on this single estimate for the accuracy of each call.

The approach taken by variant quality score recalibration is to develop a continuous, covarying estimate of the relationship between SNP call annotations (QD, SB, HaplotypeScore, HRun, for example) and the the probability that a SNP is a true genetic variant versus a sequencing or data processing artifact. This model is determined adaptively based on "true sites" provided as input (typically HapMap 3 sites and those sites found to be polymorphic on the Omni 2.5M SNP chip array, for humans). This adaptive error model can then be applied to both known and novel variation discovered in the call set of interest to evaluate the probability that each call is real. The score that gets added to the INFO field of each variant is called the VQSLOD. It is the log odds ratio of being a true variant versus being false under the trained Gaussian mixture model.

The variant recalibrator contrastively evaluates variants in a two step process, each performed by a distinct tool:

VariantRecalibrator
Create a Gaussian mixture model by looking at the annotations values over a high quality subset of the input call set and then evaluate all input variants. This step produces a recalibration file.
ApplyRecalibration
Apply the model parameters to each variant in input VCF files producing a recalibrated VCF file in which each variant is annotated with its VQSLOD value. In addition, this step will filter the calls based on this new lod score by adding lines to the FILTER column for variants that don't meet the specified lod threshold.

Please see the VQSR tutorial for step-by-step instructions on running these tools.

How VariantRecalibrator works in a nutshell

The tool takes the overlap of the training/truth resource sets and of your callset. It models the distribution of these variants relative to the annotations you specified, and attempts to group them into clusters. Then it uses the clustering to assign VQSLOD scores to all variants. Variants that are closer to the heart of a cluster will get a higher score than variants that are outliers.

How ApplyRecalibration works in a nutshell

During the first part of the recalibration process, variants in your callset were given a score called VQSLOD. At the same time, variants in your training sets were also ranked by VQSLOD. When you specify a tranche sensitivity threshold with ApplyRecalibration, expressed as a percentage (e.g. 99.9%), what happens is that the program looks at what is the VQSLOD value above which 99.9% of the variants in the training callset are included. It then takes that value of VQSLOD and uses it as a threshold to filter your variants. Variants that are above the threshold pass the filter, so the FILTER field will contain PASS. Variants that are below the threshold will be filtered out; they will be written to the output file, but in the FILTER field they will have the name of the tranche they belonged to. So VQSRTrancheSNP99.90to100.00 means that the variant was in the range of VQSLODs corresponding to the remaining 0.1% of the training set, which are basically considered false positives.

Interpretation of the Gaussian mixture model plots

The variant recalibration step fits a Gaussian mixture model to the contextual annotations given to each variant. By fitting this probability model to the training variants (variants considered to be true-positives), a probability can be assigned to the putative novel variants (some of which will be true-positives, some of which will be false-positives). It is useful for users to see how the probability model was fit to their data. Therefore a modeling report is automatically generated each time VariantRecalibrator is run (in the above command line the report will appear as path/to/output.plots.R.pdf). For every pair-wise combination of annotations used in modeling, a 2D projection of the Gaussian mixture model is shown.

The figure shows one page of an example Gaussian mixture model report that is automatically generated by the VQSR from the example HiSeq call set. This page shows the 2D projection of mapping quality rank sum test versus Haplotype score by marginalizing over the other annotation dimensions in the model.

In each page there are four panels which show different ways of looking at the 2D projection of the model. The upper left panel shows the probability density function that was fit to the data. The 2D projection was created by marginalizing over the other annotation dimensions in the model via random sampling. Green areas show locations in the space that are indicative of being high quality while red areas show the lowest probability areas. In general putative SNPs that fall in the red regions will be filtered out of the recalibrated call set.

The remaining three panels give scatter plots in which each SNP is plotted in the two annotation dimensions as points in a point cloud. The scale for each dimension is in normalized units. The data for the three panels is the same but the points are colored in different ways to highlight different aspects of the data. In the upper right panel SNPs are colored black and red to show which SNPs are retained and filtered, respectively, by applying the VQSR procedure. The red SNPs didn't meet the given truth sensitivity threshold and so are filtered out of the call set. The lower left panel colors SNPs green, grey, and purple to give a sense of the distribution of the variants used to train the model. The green SNPs are those which were found in the training sets passed into the VariantRecalibrator step, while the purple SNPs are those which were found to be furthest away from the learned Gaussians and thus given the lowest probability of being true. Finally, the lower right panel colors each SNP by their known/novel status with blue being the known SNPs and red being the novel SNPs. Here the idea is to see if the annotation dimensions provide a clear separation between the known SNPs (most of which are true) and the novel SNPs (most of which are false).

An example of good clustering for SNP calls from the tutorial dataset is shown to the right. The plot shows that the training data forms a distinct cluster at low values for each of the two statistics shown (haplotype score and mapping quality bias). As the SNPs fall off the distribution in either one or both of the dimensions they are assigned a lower probability (that is, move into the red region of the model's PDF) and are filtered out. This makes sense as not only do higher values of HaplotypeScore indicate a lower chance of the data being explained by only two haplotypes but also higher values for mapping quality bias indicate more evidence of bias between the reference bases and the alternative bases. The model has captured our intuition that this area of the distribution is highly enriched for machine artifacts and putative variants here should be filtered out!

Tranches and the tranche plot

The recalibrated variant quality score provides a continuous estimate of the probability that each variant is true, allowing one to partition the call sets into quality tranches. The main purpose of the tranches is to establish thresholds within your data that correspond to certain levels of sensitivity relative to the truth sets. The idea is that with well calibrated variant quality scores, you can generate call sets in which each variant doesn't have to have a hard answer as to whether it is in or out of the set. If a very high accuracy call set is desired then one can use the highest tranche, but if a larger, more complete call set is a higher priority than one can dip down into lower and lower tranches. These tranches are applied to the output VCF file using the FILTER field. In this way you can choose to use some of the filtered records or only use the PASSing records.

The first tranche (90) which has the lowest value of truth sensitivity but the highest value of novel Ti/Tv, is exceedingly specific but less sensitive. Each subsequent tranche in turn introduces additional true positive calls along with a growing number of false positive calls. Downstream applications can select in a principled way more specific or more sensitive call sets or incorporate directly the recalibrated quality scores to avoid entirely the need to analyze only a fixed subset of calls but rather weight individual variant calls by their probability of being real. An example tranche plot, automatically generated by the VariantRecalibrator walker, is shown below.

This is an example of a tranches plot generated for a HiSeq call set. The x-axis gives the number of novel variants called while the y-axis shows two quality metrics -- novel transition to transversion ratio and the overall truth sensitivity.

Note that the tranches plot is not applicable for indels and will not be generated when the tool is run in INDEL mode.

Ti/Tv-free recalibration

We use a Ti/Tv-free approach to variant quality score recalibration. This approach requires an additional truth data set, and cuts the VQSLOD at given sensitivities to the truth set. It has several advantages over the Ti/Tv-targeted approach:

The truth sensitivity (TS) approach gives you back the novel Ti/Tv as a QC metric
The truth sensitivity (TS) approach is conceptual cleaner than deciding on a novel Ti/Tv target for your dataset
The TS approach is easier to explain and defend, as saying "I took called variants until I found 99% of my known variable sites" is easier than "I took variants until I dropped my novel Ti/Tv ratio to 2.07"

We have used hapmap 3.3 sites as the truth set (genotypes_r27_nr.b37_fwd.vcf), but other sets of high-quality (~99% truly variable in the population) sets of sites should work just as well. In our experience, with HapMap, 99% is a good threshold, as the remaining 1% of sites often exhibit unusual features like being close to indels or are actually MNPs, and so receive a low VQSLOD score.
Note that the expected Ti/Tv is still an available argument but it is only used for display purposes.

Finally, a couple of Frequently Asked Questions

- Can I use the variant quality score recalibrator with my small sequencing experiment?

This tool is expecting thousands of variant sites in order to achieve decent modeling with the Gaussian mixture model. Whole exome call sets work well, but anything smaller than that scale might run into difficulties.

One piece of advice is to turn down the number of Gaussians used during training. This can be accomplished by adding --maxGaussians 4 to your command line.

maxGaussians is the maximum number of different "clusters" (=Gaussians) of variants the program is "allowed" to try to identify. Lowering this number forces the program to group variants into a smaller number of clusters, which means there will be more variants in each cluster -- hopefully enough to satisfy the statistical requirements. Of course, this decreases the level of discrimination that you can achieve between variant profiles/error modes. It's all about trade-offs; and unfortunately if you don't have a lot of variants you can't afford to be very demanding in terms of resolution.

- Why don't all the plots get generated for me?

The most common problem related to this is not having Rscript accessible in your environment path. Rscript is the command line version of R that gets installed right alongside. We also make use of the ggplot2 library so please be sure to install that package as well. See the Common Problems section of the Guide for more details.

↧

Plot bqsr error using wdl.

January 9, 2018, 3:05 pm

≫ Next: exome hg38 interval list

≪ Previous: Variant Quality Score Recalibration (VQSR)

Hi there,

I keep getting hard link error during my plot_bqsr step using wdl.
Any suggestion how to fix it?
I tested the BQSR_1 and BQSR_2 both worked and on commandline I could run the AnalyzeCovariates step successfully.
Thanks!

`#Step6.1BQSR_1

call bqsr_1 {
input:
gatk=gatk,
inputBAM=markduplicate.dedupbam,
RefFasta=RefFasta,
knownSite1=dbsnp,
knownSite2=Mills_indels,
knownSite3=tenk_indels,
sampleName=sampleName,
Refdict=Refdict,
RefIndex=RefIndex,
bamindex=buildbamindex.bamindex
}

Step7.1BQSR_2

call bqsr_2 {
input:
gatk=gatk,
inputBAM=markduplicate.dedupbam,
RefFasta=RefFasta,
knownSite1=dbsnp,
knownSite2=Mills_indels,
knownSite3=tenk_indels,
sampleName=sampleName,
Refdict=Refdict,
RefIndex=RefIndex,
bamindex=buildbamindex.bamindex,
recaltable=bqsr_1.bqsr1table
}

Step8.plot_BQSR

call plot_bqsr {
input:
gatk=gatk,
RefFasta=RefFasta,
Refdict=Refdict,
RefIndex=RefIndex,
sampleName=sampleName,
recaltable=bqsr_1.bqsr1table,
postable=bqsr_2.bqsr2table
}
}

task plot_bqsr {
File gatk
File RefFasta
File recaltable
File postable
File sampleName
File Refdict
File RefIndex

command{
ln ${recaltable} b1.table
ln ${postable} b2.table
ln ${Refdict}

java -jar ${gatk} -T AnalyzeCovariates \
-R ${RefFasta} \
-before b1.table \
-after b2.table \
-plots ${sampleName}_recalibration_plots.pdf
}

output{
File recalibration_plot = "${sampleName}_recalibration_plots.pdf"
}

}

task bqsr_2{
File gatk
File inputBAM
String sampleName
File knownSite1
File knownSite2
File knownSite3
File RefFasta
File Refdict
File RefIndex
File bamindex
File recaltable

command{
ln ${inputBAM} input.bam
ln ${bamindex} input.bam.bai
ln ${Refdict}

java -jar ${gatk} -T BaseRecalibrator \
-R ${RefFasta} \
-I input.bam \
-knownSites ${knownSite1} \
-knownSites ${knownSite2} \
-knownSites ${knownSite3} \
-BQSR ${recaltable} \
-o ${sampleName}_post_recal_data.table
}

output{
File bqsr2table = "${sampleName}_post_recal_data.table"
}

}

task bqsr_1{
File gatk
File inputBAM
String sampleName
File knownSite1
File knownSite2
File knownSite3
File RefFasta
File Refdict
File RefIndex
File bamindex

command{
ln ${inputBAM} input.bam
ln ${bamindex} input.bam.bai
ln ${Refdict}

java -jar ${gatk} -T BaseRecalibrator \
-R ${RefFasta} \
-I input.bam \
-knownSites ${knownSite1} \
-knownSites ${knownSite2} \
-knownSites ${knownSite3} \
-o ${sampleName}_recal_data.table
}

output{
File bqsr1table = "${sampleName}_recal_data.table"
}

{ "OUWES_test.Refsa": "/home/cytolab/Sand/Reference/ucsc.hg19.fasta.sa", "OUWES_test.picard": "/home/cytolab/Software/picard/picard.jar", "OUWES_test.gatk": "/home/cytolab/Software/GATK/GenomeAnalysisTK.jar", "OUWES_test.Refann": "/home/cytolab/Sand/Reference/ucsc.hg19.fasta.ann", "OUWES_test.inputBAML3R1": "/home/cytolab/Sand/testl3r1.fastq", "OUWES_test.inputBAML2R1": "/home/cytolab/Sand/testl2r1.fastq", "OUWES_test.dbsnp": "/home/cytolab/Sand/Reference/dbsnp_138.hg19.vcf", "OUWES_test.Mills_indels": "/home/cytolab/Sand/Reference/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf", "OUWES_test.tenk_indels": "/home/cytolab/Sand/Reference/1000G_phase1.snps.high_confidence.hg19.sites.vcf", "OUWES_test.Refdict": "/home/cytolab/Sand/Reference/ucsc.hg19.dict", "OUWES_test.RefFasta": "/home/cytolab/Sand/Reference/ucsc.hg19.fasta", "OUWES_test.RefIndex": "/home/cytolab/Sand/Reference/ucsc.hg19.fasta.fai", "OUWES_test.Refpac": "/home/cytolab/Sand/Reference/ucsc.hg19.fasta.pac", "OUWES_test.inputBAML1R2": "/home/cytolab/Sand/testl1r2.fastq", "OUWES_test.inputBAML1R1": "/home/cytolab/Sand/testl1r1.fastq", "OUWES_test.inputBAML2R2": "/home/cytolab/Sand/testl2r2.fastq", "OUWES_test.inputBAML4R1": "/home/cytolab/Sand/testl4r1.fastq", "OUWES_test.sampleName": "test769", "OUWES_test.Refbwt": "/home/cytolab/Sand/Reference/ucsc.hg19.fasta.bwt", "OUWES_test.inputBAML4R2": "/home/cytolab/Sand/testl4r2.fastq", "OUWES_test.inputBAML3R2": "/home/cytolab/Sand/testl3r2.fastq", "OUWES_test.Refamb": "/home/cytolab/Sand/Reference/ucsc.hg19.fasta.amb" }

calls: OUWES_test.plot_bqsr:NA:1
[2018-01-09 16:53:44,57] [warn] Localization via hard link has failed: /home/cytolab/Sand/cromwell-executions/OUWES_test/540ab5ba-34e3-4aaf-9992-0b891423e87f/call-plot_bqsr/inputs/home/cytolab/Sand/test769 -> /home/cytolab/Sand/test769
[2018-01-09 16:53:44,67] [warn] Couldn't find a suitable DSN, defaulting to a Noop one.
[2018-01-09 16:53:44,74] [info] Using noop to send events.
[2018-01-09 16:53:44,80] [warn] Localization via copy has failed: /home/cytolab/Sand/test769
[2018-01-09 16:53:44,93] [error] BackgroundConfigAsyncJobExecutionActor [540ab5baOUWES_test.plot_bqsr:NA:1]: Error attempting to Execute
cromwell.backend.standard.StandardAsyncExecutionActor$$anonfun$$nestedInanonfun$commandLinePreProcessor$1$1$$anon$1: :
Could not localize test769 -> /home/cytolab/Sand/cromwell-executions/OUWES_test/540ab5ba-34e3-4aaf-9992-0b891423e87f/call-plot_bqsr/inputs/home/cytolab/Sand/test769:
test769 doesn't exists
File not found /home/cytolab/Sand/cromwell-executions/OUWES_test/540ab5ba-34e3-4aaf-9992-0b891423e87f/call-plot_bqsr/inputs/home/cytolab/Sand/test769 -> /home/cytolab/Sand/test769
File not found test769
File not found /home/cytolab/Sand/test769
at cromwell.backend.standard.StandardAsyncExecutionActor$$anonfun$$nestedInanonfun$commandLinePreProcessor$1$1.applyOrElse(StandardAsyncExecutionActor.scala:113)
at cromwell.backend.standard.StandardAsyncExecutionActor$$anonfun$$nestedInanonfun$commandLinePreProcessor$1$1.applyOrElse(StandardAsyncExecutionActor.scala:112)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:34)
at scala.util.Failure.recoverWith(Try.scala:232)
at cromwell.backend.standard.StandardAsyncExecutionActor.$anonfun$commandLinePreProcessor$1(StandardAsyncExecutionActor.scala:112)
at cromwell.backend.wdl.Command$.instantiate(Command.scala:27)
at cromwell.backend.standard.StandardAsyncExecutionActor.instantiatedCommand(StandardAsyncExecutionActor.scala:207)
at cromwell.backend.standard.StandardAsyncExecutionActor.instantiatedCommand$(StandardAsyncExecutionActor.scala:206)
at cromwell.backend.impl.sfs.config.BackgroundConfigAsyncJobExecutionActor.instantiatedCommand$lzycompute(ConfigAsyncJobExecutionActor.scala:121)
at cromwell.backend.impl.sfs.config.BackgroundConfigAsyncJobExecutionActor.instantiatedCommand(ConfigAsyncJobExecutionActor.scala:121)
at cromwell.backend.standard.StandardAsyncExecutionActor.commandScriptContents(StandardAsyncExecutionActor.scala:177)
at cromwell.backend.standard.StandardAsyncExecutionActor.commandScriptContents$(StandardAsyncExecutionActor.scala:176)
at cromwell.backend.impl.sfs.config.BackgroundConfigAsyncJobExecutionActor.commandScriptContents(ConfigAsyncJobExecutionActor.scala:121)
at cromwell.backend.sfs.SharedFileSystemAsyncJobExecutionActor.writeScriptContents(SharedFileSystemAsyncJobExecutionActor.scala:136)
at cromwell.backend.sfs.SharedFileSystemAsyncJobExecutionActor.writeScriptContents$(SharedFileSystemAsyncJobExecutionActor.scala:135)
at cromwell.backend.impl.sfs.config.BackgroundConfigAsyncJobExecutionActor.cromwell$backend$sfs$BackgroundAsyncJobExecutionActor$$super$writeScriptContents(ConfigAsyncJobExecutionActor.scala:121)
at cromwell.backend.sfs.BackgroundAsyncJobExecutionActor.writeScriptContents(BackgroundAsyncJobExecutionActor.scala:11)
at cromwell.backend.sfs.BackgroundAsyncJobExecutionActor.writeScriptContents$(BackgroundAsyncJobExecutionActor.scala:10)
at cromwell.backend.impl.sfs.config.BackgroundConfigAsyncJobExecutionActor.writeScriptContents(ConfigAsyncJobExecutionActor.scala:121)
at cromwell.backend.sfs.SharedFileSystemAsyncJobExecutionActor.execute(SharedFileSystemAsyncJobExecutionActor.scala:123)
at cromwell.backend.sfs.SharedFileSystemAsyncJobExecutionActor.execute$(SharedFileSystemAsyncJobExecutionActor.scala:121)
at cromwell.backend.impl.sfs.config.BackgroundConfigAsyncJobExecutionActor.execute(ConfigAsyncJobExecutionActor.scala:121)
at cromwell.backend.standard.StandardAsyncExecutionActor.$anonfun$executeAsync$1(StandardAsyncExecutionActor.scala:254)
at scala.util.Try$.apply(Try.scala:209)
at cromwell.backend.standard.StandardAsyncExecutionActor.executeAsync(StandardAsyncExecutionActor.scala:254)
at cromwell.backend.standard.StandardAsyncExecutionActor.executeAsync$(StandardAsyncExecutionActor.scala:254)
at cromwell.backend.impl.sfs.config.BackgroundConfigAsyncJobExecutionActor.executeAsync(ConfigAsyncJobExecutionActor.scala:121)
at cromwell.backend.standard.StandardAsyncExecutionActor.executeOrRecover(StandardAsyncExecutionActor.scala:510)
at cromwell.backend.standard.StandardAsyncExecutionActor.executeOrRecover$(StandardAsyncExecutionActor.scala:504)
at cromwell.backend.impl.sfs.config.BackgroundConfigAsyncJobExecutionActor.executeOrRecover(ConfigAsyncJobExecutionActor.scala:121)
at cromwell.backend.async.AsyncBackendJobExecutionActor.$anonfun$robustExecuteOrRecover$1(AsyncBackendJobExecutionActor.scala:56)
at cromwell.core.retry.Retry$.withRetry(Retry.scala:36)
at cromwell.backend.async.AsyncBackendJobExecutionActor.withRetry(AsyncBackendJobExecutionActor.scala:52)
at cromwell.backend.async.AsyncBackendJobExecutionActor.cromwell$backend$async$AsyncBackendJobExecutionActor$$robustExecuteOrRecover(AsyncBackendJobExecutionActor.scala:56)
at cromwell.backend.async.AsyncBackendJobExecutionActor$$anonfun$receive$1.applyOrElse(AsyncBackendJobExecutionActor.scala:79)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
at akka.actor.Actor.aroundReceive(Actor.scala:513)
at akka.actor.Actor.aroundReceive$(Actor.scala:511)
at cromwell.backend.impl.sfs.config.BackgroundConfigAsyncJobExecutionActor.aroundReceive(ConfigAsyncJobExecutionActor.scala:121)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:527)
at akka.actor.ActorCell.invoke(ActorCell.scala:496)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
[2018-01-09 16:53:45,13] [error] WorkflowManagerActor Workflow 540ab5ba-34e3-4aaf-9992-0b891423e87f failed (during ExecutingWorkflowState): :
Could not localize test769 -> /home/cytolab/Sand/cromwell-executions/OUWES_test/540ab5ba-34e3-4aaf-9992-0b891423e87f/call-plot_bqsr/inputs/home/cytolab/Sand/test769:
test769 doesn't exists
File not found /home/cytolab/Sand/cromwell-executions/OUWES_test/540ab5ba-34e3-4aaf-9992-0b891423e87f/call-plot_bqsr/inputs/home/cytolab/Sand/test769 -> /home/cytolab/Sand/test769
File not found test769
File not found /home/cytolab/Sand/test769
cromwell.backend.standard.StandardAsyncExecutionActor$$anonfun$$nestedInanonfun$commandLinePreProcessor$1$1$$anon$1: :
Could not localize test769 -> /home/cytolab/Sand/cromwell-executions/OUWES_test/540ab5ba-34e3-4aaf-9992-0b891423e87f/call-plot_bqsr/inputs/home/cytolab/Sand/test769:
test769 doesn't exists
File not found /home/cytolab/Sand/cromwell-executions/OUWES_test/540ab5ba-34e3-4aaf-9992-0b891423e87f/call-plot_bqsr/inputs/home/cytolab/Sand/test769 -> /home/cytolab/Sand/test769
File not found test769
File not found /home/cytolab/Sand/test769
at cromwell.backend.standard.StandardAsyncExecutionActor$$anonfun$$nestedInanonfun$commandLinePreProcessor$1$1.applyOrElse(StandardAsyncExecutionActor.scala:113)
at cromwell.backend.standard.StandardAsyncExecutionActor$$anonfun$$nestedInanonfun$commandLinePreProcessor$1$1.applyOrElse(StandardAsyncExecutionActor.scala:112)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:34)
at scala.util.Failure.recoverWith(Try.scala:232)
at cromwell.backend.standard.StandardAsyncExecutionActor.$anonfun$commandLinePreProcessor$1(StandardAsyncExecutionActor.scala:112)
at cromwell.backend.wdl.Command$.instantiate(Command.scala:27)
at cromwell.backend.standard.StandardAsyncExecutionActor.instantiatedCommand(StandardAsyncExecutionActor.scala:207)
at cromwell.backend.standard.StandardAsyncExecutionActor.instantiatedCommand$(StandardAsyncExecutionActor.scala:206)
at cromwell.backend.impl.sfs.config.BackgroundConfigAsyncJobExecutionActor.instantiatedCommand$lzycompute(ConfigAsyncJobExecutionActor.scala:121)
at cromwell.backend.impl.sfs.config.BackgroundConfigAsyncJobExecutionActor.instantiatedCommand(ConfigAsyncJobExecutionActor.scala:121)
at cromwell.backend.standard.StandardAsyncExecutionActor.commandScriptContents(StandardAsyncExecutionActor.scala:177)
at cromwell.backend.standard.StandardAsyncExecutionActor.commandScriptContents$(StandardAsyncExecutionActor.scala:176)
at cromwell.backend.impl.sfs.config.BackgroundConfigAsyncJobExecutionActor.commandScriptContents(ConfigAsyncJobExecutionActor.scala:121)
at cromwell.backend.sfs.SharedFileSystemAsyncJobExecutionActor.writeScriptContents(SharedFileSystemAsyncJobExecutionActor.scala:136)
at cromwell.backend.sfs.SharedFileSystemAsyncJobExecutionActor.writeScriptContents$(SharedFileSystemAsyncJobExecutionActor.scala:135)
at cromwell.backend.impl.sfs.config.BackgroundConfigAsyncJobExecutionActor.cromwell$backend$sfs$BackgroundAsyncJobExecutionActor$$super$writeScriptContents(ConfigAsyncJobExecutionActor.scala:121)
at cromwell.backend.sfs.BackgroundAsyncJobExecutionActor.writeScriptContents(BackgroundAsyncJobExecutionActor.scala:11)
at cromwell.backend.sfs.BackgroundAsyncJobExecutionActor.writeScriptContents$(BackgroundAsyncJobExecutionActor.scala:10)
at cromwell.backend.impl.sfs.config.BackgroundConfigAsyncJobExecutionActor.writeScriptContents(ConfigAsyncJobExecutionActor.scala:121)
at cromwell.backend.sfs.SharedFileSystemAsyncJobExecutionActor.execute(SharedFileSystemAsyncJobExecutionActor.scala:123)
at cromwell.backend.sfs.SharedFileSystemAsyncJobExecutionActor.execute$(SharedFileSystemAsyncJobExecutionActor.scala:121)
at cromwell.backend.impl.sfs.config.BackgroundConfigAsyncJobExecutionActor.execute(ConfigAsyncJobExecutionActor.scala:121)
at cromwell.backend.standard.StandardAsyncExecutionActor.$anonfun$executeAsync$1(StandardAsyncExecutionActor.scala:254)
at scala.util.Try$.apply(Try.scala:209)
at cromwell.backend.standard.StandardAsyncExecutionActor.executeAsync(StandardAsyncExecutionActor.scala:254)
at cromwell.backend.standard.StandardAsyncExecutionActor.executeAsync$(StandardAsyncExecutionActor.scala:254)
at cromwell.backend.impl.sfs.config.BackgroundConfigAsyncJobExecutionActor.executeAsync(ConfigAsyncJobExecutionActor.scala:121)
at cromwell.backend.standard.StandardAsyncExecutionActor.executeOrRecover(StandardAsyncExecutionActor.scala:510)
at cromwell.backend.standard.StandardAsyncExecutionActor.executeOrRecover$(StandardAsyncExecutionActor.scala:504)
at cromwell.backend.impl.sfs.config.BackgroundConfigAsyncJobExecutionActor.executeOrRecover(ConfigAsyncJobExecutionActor.scala:121)
at cromwell.backend.async.AsyncBackendJobExecutionActor.$anonfun$robustExecuteOrRecover$1(AsyncBackendJobExecutionActor.scala:56)
at cromwell.core.retry.Retry$.withRetry(Retry.scala:36)
at cromwell.backend.async.AsyncBackendJobExecutionActor.withRetry(AsyncBackendJobExecutionActor.scala:52)
at cromwell.backend.async.AsyncBackendJobExecutionActor.cromwell$backend$async$AsyncBackendJobExecutionActor$$robustExecuteOrRecover(AsyncBackendJobExecutionActor.scala:56)
at cromwell.backend.async.AsyncBackendJobExecutionActor$$anonfun$receive$1.applyOrElse(AsyncBackendJobExecutionActor.scala:79)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
at akka.actor.Actor.aroundReceive(Actor.scala:513)
at akka.actor.Actor.aroundReceive$(Actor.scala:511)
at cromwell.backend.impl.sfs.config.BackgroundConfigAsyncJobExecutionActor.aroundReceive(ConfigAsyncJobExecutionActor.scala:121)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:527)
at akka.actor.ActorCell.invoke(ActorCell.scala:496)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

[2018-01-09 16:53:45,13] [info] WorkflowManagerActor WorkflowActor-540ab5ba-34e3-4aaf-9992-0b891423e87f is in a terminal state: WorkflowFailedState
[2018-01-09 16:54:20,23] [info] SingleWorkflowRunnerActor workflow finished with status 'Failed'.
Workflow 540ab5ba-34e3-4aaf-9992-0b891423e87f transitioned to state Failed
[2018-01-09 16:54:20,46] [info] Automatic shutdown of the async connection
[2018-01-09 16:54:20,46] [info] Gracefully shutdown sentry threads.
[2018-01-09 16:54:20,46] [info] Shutdown finished.

↧

exome hg38 interval list

January 9, 2018, 7:27 pm

≫ Next: Past National basketball association person

≪ Previous: Plot bqsr error using wdl.

Hi . GATK team!
Please tell me if I am available exome interval list version hg38.

-thanks

↧

Past National basketball association person

January 9, 2018, 7:58 pm

≫ Next: Nine players acquire qualifying delivers because

≪ Previous: exome hg38 interval list

Past National basketball association person Lamar Odom particulars his or her struggle pertaining to sobriety, life's battles
Odom covers their drug-fueled downfallLos Angeles Lakers7dESPN information servicesProjected W-L file for many 40 Nba teams4hKevin PeltonCan LeBron's very friends get together within 2018?1dAdam ReisingerNBA 2K ratings: Soccer ball Ryan Quigley Jersey, Fultz, Durant amazingly high12hSN StaffNext Nba superteam? Listed here are 9 in order to watch3dTom HaberstrohVote: Which rates high increased to suit your needs, Kobe or even LeBron?22hSN StaffMali's Oumar Ballo features a little Shaq and many probable in the game1dJonathan GivonyKevin Durant tries for a takedown, results in India2dPhotography simply by Vivek SinghLamar Odom, who 'shook arms together with demise,Ha unveils information regarding his substance useplayLe Batard: Odom's notice offers humanizing specifics (3:02)Dan Ce Batard claims enthusiasts are generally thank you for visiting think Lamar Odom is an fool for getting dependent on drugs Zach Line Jersey, nevertheless says wording concerns. (Three or more:02)FacebookTwitterFacebook MessengerEmailcommentJul 27, 2017ESPN media servicesFacebookTwitterFacebook MessengerPinterestEmailprintcommentFormer Basketball participant Lamar Odom's tangle along with tough medicines nearly killed your ex, nevertheless it didn't rob your pet regarding their power to mention a risky party regarding his devils.He has sober right now, this individual wrote in the uncovering story for that Players' Tribune, but states it really is "an daily struggle" to keep doing this.Lamar Odom, ingesting any Heat-Lakers online game in 2016, affirms he or she "shook fingers together with death" however is drug free.Kevork Djansezian/Getty ImagesOdom dropped almost all things in an incredibly open public way as a consequence of his / her dependancy -- his / her NBA job as well as matrimony in order to Khloe Kardashian, primary included in this -- following staying identified comatose within October 2015 in the Las Vegas-area brothel. Odom detailed exactly what guided your pet on the healthcare facility, where he previously 14 shots and 2 heart attacks."At some time, the principle medical doctor arrived as well as said just what acquired occurred,Inches Odom composed. "He mentioned, 'Mr. Odom, you are in the coma going back a number of days. Can you comprehend?I I could not discuss. And so i merely nodded her head. He explained, 'It's magic you're the following. We did not think you were making the idea.'"Odom says they was in jolt and also sensed hopeless for the first time in the lifestyle, however that it was no wonder he or she wound up presently there."At that point within my lifestyle, I had been carrying out softdrink every day. Almost every subsequent involving sparetime that we had, I had been undertaking cola. I could not keep it in check,Inches Odom wrote.Editor's PicksOdom regrets extramarital affairs, medications hurting his careerLamar Odom claims he has been "a going for walks miracle" following becoming located subconscious along with crack in his method inside a The state of nevada brothel within 2015.He's consumed the culprit correctly almost all along with claims he has been still dealing with his demons. He or she along with Kardashian, who experienced filed for breakup inside 2013, reunited following your incident, but they divided again within 2016 following photos surfaced of Odom having.Inside the post, Odom used a popular "Chappelle's Show" quotation uttered by musician and performer Ron David -- "Cocaine can be a nightmare of a drug" -- stating it created your pet perform issues however never thought possible himself doing. Nevertheless generally, this individual stored asking yourself exactly how they got presently there, they published, and those he previously lost."When I used to be in this healthcare facility bed, My partner and i stored inquiring me personally that will problem,Inches he or she composed. "And My spouse and i kept considering all the people inside my life that are not right here any longer. Generally, I believed regarding our mother. My dad has not been truly about after i would be a little one. He'd their own difficulties with habit. Yet my personal mommy ended up being his mom on the globe. She was simply thus nurturing."Odom, 37, published that they had his complete long term organized from an earlier grow older."See, I previously had this particular eyesight in my mind, from the moment I became Ten years previous. I really could already discover [former Basketball commissioner] Jesse Stringent way up in the stage getting in touch with my own identify, stating precisely what crew I had been planning to, as well as me finding that my loved ones. I really could already see it," he composed.But your life carressed together with tragedy brought your ex along the path he previously planned to prevent. It began with his or her mom's sickness as well as a holiday to see the girl inside the healthcare facility prior to she perished."When I became Twelve years, the girl received ill. My partner and i knew she'd cancer of the colon, on the other hand didn't really understand how bad it absolutely was. She form of held this via us to guard myself,Inch they authored. "I keep in mind which she went into a healthcare facility for a time, then when My partner and i visited go to the woman's, i felt like she was receiving ... scaled-down. Like the lady was vanishing.""I don't think something can easily equip you for dropping the mom at 12 yrs . old. This foliage an indication on you. I do not proper care just how robust you think you're,In . this individual wrote.Inside the Players' Tribune article, Odom published about how precisely from A dozen years, they swore in order to themselves however never ever effect benzoylmethylecgonine."I did not do it until finally I had been All day and years of age, while i was about summer vacation in Arkansas. And ... I wish I possibly could tell you there were reasons for this. Presently there has not been. It absolutely was only the asinine determination My partner and i created. Only knew that it was planning to affect my entire life operate do, My spouse and i would've never even thought about this. In no way. However i made it happen. This turned into the life-altering selection,Inch he composed.Soon later, his / her nanna perished. And then members of the family additionally died, including his or her 6-month-old kid, Jayden, in order to cot death. Odom raised on the pain with more benzoylmethylecgonine, along with the idea emerged a growing number of reckless behavior company https://www.vikingsteamfansgear.com/minnesota-vikings-gear/sharrif-floyd-jersey.html, shame."That's the one thing individuals don't understand,Inch this individual published. "Anybody that's resided an elaborate, drug-infused real life We've lived is aware your cycle -- with girls, disloyal on my spouse. ... You imagine I has not been experience pity? You think I became unaware of what I was carrying out? Nah, We had not been unaware of this."Odom mentioned his "brain ended up being busted," and as the years continued, his National basketball association career carried the particular impact of damage right up until he'd drugged themselves out from the group. He admits that they dropped his can to work through, to practice, to rehearse. His / her lineage in to very cheap received energy right up until that happened."When I used to be such as Thirty two, 33 ... I recently desired to get high on a regular basis. That's it, simply find substantial. And items acquired dim while nightmare,Inch Odom published. "One in the darkest locations That i have ever recently been had been while i is at the motel place, obtaining large using this girl, and my lady (back then) wandered inside. That probably ended up being such as very cheap."But your Nevada brothel occurrence still left him connected to countless tubes and also comatose, and his awesome kids found go to. Just like while he would attended go to his mom Linval Joseph Jersey."The medical doctors explained which before My spouse and i wakened through the coma, the children got find to find out me personally. Knowning that shattered me, since i had seen my own mother on her deathbed, using pipes coming out of your ex mouth area,Inches this individual composed. "My youngsters are the only items that retained myself planning. I have been previously a large powerful guy my own very existence, consequently anytime my kids discover me in the weak spot like that is certainly hard for me -- even going to talk about currently."Odom done rehab, exactly where he or she learned to "submit in order to almost everything,Inch this individual wrote. This individual realized to deal with anxiousness he or she had no idea ended up being leading to your ex a lot of difficulties. He admits his or her dependancy as well as understands it is not heading wherever. He loans his or her children in order to keep him for the directly as well as narrow. He has been undertaking the very best the guy can, he or she composed, and dealing with all the aftereffects."I shaken fingers using dying. However you know what? Ain't no finding its way back through that will,In . he or she wrote."Even however our memorial would possibly certainly be a very good funeral, and also there would oftimes be lots of people that we had not noticed one another inside a number of years. Nonetheless it is not here we are at which but. ... I nevertheless got my kids. Now i'm even now here. And really, I'm nonetheless rather fine."FacebookTwitterFacebook MessengerEmailcommentSponsored Headers Feedback ABOUT COOKIES All of us employ pastries to provide a much better on-line expertise. By clicking on "OK" with out varying your configurations you're supplying your accept to get snacks.

↧

Nine players acquire qualifying delivers because

January 9, 2018, 8:12 pm

≫ Next: VariantRecalibrator - no data found

≪ Previous: Past National basketball association person

Nine players acquire qualifying delivers because baseball’s free of charge agency interval will begin * SFGate
Get SFGate news letters for that most up-to-date from your These kinds of http://sfgate/giants/article/Marquee-players-get-qualifying-offers-as-12336689.perl Being unfaithful participants find getting qualification delivers since baseball's totally free organization interval commences Staff along with Media Solutions Updated Being unfaithful:28pm, Monday, Nov 6, 2017 Kansas City 1st baseman Eric Hosmer, 3rd baseman Robert Moustakas and also outfielder Lorenzo Cain were amongst seven totally free brokers who acquired $17.Four million qualifying offers from their squads Friday.Pups pitchers Chris Arrieta and also Wade Davis furthermore received the gives, as do Tampa fl These kinds of pitcher Alex Cobb, Co better Greg The low countries, Saint. Louis pitcher Lance Louise and also Cleveland first baseman Carlos Santana.People get till December. 07 to just accept. Should they drop as well as sign together with brand new squads, their aged golf equipment would likely receive an added write choose since settlement probably an extremely reduce variety than previously underneath the regulations in football s brand new labor commitment. A golf club iron deciding upon one of the gamers who didn big t recognize would likely lose a new write choice no more any first-round select and perhaps portion of its intercontinental bonus-pool allocation with regard to 2018-19.No cost providers can start off settling contracts effortlessly squads starting Mon night time. The Dodgers practiced the $9 trillion choice on infielder Logan Forsythe and dropped a new $17.Your five zillion selection on outfielder Andre Ethier, that gets a $2.A few zillion purchase. Arizona decreased a good $11 trillion choice in 1st baseman Robert Napoli ($2.A few trillion acquistion) and a $4 trillion selection upon crusher Tony a2z Barnett ($250,1000 acquistion), and also exercised any $6 million option upon left-hander Martin Perez. California catcher Shiny Wieters exercised their $10.Five million choice https://www.atlfalconsapparel.com/318-Marvin-Hall-Apparels. Greater toronto area declined any $17.A few zillion option on outfielder Jose Bautista ($500,Thousand buyout). Baltimore decreased any $14 zillion choice about shortstop M.T. Robust ($2 zillion buyout) along with a $12 zillion selection about left-hander Wade The teen sensation ($500,000 acquistion), along with catcher Welington Castillo rejected his or her $7 million player alternative. E. Louis unveiled ex- nearer Trevor Rosenthal, which experienced Tommy John surgical procedure and may also skip the entire 2018 time of year https://www.atlfalconsapparel.com/53-Anthony-Dable-Apparels. Cleveland dropped a new $7 thousand alternative on reducer Boone Logan, which will receive a $1 trillion buyout. Tampa bay Fresh worked out a new $2 million alternative on right-hander Nathan Eovaldi, who skipped every one of final time subsequent Tommy Bob medical procedures in June 2016. Arizona exercised any $2 zillion choice about infielder Daniel Descalso. Hall of Popularity poll: David Garvey, Jack port Morris as well as Tommy John along with long time people union main Marvin Miller are probably the 10 brands about the poll for the Hall regarding Fame political election next week.Put on Mattingly, Dale Murphy, Gaga Parker, Ted Simmons, Luis Tiant as well as Mike Trammell also are qualified for the Modern Baseball Period poll, which in turn identifies people whoever biggest advantages originated in 1970 by means of Eighty seven. Any s techniques: Oakland delivered pitchers Michael Brady Travis Averill Jersey, Simon Castro and also Josh Cruz downright to be able to Triple-A Nashville. Teaching moves: Arizona called Add Wakamatsu since it's counter coach https://www.atlfalconsapparel.com/263-Joshua-Perkins-Apparels, and moved prior counter instructor David Buechele for you to first-base mentor. Dan Warthen turns into helper pestering instructor. Mn hired Derek Shelton as its regular coach. This individual ended up being Greater toronto area s quality-control mentor previous time. Latest through the SFGATE website:Click on beneath for that top media via throughout the San fran and also past. Subscribe to the updates being the first one to find out about splitting reports plus more. Go to 'Sign In' and 'Manage Profile' at the top of the particular site.

↧