Base Quality Score Recalibration (BQSR)

December 28, 2017, 7:15 pm

≫ Next: UmiAwareMarkDuplicatesWithMateCigar sorted Error

≪ Previous: (How to) Map and clean up short read sequence data efficiently

BQSR stands for Base Quality Score Recalibration. In a nutshell, it is a data pre-processing step that detects systematic errors made by the sequencing machine when it estimates the accuracy of each base call.

Note that this base recalibration process (BQSR) should not be confused with variant recalibration (VQSR), which is a sophisticated filtering technique applied on the variant callset produced in a later step. The developers who named these methods wish to apologize sincerely to anyone, especially Spanish-speaking users, who get tripped up by the similarity of these names.

Overview
Base recalibration procedure details
Important factors for successful recalibration
Examples of pre- and post-recalibration metrics
Recalibration report

1. Overview

It's all about the base, 'bout the base (quality scores)

Base quality scores are per-base estimates of error emitted by the sequencing machines; they express how confident the machine was that it called the correct base each time. For example, let's say the machine reads an A nucleotide, and assigns a quality score of Q20 -- in Phred-scale, that means it's 99% sure it identified the base correctly. This may seem high, but it does mean that we can expect it to be wrong in one case out of 100; so if we have several billion base calls (we get ~90 billion in a 30x genome), at that rate the machine would make the wrong call in 900 million bases -- which is a lot of bad bases. The quality score each base call gets is determined through some dark magic jealously guarded by the manufacturer of the sequencing machines.

Why does it matter? Because our short variant calling algorithms rely heavily on the quality score assigned to the individual base calls in each sequence read. This is because the quality score tells us how much we can trust that particular observation to inform us about the biological truth of the site where that base aligns. If we have a base call that has a low quality score, that means we're not sure we actually read that A correctly, and it could actually be something else. So we won't trust it as much as other base calls that have higher qualities. In other words we use that score to weigh the evidence that we have for or against a variant allele existing at a particular site.

Okay, so what is base recalibration?

Unfortunately the scores produced by the machines are subject to various sources of systematic (non-random) technical error, leading to over- or under-estimated base quality scores in the data. Some of these errors are due to the physics or the chemistry of how the sequencing reaction works, and some are probably due to manufacturing flaws in the equipment.

Base quality score recalibration (BQSR) is a process in which we apply machine learning to model these errors empirically and adjust the quality scores accordingly. For example we can identify that, for a given run, whenever we called two A nucleotides in a row, the next base we called had a 1% higher rate of error. So any base call that comes after AA in a read should have its quality score reduced by 1%. We do that over several different covariates (mainly sequence context and position in read, or cycle) in a way that is additive. So the same base may have its quality score increased for one reason and decreased for another.

This allows us to get more accurate base qualities overall, which in turn improves the accuracy of our variant calls. To be clear, we can't correct the base calls themselves, i.e. we can't determine whether that low-quality A should actually have been a T -- but we can at least tell the variant caller more accurately how far it can trust that A. Note that in some cases we may find that some bases should have a higher quality score, which allows us to rescue observations that otherwise may have been given less consideration than they deserve. Anecdotally our impression is that sequencers are more often over-confident than under-confident, but we do occasionally see runs from sequencers that seemed to suffer from low self-esteem.

This procedure can be applied to BAM files containing data from any sequencing platform that outputs base quality scores on the expected scale. We have run it ourselves on data from several generations of Illumina, SOLiD, 454, Complete Genomics, and Pacific Biosciences sequencers.

That sounds great! How does it work?

The base recalibration process involves two key steps: first the BaseRecalibrator tool builds a model of covariation based on the input data and a set of known variants, producing a recalibration file; then the ApplyBQSR tool adjusts the base quality scores in the data based on the model, producing a new BAM file. The known variants are used to mask out bases at sites of real (expected) variation, to avoid counting real variants as errors. Outside of the masked sites, every mismatch is counted as an error. The rest is mostly accounting.

There is an optional but highly recommended step that involves building a second model and generating before/after plots to visualize the effects of the recalibration process. This is useful for quality control purposes.

2. Base recalibration procedure details

BaseRecalibrator builds the model

To build the recalibration model, this first tool goes through all of the reads in the input BAM file and tabulates data about the following features of the bases:

read group the read belongs to
quality score reported by the machine
machine cycle producing this base (Nth cycle = Nth base from the start of the read)
current base + previous base (dinucleotide)

For each bin, we count the number of bases within the bin and how often such bases mismatch the reference base, excluding loci known to vary in the population, according to the known variants resource (typically dbSNP). This information is output to a recalibration file in GATKReport format.

Note that the recalibrator applies a "yates" correction for low occupancy bins. Rather than inferring the true Q score from # mismatches / # bases we actually infer it from (# mismatches + 1) / (# bases + 2). This deals very nicely with overfitting problems, which has only a minor impact on data sets with billions of bases but is critical to avoid overconfidence in rare bins in sparse data.

ApplyBQSR adjusts the scores

This second tool goes through all the reads again, using the recalibration file to adjust each base's score based on which bins it falls in. So effectively the new quality score is:

the sum of the global difference between reported quality scores and the empirical quality
plus the quality bin specific shift
plus the cycle x qual and dinucleotide x qual effect

Following recalibration, the read quality scores are much closer to their empirical scores than before. This means they can be used in a statistically robust manner for downstream processing, such as variant calling. In addition, by accounting for quality changes by cycle and sequence context, we can identify truly high quality bases in the reads, often finding a subset of bases that are Q30 even when no bases were originally labeled as such.

3. Important factors for successful recalibration

Read groups

The recalibration system is read-group aware, meaning it uses @RG tags to partition the data by read group. This allows it to perform the recalibration per read group, which reflects which library a read belongs to and what lane it was sequenced in on the flowcell. We know that systematic biases can occur in one lane but not the other, or one library but not the other, so being able to recalibrate within each unit of sequence data makes the modeling process more accurate. As a corollary, that means it's okay to run BQSR on BAM files with multiple read groups. However, please note that the memory requirements scale linearly with the number of read groups in the file, so that files with many read groups could require a significant amount of RAM to store all of the covariate data.

Amount of data

A critical determinant of the quality of the recalibration is the number of observed bases and mismatches in each bin. This procedure will not work well on a small number of aligned reads. We usually expect to see more than 100M bases per read group; as a rule of thumb, larger numbers will work better.

No excuses

You should almost always perform recalibration on your sequencing data. In human data, given the exhaustive databases of variation we have available, almost all of the remaining mismatches -- even in cancer -- will be errors, so it's super easy to ascertain an accurate error model for your data, which is essential for downstream analysis. For non-human data it can be a little bit more work since you may need to bootstrap your own set of variants if there are no such resources already available for you organism, but it's worth it.

Here's how you would bootstrap a set of known variants:

First do an initial round of variant calling on your original, unrecalibrated data.
Then take the variants that you have the highest confidence in and use that set as the database of known variants by feeding it as a VCF file to the BaseRecalibrator.
Finally, do a real round of variant calling with the recalibrated data. These steps could be repeated several times until convergence.

The main case figure where you really might need to skip BQSR is when you have too little data (some small gene panels have that problem), or you're working with a really weird organism that displays insane amounts of variation.

4. Examples of pre- and post-recalibration metrics

This shows recalibration results from a lane sequenced at the Broad by an Illumina GA-II in February 2010. This is admittedly not very recent but the results are typical of what we still see on some more recent runs, even if the overall quality of sequencing has improved. You can see there is a significant improvement in the accuracy of the base quality scores after applying the recalibration procedure. Note that the plots shown below are not the same as the plots that are produced by the AnalyzeCovariates tool.

5. Recalibration report

The recalibration report contains the following 5 tables:

Arguments Table -- a table with all the arguments and its values
Quantization Table
ReadGroup Table
Quality Score Table
Covariates Table

Arguments Table

This is the table that contains all the arguments used to run BQSR for this dataset.

#:GATKTable:true:1:17::;
#:GATKTable:Arguments:Recalibration argument collection values used in this run
Argument                    Value
covariate                   null
default_platform            null
deletions_context_size      6
force_platform              null
insertions_context_size     6
...

Quantization Table

The GATK offers native support to quantize base qualities. The GATK quantization procedure uses a statistical approach to determine the best binning system that minimizes the error introduced by amalgamating the different qualities present in the specific dataset. When running BQSR, a table with the base counts for each base quality is generated and a 'default' quantization table is generated. This table is a required parameter for any other tool in the GATK if you want to quantize your quality scores.

The default behavior (currently) is to use no quantization. You can override this by using the engine argument -qq. With -qq 0 you don't quantize qualities, or -qq N you recalculate the quantization bins using N bins.

#:GATKTable:true:2:94:::;
#:GATKTable:Quantized:Quality quantization map
QualityScore  Count        QuantizedScore
0                     252               0
1                   15972               1
2                  553525               2
3                 2190142               9
4                 5369681               9
9                83645762               9
...

ReadGroup Table

This table contains the empirical quality scores for each read group, for mismatches insertions and deletions.

#:GATKTable:false:6:18:%s:%s:%.4f:%.4f:%d:%d:;
#:GATKTable:RecalTable0:
ReadGroup  EventType  EmpiricalQuality  EstimatedQReported  Observations  Errors
SRR032768  D                   40.7476             45.0000    2642683174    222475
SRR032766  D                   40.9072             45.0000    2630282426    213441
SRR032764  D                   40.5931             45.0000    2919572148    254687
SRR032769  D                   40.7448             45.0000    2850110574    240094
SRR032767  D                   40.6820             45.0000    2820040026    241020
SRR032765  D                   40.9034             45.0000    2441035052    198258
SRR032766  M                   23.2573             23.7733    2630282426  12424434
SRR032768  M                   23.0281             23.5366    2642683174  13159514
SRR032769  M                   23.2608             23.6920    2850110574  13451898
SRR032764  M                   23.2302             23.6039    2919572148  13877177
SRR032765  M                   23.0271             23.5527    2441035052  12158144
SRR032767  M                   23.1195             23.5852    2820040026  13750197
SRR032766  I                   41.7198             45.0000    2630282426    177017
SRR032768  I                   41.5682             45.0000    2642683174    184172
SRR032769  I                   41.5828             45.0000    2850110574    197959
SRR032764  I                   41.2958             45.0000    2919572148    216637
SRR032765  I                   41.5546             45.0000    2441035052    170651
SRR032767  I                   41.5192             45.0000    2820040026    198762

Quality Score Table

This table contains the empirical quality scores for each read group and original quality score, for mismatches insertions and deletions.

#:GATKTable:false:6:274:%s:%s:%s:%.4f:%d:%d:;
#:GATKTable:RecalTable1:
ReadGroup  QualityScore  EventType  EmpiricalQuality  Observations  Errors
SRR032767            49  M                   33.7794          9549        3
SRR032769            49  M                   36.9975          5008        0
SRR032764            49  M                   39.2490          8411        0
SRR032766            18  M                   17.7397      16330200   274803
SRR032768            18  M                   17.7922      17707920   294405
SRR032764            45  I                   41.2958    2919572148   216637
SRR032765             6  M                    6.0600       3401801   842765
SRR032769            45  I                   41.5828    2850110574   197959
SRR032764             6  M                    6.0751       4220451  1041946
SRR032767            45  I                   41.5192    2820040026   198762
SRR032769             6  M                    6.3481       5045533  1169748
SRR032768            16  M                   15.7681      12427549   329283
SRR032766            16  M                   15.8173      11799056   309110
SRR032764            16  M                   15.9033      13017244   334343
SRR032769            16  M                   15.8042      13817386   363078
...

Covariates Table

This table has the empirical qualities for each covariate used in the dataset. The default covariates are cycle and context. In the current implementation, context is of a fixed size (default 6). Each context and each cycle will have an entry on this table stratified by read group and original quality score.

#:GATKTable:false:8:1003738:%s:%s:%s:%s:%s:%.4f:%d:%d:;
#:GATKTable:RecalTable2:
ReadGroup  QualityScore  CovariateValue  CovariateName  EventType  EmpiricalQuality  Observations  Errors
SRR032767            16  TACGGA          Context        M                   14.2139           817      30
SRR032766            16  AACGGA          Context        M                   14.9938          1420      44
SRR032765            16  TACGGA          Context        M                   15.5145           711      19
SRR032768            16  AACGGA          Context        M                   15.0133          1585      49
SRR032764            16  TACGGA          Context        M                   14.5393           710      24
SRR032766            16  GACGGA          Context        M                   17.9746          1379      21
SRR032768            45  CACCTC          Context        I                   40.7907        575849      47
SRR032764            45  TACCTC          Context        I                   43.8286        507088      20
SRR032769            45  TACGGC          Context        D                   38.7536         37525       4
SRR032768            45  GACCTC          Context        I                   46.0724        445275      10
SRR032766            45  CACCTC          Context        I                   41.0696        575664      44
SRR032769            45  TACCTC          Context        I                   43.4821        490491      21
SRR032766            45  CACGGC          Context        D                   45.1471         65424       1
SRR032768            45  GACGGC          Context        D                   45.3980         34657       0
SRR032767            45  TACGGC          Context        D                   42.7663         37814       1
SRR032767            16  AACGGA          Context        M                   15.9371          1647      41
SRR032764            16  GACGGA          Context        M                   18.2642          1273      18
SRR032769            16  CACGGA          Context        M                   13.0801          1442      70
SRR032765            16  GACGGA          Context        M                   15.9934          1271      31
...

↧

UmiAwareMarkDuplicatesWithMateCigar sorted Error

August 28, 2018, 1:01 am

≫ Next: too many open files

≪ Previous: Base Quality Score Recalibration (BQSR)

Hello,

I want to use UmiAwareMarkDuplicatesWithMateCigar Instead of MarkDuplicates to analyze my sequencing data with UMI.

But when i use UmiAwareMarkDuplicatesWithMateCigar, i have an error that my bam files is not sorted in duplicate order. But I use sambamba sort before.

I use this commande line : java -Djava.io.tmpdir=${PATH_TMP_DIR} -jar /illumina/software/Picard/picard_V2.18.11.jar UmiAwareMarkDuplicatesWithMateCigar INPUT=${PATH_SAMPLE}/${SAMPLE}.fixed_mate.umi.sorted.bam OUTPUT=${PATH_SAMPLE}/${SAMPLE}.markdup.sorted.bam METRICS_FILE=${PATH_SAMPLE}/metrics_duplication.txt BARCODE_TAG=RX REMOVE_DUPLICATES=true UMI_METRICS=${PATH_SAMPLE}/metrics_UMI.txt ASSUME_SORT_ORDER=coordinate

My picard version is 2.18.11

And i have this error :

Exception in thread "main" htsjdk.samtools.SAMException: The input records were not sorted in duplicate order:
K00103:407:HWCCJBBXX:8:1121:3457:27953 147 chr1 26731173 60 7 5M = 26730955 -293 GGCCCCAGCGGGTATGGTCAACAGGGCCAGACTCCATATT ACAACCAGCAAAGTCCTCACCCTCAGCAGCAGCAG JA--JJJJJJJJ<JFAJF7JFAAFJFJ7JF7JJJFFJJJJ JJA<AJJJFAF7JJJJ<JFJJJFJJJJJJJFFFAA MC:Z:75M MD:Z:75 RG:Z:GRChg38 N M:i:0 MQ:i:60 AS:i:75 XS:i:22 RX:Z:GTAGAAATAC
K00103:407:HWCCJBBXX:8:2215:31010:10844 181 chr1 15928612 0 * = 15928612 0 GAGCTACATCTTCCACTCCGGTCCAATCTCCTCCATTCCACTCCGTTC CATTTCATTCCATTTCTCTTCATTCCA F<JAF<F7-FFFJ<<AFJJAFAFFJ<F7FJJAFJAA7<<-<AA7-<F7 F7-<F-JFJJJJJJ<F-J<FFFF<<AA MC:Z:75M RG:Z:GRChg38 AS:i:0 XS:i:0 R X:Z:TCAGGAGGGA

    at htsjdk.samtools.DuplicateSetIterator.next(DuplicateSetIterator.java:1                                                                                                                                                             52)
    at picard.sam.markduplicates.UmiAwareDuplicateSetIterator.next(UmiAwareD                                                                                                                                                             uplicateSetIterator.java:117)
    at picard.sam.markduplicates.UmiAwareDuplicateSetIterator.next(UmiAwareD                                                                                                                                                             uplicateSetIterator.java:53)
    at picard.sam.markduplicates.SimpleMarkDuplicatesWithMateCigar.doWork(Si                                                                                                                                                             mpleMarkDuplicatesWithMateCigar.java:133)
    at picard.sam.markduplicates.UmiAwareMarkDuplicatesWithMateCigar.doWork(                                                                                                                                                             UmiAwareMarkDuplicatesWithMateCigar.java:141)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.jav                                                                                                                                                             a:277)
    at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:                                                                                                                                                             103)
    at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)

Thanks you in advance,

↧

too many open files

August 28, 2018, 3:48 am

≫ Next: http://www.testostack.com/ketogen-pure/

≪ Previous: UmiAwareMarkDuplicatesWithMateCigar sorted Error

Hi,

I've seen this problem mentioned many times here, but I wonder if I may have some new contribution here. Our users are running GATK 3.8, and see this error:

##### ERROR MESSAGE: Unable to parse header with error: /tmp/org.broadinstitute.gatk.engine.io.stubs.VariantContextWriterStub3088197262002724909.tmp (Too many open files)

I've monitored some runs and noticed that while the run progresses, there is more and more leftover open file descriptors:

lsof | grep VariantContextWriterStub
java       67922   frodeli 1563r      REG                8,5    10740643                  968 /tmp/org.broadinstitute.gatk.engine.io.stubs.VariantContextWriterStub498651467373697139.tmp (deleted)
java       67922   frodeli 1564w      REG                8,5     6282467                  992 /tmp/org.broadinstitute.gatk.engine.io.stubs.VariantContextWriterStub2281266380489957735.tmp

Essentially all VariantContextWriterStub files have a (deleted) flag, and only some of them don't. After 24h of runtime 1432 out of 1467 entries are reported as deleted. This makes me wonder: maybe there is a resource leak in GATK, i.e., the temporary file is deleted without being closed before.

Regards,

Marcin

↧

http://www.testostack.com/ketogen-pure/

August 28, 2018, 4:20 am

≫ Next: Does GATK support BGEN?

≪ Previous: too many open files

Ketogen Pure immoderate-plentiful Ketogen Pure producers, which which that you will be all set to be ready to appear that that it seems to plot about aural a day or so, at atomic in contract of the abundant manner it could ascendancy your urge for substances. Then, it starts offevolved to plot at afire up your saved fat, which presents you you with furnished vigour. The aftereffect of that is that you just simply readily .

http://www.testostack.com/ketogen-pure/

↧

Does GATK support BGEN?

August 28, 2018, 7:25 am

≫ Next: Evaluating the quality of a variant callset

≪ Previous: http://www.testostack.com/ketogen-pure/

Does GATK support BGEN files, which are the standard data format output by the UK Biobank?

↧

Evaluating the quality of a variant callset

October 21, 2015, 8:23 am

≫ Next: (How to part II) Sensitively detect copy ratio alterations and allelic segments

≪ Previous: Does GATK support BGEN?

Introduction

Running through the steps involved in variant discovery (calling variants, joint genotyping and applying filters) produces a variant callset in the form of a VCF file. So what’s next? Technically, that callset is ready to be used in downstream analysis. But before you do that, we recommend running some quality control analyses to evaluate how “good” that callset is.

To be frank, distinguishing between a “good” callset and a “bad” callset is a complex problem. If you knew the absolute truth of what variants are present or not in your samples, you probably wouldn’t be here running variant discovery on some high-throughput sequencing data. Your fresh new callset is your attempt to discover that truth. So how do you know how close you got?

Methods for variant evaluation

There are several methods that you can apply which offer different insights into the probable biological truth, all with their own pros and cons. Possibly the most trusted method is Sanger sequencing of regions surrounding putative variants. However, it is also the least scalable as it would be prohibitively costly and time-consuming to apply to an entire callset. Typically, Sanger sequencing is only applied to validate candidate variants that are judged highly likely. Another popular method is to evaluate concordance against results obtained from a genotyping chip run on the same samples. This is much more scalable, and conveniently also doubles as a quality control method to detect sample swaps. Although it only covers the subset of known variants that the chip was designed for, this method can give you a pretty good indication of both sensitivity (ability to detect true variants) and specificity (not calling variants where there are none). This is something we do systematically for all samples in the Broad’s production pipelines.

The third method, presented here, is to evaluate how your variant callset stacks up against another variant callset (typically derived from other samples) that is considered to be a truth set (sometimes referred to as a gold standard -- these terms are very close and often used interchangeably). The general idea is that key properties of your callset (metrics discussed later in the text) should roughly match those of the truth set. This method is not meant to render any judgments about the veracity of individual variant calls; instead, it aims to estimate the overall quality of your callset and detect any red flags that might be indicative of error.

Underlying assumptions and truthiness^*: a note of caution

It should be immediately obvious that there are two important assumptions being made here: 1) that the content of the truth set has been validated somehow and is considered especially trustworthy; and 2) that your samples are expected to have similar genomic content as the population of samples that was used to produce the truth set. These assumptions are not always well-supported, depending on the truth set, your callset, and what they have (or don’t have) in common. You should always keep this in mind when choosing a truth set for your evaluation; it’s a jungle out there. Consider that if anyone can submit variants to a truth set’s database without a well-regulated validation process, and there is no process for removing variants if someone later finds they were wrong (I’m looking at you, dbSNP), you should be extra cautious in interpreting results.
^{*With apologies to Stephen Colbert.}

Validation

So what constitutes validation? Well, the best validation is done with orthogonal methods, meaning that it is done with technology (wetware, hardware, software, etc.) that is not subject to the same error modes as the sequencing process. Calling variants with two callers that use similar algorithms? Great way to reinforce your biases. It won’t mean anything that both give the same results; they could both be making the same mistakes. On the wetlab side, Sanger and genotyping chips are great validation tools; the technology is pretty different, so they tend to make different mistakes. Therefore it means more if they agree or disagree with calls made from high-throughput sequencing.

Matching populations

Regarding the population genomics aspect: it’s complicated -- especially if we’re talking about humans (I am). There’s a lot of interesting literature on this topic; for now let’s just summarize by saying that some important variant calling metrics vary depending on ethnicity. So if you are studying a population with a very specific ethnic composition, you should try to find a truth set composed of individuals with a similar ethnic background, and adjust your expectations accordingly for some metrics.

Similar principles apply to non-human genomic data, with important variations depending on whether you’re looking at wild or domesticated populations, natural or experimentally manipulated lineages, and so on. Unfortunately we can’t currently provide any detailed guidance on this topic, but hopefully this explanation of the logic and considerations involved will help you formulate a variant evaluation strategy that is appropriate for your organism of interest.

Variant evaluation metrics

So let’s say you’ve got your fresh new callset and you’ve found an appropriate truth set. You’re ready to look at some metrics (but don’t worry yet about how; we’ll get to that soon enough). There are several metrics that we recommend examining in order to evaluate your data. The set described here should be considered a minimum and is by no means exclusive. It is nearly always better to evaluate more metrics if you possess the appropriate data to do so -- and as long as you understand why those additional metrics are meaningful. Please don’t try to use metrics that you don’t understand properly, because misunderstandings lead to confusion; confusion leads to worry; and worry leads to too many desperate posts on the GATK forum.

Variant-level concordance and genotype concordance

The relationship between variant-level concordance and genotype concordance is illustrated in this figure.

Variant-level concordance (aka % Concordance) gives the percentage of variants in your samples that match (are concordant with) variants in your truth set. It essentially serves as a check of how well your analysis pipeline identified variants contained in the truth set. Depending on what you are evaluating and comparing, the interpretation of percent concordance can vary quite significantly.
Comparing your sample(s) against genotyping chip results matched per sample allows you to evaluate whether you missed any real variants within the scope of what is represented on the chip. Based on that concordance result, you can extrapolate what proportion you may have missed out of the real variants not represented on the chip.
If you don't have a sample-matched truth set and you're comparing your sample against a truth set derived from a population, your interpretation of percent concordance will be more limited. You have to account for the fact that some variants that are real in your sample will not be present in the population and that conversely, many variants that are in the population will not be present in your sample. In both cases, "how many" depends on how big the population is and how representative it is of your sample's background.
Keep in mind that for most tools that calculate this metric, all unmatched variants (present in your sample but not in the truth set) are considered to be false positives. Depending on your trust in the truth set and whether or not you expect to see true, novel variants, these unmatched variants could warrant further investigation -- or they could be artifacts that you should ignore.
Genotype concordance is a similar metric but operates at the genotype level. It allows you to evaluate, within a set of variant calls that are present in both your sample callset and your truth set, what proportion of the genotype calls have been assigned correctly. This assumes that you are comparing your sample to a matched truth set derived from the same original sample.

Number of Indels & SNPs and TiTv Ratio

These metrics are widely applicable. The table below summarizes their expected value ranges for Human Germline Data:

Sequencing Type	# of Variants*	TiTv Ratio
WGS	~4.4M	2.0-2.1
WES	~41k	3.0-3.3

^{*for a single sample}

Number of Indels & SNPs
The number of variants detected in your sample(s) are counted separately as indels (insertions and deletions) and SNPs (Single Nucleotide Polymorphisms). Many factors can affect this statistic including whole exome (WES) versus whole genome (WGS) data, cohort size, strictness of filtering through the GATK pipeline, the ethnicity of your sample(s), and even algorithm improvement due to a software update. For reference, Nature's recently published 2015 paper in which various ethnicities in a moderately large cohort were analyzed for number of variants. As such, this metric alone is insufficient to confirm data validity, but it can raise warning flags when something went extremely wrong: e.g. 1000 variants in a large cohort WGS data set, or 4 billion variants in a ten-sample whole-exome set.
TiTv Ratio
This metric is the ratio of transition (Ti) to transversion (Tv) SNPs. If the distribution of transition and transversion mutations were random (i.e. without any biological influence) we would expect a ratio of 0.5. This is simply due to the fact that there are twice as many possible transversion mutations than there are transitions. However, in the biological context, it is very common to see a methylated cytosine undergo deamination to become thymine. As this is a transition mutation, it has been shown to increase the expected random ratio from 0.5 to ~2.0¹. Furthermore, CpG islands, usually found in primer regions, have higher concentrations of methylcytosines. By including these regions, whole exome sequencing shows an even stronger lean towards transition mutations, with an expected ratio of 3.0-3.3. A significant deviation from the expected values could indicate artifactual variants causing bias. If your TiTv Ratio is too low, your callset likely has more false positives.

It should also be noted that the TiTv ratio from exome-sequenced data will vary from the expected value based upon the length of flanking sequences. When we analyze exome sequence data, we add some padding (usually 100 bases) around the targeted regions (using the -ip engine argument) because this improves calling of variants that are at the edges of exons (whether inside the exon sequence or in the promoter/regulatory sequence before the exon). These flanking sequences are not subject to the same evolutionary pressures as the exons themselves, so the number of transition and transversion mutants lean away from the expected ratio. The amount of "lean" depends on how long the flanking sequence is.

Ratio of Insertions to Deletions (Indel Ratio)

This metric is generally evaluated after filtering for purposes that are specific to your study, and the expected value range depends on whether you're looking for rare or common variants, as summarized in the table below.

Filtering for	Indel Ratio
common	~1
rare	0.2-0.5

A significant deviation from the expected ratios listed in the table above could indicate a bias resulting from artifactual variants.

Tools for performing variant evaluation

VariantEval

This is the GATK’s main tool for variant evaluation. It is designed to collect and calculate a variety of callset metrics that are organized in evaluation modules, which are listed in the tool doc. For each evaluation module that is enabled, the tool will produce a table containing the corresponding callset metrics based on the specified inputs (your callset of interest and one or more truth sets). By default, VariantEval will run with a specific subset of the available modules (listed below), but all evaluation modules can be enabled or disabled from the command line. We recommend setting the tool to produce only the metrics that you are interested in, because each active module adds to the computational requirements and overall runtime of the tool.

It should be noted that all module calculations only include variants that passed filtering (i.e. FILTER column in your vcf file should read PASS); variants tagged as filtered out will be ignored. It is not possible to modify this behavior. See the example analysis for more details on how to use this tool and interpret its output.

GenotypeConcordance

This tool calculates -- you’ve guessed it -- the genotype concordance between callsets. In earlier versions of GATK, GenotypeConcordance was itself a module within VariantEval. It was converted into a standalone tool to enable more complex genotype concordance calculations.

Picard tools

The Picard toolkit includes two tools that perform similar functions to VariantEval and GenotypeConcordance, respectively called CollectVariantCallingMetrics and GenotypeConcordance. Both are relatively lightweight in comparison to their GATK equivalents; their functionalities are more limited, but they do run quite a bit faster. See the example analysis of CollectVariantCallingMetrics for details on its use and data interpretation. Note that in the coming months, the Picard tools are going to be integrated into the next major version of GATK, so at that occasion we plan to consolidate these two pairs of homologous tools to eliminate redundancy.

Which tool should I use?

We recommend Picard's version of each tool for most cases. The GenotypeConcordance tools provide mostly the same information, but Picard's version is preferred by Broadies. Both VariantEval and CollectVariantCallingMetrics produce similar metrics, however the latter runs faster and is scales better for larger cohorts. By default, CollectVariantCallingMetrics stratifies by sample, allowing you to see the value of relevant statistics as they pertain to specific samples in your cohort. It includes all metrics discussed here, as well as a few more. On the other hand, VariantEval provides many more metrics beyond the minimum described here for analysis. It should be noted that none of these tools use phasing to determine metrics.

So when should I use CollectVariantCallingMetrics?

If you have a very large callset
If you want to look at the metrics discussed here and not much else
If you want your analysis back quickly

When should I use VariantEval?

When you require a more detailed analysis of your callset
If you need to stratify your callset by another factor (allele frequency, indel size, etc.)
If you need to compare to multiple truth sets at the same time

↧

(How to part II) Sensitively detect copy ratio alterations and allelic segments

March 26, 2018, 10:32 am

≫ Next: Mutect2 error "Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded"

≪ Previous: Evaluating the quality of a variant callset

Document is currently under review and in BETA. It is incomplete and may contain inaccuracies. Expect changes to the content.

This workflow is broken into two tutorials. You are currently on the second part. See Tutorial#11682 for the first part.

For this second part, at the heart is segmentation, performed by ModelSegments. In segmentation, contiguous copy ratios are grouped together into segments. The tool performs segmentation for both copy ratios and for allelic copy ratios, given allelic counts. The segmentation is informed by both types of data, i.e. the tool uses allelic data to refine copy ratio segmentation and vice versa. The tutorial refers to this multi-data approach as joint segmentation. The presented commands showcase full features of tools. It is possible to perform segmentation for each data type independently, i.e. based solely on copy ratios or based solely on allelic counts.

The tutorial illustrates the workflow using a paired sample set. Specifically, detection of allelic copy ratios uses a matched control, i.e. the HCC1143 tumor sample is analyzed using a control, the HCC1143 blood normal. It is possible to run the workflow without a matched-control. See section 8.1 for considerations in interpreting allelic copy ratio results for different modes and for different purities.

The GATK4 CNV workflow offers a multitude of levers, e.g. towards fine-tuning analyses and towards controls. Researchers are expected to tune workflow parameters on samples with similar copy number profiles as their case sample under scrutiny. Refer to each tool's documentation for descriptions of parameters.

Jump to a section

5. Count ref and alt alleles at common germline variant sites using CollectAllelicCounts

CollectAllelicCounts will tabulate counts of the reference allele and counts of the dominant alternate allele for each site in a given genomic intervals list. The tutorial performs this step for both the case sample, the HCC1143 tumor, and the matched-control, the HCC1143 blood normal. This allele-specific coverage collection is just that--raw coverage collection without any statistical inferences. In the next section, ModelSegments uses the allele counts towards estimating allelic copy ratios, which in turn the tool uses to refine segmentation.

Collect allele counts for the case and the matched-control alignments independently with the same intervals. For the matched-control analysis, the allelic count sites for the case and control must match exactly. Otherwise, ModelSegments, which takes the counts in the next step, will error. Here we use an intervals list that subsets gnomAD biallelic germline SNP sites to those within the padded, preprocessed exome target intervals [9].

The tutorial has already collected allele counts for full length sample BAMs. To demonstrate coverage collection, the following command uses the small BAMs originally made for Tutorial#11136 [6]. The tutorial does not use the resulting files in subsequent steps.

Collect counts at germline variant sites for the matched-control

gatk --java-options "-Xmx3g" CollectAllelicCounts \
    -L chr17_theta_snps.interval_list \
    -I normal.bam \
    -R /gatk/ref/Homo_sapiens_assembly38.fasta \
    -O sandbox/hcc1143_N_clean.allelicCounts.tsv

Collect counts at the same sites for the case sample

gatk --java-options "-Xmx3g" CollectAllelicCounts \
    -L chr17_theta_snps.interval_list \
    -I tumor.bam \
    -R /gatk/ref/Homo_sapiens_assembly38.fasta \
    -O sandbox/hcc1143_T_clean.allelicCounts.tsv

This results in counts table files. Each data file has header lines that start with an @ asperand symbol, e.g. @HD, @SQ and @RG lines, followed by a table of data with six columns. An example snippet is shown.

Comments on select parameters

The tool requires one or more genomic intervals specified with -L. The intervals can be either a Picard-style intervals list or a VCF. See Article#1109 for descriptions of formats. The sites should represent sites of common and/or sample-specific germline variant SNPs-only sites. Omit indel-type and mixed-variant-type sites.
The tool requires the reference genome, specified with -R, and aligned reads, specified with -I.
As is the case for most GATK tools, the engine filters reads upfront using a number of read filters. Of note for CollectAllelicCounts is the MappingQualityReadFilter. By default, the tool sets the filter's --minimum-base-quality to twenty. As a result, the tool will include reads with MAPQ20 and above in the analysis [10].

☞ 5.1 What is the difference between CollectAllelicCounts and GetPileupSummaries?

Another GATK tool, GetPileupSummaries, similarly counts reference and alternate alleles. The resulting summaries are meant for use with CalculateContamination in estimating cross-sample contamination. GetPileupSummaries limits counts collections to those sites with population allele frequencies set by the parameters --minimum-population-allele-frequency and --maximum-population-allele-frequency. Details are here.

CollectAllelicCounts employs fewer engine-level read filters than GetPileupSummaries. Of note, both tools use the MappingQualityReadFilter. However, each sets a different threshold with the filter. GetPileupSummaries uses a --minimum-mapping-quality threshold of 50. In contrast, CollectAllelicCounts sets the --minimum-mapping-quality parameter to 30. In addition, CollectAllelicCounts filters on base quality. The base quality threshold is set with the --minimum-base-quality parameter, whose default is 20.

6. Group contiguous copy ratios into segments with ModelSegments

ModelSegments groups together copy and allelic ratios that it determines are contiguous on the same segment. A Gaussian-kernel binary-segmentation algorithm differentiates ModelSegments from a GATK4.beta tool, PerformSegmentation, which GATK4 ModelSegments replaces. The older tool used a CBS (circular binary-segmentation) algorithm. ModelSegment's kernel algorithm enables efficient segmentation of dense data, e.g. that of whole genome sequences. A discussion of preliminary algorithm performance is here.

The algorithm performs segmentation for both copy ratios and for allelic copy ratios jointly when given both datatypes together. For allelic copy ratios, ModelSegments uses only those sites it determines are heterozygous, either in the control in a paired analysis or in the case in a case-only analysis [11]. In the paired analysis, the tool models allelic copy ratios in the case using sites for which the control is heterozygous. The workflow defines allelic copy ratios in terms of alternate-allele fraction, where total allele fractions for reference allele and alternate allele add to one for each site.

For the following command, be sure to specify an existing --output directory or . for the current directory.

gatk --java-options "-Xmx4g" ModelSegments \
    --denoised-copy-ratios hcc1143_T_clean.denoisedCR.tsv \
    --allelic-counts hcc1143_T_clean.allelicCounts.tsv \
    --normal-allelic-counts hcc1143_N_clean.allelicCounts.tsv \
    --output sandbox \
    --output-prefix hcc1143_T_clean

This produces nine files each with the basename hcc1143_T_clean in the current directory and listed below. The param files contain global parameters for copy ratios (cr) and allele fractions (af), and the seg files contain data on the segments. For either type of data, the tool gives data before and after segmentation smoothing. The tool documentation details what each file contains. The last two files, labeled hets, contain the allelic counts for the control's heterogygous sites. Counts are for the matched control (normal) and the case.

hcc1143_T_clean.modelBegin.seg
hcc1143_T_clean.modelFinal.seg
hcc1143_T_clean.cr.seg
hcc1143_T_clean.modelBegin.af.param
hcc1143_T_clean.modelBegin.cr.param
hcc1143_T_clean.modelFinal.af.param
hcc1143_T_clean.modelFinal.cr.param
hcc1143_T_clean.hets.normal.tsv
hcc1143_T_clean.hets.tsv

The tool has numerous adjustable parameters and these are described in the ModelSegments tool documentation. The tutorial uses the default values for all of the parameters. Adjusting parameters can change the resolution and smoothness of the segmentation results.

Comments on select parameters

The tool accepts both or either copy-ratios (--denoised-copy-ratios) or allelic-counts (--allelic-counts) data. The matched-control allelic counts (--normal-allelic-counts) is optional. If given both types of data, then copy ratios and allelic counts data together inform segmentation for both copy ratio and allelic segments. If given only one type of data, then segmentation is based solely on the given type of data.
The --minimum-total-allele-count is set to 30 by default. This means the tool only considers sites with 30 or more read depth coverage for allelic copy ratios.
The --genotyping-homozygous-log-ratio-threshold option is set to -10.0 by default. Increase this to increase the number of sites assumed to be heterozygous for modeling.
Default smoothing parameters are optimized for faster performance, given the size of whole genomes. The --maximum-number-of-smoothing-iterations option caps smoothing iterations to 25. MCMC model sampling is also set to 100, for both copy-ratio and allele-fraction sampling by the --number-of-samples-copy-ratio and --number-of-samples-allele-fraction options, respectively. Finally, --number-of-smoothing-iterations-per-fit is set to zero by default to disable model refitting between iterations. What this means is that the tool will generate only two MCMC fits--an initial and a final fit.
- GATK4.beta's ACNV set this parameter such that each smoothing iteration refit using MCMC, at the cost of compute. For the tutorial data, which is targeted exomes, the default zero gives 398 segments after two smoothing iterations, while setting --number-of-smoothing-iterations-per-fit to one gives 311 segments after seven smoothing iterations. Section 8 plots these alternative results.
For advanced smoothing recommendations, see [12].

Section 8 shows the results of segmentation, the result from changing --number-of-smoothing-iterations-per-fit and the result of allelic segmentation modeled from allelic counts data alone. Section 8.1 details considerations depending on analysis approach and purity of samples. Section 8.2 shows the results of changing the advanced smoothing parameters given in [12].

ModelSegments runs in the following three stages.

Genotypes heterozygous sites and filters on depth and for sites that overlap with copy-ratio intervals.
- Allelic counts for sites in the control that are heterozygous are written to hets.normal.tsv. For the same sites in the case, allelic counts are written to hets.tsv.
- If given only allelic counts data, ModelSegments does not apply intervals.
Performs multidimensional kernel segmentation (1, 2).
- Uses allelic counts within each copy-ratio interval for each contig.
- Uses denoised copy ratios and heterozygous allelic counts.
Performs Markov-Chain Monte Carlo (MCMC, 1, 2, 3) sampling and segment smoothing. In particular, the tool uses Gibbs sampling and slice sampling. These MCMC samplings inform smoothing, i.e. merging adjacent segments, and the tool can perform multiple iterations of sampling and smoothing [13].
- Fits initial model. Writes initial segments to modelBegin.seg, posterior summaries for copy-ratio global parameters to modelBegin.cr.param and allele-fraction global parameters to modelBegin.af.param.
- Iteratively performs segment smoothing and sampling. Fits allele-fraction model [14] until log likelihood converges. This process produces global parameters.
- Samples final models. Writes final segments to modelFinal.seg, posterior summaries for copy-ratio global parameters to modelFinal.cr.param, posterior summaries for allele-fraction global parameters to modelFinal.af.param and final copy-ratio segments to cr.seg.

At the second stage, the tutorial data generates the following message.

INFO  MultidimensionalKernelSegmenter - Found 638 segments in 23 chromosomes.

At the third stage, the tutorial data generates the following message.

INFO  MultidimensionalModeller - Final number of segments after smoothing: 398

For tutorial data, the initial number of segments before smoothing is 638 segments over 23 contigs. After smoothing with default parameters, the number of segments is 398 segments.

7. Call copy-neutral, amplified and deleted segments with CallCopyRatioSegments

CallCopyRatioSegments allows for systematic calling of copy-neutral, amplified and deleted segments. This step is not required for plotting segmentation results. Provide the tool with the cr.seg segmentation result from ModelSegments.

gatk CallCopyRatioSegments \
    --input hcc1143_T_clean.cr.seg \
    --output sandbox/hcc1143_T_clean.called.seg

The resulting called.seg data adds the sixth column to the provided copy ratio segmentation table. The tool denotes amplifications with a + plus sign, deletions with a - minus sign and neutral segments with a 0 zero.

Here is a snippet of the results.

Comments on select parameters
- The parameters --neutral-segment-copy-ratio-lower-bound (default 0.9) and --neutral-segment-copy-ratio-upper-bound (default 1.1) together set the copy ratio range for copy-neutral segments. These two parameters replace the GATK4.beta workflow’s --neutral-segment-copy-ratio-threshold option.

8. Plot modeled copy ratio and allelic fraction segments with PlotModeledSegments

PlotModeledSegments visualizes copy and allelic ratio segmentation results.

gatk PlotModeledSegments \
    --denoised-copy-ratios hcc1143_T_clean.denoisedCR.tsv \
    --allelic-counts hcc1143_T_clean.hets.tsv \
    --segments hcc1143_T_clean.modelFinal.seg \
    --sequence-dictionary Homo_sapiens_assembly38.dict \
    --minimum-contig-length 46709983 \
    --output sandbox/plots \
    --output-prefix hcc1143_T_clean

This produces plots in the plots folder. The plots represent final modeled segments for both copy ratios and alternate allele fractions. If we are curious about the extent of smoothing provided by MCMC, then we can similarly plot initial kernel segmentation results by substituting in --segments hcc1143_T_clean.modelBegin.seg.

Comments on select parameters
- The tutorial provides the --sequence-dictionary that matches the GRCh38 reference used in mapping [4].
- To omit alternate and decoy contigs from the plots, the tutorial adjusts the --minimum-contig-length from the default value of 1,000,000 to 46,709,983, the length of the smallest of GRCh38's primary assembly contigs.

As of this writing, it is NOT possible to subset plotting with genomic intervals, i.e. with the -L parameter. To interactively visualize data, consider the following options.

Modify the sequence dictionary to contain only the contigs of interest, in the order desired.
The bedGraph format for targeted exomes and bigWig for whole genomes. An example of CNV data converted to bedGraph and visualized in IGV is given in this discussion.
Alternatively, researchers versed in R may choose to visualize subsets of data using RStudio.

Below are three sets of results for the HCC1143 tumor cell line in order of increasing smoothing. The top plot of each set shows the copy ratio segments. The bottom plot of each set shows the allele fraction segments.

In the denoised copy ratio segment plot, individual targets still display as points on the plot. Different copy ratio segments are indicated by alternating blue and orange color groups. The denoised median is drawn in thick black.
In the allele fraction plot, the boxes surrounding the alternate allelic fractions do NOT indicate standard deviation nor standard error, which biomedical researchers may be more familiar with. Rather, the allelic fraction data is given in credible intervals. The allelic copy ratio plot shows the 10th, 50th and 90th percentiles. These should be interpreted with care as explained in section 8.1. Individual allele fraction data display as faint data points, also in orange and blue.

8A. Initial segmentation before MCMC smoothing gives 638 segments.
T_modelbegin.modeled.png

8B. Default smoothing gives 398 segments.
T_modelfinal.modeled.png

8C. Enabling additional smoothing iterations per fit gives 311 segments. See section 6 for a description of the --number-of-smoothing-iterations-per-fit parameter.
T_increase_smoothing_1.modeled.png

Smoothing accounts for data points that are outliers. Some of these outliers could be artifactual and therefore not of interest, while others could be true copy number variation that would then be missed. To understand the impact of joint copy ratio and allelic counts segmentation, compare the results of 8B to the single-data segmentation results below. Each plot below shows the results of modeling segmentation on a single type of data, either copy-ratios or allelic counts, using default smoothing parameters.

8D. Copy ratio segmentation based on copy ratios alone gives 235 segments.
T_caseonly.modeled.png

8E. Allelic segmentation result based on allelic counts alone in the matched case gives 105 segments.
T-matched-normal_just_allelic.modeled.png

Compare chr1 and chr2 segmentation for the various plots. In particular, pay attention to the p arm (left side) of chr1 and q arm (right side) of chr2. What do you think is happening when adjacent segments are slightly shifted from each other in some sets but then seemingly at the same copy ratio for other sets?

For allelic counts, ModelSegments retains 16,872 sites that are heterozygous in the control. Of these, the case presents 15,486 usable sites. In allelic segmentation using allelic counts alone, the tool uses all of the usable sites. In the matched-control scenario, ModelSegments emits the following message.

INFO  MultidimensionalKernelSegmenter - Using first allelic-count site in each copy-ratio interval (12668 / 15486) for multidimensional segmentation...

The message informs us that for the matched-control scenario, ModelSegments uses the first allele-count site for each genomic interval towards allelic modeling. For the tutorial data, this is 12,668 out of the 15,486 or 81.8% of the usable allele-count sites. The exclusion of ~20% of allelic-counts sites, together with the lack of copy ratio data informing segmentation, account for the difference we observe in this and the previous allelic segmentation plot.

In the allele fraction plot, some of the alternate-allele fractions are around 0.35/0.65 and some are at 0/1. We also see alternate-allele fractions around 0.25/0.75 and 0.5. These suggest ploidies that are multiples of one, two, three and four.

Is it possible a copy ratio of one is not diploid but represents some other ploidy?

For the plots above, focus on chr4, chr5 and chr17. Based on both the copy ratio and allelic results, what is the zygosity of each of the chromosomes? What proportion of each chromosome could be described as having undergone copy-neutral loss of heterozygosity?

☞ 8.1 Some considerations in interpreting allelic copy ratios

For allelic copy ratio analysis, the matched-control is a sample from the same individual as the case sample. In the somatic case, the matched-control is the germline normal sample and the case is the tumor sample from the same individual.

The matched-control case presents the following considerations.

If a matched control contains any region with copy number amplification, the skewed allele fractions still allow correct interpretation of the original heterozygosity.
However, if a matched control contains deleted regions or regions with copy-neutral loss of heterozygosity or a long stretch of homozygosity, e.g. as occurs in uniparental disomy, then these regions would go dark so to speak in that they become apparently homozygous and so ModelSegments drops them from consideration.
From population sequencing projects, we know the expected heterozygosity of normal germline samples averages around one in a thousand. However, the GATK4 CNV workflow does not account for any heterozygosity expectations. An example of such an analysis that utilizes SNP array data is HAPSEG. It is available on GenePattern.
If a matched normal contains tumor contamination, this should still allow for the normal to serve as a control. The expectation is that somatic mutations coinciding with common germline SNP sites will be rare and ModelSegments (i) only counts the dominant alt allele at multiallelic sites and (ii) recognizes and handles outliers. To estimate tumor in normal (TiN) contamination, see the Broad CGA group's deTiN.

Here are some considerations for detecting loss of heterozygosity regions.

In the matched-control case, if the case sample is pure, i.e. not contaminated with the control sample, then we see loss of heterozygosity (LOH) segments near alternate-allele fractions of zero and one.
If the case is contaminated with matched control, whether the analysis is matched or not, then the range of alternate-allele fractions becomes squished so to speak in that the contaminating normal's heterozygous sites add to the allele fractions. In this case, putative LOH segments still appear at the top and bottom edges of the allelic plot, at the lowest and highest alternate-allele fractions. For a given depth of coverage, the fraction of reads that differentiates zygosity is narrower in range and therefore harder to differentiate visually.

8F. Case-only analysis of tumor contaminated with normal still allows for LOH detection. Here, we bluntly added together the tutorial tumor and normal sample reads. Results for the matched-control analysis are similar.
In the tumor-only case, if the tumor is pure, because ModelSegments drops homozygous sites from consideration and only models sites it determines are heterozygous, the workflow cannot ascertain LOH segments. Such LOH regions may present as an absence of allelic data or as low confidence segments, i.e. having a wide confidence interval on the allelic plot. Compare such a result below to that of the matched case in 8E above.

8G. Allelic segmentation result based on allelic counts alone for case-only, when the case is pure, can produce regions of missing representation and low confidence allelic fraction segments.

Compare results. Focus on chr4, chr5 and chr17. While the matched-case gives homozygous zygosity for each of these chromosomes, the case-only allelic segmentation either presents an absence of segments for regions or gives low confidence allelic fraction segments at alternate allele fractions that are inaccurate, i.e. do not represent actual zygosity. This is particularly true for tumor samples where aneuploidy and LOH are common. Interpret case-only allelic results with caution.

Finally, remember the tutorial analyses above utilize allelic counts from gnomAD sites of common population variation that have been lifted-over from GRCh37 to GRCh38. For allelic count sites, use of sample-specific germline variant sites may incrementally increase resolution. Also, use of confident variant sites from a callset derived from alignments to the target reference may help decrease noise. Confident germline variant sites can be derived with HaplotypeCaller calling on alignments and subsequent variant filtration. Alternatively, it is possible to fine-tune ModelSegments smoothing parameters to dampen noise.

☞ 8.2 Some results of fine-tuning smoothing parameters

This section shows plotting results of changing some advanced smoothing parameters. The parameters and their defaults are given below, in the order of recommended consideration [12].

--number-of-changepoints-penalty-factor 1.0 \
--kernel-variance-allele-fraction 0.025 \
--kernel-variance-copy-ratio 0.0 \
--kernel-scaling-allele-fraction 1.0 \
--smoothing-credible-interval-threshold-allele-fraction 2.0 \
--smoothing-credible-interval-threshold-copy-ratio 2.0 \

The first four parameters impact segmentation while the last two parameters impact modeling. The following plots show the results of changing these smoothing parameters. The tutorial chose argument values arbitrarily, for illustration purposes. Results should be compared to that of 8B, which gives 398 segments.

8H. Increasing changepoints penalty factor from 1.0 to 5.0 gives 140 segments.

8I. Increasing kernel variance parameters each to 0.8 gives 144 segments. Changing --kernel-variance-copy-ratio alone to 0.025 increases the number of segments greatly, to 1,266 segments. Changing it to 0.2 gives 414 segments.

8J. Decreasing kernel scaling from 1.0 to 0 gives 236 segments. Conversely, increasing kernel scaling from 1.0 to 5.0 gives 551 segments.

8K. Increasing both smoothing parameters each from 2.0 to 10.0 gives 263 segments.

Footnotes

[9] The GATK Resource Bundle provides two variations of a SNPs-only gnomAD project resource VCF. Both VCFs are sites-only eight-column VCFs but one retains the AC allele count and AF allele frequency variant-allele-specific annotations, while the other removes these to reduce file size.

For targeted exomes, it may be convenient to subset these to the preprocessed intervals, e.g. with SelectVariants for use with CollectAllelicCounts. This is not necessary, however, as ModelSegments drops sites outside the target regions from its analysis in the joint-analysis approach.
For whole genomes, depending on the desired resolution of the analysis, consider subsetting the gnomAD sites to those commonly variant, e.g. above an allele frequency threshold. Note that SelectVariants, as of this writing, can filter on AF allele frequency only for biallelic sites. Non-biallelic sites make up ~3% of the gnomAD SNPs-only resource.
For more resolution, consider adding sample-specific germline variant biallelic SNPs-only sites to the intervals. Section 8.1 shows allelic segmentation results for such an analysis.

[10] The MAPQ20 threshold of CollectAllelicCounts is lower than that used by CollectFragmentCounts, which uses MAPQ30.

[11] In particular, the tool considers only heterozygous sites that have counts for both the reference allele and the alternate allele. If multiple alternate alleles are present, the tool uses the alternate allele with the highest count and ignores any other alternate allele(s).

[12] These advanced smoothing recommendations are from one of the workflow developers--@slee.

For smoother results, first increase --number-of-changepoints-penalty-factor from its default of 1.0.
If the above does not suffice, then consider changing the kernel-variance parameters --kernel-variance-copy-ratio (default 0.0) and --kernel-variance-allele-fraction (default 0.025), or change the weighting of the allele-fraction data by changing --kernel-scaling-allele-fraction (default 1.0).
If such changes are still insufficient, then consider adjusting the smoothing-credible-interval-threshold parameters --smoothing-credible-interval-threshold-copy-ratio (default 2.0) and --smoothing-credible-interval-threshold-allele-fraction (default 2.0). Increasing these will more aggressively merge adjacent segments.

[13] In particular, uses Gibbs sampling, a type of MCMC sampling, towards both allele-fraction modeling and copy-ratio modeling, and additionally uses slice sampling towards allele-fraction modeling. @slee details the following substeps.

Perform MCMC (Gibbs) to fit the copy-ratio model posteriors.
Use optimization (of the log likelihood) to initialize the Markov Chain for the allele-fraction model.
Perform MCMC (Gibbs and slice) to fit the allele-fraction model posteriors.
The initial model is now fit. We write the corresponding modelBegin files, including those for global parameters.
Iteratively perform segment smoothing.
Perform steps 1-4 again, this time to generate the final model fit and modelFinal files.

[14] @slee shares the tool initializes the MCMC by starting off at the maximum a posteriori (MAP) point in parameter space.

Many thanks to Samuel Lee (@slee), no relation, for his patient explanations of the workflow parameters and mathematics behind the workflow. Researchers may find reading through @slee's comments on the forum, a few of which the tutorials link to, insightful.

↧

Mutect2 error "Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded"

August 28, 2018, 12:50 pm

≫ Next: Data pre-processing for variant discovery

≪ Previous: (How to part II) Sensitively detect copy ratio alterations and allelic segments

Hello,
I've been trying to use Mutect2 to call somatic variants in an Illumina WGS dataset however after working for a while Mutect stops with the following error:

14:47:25.199 INFO VectorLoglessPairHMM - Time spent in setup for JNI call : 1.399932108
14:47:25.200 INFO PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 651.683336832
14:47:25.200 INFO SmithWatermanAligner - Total compute time in java Smith-Waterman : 3408.11 sec
14:47:25.202 INFO Mutect2 - Shutting down engine
[August 28, 2018 2:47:25 PM EDT] org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2 done. Elapsed time: 156.17 minutes.
Runtime.totalMemory()=30375673856
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.broadinstitute.hellbender.utils.smithwaterman.SmithWatermanJavaAligner.align(SmithWatermanJavaAligner.java:88)
at org.broadinstitute.hellbender.utils.read.CigarUtils.calculateCigar(CigarUtils.java:302)
at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.readthreading.ReadThreadingAssembler.findBestPaths(ReadThreadingAssembler.java:211)
at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.readthreading.ReadThreadingAssembler.runLocalAssembly(ReadThreadingAssembler.java:151)
at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.AssemblyBasedCallerUtils.assembleReads(AssemblyBasedCallerUtils.java:261)
at org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2Engine.callRegion(Mutect2Engine.java:185)
at org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2.apply(Mutect2.java:212)
at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.processReadShard(AssemblyRegionWalker.java:291)
at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.traverse(AssemblyRegionWalker.java:267)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:979)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:137)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:182)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:201)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)

I am using GATK 4.0.8.1 with java version 1.8.0_171. My pipeline was as follows: reads aligned with bwa mem, duplicates marked with piccard, indel calling/realigning done with GATK and then finally somatic SNP calling with Mutect2. The command I used for Mutect2 was as follows:

java -jar /PATH/apps/gatk-4.0.8.1/gatk-package-4.0.8.1-local.jar Mutect2 \
-R /PATH/reference_lemna/Lemna_minor.fa \
-I /PATH/lemna_MA/GPL7_D_filtered.bam \
-tumor HI.3183.001.Index_15.GPL7_D_ \
-L /PATH/mutect/lemna_clipped_ends_targets.intervals \
-O /PATH/PON_Lemna_D_filtered_trimmed.gz

Any ideas what I could do to resolve this issue?

Many thanks!
George Sandler

↧

Data pre-processing for variant discovery

January 8, 2018, 8:45 pm

≫ Next: Germline copy number variant discovery (CNVs)

≪ Previous: Mutect2 error "Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded"

Purpose

The is the obligatory first phase that must precede all variant discovery. It involves pre-processing the raw sequence data (provided in FASTQ or uBAM format) to produce analysis-ready BAM files. This involves alignment to a reference genome as well as some data cleanup operations to correct for technical biases and make the data suitable for analysis.

Reference Implementations

Pipeline	Summary	Notes	Github	FireCloud
Prod* germline short variant per-sample calling	uBAM to GVCF	optimized for GCP	yes	pending
$5 Genome Analysis Pipeline	uBAM to GVCF or cohort VCF	optimized for GCP (see blog)	yes	hg38
Generic data pre-processing	uBAM to analysis-ready BAM	universal	yes	hg38 & b37

* Prod refers to the Broad Institute's Data Sciences Platform production pipelines, which are used to process sequence data produced by the Broad's Genomic Sequencing Platform facility.

Expected input

This workflow is designed to operate on individual samples, for which the data is initially organized in distinct subsets called readgroups. These correspond to the intersection of libraries (the DNA product extracted from biological samples and prepared for sequencing, which includes fragmenting and tagging with identifying barcodes) and lanes (units of physical separation on the DNA sequencing chips) generated through multiplexing (the process of mixing multiple libraries and sequencing them on multiple lanes, for risk and artifact mitigation purposes).

Our reference implementations expect the read data to be input in unmapped BAM (uBAM) format. Conversion utilities are available to convert from FASTQ to uBAM.

Main steps

We begin by mapping the sequence reads to the reference genome to produce a file in SAM/BAM format sorted by coordinate. Next, we mark duplicates to mitigate biases introduced by data generation steps such as PCR amplification. Finally, we recalibrate the base quality scores, because the variant calling algorithms rely heavily on the quality scores assigned to the individual base calls in each sequence read.

Map to Reference

Tools involved: BWA, MergeBamAlignments

This first processing step is performed per-read group and consists of mapping each individual read pair to the reference genome, which is a synthetic single-stranded representation of common genome sequence that is intended to provide a common coordinate framework for all genomic analysis. Because the mapping algorithm processes each read pair in isolation, this can be massively parallelized to increase throughput as desired.

Mark Duplicates

Tools involved: MarkDuplicates, SortSam

This second processing step is performed per-sample and consists of identifying read pairs that are likely to have originated from duplicates of the same original DNA fragments through some artifactual processes. These are considered to be non-independent observations, so the program tags all but of the read pairs within each set of duplicates, causing them to be ignored by default during the variant discovery process. This step constitutes a major bottleneck since it involves making a large number of comparisons between all the read pairs belonging to the sample, across all of its readgroups. It is followed by a sorting operation (not explicitly shown in the workflow diagram) that also constitutes a performance bottleneck, since it also operates across all reads belonging to the sample. Both algorithms continue to be the target of optimization efforts to reduce their impact on latency.

Base (Quality Score) Recalibration

Tools involved: BaseRecalibrator, Apply Recalibration, AnalyzeCovariates (optional)

This third processing step is performed per-sample and consists of applying machine learning to detect and correct for patterns of systematic errors in the base quality scores, which are confidence scores emitted by the sequencer for each base. Base quality scores play an important role in weighing the evidence for or against possible variant alleles during the variant discovery process, so it's important to correct any systematic bias observed in the data. Biases can originate from biochemical processes during library preparation and sequencing, from manufacturing defects in the chips, or instrumentation defects in the sequencer. The recalibration procedure involves collecting covariate statistics from all base calls in the dataset, building a model from those statistics, and applying base quality adjustments to the dataset based on the resulting model. The initial statistics collection can be parallelized by scattering across genomic coordinates, typically by chromosome or batches of chromosomes but this can be broken down further to boost throughput if needed. Then the per-region statistics must be gathered into a single genome-wide model of covariation; this cannot be parallelized but it is computationally trivial, and therefore not a bottleneck. Finally, the recalibration rules derived from the model are applied to the original dataset to produce a recalibrated dataset. This is parallelized in the same way as the initial statistics collection, over genomic regions, then followed by a final file merge operation to produce a single analysis-ready file per sample.

↧

Germline copy number variant discovery (CNVs)

January 7, 2018, 1:08 am

≫ Next: Questions about the RNAseq variant discovery workflow

≪ Previous: Data pre-processing for variant discovery

Purpose

Identify germline copy number variants.

Diagram is not available

Reference implementation is not available

This workflow is in development; detailed documentation will be made available when the workflow is considered fully released.

↧

Questions about the RNAseq variant discovery workflow

October 15, 2014, 9:59 am

≫ Next: http://www.testostack.com/medlief-cbd/

≪ Previous: Germline copy number variant discovery (CNVs)

This discussion was created from comments split from: Calling variants in RNAseq.

↧

http://www.testostack.com/medlief-cbd/

August 29, 2018, 4:51 am

≫ Next: genomestrip throws unhelpful slurm error when using the slurm-drmaa bridge

≪ Previous: Questions about the RNAseq variant discovery workflow

Medlief Cbd those cut down radicals becoming a broken framework and multiplied signs of growing old. Keep aspect of your looking shop! Tip # - Take price Of Your strain levels "A maximum recent telomere experiences test on the tuition of California, San Francisco, confirmed that "persistent stress can charge up ageing as a best deal as years". Cortisol is a corticosteroid hormone designed with the help of manner of the .

http://www.testostack.com/medlief-cbd/

↧

genomestrip throws unhelpful slurm error when using the slurm-drmaa bridge

August 29, 2018, 6:40 am

≫ Next: Unable to access jarfile when running on docker and local computer

≪ Previous: http://www.testostack.com/medlief-cbd/

I am using the SLURM-DRMAA bridge and the pipeline throws an obstinate error. The obstinate error is "org.ggf.drmaa.InternalException: slurm_submit_batch_job: Invalid account or account/partition combination specified", but there are other errors when trying things differently. So it makes sense to ask here. If I could get access to the script being submitted to the cluster then I could investigate what is it that the SLURM system hates in my parameter specifications. Can I do that? From the error listing, since the Java machine catches the error it basically fails by delegating the error message to my local system, I would argue that it is bad practice.

Note that running things locally with -run and no -jobRunner specs works, but I want to use the cluster. My system admin says that since the processing stops at the pipeline level and there is no SLURM submission he cannot really help me and suggested me to try different combinations. Here is what I have tried and the errors.

- -run -jobRunner Drmaa -gatkJobRunner Drmaa -jobNative "-A sens2016011-bianca" -jobNative "-p node" fails with org.ggf.drmaa.InternalException: slurm_submit_batch_job: Invalid account or account/partition combination specified
- -run -jobRunner Drmaa -gatkJobRunner Drmaa -jobNative "-A sens2016011-bianca" -jobNative "-p core" -jobNative "-n 1" fails with org.ggf.drmaa.InternalException: slurm_submit_batch_job: Invalid account or account/partition combination specified
- -run -jobRunner Drmaa -gatkJobRunner Drmaa -jobNative "-A sens2016011-bianca" -jobNative "-p core" -jobNative "-n 1" -jobNative "-t 20:00" same error as above
- -run -jobRunner Drmaa -gatkJobRunner Drmaa -jobNative "-A sens2016011-bianca" fails with Too many cores requested for -p core partition. Minimum cpus requested is 4294967294. To use more than  16 cores, request -p node.
- -run -jobRunner Drmaa -gatkJobRunner Drmaa -jobNative "-A sens2016011-bianca" -jobNative "-N 1" same as above
- -run -jobRunner Drmaa -gatkJobRunner Drmaa -jobNative A sens2016011-bianca -jobNative p core -jobNative N 1 fails with Unable to submit job: Invalid native specification: A sens2016011-bianca p core N 1
- -run -jobRunner Drmaa fails with Use the flag -A to specify an active project with allocation on this cluster.

Here is the full error listing:

$ java -Xmx4g -cp /proj/sens2016011/nobackup/genomestrip/lib/svtoolkit/lib/SVToolkit.jar:/proj/sens2016011/nobackup/genomestrip/lib/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/proj/sens2016011/nobackup/genomestrip/lib/svtoolkit/lib/gatk/Queue.jar org.broadinstitute.gatk.queue.QCommandLine -S /proj/sens2016011/nobackup/genomestrip/lib/svtoolkit/qscript/SVPreprocess.q -S /proj/sens2016011/nobackup/genomestrip/lib/svtoolkit/qscript/SVQScript.q -gatk /proj/sens2016011/nobackup/genomestrip/lib/svtoolkit/lib/gatk/GenomeAnalysisTK.jar -configFile /proj/sens2016011/nobackup/genomestrip/lib/svtoolkit/conf/genstrip_parameters.txt -R /sw/data/uppnex/GATK/2.8/b37/human_g1k_v37.fasta -I /proj/sens2016011/nobackup/melt/data/bam_links/00028285.sorted.bam -md meta -bamFilesAreDisjoint true -jobLogDir /proj/sens2016011/nobackup/genomestrip/tests/logs -run -jobRunner Drmaa -gatkJobRunner Drmaa -jobNative "-A sens2016011-bianca" -jobNative "-p core" -jobNative "-n 1" -jobNative "-t 20:00"
INFO  17:10:33,709 QScriptManager - Compiling 2 QScripts 
INFO  17:11:13,568 QScriptManager - Compilation complete 
INFO  17:11:13,936 HelpFormatter - ---------------------------------------------------------------------- 
INFO  17:11:13,936 HelpFormatter - Queue v3.7.GS-r1748-0-g74bfe0b, Compiled 2018/04/10 10:30:23 
INFO  17:11:13,936 HelpFormatter - Copyright (c) 2012 The Broad Institute 
INFO  17:11:13,936 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
INFO  17:11:13,937 HelpFormatter - Program Args: -S /proj/sens2016011/nobackup/genomestrip/lib/svtoolkit/qscript/SVPreprocess.q -S /proj/sens2016011/nobackup/genomestrip/lib/svtoolkit/qscript/SVQScript.q -gatk /proj/sens2016011/nobackup/genomestrip/lib/svtoolkit/lib/gatk/GenomeAnalysisTK.jar -configFile /proj/sens2016011/nobackup/genomestrip/lib/svtoolkit/conf/genstrip_parameters.txt -R /sw/data/uppnex/GATK/2.8/b37/human_g1k_v37.fasta -I /proj/sens2016011/nobackup/melt/data/bam_links/00028285.sorted.bam -md meta -bamFilesAreDisjoint true -jobLogDir /proj/sens2016011/nobackup/genomestrip/tests/logs -run -jobRunner Drmaa -gatkJobRunner Drmaa -jobNative -A sens2016011-bianca -jobNative -p core -jobNative -n 1 -jobNative -t 20:00 
INFO  17:11:13,937 HelpFormatter - Executing as sergiun@sens2016011-bianca.uppmax.uu.se on Linux 3.10.0-862.3.2.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_171-b10. 
INFO  17:11:13,938 HelpFormatter - Date/Time: 2018/08/28 17:11:13 
INFO  17:11:13,938 HelpFormatter - ---------------------------------------------------------------------- 
INFO  17:11:13,938 HelpFormatter - ---------------------------------------------------------------------- 
INFO  17:11:13,953 QCommandLine - Scripting SVPreprocess 
INFO  17:11:15,238 QCommandLine - Added 190 functions 
INFO  17:11:15,257 QGraph - Generating graph. 
INFO  17:11:15,351 QGraph - Running jobs. 
INFO  17:11:17,092 FunctionEdge - Starting:  'java'  '-Xmx2048m'  '-XX:+UseParallelOldGC'  '-XX:ParallelGCThreads=4'  '-XX:GCTimeLimit=50'  '-XX:GCHeapFreeLimit=10'  '-Djava.io.tmpdir=/castor/project/proj_nobackup/genomestrip/tests/batch/.queue/tmp'  '-cp' '/proj/sens2016011/nobackup/genomestrip/lib/svtoolkit/lib/SVToolkit.jar:/proj/sens2016011/nobackup/genomestrip/lib/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/proj/sens2016011/nobackup/genomestrip/lib/svtoolkit/lib/gatk/Queue.jar'  'org.broadinstitute.sv.apps.ComputeGenomeSizes'  '-O' '/castor/project/proj_nobackup/genomestrip/tests/batch/meta/genome_sizes.txt'  '-R' '/sw/data/uppnex/GATK/2.8/b37/human_g1k_v37.fasta'    
INFO  17:11:17,093 FunctionEdge - Output written to /proj/sens2016011/nobackup/genomestrip/tests/logs/SVPreprocess-5.out 
ERROR 17:11:17,119 Retry - Caught error during attempt 1 of 4. 
org.ggf.drmaa.InternalException: slurm_submit_batch_job: Invalid account or account/partition combination specified
        at org.broadinstitute.gatk.utils.jna.drmaa.v1_0.JnaSession.checkError(JnaSession.java:400)
        at org.broadinstitute.gatk.utils.jna.drmaa.v1_0.JnaSession.checkError(JnaSession.java:392)
        at org.broadinstitute.gatk.utils.jna.drmaa.v1_0.JnaSession.runJob(JnaSession.java:79)
        at org.broadinstitute.gatk.queue.engine.drmaa.DrmaaJobRunner.runJob(DrmaaJobRunner.scala:115)
        at org.broadinstitute.gatk.queue.engine.drmaa.DrmaaJobRunner$$anonfun$start$1.apply$mcV$sp(DrmaaJobRunner.scala:93)
        at org.broadinstitute.gatk.queue.engine.drmaa.DrmaaJobRunner$$anonfun$start$1.apply(DrmaaJobRunner.scala:91)
        at org.broadinstitute.gatk.queue.engine.drmaa.DrmaaJobRunner$$anonfun$start$1.apply(DrmaaJobRunner.scala:91)
        at org.broadinstitute.gatk.queue.util.Retry$.attempt(Retry.scala:50)
        at org.broadinstitute.gatk.queue.engine.drmaa.DrmaaJobRunner.start(DrmaaJobRunner.scala:91)
        at org.broadinstitute.gatk.queue.engine.FunctionEdge.start(FunctionEdge.scala:101)
        at org.broadinstitute.gatk.queue.engine.QGraph.startOneJob(QGraph.scala:646)
        at org.broadinstitute.gatk.queue.engine.QGraph.runJobs(QGraph.scala:507)
        at org.broadinstitute.gatk.queue.engine.QGraph.run(QGraph.scala:168)
        at org.broadinstitute.gatk.queue.QCommandLine.execute(QCommandLine.scala:170)
        at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
        at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
        at org.broadinstitute.gatk.queue.QCommandLine$.main(QCommandLine.scala:61)
        at org.broadinstitute.gatk.queue.QCommandLine.main(QCommandLine.scala)
ERROR 17:11:17,121 Retry - Retrying in 1.0 minute.

↧

Unable to access jarfile when running on docker and local computer

August 29, 2018, 7:53 am

≫ Next: Could you help me with this GenomeSTRIP error message?

≪ Previous: genomestrip throws unhelpful slurm error when using the slurm-drmaa bridge

This discussion was created from comments split from: (How to) Run the GATK4 Docker locally and take a look inside.

↧

Could you help me with this GenomeSTRIP error message?

August 29, 2018, 10:48 am

≫ Next: Exception in SplitNCigarReads

≪ Previous: Unable to access jarfile when running on docker and local computer

INFO 13:25:42,533 29-Aug-2018 SVDiscovery - Locus search window: 16:1-500000
Caught exception while processing read: E00322:200:HK5JVALXX:5:1204:17827:193291 63 16 59894 0 148M2S = 59997 249 AACCCTAACCCTAACC CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCGA AAFFFJJJJJJJJJJJJJJJJJJJ JJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJFJJJJJFJFJJJ7JFJJFFJFJJJ-AFJFJ7JA7A<AJ<AJJ7AAF JFFJAJA--A<AJF-<FFFF-FA-<7---77F-7-AA-----7F-- BC:Z:none RG:Z:4 NM:i:112 SM:i:0 AS:i:0

ERROR --

ERROR stack trace

java.lang.RuntimeException: Error processing input from 51-08665_S1.bam: at org.broadinstitute.sv.discovery.DeletionDiscoveryAlgorithm.runTravers at org.broadinstitute.sv.discovery.SVDiscoveryWalker.onTraversalDone(SVD at org.broadinstitute.sv.discovery.SVDiscoveryWalker.onTraversalDone(SVD at org.broadinstitute.gatk.engine.executive.Accumulator$StandardAccumula at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAna at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandL at org.broadinstitute.sv.main.SVCommandLine.execute(SVCommandLine.java:1 at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(Co at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(Co at org.broadinstitute.sv.main.SVCommandLine.main(SVCommandLine.java:91)< at org.broadinstitute.sv.main.SVDiscovery.main(SVDiscovery.java:21)
Caused by: java.lang.NullPointerException
at org.broadinstitute.sv.metadata.isize.InsertRadiusMap.init(InsertRadiu at org.broadinstitute.sv.metadata.isize.InsertRadiusMap.(InsertRad at org.broadinstitute.sv.discovery.ReadPairInsertSizeSelector.initMinimu at org.broadinstitute.sv.discovery.ReadPairInsertSizeSelector.getMinimum at org.broadinstitute.sv.discovery.ReadPairInsertSizeSelector.selectRead at org.broadinstitute.sv.discovery.ReadPairRecordSelector.selectReadPair at org.broadinstitute.sv.discovery.DeletionDiscoveryAlgorithm.processRea at org.broadinstitute.sv.discovery.DeletionDiscoveryAlgorithm.runTravers ... 11 more null
al(DeletionDiscoveryAlgorithm.java:160)
iscoveryWalker.java:105)
iscoveryWalker.java:40)
tor.finishTraversal(Accumulator.java:129)
(LinearMicroScheduler.java:115)
lysisEngine.java:316)
ineExecutable.java:123)
41)
mmandLineProgram.java:256)
mmandLineProgram.java:158)
br> sMap.java:52)
iusMap.java:35)
mRadiusMap(ReadPairInsertSizeSelector.java:102)
InsertSize(ReadPairInsertSizeSelector.java:86)
PairRecord(ReadPairInsertSizeSelector.java:62)
Record(ReadPairRecordSelector.java:107)
d(DeletionDiscoveryAlgorithm.java:173)
al(DeletionDiscoveryAlgorithm.java:152)

ERROR -------------------------------------------------------------------- ----------------------

ERROR A GATK RUNTIME ERROR has occurred (version 3.7.GS-r1748-0-g74bfe0b):

ERROR

ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.

ERROR If not, please post the error message, with stack trace, to the GATK forum.

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions https://software.broadinstitute.org/gatk

ERROR

ERROR MESSAGE: Error processing input from 51-08665_S1.bam: null

ERROR -------------------------------------------------------------------- ----------------------

↧

Exception in SplitNCigarReads

August 29, 2018, 11:42 am

≫ Next: Error : Duplicate allele added to VariantContext

≪ Previous: Could you help me with this GenomeSTRIP error message?

Hello,

I am getting the following exception when running SplitNCigarReads on RNA-Seq data using GATK 4.0.8.1:

java.lang.ArrayIndexOutOfBoundsException: 100
at org.broadinstitute.hellbender.tools.walkers.rnaseq.OverhangFixingManager.overhangingBasesMismatch(OverhangFixingManager.java:313)
at org.broadinstitute.hellbender.tools.walkers.rnaseq.OverhangFixingManager.fixSplit(OverhangFixingManager.java:252)
at org.broadinstitute.hellbender.tools.walkers.rnaseq.OverhangFixingManager.addReadGroup(OverhangFixingManager.java:209)
at org.broadinstitute.hellbender.tools.walkers.rnaseq.SplitNCigarReads.splitNCigarRead(SplitNCigarReads.java:270)
at org.broadinstitute.hellbender.tools.walkers.rnaseq.SplitNCigarReads.firstPassApply(SplitNCigarReads.java:180)
at org.broadinstitute.hellbender.engine.TwoPassReadWalker.lambda$traverseReads$0(TwoPassReadWalker.java:62)
at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.Iterator.forEachRemaining(Iterator.java:116)
at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
at org.broadinstitute.hellbender.engine.TwoPassReadWalker.traverseReads(TwoPassReadWalker.java:60)
at org.broadinstitute.hellbender.engine.TwoPassReadWalker.traverse(TwoPassReadWalker.java:42)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:979)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:137)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:182)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:201)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)

Command:

./gatk/gatk \
   SplitNCigarReads \
   --reference $REF \
   --input test3.bam \
   --output output.bam \
   --verbosity DEBUG \
    > split.log 2>&1

Running ValidateSamFile does not reveal anything suspicious and visual inspection of the reads also appears to be fine.

↧

Error : Duplicate allele added to VariantContext

March 2, 2018, 7:18 am

≫ Next: PathSeq resource bundle on Google Cloud?

≪ Previous: Exception in SplitNCigarReads

Hello,

I download GnomAD vcf files in GRCh38 (hg19 remap). I want to convert this vcf in a table so i try to use VariantsToTable.
But i have this error :
The provided VCF file is malformed at approximately line number 10609: Duplicate allele added to VariantContext: C

line 10609 :
10 3101451 rs4881080 C C,T,G 191085017.71 PASS AC=237624,66,1;AF=9.99663e-01,2.77656e-04,4.20691e-06

I agree C is in REF and ALT colum but we can use VariantsToTable without check the integrity of vcf files?

Thanks,

Steven

↧

PathSeq resource bundle on Google Cloud?

August 29, 2018, 3:14 pm

≫ Next: http://shark-tank-diet.info/praltrix-es/

≪ Previous: Error : Duplicate allele added to VariantContext

Hi all - is the GATK4 PathSeq resource bundle located at ftp://ftp.broadinstitute.org/bundle/pathseq/ also available in a Google Cloud Bucket or on FireCloud? I am running PathSeq on FireCloud and would like to use the files in the PathSeq resource bundle - hopefully there is a way to do so without downloading and re-uploading them into a Google Cloud Bucket.

Thanks for your help!

↧

http://shark-tank-diet.info/praltrix-es/

August 30, 2018, 2:51 am

≫ Next: GATK ERROR MESSAGE: 38 HaplotipeCaller

≪ Previous: PathSeq resource bundle on Google Cloud?

praltrix:- Este es el problema por el que todo el mundo obtiene información sobre nuevos elementos. ¿Funcionará realmente o estoy desperdiciando mi dinero en efectivo? Además, estamos aquí para permitirle que se decida por la opción independientemente de si debe intentar Praltrix Australia. Lo que sabemos es que el artículo es un suplemento de mejora masculina que podría funcionar en tu vida. Praltrix garantiza erecciones más grandes y duraderas, un deseo sexual y una inundación de vitalidad, y una expansión en la seguridad sexual. En la remota posibilidad de que estuvieras mejor en la cama, sin duda la certeza tomaría no muy atrás. Esta receta de actualización masculina podría al fin levantarte cuando lo necesites. ¡Tú y tu media naranja podrían estar disfrutando de tu vida! En cualquier caso, no puede saber más allá de cualquier duda cómo Praltrix Pills lo influencia poco a poco hasta el momento en que lo intenta. ¡De esta manera, toca cualquier imagen en esta página para obtener tu preliminar!

http://shark-tank-diet.info/praltrix-es/

↧

GATK ERROR MESSAGE: 38 HaplotipeCaller

February 5, 2018, 1:05 am

≫ Next: http://www.testostack.com/alpha-xr/

≪ Previous: http://shark-tank-diet.info/praltrix-es/

Hello everyone,

I´m using HaplotypeCaller program in whole sheep genome.

The next paragraph is the command used for all 158 samples. We use nodes of 16 cores (-ntc 16) and 28 Gb of memory RAM.

Could you tell me what the next mistake might be?
thank you in advance
-------------------------------------------------------------------------------------------------------------------------------------------------------------

ERROR --

ERROR stack trace

java.lang.ArrayIndexOutOfBoundsException: 38
at org.broadinstitute.gatk.tools.walkers.annotator.BaseQualityRankSumTest.getElementForRead(BaseQualityRankSumTest.java:96)
at org.broadinstitute.gatk.tools.walkers.annotator.RankSumTest.getElementForRead(RankSumTest.java:209)
at org.broadinstitute.gatk.tools.walkers.annotator.RankSumTest.fillQualsFromLikelihoodMap(RankSumTest.java:187)
at org.broadinstitute.gatk.tools.walkers.annotator.RankSumTest.annotate(RankSumTest.java:104)
at org.broadinstitute.gatk.tools.walkers.annotator.VariantAnnotatorEngine.annotateContextForActiveRegion(VariantAnnotatorEngine.java:315)
at org.broadinstitute.gatk.tools.walkers.annotator.VariantAnnotatorEngine.annotateContextForActiveRegion(VariantAnnotatorEngine.java:260)
at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCallerGenotypingEngine.annotateCall(HaplotypeCallerGenotypingEngine.java:328)
at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCallerGenotypingEngine.assignGenotypeLikelihoods(HaplotypeCallerGenotypingEngine.java:290)
at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:970)
at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:252)
at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:709)
at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:705)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler$ReadMapReduceJob.run(NanoScheduler.java:471)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

ERROR ------------------------------------------------------------------------------------------

ERROR A GATK RUNTIME ERROR has occurred (version 3.8-0-ge9d806836):

ERROR

ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.

ERROR If not, please post the error message, with stack trace, to the GATK forum.

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions https://software.broadinstitute.org/gatk

ERROR

##### ERROR MESSAGE: 38

##### ERROR ------------------------------------------------------------------------------------------

-------------------------------------------------------------------------------------------------------------------------------------------------------------

↧

Contents

1. Overview

It's all about the base, 'bout the base (quality scores)

Okay, so what is base recalibration?

That sounds great! How does it work?

2. Base recalibration procedure details

BaseRecalibrator builds the model

ApplyBQSR adjusts the scores

3. Important factors for successful recalibration

Read groups

Amount of data

No excuses

4. Examples of pre- and post-recalibration metrics

5. Recalibration report

Arguments Table

Quantization Table

ReadGroup Table

Quality Score Table

Covariates Table

Introduction

Methods for variant evaluation

Underlying assumptions and truthiness*: a note of caution

Validation

Matching populations

Variant evaluation metrics

Variant-level concordance and genotype concordance

Number of Indels & SNPs and TiTv Ratio

Ratio of Insertions to Deletions (Indel Ratio)

Tools for performing variant evaluation

Which tool should I use?

Jump to a section

5. Count ref and alt alleles at common germline variant sites using CollectAllelicCounts

☞ 5.1 What is the difference between CollectAllelicCounts and GetPileupSummaries?

6. Group contiguous copy ratios into segments with ModelSegments

7. Call copy-neutral, amplified and deleted segments with CallCopyRatioSegments

8. Plot modeled copy ratio and allelic fraction segments with PlotModeledSegments

☞ 8.1 Some considerations in interpreting allelic copy ratios

☞ 8.2 Some results of fine-tuning smoothing parameters

Footnotes

Purpose

Reference Implementations

Expected input

Main steps

Map to Reference

Tools involved: BWA, MergeBamAlignments

Mark Duplicates

Tools involved: MarkDuplicates, SortSam

Base (Quality Score) Recalibration

Tools involved: BaseRecalibrator, Apply Recalibration, AnalyzeCovariates (optional)

Purpose

ERROR --

ERROR stack trace

ERROR -------------------------------------------------------------------- ----------------------

ERROR A GATK RUNTIME ERROR has occurred (version 3.7.GS-r1748-0-g74bfe0b):

ERROR

ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.

ERROR If not, please post the error message, with stack trace, to the GATK forum.

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions https://software.broadinstitute.org/gatk

ERROR

ERROR MESSAGE: Error processing input from 51-08665_S1.bam: null

ERROR -------------------------------------------------------------------- ----------------------

ERROR --

ERROR stack trace

ERROR ------------------------------------------------------------------------------------------

ERROR A GATK RUNTIME ERROR has occurred (version 3.8-0-ge9d806836):

ERROR

ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.

ERROR If not, please post the error message, with stack trace, to the GATK forum.

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions https://software.broadinstitute.org/gatk

ERROR

##### ERROR ------------------------------------------------------------------------------------------

Underlying assumptions and truthiness^*: a note of caution