what is --intervals argument while using GenomicsDBImport in germling short variant discovery pipeli

March 18, 2019, 11:18 pm

≫ Next: UCLA文凭微Margon1688.加州大学洛杉矶分校毕业证成绩单.一手留信网认证.使馆认证归国证明

≪ Previous: what is --intervals argument while using GenomicsDBImport in germling short variant discovery pipeli

what is --intervals argument while using GenomicsDBImport in germline short variant discovery? Kindly let me know what should be the input file for --intervals argument.?

↧

UCLA文凭微Margon1688.加州大学洛杉矶分校毕业证成绩单.一手留信网认证.使馆认证归国证明

March 19, 2019, 12:35 am

≫ Next: USC文凭微Margon1688/南加州大学毕业证成绩单.一手留信网认证.使馆认证归国证明

≪ Previous: what is --intervals argument while using GenomicsDBImport in germling short variant discovery pipeli

UCLA文凭微Margon1688.加州大学洛杉矶分校毕业证成绩单.一手留信网认证.使馆认证归国证明

↧

USC文凭微Margon1688/南加州大学毕业证成绩单.一手留信网认证.使馆认证归国证明

March 19, 2019, 12:42 am

≫ Next: CMU文凭微Margon1688/卡内基梅隆大学毕业证成绩单.一手留信网认证.使馆认证归国证明

≪ Previous: UCLA文凭微Margon1688.加州大学洛杉矶分校毕业证成绩单.一手留信网认证.使馆认证归国证明

USC文凭微Margon1688/南加州大学毕业证成绩单.一手留信网认证.使馆认证归国证明

↧

CMU文凭微Margon1688/卡内基梅隆大学毕业证成绩单.一手留信网认证.使馆认证归国证明

March 19, 2019, 1:12 am

≫ Next: GenotypeGVCF shows no annotations on part of the data

≪ Previous: USC文凭微Margon1688/南加州大学毕业证成绩单.一手留信网认证.使馆认证归国证明

CMU文凭微Margon1688/卡内基梅隆大学毕业证成绩单.一手留信网认证.使馆认证归国证明

↧

GenotypeGVCF shows no annotations on part of the data

March 19, 2019, 1:20 am

≫ Next: WFU文凭微Margon1688/维克森林大学毕业证成绩单.一手留信网认证.使馆认证归国证明

≪ Previous: CMU文凭微Margon1688/卡内基梅隆大学毕业证成绩单.一手留信网认证.使馆认证归国证明

Hi, i am doing an analysis on exome sequencing with 31 samples following best practices.

-HaplotypeCaller:
java1.8 -Xmx10g -jar $gatk_jar HaplotypeCaller --native-pair-hmm-threads 20 -R $reference_dir -I ${SAMPLENAME}_bqsr.bam -O ${SAMPLENAME}.g.vcf.gz --genotyping-mode DISCOVERY -ERC GVCF

-Genomics DB:
less target.list| while read line ; do gatk --java-options -DGATK_STACKTRACE_ON_USER_EXCEPTION=true GenomicsDBImport --genomicsdb-workspace-path diabetes_DB_$line -L $line --sample-name-map LB_map --batch-size 50 --tmp-dir tmp --reader-threads 5 ; done

-GenotypeGVCFs:
gatk GenotypeGVCFs -R /home/Graceca/references/hs38/hs38DH.fa -V gendb://diabetes_DB_chr9:3824126-4310693 -O test_out.vcf.gz --tmp-dir ./tmp

-some output :
15:56:21.275 INFO GenotypeGVCFs - GCS max retries/reopens: 20
15:56:21.275 INFO GenotypeGVCFs - Requester pays: disabled
15:56:21.275 INFO GenotypeGVCFs - Initializing engine
WARNING: No valid combination operation found for INFO field DS - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field DS - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records
15:56:22.246 INFO GenotypeGVCFs - Done initializing engine
15:56:22.304 INFO ProgressMeter - Starting traversal
15:56:22.304 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
WARNING: No valid combination operation found for INFO field DS - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records
15:56:23.354 WARN InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
15:56:23.411 WARN InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
15:56:23.629 WARN InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
15:56:23.631 WARN InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
15:56:23.646 WARN InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
15:56:23.743 WARN InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
15:56:23.770 WARN InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
15:56:23.809 WARN InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples

This is the output of 31 gcvf files:

chr9 3836809 . C A 11.03 . AC=2;AF=0.067;AN=30;DP=38;ExcessHet=0.0755;FS=0.000;InbreedingCoeff=0.2923;ML
EAC=1;MLEAF=0.033;MQ=60.00;QD=5.51;SOR=0.693 GT:AD:DP:GQ:PL ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 0/0:5,0:5:15:0,15,
109 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 0/0:2,0:2:6:0,6,38 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 ./.:0,0
:0:.:0,0,0 0/0:2,0:2:6:0,6,37 ./.:0,0:0:.:0,0,0 0/0:2,0:2:6:0,6,39 0/0:2,0:2:6:0,6,37 ./.:0,0:0:.:0,0,0
0/0:1,0:1:3:0,3,14 1/1:0,2:2:6:49,6,0 0/0:2,0:2:6:0,6,37 ./.:0,0:0:.:0,0,0 0/0:2,0:2:6:0,6,43 0/0:2,
0:2:6:0,6,37 0/0:2,0:2:6:0,6,39 0/0:5,0:5:15:0,15,114 0/0:2,0:2:6:0,6,39 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0
0/0:4,0:4:12:0,12,79 0/0:3,0:3:9:0,9,59 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 ./.:0,
0:0:.:0,0,0

Some sample has no annotation at all. May i know what causes that?

↧

WFU文凭微Margon1688/维克森林大学毕业证成绩单.一手留信网认证.使馆认证归国证明

March 19, 2019, 1:37 am

≫ Next: UMich文凭微Margon1688/密歇根大学安娜堡分校毕业证成绩单.一手留信网认证.使馆认证归国证明

≪ Previous: GenotypeGVCF shows no annotations on part of the data

WFU文凭微Margon1688/维克森林大学毕业证成绩单.一手留信网认证.使馆认证归国证明

↧

UMich文凭微Margon1688/密歇根大学安娜堡分校毕业证成绩单.一手留信网认证.使馆认证归国证明

March 19, 2019, 2:17 am

≫ Next: Use GenomicsDBImport to extract coding regions from 88 human WGS GVCFs

≪ Previous: WFU文凭微Margon1688/维克森林大学毕业证成绩单.一手留信网认证.使馆认证归国证明

UMich文凭微Margon1688/密歇根大学安娜堡分校毕业证成绩单.一手留信网认证.使馆认证归国证明

↧

Use GenomicsDBImport to extract coding regions from 88 human WGS GVCFs

March 19, 2019, 2:53 am

≫ Next: Errors in Mutect2 AD, F1R2 and F2R1 counts

≪ Previous: UMich文凭微Margon1688/密歇根大学安娜堡分校毕业证成绩单.一手留信网认证.使馆认证归国证明

I want to extract coding regions from 88 WGS GVCFs using GenomicsDBImport (followed by GenotypeGVCFs). I have a list of ~222,000 intervals and was thinking of using the -merge-input-intervals parameter (if appropriate for WGS), and scattering the process using 10 different jobs. Is that a good way to speed up the process? I am using GATK (4.1.0.0) but I can't use Cromwell on our local servers. Thanks!

↧

Errors in Mutect2 AD, F1R2 and F2R1 counts

March 19, 2019, 8:17 am

≫ Next: Read Downsampling

≪ Previous: Use GenomicsDBImport to extract coding regions from 88 human WGS GVCFs

Hello GATK team,

We are using GATK 4.0.10.1, and we noticed error in mutect2 ALT/REF count, that we cannot explain. Some variant will be reported with coverage from forward and reverse read were there is no reverse read with alternative.
We looked at original bam and the --bam-output results with samtools tview and none of them display the same numbers as Mutect2 vcf.

Here is an example with Mutect2 VCF line, and corresponding count in orignal bam and bam-out from mutect2. Mutect2 report 5 alternative in reverse strand where at most we see one in mutect2 bam.

We noticed this for ~2% of driver mutations that where "wrongly(?)" called in our project.

Hopefully it comes from our understanding of the data, if so could you explain to us how to correctly check mutect2 results.

Thanks,

Ismael

############
## Mutect2 VCF
X 76939699 . T C . . DP=27;ECNT=1;POP_AF=5.000e-08;TLOD=32.34 GT:AD:AF:DP:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB 0/1:16,9:0.370:25:10,4:6,5:34,42:195,232:60:19:0.222,0.364,0.360:0.069,0.013,0.918

ref = 16
alt = 9

ALT_F1R2: 4
ALT_F2R1: 5
REF_F1R2: 10
REF_F2R1: 6

############
## Original bam

ref = 17
alt = 8

ALT_F1R2: 8
ALT_F2R1: 0
REF_F1R2: 12
REF_F2R1: 5

############
## Mutect2 --bam-output
ref = 18
alt = 9

ALT_F1R2: 8
ALT_F2R1: 1
REF_F1R2: 12
REF_F2R1: 6

↧

Read Downsampling

March 19, 2019, 8:43 am

≫ Next: Error with DBimport

≪ Previous: Errors in Mutect2 AD, F1R2 and F2R1 counts

Hello,
I have sequences with 50X coverage from three apple cultivars. I was wondering whether GATK can process it or still has read-downsampling problem. If GATK does not support, what other pipeline do you recommend?

↧

Error with DBimport

March 19, 2019, 9:50 am

≫ Next: Eron Plus Reviews

≪ Previous: Read Downsampling

The following error is shown in my log file when I use DBimport for chr09:

A USER ERROR has occurred: Couldn't read file. Error was: Failure while waiting for FeatureReader to initialize with exception: htsjdk.tribble.TribbleException: Line 104: there aren't enough columns for line > . . END=22917210 GT:DP:GQ:MIN_DP:PL 0/0:7:21:7:0,21,276 (we expected 9 tokens, and saw 6 )

Then I check that line, which is:
chr09 22917210 . G . . END=22917210 GT:DP:GQ:MIN_DP:PL 0/0:7:21:7:0,21,276

I have 10 columns for this line, I don't understand why I have error like this?
The next line is:

chr10 46602 . A . . END=46604 GT:DP:GQ:MIN_DP:PL 0/0:20:51:20:0,51,765

↧

Eron Plus Reviews

March 19, 2019, 9:59 am

≫ Next: GATK GermlineCNVcaller Expected Run Time Calculations

≪ Previous: Error with DBimport

Eron Plus is a dietary enhancement that is intended to improve your sexual experience. It professes to calm you from erectile dysfunctions. It arrives in a lot of two items, i.e., Eron Plus and Eron Plus Before item that reinforces your erections. Eron Plus cases are intended for every day use to wipe out all the erection issues.

↧

GATK GermlineCNVcaller Expected Run Time Calculations

March 19, 2019, 10:36 am

≫ Next: ERROR Stack trace: java.lang.NumberFormatException: For input string: "2520503224"

≪ Previous: Eron Plus Reviews

Hello

I am working with a particularly large amount of data and have read about the limitations posed by the gCNVcaller runTime. I wanted to ask if anyone knew a way to calculate anticipated runTime using the #Bam files (w/ total size in GB), #CPU's used, Size of Ram available.

Thank You!

↧

ERROR Stack trace: java.lang.NumberFormatException: For input string: "2520503224"

March 19, 2019, 10:41 am

≫ Next: supporting dataset for CalculateGenotypePosteriors

≪ Previous: GATK GermlineCNVcaller Expected Run Time Calculations

Hi-

I am working with GenomeSTRiP v2.0 on some bovine samples. Java is version 1.8.0_201. I created my own reference ploidy map as discussed on the Reference Genome Metadata page using the locations from my reference index:

```
X 139009144 2520503224 F 2
X 139009144 2520503224 M 1
Y 43300181 2661249986 F 0
Y 43300181 2661249986 M 1
* * * * 2
```

However, I a running into this error:

```
##### ERROR stack trace
java.lang.NumberFormatException: For input string: "2520503224"
at java.lang.NumberFormatException.forInputString(NumberFormatException. java:65)
at java.lang.Integer.parseInt(Integer.java:583)
at java.lang.Integer.parseInt(Integer.java:615)
at org.broadinstitute.sv.metadata.ploidy.PloidyMap.parsePloidyMapFile(Pl oidyMap.java:252)
at org.broadinstitute.sv.metadata.ploidy.PloidyMap.open(PloidyMap.java:5 5)
at org.broadinstitute.sv.metadata.depth.ComputeReadDepthCoverageWalker.i nitialize(ComputeReadDepthCoverageWalker.java:131)
at org.broadinstitute.sv.metadata.ComputeMetadataWalker.initialize(Compu teMetadataWalker.java:202)
at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute (LinearMicroScheduler.java:83)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAna lysisEngine.java:316)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandL ineExecutable.java:123)
at org.broadinstitute.sv.main.SVCommandLine.execute(SVCommandLine.java:1 41)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(Co mmandLineProgram.java:256)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(Co mmandLineProgram.java:158)
at org.broadinstitute.sv.main.SVCommandLine.main(SVCommandLine.java:91)
at org.broadinstitute.sv.main.SVCommandLine.main(SVCommandLine.java:65)
##### ERROR -------------------------------------------------------------------- ----------------------
```

I am not sure if this is an exception based on my ploidy map or a format issue with java and the size of the integer.
Any insight would be greatly appreciated!

Thanks,
Beth
(I apologize for any format issues)

↧

supporting dataset for CalculateGenotypePosteriors

February 4, 2019, 5:57 am

≫ Next: Base Quality Score Recalibration (BQSR)

≪ Previous: ERROR Stack trace: java.lang.NumberFormatException: For input string: "2520503224"

Dear team,

I am relatively new to the GATK environment, so please forgive me if I missed something obvious. I realize that similar question came up before, but I did not find an answer that solved my problem.

I am trying to run CalculateGenotypePosteriors with a supporting dataset. In the tool documentation you use 1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38.vcf.gz, which I downloaded from the GATK bundle

console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/

When I run
gatk CalculateGenotypePosteriors -R Homo_sapiens_assembly38.fasta -V in.vcf.gz -O out.vcf.gz -supporting 1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38.vcf.gz

the result is a user error

A USER ERROR has occurred: Input files reference and features have incompatible contigs: Found contigs with the same name but different lengths:
contig reference = chr15 / 101991189
contig features = chr15 / 90338345.

All my alignment, variant calling and genotyping has been done with the same Homo_sapiens_assembly38.fasta file (obtained from the GATK bundle). I am using GATK 4.0.6.0.

Running
gatk ValidateVariants -R Homo_sapiens_assembly38.fasta -V in.vcf.gz --dbsnp GATK-bundle/dbsnp_138.hg38.vcf.gz

completed without an error. So my question is: Is there a problem with this supporting input file (1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38.vcf.gz)? Is there another file I could use? Are there other tests I could use to check for the integrity of my input vcf files?

Best wishes,

Georg

↧

Base Quality Score Recalibration (BQSR)

December 28, 2017, 7:15 pm

≫ Next: Scheduled maintenance: Expected downtime 6:30-7am EST March 20

≪ Previous: supporting dataset for CalculateGenotypePosteriors

BQSR stands for Base Quality Score Recalibration. In a nutshell, it is a data pre-processing step that detects systematic errors made by the sequencing machine when it estimates the accuracy of each base call.

Note that this base recalibration process (BQSR) should not be confused with variant recalibration (VQSR), which is a sophisticated filtering technique applied on the variant callset produced in a later step. The developers who named these methods wish to apologize sincerely to anyone, especially Spanish-speaking users, who get tripped up by the similarity of these names.

Overview
Base recalibration procedure details
Important factors for successful recalibration
Examples of pre- and post-recalibration metrics
Recalibration report

1. Overview

It's all about the base, 'bout the base (quality scores)

Base quality scores are per-base estimates of error emitted by the sequencing machines; they express how confident the machine was that it called the correct base each time. For example, let's say the machine reads an A nucleotide, and assigns a quality score of Q20 -- in Phred-scale, that means it's 99% sure it identified the base correctly. This may seem high, but it does mean that we can expect it to be wrong in one case out of 100; so if we have several billion base calls (we get ~90 billion in a 30x genome), at that rate the machine would make the wrong call in 900 million bases -- which is a lot of bad bases. The quality score each base call gets is determined through some dark magic jealously guarded by the manufacturer of the sequencing machines.

Why does it matter? Because our short variant calling algorithms rely heavily on the quality score assigned to the individual base calls in each sequence read. This is because the quality score tells us how much we can trust that particular observation to inform us about the biological truth of the site where that base aligns. If we have a base call that has a low quality score, that means we're not sure we actually read that A correctly, and it could actually be something else. So we won't trust it as much as other base calls that have higher qualities. In other words we use that score to weigh the evidence that we have for or against a variant allele existing at a particular site.

Okay, so what is base recalibration?

Unfortunately the scores produced by the machines are subject to various sources of systematic (non-random) technical error, leading to over- or under-estimated base quality scores in the data. Some of these errors are due to the physics or the chemistry of how the sequencing reaction works, and some are probably due to manufacturing flaws in the equipment.

Base quality score recalibration (BQSR) is a process in which we apply machine learning to model these errors empirically and adjust the quality scores accordingly. For example we can identify that, for a given run, whenever we called two A nucleotides in a row, the next base we called had a 1% higher rate of error. So any base call that comes after AA in a read should have its quality score reduced by 1%. We do that over several different covariates (mainly sequence context and position in read, or cycle) in a way that is additive. So the same base may have its quality score increased for one reason and decreased for another.

This allows us to get more accurate base qualities overall, which in turn improves the accuracy of our variant calls. To be clear, we can't correct the base calls themselves, i.e. we can't determine whether that low-quality A should actually have been a T -- but we can at least tell the variant caller more accurately how far it can trust that A. Note that in some cases we may find that some bases should have a higher quality score, which allows us to rescue observations that otherwise may have been given less consideration than they deserve. Anecdotally our impression is that sequencers are more often over-confident than under-confident, but we do occasionally see runs from sequencers that seemed to suffer from low self-esteem.

This procedure can be applied to BAM files containing data from any sequencing platform that outputs base quality scores on the expected scale. We have run it ourselves on data from several generations of Illumina, SOLiD, 454, Complete Genomics, and Pacific Biosciences sequencers.

That sounds great! How does it work?

The base recalibration process involves two key steps: first the BaseRecalibrator tool builds a model of covariation based on the input data and a set of known variants, producing a recalibration file; then the ApplyBQSR tool adjusts the base quality scores in the data based on the model, producing a new BAM file. The known variants are used to mask out bases at sites of real (expected) variation, to avoid counting real variants as errors. Outside of the masked sites, every mismatch is counted as an error. The rest is mostly accounting.

There is an optional but highly recommended step that involves building a second model and generating before/after plots to visualize the effects of the recalibration process. This is useful for quality control purposes.

2. Base recalibration procedure details

BaseRecalibrator builds the model

To build the recalibration model, this first tool goes through all of the reads in the input BAM file and tabulates data about the following features of the bases:

read group the read belongs to
quality score reported by the machine
machine cycle producing this base (Nth cycle = Nth base from the start of the read)
current base + previous base (dinucleotide)

For each bin, we count the number of bases within the bin and how often such bases mismatch the reference base, excluding loci known to vary in the population, according to the known variants resource (typically dbSNP). This information is output to a recalibration file in GATKReport format.

Note that the recalibrator applies a "yates" correction for low occupancy bins. Rather than inferring the true Q score from # mismatches / # bases we actually infer it from (# mismatches + 1) / (# bases + 2). This deals very nicely with overfitting problems, which has only a minor impact on data sets with billions of bases but is critical to avoid overconfidence in rare bins in sparse data.

ApplyBQSR adjusts the scores

This second tool goes through all the reads again, using the recalibration file to adjust each base's score based on which bins it falls in. So effectively the new quality score is:

the sum of the global difference between reported quality scores and the empirical quality
plus the quality bin specific shift
plus the cycle x qual and dinucleotide x qual effect

Following recalibration, the read quality scores are much closer to their empirical scores than before. This means they can be used in a statistically robust manner for downstream processing, such as variant calling. In addition, by accounting for quality changes by cycle and sequence context, we can identify truly high quality bases in the reads, often finding a subset of bases that are Q30 even when no bases were originally labeled as such.

3. Important factors for successful recalibration

Read groups

The recalibration system is read-group aware, meaning it uses @RG tags to partition the data by read group. This allows it to perform the recalibration per read group, which reflects which library a read belongs to and what lane it was sequenced in on the flowcell. We know that systematic biases can occur in one lane but not the other, or one library but not the other, so being able to recalibrate within each unit of sequence data makes the modeling process more accurate. As a corollary, that means it's okay to run BQSR on BAM files with multiple read groups. However, please note that the memory requirements scale linearly with the number of read groups in the file, so that files with many read groups could require a significant amount of RAM to store all of the covariate data.

Amount of data

A critical determinant of the quality of the recalibration is the number of observed bases and mismatches in each bin. This procedure will not work well on a small number of aligned reads. We usually expect to see more than 100M bases per read group; as a rule of thumb, larger numbers will work better.

No excuses

You should almost always perform recalibration on your sequencing data. In human data, given the exhaustive databases of variation we have available, almost all of the remaining mismatches -- even in cancer -- will be errors, so it's super easy to ascertain an accurate error model for your data, which is essential for downstream analysis. For non-human data it can be a little bit more work since you may need to bootstrap your own set of variants if there are no such resources already available for you organism, but it's worth it.

Here's how you would bootstrap a set of known variants:

First do an initial round of variant calling on your original, unrecalibrated data.
Then take the variants that you have the highest confidence in and use that set as the database of known variants by feeding it as a VCF file to the BaseRecalibrator.
Finally, do a real round of variant calling with the recalibrated data. These steps could be repeated several times until convergence.

The main case figure where you really might need to skip BQSR is when you have too little data (some small gene panels have that problem), or you're working with a really weird organism that displays insane amounts of variation.

4. Examples of pre- and post-recalibration metrics

This shows recalibration results from a lane sequenced at the Broad by an Illumina GA-II in February 2010. This is admittedly not very recent but the results are typical of what we still see on some more recent runs, even if the overall quality of sequencing has improved. You can see there is a significant improvement in the accuracy of the base quality scores after applying the recalibration procedure. Note that the plots shown below are not the same as the plots that are produced by the AnalyzeCovariates tool.

Note: The scale for number of bases in the two graphs are different.

5. Recalibration report

The recalibration report contains the following 5 tables:

Arguments Table -- a table with all the arguments and its values
Quantization Table
ReadGroup Table
Quality Score Table
Covariates Table

Arguments Table

This is the table that contains all the arguments used to run BQSR for this dataset.

#:GATKTable:true:1:17::;
#:GATKTable:Arguments:Recalibration argument collection values used in this run
Argument                    Value
covariate                   null
default_platform            null
deletions_context_size      6
force_platform              null
insertions_context_size     6
...

Quantization Table

The GATK offers native support to quantize base qualities. The GATK quantization procedure uses a statistical approach to determine the best binning system that minimizes the error introduced by amalgamating the different qualities present in the specific dataset. When running BQSR, a table with the base counts for each base quality is generated and a 'default' quantization table is generated. This table is a required parameter for any other tool in the GATK if you want to quantize your quality scores.

The default behavior (currently) is to use no quantization. You can override this by using the engine argument -qq. With -qq 0 you don't quantize qualities, or -qq N you recalculate the quantization bins using N bins.

#:GATKTable:true:2:94:::;
#:GATKTable:Quantized:Quality quantization map
QualityScore  Count        QuantizedScore
0                     252               0
1                   15972               1
2                  553525               2
3                 2190142               9
4                 5369681               9
9                83645762               9
...

ReadGroup Table

This table contains the empirical quality scores for each read group, for mismatches insertions and deletions.

#:GATKTable:false:6:18:%s:%s:%.4f:%.4f:%d:%d:;
#:GATKTable:RecalTable0:
ReadGroup  EventType  EmpiricalQuality  EstimatedQReported  Observations  Errors
SRR032768  D                   40.7476             45.0000    2642683174    222475
SRR032766  D                   40.9072             45.0000    2630282426    213441
SRR032764  D                   40.5931             45.0000    2919572148    254687
SRR032769  D                   40.7448             45.0000    2850110574    240094
SRR032767  D                   40.6820             45.0000    2820040026    241020
SRR032765  D                   40.9034             45.0000    2441035052    198258
SRR032766  M                   23.2573             23.7733    2630282426  12424434
SRR032768  M                   23.0281             23.5366    2642683174  13159514
SRR032769  M                   23.2608             23.6920    2850110574  13451898
SRR032764  M                   23.2302             23.6039    2919572148  13877177
SRR032765  M                   23.0271             23.5527    2441035052  12158144
SRR032767  M                   23.1195             23.5852    2820040026  13750197
SRR032766  I                   41.7198             45.0000    2630282426    177017
SRR032768  I                   41.5682             45.0000    2642683174    184172
SRR032769  I                   41.5828             45.0000    2850110574    197959
SRR032764  I                   41.2958             45.0000    2919572148    216637
SRR032765  I                   41.5546             45.0000    2441035052    170651
SRR032767  I                   41.5192             45.0000    2820040026    198762

Quality Score Table

This table contains the empirical quality scores for each read group and original quality score, for mismatches insertions and deletions.

#:GATKTable:false:6:274:%s:%s:%s:%.4f:%d:%d:;
#:GATKTable:RecalTable1:
ReadGroup  QualityScore  EventType  EmpiricalQuality  Observations  Errors
SRR032767            49  M                   33.7794          9549        3
SRR032769            49  M                   36.9975          5008        0
SRR032764            49  M                   39.2490          8411        0
SRR032766            18  M                   17.7397      16330200   274803
SRR032768            18  M                   17.7922      17707920   294405
SRR032764            45  I                   41.2958    2919572148   216637
SRR032765             6  M                    6.0600       3401801   842765
SRR032769            45  I                   41.5828    2850110574   197959
SRR032764             6  M                    6.0751       4220451  1041946
SRR032767            45  I                   41.5192    2820040026   198762
SRR032769             6  M                    6.3481       5045533  1169748
SRR032768            16  M                   15.7681      12427549   329283
SRR032766            16  M                   15.8173      11799056   309110
SRR032764            16  M                   15.9033      13017244   334343
SRR032769            16  M                   15.8042      13817386   363078
...

Covariates Table

This table has the empirical qualities for each covariate used in the dataset. The default covariates are cycle and context. In the current implementation, context is of a fixed size (default 6). Each context and each cycle will have an entry on this table stratified by read group and original quality score.

#:GATKTable:false:8:1003738:%s:%s:%s:%s:%s:%.4f:%d:%d:;
#:GATKTable:RecalTable2:
ReadGroup  QualityScore  CovariateValue  CovariateName  EventType  EmpiricalQuality  Observations  Errors
SRR032767            16  TACGGA          Context        M                   14.2139           817      30
SRR032766            16  AACGGA          Context        M                   14.9938          1420      44
SRR032765            16  TACGGA          Context        M                   15.5145           711      19
SRR032768            16  AACGGA          Context        M                   15.0133          1585      49
SRR032764            16  TACGGA          Context        M                   14.5393           710      24
SRR032766            16  GACGGA          Context        M                   17.9746          1379      21
SRR032768            45  CACCTC          Context        I                   40.7907        575849      47
SRR032764            45  TACCTC          Context        I                   43.8286        507088      20
SRR032769            45  TACGGC          Context        D                   38.7536         37525       4
SRR032768            45  GACCTC          Context        I                   46.0724        445275      10
SRR032766            45  CACCTC          Context        I                   41.0696        575664      44
SRR032769            45  TACCTC          Context        I                   43.4821        490491      21
SRR032766            45  CACGGC          Context        D                   45.1471         65424       1
SRR032768            45  GACGGC          Context        D                   45.3980         34657       0
SRR032767            45  TACGGC          Context        D                   42.7663         37814       1
SRR032767            16  AACGGA          Context        M                   15.9371          1647      41
SRR032764            16  GACGGA          Context        M                   18.2642          1273      18
SRR032769            16  CACGGA          Context        M                   13.0801          1442      70
SRR032765            16  GACGGA          Context        M                   15.9934          1271      31
...

↧

Scheduled maintenance: Expected downtime 6:30-7am EST March 20

March 19, 2019, 3:13 pm

≫ Next: HaplotypeCaller failed to detect variant.

≪ Previous: Base Quality Score Recalibration (BQSR)

Summary

We're planning to upgrade one of the main servers at the heart of FireCloud in order to unlock some performance enhancements ahead of a training event that is taking place later this week. We plan to perform this operation between 6:30-7am EST on March 20, 2019 and expect it should take about 5 to 10 minutes to complete.

Impact

For the duration of the upgrade process, it won't be possible to log into the application, and it's very likely that any API calls will either timeout or fail.
Actively running workflows should not be impacted but querying of status will timeout or fail

For more information

If you would like to be notified of all service incidents or upcoming scheduled maintenance, click Follow on this page.

↧

HaplotypeCaller failed to detect variant.

March 19, 2019, 7:58 pm

≫ Next: large drop in variants in GATK 4.1.0.0 Mutect2

≪ Previous: Scheduled maintenance: Expected downtime 6:30-7am EST March 20

I have experienced a variant detection issue in GATK4.1 and older(v3.6) in the following examples.

I have two bam files: NA24385_partial.bam, NA24143_partial.bam. On the IGV's image above, both samples have a variant at 134784873G>A. HaplotypeCaller could detect the variant on NA24385, though, failed to detect it on NA24143.

Detected:

NA24385_bp_region.vcf
chr9    134784873       .       G       A,<NON_REF>     2527.60 .       BaseQRankSum=-9.003;DP=364;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.000;RAW_MQandDP=1310400,364;ReadPosRankSum=-4.180  GT:AD:DP:GQ:PL:SB       0/1:256,103,0:359:99:2535,0,8789,3303,9099,12402:217,39,69,34

Failed:

NA24143_bp_region.vcf
chr9    134784873       .       G       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:320,265:585:0:0,0,3619

At the variant site, NA24143's PL looks really odd. The probability of the genotype for the 0/0 and 0/1 are equal. Why did HaplotypeCaller assign such a high probability for 0/0 here?

Using GATK jar /gatk/gatk-package-4.1.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /gatk/gatk-package-4.1.0.0-local.jar --version

Base quality for the variant region looks good enough.

$ samtools mpileup -r chr9:134784873-134784873 NA24143_partial.bam
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
chr9    134784873       N       515     G$G$AAAGAAGGGGAAGGAAAAGAAGAAGAAGGGAGGGAGGGAGAGAGGGGGGGGGGGGGGAAAGAGA$GAAGGAGG$AGGAGAGGAGAGGGGGGAGGGGGGGGGAAGGGGGAAaAGGGAGGAAGaaGAAAAAGAAAGGAGGAGAGgGAAAGGAGgAAGGGGAgGGAGGaAGGAGAGGAGAAaAGAGGGAGGAAAGAGGGAGaAAGGAAAAGGGGGGGGGGGGGGGGGGGGGGGAAAAAAGGAGGAagAGGGGGGgGGGGAGAAGGAAGAGaaGGAAGaAAGAAAGGGGGGAAGGAAAGAAAGAGAAGAGGGggAGGAAAGAAGAGGAAggaAGAAGAGGAAGAAGAaggAGGGGagGAGAAAAGagAGAAAGGGagagagaagGGGAAAaAAGGAgGGAAGAGAGGGAAGGAGAAGAGAGGAAGGGAAAAAAGGGAAgaggGAGagGAGGAGGGAAGagagaggaGGAAAAGGAGGGGGAGGGGGGaaagGGAGGAAAGGA^]G^]G^]G^]G^]G^]A^]A^]G^]A^]A^]A^]G^]a^]a^]a^]g^]g       AA=;<?==????;?ADBB00E@CgC/ECbFFADE?E8FFADF/F@EFFEAF9FFFFAFF=.DF@F;F0DFF@A2DFF.F<FFDFDFFFFFFDFFDFFFDDD_8DFEEE8DBDFFFDFFCCEAAFDDDb@FD.DFFDFb@FdFDFDD8FFDFD@CFiFFDDFF/FFADFA@FDFFDFDDA8ADFFFDFFD0/FDFFhDFADCEFCDCDFFFFFFFFFFFFFFFFFFFFFFFDD<D/DFF/FFDADDFFFFFFDFFiFEFcEFF7@F.F@@FF7DF?DDFc0@FFFFFFDDFFE?7FE?D@.F1DF?FFFDDDFFCDDEC7EDEFaDDD?DF@DEOFFCCEDCFCAD@7@EFFdC2DF./D/FA@CE7CCEEE@C@DADAACEaEC?D@BCDD7DEEBCD?DB@DDCbEDCE76AACA3B6@BAA???==?AA@??C@@C3=@@CBABBABBBAAB?B?B?B@?:@?@@@@A@@AFFF1FEFFEE:::=@@?@@_??@@?>>???/?>6>>><<<??

$ samtools --version
samtools 1.2
Using htslib 1.2.1
Copyright (C) 2015 Genome Research Ltd.

↧

large drop in variants in GATK 4.1.0.0 Mutect2

March 19, 2019, 8:33 pm

≫ Next: Run the germline GATK Best Practices Pipeline for $5 per genome

≪ Previous: HaplotypeCaller failed to detect variant.

I recently switched to GATK 4.1.0.0. I ran Mutect2 on a small (~1Mb) targeted panel. I am using a normal control that is not the same individual (basically to exclude technical artifacts), so I do expect to see more variants than with a proper matched normal. I was getting around 100-300 variants per sample with GATK 4.0.6.0. I am still roughly in the same range for some samples, but for some I am getting 0.

The problem seems to be at the FilterMutectCalls stage where I am seeing the following error:

[March 19, 2019 10:43:17 PM EDT] org.broadinstitute.hellbender.tools.walkers.mutect.FilterMutectCalls done. Elapsed time: 0.04 minutes.
Runtime.totalMemory()=8851030016
java.lang.IllegalArgumentException: errorRate must be good probability but got NaN
    at org.broadinstitute.hellbender.utils.Utils.validateArg(Utils.java:730)
    at org.broadinstitute.hellbender.utils.QualityUtils.errorProbToQual(QualityUtils.java:227)
    at org.broadinstitute.hellbender.utils.QualityUtils.errorProbToQual(QualityUtils.java:211)
    at org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2FilteringEngine.applyContaminationFilter(Mutect2FilteringEngine.java:79)
    at org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2FilteringEngine.calculateFilters(Mutect2FilteringEngine.java:518)
    at org.broadinstitute.hellbender.tools.walkers.mutect.FilterMutectCalls.firstPassApply(FilterMutectCalls.java:130)
    at org.broadinstitute.hellbender.engine.TwoPassVariantWalker.lambda$traverseVariants$0(TwoPassVariantWalker.java:76)
    at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
    at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
    at java.util.Iterator.forEachRemaining(Iterator.java:116)
    at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
    at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
    at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
    at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
    at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
    at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
    at org.broadinstitute.hellbender.engine.TwoPassVariantWalker.traverseVariants(TwoPassVariantWalker.java:74)
    at org.broadinstitute.hellbender.engine.TwoPassVariantWalker.traverse(TwoPassVariantWalker.java:27)
    at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:966)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:138)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
    at org.broadinstitute.hellbender.Main.main(Main.java:291)

Do you know what may be causing this problem?

↧

Run the germline GATK Best Practices Pipeline for $5 per genome

February 12, 2018, 6:38 am

≫ Next: Can i skip GenomicsDBimport and CombinedGVCFs?

≪ Previous: large drop in variants in GATK 4.1.0.0 Mutect2

By Eric Banks, Director, Data Sciences Platform at the Broad Institute

Last week I wrote about our efforts to develop a data processing pipeline specification that would eliminate batch effects, in collaboration with other major sequencing centers. Today I want to share our implementation of the resulting "Functional Equivalence" pipeline spec, and highlight the cost-centric optimizations we've made that make it incredibly cheap to run on Google Cloud.

For a little background, we started transitioning our analysis pipelines to Google Cloud Platform in 2016. Throughout that process we focused most of our engineering efforts on bringing down compute cost, which is the most important factor for our production operation. It's been a long road, but all that hard work really paid off: we managed to get the cost of our main Best Practices analysis pipeline down from about $45 to $5 per genome! As you can imagine that kind of cost reduction has a huge impact on our ability to do more great science per research dollar -- and now, we’re making this same pipeline available to everyone.

The Best Practices pipeline I'm talking about is the most common type of analysis done on a 30x WGS: germline short variant discovery (SNPs and indels). This pipeline covers taking the data from unmapped reads all the way to an analysis-ready BAM or CRAM (i.e. the part covered by the Functional Equivalence spec), then either a single-sample VCF or an intermediate GVCF, plus 15 steps of quality control metrics collected at various points in the pipeline, totalling $5 in compute cost on Google Cloud. As far as I know this is the most comprehensive pipeline available for whole-genome data processing and germline short variant discovery (without skimping on QC and important cleanup steps like base recalibration).

Let me give you a real-world example of what this means for an actual project. In February 2017, our production team processed a cohort of about 900 30x WGS samples through our Best Practices germline variant discovery pipeline; the compute costs totalled $12,150 or $13.50 per sample. If we had run the version of this pipeline we had just one year prior (before the main optimizations were made), it would have cost $45 per sample; a whopping $40,500 total! Meanwhile we've made further improvements since February, and if we were to run this same pipeline today, the cohort would cost only $4,500 to analyze.

	2016	2017	Today
# of Whole Genomes Analyzed	900	900	900
Total Compute Cost	$40,500	$12,150	$4,500
Cost per Genome Analyzed	$45	$13.50	$5

For the curious, the most dramatic reductions we saw came from using different machine types for each of the various tasks (rather than piping data between tasks), leveraging GCP’s preemptible VMs, and most recently incorporating NIO to minimize the amount of data localization involved. You can read more about these approaches on Google's blog. At this point the single biggest culprit for cost in the pipeline is BWA (the genome mapper), a problem which its author Heng Li is actively working to address through a much faster (but equivalently accurate) mapper. Once Heng's new mapper is available, we anticipate the cost per genome analyzed to drop below $3.

On top of the low cost of operating the pipeline, the other huge bonus we get from running this pipeline on the cloud is that we can get any number of samples done in the time it takes to do just one, due to the staggeringly elastic scalability of the cloud environment. Even though it takes a single genome 30 hours to run through the pipeline (and we're still working on speeding that up), we're able to process genomes at a rate of one every 3.6 minutes, and we've been averaging about 500 genomes completed per day.

We're making the workflow script for this pipeline available in Github under an open-source license so anyone can use it, and we're also providing it as a preconfigured pipeline in FireCloud, the pipelining service we run on Google Cloud. Anyone can access FireCloud for free, you just need to pay Google for any compute and storage costs you incur when running the pipelines. So to be clear, when you run this pipeline on your data in FireCloud, all $5 of compute costs will go directly to the cloud provider; we won't make any money off of it. And there are no licensing fees involved at any point!

As a cherry on the cake, our friends at Google Cloud Platform are sponsoring free credits to help first-time users get started with FireCloud: the first 1,000 applicants can get $250 worth of credits to cover compute and storage costs. You can learn more here on the FireCloud website if you're interested.

Of course, we understand that not everyone is on Google Cloud, so we are actively collaborating with other cloud vendors and technology partners to expand the range of options for taking advantage of our optimized pipelines. For example, the Chinese cloud giant Alibaba Cloud is developing a backend for Cromwell, the execution engine we use to run our pipelines. And it's not all cloud-centric either; we are also collaborating with our long-time partners at Intel to ensure our pipelines can be run optimally on on-premises infrastructure without compromising on quality.

In conclusion, this pipeline is the result of two years' worth of hard work by a lot of people, both on our team and on the teams of the institutions and companies we collaborate with. We're all really excited to finally share it with the world, and we hope it will make it easier for everyone in the community to get more mileage out of their research dollars, just like we do.

↧

Contents