Weird VariantRecalibration result, should i add exome data from 1000G project into WES analysis?

February 27, 2019, 12:44 am

≫ Next: Create separate panels of normals for SNVs and Indels?

≪ Previous: Variant Quality Score Recalibration (VQSR)

Hi, attached below is the result of running VariantRecalibration on the chr1 of 56 individuals? Seems like there is no true positive discovered when truth sensitivity is 100.

known=(602444 @ 1.6475) novel=(238870 @ 0.3388)

gatk VariantRecalibrator -R $reference
-V ./LB56sa_chr1_output.vcf.gz
--resource hapmap,known=false,training=true,truth=true,prior=15.0:$hapmap
--resource omni,known=false,training=true,truth=false,prior=12.0:$omni
--resource 1000G,known=false,training=true,truth=false,prior=10.0:$oneKsnp
--resource dbsnp,known=true,training=false,truth=false,prior=2.0:$dbsnp
-an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -mode SNP
-O LB56_chr1.output.recal
--tranches-file LB56_chr1.tranches
--rscript-file LB56.plots.R

May i have any advice on this case?

↧

Create separate panels of normals for SNVs and Indels?

March 22, 2019, 12:07 pm

≫ Next: (howto) Run the genotype refinement workflow

≪ Previous: Weird VariantRecalibration result, should i add exome data from 1000G project into WES analysis?

Hello,

I have created a panel of normal with >200 normal samples by running mutect2 in tumor-only mode followed by CreateSomaticPanelOfNormals.
I have come across a few examples where the normals in the PON have indels at a particular site, but the somatic VCF has a SNV called at the exact same position, which then gets filtered because of the PON.
Similarly, I have seen sites in the PON because of recurrent SNPs in the normals, but that leads to filtering of an indel from the somatic VCF. These somatic calls looked real when I looked at them in IGV.
I was thinking of creating separate PON from SNVs and indels called in the panel of normal, and doing a type-specific filtering of the somatic variants. I am hoping that that way sites where we find both SNPs and indels recurrently will be seen in both the lists, but a site where we only find SNPs recurrently will not make it to the indel PON and vice versa.
I wanted to know if this is something that you have already tried, and would you recommend it?

↧

(howto) Run the genotype refinement workflow

October 17, 2014, 5:41 pm

≫ Next: gatk4 haplotypecaller

≪ Previous: Create separate panels of normals for SNVs and Indels?

Overview

This tutorial describes step-by-step instruction for applying the Genotype Refinement workflow (described in this method article) to your data.

Step 1: Derive posterior probabilities of genotypes

In this first step, we are deriving the posteriors of genotype calls in our callset, recalibratedVariants.vcf, which just came out of the VQSR filtering step; it contains among other samples a trio of individuals (mother, father and child) whose family structure is described in the pedigree file trio.ped (which you need to supply). To do this, we are using the most comprehensive set of high confidence SNPs available to us, a set of sites from Phase 3 of the 1000 Genomes project (available in our resource bundle), which we pass via the --supporting argument.

 java -jar GenomeAnalysisToolkit.jar -R human_g1k_v37_decoy.fasta -T CalculateGenotypePosteriors --supporting 1000G_phase3_v4_20130502.sites.vcf -ped trio.ped -V recalibratedVariants.vcf -o recalibratedVariants.postCGP.vcf

This produces the output file recalibratedVariants.postCGP.vcf, in which the posteriors have been annotated wherever possible.

Step 2: Filter low quality genotypes

In this second, very simple step, we are tagging low quality genotypes so we know not to use them in our downstream analyses. We use Q20 as threshold for quality, which means that any passing genotype has a 99% chance of being correct.

java -jar $GATKjar -T VariantFiltration -R $bundlePath/b37/human_g1k_v37_decoy.fasta -V recalibratedVariants.postCGP.vcf -G_filter "GQ < 20.0" -G_filterName lowGQ -o recalibratedVariants.postCGP.Gfiltered.vcf

Note that in the resulting VCF, the genotypes that failed the filter are still present, but they are tagged lowGQ with the FT tag of the FORMAT field.

Step 3: Annotate possible de novo mutations

In this third and final step, we tag variants for which at least one family in the callset shows evidence of a de novo mutation based on the genotypes of the family members.

java -jar $GATKjar -T VariantAnnotator -R $bundlePath/b37/human_g1k_v37_decoy.fasta -V recalibratedVariants.postCGP.Gfiltered.vcf -A PossibleDeNovo -ped trio.ped -o recalibratedVariants.postCGP.Gfiltered.deNovos.vcf

The annotation output will include a list of the children with possible de novo mutations, classified as either high or low confidence.

See section 3 of the method article for a complete description of annotation outputs and section 4 for an example of a call and the interpretation of the annotation values.

↧

gatk4 haplotypecaller

March 22, 2019, 3:02 pm

≫ Next: GATK 3.7 reads are being thrown away due to poor mapping quality

≪ Previous: (howto) Run the genotype refinement workflow

I am using gatk4 haplotypecaller to call variants. I have a couple of questions:
1. Are there options to limit the number of reads and allele frequency before a call is made. For example, do not call variants if depth < 10 or variant frequency < 20%.
2. Is haplotypecaller the best option to call variants on bacterial sequencing data?

↧

GATK 3.7 reads are being thrown away due to poor mapping quality

March 22, 2019, 3:46 pm

≫ Next: Core dump when using GATK 3.7 haplotyper

≪ Previous: gatk4 haplotypecaller

Hello all,

I'm using gatk 3.7, java 1.8, human genome. For specfici regions I'm using ploidy more than 2 due to specific aim of the question.

/usr/bin/java -Xmx10G -jar /mnt/mfs/hgrcgrid/shared/softwares/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T HaplotypeCaller -R CR1_region1.ploidy_6.fa -I washei36472_ploidy6CR1region1.bam -L CR1_region1.expanded.bed --sample_ploidy 6 --genotyping_mode DISCOVERY --emitRefConfidence GVCF --dontUseSoftClippedBases -o washei36472.3CR1region1.g.vcf

When run this the log on screen says:
``HCMappingQualityFilter - Filtering out reads with MAPQ < 20``

I don't want gatk to remove reads with poor mapping quality. I looked in the help but it doesn't have any flag for mmq as suggested on the link

Sorry, I can't quote/highlight text for code and error using github text guidelines. GATK forum didn't allow me to post link, maybe because my account is new.

Best,
dG

↧

Core dump when using GATK 3.7 haplotyper

March 22, 2019, 4:02 pm

≫ Next: Invitation for a workshop in Brazilian PhD program on bioinformatics

≪ Previous: GATK 3.7 reads are being thrown away due to poor mapping quality

I'm using gatk 3.7, java 1.8, human genome. For specific regions I'm using ploidy more than 2 due to specific aim of the question.

/usr/bin/java -Xmx10G -jar /mnt/mfs/hgrcgrid/shared/softwares/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T HaplotypeCaller -R CR1_region1.ploidy_6.fa -I washei36472_ploidy6CR1region1.bam -L CR1_region1.expanded.bed --sample_ploidy 6 --genotyping_mode DISCOVERY --emitRefConfidence GVCF --dontUseSoftClippedBases -o washei36472.3CR1region1.g.vcf

Sorry, I can't quote/highlight text for code and error using github text guidelines. GATK forum didn't allow me to post link, maybe because my account is new.

I tried with 6G, 3G, 10G, 8G. The input bam and bed files are less than 1MB individually, I don't know what's the problem and how to fix it. If I use -nct flag either the error persists. I'm working on cluster node with 15G of memory, so I can easliy provide below 10G when running GATK.

Error log on screen:

Using AVX accelerated implementation of PairHMM
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00002b17edc60ce9, pid=30033, tid=0x00002b178ef5c700
#
# JRE version: OpenJDK Runtime Environment (8.0_151-b12) (build 1.8.0_151-8u151-b12-1~deb9u1-b12)
# Java VM: OpenJDK 64-Bit Server VM (25.151-b12 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C [libVectorLoglessPairHMM8627199239981386149.so+0x1bce9] LoadTimeInitializer::LoadTimeInitializer()+0x1669
#
# Core dump written. Default location: /mnt/mfs/hgrcgrid/shared/GT_ADMIX/INDEL_comparisons/sequencing_projects/darkgenome/internal_pipeline/align_CR1exons/core or core.30033
#
# An error report file with more information is saved as:
# /mnt/mfs/hgrcgrid/shared/GT_ADMIX/INDEL_comparisons/sequencing_projects/darkgenome/internal_pipeline/align_CR1exons/hs_err_pid30033.log
#
# If you would like to submit a bug report, please visit:
#
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.

↧

Invitation for a workshop in Brazilian PhD program on bioinformatics

March 22, 2019, 4:06 pm

≫ Next: CNNScoreVariants Hanging in 4.1.0

≪ Previous: Core dump when using GATK 3.7 haplotyper

Hello GATK Team. There is a grant to apply for a visit of 15 days to our PhD program. We would love to be able to have someone from the GATK team here. If possible, please let me know? biodados@gmail.com (Miguel). The paperwork is very simple, just a CV and a letter of intention from you. From my side I have to set a cooperation with your Institute, but is just a form. Hope it can be possible, our students would love so much!
BW, miguel

↧

CNNScoreVariants Hanging in 4.1.0

February 28, 2019, 9:43 am

≫ Next: 【海外毕业证】英国大学硕士学位证/成绩单/海外大学文凭.微464571773图片

≪ Previous: Invitation for a workshop in Brazilian PhD program on bioinformatics

I am trying to run CNNScoreVariants in GATK 4.1.0 but the tool seems to hang on the 'INFO NativeLibraryLoader - Loading libgkl_utils.so from jar' step for both the 1D and 2D models.

My issues seems similar to this post, but the hang up occurs at a different location:
(gatkforums.broadinstitute.org/gatk/discussion/12384/cnnscorevariants-hanging-in-4-0-5-2-and-4-0-6-0)

I have tried the accepted answer in the above post without success. Any help would be appreciated.

↧

【海外毕业证】英国大学硕士学位证/成绩单/海外大学文凭.微464571773图片

March 22, 2019, 5:49 pm

≫ Next: (howto) Recalibrate variant quality scores = run VQSR

≪ Previous: CNNScoreVariants Hanging in 4.1.0

英国硕士学位证/+QQ-78464162-加微464571773英国大学毕业证制作海外大学文凭图片
英国谢菲尔德大学大学毕业证成绩单欢迎来高等教育最发达的国家之一，英国求学。身处异国，远离亲人，在陌生的环境中学习
无疑是大家人生做出的最重要决定之一。你们在英国会遇到各种想不到的挑战和困难。
为了帮助大家应对挑战、战胜困难驻英使馆教育处支持全英学联编写并印制了硕士毕业证
欢迎来高等教育最发达的国家之一，英国求学。身处异国，远离亲人，在陌生的环境中学习
无疑是大家人生做出的最重要决定之一。你们在英国会遇到各种想不到的挑战和困难。
为了帮助大家应对挑战、战胜困难驻英使馆教育处支持全英学联编写并印制了

↧

(howto) Recalibrate variant quality scores = run VQSR

June 17, 2013, 3:26 pm

≫ Next: 〖购买美国〗UA阿拉巴马大学毕业证书Q薇418049015『UA毕业证成绩单

≪ Previous: 【海外毕业证】英国大学硕士学位证/成绩单/海外大学文凭.微464571773图片

Objective

Recalibrate variant quality scores and produce a callset filtered for the desired levels of sensitivity and specificity.

Prerequisites

Caveats

This document provides a typical usage example including parameter values. However, the values given may not be representative of the latest Best Practices recommendations. When in doubt, please consult the FAQ document on VQSR training sets and parameters, which overrides this document. See that document also for caveats regarding exome vs. whole genomes analysis design.

Steps

Prepare recalibration parameters for SNPs
a. Specify which call sets the program should use as resources to build the recalibration model
b. Specify which annotations the program should use to evaluate the likelihood of Indels being real
c. Specify the desired truth sensitivity threshold values that the program should use to generate tranches
d. Determine additional model parameters
Build the SNP recalibration model
Apply the desired level of recalibration to the SNPs in the call set
Prepare recalibration parameters for Indels
a. Specify which call sets the program should use as resources to build the recalibration model
b. Specify which annotations the program should use to evaluate the likelihood of Indels being real
c. Specify the desired truth sensitivity threshold values that the program should use to generate tranches
d. Determine additional model parameters
Build the Indel recalibration model
Apply the desired level of recalibration to the Indels in the call set

1. Prepare recalibration parameters for SNPs

a. Specify which call sets the program should use as resources to build the recalibration model

For each training set, we use key-value tags to qualify whether the set contains known sites, training sites, and/or truth sites. We also use a tag to specify the prior likelihood that those sites are true (using the Phred scale).

True sites training resource: HapMap

This resource is a SNP call set that has been validated to a very high degree of confidence. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). We will also use these sites later on to choose a threshold for filtering variants based on sensitivity to truth sites. The prior likelihood we assign to these variants is Q15 (96.84%).

True sites training resource: Omni

This resource is a set of polymorphic SNP sites produced by the Omni genotyping array. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q12 (93.69%).

Non-true sites training resource: 1000G

This resource is a set of high-confidence SNP sites produced by the 1000 Genomes Project. The program will consider that the variants in this resource may contain true variants as well as false positives (truth=false), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q10 (90%).

Known sites resource, not used in training: dbSNP

This resource is a SNP call set that has not been validated to a high degree of confidence (truth=false). The program will not use the variants in this resource to train the recalibration model (training=false). However, the program will use these to stratify output metrics such as Ti/Tv ratio by whether variants are present in dbsnp or not (known=true). The prior likelihood we assign to these variants is Q2 (36.90%).

The default prior likelihood assigned to all other variants is Q2 (36.90%). This low value reflects the fact that the philosophy of the GATK callers is to produce a large, highly sensitive callset that needs to be heavily refined through additional filtering.

b. Specify which annotations the program should use to evaluate the likelihood of SNPs being real

These annotations are included in the information generated for each variant call by the caller. If an annotation is missing (typically because it was omitted from the calling command) it can be added using the VariantAnnotator tool.

Coverage (DP)

Total (unfiltered) depth of coverage. Note that this statistic should not be used with exome datasets; see caveat detailed in the VQSR arguments FAQ doc.

QualByDepth (QD)

Variant confidence (from the QUAL field) / unfiltered depth of non-reference samples.

FisherStrand (FS)

Measure of strand bias (the variation being seen on only the forward or only the reverse strand). More bias is indicative of false positive calls. This complements the StrandOddsRatio (SOR) annotation.

StrandOddsRatio (SOR)

Measure of strand bias (the variation being seen on only the forward or only the reverse strand). More bias is indicative of false positive calls. This complements the FisherStrand (FS) annotation.

MappingQualityRankSumTest (MQRankSum)

The rank sum test for mapping qualities. Note that the mapping quality rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.

ReadPosRankSumTest (ReadPosRankSum)

The rank sum test for the distance from the end of the reads. If the alternate allele is only seen near the ends of reads, this is indicative of error. Note that the read position rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.

RMSMappingQuality (MQ)

Estimation of the overall mapping quality of reads supporting a variant call.

InbreedingCoeff

Evidence of inbreeding in a population. See caveats regarding population size and composition detailed in the VQSR arguments FAQ doc.

c. Specify the desired truth sensitivity threshold values that the program should use to generate tranches

First tranche threshold 100.0
Second tranche threshold 99.9
Third tranche threshold 99.0
Fourth tranche threshold 90.0

Tranches are essentially slices of variants, ranked by VQSLOD, bounded by the threshold values specified in this step. The threshold values themselves refer to the sensitivity we can obtain when we apply them to the call sets that the program uses to train the model. The idea is that the lowest tranche is highly specific but less sensitive (there are very few false positives but potentially many false negatives, i.e. missing calls), and each subsequent tranche in turn introduces additional true positive calls along with a growing number of false positive calls. This allows us to filter variants based on how sensitive we want the call set to be, rather than applying hard filters and then only evaluating how sensitive the call set is using post hoc methods.

2. Build the SNP recalibration model

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T VariantRecalibrator \ 
    -R reference.fa \ 
    -input raw_variants.vcf \ 
    -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap.vcf \ 
    -resource:omni,known=false,training=true,truth=true,prior=12.0 omni.vcf \ 
    -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G.vcf \ 
    -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.vcf \ 
    -an DP \ 
    -an QD \ 
    -an FS \ 
    -an SOR \ 
    -an MQ \
    -an MQRankSum \ 
    -an ReadPosRankSum \ 
    -an InbreedingCoeff \
    -mode SNP \ 
    -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \ 
    -recalFile recalibrate_SNP.recal \ 
    -tranchesFile recalibrate_SNP.tranches \ 
    -rscriptFile recalibrate_SNP_plots.R

Expected Result

This creates several files. The most important file is the recalibration report, called recalibrate_SNP.recal, which contains the recalibration data. This is what the program will use in the next step to generate a VCF file in which the variants are annotated with their recalibrated quality scores. There is also a file called recalibrate_SNP.tranches, which contains the quality score thresholds corresponding to the tranches specified in the original command. Finally, if your installation of R and the other required libraries was done correctly, you will also find some PDF files containing plots. These plots illustrated the distribution of variants according to certain dimensions of the model.

For detailed instructions on how to interpret these plots, please refer to the VQSR method documentation and presentation videos.

3. Apply the desired level of recalibration to the SNPs in the call set

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T ApplyRecalibration \ 
    -R reference.fa \ 
    -input raw_variants.vcf \ 
    -mode SNP \ 
    --ts_filter_level 99.0 \ 
    -recalFile recalibrate_SNP.recal \ 
    -tranchesFile recalibrate_SNP.tranches \ 
    -o recalibrated_snps_raw_indels.vcf

Expected Result

This creates a new VCF file, called recalibrated_snps_raw_indels.vcf, which contains all the original variants from the original raw_variants.vcf file, but now the SNPs are annotated with their recalibrated quality scores (VQSLOD) and either PASS or FILTER depending on whether or not they are included in the selected tranche.

Here we are taking the second lowest of the tranches specified in the original recalibration command. This means that we are applying to our data set the level of sensitivity that would allow us to retrieve 99% of true variants from the truth training sets of HapMap and Omni SNPs. If we wanted to be more specific (and therefore have less risk of including false positives, at the risk of missing real sites) we could take the very lowest tranche, which would only retrieve 90% of the truth training sites. If we wanted to be more sensitive (and therefore less specific, at the risk of including more false positives) we could take the higher tranches. In our Best Practices documentation, we recommend taking the second highest tranche (99.9%) which provides the highest sensitivity you can get while still being acceptably specific.

4. Prepare recalibration parameters for Indels

a. Specify which call sets the program should use as resources to build the recalibration model

Known and true sites training resource: Mills

This resource is an Indel call set that has been validated to a high degree of confidence. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q12 (93.69%).

b. Specify which annotations the program should use to evaluate the likelihood of Indels being real

Coverage (DP)

Total (unfiltered) depth of coverage. Note that this statistic should not be used with exome datasets; see caveat detailed in the VQSR arguments FAQ doc.

QualByDepth (QD)

Variant confidence (from the QUAL field) / unfiltered depth of non-reference samples.

FisherStrand (FS)

StrandOddsRatio (SOR)

Measure of strand bias (the variation being seen on only the forward or only the reverse strand). More bias is indicative of false positive calls. This complements the FisherStrand (FS) annotation.

MappingQualityRankSumTest (MQRankSum)

The rank sum test for mapping qualities. Note that the mapping quality rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.

ReadPosRankSumTest (ReadPosRankSum)

InbreedingCoeff

Evidence of inbreeding in a population. See caveats regarding population size and composition detailed in the VQSR arguments FAQ doc.

c. Specify the desired truth sensitivity threshold values that the program should use to generate tranches

First tranche threshold 100.0
Second tranche threshold 99.9
Third tranche threshold 99.0
Fourth tranche threshold 90.0

d. Determine additional model parameters

Maximum number of Gaussians (-maxGaussians) 4

This is the maximum number of Gaussians (i.e. clusters of variants that have similar properties) that the program should try to identify when it runs the variational Bayes algorithm that underlies the machine learning method. In essence, this limits the number of different ”profiles” of variants that the program will try to identify. This number should only be increased for datasets that include very many variants.

5. Build the Indel recalibration model

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T VariantRecalibrator \ 
    -R reference.fa \ 
    -input recalibrated_snps_raw_indels.vcf \ 
    -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.vcf  \
    -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.b37.vcf \
    -an QD \
    -an DP \ 
    -an FS \ 
    -an SOR \ 
    -an MQRankSum \ 
    -an ReadPosRankSum \ 
    -an InbreedingCoeff
    -mode INDEL \ 
    -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \ 
    --maxGaussians 4 \ 
    -recalFile recalibrate_INDEL.recal \ 
    -tranchesFile recalibrate_INDEL.tranches \ 
    -rscriptFile recalibrate_INDEL_plots.R

Expected Result

This creates several files. The most important file is the recalibration report, called recalibrate_INDEL.recal, which contains the recalibration data. This is what the program will use in the next step to generate a VCF file in which the variants are annotated with their recalibrated quality scores. There is also a file called recalibrate_INDEL.tranches, which contains the quality score thresholds corresponding to the tranches specified in the original command. Finally, if your installation of R and the other required libraries was done correctly, you will also find some PDF files containing plots. These plots illustrated the distribution of variants according to certain dimensions of the model.

For detailed instructions on how to interpret these plots, please refer to the online GATK documentation.

6. Apply the desired level of recalibration to the Indels in the call set

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T ApplyRecalibration \ 
    -R reference.fa \ 
    -input recalibrated_snps_raw_indels.vcf \ 
    -mode INDEL \ 
    --ts_filter_level 99.0 \ 
    -recalFile recalibrate_INDEL.recal \ 
    -tranchesFile recalibrate_INDEL.tranches \ 
    -o recalibrated_variants.vcf

Expected Result

This creates a new VCF file, called recalibrated_variants.vcf, which contains all the original variants from the original recalibrated_snps_raw_indels.vcf file, but now the Indels are also annotated with their recalibrated quality scores (VQSLOD) and either PASS or FILTER depending on whether or not they are included in the selected tranche.

↧

〖购买美国〗UA阿拉巴马大学毕业证书Q薇418049015『UA毕业证成绩单

March 22, 2019, 11:49 pm

≫ Next: How to input covariate table to GATK4-Alpha Print Reads

≪ Previous: (howto) Recalibrate variant quality scores = run VQSR

〖购买美国〗UA阿拉巴马大学毕业证书Q薇418049015『UA毕业证成绩单
〖购买美国〗UA阿拉巴马大学毕业证书Q薇418049015『UA毕业证成绩单
〖购买美国〗UA阿拉巴马大学毕业证书Q薇418049015『UA毕业证成绩单

↧

How to input covariate table to GATK4-Alpha Print Reads

April 19, 2017, 9:25 pm

≫ Next: GATK Spark Logging

≪ Previous: 〖购买美国〗UA阿拉巴马大学毕业证书Q薇418049015『UA毕业证成绩单

I'm trying to run PrintReads in GATK4-Alpha and apply the BaseRecalibrator covariates table. Using

java -Xmx80G -jar $GATK PrintReads -R umd_3_1_reference_1000_bull_genomes.fa -I GA1442_dedup.bam -O GA1442_dedup_recal.bam -BQSR recal.table

I get the following error

A USER ERROR has occurred: Invalid command line: B is not a recognized option

I've tried -bqsr --BQSR and --bqsr all with errors saying that they are not recognized options.

-BQSR worked in version 3.5 as an option. The --help for PrintReads does not list an option to input a covariates table. Can you please advise on how to do this in GATK4-Alpha

↧

GATK Spark Logging

March 13, 2019, 11:58 am

≫ Next: Duplicate allele error during LiftoverVcf run

≪ Previous: How to input covariate table to GATK4-Alpha Print Reads

Hello,

I've been trying to decrease the verbosity of the Spark runs for GATk tools, e.g. MarkDuplicatesSpark

My call is as follows:

python ${gatkDir}/gatk MarkDuplicatesSpark --spark-master local[$threads] -R ${GRC}.fa --input ${TU}.bam --output ${TU}.dd.bam --tmp-dir temp --verbosity ERROR

I thought the --verbosity ERROR would write only ERROR level output from the tools, but I'm still getting a lot of INFO output.

Is there another way to get only ERROR level output?

Thanks!

↧

Duplicate allele error during LiftoverVcf run

March 24, 2019, 10:49 pm

≫ Next: storage.objects.list access error for freecredit project

≪ Previous: GATK Spark Logging

Hey folks

I'm trying to run VQSR on some data that was aligned to b37 (a couple years ago). The reference files have moved and the best practices have changed since they were first posted for b37, so I thought it might be easiest to run LiftoverVcf to align it to hg38 and run VQSR with the most recent hg38 files. I downloaded the chain file for b37 to hg19, and the lift over ran fine; then I tried to lift over from hg19 to hg38, and that's giving me errors.

I think I'm using the latest version of Picard, 2.18.29, Java version 1.8.0_201 (and gatk 4.1.0.0 if that's relevant).

This is WES + capture kit data.

The command I used was:
java -jar ~/tools/picard.jar LiftoverVcf I=MyData.vcf O=MyData_lifted_over.vcf CHAIN=hg19ToHg38.over.chain REJECT=rejected_variants.vcf R= hg38.fa

The error is:
Exception in thread "main" java.lang.IllegalArgumentException: Duplicate allele added to VariantContext: C
at htsjdk.variant.variantcontext.VariantContext.makeAlleles(VariantContext.java:1493)
at htsjdk.variant.variantcontext.VariantContext.(VariantContext.java:379)
at htsjdk.variant.variantcontext.VariantContextBuilder.make(VariantContextBuilder.java:579)
at htsjdk.variant.variantcontext.VariantContextBuilder.make(VariantContextBuilder.java:573)
at picard.util.LiftoverUtils.liftVariant(LiftoverUtils.java:117)
at picard.vcf.LiftoverVcf.doWork(LiftoverVcf.java:396)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:295)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)

I then ran it again with a slightly different reference file and got something similar:

Exception in thread "main" java.lang.IllegalArgumentException: Duplicate allele added to VariantContext: A
at htsjdk.variant.variantcontext.VariantContext.makeAlleles(VariantContext.java:1493)
at htsjdk.variant.variantcontext.VariantContext.(VariantContext.java:379)
at htsjdk.variant.variantcontext.VariantContextBuilder.make(VariantContextBuilder.java:579)
at htsjdk.variant.variantcontext.VariantContextBuilder.make(VariantContextBuilder.java:573)
at picard.util.LiftoverUtils.liftVariant(LiftoverUtils.java:117)
at picard.vcf.LiftoverVcf.doWork(LiftoverVcf.java:396)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:295)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)

If I narrow down the top portion of the file to see where the error appears, I think it's this line:

chr1 144931739 rs375230102 AGGC *,AGGCGGC,A 2312.56 PASS AC=2,9,9;AF=4.950e-03,0.022,0.022;AN=404;BaseQRankSum=-2.890e-01;ClippingRankSum=0.289;DB;DP=5719;FS=0.751;InbreedingCoeff=-1.3149;MLEAC=2,9,7;MLEAF=4.950e-03,0.022,0.017;MQ=60.00;MQ0=0;MQRankSum=0.300;QD=5.02;ReadPosRankSum=0.286;SOR=0.800

I'm not sure what's wrong here. I'll also give you a few lines before and after in case I screwed that up somehow:

chr1 144931607 rs6673292 C T 492.12 PASS AC=3;AF=7.426e-03;AN=404;BaseQRankSum=-1.204e+00;ClippingRankSum=0.00;DB;DP=6458;FS=1.721;InbreedingCoeff=-0.0075;MLEAC=3;MLEAF=7.426e-03;MQ=60.00;MQ0=0;MQRankSum=0.012;QD=4.21;ReadPosRankSum=0.450;SOR=0.906
chr1 144931699 rs144526186 T A 192.12 PASS AC=2;AF=4.950e-03;AN=404;BaseQRankSum=2.13;ClippingRankSum=0.477;DB;DP=6505;FS=2.847;InbreedingCoeff=-0.0050;MLEAC=2;MLEAF=4.950e-03;MQ=60.00;MQ0=0;MQRankSum=-1.090e-01;QD=3.00;ReadPosRankSum=0.065;SOR=1.259
chr1 144931727 rs2985363 G A 28702.22 PASS AC=126;AF=0.312;AN=404;BaseQRankSum=0.299;ClippingRankSum=0.301;DB;DP=6107;FS=0.594;InbreedingCoeff=-0.4553;MLEAC=126;MLEAF=0.312;MQ=60.00;MQ0=0;MQRankSum=0.025;QD=7.51;ReadPosRankSum=0.586;SOR=0.790
chr1 144931737 rs765186109 CGAG C,GGAG 431.87 PASS AC=2,1;AF=4.950e-03,2.475e-03;AN=404;BaseQRankSum=-6.060e-01;ClippingRankSum=-6.270e-01;DB;DP=5793;FS=3.081;InbreedingCoeff=-0.0074;MLEAC=2,1;MLEAF=4.950e-03,2.475e-03;MQ=60.00;MQ0=0;MQRankSum=0.209;QD=5.47;ReadPosRankSum=0.079;SOR=1.055
chr1 144931739 rs375230102 AGGC *,AGGCGGC,A 2312.56 PASS AC=2,9,9;AF=4.950e-03,0.022,0.022;AN=404;BaseQRankSum=-2.890e-01;ClippingRankSum=0.289;DB;DP=5719;FS=0.751;InbreedingCoeff=-1.3149;MLEAC=2,9,7;MLEAF=4.950e-03,0.022,0.017;MQ=60.00;MQ0=0;MQRankSum=0.300;QD=5.02;ReadPosRankSum=0.286;SOR=0.800
chr1 144935104 rs147352020 A G 10599.84 PASS AC=5;AF=0.012;AN=404;BaseQRankSum=0.748;ClippingRankSum=-1.760e-01;DB;DP=48219;FS=0.000;InbreedingCoeff=-0.0125;MLEAC=5;MLEAF=0.012;MQ=60.00;MQ0=0;MQRankSum=-5.640e-01;QD=4.39;ReadPosRankSum=1.48;SOR=0.709
chr1 144935209 . T A 1324.16 PASS AC=1;AF=2.475e-03;AN=404;BaseQRankSum=2.60;ClippingRankSum=0.378;DP=47081;FS=15.309;InbreedingCoeff=-0.0025;MLEAC=1;MLEAF=2.475e-03;MQ=60.00;MQ0=0;MQRankSum=0.633;QD=4.09;ReadPosRankSum=-7.390e-01;SOR=1.570
chr1 144935261 . A G 1243.16 PASS AC=1;AF=2.475e-03;AN=404;BaseQRankSum=0.902;ClippingRankSum=-1.088e+00;DP=46942;FS=2.215;InbreedingCoeff=-0.0025;MLEAC=1;MLEAF=2.475e-03;MQ=60.00;MQ0=0;MQRankSum=1.60;QD=5.63;ReadPosRankSum=1.42;SOR=0.917

If someone could point me in the right direction it would be a big help.

Thanks!

David

PS -- When I only run the first part of the file, I process 74348 variants and find that 99.6584% of variants were successfully lifted over and written to the output. I also get a few warnings of the type "Interval chr1:120583691-120583697 failed to match chain 2 because intersection length 4 < minMatchSize 7.0 (0.5714286 < 1.0)" -- I see several references to this type of warning but couldn't find the meaning, could someone point me to that?

↧

storage.objects.list access error for freecredit project

March 25, 2019, 12:23 am

≫ Next: I want to use GATK tools on Hadoop Map-reduce.

≪ Previous: Duplicate allele error during LiftoverVcf run

Hi,
I ran a modified method 3-Joint-Discovery from workspace help-gatk/Germline-SNPs-Indels-GATK4-hg38. The GVCF input files failed to import. Because it's a trial account, I couldn't modify the IAM settings of the data bucket. I've searched on storage.objects.list error topics, and found the ACL setting tricks but I'm not sure which part I should modify in the workflow.

2019/03/25 05:17:02 I: Running command: sudo gsutil -q -m cp gs://fc-12f21f92-dfe5-451e-bca8-95552bd85f03/4665e834-aac7-40b7-b28f-8d574cf5be00/HaplotypeCallerGvcf_GATK4/f206209b-af79-482b-8f5b-77839edb52fb/call-MergeGVCFs/Sample_100.SpinachV2pseudo.aligned.duplicate_marked.sorted.g.vcf.gz /mnt/local-disk/fc-12f21f92-dfe5-451e-bca8-95552bd85f03/4665e834-aac7-40b7-b28f-8d574cf5be00/HaplotypeCallerGvcf_GATK4/f206209b-af79-482b-8f5b-77839edb52fb/call-MergeGVCFs/Sample_100.SpinachV2pseudo.aligned.duplicate_marked.sorted.g.vcf.gz
2019/03/25 05:17:05 E: command failed: AccessDeniedException: 403 pet-255319754645899bdec02@fccredits-carbon-gold-4123.iam.gserviceaccount.com does not have storage.objects.list access to fc-12f21f92-dfe5-451e-bca8-95552bd85f03.
CommandException: 1 file/object could not be transferred.

↧

I want to use GATK tools on Hadoop Map-reduce.

March 25, 2019, 2:00 am

≫ Next: 【购买澳洲】Adelaide阿德莱德大学毕业证书Q薇418049015〖阿德莱德大学毕业证成绩单Adelaide University

≪ Previous: storage.objects.list access error for freecredit project

Is it possible to run GATK tool on Hadoop ? Can we change the input output path of the tool as HDFS path ?

↧

【购买澳洲】Adelaide阿德莱德大学毕业证书Q薇418049015〖阿德莱德大学毕业证成绩单Adelaide University

March 25, 2019, 3:20 am

≫ Next: 【购买澳洲】Monash莫纳什大学毕业证书Q薇418049015〖莫纳什大学毕业证成绩单Monash University

≪ Previous: I want to use GATK tools on Hadoop Map-reduce.

【购买澳洲】Adelaide阿德莱德大学毕业证书Q薇418049015〖阿德莱德大学毕业证成绩单Adelaide University
【购买澳洲】Adelaide阿德莱德大学毕业证书Q薇418049015〖阿德莱德大学毕业证成绩单Adelaide University
【购买澳洲】Adelaide阿德莱德大学毕业证书Q薇418049015〖阿德莱德大学毕业证成绩单Adelaide University

↧

【购买澳洲】Monash莫纳什大学毕业证书Q薇418049015〖莫纳什大学毕业证成绩单Monash University

March 25, 2019, 3:33 am

≫ Next: 【购买澳洲】UQ昆士兰大学毕业证书Q薇418049015〖昆士兰大学毕业证成绩单The University of Queensland

≪ Previous: 【购买澳洲】Adelaide阿德莱德大学毕业证书Q薇418049015〖阿德莱德大学毕业证成绩单Adelaide University

【购买澳洲】Monash莫纳什大学毕业证书Q薇418049015〖莫纳什大学毕业证成绩单Monash University
【购买澳洲】Monash莫纳什大学毕业证书Q薇418049015〖莫纳什大学毕业证成绩单Monash University
【购买澳洲】Monash莫纳什大学毕业证书Q薇418049015〖莫纳什大学毕业证成绩单Monash University

↧

【购买澳洲】UQ昆士兰大学毕业证书Q薇418049015〖昆士兰大学毕业证成绩单The University of Queensland

March 25, 2019, 3:51 am

≫ Next: 【购买澳洲】UWA西澳大学毕业证书Q薇418049015〖西澳大学毕业证成绩单The University of Western Australia

≪ Previous: 【购买澳洲】Monash莫纳什大学毕业证书Q薇418049015〖莫纳什大学毕业证成绩单Monash University

【购买澳洲】UQ昆士兰大学毕业证书Q薇418049015〖昆士兰大学毕业证成绩单The University of Queensland
【购买澳洲】UQ昆士兰大学毕业证书Q薇418049015〖昆士兰大学毕业证成绩单The University of Queensland
【购买澳洲】UQ昆士兰大学毕业证书Q薇418049015〖昆士兰大学毕业证成绩单The University of Queensland

↧

【购买澳洲】UWA西澳大学毕业证书Q薇418049015〖西澳大学毕业证成绩单The University of Western Australia

March 25, 2019, 3:58 am

≫ Next: CNNScoreVariants, too much threads

≪ Previous: 【购买澳洲】UQ昆士兰大学毕业证书Q薇418049015〖昆士兰大学毕业证成绩单The University of Queensland

【购买澳洲】UWA西澳大学毕业证书Q薇418049015〖西澳大学毕业证成绩单The University of Western Australia
【购买澳洲】UWA西澳大学毕业证书Q薇418049015〖西澳大学毕业证成绩单The University of Western Australia
【购买澳洲】UWA西澳大学毕业证书Q薇418049015〖西澳大学毕业证成绩单The University of Western Australia

↧