Quantcast
Channel: Recent Discussions — GATK-Forum
Viewing all 12345 articles
Browse latest View live

vendor pre-built reference package

$
0
0

Hi, firecloud team
I am aligning my WGS with vendor pre-built hg38 reference package, which contains decoy contigs, but not alternate haplotypes and MHC alleles, compared to standard hg38 reference assembly.

I have recently tried running mutect2 and somatic CNV workflow on firecloud for those bams aligned with vendor reference package but assigned firecloud attributes were made from standard hg38 assembly ( I got from Broad data resource bundle, etc). Unfortunately, I have not made it work yet. But not sure whether it is because of different reference used.

Do you suggest using vendor reference for ref_fasta, ref_dict, ref_fai attributes on firecloud? if so, I don't think I can use other reference databases such as PoN, 1000g, gnomad files, because they all made with standard reference build, I think.

Please advise on what I should do.

Thank you very much


(howto) Recalibrate variant quality scores = run VQSR

$
0
0

Objective

Recalibrate variant quality scores and produce a callset filtered for the desired levels of sensitivity and specificity.

Prerequisites

  • TBD

Caveats

This document provides a typical usage example including parameter values. However, the values given may not be representative of the latest Best Practices recommendations. When in doubt, please consult the FAQ document on VQSR training sets and parameters, which overrides this document. See that document also for caveats regarding exome vs. whole genomes analysis design.

Steps

  1. Prepare recalibration parameters for SNPs
    a. Specify which call sets the program should use as resources to build the recalibration model
    b. Specify which annotations the program should use to evaluate the likelihood of Indels being real
    c. Specify the desired truth sensitivity threshold values that the program should use to generate tranches
    d. Determine additional model parameters

  2. Build the SNP recalibration model

  3. Apply the desired level of recalibration to the SNPs in the call set

  4. Prepare recalibration parameters for Indels
    a. Specify which call sets the program should use as resources to build the recalibration model
    b. Specify which annotations the program should use to evaluate the likelihood of Indels being real
    c. Specify the desired truth sensitivity threshold values that the program should use to generate tranches
    d. Determine additional model parameters

  5. Build the Indel recalibration model

  6. Apply the desired level of recalibration to the Indels in the call set


1. Prepare recalibration parameters for SNPs

a. Specify which call sets the program should use as resources to build the recalibration model

For each training set, we use key-value tags to qualify whether the set contains known sites, training sites, and/or truth sites. We also use a tag to specify the prior likelihood that those sites are true (using the Phred scale).

  • True sites training resource: HapMap

This resource is a SNP call set that has been validated to a very high degree of confidence. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). We will also use these sites later on to choose a threshold for filtering variants based on sensitivity to truth sites. The prior likelihood we assign to these variants is Q15 (96.84%).

  • True sites training resource: Omni

This resource is a set of polymorphic SNP sites produced by the Omni genotyping array. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q12 (93.69%).

  • Non-true sites training resource: 1000G

This resource is a set of high-confidence SNP sites produced by the 1000 Genomes Project. The program will consider that the variants in this resource may contain true variants as well as false positives (truth=false), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q10 (90%).

  • Known sites resource, not used in training: dbSNP

This resource is a SNP call set that has not been validated to a high degree of confidence (truth=false). The program will not use the variants in this resource to train the recalibration model (training=false). However, the program will use these to stratify output metrics such as Ti/Tv ratio by whether variants are present in dbsnp or not (known=true). The prior likelihood we assign to these variants is Q2 (36.90%).

The default prior likelihood assigned to all other variants is Q2 (36.90%). This low value reflects the fact that the philosophy of the GATK callers is to produce a large, highly sensitive callset that needs to be heavily refined through additional filtering.

b. Specify which annotations the program should use to evaluate the likelihood of SNPs being real

These annotations are included in the information generated for each variant call by the caller. If an annotation is missing (typically because it was omitted from the calling command) it can be added using the VariantAnnotator tool.

Total (unfiltered) depth of coverage. Note that this statistic should not be used with exome datasets; see caveat detailed in the VQSR arguments FAQ doc.

Variant confidence (from the QUAL field) / unfiltered depth of non-reference samples.

Measure of strand bias (the variation being seen on only the forward or only the reverse strand). More bias is indicative of false positive calls. This complements the StrandOddsRatio (SOR) annotation.

Measure of strand bias (the variation being seen on only the forward or only the reverse strand). More bias is indicative of false positive calls. This complements the FisherStrand (FS) annotation.

The rank sum test for mapping qualities. Note that the mapping quality rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.

The rank sum test for the distance from the end of the reads. If the alternate allele is only seen near the ends of reads, this is indicative of error. Note that the read position rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.

Estimation of the overall mapping quality of reads supporting a variant call.

Evidence of inbreeding in a population. See caveats regarding population size and composition detailed in the VQSR arguments FAQ doc.

c. Specify the desired truth sensitivity threshold values that the program should use to generate tranches

  • First tranche threshold 100.0

  • Second tranche threshold 99.9

  • Third tranche threshold 99.0

  • Fourth tranche threshold 90.0

Tranches are essentially slices of variants, ranked by VQSLOD, bounded by the threshold values specified in this step. The threshold values themselves refer to the sensitivity we can obtain when we apply them to the call sets that the program uses to train the model. The idea is that the lowest tranche is highly specific but less sensitive (there are very few false positives but potentially many false negatives, i.e. missing calls), and each subsequent tranche in turn introduces additional true positive calls along with a growing number of false positive calls. This allows us to filter variants based on how sensitive we want the call set to be, rather than applying hard filters and then only evaluating how sensitive the call set is using post hoc methods.


2. Build the SNP recalibration model

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T VariantRecalibrator \ 
    -R reference.fa \ 
    -input raw_variants.vcf \ 
    -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap.vcf \ 
    -resource:omni,known=false,training=true,truth=true,prior=12.0 omni.vcf \ 
    -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G.vcf \ 
    -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.vcf \ 
    -an DP \ 
    -an QD \ 
    -an FS \ 
    -an SOR \ 
    -an MQ \
    -an MQRankSum \ 
    -an ReadPosRankSum \ 
    -an InbreedingCoeff \
    -mode SNP \ 
    -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \ 
    -recalFile recalibrate_SNP.recal \ 
    -tranchesFile recalibrate_SNP.tranches \ 
    -rscriptFile recalibrate_SNP_plots.R 

Expected Result

This creates several files. The most important file is the recalibration report, called recalibrate_SNP.recal, which contains the recalibration data. This is what the program will use in the next step to generate a VCF file in which the variants are annotated with their recalibrated quality scores. There is also a file called recalibrate_SNP.tranches, which contains the quality score thresholds corresponding to the tranches specified in the original command. Finally, if your installation of R and the other required libraries was done correctly, you will also find some PDF files containing plots. These plots illustrated the distribution of variants according to certain dimensions of the model.

For detailed instructions on how to interpret these plots, please refer to the VQSR method documentation and presentation videos.


3. Apply the desired level of recalibration to the SNPs in the call set

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T ApplyRecalibration \ 
    -R reference.fa \ 
    -input raw_variants.vcf \ 
    -mode SNP \ 
    --ts_filter_level 99.0 \ 
    -recalFile recalibrate_SNP.recal \ 
    -tranchesFile recalibrate_SNP.tranches \ 
    -o recalibrated_snps_raw_indels.vcf 

Expected Result

This creates a new VCF file, called recalibrated_snps_raw_indels.vcf, which contains all the original variants from the original raw_variants.vcf file, but now the SNPs are annotated with their recalibrated quality scores (VQSLOD) and either PASS or FILTER depending on whether or not they are included in the selected tranche.

Here we are taking the second lowest of the tranches specified in the original recalibration command. This means that we are applying to our data set the level of sensitivity that would allow us to retrieve 99% of true variants from the truth training sets of HapMap and Omni SNPs. If we wanted to be more specific (and therefore have less risk of including false positives, at the risk of missing real sites) we could take the very lowest tranche, which would only retrieve 90% of the truth training sites. If we wanted to be more sensitive (and therefore less specific, at the risk of including more false positives) we could take the higher tranches. In our Best Practices documentation, we recommend taking the second highest tranche (99.9%) which provides the highest sensitivity you can get while still being acceptably specific.


4. Prepare recalibration parameters for Indels

a. Specify which call sets the program should use as resources to build the recalibration model

For each training set, we use key-value tags to qualify whether the set contains known sites, training sites, and/or truth sites. We also use a tag to specify the prior likelihood that those sites are true (using the Phred scale).

  • Known and true sites training resource: Mills

This resource is an Indel call set that has been validated to a high degree of confidence. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q12 (93.69%).

The default prior likelihood assigned to all other variants is Q2 (36.90%). This low value reflects the fact that the philosophy of the GATK callers is to produce a large, highly sensitive callset that needs to be heavily refined through additional filtering.

b. Specify which annotations the program should use to evaluate the likelihood of Indels being real

These annotations are included in the information generated for each variant call by the caller. If an annotation is missing (typically because it was omitted from the calling command) it can be added using the VariantAnnotator tool.

Total (unfiltered) depth of coverage. Note that this statistic should not be used with exome datasets; see caveat detailed in the VQSR arguments FAQ doc.

Variant confidence (from the QUAL field) / unfiltered depth of non-reference samples.

Measure of strand bias (the variation being seen on only the forward or only the reverse strand). More bias is indicative of false positive calls. This complements the StrandOddsRatio (SOR) annotation.

Measure of strand bias (the variation being seen on only the forward or only the reverse strand). More bias is indicative of false positive calls. This complements the FisherStrand (FS) annotation.

The rank sum test for mapping qualities. Note that the mapping quality rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.

The rank sum test for the distance from the end of the reads. If the alternate allele is only seen near the ends of reads, this is indicative of error. Note that the read position rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.

Evidence of inbreeding in a population. See caveats regarding population size and composition detailed in the VQSR arguments FAQ doc.

c. Specify the desired truth sensitivity threshold values that the program should use to generate tranches

  • First tranche threshold 100.0

  • Second tranche threshold 99.9

  • Third tranche threshold 99.0

  • Fourth tranche threshold 90.0

Tranches are essentially slices of variants, ranked by VQSLOD, bounded by the threshold values specified in this step. The threshold values themselves refer to the sensitivity we can obtain when we apply them to the call sets that the program uses to train the model. The idea is that the lowest tranche is highly specific but less sensitive (there are very few false positives but potentially many false negatives, i.e. missing calls), and each subsequent tranche in turn introduces additional true positive calls along with a growing number of false positive calls. This allows us to filter variants based on how sensitive we want the call set to be, rather than applying hard filters and then only evaluating how sensitive the call set is using post hoc methods.

d. Determine additional model parameters

  • Maximum number of Gaussians (-maxGaussians) 4

This is the maximum number of Gaussians (i.e. clusters of variants that have similar properties) that the program should try to identify when it runs the variational Bayes algorithm that underlies the machine learning method. In essence, this limits the number of different ”profiles” of variants that the program will try to identify. This number should only be increased for datasets that include very many variants.


5. Build the Indel recalibration model

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T VariantRecalibrator \ 
    -R reference.fa \ 
    -input recalibrated_snps_raw_indels.vcf \ 
    -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.vcf  \
    -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.b37.vcf \
    -an QD \
    -an DP \ 
    -an FS \ 
    -an SOR \ 
    -an MQRankSum \ 
    -an ReadPosRankSum \ 
    -an InbreedingCoeff
    -mode INDEL \ 
    -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \ 
    --maxGaussians 4 \ 
    -recalFile recalibrate_INDEL.recal \ 
    -tranchesFile recalibrate_INDEL.tranches \ 
    -rscriptFile recalibrate_INDEL_plots.R 

Expected Result

This creates several files. The most important file is the recalibration report, called recalibrate_INDEL.recal, which contains the recalibration data. This is what the program will use in the next step to generate a VCF file in which the variants are annotated with their recalibrated quality scores. There is also a file called recalibrate_INDEL.tranches, which contains the quality score thresholds corresponding to the tranches specified in the original command. Finally, if your installation of R and the other required libraries was done correctly, you will also find some PDF files containing plots. These plots illustrated the distribution of variants according to certain dimensions of the model.

For detailed instructions on how to interpret these plots, please refer to the online GATK documentation.


6. Apply the desired level of recalibration to the Indels in the call set

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T ApplyRecalibration \ 
    -R reference.fa \ 
    -input recalibrated_snps_raw_indels.vcf \ 
    -mode INDEL \ 
    --ts_filter_level 99.0 \ 
    -recalFile recalibrate_INDEL.recal \ 
    -tranchesFile recalibrate_INDEL.tranches \ 
    -o recalibrated_variants.vcf 

Expected Result

This creates a new VCF file, called recalibrated_variants.vcf, which contains all the original variants from the original recalibrated_snps_raw_indels.vcf file, but now the Indels are also annotated with their recalibrated quality scores (VQSLOD) and either PASS or FILTER depending on whether or not they are included in the selected tranche.

Here we are taking the second lowest of the tranches specified in the original recalibration command. This means that we are applying to our data set the level of sensitivity that would allow us to retrieve 99% of true variants from the truth training sets of HapMap and Omni SNPs. If we wanted to be more specific (and therefore have less risk of including false positives, at the risk of missing real sites) we could take the very lowest tranche, which would only retrieve 90% of the truth training sites. If we wanted to be more sensitive (and therefore less specific, at the risk of including more false positives) we could take the higher tranches. In our Best Practices documentation, we recommend taking the second highest tranche (99.9%) which provides the highest sensitivity you can get while still being acceptably specific.

AritificialHaplotypeRG Mutect2

$
0
0

Hello,

I found an interesting variant call which are supported by 'ArtificialHaplotypeRG' (total 8 reads) when using Mutect2.
They don't seem to be real reads.
What are the reads ?
Is the call reliable?

NOT able to pull GATK4.0.5.0 image in Firecloud

$
0
0

even if I set the disk space to 200G ...

2018/06/08 19:16:16 I: Switching to status: pulling-image 2018/06/08 19:16:16 I: Calling SetOperationStatus(pulling-image) 2018/06/08 19:16:16 I: SetOperationStatus(pulling-image) succeeded 2018/06/08 19:16:16 I: Writing new Docker configuration file 2018/06/08 19:16:16 I: Pulling image "broadinstitute/gatk@sha256:76b5037167dac880a9651802dc06c7dcdfd487cfefd6f4db4f86623dd9a01ec9" 2018/06/08 19:19:14 W: "docker --config /tmp/.docker/ pull broadinstitute/gatk@sha256:76b5037167dac880a9651802dc06c7dcdfd487cfefd6f4db4f86623dd9a01ec9" failed: exit status 1: sha256:76b5037167dac880a9651802dc06c7dcdfd487cfefd6f4db4f86623dd9a01ec9: Pulling from broadinstitute/gatk ae79f2514705: Pulling fs layer 5ad56d5fc149: Pulling fs layer 170e558760e8: Pulling fs layer 395460e233f5: Pulling fs layer 6f01dc62e444: Pulling fs layer 98db058f41f6: Pulling fs layer dc9c3ece7593: Pulling fs layer c82b47286f3d: Pulling fs layer 16a3034a6570: Pulling fs layer ea15f6798d84: Pulling fs layer 978d56db40a6: Pulling fs layer 4b3ec876807a: Pulling fs layer 504f977e3da2: Pulling fs layer 66e54a65e68a: Pulling fs layer d86f1090b756: Pulling fs layer fb33d0c493c0: Pulling fs layer fdc65578d1e6: Pulling fs layer 400c525cbc78: Pulling fs layer 7848d22029f8: Pulling fs layer 0bf9f050734a: Pulling fs layer 65528f070366: Pulling fs layer 7eadcbdc8859: Pulling fs layer bc989902ecb5: Pulling fs layer 8ab4e34e8939: Pulling fs layer fbbe2d889fb9: Pulling fs layer ce9f6f562c58: Pulling fs layer 54bdd2bf38e8: Pulling fs layer 395460e233f5: Waiting 6f01dc62e444: Waiting 98db058f41f6: Waiting dc9c3ece7593: Waiting c82b47286f3d: Waiting 16a3034a6570: Waiting ea15f6798d84: Waiting 978d56db40a6: Waiting 4b3ec876807a: Waiting 504f977e3da2: Waiting 66e54a65e68a: Waiting d86f1090b756: Waiting fb33d0c493c0: Waiting fdc65578d1e6: Waiting 400c525cbc78: Waiting 7848d22029f8: Waiting 0bf9f050734a: Waiting 65528f070366: Waiting 7eadcbdc8859: Waiting bc989902ecb5: Waiting 8ab4e34e8939: Waiting fbbe2d889fb9: Waiting ce9f6f562c58: Waiting 54bdd2bf38e8: Waiting 170e558760e8: Verifying Checksum 170e558760e8: Download complete 5ad56d5fc149: Verifying Checksum 5ad56d5fc149: Download complete 395460e233f5: Verifying Checksum 395460e233f5: Download complete ae79f2514705: Verifying Checksum ae79f2514705: Download complete dc9c3ece7593: Verifying Checksum dc9c3ece7593: Download complete 6f01dc62e444: Verifying Checksum 6f01dc62e444: Download complete ae79f2514705: Pull complete 5ad56d5fc149: Pull complete 170e558760e8: Pull complete 395460e233f5: Pull complete 6f01dc62e444: Pull complete 16a3034a6570: Verifying Checksum 16a3034a6570: Download complete ea15f6798d84: Verifying Checksum ea15f6798d84: Download complete 98db058f41f6: Verifying Checksum 98db058f41f6: Download complete 4b3ec876807a: Verifying Checksum 4b3ec876807a: Download complete 504f977e3da2: Verifying Checksum 504f977e3da2: Download complete 978d56db40a6: Verifying Checksum 978d56db40a6: Download complete d86f1090b756: Verifying Checksum d86f1090b756: Download complete c82b47286f3d: Verifying Checksum c82b47286f3d: Download complete 66e54a65e68a: Verifying Checksum 66e54a65e68a: Download complete fb33d0c493c0: Verifying Checksum fb33d0c493c0: Download complete 7848d22029f8: Verifying Checksum 7848d22029f8: Download complete 0bf9f050734a: Verifying Checksum 0bf9f050734a: Download complete 65528f070366: Verifying Checksum 65528f070366: Download complete 7eadcbdc8859: Verifying Checksum 7eadcbdc8859: Download complete bc989902ecb5: Verifying Checksum bc989902ecb5: Download complete 8ab4e34e8939: Verifying Checksum 8ab4e34e8939: Download complete fbbe2d889fb9: Verifying Checksum fbbe2d889fb9: Download complete ce9f6f562c58: Verifying Checksum ce9f6f562c58: Download complete 400c525cbc78: Verifying Checksum 400c525cbc78: Download complete fdc65578d1e6: Verifying Checksum fdc65578d1e6: Download complete 98db058f41f6: Pull complete dc9c3ece7593: Pull complete c82b47286f3d: Pull complete 54bdd2bf38e8: Verifying Checksum 54bdd2bf38e8: Download complete 16a3034a6570: Pull complete ea15f6798d84: Pull complete 978d56db40a6: Pull complete 4b3ec876807a: Pull complete 504f977e3da2: Pull complete 66e54a65e68a: Pull complete d86f1090b756: Pull complete fb33d0c493c0: Pull complete fdc65578d1e6: Pull complete 400c525cbc78: Pull complete 7848d22029f8: Pull complete 0bf9f050734a: Pull complete 65528f070366: Pull complete 7eadcbdc8859: Pull complete bc989902ecb5: Pull complete 8ab4e34e8939: Pull complete fbbe2d889fb9: Pull complete ce9f6f562c58: Pull complete failed to register layer: Error processing tar file(exit status 1): write /root/.cache/pip/http/c/d/7/5/4/cd754ee3e1f32413f09f243232a69bc3b2d214c7d6bca9509ded9809: no space left on device

gatk 4.4 docker image missing dependancies

$
0
0

Hi there,

I am trying to perform Base recalibration using the docker image of gatk 4.4. (I used 3.6 before but a dependency problem with R pointed me to the latest version, in which the problem should be fixed according to the GATK forum) - but here I am with v4.4 and a similar error message "Error in library("reshape") : there is no package called 'reshape'" (see full message at the bottom of the message).

The library is indeed not installed in R.
Is there a repo having a container will all depencies installed? or is it just like v3.6 and running script "manually" is required? What am I missing to perform this step correctly?

thanks in advance for your answer.

Best,

B.

`
b35@toto$ docker run --mount type=bind,source="$ld",target=/data/ --mount type=bind,source=/media/b35/DATA/genomic/reference_genomes/,target=/ref/ broadinstitute/gatk:latest sh -c "gatk AnalyzeCovariates -bqsr /data/data/recal.table${i} -plots /data/data/AnalyzeCovariates${i}.pdf"
19:36:17.277 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/build/libs/gatk-package-4.0.4.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
19:36:17.414 INFO AnalyzeCovariates - ------------------------------------------------------------
19:36:17.415 INFO AnalyzeCovariates - The Genome Analysis Toolkit (GATK) v4.0.4.0
19:36:17.415 INFO AnalyzeCovariates - For support and documentation go to https://software.broadinstitute.org/gatk/
19:36:17.415 INFO AnalyzeCovariates - Executing as root@f43fb0936ac9 on Linux v4.13.0-43-generic amd64
19:36:17.415 INFO AnalyzeCovariates - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_131-8u131-b11-2ubuntu1.16.04.3-b11
19:36:17.415 INFO AnalyzeCovariates - Start Date/Time: May 28, 2018 7:36:17 PM UTC
19:36:17.415 INFO AnalyzeCovariates - ------------------------------------------------------------
19:36:17.415 INFO AnalyzeCovariates - ------------------------------------------------------------
19:36:17.416 INFO AnalyzeCovariates - HTSJDK Version: 2.14.3
19:36:17.416 INFO AnalyzeCovariates - Picard Version: 2.18.2
19:36:17.416 INFO AnalyzeCovariates - HTSJDK Defaults.COMPRESSION_LEVEL : 2
19:36:17.416 INFO AnalyzeCovariates - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
19:36:17.416 INFO AnalyzeCovariates - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
19:36:17.416 INFO AnalyzeCovariates - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
19:36:17.416 INFO AnalyzeCovariates - Deflater: IntelDeflater
19:36:17.416 INFO AnalyzeCovariates - Inflater: IntelInflater
19:36:17.416 INFO AnalyzeCovariates - GCS max retries/reopens: 20
19:36:17.417 INFO AnalyzeCovariates - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
19:36:17.417 INFO AnalyzeCovariates - Initializing engine
19:36:17.417 INFO AnalyzeCovariates - Done initializing engine
19:36:17.731 INFO AnalyzeCovariates - Generating csv file '/tmp/root/AnalyzeCovariates2520511082455841657.csv'
19:36:17.789 INFO AnalyzeCovariates - Generating plots file '/data/data/AnalyzeCovariates1.pdf'
19:36:18.255 INFO AnalyzeCovariates - Shutting down engine
[May 28, 2018 7:36:18 PM UTC] org.broadinstitute.hellbender.tools.walkers.bqsr.AnalyzeCovariates done. Elapsed time: 0.02 minutes.
Runtime.totalMemory()=320864256
org.broadinstitute.hellbender.utils.R.RScriptExecutorException:
Rscript exited with 1
Command Line: Rscript -e tempLibDir = '/tmp/root/Rlib.7509374791326779134';source('/tmp/root/BQSR.8897197604108889283.R'); /tmp/root/AnalyzeCovariates2520511082455841657.csv /data/data/recal.table1 /data/data/AnalyzeCovariates1.pdf
Stdout:
Stderr:
Attaching package: 'gplots'

The following object is masked from 'package:stats':

lowess

Error in library("reshape") : there is no package called 'reshape'
Calls: source -> withVisible -> eval -> eval -> library
Execution halted

at org.broadinstitute.hellbender.utils.R.RScriptExecutor.getScriptException(RScriptExecutor.java:80)
at org.broadinstitute.hellbender.utils.R.RScriptExecutor.getScriptException(RScriptExecutor.java:19)
at org.broadinstitute.hellbender.utils.runtime.ScriptExecutor.executeCuratedArgs(ScriptExecutor.java:126)
at org.broadinstitute.hellbender.utils.R.RScriptExecutor.exec(RScriptExecutor.java:131)
at org.broadinstitute.hellbender.utils.recalibration.RecalUtils.generatePlots(RecalUtils.java:360)
at org.broadinstitute.hellbender.tools.walkers.bqsr.AnalyzeCovariates.generatePlots(AnalyzeCovariates.java:329)
at org.broadinstitute.hellbender.tools.walkers.bqsr.AnalyzeCovariates.doWork(AnalyzeCovariates.java:341)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:134)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:179)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:198)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)

Using GATK jar /gatk/build/libs/gatk-package-4.0.4.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /gatk/build/libs/gatk-package-4.0.4.0-local.jar AnalyzeCovariates -bqsr /data/data/recal.table1 -plots /data/data/AnalyzeCovariates1.pdf
`

strange alignment in mutect2 output bam

$
0
0

Hi, I use gatk4.0.0.0 and find a strange alignment in mutect2 outout bam, I think the red T on the right of indel seems to should be aligned to the left T in the indel.

How to relax parameters in GenotypeGVCFs to get more variants?

$
0
0

Hi,

I am using GenotypeGVCFs in following command,

java -Xmx2g -jar CancerAnalysisPackage-2015.1-3/GenomeAnalysisTK.jar -T GenotypeGVCFs -R Refgenome_ucsc/ucsc.hg19.fasta --variant chrn.g.vcf --out chrn
.rawVariants.vcf -L chrn.bed --interval_padding 100 --disable_auto_index_creation_and_locking_when_reading_rods -nt 4

no of variants in chrn.g.vcf = 12248378
no of variant in chrn.rawVariants.vcf = 85579

Is there any way to get more variants as i miss lot of vairants during this step? Please help me out as i am in great need.

VCF contigs don't match reference genome

$
0
0

Hello,
I am working with fungi RNA-seq SNPs that I called using the GATK best practices pipeline. I have 12 vcf files of SNPs that I called using REFERENCE1.fa. I also have a vcf file from a collaborator with SNPs that were called using the same reference genome (REFERENCE1.fa). My goal was to combine all of my vcf files with those of my collaborator for phylogenetic tree data analysis. I was able to combine my 12 vcf files, but when I try to combine with my collaborator's file, I get an error saying that the contigs don't match. I looked more closely at both of our vcf files, and I noticed that there are two contigs present in my file that aren't present in their file. I found out that these contigs are for unmapped scaffolds, and mitochondria. I thought maybe things would work if I removed these two contigs from my vcf file. I used selectvariants to do this, and tried the merging again. Now I am getting an error saying that my vcf contigs do not match my reference genome. Is there a way to remove these contents from my reference genome as well, so that I won't get this error?


Simple explanation of MarkDuplicate

$
0
0

I am having a hard time understanding how MarkDuplicate works. Based on MarkDuplicate documentation, this is how it has been described: “The MarkDuplicates tool works by comparing sequences in the 5 prime positions of both reads and read-pairs in a SAM/BAM file.” I don’t understand what “5 prime positions” means in the above statement. Also, what does it mean in the context of “of both reads and read-pairs” ? If you could please explain that to me using an example I would really appreciate that.

funcotator error

$
0
0

Hi, FireCloud team

Sorry for the duplicated question. I just want to update the format.

I ran into a problem recently coming from Funcotator. I initially run Mutect2 coupled with Funcotator on FireCloud, GATK 4.0.4.0 version, but it throw out some error. And I also tried run Funcotator locally and it produces the almost identical error, see below

[June 6, 2018 1:47:32 PM EDT] org.broadinstitute.hellbender.tools.funcotator.Funcotator done. Elapsed time: 12.63 minutes.
Runtime.totalMemory()=12082741248
java.lang.StringIndexOutOfBoundsException: String index out of range: -61
    at java.lang.String.substring(String.java:1967)
    at org.broadinstitute.hellbender.tools.funcotator.dataSources.gencode.GencodeFuncotationFactory.createUtrFuncotation(GencodeFuncotationFactory.java:1088)
    at org.broadinstitute.hellbender.tools.funcotator.dataSources.gencode.GencodeFuncotationFactory.createGencodeFuncotationOnTranscript(GencodeFuncotationFactory.java:601)
    at org.broadinstitute.hellbender.tools.funcotator.dataSources.gencode.GencodeFuncotationFactory.createFuncotationsHelper(GencodeFuncotationFactory.java:529)
    at org.broadinstitute.hellbender.tools.funcotator.dataSources.gencode.GencodeFuncotationFactory.createFuncotationsOnVariant(GencodeFuncotationFactory.java:276

I know Funcotator is still in beta version, but want to ask you help me debug this problem and I am also wondering is there any more stable Funcotator.

Thanks,
Zhouwei

ArrayIndexOutOfBoundsException in GenotypeGVCFs on chrX with male/female adapted ploidy

$
0
0

I am attempting to call exomes using GATK 3.8, the new quality model, and AS annotations. However, for chrX, I get an ArrayIndexOutOfBoundsException for chrX, likely as I am using different ploidy for males and females.

INFO 20:01:42,079 ProgressMeter - X:140994551 3505.0 30.0 s 2.4 h 1.9% 26.9 m 26.4 m

ERROR --
ERROR stack trace

java.lang.ArrayIndexOutOfBoundsException: 24
at org.broadinstitute.gatk.tools.walkers.genotyper.GeneralPloidyGenotypeLikelihoods.getNumLikelihoodElements(GeneralPloidyGenotypeLikelihoods.java:440)
at org.broadinstitute.gatk.tools.walkers.genotyper.GeneralPloidyGenotypeLikelihoods.subsetToAlleles(GeneralPloidyGenotypeLikelihoods.java:339)
at org.broadinstitute.gatk.tools.walkers.genotyper.afcalc.IndependentAllelesExactAFCalculator.subsetAlleles(IndependentAllelesExactAFCalculator.java:494)
at org.broadinstitute.gatk.tools.walkers.genotyper.GenotypingEngine.calculateGenotypes(GenotypingEngine.java:292)
at org.broadinstitute.gatk.tools.walkers.genotyper.UnifiedGenotypingEngine.calculateGenotypes(UnifiedGenotypingEngine.java:392)
at org.broadinstitute.gatk.tools.walkers.genotyper.UnifiedGenotypingEngine.calculateGenotypes(UnifiedGenotypingEngine.java:375)
at org.broadinstitute.gatk.tools.walkers.genotyper.UnifiedGenotypingEngine.calculateGenotypes(UnifiedGenotypingEngine.java:330)
at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.regenotypeVC(GenotypeGVCFs.java:327)
at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.map(GenotypeGVCFs.java:305)
at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.map(GenotypeGVCFs.java:136)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:98)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:323)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:123)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:108)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.8-0-ge9d806836):

ApplyVQSR page example error

$
0
0

Dear team,
Thanks for the wonderful development of this tool.
When I tried out ApplyVQSR with GATK4, I was following the examples from https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_vqsr_ApplyVQSR.php,
and found the option --ts_filter_level in the example not recognised by GATK. Further reading on the page suggested the option should really be -ts-filter-level, and --ts_filter_level was a carry over from GATK 3.x. Can you please correct it?
Thanks,
Jing

(How to) Run Spark-enabled GATK tools on a local multi-core machine

$
0
0

This is a placeholder for a document in progress.

(How to) Run Spark-enabled GATK tools on a Spark cluster

$
0
0

This is a placeholder for documentation in progress.

GATK4 SplitNCigarReads RuntimeIOException: Attempt to add record to closed writer.

$
0
0

On a Linux cluster, I ran this command on a node (no job scheduler):

./gatk SplitNCigarReads -R /bigdisk/databases/genomes/human/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa -I 28_tumor.dedupped.bam -O 28_tumor.split.bam

I get this error during SplitN's second pass:

13:27:28.945 INFO  ProgressMeter -           4:74283282            163.1             288256000        1767147.3`
13:27:38.955 INFO  ProgressMeter -           4:74283830            163.3             288523000        1766976.9`
13:27:46.176 INFO  SplitNCigarReads - Shutting down engine`
[February 6, 2018 1:27:46 PM CET] org.broadinstitute.hellbender.tools.walkers.rnaseq.SplitNCigarReads done. Elapsed time: 163.42 minutes.`
Runtime.totalMemory()=12006719488`
htsjdk.samtools.util.RuntimeIOException: Attempt to add record to closed writer.
    at htsjdk.samtools.util.AbstractAsyncWriter.write(AbstractAsyncWriter.java:57)
    at htsjdk.samtools.AsyncSAMFileWriter.addAlignment(AsyncSAMFileWriter.java:53)
    at org.broadinstitute.hellbender.utils.read.SAMFileGATKReadWriter.addRead(SAMFileGATKReadWriter.java:21)
    at org.broadinstitute.hellbender.tools.walkers.rnaseq.OverhangFixingManager.writeReads(OverhangFixingManager.java:349)
    at org.broadinstitute.hellbender.tools.walkers.rnaseq.OverhangFixingManager.flush(OverhangFixingManager.java:329)
    at org.broadinstitute.hellbender.tools.walkers.rnaseq.SplitNCigarReads.closeTool(SplitNCigarReads.java:195)
    at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:897)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:136)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:179)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:198)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:152)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:195)
    at org.broadinstitute.hellbender.Main.main(Main.java:275)

The output file 28_tumor.split.bam is 0 bytes, and ther is an index file, 0 bytes also.

Java version: 1.8.0_162
GATK version: 4.0.0.0
OS: CentOS release 6.8

I ran this command on a different computer with Ubuntu 16.04 and had no problems. On different BAM files I get the same error. Any ideas? It's frustrating that I can't get GATK to run efficentlyon the cluster, only on slow computers or with computers with limited disk space. It took a month to run on about 45 pairs of RNA-Seq samples (of course I made errors during the time), so I really need it to run on the cluster.

Thanks,
Zsuzsa


libVectorLoglessPairHMM is not present in GATK 3.8 - HaplotypeCaller is slower than 3.4-46!

$
0
0

We are running GATK on a multi-core Intel Xeon that does not have AVX. We have just upgraded from running 3.4-46 to running 3.8, and HaplotypeCaller runs much more slowly. I noticed that our logs used to say:

Using SSE4.1 accelerated implementation of PairHMM
INFO 06:18:09,932 VectorLoglessPairHMM - libVectorLoglessPairHMM unpacked successfully from GATK jar file
INFO 06:18:09,933 VectorLoglessPairHMM - Using vectorized implementation of PairHMM

But now they say:

WARN 07:10:21,304 PairHMMLikelihoodCalculationEngine$1 - OpenMP multi-threaded AVX-accelerated native PairHMM implementation is not supported
WARN 07:10:21,310 PairHMMLikelihoodCalculationEngine$1 - AVX-accelerated native PairHMM implementation is not supported. Falling back to slower LOGLESS_CACHING implementation

I'm guessing the newfangled Intel GKL isn't working so well for us. Note that I had a very similar problem with GATK 3.4-0, in http://gatk.vanillaforums.com/entry/passwordreset/21436/OrxbD0I4oRDaj8y1hDSE and this was resolved in GATK 3.4-46.

Read groups

$
0
0

There is no formal definition of what is a read group, but in practice, this term refers to a set of reads that were generated from a single run of a sequencing instrument.

In the simple case where a single library preparation derived from a single biological sample was run on a single lane of a flowcell, all the reads from that lane run belong to the same read group. When multiplexing is involved, then each subset of reads originating from a separate library run on that lane will constitute a separate read group.

Read groups are identified in the SAM/BAM /CRAM file by a number of tags that are defined in the official SAM specification. These tags, when assigned appropriately, allow us to differentiate not only samples, but also various technical features that are associated with artifacts. With this information in hand, we can mitigate the effects of those artifacts during the duplicate marking and base recalibration steps. The GATK requires several read group fields to be present in input files and will fail with errors if this requirement is not satisfied. See this article for common problems related to read groups.

To see the read group information for a BAM file, use the following command.

samtools view -H sample.bam | grep '^@RG'

This prints the lines starting with @RG within the header, e.g. as shown in the example below.

@RG ID:H0164.2  PL:illumina PU:H0164ALXX140820.2    LB:Solexa-272222    PI:0    DT:2014-08-20T00:00:00-0400 SM:NA12878  CN:BI

Meaning of the read group fields required by GATK

  • ID = Read group identifier
    This tag identifies which read group each read belongs to, so each read group's ID must be unique. It is referenced both in the read group definition line in the file header (starting with @RG) and in the RG:Z tag for each read record. Note that some Picard tools have the ability to modify IDs when merging SAM files in order to avoid collisions. In Illumina data, read group IDs are composed using the flowcell + lane name and number, making them a globally unique identifier across all sequencing data in the world.
    Use for BQSR: ID is the lowest denominator that differentiates factors contributing to technical batch effects: therefore, a read group is effectively treated as a separate run of the instrument in data processing steps such as base quality score recalibration, since they are assumed to share the same error model.

  • PU = Platform Unit
    The PU holds three types of information, the {FLOWCELL_BARCODE}.{LANE}.{SAMPLE_BARCODE}. The {FLOWCELL_BARCODE} refers to the unique identifier for a particular flow cell. The {LANE} indicates the lane of the flow cell and the {SAMPLE_BARCODE} is a sample/library-specific identifier. Although the PU is not required by GATK but takes precedence over ID for base recalibration if it is present. In the example shown earlier, two read group fields, ID and PU, appropriately differentiate flow cell lane, marked by .2, a factor that contributes to batch effects.

  • SM = Sample
    The name of the sample sequenced in this read group. GATK tools treat all read groups with the same SM value as containing sequencing data for the same sample, and this is also the name that will be used for the sample column in the VCF file. Therefore it's critical that the SM field be specified correctly. When sequencing pools of samples, use a pool name instead of an individual sample name.

  • PL = Platform/technology used to produce the read
    This constitutes the only way to know what sequencing technology was used to generate the sequencing data. Valid values: ILLUMINA, SOLID, LS454, HELICOS and PACBIO.

  • LB = DNA preparation library identifier
    MarkDuplicates uses the LB field to determine which read groups might contain molecular duplicates, in case the same DNA library was sequenced on multiple lanes.

If your sample collection's BAM files lack required fields or do not differentiate pertinent factors within the fields, use Picard's AddOrReplaceReadGroups to add or appropriately rename the read group fields as outlined here.


Deriving ID and PU fields from read names

Here we illustrate how to derive both ID and PU fields from read names as they are formed in the data produced by the Broad Genomic Services pipelines (other sequence providers may use different naming conventions). We break down the common portion of two different read names from a sample file. The unique portion of the read names that come after flow cell lane, and separated by colons, are tile number, x-coordinate of cluster and y-coordinate of cluster.

H0164ALXX140820:2:1101:10003:23460
H0164ALXX140820:2:1101:15118:25288

Breaking down the common portion of the query names:

H0164____________ #portion of @RG ID and PU fields indicating Illumina flow cell
_____ALXX140820__ #portion of @RG PU field indicating barcode or index in a multiplexed run
_______________:2 #portion of @RG ID and PU fields indicating flow cell lane

Multi-sample and multiplexed example

Suppose I have a trio of samples: MOM, DAD, and KID. Each has two DNA libraries prepared, one with 400 bp inserts and another with 200 bp inserts. Each of these libraries is run on two lanes of an Illumina HiSeq, requiring 3 x 2 x 2 = 12 lanes of data. When the data come off the sequencer, I would create 12 bam files, with the following @RG fields in the header:

Dad's data:
@RG     ID:FLOWCELL1.LANE1      PL:ILLUMINA     LB:LIB-DAD-1 SM:DAD      PI:200
@RG     ID:FLOWCELL1.LANE2      PL:ILLUMINA     LB:LIB-DAD-1 SM:DAD      PI:200
@RG     ID:FLOWCELL1.LANE3      PL:ILLUMINA     LB:LIB-DAD-2 SM:DAD      PI:400
@RG     ID:FLOWCELL1.LANE4      PL:ILLUMINA     LB:LIB-DAD-2 SM:DAD      PI:400

Mom's data:
@RG     ID:FLOWCELL1.LANE5      PL:ILLUMINA     LB:LIB-MOM-1 SM:MOM      PI:200
@RG     ID:FLOWCELL1.LANE6      PL:ILLUMINA     LB:LIB-MOM-1 SM:MOM      PI:200
@RG     ID:FLOWCELL1.LANE7      PL:ILLUMINA     LB:LIB-MOM-2 SM:MOM      PI:400
@RG     ID:FLOWCELL1.LANE8      PL:ILLUMINA     LB:LIB-MOM-2 SM:MOM      PI:400

Kid's data:
@RG     ID:FLOWCELL2.LANE1      PL:ILLUMINA     LB:LIB-KID-1 SM:KID      PI:200
@RG     ID:FLOWCELL2.LANE2      PL:ILLUMINA     LB:LIB-KID-1 SM:KID      PI:200
@RG     ID:FLOWCELL2.LANE3      PL:ILLUMINA     LB:LIB-KID-2 SM:KID      PI:400
@RG     ID:FLOWCELL2.LANE4      PL:ILLUMINA     LB:LIB-KID-2 SM:KID      PI:400

Note the hierarchical relationship between read groups (unique for each lane) to libraries (sequenced on two lanes) and samples (across four lanes, two lanes for each library).

I have all Allele frequency 0.5 or 1.0

$
0
0

I finished my RNA-seq variant callling using GATK pipeline described in your workflow.
But I realized that all the Allele Frequency(AF) values in the vcf files are 0.5 or 1.0.
Is it normal?

reason for using -new-qual in GenotypeGVCFs

$
0
0

I am aware that -new-qual switch in GenotypeGVCFs uses a new algorithm to calculate AFs. Can I get more details of it with examples on why should it be used instead of the old one (about which also I would like to know)? On quick look, I saw that the variant quality (QUAL) was different for a particular variant site with and without the usage of -new-qual.

Seems like CombineGVCFs is freezing

$
0
0

Hello,

I am using latest version of gatk (gatk-4.0.3.0), it seems like CombineGVCFs is required before GenotypeGVCFs for multiple samples according to the online Tool Documentation, so I am running CombineGVCFs after SNP calling. My problem is CombineGVCFs is running very slowly, and looks like it has been freezing after read input files (I only have 6 samples) as below:

11:33:49.467 INFO CombineGVCFs - Initializing engine
11:33:56.507 INFO FeatureManager - Using codec VCFCodec to read file file:///homes/yuanwen/SR39-2/RWG1_assembly/151_RWG1_assembly.g.vcf
11:34:02.884 INFO FeatureManager - Using codec VCFCodec to read file file:///homes/yuanwen/SR39-2/RWG1_assembly/179_RWG1_assembly.g.vcf
11:34:08.362 INFO FeatureManager - Using codec VCFCodec to read file file:///homes/yuanwen/SR39-2/RWG1_assembly/338_RWG1_assembly.g.vcf
11:34:13.497 INFO FeatureManager - Using codec VCFCodec to read file file:///homes/yuanwen/SR39-2/RWG1_assembly/374_RWG1_assembly.g.vcf
11:34:22.429 INFO FeatureManager - Using codec VCFCodec to read file file:///homes/yuanwen/SR39-2/RWG1_assembly/449_RWG1_assembly.g.vcf
11:34:27.592 INFO FeatureManager - Using codec VCFCodec to read file file:///homes/yuanwen/SR39-2/RWG1_assembly/RWG1_control_RWG1_assembly.g.vcf

Is there anyone could help or explain this issue? Thank you very much!

Best,
Yuanwen

Viewing all 12345 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>