Calling a complete set Germline+Somatic Mutations for a Cancer Sample

September 14, 2016, 8:47 am

≫ Next: downsample_to_coverage in HaplotyperCaller

≪ Previous: about VQSR -- VariantRecalibrator

I need to produce a set of all DNA-level variants for cancer patients to serve as exclusion sites for RNA editing site finding. Currently have have matched Tumor-Normal WES data for these patients. Which of the following would be the most appropriate way to produce these?

Unfiltered paired MuTect2
HC of NORMAL plus filtered paired MuTect2, or
HC of TUMOR plus filtered paired MuTect2?

↧

downsample_to_coverage in HaplotyperCaller

October 5, 2015, 6:56 am

≫ Next: Joint analysis on germline SNP discovery between paired-samples

≪ Previous: Calling a complete set Germline+Somatic Mutations for a Cancer Sample

Hello,
Here I have a question about downsample_to_coverage in HaplotypeCaller. I found -dcov cannot be used in HaplotypeCaller and I tried to change the values of parameters maxReadsInRegionPerSample and minReadsPerAlignStart to change the coverage level, but what I got the coverage of result files is still default coverage level.
so I wanna ask what parameter in HaplotypeCaller could change the level of coverage? if they are above two parameters, then how could I increase the downsample_coverage?

↧

Joint analysis on germline SNP discovery between paired-samples

September 14, 2016, 8:54 am

≫ Next: GATK and MuTect licensing moves to direct-through-Broad model

≪ Previous: downsample_to_coverage in HaplotyperCaller

Hello,
I have a cohort of several sibling pairs, and I am interested in finding Germline SNPs for each pair. Furthermore, I understand from the best practices that a joint analysis using the Haplotype Caller works very well. I am not sure how to proceed. Is it reasonable/possible to do the following?:
1) Haplotype Caller to find SNPs between each pair <- is this possible with the Haplotype caller? It seems like it is meant to handle one or more samples against a single reference sequence.
2) Joint genotyping using then haplotype caller

Thank you in advance, any help appreciated,
Ramiro

↧

GATK and MuTect licensing moves to direct-through-Broad model

April 2, 2015, 12:14 am

≫ Next: Error message during PhaseByTransition

≪ Previous: Joint analysis on germline SNP discovery between paired-samples

We have some important news to share with you regarding the licensing of GATK and MuTect. The licensing agreement between us and Appistry will end effective April 15, 2015; from that point on, the tools will continue to be licensed through Broad for commercial entities that will be running the GATK and MuTect internally or as part of their own hardware offering. Current licensed users will transition to Broad Institute when their current license expires.

For our academic and non-profit GATK and MuTect users, the licensing transition will be essentially transparent. You will still be able to use the GATK and MuTect for free, and access the source code through the existing public repository. The support forum and documentation website will also remain operational and freely accessible to all as they been previously.

Since our commercial users will now get their license --and their support!-- directly from Broad, they can expect to see some clear benefits:

Better access.
We have heard from our licensed user community that you would like greater access to the GATK support team. Licensing through Broad will remove intermediaries and allow all GATK and MuTect users --including those who purchase a license-- more direct contact with and support from our team.
Most current tools.
Getting your GATK license through Broad license will give you access to the most cutting-edge tools and features available without sacrificing support. Under Broad licensing, you will still have the option of purchasing a license for either of two packages, “GATK” or “GATK + Cancer Tools”.
Up-to-date Best Practices recommendations.
You will get our latest recommendations, informed by the latest in our internal analysis and internal R&D work, directly from us. So you can be confident that you have access to the freshest information at all times.

Our development team is driven by the goal of building tools to enable better science and push the boundaries of genome analysis. Revenue from GATK and MuTect licensing enables these goals by directly feeding into GATK and MuTect development, in the form of critical codebase maintenance and bug fixing work, as well as expansion of the support team. This enables us to keep pace with the growth of the user community and the ever-increasing demand for GATK and MuTect support.

This is a significant new milestone in the life of GATK and MuTect, and we recognize that there are going to be a lot of questions and discussions on this topic since it will affect many of you in the research community. We’ve put together some FAQs (below the fold) that we hope will answer your most pressing questions; feel free to comment and suggest additional points that you think should be covered there.

Note that we are still working on defining some of the finer points of the support model and pricing structure, so we can’t address those quite yet -- but feel free to email softwarelicensing@broadinstitute.org if you have some burning question and/or concern that you’d like to discuss regarding licensing and/or pricing in particular. Rest assured that once the model has been finalized, we will make the full details (including pricing) available on our website in order to ensure full transparency.

Frequently Asked Questions

Who is impacted by the licensing transition and how?

Academic/non-profit users: No change. The licensing terms remain the same and the GATK remains free to use for not-for-profit research and educational purposes. The current free user support model will remain available through the online forum.
Currently licensed Commercial/for-profit users: Appistry will continue to provide full GATK support for the remainder of your current license term. After that point, Broad Institute can offer you a GATK license directly. This will offer you immediate access to the latest version of GATK. We can work directly with you on your specific licensing questions at softwarelicensing@broadinstitute.org. For support questions, GATK product upgrade information or other suggestions, please comment in the discussion below or send us a private message.
Prospective commercial/for-profit users: Broad Institute can offer you a GATK license directly. This will offer you immediate access to the latest version of GATK. We can work directly with you on your specific licensing questions at softwarelicensing@broadinstitute.org. For support questions, GATK product upgrade information or other suggestions, please comment in the discussion below or send us a private message.

Will licensed users (commercial / for-profit) and non-licensed users (non-profit and academic) have access to different versions of GATK?

No. There will only be one version for all users. We will provide our licensed users total support for the very latest version. This means they will always be able to use the most cutting-edge tools and features available without sacrificing support.

Now that the Broad's licensing agreement with Appistry is ending, why not make GATK free for commercial users, just like it is for academic/non-profit users?

Part of the Broad Institute’s mission is to share our tools and research as broadly as possible to enable others to do transformational research. We started developing GATK several years ago and, since then, have constantly upgraded it, thanks to the hard work and dedication of many talented programmers, developers and genomic researchers in our group and beyond. That is why GATK remains the most advanced, accurate and reliable toolkit for variant discovery available anywhere (if we do say so ourselves). But please understand that building, maintaining, testing and constantly improving GATK is neither easy nor free. This is why we charge commercial users a licensing fee and funnel these resources back into upgrades of the tool itself – it allows us to continue to offer GATK for free to academic and non-profit organizations while ensuring it is always the best-of-the-best in an emerging and rapidly-changing field of research.

Will Broad offer only GATK as a licensed product or will there be an equivalent to the Cancer Genomics Analysis Suite offered by Appistry?

In addition to the GATK package, we will also offer a package that bundles GATK with MuTect and ContEst. That package will not include the SomaticIndelDetector, but a replacement for that functionality is in preparation.

A recent announcement indicated that Picard tools will be integrated into future versions of GATK. Does that mean tools that originated in Picard will be subject to the protected GATK license?

No. Tools originating from the Picard toolkit will remain free and fully open-source for everyone. We are preparing to integrate them into a part of GATK that will be under a BSD license.

Will researchers who develop and publish analysis pipelines involving GATK be allowed to bundle GATK in, e.g., any Docker images that they provide to the community?

We are preparing to enable this in order to facilitate sharing of scientific methods, but we need input from the community first. To that end, we’d like to invite researchers who are developing or have developed such pipeline images to contact us in order to discuss options. We are envisioning simple technical solutions to ensure that users of these images are made fully aware of their own legal responsibility relative to the GATK licensing status, in a way that minimizes the burden on the researchers who distribute them.

↧

Error message during PhaseByTransition

July 5, 2016, 5:46 pm

≫ Next: Calculate Posterior Probability on Targeted Panel

≪ Previous: GATK and MuTect licensing moves to direct-through-Broad model

Hi, I am running PhaseByTransition on a set of 250+ family trios. I am getting this error message:

Sample lgsnd32563jz3 found in data sources but not in pedigree files with STRICT pedigree validation

However, the sample is in the pedigree file. I tried adding the flag recommended in previous forums - --pedigreeValidationType SILENT, but then all the trios were excluded.

Can you advise me on what the problem could be? Here is my command:

java -Xmx8g -jar GenomeAnalysisTK.jar -T PhaseByTransmission -R /Volumes/Thunderbolt/ref_genome/human_g1k_v37.fasta -V /Volumes/Passport2/July2016/merged_vcf/Trios_GATK_all.dbID.db.eff.vcf -ped /Users/Yam/Desktop/Trios.ped -o /Volumes/Passport2/July2016/merged_vcf/Trios_GATK_all.dbID.db.eff_phased.vcf

↧

Calculate Posterior Probability on Targeted Panel

September 14, 2016, 10:07 am

≫ Next: HaplotypeCaller/ Variantannotator no allele balance tag for all SNPs

≪ Previous: Error message during PhaseByTransition

Hello

After hard filter a callset that is too small for VQSR I wanted to make sure that it is okay to still use CalculatePostProb on a targeted panels (i.e. 10-50 genes).

James

↧

HaplotypeCaller/ Variantannotator no allele balance tag for all SNPs

April 28, 2014, 8:11 am

≫ Next: Overview of Queue

≪ Previous: Calculate Posterior Probability on Targeted Panel

Version 3.1.1. Human normal samples.

I couldnt find AlleleBalance and AlleleBalanceBySample tags in my vcf outputs. Tags are not found even for single variant
I tried HaplotypeCaller with -all or directly with -A AlleleBalance or -A AlleleBalanceBySample.
Also I tried Variantannotator with -all or -A AlleleBalance or -A AlleleBalanceBySample.

Any help will be apreciated

↧

Overview of Queue

August 10, 2012, 12:20 pm

≫ Next: MuTect2 tumor only mode: empty VCFs

≪ Previous: HaplotypeCaller/ Variantannotator no allele balance tag for all SNPs

1. Introduction

GATK-Queue is command-line scripting framework for defining multi-stage genomic analysis pipelines combined with an execution manager that runs those pipelines from end-to-end. Often processing genome data includes several steps to produces outputs, for example our BAM to VCF calling pipeline include among other things:

Local realignment around indels
Emitting raw SNP calls
Emitting indels
Masking the SNPs at indels
Annotating SNPs using chip data
Labeling suspicious calls based on filters
Creating a summary report with statistics

Running these tools one by one in series may often take weeks for processing, or would require custom scripting to try and optimize using parallel resources.

With a Queue script users can semantically define the multiple steps of the pipeline and then hand off the logistics of running the pipeline to completion. Queue runs independent jobs in parallel, handles transient errors, and uses various techniques such as running multiple copies of the same program on different portions of the genome to produce outputs faster.

2. Obtaining Queue

You have two options: download the binary distribution (prepackaged, ready to run program) or build it from source.

- Download the binary

This is obviously the easiest way to go. Links are on the Downloads page. Just get the Queue package; no need to get the GATK package separately as GATK is bundled in with Queue.

- Building Queue from source

Briefly, here's what you need to know/do:

Queue is part of the GATK repository. Download the source from the public repository on Github. Run the following command:

git clone https://github.com/broadgsa/gatk.git

IMPORTANT NOTE: These instructions refer to the MIT-licensed version of the GATK+Queue source code. With that version, you will be able to build Queue itself, as well as the public portion of the GATK (the core framework), but that will not include the GATK analysis tools. If you want to use Queue to pipeline the GATK analysis tools, you need to clone the 'protected' repository. Please note however that part of the source code in that repository (the 'protected' module) is under a different license which excludes for-profit use, modification and redistribution.

Move to the git root directory and use maven to build the source.

mvn clean verify

All dependencies will be managed by Maven as needed.

See this article on how to test your installation of Queue.

3. Running Queue

See this article on running Queue for the first time for full details.

Queue arguments can be listed by running with --help

java -jar dist/Queue.jar --help

To list the arguments required by a QScript, add the script with -S and run with --help.

java -jar dist/Queue.jar -S script.scala --help

Note that by default queue runs in a "dry" mode, as explained in the link above. After verifying the generated commands execute the pipeline by adding -run.

See QFunction and Command Line Options for more info on adjusting Queue options.

4. QScripts

General Information

Queue pipelines are written as Scala 2.8 files with a bit of syntactic sugar, called QScripts.

Every QScript includes the following steps:

New instances of CommandLineFunctions are created
Input and output arguments are specified on each function
The function is added with add() to Queue for dispatch and monitoring

The basic command-line to run the Queue pipelines on the command line is

java -jar Queue.jar -S <script>.scala

See the main article Queue QScripts for more info on QScripts.

Supported QScripts

Most QScripts are analysis pipelines that are custom-built for specific projects, and we currently do not offer any QScripts as supported analysis tools. However, we do provide some example scripts that you can use as basis to write your own QScripts (see below).

Example QScripts

The latest version of the example files are available in the Sting github repository under public/scala/qscript/examples

5. Visualization and Queue

QJobReport

Queue automatically generates GATKReport-formatted runtime information about executed jobs. See this presentation for a general introduction to QJobReport.

Note that Queue attempts to generate a standard visualization using an R script in the GATK public/R repository. You must provide a path to this location if you want the script to run automatically. Additionally the script requires the gsalib to be installed on the machine, which is typically done by providing its path in your .Rprofile file:

bm8da-dbe ~/Desktop/broadLocal/GATK/unstable % cat ~/.Rprofile
.libPaths("/Users/depristo/Desktop/broadLocal/GATK/unstable/public/R/")

Note that gsalib is available from the CRAN repository so you can install it with the canonical R package install command.

Caveats

The system only provides information about commands that have just run. Resuming from a partially completed job will only show the information for the jobs that just ran, and not for any of the completed commands. This is due to a structural limitation in Queue, and will be fixed when the Queue infrastructure improves
This feature only works for command line and LSF execution models. SGE should be easy to add for a motivated individual but we cannot test this capabilities here at the Broad. Please send us a patch if you do extend Queue to support SGE.

DOT visualization of Pipelines

Queue emits a queue.dot file to help visualize your commands. You can open this file in programs like DOT, OmniGraffle, etc to view your pipelines. By default the system will print out your LSF command lines, but this can be too much in a complex pipeline.

To clarify your pipeline, override the dotString() function:

class CountCovariates(bamIn: File, recalDataIn: File, args: String = "") extends GatkFunction {
    @Input(doc="foo") var bam = bamIn
    @Input(doc="foo") var bamIndex = bai(bamIn)
    @Output(doc="foo") var recalData = recalDataIn
    memoryLimit = Some(4)
    override def dotString = "CountCovariates: %s [args %s]".format(bamIn.getName, args)
    def commandLine = gatkCommandLine("CountCovariates") + args + " -l INFO -D /humgen/gsa-hpprojects/GATK/data/dbsnp_129_hg18.rod -I %s --max_reads_at_locus 20000 -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate -recalFile %s".format(bam, recalData)
}

Here we only see CountCovariates my.bam [-OQ], for example, in the dot file. The base quality score recalibration pipeline, as visualized by DOT, can be viewed here:

6. Further reading

↧

MuTect2 tumor only mode: empty VCFs

January 5, 2016, 11:55 am

≫ Next: GenotypeGVCFs sees DP incorrectly in INFO, not FORMAT field

≪ Previous: Overview of Queue

Hello,

I am trying to run Mutect2 in tumor only mode without a matching normal. However, when I did this, MuTect2 produced output VCFs that had a full header, but had no variant calls.

Here is a sample command that I ran to produce variant calls for chromosome 1:
module load gatk/3.5.0; java -Xmx10g -Djava.io.tmpdir=$TMP -jar GenomeAnalysisTK.jar -T MuTect2 -R $REFERENCE_GENOME_FA --dbsnp $DBSNP_VCF --cosmic $COSMIC_VCF -dt NONE --input_file:tumor $TUMOR_BAM --intervals chr1:1-249250621 -o $OUTPUT_VCF

Note: I ran a similar command to this (same input files, etc.) using MuTect v1.1.4 and it produced a complete VCF.

Can you please explain if there is anything I need to change?

Thank you,
Jeremy

↧

GenotypeGVCFs sees DP incorrectly in INFO, not FORMAT field

May 7, 2015, 2:22 am

≫ Next: GATK on RAD Data - extraordinarily long run times

≪ Previous: MuTect2 tumor only mode: empty VCFs

I have a potential bug running GATK GenotypeGVCFs. It complains that there is a DP in the INFO field, but in my haplotypecaller-generated -mg.g.vcf.gz's I do not have a DP in the info, I do have DP in the FORMAT field though, but that's present in the headers as shown below the error output.

Any idea what could be the problem?

INFO  18:30:12,694 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  18:30:12,698 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.3-0-geee94ec, Compiled 2015/03/09 14:27:22 
INFO  18:30:12,699 HelpFormatter - Copyright (c) 2010 The Broad Institute 
INFO  18:30:12,699 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
INFO  18:30:12,706 HelpFormatter - Program Args: -l INFO -T GenotypeGVCFs -R /net/NGSanalysis/ref/Mus_musculus.GRCm38/index/bwa/Mus_musculus.GRCm38.dna.primary_assembly.fa -o /dev/stdout -ploidy 2 --num_threads 32 --intervals:targets,BED /net/NGSanalysis/ref/Mus_musculus.GRCm38/bed/SeqCap/ex100/110624_MM10_exome_L2R_D02_EZ_HX1-ex100.bed --max_alternate_alleles 20 -V:3428_10_14_SRO_185_TGGCTTCA-mg,VCF 3428_10_14_SRO_185_TGGCTTCA-mg.g.vcf.gz -V:3428_11_14_SRO_186_TGGTGGTA-mg,VCF 3428_11_14_SRO_186_TGGTGGTA-mg.g.vcf.gz -V:3428_12_13_SRO_422_TTCACGCA-mg,VCF 3428_12_13_SRO_422_TTCACGCA-mg.g.vcf.gz -V:3428_13_13_SRO_492_AACTCACC-mg,VCF 3428_13_13_SRO_492_AACTCACC-mg.g.vcf.gz -V:3428_14_13_SRO_493_AAGAGATC-mg,VCF 3428_14_13_SRO_493_AAGAGATC-mg.g.vcf.gz -V:3428_15_14_SRO_209_AAGGACAC-mg,VCF 3428_15_14_SRO_209_AAGGACAC-mg.g.vcf.gz -V:3428_16_14_SRO_218_AATCCGTC-mg,VCF 3428_16_14_SRO_218_AATCCGTC-mg.g.vcf.gz -V:3428_17_14_SRO_201_AATGTTGC-mg,VCF 3428_17_14_SRO_201_AATGTTGC-mg.g.vcf.gz -V:3428_18_13_SRO_416_ACACGACC-mg,VCF 3428_18_13_SRO_416_ACACGACC-mg.g.vcf.gz -V:3428_19_14_SRO_66_ACAGATTC-mg,VCF 3428_19_14_SRO_66_ACAGATTC-mg.g.vcf.gz -V:3428_1_13_SRO_388_GTCGTAGA-mg,VCF 3428_1_13_SRO_388_GTCGTAGA-mg.g.vcf.gz -V:3428_20_14_SRO_68_AGATGTAC-mg,VCF 3428_20_14_SRO_68_AGATGTAC-mg.g.vcf.gz -V:3428_21_14_SRO_210_AGCACCTC-mg,VCF 3428_21_14_SRO_210_AGCACCTC-mg.g.vcf.gz -V:3428_22_14_SRO_256_AGCCATGC-mg,VCF 3428_22_14_SRO_256_AGCCATGC-mg.g.vcf.gz -V:3428_23_14_SRO_270_AGGCTAAC-mg,VCF 3428_23_14_SRO_270_AGGCTAAC-mg.g.vcf.gz -V:3428_24_13_SRO_452_ATAGCGAC-mg,VCF 3428_24_13_SRO_452_ATAGCGAC-mg.g.vcf.gz -V:3428_2_13_SRO_399_GTCTGTCA-mg,VCF 3428_2_13_SRO_399_GTCTGTCA-mg.g.vcf.gz -V:3428_3_13_SRO_461_GTGTTCTA-mg,VCF 3428_3_13_SRO_461_GTGTTCTA-mg.g.vcf.gz -V:3428_4_13_SRO_462_TAGGATGA-mg,VCF 3428_4_13_SRO_462_TAGGATGA-mg.g.vcf.gz -V:3428_5_13_SRO_465_TATCAGCA-mg,VCF 3428_5_13_SRO_465_TATCAGCA-mg.g.vcf.gz -V:3428_6_13_SRO_402_TCCGTCTA-mg,VCF 3428_6_13_SRO_402_TCCGTCTA-mg.g.vcf.gz -V:3428_7_13_SRO_474_TCTTCACA-mg,VCF 3428_7_13_SRO_474_TCTTCACA-mg.g.vcf.gz -V:3428_8_13_SRO_531_TGAAGAGA-mg,VCF 3428_8_13_SRO_531_TGAAGAGA-mg.g.vcf.gz -V:3428_9_14_SRO_166_TGGAACAA-mg,VCF 3428_9_14_SRO_166_TGGAACAA-mg.g.vcf.gz 
INFO  18:30:12,714 HelpFormatter - Executing as roel@utonium.nki.nl on Linux 2.6.32-504.12.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_75-b13. 
INFO  18:30:12,714 HelpFormatter - Date/Time: 2015/05/06 18:30:12 
INFO  18:30:12,715 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  18:30:12,715 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  18:30:15,963 GenomeAnalysisEngine - Strictness is SILENT 
INFO  18:30:16,109 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 
INFO  18:30:29,705 IntervalUtils - Processing 101539431 bp from intervals 
WARN  18:30:29,726 IndexDictionaryUtils - Track 3428_10_14_SRO_185_TGGCTTCA-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,727 IndexDictionaryUtils - Track 3428_11_14_SRO_186_TGGTGGTA-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,727 IndexDictionaryUtils - Track 3428_12_13_SRO_422_TTCACGCA-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,728 IndexDictionaryUtils - Track 3428_13_13_SRO_492_AACTCACC-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,728 IndexDictionaryUtils - Track 3428_14_13_SRO_493_AAGAGATC-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,728 IndexDictionaryUtils - Track 3428_15_14_SRO_209_AAGGACAC-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,729 IndexDictionaryUtils - Track 3428_16_14_SRO_218_AATCCGTC-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,729 IndexDictionaryUtils - Track 3428_17_14_SRO_201_AATGTTGC-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,730 IndexDictionaryUtils - Track 3428_18_13_SRO_416_ACACGACC-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,730 IndexDictionaryUtils - Track 3428_19_14_SRO_66_ACAGATTC-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,730 IndexDictionaryUtils - Track 3428_1_13_SRO_388_GTCGTAGA-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,731 IndexDictionaryUtils - Track 3428_20_14_SRO_68_AGATGTAC-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,731 IndexDictionaryUtils - Track 3428_21_14_SRO_210_AGCACCTC-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,731 IndexDictionaryUtils - Track 3428_22_14_SRO_256_AGCCATGC-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,732 IndexDictionaryUtils - Track 3428_23_14_SRO_270_AGGCTAAC-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,732 IndexDictionaryUtils - Track 3428_24_13_SRO_452_ATAGCGAC-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,732 IndexDictionaryUtils - Track 3428_2_13_SRO_399_GTCTGTCA-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,733 IndexDictionaryUtils - Track 3428_3_13_SRO_461_GTGTTCTA-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,733 IndexDictionaryUtils - Track 3428_4_13_SRO_462_TAGGATGA-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,733 IndexDictionaryUtils - Track 3428_5_13_SRO_465_TATCAGCA-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,734 IndexDictionaryUtils - Track 3428_6_13_SRO_402_TCCGTCTA-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,734 IndexDictionaryUtils - Track 3428_7_13_SRO_474_TCTTCACA-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,734 IndexDictionaryUtils - Track 3428_8_13_SRO_531_TGAAGAGA-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
WARN  18:30:29,735 IndexDictionaryUtils - Track 3428_9_14_SRO_166_TGGAACAA-mg doesn't have a sequence dictionary built in, skipping dictionary validation 
INFO  18:30:29,749 MicroScheduler - Running the GATK in parallel mode with 32 total threads, 1 CPU thread(s) for each of 32 data thread(s), of 64 processors available on this machine 
INFO  18:30:29,878 GenomeAnalysisEngine - Preparing for traversal 
INFO  18:30:29,963 GenomeAnalysisEngine - Done preparing for traversal 
INFO  18:30:29,964 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
INFO  18:30:29,965 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining 
INFO  18:30:29,966 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime 
INFO  18:30:30,562 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant files 
INFO  18:31:00,420 ProgressMeter -       1:4845033         0.0    30.0 s      50.3 w        0.0%    46.7 h      46.7 h 
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace 
java.lang.IllegalStateException: Key DP found in VariantContext field INFO at 1:4839315 but this key isn't defined in the VCFHeader.  We require all VCFs to have complete VCF headers by default.
    at htsjdk.variant.vcf.VCFEncoder.fieldIsMissingFromHeaderError(VCFEncoder.java:176)
    at htsjdk.variant.vcf.VCFEncoder.encode(VCFEncoder.java:115)
    at htsjdk.variant.variantcontext.writer.VCFWriter.add(VCFWriter.java:221)
    at org.broadinstitute.gatk.engine.io.storage.VariantContextWriterStorage.add(VariantContextWriterStorage.java:182)
    at org.broadinstitute.gatk.engine.io.stubs.VariantContextWriterStub.add(VariantContextWriterStub.java:269)
    at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.reduce(GenotypeGVCFs.java:351)
    at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.reduce(GenotypeGVCFs.java:119)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociReduce.apply(TraverseLociNano.java:291)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociReduce.apply(TraverseLociNano.java:280)
    at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:279)
    at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
    at org.broadinstitute.gatk.engine.executive.ShardTraverser.call(ShardTraverser.java:98)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 3.3-0-geee94ec):
##### ERROR
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: Key DP found in VariantContext field INFO at 1:4839315 but this key isn't defined in the VCFHeader.  We require all VCFs to have complete VCF headers by default.
##### ERROR ------------------------------------------------------------------------------------------

for f in *.g.vcf.gz; do echo -e "\n-- $f --"; zcat "$f" | sed -n -r "/^#.*DP/p;/^1\t4839315\t/{p;q;}"; done

##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839317     GT:DP:GQ:MIN_DP:PL      0/0:22:0:21:0,0,432
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839317     GT:DP:GQ:MIN_DP:PL      0/0:20:0:20:0,0,410
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839315     GT:DP:GQ:MIN_DP:PL      0/0:29:0:29:0,0,773
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839315     GT:DP:GQ:MIN_DP:PL      0/0:25:2:25:0,3,790
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839316     GT:DP:GQ:MIN_DP:PL      0/0:33:0:33:0,0,837
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839315     GT:DP:GQ:MIN_DP:PL      0/0:23:31:23:0,31,765
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       GA      G,<NON_REF>     0       .       ClippingRankSum=-0.578;MLEAC=0,0;MLEAF=0.00,0.00        GT:DP:GQ:PL:SB  0/0:21:39:0,39,488,60,491,512:20,0,0,0
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839315     GT:DP:GQ:MIN_DP:PL      0/0:18:0:18:0,0,514
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839316     GT:DP:GQ:MIN_DP:PL      0/0:29:0:29:0,0,810
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839316     GT:DP:GQ:MIN_DP:PL      0/0:33:0:33:0,0,812
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839317     GT:DP:GQ:MIN_DP:PL      0/0:28:0:25:0,0,624
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       GA      G,<NON_REF>     0.08    .       ClippingRankSum=-0.189;MLEAC=1,0;MLEAF=0.500,0.00       GT:DP:GQ:PL:SB  0/1:17:20:20,0,311,62,320,382:14,0,3,0
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       GA      G,<NON_REF>     6.76    .       ClippingRankSum=-0.374;MLEAC=1,0;MLEAF=0.500,0.00       GT:DP:GQ:PL:SB  0/1:25:43:43,0,401,102,417,519:20,0,3,2
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       GA      G,<NON_REF>     0       .       ClippingRankSum=-1.095;MLEAC=0,0;MLEAF=0.00,0.00        GT:DP:GQ:PL:SB  0/0:23:1:0,1,395,56,406,460:19,0,0,0
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839317     GT:DP:GQ:MIN_DP:PL      0/0:28:0:28:0,0,626
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       GA      G,<NON_REF>     5.99    .       ClippingRankSum=-0.584;MLEAC=1,0;MLEAF=0.500,0.00       GT:DP:GQ:PL:SB  0/1:18:42:42,0,293,84,305,388:13,1,3,1
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839317     GT:DP:GQ:MIN_DP:PL      0/0:22:0:22:0,0,558
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       GA,<NON_REF>    6.76    .       ClippingRankSum=0.850;MLEAC=1,0;MLEAF=0.500,0.00        GT:DP:GQ:PL:SB  0/1:19:43:43,0,262,87,274,361:12,3,4,0
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       GA      G,<NON_REF>     16.82   .       ClippingRankSum=-0.784;MLEAC=1,0;MLEAF=0.500,0.00       GT:DP:GQ:PL:SB  0/1:21:54:54,0,352,102,367,470:16,0,4,1
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839317     GT:DP:GQ:MIN_DP:PL      0/0:26:0:25:0,0,419
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839316     GT:DP:GQ:MIN_DP:PL      0/0:30:0:30:0,0,771
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839315     GT:DP:GQ:MIN_DP:PL      0/0:34:77:34:0,78,1136
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       G       <NON_REF>       .       .       END=4839316     GT:DP:GQ:MIN_DP:PL      0/0:26:0:20:0,0,397
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
1       4839315 .       GAA     G,GA,<NON_REF>  22.75   .       ClippingRankSum=-2.181;MLEAC=0,1,0;MLEAF=0.00,0.500,0.00        GT:DP:GQ:PL:SB  0/2:11:22:60,22,209,0,87,104,63,153,113,176:4,2,3,0

↧

GATK on RAD Data - extraordinarily long run times

May 19, 2013, 4:21 pm

≫ Next: (How to) Generate an unmapped BAM from FASTQ or aligned BAM

≪ Previous: GenotypeGVCFs sees DP incorrectly in INFO, not FORMAT field

Hi all,

I have Type 2B RAD data from many individuals from several populations of my non-model species, mapped using Bowtie 0.12.8 to a reference database made by extracting all potential RAD sites from the available genome. I would like to run a first-pass UnifiedGenotyper run on a single individual, but even on a supercomputer and using the -nct and -nt flags, GATK says it will need 4.9 weeks to finish!

A collaborator suggested that GATK may just not handle many reference contigs well, but I have already reduced my reference database from the 1.6 million possible tags to the 95,000 tags that were seen at least 100x across all my individuals.

Does GATK respond to the number of contigs like this? Are there any tips you can give me to reduce the amount of time necessary to something more reasonable?

Thank you!
Roxana

↧

(How to) Generate an unmapped BAM from FASTQ or aligned BAM

November 23, 2015, 1:11 pm

≫ Next: What's the version of bwa being implemented in GATK4 bwaspark tool?

≪ Previous: GATK on RAD Data - extraordinarily long run times

Here we outline how to generate an unmapped BAM (uBAM) from either a FASTQ or aligned BAM file. We use Picard's FastqToSam to convert a FASTQ (Option A) or Picard's RevertSam to convert an aligned BAM (Option B).

Jump to a section on this page

(A) Convert FASTQ to uBAM and add read group information using FastqToSam
(B) Convert aligned BAM to uBAM and discard problematic records using RevertSam

Tools involved

Prerequisites

Installed Picard tools

Download example data

Tutorial data reads were originally aligned to the advanced tutorial bundle's human_g1k_v37_decoy.fasta reference and to 10:91,000,000-92,000,000.

Related resources

Our dictionary entry on read groups discusses the importance of assigning appropriate read group fields that differentiate samples and factors that contribute to batch effects, e.g. flow cell lane. Be sure your read group fields conform to the recommendations.
This post provides an example command for AddOrReplaceReadGroups.
This How to is part of a larger workflow and tutorial on (How to) Efficiently map and clean up short read sequence data.
To extract reads in a genomic interval from the aligned BAM, use GATK's PrintReads.
In the future we will post on how to generate a uBAM from BCL data using IlluminaBasecallsToSAM.

(A) Convert FASTQ to uBAM and add read group information using FastqToSam

Picard's FastqToSam transforms a FASTQ file to an unmapped BAM, requires two read group fields and makes optional specification of other read group fields. In the command below we note which fields are required for GATK Best Practices Workflows. All other read group fields are optional.

java -Xmx8G -jar picard.jar FastqToSam \
    FASTQ=6484_snippet_1.fastq \ #first read file of pair
    FASTQ2=6484_snippet_2.fastq \ #second read file of pair
    OUTPUT=6484_snippet_fastqtosam.bam \
    READ_GROUP_NAME=H0164.2 \ #required; changed from default of A
    SAMPLE_NAME=NA12878 \ #required
    LIBRARY_NAME=Solexa-272222 \ #required 
    PLATFORM_UNIT=H0164ALXX140820.2 \ 
    PLATFORM=illumina \ #recommended
    SEQUENCING_CENTER=BI \ 
    RUN_DATE=2014-08-20T00:00:00-0400

Some details on select parameters:

For paired reads, specify each FASTQ file with FASTQ and FASTQ2 for the first read file and the second read file, respectively. Records in each file must be queryname sorted as the tool assumes identical ordering for pairs. The tool automatically strips the /1 and /2 read name suffixes and adds SAM flag values to indicate reads are paired. Do not provide a single interleaved fastq file, as the tool will assume reads are unpaired and the SAM flag values will reflect single ended reads.
For single ended reads, specify the input file with FASTQ.
QUALITY_FORMAT is detected automatically if unspecified.
SORT_ORDER by default is queryname.
PLATFORM_UNIT is often in run_barcode.lane format. Include if sample is multiplexed.
RUN_DATE is in Iso8601 date format.

Paired reads will have SAM flag values that reflect pairing and the fact that the reads are unmapped as shown in the example read pair below.

Original first read

@H0164ALXX140820:2:1101:10003:49022/1
ACTTTAGAAATTTACTTTTAAGGACTTTTGGTTATGCTGCAGATAAGAAATATTCTTTTTTTCTCCTATGTCAGTATCCCCCATTGAAATGACAATAACCTAATTATAAATAAGAATTAGGCTTTTTTTTGAACAGTTACTAGCCTATAGA
+
-FFFFFJJJJFFAFFJFJJFJJJFJFJFJJJ<<FJJJJFJFJFJJJJ<JAJFJJFJJJJJFJJJAJJJJJJFFJFJFJJFJJFFJJJFJJJFJJFJJFJAJJJJAJFJJJJJFFJJ<<<JFJJAFJAAJJJFFFFFJJJAJJJF<AJFFFJ

Original second read

@H0164ALXX140820:2:1101:10003:49022/2
TGAGGATCACTAGATGGGGGAGGGAGAGAAGAGATGTGGGCTGAAGAACCATCTGTTGGGTAATATGTTTACTGTCAGTGTGATGGAATAGCTGGGACCCCAAGCGTCAGTGTTACACAACTTACATCTGTTGATCGACTGTCTATGACAG
+
AA<FFJJJAJFJFAFJJJJFAJJJJJ7FFJJ<F-FJFJJJFJJFJJFJJF<FJJA<JF-AFJFAJFJJJJJAAAFJJJJJFJJF-FF<7FJJJJJJ-JA<<J<F7-<FJFJJ7AJAF-AFFFJA--J-F######################

After FastqToSam

H0164ALXX140820:2:1101:10003:49022      77      *       0       0       *       *       0       0       ACTTTAGAAATTTACTTTTAAGGACTTTTGGTTATGCTGCAGATAAGAAATATTCTTTTTTTCTCCTATGTCAGTATCCCCCATTGAAATGACAATAACCTAATTATAAATAAGAATTAGGCTTTTTTTTGAACAGTTACTAGCCTATAGA -FFFFFJJJJFFAFFJFJJFJJJFJFJFJJJ<<FJJJJFJFJFJJJJ<JAJFJJFJJJJJFJJJAJJJJJJFFJFJFJJFJJFFJJJFJJJFJJFJJFJAJJJJAJFJJJJJFFJJ<<<JFJJAFJAAJJJFFFFFJJJAJJJF<AJFFFJ RG:Z:H0164.2
H0164ALXX140820:2:1101:10003:49022      141     *       0       0       *       *       0       0       TGAGGATCACTAGATGGGGGAGGGAGAGAAGAGATGTGGGCTGAAGAACCATCTGTTGGGTAATATGTTTACTGTCAGTGTGATGGAATAGCTGGGACCCCAAGCGTCAGTGTTACACAACTTACATCTGTTGATCGACTGTCTATGACAG AA<FFJJJAJFJFAFJJJJFAJJJJJ7FFJJ<F-FJFJJJFJJFJJFJJF<FJJA<JF-AFJFAJFJJJJJAAAFJJJJJFJJF-FF<7FJJJJJJ-JA<<J<F7-<FJFJJ7AJAF-AFFFJA--J-F###################### RG:Z:H0164.2

(B) Convert aligned BAM to uBAM and discard problematic records using RevertSam

We use Picard's RevertSam to remove alignment information and generate an unmapped BAM (uBAM). For our tutorial file we have to call on some additional parameters that we explain below. This illustrates the need to cater the tool's parameters to each dataset. As such, it is a good idea to test the reversion process on a subset of reads before committing to reverting the entirety of a large BAM. Follow the directions in this How to to create a snippet of aligned reads corresponding to a genomic interval.

We use the following parameters.

java -Xmx8G -jar /path/picard.jar RevertSam \
    I=6484_snippet.bam \
    O=6484_snippet_revertsam.bam \
    SANITIZE=true \ 
    MAX_DISCARD_FRACTION=0.005 \ #informational; does not affect processing
    ATTRIBUTE_TO_CLEAR=XT \
    ATTRIBUTE_TO_CLEAR=XN \
    ATTRIBUTE_TO_CLEAR=AS \ #Picard release of 9/2015 clears AS by default
    ATTRIBUTE_TO_CLEAR=OC \
    ATTRIBUTE_TO_CLEAR=OP \
    SORT_ORDER=queryname \ #default
    RESTORE_ORIGINAL_QUALITIES=true \ #default
    REMOVE_DUPLICATE_INFORMATION=true \ #default
    REMOVE_ALIGNMENT_INFORMATION=true #default

To process large files, also designate a temporary directory.

    TMP_DIR=/path/shlee #sets environmental variable for temporary directory

We invoke or change multiple RevertSam parameters to generate an unmapped BAM

We remove nonstandard alignment tags with the ATTRIBUTE_TO_CLEAR option. Standard tags cleared by default are NM, UQ, PG, MD, MQ, SA, MC, and AS tags (AS for Picard releases starting 9/2015). Additionally, the OQ tag is removed by the default RESTORE_ORIGINAL_QUALITIES parameter. Remove all other nonstandard tags by specifying each with the ATTRIBUTE_TO_CLEAR option. For example, we clear the XT tag using this option for our tutorial file so that it is free for use by other tools, e.g. MarkIlluminaAdapters. To list all tags within a BAM, use the command below.
```
samtools view input.bam | cut -f 12- | tr '\t' '\n' | cut -d ':' -f 1 | awk '{ if(!x[$1]++) { print }}' 
```
For the tutorial file, this gives RG, OC, XN, OP and XT tags as well as those removed by default. We remove all of these except the RG tag. See your aligner's documentation and the Sequence Alignment/Map Format Specification for descriptions of tags.
Additionally, we invoke the SANITIZE option to remove reads that cause problems for certain tools, e.g. MarkIlluminaAdapters. Downstream tools will have problems with paired reads with missing mates, duplicated records, and records with mismatches in length of bases and qualities. Any paired reads file subset for a genomic interval requires sanitizing to remove reads with lost mates that align outside of the interval.
In this command, we've set MAX_DISCARD_FRACTION to a more strict threshold of 0.005 instead of the default 0.01. Whether or not this fraction is reached, the tool informs you of the number and fraction of reads it discards. This parameter asks the tool to additionally inform you of the discarded fraction via an exception as it finishes processing.
```
Exception in thread "main" picard.PicardException: Discarded 0.787% which is above MAX_DISCARD_FRACTION of 0.500%  
```

Some comments on options kept at default:

SORT_ORDER=queryname
For paired read files, because each read in a pair has the same query name, sorting results in interleaved reads. This means that reads in a pair are listed consecutively within the same file. We make sure to alter the previous sort order. Coordinate sorted reads result in the aligner incorrectly estimating insert size from blocks of paired reads as they are not randomly distributed.
RESTORE_ORIGINAL_QUALITIES=true
Restoring original base qualities to the QUAL field requires OQ tags listing original qualities. The OQ tag uses the same encoding as the QUAL field, e.g. ASCII Phred-scaled base quality+33 for tutorial data. After restoring the QUAL field, RevertSam removes the tag.
REMOVE_ALIGNMENT_INFORMATION=true will remove program group records and alignment flag and tag information. For example, flags reset to unmapped values, e.g. 77 and 141 for paired reads. The parameter also invokes the default ATTRIBUTE_TO_CLEAR parameter which removes standard alignment tags. RevertSam ignores ATTRIBUTE_TO_CLEAR when REMOVE_ALIGNMENT_INFORMATION=false.

Below we show below a read pair before and after RevertSam from the tutorial data. Notice the first listed read in the pair becomes reverse-complemented after RevertSam. This restores how reads are represented when they come off the sequencer--5' to 3' of the read being sequenced.

For 6484_snippet.bam, SANITIZE removes 2,202 out of 279,796 (0.787%) reads, leaving us with 277,594 reads.

Original BAM

H0164ALXX140820:2:1101:10003:23460  83  10  91515318    60  151M    =   91515130    -339    CCCATCCCCTTCCCCTTCCCTTTCCCTTTCCCTTTTCTTTCCTCTTTTAAAGAGACAAGGTCTTGTTCTGTCACCCAGGCTGGAATGCAGTGGTGCAGTCATGGCTCACTGCCGCCTCAGACTTCAGGGCAAAAGCAATCTTTCCAGCTCA :<<=>@AAB@AA@AA>6@@A:>,*@A@<@??@8?9>@==8?:?@?;?:><??@>==9?>8>@:?>>=>;<==>>;>?=?>>=<==>>=>9<=>??>?>;8>?><?<=:>>>;4>=>7=6>=>>=><;=;>===?=>=>>?9>>>>??==== MC:Z:60M91S MD:Z:151    PG:Z:MarkDuplicates RG:Z:H0164.2    NM:i:0  MQ:i:0  OQ:Z:<FJFFJJJJFJJJJJF7JJJ<F--JJJFJJJJ<J<FJFF<JAJJJAJAJFFJJJFJAFJAJJAJJJJJFJJJJJFJJFJJJJFJFJJJJFFJJJJJJJFAJJJFJFJFJJJFFJJJ<J7JJJJFJ<AFAJJJJJFJJJJJAJFJJAFFFFA    UQ:i:0  AS:i:151

H0164ALXX140820:2:1101:10003:23460  163 10  91515130    0   60M91S  =   91515318    339 TCTTTCCTTCCTTCCTTCCTTGCTCCCTCCCTCCCTCCTTTCCTTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCTTCCCCTCTCCCACCCCTCTCTCCCCCCCTCCCACCC :0;.=;8?7==?794<<;:>769=,<;0:=<0=:9===/,:-==29>;,5,98=599;<=########################################################################################### SA:Z:2,33141573,-,37S69M45S,0,1;    MC:Z:151M   MD:Z:48T4T6 PG:Z:MarkDuplicates RG:Z:H0164.2    NM:i:2  MQ:i:60 OQ:Z:<-<-FA<F<FJF<A7AFAAJ<<AA-FF-AJF-FA<AFF--A-FA7AJA-7-A<F7<<AFF###########################################################################################    UQ:i:49 AS:i:50

After RevertSam

H0164ALXX140820:2:1101:10003:23460  77  *   0   0   *   *   0   0   TGAGCTGGAAAGATTGCTTTTGCCCTGAAGTCTGAGGCGGCAGTGAGCCATGACTGCACCACTGCATTCCAGCCTGGGTGACAGAACAAGACCTTGTCTCTTTAAAAGAGGAAAGAAAAGGGAAAGGGAAAGGGAAGGGGAAGGGGATGGG AFFFFAJJFJAJJJJJFJJJJJAFA<JFJJJJ7J<JJJFFJJJFJFJFJJJAFJJJJJJJFFJJJJFJFJJJJFJJFJJJJJFJJJJJAJJAJFAJFJJJFFJAJAJJJAJ<FFJF<J<JJJJFJJJ--F<JJJ7FJJJJJFJJJJFFJF< RG:Z:H0164.2

H0164ALXX140820:2:1101:10003:23460  141 *   0   0   *   *   0   0   TCTTTCCTTCCTTCCTTCCTTGCTCCCTCCCTCCCTCCTTTCCTTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCTTCCCCTCTCCCACCCCTCTCTCCCCCCCTCCCACCC <-<-FA<F<FJF<A7AFAAJ<<AA-FF-AJF-FA<AFF--A-FA7AJA-7-A<F7<<AFF########################################################################################### RG:Z:H0164.2

↧

What's the version of bwa being implemented in GATK4 bwaspark tool?

September 15, 2016, 8:58 am

≫ Next: MutSigCV gene covariates and full exome coverage, gene symbol discrepancies.

≪ Previous: (How to) Generate an unmapped BAM from FASTQ or aligned BAM

Hi,

Is the bwa of the bwaspark tool is the latest 0.7.15 version or not?

Thanks!

↧

MutSigCV gene covariates and full exome coverage, gene symbol discrepancies.

September 15, 2016, 6:20 pm

≫ Next: Genotype set to missing with lots of hom ref reads

≪ Previous: What's the version of bwa being implemented in GATK4 bwaspark tool?

Hi,

I observed this issue while analyzing a known dataset. Gene names in two files that MutSig uses (gene.covariates.txt and exome_full192.coverage.txt) are not according to HGNC and some of the gene names are still according to old nomenclature. But most of the variant annotation programs such as oncotator or VEP uses HGNC symbols for gene annotation. This discrepancy causes MutSig to not recognize well known oncogenes in the MAF file and ignores them.

For example, KMT2D in Esopghageal Squmaous Carcinoma is frequently mutated, but MutSig covariates file doesn;t have this gene. Instead they have MLL2/MLL4 which are synonyms for KMT2D. This causes mutsig to ignore KMT2D from analysis. I tried to convert gene names in these two files into HGNC symbols, but MutSig doesn't recognize these altered files and the result contains NaN values for expr, reptime and hic columns.

Is there anyway to fix this ?

I think this should be addressed, since this is one of the most widely used program and someone doing a denovo analysis might miss important genes.

P.S; I have also posted this on CGA and CGA forum doesn't seem to be active.

↧

Genotype set to missing with lots of hom ref reads

September 16, 2016, 3:47 am

≫ Next: A suddenly stop when running GenotypeGVCF.

≪ Previous: MutSigCV gene covariates and full exome coverage, gene symbol discrepancies.

In the following final VCF (produced by following the gVCF workflow) I get quite a number of positions reported as missing (1.2M out of ~24M bases). This of course isn't unexpected, however upon closer inspection I cannot see a reason why HC would call some of these positions as missing genotypes. Take for instance this 11bp extract from a gVCF.

Supercontig_1.1 613355  .   A   <NON_REF>   .   .   .   GT:AD:DP:GQ:PL  0/0:78,0:78:18:0,18,270
Supercontig_1.1 613356  .   G   <NON_REF>   .   .   .   GT:AD:DP:GQ:PL  0/0:77,0:77:9:0,9,135
Supercontig_1.1 613357  .   G   <NON_REF>   .   .   .   GT:AD:DP:GQ:PL  0/0:78,0:78:0:0,0,0
Supercontig_1.1 613358  .   T   <NON_REF>   .   .   .   GT:AD:DP:GQ:PL  0/0:79,0:79:0:0,0,0
Supercontig_1.1 613359  .   T   <NON_REF>   .   .   .   GT:AD:DP:GQ:PL  0/0:79,0:79:0:0,0,0
Supercontig_1.1 613360  .   T   <NON_REF>   .   .   .   GT:AD:DP:GQ:PL  0/0:78,0:78:0:0,0,0
Supercontig_1.1 613361  .   G   <NON_REF>   .   .   .   GT:AD:DP:GQ:PL  0/0:77,0:77:0:0,0,0
Supercontig_1.1 613362  .   C   CT,<NON_REF>    3242.73 .   DP=82;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1,0;RAW_MQ=295200    GT:AD:DP:GQ:PGT:PID:PL:SB   1/1:0,74,0:74:99:0|1:613362_C_CT:3280,223,0,3280,223,3280:0,0,44,30
Supercontig_1.1 613363  .   T   <NON_REF>   .   .   .   GT:AD:DP:GQ:PL  0/0:4,72:76:0:0,0,0
Supercontig_1.1 613364  .   T   <NON_REF>   .   .   .   GT:AD:DP:GQ:PL  0/0:76,0:76:99:0,120,1800
Supercontig_1.1 613365  .   T   <NON_REF>   .   .   .   GT:AD:DP:GQ:PL  0/0:76,0:76:99:0,120,1800

Here is the subsequent 11bp stretch after running the GenotypeGVCFs on the gVCF:

Supercontig_1.1 613355  .   A   .   .   PASS    AN=2;DP=78;VariantType=NO_VARIATION GT:AD:DP:RGQ    0/0:78:78:18
Supercontig_1.1 613356  .   G   .   .   PASS    AN=2;DP=77;VariantType=NO_VARIATION GT:AD:DP:RGQ    0/0:77:77:9
Supercontig_1.1 613357  .   G   .   .   PASS    DP=78;VariantType=NO_VARIATION  GT:AD:DP:RGQ    ./.:78:78:0
Supercontig_1.1 613358  .   T   .   .   PASS    DP=79;VariantType=NO_VARIATION  GT:AD:DP:RGQ    ./.:79:79:0
Supercontig_1.1 613359  .   T   .   .   PASS    DP=79;VariantType=NO_VARIATION  GT:AD:DP:RGQ    ./.:79:79:0
Supercontig_1.1 613360  .   T   .   .   PASS    DP=78;VariantType=NO_VARIATION  GT:AD:DP:RGQ    ./.:78:78:0
Supercontig_1.1 613361  .   G   .   .   PASS    DP=77;VariantType=NO_VARIATION  GT:AD:DP:RGQ    ./.:77:77:0
Supercontig_1.1 613362  .   C   CT  3242.73 PASS    AC=2;AF=1;AN=2;DP=82;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=60;QD=30.85;SOR=1.134;VariantType=INSERTION.NumRepetitions_3.EventLength_1.RepeatExpansion_T  GT:AD:DP:GQ:PGT:PID:PL  1/1:0,74:74:99:1|1:613362_C_CT:3280,223,0
Supercontig_1.1 613363  .   T   .   .   PASS    DP=76;VariantType=NO_VARIATION  GT:AD:DP:RGQ    ./.:4:76:0
Supercontig_1.1 613364  .   T   .   .   PASS    AN=2;DP=76;VariantType=NO_VARIATION GT:AD:DP:RGQ    0/0:76:76:99
Supercontig_1.1 613365  .   T   .   .   PASS    AN=2;DP=76;VariantType=NO_VARIATION GT:AD:DP:RGQ    0/0:76:76:99

Positions 613,357-61 are all assigned a missing genotype (which I assume is because the genotype likelihoods for these positions are all equally likely according to HC). However, examining the raw bam output, I can see that ALL the reads covering these positions are 100% hom-ref, and this is also the case when examining the BAMOUT from HC. Could anyone explain why I get these no-calls which appear to me to be erroneous? All the mapping qualities are very high as are the base qualities.

↧

A suddenly stop when running GenotypeGVCF.

June 7, 2016, 3:45 am

≫ Next: java.lang.ClassNotFoundException error when running FastqToSam in GATK4

≪ Previous: Genotype set to missing with lots of hom ref reads

The input script is like this:
java -jar /nfs/home/gatk/GenomeAnalysisTK.jar
-T GenotypeGVCFs
-R /nfs/home/tool/gatk/bundle/2.8/hg19/ucsc.hg19.fasta
--variant /home/sample/sample4/AD_1.g.vcf
--variant /home/sample/sample4/AD_2.g.vcf
--variant /home/sample/sample4/AD_3.g.vcf
--variant /home/sample/sample4/AD_4.g.vcf
-o /home/sample/sample4/AD_raw_variants.vcf

It was running smoothly, but suddenly it stopped at chr19. The screen is like this:
INFO 17:47:12,334 ProgressMeter - chr19:11563684 5.32049E7 7.6 h 8.6 m 85.1% 9.0 h 80.0 m
INFO 17:47:42,335 ProgressMeter - chr19:12766650 5.3285579E7 7.6 h 8.6 m 85.2% 9.0 h 79.8 m
INFO 17:47:47,835 GATKRunReport - Uploaded run statistics report to AWS S3

ERROR ------------------------------------------------------------------------------------------

ERROR stack trace

java.lang.NumberFormatException: For input string: "495"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:481)
at java.lang.Integer.parseInt(Integer.java:527)
at org.broadinstitute.gatk.tools.walkers.variantutils.ReferenceConfidenceVariantContextMerger.merge(ReferenceConfidenceVariantContextMerger.java:161)
at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.map(GenotypeGVCFs.java:257)
at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.map(GenotypeGVCFs.java:129)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:315)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:106)

ERROR ------------------------------------------------------------------------------------------

ERROR A GATK RUNTIME ERROR has occurred (version 3.5-0-g36282e4):

ERROR

ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.

ERROR If not, please post the error message, with stack trace, to the GATK forum.

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions http://www.broadinstitute.org/gatk

ERROR

ERROR MESSAGE: For input string: "495"

ERROR ------------------------------------------------------------------------------------------

jiangli@darwin:~$ ^C

The .g.vcf files are created by CombineGVCFs. So, is there anything wrong in these gVCF files?

Best Wishes,
River Lee

↧

java.lang.ClassNotFoundException error when running FastqToSam in GATK4

September 15, 2016, 4:39 pm

≫ Next: trustAnchors mustn't be empty ...

≪ Previous: A suddenly stop when running GenotypeGVCF.

Hi,
Here is my log file, what does " java.lang.ClassNotFoundException: org.xerial.snappy.LoadSnappy" mean? Thanks

Using GATK jar /home/kh3/gatk-4.alpha.2-36-g13333ba-SNAPSHOT/gatk-package-4.alpha.2-36-g13333ba-SNAPSHOT-local.jar
Running:
java -jar /home/kh3/gatk-4.alpha.2-36-g13333ba-SNAPSHOT/gatk-package-4.alpha.2-36-g13333ba-SNAPSHOT-local.jar FastqToSam -SM test -F1 /home/kh3/data/Illumina/GATK4/Platinum/Coriell_12891_GCC
AAT_L001_R1_001.fastq.gz -F2 /home/kh3/data/Illumina/GATK4/Platinum/Coriell_12891_GCCAAT_L001_R2_001.fastq.gz -O /home/kh3/data/Illumina/GATK4/Platinum/TEST/test.spark.bam -SO coordinate -R /hom
e/kh3/Resources/genome_b37/genome.2bit --STRIP_UNPAIRED_MATE_NUMBER true --VALIDATION_STRINGENCY LENIENT -PL ILLUMINA --MAX_RECORDS_IN_RAM 6624267
19:28:00.322 INFO IntelGKLUtils - Trying to load Intel GKL library from:
jar:file:/home/kh3/gatk-4.alpha.2-36-g13333ba-SNAPSHOT/gatk-package-4.alpha.2-36-g13333ba-SNAPSHOT-local.jar!/com/intel/gkl/native/libIntelGKL.so
19:28:00.371 INFO IntelGKLUtils - Intel GKL library loaded from classpath.
[September 15, 2016 7:28:00 PM EDT] org.broadinstitute.hellbender.tools.picard.sam.FastqToSam --FASTQ /home/kh3/data/Illumina/GATK4/Platinum/Coriell_12891_GCCAAT_L001_R1_001.fastq.gz --FASTQ2 /home/
kh3/data/Illumina/GATK4/Platinum/Coriell_12891_GCCAAT_L001_R2_001.fastq.gz --output /home/kh3/data/Illumina/GATK4/Platinum/TEST/test.spark.bam --SAMPLE_NAME test --PLATFORM ILLUMINA --SORT_ORDER
coordinate --STRIP_UNPAIRED_MATE_NUMBER true --VALIDATION_STRINGENCY LENIENT --MAX_RECORDS_IN_RAM 6624267 --reference /home/kh3/Resources/genome_b37/genome.2bit --READ_GROUP_NAME A --MIN_Q 0 --MAX_
Q 93 --ALLOW_AND_IGNORE_EMPTY_LINES false --COMPRESSION_LEVEL 5 --CREATE_INDEX false --CREATE_MD5_FILE false --help false --version false --verbosity INFO --QUIET false --use_jdk_deflater false
[September 15, 2016 7:28:00 PM EDT] Executing as kh3@rgcaahauva08091.rgc.aws.regeneron.com on Linux 3.13.0-91-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_101-b13; Version: Version:4.alpha.
2-36-g13333ba-SNAPSHOT
19:28:00.397 INFO FastqToSam - Defaults.BUFFER_SIZE : 131072
19:28:00.398 INFO FastqToSam - Defaults.COMPRESSION_LEVEL : 5
19:28:00.398 INFO FastqToSam - Defaults.CREATE_INDEX : false
19:28:00.398 INFO FastqToSam - Defaults.CREATE_MD5 : false
19:28:00.398 INFO FastqToSam - Defaults.CUSTOM_READER_FACTORY :
19:28:00.398 INFO FastqToSam - Defaults.EBI_REFERENCE_SERVICE_URL_MASK : http://www.ebi.ac.uk/ena/cram/md5/%s
19:28:00.398 INFO FastqToSam - Defaults.NON_ZERO_BUFFER_SIZE : 131072
19:28:00.398 INFO FastqToSam - Defaults.REFERENCE_FASTA : null
19:28:00.398 INFO FastqToSam - Defaults.SAM_FLAG_FIELD_FORMAT : DECIMAL
19:28:00.398 INFO FastqToSam - Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
19:28:00.398 INFO FastqToSam - Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : false
19:28:00.398 INFO FastqToSam - Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
19:28:00.398 INFO FastqToSam - Defaults.USE_CRAM_REF_DOWNLOAD : false
19:28:00.398 INFO FastqToSam - Deflater IntelDeflater
19:28:00.398 INFO FastqToSam - Initializing engine
19:28:00.398 INFO FastqToSam - Done initializing engine
19:28:00.489 INFO FastqToSam - Auto-detected quality format as: Standard.
19:28:03.741 INFO FastqToSam - Processed 1,000,000 records. Elapsed time: 00:00:03s. Time for last 1,000,000: 3s. Last read position: /
19:28:08.820 INFO FastqToSam - Processed 2,000,000 records. Elapsed time: 00:00:08s. Time for last 1,000,000: 5s. Last read position: /
19:28:12.281 INFO FastqToSam - Processed 3,000,000 records. Elapsed time: 00:00:11s. Time for last 1,000,000: 3s. Last read position: /
19:28:17.976 INFO FastqToSam - Processed 4,000,000 records. Elapsed time: 00:00:17s. Time for last 1,000,000: 5s. Last read position: /
19:28:23.149 INFO FastqToSam - Processed 5,000,000 records. Elapsed time: 00:00:22s. Time for last 1,000,000: 5s. Last read position: /
19:28:25.998 INFO FastqToSam - Processed 6,000,000 records. Elapsed time: 00:00:25s. Time for last 1,000,000: 2s. Last read position: /
19:29:16.029 INFO FastqToSam - Shutting down engine
[September 15, 2016 7:29:16 PM EDT] org.broadinstitute.hellbender.tools.picard.sam.FastqToSam done. Elapsed time: 1.26 minutes.
Runtime.totalMemory()=4369416192
Exception in thread "main" java.lang.NoClassDefFoundError: org/xerial/snappy/LoadSnappy
at htsjdk.samtools.util.SnappyLoader.(SnappyLoader.java:86)
at htsjdk.samtools.util.SnappyLoader.(SnappyLoader.java:52)
at htsjdk.samtools.util.TempStreamFactory.getSnappyLoader(TempStreamFactory.java:42)
at htsjdk.samtools.util.TempStreamFactory.wrapTempOutputStream(TempStreamFactory.java:74)
at htsjdk.samtools.util.SortingCollection.spillToDisk(SortingCollection.java:223)
at htsjdk.samtools.util.SortingCollection.add(SortingCollection.java:166)
at htsjdk.samtools.SAMFileWriterImpl.addAlignment(SAMFileWriterImpl.java:192)
at org.broadinstitute.hellbender.tools.picard.sam.FastqToSam.doPaired(FastqToSam.java:222)
at org.broadinstitute.hellbender.tools.picard.sam.FastqToSam.makeItSo(FastqToSam.java:181)
at org.broadinstitute.hellbender.tools.picard.sam.FastqToSam.doWork(FastqToSam.java:156)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:109)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:167)
at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgram.instanceMain(PicardCommandLineProgram.java:61)
at org.broadinstitute.hellbender.Main.instanceMain(Main.java:76)
at org.broadinstitute.hellbender.Main.main(Main.java:92)
Caused by: java.lang.ClassNotFoundException: org.xerial.snappy.LoadSnappy
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 15 more

↧

trustAnchors mustn't be empty ...

September 16, 2016, 8:03 am

≫ Next: incompatible reference and reads error when running GATK4 BwaSpark

≪ Previous: java.lang.ClassNotFoundException error when running FastqToSam in GATK4

I'm starting to get a weird warning message which appears four times at the end of any GATK command run as a qsub job on a Linux SGE.

INFO 15:57:27,936 HttpMethodDirector - I/O exception (javax.net.ssl.SSLException) caught when processing request: java.lang.RuntimeException: Unexpected error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty

Sysadmin here thinks it could've been broken by a java update. I'm using GATK 3.4 so I know, I know, I should get it updated.

↧

incompatible reference and reads error when running GATK4 BwaSpark

September 15, 2016, 8:43 pm

≫ Next: Panel Of Normals for MuTect

≪ Previous: trustAnchors mustn't be empty ...

The bam was created using FastqToSam, so it is unaligned. I think that's the reason why the read contig list was empty. Is there any solution for it?Thanks!

Running:
/home/kh3/Softwares/gatk/build/install/gatk/bin/gatk BwaSpark -I /home/kh3/data/Illumina/GATK4/Platinum/TEST/test.spark.sam -R /home/kh3/Resources/genome_b37/genome.fa -t 16 -O /home/kh3/data/Illumina/GATK4/Platinum/TEST/test.spark.bam
23:30:44.246 INFO IntelGKLUtils - Trying to load Intel GKL library from:
jar:file:/home/kh3/Softwares/gatk/build/install/gatk/lib/gkl-0.1.2.jar!/com/intel/gkl/native/libIntelGKL.so
23:30:44.295 INFO IntelGKLUtils - Intel GKL library loaded from classpath.
[September 15, 2016 11:30:44 PM EDT] org.broadinstitute.hellbender.tools.spark.bwa.BwaSpark --output /home/kh3/data/Illumina/GATK4/Platinum/TEST/test.spark.bam --threads 16 --reference /home/kh3/Resources/genome_b37/genome.fa --input /home/kh3/data/Illumina/GATK4/Platinum/TEST/test.spark.sam --fixedChunkSize 100000 --readValidationStringency SILENT --interval_set_rule UNION --interval_padding 0 --interval_exclusion_padding 0 --bamPartitionSize 0 --disableSequenceDictionaryValidation false --shardedOutput false --numReducers 0 --sparkMaster local[*] --help false --version false --verbosity INFO --QUIET false --use_jdk_deflater false --disableAllReadFilters false
[September 15, 2016 11:30:44 PM EDT] Executing as kh3@rgcaahauva08091.rgc.aws.regeneron.com on Linux 3.13.0-91-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_101-b13; Version: Version:4.alpha.2-45-ga30af5a-SNAPSHOT
23:30:44.320 INFO BwaSpark - Defaults.BUFFER_SIZE : 131072
23:30:44.320 INFO BwaSpark - Defaults.COMPRESSION_LEVEL : 1
23:30:44.320 INFO BwaSpark - Defaults.CREATE_INDEX : false
23:30:44.320 INFO BwaSpark - Defaults.CREATE_MD5 : false
23:30:44.321 INFO BwaSpark - Defaults.CUSTOM_READER_FACTORY :
23:30:44.321 INFO BwaSpark - Defaults.EBI_REFERENCE_SERVICE_URL_MASK : http://www.ebi.ac.uk/ena/cram/md5/%s
23:30:44.321 INFO BwaSpark - Defaults.NON_ZERO_BUFFER_SIZE : 131072
23:30:44.321 INFO BwaSpark - Defaults.REFERENCE_FASTA : null
23:30:44.321 INFO BwaSpark - Defaults.SAM_FLAG_FIELD_FORMAT : DECIMAL
23:30:44.321 INFO BwaSpark - Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
23:30:44.321 INFO BwaSpark - Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
23:30:44.321 INFO BwaSpark - Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
23:30:44.321 INFO BwaSpark - Defaults.USE_CRAM_REF_DOWNLOAD : false
23:30:44.321 INFO BwaSpark - Deflater IntelDeflater
23:30:44.321 INFO BwaSpark - Initializing engine
23:30:44.321 INFO BwaSpark - Done initializing engine
23:30:44.756 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23:30:46.883 INFO BwaSpark - Shutting down engine
[September 15, 2016 11:30:46 PM EDT] org.broadinstitute.hellbender.tools.spark.bwa.BwaSpark done. Elapsed time: 0.04 minutes.
Runtime.totalMemory()=498597888

A USER ERROR has occurred: Input files reference and reads have incompatible contigs: No overlapping contigs found.
reference contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT, GL000207.1, GL000226.1, GL000229.1, GL000231.1, GL000210.1, GL000239.1, GL000235.1, GL000201.1, GL000247.1, GL000245.1, GL000197.1, GL000203.1, GL000246.1, GL000249.1, GL000196.1, GL000248.1, GL000244.1, GL000238.1, GL000202.1, GL000234.1, GL000232.1, GL000206.1, GL000240.1, GL000236.1, GL000241.1, GL000243.1, GL000242.1, GL000230.1, GL000237.1, GL000233.1, GL000204.1, GL000198.1, GL000208.1, GL000191.1, GL000227.1, GL000228.1, GL000214.1, GL000221.1, GL000209.1, GL000218.1, GL000220.1, GL000213.1, GL000211.1, GL000199.1, GL000217.1, GL000216.1, GL000215.1, GL000205.1, GL000219.1, GL000224.1, GL000223.1, GL000195.1, GL000212.1, GL000222.1, GL000200.1, GL000193.1, GL000194.1, GL000225.1, GL000192.1, NC_007605, hs37d5]
reads contigs = []

Here is the content of my input bam:
@HD VN:1.5 SO:coordinate
@RG ID:A SM:test PL:ILLUMINA
@CDDDDFHHHGHJHGHHIJJJGEGHGGFHGJIIIGIGHGGGGIGHAHGIGIBFCA(=FGGJIE;CG;AHFHFECD RG:Z:A
@FFFFFHFHFHJIIIIIJJJAEHIHIDF?BGCHEGCG*?DHGDB=DHHE@@GHGHG@EHJE>AE?B,776>A88 RG:Z:A
@FFFFFDHHGHJJJIIGGIIJGJIIGIJIBGGBDEBBFGHGGDEIDBF0CGBHIGC7=CEE=AEE@5?>(>C;; RG:Z:A

↧

Panel Of Normals for MuTect

February 1, 2016, 8:53 am

≫ Next: GenotypeGVCFs: WARNING: of INFO fields not parsing

≪ Previous: incompatible reference and reads error when running GATK4 BwaSpark

I am wondering if there is a public repository from which one can build a panel of normals?
A couple of questions about that:
1. Must all the sample for a PON come from the same platform?
2. if not, is there a publicly available PON?

Many thanks

↧