Quantcast
Channel: Recent Discussions — GATK-Forum
Viewing all 12345 articles
Browse latest View live

Using JEXL to apply hard filters or select variants based on annotation values

$
0
0

1. JEXL in a nutshell

JEXL stands for Java EXpression Language. It's not a part of the GATK as such; it's a software library that can be used by Java-based programs like the GATK. It can be used for many things, but in the context of the GATK, it has one very specific use: making it possible to operate on subsets of variants from VCF files based on one or more annotations, using a single command. This is typically done with walkers such as VariantFiltration and SelectVariants.


2. Basic structure of JEXL expressions for use with the GATK

In this context, a JEXL expression is a string (in the computing sense, i.e. a series of characters) that tells the GATK which annotations to look at and what selection rules to apply.

JEXL expressions contain three basic components: keys and values, connected by operators. For example, in this simple JEXL expression which selects variants whose quality score is greater than 30:

"QUAL > 30.0"
  • QUAL is a key: the name of the annotation we want to look at
  • 30.0 is a value: the threshold that we want to use to evaluate variant quality against
  • > is an operator: it determines which "side" of the threshold we want to select

The complete expression must be framed by double quotes. Within this, keys are strings (typically written in uppercase or CamelCase), and values can be either strings, numbers or booleans (TRUE or FALSE) -- but if they are strings the values must be framed by single quotes, as in the following example:

"MY_STRING_KEY == 'foo'"

3. Evaluation on multiple annotations

You can build expressions that calculate a metric based on two separate annotations, for example if you want to select variants for which quality (QUAL) divided by depth of coverage (DP) is below a certain threshold value:

"QUAL / DP < 10.0"

You can also join multiple conditional statements with logical operators, for example if you want to select variants that have both sufficient quality (QUAL) and a certain depth of coverage (DP):

"QUAL > 30.0 && DP == 10"

where && is the logical "AND".

Or if you want to select variants that have at least one of several conditions fulfilled:

"QD < 2.0 || ReadPosRankSum < -20.0 || FS > 200.0"

where || is the logical "OR".


4. Filtering on sample/genotype-level properties

You can also filter individual samples/genotypes in a VCF based on information from the FORMAT field. Variant Filtration will add the sample-level FT tag to the FORMAT field of filtered samples. Note however that this does not affect the record's FILTER tag. This is still a work in progress and isn't quite as flexible and powerful yet as we'd like it to be. For now, you can filter based on most fields as normal (e.g. GQ < 5.0), but the GT (genotype) field is an exception. We have put in convenience methods to enable filtering out heterozygous calls (isHet == 1), homozygous-reference calls (isHomRef == 1), and homozygous-variant calls (isHomVar == 1).


5. Important caveats

Sensitivity to case and type

You're probably used to case being important (whether letters are lowercase or UPPERCASE) but now you need to also pay attention to the type of value that is involved -- for example, numbers are differentiated between integers and floats (essentially, non-integers). These points are especially important to keep in mind:

  • Case
    Currently, VCF INFO field keys are case-sensitive. That means that if you have a QUAL field in uppercase in your VCF record, the system will not recognize it if you write it differently (Qual, qual or whatever) in your JEXL expression.

  • Type
    The types (i.e. string, integer, non-integer, floating point or boolean) used in your expression must be exactly the same as that of the value you are trying to evaluate. In other words, if you have a QUAL field with non-integer values (e.g. 45.3) and your filter expression is written as an integer (e.g. "QUAL < 50"), the system will throw a hissy fit (specifically, a Java exception, e.g. a NumberFormatException for numerical type mismatches).

Complex queries

We highly recommend that complex expressions involving multiple AND/OR operations be split up into separate expressions whenever possible to avoid confusion. If you are using complex expressions, make sure to test them on a panel of different sites with several combinations of yes/no criteria.


6. More complex JEXL magic

Note that this last part is fairly advanced and not for the faint of heart. To be frank, it's also explained rather more briefly than the topic deserves. But if there's enough demand for this level of usage (click the "view in forum" link and leave a comment) we'll consider producing a full-length tutorial.

Introducing the VariantContext object

When you use SelectVariants with JEXL, what happens under the hood is that the program accesses something called the VariantContext, which is a representation of the variant call with all its annotation information. The VariantContext is technically not part of GATK; it's part of the variant library included within the Picard tools source code, which GATK uses for convenience.

The reason we're telling you about this is that you can actually make more complex queries than what the GATK offers convenience functions for, provided you're willing to do a little digging into the VariantContext methods. This will allow you to leverage the full range of capabilities of the underlying objects from the command line.

In a nutshell, the VariantContext is available through the vc variable, and you just need to add method calls to that variable in your command line. The bets way to find out what methods are available is to read the VariantContext documentation on the Picard tools source code repository (on SourceForge), but we list a few examples below to whet your appetite.

Using VariantContext directly

For example, suppose I want to use SelectVariants to select all of the sites where sample NA12878 is homozygous-reference. This can be accomplished by assessing the underlying VariantContext as follows:

java -jar GenomeAnalysisTK.jar -T SelectVariants \
        -R reference.fasta \
        -V variants.vcf \
        -select 'vc.getGenotype("NA12878").isHomRef()'

Groovy, right? Now here's a more sophisticated example of JEXL expression that finds all novel variants in the total set with allele frequency > 0.25 but not 1, is not filtered, and is non-reference in 01-0263 sample:

! vc.getGenotype("01-0263").isHomRef() && (vc.getID() == null || vc.getID().equals(".")) && AF > 0.25 && AF < 1.0 && vc.isNotFiltered() && vc.isSNP() -o 01-0263.high_freq_novels.vcf -sn 01-0263

Using the VariantContext to evaluate boolean values

The classic way of evaluating a boolean goes like this:

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants \
        -R reference.fasta \
        -V my.vcf \
        -select 'DB'

But you can also use the VariantContext object like this:

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants \
        -R reference.fasta \
        -V my.vcf \
        -select 'vc.hasAttribute("DB")'

Using VariantContext to access annotations in multiallelic sites

The order of alleles in the VariantContext object is not guaranteed to be the same as in the VCF output, so accessing the AF by an index derived from a scrambled alleles array is dangerous. However! If we have the sample genotypes, there's a workaround:

java -jar GenomeAnalysisTK.jar -T SelectVariants  \
        -R reference.fasta  \
        -V multiallelics.vcf  \
        -select 'vc.hasGenotypes() && vc.getCalledChrCount(vc.getAltAlleleWithHighestAlleleCount())/(1.0*vc.getCalledChrCount()) > 0.1' -o multiHighAC.vcf

The odd 1.0 is there because otherwise we're dividing two integers, which will always yield 0. The vc.hasGenotypes() is extra error checking. This might be slow for large files, but we could use something like this if performance is a concern:

java -jar GenomeAnalysisTK.jar -T SelectVariants \
        -R reference.fasta \
        -V multiallelics.vcf \
         -select 'vc.isBiallelic() ? AF > 0.1 : vc.hasGenotypes() && vc.getCalledChrCount(vc.getAltAlleleWithHighestAlleleCount())/(1.0*vc.getCalledChrCount()) > 0.1' -o multiHighAC.vcf

Where hopefully the ternary expression shortcuts the extra vc calls for all the biallelics.

Using JEXL to evaluate arrays

Sometimes you might want to write a JEXL expression to evaluate e.g. the AD (allelic depth) field in the FORMAT column. However, the AD is technically not an integer; rather it is a list (array) of integers. One can evaluate the array data using the "." operator. Here's an example:

java -jar GenomeAnalysisTK.jar -T SelectVariants \
        -R reference.fasta \
        -V variants.vcf \
        -select 'vc.getGenotype("NA12878").getAD().0 > 10'

If you would like to select sites where the alternate allele frequency is greater than 50%, you can use the following expression:

java -jar GenomeAnalysisTK.jar -T SelectVariants \
        -R reference.fasta \
        -V variants.vcf \
        -select vc.getGenotype("NA12878").getAD().1 / vc.getGenotype("NA12878").getDP() > 0.50

Java versioning changes and the future of the compatibility

$
0
0

I have a question regarding this issue. Oracle decided to start issuing rolling releases every 6 months to java virtual machine. Current tools work with Java 1.8 (A.K.A. Java 8.0) however I am wondering what will happen to the compatibility in the long run. BTW Java 10 is released yesterday. OpenJDK is also adopting the same strategy as they are using the same codebase for their releases. Some systems may update inadvertently which may kill the compatibility.

Thanks.

Haplotype caller not picking up variants for HiSeq Runs

$
0
0

Hello,
We were sequencing all our data in HiSeq and now moved to nextseq. We have sequenced the same batch of samples on both the sequencers. Both are processed using the same pipeline/parameters.
What I have noticed is, GATK 3.7 HC is not picking up variants, even though the coverage is good and is evidently present in the BAM file.

For example the screenshot below shows the BAM files for both NextSeq and HiSeq sample. There are atleast 3
variants in the region 22:29885560-29885861(NEPH, exon 5) that is expected to be picked up for HiSeq.

These variants are picked up for NextSeq samples (even though the coverage for hiSeq is much better).

The command that I have used for both samples is

java -Xmx32g -jar GATK_v3_7/GenomeAnalysisTK.jar -T HaplotypeCaller -R GRCh37.fa --dbsnp GATK_ref/dbsnp_138.b37.vcf -I ${i}.HiSeq_Run31.variant_ready.bam -L NEPH.bed -o ${i}.HiSeq_Run31.NEPH.g.vcf

Any idea why this can happen ?

Many thanks,

Why am I getting this error when running GetPileupSummaries?

$
0
0

The code:
gatk GetPileupSummaries -I V2.bam -V annot.vcf.gz -O getpileupsummaries.table

The error:

11:13:10.829 INFO GetPileupSummaries - Initializing engine
11:13:11.515 INFO FeatureManager - Using codec VCFCodec to read file file:///path.to file.vcf.gz
11:13:11.646 INFO GetPileupSummaries - Done initializing engine
11:13:11.647 INFO ProgressMeter - Starting traversal
11:13:11.648 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
11:13:11.694 INFO GetPileupSummaries - Shutting down engine
[March 21, 2018 11:13:11 AM PDT] org.broadinstitute.hellbender.tools.walkers.contamination.GetPileupSummaries done. Elapsed time: 0.03 minutes.
Runtime.totalMemory()=1705508864
java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String
at htsjdk.variant.variantcontext.CommonInfo.getAttributeAsDouble(CommonInfo.java:323)
at htsjdk.variant.variantcontext.VariantContext.getAttributeAsDouble(VariantContext.java:732)
at org.broadinstitute.hellbender.tools.walkers.contamination.GetPileupSummaries.alleleFrequencyInRange(GetPileupSummaries.java:197)
at org.broadinstitute.hellbender.tools.walkers.contamination.GetPileupSummaries.apply(GetPileupSummaries.java:172)
at org.broadinstitute.hellbender.engine.VariantWalkerBase.lambda$traverse$0(VariantWalkerBase.java:110)
at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
at java.util.Iterator.forEachRemaining(Iterator.java:116)
at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
at org.broadinstitute.hellbender.engine.VariantWalkerBase.traverse(VariantWalkerBase.java:108)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:893)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:136)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:179)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:198)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:153)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:195)
at org.broadinstitute.hellbender.Main.main(Main.java:277)

What is the GATKReport file format?

$
0
0

A GATKReport is simply a text document that contains well-formatted, easy to read representation of some tabular data. Many GATK tools output their results as GATKReports, so it's important to understand how they are formatted and how you can use them in further analyses.

Here's a simple example:

#:GATKReport.v1.0:2
#:GATKTable:true:2:9:%.18E:%.15f:;
#:GATKTable:ErrorRatePerCycle:The error rate per sequenced position in the reads
cycle  errorrate.61PA8.7         qualavg.61PA8.7                                         
0      7.451835696110506E-3      25.474613284804366                                      
1      2.362777171937477E-3      29.844949954504095                                      
2      9.087604507451836E-4      32.875909752547310
3      5.452562704471102E-4      34.498999090081895                                      
4      9.087604507451836E-4      35.148316651501370                                       
5      5.452562704471102E-4      36.072234352256190                                       
6      5.452562704471102E-4      36.121724890829700                                        
7      5.452562704471102E-4      36.191048034934500                                        
8      5.452562704471102E-4      36.003457059679770                                       

#:GATKTable:false:2:3:%s:%c:;
#:GATKTable:TableName:Description
key    column
1:1000  T 
1:1001  A 
1:1002  C 

This report contains two individual GATK report tables. Every table begins with a header for its metadata and then a header for its name and description. The next row contains the column names followed by the data.

We provide an R library called gsalib that allows you to load GATKReport files into R for further analysis. Here are four simple steps to getting gsalib, installing it and loading a report.

1. Start R (or open RStudio)

$ R

R version 2.11.0 (2010-04-22)
Copyright (C) 2010 The R Foundation for Statistical Computing
ISBN 3-900051-07-0

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

2. Get the gsalib library from CRAN

The gsalib library is available on the Comprehensive R Archive Network, so you can just do:

> install.packages("gsalib") 

From within R (we use RStudio for convenience).

In some cases you need to explicitly tell R where to find the library; you can do this as follows:

$ cat .Rprofile 
.libPaths("/path/to/Sting/R/")

3. Load the gsalib library

> library(gsalib)

4. Finally, load the GATKReport file and have fun

> d = gsa.read.gatkreport("/path/to/my.gatkreport")
> summary(d)
              Length Class      Mode
CountVariants 27     data.frame list
CompOverlap   13     data.frame list

How to import GATK reports into Python/pandas DataFrames

$
0
0

Answering my own question. I searched around and could not find a Python equivalent to the gsalib R library, which enables R analysis of GATK reports produced by VariantEval and other tools. So I create a port of the R gsalib to a Python module. The GatkReport object will read a report file (supports v0 and v1), and load each table into a pandas DataFrame.

Install with: pip install gsalib.

Example usage:

from gsalib import GatkReport

report = GatkReport('/path/to/gsalib/test/test_v1.0_gatkreport.table')
table = report.tables['ExampleTable']

Does GATK4 Mutect2 support Force Calling?

$
0
0

Hi,
Does GATK4 Mutect2 support Force Calling? If yes, could you let me know how to do it? Thanks a lot!

Thanks,
Chunyang

VariantRecalibrator Engine failure

$
0
0

Hi,

I'm trying to run the following on WGS samples:

/gatk/build/install/gatk/bin/gatk VariantRecalibrator -R /cromwell_root/fc-f36b3dc8-85f7-4d7f-bc99-a4610229d66a/broadinstitute/reference/hg19/fasta/Homo_sapiens_assembly19.fasta -V /cromwell_root/fc-123f0687-7fec-4d69-a5a3-9b4ac93beab0/7e1d530b-33ec-46a2-9967-441071e5d714/jointgenotype_genotype/ffe4d42d-6465-45de-b0bc-01fbd5f4622a/call-ApplyRecalSNP/normalsampleset.recal.snp.g.vcf.gz --resource mills,known=false,training=true,truth=true,prior=12.0:recalibrate/Mills_and_1000G_gold_standard.indels.b37.vcf --resource dbsnp,known=true,training=false,truth=false,prior=2.0:recalibrate/dbsnp_138.b37.vcf --use-annotation QD --use-annotation MQRankSum --use-annotation ReadPosRankSum --use-annotation FS --mode INDEL -tranche 100.0 -tranche 99.9 -tranche 99.8 -tranche 99.7 -tranche 99.6 -tranche 99.5 -tranche 99.4 -tranche 99.3 -tranche 99.2 -tranche 99.1 -tranche 99.0 -tranche 98.9 -tranche 98.8 -tranche 98.6 -tranche 98.5 -tranche 98.3 -tranche 98.2 -tranche 98.1 -tranche 98.0 -tranche 97.9 -tranche 97.8 -tranche 97.5 -tranche 97.0 -tranche 95.0 -tranche 90.0 --output normalsampleset.INDEL.recal --tranches-file normalsampleset.INDEL.tranches --rscript-file normalsampleset.INDEL.R

It is failing halfway with a cryptic error:

15:39:10.737 INFO VariantRecalibrator - Done initializing engine 15:39:10.767 INFO TrainingSet - Found mills track: Known = false Training = true Truth = true Prior = Q12.0 15:39:10.768 INFO TrainingSet - Found dbsnp track: Known = true Training = false Truth = false Prior = Q2.0 15:39:10.775 WARN GATKVariantContextUtils - Can't determine output variant file format from output file extension "recal". Defaulting to VCF. 15:39:10.795 INFO ProgressMeter - Starting traversal 15:39:10.795 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute 15:39:20.827 INFO ProgressMeter - 1:219441985 0.2 191000 1142344.5 15:39:30.854 INFO ProgressMeter - 2:202422268 0.3 439000 1313126.3 15:39:40.880 INFO ProgressMeter - 3:44195999 0.5 687000 1370118.0 15:39:50.923 INFO ProgressMeter - 4:29584552 0.7 928000 1387559.8 15:40:00.940 INFO ProgressMeter - 5:20409595 0.8 1200000 1435836.1 15:40:10.978 INFO ProgressMeter - 6:21835741 1.0 1437000 1432654.3 15:40:20.994 INFO ProgressMeter - 7:23090705 1.2 1706000 1458161.2 15:40:31.056 INFO ProgressMeter - 8:39816783 1.3 1964000 1468210.0 15:40:41.078 INFO ProgressMeter - 9:28887875 1.5 2206000 1466073.0 15:40:51.109 INFO ProgressMeter - 10:32387304 1.7 2480000 1483342.3 15:41:01.113 INFO ProgressMeter - 11:39779480 1.8 2767000 1504922.1 15:41:11.211 INFO ProgressMeter - 12:44466895 2.0 3038000 1513764.9 15:41:21.242 INFO ProgressMeter - 14:21930361 2.2 3312000 1523377.3 15:41:31.269 INFO ProgressMeter - 16:18076709 2.3 3594000 1535088.3 15:41:41.269 INFO ProgressMeter - 17:22089136 2.5 3871000 1543522.5 15:41:51.281 INFO ProgressMeter - 18:27782257 2.7 4104000 1534339.4 15:42:01.290 INFO ProgressMeter - 19:36452733 2.8 4368000 1537180.2 15:42:11.307 INFO ProgressMeter - 20:61623489 3.0 4622000 1536296.8 15:42:21.318 INFO ProgressMeter - X:14415975 3.2 4917000 1548474.5 15:42:25.221 INFO ProgressMeter - GL000192.1:348384 3.2 5103668 1574995.5 15:42:25.221 INFO ProgressMeter - Traversal complete. Processed 5103668 total variants in 3.2 minutes. 15:42:25.291 INFO VariantDataManager - QD: mean = 20.37 standard deviation = 8.35 15:42:25.374 INFO VariantDataManager - MQRankSum: mean = 0.18 standard deviation = 0.65 15:42:25.456 INFO VariantDataManager - ReadPosRankSum: mean = 0.26 standard deviation = 0.96 15:42:25.524 INFO VariantDataManager - FS: mean = 3.15 standard deviation = 5.59 15:42:25.968 INFO VariantDataManager - Annotations are now ordered by their information content: [QD, FS, MQRankSum, ReadPosRankSum] 15:42:26.000 INFO VariantDataManager - Training with 332973 variants after standard deviation thresholding. 15:42:26.007 INFO GaussianMixtureModel - Initializing model with 100 k-means iterations... 15:42:47.908 INFO VariantRecalibratorEngine - Finished iteration 0. 15:42:54.626 INFO VariantRecalibratorEngine - Finished iteration 5. Current change in mixture coefficients = 1.45712 15:43:00.822 INFO VariantRecalibratorEngine - Finished iteration 10. Current change in mixture coefficients = 0.09152 15:43:07.009 INFO VariantRecalibratorEngine - Finished iteration 15. Current change in mixture coefficients = 0.09895 15:43:13.567 INFO VariantRecalibratorEngine - Finished iteration 20. Current change in mixture coefficients = 0.08980 15:43:19.994 INFO VariantRecalibratorEngine - Finished iteration 25. Current change in mixture coefficients = 0.08297 15:43:26.272 INFO VariantRecalibratorEngine - Finished iteration 30. Current change in mixture coefficients = 0.11782 15:43:32.744 INFO VariantRecalibratorEngine - Finished iteration 35. Current change in mixture coefficients = 0.10006 15:43:39.222 INFO VariantRecalibratorEngine - Finished iteration 40. Current change in mixture coefficients = 0.15533 15:43:46.351 INFO VariantRecalibratorEngine - Finished iteration 45. Current change in mixture coefficients = 0.01678 15:43:53.432 INFO VariantRecalibratorEngine - Finished iteration 50. Current change in mixture coefficients = 0.00999 15:44:00.404 INFO VariantRecalibratorEngine - Finished iteration 55. Current change in mixture coefficients = 0.00607 15:44:07.300 INFO VariantRecalibratorEngine - Finished iteration 60. Current change in mixture coefficients = 0.00359 15:44:14.207 INFO VariantRecalibratorEngine - Finished iteration 65. Current change in mixture coefficients = 0.00190 15:44:14.207 INFO VariantRecalibratorEngine - Convergence after 65 iterations! 15:44:14.830 INFO VariantRecalibratorEngine - Evaluating full set of 946745 variants... 15:44:14.830 WARN VariantRecalibratorEngine - Evaluate datum returned a NaN. 15:44:14.862 INFO VariantDataManager - Selected worst 0 scoring variants --> variants with LOD <= -5.0000. 15:44:14.897 INFO VariantRecalibrator - Shutting down engine [March 21, 2018 3:44:14 PM UTC] org.broadinstitute.hellbender.tools.walkers.vqsr.VariantRecalibrator done. Elapsed time: 5.11 minutes. Runtime.totalMemory()=2712666112 java.lang.IllegalArgumentException: No data found. at org.broadinstitute.hellbender.tools.walkers.vqsr.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:34) at org.broadinstitute.hellbender.tools.walkers.vqsr.VariantRecalibrator.onTraversalSuccess(VariantRecalibrator.java:629) at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:895) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:136) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:179) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:198) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:153) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:195) at org.broadinstitute.hellbender.Main.main(Main.java:277)

I'm not sure what the 'no data found' error refers to. Does it mean filtering is too strict and leaving no variants to output or something of that sort?

Thanks for the help.


Best practice germline variant calling hg19 on Firecloud

$
0
0

Hello, I noticed that the five-dollar-genome-analysis pipeline on FC is exclusively meant for use with an hg38 reference. Is there an equivalent for hg19? Thanks,

Eric

(Howto) Run GATK4 in a Docker container

$
0
0

1. Install Docker

Follow the relevant link below depending on your computer system; on Mac and Windows, select the "Stable channel" download. Run through the installation instructions and initial setup page; they are very straightforward and should only take you a few minutes (not counting download time).
We have included instructions below for all steps after that first page, so you shouldn't need to go to any other pages in the Docker documentation. Frankly their docs are targeted at people who want to do things like run web applications on the cloud and can be quite frustrating to deal with.

Click here for Mac

Click here for Windows

Full list of supported systems and their install pages


2. Get the GATK4 container image

Go to your Terminal (it doesn't matter where your working directory is) and run the following command.

docker pull broadinstitute/gatk:4.beta.6

Note that the last bit after gatk: is the version tag, which you can change to get a different version than what we've specified here.

The GATK4 image is quite large so the download may take a little while if you've never done this before. The good news is that next time you need to pull a GATK4 image (e.g. to get another release), Docker will only pull the components that have been updated, so it will go faster.


3. Start up the GATK4 container

There are several different ways to do this in Docker. Here we're going to use the simplest invocation that gets us the functionality we need, i.e. the ability to log into the container once it's running and execute commands from inside it.

docker run -it broadinstitute/gatk:4.beta.6

If all goes well, this will start up the container in interactive mode, and you will automatically get logged into it. Your terminal prompt will change to something like this:

root@ea3a5218f494:/gatk#

At this point you can use classic shell commands to explore the container and see what's in there, if you like.


4. Run a GATK4 command in the container

The container has the gatk-launch script all set up and ready to go, so you can now run any GATK or Picard command you want. Note that if you want to run a Picard command, you need to use the new syntax, which follows GATK conventions (-I instead of I= and so on). Let's use --list to list all tools available in this version.

./gatk-launch --list

The output will start with a usage message (shown below) then a full list of tools and their summary descriptions.

Using GATK wrapper script /gatk/build/install/gatk/bin/gatk
Running:
    /gatk/build/install/gatk/bin/gatk --help
USAGE:  <program name> [-h]

Once you've verified that this works for you, you know you can run any GATK4 commands you want. But before you proceed, there's one more setup thing to go through, which is technically optional but will make your life much easier.


5. Use a mounted volume to access data that lives outside the container

This is the final piece of the puzzle. By default, when you're inside the container you can't access any data that lives on the filesystem outside of the container. One way to deal with that is to copy things back and forth, but that's wasteful and tedious. So we're going to follow the better path, which is to mount a volume in the container, i.e. establish a link that makes part of the filesystem visible from inside the container.

The hitch is that you can't do this after you started running the container, so you'll have to shut it down and run a new one (not just restart the first one) with an extra part to the command. In case you're wondering why we didn't do this from the get-go, it's because the first command we ran is simpler so there's less chance that something will go wrong, which is nice when you're trying something for the first time.

To shut down your container from inside it, you can just type exit while still inside the container:

exit

That should stop the container and take you back to your regular prompt. It's also possible to exit the container without stopping it (a move called detaching) but that's a matter for another time since here we do want to to stop it. You'll probably also want to learn how to clean up and delete old instances of containers that you no longer want.

For now, let's focus on starting a new instance of the GATK4 container, specifying in the following command what is your particular container ID and the filesystem location you want to mount.

docker run -v ~/my_project:/gatk/my_data -it broadinstitute/gatk:4.beta.6

Here I set the external location to be an existing directory called my_project in my home directory (the key requirement is that it has to be an absolute path) and I'm setting the mount point inside the container's /gatk directory. The name of the mount point can be the same as the mount directory, or something completely different; the main constraint is that it should not conflict with an existing directory, otherwise that would make the existing directory unattainable.

Assuming your paths are valid, this command starts up the container and logs you into it the same way as before; but now you can see by using ls that you have access to your filesystem. So now you can run GATK commands on any data you have lying around. Have fun!

How to properly merge bamouts after Mutect2

$
0
0

Hi,
I ran Mutect2 (gatk4) using the -L command to split by chromosome, and the -bamout option.
I am now using CatVariants (gatk3) to merge all the vcf files (one per chr) into a final vcf.
And I also need to merge the bamouts. I know there are several tools to merge bams, but I've read that some of such tools have issues with the headers, etc.
Is there any particular tool you would recommend to properly merge the bams?

Thanks!

Over estimation of AF in Mutect2 (GATK 4.0.2.1) ?

$
0
0

Hi,

I ran Mutect2 (GATK 4.0.2.1) followed by FilterMutectCalls with default parameters. I got some passed variants with AF much larger than the alt_depth/total_depth (see attached image), I checked the reads mapping qualities and most of them are good, so reads are not likely filtered by Mutect2.

Why is the AF larger than the alt_depth/total_depth?

I noted that the three variants with AF overestimated also with reads orientation bias, can Mutect2 filter variants by reads orientation bias?

Thanks!

chrX 41000336 . A G . PASS DP=2025;ECNT=1;NLOD=241.68;N_ART_LOD=-6.740e-01;POP_AF=1.000e-03;P_GERMLINE=-2.387e+02;TLOD=5.88 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:OBAM:OBAMRC:OBF:OBP:OBQ:OBQRC:SA_MAP_AF:SA_POST_PROB 0/1:1094,7:0.030:563,1:531,6:33:200,195:60:33:false:false:.:.:100.00:51.51:0.010,0.010,6.358e-03:6.461e-04,1.733e-03,0.998 0/0:828,2:0.028:422,1:406,1:34:199,181:60:19:false:false
chrX 66766308 . A G . PASS DP=1691;ECNT=2;NLOD=90.81;N_ART_LOD=-1.323e+00;POP_AF=1.000e-03;P_GERMLINE=-8.780e+01;TLOD=6.03 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:OBAM:OBAMRC:OBF:OBP:OBQ:OBQRC:SA_MAP_AF:SA_POST_PROB 0/1:1292,8:0.029:578,0:714,8:32:291,291:60:38:false:false:.:.:100.00:51.51:0.010,0.00,6.154e-03:7.151e-04,2.907e-03,0.996 0/0:314,1:0.033:152,0:162,1:32:191,297:60:15:false:false
chrX 66766320 . C T . PASS DP=1382;ECNT=2;NLOD=74.68;N_ART_LOD=-2.139e+00;POP_AF=1.000e-03;P_GERMLINE=-7.140e+01;TLOD=7.39 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:OBAM:OBAMRC:OBF:OBP:OBQ:OBQRC:SA_MAP_AF:SA_POST_PROB 0/1:1060,10:0.030:453,0:607,10:30:292,329:60:32:false:false:.:.:100.00:52.34:0.010,0.010,9.346e-03:0.042,4.118e-04,0.957 0/0:256,1:0.029:124,0:132,1:20:186,135:60:32:false:false

Unable to determine status of job ID error

$
0
0

Hi,
I occasionally get the following type of error when using genome strip CNVDisvovery pipeline on a cluster running SGE, it will run successfully for long periods of time and then arrive at this type of error
WARN 18:12:43,778 DrmaaJobRunner - Unable to determine status of job id 128098
1322 org.ggf.drmaa.DrmCommunicationException: failed receiving gdi request response for mid=63604 (can't send response for this message id - protocol error).

Usually I can just restart the pipeline and it will go to completion, but I don't know if I need to restart the whole stage? I have a hard time understanding sometimes whether there are actually problems I need to address. Any tips on troubleshooting this correctly?

Thanks!

"Unable to determine status of job" warn in Genome STRiP preprocessing

$
0
0

So when in the preprocessing step, there always occurs such kind of warning:

WARN  10:50:19,392 DrmaaJobRunner - Unable to determine status of job id 2231784
org.ggf.drmaa.DrmCommunicationException: unable to send message to qmaster using port 6448 on host "sge2": got send error
        at org.broadinstitute.gatk.utils.jna.drmaa.v1_0.JnaSession.checkError(JnaSession.java:402)
        at org.broadinstitute.gatk.utils.jna.drmaa.v1_0.JnaSession.checkError(JnaSession.java:392)
        at org.broadinstitute.gatk.utils.jna.drmaa.v1_0.JnaSession.getJobProgramStatus(JnaSession.java:156)
        at org.broadinstitute.gatk.queue.engine.drmaa.DrmaaJobRunner.liftedTree1$1(DrmaaJobRunner.scala:124)
        at org.broadinstitute.gatk.queue.engine.drmaa.DrmaaJobRunner.updateJobStatus(DrmaaJobRunner.scala:123)
        at org.broadinstitute.gatk.queue.engine.drmaa.DrmaaJobManager$$anonfun$updateStatus$1.apply(DrmaaJobManager.scala:56)
        at org.broadinstitute.gatk.queue.engine.drmaa.DrmaaJobManager$$anonfun$updateStatus$1.apply(DrmaaJobManager.scala:56)
        at scala.collection.immutable.Set$Set3.foreach(Set.scala:115)
        at org.broadinstitute.gatk.queue.engine.drmaa.DrmaaJobManager.updateStatus(DrmaaJobManager.scala:56)
        at org.broadinstitute.gatk.queue.engine.QGraph$$anonfun$updateStatus$1.apply(QGraph.scala:1369)
        at org.broadinstitute.gatk.queue.engine.QGraph$$anonfun$updateStatus$1.apply(QGraph.scala:1361)
        at scala.collection.immutable.List.foreach(List.scala:318)
        at org.broadinstitute.gatk.queue.engine.QGraph.updateStatus(QGraph.scala:1361)
        at org.broadinstitute.gatk.queue.engine.QGraph.runJobs(QGraph.scala:548)
        at org.broadinstitute.gatk.queue.engine.QGraph.run(QGraph.scala:168)
        at org.broadinstitute.gatk.queue.QCommandLine.execute(QCommandLine.scala:170)
        at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
        at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
        at org.broadinstitute.gatk.queue.QCommandLine$.main(QCommandLine.scala:61)
        at org.broadinstitute.gatk.queue.QCommandLine.main(QCommandLine.scala)

This will certainly cause a job to fail:

INFO  18:03:26,604 QCommandLine - Script failed: 1 Pend, 0 Run, 1 Fail, 980 Done

If this occurs at the very end of the master job (as is shown above, 1 Pend, 0 Run, 1 Fail, 980 Done), then I think intermediate files is already deleted, which causes redoing all jobs all over, if I resubmit the master job.

The " Unable to determine status of job" warning occurs randomly (not specific to a particular job), as resubmitting the master job can fix it. But if it occurs at the very end, then resubmitting the master job will redo everything.

So is there a way to prevent this warning? If not, I guess the program really needs to be improved to prevent such kind of "redo all over".

(howto) Recalibrate variant quality scores = run VQSR

$
0
0

Objective

Recalibrate variant quality scores and produce a callset filtered for the desired levels of sensitivity and specificity.

Prerequisites

  • TBD

Caveats

This document provides a typical usage example including parameter values. However, the values given may not be representative of the latest Best Practices recommendations. When in doubt, please consult the FAQ document on VQSR training sets and parameters, which overrides this document. See that document also for caveats regarding exome vs. whole genomes analysis design.

Steps

  1. Prepare recalibration parameters for SNPs
    a. Specify which call sets the program should use as resources to build the recalibration model
    b. Specify which annotations the program should use to evaluate the likelihood of Indels being real
    c. Specify the desired truth sensitivity threshold values that the program should use to generate tranches
    d. Determine additional model parameters

  2. Build the SNP recalibration model

  3. Apply the desired level of recalibration to the SNPs in the call set

  4. Prepare recalibration parameters for Indels
    a. Specify which call sets the program should use as resources to build the recalibration model
    b. Specify which annotations the program should use to evaluate the likelihood of Indels being real
    c. Specify the desired truth sensitivity threshold values that the program should use to generate tranches
    d. Determine additional model parameters

  5. Build the Indel recalibration model

  6. Apply the desired level of recalibration to the Indels in the call set


1. Prepare recalibration parameters for SNPs

a. Specify which call sets the program should use as resources to build the recalibration model

For each training set, we use key-value tags to qualify whether the set contains known sites, training sites, and/or truth sites. We also use a tag to specify the prior likelihood that those sites are true (using the Phred scale).

  • True sites training resource: HapMap

This resource is a SNP call set that has been validated to a very high degree of confidence. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). We will also use these sites later on to choose a threshold for filtering variants based on sensitivity to truth sites. The prior likelihood we assign to these variants is Q15 (96.84%).

  • True sites training resource: Omni

This resource is a set of polymorphic SNP sites produced by the Omni genotyping array. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q12 (93.69%).

  • Non-true sites training resource: 1000G

This resource is a set of high-confidence SNP sites produced by the 1000 Genomes Project. The program will consider that the variants in this resource may contain true variants as well as false positives (truth=false), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q10 (90%).

  • Known sites resource, not used in training: dbSNP

This resource is a SNP call set that has not been validated to a high degree of confidence (truth=false). The program will not use the variants in this resource to train the recalibration model (training=false). However, the program will use these to stratify output metrics such as Ti/Tv ratio by whether variants are present in dbsnp or not (known=true). The prior likelihood we assign to these variants is Q2 (36.90%).

The default prior likelihood assigned to all other variants is Q2 (36.90%). This low value reflects the fact that the philosophy of the GATK callers is to produce a large, highly sensitive callset that needs to be heavily refined through additional filtering.

b. Specify which annotations the program should use to evaluate the likelihood of SNPs being real

These annotations are included in the information generated for each variant call by the caller. If an annotation is missing (typically because it was omitted from the calling command) it can be added using the VariantAnnotator tool.

Total (unfiltered) depth of coverage. Note that this statistic should not be used with exome datasets; see caveat detailed in the VQSR arguments FAQ doc.

Variant confidence (from the QUAL field) / unfiltered depth of non-reference samples.

Measure of strand bias (the variation being seen on only the forward or only the reverse strand). More bias is indicative of false positive calls. This complements the StrandOddsRatio (SOR) annotation.

Measure of strand bias (the variation being seen on only the forward or only the reverse strand). More bias is indicative of false positive calls. This complements the FisherStrand (FS) annotation.

The rank sum test for mapping qualities. Note that the mapping quality rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.

The rank sum test for the distance from the end of the reads. If the alternate allele is only seen near the ends of reads, this is indicative of error. Note that the read position rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.

Estimation of the overall mapping quality of reads supporting a variant call.

Evidence of inbreeding in a population. See caveats regarding population size and composition detailed in the VQSR arguments FAQ doc.

c. Specify the desired truth sensitivity threshold values that the program should use to generate tranches

  • First tranche threshold 100.0

  • Second tranche threshold 99.9

  • Third tranche threshold 99.0

  • Fourth tranche threshold 90.0

Tranches are essentially slices of variants, ranked by VQSLOD, bounded by the threshold values specified in this step. The threshold values themselves refer to the sensitivity we can obtain when we apply them to the call sets that the program uses to train the model. The idea is that the lowest tranche is highly specific but less sensitive (there are very few false positives but potentially many false negatives, i.e. missing calls), and each subsequent tranche in turn introduces additional true positive calls along with a growing number of false positive calls. This allows us to filter variants based on how sensitive we want the call set to be, rather than applying hard filters and then only evaluating how sensitive the call set is using post hoc methods.


2. Build the SNP recalibration model

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T VariantRecalibrator \ 
    -R reference.fa \ 
    -input raw_variants.vcf \ 
    -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap.vcf \ 
    -resource:omni,known=false,training=true,truth=true,prior=12.0 omni.vcf \ 
    -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G.vcf \ 
    -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.vcf \ 
    -an DP \ 
    -an QD \ 
    -an FS \ 
    -an SOR \ 
    -an MQ \
    -an MQRankSum \ 
    -an ReadPosRankSum \ 
    -an InbreedingCoeff \
    -mode SNP \ 
    -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \ 
    -recalFile recalibrate_SNP.recal \ 
    -tranchesFile recalibrate_SNP.tranches \ 
    -rscriptFile recalibrate_SNP_plots.R 

Expected Result

This creates several files. The most important file is the recalibration report, called recalibrate_SNP.recal, which contains the recalibration data. This is what the program will use in the next step to generate a VCF file in which the variants are annotated with their recalibrated quality scores. There is also a file called recalibrate_SNP.tranches, which contains the quality score thresholds corresponding to the tranches specified in the original command. Finally, if your installation of R and the other required libraries was done correctly, you will also find some PDF files containing plots. These plots illustrated the distribution of variants according to certain dimensions of the model.

For detailed instructions on how to interpret these plots, please refer to the VQSR method documentation and presentation videos.


3. Apply the desired level of recalibration to the SNPs in the call set

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T ApplyRecalibration \ 
    -R reference.fa \ 
    -input raw_variants.vcf \ 
    -mode SNP \ 
    --ts_filter_level 99.0 \ 
    -recalFile recalibrate_SNP.recal \ 
    -tranchesFile recalibrate_SNP.tranches \ 
    -o recalibrated_snps_raw_indels.vcf 

Expected Result

This creates a new VCF file, called recalibrated_snps_raw_indels.vcf, which contains all the original variants from the original raw_variants.vcf file, but now the SNPs are annotated with their recalibrated quality scores (VQSLOD) and either PASS or FILTER depending on whether or not they are included in the selected tranche.

Here we are taking the second lowest of the tranches specified in the original recalibration command. This means that we are applying to our data set the level of sensitivity that would allow us to retrieve 99% of true variants from the truth training sets of HapMap and Omni SNPs. If we wanted to be more specific (and therefore have less risk of including false positives, at the risk of missing real sites) we could take the very lowest tranche, which would only retrieve 90% of the truth training sites. If we wanted to be more sensitive (and therefore less specific, at the risk of including more false positives) we could take the higher tranches. In our Best Practices documentation, we recommend taking the second highest tranche (99.9%) which provides the highest sensitivity you can get while still being acceptably specific.


4. Prepare recalibration parameters for Indels

a. Specify which call sets the program should use as resources to build the recalibration model

For each training set, we use key-value tags to qualify whether the set contains known sites, training sites, and/or truth sites. We also use a tag to specify the prior likelihood that those sites are true (using the Phred scale).

  • Known and true sites training resource: Mills

This resource is an Indel call set that has been validated to a high degree of confidence. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q12 (93.69%).

The default prior likelihood assigned to all other variants is Q2 (36.90%). This low value reflects the fact that the philosophy of the GATK callers is to produce a large, highly sensitive callset that needs to be heavily refined through additional filtering.

b. Specify which annotations the program should use to evaluate the likelihood of Indels being real

These annotations are included in the information generated for each variant call by the caller. If an annotation is missing (typically because it was omitted from the calling command) it can be added using the VariantAnnotator tool.

Total (unfiltered) depth of coverage. Note that this statistic should not be used with exome datasets; see caveat detailed in the VQSR arguments FAQ doc.

Variant confidence (from the QUAL field) / unfiltered depth of non-reference samples.

Measure of strand bias (the variation being seen on only the forward or only the reverse strand). More bias is indicative of false positive calls. This complements the StrandOddsRatio (SOR) annotation.

Measure of strand bias (the variation being seen on only the forward or only the reverse strand). More bias is indicative of false positive calls. This complements the FisherStrand (FS) annotation.

The rank sum test for mapping qualities. Note that the mapping quality rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.

The rank sum test for the distance from the end of the reads. If the alternate allele is only seen near the ends of reads, this is indicative of error. Note that the read position rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.

Evidence of inbreeding in a population. See caveats regarding population size and composition detailed in the VQSR arguments FAQ doc.

c. Specify the desired truth sensitivity threshold values that the program should use to generate tranches

  • First tranche threshold 100.0

  • Second tranche threshold 99.9

  • Third tranche threshold 99.0

  • Fourth tranche threshold 90.0

Tranches are essentially slices of variants, ranked by VQSLOD, bounded by the threshold values specified in this step. The threshold values themselves refer to the sensitivity we can obtain when we apply them to the call sets that the program uses to train the model. The idea is that the lowest tranche is highly specific but less sensitive (there are very few false positives but potentially many false negatives, i.e. missing calls), and each subsequent tranche in turn introduces additional true positive calls along with a growing number of false positive calls. This allows us to filter variants based on how sensitive we want the call set to be, rather than applying hard filters and then only evaluating how sensitive the call set is using post hoc methods.

d. Determine additional model parameters

  • Maximum number of Gaussians (-maxGaussians) 4

This is the maximum number of Gaussians (i.e. clusters of variants that have similar properties) that the program should try to identify when it runs the variational Bayes algorithm that underlies the machine learning method. In essence, this limits the number of different ”profiles” of variants that the program will try to identify. This number should only be increased for datasets that include very many variants.


5. Build the Indel recalibration model

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T VariantRecalibrator \ 
    -R reference.fa \ 
    -input recalibrated_snps_raw_indels.vcf \ 
    -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.sites.vcf \
    -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.b37.vcf \
    -an QD \
    -an DP \ 
    -an FS \ 
    -an SOR \ 
    -an MQRankSum \ 
    -an ReadPosRankSum \ 
    -an InbreedingCoeff
    -mode INDEL \ 
    -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \ 
    --maxGaussians 4 \ 
    -recalFile recalibrate_INDEL.recal \ 
    -tranchesFile recalibrate_INDEL.tranches \ 
    -rscriptFile recalibrate_INDEL_plots.R 

Expected Result

This creates several files. The most important file is the recalibration report, called recalibrate_INDEL.recal, which contains the recalibration data. This is what the program will use in the next step to generate a VCF file in which the variants are annotated with their recalibrated quality scores. There is also a file called recalibrate_INDEL.tranches, which contains the quality score thresholds corresponding to the tranches specified in the original command. Finally, if your installation of R and the other required libraries was done correctly, you will also find some PDF files containing plots. These plots illustrated the distribution of variants according to certain dimensions of the model.

For detailed instructions on how to interpret these plots, please refer to the online GATK documentation.


6. Apply the desired level of recalibration to the Indels in the call set

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T ApplyRecalibration \ 
    -R reference.fa \ 
    -input recalibrated_snps_raw_indels.vcf \ 
    -mode INDEL \ 
    --ts_filter_level 99.0 \ 
    -recalFile recalibrate_INDEL.recal \ 
    -tranchesFile recalibrate_INDEL.tranches \ 
    -o recalibrated_variants.vcf 

Expected Result

This creates a new VCF file, called recalibrated_variants.vcf, which contains all the original variants from the original recalibrated_snps_raw_indels.vcf file, but now the Indels are also annotated with their recalibrated quality scores (VQSLOD) and either PASS or FILTER depending on whether or not they are included in the selected tranche.

Here we are taking the second lowest of the tranches specified in the original recalibration command. This means that we are applying to our data set the level of sensitivity that would allow us to retrieve 99% of true variants from the truth training sets of HapMap and Omni SNPs. If we wanted to be more specific (and therefore have less risk of including false positives, at the risk of missing real sites) we could take the very lowest tranche, which would only retrieve 90% of the truth training sites. If we wanted to be more sensitive (and therefore less specific, at the risk of including more false positives) we could take the higher tranches. In our Best Practices documentation, we recommend taking the second highest tranche (99.9%) which provides the highest sensitivity you can get while still being acceptably specific.


Is it possible to merge the multiple samples within a vcf file?

$
0
0

Greetings from Australia,
Could someone please help me with this?
I got a single vcf file containing 14 samples at the end of the best practices pipeline.
For a bulk segregation analysis, I would like to have these samples merged to form 2 bulks.

A QTLseqr (R) function will take a single table as input (VariantsToTable) but it only allows 2 sample names (BulkA and BulkB) to be specified.

Thank you

Inconsistent results with HaplotypeCaller on haploid organism

$
0
0

Hello GATK team,

I would appreciate some help in understanding how GATK works in GVCF mode on my data.
Here is my data example I'm usign GATK v3.8:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 328-16 983-16 NC_018661.1 859953 . C T > 31035.30 SnpCluster AC=15;AF=0.536;AN=28;BaseQRankSum=-1.134e+00;ClippingRankSum=0.00;DP=7961;FS=3.292;MLEAC=15;MLEAF=0.536;MQ=56.52;MQRankSum=0.00;QD=7.76;ReadPosRankSum=-5.480e-01;SOR=0.486 GT:AD:DP:GQ:PL 0:354,0:354:99:0,106 1:157,157:314:99:361,0

  • First thing weird is that the variant is in heterocygosis with the highest GQ (99) when we are analyzing an haploid sample, this is different from the explanation given in this post

  • Second issue appears when we observe this position using IGV in our aligned reads using bwa mem(bam format). Here we see that both samples seem to have this site as AD 50%, but HaplotypeCaller calls it totally different.

This are the parameters we use for one sample:
java -Djava.io.tmpdir=/processing_Data/bioinformatics/services_and_colaborations/CNM/bacteriologia/SRVCNM062_20180214_ECOLI07_SS_S/TMP/ -Xmx10g -jar /opt/g atk/gatk-3.8.0/GenomeAnalysisTK.jar -T HaplotypeCaller -R /processing_Data/bioinformatics/services_and_colaborations/CNM/bacteriologia/SRVCNM062_20180214_ECO LI07_SS_S/REFERENCES/GCF_000299475.1_ASM29947v1_genomic_NoPlasmid.fna -I /processing_Data/bioinformatics/services_and_colaborations/CNM/bacteriologia/SRVCNM0 62_20180214_ECOLI07_SS_S/ANALYSIS/20180205_ECOLI0601/Alignment/BAM/328-16/328-16.woduplicates.bam -o /processing_Data/bioinformatics/services_and_colaboratio ns/CNM/bacteriologia/SRVCNM062_20180214_ECOLI07_SS_S/ANALYSIS/20180205_ECOLI0601/variant_calling/variants_gatk/variants/328-16.g.vcf -stand_call_conf 30 --em itRefConfidence GVCF -ploidy 1 -S LENIENT -log /processing_Data/bioinformatics/services_and_colaborations/CNM/bacteriologia/SRVCNM062_20180214_ECOLI07_SS_S/A NALYSIS/20180205_ECOLI0601/variant_calling/variants_gatk/snp_indels.vcf-HaplotypeCaller.log

How is this even possible? (I have infinite-checked that bam files are the same used in IGV and passed to GATK, you never know...)

Could be the effect referred in this thread be somehow affecting the variant calling? Should we use BP_Resolution? Which is the main difference between GVCF and BP_RESOLUTION mode?

Our first idea is select by AD our GVCFs using JEXL expressions but as GVCF has reference blocks with no AD the command fails:

ERROR MESSAGE: Invalid JEXL expression detected for select-0 with message ![35,47]: 'vc.getGenotype('328-16').getAD().1.floatValue() / vc.getGenotype('328-16').getDP() > 0.90;' attempting to call method on null

I could filter them manually before GenotypeGVCFs but, it is a good practice? As I read in this thread this is not recommended, obviously because we override GATK model which takes a lot more of variables into account...
Any ideas? We are kind of struggling, maybe is something trivial but we can't see it, any help will be much appreciate.

Thanks very much in advance,
Best Regards
Sara

GenotypeGVCFs shows flag despite all reads in all samples supporting the reference allele

$
0
0

Hello,
I have been jointly genotyping variants in my samples using GenotypeGVCFs but a few sites show a flag despite all reads in all samples supporting the reference allele. Do you know why this might be the case?

An example:
chr2 37521 . T 66.18 PASS AC=0;AF=0.00;AN=30;DP=320;InbreedingCoeff=-0.0000;MLEAC=0;MLEAF=0.00;MQ=64.28 GT:AD:DP:RGQ 0/0:22,0:22:63 0/0:27,0:27:66 0/0:17,0:17:48 0/0:17,0:17:42 0/0:31,0:31:63 0/0:25,0:25:60 0/0:15,0:15:36 0/0:22,0:22:63 0/0:26,0:26:72 0/0:22,0:22:60 0/0:28,0:28:81 0/0:22,0:22:60 0/0:20,0:20:57 0/0:12,0:12:33 0/0:14,0:14:42
Thank you in advance for your help.

Calling variants on whole exome and whole genome samples together?

$
0
0

Hi,

I have 15 affected samples. 2 are whole exome and 13 are whole genome. They have already been realigned on a single-sample level and had BQSR performed. I am contemplating running UnifiedGenotyper on all 15 samples together because we would like to compare the calls across the samples (especially in the coding regions). I am aware that there would be a large number of variant calls in the whole genome samples that would have little to no coverage in the exome samples. I haven't been able to find any posts that say you should or shouldn't run whole genome and exome samples through UnifiedGenotyper together. Are there any reasons why this should be discouraged?

Also, assuming I do perform multi-sample calling across all 15 samples, would it be ok to run that multi-sample VCF file through VQSR?

Thanks!
Jared

Cross comparison between Array and NGS data

$
0
0

Dear GATK staff,

I have a 11 samples that were sequenced using NGS (Illumina HiSeq) and 2 of these samples were also genotyped using an Illumina Human Global screening array (Illumina Iscan). I was looking at your latest WDL script and I've noticed a few steps that I think are related but I don't know how to prepare the inputs for them. Any help or additional explanation on them would be really appreciated!

# Check identity of fingerprints across readgroups
CrossCheckFingerprints
input: haplotype_database_file

What information should I use to create this file? Array data? I have already read these links but I'm still lost:
http://gatkforums.broadinstitute.org/gatk/discussion/comment/37543
http://gatkforums.broadinstitute.org/gatk/discussion/9526/picard-haplotype-map-file-format

# Estimate level of cross-sample contamination
CheckContamination
input: contamination_sites_vcf

What information should I use to create this file?

# Check the sample BAM fingerprint against the sample array 
CheckFingerprint
input: haplotype_database_file
input: genotypes

What information should I use to create these files? What does each input stands for?

Thank you very much in advance.

Best regards,
Santiago

Viewing all 12345 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>