Quantcast
Channel: Recent Discussions — GATK-Forum
Viewing all 12345 articles
Browse latest View live

Error: DP annotation : HaplotypeCaller in GATK pipeline

$
0
0

I was following the GATK pipeline for my patient data to find the germline mutation.

I was okay until the BQSR procedure. ( the output file is patient1.sorted_dupl_realign_recalib.resort.bam)
Then, I am trying to call the variant using haplotype Calller.
(> java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R genome.fa -I patient1.sorted_dupl_realign_recalib.resort.bam -o patient1.g.vcf -ERC GVCF -variant_index_type LINEAR -variant_index_parameter 128000)

So, I finally got the patient1.g.vcf file.

Now, I am trying to do the variant recalibration step.

java -jar GenomeAnalysisTK.jar -nt 16 -T VariantRecalibrator -R genome.fa -input patient1.g.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.hg19.sites.vcf -resource:omni,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.hg19.sites.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.hg19.sites.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.hg19.vcf -an DP -an QD -an FS -an MQRankSum -mode SNP --maxGaussians 4 -recalFile patietn1.raw.SNPs.recal -tranchesFile patient1.raw.SNPs.tranches -rscriptFile patient1.recal.plots.R

ERROR MESSAGE: Bad input: Values for DP annotation not detected for ANY training variant in the input callset. VariantAnnotator may be used to add these annotations.

I got the error message even though my patient1.g.vcf file have DP annotation. I checked it.
I googled it and check this forum. but I could not solve this problem.

COuld you please help me with this?


Why the option -env did not exclude all the non-variant sites ?

$
0
0

Hi,

I use SelectVariants to remove some sample out of may initial data. In doing that, I ended up having some sites for which there is no variant as the variation was removed with samples that had been removed...
However, when I add the option -env (--excludeNonVariants) in order to remove all the sites where there is no variation between samples. I still have sites where there is no variation between samples, but different from the reference, in the vcf file.

Here is my command:

java -Xmx4G -jar /path/to/GenomeAnalysisTK.jar -T SelectVariants -R /path/to/ref.fa -V starting_vcf_file.vcf -env -o results_vcf_file.fa

I expect the results vcf file to contain only sites with variation between the samples. However I still have sites such as

scaffold1 25003 . T C 40226.42 PASS AC=14;AF=1.00;AN=14;DP=866;ExcessHet=0.2482;FS=0.000;MQ=60.06;QD=31.72;SOR=1.046 GT:AD:DP:GQ:PL 1:0,54:54:99:2427,0 1:0,69:69:99:3089,0 1:0,83:83:99:3807,0 1:0,35:35:99:1530,0 1:0,40:40:99:1842,0 1:0,31:31:99:1405,0 1:0,65:65:99:2899,0 1:0,43:43:99:1935,0 1:0,55:55:99:2464,0 1:0,64:64:99:2900,0 1:0,86:86:99:3850,0 1:0,91:91:99:4182,0 1:0,84:84:99:3817,0 1:0,66:66:99:3029,0

As you can see, there is no variation between the samples ! Does GATK consider all sites where all samples have the alternative allels as variants ???
If this is the case, how can exclude these sites ?

Thank you very much in advance.

How to mark specific individual genotype as No Call

$
0
0

Hi,

I have been trying to find a way to mark specific individual genotypes as No Call.

I know that in VariantFiltration it is possible to add the option --setFilteredGTToNocall in order to mark filtered genotypes as no call. However, in my case, there is no available filter corresponding to my criteria. Let me explain. I have some diploid-haploid paired related samples, i.e. I expect the haploid individual to share one of the two alleles from its diploid related. Therefore, for positions where there are discordant genotypes, I would like to mark individual genotypes as as no call in order to not include potentials errors in my data. I couldn't find a way to apply this rational to the pipeline of VariantFiltration or SelectVariants...

Eventually, I manage to manipulate the .vcf file by myself to mark any discordant genotypes as . or ./. However, when later I want to select the positions for which there are maximum of 0.2 ratio of no call (using the option --maxNOCALLfraction of SelectVariants from the nightly build version), there was an error message "there aren't enough comumns for line...". I suppose that it's because I did manipulate the vcf file by myself and there could have been errors (although I'm quite confident with my script).

Therefore, I think it's better to not manipulate the vcf file and try to do things using the pipeline....but how?

When does IndelRealigner discard reads?

$
0
0

I'm using IndelRealigner on version VN:3.4-46-gbc02625, command line field is:

CL:knownAlleles=[] targetIntervals=/data2/processed/dreamchallenge_set1/synthetic.challenge.set1.tumor.v2/tmp/synthetic.challenge.set1.tumor.v2.target.intervals.list LODThresholdForCleaning=5.0 consensusDeterminationModel=USE_READS entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=3000 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null

The (sambamba-produced) flagstat file is very different before and after - about ~18M reads are gone. We don't normally see this, but we also haven't run the DREAM data before. In what situations can IndelRealigner discard reads?

The documentation states that downsampling is not done by this tool by default. I notice the logs of the RealignerTargetCreator show some filters failing and downsampling to 1000 coverage. My assumption is that this should not affect the final result, as it is only to produce the regions to clean.

(I'm aware that in many cases indel re-alignment is no longer recommended).

How can I invoke read filters and their arguments?

$
0
0

Most GATK tools apply several read filters by default. You can look up exactly what are the defaults for each tool in their respective Technical Documentation pages.

But sometimes you want to specify additional filters yourself (and before you ask, no, you cannot disable the default read filters used by a given tool). This is how you do it:

The --read-filter argument (or -rf for short) allows you to apply whatever read filters you'd like. For example, to add the MaxReadLengthFilter filter above to PrintReads, you just add this to your command line:

--read_filter MaxReadLength 

Notice that when you specify a read filter, you need to strip the Filter part of its name off!

The read filter will be applied with its default value (which you can also look up in the Tech Docs for that filter). Now, if you want to specify a different value from the default, you pass the relevant argument by adding this right after the read filter:

--read_filter MaxReadLength -maxReadLength 76

It's important that you pass the argument right after the filter itself, otherwise the command line parser won't know that they're supposed to go together.

And of course, you can add as many filters as you like by using multiple copies of the --read_filter parameter:

--read_filter MaxReadLength --maxReadLength 76 --read_filter ZeroMappingQualityRead

Interpreting Allele frequency equal to 1.0

$
0
0

Hi,

In the following vcf file, the AF allele frequency of ALT allele is 1.0. Does it mean that it is not a true SNP of interest? Does it also mean that ALT allele always occurs and REF allele never occurs in the population?

Also not sure how allele count AC=37 is being calculated.

Chr1 pos=565286, REF=C, ALT=T, Info AC=37; AF=1.0 Format GT:AC:AF:NC Child=1:11:1.0:+T=4,-
T=7 mother=1:26:1.0:+T=8,-T=18,

Cheers,
Ambi

I do not get the annotations I specified with -A

$
0
0

The problem

You specified -A <some annotation> in a command line invoking one of the annotation-capable tools (HaplotypeCaller, MuTect2, UnifiedGenotyper and VariantAnnotator), but that annotation did not show up in your output VCF.

Keep in mind that all annotations that are necessary to run our Best Practices are annotated by default, so you should generally not need to request annotations unless you're doing something a bit special.

Why this happens & solutions

There can be several reasons why this happens, depending on the tool, the annotation, and you data. These are the four we see most often; if you encounter another that is not listed here, let us know in the comments.

  1. You requested an annotation that cannot be calculated by the tool

    For example, you're running MuTect2 but requested an annotation that is specific to HaplotypeCaller. There should be an error message to that effect in the output log. It's not possible to override this; but if you believe the annotation should be available to the tool, let us know in the forum and we'll consider putting in a feature request.

  2. You requested an annotation that can only be calculated if an optional input is provided

    For example, you're running HaplotypeCaller and you want InbreedingCoefficient, but you didn't specify a pedigree file. There should be an error message to that effect in the output log. The solution is simply to provide the missing input file. Another example: you're running VariantAnnotator and you want to annotate Coverage, but you didn't specify a BAM file. The tool needs to see the read data in order to calculate the annotation, so again, you simply need to provide the BAM file.

  3. You requested an annotation that has requirements which are not met by some or all sites

    For example, you're looking at RankSumTest annotations, which require heterozygous sites in order to perform the necessary calculations, but you're running on haploid data so you don't have any het sites. There is no workaround; the annotation is not applicable to your data. Another example: you requested InbreedingCoefficient, but your population includes fewer than 10 founder samples, which are required for the annotation calculation. There is no workaround; the annotation is not applicable to your data.

  4. You requested an annotation that is already applied by default by the tool you are running

    For example, you requested Coverage from HaplotypeCaller, which already annotates this by default. There is currently a bug that causes some default annotations to be dropped from the list if specified on the command line. This will be addressed in an upcoming version. For now the workaround is to check what annotations are applied by default and NOT request them with -A.

MuTect2 high insertion counts?

$
0
0

Hi Folks

I am using Mutect2 to analyze blood vs. FFPE tumor samples (breast cancer).

I am getting (what I think are) unusually high insertion:SNV ratios - the ratio is between 2:1 and 3:1, thus high numbers of insertions.
The deletion:SNV ratio is between 0.1:1 and 0.25:1.

I was wondering if anyone else had experienced something similar or had any advice / comments?

Best regards,

Fourie


UnifiedGenotyper error: Somehow the requested coordinate is not covered by the read.

$
0
0

Dear GATK Team,

I am receiving the following error while running GATK 1.6. Unfortunately, for project consistency I cannot update to a more recent version of GATK and would at least wish to understand the source of the error. I ran ValidateSamFile on the input bam files and they appear to be OK.

Any insight or advice would be greatly appreciated:

`##### ERROR ------------------------------------------------------------------------------------------

ERROR stack trace

org.broadinstitute.sting.utils.exceptions.ReviewedStingException: Somehow the requested coordinate is not covered by the read. Too many deletions?
at org.broadinstitute.sting.utils.sam.ReadUtils.getReadCoordinateForReferenceCoordinate(ReadUtils.java:425)
at org.broadinstitute.sting.utils.sam.ReadUtils.getReadCoordinateForReferenceCoordinate(ReadUtils.java:374)
at org.broadinstitute.sting.utils.sam.ReadUtils.getReadCoordinateForReferenceCoordinate(ReadUtils.java:370)
at org.broadinstitute.sting.utils.clipping.ReadClipper.hardClipByReferenceCoordinates(ReadClipper.java:445)
at org.broadinstitute.sting.utils.clipping.ReadClipper.hardClipByReferenceCoordinatesRightTail(ReadClipper.java:176)
at org.broadinstitute.sting.gatk.walkers.indels.PairHMMIndelErrorModel.computeReadHaplotypeLikelihoods(PairHMMIndelErrorModel.java:196)
at org.broadinstitute.sting.gatk.walkers.genotyper.IndelGenotypeLikelihoodsCalculationModel.getLikelihoods(IndelGenotypeLikelihoodsCalculationModel.java:212)
at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyperEngine.calculateLikelihoods(UnifiedGenotyperEngine.java:235)
at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyperEngine.calculateLikelihoodsAndGenotypes(UnifiedGenotyperEngine.java:164)
at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:302)
at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper.map(UnifiedGenotyper.java:115)
at org.broadinstitute.sting.gatk.traversals.TraverseLoci.traverse(TraverseLoci.java:78)
at org.broadinstitute.sting.gatk.traversals.TraverseLoci.traverse(TraverseLoci.java:18)
at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:63)
at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:248)
at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146)
at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:92)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 1.6-22-g3ec78bd):
ERROR
ERROR Please visit the wiki to see if this is a known problem
ERROR If not, please post the error, with stack trace, to the GATK forum
ERROR Visit our wiki for extensive documentation http://www.broadinstitute.org/gsa/wiki
ERROR Visit our forum to view answers to commonly asked questions http://getsatisfaction.com/gsa
ERROR
ERROR MESSAGE: Somehow the requested coordinate is not covered by the read. Too many deletions?
ERROR ------------------------------------------------------------------------------------------`

Abbreviated commandline used:

GenomeAnalysisTK.jar -T UnifiedGenotyper -glm BOTH -et NO_ET \
-R "Saccharomyces_cerevisiae/UCSC/sacCer2/Sequence/WholeGenomeFasta/genome.fa" \
-dcov 5000 -I "someFile.bam" --output_mode EMIT_ALL_SITES -gvcf -l OFF \
-stand_call_conf 1 -L chrIV:1-1531919

Bug report: DepthOfCoverage csv output

$
0
0

Hi GATK team.

FYI:

when using DepthOfCoverage 3.6.0 using the following arguments:

-T DepthOfCoverage  -R $(REF)   --validation_strictness LENIENT  -L:capture,BED in.bed  -I in.al --outputFormat csv -o  out.tsv --omitIntervalStatistics  --omitPerSampleStats --omitLocusTable 

the first line of the header looks tab delimited

Locus   Total_Depth Average_Depth_sample    Depth_for_SAMPLE

while the remaining lines are comma delimited

1:1116653,116,116.00,116

Regards,

Pierre

Are there published guidelines for interpreting FS (FisherStrand) and SOR (StrandOddsRatio) scores?

$
0
0

I have recently come across some false positive variant calls, for which visual inspection revealed obvious strand bias. I looked at the FS and SOR scores for these variants but they do not seem to provide clear reason to filter, when compared with FS and SOR scores for other variants that passed our filters. I can't seem to find a straightforward guideline for interpreting these scores, either. Does GATK provide any such guideline?

The only useful differentiation I found between them was that SOR "is better at taking into account large amounts of data in high coverage situations" but doesn't define what the threshold might be to define "high coverage."

I can post more details if necessary (e.g. screen shots) but for now, here are the FS and SOR scores for two groups of variants, true positives and false positives, in our data:

True positives (FS, SOR):
0, 0.169
0, 0.705
0, 2.211
0, 2.303
5.65, 11.106
17.459, 14.404

False positives (FS, SOR):
1.307, 0.509
1.876, 0.33

Is it relevant that HaplotypeCaller produces one FS score and one SOR score for a variant, regardless of how many samples the variant is called in? We are running on 30+ samples, so having one score to determine the strand bias across all samples seems like it could lead to errors in variant calls.

Reference genotype quality in presence of conflicting reads

$
0
0

Given the following line from a VCF file:

Supercontig_1.1 308 . T A . PASS AC=0;AF=0;AN=2;BaseQRankSum=-1.932;ClippingRankSum=0;DP=99;ExcessHet=3.01;MQ=43.26;MQRankSum=-2.382;ReadPosRankSum=1.77;VariantType=SNP GT:AD:DP:RGQ 0/0:52,4:56:1

I note that despite there being 52 reads passing filters for the reference genotype, the reference genotype quality is still only 1. Is RGQ affected by the presence of reads indicating a possible variant (4 in this case)? So the low RGQ score in this case reflects uncertainty over whether this position really is reference call (T/T), or if it might be a variant (A/A or A/T or T/A).

If I was being super strict about only including highly certain positions in my analysis would you recommend that I assign this position a missing genotype because I can't really be sure what it is?

Should I analyze my samples alone or together?

$
0
0

Together is (almost always) better than alone

We recommend performing variant discovery in a way that enables joint analysis of multiple samples, as laid out in our Best Practices workflow. That workflow includes a joint analysis step that empowers variant discovery by providing the ability to leverage population-wide information from a cohort of multiple sample, allowing us to detect variants with great sensitivity and genotype samples as accurately as possible. Our workflow recommendations provide a way to do this in a way that is scalable and allows incremental processing of the sequencing data.

The key point is that you don’t actually have to call variants on all your samples together to perform a joint analysis. We have developed a workflow that allows us to decouple the initial identification of potential variant sites (ie variant calling) from the genotyping step, which is the only part that really needs to be done jointly. Since GATK 3.0, you can use the HaplotypeCaller to call variants individually per-sample in -ERC GVCF mode, followed by a joint genotyping step on all samples in the cohort, as described in this method article. This achieves what we call incremental joint discovery, providing you with all the benefits of classic joint calling (as described below) without the drawbacks.

Why "almost always"? Because some people have reported missing a small fraction of singletons (variants that are unique to individual samples) when using the new method. For most studies, this is an acceptable tradeoff (which is reduced by the availability of high quality sequencing data), but if you are very specifically looking for singletons, you may need to do some careful evaluation before committing to this method.


Previously established cohort analysis strategies

Until recently, three strategies were available for variant discovery in multiple samples:

- single sample calling: sample BAMs are analyzed individually, and individual call sets are combined in a downstream processing step;
- batch calling: sample BAMs are analyzed in separate batches, and batch call sets are merged in a downstream processing step;
- joint calling: variants are called simultaneously across all sample BAMs, generating a single call set for the entire cohort.

The best of these, from the point of view of variant discovery, was joint calling, because it provided the following benefits:

1. Clearer distinction between homozygous reference sites and sites with missing data

Batch-calling does not output a genotype call at sites where no member in the batch has evidence for a variant; it is thus impossible to distinguish such sites from locations missing data. In contrast, joint calling emits genotype calls at every site where any individual in the call set has evidence for variation.

2. Greater sensitivity for low-frequency variants

By sharing information across all samples, joint calling makes it possible to “rescue” genotype calls at sites where a carrier has low coverage but other samples within the call set have a confident variant at that location. However this does not apply to singletons, which are unique to a single sample. To minimize the chance of missing singletons, we increase the cohort size -- so that singletons themselves have less chance of happening in the first place.

3. Greater ability to filter out false positives

The current approaches to variant filtering (such as VQSR) use statistical models that work better with large amounts of data. Of the three calling strategies above, only joint calling provides enough data for accurate error modeling and ensures that filtering is applied uniformly across all samples.

image

Figure 1: Power of joint calling in finding mutations at low coverage sites. The variant allele is present in only two of the N samples, in both cases with such low coverage that the variant is not callable when processed separately. Joint calling allows evidence to be accumulated over all samples and renders the variant callable. (right) Importance of joint calling to square off the genotype matrix, using an example of two disease-relevant variants. Neither sample will have records in a variants-only output file, for different reasons: the first sample is homozygous reference while the second sample has no data. However, merging the results from single sample calling will incorrectly treat both of these samples identically as being non-informative.


Drawbacks of traditional joint calling (all steps performed multi-sample)

There are two major problems with the joint calling strategy.

- Scaling & infrastructure
Joint calling scales very badly -- the calculations involved in variant calling (especially by methods like the HaplotypeCaller’s) become exponentially more computationally costly as you add samples to the cohort. If you don't have a lot of compute available, you run into limitations pretty quickly. Even here at Broad where we have fairly ridiculous amounts of compute available, we can't brute-force our way through the numbers for the larger cohort sizes that we're called on to handle.

- The N+1 problem
When you’re getting a large-ish number of samples sequenced (especially clinical samples), you typically get them in small batches over an extended period of time, and you analyze each batch as it comes in (whether it’s because the analysis is time-sensitive or your PI is breathing down your back). But that’s not joint calling, that’s batch calling, and it doesn’t give you the same significant gains that joint calling can give you. Unfortunately the joint calling approach doesn’t allow for incremental analysis -- every time you get even one new sample sequence, you have to re-call all samples from scratch.

Both of these problems are solved by the single-sample calling + joint genotyping workflow.

Bug with version 3.6 and jexl

$
0
0

I recently tried to use a GaTK command I previously used with GaTK 3.5, and I'm getting a jexl error.
I don't understand why. I've updated to java 1.8.0_77
java -Xmx${MEM} -jar ${gatk_dir}/GenomeAnalysisTK.jar \
-T VariantFiltration \
-R ${genome} \
-L ${CHROM} \
-V ${data_dir}/'14a_'${abb}'HapCaller_mergedCHRs_VARIANTS'${CHROM}'.vcf' \
-G_filter "DP < 10 " \
-G_filterName "LowCov" \
-G_filter "DP > 100 " \
-G_filterName "HighCov" \
-G_filter "GQ < 20 " \
-G_filterName "LowGQ" \
--clusterWindowSize 10 --clusterSize 3 \
--filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0 || SOR > 4.0 " \
--filterName "gatkHardFilter" \
--setFilteredGtToNocall \
-o ${data_dir}/'14b_newgatk_'${abb}'HapCaller_mergedCHRs_VARIANTS_hard_cluster_filterflag_DPGQmissing'${CHROM}'.vcf'

The error message:

ERROR --
ERROR stack trace

java.lang.IllegalArgumentException: Invalid JEXL expression detected for HighCov with message no message
at htsjdk.variant.variantcontext.JEXLMap.evaluateExpression(JEXLMap.java:136)
at htsjdk.variant.variantcontext.JEXLMap.get(JEXLMap.java:93)
at htsjdk.variant.variantcontext.JEXLMap.get(JEXLMap.java:22)
at htsjdk.variant.variantcontext.VariantContextUtils.match(VariantContextUtils.java:323)
at org.broadinstitute.gatk.tools.walkers.filters.VariantFiltration.filter(VariantFiltration.java:433)
at org.broadinstitute.gatk.tools.walkers.filters.VariantFiltration.map(VariantFiltration.java:349)
at org.broadinstitute.gatk.tools.walkers.filters.VariantFiltration.map(VariantFiltration.java:97)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:311)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:113)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:255)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:157)
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:108)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.6-0-g89b7209):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions https://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Invalid JEXL expression detected for HighCov with message no message
ERROR ------------------------------------------------------------------------------------------

Dream challenge PON

$
0
0

Hello,

The Supplementary Data for the Dream challenge (http://www.nature.com/nmeth/journal/v12/n7/full/nmeth.3407.html) indicates that a panel of normals filter was used for the Broad's Mutect submission:

"Panel of normals filter:
matlab: survey_panel_of_normals_for_mutations(maf_file, bamfile_list.txt) where bamfile_list.txt contains 258 normal whole genome bam files to use as an additional panel of normals."

Is this dataset available to the public?

Thank you.


Running HaplotypeCaller in GENOTYPE_GIVEN_ALLELES mode with --emitRefConfidence GVCF

$
0
0

Hi Sheila and Geraldine

When I run HaplotypeCaller (v3.3-0-g37228af) in GENOTYPE_GIVEN_ALLELES mode with --emitRefConfidence GVCF I get the error:

Invalid command line: Argument ERC/gt_mode has a bad value: you cannot request reference confidence output and GENOTYPE_GIVEN_ALLELES at the same time

It is however strange that GENOTYPE_GIVEN_ALLELES is mentioned in the --max_alternate_alleles section of the GenotypeGVCFs documentation.

Maybe I'm missing something?

Thanks,
Gerrit

about the resource bundle for hg38 : about the vcf files

$
0
0

Dear all, please could you let me know whether there is a quick fix for the VCF files in the bundle for hg38
(available at ftp://ftp.broadinstitute.org/bundle/hg38/hg38bundle/): particularly,

  1. dbsnp_144.hg38.vcf has the chromosomes names as "1,2, ..." etc instead of "chr1, chr2, " etc
  2. dbsnp_138.hg38.vcf is missing (I can see only "dbsnp_138.hg38.vcf.gz.tbi" file).

and also I would appreciate some information on the following:

  1. what is the difference between 1) "Homo_sapiens_assembly38.dbsnp.vcf" and 2) "Homo_sapiens_assembly38.dbsnp138.vcf" ?

  2. which one of these files above 1) or 2) shall I use for base score recalibration ?

  3. what is the difference between the files 3) "Homo_sapiens_assembly38.known_indels.vcf" and 4) "Homo_sapiens_assembly38.variantEvalGoldStandard.vcf", and when shall I use those in the analysis ?

many thanks,

bogdan

Is it possible to make VariantAnnotator check REF and ALT fields?

$
0
0

Hi! So we use GATK a lot in our research, it works amazingly well most of the time, so first of all, thanks for creating it!

We have this one problem that we were unable to solve on our own. Say we have a VCF file that contains called variants, and we want to annotate it using an external database, clinvar as one example. We used to use VariantAnnotator for this purpose until we found out (both by reading documentation and doing a quick experimental check) that it annotates variants based solely on position, ignoring the actual mutation that happened. Imagine for example that variant A → C was called at a specific position, but data for A → G is recorded at clinvar. In this case VariantAnnotator will still carry over INFO fields from clinvar into our VCF. We ideally do not want this to happen, because strictly speaking clinvar data was recorded for a completely different mutation and might not be relevant at all in our case.

My question: is there an option for VariantAnnotator to make it check REF and ALT fields in the process of annotation? (Although I fear it wouldn't be possible because it uses RodWalker class to traverse the variants.) Or, alternatively, can this be achieved using combination of other GATK commands? Or will we have to write a custom walker to accomplish what we want? (The latter is obviously the worst case, but hopefully we can manage that.)

All the best,
Kirill

Extremely high depth of coverage

$
0
0

Dear all,
I've run the DepthOfCoverage tool on 263 WGS samples and have found some unusual total and averages for some regions.
Does it mean any sort of error on the alignment or I can just filtered this regions out when calling the genotypes?
I'm attaching a example for one chromosome.
Cheers,
Adriana image

about VQSR -- VariantRecalibrator

$
0
0

Dear all, please could you advise on the following : how is more appropriate to set up the VQSR as I have noted some discrepancies between different resource materials ? Here is what I mean :

I. about SNP :

--- in GATK workshop materials (resource:omni ------- truth=false ):

resource:hapmap,known=false,training=true,truth=true,prior=15.0 $HAPMAP33 \
resource:omni,known=false,training=true,truth=false,prior=12.0 $OMNI25 \
resource:1000G,known=false,training=true,truth=false,prior=10.0 $SNP1000G \
resource:dbsnp,known=true,training=false,truth=false,prior=2.0 $DBSNP138 \

--- while in the online documentation (https://software.broadinstitute.org/gatk/guide/article?id=1259)
(resource omni -- truth=true):

resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.sites.vcf \

II. about INDEL :

-- in Current protocols in Bioinfo (resource:mills ------- known=true) :

resource:mills,known=true,training=true,truth=true,prior=12.0 $INDEL1000G \
resource:dbsnp,known=true,training=false,truth=false,prior=2.0 $DBSNP138 \

-- while in the online documentation (https://software.broadinstitute.org/gatk/guide/article?id=1259)
(resource:mills -- known=false):

resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.sites.vcf \

thanks !

bogdan

Viewing all 12345 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>