Quantcast
Channel: Recent Discussions — GATK-Forum
Viewing all 12345 articles
Browse latest View live

MergeBamAlignment - what are all the exact steps it performs?

$
0
0

Hi, I have a question about the MergeBamAlignment tool
I tried reading through the documentation and through the couple of blog posts that I found on the GATK website, but I still have a couple of things that I could use help clearing out.

Basically, I ran the following tests:
1. Starting from an unmapped BAM file with multiple read groups, I ran the GATK data pre-processing Best Practices WDL
2. Starting from the same uBAM, but with read group information removed using AddOrReplaceReadGroups, I ran the GATK data pre-processing Best Practices WDL
3. Starting from the uBAM without readgroup information, I ran the Data pre-processing pipeline where I removed the MergeBamAlignment step

After this, with the resulting BAM files, I ran the GATK generic germline variant calling Best Practices WDL.

Between the first two cases, for the samples that I was testing with, I found 2.1% differences in variants called.
I understand here that MergeBamAlignments adds the missing readgroup information from the uBAM, which in turn is used during MarkDuplications, BaseRecalibrator and ApplyBQSR steps, which would lead to a different BAM than in the case if I didn't have the read group information (test 2), and so the variant calling would also be different.

But, between test 2 (uBAM with no readgroups and MBA exists) and test 3 (uBAM with no readgroups and MBS does not exist), I also noted differences in variants called - the difference was 0.18%, so albeit small, it still exists.
My understanding is that MergeBamAlignment also performs more actions in additions to just merging readgroups and read-level tag information.
From one post, I understood that MBA turns hardclipped reads (by BWA, usually some chimeric reads) back into softclipped reads.
Does anyone have more info on this? Or on what exactly MBA does?
Should I expect these small differences, or not?


Original bam file vs -bamout bam file, which one sould I rely on?

$
0
0

Dear GATK authors and other scientists,

I wonder which bam file is the 'correct' one. Let me explain. I have to select some interesting variants from my vcf files (called by HaplotypeCaller) and then I'm going to confirm them in the wet lab. I'd like to avoid false positives, so I prepared several filtering strategies with strict conditions. First of all I'd like to see my variant in bam file (IGV). However sometimes the variant of interest is present only in bam from -bamout and there's no any alternation in original bam file (based on position).
Yes I know that is similar question here:
https://gatkforums.broadinstitute.org/gatk/discussion/6129/ad-in-vcf-doesnt-match-bam

And Sheila has responded that it's the result of a reassembly done by HaplotypeCaller which may change the positions of the reads. I understand this, you are using the De Brujin graph to reconstruct and select the haplotype with best likelihood.

However should I take variant like this into consideration (example below) or treat him like a false positive? What do you think? I got dozens of variants like this one.
Original bam:

Bamout bam:

Same position, variant present in vcf file with nice score. I see the 'variant pattern' in some reads in bamout one, I think it's pretty suspicous and it may be a group of false positives, what do you think?

MuTect2 Tumour/Normal variant calling with only tumour samples

$
0
0
Hello,
If two tumour samples are available from the same individual but no normal tissue e.g. primary and relapsed tumour, is it acceptable to use MuTect2 Tumour/Normal variant calling and treat the primary tumour sample as "normal" and relapsed tumour as "tumour" in the analysis?
Thank you for your help.

Old GATK 3.8 web site needs proper link to GATK4.

Safe to use HaplotypeCallerSpark?

$
0
0
Hi all,

I'm wondering how bad it is to use use HaplotypeCallerSpark in GATK 4.0.2.1/JDK 1.8? I realize it's in beta, but I'm wondering if that means "your results will be useless" or just "use with caution".

The reason I'm asking that is because it seems like Spark is the only way to multi-thread in GATK 4, and just 1 of my bams took 64 hrs to run on single node, and I have 110 bams, so parallelizing is a must.

Btw, when I tried to run HaplotypeCallerSpark in parallel with 48 nodes, my job crashed after running for two days. I thought since with 1 node it took 64 hrs, using 48 nodes would mean it would finish in wayyy less than 2 days.

Here's what I have:

gatk --java-options "-Xmx32g -XX:ParallelGCThreads=1" HaplotypeCallerSpark --spark-master local[48] -R myref.2bit -I mybam.bam -O mygvcf.g.vcf --emit-ref-confidence GVCF --min-dangling-branch-length 1 --min-pruning 1

GATK_docker image to run Mutect2

$
0
0
I’d like to put GATK_docker image to run the Mutect2 on Terra. I'm trying to run the program using Terra’s workspace data/demo files but it says I need a Docker image. Do I need to download specific software for a gatk_docker image? Otherwise, is there any alternative way to copy and paste from the workspace? If anyone has a docker image, can I share it on Terra? What workspace attribute should I put down for gatk_docker? Thank you!

What should I specify as intervals to create panels of normal from the matched whole genome data?

$
0
0
I have 16 matched normal samples for my cancer-normal matched dataset and want to call somatic SNVs. Referring to your best practices I want to create the panels of normals. Towards this I have already generated 16 vcfs using GATK 4.1.3.0 with Mutect2 but while merging these 16 independent VCFs to a common one big GVCF file using GenomicsDBImport I am confused what I should pass as intervals with the -L option. Since this is a whole genome data set I am not sure about these intervals. Any suggestion will be highly appreciated.

Common SNPs obscured by identical VariantContext from HaplotypeCaller

$
0
0

Hi GATK team

During one of our recent analyses we found many cases where common SNPs are reported in a very unexpected way by GATK4 Haplotypecaller and this makes our downstream analysis unnecessary hard as we usually assume that SNPs are actually encoded as a SNP. This looks very much like a bug to me, but I am interested in finding out whether there is actually a good reason in encoding the variants this way. In a cohort of 5000 patients, we saw that almost 2.5% of common SNPS where affected (200k out of 8M variants) and for larger cohorts the percetage goes up to 5%.

Here are some examples of called variants from a 160 patient cohort (sites-only, to make it small) where a SNP is obscured by some variant context that is identical between REF and all ALTs.

chr1 821054 . GTCTATACTACCTGCCTGTCGAGCAGAT CTCTATACTACCTGCCTGTCGAGCAGAT,ATCTATACTACCTGCCTGTCGAGCAGAT,* 4897594.93
chr1 876041 rs61768170 TACTCCCCCAC AACTCCCCCAC,* 319.96
chr1 93326098 rs71730518 CT CTT,CTTT,* 38399.11
chr1 112407840 rs10589164 CT CTTTT,CTT,CTTTTT,* 16295.78
chr1 158902694 rs5778098 CTT CT,* 118255.17

All those variants can be trimmed by removing the ends of REF and ALT sequences. Some end up as simple SNPs, while others remain albeit shorter indels:
chr1 821054 . G C,A,* 4897594.93 PASS
chr1 876041 rs61768170 T A,* 319.96
chr1 93326098 rs71730518 C CT,CTT,* 38399.11
chr1 112407840 rs10589164 C CTTT,CT,CTTTT,* 16295.78
chr1 158902694 rs5778098 CT C,* 118255.17

I noticed that all affected sites have an upstream deletion (*) so maybe this notation is trying to tell us something that we just don't understand?

Now assuming that there is a good reason why these variants are represented this way, how can I best trim the variants to the actually varying part? I tried bcftools norm and GATKs LeftAlignAndTrimVariants, but both did not change any of these sites. Only LeftAlignAndTrimVariants --split worked, however it also splits the multiallelic sites into multiple lines, which comes with its own limitations.

In case it helps here are some more details: All samples are human, normal (non-tumor) samples sequenced to 30x. Individual gvcf were produced by HaplotypeCaller in gvcf mode using GATK 3.5. bams came from a functional equivalence pipeline. Joint genotypes were called using a published GATK4 wdl: https://github.com/gatk-workflows/gatk4-germline-snps-indels.. I also tested using a GATK 3.5 Joint calling workflow and it shows the exact same behavior, so it does not seem to be caused by genomicsDB or other GATK4 features.

thanks for any insight
Jens


AnalyzeCovariates error (R)

$
0
0

Hello

I am trying to generate a base recalibration plots using AnalyzeCovariate

My command is such

java -jar GenomeAnalysisTK.jar \
-T AnalyzeCovariates -R GRCh37-lite.fa \
-before test_data/realigned/SA495-Tumor.sorted.realigned.grp \
-after test_data/realigned/SA495-Tumor.sorted.post_recal.grp2 \
-plots recal_plots.pdf

and this gives me an error

INFO  17:01:06,050 HelpFormatter - Date/Time: 2014/05/16 17:01:06
INFO  17:01:06,050 HelpFormatter - --------------------------------------------------------------------------------
INFO  17:01:06,050 HelpFormatter - --------------------------------------------------------------------------------
INFO  17:01:06,962 GenomeAnalysisEngine - Strictness is SILENT
INFO  17:01:07,193 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO  17:01:07,317 GenomeAnalysisEngine - Preparing for traversal
INFO  17:01:07,339 GenomeAnalysisEngine - Done preparing for traversal
INFO  17:01:07,340 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO  17:01:07,340 ProgressMeter -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining
INFO  17:01:08,293 ContextCovariate -       Context sizes: base substitution model 2, indel substitution model 3
INFO  17:01:08,537 ContextCovariate -       Context sizes: base substitution model 2, indel substitution model 3
INFO  17:01:08,592 AnalyzeCovariates - Generating csv file '/tmp/AnalyzeCovariates3565832248324656361.csv'
INFO  17:01:09,077 AnalyzeCovariates - Generating plots file 'recal_plots.pdf'
INFO  17:01:18,598 GATKRunReport - Uploaded run statistics report to AWS S3
 ERROR ------------------------------------------------------------------------------------------
 ERROR stack trace
org.broadinstitute.sting.utils.R.RScriptExecutorException: RScript exited with 1. Run with -l DEBUG for more info.
    at org.broadinstitute.sting.utils.R.RScriptExecutor.exec(RScriptExecutor.java:174)
    at org.broadinstitute.sting.utils.recalibration.RecalUtils.generatePlots(RecalUtils.java:548)
    at org.broadinstitute.sting.gatk.walkers.bqsr.AnalyzeCovariates.generatePlots(AnalyzeCovariates.java:380)
    at org.broadinstitute.sting.gatk.walkers.bqsr.AnalyzeCovariates.initialize(AnalyzeCovariates.java:394)
    at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:83)
    at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313)
    at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:121)
    at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
    at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
    at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:107)
 ERROR ------------------------------------------------------------------------------------------
 ERROR A GATK RUNTIME ERROR has occurred (version 3.1-1-g07a4bf8):
 ERROR
 ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
 ERROR If not, please post the error message, with stack trace, to the GATK forum.
 ERROR Visit our website and forum for extensive documentation and answers to
 ERROR commonly asked questions http://www.broadinstitute.org/gatk
 ERROR
 ERROR MESSAGE: RScript exited with 1. Run with -l DEBUG for more info.
 ERROR ------------------------------------------------------------------------------------------

Ideas ?
Thanks

Error in generating DetermineGermlineContigPloidy

$
0
0
Hi Team,
I trying to generate CNV from GATK pipeline.
Process intervals and collect read counts are done when i am running Determine germlinecontig ploidy i am getting this error.

Errors in detail

Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/ec2-user/data/gatk_cnv/gatk-4.1.4.0/gatk-package-4.1.4.0-local.jar DetermineGermlineContigPloidy --contig-ploidy-priors ploidy_model/interval_ploidy_new.txt --interval-merging-rule OVERLAPPING_ONLY -L preprocessed_intervals.interval_list -I ESI_17.tsv -I ES_msc.tsv --exclude-intervals exclude_intervals.bed --output esi_ploidy --output-prefix esi_cnvploidy --verbosity DEBUG
.................................................
................................................
04:46:53.856 INFO DetermineGermlineContigPloidy - Aggregating read-count file ESI_17.tsv (1 / 2)
04:46:57.935 INFO DetermineGermlineContigPloidy - Shutting down engine
[October 22, 2019 4:46:57 AM UTC] org.broadinstitute.hellbender.tools.copynumber.DetermineGermlineContigPloidy done. Elapsed time: 0.60 minutes.
Runtime.totalMemory()=3437232128
java.lang.IllegalArgumentException: Intervals for read-count file ESI_17.tsv do not contain all specified intervals.
at org.broadinstitute.hellbender.utils.Utils.validateArg(Utils.java:725)
at org.broadinstitute.hellbender.tools.copynumber.DetermineGermlineContigPloidy.writeSamplesByCoveragePerContig(DetermineGermlineContigPloidy.java:374)
at org.broadinstitute.hellbender.tools.copynumber.DetermineGermlineContigPloidy.doWork(DetermineGermlineContigPloidy.java:285)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:163)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:206)
at org.broadinstitute.hellbender.Main.main(Main.java:292)

Please help me in fixing the bug.

Thank you

Merging same sample BAMs for HaplotypeCaller

$
0
0
Dear GATKers,

I really do appreciate you all for your hard work.

I work with non-human samples which were genotyped using RAD-based technique (GBS). Each of the libraries were sequenced twice on distinct flow cell lanes, thus I have two RG-tagged BAMs for each sample (SM is the same).

As I screened the GATK forum, I have learned that BAM files belonging to the same sample should be merged before passing them to HaplotypeCaller.

I have noticed, that you recommend merging BAMs either at the MarkDuplicates step, the Indel Realignment step or at the BQSR step. The problem is, that I have to skip these steps for the following reasons: (i) MarkDuplicates step is out of question because GBS relies heavily on PCR amplification, (ii) and as I work on non-human/model organisms, there’s a lack of indel/known polymorphic sites databases required for Indel Realignment or BQSR step.

Could you please specify if I could simply merge single-sample BAMs by using MergeSamFiles (Picard) and feed them to HaplotypeCaller?

I also came across to a post by Sheila, stating that ‘some people used to input same sample GVCF files to GenotypeGVCFs with no problem’, also noticing that ‘this is not best practice’. I’ve tried this, and GenotypeGVCFs runs without throweing an error, however I am concerned about the reliability of the genotyping. Is it a really bad idea to go this way?

Many thanks and have a nice day!

P.S. I am using GATK v. 3.8

Multi sample somatic variant calling filters

$
0
0

Hi,

for my PhD I have several patients with multiple biopsies being available, so the multi sample variant calling is really very interesting for me.

I did follow the latest tutorial on running Mutect2 v4.1.2.0 with first creating a panel of normals with 40 healthy samples sequenced on the same machine with the same library prep,
Then performed joint variant calling according to the tutorial with a later step of filtering.
From orthagonal validation of those exact samples with ddPCR I know that these variants are actually present in the samples, but in they get filtered out with two filters: multiallelic, which I kind of understand, as there are a lot of different variants at this position (as this position allows resistance to treatment), but it also says normal_artifact, which I do not understand at all.
First of all, I would love to know if there is a way for me to not filter out these variants, especially as they are pretty "common" in some samples (vaf around 0.4)
And secondly I would like to understand why the calculated AF for most these alleles is actually not 0 in the tumour, when the AD field says, that no supporting read was found
In general, this is unfortunate, because I was really hoping Mutect2 would be an easy solution.
I would love to understand why this happened and if I can change the behaviour

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FILTER=<ID=base_qual,Description="alt median base quality">
##FILTER=<ID=clustered_events,Description="Clustered events observed in the tumor">
##FILTER=<ID=contamination,Description="contamination">
##FILTER=<ID=duplicate,Description="evidence for alt allele is overrepresented by apparent duplicates">
##FILTER=<ID=fragment,Description="abs(ref - alt) median fragment length">
##FILTER=<ID=germline,Description="Evidence indicates this site is germline, not somatic">
##FILTER=<ID=haplotype,Description="Variant near filtered variant on same haplotype.">
##FILTER=<ID=low_allele_frac,Description="Allele fraction is below specified threshold">
##FILTER=<ID=map_qual,Description="ref - alt median mapping quality">
##FILTER=<ID=multiallelic,Description="Site filtered because too many alt alleles pass tumor LOD">
##FILTER=<ID=n_ratio,Description="Ratio of N to alt exceeds specified ratio">
##FILTER=<ID=normal_artifact,Description="artifact_in_normal">
##FILTER=<ID=numt_chimera,Description="NuMT variant with too many ALT reads originally from autosome">
##FILTER=<ID=numt_novel,Description="Alt depth is below expected coverage of NuMT in autosome">
##FILTER=<ID=orientation,Description="orientation bias detected by the orientation bias mixture model">
##FILTER=<ID=panel_of_normals,Description="Blacklisted site in panel of normals">
##FILTER=<ID=position,Description="median distance of alt variants from end of reads">
##FILTER=<ID=slippage,Description="Site filtered due to contraction of short tandem repeat region">
##FILTER=<ID=strand_bias,Description="Evidence for alt allele comes from one read direction only">
##FILTER=<ID=strict_strand,Description="Evidence for alt allele is not represented in both directions">
##FILTER=<ID=weak_evidence,Description="Mutation does not meet likelihood threshold">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions of alternate alleles in the tumor">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=F1R2,Number=R,Type=Integer,Description="Count of reads in F1R2 pair orientation supporting each allele">
##FORMAT=<ID=F2R1,Number=R,Type=Integer,Description="Count of reads in F2R1 pair orientation supporting each allele">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another">
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phasing set (typically the position of the first variant in the set)">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##GATKCommandLine=<ID=FilterMutectCalls,CommandLine="FilterMutectCalls  --output /home/shollizeck/CASCADE/analysis/CA99/tumor/joined/mutect2/CA99_somatic.filtered.vcf.gz --contamination-table /home/shollizeck/CASCADE/analysis/CA99/tumor/joined/mutect2/contamination.table --tumor-segmentation /home/shollizeck/CASCADE/analysis/CA99/tumor/joined/mutect2/segments.table --orientation-bias-artifact-priors /home/shollizeck/CASCADE/analysis/CA99/tumor/joined/mutect2/readOrientationModel.tar.gz --variant /home/shollizeck/CASCADE/analysis/CA99/tumor/joined/mutect2/CA99_somatic.unfiltered.vcf.gz --reference /data/reference/indexes/human/hg19/fasta/Homo_sapiens.GRCh37.73.dna.toplevel.fa  --threshold-strategy OPTIMAL_F_SCORE --f-score-beta 1.0 --false-discovery-rate 0.05 --initial-threshold 0.1 --mitochondria-mode false --max-events-in-region 2 --max-alt-allele-count 1 --unique-alt-read-count 0 --min-median-mapping-quality 30 --min-median-base-quality 20 --max-median-fragment-length-difference 10000 --min-median-read-position 1 --max-n-ratio Infinity --min-reads-per-strand 0 --autosomal-coverage 0.0 --max-numt-fraction 0.85 --min-allele-fraction 0.0 --contamination-estimate 0.0 --log-snv-prior -13.815510557964275 --log-indel-prior -16.11809565095832 --log-artifact-prior -2.302585092994046 --normal-p-value-threshold 0.001 --min-slippage-length 8 --pcr-slippage-rate 0.1 --distance-on-haplotype 100 --long-indel-length 5 --interval-set-rule UNION --interval-padding 0 --interval-exclusion-padding 0 --interval-merging-rule ALL --read-validation-stringency SILENT --seconds-between-progress-updates 10.0 --disable-sequence-dictionary-validation false --create-output-bam-index true --create-output-bam-md5 false --create-output-variant-index true --create-output-variant-md5 false --lenient false --add-output-sam-program-record true --add-output-vcf-command-line true --cloud-prefetch-buffer 40 --cloud-index-prefetch-buffer -1 --disable-bam-index-caching false --sites-only-vcf-output false --help false --version false --showHidden false --verbosity INFO --QUIET false --use-jdk-deflater false --use-jdk-inflater false --gcs-max-retries 20 --gcs-project-for-requester-pays  --disable-tool-default-read-filters false",Version="4.1.2.0",Date="August 12, 2019 10:14:09 AM UTC">
##GATKCommandLine=<ID=Mutect2,CommandLine="Mutect2  --f1r2-tar-gz /home/shollizeck/CASCADE/analysis/CA99/tumor/joined/mutect2/10-f1r2.tar.gz --normal-sample CA99 --panel-of-normals /data/reference/dawson_labs/Mutect2/PanelOfNormals/pon.vcf.gz --germline-resource /data/reference/dawson_labs/Mutect2/af-only-gnomad.raw.sites.GRCh37.73.vcf.gz --output /home/shollizeck/CASCADE/analysis/CA99/tumor/joined/mutect2/CA99_10.somatic.unfiltered.vcf.gz --intervals 10 --input /home/shollizeck/CASCADE/analysis/CA99/germline/Bam/CA99_merged.markdups.bam --input /home/shollizeck/CASCADE/analysis/CA99/tumor/CA99-26/Bam/CA99-26_merged.markdups.bam --input /home/shollizeck/CASCADE/analysis/CA99/tumor/CA99-31/Bam/CA99-31_merged.markdups.bam --input /home/shollizeck/CASCADE/analysis/CA99/tumor/CA99-41/Bam/CA99-41_merged.markdups.bam --input /home/shollizeck/CASCADE/analysis/CA99/tumor/CA99-57/Bam/CA99-57_merged.markdups.bam --input /home/shollizeck/CASCADE/analysis/CA99/tumor/CA99-11/Bam/CA99-11_merged.markdups.bam --input /home/shollizeck/CASCADE/analysis/CA99/tumor/CA99-59/Bam/CA99-59_merged.markdups.bam --input /home/shollizeck/CASCADE/analysis/CA99/tumor/CA99-47/Bam/CA99-47_merged.markdups.bam --input /home/shollizeck/CASCADE/analysis/CA99/tumor/CA99-55/Bam/CA99-55_merged.markdups.bam --reference /data/reference/indexes/human/hg19/fasta/Homo_sapiens.GRCh37.73.dna.toplevel.fa  --f1r2-median-mq 50 --f1r2-min-bq 20 --f1r2-max-depth 200 --genotype-pon-sites false --genotype-germline-sites false --af-of-alleles-not-in-resource -1.0 --mitochondria-mode false --tumor-lod-to-emit 3.0 --initial-tumor-lod 2.0 --pcr-snv-qual 40 --pcr-indel-qual 40 --max-population-af 0.01 --downsampling-stride 1 --callable-depth 10 --max-suspicious-reads-per-alignment-start 0 --normal-lod 2.2 --ignore-itr-artifacts false --gvcf-lod-band -2.5 --gvcf-lod-band -2.0 --gvcf-lod-band -1.5 --gvcf-lod-band -1.0 --gvcf-lod-band -0.5 --gvcf-lod-band 0.0 --gvcf-lod-band 0.5 --gvcf-lod-band 1.0 --minimum-allele-fraction 0.0 --genotype-filtered-alleles false --disable-adaptive-pruning false --dont-trim-active-regions false --max-disc-ar-extension 25 --max-gga-ar-extension 300 --padding-around-indels 150 --padding-around-snps 20 --kmer-size 10 --kmer-size 25 --dont-increase-kmer-sizes-for-cycles false --allow-non-unique-kmers-in-ref false --num-pruning-samples 1 --min-dangling-branch-length 4 --recover-all-dangling-branches false --max-num-haplotypes-in-population 128 --min-pruning 2 --adaptive-pruning-initial-error-rate 0.001 --pruning-lod-threshold 2.302585092994046 --max-unpruned-variants 100 --debug-assembly false --debug-graph-transformations false --capture-assembly-failure-bam false --error-correct-reads false --kmer-length-for-read-error-correction 25 --min-observations-for-kmer-to-be-solid 20 --likelihood-calculation-engine PairHMM --base-quality-score-threshold 18 --pair-hmm-gap-continuation-penalty 10 --pair-hmm-implementation FASTEST_AVAILABLE --pcr-indel-model CONSERVATIVE --phred-scaled-global-read-mismapping-rate 45 --native-pair-hmm-threads 4 --native-pair-hmm-use-double-precision false --bam-writer-type CALLED_HAPLOTYPES --dont-use-soft-clipped-bases false --min-base-quality-score 10 --smith-waterman JAVA --emit-ref-confidence NONE --max-mnp-distance 1 --min-assembly-region-size 50 --max-assembly-region-size 300 --assembly-region-padding 100 --max-reads-per-alignment-start 50 --active-probability-threshold 0.002 --max-prob-propagation-distance 50 --force-active false --interval-set-rule UNION --interval-padding 0 --interval-exclusion-padding 0 --interval-merging-rule ALL --read-validation-stringency SILENT --seconds-between-progress-updates 10.0 --disable-sequence-dictionary-validation false --create-output-bam-index true --create-output-bam-md5 false --create-output-variant-index true --create-output-variant-md5 false --lenient false --add-output-sam-program-record true --add-output-vcf-command-line true --cloud-prefetch-buffer 40 --cloud-index-prefetch-buffer -1 --disable-bam-index-caching false --sites-only-vcf-output false --help false --version false --showHidden false --verbosity INFO --QUIET false --use-jdk-deflater false --use-jdk-inflater false --gcs-max-retries 20 --gcs-project-for-requester-pays  --disable-tool-default-read-filters false --max-read-length 2147483647 --min-read-length 30 --minimum-mapping-quality 20 --disable-tool-default-annotations false --enable-all-annotations false",Version="4.1.2.0",Date="July 29, 2019 12:11:09 AM UTC">
##INFO=<ID=CONTQ,Number=1,Type=Float,Description="Phred-scaled qualities that alt allele are not due to contamination">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=ECNT,Number=1,Type=Integer,Description="Number of events in this haplotype">
##INFO=<ID=GERMQ,Number=1,Type=Integer,Description="Phred-scaled quality that alt alleles are not germline variants">
##INFO=<ID=MBQ,Number=R,Type=Integer,Description="median base quality">
##INFO=<ID=MFRL,Number=R,Type=Integer,Description="median fragment length">
##INFO=<ID=MMQ,Number=R,Type=Integer,Description="median mapping quality">
##INFO=<ID=MPOS,Number=A,Type=Integer,Description="median distance from end of read">
##INFO=<ID=NALOD,Number=A,Type=Float,Description="Negative log 10 odds of artifact in normal with same allele fraction as tumor">
##INFO=<ID=NCount,Number=1,Type=Integer,Description="Count of N bases in the pileup">
##INFO=<ID=NLOD,Number=A,Type=Float,Description="Normal log 10 likelihood ratio of diploid het or hom alt genotypes">
##INFO=<ID=OCM,Number=1,Type=Integer,Description="Number of alt reads whose original alignment doesn't match the current contig.">
##INFO=<ID=PON,Number=0,Type=Flag,Description="site found in panel of normals">
##INFO=<ID=POPAF,Number=A,Type=Float,Description="negative log 10 population allele frequencies of alt alleles">
##INFO=<ID=ROQ,Number=1,Type=Float,Description="Phred-scaled qualities that alt allele are not due to read orientation artifact">
##INFO=<ID=RPA,Number=.,Type=Integer,Description="Number of times tandem repeat unit is repeated, for each allele (including reference)">
##INFO=<ID=RU,Number=1,Type=String,Description="Tandem repeat unit (bases)">
##INFO=<ID=SEQQ,Number=1,Type=Integer,Description="Phred-scaled quality that alt alleles are not sequencing errors">
##INFO=<ID=STR,Number=0,Type=Flag,Description="Variant is a short tandem repeat">
##INFO=<ID=STRANDQ,Number=1,Type=Integer,Description="Phred-scaled quality of strand bias artifact">
##INFO=<ID=STRQ,Number=1,Type=Integer,Description="Phred-scaled quality that alt alleles in STRs are not polymerase slippage errors">
##INFO=<ID=TLOD,Number=A,Type=Float,Description="Log 10 likelihood ratio score of variant existing versus not existing">
##INFO=<ID=UNIQ_ALT_READ_COUNT,Number=1,Type=Integer,Description="Number of ALT reads with unique start and mate end positions at a variant site">
##MutectVersion=2.2
##bcftools_concatCommand=concat -o /home/shollizeck/CASCADE/analysis/CA99/tumor/joined/mutect2/CA99_somatic.unfiltered.vcf.gz -O z /home/shollizeck/CASCADE/analysis/CA99/tumor/joined/mutect2/CA99_10.somatic.unfiltered.vcf.gz /home/shollizeck/CASCADE/analysis/CA99/tumor/joined/mutect2/CA99_11.somatic.unfiltered.vcf.gz /home/shollizeck/CASCADE/analysis/CA99/tumor/joined/mutect2/CA99_12.somatic.unfiltered.vcf.gz /home/shollizeck/CASCADE/analysis/CA99/tumor/joined/mutect2/CA99_13.somatic.unfiltered.vcf.gz /home/shollizeck/CASCADE/analysis/CA99/tumor/joined/mutect2/CA99_14.somatic.unfiltered.vcf.gz /home/shollizeck/CASCADE/analysis/CA99/tumor/joined/mutect2/CA99_15.somatic.unfiltered.vcf.gz /home/shollizeck/CASCADE/analysis/CA99/tumor/joined/mutect2/CA99_16.somatic.unfiltered.vcf.gz /home/shollizeck/CASCADE/analysis/CA99/tumor/joined/mutect2/CA99_17.somatic.unfiltered.vcf.gz /home/shollizeck/CASCADE/analysis/CA99/tumor/joined/mutect2/CA99_18.somatic.unfiltered.vcf.gz /home/shollizeck/CASCADE/analysis/CA99/tumor/joined/mutect2/CA99_19.somatic.unfiltered.vcf.gz /home/shollizeck/CASCADE/analysis/CA99/tumor/joined/mutect2/CA99_1.somatic.unfiltered.vcf.gz /home/shollizeck/CASCADE/analysis/CA99/tumor/joined/mutect2/CA99_20.somatic.unfiltered.vcf.gz /home/shollizeck/CASCADE/analysis/CA99/tumor/joined/mutect2/CA99_21.somatic.unfiltered.vcf.gz /home/shollizeck/CASCADE/analysis/CA99/tumor/joined/mutect2/CA99_22.somatic.unfiltered.vcf.gz /home/shollizeck/CASCADE/analysis/CA99/tumor/joined/mutect2/CA99_2.somatic.unfiltered.vcf.gz /home/shollizeck/CASCADE/analysis/CA99/tumor/joined/mutect2/CA99_3.somatic.unfiltered.vcf.gz /home/shollizeck/CASCADE/analysis/CA99/tumor/joined/mutect2/CA99_4.somatic.unfiltered.vcf.gz /home/shollizeck/CASCADE/analysis/CA99/tumor/joined/mutect2/CA99_5.somatic.unfiltered.vcf.gz /home/shollizeck/CASCADE/analysis/CA99/tumor/joined/mutect2/CA99_6.somatic.unfiltered.vcf.gz /home/shollizeck/CASCADE/analysis/CA99/tumor/joined/mutect2/CA99_7.somatic.unfiltered.vcf.gz /home/shollizeck/CASCADE/analysis/CA99/tumor/joined/mutect2/CA99_8.somatic.unfiltered.vcf.gz /home/shollizeck/CASCADE/analysis/CA99/tumor/joined/mutect2/CA99_9.somatic.unfiltered.vcf.gz /home/shollizeck/CASCADE/analysis/CA99/tumor/joined/mutect2/CA99_MT.somatic.unfiltered.vcf.gz /home/shollizeck/CASCADE/analysis/CA99/tumor/joined/mutect2/CA99_X.somatic.unfiltered.vcf.gz /home/shollizeck/CASCADE/analysis/CA99/tumor/joined/mutect2/CA99_Y.somatic.unfiltered.vcf.gz; Date=Sat Aug 10 21:18:38 2019
##bcftools_concatVersion=1.9-80-gff3137d+htslib-1.9-66-gbcf9bff
##contig=<ID=1,length=249250621>
##contig=<ID=2,length=243199373>
##contig=<ID=3,length=198022430>
##contig=<ID=4,length=191154276>
##contig=<ID=5,length=180915260>
##contig=<ID=6,length=171115067>
##contig=<ID=7,length=159138663>
##contig=<ID=8,length=146364022>
##contig=<ID=9,length=141213431>
##contig=<ID=10,length=135534747>
##contig=<ID=11,length=135006516>
##contig=<ID=12,length=133851895>
##contig=<ID=13,length=115169878>
##contig=<ID=14,length=107349540>
##contig=<ID=15,length=102531392>
##contig=<ID=16,length=90354753>
##contig=<ID=17,length=81195210>
##contig=<ID=18,length=78077248>
##contig=<ID=19,length=59128983>
##contig=<ID=20,length=63025520>
##contig=<ID=21,length=48129895>
##contig=<ID=22,length=51304566>
##contig=<ID=X,length=155270560>
##contig=<ID=Y,length=59373566>
##contig=<ID=MT,length=16569>
##contig=<ID=GL000207.1,length=4262>
##contig=<ID=GL000226.1,length=15008>
##contig=<ID=GL000229.1,length=19913>
##contig=<ID=GL000231.1,length=27386>
##contig=<ID=GL000210.1,length=27682>
##contig=<ID=GL000239.1,length=33824>
##contig=<ID=GL000235.1,length=34474>
##contig=<ID=GL000201.1,length=36148>
##contig=<ID=GL000247.1,length=36422>
##contig=<ID=GL000245.1,length=36651>
##contig=<ID=GL000197.1,length=37175>
##contig=<ID=GL000203.1,length=37498>
##contig=<ID=GL000246.1,length=38154>
##contig=<ID=GL000249.1,length=38502>
##contig=<ID=GL000196.1,length=38914>
##contig=<ID=GL000248.1,length=39786>
##contig=<ID=GL000244.1,length=39929>
##contig=<ID=GL000238.1,length=39939>
##contig=<ID=GL000202.1,length=40103>
##contig=<ID=GL000234.1,length=40531>
##contig=<ID=GL000232.1,length=40652>
##contig=<ID=GL000206.1,length=41001>
##contig=<ID=GL000240.1,length=41933>
##contig=<ID=GL000236.1,length=41934>
##contig=<ID=GL000241.1,length=42152>
##contig=<ID=GL000243.1,length=43341>
##contig=<ID=GL000242.1,length=43523>
##contig=<ID=GL000230.1,length=43691>
##contig=<ID=GL000237.1,length=45867>
##contig=<ID=GL000233.1,length=45941>
##contig=<ID=GL000204.1,length=81310>
##contig=<ID=GL000198.1,length=90085>
##contig=<ID=GL000208.1,length=92689>
##contig=<ID=GL000191.1,length=106433>
##contig=<ID=GL000227.1,length=128374>
##contig=<ID=GL000228.1,length=129120>
##contig=<ID=GL000214.1,length=137718>
##contig=<ID=GL000221.1,length=155397>
##contig=<ID=GL000209.1,length=159169>
##contig=<ID=GL000218.1,length=161147>
##contig=<ID=GL000220.1,length=161802>
##contig=<ID=GL000213.1,length=164239>
##contig=<ID=GL000211.1,length=166566>
##contig=<ID=GL000199.1,length=169874>
##contig=<ID=GL000217.1,length=172149>
##contig=<ID=GL000216.1,length=172294>
##contig=<ID=GL000215.1,length=172545>
##contig=<ID=GL000205.1,length=174588>
##contig=<ID=GL000219.1,length=179198>
##contig=<ID=GL000224.1,length=179693>
##contig=<ID=GL000223.1,length=180455>
##contig=<ID=GL000195.1,length=182896>
##contig=<ID=GL000212.1,length=186858>
##contig=<ID=GL000222.1,length=186861>
##contig=<ID=GL000200.1,length=187035>
##contig=<ID=GL000193.1,length=189789>
##contig=<ID=GL000194.1,length=191469>
##contig=<ID=GL000225.1,length=211173>
##contig=<ID=GL000192.1,length=547496>
##filtering_status=These calls have been filtered by FilterMutectCalls to label false positives with a list of failed filters and true positives with PASS.
##normal_sample=CA99
##source=FilterMutectCalls
##source=Mutect2
##tumor_sample=CA99-11
##tumor_sample=CA99-26
##tumor_sample=CA99-31
##tumor_sample=CA99-41
##tumor_sample=CA99-47
##tumor_sample=CA99-55
##tumor_sample=CA99-57
##tumor_sample=CA99-59
##bcftools_viewVersion=1.9-80-gff3137d+htslib-1.9-66-gbcf9bff
##bcftools_viewCommand=view -h CASCADE/analysis/CA99/tumor/joined/mutect2/CA99_somatic.filtered.vcf.gz; Date=Wed Sep  4 13:01:27 2019
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  CA99    CA99-11 CA99-26 CA99-31 CA99-41 CA99-47 CA99-55 CA99-57 CA99-59
10  43615014    .   G   A,C,T   .   multiallelic;normal_artifact    CONTQ=93;DP=1438;ECNT=1;GERMQ=93;MBQ=37,37,37,37;MFRL=360,407,325,369;MMQ=60,60,60,60;MPOS=35,41,29;NALOD=0.957,1.52,1.52;NLOD=26.39,28.55,28.56;POPAF=6.00,6.00,6.00;ROQ=21;SEQQ=93;STRANDQ=93;TLOD=318.68,206.55,109.89   GT:AD:AF:DP:F1R2:F2R1:SB    0/0:95,1,0,0:0.018,9.934e-03,9.933e-03:96:39,0,0,0:56,1,0,0:49,46,1,0   0/1/2/3:104,69,1,0:0.394,5.737e-03,5.589e-03:174:59,30,0,0:43,36,0,0:49,55,34,36    0/1/2/3:86,0,32,4:7.869e-03,0.263,0.037:122:47,0,16,3:36,0,16,1:50,36,19,17 0/1/2/3:115,2,2,0:0.024,0.022,8.101e-03:119:57,1,2,0:56,1,0,0:63,52,2,2 0/1/2/3:189,0,3,1:5.039e-03,0.020,0.010:193:97,0,3,1:91,0,0,0:96,93,4,0 0/1/2/3:166,0,3,30:4.824e-03,0.020,0.151:199:89,0,2,18:75,0,1,11:82,84,15,18    0/1/2/3:184,2,19,17:0.013,0.089,0.080:222:97,0,13,7:84,2,6,10:80,104,18,20  0/1/2/3:101,0,35,0:7.072e-03,0.257,7.072e-03:136:49,0,21,0:50,0,14,0:53,48,17,18    0/1/2/3:85,56,0,0:0.390,6.877e-03,6.877e-03:141:47,28,0,0:38,25,0,0:42,43,27,29

Outlook on Grch38/hg38 for in exome and other targeted sequencing

$
0
0

Dear GATK team,

First of all, congratulations on releasing GATK4!

I was wondering, on this page: https://software.broadinstitute.org/gatk/download/bundle it is mentioned that the human genome reference builds you support actively are the following:
For Best Practices short variant discovery in exome and other targeted sequencing: b37/hg19

Last year we build an RNAseq pipeline and a preliminary DNAseq pipeline around GRCh38. Can you perhaps indicate how far out the publication of Best Practices for short variant discovery in exome and other targeted sequencing using GRCh38 is?

By the way, the link below the bullet points (https://software.broadinstitute.org/gatk/user%20guide/article.php?id=1213) gives a 404.

Keep up the good work,

Highest regards,

Freek.

ValidateVariants Error: Input files reference and features have incompatible contigs

$
0
0
Hello! I'm sorry if I am not using the correct format to ask this in but:

When using gatk ValidateVariants, I receive the error

``` A USER ERROR has occurred: Input files reference and features have incompatible contigs: Found contigs with the same name but different lengths:
contig reference = 1 / 249250621
contig features = 1 / 249218993.
reference contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT, GL000207.1, GL000226.1, GL000229.1, GL000231.1, GL000210.1, GL000239.1, GL000235.1, GL000201.1, GL000247.1, GL000245.1, GL000197.1, GL000203.1, GL000246.1, GL000249.1, GL000196.1, GL000248.1, GL000244.1, GL000238.1, GL000202.1, GL000234.1, GL000232.1, GL000206.1, GL000240.1, GL000236.1, GL000241.1, GL000243.1, GL000242.1, GL000230.1, GL000237.1, GL000233.1, GL000204.1, GL000198.1, GL000208.1, GL000191.1, GL000227.1, GL000228.1, GL000214.1, GL000221.1, GL000209.1, GL000218.1, GL000220.1, GL000213.1, GL000211.1, GL000199.1, GL000217.1, GL000216.1, GL000215.1, GL000205.1, GL000219.1, GL000224.1, GL000223.1, GL000195.1, GL000212.1, GL000222.1, GL000200.1, GL000193.1, GL000194.1, GL000225.1, GL000192.1]
features contigs = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26] ```

I am using the human_g1k_v37.fasta as my reference, and a merged vcf created by merging grp1.vcf and grp2.vcf, both of which were run through bcftools fixref function aligned to human_g1k_v37.fasta before merging.

I receive the same error (same contig lengths) when trying to use ValidateVariants on grp1.vcf + humann_g1k_v37.fasta before usingn fixref tool.

Do you have any suggestions on how to fix this?

Thanks in advance!

What do the empty counts in the ContingencyMetrics output file from GenotypeConcordance signify?

$
0
0
I have two VCF files produced from two different protocols that I want to compare. I used a set of hard filters on both the VCF files with one being a truth set and comparing another VCF against it. Since both VCF files have a filter status (either PASS etc); when I compare the two VCF files: I have TP, TN and empty counts - no FP or FN.
When I compare the two VCFs prior to the filtering using genotypeConcordance, I have only TP and TN with no empty counts.
By setting the ignore_filter_status as FALSE; does it compare the passed variants from the call set against all the variants from the truth set (or) does genotype concordance compare the passed variants from the call set to the passed variants from the truth set. From the documentation; empty counts state that there was no contingency information for those number of variants which is very speculative. If the latter is true: shouldn't it be false positives if a passed variant was found in the call set and not present in the truth set / false negatives if there were certain passed variants in the truth set not found in the call set.

I am using SelectVariants and VariantFiltration to impose the hard filters on both the VCF files. The hard filters are the same for the two files but different for SNPs and INDELs. I am comparing SNPs and INDELS separately in GenotypeConcordance.

GATK 4.1.3.0

Error while attempting to exclude intervals in GenotypeGVCFs

$
0
0

Hello, I've been trying to get the HaplotypeCaller-in-gVCF-mode to work for a combination of exome samples, some of which are diploid individuals, and others of which are pools of individuals, with ploidies ~20.

During the GenotypeGVCFs step, a number of regions have huge memory demands and fail, presumably related to the high ploidies.

These regions seem to be pretty small, and there are only a few of them, so my approach is simply to exclude the regions from analysis.

If I do this by telling GenotypeGVCFs to genotype all the non-problem regions using --intervals, this works just fine:

GenotypeGVCFs [etc] --intervals chr1:1-50000 --intervals chr1:60000-56000000

But if I tell it to exclude the problem regions, it fails:

GenotypeGVCFs [etc] -XL chr1:50000-60000

More details:

I am working with mosquito exome data, 150bp PE illumina reads, GATK 4.1.4.0, java 1.8.0_222. I've been following the Broad best practices pretty closely, with a single round of bootstrapped base recalibration.

Here is the command:

gatk --java-options '-Xmx20G' GenotypeGVCFs \
  -R genome.fa \
  -V gendb://../vcfs/combined_gvcfs_br/NW_021837065.1 \
  -O ../vcfs/chromosome_vcfs_br/NW_021837065.1.vcf \
  -XL NW_021837065.1:0-100000 \

I get the following message:

16:23:43.946 INFO  GenotypeGVCFs - Initializing engine
WARNING: No valid combination operation found for INFO field DS - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field DS - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records
16:23:45.306 INFO  IntervalArgumentCollection - Initial include intervals span 2538371206 loci; exclude intervals span 100000 loci
16:23:45.307 INFO  IntervalArgumentCollection - Excluding 100000 loci from original intervals (0.00% reduction)
16:23:45.309 INFO  IntervalArgumentCollection - Processing 2538271206 bp from intervals
16:23:45.337 INFO  GenotypeGVCFs - Done initializing engine
16:23:45.459 INFO  ProgressMeter - Starting traversal
16:23:45.459 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute
16:23:45.514 INFO  GenotypeGVCFs - Shutting down engine
[October 22, 2019 4:23:45 PM EDT] org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs done. Elapsed time: 0.03 minutes.
Runtime.totalMemory()=476577792
java.lang.IllegalStateException: There are no sources based on those query parameters
        at org.genomicsdb.reader.GenomicsDBFeatureIterator.<init>(GenomicsDBFeatureIterator.java:132)
        at org.genomicsdb.reader.GenomicsDBFeatureReader.query(GenomicsDBFeatureReader.java:144)
        at org.broadinstitute.hellbender.engine.FeatureIntervalIterator.queryNextInterval(FeatureIntervalIterator.java:135)
        at org.broadinstitute.hellbender.engine.FeatureIntervalIterator.loadNextFeature(FeatureIntervalIterator.java:92)
        at org.broadinstitute.hellbender.engine.FeatureIntervalIterator.loadNextNovelFeature(FeatureIntervalIterator.java:74)
        at org.broadinstitute.hellbender.engine.FeatureIntervalIterator.<init>(FeatureIntervalIterator.java:47)
        at org.broadinstitute.hellbender.engine.FeatureDataSource.iterator(FeatureDataSource.java:467)
        at java.lang.Iterable.spliterator(Iterable.java:101)
        at org.broadinstitute.hellbender.engine.VariantLocusWalker.getSpliteratorForDrivingVariants(VariantLocusWalker.java:58)
        at org.broadinstitute.hellbender.engine.VariantLocusWalker.traverse(VariantLocusWalker.java:133)
        at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1048)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:163)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:206)
        at org.broadinstitute.hellbender.Main.main(Main.java:292)

A USER ERROR has occurred: v is not a recognized option

$
0
0
I came across this error when I collate all the normal VCFs into a single callset with CreateSomaticPanelOfNormals
Syntax as "Java -jar -Xmx16g ./gatk-package-4.1.2.0-local.jar CreateSomaticPanelOfNormals \
-vcfs tutorial_11136/3_HG00190.vcf.gz \
-vcfs tutorial_11136/4_NA19771.vcf.gz \
-vcfs tutorial_11136/5_HG02759.vcf.gz \
-O 6_threesamplepon.vcf.gz"
I tried in Cygwin and Powershell in my PC.
Any suggestion?

HaplotypeCaller generate Calls is different when try to look the bam in IGV

$
0
0

Hi,
I try to call SNP and use the gatk_4.1.0.0. When I check the result in IGV, the depth and result is different. Like that.


In IGV six samples on two site are both heterozygous, but in my result the three are homozygous. The depth of each site is also different.
Is there anything wrong ?
Thanks in advance!
Best wishes,
Yu Liu

HaplotyeCaller - non-variant block records in gVCF

$
0
0

Hi,

I have generated a gVCF for an exome (with non-variant block records) from a BAM file belonging to the 1000Genomes data.
I am using GATK tools version 3.5-0-g36282e4 and I have run the HaplotypeCaller as follows:

time java -jar $gatk_dir/GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-R $reference \
-I $bamfile \
-ploidy 2 \
-stand_call_conf 20 \
-stand_emit_conf 10 \
-ERC GVCF \
-o output.g.vcf.gz

Within the purpose of the analysis I am performing, from this gVCF I need to be able to know whether the positions are no-called, homozygous reference, variant sites or if the positions were not targeted in the exome sequencing.

However, with the gVCF file I obtained I am not able to do it because there are only variant site records or non-variant block records where the GT tag is always "0/0".

So I have few questions regarding the non-variant block records:

  1. Why the output file does not contain any no-call ("./.") record?

  2. Shouldn't regions where there are no reads have the tag GT equal to "./." instead of "0/0"?

  3. How can regions without reads (not targeted) be distinguished from regions with reads that were not called?

  4. When looking at the bam file with IGV, non-variant blocks displayed in gVCF contain regions with reads. What is the explanation for such behaviour?

Thank you for your attention,

Sofia

What should I use as known variants/sites for running tool X?

$
0
0

1. Notes on known sites

Why are they important?

Each tool uses known sites differently, but what is common to all is that they use them to help distinguish true variants from false positives, which is very important to how these tools work. If you don't provide known sites, the statistical analysis of the data will be skewed, which can dramatically affect the sensitivity and reliability of the results.

In the variant calling pipeline, the only tools that do not strictly require known sites are UnifiedGenotyper and HaplotypeCaller.

Human genomes

If you're working on human genomes, you're in luck. We provide sets of known sites in the human genome as part of our resource bundle, and we can give you specific Best Practices recommendations on which sets to use for each tool in the variant calling pipeline. See the next section for details.

Non-human genomes

If you're working on genomes of other organisms, things may be a little harder -- but don't panic, we'll try to help as much as we can. We've started a community discussion in the forum on What are the standard resources for non-human genomes? in which we hope people with non-human genomics experience will share their knowledge.

And if it turns out that there is as yet no suitable set of known sites for your organisms, here's how to make your own for the purposes of BaseRecalibration: First, do an initial round of SNP calling on your original, unrecalibrated data. Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator. Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence. Good luck!

Some experimentation will be required to figure out the best way to find the highest confidence SNPs for use here. Perhaps one could call variants with several different calling algorithms and take the set intersection. Or perhaps one could do a very strict round of filtering and take only those variants which pass the test.

2. Recommended sets of known sites per tool

Summary table

Tool dbSNP 129 dbSNP >132 Mills indels 1KG indels HapMap Omni
RealignerTargetCreator X X
IndelRealigner X X
BaseRecalibrator X X X
(UnifiedGenotyper/ HaplotypeCaller) X
VariantRecalibrator X X X X
VariantEval X

RealignerTargetCreator and IndelRealigner

These tools require known indels passed with the -known argument to function properly. We use both the following files:

  • Mills_and_1000G_gold_standard.indels.b37.vcf
  • 1000G_phase1.indels.b37.vcf (currently from the 1000 Genomes Phase I indel calls)

BaseRecalibrator

This tool requires known SNPs and indels passed with the -knownSites argument to function properly. We use all the following files:

  • The most recent dbSNP release (build ID > 132)
  • Mills_and_1000G_gold_standard.indels.b37.vcf
  • 1000G_phase1.indels.b37.vcf (currently from the 1000 Genomes Phase I indel calls)

UnifiedGenotyper / HaplotypeCaller

These tools do NOT require known sites, but if SNPs are provided with the -dbsnp argument they will use them for variant annotation. We use this file:

  • The most recent dbSNP release (build ID > 132)

VariantRecalibrator

For VariantRecalibrator, please see the FAQ article on VQSR training sets and arguments.

VariantEval

This tool requires known SNPs passed with the -dbsnp argument to function properly. We use the following file:

  • A version of dbSNP subsetted to only sites discovered in or before dbSNP BuildID 129, which excludes the impact of the 1000 Genomes project and is useful for evaluation of dbSNP rate and Ti/Tv values at novel sites.
Viewing all 12345 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>