How can i use the optional tool; OUTPUT_VCF?

September 27, 2018, 12:22 am

≫ Next: Disable duplicate read filter for M2

≪ Previous: Picard GenotypeConcordance was over in the middle of chr2.

Dear,

How can i use the optional tool; OUTPUT_VCF?
I saw https://software.broadinstitute.org/gatk/documentation/tooldocs/4.0.5.0/picard_vcf_GenotypeConcordance.php.
I entered below command.
However, i couldn't get the VCF.file. I could get only three files; oncordance_contingency_metrics, oncordance_detail_metrics, oncordance_summary_metrics.

java -jar picard.jar GenotypeConcordance \
CALL_VCF=input.vcf \
CALL_SAMPLE=sample01 \
O=concordance.vcf \
TRUTH_VCF=Ref.vcf \
TRUTH_SAMPLE=sample02 \
OUTPUT_VCF=output.vcf

How can i get the VCF file?
Thank you.

↧

Disable duplicate read filter for M2

April 11, 2019, 4:04 am

≫ Next: GATK4 GetPileupSummaries

≪ Previous: How can i use the optional tool; OUTPUT_VCF?

I have high coverage targeted sequencing data and would like to disable the duplicate read filter for Mutect2 (rather than skipping the de-duplication step in alignment).

Do I understand the documentation correctly that I should add the parameter --disable-read-filter NotDuplicateReadFilter to the Mutect2 call? I would do this both for the tumor-only M2 (PON step) and paired M2 (calling step).

↧

GATK4 GetPileupSummaries

April 11, 2019, 5:28 am

≫ Next: Mutect2 "--mitochondria" flag doesn't work

≪ Previous: Disable duplicate read filter for M2

Hi, I use GATK GetPileupSummaries to estimate contamination, the output has many sites, as in the following picture

so, which sites are used to estimate contamination? can someone point these sites in the picture?

↧

Mutect2 "--mitochondria" flag doesn't work

April 11, 2019, 5:46 am

≫ Next: Does GATK4 v4.1.1.0 support MarkDuplicates?

≪ Previous: GATK4 GetPileupSummaries

First of all, since this is my first question on the forum, I wanted to thank you for all the help so far and the help that I will definitely get in the future.
Now to the actual question: I am working with mitochondrial DNA and I am trying to call variants with Mutect2, using the latest (V4.1.1.0) version of GATK. The problem is however that it seems that neither the "--mitochondria" nor "--median-autosomal-coverage" flags are recognised. In fact, in the "Argument list/details" section of the web page of the tool, neither of them are included. But in the example just above -(iii) Mitochondrial mode-, they are.
So, is the "--mitochondria" argument not implemented in GATK yet? Or am I missing something else?
Btw, when I tried to run it with the old "-mitochondria-mode" flag which is included in the argument list, it works. The output returns "##MutectVersion=2.2", is that right or it should be an older version?
Thank you
PanaZ

↧

Does GATK4 v4.1.1.0 support MarkDuplicates?

April 11, 2019, 7:03 am

≫ Next: Intervals that cross chromosomes

≪ Previous: Mutect2 "--mitochondria" flag doesn't work

I've run GATK3.5 with `MarkDuplicates`, but can't get it to run with GATK4 v4.1.1.0. I double-checked the best practices for data pre-processing for variant discovery and noted that the command `MarkDuplicates` still appears there. When I checked the tool documentation index I could pull up `MarkDuplicates` for GATK4 v4.0.8.0, but not v4.1.1.0. So I'm wondering if `MarkDuplicates` is supported by GATK4 v4.1.1.0?

Command:
```
strings=(
S1233686
)
for i in "${strings[@]}"; do
echo "${i}"

# Mark duplicates
/ast/emb/software/gatk-4.1.1.0/gatk MarkDuplicates \
I=/ast/emb/prjt3/aligned_data/${i}Aligned.sortedByCoord.out.bam \
O=/ast/emb/prjt3/aligned_data/${i}.dedupped.bam \
CREATE_INDEX=true \
VALIDATION_STRINGENCY=SILENT \
METRICS_FILE=/ast/emb/prjt3/aligned_data/dedup.metrics.${i}.txt

done
```

Output:
```
USAGE: MarkDuplicates [arguments]

Identifies duplicate reads. This tool locates and tags duplicate reads in a BAM or SAM file, where duplicate reads
are defined as originating from a single fragment of DNA. Duplicates can arise during sample preparation e.g. library
construction using PCR. See also MarkDuplicates for
detailed explanations of the output metrics.
Version:4.1.1.0

Required Arguments:

--INPUT,-I:String One or more input SAM or BAM files to analyze. Must be coordinate sorted. This argument
must be specified at least once. Required.

****************REMOVED STANDARD HELP INFO TO SHORTEN OUTPUT****************************

Invalid argument 'I=/ast/emb/prjt3/aligned_data/S1233686Aligned.sortedByCoord.out.bam'.
Tool returned:
1
```

The output suggests that `MarkDuplicates` is supported. I hope I didn't make a silly syntax error. I did double-check that my input file exists.

↧

Intervals that cross chromosomes

April 11, 2019, 7:52 am

≫ Next: How much memory should it require to run MergeVCFs on 2500 samples?

≪ Previous: Does GATK4 v4.1.1.0 support MarkDuplicates?

Hi team. I'm trying to put together an interval list (in any format that can be used by GenotypeGVCFs) that crosses chromosomes (multiple chromosomes within one interval).

This is for most efficient processing on Terra - @jsoto assures me that about 900 intervals will be ideal for processing canine samples, and our references has 3k or so contigs. I'd like to be able to combine all our tiny unanchored segments into larger intervals.

I can't find the syntax to specify an interval that includes multiple chromosomes - is it even possible? Help!

↧

How much memory should it require to run MergeVCFs on 2500 samples?

April 11, 2019, 8:30 am

≫ Next: Installing GATK4 via Conda

≪ Previous: Intervals that cross chromosomes

My MergeVCFs job is not outputting any logs on FireCloud, just sitting there for several days without finishing. We ran this sucessfully on 1200 samples with just 8 GB of RAM, so it's hard for me to imagine this is a memory issue. Is 2500 too much for MergeVCFs to handle?

↧

Installing GATK4 via Conda

February 3, 2018, 7:29 am

≫ Next: Could you please let me know any tool, to concatenate the gvcf files? Or there is any other solution

≪ Previous: How much memory should it require to run MergeVCFs on 2500 samples?

Hi there! I have a small problem, or a suggestion for improvement, related to the use of (Mini)conda and GATK4. I'm not entirely sure if this forum is a right place to ask this because I don't really know how GATK4's Conda package is maintained, but let's give it a try!

So I'm using a wide variety of bioinformatic tools in my work which is why I prefer Conda in package management - just to make it little bit easier to handle package dependencies and package updates. I am now planning to try the new GATK4 as the version 4.0.1.1 seems to be available in Bioconda. With GATK3 I was able to launch GATK simply with command 'gatk' so I naturally tried the very same command for GATK4. However;

gatk -h
bash: gatk: command not found
gatk4 -h
bash: gatk4: command not found

I located the GATK4 .jar file and succesfully tried the command;

java -jar /home/user/miniconda3/pkgs/gatk4-4.0.1.1-py36/share/gatk4-4.0.0.1-0/gatk-package-4-0.0.1-local.jar -h

This prints all available tools as excepted. So the main problem seems to be that shortcut to this .jar file is not included in the Conda distribution. Is there any particular reason for this behaviour or is this just a bug in the package? It is, of course, possible to use GATK4 with 'java -jar' command but the use of simple 'gatk' or 'gatk4' would be easier for Conda users. For example, if I update my GATK4 in the future I must also update my pipelines so that my paths are leading to the right .jar file. If I use direct 'gatk4' command, in turn, I can simply update GATK4 with Conda and launch it with 'gatk4' command in my pipeline - without manual path updating.

Thank you!

↧

Could you please let me know any tool, to concatenate the gvcf files? Or there is any other solution

April 11, 2019, 12:02 pm

≫ Next: Annotation problem: not all variants are taken into account

≪ Previous: Installing GATK4 via Conda

Could you please let me know any tool, to concatenate the gvcf files? Or there is any other solution to run the intermediate HaplotypeCaller in GVCF mode on parts of the chromosome to speed up the process and then combine them in one gvcf file before jointgenotyping.

↧

Annotation problem: not all variants are taken into account

April 11, 2019, 1:23 pm

≫ Next: Potential for merging GenomeDBs or adding to an existing database

≪ Previous: Could you please let me know any tool, to concatenate the gvcf files? Or there is any other solution

Hello,

I use GATK version 4.1 to annotate a vcf with the following command :

java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx5g -jar gatk-package-4.1.1.0-local.jar VariantAnnotator -R /sandbox/resources/species/human/ensembl/release-75/Homo_sapiens.GRCh37.75.dna.toplevel.fa -V GQPDOMB-stats.vcf -O rsID_GQPDOMB.vcf --dbsnp
/sandbox/resources/species/human/ensembl/release-75/dbSNP_b150_GRCh37_00-All.vcf.gz

However, it only looks for rs IDs for the first 5. How is it possible to solve this problem so that it can find the rs IDs of the entire file?

Thanking you for your help

↧

Potential for merging GenomeDBs or adding to an existing database

April 11, 2019, 3:16 pm

≫ Next: GATK FastaAlternateReferenceMaker not correcting fasta reference

≪ Previous: Annotation problem: not all variants are taken into account

Hello,

I have adopted using GenomicsDBImport for my workflows involving ~2500 samples to begin. This data structure works well and I am very happy it was implemented. However my data is such that I will be adding more samples every 2-3 months or so for the foreseeable future. It becomes very cumbersome and space-prohibitive to manage both gVCFs and a GenomeDB, as well immensely time consuming to include all samples again and again to maintain this. Is there a way now, or a plan to add new samples to existing data, or merge GenomeDBs? I anticipate this will become a common need.

Sincerely,
Brian

↧

GATK FastaAlternateReferenceMaker not correcting fasta reference

March 14, 2019, 8:23 am

≫ Next: VQSR extremely low numbers of TP variants in tranche.(0.01 novel variants?)Weird tranche and VQSR

≪ Previous: Potential for merging GenomeDBs or adding to an existing database

Hi,
I am trying to use "GATK FastaAlternateReferenceMaker" but the output fasta file is the same as the one used in input. In other words, my fasta genome file is not corrected according to the vcf file used. I am wondering wether it is a misuse of myself or a bug of the tool.
Here is the cmd line I used :
$ nice -19 gatk FastaAlternateReferenceMaker -R Dp_PB-MI_190104_dedup.fasta -V Mi_M-B-Dp_PB_B-M-freebayes_onlyindels_cov_qMi+20_SRRF-notrepeat_sorted.vcf -O Dp_PB-MI_190104_dedup_gatkcorrected.fasta &>gatk.log

Thanks in advance for your help.

Paul

↧

VQSR extremely low numbers of TP variants in tranche.(0.01 novel variants?)Weird tranche and VQSR

April 11, 2019, 7:14 pm

≫ Next: Is DP the right criteria to be used in whole exome sequencing.

≪ Previous: GATK FastaAlternateReferenceMaker not correcting fasta reference

Dear staff, (wondering why am only allowed to select "Zoo & Garden" for the category...HELP)

I have 45108 variants from 31 exome vcf files. After VariantRecalibrator, my tranche specificity shows the best Ti/Tv ratio at 57.8 which is very different from several examples from the tutorial. Is it normal and should i proceed with ApplyVQSR with -tranche cutoff level at 55.7? I have read several tutorials and video which tells me it is alright to lower down the value but i have not yet seen anyone has lowered it to below 80.

If this is not normal, should i switch to hardfilter?

Command:

gatk --java-options "-Xmx48g -Xms48g" VariantRecalibrator -R $reference_dir -V tmp_sitesonly.vcf.gz -O tmp_sitesonly_recal_snps.vcf.gz --tranches-file tmp_sitesonly.snps.tranches -an MQ -an DP -an QD -an FS -an ReadPosRankSum --max-gaussians 4 --trust-all-polymorphic -tranche 60 -tranche 55.9 -tranche 55.8 -tranche 55.7 -tranche 55.5 -tranche 55 -mode SNP --rscript-file tmp_Merged_snps.plot.R --resource hapmap,known=false,training=true,truth=true,prior=15.0:$hapmap --resource omni,known=false,training=true,truth=false,prior=12.0:$omni --resource 1000G,known=false,training=true,truth=false,prior=10.0:$oneksnp --resource dbsnp,known=true,training=false,truth=false,prior=2.0:$dbsnp

Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx48g -Xms48g -jar /home/cruadmin01/installations/gatk-4.0.12.0/gatk-package-4.0.12.0-local.jar VariantRecalibrator -R /home/Graceca/references/hs38/hs38DH.fa -V tmp_sitesonly.vcf.gz -O tmp_sitesonly_recal_snps.vcf.gz --tranches-file tmp_sitesonly.snps.tranches -an MQ -an DP -an QD -an FS -an ReadPosRankSum --max-gaussians 4 --trust-all-polymorphic -tranche 60 -tranche 55.9 -tranche 55.8 -tranche 55.7 -tranche 55.5 -tranche 55 -mode SNP --rscript-file tmp_Merged_snps.plot.R --resource hapmap,known=false,training=true,truth=true,prior=15.0:/home/Graceca/references/GATK/hapmap_3.3.hg38.vcf.gz --resource omni,known=false,training=true,truth=false,prior=12.0:/home/Graceca/references/GATK/1000G_omni2.5.hg38.vcf.gz --resource 1000G,known=false,training=true,truth=false,prior=10.0:/home/Graceca/references/GATK/1000G_phase1.snps.high_confidence.hg38.vcf.gz --resource dbsnp,known=true,training=false,truth=false,prior=2.0:/home/Graceca/references/GATK/dbsnp_146.hg38.vcf.gz
16:13:44.468 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/cruadmin01/installations/gatk-4.0.12.0/gatk-package-4.0.12.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
16:13:46.193 INFO VariantRecalibrator - ------------------------------------------------------------
16:13:46.193 INFO VariantRecalibrator - The Genome Analysis Toolkit (GATK) v4.0.12.0
16:13:46.194 INFO VariantRecalibrator - For support and documentation go to https://software.broadinstitute.org/gatk/
16:13:46.194 INFO VariantRecalibrator - Executing as cruadmin01@KTPAPPCRU01 on Linux v3.13.0-103-generic amd64
16:13:46.194 INFO VariantRecalibrator - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_101-b13
16:13:46.194 INFO VariantRecalibrator - Start Date/Time: 22 March, 2019 4:13:44 PM SGT
16:13:46.194 INFO VariantRecalibrator - ------------------------------------------------------------
16:13:46.195 INFO VariantRecalibrator - ------------------------------------------------------------
16:13:46.195 INFO VariantRecalibrator - HTSJDK Version: 2.18.1
16:13:46.195 INFO VariantRecalibrator - Picard Version: 2.18.16
16:13:46.195 INFO VariantRecalibrator - HTSJDK Defaults.COMPRESSION_LEVEL : 2
16:13:46.195 INFO VariantRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
16:13:46.196 INFO VariantRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
16:13:46.196 INFO VariantRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
16:13:46.196 INFO VariantRecalibrator - Deflater: IntelDeflater
16:13:46.196 INFO VariantRecalibrator - Inflater: IntelInflater
16:13:46.196 INFO VariantRecalibrator - GCS max retries/reopens: 20
16:13:46.196 INFO VariantRecalibrator - Requester pays: disabled
16:13:46.196 INFO VariantRecalibrator - Initializing engine
16:13:46.593 INFO FeatureManager - Using codec VCFCodec to read file file:///home/Graceca/references/GATK/hapmap_3.3.hg38.vcf.gz
16:13:46.733 INFO FeatureManager - Using codec VCFCodec to read file file:///home/Graceca/references/GATK/1000G_omni2.5.hg38.vcf.gz
16:13:46.833 INFO FeatureManager - Using codec VCFCodec to read file file:///home/Graceca/references/GATK/1000G_phase1.snps.high_confidence.hg38.vcf.gz
16:13:46.903 INFO FeatureManager - Using codec VCFCodec to read file file:///home/Graceca/references/GATK/dbsnp_146.hg38.vcf.gz
16:13:46.957 INFO FeatureManager - Using codec VCFCodec to read file file:///mnt1/fastq/hs38dr/tmp_sitesonly.vcf.gz
16:13:47.041 WARN IndexUtils - Feature file "/home/Graceca/references/GATK/dbsnp_146.hg38.vcf.gz" appears to contain no sequence dictionary. Attempting to retrieve a sequence dictionary from the associated index file
16:13:47.148 INFO VariantRecalibrator - Done initializing engine
16:13:47.150 INFO TrainingSet - Found hapmap track: Known = false Training = true Truth = true Prior = Q15.0
16:13:47.150 INFO TrainingSet - Found omni track: Known = false Training = true Truth = false Prior = Q12.0
16:13:47.150 INFO TrainingSet - Found 1000G track: Known = false Training = true Truth = false Prior = Q10.0
16:13:47.150 INFO TrainingSet - Found dbsnp track: Known = true Training = false Truth = false Prior = Q2.0
16:13:47.206 INFO ProgressMeter - Starting traversal
16:13:47.206 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
16:13:52.601 INFO ProgressMeter - chr20:44357841 0.1 49559 551268.1
16:13:52.601 INFO ProgressMeter - Traversal complete. Processed 49559 total variants in 0.1 minutes.
16:13:52.616 INFO VariantDataManager - MQ: mean = 59.81 standard deviation = 1.40
16:13:52.629 INFO VariantDataManager - DP: mean = 90.09 standard deviation = 369.36
16:13:52.635 INFO VariantDataManager - QD: mean = 23.68 standard deviation = 7.09
16:13:52.640 INFO VariantDataManager - FS: mean = 0.16 standard deviation = 0.97
16:13:52.644 INFO VariantDataManager - ReadPosRankSum: mean = 0.15 standard deviation = 1.02
16:13:52.712 INFO VariantDataManager - Annotation order is: [DP, MQ, QD, FS, ReadPosRankSum]
16:13:52.718 INFO VariantDataManager - Training with 13094 variants after standard deviation thresholding.
16:13:52.720 INFO GaussianMixtureModel - Initializing model with 100 k-means iterations...
16:13:52.940 INFO VariantRecalibratorEngine - Finished iteration 0.
16:13:53.076 INFO VariantRecalibratorEngine - Finished iteration 5. Current change in mixture coefficients = 0.07580
16:13:53.165 INFO VariantRecalibratorEngine - Finished iteration 10. Current change in mixture coefficients = 0.06460
16:13:53.253 INFO VariantRecalibratorEngine - Finished iteration 15. Current change in mixture coefficients = 0.01033
16:13:53.341 INFO VariantRecalibratorEngine - Finished iteration 20. Current change in mixture coefficients = 0.01247
16:13:53.414 INFO VariantRecalibratorEngine - Convergence after 24 iterations!
16:13:53.434 INFO VariantRecalibratorEngine - Evaluating full set of 45108 variants...
16:13:54.081 INFO VariantDataManager - Selected worst 1354 scoring variants --> variants with LOD <= -5.0000.
16:13:54.081 INFO GaussianMixtureModel - Initializing model with 100 k-means iterations...
16:13:54.098 INFO VariantRecalibratorEngine - Finished iteration 0.
16:13:54.104 INFO VariantRecalibratorEngine - Finished iteration 5. Current change in mixture coefficients = 0.02325
16:13:54.110 INFO VariantRecalibratorEngine - Finished iteration 10. Current change in mixture coefficients = 0.00563
16:13:54.117 INFO VariantRecalibratorEngine - Finished iteration 15. Current change in mixture coefficients = 0.00108
16:13:54.117 INFO VariantRecalibratorEngine - Convergence after 15 iterations!
16:13:54.127 INFO VariantRecalibratorEngine - Evaluating full set of 45108 variants...
16:13:54.784 INFO TrancheManager - Finding 6 tranches for 45108 variants
16:13:54.809 INFO TrancheManager - TruthSensitivityTranche threshold 60.00 => selection metric threshold 0.400
16:13:54.821 INFO TrancheManager - Found tranche for 60.000: 0.400 threshold starting with variant 38815; running score is 0.400
16:13:54.821 INFO TrancheManager - TruthSensitivityTranche is TruthSensitivityTranche targetTruthSensitivity=60.00 minVQSLod=19.2431 known=(6221 @ 2.0883) novel=(72 @ 0.5000) truthSites(6689 accessible, 4013 called), name=anonymous]
16:13:54.821 INFO TrancheManager - TruthSensitivityTranche threshold 55.90 => selection metric threshold 0.441
16:13:54.827 INFO TrancheManager - Found tranche for 55.900: 0.441 threshold starting with variant 39577; running score is 0.441
16:13:54.827 INFO TrancheManager - TruthSensitivityTranche is TruthSensitivityTranche targetTruthSensitivity=55.90 minVQSLod=19.4768 known=(5517 @ 2.1017) novel=(14 @ 1.3333) truthSites(6689 accessible, 3739 called), name=anonymous]
16:13:54.827 INFO TrancheManager - TruthSensitivityTranche threshold 55.80 => selection metric threshold 0.442
16:13:54.830 INFO TrancheManager - Found tranche for 55.800: 0.442 threshold starting with variant 39592; running score is 0.442
16:13:54.830 INFO TrancheManager - TruthSensitivityTranche is TruthSensitivityTranche targetTruthSensitivity=55.80 minVQSLod=19.4814 known=(5503 @ 2.1025) novel=(13 @ 1.6000) truthSites(6689 accessible, 3732 called), name=anonymous]
16:13:54.830 INFO TrancheManager - TruthSensitivityTranche threshold 55.70 => selection metric threshold 0.443
16:13:54.833 INFO TrancheManager - Found tranche for 55.700: 0.443 threshold starting with variant 39607; running score is 0.443
16:13:54.833 INFO TrancheManager - TruthSensitivityTranche is TruthSensitivityTranche targetTruthSensitivity=55.70 minVQSLod=19.4869 known=(5489 @ 2.1034) novel=(12 @ 2.0000) truthSites(6689 accessible, 3725 called), name=anonymous]
16:13:54.833 INFO TrancheManager - TruthSensitivityTranche threshold 55.50 => selection metric threshold 0.445
16:13:54.836 INFO TrancheManager - Found tranche for 55.500: 0.445 threshold starting with variant 39643; running score is 0.445
16:13:54.836 INFO TrancheManager - TruthSensitivityTranche is TruthSensitivityTranche targetTruthSensitivity=55.50 minVQSLod=19.4990 known=(5455 @ 2.1041) novel=(10 @ 2.3333) truthSites(6689 accessible, 3712 called), name=anonymous]
16:13:54.836 INFO TrancheManager - TruthSensitivityTranche threshold 55.00 => selection metric threshold 0.450
16:13:54.839 INFO TrancheManager - Found tranche for 55.000: 0.450 threshold starting with variant 39744; running score is 0.450
16:13:54.839 INFO TrancheManager - TruthSensitivityTranche is TruthSensitivityTranche targetTruthSensitivity=55.00 minVQSLod=19.5381 known=(5359 @ 2.0886) novel=(5 @ 5.0000) truthSites(6689 accessible, 3678 called), name=anonymous]
16:13:54.840 INFO VariantRecalibrator - Writing out recalibration table...
16:13:55.390 INFO VariantRecalibrator - Writing out visualization Rscript file...
16:13:55.406 INFO VariantRecalibrator - Building DP x MQ plot...
16:13:55.410 INFO VariantRecalibratorEngine - Evaluating full set of 3660 variants...
16:13:55.856 INFO VariantRecalibratorEngine - Evaluating full set of 3660 variants...
16:13:56.079 INFO VariantRecalibrator - Building DP x QD plot...
16:13:56.082 INFO VariantRecalibratorEngine - Evaluating full set of 3660 variants...
16:13:56.424 INFO VariantRecalibratorEngine - Evaluating full set of 3660 variants...
16:13:56.700 INFO VariantRecalibrator - Building DP x FS plot...
16:13:56.703 INFO VariantRecalibratorEngine - Evaluating full set of 3660 variants...
16:13:57.044 INFO VariantRecalibratorEngine - Evaluating full set of 3660 variants...
16:13:57.255 INFO VariantRecalibrator - Building DP x ReadPosRankSum plot...
16:13:57.257 INFO VariantRecalibratorEngine - Evaluating full set of 3660 variants...
16:13:57.608 INFO VariantRecalibratorEngine - Evaluating full set of 3660 variants...
16:13:57.810 INFO VariantRecalibrator - Building MQ x QD plot...
16:13:57.811 INFO VariantRecalibratorEngine - Evaluating full set of 3721 variants...
16:13:58.207 INFO VariantRecalibratorEngine - Evaluating full set of 3721 variants...
16:13:58.395 INFO VariantRecalibrator - Building MQ x FS plot...
16:13:58.396 INFO VariantRecalibratorEngine - Evaluating full set of 3721 variants...
16:13:58.707 INFO VariantRecalibratorEngine - Evaluating full set of 3721 variants...
16:13:58.919 INFO VariantRecalibrator - Building MQ x ReadPosRankSum plot...
16:13:58.921 INFO VariantRecalibratorEngine - Evaluating full set of 3721 variants...
16:13:59.846 INFO VariantRecalibratorEngine - Evaluating full set of 3721 variants...
16:14:00.017 INFO VariantRecalibrator - Building QD x FS plot...
16:14:00.018 INFO VariantRecalibratorEngine - Evaluating full set of 3721 variants...
16:14:00.301 INFO VariantRecalibratorEngine - Evaluating full set of 3721 variants...
16:14:00.485 INFO VariantRecalibrator - Building QD x ReadPosRankSum plot...
16:14:00.485 INFO VariantRecalibratorEngine - Evaluating full set of 3721 variants...
16:14:00.768 INFO VariantRecalibratorEngine - Evaluating full set of 3721 variants...
16:14:00.944 INFO VariantRecalibrator - Building FS x ReadPosRankSum plot...
16:14:00.944 INFO VariantRecalibratorEngine - Evaluating full set of 3721 variants...
16:14:01.223 INFO VariantRecalibratorEngine - Evaluating full set of 3721 variants...
16:14:01.405 INFO VariantRecalibrator - Executing: Rscript /mnt1/fastq/hs38dr/tmp_Merged_snps.plot.R
16:14:12.480 INFO VariantRecalibrator - Executing: Rscript (resource)org/broadinstitute/hellbender/tools/walkers/vqsr/plot_Tranches.R /mnt1/fastq/hs38dr/tmp_sitesonly.snps.tranches 2.15

Last question, this is the result after ApplyVQSR:

chr2 88606308 . T *,A 597.39 PASS AC=2,10;AF=0.125,0.625;AN=16;DP=34;ExcessHet=0.0458;FS=0.000;MLEAC=7,24;MLEAF=0.438,1.00;MQ=60.00;POSITIVE_TRAIN_SITE;QD=29.11;SOR=0.804;VQSLOD=20.04;culprit=MQ GT:AD:DP:GQ:PGT:PID:PL:PS ./.:2,0,0:2:.:.:.:0,0,0,0,0,0 ./.:0,0,0:0:.:.:.:0,0,0,0,0,0 ./.:3,0,0:3:.:.:.:0,0,0,0,0,0 ./.:2,0,0:2:.:.:.:0,0,0,0,0,0 1/2:0,2,1:3:27:.:.:126,36,27,75,0,68 ./.:0,0,0:0:.:.:.:0,0,0,0,0,0 0/0:2,0,0:2:6:.:.:0,6,41,6,41,41 ./.:2,0,0:2:.:.:.:0,0,0,0,0,0 ./.:0,0,0:0:.:.:.:0,0,0,0,0,0 ./.:3,0,0:3:.:.:.:0,0,0,0,0,0 ./.:0,0,0:0:.:.:.:0,0,0,0,0,0 ./.:0,0,0:0:.:.:.:0,0,0,0,0,0 2|2:0,0,1:1:3:1|1:88606308_T_A:45,45,45,3,3,0:88606308 ./.:0,0,0:0:.:.:.:0,0,0,0,0,0 ./.:0,0,0:0:.:.:.:0,0,0,0,0,0 ./.:1,0,0:1:.:.:.:0,0,0,0,0,0 ./.:0,0,0:0:.:.:.:0,0,0,0,0,0 2|2:0,0,4:4:12:1|1:88606308_T_A:180,180,180,12,12,0:88606308 ./.:0,0,0:0:.:.:.:0,0,0,0,0,0 ./.:0,0,0:0:.:.:.:0,0,0,0,0,0 ./.:0,0,0:0:.:.:.:0,0,0,0,0,0 2|2:0,0,3:3:9:1|1:88606308_T_A:124,124,124,9,9,0:88606308 1/2:0,2,2:4:42:.:.:168,54,42,56,0,44 ./.:0,0,0:0:.:.:.:0,0,0,0,0,0 2|2:0,0,2:2:6:1|1:88606308_T_A:90,90,90,6,6,0:88606308 ./.:0,0,0:0:.:.:.:0,0,0,0,0,0 ./.:0,0,0:0:.:.:.:0,0,0,0,0,0 ./.:0,0,0:0:.:.:.:0,0,0,0,0,0 ./.:0,0,0:0:.:.:.:0,0,0,0,0,0 0/0:2,0,0:2:6:.:.:0,6,39,6,39,39 ./.:0,0,0:0:.:.:.:0,0,0,0,0,0

This result shows one of the variant with a PASS status for tranche = 55.7, I dont get why DP is 34 while there are no DP that is 34 in all 31 samples shown, therefore how does the DP value comes(calculated) from ? I feel there is something really weird with the output.. is it simply because of low DP thats why that looks a bit abnormal to me? Some samples seems like only have one or two DP!

↧

Is DP the right criteria to be used in whole exome sequencing.

March 28, 2019, 7:32 am

≫ Next: How MuTect filters candidate mutations

≪ Previous: VQSR extremely low numbers of TP variants in tranche.(0.01 novel variants?)Weird tranche and VQSR

I have whole exome sequencing data from multiple mice. I have performed GATK4.1 pipeline and for recal i have used dbSNP142 and also snpeff. Can I consider DP=100 as the minimum value to filter the resultant SNPs found. When I keep the coverage minimum to DP= 50, i end up with more SNPs. But I end up having less SNPs when i go higher in DP value.

↧

How MuTect filters candidate mutations

July 30, 2014, 3:09 pm

≫ Next: Getting specific information from Mutect or Mutect2 output

≪ Previous: Is DP the right criteria to be used in whole exome sequencing.

Please note that this article refers to the original standalone version of MuTect. A new version is now available within GATK (starting at GATK 3.5) under the name MuTect2. This new version is able to call both SNPs and indels. See the GATK version 3.5 release notes and the MuTect2 tool documentation for further details.

Overview

This document describes the methodological underpinnings of the filters that MuTect applies by default to distinguish real mutations from sequencing artifacts and errors. Some of these filters are applied in all detection modes, while others are only applied in "High Confidence" detection mode.

Note that at the moment, there is no straightforward way to disable these filters. It is possible to disable each by passing parameter values that render the filters ineffective (e.g. set a value of zero for a filter that requires a minimum value of some quantity) but this has to be examined on a case-by-case basis. A more practical solution is to leave the filter parameters untouched, but instead perform some filtering on the CALLSTATS file using text processing functions (e.g. test for lines that have REJECT in only one of several columns).

Filters used in high-confidence mode

1. Proximal Gap

This filter removes false positives (FP) caused by nearby misaligned small indel events. MuTect will reject a candidate site if there are more than a given number of reads with insertions/deletions in an 11 base pair window centered on the candidate. The threshold value is controlled by the --gap_events_threshold.

In the CALLSTATS output file, the relevant columns are labeled t_ins_count and t_del_count.

2. Poor Mapping

This filter removes FPs caused by reads that are poorly mapped (typically due to sequence similarities between different portions of the genome). The filter uses two tests:

Reject candidate if it does not meet a given threshold for the fraction of reads that have a mapping quality of 0 in tumor and normal samples. The threshold value is controlled by --fraction_mapq_threshold.
Reject candidate if it does not have at least one observation of the mutant allele with a mapping quality that satisfies a given threshold. The threshold value is controlled by --required_maximum_alt_allele_mapping_quality_score.

In the CALLSTATS output file, the relevant columns are labeled total_reads and map_Q0_reads for the first test, and t_alt_max_mapq for the second test.

3. Strand Bias

This filter rejects FPs caused by context-specific sequencing where the vast majority of alternate alleles are seen in a single direction of reads. Candidates are rejected if strand-specific LOD is below a given threshold in a direction where the sensitivity to have passed that threshold is above a certain percentage. The LOD threshold value is controlled by --strand_artifact_lod and the percentage is controlled by --strand_artifact_power_threshold.

In the CALLSTATS output file, the relevant columns are labeled power_to_detect_negative_strand_artifact and t_lod_fstar_forward. There are also complementary columns labeled power_to_detect_positive_strand_artifact and t_lod_fstar_reverse.

4. Clustered Position

This filter rejects FPs caused by misalignments evidenced by the alternate alleles being clustered at a consistent distance from the start or end of the read alignment. Candidates are rejected if their median distance from the start/end of the read and median absolute deviation are lower or equal to given thresholds. The position from end of read threshold value is controlled by --pir_median_threshold and the deviation value is controlled by --pir_mad_threshold.

In the CALLSTATS output file, the relevant columns are labeled tumor_alt_fpir_median and tumor_alt_fpir_mad for the forward strand, and complementary columns are labeled tumor_alt_rpir_median and tumor_alt_rpir_mad for the reverse (note the name difference is fpir vs. rpir, for forward vs. reverse position in read).

5. Observed in Control

This filter rejects FPs in tumor data by looking at control data (typically from a matched normal) for evidence of the alternate allele that is above random sequencing error. Candidates are rejected if both the following conditions are met:

The number of observations of the alternate allele or the proportion of reads carrying the alternate allele is above a given threshold, controlled by --max_alt_alleles_in_normal_count and --max_alt_allele_in_normal_fraction.
The sum of quality scores is above a given threshold value, controlled by --max_alt_alleles_in_normal_qscore_sum.

In the CALLSTATS output file, the relevant columns are labeled n_alt_count, normal_f , and n_alt_sum.

Filters applied in all MuTect modes

1. Tumor and normal LOD scores

This filter rejects candidates with a tumor LOD score below a given threshold value, controlled by --tumor_lod, and similarly for a normal LOD score threshold controlled by --normal_lod_threshold.

In the CALLSTATS output file, the relevant columns are labeled t_lod_fstar and init_n_lod, respectively.

2. Possible contamination

This filter rejects candidates with potential cross-patient contamination, controlled by --fraction_contamination.

In the CALLSTATS output file, the relevant columns are labeled t_lod_fstar and contaminant_lod.

3. Normal LOD score and dbsnp status

If a candidate mutation is in dbsnp but is not in COSMIC, it may be a germline variant. In that case, the normal LOD threshold that the candidate must clear is raised to a value controlled by --dbsnp_normal_lod.

In the CALLSTATS output file, the relevant column is labeled init_n_lod.

4. Triallelic Site Filter

When the program is evaluating a site, it considers all possible alternate alleles as mutation candidates, and puts them through all the filters detailed above. If more than one candidate allele passes all filters, resulting in a proposed triallelic site, the site is rejected with the reason triallelic_site because it is extremely unlikely that this would really happen in a tumor sample.

↧

Getting specific information from Mutect or Mutect2 output

April 12, 2019, 6:53 am

≫ Next: Missing metrics for a sample with CollectVariantCallingMetrics

≪ Previous: How MuTect filters candidate mutations

Hi

Long time I am seeking for which variant caller gives me read depth for tumour and matched normal also Number of variant bases at the position in the tumor sample from called SNVs and indels; I used Strelka but in output .vcf files I can not find these; I need something like this table

> head(mut_data[,c(1,3:9)])
Sample CHROM POS REF ALT Tumor_Varcount Tumor_Depth Normal_Depth
1 CHC2432T chr1 102961055 G A 4 64 62

I read mutect and mutect2 output format but I did not get if I can extract this information if I call SNV and indels by mutect or mutect2

Any suggestion please

Thank you

↧

Missing metrics for a sample with CollectVariantCallingMetrics

April 12, 2019, 11:29 am

≫ Next: MergeBAM RG ID and SM tags

≪ Previous: Getting specific information from Mutect or Mutect2 output

Hi,
I am using CollectVariantCallingMetrics on vcf generated with GATK, version 4.1.1.0 and my Picard version is 2.1.0. Interestingly while my vcf has 4 samples, I got the metrics for only 3 samples. Am I missing something here? I did not error out while running the tool.
-Uma

↧

MergeBAM RG ID and SM tags

April 12, 2019, 11:40 am

≫ Next: GATK 4.1.0.0 Mutect2 error with gnomAD AF file

≪ Previous: Missing metrics for a sample with CollectVariantCallingMetrics

Dear GATK team,

I have gone through the discussion on Merging lane wise bam and still I have little confusion.

Does it make difference if we have same RG ID and SM tags across all lanes as it was mentioned that MarkDuplicates will consider the library (flowcell)?

What if we have same sample lanes sequenced in more than one flowcell. How Markduplicates handled such cases? How does it affect on variant calling?

Please help me to resolve this type of cases.

Thanks In Advance
Fazulur Rehaman

↧

GATK 4.1.0.0 Mutect2 error with gnomAD AF file

March 7, 2019, 12:56 pm

≫ Next: Question about GATK4 SplitNCigarReads tool

≪ Previous: MergeBAM RG ID and SM tags

I'm running into an error when running GATK 4.1.0.0 with the following call:

java -Xmx16g -jar ${gatkDir}/GATK.jar Mutect2 -R ${GRC}.fa -I ${TU}.recal.bam -tumor TU -I ${NM}.recal.bam -normal NM --native-pair-hmm-threads $threads --germline-resource $gnomad --af-of-alleles-not-in-resource 0.0000025 -O ${sampleName}.mutect.UF.vcf --tmp-dir temp

The error is as follows:

java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:657)
at java.util.ArrayList.get(ArrayList.java:433)
at org.broadinstitute.hellbender.tools.walkers.mutect.GermlineProbabilityCalculator.lambda$getGermlineAltAlleleFrequencies$3(GermlineProbabilityCalculator.java:55)
at java.util.stream.ReferencePipeline$6$1.accept(ReferencePipeline.java:244)
at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:545)
at java.util.stream.AbstractPipeline.evaluateToArrayNode(AbstractPipeline.java:260)
at java.util.stream.DoublePipeline.toArray(DoublePipeline.java:506)
at org.broadinstitute.hellbender.tools.walkers.mutect.GermlineProbabilityCalculator.getGermlineAltAlleleFrequencies(GermlineProbabilityCalculator.java:57)
at org.broadinstitute.hellbender.tools.walkers.mutect.GermlineProbabilityCalculator.getNegativeLog10PopulationAFAnnotation(GermlineProbabilityCalculator.java:29)
at org.broadinstitute.hellbender.tools.walkers.mutect.SomaticGenotypingEngine.callMutations(SomaticGenotypingEngine.java:165)
at org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2Engine.callRegion(Mutect2Engine.java:233)
at org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2.apply(Mutect2.java:232)
at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.processReadShard(AssemblyRegionWalker.java:291)
at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.traverse(AssemblyRegionWalker.java:267)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:966)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:138)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
at org.broadinstitute.hellbender.Main.main(Main.java:291)

I have seen errors like this listed before on the forums relating to the AF file. I removed the file, and it was able to successfully run. The AF file is af-only-gnomad.filtered.hg38.vcf.gz

However, the above function call with the AF file runs correctly on GATK 4.0.10.1 with no errors and completes successfully.

The AF file is formatted as follows:

#CHROM POS ID REF ALT QUAL FILTER INFO
1 10067 . T TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC 30.35 PASS .
1 10108 . CAACCCT C 46514.3 PASS .
1 10109 . AACCCT A 89837.3 PASS .
1 10114 . TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAAACCCTA CAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAAACCCTA,T 36729 PASS .
1 10119 . CT C 251.23 PASS .
1 10120 . T C 14928.7 PASS .
1 10128 . ACCCTAACCCTAACCCTAAC A 285.71 PASS .
1 10131 . CT C 378.93 PASS .
...

Any thoughts as to what the error could be?

Thank you!

↧

Question about GATK4 SplitNCigarReads tool

April 12, 2019, 6:57 pm

≫ Next: SplitNCigarReads error

≪ Previous: GATK 4.1.0.0 Mutect2 error with gnomAD AF file

Hi, I used the GATK SplitNCigarReads tools to process RNA-Seq data, which is said to reduce the false positive rate. Then, the processed data was used for SNP calling(by using variant calling tools in GATK). However, after annotating the SNP calling result with GTF file. It shows that only 20%~30% SNP sites locate in the exonic region. I was wonder about it. Ideally, most of the SNP sites may locate in the exonic region. Could u please help me solve this puzzle?

↧