Problems with Indel calling (following GATK best practises)

June 12, 2018, 7:19 am

≫ Next: SNP calling for Cell lines - how does the ploidy affect HC

≪ Previous: Do you need to do Variant Quality Score Recalibration when calling somatic variants with Mutect2?

Hi,

we sequenced two esomes (from a normal and a tumor samples) using the Truseq DNA Exome kit (Illumina) on our NextSeq500 (mid flowcell, 2x75 bp). Then we analyzed the reads running our pipeline that follows the "GATK best practises" and finally used Mutect2 to called somatic variants.

Then we filtered and prioritized the variants (using our browser QueryOR) in order to obtain some indels that could be interesting for our study. The problem is that when we have a look at the alignments with IGV, we cannot found those indels. THe reads do not have any insertions or deletions.

Best regards
Erika

↧

SNP calling for Cell lines - how does the ploidy affect HC

November 20, 2017, 5:21 am

≫ Next: GenotypeGVCFs and VariantFiltration tools

≪ Previous: Problems with Indel calling (following GATK best practises)

Hi all,

I am calling SNPs in various immortalised cell lines, which are known to be very instable - hence the ploidy is not known. Generally it should be diploid. So my question is - what can happen if the ploidy is not correct? Would HC miss SNPs? I see a relatively low overlap of common SNPs between two related cell lines and I want to make sure this low overlap is indeed there.

Thank you in advance.

↧

GenotypeGVCFs and VariantFiltration tools

June 12, 2018, 10:19 am

≫ Next: Estimate tumor contamination in normal sample

≪ Previous: SNP calling for Cell lines - how does the ploidy affect HC

We are following "Calling variants on cohorts of samples using the HaplotypeCaller in GVCF mode" best practices using GATK 3.8.1 and Java 1.8. Thus we merged the raw.g.vcfs from HaplotypeCaller into one cohort.g.vcf and then carried out joint genotyping using the GenotypeGVCFs tool. We are working in a haploid model organism so we then tried to use the VariantFiltration tool on the output (which is a vcf file containing the information from all of the sequences with which we are working). However this failed and we got the error
"Line 2176: there aren't enough columns for line 102"
Others have encountered the same problem and I see that you have responded that the GATK and java versions are incompatible but this was several versions ago. Is this true for us? Please can you tell me where to go to next.

↧

Estimate tumor contamination in normal sample

May 31, 2018, 8:33 am

≫ Next: Should I make PON although having Matched normal samples? PLEASE LET ME KNOW (..)

≪ Previous: GenotypeGVCFs and VariantFiltration tools

Hello again- I'd like to hear your thoughts about estimating the tumour contamination in a normal sample. (This is useful to know when the matched normal is tissue adjacent to the tumor).

From Estimate cross-sample contamination using GetPileupSummaries and CalculateContamination I see that the combination of getPileupSummaries+CalculateContamination is not appropriate for this task.

However, I see in mutect.pdf, section VII, a proposed strategy for estimating tumor in normal. Do you have any experience/recommendations about it?

Thanks a lot!
Dario

↧

Should I make PON although having Matched normal samples? PLEASE LET ME KNOW (..)

May 29, 2018, 7:38 pm

≫ Next: Applying GATK to microbiome data

≪ Previous: Estimate tumor contamination in normal sample

I have normal-tumor matched samples for 30 individuals.
but Every documents i saw says 'Making PON is a very important process'.

Although i have matched normal samples with every tumor samples, should i make pon using Mutect tumor only mode?

ps. I currently use GATK3. And want to compare GATK3 and GATK4 in the view of result.

↧

Applying GATK to microbiome data

May 16, 2017, 1:41 pm

≫ Next: About fastaalternatereferencemaker in GATK 4.0

≪ Previous: Should I make PON although having Matched normal samples? PLEASE LET ME KNOW (..)

Hi, I am interested in knowing how to apply gatk tools to microbiome data. Specifically, I would like to override the assumption of ploidy in the HaplotypeCaller and making it flexible, in that one sample could have a unknown number of haplotypes at the same time, I know somatic mutation caller Mutect2 does not share the assumption but then it's designed to specifically deal with normal - tumor sample pair which is not really applicable in the microbiome studies. Thanks

↧

About fastaalternatereferencemaker in GATK 4.0

January 14, 2018, 10:36 pm

≫ Next: Haplotype caller not picking up variants for HiSeq Runs

≪ Previous: Applying GATK to microbiome data

Where is the fastaalternatereferencemaker in GATK4.0?
I want to build a reference with mutation by fastaalternatereferencemaker, but I can not find the commond
of fastaalternatereferencemaker , please tell me where ?
Thank you !

↧

Haplotype caller not picking up variants for HiSeq Runs

October 12, 2017, 7:33 am

≫ Next: GATK4 silly "multithreading" workaround

≪ Previous: About fastaalternatereferencemaker in GATK 4.0

Hello,
We were sequencing all our data in HiSeq and now moved to nextseq. We have sequenced the same batch of samples on both the sequencers. Both are processed using the same pipeline/parameters.
What I have noticed is, GATK 3.7 HC is not picking up variants, even though the coverage is good and is evidently present in the BAM file.

For example the screenshot below shows the BAM files for both NextSeq and HiSeq sample. There are atleast 3
variants in the region 22:29885560-29885861(NEPH, exon 5) that is expected to be picked up for HiSeq.

These variants are picked up for NextSeq samples (even though the coverage for hiSeq is much better).

The command that I have used for both samples is

java -Xmx32g -jar GATK_v3_7/GenomeAnalysisTK.jar -T HaplotypeCaller -R GRCh37.fa --dbsnp GATK_ref/dbsnp_138.b37.vcf -I ${i}.HiSeq_Run31.variant_ready.bam -L NEPH.bed -o ${i}.HiSeq_Run31.NEPH.g.vcf

Any idea why this can happen ?

Many thanks,

↧

GATK4 silly "multithreading" workaround

June 12, 2018, 6:40 pm

≫ Next: GenomicsDBImport don't work in my five GVCF files

≪ Previous: Haplotype caller not picking up variants for HiSeq Runs

Hi,
I'm working with RNAseq samples from sunflower. Right now I have samples from 8 genotypes (3 biological replicates each). These eight genotypes arise from the same biparental crossing; so the genotypes are related.

I'm using GATK4 in a VM with 16 Intel Xeon E7-4860 processors (they can't support AVX), and 32 Gb RAM + 16 Gb swap (I can ask more).

Since GATK4 doesn't have the multithreading options (-nt and -nct) anymore, I oftenly cannot take advantage of all processors. Because of this, I have been trying the Spark version of the tools, but I don't really want to use them until you "aprove" them oficially.

Also!, I tried a silly approach for "multithreading" dividing the genome in 16 intervals and run using the -L option on 16 parallel commands; and then merging the results.

My questions is: ¿How wrong is this approach?

↧

GenomicsDBImport don't work in my five GVCF files

June 12, 2018, 8:48 pm

≫ Next: Germline short variant discovery (SNPs + Indels)

≪ Previous: GATK4 silly "multithreading" workaround

Hi Geraldine!
I am using the GenomicsDBImport tool to merge five gvcf files, the script is shown below:
gatk GenomicsDBImport -R /hg19/hg19.fa -V sample1.g.vcf -V sample2.g.vcf -V sample3.g.vcf -V sample4.g.vcf -V sample5.g.vcf --genomicsdb-workspace-path my_database --intervals chr20

However, it doesn't work warning that htsjdk.tribble.TribbleException: An index is required, but none found., for input source: file:sample1.g.vcf. How can I resolve the problem?

↧

Germline short variant discovery (SNPs + Indels)

January 7, 2018, 1:03 am

≫ Next: Human genome reference file and picard.jar MarkDuplicates

≪ Previous: GenomicsDBImport don't work in my five GVCF files

Purpose

Identify germline short variants (SNPs and Indels) in one or more individuals to produce a joint callset in VCF format.

Reference Implementations

Pipeline	Summary	Notes	Github	FireCloud
Prod* germline short variant per-sample calling	uBAM to GVCF	optimized for GCP	yes	pending
Prod* germline short variant joint genotyping	GVCFs to cohort VCF	optimized for GCP	yes	pending
$5 Genome Analysis Pipeline	uBAM to GVCF or cohort VCF	optimized for GCP (see blog)	yes	hg38
Generic germline short variant per-sample calling	analysis-ready BAM to GVCF	universal	yes	hg38
Generic germline short variant joint genotyping	GVCFs to cohort VCF	universal	yes	hg38 & b37
Intel germline short variant per-sample calling	uBAM to GVCF	Intel optimized for local architectures	yes	NA

* Prod refers to the Broad Institute's Data Sciences Platform production pipelines, which are used to process sequence data produced by the Broad's Genomic Sequencing Platform facility.

Expected input

This workflow is designed to operate on a set of samples constituting a study cohort. Specifically, a set of per-sample BAM files that have been pre-processed as described in the GATK Best Practices for data pre-processing.

Main steps

We begin by calling variants per sample in order to produce a file in GVCF format. Next, we consolidate GVCFs from multiple samples into a GenomicsDB datastore. We then perform joint genotyping, and finally, apply VQSR filtering to produce the final multisample callset with the desired balance of precision and sensitivity.

Additional steps such as Genotype Refinement and Variant Annotation may be included depending on experimental design; those are not documented here.

Call variants per-sample

Tools involved: HaplotypeCaller (in GVCF mode)

In the past, variant callers specialized in either SNPs or Indels, or (like the GATK's own UnifiedGenotyper) could call both but had to do so them using separate models of variation. The HaplotypeCaller is capable of calling SNPs and indels simultaneously via local de-novo assembly of haplotypes in an active region. In other words, whenever the program encounters a region showing signs of variation, it discards the existing mapping information and completely reassembles the reads in that region. This allows the HaplotypeCaller to be more accurate when calling regions that are traditionally difficult to call, for example when they contain different types of variants close to each other. It also makes the HaplotypeCaller much better at calling indels than position-based callers like UnifiedGenotyper.

In the GVCF mode used for scalable variant calling in DNA sequence data, HaplotypeCaller runs per-sample to generate an intermediate file called a GVCF , which can then be used for joint genotyping of multiple samples in a very efficient way. This enables rapid incremental processing of samples as they roll off the sequencer, as well as scaling to very large cohort sizes

In practice, this step can be appended to the pre-processing section to form a single pipeline applied per-sample, going from the original unmapped BAM containing raw sequence all the way to the GVCF for each sample. This is the implementation used in production at the Broad Institute.

Consolidate GVCFs

Tools involved: ImportGenomicsDB

This step consists of consolidating the contents of GVCF files across multiple samples in order to improve scalability and speed the next step, joint genotyping. Note that this is NOT equivalent to the joint genotyping step; variants in the resulting merged GVCF cannot be considered to have been called jointly.

Prior to GATK4 this was done through hierarchical merges with a tool called CombineGVCFs. This tool is included in GATK4 for legacy purposes, but performance is far superior when using ImportGenomicsDB, which produces a datastore instead of a GVCF file.

Joint-Call Cohort

Tools involved: GenotypeGVCFs

At this step, we gather all the per-sample GVCFs (or combined GVCFs if we are working with large numbers of samples) and pass them all together to the joint genotyping tool, GenotypeGVCFs. This produces a set of joint-called SNP and indel calls ready for filtering. This cohort-wide analysis empowers sensitive detection of variants even at difficult sites, and produces a squared-off matrix of genotypes that provides information about all sites of interest in all samples considered, which is important for many downstream analyses.

This step runs quite quickly and can be rerun at any point when samples are added to the cohort, thereby solving the so-called N+1 problem.

Filter Variants by Variant (Quality Score) Recalibration

Tools involved: VariantRecalibrator, ApplyRecalibration

The GATK's variant calling tools are designed to be very lenient in order to achieve a high degree of sensitivity. This is good because it minimizes the chance of missing real variants, but it does mean that we need to filter the raw callset they produce in order to reduce the amount of false positives, which can be quite large.

The established way to filter the raw variant callset is to use variant quality score recalibration (VQSR), which uses machine learning to identify annotation profiles of variants that are likely to be real, and assigns a VQSLOD score to each variant that is much more reliable than the QUAL score calculated by the caller. In the first step of this two-step process, the program builds a model based on training variants, then applies that model to the data to assign a well-calibrated probability to each variant call. We can then use this variant quality score in the second step to filter the raw call set, thus producing a subset of calls with our desired level of quality, fine-tuned to balance specificity and sensitivity.

The downside of how variant recalibration works is that the algorithm requires high-quality sets of known variants to use as training and truth resources, which for many organisms are not yet available. It also requires quite a lot of data in order to learn the profiles of good vs. bad variants, so it can be difficult or even impossible to use on small datasets that involve only one or a few samples, on targeted sequencing data, on RNAseq, and on non-model organisms. If for any of these reasons you find that you cannot perform variant recalibration on your data (after having tried the workarounds that we recommend, where applicable), you will need to use hard-filtering instead. This consists of setting flat thresholds for specific annotations and applying them to all variants equally. See the methods articles and FAQs for more details on how to do this.

We are currently experimenting with neural network-based approaches with the goal of eventually replacing VQSR with a more powerful and flexible filtering process.

Notes on methodology

The central tenet that governs the variant discovery part of the workflow is that the accuracy and sensitivity of the germline variant discovery algorithm are significantly increased when it is provided with data from many samples at the same time. Specifically, the variant calling program needs to be able to construct a squared-off matrix of genotypes representing all potentially variant genomic positions, across all samples in the cohort. Note that this is distinct from the primitive approach of combining variant calls generated separately per-sample, which lack information about the confidence of homozygous-reference or other uncalled genotypes.

In earlier versions of the variant discovery phase, multiple per-sample BAM files were presented directly to the variant calling program for joint analysis. However, that scaled very poorly with the number of samples, posing unacceptable limits on the size of the study cohorts that could be analyzed in that way. In addition, it was not possible to add samples incrementally to a study; all variant calling work had to be redone when new samples were introduced.

Starting with GATK version 3.x, a new approach was introduced, which decoupled the two internal processes that previously composed variant calling: (1) the initial per-sample collection of variant context statistics and calculation of all possible genotype likelihoods given each sample by itself, which require access to the original BAM file reads and is computationally expensive, and (2) the calculation of genotype posterior probabilities per-sample given the genotype likelihoods across all samples in the cohort, which is computationally cheap. These were made into the separate steps described below, enabling incremental growth of cohorts as well as scaling to large cohort sizes.

↧

Human genome reference file and picard.jar MarkDuplicates

June 12, 2018, 11:24 pm

≫ Next: GATK HaplotypeCaller missing SNPs at the terminals of the segment when calling SNPs for Influenza A

≪ Previous: Germline short variant discovery (SNPs + Indels)

samtools faidx GRCh38_r77.all.fa
The code is not working with my data. I am using Linux Ubuntu. GRCh38_r77.all.fa is the reference file that I downloaded for each of chromosomes and merged into the final reference file.
the following codes are working well with my data
bwa aln -t 21 hg38bwaidx CHD1/CHD1_R1.fastq.gz >R1.sai
bwa aln -t 21 hg38bwaidx CHD1/CHD1_R2.fastq.gz >R2.sai
bwa sampe hg38bwaidx R1.sai R2.sai CHD1/CHD1_R1.fastq.gz CHD1/CHD1_R2.fastq.gz > chd1_pe.sam
samtools view -Sb chd1_pe.sam >chd1_pe.bam
samtools sort chd1_pe.bam chd1_sorted.bam
mv chd1_sorted.bam.bam chd1_s.bam
samtools index chd1_s.bam
java -jar CreateSequenceDictionary R=GRCh38_r77.all.fa O=GRCh38_r77.all.fa.dict
samtools view -H chd1_s.bam

The outcomes of last code:
@HD VN:1.3 SO:coordinate
@SQ SN:10 LN:133797422
@SQ SN:11 LN:135086622
@SQ SN:12 LN:133275309
@SQ SN:13 LN:114364328
@SQ SN:14 LN:107043718
@SQ SN:15 LN:101991189
@SQ SN:16 LN:90338345
@SQ SN:17 LN:83257441
@SQ SN:18 LN:80373285
@SQ SN:19 LN:58617616
@SQ SN:1 LN:248956422
@SQ SN:20 LN:64444167
@SQ SN:21 LN:46709983
@SQ SN:22 LN:50818468
@SQ SN:2 LN:242193529
@SQ SN:3 LN:198295559
@SQ SN:4 LN:190214555
@SQ SN:5 LN:181538259
@SQ SN:6 LN:170805979
@SQ SN:7 LN:159345973
@SQ SN:8 LN:145138636
@SQ SN:9 LN:138394717
@SQ SN:MT LN:16569
@SQ SN:X LN:156040895
@SQ SN:Y LN:57227415
@PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa sampe hg38bwaidx R1.sai R2.sai CHD1/CHD1_R1.fastq.gz CHD1/CHD1_R2.fastq.gz

Then I run
java -jar picard.jar MarkDuplicates I=chd1_s.bam M=chd1_dup.metric O=chd1_dup.bam

The error message is:
INFO 2018-06-12 14:12:09 MarkDuplicates Tracking 14376 as yet unmatched pairs. 269 records in RAM.
[Tue Jun 12 14:12:10 AWST 2018] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 28.26 minutes.
Runtime.totalMemory()=20756561920
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: ERROR: Record 346429592, Read name E00474:164:HJLVLALXX:5:1213:23470:25728, MAPQ should be 0 for unmapped read.
at htsjdk.samtools.SAMUtils.processValidationErrors(SAMUtils.java:454)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:812)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:797)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:765)
at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:576)
at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:548)
at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:495)
at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:232)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:282)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)

Can you anyone give me advice on this? Thank you so much in advance.

↧

GATK HaplotypeCaller missing SNPs at the terminals of the segment when calling SNPs for Influenza A

June 13, 2018, 2:23 am

≫ Next: SelectVariants - ERROR MESSAGE: The PL index 1326 cannot be more than 1325

≪ Previous: Human genome reference file and picard.jar MarkDuplicates

We are trying to call variants for Influenza A virus sequenced by MiSeq using HaplotypeCaller following GATK best practices (GATK version 3.7). However, when checking in IGV the called variants with BAM file, we frequently identify snps that are missed by HaplotypeCaller at the beginning or the end of a segment. The missing ones are well supported by the reads, and are called by samtools and UnifiedGenotyper with high confidence.

As one example (showing below), there are three rows of called variants at the top, from top to bottom, called by UnifiedGenotyper, samtools, and HaplotypeCaller. The right most snp is called by first two tools but missed by HaplotypeCaller, although the support reads show consistent snp readouts.

Just to show that this snp is well supported by the reads, here is the vcf record reporting this snp in VCF generated by UnifiedGenotyper:

A-New_Jersey-NHRC_93408-2016-H3N2(KY078630)-HA 15 . A T 166598 . AC=1;AF=1.00;AN=1;DP=3970;Dels=0.00;FS=0.000;HaplotypeScore=26.7856;MLEAC=1;MLEAF=1.00;MQ=59.99;MQ0=0;QD=34.24;SOR=4.823 GT:AD:DP:GQ:PL 1:0,3969:3970:99:166628,0

A close check in the HaplotypeCaller generated BAM file for debugging, we noticed that the variant is consistently missing from the de novo generated Haplotypes.

There are also other cases of missing snps. The similarity is that they are always at the terminal of the segment, well supported by reads, and only HaplotypeCaller misses them. However, for some samples, similar variants at the terminal are called by HaplotypeCaller.

My question is following:

is this a bug of HaplotypeCaller? If so, has it been fixed?
if not a bug, is there a parameter of HaplotypeCaller that can be set to guarantee that it will not miss the good quality variants at the terminal?

Many thanks.

↧

SelectVariants - ERROR MESSAGE: The PL index 1326 cannot be more than 1325

June 13, 2018, 3:33 am

≫ Next: GATK4 - VariantFiltration --genotype-filter-expression

≪ Previous: GATK HaplotypeCaller missing SNPs at the terminals of the segment when calling SNPs for Influenza A

Dear GATK Team,

I am using GATK version 3.6-0-g89b7209 (for consistency with control data).
From a VCF containing only INDELs in 500 samples I am trying to extract only variants in a subset of 269 samples:

time java -Xmx24g -jar /home/mhalache/tools/GATK3.6/GenomeAnalysisTK.jar -T SelectVariants \
-R /exports/igmm/eddie/NextGenResources/annotation/variants/1KG_phase3/reference/human_g1k_v37.fasta \
-V a.INDEL.ready.vcf.gz \
-sf sample_ids.txt \
-o a.INDEL.unrel.vcf.gz \
--removeUnusedAlternates \
-env

and getting the following error message

DEBUG 2018-06-13 11:14:33 BlockCompressedOutputStream Using deflater: Deflater

ERROR --

ERROR stack trace

java.lang.IllegalStateException: The PL index 1326 cannot be more than 1325
at htsjdk.variant.variantcontext.GenotypeLikelihoods.getAllelePair(GenotypeLikelihoods.java:492)
at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.getDiploidLikelihoodIndexes(GATKVariantContextUtils.java:697)
at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.determineDiploidLikelihoodIndexesToUse(GATKVariantContextUtils.java:647)
at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.fixDiploidGenotypesFromSubsettedAlleles(GATKVariantContextUtils.java:1421)
at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.updatePLsSACsAD(GATKVariantContextUtils.java:1403)
at org.broadinstitute.gatk.tools.walkers.variantutils.SelectVariants.subsetRecord(SelectVariants.java:1080)
at org.broadinstitute.gatk.tools.walkers.variantutils.SelectVariants.map(SelectVariants.java:854)
at org.broadinstitute.gatk.tools.walkers.variantutils.SelectVariants.map(SelectVariants.java:309)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:311)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:113)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:255)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:157)
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:108)

ERROR ------------------------------------------------------------------------------------------

ERROR A GATK RUNTIME ERROR has occurred (version 3.6-0-g89b7209):

ERROR

ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.

ERROR If not, please post the error message, with stack trace, to the GATK forum.

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions https://www.broadinstitute.org/gatk

ERROR

ERROR MESSAGE: The PL index 1326 cannot be more than 1325

ERROR ------------------------------------------------------------------------------------------

I could not find any relevant post on the GATK site, please accept my apologies if it has been discussed previously.
A corresponding script for extracting the SNPs for the same subset appears to be working properly (it is currently running, output not validated yet)

Best,
Mike

↧

GATK4 - VariantFiltration --genotype-filter-expression

March 11, 2018, 5:39 pm

≫ Next: Picard Sort Vcf Error

≪ Previous: SelectVariants - ERROR MESSAGE: The PL index 1326 cannot be more than 1325

Hello there,
I am trying to apply some sample-level filters on a VCF generated using GATK4.0.2.1. My issue is that all variant sites are not getting an FT flag added and I am wondering why. Additionally, "PASS" is being added the the FILTER column at the variant-level (I am not sure if this behavior is expected, but it seems weird)

Here is some information about the system:

17:43:04.589 DEBUG NativeLibraryLoader - Extracting libgkl_compression.so to /tmp/szs315/libgkl_compression8694733123384787175.so
17:43:04.681 INFO  VariantFiltration - ------------------------------------------------------------
17:43:04.681 INFO  VariantFiltration - The Genome Analysis Toolkit (GATK) v4.0.2.1
17:43:04.681 INFO  VariantFiltration - For support and documentation go to https://software.broadinstitute.org/gatk/
17:43:04.681 INFO  VariantFiltration - Executing as szs315@quser12 on Linux v3.10.0-514.36.5.el7.x86_64 amd64
17:43:04.681 INFO  VariantFiltration - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_112-b16
17:43:04.682 INFO  VariantFiltration - Start Date/Time: March 11, 2018 6:43:04 PM CDT
17:43:04.682 INFO  VariantFiltration - ------------------------------------------------------------
17:43:04.682 INFO  VariantFiltration - ------------------------------------------------------------
17:43:04.682 INFO  VariantFiltration - HTSJDK Version: 2.14.3
17:43:04.682 INFO  VariantFiltration - Picard Version: 2.17.2
17:43:04.684 INFO  VariantFiltration - HTSJDK Defaults.BUFFER_SIZE : 131072
17:43:04.684 INFO  VariantFiltration - HTSJDK Defaults.COMPRESSION_LEVEL : 1
17:43:04.684 INFO  VariantFiltration - HTSJDK Defaults.CREATE_INDEX : false
17:43:04.684 INFO  VariantFiltration - HTSJDK Defaults.CREATE_MD5 : false
17:43:04.684 INFO  VariantFiltration - HTSJDK Defaults.CUSTOM_READER_FACTORY : 
17:43:04.684 INFO  VariantFiltration - HTSJDK Defaults.DISABLE_SNAPPY_COMPRESSOR : false
17:43:04.685 INFO  VariantFiltration - HTSJDK Defaults.EBI_REFERENCE_SERVICE_URL_MASK : https://www.ebi.ac.uk/ena/cram/md5/%s
17:43:04.685 INFO  VariantFiltration - HTSJDK Defaults.NON_ZERO_BUFFER_SIZE : 131072
17:43:04.685 INFO  VariantFiltration - HTSJDK Defaults.REFERENCE_FASTA : null
17:43:04.685 INFO  VariantFiltration - HTSJDK Defaults.SAM_FLAG_FIELD_FORMAT : DECIMAL
17:43:04.685 INFO  VariantFiltration - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
17:43:04.685 INFO  VariantFiltration - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
17:43:04.685 INFO  VariantFiltration - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
17:43:04.685 INFO  VariantFiltration - HTSJDK Defaults.USE_CRAM_REF_DOWNLOAD : false
17:43:04.685 DEBUG ConfigFactory - Configuration file values: 
17:43:04.688 DEBUG ConfigFactory -  gcsMaxRetries = 20
17:43:04.688 DEBUG ConfigFactory -  gatk_stacktrace_on_user_exception = false
17:43:04.688 DEBUG ConfigFactory -  samjdk.use_async_io_read_samtools = false
17:43:04.688 DEBUG ConfigFactory -  samjdk.compression_level = 1
17:43:04.688 DEBUG ConfigFactory -  samjdk.use_async_io_write_samtools = true
17:43:04.688 DEBUG ConfigFactory -  samjdk.use_async_io_write_tribble = false
17:43:04.688 DEBUG ConfigFactory -  spark.kryoserializer.buffer.max = 512m
17:43:04.688 DEBUG ConfigFactory -  spark.driver.maxResultSize = 0
17:43:04.688 DEBUG ConfigFactory -  spark.driver.userClassPathFirst = true
17:43:04.688 DEBUG ConfigFactory -  spark.io.compression.codec = lzf
17:43:04.688 DEBUG ConfigFactory -  spark.yarn.executor.memoryOverhead = 600
17:43:04.689 DEBUG ConfigFactory -  spark.driver.extraJavaOptions = 
17:43:04.689 DEBUG ConfigFactory -  spark.executor.extraJavaOptions = 
17:43:04.689 DEBUG ConfigFactory -  codec_packages = [htsjdk.variant, htsjdk.tribble, org.broadinstitute.hellbender.utils.codecs]
17:43:04.689 DEBUG ConfigFactory -  cloudPrefetchBuffer = 40
17:43:04.689 DEBUG ConfigFactory -  cloudIndexPrefetchBuffer = -1
17:43:04.689 DEBUG ConfigFactory -  createOutputBamIndex = true
17:43:04.689 INFO  VariantFiltration - Deflater: IntelDeflater
17:43:04.689 INFO  VariantFiltration - Inflater: IntelInflater
17:43:04.689 INFO  VariantFiltration - GCS max retries/reopens: 20
17:43:04.689 INFO  VariantFiltration - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
17:43:04.689 INFO  VariantFiltration - Initializing engine

Here is the command I used to apply the filters

 gatk-launch VariantFiltration \
-variant wild_isolate.vcf.gz \
--genotype-filter-expression "DP < 2" \
--genotype-filter-name "depth" \
-O wi_dp_tet.vcf  \
--verbosity DEBUG \
--seconds-between-progress-updates 0.1 \
--disable-tool-default-read-filters true \
--lenient true \
--disable-sequence-dictionary-validation true \
--disable-bam-index-caching true

I added the --verbosity flag and all other flags below --verbosity after I noticed some variants were not receiving the FT field. I thought there may be some default filters being applied that may results in variants being skipped (maybe these flags need to be applied at previous steps?). I ran this step with and without those flags, and with/without the -R flag.

I am running this on a test data set to make sure my pipeline is working properly... 45576 variants are not receiving the FT field and 127762 variants did receive the FT field. Also, not that I am not going through the VQSR procedure because I do not have a truth set.

As for the steps proceeding VariantFiltration, I ran HaplotypeCaller in DISCOVERY with ERC GVCF (in chromosome blocks), performed ValidateVariants, combined chromosome gVCFs for each each sample using CombineGVCFs, combined individual sample gVCFs with GenomicsDBImport, and then ran GenotypeGVCFs on individual chromosomes, and collapsed the chromosome VCFs using GatherVcfs.

Here are the last few entries of test VCF, highlighting the inconsistent FORMAT/FT field.

MtDNA   12998   .   C   A,T 2457.39 PASS    AC=8,6;AF=0.571,0.429;AN=14;AS_QD=15.04,31.74;DP=74;ExcessHet=3.0103;FS=0.000;GQ_MEAN=31.14;GQ_STDDEV=28.46;MLEAC=8,6;MLEAF=0.571,0.429;MQ=59.59;NCC=1;QD=33.66;SOR=0.720   GT:AD:DP:GQ:PL  1/1:0,2,0:2:6:80,6,0,80,6,80    2/2:0,0,2:2:6:83,83,83,6,6,0    1/1:0,3,0:3:9:125,9,0,125,9,125 ./.:1,0,0:1:.:0,0,0,0,0,0   1/1:0,22,0:22:66:817,66,0,817,66,817    1/1:0,8,0:8:24:235,24,0,235,24,235  2/2:0,0,11:11:33:383,383,383,33,33,0    2/2:0,0,25:25:74:749,749,749,74,74,0
MtDNA   13029   .   T   C   74.63   PASS    AC=2;AF=0.125;AN=16;AS_QD=32.99;DP=62;ExcessHet=0.1472;FS=0.000;GQ_MEAN=22.13;GQ_STDDEV=20.47;MLEAC=1;MLEAF=0.063;MQ=60.00;NCC=0;QD=26.41;SOR=0.693 GT:AD:DP:FT:GQ:PL   1/1:0,2:2:PASS:6:90,6,0 0/0:1,0:1:depth:3:0,3,34    0/0:5,0:5:PASS:15:0,15,195  0/0:1,0:1:depth:3:0,3,32    0/0:18,0:18:PASS:48:0,48,720    0/0:7,0:7:PASS:21:0,21,213  0/0:8,0:8:PASS:24:0,24,288  0/0:20,0:20:PASS:57:0,57,855
MtDNA   13069   .   T   C   2144.05 PASS    AC=12;AF=1.00;AN=12;AS_QD=27.59;DP=51;ExcessHet=3.0103;FS=0.000;GQ_MEAN=25.50;GQ_STDDEV=13.52;MLEAC=14;MLEAF=1.00;MQ=60.00;NCC=2;QD=30.55;SOR=0.994 GT:AD:DP:GQ:PL  1/1:0,2:2:6:87,6,0  ./.:0,0:0:.:0,0,0   1/1:0,7:7:21:292,21,0   ./.:0,0:0:.:0,0,0   1/1:0,12:12:36:531,36,0 1/1:0,7:7:21:259,21,0   1/1:0,8:8:24:334,24,0   1/1:0,15:15:45:620,45,0
MtDNA   13208   .   C   T   788.24  PASS    AC=6;AF=0.500;AN=12;AS_QD=25.73;DP=53;ExcessHet=0.1809;FS=0.000;GQ_MEAN=20.00;GQ_STDDEV=19.22;MLEAC=8;MLEAF=0.667;MQ=60.00;NCC=2;QD=28.92;SOR=1.127 GT:AD:DP:GQ:PL  ./.:0,0:0:.:0,0,0   0/0:2,0:2:6:0,6,65  1/1:0,4:4:12:157,12,0   ./.:0,0:0:.:0,0,0   1/1:0,8:8:24:341,24,0   1/1:0,8:8:24:303,24,0   0/0:13,0:13:0:0,0,353   0/0:18,0:18:54:0,54,472
MtDNA   13344   .   G   A   226.02  PASS    AC=2;AF=0.200;AN=10;AS_QD=28.25;DP=17;ExcessHet=0.2482;FS=0.000;GQ_MEAN=9.60;GQ_STDDEV=8.85;MLEAC=3;MLEAF=0.300;MQ=60.00;NCC=3;QD=28.25;SOR=1.179   GT:AD:DP:FT:GQ:PL   0/0:1,0:1:depth:3:0,3,39    ./.:0,0:0:PASS:.:0,0,0  ./.:0,0:0:PASS:.:0,0,0  ./.:0,0:0:PASS:.:0,0,0  0/0:2,0:2:PASS:3:0,3,45 0/0:4,0:4:PASS:12:0,12,136  0/0:2,0:2:PASS:6:0,6,88 1/1:0,8:8:PASS:24:239,24,0
MtDNA   13700   .   TA  T   49.17   PASS    AC=2;AF=0.250;AN=8;AS_QD=24.58;DP=24;ExcessHet=0.3218;FS=0.000;GQ_MEAN=17.25;GQ_STDDEV=7.89;MLEAC=2;MLEAF=0.250;MQ=48.99;NCC=4;QD=24.58;RPA=8,7;RU=A;SOR=2.303;STR  GT:AD:DP:GQ:PL  ./.:0,0:0:.:0,0,0   ./.:0,0:0:.:0,0,0   ./.:1,0:1:.:0,0,0   ./.:0,0:0:.:0,0,0   0/0:7,0:7:21:0,21,298   0/0:6,0:6:18:0,18,141   1/1:0,2:2:6:61,6,0  0/0:8,0:8:24:0,24,211

Any and all helps is appreciated! I'm hoping it is something simple!

Thanks

↧

Picard Sort Vcf Error

January 24, 2017, 7:27 pm

≫ Next: Artifact list on targeted panel data

≪ Previous: GATK4 - VariantFiltration --genotype-filter-expression

Hello.

I am using GATK version 3.6, picard-2.8.2.jar

I downloaded hapmap_3.3.hg38.vcf from gatk resource bundle. I then used the below command to remove chr notation.
awk '{gsub(/^chr/,""); print}' hapmap_3.3.hg38.vcf > no_chr_hapmap_3.3.hg38.vcf.vcf

Before (hapmap_3.3.hg38.vcf)
chr1 2242065 rs263526 T C . PASS AC=724;AF=0.259;AN=2792
chr1 2242417 rs16824926 C . . PASS AN=530
chr1 2242880 rs11581436 A . . PASS AN=540

After (no_chr_hapmap_3.3.hg38.vcf.vcf)
1 6421563 rs4908891 G A . PASS AC=1086;AF=0.389;AN=2792
1 6421782 rs4908892 A G . PASS AC=1692;AF=0.606;AN=2792
1 6421856 rs12078257 T C . PASS AC=368;AF=0.132;AN=2790

Then, use Picard SortVcf to sort the no_chr_hapmap_3.3.hg38.vcf.vcf
java -jar picard-2.8.2.jar SortVcf I=removedChr_HapMap.vcf O=sortedHapMap.vcf SEQUENCE_DICTIONARY=hg38.dict

hg38.dict
@SQ SN:1 LN:248956422 UR:file:/media/ubuntu/Elements/TOOL/hg38.fa M5:2648ae1bacce4ec4b6cf337dcae37816
@SQ SN:10 LN:133797422 UR:file:/media/ubuntu/Elements/TOOL/hg38.fa M5:907112d17fcb73bcab1ed1c72b97ce68
@SQ SN:11 LN:135086622 UR:file:/media/ubuntu/Elements/TOOL/hg38.fa M5:1511375dc2dd1b633af8cf439ae90cec
@SQ SN:12 LN:133275309 UR:file:/media/ubuntu/Elements/TOOL/hg38.fa M5:e81e16d3f44337034695a29b97708fce

I have then encountered this error:

Exception in thread "main" java.lang.IllegalArgumentException: java.lang.AssertionError: SAM dictionaries are not the same: SAMSequenceRecord(name=chr1,length=248956422,dict_index=0,assembly=20) was found when SAMSequenceRecord(name=1,length=248956422,dict_index=0,assembly=null) was expected.
at picard.vcf.SortVcf.collectFileReadersAndHeaders(SortVcf.java:126)
at picard.vcf.SortVcf.doWork(SortVcf.java:95)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:205)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:94)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:104)
Caused by: java.lang.AssertionError: SAM dictionaries are not the same: SAMSequenceRecord(name=chr1,length=248956422,dict_index=0,assembly=20) was found when SAMSequenceRecord(name=1,length=248956422,dict_index=0,assembly=null) was expected.
at htsjdk.samtools.SAMSequenceDictionary.assertSameDictionary(SAMSequenceDictionary.java:170)
at picard.vcf.SortVcf.collectFileReadersAndHeaders(SortVcf.java:124)
... 4 more

I have tried a lot of times but still getting back the same error. Kindly do advise how can I solve this problem.

I would then like to perform SelectVariants to extract variants that missed in HapMap but present in my dataset.

Thank you so much in advance.

Cheers,
Moon

↧

Artifact list on targeted panel data

June 13, 2018, 7:06 am

≫ Next: Base Quality Recalibration

≪ Previous: Picard Sort Vcf Error

Hi!

As it is well described, target panel data is difficult to filter due to the lack of tools like FIlterByOrientationBias present in WES data. I was wondering, however, if GATK is capable to build a "Pool of Artifacts" from the low quality and/or user specified false positives and use it as a proper filtering method.

Any ideas if this exist or if it should be implemented easily?
Thanks!

↧

Base Quality Recalibration

June 13, 2018, 8:02 am

≫ Next: Sparks tool TaskMemoryManager WARN

≪ Previous: Artifact list on targeted panel data

I am analyzing human whole exome sequencing reads from normal and lung tumor pairs for somatic mutations and have been using hg38 as the reference for pre-processing prior to Base Quality Recalibration. Is it possible to use b37 as the reference and resources for this step or do I need to start again and use b37 from the beginning and for the entire variant calling pipeline?

↧

Sparks tool TaskMemoryManager WARN

June 13, 2018, 8:14 am

≫ Next: Allele Depth (AD) / Allele Balance (AB) Filtering in GATK 4

≪ Previous: Base Quality Recalibration

Using ApplyBQSRSpark I experienced the following WARN and it stopped. There were more than 60 GB of RAM free on the server I used at that time, and every time I lauch the command it gives the same output. This is the command that I run using docker:

/gatk/gatk ApplyBQSRSpark \
-I ${INPUT_FILE_BAM} \
--bqsr-recal-file recal_data_.table \
-O /BQSRS_.bam

18/06/13 10:51:15 INFO BlockManagerInfo: Removed broadcast_2_piece0 on 172.17.0.2:44253 in memory (size: 5.7 KB, free: 15.8 GB)
18/06/13 11:04:55 WARN TaskMemoryManager: Failed to allocate a page (8388608 bytes), try again.
18/06/13 11:06:32 WARN NettyRpcEnv: Ignored message: HeartbeatResponse(false)
18/06/13 11:08:03 WARN TaskMemoryManager: Failed to allocate a page (8388608 bytes), try again.
18/06/13 11:09:51 WARN TaskMemoryManager: Failed to allocate a page (8388608 bytes), try again.

↧

Allele Depth (AD) / Allele Balance (AB) Filtering in GATK 4

June 5, 2018, 2:03 am

≫ Next: Upcoming workshops: June-July-September 2018

≪ Previous: Sparks tool TaskMemoryManager WARN

Hi,

I am trying to filter my GATK 4.0.3 - HaplotypeCaller generated multi-sample VCF for allele depth (AD) annotation at sample genotype-level (so available in "FORMAT" fields of each sample).

I think prior to GATK 4, this annotation was available as "Allele Balance" (AB) ratios (generated by AlleleBalanceBySample), but it is not available anymore in GATK 4. So I tried to filter genotypes based on AD field, that is exactly the same thing but indicated in "X,Y" format, so in an array format of integers. This array format makes it difficult to filter based on depth of alternative allele divided by depth of all alleles at a specific site.

Can you please recommend any solution to this problem? If I could turn this array into a ratio, I could easily filter genotypes using VariantFiltration or other tools such as vcflib/vcffilter. I also tried the below code (following https://gatkforums.broadinstitute.org/gatk/discussion/1255/what-are-jexl-expressions-and-how-can-i-use-them-with-the-gatk):

gatk VariantFiltration -R $ref -V $vcf -O $output --genotype-filter-expression 'vc.getGenotype("Sample1").getAD().1 / vc.getGenotype("Sample1").getAD().0 > 0.33' --set-filtered-genotype-to-no-call --genotype-filter-name 'ABfilter'

This worked, but strangely it filters the variant for all samples if only one of the sample have allele depths that are not in balance (defined by the filter). If it worked only for Sample1, I was planning to write a quick loop for all the samples for instance. I tried the same with GATK 3.8, but still it filters whole variant for all the samples if it is filtered in just one sample.

↧