GATK4 MergeVcfs "One or more header lines must be in the header line collection"

July 1, 2018, 9:04 am

≫ Next: How do I fix the issue "Sequence dictionaries are not the same size (6671, 242)"

≪ Previous: A complete script that processes a trio with WGS data from FASTQ to BAM to VCF

Hi! I am trying to use MergeVcfs to merge several VCF files (VarScan2 output files) but I am getting the following error:

gatk MergeVcfs \
   -I A.vcf \
   -I B.vcf \
   -D human_g1k_v37_decoy.dict
   -O out.vcf

...
java.lang.IllegalArgumentException: One or more header lines must be in the header line collection
...

Unfortunately I cannot find any information about this error message. I have tried using gatk ValidateVariants to validate the input VCF files but this does not return any errors:

gatk ValidateVariants \
   -V A.vcf \
   -R human_g1k_v37_decoy.fasta

...
12:01:11.764 INFO  ValidateVariants - Done initializing engine
12:01:11.764 INFO  ProgressMeter - Starting traversal
12:01:11.765 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute
12:01:12.641 INFO  ProgressMeter -           1:29562369              0.0                 43393        2978924.5
12:01:12.642 INFO  ProgressMeter - Traversal complete. Processed 43393 total variants in 0.0 minutes.
12:01:12.642 INFO  ValidateVariants - Shutting down engine
[July 1, 2018 12:01:12 PM EDT] org.broadinstitute.hellbender.tools.walkers.variantutils.ValidateVariants done. Elapsed time: 0.03 minutes.

Can anyone familiar with the code point me in the right direction?

The VCF header for A.vcf and B.vcf looks as follows:

##fileformat=VCFv4.1
##source=VarScan2
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total depth of quality bases">
##INFO=<ID=SOMATIC,Number=0,Type=Flag,Description="Indicates if record is a somatic mutation">
##INFO=<ID=SS,Number=1,Type=String,Description="Somatic status of variant (0=Reference,1=Germline,2=Somatic,3=LOH, or 5=Unknown)
##INFO=<ID=SSC,Number=1,Type=String,Description="Somatic score in Phred scale (0-255) derived from somatic p-value">
##INFO=<ID=GPV,Number=1,Type=Float,Description="Fisher's Exact Test P-value of tumor+normal versus no variant for Germline calls
##INFO=<ID=SPV,Number=1,Type=Float,Description="Fisher's Exact Test P-value of tumor versus normal for Somatic/LOH calls">
##FILTER=<ID=str10,Description="Less than 10% or more than 90% of variant supporting reads on one strand">
##FILTER=<ID=indelError,Description="Likely artifact due to indel reads at this position">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=RD,Number=1,Type=Integer,Description="Depth of reference-supporting bases (reads1)">
##FORMAT=<ID=AD,Number=1,Type=Integer,Description="Depth of variant-supporting bases (reads2)">
##FORMAT=<ID=FREQ,Number=1,Type=String,Description="Variant allele frequency">
##FORMAT=<ID=DP4,Number=1,Type=String,Description="Strand read counts: ref/fwd, ref/rev, var/fwd, var/rev">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NORMAL  TUMOR

↧

How do I fix the issue "Sequence dictionaries are not the same size (6671, 242)"

July 1, 2018, 12:24 pm

≫ Next: Can you give some suggestions on running gatk4-germline-joint-discovery pipeline on 799 human WES sa

≪ Previous: GATK4 MergeVcfs "One or more header lines must be in the header line collection"

I am using 2.18.7-1-gb02e42e-SNAPSHOT and Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)

I used the CreateSequenceDictionary on my .fasta genome:

java -Xmx2g -jar picard.jar \
CreateSequenceDictionary \
R=.fasta \
O=.dict \

This seemed to work, and made a .dict file. However, when I run CollectMultipleMetrics to get information on my RNAseq alignments done by STAR 2.5.0c:

java -Xmx2g -jar picard.jar \
CollectMultipleMetrics \
R=.fasta \
I=.bam \
O= \
PROGRAM=null \
PROGRAM=CollectAlignmentSummaryMetrics \
PROGRAM=QualityScoreDistribution \
PROGRAM=CollectGcBiasMetrics \
PROGRAM=MeanQualityByCycle \
PROGRAM=CollectInsertSizeMetrics \

There is an exception: "Exception in thread "main" htsjdk.samtools.util.SequenceUtil$SequenceListsDifferException: Sequence dictionaries are not the same size (6671, 242)"

I'd love any help on understanding the problem and how to fix it. Let me know if I can provide any other useful information.

Complete CollectMultipleMetrics output:

14:00:31.203 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file://sw/picard/build/libs/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Sun Jul 01 14:00:31 GMT-05:00 2018] CollectMultipleMetrics INPUT=.bam OUTPUT= PROGRAM=[CollectAlignmentSummaryMetrics, QualityScoreDistribution, CollectGcBiasMetrics, MeanQualityByCycle, CollectInsertSizeMetrics] REFERENCE_SEQUENCE=.fasta ASSUME_SORTED=true STOP_AFTER=0 METRIC_ACCUMULATION_LEVEL=[ALL_READS] INCLUDE_UNPAIRED=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Sun Jul 01 14:00:31 GMT-05:00 2018] Executing as on Linux 3.10.0-693.21.1.el7.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_45-b14; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.18.7-1-gb02e42e-SNAPSHOT
[Sun Jul 01 14:00:31 GMT-05:00 2018] picard.analysis.CollectMultipleMetrics done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=2058354688
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.samtools.util.SequenceUtil$SequenceListsDifferException: Sequence dictionaries are not the same size (6671, 242)
at htsjdk.samtools.util.SequenceUtil.assertSequenceListsEqual(SequenceUtil.java:237)
at htsjdk.samtools.util.SequenceUtil.assertSequenceDictionariesEqual(SequenceUtil.java:320)
at htsjdk.samtools.util.SequenceUtil.assertSequenceDictionariesEqual(SequenceUtil.java:306)
at picard.analysis.SinglePassSamProgram.makeItSo(SinglePassSamProgram.java:107)
at picard.analysis.CollectMultipleMetrics.doWork(CollectMultipleMetrics.java:426)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:282)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)

↧

Can you give some suggestions on running gatk4-germline-joint-discovery pipeline on 799 human WES sa

July 1, 2018, 7:53 pm

≫ Next: running cromwell with google cloud call VM with external IP address

≪ Previous: How do I fix the issue "Sequence dictionaries are not the same size (6671, 242)"

Hi, GATK team.
I have a cohort of 799 human WES samples, and have generated the g.vcf file. Now I want to run the germline-joint-discovery pipeline here.

As my local cluster is running grid engine, so I run the cromwell using this sge.conf file below. The cmd is java -Dconfig.file=$cromwell/sge.conf -jar $cromwell/cromwell-32.jar server >cromwell.log 2>&1

include required(classpath("application"))

system {
  input-read-limits {
    lines = 100000000
  }
}
backend {
  default = SGE

  providers {
    SGE {
      actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
      config {
        concurrent-job-limit = 50

        runtime-attributes = """
        Int cpu = 1
        Float? memory_gb
        String? sge_queue
        String? sge_project
        String? docker
        String? docker_user
        """

        submit = """
        qsub \
        -terse \
        -V \
        -b y \
        -N ${job_name} \
        -wd ${cwd} \
        -o ${out} \
        -e ${err} \
        -pe mpi ${cpu} \
        ${"-l mem_free=" + memory_gb + "g"} \
        ${"-q " + sge_queue} \
        ${"-P " + sge_project} \
        /usr/bin/env bash ${script}
        """

        job-id-regex = "(\\d+)"

        kill = "qdel ${job_id}"
        check-alive = "qstat -j ${job_id}"
      }
    }
  }
}

I modify the input-read-limits because my bed file is the agilent SureSelect human exome v6 bed file, which contains 243190 intervals, about 46M size.

I submit the job with the proper wdl file and input file. Everything seems ok.

It run the task DynamicallyCombineIntervals very quickly. But it takes about half an hour before I see the folder call-ImportGVCFs. I suppose the task call-ImportGVCFs will submit all intervals. In my case, DynamicallyCombineIntervals output contain 239964 intervals, it will submit 239964 jobs to sge. Actually, it submit very slowly. It takes about 4-5 min on average to submit a job, but the ImportGVCFs task uses only 1-2 min. Sometimes, it can submit dozens of jobs. Sometimes, it only submit 1 or 2 jobs. Besides, the shard number seems random. From 2018-07-01 14:11 to 2018-07-02 00:57, it only finish 2750 intervals ImportGVCFs task. Finally, the cromwell service give GC overhead limit exceeded error and exit.

I think the GC error occurs because the machine I run cromwell service on has low memory, about 32G. I will change to another machine with 188G memory to have a try.

As this is my first time to run GATK using wdl and cromwell on so many samples, I have no idea why it submit jobs to sge so slowly. Is it normal? Do you have some good suggestions?

The attach file is my workflow log. I provide the DynamicallyCombineIntervals out intervals file in the inputs file, so you can not see task DynamicallyCombineIntervals in the log.

↧

running cromwell with google cloud call VM with external IP address

July 1, 2018, 8:12 pm

≫ Next: MuTect2 beta --germline_resource for build b37

≪ Previous: Can you give some suggestions on running gatk4-germline-joint-discovery pipeline on 799 human WES sa

Hi,

I tried running the five dollar pipeline with WGS on google cloud and saw that it is always creating VM instances with external IP.

Is there a way in cromwell runtime to control the VM instance configuration? Or other methods to restrict google VM with only internal IPs?

↧

MuTect2 beta --germline_resource for build b37

July 24, 2017, 2:16 pm

≫ Next: GenotypeGVCFs liftOver problem?

≪ Previous: running cromwell with google cloud call VM with external IP address

Hi - I'm looking to run MuTect2 beta using the --germline_resource option. However, I've consistently used the b37 genome build throughout my analaysis, while the suggested resource (gnomad) appears to only be available for the hg19 build. So I'm wondering whether I should
1. Go ahead and use the gnomad hg19 files, despite the fact that my whole analysis has used the b37 build?
2. Lift over my existing gnomad vcfs from hg19 to b37? (In this case, I'd need an hg19tob37 liftOver file - I can't find one anywhere).
3. Use another germline resource?

I wonder which option you would recommend? Many thanks for your time.

↧

GenotypeGVCFs liftOver problem?

April 25, 2018, 2:12 am

≫ Next: how to let GATK support Coordinate Sorted Index (CSI) format of bam file

≪ Previous: MuTect2 beta --germline_resource for build b37

Hi,
I am trying to run GenotypeGVCFs on my combined gvcf file 'cohort.g.vcf' and i am getting the following error:

A USER ERROR has occurred: Given reference file does not have data at the requested contig(ChroChromosome01)!

I don't understand why this is happening as i used the same reference genome for all previous steps (such as variant calling,CombineGVCFs). Do i need to liftOver genomic coordinates on the VCF file based on the reference genome? I am trying to use flo to get the chain file for my reference genome and then use that chain file as input to picard tool liftOverVCF or liftOvervariants?. I'm just not sure why i have to do this as the vcf should match the reference genome coordinates as i am using the same reference throughout. Please help

↧

how to let GATK support Coordinate Sorted Index (CSI) format of bam file

October 11, 2015, 8:54 pm

≫ Next: Genotype Refinement workflow for germline short variants

≪ Previous: GenotypeGVCFs liftOver problem?

Since samtools 1.0, csi indexing format of bam file is specifically used for any other organisms with long chromosomes ( > 536Mb). Could you help me figure out how to let GATK SplitNCigarReads support cis indexing file?

Thanks so much.

↧

Genotype Refinement workflow for germline short variants

December 28, 2017, 3:18 pm

≫ Next: Specify more than one snpmask with Fasta Alternate Reference Maker

≪ Previous: how to let GATK support Coordinate Sorted Index (CSI) format of bam file

Overview
Summary of workflow steps
Output annotations
Example
More information about priors
Mathematical details

1. Overview

The core GATK Best Practices workflow has historically focused on variant discovery --that is, the existence of genomic variants in one or more samples in a cohorts-- and consistently delivers high quality results when applied appropriately. However, we know that the quality of the individual genotype calls coming out of the variant callers can vary widely based on the quality of the BAM data for each sample. The goal of the Genotype Refinement workflow is to use additional data to improve the accuracy of genotype calls and to filter genotype calls that are not reliable enough for downstream analysis. In this sense it serves as an optional extension of the variant calling workflow, intended for researchers whose work requires high-quality identification of individual genotypes.

While every study can benefit from increased data accuracy, this workflow is especially useful for analyses that are concerned with how many copies of each variant an individual has (e.g. in the case of loss of function) or with the transmission (or de novo origin) of a variant in a family.

If a “gold standard” dataset for SNPs is available, that can be used as a very powerful set of priors on the genotype likelihoods in your data. For analyses involving families, a pedigree file describing the relatedness of the trios in your study will provide another source of supplemental information. If neither of these applies to your data, the samples in the dataset itself can provide some degree of genotype refinement (see section 5 below for details).

After running the Genotype Refinement workflow, several new annotations will be added to the INFO and FORMAT fields of your variants (see below). Note that GQ fields will be updated, and genotype calls may be modified. However, the Phred-scaled genotype likelihoods (PLs) which indicate the original genotype call (the genotype candidate with PL=0) will remain untouched. Any analysis that made use of the PLs will produce the same results as before.

2. Summary of workflow steps

Input

Begin with recalibrated variants from VQSR at the end of the germline short variants pipeline. The filters applied by VQSR will be carried through the Genotype Refinement workflow.

Step 1: Derive posterior probabilities of genotypes

Tool used: CalculateGenotypePosteriors

Using the Phred-scaled genotype likelihoods (PLs) for each sample, prior probabilities for a sample taking on a HomRef, Het, or HomVar genotype are applied to derive the posterior probabilities of the sample taking on each of those genotypes. A sample’s PLs were calculated by HaplotypeCaller using only the reads for that sample. By introducing additional data like the allele counts from the 1000 Genomes project and the PLs for other individuals in the sample’s pedigree trio, those estimates of genotype likelihood can be improved based on what is known about the variation of other individuals.

SNP calls from the 1000 Genomes project capture the vast majority of variation across most human populations and can provide very strong priors in many cases. At sites where most of the 1000 Genomes samples are homozygous variant with respect to the reference genome, the probability of a sample being analyzed of also being homozygous variant is very high.

For a sample for which both parent genotypes are available, the child’s genotype can be supported or invalidated by the parents’ genotypes based on Mendel’s laws of allele transmission. Even the confidence of the parents’ genotypes can be recalibrated, such as in cases where the genotypes output by HaplotypeCaller are apparent Mendelian violations.

Step 2: Filter low quality genotypes

Tool used: VariantFiltration

After the posterior probabilities are calculated for each sample at each variant site, genotypes with GQ < 20 based on the posteriors are filtered out. GQ20 is widely accepted as a good threshold for genotype accuracy, indicating that there is a 99% chance that the genotype in question is correct. Tagging those low quality genotypes indicates to researchers that these genotypes may not be suitable for downstream analysis. However, as with the VQSR, a filter tag is applied, but the data is not removed from the VCF.

Step 3: Annotate possible de novo mutations

Tool used: VariantAnnotator

Using the posterior genotype probabilities, possible de novo mutations are tagged. Low confidence de novos have child GQ >= 10 and AC < 4 or AF < 0.1%, whichever is more stringent for the number of samples in the dataset. High confidence de novo sites have all trio sample GQs >= 20 with the same AC/AF criterion.

Step 4: Functional annotation of possible biological effects

Tool options: Funcotator (experimental)

Especially in the case of de novo mutation detection, analysis can benefit from the functional annotation of variants to restrict variants to exons and surrounding regulatory regions. Funcotator is a new tool that is currently still in development. If you would prefer to use a more mature tool, we recommend you look into SnpEff or Oncotator, but note that these are not GATK tools so we do not provide support for them.

3. Output annotations

The Genotype Refinement workflow adds several new info- and format-level annotations to each variant. GQ fields will be updated, and genotypes calculated to be highly likely to be incorrect will be changed. The Phred-scaled genotype likelihoods (PLs) carry through the pipeline without being changed. In this way, PLs can be used to derive the original genotypes in cases where sample genotypes were changed.

Population Priors

New INFO field annotation PG is a vector of the Phred-scaled prior probabilities of a sample at that site being HomRef, Het, and HomVar. These priors are based on the input samples themselves along with data from the supporting samples if the variant in question overlaps another in the supporting dataset.

Phred-Scaled Posterior Probability

New FORMAT field annotation PP is the Phred-scaled posterior probability of the sample taking on each genotype for the given variant context alleles. The PPs represent a better calibrated estimate of genotype probabilities than the PLs are recommended for use in further analyses instead of the PLs.

Genotype Quality

Current FORMAT field annotation GQ is updated based on the PPs. The calculation is the same as for GQ based on PLs.

Joint Trio Likelihood

New FORMAT field annotation JL is the Phred-scaled joint likelihood of the posterior genotypes for the trio being incorrect. This calculation is based on the PLs produced by HaplotypeCaller (before application of priors), but the genotypes used come from the posteriors. The goal of this annotation is to be used in combination with JP to evaluate the improvement in the overall confidence in the trio’s genotypes after applying CalculateGenotypePosteriors. The calculation of the joint likelihood is given as:

$$ -10*\log ( 1-GL_{mother}[\text{Posterior mother GT}] * GL_{father}[\text{Posterior father GT}] * GL_{child}[\text{Posterior child GT}] ) $$

where the GLs are the genotype likelihoods in [0, 1] probability space.

Joint Trio Posterior

New FORMAT field annotation JP is the Phred-scaled posterior probability of the output posterior genotypes for the three samples being incorrect. The calculation of the joint posterior is given as:

$$ -10*\log (1-GP_{mother}[\text{Posterior mother GT}] * GP_{father}[\text{Posterior father GT}] * GP_{child}[\text{Posterior child GT}] )$$

where the GPs are the genotype posteriors in [0, 1] probability space.

Low Genotype Quality

New FORMAT field filter lowGQ indicates samples with posterior GQ less than 20. Filtered samples tagged with lowGQ are not recommended for use in downstream analyses.

High and Low Confidence De Novo

New INFO field annotation for sites at which at least one family has a possible de novo mutation. Following the annotation tag is a list of the children with de novo mutations. High and low confidence are output separately.

4. Example

Before:

1       1226231 rs13306638      G       A       167563.16       PASS    AC=2;AF=0.333;AN=6;…        GT:AD:DP:GQ:PL  0/0:11,0:11:0:0,0,249   0/0:10,0:10:24:0,24,360 1/1:0,18:18:60:889,60,0

After:

1       1226231 rs13306638      G       A       167563.16       PASS    AC=3;AF=0.500;AN=6;…PG=0,8,22;…    GT:AD:DP:GQ:JL:JP:PL:PP 0/1:11,0:11:49:2:24:0,0,249:49,0,287    0/0:10,0:10:32:2:24:0,24,360:0,32,439   1/1:0,18:18:43:2:24:889,60,0:867,43,0

The original call for the child (first sample) was HomRef with GQ0. However, given that, with high confidence, one parent is HomRef and one is HomVar, we expect the child to be heterozygous at this site. After family priors are applied, the child’s genotype is corrected and its GQ is increased from 0 to 49. Based on the allele frequency from 1000 Genomes for this site, the somewhat weaker population priors favor a HomRef call (PG=0,8,22). The combined effect of family and population priors still favors a Het call for the child.

The joint likelihood for this trio at this site is two, indicating that the genotype for one of the samples may have been changed. Specifically a low JL indicates that posterior genotype for at least one of the samples was not the most likely as predicted by the PLs. The joint posterior value for the trio is 24, which indicates that the GQ values based on the posteriors for all of the samples are at least 24. (See above for a more complete description of JL and JP.)

5. More information about priors

The Genotype Refinement Pipeline uses Bayes’s Rule to combine independent data with the genotype likelihoods derived from HaplotypeCaller, producing more accurate and confident genotype posterior probabilities. Different sites will have different combinations of priors applied based on the overlap of each site with external, supporting SNP calls and on the availability of genotype calls for the samples in each trio.

Input-derived Population Priors

If the input VCF contains at least 10 samples, then population priors will be calculated based on the discovered allele count for every called variant.

Supporting Population Priors

Priors derived from supporting SNP calls can only be applied at sites where the supporting calls overlap with called variants in the input VCF. The values of these priors vary based on the called reference and alternate allele counts in the supporting VCF. Higher allele counts (for ref or alt) yield stronger priors.

Family Priors

The strongest family priors occur at sites where the called trio genotype configuration is a Mendelian violation. In such a case, each Mendelian violation configuration is penalized by a de novo mutation probability (currently 10-6). Confidence also propagates through a trio. For example, two GQ60 HomRef parents can substantially boost a low GQ HomRef child and a GQ60 HomRef child and parent can improve the GQ of the second parent. Application of family priors requires the child to be called at the site in question. If one parent has a no-call genotype, priors can still be applied, but the potential for confidence improvement is not as great as in the 3-sample case.

Caveats

Right now family priors can only be applied to biallelic variants and population priors can only be applied to SNPs. Family priors only work for trios.

6. Mathematical details

Note that family priors are calculated and applied before population priors. The opposite ordering would result in overly strong population priors because they are applied to the child and parents and then compounded when the trio likelihoods are multiplied together.

Review of Bayes’s Rule

HaplotypeCaller outputs the likelihoods of observing the read data given that the genotype is actually HomRef, Het, and HomVar. To convert these quantities to the probability of the genotype given the read data, we can use Bayes’s Rule. Bayes’s Rule dictates that the probability of a parameter given observed data is equal to the likelihood of the observations given the parameter multiplied by the prior probability that the parameter takes on the value of interest, normalized by the prior times likelihood for all parameter values:

$$ P(\theta|Obs) = \frac{P(Obs|\theta)P(\theta)}{\sum_{\theta} P(Obs|\theta)P(\theta)} $$

In the best practices pipeline, we interpret the genotype likelihoods as probabilities by implicitly converting the genotype likelihoods to genotype probabilities using non-informative or flat priors, for which each genotype has the same prior probability. However, in the Genotype Refinement Pipeline we use independent data such as the genotypes of the other samples in the dataset, the genotypes in a “gold standard” dataset, or the genotypes of the other samples in a family to construct more informative priors and derive better posterior probability estimates.

Calculation of Population Priors

Given a set of samples in addition to the sample of interest (ideally non-related, but from the same ethnic population), we can derive the prior probability of the genotype of the sample of interest by modeling the sample’s alleles as two independent draws from a pool consisting of the set of all the supplemental samples’ alleles. (This follows rather naturally from the Hardy-Weinberg assumptions.) Specifically, this prior probability will take the form of a multinomial Dirichlet distribution parameterized by the allele counts of each allele in the supplemental population. In the biallelic case the priors can be calculated as follows:

$$ P(GT = HomRef) = \dbinom{2}{0} \ln \frac{\Gamma(nSamples)\Gamma(RefCount + 2)}{\Gamma(nSamples + 2)\Gamma(RefCount)} $$

$$ P(GT = Het) = \dbinom{2}{1} \ln \frac{\Gamma(nSamples)\Gamma(RefCount + 1)\Gamma(AltCount + 1)}{\Gamma(nSamples + 2)\Gamma(RefCount)\Gamma(AltCount)} $$

$$ P(GT = HomVar) = \dbinom{2}{2} \ln \frac{\Gamma(nSamples)\Gamma(AltCount + 2)}{\Gamma(nSamples + 2)\Gamma(AltCount)} $$

where Γ is the Gamma function, an extension of the factorial function.

The prior genotype probabilities based on this distribution scale intuitively with number of samples. For example, a set of 10 samples, 9 of which are HomRef yield a prior probability of another sample being HomRef with about 90% probability whereas a set of 50 samples, 49 of which are HomRef yield a 97% probability of another sample being HomRef.

Calculation of Family Priors

Given a genotype configuration for a given mother, father, and child trio, we set the prior probability of that genotype configuration as follows:

$$ P(G_M,G_F,G_C) = P(\vec{G}) \cases{ 1-10\mu-2\mu^2 & no MV \cr \mu & 1 MV \cr \mu^2 & 2 MVs} $$

where the 10 configurations with a single Mendelian violation are penalized by the de novo mutation probability μ and the two configurations with two Mendelian violations by μ^2. The remaining configurations are considered valid and are assigned the remaining probability to sum to one.

This prior is applied to the joint genotype combination of the three samples in the trio. To find the posterior for any single sample, we marginalize over the remaining two samples as shown in the example below to find the posterior probability of the child having a HomRef genotype:

$$ P(G_C = HomRef | \vec{D}) = \frac{L_C(G_C = HomRef) \sum_{G_F,G_M} L_F(G_F)L_M(G_M)P(\vec{G})}{\sum_{\vec{H}}P(\vec{D}|\vec{H})P(\vec{H})} $$

This quantity P(Gc|D) is calculated for each genotype, then the resulting vector is Phred-scaled and output as the Phred-scaled posterior probabilities (PPs).

↧

Specify more than one snpmask with Fasta Alternate Reference Maker

July 2, 2018, 7:24 am

≫ Next: Downsampling

≪ Previous: Genotype Refinement workflow for germline short variants

Hi there,

I am hoping to use multiple snpmask vcfs with fasta alternate reference maker. The description of fasta alternate reference maker says "...it allows for one or more "snpmask" VCFs to set overlapping bases to 'N'."

However, I haven't been able to figure out how to specify multiple snpmasks.

I have tried

--snpmask mask1.vcf mask2.vcf

and

--snpmask mask1.vcf --snpmask mask2.vcf

but both give the error:

Argument 'snpmask' has too many values.

Both mask1.vcf and mask2.vcf work fine on their own (when only one snpmask specified).

I am using GATK v3.8-0.

My apologies for this very basic question! I've read the documentation carefully and searched for "--snpmask" in the forums, but am still stuck - any pointers would be appreciated!

Thanks!
Elaine

↧

Downsampling

August 10, 2012, 10:16 pm

≫ Next: UV mutagenesis hotspots

≪ Previous: Specify more than one snpmask with Fasta Alternate Reference Maker

Downsampling is a process by which read depth is reduced, either at a particular position or within a region.

Normal sequencing and alignment protocols can often yield pileups with vast numbers of reads aligned to a single section of the genome in otherwise well-behaved datasets. Because of the frequency of these 'speed bumps', the GATK now downsamples pileup data unless explicitly overridden.

Note that there is also a proportional "downsample to fraction" mechanism that is mostly intended for testing the effect of different overall coverage means on analysis results.

See below for details of how this is implemented and controlled in GATK.

1. Downsampling to a target coverage

The principle of this downsampling type is to downsample reads to a given capping threshold coverage. Its purpose is to get rid of excessive coverage, because above a certain depth, having additional data is not informative and imposes unreasonable computational costs. The downsampling process takes two different forms depending on the type of analysis it is used with. For locus-based traversals (LocusWalkers like UnifiedGenotyper and ActiveRegionWalkers like HaplotypeCaller), downsample_to_coverage controls the maximum depth of coverage at each locus. For read-based traversals (ReadWalkers like BaseRecalibrator), it controls the maximum number of reads sharing the same alignment start position. For ReadWalkers you will typically need to use much lower dcov values than you would with LocusWalkers to see an effect. Note that this downsampling option does not produce an unbiased random sampling from all available reads at each locus: instead, the primary goal of the to-coverage downsampler is to maintain an even representation of reads from all alignment start positions when removing excess coverage. For a truly unbiased random sampling of reads, use -dfrac instead. Also note that the coverage target is an approximate goal that is not guaranteed to be met exactly: the downsampling algorithm will under some circumstances retain slightly more or less coverage than requested.

Defaults

The GATK's default downsampler (invoked by -dcov) exhibits the following properties:

The downsampler treats data from each sample independently, so that high coverage in one sample won't negatively impact calling in other samples.
The downsampler attempts to downsample uniformly across the range spanned by the reads in the pileup.
The downsampler's memory consumption is proportional to the sampled coverage depth rather than the full coverage depth.

By default, the downsampler is limited to 1000 reads per sample. This value can be adjusted either per-walker or per-run.

Customizing

From the command line:

To disable the downsampler, specify -dt NONE.
To change the default coverage per-sample, specify the desired coverage to the -dcov option.

To modify the walker's default behavior:

Add the @Downsample interface to the top of your walker. Override the downsampling type by changing the by=<value>. Override the downsampling depth by changing the toCoverage=<value>.

Algorithm details

The downsampler algorithm is designed to maintain uniform coverage while preserving a low memory footprint in regions of especially deep data. Given an already established pileup, a single-base locus, and a pile of reads with an alignment start of single-base locus + 1, the outline of the algorithm is as follows:

For each sample:

Select reads with the next alignment start.
While the number of existing reads + the number of incoming reads is greater than the target sample size:

Now walk backward through each set of reads having the same alignment start. If the count of reads having the same alignment start is > 1, throw out one randomly selected read.

If we have n slots available where n is >= 1, randomly select n of the incoming reads and add them to the pileup.
Otherwise, we have zero slots available. Choose the read from the existing pileup with the least alignment start. Throw it out and add one randomly selected read from the new pileup.

2. Downsampling to a fraction of the coverage

Reads will be downsampled so the specified fraction remains; e.g. if you specify -dfrac 0.25, three-quarters of the reads will be removed, and the remaining one quarter will be used in the analysis. This method of downsampling is truly unbiased and random. It is typically used to simulate the effect of generating different amounts of sequence data for a given sample. For example, you can use this in a pilot experiment to evaluate how much target coverage you need to aim for in order to obtain enough coverage in all loci of interest.

↧

UV mutagenesis hotspots

July 2, 2018, 9:27 am

≫ Next: How does ApplyBQSR handle unmapped reads?

≪ Previous: Downsampling

How can I use GATK to identify mutation hotspots on a genome? What tools do I use? I have an organism that I UV mutagenized and resequenced and now I want to identify regions on the genome that were more prone to mutation.

↧

How does ApplyBQSR handle unmapped reads?

July 2, 2018, 3:50 pm

≫ Next: GATK SNP calling contains all point mutations different from the reference genome. True or false?

≪ Previous: UV mutagenesis hotspots

Dear GATK team,

I am working with WGS BAM files provided by a major consortium. They applied a BWA + GATK pipeline. The quality scores were recalibrated, but the resulting BAM files do not contain an OQ tag giving the original base qualities.

What does GATK do with the base quality scores of unmapped reads in the ApplyBQSR (GATK4) or PrintReads (GATK3) tool?

My analysis involves a subset of reads drawn from mapped reads and unmapped reads. Therefore, I am concerned that this subset will contain a mixture of quality scores - recalibrated for some reads, but not for others.

Thank you,
Matthew

↧

GATK SNP calling contains all point mutations different from the reference genome. True or false?

July 2, 2018, 6:40 pm

≫ Next: Do you recommend Windows 7 or 10 for installing and running GATK4 in a Docker Container?

≪ Previous: How does ApplyBQSR handle unmapped reads?

I want to knows the difference between SNP and SNV and how GATK source code operate ？

↧

Do you recommend Windows 7 or 10 for installing and running GATK4 in a Docker Container?

July 2, 2018, 8:25 pm

≫ Next: A Sanger verified variant is not called by HaplotypCaller

≪ Previous: GATK SNP calling contains all point mutations different from the reference genome. True or false?

Hello GATK team,
We would like to download and run a Docker container with the latest GATK4 software (or the version used for the GCC-BOSC 2018 Workshop) on a Windows 7 PC with the specifications stated below. Do you recommend we upgrade the OS from Windows 7 to Win 10 Pro 64, which we can easily request our IT professionals to do?

Dell Precision Tower 5810 XCTO Base
Intel Xeon Processor E5-1603 v3 (Four Core, 10MB Cache, 2.8GHz)
64GB (4x16GB) 2400MHz DDR4 RDIMM ECC
1TB 3.5inch Serial ATA (7,200 Rpm) Hard Drive, FPWS
4TB 3.5inch Serial ATA (5.400 Rpm) Hard Drive

Thanks in advance! (my very first post!)

↧

A Sanger verified variant is not called by HaplotypCaller

May 14, 2018, 1:28 am

≫ Next: Calling invaiant sites with the new pipeline of HaplotypeCaller

≪ Previous: Do you recommend Windows 7 or 10 for installing and running GATK4 in a Docker Container?

Hello,

We have a trio which all three members were whole exome sequenced. We processed the samples by the standard pipeline (bwa mem alignment, picard mark duplicates and gatk joint variant calling by HaplotypeCaller). The version of GATK we used is v3.8.1. We can see there is a variant both from mother and the proband from IGV (see here) and the variant is Sanger verified both from proband and the mother. The position is well covered in both proband bam(94x from vcf FORMAT) and mother bam (207x from vcf FORMAT). However, in the variant file, the variant is called as de novo, so there is no variant called from mother.

17 71189251 . C G 1581.68 . AC=1;AF=0.167;AN=6;BaseQRankSum=1.59;ClippingRankSum=0.00;DP=343;ExcessHet=3.0103;FS=2.970;MLEAC=1;MLEAF=0.167;MQ=60.00;MQRankSum=0.00;QD=16.83;ReadPosRankSum=1.27;SOR=1.095 GT:AD:DP:GQ:PGT:PID:PL 0/1:46,48:94:99:0|1:71189182_A_G:1610,0,1851 0/0:41,0:41:99:.:.:0,99,1485 0/0:207,0:207:0:.:.:0,0,117

I have both bam file and bamoutput file from HaplotypeCaller sliced on this variant. I could upload if you tell me where to upload.

Could you help us to check what is the reason for the caller to miss this variant? Thanks.

↧

Calling invaiant sites with the new pipeline of HaplotypeCaller

May 12, 2014, 1:10 am

≫ Next: All annotations in BP_RESOLUTION mode

≪ Previous: A Sanger verified variant is not called by HaplotypCaller

Hello,

I am using the new pipeline of haplotype caller in order to obtain a vcf file containing both variant and invariant sites.

For each individual, I called variant and invariant sites :

java -Xmx300g -jar GenomeAnalysisTK.jar \
     -T HaplotypeCaller \
     -R ref.fasta \
     -I ${INPUT}.bam \
     --genotyping_mode DISCOVERY 
     -stand_emit_conf 0 \
     -stand_call_conf 0 \
     -o ${INPUT}\_VC.vcf \
     --emitRefConfidence BP_RESOLUTION  \
     --variant_index_type LINEAR \
     --variant_index_parameter 128000 \
     -nct 16

In the vcf that I obtain, I indeed have every position.
The problem is that he INFO and QUAL fileds are empty (.) if the site is non variant.

KE332545.1      44      .       T       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:13,0:13:39:0,39,503
KE332545.1      45      .       T       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:13,0:13:39:0,39,518
KE332545.1      46      .       C       T,<NON_REF>     0       .       BaseQRankSum=-2.270;ClippingRankSum=-0.691;DP=17;MLEAC=0,0;MLEAF=0.00,0.00;MQ=38.98;MQ0=0;MQRankSum=0.099;ReadPosRankSum=0.493  GT:AD:DP:GQ:PL:SB      0/0:11,2,0:13:3:0,3,379,33,385,414:0,0,0,0
KE332545.1      47      .       C       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:13,0:13:39:0,39,515
KE332545.1      48      .       A       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:13,0:13:39:0,39,540
KE332545.1      49      .       C       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:13,0:13:39:0,39,563

But I also wanted this information in order to use my filtering pipeline on those invariant sites as well !
Any solution ?

Thanks !

Muriel

↧

All annotations in BP_RESOLUTION mode

June 19, 2015, 12:20 pm

≫ Next: Invalid or corrupt jarfile

≪ Previous: Calling invaiant sites with the new pipeline of HaplotypeCaller

Hello,

I was wondering if there is a way to output all annotations for all sites when running HaplotypeCaller with BP_RESOLUTION. Currently it outputs all annotations for only called variants. Thanks in advance.

↧

Invalid or corrupt jarfile

January 23, 2018, 11:52 am

≫ Next: CombineGVCFs takes forever if there are no calls in one g.vcf

≪ Previous: All annotations in BP_RESOLUTION mode

When I run

./gatk --help

it seems to be working fine. However, running anything else such as

./gatk --list

produces an error:

Error: Invalid or corrupt jarfile /path/to/gatk/gatk-4.0.0.0/gatk-package-4.0.0.0-local.jar

What's going on? Sorry this might be noob question.

↧

CombineGVCFs takes forever if there are no calls in one g.vcf

July 3, 2018, 6:17 am

≫ Next: Error while running gatk_collect_coverage

≪ Previous: Invalid or corrupt jarfile

Hi,

we have one sample which produced only ~2000 mapped reads and therefore got no calls at all from HaplotypeCaller. Since we are doing the whole pipeline at once, we merged all per sample g.vcf of that run into one to do GenotypeGVCFs. In most runs this taks a few hours. In this case it took 2.5 weeks. As I removed the g.vcf of this sample it was done in 6 hours.

Why does CombineGVCFs takes so much longer if there is one file without calls? I attached a g.vcf of chromosme 2 as a txt file (since g.vcf was not possibe).

Looking forward to your ideas.

Best,
Daniel

↧

Error while running gatk_collect_coverage

July 3, 2018, 7:04 am

≫ Next: A USER ERROR has occurred: 'CalculateTargetCoverage' is not a valid command.

≪ Previous: CombineGVCFs takes forever if there are no calls in one g.vcf

Hi,

I keep getting the following error when I try to run the gatk_collect_coverage method:

ErrorReport(rawls,http error calling uri https://cromwell1-int-lb.dsde-prod.broadinstitute.org:443/api/workflows/v1/batch,Some(400 Bad Request),List(),List(),None)

https://portal.firecloud.org/#workspaces/broad-firecloud-wuclonal/SandBox_Analysis/monitor/db7eb98b-a6b5-4e09-8b43-13bfd48aaaa9

What might be causing this?

Thanks

↧

Contents

1. Overview

2. Summary of workflow steps

Input

Step 1: Derive posterior probabilities of genotypes

Tool used: CalculateGenotypePosteriors

Step 2: Filter low quality genotypes

Tool used: VariantFiltration

Step 3: Annotate possible de novo mutations

Tool used: VariantAnnotator

Step 4: Functional annotation of possible biological effects

Tool options: Funcotator (experimental)

3. Output annotations

Population Priors

Phred-Scaled Posterior Probability

Genotype Quality

Joint Trio Likelihood

Joint Trio Posterior

Low Genotype Quality

High and Low Confidence De Novo

4. Example

5. More information about priors

Input-derived Population Priors

Supporting Population Priors

Family Priors

Caveats

6. Mathematical details

Review of Bayes’s Rule

Calculation of Population Priors

Calculation of Family Priors

Downsampling is a process by which read depth is reduced, either at a particular position or within a region.

1. Downsampling to a target coverage

Defaults

Customizing

Algorithm details

2. Downsampling to a fraction of the coverage