redo a variant calling run on a vcf

March 6, 2019, 9:46 pm

≫ Next: Hi, Is there any way to solve the data skew problem of HaplotypeCallerSpark?

≪ Previous: (How to part II) Sensitively detect copy ratio alterations and allelic segments

is this mode still works on** gatk4** of HaplotypeCaller and mutect2? if so, is there a recommended value of ip , thanks a lot

↧

Hi, Is there any way to solve the data skew problem of HaplotypeCallerSpark?

March 7, 2019, 12:14 am

≫ Next: Mutect2FilteringStats.tsv stands for what

≪ Previous: redo a variant calling run on a vcf

Hi，Is there any way to solve the data skew problem of HaplotypeCallerSpark?

↧

Mutect2FilteringStats.tsv stands for what

March 7, 2019, 12:28 am

≫ Next: urgent!!! HaplotypeCaller argument --bam-output bam does not keep a variant with 228 reads supported

≪ Previous: Hi, Is there any way to solve the data skew problem of HaplotypeCallerSpark?

when I ran gatk , there is a file Mutect2FilteringStats.tsv
content likes following

how should I understand this, thanks a lot

↧

urgent!!! HaplotypeCaller argument --bam-output bam does not keep a variant with 228 reads supported

March 7, 2019, 6:29 am

≫ Next: how to run CalculateGenotypePosteriors on a case-control cohort

≪ Previous: Mutect2FilteringStats.tsv stands for what

it is germline calling
and this site is very important in chemotherapy， we need to have a accurate genotype of this site

↧

how to run CalculateGenotypePosteriors on a case-control cohort

March 7, 2019, 6:58 am

≫ Next: fastq reads cleaning

≪ Previous: urgent!!! HaplotypeCaller argument --bam-output bam does not keep a variant with 228 reads supported

One usage example is to refine genotypes based on the discovered allele frequency in an input VCF containing many samples

 gatk --java-options "-Xmx4g" CalculateGenotypePosteriors \
   -V multisample_input.vcf.gz \
   -O output.vcf.gz

Now if I have a case-control cohort of autoimmune disease, I wonder if I should run the above step separately for case and control, or put them into one vcf file?

Thanks for any insight!

↧

fastq reads cleaning

March 7, 2019, 7:16 am

≫ Next: VQSR on SNP and Indel

≪ Previous: how to run CalculateGenotypePosteriors on a case-control cohort

In the best practice for short variant discovery from DNA sequencing, there is no mention of reads cleaning (removing adapters, low quality and short reads, etc.) . I wonder whether it's because such step is not necessary, or it's assumed the reads are already cleaned?

If it's not necessary, is it because the bad reads will be filtered out after the mapping step?

↧

VQSR on SNP and Indel

March 7, 2019, 7:24 am

≫ Next: Haplotype phasing somatic mutations from MuTect2 using read-backed phasing and parental data

≪ Previous: fastq reads cleaning

My impression is that old recommendation from the GATK team is to do VQSR on snp.vcf and indel.vcf separately and in parallel. But the current pipeline and example is to do them sequentially on all.variants.vcf.

I wonder if the new approach is to keep those variants that are neither SNP nor Indel?

↧

Haplotype phasing somatic mutations from MuTect2 using read-backed phasing and parental data

March 7, 2019, 7:38 am

≫ Next: gvcf or final vcf used to construct accession specific genome

≪ Previous: VQSR on SNP and Indel

To whom it may concern,

I have both normal and tumour sample and I also have the parental data (both mother and father) for the patient sample. I hope to first haplotype phase the SNP and INDELs from the haplotype caller using PhaseByTransmission. Thereafter, I wanted to haplotype phase the somatic mutations from MuTect2 using Read-Backed Phasing.

I wanted to ask whether the Read-Backed Phasing method will consider both the SNP and INDEL encompassed within the read and whether it will also consider the information from PhaseByTransmission when haplotype phasing the somatic mutations.

Regards,
Sangjin Lee

↧

gvcf or final vcf used to construct accession specific genome

February 26, 2019, 2:03 pm

≫ Next: How to pre-assign the multiple regions I want to process in each pipeline?

≪ Previous: Haplotype phasing somatic mutations from MuTect2 using read-backed phasing and parental data

I have 100 rice accessions and its DNA sequencing bam files. My goal is to construct the genome in a specific region for each rice accession. Could I directly use the gvcf file from the variant call? Or it is better to use the vcf file after final genotype call? If I use gvcf file, is there any filtering step I should use to guarantee the quality? Thanks.

↧

How to pre-assign the multiple regions I want to process in each pipeline?

September 8, 2018, 5:56 am

≫ Next: Is the CNQ value the Phred quality score or the confidence percentile?

≪ Previous: gvcf or final vcf used to construct accession specific genome

Dear Genome STRiP users,

I intend to process part of the chromosome in each pipeline rather than processing the whole sequence. I know there is an -L flag in the SVPreprocess, but it is not listed in the documentation (http://software.broadinstitute.org/software/genomestrip/org_broadinstitute_sv_qscript_SVPreprocess.html), but I do not know how to pre-assign multiple regions. More precisely, if I intend to process the 1-20000 and 25000-40000 of chr16, how can I set the -L?

Similarly, I am not sure if I can pre-assign the regions in SVDiscovery and SVGenotyper; if so, how can I pre-assign the multiple regions in these two pipelines?

Besides, in the SVCNVDiscovery, I know that there is a -intervalList flag, and need a .list file to set the interval list, but how can I set multiple regions, for example, do I need to put them in the different lines or separated them by the comma?

A further question is that I am not sure if the following two cases are equivalent:
1. Running SVPreprocess with -L chr16:1-50000, and then running SVCNVDiscovery with -intervalList chr16:20000-40000;
2. Running SVPreprocess with -L chr16:20000-40000, and then running SVCNVDiscovery with -intervalList chr16:20000-40000.

May I have your suggestions about these questions? Thank you in advance.

Best regards,
Wusheng

↧

Is the CNQ value the Phred quality score or the confidence percentile?

March 7, 2019, 7:53 am

≫ Next: htsjdk.samtools.util.RuntimeIOException: {bqsr_bam_file} has invalid uncompressed Length: -196200583

≪ Previous: How to pre-assign the multiple regions I want to process in each pipeline?

Dear Genome STRiP users,

I completed the SVCNVDiscovery to all the samples in a single batch. Beside of the CN, I also got the CNQ for the CN of each sample. Based on "https://gatkforums.broadinstitute.org/gatk/discussion/23345/is-there-any-pre-copy-number-value-in-the-processing#latest", CNQ should be the Phred quality score, but my CNQ value looks like

sample 1	sample 2	sample 3	sample 4	sample 5
53.1	99	80.2	46	93.4

And they look like the confidence percentile, but I am not sure. May I have your help to explain these CNQ values? Thank you very much.

Best regards,
Wusheng

↧

htsjdk.samtools.util.RuntimeIOException: {bqsr_bam_file} has invalid uncompressed Length: -196200583

March 7, 2019, 8:23 am

≫ Next: Understanding the math behind MUTECT LOD scores

≪ Previous: Is the CNQ value the Phred quality score or the confidence percentile?

Hi!

I'm using GATK4.0.7.0 and best practice workflow, where I first applied Base reacalibrator on my RNASeq bam-files. These seems to work fine (files are not truncated). Those bqsr-bam files now used to work with Haplotypecaller but I always get the error htsjdk.samtools.util.RuntimeIOException: {bqsr_bam_file} has invalid uncompressed Length: -196200583

Using GATK jar /gatk-4.0.7.0/gatk-package-4.0.7.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx30G -jar /gatk-4.0.7.0/gatk-package-4.0.7.0-local.jar HaplotypeCaller -R /ref_info/HG19.karyo.fasta -I /raw_vcf_new/sample_bqsr.recal.bam -O /media/big_hd/Cosima/raw_vcf_new/sample_haplocaller0017-17.recal.bam.g.vcf.gz -L /media/big_hd/Cosima/ref_info/Exom_V6.interval_list

16:45:18.831 INFO HaplotypeCaller - Shutting down engine
[March 7, 2019 4:45:18 PM CET] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 13.07 minutes.
Runtime.totalMemory()=2069364736
htsjdk.samtools.util.RuntimeIOException: /media/big_hd/Cosima/raw_vcf_new/sample_bqsr0017-17.recal.bam has invalid uncompressedLength: -196200583
at htsjdk.samtools.util.BlockCompressedInputStream.inflateBlock(BlockCompressedInputStream.java:543)
at htsjdk.samtools.util.BlockCompressedInputStream.processNextBlock(BlockCompressedInputStream.java:532)
at htsjdk.samtools.util.BlockCompressedInputStream.nextBlock(BlockCompressedInputStream.java:468)
at htsjdk.samtools.util.BlockCompressedInputStream.readBlock(BlockCompressedInputStream.java:458)
at htsjdk.samtools.util.BlockCompressedInputStream.available(BlockCompressedInputStream.java:196)
at htsjdk.samtools.util.BlockCompressedInputStream.read(BlockCompressedInputStream.java:331)
at java.io.DataInputStream.read(DataInputStream.java:149)
at htsjdk.samtools.util.BinaryCodec.readBytesOrFewer(BinaryCodec.java:418)
at htsjdk.samtools.util.BinaryCodec.readBytes(BinaryCodec.java:394)
at htsjdk.samtools.util.BinaryCodec.readByteBuffer(BinaryCodec.java:504)
at htsjdk.samtools.util.BinaryCodec.readInt(BinaryCodec.java:515)
at htsjdk.samtools.BAMRecordCodec.decode(BAMRecordCodec.java:263)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.getNextRecord(BAMFileReader.java:829)
at htsjdk.samtools.BAMFileReader$BAMFileIndexIterator.getNextRecord(BAMFileReader.java:981)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:803)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:797)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:765)
at htsjdk.samtools.BAMFileReader$BAMQueryFilteringIterator.advance(BAMFileReader.java:1034)
at htsjdk.samtools.BAMFileReader$BAMQueryFilteringIterator.next(BAMFileReader.java:1024)
at htsjdk.samtools.BAMFileReader$BAMQueryFilteringIterator.next(BAMFileReader.java:988)
at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:576)
at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:548)
at org.broadinstitute.hellbender.utils.iterators.SamReaderQueryingIterator.loadNextRecord(SamReaderQueryingIterator.java:114)
at org.broadinstitute.hellbender.utils.iterators.SamReaderQueryingIterator.next(SamReaderQueryingIterator.java:151)
at org.broadinstitute.hellbender.utils.iterators.SamReaderQueryingIterator.next(SamReaderQueryingIterator.java:29)
at org.broadinstitute.hellbender.utils.iterators.SAMRecordToReadIterator.next(SAMRecordToReadIterator.java:27)
at org.broadinstitute.hellbender.utils.iterators.SAMRecordToReadIterator.next(SAMRecordToReadIterator.java:13)
at org.broadinstitute.hellbender.utils.iterators.ReadTransformingIterator.next(ReadTransformingIterator.java:42)
at org.broadinstitute.hellbender.utils.iterators.ReadTransformingIterator.next(ReadTransformingIterator.java:14)
at org.broadinstitute.hellbender.utils.iterators.ReadFilteringIterator.loadNextRead(ReadFilteringIterator.java:53)
at org.broadinstitute.hellbender.utils.iterators.ReadFilteringIterator.next(ReadFilteringIterator.java:47)
at org.broadinstitute.hellbender.utils.iterators.ReadFilteringIterator.next(ReadFilteringIterator.java:13)
at org.broadinstitute.hellbender.utils.iterators.ReadTransformingIterator.next(ReadTransformingIterator.java:42)
at org.broadinstitute.hellbender.utils.iterators.ReadTransformingIterator.next(ReadTransformingIterator.java:14)
at org.broadinstitute.hellbender.utils.downsampling.ReadsDownsamplingIterator.fillDownsampledReadsCache(ReadsDownsamplingIterator.java:69)
at org.broadinstitute.hellbender.utils.downsampling.ReadsDownsamplingIterator.advanceToNextRead(ReadsDownsamplingIterator.java:55)
at org.broadinstitute.hellbender.utils.downsampling.ReadsDownsamplingIterator.next(ReadsDownsamplingIterator.java:49)
at org.broadinstitute.hellbender.utils.downsampling.ReadsDownsamplingIterator.next(ReadsDownsamplingIterator.java:16)
at org.broadinstitute.hellbender.utils.iterators.ReadCachingIterator.next(ReadCachingIterator.java:42)
at org.broadinstitute.hellbender.utils.iterators.ReadCachingIterator.next(ReadCachingIterator.java:17)
at htsjdk.samtools.util.PeekableIterator.advance(PeekableIterator.java:71)
at htsjdk.samtools.util.PeekableIterator.next(PeekableIterator.java:57)
at org.broadinstitute.hellbender.utils.locusiterator.ReadStateManager.collectPendingReads(ReadStateManager.java:160)
at org.broadinstitute.hellbender.utils.locusiterator.LocusIteratorByState.lazyLoadNextAlignmentContext(LocusIteratorByState.java:315)
at org.broadinstitute.hellbender.utils.locusiterator.LocusIteratorByState.hasNext(LocusIteratorByState.java:252)
at org.broadinstitute.hellbender.utils.locusiterator.IntervalAlignmentContextIterator.advanceAlignmentContext(IntervalAlignmentContextIterator.java:104)
at org.broadinstitute.hellbender.utils.locusiterator.IntervalAlignmentContextIterator.advanceAlignmentContextToCurrentInterval(IntervalAlignmentContextIterator.java:99)
at org.broadinstitute.hellbender.utils.locusiterator.IntervalAlignmentContextIterator.next(IntervalAlignmentContextIterator.java:69)
at org.broadinstitute.hellbender.utils.locusiterator.IntervalAlignmentContextIterator.next(IntervalAlignmentContextIterator.java:21)
at org.broadinstitute.hellbender.engine.AssemblyRegionIterator.loadNextAssemblyRegion(AssemblyRegionIterator.java:143)
at org.broadinstitute.hellbender.engine.AssemblyRegionIterator.next(AssemblyRegionIterator.java:135)
at org.broadinstitute.hellbender.engine.AssemblyRegionIterator.next(AssemblyRegionIterator.java:34)
at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.processReadShard(AssemblyRegionWalker.java:286)
at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.traverse(AssemblyRegionWalker.java:267)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:979)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:137)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:182)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:201)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)

Do you have an idea what is going wrong?
Thanks a lot!
Cosima

↧

Understanding the math behind MUTECT LOD scores

March 7, 2019, 9:02 am

≫ Next: How to Understand SVCNVDiscovery result: CN "-1" value with CNQ "." value

≪ Previous: htsjdk.samtools.util.RuntimeIOException: {bqsr_bam_file} has invalid uncompressed Length: -196200583

Hello, thank you for making such great tools and having such a responsive forum for answering questions about your methods. My question is what are the LOD scores exactly and how are they computed (and thus how they should be interpreted) in Mutect/Mutect2? My background is more mathematical/statistical and thus I am trying to follow in full detail the mathematics of the paper, which is the only way I feel I will fully understand any given method.

Unfortunately, it appears that the Nature Biotechnology paper where the details are spelled out may have typos/errors in the "Methods: Variant Detection" section, which make it difficult to track how LOD_T and LOD_N are defined/used in MUTECT. In particular, if you look at the two LOD_T equations, there are two conflicting definitions/equalities for LOD_T(m,f). These two equations would only agree if P(m,f)=0.5, which is not an assumption I believe the package is making. Likewise, the LOD_N inequalities further down in "Methods: Variant Classification" are not consistent; these inequalities would only be equivalent if P(m,f)=P(germ line), which is again an assumption that I do not believe the method is making.

Can someone clarify, in the notation of that paper, what these equations should be to align with the actual implementation in Mutect? And is it safe to say that the same interpretation/logic carries over into Mutect2's TLOD score?

I did search the forums and see the other post here:

(I cannot post links as I made a new account, but there is a forum post thats link ends /gatk/discussion/4463/how-mutect-identifies-candidate-somatic-mutations)

But the notation there is somewhat ambiguous and does not match the more rigorous notation of the Mutect paper, making it difficult for me to track what it means exactly, and I also see what appear to be errors in that post as well, for example:

LOD_T > log_{10} (0.5 \times 10^{-6} ) \approx 6.3

But I believe that log_{10} (0.5\times 10^{-6}) is approximately equal to negative 6.3, not positive 6.3 as is written. But that positive 6.3 threshold also appears in the mutect methods section as well.

I would greatly appreciate any clarity that can be provided regarding the details of the math behind these LOD scores as they are actually used/implemented in Mutect.

Thank you very much!
Garrett

↧

How to Understand SVCNVDiscovery result: CN "-1" value with CNQ "." value

March 7, 2019, 10:59 am

≫ Next: GATK

≪ Previous: Understanding the math behind MUTECT LOD scores

Dear Genome STRiP users,

I got the CN and CNQ value by SVCNVDiscovery. After sorting the CNQ and the corresponding CN value, I found some weird CN values and CNQ values as below -- 10 samples out of 10686 samples (24 samples: CNQ (not .) < 13, and 24 samples: 13 < CNQ < 20).

ID	sample 1	sample 2	sample 3	sample 4	sample 5
CNQ	.	.	.	0.5	1.8
CN	-1	-1	-1	2	2

So may I have your suggestions about:

What does CN = -1 and CNQ = . mean? Does it mean that there are no confident copy number values?
Whether 10 out of 10686 samples has -1 as CN value (24 samples: CNQ (not .) < 13, and 24 samples: 13 < CNQ < 20) is a good CNV calling or a bad calling?
If this is a bad calling, may I have your suggestion about how to improve it, such as running in multiple batches (but more samples will make the estimation better?), using longer -intervalList in SVCNVDiscovery, or using longer -L in the SVPreprocess (if really cannot load the whole 94million cram file)?

Thank you very much.

P.S. It looks like that the ordered bulletin in this markdown window does not work so my post does not have the number order at the head of each question.

Best regards,
Wusheng

↧

GATK

March 7, 2019, 12:56 pm

≫ Next: HaplotypeCaller VCF entries with 0/0 genotype. How to interpret?

≪ Previous: How to Understand SVCNVDiscovery result: CN "-1" value with CNQ "." value

I'm running into an error when running GATK 4.1.0.0 with the following call:

java -Xmx16g -jar ${gatkDir}/GATK.jar Mutect2 -R ${GRC}.fa -I ${TU}.recal.bam -tumor TU -I ${NM}.recal.bam -normal NM --native-pair-hmm-threads $threads --germline-resource $gnomad --af-of-alleles-not-in-resource 0.0000025 -O ${sampleName}.mutect.UF.vcf --tmp-dir temp

The error is as follows:

java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:657)
at java.util.ArrayList.get(ArrayList.java:433)
at org.broadinstitute.hellbender.tools.walkers.mutect.GermlineProbabilityCalculator.lambda$getGermlineAltAlleleFrequencies$3(GermlineProbabilityCalculator.java:55)
at java.util.stream.ReferencePipeline$6$1.accept(ReferencePipeline.java:244)
at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:545)
at java.util.stream.AbstractPipeline.evaluateToArrayNode(AbstractPipeline.java:260)
at java.util.stream.DoublePipeline.toArray(DoublePipeline.java:506)
at org.broadinstitute.hellbender.tools.walkers.mutect.GermlineProbabilityCalculator.getGermlineAltAlleleFrequencies(GermlineProbabilityCalculator.java:57)
at org.broadinstitute.hellbender.tools.walkers.mutect.GermlineProbabilityCalculator.getNegativeLog10PopulationAFAnnotation(GermlineProbabilityCalculator.java:29)
at org.broadinstitute.hellbender.tools.walkers.mutect.SomaticGenotypingEngine.callMutations(SomaticGenotypingEngine.java:165)
at org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2Engine.callRegion(Mutect2Engine.java:233)
at org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2.apply(Mutect2.java:232)
at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.processReadShard(AssemblyRegionWalker.java:291)
at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.traverse(AssemblyRegionWalker.java:267)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:966)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:138)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
at org.broadinstitute.hellbender.Main.main(Main.java:291)

I have seen errors like this listed before on the forums relating to the AF file. I removed the file, and it was able to successfully run. The AF file is af-only-gnomad.filtered.hg38.vcf.gz

However, the above function call with the AF file runs correctly on GATK 4.0.10.1 with no errors and completes successfully.

The AF file is formatted as follows:

#CHROM POS ID REF ALT QUAL FILTER INFO
1 10067 . T TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC 30.35 PASS .
1 10108 . CAACCCT C 46514.3 PASS .
1 10109 . AACCCT A 89837.3 PASS .
1 10114 . TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAAACCCTA CAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAAACCCTA,T 36729 PASS .
1 10119 . CT C 251.23 PASS .
1 10120 . T C 14928.7 PASS .
1 10128 . ACCCTAACCCTAACCCTAAC A 285.71 PASS .
1 10131 . CT C 378.93 PASS .
...

Any thoughts as to what the error could be?

Thank you!

↧

HaplotypeCaller VCF entries with 0/0 genotype. How to interpret?

March 7, 2019, 4:56 pm

≫ Next: What is a VCF and how should I interpret it?

≪ Previous: GATK

Hi there,
I'm running HaplotypeCaller in version 4.0.10.0 using your cromwell pipeline which puts this command information in the VCF header:

HaplotypeCaller --contamination-fraction-to-filter 0.0 --output GM12878_DNA_dup.bam.p4.bam.dup.bam_fixed.vcf.gz --intervals /projects/rcorbettprj2/clinicalGermline/GIABtests/bothReplicates/merges/varian
t_call_wdl/cromwell-executions/run_haplotypecaller_on_directory/259f5b72-4367-4dd1-a1b0-a26c3ddb2fdb/call-HaplotypeCallerGvcf_GATK4/shard-4/haplotypecaller.HaplotypeCallerGvcf_GATK4/953f9567-2161-4f45-a3be-2b19271d5f89/call-HaplotypeCaller/shard-35/input
s/1202757533/0035-scattered.intervals --input /projects/rcorbettprj2/clinicalGermline/GIABtests/bothReplicates/merges/variant_call_wdl/cromwell-executions/run_haplotypecaller_on_directory/259f5b72-4367-4dd1-a1b0-a26c3ddb2fdb/call-HaplotypeCallerGvcf_GATK
4/shard-4/haplotypecaller.HaplotypeCallerGvcf_GATK4/953f9567-2161-4f45-a3be-2b19271d5f89/call-HaplotypeCaller/shard-35/inputs/-1421299472/GM12878_DNA_dup.bam.p4.bam.dup.bam_fixed.bam --reference /projects/rcorbettprj2/clinicalGermline/GIABtests/bothRepli
cates/merges/variant_call_wdl/cromwell-executions/run_haplotypecaller_on_directory/259f5b72-4367-4dd1-a1b0-a26c3ddb2fdb/call-HaplotypeCallerGvcf_GATK4/shard-4/haplotypecaller.HaplotypeCallerGvcf_GATK4/953f9567-2161-4f45-a3be-2b19271d5f89/call-HaplotypeCa
ller/shard-35/inputs/-1226353055/hg19a.fa --emit-ref-confidence NONE --gvcf-gq-bands 1 --gvcf-gq-bands 2 --gvcf-gq-bands 3 --gvcf-gq-bands 4 --gvcf-gq-bands 5 --gvcf-gq-bands 6 --gvcf-gq-bands 7 --gvcf-gq-bands 8 --gvcf-gq-bands 9 --gvcf-gq-bands 10 --g
vcf-gq-bands 11 --gvcf-gq-bands 12 --gvcf-gq-bands 13 --gvcf-gq-bands 14 --gvcf-gq-bands 15 --gvcf-gq-bands 16 --gvcf-gq-bands 17 --gvcf-gq-bands 18 --gvcf-gq-bands 19 --gvcf-gq-bands 20 --gvcf-gq-bands 21 --gvcf-gq-bands 22 --gvcf-gq-bands 23 --gvcf-gq-
bands 24 --gvcf-gq-bands 25 --gvcf-gq-bands 26 --gvcf-gq-bands 27 --gvcf-gq-bands 28 --gvcf-gq-bands 29 --gvcf-gq-bands 30 --gvcf-gq-bands 31 --gvcf-gq-bands 32 --gvcf-gq-bands 33 --gvcf-gq-bands 34 --gvcf-gq-bands 35 --gvcf-gq-bands 36 --gvcf-gq-bands 3
7 --gvcf-gq-bands 38 --gvcf-gq-bands 39 --gvcf-gq-bands 40 --gvcf-gq-bands 41 --gvcf-gq-bands 42 --gvcf-gq-bands 43 --gvcf-gq-bands 44 --gvcf-gq-bands 45 --gvcf-gq-bands 46 --gvcf-gq-bands 47 --gvcf-gq-bands 48 --gvcf-gq-bands 49 --gvcf-gq-bands 50 --gvc
f-gq-bands 51 --gvcf-gq-bands 52 --gvcf-gq-bands 53 --gvcf-gq-bands 54 --gvcf-gq-bands 55 --gvcf-gq-bands 56 --gvcf-gq-bands 57 --gvcf-gq-bands 58 --gvcf-gq-bands 59 --gvcf-gq-bands 60 --gvcf-gq-bands 70 --gvcf-gq-bands 80 --gvcf-gq-bands 90 --gvcf-gq-ba
nds 99 --indel-size-to-eliminate-in-ref-model 10 --use-alleles-trigger false --disable-optimizations false --just-determine-active-regions false --dont-genotype false --max-mnp-distance 0 --dont-trim-active-regions false --max-disc-ar-extension 25 --max-
gga-ar-extension 300 --padding-around-indels 150 --padding-around-snps 20 --kmer-size 10 --kmer-size 25 --dont-increase-kmer-sizes-for-cycles false --allow-non-unique-kmers-in-ref false --num-pruning-samples 1 --recover-dangling-heads false --do-not-reco
ver-dangling-branches false --min-dangling-branch-length 4 --consensus false --max-num-haplotypes-in-population 128 --error-correct-kmers false --min-pruning 2 --debug-graph-transformations false --kmer-length-for-read-error-correction 25 --min-observati
ons-for-kmer-to-be-solid 20 --likelihood-calculation-engine PairHMM --base-quality-score-threshold 18 --pair-hmm-gap-continuation-penalty 10 --pair-hmm-implementation FASTEST_AVAILABLE --pcr-indel-model CONSERVATIVE --phred-scaled-global-read-mismapping-
rate 45 --native-pair-hmm-threads 4 --native-pair-hmm-use-double-precision false --debug false --use-filtered-reads-for-annotations false --bam-writer-type CALLED_HAPLOTYPES --dont-use-soft-clipped-bases false --capture-assembly-failure-bam false --error
-correct-reads false --do-not-run-physical-phasing false --min-base-quality-score 10 --smith-waterman JAVA --use-new-qual-calculator false --annotate-with-num-discovered-alleles false --heterozygosity 0.001 --indel-heterozygosity 1.25E-4 --heterozygosity
-stdev 0.01 --standard-min-confidence-threshold-for-calling 10.0 --max-alternate-alleles 6 --max-genotype-count 1024 --sample-ploidy 2 --num-reference-samples-if-no-call 0 --genotyping-mode DISCOVERY --genotype-filtered-alleles false --output-mode EMIT_V
ARIANTS_ONLY --all-site-pls false --min-assembly-region-size 50 --max-assembly-region-size 300 --assembly-region-padding 100 --max-reads-per-alignment-start 50 --active-probability-threshold 0.002 --max-prob-propagation-distance 50 --interval-set-rule UN
ION --interval-padding 0 --interval-exclusion-padding 0 --interval-merging-rule ALL --read-validation-stringency SILENT --seconds-between-progress-updates 10.0 --disable-sequence-dictionary-validation false --create-output-bam-index true --create-output-
bam-md5 false --create-output-variant-index true --create-output-variant-md5 false --lenient false --add-output-sam-program-record true --add-output-vcf-command-line true --cloud-prefetch-buffer 40 --cloud-index-prefetch-buffer -1 --disable-bam-index-cac
hing false --sites-only-vcf-output false --help false --version false --showHidden false --verbosity INFO --QUIET false --use-jdk-deflater false --use-jdk-inflater false --gcs-max-retries 20 --gcs-project-for-requester-pays --disable-tool-default-read-f
ilters false --minimum-mapping-quality 20 --disable-tool-default-annotations false --enable-all-annotations false",Version=4.0.10.0,Date="March 6, 2019 5:54:45 PM PST

In the resulting VCF I am seeing a number of records like the following:

1 1453665 . T . 20.80 . AN=2;DP=11;MQ=55.12 GT:AD:DP 0/0:11:11
1 1453666 . A . 20.80 . AN=2;DP=11;MQ=55.12 GT:AD:DP 0/0:11:11
1 1453676 . A . 14.91 . AN=2;DP=13;MQ=55.90 GT:AD:DP 0/0:13:13
1 1453682 . A . 14.91 . AN=2;DP=13;MQ=55.90 GT:AD:DP 0/0:13:13
1 1453696 . A . 12.05 . AN=2;DP=13;MQ=55.90 GT:AD:DP 0/0:13:13
1 200783760 . AAAC . 384.73 . AN=2;DP=11;MQ=60.00 GT:AD:DP 0/0:11:11

My initial interpretation of these is that although my VCF file should contain only variants that GATK believes to be real variants (ie. different from the reference), there are some variants that have been included and then marked as homozygous reference by some part of the pipeline.

My concern is that some casually developed tools would see the records are not filtered and happily use the locations even though the GT field suggests otherwise. Do you suggest running a second command to filter these records out, or perhaps most tools are correctly just filtering these out at runtime. Do we know what SNPEff or Annovar will do with these records?

thanks,
RIchard

↧

What is a VCF and how should I interpret it?

August 6, 2012, 10:28 am

≫ Next: Mutect2 files from TCGA - FORMAT tag [QSS] expected different number of values (expected 1, found 2)

≪ Previous: HaplotypeCaller VCF entries with 0/0 genotype. How to interpret?

This document describes "regular" VCF files produced for GERMLINE calls. For information on the special kind of VCF called gVCF, produced by HaplotypeCaller in `-ERC GVCF` mode, please see this companion document. For information specific to SOMATIC calls, see the MuTect documentation.

What is VCF?
Basic structure of a VCF file
Interpreting the VCF file header information
Structure of variant call records
How the genotype and other sample-level information is represented
How to extract information from a VCF in a sane, straightforward way

1. What is VCF?

VCF stands for Variant Call Format. It is a standardized text file format for representing SNP, indel, and structural variation calls. The VCF specification used to be maintained by the 1000 Genomes Project, but its management and expansion has been taken over by the Global Alliance for Genomics and Health Data Working group file format team. The full format spec can be found in the Samtools/Hts-specs repository along with other useful specs like SAM/BAM. We highly encourage you to take a look at those documents, as they contain a lot of useful information that we don't go over in this document.

VCF is the primary (and only well-supported) format used by the GATK for variant calls. We prefer it above all others because while it can be a bit verbose, the VCF format is very explicit about the exact type and sequence of variation as well as the genotypes of multiple samples for this variation.

That being said, this highly detailed information can be challenging to understand. The information provided by the GATK tools that infer variation from high-throughput sequencing data, such as the HaplotypeCaller, is especially complex. This document describes the key features and annotations that you need to know about in order to understand VCF files output by the GATK tools.

Note that VCF files are plain text files, so you can open them for viewing or editing in any text editor, with the following caveats:

Some VCF files are very large, so your personal computer may struggle to load the whole file into memory. In such cases, you may need to use a different approach, such as using UNIX tools to access the part of the dataset that is relevant to you, or subsetting the data using tools like GATK's SelectVariants.
NEVER EDIT A VCF IN A WORD PROCESSOR SUCH AS MICROSOFT WORD BECAUSE IT WILL SCREW UP THE FORMAT! You have been warned
Don't write home-brewed VCF parsing scripts. It never ends well.

2. Basic structure of a VCF file

A valid VCF file is composed of two main parts: the header, and the variant call records.

The header contains information about the dataset and relevant reference sources (e.g. the organism, genome build version etc.), as well as definitions of all the annotations used to qualify and quantify the properties of the variant calls contained in the VCF file. The header of VCFs generated by GATK tools also include the command line that was used to generate them. Some other programs also record the command line in the VCF header, but not all do so as it is not required by the VCF specification. For more information about the header, see the next section.

The actual data lines will look something like this:

[HEADER LINES]
#CHROM  POS ID      REF ALT QUAL    FILTER  INFO          FORMAT          NA12878
1   873762  .       T   G   5231.78 PASS    [ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:173,141:282:99:255,0,255
1   877664  rs3828047   A   G   3931.66 PASS    [ANNOTATIONS] GT:AD:DP:GQ:PL  1/1:0,105:94:99:255,255,0
1   899282  rs28548431  C   T   71.77   PASS    [ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:1,3:4:26:103,0,26
1   974165  rs9442391   T   C   29.84   LowQual [ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:14,4:14:61:61,0,255

After the header lines and the field names, each line represents a single variant, with various properties of that variant represented in the columns. Note that all the lines shown in the example above describe SNPs (also called SNVs), but other variation could be described, such as indels or CNVs. See the VCF specification for details on how the various types of variations are represented. Depending on how the callset was generated, there may only be records for sites where a variant was identified, or there may also be "invariant" records, ie records for sites where no variation was identified.

You will sometimes come across VCFs that have only 8 columns, and contain no FORMAT or sample-specific information. These are called "sites-only" VCFs, and represent variation that has been observed in a population. Generally, information about the population of origin should be included in the header.

3. Interpreting the VCF file header information

The following is a valid VCF header produced by HaplotypeCaller on an example data set (derived from our favorite test sample, NA12878). You can download similar test data from our resource bundle and try looking at it yourself!

##fileformat=VCFv4.1
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##GATKCommandLine.HaplotypeCaller=<ID=HaplotypeCaller,Version=3.4-3-gd1ac142,Date="Mon May 18 17:36:4
.
.
.
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##contig=<ID=chr1,length=249250621,assembly=b37>
##reference=file:human_genome_b37.fasta

We're not showing all the lines here, but that's still a lot... so let's break it down into digestible bits. Note that the header lines are always listed in alphabetical order.

VCF spec version

The first line:

##fileformat=VCFv4.1

tells you the version of the VCF specification to which the file conforms. This may seem uninteresting but it can have some important consequences for how to handle and interpret the file contents. As genomics is a fast moving field, the file formats are evolving fairly rapidly, so some of the encoding conventions change. If you run into unexpected issues while trying to parse a VCF file, be sure to check the version and the spec for any relevant format changes.

FILTER lines

The FILTER lines tell you what filters have been applied to the data. In our test file, one filter has been applied:

##FILTER=<ID=LowQual,Description="Low quality">

Records that fail any of the filters listed here will contain the ID of the filter (here, LowQual) in its FILTER field (see how records are structured further below).

FORMAT and INFO lines

These lines define the annotations contained in the FORMAT and INFO columns of the VCF file, which we explain further below. If you ever need to know what an annotation stands for, you can always check the VCF header for a brief explanation.

GATKCommandLine

The GATKCommandLine lines contain all the parameters that went used by the tool that generated the file. Here, GATKCommandLine.HaplotypeCaller refers to a command line invoking HaplotypeCaller. These parameters include all the arguments that the tool accepts, not just the ones specified explicitly by the user in the command line.

Contig lines and Reference

These contain the contig names, lengths, and which reference assembly was used with the input bam file. This can come in handy when someone gives you a callset but doesn't tell you which reference it was derived from -- remember that for most organisms, there are multiple reference assemblies, and you should always make sure to use the appropriate one!

[todo: FAQ on genome builds]

4. Structure of variant call records

For each site record, the information is structured into columns (also called fields) as follows:

#CHROM  POS ID  REF ALT     QUAL    FILTER  INFO    FORMAT  NA12878 [other samples...]

The first 8 columns of the VCF records (up to and including INFO) represent the properties observed at the level of the variant (or invariant) site. Keep in mind that when multiple samples are represented in a VCF file, some of the site-level annotations represent a summary or average of the values obtained for that site from the different samples.

Sample-specific information such as genotype and individual sample-level annotation values are contained in the FORMAT column (9th column) and in the sample-name columns (10th and beyond). In the example above, there is one sample called NA12878; if there were additional samples there would be additional columns to the right. Most programs order the sample columns alphabetically by sample name, but this is not always the case, so be aware that you can't depend on ordering rules for parsing VCF output!

Site-level properties and annotations

These first 7 fields are required by the VCF format and must be present, although they can be empty (in practice, there has to be a dot, ie . to serve as a placeholder).

CHROM and POS : The contig and genomic coordinates on which the variant occurs.
Note that for deletions the position given is actually the base preceding the event.
ID: An optional identifier for the variant.
Based on the contig and position of the call and whether a record exists at this site in a reference database such as dbSNP.
REF and ALT: The reference allele and alternative allele(s) observed in a sample, set of samples, or a population in general (depending how the VCF was generated).
Note that REF and ALT are always given on the forward strand. For insertions, the ALT allele includes the inserted sequence as well as the base preceding the insertion so you know where the insertion is compared to the reference sequence. For deletions, the ALT allele is the base before the deletion.
QUAL: The Phred-scaled probability that a REF/ALT polymorphism exists at this site given sequencing data.
Because the Phred scale is -10 * log(1-p), a value of 10 indicates a 1 in 10 chance of error, while a 100 indicates a 1 in 10^10 chance (see the FAQ article for a detailed explanation). These values can grow very large when a large amount of data is used for variant calling, so QUAL is not often a very useful property for evaluating the quality of a variant call. See our documentation on filtering variants for more information on this topic.
Not to be confused with the sample-level annotation GQ; see this FAQ article for an explanation of the differences in what they mean and how they should be used.
FILTER: This field contains the name(s) of any filter(s) that the variant fails to pass, or the value PASS if the variant passed all filters.
If the FILTER value is ., then no filtering has been applied to the records. It is extremely important to apply appropriate filters before using a variant callset in downstream analysis. See our documentation on filtering variants for more information on this topic.

This next field does not have to be present in the VCF.

INFO: Various site-level annotations.
The annotations contained in the INFO field are represented as tag-value pairs, where the tag and value are separated by an equal sign, ie =, and pairs are separated by colons, ie ; as in this example: MQ=99.00;MQ0=0;QD=17.94.
They typically summarize context information from the samples, but can also include information from other sources (e.g. population frequencies from a database resource). Some are annotated by default by the GATK tools that produce the callset, and some can be added on request. They are always defined in the VCF header, so that's an easy way to check what an annotation means if you don't recognize it. You can also find additional information on how they are calculated and how they should be interpreted in the "Annotations" section of the Tool Documentation.

Sample-level annotations

At this point you've met all the fields up to INFO in this lineup:

#CHROM  POS ID  REF ALT     QUAL    FILTER  INFO    FORMAT  NA12878 [other samples...]

All the rest is going to be sample-level information. Sample-level annotations are tag-value pairs, like the INFO annotations, but the formatting is a bit different. The short names of the sample-level annotations are recorded in the FORMAT field. The annotation values are then recorded in corresponding order in each sample column (where the sample names are the SM tags identified in the read group data). Typically, you will at minimum have information about the genotype and confidence in the genotype for the sample at each site. See the next section on genotypes for more details.

5. How the genotype and other sample-level information is represented

The sample-level information contained in the VCF (also called "genotype fields") may look a bit complicated at first glance, but they're actually not that hard to interpret once you understand that they're just sets of tags and values.

Let's take a look at three of the records shown earlier, simplified to just show the key genotype annotations:

1   873762  .       T   G   [CLIPPED] GT:AD:DP:GQ:PL    0/1:173,141:282:99:255,0,255
1   877664  rs3828047   A   G   [CLIPPED] GT:AD:DP:GQ:PL    1/1:0,105:94:99:255,255,0
1   899282  rs28548431  C   T   [CLIPPED] GT:AD:DP:GQ:PL    0/1:1,3:4:26:103,0,26

Looking at that last column, here is what the tags mean:

GT : The genotype of this sample at this site.
For a diploid organism, the GT field indicates the two alleles carried by the sample, encoded by a 0 for the REF allele, 1 for the first ALT allele, 2 for the second ALT allele, etc. When there's a single ALT allele (by far the more common case), GT will be either:
- 0/0 - the sample is homozygous reference
- 0/1 - the sample is heterozygous, carrying 1 copy of each of the REF and ALT alleles
- 1/1 - the sample is homozygous alternate
  In the three sites shown in the example above, NA12878 is observed with the allele combinations T/G, G/G, and C/T respectively.
  For non-diploids, the same pattern applies; in the haploid case there will be just a single value in GT; for polyploids there will be more, e.g. 4 values for a tetraploid organism.
AD and DP : Allele depth and depth of coverage.
These are complementary fields that represent two important ways of thinking about the depth of the data for this sample at this site.
AD is the unfiltered allele depth, i.e. the number of reads that support each of the reported alleles. All reads at the position (including reads that did not pass the variant caller’s filters) are included in this number, except reads that were considered uninformative. Reads are considered uninformative when they do not provide enough statistical evidence to support one allele over another.
DP is the filtered depth, at the sample level. This gives you the number of filtered reads that support each of the reported alleles. You can check the variant caller’s documentation to see which filters are applied by default. Only reads that passed the variant caller’s filters are included in this number. However, unlike the AD calculation, uninformative reads are included in DP.
See the Tool Documentation for more details on AD (DepthPerAlleleBySample) and DP (Coverage) for more details.
PL : "Normalized" Phred-scaled likelihoods of the possible genotypes.
For the typical case of a monomorphic site (where there is only one ALT allele) in a diploid organism, the PL field will contain three numbers, corresponding to the three possible genotypes (0/0, 0/1, and 1/1). The PL values are "normalized" so that the PL of the most likely genotype (assigned in the GT field) is 0 in the Phred scale. We use "normalized" in quotes because these are not probabilities. We set the most likely genotype PL to 0 for easy reading purpose.The other values are scaled relative to this most likely genotype.
Keep in mind, if you're not familiar with the statistical lingo, that when we say PL is the "Phred-scaled likelihood of the genotype", we mean it is "How much less likely that genotype is compared to the best one". Have a look at this article for an example of how PL is calculated.
GQ : Quality of the assigned genotype.
The Genotype Quality represents the Phred-scaled confidence that the genotype assignment (GT) is correct, derived from the genotype PLs. Specifically, the GQ is the difference between the PL of the second most likely genotype, and the PL of the most likely genotype. As noted above, the values of the PLs are normalized so that the most likely PL is always 0, so the GQ ends up being equal to the second smallest PL, unless that PL is greater than 99. In GATK, the value of GQ is capped at 99 because larger values are not more informative, but they take more space in the file. So if the second most likely PL is greater than 99, we still assign a GQ of 99.
Basically the GQ gives you the difference between the likelihoods of the two most likely genotypes. If it is low, you can tell there is not much confidence in the genotype, i.e. there was not enough evidence to confidently choose one genotype over another. See the FAQ article on the Phred scale to get a sense of what would be considered low.
Not to be confused with the site-level annotation QUAL; see this FAQ article for an explanation of the differences in what they mean and how they should be used.

With that out of the way, let's interpret the genotype information for NA12878 at 1:899282.

1   899282  rs28548431  C   T   [CLIPPED] GT:AD:DP:GQ:PL    0/1:1,3:4:26:103,0,26

At this site, the called genotype is GT = 0/1, which corresponds to the alleles C/T. The confidence indicated by GQ = 26 isn't very good, largely because there were only a total of 4 reads at this site (DP =4), 1 of which was REF (=had the reference base) and 3 of which were ALT (=had the alternate base) (indicated by AD=1,3). The lack of certainty is evident in the PL field, where PL(0/1) = 0 (the normalized value that corresponds to a likelihood of 1.0) as is always the case for the assigned allele, but the next PL is PL(1/1) = 26 (which corresponds to 10^(-2.6), or 0.0025). So although we're pretty sure there's a variant at this site, there's a chance that the genotype assignment is incorrect, and that the subject may in fact not be het (heterozygous) but be may instead be hom-var (homozygous with the variant allele). But either way, it's clear that the subject is definitely not hom-ref (homozygous with the reference allele) since PL(0/0) = 103, which corresponds to 10^(-10.3), a very small number.

6. How to extract information from a VCF in a sane, (mostly) straightforward way

Use VariantsToTable.

No, really, don't write your own parser if you can avoid it. This is not a comment on how smart or how competent we think you are -- it's a comment on how annoyingly obtuse and convoluted the VCF format is.

Seriously. The VCF format lends itself really poorly to parsing methods like regular expressions, and we hear sob stories all the time from perfectly competent people whose home-brewed parser broke because it couldn't handle a more esoteric feature of the format. We know we broke a bunch of people's scripts when we introduced a new representation for spanning deletions in multisample callsets. OK, we ended up replacing it with a better representation a month later that was a lot less disruptive and more in line with the spirit of the specification -- but the point is, that first version was technically legal by the 4.2 spec, and that sort of thing can happen at any time. So yes, the VCF is a difficult format to work with, and one way to deal with that safely is to not home-brew parsers.

(Why are we sticking with it anyway? Because, as Winston Churchill famously put it, VCF is the worst variant call representation, except for all the others.)

↧

Mutect2 files from TCGA - FORMAT tag [QSS] expected different number of values (expected 1, found 2)

March 7, 2019, 7:24 pm

≫ Next: GATK4 does not include DepthOfCoverage tool

≪ Previous: What is a VCF and how should I interpret it?

Hi,

I am trying to run vcf-validator (vcftools) on some VCF files I obtained from TCGA. These files are tumor-normal vcf files called using Mutect2. When I try and run vcf validator I get an error:

------------------------
Summary:
6220 errors total

6220 .. column NORMAL at chr1:14677 .. FORMAT tag [QSS] expected different number of values (expected 1, found 2)

Any ideas on what is wrong and what I can do to fix it?

Thanks

↧

GATK4 does not include DepthOfCoverage tool

March 29, 2017, 10:05 pm

≫ Next: Concordance crashed

≪ Previous: Mutect2 files from TCGA - FORMAT tag [QSS] expected different number of values (expected 1, found 2)

It appears that DepthOfCoverage is not listed as a tool for GATK4, I get the following error

A USER ERROR has occurred: 'DepthOfCoverage' is not a valid command.

I have searched through the list of available tools and cannot find anything with a similar function. Is there a new tool for calculating coverage?

↧

Concordance crashed

January 8, 2019, 11:33 am

≫ Next: Mutect2 - Filtering mutation calls issue

≪ Previous: GATK4 does not include DepthOfCoverage tool

Dear Gatk team,
I'm trying to compare two vcfs with genotypes for a single sample on two platforms using Concordance (gatk 4.0.1.1). Getting an error, please help:

[jlr328@cbsurf01 Concordance]$ /programs/gatk4/gatk --java-options "-Xmx120G" Concordance -R ../RemapWGS/Reference/v2.refseq/GRCh38_latest_genomic.fna --evaluation qmd-27-07.deep-wes.grch38-refseq.gatk4.qchip-sites.vcf --truth qmd-27-07.deep-wgs.grch38-refseq.gatk4.qchip-sites.vcf --summary concordance.qmd-27-07.eval-deep-wes.truth-deep-wgs.tsv
Using GATK jar /programs/gatk-4.0.1.1/gatk-package-4.0.1.1-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Xmx120G -jar /programs/gatk-4.0.1.1/gatk-package-4.0.1.1-local.jar Concordance -R ../RemapWGS/Reference/v2.refseq/GRCh38_latest_genomic.fna --evaluation qmd-27-07.deep-wes.grch38-refseq.gatk4.qchip-sites.vcf --truth qmd-27-07.deep-wgs.grch38-refseq.gatk4.qchip-sites.vcf --summary concordance.qmd-27-07.eval-deep-wes.truth-deep-wgs.tsv
14:29:40.034 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/programs/gatk-4.0.1.1/gatk-package-4.0.1.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
14:29:40.157 INFO Concordance - ------------------------------------------------------------
14:29:40.157 INFO Concordance - The Genome Analysis Toolkit (GATK) v4.0.1.1
14:29:40.157 INFO Concordance - For support and documentation go to https://software.broadinstitute.org/gatk/
14:29:40.158 INFO Concordance - Executing as jlr328@cbsurf01.biohpc.cornell.edu on Linux v3.10.0-693.el7.x86_64 amd64
14:29:40.158 INFO Concordance - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_131-b12
14:29:40.158 INFO Concordance - Start Date/Time: January 8, 2019 2:29:40 PM EST
14:29:40.158 INFO Concordance - ------------------------------------------------------------
14:29:40.159 INFO Concordance - ------------------------------------------------------------
14:29:40.159 INFO Concordance - HTSJDK Version: 2.14.1
14:29:40.159 INFO Concordance - Picard Version: 2.17.2
14:29:40.159 INFO Concordance - HTSJDK Defaults.COMPRESSION_LEVEL : 1
14:29:40.159 INFO Concordance - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
14:29:40.159 INFO Concordance - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
14:29:40.159 INFO Concordance - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
14:29:40.160 INFO Concordance - Deflater: IntelDeflater
14:29:40.160 INFO Concordance - Inflater: IntelInflater
14:29:40.160 INFO Concordance - GCS max retries/reopens: 20
14:29:40.160 INFO Concordance - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
14:29:40.160 INFO Concordance - Initializing engine
14:29:40.712 INFO FeatureManager - Using codec VCFCodec to read file file:///local/storage/AnalysisData/Ongoing/Qatar/Array/QChip/QChipData/v4/GRCh38/Concordance/qmd-27-07.deep-wgs.grch38-refseq.gatk4.qchip-sites.vcf
14:29:40.736 INFO FeatureManager - Using codec VCFCodec to read file file:///local/storage/AnalysisData/Ongoing/Qatar/Array/QChip/QChipData/v4/GRCh38/Concordance/qmd-27-07.deep-wes.grch38-refseq.gatk4.qchip-sites.vcf
14:29:40.744 INFO Concordance - Done initializing engine
14:29:40.745 INFO ProgressMeter - Starting traversal
14:29:40.745 INFO ProgressMeter - Current Locus Elapsed Minutes Records Processed Records/Minute
14:29:40.958 INFO Concordance - Shutting down engine
[January 8, 2019 2:29:40 PM EST] org.broadinstitute.hellbender.tools.walkers.validation.Concordance done. Elapsed time: 0.02 minutes.
Runtime.totalMemory()=2092433408
java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
at java.util.ArrayList.rangeCheck(ArrayList.java:653)
at java.util.ArrayList.get(ArrayList.java:429)
at htsjdk.variant.variantcontext.VariantContext.getAlternateAllele(VariantContext.java:879)
at org.broadinstitute.hellbender.tools.walkers.validation.Concordance.areVariantsAtSameLocusConcordant(Concordance.java:256)
at org.broadinstitute.hellbender.engine.AbstractConcordanceWalker$ConcordanceIterator.next(AbstractConcordanceWalker.java:188)
at org.broadinstitute.hellbender.engine.AbstractConcordanceWalker$ConcordanceIterator.next(AbstractConcordanceWalker.java:163)
at java.util.Iterator.forEachRemaining(Iterator.java:116)
at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580)
at org.broadinstitute.hellbender.engine.AbstractConcordanceWalker.traverse(AbstractConcordanceWalker.java:121)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:893)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:136)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:179)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:198)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:153)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:195)
at org.broadinstitute.hellbender.Main.main(Main.java:277)

↧

This document describes "regular" VCF files produced for GERMLINE calls. For information on the special kind of VCF called gVCF, produced by HaplotypeCaller in -ERC GVCF mode, please see this companion document. For information specific to SOMATIC calls, see the MuTect documentation.

Contents

1. What is VCF?

2. Basic structure of a VCF file

3. Interpreting the VCF file header information

4. Structure of variant call records

Site-level properties and annotations

Sample-level annotations

5. How the genotype and other sample-level information is represented

6. How to extract information from a VCF in a sane, (mostly) straightforward way

This document describes "regular" VCF files produced for GERMLINE calls. For information on the special kind of VCF called gVCF, produced by HaplotypeCaller in `-ERC GVCF` mode, please see this companion document. For information specific to SOMATIC calls, see the MuTect documentation.