MuTect2 all potential somatic mutations did not pass the alt_allele_in_normal filter

February 3, 2017, 12:36 am

≪ Previous: Allele-specific annotation and filtering

Hello,

I am trying to use MuTect2 to call low-frequency (0.1-5%) mutations in yeast population but none of them passed the alt_allele_in_normal filter, meaning the allele is presented in control ("normal") sample. However, it is not look reasonable as soon as I used control to create Panel of Normals and my samples successfully passed the panel_of_normals filter. In addition, when I visualized the reads through the IGV, I did not see any alternative alleles in control at these positions.

Briefly, my experiment design: I poolled 1000 yeast colonies (after certain mutagenic conditions), extracted DNA and amplified a gene of interest through the PCR. Then I purified the amplicons and sequenced with MiSeq Illumina platform, paired-end. The final coverage is between 10 000 - 30 000 (10-30 per genome). I preproceced the data by GATK best practices and I created PON trough the artifact detection mode based on the same protocol with wild type strain. At least one mutation is expected per colony.

Finally, I ran Mutect2 with following parameters:

GenomeAnalysisTK -T MuTect2 -R path_to_reference/reference.fa -I:tumor path_to_file/realigned_Sample1.bam -I:normal path_to_file/realigned_control.bam -L I:X1-X2 --sample_ploidy 1000 -PON output.control.vcf -o output_control_vs_sample1.vcf

Could you help me to fix this problem?
Thank you!

↧

GATK3 HC bug?

October 13, 2017, 8:30 am

≫ Next: GATK events update for Fall 2017: ASHG and more!

≪ Previous: MuTect2 all potential somatic mutations did not pass the alt_allele_in_normal filter

Hey GATK Devs!

I'm writing to report some unexpected behavior on the part of GATK3.8 HC. I'm trying to use Illumina data to call SNPs and indels on a PacBio assembly and identify loci where assembly polishing has failed to correct the assembly. I was looking through the reads of a particular contig and identified a locus (tig00006168:59182) that GATK failed to call. According the the reads, the locus should have been called homozygous for a deletion at 30X depth. Looking at the gVCF I see it is called homozygous for the reference allele (while reporting 31 reads supporting the alternate allele), reports 0 for GQ and all PLs:

tig00006168     59180   .       C       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:31,0:31:15:0,15,225
tig00006168     59181   .       A       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:30,0:30:15:0,15,225
tig00006168     59182   .       C       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:1,31:32:0:0,0,0
tig00006168     59183   .       A       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:1,31:32:0:0,0,0
tig00006168     59184   .       A       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:1,31:32:0:0,0,0
tig00006168     59185   .       A       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:27,2:29:54:0,54,1005
tig00006168     59186   .       C       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:30,1:31:72:0,72,1080

The -bamout output for this run reports no variant-containing reads at this locus. However, if I include -L tig00006168:59172-59195 in the options, GATK calls the indel:

tig00006168     59182   .       CA      C       927.73  .       AC=2;AF=1.00;AN=2;DP=27;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=34.36;SOR=2.753        GT:AD:DP:GQ:PL  1/1:0,27:27:81:965,81,0

The tview on the -bamout for this latter run displays:

59141     59151     59161     59171     59181     59191     59201     59211     592
GGAAATGAAGGAGAAGAAAGTGTTTATCAGCCTCGTGGGCACAAACAGGAATGGGCTGCAGGTTGGTACCCCCAATCTCTNNN
      ..........................................................................
      .................................        .................................
      ....................................*.....................................
      ....................................*.....................................
      ...........................              ,,,,,,,,,,,,,,,,,,,,c,,,,,,,,,,,,
      ..................                       ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
      ....................................*.....................................
      ....................................*..............................
      ....................................*..............................
      ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,*,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
      ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,*,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
      ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,       ,,g,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
      ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,*,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
      ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,*,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
      ,,,,,,,,,,,,,,,,,,,,,,                    ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
      ,,,,,,,,,,,,,,,,,,,,g,,,,,,,,,,,,,,,*,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
      ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,*,,,,,,,   ,,,,,,,,,,,,,,,,,,,,,,,,,,,
      ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,*,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
      ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,*,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
      ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,*,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
      ,,,,,c,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,*,,,,,,,,,,,,,,,,,,,,,,,,
      ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,*,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,
      ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,*,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
      ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,*,,,,,,,,,,,,    .....................
      ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,*,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
      ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,*,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
      ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,*,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
      ,,,,,,,,,,,,g,,,,,,,,,,,,,,,,,,,,,,,*,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
      ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,*,,,,,,,,,,,,,,,,,,,,,,,,c,,,,,,,,,,,,
      ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
      ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,*,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
         ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,*,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
            ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,*,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
                 ,,,,,,,,,,,,,,,,,,,,,,,,,*,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
                 ,,,,,,,,,,,,,,,,,,,,,,,,,*,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

The mapping quality for all of these reads is over 30 and the base qualities in the 30+ range. In your experience, what might cause this odd behavior? I've tried GATK versions 3.2-2, 3.6-0, 3.7, and they exhibit the same behavior. My initial run log:

INFO  07:46:12,698 HelpFormatter - ---------------------------------------------------------------------------------------
INFO  07:46:12,701 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.8-0-ge9d806836, Compiled 2017/07/28 21:26:50
INFO  07:46:12,701 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute
INFO  07:46:12,701 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk
INFO  07:46:12,701 HelpFormatter - [Fri Oct 13 07:46:12 PDT 2017] Executing on Linux 2.6.32-696.3.2.el6.nersc.x86_64 amd64
INFO  07:46:12,701 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0_31-b13
INFO  07:46:12,704 HelpFormatter - Program Args: -T HaplotypeCaller --standard_min_confidence_threshold_for_calling 0 -rf BadMate -R ./contigs.fasta -L
tig00006168 -I 10X.bam -mmq 25 -mbq 30 -o tig00006168.trg.vcf.gz -bamout tig00006168.trg.bam
INFO  07:46:12,714 HelpFormatter - Executing as bredeson@hostname on Linux 2.6.32-696.3.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM
1.8.0_31-b13.
INFO  07:46:12,714 HelpFormatter - Date/Time: 2017/10/13 07:46:12
INFO  07:46:12,714 HelpFormatter - ---------------------------------------------------------------------------------------
INFO  07:46:12,714 HelpFormatter - ---------------------------------------------------------------------------------------
ERROR StatusLogger Unable to create class org.apache.logging.log4j.core.impl.Log4jContextFactory specified in jar:file:~bredeson/tools/bin/GATK/3.
8-0-ge9d80683/GenomeAnalysisTK.jar!/META-INF/log4j-provider.properties
ERROR StatusLogger Log4j2 could not find a logging implementation. Please add log4j-core to the classpath. Using SimpleLogger to log to the console...
INFO  07:46:12,851 GenomeAnalysisEngine - Deflater: JdkDeflater
INFO  07:46:12,851 GenomeAnalysisEngine - Inflater: JdkInflater
INFO  07:46:12,852 GenomeAnalysisEngine - Strictness is SILENT
INFO  07:46:15,647 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 500
INFO  07:46:15,654 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO  07:46:15,879 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.22
INFO  07:46:16,007 HCMappingQualityFilter - Filtering out reads with MAPQ < 25
INFO  07:46:18,047 IntervalUtils - Processing 106407 bp from intervals
INFO  07:46:18,157 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files
INFO  07:46:18,277 GenomeAnalysisEngine - Done preparing for traversal
INFO  07:46:18,277 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO  07:46:18,278 ProgressMeter -                 |      processed |    time |         per 1M |           |   total | remaining
INFO  07:46:18,278 ProgressMeter -        Location | active regions | elapsed | active regions | completed | runtime |   runtime
INFO  07:46:18,278 HaplotypeCaller - Disabling physical phasing, which is supported only for reference-model confidence output
INFO  07:46:18,325 StrandBiasTest - SAM/BAM data was found. Attempting to use read data to calculate strand bias annotations values.
WARN  07:46:18,325 InbreedingCoeff - Annotation will not be calculated. InbreedingCoeff requires at least 10 unrelated samples.
INFO  07:46:18,326 StrandBiasTest - SAM/BAM data was found. Attempting to use read data to calculate strand bias annotations values.
INFO  07:46:18,674 HaplotypeCaller - Using global mismapping rate of 45 => -4.5 in log10 likelihood units
INFO  07:46:19,188 VectorLoglessPairHMM - Using OpenMP multi-threaded AVX-accelerated native PairHMM implementation
[INFO] Available threads: 40
[INFO] Requested threads: 1
[INFO] Using 1 threads
WARN  07:46:19,268 HaplotypeScore - Annotation will not be calculated, must be called from UnifiedGenotyper, not HaplotypeCaller
INFO  07:46:32,888 VectorLoglessPairHMM - Time spent in setup for JNI call : 0.012103603000000001
INFO  07:46:32,889 PairHMM - Total compute time in PairHMM computeLikelihoods() : 3.824669993
INFO  07:46:32,889 HaplotypeCaller - Ran local assembly on 82 active regions
INFO  07:46:33,175 ProgressMeter -            done         106407.0    14.0 s            2.3 m      100.0%    14.0 s       0.0 s
INFO  07:46:33,175 ProgressMeter - Total runtime 14.90 secs, 0.25 min, 0.00 hours
INFO  07:46:33,175 MicroScheduler - 106744 reads were filtered out during the traversal out of approximately 133359 total reads (80.04%)
INFO  07:46:33,176 MicroScheduler -   -> 0 reads (0.00% of total) failing BadCigarFilter
INFO  07:46:33,176 MicroScheduler -   -> 6304 reads (4.73% of total) failing BadMateFilter
INFO  07:46:33,176 MicroScheduler -   -> 0 reads (0.00% of total) failing DuplicateReadFilter
INFO  07:46:33,176 MicroScheduler -   -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
INFO  07:46:33,176 MicroScheduler -   -> 99743 reads (74.79% of total) failing HCMappingQualityFilter
INFO  07:46:33,177 MicroScheduler -   -> 0 reads (0.00% of total) failing MalformedReadFilter
INFO  07:46:33,188 MicroScheduler -   -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter
INFO  07:46:33,188 MicroScheduler -   -> 697 reads (0.52% of total) failing NotPrimaryAlignmentFilter
INFO  07:46:33,188 MicroScheduler -   -> 0 reads (0.00% of total) failing UnmappedReadFilter
------------------------------------------------------------------------------------------
Done. There were 2 WARN messages, the first 2 are repeated below.
WARN  07:46:18,325 InbreedingCoeff - Annotation will not be calculated. InbreedingCoeff requires at least 10 unrelated samples.
WARN  07:46:19,268 HaplotypeScore - Annotation will not be calculated, must be called from UnifiedGenotyper, not HaplotypeCaller
------------------------------------------------------------------------------------------

↧

GATK events update for Fall 2017: ASHG and more!

October 13, 2017, 9:18 am

≫ Next: Picard LiftoverVcf

≪ Previous: GATK3 HC bug?

Fall is my favorite season -- it combines the best weather in New England and the most active period of the year for GATK events and announcements (although sometimes the latter means we don't get to go out and enjoy the former as much as we'd like). In keeping with that, we have a couple of important announcements that will go out early next week. However I'm breaking radio silence now to give you a quick update on what's been happening with regard to events and workshops, specifically.

Coming soonest: FLOW pipelining workshop showcasing GATK4 pipelines, organized by DNAstack in Orlando, Oct 17

Plus upcoming GATK workshops and links to recent workshop materials.

Details below!

Upcoming events

ASHG in Orlando, FL (Oct 17-21)

The yearly meeting of the American Society for Human Genetics is always a hotspot for us, so we've got a lot going on there:

A 2-hour pipelining workshop called FLOW, organized by DNAstack, where we'll be demoing running GATK4 pipelines written in WDL. It's an evening workshop, 7-9pm on Tuesday 17; there are still a few spots left so sign up now.
A poster presentation about GATK4's expanded scope of analysis (PgmNr 758/T, 3-4pm on Thu 19). Come visit Soo Hee Lee (support team) and Louis Bergelson (engine development team) to learn what GATK4 will enable you to do in your research!
GATK4 demos and Q&A at the Broad Genomics booth (#1037 in the Exhibition Hall, Orange County Convention Center South Building, Floor 1), Thursday 19, 11:30-12:30 and Friday 20, 10:30-11:30. If you miss the poster presentation or you'd like to ask the kind of questions that are best answered with a running laptop, this would be your time.
FireCloud demos and Q&A, also at the Broad Genomics booth, Wednesday 18 afternoon, all day Thursday 19, and Friday 20 morning. The demos are scheduled for 3pm on Wednesday and Thursday; the rest of the time is open for walk-up questions and 1:1 guided tours. This is a great opportunity to learn how FireCloud, our cloud-based analysis platform, can help you by taking the pain out of pipelining GATK (or anything else you want to pipeline, really).

GATK workshop in Pretoria, South Africa (Oct 23-27)

This is a "bootcamp special" version of our popular workshop series, running five entire days, with interleaved talks and hands-on exercises, hosted by the University of Pretoria. I'm not sure there are any spots left but if you're interested (and presumably, local to that part of the world, given the short time remaining), see this document.

GATK workshop in Huntington, West Virginia (Nov 8-10)

This one is our classic workshop formula, running three days with one day of talks and two days of hands-on exercises, hosted by the WV-INBRE Bioinformatics Core Facility. It's open to all comers and I believe there are still a few spots left, for which you can register here.

SuperComputing '17 in Denver, Colorado (Nov 12-17)

We'll be demoing GATK4 pipelines running on Google Cloud; exact schedule TBD.

Coming next

We are currently developing the workshop schedule for 2018, and are so far talking with several prospective hosts in Europe, North America and Asia. We hope to be able to offer a good selection of locations worldwide. And we're still taking requests, so don't hesitate to reach out to us if you'd like to consider hosting us yourself!

Latest workshops materials

We held several workshops this summer; Cambridge and Edinburgh in July, and Helsinki in September. As always, the latest workshop materials (Helsinki version) are accessible from the Presentations page. If you're lazy (or should I say effort-averse), here's a direct link to the relevant Google Drive folder -- but keep in mind this will not remain the latest version for long.

↧

Picard LiftoverVcf

August 12, 2015, 4:44 pm

≫ Next: Most Variants Called

≪ Previous: GATK events update for Fall 2017: ASHG and more!

I am having a problem with picard's LiftoverVcf.

I am trying to Liftover hapmap files (downloaded plink files from hapmap and converted to vcf using plink) from ncbi36 to hg38. I was able to do this with GATK LiftoverVariants. My problem came when I had to merge the hapmap.hg38 with some genotype files (that I liftover from hg19 to hg38 using GATK LiftoverVariants). I am merging them so that I can run population stratification using plink. I used vcf-merge but it complained that a SNP has different reference allele in both files: rs3094315, should be reference allele G (which was correct in the genotype.hg38 files but in the hapmap.hg38 files it was wrong). I also first tried to lift hapmap.ncbi36 to hg19 then to hg38 but the offending allele was still there. So I decided to try and lift the hapmap.ncbi36 using LiftoverVCF from picard.

I downloaded the newest picard build (20 hours old) picard-tools-1.138.
Used the command: java -jar -Xmx6000m ../../../tools/picard-tools-1.138/picard.jar LiftoverVcf I=all_samples_hapmap3_r3_b36_fwd.qc.poly.tar.vcf O=all_samples_hapmap3_r3_b36_fwd.qc.poly.tar.picard.hg38.vcf C=../../../tools/liftover/chain_files/hg18ToHg38.over.chain REJECT=all_samples_hapmap3_r3_b36_fwd.qc.poly.tar.picard.hg38.reject.vcf R=../../../data/assemblies/hg38/hg38.fa VERBOSITY=ERROR

Here is the run:
[Thu Aug 13 00:43:40 CEST 2015] picard.vcf.LiftoverVcf INPUT=all_samples_hapmap3_r3_b36_fwd.qc.poly.tar.vcf OUTPUT=all_samples_hapmap3_r3_b36_fwd.qc.poly.tar.picard.hg38.vcf CHAIN=......\tools\liftover\chain_files\hg18ToHg38.over.chain REJECT=all_samples_hapmap3_r3_b36_fwd.qc.poly.tar.picard.hg38.reject.vcf REFERENCE_SEQUENCE=......\data\assemblies\hg19\assemble\hg38.fa VERBOSITY=ERROR QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json

Here is the error:
Exception in thread "main" java.lang.IllegalStateException: Allele in genotype A* not in the variant context [T*, C]
at htsjdk.variant.variantcontext.VariantContext.validateGenotypes(VariantContext.java:1357)
at htsjdk.variant.variantcontext.VariantContext.validate(VariantContext.java:1295)
at htsjdk.variant.variantcontext.VariantContext.(VariantContext.java:410)
at htsjdk.variant.variantcontext.VariantContextBuilder.make(VariantContextBuilder.java:496)
at htsjdk.variant.variantcontext.VariantContextBuilder.make(VariantContextBuilder.java:490)
at picard.vcf.LiftoverVcf.doWork(LiftoverVcf.java:200)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:206)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)

I have no idea which SNP is the problem.
I do not know what T* means (does not seem to exist in the file).
I am new to picard so I thought VERBOSE=ERROR will give me something more but nothing more appeared.
Given that lifting hapmap.ncbi36 to hg19 then to hg38 produced the same erroneous reference allele I suppose lifting will not fix this and I will have to work with dnsnp to correct my file. Do you know how I can change reference allele in a vcf? Is there a tool for this? Is there a liftover tool for dbsnp?
As a side note I want to make picard work because I read that you will be deprecating the GATK liftover and will support the picard liftover (at some point in the future) so help with this tool will be appreciated.

↧

Most Variants Called

January 27, 2017, 2:57 am

≫ Next: How to produce GVCF files that were produced from a different reference genome?

≪ Previous: Picard LiftoverVcf

Hello all!

So I am forced to do hard-filtering on my VCF files. Looking at them before filtering, ~99% of my variants have a QD of <2.0. Looking at the distribution plots in ggplot, they do not follow the same distribution pattern as seen in http://gatkforums.broadinstitute.org/gatk/discussion/6925/understanding-and-adapting-the-generic-hard-filtering-recommendations. I have 24 samples and they are not at all similar in their distribution.
The other FS and MQ are all within the recommendations. I wasn't sure where the values for ReadPosRankSum and MQRankSum were as so couldn't plot those out.

I have used a ploidy of 20, and I'm looking at a population of bacteria. Does anyone know why the QD is so low?

I'm going to reduce the recommended filtering QD cut off to 0.8, therefore, sampling the top ~10% of variants by QD. Does that seem sensible?

Thanks,

↧

How to produce GVCF files that were produced from a different reference genome?

October 15, 2017, 5:34 am

≫ Next: Difference between vcf-result and "samtools tview"

≪ Previous: Most Variants Called

Hi everyone. I have got my genotyped vcf file successful produced but I needed to merge them with other g.vcf files I recently received from a collaborating company. These were gzipped vcf files with their associated .tbi files. I had to first unzipped them in order to combining them by using GenotypeGVCF tool to create a genotype vcf file. It appeared that these files had different contigs names from reference genome (g.vcf contig names as 1,2,3, whereas my reference contig naming is chr 1, ..). Luck enough I were able to rename the contig names to match the reference contig names, however, the merging process by using GenotypeGVCF still didn't work out. This time the error was regarding that these g.vcf files contigs were not in the same order to that of my reference. However, in attempt of reordering the contig names to match the ordering in the reference using sortVCF in picard tools, I receieved error information that I really couldn't be able to solve. Could anyone kindly help me out? Please find the attachment of the various stages and error information.

↧

Difference between vcf-result and "samtools tview"

October 15, 2017, 6:36 am

≫ Next: GATK 3.5 FindCoveredIntervals works same regardless of DuplicateFilter tag

≪ Previous: How to produce GVCF files that were produced from a different reference genome?

I am calling mutation contained in the genes brca1/brca2 with haplotypecaller.
I use "samtools tview" to check if the result is right.
Cite a deletion site as the example,I got the result as the following picture:

After counting all the "*" signals, the number is less than the that contained in the vcf file.
(the number in the red frame in the picture.)

In my opinion,this number should smaller or equal to that the samtools-tview-counting reslult,
Could you please help me to solve this problem,Thanks.

↧

GATK 3.5 FindCoveredIntervals works same regardless of DuplicateFilter tag

October 16, 2017, 5:11 am

≫ Next: (How to) Call somatic copy number variants using GATK4 CNV

≪ Previous: Difference between vcf-result and "samtools tview"

Hi,

When using FindCoveredIntervals tool, I find no difference when calling the tool with -drf DuplicateRead tag, or without it (when using -rf DuplicateRead tag).
Manual checking of .bam file using IGV shows that duplicate reads are existing in the input .bam, and that resulting .bed files ignore intervals where duplicate reads increase coverage over given threshold in both cases (with or without a flag).

Is this bug already known, I haven't been able to find similar questions on the forum? Is it fixed in later versions of GATK, and is there (or will there be) same or similar tool in GATK 4.0? Am I using DuplicateRead filter wrong and expecting wrong result?

Thanks in advance,
-Boris

↧

(How to) Call somatic copy number variants using GATK4 CNV

March 8, 2017, 11:13 am

≫ Next: Testing FPGA implementation of HaplotypeCaller (PairHMM)

≪ Previous: GATK 3.5 FindCoveredIntervals works same regardless of DuplicateFilter tag

This demonstrative tutorial provides instructions and example data to detect somatic copy number variation (CNV) using a panel of normals (PoN). The workflow is optimized for Illumina short-read whole exome sequencing (WES) data. It is not suitable for whole genome sequencing (WGS) data nor for germline calling.

The tutorial recapitulates the GATK demonstration given at the 2016 ASHG meeting in Vancouver, Canada, for a beta version of the CNV workflow. Because we are still actively developing the CNV tools (writing as of March 2017), the underlying algorithms and current workflow options, e.g. syntax, may change. However, the presented basic approach and general concepts will still be germaine. Please check the forum for updates.

Many thanks to Samuel Lee (@slee) for developing the example data, data figures and discussion that set the backbone of this tutorial.

► For a similar example workflow that pertains to earlier releases of GATK4, see Article#6791.
► For the mathematics behind the workflow, see this whitepaper.

Different data types come with their own caveats. WGS, while providing even coverage that enables better CNV detection, is costly. SNP arrays, while the standard for CNV detection, may not be part of an analysis protocol. Being able to resolve CNVs from WES, which additionally introduces artifacts and variance in the target capture step, requires sophisticated denoising.

Jump to a section

Tools, system requirements and example data download

This tutorial uses a beta version of the CNV workflow tools within the GATK4 gatk-protected-1.0.0.0-alpha1.2.3 pre-release (Version:0288cff-SNAPSHOT from September 2016). We have made the program jar specially available alongside the data bundle here. Note other tools in this program release may be unsuitable for analyses.

The example data is whole exome capture sequence data for chromosomes 1–7 of matched normal and tumor samples aligned to GRCh37. Because the data is from real cancer patients, we have anonymized them at multiple levels. The anonymization process preserves the noise inherent in real samples. The data is representative of Illumina sequencing technology from 2011.
R (install from https://www.r-project.org/) and certain R components. After installing R, install the components with the following command.
```
Rscript install_R_packages.R
```
We include install_R_packages.R in the tutorial data bundle. Alternatively, download it from here.
XQuartz for optional section 5. Your system may already have this installed.
The tutorial does not require reference files. The optional plotting step that uses the PlotSegmentedCopyRatio tool plots against GRCh37 and should NOT be used for other reference assemblies.

1. Collect proportional coverage using target intervals and read data using CalculateTargetCoverage

In this step, we collect proportional coverage using target intervals and read data. We have actually pre-computed this for you and we provide the command here for reference.

We process each BAM, whether normal or tumor. The tool collects coverage per read group at each target and divides these counts by the total number of reads per sample.

java -jar gatk4.jar CalculateTargetCoverage \
    -I <input_bam_file> \
    -T <input_target_tsv> \
    -transform PCOV \
    -groupBy SAMPLE \
    -targetInfo FULL \
    –keepdups \
    -O <output_pcov_file>

The target file -T is a padded intervals list of the baited regions. You can add padding to a target list using the GATK4 PadTargets tool. For our example data, padding each exome target 250bp on either side increases sensitivity.
Setting the -targetInfo option to FULL keeps the original target names from the target list.
The –keepdups option asks the tool to include alignments flagged as duplicate.

The top plot shows the raw proportional coverage for our tumor sample for chromosomes 1–7. Each dot represents a target. The y-axis plots proportional coverage and the x-axis targets. The middle plot shows the data after a median-normalization and log2-transformation. The bottom plot shows the tumor data after normalization against its matched-normal.

For each of these progressions, how certain are you that there are copy-number events? How many copy-number variants are you certain of? What is contributing to your uncertainty?

2. Create the CNV PoN using CombineReadCounts and CreatePanelOfNormals

In this step, we use two commands to create the CNV panel of normals (PoN).

The normals should represent the same sequencing technology, e.g. sample preparation and capture target kit, as that of the tumor samples under scrutiny. The PoN is meant to encapsulate sequencing noise and may also capture common germline variants. Like any control, you should think carefully about what sample set would make an effective panel. At the least, the PoN should consist of ten normal samples that were ideally subject to the same batch effects as that of the tumor sample, e.g. from the same sequencing center. Our current recommendation is 40 or more normal samples. Depending on the coverage depth of samples, adjust the number.

What is better, tissue-matched normals or blood normals of tumor samples?
What makes a better background control, a matched normal sample or a panel of normals?

The first step combines the proportional read counts from the multiple normal samples into a single file. The -inputList parameter takes a file listing the relative file paths, one sample per line, of the proportional coverage data of the normals.

java -jar gatk4.jar CombineReadCounts \
    -inputList normals.txt \
    -O sandbox/combined-normals.tsv

The second step creates a single CNV PoN file. The PoN stores information such as the median proportional coverage per target across the panel and projections of systematic noise calculated with PCA (principal component analysis). Our tutorial’s PoN is built with 39 normal blood samples from cancer patients from the same cohort (not suffering from blood cancers).

java -jar gatk4.jar CreatePanelOfNormals \
    -I sandbox/combined-normals.tsv \
    -O sandbox/normals.pon \
    -noQC \
    --disableSpark \
    --minimumTargetFactorPercentileThreshold 5

This results in two files, the CNV PoN and a target_weights.txt file that typical workflows can ignore. Because we have a small number of normals, we include the -noQC option and change the --minimumTargetFactorPercentileThreshold to 5%.

Based on what you know about PCA, what do you think are the effects of using more normal samples? A panel with some profiles that are outliers?

3. Normalize a raw proportional coverage profile against the PoN using NormalizeSomaticReadCounts

In this step, we normalize the raw proportional coverage (PCOV) profile using the PoN. Specifically, we normalize the tumor coverage against the PoN’s target medians and against the principal components of the PoN.

java -jar gatk4.jar NormalizeSomaticReadCounts \
    -I cov/tumor.tsv \
    -PON sandbox/normals.pon \
    -PTN sandbox/tumor.ptn.tsv \
    -TN sandbox/tumor.tn.tsv

This produces the pre-tangent-normalized file -PTN and the tangent-normalized file -TN, respectively. Resulting data is log2-transformed.

Denoising with a PoN is critical for calling copy-number variants from WES coverage profiles. It can also improve calls from WGS profiles that are typically more evenly distributed and subject to less noise. Furthermore, denoising with a PoN can greatly impact results for (i) samples that have more noise, e.g. those with lower coverage, lower purity or higher activity, (ii) samples lacking a matched normal and (iii) detection of smaller events that span only a few targets.

4. Segment the normalized coverage profile using PerformSegmentation

Here we segment the normalized coverage profile. Segmentation groups contiguous targets with the same copy ratio.

java -jar gatk4.jar PerformSegmentation \
    -TN sandbox/tumor.tn.tsv \
    -O sandbox/tumor.seg \
    -LOG

For our tumor sample, we reduce the ~73K individual targets to 14 segments. The -LOG parameter tells the tool that the input coverages are log2-transformed.

View the resulting file with cat sandbox/tumor.seg.

Which chromosomes have events?

☞ I get an error at this step!

This command will error if you have not installed R and certain R components. Take a few minutes to install R from https://www.r-project.org/. Then install the components with the following command.

Rscript install_R_packages.R

We include install_R_packages.R in the tutorial data bundle. Alternatively, download it from here.

5. (Optional) Plot segmented coverage using PlotSegmentedCopyRatio

This is an optional step that plots segmented coverage.

This command requires XQuartz installation. If you do not have this dependency, then view the results in the precomputed_results folder instead. Currently plotting only supports human assembly b37 autosomes. Going forward, this tool will accommodate other references and the workflow will support calling on sex chromosomes.

java -jar gatk4.jar PlotSegmentedCopyRatio \
    -TN sandbox/tumor.tn.tsv \
    -PTN sandbox/tumor.ptn.tsv \
    -S sandbox/tumor.seg \
    -O sandbox \
    -pre tumor \
    -LOG

The -O defines the output directory, and the -pre defines the basename of the files. Again, the -LOG parameter tells the tool that the inputs are log2- transformed. The output folder contains seven files--three PNG images and four text files.

Before_After.png (shown above) plots copy-ratios pre (top) and post (bottom) tangent-normalization across the chromosomes. The plot automatically adjusts the y-axis to show all available data points. Dotted lines represent centromeres.
Before_After_CR_Lim_4.png shows the same but fixes the y-axis range from 0 to 4 for comparability across samples.
FullGenome.png colors differential copy-ratio segments in alternating blue and orange. The horizontal line plots the segment mean. Again the y-axis ranges from 0 to 4.

Open each of these images. How many copy-number variants do you see?

☞ What is the QC value?

Each of the four text files contain a single quality control (QC) value. This value is the median of absolute differences (MAD) in copy-ratios of adjacent targets. Its calculation is robust to actual copy-number variants and should decrease post tangent-normalization.

preQc.txt gives the QC value before tangent-normalization.
postQc.txt gives the post-tangent-normalization QC value.
dQc.txt gives the difference between pre and post QC values.
scaled_dQc.txt gives the fraction difference (preQc - postQc)/(preQc).

6. Call segmented copy number variants using CallSegments

In this final step, we call segmented copy number variants. The tool makes one of three calls for each segment--neutral (0), deletion (-) or amplification (+). These deleted or amplified segments could represent somatic events.

java -jar gatk4.jar CallSegments \
    -TN sandbox/tumor.tn.tsv \
    -S sandbox/tumor.seg \
    -O sandbox/tumor.called

View the results with cat sandbox/tumor.called.

Besides the last column, how is this result different from that of step 4?

7. Discussion of interest to some

☞ Why can't I use just a matched normal?

Let’s compare results from the raw coverage (top), from normalizing using the matched-normal only (middle) and from normalizing using the PoN (bottom).

What is different between the plots? Look closely.

The matched-normal normalization appears to perform well. However, its noisiness brings uncertainty to any call that would be made, even if visible by eye. Furthermore, its level of noise obscures detection of the 4th variant that the PoN normalization reveals.

☞How do the results compare to SNP6 analyses?

As with any algorithmic analysis, it’s good to confirm results with orthogonal methods. If we compare calls from the original unscrambled tumor data against GISTIC SNP6 array analysis of the same sample, we similarly find three deletions and a single large amplification.

↧

Testing FPGA implementation of HaplotypeCaller (PairHMM)

October 16, 2017, 9:20 am

≫ Next: Read groups

≪ Previous: (How to) Call somatic copy number variants using GATK4 CNV

Hi,
We are two researchers from the Politecnico di Milano.
We are trying to test the FPGA implementation of the HaplotypeCaller (PairHMM) on GATK 3.8-0-ge9d806836, using a Terasic DE5a-Net (Arria 10, 10AX115N3F45I2SG).

According to the version highlights for GATK 3.8 (https://gatkforums.broadinstitute.org/gatk/discussion/10063/version-highlights-for-gatk-version-3-8) FPGA support was added to pairHMM, and it should be used if the appropriate hardware is detected.
However, from our tests it seems that the CPU implementation is being used instead. Is there a way to enforce the usage of the FPGA implementation?

Kind regards,
Chiara & Alberto

↧

Read groups

November 20, 2015, 11:22 am

≫ Next: Understanding and adapting the generic hard-filtering recommendations

≪ Previous: Testing FPGA implementation of HaplotypeCaller (PairHMM)

There is no formal definition of what is a read group, but in practice, this term refers to a set of reads that were generated from a single run of a sequencing instrument.

In the simple case where a single library preparation derived from a single biological sample was run on a single lane of a flowcell, all the reads from that lane run belong to the same read group. When multiplexing is involved, then each subset of reads originating from a separate library run on that lane will constitute a separate read group.

Read groups are identified in the SAM/BAM /CRAM file by a number of tags that are defined in the official SAM specification. These tags, when assigned appropriately, allow us to differentiate not only samples, but also various technical features that are associated with artifacts. With this information in hand, we can mitigate the effects of those artifacts during the duplicate marking and base recalibration steps. The GATK requires several read group fields to be present in input files and will fail with errors if this requirement is not satisfied. See this article for common problems related to read groups.

To see the read group information for a BAM file, use the following command.

samtools view -H sample.bam | grep '@RG'

This prints the lines starting with @RG within the header, e.g. as shown in the example below.

@RG ID:H0164.2  PL:illumina PU:H0164ALXX140820.2    LB:Solexa-272222    PI:0    DT:2014-08-20T00:00:00-0400 SM:NA12878  CN:BI

Meaning of the read group fields required by GATK

ID = Read group identifier
This tag identifies which read group each read belongs to, so each read group's ID must be unique. It is referenced both in the read group definition line in the file header (starting with @RG) and in the RG:Z tag for each read record. Note that some Picard tools have the ability to modify IDs when merging SAM files in order to avoid collisions. In Illumina data, read group IDs are composed using the flowcell + lane name and number, making them a globally unique identifier across all sequencing data in the world.
Use for BQSR: ID is the lowest denominator that differentiates factors contributing to technical batch effects: therefore, a read group is effectively treated as a separate run of the instrument in data processing steps such as base quality score recalibration, since they are assumed to share the same error model.
PU = Platform Unit
The PU holds three types of information, the {FLOWCELL_BARCODE}.{LANE}.{SAMPLE_BARCODE}. The {FLOWCELL_BARCODE} refers to the unique identifier for a particular flow cell. The {LANE} indicates the lane of the flow cell and the {SAMPLE_BARCODE} is a sample/library-specific identifier. Although the PU is not required by GATK but takes precedence over ID for base recalibration if it is present. In the example shown earlier, two read group fields, ID and PU, appropriately differentiate flow cell lane, marked by .2, a factor that contributes to batch effects.
SM = Sample
The name of the sample sequenced in this read group. GATK tools treat all read groups with the same SM value as containing sequencing data for the same sample, and this is also the name that will be used for the sample column in the VCF file. Therefore it's critical that the SM field be specified correctly. When sequencing pools of samples, use a pool name instead of an individual sample name.
PL = Platform/technology used to produce the read
This constitutes the only way to know what sequencing technology was used to generate the sequencing data. Valid values: ILLUMINA, SOLID, LS454, HELICOS and PACBIO.
LB = DNA preparation library identifier
MarkDuplicates uses the LB field to determine which read groups might contain molecular duplicates, in case the same DNA library was sequenced on multiple lanes.

If your sample collection's BAM files lack required fields or do not differentiate pertinent factors within the fields, use Picard's AddOrReplaceReadGroups to add or appropriately rename the read group fields as outlined here.

Deriving `ID` and `PU` fields from read names

Here we illustrate how to derive both ID and PU fields from read names as they are formed in the data produced by the Broad Genomic Services pipelines (other sequence providers may use different naming conventions). We break down the common portion of two different read names from a sample file. The unique portion of the read names that come after flow cell lane, and separated by colons, are tile number, x-coordinate of cluster and y-coordinate of cluster.

H0164ALXX140820:2:1101:10003:23460
H0164ALXX140820:2:1101:15118:25288

Breaking down the common portion of the query names:

H0164____________ #portion of @RG ID and PU fields indicating Illumina flow cell
_____ALXX140820__ #portion of @RG PU field indicating barcode or index in a multiplexed run
_______________:2 #portion of @RG ID and PU fields indicating flow cell lane

Multi-sample and multiplexed example

Suppose I have a trio of samples: MOM, DAD, and KID. Each has two DNA libraries prepared, one with 400 bp inserts and another with 200 bp inserts. Each of these libraries is run on two lanes of an Illumina HiSeq, requiring 3 x 2 x 2 = 12 lanes of data. When the data come off the sequencer, I would create 12 bam files, with the following @RG fields in the header:

Dad's data:
@RG     ID:FLOWCELL1.LANE1      PL:ILLUMINA     LB:LIB-DAD-1 SM:DAD      PI:200
@RG     ID:FLOWCELL1.LANE2      PL:ILLUMINA     LB:LIB-DAD-1 SM:DAD      PI:200
@RG     ID:FLOWCELL1.LANE3      PL:ILLUMINA     LB:LIB-DAD-2 SM:DAD      PI:400
@RG     ID:FLOWCELL1.LANE4      PL:ILLUMINA     LB:LIB-DAD-2 SM:DAD      PI:400

Mom's data:
@RG     ID:FLOWCELL1.LANE5      PL:ILLUMINA     LB:LIB-MOM-1 SM:MOM      PI:200
@RG     ID:FLOWCELL1.LANE6      PL:ILLUMINA     LB:LIB-MOM-1 SM:MOM      PI:200
@RG     ID:FLOWCELL1.LANE7      PL:ILLUMINA     LB:LIB-MOM-2 SM:MOM      PI:400
@RG     ID:FLOWCELL1.LANE8      PL:ILLUMINA     LB:LIB-MOM-2 SM:MOM      PI:400

Kid's data:
@RG     ID:FLOWCELL2.LANE1      PL:ILLUMINA     LB:LIB-KID-1 SM:KID      PI:200
@RG     ID:FLOWCELL2.LANE2      PL:ILLUMINA     LB:LIB-KID-1 SM:KID      PI:200
@RG     ID:FLOWCELL2.LANE3      PL:ILLUMINA     LB:LIB-KID-2 SM:KID      PI:400
@RG     ID:FLOWCELL2.LANE4      PL:ILLUMINA     LB:LIB-KID-2 SM:KID      PI:400

Note the hierarchical relationship between read groups (unique for each lane) to libraries (sequenced on two lanes) and samples (across four lanes, two lanes for each library).

↧

Understanding and adapting the generic hard-filtering recommendations

February 3, 2016, 9:18 am

≫ Next: GATK 4.0 will be released Jan 9, 2018

≪ Previous: Read groups

This document aims to provide insight into the logic of the generic hard-filtering recommendations that we provide as a substitute for VQSR. Hopefully it will also serve as a guide for adapting these recommendations or developing new filters that are appropriate for datasets that diverge significantly from what we usually work with.

Introduction

Hard-filtering consists of choosing specific thresholds for one or more annotations and throwing out any variants that have annotation values above or below the set thresholds. By annotations, we mean properties or statistics that describe for each variant e.g. what the sequence context is like around the variant site, how many reads covered it, how many reads covered each allele, what proportion of reads were in forward vs reverse orientation, and so on.

The problem with this approach is that it is very limiting because it forces you to look at each annotation dimension individually, and you end up throwing out good variants just because one of their annotations looks bad, or keeping bad variants in order to keep those good variants.

In contrast, VQSR is more powerful because it uses machine-learning algorithms to learn from the data what are the annotation profiles of good variants (true positives) and of bad variants (false positives) in a particular dataset. This empowers you to pull out variants based on how they cluster together along different dimensions, and liberates you to a large extent from the linear tyranny of single-dimension thresholds.

Unfortunately this method requires a large number of variants and well-curated known variant resources. For those of you working with small gene panels or with non-model organisms, this is a deal-breaker, and you have to fall back on hard-filtering.

Outline

In this article, we illustrate how the generic hard-filtering recommendations we provide relate to the distribution of annotation values we typically see in callsets produced by our variant calling tools, and how this in turn relates to the underlying physical properties of the sequence data.

We also use results from VQSR filtering (which we take as ground truth in this context) to highlight the limitations of hard-filtering.

We do this in turn for each of five annotations that are highly informative among the recommended annotations: QD, FS, MQ, MQRankSum and ReadPosRankSum. The same principles can be applied to most other annotations produced by GATK tools.

Overview of data and methods

Origin of the dataset

We called variants on a whole genome trio (samples NA12878, NA12891, NA12892, previously pre-processed) using HaplotypeCaller in GVCF mode, yielding a gVCF file for each sample. We then joint-genotyped the gVCFs using GenotypeGVCF, yielding an unfiltered VCF callset for the trio. Finally, we ran VQSR on the trio VCF, yielding the filtered callset. We will be looking at the SNPs only.

Plotting methods and interpretation notes

All plots shown below are density plots generated using the ggplot2 library in R. On the x-axis are the annotation values, and on the y-axis are the density values. The area under the density plot gives you the probability of observing the annotation values. So, the entire area under all of the plots will be equal to 1. However, if you would like to know the probability of observing an annotation value between 0 and 1, you will have to take the area under the curve between 0 and 1.

In plain English, this means that the plots shows you, for a given set of variants, what is the distribution of their annotation values. The caveat is that when we're comparing two or more sets of variants on the same plot, we have to keep in mind that they may contain very different numbers of variants, so the amount of variants in a given part of the distribution is not directly comparable; only their proportions are comparable.

QualByDepth (QD)

This is the variant confidence (from the QUAL field) divided by the unfiltered depth of non-hom-ref samples. This annotation is intended to normalize the variant quality in order to avoid inflation caused when there is deep coverage. For filtering purposes it is better to use QD than either QUAL or DP directly.

The generic filtering recommendation for QD is to filter out variants with QD below 2. Why is that?

First, let’s look at the QD values distribution for unfiltered variants. Notice the values can be anywhere from 0-40. There are two peaks where the majority of variants are (around QD = 12 and QD = 32). These two peaks correspond to variants that are mostly observed in heterozygous (het) versus mostly homozygous-variant (hom-var) states, respectively, in the called samples. This is because hom-var samples contribute twice as many reads supporting the variant than do het variants. We also see, to the left of the distribution, a "shoulder" of variants with QD hovering between 0 and 5.

We expect to see a similar distribution profile in callsets generated from most types of high-throughput sequencing data, although values where the peaks form may vary.

Now, let’s look at the plot of QD values for variants that passed VQSR and those that failed VQSR. Red indicates the variants that failed VQSR, and blue (green?) the variants that passed VQSR.

We see that the majority of variants filtered out correspond to that low-QD "shoulder" (remember that since this is a density plot, the y-axis indicates proportion, not number of variants); that is what we would filter out with the generic recommendation of the threshold value 2 for QD.

Notice however that VQSR has failed some variants that have a QD greater than 30! All those variants would have passed the hard filter threshold, but VQSR tells us that these variants looked artifactual in one or more other annotation dimensions. Conversely, although it is not obvious in the figure, we know that VQSR has passed some variants that have a QD less than 2, which hard filters would have eliminated from our callset.

FisherStrand (FS)

This is the Phred-scaled probability that there is strand bias at the site. Strand Bias tells us whether the alternate allele was seen more or less often on the forward or reverse strand than the reference allele. When there little to no strand bias at the site, the FS value will be close to 0.

Note: SB, SOR and FS are related but not the same! They all measure strand bias (a type of sequencing bias in which one DNA strand is favored over the other, which can result in incorrect evaluation of the amount of evidence observed for one allele vs. the other) in different ways. SB gives the raw counts of reads supporting each allele on the forward and reverse strand. FS is the result of using those counts in a Fisher's Exact Test. SOR is a related annotation that applies a different statistical test (using the SB counts) that is better for high coverage data.

Let’s look at the FS values for the unfiltered variants. The FS values have a very wide range; we made the x-axis log-scaled so the distribution is easier to see. Notice most variants have an FS value less than 10, and almost all variants have an FS value less than 100. However, there are indeed some variants with a value close to 400.

The plot below shows FS values for variants that passed VQSR and failed VQSR.

Notice most of the variants that fail have an FS value greater than 55. Our hard filtering recommendations tell us to fail variants with an FS value greater than 60. Notice that although we are able to remove many false positives by removing variants with FS greater than 60, we still keep many false positive variants. If we move the threshold to a lower value, we risk losing true positive variants.

StrandOddsRatio (SOR)

This is another way to estimate strand bias using a test similar to the symmetric odds ratio test. SOR was created because FS tends to penalize variants that occur at the ends of exons. Reads at the ends of exons tend to only be covered by reads in one direction and FS gives those variants a bad score. SOR will take into account the ratios of reads that cover both alleles.

Let’s look at the SOR values for the unfiltered variants. The SOR values range from 0 to greater than 9. Notice most variants have an SOR value less than 3, and almost all variants have an SOR value less than 9. However, there is a long tail of variants with a value greater than 9.

The plot below shows SOR values for variants that passed VQSR and failed VQSR.

Notice most of the variants that have an SOR value greater than 3 fail the VQSR filter. Although there is a non-negligible population of variants with an SOR value less than 3 that failed VQSR, our hard filtering recommendation of failing variants with an SOR value greater than 3 will at least remove the long tail of variants that show fairly clear bias according to the SOR test.

RMSMappingQuality (MQ)

This is the root mean square mapping quality over all the reads at the site. Instead of the average mapping quality of the site, this annotation gives the square root of the average of the squares of the mapping qualities at the site. It is meant to include the standard deviation of the mapping qualities. Including the standard deviation allows us to include the variation in the dataset. A low standard deviation means the values are all close to the mean, whereas a high standard deviation means the values are all far from the mean.When the mapping qualities are good at a site, the MQ will be around 60.

Now let’s check out the graph of MQ values for the unfiltered variants. Notice the very large peak around MQ = 60. Our recommendation is to fail any variant with an MQ value less than 40.0. You may argue that hard filtering any variant with an MQ value less than 50 is fine as well. This brings up an excellent point that our hard filtering recommendations are meant to be very lenient. We prefer to keep all potentially decent variants rather than get rid of a few bad variants.

Let’s look at the VQSR pass vs fail variants. At first glance, it seems like VQSR has passed the variants in the high peak and failed any variants not in the peak.

It is hard to tell which variants passed and failed, so let’s zoom in and see what exactly is happening.

The plot above shows the x-axis from 59-61. Notice the variants in blue (the ones that passed) all have MQ around 60. However, some variants in red (the ones that failed) also have an MQ around 60.

MappingQualityRankSumTest (MQRankSum)

This is the u-based z-approximation from the Rank Sum Test for mapping qualities. It compares the mapping qualities of the reads supporting the reference allele and the alternate allele. A positive value means the mapping qualities of the reads supporting the alternate allele are higher than those supporting the reference allele; a negative value indicates the mapping qualities of the reference allele are higher than those supporting the alternate allele. A value close to zero is best and indicates little difference between the mapping qualities.

Next, let’s look at the distribution of values for MQRankSum in the unfiltered variants. Notice the values range from approximately -10.5 to 6.5. Our hard filter threshold is -12.5. There are no variants in this dataset that have MQRankSum less than -10.5! In this case, hard filtering would not fail any variants based on MQRankSum. Remember, our hard filtering recommendations are meant to be very lenient. If you do plot your annotation values for your samples and find none of your variants have MQRankSum less than -12.5, you may want to refine your hard filters. Our recommendations are indeed recommendations that you the scientist will want to refine yourself.

Looking at the plot of pass VQSR vs fail VQSR variants, we see the variants with an MQRankSum value less than -2.5 fail VQSR. However, the region between -2.5 to 2.5 contains both pass and fail variants. Are you noticing a trend here? It is very difficult to pick a threshold for hard filtering. If we pick -2.5 as our hard filtering threshold, we still have many variants that fail VQSR in our dataset. If we try to get rid of those variants, we will lose some good variants as well. It is up to you to decide how many false positives you would like to remove from your dataset vs how many true positives you would like to keep and adjust your threshold based on that.

ReadPosRankSumTest (ReadPosRankSum)

This is the u-based z-approximation from the Rank Sum Test for site position within reads. It compares whether the positions of the reference and alternate alleles are different within the reads. Seeing an allele only near the ends of reads is indicative of error, because that is where sequencers tend to make the most errors. A negative value indicates that the alternate allele is found at the ends of reads more often than the reference allele; a positive value indicates that the reference allele is found at the ends of reads more often than the alternate allele. A value close to zero is best because it indicates there is little difference between the positions of the reference and alternate alleles in the reads.

The last annotation we will look at is ReadPosRankSum. Notice the values fall mostly between -4 and 4. Our hard filtering threshold removes any variant with a ReadPosRankSum value less than -8.0. Again, there are no variants in this dataset that have a ReadPosRankSum value less than -8.0, but some datasets might. If you plot your variant annotations and find there are no variants that have a value less than or greater than one of our recommended cutoffs, you will have to refine them yourself based on your annotation plots.

Looking at the VQSR pass vs fail variants, we can see VQSR has failed variants with ReadPosRankSum values less than -1.0 and greater than 3.5. However, notice VQSR has failed some variants that have values that pass VQSR.

↧

GATK 4.0 will be released Jan 9, 2018

October 16, 2017, 2:21 pm

≫ Next: How does GATK4 Mutect2 know which -I is tumor?

≪ Previous: Understanding and adapting the generic hard-filtering recommendations

A long time ago in a galaxy far, far away, we started work on a brand new version of GATK in which the engine framework was to be completely revamped, streamlined and accelerated, with support for cloud technologies and an impressively expanded scope of analysis (copy number! structural variation! somatic and germline versions of everything!). Oh, and it would be fully open-source.

Today that new beginning is tantalizingly close to fruition. We've had a series of beta versions out for preview for about three months, and we've actually had several segments of our genome production pipeline running a subset of fully-vetted GATK4 tools for over a year. Aside from a few remaining technical issues that are actively being addressed, the work left to be done before general release mainly involves clean-up and streamlining of user-facing functionality: what gets logged and how, argument names and syntax, documentation and so on.

So it's time to set a date and put a ring on it! I'm thrilled to announce the happy event will take place on Jan 9, 2018.

Wait, why January, you ask? Earlier this summer I announced that we hoped to push out the general 4.0 release by the end of September. Obviously it's now mid-October and it's not out, so what's up with that?

To be quite frank, what's up is mainly that my initial assessment was not sufficiently realistic. I underestimated how much time it would take to identify and resolve technical issues during the beta testing phase (things I learned: humans like to take vacations during summer months) and how much work we still needed to do from the support side to make the migration to GATK4 as painless as it could be.

So when we realized we were going to blow straight through my September estimate, we undertook a much more thorough status review. We formulated the Jan 9 release date based on a generous budgeting of time that assumes the work will be complete by late November, early December at the latest, giving us a few weeks' worth of padding to cope with any last-minute surprises. And this avoids the end of year holiday period, which is a bad time to release anything anyway -- except perhaps a final beta for any early adopters out there looking for an excuse to get away from their in-laws.

As we get closer to the big day we'll post some additional details of all the goodies that will accompany the release, so stay tuned for further announcements on this blog.

↧

How does GATK4 Mutect2 know which -I is tumor?

October 16, 2017, 7:37 pm

≫ Next: Oncotator error---IndexError: list index out of range

≪ Previous: GATK 4.0 will be released Jan 9, 2018

Hi,

I'm reading the Mutect2 documentation here:

https://software.broadinstitute.org/gatk/gatkdocs/4.beta.1/org_broadinstitute_hellbender_tools_walkers_mutect_Mutect2.php

I see that you give two -I arguments. How does Mutect2 know which is the tumor? Is it whichever comes first, or is it related to the placement of the -tumor argument?

↧

Oncotator error---IndexError: list index out of range

October 17, 2017, 12:37 am

≫ Next: about genotypegvcfs

≪ Previous: How does GATK4 Mutect2 know which -I is tumor?

I used Oncotator (1.9.3.0) to annotate VCF file.
If I run this on command line:
oncotator -v --input_format VCF --output_format TCGAMAF --db-dir /share/apps/oncotator_v1_ds_April052016/ -d . test_oncotator_ffpe.vcf oncotator.maf hg19

There will be an error：

2017-10-17 11:43:39,572 ERROR [oncotator.output.TcgaMafOutputRenderer:333] Traceback (most recent call last):
  File "build/bdist.linux-x86_64/egg/oncotator/output/TcgaMafOutputRenderer.py", line 317, in renderMutations
    self._add_output_annotations(m)
  File "build/bdist.linux-x86_64/egg/oncotator/output/TcgaMafOutputRenderer.py", line 241, in _add_output_annotations
    alt_count = vals[1]
IndexError: list index out of range

2017-10-17 11:43:39,572 ERROR [oncotator.output.TcgaMafOutputRenderer:334] Error at mutation 0 ['1', '11166639', '11166639', 'T', 'A']:
2017-10-17 11:43:39,572 ERROR [oncotator.output.TcgaMafOutputRenderer:335] Incomplete: rendered 0 mutations.
Traceback (most recent call last):
  File "/share/apps/oncotator/bin/oncotator", line 11, in <module>
    load_entry_point('Oncotator==1.9.3.0', 'console_scripts', 'oncotator')()
  File "build/bdist.linux-x86_64/egg/oncotator/Oncotator.py", line 309, in main
  File "build/bdist.linux-x86_64/egg/oncotator/Annotator.py", line 437, in annotate
  File "build/bdist.linux-x86_64/egg/oncotator/output/TcgaMafOutputRenderer.py", line 337, in renderMutations
IndexError: list index out of range

if I run this on command line:
oncotator -v --input_format VCF --output_format VCF --db-dir /share/apps/oncotator_v1_ds_April052016/ -d . test_oncotator_ffpe.vcf oncotator.maf hg19

There will not be an error，this VCF is produced by VARSCAN 2.3.4，fileformat=VCF4.1,normalized by bcftools

And if I run this on command line:
oncotator -v --input_format VCF --output_format TCGAMAF --db-dir /share/apps/oncotator_v1_ds_April052016/ -d . test_oncotator_wes.vcf oncotator.maf hg19

There will not be an error，this VCF is produced by GATK 3.8，fileformat=VCF4.2,normalized by bcftools

↧

about genotypegvcfs

October 17, 2017, 2:23 am

≫ Next: final result

≪ Previous: Oncotator error---IndexError: list index out of range

I want to know about the algorithm which converts raw gVCF files from different samples to raw VCF by doing genotypegvcfs

↧

final result

October 17, 2017, 2:30 am

≫ Next: Install GATK 1.0

≪ Previous: about genotypegvcfs

The question is that after running GenotypeGVCF, and getting VCF file then how we can get the result for every sample we have run GenotypeGVCF. I mean if we run haplotyping for 100 samples and getting the gVCF for every sample then at the end of the day after getting the VCF file we want to produce the result for every sample how to produce 100 results for 100 samples?

↧

Install GATK 1.0

October 17, 2017, 8:17 am

≫ Next: GATK Community - Take this survey, win a prize!

≪ Previous: final result

We need to re-analyze some data produced some years ago and I need to install the first GATK version on my local machine. From the following link https://github.com/broadgsa/gatk/releases?after=1.5 I downloaded the tar.gz file but when uncompressing it I obtain a folder containing only two subolders: lib and src. No information is there regarding how to install this version. Any hint? Thanks in advance

↧

GATK Community - Take this survey, win a prize!

October 17, 2017, 8:57 am

≫ Next: When will GATK4 release to general availability status?

≪ Previous: Install GATK 1.0

As many of you know, GATK4 Beta is out and we are excited for the full GATK4 to be released in the new year. It has been a long time coming and we hope that many of you have gotten to experiment with its features before the big release. In fact, we’d like to know if you’ve tried it out! We crafted a survey that asks questions about your experience with GATK, the Beta release, thoughts on the upcoming GATK4 release and the infrastructure you run on to help improve our team’s communication, support, and product development efforts. If you have not used the Beta or do not know much about GATK4, please tell us in this survey, as that is very useful for us to know. This survey is for anyone who has ever used any GATK version.

The Survey

The survey is 27 questions long and should take 15-20 minutes to complete. We want to compensate you for your time so we have gotten some reward funding from the Intel Center for Genomic Data Engineering, established at the Broad Institute in 2017.

We will randomly draw 100 survey participants and offer each winner one prize of their choosing:
* $50 Amazon gift card (85 available)
* $250 FireCloud credit (10 available)
* $500 FireCloud credit (5 available)

In the survey, we ask that you leave your contact email and rank the prizes in the order you’d like to receive them. We will allocate them as we randomly draw winners. This means if we have run out of FireCloud credit and you are the sixth person who put this as your number one choice, we will allocate you your second choice. The survey will be live for one month.

We know that everyone is busy with their work and we think that these gifts and the action your feedback will generate is worth the twenty minutes spared. The FireCloud credit reward could even go into funding your research! Whether you win something or not, your feedback is used for a greater purpose - your opinions will help the Broad team and the collaborators at the Intel Center learn how to improve the GATK communities’ experience by understanding its needs.

Thank you for reading & good luck!

Note on prizes: Amazon gift cards will be issued for use in your home country.

↧

When will GATK4 release to general availability status?

October 17, 2017, 9:34 pm

≫ Next: CombineGVCF: java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Double

≪ Previous: GATK Community - Take this survey, win a prize!

We've already used GATK4 beta 5 for testing,and want to know when will GATK4 release to general availability status?THX.

↧