ploidy level in HaplotypeCaller and GenotypeGVCFs

January 31, 2018, 4:40 am

≫ Next: PE mates are lost while downsampling with -dfrac?

≪ Previous: Strange behaviour (bias?) in BaseRecalibrator

Hello,
I am trying to make SNP calling of chloroplast DNA reads from 85 samples, using GATK v4.0.
First, I used HaplotypeCaller to produce individual GVCF with the default ploidy setting. Making joint call using GenotypeGVCFs, and it only took a few minutes. But then I think the sample ploidy should set as 1, since I am working with chloroplast data.
I did not change any settings but only added “-ploidy 1” when running HaplotypeCaller and it worked. However, when running “gatk GenotypeGVCFs” with default settings, the program hanged at “WARN InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples” for hours and hours. Then I tried “-ploidy 85” when running GenotypeGVCFs but got the same problem.
I wonder what is wrong with the ploidy setting.
Also, I found HaplotypeCaller with “-ploidy 1” detected much fewer SNPs, comparing to the running with the default setting. I assume this is reasonable, right?

Looking forward to reply! Thanks a lot in advance!

↧

PE mates are lost while downsampling with -dfrac?

November 13, 2017, 7:12 am

≫ Next: IndelRealignment step: Lost of reads?

≪ Previous: ploidy level in HaplotypeCaller and GenotypeGVCFs

Hi GATK team and users,

I am using PrintReads with -dfrac option to simulate different depths of coverage. The original data contains WGS, PE reads (from GATK's Bundle bam, PrintReads with -L 20, -dfrac 0.18). I'm using gatk-3.7.

I think that the PE mates are lost while downsampling (first observed at IGV with 'view as pairs'):

samtools flagstat still see 96.64% of "properly paired" reads but I guess that it is because the flags are inherited from the original bam reads.

samtools flagstat ./NA12878/CEUTrio.HiSeq.WGS.b37.NA12878.L20.dfrac0.18.bam:

##  9278360 + 0 in total (QC-passed reads + QC-failed reads)
## ...
##  8966551 + 0 properly paired (96.64% : N/A)
##  9023705 + 0 with itself and mate mapped
## ...

82% of the name of the reads are unique (and not duplicated as expected for PE data).

samtools view ./NA12878/CEUTrio.HiSeq.WGS.b37.NA12878.L20.dfrac0.18.bam | awk '{print $1}' | sort -n | uniq -c | awk '{print $1}' | sort -n | uniq -c

## 7612964 1
##   832698 2

Is there a way to downsample a bam file keeping the paired reads to simulate that I have got less data but still properly paired?

Thanks a lot for any help/discussion,
EsterQ

↧

IndelRealignment step: Lost of reads?

January 31, 2018, 6:31 am

≫ Next: Is GATK4 HaplotypeCaller in evaluation phase?

≪ Previous: PE mates are lost while downsampling with -dfrac?

I'm using IndelRealigner tool on my BAM file and then counting the number of reads in the BAM file using samtools stats and it turns out that in my input BAM I have 167170574 reads mapped while in the output BAM of IndelRealignment step 121608609. Is it an expected behaviour?

Thank you in advance,

↧

Is GATK4 HaplotypeCaller in evaluation phase?

January 31, 2018, 7:29 am

≫ Next: Troubleshooting: ERROR - variant files have inconsistent references for the same position.

≪ Previous: IndelRealignment step: Lost of reads?

Hi GATK team,

Congratulations on the release! I just found this public method in FireCloud that notes that HaplotypeCaller in GATK4 should not be used for production use yet since it is still in evaluation phase. This post was last updated on January 9th, the day of GATK4 release. Is this statement true? Could you provide more details about HaplotypeCaller evaluation?

Thanks!

↧

Troubleshooting: ERROR - variant files have inconsistent references for the same position.

January 31, 2018, 8:26 am

≫ Next: Did GATK 4 break taking files containing lists of files from command line?

≪ Previous: Is GATK4 HaplotypeCaller in evaluation phase?

Greetings,
I am hoping to get some help troubleshooting a frustrating error I am having trying to genotype a large set of data. The source data is nearly 12000 WES samples, which were sequenced by a 3rd party company, so I am assuming it is worth the money that was spent . I know they followed best practices and used the same reference file for all samples. I have the gvcf files for the entire set, and I have successfully genotyped the entire WES intervals, as well as subset the gvcf files for 74 genes and successfully genotyped those. All of this with GATK v3.7.

I now have a third set of intervals (SXP) I am trying to process. SelectVariants with this interval set works fine. I create 40 cohort.g.vcf files with roughly ~290 samples in each, and this process has worked without any errors in all three use cases.

However, now with these SXP cohorts, I get about 2.5% through GenotypeGVCF and will receive an error

##### ERROR MESSAGE: The provided variant file(s) have inconsistent references for the same position(s) at 1:62732364, A* vs. G*

I identify that a single cohort has this ref anomaly. I looked for it in the individual SXP subset g.vcfs of all the samples in that cohort, but cannot find a single sample with that position as such; I have no idea where it comes from. I tried removing that position from the cohort g.vcf. I receive the same error, at a different position, in a different cohort, but I notice that its technically happening in the same gene as the original error.

I removed that gene from my interval list, re-subset the entire sample set and made the same cohorts from the modified data; receive the same error, at a different position, in a different cohort, in a different gene.

I can find no evidence that these data had any sort of inconsistent reference when they were created, and again I have used them successfully a couple of times already, and so have other researchers working with the data files.

I do not understand where these genotypes are coming from. From my understanding I can not run ValidateVariants on gvcfs and get anything meaningful. Is there anything else I can be doing to find the issue or is there a way to GenotypeGVCFs move passed these error positions? I think they only thing I havent tried is upgrading to GATK 4.0, but I am dubious it will make a difference. Thank you!

-bwubb

↧

Did GATK 4 break taking files containing lists of files from command line?

January 31, 2018, 8:40 am

≫ Next: # in file names converted to %23 resulting in file not found

≪ Previous: Troubleshooting: ERROR - variant files have inconsistent references for the same position.

In GATK 3 I used to be able to provide the list of input files to GATK as a file containing a list of files rather than repeating the command line argument. Now with GATK 4 I get the error message:

Cannot read list.list because no suitable codecs found

Am I missing something or was this functionality removed?

↧

# in file names converted to %23 resulting in file not found

January 31, 2018, 9:17 am

≫ Next: ReadBacked phasing vs Trio phasing?

≪ Previous: Did GATK 4 break taking files containing lists of files from command line?

Whilst I was trying to run CombineGVCFs 4.0.0 I got a very strange error, file not found for a file I knew existed. Looking into the backtrace it looks like somehow a # is getting mistakenly URL escaped?

org.broadinstitute.hellbender.exceptions.GATKException: Error initializing feature reader for path /project/gvcf-pcr/23232_1#1/1.g.vcf.gz
    at org.broadinstitute.hellbender.engine.FeatureDataSource.getTribbleFeatureReader(FeatureDataSource.java:341)
    at org.broadinstitute.hellbender.engine.FeatureDataSource.getFeatureReader(FeatureDataSource.java:292)
    at org.broadinstitute.hellbender.engine.FeatureDataSource.<init>(FeatureDataSource.java:244)
    at org.broadinstitute.hellbender.engine.FeatureManager.addToFeatureSources(FeatureManager.java:202)
    at org.broadinstitute.hellbender.engine.MultiVariantWalker.lambda$initializeDrivingVariants$0(MultiVariantWalker.java:66)
    at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374)
    at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580)
    at org.broadinstitute.hellbender.engine.MultiVariantWalker.initializeDrivingVariants(MultiVariantWalker.java:56)
    at org.broadinstitute.hellbender.engine.VariantWalkerBase.initializeFeatures(VariantWalkerBase.java:47)
    at org.broadinstitute.hellbender.engine.GATKTool.onStartup(GATKTool.java:558)
    at org.broadinstitute.hellbender.engine.MultiVariantWalker.onStartup(MultiVariantWalker.java:48)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:134)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:179)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:198)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:152)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:195)
    at org.broadinstitute.hellbender.Main.main(Main.java:275)
Caused by: htsjdk.tribble.TribbleException$MalformedFeatureFile: Unable to create BasicFeatureReader using feature file , for input source: file:///project/gvcf-pcr/23232_1%231/1.g.vcf.gz
    at htsjdk.tribble.AbstractFeatureReader.getFeatureReader(AbstractFeatureReader.java:113)
    at org.broadinstitute.hellbender.engine.FeatureDataSource.getTribbleFeatureReader(FeatureDataSource.java:337)
    ... 16 more
Caused by: java.io.FileNotFoundException: /project/gvcf-pcr/23232_1%231/1.g.vcf.gz (No such file or directory)
    at java.io.RandomAccessFile.open0(Native Method)
    at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
    at java.io.RandomAccessFile.<init>(RandomAccessFile.java:243)
    at htsjdk.samtools.seekablestream.SeekableFileStream.<init>(SeekableFileStream.java:47)
    at htsjdk.samtools.seekablestream.SeekableStreamFactory$DefaultSeekableStreamFactory.getStreamFor(SeekableStreamFactory.java:99)
    at htsjdk.tribble.readers.TabixReader.<init>(TabixReader.java:129)
    at htsjdk.tribble.TabixFeatureReader.<init>(TabixFeatureReader.java:83)
    at htsjdk.tribble.AbstractFeatureReader.getFeatureReader(AbstractFeatureReader.java:106)
    ... 17 more

↧

ReadBacked phasing vs Trio phasing?

January 31, 2018, 11:10 am

≫ Next: Panel of Normals (PON)

≪ Previous: # in file names converted to %23 resulting in file not found

I think I understand the technical difference. But in terms of phasing quality, how does one compare to the another? Are there any publications/reports/blog posts comparing the two? Is there some quantifiable metric that shows how different the estimates are?

↧

Panel of Normals (PON)

December 26, 2017, 6:10 pm

≫ Next: HaplotypeCaller Reference Confidence Model (GVCF mode)

≪ Previous: ReadBacked phasing vs Trio phasing?

A Panel of Normal or PON is a type of resource used in somatic variant analysis. Depending on the type of variant you're looking for, the PON will be generated differently. What all PONs have in common is that (1) they are made from normal samples (in this context, "normal" means derived from healthy tissue that is believed to not have any somatic alterations) and (2) their main purpose is to capture recurrent technical artifacts in order to improve the results of the variant calling analysis.

As a result, the most important selection criteria for choosing normals to include in any PON are the technical properties of how the data was generated. It's very important to use normals that are as technically similar as possible to the tumor (same exome or genome preparation methods, sequencing technology and so on). Additionally, the samples should come from subjects that were young and healthy to minimize the chance of using as normal a sample from someone who has an undiagnosed tumor. Normals are typically derived from blood samples.

There is no definitive rule for how many samples should be used to make a PON (even a small PON is better than no PON) but in practice we recommend aiming for a minimum of 40.

At the Broad Institute, we typically make a standard PON for a given version of the pipeline (corresponding to the combination of all protocols used in production to generate the sequence data, starting from sample preparation and including the analysis software) and use it to process all tumor samples that go through that version of the pipeline. Because we process many samples in the same way, we are able to make PONs composed of hundreds of samples.

Variant type-specific recommendations are given below.

Short variants (SNVs and indels)

For short variant discovery, the PON is created by running the variant caller Mutect2 individually on a set of normal samples and combining the resulting variant calls with some criteria (e.g. excluding any sites that are not present in at least 2 normals) as defined in the Best Practices documentation. This produces a sites-only VCF file that can be used as PON for Mutect2.

Copy Number Variants

For CNV discovery, the PON is created by running the initial coverage collection tools individually on a set of normal samples and combining the resulting copy ratio data using a dedicated PON creation tool. This produces a binary file that can be used as PON.

↧

HaplotypeCaller Reference Confidence Model (GVCF mode)

December 28, 2017, 3:02 pm

≫ Next: Concatenating GVCF files in GATK4

≪ Previous: Panel of Normals (PON)

This document describes the reference confidence model applied by HaplotypeCaller to generate a per-sample GVCF, invoked by -ERC GVCF or -ERC BP_RESOLUTION.

As explained here, HaplotypeCaller works by assembling the reads to create potential haplotypes, realigning the reads to their most likely haplotypes, and then projecting these reads back onto the reference sequence via their haplotypes to compute alignments of the reads to the reference. At that point, we can calculate the likelihoods of each possible genotype and emit variant calls.

What that article does not explain is how HaplotypeCaller additionally estimates the chance that some (unknown) non-reference allele is segregating at this position by examining the realigned reads that span the reference base. At this base we perform two calculations:

Estimate the confidence that no SNP exists at the site by contrasting all reads with the REF base vs. all reads with any non-reference base.
Estimate the confidence that no indel of size < X (determined by command line parameter) could exist at this site by calculating the number of reads that provide evidence against such an indel, and from this value estimate the chance that we would not have seen the allele confidently.

Based on this, we emit the genotype likelihoods (PL) and compute the GQ (from the PLs) for the least confidence of these two models. We use a symbolic ALT allele, <NON_REF>, to hold the likelihood that the site is not homozygous reference, as well as allele-specific AD and PL field values.

We do this at all sites in the territory covered by the analysis, including homozygous-reference sites, both inside and outside the ActiveRegions determined by HaplotypeCaller.

↧

Concatenating GVCF files in GATK4

January 31, 2018, 1:29 pm

≫ Next: How is a haplotype called by HaplotypeCaller across the genome with RADseq data?

≪ Previous: HaplotypeCaller Reference Confidence Model (GVCF mode)

Hello,

I am currently running 60 whole genome sequences broken up into a number of non-overlapping intervals through HaplotypeCaller in GATK4. After I have created the GVCF files what is the best method for concatenating those files for each sample before using CombineGVCFs to combine the files prior to running GenotypeGVCFs. I saw in an older thread the mention of CatVatiants. Is this tool included in GATK4 or will I need to use something else?

↧

How is a haplotype called by HaplotypeCaller across the genome with RADseq data?

January 31, 2018, 2:49 pm

≫ Next: Forgot to add -ERC GVCF when using haplotypecaller

≪ Previous: Concatenating GVCF files in GATK4

Hi,
I had a question about how is a haplotype called by HaplotypeCaller across the genome with reduced representation sequencing data. I have ddRADseq data from a diploid organism and I used HaplotypeCaller to get the raw vcf file. I saw some heterozygous SNP sites were phased, however, I also found some unphased heterozygous sites in the vcf file, I guess it was because there was not much information available to phase the sequence.
I wonder how does the program deal with the reduced representation sequencing data to call a haplotype across the whole genome?
Also, I was wondering if I should exclude unphased heterozygous sites for my downstream analysis, if so, how can I do that?

Hope my questions make sense. Thanks!

↧

Forgot to add -ERC GVCF when using haplotypecaller

January 31, 2018, 5:30 pm

≫ Next: How to Prepare the normal.bam and tumor.bam files

≪ Previous: How is a haplotype called by HaplotypeCaller across the genome with RADseq data?

Hi,
I have used the haplotypecaller to call the variants for my each sample without -ERC GVCF. Thus, I can not use the GenotypeGVCFs to merge them together. What should I do to resolve this problem? Should I re-run them? There are lot of samples. Although I can merge them together using bcftools, the result do not contain 0/0 type, there are only ./. 0/1 1/1 three types. I want to get a final merged VCf file which contain 0/0, ./., 0/1, 1/1
Can anyone help me?

Cheers,
Jian

↧

How to Prepare the normal.bam and tumor.bam files

January 31, 2018, 11:55 pm

≫ Next: Current status of GATK4 GermlineCNVCaller tools and best practices.

≪ Previous: Forgot to add -ERC GVCF when using haplotypecaller

Dear Sir,

am new and currently trying to learn whole exome analysis of breast cancer samples using the GDC Bioinformatics Pipeline https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/DNA_Seq_Variant_Calling_Pipeline/

The data .bam file was downloaded from GDC legacy archives.
https://portal.gdc.cancer.gov/legacy-archive/files/9efa8d39-37e0-4236-9737-e14ddcfd93ff

The reference genome is downloaded from here
https://gdc.cancer.gov/about-data/data-harmonization-and-generation/gdc-reference-files
GRCh38.d1.vd1.fa.tar.gz

was able to complete the Genome Alignment and Alignment Co-Cleaning, next wanted to do the variant calling step, in the next using MuSe

MuSE call -f -r <tumor.bam> <normal.bam> -O <intermediate_muse_call.txt>

I don't know what the region is (is it the chromosome number, or the read group)
also how to prepare the normal.bam and tumor.bam files. Please help.

Thanks
Dr. Prabhakar

↧

Current status of GATK4 GermlineCNVCaller tools and best practices.

January 31, 2018, 11:56 pm

≫ Next: Confusion in using gVCF mode

≪ Previous: How to Prepare the normal.bam and tumor.bam files

Hi,

I would like to try out GATK4 for discovering or genotyping germline CNV's in a cohort of few hundred whole genome sequenced samples. I work with non-human species data, but the genome sizes are almost the same as human or smaller.

The best practice documentation for germline CNV calling is still empty.
https://software.broadinstitute.org/gatk/best-practices/workflow?id=11148

According the gatk4-4.0.0.0-0 JAR file germline CNV calling tools are already included.
java -jar ./gatk4-4.0.0.0-0/gatk-package-4.0.0.0-local.jar
USAGE: [-h]--------------------------------------------------------------------------------------
Copy Number Variant Discovery: Tools that analyze read coverage to detect copy number variants.
AnnotateIntervals (BETA Tool) Annotates intervals with GC content
CallCopyRatioSegments (BETA Tool) Calls copy-ratio segments as amplified, deleted, or copy-number neutral
CombineSegmentBreakpoints (EXPERIMENTAL Tool) Combine the breakpoints of two segment files and annotate the resulting intervals with chosen columns from each file.
CreateReadCountPanelOfNormals (BETA Tool) Creates a panel of normals for read-count denoising
DenoiseReadCounts (BETA Tool) Denoises read counts to produce denoised copy ratios
DetermineGermlineContigPloidy (BETA Tool) Determines the baseline contig ploidy for germline samples given counts data.
GermlineCNVCaller (BETA Tool) Calls copy-number variants in germline samples given their counts and the output of DetermineGermlineContigPloidy.
ModelSegments (BETA Tool) Models segmented copy ratios from denoised read counts and segmented minor-allele fractions from allelic counts
PlotDenoisedCopyRatios (BETA Tool) Creates plots of denoised copy ratios
PlotModeledSegments (BETA Tool) Creates plots of denoised and segmented copy-ratio and minor-allele-fraction estimates

Can you give some more information about what the current status is of the GATK4 GermlineCNVCaller tools and if you have an estimation for when the best practices for these tools should be available?

It would also be nice if you can give an idea if the GATK4 GermlineCNVCallertools tools are expected to work for non-human species, e.g. other vertebrates, simple / complex plants genomes and bacteria.

Thank you.

↧

Confusion in using gVCF mode

February 1, 2018, 4:57 am

≫ Next: No variants with GenotypeGVCFs on polyploid samples

≪ Previous: Current status of GATK4 GermlineCNVCaller tools and best practices.

I have problem in using HaplotypeCaller gVCF mode ( GATK4 best practices). Please let me know following problems:

1- Should we run gVCF even when we have one WES sample?

2- I have 3 WES samples, should I use gVCF --> Cosolidate --> GenotypeGVCF --> VCF or it is better to obtain VCF directly from HaplotypeCaller and ignore its next steps?

3- If I have 3-5 WES samples, is it better to run HaplotypeCaller with multiple input (bams) or separately?

Regards.

↧

No variants with GenotypeGVCFs on polyploid samples

February 1, 2018, 5:54 am

≫ Next: AF in vcf files

≪ Previous: Confusion in using gVCF mode

Hello, I have two bam files with sequenced pooled individuals (16 and 20 individuals). I ran Haplotypecaller (gatk 3.8) by setting the ploidy option to 32 and 40 respectively for each of my samples. When I ran GenotypeGVCFs, I get an empty file with no variants. Does GenotypeGVCFs is not supporting these polyploidy levels?
As I’m interested only in the DP and AD counting, I'm thinking to try with ploidy 2, but I don’t know if this setting would change the AD and DP fields?

↧

AF in vcf files

February 1, 2018, 6:27 am

≫ Next: trio pipeline

≪ Previous: No variants with GenotypeGVCFs on polyploid samples

There is some inconsistency in the community on how to calculate the AF (allele frequency) value in vcf files. GATK calculates a hypothetical value (0, 0.5 or 1 for normal diploid organisms). Other callers will calculate the AF as an Alt Allele Frequency.

Is there a way to have both values in the VCF file from GATK?

↧

trio pipeline

June 28, 2017, 2:40 pm

≫ Next: Undefined variable VariantFiltration

≪ Previous: AF in vcf files

Dear friends
I am analyzing a trio
I have followed the pipeline described in van der Auwera et al. 2013
on each person individually up to HaplotypeCaller and VariantRecalibrator

is there a pipeline I can follow to put together the data and recognize disease variants in the affected child (de novo or inherited?)
thank you vittoria

↧

Undefined variable VariantFiltration

March 13, 2013, 4:36 am

≫ Next: GATK 3 to GATK 4 and BQSR

≪ Previous: trio pipeline

Hello
I am filtering SNPs and indels for single sample targeted resequencing dataset. My command is:

my $bigfilter_snps="-filter \"QUAL<80.0\" -filterName vm_QUAL -filter \"DP<20\" -filterName vm_DP -filter \"MQ<40.0\" -filterName GATK_v3_MQ -filter \"QD<2.0\" -filterName GATK_v3_QD -filter \"MQRankSum<-12.5\" -filterName GATK_v3_MQRankSum -filter \"HRun>5\" -filterName GATK_v2_HRun ";

java -jar GenomeAnalysisTK.jar -R /software/GenomeAnalysisTK_support/human_g1k_v37.fasta -T UnifiedGenotyper -I ${sample}.sorted.realigned.recal.bam -o ${sample}.sorted.realigned.recal.snps.vcf --intervals /506amplicons.interval_list --min_base_quality_score 15 -stand_call_conf 30 --baq CALCULATE_AS_NECESSARY -glm SNP --baqGapOpenPenalty 65 --downsampling_type BY_SAMPLE --downsample_to_coverage 250 --output_mode EMIT_ALL_CONFIDENT_SITES"

I get the following error:

MQRankSum < -12.5;' undefined variable MQRankSum
WARN 11:04:58,217 Interpreter - ![0,4]: 'HRun > 5;' undefined variable HRun
WARN 11:04:58,217 Interpreter - ![0,2]: 'QD < 2.0;' undefined variable QD

also the filtered vcf file doesn't have the ref allele or quality score:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
1 158259801 . A . 31.15 vm_DP;vm_QUAL AN=2;DP=2;MQ=66.59;MQ0=0 GT:DP 0/0:2
1 158259802 . G . 34.23 vm_DP;vm_QUAL AN=2;DP=2;MQ=66.59;MQ0=0 GT:DP 0/0:2
1 158259803 . A . 37.23 vm_DP;vm_QUAL AN=2;DP=5;MQ=68.66;MQ0=0 GT:DP 0/0:5
1 158259804 . A . 40.23 vm_DP;vm_QUAL AN=2;DP=10;MQ=69.33;MQ0=0 GT:DP 0/0:10

Any help on why the filtering is not working is appreciated as I am quite new to this

Thanks

↧