Calculation of PQ and TP values

August 23, 2012, 6:24 am

≫ Next: Mutect2 bamout depths not matching vcf.

≪ Previous: -A VariantType annotation still available in GATK4 HaplotypeCaller?

Dear GATK team,

I'd like to be able to work through the calculations for the PQ (ReadBackedPhasing) and TP (PhaseByTransmission) values for small toy data sets. Is there an article or document anywhere that describes the algorithms used to calculate PQ and TP? Unfortunately I'm only a beginner at Java, so can't answer my questions by looking at the source code.

Thanks for all the great work you do with the GATK.

Best wishes,

Katherine

↧

Mutect2 bamout depths not matching vcf.

October 25, 2018, 1:08 pm

≫ Next: https://software.broadinstitute.org/gatk/documentation/article?id=11127

≪ Previous: Calculation of PQ and TP values

Hi,

After running Mutect2 on tumor/normal paired bam files, I get an output VCF with unusually high depth counts. I understand that the numbers here can differ from the input bam depths due to genotype reassembly within Mutect2. However, these depths are sometimes jumping from around 20 reads to 200 reads, which seems hard to believe. In order to investigate further, I re-ran Mutect2 with the -bamout option, in order to analyze some positions in IGV. The Mutect2 command is posted below. The problem is that when I look at the output bam (bamout) and the output VCF in IGV, the numbers do not match. Note: I ran FilterMutectCalls and selected only PASS calls when deciding which positions to look at. An example position: depth 146 and 192 in tumor and normal, respectively in the VCF, but only 41 reads for the same position in the bamout file.

Questions:
1) What can explain the discrepancy between the bamout depths and the vcf depths?
2) Is there a way to get a bam/bamout that matches the VCF exactly?

Thanks a lot,
Sujay

Mutect2 command:
./gatk-4.0.10.1/gatk --java-options "-Xmx4g" Mutect2 -R /genomes/Hsapiens/hg19/seq/hg19.fa --annotation ClippingRankSumTest --annotation DepthPerSampleHC --annotation MappingQualityRankSumTest --annotation MappingQualityZero --annotation QualByDepth --annotation ReadPosRankSumTest --annotation RMSMappingQuality --annotation FisherStrand --annotation MappingQuality --annotation DepthPerAlleleBySample --annotation Coverage --read-validation-stringency LENIENT -I tumorX.markduplicates.grouped.bam -tumor tumorX -I normalX.markduplicates.grouped.bam -normalX normal -L ./baits.bed --interval-set-rule INTERSECTION --disable-read-filter NotDuplicateReadFilter -ploidy 2 -bamout tumorX.bamout.bam -O tumorX.vcf

↧

https://software.broadinstitute.org/gatk/documentation/article?id=11127

November 5, 2018, 1:05 pm

≫ Next: GenomicsDBImport - Duplicate field name MQ0 found in vid map

≪ Previous: Mutect2 bamout depths not matching vcf.

Testing changing content as a new user.

↧

GenomicsDBImport - Duplicate field name MQ0 found in vid map

October 15, 2018, 12:52 pm

≫ Next: [GATK4 beta] no filter-passing variants in Mutect2 tumor-only runs using default parameters

≪ Previous: https://software.broadinstitute.org/gatk/documentation/article?id=11127

I am running into an issue running GenomicsDBImport (version v4.0.5.1) on a single gVCF. The gVCF was generated using gatk version, 3.5-0-g36282e4.

Here is the error,

...
15:22:35.273 DEBUG GenomeLocParser -  HLA-DRB1*16:02:01 (11005 bp)
15:22:35.626 INFO  IntervalArgumentCollection - Processing 64444167 bp from intervals
15:22:35.751 INFO  GenomicsDBImport - Done initializing engine
15:22:35.752 DEBUG IOUtils - Deleted /my_database_chr20
Created workspace /my_database_chr20
15:22:35.941 INFO  GenomicsDBImport - Vid Map JSON file will be written to /my_database_chr20/vidmap.json
15:22:35.941 INFO  GenomicsDBImport - Callset Map JSON file will be written to /my_database_chr20/callset.json
15:22:35.941 INFO  GenomicsDBImport - Complete VCF Header will be written to /my_database_chr20/vcfheader.vcf
15:22:35.941 INFO  GenomicsDBImport - Importing to array - /my_database_chr20/genomicsdb_array
15:22:35.955 INFO  ProgressMeter - Starting traversal
15:22:35.955 INFO  ProgressMeter -        Current Locus  Elapsed Minutes     Batches Processed   Batches/Minute
15:22:35.955 INFO  GenomicsDBImport - Starting batch input file preload
15:22:36.161 INFO  GenomicsDBImport - Finished batch preload
15:22:36.161 INFO  GenomicsDBImport - Importing batch 1 with 1 samples
Duplicate field name MQ0 found in vid map
terminate called after throwing an instance of 'ProtoBufBasedVidMapperException'
  what():  ProtoBufBasedVidMapperException : Duplicate fields exist in vid map

GenomicsDBImport command,

JAVA_OPTS="-Xmx${memory}g" ${gatk_exe} GenomicsDBImport \
  --genomicsdb-workspace-path ${run_dir}/my_database_chr20 \
  --intervals chr20:1-100000 \
  --verbosity DEBUG \
  --lenient \
  --reader-threads 5 \
  -ip 500 \
  --batch-size 1  \
  --sample-name-map ${data_dir}/samples.list \
  --overwrite-existing-genomicsdb-workspace \
  -R ${ref}

Haplotype caller command,

/Software/bin/java
 -XX:ParallelGCThreads=2
 -Djava.io.tmpdir=/scratch/68a3f0e84d564e729fc42934118dc642
 -Xmx8196M
 -jar /Software/GenomeAnalysisTK/GenomeAnalysisTK-3.5/GenomeAnalysisTK.jar
 -T HaplotypeCaller
 --genotyping_mode DISCOVERY
 -A AlleleBalanceBySample
 -A DepthPerAlleleBySample
 -A DepthPerSampleHC
 -A InbreedingCoeff
 -A MappingQualityZeroBySample
 -A StrandBiasBySample
 -A Coverage
 -A FisherStrand
 -A HaplotypeScore
 -A MappingQualityRankSumTest
 -A MappingQualityZero
 -A QualByDepth
 -A RMSMappingQuality
 -A ReadPosRankSumTest
 -A VariantType
 -l INFO
 --emitRefConfidence GVCF
 -rf BadCigar
 --variant_index_parameter 128000
 --variant_index_type LINEAR
 -R /Resources/GRCh38_full_analysis_set_plus_decoy_hla.fa
 -nct 1
 -I /data/NA12878.final.bam
 -o /data/NA12878.20.haplotypeCalls.raw.g.vcf
 -L chr20

Don't think this should be the issue, but FYI

$ bcftools view -h ../data/NA12878.haplotypeCalls.raw.g.vcf.gz | grep MQ0

##FORMAT=<ID=MQ0,Number=1,Type=Integer,Description="Number of Mapping Quality Zero Reads per sample">
##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">

↧

[GATK4 beta] no filter-passing variants in Mutect2 tumor-only runs using default parameters

August 12, 2017, 5:44 pm

≫ Next: Get Registered IELTS certificates without attending the Exam(onlinedocuments100@outlook.com)

≪ Previous: GenomicsDBImport - Duplicate field name MQ0 found in vid map

Hello,

I would like to ask your advice on the tumor only mode of Mutect.
I ran GATK4 beta.3's Mutect on 20 tumor samples using tumor-only mode, and found no variant passing filters. Every variant is filtered out after running FilterMutectCalls tool. It seems that germline risk is estimated very high overall.
Mutect2 was executed using the scripts/mutect2_wdl/mutect2_multi_sample.wdl in the GATK source repository. gnomAD is given for the population af source and default parameters are used.
I'd appreciate it if you would help run tumor-only mode of Mutect.

FYI, 10^P_GERMLINE (log10 posterior probability for alt allele to be germline variants in INFO) of a tumor sample distributes as below. Outliers are not plotted for the sake of simplicity.

Summary(10^P_GERMLINE)

Min.	1st Qu.	Median	Mean	3rd Qu.	Max.
0.04699	0.93345	0.99919	0.94155	1.00000	1.00000

P_GERMLINE plot

(Additionally, no toolbar button works on this 'ask a question' page I am writing this question. (such as bold, italic, file upload buttons) Is it just me?)

↧

Get Registered IELTS certificates without attending the Exam(onlinedocuments100@outlook.com)

November 6, 2018, 1:54 am

≫ Next: Which training sets / arguments should I use for running VQSR?

≪ Previous: [GATK4 beta] no filter-passing variants in Mutect2 tumor-only runs using default parameters

PURCHASE/OBTAIN/GET/GAIN ORIGINAL IELTS,TOEFL,NEBOSH,GRE,TOEIC,PTE, FCE, CAE, CPE, BEC,YLE, KET,BULATS,ILEC, ICF certificates.

skype id...(john dawson00)IELTS stands for “International English Language Testing System”. It’s a system for testing the language ability for people who need to study or work in an environment where English is the language of communication. It is jointly managed by Cambridge English Language Assessment, the British Council and IDP Education. It was established in 1989 and is one of the two major English-language tests in the world (TOEFL being the other).IELTS is accepted by most British, Canadian, Australian,Irish, New Zealand and South African academic institutions. Over 3,000 academic institutions in the United States and various professional organizations across the world also accept IELTS. It is now a requirement for people wishing to immigrate to Australia and New Zealand and is also accepted by immigration authorities in Canada IELTS certificate for sale buy orginal ielts certificate and other products for a number of countries like: USA, Australia, Belgium, Brazil, Norway Canada, Italy,Finland, France, Germany, Israel, Mexico, Netherlands, South Africa, Spain, United Kingdom. This list is not complete. Email:...

WhatsApp:+237680746007

skype id...(john dawson00)

buy British Council IELTS certificates
buy British Council IELTS certificates
buy British Council IELTS certificates
buy British Council IELTS certificates
buy British Council IELTS certificates
buy British Council IELTS certificates
buy British Council IELTS certificates
buy British Council IELTS certificates
buy British Council IELTS certificates
Buy original Ielts Certificate without Exam in United Arab Emitates,UAE,DUbai
Buy original Ielts Certificate without Exam in Jordan
Buy original Ielts Certificate without Exam in Saudi Arabia
Buy original Ielts Certificate without Exam in Jordan
Buy original Ielts Certificate without Exam in Saudi Arabia
Buy original Ielts Certificate without Exam in DUbai
Buy original Ielts Certificate without Exam in Kuwait
Buy original Ielts Certificate without Exam in Qatar
Buy original Ielts Certificate without Exam in Egypt
We Sell Registered IELTS Certificates
Buy IELTS certificate in Australia
Buy IELTS certificate in Dubai
Buy IELTS certificate in India
Buy IELTS certificate in Karachi
Buy IELTS certificate in Malaysia
Buy IELTS certificate in Nepal
Buy IELTS certificate in uae
Buy IELTS certificate online
Buy IELTS certificate UK
Buy IELTS certificate without exam
Buy real IELTS certificates qld
Buy IELTS certificate online
Buy IELTS certificate uk
Buy IELTS certificate without exam
Buy real IELTS certificates qld
buy real ielts certificate copy
buy real ielts certificate pakistan

WhatsApp:(+237680746007)

skype id...(john dawson00)

↧

Which training sets / arguments should I use for running VQSR?

August 2, 2012, 7:05 am

≫ Next: Can Depth of the coverage data from GATK3 corrected using CorrectGCBias

≪ Previous: Get Registered IELTS certificates without attending the Exam(onlinedocuments100@outlook.com)

This document describes the resource datasets and arguments that we recommend for use in the two steps of VQSR (i.e. the successive application of VariantRecalibrator and ApplyRecalibration), based on our work with human genomes, to comply with the GATK Best Practices. The recommendations detailed in this document take precedence over any others you may see elsewhere in our documentation (e.g. in Tutorial articles, which are only meant to illustrate usage, or in past presentations, which may be out of date).

The document covers:

Explanation of resource datasets
Important notes about annotations
Important notes about exome experiments
Argument recommendations for VariantRecalibrator
Argument recommendations for ApplyRecalibration

These recommendations are valid for use with calls generated by both the UnifiedGenotyper and HaplotypeCaller. In the past we made a distinction in how we processed the calls from these two callers, but now we treat them the same way. These recommendations will probably not work properly on calls generated by other (non-GATK) callers.

Note that VQSR must be run twice in succession in order to build a separate error model for SNPs and INDELs (see the VQSR documentation for more details).

Explanation of resource datasets

The human genome training, truth and known resource datasets mentioned in this document are all available from our resource bundle.

If you are working with non-human genomes, you will need to find or generate at least truth and training resource datasets with properties corresponding to those described below. To generate your own resource set, one idea is to first do an initial round of SNP calling and only use those SNPs which have the highest quality scores. These sites which have the most confidence are probably real and could be used as truth data to help disambiguate the rest of the variants in the call set. Another idea is to try using several SNP callers in addition to the UnifiedGenotyper or HaplotypeCaller, and use those sites which are concordant between the different methods as truth data. In either case, you'll need to assign your set a prior likelihood that reflects your confidence in how reliable it is as a truth set. We recommend Q10 as a starting value, which you can then experiment with to find the most appropriate value empirically. There are many possible avenues of research here. Hopefully the model reporting plots that are generated by the recalibration tools will help facilitate this experimentation.

Resources for SNPs

True sites training resource: HapMap 
This resource is a SNP call set that has been validated to a very high degree of confidence. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). We will also use these sites later on to choose a threshold for filtering variants based on sensitivity to truth sites. The prior likelihood we assign to these variants is Q15 (96.84%).
True sites training resource: Omni 
This resource is a set of polymorphic SNP sites produced by the Omni geno- typing array. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q12 (93.69%).
Non-true sites training resource: 1000G 
This resource is a set of high-confidence SNP sites produced by the 1000 Genomes Project. The program will consider that the variants in this re- source may contain true variants as well as false positives (truth=false), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q10 (90%).
Known sites resource, not used in training: dbSNP 
This resource is a call set that has not been validated to a high degree of confidence (truth=false). The program will not use the variants in this resource to train the recalibration model (training=false). However, the program will use these to stratify output metrics such as Ti/Tv ratio by whether variants are present in dbsnp or not (known=true). The prior likelihood we assign to these variants is Q2 (36.90%).

Resources for Indels

True sites training resource: Mills
This resource is an Indel call set that has been validated to a high degree of confidence. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q12 (93.69%).
Known sites resource, not used in training: dbSNP 
This resource is a call set that has not been validated to a high degree of confidence (truth=false). The program will not use the variants in this resource to train the recalibration model (training=false). However, the program will use these to stratify output metrics such as Ti/Tv ratio by whether variants are present in dbsnp or not (known=true). The prior likelihood we assign to these variants is Q2 (36.90%).

Important notes about annotations

Some of the annotations included in the recommendations given below might not be the best for your particular dataset. In particular, the following caveats apply:

Depth of coverage (the DP annotation invoked by Coverage) should not be used when working with exome datasets since there is extreme variation in the depth to which targets are captured! In whole genome experiments this variation is indicative of error but that is not the case in capture experiments.
You may have seen HaplotypeScore mentioned in older documents. That is a statistic produced by UnifiedGenotyper that should only be used if you called your variants with UG. This statistic isn't produced by the HaplotypeCaller because that mathematics is already built into the likelihood function itself when calling full haplotypes with HC.
The InbreedingCoeff is a population level statistic that requires at least 10 samples in order to be computed. For projects with fewer samples, or that includes many closely related samples (such as a family) please omit this annotation from the command line.

Important notes for exome capture experiments

In our testing we've found that in order to achieve the best exome results one needs to use an exome SNP and/or indel callset with at least 30 samples. For users with experiments containing fewer exome samples there are several options to explore:

Add additional samples for variant calling, either by sequencing additional samples or using publicly available exome bams from the 1000 Genomes Project (this option is used by the Broad exome production pipeline). Be aware that you cannot simply add VCFs from the 1000 Genomes Project. You must either call variants from the original BAMs jointly with your own samples, or (better) use the reference model workflow to generate GVCFs from the original BAMs, and perform joint genotyping on those GVCFs along with your own samples' GVCFs with GenotypeGVCFs.
You can also try using the VQSR with the smaller variant callset, but experiment with argument settings (try adding --maxGaussians 4 to your command line, for example). You should only do this if you are working with a non-model organism for which there are no available genomes or exomes that you can use to supplement your own cohort.

Argument recommendations for VariantRecalibrator

The variant quality score recalibrator builds an adaptive error model using known variant sites and then applies this model to estimate the probability that each variant is a true genetic variant or a machine artifact. One major improvement from previous recommended protocols is that hand filters do not need to be applied at any point in the process now. All filtering criteria are learned from the data itself.

Common, base command line

This is the first part of the VariantRecalibrator command line, to which you need to add either the SNP-specific recommendations or the indel-specific recommendations given further below.

java -Xmx4g -jar GenomeAnalysisTK.jar \
   -T VariantRecalibrator \
   -R path/to/reference/human_g1k_v37.fasta \
   -input raw.input.vcf \
   -recalFile path/to/output.recal \
   -tranchesFile path/to/output.tranches \
   -nt 4 \
   [SPECIFY TRUTH AND TRAINING SETS] \
   [SPECIFY WHICH ANNOTATIONS TO USE IN MODELING] \
   [SPECIFY WHICH CLASS OF VARIATION TO MODEL] \

SNP specific recommendations

For SNPs we use both HapMap v3.3 and the Omni chip array from the 1000 Genomes Project as training data. In addition we take the highest confidence SNPs from the project's callset. These datasets are available in the GATK resource bundle.

   -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.vcf \
   -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.vcf \
   -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.b37.vcf \
   -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.b37.vcf \
   -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP -an InbreedingCoeff \
   -mode SNP \

Please note that these recommendations are formulated for whole-genome datasets. For exomes, we do not recommend using DP for variant recalibration (see below for details of why).

Note also that, for the above to work, the input vcf needs to be annotated with the corresponding values (QD, FS, DP, etc.). If any of these values are somehow missing, then VariantAnnotator needs to be run first so that VariantRecalibration can run properly.

Also, using the provided sites-only truth data files is important here as parsing the genotypes for VCF files with many samples increases the runtime of the tool significantly.

You may notice that these recommendations no longer include the --numBadVariants argument. That is because we have removed this argument from the tool, as the VariantRecalibrator now determines the number of variants to use for modeling "bad" variants internally based on the data.

Indel specific recommendations

When modeling indels with the VQSR we use a training dataset that was created at the Broad by strictly curating the (Mills, Devine, Genome Research, 2011) dataset as as well as adding in very high confidence indels from the 1000 Genomes Project. This dataset is available in the GATK resource bundle.

   --maxGaussians 4 \
   -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.vcf  \
   -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.b37.vcf\
   -an QD -an DP -an FS -an SOR -an ReadPosRankSum -an MQRankSum -an InbreedingCoeff \
   -mode INDEL \

Note that indels use a different set of annotations than SNPs. Most annotations related to mapping quality have been removed since there is a conflation with the length of an indel in a read and the degradation in mapping quality that is assigned to the read by the aligner. This covariation is not necessarily indicative of being an error in the same way that it is for SNPs.

Argument recommendations for ApplyRecalibration

The power of the VQSR is that it assigns a calibrated probability to every putative mutation in the callset. The user is then able to decide at what point on the theoretical ROC curve their project wants to live. Some projects, for example, are interested in finding every possible mutation and can tolerate a higher false positive rate. On the other hand, some projects want to generate a ranked list of mutations that they are very certain are real and well supported by the underlying data. The VQSR provides the necessary statistical machinery to effectively apply this sensitivity/specificity tradeoff.

Common, base command line

This is the first part of the ApplyRecalibration command line, to which you need to add either the SNP-specific recommendations or the indel-specific recommendations given further below.

 
 java -Xmx3g -jar GenomeAnalysisTK.jar \
   -T ApplyRecalibration \
   -R reference/human_g1k_v37.fasta \
   -input raw.input.vcf \
   -tranchesFile path/to/input.tranches \
   -recalFile path/to/input.recal \
   -o path/to/output.recalibrated.filtered.vcf \
   [SPECIFY THE DESIRED LEVEL OF SENSITIVITY TO TRUTH SITES] \
   [SPECIFY WHICH CLASS OF VARIATION WAS MODELED] \

SNP specific recommendations

For SNPs we used HapMap 3.3 and the Omni 2.5M chip as our truth set. We typically seek to achieve 99.5% sensitivity to the accessible truth sites, but this is by no means universally applicable: you will need to experiment to find out what tranche cutoff is right for your data. Generally speaking, projects involving a higher degree of diversity in terms of world populations can expect to achieve a higher truth sensitivity than projects with a smaller scope.

   --ts_filter_level 99.5 \
   -mode SNP \

Indel specific recommendations

For indels we use the Mills / 1000 Genomes indel truth set described above. We typically seek to achieve 99.0% sensitivity to the accessible truth sites, but this is by no means universally applicable: you will need to experiment to find out what tranche cutoff is right for your data. Generally speaking, projects involving a higher degree of diversity in terms of world populations can expect to achieve a higher truth sensitivity than projects with a smaller scope.

   --ts_filter_level 99.0 \
   -mode INDEL \

↧

Can Depth of the coverage data from GATK3 corrected using CorrectGCBias

November 5, 2018, 9:12 am

≫ Next: No outputs from PathSeq

≪ Previous: Which training sets / arguments should I use for running VQSR?

Hello,

I have coverage data for a large exome cohort obtained from Depth Of Coverage (GATK 3.6+). We are interested in regions with high GC content and would like to correct sample level GC bias. Unfortunately, the BAM files aren't easily accessible. So curious to know if it is possible at all to run CorrectGCBias on this coverage data ?

Given it's doable, do the input file format follows what described here : https://software.broadinstitute.org/gatk/documentation/tooldocs/4.beta.3/org_broadinstitute_hellbender_tools_exome_CombineReadCounts.php

Thanks,
Nick

↧

No outputs from PathSeq

October 26, 2018, 6:18 pm

≫ Next: Argument syntax conversion failed.

≪ Previous: Can Depth of the coverage data from GATK3 corrected using CorrectGCBias

Hello,

I've been trying to run pathseq with no success so far...here's my input:

gatk --java-options "-Xmx256g" PathSeqPipelineSpark \
--input sample.bam \
--filter-bwa-image /home/wangjh/shaba033/pathseq_host.fa.img \
--kmer-file /home/wangjh/shared/G12Microbiome/pathseq/pathseq_host.bfi \
--min-clipped-read-length 70 \
--microbe-fasta /home/wangjh/shared/G12Microbiome/pathseq/pathseq_microbe.fa \
--microbe-bwa-image /home/wangjh/shared/G12Microbiome/pathseq/pathseq_microbe.fa.img \
--taxonomy-file /home/wangjh/shared/G12Microbiome/pathseq/pathseq_taxonomy.db \
--output sample.pathseq.bam \
--scores-output sample.bam.txt

Anything missing here?

Best and Thank you,

-Ashraf

↧

Argument syntax conversion failed.

November 5, 2018, 11:00 pm

≫ Next: GATK (v4.0.10.1) CombineGVCFs failing with 'java.lang.OutOfMemoryError'; not using memory provided

≪ Previous: No outputs from PathSeq

Hi,

I am working on GATK version 4.0.11 and get an error at SortSam tool. My command is the following:

gatk SortSam -I '/media/Berechnungen/181102_NB501654_0091_AH5CCFAFXY/1239-18_Brustkrebs=(BRCA1.=BRCA2)/1239-18.sam' -O '/media/Berechnungen/181102_NB501654_0091_AH5CCFAFXY/1239-18_Brustkrebs=(BRCA1.=BRCA2)/1239-18.sorted.bam' --SORT_ORDER coordinate --TMP_DIR /media/Ergebnisse/picardtmp

which gives me the following error:

java.lang.RuntimeException: Argument syntax conversion failed. Too many "=" separated tokens to translate: /media/Berechnungen/181102_NB501654_0091_AH5CCFAFXY/1239-18_Brustkrebs=(BRCA1.=BRCA2)/1239-18.sam at picard.cmdline.CommandLineSyntaxTranslater.lambda$translatePicardStyleToPosixStyle$1(CommandLineSyntaxTranslater.java:40) at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:267) at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) at picard.cmdline.CommandLineSyntaxTranslater.translatePicardStyleToPosixStyle(CommandLineSyntaxTranslater.java:44) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:209) at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:25) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203) at org.broadinstitute.hellbender.Main.main(Main.java:289)

So it seems that the tool cannot handle "=" in a file path. But why is there no error for the -I parameter. There are also "=" signs in the filepath. Earlier versions of GATK do not throw this error. Is there a workaround or do I have to use the older version until this is fixed in a new version?

Thanks
Stefan

↧

GATK (v4.0.10.1) CombineGVCFs failing with 'java.lang.OutOfMemoryError'; not using memory provided

October 22, 2018, 11:00 am

≫ Next: SelectVariant (GATK 4.10.0) output site not in Interval list (?)

≪ Previous: Argument syntax conversion failed.

Hi,

We ran a CombineGVCFs job using the following command, where gvcfs.list contained only 31 gvcf files with 24 samples each:

$GATK --java-options "-Xmx650G" \
CombineGVCFs \
-R $referenceFasta \
-O full_cohort.b37.g.vcf \
--variant gvcfs.list

We tried the extreme memory because CombineGVCFs kept failing. This node has 750G of RAM.

Despite the high memory provided, we get the stacktrace below. The total memory reported by GATK is only ~12G, though (Runtime.totalMemory()=12662603776). Am I missing something? I don't understand why GATK is only using 12G of RAM when we provided much more, and then failing with an OutOfMemoryError.

We are currently setting up GenomicsDBImport, but this seems worth reporting.

Really appreciate your help.

18:55:51.944 INFO ProgressMeter - 4:26649295 23.6 18617000 787894.4
18:56:01.988 INFO ProgressMeter - 4:26655758 23.8 18779000 789159.6
18:59:13.407 INFO CombineGVCFs - Shutting down engine
[October 19, 2018 6:59:13 PM CDT] org.broadinstitute.hellbender.tools.walkers.CombineGVCFs done. Elapsed time: 27.06 minutes.
Runtime.totalMemory()=12662603776
Exception in thread "main" java.lang.OutOfMemoryError
at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
at sun.nio.cs.StreamEncoder.implClose(StreamEncoder.java:316)
at sun.nio.cs.StreamEncoder.close(StreamEncoder.java:149)
at java.io.OutputStreamWriter.close(OutputStreamWriter.java:233)
at java.io.BufferedWriter.close(BufferedWriter.java:266)
at htsjdk.variant.variantcontext.writer.VCFWriter.close(VCFWriter.java:226)
at org.broadinstitute.hellbender.tools.walkers.CombineGVCFs.closeTool(CombineGVCFs.java:461)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:970)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)

↧

SelectVariant (GATK 4.10.0) output site not in Interval list (?)

November 5, 2018, 7:07 pm

≫ Next: (How to part II) Sensitively detect copy ratio alterations and allelic segments

≪ Previous: GATK (v4.0.10.1) CombineGVCFs failing with 'java.lang.OutOfMemoryError'; not using memory provided

Hi,

I have run SelectVariants just to output sites belonged to my interval list (i.e. a list of single site), which in the format like this:

...
chr21:28407515
chr21:29497692
chr21:30101335
chr21:31599751
chr21:33096867
...

However, when I run SelectVariants with following minimized parameters:

${gatk} SelectVariants \
-V ${sample}.f.vcf \
-L ${dir0}/mysite.intervals \
-O ${sample}.f.select.vcf \
&> log.SelectVariants.${sample}

The output vcf file contain both position chr21:30101334 and chr21:30101335:

        ...
        chr21   29497692        rs9974441       G       A,<NON_REF>     3445.77 PASS    BaseQRankSum=0.283;DB;DP=270;ExcessHet=3.0103;GQ_MEAN=2108.00;MLEAC=1,0;MLEAF=0.500,0.00;MQRankS
       chr21   30101334        .       GA      G,<NON_REF>     0       PASS    BaseQRankSum=-0.637;DP=326;ExcessHet=3.0103;GQ_MEAN=869.00;MLEAC=0,0;MLEAF=0.00,0.00;MQRankSum=0.000;NCC
        chr21   30101335        rs2831995       A       *,G,T,<NON_REF> 3364.77 PASS    BaseQRankSum=0.967;DB;DP=320;ExcessHet=3.0103;GQ_MEAN=3272.00;MLEAC=0,1,0,0;MLEAF=0.00,0.500,0.0
        chr21   31599751        rs2832663       T       C,<NON_REF>     6836.77 PASS    DB;DP=269;ExcessHet=3.0103;GQ_MEAN=801.00;MLEAC=2,0;MLEAF=1.00,0.00;NCC=0;RAW_MQandDP=968400,269
        ...

Could you please explain why this happened, how could I get only the site in interval file ?
Thank you very much!

↧

(How to part II) Sensitively detect copy ratio alterations and allelic segments

March 26, 2018, 10:32 am

≫ Next: (How to) Mark duplicates with MarkDuplicates or MarkDuplicatesWithMateCigar

≪ Previous: SelectVariant (GATK 4.10.0) output site not in Interval list (?)

Document is currently under review and in BETA. It is incomplete and may contain inaccuracies. Expect changes to the content.

This workflow is broken into two tutorials. You are currently on the second part. See Tutorial#11682 for the first part.

For this second part, at the heart is segmentation, performed by ModelSegments. In segmentation, contiguous copy ratios are grouped together into segments. The tool performs segmentation for both copy ratios and for allelic copy ratios, given allelic counts. The segmentation is informed by both types of data, i.e. the tool uses allelic data to refine copy ratio segmentation and vice versa. The tutorial refers to this multi-data approach as joint segmentation. The presented commands showcase full features of tools. It is possible to perform segmentation for each data type independently, i.e. based solely on copy ratios or based solely on allelic counts.

The tutorial illustrates the workflow using a paired sample set. Specifically, detection of allelic copy ratios uses a matched control, i.e. the HCC1143 tumor sample is analyzed using a control, the HCC1143 blood normal. It is possible to run the workflow without a matched-control. See section 8.1 for considerations in interpreting allelic copy ratio results for different modes and for different purities.

The GATK4 CNV workflow offers a multitude of levers, e.g. towards fine-tuning analyses and towards controls. Researchers are expected to tune workflow parameters on samples with similar copy number profiles as their case sample under scrutiny. Refer to each tool's documentation for descriptions of parameters.

Jump to a section

5. Count ref and alt alleles at common germline variant sites using CollectAllelicCounts

CollectAllelicCounts will tabulate counts of the reference allele and counts of the dominant alternate allele for each site in a given genomic intervals list. The tutorial performs this step for both the case sample, the HCC1143 tumor, and the matched-control, the HCC1143 blood normal. This allele-specific coverage collection is just that--raw coverage collection without any statistical inferences. In the next section, ModelSegments uses the allele counts towards estimating allelic copy ratios, which in turn the tool uses to refine segmentation.

Collect allele counts for the case and the matched-control alignments independently with the same intervals. For the matched-control analysis, the allelic count sites for the case and control must match exactly. Otherwise, ModelSegments, which takes the counts in the next step, will error. Here we use an intervals list that subsets gnomAD biallelic germline SNP sites to those within the padded, preprocessed exome target intervals [9].

The tutorial has already collected allele counts for full length sample BAMs. To demonstrate coverage collection, the following command uses the small BAMs originally made for Tutorial#11136 [6]. The tutorial does not use the resulting files in subsequent steps.

Collect counts at germline variant sites for the matched-control

gatk --java-options "-Xmx3g" CollectAllelicCounts \
    -L chr17_theta_snps.interval_list \
    -I normal.bam \
    -R /gatk/ref/Homo_sapiens_assembly38.fasta \
    -O sandbox/hcc1143_N_clean.allelicCounts.tsv

Collect counts at the same sites for the case sample

gatk --java-options "-Xmx3g" CollectAllelicCounts \
    -L chr17_theta_snps.interval_list \
    -I tumor.bam \
    -R /gatk/ref/Homo_sapiens_assembly38.fasta \
    -O sandbox/hcc1143_T_clean.allelicCounts.tsv

This results in counts table files. Each data file has header lines that start with an @ asperand symbol, e.g. @HD, @SQ and @RG lines, followed by a table of data with six columns. An example snippet is shown.

Comments on select parameters

The tool requires one or more genomic intervals specified with -L. The intervals can be either a Picard-style intervals list or a VCF. See Article#1109 for descriptions of formats. The sites should represent sites of common and/or sample-specific germline variant SNPs-only sites. Omit indel-type and mixed-variant-type sites.
The tool requires the reference genome, specified with -R, and aligned reads, specified with -I.
As is the case for most GATK tools, the engine filters reads upfront using a number of read filters. Of note for CollectAllelicCounts is the MappingQualityReadFilter. By default, the tool sets the filter's --minimum-base-quality to twenty. As a result, the tool will include reads with MAPQ20 and above in the analysis [10].

☞ 5.1 What is the difference between CollectAllelicCounts and GetPileupSummaries?

Another GATK tool, GetPileupSummaries, similarly counts reference and alternate alleles. The resulting summaries are meant for use with CalculateContamination in estimating cross-sample contamination. GetPileupSummaries limits counts collections to those sites with population allele frequencies set by the parameters --minimum-population-allele-frequency and --maximum-population-allele-frequency. Details are here.

CollectAllelicCounts employs fewer engine-level read filters than GetPileupSummaries. Of note, both tools use the MappingQualityReadFilter. However, each sets a different threshold with the filter. GetPileupSummaries uses a --minimum-mapping-quality threshold of 50. In contrast, CollectAllelicCounts sets the --minimum-mapping-quality parameter to 30. In addition, CollectAllelicCounts filters on base quality. The base quality threshold is set with the --minimum-base-quality parameter, whose default is 20.

6. Group contiguous copy ratios into segments with ModelSegments

ModelSegments groups together copy and allelic ratios that it determines are contiguous on the same segment. A Gaussian-kernel binary-segmentation algorithm differentiates ModelSegments from a GATK4.beta tool, PerformSegmentation, which GATK4 ModelSegments replaces. The older tool used a CBS (circular binary-segmentation) algorithm. ModelSegment's kernel algorithm enables efficient segmentation of dense data, e.g. that of whole genome sequences. A discussion of preliminary algorithm performance is here.

The algorithm performs segmentation for both copy ratios and for allelic copy ratios jointly when given both datatypes together. For allelic copy ratios, ModelSegments uses only those sites it determines are heterozygous, either in the control in a paired analysis or in the case in a case-only analysis [11]. In the paired analysis, the tool models allelic copy ratios in the case using sites for which the control is heterozygous. The workflow defines allelic copy ratios in terms of alternate-allele fraction, where total allele fractions for reference allele and alternate allele add to one for each site.

For the following command, be sure to specify an existing --output directory or . for the current directory.

gatk --java-options "-Xmx4g" ModelSegments \
    --denoised-copy-ratios hcc1143_T_clean.denoisedCR.tsv \
    --allelic-counts hcc1143_T_clean.allelicCounts.tsv \
    --normal-allelic-counts hcc1143_N_clean.allelicCounts.tsv \
    --output sandbox \
    --output-prefix hcc1143_T_clean

This produces nine files each with the basename hcc1143_T_clean in the current directory and listed below. The param files contain global parameters for copy ratios (cr) and allele fractions (af), and the seg files contain data on the segments. For either type of data, the tool gives data before and after segmentation smoothing. The tool documentation details what each file contains. The last two files, labeled hets, contain the allelic counts for the control's heterogygous sites. Counts are for the matched control (normal) and the case.

hcc1143_T_clean.modelBegin.seg
hcc1143_T_clean.modelFinal.seg
hcc1143_T_clean.cr.seg
hcc1143_T_clean.modelBegin.af.param
hcc1143_T_clean.modelBegin.cr.param
hcc1143_T_clean.modelFinal.af.param
hcc1143_T_clean.modelFinal.cr.param
hcc1143_T_clean.hets.normal.tsv
hcc1143_T_clean.hets.tsv

The tool has numerous adjustable parameters and these are described in the ModelSegments tool documentation. The tutorial uses the default values for all of the parameters. Adjusting parameters can change the resolution and smoothness of the segmentation results.

Comments on select parameters

The tool accepts both or either copy-ratios (--denoised-copy-ratios) or allelic-counts (--allelic-counts) data. The matched-control allelic counts (--normal-allelic-counts) is optional. If given both types of data, then copy ratios and allelic counts data together inform segmentation for both copy ratio and allelic segments. If given only one type of data, then segmentation is based solely on the given type of data.
The --minimum-total-allele-count is set to 30 by default. This means the tool only considers sites with 30 or more read depth coverage for allelic copy ratios.
The --genotyping-homozygous-log-ratio-threshold option is set to -10.0 by default. Increase this to increase the number of sites assumed to be heterozygous for modeling.
Default smoothing parameters are optimized for faster performance, given the size of whole genomes. The --maximum-number-of-smoothing-iterations option caps smoothing iterations to 25. MCMC model sampling is also set to 100, for both copy-ratio and allele-fraction sampling by the --number-of-samples-copy-ratio and --number-of-samples-allele-fraction options, respectively. Finally, --number-of-smoothing-iterations-per-fit is set to zero by default to disable model refitting between iterations. What this means is that the tool will generate only two MCMC fits--an initial and a final fit.
- GATK4.beta's ACNV set this parameter such that each smoothing iteration refit using MCMC, at the cost of compute. For the tutorial data, which is targeted exomes, the default zero gives 398 segments after two smoothing iterations, while setting --number-of-smoothing-iterations-per-fit to one gives 311 segments after seven smoothing iterations. Section 8 plots these alternative results.
For advanced smoothing recommendations, see [12].

Section 8 shows the results of segmentation, the result from changing --number-of-smoothing-iterations-per-fit and the result of allelic segmentation modeled from allelic counts data alone. Section 8.1 details considerations depending on analysis approach and purity of samples. Section 8.2 shows the results of changing the advanced smoothing parameters given in [12].

ModelSegments runs in the following three stages.

Genotypes heterozygous sites and filters on depth and for sites that overlap with copy-ratio intervals.
- Allelic counts for sites in the control that are heterozygous are written to hets.normal.tsv. For the same sites in the case, allelic counts are written to hets.tsv.
- If given only allelic counts data, ModelSegments does not apply intervals.
Performs multidimensional kernel segmentation (1, 2).
- Uses allelic counts within each copy-ratio interval for each contig.
- Uses denoised copy ratios and heterozygous allelic counts.
Performs Markov-Chain Monte Carlo (MCMC, 1, 2, 3) sampling and segment smoothing. In particular, the tool uses Gibbs sampling and slice sampling. These MCMC samplings inform smoothing, i.e. merging adjacent segments, and the tool can perform multiple iterations of sampling and smoothing [13].
- Fits initial model. Writes initial segments to modelBegin.seg, posterior summaries for copy-ratio global parameters to modelBegin.cr.param and allele-fraction global parameters to modelBegin.af.param.
- Iteratively performs segment smoothing and sampling. Fits allele-fraction model [14] until log likelihood converges. This process produces global parameters.
- Samples final models. Writes final segments to modelFinal.seg, posterior summaries for copy-ratio global parameters to modelFinal.cr.param, posterior summaries for allele-fraction global parameters to modelFinal.af.param and final copy-ratio segments to cr.seg.

At the second stage, the tutorial data generates the following message.

INFO  MultidimensionalKernelSegmenter - Found 638 segments in 23 chromosomes.

At the third stage, the tutorial data generates the following message.

INFO  MultidimensionalModeller - Final number of segments after smoothing: 398

For tutorial data, the initial number of segments before smoothing is 638 segments over 23 contigs. After smoothing with default parameters, the number of segments is 398 segments.

7. Call copy-neutral, amplified and deleted segments with CallCopyRatioSegments

CallCopyRatioSegments allows for systematic calling of copy-neutral, amplified and deleted segments. This step is not required for plotting segmentation results. Provide the tool with the cr.seg segmentation result from ModelSegments.

gatk CallCopyRatioSegments \
    --input hcc1143_T_clean.cr.seg \
    --output sandbox/hcc1143_T_clean.called.seg

The resulting called.seg data adds the sixth column to the provided copy ratio segmentation table. The tool denotes amplifications with a + plus sign, deletions with a - minus sign and neutral segments with a 0 zero.

Here is a snippet of the results.

Comments on select parameters
- The parameters --neutral-segment-copy-ratio-lower-bound (default 0.9) and --neutral-segment-copy-ratio-upper-bound (default 1.1) together set the copy ratio range for copy-neutral segments. These two parameters replace the GATK4.beta workflow’s --neutral-segment-copy-ratio-threshold option.

8. Plot modeled copy ratio and allelic fraction segments with PlotModeledSegments

PlotModeledSegments visualizes copy and allelic ratio segmentation results.

gatk PlotModeledSegments \
    --denoised-copy-ratios hcc1143_T_clean.denoisedCR.tsv \
    --allelic-counts hcc1143_T_clean.hets.tsv \
    --segments hcc1143_T_clean.modelFinal.seg \
    --sequence-dictionary Homo_sapiens_assembly38.dict \
    --minimum-contig-length 46709983 \
    --output sandbox/plots \
    --output-prefix hcc1143_T_clean

This produces plots in the plots folder. The plots represent final modeled segments for both copy ratios and alternate allele fractions. If we are curious about the extent of smoothing provided by MCMC, then we can similarly plot initial kernel segmentation results by substituting in --segments hcc1143_T_clean.modelBegin.seg.

Comments on select parameters
- The tutorial provides the --sequence-dictionary that matches the GRCh38 reference used in mapping [4].
- To omit alternate and decoy contigs from the plots, the tutorial adjusts the --minimum-contig-length from the default value of 1,000,000 to 46,709,983, the length of the smallest of GRCh38's primary assembly contigs.

As of this writing, it is NOT possible to subset plotting with genomic intervals, i.e. with the -L parameter. To interactively visualize data, consider the following options.

Modify the sequence dictionary to contain only the contigs of interest, in the order desired.
The bedGraph format for targeted exomes and bigWig for whole genomes. An example of CNV data converted to bedGraph and visualized in IGV is given in this discussion.
Alternatively, researchers versed in R may choose to visualize subsets of data using RStudio.

Below are three sets of results for the HCC1143 tumor cell line in order of increasing smoothing. The top plot of each set shows the copy ratio segments. The bottom plot of each set shows the allele fraction segments.

In the denoised copy ratio segment plot, individual targets still display as points on the plot. Different copy ratio segments are indicated by alternating blue and orange color groups. The denoised median is drawn in thick black.
In the allele fraction plot, the boxes surrounding the alternate allelic fractions do NOT indicate standard deviation nor standard error, which biomedical researchers may be more familiar with. Rather, the allelic fraction data is given in credible intervals. The allelic copy ratio plot shows the 10th, 50th and 90th percentiles. These should be interpreted with care as explained in section 8.1. Individual allele fraction data display as faint data points, also in orange and blue.

8A. Initial segmentation before MCMC smoothing gives 638 segments.
T_modelbegin.modeled.png

8B. Default smoothing gives 398 segments.
T_modelfinal.modeled.png

8C. Enabling additional smoothing iterations per fit gives 311 segments. See section 6 for a description of the --number-of-smoothing-iterations-per-fit parameter.
T_increase_smoothing_1.modeled.png

Smoothing accounts for data points that are outliers. Some of these outliers could be artifactual and therefore not of interest, while others could be true copy number variation that would then be missed. To understand the impact of joint copy ratio and allelic counts segmentation, compare the results of 8B to the single-data segmentation results below. Each plot below shows the results of modeling segmentation on a single type of data, either copy-ratios or allelic counts, using default smoothing parameters.

8D. Copy ratio segmentation based on copy ratios alone gives 235 segments.
T_caseonly.modeled.png

8E. Allelic segmentation result based on allelic counts alone in the matched case gives 105 segments.
T-matched-normal_just_allelic.modeled.png

Compare chr1 and chr2 segmentation for the various plots. In particular, pay attention to the p arm (left side) of chr1 and q arm (right side) of chr2. What do you think is happening when adjacent segments are slightly shifted from each other in some sets but then seemingly at the same copy ratio for other sets?

For allelic counts, ModelSegments retains 16,872 sites that are heterozygous in the control. Of these, the case presents 15,486 usable sites. In allelic segmentation using allelic counts alone, the tool uses all of the usable sites. In the matched-control scenario, ModelSegments emits the following message.

INFO  MultidimensionalKernelSegmenter - Using first allelic-count site in each copy-ratio interval (12668 / 15486) for multidimensional segmentation...

The message informs us that for the matched-control scenario, ModelSegments uses the first allele-count site for each genomic interval towards allelic modeling. For the tutorial data, this is 12,668 out of the 15,486 or 81.8% of the usable allele-count sites. The exclusion of ~20% of allelic-counts sites, together with the lack of copy ratio data informing segmentation, account for the difference we observe in this and the previous allelic segmentation plot.

In the allele fraction plot, some of the alternate-allele fractions are around 0.35/0.65 and some are at 0/1. We also see alternate-allele fractions around 0.25/0.75 and 0.5. These suggest ploidies that are multiples of one, two, three and four.

Is it possible a copy ratio of one is not diploid but represents some other ploidy?

For the plots above, focus on chr4, chr5 and chr17. Based on both the copy ratio and allelic results, what is the zygosity of each of the chromosomes? What proportion of each chromosome could be described as having undergone copy-neutral loss of heterozygosity?

☞ 8.1 Some considerations in interpreting allelic copy ratios

For allelic copy ratio analysis, the matched-control is a sample from the same individual as the case sample. In the somatic case, the matched-control is the germline normal sample and the case is the tumor sample from the same individual.

The matched-control case presents the following considerations.

If a matched control contains any region with copy number amplification, the skewed allele fractions still allow correct interpretation of the original heterozygosity.
However, if a matched control contains deleted regions or regions with copy-neutral loss of heterozygosity or a long stretch of homozygosity, e.g. as occurs in uniparental disomy, then these regions would go dark so to speak in that they become apparently homozygous and so ModelSegments drops them from consideration.
From population sequencing projects, we know the expected heterozygosity of normal germline samples averages around one in a thousand. However, the GATK4 CNV workflow does not account for any heterozygosity expectations. An example of such an analysis that utilizes SNP array data is HAPSEG. It is available on GenePattern.
If a matched normal contains tumor contamination, this should still allow for the normal to serve as a control. The expectation is that somatic mutations coinciding with common germline SNP sites will be rare and ModelSegments (i) only counts the dominant alt allele at multiallelic sites and (ii) recognizes and handles outliers. To estimate tumor in normal (TiN) contamination, see the Broad CGA group's deTiN.

Here are some considerations for detecting loss of heterozygosity regions.

In the matched-control case, if the case sample is pure, i.e. not contaminated with the control sample, then we see loss of heterozygosity (LOH) segments near alternate-allele fractions of zero and one.
If the case is contaminated with matched control, whether the analysis is matched or not, then the range of alternate-allele fractions becomes squished so to speak in that the contaminating normal's heterozygous sites add to the allele fractions. In this case, putative LOH segments still appear at the top and bottom edges of the allelic plot, at the lowest and highest alternate-allele fractions. For a given depth of coverage, the fraction of reads that differentiates zygosity is narrower in range and therefore harder to differentiate visually.

8F. Case-only analysis of tumor contaminated with normal still allows for LOH detection. Here, we bluntly added together the tutorial tumor and normal sample reads. Results for the matched-control analysis are similar.
In the tumor-only case, if the tumor is pure, because ModelSegments drops homozygous sites from consideration and only models sites it determines are heterozygous, the workflow cannot ascertain LOH segments. Such LOH regions may present as an absence of allelic data or as low confidence segments, i.e. having a wide confidence interval on the allelic plot. Compare such a result below to that of the matched case in 8E above.

8G. Allelic segmentation result based on allelic counts alone for case-only, when the case is pure, can produce regions of missing representation and low confidence allelic fraction segments.

Compare results. Focus on chr4, chr5 and chr17. While the matched-case gives homozygous zygosity for each of these chromosomes, the case-only allelic segmentation either presents an absence of segments for regions or gives low confidence allelic fraction segments at alternate allele fractions that are inaccurate, i.e. do not represent actual zygosity. This is particularly true for tumor samples where aneuploidy and LOH are common. Interpret case-only allelic results with caution.

Finally, remember the tutorial analyses above utilize allelic counts from gnomAD sites of common population variation that have been lifted-over from GRCh37 to GRCh38. For allelic count sites, use of sample-specific germline variant sites may incrementally increase resolution. Also, use of confident variant sites from a callset derived from alignments to the target reference may help decrease noise. Confident germline variant sites can be derived with HaplotypeCaller calling on alignments and subsequent variant filtration. Alternatively, it is possible to fine-tune ModelSegments smoothing parameters to dampen noise.

☞ 8.2 Some results of fine-tuning smoothing parameters

This section shows plotting results of changing some advanced smoothing parameters. The parameters and their defaults are given below, in the order of recommended consideration [12].

--number-of-changepoints-penalty-factor 1.0 \
--kernel-variance-allele-fraction 0.025 \
--kernel-variance-copy-ratio 0.0 \
--kernel-scaling-allele-fraction 1.0 \
--smoothing-credible-interval-threshold-allele-fraction 2.0 \
--smoothing-credible-interval-threshold-copy-ratio 2.0 \

The first four parameters impact segmentation while the last two parameters impact modeling. The following plots show the results of changing these smoothing parameters. The tutorial chose argument values arbitrarily, for illustration purposes. Results should be compared to that of 8B, which gives 398 segments.

8H. Increasing changepoints penalty factor from 1.0 to 5.0 gives 140 segments.

8I. Increasing kernel variance parameters each to 0.8 gives 144 segments. Changing --kernel-variance-copy-ratio alone to 0.025 increases the number of segments greatly, to 1,266 segments. Changing it to 0.2 gives 414 segments.

8J. Decreasing kernel scaling from 1.0 to 0 gives 236 segments. Conversely, increasing kernel scaling from 1.0 to 5.0 gives 551 segments.

8K. Increasing both smoothing parameters each from 2.0 to 10.0 gives 263 segments.

Footnotes

[9] The GATK Resource Bundle provides two variations of a SNPs-only gnomAD project resource VCF. Both VCFs are sites-only eight-column VCFs but one retains the AC allele count and AF allele frequency variant-allele-specific annotations, while the other removes these to reduce file size.

For targeted exomes, it may be convenient to subset these to the preprocessed intervals, e.g. with SelectVariants for use with CollectAllelicCounts. This is not necessary, however, as ModelSegments drops sites outside the target regions from its analysis in the joint-analysis approach.
For whole genomes, depending on the desired resolution of the analysis, consider subsetting the gnomAD sites to those commonly variant, e.g. above an allele frequency threshold. Note that SelectVariants, as of this writing, can filter on AF allele frequency only for biallelic sites. Non-biallelic sites make up ~3% of the gnomAD SNPs-only resource.
For more resolution, consider adding sample-specific germline variant biallelic SNPs-only sites to the intervals. Section 8.1 shows allelic segmentation results for such an analysis.

[10] The MAPQ20 threshold of CollectAllelicCounts is lower than that used by CollectFragmentCounts, which uses MAPQ30.

[11] In particular, the tool considers only heterozygous sites that have counts for both the reference allele and the alternate allele. If multiple alternate alleles are present, the tool uses the alternate allele with the highest count and ignores any other alternate allele(s).

[12] These advanced smoothing recommendations are from one of the workflow developers--@slee.

For smoother results, first increase --number-of-changepoints-penalty-factor from its default of 1.0.
If the above does not suffice, then consider changing the kernel-variance parameters --kernel-variance-copy-ratio (default 0.0) and --kernel-variance-allele-fraction (default 0.025), or change the weighting of the allele-fraction data by changing --kernel-scaling-allele-fraction (default 1.0).
If such changes are still insufficient, then consider adjusting the smoothing-credible-interval-threshold parameters --smoothing-credible-interval-threshold-copy-ratio (default 2.0) and --smoothing-credible-interval-threshold-allele-fraction (default 2.0). Increasing these will more aggressively merge adjacent segments.

[13] In particular, uses Gibbs sampling, a type of MCMC sampling, towards both allele-fraction modeling and copy-ratio modeling, and additionally uses slice sampling towards allele-fraction modeling. @slee details the following substeps.

Perform MCMC (Gibbs) to fit the copy-ratio model posteriors.
Use optimization (of the log likelihood) to initialize the Markov Chain for the allele-fraction model.
Perform MCMC (Gibbs and slice) to fit the allele-fraction model posteriors.
The initial model is now fit. We write the corresponding modelBegin files, including those for global parameters.
Iteratively perform segment smoothing.
Perform steps 1-4 again, this time to generate the final model fit and modelFinal files.

[14] @slee shares the tool initializes the MCMC by starting off at the maximum a posteriori (MAP) point in parameter space.

Many thanks to Samuel Lee (@slee), no relation, for his patient explanations of the workflow parameters and mathematics behind the workflow. Researchers may find reading through @slee's comments on the forum, a few of which the tutorials link to, insightful.

↧

(How to) Mark duplicates with MarkDuplicates or MarkDuplicatesWithMateCigar

January 8, 2016, 1:06 pm

≫ Next: microRNA Annotated by Oncotator

≪ Previous: (How to part II) Sensitively detect copy ratio alterations and allelic segments

This tutorial updates Tutorial#2799.

Here we discuss two tools, MarkDuplicates and MarkDuplicatesWithMateCigar, that flag duplicates. We provide example data and example commands for you to follow along the tutorial (section 1) and include tips in estimating library complexity for PCR-free samples and patterned flow cell technologies. In section 2, we point out special memory considerations for these tools. In section 3, we highlight the similarities and differences between the two tools. Finally, we get into some details that may be of interest to some that includes comments on the metrics file (section 4).

To mark duplicates in RNA-Seq data, use MarkDuplicates. Reasons are explained in section 2 and section 3. And if you are considering using MarkDuplicatesWithMateCigar for your DNA data, be sure insert lengths are short and you have a low percentage of split or multi-mapping records.

Obviously, expect more duplicates for samples prepared with PCR than for PCR-free preparations. Duplicates arise from various sources, including within the sequencing run. As such, even PCR-free data can give rise to duplicates, albeit at low rates, as illustrated here with our example data.

Which tool should I use, MarkDuplicates or MarkDuplicatesWithMateCigar? `new section 5/25/2016`

The Best Practices so far recommends MarkDuplicates. However, as always, consider your research goals.

If your research uses paired end reads and pre-processing that generates missing mates, for example by application of an intervals list or by removal of reference contigs after the initial alignment, and you wish to flag duplicates for these remaining singletons, then MarkDuplicatesWithMateCigar will flag these for you at the insert level using the mate cigar (MC) tag. MarkDuplicates skips these singletons from consideration.

If the qualities by which the representative insert in a duplicate set is selected is important to your analyses, then note that MarkDuplicatesWithMateCigar is limited to prioritizing by the total mapped length of a pair, while MarkDuplicates can use this OR the default sum of base qualities of a pair.

If you are still unsure which tool is appropriate, then consider maximizing comparability to previous analyses. The Broad Genomics Platform has used only MarkDuplicates in their production pipelines. MarkDuplicatesWithMateCigar is a newer tool that has yet to gain traction.

This tutorial compares the two tools to dispel the circulating notion that the outcomes from the two tools are equivalent and to provide details helpful to researchers in optimizing their analyses.

We welcome feedback. Share your suggestions in the Comment section at the bottom of this page.

Jump to a section

Tools involved

Prerequisites

Installed Picard tools
Coordinate-sorted and indexed BAM alignment data. Secondary/supplementary alignments are flagged appropriately (256 and 2048 flags) and additionally with the mate unmapped (8) flag. See the MergeBamAlignment section (3C) of Tutorial#6483 for a description of how MergeBamAlignment ensures such flagging. **Revision as of 5/17/2016:** I wrote this tutorial at a time when the input could only be an indexed and coordinate-sorted BAM. Recently, the tools added a feature to accept queryname-sorted inputs that in turn activates additional features. The additional features that providing a queryname-sorted BAM activates will give DIFFERENT duplicate flagging results. So for the tutorial's observations to apply, use coordinate-sorted data.
For MarkDuplicatesWithMateCigar, pre-computed Mate CIGAR (MC) tags. Data produced according to Tutorial#6483 will have the MC tags added by MergeBamAlignment. Alternatively, see tools RevertOriginalBaseQualitiesAndAddMateCigar and FixMateInformation.
Appropriately assigned Read Group (RG) information. Read Group library (RGLB) information is factored for molecular duplicate detection. Optical duplicates are limited to those from the same RGID.

Download example data

Use the advanced tutorial bundle's human_g1k_v37_decoy.fasta as reference. This same reference is available to load in IGV.
tutorial_6747.tar.gz data contain human paired 2x150 whole genome sequence reads originally aligning at ~30x depth of coverage. The sample is a PCR-free preparation of the NA12878 individual run on the HiSeq X platform. This machine type, along with HiSeq 4000, has the newer patterned flow cell that differs from the typical non-patterned flow cell. I took the reads aligning to a one Mbp genomic interval (10:96,000,000-97,000,000) and sanitized and realigned the reads (BWA-MEM -M) to the entire genome according to the workflow presented in Tutorial#6483 to produce snippet.bam. The data has (i) no supplementary records; (ii) secondary records flagged with the 256 flag and the mate-unmapped (8) flag; and (iii) unmapped records (4 flag) with mapped mates (mates have 8 flag), zero MAPQ (column 5) and asterisks for CIGAR (column 6). The notation allows read pairs where one mate maps and the other does not to sort and remain together when we apply genomic intervals such as in the generation of the snippet.

Related resources

See DuplicationMetrics for descriptions of each metric.
See Tutorial#6483 for instructions on how to efficiently map and clean up short read sequence data. You can use the resulting files directly in this tutorial.
See an overview of lane, library, sample and cohort and this forum discussion of how MarkDuplicates handles library information.
See SAM flags to interpret SAM flag values.
See dictionary entry on Illumina Chastity filter for a link to a document comparing patterned and non-patterned flow cells.
See this tutorial to coordinate-sort and index a BAM.
See this tutorial for basic instructions on using the Integrative Genomics Viewer (IGV).

1. Commands for MarkDuplicates and MarkDuplicatesWithMateCigar

The following commands take a coordinate-sorted and indexed BAM and return (i) a BAM with the same records in coordinate order and with duplicates marked by the 1024 flag, (ii) a duplication metrics file, and (iii) an optional matching BAI index.

For a given file with all MC (mate CIGAR) tags accounted for:

and where all mates are accounted for, each tool--MarkDuplicates and MarkDuplicatesWithMateCigar--examines the same duplicate sets but prioritize which inserts get marked duplicate differently. This situation is represented by our snippet example data.
but containing missing mates records, MarkDuplicates ignores the records, while MarkDuplicatesWithMateCigar still considers them for duplicate marking using the MC tag for mate information. Again, the duplicate scoring methods differ for each tool.

Use the following commands to flag duplicates for 6747_snippet.bam. These commands produce qualitatively different data.

Score duplicate sets based on the sum of base qualities using MarkDuplicates:

java -Xmx32G -jar picard.jar MarkDuplicates \
INPUT=6747_snippet.bam \ #specify multiple times to merge 
OUTPUT=6747_snippet_markduplicates.bam \
METRICS_FILE=6747_snippet_markduplicates_metrics.txt \ 
OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 \ #changed from default of 100
CREATE_INDEX=true \ #optional
TMP_DIR=/tmp

Score duplicate sets based on total mapped reference length using MarkDuplicatesWithMateCigar:

java -Xmx32G -jar picard.jar MarkDuplicatesWithMateCigar \
INPUT=6747_snippet.bam \ #specify multiple times to merge
OUTPUT=6747_snippet_markduplicateswithmatecigar.bam \
METRICS_FILE=6747_snippet_markduplicateswithmatecigar_metrics.txt \ 
OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 \ #changed from default of 100
CREATE_INDEX=true \ #optional
TMP_DIR=/tmp

Comments on select parameters

**Revision as of 5/17/2016:** The example input 6747_snippet.bam is coordinate-sorted and indexed. Recently, the tools added a feature to accept queryname-sorted inputs that in turn by default activates additional features that will give DIFFERENT duplicate flagging results than outlined in this tutorial. Namely, if you provide MarkDuplicates a queryname-sorted BAM, then if a primary alignment is marked as duplicate, then the tool will also flag its (i) unmapped mate, (ii) secondary and/or (iii) supplementary alignment record(s) as duplicate.
Each tool has a distinct default DUPLICATE_SCORING_STRATEGY. For MarkDuplicatesWithMateCigar it is TOTAL_MAPPED_REFERENCE_LENGTH and this is the only scoring strategy available. For MarkDuplicates you can switch the DUPLICATE_SCORING_STRATEGY between the default SUM_OF_BASE_QUALITIES and TOTAL_MAPPED_REFERENCE_LENGTH. Both scoring strategies use information pertaining to both mates in a pair, but in the case of MarkDuplicatesWithMateCigar the information for the mate comes from the read's MC tag and not from the actual mate.
To merge multiple files into a single output, e.g. when aggregating a sample from across lanes, specify the INPUT parameter for each file. The tools merge the read records from the multiple files into the single output file. The tools marks duplicates for the entire library (RGLB) and accounts for optical duplicates per RGID. INPUT files must be coordinate sorted and indexed.
The Broad's production workflow increases OPTICAL_DUPLICATE_PIXEL_DISTANCE to 2500, to better estimate library complexity. The default setting for this parameter is 100. Changing this parameter does not alter duplicate marking. It only changes the count for optical duplicates and the library complexity estimate in the metrics file in that whatever is counted as an optical duplicate does not factor towards library complexity. The increase has to do with the fact that our example data was sequenced in a patterned flow cell of a HiSeq X machine. Both HiSeq X and HiSeq 4000 technologies decrease pixel unit area by 10-fold and so the equivalent pixel distance in non-patterned flow cells is 250. You may ask why are we still counting optical duplicates for patterned flow cells that by design should have no optical duplicates. We are hijacking this feature of the tools to account for other types of duplicates arising from the sequencer. Sequencer duplicates are not limited to optical duplicates and should be differentiated from PCR duplicates for more accurate library complexity estimates.
By default the tools flag duplicates and retain them in the output file. To remove the duplicate records from the resulting file, set the REMOVE_DUPLICATES parameter to true. However, given you can set GATK tools to include duplicates in analyses by adding -drf DuplicateRead to commands, a better option for value-added storage efficiency is to retain the resulting marked file over the input file.
To optionally create a .bai index, add and set the CREATE_INDEX parameter to true.

For snippet, the duplication metrics are identical whether marked by MarkDuplicates or MarkDuplicatesWithMateCigar. We have 13.4008% duplication, with 255 unpaired read duplicates and 18,254 paired read duplicates. However, as the screenshot at the top of this page illustrates, and as section 4 explains, the data qualitatively differ.

2. Slow or out of memory error? Special memory considerations for duplicate marking tools

The seemingly simple task of marking duplicates is one of the most memory hungry processes, especially for paired end reads. Both tools are compute-intensive and require upping memory compared to other processes.

Because of the single-pass nature of MarkDuplicatesWithMateCigar, for a given file its memory requirements can be greater than for MarkDuplicates. What this means is that MarkDuplicatesWithMateCigar streams the duplicate marking routine in a manner that allows for piping. Due to these memory constraints for MarkDuplicatesWithMateCigar, we recommend MarkDuplicates for alignments that have large reference skips, e.g. spliced RNA alignments.

For large files, (1) use the Java -Xmx setting and (2) set the environmental variable TMP_DIR for a temporary directory. These options allow the tool to run without slowing down as well as run without causing an out of memory error. For the purposes of this tutorial, commands are given as if the example data is a large file, which we know it is not.

    java -Xmx32G -jar picard.jar MarkDuplicates \
    ... \
    TMP_DIR=/tmp

These options can be omitted for small files such as the example data and the equivalent command is as follows.

    java -jar picard.jar MarkDuplicates ...

Set the java maxheapsize, specified by the `-Xmx#G` option, to the maximum your system allows.

The high memory cost, especially for MarkDuplicatesWithMateCigar, is in part because the tool systematically traverses genomic coordinate intervals for inserts in question, and for every read it marks as a duplicate it must keep track of the mate, which may or may not map nearby, so that reads are marked as pairs with each record emitted in its coordinate turn. In the meanwhile, this information is held in memory, which is the first choice for faster processing, until the memory limit is reached, at which point memory spills to disk. We set this limit high to minimize instances of memory spilling to disk.

In the example command, the -Xmx32G Java option caps the maximum heap size, or memory usage, to 32 gigabytes, which is the limit on the server I use. This is in contrast to the 8G setting I use for other processes on the same sample data--a 75G BAM file. To find a system's default maximum heap size, type java -XX:+PrintFlagsFinal -version, and look for MaxHeapSize.

Set an additional temporary directory with the `TMP_DIR` parameter for memory spillage.

When the tool hits the memory limit, memory spills to disk. This causes data to traverse in and out of the processor's I/O device, slowing the process down. Disk is a location you specify with the TMP_DIR parameter. If you work on a server separate from where you read and write files to, setting TMP_DIR to the server's local temporary directory (typically /tmp) can reduce processing time compared to setting it to the storage disk. This is because the tool then additionally avoids traversing the network file system when spilling memory. Be sure the TMP_DIR location you specify provides enough storage space. Use df -h to see how much is available.

3. Conceptual overview of duplicate flagging

The aim of duplicate marking is to flag all but one of a duplicate set as duplicates and to use duplicate metrics to estimate library complexity. Duplicates have a higher probability of being non-independent measurements from the exact same template DNA. Duplicate inserts are marked by the 0x400 bit (1024 flag) in the second column of a SAM record, for each mate of a pair. This allows downstream GATK tools to exclude duplicates from analyses (most do this by default). Certain duplicates, i.e. PCR and sequencer duplicates, violate assumptions of variant calling and also potentially amplify errors. Removing these, even at the cost of removing serendipitous biological duplicates, allows us to be conservative in calculating the confidence of variants.

GATK tools allow you to disable the duplicate read filter with -drf DuplicateRead so you can include duplicates in analyses.

For a whole genome DNA sample, duplicates arise from three sources: (i) in DNA shearing from distinct molecular templates identical in insert mapping, (ii) from PCR amplification of a template (PCR duplicates), and (iii) from sequencing, e.g. optical duplicates. The tools cannot distinguish between these types of duplicates with the exception of optical duplicates. In estimating library complexity, the latter two types of duplicates are undesirable and should each factor differently.

When should we not care about duplicates? Given duplication metrics, we can make some judgement calls on the quality of our sample preparation and sequencer run. Of course, we may not expect a complex library if our samples are targeted amplicons. Also, we may expect minimal duplicates if our samples are PCR-free. Or it may be that because of the variation inherent in expression level data, e.g. RNA-Seq, duplicate marking becomes ritualistic. Unless you are certain of your edge case (amplicon sequencing, RNA-Seq allele-specific expression analysis, etc.) where duplicate marking adds minimal value, you should go ahead and mark duplicates. You may find yourself staring at an IGV session trying to visually calculate the strength of the evidence for a variant. We can pat ourselves on the back for having the forethought to systematically mark duplicates and turn on the IGV duplicate filter.

The Broad's Genomics Platform uses MarkDuplicates twice for multiplexed samples. Duplicates are flagged first per sample per lane to estimate lane-level library complexity, and second to aggregate data per sample while marking all library duplicates. In the second pass, duplicate marking tools again assess all reads for duplicates and overwrite any prior flags.

Our two duplicate flagging tools share common features but differ at the core. As the name implies, MarkDuplicatesWithMateCigar uses the MC (mate CIGAR) tag for mate alignment information. Unlike MarkDuplicates, it is a single-pass tool that requires pre-computed MC tags.

For RNA-Seq data mapped against the genome, use MarkDuplicates. Specifically, MarkDuplicatesWithMateCigar will refuse to process data with large reference skips frequent in spliced RNA transcripts where the gaps are denoted with an N in the CIGAR string.
Both tools only consider primary mappings, even if mapped to different contigs, and ignore secondary/supplementary alignments (256 flag and 2048 flag) altogether. Because of this, before flagging duplicates, be sure to mark primary alignments according to a strategy most suited to your experimental aims. See MergeBamAlignment's PRIMARY_ALIGNMENT_STRATEGY parameter for strategies the tool considers for changing primary markings made by an aligner.
Both tools identify duplicate sets identically with the exception that MarkDuplicatesWithMateCigar additionally considers reads with missing mates. Missing mates occur for example when aligned reads are filtered using an interval list of genomic regions. This creates divorced reads whose mates aligned outside the targeted intervals.
Both tools identify duplicates as sets of read pairs that have the same unclipped alignment start and unclipped alignment end. The tools intelligently factor for discordant pair orientations given these start and end coordinates. Within a duplicate set, with the exception of optical duplicates, read pairs may have either pair orientation--F1R2 or F2R1. For optical duplicates, pairs in the set must have the same orientation. Why this is is explained in section 4.
Both tools take into account clipped and gapped alignments and singly mapping reads (mate unmapped and not secondary/supplementary).
Each tool flags duplicates according to different priorities. MarkDuplicatesWithMateCigar prioritizes which pair to leave as the representative nondup based on the total mapped length of a pair while MarkDuplicates can prioritize based on the sum of base qualities of a pair (default) or the total mapped length of a pair. Duplicate inserts are marked at both ends.

4. Details of interest to some

To reach a high target coverage depth, some fraction of sequenced reads will by stochastic means be duplicate reads.

Let us hope the truth of a variant never comes down to so few reads that duplicates should matter so. Keep in mind the better evidence for a variant is the presence of overlapping reads that contain the variant. Also, take estimated library complexity at face value--an estimate.

Don't be duped by identical numbers. Data from the two tools qualitatively differ.

First, let me reiterate that secondary and supplementary alignment records are skipped and never flagged as duplicate.

Given a file with no missing mates, each tool identifies the same duplicate sets from primary alignments only and therefore the same number of duplicates. To reiterate, the number of identical loci or duplicate sets and the records within each set are the same for each tool. However, each tool differs in how it decides which insert(s) within a set get flagged and thus which insert remains the representative nondup. Also, if there are ties, the tools may break them differently in that tie-breaking can depend on the sort order of the records in memory.

MarkDuplicates by default prioritizes the sum of base qualities for both mates of a pair. The pair with the highest sum of base qualities remains as the nondup.
As a consequence of using the mate's CIGAR string (provided by the MC tag), MarkDuplicatesWithMateCigar can only prioritize the total mapped reference length, as provided by the CIGAR string, in scoring duplicates in a set. The pair with the longest mapping length remains as the nondup.
If there are ties after applying each scoring strategy, both tools break the ties down a chain of deterministic factors starting with read name.

Duplicate metrics in brief

We can break down the metrics file into two parts: (1) a table of metrics that counts various categories of duplicates and gives the library complexity estimate, and (2) histogram values in two columns.

See DuplicationMetrics for descriptions of each metric. For paired reads, duplicates are considered for the insert. For single end reads, duplicates are considered singly for the read, increasing the likelihood of being identified as a duplicate. Given the lack of insert-level information for these singly mapping reads, the insert metrics calculations exclude these.

The library complexity estimate only considers the duplicates that remain after subtracting out optical duplicates. For the math to derive estimated library size, see formula (1.2) in Mathematical Notes on SAMtools Algorithms.

The histogram values extrapolate the calculated library complexity to a saturation curve plotting the gains in complexity if you sequence additional aliquots of the same library. The first bin's value represents the current complexity.

Pair orientation F1R2 is distinct from F2R1 for optical duplicates

Here we refer you to a five minute video illustrating what happens at the molecular level in a typical sequencing by synthesis run.

What I would like to highlight is that each strand of an insert has a chance to seed a different cluster. I will also point out, due to sequencing chemistry, F1 and R1 reads typically have better base qualities than F2 and R2 reads.

Optical duplicate designation requires the same pair orientation.

Let us work out the implications of this for a paired end, unstranded DNA library. During sequencing, within the flow cell, for a particular insert produced by sample preparation, the strands of the insert are separated and each strand has a chance to seed a different cluster. Let's say for InsertAB, ClusterA and ClusterB and for InsertCD, ClusterC and ClusterD. InsertAB and InsertCD are identical in sequence and length and map to the same loci. It is possible InsertAB and InsertCD are PCR duplicates and also possible they represent original inserts. Each strand is then sequenced in the forward and reverse to give four pieces of information in total for the given insert, e.g. ReadPairA and ReadPairB for InsertAB. The pair orientation of these two pairs are reversed--one cluster will give F1R2 and the other will give F2R1 pair orientation. Both read pairs map exactly to the same loci. Our duplicate marking tools consider ReadPairA and ReadPairB in the same duplicate set for regular duplicates but not for optical duplicates. Optical duplicates require identical pair orientation.

↧

microRNA Annotated by Oncotator

February 29, 2016, 3:52 am

≫ Next: Incorrect .tsv and hdf5 output in CollectReadCounts

≪ Previous: (How to) Mark duplicates with MarkDuplicates or MarkDuplicatesWithMateCigar

I have identified some mutations that are all annotated with Oncotator as for a specific microRNAb. I checked the genomic regions, transcripts, and products of the microRNA, and found that both strands in this genomic region have mature miRNA products, and the names are miRNAxxxa and miRNAxxb. The genomic regions of the primary miRNAs of the 2 mature (miRNAs miRNAxxxa and miRNAxxb) are exactly same, but genomic regions where the mature miRNAs are encoded do not have overlap. In fact the genomic region where I identified my mutations are in the region where mature miRNAxxxa is encoded.
If we assume that even the genomic regions that are truncated during the process from primary miRNA to mature miRNA can have some effect on the expression level of the miRNA, then in this case, the expression level of both miRNAs are going to be effected.
In fact, the mutations I have identified are all in the region where the mature miRNAa is encoded.
My question is: Why only miRNAxxb is annotated by Oncotator?

↧

Incorrect .tsv and hdf5 output in CollectReadCounts

October 31, 2018, 11:29 am

≫ Next: Are there any Broad-specific instructions for using GATK?

≪ Previous: microRNA Annotated by Oncotator

I am running a custom made interval_list to count reads in specific intervals in mouse very low pass WGS data (generated from Oxford Nanopore MinION).

I get the output and I can see that the correct reads were filtered from the command line but I can't find them in the tsv file.

Any help would be super appreciated.

Nada

↧

Are there any Broad-specific instructions for using GATK?

December 29, 2017, 2:49 pm

≫ Next: Is there a panel-of-cancer option for somatic variant calling

≪ Previous: Incorrect .tsv and hdf5 output in CollectReadCounts

In general you should use FireCloud, which has all the major GATK workflows preloaded, is more scalable and makes it easier to share any work you do with external collaborators, since the portal is publicly accessible and you can grant anyone access to workspaces securely and conveniently.

However, there are a couple of few Broad-internal resources that you can use if FireCloud is not yet a suitable option for you.

Dotkits for running GATK CNV and ACNV
GATK CNV Toolchain in Firehose

1. Dotkits for running GATK CNV and ACNV

The following dotkits should load all the necessary dependencies:

use .hdfview-2.9
use Java-1.8
use .r-3.1.3-gatk-only

If these don't work, move to a VM where the dotkits are not broken. If that still doesn't work, go to FireCloud.

2. GATK CNV Toolchain in Firehose

We make this available as a courtesy, but we will not be able to provide support for any Firehose-specific aspects. Note that Firehose will be phased out at some point in 2018, and you will need to move your work to FireCloud by then. Rest assured we will provide support for the migration (phase-out calendar TBD).

We have put the GATK4 Somatic CNV Toolchain into Firehose. Please copy the below workflows from Algorithm_Commons:

GATK_Somatic_CNV_Toolchain_Capture
GATK_Somatic_CNV_Toolchain_WGS

Frequently asked questions:

Who do I contact with an issue?

First, make sure that your question is not here or in another forum post. If it is a Firehose issue or you are not sure, email pipeline-help@broadinstitute.org. If you are sure that it is an issue with GATK CNV, ACNV, or GetBayesianHetPulldown, post to the forum.

What is GATK CNV vs. ACNV and which are run in the workflows above?

GATK CNV estimates total copy ratio and performs segmentation and (basic) event calling. This tool works very similarly to ReCapSeg (for now).
GATK ACNV creates credible intervals for copy ratio and minor allelic fraction (MAF). Under the hood, this tool is very different from Allelic CapSeg, but it can produce a file that can be ingested by ABSOLUTE (i.e. file is in same format produced by Allelic CapSeg)
Both GATK CNV and ACNV are in the workflows above.

Are the results (e.g. sensitivity and precision) better than ReCapSeg in the GATK CNV toolchain?

If you talk about running without the allelic integration, then the results are equivalent. If you want more details, ask in the forum or invite us to talk to you -- we have a presentation or two about this topic.

Do I run these workflows on Pair Sets or Individual Sets?

Individual Sets

What entity types do the tasks run on?

Samples and Pairs. I realize that the above question says to run the workflow on Individual Sets. This is to work around a Firehose issue.

What are the caveats around WGS?

The total copy number tasks (similar to ReCapSeg) take about a tenth of the time as ReCapSeg, assuming good NFS performance. This is a good thing.
The allelic tasks (GetBayesianPulldown and Allelic CNV) take a very long time to run. Over a day of runtime is not uncommon. In the next version of the GATK4 CNV Toolchain, we will have addressed this issue, but due to dispatch limitations, Firehose may not be able to fully capitalize on these improvements.
The runtimes in general are very very sensitive to the filesystem performance.
The results still have the same oversegmentation issues that you will see in ReCapSeg. There is a GC correction tool, but this has not been integrated into the Firehose workflow.
There is a bug in a third-party library that limits the size of a PoN. This is unlikely to be an issue for capture, but can become a problem for WGS. For more details, please see gatkforums.broadinstitute.org/gatk/discussion/7594/limits-on-the-size-of-a-pon

What is the future of ReCapSeg?

We are phasing out ReCapSeg, for many reasons, everywhere -- not just Firehose. If you would like more details, post to the forum and we'll respond.

What is the future of Allelic CapSeg?

We have never supported (and never will support) Allelic CapSeg and cannot answer that question. We have some results comparing Allelic CapSeg and GATK ACNV. We can show you if you are interested (internal to Broad only).

Why are there fewer plots than in ReCapSeg?

We did not include plots that we did not believe were being used. If you would like to include additional plots, please post to the forum.

How is the GATK 4 CNV toolchain workflow better than the ReCapSeg workflow?

Faster. On exome, ReCapSeg takes ~105 minutes per case sample. GATK CNV takes < 30 minutes. Both time estimates assume good performance of NFS filesystem.
The workflows above include allelic integration results, from the tool GATK ACNV. These results are analogous to what Allelic CapSeg produces.
The workflow above produces results compatible with ABSOLUTE and TITAN. I.e. the results can be used as input to ABSOLUTE or TITAN.
All future improvements and bugfixes are going into GATK, not ReCapSeg. And many improvements are coming....
The workflows produce germline heterzygous SNP call files.
The ReCapSeg WGS workflow no longer works.

Are there new PoNs for these workflows?

Yes, but the PoN locations are already populated, if you run the workflows properly. You should not need to do any set up yourself.

Is the correct PoN automatically selected for ICE vs. Agilent samples?

Yes, if you run the workflow as provided.

Is there a PoN creation workflow in Firehose?

No. Never going to happen. Don't ask. See the forum for instructions to create PoNs.

Can I run ABSOLUTE from the output of GATK ACNV?

Yes. The annotations are gatk4cnv_acnv_acs_seg_file_capture (capture) and gatk4cnv_acnv_acs_seg_file_wgs (WGS).

Can I run TITAN from the output of GATK ACNV?

Yes, though there has been little testing done on this. The annotations are gatk4cnv_acnv_acs_seg_file_capture and gatk4cnv_acnv_acs_seg_file_wgs.

Do the workflows above include Oncotator gene lists?

Yes.

These workflows include Picard Target Mapper. Isn't that going to cause me to have to rerun all of my jobs (e.g. MuTect)?

The workflows above will rerun Picard Target Mapper, but only new annotations are added. All previous output annotations of Picard Target Mapper should be populated with the same values. This will look as if it outdated mutation calling (MuTect) and other tasks, but the rerunning will be job-avoided.

Can I do the tumor-only GATK ACNV workflow?

For exome that is working well, but is not available in Firehose. If you would like to see evaluation data for tumor-only on exome, we can show you (internal to Broad only).

What are all of the annotations produced?

Where applicable, each of the list below also has a *_wgs counterpart...
Sample annotations:

gatk4cnv_seg_file_capture -- seg file of GATK CNV. This file is analogous to the ReCapSeg seg file.
gatk4cnv_tn_file_capture -- tangent normalized (denoised) target copy ratio estimates of GATK CNV. This file is analogous to the ReCapSeg tn file.
gatk4cnv_pre_tn_file_capture -- coverage profile (i.e. target copy ratio estimates without denoising) of GATK CNV. This file is analogous to the ReCapSeg tn file.
gatk4cnv_betahats_capture -- Tangent normalization coefficients used in the projection. This is in the weeds.
gatk4cnv_called_seg_file_capture -- output called seg file of GATK CNV. This file is analogous to the ReCapSeg called seg file.
gatk4cnv_oncotated_called_seg_file_capture -- gene list file generated from the GATK CNV segments
gatk4cnv_dqc_capture (coming later) -- measure of noise reduction in the tangent normalization process. Lower is better.
gatk4cnv_preqc_capture (coming later) -- measure of noise before tangent normalization
gatk4cnv_postqc_capture (coming later) -- measure of noise after tangent normalization
gatk4cnv_num_seg_capture (coming later) -- number of segments in the GATK CNV output

Pair annotations:

gatk4cnv_case_het_file_capture -- het pulldown file for the tumor sample in the pair.
gatk4cnv_control_het_file_capture -- het pulldown file for the normal sample in the pair.
gatk4cnv_acnv_seg_file_capture -- ACNV seg file with confidence intervals for copy ratio and minor allelic fraction.
gatk4cnv_acnv_acs_seg_file_capture -- ACNV seg file in a format that looks as if it was produced by AllelicCapSeg. Any segments called as "balanced" will be pegged to a MAF of 0.5. This file is ready for ingestion by ABSOLUTE.
gatk4cnv_acnv_cnv_seg_file_capture -- ACNV seg file in a format that looks as if it was produced by GATK CNV
gatk4cnv_acnv_titan_het_file_capture -- het file in a format that can be ingested by TITAN
gatk4cnv_acnv_titan_cr_file_capture -- target copy ratio estimates file in a format that can be ingested by TITAN
gatk4cnv_acnv_cnloh_balanced_file_capture -- ACNV seg file with calls for whether a segment is balanced or CNLoH (or neither).

Do the workflows also run on the normals?

GATK CNV, yes.
GATK ACNV, no.
There is a het pulldown generated for the normal, as a side effect, when doing the het pulldown for the tumor.

What about array data?

The GATK4 CNV tools do not run on array data. Sequencing data only.

Do we still need separate PoNs if we want to run on X and Y?

Yes.

Can I run both the ReCapSeg workflow and the GATK CNV toolchain workflow?

Yes. All results are written to separate annotations.

Are the new workflows part of my PrAn?

No, not yet. You will need to copy (and run) these manually from Algorithm_Commons before you begin analysis. As a reminder, copy into your analysis workspace.

Does GATK CNV require matched (tumor-normal) samples?

No.

Does GATK ACNV require matched (tumor-normal) samples?

In Firehose, yes. Out of Firehose, no.

How do I modify the ABSOLUTE tasks in FH to accept the new GATK ACNV annotations?

There are two changes you need to make to the ABSOLUTE_v1.5_WES configuration to make it accept the new outputs.

replace alleliccapseg_tsv with gatk4cnv_acnv_acs_seg_file_capture in the inputs
replace alleliccapseg_skew with 0.9883274, and change the annotation type to "Literal" instead of "Simple Expression"

↧

Is there a panel-of-cancer option for somatic variant calling

October 24, 2018, 2:45 pm

≫ Next: Annotate possible de novo in WES

≪ Previous: Are there any Broad-specific instructions for using GATK?

I am trying to identify somatic variants from 18 pairs of matched tumour/normal samples from a single cancer type.

I've done the tutorial at https://gatkforums.broadinstitute.org/gatk/discussion/11136/how-to-call-somatic-mutations-using-gatk4-mutect2. The idea of panel-of-normals makes a lot of sense to me. So I am wondering if there is an option to call somatic variants of all tumour samples concurrently (hence panel of cancer) w.s.t. the panel of normals, leveraging information from all tumour samples together somehow with GATK4 mutect2?

↧

Annotate possible de novo in WES

August 6, 2018, 12:17 pm

≫ Next: Add or Replace read group

≪ Previous: Is there a panel-of-cancer option for somatic variant calling

Hi,
I've been following the Best practices for WES data on 97 individuals (10 trios+ other relatives) using GATK4. I currently have a multisample VCF file after VQSR and I want to run the genotype refinement as mentioned here https://gatkforums.broadinstitute.org/gatk/discussion/4727/howto-run-the-genotype-refinement-workflow. I would like to first focus on the 10 trios and I wonder if I need to split my multisampleVCF into 10 vcf files (trio_vcf: one for each trio) and then apply the genotype refinement on each of these VCF files.
I think I can get the Genotype posteriors using the multiVCF and a PED file with the information of the 10 trios. This generate a newmultiVCF. But I am not sure if I can annotate the novo using the newmultiVCF or if it is better to do all the steps in the trio_vcf?
Thank you very much

↧

Add or Replace read group

July 9, 2013, 6:01 am

≫ Next: GenomicsDBImport terminates after Overlapping contigs found error

≪ Previous: Annotate possible de novo in WES

Hi..I am trying to use GATK for the first time with RNA-Seq data coming from two different run (one with HiSeq1000 e the other from HiSeq2000). When I do "Add or replace read Group" using picard tool I need to create a different read Group for each run?
If yes how? Thank you.
Eleonora

↧

Explanation of resource datasets

Resources for SNPs

Resources for Indels

Important notes about annotations

Important notes for exome capture experiments

Argument recommendations for VariantRecalibrator

Common, base command line

SNP specific recommendations

Indel specific recommendations

Argument recommendations for ApplyRecalibration

Common, base command line

SNP specific recommendations

Indel specific recommendations

Jump to a section

5. Count ref and alt alleles at common germline variant sites using CollectAllelicCounts

☞ 5.1 What is the difference between CollectAllelicCounts and GetPileupSummaries?

6. Group contiguous copy ratios into segments with ModelSegments

7. Call copy-neutral, amplified and deleted segments with CallCopyRatioSegments

8. Plot modeled copy ratio and allelic fraction segments with PlotModeledSegments

☞ 8.1 Some considerations in interpreting allelic copy ratios

☞ 8.2 Some results of fine-tuning smoothing parameters

Footnotes

Which tool should I use, MarkDuplicates or MarkDuplicatesWithMateCigar? new section 5/25/2016

Jump to a section

Tools involved

Prerequisites

Download example data

Related resources

1. Commands for MarkDuplicates and MarkDuplicatesWithMateCigar

Comments on select parameters

2. Slow or out of memory error? Special memory considerations for duplicate marking tools

Set the java maxheapsize, specified by the -Xmx#G option, to the maximum your system allows.

Set an additional temporary directory with the TMP_DIR parameter for memory spillage.

3. Conceptual overview of duplicate flagging

4. Details of interest to some

Don't be duped by identical numbers. Data from the two tools qualitatively differ.

Duplicate metrics in brief

Pair orientation F1R2 is distinct from F2R1 for optical duplicates

1. Dotkits for running GATK CNV and ACNV

2. GATK CNV Toolchain in Firehose

Frequently asked questions:

Which tool should I use, MarkDuplicates or MarkDuplicatesWithMateCigar? `new section 5/25/2016`

Set the java maxheapsize, specified by the `-Xmx#G` option, to the maximum your system allows.

Set an additional temporary directory with the `TMP_DIR` parameter for memory spillage.