(How to) Filter on genotype using VariantFiltration

July 3, 2018, 8:21 am

≫ Next: Combine phased calls from Mutect2

≪ Previous: i_variant_quality_by_depth/i_genotype_quality interpretation

This tutorial illustrates how to filter on genotype, e.g. heterozygous genotype call. The steps apply to either single-sample or multi-sample callsets.

First, the genotype is annotated with a filter expression using VariantFiltration. Then, the filtered genotypes are made into no-call (./.) genotypes with SelectVariants so that downstream tools may discount them.

We use the this example variant record FORMAT fields from trio.vcf to illustrate.

GT:AD:DP:GQ:PL  
0/1:17,15:32:99:399,0,439       0/1:11,12:23:99:291,0,292       1/1:0,30:30:90:948,90,0

1. Annotate genotypes using VariantFiltration

If we want to filter heterozygous genotypes, we use VariantFiltration's --genotype-filter-expression "isHet == 1" option. We can specify the annotation value for the tool to label the heterozygous genotypes with with the --genotype-filter-name option. Here, this parameter's value is set to "isHetFilter".

gatk VariantFiltration \
-V trio.vcf \
-O trio_VF.vcf \
--genotype-filter-expression "isHet == 1" \
--genotype-filter-name "isHetFilter"

After filtering, in the resulting trio_VF.vcf, our example record adds an FT field and becomes:

GT:AD:DP:FT:GQ:PL
0/1:17,15:32:isHetFilter:99:399,0,439   0/1:11,12:23:isHetFilter:99:291,0,292   1/1:0,30:30:PASS:90:948,90,0

We see that HET (0/1) genotype calls get a isHetFilter in the FT field and other genotype calls get a PASS in the genotype field.

The VariantFiltration tool document lists the various options to filter on the FORMAT (aka genotype call) field:

We have put in convenience methods so that one can now filter out hets ("isHet == 1"), refs ("isHomRef == 1"), or homs ("isHomVar == 1"). Also available are expressions isCalled, isNoCall, isMixed, and isAvailable, in accordance with the methods of the Genotype object.

2. Transform filtered genotypes to no call

Running SelectVariants with --set-filtered-gt-to-nocall will further transform the flagged genotypes with a null genotype call. This conversion is necessary because downstream tools do not parse the FORMAT-level filter field.

gatk SelectVariants \
-V trio_VF.vcf \
--set-filtered-gt-to-nocall \
-O trioGGVCF_VF_SV.vcf

The result is that the GT genotypes of the isHetFiltered genotype records become null or no call (./.) as follows.

GT:AD:DP:FT:GQ:PL
./.:17,15:32:isHetFilter:99:399,0,439   ./.:11,12:23:isHetFilter:99:291,0,292   1/1:0,30:30:PASS:90:948,90,0

↧

Combine phased calls from Mutect2

June 20, 2018, 3:48 am

≫ Next: Stateprovider is not working in Cordova angularJS app

≪ Previous: (How to) Filter on genotype using VariantFiltration

Hello, is there a way to have Mutect2 emit multi-nucleotide variants instead of multiple adjacent SNVs?

For example, consider this variant:

REF: AGGT
ALT: ATCT

Mutect will call the G/T SNP in position 2 as one line, and the G/C SNP at position 3 as another line. Then the fact that they are part of the same haplotype is indicated by the phasing information in the info column of the vcf.

I would prefer to have it call a multiple nucleotide variant: REF GG and ALT TC.

Can I get Mutect to do this? Or is there any post-processing tool you can recommend?

The reason for preferring MNVs instead of SNVs is that I am using ensembl-VEP to predict the protein consequences of the variants. In that case it's quite important to represent the actual haplotypes instead of stepping through variant sites one-by-one.

Thanks,
Patrick

↧

Stateprovider is not working in Cordova angularJS app

July 6, 2018, 2:21 am

≫ Next: Running GATK WDL on FireCloud with TCGA controlled bam files

≪ Previous: Combine phased calls from Mutect2

I am developing an app using cordova and angularjs. ng-route is working fine but when I am trying to use ui.router , it's not working angularjs.The view template is not rendering inside My app.js

angular.module('helloApp', [
  'ngAnimate',
  'ngCookies',
  'ngResource',
  'ngRoute',
  'ngSanitize',
  'ngTouch',
 'ui.router'

])
.config(function ($routeProvider, $stateProvider) {
$stateProvider
.state('home', {
url: '/',
templateUrl: 'views/main.html',
controller: 'MainCtrl',
})
});

↧

Running GATK WDL on FireCloud with TCGA controlled bam files

June 19, 2018, 10:59 pm

≫ Next: (howto) Evaluate a callset with CollectVariantCallingMetrics

≪ Previous: Stateprovider is not working in Cordova angularJS app

Hi, GATK team!

I have an issue with GATK4.0 pipeline when running analysis on FireCloud.

I am going to run GATK with TCGA controlled mRNASeq bam files. As far as I concerned, FireCloud offers TCGA level 1 bam files which named as .sorted_genome_alignments.bam‎. So I ran the pipeline from the step MarkDuplicates according to the rnaseq-germline-snps-indels, the public WDL example as GATK team put forward on FireCloud. Then I set proper parameters and workspace’s attributes in configuration, especially the reference fasta as gs://broad-references/Homo_sapiens_assembly19_1000genomes_decoy/Homo_sapiens_assembly19_1000genomes_decoy.fasta‎, but I got some error reported like this:

[Mon Jun 11 04:59:58 UTC 2018] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 18.94 minutes.
Runtime.totalMemory()=24761073664
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
picard.PicardException: This program requires input that are either coordinate or query sorted. Found unsorted
    at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:254)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:269)
    at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:25)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
    at org.broadinstitute.hellbender.Main.main(Main.java:289)
Using GATK jar /gatk/build/libs/gatk-package-4.0.3.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx32G -jar /gatk/build/libs/gatk-package-4.0.3.0-local.jar MarkDuplicates --INPUT /cromwell_root/5aa919de-0aa0-43ec-9ec3-288481102b6d/tcga/LUAD/RNA/RNA-Seq/UNC-LCCC/ILLUMINA/UNCID_1209060.e6a101b9-61f9-4ed1-a59f-d9db3fdb4555.sorted_genome_alignments.bam --OUTPUT UNCID_1209060.e6a101b9-61f9-4ed1-a59f-d9db3fdb4555.sorted_genome_alignments.dedupped.bam --CREATE_INDEX true --VALIDATION_STRINGENCY SILENT --METRICS_FILE UNCID_1209060.e6a101b9-61f9-4ed1-a59f-d9db3fdb4555.sorted_genome_alignments.dedupped.metrics

Since the name of input BAM file including “sorted”, I thought it was reasonable to add the option “--ASSUME_SORTED” after I searched some solutions that other people and GATK’s staff posted. Then MarkDuplicates step finally worked. But in the next step SplitNCigarReads, error occurred like this:

INFO  12:59:01,086 HelpFormatter - Program Args: -T SplitNCigarReads -R /cromwell_root/broad-references/Homo_sapiens_assembly19_1000genomes_decoy/Homo_sapiens_assembly19_1000genomes_decoy.fasta -I /cromwell_root/fc-85926c4b-dcec-49b1-a0b1-446abe208477/e44da35f-1087-423f-95ea-53944c30c5f2/RNAseq/e5b8550b-f301-47b7-a709-f6d91554ab6f/call-MarkDuplicates/UNCID_1209060.e6a101b9-61f9-4ed1-a59f-d9db3fdb4555.sorted_genome_alignments.dedupped.bam -o UNCID_1209060.e6a101b9-61f9-4ed1-a59f-d9db3fdb4555.sorted_genome_alignments.split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS 
INFO  12:59:01,089 HelpFormatter - Executing as root@1ffd1fee7d64 on Linux 4.9.0-0.bpo.6-amd64 amd64; OpenJDK 64-Bit Server VM 1.8.0_111-8u111-b14-2~bpo8+1-b14. 
INFO  12:59:01,089 HelpFormatter - Date/Time: 2018/06/19 12:59:01 
INFO  12:59:01,090 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  12:59:01,090 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  12:59:01,234 GenomeAnalysisEngine - Strictness is SILENT 
INFO  12:59:01,292 GenomeAnalysisEngine - Downsampling Settings: No downsampling 
INFO  12:59:01,298 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
WARNING: BAM index file /cromwell_root/fc-85926c4b-dcec-49b1-a0b1-446abe208477/e44da35f-1087-423f-95ea-53944c30c5f2/RNAseq/e5b8550b-f301-47b7-a709-f6d91554ab6f/call-MarkDuplicates/UNCID_1209060.e6a101b9-61f9-4ed1-a59f-d9db3fdb4555.sorted_genome_alignments.dedupped.bai is older than BAM /cromwell_root/fc-85926c4b-dcec-49b1-a0b1-446abe208477/e44da35f-1087-423f-95ea-53944c30c5f2/RNAseq/e5b8550b-f301-47b7-a709-f6d91554ab6f/call-MarkDuplicates/UNCID_1209060.e6a101b9-61f9-4ed1-a59f-d9db3fdb4555.sorted_genome_alignments.dedupped.bam
INFO  12:59:01,319 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.02 
INFO  12:59:02,073 GATKRunReport - Uploaded run statistics report to AWS S3 
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 3.5-0-g36282e4): 
##### ERROR
##### ERROR This means that one or more arguments or inputs in your command are incorrect.
##### ERROR The error message below tells you what is the problem.
##### ERROR
##### ERROR If the problem is an invalid argument, please check the online documentation guide
##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
##### ERROR
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
##### ERROR
##### ERROR MESSAGE: Lexicographically sorted human genome sequence detected in reads. Please see http://gatkforums.broadinstitute.org/discussion/58/companion-utilities-reordersamfor more information. Error details: reads contigs = [chr1, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr2, chr20, chr21, chr22, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chrM_rCRS, chrX, chrY]
##### ERROR ------------------------------------------------------------------------------------------

I tried to fix the bug by adding ReorderSam step as the error message mentioned. The reference fasta I used was still Homo_sapiens_assembly19_1000genomes_decoy.fasta‎, but it still didn’t work well. The error message was:

[Mon Jun 11 15:30:14 UTC 2018] picard.sam.ReorderSam done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=665845760
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
picard.PicardException: New reference sequence does not contain a matching contig for chr1
    at picard.sam.ReorderSam.buildSequenceDictionaryMap(ReorderSam.java:263)
    at picard.sam.ReorderSam.doWork(ReorderSam.java:146)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:269)
    at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:25)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
    at org.broadinstitute.hellbender.Main.main(Main.java:289)
Using GATK jar /gatk/build/libs/gatk-package-4.0.3.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx32G -jar /gatk/build/libs/gatk-package-4.0.3.0-local.jar ReorderSam --INPUT /cromwell_root/fc-85926c4b-dcec-49b1-a0b1-446abe208477/64c3a79b-04f8-46f8-a238-8717380c7768/RNAseq/4e8ce380-6f4b-41f6-b1d2-4fe11ed8fa68/call-MarkDuplicates/UNCID_1209060.e6a101b9-61f9-4ed1-a59f-d9db3fdb4555.sorted_genome_alignments.dedupped.bam --OUTPUT UNCID_1209060.e6a101b9-61f9-4ed1-a59f-d9db3fdb4555.sorted_genome_alignments.reorder.bam -R /cromwell_root/broad-references/Homo_sapiens_assembly19_1000genomes_decoy/Homo_sapiens_assembly19_1000genomes_decoy.fasta --CREATE_INDEX true

And then I added the option --VALIDATION_STRINGENCY LENIENT --ALLOW_INCOMPLETE_DICT_CONCORDANCE, I got this:

Ignoring SAM validation error: ERROR: Record 178984837, Read name UNC9-SN296_246:4:1107:4151:192010/2, Mapped mate should have mate reference name Ignoring SAM validation error: ERROR: Record 178984905, Read name UNC9-SN296_246:4:2205:17136:94561/2, Mapped mate should have mate reference name INFO 2018-06-13 03:23:44 ReorderSam Wrote 186956859 reads [Wed Jun 13 03:23:46 UTC 2018] picard.sam.ReorderSam done. Elapsed time: 40.71 minutes. Runtime.totalMemory()=10954997760 To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp htsjdk.samtools.SAMException: Exception when processing alignment for BAM index UNC9-SN296_246:4:1101:10000:103197/2 2/2 50b aligned read. at htsjdk.samtools.BAMFileWriter.writeAlignment(BAMFileWriter.java:140) at htsjdk.samtools.SAMFileWriterImpl.close(SAMFileWriterImpl.java:226) at htsjdk.samtools.AsyncSAMFileWriter.synchronouslyClose(AsyncSAMFileWriter.java:38) at htsjdk.samtools.util.AbstractAsyncWriter.close(AbstractAsyncWriter.java:89) at picard.sam.ReorderSam.doWork(ReorderSam.java:167) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:269) at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:25) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203) at org.broadinstitute.hellbender.Main.main(Main.java:289) Caused by: htsjdk.samtools.SAMException: Exception creating BAM index for record UNC9-SN296_246:4:1101:10000:103197/2 2/2 50b aligned read. at htsjdk.samtools.BAMIndexer.processAlignment(BAMIndexer.java:119) at htsjdk.samtools.BAMFileWriter.writeAlignment(BAMFileWriter.java:137) ... 9 more Caused by: htsjdk.samtools.SAMException: Unexpected reference -1 when constructing index for 0 for record UNC9-SN296_246:4:1101:10000:103197/2 2/2 50b aligned read. at htsjdk.samtools.BAMIndexer$BAMIndexBuilder.processAlignment(BAMIndexer.java:218) at htsjdk.samtools.BAMIndexer.processAlignment(BAMIndexer.java:117) ... 10 more Using GATK jar /gatk/build/libs/gatk-package-4.0.3.0-local.jar Running: java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx32G -jar /gatk/build/libs/gatk-package-4.0.3.0-local.jar ReorderSam --INPUT /cromwell_root/fc-85926c4b-dcec-49b1-a0b1-446abe208477/4cc0531d-5121-4140-8344-f38235f035fd/RNAseq/f7b4b882-effb-43ed-a70a-76720b2d8772/call-MarkDuplicates/UNCID_1209060.e6a101b9-61f9-4ed1-a59f-d9db3fdb4555.sorted_genome_alignments.dedupped.bam --OUTPUT UNCID_1209060.e6a101b9-61f9-4ed1-a59f-d9db3fdb4555.sorted_genome_alignments.reorder.bam -R /cromwell_root/broad-references/Homo_sapiens_assembly19_1000genomes_decoy/Homo_sapiens_assembly19_1000genomes_decoy.fasta --CREATE_INDEX true --VALIDATION_STRINGENCY LENIENT --ALLOW_INCOMPLETE_DICT_CONCORDANCE

It seemed that the input sorted_genome_alignment_bam file had aligned to another reference fasta different from my pipeline used. Although I looked through some metadata that TCGA provided in their official website and description of the TCGA controlled access workspace in FireCloud, I couldn’t find the information of specific reference fasta file.

Could you please provide some help to solve the problem?

↧

(howto) Evaluate a callset with CollectVariantCallingMetrics

September 23, 2015, 5:06 am

≫ Next: Germline short variant discovery (SNPs + Indels)

≪ Previous: Running GATK WDL on FireCloud with TCGA controlled bam files

Context

This document will walk you through use of Picard's CollectVariantCallingMetrics tool, an excellent tool for large callsets, especially if you need your results quickly and do not require many additional metrics to those described here. Your callset consists of variants identified by earlier steps in the GATK best practices pipeline, and now requires additional evaluation to determine where your callset falls on the spectrum of "perfectly identifies all true, biological variants" to "only identifies artifactual or otherwise unreal variants". When variant calling, we want the callset to maximize the correct calls, while minimizing false positive calls. While very robust methods, such as Sanger sequencing, can be used to individually sequence each potential variant, statistical analysis can be used to evaluate callsets instead, saving both time and money. These callset-based analyses are accomplished by comparing relevant metrics between your samples and a known truth set, such as dbSNP. Two tools exist to examine these metrics: VariantEval in GATK, and CollectVariantCallingMetrics in Picard. While the latter is currently used in the Broad Institute's production pipeline, the merits to each tool, as well as the basis for variant evaluation, are discussed here.

Example Use

Command

java -jar picard.jar CollectVariantCallingMetrics \
INPUT=CEUtrio.vcf \
OUTPUT=CEUtrioMetrics \
DBSNP=dbsnp_138.b37.excluding_sites_after_129.vcf

INPUT
The CEU trio (NA12892, NA12891, and 12878) from the 1000 Genomes Project is the input chosen for this example. It is the callset that we wish to examine the metrics on, and thus this is the field where you would specify the .vcf file containing your sample(s)'s variant calls.
OUTPUT
The output for this command will be written to two files named CEUtrioMetrics.variant_calling_summary_metrics and CEUtrioMetrics.variant_calling_detail_metrics, hereafter referred to as "summary" and "detail", respectively. The specification given in this field is applied as the name of the out files; the file extensions are provided by the tool itself.
DBSNP
The last required input to run this tool is a dbSNP file. The one used here is available in the current GATK bundle. CollectVariantCallingMetrics utilizes this dbSNP file as a base of comparison against the sample(s) present in your vcf.

Getting Results

After running the command, CollectVariantCallingMetrics will return both a detail and a summary metrics file. These files can be viewed as a text file if needed, or they can be read in as a table using your preferred spreadsheet viewer (e.g. Excel) or scripting language of your choice (e.g. python, R, etc.) The files contain headers and are tab-delimited; the commands for reading in the tables in RStudio are found below. (Note: Replace "~/path/to/" with the path to your output files as needed.)

summary <- read.table("~/path/to/CEUtrioMetrics.variant_calling_summary_metrics", header=TRUE, sep="\t")
detail <- read.table("~/path/to/CEUtrioMetrics.variant_calling_detail_metrics", header=TRUE, sep="\t")

Summary
The summary metrics file will contain a single row of data for each metric, taking into account all samples present in your INPUT file.
Detail
The detail metrics file gives a breakdown of each statistic by sample. In addition to all metrics covered in the summary table, the detail table also contains entries for SAMPLE_ALIAS and HET_HOMVAR_RATIO. In the example case here, the detail file will contain metrics for the three different samples, NA12892, NA12891, and NA12878.

Analyzing Results

^{*Concatenated in the above table are the detail file's (rows 1-3) and the summary file's (row 4) relevant metrics; for full output table, see attached image file.}

Number of Indels & SNPs
This tool collects the number of SNPs (single nucleotide polymorphisms) and indels (insertions and deletions) as found in the variants file. It counts only biallelic sites and filters out multiallelic sites. Many factors affect these counts, including cohort size, relatedness between samples, strictness of filtering, ethnicity of samples, and even algorithm improvement due to updated software. While this metric alone is insufficient to evaluate your variants, it does provide a good baseline. It is reassuring to see that across the three related samples, we saw very similar numbers of SNPs and indels. It could be cause for concern if a particular sample had significantly more or fewer variants than the rest.
Indel Ratio
The indel ratio is determined to be the total number of insertions divided by the total number of deletions; this tool does not include filtered variants in this calculation. Usually, the indel ratio is around 1, as insertions occur typically as frequently as deletions. However, in rare variant studies, indel ratio should be around 0.2-0.5. Our samples have an indel ratio of ~0.95, indicating that these variants are not likely to have a bias affecting their insertion/deletion ratio.
TiTv Ratio
This metric is the ratio of transition (Ti) to transversion (Tv) mutations. For whole genome sequencing data, TiTv should be ~2.0-2.1, whereas whole exome sequencing data will have a TiTv ratio of ~3.0-3.3¹. In the case of the CEU trio of samples, the TiTv of ~2.08 and ~1.91 are within reason, and this variant callset is unlikely to have a bias affecting its transition/transversion ratio.

↧

Germline short variant discovery (SNPs + Indels)

January 7, 2018, 1:03 am

≫ Next: Mutect2 with lastest nightly build problem

≪ Previous: (howto) Evaluate a callset with CollectVariantCallingMetrics

Purpose

Identify germline short variants (SNPs and Indels) in one or more individuals to produce a joint callset in VCF format.

Reference Implementations

Pipeline	Summary	Notes	Github	FireCloud
Prod* germline short variant per-sample calling	uBAM to GVCF	optimized for GCP	yes	pending
Prod* germline short variant joint genotyping	GVCFs to cohort VCF	optimized for GCP	yes	pending
$5 Genome Analysis Pipeline	uBAM to GVCF or cohort VCF	optimized for GCP (see blog)	yes	hg38
Generic germline short variant per-sample calling	analysis-ready BAM to GVCF	universal	yes	hg38
Generic germline short variant joint genotyping	GVCFs to cohort VCF	universal	yes	hg38 & b37
Intel germline short variant per-sample calling	uBAM to GVCF	Intel optimized for local architectures	yes	NA

* Prod refers to the Broad Institute's Data Sciences Platform production pipelines, which are used to process sequence data produced by the Broad's Genomic Sequencing Platform facility.

Expected input

This workflow is designed to operate on a set of samples constituting a study cohort. Specifically, a set of per-sample BAM files that have been pre-processed as described in the GATK Best Practices for data pre-processing.

Main steps

We begin by calling variants per sample in order to produce a file in GVCF format. Next, we consolidate GVCFs from multiple samples into a GenomicsDB datastore. We then perform joint genotyping, and finally, apply VQSR filtering to produce the final multisample callset with the desired balance of precision and sensitivity.

Additional steps such as Genotype Refinement and Variant Annotation may be included depending on experimental design; those are not documented here.

Call variants per-sample

Tools involved: HaplotypeCaller (in GVCF mode)

In the past, variant callers specialized in either SNPs or Indels, or (like the GATK's own UnifiedGenotyper) could call both but had to do so them using separate models of variation. The HaplotypeCaller is capable of calling SNPs and indels simultaneously via local de-novo assembly of haplotypes in an active region. In other words, whenever the program encounters a region showing signs of variation, it discards the existing mapping information and completely reassembles the reads in that region. This allows the HaplotypeCaller to be more accurate when calling regions that are traditionally difficult to call, for example when they contain different types of variants close to each other. It also makes the HaplotypeCaller much better at calling indels than position-based callers like UnifiedGenotyper.

In the GVCF mode used for scalable variant calling in DNA sequence data, HaplotypeCaller runs per-sample to generate an intermediate file called a GVCF , which can then be used for joint genotyping of multiple samples in a very efficient way. This enables rapid incremental processing of samples as they roll off the sequencer, as well as scaling to very large cohort sizes

In practice, this step can be appended to the pre-processing section to form a single pipeline applied per-sample, going from the original unmapped BAM containing raw sequence all the way to the GVCF for each sample. This is the implementation used in production at the Broad Institute.

Consolidate GVCFs

Tools involved: ImportGenomicsDB

This step consists of consolidating the contents of GVCF files across multiple samples in order to improve scalability and speed the next step, joint genotyping. Note that this is NOT equivalent to the joint genotyping step; variants in the resulting merged GVCF cannot be considered to have been called jointly.

Prior to GATK4 this was done through hierarchical merges with a tool called CombineGVCFs. This tool is included in GATK4 for legacy purposes, but performance is far superior when using ImportGenomicsDB, which produces a datastore instead of a GVCF file.

Joint-Call Cohort

Tools involved: GenotypeGVCFs

At this step, we gather all the per-sample GVCFs (or combined GVCFs if we are working with large numbers of samples) and pass them all together to the joint genotyping tool, GenotypeGVCFs. This produces a set of joint-called SNP and indel calls ready for filtering. This cohort-wide analysis empowers sensitive detection of variants even at difficult sites, and produces a squared-off matrix of genotypes that provides information about all sites of interest in all samples considered, which is important for many downstream analyses.

This step runs quite quickly and can be rerun at any point when samples are added to the cohort, thereby solving the so-called N+1 problem.

Filter Variants by Variant (Quality Score) Recalibration

Tools involved: VariantRecalibrator, ApplyRecalibration

The GATK's variant calling tools are designed to be very lenient in order to achieve a high degree of sensitivity. This is good because it minimizes the chance of missing real variants, but it does mean that we need to filter the raw callset they produce in order to reduce the amount of false positives, which can be quite large.

The established way to filter the raw variant callset is to use variant quality score recalibration (VQSR), which uses machine learning to identify annotation profiles of variants that are likely to be real, and assigns a VQSLOD score to each variant that is much more reliable than the QUAL score calculated by the caller. In the first step of this two-step process, the program builds a model based on training variants, then applies that model to the data to assign a well-calibrated probability to each variant call. We can then use this variant quality score in the second step to filter the raw call set, thus producing a subset of calls with our desired level of quality, fine-tuned to balance specificity and sensitivity.

The downside of how variant recalibration works is that the algorithm requires high-quality sets of known variants to use as training and truth resources, which for many organisms are not yet available. It also requires quite a lot of data in order to learn the profiles of good vs. bad variants, so it can be difficult or even impossible to use on small datasets that involve only one or a few samples, on targeted sequencing data, on RNAseq, and on non-model organisms. If for any of these reasons you find that you cannot perform variant recalibration on your data (after having tried the workarounds that we recommend, where applicable), you will need to use hard-filtering instead. This consists of setting flat thresholds for specific annotations and applying them to all variants equally. See the methods articles and FAQs for more details on how to do this.

We are currently experimenting with neural network-based approaches with the goal of eventually replacing VQSR with a more powerful and flexible filtering process.

Notes on methodology

The central tenet that governs the variant discovery part of the workflow is that the accuracy and sensitivity of the germline variant discovery algorithm are significantly increased when it is provided with data from many samples at the same time. Specifically, the variant calling program needs to be able to construct a squared-off matrix of genotypes representing all potentially variant genomic positions, across all samples in the cohort. Note that this is distinct from the primitive approach of combining variant calls generated separately per-sample, which lack information about the confidence of homozygous-reference or other uncalled genotypes.

In earlier versions of the variant discovery phase, multiple per-sample BAM files were presented directly to the variant calling program for joint analysis. However, that scaled very poorly with the number of samples, posing unacceptable limits on the size of the study cohorts that could be analyzed in that way. In addition, it was not possible to add samples incrementally to a study; all variant calling work had to be redone when new samples were introduced.

Starting with GATK version 3.x, a new approach was introduced, which decoupled the two internal processes that previously composed variant calling: (1) the initial per-sample collection of variant context statistics and calculation of all possible genotype likelihoods given each sample by itself, which require access to the original BAM file reads and is computationally expensive, and (2) the calculation of genotype posterior probabilities per-sample given the genotype likelihoods across all samples in the cohort, which is computationally cheap. These were made into the separate steps described below, enabling incremental growth of cohorts as well as scaling to large cohort sizes.

↧

Mutect2 with lastest nightly build problem

October 14, 2016, 4:06 pm

≫ Next: MergeBamAlignment Problem

≪ Previous: Germline short variant discovery (SNPs + Indels)

Hi,

I am trying to run the variant calling step for the first time as a test. As I don't have any matched normal, I have only my tumour .bam file and a PoN composed of only 2 samples (for now).

So I combined my two .vcf normals to create a PoN and a launched MuTect2 like that :

java -jar /analysis/GATK/GenomeAnalysisTK-3.6/GenomeAnalysisTK.jar \
-T MuTect2 \
-R /projects/acoudray/star/ref_genome/hg38_gatk/GRCh38.primary_assembly.genome.fa \
-I:tumor /projects/acoudray/star/RNAseq/BaseRecalibration/rnaseq_added_sorted_dedup_split_recal.bam \
-PON MuTect2_PON_2samples.vcf \
--dbsnp /projects/acoudray/star/ref_genome/dbsnp_b147_hg38.vcf \
-o MuTect2_output_SD01_RNAseq_PON2samples.vcf

and I got some strange warning during the compilation :

WARN 18:01:55,963 SomaticGenotypingEngine - At Locus chrchr1:8685652, we detected that variant context had alleles that not in PRALM. VC alleles = [C, T], PRALM alleles = []
WARN 18:01:56,283 SomaticGenotypingEngine - At Locus chrchr1:8717782, we detected that variant context had alleles that not in PRALM. VC alleles = [A, G], PRALM alleles = []
WARN 18:01:56,458 SomaticGenotypingEngine - At Locus chrchr1:8733981, we detected that variant context had alleles that not in PRALM. VC alleles = [C, T], PRALM alleles = []
WARN 18:01:57,860 SomaticGenotypingEngine - At Locus chrchr1:8808360, we detected that variant context had alleles that not in PRALM. VC alleles = [A, C], PRALM alleles = []
INFO 18:02:03,694 ProgressMeter - chr1:8861373 0.0 5.5 m 545.7 w 0.3% 32.1 h 32.0 h
INFO 18:03:03,696 ProgressMeter - chr1:8865651 0.0 6.5 m 644.9 w 0.3% 37.9 h 37.8 h

and it seems that nothing is written in the output.

Any idea from where it comes from ?

Note : I just installed the nightly build for the 2 last steps (combining .vcf files and variant calling with mutect2). The normal-only mode to generate .vcf file in artifact_detection_mode wasn't done with the latest nightly build but with GATKv3.6 without any nightly build.

Thanks a lot !

Alex

↧

MergeBamAlignment Problem

June 20, 2018, 8:52 am

≫ Next: (How to) Map and clean up short read sequence data efficiently

≪ Previous: Mutect2 with lastest nightly build problem

Hi,

I'm trying to merge two bam files that I got going through the Drop-seq pipeline. I've check both inputs using ValidateSamFiles and have no errors from either. But when I run MergeBamAlignment, I keep getting the same error when it's running:

WARNING 2018-06-20 11:33:53     SamAlignmentMerger      Exception merging bam alignment - attempting to sort aligned reads and try again: Inappropriate call if not paired read

And this one after it finishes reading in all the records from the alignment SAM/BAM:

Exception in thread "main" java.lang.IllegalStateException: Inappropriate call if not paired read
        at htsjdk.samtools.SAMRecord.requireReadPaired(SAMRecord.java:866)
        at htsjdk.samtools.SAMRecord.getProperPairFlag(SAMRecord.java:874)
        at picard.sam.AbstractAlignmentMerger.setValuesFromAlignment(AbstractAlignmentMerger.java:655)
        at picard.sam.AbstractAlignmentMerger.transferAlignmentInfoToFragment(AbstractAlignmentMerger.java:548)
        at picard.sam.AbstractAlignmentMerger.transferAlignmentInfoToPairedRead(AbstractAlignmentMerger.java:578)
        at picard.sam.AbstractAlignmentMerger.mergeAlignment(AbstractAlignmentMerger.java:390)
        at picard.sam.SamAlignmentMerger.mergeAlignment(SamAlignmentMerger.java:157)
        at picard.sam.MergeBamAlignment.doWork(MergeBamAlignment.java:266)
        at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:209)
        at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
        at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)

I'm not sure what the issue is. I've been successful using the same code on previous Drop-seq files but this newest batch is causing these issues. As far as I can tell, they are the same as older runs that worked. Any ideas would be greatly appreciated!

Thanks!
pschnepp

↧

(How to) Map and clean up short read sequence data efficiently

November 23, 2015, 12:55 pm

≫ Next: Too many (?) variants detected by joint genotyping of 8232 exomes

≪ Previous: MergeBamAlignment Problem

If you are interested in emulating the methods used by the Broad Genomics Platform to pre-process your short read sequencing data, you have landed on the right page. The parsimonious operating procedures outlined in this three-step workflow both maximize data quality, storage and processing efficiency to produce a mapped and clean BAM. This clean BAM is ready for analysis workflows that start with MarkDuplicates.

Since your sequencing data could be in a number of formats, the first step of this workflow refers you to specific methods to generate a compatible unmapped BAM (uBAM, Tutorial#6484) or (uBAM^XT, Tutorial#6570 coming soon). Not all unmapped BAMs are equal and these methods emphasize cleaning up prior meta information while giving you the opportunity to assign proper read group fields. The second step of the workflow has you marking adapter sequences, e.g. arising from read-through of short inserts, using MarkIlluminaAdapters such that they contribute minimally to alignments and allow the aligner to map otherwise unmappable reads. The third step pipes three processes to produce the final BAM. Piping SamToFastq, BWA-MEM and MergeBamAlignment saves time and allows you to bypass storage of larger intermediate FASTQ and SAM files. In particular, MergeBamAlignment merges defined information from the aligned SAM with that of the uBAM to conserve read data, and importantly, it generates additional meta information and unifies meta data. The resulting clean BAM is coordinate sorted, indexed.

The workflow reflects a lossless operating procedure that retains original sequencing read information within the final BAM file such that data is amenable to reversion and analysis by different means. These practices make scaling up and long-term storage efficient, as one needs only keep the final BAM file.

Geraldine_VdAuwera points out that there are many different ways of correctly preprocessing HTS data for variant discovery and ours is only one approach. So keep this in mind.

We present this workflow using real data from a public sample. The original data file, called Solexa-272222, is large at 150 GB. The file contains 151 bp paired PCR-free reads giving 30x coverage of a human whole genome sample referred to as NA12878. The entire sample library was sequenced in a single flow cell lane and thereby assigns all the reads the same read group ID. The example commands work both on this large file and on smaller files containing a subset of the reads, collectively referred to as snippet. NA12878 has a variant in exon 5 of the CYP2C19 gene, on the portion of chromosome 10 covered by the snippet, resulting in a nonfunctional protein. Consistent with GATK's recommendation of using the most up-to-date tools, for the given example results, with the exception of BWA, we used the most current versions of tools as of their testing (September to December 2015). We provide illustrative example results, some of which were derived from processing the original large file and some of which show intermediate stages skipped by this workflow.

Download example snippet data to follow along the tutorial.

We welcome feedback. Share your suggestions in the Comments section at the bottom of this page.

Jump to a section

Tools involved

MarkIlluminaAdapters
Unix pipelines
SamToFastq
BWA-MEM (Li 2013 reference; Li 2014 benchmarks; homepage; manual)
MergeBamAlignment

Prerequisites

Installed Picard tools
Installed GATK tools
Installed BWA
Reference genome
Illumina or similar tech DNA sequence reads file containing data corresponding to one read group ID. That is, the file contains data from one sample and from one flow cell lane.

Download example data

To download the reference, open ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.8/b37/ in your browser. Leave the password field blank. Download the following three files (~860 MB) to the same folder: human_g1k_v37_decoy.fasta.gz, .fasta.fai.gz, and .dict.gz. This same reference is available to load in IGV.
I divided the example data into two tarballs: tutorial_6483_piped.tar.gz contains the files for the piped process and tutorial_6483_intermediate_files.tar.gz contains the intermediate files produced by running each process independently. The data contain reads originally aligning to a one Mbp genomic interval (10:96,000,000-97,000,000) of GRCh37. The table shows the steps of the workflow, corresponding input and output example data files and approximate minutes and disk space needed to process each step. Additionally, we tabulate the time and minimum storage needed to complete the workflow as presented (piped) or without piping.

Related resources

See this tutorial to add or replace read groups or coordinate-sort and index a BAM.
See this tutorial for basic instructions on using the Integrative Genomics Viewer (IGV).
For collecting alignment summary metrics, see CollectAlignmentSummaryMetrics, CollectWgsMetrics and CollectInsertSizeMetrics. See Picard for metrics definitions.
See SAM flags to interpret SAM flag values.
Tutorial#2799 gives an example command to mark duplicates.

Other notes

When transforming data files, we stick to using Picard tools over other tools to avoid subtle incompatibilities.
For large files, (1) use the Java -Xmx setting and (2) set the environmental variable TMP_DIR for a temporary directory.
```
java -Xmx8G -jar /path/picard.jar MarkIlluminaAdapters \
    TMP_DIR=/path/shlee 
```
In the command, the -Xmx8G Java option caps the maximum heap size, or memory usage, to eight gigabytes. The path given by TMP_DIR points the tool to scratch space that it can use. These options allow the tool to run without slowing down as well as run without causing an out of memory error. The -Xmx settings we provide here are more than sufficient for most cases. For GATK, 4G is standard, while for Picard less is needed. Some tools, e.g. MarkDuplicates, may require more. These options can be omitted for small files such as the example data and the equivalent command is as follows.
```
java -jar /path/picard.jar MarkIlluminaAdapters 
```
To find a system's default maximum heap size, type java -XX:+PrintFlagsFinal -version, and look for MaxHeapSize. Note that any setting beyond available memory spills to storage and slows a system down. If multithreading, increase memory proportionately to the number of threads. e.g. if 1G is the minimum required for one thread, then use 2G for two threads.
When I call default options within a command, follow suit to ensure the same results.

1. Generate an unmapped BAM from FASTQ, aligned BAM or BCL

If you have raw reads data in BAM format with appropriately assigned read group fields, then you can start with step 2. Namely, besides differentiating samples, the read group ID should differentiate factors contributing to technical batch effects, i.e. flow cell lane. If not, you need to reassign read group fields. This dictionary post describes factors to consider and this post and this post provide some strategic advice on handling multiplexed data.

See this tutorial to add or replace read groups.

If your reads are mapped, or in BCL or FASTQ format, then generate an unmapped BAM according to the following instructions.

To convert FASTQ or revert aligned BAM files, follow directions in Tutorial#6484. The resulting uBAM needs to have its adapter sequences marked as outlined in the next step (step 2).
To convert an Illumina Base Call files (BCL) use IlluminaBasecallsToSam. The tool marks adapter sequences at the same time. The resulting uBAM^XT has adapter sequences marked with the XT tag so you can skip step 2 of this workflow and go directly to step 3. The corresponding Tutorial#6570 is coming soon.

See if you can revert 6483_snippet.bam, containing 279,534 aligned reads, to the unmapped 6383_snippet_revertsam.bam, containing 275,546 reads.

2. Mark adapter sequences using MarkIlluminaAdapters

MarkIlluminaAdapters adds the XT tag to a read record to mark the 5' start position of the specified adapter sequence and produces a metrics file. Some of the marked adapters come from concatenated adapters that randomly arise from the primordial soup that is a PCR reaction. Others represent read-through to 3' adapter ends of reads and arise from insert sizes that are shorter than the read length. In some instances read-though can affect the majority of reads in a sample, e.g. in Nextera library samples over-titrated with transposomes, and render these reads unmappable by certain aligners. Tools such as SamToFastq use the XT tag in various ways to effectively remove adapter sequence contribution to read alignment and alignment scoring metrics. Depending on your library preparation, insert size distribution and read length, expect varying amounts of such marked reads.

java -Xmx8G -jar /path/picard.jar MarkIlluminaAdapters \
I=6483_snippet_revertsam.bam \
O=6483_snippet_markilluminaadapters.bam \
M=6483_snippet_markilluminaadapters_metrics.txt \ #naming required
TMP_DIR=/path/shlee #optional to process large files

This produces two files. (1) The metrics file, 6483_snippet_markilluminaadapters_metrics.txt bins the number of tagged adapter bases versus the number of reads. (2) The 6483_snippet_markilluminaadapters.bam file is identical to the input BAM, 6483_snippet_revertsam.bam, except reads with adapter sequences will be marked with a tag in XT:i:# format, where # denotes the 5' starting position of the adapter sequence. At least six bases are required to mark a sequence. Reads without adapter sequence remain untagged.

By default, the tool uses Illumina adapter sequences. This is sufficient for our example data.
Adjust the default standard Illumina adapter sequences to any adapter sequence using the FIVE_PRIME_ADAPTER and THREE_PRIME_ADAPTER parameters. To clear and add new adapter sequences first set ADAPTERS to 'null' then specify each sequence with the parameter.

We plot the metrics data that is in GATKReport file format using RStudio, and as you can see, marked bases vary in size up to the full length of reads.

Do you get the same number of marked reads? 6483_snippet marks 448 reads (0.16%) with XT, while the original Solexa-272222 marks 3,236,552 reads (0.39%).

Below, we show a read pair marked with the XT tag by MarkIlluminaAdapters. The insert region sequences for the reads overlap by a length corresponding approximately to the XT tag value. For XT:i:20, the majority of the read is adapter sequence. The same read pair is shown after SamToFastq transformation, where adapter sequence base quality scores have been set to 2 (# symbol), and after MergeBamAlignment, which restores original base quality scores.

Unmapped uBAM (step 1)

After MarkIlluminaAdapters (step 2)

After SamToFastq (step 3)

After MergeBamAlignment (step 3)

3. Align reads with BWA-MEM and merge with uBAM using MergeBamAlignment

This step actually pipes three processes, performed by three different tools. Our tutorial example files are small enough to easily view, manipulate and store, so any difference in piped or independent processing will be negligible. For larger data, however, using Unix pipelines can add up to significant savings in processing time and storage.

Not all tools are amenable to piping and piping the wrong tools or wrong format can result in anomalous data.

The three tools we pipe are SamToFastq, BWA-MEM and MergeBamAlignment. By piping these we bypass storage of larger intermediate FASTQ and SAM files. We additionally save time by eliminating the need for the processor to read in and write out data for two of the processes, as piping retains data in the processor's input-output (I/O) device for the next process.

To make the information more digestible, we will first talk about each tool separately. At the end of the section, we provide the piped command.

3A. Convert BAM to FASTQ and discount adapter sequences using SamToFastq

Picard's SamToFastq takes read identifiers, read sequences, and base quality scores to write a Sanger FASTQ format file. We use additional options to effectively remove previously marked adapter sequences, in this example marked with an XT tag. By specifying CLIPPING_ATTRIBUTE=XT and CLIPPING_ACTION=2, SamToFastq changes the quality scores of bases marked by XT to two--a rather low score in the Phred scale. This effectively removes the adapter portion of sequences from contributing to downstream read alignment and alignment scoring metrics.

Illustration of an intermediate step unused in workflow. See piped command.

java -Xmx8G -jar /path/picard.jar SamToFastq \
I=6483_snippet_markilluminaadapters.bam \
FASTQ=6483_snippet_samtofastq_interleaved.fq \
CLIPPING_ATTRIBUTE=XT \
CLIPPING_ACTION=2 \
INTERLEAVE=true \ 
NON_PF=true \
TMP_DIR=/path/shlee #optional to process large files

This produces a FASTQ file in which all extant meta data, i.e. read group information, alignment information, flags and tags are purged. What remains are the read query names prefaced with the @ symbol, read sequences and read base quality scores.

For our paired reads example file we set SamToFastq's INTERLEAVE to true. During the conversion to FASTQ format, the query name of the reads in a pair are marked with /1 or /2 and paired reads are retained in the same FASTQ file. BWA aligner accepts interleaved FASTQ files given the -p option.
We change the NON_PF, aka INCLUDE_NON_PF_READS, option from default to true. SamToFastq will then retain reads marked by what some consider an archaic 0x200 flag bit that denotes reads that do not pass quality controls, aka reads failing platform or vendor quality checks. Our tutorial data do not contain such reads and we call out this option for illustration only.
Other CLIPPING_ACTION options include (1) X to hard-clip, (2) N to change bases to Ns or (3) another number to change the base qualities of those positions to the given value.

3B. Align reads and flag secondary hits using BWA-MEM

In this workflow, alignment is the most compute intensive and will take the longest time. GATK's variant discovery workflow recommends Burrows-Wheeler Aligner's maximal exact matches (BWA-MEM) algorithm (Li 2013 reference; Li 2014 benchmarks; homepage; manual). BWA-MEM is suitable for aligning high-quality long reads ranging from 70 bp to 1 Mbp against a large reference genome such as the human genome.

Aligning our snippet reads against either a portion or the whole genome is not equivalent to aligning our original Solexa-272222 file, merging and taking a new slice from the same genomic interval.
For the tutorial, we use BWA v 0.7.7.r441, the same aligner used by the Broad Genomics Platform as of this writing (9/2015).
As mentioned, alignment is a compute intensive process. For faster processing, use a reference genome with decoy sequences, also called a decoy genome. For example, the Broad's Genomics Platform uses an Hg19/GRCh37 reference sequence that includes Ebstein-Barr virus (EBV) sequence to soak up reads that fail to align to the human reference that the aligner would otherwise spend an inordinate amount of time trying to align as split reads. GATK's resource bundle provides a standard decoy genome from the 1000 Genomes Project.
BWA alignment requires an indexed reference genome file. Indexing is specific to algorithms. To index the human genome for BWA, we apply BWA's index function on the reference genome file, e.g. human_g1k_v37_decoy.fasta. This produces five index files with the extensions amb, ann, bwt, pac and sa.
```
bwa index -a bwtsw human_g1k_v37_decoy.fasta
```

The example command below aligns our example data against the GRCh37 genome. The tool automatically locates the index files within the same folder as the reference FASTA file.

Illustration of an intermediate step unused in workflow. See piped command.

/path/bwa mem -M -t 7 -p /path/human_g1k_v37_decoy.fasta \ 
6483_snippet_samtofastq_interleaved.fq > 6483_snippet_bwa_mem.sam

This command takes the FASTQ file, 6483_snippet_samtofastq_interleaved.fq, and produces an aligned SAM format file, 6483_snippet_unthreaded_bwa_mem.sam, containing read alignment information, an automatically generated program group record and reads sorted in the same order as the input FASTQ file. Aligner-assigned alignment information, flag and tag values reflect each read's or split read segment's best sequence match and does not take into consideration whether pairs are mapped optimally or if a mate is unmapped. Added tags include the aligner-specific XS tag that marks secondary alignment scores in XS:i:# format. This tag is given for each read even when the score is zero and even for unmapped reads. The program group record (@PG) in the header gives the program group ID, group name, group version and recapitulates the given command. Reads are sorted by query name. For the given version of BWA, the aligned file is in SAM format even if given a BAM extension.

Does the aligned file contain read group information?

We invoke three options in the command.

-M to flag shorter split hits as secondary.
This is optional for Picard compatibility as MarkDuplicates can directly process BWA's alignment, whether or not the alignment marks secondary hits. However, if we want MergeBamAlignment to reassign proper pair alignments, to generate data comparable to that produced by the Broad Genomics Platform, then we must mark secondary alignments.
-p to indicate the given file contains interleaved paired reads.
-t followed by a number for the number of processor threads to use concurrently. Here we use seven threads which is one less than the total threads available on my Mac laptap. Check your server or system's total number of threads with the following command provided by KateN.
```
getconf _NPROCESSORS_ONLN 
```

In the example data, all of the 1211 unmapped reads each have an asterisk (*) in column 6 of the SAM record, where a read typically records its CIGAR string. The asterisk represents that the CIGAR string is unavailable. The several asterisked reads I examined are recorded as mapping exactly to the same location as their _mapped_ mates but with MAPQ of zero. Additionally, the asterisked reads had varying noticeable amounts of low base qualities, e.g. strings of #s, that corresponded to original base quality calls and not those changed by SamToFastq. This accounting by BWA allows these pairs to always list together, even when the reads are coordinate-sorted, and leaves a pointer to the genomic mapping of the mate of the unmapped read. For the example read pair shown below, comparing sequences shows no apparent overlap, with the highest identity at 72% over 25 nts.

After MarkIlluminaAdapters (step 2)

After BWA-MEM (step 3)

After MergeBamAlignment (step 3)

3C. Restore altered data and apply & adjust meta information using MergeBamAlignment

MergeBamAlignment is a beast of a tool, so its introduction is longer. It does more than is implied by its name. Explaining these features requires I fill you in on some background.

Broadly, the tool merges defined information from the unmapped BAM (uBAM, step 1) with that of the aligned BAM (step 3) to conserve read data, e.g. original read information and base quality scores. The tool also generates additional meta information based on the information generated by the aligner, which may alter aligner-generated designations, e.g. mate information and secondary alignment flags. The tool then makes adjustments so that all meta information is congruent, e.g. read and mate strand information based on proper mate designations. We ascribe the resulting BAM as clean.

Specifically, the aligned BAM generated in step 3 lacks read group information and certain tags--the UQ (Phred likelihood of the segment), MC (CIGAR string for mate) and MQ (mapping quality of mate) tags. It has hard-clipped sequences from split reads and altered base qualities. The reads also have what some call mapping artifacts but what are really just features we should not expect from our aligner. For example, the meta information so far does not consider whether pairs are optimally mapped and whether a mate is unmapped (in reality or for accounting purposes). Depending on these assignments, MergeBamAlignment adjusts the read and read mate strand orientations for reads in a proper pair. Finally, the alignment records are sorted by query name. We would like to fix all of these issues before taking our data to a variant discovery workflow.

Enter MergeBamAlignment. As the tool name implies, MergeBamAlignment applies read group information from the uBAM and retains the program group information from the aligned BAM. In restoring original sequences, the tool adjusts CIGAR strings from hard-clipped to soft-clipped. If the alignment file is missing reads present in the unaligned file, then these are retained as unmapped records. Additionally, MergeBamAlignment evaluates primary alignment designations according to a user-specified strategy, e.g. for optimal mate pair mapping, and changes secondary alignment and mate unmapped flags based on its calculations. Additional for desired congruency. I will soon explain these and additional changes in more detail and show a read record to illustrate.

Consider what PRIMARY_ALIGNMENT_STRATEGY option best suits your samples. MergeBamAlignment applies this strategy to a read for which the aligner has provided more than one primary alignment, and for which one is designated primary by virtue of another record being marked secondary. MergeBamAlignment considers and switches only existing primary and secondary designations. Therefore, it is critical that these were previously flagged.

A read with multiple alignment records may map to multiple loci or may be chimeric--that is, splits the alignment. It is possible for an aligner to produce multiple alignments as well as multiple primary alignments, e.g. in the case of a linear alignment set of split reads. When one alignment, or alignment set in the case of chimeric read records, is designated primary, others are designated either secondary or supplementary. Invoking the -M option, we had BWA mark the record with the longest aligning section of split reads as primary and all other records as secondary. MergeBamAlignment further adjusts this secondary designation and adds the read mapped in proper pair (0x2) and mate unmapped (0x8) flags. The tool then adjusts the strand orientation flag for a read (0x10) and it proper mate (0x20).

In the command, we change CLIP_ADAPTERS, MAX_INSERTIONS_OR_DELETIONS and PRIMARY_ALIGNMENT_STRATEGY values from default, and invoke other optional parameters. The path to the reference FASTA given by R should also contain the corresponding .dict sequence dictionary with the same prefix as the reference FASTA. It is imperative that both the uBAM and aligned BAM are both sorted by queryname.

Illustration of an intermediate step unused in workflow. See piped command.

java -Xmx16G -jar /path/picard.jar MergeBamAlignment \
R=/path/Homo_sapiens_assembly19.fasta \ 
UNMAPPED_BAM=6383_snippet_revertsam.bam \ 
ALIGNED_BAM=6483_snippet_bwa_mem.sam \ #accepts either SAM or BAM
O=6483_snippet_mergebamalignment.bam \
CREATE_INDEX=true \ #standard Picard option for coordinate-sorted outputs
ADD_MATE_CIGAR=true \ #default; adds MC tag
CLIP_ADAPTERS=false \ #changed from default
CLIP_OVERLAPPING_READS=true \ #default; soft-clips ends so mates do not extend past each other
INCLUDE_SECONDARY_ALIGNMENTS=true \ #default
MAX_INSERTIONS_OR_DELETIONS=-1 \ #changed to allow any number of insertions or deletions
PRIMARY_ALIGNMENT_STRATEGY=MostDistant \ #changed from default BestMapq
ATTRIBUTES_TO_RETAIN=XS \ #specify multiple times to retain tags starting with X, Y, or Z 
TMP_DIR=/path/shlee #optional to process large files

This generates a coordinate-sorted and clean BAM, 6483_snippet_mergebamalignment.bam, and corresponding .bai index. These are ready for analyses starting with MarkDuplicates. The two bullet-point lists below describe changes to the resulting file. The first list gives general comments on select parameters and the second describes some of the notable changes to our example data.

Comments on select parameters

Setting PRIMARY_ALIGNMENT_STRATEGYto MostDistant marks primary alignments based on the alignment pair with the largest insert size. This strategy is based on the premise that of chimeric sections of a read aligning to consecutive regions, the alignment giving the largest insert size with the mate gives the most information.
It may well be that alignments marked as secondary represent interesting biology, so we retain them with the INCLUDE_SECONDARY_ALIGNMENTS parameter.
Setting MAX_INSERTIONS_OR_DELETIONS to -1 retains reads irregardless of the number of insertions and deletions. The default is 1.
Because we leave the ALIGNER_PROPER_PAIR_FLAGS parameter at the default false value, MergeBamAlignment will reassess and reassign proper pair designations made by the aligner. These are explained below using the example data.
ATTRIBUTES_TO_RETAIN is specified to carryover the XS tag from the alignment, which reports BWA-MEM's suboptimal alignment scores. My impression is that this is the next highest score for any alternative or additional alignments BWA considered, whether or not these additional alignments made it into the final aligned records. (IGV's BLAT feature allows you to search for additional sequence matches). For our tutorial data, this is the only additional unaccounted tag from the alignment. The XS tag in unnecessary for the Best Practices Workflow and is not retained by the Broad Genomics Platform's pipeline. We retain it here not only to illustrate that the tool carries over select alignment information only if asked, but also because I think it prudent. Given how compute intensive the alignment process is, the additional ~1% gain in the snippet file size seems a small price against having to rerun the alignment because we realize later that we want the tag.
Setting CLIP_ADAPTERS to false leaves reads unclipped.
By default the merged file is coordinate sorted. We set CREATE_INDEX to true to additionally create the bai index.
We need not invoke PROGRAM options as BWA's program group information is sufficient and is retained in the merging.
As a standalone tool, we would normally feed in a BAM file for ALIGNED_BAM instead of the much larger SAM. We will be piping this step however and so need not add an extra conversion to BAM.

Description of changes to our example data

MergeBamAlignment merges header information from the two sources that define read groups (@RG) and program groups (@PG) as well as reference contigs.
Tags are updated for our example data as shown in the table. The tool retains SA, MD, NM and AS tags from the alignment, given these are not present in the uBAM. The tool additionally adds UQ (the Phred likelihood of the segment), MC (mate CIGAR string) and MQ (mapping quality of the mate/next segment) tags if applicable. For unmapped reads (marked with an * asterisk in column 6 of the SAM record), the tool removes AS and XS tags and assigns MC (if applicable), PG and RG tags. This is illustrated for example read H0164ALXX140820:2:1101:29704:6495 in the BWA-MEM section of this document.
Original base quality score restoration is illustrated in step 2.

The example below shows a read pair for which MergeBamAlignment adjusts multiple information fields, and these changes are described in the remaining bullet points.

MergeBamAlignment changes hard-clipping to soft-clipping, e.g. 96H55M to 96S55M, and restores corresponding truncated sequences with the original full-length read sequence.
The tool reorders the read records to reflect the chromosome and contig ordering in the header and the genomic coordinates for each.
MergeBamAlignment's MostDistant PRIMARY_ALIGNMENT_STRATEGY asks the tool to consider the best pair to mark as primary from the primary and secondary records. In this pair, one of the reads has two alignment loci, on contig hs37d5 and on chromosome 10. The two loci align 115 and 55 nucleotides, respectively, and the aligned sequences are identical by 55 bases. Flag values set by BWA-MEM indicate the contig hs37d5 record is primary and the shorter chromosome 10 record is secondary. For this chimeric read, MergeBamAlignment reassigns the chromosome 10 mapping as the primary alignment and the contig hs37d5 mapping as secondary (0x100 flag bit).
In addition, MergeBamAlignment designates each record on chromosome 10 as read mapped in proper pair (0x2 flag bit) and the contig hs37d5 mapping as mate unmapped (0x8 flag bit). IGV's paired reads mode displays the two chromosome 10 mappings as a pair after these MergeBamAlignment adjustments.
MergeBamAlignment adjusts read reverse strand (0x10 flag bit) and mate reverse strand (0x20 flag bit) flags consistent with changes to the proper pair designation. For our non-stranded DNA-Seq library alignments displayed in IGV, a read pointing rightward is in the forward direction (absence of 0x10 flag) and a read pointing leftward is in the reverse direction (flagged with 0x10). In a typical pair, where the rightward pointing read is to the left of the leftward pointing read, the left read will also have the mate reverse strand (0x20) flag.

Two distinct classes of mate unmapped read records are now present in our example file: (1) reads whose mates truly failed to map and are marked by an asterisk * in column 6 of the SAM record and (2) multimapping reads whose mates are in fact mapped but in a proper pair that excludes the particular read record. Each of these two classes of mate unmapped reads can contain multimapping reads that map to two or more locations.

Comparing 6483_snippet_bwa_mem.sam and 6483_snippet_mergebamalignment.bam, we see the number_unmapped reads_ remains the same at 1211, while the number of records with the mate unmapped flag increases by 1359, from 1276 to 2635. These now account for 0.951% of the 276,970 read records.

For 6483_snippet_mergebamalignment.bam, how many additional unique reads become mate unmapped?

After BWA-MEM alignment

After MergeBamAlignment

3D. Pipe SamToFastq, BWA-MEM and MergeBamAlignment to generate a clean BAM

We pipe the three tools described above to generate an aligned BAM file sorted by query name. In the piped command, the commands for the three processes are given together, separated by a vertical bar |. We also replace each intermediate output and input file name with a symbolic path to the system's output and input devices, here /dev/stdout and /dev/stdin, respectively. We need only provide the first input file and name the last output file.

Before using a piped command, we should ask UNIX to stop the piped command if any step of the pipe should error and also return to us the error messages. Type the following into your shell to set these UNIX options.

set -o pipefail

Overview of command structure

[SamToFastq] | [BWA-MEM] | [MergeBamAlignment]

Piped command

java -Xmx8G -jar /path/picard.jar SamToFastq \
I=6483_snippet_markilluminaadapters.bam \
FASTQ=/dev/stdout \
CLIPPING_ATTRIBUTE=XT CLIPPING_ACTION=2 INTERLEAVE=true NON_PF=true \
TMP_DIR=/path/shlee | \ 
/path/bwa mem -M -t 7 -p /path/Homo_sapiens_assembly19.fasta /dev/stdin | \  
java -Xmx16G -jar /path/picard.jar MergeBamAlignment \
ALIGNED_BAM=/dev/stdin \
UNMAPPED_BAM=6383_snippet_revertsam.bam \ 
OUTPUT=6483_snippet_piped.bam \
R=/path/Homo_sapiens_assembly19.fasta CREATE_INDEX=true ADD_MATE_CIGAR=true \
CLIP_ADAPTERS=false CLIP_OVERLAPPING_READS=true \
INCLUDE_SECONDARY_ALIGNMENTS=true MAX_INSERTIONS_OR_DELETIONS=-1 \
PRIMARY_ALIGNMENT_STRATEGY=MostDistant ATTRIBUTES_TO_RETAIN=XS \
TMP_DIR=/path/shlee

The piped output file, 6483_snippet_piped.bam, is for all intensive purposes the same as 6483_snippet_mergebamalignment.bam, produced by running MergeBamAlignment separately without piping. However, the resulting files, as well as new runs of the workflow on the same data, have the potential to differ in small ways because each uses a different alignment instance.

How do these small differences arise?

Counting the number of mate unmapped reads shows that this number remains unchanged for the two described workflows. Two counts emitted at the end of the process updates, that also remain constant for these instances, are the number of alignment records and the number of unmapped reads.

INFO    2015-12-08 17:25:59 AbstractAlignmentMerger Wrote 275759 alignment records and 1211 unmapped reads.

Some final remarks

We have produced a clean BAM that is coordinate-sorted and indexed, in an efficient manner that minimizes processing time and storage needs. The file is ready for marking duplicates as outlined in Tutorial#2799. Additionally, we can now free up storage on our file system by deleting the original file we started with, the uBAM and the uBAM^XT. We sleep well at night knowing that the clean BAM retains all original information.

We have two final comments (1) on multiplexed samples and (2) on fitting this workflow into a larger workflow.

For multiplexed samples, first perform the workflow steps on a file representing one sample and one lane. Then mark duplicates. Later, after some steps in the GATK's variant discovery workflow, and after aggregating files from the same sample from across lanes into a single file, mark duplicates again. These two marking steps ensure you find both optical and PCR duplicates.

For workflows that nestle this pipeline, consider additionally optimizing java jar's parameters for SamToFastq and MergeBamAlignment. For example, the following are the additional settings used by the Broad Genomics Platform in the piped command for very large data sets.

    java -Dsamjdk.buffer_size=131072 -Dsamjdk.compression_level=1 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx128m -jar /path/picard.jar SamToFastq ...

    java -Dsamjdk.buffer_size=131072 -Dsamjdk.use_async_io=true -Dsamjdk.compression_level=1 -XX:+UseStringCache -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx5000m -jar /path/picard.jar MergeBamAlignment ...

I give my sincere thanks to Julian Hess, the GATK team and the Data Sciences and Data Engineering (DSDE) team members for all their help in writing this and related documents.

↧

Too many (?) variants detected by joint genotyping of 8232 exomes

April 13, 2018, 10:37 am

≫ Next: (legit.pro.documents90@gmail.com) Achetez des faux passeports, des permis de conduire, des cartes.

≪ Previous: (How to) Map and clean up short read sequence data efficiently

Hello,

I am about to finish analyzing 8232 exome samples. I have used GATK 3.8 and 3.6 throughout my workflow, and followed the best practices guideline. After making variant calling by running haplotypecaller in gvcf mode and using standard workflow, I have merged gvcf files hierarchically until all the 8232 samples have been merged. After initial rounds of gvcf merging, I run the program chr by chr and then further dividing the genome into 30-50 Mb pieces in order to reduce computational time. Finally, I started running genotypegvcfs and VQSR on each of the 70 genomic parts. 60 parts have been completed, and a total of 6.4 M variants have been detected. I estimate to get around 7.5M variants when I completely finish the workflow. I suspect that this amount might be too much than expected for that many exomes, but I am not sure; therefore I would like to have your comments. If it is too much indeed, what might be the cause of getting too many false calls? I can write all the commands I have used in this analysis.
At the end, I am planning to select the variants passing filters (I used tranches 99.5 and 99.0 for SNPs and indels, respectively). But I think this will not reduce the amount of variants significantly. I calculated the number of variants with call rate >80 in a few genomic parts, and I found that only around 50% of all variants reach this call rate. My second question is this: is it normal that my vcf files contain so many variants with low calling rate? If I select the variants with PASS flag and call rate >80, can I trust to the remaining set, or do you think getting to many variants with low call rate indicates that the output is unreliable?

Cigdem

↧

(legit.pro.documents90@gmail.com) Achetez des faux passeports, des permis de conduire, des cartes.

July 6, 2018, 11:02 pm

≫ Next: (legit.pro.documents90@gmail.com) Achetez des faux passeports, des permis de conduire, IELTS,VISA...

≪ Previous: Too many (?) variants detected by joint genotyping of 8232 exomes

Nous produisons de vrais et faux passeports, permis de conduire, cartes d'identité, certificats de naissance, diplômes, visas, SSN, certificats de mariage, papiers de divorce, cartes vertes américaines, diplômes universitaires, monnaie contrefaite, solution chimique SSD et poudre d'activation utilisée pour nettoyer l'argent noir enduit et d'autres documents pour un certain nombre de pays comme: USA, Australie, Belgique, Brésil, Canada, Italie, Finlande, France, Allemagne, Israël, Mexique, Pays-Bas, Afrique du Sud, Espagne, Royaume-Uni. Cette liste n'est pas complète.

E-mails de contact: =========== (legit.pro.documents90@gmail.com)

N'hésitez pas à nous contacter par e-mail ou appelez à tout moment d'accord, nous serons heureux de
avoir de tes nouvelles.

Mots clés:

faux passeports USA (États-Unis),
faux passeports australiens,
faux passeports belges,
faux passeports brésiliens (Brésil),
faux passeports canadiens (Canada),
faux passeports finlandais (Finlande),
faux passeports français (France),
faux passeports allemands (Allemagne),
faux passeport néerlandais (Pays-Bas / Hollande),
faux passeports d'Israël,
faux passeports Royaume-Uni,
faux passeports espagnols (Espagne),
faux passeports mexicains (Mexique),
faux passeports sud-africains.
faux permis de conduire australiens,
faux permis de conduire canadiens,
faux permis de conduire français (France),
faux permis de conduire néerlandais (Pays-Bas / Hollande),
faux permis de conduire allemands (Allemagne),
faux permis de conduire au Royaume-Uni (Royaume-Uni),
faux passeports diplomatiques, passeports de camouflage, duplicata de passeport,
false États-Unis (États-Unis) passeports à vendre,
faux passeports australiens à vendre,
faux passeports belges à vendre,
faux passeports brésiliens (Brésil) à vendre,
faux passeports canadiens (Canada) à vendre,
faux passeports finlandais (Finlande) pour vendre, vendre des faux passeports, faire
Passeports diplomatiques,
faux passeport d'Afghanistan
faux passeport d'Albanie
faux passeport d'Algérie
faux passeport d'Andorre
faux passeport de l'Angola
faux passeport d'Argentine
faux passeport d'Arménie
faux passeport d'Australie
faux passeport d'Autriche
faux passeport de Bahreïn
faux passeport du Bangladesh
faux passeport de la Barbade
faux passeport de Bélarus
faux passeport de Belgique
faux passeport du Belize
faux passeport du Bénin
faux passeport du Bhoutan
faux passeport de Bolivie
faux passeport de Bosnie-Herzégovine
faux passeport du Brésil
faux passeport du Brunei
faux passeport de Bulgarie
faux passeport du Burkina
faux passeport du Burundi
faux passeport du Cambodge
faux passeport du Cameroun
faux passeport du Canada
faux passeport du Cap-Vert
faux passeport de Republique Centrafricaine
faux passeport du Tchad
faux passeport du Chili
faux passeport de Jordanie
faux passeport du Kazakhstan
faux passeport du Kenya
faux passeport du Yémen

acheter, obtenir, faux, faux, passeport, passeport, id, carte, cartes, uk, vendre, en ligne, canadien, britannique, vente, nouveauté, contrefaçon, faux, américain, uni, états, usa, nous, italien, malaisien, Australien, documents, identité, identification, conducteur, permis, permis, conduite, résidence, permis, SSN faux passeport, faux passeport libre, vol d'identité, faux, nouveauté, camouflage, passeport, anonyme, privé, coffre-fort, voyage, anti-terrorisme , international, offshore, bancaire, id, chauffeur, conducteurs, licence, instantanée, en ligne, à vendre, pas cher, vente en gros, nouvelle identité.

E-mails de contact: ========== (legit.pro.documents90@gmail.com)

E-mails de contact: =========== (legit.pro.documents90@gmail.com)

↧

(legit.pro.documents90@gmail.com) Achetez des faux passeports, des permis de conduire, IELTS,VISA...

July 6, 2018, 11:18 pm

≫ Next: An error when running PathSeqFilterSpark

≪ Previous: (legit.pro.documents90@gmail.com) Achetez des faux passeports, des permis de conduire, des cartes.

E-mails de contact: =========== (legit.pro.documents90@gmail.com)

N'hésitez pas à nous contacter par e-mail ou appelez à tout moment d'accord, nous serons heureux de
avoir de tes nouvelles!

Mots clés:

E-mails de contact: ========== (legit.pro.documents90@gmail.com)

E-mails de contact: =========== (legit.pro.documents90@gmail.com)

↧

An error when running PathSeqFilterSpark

July 7, 2018, 9:22 am

≫ Next: Off-label workflow to simply call differences in two samples

≪ Previous: (legit.pro.documents90@gmail.com) Achetez des faux passeports, des permis de conduire, IELTS,VISA...

Dear GATKer,
I just run PathSeqFilterSpark using the command, but there is an error and I had no idea about that. I uploaded the log file, could you please help me have a check?

I generated an unmapped BAM from raw paired-end fastq files using Picard's FastqToSam as input:
java -Xmx8G -jar /home/jw380/tools/PICARD/picard.jar FastqToSam \
FASTQ=/n/groups/liu/public/Xiaoqi_samples/fastq/FZ-97_R1.fq \
FASTQ2=/n/groups/liu/public/Xiaoqi_samples/fastq/FZ-97_R2.fq \
OUTPUT=FZ-97_fastqtosam.bam \
SAMPLE_NAME=FZ-97

Then running PathSeqFilterSpark:
/home/jw380/tools/GATK4/gatk-4.0.5.2/gatk PathSeqFilterSpark \
--input /n/groups/liu/jinwang/Lung_samples/TB/gatk/FZ-97_fastqtosam.bam \
--paired-output /n/groups/liu/jinwang/Lung_samples/TB/gatk/FZ-97_reads_paired.bam \
--unpaired-output /n/groups/liu/jinwang/Lung_samples/TB/gatk/FZ-97_reads_unpaired.bam \
--is-host-aligned true \
--kmer-file host_hg38.hss\
--filter-bwa-image hg38.fa.img \
--filter-metrics FZ-97_metrics.txt \

Best,
Jin

↧

Off-label workflow to simply call differences in two samples

January 29, 2018, 12:08 pm

≫ Next: The variant which is seen single sample gvcf is missing after combing multiple samples

≪ Previous: An error when running PathSeqFilterSpark

Given my years as a biochemist, if given two samples to compare, my first impulse is to want to know what are the functional differences, i.e. differences in proteins expressed between the two samples. I am interested in genomic alterations that ripple down the central dogma to transform a cell.

Please note the workflow that follows is NOT a part of the Best Practices. This is an illustrative, unsupported workflow. For the official Somatic Short Variant Calling Best Practices workflow, see Tutorial#11136.

To call every allele that is different between two samples, I have devised a two-pass workflow that takes advantage of Mutect2 features. This workflow uses Mutect2 in tumor-only mode and appropriates the --germline-resource argument to supply a single-sample VCF with allele fractions instead of population allele frequencies. The workflow assumes the two case samples being compared originate from the same parental line and the ploidy and mutation rates make it unlikely that any site accumulates more than one allele change.

First, call on each sample using Mutect2's tumor-only mode.

gatk Mutect2 \
-R ref.fa \
-I A.bam \
-tumor A \
-O A.vcf

gatk Mutect2 \
-R ref.fa \
-I B.bam \
-tumor B \
-O B.vcf

Second, for each single-sample VCF, move the sample-level `AF` allele-fraction annotation to the INFO field and simplify to a sites-only VCF.

This is a heuristic solution in which we substitute sample-level allele fractions for the expected population germline allele frequencies. Mutect2 is actually designed to use population germline allele frequencies in somatic likelihood calculations, so this substitution allows us to fulfill the requirement for an AF annotation with plausible fractional values. The terminal screenshots highlight the data transpositions.

Before:

After:

Third, call on each sample in a second pass, again in tumor-only mode, with the following additions.

gatk Mutect2 \
-R ref.fa \
-I A.bam \
-tumor A \
--germline-resource Baf.vcf \
--af-of-alleles-not-in-resource 0 \
--max-population-af 0 \
-pon pon_maskAB.vcf \
-O A-B.vcf

gatk Mutect2 \
-R ref.fa \
-I B.bam \
-tumor B \
--germline-resource Aaf.vcf \
--af-of-alleles-not-in-resource 0 \
--max-population-af 0 \
-pon pon_maskAB.vcf \
-O B-A.vcf

Provide the matched single-sample callset for the case sample with the --germline-resource argument.
Avoid calling any allele in the --germline-resource by setting --max-population-af to zero.
Maximize the probability of calling any differing allele by setting --af-of-alleles-not-in-resource to zero.
Prefilter sites with artifacts and cross-sample contamination with a panel of normals (PoN) in which confident variant sites for both sample A and B have been removed, e.g. with gatk SelectVariants –V pon.vcf -XL AandB_haplotypecaller.vcf –O pon_maskAB.vcf.

Fourth, filter out unlikely calls with FilterMutectCalls.

gatk FilterMutectCalls \
-V A-B.vcf \
-O A-B-filter.vcf

gatk FilterMutectCalls \
-V B-A.vcf \
-O B-A-filter.vcf

FilterMutectCalls provides many filters, e.g. that account for low base quality, for events that are clustered, for low mapping quality and for short-tandem-repeat contractions. Of the filters, let's consider the multiallelic filter. It discounts sites with more than two variant alleles that pass the tumor LOD threshold.

We assume case sample variant sites will have a maximum of one allele that is different from the --germline-resource control. A single allele call will pass the multiallelic filter. However, if we emit any shared variant allele alongside the differing allele, e.g. for a heterozygous site without ref alleles, then the call becomes multiallelic and will be filtered, which is not what we want. We previously set Mutect2’s --max-population-af to zero to ensure only the differing allele is called, and so here we can rely on FilterMutectCalls to filter artifactual multiallelic sites.
If multiple variant alleles are expected per call, then FilterMutectCall’s multiallelic filtering will be undesirable. For example, if changes to allele fractions for alleles that are shared was of interest for the two samples derived from the same parental line, and Mutect2 --max-population-af was set to one in the previous step to additionally emit the shared variant alleles, then you would expect multiallelic calls. These will be indistinguishable from artifactual multiallelic sites.

This workflow produces contrastive variants. If the samples are a tumor and its matched normal, then the calls include sites where heterozygosity was lost.

We know that loss of heterozygosity (LOH) plays a role in tumorigenesis (doi:10.1186/s12920-015-0123-z). This leads us to believe the heterozygosity of proteins we express contributes to our health. If this is true, then for somatic studies, if cataloging the gain of alleles is of interest, then cataloging the loss of alleles should also be of interest. Can we assume just because variants are germline that they do not play a role in disease processes? How can we account for the combinatorial effects of the diploid nature of our genomes?

Remember regions of LOH do not necessarily represent a haploid state but can be copy-neutral or even copy-amplified. It may be that as one parental chromosome copy is lost, the other is duplicated to maintain copy number, which presumably compensates for dosage effects as is the case in uniparental isodisomy.

↧

The variant which is seen single sample gvcf is missing after combing multiple samples

July 8, 2018, 6:58 am

≫ Next: htsjdk.samtools.CRAMFileReader error gatk 4.0.6.0

≪ Previous: Off-label workflow to simply call differences in two samples

@Sheila, I generally use the following command to combine gvcf of my samples. when new sample comes I keep adding them to already combined gvcfs
java -Xmx20g -jar /mnt/exome/Softwares/GenomeAnalysisTK.jar -T CombineGVCFs -R /mnt/exome/ReferenceFiles/human_g1k_v37.fasta -V Combined_200.gvcf -V samplea.gvcf -V sampleb.gvcf -V samplec.gvcf -o $outname.gvcf
But one of the variant which is seen in gvcf of a single sample is not seen in combined gvcf. Please help me to rectify the issue.

Thanks and Regards
Neethu

↧

htsjdk.samtools.CRAMFileReader error gatk 4.0.6.0

July 9, 2018, 7:21 am

≫ Next: CNNScoreVariants Hanging in 4.0.5.2 and 4.0.6.0

≪ Previous: The variant which is seen single sample gvcf is missing after combing multiple samples

I'm running haplotypecaller on a cram file. The wdl i'm using has NIO enabled, using gatk docker 4.0.6.0, and running on firecloud. The wdl I'm using is haplotypecaller-gvcf-gatk4 .

It looks like Haplotypecaller finishes but something happens at the very end with htsjdk which causes the task to fail.
The error message is
Caused by: java.nio.channels.ClosedChannelException at org.broadinstitute.hellbender.utils.nio.SeekableByteChannelPrefetcher.position(SeekableByteChannelPrefetcher.java:406) at htsjdk.samtools.seekablestream.SeekablePathStream.seek(SeekablePathStream.java:63) at htsjdk.samtools.cram.build.CramSpanContainerIterator.<init>(CramSpanContainerIterator.java:26) at htsjdk.samtools.cram.build.CramSpanContainerIterator.fromFileSpan(CramSpanContainerIterator.java:40) at htsjdk.samtools.CRAMIterator.<init>(CRAMIterator.java:104) at htsjdk.samtools.CRAMFileReader$CRAMIntervalIterator.<init>(CRAMFileReader.java:516)

The whole stderr log is below

Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/cromwell_root/tmp.eb6715df
13:44:46.291 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.0.6.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
13:44:48.009 INFO  HaplotypeCaller - ------------------------------------------------------------
13:44:48.010 INFO  HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.0.6.0
13:44:48.010 INFO  HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
13:44:48.011 INFO  HaplotypeCaller - Executing as root@4a27babf2e2b on Linux v4.9.0-0.bpo.6-amd64 amd64
13:44:48.011 INFO  HaplotypeCaller - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_171-8u171-b11-0ubuntu0.16.04.1-b11
13:44:48.011 INFO  HaplotypeCaller - Start Date/Time: July 9, 2018 1:44:46 PM UTC
13:44:48.011 INFO  HaplotypeCaller - ------------------------------------------------------------
13:44:48.012 INFO  HaplotypeCaller - ------------------------------------------------------------
13:44:48.013 INFO  HaplotypeCaller - HTSJDK Version: 2.16.0
13:44:48.013 INFO  HaplotypeCaller - Picard Version: 2.18.7
13:44:48.013 INFO  HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
13:44:48.013 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
13:44:48.014 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
13:44:48.014 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
13:44:48.014 INFO  HaplotypeCaller - Deflater: IntelDeflater
13:44:48.014 INFO  HaplotypeCaller - Inflater: IntelInflater
13:44:48.015 INFO  HaplotypeCaller - GCS max retries/reopens: 20
13:44:48.015 INFO  HaplotypeCaller - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
13:44:48.015 INFO  HaplotypeCaller - Initializing engine
13:44:53.912 INFO  IntervalArgumentCollection - Processing 59173529 bp from intervals
13:44:53.941 INFO  HaplotypeCaller - Done initializing engine
13:44:54.018 INFO  HaplotypeCallerEngine - Tool is in reference confidence mode and the annotation, the following changes will be made to any specified annotations: 'StrandBiasBySample' will be enabled. 'ChromosomeCounts', 'FisherStrand', 'StrandOddsRatio' and 'QualByDepth' annotations have been disabled
13:44:54.023 INFO  HaplotypeCallerEngine - Standard Emitting and Calling confidence set to 0.0 for reference-model confidence output
13:44:54.023 INFO  HaplotypeCallerEngine - All sites annotated with PLs forced to true for reference-model confidence output
13:44:54.049 INFO  NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/gatk/gatk-package-4.0.6.0-local.jar!/com/intel/gkl/native/libgkl_utils.so
13:44:54.061 INFO  NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:/gatk/gatk-package-4.0.6.0-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so
13:44:54.159 WARN  IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
13:44:54.160 INFO  IntelPairHmm - Available threads: 2
13:44:54.161 INFO  IntelPairHmm - Requested threads: 4
13:44:54.161 WARN  IntelPairHmm - Using 2 available threads, but 4 were requested
13:44:54.161 INFO  PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation
13:44:54.371 INFO  ProgressMeter - Starting traversal
13:44:54.371 INFO  ProgressMeter -        Current Locus  Elapsed Minutes     Regions Processed   Regions/Minute
13:45:04.566 INFO  ProgressMeter -       chr2:239026118              0.2                   140            824.0
13:45:14.759 INFO  ProgressMeter -       chr2:239101115              0.3                   530           1559.8
13:45:24.850 INFO  ProgressMeter -       chr2:239184333              0.5                   980           1929.3
13:45:34.876 INFO  ProgressMeter -       chr2:239282453              0.7                  1480           2192.4
13:45:44.983 INFO  ProgressMeter -       chr2:239417055              0.8                  2180           2584.5
13:45:54.987 INFO  ProgressMeter -       chr2:239578517              1.0                  3020           2989.4
13:46:05.021 INFO  ProgressMeter -       chr2:239756375              1.2                  3880           3295.2
13:46:15.080 INFO  ProgressMeter -       chr2:239888664              1.3                  4650           3456.9
13:46:25.115 INFO  ProgressMeter -       chr2:240026623              1.5                  5430           3590.4
13:46:35.137 INFO  ProgressMeter -       chr2:240201034              1.7                  6210           3697.7
13:46:45.145 INFO  ProgressMeter -       chr2:240339896              1.8                  6980           3780.7
13:46:55.164 INFO  ProgressMeter -       chr2:240586461              2.0                  8060           4003.6
13:47:05.212 INFO  ProgressMeter -       chr2:240717379              2.2                  8840           4053.8
13:47:15.229 INFO  ProgressMeter -       chr2:240892713              2.3                  9730           4144.7
13:47:25.316 INFO  ProgressMeter -       chr2:241107408              2.5                 10780           4285.0
13:47:35.371 INFO  ProgressMeter -       chr2:241251611              2.7                 11560           4308.1
13:47:45.428 INFO  ProgressMeter -       chr2:241419542              2.9                 12440           4363.5
13:47:55.491 INFO  ProgressMeter -       chr2:241594749              3.0                 13350           4422.5
13:48:05.570 INFO  ProgressMeter -       chr2:241747417              3.2                 14150           4440.4
13:48:15.592 INFO  ProgressMeter -       chr2:241934771              3.4                 15090           4499.6
13:48:25.699 INFO  ProgressMeter -       chr2:242072069              3.5                 15860           4503.0
13:48:32.401 INFO  VectorLoglessPairHMM - Time spent in setup for JNI call : 0.140101847
13:48:32.402 INFO  PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 18.015375589
13:48:32.402 INFO  SmithWatermanAligner - Total compute time in java Smith-Waterman : 27.18 sec
13:48:32.402 INFO  HaplotypeCaller - Shutting down engine
[July 9, 2018 1:48:32 PM UTC] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 3.77 minutes.
Runtime.totalMemory()=2653945856
htsjdk.samtools.util.RuntimeEOFException: java.nio.channels.ClosedChannelException
    at htsjdk.samtools.CRAMFileReader$CRAMIntervalIterator.<init>(CRAMFileReader.java:519)
    at htsjdk.samtools.CRAMFileReader$CRAMIntervalIterator.<init>(CRAMFileReader.java:504)
    at htsjdk.samtools.CRAMFileReader.query(CRAMFileReader.java:455)
    at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.query(SamReader.java:528)
    at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.queryOverlapping(SamReader.java:400)
    at org.broadinstitute.hellbender.utils.iterators.SamReaderQueryingIterator.loadNextIterator(SamReaderQueryingIterator.java:125)
    at org.broadinstitute.hellbender.utils.iterators.SamReaderQueryingIterator.<init>(SamReaderQueryingIterator.java:66)
    at org.broadinstitute.hellbender.engine.ReadsDataSource.prepareIteratorsForTraversal(ReadsDataSource.java:404)
    at org.broadinstitute.hellbender.engine.ReadsDataSource.iterator(ReadsDataSource.java:330)
    at org.broadinstitute.hellbender.engine.MultiIntervalLocalReadShard.iterator(MultiIntervalLocalReadShard.java:134)
    at org.broadinstitute.hellbender.engine.AssemblyRegionIterator.<init>(AssemblyRegionIterator.java:109)
    at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.processReadShard(AssemblyRegionWalker.java:282)
    at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.traverse(AssemblyRegionWalker.java:267)
    at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:984)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:135)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:180)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:199)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
    at org.broadinstitute.hellbender.Main.main(Main.java:289)
Caused by: java.nio.channels.ClosedChannelException
    at org.broadinstitute.hellbender.utils.nio.SeekableByteChannelPrefetcher.position(SeekableByteChannelPrefetcher.java:406)
    at htsjdk.samtools.seekablestream.SeekablePathStream.seek(SeekablePathStream.java:63)
    at htsjdk.samtools.cram.build.CramSpanContainerIterator.<init>(CramSpanContainerIterator.java:26)
    at htsjdk.samtools.cram.build.CramSpanContainerIterator.fromFileSpan(CramSpanContainerIterator.java:40)
    at htsjdk.samtools.CRAMIterator.<init>(CRAMIterator.java:104)
    at htsjdk.samtools.CRAMFileReader$CRAMIntervalIterator.<init>(CRAMFileReader.java:516)
    ... 19 more
Using GATK jar /gatk/gatk-package-4.0.6.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx6G -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -jar /gatk/gatk-package-4.0.6.0-local.jar HaplotypeCaller -R /cromwell_root/broad-references/hg38/v0/Homo_sapiens_assembly38.fasta -I gs://broad-public-datasets/NA12878/NA12878.cram -L /cromwell_root/genomics-public-data/resources/broad/hg38/v0/scattered_calling_intervals/temp_0009_of_50/scattered.interval_list -O NA12878.cram.g.vcf.gz -contamination 0 -ERC GVCF

↧

CNNScoreVariants Hanging in 4.0.5.2 and 4.0.6.0

July 9, 2018, 8:25 am

≫ Next: genotypeConcordance for GVCF file

≪ Previous: htsjdk.samtools.CRAMFileReader error gatk 4.0.6.0

I had no issues running CNNScoreVariants when using GATK 4.0.5.1 but as soon as I tried to run the exact same command in 4.0.5.2 or 4.0.6.0 it would get to the part where it says: "INFO CNNScoreVariants - Done initializing engine" and then it appears to just hang with no CPU activity. I even tried to let it go for about 10 minutes with nothing happening.

Below is the command using and it works just fine when I downgrade to 4.0.5.1. I also tried the 1D model and the same issue occurs.

gatk CNNScoreVariants -V test_data.vcf -I test_data.bam -R hg19.fa -O annotated_data.vcf -tensor-type read_tensor

If it matters the Java version is 1.8.0_121

↧

genotypeConcordance for GVCF file

May 25, 2016, 7:44 am

≫ Next: Malformed walker argument for RealignerTargetCreator

≪ Previous: CNNScoreVariants Hanging in 4.0.5.2 and 4.0.6.0

Hi,
I need to compare two gVCF file using genotypeConcordance tool. In spite of them being the same, genotypeConcordance lists one difference between them:

EVAL.gVCF

1       11854456        rs1801131       T       G       1484.77 .       DB;DP=35;MLEAC=2,0;MLEAF=1.00,0.00;MQ=60.00;MQ0=0       GT:AD:DP:GQ:PGT:PID:PL:SB       1/1:0,35,0:35:99:0|1:11854457_G_A:1513,105,0,1513,105,1513:0,0,14,21
1       11856348        .       N       G       .       .       .       GT:DP:GQ:MIN_DP:PL      0/0:45:99:38:0,102,1141

Comp.gVCF

1       11854456        .       T       G       6189.77 .       DP=141;MLEAC=2,0;MLEAF=1.00,0.00;MQ=60.00       GT:AD:DP:GQ:PGT:PID:PL:SB       1/1:0,141,0:141:99:0|1:11854457_G_A:6218,424,0,6218,424,6218:0,0,66,75
1       11856348        .       N       G       .       .       .       GT:DP:GQ:MIN_DP:PL      0/0:57:99:48:0,102,1800

ALLELES_MATCH  EVAL_SUPERSET_TRUTH  EVAL_SUBSET_TRUTH  ALLELES_DO_NOT_MATCH  EVAL_ONLY  TRUTH_ONLY
            1                    0                  0                     0          0           1

I wonder why that is. Is there a way to run genotypeConcordance on the GVCF file?

Please note that I replaced the ref allele with 'N' and the alt allele with G (which was the ref allele) because otherwise GATK throws the error of duplicate alleles.

Any help would be appreciated.
Thanks!

↧

Malformed walker argument for RealignerTargetCreator

July 9, 2018, 1:17 pm

≫ Next: libVectorLoglessPairHMM is not present in GATK 3.8 - HaplotypeCaller is slower than 3.4-46!

≪ Previous: genotypeConcordance for GVCF file

version 3.8-0-ge9d806836 whe I run Java -jar GenomeAnalysisTK.jar -T RealignerTargetCreator it s keep giving error
Malformed walker argument for RealignerTargetCreator

↧

libVectorLoglessPairHMM is not present in GATK 3.8 - HaplotypeCaller is slower than 3.4-46!

November 23, 2017, 3:59 am

≫ Next: how to use Multi-interval in GenomicsDBImport with gatk 4.0.6.0

≪ Previous: Malformed walker argument for RealignerTargetCreator

We are running GATK on a multi-core Intel Xeon that does not have AVX. We have just upgraded from running 3.4-46 to running 3.8, and HaplotypeCaller runs much more slowly. I noticed that our logs used to say:

Using SSE4.1 accelerated implementation of PairHMM
INFO 06:18:09,932 VectorLoglessPairHMM - libVectorLoglessPairHMM unpacked successfully from GATK jar file
INFO 06:18:09,933 VectorLoglessPairHMM - Using vectorized implementation of PairHMM

But now they say:

WARN 07:10:21,304 PairHMMLikelihoodCalculationEngine$1 - OpenMP multi-threaded AVX-accelerated native PairHMM implementation is not supported
WARN 07:10:21,310 PairHMMLikelihoodCalculationEngine$1 - AVX-accelerated native PairHMM implementation is not supported. Falling back to slower LOGLESS_CACHING implementation

I'm guessing the newfangled Intel GKL isn't working so well for us. Note that I had a very similar problem with GATK 3.4-0, in http://gatk.vanillaforums.com/entry/passwordreset/21436/OrxbD0I4oRDaj8y1hDSE and this was resolved in GATK 3.4-46.

↧

1. Annotate genotypes using VariantFiltration

2. Transform filtered genotypes to no call

Related Documents

Context

Example Use

Command

Getting Results

Analyzing Results

Purpose

Reference Implementations

Expected input

Main steps

Call variants per-sample

Tools involved: HaplotypeCaller (in GVCF mode)

Consolidate GVCFs

Tools involved: ImportGenomicsDB

Joint-Call Cohort

Tools involved: GenotypeGVCFs

Filter Variants by Variant (Quality Score) Recalibration

Tools involved: VariantRecalibrator, ApplyRecalibration

Notes on methodology

Jump to a section

Tools involved

Prerequisites

Download example data

Related resources

Other notes

1. Generate an unmapped BAM from FASTQ, aligned BAM or BCL

2. Mark adapter sequences using MarkIlluminaAdapters

3. Align reads with BWA-MEM and merge with uBAM using MergeBamAlignment

3A. Convert BAM to FASTQ and discount adapter sequences using SamToFastq

3B. Align reads and flag secondary hits using BWA-MEM

3C. Restore altered data and apply & adjust meta information using MergeBamAlignment

3D. Pipe SamToFastq, BWA-MEM and MergeBamAlignment to generate a clean BAM

Some final remarks

First, call on each sample using Mutect2's tumor-only mode.

Second, for each single-sample VCF, move the sample-level AF allele-fraction annotation to the INFO field and simplify to a sites-only VCF.

Third, call on each sample in a second pass, again in tumor-only mode, with the following additions.

Fourth, filter out unlikely calls with FilterMutectCalls.

Second, for each single-sample VCF, move the sample-level `AF` allele-fraction annotation to the INFO field and simplify to a sites-only VCF.