Help choosing truth sensitivity

March 10, 2019, 1:49 am

≪ Previous: Select Variants restrict variants to 'BIALLELIC' doesn't remove biallelic variants

Hello,

I am trying to decide which set of SNPs to use for my downstream analyses. I need to have > 1SNP per Kbp to detect signatures of selection (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4611237/). The tranches that achieve this SNP density start from 80% truth sensitivity.
Looking at the tranches plot , tranches 80-90 are reasonable based on the novel TiTv ratio for the species I am working with.

However, the fraction of false positives (Target TiTv ratio = 2) might be too high for these tranches (0.2-0.25):

    | targetSensitivity|     nTP|     nFP| FP_fraction|
    |-----------------:|-------:|-------:|-----------:|
    |                60|  140491|    5357|       0.037|
    |                65|  200758|   10581|       0.050|
    |                70|  289547|   23544|       0.075|
    |                75|  303540|   76741|       0.202|
    |                80|  311006|   81051|       0.207|
    |                85|  322111|   90922|       0.220|
    |                90|  340943|  110478|       0.245|
    |                95|  377469|  340152|       0.474|
    |               100| 1479090| 2361373|       0.615|

But I am not really sure how to interpret this information. The way this makes sense to me is that if I use, for instance, tranche 85 as my final SNP subset, I would accept that each novel SNP has 22% chance of being a false positive. However, this corresponds to 90922 SNPs and 1% of the total SNP set, which I am willing to live with and move on with the analysis.

I would like to know if this interpretation correct and if you have any suggestions (i.e. would it make a big difference to choose tranche 90 instead of 85?).

Thanks!

↧

Calling variants in RNAseq

March 5, 2014, 11:15 pm

≫ Next: 英国埃塞克斯大学（毕业证）英国大学毕业证毕业证硕士学位证Ｑ95634381成绩单样本

≪ Previous: Help choosing truth sensitivity

Overview

This document describes the details of the GATK Best Practices workflow for SNP and indel calling on RNAseq data.

Please note that any command lines are only given as example of how the tools can be run. You should always make sure you understand what is being done at each step and whether the values are appropriate for your data. To that effect, you can find more guidance here.

In brief, the key modifications made to the DNAseq Best Practices focus on handling splice junctions correctly, which involves specific mapping and pre-processing procedures, as well as some new functionality in the HaplotypeCaller. Here is a detailed overview:

Caveats

Please keep in mind that our DNA-focused Best Practices were developed over several years of thorough experimentation, and are continuously updated as new observations come to light and the analysis methods improve. We have been working with RNAseq for a somewhat shorter time, so there are many aspects that we still need to examine in more detail before we can be fully confident that we are doing the best possible thing.

We know that the current recommended pipeline is producing both false positives (wrong variant calls) and false negatives (missed variants) errors. While some of those errors are inevitable in any pipeline, others are errors that we can and will address in future versions of the pipeline. A few examples of such errors are given in this article as well as our ideas for fixing them in the future.

We will be improving these recommendations progressively as we go, and we hope that the research community will help us by providing feedback of their experiences applying our recommendations to their data.

The workflow

1. Mapping to the reference

The first major difference relative to the DNAseq Best Practices is the mapping step. For DNA-seq, we recommend BWA. For RNA-seq, we evaluated all the major software packages that are specialized in RNAseq alignment, and we found that we were able to achieve the highest sensitivity to both SNPs and, importantly, indels, using STAR aligner. Specifically, we use the STAR 2-pass method which was described in a recent publication (see page 43 of the Supplemental text of the Pär G Engström et al. paper referenced below for full protocol details -- we used the suggested protocol with the default parameters). In brief, in the STAR 2-pass approach, splice junctions detected in a first alignment run are used to guide the final alignment.

Here is a walkthrough of the STAR 2-pass alignment steps:

1) STAR uses genome index files that must be saved in unique directories. The human genome index was built from the FASTA file hg19.fa as follows:

genomeDir=/path/to/hg19
mkdir $genomeDir
STAR --runMode genomeGenerate --genomeDir $genomeDir --genomeFastaFiles hg19.fa\  --runThreadN <n>

2) Alignment jobs were executed as follows:

runDir=/path/to/1pass
mkdir $runDir
cd $runDir
STAR --genomeDir $genomeDir --readFilesIn mate1.fq mate2.fq --runThreadN <n>

3) For the 2-pass STAR, a new index is then created using splice junction information contained in the file SJ.out.tab from the first pass:

genomeDir=/path/to/hg19_2pass
mkdir $genomeDir
STAR --runMode genomeGenerate --genomeDir $genomeDir --genomeFastaFiles hg19.fa \
    --sjdbFileChrStartEnd /path/to/1pass/SJ.out.tab --sjdbOverhang 75 --runThreadN <n>

4) The resulting index is then used to produce the final alignments as follows:

runDir=/path/to/2pass
mkdir $runDir
cd $runDir
STAR --genomeDir $genomeDir --readFilesIn mate1.fq mate2.fq --runThreadN <n>

2. Add read groups, sort, mark duplicates, and create index

The above step produces a SAM file, which we then put through the usual Picard processing steps: adding read group information, sorting, marking duplicates and indexing.

java -jar picard.jar AddOrReplaceReadGroups I=star_output.sam O=rg_added_sorted.bam SO=coordinate RGID=id RGLB=library RGPL=platform RGPU=machine RGSM=sample 

java -jar picard.jar MarkDuplicates I=rg_added_sorted.bam O=dedupped.bam  CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT M=output.metrics

3. Split'N'Trim and reassign mapping qualities

Next, we use a new GATK tool called SplitNCigarReads developed specially for RNAseq, which splits reads into exon segments (getting rid of Ns but maintaining grouping information) and hard-clip any sequences overhanging into the intronic regions.

In the future we plan to integrate this into the GATK engine so that it will be done automatically where appropriate, but for now it needs to be run as a separate step.

At this step we also add one important tweak: we need to reassign mapping qualities, because STAR assigns good alignments a MAPQ of 255 (which technically means “unknown” and is therefore meaningless to GATK). So we use the GATK’s ReassignOneMappingQuality read filter to reassign all good alignments to the default value of 60. This is not ideal, and we hope that in the future RNAseq mappers will emit meaningful quality scores, but in the meantime this is the best we can do. In practice we do this by adding the ReassignOneMappingQuality read filter to the splitter command.

Finally, be sure to specify that reads with N cigars should be allowed. This is currently still classified as an "unsafe" option, but this classification will change to reflect the fact that this is now a supported option for RNAseq processing.

java -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R ref.fasta -I dedupped.bam -o split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS

4. Indel Realignment (optional)

After the splitting step, we resume our regularly scheduled programming... to some extent. We have found that performing realignment around indels can help rescue a few indels that would otherwise be missed, but to be honest the effect is marginal. So while it can’t hurt to do it, we only recommend performing the realignment step if you have compute and time to spare (or if it’s important not to miss any potential indels).

5. Base Recalibration

We do recommend running base recalibration (BQSR). Even though the effect is also marginal when applied to good quality data, it can absolutely save your butt in cases where the qualities have systematic error modes.

Both steps 4 and 5 are run as described for DNAseq (with the same known sites resource files), without any special arguments. Finally, please note that you should NOT run ReduceReads on your RNAseq data. The ReduceReads tool will no longer be available in GATK 3.0.

6. Variant calling

Finally, we have arrived at the variant calling step! Here, we recommend using HaplotypeCaller because it is performing much better in our hands than UnifiedGenotyper (our tests show that UG was able to call less than 50% of the true positive indels that HC calls). We have added some functionality to the variant calling code which will intelligently take into account the information about intron-exon split regions that is embedded in the BAM file by SplitNCigarReads. In brief, the new code will perform “dangling head merging” operations and avoid using soft-clipped bases (this is a temporary solution) as necessary to minimize false positive and false negative calls. To invoke this new functionality, just add -dontUseSoftClippedBases to your regular HC command line. Note that the -recoverDanglingHeads argument which was previously required is no longer necessary as that behavior is now enabled by default in HaplotypeCaller. Also, we found that we get better results if we set the minimum phred-scaled confidence threshold for calling variants 20, but you can lower this to increase sensitivity if needed.

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ref.fasta -I input.bam -dontUseSoftClippedBases -stand_call_conf 20.0 -o output.vcf

7. Variant filtering

To filter the resulting callset, you will need to apply hard filters, as we do not yet have the RNAseq training/truth resources that would be needed to run variant recalibration (VQSR).

We recommend that you filter clusters of at least 3 SNPs that are within a window of 35 bases between them by adding -window 35 -cluster 3 to your command. This filter recommendation is specific for RNA-seq data.

As in DNA-seq, we recommend filtering based on Fisher Strand values (FS > 30.0) and Qual By Depth values (QD < 2.0).

java -jar GenomeAnalysisTK.jar -T VariantFiltration -R hg_19.fasta -V input.vcf -window 35 -cluster 3 -filterName FS -filter "FS > 30.0" -filterName QD -filter "QD < 2.0" -o output.vcf

Please note that we selected these hard filtering values in attempting to optimize both high sensitivity and specificity together. By applying the hard filters, some real sites will get filtered. This is a tradeoff that each analyst should consider based on his/her own project. If you care more about sensitivity and are willing to tolerate more false positives calls, you can choose not to filter at all (or to use less restrictive thresholds).

An example of filtered (SNPs cluster filter) and unfiltered false variant calls:

An example of true variants that were filtered (false negatives). As explained in text, there is a tradeoff that comes with applying filters:

Known issues

There are a few known issues; one is that the allelic ratio is problematic. In many heterozygous sites, even if we can see in the RNAseq data both alleles that are present in the DNA, the ratio between the number of reads with the different alleles is far from 0.5, and thus the HaplotypeCaller (or any caller that expects a diploid genome) will miss that call. A DNA-aware mode of the caller might be able to fix such cases (which may be candidates also for downstream analysis of allele specific expression).

Although our new tool (splitNCigarReads) cleans many false positive calls that are caused by splicing inaccuracies by the aligners, we still call some false variants for that same reason, as can be seen in the example below. Some of those errors might be fixed in future versions of the pipeline with more sophisticated filters, with another realignment step in those regions, or by making the caller aware of splice positions.

As stated previously, we will continue to improve the tools and process over time. We have plans to improve the splitting/clipping functionalities, improve true positive and minimize false positive rates, as well as developing statistical filtering (i.e. variant recalibration) recommendations.

We also plan to add functionality to process DNAseq and RNAseq data from the same samples simultaneously, in order to facilitate analyses of post-transcriptional processes. Future extensions to the HaplotypeCaller will provide this functionality, which will require both DNAseq and RNAseq in order to produce the best results. Finally, we are also looking at solutions for measuring differential expression of alleles.

[1] Pär G Engström et al. “Systematic evaluation of spliced alignment programs for RNA-seq data”. Nature Methods, 2013

NOTE: Questions about this document that were posted before June 2014 have been moved to this archival thread: http://gatkforums.broadinstitute.org/discussion/4709/questions-about-the-rnaseq-variant-discovery-workflow

↧

英国埃塞克斯大学（毕业证）英国大学毕业证毕业证硕士学位证Ｑ95634381成绩单样本

March 10, 2019, 7:26 pm

≫ Next: Does pathseq pipeline supports Ion Torrent data?

≪ Previous: Calling variants in RNAseq

(微信464571773),ＱＱ95634381诺丁汉大学（University of Nottingham毕业证，英国硕士学位证Ｑ78464162成绩单样本大学毕业证，英国学学位证，意大利大学成绩单大毕业证，代办英国大学毕业证证。仿真毕澳大利亚证，

`澳洲`文憑學位證`qq95634381`澳大利亚中央昆士兰大学畢業證成績單畢業證文憑學位證=Q95634381加拿大大学畢業證成績單.加拿大大学畢業證成績單成绩单样本制作，本科毕业证.制造成绩单．美国学历认证书.录取通知书，英国国外学历学位认证，使馆认证，留学人员证明，如果要办理请联系我们.诚信制作.质量三包.请放心办理．
制作程度

↧

Does pathseq pipeline supports Ion Torrent data?

March 10, 2019, 10:40 pm

≫ Next: The stop position is less than start for Broad.human.exome.b37.scattered.txt

≪ Previous: 英国埃塞克斯大学（毕业证）英国大学毕业证毕业证硕士学位证Ｑ95634381成绩单样本

I could see that pathseq has been tested using data generated on Illumina platform. Does pathseq pipeline supports Ion Torrent data?

↧

The stop position is less than start for Broad.human.exome.b37.scattered.txt

March 10, 2019, 11:33 pm

≫ Next: Hi，i ask a question about the accuracy of the gatk4.1 test process，as described below.

≪ Previous: Does pathseq pipeline supports Ion Torrent data?

I was running a test with the the gatk3 germline workflow (located at `gatk-workflows/gatk3-germline-snps-indels` on GitHub), but since I'm only interested in exome performance I used the `Broad.human.exome.b37.scattered.txt`, located at `gs://gatk-test-data/intervals/Broad.human.exome.b37.scattered.txt`, rather than the default intervals file.

However, running the workflow with this intervals file results in the following error:

```
2019-03-11 03:51:27,464 cromwell-system-akka.dispatchers.engine-dispatcher-21 ERROR - WorkflowManagerActor Workflow 91edb9e9-0f44-4c5c-8995-2c77090c7022 failed (during ExecutingWorkflowState): Job HCV_3.HaplotypeCaller:13:1 exited with return code 2 which has not been declared as a valid return code. See 'continueOnReturnCode' runtime attribute for more details.
Check the content of stderr for potential additional information: s3://cromwell-results/cromwell-execution/best_practise/91edb9e9-0f44-4c5c-8995-2c77090c7022/call-HCV_3/haplotype.HCV_3/ccf7ae57-4d04-4f16-b4e0-02450bcd4aca/call-HaplotypeCaller/shard-13/HaplotypeCaller-13-stderr.log.
Using GATK jar /usr/gitc/gatk4/gatk-package-4.beta.5-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Dsnappy.disable=true -Xms2g -jar /usr/gitc/gatk4/gatk-package-4.beta.5-local.jar PrintReads -I /cromwell_root/cromwell-results/cromwell-execution/best_practise/91edb9e9-0f44-4c5c-8995-2c77090c7022/call-GPPW/processing.GPPW/f6ef85cb-7488-4248-b31c-ba42addfcc7d/call-GBF/NA12878.bam --interval_padding 500 -L /cromwell_root/genovic-cromwell-inputs/reference_data/b37/intervals/Broad.human.exome.scattered/Broad.human.exome.b37_21.bed -O local.sharded.bam
Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/cromwell_root/cromwell-results/cromwell-execution/best_practise/91edb9e9-0f44-4c5c-8995-2c77090c7022/call-HCV_3/haplotype.HCV_3/ccf7ae57-4d04-4f16-b4e0-02450bcd4aca/call-HaplotypeCaller/shard-13/tmp.66897de2
[March 11, 2019 3:42:49 AM UTC] PrintReads --output local.sharded.bam --intervals /cromwell_root/genovic-cromwell-inputs/reference_data/b37/intervals/Broad.human.exome.scattered/Broad.human.exome.b37_21.bed --interval_padding 500 --input /cromwell_root/cromwell-results/cromwell-execution/best_practise/91edb9e9-0f44-4c5c-8995-2c77090c7022/call-GPPW/processing.GPPW/f6ef85cb-7488-4248-b31c-ba42addfcc7d/call-GBF/NA12878.bam --interval_set_rule UNION --interval_exclusion_padding 0 --interval_merging_rule ALL --readValidationStringency SILENT --secondsBetweenProgressUpdates 10.0 --disableSequenceDictionaryValidation false --createOutputBamIndex true --createOutputBamMD5 false --createOutputVariantIndex true --createOutputVariantMD5 false --lenient false --addOutputSAMProgramRecord true --addOutputVCFCommandLine true --cloudPrefetchBuffer 40 --cloudIndexPrefetchBuffer -1 --disableBamIndexCaching false --help false --version false --showHidden false --verbosity INFO --QUIET false --use_jdk_deflater false --use_jdk_inflater false --gcs_max_retries 20 --disableToolDefaultReadFilters false
[March 11, 2019 3:42:49 AM UTC] Executing as root@ip-10-0-33-14 on Linux 4.14.97-74.72.amzn1.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_111-8u111-b14-2~bpo8+1-b14; Version: 4.beta.5
[March 11, 2019 3:42:51 AM UTC] org.broadinstitute.hellbender.tools.PrintReads done. Elapsed time: 0.03 minutes.
Runtime.totalMemory()=2058354688
***********************************************************************

A USER ERROR has occurred: Badly formed genome unclippedLoc: Parameters to GenomeLocParser are incorrect:The stop position 19506651 is less than start 19506652 in contig 21

***********************************************************************
```

You can understand why this happens by looking at the file `gs://gatk-test-data/intervals/Broad.human.exome.scattered/Broad.human.exome.b37_21.bed`, which is referenced by this intervals file. Lines like this cause GATK to fail:
```
21 19506651 19506651 + new_exome_1.1_content
```
Here, the start and end position are the same. I'm not really sure what the point of this is, but that's definitely the cause of the issue.

↧

Hi，i ask a question about the accuracy of the gatk4.1 test process，as described below.

March 11, 2019, 12:02 am

≫ Next: GATK4-mutect2 how to or should I use a newer gnomad r2.1 as germline-resource

≪ Previous: The stop position is less than start for Broad.human.exome.b37.scattered.txt

test process：
the first step，Implementing bwa mem multi-process concurrent execution by segmenting genetic data，and
generate multiple bam files.
the second step, executing ReadsPipelineSpark with multiple bam files as input,and command is as follows.
$gatk_dir ReadsPipelineSpark -I hdfs://Master:9000/test_block_2/test_sort.bam -R hdfs://Master:9000/test_block_2/human_g1k_v37.fasta -O hdfs://Master:9000/test_block_2/test-$size-reads.vcf --known-sites hdfs://Master:9000/test_block_2/dbsnp132_20101103.vcf --smith-waterman AVX_ENABLED -- --spark-runner SPARK --spark-master spark://Master:7077

How accurate is this test process?

↧

GATK4-mutect2 how to or should I use a newer gnomad r2.1 as germline-resource

March 11, 2019, 12:39 am

≫ Next: 热烈欢迎我们的中国朋友 / A warm welcome to our Chinese friends

≪ Previous: Hi，i ask a question about the accuracy of the gatk4.1 test process，as described below.

Dear GATK Team,

We have a program to detect somatic mutations of tumor-vs-normal samples. Although we had read the
mutect2 guide——best practice for mutect2(gatk post#11136), but there is no idea for me to go on.

Gnomad had release a newer version r2.1, but the gatk bundle holds an old version——especially the b37 shows the year 2017.

Now we don't know if should use the r2.1 as a germline-resource, because there're more allele frequencies in the newer version.

If we want to use the newer gnomad as a resource, what should we do to make a 'af-only-gnomad_hg19.vcf' (you see, we used the hg19 but not grch38). Apple is too big to bite, the resource gnomad is the same. By the way, we wish detect the whole genome mutations.
Wish your advice.
Thank you.

↧

热烈欢迎我们的中国朋友 / A warm welcome to our Chinese friends

July 1, 2017, 8:33 pm

≫ Next: I can‘t get the access bundle，what can I do for solving this problem？

≪ Previous: GATK4-mutect2 how to or should I use a newer gnomad r2.1 as germline-resource

科研圈的亲们，我们来啦！携手国内重量级公司和机构，我们这次给大家带来了高效、规模化使用GATK的技巧！

Today we are reaching out to the Chinese research community with great news: we are partnering with key companies and institutions in China to empower Chinese researchers to use GATK effectively and at scale.

大家可能已经有所耳闻，我们开发了一整套基因组数据分析系统，涵盖了分析工具（GATK4，即将发行），流程控制语言（WDL），以及支持多种环境下--包括本地数据中心和云计算--执行分析流程的运行核心Cromwell 。这一整合的套装是为了让生物医药研究人员可以自如的运行、重复分析流程，包括我们一直以来推崇的GATK最佳运行方案（现在我们已经发布了可以即刻使用的流程）。我们希望这一系列努力可以大大减轻之前大家设置以及运行GATK的各种困扰以及烦人的臆测。

As you may know, we have developed a "full stack" genomics solution that combines analysis tools (GATK itself, with version 4 soon to be released), a workflow definition language called WDL, and an execution engine called Cromwell that can execute pipelines in multiple environments, on-premises and on the cloud. This integrated solution aims to empower biomedical researchers to run and replicate analysis pipelines, starting with the GATK Best Practices, for which we are now publishing ready-to-use WDL workflows. We hope this will dramatically cut down on the effort -- and sometimes guesswork! -- previously involved in standing up GATK pipelines.

当然我们并没有止步于单纯的提供工具和流程软件，我们也希望能在主流云计算平台下放飞这一系列匠心独具的设计。两年前，我们在努力开发这些软件工具的伊始，也开始了与六家行业引领者的合作：因特尔，谷歌，Cloudera，亚马逊，IBM以及微软；这六家公司和我们有着同样的目标：让广大用户可以在云平台下自如的使用我们的软件。

But our goals didn't stop at just building the pipelining software -- we wanted to make sure our tools would be easy to use on any of the major public clouds. So two years ago, as we were knuckling down to the hard work of developing these software tools, we forged a partnership with six industry leaders who agreed to help us bring our solution to the Cloud -- Intel, Google, Cloudera, Amazon Web Services (AWS), IBM and Microsoft.

现在我们重磅推出与阿里云，以及华大基因的合作！阿里云是中国主要的云计算运营商，而华大是主要的基因测序中心。两家机构都认同并且愿意帮助我们实现共同的目标：为全球每一位科研人员提供最好的、可重复的基因组数据分析流程软件。巧合的是现在我们的云端服务伙伴刚刚好是幸运数字八！同时我们也在积极的与其他研究所和商业机构商洽，包括中科院北京基因组研究所、诺禾致源、浪潮集团，他们也都表示了采纳我们分析套装的兴趣。

Now, we are thrilled that Alibaba Cloud, the major cloud service provider in China, and BGI, the major sequencing service provider, are both signing on to help in the pursuit of our common goal, which is to provide top-quality, reproducible genomics pipelines to everyone in the global research community. It is a happy coincidence that this brings our fellowship of the Cloud to a lucky number eight! We are also engaging with other key companies and institutions in China, including the Beijing Institute of Genomics, Novogene and Inspur, who have expressed interest in adopting our genomics stack.

仲有！我们同时意识到对于说汉语的亲们，语言会是一个障碍，所以我们也在考虑建立一个推广项目，专门为汉语用户圈服务。这将包括汉语论坛，GATK和WDL注释文档的翻译，以及在中国开办的研讨会。这对我们来说是个挑战，但是我们乐观地相信这会给双方都带来巨大的好处和互相学习的机会。

But that's not all. We're aware that language is often an obstacle for our Chinese audience, so we are looking at options for establishing an outreach program specifically aimed at the Chinese community. This would include a Chinese-language forum, translations of the GATK and WDL documentation, as well as workshops in China. This will be a challenging new undertaking for us but I am optimistic that it will yield great benefits, as I am certain our communities have much to learn from each other.

最后夹带一些私货：我个人很高兴看到我们能以这样的方式与中国的研究圈联系。2008年，我因为一个研究项目在位于火炉武汉的华中农业大学度过了一个夏天。华农的亲们对我的热情让我终身难忘，今日终于可以投桃报李了！

Finally, I should mention I have personal reasons for being especially pleased that we are reaching out to the Chinese research community in this way. In 2008, I spent several months living and working on a research project at Huazhong Agricultural University in Wuhan, Hubei Province, and I will never forget the wonderful welcome I was given by the staff and students at HZAU. I look forward to finally reciprocating that welcome, at scale!

无图无真相，在武汉的2008奥运会火炬传递仪式中（欢欢）
Photographic evidence… at the 2008 Olympic torch parade in Wuhan!

Many thanks to members of the Intel China team and to Steve Huang of the GATK development team for their invaluable help with the translation!

↧

I can‘t get the access bundle，what can I do for solving this problem？

March 11, 2019, 1:25 am

≫ Next: Remove symbolic alleles from REF in a VCF file

≪ Previous: 热烈欢迎我们的中国朋友 / A warm welcome to our Chinese friends

I can‘t get the access bundle，what can I do for solving this problem？

↧

Remove symbolic alleles from REF in a VCF file

March 11, 2019, 3:30 am

≫ Next: GATK v4 Variant Recalibrator command line

≪ Previous: I can‘t get the access bundle，what can I do for solving this problem？

Hi there!

I'm trying to do BQSR on some sheep genomes. I downloaded the known variants from Ensembl and converted the GVF file to VCF using their script. However, once I try to use it to ran GATK 3.5.0 BaseRecalibrator, it propt this message:

##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 3.5-0-g36282e4):
##### ERROR
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions
##### ERROR
##### ERROR MESSAGE: Cannot tag a symbolic allele as the reference allele

I've looked for "<>" symbols, as they are the markers for symbolic alleles in VCF as far as I've read, but grep didn't found any outside of the header, and I've tried out with GATK 3.8 with the same results.

Any idea of what could be happening and how can I solve it?

Thank you for your help.

↧

GATK v4 Variant Recalibrator command line

May 30, 2017, 4:28 pm

≫ Next: CombineVariants in GATK4

≪ Previous: Remove symbolic alleles from REF in a VCF file

Could someone please provide me with a sample command line to run Variant Recalibrator for GATK v4? I am running the tool using GATK 4 Alpha with the following command line:

~/gatk-protected/gatk-launch VariantRecalibrator -R ~/MiSeq/Bioinformatics/Archive/ReferenceFiles/hg19/seq/hg19.fa -input Stromal-combined-New.vcf --resource hapmap,known=false,training=true,truth=true,prior=15.0 ~/MiSeq/Bioinformatics/Archive/ReferenceFiles/GATK/hapmap_3.3.hg19.sites.vcf --resource omni,known=false,training=true,truth=true,prior=12.0 ~/MiSeq/Bioinformatics/Archive/ReferenceFiles/GATK/1000G_omni2.5.hg19.sites.vcf --resource 1000G,known=false,training=true,truth=false,prior=10.0 ~/MiSeq/Bioinformatics/Archive/ReferenceFiles/GATK/1000G_phase1.snps.high_confidence.hg19.sites.vcf --resource dbsnp,known=true,training=false,truth=false,prior=2.0 ~/MiSeq/Bioinformatics/Archive/ReferenceFiles/GATK/dbsnp_138.hg19.vcf -an DP -an QD -an FS -an SOR -an MQ -an MQRankSum -an ReadPosRankSum -an InbreedingCoeff -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -tranchesFile Stromal-combined-New.tranches --rscriptFile Stromal-combined-New.R

and I get the following error
A USER ERROR has occurred: Invalid argument '/home/galaxy/MiSeq/Bioinformatics/Archive/ReferenceFiles/GATK/hapmap_3.3.hg19.sites.vcf'.

The command syntax follows the same pattern as this
https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_variantrecalibration_VariantRecalibrator.php

My Java version is java version "1.8.0_131"

Has the syntax been changed for GATK version 4?

Thank you very much.

↧

CombineVariants in GATK4

September 11, 2017, 7:28 am

≫ Next: DetermineGermlineContigPloidy- Issue when using more samples

≪ Previous: GATK v4 Variant Recalibrator command line

Is it planned to add CombineVariants tool into GATK4.0 toolkit (it existed in previous GATK versions)? The only similar tool currently available in GATK4.0 Beta is GatherVCFs which has very limited possibility and cannot concatenate unsorted VCFs or merge different INFO fields correctly.
Thanks!

↧

DetermineGermlineContigPloidy- Issue when using more samples

March 11, 2019, 9:31 am

≫ Next: Empty FILTER column in Mutect2 output vcf - tumor-only mode

≪ Previous: CombineVariants in GATK4

Hello I have been testing the gCNVcaller from GATK 4.1.0.0

I was able to test and complete the gCNV pipeline using 30 samples. But I would like to scale up to a larger dataset of 200 samples, and am having trouble. The DetermineGermlineContigPloidy function is giving me errors when I try to use 200 samples. i have tried dividing it up by chromosome for each run, and have still been unable to diagnose a solution.

gatk --java-options "-Xmx25G" DetermineGermlineContigPloidy \
-I CVH-1051.bam.counts.hdf5 \
.... (other 199 samples)
--contig-ploidy-priors ./contig_priors.tsv \
--output ../output/output.gatk.DGCP/ \
--output-prefix test_Data \
-verbosity DEBUG

12:21:12.899 DEBUG ScriptExecutor - --interval_list=/tmp/intervals1902729868733878616.tsv
12:21:12.899 DEBUG ScriptExecutor - --contig_ploidy_prior_table=/gpfs/gsfs10/users/islekda/projectCNV/gatk/contig_priors.tsv
12:21:12.899 DEBUG ScriptExecutor - --output_model_path=/gpfs/gsfs10/users/islekda/projectCNV/output/output.gatk.DGCP/test_Data-model
/data/islekda/conda/envs/gatk/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
Traceback (most recent call last):
File "/tmp/cohort_determine_ploidy_and_depth.4716223482773228095.py", line 86, in
sample_metadata_collection, args.sample_coverage_metadata)
File "/data/islekda/conda/envs/gatk/lib/python3.6/site-packages/gcnvkernel/io/io_metadata.py", line 78, in read_sample_coverage_metadata
sample_name, n_j, contig_list))
File "/data/islekda/conda/envs/gatk/lib/python3.6/site-packages/gcnvkernel/structs/metadata.py", line 242, in add_sample_coverage_metadata
'Sample "{0}" already has coverage metadata annotations'.format(sample_name))
gcnvkernel.structs.metadata.SampleAlreadyInCollectionException: Sample "none" already has coverage metadata annotations
12:21:36.494 DEBUG ScriptExecutor - Result: 1
12:21:36.495 INFO DetermineGermlineContigPloidy - Shutting down engine
[March 11, 2019 12:21:36 PM EDT] org.broadinstitute.hellbender.tools.copynumber.DetermineGermlineContigPloidy done. Elapsed time: 1.15 minutes.
Runtime.totalMemory()=3002597376
org.broadinstitute.hellbender.utils.python.PythonScriptExecutorException:
python exited with 1
Command Line: python /tmp/cohort_determine_ploidy_and_depth.4716223482773228095.py --sample_coverage_metadata=/tmp/samples-by-coverage-per-contig5215066494113015797.tsv --output_calls_path=/gpfs/gsfs10/users/islekda/projectCNV/output/output.gatk.DGCP/test_Data-calls --mapping_error_rate=1.000000e-02 --psi_s_scale=1.000000e-04 --mean_bias_sd=1.000000e-02 --psi_j_scale=1.000000e-03 --learning_rate=5.000000e-02 --adamax_beta1=9.000000e-01 --adamax_beta2=9.990000e-01 --log_emission_samples_per_round=2000 --log_emission_sampling_rounds=100 --log_emission_sampling_median_rel_error=5.000000e-04 --max_advi_iter_first_epoch=1000 --max_advi_iter_subsequent_epochs=1000 --min_training_epochs=20 --max_training_epochs=100 --initial_temperature=2.000000e+00 --num_thermal_advi_iters=5000 --convergence_snr_averaging_window=5000 --convergence_snr_trigger_threshold=1.000000e-01 --convergence_snr_countdown_window=10 --max_calling_iters=1 --caller_update_convergence_threshold=1.000000e-03 --caller_internal_admixing_rate=7.500000e-01 --caller_external_admixing_rate=7.500000e-01 --disable_caller=false --disable_sampler=false --disable_annealing=false --interval_list=/tmp/intervals1902729868733878616.tsv --contig_ploidy_prior_table=/gpfs/gsfs10/users/islekda/projectCNV/gatk/contig_priors.tsv --output_model_path=/gpfs/gsfs10/users/islekda/projectCNV/output/output.gatk.DGCP/test_Data-model
at org.broadinstitute.hellbender.utils.python.PythonExecutorBase.getScriptException(PythonExecutorBase.java:75)
at org.broadinstitute.hellbender.utils.runtime.ScriptExecutor.executeCuratedArgs(ScriptExecutor.java:126)
at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeArgs(PythonScriptExecutor.java:170)
at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeScript(PythonScriptExecutor.java:151)
at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeScript(PythonScriptExecutor.java:121)
at org.broadinstitute.hellbender.tools.copynumber.DetermineGermlineContigPloidy.executeDeterminePloidyAndDepthPythonScript(DetermineGermlineContigPloidy.java:403)
at org.broadinstitute.hellbender.tools.copynumber.DetermineGermlineContigPloidy.doWork(DetermineGermlineContigPloidy.java:283)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:138)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
at org.broadinstitute.hellbender.Main.main(Main.java:291)

↧

Empty FILTER column in Mutect2 output vcf - tumor-only mode

March 11, 2019, 10:22 am

≫ Next: 1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38.vcf corrupted?

≪ Previous: DetermineGermlineContigPloidy- Issue when using more samples

Hi,
I have a question regarding Mutect2 output vcf. I am trying to run Mutect2 on paired-end reads (F and R reads for 1 sample at a time), tumor-only, sequenced by Illumina.

I ran the following bash script and unfortunately, I receive a vcf file that has all None values in FILTER column. The other columns look reasonable. I have run this script on a similar data previously and it worked, so I wonder what the problem might be.

My plan is then to sort the vcf file based on coverage and filter status, but I really don't know where the problem might be.

Thank you in advance.

#!/bin/bash

ROOTDIR='~/projects/patologie
cd $ROOTDIR/hs37d5

bwa index -a bwtsw $ROOTDIR/hs37d5/hs37d5.fa
samtools faidx $ROOTDIR/hs37d5/hs37d5.fa
picard CreateSequenceDictionary R=$ROOTDIR/hs37d5/hs37d5.fa O=$ROOTDIR/hs37d5/hs37d5.dict

cd $ROOTDIR/analysis2

bwa mem -M $ROOTDIR/hs37d5/hs37d5.fa $ROOTDIR/01_raw_data/BRCA1_S1_L001_R1_001.fastq $ROOTDIR/01_raw_data/BRCA1_S1_L001_R2_001.fastq > sample.sam

samtools view -bS sample.sam > sample.bam

picard SortSam I=sample.bam O=sorted_sample.bam SORT_ORDER=coordinate

samtools index sorted_sample.bam

picard MarkDuplicates I=sorted_sample.bam O=sample_marked.bam M=marked_metrics.txt ASSUME_SORT_ORDER=coordinate

samtools index sample_marked.bam

picard AddOrReplaceReadGroups I=sample_marked.bam O=sample_rg.bam RGID=1 RGLB=lib1 RGPL=illumina RGPU=unit1 RGSM=1

gatk IndexFeatureFile -F $ROOTDIR/reference/1000G_phase1.indels.b37.vcf

gatk IndexFeatureFile -F $ROOTDIR/reference/Mills_and_1000G_gold_standard.indels.b37.vcf

gatk IndexFeatureFile -F $ROOTDIR/reference/dbsnp_138.b37.vcf

gatk BaseRecalibrator -R $ROOTDIR/hs37d5/hs37d5.fa -I sample_rg.bam --known-sites $ROOTDIR/reference/1000G_phase1.indels.b37.vcf --known-sites $ROOTDIR/reference/dbsnp_138.b37.vcf --known-sites $ROOTDIR/reference/Mills_and_1000G_gold_standard.indels.b37.vcf -O recal_table

gatk ApplyBQSR -bqsr recal_table -I sample_rg.bam -R $ROOTDIR/hs37d5/hs37d5.fa -O sample_recal.bam

gatk IndexFeatureFile -F $ROOTDIR/reference/af-only-gnomad.raw.sites.b37.vcf

gatk Mutect2 -R $ROOTDIR/hs37d5/hs37d5.fa -I $ROOTDIR/analysis2/sample_recal.bam -tumor 1 -O sample_single.vcf

gatk VariantAnnotator -R $ROOTDIR/hs37d5/hs37d5.fa -V sample_single.vcf --dbsnp $ROOTDIR/reference/dbsnp_138.b37.vcf -O sample_single_id_som.vcf

grep -v '^##.*' sample_single_id_som.vcf > sample_filter.vcf

↧

1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38.vcf corrupted?

December 12, 2016, 6:54 pm

≫ Next: GATK4 Mutect2 variants IDs not shown

≪ Previous: Empty FILTER column in Mutect2 output vcf - tumor-only mode

Hi i downloaded the file from GATK google cloud but it seems the file is corrupted? only chr1-chr15 sites are present.

↧

GATK4 Mutect2 variants IDs not shown

February 26, 2019, 1:06 pm

≫ Next: Web-based Oncotator server

≪ Previous: 1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38.vcf corrupted?

Hi everyone!

I am a beginner using GATK, so bear with me please. Also, I am sorry this is duplicated in Biostars, I believe here is more appropriate.

I am trying to do my variant calling with GATK4 new Mutect2 (not MuTect2), using as --germline-resource the af-only-gnomad.hg38.vcf.gz from the GATK's bundle. The command is pretty much the same as seen in the tutorial: https://software.broadinstitute.org/gatk/documentation/article?id=11136, only updated to allow it to be used with WES.

My problem is that after successfully completing the variant calling, none of the variants have ID information, although when I compared it to a MuTect2 result several did, using --dbsnp which is not available in Mutect2.
Basically, all of the IDs are a "."

I checked the uncompressed gnomad.hg38.vcf.gz manually and the ID info as rsXXXXXX are there.

Any ideas on why the variant calling is not collecting the ID info from the germline resource to populate the ID field? Thank you!

Daiana

↧

Web-based Oncotator server

May 16, 2014, 4:13 pm

≫ Next: about ASEReadCounter

≪ Previous: GATK4 Mutect2 variants IDs not shown

There is a web-based version of Oncotator which you can use for annotation without running anything on your own machine.

However, please note that the web-based version is an older version, with fewer datasources and many limitations. We urge you to use the downloadable version instead, and at this time we do not provide user support for the web-based version. It is simply provided as-is.

Note also that on rare occasions the server malfunctions and needs to be rebooted. If you experience any server errors (e.g. an error message stating that the server is unavailable), please post a note in the thread below and we'll reboot it as soon as we can.

↧

about ASEReadCounter

April 12, 2017, 3:59 pm

≫ Next: Buy Passports, Drivers license,ID Cards,Visas:(whatsapp....+1(518)707-4008)etc

≪ Previous: Web-based Oncotator server

Dear all,

i am using ASEReadCounter in order to count the number of reads per variant in a BAM file, and somehow related to a previous post (below), I am encountering a similar error :

"MESSAGE: More then one variant context at position: chr19:125517"

i.e. in the vcf file, there are 2 entries for the same position :

chr19 125517 . A G 42.01 . AC1=1;AF1=0.5;BQB=0.950129;DP=43;DP4=12,7,4,1;FQ=45.0154;MQ=46;MQ0F=0;MQB=0.984335;MQSB=0.998127;PV4=0.631094,1,1,1;RPB=1;SGB=-0.590765;VDB=0.233642 GT:PL 0/1:72,0,255

chr19 125517 . AA AAGAGA 5.79 . AC1=1;AF1=0.499984;DP=43;DP4=7,5,5,0;FQ=8.19012;IDV=2;IMF=0.0444444;INDEL;MQ=45;MQ0F=0;MQSB=0.99446;PV4=0.244505,1,0.0559047,0.273348;SGB=-0.590765;VDB=0.125771 GT:PL 0/1:42,0,151

The question would be : is there any way in GATK to remove these sites ? Of course, i could do it with a simple script outside GATK, although doing it outside GATK may complicate a bit the pipeline. Thank you very much !

-- bogdan

ps : the previous post was :
http://gatkforums.broadinstitute.org/gatk/discussion/comment/30752#Comment_30752

↧

Buy Passports, Drivers license,ID Cards,Visas:(whatsapp....+1(518)707-4008)etc

March 11, 2019, 11:39 pm

≫ Next: VariantRecalibrator Problem: QD Annotation

≪ Previous: about ASEReadCounter

We produce Real database registered documents which are legally use and
passes all airport scans and data-check machines. Any time these Real documents are being
verified in the system,all the holder’s information will validly show up making the document real
We are global experts in Immi-Emmigration issues,our experience is outstanding.
we operate all over the world and also has a good relation with our global
colleagues working in embassy's and interior ministries of various nations.
We use high quality equipment and materials to produce authentic and counterfeit documents.
All secret features of real passports are carefully duplicated for our Registered and unregistered documents.
We are unique producers of quality and Real documents.We offer only original
high-quality Registered documents

We offer high quality:

- Passports

- Driving license

- ID cards

- Marriage certificates

- Stamps etc.

IDs Scan:

we are able to produce the following items;
REAL BRITISH PASSPORT.
REAL CANADIAN PASSPORT.
REAL FRENCH PASSPORT.
REAL AMERICAN PASSPORT.
REAL RUSSIAN PASSPORT.
REAL JAPANESe PASSPORT.
REAL CHINESe PASSPORT.
and other citizenship documents for any country.

All clients who require any citizenship document from all countries
are 100% guaranteed to receive genuine data base registered documents
with high quality and our customers having no problems with the authorities.

For more information about our services contact us via ..

Whatsapp: +1 518 707 4008

General Support:( prodocument77@gmail.com )
feel free to contact us at anytime.Thanks

↧

VariantRecalibrator Problem: QD Annotation

March 12, 2019, 7:24 am

≫ Next: Improve performance of GATK4 GermlineCNVCaller

≪ Previous: Buy Passports, Drivers license,ID Cards,Visas:(whatsapp....+1(518)707-4008)etc

Hi,

I'm just new to this community and I'm trying to perform a joint analysis of somatic SNV + indels according to Best Practices. When I try to perform VariantRecalibration:

```
java -jar -T GenomeAnalysisTK.jar VariantRecalibrator -R ref.fna \
-input cohort_raw.vcf \
-resource:hapmap, known=false,training=true,truth=true,prior=15.0 /hapmap_3.3.hg19.sites.vcf.gz \
-resource:omni,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.hg19.sites.vcf.gz \
-resource:MG,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.hg19.sites.vcf.gz \
-resource:dbsnp,known=true,training=false,truth=false,prior=2.0 GRCh37_latest_dbSNP_all.vcf.gz \
-an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an InbreedingCoeff \
-mode SNP -recalFile output_snp_cohort.recal \
-tranchesFile output_snp_cohort.tranches \
-rscriptFile output_snp_cohort.plots.R
```
I always get the message:

```
## MESSAGE: Bad input: Values for QD annotation not detected for ANY training variant in the input callset. VariantAnnotator may be used to add these annotations.
```

I downloaded the training files from the resource bundle.

My input file look like this:

```
## NC_000017.10:41196312-41577500 52 . C T 18871.87 . AC=3;AF=0.021;AN=144;BaseQRankSum=0.721;ClippingRankSum=0.00;DP=32141;ExcessHet=3.1024;FS=0.521;InbreedingCoeff=-0.0228;MLEAC=3;MLEAF=0.021;MQ=60.00;MQRankSum=0.00;QD=9.24;ReadPosRankSum=1.43;SOR=0.728 GT:AD:DP:GQ:PL ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0
```

I read in other posts that it could be due to "-nt", I disabled it and it didn't work. And also I read that maybe was by the lack of QD annotation in the joint g.vcf. So I tried to re-annotate before the g.vcf merge:

```
java -jar GenomeAnalysisTK.jar \
-R reference.fasta \
-T VariantAnnotator \
-I input.bam \
-V input.vcf \
-o output.vcf \
-A Coverage -A MappingQualityRankSumTest -A QualByDepth \
-A RMSMappingQuality -A ReadPosRankSumTest -A StrandOddsRatio \
-L input.vcf \
--dbsnp dbsnp.vcf
```
It didn't work. And also tried:

```
${gatk3} -T GenotypeGVCFs -R ${ref} \
--variant ${final}/cohort.g.vcf -maxAltAlleles 8 -nt 8 --dbsnp ${vcfref} \
-A Coverage -A MappingQualityRankSumTest -A QualByDepth \
-A RMSMappingQuality -A ReadPosRankSumTest -A StrandOddsRatio \
-A FisherStrand -o ${final}/cohort_raw.vcf
```
The same result. I'm out of options. Help!

Thanks
Tarr

↧