jexl expressions for selecting variants

October 28, 2019, 8:43 am

≫ Next: Location of contamination files in new Google Bucket

≪ Previous: File size is largely reduced in MarkIlluminaAdapters step

This discussion was created from comments split from: Questions about JEXL expressions for selecting variants according to specific reuqirements.

↧

Location of contamination files in new Google Bucket

October 28, 2019, 8:59 am

≫ Next: how to get known variant VCF file to recalibrate human mitochondrial NGS data with GATK4

≪ Previous: jexl expressions for selecting variants

I'm having a hard time finding a few files in the new Google Bucket.

From the five-dollar-genome JSON, these are the files:

"contamination_sites_ud": "gs://broad-references/hg38/v0/Homo_sapiens_assembly38.contam.UD",
"contamination_sites_bed": "gs://broad-references/hg38/v0/Homo_sapiens_assembly38.contam.bed",
"contamination_sites_mu": "gs://broad-references/hg38/v0/Homo_sapiens_assembly38.contam.mu",
"calling_interval_list": "gs://broad-references/hg38/v0/wgs_calling_regions.hg38.interval_list",

I expect them to be in this bucket: https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0/?_ga=2.194937135.-360443890.1571254203&pli=1

It looks like the other reference files are in this directory, but I can't find the four files listed above.

Can someone please point me to the correct directory in the bucket?

Thanks in advance for the help.

↧

how to get known variant VCF file to recalibrate human mitochondrial NGS data with GATK4

October 28, 2019, 11:07 am

≫ Next: how to get known variant VCF file to recalibrate human mitochondrial NGS data with GATK4

≪ Previous: Location of contamination files in new Google Bucket

Im analysis human mitochondrial NGS sequences for germline variants. Im using GATK4 and wondering if recalibration is necessary, and whether base recalibration and variant recalibration done in a single step. and if so, how should I get known variant VCF of human mitochondrial DNA ; is there a way to do it if I dont have that file?

Also curious what should be done about the realignment step as that function is deprecated in GATK4

↧

how to get known variant VCF file to recalibrate human mitochondrial NGS data with GATK4

October 28, 2019, 11:14 am

≫ Next: how to get known variant VCF file to recalibrate human mitochondrial NGS data with GATK4

≪ Previous: how to get known variant VCF file to recalibrate human mitochondrial NGS data with GATK4

Im analysing human mitochondrial NGS sequences for germline variants. Im using GATK4 and wondering if recalibration is necessary, and whether base recalibration and variant recalibration done in a single step. and if so, how should I get known variant VCF of human mitochondrial DNA ; is there a way to do it if I dont have that file?

Also curious what should be done about the realignment step as that function is deprecated in GATK4

↧

how to get known variant VCF file to recalibrate human mitochondrial NGS data with GATK4

October 28, 2019, 11:26 am

≫ Next: Merging parallelized BaseRecalibrator outputs

≪ Previous: how to get known variant VCF file to recalibrate human mitochondrial NGS data with GATK4

how to get known variant VCF file to recalibrate human mitochondrial NGS data with GATK4

↧

Merging parallelized BaseRecalibrator outputs

October 28, 2019, 11:47 am

≫ Next: Mutect2 - Germline Resource

≪ Previous: how to get known variant VCF file to recalibrate human mitochondrial NGS data with GATK4

Hi,

It looks like BaseRecalibrator is parallelizable (based on the fact that BaseRecalibratorSpark exists). However, I don't see a tool for merging the recalibration table (except inside of BaseRecalibratorSpark). Is the merger available for non-Spark parallelization? There was some discussion of a tool back in 2013, but I don't see it anymore.

Regards, Alec

↧

Mutect2 - Germline Resource

October 28, 2019, 12:49 pm

≫ Next: Gene based DepthOfCoverage and correctly editing Refseq file to include all target regions?

≪ Previous: Merging parallelized BaseRecalibrator outputs

Hi,

Are the details behind how Mutect2 utilizes the Germline Resource documented anywhere? It does not appear to be a simple filter of variants in the input file. How does Mutect2 determine whether a given call not in the resource is somatic or germline?

Thanks!

↧

Gene based DepthOfCoverage and correctly editing Refseq file to include all target regions?

October 28, 2019, 7:01 pm

≫ Next: GATK4.0.9.0 undetected a delins site

≪ Previous: Mutect2 - Germline Resource

Hello,

I want to run DepthOfCoverage to get a gene summary of coverage. I have a bed file of target gene regions that I want to run this with but the .refseq file I downloaded does not contain some intervals in my target regions. Therefore I had to edit my refseq file to get this to work.

For Example

This is the Refseq entry for SNHG3:
```
#bin name chrom strand txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds score name2 cdsStartStat cdsEndStat exonFrames
100 NR_002909.2 chr1 + 28832454 28837404 28837404 28837404 3 28832454,28834639,28835341, 28832596,28834672,28837404, 0 SNHG3 none none -1,-1,-1,
```

My hypothetical bed file has 2 extra regions outside of the SNHG3 refseq gene:
```
chr1 20000000 21000000
chr1 28832454 28832596
chr1 28834639 28834672
chr1 28835341 28837404
chr1 29000000 30000000
```

So I reworked the refseq file to look like this:

```
3 SNHG3 chr1 + 20000000 30000000 20000000 30000000 5 20000000,28832454,28834639,28835341,29000000, 21000000,28832596,28834672,28837404,30000000, 0 SNHG3 none none 0,0,0,0,0
```

From what I understand, DepthOfCoverage uses exonStarts and exonEnds columns when matching intervals in -L to the correct genes. So, as long as the tx and cds coordinates lie outside my exonStart/End intervals and I have the correct number of exons, do the other columns (outside of chrom) matter at all to DepthOfCoverage?

From test cases I've run, it looks like this is works, but I wanted to be 100% sure that the way I edited columns like exonFrames, #bin, name are ok.

I'm running this on the latest version of GATK 3.

GenomeAnalysisTK.jar -T DepthOfCoverage -I input.list -o test -geneList trial.refseq -L targets.bed -R hg19.fa -mmq 30 -mbq 20

Thank you

↧

GATK4.0.9.0 undetected a delins site

May 28, 2019, 8:07 pm

≫ Next: GATK HaplotypeCaller MIXED variants delins

≪ Previous: Gene based DepthOfCoverage and correctly editing Refseq file to include all target regions?

Dear GATK team,
I use GATK4.0.9.0 Haplotypercaller to detect germline variation, and find a delins site is not call. As the figure show, this is gvcf output, 32944606 is delTTT, and 32944609 is insAAAA. But the vcf output only is delTTT, insAAAA is filtered. And I find the in GATK4.1.1.0 and 4.1.2.0 also not call insAAAA in vcf. The site is verified by Sanger, and is a delinsAAAA. So why GATK4.0.9.0 filtered the insAAAA in 32944609 ?

The vcf of GATK4.0.9.0:
chr13 32944606 . CTTT C 9318.60 . AC=1;AF=0.500;AN=2;BaseQRankSum=1.294;DP=814;ExcessHet=3.0103;FS=1.093;MLEAC=1;MLEAF=0.500;MQ=60.03;MQRankSum=0.000;QD=11.90;ReadPosRankSum=0.290;SOR=0.769 GT:AD:DP:GQ:PL 0/1:423,360:783:99:9326,0,45236

↧

GATK HaplotypeCaller MIXED variants delins

October 19, 2015, 6:10 am

≫ Next: VAF by INDEL size

≪ Previous: GATK4.0.9.0 undetected a delins site

Hello GATK team !

I am facing a problem that I am not exactly sure your caller can address and I would need your opinion on that.
I use GATK last version (3.4.46) haplotypeCaller to call my variants (after all the best practices).

I am getting the following two variants :
chr17 41222982 . ATTC A 8447.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=1.398;ClippingRankSum=-0.136;DP=515;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.02;MQRankSum=1.616;QD=16.40;ReadPosRankSum=-19.399;SOR=0.470 GT:AD:DP:GQ:PL 0/1:292,223:515:99:8485,0,17722
chr17 41222986 . T TAAAA 8363.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=-12.032;ClippingRankSum=0.954;DP=515;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.02;MQRankSum=-1.440;QD=16.24;ReadPosRankSum=-19.368;SOR=0.453 GT:AD:DP:GQ:PL 0/1:292,223:515:99:8401,0,17731

These is 1 deletion directly followed by 1 insertion. As you can see, the number of reads harbouring the reference (292) versus the alternate (223) is exactly the same. The problem is that my biologists are looking one single mutation called "delins" and because HC calls 2 distinct variants, I have 2 annotations and not 1 (should be something like c.1234delTTCinsAAAA).

Do you have any idea how I could handle that using HC ? Or maybe after getting the vcf with a post processing tool ?

Thanks a lot.
Manon

↧

VAF by INDEL size

October 29, 2019, 1:22 am

≫ Next: (How to) Run the Pathseq pipeline

≪ Previous: GATK HaplotypeCaller MIXED variants delins

Hello,

FLT3 ITD (FLT3 internal tandem duplication, duplication in exon 13 ~ exon14 in FLT3) is an well-known variant in Leukemia.
The length of duplicated bases has a range of 10bp ~ 200bp, and Mutect2 is able to detect the variant if the size is less than about 60bp (read length is 150bp in my data).
Recently, I updated GATK from 4.0.6 to 4.1.4 where Mutect2 allows to keep both overlapping reads, and the DP, AD, AF values became similar to the values observed in IGV viewer.
Thank you for the change, there were many questions about the VAF which looked quite different from IGV.
When compared VAF (AD_alt/ AD_total) of FLT3-ITD vairant between v4.0.6 and v4.1.4, the VAF was overestimated in v4.0.6.
Two questions.
1. The VAF difference increases as the length of duplicated bases increase. Does the size of INDEL variant affect on VAF ?

The VAF of SNV variants showed litte difference between two GATK versions. Keeping overlapping reads is more sensitive to INDEL variants?

↧

(How to) Run the Pathseq pipeline

December 8, 2017, 3:02 pm

≫ Next: Multi-sample VariantsToTable- Each sample in New line with Sample Name

≪ Previous: VAF by INDEL size

Beta tutorial Please report any issues in the comments section.

Overview

PathSeq is a GATK pipeline for detecting microbial organisms in short-read deep sequencing samples taken from a host organism (e.g. human). The diagram below summarizes how it works. In brief, the pipeline performs read quality filtering, subtracts reads derived from the host, aligns the remaining (non-host) reads to a reference of microbe genomes, and generates a table of detected microbial organisms. The results can be used to determine the presence and abundance microbial organisms as well as to discover novel microbial sequences.

PathSeq pipeline diagram Boxes outlined with dashed lines represent files. The green boxes at the top depict the three phases of the pipeline: read quality filtering / host subtraction, microbe alignment, and taxonomic abundance scoring. The blue boxes show tools used for pre-processing the host and microbe references for use with PathSeq.

Tutorial outline

This tutorial describes:

How to run the full PathSeq pipeline on a simulated mixture of human and E. coli reads using pre-built small-scale reference files
How to prepare custom host and microbe reference files for use with PathSeq

A more detailed introduction of the pipeline can be found in the PathSeqPipelineSpark tool documentation. For more information about the other tools, see the Metagenomics section of the GATK documentation.

How to obtain reference files

Host and microbe references must be prepared for PathSeq as described in this tutorial. The tutorial files provided below contain references that are designed specifically for this tutorial and should not be used in practice. Users can download recommended pre-built reference files for use with PathSeq from the GATK Resource Bundle FTP server in /bundle/pathseq/ (see readme file). This tutorial also covers how to build custom host and microbe references.

Tutorial Requirements

The PathSeq tools are bundled with the GATK 4 release. For the most up-to-date GATK installation instructions, please see https://github.com/broadinstitute/gatk. This tutorial assumes you are using a POSIX (e.g. Linux or MacOS) operating system with at least 2Gb of memory.

Obtain tutorial files

Download tutorial_10913.tar.gz from the ftp site. Extract the archive with the command:

> tar xzvf pathseq_tutorial.tar.gz
> cd pathseq_tutorial

You should now have the following files in your current directory:

test_sample.bam : simulated sample of 3M paired-end 151-bp reads from human and E. coli
hg19mini.fasta : human reference sequences (indexed)
e_coli_k12.fasta : E. coli reference sequences (indexed)
e_coli_k12.fasta.img : PathSeq BWA-MEM index image
e_coli_k12.db : PathSeq taxonomy file

Run the PathSeq pipeline

The pipeline accepts reads in BAM format (if you have FASTQ files, please see this article on how to convert to BAM). In this example, the pipeline can be run using the following command:

> gatk PathSeqPipelineSpark \
    --input test_sample.bam \
    --filter-bwa-image hg19mini.fasta.img \
    --kmer-file hg19mini.hss \
    --min-clipped-read-length 70 \
    --microbe-fasta e_coli_k12.fasta \
    --microbe-bwa-image e_coli_k12.fasta.img \
    --taxonomy-file e_coli_k12.db \
    --output output.pathseq.bam \
    --scores-output output.pathseq.txt

This ran in 2 minutes on a Macbook Pro with a 2.8GHz Quad-core CPU and 16 GB of RAM. If running on a local workstation, users can monitor the progress of the pipeline through a web browser at http://localhost:4040.

Interpreting the output

The PathSeq output files are:

output.pathseq.bam : contains all high-quality non-host reads aligned to the microbe reference. The YP read tag lists the NCBI taxonomy IDs of any aligned species meeting the alignment identity criteria (see the --min-score-identity and --identity-margin parameters). This tag is omitted if the read was not successfully mapped, which may indicate the presence of organisms not represented in the microbe database.
output.pathseq.txt : a tab-delimited table of the input sample’s microbial composition. This can be imported into Excel and organized by selecting Data -> Filter from the menu:

tax_id	taxonomy	type	name	kingdom	score	score_normalized	reads	unambiguous	reference_length
1	root	root	root	root	189580	100	189580	189580	0
131567	root cellular_organisms	no_rank	cellular_organisms	root	189580	100	189580	189580	0
2	... cellular_organisms Bacteria	superkingdom	Bacteria	Bacteria	189580	100	189580	189580	0
1224	... Proteobacteria	phylum	Proteobacteria	Bacteria	189580	100	189580	189580	0
1236	... Proteobacteria Gammaproteobacteria	class	Gammaproteobacteria	Bacteria	189580	100	189580	189580	0
91347	... Gammaproteobacteria Enterobacterales	order	Enterobacterales	Bacteria	189580	100	189580	189580	0
543	... Enterobacterales Enterobacteriaceae	family	Enterobacteriaceae	Bacteria	189580	100	189580	189580	0
561	... Enterobacteriaceae Escherichia	genus	Escherichia	Bacteria	189580	100	189580	189580	0
562	... Escherichia Escherichia_coli	species	Escherichia_coli	Bacteria	189580	100	189580	189580	0
83333	... Escherichia_coli Escherichia_coli_K-12	no_rank	Escherichia_coli_K-12	Bacteria	189580	100	189580	189580	0
511145	... Escherichia_coli_str._K-12_substr._MG1655	no_rank	Escherichia_coli_str._K-12_substr._MG1655	Bacteria	189580	100	189580	189580	4641652

Each line provides information for a single node in the taxonomic tree. A "root" node corresponding to the top of the tree is always listed. Columns to the right of the taxonomic information are:

score : indicates the amount of evidence that this taxon is present, based on the number of reads that aligned to references in this taxon. This takes into account uncertainty due to ambiguously mapped reads by dividing their weight across each possible hit. It it also normalized by genome length.
score_normalized : the same as score, but normalized to sum to 100 within each kingdom.
reads : number of mapped reads (ambiguous or unambiguous)
unambiguous : number of unambiguously mapped reads
reference_length : reference length (in bases) if there is a reference assigned to this taxon. Unlike scores, this number is not propagated up the tree, i.e. it is 0 if there is no reference corresponding directly to the taxon. In the above example, the MG1655 strain reference length is only shown in the strain row (4,641,652 bases).

In this example, one can see that PathSeq detected 189,580 reads reads that mapped to the strain reference for E. coli K-12 MG1655. This read count is propogated up the tree (species, genus, family, etc.) to the root node. If other species were present, their read counts would be listed and added to their corresponding ancestral taxonomic classes.

Microbe discovery

PathSeq can also be used to discover novel microorganisms by analyzing the unmapped reads, e.g. using BLAST or de novo assembly. To get the number of non-host (microbe plus unmapped) reads use the samtools view command:

> samtools view –c output.pathseq.bam
189580

Since the reported number of E. coli reads is the same number of reads in the output BAM, there are 0 reads of unknown origin in this sample.

Preparing Custom Reference Files

Custom host and microbe references must both be prepared for use with PathSeq. The references should be supplied as FASTA files with proper indices and sequence dictionaries. The host reference is used to build a BWA-MEM index image and a k-mer file. The microbe reference is used to build another BWA-MEM index image and a taxonomy file. Here we assume you are starting with the FASTA reference files that have been properly indexed:

host.fasta : your custom host reference sequences
microbe.fasta : your custom microbe reference sequences

Build the host and microbe BWA index images

The BWA index images must be build using BwaMemIndexImageCreator:

> gatk BwaMemIndexImageCreator -I host.fasta
> gatk BwaMemIndexImageCreator -I microbe.fasta

Generate the host k-mer library file

The PathSeqBuildKmers tool creates a library of k-mers from a host reference FASTA file. Create a hash set of all k-mers in the host reference with following command:

> gatk PathSeqBuildKmers \
--reference host.fasta \
-O host.hss

Build the taxonomy file

Download the latest RefSeq accession catalog RefSeq-releaseXX.catalog.gz, where XX is the latest RefSeq release number, at:
ftp://ftp.ncbi.nlm.nih.gov/refseq/release/release-catalog/
Download NCBI taxonomy data files dump (no need to extract the archive):
ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
Assuming these files are now in your current working directory, build the taxonomy file using PathSeqBuildReferenceTaxonomy:

> gatk PathSeqBuildReferenceTaxonomy \
-R microbe.fasta \
--refseq-catalog RefSeq-releaseXX.catalog.gz \
--tax-dump taxdump.tar.gz \
-O microbe.db

Example reference build script

The preceding instructions can be conveniently executed with the following bash script:

#!/bin/bash
set -eu
GATK_HOME=/path/to/gatk
REFSEQ_CATALOG=/path/to/RefSeq-releaseXX.catalog.gz
TAXDUMP=/path/to/taxdump.tar.gz

echo "Building pathogen reference..."
$GATK_HOME/gatk BwaMemIndexImageCreator -I microbe.fasta
$GATK_HOME/gatk PathSeqBuildReferenceTaxonomy -R microbe.fasta --refseq-catalog $REFSEQ_CATALOG --tax-dump $TAXDUMP -O microbe.db

echo "Building host reference..."
$GATK_HOME/gatk BwaMemIndexImageCreator -I host.fasta
$GATK_HOME/gatk PathSeqBuildKmers --reference host.fasta -O host.hss

Troubleshooting

Java heap out of memory error

Increase the Java heap limit. For example, to increase the limit to 4GB with the --java-options flag:

> gatk --java-options "-Xmx4G" ...

This should generally be set to a value greater than the sum of all reference files.

The output is empty

The input reads must pass an initial validity filter, WellFormedReadFilter. A common cause of empty output is that the input reads do not pass this filter, often because none of the reads have been assigned to a read group (with an RG tag). For instructions on adding read groups, see this article, but note that PathSeqPipelineSpark and PathSeqFilterSpark do not require the input BAM to be sorted or indexed.

↧

Multi-sample VariantsToTable- Each sample in New line with Sample Name

August 20, 2019, 9:36 am

≫ Next: Genotype annotation "1/0" instead the regular "0/1" in VCF files generated by HaplotypeCaller

≪ Previous: (How to) Run the Pathseq pipeline

I'm interested in using VariantsToTable tool to convert a multi-sample VCF to a tab-separated file. While doing so, I want to have each sample with non-reference genotype on a separate line.
I reviewed 'moltenize' option in the tool, but the Sample column only populates 'site'. Is there a way where it can contain the sample name instead?
https://software.broadinstitute.org/gatk/documentation/tooldocs/4.0.2.0/org_broadinstitute_hellbender_tools_walkers_variantutils_VariantsToTable.php#--moltenize

↧

Genotype annotation "1/0" instead the regular "0/1" in VCF files generated by HaplotypeCaller

October 29, 2019, 6:17 am

≫ Next: Include SAMPLE name in Picard CollectHsMetrics output at runtime?

≪ Previous: Multi-sample VariantsToTable- Each sample in New line with Sample Name

Please, could you help me?

In my VCFs files generated by "gatk-4.0.4.0" using HaplotypeCaller following GATK Best Practices, I got the expected genotypes for most variants, for example:

0/1:238,78:316:99:1430,0,6482 >> the sample is heterozygous
1/1:0,79:79:99:2743,237,0 >> the sample is homozygous alternate
0/0:7,0:7:21:0,21,231 >> the sample is homozygous reference

However, for some variants, I also got genotypes like this:
1/0:0,209:269:99:9550,1843,1208
1/0:1,155:304:99:8979,4486,4054
1/0:7,15:35:75:456,75,118

1) Why does it happen? Why genotype 1/0 appear?
2) What is the difference between genotypes 0/1 and 1/0?

3) Should I exclude variants with 1/0 genotype?

4) In 1/0 cases, the Allelic depths for the alt allele would be the first or the second field?
For example, for 1/0:7,15:35:75:456,75,118 the AD for the alt allele is 7?

I am sorry for all these questions but I could not find clear answers in GATK forums or VCF specs.

Thank you in advance!

Best regards,

↧

Include SAMPLE name in Picard CollectHsMetrics output at runtime?

October 29, 2019, 8:07 am

≫ Next: size of gvcf generated by Haplotypecaller of 30x and 100x coverage of the same sample are different

≪ Previous: Genotype annotation "1/0" instead the regular "0/1" in VCF files generated by HaplotypeCaller

When we run the GATK 4.1.x version and earlier of Picard CollectHsMetrics, the last three columns in the initial "METRICS CLASS" data line are SAMPLE, LIBRARY and READ_GROUP, respectively.

Every time I've run this, and I've run this kind of thing literally thousands of times over the years, the last three columns always come back as three blanks.

Is there a way to populate these three values (especially the SAMPLE) at runtime e.g. a command line argument? Or is there another way?

For the record, this is a question I've had simmering since the Picard 1.x days, but i've just recently been asked to combine the HsMetrics outputs in a systematic/automated fashion. It would be super easy if I could just inject the SAMPLE value into that row at runtime. I realize I could do this with a perl/sed/awk/etc one-liner, but if there's an easier way, I'd like to know about it.

↧

size of gvcf generated by Haplotypecaller of 30x and 100x coverage of the same sample are different

August 13, 2019, 3:13 am

≫ Next: Question about setting intervals for variant calling

≪ Previous: Include SAMPLE name in Picard CollectHsMetrics output at runtime?

I have a RAW reads with 30x and 100x coverage. When I followed all pre-processed steps as mentioned in GATK best practices to call variant. At last, gvcf files have been generated from both data by Haplotypecaller of GATK 4 but file size different. Why size of gvcf files are different, even same reference sequence was used in alignment by BWA mem? I think size of both gvcf files should be same if variants have been called by using same reference, same aligner i.e. BWA-mem with default parameters for both samples and same per-process steps were followed.

↧

Question about setting intervals for variant calling

October 29, 2019, 1:05 pm

≫ Next: Discovering singletons with GenotypeGVCFs?

≪ Previous: size of gvcf generated by Haplotypecaller of 30x and 100x coverage of the same sample are different

Hello GATK team!
I am processing non-human WGS data (50 samples), which has more than 11000 scaffolds, with GATK4.
Now I have a question about HaplotypeCalling and JointDiscovery steps.
In my understanding, parallelizing haplotype calling step with -nct option is not recommended.
So I want to parallelize these steps with intervals.
The longest scaffold is around 160Mb, and I chopped all the scaffolds into 30 Mb pieces.
Then I made 50 interval files (xxx.intervals), some of which contain only one interval (with the length of 30 Mb) and some of which contain many short scaffolds.
I performed 50 independent jobs for both of HaplotypeCalling and JointDiscovery steps with different interval files.
I believe that in the case I only need SNPs and don't call indels, cutting sequence into random interval does not harm me... Is this correct?
Here I paste my modified pipelines and example of interval files:

#### script for haplotype Calling ####

# WORKFLOW DEFINITION
workflow HaplotypeCallerGvcf_GATK4 {
String sample_name
File input_bam
File input_bam_index
#String input_dir
File ref_dict
File ref_fasta
File ref_fasta_index
File interval_file

String sample_basename = basename(input_bam, ".bam")
String sample_num = basename(interval_file, ".intervals")
String vcf_basename = sample_basename
#String output_suffix = if making_gvcf then ".g.vcf.gz" else ".vcf.gz"
String output_suffix = ".g.vcf.gz"
String output_filename = vcf_basename + "_" + sample_num +output_suffix

# Call variants in parallel over grouped calling intervals
#scatter (interval_file in scattered_calling_intervals) {

# Generate GVCF by interval
call HaplotypeCaller {
input:
#input_bam = select_first([CramToBamTask.output_bam, input_bam]),
#input_bam_index = select_first([CramToBamTask.output_bai, input_bam_index]),
interval_list = interval_file,
input_bam = input_bam,
input_bam_index = input_bam_index,
output_filename = output_filename,
ref_dict = ref_dict,
ref_fasta = ref_fasta,
ref_fasta_index = ref_fasta_index
#make_gvcf = making_gvcf
#docker = gatk_docker,
#gatk_path = gatk_path
}
#}

# Outputs that will be retained when execution is complete
output {
#File output_vcf = MergeGVCFs.output_vcf
#File output_vcf_index = MergeGVCFs.output_vcf_index
File output_vcf = HaplotypeCaller.output_vcf
File output_vcf_index = HaplotypeCaller.output_vcf_index
}
}

# TASK DEFINITIONS

task CramToBamTask {
# Command parameters
File ref_fasta
File ref_fasta_index
File ref_dict
File input_cram
String sample_name

Float output_bam_size = size(input_cram, "GB") / 0.60
Float ref_size = size(ref_fasta, "GB") + size(ref_fasta_index, "GB") + size(ref_dict, "GB")
Int disk_size = ceil(size(input_cram, "GB") + output_bam_size + ref_size) + 20

command {
set -e
set -o pipefail

samtools view -h -T ${ref_fasta} ${input_cram} |
samtools view -b -o ${sample_name}.bam -
samtools index -b ${sample_name}.bam
mv ${sample_name}.bam.bai ${sample_name}.bai
}

output {
File output_bam = "${sample_name}.bam"
File output_bai = "${sample_name}.bai"
}
}

# HaplotypeCaller per-sample in GVCF mode
task HaplotypeCaller {
File input_bam
File input_bam_index
File interval_list
String output_filename
File ref_dict
File ref_fasta
File ref_fasta_index
#Float? contamination
#Boolean make_gvcf

String gatk_path
#String? java_options
#String java_opt = select_first([java_options, "-XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10"])
String java_opt = "-XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10"

Int machine_mem_gb = select_first([mem_gb, 7])
Int command_mem_gb = machine_mem_gb - 1

Float ref_size = size(ref_fasta, "GB") + size(ref_fasta_index, "GB") + size(ref_dict, "GB")
Int disk_size = ceil(size(input_bam, "GB") + ref_size) + 20

command <<<
set -e

${gatk_path} --java-options "-Xmx${command_mem_gb}G ${java_opt}" \
HaplotypeCaller \
-R ${ref_fasta} \
-I ${input_bam} \
-L ${interval_list} \
-O ${output_filename} \
-ERC GVCF
>>>

#runtime {
#docker: docker
#memory: machine_mem_gb + " GB"
#disks: "local-disk " + select_first([disk_space_gb, disk_size]) + if use_ssd then " SSD" else " HDD"
#preemptible: select_first([preemptible_attempts, 3])
#}

output {
File output_vcf = "${output_filename}"
File output_vcf_index = "${output_filename}.tbi"
}
}
# Merge GVCFs generated per-interval for the same sample
task MergeGVCFs {
Array[File] input_vcfs
Array[File] input_vcfs_indexes
String output_filename

String gatk_path

# Runtime parameters
String docker
Int mem_gb
#Int? disk_space_gb
Boolean use_ssd = false
Int? preemptible_attempts

Int machine_mem_gb = select_first([mem_gb, 3])
Int command_mem_gb = machine_mem_gb - 1

command <<<
set -e

${gatk_path} --java-options "-Xmx${command_mem_gb}G" \
MergeVcfs \
--INPUT ${sep=' --INPUT ' input_vcfs} \
--OUTPUT ${output_filename}
>>>

output {
File output_vcf = "${output_filename}"
File output_vcf_index = "${output_filename}.tbi"
}
}

#### Script for Joint Discovery ####

workflow JointGenotyping {
File intervals_file

String callset_name
#Boolean ready
File ref_fasta
File ref_fasta_index
File ref_dict

#File dbsnp_vcf
#File dbsnp_vcf_index

#Array[String] sample_names
#Array[File] input_gvcfs
#Array[File] input_gvcfs_indices
File cohort_sample_map

#File eval_interval_list
#File dbsnp_resource_vcf = dbsnp_vcf
#File dbsnp_resource_vcf_index = dbsnp_vcf_index

#String sample_num = basename(intervals_file, ".intervals")
String output_filename = callset_name

# ExcessHet is a phred-scaled p-value. We want a cutoff of anything more extreme
# than a z-score of -4.5 which is a p-value of 3.4e-06, which phred-scaled is 54.69
Float excess_het_threshold = 54.69

#Int num_of_original_intervals = length(read_lines(unpadded_intervals_file))

#Array[String] unpadded_intervals = read_lines(unpadded_intervals_file)

#scatter (idx in range(length(unpadded_intervals))) {
# the batch_size value was carefully chosen here as it
# is the optimal value for the amount of memory allocated
# within the task; please do not change it without consulting
# the Hellbender (GATK engine) team!
call ImportGVCFs {
input:
#sample_names = sample_names,
interval = intervals_file,
workspace_dir_name = "genomicsdb",
inputs_samples = cohort_sample_map,
batch_size = 5
}

call GenotypeGVCFs {
input:
workspace_tar = ImportGVCFs.output_genomicsdb,
interval = intervals_file,
output_vcf_filename = "output.vcf.gz",
ref_fasta = ref_fasta,
ref_fasta_index = ref_fasta_index,
ref_dict = ref_dict,
#dbsnp_vcf = dbsnp_vcf,
#dbsnp_vcf_index = dbsnp_vcf_index
}

call HardFilterAndMakeSitesOnlyVcf {
input:
vcf = GenotypeGVCFs.output_vcf,
vcf_index = GenotypeGVCFs.output_vcf_index,
excess_het_threshold = excess_het_threshold,
variant_filtered_vcf_filename = output_filename + ".variant_filtered.vcf.gz"
#sites_only_vcf_filename = output_filename + ".sites_only.variant_filtered.vcf.gz"
}

call GatherVcfs as SitesOnlyGatherVcf {
input:
input_vcfs_fofn = HardFilterAndMakeSitesOnlyVcf.variant_filtered_vcf,
input_vcf_indexes_fofn = HardFilterAndMakeSitesOnlyVcf.variant_filtered_vcf_index,
output_vcf_name = output_filename + ".vcf.gz"
}

output {
# outputs from the small callset path through the wdl

File prepared_vcf = SitesOnlyGatherVcf.output_vcf
File prepared_vcf_index = SitesOnlyGatherVcf.output_vcf_index

Boolean done = true

}
}

task GetNumberOfSamples {
File sample_name_map
#String docker
command <<<
wc -l ${sample_name_map} | awk '{print $1}'
>>>
output {
Int sample_count = read_int(stdout())
}
}

task ImportGVCFs {
#Array[String] sample_names
#Array[File] input_gvcfs
#Array[File] input_gvcfs_indices
File inputs_samples
String interval

String workspace_dir_name
Int batch_size

command <<<
# The memory setting here is very important and must be several GB lower
# than the total memory allocated to the VM because this tool uses
# a significant amount of non-heap memory for native libraries.
# Also, testing has shown that the multithreaded reader initialization
# does not scale well beyond 5 threads, so don't increase beyond that.
gatk --java-options "-Xmx7g -Xms7g -XX:ConcGCThreads=1" \
GenomicsDBImport \
--genomicsdb-workspace-path ${workspace_dir_name} \
--batch-size ${batch_size} \
-L ${interval} \
--sample-name-map ${inputs_samples}

tar -cf ${workspace_dir_name}.tar ${workspace_dir_name}

>>>
output {
File output_genomicsdb = "${workspace_dir_name}.tar"
}
}

task GenotypeGVCFs {
File workspace_tar
String interval

String output_vcf_filename

File ref_fasta
File ref_fasta_index
File ref_dict

#File dbsnp_vcf
#File dbsnp_vcf_index

command <<<
set -e

tar -xf ${workspace_tar}
WORKSPACE=$( basename ${workspace_tar} .tar)

gatk --java-options "-Xmx7g -Xms7g -XX:ConcGCThreads=1" \
GenotypeGVCFs \
-R ${ref_fasta} \
-O ${output_vcf_filename} \
-G StandardAnnotation \
--only-output-calls-starting-in-intervals \
--use-new-qual-calculator \
-V gendb://$WORKSPACE \
-L ${interval}
>>>
output {
File output_vcf = "${output_vcf_filename}"
File output_vcf_index = "${output_vcf_filename}.tbi"
}
}

task HardFilterAndMakeSitesOnlyVcf {
File vcf
File vcf_index
Float excess_het_threshold

String variant_filtered_vcf_filename
#String sites_only_vcf_filename
#String gatk_path

#String docker
#Int disk_size

command {
set -e

gatk --java-options "-Xmx32g -Xms32g -XX:ConcGCThreads=1" \
VariantFiltration \
--filter-expression "ExcessHet > ${excess_het_threshold}" \
--filter-name ExcessHet \
-O ${variant_filtered_vcf_filename} \
-V ${vcf}

}
output {
File variant_filtered_vcf = "${variant_filtered_vcf_filename}"
File variant_filtered_vcf_index = "${variant_filtered_vcf_filename}.tbi"
}
}

task GatherVcfs {
File input_vcfs_fofn
File input_vcf_indexes_fofn
String output_vcf_name

command <<<
set -e
set -o pipefail

# ignoreSafetyChecks make a big performance difference so we include it in our invocation
gatk --java-options "-Xmx60g -Xms60g -XX:ConcGCThreads=1" \
GatherVcfsCloud \
--ignore-safety-checks \
--gather-type BLOCK \
--input ${sep=" --input " input_vcfs_fofn} \
--output ${output_vcf_name}

gatk --java-options "-Xmx60g -Xms60g -XX:ConcGCThreads=1" \
IndexFeatureFile \
--feature-file ${output_vcf_name}
>>>
output {
File output_vcf = "${output_vcf_name}"
File output_vcf_index = "${output_vcf_name}.tbi"
}
}

#### interval files ####
# A.intervals
HiC_scaffold_10:30000001-30097849
HiC_scaffold_11:1-24811943

# B.intervals
HiC_scaffold_10:1-30000000

# C.intervals
HiC_scaffold_4888:1-2392
HiC_scaffold_4889:1-2392
HiC_scaffold_4890:1-2389
HiC_scaffold_4891:1-2388
HiC_scaffold_4892:1-2387
HiC_scaffold_4893:1-2387
HiC_scaffold_4894:1-2386
HiC_scaffold_4895:1-2385
HiC_scaffold_4896:1-2385
HiC_scaffold_4897:1-2384
HiC_scaffold_4898:1-2384
HiC_scaffold_4899:1-2384
HiC_scaffold_4900:1-2384
HiC_scaffold_4901:1-2383
HiC_scaffold_4902:1-2382
HiC_scaffold_4903:1-2382
HiC_scaffold_4904:1-2381
HiC_scaffold_4905:1-2381
HiC_scaffold_4906:1-2381
HiC_scaffold_4907:1-2380
HiC_scaffold_4908:1-2380
HiC_scaffold_4909:1-2380
HiC_scaffold_4910:1-2380
HiC_scaffold_4911:1-2380

↧

Discovering singletons with GenotypeGVCFs?

October 10, 2019, 12:30 pm

≫ Next: MarkDuplicates policy on reads that are unaligned/unmapped?

≪ Previous: Question about setting intervals for variant calling

Hi,

I have several samples that I ran HaplotypeCaller (in normal mode) with that I am looking to discover germline variants from. I read that GenotypeGVCFs isn't good with discovering singletons, and it is likely that there will be many singletons in the samples that I have. Does anyone have a solution to this? I was planning on running GenotypeGVCFs on each sample individually so as to prevent singletons from being lost.

↧

MarkDuplicates policy on reads that are unaligned/unmapped?

October 29, 2019, 5:26 pm

≫ Next: (WhatsApp:+1 (985) 606-3684 Upgrade IELTS, TOEFL, PTE, GRE Score certificate Without Exam

≪ Previous: Discovering singletons with GenotypeGVCFs?

Our bams are created according to the "lossless" alignment procedure described in this article. The procedure involves mixing unaligned and aligned reads with Picard's MergeBamAlignment. So they contain both mapped and unmapped reads. These bams are then sorted with SortSam - so that the sort order in the header becomes:

@HD VN:1.6 SO:coordinate

On such bams, is there any special sort order that should be specified with MarkDuplicates to reduce memory usage, or speedup processing? Can you recommend --ASSUME_SORT_ORDER X ? It's not clear from the documentation how MarkDuplicates handles reads that don't have reliable position information in the bam.

↧

(WhatsApp:+1 (985) 606-3684 Upgrade IELTS, TOEFL, PTE, GRE Score certificate Without Exam

October 29, 2019, 5:43 pm

≫ Next: M2 error with canine germline resource and variants_for_contamination files

≪ Previous: MarkDuplicates policy on reads that are unaligned/unmapped?

*PLEASE NOTE!
We don't just produce certificates. We start by getting our clients properly registered for the exam. Then,
we proceed with the processing of the certificate, with or without the client attending the exam.
It is also worth reiterating that we don't make fake certificates.
All certificates which we process are fully registered in the system with the client's details and verifiable online at the official IELTS vertification website.

We are a network of English Language Professors with years of Examination experience. During these years, we have been able to derive backdoor means of registering IELTS certificates without Students taking the Test. With our help, you can be able to get real registered and original IELTS Certificates without facing the stress and trauma of the Exam. The IELTS Certificates we issue carries a score of your choosing and you will be able to verify it online and collect the original TRF or Test report card from local district Examination Center or we send it directly to you.

- We only produce Real and IDP/BC verified IELTS Certificates

- We do not produce fake IELTS certificates as they serve no purpose

- We keep client information discreet and we do not share with any third party
With these Certificates you have a shot at a migration process.

Study Abroad
Study Abroad with IDP
Study Abroad in Australia
Study Abroad in USA
Study Abroad in Canada
Study Abroad in UNITED KINGDOM
Study Abroad in New Zealand

WhatApp: +1 (985) 606-3684
or

We can help you to get Certified IELTS IDP/BC certificates with your desired Score. The Certificate is Registered and can be verified online , you can use this certificate for University admission and any immigration use

*Do you need Real and IDP/BC verified IELTS Certificates?
*Do you like to Get academic or general IELTS Certificate Test?
*Are You trying to get Band 7 or 8 in IELTS certification in Asia, Europe, America, Africa etc ?
*Do you need to edit and increase your past certificate ?
*Do you need a teacher to write the exams for you ?
*Do you need questions/test/exams paper both questions and answers ?
*Do you need our help in the exam to provide your target score even if you fail ?
We can help you!!

#Each BAND corresponds to a level of English Competence. ALL part of the test and overall score can be reported in whole and half bands,e.g 6.5, 7.0 ,7.5, 8.0.

Our Services:

- we provide Official certificate with registration into the database and actual center stamps for customers interested in obtaining the certificate without taking the test.

- If you already took the test and it less than a month that you took the test, we can update the results optained in your previous test to provide you with a new certificate with the updated results for you to follow you PR procedures without any risk.

- Last but not the least we can provide Question papers for future test before the actual test date. the questionnaires will be issued about 6 to 10 days before the test data and will be 100% same questions that will appear in the test. guaranteed at 100%

- You can register for your exams and go in for but we shall provide your target scores as you request because we have underground partners working at any center test which give us access into the system.

BUY REAL IELTS CERTIFICATE ONLINE - BUY CERTIFIED IELTS CERTIFICATE - GET ORIGINAL IELTS WITHOUT EXAM - HOW TO CHECK IELTS BRITISH COUNCIL - BUY IDP IELTS CERTIFICATE - BUY GENUINE IELTS CERTIFICATE

ABOUT US:

>>We are fast, reliable and flexible

>>We are popular and trusted

>>We are highly experienced in documentation

>>We have excellent pass into database.

WhatApp: +1 (985) 606-3684

Some may not have the time or patience to do this and some may be afraid of complications not to have the right agent from the right source. There are many agents and their competence (and honesty!) ranges from excellent all the way down to non-existent.

One may decide to use an agent to help and advice on how to get his/her certificate. But, if you do decide to use an agent, be careful especially on the internet. WE ALWAYS ADVISE OUR CLIENTS TO BE CAREFUL .

The best way to ensure that you are in direct deal with competent, professional and honest officials, feel free to leave us a message, using email

Our representatives are waiting to reply to your inquiries 24/7, and set you on your way toward obtaining your IELTS, TOEFL, GMAT, GRE, PTE, CAE, SAT, PMP, CELPIP, TESOL, NEBOSH, FCE, PSAT, certificates that may dramatically change your life for the better!.

Through us it is straight forward; with a little time and effort to spent

WhatApp: +1 (985) 606-3684

↧