(How to) Run the Pathseq pipeline

December 8, 2017, 3:02 pm

≫ Next: File not found while executing CollectHsMetric on GATK 4.1.2.0

≪ Previous: MarkDuplicatesSpark error: Multiple mark duplicate record objects corresponding to read with name

Beta tutorial Please report any issues in the comments section.

Overview

PathSeq is a GATK pipeline for detecting microbial organisms in short-read deep sequencing samples taken from a host organism (e.g. human). The diagram below summarizes how it works. In brief, the pipeline performs read quality filtering, subtracts reads derived from the host, aligns the remaining (non-host) reads to a reference of microbe genomes, and generates a table of detected microbial organisms. The results can be used to determine the presence and abundance microbial organisms as well as to discover novel microbial sequences.

PathSeq pipeline diagram Boxes outlined with dashed lines represent files. The green boxes at the top depict the three phases of the pipeline: read quality filtering / host subtraction, microbe alignment, and taxonomic abundance scoring. The blue boxes show tools used for pre-processing the host and microbe references for use with PathSeq.

Tutorial outline

This tutorial describes:

How to run the full PathSeq pipeline on a simulated mixture of human and E. coli reads using pre-built small-scale reference files
How to prepare custom host and microbe reference files for use with PathSeq

A more detailed introduction of the pipeline can be found in the PathSeqPipelineSpark tool documentation. For more information about the other tools, see the Metagenomics section of the GATK documentation.

How to obtain reference files

Host and microbe references must be prepared for PathSeq as described in this tutorial. The tutorial files provided below contain references that are designed specifically for this tutorial and should not be used in practice. Users can download recommended pre-built reference files for use with PathSeq from the GATK Resource Bundle FTP server in /bundle/pathseq/ (see readme file). This tutorial also covers how to build custom host and microbe references.

Tutorial Requirements

The PathSeq tools are bundled with the GATK 4 release. For the most up-to-date GATK installation instructions, please see https://github.com/broadinstitute/gatk. This tutorial assumes you are using a POSIX (e.g. Linux or MacOS) operating system with at least 2Gb of memory.

Obtain tutorial files

Download tutorial_10913.tar.gz from the ftp site. Extract the archive with the command:

> tar xzvf pathseq_tutorial.tar.gz
> cd pathseq_tutorial

You should now have the following files in your current directory:

test_sample.bam : simulated sample of 3M paired-end 151-bp reads from human and E. coli
hg19mini.fasta : human reference sequences (indexed)
e_coli_k12.fasta : E. coli reference sequences (indexed)
e_coli_k12.fasta.img : PathSeq BWA-MEM index image
e_coli_k12.db : PathSeq taxonomy file

Run the PathSeq pipeline

The pipeline accepts reads in BAM format (if you have FASTQ files, please see this article on how to convert to BAM). In this example, the pipeline can be run using the following command:

> gatk PathSeqPipelineSpark \
    --input test_sample.bam \
    --filter-bwa-image hg19mini.fasta.img \
    --kmer-file hg19mini.hss \
    --min-clipped-read-length 70 \
    --microbe-fasta e_coli_k12.fasta \
    --microbe-bwa-image e_coli_k12.fasta.img \
    --taxonomy-file e_coli_k12.db \
    --output output.pathseq.bam \
    --scores-output output.pathseq.txt

This ran in 2 minutes on a Macbook Pro with a 2.8GHz Quad-core CPU and 16 GB of RAM. If running on a local workstation, users can monitor the progress of the pipeline through a web browser at http://localhost:4040.

Interpreting the output

The PathSeq output files are:

output.pathseq.bam : contains all high-quality non-host reads aligned to the microbe reference. The YP read tag lists the NCBI taxonomy IDs of any aligned species meeting the alignment identity criteria (see the --min-score-identity and --identity-margin parameters). This tag is omitted if the read was not successfully mapped, which may indicate the presence of organisms not represented in the microbe database.
output.pathseq.txt : a tab-delimited table of the input sample’s microbial composition. This can be imported into Excel and organized by selecting Data -> Filter from the menu:

tax_id	taxonomy	type	name	kingdom	score	score_normalized	reads	unambiguous	reference_length
1	root	root	root	root	189580	100	189580	189580	0
131567	root cellular_organisms	no_rank	cellular_organisms	root	189580	100	189580	189580	0
2	... cellular_organisms Bacteria	superkingdom	Bacteria	Bacteria	189580	100	189580	189580	0
1224	... Proteobacteria	phylum	Proteobacteria	Bacteria	189580	100	189580	189580	0
1236	... Proteobacteria Gammaproteobacteria	class	Gammaproteobacteria	Bacteria	189580	100	189580	189580	0
91347	... Gammaproteobacteria Enterobacterales	order	Enterobacterales	Bacteria	189580	100	189580	189580	0
543	... Enterobacterales Enterobacteriaceae	family	Enterobacteriaceae	Bacteria	189580	100	189580	189580	0
561	... Enterobacteriaceae Escherichia	genus	Escherichia	Bacteria	189580	100	189580	189580	0
562	... Escherichia Escherichia_coli	species	Escherichia_coli	Bacteria	189580	100	189580	189580	0
83333	... Escherichia_coli Escherichia_coli_K-12	no_rank	Escherichia_coli_K-12	Bacteria	189580	100	189580	189580	0
511145	... Escherichia_coli_str._K-12_substr._MG1655	no_rank	Escherichia_coli_str._K-12_substr._MG1655	Bacteria	189580	100	189580	189580	4641652

Each line provides information for a single node in the taxonomic tree. A "root" node corresponding to the top of the tree is always listed. Columns to the right of the taxonomic information are:

score : indicates the amount of evidence that this taxon is present, based on the number of reads that aligned to references in this taxon. This takes into account uncertainty due to ambiguously mapped reads by dividing their weight across each possible hit. It it also normalized by genome length.
score_normalized : the same as score, but normalized to sum to 100 within each kingdom.
reads : number of mapped reads (ambiguous or unambiguous)
unambiguous : number of unambiguously mapped reads
reference_length : reference length (in bases) if there is a reference assigned to this taxon. Unlike scores, this number is not propagated up the tree, i.e. it is 0 if there is no reference corresponding directly to the taxon. In the above example, the MG1655 strain reference length is only shown in the strain row (4,641,652 bases).

In this example, one can see that PathSeq detected 189,580 reads reads that mapped to the strain reference for E. coli K-12 MG1655. This read count is propogated up the tree (species, genus, family, etc.) to the root node. If other species were present, their read counts would be listed and added to their corresponding ancestral taxonomic classes.

Microbe discovery

PathSeq can also be used to discover novel microorganisms by analyzing the unmapped reads, e.g. using BLAST or de novo assembly. To get the number of non-host (microbe plus unmapped) reads use the samtools view command:

> samtools view –c output.pathseq.bam
189580

Since the reported number of E. coli reads is the same number of reads in the output BAM, there are 0 reads of unknown origin in this sample.

Preparing Custom Reference Files

Custom host and microbe references must both be prepared for use with PathSeq. The references should be supplied as FASTA files with proper indices and sequence dictionaries. The host reference is used to build a BWA-MEM index image and a k-mer file. The microbe reference is used to build another BWA-MEM index image and a taxonomy file. Here we assume you are starting with the FASTA reference files that have been properly indexed:

host.fasta : your custom host reference sequences
microbe.fasta : your custom microbe reference sequences

Build the host and microbe BWA index images

The BWA index images must be build using BwaMemIndexImageCreator:

> gatk BwaMemIndexImageCreator -I host.fasta
> gatk BwaMemIndexImageCreator -I microbe.fasta

Generate the host k-mer library file

The PathSeqBuildKmers tool creates a library of k-mers from a host reference FASTA file. Create a hash set of all k-mers in the host reference with following command:

> gatk PathSeqBuildKmers \
--reference host.fasta \
-O host.hss

Build the taxonomy file

Download the latest RefSeq accession catalog RefSeq-releaseXX.catalog.gz, where XX is the latest RefSeq release number, at:
ftp://ftp.ncbi.nlm.nih.gov/refseq/release/release-catalog/
Download NCBI taxonomy data files dump (no need to extract the archive):
ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
Assuming these files are now in your current working directory, build the taxonomy file using PathSeqBuildReferenceTaxonomy:

> gatk PathSeqBuildReferenceTaxonomy \
-R microbe.fasta \
--refseq-catalog RefSeq-releaseXX.catalog.gz \
--tax-dump taxdump.tar.gz \
-O microbe.db

Example reference build script

The preceding instructions can be conveniently executed with the following bash script:

#!/bin/bash
set -eu
GATK_HOME=/path/to/gatk
REFSEQ_CATALOG=/path/to/RefSeq-releaseXX.catalog.gz
TAXDUMP=/path/to/taxdump.tar.gz

echo "Building pathogen reference..."
$GATK_HOME/gatk BwaMemIndexImageCreator -I microbe.fasta
$GATK_HOME/gatk PathSeqBuildReferenceTaxonomy -R microbe.fasta --refseq-catalog $REFSEQ_CATALOG --tax-dump $TAXDUMP -O microbe.db

echo "Building host reference..."
$GATK_HOME/gatk BwaMemIndexImageCreator -I host.fasta
$GATK_HOME/gatk PathSeqBuildKmers --reference host.fasta -O host.hss

Troubleshooting

Java heap out of memory error

Increase the Java heap limit. For example, to increase the limit to 4GB with the --java-options flag:

> gatk --java-options "-Xmx4G" ...

This should generally be set to a value greater than the sum of all reference files.

The output is empty

The input reads must pass an initial validity filter, WellFormedReadFilter. A common cause of empty output is that the input reads do not pass this filter, often because none of the reads have been assigned to a read group (with an RG tag). For instructions on adding read groups, see this article, but note that PathSeqPipelineSpark and PathSeqFilterSpark do not require the input BAM to be sorted or indexed.

↧

File not found while executing CollectHsMetric on GATK 4.1.2.0

October 7, 2019, 7:20 am

≫ Next: Failures running VariantRecalibrator

≪ Previous: (How to) Run the Pathseq pipeline

I am executing this command

time gatk CollectHsMetrics -I test.bam -O hs_metrics.txt  -R genome.fa -BAIT_INTERVALS Target_bait.bed.interval.list -TARGET_INTERVALS Exome_Target_hg19.interval_list

I am getting this error

Using GATK jar /home/bioinfo/gatk-4.1.2.0/gatk-package-4.1.2.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/bioinfo/gatk-4.1.2.0/gatk-package-4.1.2.0-local.jar CollectHsMetrics -I test.bam -O hs_metrics.txt -R genome.fa -BAIT_INTERVALS Target_bait.bed.interval.list -TARGET_INTERVALS Exome_Target_hg19.interval_list
19:05:20.489 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/bioinfo/gatk-4.1.2.0/gatk-package-4.1.2.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
[Mon Oct 07 19:05:20 IST 2019] CollectHsMetrics  --BAIT_INTERVALS Target_bait.bed.interval.list --TARGET_INTERVALS Exome_Target_hg19.interval_list --INPUT test.bam --OUTPUT hs_metrics.txt --REFERENCE_SEQUENCE genome.fa  --METRIC_ACCUMULATION_LEVEL ALL_READS --NEAR_DISTANCE 250 --MINIMUM_MAPPING_QUALITY 20 --MINIMUM_BASE_QUALITY 20 --CLIP_OVERLAPPING_READS true --INCLUDE_INDELS false --COVERAGE_CAP 200 --SAMPLE_SIZE 10000 --ALLELE_FRACTION 0.001 --ALLELE_FRACTION 0.005 --ALLELE_FRACTION 0.01 --ALLELE_FRACTION 0.02 --ALLELE_FRACTION 0.05 --ALLELE_FRACTION 0.1 --ALLELE_FRACTION 0.2 --ALLELE_FRACTION 0.3 --ALLELE_FRACTION 0.5 --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
Oct 07, 2019 7:05:22 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
[Mon Oct 07 19:05:22 IST 2019] Executing as bioinfo@bioinfo-pc on Linux 4.15.0-64-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_222-8u222-b10-1ubuntu1~16.04.1-b10; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.1.2.0
[Mon Oct 07 19:05:22 IST 2019] picard.analysis.directed.CollectHsMetrics done. Elapsed time: 0.03 minutes.
Runtime.totalMemory()=726138880
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
htsjdk.samtools.SAMException: Cannot read non-existent file: file:///media/bioinfo/@HD%09VN:1.6%09SO:coordinate
    at htsjdk.samtools.util.IOUtil.assertFileIsReadable(IOUtil.java:405)
    at htsjdk.samtools.util.IOUtil.assertFileIsReadable(IOUtil.java:392)
    at picard.analysis.directed.CollectHsMetrics.getProbeIntervals(CollectHsMetrics.java:146)
    at picard.analysis.directed.CollectTargetedMetrics.doWork(CollectTargetedMetrics.java:129)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:295)
    at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:25)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
    at org.broadinstitute.hellbender.Main.main(Main.java:291)

real    0m3.631s
user    0m8.760s
sys 0m0.307s

↧

Failures running VariantRecalibrator

October 7, 2019, 2:02 pm

≫ Next: Issue with running CNV germline workflows using "cnv_germline_cohort_workflow.wdl"

≪ Previous: File not found while executing CollectHsMetric on GATK 4.1.2.0

We want to run joint germline calling on a set of 122 WES BRCA normal hg19 BAMs from the CPTAC 3 project. We are using the GATK4 workflows showcased in the Terra workspace https://app.terra.bio/#workspaces/help-gatk/Germline-SNPs-Indels-GATK4-b37. We are starting with data that has already been aligned to hg19, so of the three workflows in the showcase workspace, we are running two: haplotypecaller-gvcf-gatk4 and joint-discovery-gatk4. We are encountering problems with the joint-discovery-gatk4 workflow, in particular, in the running of the VariantRecalibrator task. Initially, we are just running on 3 sample gvcfs, recognizing that you need at a minumum 30 exome samples, just to ensure we can run the pipeline. We are using gatk4 v4.1.2.0.

We are getting more or less the same error for both instances of the VariantRecalibrator task...

task instance: JointGenotyping.SNPsVariantRecalibratorClassic:

A USER ERROR has occurred: Couldn't read file file:///cromwell_root/hapmap,known=false,training=true,truth=true,prior=15:/cromwell_root/broad-references/hg19/v0/hapmap_3.3.b37.vcf.gz. Error was: It doesn't exist.

task instance: JointGenotyping.IndelsVariantRecalibrator:

A USER ERROR has occurred: Couldn't read file file:///cromwell_root/mills,known=false,training=true,truth=true,prior=12:/cromwell_root/broad-references/hg19/v0/Mills_and_1000G_gold_standard.indels.b37.sites.vcf. Error was: It doesn't exist.

Here is the java command line (from the task log file in Terra):

Using GATK jar /gatk/gatk-package-4.1.2.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx24g -Xms24g -jar /gatk/gatk-package-4.1.2.0-local.jar VariantRecalibrator -V /cromwell_root/fc-secure-823808d0-5404-49c9-990f-b3d9e353e468/02fdb905-0a50-47d5-9a0b-8abb8d0a9636/JointGenotyping/71c87f4b-5e0e-40bc-9b61-71a5e52ac82a/call-SitesOnlyGatherVcf/CBB_Test.sites_only.vcf.gz -O CBB_Test.indels.recal --tranches-file CBB_Test.indels.tranches --trust-all-polymorphic -tranche 100.0 -tranche 99.95 -tranche 99.9 -tranche 99.5 -tranche 99.0 -tranche 97.0 -tranche 96.0 -tranche 95.0 -tranche 94.0 -tranche 93.5 -tranche 93.0 -tranche 92.0 -tranche 91.0 -tranche 90.0 -an FS -an ReadPosRankSum -an MQRankSum -an QD -an SOR -an DP -mode INDEL --max-gaussians 4 -resource mills,known=false,training=true,truth=true,prior=12:/cromwell_root/broad-references/hg19/v0/Mills_and_1000G_gold_standard.indels.b37.sites.vcf -resource axiomPoly,known=false,training=true,truth=false,prior=10:/cromwell_root/broad-references/hg19/v0/Axiom_Exome_Plus.genotypes.all_populations.poly.vcf.gz -resource dbsnp,known=true,training=false,truth=false,prior=2:/cromwell_root/broad-references/hg19/v0/dbsnp_138.b37.vcf.gz

The problem is clearly with the attributes that prepend the -resource input parameter... they are being interpreted as part of the filename by gatk4.

↧

Issue with running CNV germline workflows using "cnv_germline_cohort_workflow.wdl"

October 7, 2019, 5:25 pm

≫ Next: New! Mitochondrial Analysis with Mutect2

≪ Previous: Failures running VariantRecalibrator

Greetings GATK team,

I am trying to run CNV germline workflows using "cnv_germline_cohort_workflow.wdl". I'm using the examples provided in Github
However, after the step PostprocessGermlineCNVCalls I got the following error:

22:37:00.588 INFO  PostprocessGermlineCNVCalls - Shutting down engine
[October 7, 2019 10:37:00 PM UTC] org.broadinstitute.hellbender.tools.copynumber.PostprocessGermlineCNVCalls done. Elapsed time: 1.67 minutes.
Runtime.totalMemory()=2294808576
Exception in thread "Thread-1" htsjdk.samtools.util.RuntimeIOException: java.nio.file.DirectoryNotEmptyException: /cromwell-executions/CNVGermlineCohortWorkflow/208592f4-6e9d-403d-99bf-1066a8a82cd7/call-PostprocessGermlineCNVCalls/shard-0/tmp.c77022dd/gcnv-segmented-calls2725641956355372386/SAMPLE_0
        at htsjdk.samtools.util.IOUtil.recursiveDelete(IOUtil.java:1346)
        at org.broadinstitute.hellbender.utils.io.IOUtils.deleteRecursively(IOUtils.java:1061)
        at org.broadinstitute.hellbender.utils.io.DeleteRecursivelyOnExitPathHook.runHooks(DeleteRecursivelyOnExitPathHook.java:56)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.nio.file.DirectoryNotEmptyException: /cromwell-executions/CNVGermlineCohortWorkflow/208592f4-6e9d-403d-99bf-1066a8a82cd7/call-PostprocessGermlineCNVCalls/shard-0/tmp.c77022dd/gcnv-segmented-calls2725641956355372386/SAMPLE_0
        at sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:242)
        at sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(AbstractFileSystemProvider.java:108)
        at java.nio.file.Files.deleteIfExists(Files.java:1165)
        at htsjdk.samtools.util.IOUtil$3.postVisitDirectory(IOUtil.java:1338)
        at htsjdk.samtools.util.IOUtil$3.postVisitDirectory(IOUtil.java:1327)
        at java.nio.file.Files.walkFileTree(Files.java:2688)
        at java.nio.file.Files.walkFileTree(Files.java:2742)
        at htsjdk.samtools.util.IOUtil.recursiveDelete(IOUtil.java:1344)
        ... 3 more
Using GATK jar /cromwell-executions/CNVGermlineCohortWorkflow/208592f4-6e9d-403d-99bf-1066a8a82cd7/call-PostprocessGermlineCNVCalls/shard-0/inputs/-25482348/gatk-package-4.1.2.0-local.jar defined in environment variable GATK_LOCAL_JAR
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx6000m -jar /cromwell-executions/CNVGermlineCohortWorkflow/208592f4-6e9d-403d-99bf-1066a8a82cd7/call-PostprocessGermlineCNVCalls/shard-0/inputs/-25482348/gatk-package-4.1.2.0-local.jar PostprocessGermlineCNVCalls --calls-shard-path CALLS_0 --model-shard-path MODEL_0 --allosomal-contig X --allosomal-contig Y --autosomal-ref-copy-number 2 --contig-ploidy-calls contig-ploidy-calls --sample-index 0 --output-genotyped-intervals genotyped-intervals-SM-74NEG_20xy-downsampled.vcf.gz --output-genotyped-segments genotyped-segments-SM-74NEG_20xy-downsampled.vcf.gz
/encrypted/genomics/nationalguard/new_cnv_azza/final_workflow/gatk/scripts/cnv_wdl/germline/workdir/CNVGermlineCohortWorkflow/208592f4-6e9d-403d-99bf-1066a8a82cd7/call-PostprocessGermlineCNVCalls/shard-0/execution/script: line 68: --output-denoised-copy-ratios: command not found

I've been struggling to understand why this is happening, and already investigated and tested many things, but I cannot seem to find an explanation. Any advice is really appreciated.

Looking forward to your answer,

Best Regards,
Azza.

↧

New! Mitochondrial Analysis with Mutect2

March 5, 2019, 8:45 am

≫ Next: GATK4 SplitNCigarReads: file size of bam file is "0"

≪ Previous: Issue with running CNV germline workflows using "cnv_germline_cohort_workflow.wdl"

Overcoming barriers to understanding the mitochondrial genome

Announcing a brand new “Best Practices” pipeline for calling SNPs and INDELs in the mitochondrial genome! Calling low VAF alleles (variant allele fraction) in the mitochondrial genome presents special problems, but come with great rewards, including diagnosing rare diseases and identifying asymptomatic carriers of pathogenic diseases. We’re excited to begin using this pipeline on tens of thousands of diverse samples from the gnomAD project (http://gnomad.broadinstitute.org/about) to gain greater understanding of population genetics from the perspective of mitochondrial DNA.

Mitochondrial genome - a history of challenges

We had been often advised to “try using a somatic caller,” since we expect mitochondria to have variable allele fraction variants, but we never actually tried it ourselves. Over the past year we focused on creating truth-data for low allele fraction variants on the mitochondria and developing a production-quality, high throughput pipeline that overcomes the unique challenges of calling SNPs and INDELs on the mitochondria offers.

See below the four challenges to unlocking the mitochondrial genome and how we’ve improved our pipeline to overcome them.

1. Mitochondria have a circular genome

Though the genome is linearized in the typical references we use, the breakpoint is artificial -- purely for the sake of bioinformatic convenience. Since the breakpoint is inside the “control region”, which is non-coding but highly variable across people, we want to be sensitive to variation in that region, to capture the most genetic diversity.

2. A pushy genome makes for difficult mapping

The mitochondrial genome has inserted itself into the autosomal genome many times throughout human evolution - and continues to do so. These regions in the autosomal genome, called Nuclear Mitochondrial DNA segments (NuMTs), make mapping difficult: if the sequences are identical, it’s hard to know if a read belongs in an autosomal NuMT or the mitochondrial contig.

3. Most mitochondria are normal

Variation in the mitochondria can have very low heteroplasmy. In fact, the variation “signal” can be comparable to the inherent sequencer noise, but the scientific community tasked us with calling 1% allele fraction sites with as much accuracy as we can. Our pipeline achieves 99% sensitivity at 5% VAF at depths greater than 1000. With depth in the thousands or tens of thousands of reads for most whole genome mitochondrial samples, it should be possible to call most 1% allele fraction sites with high confidence.

4. High depth coverage is a blessing… and a curse

The mitochondrial contig typically has extremely high depth in whole genome sequence data:
around 2000x for a typical blood sample compared to autosomes (typically ~30x coverage). Samples from mitochondria-rich tissues like heart and muscle have even higher depth (e.g. 80,000x coverage). This depth is a blessing for calling low-allele fraction sites with confidence, but can overwhelm computations that use algorithms not designed to handle the large amounts of data that come with this extreme depth.

Solving a geometry problem by realigning twice

We’ve solved the first problem by extracting reads that align to carefully selected NuMT regions and the mitochondria itself, from a whole genome sample. We take these aligned, recalibrated reads and realign them twice: once to the canonical mitochondria reference, and once to a “rotated” mitochondria reference that moves the breakpoint from the control region to the opposite side of the circular contig.

To help filter out NuMTs, we mark reads with their original alignment position before realigning to the mitochondrial contig. Then we use Mutect2 filters tuned to the high depth we expect from the mitochondria, by running Mutect2 in “--mitochondria-mode”. We increase accuracy on the “breakpoint” location by calling only the non-control region on the original mitochondria reference, and call the control region on the shifted reference (now in the middle of the linearized chromosome). We then shift the calls on the rotated reference back to the original mitochondrial reference and merge the VCFs.

Adaptive pruning for high sensitivity and precision

The rest of the problems benefited from recent improvements the local assembly code that Mutect2 shares with HaplotypeCaller (HC). Incorporating the new adaptive pruning strategy in the latest version of Mutect2 will improve sensitivity and precision for samples with varying depth across the mitochondrial reference, and enable us to adapt our pipeline to exome and RNA samples. See the blog post on the newest version of Mutect2 here.

Unblocking genetic bottlenecks with a new pipeline

The new pipeline’s high sensitivity to low allele fraction variants is especially powerful since low AF variants may be at higher AF in other tissue.

Our pipeline harnesses the power of low AFs to help:

1. Diagnose rare diseases

Mutations can be at high allele fraction in affected tissues but low in blood samples typically used for genetic testing.

2. Identify asymptomatic carriers of pathogenic variants

If you carry a pathogenic allele even at low VAF, you can pass this along at high VAF to your offspring.

3. Discover somatic variants in tissues or cell lineages

For example, studies have used rare somatic mtDNA variants for lineage tracing in single-cell RNA-seq studies.

You can find the WDLs used to run this pipeline in the GATK repo under the scripts directory (https://github.com/broadinstitute/gatk/blob/master/scripts/mitochondria_m2_wdl/MitochondriaPipeline.wdl). Keep an eye out for an official “Best Practices” pipeline, coming soon in the gatk-workflows repo and in Firecloud.

Caveat: We're not so confident in calls under 5%AF (due to false positives from NuMTs). We're working on a longer term fix for this for a future release.

↧

GATK4 SplitNCigarReads: file size of bam file is "0"

October 8, 2019, 1:16 am

≫ Next: How to get metadata reference?

≪ Previous: New! Mitochondrial Analysis with Mutect2

hi, I run the following command lines for my RNA_Seq data, no error showed, but part of the samples after "SplitNCigarReads", the file size of split.bam files were "0", could you please help to check ?

the command line is:
~gatk SplitNCigarReads \
-R ${reference} \
-I ${outputdir}/~/${SM}.dedupped.bam \
-O ${outputdir}/~/${SM}.split.bam \
-RMQF 255 \
-RMQT 60 \
-U ALLOW_N_CIGAR_READS

↧

How to get metadata reference?

October 8, 2019, 5:01 am

≫ Next: VCF - Variant Call Format

≪ Previous: GATK4 SplitNCigarReads: file size of bam file is "0"

Hi, I have recently installed Genome STRiP, I want to ask how can I get metadata reference(GC mask, SV mask and ploidy map) for my data set. The reference genome I am using is HG38/B38.

↧

VCF - Variant Call Format

December 23, 2017, 4:04 pm

≫ Next: Which part of mym QD plot is the homozygous peak?

≪ Previous: How to get metadata reference?

This document describes "regular" VCF files produced for GERMLINE short variant (SNP and indel) calls (e.g. by HaplotypeCaller in "normal" mode and by GenotypeGVCFs). For information on the special kind of VCF called GVCF produced by HaplotypeCaller in -ERC GVCF mode, please see the GVCF entry. For information specific to SOMATIC calls, see the Mutect2 documentation.

Overview
Structure of a VCF file
Interpreting the header information
Structure of variant call records
Interpreting genotype and other sample-level information
Basic operations: validating, subsetting and exporting from a VCF
Merging VCF files

1. Overview

VCF stands for Variant Call Format. It is a standardized text file format for representing SNP, indel, and structural variation calls. The VCF specification used to be maintained by the 1000 Genomes Project, but its management and further development has been taken over by the Genomic Data Toolkit team of the Global Alliance for Genomics and Health. The full format spec can be found in the Samtools/Hts-specs repository along with other useful specifications like SAM/BAM/CRAM. We highly encourage you to take a look at those documents, as they contain a lot of useful information that we don't go over in this document.

VCF is the primary (and only well-supported) format used by the GATK for variant calls. We prefer it above all others because while it can be a bit verbose, the VCF format is very explicit about the exact type and sequence of variation as well as the genotypes of multiple samples for this variation.

That being said, this highly detailed information can be challenging to understand. The information provided by the GATK tools that infer variation from high-throughput sequencing data, such as the HaplotypeCaller, is especially complex. This document describes the key features and annotations that you need to know about in order to understand VCF files output by the GATK tools.

Note that VCF files are plain text files, so you can open them for viewing or editing in any text editor, with the following caveats:

Some VCF files are very large, so your personal computer may struggle to load the whole file into memory. In such cases, you may need to use a different approach, such as using UNIX tools to access the part of the dataset that is relevant to you, or subsetting the data using tools like GATK's SelectVariants.
NEVER EDIT A VCF IN A WORD PROCESSOR SUCH AS MICROSOFT WORD BECAUSE IT WILL SCREW UP THE FORMAT! You have been warned
Don't write home-brewed VCF parsing scripts. It never ends well.

2. Structure of a VCF file

A valid VCF file is composed of two main parts: the header, and the variant call records.

The header contains information about the dataset and relevant reference sources (e.g. the organism, genome build version etc.), as well as definitions of all the annotations used to qualify and quantify the properties of the variant calls contained in the VCF file. The header of VCFs generated by GATK tools also include the command line that was used to generate them. Some other programs also record the command line in the VCF header, but not all do so as it is not required by the VCF specification. For more information about the header, see the next section.

The actual data lines will look something like this:

[HEADER LINES]
#CHROM  POS ID      REF ALT QUAL    FILTER  INFO          FORMAT          NA12878
20  10001019    .   T   G   364.77  .   AC=1;AF=0.500;AN=2;BaseQRankSum=0.699;ClippingRankSum=0.00;DP=34;ExcessHet=3.0103;FS=3.064;MLEAC=1;MLEAF=0.500;MQ=42.48;MQRankSum=-3.219e+00;QD=11.05;ReadPosRankSum=-6.450e-01;SOR=0.537   GT:AD:DP:GQ:PL  0/1:18,15:33:99:393,0,480
20  10001298    .   T   A   884.77  .   AC=2;AF=1.00;AN=2;DP=30;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=29.49;SOR=1.765    GT:AD:DP:GQ:PL  1/1:0,30:30:89:913,89,0
20  10001436    .   A   AAGGCT  1222.73 .   AC=2;AF=1.00;AN=2;DP=29;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=25.36;SOR=0.836    GT:AD:DP:GQ:PL  1/1:0,28:28:84:1260,84,0
20  10001474    .   C   T   843.77  .   AC=2;AF=1.00;AN=2;DP=27;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=31.25;SOR=1.302    GT:AD:DP:GQ:PL  1/1:0,27:27:81:872,81,0
20  10001617    .   C   A   493.77  .   AC=1;AF=0.500;AN=2;BaseQRankSum=1.63;ClippingRankSum=0.00;DP=38;ExcessHet=3.0103;FS=1.323;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.00;QD=12.99;ReadPosRankSum=0.170;SOR=1.179   GT:AD:DP:GQ:PL  0/1:19,19:38:99:522,0,480

After the header lines and the field names, each line represents a single variant, with various properties of that variant represented in the columns. Note that all the lines shown in the example above describe SNPs and indels, but other variation types could be described (see the VCF specification for details). Depending on how the callset was generated, there may only be records for sites where a variant was identified, or there may also be "invariant" records, ie records for sites where no variation was identified.

You will sometimes come across VCFs that have only 8 columns, and contain no FORMAT or sample-specific information. These are called "sites-only" VCFs, and represent variation that has been observed in a population. Generally, information about the population of origin should be included in the header.

3. Interpreting the header information

The following is a valid VCF header produced by GenotypeGVCFs on an example data set (derived from our favorite test sample, NA12878). You can download similar test data from our resource bundle and try looking at it yourself.

##fileformat=VCFv4.2
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another">
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=RGQ,Number=1,Type=Integer,Description="Unconditional reference genotype confidence, encoded as a phred quality -10*log10 p(genotype call is wrong)">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##GATKCommandLine.HaplotypeCaller=<ID=HaplotypeCaller,Version=3.7-0-gcfedb67,Date="Fri Jan 20 11:14:15 EST 2017",Epoch=1484928855435,CommandLineOptions="[command-line goes here]">
##GATKCommandLine=<ID=GenotypeGVCFs,CommandLine="[command-line goes here]",Version=4.beta.6-117-g4588584-SNAPSHOT,Date="December 23, 2017 5:45:56 PM EST">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=ExcessHet,Number=1,Type=Float,Description="Phred-scaled p-value for exact test of excess heterozygosity">
##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
##INFO=<ID=RAW_MQ,Number=1,Type=Float,Description="Raw data for RMS Mapping Quality">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##INFO=<ID=SOR,Number=1,Type=Float,Description="Symmetric Odds Ratio of 2x2 contingency table to detect strand bias">
##contig=<ID=20,length=63025520>
##reference=file:///data/ref/ref.fasta
##source=GenotypeGVCFs

That's a lot of lines, so let's break it down into digestible bits. Note that the header lines are always listed in alphabetical order.

VCF spec version

The first line:

##fileformat=VCFv4.2

tells you the version of the VCF specification to which the file conforms. This may seem uninteresting but it can have some important consequences for how to handle and interpret the file contents. As genomics is a fast moving field, the file formats are evolving fairly rapidly, so some of the encoding conventions change. If you run into unexpected issues while trying to parse a VCF file, be sure to check the version and the spec for any relevant format changes.

FILTER lines

The FILTER lines tell you what filters have been applied to the data. In our test file, one filter has been applied:

##FILTER=<ID=LowQual,Description="Low quality">

Records that fail any of the filters listed here will contain the ID of the filter (here, LowQual) in its FILTER field (see how records are structured further below).

FORMAT and INFO lines

These lines define the annotations contained in the FORMAT and INFO columns of the VCF file, which we explain further below. If you ever need to know what an annotation stands for, you can always check the VCF header for a brief explanation (at least if you're using a civilized program that writes definition lines to the header).

GATKCommandLine

The GATKCommandLine lines contain all the parameters that went used by the tool that generated the file. Here, GATKCommandLine.HaplotypeCaller refers to a command line invoking HaplotypeCaller. These parameters include all the arguments that the tool accepts, along with the values that were applied (if you don't pass one, a default is applied); so it's not just the arguments specified explicitly by the user in the command line.

Contig lines and Reference

These contain the contig names, lengths, and which reference assembly was used with the input BAM file. This can come in handy when someone gives you a callset but doesn't tell you which reference it was derived from -- remember that for many organisms, there are multiple reference assemblies, and you should always make sure to use the appropriate one!

For more information on genome references, see the corresponding Dictionary entry.

4. Structure of variant call records

For each site record, the information is structured into columns (also called fields) as follows:

#CHROM  POS ID  REF ALT     QUAL    FILTER  INFO    FORMAT  NA12878 [other samples...]

The first 8 columns of the VCF records (up to and including INFO) represent the properties observed at the level of the variant (or invariant) site. Keep in mind that when multiple samples are represented in a VCF file, some of the site-level annotations represent a summary or average of the values obtained for that site from the different samples.

Sample-specific information such as genotype and individual sample-level annotation values are contained in the FORMAT column (9th column) and in the sample-name columns (10th and beyond). In the example above, there is one sample called NA12878; if there were additional samples there would be additional columns to the right. Most programs order the sample columns alphabetically by sample name, but this is not always the case, so be aware that you can't depend on ordering rules for parsing VCF output!

Site-level properties and annotations

These first 7 fields are required by the VCF format and must be present, although they can be empty (in practice, there has to be a dot, ie . to serve as a placeholder).

CHROM and POS

The contig and genomic coordinates on which the variant occurs. Note that for deletions the position given is actually the base preceding the event.

ID

An optional identifier for the variant. Based on the contig and position of the call and whether a record exists at this site in a reference database such as dbSNP. A typical identifier is the dbSNP ID, which in human data would look like rs28548431, for example.

REF and ALT

The reference allele and alternative allele(s) observed in a sample, set of samples, or a population in general (depending how the VCF was generated). The REF and ALT alleles are the only required elements of a VCF record that tell us whether the variant is a SNP or an indel (or in complex cases, a mixed-type variant). If we look at the following two sites, we see the first is a SNP, the second is an insertion and the third is a deletion:

20  10001298    .   T   A   884.77  .   [CLIPPED]   GT:AD:DP:GQ:PL  1/1:0,30:30:89:913,89,0
20  10001436    .   A   AAGGCT  1222.73 .   [CLIPPED]   GT:AD:DP:GQ:PL  1/1:0,28:28:84:1260,84,0
20  10004769    .   TAAAACTATGC T   622.73  .   [CLIPPED]   GT:AD:DP:GQ:PL  0/1:18,17:35:99:660,0,704

Note that REF and ALT are always given on the forward strand. For insertions, the ALT allele includes the inserted sequence as well as the base preceding the insertion so you know where the insertion is compared to the reference sequence. For deletions, the ALT allele is the base before the deletion.

QUAL

The Phred-scaled probability that a REF/ALT polymorphism exists at this site given sequencing data. Because the Phred scale is -10 * log(1-p), a value of 10 indicates a 1 in 10 chance of error, while a 100 indicates a 1 in 10^10 chance (see the Dictionary entry). These values can grow very large when a large amount of data is used for variant calling, so QUAL is not often a very useful property for evaluating the quality of a variant call. See our documentation on filtering variants for more information on this topic.

Not to be confused with the sample-level annotation GQ; see this FAQ article for an explanation of the differences in what they mean and how they should be used.

FILTER

This field contains the name(s) of any filter(s) that the variant fails to pass, or the value PASS if the variant passed all filters. If the FILTER value is ., then no filtering has been applied to the records. It is extremely important to apply appropriate filters before using a variant callset in downstream analysis. See our documentation on filtering variants for more information on this topic.

INFO

Various site-level annotations. This field is not required to be present in the VCF.

The annotations contained in the INFO field are represented as tag-value pairs, where the tag and value are separated by an equal sign, ie =, and pairs are separated by colons, ie ; as in this example: MQ=99.00;MQ0=0;QD=17.94. They typically summarize context information from the samples, but can also include information from other sources (e.g. population frequencies from a database resource). Some are annotated by default by the GATK tools that produce the callset, and some can be added on request. They are always defined in the VCF header, so that's an easy way to check what an annotation means if you don't recognize it. You can also find additional information on how they are calculated and how they should be interpreted in the "Annotations" section of the Tool Documentation.

Sample-level annotations

At this point you've met all the fields up to INFO in this lineup:

#CHROM  POS ID  REF ALT     QUAL    FILTER  INFO    FORMAT  NA12878 [other samples...]

All the rest is going to be sample-level information. Sample-level annotations are tag-value pairs, like the INFO annotations, but the formatting is a bit different. The short names of the sample-level annotations are recorded in the FORMAT field. The annotation values are then recorded in corresponding order in each sample column (where the sample names are the SM tags identified in the read group data). Typically, you will at minimum have information about the genotype and confidence in the genotype for the sample at each site. See the next section on genotypes for more details.

5. Interpreting genotype and other sample-level information

The sample-level information contained in the VCF (also called "genotype fields") may look a bit complicated at first glance, but they're actually not that hard to interpret once you understand that they're just sets of tags and values.

Let's take a look at three of the records shown earlier, simplified to just show the key genotype annotations:

20  10001019    .   T   G   364.77  .   [CLIPPED]   GT:AD:DP:GQ:PL  0/1:18,15:33:99:393,0,480
20  10001298    .   T   A   884.77  .   [CLIPPED]   GT:AD:DP:GQ:PL  1/1:0,30:30:89:913,89,0
20  10001436    .   A   AAGGCT  1222.73 .   [CLIPPED]   GT:AD:DP:GQ:PL  1/1:0,28:28:84:1260,84,0

Looking at that last column, here is what the tags mean:

GT

The genotype of this sample at this site. For a diploid organism, the GT field indicates the two alleles carried by the sample, encoded by a 0 for the REF allele, 1 for the first ALT allele, 2 for the second ALT allele, etc. When there's a single ALT allele (by far the more common case), GT will be either:

- 0/0 : the sample is homozygous reference
- 0/1 : the sample is heterozygous, carrying 1 copy of each of the REF and ALT alleles
- 1/1 : the sample is homozygous alternate

In the three sites shown in the example above, NA12878 is observed with the allele combinations T/G, A/A and AAGGCT/AAGGCT respectively. For non-diploids, the same pattern applies; in the haploid case there will be just a single value in GT (e.g. 1); for polyploids there will be more, e.g. 4 values for a tetraploid organism (e.g. 0/0/1/1).

AD and DP

Allele depth (AD) and depth of coverage (DP). These are complementary fields that represent two important ways of thinking about the depth of the data for this sample at this site.

AD is the unfiltered allele depth, i.e. the number of reads that support each of the reported alleles. All reads at the position (including reads that did not pass the variant caller’s filters) are included in this number, except reads that were considered uninformative. Reads are considered uninformative when they do not provide enough statistical evidence to support one allele over another.

DP is the filtered depth, at the sample level. This gives you the number of filtered reads that support each of the reported alleles. You can check the variant caller’s documentation to see which filters are applied by default. Only reads that passed the variant caller’s filters are included in this number. However, unlike the AD calculation, uninformative reads are included in DP.

See the Tool Documentation for more details on AD (DepthPerAlleleBySample) and DP (Coverage) for more details.

PL

"Normalized" Phred-scaled likelihoods of the possible genotypes. For the typical case of a monomorphic site (where there is only one ALT allele) in a diploid organism, the PL field will contain three numbers, corresponding to the three possible genotypes (0/0, 0/1, and 1/1). The PL values are "normalized" so that the PL of the most likely genotype (assigned in the GT field) is 0 in the Phred scale. We use "normalized" in quotes because these are not probabilities. We set the most likely genotype PL to 0 for easy reading purpose.The other values are scaled relative to this most likely genotype.

Keep in mind, if you're not familiar with the statistical lingo, that when we say PL is the "Phred-scaled likelihood of the genotype", we mean it is "How much less likely that genotype is compared to the best one". Have a look at this article for an example of how PL is calculated.

GQ

The Genotype Quality represents the Phred-scaled confidence that the genotype assignment (GT) is correct, derived from the genotype PLs. Specifically, the GQ is the difference between the PL of the second most likely genotype, and the PL of the most likely genotype. As noted above, the values of the PLs are normalized so that the most likely PL is always 0, so the GQ ends up being equal to the second smallest PL, unless that PL is greater than 99. In GATK, the value of GQ is capped at 99 because larger values are not more informative, but they take more space in the file. So if the second most likely PL is greater than 99, we still assign a GQ of 99.

Basically the GQ gives you the difference between the likelihoods of the two most likely genotypes. If it is low, you can tell there is not much confidence in the genotype, i.e. there was not enough evidence to confidently choose one genotype over another. See the FAQ article on the Phred scale to get a sense of what would be considered low.

Not to be confused with the site-level annotation QUAL; see this FAQ article for an explanation of the differences in what they mean and how they should be used.

A few examples

With all the definitions out of the way, let's interpret the genotype information for a few records from our NA12878 callset, starting with at position 10001019 on chromosome 20:

20  10001019    .   T   G   364.77  .   [CLIPPED]   GT:AD:DP:GQ:PL  0/1:18,15:33:99:393,0,480

At this site, the called genotype is GT = 0/1, which corresponds to a heterozygous genotype with alleles T/G. The confidence indicated by GQ = 99 is very good; there were a total of 33 informative reads at this site (DP=33), 18 of which supported the REF allele (=had the reference base) and 15 of which supported the ALT allele (=had the alternate base) (indicated by AD=18,15). The degree of certainty in our genotype is evident in the PL field, where PL(0/1) = 0 (the normalized value that corresponds to a likelihood of 1.0) as is always the case for the assigned allele; the next PL is PL(0/0) = 393, corresponding to 10^(-39.3), or 5.0118723e-40 which is a very small number indeed; and the next one will be even smaller. The GQ ends up being 99 because of the capping as explained above.

Now let's look at a site where our confidence is quite a bit lower:

20  10024300    .   C   CTT 43.52   .   [CLIPPED]   GT:AD:DP:GQ:PL  0/1:1,4:6:20:73,0,20

Here we have an indel -- specifically an insertion of TT after the reference C base at position 10024300. The called genotype is GT = 0/1 again, but this time the GQ = 20 indicates that even though this is probably a real variant (the QUAL is not too bad), we're not sure we have the right genotype. Looking at the coverage annotations, we see we only had 6 reads there, of which 1 supported REF and 4 supported ALT (and one read must have been considered uninformative, possibly due to quality issues). With so little coverage, we can't be sure that the genotype shouldn't in fact be homozygous variant.

Finally, let's look at a more complicated example:

20  10009875    .   A   G,AGGGAGG   1128.77 .   [CLIPPED]   GT:AD:DP:GQ:PL  1/2:0,11,5:16:99:1157,230,161,487,0,434

This site is a doozy; two credible ALT alleles were observed, but the REF allele was not -- so technically this is a biallelic site in our sample, but will be considered multiallelic because there are more than two alleles notated in the record. It's also a mixed-type record, since one of the ALTs by itself would make it an A->G SNP, and the other would make it an insertion of GGGAGG after the reference A. The called genotype is GT = 1/2, which means it's a heterozygous genotype composed of two different ALT alleles. The coverage wasn't great, and wasn't all that balanced between the two ALTs (since one was supported by 11 reads and the other by 5) but it was sufficient for the program to have high confidence in its call.

6. Basic operations: validating, subsetting and exporting from a VCF

These are a few common things you may want to do with your VCFs that don't deserve their own tutorial. Let us know if there are other operations you think we should cover here.

Validate your VCF

By that I mean check that the format of the file is correct, follows the specification, and will therefore not break any well-behave tool you choose to run on it. You can do this very simply with ValidateVariants. Note that ValidateVariants can also be used on GVCFs if you use the --gvcf argument.

Subset records from your VCF

Sometimes you want to subset just one or a few samples from a big cohort. Sometimes you want to subset to just a genomic region. Sometimes you want to do both at the same time! Well, the same tool can do both, and more; it's called SelectVariants and has a lot of options for doing this like that (including operating over intervals in the usual way). There are many options for setting the selection criteria, depending on what you want to achieve. For example, given a single VCF file, one or more samples can be extracted from the file, based either on a complete sample name, or on a pattern match. Variants can also be selected based on annotated properties, such as depth of coverage or allele frequency. This is done using JEXL expressions. Other VCF files can also be used to modify the selection based on concordance or discordance between different callsets (see --discordance / --concordance arguments in the Tool Doc.

Important notes about subsetting operations

In the output VCF, some annotations such as AN (number of alleles), AC (allele count), AF (allele frequency), and DP (depth of coverage) are recalculated as appropriate to accurately reflect the composition of the subset callset.
By default, SelectVariants will keep all ALT alleles, even if they are no longer supported by any samples after subsetting. This is the correct behavior, as reducing samples down shouldn't change the character of the site, only the AC in the subpopulation. In some cases this will produce monomorphic records, i.e. where no ALT alleles are supported. The tool accepts flags that exclude unsupported alleles and/or monomorphic records from the output.

Extract information from a VCF in a sane, (mostly) straightforward way

Use VariantsToTable.

No, really, don't write your own parser if you can avoid it. This is not a comment on how smart or how competent we think you are -- it's a comment on how annoyingly obtuse and convoluted the VCF format is.

Seriously. The VCF format lends itself really poorly to parsing methods like regular expressions, and we hear sob stories all the time from perfectly competent people whose home-brewed parser broke because it couldn't handle a more esoteric feature of the format. We know we broke a bunch of people's scripts when we introduced a new representation for spanning deletions in multisample callsets. OK, we ended up replacing it with a better representation a month later that was a lot less disruptive and more in line with the spirit of the specification -- but the point is, that first version was technically legal according to the 4.2 spec, and that sort of thing can happen at any time. So yes, the VCF is a difficult format to work with, and one way to deal with that safely is to not home-brew parsers.

(Why are we sticking with it anyway? Because, as Winston Churchill famously put it, VCF is the worst variant call representation, except for all the others.)

7. Merging VCF files

There are three main reasons why you might want to combine variants from different files into one, and the tool to use depends on what you are trying to achieve.

The most common case is when you have been parallelizing your variant calling analyses, e.g. running HaplotypeCaller per-chromosome, producing separate VCF files (or GVCF files) per-chromosome. For that case, you can use the Picard tool MergeVcfs to merge the files. See the relevant Tool Doc page for usage details.
The second case is when you have been using HaplotypeCaller in -ERC GVCF or -ERC BP_RESOLUTION to call variants on a large cohort, producing many GVCF files. You then need to consolidate them before joint-calling variants with GenotypeGVCFs (for performance reasons). This can be done with either CombineGVCFs or ImportGenomicsDB tools, both of which are specifically designed to handle GVCFs in this way. See the relevant Tool Doc pages for usage details and the Best Practices workflow documentation to learn more about the logic of this workflow.
The third case is when you want to compare variant calls that were produced from the same samples but using different methods, for comparison. For example, if you're evaluating variant calls produced by different variant callers, different workflows, or the same but using different parameters. For this case, we recommend taking a different approach; rather than merging the VCF files (which can have all sorts of complicated consequences), you can us the VariantAnnotator tool to annotate one of the VCFs with the other treated as a resource. See the relevant Tool Doc page for usage details.

There is actually one more reason why you might want to combine variants from different files into one, but we do not recommend doing it: you have produced variant calls from various samples separately, and want to combine them for analysis. This is how people used to do variant analysis on large numbers of samples, but we don't recommend proceeding this way because that workflow suffers from serious methodological flaws. Instead, you should follow our recommendations as laid out in the Best Practices documentation.

↧

Which part of mym QD plot is the homozygous peak?

July 25, 2016, 8:47 am

≫ Next: Annotated intervals do not match provided intervals | CreateReadCountPanelOfNormals

≪ Previous: VCF - Variant Call Format

Hi there,

Just a quick question, which I think may be of use to people with similarly...squiffy...plots! I've plotted out QD values v Density so as to inform the hard filtering process but I'm having difficulty discerning the expected peaks for heterozygous calls and homozygous calls, as described at https://software.broadinstitute.org/gatk/guide/article?id=6925. As you can see from the attached plot, there is a peak at the lower values (or is it a shoulder?), a tiny bump and then a major peak, but then just a shoulder on the other side of the peak. As filtering effectively (and stringently) is key to my study, I'd like to know what each peak and shoulder represents before I take the plunge if anyone can make an educated guess, please?

Many thanks,

Ian

↧

Annotated intervals do not match provided intervals | CreateReadCountPanelOfNormals

October 8, 2019, 12:01 pm

≫ Next: VQSR in dog

≪ Previous: Which part of mym QD plot is the homozygous peak?

Hello all,

I'm using WDL for somatic CNV, I start with cnv_somatic_panel_workflow.wdl which basically includes 4 tasks/functions:

CNVTasks.PreprocessIntervals (Done)
CNVTasks.AnnotateIntervals (Done)
CNVTasks.CollectCounts (Done)
CreateReadCountPanelOfNormals (Error)

The first 3 works well, but the problem/error is from the last one in which it shows me this error:

```
java.lang.IllegalArgumentException: Annotated intervals do not match provided intervals.
at org.broadinstitute.hellbender.utils.Utils.validateArg(Utils.java:724)
at org.broadinstitute.hellbender.tools.copynumber.arguments.CopyNumberArgumentValidationUtils.validateAnnotatedIntervals(CopyNumberArgumentValidationUtils.java:135)
at org.broadinstitute.hellbender.tools.copynumber.CreateReadCountPanelOfNormals.runPipeline(CreateReadCountPanelOfNormals.java:276)
at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:31)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
at org.broadinstitute.hellbender.Main.main(Main.java:291)
19/10/08 18:23:25 INFO ShutdownHookManager: Shutdown hook called
19/10/08 18:23:25 INFO ShutdownHookManager: Deleting directory /cromwell-executions/CNVSomaticPanelWorkflow/7481fd6a-e289-4bd6-b195-77b4d45d752e/call-CreateReadCountPanelOfNormals/tmp.16de278a/spark-8a26fda4-9b30-49b0-b795-846caa9a1e35
Using GATK jar /cromwell-executions/CNVSomaticPanelWorkflow/7481fd6a-e289-4bd6-b195-77b4d45d752e/call-CreateReadCountPanelOfNormals/inputs/-1551391607/gatk-package-4.1.2.0-local.jar defined in environment variable GATK_LOCAL_JAR
```

The initial interval file been taken from here https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0 as suggested here https://gatkforums.broadinstitute.org/gatk/discussion/10215/intervals-and-interval-lists

Any help how to solve this please?

↧

VQSR in dog

October 8, 2019, 12:56 pm

≫ Next: (Howto) Run GATK4 in a Docker container

≪ Previous: Annotated intervals do not match provided intervals | CreateReadCountPanelOfNormals

Hi team! I am working on running a bunch of dogs through GATK variant calling in Terra, as many of you know. I'm grappling with VQSR at this point. I have compiled a bunch of dog variant resources and was hoping you folks would be willing to offer opinions on my current plan for using them for VQSR.

I'm working with the joint-discovery-gatk4 WDL, v13, as acquired from the Terra library. I have removed the human-specific variant resources (hapmap, omni, etc). I have compiled to replace them, in order from most to least confidence:

axiom_klab: the intersection of the klab variants (see next entry) with Axiom 1.2 million array variants. These are our highest confidence variants.
klab: variants called by Karlsson lab on 20-30X dogs, filtered with hard filters
ostrander435: variants called by Ostrander lab on 20-30X dogs, filtered with VQSR (method details unknown, many more variants than the klab compilation so I assume more sensitive / less specific)
broad: variants from the Broad track on UCSC
axelsson: variants called for Axelsson et al., 2014
dogsd: variants downloaded from http://bigd.big.ac.cn/dogsdv2/

I'm putting these in to VariantRecalibrator calls in the WDL as below:

      --resource:axiom_klab,known=false,training=true,truth=true,prior=10 ${axiom_klab_vcf} \
      --resource:klab,known=false,training=true,truth=false,prior=8 ${klab_vcf} \
      --resource:ostrander435,known=false,training=true,truth=false,prior=7 ${ostrander435_vcf} \
      --resource:broad,known=false,training=true,truth=false,prior=5 ${broad_vcf} \
      --resource:axelsson,known=false,training=true,truth=false,prior=5 ${axelsson_vcf} \
      --resource:dogsd,known=true,training=false,truth=false,prior=2 ${dogsd_vcf}

I would love feedback on whether my decisions are generally sensible here or if I am completely missing the boat on how VQSR is supposed to work.

↧

(Howto) Run GATK4 in a Docker container

December 3, 2017, 7:56 pm

≫ Next: File size is largely reduced in MarkIlluminaAdapters step

≪ Previous: VQSR in dog

1. Install Docker

Follow the relevant link below depending on your computer system; on Mac and Windows, select the "Stable channel" download. Run through the installation instructions and initial setup page; they are very straightforward and should only take you a few minutes (not counting download time).
We have included instructions below for all steps after that first page, so you shouldn't need to go to any other pages in the Docker documentation. Frankly their docs are targeted at people who want to do things like run web applications on the cloud and can be quite frustrating to deal with.

Click here for Mac

Click here for Windows

Full list of supported systems and their install pages

2. Get the GATK4 container image

Go to your Terminal (it doesn't matter where your working directory is) and run the following command.

docker pull broadinstitute/gatk:4.beta.6

Note that the last bit after gatk: is the version tag, which you can change to get a different version than what we've specified here.

The GATK4 image is quite large so the download may take a little while if you've never done this before. The good news is that next time you need to pull a GATK4 image (e.g. to get another release), Docker will only pull the components that have been updated, so it will go faster.

3. Start up the GATK4 container

There are several different ways to do this in Docker. Here we're going to use the simplest invocation that gets us the functionality we need, i.e. the ability to log into the container once it's running and execute commands from inside it.

docker run -it broadinstitute/gatk:4.beta.6

If all goes well, this will start up the container in interactive mode, and you will automatically get logged into it. Your terminal prompt will change to something like this:

root@ea3a5218f494:/gatk#

At this point you can use classic shell commands to explore the container and see what's in there, if you like.

4. Run a GATK4 command in the container

The container has the gatk-launch script all set up and ready to go, so you can now run any GATK or Picard command you want. Note that if you want to run a Picard command, you need to use the new syntax, which follows GATK conventions (-I instead of I= and so on). Let's use --list to list all tools available in this version.

./gatk-launch --list

The output will start with a usage message (shown below) then a full list of tools and their summary descriptions.

Using GATK wrapper script /gatk/build/install/gatk/bin/gatk
Running:
    /gatk/build/install/gatk/bin/gatk --help
USAGE:  <program name> [-h]

Once you've verified that this works for you, you know you can run any GATK4 commands you want. But before you proceed, there's one more setup thing to go through, which is technically optional but will make your life much easier.

5. Use a mounted volume to access data that lives outside the container

This is the final piece of the puzzle. By default, when you're inside the container you can't access any data that lives on the filesystem outside of the container. One way to deal with that is to copy things back and forth, but that's wasteful and tedious. So we're going to follow the better path, which is to mount a volume in the container, i.e. establish a link that makes part of the filesystem visible from inside the container.

The hitch is that you can't do this after you started running the container, so you'll have to shut it down and run a new one (not just restart the first one) with an extra part to the command. In case you're wondering why we didn't do this from the get-go, it's because the first command we ran is simpler so there's less chance that something will go wrong, which is nice when you're trying something for the first time.

To shut down your container from inside it, you can just type exit while still inside the container:

exit

That should stop the container and take you back to your regular prompt. It's also possible to exit the container without stopping it (a move called detaching) but that's a matter for another time since here we do want to to stop it. You'll probably also want to learn how to clean up and delete old instances of containers that you no longer want.

For now, let's focus on starting a new instance of the GATK4 container, specifying in the following command what is your particular container ID and the filesystem location you want to mount.

docker run -v ~/my_project:/gatk/my_data -it broadinstitute/gatk:4.beta.6

Here I set the external location to be an existing directory called my_project in my home directory (the key requirement is that it has to be an absolute path) and I'm setting the mount point inside the container's /gatk directory. The name of the mount point can be the same as the mount directory, or something completely different; the main constraint is that it should not conflict with an existing directory, otherwise that would make the existing directory unattainable.

Assuming your paths are valid, this command starts up the container and logs you into it the same way as before; but now you can see by using ls that you have access to your filesystem. So now you can run GATK commands on any data you have lying around. Have fun!

↧

File size is largely reduced in MarkIlluminaAdapters step

October 9, 2019, 6:01 am

≫ Next: Intervals and interval lists

≪ Previous: (Howto) Run GATK4 in a Docker container

Hi,

I was doing data processing steps for raw reads (Fastq) in two way approaches

1. merging all the forward reads and reverse reads and used as input for further steps  
2. Without merging, each read (single raw fastq files) were used as input for each step

While I am doing MarkIlluminaAdapter step I observed the data file size is reduced for 2nd ways, the Size details as follows

1. Raw fastq files size (80Gb)  
2. MarkIlluminaAdapter output size: **1st way (merged) 215Gb; 2nd way 179Gb**     
---  
---

But I observed that in BWA mem-Alignment(1st way(merged) 258Gb; 2nd way 263Gb), Bam conversion (1stway 60Gb; 2ndway 80Gb) and Markduplicator (1stway59Gb and 2nd way60Gb) the data size is approximately retained BWA & MarkDupicates and size increased for 2nd way.

And another thing is, when I did alignment quality check for the both of the BAM files (in a 2nd way, the each reads output were merged to single file for quality check) using samtools flagstat in both types also showed 99.65% mapped but duplication was observed less in the 1st way (merged reads)

1st way duplication: 7197218 + 0 duplicates and 2ndway duplication: 208749 + 0 duplicates

could you please explain why this large size of data reduction had seen in MarkIlluminaAdapter and about this alignment quality check duplication difference in merged files?

↧

Intervals and interval lists

August 22, 2017, 2:42 pm

≫ Next: Cannot construct fragment from more than two reads

≪ Previous: File size is largely reduced in MarkIlluminaAdapters step

Many of our workflow recommendations and example commands involve intervals or lists of intervals, which you can specify in your command line using -L (or -XL to exclude specific intervals).

So where do those intervals come from? It depends a lot on what you're working with (everyone's least favorite answer, I know). The most important distinction is the sequencing experiment type: is it whole genome, or targeted sequencing of some sort?

Targeted sequencing (exomes etc.)

For exomes and similarly targeted data types, the interval list should correspond to the capture targets used for the library prep, and is typically provided by the prep kit manufacturer (with versions for each ref genome build of course).

We make our exome interval lists available, but be aware that they are specific to the custom exome targeting kits used at the Broad. If you got your sequencing done somewhere else, you should seek to get the appropriate intervals list from the sequencing provider.

Whole genomes (WGS)

For whole genome sequence, the intervals lists don’t depend on the prep (since in principle you captured the “whole genome”) so instead it depends on what regions of the genome you want to blacklist (eg centromeric regions that waste your time for nothing) and how the reference genome build enables you to cut up regions (separated by Ns) for scatter-gather parallelizing.

We make our WGS interval lists available, and the good news is that you can use them with your own data even if it comes from somewhere else -- assuming you agree with our decisions about which regions to blacklist! Which you can examine by looking at the intervals themselves (we don't currently have documentation on their provenance, sorry -- baby steps).

Cannot construct fragment from more than two reads

October 9, 2019, 8:53 am

≫ Next: Get Error when using CreateReadCountPanelOfNormals in Calling Somatic Copy Number Variation

≪ Previous: Intervals and interval lists

Hi,

I am running a Best Practices Mutect2 workflow, and after having upgraded to GATK 4.1.4.0 from 4.1.3.0, I am starting to see this error:

(...)
17:35:39.809 INFO  PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation
17:35:39.942 INFO  ProgressMeter - Starting traversal
17:35:39.943 INFO  ProgressMeter -        Current Locus  Elapsed Minutes     Regions Processed   Regions/Minute
17:35:50.272 INFO  ProgressMeter -       chr22:10938027              0.2                 36510         212082.5
17:36:00.317 INFO  ProgressMeter -       chr22:12587748              0.3                 42160         124158.2
17:36:10.409 INFO  ProgressMeter -       chr22:16564638              0.5                 55550         109400.6
17:36:20.431 INFO  ProgressMeter -       chr22:18088679              0.7                 60790          90086.0
17:36:30.395 INFO  VectorLoglessPairHMM - Time spent in setup for JNI call : 0.022855297
17:36:30.395 INFO  PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 7.454791689
17:36:30.395 INFO  SmithWatermanAligner - Total compute time in java Smith-Waterman : 11.04 sec
17:36:30.516 INFO  Mutect2 - Shutting down engine
[October 9, 2019 5:36:30 PM CEST] org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2 done. Elapsed time: 0.86 minutes.
Runtime.totalMemory()=1360003072
java.lang.IllegalArgumentException: Cannot construct fragment from more than two reads
        at org.broadinstitute.hellbender.utils.Utils.validateArg(Utils.java:725)
        at org.broadinstitute.hellbender.utils.read.Fragment.create(Fragment.java:36)
        at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
        at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
        at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
        at org.broadinstitute.hellbender.utils.genotyper.AlleleLikelihoods.groupEvidence(AlleleLikelihoods.java:595)
        at org.broadinstitute.hellbender.tools.walkers.mutect.SomaticGenotypingEngine.callMutations(SomaticGenotypingEngine.java:93)
        at org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2Engine.callRegion(Mutect2Engine.java:251)
        at org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2.apply(Mutect2.java:320)
        at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.processReadShard(AssemblyRegionWalker.java:308)
        at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.traverse(AssemblyRegionWalker.java:281)
        at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1048)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:163)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:206)
        at org.broadinstitute.hellbender.Main.main(Main.java:292)
Using GATK jar /home/michaelk/miniconda3/envs/moma-somatic-pipeline-gatk-4.1.4.0/share/gatk4-4.1.4.0-0/gatk-package-4.1.4.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx8g -Djava.io.tmpdir=/scratch/2114474/tmp.xejWjcYSOO -jar /home/michaelk/miniconda3/envs/moma-somatic-pipeline-gatk-4.1.4.0/share/gatk4-4.1.4.0-0/gatk-package-4.1.4.0-local.jar Mutect2 -R /faststorage/project/MomaRAWfiles/BACKUP/reference/hg38/reference_hg38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna -I output/raw_output/alignments/TUMOR.recalibrated.markdup.sorted.bam -tumor TUMOR -I output/raw_output/alignments/NORMAL.recalibrated.markdup.sorted.bam -normal NORMAL --bam-output /scratch/2114474/TUMOR_chr22.mutect.somatic.assembled.haplotypes.bam --f1r2-tar-gz /scratch/2114474/TUMOR_chr22.mutect.somatic.f1r2.tar.gz -pon /faststorage/project/MomaRAWfiles/BACKUP/reference/hg38/broad_bundle_hg38/1000g_pon.hg38.vcf.gz -L chr22 -O /scratch/2114474/TUMOR_chr22.mutect.somatic.vcf.gz

Both VCF file and BAM-out are produced and seem to be OK. If I run the exact same pipeline with GATK 4.1.3.0, the error does not occur.

I have not seen others report this error neither here nor as a GitHub issue. Any clue what could be going on?

Thanks!

↧

Get Error when using CreateReadCountPanelOfNormals in Calling Somatic Copy Number Variation

July 29, 2018, 5:11 am

≫ Next: Deleted

≪ Previous: Cannot construct fragment from more than two reads

**Error information:** Using GATK jar /home/yangyuan/Desktop/Tool/gatk-4.0.5.2/gatk-package-4.0.5.2-local.jar Running: java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx6500m -jar /home/yangyuan/Desktop/Tool/gatk-4.0.5.2/gatk-package-4.0.5.2-local.jar CreateReadCountPanelOfNormals -I 1_19_0427_S18.counts.hdf5 -I 1_20_0427_S19.counts.hdf5 -I 1_21_0427_S20.counts.hdf5 -I 1_22_0427_S21.counts.hdf5 -I 1_23_0427_S22.counts.hdf5 -I 1_24_0427_S23.counts.hdf5 -I 1_25_0427_S24.counts.hdf5 -I 1_26_0427_S25.counts.hdf5 -I 1_50_0427_S48.counts.hdf5 -I 1_51_0427_S49.counts.hdf5 -I ...... ...... ...... ...... 18/07/29 19:46:57 INFO Executor: Running task 32.0 in stage 1.0 (TID 33) 18/07/29 19:46:57 INFO Executor: Running task 33.0 in stage 1.0 (TID 34) 18/07/29 19:46:57 INFO Executor: Running task 34.0 in stage 1.0 (TID 35) 18/07/29 19:46:57 INFO Executor: Running task 35.0 in stage 1.0 (TID 36) 18/07/29 19:46:57 INFO Executor: Running task 36.0 in stage 1.0 (TID 37) 18/07/29 19:46:57 INFO Executor: Running task 37.0 in stage 1.0 (TID 38) 18/07/29 19:46:57 INFO Executor: Running task 38.0 in stage 1.0 (TID 39) 18/07/29 19:46:57 INFO Executor: Running task 39.0 in stage 1.0 (TID 40) Jul 29, 2018 7:46:57 PM com.github.fommil.jni.JniLoader liberalLoad INFO: successfully loaded /tmp/yangyuan/jniloader7881962181404460704netlib-native_system-linux-x86_64.so java: symbol lookup errorjavajavajava: : symbol lookup errorsymbol lookup error: java: /tmp/yangyuan/jniloader7881962181404460704netlib-native_system-linux-x86_64.so: : /tmp/yangyuan/jniloader7881962181404460704netlib-native_system-linux-x86_64.so: javasymbol lookup error/tmp/yangyuan/jniloader7881962181404460704netlib-native_system-linux-x86_64.so: symbol lookup error: : java: undefined symbol: cblas_dspr: undefined symbol: cblas_dspr: : symbol lookup errorjava/tmp/yangyuan/jniloader7881962181404460704netlib-native_system-linux-x86_64.soundefined symbol: cblas_dsprsymbol lookup error/tmp/yangyuan/jniloader7881962181404460704netlib-native_system-linux-x86_64.so: java: : : java

: /tmp/yangyuan/jniloader7881962181404460704netlib-native_system-linux-x86_64.so
undefined symbol: cblas_dspr/tmp/yangyuan/jniloader7881962181404460704netlib-native_system-linux-x86_64.sosymbol lookup error: java: symbol lookup error: java: symbol lookup error: : : /tmp/yangyuan/jniloader7881962181404460704netlib-native_system-linux-x86_64.so: /tmp/yangyuan/jniloader7881962181404460704netlib-native_system-linux-x86_64.so:
undefined symbol: cblas_dspr

Hi, when i'm using CreateReadCountPanelOfNormals in Calling Somatic Copy Number Variation, i got error.
When i search the error on Google, i found someone meet the same problem with me (https://gatkforums.broadinstitute.org/gatk/discussion/8810/something-about-create-pon-workflow).

But the solution did not work for me, this is the solution the above link give:

I met this problem too. it was running very well with one sample input, but this bug appeared when I input multiple samples... BTW, my version is 4.0.3.0.
It seems related to Spark, and I just solved it.
1. install libblas.so, liblapacke.so and libopenblas.so(which I lacked).
2. add to environment. export LD_PRELOAD=/path/to/libopenblas.so
Then everything works as expected.

The command i input was:
gatk --java-options "-Xmx6500m" CreateReadCountPanelOfNormals \
-I 1_19_0427_S18.counts.hdf5 \
-I 1_20_0427_S19.counts.hdf5 \
-I 1_21_0427_S20.counts.hdf5 \
-I 1_22_0427_S21.counts.hdf5 \
-I 1_23_0427_S22.counts.hdf5 \
-I 1_24_0427_S23.counts.hdf5 \
-I 1_25_0427_S24.counts.hdf5 \
-I 1_26_0427_S25.counts.hdf5 \
-I 1_50_0427_S48.counts.hdf5 \
-I 1_51_0427_S49.counts.hdf5 \
-I 1_52_0427_S50.counts.hdf5 \
-I 1_53_0427_S51.counts.hdf5 \
-I 1_54_0427_S52.counts.hdf5 \
-I 1_55_0427_S53.counts.hdf5 \
-I 1_56_0427_S54.counts.hdf5 \
-I 1_57_0427_S55.counts.hdf5 \
-I 1_58_0427_S56.counts.hdf5 \
-I 1_59_0427_S57.counts.hdf5 \
--minimum-interval-median-percentile 55.0 \
-O cnvponC.pon.hdf5

↧

Deleted

October 9, 2019, 10:35 pm

≫ Next: How do I reach my data when running GATK in docker?

≪ Previous: Get Error when using CreateReadCountPanelOfNormals in Calling Somatic Copy Number Variation

Deleted

↧

How do I reach my data when running GATK in docker?

October 10, 2019, 2:47 am

≫ Next: Mutect2 - java.lang.IllegalArgumentException: Cannot construct fragment from more than two reads

≪ Previous: Deleted

Hi! I would like to try out GATK but I am new to Linux OS, generally I use Windows OS. I installed Ubuntu 18.04.3 LTS through Oracle VirtualBox 6.0 on a laptop running Windows 10. I followed the instructions on the https://software.broadinstitute.org/gatk/documentation/article?id=11090 website to install Docker and downloading the GATK container image. When I try to run the FastqToSam tool it looks like this:

'''
lmi@lmi-VirtualBox:~$ sudo docker run -v ~/home/lmi/NGS:/gatk/my_data -it broadinstitute/gatk:4.1.3.0
[sudo] password for lmi:
(gatk) root@f5e690616506:/gatk# gatk FastqToSam -F1 F1.fastq -F2 F2.fastq -O uBAM.bam -SM sample001 -RG rg0013
Using GATK jar /gatk/gatk-package-4.1.3.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /gatk/gatk-package-4.1.3.0-local.jar FastqToSam -F1 F1.fastq -F2 F2.fastq -O uBAM.bam -SM sample001 -RG rg0013
09:40:34.696 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.1.3.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
[Thu Oct 10 09:40:34 UTC 2019] FastqToSam --FASTQ F1.fastq --FASTQ2 F2.fastq --OUTPUT uBAM.bam --READ_GROUP_NAME rg0013 --SAMPLE_NAME sample001 --USE_SEQUENTIAL_FASTQS false --SORT_ORDER queryname --MIN_Q 0 --MAX_Q 93 --STRIP_UNPAIRED_MATE_NUMBER false --ALLOW_AND_IGNORE_EMPTY_LINES false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
Oct 10, 2019 9:40:41 AM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
[Thu Oct 10 09:40:41 UTC 2019] Executing as root@f5e690616506 on Linux 5.0.0-31-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_191-8u191-b12-0ubuntu0.16.04.1-b12; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.1.3.0
[Thu Oct 10 09:40:41 UTC 2019] picard.sam.FastqToSam done. Elapsed time: 0.12 minutes.
Runtime.totalMemory()=79167488
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
htsjdk.samtools.SAMException: Cannot read non-existent file: file:///gatk/F1.fastq
at htsjdk.samtools.util.IOUtil.assertFileIsReadable(IOUtil.java:483)
at htsjdk.samtools.util.IOUtil.assertFileIsReadable(IOUtil.java:470)
at picard.sam.FastqToSam.doWork(FastqToSam.java:312)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:305)
at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:25)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
at org.broadinstitute.hellbender.Main.main(Main.java:291)
(gatk) root@f5e690616506:/gatk#
'''

If I understand it correctly from "Cannot read non-existent file", it can't find the input fastq files I have in my NGS directory. I would like for my working directory to be /home/lmi/NGS. How should I start GATK to resolve this?

Thank you!

↧

Mutect2 - java.lang.IllegalArgumentException: Cannot construct fragment from more than two reads

October 10, 2019, 3:05 am

≫ Next: GATK Exception in HaplotypeCaller

≪ Previous: How do I reach my data when running GATK in docker?

Hi,
trying the latest version of Mutect2 4.1.4.0

java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xms4G -Xmx9G -jar /home/tools/gatk/4.1.4.0/gatk-package-4.1.4.0-local.jar Mutect2 --annotation ClippingRankSumTest --annotation DepthPerSampleHC --annotation MappingQualityRankSumTest --annotation MappingQualityZero --annotation QualByDepth --annotation ReadPosRankSumTest --annotation RMSMappingQuality --annotation FisherStrand --annotation MappingQuality --annotation DepthPerAlleleBySample --annotation Coverage -R hs38DH.fa -I SynSet3_N.dedup.sorted.indel.dedup.bam -I SynSet3_T.dedup.sorted.indel.dedup.bam -normal SynSet3_N --panel-of-normals 1000g_pon.hg38.vcf.gz --germline-resource af-only-gnomad.hg38.vcf.gz --f1r2-tar-gz SynSet3.sg00.dedup.sorted.indel.dedup.f1f2.tar.gz -O SynSet3.sg00.dedup.sorted.indel.dedup.vcf --bam-output SynSet3.sg00.dedup.sorted.indel.dedup.mt2.bam -L 00.interval_list --native-pair-hmm-threads 1

I got the following error.

java.lang.IllegalArgumentException: Cannot construct fragment from more than two reads
        at org.broadinstitute.hellbender.utils.Utils.validateArg(Utils.java:725)
        at org.broadinstitute.hellbender.utils.read.Fragment.create(Fragment.java:36)
        at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
        at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
        at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566)
        at org.broadinstitute.hellbender.utils.genotyper.AlleleLikelihoods.groupEvidence(AlleleLikelihoods.java:595)
        at org.broadinstitute.hellbender.tools.walkers.mutect.SomaticGenotypingEngine.callMutations(SomaticGenotypingEngine.java:93)
        at org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2Engine.callRegion(Mutect2Engine.java:251)
        at org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2.apply(Mutect2.java:320)
        at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.processReadShard(AssemblyRegionWalker.java:308)
        at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.traverse(AssemblyRegionWalker.java:281)
        at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1048)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:163)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:206)
        at org.broadinstitute.hellbender.Main.main(Main.java:292)

The same instruction works with Mutect2 4.1.3.0. May you explain what happened?
Thank you

↧

GATK Exception in HaplotypeCaller

September 15, 2019, 11:11 pm

≫ Next: Discovering singletons with GenotypeGVCFs?

≪ Previous: Mutect2 - java.lang.IllegalArgumentException: Cannot construct fragment from more than two reads

Hi,

I am using gatk 4.0.8.1 HaplotypeCaller for making g.vcf. I am running follwing command
gatk --java-options "-Xms24g -Xmx48g" HaplotypeCaller -R new_hg38.fa -I S11_.sorted.BQRC.bam -O S11.g.vcf -L ../../nextera-dna-exome-targeted-regions-manifest-v1-2.bed --native-pair-hmm-threads 6 --min-base-quality-score 20 -stand-call-conf 30 --dbsnp /All_20180418.chr.hg38.vcf.gz -ERC GVCF -G StandardAnnotation -G AS_StandardAnnotation --read-validation-stringency SILENT --TMP_DIR scratch-2

But gatk is shutting down with exception.Here is log of exception

23:56:09.618 WARN GATKAnnotationPluginDescriptor - Redundant enabled annotation group (StandardAnnotation) is enabled for this tool by default
23:56:09.692 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/hpcc/tools/gatk-4.0.8.1/gatk-package-4.0.8.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
23:56:10.099 INFO HaplotypeCaller - ------------------------------------------------------------
23:56:10.099 INFO HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.0.8.1
23:56:10.099 INFO HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
23:56:10.100 INFO HaplotypeCaller - Executing as hpcc@hpcc on Linux v4.4.0-159-generic amd64
23:56:10.100 INFO HaplotypeCaller - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_222-8u222-b10-1ubuntu1~16.04.1-b10
23:56:10.100 INFO HaplotypeCaller - Start Date/Time: September 16, 2019 11:56:09 PM PKT
23:56:10.100 INFO HaplotypeCaller - ------------------------------------------------------------
23:56:10.100 INFO HaplotypeCaller - ------------------------------------------------------------
23:56:10.101 INFO HaplotypeCaller - HTSJDK Version: 2.16.0
23:56:10.101 INFO HaplotypeCaller - Picard Version: 2.18.7
23:56:10.101 INFO HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
23:56:10.101 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
23:56:10.102 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
23:56:10.102 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
23:56:10.102 INFO HaplotypeCaller - Deflater: IntelDeflater
23:56:10.102 INFO HaplotypeCaller - Inflater: IntelInflater
23:56:10.102 INFO HaplotypeCaller - GCS max retries/reopens: 20
23:56:10.102 INFO HaplotypeCaller - Using google-cloud-java fork https://github.com/broadinstitute/google-cloud-java/releases/tag/0.20.5-alpha-GCS-RETRY-FIX
23:56:10.102 INFO HaplotypeCaller - Initializing engine
23:56:10.936 INFO FeatureManager - Using codec VCFCodec to read file file:///mnt/2d6b1dc8-eccd-46f4-a5b2-39966cd786c9/data-base/All_20180418.chr.hg38.vcf.gz
23:56:11.249 INFO FeatureManager - Using codec BEDCodec to read file file:///mnt/2d6b1dc8-eccd-46f4-a5b2-39966cd786c9/scratch-2/exome-run2/cleanfastq/part3/newwork/../../nextera-dna-exome-targeted-regions-manifest-v1-2.bed
23:56:13.499 INFO IntervalArgumentCollection - Processing 45326818 bp from intervals
23:56:13.651 WARN IndexUtils - Feature file "/mnt/2d6b1dc8-eccd-46f4-a5b2-39966cd786c9/data-base/All_20180418.chr.hg38.vcf.gz" appears to contain no sequence dictionary. Attempting to retrieve a sequence dictionary from the associated index file
23:56:14.042 INFO HaplotypeCaller - Shutting down engine
[September 16, 2019 11:56:14 PM PKT] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 0.07 minutes.
Runtime.totalMemory()=25739919360
java.lang.NullPointerException
at java.util.ComparableTimSort.countRunAndMakeAscending(ComparableTimSort.java:325)
at java.util.ComparableTimSort.sort(ComparableTimSort.java:202)
at java.util.Arrays.sort(Arrays.java:1312)
at java.util.Arrays.sort(Arrays.java:1506)
at java.util.ArrayList.sort(ArrayList.java:1462)
at java.util.Collections.sort(Collections.java:143)
at org.broadinstitute.hellbender.utils.IntervalUtils.sortAndMergeIntervals(IntervalUtils.java:459)
at org.broadinstitute.hellbender.utils.IntervalUtils.getIntervalsWithFlanks(IntervalUtils.java:955)
at org.broadinstitute.hellbender.utils.IntervalUtils.getIntervalsWithFlanks(IntervalUtils.java:970)
at org.broadinstitute.hellbender.engine.MultiIntervalLocalReadShard.<init>(MultiIntervalLocalReadShard.java:59)
at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.makeReadShards(AssemblyRegionWalker.java:195)
at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.onStartup(AssemblyRegionWalker.java:175)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:135)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:182)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:201)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)

↧

Overview

Tutorial outline

How to obtain reference files

Tutorial Requirements

Obtain tutorial files

Run the PathSeq pipeline

Interpreting the output

Microbe discovery

Preparing Custom Reference Files

Build the host and microbe BWA index images

Generate the host k-mer library file

Build the taxonomy file

Example reference build script

Troubleshooting

Java heap out of memory error

The output is empty

Overcoming barriers to understanding the mitochondrial genome

Mitochondrial genome - a history of challenges

1. Mitochondria have a circular genome

2. A pushy genome makes for difficult mapping

3. Most mitochondria are normal

4. High depth coverage is a blessing… and a curse

Solving a geometry problem by realigning twice

Adaptive pruning for high sensitivity and precision

Unblocking genetic bottlenecks with a new pipeline

1. Diagnose rare diseases

2. Identify asymptomatic carriers of pathogenic variants

3. Discover somatic variants in tissues or cell lineages

Contents

1. Overview

2. Structure of a VCF file

3. Interpreting the header information

VCF spec version

FILTER lines

FORMAT and INFO lines

GATKCommandLine

Contig lines and Reference

4. Structure of variant call records

Site-level properties and annotations

CHROM and POS

ID

REF and ALT

QUAL

FILTER

INFO

Sample-level annotations

5. Interpreting genotype and other sample-level information

GT

AD and DP

PL

GQ

A few examples

6. Basic operations: validating, subsetting and exporting from a VCF

Validate your VCF

Subset records from your VCF

Important notes about subsetting operations

Extract information from a VCF in a sane, (mostly) straightforward way

7. Merging VCF files

1. Install Docker

2. Get the GATK4 container image

3. Start up the GATK4 container

4. Run a GATK4 command in the container

5. Use a mounted volume to access data that lives outside the container

Targeted sequencing (exomes etc.)

Whole genomes (WGS)

Further reading