Quantcast
Channel: Recent Discussions — GATK-Forum
Viewing all 12345 articles
Browse latest View live

Improve performance of GATK4 GermlineCNVCaller

$
0
0

Hi,

I was able to run and finish all the steps in the GATK4 GermlineCNVCaller COHORT pipeline thanks to the information that I received in this post:
https://gatkforums.broadinstitute.org/gatk/discussion/11344/current-status-of-gatk4-germlinecnvcaller-tools-and-best-practices

The GATK4 GermlineCNVCaller COHORT command that I started for 108 samples (diploid, genome size 0.5Gb) is still running since last week.

I am running this as a gridjob without capture of the stdout, and I just see a this python script running with c.a. 600% CPU and 150G of memory.

0.147t 31288 R 618.5 10.0 47006:15 python /tmp/cohort_denoising_calling.5354881656462301387.py

Do you have any kind of ball park figure for how long the GATK4 GermlineCNVCaller step in COHORT mode is supposed to take?

Is there a limit to the population size that you recommend to run the GATK4 GermlineCNVCaller tool on in COHORT mode? Is 108 samples maybe to much? Even when the species has a small 0.5Gb genome?

Do you recommend using a bin size other than the default 1K bp --bin-length of PreprocessIntervals? Should I increase this value for larger genomes? (to keep the number of bins the same, i.e. is there a computational optimal number of bins per genome/chromosome?)

Is it possible to run the GATK4 GermlineCNVCaller tool on in COHORT using more parallelism?
For example run the tool per chromosome in parallel, or use more CPU, only 6 from the 60 CPU seems to be used.

Thank you.


Picard/GATK MergeVcfs throws errors

$
0
0

Dear all,
I am following your guidelines for germline SNP detection in GATK 4. Nevertheless, I cannot complete the concatenation of region-wise gvcfs.
Using GATK MergeVcfs I get the following error:
/package/sequencer/java/8/bin/java -jar -XX:+UseSerialGC -verbose:GC -Xmx8g -Djava.io.tmpdir=/scratch/cluster/seqcore/temp/smith/package/sequencer/gatk/current/gatk-package-4.0.1.1-local.jar MergeVcfs --INPUT ./03_GATK/core_L11935-2_Mystique.chrEBV.gvcf --INPUT ./03_GATK/core_L11935-2_Mystique.chrUn_KI270742v1.gvcf --OUTPUT ./03_GATK/core_L11935-2_Mystique.gvcf

[Fri Feb 09 13:20:55 CET 2018] MergeVcfs --INPUT ./03_GATK/core_L11935-2_Mystique.chrEBV.gvcf --INPUT ./03_GATK/core_L11935-2_Mystique.chrUn_KI270742v1.gvcf --OUTPUT ./03_GATK/core_L11935-2_Mystique.gvcf --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 1 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX true --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Fri Feb 09 13:20:55 CET 2018] Executing as smith@bromhidrosophobie.molgen.mpg.de on Linux 4.14.17.mx64.205 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_25-b17; Deflater: Intel; Inflater: Intel; Picard version: Version:4.0.1.1

java.lang.IllegalArgumentException: Illegal character in fragment at index 1: ##fileformat=VCFv4.2
at java.net.URI.create(URI.java:852)
at htsjdk.samtools.util.IOUtil.getPath(IOUtil.java:1134)
at htsjdk.samtools.util.IOUtil.lambda$unrollPaths$2(IOUtil.java:1088)
at htsjdk.samtools.util.IOUtil$$Lambda$29/1967434886.accept(Unknown Source)
at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.Iterator.forEachRemaining(Iterator.java:116)
at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:512)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:502)
at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
at htsjdk.samtools.util.IOUtil.unrollPaths(IOUtil.java:1085)
at htsjdk.samtools.util.IOUtil.unrollFiles(IOUtil.java:1050)
at picard.vcf.MergeVcfs.doWork(MergeVcfs.java:164)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:269)
at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:24)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:153)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:195)
at org.broadinstitute.hellbender.Main.main(Main.java:277)
Caused by: java.net.URISyntaxException: Illegal character in fragment at index 1: ##fileformat=VCFv4.2
at java.net.URI$Parser.fail(URI.java:2848)
at java.net.URI$Parser.checkChars(URI.java:3021)
at java.net.URI$Parser.parse(URI.java:3067)
at java.net.URI.(URI.java:588)
at java.net.URI.create(URI.java:850)

Applying the picard commands I get the following:
/package/sequencer/java/8/bin/java -jar -XX:+UseSerialGC -verbose:GC -Xmx8g -Djava.io.tmpdir=/scratch/cluster/seqcore/temp/smith/package/sequencer/picard-tools/current/picard.jar MergeVcfs INPUT=./03_GATK/core_L11935-2_Mystique.chrEBV.gvcf INPUT=./03_GATK/core_L11935-2_Mystique.chrUn_KI270742v1.gvcf OUTPUT= ./03_GATK/core_L11935-2_Mystique.gvcf

13:24:51.701 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/package/sequencer/picard-tools/2.12.1/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Fri Feb 09 13:24:51 CET 2018] MergeVcfs INPUT=[./03_GATK/core_L11935-2_Mystique.chrEBV.gvcf, ./03_GATK/core_L11935-2_Mystique.chrUn_KI270742v1.gvcf] OUTPUT=./03_GATK/core_L11935-2_Mystique.gvcf VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=true CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false

Exception in thread "main" htsjdk.samtools.SAMException: Cannot read non-existent file: /project/seqcore-cluster/data/superhero/chrUn_KI270742v1 186727 . C .. END=186739 GT:DP:GQ:MIN_DP:PL 0/0:9:0:4:0,0,0
at htsjdk.samtools.util.IOUtil.assertFileIsReadable(IOUtil.java:347)
at htsjdk.samtools.util.IOUtil.assertFileIsReadable(IOUtil.java:334)
at htsjdk.samtools.util.IOUtil.unrollFiles(IOUtil.java:948)
at picard.vcf.MergeVcfs.doWork(MergeVcfs.java:98)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:268)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:98)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:108)

I appreciate any help on this issue.
Best
Stefan

Meaning of warning (

$
0
0

Hello!
I have combined my gvcfs produced by HaplotypeCaller into one vcf file using CombineGVCFs.

When I use said vcf file for joint genotypeing with GenotypeGVCFs, I get often the following warnings:

WARN InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples

This is makes me wonder if I have done the previous steps correctly, so I will give you some background to what I am doing.

1/ We are working with 6 different lines of mouse
2/ For each line, 10 animals were sequenced.
3/ Variants were called per sample with HaplotypeCaller in GVCF mode (after preparing BAMs as per GATK's best practices)
4/ gvcf files were combined with CombineGVCFs regardless of group membership (60 gvcfs)
5/ Joint genotyping with GenotypeGVCFs

I am now wondering if the fact that on step 4 I did not produce one combined vcf per group (i.e. 6 vcfs, each resulting from 10 gvcfs) could have something to do with this warning.

Input bam file for the CNN pipeline

$
0
0

Hi,

I was looking to test the CNN pipeline and I saw that in the HC step was added the "-bamout" option (HC). To run the 2D model in the CNNScoreVariants tool I have to use as input bam file the original bam (post BQSR bam) or the bamout bam from the HC step?

Thanks

Intervals and interval lists

$
0
0

Interval lists define subsets of genomic regions, sometimes even just individual positions in the genome. You can provide GATK tools with intervals or lists of intervals when you want to restrict them to operating on a subset of genomic regions. There are four main types of reasons for doing so:

  • You want to run a quick test on a subset of data (often used in troubleshooting)
  • You want to parallelize execution of an analysis across genomic regions
  • You need to exclude regions that have bad or uninformative data where a tool is getting stuck
  • The analysis you're running should only take data from those subsets due to how the underlying algorithm works

Regarding the latter case, see the Best Practices workflow recommendations and tool example commands for guidance regarding when to restrict analysis to intervals.


Interval-related arguments and syntax

Arguments for specifying and modifying intervals are provided by the engine and can be applied to most of not all tools. The main arguments you need to know about are the following:

  • -L / --intervals allows you to specify an interval or list of intervals to include.
  • -XL / --exclude-intervals allows you to specify an interval or list of intervals to exclude.
  • -ip / --interval-padding allows you to add padding (in bp) to the intervals you include.
  • -ixp / --interval-exclusion-padding allows you to add padding (in bp) to the intervals you exclude.

By default the engine will merge any intervals that abut (i.e. they are contiguous, they touch without overlapping) or overlap into a single interval. This behavior can be modified by specifying an alternate interval merging rule (see --interval-merging-rule in the Tool Docs).

The syntax for using -L is as follows; it applies equally to -XL:

  • -L chr20 for contig chr20.
  • -L chr20:1-100 for contig chr20, positions 1-100.
  • -L intervals.list (or intervals.interval_list, or intervals.bed) when specifying a text file containing intervals (see supported formats below).
  • -L variants.vcf when specifying a VCF file containing variant records; their genomic coordinates will be used as intervals.

If you want to provide several intervals or several interval lists, just pass them in using separate -L or -XL arguments (you can even use both of them in the same command). You can use all the different formats within the same command line. By default, the GATK engine will take the UNION of all the intervals in all the sets. This behavior can be modified by specifying an alternate interval set rule (see --interval-set-rule in the Tool Docs).


Supported interval list formats

GATK supports several types of interval list formats: Picard-style .interval_list, GATK-style .list, BED files with extension .bed, and VCF files. The intervals MUST be sorted by coordinate (in increasing order) within contigs; and the contigs must be sorted in the same order as in the sequence dictionary. This is require for efficiency reasons.

A. Picard-style .interval_list

Picard-style interval files have a SAM-like header that includes a sequence dictionary. The intervals are given in the form <chr> <start> <stop> + <target_name>, with fields separated by tabs, and the coordinates are 1-based (first position in the genome is position 1, not position 0).

@HD     VN:1.0  SO:coordinate
@SQ     SN:1    LN:249250621    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:1b22b98cdeb4a9304cb5d48026a85128     SP:Homo Sapiens
@SQ     SN:2    LN:243199373    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:a0d9851da00400dec1098a9255ac712e     SP:Homo Sapiens
1       30366   30503   +       target_1
1       69089   70010   +       target_2
1       367657  368599  +       target_3
1       621094  622036  +       target_4
1       861320  861395  +       target_5
1       865533  865718  +       target_6

This is the preferred format because the explicit sequence dictionary safeguards against accidental misuse (e.g. apply hg18 intervals to an hg19 BAM file). Note that this file is 1-based, not 0-based (the first position in the genome is position 1).

B. GATK-style .list or .intervals

This is a simpler format, where intervals are in the form <chr>:<start>-<stop>, and no sequence dictionary is necessary. This file format also uses 1-based coordinates. Note that only the <chr> part is strictly required; if you just want to specify chromosomes/ contigs as opposed to specific coordinate ranges, you don't need to specify the rest. Both <chr>:<start>-<stop> and <chr> can be present in the same file. You can also specify intervals in this format directly at the command line instead of writing them in a file.

C. BED files with extension .bed

We also accept the widely-used BED format, where intervals are in the form <chr> <start> <stop>, with fields separated by tabs. However, you should be aware that this file format is 0-based for the start coordinates, so coordinates taken from 1-based formats (e.g. if you're cooking up a custom interval list derived from a file in a 1-based format) should be offset by 1. The GATK engine recognizes the .bed extension and interprets the coordinate system accordingly.

D. VCF files

Yeah, I bet you didn't expect that was a thing! It's very convenient. Say you want to redo a variant calling run on a set of variant calls that you were given by a colleague, but with the latest version of HaplotypeCaller. You just provide the VCF, slap on some padding on the fly using e.g. -ip 100 in the HC command, and boom, done. Each record in the VCF will be interpreted as a single-base interval, and by adding padding you ensure that the caller sees enough context to reevaluate the call appropriately.


Obtaining suitable interval lists

So where do those intervals come from? It depends a lot on what you're working with (everyone's least favorite answer, I know). The most important distinction is the sequencing experiment type: is it whole genome, or targeted sequencing of some sort?

Targeted sequencing (exomes, gene panels etc.)

For exomes and similarly targeted data types, the interval list should correspond to the capture targets used for the library prep, and is typically provided by the prep kit manufacturer (with versions for each ref genome build of course).

We make our exome interval lists available, but be aware that they are specific to the custom exome targeting kits used at the Broad. If you got your sequencing done somewhere else, you should seek to get the appropriate intervals list from the sequencing provider.

Whole genomes (WGS)

For whole genome sequence, the intervals lists don’t depend on the prep (since in principle you captured the “whole genome”) so instead it depends on what regions of the genome you want to blacklist (e.g. centromeric regions that waste your time for nothing) and how the reference genome build enables you to cut up regions (separated by Ns) for scatter-gather parallelizing.

We make our WGS interval lists available, and the good news is that, as long as you're using the same genome reference build as us, you can use them with your own data even if it comes from somewhere else -- assuming you agree with our decisions about which regions to blacklist! Which you can examine by looking at the intervals themselves. However, we don't currently have documentation on their provenance, sorry -- baby steps.

VCF from RNA-seq data

$
0
0
Hi,
I would like to determine variants from RNA-seq data that was generated in different lanes. I've combined it during alignment step using HISAT2 and the uniquely mapped reads to the ref. genome is 80%. Can I use this bam file from the step 2 (ie. Add read groups, sort, mark duplicates, and create index) onwards in GATK Best Practices workflow for SNP and indel calling on RNAseq data?

GATK on Amazon Web Services

$
0
0

We are soon adding support for running Cromwell on AWS Batch, integrating with AWS products. This will allow you to login with your AWS credentials, access your files in S3, and run your WDL files through AWS Batch.

Stay tuned for more updates!

Allele Depth and Uninformative Reads

$
0
0
Am very new to this so apologies for some basic questions, however struggling to find the answers.

Using Mutect2 to call mutations, paired tumour & normal specimen
DP=722;ECNT=1;NLOD=108.97;N_ART_LOD=-2.27;POP_AF=1e-06;P_CONTAM=0;P_GERMLINE=-194.4;TLOD=7.06 GT:AD:AF:DP:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:ORIGINAL_CONTIG_MISMATCH:SA_MAP_AF:SA_POST_PROB 0/1:312,4:0.016:316:148,4:164,0:40,40:149,162:60:32:0:0.01,0.01,0.013:0.002001,0.00324,0.995 0/0:362,0:0.026:362:182,0:180,0:40,0:134,0:0:0:0:.:.

Can I check whether the following is correct- 312, 4 AD is the unfiltered allele depth including reads that didn't pass but not un-informative.
It is 316, 0 in the normal.
The DP 316 is greater than AD, I read this is the filtered depth at sample level. Is it greater due to the uninformative reads. Or does this represent the un-filtered.
Similarly looking at the total depth 772, (312+4 +362) is 678. Is the difference due to uninformative reads
Although in the normal sample AD is 0,362. The fact that the allele frequency is not 0. Is this the case when the the only alternate allele reads are not informative, therefore the variant reported but AD 0?
Are there other reasons?

Finally when I look at this position in the original BAM file in IGV, at this position
tumour
total count = 468
a=2 1,1
c = 1 0,1
t = 5 (3,2
g = 460, 232+,228

normal total 566,
g = 566 296+, 270-

I understand the difficulty with tri-allelic sites. And in mutect if there are forward and reverse strands covering a site, one would be disregarded. But still the total count seems very different to DP.

Thankyou!

conda env create fails: Invalid requirement: '$tensorFlowDependency'

$
0
0
Hi,

I get an error while trying to create conda environment for gatk on Centos 7, the gatk installed successfuly, conda installed, standard issue of python supplied with the system and updated to latest version. From what I can see, Anaconda has its own python3. The $tensorFlowDependency and other lines are puzzling to me.

Below is the entire output of the command:

------------------------------------------------
# conda env create -n gatk -f gatkcondaenv.yml
Solving environment: done


==> WARNING: A newer version of conda exists. <==
current version: 4.5.12
latest version: 4.6.8

Please update conda by running

$ conda update -n base -c defaults conda



Downloading and Extracting Packages
intel-openmp-2018.0. | 620 KB | ############################################################################################################## | 100%
pip-9.0.1 | 1.7 MB | ############################################################################################################## | 100%
zlib-1.2.11 | 109 KB | ############################################################################################################## | 100%
readline-6.2 | 606 KB | ############################################################################################################## | 100%
openssl-1.0.2l | 3.2 MB | ############################################################################################################## | 100%
tk-8.5.18 | 1.9 MB | ############################################################################################################## | 100%
certifi-2016.2.28 | 216 KB | ############################################################################################################## | 100%
xz-5.2.3 | 667 KB | ############################################################################################################## | 100%
python-3.6.2 | 16.5 MB | ############################################################################################################## | 100%
sqlite-3.13.0 | 4.0 MB | ############################################################################################################## | 100%
setuptools-36.4.0 | 563 KB | ############################################################################################################## | 100%
mkl-2018.0.1 | 184.7 MB | ############################################################################################################## | 100%
wheel-0.29.0 | 88 KB | ############################################################################################################## | 100%
mkl-service-1.1.2 | 11 KB | ############################################################################################################## | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
Invalid requirement: '$tensorFlowDependency'
Traceback (most recent call last):
File "/usr/share/anaconda2/envs/gatk/lib/python3.6/site-packages/pip/_vendor/packaging/requirements.py", line 92, in __init__
req = REQUIREMENT.parseString(requirement_string)
File "/usr/share/anaconda2/envs/gatk/lib/python3.6/site-packages/pip/_vendor/pyparsing.py", line 1617, in parseString
raise exc
File "/usr/share/anaconda2/envs/gatk/lib/python3.6/site-packages/pip/_vendor/pyparsing.py", line 1607, in parseString
loc, tokens = self._parse( instring, 0 )
File "/usr/share/anaconda2/envs/gatk/lib/python3.6/site-packages/pip/_vendor/pyparsing.py", line 1379, in _parseNoCache
loc,tokens = self.parseImpl( instring, preloc, doActions )
File "/usr/share/anaconda2/envs/gatk/lib/python3.6/site-packages/pip/_vendor/pyparsing.py", line 3376, in parseImpl
loc, exprtokens = e._parse( instring, loc, doActions )
File "/usr/share/anaconda2/envs/gatk/lib/python3.6/site-packages/pip/_vendor/pyparsing.py", line 1379, in _parseNoCache
loc,tokens = self.parseImpl( instring, preloc, doActions )
File "/usr/share/anaconda2/envs/gatk/lib/python3.6/site-packages/pip/_vendor/pyparsing.py", line 3698, in parseImpl
return self.expr._parse( instring, loc, doActions, callPreParse=False )
File "/usr/share/anaconda2/envs/gatk/lib/python3.6/site-packages/pip/_vendor/pyparsing.py", line 1379, in _parseNoCache
loc,tokens = self.parseImpl( instring, preloc, doActions )
File "/usr/share/anaconda2/envs/gatk/lib/python3.6/site-packages/pip/_vendor/pyparsing.py", line 3359, in parseImpl
loc, resultlist = self.exprs[0]._parse( instring, loc, doActions, callPreParse=False )
File "/usr/share/anaconda2/envs/gatk/lib/python3.6/site-packages/pip/_vendor/pyparsing.py", line 1383, in _parseNoCache
loc,tokens = self.parseImpl( instring, preloc, doActions )
File "/usr/share/anaconda2/envs/gatk/lib/python3.6/site-packages/pip/_vendor/pyparsing.py", line 2670, in parseImpl
raise ParseException(instring, loc, self.errmsg, self)
pip._vendor.pyparsing.ParseException: Expected W:(abcd...) (at char 0), (line:1, col:1)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/share/anaconda2/envs/gatk/lib/python3.6/site-packages/pip/req/req_install.py", line 82, in __init__
req = Requirement(req)
File "/usr/share/anaconda2/envs/gatk/lib/python3.6/site-packages/pip/_vendor/packaging/requirements.py", line 96, in __init__
requirement_string[e.loc:e.loc + 8]))
pip._vendor.packaging.requirements.InvalidRequirement: Invalid requirement, parse error at "'$tensorF'"


CondaValueError: pip returned an error

-----------------------------------------------

Can you please point me into the right direction what's missing (apart of updated conda)?


Thanks

Best Regards
Maciej

ERROR:INVALID_INSERT_SIZE using picard CollectAlignmentSummaryMetrics

$
0
0
Hi all,

I am trying to run: java -jar picard.jar CollectAlignmentSummaryMetrics R=reference.fasta I=myfile.SORT.bam O=outputpicard.txt

my bam file was created using bwa and samtools from paired end reads.

bwa mem -t 24 myreference.fasta C1_1_Q30L50.fq C1_2_Q30L50.fq | samtools view -F 12 -Sb -o C1.Q30L50MAP.bam -
And the output was sorted using: samtools sort -o C1.Q30L50MAP.SORT.bam C1.Q30L50MAP.bam

But I have this error:

SAMRecord my reference length is too large for BAM bin field. A00551:34:H7KNLDSXX:1:2675:7048:35493 record bin field value is incorrect.

Then I followed the workflow for diagnosing SAM/BAM file errors with ValidateSamFile.
And the summary was:

## HISTOGRAM java.lang.String
Error Type Count
ERROR:INVALID_INSERT_SIZE 6971533
ERROR:MISSING_READ_GROUP 1
WARNING:RECORD_MISSING_READ_GROUP 574151359

And the detailed list was:

ERROR: Record 272534869, Read name A00551:34:H7KNLDSXX:1:2670:8404:5963, Insert size out of range
.
.
I haven´t be able to resolve this problem, could you help me? i read in the forum similar questions but I could find the solution.

Thank you,

Best regards,

germline heterozygous positions estimated by GATK 3.2-2

$
0
0
Hi,

I have WGS files for 3 patients (tumour and matched its derived model) but without matched normal sample. If I call copy number of each patients (tumour and matched its derived model), how I can use read counts at germline heterozygous positions estimated by GATK 3.2-2 to compensate for the absence of matched normal sample?

GATK Spark Logging

$
0
0
Hello,

I've been trying to decrease the verbosity of the Spark runs for GATk tools, e.g. MarkDuplicatesSpark

My call is as follows:

python ${gatkDir}/gatk MarkDuplicatesSpark --spark-master local[$threads] -R ${GRC}.fa --input ${TU}.bam --output ${TU}.dd.bam --tmp-dir temp --verbosity ERROR

I thought the --verbosity ERROR would write only ERROR level output from the tools, but I'm still getting a lot of INFO output.

Is there another way to get only ERROR level output?

Thanks!

(How to) Run the Pathseq pipeline

$
0
0

Beta tutorial Please report any issues in the comments section.

Overview

PathSeq is a GATK pipeline for detecting microbial organisms in short-read deep sequencing samples taken from a host organism (e.g. human). The diagram below summarizes how it works. In brief, the pipeline performs read quality filtering, subtracts reads derived from the host, aligns the remaining (non-host) reads to a reference of microbe genomes, and generates a table of detected microbial organisms. The results can be used to determine the presence and abundance microbial organisms as well as to discover novel microbial sequences.


PathSeq pipeline diagram Boxes outlined with dashed lines represent files. The green boxes at the top depict the three phases of the pipeline: read quality filtering / host subtraction, microbe alignment, and taxonomic abundance scoring. The blue boxes show tools used for pre-processing the host and microbe references for use with PathSeq.

Tutorial outline

This tutorial describes:

  • How to run the full PathSeq pipeline on a simulated mixture of human and E. coli reads using pre-built small-scale reference files
  • How to prepare custom host and microbe reference files for use with PathSeq

A more detailed introduction of the pipeline can be found in the PathSeqPipelineSpark tool documentation. For more information about the other tools, see the Metagenomics section of the GATK documentation.

How to obtain reference files

Host and microbe references must be prepared for PathSeq as described in this tutorial. The tutorial files provided below contain references that are designed specifically for this tutorial and should not be used in practice. Users can download recommended pre-built reference files for use with PathSeq from the GATK Resource Bundle FTP server in /bundle/pathseq/ (see readme file). This tutorial also covers how to build custom host and microbe references.

Tutorial Requirements

The PathSeq tools are bundled with the GATK 4 release. For the most up-to-date GATK installation instructions, please see https://github.com/broadinstitute/gatk. This tutorial assumes you are using a POSIX (e.g. Linux or MacOS) operating system with at least 2Gb of memory.

Obtain tutorial files

Download tutorial_10913.tar.gz from the ftp site. Extract the archive with the command:

> tar xzvf pathseq_tutorial.tar.gz
> cd pathseq_tutorial

You should now have the following files in your current directory:

  • test_sample.bam : simulated sample of 3M paired-end 151-bp reads from human and E. coli
  • hg19mini.fasta : human reference sequences (indexed)
  • e_coli_k12.fasta : E. coli reference sequences (indexed)
  • e_coli_k12.fasta.img : PathSeq BWA-MEM index image
  • e_coli_k12.db : PathSeq taxonomy file

Run the PathSeq pipeline

The pipeline accepts reads in BAM format (if you have FASTQ files, please see this article on how to convert to BAM). In this example, the pipeline can be run using the following command:

> gatk PathSeqPipelineSpark \
    --input test_sample.bam \
    --filter-bwa-image hg19mini.fasta.img \
    --kmer-file hg19mini.hss \
    --min-clipped-read-length 70 \
    --microbe-fasta e_coli_k12.fasta \
    --microbe-bwa-image e_coli_k12.fasta.img \
    --taxonomy-file e_coli_k12.db \
    --output output.pathseq.bam \
    --scores-output output.pathseq.txt

This ran in 2 minutes on a Macbook Pro with a 2.8GHz Quad-core CPU and 16 GB of RAM. If running on a local workstation, users can monitor the progress of the pipeline through a web browser at http://localhost:4040.

Interpreting the output

The PathSeq output files are:

  • output.pathseq.bam : contains all high-quality non-host reads aligned to the microbe reference. The YP read tag lists the NCBI taxonomy IDs of any aligned species meeting the alignment identity criteria (see the --min-score-identity and --identity-margin parameters). This tag is omitted if the read was not successfully mapped, which may indicate the presence of organisms not represented in the microbe database.
  • output.pathseq.txt : a tab-delimited table of the input sample’s microbial composition. This can be imported into Excel and organized by selecting Data -> Filter from the menu:
tax_id taxonomy type name kingdom score score_normalized reads unambiguous reference_length
1 root root root root 189580 100 189580 189580 0
131567 root cellular_organisms no_rank cellular_organisms root 189580 100 189580 189580 0
2 ... cellular_organisms Bacteria superkingdom Bacteria Bacteria 189580 100 189580 189580 0
1224 ... Proteobacteria phylum Proteobacteria Bacteria 189580 100 189580 189580 0
1236 ... Proteobacteria Gammaproteobacteria class Gammaproteobacteria Bacteria 189580 100 189580 189580 0
91347 ... Gammaproteobacteria Enterobacterales order Enterobacterales Bacteria 189580 100 189580 189580 0
543 ... Enterobacterales Enterobacteriaceae family Enterobacteriaceae Bacteria 189580 100 189580 189580 0
561 ... Enterobacteriaceae Escherichia genus Escherichia Bacteria 189580 100 189580 189580 0
562 ... Escherichia Escherichia_coli species Escherichia_coli Bacteria 189580 100 189580 189580 0
83333 ... Escherichia_coli Escherichia_coli_K-12 no_rank Escherichia_coli_K-12 Bacteria 189580 100 189580 189580 0
511145 ... Escherichia_coli_str._K-12_substr._MG1655 no_rank Escherichia_coli_str._K-12_substr._MG1655 Bacteria 189580 100 189580 189580 4641652

Each line provides information for a single node in the taxonomic tree. A "root" node corresponding to the top of the tree is always listed. Columns to the right of the taxonomic information are:

  • score : indicates the amount of evidence that this taxon is present, based on the number of reads that aligned to references in this taxon. This takes into account uncertainty due to ambiguously mapped reads by dividing their weight across each possible hit. It it also normalized by genome length.
  • score_normalized : the same as score, but normalized to sum to 100 within each kingdom.
  • reads : number of mapped reads (ambiguous or unambiguous)
  • unambiguous : number of unambiguously mapped reads
  • reference_length : reference length (in bases) if there is a reference assigned to this taxon. Unlike scores, this number is not propagated up the tree, i.e. it is 0 if there is no reference corresponding directly to the taxon. In the above example, the MG1655 strain reference length is only shown in the strain row (4,641,652 bases).

In this example, one can see that PathSeq detected 189,580 reads reads that mapped to the strain reference for E. coli K-12 MG1655. This read count is propogated up the tree (species, genus, family, etc.) to the root node. If other species were present, their read counts would be listed and added to their corresponding ancestral taxonomic classes.

Microbe discovery

PathSeq can also be used to discover novel microorganisms by analyzing the unmapped reads, e.g. using BLAST or de novo assembly. To get the number of non-host (microbe plus unmapped) reads use the samtools view command:

> samtools view –c output.pathseq.bam
189580

Since the reported number of E. coli reads is the same number of reads in the output BAM, there are 0 reads of unknown origin in this sample.

Preparing Custom Reference Files

Custom host and microbe references must both be prepared for use with PathSeq. The references should be supplied as FASTA files with proper indices and sequence dictionaries. The host reference is used to build a BWA-MEM index image and a k-mer file. The microbe reference is used to build another BWA-MEM index image and a taxonomy file. Here we assume you are starting with the FASTA reference files that have been properly indexed:

  • host.fasta : your custom host reference sequences
  • microbe.fasta : your custom microbe reference sequences

Build the host and microbe BWA index images

The BWA index images must be build using BwaMemIndexImageCreator:

> gatk BwaMemIndexImageCreator -I host.fasta
> gatk BwaMemIndexImageCreator -I microbe.fasta

Generate the host k-mer library file

The PathSeqBuildKmers tool creates a library of k-mers from a host reference FASTA file. Create a hash set of all k-mers in the host reference with following command:

> gatk PathSeqBuildKmers \
--reference host.fasta \
-O host.hss

Build the taxonomy file

Download the latest RefSeq accession catalog RefSeq-releaseXX.catalog.gz, where XX is the latest RefSeq release number, at:
ftp://ftp.ncbi.nlm.nih.gov/refseq/release/release-catalog/
Download NCBI taxonomy data files dump (no need to extract the archive):
ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
Assuming these files are now in your current working directory, build the taxonomy file using PathSeqBuildReferenceTaxonomy:

> gatk PathSeqBuildReferenceTaxonomy \
-R microbe.fasta \
--refseq-catalog RefSeq-releaseXX.catalog.gz \
--tax-dump taxdump.tar.gz \
-O microbe.db

Example reference build script

The preceding instructions can be conveniently executed with the following bash script:

#!/bin/bash
set -eu
GATK_HOME=/path/to/gatk
REFSEQ_CATALOG=/path/to/RefSeq-releaseXX.catalog.gz
TAXDUMP=/path/to/taxdump.tar.gz

echo "Building pathogen reference..."
$GATK_HOME/gatk BwaMemIndexImageCreator -I microbe.fasta
$GATK_HOME/gatk PathSeqBuildReferenceTaxonomy -R microbe.fasta --refseq-catalog $REFSEQ_CATALOG --tax-dump $TAXDUMP -O microbe.db

echo "Building host reference..."
$GATK_HOME/gatk BwaMemIndexImageCreator -I host.fasta
$GATK_HOME/gatk PathSeqBuildKmers --reference host.fasta -O host.hss

Troubleshooting

Java heap out of memory error

Increase the Java heap limit. For example, to increase the limit to 4GB with the --java-options flag:

> gatk --java-options "-Xmx4G" ... 

This should generally be set to a value greater than the sum of all reference files.

The output is empty

The input reads must pass an initial validity filter, WellFormedReadFilter. A common cause of empty output is that the input reads do not pass this filter, often because none of the reads have been assigned to a read group (with an RG tag). For instructions on adding read groups, see this article, but note that PathSeqPipelineSpark and PathSeqFilterSpark do not require the input BAM to be sorted or indexed.

(How to) Consolidate GVCFs for joint calling with GenotypeGVCFs

$
0
0

In GATK4, the GenotypeGVCFs tool can only take a single input i.e., 1) a single single-sample GVCF 2) a single multi-sample GVCF created by CombineGVCFs or 3) a GenomicsDB workspace created by GenomicsDBImport. If you have GVCFs from multiple samples (which is usually the case) you will need to combine them before feeding them to GenotypeGVCFs. The input samples must possess genotype likelihoods containing the allele produced by HaplotypeCaller with -ERC GVCF or -ERC BP_RESOLUTION.

Although there are several tools in the GATK and Picard toolkits that provide some type of VCF merging functionality, for this use case ONLY two of them can do the GVCF consolidation step correctly: GenomicsDBImport and CombineGVCFs.

GenomicsDBImport is the preferred tool (see detailed instructions below); CombineGVCFs is provided only as a backup solution for people who cannot use GenomicsDBImport. We know CombineGVCFs is quite inefficient and typically requires a lot of memory, so we encourage you to try GenomicsDBImport first and only fall back on CombineGVCFs if you experience issues that we are unable to help you solve (ask us for help in the forum!).


UsingGenomicsDBImport in practice

The GenomicsDBImport tool takes in one or more single-sample GVCFs and imports data over at least one genomics interval (this feature is available in v4.0.6.0 and later and stable in v4.0.8.0 and later), and outputs a directory containing a GenomicsDB datastore with combined multi-sample data. GenotypeGVCFs can then read from the created GenomicsDB directly and output the final multi-sample VCF.

So if you have a trio of GVCFs your GenomicsDBImport command would look like this, assuming you're running per chromosome (here we're showing the tool running on chromosome 20 and chromosome 21):

gatk GenomicsDBImport \
    -V data/gvcfs/mother.g.vcf \
    -V data/gvcfs/father.g.vcf \
    -V data/gvcfs/son.g.vcf \
    --genomicsdb-workspace-path my_database \
    --intervals chr20,chr21

That generates a directory called my_database containing the combined GVCF data for chromosome 20 and 21. (The contents of the directory are not really human-readable; see “extracting GVCF data from a GenomicsDB” to evaluate the combined, pre-genotyped data. Also note that the log will contain a series of messages like Buffer resized from 178298bytes to 262033 -- this is expected.) For larger cohort sizes, we recommend specifying a batch size of 50 for improved memory usage. A sample map file can also be specified when enumerating the GVCFs individually as above becomes arduous.

Then you run joint genotyping; note the gendb:// prefix to the database input directory path. Note that this step requires a reference, even though the import can be run without one.

gatk GenotypeGVCFs \
    -R data/ref/ref.fasta \
    -V gendb://my_database \
    -newQual \
    -O test_output.vcf 

And that's all there is to it.


Important limitations and Common “Gotchas”:

  1. You can't add data to an existing database; you have to keep the original GVCFs around and reimport them all together when you get new samples. For very large numbers of samples, there are some batching options.

  2. At least one interval must be provided when using GenomicsDBImport.

  3. Input GVCFs cannot contain multiple entries for a single genomic position

  4. GenomicsDBImport cannot accept multiple GVCFs for the same sample, so if for example you generated separate GVCFs per chromosome for each sample, you'll need to either concatenate the chromosome GVCFs to produce a single GVCF per sample (using GatherVcfs) or scatter the following steps by chromosome as well.

  5. The annotation counts specified in the header MUST BE VALID! If not, you may see an error like A fatal error has been detected by the Java Runtime Environment [...] SIGSEGV with mention of a core dump (which may or may not be output depending on your system configuration.) You can check you annotation headers with vcf-validator from VCFtools [https://github.com/vcftools/vcftools]

  6. GenomicsDB will not overwrite an existing workspace. To rerun an import, you will have to manually delete the workspace before running the command again.

  7. If you’re working on a POSIX filesystem (e.g. Lustre, NFS, xfs, ext4 etc), you must set the environment variable TILEDB_DISABLE_FILE_LOCKING=1 before running any GenomicsDB tool. If you don’t, you will likely see an error like Could not open array genomicsdb_array at workspace:[...]

  8. HaplotypeCaller output containing MNPs cannot be merged with CombineGVCFs or GenotypeGVCFs. For phasing nearby variants in multi-sample callsets, MNPs can be inferred from the phase set (PS) tag in the FORMAT field.

  9. There are a few other, rare bugs we’re in the process of working out. If you run into problems, you can check the open github issues [https://github.com/broadinstitute/gatk/issues?utf8=✓&amp;q=is:issue+is:open+genomicsdb] to see if a fix is in in progress.

If you can't use GenomicsDBImport for whatever reason, fall back to CombineGVCFs instead. It is slower but will allow you to combine GVCFs the old-fashioned way.


Addendum: extracting GVCF data from the GenomicsDB

If you want to generate a flat multisample GVCF file from the GenomicsDB you created, you can do so with SelectVariants as follows:

gatk SelectVariants \
    -R data/ref/ref.fasta \
    -V gendb://my_database \
    -O combined.g.vcf

Bells and Whistles

GenomicsDB now supports allele-specific annotations [ https://software.broadinstitute.org/gatk/documentation/article?id=9622 ], which have become standard in our Broad exome production pipeline.

GenomicsDB can now import directly from a Google cloud path (i.e. gs://) using NIO.

New workflow for FFPE sample in GATK 4.1.0.0 ?

$
0
0
We are glad to see the new release of GATK 4.1.0.0 compiling all the changes over 2018, especially Mutect2 out of Beta, and the Workflow and tools for FFPE samples.
However our team would like some clarification for running FFPE samples. We have been creating the value artifact-prior.tsv for the parameter --orientation-bias-artifact-priors by following the three step workflow suggested elsewhere in this forum:
1.) java gatk4 CollectF1R2Counts
2.) java gatk4LearnReadOrientationModel
3.) java gatk4 Mutect2 -I FFPE_tumour.bam --orientation-bias-artifact-priors my_artifact-prior.tsv

Which seems to work. But from your list of Major Changes for the GATK 4.1.0.0 version we quote:
"...
-Many new/improved filters to reduce false positives (eg., FilterAlignmentArtifacts)
-Mutect2 now automatically recognizes and removes end repair artifacts in regions with inverted tandem repeats. This is extremely important for some FFPE samples.
..."
Does this suggest further tools/switches other than --orientation-bias-artifact-priors my_artifact-prior.tsv
should be used with FFPE Samples??

在线制作国外文凭.海外大学毕业证.留学生做硕士学历认证书

$
0
0
(微信464571773),QQ78464162诺丁汉大学(University of Nottingham毕业证,英国硕士学位证Q78464162成绩单样本大学毕业证,英国学学位证,意大利大学成绩单大毕业证,代办英国大学毕业证证。仿真毕澳大利亚证,

Running GenomicsDBImport: stuck on 'INFO GenomicsDBImport - Importing batch 1 with 62 samples'

$
0
0

Hi,

I am running GenomicsDBImport on 62 human exome data and somehow the process stuck in the very beginning of the process. Please see below message:

16:24:19.150 INFO  ProgressMeter - Starting traversal
16:24:19.150 INFO  ProgressMeter -        Current Locus  Elapsed Minutes     Batches Processed   Batches/Minute
16:24:19.378 INFO  GenomicsDBImport - Starting batch input file preload
16:24:36.900 INFO  GenomicsDBImport - Finished batch preload
16:24:36.900 INFO  GenomicsDBImport - Importing batch 1 with 62 samples

It stucks right here for several hours.

Here is the command I used:

java -jar $GATK GenomicsDBImport -R $hg19 \
     --sampleNameMap sample.map
     -L chr1 \
     --genomicsdb-workspace-path $output

I wonder if this is normal. Thank you for your help in advance!

Masaki

a question about running HaplotypeCaller with intervals

$
0
0

Hi,

I have a question when running HaplotypeCaller functions with intervals on exome-seq data.
Here is the command I used:
java -jar gatk-package-4.0.6.0-local.jar HaplotypeCaller -R /espresso/share/genomes/hg38/genome.fa -I recal_reads.bam -O variants.g.vcf -ERC GVCF -L capture.bed

However, when I ran the command, I got the following message:
17:13:14.439 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk-4.0.6.0/gatk-package-4.0.6.0-local.jar!/com/intel/gkl/native/libgkl_compression.so 17:13:14.591 INFO HaplotypeCaller - ------------------------------------------------------------ 17:13:14.591 INFO HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.0.6.0 17:13:14.591 INFO HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/ 17:13:14.591 INFO HaplotypeCaller - Executing as ... on Linux v2.6.32-431.29.2.el6.x86_64 amd64 17:13:14.592 INFO HaplotypeCaller - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_121-b13 17:13:14.592 INFO HaplotypeCaller - Start Date/Time: July 16, 2018 5:13:14 PM EDT 17:13:14.592 INFO HaplotypeCaller - ------------------------------------------------------------ 17:13:14.592 INFO HaplotypeCaller - ------------------------------------------------------------ 17:13:14.592 INFO HaplotypeCaller - HTSJDK Version: 2.16.0 17:13:14.592 INFO HaplotypeCaller - Picard Version: 2.18.7 17:13:14.592 INFO HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2 17:13:14.592 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false 17:13:14.592 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true 17:13:14.592 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false 17:13:14.593 INFO HaplotypeCaller - Deflater: IntelDeflater 17:13:14.593 INFO HaplotypeCaller - Inflater: IntelInflater 17:13:14.593 INFO HaplotypeCaller - GCS max retries/reopens: 20 17:13:14.593 INFO HaplotypeCaller - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes 17:13:14.593 INFO HaplotypeCaller - Initializing engine 17:13:15.037 INFO FeatureManager - Using codec BEDCodec to read file file:///capture.bed 17:13:16.883 INFO IntervalArgumentCollection - Processing 64190747 bp from intervals 17:13:17.009 INFO HaplotypeCaller - Shutting down engine [July 16, 2018 5:13:17 PM EDT] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 0.04 minutes. Runtime.totalMemory()=2041053184 java.lang.NullPointerException at java.util.ComparableTimSort.countRunAndMakeAscending(ComparableTimSort.java:325) at java.util.ComparableTimSort.sort(ComparableTimSort.java:202) at java.util.Arrays.sort(Arrays.java:1312) at java.util.Arrays.sort(Arrays.java:1506) at java.util.ArrayList.sort(ArrayList.java:1454) at java.util.Collections.sort(Collections.java:141) at org.broadinstitute.hellbender.utils.IntervalUtils.sortAndMergeIntervals(IntervalUtils.java:459) at org.broadinstitute.hellbender.utils.IntervalUtils.getIntervalsWithFlanks(IntervalUtils.java:956) at org.broadinstitute.hellbender.utils.IntervalUtils.getIntervalsWithFlanks(IntervalUtils.java:971) at org.broadinstitute.hellbender.engine.MultiIntervalLocalReadShard.<init>(MultiIntervalLocalReadShard.java:59) at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.makeReadShards(AssemblyRegionWalker.java:195) at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.onStartup(AssemblyRegionWalker.java:175) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:133) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:180) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:199) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203) at org.broadinstitute.hellbender.Main.main(Main.java:289)

I did not see any error but it seems HaplotypeCaller did not run and there is no output.
So I will really appreciate it if I can get help from you guys.

Thank you!

Best,
Siyu

Are there any published approaches/methods for merging variant calls from RNA-seq and whole exome

$
0
0

Hi,

This is a general question and I would appreciate any pointers in this regard. I was wondering if there are any best practices/recommendations/published methods for merging variant calls from RNA-seq and whole exome data after we have filtered variants from gatk using best practices.

Thanks!

New! Mutect2 for Mitochondrial Analysis

$
0
0

Overcoming barriers to understanding the mitochondrial genome

Announcing a brand new “Best Practices” pipeline for calling SNPs and INDELs in the mitochondrial genome! Calling low VAF alleles (variant allele fraction) in the mitochondrial genome presents special problems, but come with great rewards, including diagnosing rare diseases and identifying asymptomatic carriers of pathogenic diseases. We’re excited to begin using this pipeline on tens of thousands of diverse samples from the gnomAD project (http://gnomad.broadinstitute.org/about) to gain greater understanding of population genetics from the perspective of mitochondrial DNA.

Mitochondrial genome - a history of challenges

We had been often advised to “try using a somatic caller,” since we expect mitochondria to have variable allele fraction variants, but we never actually tried it ourselves. Over the past year we focused on creating truth-data for low allele fraction variants on the mitochondria and developing a production-quality, high throughput pipeline that overcomes the unique challenges of calling SNPs and IDELs on the mitochondria offers.

See below the four challenges to unlocking the mitochondrial genome and how we’ve improved our pipeline to overcome them.

1. Mitochondria have a circular genome

Though the genome is linearized in the typical references we use, the breakpoint is artificial -- purely for the sake of bioinformatic convenience. Since the breakpoint is inside the “control region”, which is non-coding but highly variable across people, we want to be sensitive to variation in that region, to capture the most genetic diversity.

2. A pushy genome makes for difficult mapping

The mitochondrial genome has inserted itself into the autosomal genome many times throughout human evolution - and continues to do so. These regions in the autosomal genome, called Nuclear Mitochondrial DNA segments (NuMTs), make mapping difficult: if the sequences are identical, it’s hard to know if a read belongs in an autosomal NuMT or the mitochondrial contig.

3. Most mitochondria are normal

Variation in the mitochondria can have very low heteroplasmy. In fact, the variation “signal” can be comparable to the inherent sequencer noise, but the scientific community tasked us with calling 1% allele fraction sites with as much accuracy as we can. Our pipeline achieves 99% sensitivity at 5% VAF at depths greater than 1000. With depth in the thousands or tens of thousands of reads for most whole genome mitochondrial samples, it should be possible to call most 1% allele fraction sites with high confidence.

4. High depth coverage is a blessing… and a curse

The mitochondrial contig typically has extremely high depth in whole genome sequence data:
around 2000x for a typical blood sample compared to autosomes (typically ~30x coverage). Samples from mitochondria-rich tissues like heart and muscle have even higher depth (e.g. 80,000x coverage). This depth is a blessing for calling low-allele fraction sites with confidence, but can overwhelm computations that use algorithms not designed to handle the large amounts of data that come with this extreme depth.

Solving a geometry problem by realigning twice

We’ve solved the first problem by extracting reads that align to carefully selected NuMT regions and the mitochondria itself, from a whole genome sample. We take these aligned, recalibrated reads and realign them twice: once to the canonical mitochondria reference, and once to a “rotated” mitochondria reference that moves the breakpoint from the control region to the opposite side of the circular contig.

To help filter out NuMTs, we mark reads with their original alignment position before realigning to the mitochondrial contig. Then we use Mutect2 filters tuned to the high depth we expect from the mitochondria, by running Mutect2 in “--mitochondria-mode”. We increase accuracy on the “breakpoint” location by calling only the non-control region on the original mitochondria reference, and call the control region on the shifted reference (now in the middle of the linearized chromosome). We then shift the calls on the rotated reference back to the original mitochondrial reference and merge the VCFs.

Adaptive pruning for high sensitivity and precision

The rest of the problems benefited from recent improvements the local assembly code that Mutect2 shares with HaplotypeCaller (HC). Incorporating the new adaptive pruning strategy in the latest version of Mutect2 will improve sensitivity and precision for samples with varying depth across the mitochondrial reference, and enable us to adapt our pipeline to exome and RNA samples. See the blog post on the newest version of Mutect2 here.

Unblocking genetic bottlenecks with a new pipeline

The new pipeline’s high sensitivity to low allele fraction variants is especially powerful since low AF variants may be at higher AF in other tissue.

Our pipeline harnesses the power of low AFs to help:

1. Diagnose rare diseases

Mutations can be at high allele fraction in affected tissues but low in blood samples typically used for genetic testing.

2. Identify asymptomatic carriers of pathogenic variants

If you carry a pathogenic allele even at low VAF, you can pass this along at high VAF to your offspring.

3. Discover somatic variants in tissues or cell lineages

For example, studies have used rare somatic mtDNA variants for lineage tracing in single-cell RNA-seq studies.

You can find the WDLs used to run this pipeline in the GATK repo under the scripts directory (https://github.com/broadinstitute/gatk/blob/master/scripts/mitochondria_m2_wdl/MitochondriaPipeline.wdl). Keep an eye out for an official “Best Practices” pipeline, coming soon in the gatk-workflows repo and in Firecloud.

Caveat: We're not so confident in calls under 5%AF (due to false positives from NuMTs). We're working on a longer term fix for this for a future release.

Viewing all 12345 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>