CNNScoreVariants, too much threads

February 24, 2019, 10:22 pm

≪ Previous: Question: Mutect v1 doesnt finds second alternative allele

Hi,

in the BestPractice workflows you advise to use HaplotypeCaller with the "-XX:GCTimeLimit=50" and "-X:GCHeapFreeLimit=10" java options.

Is there something similar for CNNScoreVariants? I tried to use several java options with different values to limit threads but it is quite impossible. Without any option I have 116 threads, running only one command, with 5 java options I can limit them to 95 ... still too much! What should I limit here?

Many thanks

↧

Mutect1 multi-allelic cases

March 4, 2019, 1:43 pm

≫ Next: CreateReadCountPanelOfNormals in GATK4.1 doesn't output valid HDF5 files

≪ Previous: CNNScoreVariants, too much threads

Hello,
We have worked hard to figure this out.
By default, or by using any parameters, Mutect1 does not catch multi-allelic cases at a single position. It seems to pick one of the alternate alleles at every position (randomly?). As in mutect2, one expects diploidy like parameter. We need to detect every alternate allele at that position, as somatic analysis requires this.

Do you have an idea? How should one use Mutect1 for detecting multiallelic cases?

Thanks

↧

CreateReadCountPanelOfNormals in GATK4.1 doesn't output valid HDF5 files

March 4, 2019, 2:00 pm

≫ Next: What are the requirements for running GATK?

≪ Previous: Mutect1 multi-allelic cases

Hi GATK team,

I was testing somatic CNV workflow in GATK 4.1. I found that CreateReadCountPanelOfNormals doesn't output valid HDF5 files. As an example, I have generated HDF5 files from three normal samples using CollectCounts, which can be found here:

https://www.dropbox.com/sh/lblf0u339t8asqo/AAA21rJSmtgRLfJw-fnEONwDa?dl=0

Then I tried to create PON from these file:

gatk --java-options -Xmx8g CreateReadCountPanelOfNormals --input 1302003-B-ready.counts.hdf5 --input 1436156-B-ready.counts.hdf5 --input 1436468-B-ready.counts.hdf5 --output example.hdf5

This command line runs without errors but resulting HDF5 could not be used by other functions such as DenoiseReadCounts. I noted that the three individual hdf5 files could be validated by hdfview (https://support.hdfgroup.org/products/java/hdfview/) but the one generated by CreateReadCountPanelOfNormals could not.

Can you help? Thanks,

Jiantao

↧

What are the requirements for running GATK?

December 26, 2017, 2:57 pm

≫ Next: Install with conda, gatkcondaenv.yml not found

≪ Previous: CreateReadCountPanelOfNormals in GATK4.1 doesn't output valid HDF5 files

1. Skills / experience
1. Input data
1. Software
1. Hardware

1. Skills / experience

We aim to make the tools usable by everyone, regardless of your background.

The GATK does not have a Graphical User Interface (GUI). You don't open it by clicking on the .jar file; to use it directly you have to use the Console (or Terminal) to input commands. If this is all new to you, we recommend you first learn about that and get some basic training (we recommend Software Carpentry) before trying to use the GATK. It's not difficult but you'll need to learn some jargon and get used to living without a mouse. Trust us, it's a liberating experience

If you prefer working in a point-and-click environment, consider trying FireCloud. FireCloud is a secure, freely accessible cloud-based analysis portal developed at the Broad Institute. It includes preconfigured GATK Best Practices pipelines as well as tools for building your own custom pipelines (with any command line tool you want, not just GATK).

Note that FireCloud is not a GUI-only solution; it's also possible to interact with it programmatically through an API and still take advantage of all the work we've done to preconfigure GATK pipelines and working examples.

2. Input data

Typical inputs and format requirements are documented here as well as in each tool's respective Tool Doc.

3. Software

Most GATK4 tools have fairly simple software requirements: a Unix-style OS and Java 1.8. However, a subset of tools have additional R or Python dependencies. These dependencies (as well as the base system requirements) are described further below. We strongly encourage you to use the Docker container system, if that's an option on your infrastructure, rather than a custom installation. All released versions of GATK4 can be found as prepackaged container images in Dockerhub here.

Operating system

The GATK runs natively on most if not all flavors of UNIX, which includes MacOSX, Linux and BSD. It is possible to get it running on some recent versions of Windows, but we don't provide any support nor instructions for that. If you need to run on a Windows machine, consider using Docker.

Java 8 / JRE or SDK 1.8

The GATK is a Java-based program, so you'll need to have Java installed on your machine. The Java runtime version should be at 1.8 exactly. To be clear: we do not yet support 1.9, and older versions (1.6 and 1.7) no longer work. You can check what version you have by typing java -version at the command line. This article has some more details about what to do if you don't have the right version. Both the Sun/Oracle Java JDK and OpenJDK versions are fully supported.

R dependencies

Some of the GATK tools produce plots using R, so if you want to get the plots you'll need to have R and Rscript installed, as well as these R libraries: gsalib, ggplot2, reshape, gplots,

Python dependencies

The gatk-launch wrapper script requires Python 2.6 or greater.

Some of the newer tools and workflows require Python 3.6.2 along with a set of additional Python packages. We use the Conda package manager to establish and manage the environment and dependencies required by these tools. The GATK Docker image comes with this environment pre-configured. In order to establish an environment suitable to run these tools outside of the Docker image, we provide a Conda config file, gatkcondaenv.yml. To use this, you must first install Conda, then create the GATK-appropriate environment by running the following command:

`conda env create -n gatk -f gatkcondaenv.yml`

To activate the environment once it has been created, run the command

`source activate gatk`

See the Conda documentation for additional information about using and managing Conda environments.

Developers only

If you plan to build GATK from source, you will need Git 2.5 or greater, git-lfs 1.1.0 or greater, and Gradle 3.1 or greater. Use the ./gradlew script to build from source; see the Github repository README for more details.

4. Hardware

We do not provide guidelines for hardware requirements, as these can vary enormously depending on the type of work you plan to do. However, you may find the following helpful:

Local infrastructure

Our collaborators at the Intel-Broad Center for Genomic Data Engineering can provide you with recommended hardware configurations based on your planned usage. Let us know in the comment thread if you'd like us to introduce you.

Cloud options

As noted above, we make our own cloud-based analysis portal freely available to everyone. It is built on Google Cloud; using the portal is free of charge, and compute/storage/egress costs are charged directly by Google. The advantage to you of using this portal is that we have already set up preconfigured workspaces for all the GATK Best Practices (including runtime hardware parameters, memory etc), and you also have the option of adding your own custom pipelines. This removes most of the typical limitations and guesswork involved in working with local infrastructure, and it also makes it easier to share your results and methods.

In addition, we are working with all the other major commercial cloud vendors to make it easy to run GATK pipelines on their platforms. See the "Pipelining Options" documentation for more details.

↧

Install with conda, gatkcondaenv.yml not found

November 20, 2018, 3:19 am

≫ Next: (How to) Install and use Conda for GATK4

≪ Previous: What are the requirements for running GATK?

This link from https://software.broadinstitute.org/gatk/documentation/article?id=11049 is dead and no link is present in https://software.broadinstitute.org/gatk/documentation/article?id=12836.

Where can I please find gatkcondaenv.yml?

Should I use the GIT file from here ?

Thanks

↧

(How to) Install and use Conda for GATK4

September 4, 2018, 7:06 am

≫ Next: Strategies used by Genome Strip to detect CNVs

≪ Previous: Install with conda, gatkcondaenv.yml not found

Some tools in GATK4, like the gCNV pipeline and the new deep learning variant filtering tools, require extensive Python dependencies. To avoid having to worry about managing these dependencies, we recommend using the GATK4 docker container, which comes with everything pre-installed, as explained here. If you are running GATK4 on a server and/or cannot use the Docker image, we recommend using the Conda package manager as a backup solution. The Conda package manager comes with all the dependencies you need, so you do not need to install everything separately. Both Conda and Docker are intended to solve the same problem, but one of the big differences/benefits of Conda is that you can use Conda without having root access. Conda should be easy to install if you follow these steps.

1) Refer to the installation instructions from Conda. Choose the correct version/computer you need to download it for. You will have the option of downloading Anaconda or Miniconda. Conda provides documentation about the difference between Anaconda and Miniconda. We chose to use Miniconda for this tutorial because we just wanted to use the GATK package and did not want to take up too much space on our computer. If you are not going to use Conda for anything other than GATK4, you might consider doing the same. If you choose to install Anaconda, you may have access to other bioinformatics packages that are helpful to you, and you won’t have to install each package you need. Follow the prompts to properly install the .pkg file. Make sure you choose the correct package for the version of Python you are using. For example, if you have Python 2.7 on your computer, choose the version specific to it.

2) Go to the directory where you have stored the GATK4 jars and the gatk wrapper script, and make sure gatkcondaenv.yml is present. Run
conda env create -n gatk -f gatkcondaenv.yml

source activate gatk

3) To check if your Conda environment is running properly, type conda list and you should see a list of packages installed.

gatkpythonpackages should be one of them.

4) You can also test out whether the new variant filtering tool (CNNScoreVariants) runs properly. If you run
gatk NeuralNetInference -R reference.fasta -V NA12878.vcf -O NeuralNetInferenceFiltered.vcf -a cnn_1d_annotations.hd5
the tool should run to completion without errors. If you do not have the Conda environment configured correctly, you will get an error immediately saying ImportError: No module named keras.models.

5) If you later upgrade to a new version of GATK4, you will need to update the Conda configuration in the new GATK4 folder. If you simply overwrite the old GATK with the new one, you will get an error message saying “CondaValueError: prefix already exists: /anaconda2/envs/gatk”. For example, when I upgraded from GATK 4.0.1.2 to GATK 4.0.2.0, I simply ran (in my 4.0.2.0 folder)
source deactivate
conda env remove -n gatk
Then, follow Steps 2-4 again to re-install it.

↧

Strategies used by Genome Strip to detect CNVs

March 5, 2019, 3:38 am

≪ Previous: (How to) Install and use Conda for GATK4

Dear all,

I'm only want to know which strategies are used by Genome Strip in order to detect the CNVs... I looking for in the Genome Strip web page but I'm not sure... I saw in different papers where they classify the detection using the Discordant Reads (RP), Read depth(RD). Another papers also include Split reads (SR) signals or even local assembly (AS).

So I will appreciate if you could clarify me which strategies used this tool...

Thanks for your help

Jordi

↧

Java related error encountered while running gatk PathSeqPipelineSpark

March 5, 2019, 5:30 am

≫ Next: Somewhat off-topic: BroadE GATK4 workshop on Terra

≪ Previous: Strategies used by Genome Strip to detect CNVs

Hi,

I am trying to test the pathseq tutorial following the tutorial on this link

I ran the following commands

bioinfo@bioinfo$ conda activate gatk
(gatk) bioinfo@bioinfo$ gatk PathSeqPipelineSpark \
>     --input test_sample.bam \
>     --filter-bwa-image hg19mini.fasta.img \
>     --kmer-file hg19mini.hss \
>     --min-clipped-read-length 70 \
>     --microbe-fasta e_coli_k12.fasta \
>     --microbe-bwa-image e_coli_k12.fasta.img \
>     --taxonomy-file e_coli_k12.db \
>     --output output.pathseq.bam \
>     --scores-output output.pathseq.txt

And encountered below error:

Using GATK jar /home/bioinfo/Installers/gatk4/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/bioinfo/Installers/gatk4/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar PathSeqPipelineSpark --input test_sample.bam --filter-bwa-image hg19mini.fasta.img --kmer-file hg19mini.hss --min-clipped-read-length 70 --microbe-fasta e_coli_k12.fasta --microbe-bwa-image e_coli_k12.fasta.img --taxonomy-file e_coli_k12.db --output output.pathseq.bam --scores-output output.pathseq.txt
18:57:39.629 WARN  SparkContextFactory - Environment variables HELLBENDER_TEST_PROJECT and HELLBENDER_JSON_SERVICE_ACCOUNT_KEY must be set or the GCS hadoop connector will not be configured properly
18:57:39.729 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/bioinfo/Installers/gatk4/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
18:57:41.594 INFO  PathSeqPipelineSpark - ------------------------------------------------------------
18:57:41.594 INFO  PathSeqPipelineSpark - The Genome Analysis Toolkit (GATK) v4.1.0.0
18:57:41.594 INFO  PathSeqPipelineSpark - For support and documentation go to https://software.broadinstitute.org/gatk/
18:57:41.739 INFO  PathSeqPipelineSpark - Initializing engine
18:57:41.739 INFO  PathSeqPipelineSpark - Done initializing engine
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
19/03/05 18:57:41 INFO SparkContext: Running Spark version 2.2.0
18:57:41.968 WARN  NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18:57:42.155 INFO  PathSeqPipelineSpark - Shutting down engine
[5 March, 2019 6:57:42 PM IST] org.broadinstitute.hellbender.tools.spark.pathseq.PathSeqPipelineSpark done. Elapsed time: 0.04 minutes.
Runtime.totalMemory()=645922816
Exception in thread "main" java.lang.ExceptionInInitializerError
    at org.apache.spark.SparkConf.validateSettings(SparkConf.scala:546)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:373)
    at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
    at org.broadinstitute.hellbender.engine.spark.SparkContextFactory.createSparkContext(SparkContextFactory.java:178)
    at org.broadinstitute.hellbender.engine.spark.SparkContextFactory.getSparkContext(SparkContextFactory.java:110)
    at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:28)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:138)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
    at org.broadinstitute.hellbender.Main.main(Main.java:291)
Caused by: java.net.UnknownHostException: bioinfo: bioinfo: unknown error
    at java.net.InetAddress.getLocalHost(InetAddress.java:1505)
    at org.apache.spark.util.Utils$.findLocalInetAddress(Utils.scala:891)
    at org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$localIpAddress$lzycompute(Utils.scala:884)
    at org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$localIpAddress(Utils.scala:884)
    at org.apache.spark.util.Utils$$anonfun$localHostName$1.apply(Utils.scala:941)
    at org.apache.spark.util.Utils$$anonfun$localHostName$1.apply(Utils.scala:941)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.util.Utils$.localHostName(Utils.scala:941)
    at org.apache.spark.internal.config.package$.<init>(package.scala:204)
    at org.apache.spark.internal.config.package$.<clinit>(package.scala)
    ... 12 more
Caused by: java.net.UnknownHostException: bioinfo: unknown error
    at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
    at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
    at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
    at java.net.InetAddress.getLocalHost(InetAddress.java:1500)
    ... 21 more

↧

Somewhat off-topic: BroadE GATK4 workshop on Terra

March 5, 2019, 6:15 am

≫ Next: New! Mutect2 for Mitochondrial Analysis

≪ Previous: Java related error encountered while running gatk PathSeqPipelineSpark

On the events webpage there's an event called "BroadE GATK4 workshop on Terra", but I can't find any information on it at all. Is it an online workshop? Do you have to sign up?

Didn't know a better place to ask this question.

↧

New! Mutect2 for Mitochondrial Analysis

March 5, 2019, 8:45 am

≫ Next: VariantRecalibrator tranche plots have a lot of false positives

≪ Previous: Somewhat off-topic: BroadE GATK4 workshop on Terra

Overcoming barriers to understanding the mitochondrial genome

Announcing a brand new “Best Practices” pipeline for calling SNPs and INDELs in the mitochondrial genome! Calling low VAF alleles (variant allele fraction) in the mitochondrial genome presents special problems, but come with great rewards, including diagnosing rare diseases and identifying asymptomatic carriers of pathogenic diseases. We’re excited to begin using this pipeline on tens of thousands of diverse samples from the gnomAD project (http://gnomad.broadinstitute.org/about) to gain greater understanding of population genetics from the perspective of mitochondrial DNA.

Mitochondrial genome - a history of challenges

We had been often advised to “try using a somatic caller,” since we expect mitochondria to have variable allele fraction variants, but we never actually tried it ourselves. Over the past year we focused on creating truth-data for low allele fraction variants on the mitochondria and developing a production-quality, high throughput pipeline that overcomes the unique challenges of calling SNPs and IDELs on the mitochondria offers.

See below the four challenges to unlocking the mitochondrial genome and how we’ve improved our pipeline to overcome them.

1. Mitochondria have a circular genome

Though the genome is linearized in the typical references we use, the breakpoint is artificial -- purely for the sake of bioinformatic convenience. Since the breakpoint is inside the “control region”, which is non-coding but highly variable across people, we want to be sensitive to variation in that region, to capture the most genetic diversity.

2. A pushy genome makes for difficult mapping

The mitochondrial genome has inserted itself into the autosomal genome many times throughout human evolution - and continues to do so. These regions in the autosomal genome, called Nuclear Mitochondrial DNA segments (NuMTs), make mapping difficult: if the sequences are identical, it’s hard to know if a read belongs in an autosomal NuMT or the mitochondrial contig.

3. Most mitochondria are normal

Variation in the mitochondria can have very low heteroplasmy. In fact, the variation “signal” can be comparable to the inherent sequencer noise, but the scientific community tasked us with calling 1% allele fraction sites with as much accuracy as we can. Our pipeline achieves 99% sensitivity at 5% VAF at depths greater than 1000. With depth in the thousands or tens of thousands of reads for most whole genome mitochondrial samples, it should be possible to call most 1% allele fraction sites with high confidence.

4. High depth coverage is a blessing… and a curse

The mitochondrial contig typically has extremely high depth in whole genome sequence data:
around 2000x for a typical blood sample compared to autosomes (typically ~30x coverage). Samples from mitochondria-rich tissues like heart and muscle have even higher depth (e.g. 80,000x coverage). This depth is a blessing for calling low-allele fraction sites with confidence, but can overwhelm computations that use algorithms not designed to handle the large amounts of data that come with this extreme depth.

Solving a geometry problem by realigning twice

We’ve solved the first problem by extracting reads that align to carefully selected NuMT regions and the mitochondria itself, from a whole genome sample. We take these aligned, recalibrated reads and realign them twice: once to the canonical mitochondria reference, and once to a “rotated” mitochondria reference that moves the breakpoint from the control region to the opposite side of the circular contig.

To help filter out NuMTs, we mark reads with their original alignment position before realigning to the mitochondrial contig. Then we use Mutect2 filters tuned to the high depth we expect from the mitochondria, by running Mutect2 in “--mitochondria-mode”. We increase accuracy on the “breakpoint” location by calling only the non-control region on the original mitochondria reference, and call the control region on the shifted reference (now in the middle of the linearized chromosome). We then shift the calls on the rotated reference back to the original mitochondrial reference and merge the VCFs.

Adaptive pruning for high sensitivity and precision

The rest of the problems benefited from recent improvements the local assembly code that Mutect2 shares with HaplotypeCaller (HC). Incorporating the new adaptive pruning strategy in the latest version of Mutect2 will improve sensitivity and precision for samples with varying depth across the mitochondrial reference, and enable us to adapt our pipeline to exome and RNA samples. See the blog post on the newest version of Mutect2 here.

Unblocking genetic bottlenecks with a new pipeline

The new pipeline’s high sensitivity to low allele fraction variants is especially powerful since low AF variants may be at higher AF in other tissue.

Our pipeline harnesses the power of low AFs to help:

1. Diagnose rare diseases

Mutations can be at high allele fraction in affected tissues but low in blood samples typically used for genetic testing.

2. Identify asymptomatic carriers of pathogenic variants

If you carry a pathogenic allele even at low VAF, you can pass this along at high VAF to your offspring.

3. Discover somatic variants in tissues or cell lineages

For example, studies have used rare somatic mtDNA variants for lineage tracing in single-cell RNA-seq studies.

You can find the WDLs used to run this pipeline in the GATK repo under the scripts directory (https://github.com/broadinstitute/gatk/blob/master/scripts/mitochondria_m2_wdl/MitochondriaPipeline.wdl). Keep an eye out for an official “Best Practices” pipeline, coming soon in the gatk-workflows repo and in Firecloud.

Caveat: We're not so confident in calls under 5%AF (due to false positives from NuMTs). We're working on a longer term fix for this for a future release.

↧

VariantRecalibrator tranche plots have a lot of false positives

March 5, 2019, 11:28 am

≫ Next: Error when running CNVDiscovery in a batch-like way: “Read count cache file is truncated”

≪ Previous: New! Mutect2 for Mitochondrial Analysis

Hello!

I am working with data from 122 human whole exomes, captured using SeqCap EZ Prime Exome. My software versions are GATK 3.8.0 and java 1.8.0_131.

After following the Best Practices guidelines, I have gotten tranche plots from VariantRecalibrator that show a high proportion of 'false positives' in my novel variants (resulting from a low Ti/Tv ratio). I can't find anything this extreme on the forum, and I'm wondering if I may be doing something wrong with my variant calling.

The command that produced the tranche plots is:

```
java -Xmx16000m -jar GenomeAnalysisTK.jar \
-T VariantRecalibrator \
-R hg38.fa \
-input SNP.vcf \
-resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.hg38.vcf.gz \
-resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.hg38.vcf.gz \
-resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.hg38.vcf.gz \
-resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.hg38.vcf.gz \
-an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an InbreedingCoeff \
-mode SNP \
-recalFile SNP.recal \
-tranchesFile SNP.tranches \
-rscriptFile SNP.plots.R
```

As you can see in 'all_SNPs.pdf', something like 40% of the novel SNPs are estimated to be false positives. 'more_tranches.pdf' shows that lowering the truth threshold does not resolve this (though it does discard a ton of SNPs).

As an alternative, I did hard filtering based on the distributions of all my annotations in R. (They looked pretty normal except for QD, I think because of high depths--see 'QD.png' attached here, and QUAL by DP plots in the thread for Discussion 23514 [sorry, can't post links]).

```
java -Xmx16000m -jar GenomeAnalysisTK.jar \
-T VariantFiltration \
-R hg38.fa \
--variant SNP.vcf \
-o SNP.FILT.vcf \
--filterExpression "QD < 2.0 || FS > 60.0 || MQ < 55.0 || MQRankSum < -1.0 || ReadPosRankSum < -2.5 || SOR > 2.5 || DP < 500 || InbreedingCoeff < -0.1"\
--filterName "HARDFILTER"
```

I then ran VariantRacalibrator on the hard filtered variants to see what would happen. Hard filtering helps reduce the false positive estimates a little--see 'hard_filtered_SNPs.pdf'--but it does not really solve this problem.

I ran VariantEval on the filtered variants to get a better idea of what was going on, and found the following:

My Data Subset Ti/Tv
All SNPs 2.23
SNPs in dbSNP (68% of total) 2.65
novel SNPs (32% of total) 1.52

So, it seems like my SNPs that also appear in dbSNP are alright, but the novel ones are not trustworthy.

One obvious option is to just filter out any variant not found in an existing database. This is OK for my purposes, since I'm looking for effects of common variants. But it still gives me pause that my novel variants can't be trusted. Any ideas about what would lead to such low Ti/Tv in an exome datset? (Note, I used '-L PrimeExome.intervals -ip 100' at relevant steps.)

Thanks a lot!

↧

Error when running CNVDiscovery in a batch-like way: “Read count cache file is truncated”

September 8, 2018, 6:06 am

≫ Next: Error when running SVCNVDiscovery in batch-like way: Read count cache file is truncated

≪ Previous: VariantRecalibrator tranche plots have a lot of false positives

Dear Genome STRiP users,

I am running CNVDiscovery pipeline in a batch-like way, and always fail in No.4 batch, and No.23 batch with the following error:

INFO  02:38:02,459 RefineCNVBoundaries - Initialized data set: 1 file, 769 read groups, 98 samples. 
INFO  02:38:02,927 ReadCountCache - Initializing read count cache with 1 file. 
mInputFile=file:///proj/yunligrp/users/minzhi/gs_test_svpreprocess_fulllist_batch_success/4/md_tempdir/rccache.bin mCurrentSequenceName=chr16; mCurrentPosition=500001
Exception in thread "main" java.lang.RuntimeException: Read count cache file file:///proj/yunligrp/users/minzhi/gs_test_svpreprocess_fulllist_batch_success/4/md_tempdir/rccache.bin is truncated
    at org.broadinstitute.sv.commandline.CommandLineProgram.execute(CommandLineProgram.java:65)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
    at org.broadinstitute.sv.commandline.CommandLineProgram.runAndReturnResult(CommandLineProgram.java:29)
    at org.broadinstitute.sv.commandline.CommandLineProgram.run(CommandLineProgram.java:25)
    at org.broadinstitute.sv.genotyping.RefineCNVBoundaries.main(RefineCNVBoundaries.java:133)
Caused by: java.lang.RuntimeException: Read count cache file file:///proj/yunligrp/users/minzhi/gs_test_svpreprocess_fulllist_batch_success/4/md_tempdir/rccache.bin is truncated
    at org.broadinstitute.sv.metadata.depth.ReadCountFileReader$ReadCountDataIterator.decodeRow(ReadCountFileReader.java:516)
    at org.broadinstitute.sv.metadata.depth.ReadCountFileReader$ReadCountDataIterator.getReadCacheItems(ReadCountFileReader.java:470)
    at org.broadinstitute.sv.metadata.depth.ReadCountFileReader$ReadCountDataIterator.aggregateSampleReadCounts(ReadCountFileReader.java:476)
    at org.broadinstitute.sv.metadata.depth.ReadCountFileReader.getReadCounts(ReadCountFileReader.java:266)
    at org.broadinstitute.sv.common.ReadCountCache.getReadCounts(ReadCountCache.java:100)
    at org.broadinstitute.sv.genotyping.GenotypingDepthModule.computeRefReadCounts(GenotypingDepthModule.java:295)
    at org.broadinstitute.sv.genotyping.GenotypingDepthModule.computeRefReadCounts(GenotypingDepthModule.java:245)
    at org.broadinstitute.sv.genotyping.GenotypingDepthModule.getReadCounts(GenotypingDepthModule.java:230)
    at org.broadinstitute.sv.genotyping.GenotypingDepthModule.getCnpReadCounts(GenotypingDepthModule.java:217)
    at org.broadinstitute.sv.genotyping.GenotypingDepthModule.genotypeCnp(GenotypingDepthModule.java:141)
    at org.broadinstitute.sv.genotyping.BoundaryRefinementAlgorithm.genotypeCnp(BoundaryRefinementAlgorithm.java:287)
    at org.broadinstitute.sv.genotyping.BoundaryRefinementAlgorithm.refineOneBoundary(BoundaryRefinementAlgorithm.java:633)
    at org.broadinstitute.sv.genotyping.BoundaryRefinementAlgorithm.refineBoundaryStep(BoundaryRefinementAlgorithm.java:553)
    at org.broadinstitute.sv.genotyping.BoundaryRefinementAlgorithm.refineBoundaries(BoundaryRefinementAlgorithm.java:536)
    at org.broadinstitute.sv.genotyping.BoundaryRefinementAlgorithm.processVariant(BoundaryRefinementAlgorithm.java:232)
    at org.broadinstitute.sv.genotyping.RefineCNVBoundaries.run(RefineCNVBoundaries.java:204)
    at org.broadinstitute.sv.commandline.CommandLineProgram.execute(CommandLineProgram.java:54)
    ... 5 more 
INFO  02:38:16,126 QGraph - Writing incremental jobs reports...

I divided all 3418 samples into 33 batches: #0-31 with 100 as batch size, #32 with 218 as batch size.

Besides, I also successfully run the CNVDiscovery pipeline to all 3418 samples in an individual running. Does it mean that there is no error in my bam files?

May I have your suggestions? Thank you in advance.

Best regards,
Wusheng

↧

Error when running SVCNVDiscovery in batch-like way: Read count cache file is truncated

August 20, 2018, 6:43 am

≫ Next: VCF and pedigree info

≪ Previous: Error when running CNVDiscovery in a batch-like way: “Read count cache file is truncated”

Dear Genome STRiP users,

I am running SVCNVDiscovery process in a batch-like way. To be precisely, I have 3418 samples, and I divided them into 33 batches: the first 32 batches with 100 samples each and the 33th batch with 218 samples. Then I completed the SVPreprocess on each batch without any error. Next, I run SVCNVDiscovery on the each batch with the following script.

classpath="${SV_DIR}/lib/SVToolkit.jar:${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar:${SV_DIR}/lib/gatk/Queue.jar"
svpreprocess_dir="/proj/yunligrp/users/minzhi/gs_test_svpreprocess_fulllist_batch_success/${1}"
rundir="/proj/yunligrp/users/minzhi/gs_test_svcnvdiscovery/${1}"

java -Xmx4g -cp ${classpath} \
    org.broadinstitute.gatk.queue.QCommandLine \
    -S ${SV_DIR}/qscript/discovery/cnv/CNVDiscoveryPipeline.q \
    -S ${SV_DIR}/qscript/SVQScript.q \
    -cp ${classpath} \
    -gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \
    -configFile ${SV_DIR}/conf/genstrip_parameters.txt \
    -R /proj/yunligrp/users/minzhi/Homo_sapiens_assembly38/Homo_sapiens_assembly38.fasta \
    -I /proj/yunligrp/users/minzhi/gs_script/NWD.recab_${1}.list \
    -genderMapFile /proj/yunligrp/users/minzhi/gs_script/JHS_full_all_male_gender_${1}.map \
    -ploidyMapFile /proj/yunligrp/users/minzhi/gs_script/standard_ploidy.map \
    -md ${svpreprocess_dir}/md_tempdir \
    -runDirectory ${rundir} \
    -jobLogDir ${rundir}/logs \
    -intervalList /proj/yunligrp/users/minzhi/gs_script/reference_chromosomes16_1-500000.list \
    -tilingWindowSize 1000 \
    -tilingWindowOverlap 500 \
    -maximumReferenceGapLength 1000 \
    -boundaryPrecision 100 \
    -minimumRefinedLength 500 \
    -jobRunner Drmaa \
    -gatkJobRunner Drmaa \
    -jobNative "--mem-per-cpu=1000 --time=05:00:00 --nodes=1 --ntasks-per-node=4" \
    -jobQueue general \
    -run \
    || exit 1

However, the 5th batch and the 24th batch always (always mean that I re-run more than 200 times to these two batches) fail with the error as below. And other batches completed without any error.

…
ERROR 02:38:16,120 FunctionEdge - Error:  'java'  '-Xmx2048m'  '-XX:+UseParallelOldGC'  '-XX:ParallelGCThreads=4'  '-XX:GCTimeLimit=50'  '-XX:GCHeapFreeLimit=10'  '-Djava.io.tmpdir=/proj/yunligrp/users/minzhi/gs_test_svcnvdiscovery/4/.queue/tmp'  '-cp' '/proj/yunligrp/users/minzhi/svtoolkit/lib/SVToolkit.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/Queue.jar'  '-cp' '/proj/yunligrp/users/minzhi/svtoolkit/lib/SVToolkit.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/Queue.jar'  'org.broadinstitute.sv.genotyping.RefineCNVBoundaries'  '-I' '/proj/yunligrp/users/minzhi/gs_test_svcnvdiscovery/4/cnv_stage6/seq_chr16/seq_chr16.merged_headers.bam'  '-O' '/proj/yunligrp/users/minzhi/gs_test_svcnvdiscovery/4/cnv_stage7/seq_chr16/P0007/seq_chr16.merged.brig.vcf'  '-R' '/proj/yunligrp/users/minzhi/Homo_sapiens_assembly38/Homo_sapiens_assembly38.fasta'  '-md' '/proj/yunligrp/users/minzhi/gs_test_svpreprocess_fulllist_batch_success/4/md_tempdir'  '-configFile' '/proj/yunligrp/users/minzhi/svtoolkit/conf/genstrip_parameters.txt' '-configFile' '/proj/yunligrp/users/minzhi/Homo_sapiens_assembly38/Homo_sapiens_assembly38.gsparams.txt'  '-P' 'depth.readCountCacheIgnoreGenomeMask:true'  '-genomeMaskFile' '/proj/yunligrp/users/minzhi/Homo_sapiens_assembly38/Homo_sapiens_assembly38.svmask.fasta' '-genomeMaskFile' '/proj/yunligrp/users/minzhi/Homo_sapiens_assembly38/Homo_sapiens_assembly38.lcmask.fasta'  '-genderMapFile' '/proj/yunligrp/users/minzhi/gs_script/JHS_full_all_male_gender_4.map'  '-ploidyMapFile' '/proj/yunligrp/users/minzhi/gs_script/standard_ploidy.map'  '-vcf' '/proj/yunligrp/users/minzhi/gs_test_svcnvdiscovery/4/cnv_stage4/seq_chr16/seq_chr16.merged.genotypes.vcf.gz'  '-site' '/proj/yunligrp/users/minzhi/gs_test_svcnvdiscovery/4/cnv_stage7/seq_chr16/P0007.sites.list'  '-boundaryPrecision' '100'  '-minimumRefinedLength' '500'  '-maximumReferenceGapLength' '1000'  
ERROR 02:38:16,124 FunctionEdge - Contents of /proj/yunligrp/users/minzhi/gs_test_svcnvdiscovery/4/cnv_stage7/seq_chr16/logs/CNVDiscoveryStage7-8.out:
INFO  02:38:02,034 HelpFormatter - 
…
INFO  02:38:02,927 ReadCountCache - Initializing read count cache with 1 file. 
mInputFile=file:///proj/yunligrp/users/minzhi/gs_test_svpreprocess_fulllist_batch_success/4/md_tempdir/rccache.bin mCurrentSequenceName=chr16; mCurrentPosition=500001
Exception in thread "main" java.lang.RuntimeException: Read count cache file file:///proj/yunligrp/users/minzhi/gs_test_svpreprocess_fulllist_batch_success/4/md_tempdir/rccache.bin is truncated
    at org.broadinstitute.sv.commandline.CommandLineProgram.execute(CommandLineProgram.java:65)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
    at org.broadinstitute.sv.commandline.CommandLineProgram.runAndReturnResult(CommandLineProgram.java:29)
    at org.broadinstitute.sv.commandline.CommandLineProgram.run(CommandLineProgram.java:25)
    at org.broadinstitute.sv.genotyping.RefineCNVBoundaries.main(RefineCNVBoundaries.java:133)
Caused by: java.lang.RuntimeException: Read count cache file file:///proj/yunligrp/users/minzhi/gs_test_svpreprocess_fulllist_batch_success/4/md_tempdir/rccache.bin is truncated
    at org.broadinstitute.sv.metadata.depth.ReadCountFileReader$ReadCountDataIterator.decodeRow(ReadCountFileReader.java:516)
    at org.broadinstitute.sv.metadata.depth.ReadCountFileReader$ReadCountDataIterator.getReadCacheItems(ReadCountFileReader.java:470)
    at org.broadinstitute.sv.metadata.depth.ReadCountFileReader$ReadCountDataIterator.aggregateSampleReadCounts(ReadCountFileReader.java:476)
    at org.broadinstitute.sv.metadata.depth.ReadCountFileReader.getReadCounts(ReadCountFileReader.java:266)
    at org.broadinstitute.sv.common.ReadCountCache.getReadCounts(ReadCountCache.java:100)
    at org.broadinstitute.sv.genotyping.GenotypingDepthModule.computeRefReadCounts(GenotypingDepthModule.java:295)
    at org.broadinstitute.sv.genotyping.GenotypingDepthModule.computeRefReadCounts(GenotypingDepthModule.java:245)
    at org.broadinstitute.sv.genotyping.GenotypingDepthModule.getReadCounts(GenotypingDepthModule.java:230)
    at org.broadinstitute.sv.genotyping.GenotypingDepthModule.getCnpReadCounts(GenotypingDepthModule.java:217)
    at org.broadinstitute.sv.genotyping.GenotypingDepthModule.genotypeCnp(GenotypingDepthModule.java:141)
    at org.broadinstitute.sv.genotyping.BoundaryRefinementAlgorithm.genotypeCnp(BoundaryRefinementAlgorithm.java:287)
    at org.broadinstitute.sv.genotyping.BoundaryRefinementAlgorithm.refineOneBoundary(BoundaryRefinementAlgorithm.java:633)
    at org.broadinstitute.sv.genotyping.BoundaryRefinementAlgorithm.refineBoundaryStep(BoundaryRefinementAlgorithm.java:553)
    at org.broadinstitute.sv.genotyping.BoundaryRefinementAlgorithm.refineBoundaries(BoundaryRefinementAlgorithm.java:536)
    at org.broadinstitute.sv.genotyping.BoundaryRefinementAlgorithm.processVariant(BoundaryRefinementAlgorithm.java:232)
    at org.broadinstitute.sv.genotyping.RefineCNVBoundaries.run(RefineCNVBoundaries.java:204)
    at org.broadinstitute.sv.commandline.CommandLineProgram.execute(CommandLineProgram.java:54)
    ... 5 more 
INFO  02:38:16,126 QGraph - Writing incremental jobs reports... 
INFO  02:38:16,126 QGraph - 4 Pend, 0 Run, 1 Fail, 7 Done 
INFO  02:38:16,128 QCommandLine - Writing final jobs report... 
INFO  02:38:16,129 QCommandLine - Done with errors
...

Besides, I tried other ways to run SVCNVDiscovery as below:
1. Completed SVPreprocess with regards to all 3418 samples, and running SVCNVDiscovery in the batch-like way (divided batches as above), but the same error happened also at 5th batch and the 24 batch.
2. Completed both of SVPreprocess and SVCNVDiscovery with regards to all 3418 samples, and there is no error in any of these processes.

May I have your suggestions of this situation?

Thank you very much.

Best regards,
Wusheng

↧

VCF and pedigree info

March 5, 2019, 2:55 pm

≫ Next: Mutect2 allele specific stats for Multiallelic sites

≪ Previous: Error when running SVCNVDiscovery in batch-like way: Read count cache file is truncated

Hi,

I have a trio WES data that I'm trying to run mendelian error rate with RTG tools.

Following the GATK best practices, I ran BWA, SortSam, Marduplicates, BaseRecalibrator, BQSR, and HaplotypeCaller, though at haplotypecaller I want gvcf so I ran the trio data separately.

gatk HaplotypeCaller -R ./Homo_sapiens_assembly38.fasta -I WBTB-001A.sort.dup.gatk.recal.bqsr.bam -ERC GVCF -L ./truseq-exome-targeted-regions-manifest-v1-2.hg38.bed --dbsnp ./Homo_sapiens_assembly38.dbsnp138.vcf -O WBTB-001A.g.vcf.gz

Then I ran CombineGVCFs and GenotypeGVCFs

gatk CombineGVCFs -R ./Homo_sapiens_assembly38.fasta --variant ./gatk2.001A/WBTB-001A.g2.vcf --variant ./gatk2.001B/WBTB-001B.g2.vcf --variant ./gatk2.001C/WBTB-001C.g2.vcf -o ./gatk/gatk2.WBTB-001.c.g.vcf

gatk GenotypeGVCFs -R ./Homo_sapiens_assembly38.fasta --variant ./gatk/gatk2.cgvcf/gatk2.WBTB-001.c.g.vcf -O ./gatk/gatk2.WBTB-001.gatk.genotype.vcf

==========================================================
For RTG mendelian I created a pedigree file (tab-delimited) WBTB-001.ped,

WBTB-001 A B C 1 2
WBTB-001 B 0 0 1 1
WBTB-001 C 0 0 2 1

./tools/rtg-tools/dist/rtg-tools-3.9.1-6dde278/rtg mendelian -i ./gatk/gatk2.WBTB-001.gatk.genotype.vcf -t ./ref/ref.hg38.sdf/ --pedigree=./WBTB-001.ped > ./gatk/WBTB.001.gatk.results.txt

However, it came back with "No family information found, no checking done."

Just wondering if I had missed a step for incorporating pedigree info into the VCFs.
Could you give us some suggestions on how to solve this?

Thank you

↧

Mutect2 allele specific stats for Multiallelic sites

March 6, 2019, 2:17 am

≫ Next: Feedback on approach to create a custom truth set for VQSR

≪ Previous: VCF and pedigree info

Hello,

I want to have allele specific stats for multi allelic sites. I was able to have this information by using other somatic variant callers, but I couldn’t get that info from neither mutect2 nor mutect1.

If you would give one possible way either mutect (I think you aren’t supporting right now) or mutect2 I would appreciate.

Thanks in advance, Gufran

↧

Feedback on approach to create a custom truth set for VQSR

March 6, 2019, 10:01 am

≫ Next: Using CAVA-Annotated VCF file for VariantsToTable

≪ Previous: Mutect2 allele specific stats for Multiallelic sites

Hello!

I would like to ask you for feedback on my approach to construct a truth set, since there is no such resource for my species.

What I am doing is to:
1/ call variants with GATK best practices by joint calling with GenotypeGVCFs
2/ call variants with another caller (samtools mpileup-> bcftools call)
3/ Filter each set by retaining sites in which all samples have a depth of at least 10 (DP>=10) and a genotype quality of 30 (GQ>=30) in the FORMAT.
4/ Use retained sites common between both callers as truth set for VQSR

My reasoning was that sites called by two different algorithms having a GQ>=30 and DP>=10 in all samples of the cohort are very likely to be truth, and their annotations can be used to learn the rules of what a good variant looks like.

I would like to know if my reasoning makes sense to you and if so, what would you suggest me to change/add/remove (for example, I am not completely convinced about retaining sites if all samples have the min GQ and DP, what about if only one sample passes the condition?).

I greately appreciate your feedback and thanks in advance!

↧

Using CAVA-Annotated VCF file for VariantsToTable

March 6, 2019, 10:37 am

≫ Next: Intervals in Joint Discovery WDL

≪ Previous: Feedback on approach to create a custom truth set for VQSR

Hi all,

I have generated a VCF file via the 5 dollar genome analysis pipeline. Then I used a script called CAVA (https://tinyurl.com/y6bjhskc) to annotate the variants in my VCF file. CAVA added some 18 extra ##INFO lines to the beginning of the VCF file. And added extra info into the INFO column for each variant, as shown below. I wanted to try and use the VariantsToTable function in the GATK toolbox. I tested with GATK v3.3. I am getting the error of:

ERROR MESSAGE: Your input file has a malformed header: unexpected tag count 6 in line <ID=TYPE,Number=.,Type=String,Description="Variant type: Substitution, Insertion, Deletion or Complex",Source="CAVA",Version="1.2.2">

VCF file:

##fileformat=VCFv4.2
##fileDate=2019-03-04
##INFO=<ID=TYPE,Number=.,Type=String,Description="Variant type: Substitution, Insertion, Deletion or Complex",Source="CAVA",Version="1.2.2">
##INFO=<ID=GENE,Number=.,Type=String,Description="HGNC gene symbol",Source="CAVA",Version="1.2.2">
##INFO=<ID=TRANSCRIPT,Number=.,Type=String,Description="Transcript identifier",Source="CAVA",Version="1.2.2">
##INFO=<ID=GENEID,Number=.,Type=String,Description="Gene identifier",Source="CAVA",Version="1.2.2">
##INFO=<ID=TRINFO,Number=.,Type=String,Description="Transcript information: Strand/Length of transcript/Number of exons/Length of coding DNA + UTR/Protein length",Source="CAVA",Version="1.2.2">
##INFO=<ID=LOC,Number=.,Type=String,Description="Location of variant in transcript",Source="CAVA",Version="1.2.2">
##INFO=<ID=CSN,Number=.,Type=String,Description="CSN annotation",Source="CAVA",Version="1.2.2">
##INFO=<ID=PROTPOS,Number=.,Type=String,Description="Protein position",Source="CAVA",Version="1.2.2">
##INFO=<ID=PROTREF,Number=.,Type=String,Description="Reference amino acids",Source="CAVA",Version="1.2.2">
##INFO=<ID=PROTALT,Number=.,Type=String,Description="Alternate amino acids",Source="CAVA",Version="1.2.2">
##INFO=<ID=CLASS,Number=.,Type=String,Description="5PU: Variant in 5 prime untranslated region, 3PU: Variant in 3 prime untranslated region, INT: Intronic variant that does not alter splice site bases, SS: Intronic variant that alters a splice site base but not an ESS or SS5 base, ESS: Variant that alters essential splice site base (+1,+2,-1,-2), SS5: Variant that alters the +5 splice site base, but not an ESS base, SY: Synonymous change caused by a base substitution (i.e. does not alter amino acid), NSY: Nonsynonymous change (missense) caused by a base substitution (i.e. alters amino acid), IF: Inframe insertion and/or deletion (variant alters the length of coding sequence but not the frame), IM: Variant that alters the start codon, SG: Variant resulting in stop-gain (nonsense) mutation, SL: Variant resulting in stop-loss mutation, FS: Frameshifting insertion and/or deletion (variant alters the length and frame of coding sequence), EE: Inframe deletion, insertion or base substitution which affects the first or last three bases of the exon",Source="CAVA",Version="1.2.2">
##INFO=<ID=SO,Number=.,Type=String,Description="Sequence Ontology term",Source="CAVA",Version="1.2.2">
##INFO=<ID=ALTFLAG,Number=.,Type=String,Description="None: variant has the same CSN annotation regardless of its left or right-alignment, AnnNotClass/AnnNotSO/AnnNotClassNotSO: indel has an alternative CSN but the same CLASS and/or SO, AnnAndClass/AnnAndSO/AnnAndClassNotSO/AnnAndSONotClass/AnnAndClassAndSO: Multiple CSN with different CLASS and/or SO annotations",Source="CAVA",Version="1.2.2">
##INFO=<ID=ALTANN,Number=.,Type=String,Description="Alternate CSN annotation",Source="CAVA",Version="1.2.2">
##INFO=<ID=ALTCLASS,Number=.,Type=String,Description="Alternate CLASS annotation",Source="CAVA",Version="1.2.2">
##INFO=<ID=ALTSO,Number=.,Type=String,Description="Alternate SO annotation",Source="CAVA",Version="1.2.2">
##INFO=<ID=IMPACT,Number=.,Type=String,Description="Impact group the variant is stratified into",Source="CAVA",Version="1.2.2">
##INFO=<ID=DBSNP,Number=.,Type=String,Description="rsID from dbSNP",Source="CAVA",Version="1.2.2">

.
.
.

CHROM   POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA12878
chr1    21840336    .   T   C   40.74   PASS    AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=20.37;SOR=2.303;TYPE=Substitution;TRANSCRIPT=ENST00000374840;GENE=ALPL;GENEID=ENSG00000162551;TRINFO=+/69.0kb/12/2.6kb/524;LOC=In1/2;CSN=c.-105+4326T>C;PROTPOS=.;PROTREF=.;PROTALT=.;CLASS=5PU;SO=5_prime_UTR_variant;IMPACT=3;ALTANN=.;ALTCLASS=.;ALTSO=. GT:AD:DP:GQ:PL  1/1:0,2:2:6:68,6,0

Is there a simple solution you guys can see to the problem? I could remove all the extra lines manually for now but in the future for automation purposes I wanted to skip having a parser or parser-like script in between.

And secondly, I also wanted to ask whether it would be possible to retrieve the extra info added by CAVA into the INFO column via the VariantsToTable? Such as "GENE=" or "GENEID=" that does not exist in the original VCF file.

Thanks for your time and help.

↧

Intervals in Joint Discovery WDL

March 6, 2019, 2:36 pm

≫ Next: should BaseRecalibrator command add a bed file on wes or panel calling snv

≪ Previous: Using CAVA-Annotated VCF file for VariantsToTable

Hi folks. We are trying to use the sample Joint Discovery WDL on Terra/FireCloud for joint discovery of some canine gVCFs. I've had to make some edits to remove human-specific resources, but here's the original:

https://app.terra.bio/#workspaces/help-gatk/Germline-SNPs-Indels-GATK4-hg38/tools/gatk/3-Joint-Discovery

We are running up against a problem with intervals for scattering. The WDL appears to generate its own list of intervals to scatter across (JointGenotyping.DynamicallyCombineIntervals). The minimum number of intervals appears to be the number of chromosomes.

In dog, we have about 3000 unconnected "ChrUn" chromosome fragments, which we treat as individual chromosomes, and which we wish to include in genotyping. So when I use this Joint Discovery WDL, it will decline to merge any of the ChrUns together (even though some of them as as small as 2k bases long) and will scatter across 3000+ jobs. Which is very expensive.

I'm not sure how to progress. Is there a way to combine the ChrUns together into one chromosome that JointGenotyping.DynamicallyCombineIntervals will accept?

Word on the street is that the author of this task was Jose, if that is helpful.

Best,
Jessica

↧

should BaseRecalibrator command add a bed file on wes or panel calling snv

March 6, 2019, 7:12 pm

≫ Next: (How to part II) Sensitively detect copy ratio alterations and allelic segments

≪ Previous: Intervals in Joint Discovery WDL

https://software.broadinstitute.org/gatk/documentation/tooldocs/4.0.11.0/org_broadinstitute_hellbender_tools_walkers_bqsr_BaseRecalibrator.php

so BQSR must supplied a bed file, am I right, thanks a lot

ApplyBQSR command also need?

↧

(How to part II) Sensitively detect copy ratio alterations and allelic segments

March 26, 2018, 10:32 am

≫ Next: redo a variant calling run on a vcf

≪ Previous: should BaseRecalibrator command add a bed file on wes or panel calling snv

Document is currently under review and in BETA. It is incomplete and may contain inaccuracies. Expect changes to the content.

This workflow is broken into two tutorials. You are currently on the second part. See Tutorial#11682 for the first part.

For this second part, at the heart is segmentation, performed by ModelSegments. In segmentation, contiguous copy ratios are grouped together into segments. The tool performs segmentation for both copy ratios and for allelic copy ratios, given allelic counts. The segmentation is informed by both types of data, i.e. the tool uses allelic data to refine copy ratio segmentation and vice versa. The tutorial refers to this multi-data approach as joint segmentation. The presented commands showcase full features of tools. It is possible to perform segmentation for each data type independently, i.e. based solely on copy ratios or based solely on allelic counts.

The tutorial illustrates the workflow using a paired sample set. Specifically, detection of allelic copy ratios uses a matched control, i.e. the HCC1143 tumor sample is analyzed using a control, the HCC1143 blood normal. It is possible to run the workflow without a matched-control. See section 8.1 for considerations in interpreting allelic copy ratio results for different modes and for different purities.

The GATK4 CNV workflow offers a multitude of levers, e.g. towards fine-tuning analyses and towards controls. Researchers are expected to tune workflow parameters on samples with similar copy number profiles as their case sample under scrutiny. Refer to each tool's documentation for descriptions of parameters.

Jump to a section

5. Count ref and alt alleles at common germline variant sites using CollectAllelicCounts

CollectAllelicCounts will tabulate counts of the reference allele and counts of the dominant alternate allele for each site in a given genomic intervals list. The tutorial performs this step for both the case sample, the HCC1143 tumor, and the matched-control, the HCC1143 blood normal. This allele-specific coverage collection is just that--raw coverage collection without any statistical inferences. In the next section, ModelSegments uses the allele counts towards estimating allelic copy ratios, which in turn the tool uses to refine segmentation.

Collect allele counts for the case and the matched-control alignments independently with the same intervals. For the matched-control analysis, the allelic count sites for the case and control must match exactly. Otherwise, ModelSegments, which takes the counts in the next step, will error. Here we use an intervals list that subsets gnomAD biallelic germline SNP sites to those within the padded, preprocessed exome target intervals [9].

The tutorial has already collected allele counts for full length sample BAMs. To demonstrate coverage collection, the following command uses the small BAMs originally made for Tutorial#11136 [6]. The tutorial does not use the resulting files in subsequent steps.

Collect counts at germline variant sites for the matched-control

gatk --java-options "-Xmx3g" CollectAllelicCounts \
    -L chr17_theta_snps.interval_list \
    -I normal.bam \
    -R /gatk/ref/Homo_sapiens_assembly38.fasta \
    -O sandbox/hcc1143_N_clean.allelicCounts.tsv

Collect counts at the same sites for the case sample

gatk --java-options "-Xmx3g" CollectAllelicCounts \
    -L chr17_theta_snps.interval_list \
    -I tumor.bam \
    -R /gatk/ref/Homo_sapiens_assembly38.fasta \
    -O sandbox/hcc1143_T_clean.allelicCounts.tsv

This results in counts table files. Each data file has header lines that start with an @ asperand symbol, e.g. @HD, @SQ and @RG lines, followed by a table of data with six columns. An example snippet is shown.

Comments on select parameters

The tool requires one or more genomic intervals specified with -L. The intervals can be either a Picard-style intervals list or a VCF. See Article#1109 for descriptions of formats. The sites should represent sites of common and/or sample-specific germline variant SNPs-only sites. Omit indel-type and mixed-variant-type sites.
The tool requires the reference genome, specified with -R, and aligned reads, specified with -I.
As is the case for most GATK tools, the engine filters reads upfront using a number of read filters. Of note for CollectAllelicCounts is the MappingQualityReadFilter. By default, the tool sets the filter's --minimum-base-quality to twenty. As a result, the tool will include reads with MAPQ20 and above in the analysis [10].

☞ 5.1 What is the difference between CollectAllelicCounts and GetPileupSummaries?

Another GATK tool, GetPileupSummaries, similarly counts reference and alternate alleles. The resulting summaries are meant for use with CalculateContamination in estimating cross-sample contamination. GetPileupSummaries limits counts collections to those sites with population allele frequencies set by the parameters --minimum-population-allele-frequency and --maximum-population-allele-frequency. Details are here.

CollectAllelicCounts employs fewer engine-level read filters than GetPileupSummaries. Of note, both tools use the MappingQualityReadFilter. However, each sets a different threshold with the filter. GetPileupSummaries uses a --minimum-mapping-quality threshold of 50. In contrast, CollectAllelicCounts sets the --minimum-mapping-quality parameter to 30. In addition, CollectAllelicCounts filters on base quality. The base quality threshold is set with the --minimum-base-quality parameter, whose default is 20.

6. Group contiguous copy ratios into segments with ModelSegments

ModelSegments groups together copy and allelic ratios that it determines are contiguous on the same segment. A Gaussian-kernel binary-segmentation algorithm differentiates ModelSegments from a GATK4.beta tool, PerformSegmentation, which GATK4 ModelSegments replaces. The older tool used a CBS (circular binary-segmentation) algorithm. ModelSegment's kernel algorithm enables efficient segmentation of dense data, e.g. that of whole genome sequences. A discussion of preliminary algorithm performance is here.

The algorithm performs segmentation for both copy ratios and for allelic copy ratios jointly when given both datatypes together. For allelic copy ratios, ModelSegments uses only those sites it determines are heterozygous, either in the control in a paired analysis or in the case in a case-only analysis [11]. In the paired analysis, the tool models allelic copy ratios in the case using sites for which the control is heterozygous. The workflow defines allelic copy ratios in terms of alternate-allele fraction, where total allele fractions for reference allele and alternate allele add to one for each site.

For the following command, be sure to specify an existing --output directory or . for the current directory.

gatk --java-options "-Xmx4g" ModelSegments \
    --denoised-copy-ratios hcc1143_T_clean.denoisedCR.tsv \
    --allelic-counts hcc1143_T_clean.allelicCounts.tsv \
    --normal-allelic-counts hcc1143_N_clean.allelicCounts.tsv \
    --output sandbox \
    --output-prefix hcc1143_T_clean

This produces nine files each with the basename hcc1143_T_clean in the current directory and listed below. The param files contain global parameters for copy ratios (cr) and allele fractions (af), and the seg files contain data on the segments. For either type of data, the tool gives data before and after segmentation smoothing. The tool documentation details what each file contains. The last two files, labeled hets, contain the allelic counts for the control's heterogygous sites. Counts are for the matched control (normal) and the case.

hcc1143_T_clean.modelBegin.seg
hcc1143_T_clean.modelFinal.seg
hcc1143_T_clean.cr.seg
hcc1143_T_clean.modelBegin.af.param
hcc1143_T_clean.modelBegin.cr.param
hcc1143_T_clean.modelFinal.af.param
hcc1143_T_clean.modelFinal.cr.param
hcc1143_T_clean.hets.normal.tsv
hcc1143_T_clean.hets.tsv

The tool has numerous adjustable parameters and these are described in the ModelSegments tool documentation. The tutorial uses the default values for all of the parameters. Adjusting parameters can change the resolution and smoothness of the segmentation results.

Comments on select parameters

The tool accepts both or either copy-ratios (--denoised-copy-ratios) or allelic-counts (--allelic-counts) data. The matched-control allelic counts (--normal-allelic-counts) is optional. If given both types of data, then copy ratios and allelic counts data together inform segmentation for both copy ratio and allelic segments. If given only one type of data, then segmentation is based solely on the given type of data.
The --minimum-total-allele-count is set to 30 by default. This means the tool only considers sites with 30 or more read depth coverage for allelic copy ratios.
The --genotyping-homozygous-log-ratio-threshold option is set to -10.0 by default. Increase this to increase the number of sites assumed to be heterozygous for modeling.
Default smoothing parameters are optimized for faster performance, given the size of whole genomes. The --maximum-number-of-smoothing-iterations option caps smoothing iterations to 25. MCMC model sampling is also set to 100, for both copy-ratio and allele-fraction sampling by the --number-of-samples-copy-ratio and --number-of-samples-allele-fraction options, respectively. Finally, --number-of-smoothing-iterations-per-fit is set to zero by default to disable model refitting between iterations. What this means is that the tool will generate only two MCMC fits--an initial and a final fit.
- GATK4.beta's ACNV set this parameter such that each smoothing iteration refit using MCMC, at the cost of compute. For the tutorial data, which is targeted exomes, the default zero gives 398 segments after two smoothing iterations, while setting --number-of-smoothing-iterations-per-fit to one gives 311 segments after seven smoothing iterations. Section 8 plots these alternative results.
For advanced smoothing recommendations, see [12].

Section 8 shows the results of segmentation, the result from changing --number-of-smoothing-iterations-per-fit and the result of allelic segmentation modeled from allelic counts data alone. Section 8.1 details considerations depending on analysis approach and purity of samples. Section 8.2 shows the results of changing the advanced smoothing parameters given in [12].

ModelSegments runs in the following three stages.

Genotypes heterozygous sites and filters on depth and for sites that overlap with copy-ratio intervals.
- Allelic counts for sites in the control that are heterozygous are written to hets.normal.tsv. For the same sites in the case, allelic counts are written to hets.tsv.
- If given only allelic counts data, ModelSegments does not apply intervals.
Performs multidimensional kernel segmentation (1, 2).
- Uses allelic counts within each copy-ratio interval for each contig.
- Uses denoised copy ratios and heterozygous allelic counts.
Performs Markov-Chain Monte Carlo (MCMC, 1, 2, 3) sampling and segment smoothing. In particular, the tool uses Gibbs sampling and slice sampling. These MCMC samplings inform smoothing, i.e. merging adjacent segments, and the tool can perform multiple iterations of sampling and smoothing [13].
- Fits initial model. Writes initial segments to modelBegin.seg, posterior summaries for copy-ratio global parameters to modelBegin.cr.param and allele-fraction global parameters to modelBegin.af.param.
- Iteratively performs segment smoothing and sampling. Fits allele-fraction model [14] until log likelihood converges. This process produces global parameters.
- Samples final models. Writes final segments to modelFinal.seg, posterior summaries for copy-ratio global parameters to modelFinal.cr.param, posterior summaries for allele-fraction global parameters to modelFinal.af.param and final copy-ratio segments to cr.seg.

At the second stage, the tutorial data generates the following message.

INFO  MultidimensionalKernelSegmenter - Found 638 segments in 23 chromosomes.

At the third stage, the tutorial data generates the following message.

INFO  MultidimensionalModeller - Final number of segments after smoothing: 398

For tutorial data, the initial number of segments before smoothing is 638 segments over 23 contigs. After smoothing with default parameters, the number of segments is 398 segments.

7. Call copy-neutral, amplified and deleted segments with CallCopyRatioSegments

CallCopyRatioSegments allows for systematic calling of copy-neutral, amplified and deleted segments. This step is not required for plotting segmentation results. Provide the tool with the cr.seg segmentation result from ModelSegments.

gatk CallCopyRatioSegments \
    --input hcc1143_T_clean.cr.seg \
    --output sandbox/hcc1143_T_clean.called.seg

The resulting called.seg data adds the sixth column to the provided copy ratio segmentation table. The tool denotes amplifications with a + plus sign, deletions with a - minus sign and neutral segments with a 0 zero.

Here is a snippet of the results.

Comments on select parameters
- The parameters --neutral-segment-copy-ratio-lower-bound (default 0.9) and --neutral-segment-copy-ratio-upper-bound (default 1.1) together set the copy ratio range for copy-neutral segments. These two parameters replace the GATK4.beta workflow’s --neutral-segment-copy-ratio-threshold option.

8. Plot modeled copy ratio and allelic fraction segments with PlotModeledSegments

PlotModeledSegments visualizes copy and allelic ratio segmentation results.

gatk PlotModeledSegments \
    --denoised-copy-ratios hcc1143_T_clean.denoisedCR.tsv \
    --allelic-counts hcc1143_T_clean.hets.tsv \
    --segments hcc1143_T_clean.modelFinal.seg \
    --sequence-dictionary Homo_sapiens_assembly38.dict \
    --minimum-contig-length 46709983 \
    --output sandbox/plots \
    --output-prefix hcc1143_T_clean

This produces plots in the plots folder. The plots represent final modeled segments for both copy ratios and alternate allele fractions. If we are curious about the extent of smoothing provided by MCMC, then we can similarly plot initial kernel segmentation results by substituting in --segments hcc1143_T_clean.modelBegin.seg.

Comments on select parameters
- The tutorial provides the --sequence-dictionary that matches the GRCh38 reference used in mapping [4].
- To omit alternate and decoy contigs from the plots, the tutorial adjusts the --minimum-contig-length from the default value of 1,000,000 to 46,709,983, the length of the smallest of GRCh38's primary assembly contigs.

As of this writing, it is NOT possible to subset plotting with genomic intervals, i.e. with the -L parameter. To interactively visualize data, consider the following options.

Modify the sequence dictionary to contain only the contigs of interest, in the order desired.
The bedGraph format for targeted exomes and bigWig for whole genomes. An example of CNV data converted to bedGraph and visualized in IGV is given in this discussion.
Alternatively, researchers versed in R may choose to visualize subsets of data using RStudio.

Below are three sets of results for the HCC1143 tumor cell line in order of increasing smoothing. The top plot of each set shows the copy ratio segments. The bottom plot of each set shows the allele fraction segments.

In the denoised copy ratio segment plot, individual targets still display as points on the plot. Different copy ratio segments are indicated by alternating blue and orange color groups. The denoised median is drawn in thick black.
In the allele fraction plot, the boxes surrounding the alternate allelic fractions do NOT indicate standard deviation nor standard error, which biomedical researchers may be more familiar with. Rather, the allelic fraction data is given in credible intervals. The allelic copy ratio plot shows the 10th, 50th and 90th percentiles. These should be interpreted with care as explained in section 8.1. Individual allele fraction data display as faint data points, also in orange and blue.

8A. Initial segmentation before MCMC smoothing gives 638 segments.
T_modelbegin.modeled.png

8B. Default smoothing gives 398 segments.
T_modelfinal.modeled.png

8C. Enabling additional smoothing iterations per fit gives 311 segments. See section 6 for a description of the --number-of-smoothing-iterations-per-fit parameter.
T_increase_smoothing_1.modeled.png

Smoothing accounts for data points that are outliers. Some of these outliers could be artifactual and therefore not of interest, while others could be true copy number variation that would then be missed. To understand the impact of joint copy ratio and allelic counts segmentation, compare the results of 8B to the single-data segmentation results below. Each plot below shows the results of modeling segmentation on a single type of data, either copy-ratios or allelic counts, using default smoothing parameters.

8D. Copy ratio segmentation based on copy ratios alone gives 235 segments.
T_caseonly.modeled.png

8E. Allelic segmentation result based on allelic counts alone in the matched case gives 105 segments.
T-matched-normal_just_allelic.modeled.png

Compare chr1 and chr2 segmentation for the various plots. In particular, pay attention to the p arm (left side) of chr1 and q arm (right side) of chr2. What do you think is happening when adjacent segments are slightly shifted from each other in some sets but then seemingly at the same copy ratio for other sets?

For allelic counts, ModelSegments retains 16,872 sites that are heterozygous in the control. Of these, the case presents 15,486 usable sites. In allelic segmentation using allelic counts alone, the tool uses all of the usable sites. In the matched-control scenario, ModelSegments emits the following message.

INFO  MultidimensionalKernelSegmenter - Using first allelic-count site in each copy-ratio interval (12668 / 15486) for multidimensional segmentation...

The message informs us that for the matched-control scenario, ModelSegments uses the first allele-count site for each genomic interval towards allelic modeling. For the tutorial data, this is 12,668 out of the 15,486 or 81.8% of the usable allele-count sites. The exclusion of ~20% of allelic-counts sites, together with the lack of copy ratio data informing segmentation, account for the difference we observe in this and the previous allelic segmentation plot.

In the allele fraction plot, some of the alternate-allele fractions are around 0.35/0.65 and some are at 0/1. We also see alternate-allele fractions around 0.25/0.75 and 0.5. These suggest ploidies that are multiples of one, two, three and four.

Is it possible a copy ratio of one is not diploid but represents some other ploidy?

For the plots above, focus on chr4, chr5 and chr17. Based on both the copy ratio and allelic results, what is the zygosity of each of the chromosomes? What proportion of each chromosome could be described as having undergone copy-neutral loss of heterozygosity?

☞ 8.1 Some considerations in interpreting allelic copy ratios

For allelic copy ratio analysis, the matched-control is a sample from the same individual as the case sample. In the somatic case, the matched-control is the germline normal sample and the case is the tumor sample from the same individual.

The matched-control case presents the following considerations.

If a matched control contains any region with copy number amplification, the skewed allele fractions still allow correct interpretation of the original heterozygosity.
However, if a matched control contains deleted regions or regions with copy-neutral loss of heterozygosity or a long stretch of homozygosity, e.g. as occurs in uniparental disomy, then these regions would go dark so to speak in that they become apparently homozygous and so ModelSegments drops them from consideration.
From population sequencing projects, we know the expected heterozygosity of normal germline samples averages around one in a thousand. However, the GATK4 CNV workflow does not account for any heterozygosity expectations. An example of such an analysis that utilizes SNP array data is HAPSEG. It is available on GenePattern.
If a matched normal contains tumor contamination, this should still allow for the normal to serve as a control. The expectation is that somatic mutations coinciding with common germline SNP sites will be rare and ModelSegments (i) only counts the dominant alt allele at multiallelic sites and (ii) recognizes and handles outliers. To estimate tumor in normal (TiN) contamination, see the Broad CGA group's deTiN.

Here are some considerations for detecting loss of heterozygosity regions.

In the matched-control case, if the case sample is pure, i.e. not contaminated with the control sample, then we see loss of heterozygosity (LOH) segments near alternate-allele fractions of zero and one.
If the case is contaminated with matched control, whether the analysis is matched or not, then the range of alternate-allele fractions becomes squished so to speak in that the contaminating normal's heterozygous sites add to the allele fractions. In this case, putative LOH segments still appear at the top and bottom edges of the allelic plot, at the lowest and highest alternate-allele fractions. For a given depth of coverage, the fraction of reads that differentiates zygosity is narrower in range and therefore harder to differentiate visually.

8F. Case-only analysis of tumor contaminated with normal still allows for LOH detection. Here, we bluntly added together the tutorial tumor and normal sample reads. Results for the matched-control analysis are similar.
In the tumor-only case, if the tumor is pure, because ModelSegments drops homozygous sites from consideration and only models sites it determines are heterozygous, the workflow cannot ascertain LOH segments. Such LOH regions may present as an absence of allelic data or as low confidence segments, i.e. having a wide confidence interval on the allelic plot. Compare such a result below to that of the matched case in 8E above.

8G. Allelic segmentation result based on allelic counts alone for case-only, when the case is pure, can produce regions of missing representation and low confidence allelic fraction segments.

Compare results. Focus on chr4, chr5 and chr17. While the matched-case gives homozygous zygosity for each of these chromosomes, the case-only allelic segmentation either presents an absence of segments for regions or gives low confidence allelic fraction segments at alternate allele fractions that are inaccurate, i.e. do not represent actual zygosity. This is particularly true for tumor samples where aneuploidy and LOH are common. Interpret case-only allelic results with caution.

Finally, remember the tutorial analyses above utilize allelic counts from gnomAD sites of common population variation that have been lifted-over from GRCh37 to GRCh38. For allelic count sites, use of sample-specific germline variant sites may incrementally increase resolution. Also, use of confident variant sites from a callset derived from alignments to the target reference may help decrease noise. Confident germline variant sites can be derived with HaplotypeCaller calling on alignments and subsequent variant filtration. Alternatively, it is possible to fine-tune ModelSegments smoothing parameters to dampen noise.

☞ 8.2 Some results of fine-tuning smoothing parameters

This section shows plotting results of changing some advanced smoothing parameters. The parameters and their defaults are given below, in the order of recommended consideration [12].

--number-of-changepoints-penalty-factor 1.0 \
--kernel-variance-allele-fraction 0.025 \
--kernel-variance-copy-ratio 0.0 \
--kernel-scaling-allele-fraction 1.0 \
--smoothing-credible-interval-threshold-allele-fraction 2.0 \
--smoothing-credible-interval-threshold-copy-ratio 2.0 \

The first four parameters impact segmentation while the last two parameters impact modeling. The following plots show the results of changing these smoothing parameters. The tutorial chose argument values arbitrarily, for illustration purposes. Results should be compared to that of 8B, which gives 398 segments.

8H. Increasing changepoints penalty factor from 1.0 to 5.0 gives 140 segments.

8I. Increasing kernel variance parameters each to 0.8 gives 144 segments. Changing --kernel-variance-copy-ratio alone to 0.025 increases the number of segments greatly, to 1,266 segments. Changing it to 0.2 gives 414 segments.

8J. Decreasing kernel scaling from 1.0 to 0 gives 236 segments. Conversely, increasing kernel scaling from 1.0 to 5.0 gives 551 segments.

8K. Increasing both smoothing parameters each from 2.0 to 10.0 gives 263 segments.

Footnotes

[9] The GATK Resource Bundle provides two variations of a SNPs-only gnomAD project resource VCF. Both VCFs are sites-only eight-column VCFs but one retains the AC allele count and AF allele frequency variant-allele-specific annotations, while the other removes these to reduce file size.

For targeted exomes, it may be convenient to subset these to the preprocessed intervals, e.g. with SelectVariants for use with CollectAllelicCounts. This is not necessary, however, as ModelSegments drops sites outside the target regions from its analysis in the joint-analysis approach.
For whole genomes, depending on the desired resolution of the analysis, consider subsetting the gnomAD sites to those commonly variant, e.g. above an allele frequency threshold. Note that SelectVariants, as of this writing, can filter on AF allele frequency only for biallelic sites. Non-biallelic sites make up ~3% of the gnomAD SNPs-only resource.
For more resolution, consider adding sample-specific germline variant biallelic SNPs-only sites to the intervals. Section 8.1 shows allelic segmentation results for such an analysis.

[10] The MAPQ20 threshold of CollectAllelicCounts is lower than that used by CollectFragmentCounts, which uses MAPQ30.

[11] In particular, the tool considers only heterozygous sites that have counts for both the reference allele and the alternate allele. If multiple alternate alleles are present, the tool uses the alternate allele with the highest count and ignores any other alternate allele(s).

[12] These advanced smoothing recommendations are from one of the workflow developers--@slee.

For smoother results, first increase --number-of-changepoints-penalty-factor from its default of 1.0.
If the above does not suffice, then consider changing the kernel-variance parameters --kernel-variance-copy-ratio (default 0.0) and --kernel-variance-allele-fraction (default 0.025), or change the weighting of the allele-fraction data by changing --kernel-scaling-allele-fraction (default 1.0).
If such changes are still insufficient, then consider adjusting the smoothing-credible-interval-threshold parameters --smoothing-credible-interval-threshold-copy-ratio (default 2.0) and --smoothing-credible-interval-threshold-allele-fraction (default 2.0). Increasing these will more aggressively merge adjacent segments.

[13] In particular, uses Gibbs sampling, a type of MCMC sampling, towards both allele-fraction modeling and copy-ratio modeling, and additionally uses slice sampling towards allele-fraction modeling. @slee details the following substeps.

Perform MCMC (Gibbs) to fit the copy-ratio model posteriors.
Use optimization (of the log likelihood) to initialize the Markov Chain for the allele-fraction model.
Perform MCMC (Gibbs and slice) to fit the allele-fraction model posteriors.
The initial model is now fit. We write the corresponding modelBegin files, including those for global parameters.
Iteratively perform segment smoothing.
Perform steps 1-4 again, this time to generate the final model fit and modelFinal files.

[14] @slee shares the tool initializes the MCMC by starting off at the maximum a posteriori (MAP) point in parameter space.

Many thanks to Samuel Lee (@slee), no relation, for his patient explanations of the workflow parameters and mathematics behind the workflow. Researchers may find reading through @slee's comments on the forum, a few of which the tutorials link to, insightful.

↧