Quantcast
Channel: Recent Discussions — GATK-Forum
Viewing all 12345 articles
Browse latest View live

StatusLogger Log4j2 could not find a logging implementation.

$
0
0

I have been trying to use RealignerTargetCreator for treating my INDELS.

My command is : nohup java -jar /data/ngs/programs/GenomeAnalysisTK-3.8-0-ge9d806836/nightly/GenomeAnalysisTK.jar -T RealignerTargetCreator -R Ref.fasta -I Merge.bam -o INDEL.intervals

However the following error is thrown:
Unable to create class org.apache.logging.log4j.core.impl.Log4jContextFactory specified in jar:file:/data/ngs/programs/GenomeAnalysisTK-3.8-0-ge9d806836/GenomeAnalysisTK.jar!/META-INF/log4j-provider.properties
ERROR StatusLogger Log4j2 could not find a logging implementation. Please add log4j-core to the classpath. Using SimpleLogger to log to the console...

It should be noted that the job is not terminated immediately after the error is thrown and continues running without any progress.

Kindly help.


Parallelism

$
0
0

This document explains the concepts involved and how they are applied within the GATK (and Crom+WDL or Queue where applicable). For specific configuration recommendations, see the companion document on parallelizing GATK tools.


1. The concept of parallelism

Parallelism is a way to make a program finish faster by performing several operations in parallel, rather than sequentially (i.e. waiting for each operation to finish before starting the next one).

Imagine you need to cook rice for sixty-four people, but your rice cooker can only make enough rice for four people at a time. If you have to cook all the batches of rice sequentially, it's going to take all night. But if you have eight rice cookers that you can use in parallel, you can finish up to eight times faster.

This is a very simple idea but it has a key requirement: you have to be able to break down the job into smaller tasks that can be done independently. It's easy enough to divide portions of rice because rice itself is a collection of discrete units. In contrast, let's look at a case where you can't make that kind of division: it takes one pregnant woman nine months to grow a baby, but you can't do it in one month by having nine women share the work.

The good news is that most GATK runs are more like rice than like babies. Because GATK tools are built to use the Map/Reduce method (see doc for details), most GATK runs essentially consist of a series of many small independent operations that can be parallelized.

A quick warning about tradeoffs

Parallelism is a great way to speed up processing on large amounts of data, but it has "overhead" costs. Without getting too technical at this point, let's just say that parallelized jobs need to be managed, you have to set aside memory for them, regulate file access, collect results and so on. So it's important to balance the costs against the benefits, and avoid dividing the overall work into too many small jobs.

Going back to the introductory example, you wouldn't want to use a million tiny rice cookers that each boil a single grain of rice. They would take way too much space on your countertop, and the time it would take to distribute each grain then collect it when it's cooked would negate any benefits from parallelizing in the first place.

Parallel computing in practice (sort of)

OK, parallelism sounds great (despite the tradeoffs caveat), but how do we get from cooking rice to executing programs? What actually happens in the computer?

Consider that when you run a program like the GATK, you're just telling the computer to execute a set of instructions.

Let's say we have a text file and we want to count the number of lines in it. The set of instructions to do this can be as simple as:

  • open the file, count the number of lines in the file, tell us the number, close the file

Note that tell us the number can mean writing it to the console, or storing it somewhere for use later on.

Now let's say we want to know the number of words on each line. The set of instructions would be:

  • open the file, read the first line, count the number of words, tell us the number, read the second line, count the number of words, tell us the number, read the third line, count the number of words, tell us the number

And so on until we've read all the lines, and finally we can close the file. It's pretty straightforward, but if our file has a lot of lines, it will take a long time, and it will probably not use all the computing power we have available.

So to parallelize this program and save time, we just cut up this set of instructions into separate subsets like this:

  • open the file, index the lines

  • read the first line, count the number of words, tell us the number

  • read the second line, count the number of words, tell us the number
  • read the third line, count the number of words, tell us the number
  • [repeat for all lines]

  • collect final results and close the file

Here, the read the Nth line steps can be performed in parallel, because they are all independent operations.

You'll notice that we added a step, index the lines. That's a little bit of peliminary work that allows us to perform the read the Nth line steps in parallel (or in any order we want) because it tells us how many lines there are and where to find each one within the file. It makes the whole process much more efficient. As you may know, the GATK requires index files for the main data files (reference, BAMs and VCFs); the reason is essentially to have that indexing step already done.

Anyway, that's the general principle: you transform your linear set of instructions into several subsets of instructions. There's usually one subset that has to be run first and one that has to be run last, but all the subsets in the middle can be run at the same time (in parallel) or in whatever order you want.


2. Parallelizing the GATK

There are three different modes of parallelism offered by the GATK, and to really understand the difference you first need to understand what are the different levels of computing that are involved.

A quick word about levels of computing

By levels of computing, we mean the computing units in terms of hardware: the core, the machine (or CPU) and the cluster or cloud.

  • Core: the level below the machine. On your laptop or desktop, the CPU (central processing unit, or processor) contains one or more cores. If you have a recent machine, your CPU probably has at least two cores, and is therefore called dual-core. If it has four, it's a quad-core, and so on. High-end consumer machines like the latest Mac Pro have up to twelve-core CPUs (which should be called dodeca-core if we follow the Latin terminology) but the CPUs on some professional-grade machines can have tens or hundreds of cores.

  • Machine: the middle of the scale. For most of us, the machine is the laptop or desktop computer. Really we should refer to the CPU specifically, since that's the relevant part that does the processing, but the most common usage is to say machine. Except if the machine is part of a cluster, in which case it's called a node.

  • Cluster or cloud: the level above the machine. This is a high-performance computing structure made of a bunch of machines (usually called nodes) networked together. If you have access to a cluster, chances are it either belongs to your institution, or your company is renting time on it. A cluster can also be called a server farm or a load-sharing facility.

Parallelism can be applied at all three of these levels, but in different ways of course, and under different names. Parallelism takes the name of multi-threading at the core and machine levels, and scatter-gather at the cluster level.

Multi-threading

In computing, a thread of execution is a set of instructions that the program issues to the processor to get work done. In single-threading mode, a program only sends a single thread at a time to the processor and waits for it to be finished before sending another one. In multi-threading mode, the program may send several threads to the processor at the same time.

image

Not making sense? Let's go back to our earlier example, in which we wanted to count the number of words in each line of our text document. Hopefully it is clear that the first version of our little program (one long set of sequential instructions) is what you would run in single-threaded mode. And the second version (several subsets of instructions) is what you would run in multi-threaded mode, with each subset forming a separate thread. You would send out the first thread, which performs the preliminary work; then once it's done you would send the "middle" threads, which can be run in parallel; then finally once they're all done you would send out the final thread to clean up and collect final results.

If you're still having a hard time visualizing what the different threads are like, just imagine that you're doing cross-stitching. If you're a regular human, you're working with just one hand. You're pulling a needle and thread (a single thread!) through the canvas, making one stitch after another, one row after another. Now try to imagine an octopus doing cross-stitching. He can make several rows of stitches at the same time using a different needle and thread for each. Multi-threading in computers is surprisingly similar to that.

Hey, if you have a better example, let us know in the forum and we'll use that instead.

Alright, now that you understand the idea of multithreading, let's get practical: how do we do get the GATK to use multi-threading?

There are two options for multi-threading with the GATK, controlled by the arguments -nt and -nct, respectively. They can be combined, since they act at different levels of computing:

  • -nt / --num_threads controls the number of data threads sent to the processor (acting at the machine level)

  • -nct / --num_cpu_threads_per_data_thread controls the number of CPU threads allocated to each data thread (acting at the core level).

Not all GATK tools can use these options due to the nature of the analyses that they perform and how they traverse the data. Even in the case of tools that are used sequentially to perform a multi-step process, the individual tools may not support the same options. For example, at time of writing (Dec. 2012), of the tools involved in local realignment around indels, RealignerTargetCreator supports -nt but not -nct, while IndelRealigner does not support either of these options.

In addition, there are some important technical details that affect how these options can be used with optimal results. Those are explained along with specific recommendations for the main GATK tools in a companion document on parallelizing the GATK.

Scatter-gather

If you Google it, you'll find that the term scatter-gather can refer to a lot of different things, including strategies to get the best price quotes from online vendors, methods to control memory allocation and… an indie-rock band. What all of those things have in common (except possibly the band) is that they involve breaking up a task into smaller, parallelized tasks (scattering) then collecting and integrating the results (gathering). That should sound really familiar to you by now, since it's the general principle of parallel computing.

So yes, "scatter-gather" is really just another way to say we're parallelizing things. OK, but how is it different from multithreading, and why do we need yet another name?

As you know by now, multithreading specifically refers to what happens internally when the program (in our case, the GATK) sends several sets of instructions to the processor to achieve the instructions that you originally gave it in a single command-line. In contrast, the scatter-gather strategy as used by the GATK involves separate programs. There are two pipelining solutions that we support for scatter-gathering GATK jobs, Crom+WDL and Queue. They are quite different, but both are able to generate separate GATK jobs (each with its own command-line) to achieve the instructions given in a script.

image

At the simplest level, the script can involve a single GATK tool*. In that case, the execution engine (Cromwell or Queue) will create separate GATK commands that will each run that tool on a portion of the input data (= the scatter step). The results of each run will be stored in temporary files. Then once all the runs are done, the engine will collate all the results into the final output files, as if the tool had been run as a single command (= the gather step).

Note that Queue and Cromwell have additional capabilities, such as managing the use of multiple GATK tools in a dependency-aware manner to run complex pipelines, but that is outside the scope of this article. To learn more about pipelining the GATK with Queue, please see the Queue documentation. To learn more about Crom+WDL, see the WDL website.

Compare and combine

So you see, scatter-gather is a very different process from multi-threading because the parallelization happens outside of the program itself. The big advantage is that this opens up the upper level of computing: the cluster level. Remember, the GATK program is limited to dispatching threads to the processor of the machine on which it is run – it cannot by itself send threads to a different machine. But an execution engine like Queue or Cromwell can dispatch scattered GATK jobs to different machines in a computing cluster or on a cloud platform by interfacing with the appropriate job management software.

That being said, multithreading has the great advantage that cores and machines all have access to shared machine memory with very high bandwidth capacity. In contrast, the multiple machines on a network used for scatter-gather are fundamentally limited by network costs.

The good news is that you can combine scatter-gather and multithreading: use Queue or Cromwell to scatter GATK jobs to different nodes on your cluster or cloud platform, then use the GATK's internal multithreading capabilities to parallelize the jobs running on each node.

Going back to the rice-cooking example, it's as if instead of cooking the rice yourself, you hired a catering company to do it for you. The company assigns the work to several people, who each have their own cooking station with multiple rice cookers. Now you can feed a lot more people in the same amount of time! And you don't even have to clean the dishes.

Read groups

$
0
0

There is no formal definition of what is a read group, but in practice, this term refers to a set of reads that were generated from a single run of a sequencing instrument.

In the simple case where a single library preparation derived from a single biological sample was run on a single lane of a flowcell, all the reads from that lane run belong to the same read group. When multiplexing is involved, then each subset of reads originating from a separate library run on that lane will constitute a separate read group.

Read groups are identified in the SAM/BAM /CRAM file by a number of tags that are defined in the official SAM specification. These tags, when assigned appropriately, allow us to differentiate not only samples, but also various technical features that are associated with artifacts. With this information in hand, we can mitigate the effects of those artifacts during the duplicate marking and base recalibration steps. The GATK requires several read group fields to be present in input files and will fail with errors if this requirement is not satisfied. See this article for common problems related to read groups.

To see the read group information for a BAM file, use the following command.

samtools view -H sample.bam | grep '@RG'

This prints the lines starting with @RG within the header, e.g. as shown in the example below.

@RG ID:H0164.2  PL:illumina PU:H0164ALXX140820.2    LB:Solexa-272222    PI:0    DT:2014-08-20T00:00:00-0400 SM:NA12878  CN:BI

Meaning of the read group fields required by GATK

  • ID = Read group identifier
    This tag identifies which read group each read belongs to, so each read group's ID must be unique. It is referenced both in the read group definition line in the file header (starting with @RG) and in the RG:Z tag for each read record. Note that some Picard tools have the ability to modify IDs when merging SAM files in order to avoid collisions. In Illumina data, read group IDs are composed using the flowcell + lane name and number, making them a globally unique identifier across all sequencing data in the world.
    Use for BQSR: ID is the lowest denominator that differentiates factors contributing to technical batch effects: therefore, a read group is effectively treated as a separate run of the instrument in data processing steps such as base quality score recalibration, since they are assumed to share the same error model.

  • PU = Platform Unit
    The PU holds three types of information, the {FLOWCELL_BARCODE}.{LANE}.{SAMPLE_BARCODE}. The {FLOWCELL_BARCODE} refers to the unique identifier for a particular flow cell. The {LANE} indicates the lane of the flow cell and the {SAMPLE_BARCODE} is a sample/library-specific identifier. Although the PU is not required by GATK but takes precedence over ID for base recalibration if it is present. In the example shown earlier, two read group fields, ID and PU, appropriately differentiate flow cell lane, marked by .2, a factor that contributes to batch effects.

  • SM = Sample
    The name of the sample sequenced in this read group. GATK tools treat all read groups with the same SM value as containing sequencing data for the same sample, and this is also the name that will be used for the sample column in the VCF file. Therefore it's critical that the SM field be specified correctly. When sequencing pools of samples, use a pool name instead of an individual sample name.

  • PL = Platform/technology used to produce the read
    This constitutes the only way to know what sequencing technology was used to generate the sequencing data. Valid values: ILLUMINA, SOLID, LS454, HELICOS and PACBIO.

  • LB = DNA preparation library identifier
    MarkDuplicates uses the LB field to determine which read groups might contain molecular duplicates, in case the same DNA library was sequenced on multiple lanes.

If your sample collection's BAM files lack required fields or do not differentiate pertinent factors within the fields, use Picard's AddOrReplaceReadGroups to add or appropriately rename the read group fields as outlined here.


Deriving ID and PU fields from read names

Here we illustrate how to derive both ID and PU fields from read names as they are formed in the data produced by the Broad Genomic Services pipelines (other sequence providers may use different naming conventions). We break down the common portion of two different read names from a sample file. The unique portion of the read names that come after flow cell lane, and separated by colons, are tile number, x-coordinate of cluster and y-coordinate of cluster.

H0164ALXX140820:2:1101:10003:23460
H0164ALXX140820:2:1101:15118:25288

Breaking down the common portion of the query names:

H0164____________ #portion of @RG ID and PU fields indicating Illumina flow cell
_____ALXX140820__ #portion of @RG PU field indicating barcode or index in a multiplexed run
_______________:2 #portion of @RG ID and PU fields indicating flow cell lane

Multi-sample and multiplexed example

Suppose I have a trio of samples: MOM, DAD, and KID. Each has two DNA libraries prepared, one with 400 bp inserts and another with 200 bp inserts. Each of these libraries is run on two lanes of an Illumina HiSeq, requiring 3 x 2 x 2 = 12 lanes of data. When the data come off the sequencer, I would create 12 bam files, with the following @RG fields in the header:

Dad's data:
@RG     ID:FLOWCELL1.LANE1      PL:ILLUMINA     LB:LIB-DAD-1 SM:DAD      PI:200
@RG     ID:FLOWCELL1.LANE2      PL:ILLUMINA     LB:LIB-DAD-1 SM:DAD      PI:200
@RG     ID:FLOWCELL1.LANE3      PL:ILLUMINA     LB:LIB-DAD-2 SM:DAD      PI:400
@RG     ID:FLOWCELL1.LANE4      PL:ILLUMINA     LB:LIB-DAD-2 SM:DAD      PI:400

Mom's data:
@RG     ID:FLOWCELL1.LANE5      PL:ILLUMINA     LB:LIB-MOM-1 SM:MOM      PI:200
@RG     ID:FLOWCELL1.LANE6      PL:ILLUMINA     LB:LIB-MOM-1 SM:MOM      PI:200
@RG     ID:FLOWCELL1.LANE7      PL:ILLUMINA     LB:LIB-MOM-2 SM:MOM      PI:400
@RG     ID:FLOWCELL1.LANE8      PL:ILLUMINA     LB:LIB-MOM-2 SM:MOM      PI:400

Kid's data:
@RG     ID:FLOWCELL2.LANE1      PL:ILLUMINA     LB:LIB-KID-1 SM:KID      PI:200
@RG     ID:FLOWCELL2.LANE2      PL:ILLUMINA     LB:LIB-KID-1 SM:KID      PI:200
@RG     ID:FLOWCELL2.LANE3      PL:ILLUMINA     LB:LIB-KID-2 SM:KID      PI:400
@RG     ID:FLOWCELL2.LANE4      PL:ILLUMINA     LB:LIB-KID-2 SM:KID      PI:400

Note the hierarchical relationship between read groups (unique for each lane) to libraries (sequenced on two lanes) and samples (across four lanes, two lanes for each library).

SNP Calling issue- Haptotype caller not calling all SNPs

$
0
0

Hi I am working with Candida albicans genome for SNP calling. Its Whole genome data.
I used standard steps used here https://gencore.bio.nyu.edu/variant-calling-pipeline/

I am finding a very strange problem in my VCF file which i believe could be associated with HaplotypeCaller.
The SNP exist in the bam file but are not been called in vcf file.
I can also share my script and list interval file if that could help solving the problem.

What is truth? Or, how an accident of nature can illuminate our path

$
0
0

A note to explain the context of the new paper by Heng Li, myself and others, “New synthetic-diploid benchmark for accurate variant calling evaluation” available as a preprint in bioRxiv.

Developing new tools and algorithms for genome analysis relies heavily on the availability of so-called "truth sets" that are used to evaluate performance (accuracy, sensitivity etc.). This has long been a sticking point, though recently the situation has improved dramatically with the availability of several public, high-quality truth sets such as Genome In A Bottle from NIST and Platinum Genomes from Illumina. Even these resources, which have been produced through painstaking analysis and curation, are not immune to the lack of “orthogonality” which plagues most available truth-sets. Chief among these is that the failure modes of Illumina sequencing are usually masked out and the resulting data are biased towards the easier parts of the genome.

The paper I linked above introduces a new dataset that we developed to be less biased. It is based solely on PacBio sequencing, and thus its error modes are less correlated with Illumina’s error modes. Using this dataset for benchmarking has given us high confidence in the accuracy of our validations and has enabled us to improve our methods with less concern of overfitting.


Truth data (for germline DNA methods) tend to be derived from two sources: synthetic (that is, computer generated), or Illumina (and other) sequencing of a particular sample called NA12878. Both of these sources are deeply flawed and ultimately, not good enough. First, it is virtually impossible to create synthetic data that truly resemble the results of sequencing actual biological tissue, for several reasons: the reference is an approximation and the effects of sample-extraction, library-construction, and sequencing are really hard to model accurately. Regarding our biggest issue with NA12878, we simply love this sample too much! Nearly all of NA12878’s variants are present in our resource files (dbSNP, the training files for VQSR, etc.). When we evaluate our method’s performance on NA12878, we cannot really trust the results since we have been using the answer all along. Furthermore, both the NIST and Platinum Genomes truthsets are each restricted to a subset of the genome that they consider the “confidence region”. This region is defined differently in the two datasets, but in both cases it is dependent on performance of Illumina sequencing of NA12878 (among other things). This has the perverse effect that the results are reflecting performance only in the easier-to-sequence-and-analyze part of the genome, falsely inflating our self-confidence, and giving no blame or credit for performance in the harder regions of the genome.

The “Synthetic-diploid” (or as we affectionately call it, SynDip) is generated from two human cell lines (CHM1 and CHM13, PacBio-sequenced and assembled by others) that were derived from Complete Hydatidiform Moles. This rare and devastating condition results in a non-viable collection of cells that is almost entirely homozygous. The homozygosity implies that PacBio sequencing is much more trustworthy as there are no heterozygous sites that tend to confuse the assembly: any confusion is almost certainly due to sequencing error and can therefore be masked out. To make use of this, we aligned the CHM1 and CHM13 assemblies to the hg38 reference, and created a VCF and a confidence region that characterize the variation that a 50-50 mixture of the two cell lines would have. At the same time, we also sequenced and aligned such a 50-50 mixture using our WEx and WGS protocols on Illumina. So to be clear, in that regard, the name is misleading. The only “synthetic” part about SynDip is that it’s synthetically diploid, but in all other aspects it’s as natural as can be, since it was generated from live cells using regular sequencing protocols.

Since the CHM dataset was generated using PacBio data alone, with no consideration for the flaws of Illumina’s short-read technology, there should be less correlation between the failure modes of our methods on the short-read data and SynDip’s confidence regions. This allows us to have better, more trustworthy truth-data. It enables us to remove much uncertainty, defusing our natural tendency to “look under the lamp” and to overfit our methods.

And beyond that, it empowers us to push our method development further by exposing large tracts of the reference where our methods (and not only ours!) do not perform well -- and provides us with a more truthful picture of what lies in those regions. Here are the main ways we have used this resource to that end:

  • We have used the insights gained from applying our filtering methods on the SynDip data, which reveal the flaws in their performance, to design better filtering architectures and fine-tune existing ones. (More on this in a future post….)
  • We have used the dataset to assess new variant calling methods for CNVs and SVs.
  • We have used it to compare different analysis pipelines and determine whether there’s a significant difference between them (e.g. What is the effect of running BQSR over and over again? Answer: Not much beyond the first run.)
  • We are currently using it to develop the next version of our joint-calling pipeline which will be able to joint call more than 100K genomes (!!!)

One thing that the current CHM dataset doesn’t help us do is develop better lab methods. This is because the CHM cell lines are not currently commercially available and thus the technology companies cannot test their new protocols and technologies on it. Hopefully, this will eventually be made possible and could enable us to explore hard-to-sequence regions of the genome.

If you are a method developer or you are in a position to evaluate the performance of various pipelines, we encourage you to check out the CHM dataset, and we hope it will help you develop new methods and pipelines! In the future we plan to share more data from the CHM cell lines and make the methods we use for evaluating our methods and data publicly available.

GenomicsDBimport multijob issue

$
0
0

I'm having a strange problem with GenomicsDBImport. I'm trying to call SNPs for the full genome, so I've been importing my g.vcfs into a GenomicsDB database per chromosome. Each chromosome is a different job, which I've been launching simultaneously. Each job uses the same set of samples, but is for a different part of the genome. My issue is that when launching the jobs simultaneously, only the first job prints correctly, while the others seem to run but don't give any output. The jobs that don't print correctly, only output a file named "__tiledb_workspace.tdb", which is empty. Despite this, the non-printing jobs seem to continue running and produce a log file saying it worked. If I wait a bit, and launch the jobs separately, they run concurrently fine.

For reference, I'm using GATK 4.beta.3-SNAPSHOT, and I'm launching jobs on a slurm job management system.

CreatePanelOfNormals problem

$
0
0

I am trying to ues CreatePanelOfNormals for cnv analysis

java -Xmx4G  -jar /ifshk7/BC_PS/luoshizhi/software/gatk/gatk-4.beta.2-4/build/libs/gatk.jar CreatePanelOfNormals -I B.CalculateTargetCoverage/normal_coverage.tsv -O  C.CreatePanelOfNormals/ponC.pon  --disableSpark

there is only one sample in my normal_coverage.tsv file.

however,that command exits with this message:

[August 16, 2017 1:03:08 AM HKT] org.broadinstitute.hellbender.tools.exome.CreatePanelOfNormals done. Elapsed time: 0.07 minutes.
Runtime.totalMemory()=545259520
org.apache.commons.math3.exception.OutOfRangeException: column index (-1)
    at org.apache.commons.math3.linear.MatrixUtils.checkColumnIndex(MatrixUtils.java:484)
    at org.apache.commons.math3.linear.MatrixUtils.checkSubMatrixIndex(MatrixUtils.java:514)
    at org.apache.commons.math3.linear.AbstractRealMatrix.getSubMatrix(AbstractRealMatrix.java:307)
    at org.broadinstitute.hellbender.tools.pon.coverage.pca.HDF5PCACoveragePoNCreationUtils.calculateReducedPanelAndPInverses(HDF5PCACoveragePoNCreationUtils.java:312)
    at org.broadinstitute.hellbender.tools.pon.coverage.pca.HDF5PCACoveragePoNCreationUtils.create(HDF5PCACoveragePoNCreationUtils.java:106)
    at org.broadinstitute.hellbender.tools.exome.CreatePanelOfNormals.runPipeline(CreatePanelOfNormals.java:296)
    at org.broadinstitute.hellbender.utils.SparkToggleCommandLineProgram.doWork(SparkToggleCommandLineProgram.java:39)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:116)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:173)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:192)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:131)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:152)
    at org.broadinstitute.hellbender.Main.main(Main.java:230)

It is a bug?

What is a good number of samples that can be used to detect a variant - I have 15K GVCFs with 1000DP

$
0
0

Hi,

I have 15k GVCFs. To call variants, I understand I can run combineGVCFs step to get batches of GVCF combined. I would like to know whats the good number for a sample set, for bams with coverage of over 800-1000X, to detect a variant? Would the variants called from batches of 500 samples have the same power to detect a variant in all the samples as compared to a variant call in 15k samples together?


(How to) Call somatic copy number variants using GATK4 CNV

$
0
0

image This demonstrative tutorial provides instructions and example data to detect somatic copy number variation (CNV) using a panel of normals (PoN). The workflow is optimized for Illumina short-read whole exome sequencing (WES) data. It is not suitable for whole genome sequencing (WGS) data nor for germline calling.

The tutorial recapitulates the GATK demonstration given at the 2016 ASHG meeting in Vancouver, Canada, for a beta version of the CNV workflow. Because we are still actively developing the CNV tools (writing as of March 2017), the underlying algorithms and current workflow options, e.g. syntax, may change. However, the presented basic approach and general concepts will still be germaine. Please check the forum for updates.

Many thanks to Samuel Lee (@slee) for developing the example data, data figures and discussion that set the backbone of this tutorial.

► For a similar example workflow that pertains to earlier releases of GATK4, see Article#6791.
► For the mathematics behind the workflow, see this whitepaper.

Different data types come with their own caveats. WGS, while providing even coverage that enables better CNV detection, is costly. SNP arrays, while the standard for CNV detection, may not be part of an analysis protocol. Being able to resolve CNVs from WES, which additionally introduces artifacts and variance in the target capture step, requires sophisticated denoising.


Jump to a section

  1. Collect proportional coverage using target intervals and read data using CalculateTargetCoverage
  2. Create the CNV PoN using CombineReadCounts and CreatePanelOfNormals
  3. Normalize a raw proportional coverage profile against the PoN using NormalizeSomaticReadCounts
  4. Segment the normalized coverage profile using PerformSegmentation
    I get an error at this step!
  5. (Optional) Plot segmented coverage using PlotSegmentedCopyRatio
    What is the QC value?
  6. Call segmented copy number variants using CallSegments
  7. Discussion of interest to some
    Why can't I use just a matched normal?
    How do the results compare to SNP6 analyses?

Tools, system requirements and example data download

  • This tutorial uses a beta version of the CNV workflow tools within the GATK4 gatk-protected-1.0.0.0-alpha1.2.3 pre-release (Version:0288cff-SNAPSHOT from September 2016). We previously made the program jar specially available alongside the data bundle in the workshops directory here. The original worksheets are in the 1610 folder. However, the data bundle was only available to workshop attendees. Note other tools in this program release may be unsuitable for analyses.

    The example data is whole exome capture sequence data for chromosomes 1–7 of matched normal and tumor samples aligned to GRCh37. Because the data is from real cancer patients, we have anonymized them at multiple levels. The anonymization process preserves the noise inherent in real samples. The data is representative of Illumina sequencing technology from 2011.

  • R (install from https://www.r-project.org/) and certain R components. After installing R, install the components with the following command.

    Rscript install_R_packages.R
    

    We include install_R_packages.R in the tutorial data bundle. Alternatively, download it from here.

  • XQuartz for optional section 5. Your system may already have this installed.

  • The tutorial does not require reference files. The optional plotting step that uses the PlotSegmentedCopyRatio tool plots against GRCh37 and should NOT be used for other reference assemblies.


1. Collect proportional coverage using target intervals and read data using CalculateTargetCoverage

In this step, we collect proportional coverage using target intervals and read data. We have actually pre-computed this for you and we provide the command here for reference.

We process each BAM, whether normal or tumor. The tool collects coverage per read group at each target and divides these counts by the total number of reads per sample.

java -jar gatk4.jar CalculateTargetCoverage \
    -I <input_bam_file> \
    -T <input_target_tsv> \
    -transform PCOV \
    -groupBy SAMPLE \
    -targetInfo FULL \
    –keepdups \
    -O <output_pcov_file>
  • The target file -T is a padded intervals list of the baited regions. You can add padding to a target list using the GATK4 PadTargets tool. For our example data, padding each exome target 250bp on either side increases sensitivity.
  • Setting the -targetInfo option to FULL keeps the original target names from the target list.
  • The –keepdups option asks the tool to include alignments flagged as duplicate.

The top plot shows the raw proportional coverage for our tumor sample for chromosomes 1–7. Each dot represents a target. The y-axis plots proportional coverage and the x-axis targets. The middle plot shows the data after a median-normalization and log2-transformation. The bottom plot shows the tumor data after normalization against its matched-normal.

image

image

image

For each of these progressions, how certain are you that there are copy-number events? How many copy-number variants are you certain of? What is contributing to your uncertainty?


back to top


2. Create the CNV PoN using CombineReadCounts and CreatePanelOfNormals

In this step, we use two commands to create the CNV panel of normals (PoN).

The normals should represent the same sequencing technology, e.g. sample preparation and capture target kit, as that of the tumor samples under scrutiny. The PoN is meant to encapsulate sequencing noise and may also capture common germline variants. Like any control, you should think carefully about what sample set would make an effective panel. At the least, the PoN should consist of ten normal samples that were ideally subject to the same batch effects as that of the tumor sample, e.g. from the same sequencing center. Our current recommendation is 40 or more normal samples. Depending on the coverage depth of samples, adjust the number.

What is better, tissue-matched normals or blood normals of tumor samples?
What makes a better background control, a matched normal sample or a panel of normals?

The first step combines the proportional read counts from the multiple normal samples into a single file. The -inputList parameter takes a file listing the relative file paths, one sample per line, of the proportional coverage data of the normals.

java -jar gatk4.jar CombineReadCounts \
    -inputList normals.txt \
    -O sandbox/combined-normals.tsv

The second step creates a single CNV PoN file. The PoN stores information such as the median proportional coverage per target across the panel and projections of systematic noise calculated with PCA (principal component analysis). Our tutorial’s PoN is built with 39 normal blood samples from cancer patients from the same cohort (not suffering from blood cancers).

java -jar gatk4.jar CreatePanelOfNormals \
    -I sandbox/combined-normals.tsv \
    -O sandbox/normals.pon \
    -noQC \
    --disableSpark \
    --minimumTargetFactorPercentileThreshold 5

This results in two files, the CNV PoN and a target_weights.txt file that typical workflows can ignore. Because we have a small number of normals, we include the -noQC option and change the --minimumTargetFactorPercentileThreshold to 5%.

Based on what you know about PCA, what do you think are the effects of using more normal samples? A panel with some profiles that are outliers?


back to top


3. Normalize a raw proportional coverage profile against the PoN using NormalizeSomaticReadCounts

In this step, we normalize the raw proportional coverage (PCOV) profile using the PoN. Specifically, we normalize the tumor coverage against the PoN’s target medians and against the principal components of the PoN.

java -jar gatk4.jar NormalizeSomaticReadCounts \
    -I cov/tumor.tsv \
    -PON sandbox/normals.pon \
    -PTN sandbox/tumor.ptn.tsv \
    -TN sandbox/tumor.tn.tsv

This produces the pre-tangent-normalized file -PTN and the tangent-normalized file -TN, respectively. Resulting data is log2-transformed.

Denoising with a PoN is critical for calling copy-number variants from WES coverage profiles. It can also improve calls from WGS profiles that are typically more evenly distributed and subject to less noise. Furthermore, denoising with a PoN can greatly impact results for (i) samples that have more noise, e.g. those with lower coverage, lower purity or higher activity, (ii) samples lacking a matched normal and (iii) detection of smaller events that span only a few targets.


back to top


4. Segment the normalized coverage profile using PerformSegmentation

Here we segment the normalized coverage profile. Segmentation groups contiguous targets with the same copy ratio.

java -jar gatk4.jar PerformSegmentation \
    -TN sandbox/tumor.tn.tsv \
    -O sandbox/tumor.seg \
    -LOG

For our tumor sample, we reduce the ~73K individual targets to 14 segments. The -LOG parameter tells the tool that the input coverages are log2-transformed.

View the resulting file with cat sandbox/tumor.seg.

image

Which chromosomes have events?

☞ I get an error at this step!

This command will error if you have not installed R and certain R components. Take a few minutes to install R from https://www.r-project.org/. Then install the components with the following command.

Rscript install_R_packages.R

We include install_R_packages.R in the tutorial data bundle. Alternatively, download it from here.


back to top


5. (Optional) Plot segmented coverage using PlotSegmentedCopyRatio

This is an optional step that plots segmented coverage.

This command requires XQuartz installation. If you do not have this dependency, then view the results in the precomputed_results folder instead. Currently plotting only supports human assembly b37 autosomes. Going forward, this tool will accommodate other references and the workflow will support calling on sex chromosomes.

java -jar gatk4.jar PlotSegmentedCopyRatio \
    -TN sandbox/tumor.tn.tsv \
    -PTN sandbox/tumor.ptn.tsv \
    -S sandbox/tumor.seg \
    -O sandbox \
    -pre tumor \
    -LOG

The -O defines the output directory, and the -pre defines the basename of the files. Again, the -LOG parameter tells the tool that the inputs are log2- transformed. The output folder contains seven files--three PNG images and four text files.

image
  • Before_After.png (shown above) plots copy-ratios pre (top) and post (bottom) tangent-normalization across the chromosomes. The plot automatically adjusts the y-axis to show all available data points. Dotted lines represent centromeres.
  • Before_After_CR_Lim_4.png shows the same but fixes the y-axis range from 0 to 4 for comparability across samples.
  • FullGenome.png colors differential copy-ratio segments in alternating blue and orange. The horizontal line plots the segment mean. Again the y-axis ranges from 0 to 4.

Open each of these images. How many copy-number variants do you see?

☞ What is the QC value?

Each of the four text files contain a single quality control (QC) value. This value is the median of absolute differences (MAD) in copy-ratios of adjacent targets. Its calculation is robust to actual copy-number variants and should decrease post tangent-normalization.

  • preQc.txt gives the QC value before tangent-normalization.
  • postQc.txt gives the post-tangent-normalization QC value.
  • dQc.txt gives the difference between pre and post QC values.
  • scaled_dQc.txt gives the fraction difference (preQc - postQc)/(preQc).


back to top


6. Call segmented copy number variants using CallSegments

In this final step, we call segmented copy number variants. The tool makes one of three calls for each segment--neutral (0), deletion (-) or amplification (+). These deleted or amplified segments could represent somatic events.

java -jar gatk4.jar CallSegments \
    -TN sandbox/tumor.tn.tsv \
    -S sandbox/tumor.seg \
    -O sandbox/tumor.called

View the results with cat sandbox/tumor.called.

image

Besides the last column, how is this result different from that of step 4?


back to top


7. Discussion of interest to some

☞ Why can't I use just a matched normal?

Let’s compare results from the raw coverage (top), from normalizing using the matched-normal only (middle) and from normalizing using the PoN (bottom).

image

image

image

What is different between the plots? Look closely.

The matched-normal normalization appears to perform well. However, its noisiness brings uncertainty to any call that would be made, even if visible by eye. Furthermore, its level of noise obscures detection of the 4th variant that the PoN normalization reveals.

☞How do the results compare to SNP6 analyses?

As with any algorithmic analysis, it’s good to confirm results with orthogonal methods. If we compare calls from the original unscrambled tumor data against GISTIC SNP6 array analysis of the same sample, we similarly find three deletions and a single large amplification.

back to top


RealignerTargetCreator hangs with the nightly version GenomeAnalysisTK-nightly-2017-12-11-1.

$
0
0

I was having a "ERROR StatusLogger Log4j2 could not find a logging implementation" error and I tried to rectify it with the nighly build but with no avail

The version of nightly build is : GenomeAnalysisTK-nightly-2017-12-11-1
The version of java is: 1.8.0_131
Now, instead of getting the aforementioned error, the command just hangs without any output.

The command used by me is:
nohup java -jar GenomeAnalysisTK-nightly-2017-12-11-1/GenomeAnalysisTK.jar -T RealignerTargetCreator -R ref.fasta -I Merge.bam -o INDEL.intervals &

https://gatkforums.broadinstitute.org/gatk/discussion/10131/gatk-3-8-logger-error (the thread here says that the nightly version has fixed the hanging/freezing issue)

Please help.

Undefinend variable on Variantfiltration

$
0
0

Dear all!!
I have tried to use the VariantFiltration (hatk 3.8) but the programs seem not gave me the results I expected.
This is an example part of my vcf. You can see the variable AD are prest:

es. 8 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1 0/0:505,31:0.075:11:20:0.645:16565,900:227:278
chr16 29394496 . T C . alt_allele_in_normal ECNT=1;HCNT=4;MAX_ED=.;MIN_ED=.;NLOD=175.04;TLOD=23.65 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1 0/0:1109,18:0.022:10:8:0.556:35227,531:545:564
chr16 29395098 rs370525489 C G . alt_allele_in_normal;t_lod_fstar DB;ECNT=1;HCNT=4;MAX_ED=.;MIN_ED=.;NLOD=106.55;TLOD=4.44 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1 0/0:704,10:0.02:5:5:0.5:21497,265:346:358

java -Xms2g -jar /illumina/software/PROG2/GenomeAnalysisTK.jar -T VariantFiltration -R /illumina/software/database/Starhg19/hg19/hg19_primary.fa -V 1000.mutect2.vcf --filterExpression 'AD > 20.0 ' --filterName "basic_sno_filter" -o pippo.vcf

What is wrong here?

rpreter - ![0,2]: 'AD > 20.0;' undefined variable AD
WARN  14:40:10,445 Interpreter - ![0,2]: 'AD > 20.0;' undefined variable AD
WARN  14:40:10,445 Interpreter - ![0,2]: 'AD > 20.0;' undefined variable AD
WARN  14:40:10,445 Interpreter - ![0,2]: 'AD > 20.0;' undefined variable AD
WARN  14:40:10,445 Interpreter - ![0,2]: 'AD > 20.0;' undefined variable AD
WARN  14:40:10,445 Interpreter - ![0,2]: 'AD > 20.0;' undefined variable AD
WARN  14:40:10,445 Interpreter - ![0,2]: 'AD > 20.0;' undefined variable AD
WARN  14:40:10,445 Interpreter - ![0,2]: 'AD > 20.0;' undefined variable AD
INFO  14:40:10,668 ProgressMeter -            done     11419.0     7.0 s      10.5 m       98.5%     7.0 s       0.0 s
INFO  14:40:10,668 ProgressMeter - Total runtime 7.17 secs, 0.12 min, 0.00 hours
------------------------------------------------------------------------------------------
Done. There were no warn messages.

Realigner Target creator shows runtime error

$
0
0

I've been trying for over three months now and realigner target creator consistently shows runtime error. Is there a fixfor this? If not, is there an alternative that i can use?

Reference Genome Components

$
0
0

Document is in BETA. It may be incomplete and/or inaccurate. Post suggestions to the Comments section.


image This document defines several components of a reference genome. We use the human GRCh38/hg38 assembly to illustrate.

GRCh38/hg38 is the assembly of the human genome released December of 2013, that uses alternate or ALT contigs to represent common complex variation, including HLA loci. Alternate contigs are also present in past assemblies but not to the extent we see with GRCh38. Much of the improvements in GRCh38 are the result of other genome sequencing and analysis projects, including the 1000 Genomes Project.

The ideogram is from the Genome Reference Consortium website and showcases GRCh38.p7. The zoomed region illustrates how regions in blue are full of Ns.

Analysis set reference genomes have special features to accommodate sequence read alignment. This type of genome reference can differ from the reference you use to browse the genome.

  • For example, the GRCh38 analysis set hard-masks, i.e. replaces with Ns, a proportion of homologous centromeric and genomic repeat arrays (on chromosomes 5, 14, 19, 21, & 22) and two PAR (pseudoautosomal) regions on chromosome Y. Confirm the set you are using by viewing a PAR region of the Y chromosome on IGV as shown in the figure below. The chrY location of PAR1 and PAR2 on GRCh38 are chrY:10,000-2,781,479 and chrY:56,887,902-57,217,415.
    image
    The sequence in the reference set is a mix of uppercase and lowercase letters. The lowercase letters represent soft-masked sequence corresponding to repeats from RepeatMasker and Tandem Repeats Finder.

  • The GRCh38 analysis sets also include a contig to siphon off reads corresponding to the Epstein-Barr virus sequence as well as decoy contigs. The EBV contig can help correct for artifacts stemming from immortalization of human blood lymphocytes with EBV transformation, as well as capture endogenous EBV sequence as EBV naturally infects B cells in ~90% of the world population. Heng Li provides the decoy contigs.


Nomenclature: words to describe components of reference genomes

  • A contig is a contiguous sequence without gaps.

  • Alternate contigs, alternate scaffolds or alternate loci allow for representation of diverging haplotypes. These regions are too complex for a single representation. Identify ALT contigs by their _alt suffix.

    The GRCh38 ALT contigs total 109Mb in length and span 60Mb of the primary assembly. Alternate contig sequences can be novel to highly diverged or nearly identical to corresponding primary assembly sequence. Sequences that are highly diverged from the primary assembly only contribute a few million bases. Most subsequences of ALT contigs are fairly similar to the primary assembly. This means that if we align sequence reads to GRCh38+ALT blindly, then we obtain many multi-mapping reads with zero mapping quality. Since many GATK tools have a ZeroMappingQuality filter, we will then miss variants corresponding to such loci.

  • Primary assembly refers to the collection of (i) assembled chromosomes, (ii) unlocalized and (iii) unplaced sequences. It represents a non-redundant haploid genome.

    (i) Assembled chromosomes for hg38 are chromosomes 1–22 (chr1chr22), X (chrX), Y (chrY) and Mitochondrial (chrM).
    (ii) Unlocalized sequence are on a specific chromosome but with unknown order or orientation. Identify by _random suffix.
    (iii) Unplaced sequence are on an unknown chromosome. Identify by chrU_ prefix.

  • PAR stands for pseudoautosomal region. PAR regions in mammalian X and Y chromosomes allow for recombination between the sex chromosomes. Because the PAR sequences together create a diploid or pseudo-autosomal sequence region, the X and Y chromosome sequences are intentionally identical in the genome assembly. Analysis set genomes further hard-mask two of the Y chromosome PAR regions so as to allow mapping of reads solely to the X chromosome PAR regions.

  • Different assemblies shift coordinates for loci and are released infrequently. Hg19 and hg38 represent two different major assemblies. Comparing data from different assemblies requires lift-over tools that adjust genomic coordinates to match loci, at times imperfectly. In the special case of hg19 and GRCh37, the primary assembly coordinates are identical for loci but patch updates differ. Also, the naming conventions of the references differ, e.g. the use of chr1 versus 1 to indicate chromosome 1, such that these also require lift-over to compare data. GRCh38/hg38 unifies the assemblies and the naming conventions.

  • Patches are regional fixes that are released periodically for a given assembly. GRCh38.p7 indicates the seventh patched minor release of GRCh38. This NCBI page explains in more detail. Patches add information to the assembly without disrupting the chromosome coordinates. Again, they improve representation without affecting chromosome coordinate stability. The two types of patches, fixed and novel, represent different types of sequence.

    (i) Fix patches represent sequences that will replace primary assembly sequence in the next major assembly release. When interpreting data, fix patches should take precedence over the chromosomes.
    (ii) Novel patches represent alternate loci. When interpreting data, treat novel patches as population sequence variants.


The GATK perspective on reference genomes

Within GATK documentation, Tutorial#8017 outlines how to map reads in an alternate contig aware manner and discusses some of the implications of mapping reads to reference genomes with alternate contigs.

GATK tools allow for use of a genomic intervals list that tells tools which regions of the genome the tools should act on. Judicious use of an intervals list, e.g. one that excludes regions of Ns and low complexity repeat regions in the genome, makes processes more efficient. This brings us to the next point.

Specifying contigs with colons in their names, as occurs for new contigs in GRCh38, requires special handling for GATK versions prior to v3.6. Please use the following workaround.

  • For example, HLA-A*01:01:01:01 is a new contig in GRCh38. The colons are a new feature of contig naming for GRCh38 from prior assemblies. This has implications for using the -L option of GATK as the option also uses the colon as a delimiter to distinguish between contig and genomic coordinates.
  • When defining coordinates of interest for a contig, e.g. positions 1-100 for chr1, we would use -L chr1:1-100. This also works for our HLA contig, e.g. -L HLA-A*01:01:01:01:1-100.
  • However, when passing in an entire contig, for contigs with colons in the name, you must add :1+ to the end of the chromosome name as shown below. This ensures that portions of the contig name are appropriately identified as part of the contig name and not genomic coordinates.

     -L HLA-A*01:01:01:01:1+
    

Viewing CRAM alignments on genome browsers

Because CRAM compression depends on the alignment reference genome, tools that use CRAM files ensure correct decompression by comparing reference contig MD5 hashtag values. These are sensitive to any changes in the sequence, e.g. masking with Ns. This can have implications for viewing alignments in genome browsers when there is a disjoint between the reference that is loaded in the browser and the reference that was used in alignment. If you are using a version of tools for which this is an issue, be sure to load the original analysis set reference genome to view the CRAM alignments.

Should I switch to a newer reference?

Yes you should. In addition to adding many alternate contigs, GRCh38 corrects thousands of SNPs and indels in the GRCh37 assembly that are absent in the population and are likely sequencing artifacts. It also includes synthetic centromeric sequence and updates non-nuclear genomic sequence.

The ability to recognize alternate haplotypes for loci is a drastic improvement that GRCh38 makes possible. Going forward, expanding genomics data will help identify variants for alternate haplotypes, improve existing and add additional alternate haplotypes and give us a better accounting of alternate haplotypes within populations. We are already seeing improvements and additions in the patch releases to reference genomes, e.g. the seven minor releases of GRCh38 available at the time of this writing.

Note that variants produced by alternate haplotypes when they are represented on the primary assembly may or may not be present in data resources, e.g. dbSNP. This could have varying degrees of impact, including negligible, for any process that relies on known variant sites. Consider the impact this discrepant coverage in data resources may have for your research aims and weigh this against the impact of missing variants because their sequence context is unaccounted for in previous assemblies.


External resources

  1. New 11/16/2016 For a brief history and discussion on challenges in using GRCh38, see the 2015 Genome Biology article Extending reference assembly models by Church et al. (DOI: 10.1186/s13059-015-0587-3).
  2. For press releases highlighting improvements in GRCh38 from December 2013, see http://www.ncbi.nlm.nih.gov/news/12-23-2013-grch38-released/ and http://genomeref.blogspot.co.uk/2013/12/announcing-grch38.html. The latter post summarizes major improvements, including the correction of thousands of SNPs and indels in GRCh37 not seen in the population and the inclusion of synthetic centromeric sequence.
  3. Recent releases of BWA, e.g. v0.7.15+, handle alt contig mapping and HLA typing. See the BWA repository for information. See these pages for download and installation instructions.
  4. The Genome Reference Consortium (GRC) provides human, mouse, zebrafish and chicken sequences, and this particular webpage gives an overview of GRCh38. Namely, an interactive chromosome ideogram marks regions with corresponding alternate loci, regions with fix patches and regions containing novel patches. For additional assembly terminology, see http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/info/definitions.shtml.
  5. The UCSC Genome Browser allows browsing and download of genomes, including analysis sets, from many different species. For more details on the difference between GRCh38 reference and analysis sets, see ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/README.txt and ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/analysisSet/README.txt, respectively. In addition, the site provides annotation files, e.g. here is the annotation database for GRCh38. Within this particular page, the file named gap.txt.gz catalogues the gapped regions of the assembly full of Ns. For our illustration above, the corresponding region in this file shows:

        585    chr14    0    10000    1    N    10000    telomere    no
        1    chr14    10000    16000000    2    N    15990000    short_arm    no
        707    chr14    16022537    16022637    4    N    100    contig    no
    
  6. The Integrative Genomics Viewer is a desktop application for viewing genomics data including alignments. The tool accesses reference genomes you provide via file or URL or that it hosts over a server. The numerous hosted reference genomes include GRCh38. See this page for information on hosted reference genomes. For the most up-to-date list of hosted genomes, open IGV and go to Genomes>Load Genome From Server. A menu lists genomes you can make available in the main genome dropdown menu.


Question and suggestion re -nct & -num_threads options

$
0
0

Hi,

I'm trying to implement a workflow with GATK for the first time and I'm getting caught out by the -nct/-num_threads options not being compatible with all walkers, erroring and then killing the process.

Can I suggest that if the flags are not implemented/supported by a walker that the option is ignored. The docs don't clarify which walkers work and which don't so I need to test each one. It would be much easier if simply a warning message were given.

Also, I don't fully understand the difference between -nct/-num_threads. Can someone explain it, please?
TIA

Should CheckIlluminaDirectory be able to handle a non-standard read structure

$
0
0

I have a flowcell (from a 10x library) in which the 'natural' read structure is 178T8B14B5T. However, I want to interpret the flowcell as 178T8B14T5S, so that is what I passed to CheckIlluminaDirectory. I get the exception below. It looks like the code is trying to check all the cycles, including the skips. However, CbclReader.outputCycles is initialized only with enough elements to hold the non-skip cycles. Is this a bug? Or is it wrong to pass a read structure with skips in it?

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 200
at picard.illumina.parser.readers.CbclReader.readSurfaceTile(CbclReader.java:119)
at picard.illumina.parser.readers.CbclReader.(CbclReader.java:102)
at picard.illumina.CheckIlluminaDirectory.doWork(CheckIlluminaDirectory.java:170)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:205)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:94)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:104)


RealignerTargetCreator hangs

$
0
0

Hi GATK team!

we have an issue with running the RealignerTargetCreator unfortunately. Commandline looks like this:

gatk -T RealignerTargetCreator -R ref.fasta -I /testsample.sorted.bam -nt 32 -o /testsample.intervals
INFO  13:00:59,111 HelpFormatter - ---------------------------------------------------------------------------------------------
INFO  13:00:59,141 HelpFormatter - The Genome Analysis Toolkit (GATK) vnightly-2017-07-11-g1f763d5, Compiled 2017/07/11 00:01:14
INFO  13:00:59,141 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute
INFO  13:00:59,142 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk
INFO  13:00:59,142 HelpFormatter - [Thu Jul 20 13:00:58 UTC 2017] Executing on Linux 3.10.0-327.3.1.el7.x86_64 amd64
INFO  13:00:59,142 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11
INFO  13:00:59,170 HelpFormatter - Program Args:  -T RealignerTargetCreator -R ref.fasta -I /testsample.sorted.bam -nt 32 -o /testsample.intervals
INFO  13:00:59,226 HelpFormatter - Executing as user on Linux 3.10.0-327.3.1.el7.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11.
INFO  13:00:59,227 HelpFormatter - Date/Time: 2017/07/20 13:00:59
INFO  13:00:59,227 HelpFormatter - ---------------------------------------------------------------------------------------------
INFO  13:00:59,228 HelpFormatter - ---------------------------------------------------------------------------------------------
ERROR StatusLogger Unable to create class org.apache.logging.log4j.core.impl.Log4jContextFactory specified in jar:file:/opt/gatk/GenomeAnalysisTK.jar!/META-INF/log4j-provider.properties
ERROR StatusLogger Log4j2 could not find a logging implementation. Please add log4j-core to the classpath. Using SimpleLogger to log to the console…

After this, the application unfortunately hangs. Running this with GATK v3.7 stable is also not working, we had issues with the bug in HaplotypeCallers VectorHMM library. Any ideas what we can do?

why aren't my variants phased with PBT ?

$
0
0

I am running phasebytransmission on a trio vcf file. I am getting Zero phased genotypes, I couldnt figure out why:

This is the output of thr run:

ERROR StatusLogger Unable to create class org.apache.logging.log4j.core.impl.Log4jContextFactory specified in jar:file:/data/pipeline_in/Keimbahn_pipeline/Phasing/GenomeAnalysisTK.jar!/META-INF/log4j-provider.properties
ERROR StatusLogger Log4j2 could not find a logging implementation. Please add log4j-core to the classpath. Using SimpleLogger to log to the console...
INFO  16:26:34,895 GenomeAnalysisEngine - Deflater: IntelDeflater
INFO  16:26:34,896 GenomeAnalysisEngine - Inflater: IntelInflater
INFO  16:26:34,896 GenomeAnalysisEngine - Strictness is SILENT
INFO  16:26:34,999 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO  16:26:35,078 PedReader - Reading PED file family.ped with missing fields: []
INFO  16:26:35,082 PedReader - Phenotype is other? false
INFO  16:26:35,149 GenomeAnalysisEngine - Preparing for traversal
INFO  16:26:35,155 GenomeAnalysisEngine - Done preparing for traversal
INFO  16:26:35,155 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO  16:26:35,156 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining
INFO  16:26:35,156 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime
INFO  16:26:40,931 PhaseByTransmission - Number of complete trio-genotypes: 24366
INFO  16:26:40,932 PhaseByTransmission - Number of trio-genotypes containing no call(s): 5
INFO  16:26:40,932 PhaseByTransmission - Number of trio-genotypes phased: 0
INFO  16:26:40,932 PhaseByTransmission - Number of resulting Het/Het/Het trios: 2642
INFO  16:26:40,932 PhaseByTransmission - Number of remaining single mendelian violations in trios: 0
INFO  16:26:40,933 PhaseByTransmission - Number of remaining double mendelian violations in trios: 0
INFO  16:26:40,933 PhaseByTransmission - Number of complete pair-genotypes: 0
INFO  16:26:40,933 PhaseByTransmission - Number of pair-genotypes containing no call(s): 0
INFO  16:26:40,933 PhaseByTransmission - Number of pair-genotypes phased: 0
INFO  16:26:40,933 PhaseByTransmission - Number of resulting Het/Het pairs: 0
INFO  16:26:40,934 PhaseByTransmission - Number of remaining mendelian violations in pairs: 0
INFO  16:26:40,934 PhaseByTransmission - Number of genotypes updated: 0
INFO  16:26:41,068 ProgressMeter -            done     24385.0     5.0 s       4.0 m       29.6%    16.0 s      11.0 s
    INFO  16:26:41,069 ProgressMeter - Total runtime 5.91 secs, 0.10 min, 0.00 hours
------------------------------------------------------------------------------------------
Done. There were no warn messages.

Why is the number of phased genotypes ZERO ?

This is how my vcf file looks like:

##fileformat=VCFv4.2
##FILTER=<ID=indelError,Description="Likely artifact due to indel reads at this position">
##FILTER=<ID=mendelError,Description="Apparent Mendelian inheritance error (MIE) in trio">
##FILTER=<ID=str10,Description="Less than 10% or more than 90% of variant supporting reads on one strand">
##FORMAT=<ID=ABQ,Number=1,Type=Integer,Description="Average quality of variant-supporting bases (qual2)">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=ADF,Number=1,Type=Integer,Description="Depth of variant-supporting bases on forward strand (reads2plus)">
##FORMAT=<ID=ADR,Number=1,Type=Integer,Description="Depth of variant-supporting bases on reverse strand (reads2minus)">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Quality Read Depth of bases with Phred score >= 15">
##FORMAT=<ID=FREQ,Number=1,Type=String,Description="Variant allele frequency">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=HP,Number=.,Type=String,Description="Read-backed phasing haplotype identifiers">
##FORMAT=<ID=PQ,Number=1,Type=Float,Description="Read-backed phasing quality">
##FORMAT=<ID=PVAL,Number=1,Type=String,Description="P-value from Fisher's Exact Test">
##FORMAT=<ID=RBQ,Number=1,Type=Integer,Description="Average quality of reference-supporting bases (qual1)">
##FORMAT=<ID=RD,Number=1,Type=Integer,Description="Depth of reference-supporting bases (reads1)">
##FORMAT=<ID=RDF,Number=1,Type=Integer,Description="Depth of reference-supporting bases on forward strand (reads1plus)">
##FORMAT=<ID=RDR,Number=1,Type=Integer,Description="Depth of reference-supporting bases on reverse strand (reads1minus)">
##FORMAT=<ID=SDP,Number=1,Type=Integer,Description="Raw Read Depth as reported by SAMtools">
##INFO=<ID=ADP,Number=1,Type=Integer,Description="Average per-sample depth of bases with Phred score >= 15">
##INFO=<ID=DENOVO,Number=0,Type=Flag,Description="Indicates apparent de novo mutations unique to the child">
##INFO=<ID=PhasingInconsistent,Number=0,Type=Flag,Description="Are the reads significantly haplotype-inconsistent?">
##INFO=<ID=STATUS,Number=1,Type=String,Description="Variant status in trio (0=unknown, 1=untransmitted, 2=transmitted, 3=denovo, 4=MIE)">
##contig=<ID=1,length=249250621>
##contig=<ID=2,length=243199373>
##contig=<ID=3,length=198022430>
##contig=<ID=4,length=191154276>
##contig=<ID=5,length=180915260>
##contig=<ID=6,length=171115067>
##contig=<ID=7,length=159138663>
##contig=<ID=8,length=146364022>
##contig=<ID=9,length=141213431>
##contig=<ID=10,length=135534747>
##contig=<ID=11,length=135006516>
##contig=<ID=12,length=133851895>
##contig=<ID=13,length=115169878>
##contig=<ID=14,length=107349540>
##contig=<ID=15,length=102531392>
##contig=<ID=16,length=90354753>
##contig=<ID=17,length=81195210>
##contig=<ID=18,length=78077248>
##contig=<ID=19,length=59128983>
##contig=<ID=20,length=63025520>
##contig=<ID=21,length=48129895>
##contig=<ID=22,length=51304566>
##contig=<ID=X,length=155270560>
##contig=<ID=Y,length=59373566>
##source=VarScan2
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  child   father  mother
1   892460  .   G   C   .   PASS    ADP=79;STATUS=2 GT:ABQ:AD:ADF:ADR:DP:FREQ:GQ:PVAL:RBQ:RD:RDF:RDR:SDP    1/1:40:73:51:22:73:100%:99:1.7006E-43:0:0:0:0:73    1/1:44:122:93:29:122:100%:99:6.9324E-73:0:0:0:0:122 1/1:40:43:31:12:43:100%:99:1.5066E-25:0:0:0:0:43

and this is how my ped file looks like:
FAM child father mother 0 -9
FAM father 0 0 1 -9
FAM mother 0 0 2 -9

catch22 for filtering reads using FilterSamReads

$
0
0

Hi, trying to prepare some BAM files using picard/MarkDuplicates I got a SAM validation error (Padding operator not between real operators in CIGAR) for a few reads. I figured I'd remove them using FilterSamReads. I built a list of reads by running ValidateSamFile, then putting the read names into a text file to use as read list file. However, if I run the filtering step, FilterSamReads will stop with the same exception as MarkDuplicates, complaining about the very reads I'm trying to remove. I'm probably missing out on something here, but I'm stuck right now. Picard version used is: 2.16.0-1-g763d98e-SNAPSHOT, Java is: OpenJDK 64-Bit Server VM 1.8.0_131-8u131-b11-1~bpo8+1-b11, command is:
java -jar /home/picard/build/libs/picard.jar FilterSamReads I=input.bam FILTER=excludeReadList O=output.bam USE_JDK_INFLATER=true USE_JDK_DEFLATER=true RLF=input.bam.brokenreads

Exception thrown is:
ERROR 2017-12-12 14:48:43 FilterSamReads Failed to filter 1724-0121-WholeExome_S1_L001_R1_001_paired.bam
htsjdk.samtools.SAMFormatException: SAM validation error: ERROR: Read name NS500396:228:HKL5MBGX2:1:12210:18362:13274_2:N:0:TAAGGCGA, Padding operator not between real operators in CIGAR
at htsjdk.samtools.SAMUtils.processValidationErrors(SAMUtils.java:454)
at htsjdk.samtools.BAMRecord.getCigar(BAMRecord.java:253)
at htsjdk.samtools.SAMRecord.getAlignmentEnd(SAMRecord.java:606)
at htsjdk.samtools.SAMRecord.computeIndexingBin(SAMRecord.java:1575)
at htsjdk.samtools.SAMRecord.isValid(SAMRecord.java:2087)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:811)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:797)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:765)
at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:576)
at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:548)
at picard.sam.FilterSamReads.writeReadsFile(FilterSamReads.java:193)
at picard.sam.FilterSamReads.doWork(FilterSamReads.java:213)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:268)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:98)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:108)

input.bam.brokenreads looks like this:
NS500396:228:HKL5MBGX2:1:12210:18362:13274_2:N:0:TAAGGCGA
NS500396:228:HKL5MBGX2:4:23505:12020:5193_2:N:0:TAAGGCGA
NS500396:228:HKL5MBGX2:1:13203:21100:6035_1:N:0:TAAGGCGA
etc.

Thanks for any help!

CRAM support in GATK 3.7 is broken

$
0
0

I have not been able to get GATK 3.7 HaplotypeCaller to work with CRAM files at all (it has a 100% failure rate so far with our whole genome CRAMs). Based on my analysis of the problem, I don't think GATK 3.7 will work with any CRAM files containing IUPAC ambiguity codes other than 'N' (including GRCh37/hs37d5 and GRCh38/HS38DH).

The error I get is:

ERROR   2017-01-05 02:18:59     Slice   Reference MD5 mismatch for slice 2:60825966-60861215, ATCTTTCATG...CTCTCCCATT
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 3.7-0-gcfedb67):
##### ERROR
##### ERROR This means that one or more arguments or inputs in your command are incorrect.
##### ERROR The error message below tells you what is the problem.
##### ERROR
##### ERROR If the problem is an invalid argument, please check the online documentation guide
##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
##### ERROR
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions https://software.broadinstitute.org/gatk
##### ERROR
##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
##### ERROR
##### ERROR MESSAGE: SAM/BAM/CRAM file /keep/46909b690725869e1d9bfbc1da4a1398+19932/20657_7.cram is malformed. Please see https://software.broadinstitute.org/gatk/documentation/article?id=1317for more
##### ERROR ------------------------------------------------------------------------------------------

This error occurs for 100% of my CRAM files, which can be read by samtools, scramble, or previous versions of GATK (including 3.6) without any issues, so the error message is incorrect and the CRAM files are not malformed.

The CRAM slice in question is on chromosome 3 of hs37d5 (3:60825966-60861215). We can verify externally that the FASTA reference we are passing into GATK with -R does have the md5 that GATK reports it is expecting:

$ samtools faidx /keep/d527a0b11143ebf18be6c52ff6c09552+2163/hs37d5.fa 3:60825966-60861215 | grep -v '^>' | tr -d '\012' | md5sum
0e0ff678755616cba9ac362f15b851cc  -

And the sequence starts and ends with the bases that htsjdk reports:

$ samtools faidx /keep/d527a0b11143ebf18be6c52ff6c09552+2163/hs37d5.fa 3:60825966-60861215 | grep -v '^>' | tr -d '\012' | cut -c1-10
ATCTTTCATG
$ samtools faidx /keep/d527a0b11143ebf18be6c52ff6c09552+2163/hs37d5.fa 3:60825966-60861215 | grep -v '^>' | tr -d '\012' | cut -c35241-
CTCTCCCATT

I ended up having to recompile GATK and htsjdk from source and added some print debugging to htsjdk to dump the whole sequence from which the md5 was being calculated. It seems the sequence that cause problems are regions of the reference with IUPAC ambiguity codes other than 'N' (in this case a slice of chromosome 3 that contains an 'M' and two 'R's). In GATK 3.7 (built with htsjdk 2.8.1), the reference which is used to calculate the md5 for the slice has had all ambiguity codes converted to 'N'. The md5 it calculates for this slice (according to my print debugging) is: 5d820b3624e78202f503796f7330d8d9

I have verified that this is the md5 we would get from converting the IUPAC codes in this slice to N's:

$ samtools faidx /keep/d527a0b11143ebf18be6c52ff6c09552+2163/hs37d5.fa 3:60825966-60861215 | grep -v '^>' | tr -d '\012' | tr RYMKWSBDHV NNNNNNNNNN | md5sum
5d820b3624e78202f503796f7330d8d9  -

I have tried in vain to figure out where in GATK and/or htsjdk the ambiguous reference bases are being converted to 'N's. I initially thought that it was in the CachingIndexedFastaSequenceFile call to BaseUtils.convertIUPACtoN (when preserveIUPAC is false, although I didn't find any code path that could set it to true). However, after recompiling with preserveIUPAC manually set to true, the problem persisted. I guess there must be some other place where the bases are remapped. I'll leave it to you guys to figure out how to get an unmodified view on the reference for htsjdk to use for CRAM decoding.

There is, however, no mystery as to why this problem has suddenly appeared in GATK 3.7. The slice md5 validation code in htsjdk was only added in July 2016 (https://github.com/samtools/htsjdk/commit/a781afa9597dcdbcde0020bfe464abee269b3b2e). The first release version it appears in is version 2.7.0. Prior to that, it seems CRAM slice md5's were not validated in htsjdk, so this error would not have occurred.

java.lang.ArrayIndexOutOfBoundsException in BaseRecalibrator on Grc38

$
0
0

I get a consistent failure with BaseRecalibrator on a handful of samples. It occurs with every version of GATK from 3.5 to 3.8 and the current nightly build. I've also tried altering the max memory given to java and changing the knownSites file. It fails the same. I've trimmed the command line down to the minimum necessary to generate the error, and I've trimmed the input files to the minimum section needed to generate the failure (a specific single read).

You can find the failure below, but I also dug out the location of the failure with a proposed fix.

The failure occurs here --> line 184 of public/gatk-engine/src/main/java/org/broadinstitute/gatk/engine/recalibration/covariates/ContextCovariate.java

        while (bases[currentNPenalty] != 'N') {
            final int baseIndex = BaseUtils.simpleBaseToBaseIndex(bases[currentNPenalty]);
            currentKey |= (baseIndex << offset);
            offset -= 2;
            currentNPenalty--;
        }

The current while loop allows the array index to become negative and walk right off the edge. So a proposed fix is as follows (assuming it does not break the covariate logic) -->

        while (currentNPenalty > 0 && bases[currentNPenalty] != 'N') {
            final int baseIndex = BaseUtils.simpleBaseToBaseIndex(bases[currentNPenalty]);
            currentKey |= (baseIndex << offset);
            offset -= 2;
            currentNPenalty--;
        }

Minimal Command (test.bam attached - gzipped to allow the attachment to work) -->

java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -I test.bam -o test.table -R GATK_Bundle_Build38/Homo_sapiens_assembly38.fasta --knownSites GATK_Bundle_Build38/dbsnp_146.hg38.vcf.gz

Error Message -->

java.lang.ArrayIndexOutOfBoundsException: -1
    at org.broadinstitute.gatk.engine.recalibration.covariates.ContextCovariate.contextWith(ContextCovariate.java:184)
    at org.broadinstitute.gatk.engine.recalibration.covariates.ContextCovariate.recordValues(ContextCovariate.java:100)
    at org.broadinstitute.gatk.engine.recalibration.RecalUtils.computeCovariates(RecalUtils.java:926)
    at org.broadinstitute.gatk.engine.recalibration.RecalUtils.computeCovariates(RecalUtils.java:906)
    at org.broadinstitute.gatk.tools.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:286)
    at org.broadinstitute.gatk.tools.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:156)
    at org.broadinstitute.gatk.engine.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:228)
    at org.broadinstitute.gatk.engine.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:216)
    at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
    at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
    at org.broadinstitute.gatk.engine.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:102)
    at org.broadinstitute.gatk.engine.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:56)
    at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:107)
    at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:323)
    at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:123)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
    at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:108)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 3.8-0-ge9d806836):
##### ERROR
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions https://software.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: -1
##### ERROR ------------------------------------------------------------------------------------------
Viewing all 12345 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>