Quantcast
Channel: Recent Discussions — GATK-Forum
Viewing all 12345 articles
Browse latest View live

GATK4: RMSMappingQuality results differ between v4.0.0.0 and v4.1.1.0

$
0
0
Good morning everybody and thanks in advance for your advices and your help.
I checked for this problem before submitting this question. I hope this is not a double.

We are working with whole genome sequencing and SNP identification, with the GATK best pratices workflow.
We got strange results with a change of version (we usually work with GATK v4.0.0.0, but we switched recently to v4.1.1.0), so we conducted a little test.

We took 40 BAM files, produced with GATK v4.0.0.0 and we followed the workflow twice to obtain VCF files, one with GATK v4.0.0.0 and the other with GATK v4.1.1.0.

Then, we followed the advices about Hard-filtering germline short variants (id 11069 of GATK documentation).
So we filtered on MQ < 40, SOR > 3 and FS > 60 as recommended, and we obtained a dramatical decrease of SNP number between versions.
3.14M SNP for v4.0 against 615k SNP for v4.1.

We plot the FS, SOR and MQ values obtained with the two worflows.

FS v4.0/v4.1 is a strict line, i.e. the FS values did not change between versions.
This is the kind of results we want.

(see first figure FS on left up, sorry, not old enough to include links)

SOR results are a little bit messier, but it's quite ok and that does not explain our strange results on SNP filtration.

(see second figure SOR on right up)

MQ results seem to be the problems.

(see third figure MQ in the middle of the page)

The MQ results are systematically lower with v4.1 than with 4.0.
You can see on individual graph that most of the MQ v4.1 values return a FAIL as inferior to 40.

(see fourth and fifth figures on the bottom of the page)

I don't know how to explain these results.

I went to Github and found the Improve MQ calculation accuracy (#4969) change.
Apparently, there was an amelioration of the MQ calculation, with others tests implemented.
But MQ values are supposed not to change with versions, I guess.

Are the MQ tests the same in v4.0 and v4.1 ?
Is there another test that should be used, instead of MQ ?
Did I miss anything obvious ?
Thanks for reading !

Getting started with GATK4

$
0
0

GATK, pronounced "Gee Ay Tee Kay" (not "Gat-Kay"), stands for GenomeAnalysisToolkit. It is a collection of command-line tools for analyzing high-throughput sequencing data with a primary focus on variant discovery. The tools can be used individually or chained together into complete workflows. We provide end-to-end workflows, called GATK Best Practices, tailored for specific use cases.

Starting with version 4.0, GATK contains a copy of the Picard toolkit, so all Picard tools are available from within GATK itself and their documentation is available in the Tool Documentation section of this website.


Contents

  1. Quick start for the impatient
  2. Requirements
  3. Get GATK
  4. Install it
  5. Test that it works
  6. Run GATK and Picard commands
  7. Grok the Best Practices
  8. Run pipelines
  9. Get help

1. Quick start for the impatient

  • Run on Linux or MacOSX; MS Windows is not supported.
  • Make sure you have Java 8 / JDK 1.8 (Oracle or OpenJDK, doesn't matter).
  • Download the GATK package here OR get the Docker image here.
  • There are two jars because of reasons, but don't worry about; see the next point.
  • Invoke GATK through the gatk wrapper script rather than calling either jar directly
  • Basic syntax is gatk [--java-options "-Xmx4G"] ToolName [GATK args]; full details here.
  • If you need help, read the User Guide and ask questions on the forum.

2. Requirements

Most GATK4 tools have fairly simple software requirements: a Unix-style OS and Java 1.8. However, a subset of tools have additional R and/or Python dependencies. These dependencies (as well as the base system requirements) are described in detail here. So we strongly recommend using the Docker container system, if that's an option on your infrastructure, rather than a custom installation. All released versions of GATK4 can be found as prepackaged container images in Dockerhub here. If you can't use Docker, do yourself a favor and use the Conda environment that we provide to manage dependencies, as described in the github repository README. If you run into a pip error and also recently updated your Mac OS, then see this solution.

You will also need Python 2.6 or greater to run the gatk wrapper script (described below).

If you run into difficulties with the Java version requirement, see this article for help.


3. Get GATK

You can download the GATK package here OR get the Docker image here. The instructions below will assume you downloaded the GATK package to your local machine and are planning to run it directly. For instructions on how to go the Docker route, see this tutorial.

Once you have downloaded and unzipped the package (named gatk-[version]), you will find four files inside the resulting directory:

gatk
gatk-package-[version]-local.jar
gatk-package-[version]-spark.jar
README.md 

Now you may ask, why are there two jars? As the names suggest, gatk-package-[version]-spark.jar is the jar for running Spark tools on a Spark cluster, while gatk-package-[version]-local.jar is the jar that is used for everything else (including running Spark tools "locally", i.e. on a regular server or cluster).

So does that mean you have to specify which one you want to run each time? Nope! See the gatk file in there? That's an executable wrapper script that you invoke and that will choose the appropriate jar for you based on the rest of your command line. You could still invoke a specific jar if you wanted, but using gatk is easier, and it will also take care of setting some parameters that you would otherwise have to specify manually.


4. Install it

There is no installation necessary in the traditional sense, since the precompiled jar files should work on any POSIX platform that satisfies the requirements listed above. You'll simply need to open the downloaded package and place the folder containing the jar files and launch script in a convenient directory on your hard drive (or server filesystem). Although the jars themselves cannot simply be added to your PATH, you can do so with the gatk wrapper script. Please look up instructions depending on the terminal shell you use; in bash the typical syntax is export PATH=$PATH:/path/to/gatk-package/gatk where path/to/gatk-package/ is the path to the location of the gatk executable. Note that the jars must remain in the same directory as gatk for it to work.


5. Test that it works

To test that you can successfully invoke the GATK, run the following command in your terminal application. Here we assume that you have added gatk to your PATH as recommended above

./gatk --help

This should output a summary of the invocation syntax, options for listing tools and invoking a specific tool's help documentation, and main Spark options if applicable.


6. Run GATK and Picard commands

Available tools are listed and described in some detail in the Tool Documentation section, along with available options. The basic syntax for invoking any GATK or Picard tool is the following:

gatk [--java-options "jvm args like -Xmx4G go here"] ToolName [GATK args go here]

So for example, a simple GATK command would look like:

 gatk --java-options "-Xmx8G" HaplotypeCaller -R reference.fasta -I input.bam -O output.vcf

You can find more information about GATK command-line syntax here.

Syntax for Picard tools

When used from within GATK, all Picard tools use the same syntax as GATK. The conversion relative to the "Picard-style" syntax is very straightforward; wherever you used to do e.g. I=input.bam, you now do -I input.bam. So for example, a simple Picard command would look like:

gatk ValidateSamFile  -I input.bam -MODE SUMMARY

7. Grok the Best Practices

The GATK Best Practices are end-to-end workflows that are meant to provide step-by-step recommendations for performing variant discovery analysis in high-throughput sequencing (HTS) data. We have several such workflows tailored to project aims (by type of variants of interest) and experimental designs (by type of sequencing approach). And although they were originally designed for human genome research, the GATK Best Practices can be adapted for analysis of non-human organisms of all kinds, including non-diploids.

The documentation for the Best Practices includes high-level descriptions of the processes involved, various types of documents that explain deeper details and adaptations that can be made depending on constraints and use cases, a set of actual pipeline implementations of these recommendations, and perhaps the most important, workshop materials including slide decks, videos and tutorials that walk you through every step.


8. Run pipelines

Most of the work involved in processing sequence data and performing variant discovery can be automated in the form of pipeline scripts, which often include some form of parallelization to speed up execution. We provide scripted implementations of the GATK Best Practices workflows plus some additional helper/accessory scripts in order to make it easier for everyone to run these sometimes rather complex workflows.

These workflows are written in WDL and intended to be run on any platform that supports WDL execution. Options are listed in the Pipelining section of the User Guide. Our preferred option is the Cromwell execution engine, which like GATK is also developed by the Broad's Data Sciences Platform (DSP), and is available as a service on our cloud-based portal, FireCloud. Note that if you choose to run GATK workflows through FireCloud, you don't really need to do any of the above, since everything is already preloaded in a ready-to-run form (the software, the scripts, even some example data). At this point FireCloud the easiest way to run the workflows exactly as we do in our own work.


9. Get Help

We provide all support through our very active community forum. You can ask questions and report any problems that you might encounter, with the following guidelines:

Before asking for help

Before posting to the Forum, please do the following:

  1. Use the Search box in the top-right corner of every page -- it will search everything including the User Guide and the Forum.
  2. If something is not working:
    • run validation checks on all your input files to make sure they're all properly formatted
    • look at the Solutions to Problems section of the User Guide, which covers common issues that are not bugs
    • search the forum for previous reports, e.g. using the error message
    • try again the latest version of whatever tool you're using
    • check the Bugs and Feature Requests section of the User Guide

When asking for help

When asking a question about a problem, please include the following:

  1. All version information (GATK version, Java, your operating system if possible).
  2. Don't just tell us you are following the Best Practices -- describe exactly what you are doing.
  3. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
  4. For tool errors, include the full command you ran AND the stacktrace (_i.e. the long pile of unreadable software gobbledygook in the terminal output) if there is one.
  5. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
  6. For weird/unexpected results, include an illustrative example, e.g. attach IGV screenshots, and explain in detail why you think the result is weird -- especially if you're working with non-human data. We may not be aware of your organism's quirks.

We will typically get back to you with a response within one or two business days, but be aware that more complex issues (or unclear reports) may take longer to address. In addition, some times of the year are especially busy for us and we may take longer than usual to answer your question.

We may ask you to submit a formal bug report, which involves sending us some test data that we can use to reproduce the problem ourselves. This is often required for debugging. Rest assured we treat all data transferred to us as private and confidential. In some cases we may ask for your permission to include a snippet of your test case in our testing framework, which is publicly accessible. In such a case, YOU are responsible for verifying with whoever owns the data whether you are authorized to allow us to make that data public.

Note that the information in this documentation guide is targeted at end-users. For developers, the source code and related resources are available on GitHub.

How to edit MULTIPLE read groups in one bam file

$
0
0

Hi everyone,

I recently received a WGS bam from Broad for 1 sample, but with about 8 read groups. BSQR kicked it back saying that the sequencer name in the read group is not recognized.

Anyways, I need to edit the sequencer name so that BSQR can run. AddReplaceReadGroups in Picard will toss out the 8 RGs and add 1 RG info, so that will not work. So how do you edit one or two of the RGs, or replace all 8 RGs in the bam?

I am sure this is a common issue.

Thanks

Filtering VCF help

$
0
0

Hi ,

I am trying to Filter the VCF based on two filtering criteria, 1) Coverage >3x and 2) Minimum Allele Frequency should be 12% of this >3x filtered Vaiants.
For the filtering based on coverage I used the expression " --filterExpression " DP >= 3", But my question what would be the suitable expression to get my second filtering done ?

Any help will be great.
Thanks,
Satish

GenotypeGVCFs exits before Traversal is complete

$
0
0

Hi,

I am running gatk-4.1.0.0 in CentOS 6.3. My java version is jdk1.8.0_152

I split the human genome into 40 intervals, each of which is < 100 Mb. The boundaries of the intervals are gap regions of the ref genome. The variants were called separately for each interval.

I followed the GATK best practice for germline SNP/Indels. First I used HaplotypeCaller to generate GVCF files for each sample. And then I used GenomicsDBImport to import the GVCF files to the genomicsdb.
The GenomicsDBImport was successful because I can see the following information:

20:24:25.849 INFO  ProgressMeter - Traversal complete. Processed 10 total batches in 2633.3 minutes.
20:24:25.849 INFO  GenomicsDBImport - Import of all batches to GenomicsDB completed!

While I am running GenotypeGVCFs, it always exits before the traversal is complete. I tried increasing the memory to 48G (-Xmx48g -Xms48g) but didn't help.

For example, I can see the following information:

21:06:38.038 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/public/users/xieshangqian/fangli/software/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
21:07:10.501 INFO  GenotypeGVCFs - ------------------------------------------------------------
21:07:10.501 INFO  GenotypeGVCFs - The Genome Analysis Toolkit (GATK) v4.1.0.0
21:07:10.501 INFO  GenotypeGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
21:07:10.502 INFO  GenotypeGVCFs - Executing as xieshangqian@compute-0-0.local on Linux v2.6.32-279.14.1.el6.x86_64 amd64
21:07:10.502 INFO  GenotypeGVCFs - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_152-b16
21:07:10.502 INFO  GenotypeGVCFs - Start Date/Time: June 18, 2019 9:06:37 PM CST
21:07:10.502 INFO  GenotypeGVCFs - ------------------------------------------------------------
21:07:10.502 INFO  GenotypeGVCFs - ------------------------------------------------------------
21:07:10.503 INFO  GenotypeGVCFs - HTSJDK Version: 2.18.2
21:07:10.503 INFO  GenotypeGVCFs - Picard Version: 2.18.25
21:07:10.503 INFO  GenotypeGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 2
21:07:10.503 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
21:07:10.503 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
21:07:10.503 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
21:07:10.503 INFO  GenotypeGVCFs - Deflater: IntelDeflater
21:07:10.503 INFO  GenotypeGVCFs - Inflater: IntelInflater
21:07:10.503 INFO  GenotypeGVCFs - GCS max retries/reopens: 20
21:07:10.503 INFO  GenotypeGVCFs - Requester pays: disabled
21:07:10.503 INFO  GenotypeGVCFs - Initializing engine
WARNING: No valid combination operation found for INFO field DB - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field DS - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field DB - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field DS - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records
21:07:28.713 INFO  GenotypeGVCFs - Done initializing engine
21:07:28.768 INFO  ProgressMeter - Starting traversal
21:07:28.769 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute
WARNING: No valid combination operation found for INFO field DB - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field DS - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records
21:13:19.041 INFO  ProgressMeter -           X:76704691              5.8                  1000            171.3
21:13:52.734 INFO  ProgressMeter -           X:76705691              6.4                  2000            312.5
21:14:42.171 INFO  ProgressMeter -           X:76706691              7.2                  3000            415.3
21:15:31.340 INFO  ProgressMeter -           X:76707691              8.0                  4000            497.3
21:16:28.727 INFO  ProgressMeter -           X:76708691              9.0                  5000            555.6
21:16:47.125 INFO  ProgressMeter -           X:76709691              9.3                  6000            644.7
21:17:31.225 INFO  ProgressMeter -           X:76711691             10.0                  8000            796.7
21:18:18.273 INFO  ProgressMeter -           X:76712691             10.8                  9000            831.4
21:18:28.852 INFO  ProgressMeter -           X:76713691             11.0                 10000            909.0
21:18:42.161 INFO  ProgressMeter -           X:76714691             11.2                 11000            980.1
21:18:53.288 INFO  ProgressMeter -           X:76717691             11.4                 14000           1227.1
21:19:29.089 INFO  ProgressMeter -           X:76721691             12.0                 18000           1499.3
21:19:53.928 INFO  ProgressMeter -           X:76727691             12.4                 24000           1932.5
21:20:37.732 INFO  ProgressMeter -           X:76739691             13.1                 36000           2737.8
21:20:55.007 INFO  ProgressMeter -           X:76750691             13.4                 47000           3497.7
21:22:40.732 INFO  ProgressMeter -           X:76774691             15.2                 71000           4671.2

......
......


09:06:20.738 INFO  ProgressMeter -           X:90915023            718.9              14204000          19758.9
09:06:35.564 INFO  ProgressMeter -           X:90918023            719.1              14207000          19756.3
09:06:51.435 INFO  ProgressMeter -           X:90920023            719.4              14209000          19751.8
09:07:03.766 INFO  ProgressMeter -           X:90921023            719.6              14210000          19747.5
09:08:05.525 INFO  ProgressMeter -           X:90932023            720.6              14221000          19734.6
09:08:18.631 INFO  ProgressMeter -           X:90949023            720.8              14238000          19752.2
09:08:28.639 INFO  ProgressMeter -           X:90955023            721.0              14244000          19756.0
Using GATK jar /public/users/xieshangqian/fangli/software/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx48g -Xms48g -jar /public/users/xieshangqian/fangli/software/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar GenotypeGVCFs -R /public/users/xieshangqian/fangli/db/hg19/hs37d5.fa -V gendb:///public/users/xieshangqian/fangli/analysis_ngs/all_gatk_vcf/joint_genotyping/genomicsdb_reg20 -O /public/users/xieshangqian/fangli/analysis_ngs/all_gatk_vcf/joint_genotyping/NGS.reg20.raw.vcf.gz

The last position it processed is X:90955023, however, the interval is X:76653693-155270560, which means a large portion of variants were not processed. I can't see these variants in the vcf.gz file, either.

If I run the same command again, GenotypeGVCFs may exit in a different position. For example, I ran it again and got the following information:

......

04:08:36.664 INFO  ProgressMeter -           X:89888970            809.8              13178000          16273.5
04:08:51.137 INFO  ProgressMeter -           X:89903970            810.0              13193000          16287.1
04:09:03.381 INFO  ProgressMeter -           X:89912970            810.2              13202000          16294.2
04:09:49.453 INFO  ProgressMeter -           X:89924970            811.0              13214000          16293.5
04:10:02.710 INFO  ProgressMeter -           X:89927970            811.2              13217000          16292.8
04:10:12.716 INFO  ProgressMeter -           X:89936970            811.4              13226000          16300.5
04:10:28.162 INFO  ProgressMeter -           X:89951970            811.6              13241000          16313.8
Using GATK jar /public/users/xieshangqian/fangli/software/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx4g -Xms4g -jar /public/users/xieshangqian/fangli/software/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar GenotypeGVCFs -R /public/users/xieshangqian/fangli/db/hg19/hs37d5.fa -V gendb:///public/users/xieshangqian/fangli/analysis_ngs/all_gatk_vcf/joint_genotyping/genomicsdb_reg20 -O /public/users/xieshangqian/fangli/analysis_ngs/all_gatk_vcf/joint_genotyping/NGS.reg20.raw.vcf.gz


The last position is X:89951970, which is different from the previous one. The vcf.gz file size is also different.

For a few intervals, the GenotypeGVCFs step succeeded, and I can see the following information:

   ProgressMeter - Traversal complete. Processed 24096952 total variants in 1819.9 minutes.

My command is:

###  GenomicsDBImport
/public/users/xieshangqian/fangli/software/gatk-4.1.0.0/gatk --java-options "-Xmx4g -Xms4g" GenomicsDBImport -L X:76653693-155270560 --genomicsdb-workspace-path /public/users/xieshangqian/fangli/analysis_ngs/all_gatk_vcf/joint_genotyping/genomicsdb_reg20 --batch-size 10  -V sample1.gatk_raw.g.vcf.gz -V sample2.gatk_raw.g.vcf.gz -V sample3.gatk_raw.g.vcf.gz ...


###  GenotypeGVCFs
/public/users/xieshangqian/fangli/software/gatk-4.1.0.0/gatk --java-options "-Xmx48g -Xms48g" GenotypeGVCFs -R /public/users/xieshangqian/fangli/db/hg19/hs37d5.fa -V gendb:///public/users/xieshangqian/fangli/analysis_ngs/all_gatk_vcf/joint_genotyping/genomicsdb_reg20  -O /public/users/xieshangqian/fangli/analysis_ngs/all_gatk_vcf/joint_genotyping/NGS.reg20.raw.vcf.gz

Looking forward to your help!

Thank you!
Li

Does the FastaAlternateReferenceMaker also work for simple sequence repeats?

$
0
0

Hello,
I have questions regarding the FastaAlternateReferenceMaker. Does it also work for Simple Sequence Repeats? How long can the repetitions be, to still be intergrated in the new reference?

Many thanks in advance!

Read groups

$
0
0

There is no formal definition of what is a read group, but in practice, this term refers to a set of reads that were generated from a single run of a sequencing instrument.

In the simple case where a single library preparation derived from a single biological sample was run on a single lane of a flowcell, all the reads from that lane run belong to the same read group. When multiplexing is involved, then each subset of reads originating from a separate library run on that lane will constitute a separate read group.

Read groups are identified in the SAM/BAM /CRAM file by a number of tags that are defined in the official SAM specification. These tags, when assigned appropriately, allow us to differentiate not only samples, but also various technical features that are associated with artifacts. With this information in hand, we can mitigate the effects of those artifacts during the duplicate marking and base recalibration steps. The GATK requires several read group fields to be present in input files and will fail with errors if this requirement is not satisfied. See this article for common problems related to read groups.

To see the read group information for a BAM file, use the following command.

samtools view -H sample.bam | grep '@RG'

This prints the lines starting with @RG within the header, e.g. as shown in the example below.

@RG ID:H0164.2  PL:illumina PU:H0164ALXX140820.2    LB:Solexa-272222    PI:0    DT:2014-08-20T00:00:00-0400 SM:NA12878  CN:BI

Meaning of the read group fields required by GATK

  • ID = Read group identifier
    This tag identifies which read group each read belongs to, so each read group's ID must be unique. It is referenced both in the read group definition line in the file header (starting with @RG) and in the RG:Z tag for each read record. Note that some Picard tools have the ability to modify IDs when merging SAM files in order to avoid collisions. In Illumina data, read group IDs are composed using the flowcell + lane name and number, making them a globally unique identifier across all sequencing data in the world.
    Use for BQSR: ID is the lowest denominator that differentiates factors contributing to technical batch effects: therefore, a read group is effectively treated as a separate run of the instrument in data processing steps such as base quality score recalibration, since they are assumed to share the same error model.

  • PU = Platform Unit
    The PU holds three types of information, the {FLOWCELL_BARCODE}.{LANE}.{SAMPLE_BARCODE}. The {FLOWCELL_BARCODE} refers to the unique identifier for a particular flow cell. The {LANE} indicates the lane of the flow cell and the {SAMPLE_BARCODE} is a sample/library-specific identifier. Although the PU is not required by GATK but takes precedence over ID for base recalibration if it is present. In the example shown earlier, two read group fields, ID and PU, appropriately differentiate flow cell lane, marked by .2, a factor that contributes to batch effects.

  • SM = Sample
    The name of the sample sequenced in this read group. GATK tools treat all read groups with the same SM value as containing sequencing data for the same sample, and this is also the name that will be used for the sample column in the VCF file. Therefore it's critical that the SM field be specified correctly. When sequencing pools of samples, use a pool name instead of an individual sample name. Note, when we say pools, we mean samples that are not individually barcoded. In the case of multiplexing (often confused with pooling) where you know which reads come from each sample and you have simply run the samples together in one lane, you can keep the SM tag as the sample name and not the "pooled name".

  • PL = Platform/technology used to produce the read
    This constitutes the only way to know what sequencing technology was used to generate the sequencing data. Valid values: ILLUMINA, SOLID, LS454, HELICOS and PACBIO.

  • LB = DNA preparation library identifier
    MarkDuplicates uses the LB field to determine which read groups might contain molecular duplicates, in case the same DNA library was sequenced on multiple lanes.

If your sample collection's BAM files lack required fields or do not differentiate pertinent factors within the fields, use Picard's AddOrReplaceReadGroups to add or appropriately rename the read group fields as outlined here.


Deriving ID and PU fields from read names

Here we illustrate how to derive both ID and PU fields from read names as they are formed in the data produced by the Broad Genomic Services pipelines (other sequence providers may use different naming conventions). We break down the common portion of two different read names from a sample file. The unique portion of the read names that come after flow cell lane, and separated by colons, are tile number, x-coordinate of cluster and y-coordinate of cluster.

H0164ALXX140820:2:1101:10003:23460
H0164ALXX140820:2:1101:15118:25288

Breaking down the common portion of the query names:

H0164____________ #portion of @RG ID and PU fields indicating Illumina flow cell
_____ALXX140820__ #portion of @RG PU field indicating barcode or index in a multiplexed run
_______________:2 #portion of @RG ID and PU fields indicating flow cell lane

Multi-sample and multiplexed example

Suppose I have a trio of samples: MOM, DAD, and KID. Each has two DNA libraries prepared, one with 400 bp inserts and another with 200 bp inserts. Each of these libraries is run on two lanes of an Illumina HiSeq, requiring 3 x 2 x 2 = 12 lanes of data. When the data come off the sequencer, I would create 12 bam files, with the following @RG fields in the header:

Dad's data:
@RG     ID:FLOWCELL1.LANE1      PL:ILLUMINA     LB:LIB-DAD-1 SM:DAD      PI:200
@RG     ID:FLOWCELL1.LANE2      PL:ILLUMINA     LB:LIB-DAD-1 SM:DAD      PI:200
@RG     ID:FLOWCELL1.LANE3      PL:ILLUMINA     LB:LIB-DAD-2 SM:DAD      PI:400
@RG     ID:FLOWCELL1.LANE4      PL:ILLUMINA     LB:LIB-DAD-2 SM:DAD      PI:400

Mom's data:
@RG     ID:FLOWCELL1.LANE5      PL:ILLUMINA     LB:LIB-MOM-1 SM:MOM      PI:200
@RG     ID:FLOWCELL1.LANE6      PL:ILLUMINA     LB:LIB-MOM-1 SM:MOM      PI:200
@RG     ID:FLOWCELL1.LANE7      PL:ILLUMINA     LB:LIB-MOM-2 SM:MOM      PI:400
@RG     ID:FLOWCELL1.LANE8      PL:ILLUMINA     LB:LIB-MOM-2 SM:MOM      PI:400

Kid's data:
@RG     ID:FLOWCELL2.LANE1      PL:ILLUMINA     LB:LIB-KID-1 SM:KID      PI:200
@RG     ID:FLOWCELL2.LANE2      PL:ILLUMINA     LB:LIB-KID-1 SM:KID      PI:200
@RG     ID:FLOWCELL2.LANE3      PL:ILLUMINA     LB:LIB-KID-2 SM:KID      PI:400
@RG     ID:FLOWCELL2.LANE4      PL:ILLUMINA     LB:LIB-KID-2 SM:KID      PI:400

Note the hierarchical relationship between read groups (unique for each lane) to libraries (sequenced on two lanes) and samples (across four lanes, two lanes for each library).

How to interpret results from GATK ApplyVQSR

$
0
0
Hello,

I am currently following the GATK v4 best practises pipeline for short germline variant calling. The final step is applying a variant quality recalibration model to the output vcf file, using different tranche sensitivity values. However the documentation for this is very unclear and I am struggling to identify which variants are the ones I want to keep - are the ones labelled as PASS ones I want to select? Or the ones that contain tranche values?

The only clue I can find in old documentation is that ones labelled as "filtered" should be discarded - but this word doesn't appear anyway in my output file. Could the documentation be updated for version 4 to make this clearer to use for other users too?

Any help is massively appreciated!!

Graphical (GUI) and interactive exploration tool for large genotype matrixes like 1KG or gnomAD.

$
0
0

Dear GATK development team and GATK users,

What is currently the best visual(GUI) and interactive genotype matrix exploration tool (a browser) for large genotype matrixes, say the 1000 human genomes VCF?
Or something between the 1000 genomes VCF and the gnomAD (15K genomes) VCF? The full VCF including the genotypes should be visualized and explorable, not just the variant sites.

So 100M plus variants, 1000+ samples, raw uncompressed VCF file size 1TB+.

One requirement is that it should do all kinds of filtering that 'bcftools view' or 'GATK VariantFiltration' does:
http://www.htslib.org/doc/bcftools.html#view
https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_filters_VariantFiltration.php#--filter-expression

But then in an interactive and visual way (graphical user interface).
Queries are return within seconds, and a (paged) variant and genotype table is shown. And maybe even summary stats for your current selection.

Does something like this already exist? If so which tools? Or is it being build by some one? If not why not?

My preference would be:
1. An open source solution that builds on bcftools or GATK, or the HTS-JDK or HTSLib libraries. Maybe in combination with an open source big data backend.
2. A standard commercial front end/analytical tool (e.g. SpotFire/Tableau) that takes in tab file created by BCFtools query or GATK VariantsToTable. Downside is of course that SpotFire/Tableau don't have any genomics/genetics domain logic that can be used for filtering the table. And a very big memory machine is needed, since all data is loaded to memory? Did anyone try this?
3. A standard commercial front end/analytical tool (e.g. SpotFire/Tableau) that somehow works with the domain logic of bcftools/GATK/HTS-JDK/HTSlib in the backend? Maybe with a 'bigdata' distributed or in memory database backend? e.g. Apache Spark ? Is this possible?
4. A custom commercial software front end tool that builds on top of the functionality/results of GATK GenotypeGVCFs or maybe even IntelGenomicsDB.

Thank you.

Test drive GATK Best Practices workflows on Terra

$
0
0

Last week, I wrote about a new initiative we're kicking off to make it easier to get started with GATK. Part of that involves making it easier for anyone to try out the Best Practices workflows without having to do a ton of work up front. That's a pretty big can of worms, because for a long time the Best Practices were really meant to describe at a high level the key GATK (and related) tools/steps you need to run for a particular type of analysis (e.g. germline short variant discovery). They weren't intended to provide an exact recipe of commands and parameters… Yet that's what many of you have told us you want.

For the past couple of years we've been providing actual reference implementations in the form of workflows written in the Workflow Description Language, but that still leaves you with a big old learning curve to overcome before you can actually run them. And we know that for many of you, that learning curve can feel both overwhelming and unwarranted - especially when you're in the exploratory phase of a project and you're not even sure yet that you'll end up using GATK.

To address that problem, we've set up all the GATK Best Practices workflows in public workspaces on our cloud platform, Terra. These workspaces feature workflows that are fully configured with all commands and parameters, as well as resource files and example data you need to run them right out of the box. All it takes is a click of a button! (Almost. There's like three clicks involved, for real).

Let me show you one of these workspaces, and how you would use it to try out Best Practices pipelines. It should take about 15 mins if you follow along and actually click all the things. Or you can just read through to get a sense of what's involved.


GATK Best Practices workspaces live in the Terra Showcase library

Terra has a growing library of workspaces showcasing a variety of analysis use cases and tools, including GATK. You can get to it by clicking the "View Examples" button on the Terra landing page or selecting "Terra Library" then "Showcase" in the dropdown menu (top left icon, horizontal stripes) from any page.

If you go there now (go on, we'll wait for you) you'll be asked to log in with a Google identity. If you don't have one already you can create one, and choose to either create a new Gmail account for it or associate your new Google identity with your existing email address. See this article for step-by-step instructions on how to register. Once you've logged in, look for the big green banner at the top of the screen and click "Start trial" to take advantage of the free credits program. As a reminder, access to Terra is free but Google charges you for compute and storage; the credits (a $300 value!) will allow you to try out the Best Practices for free.

Let's try out the germline short variants pipeline

The Terra Showcase is organized in two major categories: "GATK4 Examples" are all the Best Practices workspaces, and "Featured Workspaces" are various others (including GATK workshop materials -- I'll cover that in an upcoming blog post in this series). Find the "Germline-SNPs-Indels-GATK4-hg38" card and click on it to access a read-only version of the workspace. If you want to be able to actually run things, you need to clone it. To do that, expand the workspace action menu (three-dot icon, top right) and select the "Clone" option. The resulting workspace clone belongs to you. See the animation below or this article for an exact step-by-step walkthrough.

You can find a detailed description of the workspace contents in the Dashboard tab, including instructions and links to relevant documentation. There's a lot of interesting info in there that we could go into, but let's zip straight over to the Data tab to look at the example data that we're providing as input for testing the pipeline.

Example data

Go to the Data tab of the workspace and click on "sample" in the left hand menu to see the table of input samples we provide. This is all metadata; the actual data files live in Google Cloud Storage. Later I'll point you to docs where you can learn more about how that works and how you can import your own data securely (it stays private unless you choose to share it) but for now, I just want to point out that in this workspace, we provide a full whole genome (WGS) input dataset in CRAM format for full-scale testing as well as a "small" downsampled dataset in BAM format for running faster tests, typically as sanity checks.

There's also a table called "Workspace Data" that lists resource files like the reference genome, known variants files, interval lists and so on -- everything you need to run the pipeline. So let's do that now.

Pre-configured Best Practices workflows

Finally, we get to the good stuff! The workflows are set up in the Tools tab of your workspace. In this particular one, you should see three workflows corresponding to the pre-processing, single-sample calling and joint variant discovery portions of the Best Practices for germline SNP & Indel discovery, respectively:

1-Processing-For-Variant-Discovery takes the raw data in unmapped BAM format to analysis-ready BAMs (and yes we have conversion utilities in the Sequence-Format-Conversion workspace for if your data is in FASTQ);
2-Haplotypecaller-GVCF takes the output from the first WDL and does the variant calling per simple, producing a GVCF;
3-Joint-Discovery implements the joint calling and VQSR filtering portion to return a VCF file and its index.

The three workflows are designed to be run back-to-back. Each workflow's outputs will get added to the data table in the appropriate columns, so that the next workflow will find the right inputs automatically.

Click on the first tool to load up the details; the page will open at the inputs definition form, which is pre-filled for you. To launch the workflow, select some data to run it on, hit the "Run analysis" button then click "Launch" in the popup dialog, as shown in the animation below.

That's all it takes! Congratulations, especially if this is the first GATK pipeline you've ever run.

You can check its status in the Job History tab; as the system processes your request, the status label will change from “Queued” to “Submitted” to “Done” (remember to refresh the page to see the current status). Behind the scenes, Terra will interpret the workflow script, dispatch jobs for execution on Google Cloud virtual machines (with parallelization in all the right places), move data around as needed, and eventually write the results to your workspace storage bucket. The best part of all that? You don't have to worry about any of it. :-)

The Dashboard lists expected runtime and costs of each workflow for each input dataset provided for testing. For example, you see that you can run the complete pipeline on the 3GB sample NA12878_24RG_small in about six hours, for less than the cost of a medium Dunkin's coffee.

What's next?

At this point you should have a sense of what it's like to test drive GATK workflows on Terra. If you'd like to learn more about how you can take further advantage of these resources, e.g. by uploading your own data to evaluate how our pipelines behave on that, have a look at this quick start guide. You may also want to check out this handy utility workspace that contains preconfigured tools for converting between various input formats, or look at the other GATK Best Practices workspaces in the Terra Showcase.

Next week I'll walk you through using the workspaces that we use in workshops to teach the component steps of each pipeline within Jupyter notebooks. If you want a sneak peek, have a look at this tutorial workspace, where all the action is in the Notebooks tab...

And of course, we're always here to help

It's the same crack team that provides frontline support for both Terra and GATK, so whenever you're using Terra, you can expect the same speedy and caring support you're used to getting on the GATK forum. In fact, you can even write to the support team privately through the Terra Support helpdesk, which you can't do in the GATK forum…

Trouble accessing FTP files

$
0
0

I'm trying to access files in the FTP bundle. Once I connect I can see the directory, but I am getting the following errors:

USER gsapubftp-anonymous

<<< 331 Anonymous login ok, send your complete email address as your password

PASS ***********

<<< 530 Sorry, max 25 users -- try again later

--> FTP reconnected

CWD /bundle/hg19/

Error EElFTPSError: Control channel transfer error

I have a spinning wheel and can't access files in the directory. I would appreciate your help with this. I am trying to access files for the variant recalibrator step. Thanks.

How to create and collect PONs using Gatk (4.1.2.0)

$
0
0
Hi GATK team,
I am using last version of Gatk (4.1.2.0) for Somatic short variant discovery pipeline. I have some problems at Pon creation part using CreateSomaticPanelOfNormal fuction. When I try following commonds it returns A USER ERROR has occurred: v is not a recognized option

Create PONs: gatk Mutect2 -R ref.fasta -I normal1.recalib.bam -normal normal1 -disable-read-filter MateOnSameContigOrNoMappedMateReadFilter -L targets.interval_list -O normal1.vcf.gz

Collect PONs: gatk CreateSomaticPanelOfNormals -vcfs normal1.vcf.gz -vcfs normal2.vcf.gz -vcfs normal3.vcf.gz -O all_pon.vcf.gz
I also tried -V instead of -vcfs but I took the same error.

Also I tried following commonds:

Step 1. Run Mutect2 in tumor-only mode for each normal sample.
gatk Mutect2 -R reference.fasta -I normal1.bam -O normal1.vcf.gz

Step 2. Create a GenomicsDB from the normal Mutect2 calls.
gatk GenomicsDBImport -R reference.fasta -L intervals.interval_list \
--genomicsdb-workspace-path pon_db \
-V normal1.vcf.gz \
-V normal2.vcf.gz \
-V normal3.vcf.gz

Step 3. Combine the normal calls using CreateSomaticPanelOfNormals.
gatk CreateSomaticPanelOfNormals -R reference.fasta -V gendb://pon_db -O pon.vcf.gz

And it returned " Warning: CreateSomaticPanelOfNormals is a BETA tool and is not yet ready for use in production".

Should I use one of the older versions of GATK?

Could you suggest any commonds that I can use for gatk-4.1.2.0 version to create PONs?

Regards,

Running Mutect2 on tumor-only mode in Firecloud

$
0
0

Hi, I have exomes for several tumors without any matched normals. I am trying to call mutations using Mutect's tumor-only mode in the featured "Somatic-SNVs-Indels-GATK4" workspace. However, in this workspace, I only see methods for a matched tumor-normal pair or to create a panel of normal. Can you please tell me how can I run Mutect in tumor only mode in Firecloud? Thanks!

Does providing an exome target interval list overlook many non-targeted, high-quality SNPs?

$
0
0

Hi GATK Team,

I recently came across a paper that states that exome sequencing can generate high-quality SNPs in non-targeted regions, and even in regions far from the targets: ncbi.nlm.nih.gov/pubmed/22607156. I just completed variant calling on some exome data and used the kit manufacturer's exome target list to restrict the variant calling during the process. However, I wonder if, in retrospect, this was the best thing to do.

I will likely go back and redo the variant calling without an exome target interval list to see how many other SNPs (if any) we get but I just wanted to post this reference here in case other GATK users (particularly those doing exome sequencing) find it interesting and perhaps ask the GATK team if they had any thoughts on not using exome interval lists during variant calling on exome data? Perhaps it's just a tradeoff between time and overall SNP count...

Anyway, thank you and keep up the good work.

GenomicsDBImport problem to use more than one chromosome as intervals

$
0
0

I Have mapped and called gvcf files with HaplotypeCaller using a genome which chromosomes are defined with ncbi type numbers, i.e., NC_018723.3. When using GenomicsDBImport then, if I use as interval a single chromosome using -L NC_018723.3 the programs runs as expected, but if I specify more than one chromosome using the recommended syntax: -L NC_018723.3,NC_018724.3 then I receive the following error message:
A USER ERROR has occurred: Badly formed genome unclippedLoc: Query interval "NC_018723.3,NC_018724.3" is not valid for this input
How can I specify several chromosomes as intervals when they are named in this way? I tied single quotes but it gave the same error message?
Thanks
Thierry


Are there runtime measurements of the HaplotypeCaller with the PairHMM FPGA accelerator ?

$
0
0
Hello.

I would like to know if there are runtime measurements of the HaplotypeCaller with the "--pairHMM EXPERIMENTAL_FPGA_LOGLESS_CACHING" option against other options such as "--pairHMM FASTEST_AVAILABLE" ?
(which will not even try to use the FPGA as can be seen in the source code of PairHMM.java line 81+)

Or if someone did experiment with this and made the results available somewhere ?

The only results available (or at least I could find) seem to come from synthetic benchmarks and not benchmarks run form the HaplotypeCaller itself in a real-life like situation.

Did anyone run the FPGA version against the fastest available (e.g., AVX) version ?
I do not have access to the cards supported by the current GKL FPGA implementation otherwise I would have done these measurements myself.

While the performance of the FPGA accelerators looks really nice on paper, I am interested in real test cases.
Is there anyone that did run the accelerator with a job in a similar fashion as is done in the "GATK Tutorial :: Germline SNPs & Indels" or (even better) with bigger workloads/datasets ?

Thank you very much.
Regards.
Rick

PathSeq running too slow

$
0
0

Hi,

I am trying to run PathSeq pipeline on single end RNAseq data (2.37GB unaligned BAM file). The pipeline is still running after 3 days.

Below is the snapshot of the log file of the step that is taking most time. Is there any option that I can use to speed up this process.

Thanks

Regards

Gaurav

command line parameters used:

gatk-4.1.2.0/gatk --java-options "-Xmx300000m" \
            PathSeqPipelineSpark H-CRC-07TT-APC_S71_L006_unaligned.bam \
            --output H-CRC-07TT-APC_S71_L006.pathseq.bam \
            --scores-output H-CRC-07TT-APC_S71_L006.pathseq.tsv \
            --filter-metrics H-CRC-07TT-APC_S71_L006.pathseq.filter_metrics \
            --score-metrics H-CRC-07TT-APC_S71_L006.pathseq.score_metrics \
            --kmer-file cromwell-executions/PathSeqPipelineWorkflow/aaabb67b-a6c0-4fe3-910c-3cf0d12018b1/call-PathseqPipeline/inputs/284996793/pathseq_host.bfi \
            --filter-bwa-image cromwell-executions/PathSeqPipelineWorkflow/aaabb67b-a6c0-4fe3-910c-3cf0d12018b1/call-PathseqPipeline/inputs/284996793/pathseq_host.fa.img \
            --microbe-bwa-image cromwell-executions/PathSeqPipelineWorkflow/aaabb67b-a6c0-4fe3-910c-3cf0d12018b1/call-PathseqPipeline/inputs/284996793/pathseq_microbe.fa.img \
            --microbe-fasta cromwell-executions/PathSeqPipelineWorkflow/aaabb67b-a6c0-4fe3-910c-3cf0d12018b1/call-PathseqPipeline/inputs/284996793/pathseq_microbe.fa \
            --taxonomy-file cromwell-executions/PathSeqPipelineWorkflow/aaabb67b-a6c0-4fe3-910c-3cf0d12018b1/call-PathseqPipeline/inputs/284996793/pathseq_taxonomy.db \
            --bam-partition-size 4000000 \
            --is-host-aligned false \
            --skip-quality-filters false \
            --min-clipped-read-length 60 \
            --filter-bwa-seed-length 19 \
            --host-min-identity 30 \
            --filter-duplicates true \
            --skip-pre-bwa-repartition false \
            --min-score-identity 0.9 \
            --identity-margin 0.02 \
            --divide-by-genome-length true \
-- \
        --spark-runner LOCAL --spark-master local[4]

Partial log file:

19/05/28 21:58:04 INFO BlockManagerInfo: Added rdd_65_23 in memory on 192.168.22.18:38230 (size: 4.0 MB, free: 155.9 GB)
19/05/28 21:58:23 INFO Executor: Finished task 24.0 in stage 38.0 (TID 9505). 2070 bytes result sent to driver
19/05/28 21:58:23 INFO TaskSetManager: Starting task 28.0 in stage 38.0 (TID 9509, localhost, executor driver, partition 28, ANY, 4995 bytes)
19/05/28 21:58:23 INFO TaskSetManager: Finished task 24.0 in stage 38.0 (TID 9505) in 11085359 ms on localhost (executor driver) (25/117)
19/05/28 21:58:23 INFO Executor: Running task 28.0 in stage 38.0 (TID 9509)
19/05/28 21:58:23 INFO ShuffleBlockFetcherIterator: Getting 592 non-empty blocks out of 592 blocks
19/05/28 21:58:23 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/05/29 00:03:30 INFO MemoryStore: Block rdd_65_24 stored as bytes in memory (estimated size 4.2 MB, free 153.4 GB)
19/05/29 00:03:30 INFO BlockManagerInfo: Added rdd_65_24 in memory on 192.168.22.18:38230 (size: 4.2 MB, free: 155.9 GB)
19/05/29 00:03:41 INFO Executor: Finished task 25.0 in stage 38.0 (TID 9506). 2070 bytes result sent to driver
19/05/29 00:03:41 INFO TaskSetManager: Starting task 29.0 in stage 38.0 (TID 9510, localhost, executor driver, partition 29, ANY, 4995 bytes)
19/05/29 00:03:41 INFO TaskSetManager: Finished task 25.0 in stage 38.0 (TID 9506) in 9629953 ms on localhost (executor driver) (26/117)
19/05/29 00:03:41 INFO Executor: Running task 29.0 in stage 38.0 (TID 9510)
19/05/29 00:03:41 INFO ShuffleBlockFetcherIterator: Getting 592 non-empty blocks out of 592 blocks
19/05/29 00:03:41 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/05/29 00:09:28 INFO MemoryStore: Block rdd_65_25 stored as bytes in memory (estimated size 4.0 MB, free 153.4 GB)
19/05/29 00:09:28 INFO BlockManagerInfo: Added rdd_65_25 in memory on 192.168.22.18:38230 (size: 4.0 MB, free: 155.9 GB)
19/05/29 00:09:40 INFO Executor: Finished task 26.0 in stage 38.0 (TID 9507). 2070 bytes result sent to driver
19/05/29 00:09:40 INFO TaskSetManager: Starting task 30.0 in stage 38.0 (TID 9511, localhost, executor driver, partition 30, ANY, 4995 bytes)
19/05/29 00:09:40 INFO Executor: Running task 30.0 in stage 38.0 (TID 9511)
19/05/29 00:09:40 INFO TaskSetManager: Finished task 26.0 in stage 38.0 (TID 9507) in 9169125 ms on localhost (executor driver) (27/117)
19/05/29 00:09:40 INFO ShuffleBlockFetcherIterator: Getting 592 non-empty blocks out of 592 blocks
19/05/29 00:09:40 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/05/29 00:35:58 INFO MemoryStore: Block rdd_65_26 stored as bytes in memory (estimated size 4.0 MB, free 153.4 GB)
19/05/29 00:35:58 INFO BlockManagerInfo: Added rdd_65_26 in memory on 192.168.22.18:38230 (size: 4.0 MB, free: 155.9 GB)
19/05/29 00:36:09 INFO Executor: Finished task 27.0 in stage 38.0 (TID 9508). 2070 bytes result sent to driver
19/05/29 00:36:09 INFO TaskSetManager: Starting task 31.0 in stage 38.0 (TID 9512, localhost, executor driver, partition 31, ANY, 4995 bytes)
19/05/29 00:36:09 INFO Executor: Running task 31.0 in stage 38.0 (TID 9512)
19/05/29 00:36:09 INFO TaskSetManager: Finished task 27.0 in stage 38.0 (TID 9508) in 10080656 ms on localhost (executor driver) (28/117)
19/05/29 00:36:09 INFO ShuffleBlockFetcherIterator: Getting 592 non-empty blocks out of 592 blocks
19/05/29 00:36:09 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/05/29 00:41:31 INFO MemoryStore: Block rdd_65_27 stored as bytes in memory (estimated size 4.0 MB, free 153.4 GB)
19/05/29 00:41:31 INFO BlockManagerInfo: Added rdd_65_27 in memory on 192.168.22.18:38230 (size: 4.0 MB, free: 155.9 GB)
19/05/29 00:41:50 INFO Executor: Finished task 28.0 in stage 38.0 (TID 9509). 2070 bytes result sent to driver
19/05/29 00:41:50 INFO TaskSetManager: Starting task 32.0 in stage 38.0 (TID 9513, localhost, executor driver, partition 32, ANY, 4995 bytes)

Error in GermlineCNVCaller: Anomalous ploidy and karyotypes

$
0
0
I ran the GermlineCNVCaller on the GATK4 docker using data from Illumina WES runs:
I saw one post about this from 2018, but there were no solutions provided. Was hoping someone figured out what might be causing this issue. Is says that there were anomalous ploidy (3) and karyotypes found.
Am I required to provide separate contig-ploidy-priors files for male and female samples?
Is there a way to circumvent the errors and proceed with the CNV Calls?

Here is the error log on the terminal:
-------------------------------------------------------------------------------------------------------------------------------------------------------
```
[May 30, 2019 9:34:28 PM UTC] org.broadinstitute.hellbender.tools.copynumber.GermlineCNVCaller done. Elapsed time: 42.22 minutes.
Runtime.totalMemory()=309329920
org.broadinstitute.hellbender.utils.python.PythonScriptExecutorException:
python exited with 137
Command Line: python /tmp/cohort_denoising_calling.2469363531812992060.py --ploidy_calls_path=/gatk/contig_ploidy_out/201to221-calls --output_calls_path=/gatk/cnv_caller_out/201to221-calls --output_tracking_path=/gatk/cnv_caller_out/201to221-tracking --modeling_interval_list=/tmp/intervals5455361474614479137.tsv --output_model_path=/gatk/cnv_caller_out/201to221-model --enable_explicit_gc_bias_modeling=False --read_count_tsv_files /tmp/sample-06897365439090412481.tsv /tmp/sample-14321148589333665336.tsv /tmp/sample-21835961979843646097.tsv /tmp/sample-38074673515857876969.tsv /tmp/sample-43743553031942260664.tsv /tmp/sample-57298179702079672321.tsv /tmp/sample-62031280085994514055.tsv /tmp/sample-75741767774624679683.tsv /tmp/sample-81194219972171310383.tsv /tmp/sample-97680992559618886592.tsv /tmp/sample-107437152082991706984.tsv /tmp/sample-111888210707192633556.tsv /tmp/sample-128036150598221044845.tsv /tmp/sample-138872009798693940440.tsv /tmp/sample-14940235851191146248.tsv /tmp/sample-154069286361387789329.tsv /tmp/sample-166690524464389231566.tsv /tmp/sample-17146091880304952416.tsv /tmp/sample-187389732363112723677.tsv /tmp/sample-192301262323034965667.tsv --psi_s_scale=1.000000e-04 --mapping_error_rate=1.000000e-02 --depth_correction_tau=1.000000e+04 --q_c_expectation_mode=hybrid --max_bias_factors=5 --psi_t_scale=1.000000e-03 --log_mean_bias_std=1.000000e-01 --init_ard_rel_unexplained_variance=1.000000e-01 --num_gc_bins=20 --gc_curve_sd=1.000000e+00 --active_class_padding_hybrid_mode=50000 --enable_bias_factors=True --disable_bias_factors_in_active_class=False --p_alt=1.000000e-06 --cnv_coherence_length=1.000000e+04 --max_copy_number=5 --p_active=0.010000 --class_coherence_length=10000.000000 --learning_rate=1.000000e-02 --adamax_beta1=9.000000e-01 --adamax_beta2=9.900000e-01 --log_emission_samples_per_round=50 --log_emission_sampling_rounds=10 --log_emission_sampling_median_rel_error=5.000000e-03 --max_advi_iter_first_epoch=5000 --max_advi_iter_subsequent_epochs=200 --min_training_epochs=10 --max_training_epochs=50 --initial_temperature=1.500000e+00 --num_thermal_advi_iters=2500 --convergence_snr_averaging_window=500 --convergence_snr_trigger_threshold=1.000000e-01 --convergence_snr_countdown_window=10 --max_calling_iters=10 --caller_update_convergence_threshold=1.000000e-03 --caller_internal_admixing_rate=7.500000e-01 --caller_external_admixing_rate=1.000000e+00 --disable_caller=false --disable_sampler=false --disable_annealing=false
Stdout: 20:52:35.963 INFO cohort_denoising_calling - Loading 20 read counts file(s)...
20:52:57.460 INFO gcnvkernel.io.io_metadata - Loading germline contig ploidy and global read depth metadata...
20:52:57.467 WARNING gcnvkernel.structs.metadata - Sample 219-Exp29_S115 has an anomalous ploidy (3) for contig 19. The presence of unmasked PAR regions and regions of low mappability in the coverage metadata can result in unreliable ploidy designations. It is recommended that the user verifies this designation by orthogonal methods.
20:52:57.468 WARNING gcnvkernel.structs.metadata - Sample 219-Exp29_S115 has an anomalous karyotype ({'Y': 1, 'X': 2}). The presence of unmasked PAR regions and regions of low mappability in the coverage metadata can result in unreliable ploidy designations. It is recommended that the user verifies this designation by orthogonal methods.
20:52:57.472 WARNING gcnvkernel.structs.metadata - Sample 220-Exp29_S116 has an anomalous ploidy (3) for contig 19. The presence of unmasked PAR regions and regions of low mappability in the coverage metadata can result in unreliable ploidy designations. It is recommended that the user verifies this designation by orthogonal methods.
20:52:57.472 WARNING gcnvkernel.structs.metadata - Sample 220-Exp29_S116 has an anomalous karyotype ({'Y': 1, 'X': 2}). The presence of unmasked PAR regions and regions of low mappability in the coverage metadata can result in unreliable ploidy designations. It is recommended that the user verifies this designation by orthogonal methods.
20:52:57.476 WARNING gcnvkernel.structs.metadata - Sample 208-Exp29_S104 has an anomalous ploidy (3) for contig 19. The presence of unmasked PAR regions and regions of low mappability in the coverage metadata can result in unreliable ploidy designations. It is recommended that the user verifies this designation by orthogonal methods.
20:52:57.476 WARNING gcnvkernel.structs.metadata - Sample 208-Exp29_S104 has an anomalous karyotype ({'Y': 0, 'X': 3}). The presence of unmasked PAR regions and regions of low mappability in the coverage metadata can result in unreliable ploidy designations. It is recommended that the user verifies this designation by orthogonal methods.
20:52:57.481 WARNING gcnvkernel.structs.metadata - Sample 221-Exp29_S117 has an anomalous ploidy (3) for contig 19. The presence of unmasked PAR regions and regions of low mappability in the coverage metadata can result in unreliable ploidy designations. It is recommended that the user verifies this designation by orthogonal methods.
20:52:57.481 WARNING gcnvkernel.structs.metadata - Sample 221-Exp29_S117 has an anomalous karyotype ({'Y': 1, 'X': 2}). The presence of unmasked PAR regions and regions of low mappability in the coverage metadata can result in unreliable ploidy designations. It is recommended that the user verifies this designation by orthogonal methods.
20:52:57.486 WARNING gcnvkernel.structs.metadata - Sample 213-Exp29_S109 has an anomalous ploidy (3) for contig 19. The presence of unmasked PAR regions and regions of low mappability in the coverage metadata can result in unreliable ploidy designations. It is recommended that the user verifies this designation by orthogonal methods.
20:52:57.486 WARNING gcnvkernel.structs.metadata - Sample 213-Exp29_S109 has an anomalous karyotype ({'Y': 1, 'X': 2}). The presence of unmasked PAR regions and regions of low mappability in the coverage metadata can result in unreliable ploidy designations. It is recommended that the user verifies this designation by orthogonal methods.
20:52:57.491 WARNING gcnvkernel.structs.metadata - Sample 201-Exp29_S97 has an anomalous ploidy (3) for contig 19. The presence of unmasked PAR regions and regions of low mappability in the coverage metadata can result in unreliable ploidy designations. It is recommended that the user verifies this designation by orthogonal methods.
20:52:57.491 WARNING gcnvkernel.structs.metadata - Sample 201-Exp29_S97 has an anomalous karyotype ({'Y': 0, 'X': 3}). The presence of unmasked PAR regions and regions of low mappability in the coverage metadata can result in unreliable ploidy designations. It is recommended that the user verifies this designation by orthogonal methods.
20:52:57.495 WARNING gcnvkernel.structs.metadata - Sample 209-Exp29_S105 has an anomalous ploidy (3) for contig 19. The presence of unmasked PAR regions and regions of low mappability in the coverage metadata can result in unreliable ploidy designations. It is recommended that the user verifies this designation by orthogonal methods.
20:52:57.495 WARNING gcnvkernel.structs.metadata - Sample 209-Exp29_S105 has an anomalous karyotype ({'Y': 0, 'X': 3}). The presence of unmasked PAR regions and regions of low mappability in the coverage metadata can result in unreliable ploidy designations. It is recommended that the user verifies this designation by orthogonal methods.
20:52:57.499 WARNING gcnvkernel.structs.metadata - Sample 203-Exp29_S99 has an anomalous ploidy (3) for contig 19. The presence of unmasked PAR regions and regions of low mappability in the coverage metadata can result in unreliable ploidy designations. It is recommended that the user verifies this designation by orthogonal methods.
20:52:57.499 WARNING gcnvkernel.structs.metadata - Sample 203-Exp29_S99 has an anomalous karyotype ({'Y': 0, 'X': 3}). The presence of unmasked PAR regions and regions of low mappability in the coverage metadata can result in unreliable ploidy designations. It is recommended that the user verifies this designation by orthogonal methods.
20:52:57.504 WARNING gcnvkernel.structs.metadata - Sample 212-Exp29_S108 has an anomalous ploidy (3) for contig 19. The presence of unmasked PAR regions and regions of low mappability in the coverage metadata can result in unreliable ploidy designations. It is recommended that the user verifies this designation by orthogonal methods.
20:52:57.504 WARNING gcnvkernel.structs.metadata - Sample 212-Exp29_S108 has an anomalous karyotype ({'Y': 1, 'X': 2}). The presence of unmasked PAR regions and regions of low mappability in the coverage metadata can result in unreliable ploidy designations. It is recommended that the user verifies this designation by orthogonal methods.
20:52:57.508 WARNING gcnvkernel.structs.metadata - Sample 204-Exp29_S100 has an anomalous ploidy (3) for contig 19. The presence of unmasked PAR regions and regions of low mappability in the coverage metadata can result in unreliable ploidy designations. It is recommended that the user verifies this designation by orthogonal methods.
20:52:57.509 WARNING gcnvkernel.structs.metadata - Sample 204-Exp29_S100 has an anomalous karyotype ({'Y': 1, 'X': 2}). The presence of unmasked PAR regions and regions of low mappability in the coverage metadata can result in unreliable ploidy designations. It is recommended that the user verifies this designation by orthogonal methods.
20:52:57.513 WARNING gcnvkernel.structs.metadata - Sample 218-Exp29_S114 has an anomalous ploidy (3) for contig 19. The presence of unmasked PAR regions and regions of low mappability in the coverage metadata can result in unreliable ploidy designations. It is recommended that the user verifies this designation by orthogonal methods.
20:52:57.513 WARNING gcnvkernel.structs.metadata - Sample 218-Exp29_S114 has an anomalous karyotype ({'Y': 0, 'X': 3}). The presence of unmasked PAR regions and regions of low mappability in the coverage metadata can result in unreliable ploidy designations. It is recommended that the user verifies this designation by orthogonal methods.
20:52:57.517 WARNING gcnvkernel.structs.metadata - Sample 215-Exp29_S111 has an anomalous ploidy (3) for contig 19. The presence of unmasked PAR regions and regions of low mappability in the coverage metadata can result in unreliable ploidy designations. It is recommended that the user verifies this designation by orthogonal methods.
20:52:57.517 WARNING gcnvkernel.structs.metadata - Sample 215-Exp29_S111 has an anomalous karyotype ({'Y': 0, 'X': 3}). The presence of unmasked PAR regions and regions of low mappability in the coverage metadata can result in unrelia
Stderr:
at org.broadinstitute.hellbender.utils.python.PythonExecutorBase.getScriptException(PythonExecutorBase.java:75)
at org.broadinstitute.hellbender.utils.runtime.ScriptExecutor.executeCuratedArgs(ScriptExecutor.java:126)
at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeArgs(PythonScriptExecutor.java:170)
at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeScript(PythonScriptExecutor.java:151)
at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeScript(PythonScriptExecutor.java:121)
at org.broadinstitute.hellbender.tools.copynumber.GermlineCNVCaller.executeGermlineCNVCallerPythonScript(GermlineCNVCaller.java:441)
at org.broadinstitute.hellbender.tools.copynumber.GermlineCNVCaller.doWork(GermlineCNVCaller.java:288)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:138)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
at org.broadinstitute.hellbender.Main.main(Main.java:291)
```
---------------------------------------------------------------------------------------------------------------------------------------------------------

why variant callers's (GATK3.8 and GATK 4.0) results are different ?

$
0
0
hello, i am beginner . i used two different tools to analyze my data but i got the two different why ?

FilterMutectCalls fails on some samples using gatk-4.1.2.0

$
0
0

java version "1.8.0_45"
Getting an error on some samples:

Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/jocostello/shared/LG3_Pipeline_HIDE/tools/gatk-4.1.2.0/gatk-package-4.1.2.0-local.jar FilterMutectCalls --intervals /home/jocostello/repositories/UCSF-Costello-Lab/LG3_Pipeline/resources/SeqCap_EZ_Exome_v3_capture.interval_list --interval-padding 0 --contamination-table Z00600t10-contamination.table --tumor-segmentation Z00600t10-segments.table --orientation-bias-artifact-priors Z00600t10-artifact-prior-table.tar.gz --verbosity ERROR --variant NOR-Z00599t10__TUM-Z00600t10.m2.obmm.vcf.gz --output NOR-Z00599t10__TUM-Z00600t10.m2.obmm.cc.vcf.gz --stats Z00600t10-M2FilteringStats.tsv --contamination-estimate 0 --reference /home/jocostello/repositories/UCSF-Costello-Lab/LG3_Pipeline/resources/UCSC_hg19/hg19.fa
[May 10, 2019 8:07:44 PM PDT] org.broadinstitute.hellbender.tools.walkers.mutect.filtering.FilterMutectCalls done. Elapsed time: 0.07 minutes.
Runtime.totalMemory()=2468872192
java.lang.IllegalArgumentException: log10p: Log10-probability must be 0 or less
at org.broadinstitute.hellbender.utils.Utils.validateArg(Utils.java:724)
at org.broadinstitute.hellbender.utils.MathUtils.log10BinomialProbability(MathUtils.java:917)
at org.broadinstitute.hellbender.utils.MathUtils.binomialProbability(MathUtils.java:910)
at org.broadinstitute.hellbender.tools.walkers.mutect.filtering.ContaminationFilter.calculateErrorProbability(ContaminationFilter.java:56)
at org.broadinstitute.hellbender.tools.walkers.mutect.filtering.Mutect2VariantFilter.errorProbability(Mutect2VariantFilter.java:15)
at org.broadinstitute.hellbender.tools.walkers.mutect.filtering.ErrorProbabilities.lambda$new$1(ErrorProbabilities.java:19)
at org.broadinstitute.hellbender.tools.walkers.mutect.filtering.ErrorProbabilities$$Lambda$131/23218037.apply(Unknown Source)
[...]

Switching back to gatk-4.1.1.0 solves the problem, but it would be nice to be able to use latest version...

Thanks!
Ivan

Viewing all 12345 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>