Quantcast
Channel: Recent Discussions — GATK-Forum
Viewing all 12345 articles
Browse latest View live

Algorithm question for VQSR

$
0
0

As for as I understand, VQSR selects a pool of SNP existing in both testing set and know annotated SNP database. These SNP will be considered as true variants and a Gaussian mixture model is established based on the features of these true variant to classify additional SNP.

These true SNPs will be clustered using Gaussian model. However, Gaussian mixture model means we are also cluster "bad" SNPs as well. I imagine that these "bad" SNPs have different poor qualities on different direction and the finally the Gaussian mixture model will make multiple clusters (one true SNP cluster and multiple bad SNP clusters), right?

Then Why can't we just use a simple Gaussian model to just draw distribution of true SNP and any SNPs far from this cluster will more likely to be false?


Why are bait locations used rather than target locations in Best Practice workspace?

$
0
0

The GATK Best Practice Workspace for somatic SNVs and Indels (help-gatk/Somatic-SNVs-Indels-GATK4) uses for its M2 intervals file an interval list for all of the baits. (See the workspace attribute intervals.) Why are you using this file? An interval list of all the target intervals should do just as well and is much smaller (the intervals file we use for running Mutect1 has 237137 rows, vs. the Mutect2 baits interval file, which has 473150 rows).

I am running benchmark tests to compare the mutect2-gatk4 workflow with CGA's production workflow. There are significant differences in the results produced by the two pipelines when run against the same dataset and if the difference in intervals files could be contributing to this discordance I want to eliminate that source of discordance by replacing the baits intervals with the target interval list.

Base Quality Score Recalibration (BQSR)

$
0
0

BQSR stands for Base Quality Score Recalibration. In a nutshell, it is a data pre-processing step that detects systematic errors made by the sequencer when it estimates the quality score of each base call. This document starts with a high-level overview of the purpose of this method; deeper technical are provided further down.

Note that this base recalibration process (BQSR) should not be confused with variant recalibration (VQSR), which is a sophisticated filtering technique applied on the variant callset produced in a later step. The developers who named these methods wish to apologize sincerely to any Spanish-speaking users who might get awfully confused at this point.


Wait, what are base quality scores again?

These scores are per-base estimates of error emitted by the sequencing machines; they express how confident the machine was that it called the correct base each time. For example, let's say the machine reads an A nucleotide, and assigns a quality score of Q20 -- in Phred-scale, that means it's 99% sure it identified the base correctly. This may seem high, but it does mean that we can expect it to be wrong in one case out of 100; so if we have several billion basecalls (we get ~90 billion in a 30x genome), at that rate the machine would make the wrong call in 900 million bases. In practice each basecall gets its own quality score, determined through some dark magic jealously guarded by the manufacturer of the sequencer.

Variant calling algorithms rely heavily on the quality score assigned to the individual base calls in each sequence read. This is because the quality score tells us how much we can trust that particular observation to inform us about the biological truth of the site where that base aligns. If we have a basecall that has a low quality score, that means we're not sure we actually read that A correctly, and it could actually be something else. So we won't trust it as much as other base calls that have higher qualities. In other words we use that score to weigh the evidence that we have for or against a variant allele existing at a particular site.

Okay, so what is base recalibration?

Unfortunately the scores produced by the machines are subject to various sources of systematic (non-random) technical error, leading to over- or under-estimated base quality scores in the data. Some of these errors are due to the physics or the chemistry of how the sequencing reaction works, and some are probably due to manufacturing flaws in the equipment.

Base quality score recalibration (BQSR) is a process in which we apply machine learning to model these errors empirically and adjust the quality scores accordingly. For example we can identify that, for a given run, whenever we called two A nucleotides in a row, the next base we called had a 1% higher rate of error. So any base call that comes after AA in a read should have its quality score reduced by 1%. We do that over several different covariates (mainly sequence context and position in read, or cycle) in a way that is additive. So the same base may have its quality score increased for one reason and decreased for another.

This allows us to get more accurate base qualities overall, which in turn improves the accuracy of our variant calls. To be clear, we can't correct the base calls themselves, i.e. we can't determine whether that low-quality A should actually have been a T -- but we can at least tell the variant caller more accurately how far it can trust that A. Note that in some cases we may find that some bases should have a higher quality score, which allows us to rescue observations that otherwise may have been given less consideration than they deserve. Anecdotally my impression is that sequencers are more often over-confident than under-confident, but we do occasionally see runs from sequencers that seemed to suffer from low self-esteem.

Fantastic! How does it work?

The base recalibration process involves two key steps: first the program builds a model of covariation based on the data and a set of known variants, then it adjusts the base quality scores in the data based on the model. The known variants are used to mask out bases at sites of real (expected) variation, to avoid counting real variants as errors. Outside of the masked sites, every mismatch is counted as an error. The rest is mostly accounting.

There is an optional but highly recommended step that involves building a second model and generating before/after plots to visualize the effects of the recalibration process. This is useful for quality control purposes.


More detailed information

Detailed information about command line options for BaseRecalibrator can be found here.

The tools in this package recalibrate base quality scores of sequencing-by-synthesis reads in an aligned BAM file. After recalibration, the quality scores in the QUAL field in each read in the output BAM are more accurate in that the reported quality score is closer to its actual probability of mismatching the reference genome. Moreover, the recalibration tool attempts to correct for variation in quality with machine cycle and sequence context, and by doing so provides not only more accurate quality scores but also more widely dispersed ones. The system works on BAM files coming from many sequencing platforms: Illumina, SOLiD, 454, Complete Genomics, Pacific Biosciences, etc.

This process is accomplished by analyzing the covariation among several features of a base. For example:

  • Reported quality score
  • The position within the read
  • The preceding and current nucleotide (sequencing chemistry effect) observed by the sequencing machine

These covariates are then subsequently applied through a piecewise tabular correction to recalibrate the quality scores of all reads in a BAM file.

For example, pre-calibration a file could contain only reported Q25 bases, which seems good. However, it may be that these bases actually mismatch the reference at a 1 in 100 rate, so are actually Q20. These higher-than-empirical quality scores provide false confidence in the base calls. Moreover, as is common with sequencing-by-synthesis machine, base mismatches with the reference occur at the end of the reads more frequently than at the beginning. Also, mismatches are strongly associated with sequencing context, in that the dinucleotide AC is often much lower quality than TG. The recalibration tool will not only correct the average Q inaccuracy (shifting from Q25 to Q20) but identify subsets of high-quality bases by separating the low-quality end of read bases AC bases from the high-quality TG bases at the start of the read. See below for examples of pre and post corrected values.

The system was designed for (sophisticated) users to be able to easily add new covariates to the calculations. For users wishing to add their own covariate simply look at QualityScoreCovariate.java for an idea of how to implement the required interface. Each covariate is a Java class which implements the org.broadinstitute.sting.gatk.walkers.recalibration.Covariate interface. Specifically, the class needs to have a getValue method defined which looks at the read and associated sequence context and pulls out the desired information such as machine cycle.

Running the tools

BaseRecalibrator

Detailed information about command line options for BaseRecalibrator can be found here.

This GATK processing step walks over all of the reads in my_reads.bam and tabulates data about the following features of the bases:

  • read group the read belongs to
  • assigned quality score
  • machine cycle producing this base
  • current base + previous base (dinucleotide)

For each bin, we count the number of bases within the bin and how often such bases mismatch the reference base, excluding loci known to vary in the population, according to dbSNP. After running over all reads, BaseRecalibrator produces a file called my_reads.recal_data.grp, which contains the data needed to recalibrate reads. The format of this GATK report is described below.

Creating a recalibrated BAM

To create a recalibrated BAM you can use GATK's PrintReads with the engine on-the-fly recalibration capability. Here is a typical command line to do so:

 
java -jar GenomeAnalysisTK.jar \
   -T PrintReads \
   -R reference.fasta \
   -I input.bam \
   -BQSR recalibration_report.grp \
   -o output.bam

After computing covariates in the initial BAM File, we then walk through the BAM file again and rewrite the quality scores (in the QUAL field) using the data in the recalibration_report.grp file, into a new BAM file.

This step uses the recalibration table data in recalibration_report.grp produced by BaseRecalibration to recalibrate the quality scores in input.bam, and writing out a new BAM file output.bam with recalibrated QUAL field values.

Effectively the new quality score is:

  • the sum of the global difference between reported quality scores and the empirical quality
  • plus the quality bin specific shift
  • plus the cycle x qual and dinucleotide x qual effect

Following recalibration, the read quality scores are much closer to their empirical scores than before. This means they can be used in a statistically robust manner for downstream processing, such as SNP calling. In additional, by accounting for quality changes by cycle and sequence context, we can identify truly high quality bases in the reads, often finding a subset of bases that are Q30 even when no bases were originally labeled as such.

Miscellaneous information

  • The recalibration system is read-group aware. It separates the covariate data by read group in the recalibration_report.grp file (using @RG tags) and PrintReads will apply this data for each read group in the file. We routinely process BAM files with multiple read groups. Please note that the memory requirements scale linearly with the number of read groups in the file, so that files with many read groups could require a significant amount of RAM to store all of the covariate data.
  • A critical determinant of the quality of the recalibation is the number of observed bases and mismatches in each bin. The system will not work well on a small number of aligned reads. We usually expect well in excess of 100M bases from a next-generation DNA sequencer per read group. 1B bases yields significantly better results.
  • Unless your database of variation is so poor and/or variation so common in your organism that most of your mismatches are real snps, you should always perform recalibration on your bam file. For humans, with dbSNP and now 1000 Genomes available, almost all of the mismatches - even in cancer - will be errors, and an accurate error model (essential for downstream analysis) can be ascertained.
  • The recalibrator applies a "yates" correction for low occupancy bins. Rather than inferring the true Q score from # mismatches / # bases we actually infer it from (# mismatches + 1) / (# bases + 2). This deals very nicely with overfitting problems, which has only a minor impact on data sets with billions of bases but is critical to avoid overconfidence in rare bins in sparse data.

Example pre and post recalibration results

  • Recalibration of a lane sequenced at the Broad by an Illumina GA-II in February 2010
  • There is a significant improvement in the accuracy of the base quality scores after applying the GATK recalibration procedure

image image image image

The output of the BaseRecalibrator

  • A Recalibration report containing all the recalibration information for the data

Note that the BasRecalibrator no longer produces plots; this is now done by the AnalyzeCovariates tool.

The Recalibration Report

The recalibration report is a [GATKReport](http://gatk.vanillaforums.com/discussion/1244/what-is-a-gatkreport) and not only contains the main result of the analysis, but it is also used as an input to all subsequent analyses on the data. The recalibration report contains the following 5 tables:

  • Arguments Table -- a table with all the arguments and its values
  • Quantization Table
  • ReadGroup Table
  • Quality Score Table
  • Covariates Table

Arguments Table

This is the table that contains all the arguments used to run BQSRv2 for this dataset. This is important for the on-the-fly recalibration step to use the same parameters used in the recalibration step (context sizes, covariates, ...).

Example Arguments table:

 
#:GATKTable:true:1:17::;
#:GATKTable:Arguments:Recalibration argument collection values used in this run
Argument                    Value
covariate                   null
default_platform            null
deletions_context_size      6
force_platform              null
insertions_context_size     6
...

Quantization Table

The GATK offers native support to quantize base qualities. The GATK quantization procedure uses a statistical approach to determine the best binning system that minimizes the error introduced by amalgamating the different qualities present in the specific dataset. When running BQSRv2, a table with the base counts for each base quality is generated and a 'default' quantization table is generated. This table is a required parameter for any other tool in the GATK if you want to quantize your quality scores.

The default behavior (currently) is to use no quantization when performing on-the-fly recalibration. You can override this by using the engine argument -qq. With -qq 0 you don't quantize qualities, or -qq N you recalculate the quantization bins using N bins on the fly. Note that quantization is completely experimental now and we do not recommend using it unless you are a super advanced user.

Example Arguments table:

 
#:GATKTable:true:2:94:::;
#:GATKTable:Quantized:Quality quantization map
QualityScore  Count        QuantizedScore
0                     252               0
1                   15972               1
2                  553525               2
3                 2190142               9
4                 5369681               9
9                83645762               9
...

ReadGroup Table

This table contains the empirical quality scores for each read group, for mismatches insertions and deletions. This is not different from the table used in the old table recalibration walker.

 
#:GATKTable:false:6:18:%s:%s:%.4f:%.4f:%d:%d:;
#:GATKTable:RecalTable0:
ReadGroup  EventType  EmpiricalQuality  EstimatedQReported  Observations  Errors
SRR032768  D                   40.7476             45.0000    2642683174    222475
SRR032766  D                   40.9072             45.0000    2630282426    213441
SRR032764  D                   40.5931             45.0000    2919572148    254687
SRR032769  D                   40.7448             45.0000    2850110574    240094
SRR032767  D                   40.6820             45.0000    2820040026    241020
SRR032765  D                   40.9034             45.0000    2441035052    198258
SRR032766  M                   23.2573             23.7733    2630282426  12424434
SRR032768  M                   23.0281             23.5366    2642683174  13159514
SRR032769  M                   23.2608             23.6920    2850110574  13451898
SRR032764  M                   23.2302             23.6039    2919572148  13877177
SRR032765  M                   23.0271             23.5527    2441035052  12158144
SRR032767  M                   23.1195             23.5852    2820040026  13750197
SRR032766  I                   41.7198             45.0000    2630282426    177017
SRR032768  I                   41.5682             45.0000    2642683174    184172
SRR032769  I                   41.5828             45.0000    2850110574    197959
SRR032764  I                   41.2958             45.0000    2919572148    216637
SRR032765  I                   41.5546             45.0000    2441035052    170651
SRR032767  I                   41.5192             45.0000    2820040026    198762

Quality Score Table

This table contains the empirical quality scores for each read group and original quality score, for mismatches insertions and deletions. This is not different from the table used in the old table recalibration walker.

 
#:GATKTable:false:6:274:%s:%s:%s:%.4f:%d:%d:;
#:GATKTable:RecalTable1:
ReadGroup  QualityScore  EventType  EmpiricalQuality  Observations  Errors
SRR032767            49  M                   33.7794          9549        3
SRR032769            49  M                   36.9975          5008        0
SRR032764            49  M                   39.2490          8411        0
SRR032766            18  M                   17.7397      16330200   274803
SRR032768            18  M                   17.7922      17707920   294405
SRR032764            45  I                   41.2958    2919572148   216637
SRR032765             6  M                    6.0600       3401801   842765
SRR032769            45  I                   41.5828    2850110574   197959
SRR032764             6  M                    6.0751       4220451  1041946
SRR032767            45  I                   41.5192    2820040026   198762
SRR032769             6  M                    6.3481       5045533  1169748
SRR032768            16  M                   15.7681      12427549   329283
SRR032766            16  M                   15.8173      11799056   309110
SRR032764            16  M                   15.9033      13017244   334343
SRR032769            16  M                   15.8042      13817386   363078
...

Covariates Table

This table has the empirical qualities for each covariate used in the dataset. The default covariates are cycle and context. In the current implementation, context is of a fixed size (default 6). Each context and each cycle will have an entry on this table stratified by read group and original quality score.

 
#:GATKTable:false:8:1003738:%s:%s:%s:%s:%s:%.4f:%d:%d:;
#:GATKTable:RecalTable2:
ReadGroup  QualityScore  CovariateValue  CovariateName  EventType  EmpiricalQuality  Observations  Errors
SRR032767            16  TACGGA          Context        M                   14.2139           817      30
SRR032766            16  AACGGA          Context        M                   14.9938          1420      44
SRR032765            16  TACGGA          Context        M                   15.5145           711      19
SRR032768            16  AACGGA          Context        M                   15.0133          1585      49
SRR032764            16  TACGGA          Context        M                   14.5393           710      24
SRR032766            16  GACGGA          Context        M                   17.9746          1379      21
SRR032768            45  CACCTC          Context        I                   40.7907        575849      47
SRR032764            45  TACCTC          Context        I                   43.8286        507088      20
SRR032769            45  TACGGC          Context        D                   38.7536         37525       4
SRR032768            45  GACCTC          Context        I                   46.0724        445275      10
SRR032766            45  CACCTC          Context        I                   41.0696        575664      44
SRR032769            45  TACCTC          Context        I                   43.4821        490491      21
SRR032766            45  CACGGC          Context        D                   45.1471         65424       1
SRR032768            45  GACGGC          Context        D                   45.3980         34657       0
SRR032767            45  TACGGC          Context        D                   42.7663         37814       1
SRR032767            16  AACGGA          Context        M                   15.9371          1647      41
SRR032764            16  GACGGA          Context        M                   18.2642          1273      18
SRR032769            16  CACGGA          Context        M                   13.0801          1442      70
SRR032765            16  GACGGA          Context        M                   15.9934          1271      31
...

Troubleshooting

The memory requirements of the recalibrator will vary based on the type of JVM running the application and the number of read groups in the input bam file.

If the application reports 'java.lang.OutOfMemoryError: Java heap space', increase the max heap size provided to the JVM by adding ' -Xmx????m' to the jvm_args variable in RecalQual.py, where '????' is the maximum available memory on the processing computer.

I've tried recalibrating my data using a downloaded file, such as NA12878 on 454, and apply the table to any of the chromosome BAM files always fails due to hitting my memory limit. I've tried giving it as much as 15GB but that still isn't enough.

All of our big merged files for 454 are running with -Xmx16000m arguments to the JVM -- it's enough to process all of the files. 32GB might make the 454 runs a lot faster though.

I have a recalibration file calculated over the entire genome (such as for the 1000 genomes trio) but I split my file into pieces (such as by chromosome). Can the recalibration tables safely be applied to the per chromosome BAM files?

Yes they can. The original tables needed to be calculated over the whole genome but they can be applied to each piece of the data set independently.

I'm working on a genome that doesn't really have a good SNP database yet. I'm wondering if it still makes sense to run base quality score recalibration without known SNPs.

The base quality score recalibrator treats every reference mismatch as indicative of machine error. True polymorphisms are legitimate mismatches to the reference and shouldn't be counted against the quality of a base. We use a database of known polymorphisms to skip over most polymorphic sites. Unfortunately without this information the data becomes almost completely unusable since the quality of the bases will be inferred to be much much lower than it actually is as a result of the reference-mismatching SNP sites.

However, all is not lost if you are willing to experiment a bit. You can bootstrap a database of known SNPs. Here's how it works:

  • First do an initial round of SNP calling on your original, unrecalibrated data.
  • Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator.
  • Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence.

Downsampling to reduce run time

For users concerned about run time please note this small analysis below showing the approximate number of reads per read group that are required to achieve a given level of recalibration performance. The analysis was performed with 51 base pair Illumina reads on pilot data from the 1000 Genomes Project. Downsampling can be achieved by specifying a genome interval using the -L option. For users concerned only with recalibration accuracy please disregard this plot and continue to use all available data when generating the recalibration table.

image

How do specific the java 8 when it's not at system-wide location?

$
0
0

the original version of Java in my computer is 1.7.0_131,and I install the JDK 8 at a directory(not the system-wide location because I don't have root access).

Then when I try to use GATK4.0 for SNP calling :
./python /path/to/gatk/gatk --java-options "-Xmx4g" HaplotypeCaller -R sequence/reference.fa.fasta -I sequence/A6475_aligned_out.bam -O output.g.vcf.gz -ERC GVCF
Error: Invalid or corrupt jarfile /export/home/biostuds/2257069w/gatk/gatk-package-4.0.4.0-local.jar

I specific the python because I didn't install it in system-wide location(because I don't have root access,as Java).It seems that in this command I did not specific the JDK 8 ,but I don't know where can I specific the java version in this command?

CombineGVCFs outputs genomic region out of specified intervals

$
0
0

Hi,

I am using CombineGVCFs module to merge a number of individual WGS gVCFs generated by Haplotype caller into a single gVCF files. The -L argument was used to restrict processing on a specific genomic intervals chr1:100000001-150000000. However, the output gVCF file contains info from region chr1:99999813-100000000 which supposed to be excluded from output.

Did I make a mistake?

Here is my command-line:

gatk --java-options "-Xmx4G -XX:+PrintCommandLineFlags -XX:ParallelGCThreads=1" CombineGVCFs -R hg38.fa -L chr1:100000001-150000000 --variant gvcf.list -O combine50_1.chr1.100000001-150000000.g.vcf.gz

Picardtools - FastqToSam not working properly. And question about quality of converted bam

$
0
0

1.
It says that it needs some dll. Iam using latest picard. It makes some bam file which i think is empty.

F:\gen\picard>java -jar F:\gen\picard\picard.jar FastqToSam f1=1.fastq o=1.bam sm=a rg=rg0013
20:32:54.610 WARN NativeLibraryLoader - Unable to find native library: native/gkl_compression.dll
20:32:54.610 WARN NativeLibraryLoader - Unable to find native library: native/gkl_compression.dll
[Tue Jun 19 20:32:54 MSK 2018] FastqToSam FASTQ=1.fastq OUTPUT=1.bam READ_GROUP_NAME=rg0013 SAMPLE_NAME=a USE_SEQUENTIAL_FASTQS=false SORT_ORDER=queryname MIN_Q=0 MAX_Q=93 STRIP_UNPAIRED_MATE_NUMBER=false ALLOW_AND_IGNORE_EMPTY_LINES=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Tue Jun 19 20:32:54 MSK 2018] Executing as Admin@DESKTOP-I2HT0TV on Windows 10 10.0 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_171-b11; Deflater: Jdk; Inflater: Jdk; Provider GCS is not available; Picard version: 2.18.7-SNAPSHOT
20:32:54.625 WARN IntelDeflaterFactory - IntelDeflater is not supported, using Java.util.zip.Deflater
INFO 2018-06-19 20:32:54 FastqToSam Auto-detected quality format as: Standard.
INFO 2018-06-19 20:32:55 FastqToSam Processed 459980 fastq reads
[Tue Jun 19 20:32:58 MSK 2018] picard.sam.FastqToSam done. Elapsed time: 0,07 minutes.
Runtime.totalMemory()=444071936

  1. How to covert fastq to bam with best quality of bam file? For example there is a 28mb bam file. I have downloaded its fastq, used this tool "SRA_FASTQ_to_BAM_Kit" from y-str.org - converted fastq to bam but bam becomes 23mb. And showed differend autosomal dna results. i want maximum quality. Thank you in advance for trying to help.

direct downloading URL for GATK3.8?

$
0
0

Hello,

I saw GATK3.8 allows direct downloading now without the requirement of registration. So I was wondering if it is possible to provide me a direct downloading URL for GATK3.8. I am asking this because I used GATK3 as one of dependencies for a couple of software pipelines that I am developing and it will be great if my software pipeline can handle the GATK3.8 installation automatically without asking the users to manually download it. Thanks for consideration!

Best,
Jia-Xing

Do you need to do Variant Quality Score Recalibration when calling somatic variants with Mutect2?

$
0
0

Hi,
I am currently working to call somatic variants from tumour samples with matched normal pairs from the same patient. I have carried out all of the steps in this tutorial: https://gatkforums.broadinstitute.org/gatk/discussion/11136/how-to-call-somatic-mutations-using-gatk4-mutect2#2. However I am now confused as to whether I am supposed to run the VQSR step described in some of the GATK tutorials on Youtube (for example here: image) .

I initially thought I did have to do it but as I've gone over the explanation of VQSR I now get the impression that it is designed for calling germline variants on many samples simultaneously. However I wanted to make sure before taking that step out of my pipeline that you genuinely do not need to do VQSR for somatic variant calling. I would also like to understand WHY it is that you need VQSR for germline variant calling but do not use it for somatic mutation calling.
Thanks!
Peter O'Donovan


Problems with Indel calling (following GATK best practises)

$
0
0

Hi,

we sequenced two esomes (from a normal and a tumor samples) using the Truseq DNA Exome kit (Illumina) on our NextSeq500 (mid flowcell, 2x75 bp). Then we analyzed the reads running our pipeline that follows the "GATK best practises" and finally used Mutect2 to called somatic variants.

Then we filtered and prioritized the variants (using our browser QueryOR) in order to obtain some indels that could be interesting for our study. The problem is that when we have a look at the alignments with IGV, we cannot found those indels. THe reads do not have any insertions or deletions.

Best regards
Erika

(How to) Generate an unmapped BAM from FASTQ or aligned BAM

$
0
0


image Here we outline how to generate an unmapped BAM (uBAM) from either a FASTQ or aligned BAM file. We use Picard's FastqToSam to convert a FASTQ (Option A) or Picard's RevertSam to convert an aligned BAM (Option B).

Jump to a section on this page

(A) Convert FASTQ to uBAM and add read group information using FastqToSam
(B) Convert aligned BAM to uBAM and discard problematic records using RevertSam

Tools involved

Prerequisites

  • Installed Picard tools

Download example data

Tutorial data reads were originally aligned to the advanced tutorial bundle's human_g1k_v37_decoy.fasta reference and to 10:91,000,000-92,000,000.

Related resources


(A) Convert FASTQ to uBAM and add read group information using FastqToSam

Picard's FastqToSam transforms a FASTQ file to an unmapped BAM, requires two read group fields and makes optional specification of other read group fields. In the command below we note which fields are required for GATK Best Practices Workflows. All other read group fields are optional.

java -Xmx8G -jar picard.jar FastqToSam \
    FASTQ=6484_snippet_1.fastq \ #first read file of pair
    FASTQ2=6484_snippet_2.fastq \ #second read file of pair
    OUTPUT=6484_snippet_fastqtosam.bam \
    READ_GROUP_NAME=H0164.2 \ #required; changed from default of A
    SAMPLE_NAME=NA12878 \ #required
    LIBRARY_NAME=Solexa-272222 \ #required 
    PLATFORM_UNIT=H0164ALXX140820.2 \ 
    PLATFORM=illumina \ #recommended
    SEQUENCING_CENTER=BI \ 
    RUN_DATE=2014-08-20T00:00:00-0400

Some details on select parameters:

  • For paired reads, specify each FASTQ file with FASTQ and FASTQ2 for the first read file and the second read file, respectively. Records in each file must be queryname sorted as the tool assumes identical ordering for pairs. The tool automatically strips the /1 and /2 read name suffixes and adds SAM flag values to indicate reads are paired. Do not provide a single interleaved fastq file, as the tool will assume reads are unpaired and the SAM flag values will reflect single ended reads.
  • For single ended reads, specify the input file with FASTQ.
  • QUALITY_FORMAT is detected automatically if unspecified.
  • SORT_ORDER by default is queryname.
  • PLATFORM_UNIT is often in run_barcode.lane format. Include if sample is multiplexed.
  • RUN_DATE is in Iso8601 date format.

Paired reads will have SAM flag values that reflect pairing and the fact that the reads are unmapped as shown in the example read pair below.

Original first read

@H0164ALXX140820:2:1101:10003:49022/1
ACTTTAGAAATTTACTTTTAAGGACTTTTGGTTATGCTGCAGATAAGAAATATTCTTTTTTTCTCCTATGTCAGTATCCCCCATTGAAATGACAATAACCTAATTATAAATAAGAATTAGGCTTTTTTTTGAACAGTTACTAGCCTATAGA
+
-FFFFFJJJJFFAFFJFJJFJJJFJFJFJJJ<<FJJJJFJFJFJJJJ<JAJFJJFJJJJJFJJJAJJJJJJFFJFJFJJFJJFFJJJFJJJFJJFJJFJAJJJJAJFJJJJJFFJJ<<<JFJJAFJAAJJJFFFFFJJJAJJJF<AJFFFJ

Original second read

@H0164ALXX140820:2:1101:10003:49022/2
TGAGGATCACTAGATGGGGGAGGGAGAGAAGAGATGTGGGCTGAAGAACCATCTGTTGGGTAATATGTTTACTGTCAGTGTGATGGAATAGCTGGGACCCCAAGCGTCAGTGTTACACAACTTACATCTGTTGATCGACTGTCTATGACAG
+
AA<FFJJJAJFJFAFJJJJFAJJJJJ7FFJJ<F-FJFJJJFJJFJJFJJF<FJJA<JF-AFJFAJFJJJJJAAAFJJJJJFJJF-FF<7FJJJJJJ-JA<<J<F7-<FJFJJ7AJAF-AFFFJA--J-F######################

After FastqToSam

H0164ALXX140820:2:1101:10003:49022      77      *       0       0       *       *       0       0       ACTTTAGAAATTTACTTTTAAGGACTTTTGGTTATGCTGCAGATAAGAAATATTCTTTTTTTCTCCTATGTCAGTATCCCCCATTGAAATGACAATAACCTAATTATAAATAAGAATTAGGCTTTTTTTTGAACAGTTACTAGCCTATAGA -FFFFFJJJJFFAFFJFJJFJJJFJFJFJJJ<<FJJJJFJFJFJJJJ<JAJFJJFJJJJJFJJJAJJJJJJFFJFJFJJFJJFFJJJFJJJFJJFJJFJAJJJJAJFJJJJJFFJJ<<<JFJJAFJAAJJJFFFFFJJJAJJJF<AJFFFJ RG:Z:H0164.2
H0164ALXX140820:2:1101:10003:49022      141     *       0       0       *       *       0       0       TGAGGATCACTAGATGGGGGAGGGAGAGAAGAGATGTGGGCTGAAGAACCATCTGTTGGGTAATATGTTTACTGTCAGTGTGATGGAATAGCTGGGACCCCAAGCGTCAGTGTTACACAACTTACATCTGTTGATCGACTGTCTATGACAG AA<FFJJJAJFJFAFJJJJFAJJJJJ7FFJJ<F-FJFJJJFJJFJJFJJF<FJJA<JF-AFJFAJFJJJJJAAAFJJJJJFJJF-FF<7FJJJJJJ-JA<<J<F7-<FJFJJ7AJAF-AFFFJA--J-F###################### RG:Z:H0164.2

back to top


(B) Convert aligned BAM to uBAM and discard problematic records using RevertSam

We use Picard's RevertSam to remove alignment information and generate an unmapped BAM (uBAM). For our tutorial file we have to call on some additional parameters that we explain below. This illustrates the need to cater the tool's parameters to each dataset. As such, it is a good idea to test the reversion process on a subset of reads before committing to reverting the entirety of a large BAM. Follow the directions in this How to to create a snippet of aligned reads corresponding to a genomic interval.

We use the following parameters.

java -Xmx8G -jar /path/picard.jar RevertSam \
    I=6484_snippet.bam \
    O=6484_snippet_revertsam.bam \
    SANITIZE=true \ 
    MAX_DISCARD_FRACTION=0.005 \ #informational; does not affect processing
    ATTRIBUTE_TO_CLEAR=XT \
    ATTRIBUTE_TO_CLEAR=XN \
    ATTRIBUTE_TO_CLEAR=AS \ #Picard release of 9/2015 clears AS by default
    ATTRIBUTE_TO_CLEAR=OC \
    ATTRIBUTE_TO_CLEAR=OP \
    SORT_ORDER=queryname \ #default
    RESTORE_ORIGINAL_QUALITIES=true \ #default
    REMOVE_DUPLICATE_INFORMATION=true \ #default
    REMOVE_ALIGNMENT_INFORMATION=true #default

To process large files, also designate a temporary directory.

    TMP_DIR=/path/shlee #sets environmental variable for temporary directory

We invoke or change multiple RevertSam parameters to generate an unmapped BAM

  • We remove nonstandard alignment tags with the ATTRIBUTE_TO_CLEAR option. Standard tags cleared by default are NM, UQ, PG, MD, MQ, SA, MC, and AS tags (AS for Picard releases starting 9/2015). Additionally, the OQ tag is removed by the default RESTORE_ORIGINAL_QUALITIES parameter. Remove all other nonstandard tags by specifying each with the ATTRIBUTE_TO_CLEAR option. For example, we clear the XT tag using this option for our tutorial file so that it is free for use by other tools, e.g. MarkIlluminaAdapters. To list all tags within a BAM, use the command below.

    samtools view input.bam | cut -f 12- | tr '\t' '\n' | cut -d ':' -f 1 | awk '{ if(!x[$1]++) { print }}' 
    

    For the tutorial file, this gives RG, OC, XN, OP and XT tags as well as those removed by default. We remove all of these except the RG tag. See your aligner's documentation and the Sequence Alignment/Map Format Specification for descriptions of tags.

  • Additionally, we invoke the SANITIZE option to remove reads that cause problems for certain tools, e.g. MarkIlluminaAdapters. Downstream tools will have problems with paired reads with missing mates, duplicated records, and records with mismatches in length of bases and qualities. Any paired reads file subset for a genomic interval requires sanitizing to remove reads with lost mates that align outside of the interval.

  • In this command, we've set MAX_DISCARD_FRACTION to a more strict threshold of 0.005 instead of the default 0.01. Whether or not this fraction is reached, the tool informs you of the number and fraction of reads it discards. This parameter asks the tool to additionally inform you of the discarded fraction via an exception as it finishes processing.

    Exception in thread "main" picard.PicardException: Discarded 0.787% which is above MAX_DISCARD_FRACTION of 0.500%  
    

Some comments on options kept at default:

  • SORT_ORDER=queryname
    For paired read files, because each read in a pair has the same query name, sorting results in interleaved reads. This means that reads in a pair are listed consecutively within the same file. We make sure to alter the previous sort order. Coordinate sorted reads result in the aligner incorrectly estimating insert size from blocks of paired reads as they are not randomly distributed.

  • RESTORE_ORIGINAL_QUALITIES=true
    Restoring original base qualities to the QUAL field requires OQ tags listing original qualities. The OQ tag uses the same encoding as the QUAL field, e.g. ASCII Phred-scaled base quality+33 for tutorial data. After restoring the QUAL field, RevertSam removes the tag.

  • REMOVE_ALIGNMENT_INFORMATION=true will remove program group records and alignment flag and tag information. For example, flags reset to unmapped values, e.g. 77 and 141 for paired reads. The parameter also invokes the default ATTRIBUTE_TO_CLEAR parameter which removes standard alignment tags. RevertSam ignores ATTRIBUTE_TO_CLEAR when REMOVE_ALIGNMENT_INFORMATION=false.

Below we show below a read pair before and after RevertSam from the tutorial data. Notice the first listed read in the pair becomes reverse-complemented after RevertSam. This restores how reads are represented when they come off the sequencer--5' to 3' of the read being sequenced.

For 6484_snippet.bam, SANITIZE removes 2,202 out of 279,796 (0.787%) reads, leaving us with 277,594 reads.

Original BAM

H0164ALXX140820:2:1101:10003:23460  83  10  91515318    60  151M    =   91515130    -339    CCCATCCCCTTCCCCTTCCCTTTCCCTTTCCCTTTTCTTTCCTCTTTTAAAGAGACAAGGTCTTGTTCTGTCACCCAGGCTGGAATGCAGTGGTGCAGTCATGGCTCACTGCCGCCTCAGACTTCAGGGCAAAAGCAATCTTTCCAGCTCA :<<=>@AAB@AA@AA>6@@A:>,*@A@<@??@8?9>@==8?:?@?;?:><??@>==9?>8>@:?>>=>;<==>>;>?=?>>=<==>>=>9<=>??>?>;8>?><?<=:>>>;4>=>7=6>=>>=><;=;>===?=>=>>?9>>>>??==== MC:Z:60M91S MD:Z:151    PG:Z:MarkDuplicates RG:Z:H0164.2    NM:i:0  MQ:i:0  OQ:Z:<FJFFJJJJFJJJJJF7JJJ<F--JJJFJJJJ<J<FJFF<JAJJJAJAJFFJJJFJAFJAJJAJJJJJFJJJJJFJJFJJJJFJFJJJJFFJJJJJJJFAJJJFJFJFJJJFFJJJ<J7JJJJFJ<AFAJJJJJFJJJJJAJFJJAFFFFA    UQ:i:0  AS:i:151

H0164ALXX140820:2:1101:10003:23460  163 10  91515130    0   60M91S  =   91515318    339 TCTTTCCTTCCTTCCTTCCTTGCTCCCTCCCTCCCTCCTTTCCTTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCTTCCCCTCTCCCACCCCTCTCTCCCCCCCTCCCACCC :0;.=;8?7==?794<<;:>769=,<;0:=<0=:9===/,:-==29>;,5,98=599;<=########################################################################################### SA:Z:2,33141573,-,37S69M45S,0,1;    MC:Z:151M   MD:Z:48T4T6 PG:Z:MarkDuplicates RG:Z:H0164.2    NM:i:2  MQ:i:60 OQ:Z:<-<-FA<F<FJF<A7AFAAJ<<AA-FF-AJF-FA<AFF--A-FA7AJA-7-A<F7<<AFF###########################################################################################    UQ:i:49 AS:i:50

After RevertSam

H0164ALXX140820:2:1101:10003:23460  77  *   0   0   *   *   0   0   TGAGCTGGAAAGATTGCTTTTGCCCTGAAGTCTGAGGCGGCAGTGAGCCATGACTGCACCACTGCATTCCAGCCTGGGTGACAGAACAAGACCTTGTCTCTTTAAAAGAGGAAAGAAAAGGGAAAGGGAAAGGGAAGGGGAAGGGGATGGG AFFFFAJJFJAJJJJJFJJJJJAFA<JFJJJJ7J<JJJFFJJJFJFJFJJJAFJJJJJJJFFJJJJFJFJJJJFJJFJJJJJFJJJJJAJJAJFAJFJJJFFJAJAJJJAJ<FFJF<J<JJJJFJJJ--F<JJJ7FJJJJJFJJJJFFJF< RG:Z:H0164.2

H0164ALXX140820:2:1101:10003:23460  141 *   0   0   *   *   0   0   TCTTTCCTTCCTTCCTTCCTTGCTCCCTCCCTCCCTCCTTTCCTTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCTTCCCCTCTCCCACCCCTCTCTCCCCCCCTCCCACCC <-<-FA<F<FJF<A7AFAAJ<<AA-FF-AJF-FA<AFF--A-FA7AJA-7-A<F7<<AFF########################################################################################### RG:Z:H0164.2

back to top


SNP calling for Cell lines - how does the ploidy affect HC

$
0
0

Hi all,

I am calling SNPs in various immortalised cell lines, which are known to be very instable - hence the ploidy is not known. Generally it should be diploid. So my question is - what can happen if the ploidy is not correct? Would HC miss SNPs? I see a relatively low overlap of common SNPs between two related cell lines and I want to make sure this low overlap is indeed there.

Thank you in advance.

GenotypeGVCFs and VariantFiltration tools

$
0
0

We are following "Calling variants on cohorts of samples using the HaplotypeCaller in GVCF mode" best practices using GATK 3.8.1 and Java 1.8. Thus we merged the raw.g.vcfs from HaplotypeCaller into one cohort.g.vcf and then carried out joint genotyping using the GenotypeGVCFs tool. We are working in a haploid model organism so we then tried to use the VariantFiltration tool on the output (which is a vcf file containing the information from all of the sequences with which we are working). However this failed and we got the error
"Line 2176: there aren't enough columns for line 102"
Others have encountered the same problem and I see that you have responded that the GATK and java versions are incompatible but this was several versions ago. Is this true for us? Please can you tell me where to go to next.

Why is GQ listed at 0 when PL shows a clear GT Distribution

$
0
0

Hello,
Thank you for taking my question. I am working with a pre-processed VCF file, and I am wondering why the GQ is listed as 0 in a number of my lines, whereas the DP is adequate and the PL shows a clear GT distribution.

Such as:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT AU4289301

1 10051 . A . . END=10051 GT:DP:GQ:MIN_DP:PL 0/0:19:0:19:0,0,466
1 10120 . T . . END=10121 GT:DP:GQ:MIN_DP:PL 0/0:39:0:36:0,0,607
1 10122 . A . . END=10122 GT:DP:GQ:MIN_DP:PL 0/0:40:12:40:0,12,180

Outputting and Using VariantRecalibration Models

$
0
0

Hello,

I'm working on optimizing the variant filtering pipeline for my team. Currently we're using VQSR following the best practices guidelines. I've been testing VQSR's ability to discern FPs and TPs by applying VQSR to sequencing data we've generated from the GM12878 cell line and comparing those VCFs to GIABs gold standard na12878 call-set using VCFeval. While following best practices for VQSR lowers the total number of FPs, it also lowers the F measure score, meaning more TPs are being filtered than FPs. This isn't optimal, naturally.

If I apply VariantRecalibrator with the GIAB snp call-set as a resource more FPs are filtered than TPs down to a tranche specificity of ~99.9. Ideally, I'd like to train a VQSR model using VCFs with gold standard call-sets as resources, output that model, and then apply that model to other VCFs.

I've been working to test the possibility of using this approach to variant filtering. The first step is to feed VariantRecalibrator a GM12878 library and the GIAB truth set, output the model, apply the recalibration and get results from VCFeval.

/home/gatk-4.0.3.0/gatk VariantRecalibrator \
   --reference $ref_fa \
   --variant $raw_snp \
   --resource GIAB,known=false,training=true,truth=true,prior=15.0:$GIAB_snp \
   --use-annotation DP \
   --use-annotation QD \
   --use-annotation FS \
   --use-annotation SOR \
   --use-annotation MQ \
   --use-annotation MQRankSum \
   --use-annotation ReadPosRankSum \
   --mode SNP \
   --truth-sensitivity-tranche 100.0 \
   --truth-sensitivity-tranche 99.98 \
   --truth-sensitivity-tranche 99.95 \
   --truth-sensitivity-tranche 99.90 \
   --output recalibrate_snp_giab.recal \
   --tranches-file recalibrate_snp_giab.tranches \
   --rscript-file recalibrate_snp_giab.plots.R \
   --output-model recalibrate_snp_giab.model

/home/gatk-4.0.3.0/gatk ApplyVQSR \
   --reference $ref_fa \
   --variant $raw_snp \
   --output ${laneid}.filtered.giab.99.9.snp.vcf.gz \
   --truth-sensitivity-filter-level 99.9 \
   --tranches-file recalibrate_snp_giab.tranches \
   --recal-file recalibrate_snp_giab.recal \
   --mode SNP

Next I want to input the model and the same VCF to VariantRecalibrator - ideally without resources, though that isn't possible - apply the recalibration and get results from VCFeval. If what I'm looking to do is possible, the two results should be the same for any given tranche.

An example of VariantRecalibrator options I've tried are below.

/home/gatk-4.0.3.0/gatk VariantRecalibrator \
   --reference $ref_fa \
   --variant $raw_snp \
   --resource HapMap,known=false,training=true,truth=true,prior=15.0:$HapMap \
   --resource Omni,known=false,training=true,truth=true,prior=12.0:$Omni \
   --resource 1000G,known=false,training=true,truth=false,prior=10.0:$Thousand_g  \
   --resource dbsnp,known=true,training=false,truth=false,prior=2.0:$DBsnp \
   --input-model $snp_model_orig \
   --use-annotation DP \
   --use-annotation QD \
   --use-annotation FS \
   --use-annotation SOR \
   --use-annotation MQ \
   --use-annotation MQRankSum \
   --use-annotation ReadPosRankSum \
   --mode SNP \
   --truth-sensitivity-tranche 100.0 \
   --truth-sensitivity-tranche 99.98 \
   --truth-sensitivity-tranche 99.95 \
   --truth-sensitivity-tranche 99.90 \
   --output recalibrate_snp_${laneid}.recal \
   --tranches-file recalibrate_snp_${laneid}.tranches \
   --rscript-file recalibrate_snp_${laneid}.plots.R \

/home/gatk-4.0.3.0/gatk ApplyVQSR \
   --reference $ref_fa \
   --variant $raw_snp \
   --output ${laneid}.filtered.model.99.9.snp.vcf.gz \
   --truth-sensitivity-filter-level 99.9 \
   --tranches-file recalibrate_snp_${laneid}.tranches \
   --recal-file recalibrate_snp_${laneid}.recal \
   --mode SNP

As I must supply some resources to VariantRecalibrator, I also tried minimizing the effect of any resources.

/home/gatk-4.0.3.0/gatk VariantRecalibrator \
   --reference $ref_fa \
   --variant $raw_snp \
   --resource HapMap,known=false,training=true,truth=true,prior=0.0:$HapMap \
   --input-model $snp_model_ms \
   --use-annotation DP \
   --use-annotation QD \
   --use-annotation FS \
   --use-annotation SOR \
   --use-annotation MQ \
   --use-annotation MQRankSum \
   --use-annotation ReadPosRankSum \
   --mode SNP \
   --truth-sensitivity-tranche 100.0 \
   --truth-sensitivity-tranche 99.98 \
   --truth-sensitivity-tranche 99.95 \
   --truth-sensitivity-tranche 99.90 \
   --output recalibrate_snp_${laneid}.recal \
   --tranches-file recalibrate_snp_${laneid}.tranches \
   --rscript-file recalibrate_snp_${laneid}.plots.R \

Neither of these approaches have been very successful in producing results similar to the model when it was first generated and applied. Is there anyway to use an output model "as is" without having it changed by VariantRecalibrator the second time around? Or do I misunderstand the nature of VQSR and how models are trained and applied?

Thanks a ton for any help! I really appreciate all the work y'all do!

-Ellis

Haplotype caller not picking up variants for HiSeq Runs

$
0
0

Hello,
We were sequencing all our data in HiSeq and now moved to nextseq. We have sequenced the same batch of samples on both the sequencers. Both are processed using the same pipeline/parameters.
What I have noticed is, GATK 3.7 HC is not picking up variants, even though the coverage is good and is evidently present in the BAM file.

For example the screenshot below shows the BAM files for both NextSeq and HiSeq sample. There are atleast 3
variants in the region 22:29885560-29885861(NEPH, exon 5) that is expected to be picked up for HiSeq.

These variants are picked up for NextSeq samples (even though the coverage for hiSeq is much better).

The command that I have used for both samples is

java -Xmx32g -jar GATK_v3_7/GenomeAnalysisTK.jar -T HaplotypeCaller -R GRCh37.fa --dbsnp GATK_ref/dbsnp_138.b37.vcf -I ${i}.HiSeq_Run31.variant_ready.bam -L NEPH.bed -o ${i}.HiSeq_Run31.NEPH.g.vcf

Any idea why this can happen ?

Many thanks,


GenomicsDBImport don't work in my five GVCF files

$
0
0

Hi Geraldine!
I am using the GenomicsDBImport tool to merge five gvcf files, the script is shown below:
gatk GenomicsDBImport -R /hg19/hg19.fa -V sample1.g.vcf -V sample2.g.vcf -V sample3.g.vcf -V sample4.g.vcf -V sample5.g.vcf --genomicsdb-workspace-path my_database --intervals chr20

However, it doesn't work warning that htsjdk.tribble.TribbleException: An index is required, but none found., for input source: file:sample1.g.vcf. How can I resolve the problem?

Germline short variant discovery (SNPs + Indels)

$
0
0

Purpose

Identify germline short variants (SNPs and Indels) in one or more individuals to produce a joint callset in VCF format.



Reference Implementations

Pipeline Summary Notes Github FireCloud
Prod* germline short variant per-sample calling uBAM to GVCF optimized for GCP yes pending
Prod* germline short variant joint genotyping GVCFs to cohort VCF optimized for GCP yes pending
$5 Genome Analysis Pipeline uBAM to GVCF or cohort VCF optimized for GCP (see blog) yes hg38
Generic germline short variant per-sample calling analysis-ready BAM to GVCF universal yes hg38
Generic germline short variant joint genotyping GVCFs to cohort VCF universal yes hg38 & b37
Intel germline short variant per-sample calling uBAM to GVCF Intel optimized for local architectures yes NA

* Prod refers to the Broad Institute's Data Sciences Platform production pipelines, which are used to process sequence data produced by the Broad's Genomic Sequencing Platform facility.


Expected input

This workflow is designed to operate on a set of samples constituting a study cohort. Specifically, a set of per-sample BAM files that have been pre-processed as described in the GATK Best Practices for data pre-processing.


Main steps

We begin by calling variants per sample in order to produce a file in GVCF format. Next, we consolidate GVCFs from multiple samples into a GenomicsDB datastore. We then perform joint genotyping, and finally, apply VQSR filtering to produce the final multisample callset with the desired balance of precision and sensitivity.

Additional steps such as Genotype Refinement and Variant Annotation may be included depending on experimental design; those are not documented here.

Call variants per-sample

Tools involved: HaplotypeCaller (in GVCF mode)

In the past, variant callers specialized in either SNPs or Indels, or (like the GATK's own UnifiedGenotyper) could call both but had to do so them using separate models of variation. The HaplotypeCaller is capable of calling SNPs and indels simultaneously via local de-novo assembly of haplotypes in an active region. In other words, whenever the program encounters a region showing signs of variation, it discards the existing mapping information and completely reassembles the reads in that region. This allows the HaplotypeCaller to be more accurate when calling regions that are traditionally difficult to call, for example when they contain different types of variants close to each other. It also makes the HaplotypeCaller much better at calling indels than position-based callers like UnifiedGenotyper.

In the GVCF mode used for scalable variant calling in DNA sequence data, HaplotypeCaller runs per-sample to generate an intermediate file called a GVCF , which can then be used for joint genotyping of multiple samples in a very efficient way. This enables rapid incremental processing of samples as they roll off the sequencer, as well as scaling to very large cohort sizes

In practice, this step can be appended to the pre-processing section to form a single pipeline applied per-sample, going from the original unmapped BAM containing raw sequence all the way to the GVCF for each sample. This is the implementation used in production at the Broad Institute.

Consolidate GVCFs

Tools involved: ImportGenomicsDB

This step consists of consolidating the contents of GVCF files across multiple samples in order to improve scalability and speed the next step, joint genotyping. Note that this is NOT equivalent to the joint genotyping step; variants in the resulting merged GVCF cannot be considered to have been called jointly.

Prior to GATK4 this was done through hierarchical merges with a tool called CombineGVCFs. This tool is included in GATK4 for legacy purposes, but performance is far superior when using ImportGenomicsDB, which produces a datastore instead of a GVCF file.

Joint-Call Cohort

Tools involved: GenotypeGVCFs

At this step, we gather all the per-sample GVCFs (or combined GVCFs if we are working with large numbers of samples) and pass them all together to the joint genotyping tool, GenotypeGVCFs. This produces a set of joint-called SNP and indel calls ready for filtering. This cohort-wide analysis empowers sensitive detection of variants even at difficult sites, and produces a squared-off matrix of genotypes that provides information about all sites of interest in all samples considered, which is important for many downstream analyses.

This step runs quite quickly and can be rerun at any point when samples are added to the cohort, thereby solving the so-called N+1 problem.

Filter Variants by Variant (Quality Score) Recalibration

Tools involved: VariantRecalibrator, ApplyRecalibration

The GATK's variant calling tools are designed to be very lenient in order to achieve a high degree of sensitivity. This is good because it minimizes the chance of missing real variants, but it does mean that we need to filter the raw callset they produce in order to reduce the amount of false positives, which can be quite large.

The established way to filter the raw variant callset is to use variant quality score recalibration (VQSR), which uses machine learning to identify annotation profiles of variants that are likely to be real, and assigns a VQSLOD score to each variant that is much more reliable than the QUAL score calculated by the caller. In the first step of this two-step process, the program builds a model based on training variants, then applies that model to the data to assign a well-calibrated probability to each variant call. We can then use this variant quality score in the second step to filter the raw call set, thus producing a subset of calls with our desired level of quality, fine-tuned to balance specificity and sensitivity.

The downside of how variant recalibration works is that the algorithm requires high-quality sets of known variants to use as training and truth resources, which for many organisms are not yet available. It also requires quite a lot of data in order to learn the profiles of good vs. bad variants, so it can be difficult or even impossible to use on small datasets that involve only one or a few samples, on targeted sequencing data, on RNAseq, and on non-model organisms. If for any of these reasons you find that you cannot perform variant recalibration on your data (after having tried the workarounds that we recommend, where applicable), you will need to use hard-filtering instead. This consists of setting flat thresholds for specific annotations and applying them to all variants equally. See the methods articles and FAQs for more details on how to do this.

We are currently experimenting with neural network-based approaches with the goal of eventually replacing VQSR with a more powerful and flexible filtering process.


Notes on methodology

The central tenet that governs the variant discovery part of the workflow is that the accuracy and sensitivity of the germline variant discovery algorithm are significantly increased when it is provided with data from many samples at the same time. Specifically, the variant calling program needs to be able to construct a squared-off matrix of genotypes representing all potentially variant genomic positions, across all samples in the cohort. Note that this is distinct from the primitive approach of combining variant calls generated separately per-sample, which lack information about the confidence of homozygous-reference or other uncalled genotypes.

In earlier versions of the variant discovery phase, multiple per-sample BAM files were presented directly to the variant calling program for joint analysis. However, that scaled very poorly with the number of samples, posing unacceptable limits on the size of the study cohorts that could be analyzed in that way. In addition, it was not possible to add samples incrementally to a study; all variant calling work had to be redone when new samples were introduced.

Starting with GATK version 3.x, a new approach was introduced, which decoupled the two internal processes that previously composed variant calling: (1) the initial per-sample collection of variant context statistics and calculation of all possible genotype likelihoods given each sample by itself, which require access to the original BAM file reads and is computationally expensive, and (2) the calculation of genotype posterior probabilities per-sample given the genotype likelihoods across all samples in the cohort, which is computationally cheap. These were made into the separate steps described below, enabling incremental growth of cohorts as well as scaling to large cohort sizes.

GATK HaplotypeCaller missing SNPs at the terminals of the segment when calling SNPs for Influenza A

$
0
0

We are trying to call variants for Influenza A virus sequenced by MiSeq using HaplotypeCaller following GATK best practices (GATK version 3.7). However, when checking in IGV the called variants with BAM file, we frequently identify snps that are missed by HaplotypeCaller at the beginning or the end of a segment. The missing ones are well supported by the reads, and are called by samtools and UnifiedGenotyper with high confidence.

As one example (showing below), there are three rows of called variants at the top, from top to bottom, called by UnifiedGenotyper, samtools, and HaplotypeCaller. The right most snp is called by first two tools but missed by HaplotypeCaller, although the support reads show consistent snp readouts.

Just to show that this snp is well supported by the reads, here is the vcf record reporting this snp in VCF generated by UnifiedGenotyper:

A-New_Jersey-NHRC_93408-2016-H3N2(KY078630)-HA 15 . A T 166598 . AC=1;AF=1.00;AN=1;DP=3970;Dels=0.00;FS=0.000;HaplotypeScore=26.7856;MLEAC=1;MLEAF=1.00;MQ=59.99;MQ0=0;QD=34.24;SOR=4.823 GT:AD:DP:GQ:PL 1:0,3969:3970:99:166628,0

A close check in the HaplotypeCaller generated BAM file for debugging, we noticed that the variant is consistently missing from the de novo generated Haplotypes.

There are also other cases of missing snps. The similarity is that they are always at the terminal of the segment, well supported by reads, and only HaplotypeCaller misses them. However, for some samples, similar variants at the terminal are called by HaplotypeCaller.

My question is following:

  • is this a bug of HaplotypeCaller? If so, has it been fixed?
  • if not a bug, is there a parameter of HaplotypeCaller that can be set to guarantee that it will not miss the good quality variants at the terminal?

Many thanks.

VariantRecalibrator - no data found

$
0
0

I just updated to the latest nightly and got the same error:

INFO 12:03:16,652 VariantRecalibratorEngine - Finished iteration 45. Current change in mixture coefficients = 0.00258
INFO 12:03:23,474 ProgressMeter - GL000202.1:10465 5.68e+07 32.4 m 34.0 s 98.7% 32.9 m 25.0 s
INFO 12:03:32,263 VariantRecalibratorEngine - Convergence after 46 iterations!
INFO 12:03:41,008 VariantRecalibratorEngine - Evaluating full set of 4944219 variants...
INFO 12:03:41,100 VariantDataManager - Training with worst 0 scoring variants --> variants with LOD <= -5.0000.

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.IllegalArgumentException: No data found.
at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:83)
at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:392)
at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:138)
at org.broadinstitute.sting.gatk.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129)
at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:116)
at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313)
at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:121)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:107)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version nightly-2014-03-20-g65934ae):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: No data found.
ERROR ------------------------------------------------------------------------------------------

SelectVariants - ERROR MESSAGE: The PL index 1326 cannot be more than 1325

$
0
0

Dear GATK Team,

I am using GATK version 3.6-0-g89b7209 (for consistency with control data).
From a VCF containing only INDELs in 500 samples I am trying to extract only variants in a subset of 269 samples:

time java -Xmx24g -jar /home/mhalache/tools/GATK3.6/GenomeAnalysisTK.jar -T SelectVariants \
-R /exports/igmm/eddie/NextGenResources/annotation/variants/1KG_phase3/reference/human_g1k_v37.fasta \
-V a.INDEL.ready.vcf.gz \
-sf sample_ids.txt \
-o a.INDEL.unrel.vcf.gz \
--removeUnusedAlternates \
-env

and getting the following error message

DEBUG 2018-06-13 11:14:33 BlockCompressedOutputStream Using deflater: Deflater

ERROR --
ERROR stack trace

java.lang.IllegalStateException: The PL index 1326 cannot be more than 1325
at htsjdk.variant.variantcontext.GenotypeLikelihoods.getAllelePair(GenotypeLikelihoods.java:492)
at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.getDiploidLikelihoodIndexes(GATKVariantContextUtils.java:697)
at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.determineDiploidLikelihoodIndexesToUse(GATKVariantContextUtils.java:647)
at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.fixDiploidGenotypesFromSubsettedAlleles(GATKVariantContextUtils.java:1421)
at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.updatePLsSACsAD(GATKVariantContextUtils.java:1403)
at org.broadinstitute.gatk.tools.walkers.variantutils.SelectVariants.subsetRecord(SelectVariants.java:1080)
at org.broadinstitute.gatk.tools.walkers.variantutils.SelectVariants.map(SelectVariants.java:854)
at org.broadinstitute.gatk.tools.walkers.variantutils.SelectVariants.map(SelectVariants.java:309)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:311)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:113)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:255)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:157)
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:108)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.6-0-g89b7209):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions https://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: The PL index 1326 cannot be more than 1325
ERROR ------------------------------------------------------------------------------------------

I could not find any relevant post on the GATK site, please accept my apologies if it has been discussed previously.
A corresponding script for extracting the SNPs for the same subset appears to be working properly (it is currently running, output not validated yet)

Best,
Mike

Viewing all 12345 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>