Base Quality Score Recalibration (BQSR)

July 23, 2012, 10:12 am

≪ Previous: Cohort mode of GermlineCNVCaller: FileNotFoundError

BQSR stands for Base Quality Score Recalibration. In a nutshell, it is a data pre-processing step that detects systematic errors made by the sequencer when it estimates the quality score of each base call. This document starts with a high-level overview of the purpose of this method; deeper technical are provided further down.

Note that this base recalibration process (BQSR) should not be confused with variant recalibration (VQSR), which is a sophisticated filtering technique applied on the variant callset produced in a later step. The developers who named these methods wish to apologize sincerely to any Spanish-speaking users who might get awfully confused at this point.

Wait, what are base quality scores again?

These scores are per-base estimates of error emitted by the sequencing machines; they express how confident the machine was that it called the correct base each time. For example, let's say the machine reads an A nucleotide, and assigns a quality score of Q20 -- in Phred-scale, that means it's 99% sure it identified the base correctly. This may seem high, but it does mean that we can expect it to be wrong in one case out of 100; so if we have several billion basecalls (we get ~90 billion in a 30x genome), at that rate the machine would make the wrong call in 900 million bases. In practice each basecall gets its own quality score, determined through some dark magic jealously guarded by the manufacturer of the sequencer.

Variant calling algorithms rely heavily on the quality score assigned to the individual base calls in each sequence read. This is because the quality score tells us how much we can trust that particular observation to inform us about the biological truth of the site where that base aligns. If we have a basecall that has a low quality score, that means we're not sure we actually read that A correctly, and it could actually be something else. So we won't trust it as much as other base calls that have higher qualities. In other words we use that score to weigh the evidence that we have for or against a variant allele existing at a particular site.

Okay, so what is base recalibration?

Unfortunately the scores produced by the machines are subject to various sources of systematic (non-random) technical error, leading to over- or under-estimated base quality scores in the data. Some of these errors are due to the physics or the chemistry of how the sequencing reaction works, and some are probably due to manufacturing flaws in the equipment.

Base quality score recalibration (BQSR) is a process in which we apply machine learning to model these errors empirically and adjust the quality scores accordingly. For example we can identify that, for a given run, whenever we called two A nucleotides in a row, the next base we called had a 1% higher rate of error. So any base call that comes after AA in a read should have its quality score reduced by 1%. We do that over several different covariates (mainly sequence context and position in read, or cycle) in a way that is additive. So the same base may have its quality score increased for one reason and decreased for another.

This allows us to get more accurate base qualities overall, which in turn improves the accuracy of our variant calls. To be clear, we can't correct the base calls themselves, i.e. we can't determine whether that low-quality A should actually have been a T -- but we can at least tell the variant caller more accurately how far it can trust that A. Note that in some cases we may find that some bases should have a higher quality score, which allows us to rescue observations that otherwise may have been given less consideration than they deserve. Anecdotally my impression is that sequencers are more often over-confident than under-confident, but we do occasionally see runs from sequencers that seemed to suffer from low self-esteem.

Fantastic! How does it work?

The base recalibration process involves two key steps: first the program builds a model of covariation based on the data and a set of known variants, then it adjusts the base quality scores in the data based on the model. The known variants are used to mask out bases at sites of real (expected) variation, to avoid counting real variants as errors. Outside of the masked sites, every mismatch is counted as an error. The rest is mostly accounting.

There is an optional but highly recommended step that involves building a second model and generating before/after plots to visualize the effects of the recalibration process. This is useful for quality control purposes.

More detailed information

Detailed information about command line options for BaseRecalibrator can be found here.

The tools in this package recalibrate base quality scores of sequencing-by-synthesis reads in an aligned BAM file. After recalibration, the quality scores in the QUAL field in each read in the output BAM are more accurate in that the reported quality score is closer to its actual probability of mismatching the reference genome. Moreover, the recalibration tool attempts to correct for variation in quality with machine cycle and sequence context, and by doing so provides not only more accurate quality scores but also more widely dispersed ones. The system works on BAM files coming from many sequencing platforms: Illumina, SOLiD, 454, Complete Genomics, Pacific Biosciences, etc.

This process is accomplished by analyzing the covariation among several features of a base. For example:

Reported quality score
The position within the read
The preceding and current nucleotide (sequencing chemistry effect) observed by the sequencing machine

These covariates are then subsequently applied through a piecewise tabular correction to recalibrate the quality scores of all reads in a BAM file.

For example, pre-calibration a file could contain only reported Q25 bases, which seems good. However, it may be that these bases actually mismatch the reference at a 1 in 100 rate, so are actually Q20. These higher-than-empirical quality scores provide false confidence in the base calls. Moreover, as is common with sequencing-by-synthesis machine, base mismatches with the reference occur at the end of the reads more frequently than at the beginning. Also, mismatches are strongly associated with sequencing context, in that the dinucleotide AC is often much lower quality than TG. The recalibration tool will not only correct the average Q inaccuracy (shifting from Q25 to Q20) but identify subsets of high-quality bases by separating the low-quality end of read bases AC bases from the high-quality TG bases at the start of the read. See below for examples of pre and post corrected values.

The system was designed for (sophisticated) users to be able to easily add new covariates to the calculations. For users wishing to add their own covariate simply look at QualityScoreCovariate.java for an idea of how to implement the required interface. Each covariate is a Java class which implements the org.broadinstitute.sting.gatk.walkers.recalibration.Covariate interface. Specifically, the class needs to have a getValue method defined which looks at the read and associated sequence context and pulls out the desired information such as machine cycle.

Running the tools

BaseRecalibrator

Detailed information about command line options for BaseRecalibrator can be found here.

This GATK processing step walks over all of the reads in my_reads.bam and tabulates data about the following features of the bases:

read group the read belongs to
assigned quality score
machine cycle producing this base
current base + previous base (dinucleotide)

For each bin, we count the number of bases within the bin and how often such bases mismatch the reference base, excluding loci known to vary in the population, according to dbSNP. After running over all reads, BaseRecalibrator produces a file called my_reads.recal_data.grp, which contains the data needed to recalibrate reads. The format of this GATK report is described below.

Creating a recalibrated BAM

To create a recalibrated BAM you can use GATK's PrintReads with the engine on-the-fly recalibration capability. Here is a typical command line to do so:

 
java -jar GenomeAnalysisTK.jar \
   -T PrintReads \
   -R reference.fasta \
   -I input.bam \
   -BQSR recalibration_report.grp \
   -o output.bam

After computing covariates in the initial BAM File, we then walk through the BAM file again and rewrite the quality scores (in the QUAL field) using the data in the recalibration_report.grp file, into a new BAM file.

This step uses the recalibration table data in recalibration_report.grp produced by BaseRecalibration to recalibrate the quality scores in input.bam, and writing out a new BAM file output.bam with recalibrated QUAL field values.

Effectively the new quality score is:

the sum of the global difference between reported quality scores and the empirical quality
plus the quality bin specific shift
plus the cycle x qual and dinucleotide x qual effect

Following recalibration, the read quality scores are much closer to their empirical scores than before. This means they can be used in a statistically robust manner for downstream processing, such as SNP calling. In additional, by accounting for quality changes by cycle and sequence context, we can identify truly high quality bases in the reads, often finding a subset of bases that are Q30 even when no bases were originally labeled as such.

Miscellaneous information

The recalibration system is read-group aware. It separates the covariate data by read group in the recalibration_report.grp file (using @RG tags) and PrintReads will apply this data for each read group in the file. We routinely process BAM files with multiple read groups. Please note that the memory requirements scale linearly with the number of read groups in the file, so that files with many read groups could require a significant amount of RAM to store all of the covariate data.
A critical determinant of the quality of the recalibation is the number of observed bases and mismatches in each bin. The system will not work well on a small number of aligned reads. We usually expect well in excess of 100M bases from a next-generation DNA sequencer per read group. 1B bases yields significantly better results.
Unless your database of variation is so poor and/or variation so common in your organism that most of your mismatches are real snps, you should always perform recalibration on your bam file. For humans, with dbSNP and now 1000 Genomes available, almost all of the mismatches - even in cancer - will be errors, and an accurate error model (essential for downstream analysis) can be ascertained.
The recalibrator applies a "yates" correction for low occupancy bins. Rather than inferring the true Q score from # mismatches / # bases we actually infer it from (# mismatches + 1) / (# bases + 2). This deals very nicely with overfitting problems, which has only a minor impact on data sets with billions of bases but is critical to avoid overconfidence in rare bins in sparse data.

Example pre and post recalibration results

Recalibration of a lane sequenced at the Broad by an Illumina GA-II in February 2010
There is a significant improvement in the accuracy of the base quality scores after applying the GATK recalibration procedure

The output of the BaseRecalibrator

A Recalibration report containing all the recalibration information for the data

Note that the BasRecalibrator no longer produces plots; this is now done by the AnalyzeCovariates tool.

The Recalibration Report

The recalibration report is a [GATKReport](http://gatk.vanillaforums.com/discussion/1244/what-is-a-gatkreport) and not only contains the main result of the analysis, but it is also used as an input to all subsequent analyses on the data. The recalibration report contains the following 5 tables:

Arguments Table -- a table with all the arguments and its values
Quantization Table
ReadGroup Table
Quality Score Table
Covariates Table

Arguments Table

This is the table that contains all the arguments used to run BQSRv2 for this dataset. This is important for the on-the-fly recalibration step to use the same parameters used in the recalibration step (context sizes, covariates, ...).

Example Arguments table:

 
#:GATKTable:true:1:17::;
#:GATKTable:Arguments:Recalibration argument collection values used in this run
Argument                    Value
covariate                   null
default_platform            null
deletions_context_size      6
force_platform              null
insertions_context_size     6
...

Quantization Table

The GATK offers native support to quantize base qualities. The GATK quantization procedure uses a statistical approach to determine the best binning system that minimizes the error introduced by amalgamating the different qualities present in the specific dataset. When running BQSRv2, a table with the base counts for each base quality is generated and a 'default' quantization table is generated. This table is a required parameter for any other tool in the GATK if you want to quantize your quality scores.

The default behavior (currently) is to use no quantization when performing on-the-fly recalibration. You can override this by using the engine argument -qq. With -qq 0 you don't quantize qualities, or -qq N you recalculate the quantization bins using N bins on the fly. Note that quantization is completely experimental now and we do not recommend using it unless you are a super advanced user.

Example Arguments table:

 
#:GATKTable:true:2:94:::;
#:GATKTable:Quantized:Quality quantization map
QualityScore  Count        QuantizedScore
0                     252               0
1                   15972               1
2                  553525               2
3                 2190142               9
4                 5369681               9
9                83645762               9
...

ReadGroup Table

This table contains the empirical quality scores for each read group, for mismatches insertions and deletions. This is not different from the table used in the old table recalibration walker.

 
#:GATKTable:false:6:18:%s:%s:%.4f:%.4f:%d:%d:;
#:GATKTable:RecalTable0:
ReadGroup  EventType  EmpiricalQuality  EstimatedQReported  Observations  Errors
SRR032768  D                   40.7476             45.0000    2642683174    222475
SRR032766  D                   40.9072             45.0000    2630282426    213441
SRR032764  D                   40.5931             45.0000    2919572148    254687
SRR032769  D                   40.7448             45.0000    2850110574    240094
SRR032767  D                   40.6820             45.0000    2820040026    241020
SRR032765  D                   40.9034             45.0000    2441035052    198258
SRR032766  M                   23.2573             23.7733    2630282426  12424434
SRR032768  M                   23.0281             23.5366    2642683174  13159514
SRR032769  M                   23.2608             23.6920    2850110574  13451898
SRR032764  M                   23.2302             23.6039    2919572148  13877177
SRR032765  M                   23.0271             23.5527    2441035052  12158144
SRR032767  M                   23.1195             23.5852    2820040026  13750197
SRR032766  I                   41.7198             45.0000    2630282426    177017
SRR032768  I                   41.5682             45.0000    2642683174    184172
SRR032769  I                   41.5828             45.0000    2850110574    197959
SRR032764  I                   41.2958             45.0000    2919572148    216637
SRR032765  I                   41.5546             45.0000    2441035052    170651
SRR032767  I                   41.5192             45.0000    2820040026    198762

Quality Score Table

This table contains the empirical quality scores for each read group and original quality score, for mismatches insertions and deletions. This is not different from the table used in the old table recalibration walker.

 
#:GATKTable:false:6:274:%s:%s:%s:%.4f:%d:%d:;
#:GATKTable:RecalTable1:
ReadGroup  QualityScore  EventType  EmpiricalQuality  Observations  Errors
SRR032767            49  M                   33.7794          9549        3
SRR032769            49  M                   36.9975          5008        0
SRR032764            49  M                   39.2490          8411        0
SRR032766            18  M                   17.7397      16330200   274803
SRR032768            18  M                   17.7922      17707920   294405
SRR032764            45  I                   41.2958    2919572148   216637
SRR032765             6  M                    6.0600       3401801   842765
SRR032769            45  I                   41.5828    2850110574   197959
SRR032764             6  M                    6.0751       4220451  1041946
SRR032767            45  I                   41.5192    2820040026   198762
SRR032769             6  M                    6.3481       5045533  1169748
SRR032768            16  M                   15.7681      12427549   329283
SRR032766            16  M                   15.8173      11799056   309110
SRR032764            16  M                   15.9033      13017244   334343
SRR032769            16  M                   15.8042      13817386   363078
...

Covariates Table

This table has the empirical qualities for each covariate used in the dataset. The default covariates are cycle and context. In the current implementation, context is of a fixed size (default 6). Each context and each cycle will have an entry on this table stratified by read group and original quality score.

 
#:GATKTable:false:8:1003738:%s:%s:%s:%s:%s:%.4f:%d:%d:;
#:GATKTable:RecalTable2:
ReadGroup  QualityScore  CovariateValue  CovariateName  EventType  EmpiricalQuality  Observations  Errors
SRR032767            16  TACGGA          Context        M                   14.2139           817      30
SRR032766            16  AACGGA          Context        M                   14.9938          1420      44
SRR032765            16  TACGGA          Context        M                   15.5145           711      19
SRR032768            16  AACGGA          Context        M                   15.0133          1585      49
SRR032764            16  TACGGA          Context        M                   14.5393           710      24
SRR032766            16  GACGGA          Context        M                   17.9746          1379      21
SRR032768            45  CACCTC          Context        I                   40.7907        575849      47
SRR032764            45  TACCTC          Context        I                   43.8286        507088      20
SRR032769            45  TACGGC          Context        D                   38.7536         37525       4
SRR032768            45  GACCTC          Context        I                   46.0724        445275      10
SRR032766            45  CACCTC          Context        I                   41.0696        575664      44
SRR032769            45  TACCTC          Context        I                   43.4821        490491      21
SRR032766            45  CACGGC          Context        D                   45.1471         65424       1
SRR032768            45  GACGGC          Context        D                   45.3980         34657       0
SRR032767            45  TACGGC          Context        D                   42.7663         37814       1
SRR032767            16  AACGGA          Context        M                   15.9371          1647      41
SRR032764            16  GACGGA          Context        M                   18.2642          1273      18
SRR032769            16  CACGGA          Context        M                   13.0801          1442      70
SRR032765            16  GACGGA          Context        M                   15.9934          1271      31
...

Troubleshooting

The memory requirements of the recalibrator will vary based on the type of JVM running the application and the number of read groups in the input bam file.

If the application reports 'java.lang.OutOfMemoryError: Java heap space', increase the max heap size provided to the JVM by adding ' -Xmx????m' to the jvm_args variable in RecalQual.py, where '????' is the maximum available memory on the processing computer.

I've tried recalibrating my data using a downloaded file, such as NA12878 on 454, and apply the table to any of the chromosome BAM files always fails due to hitting my memory limit. I've tried giving it as much as 15GB but that still isn't enough.

All of our big merged files for 454 are running with -Xmx16000m arguments to the JVM -- it's enough to process all of the files. 32GB might make the 454 runs a lot faster though.

I have a recalibration file calculated over the entire genome (such as for the 1000 genomes trio) but I split my file into pieces (such as by chromosome). Can the recalibration tables safely be applied to the per chromosome BAM files?

Yes they can. The original tables needed to be calculated over the whole genome but they can be applied to each piece of the data set independently.

I'm working on a genome that doesn't really have a good SNP database yet. I'm wondering if it still makes sense to run base quality score recalibration without known SNPs.

The base quality score recalibrator treats every reference mismatch as indicative of machine error. True polymorphisms are legitimate mismatches to the reference and shouldn't be counted against the quality of a base. We use a database of known polymorphisms to skip over most polymorphic sites. Unfortunately without this information the data becomes almost completely unusable since the quality of the bases will be inferred to be much much lower than it actually is as a result of the reference-mismatching SNP sites.

However, all is not lost if you are willing to experiment a bit. You can bootstrap a database of known SNPs. Here's how it works:

First do an initial round of SNP calling on your original, unrecalibrated data.
Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator.
Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence.

Downsampling to reduce run time

For users concerned about run time please note this small analysis below showing the approximate number of reads per read group that are required to achieve a given level of recalibration performance. The analysis was performed with 51 base pair Illumina reads on pilot data from the 1000 Genomes Project. Downsampling can be achieved by specifying a genome interval using the -L option. For users concerned only with recalibration accuracy please disregard this plot and continue to use all available data when generating the recalibration table.

↧

How to add sample names in VCF?

December 2, 2019, 1:55 am

≫ Next: is my pipeline correct to perform bqsr bootstrap with gatk4 ?

≪ Previous: Base Quality Score Recalibration (BQSR)

I am using GATK best practices for germline SNPs and Indels 4.1.2.0. After mapping and recalibration, I run haplotypecaller in GVCF mode. I am combining all vcf files (output from haplotypecaller) using GenotypeGVCF. But I am not sure that where and how I should add sample names to vcf sample?

↧

is my pipeline correct to perform bqsr bootstrap with gatk4 ?

December 2, 2019, 5:04 am

≫ Next: Access to the TCGA PoN

≪ Previous: How to add sample names in VCF?

Hello gatk team,

I'm confused because I have no idea about the exact pipeline to use to perform bqsr bootstrap in order to have my final_recalibrated.bam and do my variant calling.
I was familiar with the pipeline of gatk3.5 but not at all with the version 4...
Some threads tell to always use the original sample.bam file to perform bqsr bootstrap and some others use sample_recaln.bam file ….
And know I got lost.

is the Following pipeline correct ?

1rst round

sample.bam + HaplotypeCaller --> sample_recal0.vcf

sample_recal0.vcf + vcftools --> sample_filtreted_recal0.vcf

sample_filtreted_recal0.vcf + sample.bam + BaseRecalibrator --> sample_recal0.table

sample.bam + sample_recal0.table + ApplyBQSR --> sample_recal0.bam
**
2nd round**

sample_recal0.bam + HaplotypeCaller --> sample_recal1.vcf

sample_recal1.vcf + vcftools --> sample_filtreted_recal1.vcf

sample_filtreted_recal1.vcf +** sample.bam ** (correct ? or should I use sample_recal0.bam) + BaseRecalibrator --> sample_recal1.table

**sample.bam **+ sample_recal1.table + ApplyBQSR --> sample_recal1.bam

3rd round

and so on until convergence ??

As soon as I reach convergence I run ApplyBQSR with my sample.bam file and the sample_recalN.table then HaplotypeCaller ?

Thank's a lot in advance to solve my issue and to allow me to understand how bqsr works in gatk4 with samples with no known snp nor indel !!

↧

Access to the TCGA PoN

December 2, 2019, 5:05 am

≫ Next: VariantRecalibrator resource known training and truth confusion

≪ Previous: is my pipeline correct to perform bqsr bootstrap with gatk4 ?

Hi there,

I am authorized to access controlled data from TCGA, but I don't have access to the following file:

controlled_access_token_pon_from_tcga8000.final_summed_tokens.hist.bin

I have been told that I need to be added to group_TCGA-dbGaP-Authorized@firecloud.org, which would in turn provide read access to the bucket.

Could you please help? My Broad id is sachet. Happy to provide any other info you need either on this thread or via email.

Thanks.

↧

VariantRecalibrator resource known training and truth confusion

December 2, 2019, 6:08 am

≫ Next: CNVPipeline stage 7 error (unable to read the list of metadata directories)

≪ Previous: Access to the TCGA PoN

Running VariantRecalibrator on mouse data raw vcf file with the following command:

gatk --java-options "-Xmx4g" VariantRecalibrator -R Mus_musculus.GRCm38.dna.primary_assembly_ordered.fa -V allSamples_bwa_genotyped.vcf -resource:VCF,known=true,training=false,truth=false,prior=2.0 /mouse/mm10/Ensembl/mus_musculus.vcf -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an InbreedingCoeff -mode BOTH -O allSamples_bwa_genotyped.recal --tranches-file allSamples_bwa_genotyped.tranches 2> allSamples_bwa_genotyped_recal.log &

This produces the following:

A USER ERROR has occurred: No training set found! Please provide sets of known polymorphic loci marked with the training=true feature input tag. For example, -resource hapmap,VCF,known=false,training=true,truth=true,prior=12.0 hapmapFile.vcf

We are not sure what to use for our known, training and truth datasets. Currently, for our known (known=true,training=false,truth=false) we are using /mouse/mm10/Ensembl/mus_musculus.vcf which we downloaded from:

ftp://ftp.ensembl.org/pub/release-98/variation/vcf/mus_musculus/mus_musculus.vcf.gz

For our training (known=false,training=true,truth=true) we want to use the following (merged different mouse strains):

ftp://ftp-mouse.sanger.ac.uk/current_snps/mgp.v5.merged.snps_all.dbSNP142.vcf.gz

But this we believe contains only SNPs and we would need to download another large file for indels.

Should we use the first (Ensembl) file as training (known=true,training=true,truth=true)?

How many different files do we need to specify for the -resource parameter, and what should be our known=?,training=?,truth=?,prior=? for them?

Are we using the right files or if not could you please suggest where we can get the right files?

↧

CNVPipeline stage 7 error (unable to read the list of metadata directories)

December 2, 2019, 6:28 am

≫ Next: Extracting MQ and QUAL values for invariant sites in VCF files

≪ Previous: VariantRecalibrator resource known training and truth confusion

I ran into this issue with reading metadata location when running CNVPipeline.
I have 4 metadata directories, so I put their locations into a list called metadata.list

The error message below can be found from the log file CNVDiscoveryPipeline-10.out and log files from cnv_stage7 where the program org.broadinstitute.sv.genotyping.RefineCNVBoundaries was run.

INFO 14:52:34,256 MetaData - Adding metadata location /cluster/work/pausch/fang/svtoolkit/ucd_sv/ucd_test/metadata.list ...
Exception in thread "main" org.broadinstitute.sv.commandline.ArgumentException: Invalid metadata directory: /cluster/work/pausch/fang/svtoolkit/ucd_sv/ucd_test/metadata.list

I checked log files from stage 1-6. There was no issue reading the list of multiple metadata directories.
Does someone know how to fix this issue?

Below is my script to run CNVPipeline and the log files are attached.

#!/bin/env bash
module load gcc/4.8.5 jdk samtools/1.6 r/3.3.3 htslib/1.6 lsf_drmaa/1.1.1

chrom='1'
SV_TMPDIR=/cluster/work/pausch/temp_scratch/fang/SV_tmp_cnv/chr${chrom}
export SV_DIR='/cluster/work/pausch/fang/svtoolkit'
inputFile='/cluster/work/pausch/fang/svtoolkit/bam.list'
runDir=/cluster/work/pausch/fang/svtoolkit/ucd_sv/ucd_test/cnv_run/chr${chrom}
reference_prefix='/cluster/work/pausch/fang/svtoolkit/reference_meta/ARS-UCD1.2_Btau5.0.1Y'
outdir='/cluster/work/pausch/fang/svtoolkit/ucd_sv/ucd_test/output/cnv'
output_prefix='test'
gendermap='/cluster/work/pausch/fang/svtoolkit/ucd_sv/ucd_test/gendermap.txt'
jobRunner="Drmaa"
jobWrapper="/cluster/work/pausch/fang/test_cnv/queue_lsf_wrapper.sh"

export PATH=${SV_DIR}/bwa:${PATH}
export LD_LIBRARY_PATH=${SV_DIR}/bwa:${LD_LIBRARY_PATH}

mkdir -p ${SV_TMPDIR} || exit 1
mkdir -p ${runDir} || exit 1
mkdir -p ${runDir}/logs_cnv || exit 1

mx="-Xmx5g"
classpath="${SV_DIR}/lib/SVToolkit.jar:${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar:${SV_DIR}/lib/gatk/Queue.jar"

echo "-- Run CNVPipeline -- "
LC_ALL=C java -cp ${classpath} ${mx} \
org.broadinstitute.gatk.queue.QCommandLine \
-S ${SV_DIR}/qscript/discovery/cnv/CNVDiscoveryPipeline.q \
-S ${SV_DIR}/qscript/SVQScript.q \
-cp ${classpath} \
-gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \
-configFile ${SV_DIR}/conf/genstrip_parameters.txt \
-memLimit 7.0 \
-jobRunner ${jobRunner} \
-gatkJobRunner ${jobRunner} \
-jobWrapperScript ${jobWrapper} \
-jobNative "-W 24:00 -R rusage[mem=8000]" \
--disableJobReport \
-R ${reference_prefix}.fa \
-genomeMaskFile ${reference_prefix}.svmask.fasta \
-genderMapFile ${gendermap} \
-ploidyMapFile ${reference_prefix}.ploidymap.txt \
-md /cluster/work/pausch/fang/svtoolkit/ucd_sv/ucd_test/metadata.list \
-runDirectory ${runDir} \
-jobLogDir ${runDir}/logs_cnv \
-I ${inputFile} \
-intervalList ${chrom} \
-lastStage 7 \
-tilingWindowSize 5000 \
-tilingWindowOverlap 2500 \
-maximumReferenceGapLength 2500 \
-boundaryPrecision 200 \
-minimumRefinedLength 2500 \
-run \
|| exit 1

↧

Extracting MQ and QUAL values for invariant sites in VCF files

November 7, 2019, 2:20 am

≫ Next: What variants does Mutect2 --germline-resource filter out?

≪ Previous: CNVPipeline stage 7 error (unable to read the list of metadata directories)

I'm having problems getting mapping quality (MQ) values and PHRED called site quality scores (QUAL) for invariant sites in the VCF files generated by GATK, even when I specify that all sites should be called.

First, I cannot se a MQ value for invariant sites. Is it possible to obtain this value for these sites?

The QUAL value is generated for some invariant sites, but for a majority of them an 'Infinity' value is obtained instead. After inspecting these sites in IGV,it is not clear to me whether this is related to high or low quality regions, as they appear both in sites with good and low coverage.

The options used with GATK are the following.

First, for each sample I'm working with, I'm using GATK HaplotypeCaller as follows:

GATK HaplotypeCaller \
-I file.bam \
-O file.g.vcf \
-R reference.fa.gz \
-ploidy 1 \
-ERC BP_RESOLUTION \
-stand-call-conf 10.0

After that, I combine the generated GVCF files and call variants as follows:

GATK CombineGVCFs \
-R reference.fa.gz \
-O combined.g.vcf \
--variant file1.g.vcf \
--variant file2.g.vcf \
...

GATK GenotypeGVCFs \
-R reference.fa.gz \
-V combined.g.vcf \
-O file.vcf \
-ploidy 1 \
-all-sites

Is there a wat to get numeric MQ and QUAL values for all invariant sites?

Thank you

↧

What variants does Mutect2 --germline-resource filter out?

October 12, 2019, 12:35 am

≫ Next: Apparent difference between active region algorithm in mutect.pdf and Mutect2Engine.java:492

≪ Previous: Extracting MQ and QUAL values for invariant sites in VCF files

Hi everybody,

I am new analyzing WES and I try GATK4 workflow for the detection of somatic variants.

I run Mutect2 in tumor-only mode with these commands:

gatk Mutect2 -R reference.ucsc.hg19.fasta -L Target.bed -I input.bam --f1r2-tar-gz input.tar.gz -O output1.unfiltered.vcf --germline-resource af-only-gnomad.raw.sites.hg19.vcf [max-population-af default value = 0.01]

gatk Mutect2 -R reference.ucsc.hg19.fasta -L Target.bed -I input.bam --f1r2-tar-gz input.tar.gz -O output2.unfiltered.vcf [max-population-af default value = 0.01]

For the first command I obtained a vcf with around 40000 variants whereas with the second a vcf with around 100000 variants. I expected that --germline-resource filters out germline variants based on AF in the resource (af-only-gnomad.raw.sites.hg19.vcf).
However I notice that output1.unfiltered.vcf contains variants with POPAF (negative log 10 population allele frequencies of alt alleles) values that correspond to AF (in the resource) more than 0.01.

Does anybody know the logic used by Mutect2 to filter out variants based on germline resource? Is possible that Mutect2 consider other parameters than AF in the resource?

Thank you,
Simone

↧

Apparent difference between active region algorithm in mutect.pdf and Mutect2Engine.java:492

December 4, 2019, 2:29 am

≫ Next: (How to) Call common and rare germline copy number variants

≪ Previous: What variants does Mutect2 --germline-resource filter out?

Hello,

First off, thank you so much for an excellent toolkit and brilliant forum.
Both have and continue to help me out so much in my work. I am very grateful.

My question relates to an apparent difference in MuTect, GATK4, between the algorithm for identifying active regions (as outlined in docs/mutect/mutect.pdf, page 2) and how it is implemented in the MuTect source file tools/walkers/mutect/Mutect2Engine.java, line 492.

In the code (Mutect2Engine.java:492), fTildeRatio is evaluated as,

```
fTildeRatio = FastMath.exp(MathUtils.digamma(nRef + 1) - MathUtils.digamma(nAlt + 1));
```

However, in mutect.pdf, page 2, just after equation 1, it appears to indicate it should be evaluated as,

```
fTildeRatio = FastMath.exp(MathUtils.digamma(nRef + 1) - MathUtils.digamma(n + 2));
```

Is there a dependency there and if so how and in what situations would it affect the ultimate TLOD value in the vcf file?

Apologies in advance if I'm missing something here or it has been answered before.
I haven't come across it as yet.

Thanks again and best regards, Brian

↧

(How to) Call common and rare germline copy number variants

March 26, 2018, 10:49 am

≫ Next: How can I debug and develop the algorithm in GATK ,such as haplotypecaller, use Intellij IDEA?

≪ Previous: Apparent difference between active region algorithm in mutect.pdf and Mutect2Engine.java:492

Document is in BETA. It may be incomplete and/or inaccurate. Post suggestions and read about updates in the Comments section.

The tutorial outlines steps in detecting germline copy number variants (gCNVs) and illustrates two workflow modes--cohort mode and case mode. The cohort mode simultaneously generates a cohort model and calls CNVs for the cohort samples. The case mode analyzes a single sample against an already constructed cohort model. The same workflow steps apply to both targeted exome and whole genome sequencing (WGS) data. The workflow is able to call both rare and common events and intelligently handles allosomal ploidies, i.e. cohorts of mixed male and female samples.

For the cohort mode, the general recommendation is at least a hundred samples to start. Researchers should expect to tune workflow parameters from the provided defaults. In particular, GermlineCNVCaller's default inference parameters are conservatively set for efficient run times.

The figure diagrams the workflow tools. Section 1 creates an intervals list and counts read alignments overlapping the intervals. Section 2 shows optional but recommended cohort mode steps to annotate intervals with covariates for use in filtering intervals as well as for use in explicit modeling. The section also removes outlier counts intervals. Section 3 generates global baseline observations for the data and models and calls the ploidy of each contig. Section 4 is at the heart of the workflow and models per-interval copy number. Because of the high intensity of compute model fitting requires, the section shows how to analyze data in parts. Finally, Section 5 calls per-sample copy number events per interval and per segment. Results are in VCF format.

► A highly recommended whitepaper detailing the methods is in the gatk GitHub repository's docs/CNV directory.
► For pipelined workflows, see the gatk GitHub repository's scripts/cnv_wdl directory. Be sure to obtain a tagged version of the script, e.g. v4.1.0.0, following instructions in Section 4 of Article#23405.
► This workflow is not appropriate for bulk tumor samples, as it infers absolute copy numbers. For somatic copy number alteration calling, see Tutorial#11682.

Article#11687 visualizes the results in IGV and provides followup discussion. Towards data exploration, here are two illustrative Jupyter Notebook reports that dissect the results.

Notebook#11685 shows an approach to measuring concordance of sample NA19017 gCNV calls to 1000 Genomes Project truth set calls using tutorial chr20sub small data.
Notebook#11686 examines gCNV callset annotations using larger data, namely chr20 gCNV results from the tutorial's 24-sample cohort.

Jump to a section

Tools involved

GATK 4.1.0.0
Workflow tools DetermineGermlineContigPloidy, GermlineCNVCaller and PostprocessGermlineCNVCalls require a Python environment with specific packages, e.g. the gCNV computational python module gcnvkernel. See Article#12836 for instructions on setting up and managing the environment with the user-friendly conda. Once the conda environment is set up, e.g. with conda env create -f gatkcondaenv.yml, activate it with source activate gatk or conda activate gatk before running the tool.

Alternatively, use the broadinstitute/gatk Docker, which activates the Python environment by default. Allocation of at least 8GB memory to Docker is recommended for the tutorial commands. See Article#11090 for instructions to launch a Docker container.

Download example data

The tutorial provides example small WGS data sourced from the 1000 Genomes Project. Cohort mode illustrations use 24 samples, while case mode illustrations analyze one sample against a cohort model made from the remaining 23 samples. The tutorial uses a fraction of the workflow's recommended hundred samples for ease of illustration. Furthermore, commands in each step use one of three differently sized intervals lists for efficiency. Coverage data are from the entirety of chr20, chrX and chrY. So although a step may analyze a subset of regions, it is possible to instead analyze all three contigs in case or cohort modes.

Download tutorial_11684.tar.gz either from the GoogleDrive or from the FTP site. The bundle includes data for Notebook#11685 and Notebook#11686. To access the ftp site, leave the password field blank. If the GoogleDrive link is broken, please let us know. The tutorial also requires the GRCh38 reference FASTA, dictionary and index. These are available from the GATK Resource Bundle. The example data is from the 1000 Genomes project Phase 3 aligned to GRCh38.

1. Collect raw counts data with PreprocessIntervals and CollectReadCounts

PreprocessIntervals pads exome targets and bins WGS intervals. Binning refers to creating equally sized intervals across the reference. For example, 1000 base binning would define chr1:1-1000 as the first bin. Because counts of reads on reference N bases are not meaningful, the tool automatically excludes bins with all Ns. For GRCh38 chr1, non-N sequences start at base 10,001, so the first few bin become:

For WGS data, bin entirety of reference, e.g. with 1000 base intervals.

gatk PreprocessIntervals \
-R ~/ref/Homo_sapiens_assembly38.fasta \
--padding 0 \
-imr OVERLAPPING_ONLY \
-O grch38.preprocessed.interval_list

This produces a Picard-style intervals list of 1000 base bins.

For exome data, pad target regions, e.g. with 250 bases.

gatk PreprocessIntervals \
-R ~/ref/Homo_sapiens_assembly38.fasta \
-L targets.interval_list \
--bin-length 0 \
-imr OVERLAPPING_ONLY \
-O targets.preprocessed.interval_list

This produces a Picard-style intervals list of exome target regions padded by 250 bases on either side.

For the tutorial, bins three contigs.

The contigs in gcnv-chr20XY-contig.list subset the reference to chr20, chrX and chrY.

gatk PreprocessIntervals \
-R ref/Homo_sapiens_assembly38.fasta \
--padding 0 \
-L gcnv-chr20XY-contig.list \
-imr OVERLAPPING_ONLY \
-O chr20XY.interval_list

This generates a Picard-style intervals list with 242,549 intervals. The file has a header section with @ header lines and a five-column body. See Article#11009 for a description of the columns.

Comments on select parameters

For WGS, the default 1000 --bin-length is the recommended starting point for typical 30x data. Be sure to set --padding 0 to disable padding outside of given genomic regions. Bin size should correlate with depth of coverage, e.g. lower coverage data should use larger bin size while higher coverage data can support smaller bin size. The size of the bin defines the resolution of CNV calls. The factors to consider in sizing include how noisy the data is, average coverage depth and how even coverage is across the reference.
For targeted exomes, provide the exome capture kit's target intervals with -L, set --bin-length 0 to disable binning and pad the intervals with --padding 250 or other desired length.
Provide intervals to exclude from analysis with --exclude-intervals or -XL, e.g. centromeric regions. Consider using this option especially if data is aligned to a reference other than GRCh38. The workflow enables excluding regions later again using -XL. A frugal strategy is to collect read counts using the entirety of intervals and then to exclude undesirable regions later at the FilterIntervals step (section 2), the DetermineGermlineContigPloidy step (section 3), at the GermlineCNVCaller step (section 5) and/or post-calling.

CollectReadCounts tabulates the raw integer counts of reads overlapping an interval. The tutorial has already collected read counts ahead of time for the three contigs--chr20, chrX and chrY. Here, we collect read counts on small data.

Count reads per bin using CollectReadCounts

gatk CollectReadCounts \
-L chr20sub.interval_list \
-R ref/Homo_sapiens_assembly38.fasta \
-imr OVERLAPPING_ONLY \
-I NA19017.chr20sub.bam \
--format TSV \
-O NA19017.tsv

This generates a TSV format table of read counts.

Comments on select parameters

The tutorial generates text-based TSV (tab-separated-value) format data instead of the default HDF5 format by adding --format TSV to the command. Omit this option to generate the default HDF5 format. Downstream tools process HDF5 format more efficiently.
Here and elsewhere in the workflow, set --interval-merging-rule (-imr) to OVERLAPPING_ONLY, to prevent the tool from merging abutting intervals.
The tool employs a number of engine-level read filters. Of note are NotDuplicateReadFilter and MappingQualityReadFilter. This means the tool excludes reads marked as duplicate and excludes reads with mapping quality less than 10. Change the mapping quality threshold with the --minimum-mapping-quality option.

After the SAM format header section, denoted by lines starting with @, the body of the data has a column header line followed by read counts for every interval.

☞ 1.1 How do I view HDF5 format data?

See Article#11508 for an overview of the format and instructions on how to navigate the data with external application HDFView. The article illustrates features of the format using data generated in another tutorial.

2. (Optional) Annotate intervals with features and subset regions of interest with FilterIntervals

The steps in this section pertain to the cohort mode.

Researchers may desire to subset the intervals that GermlineCNVCaller will analyze, either to exclude potentially problematic regions or to retain only regions of interest. For example one may wish to exclude regions where all samples in a large cohort have copy number zero. Filtering intervals can be especially impactful for analyses that utilize references other than GRCh38 or that are based on sequencing technologies affected by sequence context, e.g. targeted exomes. The tutorial data is WGS data aligned to GRCh38, and the gCNV workflow can process the entirety of the data, without the need for any interval filtering.

Towards deciding which regions to exclude, AnnotateIntervals labels the given intervals with GC content and additionally with mappability and segmental duplication content if given the respective optional resource files. FilterIntervals then subsets the intervals list based on the annotations and other tunable thresholds. Later, GermlineCNVCaller also takes in the annotated intervals to use as covariates towards analysis.

Explicit GC-correction, although optional, is recommended. The default v4.1.0.0 cnv_germline_cohort_workflow.wdl pipeline workflow omits explicit gc-correction and we activate it in the pipeline by setting do_explicit_gc_correction":"True". The tutorial illustrates the optional AnnotateIntervals step by performing the recommended explicit GC-content-based filtering.

AnnotateIntervals with GC content

gatk AnnotateIntervals \
-L chr20XY.interval_list \
-R ref/Homo_sapiens_assembly38.fasta \
-imr OVERLAPPING_ONLY \
-O chr20XY.annotated.tsv

This produces a four-column table where the fourth column gives the fraction of GC content.

Comments on select parameters

The tool requires the -R reference and the -L intervals. The tool calculates GC-content for the intervals using the reference.
Although optional for the tool, we recommend annotating mappability by providing a --mappability-track regions file in either .bed or .bed.gz format. Be sure to merge any overlapping intervals beforehand. The tutorial omits use of this resource.

GATK recommends use of the the single-read mappability track, as the multi-read track requires much longer times to process. For example, the Hoffman lab at the University of Toronto provides human and mouse mappability BED files for various kmer lengths at https://bismap.hoffmanlab.org/. The accompanying publication is titled Umap and Bismap: quantifying genome and methylome mappability.
Optionally and additionally, annotate segmental duplication content by providing a --segmental-duplication-track regions file in either .bed or .bed.gz format.
Exclude undesirable intervals with the -XL parameter, e.g. intervals corresponding to centromeric regions.

FilterIntervals takes preprocessed intervals and either annotated intervals or read counts or both. It can also exclude intervals given with -XL. When given both types of data, the tool retains the intervals that intersect from filtering on each data type. The v4.1.0.0 cnv_germline_cohort_workflow.wdl pipeline script requires read counts files, and so by default the pipeline script always performs the FilterIntervals step on read counts.

FilterIntervals based on GC-content and cohort extreme counts

gatk FilterIntervals \
-L chr20XY.interval_list \
--annotated-intervals chr20XY.annotated.tsv \
-I cvg/HG00096.tsv -I cvg/HG00268.tsv -I cvg/HG00419.tsv -I cvg/HG00759.tsv \
-I cvg/HG01051.tsv -I cvg/HG01112.tsv -I cvg/HG01500.tsv -I cvg/HG01565.tsv \
-I cvg/HG01583.tsv -I cvg/HG01595.tsv -I cvg/HG01879.tsv -I cvg/HG02568.tsv \
-I cvg/HG02922.tsv -I cvg/HG03006.tsv -I cvg/HG03052.tsv -I cvg/HG03642.tsv \
-I cvg/HG03742.tsv -I cvg/NA18525.tsv -I cvg/NA18939.tsv -I cvg/NA19017.tsv \
-I cvg/NA19625.tsv -I cvg/NA19648.tsv -I cvg/NA20502.tsv -I cvg/NA20845.tsv \
-imr OVERLAPPING_ONLY \
-O chr20XY.cohort.gc.filtered.interval_list

This produces a Picard-style intervals list containing a subset of the starting intervals (230,126 of 242,549). Of the filtered intervals, GC-content filters 42 intervals, and extreme coverage counts from the 24-sample cohort remove an additional 12,381 intervals for a total of 12,423 removed filtered intervals (5.122% of starting).

Comments on select parameters

The tool requires the preprocessed intervals, provided with -L, from Section 1.
Given annotated intervals with --annotated-intervals, the tool filters intervals on the given annotation(s).
- GC-content thresholds are set by --minimum-gc-content and --maximum-gc-content, where defaults are 0.1 and 0.9, respectively.
- Mappability thresholds are set by --minimum-mappability and --maximum-mappability. Defaults are 0.9 and 1.0, respectively.
- Segmental duplication content thresholds are set by --minimum-segmental-duplication-content and --maximum-segmental-duplication-content. Defaults are 0.0 and 0.5, respectively.
Given read counts files, each with -I and in either HDF5 or TSV format, the tool filters intervals on low and extreme read counts with the following tunable thresholds.
- --low-count-filter-count-threshold default is 5
- --low-count-filter-percentage-of-samples default is 90.0
- --extreme-count-filter-minimum-percentile default is 1.0
- --extreme-count-filter-maximum-percentile default is 99.0
- --extreme-count-filter-percentage-of-samples default is 90.0
The read counts data must match each other in intervals.

For the default parameters, the tool first filters intervals with a count less than 5 in greater than 90% of the samples. The tool then filters the remaining intervals with a count percentile less than 1 or greater than 99 in a percentage of samples greater than 90%. These parameters effectively exclude intervals where all samples have extreme outlier counts, e.g. are deleted.

To disable counts based filtering, omit the read counts or, e.g. when using the v4.1.0.0 cnv_germline_cohort_workflow.wdl pipeline script, set the two percentage-of-samples parameters as follows.
```
--low-count-filter-percentage-of-samples 100 \
--extreme-count-filter-percentage-of-samples 100 \
```
Provide intervals to exclude from analysis with --exclude-intervals or -XL, e.g. centromeric regions. A frugal strategy is to collect read counts using the entirety of intervals and then to exclude undesirable regions later at the FilterIntervals step (section 2), the DetermineGermlineContigPloidy step (section 3), at the GermlineCNVCaller step (section 5) and/or post-calling.

3. Call autosomal and allosomal contig ploidy with DetermineGermlineContigPloidy

DetermineGermlineContigPloidy calls contig level ploidies for both autosomal, e.g. human chr20, and allosomal contigs, e.g. human chrX. The tool determines baseline contig ploidies using sample coverages and contig ploidy priors that give the prior probabilities for each ploidy state for each contig. In this process, the tool generates global baseline coverage and noise data GermlineCNVCaller will use in section 5.

The tool determines baseline contig ploidies using the total read count per contig. Researchers should consider the impact of this for their data. For example, for the tutorial WGS data, the contribution of the PAR regions to total coverage counts on chrX is small and the tool correctly calls allosomal ploidies. However, consider blacklisting PAR regions for data where the contribution is disporportionate, e.g. targeted panels.

DetermineGermlineContigPloidy in COHORT MODE

The cohort mode requires a --contig-ploidy-priors table and produces a ploidy model.

gatk DetermineGermlineContigPloidy \
-L chr20XY.cohort.gc.filtered.interval_list \
--interval-merging-rule OVERLAPPING_ONLY \
-I cvg/HG00096.tsv -I cvg/HG00268.tsv -I cvg/HG00419.tsv -I cvg/HG00759.tsv \
-I cvg/HG01051.tsv -I cvg/HG01112.tsv -I cvg/HG01500.tsv -I cvg/HG01565.tsv \
-I cvg/HG01583.tsv -I cvg/HG01595.tsv -I cvg/HG01879.tsv -I cvg/HG02568.tsv \
-I cvg/HG02922.tsv -I cvg/HG03006.tsv -I cvg/HG03052.tsv -I cvg/HG03642.tsv \
-I cvg/HG03742.tsv -I cvg/NA18525.tsv -I cvg/NA18939.tsv -I cvg/NA19017.tsv \
-I cvg/NA19625.tsv -I cvg/NA19648.tsv -I cvg/NA20502.tsv -I cvg/NA20845.tsv \
--contig-ploidy-priors chr20XY_contig_ploidy_priors.tsv \
--output . \
--output-prefix ploidy \
--verbosity DEBUG

This produces a ploidy-calls directory and a ploidy-model directory. The ploidy-calls directory contains a folder of data for each sample in the cohort including the contig ploidy calls. Each sample directory, e.g. ploidy-calls/SAMPLE_0, contains five files.

contig_ploidy.tsv notes the ploidy and genotype quality (GQ) of the ploidy call for each contig.
global_read_depth.tsv notes an average depth value and an average ploidy across all the intervals of the sample.
mu_psi_s_log__.tsv captures the posterior mean for all of the modeled parameters.
sample_name.txt contains the readgroup sample (RG SM) name.
std_psi_s_log__.tsv captures the standard deviation for all of the modeled paramters.

The ploidy-model directory contains aggregated model data for the cohort. This is the model to provide to a case-mode DetermineGermlineContigPloidy analysis and to GermlineCNVCaller. The tutorial ploidy-model directory contains the eight files as follows.

contig_ploidy_prior.tsv is a copy of the ploidy priors given to the tool.
gcnvkernel_version.json notes the version of the kernel.
interval_list.tsv recapitulates the intervals used, e.g. the filtered intervals.
mu_mean_bias_j_lowerbound__.tsv
mu_psi_j_log__.tsv
ploidy_config.json
std_mean_bias_j_lowerbound__.tsv
std_psi_j_log__.tsv

The theano model automatically generates mu_ and std_ files and may append transformations it performs to the file name, e.g. log or lowerbound as we see above. These are likely of interest only to advanced users.

DetermineGermlineContigPloidy in CASE MODE

The case mode calls contig ploidies for each sample against the ploidy model given by --model. The following command runs sample NA19017 against a 23-sample cohort model.

gatk DetermineGermlineContigPloidy \
--model cohort-23wgs-20190213-contig-ploidy-model \
-I cvg/NA19017.tsv \
-O . \
--output-prefix ploidy-case \
--verbosity DEBUG

This produces a ploidy-case-calls directory, which in turn contains a directory of sample data, SAMPLE_0. A list of the five resulting files is some paragraphs above.

Comments on select parameters

It is possible to analyze multiple samples simultaneously in a case mode command. Provide each sample with -I.
For the -L intervals, supply the most processed intervals list. For the tutorial, this is the filtered intervals. Note the case mode does not require explicit intervals because the ploidy model provides them.
Provide a --contig-ploidy-priors table containing the per-contig prior probabilities for integer ploidy state. Again, the case mode does not require an explicit priors file as the ploidy model provides them. Tool documentation describes this resource in detail. The tutorial uses the following contig ploidy priors.
Optionally provide intervals to exclude from analysis with --exclude-intervals or -XL, e.g. pseudoautosomal (PAR) regions, which can skew results for certain data.

The results for NA19017, from either the cohort mode or the case mode, show ploidy 2 for chr20 and chrX and ploidy 0 for chrY. The PLOIDY_GQ quality metrics differ slightly for the modes. The entirety of NA19017's contig_ploidy.tsv is shown.

Checking the ploidy calls for each of the 24 samples against metadata confirms expectations. The following table summarizes results for the 24 samples. The data was collated from DetermineGermlineContigPloidy results using a bashscript.

It should be noted, the tutorial's default parameter run gives XY samples CN1 for the majority of chrX, including for PAR regions, where coverage is actually on par with the CN2 of XX samples. See Article#11687 for further discussion.

4. Call copy number variants with GermlineCNVCaller

GermlineCNVCaller learns a denoising model per scattered shard while consistently calling CNVs across the shards. The tool models systematic biases and CNVs simultaneously, which allows for sensitive detection of both rare and common CNVs. For a description of innovations, see Blog#23439.

As the tool documentation states under Important Remarks (v4.1.0.0), the tool should see data from a large enough genomic region so as to be exposed to diverse genomic features. The current recommendation is to provide at least ~10–50Mbp genomic coverage per scatter. This applies to exomes or WGS. This allows reliable inference of bias factors including GC bias. The limitation of analyzing larger regions is available memory. As an analysis covers more data, memory requirements increase.

For expediency, the tutorial commands below analyze small data, specifically the 1400 bins in twelveregions.cohort.gc.filtered.interval_list and use default parameters. The tutorial splits the 1400 bins into two shards with 700 bins each to illustrate scattering. This results in ~0.7Mbp genomic coverage per shard. See section 4.2 for how to split interval lists by a given number of intervals. Default inference parameters are conservatively set for efficient run times.

The tutorial coverage data are sufficient to analyze the ~15Mb in chr20sub.cohort.gc.filtered.interval_list as well as the entirety of chr20, chrX and chrY using the ~230Mb of chr20XY.cohort.gc.filtered.interval_list. The former, at 5K bins per shard, give three shards. When running the default parameters in a GATKv4.1.0.0 Docker locally on a MacBook Pro, each cohort-mode shard analysis takes ~20 minutes. The latter gives 46 shards at 5K bins per shard. When running the default parameters of the v4.1.0.0 WDL cohort-mode workflow on the cloud, the majority of the shard analyses complete in half an hour.

GermlineCNVCaller in COHORT MODE

Call gCNVs on the 24-sample cohort in two scatters. Notice the different -L intervals and --output-prefix basenames.

gatk GermlineCNVCaller \
--run-mode COHORT \
-L scatter-sm/twelve_1of2.interval_list \
-I cvg/HG00096.tsv -I cvg/HG00268.tsv -I cvg/HG00419.tsv -I cvg/HG00759.tsv \
-I cvg/HG01051.tsv -I cvg/HG01112.tsv -I cvg/HG01500.tsv -I cvg/HG01565.tsv \
-I cvg/HG01583.tsv -I cvg/HG01595.tsv -I cvg/HG01879.tsv -I cvg/HG02568.tsv \
-I cvg/HG02922.tsv -I cvg/HG03006.tsv -I cvg/HG03052.tsv -I cvg/HG03642.tsv \
-I cvg/HG03742.tsv -I cvg/NA18525.tsv -I cvg/NA18939.tsv -I cvg/NA19017.tsv \
-I cvg/NA19625.tsv -I cvg/NA19648.tsv -I cvg/NA20502.tsv -I cvg/NA20845.tsv \
--contig-ploidy-calls ploidy-calls \
--annotated-intervals twelveregions.annotated.tsv \
--interval-merging-rule OVERLAPPING_ONLY \
--output cohort24-twelve \
--output-prefix cohort24-twelve_1of2 \
--verbosity DEBUG

gatk GermlineCNVCaller \
--run-mode COHORT \
-L scatter-sm/twelve_2of2.interval_list \
-I cvg/HG00096.tsv -I cvg/HG00268.tsv -I cvg/HG00419.tsv -I cvg/HG00759.tsv \
-I cvg/HG01051.tsv -I cvg/HG01112.tsv -I cvg/HG01500.tsv -I cvg/HG01565.tsv \
-I cvg/HG01583.tsv -I cvg/HG01595.tsv -I cvg/HG01879.tsv -I cvg/HG02568.tsv \
-I cvg/HG02922.tsv -I cvg/HG03006.tsv -I cvg/HG03052.tsv -I cvg/HG03642.tsv \
-I cvg/HG03742.tsv -I cvg/NA18525.tsv -I cvg/NA18939.tsv -I cvg/NA19017.tsv \
-I cvg/NA19625.tsv -I cvg/NA19648.tsv -I cvg/NA20502.tsv -I cvg/NA20845.tsv \
--contig-ploidy-calls ploidy-calls \
--annotated-intervals twelveregions.annotated.tsv \
--interval-merging-rule OVERLAPPING_ONLY \
--output cohort24-twelve \
--output-prefix cohort24-twelve_2of2 \
--verbosity DEBUG

This produces per-interval gCNV calls for each of the cohort samples and a gCNV model for the cohort. Each command produces three directories within cohort24-twelve: a cohort24-twelve_1of2-calls folder of per sample gCNV call results, a cohort24-twelve_1of2-model folder of cohort model data and a cohort24-twelve_1of2-tracking folder of data that tracks model fitting. The table below lists the cohort mode data files alongside case mode files.

GermlineCNVCaller in CASE MODE

Call gCNVs on a sample against a cohort model. The case analysis must use the same scatter approach as the model generation. So, as above, we run two shard analyses. Here, --model and --output-prefix differ between the scatter the commands.

gatk GermlineCNVCaller \
--run-mode CASE \
-I cvg/NA19017.tsv \
--contig-ploidy-calls ploidy-case-calls \
--model cohort23-twelve/cohort23-twelve_1of2-model \
--output case-twelve-vs-cohort23 \
--output-prefix case-twelve-vs-cohort23_1of2 \
--verbosity DEBUG

gatk GermlineCNVCaller \
--run-mode CASE \
-I cvg/NA19017.tsv \
--contig-ploidy-calls ploidy-case-calls \
--model cohort23-twelve/cohort23-twelve_2of2-model \
--output case-twelve-vs-cohort23 \
--output-prefix case-twelve-vs-cohort23_2of2 \
--verbosity DEBUG

This produces both calls and tracking folders with, e.g. the case-twelve-vs-cohort23_1of2 basename. The case-twelve-vs-cohort23_1of2-calls folder contains case sample gCNV call results and the case-twelve-vs-cohort23_1of2-tracking folder contains model fitting results. The case mode results files are listed in the table below alongside cohort mode data files.

Comments on select parameters

The -O output directory must be extant before running the command. Future releases (v4.1.1.0) will create the directory.
The default --max-copy-number is capped at 5. This means the tool reports any events with more copies as CN5.
For the cohort mode, optionally provide --annotated-intervals to include the annotations as covariates. These must contain all of the -L intervals. The -L intervals is an exact match or a subset of the annotated intervals.
For the case mode, the tool accepts only a single --model directory at a time. So the case must be analyzed with the same number of scatters as the cohort model run. The case mode parameters appear fewer than the cohort mode because the --model directory provides the seemingly missing requirements, i.e. the scatter intervals and the annotated intervals.
For both modes, provide the --contig-ploidy-calls results from DetermineGermlineContigPloidy (Section 3). This not only informs ploidy but also establishes baseline coverage and noise levels for each sample. Later, in section 5, GermlineCNVCaller's shard analyses refer back to these global observations.
--verbosity DEBUG allows tracking the Python gcnvkernel model fitting in the stdout, e.g. with information on denoising epochs and whether the model converged. The default INFO level verbosity is the next most verbose and emits only GATK Engine level messages.

At this point, the workflow has done its most heavy lifting to produce data towards copy number calling. In Section 5, we consolidate the data from the scattered GermlineCNVCaller runs, perform segmentation and call copy number states.

One artificial construct of the tutorial is the use of full three contig ploidy calls data even when modeling copy number states for much smaller subset regions. This effectively stabilizes the small analysis.

☞ 4.1 How do I increase the sensitivity of detection?

The tutorial uses default GermlineCNVCaller modeling parameters. However, researchers should expect to tune parameters for data, e.g. from different sequencing technologies. For tuning, first consider the coherence length parameters, p-alt, p-active and the psi-scale parameters. These hyperparameters are just a few of the plethora of adjustable parameters GermlineCNVCaller offers. Refer to the GermlineCNVCaller tool documentation for detailed explanations, and ask on the GATK Forum for further guidance.

The tutorial illustrates one set of parameter changes for WGS data provided by @markw of the GATK SV (Structural Variants) team that dramatically increase the sensitivity of calling on the tutorial data. Article#11687 and Notebook#11686 compare the results of using default vs. the increased-sensitivity parameters. Given the absence of off-the-shelf filtering solutions for CNV calls, when tuning parameters to increase sensitivity, researchers should expect to perform additional due diligence, especially for analyses requiring high precision calls.

WGS parameters that increase the sensitivity of calling from @markw

--class-coherence-length 1000.0 \
--cnv-coherence-length 1000.0 \
--enable-bias-factors false \
--interval-psi-scale 1.0E-6 \
--log-mean-bias-standard-deviation 0.01 \
--sample-psi-scale 1.0E-6 \

Comments on select sensitivity parameters

Decreasing --class-coherence-length from its default of 10,000bp to 1000bp decreases the expected length of contiguous segments. Factor for bin size when tuning.
Decreasing --cnv-coherence-length from its default 10,000bp to 1000bp decreases the expected length of CNV events. Factor for bin size when tuning.
Turning off --enable-bias-factors from the default true state to false turns off active discovery of learnable bias factors. This should always be on for targeted exome data and in general can be turned off for WGS data.
Decreasing --interval-psi-scale from its default of 0.001 to 1.0E-6 reduces the scale the tool considers normal in per-interval noise.
Decreasing --log-mean-bias-standard-deviation from its default of 0.1 to 0.01 reduces what is considered normal noise in bias factors.
Decreasing --sample-psi-scale from its default of 0.0001 to 1.0E-6 reduces the scale that is considered normal in sample-to-sample variance.

Additional parameters to consider include --depth-correction-tau, --p-active and --p-alt.

--depth-correction-tau has a default of 10000.0 (10K) and defines the precision of read-depth concordance with the global depth value.
--p-active has a default of 1e-2 (0.01) and defines the prior probability of common CNV states.
p-alt has a default of 1e-6 (0.000001) and defines the expected probability of CNV events (in rare CNV states).

☞ 4.2 How do I make interval lists for scattering?

This step applies to the cohort mode. It is unnecessary for case mode analyses as the model implies the scatter intervals.

The v4.1.0.0 cnv_germline_cohort_workflow.wdl pipeline workflow scatters the GermlineCNVCaller step. Each scattered analysis is on genomic intervals subset from intervals produced either from PreprocessIntervals (section 1) or from FilterIntervals (section 2). The workflow uses Picard IntervalListTools to break up the intervals list into roughly balanced lists.

gatk IntervalListTools \
--INPUT chr20sub.cohort.gc.filtered.interval_list \
--SUBDIVISION_MODE INTERVAL_COUNT \
--SCATTER_CONTENT 5000 \
--OUTPUT scatter

This produces three intervals lists with ~5K intervals each. For the tutorial's 1Kbp bins, this gives ~5Mbp genomic coverage per scatter. Each list is identically named scattered.interval_list within its own folder within the scatter directory. IntervalListTools systematically names the intermediate folders, e.g. temp_0001_of_3, temp_0002_of_3 and temp_0002_of_3.

Comments on select parameters

The --SUBDIVISION_MODE INTERVAL_COUNT mode scatters intervals into similarly sized lists according to the count of intervals regardless of the base count. The tool intelligently breaks up the chr20sub.cohort.gc.filtered.interval_list's ~15K intervals into lists of 5031, 5031 and 5033 intervals. This is preferable to having a fourth interval list with just 95 intervals.
The tool has another useful feature in the context of the gCNV workflow. To subset -I binned intervals, provide the regions of interest with -SI (--SECOND_INPUT) and use the --ACTION OVERLAPS mode to create a new intervals list of the overlapping bins. Adding --SUBDIVISION_MODE INTERVAL_COUNT --SCATTER_CONTENT 5000 will produce scatter intervals concurrently with the subsetting.

5. Call copy number segments and consolidate sample results with PostprocessGermlineCNVCalls

PostprocessGermlineCNVCalls consolidates the scattered GermlineCNVCaller results, performs segmentation and calls copy number states. The tool generates per-interval and per-segment sample calls in VCF format and runs on a single sample at a time.

PostprocessGermlineCNVCalls COHORT MODE

Process a single sample from the 24-sample cohort using the sample index. For NA19017, the sample index is 19.

gatk PostprocessGermlineCNVCalls \
--model-shard-path cohort24-twelve/cohort24-twelve_1of2-model \
--model-shard-path cohort24-twelve/cohort24-twelve_2of2-model \
--calls-shard-path cohort24-twelve/cohort24-twelve_1of2-calls \
--calls-shard-path cohort24-twelve/cohort24-twelve_2of2-calls \
--allosomal-contig chrX --allosomal-contig chrY \
--contig-ploidy-calls ploidy-calls \
--sample-index 19 \
--output-genotyped-intervals genotyped-intervals-cohort24-twelve-NA19017.vcf.gz \
--output-genotyped-segments genotyped-segments-cohort24-twelve-NA19017.vcf.gz \
--sequence-dictionary ref/Homo_sapiens_assembly38.dict

PostprocessGermlineCNVCalls CASE MODE

NA19017 is the singular sample with index 0.

gatk PostprocessGermlineCNVCalls \
--model-shard-path cohort23-twelve/cohort23-twelve_1of2-model \
--model-shard-path cohort23-twelve/cohort23-twelve_2of2-model \
--calls-shard-path case-twelve-vs-cohort23/case-twelve-vs-cohort23_1of2-calls \
--calls-shard-path case-twelve-vs-cohort23/case-twelve-vs-cohort23_2of2-calls \
--allosomal-contig chrX --allosomal-contig chrY \
--contig-ploidy-calls ploidy-case-calls \
--sample-index 0 \
--output-genotyped-intervals genotyped-intervals-case-twelve-vs-cohort23.vcf.gz \
--output-genotyped-segments genotyped-segments-case-twelve-vs-cohort23.vcf.gz \
--sequence-dictionary ref/Homo_sapiens_assembly38.dict

Each command generates two VCFs with indices. The genotyped-intervals VCF contains variant records for each analysis bin and therefore data covers only the interval regions. For the tutorial's small data, this gives 1400 records. The genotyped-segments VCF contains records for each contiguous copy number state segment. For the tutorial's small data, this is 30 and 31 records for cohort and case mode analyses, respectively.

The two modes--cohort and case--give highly concordant but slightly different results for sample NA19017. The factor that explains the difference is the contribution of the sample itself to the model.

Comments on select parameters

Specify a --model-shard-path directory for each scatter of the cohort model.
Specify a --calls-shard-path directory for each scatter of the cohort or case analysis.
Specify the --contig-ploidy-calls directory for the cohort or case analysis.
By default --autosomal-ref-copy-number is set to 2.
Define allosomal contigs with the --allosomal-contig parameter.
The tool requires specifying the --output-genotyped-intervals VCF.
Optionally generate segmented VCF results with --output-genotyped-segments. The tool segments the regions between the starting bin and the ending bin on a contig. The implication of this is that even if there is a gap between two analysis bins on the same contig, if the copy number state is equal for the bins, then the bins and the entire region between can end up a part of the same segment. The extent of this smoothing over gaps depends on the --cnv-coherence-length parameter.
The --sample-index refers to the index number given to a sample by GermlineCNVCaller. In a case mode analysis of a single sample, the index will always be zero.
The --sequence-dictionary is optional. Without it, the tool generates unindexed VCF results. Alternatively, to produce the VCF indices, provide the -R reference FASTA or use IndexFeatureFile afterward. The v4.1.0.0 cnv_germline_cohort_workflow.wdl pipeline workflow omits index generation.

Here is the result. The header section with lines starting with ## gives information on the analysis and define the annotations. Notice the singular END annotation in the INFO column that denotes the end position of the event. This use of the END notation is reminiscent of GVCF format GVCF blocks.

In the body of the data, as with any VCF, the first two columns give the contig and genomic start position for the variant. The third ID column concatenates together CNV_contig_start_stop, e.g. CNV_chr20_1606001_1609000. The REF column is always N and the ALT column gives the two types of CNV events of interest in symbolic allele notation--<DEL> for deletion and <DUP> for duplication or amplification. Again, the INFO field gives the END position of the variant. The FORMAT field lists sample-level annotations GT:CN:NP:QA:QS:QSE:QSS. GT of 0 indicates normal ploidy, 1 indicates deletion and 2 denotes duplication. The CN annotation indicates the copy number state. For the tutorial small data, CN ranges from 0 to 3.

Developer @asmirnov notes filtering on QS can increase specificity. For a discussion of results, see Article#11687.

↧

How can I debug and develop the algorithm in GATK ,such as haplotypecaller, use Intellij IDEA?

December 4, 2019, 11:51 pm

≫ Next: Query about wgs_calling_regions.hg38.interval_list from GRCh38 gatk bundle

≪ Previous: (How to) Call common and rare germline copy number variants

Is there any paper or documents ? and those test data where i need download,such as NA12878.HiSeq.b37.chr20.10_11mb.bam.

↧

Query about wgs_calling_regions.hg38.interval_list from GRCh38 gatk bundle

October 29, 2018, 12:42 am

≫ Next: GenomeSTRip no genotype vcf

≪ Previous: How can I debug and develop the algorithm in GATK ,such as haplotypecaller, use Intellij IDEA?

Hi GATK team,

We are using wgs_calling_regions.hg38.interval_list from gatk bundle to call variants. Could you please confirm the details about the removed/masked regions from the reference.
Particularly we would like to know about centromere regions inclusion. Thanks.

Regards
Lavanya

↧

GenomeSTRip no genotype vcf

December 5, 2019, 2:24 pm

≫ Next: Reference Genome Components

≪ Previous: Query about wgs_calling_regions.hg38.interval_list from GRCh38 gatk bundle

Dear all,

I am calling SVs for WGS using the GenomeSTRiP tool. The calling is finished sucessfully for some chromosomes, but just only discovery vcf were generated for other chromosome. I run scripts with only genotyping command again, genotyping is finished within 2 seconds without error, and didn't generate genotype vcf file. I attached the command below:

ref=reference
runDir=SV3
bam=bam_chr3.list
sites=chr3.discovery.vcf
genotypes=chr3.genotypes.vcf

java -cp ${classpath} ${mx} \
org.broadinstitute.gatk.queue.QCommandLine \
-S ${SV_DIR}/qscript/SVGenotyper.q \
-S ${SV_DIR}/qscript/SVQScript.q \
-jobRunner ParallelShell -maxConcurrentRun 8 \
-gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \
--disableJobReport \
-cp ${classpath} \
-configFile ${SV_DIR}/conf/genstrip_parameters.txt \
-tempDir ${SV_TMPDIR} \
-R ${ref}/Homo_sapiens_assembly38.fasta \
-genomeMaskFile ${ref}/Homo_sapiens_assembly38.gcmask.fasta \
-genderMapFile gender.map \
-runDirectory ${runDir} \
-md ${runDir}/metadata \
-disableGATKTraversal \
-jobLogDir ${runDir}/logs \
-I ${bam} \
-vcf ${sites} \
-P chimerism.use.correction:false \
-O ${genotypes} \
-run \
|| exit 1

Are there any suggestion for this weired result?
Thank you so much for your help!

Best wishes,

CZS

↧

Reference Genome Components

June 22, 2016, 12:13 pm

≫ Next: ECNT Value in Mutect2

≪ Previous: GenomeSTRip no genotype vcf

This document defines several components of a reference genome. We use the human GRCh38/hg38 assembly to illustrate.

GRCh38/hg38 is the assembly of the human genome released December of 2013, that uses alternate or ALT contigs to represent common complex variation, including HLA loci. Alternate contigs are also present in past assemblies but not to the extent we see with GRCh38. Much of the improvements in GRCh38 are the result of other genome sequencing and analysis projects, including the 1000 Genomes Project.

The ideogram is from the Genome Reference Consortium website and showcases GRCh38.p7. The zoomed region illustrates how regions in blue are full of Ns.

Analysis set reference genomes have special features to accommodate sequence read alignment. This type of genome reference can differ from the reference you use to browse the genome.

For example, the GRCh38 analysis set hard-masks, i.e. replaces with Ns, a proportion of homologous centromeric and genomic repeat arrays (on chromosomes 5, 14, 19, 21, & 22) and two PAR (pseudoautosomal) regions on chromosome Y. Confirm the set you are using by viewing a PAR region of the Y chromosome on IGV as shown in the figure below. The chrY location of PAR1 and PAR2 on GRCh38 are chrY:10,000-2,781,479 and chrY:56,887,902-57,217,415.

The sequence in the reference set is a mix of uppercase and lowercase letters. The lowercase letters represent soft-masked sequence corresponding to repeats from RepeatMasker and Tandem Repeats Finder.
The GRCh38 analysis sets also include a contig to siphon off reads corresponding to the Epstein-Barr virus sequence as well as decoy contigs. The EBV contig can help correct for artifacts stemming from immortalization of human blood lymphocytes with EBV transformation, as well as capture endogenous EBV sequence as EBV naturally infects B cells in ~90% of the world population. Heng Li provides the decoy contigs.

Nomenclature: words to describe components of reference genomes

A contig is a contiguous sequence without gaps.
Alternate contigs, alternate scaffolds or alternate loci allow for representation of diverging haplotypes. These regions are too complex for a single representation. Identify ALT contigs by their _alt suffix.

The GRCh38 ALT contigs total 109Mb in length and span 60Mb of the primary assembly. Alternate contig sequences can be novel to highly diverged or nearly identical to corresponding primary assembly sequence. Sequences that are highly diverged from the primary assembly only contribute a few million bases. Most subsequences of ALT contigs are fairly similar to the primary assembly. This means that if we align sequence reads to GRCh38+ALT blindly, then we obtain many multi-mapping reads with zero mapping quality. Since many GATK tools have a ZeroMappingQuality filter, we will then miss variants corresponding to such loci.
Primary assembly refers to the collection of (i) assembled chromosomes, (ii) unlocalized and (iii) unplaced sequences. It represents a non-redundant haploid genome.

(i) Assembled chromosomes for hg38 are chromosomes 1–22 (chr1–chr22), X (chrX), Y (chrY) and Mitochondrial (chrM).
(ii) Unlocalized sequence are on a specific chromosome but with unknown order or orientation. Identify by _random suffix.
(iii) Unplaced sequence are on an unknown chromosome. Identify by chrU_ prefix.
PAR stands for pseudoautosomal region. PAR regions in mammalian X and Y chromosomes allow for recombination between the sex chromosomes. Because the PAR sequences together create a diploid or pseudo-autosomal sequence region, the X and Y chromosome sequences are intentionally identical in the genome assembly. Analysis set genomes further hard-mask two of the Y chromosome PAR regions so as to allow mapping of reads solely to the X chromosome PAR regions.
Different assemblies shift coordinates for loci and are released infrequently. Hg19 and hg38 represent two different major assemblies. Comparing data from different assemblies requires lift-over tools that adjust genomic coordinates to match loci, at times imperfectly. In the special case of hg19 and GRCh37, the primary assembly coordinates are identical for loci but patch updates differ. Also, the naming conventions of the references differ, e.g. the use of chr1 versus 1 to indicate chromosome 1, such that these also require lift-over to compare data. GRCh38/hg38 unifies the assemblies and the naming conventions.
Patches are regional fixes that are released periodically for a given assembly. GRCh38.p7 indicates the seventh patched minor release of GRCh38. This NCBI page explains in more detail. Patches add information to the assembly without disrupting the chromosome coordinates. Again, they improve representation without affecting chromosome coordinate stability. The two types of patches, fixed and novel, represent different types of sequence.

(i) Fix patches represent sequences that will replace primary assembly sequence in the next major assembly release. When interpreting data, fix patches should take precedence over the chromosomes.
(ii) Novel patches represent alternate loci. When interpreting data, treat novel patches as population sequence variants.

The GATK perspective on reference genomes

Within GATK documentation, Tutorial#8017 outlines how to map reads in an alternate contig aware manner and discusses some of the implications of mapping reads to reference genomes with alternate contigs.

GATK tools allow for use of a genomic intervals list that tells tools which regions of the genome the tools should act on. Judicious use of an intervals list, e.g. one that excludes regions of Ns and low complexity repeat regions in the genome, makes processes more efficient. This brings us to the next point.

Specifying contigs with colons in their names, as occurs for new contigs in GRCh38, requires special handling for GATK versions prior to v3.6. Please use the following workaround.

For example, HLA-A*01:01:01:01 is a new contig in GRCh38. The colons are a new feature of contig naming for GRCh38 from prior assemblies. This has implications for using the -L option of GATK as the option also uses the colon as a delimiter to distinguish between contig and genomic coordinates.
When defining coordinates of interest for a contig, e.g. positions 1-100 for chr1, we would use -L chr1:1-100. This also works for our HLA contig, e.g. -L HLA-A*01:01:01:01:1-100.
However, when passing in an entire contig, for contigs with colons in the name, you must add :1+ to the end of the chromosome name as shown below. This ensures that portions of the contig name are appropriately identified as part of the contig name and not genomic coordinates.
```
 -L HLA-A*01:01:01:01:1+
```

Viewing CRAM alignments on genome browsers

Because CRAM compression depends on the alignment reference genome, tools that use CRAM files ensure correct decompression by comparing reference contig MD5 hashtag values. These are sensitive to any changes in the sequence, e.g. masking with Ns. This can have implications for viewing alignments in genome browsers when there is a disjoint between the reference that is loaded in the browser and the reference that was used in alignment. If you are using a version of tools for which this is an issue, be sure to load the original analysis set reference genome to view the CRAM alignments.

Should I switch to a newer reference?

Yes you should. In addition to adding many alternate contigs, GRCh38 corrects thousands of SNPs and indels in the GRCh37 assembly that are absent in the population and are likely sequencing artifacts. It also includes synthetic centromeric sequence and updates non-nuclear genomic sequence.

The ability to recognize alternate haplotypes for loci is a drastic improvement that GRCh38 makes possible. Going forward, expanding genomics data will help identify variants for alternate haplotypes, improve existing and add additional alternate haplotypes and give us a better accounting of alternate haplotypes within populations. We are already seeing improvements and additions in the patch releases to reference genomes, e.g. the seven minor releases of GRCh38 available at the time of this writing.

Note that variants produced by alternate haplotypes when they are represented on the primary assembly may or may not be present in data resources, e.g. dbSNP. This could have varying degrees of impact, including negligible, for any process that relies on known variant sites. Consider the impact this discrepant coverage in data resources may have for your research aims and weigh this against the impact of missing variants because their sequence context is unaccounted for in previous assemblies.

External resources

New 11/16/2016 For a brief history and discussion on challenges in using GRCh38, see the 2015 Genome Biology article Extending reference assembly models by Church et al. (DOI: 10.1186/s13059-015-0587-3).
For press releases highlighting improvements in GRCh38 from December 2013, see http://www.ncbi.nlm.nih.gov/news/12-23-2013-grch38-released/ and http://genomeref.blogspot.co.uk/2013/12/announcing-grch38.html. The latter post summarizes major improvements, including the correction of thousands of SNPs and indels in GRCh37 not seen in the population and the inclusion of synthetic centromeric sequence.
Recent releases of BWA, e.g. v0.7.15+, handle alt contig mapping and HLA typing. See the BWA repository for information. See these pages for download and installation instructions.
The Genome Reference Consortium (GRC) provides human, mouse, zebrafish and chicken sequences, and this particular webpage gives an overview of GRCh38. Namely, an interactive chromosome ideogram marks regions with corresponding alternate loci, regions with fix patches and regions containing novel patches. For additional assembly terminology, see http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/info/definitions.shtml.
The UCSC Genome Browser allows browsing and download of genomes, including analysis sets, from many different species. For more details on the difference between GRCh38 reference and analysis sets, see ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/README.txt and ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/analysisSet/README.txt, respectively. In addition, the site provides annotation files, e.g. here is the annotation database for GRCh38. Within this particular page, the file named gap.txt.gz catalogues the gapped regions of the assembly full of Ns. For our illustration above, the corresponding region in this file shows:
```
    585    chr14    0    10000    1    N    10000    telomere    no
    1    chr14    10000    16000000    2    N    15990000    short_arm    no
    707    chr14    16022537    16022637    4    N    100    contig    no
```
The Integrative Genomics Viewer is a desktop application for viewing genomics data including alignments. The tool accesses reference genomes you provide via file or URL or that it hosts over a server. The numerous hosted reference genomes include GRCh38. See this page for information on hosted reference genomes. For the most up-to-date list of hosted genomes, open IGV and go to Genomes>Load Genome From Server. A menu lists genomes you can make available in the main genome dropdown menu.

↧

ECNT Value in Mutect2

December 20, 2018, 10:57 am

≫ Next: CombineGVCFs performance

≪ Previous: Reference Genome Components

Hello,

I am using Mutect2 and FilterMutectCalls to call variants in mtDNA. According to the vcf file, the value recorded for ECNT is "Number of events in this haplotype". I am assuming that this is the number of times that a particular mutation was found in all the reads under that base pair. I am concerned because in my data, I am seeing clusters of mutations with the same ECNT value. For example:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
gi|9626243|ref|NC_001416.1| 115 . C T VL PASS DP=445;ECNT=6;POP_AF=5.000e-08;P_CONTAM=0.00;P_GERMLINE=-7.025e+01;TLOD=6.86 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB
gi|9626243|ref|NC_001416.1| 157 . T A VL PASS DP=427;ECNT=6;POP_AF=5.000e-08;P_CONTAM=0.00;P_GERMLINE=-1.113e+02;TLOD=8.55 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB
gi|9626243|ref|NC_001416.1| 470 . C T VL PASS DP=703;ECNT=9;POP_AF=5.000e-08;P_CONTAM=0.00;P_GERMLINE=-1.859e+02;TLOD=15.58 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB
gi|9626243|ref|NC_001416.1| 500 . A T VL PASS DP=691;ECNT=9;POP_AF=5.000e-08;P_CONTAM=0.00;P_GERMLINE=-1.927e+02;TLOD=7.59 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB
gi|9626243|ref|NC_001416.1| 601 . CT C VL PASS DP=671;ECNT=9;POP_AF=5.000e-08;P_CONTAM=0.00;P_GERMLINE=-1.766e+02;RPA=3,2;RU=T;STR;TLOD=7.68 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB
gi|9626243|ref|NC_001416.1| 635 . C T VL PASS DP=665;ECNT=9;POP_AF=5.000e-08;P_CONTAM=0.00;P_GERMLINE=-1.602e+02;TLOD=34.37 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB
gi|9626243|ref|NC_001416.1| 645 . C T VL PASS DP=668;ECNT=9;POP_AF=5.000e-08;P_CONTAM=0.00;P_GERMLINE=-1.818e+02;TLOD=7.87 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB
gi|9626243|ref|NC_001416.1| 704 . T TAAAAAA VL PASS DP=660;ECNT=9;POP_AF=5.000e-08;P_CONTAM=0.00;P_GERMLINE=-1.845e+02;RPA=5,11;RU=A;STR;TLOD=5.89 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:PGT:PID:SA_MAP_AF:SA_POST_PROB
gi|9626243|ref|NC_001416.1| 736 . A G VL PASS DP=666;ECNT=6;POP_AF=5.000e-08;P_CONTAM=0.00;P_GERMLINE=-1.723e+02;TLOD=20.03 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB
gi|9626243|ref|NC_001416.1| 754 . A G VL PASS DP=654;ECNT=6;POP_AF=5.000e-08;P_CONTAM=0.00;P_GERMLINE=-1.606e+02;TLOD=29.79 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB
gi|9626243|ref|NC_001416.1| 788 . A T VL PASS DP=639;ECNT=6;POP_AF=5.000e-08;P_CONTAM=0.00;P_GERMLINE=-1.738e+02;TLOD=6.65 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB
gi|9626243|ref|NC_001416.1| 898 . C T VL PASS DP=671;ECNT=6;POP_AF=5.000e-08;P_CONTAM=0.00;P_GERMLINE=-1.821e+02;TLOD=5.38 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB
gi|9626243|ref|NC_001416.1| 958 . A G VL PASS DP=671;ECNT=6;POP_AF=5.000e-08;P_CONTAM=0.00;P_GERMLINE=-1.847e+02;TLOD=6.69

Is there an explanation for why the data would look like this? It seems odd that the mutations would be occurring exactly the same number of times as the variants surrounding it.

Thank you,

kzwon

↧

CombineGVCFs performance

March 26, 2014, 8:09 am

≫ Next: Germline short variant discovery (SNPs + Indels)

≪ Previous: ECNT Value in Mutect2

I've got 300 gvcfs as a results of a Queue pipeline, that I want to combine. When I run CombineGVCFs (GATK v3.1-1) this however seems fairly slow:

INFO  15:24:22,100 ProgressMeter -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining 
INFO  15:57:52,778 ProgressMeter -      1:11456201        1.10e+07   33.5 m        3.0 m      0.4%         6.4 d     6.3 d 
INFO  15:58:52,780 ProgressMeter -      1:11805001        1.10e+07   34.5 m        3.1 m      0.4%         6.4 d     6.3 d 
INFO  15:59:52,781 ProgressMeter -      1:12140201        1.20e+07   35.5 m        3.0 m      0.4%         6.4 d     6.3 d

Is there a way of improving the performance of this merge? 6 days seems like a lot, but of course not unfeasible. Likewise, what kind of performance could I expect in the GenotypeGVCFs step?

↧

Germline short variant discovery (SNPs + Indels)

January 7, 2018, 1:03 am

≫ Next: Filter by TLOD only in Mutect2

≪ Previous: CombineGVCFs performance

Important: This document is currently being updated

Purpose

Identify germline short variants (SNPs and Indels) in one or more individuals to produce a joint callset in VCF format.

Reference Implementations

Pipeline	Summary	Notes	Github	Terra
Prod* germline short variant per-sample calling	uBAM to GVCF	optimized for GCP	yes	pending
Prod* germline short variant joint genotyping	GVCFs to cohort VCF	optimized for GCP	yes	pending
$5 Genome Analysis Pipeline	uBAM to GVCF or cohort VCF	optimized for GCP (see blog)	yes	hg38
Generic germline short variant per-sample calling	analysis-ready BAM to GVCF	universal	yes	hg38
Generic germline short variant joint genotyping	GVCFs to cohort VCF	universal	yes	hg38 & b37
Intel germline short variant per-sample calling	uBAM to GVCF	Intel optimized for local architectures	yes	NA

* Prod refers to the Broad Institute's Data Sciences Platform production pipelines, which are used to process sequence data produced by the Broad's Genomic Sequencing Platform facility.

Expected input

This workflow is designed to operate on a set of samples constituting a study cohort. Specifically, a set of per-sample BAM files that have been pre-processed as described in the GATK Best Practices for data pre-processing.

Main steps for Germline Cohort Data

We begin by calling variants per sample in order to produce a file in GVCF format. Next, we consolidate GVCFs from multiple samples into a GenomicsDB datastore. We then perform joint genotyping, and finally, apply VQSR filtering to produce the final multisample callset with the desired balance of precision and sensitivity.

Additional steps such as Genotype Refinement and Variant Annotation may be included depending on experimental design; those are not documented here.

Call variants per-sample

Tools involved: HaplotypeCaller (in GVCF mode)

In the past, variant callers specialized in either SNPs or Indels, or (like the GATK's own UnifiedGenotyper) could call both but had to do so them using separate models of variation. The HaplotypeCaller is capable of calling SNPs and indels simultaneously via local de-novo assembly of haplotypes in an active region. In other words, whenever the program encounters a region showing signs of variation, it discards the existing mapping information and completely reassembles the reads in that region. This allows the HaplotypeCaller to be more accurate when calling regions that are traditionally difficult to call, for example when they contain different types of variants close to each other. It also makes the HaplotypeCaller much better at calling indels than position-based callers like UnifiedGenotyper.

In the GVCF mode used for scalable variant calling in DNA sequence data, HaplotypeCaller runs per-sample to generate an intermediate file called a GVCF , which can then be used for joint genotyping of multiple samples in a very efficient way. This enables rapid incremental processing of samples as they roll off the sequencer, as well as scaling to very large cohort sizes

In practice, this step can be appended to the pre-processing section to form a single pipeline applied per-sample, going from the original unmapped BAM containing raw sequence all the way to the GVCF for each sample. This is the implementation used in production at the Broad Institute.

Consolidate GVCFs

Tools involved: GenomicsDBImport

This step consists of consolidating the contents of GVCF files across multiple samples in order to improve scalability and speed the next step, joint genotyping. Note that this is NOT equivalent to the joint genotyping step; variants in the resulting merged GVCF cannot be considered to have been called jointly.

Prior to GATK4 this was done through hierarchical merges with a tool called CombineGVCFs. This tool is included in GATK4 for legacy purposes, but performance is far superior when using GenomicsDBImport, which produces a datastore instead of a GVCF file.

Joint-Call Cohort

Tools involved: GenotypeGVCFs

At this step, we gather all the per-sample GVCFs (or combined GVCFs if we are working with large numbers of samples) and pass them all together to the joint genotyping tool, GenotypeGVCFs. This produces a set of joint-called SNP and indel calls ready for filtering. This cohort-wide analysis empowers sensitive detection of variants even at difficult sites, and produces a squared-off matrix of genotypes that provides information about all sites of interest in all samples considered, which is important for many downstream analyses.

This step runs quite quickly and can be rerun at any point when samples are added to the cohort, thereby solving the so-called N+1 problem.

Filter Variants by Variant (Quality Score) Recalibration

Tools involved: VariantRecalibrator, ApplyRecalibration

The GATK's variant calling tools are designed to be very lenient in order to achieve a high degree of sensitivity. This is good because it minimizes the chance of missing real variants, but it does mean that we need to filter the raw callset they produce in order to reduce the amount of false positives, which can be quite large.

The established way to filter the raw variant callset is to use variant quality score recalibration (VQSR), which uses machine learning to identify annotation profiles of variants that are likely to be real, and assigns a VQSLOD score to each variant that is much more reliable than the QUAL score calculated by the caller. In the first step of this two-step process, the program builds a model based on training variants, then applies that model to the data to assign a well-calibrated probability to each variant call. We can then use this variant quality score in the second step to filter the raw call set, thus producing a subset of calls with our desired level of quality, fine-tuned to balance specificity and sensitivity.

The downside of how variant recalibration works is that the algorithm requires high-quality sets of known variants to use as training and truth resources, which for many organisms are not yet available. It also requires quite a lot of data in order to learn the profiles of good vs. bad variants, so it can be difficult or even impossible to use on small datasets that involve only one or a few samples, on targeted sequencing data, on RNAseq, and on non-model organisms. If for any of these reasons you find that you cannot perform variant recalibration on your data (after having tried the workarounds that we recommend, where applicable), you will need to use hard-filtering instead. This consists of setting flat thresholds for specific annotations and applying them to all variants equally. See the methods articles and FAQs for more details on how to do this.

We are currently experimenting with neural network-based approaches with the goal of eventually replacing VQSR with a more powerful and flexible filtering process.

Main steps for Germline Single-Sample Data

Single sample variant discovery uses HaplotypeCaller in its default single-sample mode to call variants in an analysis-ready BAM file. The VCF that HaplotypeCaller emits errs on the side of sensitivity, so some filtering is often desired. To filter variants first run the CNNScoreVariants tool. This tool annotates each variant with a score indicating the model's prediction of the quality of each variant. To apply filters based on those scores run the FIlterVariantTranches tool with SNP and INDEL sensitivity tranches appropriate for your task.

Notes on methodology

The central tenet that governs the variant discovery part of the workflow is that the accuracy and sensitivity of the germline variant discovery algorithm are significantly increased when it is provided with data from many samples at the same time. Specifically, the variant calling program needs to be able to construct a squared-off matrix of genotypes representing all potentially variant genomic positions, across all samples in the cohort. Note that this is distinct from the primitive approach of combining variant calls generated separately per-sample, which lack information about the confidence of homozygous-reference or other uncalled genotypes.

In earlier versions of the variant discovery phase, multiple per-sample BAM files were presented directly to the variant calling program for joint analysis. However, that scaled very poorly with the number of samples, posing unacceptable limits on the size of the study cohorts that could be analyzed in that way. In addition, it was not possible to add samples incrementally to a study; all variant calling work had to be redone when new samples were introduced.

Starting with GATK version 3.x, a new approach was introduced, which decoupled the two internal processes that previously composed variant calling: (1) the initial per-sample collection of variant context statistics and calculation of all possible genotype likelihoods given each sample by itself, which require access to the original BAM file reads and is computationally expensive, and (2) the calculation of genotype posterior probabilities per-sample given the genotype likelihoods across all samples in the cohort, which is computationally cheap. These were made into the separate steps described below, enabling incremental growth of cohorts as well as scaling to large cohort sizes.

↧

Filter by TLOD only in Mutect2

December 6, 2019, 2:23 am

≫ Next: Mutect2 stops running midway

≪ Previous: Germline short variant discovery (SNPs + Indels)

Hello
I am using Mutect2 in GATK v4.1.4.0 to look for somatic variants in several tumor samples with matched germline. Because of the nature of the samples, I know I can trust variants with relatively low VAF, so I wanted to relax the filtering to allow tumor variants with an LOD similar to that of germline variants (~ 2.2). In previous versions of Mutect2 I would have simply set --tlod at 2.2 during the filtering step. The newest versions of Mutect2, however, does not have this option anymore and relies instead on a beta score (--f-score-beta) to tighten or relax the false discovery rate during the filtering step.
The issue with the beta score is that if I relax the filtering to a point where variants with TLOD >= 2.2 pass the filter, I end up with many variants with very low values in other fields (e.g. STRANDQ=1).
I could relax the filter to allow TLOD >= 2.2 and then filter again manually the resulting VCF to remove variants with low values in other fields, but this seems a rather convoluted way of approaching this issue and it feels like there should be a better way to do it.
In short, is there a way in the latest Mutect2 versions to allow for variants with low TLOD to pass the filtering step without relaxing all filters in the other fields?
Thank you!

↧

Mutect2 stops running midway

December 6, 2019, 3:30 am

≫ Next: GenotypeGVCF stuck(?) after ProgressMeter - Starting traversal

≪ Previous: Filter by TLOD only in Mutect2

Hi!
I am analyzing WES data in a few samples. It had worked just fine until some days ago. However, for the remaining samples, it is not working anymore. I am running Mutect2 in all of them with the following command (running in docker):

gatk Mutect2 -R /gatk/my_data/reference/hg19_torrent/hg19.fasta -I IonXpress_041_rawlib.bam --f1r2-tar-gz IonXpress_041_f1r2.tar.gz -O /gatk/my_data/output/IonXpress_041_unfiltered.vcf  --java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true' --independent-mates

I was able to process about half of the samples, with all the corresponding output files). But now, by using the exact same lines, Mutect2 executes until a certain point, and then stops (a , throwing some messages (file encosed).
All the sample have been sequenced on the same platform, and their bam files have been generated by the same pipeline (performed by IonTorrent platform), using the same reference build file (also used on the command line above - reference/hg19_torrent/hg19.fasta.

I have read in other threads that it might be due to lack of disk space. However, there are around 3TB available on the mounted volume being used.

Any help is greatly appreciated.

↧

GenotypeGVCF stuck(?) after ProgressMeter - Starting traversal

December 6, 2019, 6:49 am

≫ Next: Picard MarkDuplicates Barcode Tag

≪ Previous: Mutect2 stops running midway

I am running GenotypeGVCF on ~1700 samples. I use the intervals from https://console.cloud.google.com/storage/browser/_details/gatk-test-data/intervals/hg38.even.handcurated.20k.intervals?project=broad-dsde-outreach&organizationId=548622027621. The genomicDB is filled from haplotypecaller VCFs which have been produced with RNAseq data. I am using GATK version 4.1.0.0.

Runtime ranges from 1-5 hours. However, some of the intervals are now running for > 8 hours. When I look in the log output

00:27:41.070 WARN  GATKAnnotationPluginDescriptor - Redundant enabled annotation group (StandardAnnotation) is enabled for this tool by default
00:27:41.122 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/data/umcg-ndeklein/apps/software/GATK/4.1.0.0-foss-2015b-Python-3.6.3/gatk-package-4.1.0.0-local.jar!/com/intel/gkl/n
Dec 06, 2019 12:27:42 AM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
WARNING: Failed to detect whether we are running on Google Compute Engine.
shaded.cloud_nio.com.google.api.client.http.HttpResponseException: 404 Not Found
00:27:43.167 INFO  GenotypeGVCFs - ------------------------------------------------------------
00:27:43.167 INFO  GenotypeGVCFs - The Genome Analysis Toolkit (GATK) v4.1.0.0
00:27:43.167 INFO  GenotypeGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
00:27:43.168 INFO  GenotypeGVCFs - Executing as umcg-ndeklein@pg-node109 on Linux v3.10.0-1062.4.1.el7.x86_64 amd64
00:27:43.168 INFO  GenotypeGVCFs - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_45-b14
00:27:43.168 INFO  GenotypeGVCFs - Start Date/Time: December 6, 2019 12:27:41 AM GMT-05:00
00:27:43.168 INFO  GenotypeGVCFs - ------------------------------------------------------------
00:27:43.168 INFO  GenotypeGVCFs - ------------------------------------------------------------
00:27:43.169 INFO  GenotypeGVCFs - HTSJDK Version: 2.18.2
00:27:43.169 INFO  GenotypeGVCFs - Picard Version: 2.18.25
00:27:43.169 INFO  GenotypeGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 2
00:27:43.169 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
00:27:43.169 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
00:27:43.169 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
00:27:43.169 INFO  GenotypeGVCFs - Deflater: IntelDeflater
00:27:43.169 INFO  GenotypeGVCFs - Inflater: IntelInflater
00:27:43.169 INFO  GenotypeGVCFs - GCS max retries/reopens: 20
00:27:43.169 INFO  GenotypeGVCFs - Requester pays: disabled
00:27:43.169 INFO  GenotypeGVCFs - Initializing engine
00:27:43.620 INFO  FeatureManager - Using codec VCFCodec to read file file:///data/umcg-ndeklein/apps/data/ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/All_20180418.biallelicSNP_only.with_chr.vcf.gz
WARNING: No valid combination operation found for INFO field DB - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field DS - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field DB - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field DS - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records
00:28:06.355 INFO  IntervalArgumentCollection - Processing 154451 bp from intervals
00:28:06.362 WARN  IndexUtils - Feature file "/data/umcg-ndeklein/apps/data/ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/All_20180418.biallelicSNP_only.with_chr.vcf.gz" appears to contain no sequence di
00:28:06.582 INFO  GenotypeGVCFs - Done initializing engine
00:28:06.734 INFO  ProgressMeter - Starting traversal
00:28:06.735 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute
WARNING: No valid combination operation found for INFO field DB - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field DS - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records

It has not started with any progress. The size of the interval is not that large (chr1:118432037-118569219), so I would have expected this to go faster. Is there something in the settings that I am missing? I run it with:

gatk --java-options "-Xmx20G" GenotypeGVCFs \
    --reference GRCh38.primary_assembly.genome.fa \
    --dbsnp All_20180418.biallelicSNP_only.with_chr.vcf.gz \
    --output chr1_118432037_118569219.gg.vcf.gz \
    --variant gendb:///genomicDB//chr1 \
    --stand-call-conf 10.0 \
    -L chr1:118432037-118569219 \
    -G StandardAnnotation

Thanks for your help,
Niek

↧