Mutect2 - Filtering mutation calls issue

March 8, 2019, 4:11 am

≫ Next: Problem with BaseRecalibrator in v2.2-8-gec077cd

Dear gatk Support Team,

We are currently using Mutect2 (gatk/4.0.4.0) to call somatic variants on circulatory tumourDNA (ctDNA) and germline DNA (normal DNA) which are derived from a single blood sample. The analysis is working well but we are encountering a recurrent problem with a low level of definite tumour specific mutations being detected in the normal DNA sample, which result in Mutect rejecting the mutation.

This only occurs in samples with a high tumour fraction of ctDNA (e.g. VAF ~60%) and is found at low VAF in the gDNA control (e.g. VAF ~4%). These mutations are being rejected by mutect’s program, FilterMutectCalls
with a reject reason of “artifact in normal”.

We know for certain that these are actually false negatives as we can identify the mutation in corresponding tumour material.

Here are the examples:
1225380276.TA.artifact_in_normalDP=1364;ECNT=1;NLOD=133.57;N_ART_LOD=0.047;POP_AF=1.000e-05;P_CONTAM=0.00;P_GERMLINE=-1.350e+02;TLOD=1468.54GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB0/0:468,2:0.037:197,1:271,1:34:197,173:60:380/1:366,469:0.560:207,272:159,197:32:166,167:60:33:0.556,0.545,0.562:8.357e-03,0.016,0.976

177574003.GA.artifact_in_normalDP=893;ECNT=1;NLOD=106.88;N_ART_LOD=1.94;POP_AF=1.000e-05;P_CONTAM=0.00;P_GERMLINE=-1.263e+02;TLOD=1154.12GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB0/0:391,3:0.030:171,0:220,3:33:202,163:60:590/1:131,339:0.716:85,224:46,115:33:168,174:60:39:0.707,0.717,0.721:0.025,0.012,0.963

As these are true positives we want to ensure that these appear in our analysis and are wondering that is there any way that we could still include these mutation calls in our analysis by adapting the various parameters within the pipeline?

Any helpful suggestions would be gratefully appreciated, or please contact me if you require further details?

Best regards,

↧

Problem with BaseRecalibrator in v2.2-8-gec077cd

November 16, 2012, 2:13 pm

≫ Next: SplitNCigarReads: Hard clipping of overhangs

≪ Previous: Mutect2 - Filtering mutation calls issue

Hello dear GATK People,

I'm failing with BaseRecalibrator from the new GATK version - my pipeline worked with the 2.1-11, below is my error message.
Any quick fix or should I stick to the old version?

Ania

ERROR stack trace

java.lang.IllegalArgumentException: fromIndex(402) > toIndex(101)
at java.util.Arrays.rangeCheck(Unknown Source)
at java.util.Arrays.fill(Unknown Source)
at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.calculateKnownSites(BaseRecalibrator.java:280)
at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.calculateSkipArray(BaseRecalibrator.java:259)
at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:239)
at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:112)
at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:203)
at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:191)
at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:287)
at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:252)
at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:91)
at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:55)
at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:83)
at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:281)
at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146)
at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:93)

ERROR ------------------------------------------------------------------------------------------

ERROR A GATK RUNTIME ERROR has occurred (version 2.2-8-gec077cd):

.....

ERROR MESSAGE: fromIndex(402) > toIndex(101)

ERROR ------------------------------------------------------------------------------------------

↧

SplitNCigarReads: Hard clipping of overhangs

February 27, 2019, 7:05 am

≫ Next: GATK resource bundle not avialable

≪ Previous: Problem with BaseRecalibrator in v2.2-8-gec077cd

Hi,

Running SplitNCigarReads doesn't seem to hard-clip intronic overhangs despite using aggressive clipping parameters
I ran gatk (version: 4.1.0.0) with the following command

gatk SplitNCigarReads -R /path/to/genome/hg19.fa -I /path/to/$inputbam --max-mismatches-in-overhang 0 --max-bases-in-overhang 5 -O $outbam

I'm assuming the exon boundaries are inferred. if not, is there an option to provide transcript annotations to SplitNCigarReads.

Best,
A

↧

GATK resource bundle not avialable

March 8, 2019, 9:05 am

≫ Next: MuTect2 AD does not match AF

≪ Previous: SplitNCigarReads: Hard clipping of overhangs

There is nothing in
ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/

↧

MuTect2 AD does not match AF

February 1, 2019, 12:41 pm

≫ Next: Is there a minimum Coverage recommended for inputs to Mutect2 ?

≪ Previous: GATK resource bundle not avialable

I am seeing issues with the allelic depths and allelic frequencies/fraction reported in some of our variants. For example:

Chr Start End Ref Alt POS REF ALT QUAL NORMAL.AD NORMAL.AF TUMOR.AD FREQ TUMOR.AD.TOTAL NORMAL.AD.TOTAL chrX 152482917 152482917 T C 152482917 T C 986.47 79,0 0.02 436,443 0.503 879 79

Any idea what is going on? In this case, the Normal sample AF is reported as "0.02", or 2%, but it also reports that the Normal sample has 0 alt reads. How can you have a variant frequency when there are no reads? I have seen this occurring as well for the tumor samples; how could you have a variant at all when there are no reported alternate reads for the variant?

↧

Is there a minimum Coverage recommended for inputs to Mutect2 ?

March 8, 2019, 12:46 pm

≫ Next: Compressing gVCFs in GATK4

≪ Previous: MuTect2 AD does not match AF

I notice the -min-pruning default is set to 2. Would that imply a coverage of less than 2X would lead to some variants being missed? I know that altering this option to less than one would slow things down but for our project where Coverage could be as low as 12X maybe that should be considered .
opinions?

↧

Compressing gVCFs in GATK4

March 8, 2019, 1:57 pm

≫ Next: ASEReadCounter java error

≪ Previous: Is there a minimum Coverage recommended for inputs to Mutect2 ?

Hi all. Our team has been gzipping gVCFs for medium term storage, but has found it quite burdensome to have to periodically ungzip them to joint genotype them (the space requirements for our large numbers of files are challenging). We are seeking a solution where we can joint genotype gzipped or otherwise compressed gVCFs.

In face to face meetings with GATK team members, we were told that in GATK4, we can ask earlier steps in the pipeline (HaplotypeCaller I'm guessing?) to gzip gVCFs, and then joint genotyping can run on those GATK-compressed gVCFs without us having to unzip them. We are all for that!

I have been looking for documentation about how exactly this works and failed to find it. Help?

Best,
Jessica

↧

ASEReadCounter java error

March 8, 2019, 2:27 pm

≫ Next: Errors about misencoded quality scores

≪ Previous: Compressing gVCFs in GATK4

I'm just running ASEReadCounter on an RNA-seq BAM that has undergone mark duplicates, add read groups, and splitNcigar reads. These java errors don't provide any help for the user

14:24:03.421 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/groups/Spellmandata/heskett/packages/share/gatk4-4.0.11.0-0/gatk-package-4.0.11.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
14:24:05.114 INFO  ASEReadCounter - ------------------------------------------------------------
14:24:05.114 INFO  ASEReadCounter - The Genome Analysis Toolkit (GATK) v4.0.11.0
14:24:05.114 INFO  ASEReadCounter - For support and documentation go to https://software.broadinstitute.org/gatk/
14:24:05.115 INFO  ASEReadCounter - Executing as heskett@exanode-3-1 on Linux v3.10.0-862.14.4.el7.x86_64 amd64
14:24:05.115 INFO  ASEReadCounter - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_192-b01
14:24:05.116 INFO  ASEReadCounter - Start Date/Time: March 8, 2019 2:24:03 PM PST
14:24:05.116 INFO  ASEReadCounter - ------------------------------------------------------------
14:24:05.116 INFO  ASEReadCounter - ------------------------------------------------------------
14:24:05.117 INFO  ASEReadCounter - HTSJDK Version: 2.16.1
14:24:05.117 INFO  ASEReadCounter - Picard Version: 2.18.13
14:24:05.117 INFO  ASEReadCounter - HTSJDK Defaults.COMPRESSION_LEVEL : 2
14:24:05.118 INFO  ASEReadCounter - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
14:24:05.118 INFO  ASEReadCounter - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
14:24:05.118 INFO  ASEReadCounter - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
14:24:05.118 INFO  ASEReadCounter - Deflater: IntelDeflater
14:24:05.118 INFO  ASEReadCounter - Inflater: IntelInflater
14:24:05.119 INFO  ASEReadCounter - GCS max retries/reopens: 20
14:24:05.119 INFO  ASEReadCounter - Requester pays: disabled
14:24:05.119 INFO  ASEReadCounter - Initializing engine
14:24:05.581 INFO  FeatureManager - Using codec VCFCodec to read file file:///home/groups/Spellmandata/heskett/replication.rnaseq/scripts/../platinum.genome/NA12878.nochr.vcf
14:24:05.604 INFO  ASEReadCounter - Done initializing engine
contig  position    variantID   refAllele   altAllele   refCount    altCount    totalCount  lowMAPQDepth    lowBaseQDepth   rawDepth    otherBases  improperPairs
14:24:05.604 INFO  ProgressMeter - Starting traversal
14:24:05.604 INFO  ProgressMeter -        Current Locus  Elapsed Minutes        Loci Processed      Loci/Minute
14:24:05.638 INFO  ASEReadCounter - Shutting down engine
[March 8, 2019 2:24:05 PM PST] org.broadinstitute.hellbender.tools.walkers.rnaseq.ASEReadCounter done. Elapsed time: 0.04 minutes.
Runtime.totalMemory()=1859649536
java.lang.ArrayIndexOutOfBoundsException: 0
    at org.broadinstitute.hellbender.engine.ReferenceContext.getBase(ReferenceContext.java:396)
    at org.broadinstitute.hellbender.tools.walkers.rnaseq.ASEReadCounter.apply(ASEReadCounter.java:183)
    at org.broadinstitute.hellbender.engine.LocusWalker.lambda$traverse$0(LocusWalker.java:176)
    at java.util.Iterator.forEachRemaining(Iterator.java:116)
    at org.broadinstitute.hellbender.engine.LocusWalker.traverse(LocusWalker.java:174)
    at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:966)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
    at org.broadinstitute.hellbender.Main.main(Main.java:289)
Using GATK jar /home/groups/Spellmandata/heskett/packages/share/gatk4-4.0.11.0-0/gatk-package-4.0.11.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx20G -jar /home/groups/Spellmandata/heskett/packages/share/gatk4-4.0.11.0-0/gatk-package-4.0.11.0-local.jar ASEReadCounter -I ../alignments/gm12878.rep2Aligned.out.rg.sorted.markdup.bam --variant ../platinum.genome/NA12878.nochr.vcf
srun: error: exanode-3-1: task 0: Exited with exit code 3

↧

Errors about misencoded quality scores

December 30, 2017, 12:09 am

≫ Next: (How to part I) Sensitively detect copy ratio alterations and allelic segments

≪ Previous: ASEReadCounter java error

The problem

You get an error like this:

SAM/BAM/CRAM file <filename> appears to be using the wrong encoding for quality scores

Why this happens

The standard format for quality score encodings is that Q0 == ASCII 33 according to the SAM specification. However, in some datasets (including older Illumina data), encoding starts at ASCII 64. This is a problem because the GATK assumes that it can use the quality scores as they are. If they are in fact encoded using a different scale, our tools will make an incorrect estimation of the quality of your data, and your analysis results will be off.

To prevent this from happening, the GATK engine performs a sanity check of the quality score encodings that will abort the program run if they are not standard, and output the error message shown above.

Solution

If this happens to you, you'll need to run again with the flag --fix_misencoded_quality_scores / -fixMisencodedQuals. What will happen is that the engine will simply subtract 31 from every quality score as it is read in, and proceed with the corrected values. Output files will include the correct scores where applicable.

Note that the argument names in this article have not yet been updated for GATK4. Let us know if you run into problems and we'll fix them.

(How to part I) Sensitively detect copy ratio alterations and allelic segments

March 26, 2018, 9:52 am

≫ Next: Funcotator reproducibly crushes on specific WES VCF record produced by GATK4.1.0 (java 1.8.0_45)

≪ Previous: Errors about misencoded quality scores

Document is currently under review and in BETA. It is incomplete and may contain inaccuracies. Expect changes to the content.

This workflow is broken into two tutorials. You are currently on the first part.

The tutorial outlines steps in detecting copy ratio alterations, more familiarly copy number variants (CNVs), as well as allelic segments in a single sample using GATK4. The tutorial (i) denoises case sample alignment data against a panel of normals (PoN) to obtain copy ratios (Tutorial#11682) and (ii) models segments from the copy ratios and allelic counts (Tutorial#11683). The latter modeling incorporates data from a matched control. The same workflow steps apply to targeted exome and whole genome sequencing data.

Tutorial#11682 covers sections 1–4. Section 1 prepares a genomic intervals list with PreprocessIntervals and collects read coverage counts across the intervals. Section 2 creates a CNV PoN with CreateReadCountPanelOfNormals using read coverage counts. Section 3 denoises read coverage data against the PoN with DenoiseReadCounts using principal component analysis. Section 4 plots the results of standardizing and denoising copy ratios against the PoN.

Tutorial#11683 covers sections 5–8. Section 5 collects counts of reference versus alternate alleles with CollectAllelicCounts. Section 6 incorporates copy ratio and allelic counts data to group contiguous copy ratio and allelic counts segments with ModelSegments using kernel segmentation and Markov-chain Monte Carlo. The tool can also segment either copy ratio data or allelic counts data alone. Both types of data together refine segmentation results in that segments are based on the same copy ratio and the same minor allele fraction. Section 7 calls amplification, deletion and neutral events for the segmented copy ratios. Finally, Section 8 plots the results of segmentation and estimated allele-specific copy ratios.

Plotting is across genomic loci on the x-axis and copy or allelic ratios on the y-axis. The first part of the workflow focuses on removing systematic noise from coverage counts and adjusts the data points vertically. The second part focuses on segmentation and groups the data points horizontally. The extent of grouping, or smoothing, is adjustable with ModelSegments parameters. These adjustments do not change the copy ratios; the denoising in the first part of the workflow remains invariant in the second part of the workflow. See Figure 3 of this poster for a summary of tutorial results.

► The official GATK4 workflow is capable of running efficiently on WGS data and provides much greater resolution, up to ~50-fold more resolution for tested data. In these ways, GATK4 CNV improves upon its predecessor workflows in GATK4.alpha and GATK4.beta. Validations are still in progress and therefore the workflow itself is in BETA status, even if most tools, with the exception of ModelSegments, are production ready. The ModelSegments tool is still in BETA status and may change in small but significant ways going forward. Use at your own risk.

► The tutorial skips explicit GC-correction, an option in CNV analysis. For instructions on how to correct for GC bias, see AnnotateIntervals and DenoiseReadCounts tool documentation.

The GATK4 CNV workflow offers a multitude of levers, e.g. towards fine-tuning analyses and towards controls. Researchers are expected to tune workflow parameters on samples with similar copy number profiles as their case sample under scrutiny. Refer to each tool's documentation for descriptions of parameters.

Jump to a section

Tools involved

GATK 4.0.1.1 or later releases.
The plotting tools require particular R components. Options are to install these or to use the broadinstitute/gatk Docker. In particular, to match versions, use the broadinstitute/gatk:4.0.1.1 version.
- Install R v3.2.5 or above from https://www.r-project.org/, then install the components using the install_R_packages.R script with Rscript install_R_packages.R.
- Alternatively, run the plotting tools from a GATK4 Docker container following instructions in Article#11090.

Download example data

Download tutorial_11682.tar.gz and tutorial_11683.tar.gz, either from the GoogleDrive or from the FTP site. To access the ftp site, leave the password field blank. If the GoogleDrive link is broken, please let us know. The tutorial also requires the GRCh38 reference FASTA, dictionary and index. These are available from the GATK Resource Bundle. For details on the example data, see Tutorial#11136's third footnote and [1].

Alternatively, download the spacecade7/tutorial_11682_11683 docker image from DockerHub. The image contains GATK4.0.1.1 and the data necessary to run the tutorial commands, including the GRCh38 reference. Allocation of at least 4GB memory to Docker is recommended before launching the container.

1. Collect raw counts data with PreprocessIntervals and CollectFragmentCounts

Before collecting coverage counts that forms the basis of copy number variant detection, we define the resolution of the analysis with a genomic intervals list. The extent of genomic coverage and the size of genomic intervals in the intervals list factor towards resolution.

Preparing a genomic intervals list is necessary whether an analysis is on targeted exome data or whole genome data. In the case of exome data, we pad the target regions of the capture kit. In the case of whole genome data, we divide the reference genome into equally sized intervals or bins. In either case, we use PreprocessIntervals to prepare the intervals list.

For the tutorial exome data, we provide the capture kit target regions in 1-based intervals and set --bin-length to zero.

gatk PreprocessIntervals \
    -L targets_C.interval_list \
    -R /gatk/ref/Homo_sapiens_assembly38.fasta \
    --bin-length 0 \
    --interval-merging-rule OVERLAPPING_ONLY \
    -O sandbox/targets_C.preprocessed.interval_list

This produces a Picard-style intervals list targets_C.preprocessed.interval_list for use in the coverage collection step. Each interval is expanded 250 bases each on either side.

Comments on select parameters

The -L argument is optional. If provided, the tool expects the intervals list to be in Picard-style as described in Article#1319. The tool errs for other formats. If this argument is omitted, then the tool assumes each contig is a single interval. See [2] for additional discussion.
Set the --bin-length argument to be appropriate for the type of data, e.g. default 1000 for whole genome or 0 for exomes. In binning, an interval is divided into equal-sized regions of the specified length. The tool does not bin regions that contain Ns. [3]
Set --interval-merging-rule to OVERLAPPING_ONLY, to prevent the tool from merging abutting intervals. [4]
The --reference or -R is required and implies the presence of a corresponding reference index and a reference dictionary in the same directory.
To change the padding interval, specify the new value with --padding. The default value of 250 bases was determined to work well empirically for TCGA targeted exome data. This argument is relevant for exome data, as binning without an intervals list does not allow for intervals expansion. [5]

Take a look at the intervals before and after padding.

cnv_intervals

For consecutive intervals less than 250 bases apart, how does the tool pad the intervals?

Now collect raw integer counts data. The tutorial uses GATK4.0.1.1's CollectFragmentCounts, which counts coverage of paired end fragments. The tool counts once per fragment overlapping at its center with the interval. In GATK4.0.3.0, CollectReadCounts replaces CollectFragmentCounts. CollectReadCounts counts reads that overlap the interval.

The tutorial has already collected coverage on the tumor case sample, on the normal matched-control and on each of the normal samples that constitute the PoN. To demonstrate coverage collection, the following command uses the small BAM from Tutorial#11136’s data bundle [6]. The tutorial does not use the resulting file in subsequent steps. The CollectReadCounts command swaps out the tool name but otherwise uses identical parameters.

gatk CollectFragmentCounts \
    -I tumor.bam \
    -L targets_C.preprocessed.interval_list \
    --interval-merging-rule OVERLAPPING_ONLY \
    -O sandbox/tumor.counts.hdf5

In the tutorial data bundle, the equivalent full-length result is hcc1143_T_clean.counts.hdf5. The data tabulates CONTIG, START, END and raw COUNT values for each genomic interval.

Comments on select parameters

The -L argument interval list is a Picard-style interval list prepared with PreprocessIntervals.
The -I input is alignment data.
By default, data is in HDF5 format. To generate text-based TSV (tab-separated values) format data, specify --format TSV. The HDF5 format allows for quicker panel of normals creation.
Set --interval-merging-rule to OVERLAPPING_ONLY, to prevent the tool from merging abutting intervals. [4]
The tool employs a number of engine-level read filters. Of note are NotDuplicateReadFilter, FirstOfPairReadFilter, ProperlyPairedReadFilter and MappingQualityReadFilter. [7]

☞ 1.1 How do I view HDF5 format data?

See Article#11508 for an overview of the format and instructions on how to navigate the data with external application HDFView. The article illustrates features of the format using data generated in this tutorial.

2. Generate a CNV panel of normals with CreateReadCountPanelOfNormals

In creating a PoN, CreateReadCountPanelOfNormals abstracts the counts data for the samples and the intervals using Singular Value Decomposition (SVD, 1), a type of Principal Component Analysis (PCA, 1, 2, 3). The normal samples in the PoN should match the sequencing approach of the case sample under scrutiny. This applies especially to targeted exome data because the capture step introduces target-specific noise.

The tutorial has already created a CNV panel of normals using forty 1000 Genomes Project samples. The command below illustrates PoN creation using just three samples.

gatk --java-options "-Xmx6500m" CreateReadCountPanelOfNormals \
    -I HG00133.alt_bwamem_GRCh38DH.20150826.GBR.exome.counts.hdf5 \
    -I HG00733.alt_bwamem_GRCh38DH.20150826.PUR.exome.counts.hdf5 \
    -I NA19654.alt_bwamem_GRCh38DH.20150826.MXL.exome.counts.hdf5 \
    --minimum-interval-median-percentile 5.0 \
    -O sandbox/cnvponC.pon.hdf5

This generates a PoN in HDF5 format. The PoN stores information that, when applied, will (i) standardize case sample counts to PoN median counts and (ii) remove systematic noise in the case sample.

Comments on select parameters

Provide integer read coverage counts for each sample using -I. Coverage data may be in either TSV or HDF5 format. The tool will accept a single sample, e.g. the matched-normal.
The default --number-of-eigensamples or principal components is twenty. The tool will adjust this number to the smaller of twenty or the number of samples the tool retains after filtering. In general, denoising against a PoN with more components improves segmentation, but at the expense of sensitivity. Ideally, researchers should perform a sensitivity analysis to choose an appropriate value for this parameter. See this related discussion.
To run the tool using Spark, specify the Spark Master with --spark-master. See Article#11245 for details.

Comments on filtering and imputation parameters, in the order of application

The tutorial changes the --minimum-interval-median-percentile argument from the default of 10.0 to a smaller value of 5.0. The tool filters out targets or bins with a median proportional coverage below this percentile. The median is across the samples. The proportional coverage is the target coverage divided by the sum of the coverage of all targets for a sample. The effect of setting this parameter to a smaller value is that we retain more information.
The --maximum-zeros-in-sample-percentage default is 5.0. Any sample with more than 5% zero coverage targets is filtered.
The --maximum-zeros-in-interval-percentage default is 5.0. Any target interval with more than 5% zero coverage across samples is filtered.
The --extreme-sample-median-percentile default is 2.5. Any sample with less than 2.5 percentile or more than 97.5 percentile normalized median proportional coverage is filtered.
The --do-impute-zeros default is set to true. The tool takes zero coverage regions and changes these values to the median of the non-zero values. The tool additionally normalizes zero values below the 0.10 percentile or above the 99.90 percentile to.
The --extreme-outlier-truncation-percentile default is 0.1. The tool takes any proportional coverage below the 0.1 percentile or above the 99.9 percentile and sets it to the corresponding percentile value.

The current filtering and imputation parameters are identical to that in the BETA release of the CNV workflow and may change for later versions based on evaluations. The implementation has been made to be more memory efficient so that the tool runs faster than the BETA release.

If the data are not uniform, e.g. has many intervals with zero or low counts, the tool gives the warning to adjust filtering parameters and stops the run. This may happen, for example, if one attempts to construct a panel of mixed-sex samples and includes the allosomal contigs [8]. In this case, first be sure to either exclude allosomal contigs via a subset intervals list or subset the panel samples to those expected to have similar coverage across the given contigs, e.g. panels of the same sex. If the warning still occurs, then adjust --minimum-interval-median-percentile to a larger value. See this thread for the original discussion.

Based on what you know about PCA, what do you think are the effects of using more normal samples? A panel with some profiles that are outliers? Could PCA account for GC-bias?
What do you know about the 1000 Genome Project? Specifically, the exome data?
How could we tell a good PoN from a bad PoN? What control could we use?

In a somatic analysis, what is better for a PoN: tissue-matched normals or blood normals?
Should we include our particular tumor’s matched normal in the PoN?

3. Standardize and denoise case read counts against the PoN with DenoiseReadCounts

Provide DenoiseReadCounts with counts collected by CollectFragmentCounts and the CNV PoN generated with CreateReadCountPanelOfNormals.

gatk --java-options "-Xmx12g" DenoiseReadCounts \
    -I hcc1143_T_clean.counts.hdf5 \
    --count-panel-of-normals cnvponC.pon.hdf5 \
    --standardized-copy-ratios sandbox/hcc1143_T_clean.standardizedCR.tsv \
    --denoised-copy-ratios sandbox/hcc1143_T_clean.denoisedCR.tsv

This produces two files, the standardized copy ratios hcc1143_T_clean.standardizedCR.tsv and the denoised copy ratios hcc1143_T_clean.denoisedCR.tsv that each represents a data transformation. In the first transformation, the tool standardizes counts by the PoN median counts. The standarization includes log2 transformation and normalizing the counts data to center around one. In the second transformation, the tool denoises the standardized copy ratios using the principal components of the PoN.

Comments on select parameters

Because the default --number-of-eigensamples is null, the tool uses the maximum number of eigensamples available in the PoN. In section 2, by using default CreateReadCoundPanelOfNormals parameters, we capped the number of eigensamples in the PoN to twenty. Changing the --number-of-eigensamples in DenoiseReadCounts to lower values can change the resolution of results, i.e. how smooth segments are. See this thread for detailed discussion.
Additionally provide the optional --annotated-intervals generated by AnnotateIntervals to concurrently perform GC-bias correction.

4. Plot standardized and denoised copy ratios with PlotDenoisedCopyRatios

We plot the standardized and denoised read counts with PlotDenoisedCopyRatios. The plots allow visually assessing the efficacy of denoising. Provide the tool with both the standardized and denoised copy ratios from the previous step as well as a reference sequence dictionary.

gatk PlotDenoisedCopyRatios \
    --standardized-copy-ratios hcc1143_T_clean.standardizedCR.tsv \
    --denoised-copy-ratios hcc1143_T_clean.denoisedCR.tsv \
    --sequence-dictionary Homo_sapiens_assembly38.dict \
    --minimum-contig-length 46709983 \
    --output sandbox/plots \
    --output-prefix hcc1143_T_clean

This produces six files in the plots directory--two PNG images and four text files as follows.

hcc1143_T_clean.denoised.png plots the standardized and denoised read counts across the contigs and scales the y-axis to accommodate all copy ratio data.
hcc1143_T_clean.denoisedLimit4.png plots the same but limits the y-axis range from 0 to 4 for comparability across samples.

Each of the text files contains a single quality control value. The value is the median of absolute differences (MAD) in copy-ratios of adjacent targets. Its calculation is robust to actual copy-number events and should decrease after denoising.

hcc1143_T_clean.standardizedMAD.txt gives the MAD for standardized copy ratios.
hcc1143_T_clean.denoisedMAD.txt gives the MAD for denoised copy ratios.
hcc1143_T_clean.deltaMAD.txt gives the difference between standardized MAD and denoised MAD.
hcc1143_T_clean.scaledDeltaMAD.txt gives the fractional difference (standardized MAD - denoised MAD)/(standardized MAD).

Comments on select parameters

The tutorial provides the --sequence-dictionary that matches the GRCh38 reference used in mapping.
To omit alternate and decoy contigs from the plots, the tutorial adjusts the --minimum-contig-length from the default value of 1,000,000 to 46,709,983, the length of the smallest of GRCh38's primary assembly contigs.

Here are the results for the HCC1143 tumor cell line and its matched normal cell line. The normal cell line serves as a control. For each sample are two plots that show the effects of PCA denoising. The upper plot shows standardized copy ratios in blue and the lower plot shows denoised copy ratios in green.

4A. Tumor standarized and denoised copy ratio plots
hcc1143_T_clean.denoisedLimit4.png

4B. Normal standarized and denoised copy ratio plots
hcc1143_N_clean.denoisedLimit4.png

Would you guess there are CNV events in the normal? Should we be surprised?

The next step is to perform segmentation. This can be done either using copy ratios alone or in combination with allelic copy ratios. In part II, Section 6 outlines considerations in modeling segments with allelic copy ratios, section 7 generates a callset and section 8 shows how to plot segmented copy and allelic ratios. Again, the tutorial presents these steps using the full features of the workflow. However, researchers may desire to perform copy ratio segmentation independently of allelic counts data, e.g. for a case without a matched-control. For the case-only, segmentation gives the following plots. To recapitulate this approach, omit allelic-counts parameters from the example commands in sections 6 and 8.

4C. Tumor case-only copy ratios segmentation gives 235 segments.
T_caseonly.modeled.png

4D. Normal case-only copy ratios segmentation gives 41 segments.

While the normal sample shows trisomy of chr2 and a subpopulation with deletion of chr6, the tumor sample is highly aberrant. The extent of aneuploidy is unsurprising and consistent with these HCC1143 tumor dSKY results by Wenhan Chen. Remember that cell lines, with increasing culture time and selective bottlenecks, can give rise to new somatic events, undergo clonal selection and develop population heterogeneity much like in cancer.

☞ 4.1 Compare two PoNs: considerations in the panel of normals creation

Denoising with a PoN is critical for calling copy-number variants from targeted exome coverage profiles. It can also improve calls from WGS profiles that are typically more evenly distributed and subject to less noise. Furthermore, denoising with a PoN can greatly impact results for (i) samples that have more noise, e.g. those with lower coverage, lower purity or higher activity, (ii) samples lacking a matched normal and (iii) detection of smaller events that span only a few targets.

To understand the impact a PoN's constituents can have on an analysis, compare the results of denoising the normal sample against two different PoNs. Each PoN consists of forty 1000 Genomes Project exome samples. PoN-M consists of the same cohort used in the Mutect2 tutorial's PoN. We selected PoN-C's constituents with more care and this is the PoN the CNV tutorial uses.

4E. Compare standardization and denoising with PoN-C versus PoN-M.

What is the difference in the targets for the two cohorts--cohort-M and cohort-C? Is this a sufficient reason for the difference in noise profiles we observe above?

GATK4 denoises exome coverage profiles robustly with either panel of normals. However, a good panel allows maximal denoising, as is the case for PoN-C over PoN-M.

We use publically available 1000 Genomes Project data so as to be able to share the data and to illustrate considerations in CNV analyses. In an actual somatic analysis, we would construct the PoNs using the blood normals of the tumor cohort(s). We would construct a PoN for each sex, so as to be able to call events on allosomal chromosomes. Such a PoN should give better results than that from either of the tutorial PoNs.

Somatic analyses, due to the confounding factors of tumor purity and heterogeneity, require high sensitivity in calling. However, a sensitive caller can only do so much. Use of a carefully constructed PoN augments the sensitivity and helps illuminate copy number events.

This section is adapted from a hands-on tutorial developed and written by Soo Hee Lee (@shlee) in July of 2017 for the GATK workshops in Cambridge and Edinburgh, UK. The original tutorial uses the GATK4.beta workflow and can be found in the 1707 through 1711 GATK workshops folders. Although the Somatic CNV workflow has changed from GATK4.beta and the official GATK4 release, the PCA denoising remains the same. The hands-on tutorial focuses on differences in PCA denoising based on two different panels of normals (PoNs). Researchers may find working through the worksheet to the very end with either release version beneficial, as considerations in selecting PoN constituents remain identical.

Examining the read group information for the samples in the two PoNs shows a difference in mixtures of sequencing centers--four different sequencing centers for PoN-M versus a single sequencing center for PoN-C. The single sequencing center corresponds to that of the HCC1143 samples. Furthermore, tracing sample information will show different targeted exome capture kits for the sequencing centers. Comparing the denoising results of the two PoNs stresses the importance of selective PoN creation.

☞ 4.2 Compare PoN denoising versus matched-normal denoising

A feature of the GATK4 CNV workflow is the ability to normalize a case against a single control sample, e.g. a tumor case against its matched normal. This involves running the control sample through CreateReadCountPanelOfNormals, then denoising the case against this single-sample projection with DenoiseReadCounts. To illustrate this approach, here is the result of denoising the HCC1143 tumor sample against its matched normal. For single-sample matched-control denoising, DenoiseReadCounts produces identical data for standardizedCR.tsv and denoisedCR.tsv.

4F. Tumor case standardized against the normal matched-control

Compare these results to that of section 4.1. Notice the depression in chr2 copy ratios that occurs due to the PoN normal sample's chr2 trisomy. Here, the median absolute deviation (MAD) of 0.149 is an incremental improvement to section 4.1's PoN-M denoising (MAD=0.15). In contrast, PoN-C denoising (MAD=0.125) and even PoN-C standardization alone (MAD=0.134) are seemingly better normalization approaches than the matched-normal standardization. Again, results stress the importance of selective PoN creation.

The PoN accounts for germline CNVs common to its constituents such that the workflow discounts the same variation in the case. It is possible for the workflow to detect germline CNVs not represented in the PoN, in particular, rare germline CNVs. In the case of matched-normal standardization, the workflow should discount germline CNVs and reveal only somatic events.

The workflow does not support iteratively denoising two samples each against a PoN and then against each other.

The tutorial continues in a second document at #11683.

Footnotes

[1] The constituents of the forty sample CNV panel of normals differs from that of the Mutect2 panel of normals. Preliminarly CNV data was generated with v4.0.1.1 somatic CNV WDL scripts run locally on a Gcloud Compute Engine VM with Cromwell v30.2. Additional refinements were performed on a 16GB MacBook Pro laptop. Additional plots were generated using a broadinstitute/gatk:4.0.1.1 Docker container. Note the v4.0.1.1 WDL script does not allow custom sequence dictionaries for the plotting steps.

Case (HCC1143) and matched control (HCC1143_BL) sample data are based on a breast cancer cell line and its matched normal cell line derived from blood, respectively. The Broad Cancer Genome Analysis (CGA) group has graciously provided 2x76 paired-end whole exome sequence data from the two cell lines (C835.HCC1143_2 and C835.HCC1143_BL.4), and @shlee reverted then realigned these to GRCh38 and preprocessed according to GATK guidelines.
We express our gratitude to the 1000 Genomes Project for their publically available project data, from which @shlee constructed the tutorial panel of normals. Read about the project at:
A global reference for human genetic variation, The 1000 Genomes Project Consortium, Nature 526, 68-74 (01 October 2015) doi:10.1038/nature15393.

[2] Considerations in genomic intervals are as follows.

For targeted exomes, the intervals should represent the bait capture or target capture regions.
For whole genomes, either supply regions where coverage is expected across samples, e.g. that exclude alternate haplotypes and decoy regions in GRCh38 or omit the option for references where coverage is expected for the entirety of the reference.
For either type of data, expect to modify the intervals depending on (i) extent of masking in the reference used in read mapping and (ii) expectations in coverage on allosomal contigs. For example, for mammalian data, expect to remove Y chromosome intervals for female samples.

[3] See original discussion on bin size here. The bin size determines the resolution of CNV breakpoints. The theoretical limit depends on coverage depth and the insert-size distribution. Typically bin sizes on the order of the read length will give reasonable results. The GATK developers have tested WGS runs where the bin size is as small as 250 bases.

[4] Set --interval-merging-rule to OVERLAPPING_ONLY, to prevent the tool from merging abutting intervals. The default is set to ALL for GATK4.0.1.1. For future versions, the default will be set to OVERLAPPING_ONLY.

[5] The tool allows specifying both the padding and the binning arguments simultaneously. If exome targets are very long, it may be preferable to both pad and break up the intervals with binning. This may provide some additional resolution.

[6] The data bundle from Tutorial#11136 contains tumor.bam and normal.bam. These tumor and normal samples are identical to that in the current tutorial and represent a subset of the full data for the following regions:

chr6    29941013    29946495    +    
chr11   915890  1133890 +    
chr17   1   83257441    +    
chr11_KI270927v1_alt    1   218612  +    
HLA-A*24:03:01  1   3502    +

[7] The following regarding read filters may be of interest and apply to the workflow illustated in the tutorial that uses CollectFragmentCounts.

In contrast to prior versions of the workflow, the GATK4 CNV workflow excludes duplicate fragments from consideration with the NotDuplicateReadFilter. To instead include duplicate fragments, specify -DF NotDuplicateReadFilter.
The tool only considers paired-end reads (0x1 SAM flag) and the first of pair (0x40 flag) with the FirstOfPairReadFilter. The tool uses the first-of-pair read’s mapping information for the fragment center.
The tool only considers properly paired reads (0x2 SAM flag) using the ProperlyPairedReadFilter. Depending on whether and how data was preprocessed with MergeBamAlignment, proper pair assignments can differ from that given by the aligner. This filter also removes single ended reads.
The MappingQualityReadFilter sets a threshold for alignment MAPQ. The tool sets --minimum-mapping-quality to 30. Thus, the tool uses reads with MAPQ30 or higher.

[8] The current tool version requires strategizing denoising of allosomal chromosomes, e.g. X and Y in humans, against the panel of normals. This is because coverage will vary for these regions depending on the sex of the sample. To determine the sex of samples, analyze them with DetermineGermlineContigPloidy. Aneuploidy in allosomal chromosomes, much like trisomy, can still make for viable organisms and so phenotypic sex designations are insufficient. GermlineCNVCaller can account for differential sex in data.

Many thanks to Samuel Lee (@slee), no relation, for his patient explanations of the workflow parameters and mathematics behind the workflow. Researchers may find reading through @slee's comments on the forum, a few of which the tutorials link to, insightful.

↧

Funcotator reproducibly crushes on specific WES VCF record produced by GATK4.1.0 (java 1.8.0_45)

March 8, 2019, 3:57 pm

≫ Next: GATK4: How to reassign STAR mapping quality from 255 to 60 with SplitNCigarReads

≪ Previous: (How to part I) Sensitively detect copy ratio alterations and allelic segments

Here is the error I got:
[February 25, 2019 8:18:10 PM PST] org.broadinstitute.hellbender.tools.funcotator.Funcotator done. Elapsed time: 11.08 minutes.
Runtime.totalMemory()=5379719168
java.lang.StringIndexOutOfBoundsException: String index out of range: 545
at java.lang.String.substring(String.java:1951)
at org.broadinstitute.hellbender.tools.funcotator.ProteinChangeInfo.initializeForDeletion(ProteinChangeInfo.java:192)
at org.broadinstitute.hellbender.tools.funcotator.ProteinChangeInfo.(ProteinChangeInfo.java:96)
at org.broadinstitute.hellbender.tools.funcotator.ProteinChangeInfo.create(ProteinChangeInfo.java:371)
[...]

and here is offending record:
chr12 70747693 . TAAAAAAA T,TAAAA,TAAAAA,TAAAAAA,TAAAAAAAA . artifact_in_normal;germline_risk;multiallelic CONTQ=93;DP=537;ECNT=1;GERMQ=253,113,0,0,18;MBQ=36,24,36,28,36,33;MFRL=293,529,291,288,325,299;MMQ=60,29,60,60,60,60;MPOS=43,43,41,44,26;NALOD=0.912,0.217,-2.040e+00,-1.342e+01,-4.057e+00;NLOD=20.49,13.33,2.58,-1.476e+01,2.13;POPAF=2.27,1.08,2.19,2.53,5.40;REF_BASES=GCAAGCCTTCTAAAAAAAAAA;RPA=25,18,22,23,24,26;RU=A;SAAF=0.394,0.404,0.420;SAPP=0.019,0.015,0.965;STR;TLOD=3.88,7.29,6.62,36.73,3.85 GT:AD:AF:DP:F1R2:F2R1 0/0:21,0,1,7,14,6:0.011,0.025,0.099,0.262,0.124:49:14,0,1,6,7,4:7,0,0,1,7,2 0/1/2/3/4/5:65,3,11,20,47,18:0.016,0.057,0.076,0.245,0.092:164:32,3,6,14,30,8:33,0,5,6,17,10

As a result the Funcotator output is truncated. Bug?

Thanks!
Ivan

↧

GATK4: How to reassign STAR mapping quality from 255 to 60 with SplitNCigarReads

November 20, 2017, 12:57 am

≫ Next: HaplotypeCaller filters out all reads (trying to use GATK4 for RNA-seq data)

≪ Previous: Funcotator reproducibly crushes on specific WES VCF record produced by GATK4.1.0 (java 1.8.0_45)

Hi,

How can I reassign STAR mapping quality from 255 to 60 with SplitNCigarReads?

In GATK 3.X this used to be done like this:
java -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R ref.fasta -I dedupped.bam -o split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS
See this blog post: https://software.broadinstitute.org/gatk/blog?id=4285

With GATK4 latest beta the read filter argument has been renamed. Trying the same -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS arguments leads to the following error:
A USER ERROR has occurred: rf is not a recognized option

Trough looking at the CLI help documentation I got as far as:
--readFilter ReassignOneMappingQuality -RMQF 255 -RMQT 60

The readFilter argument is now recognized. But not the -RMQF 255 -RMQT 60 part:
A USER ERROR has occurred: U is not a recognized option

Could you please advice on how to run the GATK4 SplitNCigarReads tool with reassignment of the mapping quailty?

Without reassignment of the mapping quality GATK haplotype caller discards all the STAR mapped reads, and calls full chromosome reference, without any variant.

Thank you.

↧

HaplotypeCaller filters out all reads (trying to use GATK4 for RNA-seq data)

July 11, 2018, 6:04 am

≫ Next: Question: How treat a Bam file with mutiple read groups (RG)

≪ Previous: GATK4: How to reassign STAR mapping quality from 255 to 60 with SplitNCigarReads

Dear all,
I'm trying to update our pipeline to identify SNPs and mutations in RNA sequencing samples. I spent a lot of time online trying to figure out how to adjust the previous commands to the newest releases of the various tools, but the final steps filter out all of the bases - producing a VCF file that doesn't include anything nut headers.
Attached is the general work flow, trying to analyze ERR361240 from SRA:

STAR --runThreadN 32 --genomeDir /mnt/lustre/hms-01/fs01/galaxy1/reference_data/Homo_sapiens/Ensembl/GRCh38/Sequence/STARIndex --readFilesIn file.fastq --outSAMtype BAM Unsorted --outSAMmapqUnique 60 --outFilterMultimapNmax 20 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.1 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --outFilterScoreMinOverLread 0.33 --outFilterMatchNminOverLread 0.33 --limitSjdbInsertNsj 1200000 \

note that I already changed the output to mark 60 rather than 255

**picard.jar **AddOrReplaceReadGroups I=Aligned.out.bam O=Aligned.out_rg.bam RGID=null RGLB=lb RGPL=illumina RGPU=pu RGSM=ES

picard.jar ReorderSam I=Aligned.out_rg.bam O=Aligned.out_rg_sorted.bam R=/mnt/lustre/hms-01/fs01/yishaia/data/human/galaxy/hg38.fa

picard.jar SortSam I=Aligned.out_rg_sorted.bam O=Aligned.out_rg_sorted2.bam SORT_ORDER=coordinate CREATE_INDEX=true

gatk SplitNCigarReads -R /mnt/lustre/hms-01/fs01/yishaia/data/human/galaxy/hg38.fa -I Aligned.out_rg_sorted2.bam -O split.bam

gatk --java-options -Xmx4g **HaplotypeCaller **-R /mnt/lustre/hms-01/fs01/yishaia/data/human/galaxy/hg38.fa -I split.bam --dont-use-soft-clipped-bases --stand-call-conf 20.0 -O variants_output.vcf

It seems that the SplitNCigarReads worked, as the log notes:
14:32:31.381 INFO SplitNCigarReads - No reads filtered by: AllowAllReadsReadFilter
14:32:31.381 INFO ProgressMeter - KI270752.1:25118 140.1 121004130 863481.5
14:32:31.381 INFO ProgressMeter - Traversal complete. Processed 121004130 total reads in 140.1 minutes.
INFO 2018-07-11 14:32:33 SortingCollection Creating merging iterator from 137 files
14:35:23.427 INFO SplitNCigarReads - Shutting down engine
[July 11, 2018 2:35:23 PM IDT] org.broadinstitute.hellbender.tools.walkers.rnaseq.SplitNCigarReads done. Elapsed time: 143.02 minutes.
Runtime.totalMemory()=3614441472

However, in the next step the following is specified:

14:49:16.138 INFO HaplotypeCaller - 68471979 read(s) filtered by: ((((((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND NotSecondaryAlignmentReadFilter) AND NotDuplicateReadFilter) AND PassesVendorQualityCheckReadFilter) AND NonZeroReferenceLengthAlignmentReadFilter) AND GoodCigarReadFilter) AND WellformedReadFilter)
18480956 read(s) filtered by: (((((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND NotSecondaryAlignmentReadFilter) AND NotDuplicateReadFilter) AND PassesVendorQualityCheckReadFilter) AND NonZeroReferenceLengthAlignmentReadFilter) AND GoodCigarReadFilter)
18480956 read(s) filtered by: ((((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND NotSecondaryAlignmentReadFilter) AND NotDuplicateReadFilter) AND PassesVendorQualityCheckReadFilter) AND NonZeroReferenceLengthAlignmentReadFilter)
18480956 read(s) filtered by: (((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND NotSecondaryAlignmentReadFilter) AND NotDuplicateReadFilter) AND PassesVendorQualityCheckReadFilter)
18480956 read(s) filtered by: ((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND NotSecondaryAlignmentReadFilter) AND NotDuplicateReadFilter)
18480956 read(s) filtered by: (((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND NotSecondaryAlignmentReadFilter)
18480956 read(s) filtered by: ((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter)
18480956 read(s) filtered by: (MappingQualityReadFilter AND MappingQualityAvailableReadFilter)
18480956 read(s) filtered by: MappingQualityReadFilter
49991023 read(s) filtered by: WellformedReadFilter

14:49:16.138 INFO ProgressMeter - KI270394.1:901 13.7 10332600 751583.9
14:49:16.138 INFO ProgressMeter - Traversal complete. Processed 10332600 total regions in 13.7 minutes.
14:49:16.146 INFO VectorLoglessPairHMM - Time spent in setup for JNI call : 0.0
14:49:16.146 INFO PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 0.0
14:49:16.146 INFO SmithWatermanAligner - Total compute time in java Smith-Waterman : 0.00 sec
14:49:16.146 INFO HaplotypeCaller - Shutting down engine
[July 11, 2018 2:49:16 PM IDT] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 13.77 minutes.
Runtime.totalMemory()=1112539136

I honestly don't know where the problem is. This sample worked well with an older version of GATK, using TopHat in order to align it. Any help would be highly appreciated.

Thanks,
Yishai

↧

Question: How treat a Bam file with mutiple read groups (RG)

March 8, 2019, 10:30 pm

≫ Next: GATK4 with Mutect2, calling somatic SNVs and indels with normal-tumor matched sample

≪ Previous: HaplotypeCaller filters out all reads (trying to use GATK4 for RNA-seq data)

Hi everyone, I am pretty new to the NGS data analysis. I have downloaded a WES dataset at BAM file format from SRA database. I try to mark duplicates in bam file using MarkDuplicates command at Picard . However, I encountered an error in terminal "error parsing sam header. @rg line missing sm tag". I tough that there is an error in @RG tag. So I run a command in terminal i.e. "samtools view -H SRR1693634_NC_000005.9.sorted.bam | grep '@RG' to see rg tag". Output is below

@RG ID:FGC0630.4.ACTGAT
@RG ID:FGC0639.8.ACTGAT
@RG ID:FGC0639.7.ACTGAT
@RG ID:FGC0639.4.ACTGAT
@RG ID:FGC0639.6.ACTGAT
@RG ID:FGC0639.5.ACTGAT

Now I have two questions:

1. What is meaning of the several read groups for a single sample? Are MarkDuplicates results for a sample with single RG are similar to the that sample with multiple RGs?

2. RGs in my sample lacks some necessary information such as RGID, RGLB, RGPL, RGPU, RGSM. How can I obtain and then add these information to the bam file for each RG? As you know AddOrReplaceReadGroups in Picard only treats sample with a single RG.

↧

GATK4 with Mutect2, calling somatic SNVs and indels with normal-tumor matched sample

March 9, 2019, 12:36 am

≫ Next: How can I prepare a FASTA file to use as reference?

≪ Previous: Question: How treat a Bam file with mutiple read groups (RG)

Hi all,

Recently, I am working on test different variants callers with one matched tumor-normal sample. I have successfully run the test program with Strelka, MuTect, GATK3+MuTect2. However, the test with GATK4+Mutect2 is not very successful. I cannot find any mutations with PASS flag in the VCF file. I think that's may because I am doing with wrong commands.

Here is the command line. I copy it from the Mutect2 homepage.

gatk --java-options "-Xmx$MAX_MEM" Mutect2 \
-R $GENOME_REFERENCE \
-I $OUT_DIR/$TUMOR \
-I $OUT_DIR/$NORMAL \
-tumor Illumina_cancer \
-normal Illumina_normal \
--germline-resource $GERMLINE \
--af-of-alleles-not-in-resource 0.0000025 \
--disable-read-filter MateOnSameContigOrNoMappedMateReadFilter \
-L $EXON_REGION \
-O $OUT_DIR/$PREFIX.vcf

Do I need to add an additional PoN file when running a paired samples?
Could anyone help me to check if the command above could provide a list of somatic Variants with PASS flag? Actually, I didn't get any variants with PASS flag in my VCF file.

My GATK version is 4.1.0.0

Many thanks.

↧

How can I prepare a FASTA file to use as reference?

October 2, 2012, 12:21 pm

≫ Next: Base Quality Score Recalibration (BQSR)

≪ Previous: GATK4 with Mutect2, calling somatic SNVs and indels with normal-tumor matched sample

This article describes the steps necessary to prepare your reference file (if it's not one that you got from us). As a complement to this article, see the relevant tutorial.

Why these steps are necessary

The GATK uses two files to access and safety check access to the reference files: a .dict dictionary of the contig names and sizes and a .fai fasta index file to allow efficient random access to the reference bases. You have to generate these files in order to be able to use a Fasta file as reference.

NOTE: Picard and samtools treat spaces in contig names differently. We recommend that you avoid using spaces in contig names.

Creating the fasta sequence dictionary file

We use CreateSequenceDictionary.jar from Picard to create a .dict file from a fasta file.

> java -jar CreateSequenceDictionary.jar R= Homo_sapiens_assembly18.fasta O= Homo_sapiens_assembly18.dict
[Fri Jun 19 14:09:11 EDT 2009] net.sf.picard.sam.CreateSequenceDictionary R= Homo_sapiens_assembly18.fasta O= Homo_sapiens_assembly18.dict
[Fri Jun 19 14:09:58 EDT 2009] net.sf.picard.sam.CreateSequenceDictionary done.
Runtime.totalMemory()=2112487424
44.922u 2.308s 0:47.09 100.2%   0+0k 0+0io 2pf+0w

This produces a SAM-style header file describing the contents of our fasta file.

> cat Homo_sapiens_assembly18.dict 
@HD     VN:1.0  SO:unsorted
@SQ     SN:chrM LN:16571        UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:d2ed829b8a1628d16cbeee88e88e39eb
@SQ     SN:chr1 LN:247249719    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:9ebc6df9496613f373e73396d5b3b6b6
@SQ     SN:chr2 LN:242951149    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:b12c7373e3882120332983be99aeb18d
@SQ     SN:chr3 LN:199501827    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:0e48ed7f305877f66e6fd4addbae2b9a
@SQ     SN:chr4 LN:191273063    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:cf37020337904229dca8401907b626c2
@SQ     SN:chr5 LN:180857866    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:031c851664e31b2c17337fd6f9004858
@SQ     SN:chr6 LN:170899992    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:bfe8005c536131276d448ead33f1b583
@SQ     SN:chr7 LN:158821424    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:74239c5ceee3b28f0038123d958114cb
@SQ     SN:chr8 LN:146274826    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:1eb00fe1ce26ce6701d2cd75c35b5ccb
@SQ     SN:chr9 LN:140273252    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:ea244473e525dde0393d353ef94f974b
@SQ     SN:chr10        LN:135374737    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:4ca41bf2d7d33578d2cd7ee9411e1533
@SQ     SN:chr11        LN:134452384    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:425ba5eb6c95b60bafbf2874493a56c3
@SQ     SN:chr12        LN:132349534    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:d17d70060c56b4578fa570117bf19716
@SQ     SN:chr13        LN:114142980    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:c4f3084a20380a373bbbdb9ae30da587
@SQ     SN:chr14        LN:106368585    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:c1ff5d44683831e9c7c1db23f93fbb45
@SQ     SN:chr15        LN:100338915    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:5cd9622c459fe0a276b27f6ac06116d8
@SQ     SN:chr16        LN:88827254     UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:3e81884229e8dc6b7f258169ec8da246
@SQ     SN:chr17        LN:78774742     UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:2a5c95ed99c5298bb107f313c7044588
@SQ     SN:chr18        LN:76117153     UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:3d11df432bcdc1407835d5ef2ce62634
@SQ     SN:chr19        LN:63811651     UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:2f1a59077cfad51df907ac25723bff28
@SQ     SN:chr20        LN:62435964     UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:f126cdf8a6e0c7f379d618ff66beb2da
@SQ     SN:chr21        LN:46944323     UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:f1b74b7f9f4cdbaeb6832ee86cb426c6
@SQ     SN:chr22        LN:49691432     UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:2041e6a0c914b48dd537922cca63acb8
@SQ     SN:chrX LN:154913754    UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:d7e626c80ad172a4d7c95aadb94d9040
@SQ     SN:chrY LN:57772954     UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:62f69d0e82a12af74bad85e2e4a8bd91
@SQ     SN:chr1_random  LN:1663265      UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:cc05cb1554258add2eb62e88c0746394
@SQ     SN:chr2_random  LN:185571       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:18ceab9e4667a25c8a1f67869a4356ea
@SQ     SN:chr3_random  LN:749256       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:9cc571e918ac18afa0b2053262cadab6
@SQ     SN:chr4_random  LN:842648       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:9cab2949ccf26ee0f69a875412c93740
@SQ     SN:chr5_random  LN:143687       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:05926bdbff978d4a0906862eb3f773d0
@SQ     SN:chr6_random  LN:1875562      UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:d62eb2919ba7b9c1d382c011c5218094
@SQ     SN:chr7_random  LN:549659       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:28ebfb89c858edbc4d71ff3f83d52231
@SQ     SN:chr8_random  LN:943810       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:0ed5b088d843d6f6e6b181465b9e82ed
@SQ     SN:chr9_random  LN:1146434      UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:1e3d2d2f141f0550fa28a8d0ed3fd1cf
@SQ     SN:chr10_random LN:113275       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:50be2d2c6720dabeff497ffb53189daa
@SQ     SN:chr11_random LN:215294       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:bfc93adc30c621d5c83eee3f0d841624
@SQ     SN:chr13_random LN:186858       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:563531689f3dbd691331fd6c5730a88b
@SQ     SN:chr15_random LN:784346       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:bf885e99940d2d439d83eba791804a48
@SQ     SN:chr16_random LN:105485       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:dd06ea813a80b59d9c626b31faf6ae7f
@SQ     SN:chr17_random LN:2617613      UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:34d5e2005dffdfaaced1d34f60ed8fc2
@SQ     SN:chr18_random LN:4262 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:f3814841f1939d3ca19072d9e89f3fd7
@SQ     SN:chr19_random LN:301858       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:420ce95da035386cc8c63094288c49e2
@SQ     SN:chr21_random LN:1679693      UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:a7252115bfe5bb5525f34d039eecd096
@SQ     SN:chr22_random LN:257318       UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:4f2d259b82f7647d3b668063cf18378b
@SQ     SN:chrX_random  LN:1719168      UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta      M5:f4d71e0758986c15e5455bf3e14e5d6f

Creating the fasta index file

We use the faidx command in samtools to prepare the fasta index file. This file describes byte offsets in the fasta file for each contig, allowing us to compute exactly where a particular reference base at contig:pos is in the fasta file.

> samtools faidx Homo_sapiens_assembly18.fasta 
108.446u 3.384s 2:44.61 67.9%   0+0k 0+0io 0pf+0w

This produces a text file with one record per line for each of the fasta contigs. Each record is of the: contig, size, location, basesPerLine, bytesPerLine. The index file produced above looks like:

> cat Homo_sapiens_assembly18.fasta.fai 
chrM    16571   6       50      51
chr1    247249719       16915   50      51
chr2    242951149       252211635       50      51
chr3    199501827       500021813       50      51
chr4    191273063       703513683       50      51
chr5    180857866       898612214       50      51
chr6    170899992       1083087244      50      51
chr7    158821424       1257405242      50      51
chr8    146274826       1419403101      50      51
chr9    140273252       1568603430      50      51
chr10   135374737       1711682155      50      51
chr11   134452384       1849764394      50      51
chr12   132349534       1986905833      50      51
chr13   114142980       2121902365      50      51
chr14   106368585       2238328212      50      51
chr15   100338915       2346824176      50      51
chr16   88827254        2449169877      50      51
chr17   78774742        2539773684      50      51
chr18   76117153        2620123928      50      51
chr19   63811651        2697763432      50      51
chr20   62435964        2762851324      50      51
chr21   46944323        2826536015      50      51
chr22   49691432        2874419232      50      51
chrX    154913754       2925104499      50      51
chrY    57772954        3083116535      50      51
chr1_random     1663265 3142044962      50      51
chr2_random     185571  3143741506      50      51
chr3_random     749256  3143930802      50      51
chr4_random     842648  3144695057      50      51
chr5_random     143687  3145554571      50      51
chr6_random     1875562 3145701145      50      51
chr7_random     549659  3147614232      50      51
chr8_random     943810  3148174898      50      51
chr9_random     1146434 3149137598      50      51
chr10_random    113275  3150306975      50      51
chr11_random    215294  3150422530      50      51
chr13_random    186858  3150642144      50      51
chr15_random    784346  3150832754      50      51
chr16_random    105485  3151632801      50      51
chr17_random    2617613 3151740410      50      51
chr18_random    4262    3154410390      50      51
chr19_random    301858  3154414752      50      51
chr21_random    1679693 3154722662      50      51
chr22_random    257318  3156435963      50      51
chrX_random     1719168 3156698441      50      51

↧

Base Quality Score Recalibration (BQSR)

December 28, 2017, 7:15 pm

≫ Next: (--sitesVCFFile) ASEReadCounter

≪ Previous: How can I prepare a FASTA file to use as reference?

BQSR stands for Base Quality Score Recalibration. In a nutshell, it is a data pre-processing step that detects systematic errors made by the sequencing machine when it estimates the accuracy of each base call.

Note that this base recalibration process (BQSR) should not be confused with variant recalibration (VQSR), which is a sophisticated filtering technique applied on the variant callset produced in a later step. The developers who named these methods wish to apologize sincerely to anyone, especially Spanish-speaking users, who get tripped up by the similarity of these names.

Overview
Base recalibration procedure details
Important factors for successful recalibration
Examples of pre- and post-recalibration metrics
Recalibration report

1. Overview

It's all about the base, 'bout the base (quality scores)

Base quality scores are per-base estimates of error emitted by the sequencing machines; they express how confident the machine was that it called the correct base each time. For example, let's say the machine reads an A nucleotide, and assigns a quality score of Q20 -- in Phred-scale, that means it's 99% sure it identified the base correctly. This may seem high, but it does mean that we can expect it to be wrong in one case out of 100; so if we have several billion base calls (we get ~90 billion in a 30x genome), at that rate the machine would make the wrong call in 900 million bases -- which is a lot of bad bases. The quality score each base call gets is determined through some dark magic jealously guarded by the manufacturer of the sequencing machines.

Why does it matter? Because our short variant calling algorithms rely heavily on the quality score assigned to the individual base calls in each sequence read. This is because the quality score tells us how much we can trust that particular observation to inform us about the biological truth of the site where that base aligns. If we have a base call that has a low quality score, that means we're not sure we actually read that A correctly, and it could actually be something else. So we won't trust it as much as other base calls that have higher qualities. In other words we use that score to weigh the evidence that we have for or against a variant allele existing at a particular site.

Okay, so what is base recalibration?

Unfortunately the scores produced by the machines are subject to various sources of systematic (non-random) technical error, leading to over- or under-estimated base quality scores in the data. Some of these errors are due to the physics or the chemistry of how the sequencing reaction works, and some are probably due to manufacturing flaws in the equipment.

Base quality score recalibration (BQSR) is a process in which we apply machine learning to model these errors empirically and adjust the quality scores accordingly. For example we can identify that, for a given run, whenever we called two A nucleotides in a row, the next base we called had a 1% higher rate of error. So any base call that comes after AA in a read should have its quality score reduced by 1%. We do that over several different covariates (mainly sequence context and position in read, or cycle) in a way that is additive. So the same base may have its quality score increased for one reason and decreased for another.

This allows us to get more accurate base qualities overall, which in turn improves the accuracy of our variant calls. To be clear, we can't correct the base calls themselves, i.e. we can't determine whether that low-quality A should actually have been a T -- but we can at least tell the variant caller more accurately how far it can trust that A. Note that in some cases we may find that some bases should have a higher quality score, which allows us to rescue observations that otherwise may have been given less consideration than they deserve. Anecdotally our impression is that sequencers are more often over-confident than under-confident, but we do occasionally see runs from sequencers that seemed to suffer from low self-esteem.

This procedure can be applied to BAM files containing data from any sequencing platform that outputs base quality scores on the expected scale. We have run it ourselves on data from several generations of Illumina, SOLiD, 454, Complete Genomics, and Pacific Biosciences sequencers.

That sounds great! How does it work?

The base recalibration process involves two key steps: first the BaseRecalibrator tool builds a model of covariation based on the input data and a set of known variants, producing a recalibration file; then the ApplyBQSR tool adjusts the base quality scores in the data based on the model, producing a new BAM file. The known variants are used to mask out bases at sites of real (expected) variation, to avoid counting real variants as errors. Outside of the masked sites, every mismatch is counted as an error. The rest is mostly accounting.

There is an optional but highly recommended step that involves building a second model and generating before/after plots to visualize the effects of the recalibration process. This is useful for quality control purposes.

2. Base recalibration procedure details

BaseRecalibrator builds the model

To build the recalibration model, this first tool goes through all of the reads in the input BAM file and tabulates data about the following features of the bases:

read group the read belongs to
quality score reported by the machine
machine cycle producing this base (Nth cycle = Nth base from the start of the read)
current base + previous base (dinucleotide)

For each bin, we count the number of bases within the bin and how often such bases mismatch the reference base, excluding loci known to vary in the population, according to the known variants resource (typically dbSNP). This information is output to a recalibration file in GATKReport format.

Note that the recalibrator applies a "yates" correction for low occupancy bins. Rather than inferring the true Q score from # mismatches / # bases we actually infer it from (# mismatches + 1) / (# bases + 2). This deals very nicely with overfitting problems, which has only a minor impact on data sets with billions of bases but is critical to avoid overconfidence in rare bins in sparse data.

ApplyBQSR adjusts the scores

This second tool goes through all the reads again, using the recalibration file to adjust each base's score based on which bins it falls in. So effectively the new quality score is:

the sum of the global difference between reported quality scores and the empirical quality
plus the quality bin specific shift
plus the cycle x qual and dinucleotide x qual effect

Following recalibration, the read quality scores are much closer to their empirical scores than before. This means they can be used in a statistically robust manner for downstream processing, such as variant calling. In addition, by accounting for quality changes by cycle and sequence context, we can identify truly high quality bases in the reads, often finding a subset of bases that are Q30 even when no bases were originally labeled as such.

3. Important factors for successful recalibration

Read groups

The recalibration system is read-group aware, meaning it uses @RG tags to partition the data by read group. This allows it to perform the recalibration per read group, which reflects which library a read belongs to and what lane it was sequenced in on the flowcell. We know that systematic biases can occur in one lane but not the other, or one library but not the other, so being able to recalibrate within each unit of sequence data makes the modeling process more accurate. As a corollary, that means it's okay to run BQSR on BAM files with multiple read groups. However, please note that the memory requirements scale linearly with the number of read groups in the file, so that files with many read groups could require a significant amount of RAM to store all of the covariate data.

Amount of data

A critical determinant of the quality of the recalibration is the number of observed bases and mismatches in each bin. This procedure will not work well on a small number of aligned reads. We usually expect to see more than 100M bases per read group; as a rule of thumb, larger numbers will work better.

No excuses

You should almost always perform recalibration on your sequencing data. In human data, given the exhaustive databases of variation we have available, almost all of the remaining mismatches -- even in cancer -- will be errors, so it's super easy to ascertain an accurate error model for your data, which is essential for downstream analysis. For non-human data it can be a little bit more work since you may need to bootstrap your own set of variants if there are no such resources already available for you organism, but it's worth it.

Here's how you would bootstrap a set of known variants:

First do an initial round of variant calling on your original, unrecalibrated data.
Then take the variants that you have the highest confidence in and use that set as the database of known variants by feeding it as a VCF file to the BaseRecalibrator.
Finally, do a real round of variant calling with the recalibrated data. These steps could be repeated several times until convergence.

The main case figure where you really might need to skip BQSR is when you have too little data (some small gene panels have that problem), or you're working with a really weird organism that displays insane amounts of variation.

4. Examples of pre- and post-recalibration metrics

This shows recalibration results from a lane sequenced at the Broad by an Illumina GA-II in February 2010. This is admittedly not very recent but the results are typical of what we still see on some more recent runs, even if the overall quality of sequencing has improved. You can see there is a significant improvement in the accuracy of the base quality scores after applying the recalibration procedure. Note that the plots shown below are not the same as the plots that are produced by the AnalyzeCovariates tool.

Note: The scale for number of bases in the two graphs are different.

5. Recalibration report

The recalibration report contains the following 5 tables:

Arguments Table -- a table with all the arguments and its values
Quantization Table
ReadGroup Table
Quality Score Table
Covariates Table

Arguments Table

This is the table that contains all the arguments used to run BQSR for this dataset.

#:GATKTable:true:1:17::;
#:GATKTable:Arguments:Recalibration argument collection values used in this run
Argument                    Value
covariate                   null
default_platform            null
deletions_context_size      6
force_platform              null
insertions_context_size     6
...

Quantization Table

The GATK offers native support to quantize base qualities. The GATK quantization procedure uses a statistical approach to determine the best binning system that minimizes the error introduced by amalgamating the different qualities present in the specific dataset. When running BQSR, a table with the base counts for each base quality is generated and a 'default' quantization table is generated. This table is a required parameter for any other tool in the GATK if you want to quantize your quality scores.

The default behavior (currently) is to use no quantization. You can override this by using the engine argument -qq. With -qq 0 you don't quantize qualities, or -qq N you recalculate the quantization bins using N bins.

#:GATKTable:true:2:94:::;
#:GATKTable:Quantized:Quality quantization map
QualityScore  Count        QuantizedScore
0                     252               0
1                   15972               1
2                  553525               2
3                 2190142               9
4                 5369681               9
9                83645762               9
...

ReadGroup Table

This table contains the empirical quality scores for each read group, for mismatches insertions and deletions.

#:GATKTable:false:6:18:%s:%s:%.4f:%.4f:%d:%d:;
#:GATKTable:RecalTable0:
ReadGroup  EventType  EmpiricalQuality  EstimatedQReported  Observations  Errors
SRR032768  D                   40.7476             45.0000    2642683174    222475
SRR032766  D                   40.9072             45.0000    2630282426    213441
SRR032764  D                   40.5931             45.0000    2919572148    254687
SRR032769  D                   40.7448             45.0000    2850110574    240094
SRR032767  D                   40.6820             45.0000    2820040026    241020
SRR032765  D                   40.9034             45.0000    2441035052    198258
SRR032766  M                   23.2573             23.7733    2630282426  12424434
SRR032768  M                   23.0281             23.5366    2642683174  13159514
SRR032769  M                   23.2608             23.6920    2850110574  13451898
SRR032764  M                   23.2302             23.6039    2919572148  13877177
SRR032765  M                   23.0271             23.5527    2441035052  12158144
SRR032767  M                   23.1195             23.5852    2820040026  13750197
SRR032766  I                   41.7198             45.0000    2630282426    177017
SRR032768  I                   41.5682             45.0000    2642683174    184172
SRR032769  I                   41.5828             45.0000    2850110574    197959
SRR032764  I                   41.2958             45.0000    2919572148    216637
SRR032765  I                   41.5546             45.0000    2441035052    170651
SRR032767  I                   41.5192             45.0000    2820040026    198762

Quality Score Table

This table contains the empirical quality scores for each read group and original quality score, for mismatches insertions and deletions.

#:GATKTable:false:6:274:%s:%s:%s:%.4f:%d:%d:;
#:GATKTable:RecalTable1:
ReadGroup  QualityScore  EventType  EmpiricalQuality  Observations  Errors
SRR032767            49  M                   33.7794          9549        3
SRR032769            49  M                   36.9975          5008        0
SRR032764            49  M                   39.2490          8411        0
SRR032766            18  M                   17.7397      16330200   274803
SRR032768            18  M                   17.7922      17707920   294405
SRR032764            45  I                   41.2958    2919572148   216637
SRR032765             6  M                    6.0600       3401801   842765
SRR032769            45  I                   41.5828    2850110574   197959
SRR032764             6  M                    6.0751       4220451  1041946
SRR032767            45  I                   41.5192    2820040026   198762
SRR032769             6  M                    6.3481       5045533  1169748
SRR032768            16  M                   15.7681      12427549   329283
SRR032766            16  M                   15.8173      11799056   309110
SRR032764            16  M                   15.9033      13017244   334343
SRR032769            16  M                   15.8042      13817386   363078
...

Covariates Table

This table has the empirical qualities for each covariate used in the dataset. The default covariates are cycle and context. In the current implementation, context is of a fixed size (default 6). Each context and each cycle will have an entry on this table stratified by read group and original quality score.

#:GATKTable:false:8:1003738:%s:%s:%s:%s:%s:%.4f:%d:%d:;
#:GATKTable:RecalTable2:
ReadGroup  QualityScore  CovariateValue  CovariateName  EventType  EmpiricalQuality  Observations  Errors
SRR032767            16  TACGGA          Context        M                   14.2139           817      30
SRR032766            16  AACGGA          Context        M                   14.9938          1420      44
SRR032765            16  TACGGA          Context        M                   15.5145           711      19
SRR032768            16  AACGGA          Context        M                   15.0133          1585      49
SRR032764            16  TACGGA          Context        M                   14.5393           710      24
SRR032766            16  GACGGA          Context        M                   17.9746          1379      21
SRR032768            45  CACCTC          Context        I                   40.7907        575849      47
SRR032764            45  TACCTC          Context        I                   43.8286        507088      20
SRR032769            45  TACGGC          Context        D                   38.7536         37525       4
SRR032768            45  GACCTC          Context        I                   46.0724        445275      10
SRR032766            45  CACCTC          Context        I                   41.0696        575664      44
SRR032769            45  TACCTC          Context        I                   43.4821        490491      21
SRR032766            45  CACGGC          Context        D                   45.1471         65424       1
SRR032768            45  GACGGC          Context        D                   45.3980         34657       0
SRR032767            45  TACGGC          Context        D                   42.7663         37814       1
SRR032767            16  AACGGA          Context        M                   15.9371          1647      41
SRR032764            16  GACGGA          Context        M                   18.2642          1273      18
SRR032769            16  CACGGA          Context        M                   13.0801          1442      70
SRR032765            16  GACGGA          Context        M                   15.9934          1271      31
...

↧

(--sitesVCFFile) ASEReadCounter

May 15, 2015, 5:42 am

≫ Next: Picard FixVCFHeader and UpdateVCFSequenceDictionary are mutually incompatible

≪ Previous: Base Quality Score Recalibration (BQSR)

I have checked GATK-v.3.4.0 is finally released with a new option of ASEReadCounter.
(https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_rnaseq_ASEReadCounter.php)

What input of the vcf file should be prepared with BAM file ?
Only heterozygous variants should be in the vcf file, without homozygous variants?
Is it okay to use GT (genotype) information derived from GATK ? what about further applying Beagle something to infer genotype?

↧

Picard FixVCFHeader and UpdateVCFSequenceDictionary are mutually incompatible

March 9, 2019, 9:07 pm

≫ Next: Select Variants restrict variants to 'BIALLELIC' doesn't remove biallelic variants

≪ Previous: (--sitesVCFFile) ASEReadCounter

I have a VCF without a full header or a sequence dictionary. I can't add a sequence dictionary with Picard because the header issue, and I can't add a header because of the sequence dictionary issue. Using GATK4

↧

Select Variants restrict variants to 'BIALLELIC' doesn't remove biallelic variants

March 9, 2019, 10:12 pm

≫ Next: Help choosing truth sensitivity

≪ Previous: Picard FixVCFHeader and UpdateVCFSequenceDictionary are mutually incompatible

this is from the official dbsnp file after restricting to biallelic with selectvariants. apparently select variants only works if all alleles are on the same line, which clearly is not how dbsnp works

1 10177 rs201752861 A C . . ASP;GENEINFO=DDX11L1:100287102;R5;RS=201752861;RSPOS=10177;SAO=0;SSR=0;VC=SNV;VP=0x050000020005000002000100;WGT=1;dbSNPBuildID=137 1 10177 rs367896724 A AC . . ASP;CAF=0.5747,0.4253;COMMON=1;G5;G5A;GENEINFO=DDX11L1:100287102;KGPhase3;R5;RS=367896724;RSPOS=10177;SAO=0;SSR=0;VC=DIV;VLD;VP=0x050000020005170026000200;WGT=1;dbSNPBuildID=138

↧

ERROR stack trace

ERROR ------------------------------------------------------------------------------------------

ERROR A GATK RUNTIME ERROR has occurred (version 2.2-8-gec077cd):

ERROR MESSAGE: fromIndex(402) > toIndex(101)

ERROR ------------------------------------------------------------------------------------------

The problem

Why this happens

Solution

Related problems

Jump to a section

Tools involved

Download example data

1. Collect raw counts data with PreprocessIntervals and CollectFragmentCounts

☞ 1.1 How do I view HDF5 format data?

2. Generate a CNV panel of normals with CreateReadCountPanelOfNormals

3. Standardize and denoise case read counts against the PoN with DenoiseReadCounts

4. Plot standardized and denoised copy ratios with PlotDenoisedCopyRatios

☞ 4.1 Compare two PoNs: considerations in the panel of normals creation

☞ 4.2 Compare PoN denoising versus matched-normal denoising

Footnotes

Why these steps are necessary

Creating the fasta sequence dictionary file

Creating the fasta index file

Contents

1. Overview

It's all about the base, 'bout the base (quality scores)

Okay, so what is base recalibration?

That sounds great! How does it work?

2. Base recalibration procedure details

BaseRecalibrator builds the model

ApplyBQSR adjusts the scores

3. Important factors for successful recalibration

Read groups

Amount of data

No excuses

4. Examples of pre- and post-recalibration metrics

5. Recalibration report

Arguments Table

Quantization Table

ReadGroup Table

Quality Score Table

Covariates Table