(How to part II) Sensitively detect copy ratio alterations and allelic segments

March 26, 2018, 10:32 am

≫ Next: CNV standardize and denoise plots disagree each other

Document is currently under review and in BETA. It is incomplete and may contain inaccuracies. Expect changes to the content.

This workflow is broken into two tutorials. You are currently on the second part. See Tutorial#11682 for the first part.

For this second part, at the heart is segmentation, performed by ModelSegments. In segmentation, contiguous copy ratios are grouped together into segments. The tool performs segmentation for both copy ratios and for allelic copy ratios, given allelic counts. The segmentation is informed by both types of data, i.e. the tool uses allelic data to refine copy ratio segmentation and vice versa. The tutorial refers to this multi-data approach as joint segmentation. The presented commands showcase full features of tools. It is possible to perform segmentation for each data type independently, i.e. based solely on copy ratios or based solely on allelic counts.

The tutorial illustrates the workflow using a paired sample set. Specifically, detection of allelic copy ratios uses a matched control, i.e. the HCC1143 tumor sample is analyzed using a control, the HCC1143 blood normal. It is possible to run the workflow without a matched-control. See section 8.1 for considerations in interpreting allelic copy ratio results for different modes and for different purities.

The GATK4 CNV workflow offers a multitude of levers, e.g. towards fine-tuning analyses and towards controls. Researchers are expected to tune workflow parameters on samples with similar copy number profiles as their case sample under scrutiny. Refer to each tool's documentation for descriptions of parameters.

Jump to a section

5. Count ref and alt alleles at common germline variant sites using CollectAllelicCounts

CollectAllelicCounts will tabulate counts of the reference allele and counts of the dominant alternate allele for each site in a given genomic intervals list. The tutorial performs this step for both the case sample, the HCC1143 tumor, and the matched-control, the HCC1143 blood normal. This allele-specific coverage collection is just that--raw coverage collection without any statistical inferences. In the next section, ModelSegments uses the allele counts towards estimating allelic copy ratios, which in turn the tool uses to refine segmentation.

Collect allele counts for the case and the matched-control alignments independently with the same intervals. For the matched-control analysis, the allelic count sites for the case and control must match exactly. Otherwise, ModelSegments, which takes the counts in the next step, will error. Here we use an intervals list that subsets gnomAD biallelic germline SNP sites to those within the padded, preprocessed exome target intervals [9].

The tutorial has already collected allele counts for full length sample BAMs. To demonstrate coverage collection, the following command uses the small BAMs originally made for Tutorial#11136 [6]. The tutorial does not use the resulting files in subsequent steps.

Collect counts at germline variant sites for the matched-control

gatk --java-options "-Xmx3g" CollectAllelicCounts \
    -L chr17_theta_snps.interval_list \
    -I normal.bam \
    -R /gatk/ref/Homo_sapiens_assembly38.fasta \
    -O sandbox/hcc1143_N_clean.allelicCounts.tsv

Collect counts at the same sites for the case sample

gatk --java-options "-Xmx3g" CollectAllelicCounts \
    -L chr17_theta_snps.interval_list \
    -I tumor.bam \
    -R /gatk/ref/Homo_sapiens_assembly38.fasta \
    -O sandbox/hcc1143_T_clean.allelicCounts.tsv

This results in counts table files. Each data file has header lines that start with an @ asperand symbol, e.g. @HD, @SQ and @RG lines, followed by a table of data with six columns. An example snippet is shown.

Comments on select parameters

The tool requires one or more genomic intervals specified with -L. The intervals can be either a Picard-style intervals list or a VCF. See Article#1109 for descriptions of formats. The sites should represent sites of common and/or sample-specific germline variant SNPs-only sites. Omit indel-type and mixed-variant-type sites.
The tool requires the reference genome, specified with -R, and aligned reads, specified with -I.
As is the case for most GATK tools, the engine filters reads upfront using a number of read filters. Of note for CollectAllelicCounts is the MappingQualityReadFilter. By default, the tool sets the filter's --minimum-base-quality to twenty. As a result, the tool will include reads with MAPQ20 and above in the analysis [10].

☞ 5.1 What is the difference between CollectAllelicCounts and GetPileupSummaries?

Another GATK tool, GetPileupSummaries, similarly counts reference and alternate alleles. The resulting summaries are meant for use with CalculateContamination in estimating cross-sample contamination. GetPileupSummaries limits counts collections to those sites with population allele frequencies set by the parameters --minimum-population-allele-frequency and --maximum-population-allele-frequency. Details are here.

CollectAllelicCounts employs fewer engine-level read filters than GetPileupSummaries. Of note, both tools use the MappingQualityReadFilter. However, each sets a different threshold with the filter. GetPileupSummaries uses a --minimum-mapping-quality threshold of 50. In contrast, CollectAllelicCounts sets the --minimum-mapping-quality parameter to 30. In addition, CollectAllelicCounts filters on base quality. The base quality threshold is set with the --minimum-base-quality parameter, whose default is 20.

6. Group contiguous copy ratios into segments with ModelSegments

ModelSegments groups together copy and allelic ratios that it determines are contiguous on the same segment. A Gaussian-kernel binary-segmentation algorithm differentiates ModelSegments from a GATK4.beta tool, PerformSegmentation, which GATK4 ModelSegments replaces. The older tool used a CBS (circular binary-segmentation) algorithm. ModelSegment's kernel algorithm enables efficient segmentation of dense data, e.g. that of whole genome sequences. A discussion of preliminary algorithm performance is here.

The algorithm performs segmentation for both copy ratios and for allelic copy ratios jointly when given both datatypes together. For allelic copy ratios, ModelSegments uses only those sites it determines are heterozygous, either in the control in a paired analysis or in the case in a case-only analysis [11]. In the paired analysis, the tool models allelic copy ratios in the case using sites for which the control is heterozygous. The workflow defines allelic copy ratios in terms of alternate-allele fraction, where total allele fractions for reference allele and alternate allele add to one for each site.

For the following command, be sure to specify an existing --output directory or . for the current directory.

gatk --java-options "-Xmx4g" ModelSegments \
    --denoised-copy-ratios hcc1143_T_clean.denoisedCR.tsv \
    --allelic-counts hcc1143_T_clean.allelicCounts.tsv \
    --normal-allelic-counts hcc1143_N_clean.allelicCounts.tsv \
    --output sandbox \
    --output-prefix hcc1143_T_clean

This produces nine files each with the basename hcc1143_T_clean in the current directory and listed below. The param files contain global parameters for copy ratios (cr) and allele fractions (af), and the seg files contain data on the segments. For either type of data, the tool gives data before and after segmentation smoothing. The tool documentation details what each file contains. The last two files, labeled hets, contain the allelic counts for the control's heterogygous sites. Counts are for the matched control (normal) and the case.

hcc1143_T_clean.modelBegin.seg
hcc1143_T_clean.modelFinal.seg
hcc1143_T_clean.cr.seg
hcc1143_T_clean.modelBegin.af.param
hcc1143_T_clean.modelBegin.cr.param
hcc1143_T_clean.modelFinal.af.param
hcc1143_T_clean.modelFinal.cr.param
hcc1143_T_clean.hets.normal.tsv
hcc1143_T_clean.hets.tsv

The tool has numerous adjustable parameters and these are described in the ModelSegments tool documentation. The tutorial uses the default values for all of the parameters. Adjusting parameters can change the resolution and smoothness of the segmentation results.

Comments on select parameters

The tool accepts both or either copy-ratios (--denoised-copy-ratios) or allelic-counts (--allelic-counts) data. The matched-control allelic counts (--normal-allelic-counts) is optional. If given both types of data, then copy ratios and allelic counts data together inform segmentation for both copy ratio and allelic segments. If given only one type of data, then segmentation is based solely on the given type of data.
The --minimum-total-allele-count is set to 30 by default. This means the tool only considers sites with 30 or more read depth coverage for allelic copy ratios.
The --genotyping-homozygous-log-ratio-threshold option is set to -10.0 by default. Increase this to increase the number of sites assumed to be heterozygous for modeling.
Default smoothing parameters are optimized for faster performance, given the size of whole genomes. The --maximum-number-of-smoothing-iterations option caps smoothing iterations to 25. MCMC model sampling is also set to 100, for both copy-ratio and allele-fraction sampling by the --number-of-samples-copy-ratio and --number-of-samples-allele-fraction options, respectively. Finally, --number-of-smoothing-iterations-per-fit is set to zero by default to disable model refitting between iterations. What this means is that the tool will generate only two MCMC fits--an initial and a final fit.
- GATK4.beta's ACNV set this parameter such that each smoothing iteration refit using MCMC, at the cost of compute. For the tutorial data, which is targeted exomes, the default zero gives 398 segments after two smoothing iterations, while setting --number-of-smoothing-iterations-per-fit to one gives 311 segments after seven smoothing iterations. Section 8 plots these alternative results.
For advanced smoothing recommendations, see [12].

Section 8 shows the results of segmentation, the result from changing --number-of-smoothing-iterations-per-fit and the result of allelic segmentation modeled from allelic counts data alone. Section 8.1 details considerations depending on analysis approach and purity of samples. Section 8.2 shows the results of changing the advanced smoothing parameters given in [12].

ModelSegments runs in the following three stages.

Genotypes heterozygous sites and filters on depth and for sites that overlap with copy-ratio intervals.
- Allelic counts for sites in the control that are heterozygous are written to hets.normal.tsv. For the same sites in the case, allelic counts are written to hets.tsv.
- If given only allelic counts data, ModelSegments does not apply intervals.
Performs multidimensional kernel segmentation (1, 2).
- Uses allelic counts within each copy-ratio interval for each contig.
- Uses denoised copy ratios and heterozygous allelic counts.
Performs Markov-Chain Monte Carlo (MCMC, 1, 2, 3) sampling and segment smoothing. In particular, the tool uses Gibbs sampling and slice sampling. These MCMC samplings inform smoothing, i.e. merging adjacent segments, and the tool can perform multiple iterations of sampling and smoothing [13].
- Fits initial model. Writes initial segments to modelBegin.seg, posterior summaries for copy-ratio global parameters to modelBegin.cr.param and allele-fraction global parameters to modelBegin.af.param.
- Iteratively performs segment smoothing and sampling. Fits allele-fraction model [14] until log likelihood converges. This process produces global parameters.
- Samples final models. Writes final segments to modelFinal.seg, posterior summaries for copy-ratio global parameters to modelFinal.cr.param, posterior summaries for allele-fraction global parameters to modelFinal.af.param and final copy-ratio segments to cr.seg.

At the second stage, the tutorial data generates the following message.

INFO  MultidimensionalKernelSegmenter - Found 638 segments in 23 chromosomes.

At the third stage, the tutorial data generates the following message.

INFO  MultidimensionalModeller - Final number of segments after smoothing: 398

For tutorial data, the initial number of segments before smoothing is 638 segments over 23 contigs. After smoothing with default parameters, the number of segments is 398 segments.

7. Call copy-neutral, amplified and deleted segments with CallCopyRatioSegments

CallCopyRatioSegments allows for systematic calling of copy-neutral, amplified and deleted segments. This step is not required for plotting segmentation results. Provide the tool with the cr.seg segmentation result from ModelSegments.

gatk CallCopyRatioSegments \
    --input hcc1143_T_clean.cr.seg \
    --output sandbox/hcc1143_T_clean.called.seg

The resulting called.seg data adds the sixth column to the provided copy ratio segmentation table. The tool denotes amplifications with a + plus sign, deletions with a - minus sign and neutral segments with a 0 zero.

Here is a snippet of the results.

Comments on select parameters
- The parameters --neutral-segment-copy-ratio-lower-bound (default 0.9) and --neutral-segment-copy-ratio-upper-bound (default 1.1) together set the copy ratio range for copy-neutral segments. These two parameters replace the GATK4.beta workflow’s --neutral-segment-copy-ratio-threshold option.

8. Plot modeled copy ratio and allelic fraction segments with PlotModeledSegments

PlotModeledSegments visualizes copy and allelic ratio segmentation results.

gatk PlotModeledSegments \
    --denoised-copy-ratios hcc1143_T_clean.denoisedCR.tsv \
    --allelic-counts hcc1143_T_clean.hets.tsv \
    --segments hcc1143_T_clean.modelFinal.seg \
    --sequence-dictionary Homo_sapiens_assembly38.dict \
    --minimum-contig-length 46709983 \
    --output sandbox/plots \
    --output-prefix hcc1143_T_clean

This produces plots in the plots folder. The plots represent final modeled segments for both copy ratios and alternate allele fractions. If we are curious about the extent of smoothing provided by MCMC, then we can similarly plot initial kernel segmentation results by substituting in --segments hcc1143_T_clean.modelBegin.seg.

Comments on select parameters
- The tutorial provides the --sequence-dictionary that matches the GRCh38 reference used in mapping [4].
- To omit alternate and decoy contigs from the plots, the tutorial adjusts the --minimum-contig-length from the default value of 1,000,000 to 46,709,983, the length of the smallest of GRCh38's primary assembly contigs.

As of this writing, it is NOT possible to subset plotting with genomic intervals, i.e. with the -L parameter. To interactively visualize data, consider the following options.

Modify the sequence dictionary to contain only the contigs of interest, in the order desired.
The bedGraph format for targeted exomes and bigWig for whole genomes. An example of CNV data converted to bedGraph and visualized in IGV is given in this discussion.
Alternatively, researchers versed in R may choose to visualize subsets of data using RStudio.

Below are three sets of results for the HCC1143 tumor cell line in order of increasing smoothing. The top plot of each set shows the copy ratio segments. The bottom plot of each set shows the allele fraction segments.

In the denoised copy ratio segment plot, individual targets still display as points on the plot. Different copy ratio segments are indicated by alternating blue and orange color groups. The denoised median is drawn in thick black.
In the allele fraction plot, the boxes surrounding the alternate allelic fractions do NOT indicate standard deviation nor standard error, which biomedical researchers may be more familiar with. Rather, the allelic fraction data is given in credible intervals. The allelic copy ratio plot shows the 10th, 50th and 90th percentiles. These should be interpreted with care as explained in section 8.1. Individual allele fraction data display as faint data points, also in orange and blue.

8A. Initial segmentation before MCMC smoothing gives 638 segments.
T_modelbegin.modeled.png

8B. Default smoothing gives 398 segments.
T_modelfinal.modeled.png

8C. Enabling additional smoothing iterations per fit gives 311 segments. See section 6 for a description of the --number-of-smoothing-iterations-per-fit parameter.
T_increase_smoothing_1.modeled.png

Smoothing accounts for data points that are outliers. Some of these outliers could be artifactual and therefore not of interest, while others could be true copy number variation that would then be missed. To understand the impact of joint copy ratio and allelic counts segmentation, compare the results of 8B to the single-data segmentation results below. Each plot below shows the results of modeling segmentation on a single type of data, either copy-ratios or allelic counts, using default smoothing parameters.

8D. Copy ratio segmentation based on copy ratios alone gives 235 segments.
T_caseonly.modeled.png

8E. Allelic segmentation result based on allelic counts alone in the matched case gives 105 segments.
T-matched-normal_just_allelic.modeled.png

Compare chr1 and chr2 segmentation for the various plots. In particular, pay attention to the p arm (left side) of chr1 and q arm (right side) of chr2. What do you think is happening when adjacent segments are slightly shifted from each other in some sets but then seemingly at the same copy ratio for other sets?

For allelic counts, ModelSegments retains 16,872 sites that are heterozygous in the control. Of these, the case presents 15,486 usable sites. In allelic segmentation using allelic counts alone, the tool uses all of the usable sites. In the matched-control scenario, ModelSegments emits the following message.

INFO  MultidimensionalKernelSegmenter - Using first allelic-count site in each copy-ratio interval (12668 / 15486) for multidimensional segmentation...

The message informs us that for the matched-control scenario, ModelSegments uses the first allele-count site for each genomic interval towards allelic modeling. For the tutorial data, this is 12,668 out of the 15,486 or 81.8% of the usable allele-count sites. The exclusion of ~20% of allelic-counts sites, together with the lack of copy ratio data informing segmentation, account for the difference we observe in this and the previous allelic segmentation plot.

In the allele fraction plot, some of the alternate-allele fractions are around 0.35/0.65 and some are at 0/1. We also see alternate-allele fractions around 0.25/0.75 and 0.5. These suggest ploidies that are multiples of one, two, three and four.

Is it possible a copy ratio of one is not diploid but represents some other ploidy?

For the plots above, focus on chr4, chr5 and chr17. Based on both the copy ratio and allelic results, what is the zygosity of each of the chromosomes? What proportion of each chromosome could be described as having undergone copy-neutral loss of heterozygosity?

☞ 8.1 Some considerations in interpreting allelic copy ratios

For allelic copy ratio analysis, the matched-control is a sample from the same individual as the case sample. In the somatic case, the matched-control is the germline normal sample and the case is the tumor sample from the same individual.

The matched-control case presents the following considerations.

If a matched control contains any region with copy number amplification, the skewed allele fractions still allow correct interpretation of the original heterozygosity.
However, if a matched control contains deleted regions or regions with copy-neutral loss of heterozygosity or a long stretch of homozygosity, e.g. as occurs in uniparental disomy, then these regions would go dark so to speak in that they become apparently homozygous and so ModelSegments drops them from consideration.
From population sequencing projects, we know the expected heterozygosity of normal germline samples averages around one in a thousand. However, the GATK4 CNV workflow does not account for any heterozygosity expectations. An example of such an analysis that utilizes SNP array data is HAPSEG. It is available on GenePattern.
If a matched normal contains tumor contamination, this should still allow for the normal to serve as a control. The expectation is that somatic mutations coinciding with common germline SNP sites will be rare and ModelSegments (i) only counts the dominant alt allele at multiallelic sites and (ii) recognizes and handles outliers. To estimate tumor in normal (TiN) contamination, see the Broad CGA group's deTiN.

Here are some considerations for detecting loss of heterozygosity regions.

In the matched-control case, if the case sample is pure, i.e. not contaminated with the control sample, then we see loss of heterozygosity (LOH) segments near alternate-allele fractions of zero and one.
If the case is contaminated with matched control, whether the analysis is matched or not, then the range of alternate-allele fractions becomes squished so to speak in that the contaminating normal's heterozygous sites add to the allele fractions. In this case, putative LOH segments still appear at the top and bottom edges of the allelic plot, at the lowest and highest alternate-allele fractions. For a given depth of coverage, the fraction of reads that differentiates zygosity is narrower in range and therefore harder to differentiate visually.

8F. Case-only analysis of tumor contaminated with normal still allows for LOH detection. Here, we bluntly added together the tutorial tumor and normal sample reads. Results for the matched-control analysis are similar.
In the tumor-only case, if the tumor is pure, because ModelSegments drops homozygous sites from consideration and only models sites it determines are heterozygous, the workflow cannot ascertain LOH segments. Such LOH regions may present as an absence of allelic data or as low confidence segments, i.e. having a wide confidence interval on the allelic plot. Compare such a result below to that of the matched case in 8E above.

8G. Allelic segmentation result based on allelic counts alone for case-only, when the case is pure, can produce regions of missing representation and low confidence allelic fraction segments.

Compare results. Focus on chr4, chr5 and chr17. While the matched-case gives homozygous zygosity for each of these chromosomes, the case-only allelic segmentation either presents an absence of segments for regions or gives low confidence allelic fraction segments at alternate allele fractions that are inaccurate, i.e. do not represent actual zygosity. This is particularly true for tumor samples where aneuploidy and LOH are common. Interpret case-only allelic results with caution.

Finally, remember the tutorial analyses above utilize allelic counts from gnomAD sites of common population variation that have been lifted-over from GRCh37 to GRCh38. For allelic count sites, use of sample-specific germline variant sites may incrementally increase resolution. Also, use of confident variant sites from a callset derived from alignments to the target reference may help decrease noise. Confident germline variant sites can be derived with HaplotypeCaller calling on alignments and subsequent variant filtration. Alternatively, it is possible to fine-tune ModelSegments smoothing parameters to dampen noise.

☞ 8.2 Some results of fine-tuning smoothing parameters

This section shows plotting results of changing some advanced smoothing parameters. The parameters and their defaults are given below, in the order of recommended consideration [12].

--number-of-changepoints-penalty-factor 1.0 \
--kernel-variance-allele-fraction 0.025 \
--kernel-variance-copy-ratio 0.0 \
--kernel-scaling-allele-fraction 1.0 \
--smoothing-credible-interval-threshold-allele-fraction 2.0 \
--smoothing-credible-interval-threshold-copy-ratio 2.0 \

The first four parameters impact segmentation while the last two parameters impact modeling. The following plots show the results of changing these smoothing parameters. The tutorial chose argument values arbitrarily, for illustration purposes. Results should be compared to that of 8B, which gives 398 segments.

8H. Increasing changepoints penalty factor from 1.0 to 5.0 gives 140 segments.

8I. Increasing kernel variance parameters each to 0.8 gives 144 segments. Changing --kernel-variance-copy-ratio alone to 0.025 increases the number of segments greatly, to 1,266 segments. Changing it to 0.2 gives 414 segments.

8J. Decreasing kernel scaling from 1.0 to 0 gives 236 segments. Conversely, increasing kernel scaling from 1.0 to 5.0 gives 551 segments.

8K. Increasing both smoothing parameters each from 2.0 to 10.0 gives 263 segments.

Footnotes

[9] The GATK Resource Bundle provides two variations of a SNPs-only gnomAD project resource VCF. Both VCFs are sites-only eight-column VCFs but one retains the AC allele count and AF allele frequency variant-allele-specific annotations, while the other removes these to reduce file size.

For targeted exomes, it may be convenient to subset these to the preprocessed intervals, e.g. with SelectVariants for use with CollectAllelicCounts. This is not necessary, however, as ModelSegments drops sites outside the target regions from its analysis in the joint-analysis approach.
For whole genomes, depending on the desired resolution of the analysis, consider subsetting the gnomAD sites to those commonly variant, e.g. above an allele frequency threshold. Note that SelectVariants, as of this writing, can filter on AF allele frequency only for biallelic sites. Non-biallelic sites make up ~3% of the gnomAD SNPs-only resource.
For more resolution, consider adding sample-specific germline variant biallelic SNPs-only sites to the intervals. Section 8.1 shows allelic segmentation results for such an analysis.

[10] The MAPQ20 threshold of CollectAllelicCounts is lower than that used by CollectFragmentCounts, which uses MAPQ30.

[11] In particular, the tool considers only heterozygous sites that have counts for both the reference allele and the alternate allele. If multiple alternate alleles are present, the tool uses the alternate allele with the highest count and ignores any other alternate allele(s).

[12] These advanced smoothing recommendations are from one of the workflow developers--@slee.

For smoother results, first increase --number-of-changepoints-penalty-factor from its default of 1.0.
If the above does not suffice, then consider changing the kernel-variance parameters --kernel-variance-copy-ratio (default 0.0) and --kernel-variance-allele-fraction (default 0.025), or change the weighting of the allele-fraction data by changing --kernel-scaling-allele-fraction (default 1.0).
If such changes are still insufficient, then consider adjusting the smoothing-credible-interval-threshold parameters --smoothing-credible-interval-threshold-copy-ratio (default 2.0) and --smoothing-credible-interval-threshold-allele-fraction (default 2.0). Increasing these will more aggressively merge adjacent segments.

[13] In particular, uses Gibbs sampling, a type of MCMC sampling, towards both allele-fraction modeling and copy-ratio modeling, and additionally uses slice sampling towards allele-fraction modeling. @slee details the following substeps.

Perform MCMC (Gibbs) to fit the copy-ratio model posteriors.
Use optimization (of the log likelihood) to initialize the Markov Chain for the allele-fraction model.
Perform MCMC (Gibbs and slice) to fit the allele-fraction model posteriors.
The initial model is now fit. We write the corresponding modelBegin files, including those for global parameters.
Iteratively perform segment smoothing.
Perform steps 1-4 again, this time to generate the final model fit and modelFinal files.

[14] @slee shares the tool initializes the MCMC by starting off at the maximum a posteriori (MAP) point in parameter space.

Many thanks to Samuel Lee (@slee), no relation, for his patient explanations of the workflow parameters and mathematics behind the workflow. Researchers may find reading through @slee's comments on the forum, a few of which the tutorials link to, insightful.

↧

CNV standardize and denoise plots disagree each other

October 29, 2018, 5:57 pm

≫ Next: GATK4 CNV ModelSegments hets output

≪ Previous: (How to part II) Sensitively detect copy ratio alterations and allelic segments

Hi GATK Team,
I'm following the CNV workflow: Sensitively detect copy ratio alterations and allelic segments.

I ran step 1-4 with my own WGS 30x samples and a 7 sample PON to test against my tumor samples. All bam files were preprocessed by FireCloud workflow. I'm using hg38. No error messages at the end. my GATK is 4.0.8.1

At the end of step 4, I have a few questions:

for one patient, the standardize and denoise plots disagree with each other not only one cell clone but different ones, but only this patient.

I can't post markdown links to my images, please see attachment.

1. Standardized plot shows increase at X but denoised plot shows decrease. How did this happen? Is this related to my sample itself, the PON components(7) or calling algorithm?

2. Like the two sample plots above, some samples give me a smooth line of CNV ratio and some other samples have more noise. How did the difference occur and how can I improve it?

Thank you!
Le

↧

GATK4 CNV ModelSegments hets output

November 30, 2018, 1:48 pm

≫ Next: Collected FAQs about interval lists

≪ Previous: CNV standardize and denoise plots disagree each other

Hi GATK team!

I'm having trouble precisely understanding the ModelSegments hets output when ran on a tumor sample provided both a tumor and normal AllelicCounts are given.

The documentation reads:

If the matched normal is available, its allelic counts will be used to genotype the sites, and we will simply assume these genotypes are the same in the case sample. (This can be critical, for example, for determining sites with loss of heterozygosity in high purity case samples; such sites will be genotyped as homozygous if the matched-normal sample is not available.)

If this were truly the case then why:
1. Is a different number of variants (and not 1:1 exactly overlapping) output to hets.tsv and hets.normal.tsv
2. If I roughly quantify variant allele fractions in the hets.normal.tsv file, a large portion of them are far away from 0.5

Both these observations seem to contradict what the documentation states. Can someone explain the difference and similarities between the hets.tsv and hets.normal.tsv output file in a way other than stated in the documentation because I'm not understanding this explanation.

↧

Collected FAQs about interval lists

August 10, 2012, 9:16 pm

≫ Next: Mutect2 bamout depths not matching vcf.

≪ Previous: GATK4 CNV ModelSegments hets output

1. Can GATK tools be restricted to specific intervals instead of processing the entire reference?

Absolutely. Just use the -L argument to provide the list of intervals you wish to run on. Or you can use -XL to exclude intervals, e.g. to blacklist genome regions that are problematic.

2. What file formats does GATK support for interval lists?

GATK supports several types of interval list formats: Picard-style .interval_list, GATK-style .list, BED files with extension .bed, and VCF files.

A. Picard-style `.interval_list`

Picard-style interval files have a SAM-like header that includes a sequence dictionary. The intervals are given in the form <chr> <start> <stop> + <target_name>, with fields separated by tabs, and the coordinates are 1-based (first position in the genome is position 1, not position 0).

@HD     VN:1.0  SO:coordinate
@SQ     SN:1    LN:249250621    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:1b22b98cdeb4a9304cb5d48026a85128     SP:Homo Sapiens
@SQ     SN:2    LN:243199373    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:a0d9851da00400dec1098a9255ac712e     SP:Homo Sapiens
1       30366   30503   +       target_1
1       69089   70010   +       target_2
1       367657  368599  +       target_3
1       621094  622036  +       target_4
1       861320  861395  +       target_5
1       865533  865718  +       target_6

This is the preferred format because the explicit sequence dictionary safeguards against accidental misuse (e.g. apply hg18 intervals to an hg19 BAM file). Note that this file is 1-based, not 0-based (the first position in the genome is position 1).

B. GATK-style `.list` or `.intervals`

This is a simpler format, where intervals are in the form <chr>:<start>-<stop>, and no sequence dictionary is necessary. This file format also uses 1-based coordinates. Note that only the <chr> part is strictly required; if you just want to specify chromosomes/ contigs as opposed to specific coordinate ranges, you don't need to specify the rest. Both <chr>:<start>-<stop> and <chr> can be present in the same file. You can also specify intervals in this format directly at the command line instead of writing them in a file.

C. BED files with extension `.bed`

We also accept the widely-used BED format, where intervals are in the form <chr> <start> <stop>, with fields separated by tabs. However, you should be aware that this file format is 0-based for the start coordinates, so coordinates taken from 1-based formats (e.g. if you're cooking up a custom interval list derived from a file in a 1-based format) should be offset by 1. The GATK engine recognizes the .bed extension and interprets the coordinate system accordingly.

D. VCF files

Yeah, I bet you didn't expect that was a thing! It's very convenient. Say you want to redo a variant calling run on a set of variant calls that you were given by a colleague, but with the latest version of HaplotypeCaller. You just provide the VCF, slap on some padding on the fly using e.g. -ip 100 in the HC command, and boom, done. Each record in the VCF will be interpreted as a single-base interval, and by adding padding you ensure that the caller sees enough context to reevaluate the call appropriately.

3. Is there a required order of intervals?

Yes, thanks for asking. The intervals MUST be sorted by coordinate (in increasing order) within contigs; and the contigs must be sorted in the same order as in the sequence dictionary. This is for efficiency reasons.

4. Can I provide multiple sets of intervals?

Sure, no problem -- just pass them in using separate -L arguments. You can use all the different formats within the same command line. By default, the GATK engine will take the UNION of all the intervals in all the sets. This behavior can be modified by setting an interval_set rule.

5. How will GATK handle intervals that abut or overlap?

Very gracefully. By default the GATK engine will merge any intervals that abut (i.e. they are contiguous, they touch without overlapping) or overlap into a single interval. This behavior can be modified by setting an interval_merging rule.

6. What's the best way to pad intervals?

You can use the -ip engine argument to add padding on the fly. No need to produce separate padded targets files. Sweet, right?

Note that if intervals that previously didn't abut or overlap before you added padding now do so, by default the GATK engine will merge them as described above. This behavior can be modified by setting an interval_merging rule.

↧

Mutect2 bamout depths not matching vcf.

October 25, 2018, 1:08 pm

≫ Next: Somatic copy number variant discovery (CNVs)

≪ Previous: Collected FAQs about interval lists

Hi,

After running Mutect2 on tumor/normal paired bam files, I get an output VCF with unusually high depth counts. I understand that the numbers here can differ from the input bam depths due to genotype reassembly within Mutect2. However, these depths are sometimes jumping from around 20 reads to 200 reads, which seems hard to believe. In order to investigate further, I re-ran Mutect2 with the -bamout option, in order to analyze some positions in IGV. The Mutect2 command is posted below. The problem is that when I look at the output bam (bamout) and the output VCF in IGV, the numbers do not match. Note: I ran FilterMutectCalls and selected only PASS calls when deciding which positions to look at. An example position: depth 146 and 192 in tumor and normal, respectively in the VCF, but only 41 reads for the same position in the bamout file.

Questions:
1) What can explain the discrepancy between the bamout depths and the vcf depths?
2) Is there a way to get a bam/bamout that matches the VCF exactly?

Thanks a lot,
Sujay

Mutect2 command:
./gatk-4.0.10.1/gatk --java-options "-Xmx4g" Mutect2 -R /genomes/Hsapiens/hg19/seq/hg19.fa --annotation ClippingRankSumTest --annotation DepthPerSampleHC --annotation MappingQualityRankSumTest --annotation MappingQualityZero --annotation QualByDepth --annotation ReadPosRankSumTest --annotation RMSMappingQuality --annotation FisherStrand --annotation MappingQuality --annotation DepthPerAlleleBySample --annotation Coverage --read-validation-stringency LENIENT -I tumorX.markduplicates.grouped.bam -tumor tumorX -I normalX.markduplicates.grouped.bam -normalX normal -L ./baits.bed --interval-set-rule INTERSECTION --disable-read-filter NotDuplicateReadFilter -ploidy 2 -bamout tumorX.bamout.bam -O tumorX.vcf

↧

Somatic copy number variant discovery (CNVs)

January 7, 2018, 1:05 am

≫ Next: (howto) Get started with GATK4 beta

≪ Previous: Mutect2 bamout depths not matching vcf.

Purpose

Identify somatic copy number variant (CNVs) in a case sample. Requires an appropriate Panel of Normals (PON).

Reference Implementations

Pipeline	Summary	Notes	Github	FireCloud
Somatic CNV case sample	Case BAM to CNV	universal	yes	b37
Somatic CNV PON creation	Normal BAMs to PON	universal	yes	b37

Documentation for these workflows is in development.

↧

(howto) Get started with GATK4 beta

June 28, 2017, 8:11 pm

≫ Next: After SplitNCigarReads, does the separated reads both labeled as primary alignment?

≪ Previous: Somatic copy number variant discovery (CNVs)

Download the software

The GATK4 beta version command-line tools are provided as a single executable jar file. You can download a zipped package containing the jar file from this Github link (GATK4 Download page coming soon). Once you unzip the package, you will find four files inside the resulting directory:

gatk-launch
gatk-package-4.beta.x-local.jar
gatk-package-4.beta.x-spark.jar
README.md

where x is the minor release version in the jar file names.

Now you may ask, why are there two jars? As the names suggest, gatk-package-4.beta.x-spark.jar is the jar for running Spark tools on a Spark cluster, while gatk-package-4.beta.x-local.jar is the jar that is used for everything else (including running Spark tools "locally", ie on a regular server or cluster).

So does that mean you have to specify which one you want to run each time? Nope! See the gatk-launch file in there? That's an executable wrapper script that you invoke and that will choose the appropriate jar for you based on the rest of your command line. You can still invoke a specific jar if you want, but using gatk-launch is easier, and it will also take care of setting some parameters that you would otherwise have to specify manually. We'll talk about that in a minute.

Install it

There is no installation necessary in the traditional sense, since the precompiled jar files should work on any POSIX platform (NOT Microsoft Windows!) equipped with the appropriate version of Java (see below). You'll simply need to open the downloaded package and place the folder containing the jar files in a convenient directory on your hard drive (or server). Although the jars themselves cannot simply be added to your PATH, you can do so with the gatk-launch wrapper script. Please look up instructions depending on the terminal shell you use; in bash the typical syntax is export PATH=$PATH:/path/to/gatk/gatk-launch where path/to/ is the path to the location of the gatk-launch executable. Note that the jars must remain in the same directory as gatk-launch for it to work.

Important note about Java version

For the tools to run properly, you must have Java 8 / JDK or JRE 1.8 installed. To check your java version, open your terminal application and run the following command:

java -version

If the output looks something like java version "1.8.x_y", you are good to go. If not, you may need to change your version. You can download a suitable upgrade either from Oracle or from OpenJDK. To be clear, OpenJDK is now fully supported.

Test that it works

To test that you can run GATK tools, run the following command in your terminal application (we assume that you have added gatk-launch to your PATH):

./gatk-launch --help

This will output a summary of the GATK4 invocation syntax, options for listing tools and invoking a specific tool's help documentation, and main Spark options.

Use GATK tools

Tools are invoked as follows:

./gatk-launch ToolName -OPTION1 value1 -OPTION2 value2

If you have previous used older GATK versions, you'll notice that ToolName is no longer passed with -T and that it is now positional: the tool name must always be the first thing you write after the ./gatk-launch part (or the jar file if you're invoking the jar directly).

Available tools are all listed in the Tool Documentation section, which is versioned; on the website, use the orange dropdown menu button to switch between versions. This provides a complete list of tools with usage recommendations, options, and example commands.

Docker image

Docker images for GATK4 releases can be found at https://hub.docker.com/r/broadinstitute/gatk/

↧

After SplitNCigarReads, does the separated reads both labeled as primary alignment?

December 3, 2018, 11:00 am

≫ Next: Bug report: listing of workflow input display incorrect after first page.

≪ Previous: (howto) Get started with GATK4 beta

Hi, I am doing RNAseq and use STAR for alignment, I am wondering in such situation, a read overlapping the junction site, when I use the SplitNCigarReads operation, it separate the read to two; for these two separated reads, do they both got relabeled as primary alignment (which make more sense) or still secondary alignment for sequence on the right side?

Thanks a lot?

Jason

↧

Bug report: listing of workflow input display incorrect after first page.

December 3, 2018, 11:55 am

≫ Next: Removing variants based on the FORMAT field

≪ Previous: After SplitNCigarReads, does the separated reads both labeled as primary alignment?

In the monitor tab, when looking at workflow-level inputs and you need to page through the input, after the first page, the Task column shows the label name rather than the task name.

↧

Removing variants based on the FORMAT field

May 6, 2016, 12:45 am

≫ Next: GATK v4.0.8.1 GenomicsDBImport Error (VariantStorageManagerException exception)

≪ Previous: Bug report: listing of workflow input display incorrect after first page.

I have been trying to remove the Variants in my Sample with a GQ lower than 20 as part of the Genotype Refinement workflow

java -jar GenomeAnalysisTK.jar -T VariantFiltration -R ref.fa -V Genotype_refinement.vcf -G_filter "GQ < 20.0" -G_filterName lowGQ -o Genotype_refinement_GQ.vcf

This gives me a vcf file with the Filter name lowGQ in the FORMAT field.

Now I do not know how to use SelectVariants to remove these filtered variants. I think my JEXL is wrong

java -jar GenomeAnalysisTK.jar -R ref.fa -T SelectVariants -V Genotype_refinement_GQ.vcf -o Genotype_refinement_GQ_filtered.vcf -xl_se 'lowGQ'

I still get the lowGQ samples when I check the file. Could you help me understand how to remove these?

Also I wanted to preferably do this in a single step but apparently GATK SelectVariants does not recognise G_filter

##### ERROR MESSAGE: Argument with name 'G_filter' isn't defined.

↧

GATK v4.0.8.1 GenomicsDBImport Error (VariantStorageManagerException exception)

October 1, 2018, 11:21 pm

≫ Next: Haplotype Caller

≪ Previous: Removing variants based on the FORMAT field

Hi, I am following the current best practice to prepare the consolidated GVCF from 5 samples of WGS for joint calling
with the following command and encounter an error
java -Djava.io.tmpdir=/work/TMP \ -Xmx40g -jar ~/bin/gatk-4.0.8.1/gatk-package-4.0.8.1-local.jar \ GenomicsDBImport \ -V /work/Analysis/III_3P_RG_DupMark.raw.snps.indels.g.vcf \ -V /work/Analysis/IV_11N_RG_DupMark.raw.snps.indels.g.vcf \ -V /work/Analysis/IV_8N_RG_DupMark.raw.snps.indels.g.vcf \ -V /work/Analysis/IV_10P_RG_DupMark.raw.snps.indels.g.vcf \ -V /work/Analysis/IV_20P_RG_DupMark.raw.snps.indels.g.vcf \ --genomicsdb-workspace-path /work/Analysis/wang_chr19_re \ --intervals chr19

Error Log

15:00:35.770 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/wang/bin/gatk-4.0.8.1/gatk-package-4.0.8.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
15:00:35.944 INFO GenomicsDBImport - ------------------------------------------------------------
15:00:35.944 INFO GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.0.8.1
15:00:35.945 INFO GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/
15:00:35.945 INFO GenomicsDBImport - Executing as wang@Ubuntu1604 on Linux v3.16.0-43-generic amd64
15:00:35.945 INFO GenomicsDBImport - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_171-8u171-b11-2~14.04-b11
15:00:35.945 INFO GenomicsDBImport - Start Date/Time: October 2, 2018 3:00:35 PM JST
15:00:35.945 INFO GenomicsDBImport - ------------------------------------------------------------
15:00:35.945 INFO GenomicsDBImport - ------------------------------------------------------------
15:00:35.946 INFO GenomicsDBImport - HTSJDK Version: 2.16.0
15:00:35.946 INFO GenomicsDBImport - Picard Version: 2.18.7
15:00:35.946 INFO GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
15:00:35.946 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
15:00:35.946 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
15:00:35.946 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
15:00:35.946 INFO GenomicsDBImport - Deflater: IntelDeflater
15:00:35.946 INFO GenomicsDBImport - Inflater: IntelInflater
15:00:35.946 INFO GenomicsDBImport - GCS max retries/reopens: 20
15:00:35.946 INFO GenomicsDBImport - Using google-cloud-java fork https://github.com/broadinstitute/google-cloud-java/releases/tag/0.20.5-alpha-GCS-RETRY-FIX
15:00:35.946 INFO GenomicsDBImport - Initializing engine
15:00:38.360 INFO IntervalArgumentCollection - Processing 58617616 bp from intervals
15:00:38.366 INFO GenomicsDBImport - Done initializing engine
Created workspace /work/Analysis/wgs_chr19
15:00:38.849 INFO GenomicsDBImport - Vid Map JSON file will be written to /work/Analysis/wgs_chr19/vidmap.json
15:00:38.849 INFO GenomicsDBImport - Callset Map JSON file will be written to /work/Analysis/wgs_chr19/callset.json
15:00:38.849 INFO GenomicsDBImport - Complete VCF Header will be written to /work/Analysis/wgs_chr19/vcfheader.vcf
15:00:38.850 INFO GenomicsDBImport - Importing to array - /work/Analysis/wgs_chr19/genomicsdb_array
15:00:38.850 INFO ProgressMeter - Starting traversal
15:00:38.850 INFO ProgressMeter - Current Locus Elapsed Minutes Batches Processed Batches/Minute
15:00:39.771 INFO GenomicsDBImport - Importing batch 1 with 5 samples
Buffer resized from 28469bytes to 32688
Buffer resized from 28473bytes to 32630
Buffer resized from 28469bytes to 32745
Buffer resized from 28469bytes to 32717
Buffer resized from 28466bytes to 32648
Buffer resized from 32688bytes to 32758
Buffer resized from 32630bytes to 32726
Buffer resized from 32648bytes to 32703
Buffer resized from 32717bytes to 32751
Buffer resized from 32703bytes to 32765
Buffer resized from 32745bytes to 32768
Buffer resized from 32726bytes to 32763
Buffer resized from 32765bytes to 32767
Buffer resized from 32758bytes to 32765
Buffer resized from 32751bytes to 32762
Buffer resized from 32767bytes to 32769
Buffer resized from 32763bytes to 32768
Buffer resized from 32762bytes to 32768
Buffer resized from 32765bytes to 32767
Buffer resized from 32767bytes to 32769
Buffer resized from 32768bytes to 32769
Buffer resized from 32768bytes to 32769
Buffer resized from 32768bytes to 32769
terminate called after throwing an instance of 'VariantStorageManagerException'
what(): VariantStorageManagerException exception : Error while syncing array chr19$1$58617616 to disk
TileDB error message : [TileDB::utils] Error: Cannot sync file '/work/Analysis/wgs_chr19/chr19$1$58617616/.__a89fdd44-1241-43ba-9072-6fcf116fbc1d139627949156096_1538460040234'; File syncing error

things I have checked
I have confirmed there is enough disk space and the working directory is in a shared volume.
It would be appreciated if you can help me on the troubleshotting.

Thanks

↧

Haplotype Caller

December 3, 2018, 4:39 pm

≫ Next: GATK4 VariantAnnotator

≪ Previous: GATK v4.0.8.1 GenomicsDBImport Error (VariantStorageManagerException exception)

This discussion was created from comments split from: New to the forum? Ask your questions here!.

↧

GATK4 VariantAnnotator

February 26, 2018, 3:49 am

≫ Next: GATK resource bundle

≪ Previous: Haplotype Caller

Hello,

Has this tools been removed in the last release ?

Thanks,
Pedro

↧

GATK resource bundle

December 3, 2018, 6:34 pm

≫ Next: Found contigs with the same name but different lengths; BQSR

≪ Previous: GATK4 VariantAnnotator

This discussion was created from comments split from: New to the forum? Ask your questions here!.

↧

Found contigs with the same name but different lengths; BQSR

December 3, 2018, 11:27 pm

≫ Next: Normal-lod and tumor-lod in Mutect2

≪ Previous: GATK resource bundle

This discussion was created from comments split from: Does the Broad still recommend aligning WXS/WES data to b37/hg19?.

↧

Normal-lod and tumor-lod in Mutect2

May 21, 2018, 11:18 pm

≫ Next: GATK4.alpha emitDroppedReads option

≪ Previous: Found contigs with the same name but different lengths; BQSR

Hello,

What are the normal-lod and tumor-lod in Mutect2?
What are the bases of the default values 2.2 and 3.0 respectively?
And, if the threshold are lowered or raised, what would be happened in terms of mutation calling?
Please, explain them by using very basic terms ( AD, AF, ...)

Many thanks,
Luke

↧

GATK4.alpha emitDroppedReads option

June 26, 2017, 2:51 pm

≫ Next: Mutect2 troubleshooting

≪ Previous: Normal-lod and tumor-lod in Mutect2

Does GATK4.alpha support -emitDroppedReads option that works with -bamout?

↧

Mutect2 troubleshooting

December 4, 2018, 8:27 pm

≫ Next: Error when running oncotator

≪ Previous: GATK4.alpha emitDroppedReads option

I run Mutect2 in conjunction with a different variant caller and I try to keep the cutoffs as similar as possible. I don't expect the results to be identical. One caller might return twice as many variants, but those will generally include most of the ones from the other caller.

However, occasionally the differences are substantial. One of them might return 5x as many variants and with a poor overlap. Clearly, there is a problem with my samples in that case. However, I am not sure how to quantify the underlying cause. Is there a systematic way to try to diagnose some quality issues?

For example, the depth and evenness of coverage are important, but are not necessarily sufficient. Are there other metrics I should be tracking?

↧

Error when running oncotator

February 5, 2016, 1:30 pm

≫ Next: HaplotypeCaller may fail to detect variant with the same reads with a different composition.

≪ Previous: Mutect2 troubleshooting

Hi,

I installed Oncotator v1.8.0.0 on RHEL.

I am getting following error when running the command oncotator -h

Traceback (most recent call last):
File "/usr/bin/oncotator", line 9, in
load_entry_point('Oncotator==v1.8.0.0', 'console_scripts', 'oncotator')()
File "/usr/lib/python2.6/site-packages/distribute-0.6.15-py2.6.egg/pkg_resources.py", line 305, in load_entry_point
return get_distribution(dist).load_entry_point(group, name)
File "/usr/lib/python2.6/site-packages/distribute-0.6.15-py2.6.egg/pkg_resources.py", line 2244, in load_entry_point
return ep.load()
File "/usr/lib/python2.6/site-packages/distribute-0.6.15-py2.6.egg/pkg_resources.py", line 1954, in load
entry = import(self.module_name, globals(),globals(), ['name'])
File "/usr/lib/python2.6/site-packages/Oncotator-v1.8.0.0-py2.6.egg/oncotator/Oncotator.py", line 52, in
from oncotator.utils.RunSpecificationFactory import RunSpecificationFactory
File "/usr/lib/python2.6/site-packages/Oncotator-v1.8.0.0-py2.6.egg/oncotator/utils/RunSpecificationFactory.py", line 2, in
from oncotator.DatasourceFactory import DatasourceFactory
File "/usr/lib/python2.6/site-packages/Oncotator-v1.8.0.0-py2.6.egg/oncotator/DatasourceFactory.py", line 53, in
from oncotator.datasources.EnsemblTranscriptDatasource import EnsemblTranscriptDatasource
File "/usr/lib/python2.6/site-packages/Oncotator-v1.8.0.0-py2.6.egg/oncotator/datasources/EnsemblTranscriptDatasource.py", line 75
POPULATED_ANNOTATION_NAMES = {'transcript_exon', 'variant_type', 'variant_classification', 'other_transcripts',
^
SyntaxError: invalid syntax

What am i doing wrong.

LSB Version: :base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch
Distributor ID: RedHatEnterpriseServer
Description: Red Hat Enterprise Linux Server release 6.5 (Santiago)
Release: 6.5
Codename: Santiago

Linux 2.6.32-431.17.1.el6.x86_64 x86_64

↧

HaplotypeCaller may fail to detect variant with the same reads with a different composition.

February 14, 2018, 6:02 pm

≫ Next: When should I use -L to pass in a list of intervals?

≪ Previous: Error when running oncotator

I have experienced a variant detection issue with confusion. The png file attached is the result of exact same NextSeq experiment but the read extraction range is different.

NextSeq2_point.bam: bam is composed of the reads which cover chr16: 89100686 position only.

NextSeq2_region.bam: bam is composed of the reads which cover the region of chr16: 89100686 +-100bp.
On position chr16:89100686, I presume T>C should be detected, but HaplotypeCaller failed to detect the variant with NextSeq2_region.bam.

NextSeq2_point.vcf:
chr16 89100686 . T C,<NON_REF> 7397.77 . DP=199;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQ=716400.00 GT:AD:DP:GQ:PL:SB 1/1:0,199,0:199:99:7426,599,0,7426,599,7426:0,0,155,44

NextSeq2_region.vcf:
chr16 89100686 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:0,199:199:0:0,0,0

What causes the difference and why?

--- GATK Version (Docker latest)
    Using GATK wrapper script /gatk/build/install/gatk/bin/gatk
    Running:
        /gatk/build/install/gatk/bin/gatk HaplotypeCaller --version
    Version:4.0.1.2
---

--- Command used
gatk HaplotypeCaller -I /temp/NextSeq2_region.bam -O /temp/NextSeq2_region.vcf -R /temp/genome.fa -L /temp/only16.bed --debug true --output-mode EMIT_ALL_SITES --all-site-pls true --dont-trim-active-regions true --emit-ref-confidence BP_RESOLUTION
---
--- Genome Version: hg38
--- bed
chr16   89100681    89101347    NM_174917.4_cds_2_0_chr16_89100682_f    0   +

If you need the bams and vcfs, I can post them here.

↧