Given my years as a biochemist, if given two samples to compare, my first impulse is to want to know what are the functional differences, i.e. differences in proteins expressed between the two samples. I am interested in genomic alterations that ripple down the central dogma to transform a cell.
Please note the workflow that follows is NOT a part of the Best Practices. This is an illustrative, unsupported workflow. For the official Somatic Short Variant Calling Best Practices workflow, see Tutorial#11136.
To call every allele that is different between two samples, I have devised a two-pass workflow that takes advantage of Mutect2 features. This workflow uses Mutect2 in tumor-only mode and appropriates the --germline-resource
argument to supply a single-sample VCF with allele fractions instead of population allele frequencies. The workflow assumes the two case samples being compared originate from the same parental line and the ploidy and mutation rates make it unlikely that any site accumulates more than one allele change.
First, call on each sample using Mutect2's tumor-only mode.
gatk Mutect2 \
-R ref.fa \
-I A.bam \
-tumor A \
-O A.vcf
gatk Mutect2 \
-R ref.fa \
-I B.bam \
-tumor B \
-O B.vcf
Second, for each single-sample VCF, move the sample-level AF
allele-fraction annotation to the INFO field and simplify to a sites-only VCF.
This is a heuristic solution in which we substitute sample-level allele fractions for the expected population germline allele frequencies. Mutect2 is actually designed to use population germline allele frequencies in somatic likelihood calculations, so this substitution allows us to fulfill the requirement for an AF
annotation with plausible fractional values. The terminal screenshots highlight the data transpositions.
Third, call on each sample in a second pass, again in tumor-only mode, with the following additions.
gatk Mutect2 \
-R ref.fa \
-I A.bam \
-tumor A \
--germline-resource Baf.vcf \
--af-of-alleles-not-in-resource 0 \
--max-population-af 0 \
-pon pon_maskAB.vcf \
-O A-B.vcf
gatk Mutect2 \
-R ref.fa \
-I B.bam \
-tumor B \
--germline-resource Aaf.vcf \
--af-of-alleles-not-in-resource 0 \
--max-population-af 0 \
-pon pon_maskAB.vcf \
-O B-A.vcf
- Provide the matched single-sample callset for the case sample with the
--germline-resource
argument. - Avoid calling any allele in the
--germline-resource
by setting--max-population-af
to zero. - Maximize the probability of calling any differing allele by setting
--af-of-alleles-not-in-resource
to zero. - Prefilter sites with artifacts and cross-sample contamination with a panel of normals (PoN) in which confident variant sites for both sample A and B have been removed, e.g. with
gatk SelectVariants –V pon.vcf -XL AandB_haplotypecaller.vcf –O pon_maskAB.vcf
.
Fourth, filter out unlikely calls with FilterMutectCalls.
gatk FilterMutectCalls \
-V A-B.vcf \
-O A-B-filter.vcf
gatk FilterMutectCalls \
-V B-A.vcf \
-O B-A-filter.vcf
FilterMutectCalls provides many filters, e.g. that account for low base quality, for events that are clustered, for low mapping quality and for short-tandem-repeat contractions. Of the filters, let's consider the multiallelic filter. It discounts sites with more than two variant alleles that pass the tumor LOD threshold.
- We assume case sample variant sites will have a maximum of one allele that is different from the
--germline-resource
control. A single allele call will pass the multiallelic filter. However, if we emit any shared variant allele alongside the differing allele, e.g. for a heterozygous site without ref alleles, then the call becomes multiallelic and will be filtered, which is not what we want. We previously set Mutect2’s--max-population-af
to zero to ensure only the differing allele is called, and so here we can rely on FilterMutectCalls to filter artifactual multiallelic sites. - If multiple variant alleles are expected per call, then FilterMutectCall’s multiallelic filtering will be undesirable. For example, if changes to allele fractions for alleles that are shared was of interest for the two samples derived from the same parental line, and Mutect2
--max-population-af
was set to one in the previous step to additionally emit the shared variant alleles, then you would expect multiallelic calls. These will be indistinguishable from artifactual multiallelic sites.
This workflow produces contrastive variants. If the samples are a tumor and its matched normal, then the calls include sites where heterozygosity was lost.
We know that loss of heterozygosity (LOH) plays a role in tumorigenesis (doi:10.1186/s12920-015-0123-z). This leads us to believe the heterozygosity of proteins we express contributes to our health. If this is true, then for somatic studies, if cataloging the gain of alleles is of interest, then cataloging the loss of alleles should also be of interest. Can we assume just because variants are germline that they do not play a role in disease processes? How can we account for the combinatorial effects of the diploid nature of our genomes?
Remember regions of LOH do not necessarily represent a haploid state but can be copy-neutral or even copy-amplified. It may be that as one parental chromosome copy is lost, the other is duplicated to maintain copy number, which presumably compensates for dosage effects similar to uniparental isodisomy.