the current error I have is
Error parsing SAM header. Unrecognized header record type.
Your help is very much appreciated.
Hi GATK team,
We have a bunch of WGS samples and would like to import them in genomicsDBimport before joint genotyping. We are for this project interested in coding sequences. For this we want to use all exon coordinates from Gencode (~220K lines). In genomicsDBimport we saw the parameter --merge-input-intervals
explanation
--merge-input-intervals / -merge-input-intervals
Boolean flag to import all data in between intervals. Improves performance using large lists of intervals, as in exome sequencing, especially if GVCF data only exists for specified intervals.
What I understood it's that a interval file as :
chr1 1065 2000
chr1 2010 2250
chr2 500 700
chr2 800 1200
if --merge-input-intervals
is set it will consider also regions between all intervals ? so in fact an interval list as :
chr1 1065 2250
chr2 500 1200
Could you clarify ? An other idea would be to execute one instance of genomicsDBimport per chromosome and then filter the VCF based on the interval list using selectVariants.
Thank you
Hello GATK team!
I am currently following your best practices for Mutect2 somatic calling. In the steps of creating a PoN, I got my normal samples' gVCF perfectly fine.
However, at the previous step of using CreateSomaticPanelOfNormals
, I need to use GenomicsDBImport
.
This step is not working and I am running out of ideas.
This is the command I used :
gatk --java-options "-XX:+UseParallelGC -XX:ParallelGCThreads=4 -Xmx14g -Xms14g -Djava.io.tmpdir=/tmp/pon"
GenomicsDBImport
-R Homo_sapiens_assembly38.fasta
-V 1.vcf.gz -V 2.vcf.gz -V 3.vcf.gz -V 4.vcf.gz -V 5.vcf.gz -V 6.vcf.gz -V 7.vcf.gz -V 8.vcf.gz -V 9.vcf.gz -V 10.vcf.gz -V 11.vcf.gz -V 12.vcf.gz -V 13.vcf.gz -V 14.vcf.gz -V 15.vcf.gz -V 16.vcf.gz -V 17.vcf.gz -V 18.vcf.gz -V 19.vcf.gz -V 20.vcf.gz -V 21.vcf.gz -V 22.vcf.gz -V 23.vcf.gz -V 24.vcf.gz
-L wgs_calling_regions.hg38.interval_list
--tmp-dir=/tmp/pon
--genomicsdb-workspace-path pon_db
And the last lines of the log are :
...
16:04:31.323 INFO ProgressMeter - Starting traversal
16:04:31.323 INFO ProgressMeter - Current Locus Elapsed Minutes Batches Processed Batches/Minute
16:04:33.428 INFO GenomicsDBImport - Importing batch 1 with 24 samples
Duplicate field name AF found in vid attribute "fields"
terminate called after throwing an instance of 'FileBasedVidMapperException'
what(): FileBasedVidMapperException : Duplicate fields exist in vid attribute "fields"
It is true, if I look for AF
occurences in one of the VCF header, I find :
$ zgrep "AF" 1.vcf.gz
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions of alternate alleles in the tumor">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=POPAF,Number=A,Type=Float,Description="negative-log-10 population allele frequencies of alt alleles">
What is the issue according to you?
Hi there,
I am interested in using GATK best practice to call SNP/InDels for DNA-Seq samples from Arabidopsis populations. But I am struggling to find out the step-by-step command lines for such kind of analysis. Which pipeline would you command?
Could you please direct me to the correct website that I can find the detailed specific command lines with arguments?
Many thanks,
Dapeng
I'm using Mutect2 v4.0.4.0 to call variants for the purpose of making a panel-of-normals using the recommended workflow. I observe many heterozygous variants in the output VCF that have genotype 0/1 but have AD allele depths of 0 for the reference allele (and dozens to hundreds of alternate allele reads). The genotype should be 1/1 should it not?
If necessary I can provide the input data.
The command line is:
java8 -Xmx8g -jar $GATK4 Mutect2 -R $REF -I $BAM -tumor $SAMPID -O out.vcf.gz --disable-read-filter MateOnSameContigOrNoMappedMateReadFilter
Below is the start of the log output.
12:06:22.792 WARN GATKReadFilterPluginDescriptor - Disabled filter (MateOnSameContigOrNoMappedMateReadFilter) is not enabled by this tool
12:06:22.917 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/share/carvajal-archive/PACKAGES/src/GATK/gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
12:06:23.320 INFO Mutect2 - ------------------------------------------------------------
12:06:23.320 INFO Mutect2 - The Genome Analysis Toolkit (GATK) v4.0.4.0
12:06:23.321 INFO Mutect2 - For support and documentation go to https://software.broadinstitute.org/gatk/
12:06:23.321 INFO Mutect2 - Executing as twtoal@carcinos on Linux v4.4.0-109-generic amd64
12:06:23.321 INFO Mutect2 - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_152-b16
12:06:23.322 INFO Mutect2 - Start Date/Time: April 26, 2018 12:06:22 PM PDT
12:06:23.322 INFO Mutect2 - ------------------------------------------------------------
12:06:23.322 INFO Mutect2 - ------------------------------------------------------------
12:06:23.323 INFO Mutect2 - HTSJDK Version: 2.14.3
12:06:23.323 INFO Mutect2 - Picard Version: 2.18.2
12:06:23.323 INFO Mutect2 - HTSJDK Defaults.COMPRESSION_LEVEL : 1
12:06:23.323 INFO Mutect2 - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
12:06:23.323 INFO Mutect2 - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
12:06:23.323 INFO Mutect2 - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
12:06:23.323 INFO Mutect2 - Deflater: IntelDeflater
12:06:23.323 INFO Mutect2 - Inflater: IntelInflater
12:06:23.324 INFO Mutect2 - GCS max retries/reopens: 20
12:06:23.324 INFO Mutect2 - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
12:06:23.324 INFO Mutect2 - Initializing engine
12:06:25.847 INFO Mutect2 - Done initializing engine
12:06:28.279 INFO NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/share/carvajal-archive/PACKAGES/src/GATK/gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar!/com/intel/gkl/native/libgkl_utils.so
12:06:28.363 INFO NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:/share/carvajal-archive/PACKAGES/src/GATK/gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so
12:06:28.701 WARN NativeLibraryLoader - Unable to load libgkl_pairhmm_omp.so from native/libgkl_pairhmm_omp.so (/share/carvajal-archive/tmp/twtoal/libgkl_pairhmm_omp7409537124124025621.so: /usr/lib/x86_64-linux-gnu/libgomp.so.1: version `GOMP_4.0' not found (required by /share/carvajal-archive/tmp/twtoal/libgkl_pairhmm_omp7409537124124025621.so))
12:06:28.702 INFO PairHMM - OpenMP multi-threaded AVX-accelerated native PairHMM implementation is not supported
12:06:28.702 INFO NativeLibraryLoader - Loading libgkl_pairhmm.so from jar:file:/share/carvajal-archive/PACKAGES/src/GATK/gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar!/com/intel/gkl/native/libgkl_pairhmm.so
12:06:29.534 WARN IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
12:06:29.535 WARN IntelPairHmm - Ignoring request for 4 threads; not using OpenMP implementation
12:06:29.536 INFO PairHMM - Using the AVX-accelerated native PairHMM implementation
12:06:29.976 INFO ProgressMeter - Starting traversal
Could someone please provide me with a sample command line to run Variant Recalibrator for GATK v4? I am running the tool using GATK 4 Alpha with the following command line:
~/gatk-protected/gatk-launch VariantRecalibrator -R ~/MiSeq/Bioinformatics/Archive/ReferenceFiles/hg19/seq/hg19.fa -input Stromal-combined-New.vcf --resource hapmap,known=false,training=true,truth=true,prior=15.0 ~/MiSeq/Bioinformatics/Archive/ReferenceFiles/GATK/hapmap_3.3.hg19.sites.vcf --resource omni,known=false,training=true,truth=true,prior=12.0 ~/MiSeq/Bioinformatics/Archive/ReferenceFiles/GATK/1000G_omni2.5.hg19.sites.vcf --resource 1000G,known=false,training=true,truth=false,prior=10.0 ~/MiSeq/Bioinformatics/Archive/ReferenceFiles/GATK/1000G_phase1.snps.high_confidence.hg19.sites.vcf --resource dbsnp,known=true,training=false,truth=false,prior=2.0 ~/MiSeq/Bioinformatics/Archive/ReferenceFiles/GATK/dbsnp_138.hg19.vcf -an DP -an QD -an FS -an SOR -an MQ -an MQRankSum -an ReadPosRankSum -an InbreedingCoeff -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -tranchesFile Stromal-combined-New.tranches --rscriptFile Stromal-combined-New.R
and I get the following error
A USER ERROR has occurred: Invalid argument '/home/galaxy/MiSeq/Bioinformatics/Archive/ReferenceFiles/GATK/hapmap_3.3.hg19.sites.vcf'.
The command syntax follows the same pattern as this
https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_variantrecalibration_VariantRecalibrator.php
My Java version is java version "1.8.0_131"
Has the syntax been changed for GATK version 4?
Thank you very much.
Hi,
I am trying to build the SNP recalibration model by running the following GATK command:
./gatk-4.0.3.0/gatk VariantRecalibrator \
-R human_g1k_v37_decoy.fasta \
-input /mergedFiles.vcf \
--resource hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf \
--resource omni,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.b37.sites.vcf \
--resource 1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.vcf \
--resource dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_135.b37.vcf \
-an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an InbreedingCoeff \
-mode SNP \
-tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \
--recalFile recalibrate_SNP.recal \
-tranchesFile output.tranches \
--rscriptFile output.plots.R
But I am getting following error.
Error:
A USER ERROR has occurred: Invalid argument 'hapmap_3.3.b37.sites.vcf'.
Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.
I have used the human_g1k_v37_decoy.fasta for alignment therefore, using the same for recalibration. I would like to convert raw variants to ready to analysis variant by applying filtration,and annotation. Please let me know if you have any direction for best practice approach.
Thanks
Hi,
I am looking to use HaplotypeCaller to call germline variants, and I am particularly interested in the orientation of these variants relative to one another (cis- or trans-). There seems to be reference to physical phasing in the (HaplotypeCaller documentation)[https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_haplotypecaller_HaplotypeCaller.php#--do-not-run-physical-phasing], but I cannot find any physical phasing information in my VCF file.
For instance, I would expect the two variants below:
1 1647722 . G T 307.60 . AC=1;AF=0.500;AN=2;BaseQRankSum=-2.861;DP=29;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=53.28;MQRankSum=-5.260;QD=10.61;ReadPosRankSum=-0.098;SOR=0.155 GT:AD:DP:GQ:PL 0/1:21,8:29:99:315,0,841
1 1647725 . G A 304.60 . AC=1;AF=0.500;AN=2;BaseQRankSum=-1.277;DP=29;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=52.38;MQRankSum=-5.262;QD=10.50;ReadPosRankSum=-0.448;SOR=0.204 GT:AD:DP:GQ:PL 0/1:20,9:29:99:312,0,883
to be in the cis- orientation because they share nearly identical read counts, but I cannot find a corresponding annotation in the VCF file that says as much.
My command to call HaplotypeCaller is as below:
$gatk_launcher --java-options -Xmx${mem}g HaplotypeCaller \
-R $reference \
-I $bam_file \
-O $out_file \
-L $intervals_split &>> $log_file
Thank you for the help!!
This document describes the new approach to joint variant discovery that is available in GATK versions 3.0 and above. For a more detailed discussion of why it's better to perform joint discovery, see this FAQ article. For more details on how this fits into the overall reads-to-variants analysis workflow, see the Best Practices workflows documentation.
This is the workflow recommended in our Best Practices for performing variant discovery analysis on cohorts of samples.
In a nutshell, we now call variants individually on each sample using the HaplotypeCaller in -ERC GVCF
mode, leveraging the previously introduced reference model to produce a comprehensive record of genotype likelihoods and annotations for each site in the genome (or exome), in the form of a gVCF file (genomic VCF).
In a second step, we then perform a joint genotyping analysis of the gVCFs produced for all samples in a cohort.
This allows us to achieve the same results as joint calling in terms of accurate genotyping results, without the computational nightmare of exponential runtimes, and with the added flexibility of being able to re-run the population-level genotyping analysis at any time as the available cohort grows.
This is meant to replace the joint discovery workflow that we previously recommended, which involved calling variants jointly on multiple samples, with a much smarter approach that reduces computational burden and solves the "N+1 problem".
This is a quick overview of how to apply the workflow in practice. For more details, see the Best Practices workflows documentation.
Run the HaplotypeCaller on each sample's BAM file(s) (if a sample's data is spread over more than one BAM, then pass them all in together) to create single-sample gVCFs, with the option --emitRefConfidence GVCF
, and using the .g.vcf
extension for the output file.
Note that versions older than 3.4 require passing the options --variant_index_type LINEAR --variant_index_parameter 128000
to set the correct index strategy for the output gVCF.
A new tool called GenomicsDBImport is necessary to aggregate the GVCF files and feed in one GVCF to GenotypeGVCFs. You can read more about it here. You can also run CombineGVCFs if you are not able to use GenomicsDBImport.
Take the outputs from step 2 (or step 1 if dealing with fewer samples) and run GenotypeGVCFs on all of them together to create the raw SNP and indel VCFs that are usually emitted by the callers.
Finally, resume the classic GATK Best Practices workflow by running VQSR on these "regular" VCFs according to our usual recommendations.
That's it! Fairly simple in practice, but we predict this is going to have a huge impact in how people perform variant discovery in large cohorts. We certainly hope it helps people deal with the challenges posed by ever-growing datasets.
As always, we look forward to comments and observations from the research community!
Hello GATK team,
As you all know, there are many blogs/docs explaining how MuTect2 works but with lots of technical and statistical details. People who don't specialize in these domains can't easily understand how MuTect2 works. For this reason, I would like to have a discussion on how MuTect2 works with a simple example.
Let's say that we have the following information:
Reference genome sequence in a given region:
...ATCGTCAGATCATTTACGCCAGTCACTGACTGCACG...
The normal sample in the same region having the following reads:
...ATCGTCAGATCATTTACGCCAGTCACTGACTGCACG... (x50 times reads and the 5 reads below)
...ATCGTCAGAACATTTACGCCAGTCACTGACTGCACG...
...ATCGTCAGAACATTTACGCCAGTCACTGACTGCACG...
...ATCGTCAGAACATTTACGCCAGTCACTGACTGCACG...
...ATCGTCAGAACATTTACGCCAGTCACTGACTGCACG...
...ATCGTCAGAACATTTACGCCAGTCACTGACTGCACG...
And the tumor sample in the same region:
...ATCGTCAGAACATTTACGCCAGTCACTGACTGCACG... (x55 times reads)
...ATCGTCAGATCATTTACGCCAGTCACTGACTGCACG... (x30 times reads)
How does MuTect2 handles such situation ?
Could we go over each step by explaining simply what does MuTect2 does ?
I gave this example by randomly typing the sequence with a single variant. If there are other better situations to take into account that can explain all the decisions that MuTect2 does when comparing reads, I would be happy to hear them.
Let's not forget that there are also the filtering options (dbSNP membership or 1k mills genome) or the hard filters to take into account:
I got another situation in mind. Let's say for example that the same variant is found to be similar in the normal vs tumor sample but different to the reference genome. What happens in this case ?
Thanks in advance.
Hi! Could you help me, please?
Following the GATK4 Best Practices, I did not performed any pre-processing indel realign step.
I used GATK4 (using Haplotype Caller on GVCF model) to variant discovery.
After, I used "SelectVariants" (GATK4) to select each sample out of a VCF with many samples.
Now, I am trying to left align indels and split multiallics sites into biallelics on my VCF files to annotate them using Annovar tool.
For this propose, I used "LeftAlignAndTrimVariants" (GATK4) tool but I got the following error message:
"A USER ERROR has occurred: 'LeftAlignAndTrimVariants' is not a valid command."
After read many questions on GATK forums I concluded that "LeftAlignAndTrimVariants" is a deprecated tool not available on GATK4.
So, I would like to know if there is another tool in GATK4 to left align indels and to split multiallics sites into biallelics or if I can use GATK3 to run "LeftAlignAndTrimVariants" on my VCF files generated by GATK4?
Thank you!
This tutorial introduces researchers to considerations in somatic short variant discovery using GATK4 Mutect2. Example data are based on a breast cancer cell line and its matched normal cell line derived from blood and are aligned to GRCh38 with post-alt processing [1]. The tutorial focuses on how to call traditional somatic short mutations, as described in Article#11127 and pipelined in GATK v4.0.0.0's mutect2.wdl [2]. The tool and its workflow are in BETA status as of this writing, which means they may undergo changes and are not guaranteed for production.
► For Broad Mutation Calling Best Practices, see FireCloud Article#45055.
Section 1 calls somatic mutations with Mutect2 using all the bells and whistles of the tool. Section 2 outlines how to create the panel of normals resource using the tumor-only mode of Mutect2. Section 3 outlines how to estimate cross-sample contamination. Section 4 shows how to filter the callset with FilterMutectCalls. Unlike GATK3, in GATK4 the somatic calling and filtering functionalities are embodied by separate tools. Section 5 shows an optional filtering step to filter by sequence context artifacts that present with orientation bias, e.g. OxoG artifacts. Section 6 shows how to set up in IGV for manual review. Finally, section 7 provides a brief list of related resources that may be of interest to researchers.
GATK4 Mutect2 is a versatile variant caller that not only is more sensitive than, but is also roughly twice as fast as, HaplotypeCaller's reference confidence mode. Researchers who wish to customize analyses should find the tutorial's descriptions of the multiple levers of Mutect2 in section 1 and descriptions of the tumor-only mode of Mutect2 in section 2 of interest.
gatk
launch script.Download tutorial_11136.tar.gz
, either from the GoogleDrive or from the ftp site. To access the ftp site, leave the password field blank. If the GoogleDrive link is broken, please let us know. The tutorial also requires the GRCh38 reference FASTA, dictionary and index. These are available from the GATK Resource Bundle. For details on the example data and resources, see [3] and [4].
► The tutorial steps switch between the subset and full data. Some of the data files, e.g. BAMs, are restricted to a small region of the genome to efficiently pace the tutorial. Other files, e.g. the Mutect2 calls that the tutorial filters, are from the entire genome. The tutorial content was originally developed for the 2017-09 Helsinki workshop and we make the full data files, i.e. the resource files and the BAMs, available at gs://gatk-best-practices/somatic-hg38.
Here we have a rather complex command to call somatic variants on the HCC1143 tumor sample using Mutect2. For a synopsis of what somatic calling entails, see Article#11127. The command calls somatic variants in the tumor sample and uses a matched normal, a panel of normals (PoN) and a population germline variant resource.
gatk --java-options "-Xmx2g" Mutect2 \
-R hg38/Homo_sapiens_assembly38.fasta \
-I tumor.bam \
-I normal.bam \
-tumor HCC1143_tumor \
-normal HCC1143_normal \
-pon resources/chr17_pon.vcf.gz \
--germline-resource resources/chr17_af-only-gnomad_grch38.vcf.gz \
--af-of-alleles-not-in-resource 0.0000025 \
--disable-read-filter MateOnSameContigOrNoMappedMateReadFilter \
-L chr17plus.interval_list \
-O 1_somatic_m2.vcf.gz \
-bamout 2_tumor_normal_m2.bam
This produces a raw unfiltered somatic callset 1_somatic_m2.vcf.gz
, a reassembled reads BAM 2_tumor_normal_m2.bam
and the respective indices 1_somatic_m2.vcf.gz.tbi
and 2_tumor_normal_m2.bai
.
Comments on select parameters
-I
and the sample's read group sample name (the SM
field value) with -tumor
. To look up the read group SM
field use GetSampleName. Alternatively, use samtools view -H tumor.bam | grep '@RG'
.-I
and the control sample's read group sample name (the SM
field value) with -normal
. In the case of a tumor with a matched normal control, we can exclude even rare germline variants and individual-specific artifacts. If we analyze our tumor sample with Mutect2 without the matched normal, we get an order of magnitude more calls than with the matched normal.-pon
. Section 2 outlines how to create a PoN. The panel of normals not only represents common germline variant sites, it presents commonly noisy sites in sequencing data, e.g. mapping artifacts or other somewhat random but systematic artifacts of sequencing. By default, the tool does not reassemble nor emit variant sites that match identically to a PoN variant. To enable genotyping of PoN sites, use the --genotype-pon-sites
option. If the match is not exact, e.g. there is an allele-mismatch, the tool reassembles the region, emits the calls and annotates matches in the INFO field with IN_PON
.--germline-resource
. The germline resource must contain allele-specific frequencies, i.e. it must contain the AF annotation in the INFO field [4]. The tool annotates variant alleles with the population allele frequencies. When using a population germline resource, consider adjusting the --af-of-alleles-not-in-resource
parameter from its default of 0.001. For example, the gnomAD resource af-only-gnomad_grch38.vcf.gz
represents ~200k exomes and ~16k genomes and the tutorial data is exome data, so we adjust --af-of-alleles-not-in-resource
to 0.0000025 which corresponds to 1/(2*exome samples). The default of 0.001 is appropriate for human sample analyses without any population resource. It is based on the human average rate of heterozygosity. The population allele frequencies (POP_AF
) and the af-of-alleles-not-in-resource
factor in probability calculations of the variant being somatic.--disable-read-filter MateOnSameContigOrNoMappedMateReadFilter
. This filter removes from analysis paired reads whose mate maps to a different contig. Because of the way BWA crisscrosses mate information for mates that align better to alternate contigs (in alt-aware mapping to GRCh38), we want to include these types of reads in our analysis. Otherwise, we may miss out on detecting SNVs and indels associated with alternate haplotypes. Disabling this filter deviates from current production practices.-L
parameter. Here we specify this option to speed up our run on the small tutorial data. For the full callset we use in section 4, calling was on the entirety of the data, without an intervals file.-bamout
. The bamout alignments contain the artificial haplotypes and reassembled alignments for the normal and tumor and enable manual review of calls. The parameter is not required by the tool but is recommended as adding it costs only a small fraction of the total run time.To illustrate how Mutect2 applies annotations, below are five multiallelic sites from the full callset. Pull these out with gzcat somatic_m2.vcf.gz | awk '$5 ~","'
. The awk '$5 ~","'
subsets records that contain a comma in the 5th column.
We see eleven columns of information per variant call including genotype calls for the normal and tumor. Notice the empty fields for QUAL and FILTER, and annotations at the site (INFO) and sample level (columns 10 and 11). The samples each have genotypes and when a site is multiallelic, we see allele-specific annotations. Samples may have additional annotations, e.g. PGT
and PID
that relate to phasing.
We can view the standard FORMAT-level and INFO-level Mutect2 annotations in the VCF header.
The Variant Annotations section of the Tool Documentation further describe some of the annotations. For a complete list of annotations available in GATK4, see this site.
To enable specific filtering that relies on nonstandard annotations, or just to add additional annotations, use the -A
argument. For example, -A ReferenceBases
adds the ReferenceBases annotation to variant calls. Note that if an annotation a filter relies on is absent, FilterMutectCalls will skip the particular filtering without any warning messages.
To understand the impact, consider some numbers. After all other read filters, the MateOnSameContigOrNoMappedMateReadFilter (MOSCO) filter additionally removes from analysis 8.71% (8,681,271) tumor sample reads and 8.18% (6,256,996) normal sample reads from the full data. The impact of disabling the MOSCO filter is that reads on alternate contigs and read pairs that span contigs can now lend support to variant calls.
For the tutorial data, including reads normally filtered by the MOSCO filter roughly doubles the number of Mutect2 calls. The majority of the additional calls comes from the ALT, HLA and decoy contigs.
We make the motions of creating a PoN using three germline samples. These samples are HG00190, NA19771 and HG02759 [3].
First, run Mutect2 in tumor-only mode on each normal sample. In tumor-only mode, a single case sample is analyzed with the -tumor
flag without an accompanying matched control -normal
sample. For the tutorial, we run this command only for sample HG00190.
gatk Mutect2 \
-R ~/Documents/ref/hg38/Homo_sapiens_assembly38.fasta \
-I HG00190.bam \
-tumor HG00190 \
--disable-read-filter MateOnSameContigOrNoMappedMateReadFilter \
-L chr17plus.interval_list \
-O 3_HG00190.vcf.gz
This generates a callset 3_HG00190.vcf.gz
and a matching index. Mutect2 calls variants in the sample with the same sensitive criteria it uses for calling mutations in the tumor in somatic mode. Because the command omits the use of options that trigger upfront filtering, we expect all detectable variants to be called. The calls will include low allele fraction variants and sites with multiple variant alleles, i.e. multiallelic sites. Here are two multiallelic records from 3_HG00190.vcf.gz
.
We see for each site, Mutect2 calls the ref allele and three alternate alleles. The GT
genotype call is 0/1/2/3
. The AD
allele depths are 16,3,12,4
and 41,5,24,4
, respectively for the two sites.
Comments on select parameters
--germline-resource
. Remember from section 1 this resource must contain AF
population allele frequencies in the INFO column. Use of this resource in tumor-only mode, just as in somatic mode, allows upfront filtering of common germline variant alleles. This effectively omits common germline variant alleles from the PoN. Note the related optional parameter --max-population-af
(default 0.01) defines the cutoff for allele frequencies. Given a resource, and read evidence for the variant, Mutect2 will still emit variant alleles with AF
less than or equal to the --max-population-af
.--disable-read-filter MateOnSameContigOrNoMappedMateReadFilter
. This particular option is relevant for alt-aware and post-alt processed alignments.Second, collate all the normal VCFs into a single callset with CreateSomaticPanelOfNormals. For the tutorial, to illustrate the step with small data, we run this command on three normal sample VCFs. The general recommendation for panel of normals is a minimum of forty samples.
gatk CreateSomaticPanelOfNormals \
-vcfs 3_HG00190.vcf.gz \
-vcfs 4_NA19771.vcf.gz \
-vcfs 5_HG02759.vcf.gz \
-O 6_threesamplepon.vcf.gz
This generates a PoN VCF 6_threesamplepon.vcf.gz
and an index. The tutorial PoN contains 8,275 records.
CreateSomaticPanelOfNormals retains sites with variants in two or more samples. It retains the alleles from the samples but drops all other annotations to create an eight-column, sites-only VCF as shown.
Ideally, the PoN includes samples that are technically representative of the tumor case sample--i.e. samples sequenced on the same platform using the same chemistry, e.g. exome capture kit, and analyzed using the same toolchain. However, even an unmatched PoN will be remarkably effective in filtering a large proportion of sequencing artifacts. This is because mapping artifacts and polymerase slippage errors occur for pretty much the same genomic loci for short read sequencing approaches.
What do you think of including samples of family members in the PoN?
For example, consider variant calling on data that represents a pool of individuals or a collective of highly similar but distinct DNA molecules, e.g. mitochondrial DNA. Mutect2 calls multiple variants at a site in a computationally efficient manner. Furthermore, the tumor-only mode can be co-opted to simply call differences between two samples. This approach is described in Blog#11315.
First, run GetPileupSummaries on the tumor BAM to summarize read support for a set number of known variant sites. Use a population germline resource containing only common biallelic variants, e.g. subset by using SelectVariants --restrict-alleles-to BIALLELIC
, as well as population AF
allele frequencies in the INFO field [4]. The tool tabulates read counts that support reference, alternate and other alleles for the sites in the resource.
gatk GetPileupSummaries \
-I tumor.bam \
-V resources/chr17_small_exac_common_3_grch38.vcf.gz \
-O 7_tumor_getpileupsummaries.table
This produces a six-column table as shown. The alt_count
is the count of reads that support the ALT allele in the germline resource. The allele_frequency
corresponds to that given in the germline resource. Counts for other_alt_count
refer to reads that support all other alleles.
Comments on select parameters
--minimum-population-allele-frequency
(default 0.01) and --maximum-population-allele-frequency
(default 0.2). The rationale for these settings is as follows. If the homozygous alternate site has a rare allele, we are more likely to observe the presence of REF or other more common alleles if there is cross-sample contamination. This allows us to measure contamination more accurately.-L
argument.-L
and -V
parameters. For the tutorial, provide the same resources/chr17_small_exac_common_3_grch38.vcf.gz
file to each parameter. For details, see the GetPileupSummaries tool documentation.Second, estimate contamination with CalculateContamination. The tool takes the summary table from GetPileupSummaries and gives the fraction contamination. This estimation informs downstream filtering by FilterMutectCalls.
gatk CalculateContamination \
-I 7_tumor_getpileupsummaries.table \
-O 8_tumor_calculatecontamination.table
This produces a table with estimates for contamination and error. The estimate for the full tumor sample is shown below and gives a contamination fraction of 0.0205. Going forward, we know to suspect calls with less than ~2% alternate allele fraction.
Comments on select parameters
-matched
argument.► Cross-sample contamination differs from normal contamination of tumor and tumor contamination of normal. Currently, the workflow does not account for the latter type of purity issue.
One thing to rule out is sample swaps at the read group level.
Picard’s CrosscheckFingerprints can detect sample-swaps at the read group level and can additionally measure how related two samples are. Because sequencing can involve multiplexing a sample across lanes and regrouping a sample’s multiple read groups, depending on the level of automation in handling these, there is a possibility of including read groups from unrelated samples. The inclusion of such a cross-sample in the tumor sample would be detrimental to a somatic analysis. Without getting into details, the tool allows us to (i) check at the sample level that our tumor and normal are related, as it is imperative they should come from the same individual and (ii) check at the read group level that each of the read group data come from the same individual.
Again, imagine if we mistook the contaminating read group data as some tumor subpopulation! The tutorial normal and tumor samples consist of 16 and 22 read groups respectively, and when we provide these and set EXPECT_ALL_GROUPS_TO_MATCH=true
, CrosscheckReadGroupFingerprints (a tool now replaced by CrosscheckFingerprints) informs us All read groups related as expected.
FilterMutectCalls determines whether a call is a confident somatic call. The tool uses the annotations within the callset and applies preset thresholds that are tuned for human somatic analyses.
Filter the Mutect2 callset with FilterMutectCalls. Here we use the full callset, somatic_m2.vcf.gz
. To activate filtering based on the contamination estimate, provide the contamination table with --contamination-table
. In GATK v4.0.0.0, the tool uses the contamination estimate as a hard cutoff.
gatk FilterMutectCalls \
-V somatic_m2.vcf.gz \
--contamination-table tumor_calculatecontamination.table \
-O 9_somatic_oncefiltered.vcf.gz
This produces a VCF callset 9_somatic_oncefiltered.vcf.gz
and index. Calls that are likely true positives get the PASS label in the FILTER field, and calls that are likely false positives are labeled with the reason(s) for filtering in the FILTER field of the VCF. We can view the available filters in the VCF header using grep '##FILTER'
.
This step seemingly applies 14 filters, including contamination. However, if an annotation a filter relies on is absent, the tool skips the particular filtering. The filter will still appear in the header. For example, the duplicate_evidence
filter requires a nonstandard annotation that our callset omits.
So far, we have 3,695 calls, of which 2,966 are filtered and 729 pass as confident somatic calls. Of the filtered, contamination filters eight calls, all of which would have been filtered for other reasons. For the statistically inclined, this may come as a surprise. However, remember that the great majority of contaminant variants would be common germline alleles, for which we have in place other safeguards.
► In the next GATK version, FilterMutectCalls will use a statistical model to filter based on the contamination estimate.
FilterByOrientationBias allows filtering based on sequence context artifacts, e.g. OxoG and FFPE. This step is optional and if employed, should always be performed after filtering with FilterMutectCalls. The tool requires the pre_adapter_detail_metrics
from Picard CollectSequencingArtifactMetrics.
First, collect metrics on sequence context artifacts with CollectSequencingArtifactMetrics. The tool categorizes these as those that occur before hybrid selection (preadapter) and those that occur during hybrid selection (baitbias). Results provide a global view across the genome that empowers decision making in ways that site-specific analyses cannot. The metrics can help decide whether to consider downstream filtering.
gatk CollectSequencingArtifactMetrics \
-I tumor.bam \
-O 10_tumor_artifact \
–-FILE_EXTENSION ".txt" \
-R ~/Documents/ref/hg38/Homo_sapiens_assembly38.fasta
Alternatively, use the tool from a standalone Picard jar.
java -jar picard.jar \
CollectSequencingArtifactMetrics \
I=tumor.bam \
O=10_tumor_artifact \
FILE_EXTENSION=.txt \
R=~/Documents/ref/hg38/Homo_sapiens_assembly38.fasta
This generates five metrics files, including pre_adapter_detail_metrics
, which contains counts that FilterByOrientationBias uses. Below are the summary pre_adapter_summary_metrics
for the full data. Our samples were not from FFPE so we do not expect this artifact. However, it appears that we could have some OxoG transversions.
Picard metrics are described in detail here. For the purposes of this tutorial, we focus on the TOTAL_QSCORE.
F1R2
and F2R1
annotations count the reads in the pair orientation supporting the allele(s). This is a change from GATK3’s FOXOG
(fraction OxoG) annotation.Second, perform orientation bias filtering with FilterByOrientationBias. We provide the tool with the once-filtered calls 9_somatic_oncefiltered.vcf.gz
, the pre_adapter_detail_metrics
file and the sequencing contexts for FFPE (C→T transition) and OxoG (G→T transversion). The tool knows to include the reverse complement contexts.
gatk FilterByOrientationBias \
-A G/T \
-A C/T \
-V 9_somatic_oncefiltered.vcf.gz \
-P tumor_artifact.pre_adapter_detail_metrics.txt \
-O 11_somatic_twicefiltered.vcf.gz
This produces a VCF 11_somatic_twicefiltered.vcf.gz
, index and summary 11_somatic_twicefiltered.vcf.gz.summary
. In the summary, we see the number of calls for the sequence context and the number of those that the tool filters.
Is the filtering in line with our earlier prediction?
In the VCF header, we see the addition of the 15th filter, orientation_bias
, which the tool applies to 56 calls. All 56 of these calls were previously PASS sites, i.e. unfiltered. We now have 673 passing calls out of 3,695 total calls.
The table shows the breakdown in filters applied to 11_somatic_twicefiltered.vcf.gz
. The middle column tallys the instances in which each filter was applied across the calls and the third column tallys the instances in which a filter was the sole reason for a site not passing. Of the total calls, ~18% (673/3,695) are confident somatic calls. Of the filtered calls, ~56% (1,694/3,022) are filtered singly. We see an average of ~1.73 filters per filtered call (5,223/3,022).
Which filters appear to have the greatest impact? What types of calls do you think compels manual review?
Examine passing records with the following command. Take note of the AD
and AF
annotation values in particular, as they show the high sensitivity of the caller.
gzcat 11_somatic_twicefiltered.vcf.gz | grep -v '#' | awk '$7=="PASS"' | less
Deriving a good somatic callset involves comparing callsets, e.g. from different callers or calling approaches, manually reviewing passing and filtered calls and, if necessary, combining callsets and additional filtering. Manual review extends from deciphering call record annotations to the nitty-gritty of reviewing read alignments using a visualizer.
To manually review calls, use the feature-rich desktop version of the Integrative Genomics Viewer (IGV). Remember that Mutect2 makes calls on reassembled alignments that do not necessarily reflect that of the starting BAM. Given this, viewing the raw BAM is insufficient for understanding calls. We must examine the bamout that Mutect2's graph-assembly produces.
First, load Human (hg38) as the reference in IGV. Then load these six files in order:
With the exception of the somatic callset 11_somatic_twicefiltered.vcf.gz
, the subset regions the data cover are in chr17plus.interval_list
.
Second, navigate IGV to the TP53 locus (chr17:7,666,402-7,689,550).
chr17_af-only-gnomad_grch38.vcf.gz
and collapse its view.11_somatic_twicefiltered.vcf.gz
, the gray rectangle in exon 3, by click-dragging on the ruler.11_somatic_twicefiltered.vcf.gz
to view INFO level annotations. Similarly, the blue call underneath gives HCC1143_tumor sample level information.A
C→T
variant is intumor.bam
but notnormal.bam
. What is happening in2_tumor_normal_m2.bam
?
Third, tweak IGV settings that aid in visualizing reassembled alignments.
2_tumor_normal_m2.bam
. Shift+select on the left panels for tracks tumor.bam
, normal.bam
and their coverages. Right-click and Remove Tracks.Right-click on the alignments track and
Scroll to take note of the number of groups. Click on a read in each group to determine which group belongs to which sample.
What are the three grouped tracks for the bamout? What does the pastel versus gray colors indicate? How plausible is it that all tumor copies of this locus have this alteration?
Here is the corresponding VCF record. Remember Mutect2 makes no ploidy assumption. The GT
field tabulates the presence for each allele starting with the reference allele.
CHROM | POS | ID | REF | ALT | QUAL | FILTER | INFO |
---|---|---|---|---|---|---|---|
chr17 | 7,674,220 | . | C | T | . | PASS | DP=122;ECNT=1;NLOD=13.54;N_ART_LOD=-1.675e+00;POP_AF=2.500e-06;P_GERMLINE=-1.284e+01;TLOD=257.15 |
FORMAT | GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:OBAM:OBAMRC:OBF:OBP:OBQ:OBQRC:SA_MAP_AF:SA_POST_PROB |
---|---|
HCC1143_normal | 0/0:45,0:0.032:19,0:26,0:0:151,0:0:0:false:false |
HCC1143_tumor | 0/1:0,70:0.973:0,34:0,36:33:0,147:60:21:true:false:0.486:0.00:46.01:100.00:0.990,0.990,1.00:0.028,0.026,0.946 |
Finally, here are the indel calls for which we have bamout alignments. All 17 of these happen to be filtered. Explore a few of these sites in IGV to practice the motions of setting up for manual review and to study the logic behind different filters.
CHROM | POS | REF | ALT | FILTER |
---|---|---|---|---|
chr17 | 4,539,344 | T | TA | artifact_in_normal;germline_risk;panel_of_normals |
chr17 | 7,221,420 | CACTGCCCTAGGTCAGGA | C | artifact_in_normal;panel_of_normals;str_contraction |
chr17 | 7,483,063 | A | AC | mapping_quality;t_lod |
chr17 | 8,513,688 | GTT | G | panel_of_normals |
chr17 | 19,748,387 | G | GA | t_lod |
chr17 | 26,982,033 | G | GC | artifact_in_normal;clustered_events |
chr17 | 30,059,463 | CT | C | t_lod |
chr17 | 35,422,473 | C | CA | t_lod |
chr17 | 35,671,734 | CTT | C,CT,CTTT | artifact_in_normal;multiallelic;panel_of_normals |
chr17 | 43,104,057 | CA | C | artifact_in_normal;germline_risk;panel_of_normals |
chr17 | 43,104,072 | AAAAAAAAAGAAAAG | A | panel_of_normals;t_lod |
chr17 | 46,332,538 | G | GT | artifact_in_normal;panel_of_normals |
chr17 | 47,157,394 | CAA | C | panel_of_normals;t_lod |
chr17 | 50,124,771 | GCACACACACACACACA | G | clustered_events;panel_of_normals;t_lod |
chr17 | 68,907,890 | GA | G | artifact_in_normal;base_quality;germline_risk;panel_of_normals;t_lod |
chr17 | 69,182,632 | C | CA | artifact_in_normal;t_lod |
chr17 | 69,182,835 | GAAAA | G | panel_of_normals |
The next step after generating a carefully manicured somatic callset is typically functional annotation.
For a cohort, after annotation, use MutSig to discover driver mutations. MutsigCV (the version is CV) is available on GenePattern. If more samples are needed to increase the power of the analysis, consider padding the analysis set with TCGA Project or other data.
The dSKY plot at https://figshare.com/articles/D_SKY_for_HCC1143/2056665 shows somatic copy number alterations for the HCC1143 tumor sample. Its colorful results remind us that calling SNVs and indels is only one part of cancer genome analyses. Somatic copy number alteration detection will be covered in another GATK tutorial. For reference implementations of Somatic CNV workflows see here.
[1] Data was alt-aware aligned to GRCh38 and post-alt processed. For an introduction to alt-aware alignment and post-alt processing, see [Blog#8180](https://software.broadinstitute.org/gatk/blog?id=8180). The HCC1143 alignments are identical to that in [Tutorial#9183](https://software.broadinstitute.org/gatk/documentation/article?id=9183), which uses GATK3 MuTect2.
[2] For scripted GATK Best Practices Somatic Short Variant Discovery workflows, see [https://github.com/gatk-workflows](https://github.com/gatk-workflows). Within the repository, as of this writing, [gatk-somatic-snvs-indels](https://github.com/gatk-workflows/gatk4-somatic-snvs-indels), which uses GRCh37, is the sole GATK4 Mutect2 workflow. This tutorial uses additional parameters not used in the [GRCh37 gatk-somatic-snvs-indels](https://github.com/gatk-workflows/gatk4-somatic-snvs-indels) example because the tutorial data was preprocessed with post-alt processing of alt-aware alignments, which deviates from production practices. The general workflow steps remain the same.
[3] About the tutorial data:
chr17plus.interval_list
, the m2pon consists of forty 1000 Genomes Project samples, pon to panel of normals, tumor to the tumor HCC1143 breast cancer sample and normal to its matched blood normal.b4d1ddd
. The GATK Docker image was broadinstitute/gatk:4.0.0.0
and Picard was v2.14.1. A single modification was made to the script to enable generating the bamout. The script was run locally on a Google Cloud Compute VM using Cromwell v30.1. Given Docker was installed and the specified Docker images were present on the VM, Cromwell automatically launched local Docker container instances during the run and handled the local files as hard-links to avoid redundant copying. Workflow input variables were as follows.{
"##_COMMENT1:": "WORKFLOW STEP OPTIONS",
"Mutect2.is_run_oncotator": "False",
"Mutect2.is_run_orientation_bias_filter": "True",
"Mutect2.picard": "/home/shlee/picard-2.14.1.jar",
"Mutect2.gatk_docker": "broadinstitute/gatk:4.0.0.0",
"Mutect2.oncotator_docker": "broadinstitute/oncotator:1.9.3.0",
...
"##_COMMENT3:": "ANALYSIS PARAMETERS",
"Mutect2.artifact_modes": ["G/T", "C/T"],
"Mutect2.m2_extra_args": "--af-of-alleles-not-in-resource 0.0000025 --disable-read-filter MateOnSameContigOrNoMappedMateReadFilter",
"Mutect2.m2_extra_filtering_args": "",
"Mutect2.scatter_count": "10"
}
mutect2.wdl
that allow setting SplitIntervals optional arguments, then @shlee recommends setting --subdivision-mode BALANCING_WITHOUT_INTERVAL_SUBDIVISION
to avoid splitting contigs.[4] The WDL script [mutect_resources.wdl](https://github.com/broadinstitute/gatk/blob/master/scripts/mutect2_wdl/mutect_resources.wdl) takes a large gnomAD VCF or other typical cohort VCF and from it prepares both a simplified germline resource for use in _section 1_ and a common biallelic variants resource for use in _section 3_. The script first generates a sites-only VCF and in the process _removes all extraneous annotations_ except for `AF` allele frequencies. We recommend this simplification as the unburdened VCF allows Mutect2 to run much more efficiently. To generate the common biallelic variants resource, the script then selects the biallelic sites from the sites-only VCF.
hi, sometimes we are interested in some important gene sites, which maybe not PASS in mutect2 or haplotcaller, so is there a argument to add this site list, not all sites, but just these sites
thanks a lot
We ran GATK 3.7 HaplotypeCaller upon a sample to get .gVCF file few months back. Recently we tested out the same sample with same parameters of GATK 3.7 HaplotypeCaller and found that there is difference in the DP,PL values for many variants when comparing the two output .GVCF files from these two runs.
The command line parameters used for both the runs:
java -Xmx32g -Djava.io.tmpdir=Temp/ -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ref.fa -I sample.bam -nct 24 --dbsnp dbsnp138.vcf --genotyping_mode DISCOVERY --minPruning 2 -newQual -stand_call_conf 30 --emitRefConfidence GVCF -variant_index_type LINEAR -variant_index_parameter 128000 -L chr1 -G none -l INFO -log sample.log -o sample_chr1.g.vcf.gz
The sample difference extracted between both the files using the diff command :-
F1 chr1 resemble the line extracted from the .gVCF file generated few months back
F2 chr1 resemble the line extracted from the .gVCF file generated recently
Change 1 observed: DP, PL values different between two output .GVCF files from these two runs
F1 chr1 1510162 . A <NON_REF> . . END=1510162 GT:DP:GQ:MIN_DP:PL 0/0:46:12:46:0,12,1425
F2 chr1 1510162 . A <NON_REF> . . END=1510162 GT:DP:GQ:MIN_DP:PL 0/0:45:9:45:0,9,1380
F1 chr1 6941045 . C <NON_REF> . . END=6941080 GT:DP:GQ:MIN_DP:PL 0/0:14:0:7:0,0,139
F2 chr1 6941045 . C <NON_REF> . . END=6941080 GT:DP:GQ:MIN_DP:PL 0/0:15:0:7:0,0,139
F1 chr1 45683203 rs34100486 CTTTT C,<NON_REF> 177.60 . DB;MLEAC=1,0;MLEAF=0.500,0.00 GT:GQ:PL:SB 0/1:22:185,0,22,188,37,225:1,0,3,2
F2 chr1 45683203 rs34100486 CTTTT C,<NON_REF> 168.60 . DB;MLEAC=1,0;MLEAF=0.500,0.00 GT:GQ:PL:SB 0/1:22:176,0,22,179,37,215:1,0,3,2
Change 2 observed: 29 variants added in the recent run .gVCF output file which were not in the present in the previous run .gVCF output file
Below are the few sample varaints added to the new run .gVCF output file
F2 chr1 15357649 . G <NON_REF> . . END=15357649 GT:DP:GQ:MIN_DP:PL 0/0:41:94:41:0,94,1235
F2 chr1 15357650 . A <NON_REF> . . END=15357650 GT:DP:GQ:MIN_DP:PL 0/0:39:99:39:0,102,1284
Change 3 observed: 10 variants present in the previous run .gVCF output file which were not in the present in the recent run .gVCF output file
Below are the few sample varaints present in the previous run .gVCF output file
F1 chr1 9282514 . C CTCCCCCTCCTCCTTGTCTCCTCCTCCCTCTCCCCCT,<NON_REF> 274.01 . MLEAC=2,0;MLEAF=1.00,0.00 GT:GQ:PL:SB 1/1:20:288,20,0,289,21,290:0,0,0,3
F1 chr1 9282515 . T <NON_REF> . . END=9282515 GT:DP:GQ:MIN_DP:PL 0/0:37:0:37:0,0,820
F1 chr1 27014608 . T <NON_REF> . . END=27014608 GT:DP:GQ:MIN_DP:PL 0/0:35:91:35:0,91,1388**
Could you please explain why I get different results in two runs of HaplotypeCaller and what this change in values between the two output .gvcf files mean? Can this affect variant calling (Joint genotyping) that will be done at a later stage with all sample together?
Hello GATK team,
I have been trying the brand new Mutect2 v4.1.1.0 on ctDNA samples with high coverage. I noticed one case where Mutect2 v4.1.0.0 detected a variant, but the new version didn't even emit the mismatch.
Here is the original command I used:
gatk Mutect2 -R reference.fasta -I normal.bam \
-O variants.vcf.gz --germline-resource gnomadhg19.vcf.gz \
--panel-of-normals pon.vcf.gz -L targets.interval_list \
-ip 300 -normal normalSample -I tumor1.bam \
-I tumor2.bam -I tumor3.bam -I tumor4.bam
When I realized the missing variant, I reran Mutect2 with the following additional parameters, but that didn't help.
--force-active true --tumor-lod-to-emit 0 --initial-tumor-lod 0
Here is an IGV snapshot of the variant (it exists in only one tumor sample: Ref: 820, Alt: 29, N: 2)
The reads in the upper part are from the original bam and in the lower part from the bam file emitted by Mutect2
Thank you for your help
Hi.
I've been using DepthOfCoverage tool for coverage estimation for human WGS data, which was aligned with BWA MEM, filtered using samtools and passed through MarkDuplicates. I tried to run DepthOfCoverage in parallel mode using -nt
and --omitIntervalStatistics
and in a single-threaded mode. All the data is stored on an SSD and being processed on a server with 12 actual cores. Surprisingly, the speed of data processing as reported by ProgressMeter is two times faster in a single-threaded mode (15 sec per 1 million sites vs 30 sec). I understand the limitations of I/O, but it is confusing when compared with some other GATK (non-Spark) tools which are actually able to process data in -nt
or -nct
mode with reading/writing.
Does this behaviour actually look like as it is supposed to? Any comment would be greatly appreciated.
hi, image there is A -> T in position 2, and A -> G in position 3 or A -> NONE ,
will gatk try to merge(the combination can be snp+snp, snp+indel, indel+indel), because different combination or single alone can generate a totally different amino change, and finaly impact the drug instruction.
VarScan will merge some variants if it think ok, but also give single alone variant