Where is the fastaalternatereferencemaker in GATK4.0?
I want to build a reference with mutation by fastaalternatereferencemaker, but I can not find the commond
of fastaalternatereferencemaker , please tell me where ?
Thank you !
About fastaalternatereferencemaker in GATK 4.0
Allele Depth (AD) / Allele Balance (AB) Filtering in GATK 4
Hi,
I am trying to filter my GATK 4.0.3 - HaplotypeCaller generated multi-sample VCF for allele depth (AD) annotation at sample genotype-level (so available in "FORMAT" fields of each sample).
I think prior to GATK 4, this annotation was available as "Allele Balance" (AB) ratios (generated by AlleleBalanceBySample), but it is not available anymore in GATK 4. So I tried to filter genotypes based on AD field, that is exactly the same thing but indicated in "X,Y" format, so in an array format of integers. This array format makes it difficult to filter based on depth of alternative allele divided by depth of all alleles at a specific site.
Can you please recommend any solution to this problem? If I could turn this array into a ratio, I could easily filter genotypes using VariantFiltration or other tools such as vcflib/vcffilter. I also tried the below code (following https://gatkforums.broadinstitute.org/gatk/discussion/1255/what-are-jexl-expressions-and-how-can-i-use-them-with-the-gatk):
gatk VariantFiltration -R $ref -V $vcf -O $output --genotype-filter-expression 'vc.getGenotype("Sample1").getAD().1 / vc.getGenotype("Sample1").getAD().0 > 0.33' --set-filtered-genotype-to-no-call --genotype-filter-name 'ABfilter'
This worked, but strangely it filters the variant for all samples if only one of the sample have allele depths that are not in balance (defined by the filter). If it worked only for Sample1, I was planning to write a quick loop for all the samples for instance. I tried the same with GATK 3.8, but still it filters whole variant for all the samples if it is filtered in just one sample.
Infinity RGQ and no ReadPosRankSum
Hi,
I have the following two questions, it would be great if you could help me with:
What are non-variant sites for which the genotype quality is "Infinity"? What does that mean?
NW_009243200.1 47644 . G . Infinity . AN=2;ClippingRankSum=0.00;DP=57;ExcessHet=3.01 GT:DP:RGQ 0/0:57:99
NW_009243203.1 24000 . G . Infinity . AN=2;DP=46;ExcessHet=3.01 GT:DP:RGQ 0/0:42:99Why some sites do not have the field "ReadPosRankSum"?
NW_009243187.1 1403 . A T 719.03 . AC=2;AF=1.00;AN=2;DP=32;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=44.82;QD=29.00;SOR=1.022 GT:AD:DP:GQ:PL 1/1:0,19:19:56:733,56,0
NW_009243191.1 855 . T C 241.60 . AC=1;AF=0.500;AN=2;DP=6;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;QD=27.64;SOR=1.329 GT:AD:DP:GQ:PGT:PID:PL 0/1:0,6:6:59:0|1:843_C_A:249,0,59
Thanks,
Homa
Germline short variant discovery with GATK4, exome sequencing, single sample
Dear GATK team,
I'm implementing "Germline short variant discovery" (we often do not have matched normal samples, otherwise I'd go for the somatic pipeline) with GATK4. We do exome sequencing and we generally look at only 1 or a couple of samples at a time.
Thus the following sentence in joint-discovery-gatk4-local.wdl has me a bit concerned:
## - Bare minimum 1 WGS sample or 30 Exome samples. Gene panels are not supported.
How hard is this requirement for 30 samples minimum and what step exactly requires it? What would you recommend if I just have a single BAM file from a single exome sequenced patient/sample, made using processing-for-variant-discovery-gatk4.wdl?
Highest regards,
Freek.
Can we run GATK 4 MuTect2 on indel realigned files?
I had used the old GATK 3 pipeline for preparing BAM files for germline variants. The pipeline included indel realignment, which was part of the Best Practice Pipeline back then.
Those BAM files include normal and tumor samples. Now I'd like to use the latest GATK 4 MuTect 2 for running a normal/tumor somatic variant calling. However, those BAM files had been indel realigned, but the latest GATK 4 doesn't require indel realignment?
Question:
Would giving indel realigned BAM files (from GATK 3) for GATK 4 MuTect2 create bias? In general, can I reuse the BAM files prepared by GATK 3 (e.g. marking duplicates) for the new GATK 4 MuTect2 tool?
Picard CollectSequencingArtifactMetrics and CollectOxoGMetrics return all zero stats for iontorrent
Hi,
I'm trying to do some picard QC on iontorrent bam files. The results table turned out to be all "0" or "?" without any error in the log files.
Below is an example of syntax and log files.
java -Xmx16g -jar /DCEG/Resources/Tools/Picard/Picard-2.10.10/picard.jar CollectOxoGMetrics I=/DCEG/Projects/Exome/Followup/NP0318-AP1_Mirabello_TP53_MSKCCosteosarcoma/build1/BAM/OS_SC-1_A_rawlib.bam O=oxoG_metrics.txt R=/CGF/Sequencing/IonTorrent/PGM_Primary_Data/referenceLibrary/tmap-f3/hg19/hg19.fasta VALIDATION_STRINGENCY=LENIENT
09:37:29.061 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/mnt/nfs/gigantor/ifs/DCEG/Resources/Tools/Picard/Picard-2.10.10/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Tue Jun 05 09:37:29 EDT 2018] CollectOxoGMetrics INPUT=/DCEG/Projects/Exome/Followup/NP0318-AP1_Mirabello_TP53_MSKCCosteosarcoma/build1/BAM/OS_SC-1_A_rawlib.bam OUTPUT=oxoG_metrics.txt VALIDATION_STRINGENCY=LENIENT REFERENCE_SEQUENCE=/CGF/Sequencing/IonTorrent/PGM_Primary_Data/referenceLibrary/tmap-f3/hg19/hg19.fasta MINIMUM_QUALITY_SCORE=20 MINIMUM_MAPPING_QUALITY=30 MINIMUM_INSERT_SIZE=60 MAXIMUM_INSERT_SIZE=600 INCLUDE_NON_PF_READS=true USE_OQ=true CONTEXT_SIZE=1 STOP_AFTER=2147483647 VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Tue Jun 05 09:37:29 EDT 2018] Executing as luow2@cgemsIII on Linux 2.6.32-696.6.3.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_141-b16; Deflater: Intel; Inflater: Intel; Picard version: 2.10.10-SNAPSHOT
INFO 2018-06-05 09:37:29 CollectOxoGMetrics Generated 16 context strings.
INFO 2018-06-05 09:37:29 CollectOxoGMetrics Loading dbSNP File: null
INFO 2018-06-05 09:37:29 CollectOxoGMetrics Starting iteration.
[Tue Jun 05 09:37:31 EDT 2018] picard.analysis.CollectOxoGMetrics done. Elapsed time: 0.04 minutes.
Runtime.totalMemory()=2058354688
Does this mean something wrong with my bam format or bam files generated from tmap is not applicable to these metrics?
Thanks,
Wen
How to keep unique sample ID when combining gvcf files?
Hello,
I am working with RNA-seq data, and I need to get SNP calls for multiple samples (12). I first tried following the best practices method with the haplotypecaller, and later merging my VCF files. However, I realized that when I do this, any site that is not a variant in all of my samples will be marked as missing data for the non-variant samples. This is a problem because I need to know which of these samples are actually missing and which of these samples match the reference. I don't think the gVCF mode of haplotypecaller is completely supported for RNA-seq yet, but a paper that is doing similar work to mine has used it and it seemed to work well for them. Because of this, I gave it a try, but I keep coming to the same problem. When I combine my .g.vcf files, all of my samples merge. I need to make a combined vcf file with all of my sample id's remaining unique. Is there a way to do this? Thank you very much for your help and I'm sorry if this has been asked before, I have done a lot of searching but can't seem to find this question.
Picard MarkDuplicates behavior
Hi,
Based on what I could find on this forum, it appears that Picard's MarkDuplicate is "library-aware" (link). However, I am not exactly sure what it means. One comment in the thread says, "In our pipeline, we mark duplicates twice (once at the lane level then again after merging samples across lanes)."
Does that mean that a read-fragment which appears in two replicates run across two lanes will be marked as duplicate in the second step?
Thanks,
V
VariantFiltration - log4j:WARN
Hi,
I have ran this script:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx10g -jar /share/apps/bio/gatk-4.0.3.0/gatk-package-4.0.3.0-local.jar VariantFiltration -R /home/shared/resources/hgRef/hg38/Homo_sapiens_assembly38.fasta -V WES_HChg38noG_49x150_4x100_rawSNP.vcf --filter-expression QD < 2.0 || MQ < 40.0 || FS > 60.0 || SOR > 3.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0 || ExcessHet > 54.69 --filter-name HF_snp_filter_w_ExHet -O WES_HChg38noG_49x150_4x100_HF-SNP.vcf
18:39:06.282 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/share/apps/bio/gatk-4.0.3.0/gatk-package-4.0.3.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
18:39:06.454 INFO VariantFiltration - ------------------------------------------------------------
18:39:06.454 INFO VariantFiltration - The Genome Analysis Toolkit (GATK) v4.0.3.0
18:39:06.454 INFO VariantFiltration - For support and documentation go to https://software.broadinstitute.org/gatk/
18:39:06.455 INFO VariantFiltration - Executing as manolis@genemonster on Linux v3.5.0-36-generic amd64
18:39:06.455 INFO VariantFiltration - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_91-b14
18:39:06.456 INFO VariantFiltration - Start Date/Time: April 4, 2018 6:39:06 PM CEST
18:39:06.456 INFO VariantFiltration - ------------------------------------------------------------
18:39:06.456 INFO VariantFiltration - ------------------------------------------------------------
18:39:06.457 INFO VariantFiltration - HTSJDK Version: 2.14.3
18:39:06.457 INFO VariantFiltration - Picard Version: 2.17.2
18:39:06.457 INFO VariantFiltration - HTSJDK Defaults.COMPRESSION_LEVEL : 2
18:39:06.457 INFO VariantFiltration - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
18:39:06.458 INFO VariantFiltration - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
18:39:06.458 INFO VariantFiltration - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
18:39:06.458 INFO VariantFiltration - Deflater: IntelDeflater
18:39:06.458 INFO VariantFiltration - Inflater: IntelInflater
18:39:06.458 INFO VariantFiltration - GCS max retries/reopens: 20
18:39:06.458 INFO VariantFiltration - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
18:39:06.459 INFO VariantFiltration - Initializing engine
18:39:07.805 INFO FeatureManager - Using codec VCFCodec to read file file:///home/manolis/GATK4/IlluminaExomePairEnd/6.vcf/filtered/WES_HChg38noG_49x150_4x100_rawSNP.vcf
18:39:08.392 INFO VariantFiltration - Done initializing engine
18:39:08.659 INFO ProgressMeter - Starting traversal
18:39:08.660 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
log4j:WARN No appenders could be found for logger (org.apache.commons.jexl2.JexlEngine).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
18:39:18.697 INFO ProgressMeter - chr1:65116577 0.2 179000 1070147.5
Do you have any suggestion about who to fix this WARN? I found some treads including this error but I missed the solution...
Many thanks
basic bam-readcount function in GATK?
Hi,
I'm hoping to find a GATK solution to replace the use of bam-readcount in a pipeline.
For a given BAM, and a given position, I need to get the number of reads for each base (reference and alteration), as well as the overall read depth at that position.
If this functionality is present, pointers to the appropriate classes/methods would be greatly appreciated.
I apologize if this is an obvious question. This is my first rodeo.
thanks.
Best practices for ContEst for WGS
Hi,
What are the best practices for ContEst as far as the population file and interval list parameters go? I would imagine that interval list should be irrelevant when looking across the whole genome, but this seems to be a required argument.
I'm currently using /xchip/cga/reference/hg19/hg19_population_stratified_af_hapmap_3.3.vcf for the population file -- is that appropriate for WGS?
Thanks for the advice,
Eric
When should I use -L to pass in a list of intervals?
The -L argument (short for --intervals) enables you to restrict your analysis to specific intervals instead of running over the whole genome. Using this argument can have important consequences for performance and/or results. Here, we present some guidelines for using it appropriately depending on your experimental design.
In a nutshell, if you’re doing:
- Whole genome analysis: intervals are not required but they can help speed up analysis
- Whole exome analysis: you must provide the list of capture targets (typically genes/exons)
- Small targeted experiment: you must provide the targeted interval(s)
- Troubleshooting: you can run on a specific interval to test parameters or create a data snippet
Important notes:
Whatever you end up using -L for, keep this in mind: for tools that output a bam or VCF file, the output file will only contain data from the intervals specified by the -L argument. To be clear, we do not recommend using -L with tools that output a bam file since doing so will omit some data from the output.
Example Use of -L:
-L 20
for chromosome 20 in b37/b39 build-L chr20:1-100
for chromosome 20 positions 1-100 in hg18/hg19 build-L intervals.list
(orintervals.interval_list
, orintervals.bed
) where the value passed to the argument is a text file containing intervals-L some_variant_calls.vcf
where the value passed to the argument is a VCF file containing variant records; their genomic coordinates will be used as intervals.
Specifying contigs with colons in their names, as occurs for new contigs in GRCh38, requires special handling for GATK versions prior to v3.6. Please use the following workaround.
- For example, HLA-A*01:01:01:01
is a new contig in GRCh38. The colons are a new feature of contig naming for GRCh38 from prior assemblies. This has implications for using the -L
option of GATK as the option also uses the colon as a delimiter to distinguish between contig and genomic coordinates.
- When defining coordinates of interest for a contig, e.g. positions 1-100 for chr1, we would use -L chr1:1-100
. This also works for our HLA contig, e.g. -L HLA-A*01:01:01:01:1-100
.
- However, when passing in an entire contig, for contigs with colons in the name, you must add :1+
to the end of the chromosome name as shown below. This ensures that portions of the contig name are appropriately identified as part of the contig name and not genomic coordinates.
-L HLA-A*01:01:01:01:1+
So here’s a little more detail for each experimental design type.
Whole genome analysis
It is not necessary to use an intervals list in whole genome analysis -- presumably you're interested in the whole genome!
However, from a technical perspective, you may want to mask out certain contigs (e.g. chrY or non-chromosome contigs) or regions (e.g. centromere) where you know the data is not reliable or is very messy, causing excessive slowdowns. You can do this by providing a list of "good" intervals with -L
, or you could also provide a list of "bad" intervals with -XL
, which does the exact opposite of -L
: it excludes the provided intervals. We share the whole-genome interval lists (of good intervals) that we use in our production pipelines, in our resource bundle (see Download page).
Whole exome analysis
By definition, exome sequencing data doesn’t cover the entire genome, so many analyses can be restricted to just the capture targets (genes or exons) to save processing time. There are even some analyses which should be restricted to the capture targets because failing to do so can lead to suboptimal results.
Note that we recommend adding some “padding” to the intervals in order to include the flanking regions (typically ~100 bp). No need to modify your target list; you can have the GATK engine do it for you automatically using the interval padding argument. This is not required, but if you do use it, you should do it consistently at all steps where you use -L.
Below is a step-by-step breakdown of the Best Practices workflow, with a detailed explanation of why -L should or shouldn’t be used with each tool.
Tool | -L? | Why / why not |
---|---|---|
BaseRecalibrator | YES | This excludes off-target sequences and sequences that may be poorly mapped, which have a higher error rate. Including them could lead to a skewed model and bad recalibration. |
PrintReads | NO | Output is a bam file; using -L would lead to lost data. |
UnifiedGenotyper/Haplotype Caller | YES | We’re only interested in making calls in exome regions; the rest is a waste of time & includes lots of false positives. |
Next steps | NO | No need since subsequent steps operate on the callset, which was restricted to the exome at the calling step. |
Small targeted experiments
The same guidelines as for whole exome analysis apply except you do not run BQSR on small datasets.
Debugging / troubleshooting
You can use -L a lot while troubleshooting! For example, you can just provide an interval at the command line, and the output file will contain the data from that interval.This is really useful when you’re trying to figure out what’s going on in a specific interval (e.g. why HaplotypeCaller is not calling your favorite indel) or what would be the effect of changing a parameter (e.g. what happens to your indel call if you increase the value of -minPruning). This is also what you’d use to generate a file snippet to send us as part of a bug report (except that never happens because GATK has no bugs, ever).
Does FilterByOrientationBias consider normal samples?
FilterByOrientationBias takes the output of CollectSequencingArtifactMetrics to do somatic variant filtering. Its manual says:
CollectSequencingArtifactMetrics should be run for both the normal sample and the tumor sample, if the matched normal is available.
But the example command-line shown in the manual is:
gatk-launch --javaOptions "-Xmx4g" FilterByOrientationBias \
--artifactModes 'G/T' \
-V tumor_unfiltered.vcf.gz \
-P tumor.pre_adapter_detail_metrics \
--output oxog_filtered.vcf.gz
The input only involves tumor sample. Do I really need to run CollectSequencingArtifactMetrics on the matched normal sample? If yes, how should I use it in FilterByOrientationBias?
Thanks.
Error in Mutect2
I'm running the following command for mutect2 and keep getting an error:
jave -jar gatk Mutect2 \
-R ../../cf3genome.fa \
-I test.bam \
--tumor-sample test \
-O a1.vcf
A USER ERROR has occurred: Bad input: Sample test is not in BAM header: [20]
But I added test to the header with picard tools, here's my header:
samtools view -H test.bam
@HD VN:1.5 SO:coordinate
@SQ SN:1 LN:122678785
@SQ SN:10 LN:69331447
@SQ SN:11 LN:74389097
@SQ SN:12 LN:72498081
@SQ SN:13 LN:63241923
@SQ SN:14 LN:60966679
@SQ SN:15 LN:64190966
@SQ SN:16 LN:59632846
@SQ SN:17 LN:64289059
@SQ SN:18 LN:55844845
@SQ SN:19 LN:53741614
@SQ SN:2 LN:85426708
@SQ SN:20 LN:58134056
@SQ SN:21 LN:50858623
@SQ SN:22 LN:61439934
@SQ SN:23 LN:52294480
@SQ SN:24 LN:47698779
@SQ SN:25 LN:51628933
@SQ SN:26 LN:38964690
@SQ SN:27 LN:45876710
@SQ SN:28 LN:41182112
@SQ SN:29 LN:41845238
@SQ SN:3 LN:91889043
@SQ SN:30 LN:40214260
@SQ SN:31 LN:39895921
@SQ SN:32 LN:38810281
@SQ SN:33 LN:31377067
@SQ SN:34 LN:42124431
@SQ SN:35 LN:26524999
@SQ SN:36 LN:30810995
@SQ SN:37 LN:30902991
@SQ SN:38 LN:23914537
@SQ SN:4 LN:88276631
@SQ SN:5 LN:88915250
@SQ SN:6 LN:77573801
@SQ SN:7 LN:80974532
@SQ SN:8 LN:74330416
@SQ SN:9 LN:61074082
@SQ SN:MT LN:16727
@SQ SN:X LN:123869142
@SQ SN:JH373233.1 LN:2660953
@SQ SN:JH373234.1 LN:1881673
@SQ SN:JH373235.1 LN:1415205
@SQ SN:JH373236.1 LN:1067467
@SQ SN:JH373238.1 LN:881102
@SQ SN:JH373237.1 LN:866315
@SQ SN:JH373239.1 LN:822601
@SQ SN:JH373241.1 LN:745551
...
#A bunch of unassembled contigs here
...
@RG ID:test LB:test PL:illumin SM:20 PU:unit1
@PG ID:STAR PN:STAR VN:STAR_2.5.3a CL:/projects/evcon@colostate.edu/STAR-2.5.3a/bin/Linux_x86_64/STAR --genomeDir /scratch/summit/evcon@colostate.edu/str_idx/ --readFilesIn AESC006_1_val_1.fq.gz AESC006_2_val_2.fq.gz --readFilesCommand zcat --outFileNamePrefix stralgnout/AESC006_1_val_1 --outSAMtype BAM SortedByCoordinate --outFilterMultimapNmax 1 --sjdbGTFfile /scratch/summit/evcon@colostate.edu/cf3gtf.gtf --quantMode GeneCounts
@CO user command line: /projects/evcon@colostate.edu/STAR-2.5.3a/bin/Linux_x86_64/STAR --genomeDir /scratch/summit/evcon@colostate.edu/str_idx/ --outSAMtype BAM SortedByCoordinate --quantMode GeneCounts --sjdbGTFfile /scratch/summit/evcon@colostate.edu/cf3gtf.gtf --outFilterMultimapNmax 1 --readFilesCommand zcat --outFileNamePrefix stralgnout/AESC006_1_val_1 --readFilesIn AESC006_1_val_1.fq.gz AESC006_2_val_2.fq.gz
(howto) Recalibrate base quality scores = run BQSR
Objective
Recalibrate base quality scores in order to correct sequencing errors and other experimental artifacts.
Prerequisites
- TBD
Steps
- Analyze patterns of covariation in the sequence dataset
- Do a second pass to analyze covariation remaining after recalibration
- Generate before/after plots
- Apply the recalibration to your sequence data
1. Analyze patterns of covariation in the sequence dataset
Action
Run the following GATK command:
java -jar GenomeAnalysisTK.jar \
-T BaseRecalibrator \
-R reference.fa \
-I input_reads.bam \
-L 20 \
-knownSites dbsnp.vcf \
-knownSites gold_indels.vcf \
-o recal_data.table
Expected Result
This creates a GATKReport file called recal_data.table
containing several tables. These tables contain the covariation data that will be used in a later step to recalibrate the base qualities of your sequence data.
It is imperative that you provide the program with a set of known sites, otherwise it will refuse to run. The known sites are used to build the covariation model and estimate empirical base qualities. For details on what to do if there are no known sites available for your organism of study, please see the online GATK documentation.
Note that -L 20
is used here and in the next steps to restrict analysis to only chromosome 20 in the b37 human genome reference build. To run against a different reference, you may need to change the name of the contig according to the nomenclature used in your reference.
2. Do a second pass to analyze covariation remaining after recalibration
Action
Run the following GATK command:
java -jar GenomeAnalysisTK.jar \
-T BaseRecalibrator \
-R reference.fa \
-I input_reads.bam \
-L 20 \
-knownSites dbsnp.vcf \
-knownSites gold_indels.vcf \
-BQSR recal_data.table \
-o post_recal_data.table
Expected Result
This creates another GATKReport file, which we will use in the next step to generate plots. Note the use of the -BQSR
flag, which tells the GATK engine to perform on-the-fly recalibration based on the first recalibration data table.
3. Generate before/after plots
Action
Run the following GATK command:
java -jar GenomeAnalysisTK.jar \
-T AnalyzeCovariates \
-R reference.fa \
-L 20 \
-before recal_data.table \
-after post_recal_data.table \
-plots recalibration_plots.pdf
Expected Result
This generates a document called recalibration_plots.pdf
containing plots that show how the reported base qualities match up to the empirical qualities calculated by the BaseRecalibrator. Comparing the before and after plots allows you to check the effect of the base recalibration process before you actually apply the recalibration to your sequence data. For details on how to interpret the base recalibration plots, please see the online GATK documentation.
4. Apply the recalibration to your sequence data
Action
Run the following GATK command:
java -jar GenomeAnalysisTK.jar \
-T PrintReads \
-R reference.fa \
-I input_reads.bam \
-L 20 \
-BQSR recal_data.table \
-o recal_reads.bam
Expected Result
This creates a file called recal_reads.bam
containing all the original reads, but now with exquisitely accurate base substitution, insertion and deletion quality scores. By default, the original quality scores are discarded in order to keep the file size down. However, you have the option to retain them by adding the flag –emit_original_quals
to the PrintReads command, in which case the original qualities will also be written in the file, tagged OQ
.
Notice how this step uses a very simple tool, PrintReads, to apply the recalibration. What’s happening here is that we are loading in the original sequence data, having the GATK engine recalibrate the base qualities on-the-fly thanks to the -BQSR
flag (as explained earlier), and just using PrintReads to write out the resulting data to the new file.
PCR Duplicate detection on PCR-Free Libraries
Hi
We've just started a big project in which we plan to use GATK with PCR-Free Libraries & I was curious as to what your thoughts are on using PCR Duplicate detection (MarkDuplicates) with the currently Illumina PCR-Free libraries? Internally do you still run this stage of the GATK pipeline with the new libraries?
Looking at your best practices documents I see it's still in there, but didn't see anything mentioning PCR-Free libraries.
Thanks!
Possible to GenotypeGVCFs at all sites?
Dear,
I am using GATK 4.0.4.0 following the best practices for joint variant calling on a cohort of samples. Everything works. However, in the GenotypeGVCFs step I would like to genotype all sites including non variant, at least for a specific set of genes. I assume that this is impossible. Will this functionality be ported to GATK4? Is there another way to retrieve the same results, for example by converting the -ERC GVCF ouput of GATKs haplotypecaller to -ERC BP_RESOLUTION? So in my final vcf I would like to see genotypes for all positions. I am asking this because based on this last vcf file I cannot discriminate between highly reliable and less reliable reference genotypes for positions where no sample has a variant allele.
Best wishes,
Wouter
gVCF from different gatk versions
Hello,
I'm adding individuals to my project. Can I use gatk4 to create these new g.vcf and then combine them with my old g.vcf (produced with gatk3) for joint genotyping and filtering in gatk4 ?
Introduction to the GATK Best Practices
This document provides important context information about how the GATK Best Practices are developed and what are their limitations.
Contents
- What are the GATK Best Practices?
- Analysis phases
- Experimental designs
- Workflow scripts provided as reference implementations
- Scope and limitations
- What is not GATK Best Practices?
- Beware legacy scripts
1. What are the GATK Best Practices?
Reads-to-variants workflows used at the Broad Institute.
The GATK Best Practices provide step-by-step recommendations for performing variant discovery analysis in high-throughput sequencing (HTS) data. There are several different GATK Best Practices workflows tailored to particular applications depending on the type of variation of interest and the technology employed. The Best Practices documentation attempts to describe in detail the key principles of the processing and analysis steps required to go from raw reads coming off the sequencing machine, all the way to an appropriately filtered variant callset that can be used in downstream analyses. Wherever we can, we try to provide guidance regarding experimental design, quality control (QC) and pipeline implementation options, but please understand that those are dependent on many factors including sequencing technology and the hardware infrastructure that are at your disposal, so you may need to adapt our recommendations to your specific situation.
2. Analysis phases
Although the Best Practices workflows are each tailored to a particular application (type of variation and experimental design), overall they follow similar patterns, typically comprising two or three analysis phases depending on the application.
(1) Data Pre-processing is the first phase in all cases, and involves pre-processing the raw sequence data (provided in FASTQ or uBAM format) to produce analysis-ready BAM files. This involves alignment to a reference genome as well as some data cleanup operations to correct for technical biases and make the data suitable for analysis.
(2) Variant Discovery proceeds from analysis-ready BAM files and produces variant calls. This involves identifying genomic variation in one or more individuals and applying filtering methods appropriate to the experimental design. The output is typically in VCF format although some classes of variants (such as CNVs) are difficult to represent in VCF and may therefore be represented in other structured text-based formats.
(3) Depending on the application, additional steps such as filtering and annotation may be required to produce a callset ready for downstream genetic analysis. This typically involves using resources of known variation, truthsets and other metadata to assess and improve the accuracy of the results as well as attach additional information.
3. Experimental designs
Whole genomes. Exomes. Gene panels. RNAseq
These are the major experimental designs we support explicitly. Some of our workflows are specific to only one experimental design, while others can be adapted to others with some modifications. This is indicated in the workflow documentation where applicable. Note that any workflow tagged as applicable to whole genome sequence (WGS) and others is presented by default in the form that is suitable for whole genomes, and must be modified to apply to the others as recommended in the workflow documentation. Exomes, gene panels and other targeted sequencing experiments generally share the same workflow for a given variant type with only minor modifications.
4. Workflow scripts provided as reference implementations
Less guesswork, more reproducibility.
It's one thing to know what steps should be run (which is what the Best Practices tell you) and quite another to set up a pipeline that does it in practice. To help you cross this important gap, we provide the scripts that we use in our own pipelines as reference implementations. The scripts are written in WDL, a workflow description language designed specifically to be readable and writable by humans without an advanced programming background. WDL scripts can be run on Cromwell, an open-source execution engine that can connect to a variety of different platforms, whether local or cloud-based, through pluggable backends. See the Pipelining Options section for more on the Cromwell + WDL pipelining solution.
We also make all the GATK Best Practices workflows available in ready-to-run form on FireCloud, our cloud-based analysis portal, which you can read more about here.
Note that some of the production scripts we provide are specifically optimized to run on the Google Cloud Platform.
Wherever possible we also provide "generic" versions that are not platform-specific.
5. Scope and limitations
We can't test for every possible use case or technology.
We develop and validate these workflows in collaboration with many investigators within the Broad Institute's network of affiliated institutions. They are deployed at scale in the Broad's production pipelines -- a very large scale indeed. So as a general rule, the command-line arguments and parameters given in the documentation are meant to be broadly applicable (so to speak). However, our testing focuses largely on data from human whole-genome or whole-exome samples sequenced with Illumina technology, so if you are working with different types of data, organisms or experimental designs, you may need to adapt certain branches of the workflow, as well as certain parameter selections and values.
In addition, several key steps make use of external resources, including validated databases of known variants. If there are few or no such resources available for your organism, you may need to bootstrap your own or use alternative methods. We have documented useful methods to do this wherever possible, but be aware than some issues are currently still without a good solution. On the bright side, if you solve them for your field, you will be a hero to a generation of researchers and your citation index will go through the roof.
6. What is not GATK Best Practices?
Lots of workflows that people call GATK Best Practices diverge significantly from our recommendations.
Not that they're necessarily bad. Sometimes it makes perfect sense to diverge from our standard Best Practices in order to address a problem or use case that they're not designed to handle. The canonical Best Practices workflows (as run in production at the Broad) are designed specifically for human genome research and are optimized for the instrumentation (overwhelmingly Illumina) and needs of the Broad Institute sequencing facility. They can be adapted for analysis of non-human organisms of all kinds, including non-diploids, and of different data types, with varying degrees of effort depending on how divergent the use case and data type are. However, any workflow that has been significantly adapted or customized, whether for performance reasons or to fit a use case that we do not explicitly cover, should not be called "GATK Best Practices", which is a term that carries specific meaning. The correct way to refer to such workflows is "based on" or "adapted from" GATK Best Practices. When in doubt about whether a particular customization constitutes a significant divergence, feel free to ask us in the forum.
7. Beware legacy scripts
Trust, but verify.
If someone hands you a script and tells you "this runs the GATK Best Practices", start by asking what version of GATK it uses, when it was written, and what are the key steps that it includes. Both our software and our usage recommendations evolve in step with the rapid pace of technological and methodological innovation in the field of genomics, so what was Best Practice last year (let alone in 2010) may no longer be applicable. And if all the steps seem to be in accordance with our docs (same tools in the same order), you should still check every single parameter in the commands. If anything is unfamiliar to you, you should find out what it does. If you can't find it in the documentation, ask us in the forum. It's one or two hours of your life that can save you days of troubleshooting on the tail end of the pipeline, so please protect yourself by being thorough.
How should I pre-process data from multiplexed sequencing and multi-library designs?
Our Best Practices pre-processing documentation assumes a simple experimental design in which you have one set of input sequence files (forward/reverse or interleaved FASTQ, or unmapped uBAM) per sample, and you run each step of the pre-processing workflow separately for each sample, resulting in one BAM file per sample at the end of this phase.
However, if you are generating multiple libraries for each sample, and/or multiplexing samples within and/or across sequencing lanes, the data must be de-multiplexed before pre-processing, typically resulting in multiple sets of FASTQ files per sample all of which should have distinct read group IDs (RGID).
At that point there are several different valid strategies for implementing the pre-processing workflow. Here at the Broad Institute, we run the initial steps of the pre-processing workflow (mapping, sorting and marking duplicates) separately on each individual read group. Then we merge the data to produce a single BAM file for each sample (aggregation); this is done by re-running Mark Duplicates, this time on all read group BAM files for a sample at the same time. Then we run Indel Realignment and Base Recalibration on the aggregated per-sample BAM files. See the worked-out example below and the presentation on Broad Production Pipelines here for more details.
Note that there are many possible ways to achieve a similar result; here we present the way we think gives the best combination of efficiency and quality. This assumes that you are dealing with one or more samples, and each of them was sequenced on one or more lanes.
Example
Let's say we have this example data (assuming interleaved FASTQs containing both forward and reverse reads) for two sample libraries, sampleA and sampleB, which were each sequenced on two lanes, lane1 and lane2:
- sampleA_lane1.fq
- sampleA_lane2.fq
- sampleB_lane1.fq
- sampleB_lane2.fq
These will each be identified as separate read groups A1, A2, B1 and B2. If we had multiple libraries per sample, we would further distinguish them (eg sampleA_lib1_lane1.fq leading to read group A11, sampleA_lib2_lane1.fq leading to read group A21 and so on).
1. Run initial steps per-readgroup once
Assuming that you received one FASTQ file per sample library, per lane of sequence data (which amounts to a read group), run each file through mapping and sorting. During the mapping step you assign read group information, which will be very important in the next steps so be sure to do it correctly. See the read groups dictionary entry for guidance.
The example data becomes:
- sampleA_rgA1.bam
- sampleA_rgA2.bam
- sampleB_rgB1.bam
- sampleB_rgB2.bam
At this point we mark duplicates in each read group BAM file (dedup), which allows us to estimate the complexity of the corresponding library of origin as a quality control step. This step is optional.
The example data becomes:
- sampleA_rgA1.dedup.bam
- sampleA_rgA2.dedup.bam
- sampleB_rgB1.dedup.bam
- sampleB_rgB2.dedup.bam
Technically this first run of marking duplicates is not necessary because we will run it again per-sample, and that per-sample marking would be enough to achieve the desired result. To reiterate, we only do this round of marking duplicates for QC purposes.
2. Merge read groups and mark duplicates per sample (aggregation + dedup)
Once you have pre-processed each read group individually, you merge read groups belonging to the same sample into a single BAM file. You can do this as a standalone step, bur for the sake of efficiency we combine this with the per-readgroup duplicate marking step (it's simply a matter of passing the multiple inputs to MarkDuplicates in a single command).
The example data becomes:
- sampleA.merged.dedup.bam
- sampleB.merged.dedup.bam
To be clear, this is the round of marking duplicates that matters. It eliminates PCR duplicates (arising from library preparation) across all lanes in addition to optical duplicates (which are by definition only per-lane).
3. Remaining per-sample pre-processing
Then you run indel realignment (optional) and base recalibration (BQSR).
The example data becomes:
- sample1.merged.dedup.(realn).recal.bam
- sample2.merged.dedup.(realn).recal.bam
Realigning around indels per-sample leads to consistent alignments across all lanes within a sample. This step is only necessary if you will be using a locus-based variant caller like MuTect 1 or UnifiedGenotyper (for legacy reasons). If you will be using HaplotypeCaller or MuTect2, you do not need to perform indel realignment.
Base recalibration will be applied per-read group if you assigned appropriate read group information in your data. BaseRecalibrator distinguishes read groups by RGID, or RGPU if it is available (PU takes precedence over ID). This will identify separate read groups (distinguishing both lanes and libraries) as such even if they are in the same BAM file, and it will always process them separately -- as long as the read groups are identified correctly of course. There would be no sense in trying to recalibrate across lanes, since the purpose of this processing step is to compensate for the errors made by the machine during sequencing, and the lane is the base unit of the sequencing machine (assuming the equipment is Illumina HiSeq or similar technology).
People often ask also if it's worth the trouble to try realigning across all samples in a cohort. The answer is almost always no, unless you have very shallow coverage. The problem is that while it would be lovely to ensure consistent alignments around indels across all samples, the computational cost gets too ridiculous too fast. That being said, for contrastive calling projects -- such as cancer tumor/normals -- we do recommend realigning both the tumor and the normal together in general to avoid slight alignment differences between the two tissue types.