I read the documentation on this pipeline (https://cloud.google.com/genomics/docs/tutorials/gatk) and saw that its input is unaligned BAMs. Is there a way to use the pipeline for input FASTQs?
Can the GATK Best Practices Pipeline on Google Cloud Platform be used on FASTQ inputs?
BQSR can‘t run!
Hello!
I’m running BQSR with GATK v4.0.4.0. And I met some problems that I cannot get the right output file.
The command line is:
$GATK --java-options "-Xmx10240m -Djava.io.tmpdir=./" BaseRecalibratorSpark -R $GENOME -I $sample-md_rl.bam --known-sites /home/gaotiangang/niuguohao/201806call/50100-step5-3/combined_1.raw_snp.vcf -O $sample.4.table --spark-master local[4]
$GATK --java-options "-Xmx10240m -Djava.io.tmpdir=./" ApplyBQSRSpark -I $sample-md_rl.bam -bqsr $sample.4.table -O $sample.4.bam --spark-master local[4]
echo "BQSR 1 over "
But I got these lines back almost everytime ,
18/07/25 08:13:23 ERROR Executor: Exception in task 9.0 in stage 1.0 (TID 191)
java.io.IOException: Failed to create local dir in /home/gaotiangang/niuguohao/201806call/50-step5-7/niuguohao/blockmgr-accb99db-cd04-4a44-a018-0672408e3f03/1d.
at org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:70)
at org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:80)
at org.apache.spark.shuffle.IndexShuffleBlockResolver.getDataFile(IndexShuffleBlockResolver.scala:55)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.closeAndWriteOutput(UnsafeShuffleWriter.java:212)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:169)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
18/07/25 08:13:23 INFO TaskSetManager: Starting task 11.0 in stage 1.0 (TID 193, localhost, executor driver, partition 11, PROCESS_LOCAL, 4906 bytes)
18/07/25 08:13:23 INFO Executor: Running task 11.0 in stage 1.0 (TID 193)
18/07/25 08:13:23 WARN TaskSetManager: Lost task 9.0 in stage 1.0 (TID 191, localhost, executor driver): java.io.IOException: Failed to create local dir in /home/gaotiangang/niuguohao/201806call/50-step5-7/niuguohao/blockmgr-accb99db-cd04-4a44-a018-0672408e3f03/1d.
at org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:70)
at org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:80)
at org.apache.spark.shuffle.IndexShuffleBlockResolver.getDataFile(IndexShuffleBlockResolver.scala:55)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.closeAndWriteOutput(UnsafeShuffleWriter.java:212)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:169)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Does that mean I set something wrong in my command line?
Thanks !
problem in combineGVCFs
Hi,
Here is a problem when i used the combineGVCFs. here is the errors: "java.lang.IllegalStateException: Key END found in VariantContext field INFO at chrM:247 but this key isn't defined in the VCFHeader. We require all VCFs to have complete VCF headers by default.";
the command as follows: "gatk CombineGVCFs -R ~/ref/GATK_REF/ucsc.hg19.fasta -V 788_filter.vcf -V 789_filter.vcf -O Merged.vcf"
I get the vcf file of each samples with GATK version 4, as follows:
gatk HaplotypeCaller -R $ref/ucsc.hg19.fasta -I $temp_output_dir/gp_add_sorted_dedup_chr_ordered_split_$sample_id.bam --dont-use-soft-clipped-bases true -stand-call-conf 20.0 -O $temp_output_dir/$sample_id.vcf
gatk VariantFiltration -R $ref/ucsc.hg19.fasta -V $temp_output_dir/$sample_id.vcf -window 35 -cluster 3 --filter-name FS -filter "FS > 30.0" --filter-name QD -filter "QD < 2.0" -O $final_results_dir/"$sample_id"_filter.vcf
I am only retained the files: $sample_id_filter.vcf
can you help me to address this issue? thank you very much.
Xiao
How to hard filter Mutect2 calls based on strand bias
Hi, how can I apply a hard filter for Mutect2 somatic calls based on strand bias (e.g. to filter out calls with 60 (out of 80) reads on plus strand and only 2 (out of 70) reads on minus strand supporting the non-ref allele)? Which FORMAT field can I use to do this using VariantFiltration? Apologies if this has been asked before, I couldn't find a similar question in my search.
Thanks,
Rao
Why is haplotype caller ignoring some reads and calling 1/1 variants instead of 0/1?
Hello,
I compared the bam output from BWA-MEM and the bam output from HaplotypeCaller and I could see that for a specific active region GATK ignored some reads. Because of that, some variants were not called and others were called as homozygous instead of heterozygous. Could you please help me to understand why during the realignment step GATK does not consider the reads? I have checked mapping quality and alignment score and everything is ok. I am using GATK v4.
Thanks in advance.
Best
Where can I find known variants, training and truth sets, and other resource files?
For general definitions of these terms, see this Dictionary entry.
Humans
If you're working with human data, you're in luck. We provide all resource files necessary for applying the Best Practices pipelines to human data as part of our Resource Bundle, and we provide specific recommendations on which sets to use for each tool in the variant calling pipelines, as well as default settings for all parameters. See the Best Practices documentation for details to that effect.
Everything else
Unfortunately we're not currently able to provide centralized resources for non-human organisms. That means you will need to do some additional homework to find out what is available for your organism. In order to facilitate this process, we have created a forum section called Zoo & Garden specifically for the purpose of collecting information on this topic. We invite researchers who have experience in non-human genomics analysis to share their knowledge by contributing documentation to this section.
Recommended parameters for VariantRecalibrator in GATK4
Hi,
I'm trying to figure out the recommended set of parameters for VariantRecalibrator on SNPs. The official joint discovery pipeline on Github [1] and the example documentation [2] seem to differ in a few ways: different number of gaussians (6 vs 8), slightly different training feature sets, and different priors for dbSNP (7 vs 2). There are other differences like --trust-all-polimorphic and training on every 10th variant in the Github version.
The dataset I'm working on: 17k samples, 15x WGS, but only called in exonic regions so far (i.e should look like WES, but with lower, yet more homogeneous depth). Variants are called with GATK4.
I understand that a lot of tweaking can be made to the parameters, but would like to start with the most "standard" version first.
Thanks!
[1] The Github pipeline:
https://github.com/gatk-workflows/gatk4-germline-snps-indels/blob/3087accf86b325bb5b511f2e7f6e8574fc0c1ff0/joint-discovery-gatk4.wdl
https://github.com/gatk-workflows/gatk4-germline-snps-indels/blob/5e3d54aa68899248af066b2fbf00954ae052f9b7/joint-discovery-gatk4-local.hg38.wgs.inputs.json
${gatk_path} --java-options "-Xmx100g -Xms100g" \ VariantRecalibrator \ -V ${sites_only_variant_filtered_vcf} \ -O ${recalibration_filename} \ --tranches-file ${tranches_filename} \ --trust-all-polymorphic \ -tranche ${sep=' -tranche ' recalibration_tranche_values} \ -an ${sep=' -an ' recalibration_annotation_values} \ -mode SNP \ --sample-every-Nth-variant ${downsampleFactor} \ --output-model ${model_report_filename} \ --max-gaussians 6 \ -resource hapmap,known=false,training=true,truth=true,prior=15:${hapmap_resource_vcf} \ -resource omni,known=false,training=true,truth=true,prior=12:${omni_resource_vcf} \ -resource 1000G,known=false,training=true,truth=false,prior=10:${one_thousand_genomes_resource_vcf} \ -resource dbsnp,known=true,training=false,truth=false,prior=7:${dbsnp_resource_vcf}
and training features:
"JointGenotyping.indel_recalibration_annotation_values": ["FS", "ReadPosRankSum", "MQRankSum", "QD", "SOR", "DP"]
[2] Example in the docs:
https://software.broadinstitute.org/gatk/documentation/tooldocs/4.0.0.0/org_broadinstitute_hellbender_tools_walkers_vqsr_VariantRecalibrator.php
gatk VariantRecalibrator \ -R Homo_sapiens_assembly38.fasta \ -V input.vcf.gz \ --resource hapmap,known=false,training=true,truth=true,prior=15.0:hapmap_3.3.hg38.sites.vcf.gz \ --resource omni,known=false,training=true,truth=false,prior=12.0:1000G_omni2.5.hg38.sites.vcf.gz \ --resource 1000G,known=false,training=true,truth=false,prior=10.0:1000G_phase1.snps.high_confidence.hg38.vcf.gz \ --resource dbsnp,known=true,training=false,truth=false,prior=2.0:Homo_sapiens_assembly38.dbsnp138.vcf.gz \ -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR \ -mode SNP \ --recal-file output.recal \ --tranches-file output.tranches \ --rscript-file output.plots.R
Indel realignment for MuTect2 in GATK 4.0
Hello I have already read that the indel realignment is deprecated in GATK 4.0, as HaplotypeCaller does the realignment, but what about MuTect2?
In one of your posts ( https://gatkforums.broadinstitute.org/gatk/discussion/11136/how-to-call-somatic-mutations-using-gatk4-mutect2 ) I see you already have data where indel realignment was done. Was this intentional?
How do I proceed with the pipeline then?
Should I run indel realignment on my data with 3.8 and then use 4.0 Mutect2?
Best regards,
Srdjan
VariantStorageManagerException exception : status == TILEDB_OK
Using 4.0.4.0 and got the following rather unhelpful error from a ImportGenomicsDB. I suspect it might be a variant in one of my VCFs being imported causing an issue but the error message doesn't even hint at which file it was working on.
~/gatk-4.0.4.0/gatk GenomicsDBImport -R hs37d5.fa --sample-name-map gvcfs.samplemap --genomicsdb-workspace-path outputsb.workspace -L 6:29691241-33054015 --batch-size 50 --consolidate true
Using GATK jar /home/cloud-user/gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/cloud-user/gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar GenomicsDBImport -R hs37d5.fa --sample-name-map gvcfs.samplemap --genomicsdb-workspace-path outputsb.workspace -L 6:29691241-33054015 --batch-size 50 --consolidate true
17:01:47.406 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/cloud-user/gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
17:01:47.522 INFO GenomicsDBImport - ------------------------------------------------------------
17:01:47.522 INFO GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.0.4.0
17:01:47.522 INFO GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/
17:01:47.522 INFO GenomicsDBImport - Executing as cloud-user@lustre-assembly-1 on Linux v3.10.0-862.3.3.el7.x86_64 amd64
17:01:47.522 INFO GenomicsDBImport - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_171-b10
17:01:47.523 INFO GenomicsDBImport - Start Date/Time: 17 July 2018 17:01:47 BST
17:01:47.523 INFO GenomicsDBImport - ------------------------------------------------------------
17:01:47.523 INFO GenomicsDBImport - ------------------------------------------------------------
17:01:47.523 INFO GenomicsDBImport - HTSJDK Version: 2.14.3
17:01:47.523 INFO GenomicsDBImport - Picard Version: 2.18.2
17:01:47.523 INFO GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
17:01:47.523 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
17:01:47.523 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
17:01:47.523 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
17:01:47.523 INFO GenomicsDBImport - Deflater: IntelDeflater
17:01:47.523 INFO GenomicsDBImport - Inflater: IntelInflater
17:01:47.523 INFO GenomicsDBImport - GCS max retries/reopens: 20
17:01:47.523 INFO GenomicsDBImport - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
17:01:47.524 INFO GenomicsDBImport - Initializing engine
17:01:47.959 INFO IntervalArgumentCollection - Processing 3362775 bp from intervals
17:01:47.984 INFO GenomicsDBImport - Done initializing engine
Created workspace /mnt/chla/outputsb.workspace
17:01:48.120 INFO GenomicsDBImport - Vid Map JSON file will be written to outputsb.workspace/vidmap.json
17:01:48.120 INFO GenomicsDBImport - Callset Map JSON file will be written to outputsb.workspace/callset.json
17:01:48.120 INFO GenomicsDBImport - Complete VCF Header will be written to outputsb.workspace/vcfheader.vcf
17:01:48.120 INFO GenomicsDBImport - Importing to array - outputsb.workspace/genomicsdb_array
17:01:48.136 INFO ProgressMeter - Starting traversal
17:01:48.136 INFO ProgressMeter - Current Locus Elapsed Minutes Batches Processed Batches/Minute
17:01:48.250 INFO GenomicsDBImport - Importing batch 1 with 50 samples
17:37:58.214 INFO ProgressMeter - 6:29691241 36.2 1 0.0
17:37:58.215 INFO GenomicsDBImport - Done importing batch 1/143
17:37:58.316 INFO GenomicsDBImport - Importing batch 2 with 50 samples
18:14:11.907 INFO ProgressMeter - 6:29691241 72.4 2 0.0
18:14:11.907 INFO GenomicsDBImport - Done importing batch 2/143
SNIP
04:51:49.665 INFO GenomicsDBImport - Importing batch 61 with 50 samples
terminate called after throwing an instance of 'VariantStorageManagerException'
what(): VariantStorageManagerException exception : status == TILEDB_OK
Data pre-processing for variant discovery
Purpose
The is the obligatory first phase that must precede all variant discovery. It involves pre-processing the raw sequence data (provided in FASTQ or uBAM format) to produce analysis-ready BAM files. This involves alignment to a reference genome as well as some data cleanup operations to correct for technical biases and make the data suitable for analysis.
Reference Implementations
Pipeline | Summary | Notes | Github | FireCloud |
---|---|---|---|---|
Prod* germline short variant per-sample calling | uBAM to GVCF | optimized for GCP | yes | pending |
$5 Genome Analysis Pipeline | uBAM to GVCF or cohort VCF | optimized for GCP (see blog) | yes | hg38 |
Generic data pre-processing | uBAM to analysis-ready BAM | universal | yes | hg38 & b37 |
* Prod refers to the Broad Institute's Data Sciences Platform production pipelines, which are used to process sequence data produced by the Broad's Genomic Sequencing Platform facility.
Expected input
This workflow is designed to operate on individual samples, for which the data is initially organized in distinct subsets called readgroups. These correspond to the intersection of libraries (the DNA product extracted from biological samples and prepared for sequencing, which includes fragmenting and tagging with identifying barcodes) and lanes (units of physical separation on the DNA sequencing chips) generated through multiplexing (the process of mixing multiple libraries and sequencing them on multiple lanes, for risk and artifact mitigation purposes).
Our reference implementations expect the read data to be input in unmapped BAM (uBAM) format. Conversion utilities are available to convert from FASTQ to uBAM.
Main steps
We begin by mapping the sequence reads to the reference genome to produce a file in SAM/BAM format sorted by coordinate. Next, we mark duplicates to mitigate biases introduced by data generation steps such as PCR amplification. Finally, we recalibrate the base quality scores, because the variant calling algorithms rely heavily on the quality scores assigned to the individual base calls in each sequence read.
Map to Reference
Tools involved: BWA, MergeBamAlignments
This first processing step is performed per-read group and consists of mapping each individual read pair to the reference genome, which is a synthetic single-stranded representation of common genome sequence that is intended to provide a common coordinate framework for all genomic analysis. Because the mapping algorithm processes each read pair in isolation, this can be massively parallelized to increase throughput as desired.
Mark Duplicates
Tools involved: MarkDuplicates, SortSam
This second processing step is performed per-sample and consists of identifying read pairs that are likely to have originated from duplicates of the same original DNA fragments through some artifactual processes. These are considered to be non-independent observations, so the program tags all but of the read pairs within each set of duplicates, causing them to be ignored by default during the variant discovery process. This step constitutes a major bottleneck since it involves making a large number of comparisons between all the read pairs belonging to the sample, across all of its readgroups. It is followed by a sorting operation (not explicitly shown in the workflow diagram) that also constitutes a performance bottleneck, since it also operates across all reads belonging to the sample. Both algorithms continue to be the target of optimization efforts to reduce their impact on latency.
Base (Quality Score) Recalibration
Tools involved: BaseRecalibrator, Apply Recalibration, AnalyzeCovariates (optional)
This third processing step is performed per-sample and consists of applying machine learning to detect and correct for patterns of systematic errors in the base quality scores, which are confidence scores emitted by the sequencer for each base. Base quality scores play an important role in weighing the evidence for or against possible variant alleles during the variant discovery process, so it's important to correct any systematic bias observed in the data. Biases can originate from biochemical processes during library preparation and sequencing, from manufacturing defects in the chips, or instrumentation defects in the sequencer. The recalibration procedure involves collecting covariate statistics from all base calls in the dataset, building a model from those statistics, and applying base quality adjustments to the dataset based on the resulting model. The initial statistics collection can be parallelized by scattering across genomic coordinates, typically by chromosome or batches of chromosomes but this can be broken down further to boost throughput if needed. Then the per-region statistics must be gathered into a single genome-wide model of covariation; this cannot be parallelized but it is computationally trivial, and therefore not a bottleneck. Finally, the recalibration rules derived from the model are applied to the original dataset to produce a recalibrated dataset. This is parallelized in the same way as the initial statistics collection, over genomic regions, then followed by a final file merge operation to produce a single analysis-ready file per sample.
Calling variants in RNAseq
Overview
This document describes the details of the GATK Best Practices workflow for SNP and indel calling on RNAseq data.
Please note that any command lines are only given as example of how the tools can be run. You should always make sure you understand what is being done at each step and whether the values are appropriate for your data. To that effect, you can find more guidance here.
In brief, the key modifications made to the DNAseq Best Practices focus on handling splice junctions correctly, which involves specific mapping and pre-processing procedures, as well as some new functionality in the HaplotypeCaller. Here is a detailed overview:
Caveats
Please keep in mind that our DNA-focused Best Practices were developed over several years of thorough experimentation, and are continuously updated as new observations come to light and the analysis methods improve. We have been working with RNAseq for a somewhat shorter time, so there are many aspects that we still need to examine in more detail before we can be fully confident that we are doing the best possible thing.
We know that the current recommended pipeline is producing both false positives (wrong variant calls) and false negatives (missed variants) errors. While some of those errors are inevitable in any pipeline, others are errors that we can and will address in future versions of the pipeline. A few examples of such errors are given in this article as well as our ideas for fixing them in the future.
We will be improving these recommendations progressively as we go, and we hope that the research community will help us by providing feedback of their experiences applying our recommendations to their data.
The workflow
1. Mapping to the reference
The first major difference relative to the DNAseq Best Practices is the mapping step. For DNA-seq, we recommend BWA. For RNA-seq, we evaluated all the major software packages that are specialized in RNAseq alignment, and we found that we were able to achieve the highest sensitivity to both SNPs and, importantly, indels, using STAR aligner. Specifically, we use the STAR 2-pass method which was described in a recent publication (see page 43 of the Supplemental text of the Pär G Engström et al. paper referenced below for full protocol details -- we used the suggested protocol with the default parameters). In brief, in the STAR 2-pass approach, splice junctions detected in a first alignment run are used to guide the final alignment.
Here is a walkthrough of the STAR 2-pass alignment steps:
1) STAR uses genome index files that must be saved in unique directories. The human genome index was built from the FASTA file hg19.fa as follows:
genomeDir=/path/to/hg19
mkdir $genomeDir
STAR --runMode genomeGenerate --genomeDir $genomeDir --genomeFastaFiles hg19.fa\ --runThreadN <n>
2) Alignment jobs were executed as follows:
runDir=/path/to/1pass
mkdir $runDir
cd $runDir
STAR --genomeDir $genomeDir --readFilesIn mate1.fq mate2.fq --runThreadN <n>
3) For the 2-pass STAR, a new index is then created using splice junction information contained in the file SJ.out.tab from the first pass:
genomeDir=/path/to/hg19_2pass
mkdir $genomeDir
STAR --runMode genomeGenerate --genomeDir $genomeDir --genomeFastaFiles hg19.fa \
--sjdbFileChrStartEnd /path/to/1pass/SJ.out.tab --sjdbOverhang 75 --runThreadN <n>
4) The resulting index is then used to produce the final alignments as follows:
runDir=/path/to/2pass
mkdir $runDir
cd $runDir
STAR --genomeDir $genomeDir --readFilesIn mate1.fq mate2.fq --runThreadN <n>
2. Add read groups, sort, mark duplicates, and create index
The above step produces a SAM file, which we then put through the usual Picard processing steps: adding read group information, sorting, marking duplicates and indexing.
java -jar picard.jar AddOrReplaceReadGroups I=star_output.sam O=rg_added_sorted.bam SO=coordinate RGID=id RGLB=library RGPL=platform RGPU=machine RGSM=sample
java -jar picard.jar MarkDuplicates I=rg_added_sorted.bam O=dedupped.bam CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT M=output.metrics
3. Split'N'Trim and reassign mapping qualities
Next, we use a new GATK tool called SplitNCigarReads developed specially for RNAseq, which splits reads into exon segments (getting rid of Ns but maintaining grouping information) and hard-clip any sequences overhanging into the intronic regions.
In the future we plan to integrate this into the GATK engine so that it will be done automatically where appropriate, but for now it needs to be run as a separate step.
At this step we also add one important tweak: we need to reassign mapping qualities, because STAR assigns good alignments a MAPQ of 255 (which technically means “unknown” and is therefore meaningless to GATK). So we use the GATK’s ReassignOneMappingQuality read filter to reassign all good alignments to the default value of 60. This is not ideal, and we hope that in the future RNAseq mappers will emit meaningful quality scores, but in the meantime this is the best we can do. In practice we do this by adding the ReassignOneMappingQuality read filter to the splitter command.
Finally, be sure to specify that reads with N cigars should be allowed. This is currently still classified as an "unsafe" option, but this classification will change to reflect the fact that this is now a supported option for RNAseq processing.
java -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R ref.fasta -I dedupped.bam -o split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS
4. Indel Realignment (optional)
After the splitting step, we resume our regularly scheduled programming... to some extent. We have found that performing realignment around indels can help rescue a few indels that would otherwise be missed, but to be honest the effect is marginal. So while it can’t hurt to do it, we only recommend performing the realignment step if you have compute and time to spare (or if it’s important not to miss any potential indels).
5. Base Recalibration
We do recommend running base recalibration (BQSR). Even though the effect is also marginal when applied to good quality data, it can absolutely save your butt in cases where the qualities have systematic error modes.
Both steps 4 and 5 are run as described for DNAseq (with the same known sites resource files), without any special arguments. Finally, please note that you should NOT run ReduceReads on your RNAseq data. The ReduceReads tool will no longer be available in GATK 3.0.
6. Variant calling
Finally, we have arrived at the variant calling step! Here, we recommend using HaplotypeCaller because it is performing much better in our hands than UnifiedGenotyper (our tests show that UG was able to call less than 50% of the true positive indels that HC calls). We have added some functionality to the variant calling code which will intelligently take into account the information about intron-exon split regions that is embedded in the BAM file by SplitNCigarReads. In brief, the new code will perform “dangling head merging” operations and avoid using soft-clipped bases (this is a temporary solution) as necessary to minimize false positive and false negative calls. To invoke this new functionality, just add -dontUseSoftClippedBases
to your regular HC command line. Note that the -recoverDanglingHeads
argument which was previously required is no longer necessary as that behavior is now enabled by default in HaplotypeCaller. Also, we found that we get better results if we set the minimum phred-scaled confidence threshold for calling variants 20, but you can lower this to increase sensitivity if needed.
java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ref.fasta -I input.bam -dontUseSoftClippedBases -stand_call_conf 20.0 -o output.vcf
7. Variant filtering
To filter the resulting callset, you will need to apply hard filters, as we do not yet have the RNAseq training/truth resources that would be needed to run variant recalibration (VQSR).
We recommend that you filter clusters of at least 3 SNPs that are within a window of 35 bases between them by adding -window 35 -cluster 3
to your command. This filter recommendation is specific for RNA-seq data.
As in DNA-seq, we recommend filtering based on Fisher Strand values (FS > 30.0) and Qual By Depth values (QD < 2.0).
java -jar GenomeAnalysisTK.jar -T VariantFiltration -R hg_19.fasta -V input.vcf -window 35 -cluster 3 -filterName FS -filter "FS > 30.0" -filterName QD -filter "QD < 2.0" -o output.vcf
Please note that we selected these hard filtering values in attempting to optimize both high sensitivity and specificity together. By applying the hard filters, some real sites will get filtered. This is a tradeoff that each analyst should consider based on his/her own project. If you care more about sensitivity and are willing to tolerate more false positives calls, you can choose not to filter at all (or to use less restrictive thresholds).
An example of filtered (SNPs cluster filter) and unfiltered false variant calls:
An example of true variants that were filtered (false negatives). As explained in text, there is a tradeoff that comes with applying filters:
Known issues
There are a few known issues; one is that the allelic ratio is problematic. In many heterozygous sites, even if we can see in the RNAseq data both alleles that are present in the DNA, the ratio between the number of reads with the different alleles is far from 0.5, and thus the HaplotypeCaller (or any caller that expects a diploid genome) will miss that call. A DNA-aware mode of the caller might be able to fix such cases (which may be candidates also for downstream analysis of allele specific expression).
Although our new tool (splitNCigarReads) cleans many false positive calls that are caused by splicing inaccuracies by the aligners, we still call some false variants for that same reason, as can be seen in the example below. Some of those errors might be fixed in future versions of the pipeline with more sophisticated filters, with another realignment step in those regions, or by making the caller aware of splice positions.
As stated previously, we will continue to improve the tools and process over time. We have plans to improve the splitting/clipping functionalities, improve true positive and minimize false positive rates, as well as developing statistical filtering (i.e. variant recalibration) recommendations.
We also plan to add functionality to process DNAseq and RNAseq data from the same samples simultaneously, in order to facilitate analyses of post-transcriptional processes. Future extensions to the HaplotypeCaller will provide this functionality, which will require both DNAseq and RNAseq in order to produce the best results. Finally, we are also looking at solutions for measuring differential expression of alleles.
[1] Pär G Engström et al. “Systematic evaluation of spliced alignment programs for RNA-seq data”. Nature Methods, 2013
NOTE: Questions about this document that were posted before June 2014 have been moved to this archival thread: http://gatkforums.broadinstitute.org/discussion/4709/questions-about-the-rnaseq-variant-discovery-workflow
Short read data in highly repetitive genomic region for heterozygous individuals
Hello GATK team,
This might be a very general and overrated question but I appreciate your input. I am working with natural populations of plants (expected highly heterozygous individuals) and an enriched genomic region which contains some promoters of interest together with transposons, duplications and a lot of expected indels and SVs, including a potential paralog for one of our BACs. Unfortunately the long read sequencing is not yet ready so I am using the 2*75pb data and our BAC sequences as references to test how close we can get with HaplotypeCaller to see some SNP and short indel calls for an association analysis. Our coverage distribution seems to be heavily biased towards areas with duplications and potential TE and most of the assemblers based on local assembly are thrown off by our data. I have use very strict mapping parameters to avoid this problem with missaligned reads, given that we can't discard the possibility of having hyper-variable regions.
I understand that aiming for genotype calls is dangerous given our kind of data and the lack of a genome reference, so I am aiming to include the genotype likelihoods into the association analysis. With HaplotypeCaller I get a vcf file for my population and an associated PL value. My question is basically if given our type of data, do you consider that the local assembly inherent to HaplotypeCaller will give us false positives variants in the final output? Do you have any suggestion or alternative tools to get genotype likelihoods (without local assembly?) and input those into an association analysis tool?
I really appreciate your insight.
Best,
Bootstrapping high confidence variants for VQSR
Hi,
I was wondering about the current best practices recommendation for refining the variant calls made by the HaplotypeCaller when no prior known variants are available (i.e. in non-model species). I can see that for base recalibration, you recommend bootstrapping a set of high confidence variants by first doing an initial round of SNP calling on your original, unrecalibrated data, and then using a high confidence subset of the called SNPs as the "known SNPs" for the base recalibration step.
Do you recommend a similar approach for variant recalibration? I have seen some people implement that, but I don't find any mention of this option in your description of the VQSR in the current best practices. Does not mentioning it there imply that you recommend to simply do a hard filtering of called variants if you don't have a database of known variants available or would you suggest that it may be worthwhile to try bootstrapping a set of "known variants" for the VQSR step as well?
Thanks very much for any advice you can share.
MuTect2 strandbias + TLOD clarification
Hi,
I have a set of tumour samples and I would like to call variants using MuTect2 without matching normals, annotate using VEP and filter out known germline variants afterwards.
I used tumor-only mode with downsampling process turned off. There are a number of artefacts that are being called and I found at least one variant that looks real but was not called. I could think of two options to improve the calling, hence my questions
1- Strand bias: How can I find information about strand bias? I am looking for details like what we typically see in call.stats output of MuTect (ie lod scores of forward and reverse strands), but have not been able to modify my code to include that information. I think some artefacts may be due to strand bias.
2- TLOD : This is where I got confused. Could you explain how MuTect2 calculates TLOD in the absence of matching normal? I use the LOD scores to determine real calls. The majority of real variants would have massive TLOD compared to all calls within each sample. But in my set of samples, there was one variant that seems to be true and had small value of TLOD. I started to think that MuTect2 has to have something as normal to generate correct TLOD, but I am not sure.
This is what I ran:
Using hg38
gatk Mutect2 \
-R hg38 \
-I test.bam \
-L interval_list \
-O test.vcf \
-tumor test.bam \
--contamination-fraction-to-filter 0.0 \
--max-reads-per-alignment-start 0 \
Any comments would be highly appreciated.
Thank you
Invalid SAM?
I used BWA MEM to map reads from an interleaved FASTQ.
fastq="all.fastq"
fasta="/share/PI/apps/bcbio/genomes/Hsapiens/GRCh37/seq/GRCh37.fa"
bwa="/share/PI/apps/bcbio/anaconda/bin/bwa"
nThreads="12"
#Run BWA MEM
#IMPORTANT: NEED -p since "$fastq" is an interleaved fastq
readGroup="@RG\tID:CHM1\tSM:CHM1\tPL:Illumina"
sam="CHM1.sam"
"$bwa" mem -R "$readGroup" -t "$nThreads" -p "$fasta" "$fastq" -o "$sam"
(The FASTQs are actually CHM1; I used prefetch to fetch .sra
files from three different runs from NCBI, then used fastq-dump
to convert the SRAs to FASTQs, then cat
ed them all together into one FASTQ.)
The SAM is 515Gb but has no obvious problems. samtools quickcheck
says it's valid. But when I run GATK4 (4.0.4.0)'s FixMateInformation
or ValidateSamFile
, I get output like this
ERROR: Record 1, Read name ######################################################################################################################################################################################################, Zero-length read without FZ, CS or CQ tag
WARNING: Record 1, Read name ######################################################################################################################################################################################################, QUAL field is set to * (unspecified quality scores), this is allowed by the SAM specification but many tools expect reads to include qualities
ERROR: Record 421522661, Read name ######################################################################################################################################################################################################, Zero-length read without FZ, CS or CQ tag
There may be even more errors, but this is what I got after two hours.
I can, in fact, see that the first line in the SAM file is
###################################################################################################################################################################################################### 4 * 0 0 * * 0 0 * * AS:i:0 XS:i:0 RG:Z:CHM1
Is this SAM really invalid? Or is there something I need to do so GATK4 will accept it
Bad PED line 1: wrong number of fields in PED files in PhaseByTransmission
Hello,
when using the PhaseByTransmission I always get this error:
ERROR StatusLogger Unable to create class org.apache.logging.log4j.core.impl.Log4jContextFactory specified in jar:file:/pub/yuanjian/software_script/GATK-3.8/GenomeAnalysisTK.jar!/META-INF/log4j-provider.properties
ERROR StatusLogger Log4j2 could not find a logging implementation. Please add log4j-core to the classpath. Using SimpleLogger to log to the console...
INFO 02:32:46,897 GenomeAnalysisEngine - Deflater: IntelDeflater
INFO 02:32:46,897 GenomeAnalysisEngine - Inflater: IntelInflater
INFO 02:32:46,898 GenomeAnalysisEngine - Strictness is SILENT
INFO 02:32:46,989 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO 02:32:47,058 PedReader - Reading PED file output.vcf2plink.ped with missing fields: []
ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 3.8-0-ge9d806836):
ERROR
ERROR This means that one or more arguments or inputs in your command are incorrect.
ERROR The error message below tells you what is the problem.
ERROR
ERROR If the problem is an invalid argument, please check the online documentation guide
ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
ERROR
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions https://software.broadinstitute.org/gatk
ERROR
ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
ERROR
ERROR MESSAGE: File associated with name java.io.FileReader@3a0e7f89 is malformed: Bad PED line 1: wrong number of fields
ERROR ------------------------------------------------------------------------------------------
my command is:
java -jar /pub/yuanjian/software_script/GATK-3.8/GenomeAnalysisTK.jar -T PhaseByTransmission -R /pub/yuanjian/reference/human_g1k_v37.fasta -V output.vcf -ped output.vcf2plink.ped -o file.vcf
I use plink to convert a vcf file to a ped file,and my ped file is:
GENOTYPE.437 GENOTYPE.437 0 0 0 0 0 0 0 0 A A 0 0 C C G G C C A A 0 0 0 0 C T 0 0 0 0
GENOTYPE.430 GENOTYPE.430 0 0 0 0 0 0 0 0 A A C T C C G G C C A A 0 0 G A T T C G A
GENOTYPE.450 GENOTYPE.450 0 0 0 0 G G G G A A 0 0 C C G G C C A A C G 0 0 0 0 0 0 0 0
So what's wrong with my ped file?
Hard-Filtering odd MQ distributions
Hello,
I'm working on filtering my snp calls from a non-model organism and thus going with hard-filtering instead of VQSR. I know this is always a bit of touch-and-go and there's no definite answer to which thresholds to employ (I've been through the documentation) but I'm hoping you can give me some more pointers.
I've started by using your indications first except for QD, and based on missing data and coverage. So right now I have filtered SNPs with: QD < 5 ¦¦ FS < 60 ¦¦ MQ < 40 ¦¦ MQRankSum < -12.5 ¦¦ ReadPosRankSum < -8 ¦¦ SOR < 3, maximum 30% missing calls, resulting in 16'397'726 SNPs (of 19'047'259 total unfiltered calls).
These the distributions before and after filtering (QUAL just for indication, didn't apply any filter to it).
It looks considerable better to me, but still far from what your example data looks like. In particular, I'm wondering about my MQ distribution that has these very high values (over 400, some come as "Inf" even) - have you seen this before?
Further, would it be "ok" to filter on the upper value of MQ as well as the lowest? Thanks !
How can i generate png image not pdf from picard tools (CollectInsertSizeMetrics)?
I am using picard tools for generating insert size histogram but i found png image is not visible.
java -jar picard-tools-1.130/picard.jar CollectInsertSizeMetrics I=sorted.bam O=insert_size_metrics.txt H=histogram.png
How can i fix such issue?
GATK4 for somatic WES or Panel on limited number of samples
Hello,
Which of the workflows listed on "https://github.com/gatk-workflows" is adapted for SOMATIC WES or Panel sequencing on cohorts from 2 to 50 samples and can be run locally?
Many thanks,
Is it possible to call SNPs without prior assigning read groups ?
Hi, I have tens of thousands of BAM files, each coming from a single biological sample. Assigning read groups before calling SNPs for all of these BAM files are too tedious. Is it possible to call SNPs without assigning read groups for these bam files? Thanks!