Can the GATK Best Practices Pipeline on Google Cloud Platform be used on FASTQ inputs?

July 24, 2018, 11:57 am

I read the documentation on this pipeline (https://cloud.google.com/genomics/docs/tutorials/gatk) and saw that its input is unaligned BAMs. Is there a way to use the pipeline for input FASTQs?

↧

BQSR can‘t run!

July 24, 2018, 7:24 pm

≫ Next: problem in combineGVCFs

≪ Previous: Can the GATK Best Practices Pipeline on Google Cloud Platform be used on FASTQ inputs?

Hello!
I’m running BQSR with GATK v4.0.4.0. And I met some problems that I cannot get the right output file.
The command line is：
$GATK --java-options "-Xmx10240m -Djava.io.tmpdir=./" BaseRecalibratorSpark -R $GENOME -I $sample-md_rl.bam --known-sites /home/gaotiangang/niuguohao/201806call/50100-step5-3/combined_1.raw_snp.vcf -O $sample.4.table --spark-master local[4]
$GATK --java-options "-Xmx10240m -Djava.io.tmpdir=./" ApplyBQSRSpark -I $sample-md_rl.bam -bqsr $sample.4.table -O $sample.4.bam --spark-master local[4]
echo "BQSR 1 over "

But I got these lines back almost everytime ,
18/07/25 08:13:23 ERROR Executor: Exception in task 9.0 in stage 1.0 (TID 191)
java.io.IOException: Failed to create local dir in /home/gaotiangang/niuguohao/201806call/50-step5-7/niuguohao/blockmgr-accb99db-cd04-4a44-a018-0672408e3f03/1d.
at org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:70)
at org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:80)
at org.apache.spark.shuffle.IndexShuffleBlockResolver.getDataFile(IndexShuffleBlockResolver.scala:55)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.closeAndWriteOutput(UnsafeShuffleWriter.java:212)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:169)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
18/07/25 08:13:23 INFO TaskSetManager: Starting task 11.0 in stage 1.0 (TID 193, localhost, executor driver, partition 11, PROCESS_LOCAL, 4906 bytes)
18/07/25 08:13:23 INFO Executor: Running task 11.0 in stage 1.0 (TID 193)
18/07/25 08:13:23 WARN TaskSetManager: Lost task 9.0 in stage 1.0 (TID 191, localhost, executor driver): java.io.IOException: Failed to create local dir in /home/gaotiangang/niuguohao/201806call/50-step5-7/niuguohao/blockmgr-accb99db-cd04-4a44-a018-0672408e3f03/1d.
at org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:70)
at org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:80)
at org.apache.spark.shuffle.IndexShuffleBlockResolver.getDataFile(IndexShuffleBlockResolver.scala:55)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.closeAndWriteOutput(UnsafeShuffleWriter.java:212)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:169)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Does that mean I set something wrong in my command line?

Thanks !

↧

problem in combineGVCFs

June 29, 2018, 12:28 am

≫ Next: How to hard filter Mutect2 calls based on strand bias

≪ Previous: BQSR can‘t run!

Hi,
Here is a problem when i used the combineGVCFs. here is the errors: "java.lang.IllegalStateException: Key END found in VariantContext field INFO at chrM:247 but this key isn't defined in the VCFHeader. We require all VCFs to have complete VCF headers by default.";
the command as follows: "gatk CombineGVCFs -R ~/ref/GATK_REF/ucsc.hg19.fasta -V 788_filter.vcf -V 789_filter.vcf -O Merged.vcf"

I get the vcf file of each samples with GATK version 4, as follows:
gatk HaplotypeCaller -R $ref/ucsc.hg19.fasta -I $temp_output_dir/gp_add_sorted_dedup_chr_ordered_split_$sample_id.bam --dont-use-soft-clipped-bases true -stand-call-conf 20.0 -O $temp_output_dir/$sample_id.vcf
gatk VariantFiltration -R $ref/ucsc.hg19.fasta -V $temp_output_dir/$sample_id.vcf -window 35 -cluster 3 --filter-name FS -filter "FS > 30.0" --filter-name QD -filter "QD < 2.0" -O $final_results_dir/"$sample_id"_filter.vcf

I am only retained the files: $sample_id_filter.vcf

can you help me to address this issue? thank you very much.

Xiao

↧

How to hard filter Mutect2 calls based on strand bias

July 25, 2018, 2:21 am

≫ Next: Why is haplotype caller ignoring some reads and calling 1/1 variants instead of 0/1?

≪ Previous: problem in combineGVCFs

Hi, how can I apply a hard filter for Mutect2 somatic calls based on strand bias (e.g. to filter out calls with 60 (out of 80) reads on plus strand and only 2 (out of 70) reads on minus strand supporting the non-ref allele)? Which FORMAT field can I use to do this using VariantFiltration? Apologies if this has been asked before, I couldn't find a similar question in my search.

Thanks,
Rao

↧

Why is haplotype caller ignoring some reads and calling 1/1 variants instead of 0/1?

July 25, 2018, 4:13 am

≫ Next: Where can I find known variants, training and truth sets, and other resource files?

≪ Previous: How to hard filter Mutect2 calls based on strand bias

Hello,

I compared the bam output from BWA-MEM and the bam output from HaplotypeCaller and I could see that for a specific active region GATK ignored some reads. Because of that, some variants were not called and others were called as homozygous instead of heterozygous. Could you please help me to understand why during the realignment step GATK does not consider the reads? I have checked mapping quality and alignment score and everything is ok. I am using GATK v4.

Thanks in advance.

Best

↧

Where can I find known variants, training and truth sets, and other resource files?

December 28, 2017, 12:22 pm

≪ Previous: Why is haplotype caller ignoring some reads and calling 1/1 variants instead of 0/1?

For general definitions of these terms, see this Dictionary entry.

Humans

If you're working with human data, you're in luck. We provide all resource files necessary for applying the Best Practices pipelines to human data as part of our Resource Bundle, and we provide specific recommendations on which sets to use for each tool in the variant calling pipelines, as well as default settings for all parameters. See the Best Practices documentation for details to that effect.

Everything else

Unfortunately we're not currently able to provide centralized resources for non-human organisms. That means you will need to do some additional homework to find out what is available for your organism. In order to facilitate this process, we have created a forum section called Zoo & Garden specifically for the purpose of collecting information on this topic. We invite researchers who have experience in non-human genomics analysis to share their knowledge by contributing documentation to this section.

↧

Recommended parameters for VariantRecalibrator in GATK4

July 25, 2018, 6:27 am

≫ Next: Indel realignment for MuTect2 in GATK 4.0

≪ Previous: Where can I find known variants, training and truth sets, and other resource files?

Hi,

I'm trying to figure out the recommended set of parameters for VariantRecalibrator on SNPs. The official joint discovery pipeline on Github [1] and the example documentation [2] seem to differ in a few ways: different number of gaussians (6 vs 8), slightly different training feature sets, and different priors for dbSNP (7 vs 2). There are other differences like --trust-all-polimorphic and training on every 10th variant in the Github version.

The dataset I'm working on: 17k samples, 15x WGS, but only called in exonic regions so far (i.e should look like WES, but with lower, yet more homogeneous depth). Variants are called with GATK4.

I understand that a lot of tweaking can be made to the parameters, but would like to start with the most "standard" version first.

Thanks!

[1] The Github pipeline:
https://github.com/gatk-workflows/gatk4-germline-snps-indels/blob/3087accf86b325bb5b511f2e7f6e8574fc0c1ff0/joint-discovery-gatk4.wdl
https://github.com/gatk-workflows/gatk4-germline-snps-indels/blob/5e3d54aa68899248af066b2fbf00954ae052f9b7/joint-discovery-gatk4-local.hg38.wgs.inputs.json

${gatk_path} --java-options "-Xmx100g -Xms100g" \ VariantRecalibrator \ -V ${sites_only_variant_filtered_vcf} \ -O ${recalibration_filename} \ --tranches-file ${tranches_filename} \ --trust-all-polymorphic \ -tranche ${sep=' -tranche ' recalibration_tranche_values} \ -an ${sep=' -an ' recalibration_annotation_values} \ -mode SNP \ --sample-every-Nth-variant ${downsampleFactor} \ --output-model ${model_report_filename} \ --max-gaussians 6 \ -resource hapmap,known=false,training=true,truth=true,prior=15:${hapmap_resource_vcf} \ -resource omni,known=false,training=true,truth=true,prior=12:${omni_resource_vcf} \ -resource 1000G,known=false,training=true,truth=false,prior=10:${one_thousand_genomes_resource_vcf} \ -resource dbsnp,known=true,training=false,truth=false,prior=7:${dbsnp_resource_vcf}

and training features:

"JointGenotyping.indel_recalibration_annotation_values": ["FS", "ReadPosRankSum", "MQRankSum", "QD", "SOR", "DP"]

[2] Example in the docs:
https://software.broadinstitute.org/gatk/documentation/tooldocs/4.0.0.0/org_broadinstitute_hellbender_tools_walkers_vqsr_VariantRecalibrator.php

gatk VariantRecalibrator \ -R Homo_sapiens_assembly38.fasta \ -V input.vcf.gz \ --resource hapmap,known=false,training=true,truth=true,prior=15.0:hapmap_3.3.hg38.sites.vcf.gz \ --resource omni,known=false,training=true,truth=false,prior=12.0:1000G_omni2.5.hg38.sites.vcf.gz \ --resource 1000G,known=false,training=true,truth=false,prior=10.0:1000G_phase1.snps.high_confidence.hg38.vcf.gz \ --resource dbsnp,known=true,training=false,truth=false,prior=2.0:Homo_sapiens_assembly38.dbsnp138.vcf.gz \ -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR \ -mode SNP \ --recal-file output.recal \ --tranches-file output.tranches \ --rscript-file output.plots.R

↧

Indel realignment for MuTect2 in GATK 4.0

July 25, 2018, 8:59 am

≫ Next: VariantStorageManagerException exception : status == TILEDB_OK

≪ Previous: Recommended parameters for VariantRecalibrator in GATK4

Hello I have already read that the indel realignment is deprecated in GATK 4.0, as HaplotypeCaller does the realignment, but what about MuTect2?

In one of your posts ( https://gatkforums.broadinstitute.org/gatk/discussion/11136/how-to-call-somatic-mutations-using-gatk4-mutect2 ) I see you already have data where indel realignment was done. Was this intentional?
How do I proceed with the pipeline then?
Should I run indel realignment on my data with 3.8 and then use 4.0 Mutect2?

Best regards,

Srdjan

↧

VariantStorageManagerException exception : status == TILEDB_OK

July 19, 2018, 2:25 am

≫ Next: Data pre-processing for variant discovery

≪ Previous: Indel realignment for MuTect2 in GATK 4.0

Using 4.0.4.0 and got the following rather unhelpful error from a ImportGenomicsDB. I suspect it might be a variant in one of my VCFs being imported causing an issue but the error message doesn't even hint at which file it was working on.

~/gatk-4.0.4.0/gatk GenomicsDBImport -R hs37d5.fa --sample-name-map gvcfs.samplemap --genomicsdb-workspace-path outputsb.workspace -L 6:29691241-33054015 --batch-size 50 --consolidate true
Using GATK jar /home/cloud-user/gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/cloud-user/gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar GenomicsDBImport -R hs37d5.fa --sample-name-map gvcfs.samplemap --genomicsdb-workspace-path outputsb.workspace -L 6:29691241-33054015 --batch-size 50 --consolidate true
17:01:47.406 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/cloud-user/gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
17:01:47.522 INFO  GenomicsDBImport - ------------------------------------------------------------
17:01:47.522 INFO  GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.0.4.0
17:01:47.522 INFO  GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/
17:01:47.522 INFO  GenomicsDBImport - Executing as cloud-user@lustre-assembly-1 on Linux v3.10.0-862.3.3.el7.x86_64 amd64
17:01:47.522 INFO  GenomicsDBImport - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_171-b10
17:01:47.523 INFO  GenomicsDBImport - Start Date/Time: 17 July 2018 17:01:47 BST
17:01:47.523 INFO  GenomicsDBImport - ------------------------------------------------------------
17:01:47.523 INFO  GenomicsDBImport - ------------------------------------------------------------
17:01:47.523 INFO  GenomicsDBImport - HTSJDK Version: 2.14.3
17:01:47.523 INFO  GenomicsDBImport - Picard Version: 2.18.2
17:01:47.523 INFO  GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
17:01:47.523 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
17:01:47.523 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
17:01:47.523 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
17:01:47.523 INFO  GenomicsDBImport - Deflater: IntelDeflater
17:01:47.523 INFO  GenomicsDBImport - Inflater: IntelInflater
17:01:47.523 INFO  GenomicsDBImport - GCS max retries/reopens: 20
17:01:47.523 INFO  GenomicsDBImport - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
17:01:47.524 INFO  GenomicsDBImport - Initializing engine
17:01:47.959 INFO  IntervalArgumentCollection - Processing 3362775 bp from intervals
17:01:47.984 INFO  GenomicsDBImport - Done initializing engine
Created workspace /mnt/chla/outputsb.workspace
17:01:48.120 INFO  GenomicsDBImport - Vid Map JSON file will be written to outputsb.workspace/vidmap.json
17:01:48.120 INFO  GenomicsDBImport - Callset Map JSON file will be written to outputsb.workspace/callset.json
17:01:48.120 INFO  GenomicsDBImport - Complete VCF Header will be written to outputsb.workspace/vcfheader.vcf
17:01:48.120 INFO  GenomicsDBImport - Importing to array - outputsb.workspace/genomicsdb_array
17:01:48.136 INFO  ProgressMeter - Starting traversal
17:01:48.136 INFO  ProgressMeter -        Current Locus  Elapsed Minutes     Batches Processed   Batches/Minute
17:01:48.250 INFO  GenomicsDBImport - Importing batch 1 with 50 samples
17:37:58.214 INFO  ProgressMeter -           6:29691241             36.2                     1              0.0
17:37:58.215 INFO  GenomicsDBImport - Done importing batch 1/143
17:37:58.316 INFO  GenomicsDBImport - Importing batch 2 with 50 samples
18:14:11.907 INFO  ProgressMeter -           6:29691241             72.4                     2              0.0
18:14:11.907 INFO  GenomicsDBImport - Done importing batch 2/143

SNIP

04:51:49.665 INFO  GenomicsDBImport - Importing batch 61 with 50 samples
terminate called after throwing an instance of 'VariantStorageManagerException'
  what():  VariantStorageManagerException exception : status == TILEDB_OK

↧

Data pre-processing for variant discovery

January 8, 2018, 8:45 pm

≫ Next: Calling variants in RNAseq

≪ Previous: VariantStorageManagerException exception : status == TILEDB_OK

Purpose

The is the obligatory first phase that must precede all variant discovery. It involves pre-processing the raw sequence data (provided in FASTQ or uBAM format) to produce analysis-ready BAM files. This involves alignment to a reference genome as well as some data cleanup operations to correct for technical biases and make the data suitable for analysis.

Reference Implementations

Pipeline	Summary	Notes	Github	FireCloud
Prod* germline short variant per-sample calling	uBAM to GVCF	optimized for GCP	yes	pending
$5 Genome Analysis Pipeline	uBAM to GVCF or cohort VCF	optimized for GCP (see blog)	yes	hg38
Generic data pre-processing	uBAM to analysis-ready BAM	universal	yes	hg38 & b37

* Prod refers to the Broad Institute's Data Sciences Platform production pipelines, which are used to process sequence data produced by the Broad's Genomic Sequencing Platform facility.

Expected input

This workflow is designed to operate on individual samples, for which the data is initially organized in distinct subsets called readgroups. These correspond to the intersection of libraries (the DNA product extracted from biological samples and prepared for sequencing, which includes fragmenting and tagging with identifying barcodes) and lanes (units of physical separation on the DNA sequencing chips) generated through multiplexing (the process of mixing multiple libraries and sequencing them on multiple lanes, for risk and artifact mitigation purposes).

Our reference implementations expect the read data to be input in unmapped BAM (uBAM) format. Conversion utilities are available to convert from FASTQ to uBAM.

Main steps

We begin by mapping the sequence reads to the reference genome to produce a file in SAM/BAM format sorted by coordinate. Next, we mark duplicates to mitigate biases introduced by data generation steps such as PCR amplification. Finally, we recalibrate the base quality scores, because the variant calling algorithms rely heavily on the quality scores assigned to the individual base calls in each sequence read.

Map to Reference

Tools involved: BWA, MergeBamAlignments

This first processing step is performed per-read group and consists of mapping each individual read pair to the reference genome, which is a synthetic single-stranded representation of common genome sequence that is intended to provide a common coordinate framework for all genomic analysis. Because the mapping algorithm processes each read pair in isolation, this can be massively parallelized to increase throughput as desired.

Mark Duplicates

Tools involved: MarkDuplicates, SortSam

This second processing step is performed per-sample and consists of identifying read pairs that are likely to have originated from duplicates of the same original DNA fragments through some artifactual processes. These are considered to be non-independent observations, so the program tags all but of the read pairs within each set of duplicates, causing them to be ignored by default during the variant discovery process. This step constitutes a major bottleneck since it involves making a large number of comparisons between all the read pairs belonging to the sample, across all of its readgroups. It is followed by a sorting operation (not explicitly shown in the workflow diagram) that also constitutes a performance bottleneck, since it also operates across all reads belonging to the sample. Both algorithms continue to be the target of optimization efforts to reduce their impact on latency.

Base (Quality Score) Recalibration

Tools involved: BaseRecalibrator, Apply Recalibration, AnalyzeCovariates (optional)

This third processing step is performed per-sample and consists of applying machine learning to detect and correct for patterns of systematic errors in the base quality scores, which are confidence scores emitted by the sequencer for each base. Base quality scores play an important role in weighing the evidence for or against possible variant alleles during the variant discovery process, so it's important to correct any systematic bias observed in the data. Biases can originate from biochemical processes during library preparation and sequencing, from manufacturing defects in the chips, or instrumentation defects in the sequencer. The recalibration procedure involves collecting covariate statistics from all base calls in the dataset, building a model from those statistics, and applying base quality adjustments to the dataset based on the resulting model. The initial statistics collection can be parallelized by scattering across genomic coordinates, typically by chromosome or batches of chromosomes but this can be broken down further to boost throughput if needed. Then the per-region statistics must be gathered into a single genome-wide model of covariation; this cannot be parallelized but it is computationally trivial, and therefore not a bottleneck. Finally, the recalibration rules derived from the model are applied to the original dataset to produce a recalibrated dataset. This is parallelized in the same way as the initial statistics collection, over genomic regions, then followed by a final file merge operation to produce a single analysis-ready file per sample.

↧

Calling variants in RNAseq

March 5, 2014, 11:15 pm

≫ Next: Short read data in highly repetitive genomic region for heterozygous individuals

≪ Previous: Data pre-processing for variant discovery

Overview

This document describes the details of the GATK Best Practices workflow for SNP and indel calling on RNAseq data.

Please note that any command lines are only given as example of how the tools can be run. You should always make sure you understand what is being done at each step and whether the values are appropriate for your data. To that effect, you can find more guidance here.

In brief, the key modifications made to the DNAseq Best Practices focus on handling splice junctions correctly, which involves specific mapping and pre-processing procedures, as well as some new functionality in the HaplotypeCaller. Here is a detailed overview:

Caveats

Please keep in mind that our DNA-focused Best Practices were developed over several years of thorough experimentation, and are continuously updated as new observations come to light and the analysis methods improve. We have been working with RNAseq for a somewhat shorter time, so there are many aspects that we still need to examine in more detail before we can be fully confident that we are doing the best possible thing.

We know that the current recommended pipeline is producing both false positives (wrong variant calls) and false negatives (missed variants) errors. While some of those errors are inevitable in any pipeline, others are errors that we can and will address in future versions of the pipeline. A few examples of such errors are given in this article as well as our ideas for fixing them in the future.

We will be improving these recommendations progressively as we go, and we hope that the research community will help us by providing feedback of their experiences applying our recommendations to their data.

The workflow

1. Mapping to the reference

The first major difference relative to the DNAseq Best Practices is the mapping step. For DNA-seq, we recommend BWA. For RNA-seq, we evaluated all the major software packages that are specialized in RNAseq alignment, and we found that we were able to achieve the highest sensitivity to both SNPs and, importantly, indels, using STAR aligner. Specifically, we use the STAR 2-pass method which was described in a recent publication (see page 43 of the Supplemental text of the Pär G Engström et al. paper referenced below for full protocol details -- we used the suggested protocol with the default parameters). In brief, in the STAR 2-pass approach, splice junctions detected in a first alignment run are used to guide the final alignment.

Here is a walkthrough of the STAR 2-pass alignment steps:

1) STAR uses genome index files that must be saved in unique directories. The human genome index was built from the FASTA file hg19.fa as follows:

genomeDir=/path/to/hg19
mkdir $genomeDir
STAR --runMode genomeGenerate --genomeDir $genomeDir --genomeFastaFiles hg19.fa\  --runThreadN <n>

2) Alignment jobs were executed as follows:

runDir=/path/to/1pass
mkdir $runDir
cd $runDir
STAR --genomeDir $genomeDir --readFilesIn mate1.fq mate2.fq --runThreadN <n>

3) For the 2-pass STAR, a new index is then created using splice junction information contained in the file SJ.out.tab from the first pass:

genomeDir=/path/to/hg19_2pass
mkdir $genomeDir
STAR --runMode genomeGenerate --genomeDir $genomeDir --genomeFastaFiles hg19.fa \
    --sjdbFileChrStartEnd /path/to/1pass/SJ.out.tab --sjdbOverhang 75 --runThreadN <n>

4) The resulting index is then used to produce the final alignments as follows:

runDir=/path/to/2pass
mkdir $runDir
cd $runDir
STAR --genomeDir $genomeDir --readFilesIn mate1.fq mate2.fq --runThreadN <n>

2. Add read groups, sort, mark duplicates, and create index

The above step produces a SAM file, which we then put through the usual Picard processing steps: adding read group information, sorting, marking duplicates and indexing.

java -jar picard.jar AddOrReplaceReadGroups I=star_output.sam O=rg_added_sorted.bam SO=coordinate RGID=id RGLB=library RGPL=platform RGPU=machine RGSM=sample 

java -jar picard.jar MarkDuplicates I=rg_added_sorted.bam O=dedupped.bam  CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT M=output.metrics

3. Split'N'Trim and reassign mapping qualities

Next, we use a new GATK tool called SplitNCigarReads developed specially for RNAseq, which splits reads into exon segments (getting rid of Ns but maintaining grouping information) and hard-clip any sequences overhanging into the intronic regions.

In the future we plan to integrate this into the GATK engine so that it will be done automatically where appropriate, but for now it needs to be run as a separate step.

At this step we also add one important tweak: we need to reassign mapping qualities, because STAR assigns good alignments a MAPQ of 255 (which technically means “unknown” and is therefore meaningless to GATK). So we use the GATK’s ReassignOneMappingQuality read filter to reassign all good alignments to the default value of 60. This is not ideal, and we hope that in the future RNAseq mappers will emit meaningful quality scores, but in the meantime this is the best we can do. In practice we do this by adding the ReassignOneMappingQuality read filter to the splitter command.

Finally, be sure to specify that reads with N cigars should be allowed. This is currently still classified as an "unsafe" option, but this classification will change to reflect the fact that this is now a supported option for RNAseq processing.

java -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R ref.fasta -I dedupped.bam -o split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS

4. Indel Realignment (optional)

After the splitting step, we resume our regularly scheduled programming... to some extent. We have found that performing realignment around indels can help rescue a few indels that would otherwise be missed, but to be honest the effect is marginal. So while it can’t hurt to do it, we only recommend performing the realignment step if you have compute and time to spare (or if it’s important not to miss any potential indels).

5. Base Recalibration

We do recommend running base recalibration (BQSR). Even though the effect is also marginal when applied to good quality data, it can absolutely save your butt in cases where the qualities have systematic error modes.

Both steps 4 and 5 are run as described for DNAseq (with the same known sites resource files), without any special arguments. Finally, please note that you should NOT run ReduceReads on your RNAseq data. The ReduceReads tool will no longer be available in GATK 3.0.

6. Variant calling

Finally, we have arrived at the variant calling step! Here, we recommend using HaplotypeCaller because it is performing much better in our hands than UnifiedGenotyper (our tests show that UG was able to call less than 50% of the true positive indels that HC calls). We have added some functionality to the variant calling code which will intelligently take into account the information about intron-exon split regions that is embedded in the BAM file by SplitNCigarReads. In brief, the new code will perform “dangling head merging” operations and avoid using soft-clipped bases (this is a temporary solution) as necessary to minimize false positive and false negative calls. To invoke this new functionality, just add -dontUseSoftClippedBases to your regular HC command line. Note that the -recoverDanglingHeads argument which was previously required is no longer necessary as that behavior is now enabled by default in HaplotypeCaller. Also, we found that we get better results if we set the minimum phred-scaled confidence threshold for calling variants 20, but you can lower this to increase sensitivity if needed.

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ref.fasta -I input.bam -dontUseSoftClippedBases -stand_call_conf 20.0 -o output.vcf

7. Variant filtering

To filter the resulting callset, you will need to apply hard filters, as we do not yet have the RNAseq training/truth resources that would be needed to run variant recalibration (VQSR).

We recommend that you filter clusters of at least 3 SNPs that are within a window of 35 bases between them by adding -window 35 -cluster 3 to your command. This filter recommendation is specific for RNA-seq data.

As in DNA-seq, we recommend filtering based on Fisher Strand values (FS > 30.0) and Qual By Depth values (QD < 2.0).

java -jar GenomeAnalysisTK.jar -T VariantFiltration -R hg_19.fasta -V input.vcf -window 35 -cluster 3 -filterName FS -filter "FS > 30.0" -filterName QD -filter "QD < 2.0" -o output.vcf

Please note that we selected these hard filtering values in attempting to optimize both high sensitivity and specificity together. By applying the hard filters, some real sites will get filtered. This is a tradeoff that each analyst should consider based on his/her own project. If you care more about sensitivity and are willing to tolerate more false positives calls, you can choose not to filter at all (or to use less restrictive thresholds).

An example of filtered (SNPs cluster filter) and unfiltered false variant calls:

An example of true variants that were filtered (false negatives). As explained in text, there is a tradeoff that comes with applying filters:

Known issues

There are a few known issues; one is that the allelic ratio is problematic. In many heterozygous sites, even if we can see in the RNAseq data both alleles that are present in the DNA, the ratio between the number of reads with the different alleles is far from 0.5, and thus the HaplotypeCaller (or any caller that expects a diploid genome) will miss that call. A DNA-aware mode of the caller might be able to fix such cases (which may be candidates also for downstream analysis of allele specific expression).

Although our new tool (splitNCigarReads) cleans many false positive calls that are caused by splicing inaccuracies by the aligners, we still call some false variants for that same reason, as can be seen in the example below. Some of those errors might be fixed in future versions of the pipeline with more sophisticated filters, with another realignment step in those regions, or by making the caller aware of splice positions.

As stated previously, we will continue to improve the tools and process over time. We have plans to improve the splitting/clipping functionalities, improve true positive and minimize false positive rates, as well as developing statistical filtering (i.e. variant recalibration) recommendations.

We also plan to add functionality to process DNAseq and RNAseq data from the same samples simultaneously, in order to facilitate analyses of post-transcriptional processes. Future extensions to the HaplotypeCaller will provide this functionality, which will require both DNAseq and RNAseq in order to produce the best results. Finally, we are also looking at solutions for measuring differential expression of alleles.

[1] Pär G Engström et al. “Systematic evaluation of spliced alignment programs for RNA-seq data”. Nature Methods, 2013

NOTE: Questions about this document that were posted before June 2014 have been moved to this archival thread: http://gatkforums.broadinstitute.org/discussion/4709/questions-about-the-rnaseq-variant-discovery-workflow

↧

Short read data in highly repetitive genomic region for heterozygous individuals

July 25, 2018, 1:37 pm

≫ Next: Bootstrapping high confidence variants for VQSR

≪ Previous: Calling variants in RNAseq

Hello GATK team,

This might be a very general and overrated question but I appreciate your input. I am working with natural populations of plants (expected highly heterozygous individuals) and an enriched genomic region which contains some promoters of interest together with transposons, duplications and a lot of expected indels and SVs, including a potential paralog for one of our BACs. Unfortunately the long read sequencing is not yet ready so I am using the 2*75pb data and our BAC sequences as references to test how close we can get with HaplotypeCaller to see some SNP and short indel calls for an association analysis. Our coverage distribution seems to be heavily biased towards areas with duplications and potential TE and most of the assemblers based on local assembly are thrown off by our data. I have use very strict mapping parameters to avoid this problem with missaligned reads, given that we can't discard the possibility of having hyper-variable regions.

I understand that aiming for genotype calls is dangerous given our kind of data and the lack of a genome reference, so I am aiming to include the genotype likelihoods into the association analysis. With HaplotypeCaller I get a vcf file for my population and an associated PL value. My question is basically if given our type of data, do you consider that the local assembly inherent to HaplotypeCaller will give us false positives variants in the final output? Do you have any suggestion or alternative tools to get genotype likelihoods (without local assembly?) and input those into an association analysis tool?

I really appreciate your insight.

Best,

↧

Bootstrapping high confidence variants for VQSR

November 13, 2013, 10:36 pm

≫ Next: MuTect2 strandbias + TLOD clarification

≪ Previous: Short read data in highly repetitive genomic region for heterozygous individuals

Hi,

I was wondering about the current best practices recommendation for refining the variant calls made by the HaplotypeCaller when no prior known variants are available (i.e. in non-model species). I can see that for base recalibration, you recommend bootstrapping a set of high confidence variants by first doing an initial round of SNP calling on your original, unrecalibrated data, and then using a high confidence subset of the called SNPs as the "known SNPs" for the base recalibration step.

Do you recommend a similar approach for variant recalibration? I have seen some people implement that, but I don't find any mention of this option in your description of the VQSR in the current best practices. Does not mentioning it there imply that you recommend to simply do a hard filtering of called variants if you don't have a database of known variants available or would you suggest that it may be worthwhile to try bootstrapping a set of "known variants" for the VQSR step as well?

Thanks very much for any advice you can share.

↧

MuTect2 strandbias + TLOD clarification

July 25, 2018, 2:58 pm

≫ Next: Invalid SAM?

≪ Previous: Bootstrapping high confidence variants for VQSR

Hi,

I have a set of tumour samples and I would like to call variants using MuTect2 without matching normals, annotate using VEP and filter out known germline variants afterwards.

I used tumor-only mode with downsampling process turned off. There are a number of artefacts that are being called and I found at least one variant that looks real but was not called. I could think of two options to improve the calling, hence my questions

1- Strand bias: How can I find information about strand bias? I am looking for details like what we typically see in call.stats output of MuTect (ie lod scores of forward and reverse strands), but have not been able to modify my code to include that information. I think some artefacts may be due to strand bias.

2- TLOD : This is where I got confused. Could you explain how MuTect2 calculates TLOD in the absence of matching normal? I use the LOD scores to determine real calls. The majority of real variants would have massive TLOD compared to all calls within each sample. But in my set of samples, there was one variant that seems to be true and had small value of TLOD. I started to think that MuTect2 has to have something as normal to generate correct TLOD, but I am not sure.

This is what I ran:

Using hg38

gatk Mutect2 \
-R hg38 \
-I test.bam \
-L interval_list \
-O test.vcf \
-tumor test.bam \
--contamination-fraction-to-filter 0.0 \
--max-reads-per-alignment-start 0 \

Any comments would be highly appreciated.

Thank you

↧

Invalid SAM?

July 25, 2018, 9:01 pm

≫ Next: Bad PED line 1: wrong number of fields in PED files in PhaseByTransmission

≪ Previous: MuTect2 strandbias + TLOD clarification

I used BWA MEM to map reads from an interleaved FASTQ.

fastq="all.fastq"
fasta="/share/PI/apps/bcbio/genomes/Hsapiens/GRCh37/seq/GRCh37.fa"
bwa="/share/PI/apps/bcbio/anaconda/bin/bwa"
nThreads="12"

#Run BWA MEM
#IMPORTANT: NEED -p since "$fastq" is an interleaved fastq
readGroup="@RG\tID:CHM1\tSM:CHM1\tPL:Illumina"
sam="CHM1.sam"
"$bwa" mem -R "$readGroup" -t "$nThreads" -p "$fasta" "$fastq" -o "$sam"

(The FASTQs are actually CHM1; I used prefetch to fetch .sra files from three different runs from NCBI, then used fastq-dump to convert the SRAs to FASTQs, then cated them all together into one FASTQ.)

The SAM is 515Gb but has no obvious problems. samtools quickcheck says it's valid. But when I run GATK4 (4.0.4.0)'s FixMateInformation or ValidateSamFile, I get output like this

ERROR: Record 1, Read name ######################################################################################################################################################################################################, Zero-length read without FZ, CS or CQ tag
WARNING: Record 1, Read name ######################################################################################################################################################################################################, QUAL field is set to * (unspecified quality scores), this is allowed by the SAM specification but many tools expect reads to include qualities 
ERROR: Record 421522661, Read name ######################################################################################################################################################################################################, Zero-length read without FZ, CS or CQ tag

There may be even more errors, but this is what I got after two hours.

I can, in fact, see that the first line in the SAM file is

######################################################################################################################################################################################################  4       *       0       0       *       *       0       0       *       *       AS:i:0  XS:i:0  RG:Z:CHM1

Is this SAM really invalid? Or is there something I need to do so GATK4 will accept it

↧

Bad PED line 1: wrong number of fields in PED files in PhaseByTransmission

July 26, 2018, 12:31 am

≫ Next: Hard-Filtering odd MQ distributions

≪ Previous: Invalid SAM?

Hello,

when using the PhaseByTransmission I always get this error:

ERROR StatusLogger Unable to create class org.apache.logging.log4j.core.impl.Log4jContextFactory specified in jar:file:/pub/yuanjian/software_script/GATK-3.8/GenomeAnalysisTK.jar!/META-INF/log4j-provider.properties
ERROR StatusLogger Log4j2 could not find a logging implementation. Please add log4j-core to the classpath. Using SimpleLogger to log to the console...
INFO 02:32:46,897 GenomeAnalysisEngine - Deflater: IntelDeflater
INFO 02:32:46,897 GenomeAnalysisEngine - Inflater: IntelInflater
INFO 02:32:46,898 GenomeAnalysisEngine - Strictness is SILENT
INFO 02:32:46,989 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO 02:32:47,058 PedReader - Reading PED file output.vcf2plink.ped with missing fields: []

ERROR ------------------------------------------------------------------------------------------

ERROR A USER ERROR has occurred (version 3.8-0-ge9d806836):

ERROR

ERROR This means that one or more arguments or inputs in your command are incorrect.

ERROR The error message below tells you what is the problem.

ERROR

ERROR If the problem is an invalid argument, please check the online documentation guide

ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.

ERROR

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions https://software.broadinstitute.org/gatk

ERROR

ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.

ERROR

ERROR MESSAGE: File associated with name java.io.FileReader@3a0e7f89 is malformed: Bad PED line 1: wrong number of fields

ERROR ------------------------------------------------------------------------------------------

my command is:
java -jar /pub/yuanjian/software_script/GATK-3.8/GenomeAnalysisTK.jar -T PhaseByTransmission -R /pub/yuanjian/reference/human_g1k_v37.fasta -V output.vcf -ped output.vcf2plink.ped -o file.vcf

I use plink to convert a vcf file to a ped file,and my ped file is:

GENOTYPE.437 GENOTYPE.437 0 0 0 0 0 0 0 0 A A 0 0 C C G G C C A A 0 0 0 0 C T 0 0 0 0
GENOTYPE.430 GENOTYPE.430 0 0 0 0 0 0 0 0 A A C T C C G G C C A A 0 0 G A T T C G A
GENOTYPE.450 GENOTYPE.450 0 0 0 0 G G G G A A 0 0 C C G G C C A A C G 0 0 0 0 0 0 0 0

So what's wrong with my ped file?

↧

Hard-Filtering odd MQ distributions

July 17, 2018, 7:40 am

≫ Next: How can i generate png image not pdf from picard tools (CollectInsertSizeMetrics)?

≪ Previous: Bad PED line 1: wrong number of fields in PED files in PhaseByTransmission

Hello,
I'm working on filtering my snp calls from a non-model organism and thus going with hard-filtering instead of VQSR. I know this is always a bit of touch-and-go and there's no definite answer to which thresholds to employ (I've been through the documentation) but I'm hoping you can give me some more pointers.

I've started by using your indications first except for QD, and based on missing data and coverage. So right now I have filtered SNPs with: QD < 5 ¦¦ FS < 60 ¦¦ MQ < 40 ¦¦ MQRankSum < -12.5 ¦¦ ReadPosRankSum < -8 ¦¦ SOR < 3, maximum 30% missing calls, resulting in 16'397'726 SNPs (of 19'047'259 total unfiltered calls).
These the distributions before and after filtering (QUAL just for indication, didn't apply any filter to it).

It looks considerable better to me, but still far from what your example data looks like. In particular, I'm wondering about my MQ distribution that has these very high values (over 400, some come as "Inf" even) - have you seen this before?
Further, would it be "ok" to filter on the upper value of MQ as well as the lowest? Thanks !

↧

How can i generate png image not pdf from picard tools (CollectInsertSizeMetrics)?

July 26, 2018, 2:15 am

≫ Next: GATK4 for somatic WES or Panel on limited number of samples

≪ Previous: Hard-Filtering odd MQ distributions

I am using picard tools for generating insert size histogram but i found png image is not visible.

java -jar picard-tools-1.130/picard.jar CollectInsertSizeMetrics I=sorted.bam O=insert_size_metrics.txt H=histogram.png

How can i fix such issue?

↧

GATK4 for somatic WES or Panel on limited number of samples

July 26, 2018, 6:32 am

≫ Next: Is it possible to call SNPs without prior assigning read groups ?

≪ Previous: How can i generate png image not pdf from picard tools (CollectInsertSizeMetrics)?

Hello,

Which of the workflows listed on "https://github.com/gatk-workflows" is adapted for SOMATIC WES or Panel sequencing on cohorts from 2 to 50 samples and can be run locally?

Many thanks,

↧

Is it possible to call SNPs without prior assigning read groups ?

July 26, 2018, 10:36 am

≫ Next: LiftoverVCF chain file for b37 to hg38

≪ Previous: GATK4 for somatic WES or Panel on limited number of samples

Hi, I have tens of thousands of BAM files, each coming from a single biological sample. Assigning read groups before calling SNPs for all of these BAM files are too tedious. Is it possible to call SNPs without assigning read groups for these bam files? Thanks!

↧