Quantcast
Channel: Recent Discussions — GATK-Forum
Viewing all 12345 articles
Browse latest View live

GATK4's CalculateContamination reports no hom alt sites found

$
0
0

I have been trying to use GATK4's CalculateContamination but the output is not as expected:

level   contamination   error
whole_bam   0.0 1.0

The GATK log contained warnings that there was not enough data points to segment and that no hom alt sites were found.

Using GATK jar /mnt/projects/dlho/tancrc/bcbio_pipeline/anaconda/share/gatk4-4.0.4.0-0/gatk-package-4.0.4.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx16g -jar /mnt/projects/dlho/tancrc/bcbio_pipeline/anaconda/share/gatk4-4.0.4.0-0/gatk-package-4.0.4.0-local.jar CalculateContamination -I out/BC002-03042014_A_getpileupsummaries.table -O out/BC002-03042014_A_calculatecontamination.table
Picked up _JAVA_OPTIONS: -XX:+UseSerialGC
09:46:05.758 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/mnt/projects/dlho/tancrc/bcbio_pipeline/anaconda/share/gatk4-4.0.4.0-0/gatk-package-4.0.4.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
09:46:05.872 INFO  CalculateContamination - ------------------------------------------------------------
09:46:05.872 INFO  CalculateContamination - The Genome Analysis Toolkit (GATK) v4.0.4.0
09:46:05.872 INFO  CalculateContamination - For support and documentation go to https://software.broadinstitute.org/gatk/
09:46:05.872 INFO  CalculateContamination - Executing as dlho@n086.default.domain on Linux v2.6.32-431.el6.x86_64 amd64
09:46:05.872 INFO  CalculateContamination - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_102-b14
09:46:05.873 INFO  CalculateContamination - Start Date/Time: May 14, 2018 9:46:05 AM SGT
09:46:05.873 INFO  CalculateContamination - ------------------------------------------------------------
09:46:05.873 INFO  CalculateContamination - ------------------------------------------------------------
09:46:05.873 INFO  CalculateContamination - HTSJDK Version: 2.14.3
09:46:05.873 INFO  CalculateContamination - Picard Version: 2.18.2
09:46:05.873 INFO  CalculateContamination - HTSJDK Defaults.COMPRESSION_LEVEL : 2
09:46:05.873 INFO  CalculateContamination - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
09:46:05.873 INFO  CalculateContamination - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
09:46:05.873 INFO  CalculateContamination - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
09:46:05.873 INFO  CalculateContamination - Deflater: IntelDeflater
09:46:05.874 INFO  CalculateContamination - Inflater: IntelInflater
09:46:05.874 INFO  CalculateContamination - GCS max retries/reopens: 20
09:46:05.874 INFO  CalculateContamination - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
09:46:05.874 INFO  CalculateContamination - Initializing engine
09:46:05.874 INFO  CalculateContamination - Done initializing engine
09:46:05.935 WARN  KernelSegmenter - Specified dimension of the kernel approximation (100) exceeds the number of data points (2) to segment; using all data points to calculate kernel matrix.
09:46:05.961 WARN  KernelSegmenter - Number of points needed to calculate local changepoint costs (2 * window size = 100) exceeds number of data points (2).  Local changepoint costs will not be calculated for this window size.
09:46:05.961 WARN  KernelSegmenter - No changepoint candidates were found.  The specified window sizes may be inappropriate, or there may be insufficient data points
09:46:06.083 INFO  KernelSegmenter - Found 0 changepoints after applying the changepoint penalty.
09:46:06.090 WARN  KernelSegmenter - Specified dimension of the kernel approximation (100) exceeds the number of data points (3) to segment; using all data points to calculate kernel matrix.
09:46:06.090 WARN  KernelSegmenter - Number of points needed to calculate local changepoint costs (2 * window size = 100) exceeds number of data points (3).  Local changepoint costs will not be calculated for this window size.
09:46:06.090 WARN  KernelSegmenter - No changepoint candidates were found.  The specified window sizes may be inappropriate, or there may be insufficient data points
09:46:06.091 INFO  KernelSegmenter - Found 0 changepoints after applying the changepoint penalty.
09:46:06.091 WARN  KernelSegmenter - Specified dimension of the kernel approximation (100) exceeds the number of data points (2) to segment; using all data points to calculate kernel matrix.
09:46:06.092 WARN  KernelSegmenter - Number of points needed to calculate local changepoint costs (2 * window size = 100) exceeds number of data points (2).  Local changepoint costs will not be calculated for this window size.
09:46:06.092 WARN  KernelSegmenter - No changepoint candidates were found.  The specified window sizes may be inappropriate, or there may be insufficient data points
09:46:06.092 INFO  KernelSegmenter - Found 0 changepoints after applying the changepoint penalty.
09:46:06.093 WARN  KernelSegmenter - Specified dimension of the kernel approximation (100) exceeds the number of data points (1) to segment; using all data points to calculate kernel matrix.
09:46:06.093 WARN  KernelSegmenter - Number of points needed to calculate local changepoint costs (2 * window size = 100) exceeds number of data points (1).  Local changepoint costs will not be calculated for this window size.
09:46:06.093 WARN  KernelSegmenter - No changepoint candidates were found.  The specified window sizes may be inappropriate, or there may be insufficient data points
09:46:06.093 INFO  KernelSegmenter - Found 0 changepoints after applying the changepoint penalty.
09:46:06.113 WARN  CalculateContamination - No hom alt sites found!  Perhaps GetPileupSummaries was run on too small of an interval, or perhaps the sample was extremely inbred or haploid.
09:46:06.116 WARN  CalculateContamination - No hom alt sites found!  Perhaps GetPileupSummaries was run on too small of an interval, or perhaps the sample was extremely inbred or haploid.
09:46:06.117 WARN  CalculateContamination - No hom alt sites found!  Perhaps GetPileupSummaries was run on too small of an interval, or perhaps the sample was extremely inbred or haploid.

To get the pileup file required for CalculateContamination I used GetPileupSummaries and restricted the region with -L to a bedfile containing 77 genes which are of interest. The pileup file looks normal and I have 311 variants in the file though, is this not enough to CalculateContamination? Can CalculateContamination not be performed on small targeted sequencing panels? Would appreciate if someone could assist pls!


Allele Depth (AD) is lower than expected

$
0
0

The problem:

You're trying to evaluate the support for a particular call, but the numbers in the DP (total depth) and AD (allele depth) fields aren't making any sense. For example, the sum of all the ADs doesn't match up to the DP, or even more baffling, the AD for an allele that was called is zero!

Many users have reported being confused by variant calls where there is apparently no evidence for the called allele. For example, sometimes a VCF may contain a variant call that looks like this:

2 151214 . G A 673.77 . AN=2;DP=10;FS=0.000;MLEAF=0.500;MQ=56.57;MQ0=0;NCC=0;SOR=0.693 GT:AD:DP:GQ:PL 0/1:0,0:10:38:702,0,38

You can see in the Format field the AD values are 0 for both of the alleles. However, in the Info and FORMAT fields, the DP is 10. Because the DP in the INFO field is unfiltered and the DP in the FORMAT field is filtered, you know none of the reads were filtered out by the engine's built-in read filters. And if you look at the "bamout", you see 10 reads covering the position! So why is the VCF reporting an AD value of 0?


The explanation: uninformative reads

This is not actually a bug -- the program is doing what we expect; this is an interpretation problem. The answer lies in uninformative reads.

We call a read “uninformative” when it passes the quality filters, but the likelihood of the most likely allele given the read is not significantly larger than the likelihood of the second most likely allele given the read. Specifically, the difference between the Phred scaled likelihoods must be greater than 0.2 to be considered significant. In other words, that means the most likely allele must be 60% more likely than the second most likely allele.

Let’s walk through an example to make this clearer. Let’s say we have 2 reads and 2 possible alleles at a site. All of the reads have passed HaplotypeCaller’s quality filters, and the likelihoods of the alleles given the reads are in the table below.

Reads Likelihood of A Likelihood of T
1 3.8708e-7 3.6711e-7
2 4.9992e-7 2.8425e-7

Note: Keep in mind that HaplotypeCaller marginalizes the likelihoods of the haplotypes given the reads to get the likelihoods of the alleles given the reads. The table above shows the likelihoods of the alleles given the reads. For additional details, please see the HaplotypeCaller method documentation.

Now, let’s convert the likelihoods into Phred-scaled likelihoods. To do this, we simply take the log of the likelihoods.

Reads Phred-scaled likelihood of A Phred-scaled likelihood of T
1 -6.4122 -6.4352
2 -6.3011 -6.5463

Now, we want to determine if read 1 is informative. To do this, we simply look at the Phred scaled likelihoods of the most likely allele and the second most likely allele. The Phred scaled likelihood of the most likely allele (A) is -6.4122.The Phred-scaled likelihood of the second most likely allele (T) is -6.4352. Taking the difference between the two likelihoods gives us 0.023. Because 0.023 is Less than 0.2, read 1 is considered uninformative.

To determine if read 2 is informative, we take -6.3011-(-6.5463). This gives us 0.2452, which is greater than 0.2. Read 2 is considered informative.

How does a difference of 0.2 mean the most likely allele is ~60% more likely than the second most likely allele? Well, because the likelihoods are Phred-scaled, 0.2 = 10^0.2 = 1.585 which is approximately 60% greater.


Conclusion

So, now that we know the math behind determining which reads are informative, let’s look at how this affects the record output to the VCF. If a read is considered informative, it gets counted toward the AD and DP of the variant allele in the output record. If a read is considered uninformative, it is counted towards the DP, but not the AD. That way, the AD value reflects how many reads actually contributed support for a given allele at the site. We would not want to include uninformative reads in the AD value because we don’t have confidence in them.

Please note, however, that although an uninformative read is not reported in the AD, it is still used in calculations for genotyping. In future we may add an annotation to indicate counts of reads that were considered informative vs. uninformative. Let us know in the comments if you think that would be helpful.

In most cases, you will have enough coverage at a site to disregard small numbers of uninformative reads. Unfortunately, sometimes uninformative reads are the only reads you have at a site. In this case, we report the potential variant allele, but keep the AD values 0. The uncertainty at the site will be reflected in the QG and PL values.

I got a question about the kmer length you parsed during the second step of HaplotypeCaller

$
0
0


As I screenshot, I found HC respectively parse the sequence corresponding to the ActiveRegion on reference genome and reads to kmers in length of 10 and 25.

Furthermore, (https://software.broadinstitute.org/gatk/documentation/article.php?id=4146) here you claimed that in the reads threading process, HC starts with the first read and compare its first kmer to the hash table to find if it has a match.

Under this circumstance, I have confusions:
Shouldn't the kmer length be an odd number?
If the kmer length is not consistent between ref-kmer and read-kmer, how are the read-kmers considered to be a match with the ref-kmer in the hash table?

Another little inquiry, by the time of my post, I found I cannot load the web page of your Bundle via FTP. Everytime I tried to log into that page, a little window pops out requiring input of username and code. I input the username and leave the code blank as instructed. But it does not work, the little window keeps popping out every time I hit Enter.

Over estimation of AF in Mutect2 (GATK 4.0.2.1) ?

$
0
0

Hi,

I ran Mutect2 (GATK 4.0.2.1) followed by FilterMutectCalls with default parameters. I got some passed variants with AF much larger than the alt_depth/total_depth (see attached image), I checked the reads mapping qualities and most of them are good, so reads are not likely filtered by Mutect2.

Why is the AF larger than the alt_depth/total_depth?

I noted that the three variants with AF overestimated also with reads orientation bias, can Mutect2 filter variants by reads orientation bias?

Thanks!

chrX 41000336 . A G . PASS DP=2025;ECNT=1;NLOD=241.68;N_ART_LOD=-6.740e-01;POP_AF=1.000e-03;P_GERMLINE=-2.387e+02;TLOD=5.88 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:OBAM:OBAMRC:OBF:OBP:OBQ:OBQRC:SA_MAP_AF:SA_POST_PROB 0/1:1094,7:0.030:563,1:531,6:33:200,195:60:33:false:false:.:.:100.00:51.51:0.010,0.010,6.358e-03:6.461e-04,1.733e-03,0.998 0/0:828,2:0.028:422,1:406,1:34:199,181:60:19:false:false
chrX 66766308 . A G . PASS DP=1691;ECNT=2;NLOD=90.81;N_ART_LOD=-1.323e+00;POP_AF=1.000e-03;P_GERMLINE=-8.780e+01;TLOD=6.03 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:OBAM:OBAMRC:OBF:OBP:OBQ:OBQRC:SA_MAP_AF:SA_POST_PROB 0/1:1292,8:0.029:578,0:714,8:32:291,291:60:38:false:false:.:.:100.00:51.51:0.010,0.00,6.154e-03:7.151e-04,2.907e-03,0.996 0/0:314,1:0.033:152,0:162,1:32:191,297:60:15:false:false
chrX 66766320 . C T . PASS DP=1382;ECNT=2;NLOD=74.68;N_ART_LOD=-2.139e+00;POP_AF=1.000e-03;P_GERMLINE=-7.140e+01;TLOD=7.39 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:OBAM:OBAMRC:OBF:OBP:OBQ:OBQRC:SA_MAP_AF:SA_POST_PROB 0/1:1060,10:0.030:453,0:607,10:30:292,329:60:32:false:false:.:.:100.00:52.34:0.010,0.010,9.346e-03:0.042,4.118e-04,0.957 0/0:256,1:0.029:124,0:132,1:20:186,135:60:32:false:false

MuTect2 LOD calculations

$
0
0

I am interested in how NLOD and TLOD are calculated in MuTect2. The MuTect2.java code (https://github.com/broadgsa/gatk-protected/blob/master/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/cancer/m2/MuTect2.java#L856) leads me to a function calcGenotypeLikelihoodsOfRefVsAny(), in which when a non-ref is observed:
genotypeLikelihoods[AB] += Math.log10(fpobs + (1-f)pobs/3.0d);
However, my understanding is that the real likelihood should be
genotypeLikelihoods[AB] += Math.log10(fpobs + (1-f)(1-pobs)/3.0d);

Is this an error, or could you clarify how these likelihood values, NLOD, TLOD are calculated?

I have read the algorithm guide here https://software.broadinstitute.org/gatk/guide/article?id=4442, but couldn't find my answer.
I have also read the MuTect1 paper https://www.ncbi.nlm.nih.gov/pubmed/23396013, but I don't know if they are the same as in MuTect2.

[GATK4 beta] no filter-passing variants in Mutect2 tumor-only runs using default parameters

$
0
0

Hello,

I would like to ask your advice on the tumor only mode of Mutect.
I ran GATK4 beta.3's Mutect on 20 tumor samples using tumor-only mode, and found no variant passing filters. Every variant is filtered out after running FilterMutectCalls tool. It seems that germline risk is estimated very high overall.
Mutect2 was executed using the scripts/mutect2_wdl/mutect2_multi_sample.wdl in the GATK source repository. gnomAD is given for the population af source and default parameters are used.
I'd appreciate it if you would help run tumor-only mode of Mutect.

FYI, 10^P_GERMLINE (log10 posterior probability for alt allele to be germline variants in INFO) of a tumor sample distributes as below. Outliers are not plotted for the sake of simplicity.

Summary(10^P_GERMLINE)

Min. 1st Qu. Median Mean 3rd Qu. Max.
0.04699 0.93345 0.99919 0.94155 1.00000 1.00000

P_GERMLINE plot

(Additionally, no toolbar button works on this 'ask a question' page I am writing this question. (such as bold, italic, file upload buttons) Is it just me?)

Following Best practice, I have ERROR:MATE_NOT_FOUND in all my files

$
0
0

I am running Mutect2 on about 100 pairs of tumor-normal samples of whole exome sequencing. I am using GATK4.0.9.0. The pipeline used the recommended best practice pipeline:
1. FastqToSam,
2. MarkIlluminaAdapters,
3. SamToFastq, bwa (mem -M -t4),
4. MergeBamAlignment (CREATE_INDEX=true, ADD_MATE_CIGAR=true, CLIP_ADAPTERS=false, CLIP_OVERLAPPING_READS=true, INCLUDE_SECONDARY_ALIGNMENTS=true, MAX_INSERTIONS_OR_DELETIONS=-1, PRIMARY_ALIGNMENT_STRATEGY=MostDistant, ATTRIBUTES_TO_RETAIN=XS )
5. MarkDuplicates (VALIDATION_STRINGENCY SILENT, OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500, CREATE_INDEX=true)
6. BaseRecalibrator (using 100bp-padded interval bed file of the exome design)
7. ApplyBQSR
8. generate the PON VCF successfully using all normal samples.
9. When I tried to run Mutect2 to call the somatic short variants, all pairs went through the analysis except one pair which failed to pass the same region after several trials. This part of the log after one failed trail:
...
19:12:18.761 INFO ProgressMeter - chr1:41708411 5.8 5020 858.5
19:12:28.648 INFO PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 262.425723624
19:12:28.649 INFO SmithWatermanAligner - Total compute time in java Smith-Waterman : 13.26 sec
19:12:29.035 INFO Mutect2 - Shutting down engine
[October 10, 2018 7:12:29 PM EDT] org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2 done. Elapsed time: 6.09 minutes.
Runtime.totalMemory()=2715287552
java.lang.IllegalArgumentException: readMaxLength must be > 0 but got 0
...

We I tested the BAM files using ValidateSamFile, I got these results for tumor and normal samples:
Error Type Count
ERROR:MATE_NOT_FOUND 3622058

Error Type Count
ERROR:MATE_NOT_FOUND 2773161

I used the FixMateInformation tool to fix the files, re-indexed, and re-ran Mutect2 successfully but I tested all BAM files and found them carrying the same error! Why did this happen? is it a bug or did I do something wrong? Why did one pair fail to go through the variant calling step while all other pairs did not? Does this affect the quality of variant calls?

Thank you

How does a panel of normal affect variant calling using mutect2

$
0
0

Hello,
I am running an analyis on tumor samples using mutect2. To investigate the effect of a PON on variant calling , I ran my analysis with and without the PON. Both analyses yield different amounts of called variants, as expected more for the analysis without the PON. But there are still variants in the analysis with PON marked as filtered on basis of the PON.

That surprises me as I thought mutect would do the calling and then filter variants marked as being site-specific artefacts. However, based on the reduced number of variants called plus variants filtered PON-wise

I conclude that mutect affects variant calling in (at least) two different ways. I would appreciate if you can clarify this.


accuracy produced by Mutect2 is too low

$
0
0

With simulated data by BAMSurgeon, I tested Mutect2 and Varscan2. There are almost no overlap between Mutect2's vcf and my ground truth while Varscan2's result is resonable.

Why? Is my usage is wrong?

Next is my code.

import org.broadinstitute.gatk.queue.QScript
import org.broadinstitute.gatk.queue.extensions.gatk._

class run_M2 extends QScript {

  @Argument(shortName = "L",  required=false, doc = "Intervals file")
  var intervalsFile: List[File] = Nil
  @Argument(shortName = "normal",  required=true, doc = "Normal sample BAM")
  var normalBAM: String = ""
  @Argument(shortName = "tumor", required=true, doc = "Tumor sample BAM")
  var tumorBAM: String = ""
  @Argument(shortName = "o",  required=true, doc = "Output file")
  var outputFile: String = ""
  @Argument(shortName = "sc",  required=false, doc = "base scatter count")
  var scatter: Int = 200
//  @Argument(shortName = "output_mode",  required=false, doc = "output_mode")                                     
//  var output_mode: String = "EMIT_ALL_SITES"                                                                     
  /*                                                                                                               
  @Argument(shortName = "cosmic",  required=true, doc = "cosmic vcf file ")                                        
  var cosmic: String = ""                                                                                          
  @Argument(shortName = "pon",  required=true, doc = "vcf panel of normal file ")                                  
  var pon: String = ""                                                                                             
   */
    def script() {

    val mutect2 = new MuTect2

    mutect2.reference_sequence = new File("/ref/GATK/ucsc.hg19/ucsc.hg19.fasta")
   /*                                                                                                              
    mutect2.cosmic = new File(cosmic)                                                                              
    mutect2.normal_panel = new File(pon)                                                                           
    */
    mutect2.dbsnp = new File("/ref/GATK/knownsites/dbsnp_138.hg19.vcf")
    mutect2.intervalsString = intervalsFile
    mutect2.memoryLimit = 2
    mutect2.input_file = List(new TaggedFile(normalBAM, "normal"), new TaggedFile(tumorBAM, "tumor"))

    mutect2.scatterCount = scatter
    mutect2.out = outputFile
//    mutect2.output_mode = output_mode                                                                            
    add(mutect2)
  }

}

java -jar ~/Github/gatk-protected/target/Queue.jar -S M2.scala --job_queue main.q -qsub -startFromScratch -\
sc 1000 -tumor ../data/benchmark/tumor.bam -normal ../data/benchmark/normal.bam  -o ../data/benchmark/Test1_noInte\
rval_benchmark.vcf --start_from_scratch -run

Mutect2 missed variant called by HaplotypeCaller

$
0
0

Hi,

I am running GATK 3.5.0 with java version 1.8.0. I have two cell line samples that I paired with a promega baseline reference (its essentially a mixed germline sample) to run Mutect2 (which I am aware of is not a part of the Best Practices). I also ran the tumour sample a lone using the HaplotypeCaller and noticed a very clear ALK variant that was missed by Mutect2 but called by the HaplotypeCaller in both samples. Due to the nature of the cell line we also expected to see an ALK variant which is why it was detected.

What I find odd is that the local reassembly of Mutect2 seems to have discarded the variant as the bamout does not contain the variant (C > T) at loci chr2:29443695 whereas the HaplotypeCaller call does for both samples. I have read through the documentation and the specifics of the local reassembly and would be very interested in knowing at what stage this occurs and your suggestions on what can be done.

I will be trying GATK v.4.0 as well as some of the things mentioned here https://software.broadinstitute.org/gatk/documentation/article?id=1235 in the meantime I would be very greatful if someone could look into this. I will be posting the updates on my new tests as well. See details below on various metrics and IGV screenshots.

The chemistry is a DNA capture Kapa hyperplus kit, 75 paired end reads.

Sample 945

  • Entire ALK covered up to 80X
  • Mean/min coverage 1013/378
  • BWA bam shows 50% allele frequency

HaplotypeCaller line Sample 945

  • chr2 29443695 . G T 8496.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=5.863;ClippingRankSum=-0.368;DP=601;ExcessHet=3.0103;FS=0.536;MLEAC=1;MLEAF=0.500;MQ=62.46;MQRankSum=1.113;QD=14.21;ReadPosRankSum=0.502;SOR=0.76GT:AD:DP:GQ:PL 0/1:300,298:598:99:8525,0,8240

Sample 946

  • Entire ALK covered up to 80x
  • Mean/min coverage 523/204
  • BWA bam shows 49% allele frequency

HaplotypeCaller line Sample 946

  • chr2 29443695 . G T 5056.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=3.569;ClippingRankSum=-0.212;DP=397;ExcessHet=3.0103;FS=2.133;MLEAC=1;MLEAF=0.500;MQ=63.61;MQRankSum=-1.274;QD=13.00;ReadPosRankSum=0.063;SOR=0.595 GT:AD:DP:GQ:PL 0/1:199,190:389:99:5085,0,5319

Promega control sample

  • Same control sample used as pair for both 945 and 946 using Mutect
  • Coverage around ALK region ~200+

Please see IGV images of the various cases below. The --bamout (run together with disabling optimization and forcing output) command was run with a 500bp padding downstream and upstream of the target location that contains the variant (i.e the actual padding upstream and downstream the actual variant at loci 29443695 will be slighly more than 500bp). I also ran mutect with the adjust 500bp but included all the targets in chr2 without adding any padding on any other targets other than the one that contains the variant.

Sample945_bwaBAM - Bam output from BWA

Sample946_bwaBAM - Bam output from BWA

Sample945_GATKForcedBamOut

Sample946_GATKForcedBamOut

Sample945_MutectForcedBamOutChr2

Sample946_MutectForcedBamOutChr2

Sample945_MutectForcedBamOutALKOnly

Sample946_MutectForcedBamOutALKOnly

Thank you very much and I look forward hearing your thoughts on this
Sabri

Oncotator overview and basic usage

$
0
0

Overview

Oncotator is a tool for annotating information onto genomic point mutations (SNPs/SNVs) and indels. It is primarily intended to be used on human genome variant callsets and we only provide data sources that are relevant to cancer researchers. However, the tool can technically be used to annotate any kind of information onto variant callsets from any organism, and we provide instructions on how to prepare custom data sources for inclusion in the process.

Usage

By default Oncotator is set up to use a simple tsv (a.k.a MAFLITE) as input and produces a TCGA MAF as output. See details below.

Oncotator also supports VCF as an input and/or output format.

Input

The input tsv (MAFLITE) file must have the following columns (with column headers):

  • build (at this time the build must be hg19 for all variants)
  • chr
  • start
  • end
  • ref_allele (should be "-" for an insertion)
  • alt_allele (should be "-" for a deletion)

An example input file is provided with the program files. For SNPs, see test/testdata/maflite/Patient0.snp.maf.txt. For Indels, see test/testdata/maflite/Patient0.indel.maf.txt

Several additional columns are not created by annotations and must be provided by the user (instructions below). If these are missing, UNKNOWN will appear in the output file.

  • tumor_barcode
  • normal_barcode
  • NCBI_Build
  • Strand
  • Center
  • source
  • status
  • phase
  • sequencer
  • Tumor_Validation_Allele1
  • Tumor_Validation_Allele2
  • Match_Norm_Validation_Allele1
  • Match_Norm_Validation_Allele2
  • Verification_Status
  • Validation_Status
  • Validation_Method
  • Score
  • BAM_file
  • Match_Norm_Seq_Allele1
  • Match_Norm_Seq_Allele2

If you would like to eliminate the UNKNOWN values, you have four options:

1. Create an annotation override file

This will overwrite (or create) values in all variants for the specified annotations. See the --override_config or --default_configflag. An example override config file is provided with the program files (exampleOverrides.config found in the doc/ dir of the source code). Use this when one value should go into the specified annotations for all input variants.

2. Provide the fields as part of the input tsv file

Do this when the annotations change between variants.

3. Use the override flag on the command line

See the -a flag in the usage information.

4. Specify that your output should be a simple TSV instead of a TCGA MAF

This will put all annotations as column headers and, since no annotations are required, no UNKNOWN values will appear. Use -o SIMPLE_TSV when calling oncotator. Do this when you want a simple dump of all annotations for all variants.

Output

The default output is a TCGA MAF (version 2.4). The specification can be found at: https://wiki.nci.nih.gov/display/TCGA/Mutation+Annotation+Format+(MAF)+Specification

If you would prefer a simple tsv as output, just include the -o SIMPLE_TSV flag when running oncotator.

Evaluating the evidence for haplotypes and variant alleles (HaplotypeCaller & Mutect2)

$
0
0

This document details the procedure used by HaplotypeCaller to evaluate the evidence for variant alleles based on candidate haplotypes determined in the previous step for a given ActiveRegion. For more context information on how this fits into the overall HaplotypeCaller method, please see the more general HaplotypeCaller documentation.

This procedure is also applied by Mutect2 for somatic short variant discovery. See this article for a direct comparison between HaplotypeCaller and Mutect2.


Contents

  1. Overview
  2. Evaluating the evidence for each candidate haplotype
  3. Evaluating the evidence for each candidate site and corresponding alleles

1. Overview

The previous step produced a list of candidate haplotypes for each ActiveRegion, as well as a list of candidate variant sites borne by the non-reference haplotypes. Now, we need to evaluate how much evidence there is in the data to support each haplotype. This is done by aligning each sequence read to each haplotype using the PairHMM algorithm, which produces per-read likelihoods for each haplotype. From that, we'll be able to derive how much evidence there is in the data to support each variant allele at the candidate sites, and that produces the actual numbers that will finally be used to assign a genotype to the sample.


2. Evaluating the evidence for each candidate haplotype

We originally obtained our list of haplotypes for the ActiveRegion by constructing an assembly graph and selecting the most likely paths in the graph by counting the number of supporting reads for each path. That was a fairly naive evaluation of the evidence, done over all reads in aggregate, and was only meant to serve as a preliminary filter to whittle down the number of possible combinations that we're going to look at in this next step.

Now we want to do a much more thorough evaluation of how much evidence we have for each haplotype. So we're going to take each individual read and align it against each haplotype in turn (including the reference haplotype) using the PairHMM algorithm (see Durbin et al., 1998). If you're not familiar with PairHMM, it's a lot like the BLAST algorithm, in that it's a pairwise alignment method that uses a Hidden Markov Model (HMM) and produces a likelihood score. In this use of the PairHMM, the output score expresses the likelihood of observing the read given the haplotype by taking into account the information we have about the quality of the data (i.e. the base quality scores and indel quality scores). Note: If reads from a pair overlap at a site and they have the same base, the base quality is capped at Q20 for both reads (Q20 is half the expected PCR error rate). If they do not agree, we set both base qualities to Q0.

This produces a big table of likelihoods where the columns are haplotypes and the rows are individual sequence reads. The table essentially represents how much supporting evidence there is for each haplotype (including the reference), itemized by read.


3. Evaluating the evidence for each candidate site and corresponding alleles

Having per-read likelihoods for entire haplotypes is great, but ultimately we want to know how much evidence there is for individual alleles at the candidate sites that we identified in the previous step. To find out, we take the per-read likelihoods of the haplotypes and marginalize them over alleles, which produces per-read likelihoods for each allele at a given site. In practice, this means that for each candidate site, we're going to decide how much support each read contributes for each allele, based on the per-read haplotype likelihoods that were produced by the PairHMM.

This may sound complicated, but the procedure is actually very simple -- there is no real calculation involved, just cherry-picking appropriate values from the table of per-read likelihoods of haplotypes into a new table that will contain per-read likelihoods of alleles. This is how it happens. For a given site, we list all the alleles observed in the data (including the reference allele). Then, for each read, we look at the haplotypes that support each allele; we select the haplotype that has the highest likelihood for that read, and we write that likelihood in the new table. And that's it! For a given allele, the total likelihood will be the product of all the per-read likelihoods.

At the end of this step, sites where there is sufficient evidence for at least one of the variant alleles considered will be called variant, and a genotype will be assigned to the sample in the next (final) step.

MuTect2 strandbias + TLOD clarification

$
0
0

Hi,

I have a set of tumour samples and I would like to call variants using MuTect2 without matching normals, annotate using VEP and filter out known germline variants afterwards.

I used tumor-only mode with downsampling process turned off. There are a number of artefacts that are being called and I found at least one variant that looks real but was not called. I could think of two options to improve the calling, hence my questions :)

1- Strand bias: How can I find information about strand bias? I am looking for details like what we typically see in call.stats output of MuTect (ie lod scores of forward and reverse strands), but have not been able to modify my code to include that information. I think some artefacts may be due to strand bias.

2- TLOD : This is where I got confused. Could you explain how MuTect2 calculates TLOD in the absence of matching normal? I use the LOD scores to determine real calls. The majority of real variants would have massive TLOD compared to all calls within each sample. But in my set of samples, there was one variant that seems to be true and had small value of TLOD. I started to think that MuTect2 has to have something as normal to generate correct TLOD, but I am not sure.

This is what I ran:

Using hg38

gatk Mutect2 \
-R hg38 \
-I test.bam \
-L interval_list \
-O test.vcf \
-tumor test.bam \
--contamination-fraction-to-filter 0.0 \
--max-reads-per-alignment-start 0 \

Any comments would be highly appreciated.

Thank you

Run Mutect2 in gatk-4.0.4.0 with an error `java.lang.NumberFormatException: For input string: "."`

$
0
0

Hi there~

When I ran mutect2 in gatk-4.0.4.0,I got something wrong with it.

gatk --java-options "-Xmx7g -XX:+UseParallelGC -XX:ParallelGCThreads=8" Mutect2 \
--dont-use-soft-clipped-bases true --max-reads-per-alignment-start 3000 --min-base-quality-score 20 \
-R hg19.fasta \
--germline-resource \gnomad.exomes.r2.0.2.sites.vcf.gz \
-I test-1.final.bam  -tumor test-1 \
-O test-1.raw.vcf \
-L agilent.gatk.intervals

It goes error with :

[August 8, 2018 10:27:44 AM UTC] org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2 done. Elapsed time: 0.34 minutes.

Runtime.totalMemory()=3729260544
java.lang.NumberFormatException: For input string: "."
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
at java.lang.Double.parseDouble(Double.java:538)
at java.lang.Double.valueOf(Double.java:502)
at htsjdk.variant.variantcontext.CommonInfo.lambda$getAttributeAsDoubleList$2(CommonInfo.java:299)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.Collections$2.tryAdvance(Collections.java:4717)
at java.util.Collections$2.forEachRemaining(Collections.java:4725)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
at htsjdk.variant.variantcontext.CommonInfo.getAttributeAsList(CommonInfo.java:273)
at htsjdk.variant.variantcontext.CommonInfo.getAttributeAsDoubleList(CommonInfo.java:293)
at htsjdk.variant.variantcontext.VariantContext.getAttributeAsDoubleList(VariantContext.java:740)
at org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2Engine.isActive(Mutect2Engine.java:253)
at org.broadinstitute.hellbender.engine.AssemblyRegionIterator.loadNextAssemblyRegion(AssemblyRegionIterator.java:159)
at org.broadinstitute.hellbender.engine.AssemblyRegionIterator.next(AssemblyRegionIterator.java:135)
at org.broadinstitute.hellbender.engine.AssemblyRegionIterator.next(AssemblyRegionIterator.java:34)
at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.processReadShard(AssemblyRegionWalker.java:290)
at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.traverse(AssemblyRegionWalker.java:271)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:892)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:134)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:179)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:198)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)

How can i fix it?

Mutect2 pipeline fails for some inputs

$
0
0

I'm running a WGS analysis and parallelizing the run for each chromosome. Few chromosomes are failing the somatic variant calling process with the following error. It is really difficult to pinpoint what the problem is, because most of the chromosomes are processed correctly to the end (20 out of 24). I'm guessing there is some integer vs. floating point conversion error. For now, I would really appreciate if you could tell me how to get rid of this issue...!

I think this is really something you should fix.

I'm running Mutect2 in a docker container: GATK jar /gatk/gatk-package-4.0.11.0-local.jar

...
15:39:59.895 INFO  ProgressMeter -        chr2:89909854            218.8                770780           3523.1
15:40:09.967 INFO  ProgressMeter -        chr2:90031053            218.9                771760           3524.9
15:40:20.003 INFO  ProgressMeter -        chr2:90125567            219.1                772540           3525.7
15:40:30.146 INFO  ProgressMeter -        chr2:90285632            219.3                773610           3527.9
15:40:43.457 INFO  ProgressMeter -        chr2:90296305            219.5                773680           3524.7
15:40:53.596 INFO  ProgressMeter -        chr2:90357855            219.7                774100           3523.9
15:40:59.892 INFO  VectorLoglessPairHMM - Time spent in setup for JNI call : 11.926295505
15:40:59.892 INFO  PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 1194.2077419680002
15:40:59.892 INFO  SmithWatermanAligner - Total compute time in java Smith-Waterman : 2637.29 sec
INFO    2018-10-24 15:41:01     SortingCollection       Creating merging iterator from 8 files
15:41:13.039 INFO  Mutect2 - Shutting down engine
[October 24, 2018 3:41:13 PM UTC] org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2 done. Elapsed time: 220.03 minutes.
Runtime.totalMemory()=12962496512
java.lang.NumberFormatException: For input string: "."
        at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
        at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
        at java.lang.Double.parseDouble(Double.java:538)
        at java.lang.Double.valueOf(Double.java:502)
        at htsjdk.variant.variantcontext.CommonInfo.lambda$getAttributeAsDoubleList$2(CommonInfo.java:299)
        at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
        at java.util.Collections$2.tryAdvance(Collections.java:4717)
        at java.util.Collections$2.forEachRemaining(Collections.java:4725)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
        at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
        at htsjdk.variant.variantcontext.CommonInfo.getAttributeAsList(CommonInfo.java:273)
        at htsjdk.variant.variantcontext.CommonInfo.getAttributeAsDoubleList(CommonInfo.java:293)
        at htsjdk.variant.variantcontext.VariantContext.getAttributeAsDoubleList(VariantContext.java:740)
        at org.broadinstitute.hellbender.tools.walkers.mutect.GermlineProbabilityCalculator.getGermlineAltAlleleFrequencies(GermlineProbabilityCalculator.java:49)
        at org.broadinstitute.hellbender.tools.walkers.mutect.GermlineProbabilityCalculator.getPopulationAFAnnotation(GermlineProbabilityCalculator.java:27)
        at org.broadinstitute.hellbender.tools.walkers.mutect.SomaticGenotypingEngine.callMutations(SomaticGenotypingEngine.java:155)
        at org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2Engine.callRegion(Mutect2Engine.java:221)
        at org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2.apply(Mutect2.java:230)
        at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.processReadShard(AssemblyRegionWalker.java:291)
        at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.traverse(AssemblyRegionWalker.java:267)
        at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:966)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
        at org.broadinstitute.hellbender.Main.main(Main.java:289)
Using GATK jar /gatk/gatk-package-4.0.11.0-local.jar

Current status of GATK4 GermlineCNVCaller tools and best practices.

$
0
0

Hi,

I would like to try out GATK4 for discovering or genotyping germline CNV's in a cohort of few hundred whole genome sequenced samples. I work with non-human species data, but the genome sizes are almost the same as human or smaller.

The best practice documentation for germline CNV calling is still empty.
https://software.broadinstitute.org/gatk/best-practices/workflow?id=11148

According the gatk4-4.0.0.0-0 JAR file germline CNV calling tools are already included.
java -jar ./gatk4-4.0.0.0-0/gatk-package-4.0.0.0-local.jar
USAGE: [-h]--------------------------------------------------------------------------------------
Copy Number Variant Discovery: Tools that analyze read coverage to detect copy number variants.
AnnotateIntervals (BETA Tool) Annotates intervals with GC content
CallCopyRatioSegments (BETA Tool) Calls copy-ratio segments as amplified, deleted, or copy-number neutral
CombineSegmentBreakpoints (EXPERIMENTAL Tool) Combine the breakpoints of two segment files and annotate the resulting intervals with chosen columns from each file.
CreateReadCountPanelOfNormals (BETA Tool) Creates a panel of normals for read-count denoising
DenoiseReadCounts (BETA Tool) Denoises read counts to produce denoised copy ratios
DetermineGermlineContigPloidy (BETA Tool) Determines the baseline contig ploidy for germline samples given counts data.
GermlineCNVCaller (BETA Tool) Calls copy-number variants in germline samples given their counts and the output of DetermineGermlineContigPloidy.
ModelSegments (BETA Tool) Models segmented copy ratios from denoised read counts and segmented minor-allele fractions from allelic counts
PlotDenoisedCopyRatios (BETA Tool) Creates plots of denoised copy ratios
PlotModeledSegments (BETA Tool) Creates plots of denoised and segmented copy-ratio and minor-allele-fraction estimates

Can you give some more information about what the current status is of the GATK4 GermlineCNVCaller tools and if you have an estimation for when the best practices for these tools should be available?

It would also be nice if you can give an idea if the GATK4 GermlineCNVCallertools tools are expected to work for non-human species, e.g. other vertebrates, simple / complex plants genomes and bacteria.

Thank you.

CalculateGenotypePosteriors produces a bunch of zero coverage variants

$
0
0

Hi GATK Team,

we are running small targeted panels on GATK4. It seems, most of the Variants (~90%) are DP 0 Variants, emerging after applying CalculateGenotypePosteriors. Before this step we are running VQSR with over thirty exomes. Should we use external databases for CalculateGenotypePosteriors?

This is what we do now:

GATK version 4.0.4.0
${tool_gatk} --java-options "${javaArg_xms} ${javaArg_xmx}" CalculateGenotypePosteriors \ -R ${reference} \ -V ${outDirectory}/${variant_vcf}.vcf \ -O ${outDirectory}/${variant_vcf}.postCGP.vcf \ --supporting ${knownsite_hapmap} \ --pedigree ${pedigree}

Greetings
Martin

Invalid or corrupt jarfile

$
0
0

When I run

./gatk --help

it seems to be working fine. However, running anything else such as

./gatk --list

produces an error:

Error: Invalid or corrupt jarfile /path/to/gatk/gatk-4.0.0.0/gatk-package-4.0.0.0-local.jar

What's going on? Sorry this might be noob question.

sorting a BAM file with PICARD

$
0
0

Dear all,

would you please advise: I am using PICARD in order to sort a BAM based on read name (it is a BAM file from EGA that contains cancer sequencing data), and when I do run PICARD SortSam, I am getting the following error (below), and the file does not get sorted. Is there a way i could fix it ? Thank you very much !

**Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: ERROR: Record 957453876, Read name HWI-ST7001002R:223:C14GPACXX:3:1305:7471:56486, MAPQ should be 0 for unmapped read"
**
The command from PICARD is :

**java -jar $PICARD SortSam \
I=$FILE \
O="${FILE}.sorted.picard.queryname.bam" **

Calling variants in RNAseq

$
0
0

Overview

This document describes the details of the GATK Best Practices workflow for SNP and indel calling on RNAseq data.

Please note that any command lines are only given as example of how the tools can be run. You should always make sure you understand what is being done at each step and whether the values are appropriate for your data. To that effect, you can find more guidance here.

image

In brief, the key modifications made to the DNAseq Best Practices focus on handling splice junctions correctly, which involves specific mapping and pre-processing procedures, as well as some new functionality in the HaplotypeCaller. Here is a detailed overview:

image

Caveats

Please keep in mind that our DNA-focused Best Practices were developed over several years of thorough experimentation, and are continuously updated as new observations come to light and the analysis methods improve. We have been working with RNAseq for a somewhat shorter time, so there are many aspects that we still need to examine in more detail before we can be fully confident that we are doing the best possible thing.

We know that the current recommended pipeline is producing both false positives (wrong variant calls) and false negatives (missed variants) errors. While some of those errors are inevitable in any pipeline, others are errors that we can and will address in future versions of the pipeline. A few examples of such errors are given in this article as well as our ideas for fixing them in the future.

We will be improving these recommendations progressively as we go, and we hope that the research community will help us by providing feedback of their experiences applying our recommendations to their data.


The workflow

1. Mapping to the reference

The first major difference relative to the DNAseq Best Practices is the mapping step. For DNA-seq, we recommend BWA. For RNA-seq, we evaluated all the major software packages that are specialized in RNAseq alignment, and we found that we were able to achieve the highest sensitivity to both SNPs and, importantly, indels, using STAR aligner. Specifically, we use the STAR 2-pass method which was described in a recent publication (see page 43 of the Supplemental text of the Pär G Engström et al. paper referenced below for full protocol details -- we used the suggested protocol with the default parameters). In brief, in the STAR 2-pass approach, splice junctions detected in a first alignment run are used to guide the final alignment.

Here is a walkthrough of the STAR 2-pass alignment steps:

1) STAR uses genome index files that must be saved in unique directories. The human genome index was built from the FASTA file hg19.fa as follows:

genomeDir=/path/to/hg19
mkdir $genomeDir
STAR --runMode genomeGenerate --genomeDir $genomeDir --genomeFastaFiles hg19.fa\  --runThreadN <n>

2) Alignment jobs were executed as follows:

runDir=/path/to/1pass
mkdir $runDir
cd $runDir
STAR --genomeDir $genomeDir --readFilesIn mate1.fq mate2.fq --runThreadN <n>

3) For the 2-pass STAR, a new index is then created using splice junction information contained in the file SJ.out.tab from the first pass:

genomeDir=/path/to/hg19_2pass
mkdir $genomeDir
STAR --runMode genomeGenerate --genomeDir $genomeDir --genomeFastaFiles hg19.fa \
    --sjdbFileChrStartEnd /path/to/1pass/SJ.out.tab --sjdbOverhang 75 --runThreadN <n>

4) The resulting index is then used to produce the final alignments as follows:

runDir=/path/to/2pass
mkdir $runDir
cd $runDir
STAR --genomeDir $genomeDir --readFilesIn mate1.fq mate2.fq --runThreadN <n>

2. Add read groups, sort, mark duplicates, and create index

The above step produces a SAM file, which we then put through the usual Picard processing steps: adding read group information, sorting, marking duplicates and indexing.

java -jar picard.jar AddOrReplaceReadGroups I=star_output.sam O=rg_added_sorted.bam SO=coordinate RGID=id RGLB=library RGPL=platform RGPU=machine RGSM=sample 

java -jar picard.jar MarkDuplicates I=rg_added_sorted.bam O=dedupped.bam  CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT M=output.metrics 

3. Split'N'Trim and reassign mapping qualities

Next, we use a new GATK tool called SplitNCigarReads developed specially for RNAseq, which splits reads into exon segments (getting rid of Ns but maintaining grouping information) and hard-clip any sequences overhanging into the intronic regions.

image

In the future we plan to integrate this into the GATK engine so that it will be done automatically where appropriate, but for now it needs to be run as a separate step.

At this step we also add one important tweak: we need to reassign mapping qualities, because STAR assigns good alignments a MAPQ of 255 (which technically means “unknown” and is therefore meaningless to GATK). So we use the GATK’s ReassignOneMappingQuality read filter to reassign all good alignments to the default value of 60. This is not ideal, and we hope that in the future RNAseq mappers will emit meaningful quality scores, but in the meantime this is the best we can do. In practice we do this by adding the ReassignOneMappingQuality read filter to the splitter command.

Finally, be sure to specify that reads with N cigars should be allowed. This is currently still classified as an "unsafe" option, but this classification will change to reflect the fact that this is now a supported option for RNAseq processing.

java -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R ref.fasta -I dedupped.bam -o split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS

4. Indel Realignment (optional)

After the splitting step, we resume our regularly scheduled programming... to some extent. We have found that performing realignment around indels can help rescue a few indels that would otherwise be missed, but to be honest the effect is marginal. So while it can’t hurt to do it, we only recommend performing the realignment step if you have compute and time to spare (or if it’s important not to miss any potential indels).

5. Base Recalibration

We do recommend running base recalibration (BQSR). Even though the effect is also marginal when applied to good quality data, it can absolutely save your butt in cases where the qualities have systematic error modes.

Both steps 4 and 5 are run as described for DNAseq (with the same known sites resource files), without any special arguments. Finally, please note that you should NOT run ReduceReads on your RNAseq data. The ReduceReads tool will no longer be available in GATK 3.0.

6. Variant calling

Finally, we have arrived at the variant calling step! Here, we recommend using HaplotypeCaller because it is performing much better in our hands than UnifiedGenotyper (our tests show that UG was able to call less than 50% of the true positive indels that HC calls). We have added some functionality to the variant calling code which will intelligently take into account the information about intron-exon split regions that is embedded in the BAM file by SplitNCigarReads. In brief, the new code will perform “dangling head merging” operations and avoid using soft-clipped bases (this is a temporary solution) as necessary to minimize false positive and false negative calls. To invoke this new functionality, just add -dontUseSoftClippedBases to your regular HC command line. Note that the -recoverDanglingHeads argument which was previously required is no longer necessary as that behavior is now enabled by default in HaplotypeCaller. Also, we found that we get better results if we set the minimum phred-scaled confidence threshold for calling variants 20, but you can lower this to increase sensitivity if needed.

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ref.fasta -I input.bam -dontUseSoftClippedBases -stand_call_conf 20.0 -o output.vcf

7. Variant filtering

To filter the resulting callset, you will need to apply hard filters, as we do not yet have the RNAseq training/truth resources that would be needed to run variant recalibration (VQSR).

We recommend that you filter clusters of at least 3 SNPs that are within a window of 35 bases between them by adding -window 35 -cluster 3 to your command. This filter recommendation is specific for RNA-seq data.

As in DNA-seq, we recommend filtering based on Fisher Strand values (FS > 30.0) and Qual By Depth values (QD < 2.0).

java -jar GenomeAnalysisTK.jar -T VariantFiltration -R hg_19.fasta -V input.vcf -window 35 -cluster 3 -filterName FS -filter "FS > 30.0" -filterName QD -filter "QD < 2.0" -o output.vcf 

Please note that we selected these hard filtering values in attempting to optimize both high sensitivity and specificity together. By applying the hard filters, some real sites will get filtered. This is a tradeoff that each analyst should consider based on his/her own project. If you care more about sensitivity and are willing to tolerate more false positives calls, you can choose not to filter at all (or to use less restrictive thresholds).

An example of filtered (SNPs cluster filter) and unfiltered false variant calls:

image

An example of true variants that were filtered (false negatives). As explained in text, there is a tradeoff that comes with applying filters:

image


Known issues

There are a few known issues; one is that the allelic ratio is problematic. In many heterozygous sites, even if we can see in the RNAseq data both alleles that are present in the DNA, the ratio between the number of reads with the different alleles is far from 0.5, and thus the HaplotypeCaller (or any caller that expects a diploid genome) will miss that call. A DNA-aware mode of the caller might be able to fix such cases (which may be candidates also for downstream analysis of allele specific expression).

Although our new tool (splitNCigarReads) cleans many false positive calls that are caused by splicing inaccuracies by the aligners, we still call some false variants for that same reason, as can be seen in the example below. Some of those errors might be fixed in future versions of the pipeline with more sophisticated filters, with another realignment step in those regions, or by making the caller aware of splice positions.

image

image

As stated previously, we will continue to improve the tools and process over time. We have plans to improve the splitting/clipping functionalities, improve true positive and minimize false positive rates, as well as developing statistical filtering (i.e. variant recalibration) recommendations.

We also plan to add functionality to process DNAseq and RNAseq data from the same samples simultaneously, in order to facilitate analyses of post-transcriptional processes. Future extensions to the HaplotypeCaller will provide this functionality, which will require both DNAseq and RNAseq in order to produce the best results. Finally, we are also looking at solutions for measuring differential expression of alleles.


[1] Pär G Engström et al. “Systematic evaluation of spliced alignment programs for RNA-seq data”. Nature Methods, 2013


NOTE: Questions about this document that were posted before June 2014 have been moved to this archival thread: http://gatkforums.broadinstitute.org/discussion/4709/questions-about-the-rnaseq-variant-discovery-workflow

Viewing all 12345 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>