What types of variants can GATK tools detect / handle?

January 17, 2014, 7:36 am

≫ Next: Regarding ploidy in Haplotyple Caller for multiple replicates of pooled RNAseq

≪ Previous: Germline mutation with t_alt_count 0 identified with GATK

The answer depends on what tool we're talking about, and whether we're considering variant discovery or variant manipulation.

Variant manipulation

GATK variant manipulation tools are able to recognize the following types of alleles:

SNP (single nucleotide polymorphism)
INDEL (insertion/deletion)
MIXED (combination of SNPs and indels at a single position)
MNP (multi-nucleotide polymorphism, e.g. a dinucleotide substitution)
SYMBOLIC (such as the <NON-REF> allele used in GVCFs produced by HaplotypeCaller, the * allele used to signify the presence of a spanning deletion, or undefined events like a very large allele or one that's fuzzy and not fully modeled; i.e. there's some event going on here but we don't know what exactly)

Note that SelectVariants, the GATK tool most used for VCF subsetting operations, discriminates strictly between these categories. This means that if you use for example -selectType INDEL to pull out indels, it will only select pure INDEL records, excluding any MIXED records that might include a SNP allele in addition to the insertion or deletion alleles of interest. To include those you would have to also specify selectType MIXED in the same command.

Variant discovery

The HaplotypeCaller is a sophisticated variant caller that can call different types of variants at the same time. So in addition to SNPs and indels, it is capable of emitting mixed records by default, as well as symbolic representations for e.g. spanning deletions. It does emit physical phasing information, but in its current version, HC is not able to emit MNPs. If you would like to combine contiguous SNPs into MNPs, you will need to use the ReadBackedPhasing tool with the MNP merging function activated. See the tool documentation for details. Our older (and now deprecated) variant caller, UnifiedGenotyper, was even more limited. It only called SNPs and indels, and did so separately (even if you ran in calling mode BOTH, the program performed separate calling operations internally) so it was not able to recognize that SNPs and Indels should be emitted together as a joint record when they occur at the same site.

The general release version of GATK is currently not able to detect SVs (structural variations) or CNVs (copy number variations). However, the alpha version of GATK 4 (the next generation of GATK tools) includes tools for performing CNV (copy number variation) analysis in exome data. Let us know if you're interested in trying them out by commenting on this article in the forum.

There is also a third-party software package called GenomeSTRiP built on top of GATK that provides SV (structural variation) analysis capabilities.

↧

Regarding ploidy in Haplotyple Caller for multiple replicates of pooled RNAseq

June 3, 2017, 8:33 pm

≫ Next: VariantFiltration is not filtering?

≪ Previous: What types of variants can GATK tools detect / handle?

Hi,
I am a little confused about the best practices for running Haplotyple Caller to call variants given the pooled nature of my study, any feedback is super appreciated!

I have 10 replicates of pooled, RNAseq data each for two samples (10 replicates for Sample A, 10 replicates for Sample B ). By pooled I mean each replicate has mRNA from 20 individuals all mixed together with no barcoding (population genetics study).

I had planned to just merge the bam files of these replicates, who have RGSMs of SampleA and SampleB, and simply run Haplotype Caller for Sample A and Sample B. However, that would mean I would set ploidy = 2 x 200. This seems very high!
Would it be better to run Haplotype Caller for each replicate separately, without merging the bam files and setting ploidy = 2 x 20, And then use some kind of tool such as CombineVariants to stack my vcf files into two samples for downstream comparisons?
Any advice?
Regards!
Chris

↧

VariantFiltration is not filtering?

June 4, 2017, 8:51 am

≫ Next: I have been trying to run the following command for Queue.jar. But its not compiling properly.

≪ Previous: Regarding ploidy in Haplotyple Caller for multiple replicates of pooled RNAseq

Hi,

I'm trying to filter out variants in a .vcf file that I generated from DNAseq data (BWA->picard->GATK).

I'm following the most recent documentation I can find on VariantFiltration and the most basic command:

java -jar GenomeAnalysisTK.jar -T VariantFiltration -R genome.fa -V raw_snps.vcf -o filtered_snp.vcf --filterExpression "QD < 10.0" --filterName "my_snp_filter"

This throws no errors but filters out nothing. I've been fighting with this for several days now (after I realized that this step in my pipeline was failing).

Apparently my university's cluster only has GATK 3.6 installed--could that be the problem?

I'll paste a couple of lines form my input .vcf to see if that file looks correctly formatted (was generated via HaplotypeCaller):**_``_**.

X   23538706    .   A   T   160.77  .   AC=1;AF=0.500;AN=2;BaseQRankSum=-1.068e+00;ClippingRankSum=0.00;DP=15;ExcessHet=3.0103;FS=9.859;MLEAC=1;MLEAF=0.500;MQ=48.55;MQRankSum=-3.239e+00;QD=10.72;ReadPosRankSum=-1.964e+00;SOR=3.556  GT:AD:DP:GQ:PGT:PID:PL  0/1:9,6:15:99:0|1:23538706_A_T:189,0,353
X   23538720    .   T   A   67.77   .   AC=1;AF=0.500;AN=2;BaseQRankSum=-2.147e+00;ClippingRankSum=0.00;DP=12;ExcessHet=3.0103;FS=2.932;MLEAC=1;MLEAF=0.500;MQ=51.53;MQRankSum=-2.467e+00;QD=5.65;ReadPosRankSum=-2.362e+00;SOR=1.828   GT:AD:DP:GQ:PGT:PID:PL  0/1:9,3:12:96:0|1:23538706_A_T:96,0,408
X   23538721    .   G   T   67.77   .   AC=1;AF=0.500;AN=2;BaseQRankSum=-2.000e+00;ClippingRankSum=0.00;DP=12;ExcessHet=3.0103;FS=2.932;MLEAC=1;MLEAF=0.500;MQ=51.53;MQRankSum=-2.467e+00;QD=5.65;ReadPosRankSum=-2.838e+00;SOR=1.828   GT:AD:DP:GQ:PGT:PID:PL  0/1:9,3:12:96:0|1:23538706_A_T:96,0,408

Thank you for any help!

↧

I have been trying to run the following command for Queue.jar. But its not compiling properly.

June 4, 2017, 8:37 pm

≫ Next: Allele Depth (AD) is lower than expected

≪ Previous: VariantFiltration is not filtering?

Command:

java -jar Queue-3.4-46/Queue.jar -S call_hs_snps.scala -R human_g1k_v37_decoy.fasta -D dbsnp_138.b37.vcf -H hapmap_3.3.b37.vcf -O 1000G_omni2.5.b37.vcf -T 1000G_phase1.snps.high_confidence.b37.vcf -M Mills_and_1000G_gold_standard.indels.b37.vcf -o output_gatk/ -P pedigree.txt -E 1 -F 1 -C 1 -jobRunner ParallelShell -jobReport call_hs_snps.jobreport.txt -resMemLimit 3 -memLimit 2.00000000000000000000 --maximumNumberOfJobsToRunConcurrently 1 -I ERR1025650/alignments/splitters.bam

Error:
INFO 11:35:23,449 QScriptManager - Compiling 1 QScript
ERROR 11:35:27,463 QScriptManager - call_hs_snps.scala:147: value dbsnp is not a member of org.broadinstitute.gatk.queue.extensions.gatk.CombineGVCFs with pipeline.this.CommonArguments
ERROR 11:35:27,481 QScriptManager - combine.dbsnp = dbSnp
ERROR 11:35:27,483 QScriptManager - ^
ERROR 11:35:28,240 QScriptManager - two errors found

Could you please help in this issue?

↧

Allele Depth (AD) is lower than expected

August 17, 2015, 2:27 pm

≫ Next: -maxAltAlleles argument ignored in HC and genotypeGVCFs in GATK-3.7?

≪ Previous: I have been trying to run the following command for Queue.jar. But its not compiling properly.

The problem:

You're trying to evaluate the support for a particular call, but the numbers in the DP (total depth) and AD (allele depth) fields aren't making any sense. For example, the sum of all the ADs doesn't match up to the DP, or even more baffling, the AD for an allele that was called is zero!

Many users have reported being confused by variant calls where there is apparently no evidence for the called allele. For example, sometimes a VCF may contain a variant call that looks like this:

2 151214 . G A 673.77 . AN=2;DP=10;FS=0.000;MLEAF=0.500;MQ=56.57;MQ0=0;NCC=0;SOR=0.693 GT:AD:DP:GQ:PL 0/1:0,0:10:38:702,0,38

You can see in the Format field the AD values are 0 for both of the alleles. However, in the Info and FORMAT fields, the DP is 10. Because the DP in the INFO field is unfiltered and the DP in the FORMAT field is filtered, you know none of the reads were filtered out by the engine's built-in read filters. And if you look at the "bamout", you see 10 reads covering the position! So why is the VCF reporting an AD value of 0?

The explanation: uninformative reads

This is not actually a bug -- the program is doing what we expect; this is an interpretation problem. The answer lies in uninformative reads.

We call a read “uninformative” when it passes the quality filters, but the likelihood of the most likely allele given the read is not significantly larger than the likelihood of the second most likely allele given the read. Specifically, the difference between the Phred scaled likelihoods must be greater than 0.2 to be considered significant. In other words, that means the most likely allele must be 60% more likely than the second most likely allele.

Let’s walk through an example to make this clearer. Let’s say we have 2 reads and 2 possible alleles at a site. All of the reads have passed HaplotypeCaller’s quality filters, and the likelihoods of the alleles given the reads are in the table below.

Reads	Likelihood of A	Likelihood of T
1	3.8708e-7	3.6711e-7
2	4.9992e-7	2.8425e-7

Note: Keep in mind that HaplotypeCaller marginalizes the likelihoods of the haplotypes given the reads to get the likelihoods of the alleles given the reads. The table above shows the likelihoods of the alleles given the reads. For additional details, please see the HaplotypeCaller method documentation.

Now, let’s convert the likelihoods into Phred-scaled likelihoods. To do this, we simply take the log of the likelihoods.

Reads	Phred-scaled likelihood of A	Phred-scaled likelihood of T
1	-6.4122	-6.4352
2	-6.3011	-6.5463

Now, we want to determine if read 1 is informative. To do this, we simply look at the Phred scaled likelihoods of the most likely allele and the second most likely allele. The Phred scaled likelihood of the most likely allele (A) is -6.4122.The Phred-scaled likelihood of the second most likely allele (T) is -6.4352. Taking the difference between the two likelihoods gives us 0.023. Because 0.023 is Less than 0.2, read 1 is considered uninformative.

To determine if read 2 is informative, we take -6.3011-(-6.5463). This gives us 0.2452, which is greater than 0.2. Read 2 is considered informative.

How does a difference of 0.2 mean the most likely allele is ~60% more likely than the second most likely allele? Well, because the likelihoods are Phred-scaled, 0.2 = 10^0.2 = 1.585 which is approximately 60% greater.

Conclusion

So, now that we know the math behind determining which reads are informative, let’s look at how this affects the record output to the VCF. If a read is considered informative, it gets counted toward the AD and DP of the variant allele in the output record. If a read is considered uninformative, it is counted towards the DP, but not the AD. That way, the AD value reflects how many reads actually contributed support for a given allele at the site. We would not want to include uninformative reads in the AD value because we don’t have confidence in them.

Please note, however, that although an uninformative read is not reported in the AD, it is still used in calculations for genotyping. In future we may add an annotation to indicate counts of reads that were considered informative vs. uninformative. Let us know in the comments if you think that would be helpful.

In most cases, you will have enough coverage at a site to disregard small numbers of uninformative reads. Unfortunately, sometimes uninformative reads are the only reads you have at a site. In this case, we report the potential variant allele, but keep the AD values 0. The uncertainty at the site will be reflected in the QG and PL values.

↧

-maxAltAlleles argument ignored in HC and genotypeGVCFs in GATK-3.7?

June 5, 2017, 1:27 am

≫ Next: Spanning or overlapping deletions (* allele)

≪ Previous: Allele Depth (AD) is lower than expected

Hi there
I have a similiar issue as described here:
http://gatkforums.broadinstitute.org/gatk/discussion/5111/haplotypecaller-pooled-sequence-problem

I am using GATK version 3.7-0-gcfedb67 on poolseq data. I have 60 chromosomes per pool and calculated that the maximum genotype count should be 31470 when considering 3 alternate alleles.
The combination of -maxGT 31470 -maxNumPLValues 31470 -maxAltAlleles 3 works fine in HaplotypeCaller. However, running those output gvcfs in genotypeGVCFs with the same parameters I receive this error:

ERROR MESSAGE: the number of genotypes is too large for ploidy 60 and allele 9: approx. 7392009768

The programme considers 9 alleles although I have set -maxAltAlleles 3 explicitly, which should be 5 alleles including the reference and symbolic non-ref allele.

Any ideas why this is happening or tips how I can fix this?

thanks
christian

↧

Spanning or overlapping deletions (* allele)

February 3, 2016, 9:28 am

≫ Next: Can the results of UnifiedGenotyper and HaplotypeCaller be significantly different?

≪ Previous: -maxAltAlleles argument ignored in HC and genotypeGVCFs in GATK-3.7?

We use the term spanning deletion or overlapping deletion to refer to a deletion that spans a position of interest.

The presence of a spanning deletion affects how we can represent genotypes at any site(s) that it spans for those samples that carry the deletion, whether in heterozygous or homozygous variant form. Page 8, item 5 of the VCF v4.3 specification reserves the * allele to reference overlapping deletions. This is not to be confused with the bracketed asterisk <*> used to denote symbolic alternate alleles.

Here we illustrate with four human samples. Bob and Lian each have a heterozygous A to T single polymorphism at position 20, our position of interest. Kyra has a 9 bp deletion from position 15 to 23 on both homologous chromosomes that extends across position 20. Lian and Omar each are heterozygous for the same 9 bp deletion. Omar and Bob's other allele is the reference A.

What are the genotypes for each individual at position 20? For Bob, the reference A and variant T alleles are clearly present for a genotype of A/T.

What about Lian? Lian has a variant T allele plus a 9 bp deletion overlapping position 20. To notate the deletion as we do single nucleotide deletions is technically inaccurate. We need a placeholder notation to signify absent sequence that extends beyond the position of interest and that is listed for an earlier position, in our case position 14. The solution is to use a star or asterisk * at position 20 to refer to the spanning deletion. Using this convention, Lian's genotype is T/*.

At the sample-level, Kyra and Omar would not have records for position 20. However, we are comparing multiple samples and so we indicate the spanning deletion at position 20 with *. Omar's genotype is A/* and Kyra's is */*.

In the VCF, depending on the format used by tools, positions equivalent to our example position 20 may or may not be listed. If listed, such as in the first example VCF shown, the spanning deletion is noted with the asterisk * under the ALT column. The spanning deletion is then referred to in the genotype GT for Kyra, Lian and Omar. Alternatively, a VCF may altogether avoid referencing the spanning deletion by listing the variant with the spanning deletion together with the deletion. This is shown in the second example VCF at position 14.

↧

Can the results of UnifiedGenotyper and HaplotypeCaller be significantly different?

June 5, 2017, 8:10 am

≫ Next: ERROR MESSAGE: java.lang.reflect.InvocationTargetException in the SelectVariant step

≪ Previous: Spanning or overlapping deletions (* allele)

Hi!

I'm trying to replicate the variant calling procedure done by a previous graduate student in my lab. To be precise, I have access to the original fastq files and his final vcf file; unfortunately, his pipeline seems to have been lost and I had to replicate the procedure from what is described in his thesis. The only difference is that he used UnifiedGenotype; I'm using HaplotypeCaller (and a more modern version of GATK - his thesis was done almost three years ago). After running my scripts, my vcf is significantly smaller than his. He obtained ~ 200 million sites. I got ~70 million. I have done a thorough check of my scripts and the pipeline description he did in his thesis and everything seems OK. This may mean two different things: 1. That I'm unable to replicate his pipeline because some crucial steps or parameters may were left out the description; 2. That HaplotypeCaller and UnifiedGenotyper can produce different results under certain circumstance.

Could the second option explain the differences, specially considering that the sequencing data weren't of very good quality? Maybe HaplotypeCaller and more modern versions of GATK are more strict when calling variants...

Thanks!

↧

ERROR MESSAGE: java.lang.reflect.InvocationTargetException in the SelectVariant step

March 17, 2015, 11:42 am

≫ Next: Spanning deletion in pooled sequencing studies

≪ Previous: Can the results of UnifiedGenotyper and HaplotypeCaller be significantly different?

Hi guys
I'm having this error (attached) when running SelectVariant on my vcf.
I tried deleting the .idx of my vcf (solution recommended) and rerun it but it didn't work.
I appreciate any help you can give me

↧

Spanning deletion in pooled sequencing studies

June 5, 2017, 10:40 am

≫ Next: The contig order in knownSites and reference is not the same

≪ Previous: ERROR MESSAGE: java.lang.reflect.InvocationTargetException in the SelectVariant step

Hey GATK team,

I have a question regarding spanning deletions in one of my runs. Briefly, the study is set up so that we sequenced 21 pools of 3-5 animals each. Using GATK 3.5, I followed the Best Practices workflow as best as I could, but had to use hard filtering as this is a non-reference species. In my final variant call set, there are quite a few spanning deletions that don't seem to have supporting evidence for a "position of interest" within that deletion.

In other words, if I had a 5 bp spanning deletion at positions 1000 - 1005, I would expect there to be something else interesting (a SNP, whatever) to be somewhere between 1000 - 1005. Hopefully that is the correct understanding?

Here is an actual example:

chr9    62268141    .   CACAGTGA    C   46727.97    PASS    AC=63;AF=0.354;AN=178;ANN=C|intron_variant|MODIFIER|COLEC10|ENSECAG00000018266|transcript|ENSECAT00000019417.1|protein_coding|1/5|c.148+4396_148+4402delACAGTGA||||||;BaseQRankSum=1.02;ClippingRankSum=0.00;DP=3820;FS=0.000;MLEAC=64;MLEAF=0.360;MQ=59.98;MQRankSum=0.350;QD=13.07;ReadPosRankSum=-4.150e-01;SOR=0.693;set=variant    GT:AD:DP:GQ:PL  0/0/0/0/0/0/0/1/1/1:166,61:227:11:2069,110,11,0,35,107,219,385,642,1110,9818    0/0/1/1/1/1/1/1/1/1:26,136:162:8:5756,1060,664,439,286,175,92,33,0,8,955    0/0/0/0/0/0/0/1:156,25:181:29:809,0,29,108,228,398,653,1105,8990    0/0/0/0/1/1/1/1:109,107:216:28:4085,380,131,28,0,32,139,395,5412    0/0/0/0/0/0/1/1/1/1:110,81:191:9:1550,287,104,27,0,9,52,135,282,572,5275    0/0/0/0/1/1/1/1/1/1:65,93:158:11:3620,495,248,122,49,11,0,19,79,227,3984    0/0/0/0/1/1/1/1:69,63:132:12:2358,212,69,12,0,25,96,261,3633    0/0/0/0/0/0/0/1:195,13:208:91:347,0,91,222,395,626,959,1537,11748   0/0/0/0/0/1/1/1:132,76:208:29:2746,170,29,0,33,124,296,642,6407 0/0/0/0/0/1:205,23:228:99:654,0,129,345,677,1271,11880  0/0/0/0/0/0/0/0:135,0:135:23:0,23,50,82,120,170,241,361,1800    0/0/0/0/0/0/0/0/1/1:146,28:174:10:898,10,0,35,98,186,305,468,709,1134,8574  0/0/0/0/0/0/1/1:101,45:146:1:1608,68,0,1,42,125,267,540,4419    0/0/0/0/0/0/0/0/1/1:109,33:142:5:1129,44,0,5,36,91,170,284,456,767,6255 0/0/0/0/0/1/1/1:98,75:173:1:2856,215,55,0,1,51,164,408,4095 0/0/0/0/0/0/1/1:130,51:181:13:1826,67,0,13,74,187,375,732,8147  0/0/0/0/1/1/1/1:99,95:194:23:3644,332,112,23,0,31,130,364,5154  0/0/0/0/0/1/1/1:104,56:160:17:2058,116,17,0,31,106,244,520,6247 0/0/0/0/0/0/0/1:128,11:139:52:294,0,52,134,244,394,610,988,6434 0/0/1/1/1/1/1/1:61,176:237:32:7280,1038,549,288,127,32,0,66,2454    0/0/0/0/0/1/1/1:81,37:118:1:1322,58,1,0,32,97,211,429,3404
chr9    62268143    .   C   *   44681.43    PASS    AC=60;AF=0.337;AN=178;DP=3815;FS=0.000;MLEAC=63;MLEAF=0.354;QD=13.08;SOR=0.698;set=variant2 GT:AD:DP:GQ:PL  0/0/0/0/0/0/0/1/1/1:166,61:227:11:2069,110,11,0,35,107,219,385,642,1110,9818    0/0/1/1/1/1/1/1/1/1:26,136:162:8:5756,1060,664,439,286,175,92,33,0,8,955    0/0/0/0/0/0/0/1:156,25:181:29:809,0,29,108,228,398,653,1105,8990    0/0/0/0/1/1/1/1:109,107:216:28:4085,380,131,28,0,32,139,395,5412    0/0/0/0/0/0/1/1/1/1:110,81:191:9:1550,287,104,27,0,9,52,135,282,572,5275    0/0/0/0/1/1/1/1/1/1:65,93:158:11:3620,495,248,122,49,11,0,19,79,227,3984    0/0/0/0/1/1/1/1:69,63:132:12:2358,212,69,12,0,25,96,261,3633    0/0/0/0/0/0/0/1:195,13:208:91:347,0,91,222,395,626,959,1537,11748   0/0/0/0/0/1/1/1:132,76:208:29:2746,170,29,0,33,124,296,642,6407 0/0/0/0/0/1:205,23:228:99:654,0,129,345,677,1271,11880  0/0/0/0/0/0/0/0:135,0:135:23:0,23,50,82,120,170,241,361,1800    0/0/0/0/0/0/0/0/1/1:146,28:174:10:898,10,0,35,98,186,305,468,709,1134,8574  0/0/0/0/0/0/1/1:101,45:146:1:1608,68,0,1,42,125,267,540,4419    0/0/0/0/0/0/0/0/1/1:109,33:142:5:1129,44,0,5,36,91,170,284,456,767,6255 0/0/0/0/0/1/1/1:98,75:173:1:2856,215,55,0,1,51,164,408,4095 0/0/0/0/0/0/1/1:130,51:181:13:1826,67,0,13,74,187,375,732,8147  0/0/0/0/1/1/1/1:99,95:194:23:3644,332,112,23,0,31,130,364,5154  0/0/0/0/0/0/0/0:158,0:158:0:0,0,0,0,0,0,0,0,582 0/0/0/0/0/0/0/1:128,11:139:52:294,0,52,134,244,394,610,988,6434 0/0/1/1/1/1/1/1:61,176:237:32:7280,1038,549,288,127,32,0,66,2454    0/0/0/0/0/1/1/1:81,37:118:1:1322,58,1,0,32,97,211,429,3404
chr9    62268144    .   A   *   46727.97    PASS    AC=63;AF=0.354;AN=178;DP=3820;FS=0.000;MLEAC=64;MLEAF=0.360;QD=13.07;SOR=0.693;set=variant2 GT:AD:DP:GQ:PL  0/0/0/0/0/0/0/1/1/1:166,61:227:11:2069,110,11,0,35,107,219,385,642,1110,9818    0/0/1/1/1/1/1/1/1/1:26,136:162:8:5756,1060,664,439,286,175,92,33,0,8,955    0/0/0/0/0/0/0/1:156,25:181:29:809,0,29,108,228,398,653,1105,8990    0/0/0/0/1/1/1/1:109,107:216:28:4085,380,131,28,0,32,139,395,5412    0/0/0/0/0/0/1/1/1/1:110,81:191:9:1550,287,104,27,0,9,52,135,282,572,5275    0/0/0/0/1/1/1/1/1/1:65,93:158:11:3620,495,248,122,49,11,0,19,79,227,3984    0/0/0/0/1/1/1/1:69,63:132:12:2358,212,69,12,0,25,96,261,3633    0/0/0/0/0/0/0/1:195,13:208:91:347,0,91,222,395,626,959,1537,11748   0/0/0/0/0/1/1/1:132,76:208:29:2746,170,29,0,33,124,296,642,6407 0/0/0/0/0/1:205,23:228:99:654,0,129,345,677,1271,11880  0/0/0/0/0/0/0/0:135,0:135:23:0,23,50,82,120,170,241,361,1800    0/0/0/0/0/0/0/0/1/1:146,28:174:10:898,10,0,35,98,186,305,468,709,1134,8574  0/0/0/0/0/0/1/1:101,45:146:1:1608,68,0,1,42,125,267,540,4419    0/0/0/0/0/0/0/0/1/1:109,33:142:5:1129,44,0,5,36,91,170,284,456,767,6255 0/0/0/0/0/1/1/1:98,75:173:1:2856,215,55,0,1,51,164,408,4095 0/0/0/0/0/0/1/1:130,51:181:13:1826,67,0,13,74,187,375,732,8147  0/0/0/0/1/1/1/1:99,95:194:23:3644,332,112,23,0,31,130,364,5154  0/0/0/0/0/1/1/1:104,56:160:17:2058,116,17,0,31,106,244,520,6247 0/0/0/0/0/0/0/1:128,11:139:52:294,0,52,134,244,394,610,988,6434 0/0/1/1/1/1/1/1:61,176:237:32:7280,1038,549,288,127,32,0,66,2454    0/0/0/0/0/1/1/1:81,37:118:1:1322,58,1,0,32,97,211,429,3404
chr9    62268146    .   T   *   46727.97    PASS    AC=63;AF=0.354;AN=178;DP=3820;FS=0.000;MLEAC=64;MLEAF=0.360;QD=13.07;SOR=0.693;set=variant2 GT:AD:DP:GQ:PL  0/0/0/0/0/0/0/1/1/1:166,61:227:11:2069,110,11,0,35,107,219,385,642,1110,9818    0/0/1/1/1/1/1/1/1/1:26,136:162:8:5756,1060,664,439,286,175,92,33,0,8,955    0/0/0/0/0/0/0/1:156,25:181:29:809,0,29,108,228,398,653,1105,8990    0/0/0/0/1/1/1/1:109,107:216:28:4085,380,131,28,0,32,139,395,5412    0/0/0/0/0/0/1/1/1/1:110,81:191:9:1550,287,104,27,0,9,52,135,282,572,5275    0/0/0/0/1/1/1/1/1/1:65,93:158:11:3620,495,248,122,49,11,0,19,79,227,3984    0/0/0/0/1/1/1/1:69,63:132:12:2358,212,69,12,0,25,96,261,3633    0/0/0/0/0/0/0/1:195,13:208:91:347,0,91,222,395,626,959,1537,11748   0/0/0/0/0/1/1/1:132,76:208:29:2746,170,29,0,33,124,296,642,6407 0/0/0/0/0/1:205,23:228:99:654,0,129,345,677,1271,11880  0/0/0/0/0/0/0/0:135,0:135:23:0,23,50,82,120,170,241,361,1800    0/0/0/0/0/0/0/0/1/1:146,28:174:10:898,10,0,35,98,186,305,468,709,1134,8574  0/0/0/0/0/0/1/1:101,45:146:1:1608,68,0,1,42,125,267,540,4419    0/0/0/0/0/0/0/0/1/1:109,33:142:5:1129,44,0,5,36,91,170,284,456,767,6255 0/0/0/0/0/1/1/1:98,75:173:1:2856,215,55,0,1,51,164,408,4095 0/0/0/0/0/0/1/1:130,51:181:13:1826,67,0,13,74,187,375,732,8147  0/0/0/0/1/1/1/1:99,95:194:23:3644,332,112,23,0,31,130,364,5154  0/0/0/0/0/1/1/1:104,56:160:17:2058,116,17,0,31,106,244,520,6247 0/0/0/0/0/0/0/1:128,11:139:52:294,0,52,134,244,394,610,988,6434 0/0/1/1/1/1/1/1:61,176:237:32:7280,1038,549,288,127,32,0,66,2454    0/0/0/0/0/1/1/1:81,37:118:1:1322,58,1,0,32,97,211,429,3404
chr9    62268147    .   G   *   46727.97    PASS    AC=63;AF=0.354;AN=178;DP=3820;FS=0.000;MLEAC=64;MLEAF=0.360;QD=13.07;SOR=0.693;set=variant2 GT:AD:DP:GQ:PL  0/0/0/0/0/0/0/1/1/1:166,61:227:11:2069,110,11,0,35,107,219,385,642,1110,9818    0/0/1/1/1/1/1/1/1/1:26,136:162:8:5756,1060,664,439,286,175,92,33,0,8,955    0/0/0/0/0/0/0/1:156,25:181:29:809,0,29,108,228,398,653,1105,8990    0/0/0/0/1/1/1/1:109,107:216:28:4085,380,131,28,0,32,139,395,5412    0/0/0/0/0/0/1/1/1/1:110,81:191:9:1550,287,104,27,0,9,52,135,282,572,5275    0/0/0/0/1/1/1/1/1/1:65,93:158:11:3620,495,248,122,49,11,0,19,79,227,3984    0/0/0/0/1/1/1/1:69,63:132:12:2358,212,69,12,0,25,96,261,3633    0/0/0/0/0/0/0/1:195,13:208:91:347,0,91,222,395,626,959,1537,11748   0/0/0/0/0/1/1/1:132,76:208:29:2746,170,29,0,33,124,296,642,6407 0/0/0/0/0/1:205,23:228:99:654,0,129,345,677,1271,11880  0/0/0/0/0/0/0/0:135,0:135:23:0,23,50,82,120,170,241,361,1800    0/0/0/0/0/0/0/0/1/1:146,28:174:10:898,10,0,35,98,186,305,468,709,1134,8574  0/0/0/0/0/0/1/1:101,45:146:1:1608,68,0,1,42,125,267,540,4419    0/0/0/0/0/0/0/0/1/1:109,33:142:5:1129,44,0,5,36,91,170,284,456,767,6255 0/0/0/0/0/1/1/1:98,75:173:1:2856,215,55,0,1,51,164,408,4095 0/0/0/0/0/0/1/1:130,51:181:13:1826,67,0,13,74,187,375,732,8147  0/0/0/0/1/1/1/1:99,95:194:23:3644,332,112,23,0,31,130,364,5154  0/0/0/0/0/1/1/1:104,56:160:17:2058,116,17,0,31,106,244,520,6247 0/0/0/0/0/0/0/1:128,11:139:52:294,0,52,134,244,394,610,988,6434 0/0/1/1/1/1/1/1:61,176:237:32:7280,1038,549,288,127,32,0,66,2454    0/0/0/0/0/1/1/1:81,37:118:1:1322,58,1,0,32,97,211,429,3404
chr9    62268148    .   A   *   46727.97    PASS    AC=63;AF=0.354;AN=178;DP=3820;FS=0.000;MLEAC=64;MLEAF=0.360;QD=13.07;SOR=0.693;set=variant2 GT:AD:DP:GQ:PL  0/0/0/0/0/0/0/1/1/1:166,61:227:11:2069,110,11,0,35,107,219,385,642,1110,9818    0/0/1/1/1/1/1/1/1/1:26,136:162:8:5756,1060,664,439,286,175,92,33,0,8,955    0/0/0/0/0/0/0/1:156,25:181:29:809,0,29,108,228,398,653,1105,8990    0/0/0/0/1/1/1/1:109,107:216:28:4085,380,131,28,0,32,139,395,5412    0/0/0/0/0/0/1/1/1/1:110,81:191:9:1550,287,104,27,0,9,52,135,282,572,5275    0/0/0/0/1/1/1/1/1/1:65,93:158:11:3620,495,248,122,49,11,0,19,79,227,3984    0/0/0/0/1/1/1/1:69,63:132:12:2358,212,69,12,0,25,96,261,3633    0/0/0/0/0/0/0/1:195,13:208:91:347,0,91,222,395,626,959,1537,11748   0/0/0/0/0/1/1/1:132,76:208:29:2746,170,29,0,33,124,296,642,6407 0/0/0/0/0/1:205,23:228:99:654,0,129,345,677,1271,11880  0/0/0/0/0/0/0/0:135,0:135:23:0,23,50,82,120,170,241,361,1800    0/0/0/0/0/0/0/0/1/1:146,28:174:10:898,10,0,35,98,186,305,468,709,1134,8574  0/0/0/0/0/0/1/1:101,45:146:1:1608,68,0,1,42,125,267,540,4419    0/0/0/0/0/0/0/0/1/1:109,33:142:5:1129,44,0,5,36,91,170,284,456,767,6255 0/0/0/0/0/1/1/1:98,75:173:1:2856,215,55,0,1,51,164,408,4095 0/0/0/0/0/0/1/1:130,51:181:13:1826,67,0,13,74,187,375,732,8147  0/0/0/0/1/1/1/1:99,95:194:23:3644,332,112,23,0,31,130,364,5154  0/0/0/0/0/1/1/1:104,56:160:17:2058,116,17,0,31,106,244,520,6247 0/0/0/0/0/0/0/1:128,11:139:52:294,0,52,134,244,394,610,988,6434 0/0/1/1/1/1/1/1:61,176:237:32:7280,1038,549,288,127,32,0,66,2454    0/0/0/0/0/1/1/1:81,37:118:1:1322,58,1,0,32,97,211,429,3404

I cannot find any evidence - even in non-filtered variants - of another "position of interest" at this locus.

The one thing I have noticed is that it seems that the exact genotype isn't consistent for each sample through the region. For example, animal-1 may be 0/0/0/0/1/1/1/1 at pos'n 1001, 0/0/0/0/1/1/1/1 at pos'n 1002, 0/0/0/0/0/1/1/1 at pos'n 1003, 0/0/0/0/1/1/1/1 at pos'n 4, etc. I don't believe that the reported differences in genotype are accurate reflections of true differences in genotypes, but rather a reflection of the slight variability in number of supporting reads at the different positions, subsequently affecting the PL and thus the final genotype. Would something like that be enough to bring these up as spanning deletions?

Thanks for your time and continued help, and sorry for the wall of text.

Cheers,
Russ

↧

The contig order in knownSites and reference is not the same

June 5, 2017, 1:05 pm

≫ Next: Oncotator Error -- Please help me

≪ Previous: Spanning deletion in pooled sequencing studies

Hi,

I tried to use baserecalibrator. Here are my inputs:

Human reference: HG38 downloaded from GATK bundle

dbsnp_vcf: hg38.dbsnp.vcf (fileformat=VCFv4.2) downloaded from GATK bundle, sorted by Picard sortVcf
java -jar ../exe_program/picard.jar SortVcf I=hg38.dbsnp.rename.vcf O=hg38.dbsnp.sorted.vcf SEQUENCE_DICTIONARY=../ref/hg38.dict

Bam files: aligned to HG38 and sorted by Picard sortBam.
java -jar ../exe_program/picard.jar I=../test_5m/test_5m_trimmed_AG.bam O=../test_5m/test_5m_trimmed_AG_reorder.bam R=../ref/hg38.fa CREATE_INDEX=TRUE

However, I got the following error msg:
MESSAGE: Input files knownSites and reference have incompatible contigs.
Error details: The contig order in knownSites and reference is not the same
knownSites contigs = [HLA-A01:01:01:01, HLA-A01:01:01:02N, HLA-A01:01:38L, HLA-A01:02, HLA-A01:03, HLA-A01:04N...
reference contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15...

Can anyone give me some hint? Would deeply appreciate it.

coolman

↧

Oncotator Error -- Please help me

November 23, 2016, 9:02 am

≫ Next: Limit core usage with CombineGVCF

≪ Previous: The contig order in knownSites and reference is not the same

I am running Oncotator 1.8.0.0 on MuTect2 output .vcf file. I did perform filtering for PASS before running Oncotator.

If I run this on command line:
oncotator -v --input_format=VCF --output_format=VCF ../PASS/B100327_T3398.vcf B100327_T3398.vcf hg19 --db-dir ${data_source}

The last part of log file is: (Some thing is wrong in the last line, and I cannot figure out how to fix it.)

2016-11-23 17:56:20,932 WARNING [oncotator.input.VcfInputMutationCreator:293] Tumor-Normal VCF detected. The Normal will assume GT= 0/0, unless GT field specified otherwise.
Traceback (most recent call last):
File "/home/kong/lib/miniconda2/envs/oncotator/bin/oncotator", line 9, in
load_entry_point('Oncotator==v1.8.0.0', 'console_scripts', 'oncotator')()
File "/home/kong/lib/miniconda2/envs/oncotator/lib/python2.7/site-packages/Oncotator-v1.8.0.0-py2.7.egg/oncotator/Oncotator.py", line 309, in main
annotator.annotate()
File "/home/kong/lib/miniconda2/envs/oncotator/lib/python2.7/site-packages/Oncotator-v1.8.0.0-py2.7.egg/oncotator/Annotator.py", line 437, in annotate
filename = self._outputRenderer.renderMutations(mutations, metadata=metadata, comments=comments)
File "/home/kong/lib/miniconda2/envs/oncotator/lib/python2.7/site-packages/Oncotator-v1.8.0.0-py2.7.egg/oncotator/output/VcfOutputRenderer.py", line 118, in renderMutations
dataManager = OutputDataManager(self.configTable, mutations, comments, metadata, path)
File "/home/kong/lib/miniconda2/envs/oncotator/lib/python2.7/site-packages/Oncotator-v1.8.0.0-py2.7.egg/oncotator/output/OutputDataManager.py", line 90, in init
self.mutation, self.mutations = self._fetchFirstMutation(muts)
File "/home/kong/lib/miniconda2/envs/oncotator/lib/python2.7/site-packages/Oncotator-v1.8.0.0-py2.7.egg/oncotator/output/OutputDataManager.py", line 105, in _fetchFirstMutation
for mutation in muts:
File "/home/kong/lib/miniconda2/envs/oncotator/lib/python2.7/site-packages/Oncotator-v1.8.0.0-py2.7.egg/oncotator/Annotator.py", line 448, in _applyManualAnnotations
for m in mutations:
File "/home/kong/lib/miniconda2/envs/oncotator/lib/python2.7/site-packages/Oncotator-v1.8.0.0-py2.7.egg/oncotator/Annotator.py", line 456, in _applyDefaultAnnotations
for m in mutations:
File "/home/kong/lib/miniconda2/envs/oncotator/lib/python2.7/site-packages/Oncotator-v1.8.0.0-py2.7.egg/oncotator/Annotator.py", line 519, in _annotate_mutations_using_datasources
m = self._annotate_func_ptr(m, datasource)
File "/home/kong/lib/miniconda2/envs/oncotator/lib/python2.7/site-packages/Oncotator-v1.8.0.0-py2.7.egg/oncotator/Annotator.py", line 88, in _annotate_mut
return ds.annotate_mutation(m)
File "/home/kong/lib/miniconda2/envs/oncotator/lib/python2.7/site-packages/Oncotator-v1.8.0.0-py2.7.egg/oncotator/datasources/TabixIndexedVcfDatasource.py", line 230, in annotate_mutation
vcf_records = self.vcf_reader.fetch(mutation.chr, mut_start - 1, mut_end) # query database for records
File "/home/kong/lib/miniconda2/envs/oncotator/lib/python2.7/site-packages/vcf/parser.py", line 626, in fetch
encoding=self.encoding)
File "ctabix.pyx", line 92, in pysam.ctabix.Tabixfile.cinit (pysam/ctabix.c:2234)
File "ctabix.pyx", line 98, in pysam.ctabix.Tabixfile._open (pysam/ctabix.c:2344)
TypeError: _open() got an unexpected keyword argument 'encoding'

↧

Limit core usage with CombineGVCF

June 6, 2017, 2:03 am

≫ Next: Cross comparison between Array and NGS data

≪ Previous: Oncotator Error -- Please help me

Hi,
I'm trying to merge gvcf files created by the HaplotypeCaller, but I'm seeing some excessive core usage. The options for CombineGVCF do not include --nt, so there's no way I'm seeing to limit core usage.

Is there a way to prevent CombineGVCF from using up all available cores?
Thanks
M

↧

Cross comparison between Array and NGS data

June 6, 2017, 3:39 am

≫ Next: SelectVariants Large VCF slow runtime

≪ Previous: Limit core usage with CombineGVCF

Dear GATK staff,

I have a 11 samples that were sequenced using NGS (Illumina HiSeq) and 2 of these samples were also genotyped using an Illumina Human Global screening array (Illumina Iscan). I was looking at your latest WDL script and I've noticed a few steps that I think are related but I don't know how to prepare the inputs for them. Any help or additional explanation on them would be really appreciated!

# Check identity of fingerprints across readgroups
CrossCheckFingerprints
input: haplotype_database_file

What information should I use to create this file? Array data? I have already read these links but I'm still lost:
http://gatkforums.broadinstitute.org/gatk/discussion/comment/37543
http://gatkforums.broadinstitute.org/gatk/discussion/9526/picard-haplotype-map-file-format

# Estimate level of cross-sample contamination
CheckContamination
input: contamination_sites_vcf

What information should I use to create this file?

# Check the sample BAM fingerprint against the sample array 
CheckFingerprint
input: haplotype_database_file
input: genotypes

What information should I use to create these files? What does each input stands for?

Thank you very much in advance.

Best regards,
Santiago

↧

SelectVariants Large VCF slow runtime

January 3, 2017, 8:28 am

≫ Next: SelectVariants modifies VCF entries, keeping only the base calls intact.

≪ Previous: Cross comparison between Array and NGS data

I am attempting to subset and filter a large (10k exome sample, 250GB) VCF file using SelectVariants. My goal is to subset by individual samples (iterating over each sample using a custom script and passing an individual SelectVariants command for each), selecting only heterozygous alleles, with an alt allele depth > 5, GQ > 30, and for SNPs that pass the filter. My issue is very slow runtime, which seems like it shouldn't be a problem when I only want calls from a single sample. I feel it may be an issue with how I have set up my SelectVariants command (shown below), or it may be an issue with SelectVariants and large VCFs.

Here is the command I am using:

java -jar GATK.3.7.jar -T SelectVariants -R ref.fa -V very.large.vcf.gz -o single.sample.filtered.vcf.gz -sn sample.name -selectType SNP -select 'vc.getGenotype("sample.name").isHet()' -select 'vc.getGenotype("sample.name").getAD().1 > 5' -select 'vc.getGenotype("sample.name").getGQ() > 30' -select 'vc.isNotFiltered()'

↧

SelectVariants modifies VCF entries, keeping only the base calls intact.

June 6, 2017, 7:41 am

≫ Next: Merging two different data (missing or ref/ref)

≪ Previous: SelectVariants Large VCF slow runtime

I am using GATKv3.5. I used SelectVariants as shown below to remove 11 samples from a vcf file:

java -jar GenomeAnalysisTK.jar -T SelectVariants -R reference.fasta -V all_samples.vcf -xl_sn sample90 -xl_sn sample91 -xl_sn sample92 -xl_sn samples93 -xl_sn sample94 -xl_sn sample95 -xl_sn sample96 -xl_sn sample97 -xl_sn sample98 -xl_sn sample99 -xl_sn sample100 -o subset_samples.vcf

However, when I compare the SNPs between the original VCF and the subset VCF, the 0/0, 0/1, 1/1 genotype calls remain the same, but the AD, DP, GQ, and PL change to the point of nonsense. e.g. a 0,45 AD is called 0/1 (heterozygous). This is the correct call from the original file, where the AD is 53,24, but based on the 0,45 is should be 1/1. As long as the base calls themselves are correct, this shouldn’t cause any downstream errors, but I can’t be sure this is the case. Has anyone else had this error?

Original:
KB222897.1 10810 . C T 113425.71 . AC=102;AF=0.359;AN=284;BaseQRankSum=0.698;ClippingRankSum=0.029;DP=8522;ExcessHet=87.0598;FS=0.000;InbreedingCoeff=-0.4686;MLEAC=102;MLEAF=0.359;MQ=41.97;MQRankSum=-1.540e-01;QD=18.08;ReadPosRankSum=0.132;SOR=0.682 GT:AD:DP:GQ:PL 0/1:53,24:77:99:705,0,1819 0/1:39,16:55:99:470,0,1231 0/1:29,21:50:99:589,0,973

Subset:
KB222897.1 10810 SKB222897.1_10810 C T . PASS AC=96;AF=0.366;AN=262;BaseQRankSum=0.698;ClippingRankSum=0.029;DP=8027;ExcessHet=87.0598;FS=0.000;InbreedingCoeff=-0.4686;MQ=41.97;MQRankSum=-1.540e-01;QD=18.08;ReadPosRankSum=0.132;SOR=0.682;DP=6516 GT:AD:DP:GQ:PL 0/1:0,45:45:99:255,135,0 0/1:0,48:48:99:255,144,0 0/1:0,44:44:99:255,132,0

↧

Merging two different data (missing or ref/ref)

June 6, 2017, 8:01 am

≫ Next: Picard LiftOverVcf produces duplicate positions

≪ Previous: SelectVariants modifies VCF entries, keeping only the base calls intact.

Hello,

I have two different types of data (one is WES data for 100 cases and another is WGS data for 200 controls).
I combined the data respectively (into one WES.vcf and one WGS.vcf) using genotypeGVCF and ran VQSR/genotype refinement separately (WES.refined.vcf, WGS.refined.vcf).
The reason why I did these jobs separately for case and control was the type of data are very different (WES and WGS).

After that, I merged these two data (WES+WGS.refined.vcf) and conducted association analysis. The problem is case-specific variants (which is found on WES.refined.vcf only) were set to "./.:.:.:. (missing and no annotation info)" for control data. Then, should I check these variants in file for control data "WGS.refined.vcf" to see whether this variant is missing or homozygous ref alleles? Or is there more convenient way to check this?

And for this case in which two types of data are very different, do you suggest running VQSR and genotype refinement should be conducted separately as I did? Or should I merge all gvcf files into one vcf and then run VQSR and refinement?

↧

Picard LiftOverVcf produces duplicate positions

June 6, 2017, 8:33 am

≫ Next: Picard ValidateSamFile failing with INVALID_TAG_NM on hg38 HLA contigs

≪ Previous: Merging two different data (missing or ref/ref)

Hi,

I have produced a set of VCF files with UnifiedGenotyper, using a custom BED file. Subsequently I have used Picard LiftOverVcf in order to lift these VCF files from hg38 to hg19.

/jdk/current/bin/java -jar picard.jar LiftoverVcf I=input.vcf O=output.vcf CHAIN=hg38ToHg19.over.chain REJECT=rejected.vcf R=ucsc.hg19.fasta

While inspecting the resulting VCFs, I have realised that the newly-generated "lifted-over" VCF files contain a few duplicated genomic positions, some of which having different base counts. I checked the corresponding genomic positions in the original VCFs, however there are no such "duplications" in the original VCFs.

#Before lift-over, hg 38
chr21    43107642    .    G    .    .    .    BaseCounts=0,0,6,0;DP=6;LowMQ=1.0000,1.0000,6;MQ=0.00;MQ0=6;PercentNBase=0.0000;VariantType=NO_VARIATION    GT    ./.

#After lift-over, hg19
chr21   44527752        .       G       .       .       PASS    BaseCounts=0,0,5,0;DP=5;LowMQ=1.0000,1.0000,5;MQ=0.00;MQ0=5;PercentNBase=0.0000;VariantType=NO_VARIATION GT       ./.
chr21   44527752        .       G       .       .       PASS    BaseCounts=0,0,6,0;DP=6;LowMQ=1.0000,1.0000,6;MQ=0.00;MQ0=6;PercentNBase=0.0000;VariantType=NO_VARIATION GT       ./.

I was not able to understand why the LiftOverVcf produces these duplicates. Could you please help me on this?

↧

Picard ValidateSamFile failing with INVALID_TAG_NM on hg38 HLA contigs

June 6, 2017, 10:54 am

≫ Next: Phasing SNPs and Indels

≪ Previous: Picard LiftOverVcf produces duplicate positions

Picard ValidateSamFile is failing with INVALID_TAG_NM on hg38 HLA contigs when MODE=VERBOSE. The first 100 HLA reads in my BAM file failed. I assume all would fail as there were a number of different contigs among the first 100. When I validated the same BAM file with IGNORE=INVALID_TAG_NM, it passes.

Oddly, when MODE=SUMMARY, I got 'No errors found'.

I am running Picard 2.9.2 and using the GATk bundle Homo_sapiens_assembly38.fasta* as the reference. The BAM file was produced by Novoalign and processed by SortSam and SetNmMdAndUqTags. I also tried the deprecated SetNmAndUqTags. Moreover, I manually checked a few of the failing reads in the original BAM file produced by Novoalign and the records, including the NM tags, were the same as after running SetNmMdAndUqTags. I looked at the subset of the failed reads where NM was 0 and compared them directly to the sequence in the fasta file; all were a perfect match at the expected position.

*This shouldn't matter, but I replaced the tab character with two spaces to separate the contig name field in the '>' lines of of the HLA records of Homo_sapiens_assembly38.fasta. All the other records have two spaces, and the tab character was causing problems for Novoalign (to be fixed in the next release).

Picard ValidateSamFile command/stderr:

Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/run/media/yoursham/MY_6Tb_1/germline/cromwell-executions/PairedEndSingleSampleWorkflow/d626fa7a-c423-4f0c-98b2-aa0274b658e3/call-ValidateReadGroupSamFile/shard-0/execution/tmp.Et4z6V
[Tue Jun 06 00:40:48 UTC 2017] picard.sam.ValidateSamFile INPUT=/run/media/yoursham/MY_6Tb_1/germline/cromwell-executions/PairedEndSingleSampleWorkflow/d626fa7a-c423-4f0c-98b2-aa0274b658e3/call-ValidateReadGroupSamFile/shard-0/inputs/run/media/yoursham/MY_6Tb_1/germline/cromwell-executions/PairedEndSingleSampleWorkflow/d626fa7a-c423-4f0c-98b2-aa0274b658e3/call-SortAndFixReadGroupBam/shard-0/execution/NIST7035_TAAGGCGA_L001.aligned.sorted.bam OUTPUT=NIST7035_TAAGGCGA_L001.validation_report MODE=VERBOSE IGNORE=[] MAX_OUTPUT=1000000000 IS_BISULFITE_SEQUENCED=false REFERENCE_SEQUENCE=/mnt/hdd/germline/resources/gatk_bundle/Homo_sapiens_assembly38.fasta    IGNORE_WARNINGS=false VALIDATE_INDEX=true INDEX_VALIDATION_STRINGENCY=EXHAUSTIVE MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Tue Jun 06 00:40:48 UTC 2017] Executing as yoursham@yoursham-linux on Linux 3.10.0-514.16.1.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_131-b12; Picard version: 2.9.2-SNAPSHOT
INFO    2017-06-06 00:41:45 SamFileValidator    Validated Read    10,000,000 records.  Elapsed time: 00:00:56s.  Time for last 10,000,000:   55s.  Last read position: chr5:75,080,505
INFO    2017-06-06 00:42:37 SamFileValidator    Validated Read    20,000,000 records.  Elapsed time: 00:01:48s.  Time for last 10,000,000:   52s.  Last read position: chr11:65,355,941
INFO    2017-06-06 00:43:31 SamFileValidator    Validated Read    30,000,000 records.  Elapsed time: 00:02:41s.  Time for last 10,000,000:   53s.  Last read position: chr19:1,918,167
INFO    2017-06-06 00:44:24 SamFileValidator    Validated Read    40,000,000 records.  Elapsed time: 00:03:34s.  Time for last 10,000,000:   52s.  Last read position: */*
[Tue Jun 06 00:45:14 UTC 2017] picard.sam.ValidateSamFile done. Elapsed time: 4.42 minutes.
Runtime.totalMemory()=1348468736
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp

Typical errors:

ERROR: Record 38230964, Read name HWI-D00119:50:H7AP8ADXX:1:2203:14796:98239, NM tag (nucleotide differences) in file [0] does not match reality [75]
ERROR: Record 38230965, Read name HWI-D00119:50:H7AP8ADXX:1:1209:5003:56756, NM tag (nucleotide differences) in file [1] does not match reality [76]
ERROR: Record 38230966, Read name HWI-D00119:50:H7AP8ADXX:1:1108:3556:87908, NM tag (nucleotide differences) in file [1] does not match reality [72]
ERROR: Record 38230967, Read name HWI-D00119:50:H7AP8ADXX:1:1208:5673:42488, NM tag (nucleotide differences) in file [0] does not match reality [72]
ERROR: Record 38230968, Read name HWI-D00119:50:H7AP8ADXX:1:1211:16359:18440, NM tag (nucleotide differences) in file [0] does not match reality [72]

BAM records for the above errors:

HWI-D00119:50:H7AP8ADXX:1:2203:14796:98239  99  HLA-A*11:50Q    1091    30  101M    =   1111    121 CGCCTACGACGGCAAGGATTACATCGCCCTGAACGAGGACCTGCGCTCCTGGACCGCGGCGGACATGGCAGCTCAGATCACCAAGCGCAAGTGGGAGGCGG   @?>?=>?@>@@@@==@@>>>>@>>@@@@@>@??@@>@@>@@>@@@@>@@>@@>@@@@@?@@@=@>=AA@>@@>@>@>>?>@?>>?@@@>>@>@@@>@?>??   ZA:f:30 LB:Z:NIST7035_Nextera-Rapid-Capture-Exome-and-Expanded-Exome    MD:Z:101    PG:Z:novoalign  RG:Z:H7AP8ADXX_TAAGGCGA_1_NA12878   AM:i:2  NM:i:0  SM:i:2  PQ:i:1  UQ:i:0  AS:i:0  PU:Z:H7AP8ADXX_TAAGGCGA_1_NA12878
HWI-D00119:50:H7AP8ADXX:1:1209:5003:56756   163 HLA-A*11:50Q    1092    30  101M    =   1286    295 GCCTACGACGGCAAGGATTACATCGCCCTGAACGAGGACCTGCGCTCCTGGACCGCGGCGGACATGGCAGCTCAGATCACCAAACGCAAGTGGGAGGCGGC   ?@@>=@@8@@@?==?@<==<@9<@5@@@>@>=@@9??>@@?@@@@>@>=@@<9?@<?><??<?<>>:>.??==:>>>@=9>=>.>>?==?=@??9>>?9=@   ZA:f:30 LB:Z:NIST7035_Nextera-Rapid-Capture-Exome-and-Expanded-Exome    MD:Z:83G17  PG:Z:novoalign  RG:Z:H7AP8ADXX_TAAGGCGA_1_NA12878   AM:i:2  NM:i:1  SM:i:2  PQ:i:26 UQ:i:13 AS:i:18 PU:Z:H7AP8ADXX_TAAGGCGA_1_NA12878
HWI-D00119:50:H7AP8ADXX:1:1108:3556:87908   163 HLA-A*11:50Q    1101    30  101M    =   1162    162 GGCAAGGATTACATCGCCCTGAACGAGGACCTGCGCTCCTGGACCGCGGCGGACATGGCAGCCCAGATCACCAAGCGCAAGTGGGAGGCGGCCCGTCGGGC   ?>@;<@@>===?==@@@@@=@=>@?=@@>@@>@>@@=@@=@A=@@@@@?@@@=9==@>@=@?+<=@<:?=@?/2<>9:/;;;@@@>=:?@@@==>;>?@@>   ZA:f:27 LB:Z:NIST7035_Nextera-Rapid-Capture-Exome-and-Expanded-Exome    MD:Z:62T38  PG:Z:novoalign  RG:Z:H7AP8ADXX_TAAGGCGA_1_NA12878   AM:i:1  NM:i:1  SM:i:3  PQ:i:13 UQ:i:10 AS:i:13 PU:Z:H7AP8ADXX_TAAGGCGA_1_NA12878
HWI-D00119:50:H7AP8ADXX:1:1208:5673:42488   163 HLA-A*11:50Q    1111    30  101M    =   1143    133 ACATCGCCCTGAACGAGGACCTGCGCTCCTGGACCGCGGCGGACATGGCAGCTCAGATCACCAAGCGCAAGTGGGAGGCGGCCCGTCGGGCGGAGCAGCGG   ;@;>@@?@@=@==@@>@@>@@>@@@@>@@?@@?@@@@@@@@@>@>>@@@>@@=@>@>>@=@@>=@@A@=>?>?@@>@@@@@@@@@>@?@@?@@>@?>@@=>   ZA:f:27 LB:Z:NIST7035_Nextera-Rapid-Capture-Exome-and-Expanded-Exome    MD:Z:101    PG:Z:novoalign  RG:Z:H7AP8ADXX_TAAGGCGA_1_NA12878   AM:i:1  NM:i:0  SM:i:25 PQ:i:0  UQ:i:0  AS:i:0  PU:Z:H7AP8ADXX_TAAGGCGA_1_NA12878
HWI-D00119:50:H7AP8ADXX:1:1211:16359:18440  163 HLA-A*11:50Q    1111    30  101M    =   1135    125 ACATCGCCCTGAACGAGGACCTGCGCTCCTGGACCGCGGCGGACATGGCAGCTCAGATCACCAAGCGCAAGTGGGAGGCGGCCCGTCGGGCGGAGCAGCGG   ;@;>@@?@@=@==@@>@@>@@>@@@@>@@?@@?@@@@@@@@@>@>>??@>@@=@>@>>@>@@>>@?@@>>@>@@@>@@@@@@@@@>?>@@?>?=?@>@@@@   ZA:f:27 LB:Z:NIST7035_Nextera-Rapid-Capture-Exome-and-Expanded-Exome    MD:Z:101    PG:Z:novoalign  RG:Z:H7AP8ADXX_TAAGGCGA_1_NA12878   AM:i:1  NM:i:0  SM:i:25 PQ:i:1  UQ:i:0  AS:i:0  PU:Z:H7AP8ADXX_TAAGGCGA_1_NA12878

↧

Phasing SNPs and Indels

June 6, 2017, 12:03 pm

≫ Next: How do I resolve the log4j errors while running BaseRecalibratorSpark using GATK4?

≪ Previous: Picard ValidateSamFile failing with INVALID_TAG_NM on hg38 HLA contigs

Hi
I was looking into tools for phasing indels and SNPs. It seems that ReadBackPhasing only supports phasing of SNPs. I'm working on data from TCGA, and according to your docs HaplotypeCaller which can produce phased output, is not recommended for tumor samples. Is there any work around for this problem? Can MuTect phase the variants?

↧