Quantcast
Channel: Recent Discussions — GATK-Forum
Viewing all 12345 articles
Browse latest View live

Dog breed genetics/health committee: Recently received 8 canine genomes (4 healthy/4 diseased)

$
0
0

I’m a microbiologist with limited computational/bioinformatic training ( but can hack it when necessary with prokaryotic genomes). However, I recently was given a committee position within an AKC dog breed club and have found 8 whole genome sequences related to a breed-specific genetic disease that were not being worked up due to personnel time constraints (they focused on a GWAS).

I was given FASTQ files for the 8 dogs, 4 healthy and 4 diseased, all run on the same Illumina HiSeq flow cell. The disease is not cancer.

Any suggestions on what general steps to take to even have a shot at aliments would be great. Even if I can just formulate a plan that we can present to a computational person at some point that would be a big step. I assume I want to do some sort of aliment of the 4 healthy and 4 diseased in order to generate SNP, mobile element and CNV pileups. The only canine-specific platform I could find was CANFAM3.1, which doesn’t seem to allow for WGS alignments, just mapping of a single set of reads to the ‘stock’ canine genome, or checking what a gene may be involved in.


Speed up HaplotypeCaller on IBM POWER8 systems

$
0
0

We all know how HaplotypeCaller and Mutect2 analyses can take a long time. IBM is now providing a native implementation of the PairHMM algorithm that leverages the new hardware available in their POWER8 systems. The optimized native library is currently available on POWER8 for the following Linux distributions: Ubuntu 15.10, Ubuntu 16.04 and Red Hat Enterprise Linux 7.1, Red Hat Enterprise Linux 7.2.

To take advantage of the optimized library, you need to do the following:

  • Download the shared library corresponding to your Linux distribution from here

  • Set your java library path to the location of libVectorLoglessPairHMM.so using -Djava.library.path

Here is an example for running HaplotypeCaller on a P8 system with Ubuntu:

export PHMM_N_THREADS=$Num
java -Xmx32g -Djava.library.path=/path/to/PairHMM_P8_Ubuntu -jar $GATK_PATH/GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-R $REFERENCE -I $INPUT_BAM --dbsnp $SNP_VCF \
-stand_emit_conf 10 -stand_call_conf 50 \
-o $OUTPUT_VCF

Here is an example for running Mutect2 on a P8 system with Ubuntu:

export PHMM_N_THREADS=$Num
java -Djava.library.path=/path/to/PairHMM_P8_Ubuntu -jar GenomeAnalysisTK.jar \
-T MuTect2 \
-R $REFERENCE -L $GENOME_INTERVALS_FILE \
-I:tumor $TUMOR_BAM -I:normal $NORMAL_BAM \
--cosmic $COSMIC_VCF --dbsnp $SNP_VCF \
-o $OUT_VCF

The latest version of the library uses the same floating precision as Java on POWER8, so it generates the same result as without the library. Also, exploiting multithreading along with the SIMD vectorization, it can accelerate HaplotypeCaller and Mutect2 more than the previous version, especially in the single-thread mode (no -nct option specified).

SMT is a processor technology that allows multiple instruction streams (threads) to run concurrently on the same physical processor, improving overall throughput. From the point of view of the operating system, each hardware thread is treated as an independent logical processor. On POWER8 there are SMT8, SMT4, SMT2 and ST mode, each physical processor will have 8, 4, 2 and 1 logical processor, respectively. This pairhmm library uses the number of thread equal to 37% of the available logical processors by default. The number of threads can be tuned by setting the environment variable PHMM_N_THREADS, as shown in above examples.

The library can accelerate HaplotypeCaller, Mutect2 and UnifiedGenotyper of GATK. It can accelerate HaplotypeCaller up to 1.9x and Mutect2 up to 9.26x depending on the test case. For example, if the PairHMM computation consumes about a half of the HaplotypeCaller runtime in single-thread mode, 1.88x speed-up can be expected.

The source code is available here.
If you have any questions or issues (aside from downloading the file), please contact Yinhue Cheng at IBM (ycheng@us.ibm.com) or Takeshi Ogasawara at IBM Japan (TAKESHI@jp.ibm.com).

Disclaimer: Please note that these libraries are not an official IBM product. You use it entirely at your own risk, and neither IBM nor the author assumes any liability whatsoever, nor do they assume responsibility for maintenance. Please report comments and corrections to ycheng@us.ibm.com.

Unclear ERROR message

$
0
0

Dear GATK Team,
I have performed a parallel analysis of > 200 samples with Haplotype Caller in gvcf mode on a computing cluster. All but two files yielded no error. For these two, I got the following message after about 13h of runtime:

ERROR --
ERROR stack trace

java.util.NoSuchElementException
at java.util.HashMap$HashIterator.nextNode(HashMap.java:1431)
at java.util.HashMap$KeyIterator.next(HashMap.java:1453)
at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCallerGenotypingEngine.reduceNumberOfAlternativeAllelesBasedOnLikelihoods(HaplotypeCallerGenotypingEngine.java:336)
at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCallerGenotypingEngine.assignGenotypeLikelihoods(HaplotypeCallerGenotypingEngine.java:264)
at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:964)
at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:251)
at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:709)
at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:705)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:274)
at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:78)
at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:311)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:113)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:255)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:157)
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:108)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.6-0-g89b7209):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions https://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Code exception (see stack trace for error itself)
ERROR ------------------------------------------------------------------------------------------

What does this tell me? I already tried to remap the data to start with fresh bam-files - with the same outcome. The data is whole genome data of an owl mapped against an experimental owl genome.
Thanks for your support
Stefan

Spanning or overlapping deletions

$
0
0

We use the term spanning deletion or overlapping deletion to refer to a deletion that spans a position of interest.

The presence of a spanning deletion affects how we can represent genotypes at any site(s) that it spans for those samples that carry the deletion, whether in heterozygous or homozygous variant form. Page 8, item 5 of the VCF v4.3 specification reserves the * allele to reference overlapping deletions. This is not to be confused with the bracketed asterisk <*> used to denote symbolic alternate alleles.


image

Here we illustrate with four human samples. Bob and Lian each have a heterozygous A to T single polymorphism at position 20, our position of interest. Kyra has a 9 bp deletion from position 15 to 23 on both homologous chromosomes that extends across position 20. Lian and Omar each are heterozygous for the same 9 bp deletion. Omar and Bob's other allele is the reference A.

What are the genotypes for each individual at position 20? For Bob, the reference A and variant T alleles are clearly present for a genotype of A/T.

What about Lian? Lian has a variant T allele plus a 9 bp deletion overlapping position 20. To notate the deletion as we do single nucleotide deletions is technically inaccurate. We need a placeholder notation to signify absent sequence that extends beyond the position of interest and that is listed for an earlier position, in our case position 14. The solution is to use a star or asterisk * at position 20 to refer to the spanning deletion. Using this convention, Lian's genotype is T/*.

At the sample-level, Kyra and Omar would not have records for position 20. However, we are comparing multiple samples and so we indicate the spanning deletion at position 20 with *. Omar's genotype is A/* and Kyra's is */*.


image

In the VCF, depending on the format used by tools, positions equivalent to our example position 20 may or may not be listed. If listed, such as in the first example VCF shown, the spanning deletion is noted with the asterisk * under the ALT column. The spanning deletion is then referred to in the genotype GT for Kyra, Lian and Omar. Alternatively, a VCF may altogether avoid referencing the spanning deletion by listing the variant with the spanning deletion together with the deletion. This is shown in the second example VCF at position 14.

May I get the old format for reporting a spanning del using GATK 3.6?

MuTect2 tumorOnly vs paired loses true variants

$
0
0

Hi GATK team !

I have an issue with MuTect2. I'm using GATK last version (nighlty build from 16th of March) in a somatic context on an amplicon design.

I have a variant that I know is true one (although the depth of coverage at this position is quite low in the somatic context).
MuTect does call the variant when in tumor only mode : first one if the 3 here
chr13 32900222 . C T . clustered_events;homologous_mapping_event ECNT=3;HCNT=45;MAX_ED=45;MIN_ED=41;NLOD=0.00;TLOD=39.33 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1 0/1:85,36:0.149:36:0:.:2087,972:83:0
chr13 32900263 . G A . clustered_events;homologous_mapping_event ECNT=3;HCNT=16;MAX_ED=45;MIN_ED=41;NLOD=0.00;TLOD=8.56 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1 0/1:1020,18:8.696e-03:18:0:1.00:26997,444:1020:0
chr13 32900267 . C T . clustered_events;homologous_mapping_event ECNT=3;HCNT=5;MAX_ED=45;MIN_ED=41;NLOD=0.00;TLOD=6.81 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1 0/1:1024,16:0.015:16:0:0.00:27735,411:1024:0

It seems like it considers the 3 variants to be clustered and on the same haplotype, that may be important for my issue ? Although 222 from 263 is already quite far away..

When calling in paired mode, feeding it with the recalibrated germline bam file, I have no variants left, even though neither of these 3 variants is a germline one.

Could you please tell my the reason why those variants are filtered out ? Is there a parameter I should play with ?

thanks a lot
Manon

Mutect variant calling the parameters for filtering

$
0
0

Hello All,
I have a couple of vcf files resulted from a batch run from the pipeline from the Mutect and I wanted to filter the output cf file based on the following parameters max_alt_alleles_in_normal_count=2 and max_alt_allele_in_normal_fraction=0.03. This I want to do as part of a comparison between results from a sample which I first ran with those parameters and second without those parameters. The first result with those parameters after in-house filters resulted in 59 variants and whereas second vcf file without those parameters resulted in 387 variants (after same in-house filtering).

Now Since I have a couple of vcf files and it takes long to process it again with the parameters I am trying filter them from result vcf files and so i tried to take out variant which has Normal allele reads >2 and Normal variant allele fraction >0.03. And still, the resulting variants doesn't match in their numbers .

Would be great if someone comment on my understanding of these parameters max_alt_allele_in_normal_fraction --isn't its frequency of alternate allele from normal sample , ie in the output vcf file its FA from Normal??
and max_alt_alleles_in_normal_count -- isn't it the Alternate allele count from Normal sample and that is AD[1] from sample Normal ??

variants predicted by HaplotypeCaller with DP=0

$
0
0

Hello,

We found two variants predicted by HaplotypeCaller with DP=0. We also generated --bamout for the variant regions. However, we didn't found output for these regions (attached, the first track is the input bam for HaplotypeCaller, and the second track is the bam from --bamout). The variant lines in the vcf file are attached. I just wondered why a variant can be called with depth equals to zero. Is that because the DP shows here is before re-assembly by HaplotypeCaller? Thanks and have a nice weekend!

Ying


GATK Depth of Coverage randomly crashes?

$
0
0

Hi everyone,

Thank you for reading! I'm running GATK (v3.6-0-g89b7209) DepthofCoverage on a list of bam files. I originally had my files separated as a list of 23 paired bams for normal and tumor but I ran out of memory so I'm running them as 4 groups (Essentially N1, N2, T1, and T2).

While running, the first batch, N1, completed successfully without any errors and gave proper output files. T1 and N2 however just exited without providing a reason for the exit error in the output file. They both just stopped around 94% (Chr22 ~90 min left) and there's nothing else in the output.

Does this look like a system issue or a problem with one of my bam files that's throwing off the batch? I'm re-running the failed batches at the moment.

Thank you for your time!

The error of Mark Dupplicate

$
0
0

I run the step of Mark Duplicate in the server, but it always print an error message even though I changed server.
This is my input:
java -jar /home/yangguoqian/biosoft/picard-tools-2.5.0/picard.jar MarkDuplicates INPUT=/home/chenyunmei/hetero/moso_gatk/bam/Bamboo-PCRfree_Round68_Lane1.bam OUTPUT=/home/chenyunmei/hetero/moso_gatk/markDup/Bamboo-PCRfree_Round68_Lane1.dedupped.bam METRICS_FILE=/home/chenyunmei/hetero/moso_gatk/markDup/Bamboo-PCRfree_Round68_Lane1.dedupped.metrics.txt VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true REMOVE_DUPLICATES=TRUE

And the follow is the error message:
optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Sat Jul 30 19:56:01 CST 2016] Executing as chenyunmei@localhost.localdomain on Linux 3.10.0-229.el7.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_77-b03; Picard version: 2.5.0(2c370988aefe41f579920c8a6a678a201c5261c1_1466708365)
INFO 2016-07-30 19:56:01 MarkDuplicates Start of doWork freeMemory: 2045995552; totalMemory: 2058354688; maxMemory: 28631367680
INFO 2016-07-30 19:56:01 MarkDuplicates Reading input file and constructing read end information.
INFO 2016-07-30 19:56:01 MarkDuplicates Will retain up to 110120644 data points before spilling to disk.
[Sat Jul 30 19:56:12 CST 2016] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 0.18 minutes.
Runtime.totalMemory()=3755474944
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.samtools.SAMException: /tmp/chenyunmei/CSPI.2016309087006492130.tmp/150462.tmpnot found
at htsjdk.samtools.util.FileAppendStreamLRUCache$Functor.makeValue(FileAppendStreamLRUCache.java:63)
at htsjdk.samtools.util.FileAppendStreamLRUCache$Functor.makeValue(FileAppendStreamLRUCache.java:49)
at htsjdk.samtools.util.ResourceLimitedMap.get(ResourceLimitedMap.java:76)
at htsjdk.samtools.CoordinateSortedPairInfoMap.getOutputStreamForSequence(CoordinateSortedPairInfoMap.java:180)
at htsjdk.samtools.CoordinateSortedPairInfoMap.put(CoordinateSortedPairInfoMap.java:164)
at picard.sam.markduplicates.util.DiskBasedReadEndsForMarkDuplicatesMap.put(DiskBasedReadEndsForMarkDuplicatesMap.java:65)
at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:449)
at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:193)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:208)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)
Caused by: java.io.FileNotFoundException: /tmp/chenyunmei/CSPI.2016309087006492130.tmp/150462.tmp (Too many open files)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.(FileOutputStream.java:213)
at htsjdk.samtools.util.FileAppendStreamLRUCache$Functor.makeValue(FileAppendStreamLRUCache.java:60)
... 10 more

Please help me. I have waste a lot of time in this, but I really don't know what is the reason.

variation info found in bam but not in final vcf file

$
0
0

Why veiw the bam by samtools, there is snp variation on one position,but in the final vcf file, there is no variation result on that position.
please kindly see the following sample NA12878-580ng-2 for example.

run.sh:

/share/bin/samtools tview -p chr10:101124187 -d T outdir2/Upload/Alignment/NA12878-580ng-1/NA12878-580ng-1.rmdup.bam >NA12878-580ng-1.chr10_101124187

/share/bin/samtools tview -p chr10:101124187 -d T outdir2/Upload/Alignment/NA12878-580ng-2/NA12878-580ng-2.rmdup.bam >NA12878-580ng-2.chr10_101124187

$more NA12878-580ng-1.chr10_101124187
101124191 101124201 101124211 101124221 101124231 101124241 101124251
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
TGGAGGTTGGTAAGGAAGGCCTTCGCTTTGAAAATGGAGCCTTTACTTACTATGGCGTCCCAGCCATCATGACCACTGCT
TGGAGGTTGGTAAGGAAGGCCTTCGCTTTGAAAATGGAGCCTTTACTTACTATGGCGTCCCAGCCATCATGACCACTGCT
T gaggttggtaaggaaggccttcgctttgaaaatggagcctttacttactatggcgtcccagccatcatgaccactgct
TGGAGGTT ttaaggaaggccttcgctttgaaaatggagcctttacttactatggcgtcccagccatcatgaccactgct
TGGAGGTT aggaaggccttcgctttgaaaatggagcctttacttactatggcgtcccagccatcatgaccactgct
TGGAGGTTGG ggaaggccttcgctttgaaaatggagcctttacttactatggcgtcccagccatcatgaccactgct
TGGAGGTTGGT ggaaggccttcgctttgaaaatggagcctttacttactatggcgtcccagccatcatgaccactgct
TGGAGGTTGGTAAGGAAGGCCTTCGCTTTGAAAATGGAGCCTTTACTTACTATGGCGTCCCAGCCATCATGACCACTGCT
TGGAGGTTGGT gaaggccttcgctttgaaaatggagcctttacttactatggcgtcccagccatcatgaccactgct
GGGAGGTTGGTAAG AAGGCCTTCGCTTTGAAAATGGAGCCTTTACTTACTATGGCGTCCCAGCCATCATGACCACTGCT
TGGAGGTTGGTAAG aggccttcgctttgaaaatggagcctttacttactatggcgtcccagccatcatgaccactgct
TGGAGGTTGGTAAGG aggccttcgctttgaaaatggagcctttacttactatggcgtcccagccatcatgaccactgct
GGGAGGTTGGTAAGG ggccttcgctttgaaaatggagcctttacttactatggcgtcccagccatcatgaccactgct
TGGAGGTTGGTAAGG cttcgctttgaaaatggagcctttacttactatggcgtcccagccatcatgaccactgct
tggaggttggtaagg cttcgctttgaaaatggagcctttacttactatggcgtcccagccatcatgaccactgct
GGGGGGTTGGTAAGGA TTCGCTTTGAAAATGGAGCCTTTACTTACTATGGCGTCCCAGCCATCATGACCACTGCT
TGGAGGTTGGTAAGGAA tcgctttgaaaatggagcctttacttactatggcgtcccagccatcatgaccactgct
TGGAGGTTGGTAAGGAAGGCC CGCTTTGAAAATGGAGCCTTTACTTACTATGGCGTCCCAGCCATCATGACCACTGCT
TGGAGGTTGGTAAGGAAGGCC GCTTTGAAAATGGAGCCTTTACTTACTATGGCGTCCCAGCCATCATGACCACTGCT
TGGAGGTTGGTAAGGAAGGCCT CTTTGAAAATGGAGCCTTTACTTACTATGGCGTCCCAGCCATCATGACCACTGCT
TGGAGGTTGGTAAGGAAGGCCT ctttgaaaatggagcctttacttactatggcgtcccagccatcatgaccactgct
TGGAGGTTGGTAAGGAAGGCCTT tttgaaaatggagcctttacttactatggcgtcccagccatcatgaccactgct
tggaggttggtaaggaaggccttc TTGAAAATGGAGCCTTTACTTACTATGGCGTCCCAGCCATCATGACCACTGCT
TGGGGGTTGGTAAGGAAGGCCTTCGCT TGAAAATGGAGCCTTTACTTACTATGGCGTCCCAGCCATCCTGACCCCTGCT
GGGGGGTTGGTAAGGAAGGCCTTCGCTT aaaatggagcctttacttactatggcgtcccagccatcatgaccactgct
tggaggttggtaaggaaggccttcgctt AAATGGAGCCTTTACTTACTATGGCGTCCCAGCCATCATGACCACTGCT
TGGAGGTTGGTAAGGAAGGCCTTCGCTTT AATGGAGCCTTTACTTACTATGGCGTCCCAGCCATCATGACCACTGCT
TGGAGGTTGGTAAGGAAGCCCTTCGCTTT aatggagcctttacttactatggcgtcccagccatcatgaccactgct
tggaggttggtaaggaaggccttcgcttt ATGGAGCCTTTACTTACTATGGCGTCCCAGCCATCATGACCACTGCT
...
...
...

$more NA12878-580ng-2.chr10_101124187
101124191 101124201 101124211 101124221 101124231 101124241 101124251
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
TGGAGGTTG GTAAGG AAGGCCTTCGCTTTGAAAATGGAGCCTTTACTTACTATGGCGTCCCAGCCATCATGACCACTG
t GAGGTTGGTAAGGAAGGCCTTCGCTTTGAAAATGGAGCCTTTACTTACTATGGCGTCCCAGCCATCATGACCACTG
TGgaggttggtaaggaaggccttcgctttgaaaatggagcctttacttactatggcgtcccagccatcatgaccactg
TGGAGGTTGGTAAGGAAGGCCTTCGCTTTGAAAATGGAGCCTTTACTTACTATGGCGTCCCAGCCATCATGACCACTG
TGGAGGTTGG AAGGAAGGCCTTCGCTTTGAAAATGGAGCCTTTACTTACTATGGCGTCCCAGCCATCATGACCACTG
TGGAGGTTGGTAAG ggccttcgctttgaaaatggagcctttacttactatggcgtcccagccatcatgaccactg
TGGAGGTTG
GTAAG cttcgctttgaaaatggagcctttacttactatggcgtcccagccatcatgaccactg
TGGAGGTTGGTAAGGA cttcgctttgaaaatggagcctttacttactatggcgtcccagccatcatgaccactg
TGGAGGTTGGTAAGGAA cttcgctttgaaaatggagcctttacttactatggcgtcccagccatcatgaccactg
tggaggttggtaaggaag TTCGCTTTGAAAATGGAGCCTTTACTTACTATGGCGTCCCAGCCATCATGACCACTG
tggaggttggtaaggaaggcc GCTTTGAAAATGGAGCCTTTACTTACTATGGCGTCCCAGCCATCATGACCACTG
GGGGGGTTGGTAAGGAAGGCCT gctttgaaaatggagcctttacttactatggcgtcccagccatcatgaccactg
tggaggttggtaaggaaggccttcg TTTGAAAATGGAGCCTTTACTTACTATGGCGTCCCAGCCATCATGACCACTG
tggaggttggtaaggaaggccttc CTTTGAAAATGGAGCCTTTACTTACTATGGCGTCCCAGCCATCATGACCACTG
TGGCGGTTGGTAAGGAAGGCCTTCG TTTGAAAATGGAGCCTTTACTTACTATGGCGTCCCAGCCATCATGACCACTG
TGGAGGTTGGTAAGGAAGGCCTTCGC TGCAAATAGAGCCTTTACTTACTATGGCGTCACAGCCATCATGACCACTG
TGGAGGTTGGTAAGGAAGGCCTTCGC TGAAAATGGAGCCTTTACTTACTATGGCGTCCCAGCCATCATGACCACTG
TGGAGGTTGGTAAGGAAGGCCTTCGCT GAAAATGGAGCCTTTACTTACTATGGCGTCCCAGCCATCATGACCACTG
tggaggttggtaaggaaggccttcgcttt AAATGGAGCCTTTACTTACTATGGCGTCCCAGCCATCATGACCACTG
TGGAGGTTGGTAAGGAAGGCCTTCGCTTTGA aatggagcctttacttactatggcgtcccagccatcatgaccactg
TGGAGGTTGGTAAGGAAGGCCTTCGCTTTGAA ggagcctttacttactatggcgtcccagccatcatgaccactg
....
....
.....

$grep 101124187 result/NA12878-580ng-1/result_variation/snp/NA12878-580ng-1.snp.vcf.xls
chr10 101124187 . T G 90.77 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=-5.053;DP=109;Dels=0.00;ExcessHet=3.0103;FS=21.068;HaplotypeScore=20.9861;MLEAC=1;MLEAF=0.500;MQ=58.94;MQ0=0;MQRankSum=-2.249;QD=0.83;ReadPosRankSum=2.098;SOR=4.092 GT:AD:DP:GQ:PL 0/1:98,11:109:99:119,0,3607

$grep 101124187 result/NA12878-580ng-2/result_variation/snp/NA12878-580ng-2.snp.vcf.xls
PS :here there is no record in vcf file for this position chr10:101124187

Looking forward to your answers, thanks

GenotypeGVCF Parallelism

$
0
0

Hello,

I am trying to do joint genotyping with GenotypeGVCF on about 250 exomes. I tried to look at the docs to see the best way to paralyze this process, but didn't find a clear answer. Are nt and nct supported for GenotypeGVCF? Are there recommendations for these parameters with this tool?

Thank you very much!
Luke

RNA-SeQC/GATK: IntronicExpressionReadBlock error

$
0
0

Hello,

I seem to have the same error as the one posted 14 days ago on BioStar (https://www.biostars.org/p/208982/).

Had successfully run this command in June: java -jar RNA-SeQC_v1.1.8.jar -o rnaSeQC -r Homo_sapiens_assembly19.fasta -t gencode.v19.gtf -s sample.info.file -n 1.

Tried to poke around, however without any success. Could you please help figure out how to get it to run?

Thank you!

Output + error stack trace:
Creating rRNA Interval List based on given GTF annotations
Retriving contig names from reference
contig names in reference: 85
Loading GTF for Read Counting
Converting to refGene
Transcript objects to RefGen format: 1 s
Running IntronicExpressionReadBlock Walker ....
Exception in thread "main" java.lang.ExceptionInInitializerError
at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.(GenomeAnalysisEngine.java:146)
at org.broadinstitute.sting.gatk.CommandLineExecutable.(CommandLineExecutable.java:53)
at org.broadinstitute.sting.gatk.CommandLineGATK.(CommandLineGATK.java:55)
at org.broadinstitute.cga.rnaseq.gatk.GATKTools.runIntronReadCount(GATKTools.java:216)
at org.broadinstitute.cga.rnaseq.ReadCountMetrics.runRegionCounting(ReadCountMetrics.java:244)
at org.broadinstitute.cga.rnaseq.ReadCountMetrics.runReadCountMetrics(ReadCountMetrics.java:59)
at org.broadinstitute.cga.rnaseq.RNASeqMetrics.runMetrics(RNASeqMetrics.java:225)
at org.broadinstitute.cga.rnaseq.RNASeqMetrics.execute(RNASeqMetrics.java:171)
at org.broadinstitute.cga.rnaseq.RNASeqMetrics.main(RNASeqMetrics.java:139)
Caused by: java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.RuntimeException: could not create class file from NativeString.class
at org.reflections.Reflections.scan(Reflections.java:166)
at org.reflections.Reflections.(Reflections.java:91)
at org.broadinstitute.sting.utils.classloader.PluginManager.(PluginManager.java:79)
... 9 more
Caused by: java.util.concurrent.ExecutionException: java.lang.RuntimeException: could not create class file from NativeString.class
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at org.reflections.Reflections.scan(Reflections.java:162)
... 11 more
Caused by: java.lang.RuntimeException: could not create class file from NativeString.class
at org.reflections.scanners.AbstractScanner.scan(AbstractScanner.java:41)
at org.reflections.Reflections$2.run(Reflections.java:149)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: invalid constant type: 15
at javassist.bytecode.ConstPool.readOne(ConstPool.java:1023)
at javassist.bytecode.ConstPool.read(ConstPool.java:966)
at javassist.bytecode.ConstPool.(ConstPool.java:127)
at javassist.bytecode.ClassFile.read(ClassFile.java:693)
at javassist.bytecode.ClassFile.(ClassFile.java:85)
at org.reflections.adapters.JavassistAdapter.createClassObject(JavassistAdapter.java:86)
at org.reflections.adapters.JavassistAdapter.createClassObject(JavassistAdapter.java:22)
at org.reflections.scanners.AbstractScanner.scan(AbstractScanner.java:38)
... 6 more``

Picard tools MarkDuplicates using cram format ...How to pass A valid CRAM reference?

$
0
0

Hello there!

I am trying to use picard tools to mark duplicates using a cram format file; however I could not find any documentation to address this problem. How can I pass a valid CRAM reference?

Thanks in advance,

-lili

Sat Feb 06 16:09:42 CST 2016] picard.sam.markduplicates.MarkDuplicatesWithMateCigar MINIMUM_DISTANCE=250 INPUT=[/EXOME/gatk/test_10990_bwa_srtd.cram] OUTPUT=EXOME/gatk/test_10990_wes_dupMC.cram METRICS_FILE=/EXOME/gatk/test_10990_wes_dupMC_metrics.txt OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 CREATE_INDEX=true SKIP_PAIRS_WITH_NO_MATE_CIGAR=true BLOCK_SIZE=100000 REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=TOTAL_MAPPED_REFERENCE_LENGTH PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicatesWithMateCigar READ_NAME_REGEX= VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Sat Feb 06 16:09:42 CST 2016] Executing as antunes@gpu10 on Linux 2.6.32-431.29.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_40-b26; Picard version: 2.1.0(25ebc07f7fbaa7c1a4a8e6c130c88c1d10681802_1454776546) IntelDeflater
[Sat Feb 06 16:09:42 CST 2016] picard.sam.markduplicates.MarkDuplicatesWithMateCigar done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=4116185088
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" java.lang.IllegalStateException: A valid CRAM reference was not supplied and one cannot be acquired via the property settings reference_fasta or use_cram_ref_download
at htsjdk.samtools.cram.ref.ReferenceSource.getDefaultCRAMReferenceSource(ReferenceSource.java:98)
at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:269)
at picard.sam.markduplicates.util.AbstractMarkDuplicatesCommandLineProgram.openInputs(AbstractMarkDuplicatesCommandLineProgram.java:205)
at picard.sam.markduplicates.MarkDuplicatesWithMateCigar.doWork(MarkDuplicatesWithMateCigar.java:118)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:209)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)

Final VCF is skipping sites covered by the intervals and the ALT allele appears missing - GATK 3.6

$
0
0

Hi GATK,

For GATK version 3.6, I have come across two issues. First, I have noticed that there are many positions that do not show up in my final VCF, even though they are covered by the interval file I provided.This problem seems to occur near the start of some of the intervals. Second, there appears to be missing ALT alleles from non variant sites, even when they have assigned reads. I'm not sure if the second issue is by design or not. Both problems are captured below where chr1:10041296 is missing and chr1:10041297 has a missing ALT allele:

1       10041295        .       T       .       21.76   .       AN=264;DP=1366;ExcessHet=3.01;MLEAC=.;MLEAF=.;MQ=NaN    GT:DP:RGQ       0/0:7:6 0/0:10:29       0/0:8:0 0/0:10:12       0/0:14:30       0/0:13:22       0/0:13:23       0/0:8:15        0/0:17:23       0/0:12:24       0/0:14:26       0/0:8:18        0/0:3:0 0/0:3:0 0/0:2:6 0/0:6:0 0/0:2:6 0/0:14:24       0/0:14:0        0/0:5:6 0/0:21:27       0/0:12:18       0/0:21:6        0/0:17:0        0/0:12:19       0/0:21:39       0/0:11:0        0/0:11:19       0/0:16:21       0/0:12:0        0/0:0:0 0/0:3:9 0/0:10:16       0/0:10:15       0/0:9:10        0/0:21:15       0/0:11:0        0/0:15:24       0/0:7:15        0/0:13:15       0/0:14:21       0/0:15:36       0/0:18:17       0/0:21:44       0/0:6:9 0/0:10:12       0/0:12:11       0/0:12:21       0/0:10:21       0/0:6:9 0/0:25:44       0/0:13:27       0/0:2:0 0/0:11:17       0/0:12:4        0/0:6:12        0/0:0:0 0/0:12:18       0/0:19:0        0/0:5:9 0/0:8:8 0/0:5:0 0/0:5:0 0/0:11:5        0/0:2:9 0/0:14:27       0/0:10:9        0/0:2:3 0/0:10:6        0/0:11:24       0/0:11:16       0/0:11:21       0/0:3:6 0/0:11:0        0/0:6:9 0/0:14:24       0/0:3:9 0/0:8:18        0/0:5:12        0/0:6:9 0/0:21:18       0/0:10:14       0/0:13:18       0/0:3:0 0/0:5:12        0/0:6:9 0/0:15:24       0/0:8:9 0/0:13:4        0/0:12:18       0/0:9:3 0/0:8:24        0/0:4:6 0/0:7:3 0/0:5:9 0/0:7:0 0/0:11:21       0/0:9:15        0/0:12:18       0/0:13:33       0/0:10:12       0/0:4:0 0/0:6:0 0/0:18:17       0/0:12:0        0/0:7:0 0/0:19:39       0/0:9:9 0/0:23:42       0/0:8:21        0/0:13:0        0/0:14:24       0/0:16:21       0/0:16:27       0/0:9:21        0/0:12:15       0/0:25:43       0/0:13:21       0/0:9:21        0/0:4:12        0/0:8:15        0/0:13:21       0/0:13:0        0/0:9:11        0/0:16:0        0/0:2:0 0/0:15:27       0/0:8:8 0/0:10:0        0/0:14:21       0/0:10:16       0/0:12:30
1       10041297        .       TA      .       19.98   .       AN=264;BaseQRankSum=0.431;ClippingRankSum=0.00;DP=1381;ExcessHet=3.01;MLEAC=.;MLEAF=.;MQ=NaN;MQRankSum=0.00;ReadPosRankSum=-2.100e-01   GT:DP:RGQ       0/0:9:12        0/0:11:0        0/0:8:0 0/0:10:12       0/0:14:30       0/0:13:27       0/0:13:27       0/0:8:12        0/0:17:0        0/0:12:24       0/0:14:30       0/0:8:18        0/0:3:6 0/0:3:0 0/0:2:6 0/0:6:0 0/0:2:6 0/0:14:24       0/0:14:21       0/0:5:6 0/0:21:37       0/0:12:0        0/0:20:0        0/0:17:21       0/0:12:18       0/0:22:37       0/0:11:17       0/0:11:24       0/0:15:6        0/0:10:18       0/0:0:0 0/0:16:0        0/0:9:0 0/0:10:12       0/0:9:0 0/0:21:0        0/0:11:17       0/0:15:24       0/0:6:15        0/0:13:15       0/0:14:18       0/0:15:10       0/0:18:0        0/0:19:44       0/0:6:9 0/0:10:18       0/0:11:27       0/0:12:21       0/0:10:21       0/0:6:9 0/0:25:0        0/0:13:27       0/0:2:0 0/0:10:0        0/0:12:27       0/0:6:12        0/0:0:0 0/0:13:0        0/0:19:4        0/0:5:9 0/0:8:9 0/0:5:6 0/0:5:0 0/0:11:18       0/0:10:12       0/0:14:27       0/0:11:12       0/0:2:3 0/0:10:6        0/0:11:24       0/0:11:19       0/0:11:21       0/0:3:6 0/0:11:21       0/0:6:9 0/0:13:0        0/0:5:0 0/0:8:18        0/0:5:12        0/0:5:9 0/0:21:0        0/0:9:11        0/0:13:18       0/0:3:6 0/0:5:0 0/0:6:0 0/0:14:0        0/0:8:0 0/0:13:24       0/0:6:4 0/0:9:3 0/0:8:24        0/0:4:6 0/0:7:3 0/0:5:9 0/0:7:0 0/0:11:21       0/0:9:15        0/0:12:18       0/0:13:33       0/0:10:21       0/0:4:9 0/0:6:9 0/0:18:22       0/0:12:21       0/0:7:12        0/0:17:39       0/0:8:0 0/0:23:13       0/0:8:0 0/0:13:33       0/0:14:24       0/0:16:13       0/0:16:27       0/0:11:7        0/0:12:15       0/0:25:43       0/0:13:21       0/0:9:21        0/0:4:12        0/0:8:0 0/0:12:0        0/0:13:3        0/0:9:18        0/0:16:0        0/0:2:0 0/0:15:27       0/0:8:0 0/0:7:9 0/0:14:21       0/0:11:24       0/0:12:30

from interval:

1 10041293 10041377 ENSG00000173614 84 + NMNAT1

Both problems seem to go away when I run the same the same subjects with the same interval file in GATK version 3.5.

1       10041295        .       T       G       .       PASS    AC=0;AF=0.00;AN=210;DP=1366;ExcessHet=3.01;MQ=3.63;VQSLOD=2.04;culprit=MQ       GT:AD:DP:PGT:PID:RGQ    0/0:7,0:7:.:.:6 0/0:10,0:10:.:.:29      ./.:8,0:8:.:.:0 0/0:10,0:10:.:.:12      0/0:14,0:14:.:.:30      0/0:13,0:13:.:.:22      0/0:13,0:13:.:.:23      0/0:8,0:8:.:.:15        0/0:17,0:17:.:.:23      0/0:11,0:11:.:.:24      0/0:14,0:14:.:.:26      0/0:8,0:8:.:.:18        ./.:3,0:3:.:.:0 ./.:3,0:3:.:.:0 0/0:2,0:2:.:.:6 ./.:6,0:6:.:.:0 0/0:2,0:2:.:.:6 0/0:14,0:14:.:.:24      ./.:14,0:14:.:.:0       0/0:3,0:3:.:.:6 0/0:21,0:21:.:.:27      0/0:12,0:12:.:.:18      0/0:21,0:21:.:.:6       ./.:17,0:17:.:.:0       0/0:12,0:12:.:.:19      0/0:21,0:21:.:.:39      ./.:11,0:11:.:.:0       0/0:11,0:11:.:.:19      0/0:16,0:16:.:.:21      ./.:12,0:12:.:.:0       ./.:0,0:0:.:.:0 0/0:3,0:3:0|1:10041295_T_G:9    0/0:10,0:10:.:.:16      0/0:10,0:10:.:.:15      0/0:9,0:9:.:.:10        0/0:21,0:21:.:.:15      ./.:9,0:9:.:.:0 0/0:15,0:15:.:.:24      0/0:7,0:7:.:.:15        0/0:13,0:13:.:.:15      0/0:14,0:14:.:.:21      0/0:15,0:15:.:.:36      0/0:18,0:18:.:.:17      0/0:21,0:21:.:.:44      0/0:6,0:6:.:.:9 0/0:10,0:10:.:.:12      0/0:12,0:12:.:.:11      0/0:12,0:12:.:.:21      0/0:9,0:9:.:.:21        0/0:6,0:6:.:.:9 0/0:25,0:25:.:.:44      0/0:12,0:12:.:.:27      ./.:2,0:2:.:.:0 0/0:11,0:11:.:.:17      0/0:12,0:12:.:.:4       0/0:6,0:6:.:.:12        ./.:0,0:0:.:.:0 0/0:12,0:12:.:.:18      ./.:18,0:18:.:.:0       0/0:5,0:5:.:.:9 0/0:8,0:8:.:.:8 ./.:5,0:5:.:.:0 ./.:5,0:5:.:.:0 0/0:11,0:11:.:.:5       0/0:2,0:2:0|1:10041295_T_G:9    0/0:14,0:14:.:.:27      0/0:9,0:9:.:.:9 0/0:2,0:2:.:.:3 0/0:10,0:10:.:.:6       0/0:11,0:11:.:.:24      0/0:11,0:11:.:.:16      0/0:11,0:11:.:.:21      0/0:3,0:3:.:.:6 ./.:11,0:11:.:.:0       0/0:6,0:6:.:.:9 0/0:14,0:14:.:.:24      0/0:3,0:3:.:.:9 0/0:8,0:8:.:.:18        0/0:5,0:5:.:.:12        0/0:6,0:6:.:.:9 0/0:21,0:21:.:.:18      0/0:10,0:10:.:.:14      0/0:13,0:13:.:.:18      ./.:3,0:3:.:.:0 0/0:5,0:5:.:.:12        0/0:6,0:6:.:.:9 0/0:15,0:15:.:.:24      0/0:7,0:7:.:.:9 0/0:13,0:13:.:.:4       0/0:12,0:12:.:.:18      0/0:9,0:9:.:.:3 0/0:8,0:8:.:.:24        0/0:3,0:3:.:.:6 0/0:7,0:7:.:.:3 0/0:5,0:5:.:.:9 ./.:6,0:6:.:.:0 0/0:9,0:9:.:.:21        0/0:9,0:9:.:.:15        0/0:12,0:12:.:.:18      0/0:13,0:13:.:.:33      0/0:10,0:10:.:.:12      ./.:4,0:4:.:.:0 ./.:6,0:6:.:.:0 0/0:18,0:18:.:.:17      ./.:12,0:12:.:.:0       ./.:7,0:7:.:.:0 0/0:16,0:16:.:.:39      0/0:9,0:9:.:.:9 0/0:23,0:23:.:.:42      0/0:8,0:8:.:.:21        ./.:11,0:11:.:.:0       0/0:13,0:13:.:.:24      0/0:16,0:16:.:.:21      0/0:15,0:15:.:.:27      0/0:9,0:9:.:.:21        0/0:12,0:12:.:.:15      0/0:25,0:25:.:.:43      0/0:13,0:13:.:.:21      0/0:9,0:9:.:.:21        0/0:4,0:4:.:.:12        0/0:8,0:8:.:.:15        0/0:13,0:13:.:.:21      ./.:13,0:13:.:.:0       0/0:9,0:9:.:.:11        ./.:16,0:16:.:.:0       ./.:2,0:2:.:.:0 0/0:15,0:15:.:.:27      0/0:8,0:8:.:.:8 ./.:8,0:8:.:.:0 0/0:14,0:14:.:.:21      0/0:10,0:10:.:.:16      0/0:12,0:12:.:.:30
1       10041296        .       G       T,GT    .       .       AC=0,0;AF=0.00,0.00;AN=168;BaseQRankSum=-1.231e+00;ClippingRankSum=-3.580e-01;DP=1349;ExcessHet=3.01;MQ=5.42;MQRankSum=-1.231e+00;ReadPosRankSum=0.358  GT:AD:DP:PGT:PID:RGQ    ./.:9,0,0:9:.:.:0       0/0:11,0,0:11:.:.:2     ./.:8,0,0:8:.:.:0       ./.:10,0,0:10:.:.:0     0/0:12,0,0:12:.:.:4     0/0:13,0,0:13:.:.:7     0/0:13,0,0:13:.:.:7     ./.:8,0,0:8:.:.:0       ./.:17,0,0:17:.:.:0     0/0:11,0,0:11:.:.:24    0/0:14,0,0:14:.:.:23    0/0:6,0,0:6:.:.:18      0/0:3,0,0:3:.:.:6       ./.:3,0,0:3:.:.:0       0/0:2,0,0:2:.:.:6       0/0:6,0,0:6:.:.:4       0/0:2,0,0:2:.:.:6       0/0:14,0,0:14:.:.:24    0/0:14,0,0:14:.:.:21    0/0:3,0,0:3:.:.:6       0/0:21,0,0:21:.:.:8     0/0:3,2,0:5:0|1:10041296_G_T:5  0/0:20,0,0:20:.:.:13    ./.:17,0,0:17:.:.:0     0/0:12,0,0:12:.:.:5     0/0:22,0,0:22:.:.:16    ./.:11,0,0:11:.:.:0     ./.:11,0,0:11:.:.:0     0/0:16,0,0:16:.:.:21    ./.:12,0,0:12:.:.:0     ./.:0,0,0:0:.:.:0       ./.:16,0,0:16:.:.:0     ./.:9,0,0:9:.:.:0       ./.:10,0,0:10:.:.:0     ./.:9,0,0:9:.:.:0       0/0:19,0,0:19:.:.:7     ./.:9,0,0:9:.:.:0       0/0:15,0,0:15:.:.:24    ./.:7,0,0:7:.:.:0       0/0:13,0,0:13:.:.:15    ./.:14,0,0:14:.:.:0     0/0:4,0,0:4:0|1:10041296_G_T:12 ./.:18,0,0:18:.:.:0     0/0:17,0,0:17:.:.:20    0/0:6,0,0:6:.:.:9       0/0:10,0,0:10:.:.:18    ./.:11,0,0:11:.:.:0     0/0:12,0,0:12:.:.:15    0/0:9,0,0:9:.:.:21      0/0:6,0,0:6:.:.:9       ./.:25,0,0:25:.:.:0     0/0:12,0,0:12:.:.:27    ./.:2,0,0:2:.:.:0       ./.:10,0,0:10:.:.:0     0/0:12,0,0:12:.:.:3     0/0:6,0,0:6:.:.:12      ./.:0,0,0:0:.:.:0       0/0:12,0,0:12:.:.:18    ./.:18,0,0:18:.:.:0     0/0:5,0,0:5:.:.:9       0/0:8,0,0:8:.:.:9       0/0:5,0,0:5:.:.:6       ./.:5,0,0:5:.:.:0       0/0:9,0,0:9:.:.:10      0/0:2,0,0:2:0|1:10041295_T_G:9  0/0:14,0,0:14:.:.:26    0/0:9,0,0:9:.:.:9       0/0:2,0,0:2:.:.:3       0/0:10,0,0:10:.:.:6     ./.:11,0,0:11:.:.:0     0/0:11,0,0:11:.:.:2     ./.:11,0,0:11:.:.:0     0/0:3,0,0:3:.:.:6       ./.:11,0,0:11:.:.:0     ./.:6,0,0:6:.:.:0       ./.:12,0,0:12:.:.:0     ./.:5,0,0:5:.:.:0       0/0:8,0,0:8:.:.:7       0/0:5,0,0:5:.:.:12      ./.:5,0,0:5:.:.:0       0/0:21,0,0:21:.:.:42    0/0:10,0,0:10:.:.:14    0/0:13,0,0:13:.:.:18    ./.:3,0,0:3:.:.:0       ./.:5,0,0:5:.:.:0       0/0:6,0,0:6:.:.:1       ./.:14,0,0:14:.:.:0     0/0:7,0,0:7:.:.:9       0/0:13,0,0:13:.:.:8     0/0:12,0,0:12:.:.:18    0/0:9,0,0:9:.:.:3       0/0:8,0,0:8:.:.:24      0/0:2,0,0:2:.:.:6       0/0:7,0,0:7:.:.:3       ./.:5,0,0:5:.:.:0       ./.:6,0,0:6:.:.:0       0/0:9,0,0:9:.:.:21      0/0:9,0,0:9:.:.:11      0/0:12,0,0:12:.:.:18    0/0:13,0,0:13:.:.:33    0/0:10,0,0:10:.:.:15    0/0:4,0,0:4:.:.:9       0/0:6,0,0:6:.:.:9       0/0:16,0,0:16:.:.:30    ./.:12,0,0:12:.:.:0     ./.:7,0,0:7:.:.:0       0/0:17,0,0:17:.:.:32    0/0:9,0,0:9:.:.:9       0/0:23,0,0:23:.:.:17    0/0:8,0,0:8:.:.:8       0/0:13,0,0:13:.:.:22    0/0:13,0,0:13:.:.:24    0/0:16,0,0:16:.:.:21    0/0:15,0,0:15:.:.:27    ./.:9,0,0:9:.:.:0       0/0:12,0,0:12:.:.:15    0/0:25,0,0:25:.:.:28    0/0:13,0,0:13:.:.:21    ./.:9,0,0:9:.:.:0       ./.:4,0,0:4:.:.:0       0/0:8,0,0:8:.:.:7       0/0:13,0,0:13:.:.:21    0/0:13,0,0:13:.:.:10    0/0:9,0,0:9:.:.:18      ./.:16,0,0:16:.:.:0     ./.:2,0,0:2:.:.:0       0/0:15,0,0:15:.:.:27    ./.:8,0,0:8:.:.:0       ./.:8,0,0:8:.:.:0       0/0:14,0,0:14:.:.:16    ./.:12,0,0:12:.:.:0     0/0:12,0,0:12:.:.:1
1       10041297        .       TA      T       .       .       AC=0;AF=0.00;AN=196;BaseQRankSum=0.736;ClippingRankSum=-7.360e-01;DP=1381;ExcessHet=3.01;MQ=3.95;MQRankSum=0.736;ReadPosRankSum=0.736   GT:AD:DP:RGQ    0/0:9,0:9:12    ./.:11,0:11:0   ./.:8,0:8:0     0/0:10,0:10:12  0/0:14,0:14:30  0/0:13,0:13:27  0/0:13,0:13:27  0/0:8,0:8:12    ./.:17,0:17:0   0/0:11,0:11:24  0/0:14,0:14:30  0/0:8,0:8:18    0/0:3,0:3:6     ./.:3,0:3:0     0/0:2,0:2:6     ./.:6,0:6:0     0/0:2,0:2:6     0/0:14,0:14:24  0/0:14,0:14:21  0/0:3,0:3:6     0/0:21,0:21:37  ./.:12,0:12:0   ./.:20,0:20:0   0/0:17,0:17:21  0/0:12,0:12:18  0/0:22,0:22:37  0/0:11,0:11:17  0/0:11,0:11:24  0/0:15,0:15:6   0/0:10,0:10:18  ./.:0,0:0:0     ./.:16,0:16:0   ./.:9,0:9:0     0/0:10,0:10:12  ./.:9,0:9:0     ./.:21,0:21:0   0/0:11,0:11:17  0/0:15,0:15:24  0/0:6,0:6:15    0/0:13,0:13:15  0/0:14,0:14:18  0/0:15,0:15:10  ./.:18,0:18:0   0/0:19,0:19:44  0/0:6,0:6:9     0/0:10,0:10:18  0/0:11,0:11:27  0/0:12,0:12:21  0/0:9,0:9:21    0/0:6,0:6:9     ./.:25,0:25:0   0/0:12,0:12:27  ./.:2,0:2:0     ./.:10,0:10:0   0/0:12,0:12:27  0/0:6,0:6:12    ./.:0,0:0:0     ./.:12,0:12:0   0/0:19,0:19:4   0/0:5,0:5:9     0/0:8,0:8:9     0/0:5,0:5:6     ./.:5,0:5:0     0/0:11,0:11:18  0/0:10,0:10:12  0/0:14,0:14:27  0/0:11,0:11:12  0/0:2,0:2:3     0/0:10,0:10:6   0/0:11,0:11:24  0/0:11,0:11:19  0/0:11,0:11:21  0/0:3,0:3:6     0/0:11,0:11:21  0/0:6,0:6:9     ./.:12,0:12:0   ./.:5,0:5:0     0/0:8,0:8:18    0/0:5,0:5:12    0/0:5,0:5:9     ./.:21,0:21:0   0/0:9,0:9:11    0/0:13,0:13:18  0/0:3,0:3:6     ./.:5,0:5:0     ./.:6,0:6:0     ./.:14,0:14:0   ./.:8,0:8:0     0/0:13,0:13:24  0/0:5,1:6:4     0/0:9,0:9:3     0/0:8,0:8:24    0/0:4,0:4:6     0/0:7,0:7:3     0/0:5,0:5:9     ./.:6,0:6:0     0/0:9,0:9:21    0/0:9,0:9:15    0/0:12,0:12:18  0/0:13,0:13:33  0/0:10,0:10:21  0/0:4,0:4:9     0/0:6,0:6:9     0/0:18,0:18:22  0/0:12,0:12:21  0/0:7,0:7:12    0/0:17,0:17:39  ./.:8,0:8:0     0/0:23,0:23:13  ./.:8,0:8:0     0/0:13,0:13:33  0/0:13,0:13:24  0/0:16,0:16:13  0/0:15,0:15:27  0/0:11,0:11:7   0/0:12,0:12:15  0/0:25,0:25:43  0/0:13,0:13:21  0/0:9,0:9:21    0/0:4,0:4:12    ./.:8,0:8:0     ./.:11,0:11:0   0/0:13,0:13:3   0/0:9,0:9:18    ./.:16,0:16:0   ./.:2,0:2:0     0/0:15,0:15:27  ./.:8,0:8:0     0/0:7,0:7:9     0/0:14,0:14:21  0/0:11,0:11:24  0/0:12,0:12:30

Here you can see that one of the subjects, 0/0:5,1:6:4, has a read called for the ALT T allele. If GATK 3.6 is filtering out this single read, shouldn't we expect the REF allele to be a T rather than TA?

Thanks,
Ari


HashMap iterator problem with GATK 3.6 on NA12878 validations

$
0
0

Hi all;
I was running validations with the latest GATK 3.6-0 release and ran into an issue on NA12878 where a region around the centromere on X fails with a HashMap NoSuchElementException. I tried to isolate into a test case and here is a tarball with the smallest set of regions I could reproduce on:

https://s3.amazonaws.com/chapmanb/testcases/gatk36_hashmap_report.tar.gz

This has the inputs and a small shell script to demonstrate.

It's a bit of a confusing one to me. If I try to reduce the test case further -- to only the region that appears to fail when DEBUG is turned on -- it will work. The problem seems to have some dependence on the prior state.

Here is the full traceback:

##### ERROR --
##### ERROR stack trace 
java.util.NoSuchElementException
        at java.util.HashMap$HashIterator.nextNode(HashMap.java:1431)
        at java.util.HashMap$KeyIterator.next(HashMap.java:1453)
        at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCallerGenotypingEngine.reduceNumberOfAlternativeAllelesBasedOnLikelihoods(HaplotypeCallerGenotypingEngine.java:336)
        at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCallerGenotypingEngine.assignGenotypeLikelihoods(HaplotypeCallerGenotypingEngine.java:264)
        at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:964)
        at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:251)
        at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:709)
        at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:705)
        at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
        at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
        at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:274)
        at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:78)
        at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99)
        at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:311)
        at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:113)
        at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:255)
        at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:157)
        at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:108)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 3.6-0-g89b7209):
##### ERROR
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions https://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: Code exception (see stack trace for error itself)
##### ERROR ------------------------------------------------------------------------------------------

Any ideas to work around or avoid are welcome. Please let me know if I can provide any other information. Thanks for all the great work on GATK,
Brad

Markduplicates in Cram files. A valid CRAM reference was not supplied...

$
0
0

Hello,
I wonder if you can help with the following. I am trying to mark duplicates in a cram file with the following command (picard latest):
'picard-tools MarkDuplicates I=09_1#21.cram O=09_1#21_md.cram M=09_1#21_md.txt'

I keep getting the following errors:

To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" java.lang.IllegalStateException: A valid CRAM reference was not supplied and one cannot be acquired via the property settings reference_fasta or use_cram_ref_download
at htsjdk.samtools.cram.ref.ReferenceSource.getDefaultCRAMReferenceSource(ReferenceSource.java:107)
at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:301)
at picard.sam.markduplicates.util.AbstractMarkDuplicatesCommandLineProgram.openInputs(AbstractMarkDuplicatesCommandLineProgram.java:212)
at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:421)
at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:220)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:208)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105

I have also tried with the fasta reference R=in.fasta but nothing..
There is a similar post asking the same question but it doesn't seem to have been solved.
Many thanks in advance,

Carmen

(howto) Map and mark duplicates

$
0
0

See Tutorial#6747 for a comparison of MarkDuplicates and MarkDuplicatesWithMateCigar, downloadable example data to follow along, and additional commentary.


Objective

Map the read data to the reference and mark duplicates.

Prerequisites

  • This tutorial assumes adapter sequences have been removed.

Steps

  1. Identify read group information
  2. Generate a SAM file containing aligned reads
  3. Convert to BAM, sort and mark duplicates

1. Identify read group information

The read group information is key for downstream GATK functionality. The GATK will not work without a read group tag. Make sure to enter as much metadata as you know about your data in the read group fields provided. For more information about all the possible fields in the @RG tag, take a look at the SAM specification.

Action

Compose the read group identifier in the following format:

@RG\tID:group1\tSM:sample1\tPL:illumina\tLB:lib1\tPU:unit1 

where the \t stands for the tab character.


2. Generate a SAM file containing aligned reads

Action

Run the following BWA command:

In this command, replace read group info by the read group identifier composed in the previous step.

bwa mem -M -R ’<read group info>’ -p reference.fa raw_reads.fq > aligned_reads.sam 

replacing the <read group info> bit with the read group identifier you composed at the previous step.

The -M flag causes BWA to mark shorter split hits as secondary (essential for Picard compatibility).

Expected Result

This creates a file called aligned_reads.sam containing the aligned reads from all input files, combined, annotated and aligned to the same reference.

Note that here we are using a command that is specific for pair end data in an interleaved (read pairs together in the same file, with the forward read followed directly by its paired reverse read) fastq file, which is what we are providing to you as a tutorial file. To map other types of datasets (e.g. single-ended or pair-ended in forward/reverse read files) you will need to adapt the command accordingly. Please see the BWA documentation for exact usage and more options for these commands.


3. Convert to BAM, sort and mark duplicates

These initial pre-processing operations format the data to suit the requirements of the GATK tools.

Action

Run the following Picard command to sort the SAM file and convert it to BAM:

java -jar picard.jar SortSam \ 
    INPUT=aligned_reads.sam \ 
    OUTPUT=sorted_reads.bam \ 
    SORT_ORDER=coordinate 

Expected Results

This creates a file called sorted_reads.bam containing the aligned reads sorted by coordinate.

Action

Run the following Picard command to mark duplicates:

java -jar picard.jar MarkDuplicates \ 
    INPUT=sorted_reads.bam \ 
    OUTPUT=dedup_reads.bam \
    METRICS_FILE=metrics.txt

Expected Result

This creates a sorted BAM file called dedup_reads.bam with the same content as the input file, except that any duplicate reads are marked as such. It also produces a metrics file called metrics.txt containing (can you guess?) metrics.

Action

Run the following Picard command to index the BAM file:

java -jar picard.jar BuildBamIndex \ 
    INPUT=dedup_reads.bam 

Expected Result

This creates an index file for the BAM file called dedup_reads.bai.

cannot use SelectVariants to combine VCF files

$
0
0

I try to use SelectVariants to find concordance between two VCF files, one generated from GATK UnifiedGenotyper and the other from samtools mplieup.
But I get error message showing the samtools VCF file malformed, which I found caused by heterozygous sites in indel (e.g., R, Y, W, M, S or W in indel sequence ).
Is there a way to bypass this problem when using GATK SelectVaraints?

Thanks,

Chih-Ming

more 512M chromosome problem in GATK

$
0
0

i use GATK to deal with Wheat_survey genome which have a 3B chromosome(700+M), it can`t call snp from 512M to 700M use default paramaters, how fix it???

Viewing all 12345 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>