Channel: Recent Discussions — GATK-Forum

↧

Phantom indels from HaplotypeCaller?

December 12, 2017, 1:16 pm

≫ Next: 9 Things You've Been Dying To Know About The HaplotypeCaller Paper

≪ Previous: java.lang.ArrayIndexOutOfBoundsException in BaseRecalibrator on Grc38

Dear GATK users and developers,

I am running HaplotypeCaller followed by ValidateVariants and the latter complains about variants that have called alternative allele without any observation for it.

ERROR MESSAGE: File /storage/rafal.gutaker/NEXT_test/work/4f/6f8738a66d1c9d12651b76b7ef8819/IRIS_313-15896.g.vcf fails strict validation: one or more of the ALT allele(s) for the record at position LOC_Os01g01010:6190 are not observed at all in the sample genotypes |

ERROR ------------------------------------------------------------------------------------------

Here is an example of site that ValidateVariant complains about:

LOC_Os01g01010 6190 . GT G, 0 . DP=4;ExcessHet=3.0103;MLEAC=0,0;MLEAF=0.00,0.00;RAW_MQ=14400.00 GT:AD:DP:GQ:PL:SB 0/0:4,0,0:4:12:0,12,135,12,135,135:4,0,0,0
LOC_Os01g01010 6192 . T . . END=6192 GT:DP:GQ:MIN_DP:PL 0/0:8:0:8:0,0,254

In general, it seems not dangerous so i am thinking of removing this check, but why is HaplotypeCaller finding phanotm variants is a mystery to me.

Thank you and

Best!
Rafal

↧

9 Things You've Been Dying To Know About The HaplotypeCaller Paper

December 12, 2017, 8:07 pm

≫ Next: Picard RevertSam java.nio.file.NoSuchFileException

≪ Previous: Phantom indels from HaplotypeCaller?

Q: What, there's a HaplotypeCaller paper?

A: Yes! We are super pumped to announce the long-awaited release of The HaplotypeCaller Paper -- or rather, the preprint in bioRxiv. (Actually we announced it on Twitter a while back but we understand not everyone enjoys such an old-school way of keeping up with the news). Hopefully you’re as excited as we are, if not more so, but we understand that this probably raises a few questions for some of you, so we tried to address some of those below.

Q: Why did it take so long?!

A: Our mission is to develop the tools that get used by others to do groundbreaking scientific research. Benchmarking and validation are important parts of our prototyping and development cycle, but given that we’re not subject to the “publish or perish” culture of a research lab, submitting manuscripts presenting those results wasn’t a high priority for us.

Q: Are you going to submit it to a peer-reviewed journal?

A: Probably not.

Q: Why not?

A: Our main motivation for posting the HaplotypeCaller manuscript to bioRxiv was to provide something recent/reasonable to cite and to make more details of the methods public. Submitting to a peer-reviewed journal usually involves a lot of time working on revisions that we’d rather put towards working on further improvements to the tools.

Q: Is it still a preprint if it's never intended to go to print?

A: You tell us.

Q: What version of HaplotypeCaller does the paper describe?

A: The paper describes the GATK 3.4 version of the HaplotypeCaller (yes we started this a while back) but the HaplotypeCaller has not changed significantly in later 3.x versions so it's fair to say the paper covers up to version 3.8 completely.

Q: How do these results compare to GATK4?

A: At time of writing, the GATK4 version of HaplotypeCaller is still considered a beta version. The team is actively working on validating the GATK4 version to make sure that it’s guaranteed to be as good as or better than the GATK3 version described in the paper.

Q: How does the methodology compare to GATK4?

A: The GATK engine that parses the BAM and “shards” the data to pass to the tools has been rewritten for improved efficiency over GATK3, and the HaplotypeCaller code has been refactored for better organization and readability. So there's a lot that is different in terms of software implementation. However the algorithms and equations presented in the manuscript remain the same, so overall the paper's description of how the HaplotypeCaller operates also applies to the GATK4 beta version, and it is appropriate to use it as a citation for results derived from versions up to the current beta (4.beta.6).

Q: Does the release of this paper hint at a change in how the team prioritizes publication?

A: To some extent. The developers of the somatic variant caller Mutect2 and related tools have put in a lot of effort to prepare white papers on the methods involved (Mutect2 itself, the assembly process and the pairHMM algorithm), some of which are shared with the HaplotypeCaller. They hope to release a manuscript featuring Mutect2 somatic SNV and INDEL variant calling results in the near future. Additionally, the GATK development team as a whole aims to make more of our internal benchmarking and validation efforts more transparent and available to other tool developers; an effort that our colleague Yossi Farjoun kicked off in style in his blog post about the new "SynDip" benchmark last week.

↧

↧

Picard RevertSam java.nio.file.NoSuchFileException

November 18, 2017, 1:43 pm

≫ Next: How can I exclude snp sites with ALT asterisk by SelectVariants ?

≪ Previous: 9 Things You've Been Dying To Know About The HaplotypeCaller Paper

Hi,

I'm starting to process a set of bams following the best practices and beginning from bams that were processed by someone else. Thus, I'm attempting to generate unmapped BAMs following this post, and using the latest version of Picard (2.15.0). Unfortunately, Picard gives an exception that shows it is unable to find temporary files it is writing. I know there's space for these files and in fact, I now have version 1.141 of Picard running without issue. The output from version 2.15.0 is below.

15:34:14.012 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/REDACTED/bin/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Sat Nov 18 15:34:14 CST 2017] RevertSam INPUT=/REDACTED.bam OUTPUT=/REDACTED/808302_LP6008048-DNA_B02.bam SORT_ORDER=queryname RESTORE_ORIGINAL_QUALITIES=true REMOVE_DUPLICATE_INFORMATION=true REMOVE_ALIGNMENT_INFORMATION=true ATTRIBUTE_TO_CLEAR=[NM, UQ, PG, MD, MQ, SA, MC, AS, XT, XN, AS, OC, OP] SANITIZE=true MAX_DISCARD_FRACTION=0.005 TMP_DIR=[/REDACTED/tmp] VALIDATION_STRINGENCY=LENIENT OUTPUT_BY_READGROUP=false OUTPUT_BY_READGROUP_FILE_FORMAT=dynamic VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Sat Nov 18 15:34:14 CST 2017] Executing as awilliams@REDACTED on Linux 3.10.0-229.7.2.el7.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12; Deflater: Intel; Inflater: Intel; Picard version: 2.15.0-SNAPSHOT
[Sat Nov 18 15:34:30 CST 2017] picard.sam.RevertSam done. Elapsed time: 0.27 minutes.
Runtime.totalMemory()=1272971264
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.samtools.util.RuntimeIOException: java.nio.file.NoSuchFileException: /REDACTED/tmp/awilliams/sortingcollection.728972638772980431.tmp
at htsjdk.samtools.util.SortingCollection.spillToDisk(SortingCollection.java:246)
at htsjdk.samtools.util.SortingCollection.add(SortingCollection.java:166)
at picard.sam.RevertSam$RevertSamSorter.add(RevertSam.java:637)
at picard.sam.RevertSam.doWork(RevertSam.java:260)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:268)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:98)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:108)
Caused by: java.nio.file.NoSuchFileException: /REDACTED/tmp/awilliams/sortingcollection.728972638772980431.tmp
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
at java.nio.file.Files.newByteChannel(Files.java:361)
at java.nio.file.Files.createFile(Files.java:632)
at java.nio.file.TempFileHelper.create(TempFileHelper.java:138)
at java.nio.file.TempFileHelper.createTempFile(TempFileHelper.java:161)
at java.nio.file.Files.createTempFile(Files.java:852)
at htsjdk.samtools.util.IOUtil.newTempPath(IOUtil.java:316)
at htsjdk.samtools.util.SortingCollection.newTempFile(SortingCollection.java:255)
at htsjdk.samtools.util.SortingCollection.spillToDisk(SortingCollection.java:220)
... 6 more

↧

How can I exclude snp sites with ALT asterisk by SelectVariants ?

September 7, 2016, 2:10 am

≫ Next: enquiry about HaplotypeCaller DuplicateReadFilter

≪ Previous: Picard RevertSam java.nio.file.NoSuchFileException

Hello ,

I am using the latest GATK 3.6 to analysis my human WGS data.

For snp analysis, when I ran VariantRecalibrator , it reported error as following:

1 788419 . A * 854.77 PASS DP=33 GT 0/1
java.lang.RuntimeException: java.lang.RuntimeException: WARNING: Unkown IUB code for SNP '*'

I found that my raw.snp.vcf.gz had these sites :

chr1 64764 . C T
chr1 64976 . C T
chr1 66161 . T *
chr1 66164 . A *
chr1 66165 . T *
chr1 66166 . A *
chr1 66239 . A *
chr1 66240 . T *
chr1 66241 . T *
chr1 66242 . A *

I add parameter --selectTypeToExclude SYMBOLIC in SelectVariants but they were still in my snp.vcf.gz.

I don't know how to skip these sites and run VariantRecalibrator smoothly.

Thank you very much...

↧

enquiry about HaplotypeCaller DuplicateReadFilter

December 13, 2017, 3:11 am

≫ Next: VariantFiltration not filtering correctly

≪ Previous: How can I exclude snp sites with ALT asterisk by SelectVariants ?

sorted, removed

↧

↧

VariantFiltration not filtering correctly

December 13, 2017, 3:45 am

≫ Next: Merge VCF with 1000 Genome Phase 3 VCF

≪ Previous: enquiry about HaplotypeCaller DuplicateReadFilter

Hi,
I'm running the following command to hard-filtering some variants:

gatk -R /Users/debortoli/Doutorado/hg19/hg19.fa \ -T VariantFiltration -V vcf_no_indels.recode.vcf \ --filterExpression "ReadPosRankSum < -5.0 || MQRankSum < -4.0 || MQ < 40.0 || QD < 2.0 || FS > 60.0 || SOR > 2.0" \ --filterName "FAIL" \ -o vcf_no_indels_hard_filtered_test.vcf

The output show some strange things like:

chr15 28228924 . G A 68.73 PASS AC=2;AF=3.236e-03;AN=618;DP=3360;ExcessHet=0.0106;FS=0.000;InbreedingCoeff=0.0278;MLEAC=1;MLEAF=1.618e-03;MQ=60.00;QD=22.91;SOR=2.833 GT:AD:DP:GQ:PL 0/0:22,0:22:66:0,66,768 0/0:35,0:35:99:0,105,1186 0/0:11,0:11:33:0,33,378
chr15 28419695 rs149592795 T C 339531 MQRankSum AC=133;AF=0.196;AN=680;BaseQRankSum=-8.510e-01;ClippingRankSum=0.036;DP=154411;ExcessHet=73.3363;FS=0.623;InbreedingCoeff=-0.2431;MLEAC=133;MLEAF=0.196;MQ=59.11;MQRankSum=-9.913e+00;QD=3.03;ReadPosRankSum=3.09;SOR=0.610 GT:AD:DP:GQ:PGT:PID:PL 0/0:89,0:89:99:.:.:0,120,1800 0/0:1055,0:1055:99:.:.:0,120,1800 0/0:338,0:338:99:.:.:0,120,1800

I'm wondering why they are passing the filter when they shouldn't....there are more examples along the vcf that also pass one of the filters when they shouldn't...

↧

Merge VCF with 1000 Genome Phase 3 VCF

December 13, 2017, 4:21 am

≫ Next: #### ERROR MESSAGE: Writing failed because there is no space left on the disk or hard drive. Please

≪ Previous: VariantFiltration not filtering correctly

Hi,

I would like to merge my vcf with 1000 Genome Phase3 vcf. The issue for me is difference in chromosome position for the same data

eg: RSID rs112185012 in my vcf is at "1:972645" and in 1000 Genome Phase3 its at "1: 908025"

Source for 1000 Genome VCF: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr1.phase3_shapeit2_mvncall_integrated_v
5a.20130502.genotypes.vcf.gz

↧

#### ERROR MESSAGE: Writing failed because there is no space left on the disk or hard drive. Please

December 13, 2017, 3:43 am

≫ Next: What is a GVCF and how is it different from a 'regular' VCF?

≪ Previous: Merge VCF with 1000 Genome Phase 3 VCF

Hello, I am getting this error when running my analysis with GenotypeGVCFs. I have enough size available so dont understand what the issue is. Script always stop at the same position. This is the script I am running:

module load java/1.8.0
srun java -Xmx"$MEM"g -jar "$GATK" \
-T GenotypeGVCFs \
-R "$REFERENCE" \
--variant "$GVCF" \
--dbsnp "$DBSNP" \
--disable_auto_index_creation_and_locking_when_reading_rods \
--max_alternate_alleles 7 \
-nt 32 \
-o "$OUTPUT"All_samples_raw.snps.indels.vcf

↧

What is a GVCF and how is it different from a 'regular' VCF?

April 3, 2014, 1:20 pm

≫ Next: #### ERROR MESSAGE: Writing failed because there is no space left on the disk or hard drive

≪ Previous: #### ERROR MESSAGE: Writing failed because there is no space left on the disk or hard drive. Please

Overview

GVCF stands for Genomic VCF. A GVCF is a kind of VCF, so the basic format specification is the same as for a regular VCF (see the spec documentation here), but a Genomic VCF contains extra information.

This document explains what that extra information is and how you can use it to empower your variants analyses.

Important caveat

What we're covering here is strictly limited to GVCFs produced by HaplotypeCaller in GATK versions 3.0 and above. The term GVCF is sometimes used simply to describe VCFs that contain a record for every position in the genome (or interval of interest) regardless of whether a variant was detected at that site or not (such as VCFs produced by UnifiedGenotyper with --output_mode EMIT_ALL_SITES). GVCFs produced by HaplotypeCaller 3.x contain additional information that is formatted in a very specific way. Read on to find out more.

General comparison of VCF vs. gVCF

The key difference between a regular VCF and a gVCF is that the gVCF has records for all sites, whether there is a variant call there or not. The goal is to have every site represented in the file in order to do joint analysis of a cohort in subsequent steps. The records in a gVCF include an accurate estimation of how confident we are in the determination that the sites are homozygous-reference or not. This estimation is generated by the HaplotypeCaller's built-in reference model.

Note that some other tools (including the GATK's own UnifiedGenotyper) may output an all-sites VCF that looks superficially like the BP_RESOLUTION gVCFs produced by HaplotypeCaller, but they do not provide an accurate estimate of reference confidence, and therefore cannot be used in joint genotyping analyses.

The two types of gVCFs

As you can see in the figure above, there are two options you can use with -ERC: GVCF and BP_RESOLUTION. With BP_RESOLUTION, you get a gVCF with an individual record at every site: either a variant record, or a non-variant record. With GVCF, you get a gVCF with individual variant records for variant sites, but the non-variant sites are grouped together into non-variant block records that represent intervals of sites for which the genotype quality (GQ) is within a certain range or band. The GQ ranges are defined in the ##GVCFBlock line of the gVCF header. The purpose of the blocks (also called banding) is to keep file size down, and there is no downside for the downstream analysis, so we do recommend using the -GVCF option.

Example gVCF file

This is a banded gVCF produced by HaplotypeCaller with the -GVCF option.

Header:

As you can see in the first line, the basic file format is a valid version 4.1 VCF:

##fileformat=VCFv4.1
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##GVCFBlock=minGQ=0(inclusive),maxGQ=5(exclusive)
##GVCFBlock=minGQ=20(inclusive),maxGQ=60(exclusive)
##GVCFBlock=minGQ=5(inclusive),maxGQ=20(exclusive)
##GVCFBlock=minGQ=60(inclusive),maxGQ=2147483647(exclusive)
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##contig=<ID=20,length=63025520,assembly=b37>
##reference=file:///humgen/1kg/reference/human_g1k_v37.fasta

Toward the middle you see the ##GVCFBlock lines (after the ##FORMAT lines) (repeated here for clarity):

##GVCFBlock=minGQ=0(inclusive),maxGQ=5(exclusive)
##GVCFBlock=minGQ=20(inclusive),maxGQ=60(exclusive)
##GVCFBlock=minGQ=5(inclusive),maxGQ=20(exclusive)

which indicate the GQ ranges used for banding (corresponding to the boundaries [5, 20, 60]).

You can also see the definition of the MIN_DP annotation in the ##FORMAT lines.

Records

The first thing you'll notice, hopefully, is the <NON_REF> symbolic allele listed in every record's ALT field. This provides us with a way to represent the possibility of having a non-reference allele at this site, and to indicate our confidence either way.

The second thing to look for is the END tag in the INFO field of non-variant block records. This tells you at what position the block ends. For example, the first line is a non-variant block that starts at position 20:10000000 and ends at 20:10000116.

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA12878
20  10000000    .   T   <NON_REF>   .   .   END=10000116    GT:DP:GQ:MIN_DP:PL  0/0:44:99:38:0,89,1385
20  10000117    .   C   T,<NON_REF> 612.77  .   BaseQRankSum=0.000;ClippingRankSum=-0.411;DP=38;MLEAC=1,0;MLEAF=0.500,0.00;MQ=221.39;MQ0=0;MQRankSum=-2.172;ReadPosRankSum=-0.235   GT:AD:DP:GQ:PL:SB   0/1:17,21,0:38:99:641,0,456,691,519,1210:6,11,11,10
20  10000118    .   T   <NON_REF>   .   .   END=10000210    GT:DP:GQ:MIN_DP:PL  0/0:42:99:38:0,80,1314
20  10000211    .   C   T,<NON_REF> 638.77  .   BaseQRankSum=0.894;ClippingRankSum=-1.927;DP=42;MLEAC=1,0;MLEAF=0.500,0.00;MQ=221.89;MQ0=0;MQRankSum=-1.750;ReadPosRankSum=1.549    GT:AD:DP:GQ:PL:SB   0/1:20,22,0:42:99:667,0,566,728,632,1360:9,11,12,10
20  10000212    .   A   <NON_REF>   .   .   END=10000438    GT:DP:GQ:MIN_DP:PL  0/0:52:99:42:0,99,1403
20  10000439    .   T   G,<NON_REF> 1737.77 .   DP=57;MLEAC=2,0;MLEAF=1.00,0.00;MQ=221.41;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,56,0:56:99:1771,168,0,1771,168,1771:0,0,0,0
20  10000440    .   T   <NON_REF>   .   .   END=10000597    GT:DP:GQ:MIN_DP:PL  0/0:56:99:49:0,120,1800
20  10000598    .   T   A,<NON_REF> 1754.77 .   DP=54;MLEAC=2,0;MLEAF=1.00,0.00;MQ=185.55;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,53,0:53:99:1788,158,0,1788,158,1788:0,0,0,0
20  10000599    .   T   <NON_REF>   .   .   END=10000693    GT:DP:GQ:MIN_DP:PL  0/0:51:99:47:0,120,1800
20  10000694    .   G   A,<NON_REF> 961.77  .   BaseQRankSum=0.736;ClippingRankSum=-0.009;DP=54;MLEAC=1,0;MLEAF=0.500,0.00;MQ=106.92;MQ0=0;MQRankSum=0.482;ReadPosRankSum=1.537 GT:AD:DP:GQ:PL:SB   0/1:21,32,0:53:99:990,0,579,1053,675,1728:9,12,10,22
20  10000695    .   G   <NON_REF>   .   .   END=10000757    GT:DP:GQ:MIN_DP:PL  0/0:48:99:45:0,120,1800
20  10000758    .   T   A,<NON_REF> 1663.77 .   DP=51;MLEAC=2,0;MLEAF=1.00,0.00;MQ=59.32;MQ0=0  GT:AD:DP:GQ:PL:SB   1/1:0,50,0:50:99:1697,149,0,1697,149,1697:0,0,0,0
20  10000759    .   A   <NON_REF>   .   .   END=10001018    GT:DP:GQ:MIN_DP:PL  0/0:40:99:28:0,65,1080
20  10001019    .   T   G,<NON_REF> 93.77   .   BaseQRankSum=0.058;ClippingRankSum=-0.347;DP=26;MLEAC=1,0;MLEAF=0.500,0.00;MQ=29.65;MQ0=0;MQRankSum=-0.925;ReadPosRankSum=0.000 GT:AD:DP:GQ:PL:SB   0/1:19,7,0:26:99:122,0,494,179,515,694:12,7,4,3
20  10001020    .   C   <NON_REF>   .   .   END=10001020    GT:DP:GQ:MIN_DP:PL  0/0:26:72:26:0,72,1080
20  10001021    .   T   <NON_REF>   .   .   END=10001021    GT:DP:GQ:MIN_DP:PL  0/0:25:37:25:0,37,909
20  10001022    .   C   <NON_REF>   .   .   END=10001297    GT:DP:GQ:MIN_DP:PL  0/0:30:87:25:0,72,831
20  10001298    .   T   A,<NON_REF> 1404.77 .   DP=41;MLEAC=2,0;MLEAF=1.00,0.00;MQ=171.56;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,41,0:41:99:1438,123,0,1438,123,1438:0,0,0,0
20  10001299    .   C   <NON_REF>   .   .   END=10001386    GT:DP:GQ:MIN_DP:PL  0/0:43:99:39:0,95,1226
20  10001387    .   C   <NON_REF>   .   .   END=10001418    GT:DP:GQ:MIN_DP:PL  0/0:41:42:39:0,21,315
20  10001419    .   T   <NON_REF>   .   .   END=10001425    GT:DP:GQ:MIN_DP:PL  0/0:45:12:42:0,9,135
20  10001426    .   A   <NON_REF>   .   .   END=10001427    GT:DP:GQ:MIN_DP:PL  0/0:49:0:48:0,0,1282
20  10001428    .   T   <NON_REF>   .   .   END=10001428    GT:DP:GQ:MIN_DP:PL  0/0:49:21:49:0,21,315
20  10001429    .   G   <NON_REF>   .   .   END=10001429    GT:DP:GQ:MIN_DP:PL  0/0:47:18:47:0,18,270
20  10001430    .   G   <NON_REF>   .   .   END=10001431    GT:DP:GQ:MIN_DP:PL  0/0:45:0:44:0,0,1121
20  10001432    .   A   <NON_REF>   .   .   END=10001432    GT:DP:GQ:MIN_DP:PL  0/0:43:18:43:0,18,270
20  10001433    .   T   <NON_REF>   .   .   END=10001433    GT:DP:GQ:MIN_DP:PL  0/0:44:0:44:0,0,1201
20  10001434    .   G   <NON_REF>   .   .   END=10001434    GT:DP:GQ:MIN_DP:PL  0/0:44:18:44:0,18,270
20  10001435    .   A   <NON_REF>   .   .   END=10001435    GT:DP:GQ:MIN_DP:PL  0/0:44:0:44:0,0,1130
20  10001436    .   A   AAGGCT,<NON_REF>    1845.73 .   DP=43;MLEAC=2,0;MLEAF=1.00,0.00;MQ=220.07;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,42,0:42:99:1886,125,0,1888,126,1890:0,0,0,0
20  10001437    .   A   <NON_REF>   .   .   END=10001437    GT:DP:GQ:MIN_DP:PL  0/0:44:0:44:0,0,0

Note that toward the end of this snippet, you see multiple consecutive non-variant block records. These were not merged into a single record because the sites they contain belong to different ranges of GQ (which are defined in the header).

↧

↧

#### ERROR MESSAGE: Writing failed because there is no space left on the disk or hard drive

December 13, 2017, 7:21 am

≫ Next: Error: Could not find or load main class org.broadinstitute.gatk.engine.CommandLineGATK

≪ Previous: What is a GVCF and how is it different from a 'regular' VCF?

Hello, I am getting this error when running my analysis with GenotypeGVCFs. I have enough size available so dont understand what the issue is. Script always stop at the same position. This is the script I am running:

module load java/1.8.0
srun java -Xmx"$MEM"g -jar "$GATK" \
-T GenotypeGVCFs \
-R "$REFERENCE" \
--variant "$GVCF" \
--dbsnp "$DBSNP" \
--disable_auto_index_creation_and_locking_when_reading_rods \
--max_alternate_alleles 7 \
-nt 32 \
-o "$OUTPUT"All_samples_raw.snps.indels.vcf

↧

Error: Could not find or load main class org.broadinstitute.gatk.engine.CommandLineGATK

December 13, 2017, 7:55 am

≫ Next: HaplotypeCaller does not filter duplicate reads, why?

≪ Previous: #### ERROR MESSAGE: Writing failed because there is no space left on the disk or hard drive

I installed GenomeAnalysisTK-3.8-0-ge9d806836 with java version "1.8.0_151" to call variant by using SRR database.
The command line i use that:
$ java -cp "sra_gatk_package/*" org.broadinstitute.gatk.engine.CommandLineGATK -T
UnifiedGenotyper -I SRR835775 -R GRCh37/GRCh37.fa -L NC_000020.10:61000001-6110000
0 -o chr20.SRR835775.vcf.
Unfortunately, the errors i received is:
Could not find or load main class org.broadinstitute.gatk.engine.CommandLineGATK.

↧

HaplotypeCaller does not filter duplicate reads, why?

November 9, 2017, 6:56 am

≫ Next: Spanning or overlapping deletions (* allele)

≪ Previous: Error: Could not find or load main class org.broadinstitute.gatk.engine.CommandLineGATK

Hi,
Im running HaplotypeCaller on a server this way:
java -XX:ParallelGCThreads=8 -Xmx80g -jar $GATK/GenomeAnalysisTK.jar -T HaplotypeCaller -I a2tl1_14_final.bam --min_base_quality_score 25 --min_mapping_quality_score 25 -rf DuplicateRead -rf BadMate -rf BadCigar -R JIC_reference/alygenomes.fasta -o a2tl1_14_HC1.g.vcf.gz -ploidy 2 -stand_call_conf 25 -ERC GVCF --pcr_indel_model NONE -nct 8 --max_num_PL_values 350

And I can not figure out why no duplicate reads are being filtered out although they were marked by Picard (with option TAGGING_POLICY=All ) and I also see around 20% of duplicates in corresponding samtools flagstat.

The beginning of the stdout looks like this:

INFO 04:37:54,634 HelpFormatter - -------------------------------------------------------------------------------- INFO 04:37:54,986 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.7-0-gcfedb67, Compiled 2016/12/12 11:21:18 INFO 04:37:54,986 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute INFO 04:37:54,986 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk INFO 04:37:54,986 HelpFormatter - [Tue Nov 07 04:37:54 CET 2017] Executing on Linux 3.16.0-4-amd64 amd64 INFO 04:37:54,986 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 INFO 04:37:54,990 HelpFormatter - Program Args: -T HaplotypeCaller -I h2al1_21_final.bam --min_base_quality_score 25 --min_mapping_quality_score 25 -rf DuplicateRead -rf BadMate -rf BadCigar -R JIC_reference/alygenomes.fasta -o h2al1_21_HC1.g.vcf.gz -ploidy 2 -stand_call_conf 25 -ERC GVCF --pcr_indel_model NONE -nct 8 --max_num_PL_values 350 INFO 04:37:55,002 HelpFormatter - Executing as vlkofly@zigur17.cerit-sc.cz on Linux 3.16.0-4-amd64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27. INFO 04:37:55,003 HelpFormatter - Date/Time: 2017/11/07 04:37:54 INFO 04:37:55,003 HelpFormatter - -------------------------------------------------------------------------------- INFO 04:37:55,003 HelpFormatter - -------------------------------------------------------------------------------- WARN 04:37:55,009 GATKVCFUtils - Creating Tabix index for h2al1_21_HC1.g.vcf.gz, ignoring user-specified index type and parameter INFO 04:37:55,237 GenomeAnalysisEngine - Strictness is SILENT INFO 04:37:56,044 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 500 INFO 04:37:56,051 SAMDataSource$SAMReaders - Initializing SAMRecords in serial WARNING: BAM index file /scratch/vlkofly/job_386162.wagap-pro.cerit-sc.cz/h2al1_21_final.bai is older than BAM /scratch/vlkofly/job_386162.wagap-pro.cerit-sc.cz/h2al1_21_final.bam INFO 04:37:56,221 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.17 INFO 04:37:56,244 HCMappingQualityFilter - Filtering out reads with MAPQ < 25 INFO 04:37:56,289 MicroScheduler - Running the GATK in parallel mode with 8 total threads, 8 CPU thread(s) for each of 1 data thread(s), of 8 processors available on this machine INFO 04:37:57,903 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 04:37:58,090 GenomeAnalysisEngine - Done preparing for traversal INFO 04:37:58,090 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 04:37:58,091 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 04:37:58,091 ProgressMeter - Location | active regions | elapsed | active regions | completed | runtime | runtime INFO 04:37:58,091 HaplotypeCaller - Standard Emitting and Calling confidence set to 0.0 for reference-model confidence output INFO 04:37:58,092 HaplotypeCaller - All sites annotated with PLs forced to true for reference-model confidence output WARN 04:37:58,411 InbreedingCoeff - Annotation will not be calculated. InbreedingCoeff requires at least 10 unrelated samples. INFO 04:37:58,510 HaplotypeCaller - Using global mismapping rate of 45 => -4.5 in log10 likelihood units INFO 04:37:58,511 PairHMM - Performance profiling for PairHMM is disabled because the program is being run with multiple threads (-nct>1) option

And the info lines showing no duplicates removed:

INFO 16:53:05,155 ProgressMeter - Total runtime 130507.06 secs, 2175.12 min, 36.25 hours INFO 16:53:05,155 MicroScheduler - 46705813 reads were filtered out during the traversal out of approximately 149962396 total reads (31.15%) INFO 16:53:05,155 MicroScheduler - -> 0 reads (0.00% of total) failing BadCigarFilter INFO 16:53:05,156 MicroScheduler - -> 13334530 reads (8.89% of total) failing BadMateFilter INFO 16:53:05,156 MicroScheduler - -> 0 reads (0.00% of total) failing DuplicateReadFilter INFO 16:53:05,156 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter INFO 16:53:05,156 MicroScheduler - -> 31823278 reads (21.22% of total) failing HCMappingQualityFilter INFO 16:53:05,156 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter INFO 16:53:05,157 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter INFO 16:53:05,157 MicroScheduler - -> 1548005 reads (1.03% of total) failing NotPrimaryAlignmentFilter INFO 16:53:05,157 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter

↧

Spanning or overlapping deletions (* allele)

February 3, 2016, 9:28 am

≫ Next: ReorderSam: Error "Invalid reference index -1"

≪ Previous: HaplotypeCaller does not filter duplicate reads, why?

We use the term spanning deletion or overlapping deletion to refer to a deletion that spans a position of interest.

The presence of a spanning deletion affects how we can represent genotypes at any site(s) that it spans for those samples that carry the deletion, whether in heterozygous or homozygous variant form. Page 8, item 5 of the VCF v4.3 specification reserves the * allele to reference overlapping deletions. This is not to be confused with the bracketed asterisk <*> used to denote symbolic alternate alleles.

Here we illustrate with four human samples. Bob and Lian each have a heterozygous A to T single polymorphism at position 20, our position of interest. Kyra has a 9 bp deletion from position 15 to 23 on both homologous chromosomes that extends across position 20. Lian and Omar each are heterozygous for the same 9 bp deletion. Omar and Bob's other allele is the reference A.

What are the genotypes for each individual at position 20? For Bob, the reference A and variant T alleles are clearly present for a genotype of A/T.

What about Lian? Lian has a variant T allele plus a 9 bp deletion overlapping position 20. To notate the deletion as we do single nucleotide deletions is technically inaccurate. We need a placeholder notation to signify absent sequence that extends beyond the position of interest and that is listed for an earlier position, in our case position 14. The solution is to use a star or asterisk * at position 20 to refer to the spanning deletion. Using this convention, Lian's genotype is T/*.

At the sample-level, Kyra and Omar would not have records for position 20. However, we are comparing multiple samples and so we indicate the spanning deletion at position 20 with *. Omar's genotype is A/* and Kyra's is */*.

In the VCF, depending on the format used by tools, positions equivalent to our example position 20 may or may not be listed. If listed, such as in the first example VCF shown, the spanning deletion is noted with the asterisk * under the ALT column. The spanning deletion is then referred to in the genotype GT for Kyra, Lian and Omar. Alternatively, a VCF may altogether avoid referencing the spanning deletion by listing the variant with the spanning deletion together with the deletion. This is shown in the second example VCF at position 14.

↧

↧

ReorderSam: Error "Invalid reference index -1"

December 13, 2017, 11:45 am

≫ Next: Allelic CNV (common snps and getbayesianhetcoverage

≪ Previous: Spanning or overlapping deletions (* allele)

I'm running java 1.8.0_25 and the latest version of picard tools to perform RNA-seq variant calling analysis.

When I try to run ReorderSam, it says that the process is completed, but throws the following error:

Command used:
java -jar picard.jar ReorderSam I=input_dedupped.bam O=output_reordered.bam R=Homo_sapiens_assembly19.fasta CREATE_INDEX=TRUE

[Wed Dec 13 14:35:46 EST 2017] picard.sam.ReorderSam done. Elapsed time: 2.38 minutes.
Runtime.totalMemory()=130547712
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" java.lang.IllegalArgumentException: Invalid reference index -1
at htsjdk.samtools.QueryInterval.(QueryInterval.java:24)
at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.query(SamReader.java:533)
at picard.sam.ReorderSam.doWork(ReorderSam.java:141)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:268)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:98)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:108)

I think that there is something wrong with the output file since the next step (which is SplitNCigarReads) throws an error about a malformed BAM file.

Any help would be greatly appreciated!

Thank you in advance!

↧

Allelic CNV (common snps and getbayesianhetcoverage

November 28, 2017, 12:55 pm

≫ Next: Problems in getting correct number of SNPs and INDELs in SelectVariant tool

≪ Previous: ReorderSam: Error "Invalid reference index -1"

I created a list of common SNPs for GetBayesianHetCoverage following https://gatkforums.broadinstitute.org/gatk/discussion/7812/creating-a-list-of-common-snps-for-use-with-getbayesianhetcoverage.
I used GATK 3.7 (because no "CatVariants" in GATK 4?) with
Hg19 fasta (from ftp://ftp.ncbi.nlm.nih.gov/sra/reports/Assembly/GRCh37-HG19_Broad_variant/).
and got allchr.1kg.phase3.v5a.snp.maf10.biallelic.hg19.interval_list.

When I ran "GetBayesianHetCoverage (gatk-4.beta.6)", I got the following error:

04:38:01.403 INFO GetBayesianHetCoverage - Shutting down engine
[November 29, 2017 4:38:01 AM CST] org.broadinstitute.hellbender.tools.exome.GetBayesianHetCoverage done. Elapsed time: 0.21 minutes.
Runtime.totalMemory()=5701107712
htsjdk.samtools.SAMException: Intervals not in order: 9:141146683-141146683 + rs374057746; 10:60969-60969 + rs61838556
at htsjdk.samtools.util.IntervalUtil.assertOrderedNonOverlapping(IntervalUtil.java:70)
at htsjdk.samtools.filter.IntervalFilter.(IntervalFilter.java:57)
at htsjdk.samtools.util.SamRecordIntervalIteratorFactory.makeSamRecordIntervalIterator(SamRecordIntervalIteratorFactory.java:67)
at htsjdk.samtools.util.AbstractLocusIterator.iterator(AbstractLocusIterator.java:192)
at org.broadinstitute.hellbender.tools.exome.pulldown.BayesianHetPulldownCalculator.getHetPulldown(BayesianHetPulldownCalculator.java:339)
at org.broadinstitute.hellbender.tools.exome.GetBayesianHetCoverage.runMatchedNormalTumor(GetBayesianHetCoverage.java:344)
at org.broadinstitute.hellbender.tools.exome.GetBayesianHetCoverage.doWork(GetBayesianHetCoverage.java:385)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:119)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:176)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:195)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:131)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:152)
at org.broadinstitute.hellbender.Main.main(Main.java:233)

(Command line:
java -jar gatk-package-4.beta.5-local.jar GetBayesianHetCoverage \
--reference genome.fa \
--snpIntervals allchr.1kg.phase3.v5a.snp.maf10.biallelic.hg19.interval_list \
--tumor tumor.bam \
--tumorHets tumor_het_pulldown.tsv \
--normal normal.bam \
--normalHets normal_het_pulldown.tsv \
--hetCallingStringency 30)

I also tried picard "GatherVcfs" (replace CatVariants) and "SortVcf"(for interval order), but it was no use.
Please teach me how to solve it .
Thank you.

↧

Problems in getting correct number of SNPs and INDELs in SelectVariant tool

June 20, 2015, 11:46 pm

≫ Next: Mutect2 suggestion - option to output normal sites too

≪ Previous: Allelic CNV (common snps and getbayesianhetcoverage

Hi Everybody!

I actually try to utilize a SelectVariant tool to separate my variants according to their types such as SNPs and INDELs in separate files respectively.

After that i utilize the bcftools to count the number of variants but i am afraid after separation my many SNPs and INDELs were lost. and they were not considered. I did not apply any filters to exclude any type of variants and program run smoothly as well. can anyone please tell me what type of SNPs and INDELs 'SelectVariants' tool considers by default. or what i am missing.

MY GATK command was:
Step2 Select SNPs
java -Xmx10g -jar /GTK/GenomeAnalysisTK.jar -T SelectVariants \
-nt 5 \
-R /Ref/human_g1k_v37.fasta \
-V /GATK/VQSR/VQSR_snp_indel_CHG000691_2_3.vcf \
-o /GATK/SLECVAR/HC/HC_SNP_only_CHG000691_2_3.vcf \
-selectType SNP

Step2 Select INDELs
java -Xmx10g -jar /GTK/GenomeAnalysisTK.jar -T SelectVariants \
-nt 5 \
-R /leofs/zengchq_group/sohail/Ref/human_g1k_v37.fasta \
-V /GATK/VQSR/VQSR_snp_indel_CHG000691_2_3.vcf \
-o /GATK/SLECVAR/HC/HC_INDELs_only_CHG000691_2_3.vcf \
-selectType INDEL

My bcftools commands were:
bcftools stats /GATK/VQSR/VQSR_snp_indel_CHG000691_2_3.vcf > /stats/vcf-combined.stats
bcftools stats /GATK/SLECVAR/HC/HC_SNP_only_CHG000691_2_3.vcf > /stats/vcf-snp.stats
bcftools stats /GATK/SLECVAR/HC/HC_INDELs_only_CHG000691_2_3.vcf > /stats/vcf-indel.stats

Resultant statistics are as follows:
Combined-VCF: (mixed file before SelectVariant command)

SN, Summary numbers:
SN [2]id [3]key [4]value
SN 0 number of samples: 3
SN 0 number of records: 5816944
SN 0 number of SNPs: 4887422
SN 0 number of MNPs: 0
SN 0 number of indels: 937230
SN 0 number of others: 0
SN 0 number of multiallelic sites: 81719
SN 0 number of multiallelic SNP sites: 3629

SNP-VCF (SNP file after SelectVariants)
SN 0 number of samples: 3
SN 0 number of records: 4879714
SN 0 number of SNPs: 4879714
SN 0 number of MNPs: 0
SN 0 number of indels: 0
SN 0 number of others: 0
SN 0 number of multiallelic sites: 3629
SN 0 number of multiallelic SNP sites: 3629

SNP-INDELs (INDEL file after Selectvariants)
SN 0 number of samples: 3
SN 0 number of records: 929522
SN 0 number of SNPs: 0
SN 0 number of MNPs: 0
SN 0 number of indels: 929522
SN 0 number of others: 0
SN 0 number of multiallelic sites: 70382
SN 0 number of multiallelic SNP sites: 0

You can see the number of SNPs in separate file is not corresponding with the combined statistics file. Any idea why this happens
I am sorry if this comment is so large. but I think i need to explain my question.. I am sorry for any inconvenience in advance.

Thanks!

↧

Mutect2 suggestion - option to output normal sites too

September 29, 2016, 1:27 pm

≫ Next: JAVA comsol shell problem

≪ Previous: Problems in getting correct number of SNPs and INDELs in SelectVariant tool

I'd like to see an option for Mutect2 that would allow input of a germline .vcf call file, and output to include all sites present in that file.

The --output_mode option could be expanded to include an option EMIT_VARIANTS_AND_NORMAL, which would cause the output VCF to include not only the discovered somatic variants, but also include all variants that were discovered in the normal data but may not be significantly different in the tumor data.

The reasons for wanting this option are:

to have all mutations in the vcf file whether they are germline or somatic
to have an indication (in the FILTER field) from Mutect2 of whether it found the same germline mutation and no change in somatic, or same germline mutation but somatic has different allele state, or it did not find the same germline mutations that the vcf file indicates.
to have the alternate allele read counts and frequencies in the normal data together in the same vcf line as the somatic variant data, for easy comparison of the two.
to permit LOH or het to hom changes to be spotted in the vcf data, since Mutect2 does not call LOH

↧

↧

JAVA comsol shell problem

December 13, 2017, 9:49 pm

≫ Next: Problems in getting correct number of SNPs and INDELs in SelectVariant tool

≪ Previous: Mutect2 suggestion - option to output normal sites too

I have a problem launching comsol program via xshell because of this.

A fatal error has been detected by the Java Runtime Environment:

#

SIGSEGV (0xb) at pc=0x00007f759ac002b7, pid=11548, tid=140144906778432

#

JRE version: Java(TM) SE Runtime Environment (7.0_75-b13) (build 1.7.0_75-b13)

Java VM: Java HotSpot(TM) 64-Bit Server VM (24.75-b04 mixed mode linux-amd64 compressed oops)

Problematic frame:

C [libgdk-x11-2.0.so.0+0x7b2b7] gdk_window_enable_synchronized_configure+0x7

#

Core dump written. Default location: /home/user/core or core.11548

#

An error report file with more information is saved as:

/home/user/hs_err_pid11548.log

#

If you would like to submit a bug report, please visit:

http://bugreport.sun.com/bugreport/crash.jsp

The crash happened outside the Java Virtual Machine in native code.

See problematic frame for where to report the bug.

#
how can I get through?
please help me.

↧

Problems in getting correct number of SNPs and INDELs in SelectVariant tool

June 20, 2015, 11:46 pm

≫ Next: PhaseByTransmission does not phase any genotype

≪ Previous: JAVA comsol shell problem

Hi Everybody!

I actually try to utilize a SelectVariant tool to separate my variants according to their types such as SNPs and INDELs in separate files respectively.

After that i utilize the bcftools to count the number of variants but i am afraid after separation my many SNPs and INDELs were lost. and they were not considered. I did not apply any filters to exclude any type of variants and program run smoothly as well. can anyone please tell me what type of SNPs and INDELs 'SelectVariants' tool considers by default. or what i am missing.

MY GATK command was:
Step2 Select SNPs
java -Xmx10g -jar /GTK/GenomeAnalysisTK.jar -T SelectVariants \
-nt 5 \
-R /Ref/human_g1k_v37.fasta \
-V /GATK/VQSR/VQSR_snp_indel_CHG000691_2_3.vcf \
-o /GATK/SLECVAR/HC/HC_SNP_only_CHG000691_2_3.vcf \
-selectType SNP

Step2 Select INDELs
java -Xmx10g -jar /GTK/GenomeAnalysisTK.jar -T SelectVariants \
-nt 5 \
-R /leofs/zengchq_group/sohail/Ref/human_g1k_v37.fasta \
-V /GATK/VQSR/VQSR_snp_indel_CHG000691_2_3.vcf \
-o /GATK/SLECVAR/HC/HC_INDELs_only_CHG000691_2_3.vcf \
-selectType INDEL

My bcftools commands were:
bcftools stats /GATK/VQSR/VQSR_snp_indel_CHG000691_2_3.vcf > /stats/vcf-combined.stats
bcftools stats /GATK/SLECVAR/HC/HC_SNP_only_CHG000691_2_3.vcf > /stats/vcf-snp.stats
bcftools stats /GATK/SLECVAR/HC/HC_INDELs_only_CHG000691_2_3.vcf > /stats/vcf-indel.stats

Resultant statistics are as follows:
Combined-VCF: (mixed file before SelectVariant command)

SN, Summary numbers:
SN [2]id [3]key [4]value
SN 0 number of samples: 3
SN 0 number of records: 5816944
SN 0 number of SNPs: 4887422
SN 0 number of MNPs: 0
SN 0 number of indels: 937230
SN 0 number of others: 0
SN 0 number of multiallelic sites: 81719
SN 0 number of multiallelic SNP sites: 3629

SNP-VCF (SNP file after SelectVariants)
SN 0 number of samples: 3
SN 0 number of records: 4879714
SN 0 number of SNPs: 4879714
SN 0 number of MNPs: 0
SN 0 number of indels: 0
SN 0 number of others: 0
SN 0 number of multiallelic sites: 3629
SN 0 number of multiallelic SNP sites: 3629

SNP-INDELs (INDEL file after Selectvariants)
SN 0 number of samples: 3
SN 0 number of records: 929522
SN 0 number of SNPs: 0
SN 0 number of MNPs: 0
SN 0 number of indels: 929522
SN 0 number of others: 0
SN 0 number of multiallelic sites: 70382
SN 0 number of multiallelic SNP sites: 0

You can see the number of SNPs in separate file is not corresponding with the combined statistics file. Any idea why this happens
I am sorry if this comment is so large. but I think i need to explain my question.. I am sorry for any inconvenience in advance.

Thanks!

↧

PhaseByTransmission does not phase any genotype

December 14, 2017, 4:49 pm

≫ Next: ERROR StatusLogger Log4j2 could not find a logging implementation

≪ Previous: Problems in getting correct number of SNPs and INDELs in SelectVariant tool

Dear GATK development team:
Hi, I'm trying to use GATK to phase trio data. Below is the log that I copied, I filtered my PED file and VCF file so that both files contains the exactly same set of samples as well as strictly matched trio relationship.

However, the result I got is always the same as the input file, none of the genotype is phased. Could anyone help me out here?

Thanks a million!

INFO  19:39:01,840 HelpFormatter - ----------------------------------------------------------------------------------
INFO  19:39:01,842 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.8-0-ge9d806836, Compiled 2017/07/28 21:26:50
INFO  19:39:01,843 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute
INFO  19:39:01,843 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk
INFO  19:39:01,843 HelpFormatter - [Thu Dec 14 19:39:01 EST 2017] Executing on Linux 3.13.0-135-generic amd64
INFO  19:39:01,843 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12
INFO  19:39:01,848 HelpFormatter - Program Args: -T PhaseByTransmission -R resource/hs37d5.fa -V ALL.chr20.pass.minDP0.gtonly.genotypes.bcf.HRC.10k.site.sorted.vcf.gz.has.ped.recode.vcf.gz -ped ./no.problem.trio.family.list.ped -o output.vcf
INFO  19:39:01,854 HelpFormatter - Executing as fanzhang@1000g on Linux 3.13.0-135-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12.
INFO  19:39:01,855 HelpFormatter - Date/Time: 2017/12/14 19:39:01
INFO  19:39:01,855 HelpFormatter - ----------------------------------------------------------------------------------
INFO  19:39:01,855 HelpFormatter - ----------------------------------------------------------------------------------
ERROR StatusLogger Unable to create class org.apache.logging.log4j.core.impl.Log4jContextFactory specified in jar:file:/net/wonderland/home/fanzhang/WorkingSpace/install-packages/gatk/GenomeAnalysisTK-3.8-0-ge9d806836/GenomeAnalysisTK.jar!/META-INF/log4j-provider.prope
ERROR StatusLogger Log4j2 could not find a logging implementation. Please add log4j-core to the classpath. Using SimpleLogger to log to the console...
INFO  19:39:02,010 GenomeAnalysisEngine - Deflater: JdkDeflater
INFO  19:39:02,010 GenomeAnalysisEngine - Inflater: JdkInflater
INFO  19:39:02,011 GenomeAnalysisEngine - Strictness is SILENT
INFO  19:39:02,139 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
WARN  19:39:02,210 IndexDictionaryUtils - Track variant doesn't have a sequence dictionary built in, skipping dictionary validation
INFO  19:39:02,219 PedReader - Reading PED file ./no.problem.trio.family.list.ped with missing fields: []
INFO  19:39:02,273 PedReader - Phenotype is other? false
INFO  19:39:02,375 GenomeAnalysisEngine - Preparing for traversal
INFO  19:39:02,381 GenomeAnalysisEngine - Done preparing for traversal
INFO  19:39:02,381 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO  19:39:02,382 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining
INFO  19:39:02,382 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime
INFO  19:39:32,388 ProgressMeter -       20:158114         0.0    30.0 s      49.6 w       86.7%    34.0 s       4.0 s
INFO  19:40:02,392 ProgressMeter -       20:228889         0.0    60.0 s      99.2 w       86.7%    69.0 s       9.0 s
INFO  19:40:32,397 ProgressMeter -       20:303259         0.0    90.0 s     148.8 w       86.7%   103.0 s      13.0 s
INFO  19:41:02,399 ProgressMeter -       20:381399         0.0   120.0 s     198.4 w       86.7%     2.3 m      18.0 s
INFO  19:41:32,401 ProgressMeter -       20:484650         0.0     2.5 m     248.0 w       86.7%     2.9 m      23.0 s
INFO  19:42:02,402 ProgressMeter -       20:578423         0.0     3.0 m     297.7 w       86.7%     3.5 m      27.0 s
INFO  19:42:32,404 ProgressMeter -       20:678727         0.0     3.5 m     347.3 w       86.7%     4.0 m      32.0 s
INFO  19:42:34,827 PhaseByTransmission - Number of complete trio-genotypes: 9920280
INFO  19:42:34,827 PhaseByTransmission - Number of trio-genotypes containing no call(s): 0
INFO  19:42:34,828 PhaseByTransmission - Number of trio-genotypes phased: 0
INFO  19:42:34,828 PhaseByTransmission - Number of resulting Het/Het/Het trios: 108196
INFO  19:42:34,828 PhaseByTransmission - Number of remaining single mendelian violations in trios: 0
INFO  19:42:34,828 PhaseByTransmission - Number of remaining double mendelian violations in trios: 0
INFO  19:42:34,828 PhaseByTransmission - Number of complete pair-genotypes: 0
INFO  19:42:34,828 PhaseByTransmission - Number of pair-genotypes containing no call(s): 0
INFO  19:42:34,829 PhaseByTransmission - Number of pair-genotypes phased: 0
INFO  19:42:34,829 PhaseByTransmission - Number of resulting Het/Het pairs: 0
INFO  19:42:34,829 PhaseByTransmission - Number of remaining mendelian violations in pairs: 0
INFO  19:42:34,829 PhaseByTransmission - Number of genotypes updated: 0
INFO  19:42:34,847 ProgressMeter -            done     10427.0     3.5 m       5.7 h       86.7%     4.1 m      32.0 s
INFO  19:42:34,848 ProgressMeter - Total runtime 212.47 secs, 3.54 min, 0.06 hours
------------------------------------------------------------------------------------------
Done. —————————————————————————————————————————————

↧

© 2025 //www.rssing.com