GenotypeGVCFs: WARNING: of INFO fields not parsing

September 16, 2016, 10:23 am

The HC calls in issue were called in a complete GATK 3.6-0/ JDK 1.8 workflow as follows:
java -Xmx64G -jar $GATK_JAR -T HaplotypeCaller -ERC GVCF -R $REFGENOME -I $INPUT_FILE -o $HAPLOTYPECALLER_OUTPUT_FILE -G Standard -G AS_Standard -A HomopolymerRun

The output is large unremarkable, with the exception of occasional alt-allele count warnings:

INFO  11:24:33,739 HelpFormatter - ---------------------------------------------------------------------------------- 
INFO  11:24:33,742 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.6-0-g89b7209, Compiled 2016/06/01 22:27:29 
INFO  11:24:33,743 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute 
INFO  11:24:33,743 HelpFormatter - For support and documentation go to https://www.broadinstitute.org/gatk 
INFO  11:24:33,743 HelpFormatter - [Thu Sep 15 11:24:33 CDT 2016] Executing on Linux 2.6.32-431.23.3.el6.x86_64 amd64 
INFO  11:24:33,743 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0_45-b14 JdkDeflater 
INFO  11:24:33,749 HelpFormatter - Program Args: [skipped] 
INFO  11:24:33,762 HelpFormatter - Executing as [redacted] on Linux 2.6.32-431.23.3.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_45-b14. 
INFO  11:24:33,763 HelpFormatter - Date/Time: 2016/09/15 11:24:33 
INFO  11:24:33,763 HelpFormatter - ---------------------------------------------------------------------------------- 
INFO  11:24:33,763 HelpFormatter - ---------------------------------------------------------------------------------- 
INFO  11:24:33,784 GenomeAnalysisEngine - Strictness is SILENT 
INFO  11:24:33,984 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 500 
INFO  11:24:33,994 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
INFO  11:24:34,097 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.10 
INFO  11:24:34,169 HCMappingQualityFilter - Filtering out reads with MAPQ < 20 
INFO  11:24:34,327 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files 
INFO  11:24:34,690 GenomeAnalysisEngine - Done preparing for traversal 
INFO  11:24:34,691 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
INFO  11:24:34,691 ProgressMeter -                 |      processed |    time |         per 1M |           |   total | remaining 
INFO  11:24:34,692 ProgressMeter -        Location | active regions | elapsed | active regions | completed | runtime |   runtime 
INFO  11:24:34,693 HaplotypeCaller - Standard Emitting and Calling confidence set to 0.0 for reference-model confidence output 
INFO  11:24:34,693 HaplotypeCaller - All sites annotated with PLs forced to true for reference-model confidence output 
WARN  11:24:34,754 AS_InbreedingCoeff - Annotation will not be calculated. InbreedingCoeff requires at least 10 unrelated samples. 
WARN  11:24:34,755 InbreedingCoeff - Annotation will not be calculated. InbreedingCoeff requires at least 10 unrelated samples. 
INFO  11:24:34,960 HaplotypeCaller - Using global mismapping rate of 45 => -4.5 in log10 likelihood units 
Using un-vectorized C++ implementation of PairHMM
INFO  11:24:38,259 VectorLoglessPairHMM - libVectorLoglessPairHMM unpacked successfully from GATK jar file 
INFO  11:24:38,260 VectorLoglessPairHMM - Using vectorized implementation of PairHMM 
WARN  11:24:38,361 HaplotypeScore - Annotation will not be calculated, must be called from UnifiedGenotyper
[...]
INFO  18:09:21,878 VectorLoglessPairHMM - Time spent in setup for JNI call : 5.591074305 
INFO  18:09:21,878 PairHMM - Total compute time in PairHMM computeLikelihoods() : 4182.0076576070005 
INFO  18:09:21,879 HaplotypeCaller - Ran local assembly on 10816364 active regions 
INFO  18:09:21,989 ProgressMeter -            done    3.099750718E9     6.7 h            7.0 s      100.0%     6.7 h       0.0 s 
INFO  18:09:21,990 ProgressMeter - Total runtime 24287.30 secs, 404.79 min, 6.75 hours 
INFO  18:09:21,992 MicroScheduler - 15060649 reads were filtered out during the traversal out of approximately 71598790 total reads (21.03%) 
INFO  18:09:21,993 MicroScheduler -   -> 0 reads (0.00% of total) failing BadCigarFilter 
INFO  18:09:21,994 MicroScheduler -   -> 10614114 reads (14.82% of total) failing DuplicateReadFilter 
INFO  18:09:21,995 MicroScheduler -   -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter 
INFO  18:09:21,996 MicroScheduler -   -> 4446535 reads (6.21% of total) failing HCMappingQualityFilter 
INFO  18:09:21,997 MicroScheduler -   -> 0 reads (0.00% of total) failing MalformedReadFilter 
INFO  18:09:21,998 MicroScheduler -   -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter 
INFO  18:09:21,999 MicroScheduler -   -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter 
INFO  18:09:22,000 MicroScheduler -   -> 0 reads (0.00% of total) failing UnmappedReadFilter

I attempted to use GenotypeGVCFs to make call from these GVCFs:
java -Xmx64G -jar $GATK_JAR -T GenotypeGVCFs -A HomopolymerRun -R $REFGENOME -stand_call_conf 30 -stand_emit_conf 10 -V [skipped] -o [skipped]

while GenotypeGVCFs does complete, there were a large number of warnings (the stderr log is larger than 1gb) of the type:

WARN  11:19:19,340 ReferenceConfidenceVariantContextMerger - WARNING: remaining (non-reducible) annotations are assumed to be ints or doubles or booleans, but 1058.00|1542.00|0.00 doesn't parse and will not be annotated in the final VC. 
WARN  11:19:19,341 ReferenceConfidenceVariantContextMerger - WARNING: remaining (non-reducible) annotations are assumed to be ints or doubles or booleans, but 8,1,20,1|4,1,9,1,20,1| doesn't parse and will not be annotated in the final VC. 
WARN  11:19:19,341 ReferenceConfidenceVariantContextMerger - WARNING: remaining (non-reducible) annotations are assumed to be ints or doubles or booleans, but 23,2|22,1,23,2| doesn't parse and will not be annotated in the final VC. 
WARN  11:19:19,341 ReferenceConfidenceVariantContextMerger - WARNING: remaining (non-reducible) annotations are assumed to be ints or doubles or booleans, but 1,1|1,2|0,0 doesn't parse and will not be annotated in the final VC. 
WARN  11:19:19,342 ReferenceConfidenceVariantContextMerger - WARNING: remaining (non-reducible) annotations are assumed to be ints or doubles or booleans, but 20,2|33,1,36,2| doesn't parse and will not be annotated in the final VC.

I have traced these four lines to a variant called by a single sample:

chr1    13273   .       G       C,<NON_REF>     38.77   .       AS_RAW_BaseQRankSum=20,2|33,1,36,2|;AS_RAW_MQ=1058.00|1542.00|0.00;AS_RAW_MQRankSum=23,2|22,1,23,2|;AS_RAW_ReadPosRankSum=8,1,20,1|4,1,9,1,20,1|;AS_SB_TABLE=1,1|1,2|0,0;BaseQRankSum=1.645;DP=5;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=-0.524;RAW_MQ=2600.00;ReadPosRankSum=-0.253  GT:AD:GQ:PL:SB  0/1:2,3,0:34:67,0,34,73,43,117:1,1,1,2

But then there are so many WARN's emitted that I have been able to identify calls from every sample, and every possible INFO fields where there is a pipe separator.

I noticed a previous thread described a similar warning message, but it doesn't seem to fit in my current issue.

ValidateVariants turns out to be more a pain to run than I thought; java consumes so many I/O cores that even designating for 8 cores on my cluster still breaks the PROC hard limit... I'll try to generate some results, but I kind of doubt that's the issue here.

↧

piping GATK output to stdout

February 25, 2014, 8:09 am

≫ Next: GATK 4 CNV Proportional Coverage for WGS : Firehose task "ERROR SparkUI: Failed to bind SparkUI"

≪ Previous: GenotypeGVCFs: WARNING: of INFO fields not parsing

I want to pipe GATK output to standard output.

I am using a command like this (GATK v2.8-1-g932cd3a):
java -Xmx4g -jar GenomeAnalysisTK.jar -R human_g1k_v37.fasta -T CombineVariants -V in1.vcf.gz -V in2.vcf.gz -o /dev/stdout

However, GATK echos the INFO information in the standard output, mixing information that is not meant to end up in a VCF file.

I have also tried the following command line:
java -Xmx4g -jar GenomeAnalysisTK.jar -R human_g1k_v37.fasta -T CombineVariants -V in1.vcf.gz -V in2.vcf.gz -log /dev/stderr -o /dev/stdout

But this only achieves the result to send the INFO information both to standard output and to standard error.

Is there a way to have GATK not use the standard output to communicate information to the user?

I have checked the documentation at http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_CommandLineGATK.html#--log_to_file but I don't understand how I could do this.

↧

GATK 4 CNV Proportional Coverage for WGS : Firehose task "ERROR SparkUI: Failed to bind SparkUI"

September 16, 2016, 11:36 am

≫ Next: What input files can I annotate with Oncotator?

≪ Previous: piping GATK output to stdout

Hi -

Today I've been trying to use the "GATK 4 CNV Proportional Coverage for WGS " (version 4) task in Firehose copied from Algorithms Commons. After 16 warnings:

WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.

Warnings there is an

16/09/16 14:14:39 ERROR SparkUI: Failed to bind SparkUI
java.net.BindException: Address already in use: Service 'SparkUI' failed after 16 retries!
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:433)
at sun.nio.ch.Net.bind(Net.java:425)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
at org.spark-project.jetty.server.nio.SelectChannelConnector.open(SelectChannelConnector.java:187)
at org.spark-project.jetty.server.AbstractConnector.doStart(AbstractConnector.java:316)
at org.spark-project.jetty.server.nio.SelectChannelConnector.doStart(SelectChannelConnector.java:265)
at org.spark-project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
at org.spark-project.jetty.server.Server.doStart(Server.java:293)
at org.spark-project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
at org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$connect$1(JettyUtils.scala:252)
at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:262)
at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:262)
at org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1988)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1979)
at org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:262)
at org.apache.spark.ui.WebUI.bind(WebUI.scala:136)
at org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481)
at org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.SparkContext.(SparkContext.scala:481)
at org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:59)
at org.broadinstitute.hellbender.engine.spark.SparkContextFactory.createSparkContext(SparkContextFactory.java:152)
at org.broadinstitute.hellbender.engine.spark.SparkContextFactory.getSparkContext(SparkContextFactory.java:84)
at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:36)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:102)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:155)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:174)
at org.broadinstitute.hellbender.Main.instanceMain(Main.java:69)
at org.broadinstitute.hellbender.Main.main(Main.java:84)
16/09/16 14:14:39 INFO DiskBlockManager: Shutdown hook called
16/09/16 14:14:39 INFO ShutdownHookManager: Shutdown hook called
16/09/16 14:14:39 INFO ShutdownHookManager: Deleting directory /tmp/cgaadm/spark-921e83a7-f0a2-4d14-b52e-1a8a9733e16c
16/09/16 14:14:39 INFO ShutdownHookManager: Deleting directory /tmp/cgaadm/spark-921e83a7-f0a2-4d14-b52e-1a8a9733e16c/userFiles-a080c53e-7413-48ed-a3dc-1d4afcf7875b

This task ran ok in this An_REBC_dedicated workspace at the beginning of August and it's not clear why it should fail now.

Could this be an environment problem on the nodes the task is running on?

Thanks

Chip

P.S. the top lines of the error log:

Picked up JAVA_TOOL_OPTIONS: -Xmx1g -DR_HOME=/broad/software/free/Linux/redhat_6_x86_64/pkgs/r_2.10.1
[September 16, 2016 2:14:31 PM EDT] org.broadinstitute.hellbender.tools.genome.SparkGenomeReadCounts --keepXYMT false --binsize 3000 --outputFile THCA-TCGA-DJ-A2Q8-Tumor-SM-2BWKC.pcov --reference /seq/references/Homo_sapiens_assembly19/v1/Homo_sapiens_assembly19.fasta --input /seq/picard_aggregation/G32528/TCGA-DJ-A2Q8-01A-11D-A18F-08/v5/TCGA-DJ-A2Q8-01A-11D-A18F-08.bam --sparkMaster local[1] --readValidationStringency SILENT --interval_set_rule UNION --interval_padding 0 --bamPartitionSize 0 --disableSequenceDictionaryValidation false --shardedOutput false --numReducers 0 --help false --version false --verbosity INFO --QUIET false
[September 16, 2016 2:14:31 PM EDT] Executing as cgaadm@rebc-c001.broadinstitute.org on Linux 2.6.32-642.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14; Version: Version:version-unknown-SNAPSHOT
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/09/16 14:14:32 INFO SparkContext: Running Spark version 1.6.1
16/09/16 14:14:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/09/16 14:14:33 INFO SecurityManager: Changing view acls to: cgaadm
16/09/16 14:14:33 INFO SecurityManager: Changing modify acls to: cgaadm
16/09/16 14:14:33 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(cgaadm); users with modify permissions: Set(cgaadm)
16/09/16 14:14:36 WARN ThreadLocalRandom: Failed to generate a seed from SecureRandom within 3 seconds. Not enough entrophy?
16/09/16 14:14:36 INFO Utils: Successfully started service 'sparkDriver' on port 38238.
16/09/16 14:14:37 INFO Slf4jLogger: Slf4jLogger started
16/09/16 14:14:37 INFO Remoting: Starting remoting
16/09/16 14:14:37 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@10.200.103.92:46814]
16/09/16 14:14:37 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 46814.
16/09/16 14:14:37 INFO SparkEnv: Registering MapOutputTracker
16/09/16 14:14:37 INFO SparkEnv: Registering BlockManagerMaster
16/09/16 14:14:37 INFO DiskBlockManager: Created local directory at /tmp/cgaadm/blockmgr-7b3e574a-a01d-46ac-bed4-3b85e5458a0a
16/09/16 14:14:37 INFO MemoryStore: MemoryStore started with capacity 3.8 GB
16/09/16 14:14:38 INFO SparkEnv: Registering OutputCommitCoordinator

↧

What input files can I annotate with Oncotator?

May 16, 2014, 3:59 pm

≫ Next: window size in haplotypecaller's output bam

≪ Previous: GATK 4 CNV Proportional Coverage for WGS : Firehose task "ERROR SparkUI: Failed to bind SparkUI"

Input formats supported by Oncotator

VCF -- As seen in the version 4.1 http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41
MAFLITE -- maflite, which is a generic tab separated values file. The following columns must be present (though there is an aliasing mechanism in Oncotator that will automatically recognize some obvious synonyms):

chr -- contig name

start -- start position. For inserts, this is the base preceding the insert. For deletions, this is the first base that is removed.

end -- end position. For inserts, this is the base immediately after the insertion. In other words, this is start + 1.

ref_allele -- the reference allele. For insertions, this should be "-"

alt_allele -- the alternate allele. For deletions, this should be "-"

All other columns in the maflite input will be treated as annotations. Column order does not matter.

For TCGA MAF input, use MAFLITE as the input type.

TCGA MAF files created by Oncotator

You can use an annotated TCGA MAF file generated with Oncotator as an input to Oncotator.

However, there is a caveat: if the input file is a MAF generated by Oncotator 0.5.x.x or earlier, the columns may get reordered. Additionally, if the input was generated by any earlier version of Oncotator, there is a possibility that columns may change whether marked as internal (i.e. the "i_" prepend may be added or removed).

There are several reasons why you may want to do this. The most illustrative example is when you have generated a very large annotated MAF file and a new datasource is added. Rather than rerun Oncotator and re-generate the large annotated MAF file, you can use the large MAF file as input to a run of Oncotator configured with only the new datasource.

The input format should be MAFLITE and the output format should be TCGAMAF.

If you wish to reannotate an input TCGA MAF, use -i TCGAMAF. This will overwrite old values with new ones. Oncotator 1.8.x.x and above required.

When annotating an input TCGA MAF, if you see a DuplicateAnnotationException...

As of Oncotator 1.8.x.x, you can directly reannotate a TCGA MAF using -i TCGAMAF. This is preferable to the instructions below.

This happens when the input file and a datasource are trying to write different values for the same annotation.

You can use an annotated TCGA MAF file generated with Oncotator as an input to Oncotator, but you will need to preserve the following columns:

Chromosome, Start_position, End_position, ref_allele, alt_allele, Tumor_Sample_Barcode, Matched_Norm_Sample_Barcode, Tumor_Sample_UUID, Matched_Norm_Sample_UUID

The following cut command will extract those columns into a new MAFLITE file:

cut -f 5,6,7,11,13,16,17,33,34 my_maf_file.maf.annotated

Additionally, if you are running on the Broad cluster, you will want to add the following option to your oncotator call:

--default_config=/xchip/cga/reference/annotation/db/tcgaMAFManualOverrides2.4.config

↧

window size in haplotypecaller's output bam

September 19, 2016, 7:43 am

≫ Next: Empty output file and providing malformed VCF file error when using GATK ContEst

≪ Previous: What input files can I annotate with Oncotator?

Hi,

I am using HaplotypeCaller to re-align reads according to a set of pre-existing somatic variant calls. So far, this seems to work well; however, when I review the output bam in IGV, I notice that HC only provides 20nt of context upstream and downstream of each variant site. I am wondering how to boost this to a larger window size, say 100, so I can see more context. I saw a couple of other threads discussing a "-L" argument, but I could not quite figure it out. I have pasted my current command below.

Thanks in advance,
Mike

java -Xmx18000M -jar /opt/GenomeAnalysisTK_3.5-0-g36282e4.jar
--analysis_type HaplotypeCaller
--out hc_out.vcf
--bamout hc_out.bam
--bamWriterType ALL_POSSIBLE_HAPLOTYPES
--standard_min_confidence_threshold_for_emitting 20
--standard_min_confidence_threshold_for_calling 20
--reference_sequence HG19.gencode.fasta
--input_file hc_in.bam
--dontUseSoftClippedBases
--intervals hc_in.vcf
--interval_padding 500
--genotyping_mode GENOTYPE_GIVEN_ALLELES
--gatk_key gatk.key
--forceActive
--disableOptimizations
--dbsnp germline.vcf
--alleles hc_in.vcf

↧

Empty output file and providing malformed VCF file error when using GATK ContEst

June 28, 2016, 10:48 pm

≫ Next: How should I pre-process data from multiplexed sequencing and multi-library designs?

≪ Previous: window size in haplotypecaller's output bam

Hi all,
Recently I have used ContEst for estimating cross-sample contamination.Firstly I downloaded all example data from CGA website http://www.broadinstitute.org/cancer/cga/contest_download and used contest-1.0.24530-bin for test and it worked great!
java -jar contest-1.0.24530-bin/ContEst.jar
-I ContEst_example_data/chr20_sites.bam
-R human_g1k_v37.fasta
-B:pop,vcf hg19_population_stratified_af_hapmap_3.3.vcf.gz
-T Contamination
-B:genotypes,vcf ContEst_example_data/hg00142.vcf
-BTI genotypes
-o contamination_results_chr20_1.txt

However,when I delivered the example data to GATK 3.6 or 3.5,it failed with the following error:
../jdk1.8.0_91/bin/java -jar ../GenomeAnalysisTK-3.6.jar
-T ContEst
-R human_g1k_v37.fasta
-I ContEst_example_data/chr20_sites.bam
--genotypes ContEst_example_data/hg00142.vcf
--popfile ../hg19_population_stratified_af_hapmap_3.3.vcf.gz
-isr INTERSECTION
-o contamination_results_chr20_2.txt

ERROR MESSAGE: The provided VCF file is malformed at approximately line number 4: The VCF specification does not allow for whitespace in the INFO field. Offending field value was "AC=1239;AF=0.44377;ALL={G=0.55627,T=0.44373};AN=2792;ASW={G=0.50575, T=0.49425};CEU={G=0.69091, T=0.30909};CHB={G=0.57721, T=0.42279};CHD={G=0.66514, T=0.33486};CHS={G=0.00000, T=0.00000};CLM={G=0.00000, T=0.00000};FIN={G=0.00000, T=0.00000};GBR={G=0.00000, T=0.00000};GIH={G=0.61386, T=0.38614};IBS={G=0.00000, T=0.00000};JPT={G=0.57080, T=0.42920};LWK={G=0.45413, T=0.54587};MKK={G=0.47826, T=0.52174};MXL={G=0.53488, T=0.46512};PUR={G=0.00000, T=0.00000};TSI={G=0.63725, T=0.36275};YRI={G=0.45320, T=0.54680};set=Intersection GT",forinput source: /pub6/Temp/liaojianlong/contamination_test1/../hg19_population_stratified_af_hapmap_3.3.vcf.gz

According to the error message,I eliminated the whitespace in the INFO field using R and tested but got error again:

ERROR MESSAGE: The provided VCF file is malformed at approximately line number 4: The VCF specification does not allow for whitespace in the INFO field. Offending field value was "AC=1239;AF=0.44377;ALL={G=0.55627,T=0.44373};AN=2792;ASW={G=0.50575,T=0.49425};CEU={G=0.69091,T=0.30909};CHB={G=0.57721,T=0.42279};CHD={G=0.66514,T=0.33486};CHS={G=0.00000,T=0.00000};CLM={G=0.00000,T=0.00000};FIN={G=0.00000,T=0.00000};GBR={G=0.00000,T=0.00000};GIH={G=0.61386,T=0.38614};IBS={G=0.00000,T=0.00000};JPT={G=0.57080,T=0.42920};LWK={G=0.45413,T=0.54587};MKK={G=0.47826,T=0.52174};MXL={G=0.53488,T=0.46512};PUR={G=0.00000,T=0.00", for input source: /pub6/Temp/liaojianlong/contamination_test1/../population_files/hg19_population_stratified_af_hapmap_3.3.vcf.gz

On the other hand,I tested GATK ContEst in another mode but got empty file in addition to header.
../jdk1.8.0_91/bin/java -jar
../GenomeAnalysisTK-3.6.jar
-T ContEst
-R ../reference_genome/hg19_complete.fasta
-I:eval G01H_chr22.recal.bam
-I:genotype G01N_chr22.recal.bam
--popfile hg19_population_stratified_af_hapmap_3.3.vcf.gz
-isr INTERSECTION
-o contamination_output.txt
The process worked successfully with following information:

And the contamination_output.txt was empty:

Thank you very much for any recommendation!

↧

How should I pre-process data from multiplexed sequencing and multi-library designs?

August 2, 2013, 1:23 pm

≫ Next: Genotyping VCF with HC

≪ Previous: Empty output file and providing malformed VCF file error when using GATK ContEst

Our Best Practices pre-processing documentation assumes a simple experimental design in which you have one set of input sequence files (forward/reverse or interleaved FASTQ, or unmapped uBAM) per sample, and you run each step of the pre-processing workflow separately for each sample, resulting in one BAM file per sample at the end of this phase.

However, if you are generating multiple libraries for each sample, and/or multiplexing samples within and/or across sequencing lanes, the data must be de-multiplexed before pre-processing, typically resulting in multiple sets of FASTQ files per sample all of which should have distinct read group IDs (RGID).

At that point there are several different valid strategies for implementing the pre-processing workflow. Here at the Broad Institute, we run the initial steps of the pre-processing workflow (mapping, sorting and marking duplicates) separately on each individual read group. Then we merge the data to produce a single BAM file for each sample (aggregation); this is done by re-running Mark Duplicates, this time on all read group BAM files for a sample at the same time. Then we run Indel Realignment and Base Recalibration on the aggregated per-sample BAM files. See the worked-out example below and this presentation for more details.

Note that there are many possible ways to achieve a similar result; here we present the way we think gives the best combination of efficiency and quality. This assumes that you are dealing with one or more samples, and each of them was sequenced on one or more lanes.

Example

Let's say we have this example data (assuming interleaved FASTQs containing both forward and reverse reads) for two sample libraries, sampleA and sampleB, which were each sequenced on two lanes, lane1 and lane2:

sampleA_lane1.fq
sampleA_lane2.fq
sampleB_lane1.fq
sampleB_lane2.fq

These will each be identified as separate read groups A1, A2, B1 and B2. If we had multiple libraries per sample, we would further distinguish them (eg sampleA_lib1_lane1.fq leading to read group A11, sampleA_lib2_lane1.fq leading to read group A21 and so on).

1. Run initial steps per-readgroup once

Assuming that you received one FASTQ file per sample library, per lane of sequence data (which amounts to a read group), run each file through mapping and sorting. During the mapping step you assign read group information, which will be very important in the next steps so be sure to do it correctly. See the read groups dictionary entry for guidance.

The example data becomes:

sampleA_rgA1.bam
sampleA_rgA2.bam
sampleB_rgB1.bam
sampleB_rgB2.bam

At this point we mark duplicates in each read group BAM file (dedup), which allows us to estimate the complexity of the corresponding library of origin as a quality control step. This step is optional.

The example data becomes:

sampleA_rgA1.dedup.bam
sampleA_rgA2.dedup.bam
sampleB_rgB1.dedup.bam
sampleB_rgB2.dedup.bam

Technically this first run of marking duplicates is not necessary because we will run it again per-sample, and that per-sample marking would be enough to achieve the desired result. To reiterate, we only do this round of marking duplicates for QC purposes.

2. Merge read groups and mark duplicates per sample (aggregation + dedup)

Once you have pre-processed each read group individually, you merge read groups belonging to the same sample into a single BAM file. You can do this as a standalone step, bur for the sake of efficiency we combine this with the per-readgroup duplicate marking step (it's simply a matter of passing the multiple inputs to MarkDuplicates in a single command).

The example data becomes:

sampleA.merged.dedup.bam
sampleB.merged.dedup.bam

To be clear, this is the round of marking duplicates that matters. It eliminates PCR duplicates (arising from library preparation) across all lanes in addition to optical duplicates (which are by definition only per-lane).

3. Remaining per-sample pre-processing

Then you run indel realignment (optional) and base recalibration (BQSR).

The example data becomes:

sample1.merged.dedup.(realn).recal.bam
sample2.merged.dedup.(realn).recal.bam

Realigning around indels per-sample leads to consistent alignments across all lanes within a sample. This step is only necessary if you will be using a locus-based variant caller like MuTect 1 or UnifiedGenotyper (for legacy reasons). If you will be using HaplotypeCaller or MuTect2, you do not need to perform indel realignment.

Base recalibration will be applied per-read group if you assigned appropriate read group information in your data. BaseRecalibrator distinguishes read groups by RGID, or RGPU if it is available (PU takes precedence over ID). This will identify separate read groups (distinguishing both lanes and libraries) as such even if they are in the same BAM file, and it will always process them separately -- as long as the read groups are identified correctly of course. There would be no sense in trying to recalibrate across lanes, since the purpose of this processing step is to compensate for the errors made by the machine during sequencing, and the lane is the base unit of the sequencing machine (assuming the equipment is Illumina HiSeq or similar technology).

People often ask also if it's worth the trouble to try realigning across all samples in a cohort. The answer is almost always no, unless you have very shallow coverage. The problem is that while it would be lovely to ensure consistent alignments around indels across all samples, the computational cost gets too ridiculous too fast. That being said, for contrastive calling projects -- such as cancer tumor/normals -- we do recommend realigning both the tumor and the normal together in general to avoid slight alignment differences between the two tissue types.

↧

Genotyping VCF with HC

September 19, 2016, 9:13 am

≫ Next: Filtering of heterozygotes only

≪ Previous: How should I pre-process data from multiplexed sequencing and multi-library designs?

Hi,

I am using Haplotype Caller to genotype a VCF file. I am using this exact command line:

java -Xmx20g -jar GATK/3.6-0-g89b7209/GenomeAnalysisTK.jar \
-T HaplotypeCaller -R $indexFasta -I $Bam -o output.snps.indels.g.vcf \
-dt NONE --genotyping_mode GENOTYPE_GIVEN_ALLELES \
--alleles $vcffile

For most of the cases, it genotypes everything fine but for one sample, it doesn't genotype 2 indels calls that are present in the "--alleles" VCF file.

These are the 2 variants it is missing:

CHROM POS ID REF ALT QUAL FILTER INFO

13 28608262 . T TTCATATTCTCTGAAATCCTGA 343 PASS DP=2682;AF=0.006711;SB=0;DP4=1317,1363,9,9;INDEL;HRUN=2
13 28608265 . A ATATTCTCTGACTTCG 6470 PASS DP=2712;AF=0.082227;SB=4;DP4=1220,1288,116,107;INDEL;HRUN=1

They both are close to each other so not sure if it cannot genotype it because these 2 indels are in such close proximity and overlapping each other. Do you have any suggestions on how this can be fixed.

Thank you.
abolia

↧

Filtering of heterozygotes only

September 19, 2016, 12:11 am

≫ Next: Why do MQRankSum and ReadPosRankSum not appear in some vcf file entries?

≪ Previous: Genotyping VCF with HC

Hi!

I need help, please!
I'm working on lovebirds and trying to identify SNPs that can be included in a parentage verification panel. The reference genome is the offspring and then I have mapped its parents' reads to the reference to identify SNPs. I want to identify only those SNPs where the mother and father are both heterozygotes, which will imply that all four the grandparents also had a polymorphism at that site.

I did hard filtering using the following parameters:
Firstly as the best practises guidelines suggests:
QD<2 || FS>60 || MQ<40 || MQRankSum<-12.5 || ReadPosRandSum < -8.0 And then to filter in the heterozygotes: QD>2 || FS <10 || MQ >50 || MQRankSum >-5.1 || ReadPosRandSum <-8.0

The mother is more heterozygous than the father and I get around (raw) 1.9mil SNPs for her vs 1.2mil for the father. After filtering, there is of course much less.

I then combined the genotypes of the two parents and repeated the process.

The results I get for both the filtering parameters and the combined and separate genotypes are not bad, but I wish to only have those SNPs where both the mother and father are heterozygous for the SNP. I've checked the results on igv and it seems that about 1 in every 10-20 SNPs that was filtered in complying to this. However, I cannot see any difference in parameters or quality or anything to filter these further. I went through them manually and selected those I wanted, but there were no significant similarities in this subset to be able to filter them from the rest.

So my questions are:
1. Is there any way to filter out only those SNPs that are heterozygous for all individuals, other than going through them manually?
2. Some of the SNPs with the highest quality are heterozygous but less than 20% of the reads have the alternative allele. Can I select these or should I go for lower quality but a higher % of alternative allele (e.g. 50%).

Thanks a lot!
Henriette

↧

Why do MQRankSum and ReadPosRankSum not appear in some vcf file entries?

September 19, 2016, 10:21 am

≫ Next: GATK 3.6 GenotypeGVCFs Failure: java.util.zip.DataFormatException: invalid distances set

≪ Previous: Filtering of heterozygotes only

Hi all,

I've generated a vcf file of SNPs having used HaplotypeCaller, CombineGVCFs, GenotypeGVCFs and SelectVariants. However, when I try to extract annotations from each line in the vcf, I find I am losing roughly 40% of my variants because they do not have MQRankSum or ReadPosRankSum annotations. I've read that these annotations are not generated for homozygous sites, but - as I have restricted my output so far to variants only - this cannot be the problem. I've also checked and it's not to do with sites being multi-allelic as the following line shows (please note, it's been reduced to cut down the number of 0/0 samples included but every genotype listed is from this entry):

 NC_027944.1     74      .       A       G       25631.51        .       AC=2;AF=3.584e-03;AN=558;DP=51760;ExcessHet=0.0039;FS=0.000;InbreedingCoeff=1.0000;MLEAC=2;MLEAF=3.584e-03;MQ=60.00;QD=30.63;SOR=0.918  GT:AD:DP:GQ:PL  0/0:118,0:118:99:0,120,1800     0/0:141,0:141:99:0,120,1800     0/0:171,0:171:99:0,120,1800     0/0:116,0:116:99:0,120,1800     0/0:106,0:106:99:0,120,1800     0/0:145,0:145:99:0,120,1800     0/0:89,0:89:99:0,120,1800       0/0:101,0:101:99:0,120,1800     0/0:114,0:114:99:0,120,1800     0/0:120,0:120:99:0,120,1800     0/0:175,0:175:99:0,120,1800      1/1:0,724:724:99:25705,2168,0     0/0:243,0:243:99:0,120,1800     0/0:110,0:110:99:0,120,1800     0/0:149,0:149:99:0,120,1800     0/0:120,0:120:99:0,120,1800     0/0:167,0:167:99:0,120,1800

I was hoping someone might be able to tell me what I'm missing here, because I cannot work it out from the available information pages.

Also, how much of a dealbreaker is not having these values? I was originally omitting any site that was missing even one of the annotations I was filtering on but I wonder if that's too overzealous in the case of the Rank Sum test annotations?

Any and all advice would be greatly received,

Many thanks,

Ian

↧

GATK 3.6 GenotypeGVCFs Failure: java.util.zip.DataFormatException: invalid distances set

September 19, 2016, 11:01 am

≫ Next: ContESt for WGS data

≪ Previous: Why do MQRankSum and ReadPosRankSum not appear in some vcf file entries?

I am attempting to genotype a set of gVCFs, which I have bgzipped and indexed with tabix. This worked up to the point of receiving the seemingly odd error message:

ERROR ------------------------------------------------------------------------------------------

ERROR A BAM/CRAM ERROR has occurred (version 3.6-0-g89b7209):

ERROR

ERROR This means that there is something wrong with the BAM/CRAM file(s) you provided.

ERROR The error message below tells you what is the problem.

ERROR

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions https://www.broadinstitute.org/gatk

ERROR

ERROR Please do NOT post this error to the GATK forum until you have followed these instructions:

ERROR - Make sure that your BAM file is well-formed by running Picard's validator on it

ERROR (see http://picard.sourceforge.net/command-line-overview.shtml#ValidateSamFile for details)

ERROR - Ensure that your BAM index is not corrupted: delete the current one and regenerate it with 'samtools index'

ERROR - Ensure that your CRAM index is not corrupted: delete the current one and regenerate it with

ERROR 'java -jar cramtools-3.0.jar index --bam-style-index --input-file --reference-fasta-file '

ERROR (see https://github.com/enasequence/cramtools/tree/v3.0 for details)

ERROR

ERROR MESSAGE: java.util.zip.DataFormatException: invalid distances set

ERROR ------------------------------------------------------------------------------------------

As I said, this is when dealing with gVCFs, not BAMs, nor CRAMs. I did delete all my indexes on the gVCFs, reindexed, and got the same error message. Any suggestions/ideas as to what is occurring here?

↧

ContESt for WGS data

July 24, 2014, 12:13 pm

≫ Next: about HaplotypeCaller

≪ Previous: GATK 3.6 GenotypeGVCFs Failure: java.util.zip.DataFormatException: invalid distances set

I running contEst with silico diluted whole genome sequenced data to detect the contamination levels.

The numerical values of predicted contamination is corrected, but the values are constantly off by 10%. Any reason why this happens?

I am not even sure as to where to look for solving this error.

↧

about HaplotypeCaller

September 18, 2016, 10:39 pm

≫ Next: Obtaining phased haplotype info for individuals from the 1000 genome project

≪ Previous: ContESt for WGS data

Dear all,

would appreciate an advice please on a simple question : 've been running using HaplotypeCaller -stand_call_conf = 30 (default), while some of our collaborators used -stand_call_conf=20. In order to have -stand_call_conf=20, would I need to re-run the HaplotypeCaller, or just filter the vcf files ? thank you,

-- bogdan

↧

Obtaining phased haplotype info for individuals from the 1000 genome project

September 19, 2016, 12:46 pm

≫ Next: Further information about MuTect2 filters (clustered_events, homologous_mapping_event etc.)

≪ Previous: about HaplotypeCaller

Hi,

I am trying to obtain the phased haplotype of a very specific region of a human gene for 3 individuals that participated in the 1000 genomes project. I used the Data slicer tool to down load the VCF file for just the CDS of the gene. However, the file was missing info for all of the 7 important SNPs in the CDS that contribute to the phenotype. These are SNPs that are found in dbSNP and SNPedia. I would greatly appreciate any help.

Thanks!
Sunita

↧

Further information about MuTect2 filters (clustered_events, homologous_mapping_event etc.)

September 19, 2016, 1:35 pm

≫ Next: Reversions: an Algorithm issue

≪ Previous: Obtaining phased haplotype info for individuals from the 1000 genome project

Dear GATK developers,

We are using MuTect2 for variant calling and we have noticed that several variants fail to pass filters such as clustered_events, homologous_mapping_event, str_contraction and t_lod_fstar. We have reasons to believe that at least several of these variants could be incorrectly filtered out by MuTect2. For example, several variants that are positioned very closely to each other have been filtered out as clustered_events, although they have been previously verified and confirmed using Sanger sequencing.

I am aware that these filters are briefly described in the VCF file header (e.g. ##FILTER=<ID=clustered_events,Description="Clustered events observed in the tumor">). However, I would like to ask whether there are any better definitions and further information about these filters and whether there are ways of manipulating or disabling them.

Thank you for your help.

↧

Reversions: an Algorithm issue

September 19, 2016, 10:50 am

≫ Next: how to build/get populationAlleleFrequencies.vcf for ContEst

≪ Previous: Further information about MuTect2 filters (clustered_events, homologous_mapping_event etc.)

Since this is an algorithm question that covers both types of MuTect, I'd rather raise it here.

I have noticed the variant calling model in MuTect seems to require AF[TUMOR] > AF[NORMAL] to be called. This implies that a reversion/ back mutation, i.e. AF[TUMOR] < AF[NORMAL], will not be called. Is there any rationale to this?

↧

how to build/get populationAlleleFrequencies.vcf for ContEst

January 4, 2016, 10:05 pm

≫ Next: Somatic mutations based on sliced bam

≪ Previous: Reversions: an Algorithm issue

How to build/get populationAlleleFrequencies.vcf for ContEst ?

and what is the necessary field of VCF file? AF field? MAF field?

I found no guide for this. and I tried 1000G vcf file, but it failed with the following error.

java -jar /usr/hpc-bio/gatk/GATK.jar -T ContEst -l WARN -R /usr/bio-ref/GRCh38.83/GRCh38.dna.fa -pf /usr/bio-ref/GRCh38.83/1000G.vcf -isr INTERSECTION -I:eval /biowrk/bam.bqsr.pair/Project_14686/Sample_100T/bqsr.tumor.bam -I:genotype /biowrk/bam.bqsr.pair/Project_14686/Sample_100T/bqsr.normal.bam -L /usr/bio-ref/GRCh38.83/S04380110_Covered.intervals -o output.txt

ERROR ------------------------------------------------------------------------------------------

ERROR stack trace

java.lang.NullPointerException
at org.broadinstitute.gatk.tools.walkers.cancer.contamination.ContEst.calcStats(ContEst.java:625)
at org.broadinstitute.gatk.tools.walkers.cancer.contamination.ContEst.map(ContEst.java:400)
at org.broadinstitute.gatk.tools.walkers.cancer.contamination.ContEst.map(ContEst.java:127)

↧

Somatic mutations based on sliced bam

September 20, 2016, 12:20 am

≫ Next: StrandBiasbySample, FisherStrand Annotation

≪ Previous: how to build/get populationAlleleFrequencies.vcf for ContEst

Hi,

I would like to retrieve simple somatic mutations (using mutect2) from a bam file. However, I am interested in mutations in specific areas in the DNA only (for example, specific chromosoms). Can I calculate the mutations based on a sliced bam file? or should I first run the tool to generate the somatic mutations file, and then slice it?

Thanks, Michal.

↧

StrandBiasbySample, FisherStrand Annotation

July 30, 2014, 2:22 am

≫ Next: about ContEST

≪ Previous: Somatic mutations based on sliced bam

Hi, I am using GATK version 3.2-2 to analyze miseq data from a human snp panel, aligned to it's "own" reference. I use unifiedgenotyper to call all desired SNPs (ref or non-ref variants) from the panel and it works very well. I would like to know reverse and forward reads for each allele. I have used FisherStrand values, but they are all 0.00, meaning there is no strand bias ? I assume strandbias SB or StrandBiasBySample are not used anymore ? Is there any other way I can get forward and reverse reads without having to walk through the bam file ?

↧

about ContEST

September 19, 2016, 4:56 pm

≫ Next: Can I persuade GATK to resolve 2 indels as 3 SNPs?

≪ Previous: StrandBiasbySample, FisherStrand Annotation

Dear Sheila,

would appreciate a suggestion on using ContEST: how shall I modify the command line in order to change the "type of population" (from CEU to any other population, or better, to include ALL the populations from hapmap file).

the current command line is :

$GATK \
-T ContEst \
-R $REFERENCE \
-I:eval $TUMOR_MD \
-I:genotype $NORMAL_MD \
-L $CHR \
--popfile $POPFILE \
-isr INTERSECTION \
-o "vcf.check-CONTAMINATION.${TUMOR_MD%.bam}vs${NORMAL_MD%.bam}on${CHR}.analysis-ContEST.txt" \
--disable_auto_index_creation_and_locking_when_reading_rods

where the POPFILE is "hg19_population_stratified_af_hapmap_3.3.vcf.with-chr.converted-to-hg38".

thank you !

↧