Upcoming workshops: June-July-September 2018

June 13, 2018, 9:18 am

≫ Next: Using NIO with GATK4 HaplotypeCaller

≪ Previous: Allele Depth (AD) / Allele Balance (AB) Filtering in GATK 4

We've had a very active workshop season so far, and just because it's almost summer doesn't mean we're slowing down. Later this month we'll be at the GCC/BOSC conference in Portland, OR, teaching a 2.5 hr GATK4 workshop, as well as assisting colleagues who are teaching a WDL pipelining workshop. There's still some space open so register now if you'd like to join us!

In July we'll be in Cambridge, UK to teach our now-classic 4-day workshop; it's fully booked at this point but there's a waitlist you can add yourself to here. Even if you don't get in, it tells us how many people would have liked to attend but couldn't, and that helps us determine how many more workshops we need to organize and where.

In September we'll be teaching the same 4-day workshop formula in Seville, Spain, augmented with a 5th day on variant interpretation taught by the host institution. Registration for this workshop just opened here.

As always, there will be more -- and if you're interested in hosting us at your institution, just let me know in the comments or over private message.

↧

Using NIO with GATK4 HaplotypeCaller

June 13, 2018, 11:24 am

≫ Next: Infinity RGQ and no ReadPosRankSum

≪ Previous: Upcoming workshops: June-July-September 2018

Is GATK4 HaplotypeCaller NIO compatible? If not, is there another version that is?

Thanks!

↧

Infinity RGQ and no ReadPosRankSum

June 5, 2018, 2:38 am

≫ Next: Picard CollectSequencingArtifactMetrics and CollectOxoGMetrics return all zero stats for iontorrent

≪ Previous: Using NIO with GATK4 HaplotypeCaller

Hi,

I have the following two questions, it would be great if you could help me with:

What are non-variant sites for which the genotype quality is "Infinity"? What does that mean?
NW_009243200.1 47644 . G . Infinity . AN=2;ClippingRankSum=0.00;DP=57;ExcessHet=3.01 GT:DP:RGQ 0/0:57:99
NW_009243203.1 24000 . G . Infinity . AN=2;DP=46;ExcessHet=3.01 GT:DP:RGQ 0/0:42:99
Why some sites do not have the field "ReadPosRankSum"?
NW_009243187.1 1403 . A T 719.03 . AC=2;AF=1.00;AN=2;DP=32;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=44.82;QD=29.00;SOR=1.022 GT:AD:DP:GQ:PL 1/1:0,19:19:56:733,56,0
NW_009243191.1 855 . T C 241.60 . AC=1;AF=0.500;AN=2;DP=6;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;QD=27.64;SOR=1.329 GT:AD:DP:GQ:PGT:PID:PL 0/1:0,6:6:59:0|1:843_C_A:249,0,59

Thanks,
Homa

↧

Picard CollectSequencingArtifactMetrics and CollectOxoGMetrics return all zero stats for iontorrent

June 5, 2018, 7:06 am

≫ Next: Appropriate WGS Mutect normal_panel

≪ Previous: Infinity RGQ and no ReadPosRankSum

Hi,
I'm trying to do some picard QC on iontorrent bam files. The results table turned out to be all "0" or "?" without any error in the log files.
Below is an example of syntax and log files.

java -Xmx16g -jar /DCEG/Resources/Tools/Picard/Picard-2.10.10/picard.jar CollectOxoGMetrics I=/DCEG/Projects/Exome/Followup/NP0318-AP1_Mirabello_TP53_MSKCCosteosarcoma/build1/BAM/OS_SC-1_A_rawlib.bam O=oxoG_metrics.txt R=/CGF/Sequencing/IonTorrent/PGM_Primary_Data/referenceLibrary/tmap-f3/hg19/hg19.fasta VALIDATION_STRINGENCY=LENIENT

09:37:29.061 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/mnt/nfs/gigantor/ifs/DCEG/Resources/Tools/Picard/Picard-2.10.10/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Tue Jun 05 09:37:29 EDT 2018] CollectOxoGMetrics INPUT=/DCEG/Projects/Exome/Followup/NP0318-AP1_Mirabello_TP53_MSKCCosteosarcoma/build1/BAM/OS_SC-1_A_rawlib.bam OUTPUT=oxoG_metrics.txt VALIDATION_STRINGENCY=LENIENT REFERENCE_SEQUENCE=/CGF/Sequencing/IonTorrent/PGM_Primary_Data/referenceLibrary/tmap-f3/hg19/hg19.fasta MINIMUM_QUALITY_SCORE=20 MINIMUM_MAPPING_QUALITY=30 MINIMUM_INSERT_SIZE=60 MAXIMUM_INSERT_SIZE=600 INCLUDE_NON_PF_READS=true USE_OQ=true CONTEXT_SIZE=1 STOP_AFTER=2147483647 VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Tue Jun 05 09:37:29 EDT 2018] Executing as luow2@cgemsIII on Linux 2.6.32-696.6.3.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_141-b16; Deflater: Intel; Inflater: Intel; Picard version: 2.10.10-SNAPSHOT
INFO 2018-06-05 09:37:29 CollectOxoGMetrics Generated 16 context strings.
INFO 2018-06-05 09:37:29 CollectOxoGMetrics Loading dbSNP File: null
INFO 2018-06-05 09:37:29 CollectOxoGMetrics Starting iteration.
[Tue Jun 05 09:37:31 EDT 2018] picard.analysis.CollectOxoGMetrics done. Elapsed time: 0.04 minutes.
Runtime.totalMemory()=2058354688

Does this mean something wrong with my bam format or bam files generated from tmap is not applicable to these metrics?

Thanks,
Wen

↧

Appropriate WGS Mutect normal_panel

June 13, 2018, 1:37 pm

≫ Next: How to keep unique sample ID when combining gvcf files?

≪ Previous: Picard CollectSequencingArtifactMetrics and CollectOxoGMetrics return all zero stats for iontorrent

Hi,

I'm running mutect1 with WGS and am trying to figure out what the appropriate attribute to use for the normal_panel is. I'm guessing it might be this 1000 genomes file ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/integrated_call_sets/, but am not sure. Any advice would be appreciated! Most of the information out there seems geared towards exomes.

Thanks,

Eric

↧

How to keep unique sample ID when combining gvcf files?

June 5, 2018, 7:35 am

≫ Next: Best practices for ContEst for WGS

≪ Previous: Appropriate WGS Mutect normal_panel

Hello,
I am working with RNA-seq data, and I need to get SNP calls for multiple samples (12). I first tried following the best practices method with the haplotypecaller, and later merging my VCF files. However, I realized that when I do this, any site that is not a variant in all of my samples will be marked as missing data for the non-variant samples. This is a problem because I need to know which of these samples are actually missing and which of these samples match the reference. I don't think the gVCF mode of haplotypecaller is completely supported for RNA-seq yet, but a paper that is doing similar work to mine has used it and it seemed to work well for them. Because of this, I gave it a try, but I keep coming to the same problem. When I combine my .g.vcf files, all of my samples merge. I need to make a combined vcf file with all of my sample id's remaining unique. Is there a way to do this? Thank you very much for your help and I'm sorry if this has been asked before, I have done a lot of searching but can't seem to find this question.

↧

Best practices for ContEst for WGS

June 5, 2018, 12:40 pm

≫ Next: When should I use -L to pass in a list of intervals?

≪ Previous: How to keep unique sample ID when combining gvcf files?

Hi,

What are the best practices for ContEst as far as the population file and interval list parameters go? I would imagine that interval list should be irrelevant when looking across the whole genome, but this seems to be a required argument.

I'm currently using /xchip/cga/reference/hg19/hg19_population_stratified_af_hapmap_3.3.vcf‎ for the population file -- is that appropriate for WGS?

Thanks for the advice,

Eric

↧

When should I use -L to pass in a list of intervals?

May 6, 2014, 2:51 pm

≫ Next: VariantEval on non-vcf files

≪ Previous: Best practices for ContEst for WGS

The -L argument (short for --intervals) enables you to restrict your analysis to specific intervals instead of running over the whole genome. Using this argument can have important consequences for performance and/or results. Here, we present some guidelines for using it appropriately depending on your experimental design.

In a nutshell, if you’re doing:

- Whole genome analysis: intervals are not required but they can help speed up analysis
- Whole exome analysis: you must provide the list of capture targets (typically genes/exons)
- Small targeted experiment: you must provide the targeted interval(s)
- Troubleshooting: you can run on a specific interval to test parameters or create a data snippet

Important notes:

Whatever you end up using -L for, keep this in mind: for tools that output a bam or VCF file, the output file will only contain data from the intervals specified by the -L argument. To be clear, we do not recommend using -L with tools that output a bam file since doing so will omit some data from the output.

Example Use of -L:

-L 20 for chromosome 20 in b37/b39 build
-L chr20:1-100 for chromosome 20 positions 1-100 in hg18/hg19 build
-L intervals.list (or intervals.interval_list, or intervals.bed) where the value passed to the argument is a text file containing intervals
-L some_variant_calls.vcf where the value passed to the argument is a VCF file containing variant records; their genomic coordinates will be used as intervals.

Specifying contigs with colons in their names, as occurs for new contigs in GRCh38, requires special handling for GATK versions prior to v3.6. Please use the following workaround.

- For example, HLA-A*01:01:01:01 is a new contig in GRCh38. The colons are a new feature of contig naming for GRCh38 from prior assemblies. This has implications for using the -L option of GATK as the option also uses the colon as a delimiter to distinguish between contig and genomic coordinates.
- When defining coordinates of interest for a contig, e.g. positions 1-100 for chr1, we would use -L chr1:1-100. This also works for our HLA contig, e.g. -L HLA-A*01:01:01:01:1-100.
- However, when passing in an entire contig, for contigs with colons in the name, you must add :1+ to the end of the chromosome name as shown below. This ensures that portions of the contig name are appropriately identified as part of the contig name and not genomic coordinates.

-L HLA-A*01:01:01:01:1+

So here’s a little more detail for each experimental design type.

Whole genome analysis

It is not necessary to use an intervals list in whole genome analysis -- presumably you're interested in the whole genome!

However, from a technical perspective, you may want to mask out certain contigs (e.g. chrY or non-chromosome contigs) or regions (e.g. centromere) where you know the data is not reliable or is very messy, causing excessive slowdowns. You can do this by providing a list of "good" intervals with -L, or you could also provide a list of "bad" intervals with -XL, which does the exact opposite of -L: it excludes the provided intervals. We share the whole-genome interval lists (of good intervals) that we use in our production pipelines, in our resource bundle (see Download page).

Whole exome analysis

By definition, exome sequencing data doesn’t cover the entire genome, so many analyses can be restricted to just the capture targets (genes or exons) to save processing time. There are even some analyses which should be restricted to the capture targets because failing to do so can lead to suboptimal results.

Note that we recommend adding some “padding” to the intervals in order to include the flanking regions (typically ~100 bp). No need to modify your target list; you can have the GATK engine do it for you automatically using the interval padding argument. This is not required, but if you do use it, you should do it consistently at all steps where you use -L.

Below is a step-by-step breakdown of the Best Practices workflow, with a detailed explanation of why -L should or shouldn’t be used with each tool.

Tool	-L?	Why / why not
BaseRecalibrator	YES	This excludes off-target sequences and sequences that may be poorly mapped, which have a higher error rate. Including them could lead to a skewed model and bad recalibration.
PrintReads	NO	Output is a bam file; using -L would lead to lost data.
UnifiedGenotyper/Haplotype Caller	YES	We’re only interested in making calls in exome regions; the rest is a waste of time & includes lots of false positives.
Next steps	NO	No need since subsequent steps operate on the callset, which was restricted to the exome at the calling step.

Small targeted experiments

The same guidelines as for whole exome analysis apply except you do not run BQSR on small datasets.

Debugging / troubleshooting

You can use -L a lot while troubleshooting! For example, you can just provide an interval at the command line, and the output file will contain the data from that interval.This is really useful when you’re trying to figure out what’s going on in a specific interval (e.g. why HaplotypeCaller is not calling your favorite indel) or what would be the effect of changing a parameter (e.g. what happens to your indel call if you increase the value of -minPruning). This is also what you’d use to generate a file snippet to send us as part of a bug report (except that never happens because GATK has no bugs, ever).

↧

VariantEval on non-vcf files

June 13, 2018, 3:08 pm

≫ Next: VariantRecalibrator (VQSR) without any training sets? (Simulated data)

≪ Previous: When should I use -L to pass in a list of intervals?

Hello GATK team,

I am learning to use VariantEval, but I realize that my input files are not standard vcf files. I compared somatic variant calling results from two software, and want to evaluation the overlap. Simply put, my overlap only contains four columns: chr, pos, genotype of control sample, genotype of tumor sample. The genotypes are two letters -- I guess I can change that into only 1 letter to record the different allele, but I still do not have a full vcf table.

I could use the position information and extract lines from one of the software output. For example, one software is strelka, and its output looks like:

chr1 4159398 . C T . PASS NT=ref;QSS=21;QSS_NT=21;SGT=CC->CT;SOMATIC;TQSS=1;TQSS_NT=1 DP:FDP:SDP:SUBDP:AU:CU:GU:TU 30:0:0:0:0,0:30,31:0,0:0,0 24:0:0:0:0,0:20,20:0,0:4,4

Where somatic changes is recorded in bold. Do you think it is an appropriate input format?

Any advise is appreciated. Thank you.

Helen

↧

VariantRecalibrator (VQSR) without any training sets? (Simulated data)

June 14, 2018, 12:35 am

≫ Next: ApplyRecalibration error, during snps recalibration

≪ Previous: VariantEval on non-vcf files

How can I run VQSR in simulated variants data? In this dataset variants are randomly distributed along the genome so I can not use real truth vcf files. I have tried using the golden.vcf that the simulator produces but it says that it also needs annotation files, so what can I do?

Thank you

↧

ApplyRecalibration error, during snps recalibration

April 15, 2018, 7:04 am

≫ Next: combine GVCFs error

≪ Previous: VariantRecalibrator (VQSR) without any training sets? (Simulated data)

GATK 4.0.3.0

Hi,

I have a problem with ApplyRecalibration:

A USER ERROR has occurred: Encountered input variant which isn't found in the input recal file. Please make sure VariantRecalibrator and ApplyRecalibration were run on the same set of input variants. First seen at: [VC Unknown @ chr1:12899 Q75.51 of type=SNP alleles=[A*, C] attr={AC=2, AF=0.040, AN=50, DP=66, ExcessHet=0.0445, FS=0.000, InbreedingCoeff=0.1386, MLEAC=3, MLEAF=0.060, MQ=21.65, QD=25.17, SOR=2.833} GT=[] filters=

Here my pipeline:

1A) "Hard Filtering"

input="0.raw.vcf"
output="1.HF_raw.vcf"

${ph6} --java-options -Xmx25g VariantFiltration --filter-expression 'ExcessHet > 54.69' --filter-name 'ExcessHet' -V ${input} -O ${output}

1A) "Make Sites Only Vcf"

input="1.HF_raw.vcf"
output="2.HFso_raw.vcf"

${ph6} --java-options -Xmx25g MakeSitesOnlyVcf -I ${input} -O ${output}

input="2.HFso_raw.vcf"
output="HFso_indel.recal.vcf"
tranches="indels.IVR.tranches"

2A) "Indels Variant Recalibrator"

${ph6} --java-options -Xmx100g VariantRecalibrator -V ${input} -O ${output} --tranches-file ${tranches} -trust-all-polymorphic -tranche "100.0" -tranche "99.95" -tranche "99.9" -tranche "99.5" -tranche "99.0" -tranche "97.0" -tranche "96.0" -tranche "95.0" -tranche "94.0" -tranche "93.5" -tranche "93.0" -tranche "92.0" -tranche "91.0" -tranche "90.0" -an "FS" -an "ReadPosRankSum" -an "MQRankSum" -an "QD" -an "SOR" -an "DP" -mode INDEL --max-gaussians 4 -resource mills,known=false,training=true,truth=true,prior=12:${sorgsorg}/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz -resource axiomPoly,known=false,training=true,truth=false,prior=10:${sorgsorg}/Axiom_Exome_Plus.genotypes.all_populations.poly.hg38.vcf.gz -resource dbsnp,known=true,training=false,truth=false,prior=2:${sorgsorg}/Homo_sapiens_assembly38.dbsnp138.vcf

2B) "SNPs Variant Recalibrator Create Model"

input="2.HFso_raw.vcf"
output="HFso_snp.recal.vcf"
tranches="snps.SVRM.tranches"
model="snps_model.SVRM.report"

${ph6} --java-options -Xmx125g VariantRecalibrator -V ${input} -O ${output} --tranches-file ${tranches} -trust-all-polymorphic -tranche "100.0" -tranche "99.95" -tranche "99.9" -tranche "99.8" -tranche "99.6" -tranche "99.5" -tranche "99.4" -tranche "99.3" -tranche "99.0" -tranche "98.0" -tranche "97.0" -tranche "90.0" -an "QD" -an "MQRankSum" -an "ReadPosRankSum" -an "FS" -an "MQ" -an "SOR" -an "DP" -mode SNP -sample-every "10" --output-model ${model} --max-gaussians 6 -resource hapmap,known=false,training=true,truth=true,prior=15:${sorgsorg}/hapmap_3.3.hg38.vcf.gz -resource omni,known=false,training=true,truth=true,prior=12:${sorgsorg}/1000G_omni2.5.hg38.vcf.gz -resource 1000G,known=false,training=true,truth=false,prior=10:${sorgsorg}/1000G_phase1.snps.high_confidence.hg38.vcf.gz -resource dbsnp,known=true,training=false,truth=false,prior=7:${sorgsorg}/Homo_sapiens_assembly38.dbsnp138.vcf

2C) "SNPsVariantRecalibrator"

input="2.HFso_raw.vcf"
output="provaoutput.vcf"
tranches="snps.SVR.tranches"
model="snps_model.SVR.report"

${ph6} --java-options -Xmx50g VariantRecalibrator -V ${input} -O ${output} --tranches-file ${tranches} -trust-all-polymorphic -tranche "100.0" -tranche "99.95" -tranche "99.9" -tranche "99.8" -tranche "99.6" -tranche "99.5" -tranche "99.4" -tranche "99.3" -tranche "99.0" -tranche "98.0" -tranche "97.0" -tranche "90.0" -an "QD" -an "MQRankSum" -an "ReadPosRankSum" -an "FS" -an "MQ" -an "SOR" -an "DP" -mode SNP --input-model ${model} --max-gaussians 6 -resource hapmap,known=false,training=true,truth=true,prior=15:${sorgsorg}/hapmap_3.3.hg38.vcf.gz -resource omni,known=false,training=true,truth=true,prior=12:${sorgsorg}/1000G_omni2.5.hg38.vcf.gz -resource 1000G,known=false,training=true,truth=false,prior=10:${sorgsorg}/1000G_phase1.snps.high_confidence.hg38.vcf.gz -resource dbsnp,known=true,training=false,truth=false,prior=7:${sorgsorg}/Homo_sapiens_assembly38.dbsnp138.vcf

2D)"ApplyRecalibration"

input="2.HFso_raw.vcf"
outin="tmp.indel.recalibrated.vcf"
output="2.HFso_recalibrated.vcf"
indelrecal="HFso_indel.recal.vcf"
snprecal="HFso_snp.recal.vcf"
indeltranches="indels.IVR.tranches"
snptranches="snps.SVR.tranches"

${ph6} --java-options -Xmx50g ApplyVQSR -O ${outin} -V ${input} --recal-file ${indelrecal} -tranches-file ${indeltranches} -truth-sensitivity-filter-level "99.7" --create-output-variant-index true -mode INDEL

${ph6} --java-options -Xmx50g ApplyVQSR -O ${output} -V ${outin} --recal-file ${snprecal} -tranches-file ${snptranches} -truth-sensitivity-filter-level "99.7" --create-output-variant-index true -mode SNP

Any suggestion?

In the forum I found only old conversations, no about GATK4, if I'm correct.

Many thanks

↧

combine GVCFs error

March 27, 2018, 7:11 am

≫ Next: VariantRecalibrator Error with simulated vcf

≪ Previous: ApplyRecalibration error, during snps recalibration

I am trying to combine 240 gvcf files to run joint GenotypeGVCFs. I created 12 meta-merged-GVCFs by combining 20 samples into one.and I did this separately for each chromosome. When I combine 12 metamerge files for each chromosome, i get this error
09:54:21.509 INFO CombineGVCFs - Shutting down engine
[March 27, 2018 9:54:21 AM EDT] org.broadinstitute.hellbender.tools.walkers.CombineGVCFs done. Elapsed time: 1.10 minutes.
Runtime.totalMemory()=6771179520
java.lang.IllegalArgumentException: Unexpected base in allele bases '*AAAAAAAAC'
at htsjdk.variant.variantcontext.Allele.(Allele.java:165)
at htsjdk.variant.variantcontext.Allele.create(Allele.java:239)
at org.broadinstitute.hellbender.tools.walkers.ReferenceConfidenceVariantContextMerger.extendAllele(ReferenceConfidenceVariantContextMerger.java:406)
at org.broadinstitute.hellbender.tools.walkers.ReferenceConfidenceVariantContextMerger.remapAlleles(ReferenceConfidenceVariantContextMerger.java:178)
at org.broadinstitute.hellbender.tools.walkers.ReferenceConfidenceVariantContextMerger.merge(ReferenceConfidenceVariantContextMerger.java:70)
at org.broadinstitute.hellbender.tools.walkers.CombineGVCFs.endPreviousStates(CombineGVCFs.java:340)
at org.broadinstitute.hellbender.tools.walkers.CombineGVCFs.createIntermediateVariants(CombineGVCFs.java:189)
at org.broadinstitute.hellbender.tools.walkers.CombineGVCFs.apply(CombineGVCFs.java:134)
at org.broadinstitute.hellbender.engine.MultiVariantWalkerGroupedOnStart.apply(MultiVariantWalkerGroupedOnStart.java:73)
at org.broadinstitute.hellbender.engine.VariantWalkerBase.lambda$traverse$0(VariantWalkerBase.java:110)
at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
at java.util.Iterator.forEachRemaining(Iterator.java:116)
at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
at org.broadinstitute.hellbender.engine.VariantWalkerBase.traverse(VariantWalkerBase.java:108)
at org.broadinstitute.hellbender.engine.MultiVariantWalkerGroupedOnStart.traverse(MultiVariantWalkerGroupedOnStart.java:118)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:893)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:136)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:179)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:198)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:153)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:195)
at org.broadinstitute.hellbender.Main.main(Main.java:277)

on greping

File1

zgrep "AAAAAAAAC" ///sc/orga/projects/psychgen/scratch/meta-merged.1.chr22.gvcf.gz
chr22 16190298 . AAAAAAAAAAC A, . . DP=20;ExcessHet=3.01;RAW_MQ=23109.00 GT:AD:DP:GQ:MIN_DP:PL:SB ./.:0,4,0:4:12:.:127,12,0,127,12,127:0,0,3,1 ./.:0,2,0:2:6:.:74,6,0,74,6,74:0,0,2,0 ./.:.:2:3:2:0,3,45,3,45,45 ./.:.:0:0:0:0,0,0,0,0,0 ./.:.:0:0:0:0,0,0,0,0,0 ./.:.:1:3:1:0,3,27,3,27,27 ./.:.:2:3:2:0,3,45,3,45,45 ./.:.:0:0:0:0,0,0,0,0,0 ./.:.:0:0:0:0,0,0,0,0,0 ./.:.:0:0:0:0,0,0,0,0,0 ./.:.:1:3:1:0,3,16,3,16,16 ./.:.:0:0:0:0,0,0,0,0,0 ./.:.:0:0:0:0,0,0,0,0,0 ./.:.:0:0:0:0,0,0,0,0,0 ./.:0,4,0:4:12:.:125,12,0,125,12,125:0,0,2,2 ./.:.:0:0:0:0,0,0,0,0,0 ./.:.:0:0:0:0,0,0,0,0,0 ./.:.:1:0:1:0,0,0,0,0,0 ./.:.:1:0:1:0,0,0,0,0,0 ./.:.:2:3:2:0,3,45,3,45,45
^C

File 2

zgrep "AAAAAAAAC" ///sc/orga/projects/psychgen/scratch/meta-merged.101.chr22.gvcf.gz
chr22 16190298 . AAAAAAAAAAC A, . . DP=12;ExcessHet=3.01;RAW_MQ=12436.00 GT:AD:DP:GQ:MIN_DP:PL:SB ./.:.:0:0:0:0,0,0,0,0,0 ./.:0,3,0:3:9:.:135,9,0,135,9,135:0,0,2,1 ./.:.:0:0:0:0,0,0,0,0,0 ./.:.:0:0:0:0,0,0,0,0,0 ./.:.:0:0:0:0,0,0,0,0,0 ./.:.:2:6:2:0,6,34,6,34,34 ./.:.:0:0:0:0,0,0,0,0,0 ./.:.:1:3:1:0,3,35,3,35,35 ./.:0,2,0:2:6:.:77,6,0,77,6,77:0,0,1,1 ./.:.:0:0:0:0,0,0,0,0,0 ./.:.:0:0:0:0,0,0,0,0,0 ./.:.:0:0:0:0,0,0,0,0,0 ./.:.:3:0:2:0,0,0,0,0,0 ./.:.:0:0:0:0,0,0,0,0,0 ./.:.:1:3:1:0,3,36,3,36,36 ./.:.:1:3:1:0,3,37,3,37,37 ./.:.:0:0:0:0,0,0,0,0,0 ./.:.:0:0:0:0,0,0,0,0,0 ./.:.:0:0:0:0,0,0,0,0,0 ./.:.:0:0:0:0,0,0,0,0,0
chr22 16190299 . AAAAAAAAAC A,*, . . DP=10;ExcessHet=3.01;RAW_MQ=3600.00 GT:AD:DP:GQ:MIN_DP:PL:SB ./.:.:0:0:0:0,0,0,0,0,0,0,0,0,0 ./.:0,0,3,0:3:9:.:135,135,135,9,9,0,135,135,9,135:0,0,2,1 ./.:.:0:0:0:0,0,0,0,0,0,0,0,0,0 ./.:.:0:0:0:0,0,0,0,0,0,0,0,0,0 ./.:.:0:0:0:0,0,0,0,0,0,0,0,0,0 ./.:.:2:3:1:0,3,35,3,35,35,3,35,35,35 ./.:.:0:0:0:0,0,0,0,0,0,0,0,0,0 ./.:.:1:3:1:0,3,18,3,18,18,3,18,18,18 ./.:0,0,2,0:2:6:.:77,77,77,6,6,0,77,77,6,77:0,0,1,1 ./.:.:0:0:0:0,0,0,0,0,0,0,0,0,0 ./.:.:0:0:0:0,0,0,0,0,0,0,0,0,0 ./.:.:0:0:0:0,0,0,0,0,0,0,0,0,0 ./.:0,1,0,0:1:3:.:45,3,0,45,3,45,45,3,45,45:0,0,1,0 ./.:.:0:0:0:0,0,0,0,0,0,0,0,0,0 ./.:.:1:3:1:0,3,36,3,36,36,3,36,36,36 ./.:.:1:3:1:0,3,37,3,37,37,3,37,37,37 ./.:.:0:0:0:0,0,0,0,0,0,0,0,0,0 ./.:.:0:0:0:0,0,0,0,0,0,0,0,0,0 ./.:.:0:0:0:0,0,0,0,0,0,0,0,0,0 ./.:.:0:0:0:0,0,0,0,0,0,0,0,0,0

zgrep "AAAAAAAAC" ///sc/orga/projects/psychgen/scratch/meta-merged.121.chr22.gvcf.gz
chr22 16190298 . AAAAAAAAAAC A, . . DP=10;ExcessHet=3.01;RAW_MQ=5882.00 GT:AD:DP:GQ:MIN_DP:PL:SB ./.:.:1:3:1:0,3,34,3,34,34 ./.:.:2:0:2:0,0,0,0,0,0 ./.:.:0:0:0:0,0,0,0,0,0 ./.:.:0:0:0:0,0,0,0,0,0 ./.:.:0:0:0:0,0,0,0,0,0 ./.:.:1:3:1:0,3,36,3,36,36 ./.:.:0:0:0:0,0,0,0,0,0 ./.:.:0:0:0:0,0,0,0,0,0 ./.:.:0:0:0:0,0,0,0,0,0 ./.:0,2,0:2:6:.:90,6,0,90,6,90:0,0,2,0 ./.:.:2:6:2:0,6,70,6,70,70 ./.:.:0:0:0:0,0,0,0,0,0 ./.:.:0:0:0:0,0,0,0,0,0 ./.:.:0:0:0:0,0,0,0,0,0./.:.:0:0:0:0,0,0,0,0,0 ./.:.:1:3:1:0,3,18,3,18,18 ./.:.:1:3:1:0,3,37,3,37,37 ./.:.:0:0:0:0,0,0,0,0,0 ./.:.:0:0:0:0,0,0,0,0,0 ./.:.:0:0:0:0,0,0,0,0,0

and when I run
java -jar $GenomeAnalysisTK_jar ValidateVariants -V $FILE1 -R /hg19/chr22.fa -gvcf

java -jar $GenomeAnalysisTK_jar ValidateVariants -V $VCF -R /hpc/users/girdhk01/psychencode/resources/hg19/chr22.fa -gvcf
10:17:45.571 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/hpc/packages/minerva-common/gatk/4.0.1.2/gatk-4.0.1.2/gatk-package-4.0.1.2-local.jar!/com/intel/gkl/native/libgkl_compression.so
10:17:45.740 INFO ValidateVariants - ------------------------------------------------------------
10:17:45.740 INFO ValidateVariants - The Genome Analysis Toolkit (GATK) v4.0.1.2
10:17:45.740 INFO ValidateVariants - For support and documentation go to https://software.broadinstitute.org/gatk/
10:17:45.740 INFO ValidateVariants - Executing as girdhk01@purcell2 on Linux v2.6.32-358.6.2.el6.x86_64 amd64
10:17:45.741 INFO ValidateVariants - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_111-b14
10:17:45.741 INFO ValidateVariants - Start Date/Time: March 27, 2018 10:17:45 AM EDT
10:17:45.741 INFO ValidateVariants - ------------------------------------------------------------
10:17:45.741 INFO ValidateVariants - ------------------------------------------------------------
10:17:45.742 INFO ValidateVariants - HTSJDK Version: 2.14.1
10:17:45.742 INFO ValidateVariants - Picard Version: 2.17.2
10:17:45.742 INFO ValidateVariants - HTSJDK Defaults.COMPRESSION_LEVEL : 1
10:17:45.742 INFO ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
10:17:45.742 INFO ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
10:17:45.743 INFO ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
10:17:45.743 INFO ValidateVariants - Deflater: IntelDeflater
10:17:45.743 INFO ValidateVariants - Inflater: IntelInflater
10:17:45.743 INFO ValidateVariants - GCS max retries/reopens: 20
10:17:45.743 INFO ValidateVariants - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
10:17:45.743 INFO ValidateVariants - Initializing engine
10:17:46.560 INFO FeatureManager - Using codec VCFCodec to read file file:///sc/orga/projects/psychgen/scratch/meta-merged.41.chr22.gvcf.gz
10:17:46.619 INFO ValidateVariants - Done initializing engine
10:17:46.623 WARN ValidateVariants - GVCF format is currently incompatible with allele validation. Not validating Alleles.
10:17:46.623 INFO ProgressMeter - Starting traversal
10:17:46.623 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
10:17:56.628 INFO ProgressMeter - chr22:19214664 0.2 1650000 9897030.9
10:18:06.626 INFO ProgressMeter - chr22:22421630 0.3 3455000 10363445.5
10:18:16.631 INFO ProgressMeter - chr22:25287649 0.5 5290000 10577179.4
10:18:26.631 INFO ProgressMeter - chr22:27996611 0.7 7141000 10709358.1
10:18:36.635 INFO ProgressMeter - chr22:30653901 0.8 8935000 10719427.3
10:18:46.637 INFO ProgressMeter - chr22:33277050 1.0 10735000 10732495.8
10:18:56.642 INFO ProgressMeter - chr22:35995428 1.2 12542000 10747522.1
10:19:06.641 INFO ProgressMeter - chr22:38521573 1.3 14322000 10739083.7
10:19:16.645 INFO ProgressMeter - chr22:41203498 1.5 16145000 10760702.9
10:19:26.645 INFO ProgressMeter - chr22:43888896 1.7 17965000 10776629.1
10:19:36.647 INFO ProgressMeter - chr22:46816339 1.8 19807000 10801559.7
10:19:46.649 INFO ProgressMeter - chr22:50017181 2.0 21660000 10827654.0
10:19:50.658 INFO ProgressMeter - chr22:51238008 2.1 22414832 10842826.0
10:19:50.659 INFO ProgressMeter - Traversal complete. Processed 22414832 total variants in 2.1 minutes.
10:19:50.660 INFO ValidateVariants - Shutting down engine
[March 27, 2018 10:19:50 AM EDT] org.broadinstitute.hellbender.tools.walkers.variantutils.ValidateVariants done. Elapsed time: 2.09 minutes.
Runtime.totalMemory()=4264034304

A USER ERROR has occurred: A GVCF must cover the entire region. Found 3044389417 loci with no VariantContext covering it. The first uncovered segment is:chr1:1-249250621

Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.

can you help me with this?

↧

VariantRecalibrator Error with simulated vcf

June 14, 2018, 2:55 am

≫ Next: GATK4 pre-processing

≪ Previous: combine GVCFs error

I simulated random variants on genome and I have a golden.vcf of those variants. I used this golden.vcf as truth vcf database but I experienced the following error:

"A USER ERROR has occurred: Bad input: Values for MQ annotation not detected for ANY training variant in the input callset. VariantAnnotator may be used to add these annotations."

But MQ are on the training sets:

##fileformat=VCFv4.2
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MQ,Number=1,Type=Integer,Description="Mapping Quality">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=ExcessHet,Number=1,Type=Float,Description="Phred-scaled p-value for exact test of excess heterozygosity">
##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
##INFO=<ID=RAW_MQ,Number=1,Type=Float,Description="Raw data for RMS Mapping Quality">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##INFO=<ID=SOR,Number=1,Type=Float,Description="Symmetric Odds Ratio of 2x2 contingency table to detect strand bias">
##contig=<ID=1,length=4641652>
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  20
1       347     .       T       A       70      PASS    MQ=60.0 GT      0/0
1       400     .       G       T       70      PASS    MQ=60.0 GT      0/1
1       440     .       CC      C       70      PASS    MQ=60.0 GT      0/0

↧

GATK4 pre-processing

June 14, 2018, 3:29 am

≫ Next: Picard ExtractIlluminaBarcodes Error

≪ Previous: VariantRecalibrator Error with simulated vcf

Dear GATK team.

Hello.

I conducted GATK pre-processing using GATK 4.0.2.

I have some questions.

I used wxs ngs data for mutation calling.

Do you recommend I use exome interval bed when I use Recalibration (BaseRecalibrator)?

Thanks.

↧

Picard ExtractIlluminaBarcodes Error

June 14, 2018, 8:24 am

≫ Next: Variant Quality Score Recalibration (VQSR)

≪ Previous: GATK4 pre-processing

Hello
I'm using ExtractIlluminaBarcodes (picard version 2.18.7) for the first time and am encountering an error with the command

java -jar picard.jar ExtractIlluminaBarcodes \
BASECALLS_DIR=/project/JIY3012/work/data/BaseCalls/ \
LANE=1 \
READ_STRUCTURE=250T8B250T \
BARCODE_FILE=/project/JIY3012/work/data/barcode_file \
METRICS_FILE=250T8B250T_metrics_output.txt \
NUM_PROCESSORS=36 \
MAX_MISMATCHES=0

This produces:

11:03:10.765 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/rosema1/BioInfo/bin/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Thu Jun 14 11:03:10 EDT 2018] ExtractIlluminaBarcodes BASECALLS_DIR=/project/JIY3012/work/data/BaseCalls LANE=1 READ_STRUCTURE=250T8B250T BARCODE_FILE=/project/JIY3012/work/data/barcode_file METRICS_FILE=250T8B250T_metrics_output.txt MAX_MISMATCHES=0 NUM_PROCESSORS=36 MIN_MISMATCH_DELTA=1 MAX_NO_CALLS=2 MINIMUM_BASE_QUALITY=0 MINIMUM_QUALITY=2 COMPRESS_OUTPUTS=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Thu Jun 14 11:03:10 EDT 2018] Executing as rosema1@usrebcs11.nafta.syngenta.org on Linux 2.6.32-696.18.7.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.18.7-SNAPSHOT
INFO 2018-06-14 11:03:10 ExtractIlluminaBarcodes Processing with 36 PerTileBarcodeExtractor(s).
[Thu Jun 14 11:03:10 EDT 2018] picard.illumina.ExtractIlluminaBarcodes done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=2058354688
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" picard.PicardException: Expected CycledIlluminaFileMap to contain 8 cycles but only 0 were found!
at picard.illumina.parser.CycleIlluminaFileMap.assertValid(CycleIlluminaFileMap.java:66)
at picard.illumina.parser.IlluminaDataProviderFactory.makeParser(IlluminaDataProviderFactory.java:407)
at picard.illumina.parser.IlluminaDataProviderFactory.makeDataProvider(IlluminaDataProviderFactory.java:292)
at picard.illumina.ExtractIlluminaBarcodes$PerTileBarcodeExtractor.(ExtractIlluminaBarcodes.java:750)
at picard.illumina.ExtractIlluminaBarcodes.doWork(ExtractIlluminaBarcodes.java:317)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:282)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)

Perhaps this has something to do with my READ_STRUCTURE string (250T8B250T). These libraries were sequenced with dual unique barcodes with UMIs. I am interested in processing them using single indices (hence my attempted use of 250T8B250T), dual unique indices (250T8B8B250T), and dual unique indices with UMIs (250T8B9M8B250T). I am not confident that these READ_STRUCTURES are correct or if this is the cause of the error.

Additionally, my barcode file looks like this:

barcode_sequence_1 barcode_sequence_2 barcode_name library_name
CTGATCGTNNNNNNNNN ATATGCGC Dual Index UMI Adapter 1 GAR2161A459
ACTCTCGANNNNNNNNN TGGTACAG Dual Index UMI Adapter 2 GAR2161A460
TGAGCTAGNNNNNNNNN AACCGTTC Dual Index UMI Adapter 3 GAR2161A461
GAGACGATNNNNNNNNN TAACCGGT Dual Index UMI Adapter 4 GAR2161A462
CTTGTCGANNNNNNNNN GAACATCG Dual Index UMI Adapter 5 GAR2161A463
TTCCAAGGNNNNNNNNN CCTTGTAG Dual Index UMI Adapter 6 GAR2161A464
CGCATGATNNNNNNNNN TCAGGCTT Dual Index UMI Adapter 7 GAR2161A465
ACGGAACANNNNNNNNN GTTCTCGT Dual Index UMI Adapter 8 GAR2161A466
CGGCTAATNNNNNNNNN AGAACGAG Dual Index UMI Adapter 9 9
ATCGATCGNNNNNNNNN TGCTTCCA Dual Index UMI Adapter 10 10
GCAAGATCNNNNNNNNN CTTCGACT Dual Index UMI Adapter 11 11
(etc.)

I included all 384 barcodes as I am interested in observing any cross-talk that occurs.

Thank you for your help

Mark

↧

Variant Quality Score Recalibration (VQSR)

July 23, 2012, 9:49 am

≫ Next: GenotypeGVCFs 4.0.5.0 error

≪ Previous: Picard ExtractIlluminaBarcodes Error

This document describes what Variant Quality Score Recalibration (VQSR) is designed to do, and outlines how it works under the hood. The first section is a high-level overview aimed at non-specialists. Additional technical details are provided below.

For command-line examples and recommendations on what specific resource datasets and arguments to use for VQSR, please see this FAQ article. See the VariantRecalibrator tool doc and the ApplyRecalibration tool doc for a complete description of available command line arguments.

As a complement to this document, we encourage you to watch the workshop videos available in the Presentations section.

High-level overview

VQSR stands for “variant quality score recalibration”, which is a bad name because it’s not re-calibrating variant quality scores at all; it is calculating a new quality score that is supposedly super well calibrated (unlike the variant QUAL score which is a hot mess) called the VQSLOD (for variant quality score log-odds). I know this probably sounds like gibberish, stay with me. The purpose of this new score is to enable variant filtering in a way that allows analysts to balance sensitivity (trying to discover all the real variants) and specificity (trying to limit the false positives that creep in when filters get too lenient) as finely as possible.

The basic, traditional way of filtering variants is to look at various annotations (context statistics) that describe e.g. what the sequence context is like around the variant site, how many reads covered it, how many reads covered each allele, what proportion of reads were in forward vs reverse orientation; things like that -- then choose threshold values and throw out any variants that have annotation values above or below the set thresholds. The problem with this approach is that it is very limiting because it forces you to look at each annotation dimension individually, and you end up throwing out good variants just because one of their annotations looks bad, or keeping bad variants in order to keep those good variants.

The VQSR method, in a nutshell, uses machine learning algorithms to learn from each dataset what is the annotation profile of good variants vs. bad variants, and does so in a way that integrates information from multiple dimensions (like, 5 to 8, typically). The cool thing is that this allows us to pick out clusters of variants in a way that frees us from the traditional binary choice of “is this variant above or below the threshold for this annotation?”

Let’s do a quick mental visualization exercise (pending an actual figure to illustrate this), in two dimensions because our puny human brains work best at that level. Imagine a topographical map of a mountain range, with North-South and East-West axes standing in for two variant annotation scales. Your job is to define a subset of territory that contains mostly mountain peaks, and as few lowlands as possible. Traditional hard-filtering forces you to set a single longitude cutoff and a single latitude cutoff, resulting in one rectangular quadrant of the map being selected, and all the rest being greyed out. It’s about as subtle as a sledgehammer and forces you to make a lot of compromises. VQSR allows you to select contour lines around the peaks and decide how low or how high you want to go to include or exclude territory within your subset.

How this is achieved is another can of worms. The key point is that we use known, highly validated variant resources (omni, 1000 Genomes, hapmap) to select a subset of variants within our callset that we’re really confident are probably true positives (that’s the training set). We look at the annotation profiles of those variants (in our own data!), and we from that we learn some rules about how to recognize good variants. We do something similar for bad variants as well. Then we apply the rules we learned to all of the sites, which (through some magical hand-waving) yields a single score for each variant that describes how likely it is based on all the examined dimensions. In our map analogy this is the equivalent of determining on which contour line the variant sits. Finally, we pick a threshold value indirectly by asking the question “what score do I need to choose so that e.g. 99% of the variants in my callset that are also in hapmap will be selected?”. This is called the target sensitivity. We can twist that dial in either direction depending on what is more important for our project, sensitivity or specificity.

Technical overview

The purpose of variant recalibration is to assign a well-calibrated probability to each variant call in a call set. This enables you to generate highly accurate call sets by filtering based on this single estimate for the accuracy of each call.

The approach taken by variant quality score recalibration is to develop a continuous, covarying estimate of the relationship between SNP call annotations (QD, SB, HaplotypeScore, HRun, for example) and the the probability that a SNP is a true genetic variant versus a sequencing or data processing artifact. This model is determined adaptively based on "true sites" provided as input (typically HapMap 3 sites and those sites found to be polymorphic on the Omni 2.5M SNP chip array, for humans). This adaptive error model can then be applied to both known and novel variation discovered in the call set of interest to evaluate the probability that each call is real. The score that gets added to the INFO field of each variant is called the VQSLOD. It is the log odds ratio of being a true variant versus being false under the trained Gaussian mixture model.

The variant recalibrator contrastively evaluates variants in a two step process, each performed by a distinct tool:

VariantRecalibrator
Create a Gaussian mixture model by looking at the annotations values over a high quality subset of the input call set and then evaluate all input variants. This step produces a recalibration file.
ApplyRecalibration
Apply the model parameters to each variant in input VCF files producing a recalibrated VCF file in which each variant is annotated with its VQSLOD value. In addition, this step will filter the calls based on this new lod score by adding lines to the FILTER column for variants that don't meet the specified lod threshold.

Please see the VQSR tutorial for step-by-step instructions on running these tools.

How VariantRecalibrator works in a nutshell

The tool takes the overlap of the training/truth resource sets and of your callset. It models the distribution of these variants relative to the annotations you specified, and attempts to group them into clusters. Then it uses the clustering to assign VQSLOD scores to all variants. Variants that are closer to the heart of a cluster will get a higher score than variants that are outliers.

How ApplyRecalibration works in a nutshell

During the first part of the recalibration process, variants in your callset were given a score called VQSLOD. At the same time, variants in your training sets were also ranked by VQSLOD. When you specify a tranche sensitivity threshold with ApplyRecalibration, expressed as a percentage (e.g. 99.9%), what happens is that the program looks at what is the VQSLOD value above which 99.9% of the variants in the training callset are included. It then takes that value of VQSLOD and uses it as a threshold to filter your variants. Variants that are above the threshold pass the filter, so the FILTER field will contain PASS. Variants that are below the threshold will be filtered out; they will be written to the output file, but in the FILTER field they will have the name of the tranche they belonged to. So VQSRTrancheSNP99.90to100.00 means that the variant was in the range of VQSLODs corresponding to the remaining 0.1% of the training set, which are basically considered false positives.

Interpretation of the Gaussian mixture model plots

The variant recalibration step fits a Gaussian mixture model to the contextual annotations given to each variant. By fitting this probability model to the training variants (variants considered to be true-positives), a probability can be assigned to the putative novel variants (some of which will be true-positives, some of which will be false-positives). It is useful for users to see how the probability model was fit to their data. Therefore a modeling report is automatically generated each time VariantRecalibrator is run (in the above command line the report will appear as path/to/output.plots.R.pdf). For every pair-wise combination of annotations used in modeling, a 2D projection of the Gaussian mixture model is shown.

The figure shows one page of an example Gaussian mixture model report that is automatically generated by the VQSR from the example HiSeq call set. This page shows the 2D projection of mapping quality rank sum test versus Haplotype score by marginalizing over the other annotation dimensions in the model.

In each page there are four panels which show different ways of looking at the 2D projection of the model. The upper left panel shows the probability density function that was fit to the data. The 2D projection was created by marginalizing over the other annotation dimensions in the model via random sampling. Green areas show locations in the space that are indicative of being high quality while red areas show the lowest probability areas. In general putative SNPs that fall in the red regions will be filtered out of the recalibrated call set.

The remaining three panels give scatter plots in which each SNP is plotted in the two annotation dimensions as points in a point cloud. The scale for each dimension is in normalized units. The data for the three panels is the same but the points are colored in different ways to highlight different aspects of the data. In the upper right panel SNPs are colored black and red to show which SNPs are retained and filtered, respectively, by applying the VQSR procedure. The red SNPs didn't meet the given truth sensitivity threshold and so are filtered out of the call set. The lower left panel colors SNPs green, grey, and purple to give a sense of the distribution of the variants used to train the model. The green SNPs are those which were found in the training sets passed into the VariantRecalibrator step, while the purple SNPs are those which were found to be furthest away from the learned Gaussians and thus given the lowest probability of being true. Finally, the lower right panel colors each SNP by their known/novel status with blue being the known SNPs and red being the novel SNPs. Here the idea is to see if the annotation dimensions provide a clear separation between the known SNPs (most of which are true) and the novel SNPs (most of which are false).

An example of good clustering for SNP calls from the tutorial dataset is shown to the right. The plot shows that the training data forms a distinct cluster at low values for each of the two statistics shown (haplotype score and mapping quality bias). As the SNPs fall off the distribution in either one or both of the dimensions they are assigned a lower probability (that is, move into the red region of the model's PDF) and are filtered out. This makes sense as not only do higher values of HaplotypeScore indicate a lower chance of the data being explained by only two haplotypes but also higher values for mapping quality bias indicate more evidence of bias between the reference bases and the alternative bases. The model has captured our intuition that this area of the distribution is highly enriched for machine artifacts and putative variants here should be filtered out!

Tranches and the tranche plot

The recalibrated variant quality score provides a continuous estimate of the probability that each variant is true, allowing one to partition the call sets into quality tranches. The main purpose of the tranches is to establish thresholds within your data that correspond to certain levels of sensitivity relative to the truth sets. The idea is that with well calibrated variant quality scores, you can generate call sets in which each variant doesn't have to have a hard answer as to whether it is in or out of the set. If a very high accuracy call set is desired then one can use the highest tranche, but if a larger, more complete call set is a higher priority than one can dip down into lower and lower tranches. These tranches are applied to the output VCF file using the FILTER field. In this way you can choose to use some of the filtered records or only use the PASSing records.

The first tranche (90) which has the lowest value of truth sensitivity but the highest value of novel Ti/Tv, is exceedingly specific but less sensitive. Each subsequent tranche in turn introduces additional true positive calls along with a growing number of false positive calls. Downstream applications can select in a principled way more specific or more sensitive call sets or incorporate directly the recalibrated quality scores to avoid entirely the need to analyze only a fixed subset of calls but rather weight individual variant calls by their probability of being real. An example tranche plot, automatically generated by the VariantRecalibrator walker, is shown below.

This is an example of a tranches plot generated for a HiSeq call set. The x-axis gives the number of novel variants called while the y-axis shows two quality metrics -- novel transition to transversion ratio and the overall truth sensitivity.

Note that the tranches plot is not applicable for indels and will not be generated when the tool is run in INDEL mode.

Ti/Tv-free recalibration

We use a Ti/Tv-free approach to variant quality score recalibration. This approach requires an additional truth data set, and cuts the VQSLOD at given sensitivities to the truth set. It has several advantages over the Ti/Tv-targeted approach:

The truth sensitivity (TS) approach gives you back the novel Ti/Tv as a QC metric
The truth sensitivity (TS) approach is conceptual cleaner than deciding on a novel Ti/Tv target for your dataset
The TS approach is easier to explain and defend, as saying "I took called variants until I found 99% of my known variable sites" is easier than "I took variants until I dropped my novel Ti/Tv ratio to 2.07"

We have used hapmap 3.3 sites as the truth set (genotypes_r27_nr.b37_fwd.vcf), but other sets of high-quality (~99% truly variable in the population) sets of sites should work just as well. In our experience, with HapMap, 99% is a good threshold, as the remaining 1% of sites often exhibit unusual features like being close to indels or are actually MNPs, and so receive a low VQSLOD score.
Note that the expected Ti/Tv is still an available argument but it is only used for display purposes.

Finally, a couple of Frequently Asked Questions

- Can I use the variant quality score recalibrator with my small sequencing experiment?

This tool is expecting thousands of variant sites in order to achieve decent modeling with the Gaussian mixture model. Whole exome call sets work well, but anything smaller than that scale might run into difficulties.

One piece of advice is to turn down the number of Gaussians used during training. This can be accomplished by adding --maxGaussians 4 to your command line.

maxGaussians is the maximum number of different "clusters" (=Gaussians) of variants the program is "allowed" to try to identify. Lowering this number forces the program to group variants into a smaller number of clusters, which means there will be more variants in each cluster -- hopefully enough to satisfy the statistical requirements. Of course, this decreases the level of discrimination that you can achieve between variant profiles/error modes. It's all about trade-offs; and unfortunately if you don't have a lot of variants you can't afford to be very demanding in terms of resolution.

- Why don't all the plots get generated for me?

The most common problem related to this is not having Rscript accessible in your environment path. Rscript is the command line version of R that gets installed right alongside. We also make use of the ggplot2 library so please be sure to install that package as well. See the Common Problems section of the Guide for more details.

↧

GenotypeGVCFs 4.0.5.0 error

June 14, 2018, 11:22 am

≫ Next: known_sites for v37 genome for Base Recalibration (Gatk_4.0.4)

≪ Previous: Variant Quality Score Recalibration (VQSR)

Hi,

do you have any idea what is this error?

06:43:19.441 WARN GATKAnnotationPluginDescriptor - Redundant enabled annotation group (StandardAnnotation) is enabled for this tool by defau

java: tpp.c:63: __pthread_tpp_change_priority: Assertion new_prio == -1 || (new_prio >= __sched_fifo_min_prio && new_prio <= __sched_fifo_max_prio)' failed.`

The full log file...

Using GATK jar /share/apps/bio/gatk-4.0.5.0/gatk-package-4.0.5.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamj
dk.compression_level=2 -Xmx10g -jar /share/apps/bio/gatk-4.0.5.0/gatk-package-4.0.5.0-local.jar GenotypeGVCFs -R /home/shared/resources/hgRef
/hg38/Homo_sapiens_assembly38.fasta -O /home/manolis/GATK4/IlluminaExomePairEnd/6.vcf/processing/WESx3_WgWgCC_prova01_01.vcf -G StandardAnnot
ation --only-output-calls-starting-in-intervals -new-qual -V gendb://WESx3_WgWgCC_prova01/01 -L chr1
06:43:19.441 WARN  GATKAnnotationPluginDescriptor - Redundant enabled annotation group (StandardAnnotation) is enabled for this tool by defau
lt
06:43:19.578 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/share/apps/bio/gatk-4.0.5.0/gatk-package-4.0.5.0-local.
jar!/com/intel/gkl/native/libgkl_compression.so
06:43:19.865 INFO  GenotypeGVCFs - ------------------------------------------------------------
06:43:19.866 INFO  GenotypeGVCFs - The Genome Analysis Toolkit (GATK) v4.0.5.0
06:43:19.866 INFO  GenotypeGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
06:43:19.867 INFO  GenotypeGVCFs - Executing as manolis@genemonster on Linux v3.5.0-36-generic amd64
06:43:19.867 INFO  GenotypeGVCFs - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_91-b14
06:43:19.867 INFO  GenotypeGVCFs - Start Date/Time: June 12, 2018 6:43:19 AM CEST
06:43:19.867 INFO  GenotypeGVCFs - ------------------------------------------------------------
06:43:19.868 INFO  GenotypeGVCFs - ------------------------------------------------------------
06:43:19.869 INFO  GenotypeGVCFs - HTSJDK Version: 2.15.1
06:43:19.869 INFO  GenotypeGVCFs - Picard Version: 2.18.2
06:43:19.869 INFO  GenotypeGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 2
06:43:19.869 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
06:43:19.869 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
06:43:19.869 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
06:43:19.869 INFO  GenotypeGVCFs - Deflater: IntelDeflater
06:43:19.870 INFO  GenotypeGVCFs - Inflater: IntelInflater
06:43:19.870 INFO  GenotypeGVCFs - GCS max retries/reopens: 20
06:43:19.870 INFO  GenotypeGVCFs - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
06:43:19.870 INFO  GenotypeGVCFs - Initializing engine
WARNING: No valid combination operation found for INFO field DS - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field DS - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records
06:43:22.800 INFO  IntervalArgumentCollection - Processing 248956422 bp from intervals
06:43:22.909 INFO  GenotypeGVCFs - Done initializing engine
06:43:23.225 INFO  ProgressMeter - Starting traversal
06:43:23.226 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute
WARNING: No valid combination operation found for INFO field DS - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records
06:43:35.787 INFO  ProgressMeter -           chr1:17703              0.2                  1000           4776.7
06:43:38.536 INFO  GenotypeGVCFs - Shutting down engine
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.3484763260000022,Cpu time(s),0.27263762600000135
java: tpp.c:63: __pthread_tpp_change_priority: Assertion `new_prio == -1 || (new_prio >= __sched_fifo_min_prio && new_prio <= __sched_fifo_max_prio)' failed.

Many thanks

↧

known_sites for v37 genome for Base Recalibration (Gatk_4.0.4)

June 14, 2018, 4:20 pm

≫ Next: GATK4 Spark

≪ Previous: GenotypeGVCFs 4.0.5.0 error

Hi, gatk team, I see in the latest WDL scripts of broad pipeline (posted on github) that the databases used as known_sites for base recal process are 'dbSNP_138' and 'Mills_and_1000g_gold_standard_indels' for v37 human genome. I have few questions with respect to the usage of known sites:

a) I would like to upgrade to the latest dbSNP_151 available on dbSNP website and use All.vcf available therein. Do you foresee any issues using that dbSNP version? May I ask why is broad institute not using the latest dbSNP version in their scripts? Is there any special reason to it other than maintaining stability?

b) Is it OK to use a huge database such as gnomad as known_sites? Any negative impact you think it may bring to the calculations underlying base recalibration algorithm. (Note: I have knowledge regarding how you calculate base quality error rates and adjust them after masking of known sites (from population databases) in a bam. And hence am afraid that masking large number of sites may affect the sensitivity and accuracy of base recaliration process. I am aware of the role that the 4 co-variates play but still worried that masking more and more sites would result into reduced-error-rate as there are now less unmasked sites that are hypothesized as 'error mismatches' ultimately resulting into lesser amount of recalibration happening at ALL sites). On the other side of coin, masking too many sites may not be helpful as it is possible that the base quality of those masked sites may not get fixed significantly and a site with a low machine-annotated base quality would still remain of the low quality and hence not called appropriately by the variant calling step.

c) How about using 1000g_phase3 vcfs as known_sites too in addition to dbSNP_151 and the old Mills_and_1000G_gold_standard_indels.

d) Over the period of last 8 years, sequencing techniques have improved and hence the base qualities. Does Base Recalibration still makes sense as a necessary step in exome or whole genome analysis? Does it have a significant effect on the variant qualities? I still have to look at the change logs for Base Recalibration tool for past years. Could you highlight any significant change made to this tool since its birth?

Note: Please view these questions from the standpoint of a clinical lab.

Thanks!
-S

↧

GATK4 Spark

June 14, 2018, 6:51 pm

≫ Next: Java and GenotypeGVCFs errors

≪ Previous: known_sites for v37 genome for Base Recalibration (Gatk_4.0.4)

I am investigating the spark implementations of some of the GATK tools. I notice that all of the spark tools are marked as 'BETA', with comments like, 'do not use spark if you care about the results'.

I am asking for a comment on which spark-enabled GATK tool is most mature, most suitable for me to experiment with.

I understand that some of the tools that do not have 'spark' in the name may be using (local) spark under the hood as a way to parallelise. Is there any way for me to know which tools do this, so that I can use one of them in my experiments?

Thanks

↧

Java and GenotypeGVCFs errors

June 14, 2018, 9:48 pm

≫ Next: i am running haplotypcaller in one bam file

≪ Previous: GATK4 Spark

Dear GATK team members,

I have a few (but very long...) questions and I'd be really grateful if you answered my questions. I used GATK for the first time three days ago and there are some things that are not working well in the process, so I am writing to solve it. I'm really sorry if these are already solved problems.

My goal is to get linkage data from the genotype of recombinants. I am using C. elegans data as a practice, and this nematode has many recombinants made from two divergent strains, N2 and CB4856. Recombinants are mixed with N2 or CB4856 at each genomic position, so I can (possibly) see how much each positions are linked. The data I have is the sequencing of 40 recombinants to ~ 6x depth.

To analyze this, I analyzed my data using the information in the Germlines SNPs + Indels section of Best Practices (and the 2015 pdf document with script examples) and the information in the Tool Documentation Index. In the process, I got some errors and questions.

I'm using Ubuntu 18.04 and GATK-4.0.5.1.

Java error
I solved this problem partly, but have still some problems. I followed the next link,
https://software.broadinstitute.org/gatk/documentation/article?id=11135
In this article, they suggest the following script,
/usr/libexec/java_home -v 1.7.0_79 --exec java -jar GenomeAnalysisTK.jar -T ...
, so I mimicked it.
../jre1.8.0_171/ -v 1.7.0_79 --exec java -jar gatk-package-4.0.5.1-local.jar
However, it showed an error.
-bash: ../jre1.8.0_171/: Is a directory
I used the following script to run gatk, and it works fine.
../jre1.8.0_171/bin/java -jar gatk-package-4.0.5.1-local.jar

However, it doesn't suitable for the spark tool, not the local one.

../jre1.8.0_171/bin/java -jar gatk-package-4.0.5.1-spark.jar

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/Partitioner at java.lang.Class.getDeclaredConstructors0(Native Method) at java.lang.Class.privateGetDeclaredConstructors(Class.java:2671) at java.lang.Class.getConstructors(Class.java:1651) at org.broadinstitute.hellbender.utils.ClassUtils.canMakeInstances(ClassUtils.java:30) at org.broadinstitute.hellbender.Main.extractCommandLineProgram(Main.java:318) at org.broadinstitute.hellbender.Main.setupConfigAndExtractProgram(Main.java:180) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:202) at org.broadinstitute.hellbender.Main.main(Main.java:289) Caused by: java.lang.ClassNotFoundException: org.apache.spark.Partitioner at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 8 more

../jdk1.8.0_171/bin/java -jar gatk-package-4.0.5.1-spark.jar

Could you suggest any helps? If it is not a GATK problem, please just let me know. I found some articles that showed some Java errors are come from my server, rather than the GATK.

GenotypeGVCFs error
I tried to follow the Best Practices document for Germlines SNPs + Indels, but I got a trouble when I used the GenotypeGVCFs tool.

What I did last three days:
(1) extract reads and align them to another genome version. (44 samples = N2, CB4856 and 42 recombinants)
../picard SamToFastq I=$i.bam F=fastq/$i.1.fastq F2=fastq/$i.2.fastq ($i = 44 strain names sequentially)
bwa mem -t 40 reference.fasta $i.1.fastq $i.2.fastq | samtools sort -@ 40 -O BAM -o $i.sort.bam
(2) remove duplicates
/media/elegans/main/tools/picard MarkDuplicatesWithMateCigar \ I=$i.sort.bam O=$i.sort.dup.bam M=$i.dup.txt REMOVE_DUPLICATES=true
(3) add read groups and make index files
/media/elegans/main/tools/picard AddOrReplaceReadGroups \ I=$i.sort.dup.bam O=$i.sort.dup.rg.bam RGLB=lib1 RGPL=illumina RGPU=unit1 RGSM=$i
samtools index $i.sort.dup.rg.bam
(4) variant calling with HaplotypeCaller
../jre1.8.0_171/bin/java -Xmx32G -jar \ ../gatk-4.0.5.1/gatk-package-4.0.5.1-local.jar \ HaplotypeCaller --reference ..reference.fasta \ --input $i.sort.dup.rg.bam -ERC GVCF --output $i.g.vcf --use-new-qual-calculator
(5) import database using GenomicsDBImport
../jre1.8.0_171/bin/java -Xmx32G -jar \ ../gatk-4.0.5.1/gatk-package-4.0.5.1-local.jar \ GenomicsDBImport -V N2.g.vcf -V CB4856.g.vcf -V strain1.g.vcf ... -V strain42 \ --genomicsdb-workspace-path recombinant_DB/$j \ --intervals $j ($j = chromosome names, I, II, III, IV, V, MtDNA, X)
(6) joint genotyping using GenotypeGVCFs
../jre1.8.0_171/bin/java -Xmx32G -jar ../gatk-4.0.5.1/gatk-package-4.0.5.1-local.jar \ GenotypeGVCFs -R ../reference.fasta -V gendb://recombinant_DB/I -G StandardAnnotation -O joint.genotyping.chrI.vcf --founder-id N2 --use-new-qual-calculator

Then I got the following errors (I picked only WARN signs).
10:36:54.710 WARN GATKAnnotationPluginDescriptor - Redundant enabled annotation group (StandardAnnotation) is enabled for this tool by default
and
10:36:54.838 INFO GenotypeGVCFs - Initializing engine WARNING: No valid combination operation found for INFO field DS - the field will NOT be part of INFO fields in the generated VCF records WARNING: No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records WARNING: No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records WARNING: No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records WARNING: No valid combination operation found for INFO field DS - the field will NOT be part of INFO fields in the generated VCF records WARNING: No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records WARNING: No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records WARNING: No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records 10:36:55.750 INFO GenotypeGVCFs - Done initializing engine 10:36:55.786 INFO ProgressMeter - Starting traversal 10:36:55.786 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute WARNING: No valid combination operation found for INFO field DS - the field will NOT be part of INFO fields in the generated VCF records WARNING: No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records WARNING: No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records WARNING: No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records 10:37:00.587 INFO GenotypeGVCFs - Shutting down engine GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.573704676999999,Cpu time(s),0.5681586370000016

Then it stopped. When I opened the joint.genotyping.chrI.vcf file, it contains only 394 lines (from 3040 - 55642 position) and it hasn't been changed. Chromosome I has 15 Mb size.

I don't know whether that's because of big size, I tried to reduce its size.
../jre1.8.0_171/bin/java -Xmx32G -jar ../gatk-4.0.5.1/gatk-package-4.0.5.1-local.jar \ GenomicsDBImport -V N2.g.vcf -V CB4856.g.vcf -V strain1.g.vcf ... -V strain42 \ --genomicsdb-workspace-path recombinant_DB/VR1.4Mb --intervals V:20000000-21389866 (1.4 Mb region)
../jre1.8.0_171/bin/java -Xmx32G -jar ../gatk-4.0.5.1/gatk-package-4.0.5.1-local.jar \ GenotypeGVCFs -R reference.fasta -V gendb://recombinant_DB/VR1.4Mb \ -G StandardAnnotation -O joint.genotyping.chrVR.vcf --founder-id N2 --use-new-qual-calculator

It was done, but it showed similar errors and additional ones.
java: tpp.c:84: __pthread_tpp_change_priority: Assertion `new_prio == -1 || (new_prio >= fifo_min_prio && new_prio <= fifo_max_prio)' failed. Aborted (core dumped)

Could you suggest any helps? I really appreciate your efforts.

Small question: what is the founder samples? I'm wondering if both N2 and CB4856 are founder samples or not.
Does --founder-id N2 should be replaced by --founder-id 'N2 CB4856'?

If I have mistakes during variant calling, please let me know.

Thank you for maintaining this wonderful software and this community!

↧