Quantcast
Channel: Recent Discussions — GATK-Forum
Viewing all 12345 articles
Browse latest View live

Resource bundle

$
0
0

The GATK resource bundle is a collection of standard files for working with human sequencing data. We provide several versions of the bundle corresponding to the various reference builds, but be aware that we no longer support very old versions (b36/hg18). In addition, we are currently transitioning to support the Grch38/hg38 reference build, which will eventually become the default, while b37/hg19 will be considered legacy and eventually phased out. See the Dictionary entry on human genome reference builds for more information.

We do not currently provide any non-human resources in the resource bundle.

1. Accessing the bundle

See the Resource Bundle page. In a nutshell, there's a Google Cloud bucket and an FTP server. These resources are also available through FireCloud, our cloud-based analysis portal, in workspaces that are preconfigured for the major Best Practices analysis use cases.

2. Grch38/Hg38 Resources: the soon-to-be Standard Set

This contains all the resource files needed for Best Practices germline short variant discovery in whole-genome sequencing data (WGS). Exome files and itemized resource list coming soon(ish). Somatic resources are in development.

3. b37 Resources: the Standard Data Set pending completion of the Hg38 bundle

Note that many of these resources are out of date and will eventually be retired. All new development is being done against Grch38/hg38.

  • Reference sequence (standard 1000 Genomes fasta) along with fai and dict files
  • dbSNP in VCF. This includes two files:

    • A recent dbSNP release (build 138)
    • This file subsetted to only sites discovered in or before dbSNPBuildID 129, which excludes the impact of the 1000 Genomes project and is useful for evaluation of dbSNP rate and Ti/Tv values at novel sites.
  • HapMap genotypes and sites VCFs

  • OMNI 2.5 genotypes for 1000 Genomes samples, as well as sites, VCF
  • The current best set of known indels to be used for local realignment (note that we don't use dbSNP for this anymore); use both files:

    • 1000G_phase1.indels.b37.vcf (currently from the 1000 Genomes Phase I indel calls)
    • Mills_and_1000G_gold_standard.indels.b37.sites.vcf
  • The latest set from 1000G phase 3 (v4) for genotype refinement: 1000G_phase3_v4_20130502.sites.vcf

  • A large-scale standard single sample BAM file for testing:

    • NA12878.HiSeq.WGS.bwa.cleaned.recal.b37.20.bam containing ~64x reads of NA12878 on chromosome 20
    • A callset produced by running UnifiedGenotyper on the dataset above. Note that this resource is out of date and does not represent the results of our Best Practices. This will be updated in the near future.
  • The Broad's custom exome targets list: Broad.human.exome.b37.interval_list (note that you should always use the exome targets list that is appropriate for your data, which typically depends on the prep kit that was used, and should be available from the kit manufacturer's website)

Additionally, these files all have supplementary indices, statistics, and other QC data available.


All resources below this are available only on the FTP server, not on the cloud.

4. hg19 Resources: lifted over from b37

Includes the UCSC-style hg19 reference along with all lifted over VCF files.

5. hg18 Resources: lifted over from b37

Includes the UCSC-style hg18 reference along with all lifted over VCF files. The refGene track and BAM files are not available. We only provide data files for this genome-build that can be lifted over "easily" from our master b37 repository. Sorry for whatever inconvenience that this might cause.

Also includes a chain file to lift over to b37.

6. b36 Resources: lifted over from b37

Includes the 1000 Genomes pilot b36 formatted reference sequence (human_b36_both.fasta) along with all lifted over VCF files. The refGene track and BAM files are not available. We only provide data files for this genome-build that can be lifted over "easily" from our master b37 repository. Sorry for whatever inconvenience that this might cause.

Also includes a chain file to lift over to b37.


Read groups

$
0
0

There is no formal definition of what is a read group, but in practice, this term refers to a set of reads that were generated from a single run of a sequencing instrument.

In the simple case where a single library preparation derived from a single biological sample was run on a single lane of a flowcell, all the reads from that lane run belong to the same read group. When multiplexing is involved, then each subset of reads originating from a separate library run on that lane will constitute a separate read group.

Read groups are identified in the SAM/BAM /CRAM file by a number of tags that are defined in the official SAM specification. These tags, when assigned appropriately, allow us to differentiate not only samples, but also various technical features that are associated with artifacts. With this information in hand, we can mitigate the effects of those artifacts during the duplicate marking and base recalibration steps. The GATK requires several read group fields to be present in input files and will fail with errors if this requirement is not satisfied. See this article for common problems related to read groups.

To see the read group information for a BAM file, use the following command.

samtools view -H sample.bam | grep '@RG'

This prints the lines starting with @RG within the header, e.g. as shown in the example below.

@RG ID:H0164.2  PL:illumina PU:H0164ALXX140820.2    LB:Solexa-272222    PI:0    DT:2014-08-20T00:00:00-0400 SM:NA12878  CN:BI

Meaning of the read group fields required by GATK

  • ID = Read group identifier
    This tag identifies which read group each read belongs to, so each read group's ID must be unique. It is referenced both in the read group definition line in the file header (starting with @RG) and in the RG:Z tag for each read record. Note that some Picard tools have the ability to modify IDs when merging SAM files in order to avoid collisions. In Illumina data, read group IDs are composed using the flowcell + lane name and number, making them a globally unique identifier across all sequencing data in the world.
    Use for BQSR: ID is the lowest denominator that differentiates factors contributing to technical batch effects: therefore, a read group is effectively treated as a separate run of the instrument in data processing steps such as base quality score recalibration, since they are assumed to share the same error model.

  • PU = Platform Unit
    The PU holds three types of information, the {FLOWCELL_BARCODE}.{LANE}.{SAMPLE_BARCODE}. The {FLOWCELL_BARCODE} refers to the unique identifier for a particular flow cell. The {LANE} indicates the lane of the flow cell and the {SAMPLE_BARCODE} is a sample/library-specific identifier. Although the PU is not required by GATK but takes precedence over ID for base recalibration if it is present. In the example shown earlier, two read group fields, ID and PU, appropriately differentiate flow cell lane, marked by .2, a factor that contributes to batch effects.

  • SM = Sample
    The name of the sample sequenced in this read group. GATK tools treat all read groups with the same SM value as containing sequencing data for the same sample, and this is also the name that will be used for the sample column in the VCF file. Therefore it's critical that the SM field be specified correctly. When sequencing pools of samples, use a pool name instead of an individual sample name. Note, when we say pools, we mean samples that are not individually barcoded. In the case of multiplexing (often confused with pooling) where you know which reads come from each sample and you have simply run the samples together in one lane, you can keep the SM tag as the sample name and not the "pooled name".

  • PL = Platform/technology used to produce the read
    This constitutes the only way to know what sequencing technology was used to generate the sequencing data. Valid values: ILLUMINA, SOLID, LS454, HELICOS and PACBIO.

  • LB = DNA preparation library identifier
    MarkDuplicates uses the LB field to determine which read groups might contain molecular duplicates, in case the same DNA library was sequenced on multiple lanes.

If your sample collection's BAM files lack required fields or do not differentiate pertinent factors within the fields, use Picard's AddOrReplaceReadGroups to add or appropriately rename the read group fields as outlined here.


Deriving ID and PU fields from read names

Here we illustrate how to derive both ID and PU fields from read names as they are formed in the data produced by the Broad Genomic Services pipelines (other sequence providers may use different naming conventions). We break down the common portion of two different read names from a sample file. The unique portion of the read names that come after flow cell lane, and separated by colons, are tile number, x-coordinate of cluster and y-coordinate of cluster.

H0164ALXX140820:2:1101:10003:23460
H0164ALXX140820:2:1101:15118:25288

Breaking down the common portion of the query names:

H0164____________ #portion of @RG ID and PU fields indicating Illumina flow cell
_____ALXX140820__ #portion of @RG PU field indicating barcode or index in a multiplexed run
_______________:2 #portion of @RG ID and PU fields indicating flow cell lane

Multi-sample and multiplexed example

Suppose I have a trio of samples: MOM, DAD, and KID. Each has two DNA libraries prepared, one with 400 bp inserts and another with 200 bp inserts. Each of these libraries is run on two lanes of an Illumina HiSeq, requiring 3 x 2 x 2 = 12 lanes of data. When the data come off the sequencer, I would create 12 bam files, with the following @RG fields in the header:

Dad's data:
@RG     ID:FLOWCELL1.LANE1      PL:ILLUMINA     LB:LIB-DAD-1 SM:DAD      PI:200
@RG     ID:FLOWCELL1.LANE2      PL:ILLUMINA     LB:LIB-DAD-1 SM:DAD      PI:200
@RG     ID:FLOWCELL1.LANE3      PL:ILLUMINA     LB:LIB-DAD-2 SM:DAD      PI:400
@RG     ID:FLOWCELL1.LANE4      PL:ILLUMINA     LB:LIB-DAD-2 SM:DAD      PI:400

Mom's data:
@RG     ID:FLOWCELL1.LANE5      PL:ILLUMINA     LB:LIB-MOM-1 SM:MOM      PI:200
@RG     ID:FLOWCELL1.LANE6      PL:ILLUMINA     LB:LIB-MOM-1 SM:MOM      PI:200
@RG     ID:FLOWCELL1.LANE7      PL:ILLUMINA     LB:LIB-MOM-2 SM:MOM      PI:400
@RG     ID:FLOWCELL1.LANE8      PL:ILLUMINA     LB:LIB-MOM-2 SM:MOM      PI:400

Kid's data:
@RG     ID:FLOWCELL2.LANE1      PL:ILLUMINA     LB:LIB-KID-1 SM:KID      PI:200
@RG     ID:FLOWCELL2.LANE2      PL:ILLUMINA     LB:LIB-KID-1 SM:KID      PI:200
@RG     ID:FLOWCELL2.LANE3      PL:ILLUMINA     LB:LIB-KID-2 SM:KID      PI:400
@RG     ID:FLOWCELL2.LANE4      PL:ILLUMINA     LB:LIB-KID-2 SM:KID      PI:400

Note the hierarchical relationship between read groups (unique for each lane) to libraries (sequenced on two lanes) and samples (across four lanes, two lanes for each library).

Install with conda, gatkcondaenv.yml not found

how can i get vcf file without repeat snps?

$
0
0

i just call snp with my several samples` RNA-seq data .
then i get several vcf files , so i just use the function "MergeVcfs" to combine them into a big vcf files.
and i use the "CollectVariantCallingMetrics" to evaluate it .
then i find that this big vcf file contain all the snps in my samples , even those snps whose share the same sites.
so what i wonder is can i get a vcf file that all snps get the unique site.
i know it may be a complicated question , because i think this kind of big vcf file contain the snps whose genotypes is different in a way.
so if i want to get a one-site-one-snp vcf file , the information about genotype may get wrong.
or my question is simplified in this way: i just want delete the repeat snps to get net number about my snps.
maybe my description is not so clear , but i am really trying my best to describe my question as best as i can.
thanks a lot.

What the meaning of this WARN?

$
0
0

Happy Thanksgiving to GATK staff !!!
i know you guys are on vacation , so i am no hurry to get answer.but i still want to ask it in advance. :smile:
here is my question:
when i run GenotypeGVCFs , i got a warn message just like this :

what the meaning of it ? it seems that this warn is not something severe , am i right?
Besides , i run ValitateVariants to validate my GVCF file.
And then ValitateVariants just give me some message like this :

i think that mean that my gvcf file is totally fine.
i hope you guys can give me some help when you come back after vacation.
thanks a lot.

About "Ask the team"

$
0
0
This is the place to post any questions, problems or bug reports for the GATK development team to look at. We try to respond within a reasonably short amount of time, but keep in mind that we are not support agents -- we are programmers and scientists, with code to write and data to analyze. In the same spirit, because our resources are limited and our time precious, we ask that you please consult all available sources of information in the GATK Guide and previous posts in this forum before posting your question here, and above all, refrain from posting problems that are clearly identified as USER ERRORS in the GATK's output. Thank you! We also welcome discussions and responses from everyone in the user community. If you know something, say something!

Use VariantsToTable to extract alternate allele count

$
0
0

I'm using the following VariantsToTable command options to extract fields from a VCF file:

/usr/lib/jvm/jre-1.8.0-openjdk/bin/java -Xmx8g -jar /home/1GenomeRef/GATK/GATK_3.5/GenomeAnalysisTK.jar \
-T VariantsToTable \
-R ref.fa \
-V file1.vcf \
-F POS -F ID -F REF -F ALT -F QUAL -F FILTER -F AC -F AN -GF GT  \
--showFiltered \
--out outputfile \

This extracts the correct information, but my original VCF file reports each sample genotype (GT field) as an alternate allele count (0/0, 0/1 or 1/1) and the new output file reports the genotype as the base (C/T, for example.) So the GT for sample1 in my original file might be "0/1" but in the new file it's recoded as "C/T."

I prefer to retain the original genotype format but do not see an option that allows me to request this. Is there an option I can use for this? Or another tool I can apply that will quickly recode the new VCF?

Thanks so much. (And I am following the Best Practices Guidelines - although we are using GATK version 3.5, this is a choice we made for the purpose of ensuring the highest possible consistency with older data called using version 3.5)

JEXL error arising from SNPs with zero coverage for either REF or ALT alleles

$
0
0

I ran:

gatk-4.0.11.0/gatk SelectVariants -R Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fasta -V variants/P2-E8-ACTGAGCG-CTTAATAG_S152_first_pass_filtered.vcf -select '(1.0*vc.getGenotype("P2-E8-ACTGAGCG-CTTAATAG_S152").getAD().1)/(1.0*vc.getGenotype("P2-E8-ACTGAGCG-CTTAATAG_S152").getDP()) > 0.9' -output variants/P2-E8-ACTGAGCG-CTTAATAG_S152_first_pass_selected.vcf --exclude-filtered

and get:

A USER ERROR has occurred: Invalid JEXL expression detected for select-0

The exact same filter worked for 90% of my files, but failed on about 10% of them. I then found that if I replace the numerator with a '1', it still fails:

1/(1.0*vc.getGenotype("P2-E8-ACTGAGCG-CTTAATAG_S152").getDP()) > 0.9

and looking at the vcf file, it turns out there are some SNPs from the initial SNP calling that have zero coverage:

I   27036   .   G   A   15.14   PASS    AC=1;AF=1.00;AN=1;FS=0.000;MLEAC=1;MLEAF=1.00;SOR=0.693 GT:AD:DP:GQ:PL  1:0,0:0:45:45,0
I   27063   .   G   A   60  PASS    AC=1;AF=1.00;AN=1;FS=0.000;MLEAC=1;MLEAF=1.00;SOR=0.693 GT:AD:DP:GQ:PL  1:0,0:0:90:90,0
XII 585947  .   T   A   16.11   PASS    AC=1;AF=1.00;AN=1;FS=0.000;MLEAC=1;MLEAF=1.00;SOR=0.693 GT:AD:DP:GQ:PL  1:0,0:0:46:46,0

Thus, I now understand why my filter failed, because of a divide by zero error, but I don't understand how I got these SNPs in the first place. They were called with:

gatk-4.0.11.0/gatk HaplotypeCaller -ploidy 1 -R Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fasta -I bam/P2-E8-ACTGAGCG-CTTAATAG_S152.dedup.realigned.bam --output variants/P2-E8-ACTGAGCG-CTTAATAG_S152_first_pass_raw.vcf
gatk-4.0.11.0/gatk VariantFiltration -R Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fasta -V variants/P2-E8-ACTGAGCG-CTTAATAG_S152_first_pass_raw.vcf -filter 'QD < 2.0 || FS > 60.0 || SOR > 3.0 || MQ < 40.0 || MQRankSum < -10.5 || ReadPosRankSum < -8.0' -output variants/P2-E8-ACTGAGCG-CTTAATAG_S152_first_pass_filtered.vcf -filter-name "hard_filter"

Any ideas how a SNP can be called with zero coverage in either the REF or the ALT alleles?


Variant Recalibrator Error: Track Input out of Coordinate Order on contig ....

$
0
0

Hi:

After running haplotypecaller on SNPs and Indels, I wanted to recalibrate the variant scores using VariantRecalibrator on SNPs and INDels simultaneously. Only 4 samples were used (just testing the pipeline right now). I am getting the following error saying one of the track inputs is out of coordinate order. Has anyone encountered a similar error?

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 2.4-7-g5e89f01):
ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
ERROR Please do not post this error to the GATK forum
ERROR
ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: LocationAwareSeekableRODIterator: track input is out of coordinate order on contig chr4:140811083-140811084 compared to chr4:140811085
ERROR ------------------------------------------------------------------------------------------

[1]+ Exit 1 java -Xmx4g -jar $gatk -T VariantRecalibrator -R $ref2 -input ${data}/TruSeq/truseq.reduced.snps.indels.vcf --maxGaussians 6 -resource:hapmap,VCF,known=false,training=true,truth=true,prior=15.0 $hapmap -resource:omni,VCF,known=false,training=true,truth=false,prior=12.0 $omni -resource:mills,VCF,known=true,training=true,truth=true,prior=12.0 $goldstandard -resource:dbsnp,VCF,known=true,training=false,truth=false,prior=6.0 $dbsnp -an QD -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an ClippingRankSum -mode BOTH -recalFile ${data}/TruSeq/truseq.interstroke.snps.indels.reduced.recal -tranchesFile ${data}/TruSeq/truseq.interstroke.snps.indels.reduced.tranches -rscriptFile ${data}/TruSeq/truseq.interstroke.snps.indels.reduced.recal.plots.R

Thanks,

MC

UmiAwareMarkDuplicatesWithMateCigar between Picard 2.9.0 and 2.18.9

$
0
0

Hi,

My institution is moving the HPC servers, and as a side effect causes a small downgrade of Picard, from 2.18.9 to 2.9.0. I have followed the release notes and there seems to have no algorithm changes between these two versions aside from the metrics output. Can the GATK team please confirm that's the case? Thanks!

How can I check which version of picard I am running?

$
0
0

Hi,
I have used gatk and picard to mark & remove duplicates and indel realignment. I can see the gatk version from commandline. Is there any way to find which version of picard I am running?

VariantRecalibrator - Can't do comparison because Locatables' contigs not found in sequence dict

$
0
0

Hi,
sorry if the question isn't precise enough.

Here's the command:

tools/gatk-4.0.11.0/gatk VariantRecalibrator -R data/humanRefGenome/hs37d5.fa \
--variant data/mongolian/Mongolian_genome_sorted_noChr2.snp.vcf \
-O data/mongolian/mongolian.recal \
-tranches-file data/mongolian/mongolian.tranches \
-rscript-file data/mongolian/mongolian_plots.R \
--resource hapmap,known=false,training=true,truth=true,prior=15.0:data/mongolian/hapmap_3.3.hg19.sites_noChr2.vcf.gz \
--resource omni,known=false,training=true,truth=true,prior=12.0:data/mongolian/1000G_omni2.5.hg19.sites_noChr2.vcf.gz \
--resource 1000G,known=false,training=true,truth=false,prior=10.0:data/mongolian/1000G_phase1.snps.high_confidence.hg19.sites_noChr2.vcf.gz \
--resource dbsnp,known=true,training=false,truth=false,prior=2.0:data/mongolian/dbsnp_138.hg19_noChr2.vcf.gz \
-an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an DP \
-tranche 90.0 -tranche 99.0 -tranche 99.9 -tranche 100.0 \
-mode SNP

and that's the error:

21:10:45.445 INFO  VariantRecalibrator - Writing out recalibration table...
21:10:45.459 INFO  VariantRecalibrator - Shutting down engine
[November 15, 2018 9:10:45 PM CET] org.broadinstitute.hellbender.tools.walkers.vqsr.VariantRecalibrator done. Elapsed time: 33.89 minutes.
Runtime.totalMemory()=2090336256
java.lang.IllegalArgumentException: Can't do comparison because Locatables' contigs not found in sequence dictionary
    at org.broadinstitute.hellbender.utils.IntervalUtils.compareContigs(IntervalUtils.java:149)
    at org.broadinstitute.hellbender.utils.IntervalUtils.compareLocatables(IntervalUtils.java:85)
    at org.broadinstitute.hellbender.tools.walkers.vqsr.VariantDatum.lambda$getComparator$2(VariantDatum.java:49)
    at java.util.TimSort.binarySort(TimSort.java:296)
    at java.util.TimSort.sort(TimSort.java:239)
    at java.util.Arrays.sort(Arrays.java:1512)
    at java.util.ArrayList.sort(ArrayList.java:1462)
    at java.util.Collections.sort(Collections.java:175)
    at org.broadinstitute.hellbender.tools.walkers.vqsr.VariantDataManager.writeOutRecalibrationTable(VariantDataManager.java:456)
    at org.broadinstitute.hellbender.tools.walkers.vqsr.VariantRecalibrator.onTraversalSuccess(VariantRecalibrator.java:695)
    at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:968)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
    at org.broadinstitute.hellbender.Main.main(Main.java:289)

The output file (mongolian.recal) doesn't have any variant:
...
##contig=<ID=6,length=171115067,assembly=hs37d5.fa>
##contig=<ID=7,length=159138663,assembly=hs37d5.fa>
##contig=<ID=8,length=146364022,assembly=hs37d5.fa>
##contig=<ID=9,length=141213431,assembly=hs37d5.fa>
##contig=<ID=M,length=16571,assembly=hs37d5.fa>
##contig=<ID=X,length=155270560,assembly=hs37d5.fa>
##contig=<ID=Y,length=59373566,assembly=hs37d5.fa>
##reference=hs37d5
##source=VariantRecalibrator
#CHROM POS ID REF ALT QUAL FILTER INFO
END OF FILE (there isn't other rows).

What am I doing wrong?

about argument presentation

$
0
0

Dear team,
I find it sometimes quite non-intuitive which to specify from

-<LETTER> <value>
<LETTER>=<value>
--some_param <value>

and/or

some_param=<value>

with some picard and or gatk commands.

Would it be possible to unify the way arguments are called and separated from their values and to include all valid alternates in the docs?

Is there already a preference for the future to which we should start to converge when we have the choice?

I have scripts that do not execute anymore with today's command versions which makes it painful sometimes.

Thanks in advance

GenomicsDBImport terminates after Overlapping contigs found error

$
0
0

My original query was about batching and making intervals for GenomicsDBImport, but I have run into a new problem. I am using version 4.0.7.0 I tried the following:

gatk GenomicsDBImport \
--java-options "-Xmx250G -XX:+UseParallelGC -XX:ParallelGCThreads=24" \
-V input.list \
--genomicsdb-workspace-path 5sp_45ind_assmb_00 \
--intervals interval.00.list \
--batch-size 9 

where I have split my list of contigs into 50 lists, and set batch size as 9 (instead of reading in 45 g.vcf at once) for a total of 5 batches. It looks like it has started to run, but terminated quickly after an error.

The resulting stack trace is:

00:53:23.869 INFO  GenomicsDBImport - HTSJDK Version: 2.16.0
00:53:23.869 INFO  GenomicsDBImport - Picard Version: 2.18.7
00:53:23.869 INFO  GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
00:53:23.869 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
00:53:23.869 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
00:53:23.869 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
00:53:23.869 INFO  GenomicsDBImport - Deflater: IntelDeflater
00:53:23.869 INFO  GenomicsDBImport - Inflater: IntelInflater
00:53:23.869 INFO  GenomicsDBImport - GCS max retries/reopens: 20
00:53:23.869 INFO  GenomicsDBImport - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
00:53:23.869 INFO  GenomicsDBImport - Initializing engine
01:26:13.490 INFO  IntervalArgumentCollection - Processing 58057410 bp from intervals
01:26:13.517 INFO  GenomicsDBImport - Done initializing engine
Created workspace /home/leq/gvcfs/5sp_45ind_assmb_00
01:26:13.655 INFO  GenomicsDBImport - Vid Map JSON file will be written to 5sp_45ind_assmb_00/vidmap.json
01:26:13.655 INFO  GenomicsDBImport - Callset Map JSON file will be written to 5sp_45ind_assmb_00/callset.json
01:26:13.655 INFO  GenomicsDBImport - Complete VCF Header will be written to 5sp_45ind_assmb_00/vcfheader.vcf
01:26:13.655 INFO  GenomicsDBImport - Importing to array - 5sp_45ind_assmb_00/genomicsdb_array
01:26:13.656 INFO  ProgressMeter - Starting traversal
01:26:13.656 INFO  ProgressMeter -        Current Locus  Elapsed Minutes     Batches Processed   Batches/Minute
01:33:16.970 INFO  GenomicsDBImport - Importing batch 1 with 9 samples
[libprotobuf ERROR google/protobuf/io/coded_stream.cc:207] A protocol message was rejected because it was too big (more than 67108864 bytes).  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
Contig/chromosome ctg7180018354961 begins at TileDB column 0 and intersects with contig/chromosome ctg7180018354960 that spans columns [1380207667, 1380207970] terminate called after throwing an instance of 'ProtoBufBasedVidMapperException' what():  
ProtoBufBasedVidMapperException : Overlapping contigs found

How do I overcome this issue of 'overlapping contigs found'? Is there a problem with my set of contigs? Also, is the warning about protocol messages something to worry about?

Thank you!

Issue with GenotypeGVCFs - GATK 4.0.10.1

$
0
0

Hi,
I have got a problem running GenotypeGVCFs. The error is mentioned below.

A USER ERROR has occurred: Bad input: malformed RAW_MQ annotation: 2346838272,3678071

Apparently, it seems the issue has been raised on GitHub but has not been addressed yet. Here is the link:

https://github.com/broadinstitute/gatk/issues/5433

I was wondering whether anyone could give me a quick solution for it. Thank you.

Cheers,
Kouisk


SplitNCigarReads exception

$
0
0

my script :

java -jar ~/bin/gatk-3.2-2/GenomeAnalysisTK.jar -T SplitNCigarReads -R Gmax.fa -I NPB18L_mark.bam -o NPB18L_snc.bam -U ALLOW_N_CIGAR_READS -fixNDN

when i use -fixNDN, it will be:

java.lang.UnsupportedOperationException
at java.util.AbstractList.add(AbstractList.java:148)
at java.util.AbstractList.add(AbstractList.java:108)
at org.broadinstitute.gatk.tools.walkers.rnaseq.SplitNCigarReads.initialize(SplitNCigarReads.java:150)
at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:83)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:314)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.2-2-gec30cee):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Code exception (see stack trace for error itself)

but i don't use -fixNDN, it will be:

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 3.2-2-gec30cee):
ERROR
ERROR This means that one or more arguments or inputs in your command are incorrect.
ERROR The error message below tells you what is the problem.
ERROR
ERROR If the problem is an invalid argument, please check the online documentation guide
ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
ERROR
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
ERROR
ERROR MESSAGE: Bad input: Cannot split this read (might be an empty section between Ns, for example 1N1D1N): 94M407N1D1033N6M

how can i fix it???

Variants with AD 0,0 and DP 0

$
0
0

I was investigating why some of the variants in my vcf produced with HaplotypeCaller miss some fields such as QD, DP, MQ, MQRankSum, BaseQRankSum.

I understood why by searching this forum, some of these values need other parameter to be present to be calculated, such as AD which is needed to calculate QD and so on.

So I realized that for all the variants that miss these fields I have the same GT:AD:DP values of 1/1:0,0:0, therefore AD is always 0,0 and DP 0

I read about informative and uninformative reads; some posts suggest that AD do not include uninformative reads, but DP does. By reading the documentation of DepthPerSampleHC seems like that DP in the FORMAT only includes informative reads like AD, but in the INFO has all unfiltered reads supporting that call.

why there is not the DP tag in the INFO column but only in the FORMAT for some calls?
why are these variants called if AD:DP 0,0:0 and DP in the INFO is missing?

I report an example:

Qrob_H2.3_Sc0001210 5578 . G T 18.59 . AC=2;AF=1.00;AN=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;SOR=0.693 GT:AD:DP:GQ:PL 1/1:0,0:0:3:45,3,0

A call with complete info is:
Qrob_H2.3_Sc0001407 1861 . A G 739.78 . AC=1;AF=0.500;AN=2;BaseQRankSum=2.371;DP=22;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=32.87;MQRankSum=-1.787;QD=30.87;ReadPosRankSum=0.240;SOR=1.609 GT:AD:DP:GQ:PL 0/1:2,19:21:27:768,0,27

It is only one sample, I know it is not enough; I am actually trying and learning how to use the variant calling pipeline for the first time before I run it on all my data; it just a test but I would like to understand this.

I apologize if this is a repetitive question; I have searched the forums and got a few hints but did not obtain a satisfactory answer.

Thank you

Variant calling design

Mutect2

GATK4 pre-processing

$
0
0

Dear GATK team.

Hello.

I conducted GATK pre-processing using GATK 4.0.2.

I have some questions.

I used wxs ngs data for mutation calling.

Do you recommend I use exome interval bed when I use Recalibration (BaseRecalibrator)?

Thanks.

Viewing all 12345 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>