Which training sets / arguments should I use for running VQSR?

August 2, 2012, 7:05 am

≫ Next: all reads are filtered out at haplotypecaller

≪ Previous: SelectVariants - ERROR MESSAGE: The PL index 1326 cannot be more than 1325

This document describes the resource datasets and arguments that we recommend for use in the two steps of VQSR (i.e. the successive application of VariantRecalibrator and ApplyRecalibration), based on our work with human genomes, to comply with the GATK Best Practices. The recommendations detailed in this document take precedence over any others you may see elsewhere in our documentation (e.g. in Tutorial articles, which are only meant to illustrate usage, or in past presentations, which may be out of date).

The document covers:

Explanation of resource datasets
Important notes about annotations
Important notes about exome experiments
Argument recommendations for VariantRecalibrator
Argument recommendations for ApplyRecalibration

These recommendations are valid for use with calls generated by both the UnifiedGenotyper and HaplotypeCaller. In the past we made a distinction in how we processed the calls from these two callers, but now we treat them the same way. These recommendations will probably not work properly on calls generated by other (non-GATK) callers.

Note that VQSR must be run twice in succession in order to build a separate error model for SNPs and INDELs (see the VQSR documentation for more details).

Explanation of resource datasets

The human genome training, truth and known resource datasets mentioned in this document are all available from our resource bundle.

If you are working with non-human genomes, you will need to find or generate at least truth and training resource datasets with properties corresponding to those described below. To generate your own resource set, one idea is to first do an initial round of SNP calling and only use those SNPs which have the highest quality scores. These sites which have the most confidence are probably real and could be used as truth data to help disambiguate the rest of the variants in the call set. Another idea is to try using several SNP callers in addition to the UnifiedGenotyper or HaplotypeCaller, and use those sites which are concordant between the different methods as truth data. In either case, you'll need to assign your set a prior likelihood that reflects your confidence in how reliable it is as a truth set. We recommend Q10 as a starting value, which you can then experiment with to find the most appropriate value empirically. There are many possible avenues of research here. Hopefully the model reporting plots that are generated by the recalibration tools will help facilitate this experimentation.

Resources for SNPs

True sites training resource: HapMap 
This resource is a SNP call set that has been validated to a very high degree of confidence. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). We will also use these sites later on to choose a threshold for filtering variants based on sensitivity to truth sites. The prior likelihood we assign to these variants is Q15 (96.84%).
True sites training resource: Omni 
This resource is a set of polymorphic SNP sites produced by the Omni geno- typing array. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q12 (93.69%).
Non-true sites training resource: 1000G 
This resource is a set of high-confidence SNP sites produced by the 1000 Genomes Project. The program will consider that the variants in this re- source may contain true variants as well as false positives (truth=false), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q10 (90%).
Known sites resource, not used in training: dbSNP 
This resource is a call set that has not been validated to a high degree of confidence (truth=false). The program will not use the variants in this resource to train the recalibration model (training=false). However, the program will use these to stratify output metrics such as Ti/Tv ratio by whether variants are present in dbsnp or not (known=true). The prior likelihood we assign to these variants is Q2 (36.90%).

Resources for Indels

True sites training resource: Mills
This resource is an Indel call set that has been validated to a high degree of confidence. The program will consider that the variants in this resource are representative of true sites (truth=true), and will use them to train the recalibration model (training=true). The prior likelihood we assign to these variants is Q12 (93.69%).
Known sites resource, not used in training: dbSNP 
This resource is a call set that has not been validated to a high degree of confidence (truth=false). The program will not use the variants in this resource to train the recalibration model (training=false). However, the program will use these to stratify output metrics such as Ti/Tv ratio by whether variants are present in dbsnp or not (known=true). The prior likelihood we assign to these variants is Q2 (36.90%).

Important notes about annotations

Some of the annotations included in the recommendations given below might not be the best for your particular dataset. In particular, the following caveats apply:

Depth of coverage (the DP annotation invoked by Coverage) should not be used when working with exome datasets since there is extreme variation in the depth to which targets are captured! In whole genome experiments this variation is indicative of error but that is not the case in capture experiments.
You may have seen HaplotypeScore mentioned in older documents. That is a statistic produced by UnifiedGenotyper that should only be used if you called your variants with UG. This statistic isn't produced by the HaplotypeCaller because that mathematics is already built into the likelihood function itself when calling full haplotypes with HC.
The InbreedingCoeff is a population level statistic that requires at least 10 samples in order to be computed. For projects with fewer samples, or that includes many closely related samples (such as a family) please omit this annotation from the command line.

Important notes for exome capture experiments

In our testing we've found that in order to achieve the best exome results one needs to use an exome SNP and/or indel callset with at least 30 samples. For users with experiments containing fewer exome samples there are several options to explore:

Add additional samples for variant calling, either by sequencing additional samples or using publicly available exome bams from the 1000 Genomes Project (this option is used by the Broad exome production pipeline). Be aware that you cannot simply add VCFs from the 1000 Genomes Project. You must either call variants from the original BAMs jointly with your own samples, or (better) use the reference model workflow to generate GVCFs from the original BAMs, and perform joint genotyping on those GVCFs along with your own samples' GVCFs with GenotypeGVCFs.
You can also try using the VQSR with the smaller variant callset, but experiment with argument settings (try adding --maxGaussians 4 to your command line, for example). You should only do this if you are working with a non-model organism for which there are no available genomes or exomes that you can use to supplement your own cohort.

Argument recommendations for VariantRecalibrator

The variant quality score recalibrator builds an adaptive error model using known variant sites and then applies this model to estimate the probability that each variant is a true genetic variant or a machine artifact. One major improvement from previous recommended protocols is that hand filters do not need to be applied at any point in the process now. All filtering criteria are learned from the data itself.

Common, base command line

This is the first part of the VariantRecalibrator command line, to which you need to add either the SNP-specific recommendations or the indel-specific recommendations given further below.

java -Xmx4g -jar GenomeAnalysisTK.jar \
   -T VariantRecalibrator \
   -R path/to/reference/human_g1k_v37.fasta \
   -input raw.input.vcf \
   -recalFile path/to/output.recal \
   -tranchesFile path/to/output.tranches \
   -nt 4 \
   [SPECIFY TRUTH AND TRAINING SETS] \
   [SPECIFY WHICH ANNOTATIONS TO USE IN MODELING] \
   [SPECIFY WHICH CLASS OF VARIATION TO MODEL] \

SNP specific recommendations

For SNPs we use both HapMap v3.3 and the Omni chip array from the 1000 Genomes Project as training data. In addition we take the highest confidence SNPs from the project's callset. These datasets are available in the GATK resource bundle.

   -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.vcf \
   -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.vcf \
   -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.b37.vcf \
   -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.b37.vcf \
   -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP -an InbreedingCoeff \
   -mode SNP \

Please note that these recommendations are formulated for whole-genome datasets. For exomes, we do not recommend using DP for variant recalibration (see below for details of why).

Note also that, for the above to work, the input vcf needs to be annotated with the corresponding values (QD, FS, DP, etc.). If any of these values are somehow missing, then VariantAnnotator needs to be run first so that VariantRecalibration can run properly.

Also, using the provided sites-only truth data files is important here as parsing the genotypes for VCF files with many samples increases the runtime of the tool significantly.

You may notice that these recommendations no longer include the --numBadVariants argument. That is because we have removed this argument from the tool, as the VariantRecalibrator now determines the number of variants to use for modeling "bad" variants internally based on the data.

Indel specific recommendations

When modeling indels with the VQSR we use a training dataset that was created at the Broad by strictly curating the (Mills, Devine, Genome Research, 2011) dataset as as well as adding in very high confidence indels from the 1000 Genomes Project. This dataset is available in the GATK resource bundle.

   --maxGaussians 4 \
   -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.vcf  \
   -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.b37.vcf\
   -an QD -an DP -an FS -an SOR -an ReadPosRankSum -an MQRankSum -an InbreedingCoeff \
   -mode INDEL \

Note that indels use a different set of annotations than SNPs. Most annotations related to mapping quality have been removed since there is a conflation with the length of an indel in a read and the degradation in mapping quality that is assigned to the read by the aligner. This covariation is not necessarily indicative of being an error in the same way that it is for SNPs.

Argument recommendations for ApplyRecalibration

The power of the VQSR is that it assigns a calibrated probability to every putative mutation in the callset. The user is then able to decide at what point on the theoretical ROC curve their project wants to live. Some projects, for example, are interested in finding every possible mutation and can tolerate a higher false positive rate. On the other hand, some projects want to generate a ranked list of mutations that they are very certain are real and well supported by the underlying data. The VQSR provides the necessary statistical machinery to effectively apply this sensitivity/specificity tradeoff.

Common, base command line

This is the first part of the ApplyRecalibration command line, to which you need to add either the SNP-specific recommendations or the indel-specific recommendations given further below.

 
 java -Xmx3g -jar GenomeAnalysisTK.jar \
   -T ApplyRecalibration \
   -R reference/human_g1k_v37.fasta \
   -input raw.input.vcf \
   -tranchesFile path/to/input.tranches \
   -recalFile path/to/input.recal \
   -o path/to/output.recalibrated.filtered.vcf \
   [SPECIFY THE DESIRED LEVEL OF SENSITIVITY TO TRUTH SITES] \
   [SPECIFY WHICH CLASS OF VARIATION WAS MODELED] \

SNP specific recommendations

For SNPs we used HapMap 3.3 and the Omni 2.5M chip as our truth set. We typically seek to achieve 99.5% sensitivity to the accessible truth sites, but this is by no means universally applicable: you will need to experiment to find out what tranche cutoff is right for your data. Generally speaking, projects involving a higher degree of diversity in terms of world populations can expect to achieve a higher truth sensitivity than projects with a smaller scope.

   --ts_filter_level 99.5 \
   -mode SNP \

Indel specific recommendations

For indels we use the Mills / 1000 Genomes indel truth set described above. We typically seek to achieve 99.0% sensitivity to the accessible truth sites, but this is by no means universally applicable: you will need to experiment to find out what tranche cutoff is right for your data. Generally speaking, projects involving a higher degree of diversity in terms of world populations can expect to achieve a higher truth sensitivity than projects with a smaller scope.

   --ts_filter_level 99.0 \
   -mode INDEL \

↧

all reads are filtered out at haplotypecaller

June 19, 2018, 6:30 pm

≫ Next: bug in docs

≪ Previous: Which training sets / arguments should I use for running VQSR?

I am trying to apply the variant calling for RNAseq data according to the GATK best practice. I got the error at HayplotypeCaller

INFO 00:50:15,597 ProgressMeter - Total runtime 2246.28 secs, 37.44 min, 0.62 hours
INFO 00:50:15,598 MicroScheduler - 78891062 reads were filtered out during the traversal out of approximately 78891062 total reads (100.00%)
INFO 00:50:15,598 MicroScheduler - -> 0 reads (0.00% of total) failing BadCigarFilter
INFO 00:50:15,598 MicroScheduler - -> 41106793 reads (52.11% of total) failing DuplicateReadFilter
INFO 00:50:15,599 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
INFO 00:50:15,599 MicroScheduler - -> 10909111 reads (13.83% of total) failing HCMappingQualityFilter
INFO 00:50:15,599 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter
INFO 00:50:15,599 MicroScheduler - -> 26875158 reads (34.07% of total) failing MappingQualityUnavailableFilter
INFO 00:50:15,599 MicroScheduler - -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter
INFO 00:50:15,599 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter

all reads are filtered out, therefore, no data get through forward. The RNAseq data were previously aligned with STAR with twopassmode and done markDuplicates with Picard tool. So I started the pipeline from AddorReplaceReadGroup on markduplicate.bam then forward. Couldnt understand why.

Any suggestions?

↧

bug in docs

June 19, 2018, 8:17 pm

≫ Next: gatk-4.0.4.0 CombineGVCF

≪ Previous: all reads are filtered out at haplotypecaller

The link (https://software.broadinstitute.org/gatk/documentation/article.php?id=2940) in the description at:

https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_haplotypecaller_HaplotypeCaller.php#--emit-ref-confidence

is not hyperlinked. Also, if you cut and paste the link, it says the page is not found anyway.

↧

gatk-4.0.4.0 CombineGVCF

May 7, 2018, 3:17 am

≫ Next: Could not find walker with name: "RealignerTargetCreator" and "BaseRecalibrator"

≪ Previous: bug in docs

On running CombineGVCF from gatk-4.0.4.0, nothing happens..even the help file does not come..java hangs..has anyone experienced this issue? This is the command I used:

java -jar gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar CombineGVCFs

Java version 1.8.0

↧

Could not find walker with name: "RealignerTargetCreator" and "BaseRecalibrator"

June 19, 2018, 10:07 pm

≫ Next: Running GATK WDL on FireCloud with TCGA controlled bam files

≪ Previous: gatk-4.0.4.0 CombineGVCF

Please find the below code and the error I am getting. I tried all the versions 3.4 to 3.8.1 but I couldn't figure it out. The same error is happening with "BaseRecalibrator " as well, Please let me know the resolution for this problem.

brontobyte@brontobyte:~/Satish$ java -Xmx4g -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R hg19.fa -o Exome_sample_RG_index.marked.bam.list -I Exome_sample_RG_index.marked.bam -known dbsnp_138.hg19.vcf -L nexterarapidcapture_expandedexome_targetedregions.bed

ERROR ------------------------------------------------------------------------------------------

ERROR A USER ERROR has occurred (version 3.7-0-gcfedb67):

ERROR

ERROR This means that one or more arguments or inputs in your command are incorrect.

ERROR The error message below tells you what is the problem.

ERROR

ERROR If the problem is an invalid argument, please check the online documentation guide

ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.

ERROR

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions https://software.broadinstitute.org/gatk

ERROR

ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.

ERROR

ERROR MESSAGE: Invalid command line: Malformed walker argument: Could not find walker with name: RealignerTargetCreator

↧

Running GATK WDL on FireCloud with TCGA controlled bam files

June 19, 2018, 10:59 pm

≫ Next: GATK4 GenomicsDBImport very slow

≪ Previous: Could not find walker with name: "RealignerTargetCreator" and "BaseRecalibrator"

Hi, GATK team!

I have an issue with GATK4.0 pipeline when running analysis on FireCloud.

I am going to run GATK with TCGA controlled mRNASeq bam files. As far as I concerned, FireCloud offers TCGA level 1 bam files which named as .sorted_genome_alignments.bam‎. So I ran the pipeline from the step MarkDuplicates according to the rnaseq-germline-snps-indels, the public WDL example as GATK team put forward on FireCloud. Then I set proper parameters and workspace’s attributes in configuration, especially the reference fasta as gs://broad-references/Homo_sapiens_assembly19_1000genomes_decoy/Homo_sapiens_assembly19_1000genomes_decoy.fasta‎, but I got some error reported like this:

[Mon Jun 11 04:59:58 UTC 2018] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 18.94 minutes.
Runtime.totalMemory()=24761073664
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
picard.PicardException: This program requires input that are either coordinate or query sorted. Found unsorted
    at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:254)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:269)
    at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:25)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
    at org.broadinstitute.hellbender.Main.main(Main.java:289)
Using GATK jar /gatk/build/libs/gatk-package-4.0.3.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx32G -jar /gatk/build/libs/gatk-package-4.0.3.0-local.jar MarkDuplicates --INPUT /cromwell_root/5aa919de-0aa0-43ec-9ec3-288481102b6d/tcga/LUAD/RNA/RNA-Seq/UNC-LCCC/ILLUMINA/UNCID_1209060.e6a101b9-61f9-4ed1-a59f-d9db3fdb4555.sorted_genome_alignments.bam --OUTPUT UNCID_1209060.e6a101b9-61f9-4ed1-a59f-d9db3fdb4555.sorted_genome_alignments.dedupped.bam --CREATE_INDEX true --VALIDATION_STRINGENCY SILENT --METRICS_FILE UNCID_1209060.e6a101b9-61f9-4ed1-a59f-d9db3fdb4555.sorted_genome_alignments.dedupped.metrics

Since the name of input BAM file including “sorted”, I thought it was reasonable to add the option “--ASSUME_SORTED” after I searched some solutions that other people and GATK’s staff posted. Then MarkDuplicates step finally worked. But in the next step SplitNCigarReads, error occurred like this:

INFO  12:59:01,086 HelpFormatter - Program Args: -T SplitNCigarReads -R /cromwell_root/broad-references/Homo_sapiens_assembly19_1000genomes_decoy/Homo_sapiens_assembly19_1000genomes_decoy.fasta -I /cromwell_root/fc-85926c4b-dcec-49b1-a0b1-446abe208477/e44da35f-1087-423f-95ea-53944c30c5f2/RNAseq/e5b8550b-f301-47b7-a709-f6d91554ab6f/call-MarkDuplicates/UNCID_1209060.e6a101b9-61f9-4ed1-a59f-d9db3fdb4555.sorted_genome_alignments.dedupped.bam -o UNCID_1209060.e6a101b9-61f9-4ed1-a59f-d9db3fdb4555.sorted_genome_alignments.split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS 
INFO  12:59:01,089 HelpFormatter - Executing as root@1ffd1fee7d64 on Linux 4.9.0-0.bpo.6-amd64 amd64; OpenJDK 64-Bit Server VM 1.8.0_111-8u111-b14-2~bpo8+1-b14. 
INFO  12:59:01,089 HelpFormatter - Date/Time: 2018/06/19 12:59:01 
INFO  12:59:01,090 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  12:59:01,090 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  12:59:01,234 GenomeAnalysisEngine - Strictness is SILENT 
INFO  12:59:01,292 GenomeAnalysisEngine - Downsampling Settings: No downsampling 
INFO  12:59:01,298 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
WARNING: BAM index file /cromwell_root/fc-85926c4b-dcec-49b1-a0b1-446abe208477/e44da35f-1087-423f-95ea-53944c30c5f2/RNAseq/e5b8550b-f301-47b7-a709-f6d91554ab6f/call-MarkDuplicates/UNCID_1209060.e6a101b9-61f9-4ed1-a59f-d9db3fdb4555.sorted_genome_alignments.dedupped.bai is older than BAM /cromwell_root/fc-85926c4b-dcec-49b1-a0b1-446abe208477/e44da35f-1087-423f-95ea-53944c30c5f2/RNAseq/e5b8550b-f301-47b7-a709-f6d91554ab6f/call-MarkDuplicates/UNCID_1209060.e6a101b9-61f9-4ed1-a59f-d9db3fdb4555.sorted_genome_alignments.dedupped.bam
INFO  12:59:01,319 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.02 
INFO  12:59:02,073 GATKRunReport - Uploaded run statistics report to AWS S3 
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 3.5-0-g36282e4): 
##### ERROR
##### ERROR This means that one or more arguments or inputs in your command are incorrect.
##### ERROR The error message below tells you what is the problem.
##### ERROR
##### ERROR If the problem is an invalid argument, please check the online documentation guide
##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
##### ERROR
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
##### ERROR
##### ERROR MESSAGE: Lexicographically sorted human genome sequence detected in reads. Please see http://gatkforums.broadinstitute.org/discussion/58/companion-utilities-reordersamfor more information. Error details: reads contigs = [chr1, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr2, chr20, chr21, chr22, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chrM_rCRS, chrX, chrY]
##### ERROR ------------------------------------------------------------------------------------------

I tried to fix the bug by adding ReorderSam step as the error message mentioned. The reference fasta I used was still Homo_sapiens_assembly19_1000genomes_decoy.fasta‎, but it still didn’t work well. The error message was:

[Mon Jun 11 15:30:14 UTC 2018] picard.sam.ReorderSam done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=665845760
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
picard.PicardException: New reference sequence does not contain a matching contig for chr1
    at picard.sam.ReorderSam.buildSequenceDictionaryMap(ReorderSam.java:263)
    at picard.sam.ReorderSam.doWork(ReorderSam.java:146)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:269)
    at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:25)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
    at org.broadinstitute.hellbender.Main.main(Main.java:289)
Using GATK jar /gatk/build/libs/gatk-package-4.0.3.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx32G -jar /gatk/build/libs/gatk-package-4.0.3.0-local.jar ReorderSam --INPUT /cromwell_root/fc-85926c4b-dcec-49b1-a0b1-446abe208477/64c3a79b-04f8-46f8-a238-8717380c7768/RNAseq/4e8ce380-6f4b-41f6-b1d2-4fe11ed8fa68/call-MarkDuplicates/UNCID_1209060.e6a101b9-61f9-4ed1-a59f-d9db3fdb4555.sorted_genome_alignments.dedupped.bam --OUTPUT UNCID_1209060.e6a101b9-61f9-4ed1-a59f-d9db3fdb4555.sorted_genome_alignments.reorder.bam -R /cromwell_root/broad-references/Homo_sapiens_assembly19_1000genomes_decoy/Homo_sapiens_assembly19_1000genomes_decoy.fasta --CREATE_INDEX true

And then I added the option --VALIDATION_STRINGENCY LENIENT --ALLOW_INCOMPLETE_DICT_CONCORDANCE, I got this:

Ignoring SAM validation error: ERROR: Record 178984837, Read name UNC9-SN296_246:4:1107:4151:192010/2, Mapped mate should have mate reference name Ignoring SAM validation error: ERROR: Record 178984905, Read name UNC9-SN296_246:4:2205:17136:94561/2, Mapped mate should have mate reference name INFO 2018-06-13 03:23:44 ReorderSam Wrote 186956859 reads [Wed Jun 13 03:23:46 UTC 2018] picard.sam.ReorderSam done. Elapsed time: 40.71 minutes. Runtime.totalMemory()=10954997760 To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp htsjdk.samtools.SAMException: Exception when processing alignment for BAM index UNC9-SN296_246:4:1101:10000:103197/2 2/2 50b aligned read. at htsjdk.samtools.BAMFileWriter.writeAlignment(BAMFileWriter.java:140) at htsjdk.samtools.SAMFileWriterImpl.close(SAMFileWriterImpl.java:226) at htsjdk.samtools.AsyncSAMFileWriter.synchronouslyClose(AsyncSAMFileWriter.java:38) at htsjdk.samtools.util.AbstractAsyncWriter.close(AbstractAsyncWriter.java:89) at picard.sam.ReorderSam.doWork(ReorderSam.java:167) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:269) at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:25) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203) at org.broadinstitute.hellbender.Main.main(Main.java:289) Caused by: htsjdk.samtools.SAMException: Exception creating BAM index for record UNC9-SN296_246:4:1101:10000:103197/2 2/2 50b aligned read. at htsjdk.samtools.BAMIndexer.processAlignment(BAMIndexer.java:119) at htsjdk.samtools.BAMFileWriter.writeAlignment(BAMFileWriter.java:137) ... 9 more Caused by: htsjdk.samtools.SAMException: Unexpected reference -1 when constructing index for 0 for record UNC9-SN296_246:4:1101:10000:103197/2 2/2 50b aligned read. at htsjdk.samtools.BAMIndexer$BAMIndexBuilder.processAlignment(BAMIndexer.java:218) at htsjdk.samtools.BAMIndexer.processAlignment(BAMIndexer.java:117) ... 10 more Using GATK jar /gatk/build/libs/gatk-package-4.0.3.0-local.jar Running: java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx32G -jar /gatk/build/libs/gatk-package-4.0.3.0-local.jar ReorderSam --INPUT /cromwell_root/fc-85926c4b-dcec-49b1-a0b1-446abe208477/4cc0531d-5121-4140-8344-f38235f035fd/RNAseq/f7b4b882-effb-43ed-a70a-76720b2d8772/call-MarkDuplicates/UNCID_1209060.e6a101b9-61f9-4ed1-a59f-d9db3fdb4555.sorted_genome_alignments.dedupped.bam --OUTPUT UNCID_1209060.e6a101b9-61f9-4ed1-a59f-d9db3fdb4555.sorted_genome_alignments.reorder.bam -R /cromwell_root/broad-references/Homo_sapiens_assembly19_1000genomes_decoy/Homo_sapiens_assembly19_1000genomes_decoy.fasta --CREATE_INDEX true --VALIDATION_STRINGENCY LENIENT --ALLOW_INCOMPLETE_DICT_CONCORDANCE

It seemed that the input sorted_genome_alignment_bam file had aligned to another reference fasta different from my pipeline used. Although I looked through some metadata that TCGA provided in their official website and description of the TCGA controlled access workspace in FireCloud, I couldn’t find the information of specific reference fasta file.

Could you please provide some help to solve the problem?

↧

GATK4 GenomicsDBImport very slow

June 20, 2018, 1:39 am

≫ Next: Combine phased calls from Mutect2

≪ Previous: Running GATK WDL on FireCloud with TCGA controlled bam files

Dear GATK team,

we recently implemented a new WGS pipeline based on GATK4 (now 4.0.5.1). GVCFs were generated using HaplotypeCaller as per the best practice guide.

However, for a current analysis based on ~200 human genomes (30X), many of the individual GenomicsDBImport jobs take several days, even though we are parallelizing using the WGS intervals provided in the GATK bundle (b37, I think).

Code snipped:
gatk --java-options "-Xmx${task.memory.toGiga()}G" GenomicsDBImport \ --variant ${vcf_list.join(" --variant ")} \ --reference $REF \ -L $region \ --reader-threads ${task.cpus} \ --genomicsdb-workspace-path $genodb

reader-threads are currently set to 8, and memory to 64GB. All VCFs are gzipped GVCFs with accompanying index file (.tbi).

The only "potential" hardware issue I can see is that the individual nodes are currently connected only via 1Gbit to our parallel storage, which is not ideal, but also seems "ok'ish", since we are only dealing with GVCFs.

Anything fundamental I am missing that may help to improve the speed?

Kind regards,
Marc

↧

Combine phased calls from Mutect2

June 20, 2018, 3:48 am

≫ Next: HaplotypeCaller ERROR with ERC invalid

≪ Previous: GATK4 GenomicsDBImport very slow

Hello, is there a way to have Mutect2 emit multi-nucleotide variants instead of multiple adjacent SNVs?

For example, consider this variant:

REF: AGGT
ALT: ATCT

Mutect will call the G/T SNP in position 2 as one line, and the G/C SNP at position 3 as another line. Then the fact that they are part of the same haplotype is indicated by the phasing information in the info column of the vcf.

I would prefer to have it call a multiple nucleotide variant: REF GG and ALT TC.

Can I get Mutect to do this? Or is there any post-processing tool you can recommend?

The reason for preferring MNVs instead of SNVs is that I am using ensembl-VEP to predict the protein consequences of the variants. In that case it's quite important to represent the actual haplotypes instead of stepping through variant sites one-by-one.

Thanks,
Patrick

↧

HaplotypeCaller ERROR with ERC invalid

June 20, 2018, 5:32 am

≫ Next: Outlook on Grch38/hg38 for in exome and other targeted sequencing

≪ Previous: Combine phased calls from Mutect2

Hi everyone,
I'm a really novice on GATK. I want to call SNPs from 12 samples and I got some troubles.
Firstly, I combined 12 samples' bam files using the command "HaplotypeCaller". Each bam file stands for one sample and is generated by two PE sequencing files (R1 & R2). However, I got the ERROR:

A USER ERROR has occurred: Argument --emitRefConfidence has a bad value: Can only be used in single sample mode currently. Use the sample_name argument to run on a single sample out of a multi-sample BAM file.

Refer to the problem before，it seems that the BAM file has multiple lanes message. I checked the SAM file's head lines and tail lines. The head lines are as following:
HWI-7001446:625:C9078ANXX:6:1101:1209:2074 ……
HWI-7001446:625:C9078ANXX:6:1101:1209:2074 ……
HWI-7001446:625:C9078ANXX:6:1101:1232:2088 ……
HWI-7001446:625:C9078ANXX:6:1101:1232:2088 ……
And the tail lines are as following:
ST-E00159:217:HYL2KCCXX:6:2224:23979:72667 ……
ST-E00159:217:HYL2KCCXX:6:2224:23979:72667 ……
ST-E00159:217:HYL2KCCXX:6:2224:24000:72667 ……
ST-E00159:217:HYL2KCCXX:6:2224:24000:72667 ……

Next, I format the BAM file by the command "AddOrReplaceReadGroups". Then I run the HaplotypeCaller command on the formated BAM file. The .g.vcf file has been generated and I got another ERROR:

A USER ERROR has occurred: Traversal by intervals was requested but some input files are not indexed.
Please index all input files:
samtools index xxx.group.bam

Is the generated .g.vcf file valid? Why did I get the two ERROR messages?

Thank u very much.

Best wishes,

Tony

↧

Outlook on Grch38/hg38 for in exome and other targeted sequencing

January 11, 2018, 2:56 am

≫ Next: Invalid or corrupt jarfile

≪ Previous: HaplotypeCaller ERROR with ERC invalid

Dear GATK team,

First of all, congratulations on releasing GATK4!

I was wondering, on this page: https://software.broadinstitute.org/gatk/download/bundle it is mentioned that the human genome reference builds you support actively are the following:
For Best Practices short variant discovery in exome and other targeted sequencing: b37/hg19

Last year we build an RNAseq pipeline and a preliminary DNAseq pipeline around GRCh38. Can you perhaps indicate how far out the publication of Best Practices for short variant discovery in exome and other targeted sequencing using GRCh38 is?

By the way, the link below the bullet points (https://software.broadinstitute.org/gatk/user%20guide/article.php?id=1213) gives a 404.

Keep up the good work,

Highest regards,

Freek.

↧

Invalid or corrupt jarfile

January 23, 2018, 11:52 am

≫ Next: GATK4 - VariantFiltration --genotype-filter-expression

≪ Previous: Outlook on Grch38/hg38 for in exome and other targeted sequencing

When I run

./gatk --help

it seems to be working fine. However, running anything else such as

./gatk --list

produces an error:

Error: Invalid or corrupt jarfile /path/to/gatk/gatk-4.0.0.0/gatk-package-4.0.0.0-local.jar

What's going on? Sorry this might be noob question.

↧

GATK4 - VariantFiltration --genotype-filter-expression

March 11, 2018, 5:39 pm

≫ Next: Picard Sort Vcf Error

≪ Previous: Invalid or corrupt jarfile

Hello there,
I am trying to apply some sample-level filters on a VCF generated using GATK4.0.2.1. My issue is that all variant sites are not getting an FT flag added and I am wondering why. Additionally, "PASS" is being added the the FILTER column at the variant-level (I am not sure if this behavior is expected, but it seems weird)

Here is some information about the system:

17:43:04.589 DEBUG NativeLibraryLoader - Extracting libgkl_compression.so to /tmp/szs315/libgkl_compression8694733123384787175.so
17:43:04.681 INFO  VariantFiltration - ------------------------------------------------------------
17:43:04.681 INFO  VariantFiltration - The Genome Analysis Toolkit (GATK) v4.0.2.1
17:43:04.681 INFO  VariantFiltration - For support and documentation go to https://software.broadinstitute.org/gatk/
17:43:04.681 INFO  VariantFiltration - Executing as szs315@quser12 on Linux v3.10.0-514.36.5.el7.x86_64 amd64
17:43:04.681 INFO  VariantFiltration - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_112-b16
17:43:04.682 INFO  VariantFiltration - Start Date/Time: March 11, 2018 6:43:04 PM CDT
17:43:04.682 INFO  VariantFiltration - ------------------------------------------------------------
17:43:04.682 INFO  VariantFiltration - ------------------------------------------------------------
17:43:04.682 INFO  VariantFiltration - HTSJDK Version: 2.14.3
17:43:04.682 INFO  VariantFiltration - Picard Version: 2.17.2
17:43:04.684 INFO  VariantFiltration - HTSJDK Defaults.BUFFER_SIZE : 131072
17:43:04.684 INFO  VariantFiltration - HTSJDK Defaults.COMPRESSION_LEVEL : 1
17:43:04.684 INFO  VariantFiltration - HTSJDK Defaults.CREATE_INDEX : false
17:43:04.684 INFO  VariantFiltration - HTSJDK Defaults.CREATE_MD5 : false
17:43:04.684 INFO  VariantFiltration - HTSJDK Defaults.CUSTOM_READER_FACTORY : 
17:43:04.684 INFO  VariantFiltration - HTSJDK Defaults.DISABLE_SNAPPY_COMPRESSOR : false
17:43:04.685 INFO  VariantFiltration - HTSJDK Defaults.EBI_REFERENCE_SERVICE_URL_MASK : https://www.ebi.ac.uk/ena/cram/md5/%s
17:43:04.685 INFO  VariantFiltration - HTSJDK Defaults.NON_ZERO_BUFFER_SIZE : 131072
17:43:04.685 INFO  VariantFiltration - HTSJDK Defaults.REFERENCE_FASTA : null
17:43:04.685 INFO  VariantFiltration - HTSJDK Defaults.SAM_FLAG_FIELD_FORMAT : DECIMAL
17:43:04.685 INFO  VariantFiltration - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
17:43:04.685 INFO  VariantFiltration - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
17:43:04.685 INFO  VariantFiltration - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
17:43:04.685 INFO  VariantFiltration - HTSJDK Defaults.USE_CRAM_REF_DOWNLOAD : false
17:43:04.685 DEBUG ConfigFactory - Configuration file values: 
17:43:04.688 DEBUG ConfigFactory -  gcsMaxRetries = 20
17:43:04.688 DEBUG ConfigFactory -  gatk_stacktrace_on_user_exception = false
17:43:04.688 DEBUG ConfigFactory -  samjdk.use_async_io_read_samtools = false
17:43:04.688 DEBUG ConfigFactory -  samjdk.compression_level = 1
17:43:04.688 DEBUG ConfigFactory -  samjdk.use_async_io_write_samtools = true
17:43:04.688 DEBUG ConfigFactory -  samjdk.use_async_io_write_tribble = false
17:43:04.688 DEBUG ConfigFactory -  spark.kryoserializer.buffer.max = 512m
17:43:04.688 DEBUG ConfigFactory -  spark.driver.maxResultSize = 0
17:43:04.688 DEBUG ConfigFactory -  spark.driver.userClassPathFirst = true
17:43:04.688 DEBUG ConfigFactory -  spark.io.compression.codec = lzf
17:43:04.688 DEBUG ConfigFactory -  spark.yarn.executor.memoryOverhead = 600
17:43:04.689 DEBUG ConfigFactory -  spark.driver.extraJavaOptions = 
17:43:04.689 DEBUG ConfigFactory -  spark.executor.extraJavaOptions = 
17:43:04.689 DEBUG ConfigFactory -  codec_packages = [htsjdk.variant, htsjdk.tribble, org.broadinstitute.hellbender.utils.codecs]
17:43:04.689 DEBUG ConfigFactory -  cloudPrefetchBuffer = 40
17:43:04.689 DEBUG ConfigFactory -  cloudIndexPrefetchBuffer = -1
17:43:04.689 DEBUG ConfigFactory -  createOutputBamIndex = true
17:43:04.689 INFO  VariantFiltration - Deflater: IntelDeflater
17:43:04.689 INFO  VariantFiltration - Inflater: IntelInflater
17:43:04.689 INFO  VariantFiltration - GCS max retries/reopens: 20
17:43:04.689 INFO  VariantFiltration - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
17:43:04.689 INFO  VariantFiltration - Initializing engine

Here is the command I used to apply the filters

 gatk-launch VariantFiltration \
-variant wild_isolate.vcf.gz \
--genotype-filter-expression "DP < 2" \
--genotype-filter-name "depth" \
-O wi_dp_tet.vcf  \
--verbosity DEBUG \
--seconds-between-progress-updates 0.1 \
--disable-tool-default-read-filters true \
--lenient true \
--disable-sequence-dictionary-validation true \
--disable-bam-index-caching true

I added the --verbosity flag and all other flags below --verbosity after I noticed some variants were not receiving the FT field. I thought there may be some default filters being applied that may results in variants being skipped (maybe these flags need to be applied at previous steps?). I ran this step with and without those flags, and with/without the -R flag.

I am running this on a test data set to make sure my pipeline is working properly... 45576 variants are not receiving the FT field and 127762 variants did receive the FT field. Also, not that I am not going through the VQSR procedure because I do not have a truth set.

As for the steps proceeding VariantFiltration, I ran HaplotypeCaller in DISCOVERY with ERC GVCF (in chromosome blocks), performed ValidateVariants, combined chromosome gVCFs for each each sample using CombineGVCFs, combined individual sample gVCFs with GenomicsDBImport, and then ran GenotypeGVCFs on individual chromosomes, and collapsed the chromosome VCFs using GatherVcfs.

Here are the last few entries of test VCF, highlighting the inconsistent FORMAT/FT field.

MtDNA   12998   .   C   A,T 2457.39 PASS    AC=8,6;AF=0.571,0.429;AN=14;AS_QD=15.04,31.74;DP=74;ExcessHet=3.0103;FS=0.000;GQ_MEAN=31.14;GQ_STDDEV=28.46;MLEAC=8,6;MLEAF=0.571,0.429;MQ=59.59;NCC=1;QD=33.66;SOR=0.720   GT:AD:DP:GQ:PL  1/1:0,2,0:2:6:80,6,0,80,6,80    2/2:0,0,2:2:6:83,83,83,6,6,0    1/1:0,3,0:3:9:125,9,0,125,9,125 ./.:1,0,0:1:.:0,0,0,0,0,0   1/1:0,22,0:22:66:817,66,0,817,66,817    1/1:0,8,0:8:24:235,24,0,235,24,235  2/2:0,0,11:11:33:383,383,383,33,33,0    2/2:0,0,25:25:74:749,749,749,74,74,0
MtDNA   13029   .   T   C   74.63   PASS    AC=2;AF=0.125;AN=16;AS_QD=32.99;DP=62;ExcessHet=0.1472;FS=0.000;GQ_MEAN=22.13;GQ_STDDEV=20.47;MLEAC=1;MLEAF=0.063;MQ=60.00;NCC=0;QD=26.41;SOR=0.693 GT:AD:DP:FT:GQ:PL   1/1:0,2:2:PASS:6:90,6,0 0/0:1,0:1:depth:3:0,3,34    0/0:5,0:5:PASS:15:0,15,195  0/0:1,0:1:depth:3:0,3,32    0/0:18,0:18:PASS:48:0,48,720    0/0:7,0:7:PASS:21:0,21,213  0/0:8,0:8:PASS:24:0,24,288  0/0:20,0:20:PASS:57:0,57,855
MtDNA   13069   .   T   C   2144.05 PASS    AC=12;AF=1.00;AN=12;AS_QD=27.59;DP=51;ExcessHet=3.0103;FS=0.000;GQ_MEAN=25.50;GQ_STDDEV=13.52;MLEAC=14;MLEAF=1.00;MQ=60.00;NCC=2;QD=30.55;SOR=0.994 GT:AD:DP:GQ:PL  1/1:0,2:2:6:87,6,0  ./.:0,0:0:.:0,0,0   1/1:0,7:7:21:292,21,0   ./.:0,0:0:.:0,0,0   1/1:0,12:12:36:531,36,0 1/1:0,7:7:21:259,21,0   1/1:0,8:8:24:334,24,0   1/1:0,15:15:45:620,45,0
MtDNA   13208   .   C   T   788.24  PASS    AC=6;AF=0.500;AN=12;AS_QD=25.73;DP=53;ExcessHet=0.1809;FS=0.000;GQ_MEAN=20.00;GQ_STDDEV=19.22;MLEAC=8;MLEAF=0.667;MQ=60.00;NCC=2;QD=28.92;SOR=1.127 GT:AD:DP:GQ:PL  ./.:0,0:0:.:0,0,0   0/0:2,0:2:6:0,6,65  1/1:0,4:4:12:157,12,0   ./.:0,0:0:.:0,0,0   1/1:0,8:8:24:341,24,0   1/1:0,8:8:24:303,24,0   0/0:13,0:13:0:0,0,353   0/0:18,0:18:54:0,54,472
MtDNA   13344   .   G   A   226.02  PASS    AC=2;AF=0.200;AN=10;AS_QD=28.25;DP=17;ExcessHet=0.2482;FS=0.000;GQ_MEAN=9.60;GQ_STDDEV=8.85;MLEAC=3;MLEAF=0.300;MQ=60.00;NCC=3;QD=28.25;SOR=1.179   GT:AD:DP:FT:GQ:PL   0/0:1,0:1:depth:3:0,3,39    ./.:0,0:0:PASS:.:0,0,0  ./.:0,0:0:PASS:.:0,0,0  ./.:0,0:0:PASS:.:0,0,0  0/0:2,0:2:PASS:3:0,3,45 0/0:4,0:4:PASS:12:0,12,136  0/0:2,0:2:PASS:6:0,6,88 1/1:0,8:8:PASS:24:239,24,0
MtDNA   13700   .   TA  T   49.17   PASS    AC=2;AF=0.250;AN=8;AS_QD=24.58;DP=24;ExcessHet=0.3218;FS=0.000;GQ_MEAN=17.25;GQ_STDDEV=7.89;MLEAC=2;MLEAF=0.250;MQ=48.99;NCC=4;QD=24.58;RPA=8,7;RU=A;SOR=2.303;STR  GT:AD:DP:GQ:PL  ./.:0,0:0:.:0,0,0   ./.:0,0:0:.:0,0,0   ./.:1,0:1:.:0,0,0   ./.:0,0:0:.:0,0,0   0/0:7,0:7:21:0,21,298   0/0:6,0:6:18:0,18,141   1/1:0,2:2:6:61,6,0  0/0:8,0:8:24:0,24,211

Any and all helps is appreciated! I'm hoping it is something simple!

Thanks

↧

Picard Sort Vcf Error

January 24, 2017, 7:27 pm

≫ Next: GATK4 silly "multithreading" workaround

≪ Previous: GATK4 - VariantFiltration --genotype-filter-expression

Hello.

I am using GATK version 3.6, picard-2.8.2.jar

I downloaded hapmap_3.3.hg38.vcf from gatk resource bundle. I then used the below command to remove chr notation.
awk '{gsub(/^chr/,""); print}' hapmap_3.3.hg38.vcf > no_chr_hapmap_3.3.hg38.vcf.vcf

Before (hapmap_3.3.hg38.vcf)
chr1 2242065 rs263526 T C . PASS AC=724;AF=0.259;AN=2792
chr1 2242417 rs16824926 C . . PASS AN=530
chr1 2242880 rs11581436 A . . PASS AN=540

After (no_chr_hapmap_3.3.hg38.vcf.vcf)
1 6421563 rs4908891 G A . PASS AC=1086;AF=0.389;AN=2792
1 6421782 rs4908892 A G . PASS AC=1692;AF=0.606;AN=2792
1 6421856 rs12078257 T C . PASS AC=368;AF=0.132;AN=2790

Then, use Picard SortVcf to sort the no_chr_hapmap_3.3.hg38.vcf.vcf
java -jar picard-2.8.2.jar SortVcf I=removedChr_HapMap.vcf O=sortedHapMap.vcf SEQUENCE_DICTIONARY=hg38.dict

hg38.dict
@SQ SN:1 LN:248956422 UR:file:/media/ubuntu/Elements/TOOL/hg38.fa M5:2648ae1bacce4ec4b6cf337dcae37816
@SQ SN:10 LN:133797422 UR:file:/media/ubuntu/Elements/TOOL/hg38.fa M5:907112d17fcb73bcab1ed1c72b97ce68
@SQ SN:11 LN:135086622 UR:file:/media/ubuntu/Elements/TOOL/hg38.fa M5:1511375dc2dd1b633af8cf439ae90cec
@SQ SN:12 LN:133275309 UR:file:/media/ubuntu/Elements/TOOL/hg38.fa M5:e81e16d3f44337034695a29b97708fce

I have then encountered this error:

Exception in thread "main" java.lang.IllegalArgumentException: java.lang.AssertionError: SAM dictionaries are not the same: SAMSequenceRecord(name=chr1,length=248956422,dict_index=0,assembly=20) was found when SAMSequenceRecord(name=1,length=248956422,dict_index=0,assembly=null) was expected.
at picard.vcf.SortVcf.collectFileReadersAndHeaders(SortVcf.java:126)
at picard.vcf.SortVcf.doWork(SortVcf.java:95)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:205)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:94)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:104)
Caused by: java.lang.AssertionError: SAM dictionaries are not the same: SAMSequenceRecord(name=chr1,length=248956422,dict_index=0,assembly=20) was found when SAMSequenceRecord(name=1,length=248956422,dict_index=0,assembly=null) was expected.
at htsjdk.samtools.SAMSequenceDictionary.assertSameDictionary(SAMSequenceDictionary.java:170)
at picard.vcf.SortVcf.collectFileReadersAndHeaders(SortVcf.java:124)
... 4 more

I have tried a lot of times but still getting back the same error. Kindly do advise how can I solve this problem.

I would then like to perform SelectVariants to extract variants that missed in HapMap but present in my dataset.

Thank you so much in advance.

Cheers,
Moon

↧

GATK4 silly "multithreading" workaround

June 12, 2018, 6:40 pm

≫ Next: Artifact list on targeted panel data

≪ Previous: Picard Sort Vcf Error

Hi,
I'm working with RNAseq samples from sunflower. Right now I have samples from 8 genotypes (3 biological replicates each). These eight genotypes arise from the same biparental crossing; so the genotypes are related.

I'm using GATK4 in a VM with 16 Intel Xeon E7-4860 processors (they can't support AVX), and 32 Gb RAM + 16 Gb swap (I can ask more).

Since GATK4 doesn't have the multithreading options (-nt and -nct) anymore, I oftenly cannot take advantage of all processors. Because of this, I have been trying the Spark version of the tools, but I don't really want to use them until you "aprove" them oficially.

Also!, I tried a silly approach for "multithreading" dividing the genome in 16 intervals and run using the -L option on 16 parallel commands; and then merging the results.

My questions is: ¿How wrong is this approach?

↧

Artifact list on targeted panel data

June 13, 2018, 7:06 am

≫ Next: CatVariants - missing variants gatk3.7

≪ Previous: GATK4 silly "multithreading" workaround

Hi!

As it is well described, target panel data is difficult to filter due to the lack of tools like FIlterByOrientationBias present in WES data. I was wondering, however, if GATK is capable to build a "Pool of Artifacts" from the low quality and/or user specified false positives and use it as a proper filtering method.

Any ideas if this exist or if it should be implemented easily?
Thanks!

↧

CatVariants - missing variants gatk3.7

June 20, 2018, 7:29 am

≫ Next: Sparks tool TaskMemoryManager WARN

≪ Previous: Artifact list on targeted panel data

Hello,
I ran HaplotypeCaller separately on different intervals of the genome. Each interval is a set of contigs with no overlap between intervals.
Afterwards, I used CatVariants to concatenate the g.vcf per interval into a single g.vcf per sample. This seemed to have worked fine, but when checking the number of variants before and after concatenating I'm missing around 3-400'000 variants of a total of ~5million.

Do you have any idea why this is could be happening? Thanks!

↧

Sparks tool TaskMemoryManager WARN

June 13, 2018, 8:14 am

≫ Next: Using NIO with GATK4 HaplotypeCaller

≪ Previous: CatVariants - missing variants gatk3.7

Using ApplyBQSRSpark I experienced the following WARN and it stopped. There were more than 60 GB of RAM free on the server I used at that time, and every time I lauch the command it gives the same output. This is the command that I run using docker:

/gatk/gatk ApplyBQSRSpark \
-I ${INPUT_FILE_BAM} \
--bqsr-recal-file recal_data_.table \
-O /BQSRS_.bam

18/06/13 10:51:15 INFO BlockManagerInfo: Removed broadcast_2_piece0 on 172.17.0.2:44253 in memory (size: 5.7 KB, free: 15.8 GB)
18/06/13 11:04:55 WARN TaskMemoryManager: Failed to allocate a page (8388608 bytes), try again.
18/06/13 11:06:32 WARN NettyRpcEnv: Ignored message: HeartbeatResponse(false)
18/06/13 11:08:03 WARN TaskMemoryManager: Failed to allocate a page (8388608 bytes), try again.
18/06/13 11:09:51 WARN TaskMemoryManager: Failed to allocate a page (8388608 bytes), try again.

↧

Using NIO with GATK4 HaplotypeCaller

June 13, 2018, 11:24 am

≫ Next: Appropriate WGS Mutect normal_panel

≪ Previous: Sparks tool TaskMemoryManager WARN

Is GATK4 HaplotypeCaller NIO compatible? If not, is there another version that is?

Thanks!

↧

Appropriate WGS Mutect normal_panel

June 13, 2018, 1:37 pm

≫ Next: VariantEval on non-vcf files

≪ Previous: Using NIO with GATK4 HaplotypeCaller

Hi,

I'm running mutect1 with WGS and am trying to figure out what the appropriate attribute to use for the normal_panel is. I'm guessing it might be this 1000 genomes file ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/integrated_call_sets/, but am not sure. Any advice would be appreciated! Most of the information out there seems geared towards exomes.

Thanks,

Eric

↧

VariantEval on non-vcf files

June 13, 2018, 3:08 pm

≫ Next: VariantRecalibrator (VQSR) without any training sets? (Simulated data)

≪ Previous: Appropriate WGS Mutect normal_panel

Hello GATK team,

I am learning to use VariantEval, but I realize that my input files are not standard vcf files. I compared somatic variant calling results from two software, and want to evaluation the overlap. Simply put, my overlap only contains four columns: chr, pos, genotype of control sample, genotype of tumor sample. The genotypes are two letters -- I guess I can change that into only 1 letter to record the different allele, but I still do not have a full vcf table.

I could use the position information and extract lines from one of the software output. For example, one software is strelka, and its output looks like:

chr1 4159398 . C T . PASS NT=ref;QSS=21;QSS_NT=21;SGT=CC->CT;SOMATIC;TQSS=1;TQSS_NT=1 DP:FDP:SDP:SUBDP:AU:CU:GU:TU 30:0:0:0:0,0:30,31:0,0:0,0 24:0:0:0:0,0:20,20:0,0:4,4

Where somatic changes is recorded in bold. Do you think it is an appropriate input format?

Any advise is appreciated. Thank you.

Helen

↧