manba and picard_gatk_mj questions
SVPreprocess Error: Alignment file does not exist
Dear Genome STRiP users,
I completed SVPreprocess to certain cohort successfully. Now I apply the same script to another cohort for calling the same variants. However, a kind of unexpected errors raised as below (as an example)
Exception in thread "main" org.broadinstitute.sv.commandline.ArgumentException: Alignment file does not exist: /proj/yunligrp/users/minzhi/gs/PAGE_chr16/bam_PAGE_chr16_1-500000/H_TK-12498-AB33938473
at org.broadinstitute.sv.dataset.SAMLocation.create(SAMLocation.java:99)
at org.broadinstitute.sv.commandline.CommandLineParser.createSAMLocation(CommandLineParser.java:256)
at org.broadinstitute.sv.commandline.CommandLineParser.parseSAMLocationFile(CommandLineParser.java:247)
at org.broadinstitute.sv.commandline.CommandLineParser.parseSAMLocations(CommandLineParser.java:234)
at org.broadinstitute.sv.commandline.CommandLineParser.parseSAMLocations(CommandLineParser.java:220)
at org.broadinstitute.sv.apps.ExtractBAMSubset.run(ExtractBAMSubset.java:79)
at org.broadinstitute.sv.commandline.CommandLineProgram.execute(CommandLineProgram.java:54)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
at org.broadinstitute.sv.commandline.CommandLineProgram.runAndReturnResult(CommandLineProgram.java:29)
at org.broadinstitute.sv.commandline.CommandLineProgram.run(CommandLineProgram.java:25)
at org.broadinstitute.sv.apps.ExtractBAMSubset.main(ExtractBAMSubset.java:74)
I have 950 samples, but nearly 900 of them has such kind error during the SVPreprocess. I am not sure if this is related to the reference file -- but I use the reference file listed on Broad Inst's website. And the alignment file in my former analysis is just one file header.bam for all 3418 samples, but now it looks like the alignment files are different for each bam file. So does it related to my script? Here I attached my script. May I have your suggestions? Thank you very much.
classpath="${SV_DIR}/lib/SVToolkit.jar:${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar:${SV_DIR}/lib/gatk/Queue.jar"
gs_dir="$4"
rundir="${gs_dir}/$7_$1/$3"
java -Xmx4g -cp ${classpath}\
org.broadinstitute.gatk.queue.QCommandLine\
-S ${SV_DIR}/qscript/SVPreprocess.q\
-S ${SV_DIR}/qscript/SVQScript.q\
-cp ${classpath}\
-gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \
-configFile ${SV_DIR}/conf/genstrip_parameters.txt \
-R ${gs_dir}/Homo_sapiens_assembly38/Homo_sapiens_assembly38.fasta \
-L $1:$2 \
-I ${gs_dir}/$7_$1/supporting_$7_$1/$7_$1_$5_sample.list \
-md ${rundir}/md_tempdir \
-tempDir ${gs_dir}/gs_tempdir/svpre_tmp \
-runDirectory ${rundir} \
-ploidyMapFile ${gs_dir}/$7_$1/supporting_$7_$1/$7_$1_$8_ploidy.map \
-jobLogDir ${rundir}/logs \
-run \
|| exit 1
Best regards,
Wusheng
how to set HaplotypeCaller ploidy argument? espically when pool samples
if the species is not diploid, do I need to set the ploidy argument by myself?
what does
"
For pooled data, set to (Number of samples in each pool * Sample Ploidy).
"
but in GVCF mode, we call one sample one time, when need we set to (Number of samples in each pool * Sample Ploidy).
in the INFO column of a Mutect2 VCF only refer to the ALT?
Because a somatic callset is based on a single individual rather than a cohort, annotations in the INFO column of a Mutect2 VCF only refer to the ALT alleles and do not include values for the REF allele. This differs from a germline cohort callset, in which annotations in the INFO field are typically derived from data related to all observed alleles including the reference.
means there is only _TLOD value _in the INFO column of somatic?
BaseRecalibrator help is misleading (gatk 4.0.12.0)
I was trying to use the BaseRecalibrator walker. The help says:
Required Arguments:
--input,-I:String BAM/SAM/CRAM file containing reads This argument must be specified at least once.
Required.
--known-sites:FeatureInput One or more databases of known polymorphic sites used to exclude regions around known
polymorphisms from analysis. This argument must be specified at least once. Required.
--output,-O:File The output recalibration table file to create Required.
--reference,-R:String Reference sequence file Required.
However, if I include the ":" into the command line, for example with "--known-sites:#####", I get errors such as:
A USER ERROR has occurred: No value found for tagged argument: known-sites:#####
While if I replace the ":" with a space, such as "--known-sites #####", everything works.
It seems to me like after updating the command line format in GATK4, someone forgot to update the help text. I felt this was quite disruptive. I actually think that the right way to do it would have been to provide instructions in the error message explaining that ":" need to be removed in GATK4. This would go a long way to make a smooth transition experience for users moving from GATK3 to GATK4.
SVCNVDiscivery Error: "Caused by: java.lang.NullPointerException"
Dear Genome STRiP users,
I completed SVCNVDiscovery to certain cohort based on the interval list: 1-500000. However, when I changed the interval list to 151000-198000, the error below poped up:
ERROR 18:54:30,495 FunctionEdge - Contents of /proj/yunligrp/users/minzhi/gs/PAGE_chr16/svcnv_PAGE_chr16_standard_full_single_151000-198000over1-500000_serial/cnv_stage7/seq_chr16/logs/CNVDiscoveryStage7-2.out:
INFO 18:54:05,386 HelpFormatter - ---------------------------------------------------------------
INFO 18:54:05,389 HelpFormatter - Program Name: org.broadinstitute.sv.discovery.MergeBrigVcfFiles
INFO 18:54:05,394 HelpFormatter - Program Args: -R /proj/yunligrp/users/minzhi/gs/Homo_sapiens_assembly38/Homo_sapiens_assembly38.fasta -vcfFile /proj/yunligrp/users/minzhi/gs/PAGE_chr16/svcnv_PAGE_chr16_standard_full_single_151000-198000over1-500000_serial/cnv_stage7/seq_chr16/brig.vcf.file.list -mergedVcfFile /proj/yunligrp/users/minzhi/gs/PAGE_chr16/svcnv_PAGE_chr16_standard_full_single_151000-198000over1-500000_serial/cnv_stage7/seq_chr16/seq_chr16.brig.sites.vcf.gz
INFO 18:54:05,400 HelpFormatter - Executing as minzhi@b1010.ll.unc.edu on Linux 3.10.0-957.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_191-b12.
INFO 18:54:05,400 HelpFormatter - Date/Time: 2018/12/22 18:54:05
INFO 18:54:05,401 HelpFormatter - ---------------------------------------------------------------
INFO 18:54:05,401 HelpFormatter - ---------------------------------------------------------------
Exception in thread "main" java.lang.RuntimeException
at org.broadinstitute.sv.commandline.CommandLineProgram.execute(CommandLineProgram.java:65)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
at org.broadinstitute.sv.commandline.CommandLineProgram.runAndReturnResult(CommandLineProgram.java:29)
at org.broadinstitute.sv.commandline.CommandLineProgram.run(CommandLineProgram.java:25)
at org.broadinstitute.sv.discovery.MergeBrigVcfFiles.main(MergeBrigVcfFiles.java:52)
Caused by: java.lang.NullPointerException
at org.broadinstitute.sv.discovery.MergeBrigVcfFiles.updateVCFHeader(MergeBrigVcfFiles.java:118)
at org.broadinstitute.sv.discovery.MergeBrigVcfFiles.mergeVCFFiles(MergeBrigVcfFiles.java:90)
at org.broadinstitute.sv.discovery.MergeBrigVcfFiles.run(MergeBrigVcfFiles.java:57)
at org.broadinstitute.sv.commandline.CommandLineProgram.execute(CommandLineProgram.java:54)
... 5 more
INFO 18:54:30,497 QGraph - Writing incremental jobs reports...
INFO 18:54:30,498 QGraph - 3 Pend, 0 Run, 1 Fail, 1 Done
INFO 18:54:30,500 QCommandLine - Writing final jobs report...
INFO 18:54:30,501 QCommandLine - Done with errors
INFO 18:54:30,510 QGraph - -------
INFO 18:54:30,512 QGraph - Failed: 'java' '-Xmx2048m' '-XX:+UseParallelOldGC' '-XX:ParallelGCThreads=4' '-XX:GCTimeLimit=50' '-XX:GCHeapFreeLimit=10' '-Djava.io.tmpdir=/proj/yunligrp/users/minzhi/gs/gs_tempdir/svcnv_tmp' '-cp' '/proj/yunligrp/users/minzhi/svtoolkit/lib/SVToolkit.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/Queue.jar' '-cp' '/proj/yunligrp/users/minzhi/svtoolkit/lib/SVToolkit.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/proj/yunligrp/users/minzhi/svtoolkit/lib/gatk/Queue.jar' 'org.broadinstitute.sv.discovery.MergeBrigVcfFiles' '-R' '/proj/yunligrp/users/minzhi/gs/
The script I used for SVCNVDiscovery is here
classpath="${SV_DIR}/lib/SVToolkit.jar:${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar:${SV_DIR}/lib/gatk/Queue.jar"
gs_dir="/proj/yunligrp/users/minzhi/gs"
svpreprocess_dir="${gs_dir}/PAGE_chr16/svpre_PAGE_chr16_standard_full_single_1-500000over1-500000_serial_success"
rundir="${gs_dir}/PAGE_chr16/svcnv_PAGE_chr16_standard_full_single_151000-198000over1-500000_serial"
java -Xmx4g -cp ${classpath} \
org.broadinstitute.gatk.queue.QCommandLine \
-S ${SV_DIR}/qscript/discovery/cnv/CNVDiscoveryPipeline.q \
-S ${SV_DIR}/qscript/SVQScript.q \
-cp ${classpath} \
-gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \
-configFile ${SV_DIR}/conf/genstrip_parameters.txt \
-R ${gs_dir}/Homo_sapiens_assembly38/Homo_sapiens_assembly38.fasta \
-I ${gs_dir}/PAGE_chr16/supporting_PAGE_chr16/PAGE_chr16_full_sample.list \
-genderMapFile ${gs_dir}/PAGE_chr16/supporting_PAGE_chr16/PAGE_chr16_full_all-male_gender.map \
-ploidyMapFile ${gs_dir}/PAGE_chr16/supporting_PAGE_chr16/PAGE_chr16_standard_ploidy.map \
-md ${svpreprocess_dir}/md_tempdir \
-tempDir ${gs_dir}/gs_tempdir/svcnv_tmp \
-runDirectory ${rundir} \
-jobLogDir ${rundir}/logs \
-intervalList ${gs_dir}/PAGE_chr16/supporting_PAGE_chr16/PAGE_chr16_151000-198000_interval.list \
-tilingWindowSize 1000 \
-tilingWindowOverlap 500 \
-maximumReferenceGapLength 1000 \
-boundaryPrecision 100 \
-minimumRefinedLength 500 \
-jobRunner Shell \
-gatkJobRunner Shell \
-run \
|| exit 1
Is this related to the narrow interval list? May I have your suggestion about this error? Thank you very much.
Best regards,
Wusheng
Picard MergeBamAlignment's CLIP_OVERLAPPING_READS not working?
I have created a toy unmapped bam called u.bam with two pairs of overlapping reads:
181221_A00719_0016_BH5FY3DRXX:1:1101:10004:11584 77 * 0 0 * * 0 0 TGAAAACAAGCACAGCTTCATAGCATAGAATGGGATTGGGGGTCCAGTCTTCCCAGAATATTTTCTTTCCCATCTTCCCCTTGGGGAACAGATTCACCCAC :FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF RG:Z:18122.1
181221_A00719_0016_BH5FY3DRXX:1:1101:10004:11584 141 * 0 0 * * 0 0 TAACTAACTTGTTCAAGGTCTCCCAGCTAGGATCTAGCAGAAAGATCAGGAAGGTATCACAAGGAAGGTGGGTGAATCTGTTCCCCAAGGGGAAGATGGGA FFFFFFFF,F:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFF:FFFFFFFF:FFFFFFFF:FFFFFFFFFFFF RG:Z:18122.1
181221_A00719_0016_BH5FY3DRXX:1:1101:10004:19257 77 * 0 0 * * 0 0 AATTATACTACGGTGGAGATGATTCATTTAGAAATGAGACTGAACAGGTCTGGGGGGCATAAGTACGTTTTGCAAGCATGTGGCATGGCCCAGATTCCTAT FFFFFFFF:FFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFF:F:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF RG:Z:18122.1
181221_A00719_0016_BH5FY3DRXX:1:1101:10004:19257 141 * 0 0 * * 0 0 CGATATTCAGAAGGACTCCCACCAAGAAGCCACAGGTGCAAGTTGAGAGAGAATCATCAGTAGGAGAGTATAGCTGCTATAGGAATCTGGGCCATGCCACA FFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF RG:Z:18122.1
But when I feed them through this command. Nothing is soft clipped in the output o.bam as described in the doc. Why is that?
java -jar /usr/bin/picard/picard-2.17.10.jar SamToFastq I=u.bam F=/dev/stdout INTERLEAVE=true|/usr/bin/bwa/bwa-0.7.17/bwa mem -M -p /home/db/hg38/hs38DH.fa /dev/stdin | java -jar /usr/bin/picard/picard-2.17.10.jar MergeBamAlignment UNMAPPED=u.bam ALIGNED=/dev/stdin O=o.bam R=/home/db/hg38/hs38DH.fa CLIP_OVERLAPPING_READS=true ALIGNER_PROPER_PAIR_FLAGS=true MAX_GAPS=-1 ORIENTATIONS=FR VALIDATION_STRINGENCY=SILENT CREATE_INDEX=true
MuTect2 for amplicon did not call some variants
I am a beginner for using GATK. I performed the amplicon-based target sequencing and then I used the GATK4-MuTect2 to call variants. However, when we compared the variants from GATK4-MuTect2 with those from VariantCaller on Ion Torrent Sever, we found some inconsistencies. Therefore, I generated the bamout and found that some variants seem to be realigned and therefore they did not be called (see figure chr17:50196078).
In other case, the allele frequency (AF) is homozygous in both input bam file and bamout, but the allele frequency (AF) is heterozygous in the vcf which is shown below and in the figure chr17:50188065.
chr17 50188065 . A G . clustered_events DP=6046;ECNT=18;POP_AF=5.000e-08;P_CONTAM=0.00;P_GERMLINE=-2.846e+01;TLOD=1518.06 GT:AD:AF:DP:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB 0/1:447,1238:0.544:1685:174,937:273,301:32,36:0,0:60:12:0.737,0.525,0.735:0.00,1.00,0
The following is my parametric, IGV and VCF data,
date;/share/app/GATK/gatk-4.0.9.0/gatk --java-options "-Xmx256g" Mutect2 -R grch38.p2_rmsk.fasta -I 18060114.bam -tumor 18060114 -O 8060114_Mutect2_tumor_maxaf1_mrra0.vcf.gz --max-population-af 1 --max-reads-per-alignment-start 0 --min-base-quality-score 0
Your advise is highly appreciated, and look forward to your reply.
Thank you!
Respectfully yours,
Ching-Yuan Wang
is there a list of annotations in each annotation group?
commands like mutect2 allow annotation group arguments.
Where can we find what annotations are in each group, without running the tool all the way through?
Intervals and interval lists
Interval lists define subsets of genomic regions, sometimes even just individual positions in the genome. You can provide GATK tools with intervals or lists of intervals when you want to restrict them to operating on a subset of genomic regions. There are four main types of reasons for doing so:
- You want to run a quick test on a subset of data (often used in troubleshooting)
- You want to parallelize execution of an analysis across genomic regions
- You need to exclude regions that have bad or uninformative data where a tool is getting stuck
- The analysis you're running should only take data from those subsets due to how the underlying algorithm works
Regarding the latter case, see the Best Practices workflow recommendations and tool example commands for guidance regarding when to restrict analysis to intervals.
Interval-related arguments and syntax
Arguments for specifying and modifying intervals are provided by the engine and can be applied to most of not all tools. The main arguments you need to know about are the following:
-L
/--intervals
allows you to specify an interval or list of intervals to include.-XL
/--exclude-intervals
allows you to specify an interval or list of intervals to exclude.-ip
/--interval-padding
allows you to add padding (in bp) to the intervals you include.-ixp
/--interval-exclusion-padding
allows you to add padding (in bp) to the intervals you exclude.
By default the engine will merge any intervals that abut (i.e. they are contiguous, they touch without overlapping) or overlap into a single interval. This behavior can be modified by specifying an alternate interval merging rule (see --interval-merging-rule
in the Tool Docs).
The syntax for using -L
is as follows; it applies equally to -XL
:
-L chr20
for contig chr20.-L chr20:1-100
for contig chr20, positions 1-100.-L intervals.list
(orintervals.interval_list
, orintervals.bed
) when specifying a text file containing intervals (see supported formats below).-L variants.vcf
when specifying a VCF file containing variant records; their genomic coordinates will be used as intervals.
If you want to provide several intervals or several interval lists, just pass them in using separate -L
or -XL
arguments (you can even use both of them in the same command). You can use all the different formats within the same command line. By default, the GATK engine will take the UNION of all the intervals in all the sets. This behavior can be modified by specifying an alternate interval set rule (see --interval-set-rule
in the Tool Docs).
Supported interval list formats
GATK supports several types of interval list formats: Picard-style .interval_list
, GATK-style .list
, BED files with extension .bed
, and VCF files. The intervals MUST be sorted by coordinate (in increasing order) within contigs; and the contigs must be sorted in the same order as in the sequence dictionary. This is require for efficiency reasons.
A. Picard-style .interval_list
Picard-style interval files have a SAM-like header that includes a sequence dictionary. The intervals are given in the form <chr> <start> <stop> + <target_name>
, with fields separated by tabs, and the coordinates are 1-based (first position in the genome is position 1, not position 0).
@HD VN:1.0 SO:coordinate
@SQ SN:1 LN:249250621 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:1b22b98cdeb4a9304cb5d48026a85128 SP:Homo Sapiens
@SQ SN:2 LN:243199373 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:a0d9851da00400dec1098a9255ac712e SP:Homo Sapiens
1 30366 30503 + target_1
1 69089 70010 + target_2
1 367657 368599 + target_3
1 621094 622036 + target_4
1 861320 861395 + target_5
1 865533 865718 + target_6
This is the preferred format because the explicit sequence dictionary safeguards against accidental misuse (e.g. apply hg18 intervals to an hg19 BAM file). Note that this file is 1-based, not 0-based (the first position in the genome is position 1).
B. GATK-style .list
or .intervals
This is a simpler format, where intervals are in the form <chr>:<start>-<stop>
, and no sequence dictionary is necessary. This file format also uses 1-based coordinates. Note that only the <chr>
part is strictly required; if you just want to specify chromosomes/ contigs as opposed to specific coordinate ranges, you don't need to specify the rest. Both <chr>:<start>-<stop>
and <chr>
can be present in the same file. You can also specify intervals in this format directly at the command line instead of writing them in a file.
C. BED files with extension .bed
We also accept the widely-used BED format, where intervals are in the form <chr> <start> <stop>
, with fields separated by tabs. However, you should be aware that this file format is 0-based for the start coordinates, so coordinates taken from 1-based formats (e.g. if you're cooking up a custom interval list derived from a file in a 1-based format) should be offset by 1. The GATK engine recognizes the .bed
extension and interprets the coordinate system accordingly.
D. VCF files
Yeah, I bet you didn't expect that was a thing! It's very convenient. Say you want to redo a variant calling run on a set of variant calls that you were given by a colleague, but with the latest version of HaplotypeCaller. You just provide the VCF, slap on some padding on the fly using e.g. -ip 100
in the HC command, and boom, done. Each record in the VCF will be interpreted as a single-base interval, and by adding padding you ensure that the caller sees enough context to reevaluate the call appropriately.
Obtaining suitable interval lists
So where do those intervals come from? It depends a lot on what you're working with (everyone's least favorite answer, I know). The most important distinction is the sequencing experiment type: is it whole genome, or targeted sequencing of some sort?
Targeted sequencing (exomes, gene panels etc.)
For exomes and similarly targeted data types, the interval list should correspond to the capture targets used for the library prep, and is typically provided by the prep kit manufacturer (with versions for each ref genome build of course).
We make our exome interval lists available, but be aware that they are specific to the custom exome targeting kits used at the Broad. If you got your sequencing done somewhere else, you should seek to get the appropriate intervals list from the sequencing provider.
Whole genomes (WGS)
For whole genome sequence, the intervals lists don’t depend on the prep (since in principle you captured the “whole genome”) so instead it depends on what regions of the genome you want to blacklist (e.g. centromeric regions that waste your time for nothing) and how the reference genome build enables you to cut up regions (separated by Ns) for scatter-gather parallelizing.
We make our WGS interval lists available, and the good news is that, as long as you're using the same genome reference build as us, you can use them with your own data even if it comes from somewhere else -- assuming you agree with our decisions about which regions to blacklist! Which you can examine by looking at the intervals themselves. However, we don't currently have documentation on their provenance, sorry -- baby steps.
Error occur on VariantRecalibrator : Malformed floating point valueprior
The problem is about VQSR. I have no idea how to fix this error:
A USER ERROR has occurred: Unknown file is malformed: Malformed floating point valueprior
The input vcf file is generate by HaplotypeCaller and resource are from broadinstitute.org/bundle.
My gatk version is v4.0.8.1
This is my code:
gatk --java-options -DGATK_STACKTRACE_ON_USER_EXCEPTION=true VariantRecalibrator \
-R human_g1k_v37.fasta \
-V NV0047-01_S14_gatk4.vcf \
-O raw.SNPs.recal \
-resource [hapmap,known=false,training=true,truth=true,prior=15.0]: hapmap_3.3.b37.vcf \
-resource [omni,known=false,training=true,truth=true,prior=12.0]: 1000G_omni2.5.b37.vcf \
-resource [dbsnp,known=true,training=false,truth=false,prior=2.0]: dbsnp_138.b37.vcf \
-resource [1000G,known=false,training=true,truth=false,prior=10.0]:1000G_phase1.snps.high_confidence.b37.vcf \
-an DP \ -an QD \ -an FS \ -an SOR \ -an MQ \ -an MQRankSum \ -an ReadPosRankSum \ -an InbreedingCoeff \
-mode SNP \ --tranches-file raw.SNPs.tranches \ -rscript-file recal.plots.R
and this is full error message:
A USER ERROR has occurred: Unknown file is malformed: Malformed floating point valueprior
org.broadinstitute.hellbender.exceptions.UserException$MalformedFile: Unknown file is malformed: Malformed floating point valueprior
at org.broadinstitute.hellbender.tools.walkers.vqsr.TrainingSet.getDoubleAttributeOrElse(TrainingSet.java:67)
at org.broadinstitute.hellbender.tools.walkers.vqsr.TrainingSet.(TrainingSet.java:41)
at org.broadinstitute.hellbender.tools.walkers.vqsr.VariantRecalibrator.onTraversalStart(VariantRecalibrator.java:423)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:891)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:134)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:179)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:198)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)
Genome Mask Files
1. Introduction
Genome STRiP makes use of mask files that identify portions of the reference
sequence that are not reliably alignable.
Genome mask files are fasta files with the same number of sequences and of the
same length as the reference sequence. In a genome mask file, a base position
is marked with a 0 if it is reliably alignable and 1 if it is not. Each genome
mask file is specific to the reference sequence and to the parameters used to
determine alignability.
The current generation of mask files are based on fixed read lengths. A base
is assigned a 0 if an N base sequence centered on this read is unique within
the reference genome. You should use a genome mask with a value of N that
corresponds to the read lengths of your input data set. For example, if you
have data that is a uniform set of Illumina paired-end data with 101bp reads,
then you should use (or generate) a genome mask with a read length of 101. If
your data is a mixture of read lengths, one viable strategy is to use a
"lowest common denominator" approach and use a mask length corresponding to
the shortest reads in your input data set. Using the smallest read length will
cause a small additional fraction of the genome to be marked inaccessible, but
will give the best specificity. Alternatively, you can use a larger N, which
should modestly improve sensitivity at the cost of a modest increase in false
discovery rate and a modest decrease in genotyping accuracy.
2. Resources
Some precomputed mask files for a variety of reference sequences and read
lengths are available at ftp://ftp.broadinstitute.org/pub/svtoolkit/svmasks.
3. Generating your own genome mask
The ComputeGenomeMask command line utility is available
to generate genome mask files, but queue scripts to automate the process have
not been written. A reasonable strategy is to compute the genome mask in
parallel chromsome-by-chromosome and then merge the resulting fasta files into
a final genome-wide mask file.
4. Planned Enhancements
The implementation of mask files will be replaced in a future release.
Mask files are being converted from textual fasta files to binary files and
are being enhanced to better support input data sets with multiple read
lengths (so the use of a "lowest common denominator" strategy will no longer
be necessary).
Do alignment differences affect a lot with GATK Haploytype Caller?
Hi,
We find with the same sample aligned with different version of bwa thus with different bam files(We test the two bam files with deeptools, and find the two are quite different), we may get very similar variant result with gatk germline pipeline, is this due to that the haplotypecaller is more tolerant with the alignment file since it will do the realignment itself or we just ignore some other factors that will lead to this result?
java.lang.IncompatibleClassChangeError GATK 4
Hi,
I hit an error with GATK 4 beta 6 using the RealignerTargetCreator - as a complete java newbie it's quite incomprehensible to me. I'm running (oracle) java 9.0.1(and thus GATK 3 RealignerTargetCreator isn't working for me either ).
Here is the command I ran:
gatk-launch RealignerTargetCreator -R ~/data/ref/hg38.fa -I Sample1_dedup.bam -o Sample1_int.intervals
And this is the output:
Using GATK jar /usr/local/bin/gatk-4.beta.6/gatk-package-4.beta.6-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -jar /usr/local/bin/gatk-4.beta.6/gatk-package-4.beta.6-local.jar RealignerTargetCreator -R /home/jamie/data/ref/hg38.fa -I Sample1_dedup.bam -o Sample1_int.intervals
Exception in thread "main" java.lang.IncompatibleClassChangeError: Inconsistent constant pool data in classfile for class org/broadinstitute/barclay/argparser/CommandLineProgramGroup. Method lambda$static$0(Lorg/broadinstitute/barclay/argparser/CommandLineProgramGroup;Lorg/broadinstitute/barclay/argparser/CommandLineProgramGroup;)I at index 43 is CONSTANT_MethodRef and should be CONSTANT_InterfaceMethodRef
at org.broadinstitute.barclay.argparser.CommandLineProgramGroup.<clinit>(CommandLineProgramGroup.java:16)
at org.broadinstitute.hellbender.Main.printUsage(Main.java:332)
at org.broadinstitute.hellbender.Main.extractCommandLineProgram(Main.java:305)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:156)
at org.broadinstitute.hellbender.Main.main(Main.java:239)
Many thanks!
How to include non variant sites in the output vcf of GenotypeGVCFs?
Hello
I'd like to include confident non variant sites in my downstream analyses. If I understand correctly, this was possible in previous versions with --includeNonVariantSites but I'm not seeing an option that allows me to do this in GATK4. Am I missing something obvious or is this currently not possible?
Thank you very much in advance and best wishes
Sam
Include non variants sites
Hi,
I'm wondering if in this time, the option "Include non variant sites" in the output vcf of GenotypeGVCFs is already available?
If not, someone could me give a tip how I can do that using another way?
How does Picard's CollectVariantCallingMetrics calculates novel Ti/Tv per sample and as summary?
Hello!
I am having a hard time making sense of the novel Ti/Tv values produced by Picard's CollectVariantCallingMetrics
when comparing the summary and the detailed reports.
The summary reports a novel Ti/Tv
of 1.700175
(which is the same value found in tranche 90 of the VariantRecalibrator
's tranches plot). However, sample-wise, novel Ti/Tv
ranges from 1.803826
to 1.929386
.
I do not see how 1.7
summarizes a range of numbers from 1.8
to 1.9
. What am I missing?
Thanks in advance and happy new year!
user error message for HaplotypeCaller in GVCF mode
'''A USER ERROR has occurred: Read M02780:305:000000000-C4988:1:1102:23167:21145_1:N:0:16%26M02780:305:000000000-C4988:1:1102:23167:21145_2:N:0:16_(reversed) NC_007793:117448-117692 is malformed: read starts with deletion. Cigar: 13H2P1D244M. Although the SAM spec technically permits such reads, this is often indicative of malformed files.'''
I have tried multiple files, and I get the same error for all of them. A vcf file is produced, but it is truncated. When I run the same command without -ERC GVCF, it completes without an error and produces a full vcf file.
I am using Illumina data, 150bp PE, DNA bacterial origin (S. aureus). My pipeline is as follows: run the PE fastq with Bowtie2, then sort with samtools, then AddOrReplaceReadGroups and MarkDuplicates with picard, then index with samtools, and finally ValidateSamFile with picard. All of the files pass the validation without any errors.
I have also tried various ReadFilter flags including WellformedReadFilter, MappedReadFilter, GoodCigarReadFilter, ValidAlignmentStartReadFilter, but get the same message.
I am running Java version 1.8.0_121 on a Mac OS 10.13.1.
stacktrace:
'''at org.broadinstitute.hellbender.utils.locusiterator.AlignmentStateMachine.stepForwardOnGenome(AlignmentStateMachine.java:274)
at org.broadinstitute.hellbender.utils.locusiterator.ReadStateManager.addReadsToSample(ReadStateManager.java:258)
at org.broadinstitute.hellbender.utils.locusiterator.ReadStateManager.collectPendingReads(ReadStateManager.java:177)
at org.broadinstitute.hellbender.utils.locusiterator.LocusIteratorByState.lazyLoadNextAlignmentContext(LocusIteratorByState.java:315)
at org.broadinstitute.hellbender.utils.locusiterator.LocusIteratorByState.hasNext(LocusIteratorByState.java:252)
at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.ReferenceConfidenceModel.getPileupsOverReference(ReferenceConfidenceModel.java:490)
at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.ReferenceConfidenceModel.calculateRefConfidence(ReferenceConfidenceModel.java:251)
at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCallerEngine.referenceModelForNoVariation(HaplotypeCallerEngine.java:688)
at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCallerEngine.callRegion(HaplotypeCallerEngine.java:522)
at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller.apply(HaplotypeCaller.java:240)
at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.processReadShard(AssemblyRegionWalker.java:291)
at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.traverse(AssemblyRegionWalker.java:267)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:966)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)'''
I would very much appreciate your help.
Thank you,
Rebecca
Illumina ICE interval list
Hi! I am using the Featured/Best practices workspace Somatic-SNVs-Indels-GATK4 to analyze WES samples. My samples were analyzed using Illumina's ICE exome capture kit, and I have a list of intervals (attached). I am wondering whether the "workspace.intervals" attribute in method configurations of this workspace already includes this Illumina ICE exome interval list, and if not, how do I upload this interval list into my method configuration as an attribute?
Thanks!!
NaN error when running FilterMutectCalls with gatk 4.0.12.0 (phredScaleLog10ErrorRate)
I'm getting a NaN error when I'm trying to run the FilterMutectCalls walker, following the instructions in step 4 of the How To guide (https://gatkforums.broadinstitute.org/gatk/discussion/11136/how-to-call-somatic-mutations-using-gatk4-mutect2#4)
This is actually the 2nd sample vcf that I've gotten this error with, and I can provide both vcfs to you for your own testing/debugging, if you would like.
Thanks.
The complete error message and stack-trace is:
Using GATK jar /ifs/scratch/c2b2/ac_lab/pw2470/devel/mskilab/flows/modules/FilterMutectCalls/gatk-4.0.12.0/gatk-package-4.0.12.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx8g -jar /ifs/scratch/c2b2/ac_lab/pw2470/devel/mskilab/flows/modules/FilterMutectCalls/gatk-4.0.12.0/gatk-package-4.0.12.0-local.jar FilterMutectCalls -V /ifs/scratch/c2b2/ac_lab/pw2470/PROJECTS/patient0/Flow/Mutect2_rsync_b37/CUAC1857/hand_combined_partial_result_with_contig_specific_results/manually_combined_mutect2_mutations.vcf --contamination-table 02_small_exac_calcualatecontamination.table -O 03_gatk4_m2_snvs_indels_1st_pass_filter_with_cont_table.vcf.gz
13:27:55.287 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/ifs/scratch/c2b2/ac_lab/pw2470/devel/mskilab/flows/modules/FilterMutectCalls/gatk-4.0.12.0/gatk-package-4.0.12.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
13:27:58.696 INFO FilterMutectCalls - ------------------------------------------------------------
13:27:58.696 INFO FilterMutectCalls - The Genome Analysis Toolkit (GATK) v4.0.12.0
13:27:58.696 INFO FilterMutectCalls - For support and documentation go to https://software.broadinstitute.org/gatk/
13:27:58.697 INFO FilterMutectCalls - Executing as pw2470@c2b2acld1 on Linux v4.15.0-43-generic amd64
13:27:58.697 INFO FilterMutectCalls - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_191-b12
13:27:58.697 INFO FilterMutectCalls - Start Date/Time: January 3, 2019 1:27:55 PM EST
13:27:58.697 INFO FilterMutectCalls - ------------------------------------------------------------
13:27:58.697 INFO FilterMutectCalls - ------------------------------------------------------------
13:27:58.698 INFO FilterMutectCalls - HTSJDK Version: 2.18.1
13:27:58.698 INFO FilterMutectCalls - Picard Version: 2.18.16
13:27:58.698 INFO FilterMutectCalls - HTSJDK Defaults.COMPRESSION_LEVEL : 2
13:27:58.698 INFO FilterMutectCalls - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
13:27:58.699 INFO FilterMutectCalls - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
13:27:58.699 INFO FilterMutectCalls - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
13:27:58.699 INFO FilterMutectCalls - Deflater: IntelDeflater
13:27:58.699 INFO FilterMutectCalls - Inflater: IntelInflater
13:27:58.699 INFO FilterMutectCalls - GCS max retries/reopens: 20
13:27:58.699 INFO FilterMutectCalls - Requester pays: disabled
13:27:58.699 INFO FilterMutectCalls - Initializing engine
13:28:00.140 INFO FeatureManager - Using codec VCFCodec to read file file:///ifs/scratch/c2b2/ac_lab/pw2470/PROJECTS/patient0/Flow/Mutect2_rsync_b37/CUAC1857/hand_combined_partial_result_with_contig_specific_results/manually_combined_mutect2_mutations.vcf
13:28:00.623 INFO FilterMutectCalls - Done initializing engine
13:28:00.979 INFO ProgressMeter - Starting traversal
13:28:00.979 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
13:28:00.980 INFO FilterMutectCalls - Starting first pass through the variants
13:28:03.482 INFO FilterMutectCalls - Shutting down engine
[January 3, 2019 1:28:03 PM EST] org.broadinstitute.hellbender.tools.walkers.mutect.FilterMutectCalls done. Elapsed time: 0.14 minutes.
Runtime.totalMemory()=1011351552
java.lang.IllegalArgumentException: errorRateLog10 must be good probability but got NaN
at org.broadinstitute.hellbender.utils.QualityUtils.phredScaleLog10ErrorRate(QualityUtils.java:321)
at org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2FilteringEngine.lambda$applyGermlineVariantFilter$10(Mutect2FilteringEngine.java:207)
at java.util.stream.DoublePipeline$3$1.accept(DoublePipeline.java:231)
at java.util.Spliterators$DoubleArraySpliterator.forEachRemaining(Spliterators.java:1198)
at java.util.Spliterator$OfDouble.forEachRemaining(Spliterator.java:822)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:545)
at java.util.stream.AbstractPipeline.evaluateToArrayNode(AbstractPipeline.java:260)
at java.util.stream.IntPipeline.toArray(IntPipeline.java:502)
at org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2FilteringEngine.applyGermlineVariantFilter(Mutect2FilteringEngine.java:207)
at org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2FilteringEngine.calculateFilters(Mutect2FilteringEngine.java:436)
at org.broadinstitute.hellbender.tools.walkers.mutect.FilterMutectCalls.firstPassApply(FilterMutectCalls.java:120)
at org.broadinstitute.hellbender.engine.TwoPassVariantWalker.lambda$traverseVariants$0(TwoPassVariantWalker.java:76)
at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
at java.util.Iterator.forEachRemaining(Iterator.java:116)
at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
at org.broadinstitute.hellbender.engine.TwoPassVariantWalker.traverseVariants(TwoPassVariantWalker.java:74)
at org.broadinstitute.hellbender.engine.TwoPassVariantWalker.traverse(TwoPassVariantWalker.java:27)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:966)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)