Quantcast
Channel: Recent Discussions — GATK-Forum
Viewing all 12345 articles
Browse latest View live

supporting dataset for CalculateGenotypePosteriors

$
0
0
Dear team,

I am relatively new to the GATK environment, so please forgive me if I missed something obvious. I realize that similar question came up before, but I did not find an answer that solved my problem.

I am trying to run CalculateGenotypePosteriors with a supporting dataset. In the tool documentation you use 1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38.vcf.gz, which I downloaded from the GATK bundle

console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/


When I run
gatk CalculateGenotypePosteriors -R Homo_sapiens_assembly38.fasta -V in.vcf.gz -O out.vcf.gz -supporting 1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38.vcf.gz

the result is a user error

A USER ERROR has occurred: Input files reference and features have incompatible contigs: Found contigs with the same name but different lengths:
contig reference = chr15 / 101991189
contig features = chr15 / 90338345.

All my alignment, variant calling and genotyping has been done with the same Homo_sapiens_assembly38.fasta file (obtained from the GATK bundle). I am using GATK 4.0.6.0.

Running
gatk ValidateVariants -R Homo_sapiens_assembly38.fasta -V in.vcf.gz --dbsnp GATK-bundle/dbsnp_138.hg38.vcf.gz

completed without an error. So my question is: Is there a problem with this supporting input file (1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38.vcf.gz)? Is there another file I could use? Are there other tests I could use to check for the integrity of my input vcf files?

Best wishes,

Georg

LiftoverVcf with large structural variants results in IndelStraddlesMultipleIntevals

$
0
0
I am trying to use the LiftoverVcf tool to lift over structural variants from hg38 to hg19. All of the variants are larger than 50bp. The tool is not able to lift any of the around 100k variants. A proportion of the variants fails due to a non existing target (Error: NoTarget) which is expected. However, the majority fails with the IndelStraddlesMultipleIntevals error. As far as I understand, this error occurs if the target size (i.e. interval between start and end position) differs from the original variant size.

Since some of the variants could be lifted over using a bed file with the same positions and the ucsc LiftOver tool, I would be interested in what aspects other than the position itself are considered during the liftover of structural variants or Indels.

This is one variant that could be lifted over using the bed format but failed as a VCF variant with IndelStraddlesMultipleIntevals error:

```
chr1 59599 chr1-59599-INS-308 A 5 IndelStraddlesMultipleIntevals BKPTID=NA19434_chr1-59599-INS-308;CONTIG=NA19434_chr1-20000-80000-ctg7180000000004;CONTIG_DEPTH=7;CONTIG_END=4125;CONTIG_START=3817;CONTIG_SUPPORT=3;END=59600;MERGE_AC=1;MERGE_AF=0.07;MERGE_SAMPLES=NA19434;MERGE_SOURCE=NA19434;MERGE_VARIANTS=NA19434_chr1-59599-INS-308;MERGE_VARIANTS_RO=1.00;PUBLISHED_ID=NA19434_chr1-59599-INS-308;REPEAT_TYPE=AluY_simple;SVLEN=308;SVTYPE=INS
```

I am using the GATK version 4.1.0.0 and the following command:

```
gatk LiftoverVcf --CHAIN hg38ToHg19.over.chain.gz -I input.vcf --OUTPUT output.hg19.vcf --REFERENCE_SEQUENCE hg19.fasta --REJECT output.liftOverfailed.vcf --LIFTOVER_MIN_MATCH 0.95
```

Combining separately joint called vcfs

$
0
0

Hello,

I have read through the guides and man pages I could find here, but am a bit confused. I have 2 joint called VCFs, produced with the same GATK3.7 pipeline, 3000 samples and 1000 samples. Am I able to combine those VCFs, or is it wiser to re-joint call the 4000 samples together.

https://gatkforums.broadinstitute.org/gatk/discussion/53/combining-variants-from-different-files-into-one

This page mentions (as an aside) joint calling in batches of 200 samples, and then combining the results. However it does not mention how that combining would occur - the three combining methods it talks about are for cases different to this one.

https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_variantutils_CombineVariants.php

It seems like this tool is technically capable of merging vcfs, as well as other non-gatk tools. However I believe that generally merging vcfs is hard, many edge cases and missing data and so on. That is after all the reason for the gvcf workflow. I think the output of that tool merging would be markedly different from a single joint called vcf.

https://gatkforums.broadinstitute.org/gatk/discussion/23201/merging-population-vcf-files-without-gvcf

In this question you recommended not to attempt to merge vcfs, but this seems to conflict with the first link above.

https://software.broadinstitute.org/gatk/documentation/article?id=11019

This page does not mention the batching at all. I think because genomicsDB and GATK4 is expected to scale better with more samples.

Hope you can clear up my confusion

Thanks!

Questions about calculating the genotype likelihoods

$
0
0

In this website, https://software.broadinstitute.org/gatk/documentation/article.php?id=4442, you showed the formula used to calculate PL.

I can understand most of the formulas used here. But I can't understand the change on the formula when you are trying to implement G=H1H2 to P(D|G). I tried a lot of times and I cannot finish the math inference on my own. I think the formula you used to calculate P(D|G) should also be available to be generated by pure math deduction.

Therefore, if convenient, would you please show me the process of the math deduction of the formula to prove that P(D|G)=P(D|H1)/2 + P(D|H2)/2 (given a single read sequence).

Thank you!

Zero length tag name found in tagged argument: O:

$
0
0
Hello! I have some problem with my samples. Could someone please provide me with a help to run CombineGVCFs. After making the vcf files in HaplotypeCaller , I am using the command CombineGVCFs as follows:

gatk --java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true'
CombineGVCFs
-R /media/sf_E_DRIVE/work/Glycine_max_v2.1_edited.fa
-V S1.raw.g.vcf S2.raw.g.vcf S3.raw.g.vcf
-O: 3_soybean.g.vcf
-D /media/sf_E_DRIVE/work/reference_genome_v2.1/ncbi_snp_soybean_3847_All.vcf.gz
and I have error:

A USER ERROR has occurred: Zero length tag name found in tagged argument: O:

How can I fix this error? (I tried to solve it using program RenameSampleInVcf, but it did not work for me)

is there a way to speed up applybqsr

$
0
0

for a 7G fastq, when the step comes to applybqsr, it need 6 hours, is there a way to speed up, I found no thread argument of this command, thanks a lot

How fic error "INVALID_MATE_REF_INDEX"

$
0
0
What is meaning of error "INVALID_MATE_REF_INDEX" in a SAM/BAM file and How fix it?

Recommendations for pixel distance when sequences are from Novaseq

$
0
0

Hi,

We recently resequenced samples using Novaseq technology of Illumina. we are aware of the increase in optical duplicates due to the use of patterned flowcells. In the MarkDuplicates documentation 2500 is suggested for the OPTICAL_DUPLICATE_PIXEL_DISTANCE for the patterned flowcell of the Hiseq, compared to default of 100 (Nextseq).

Do you currently have any best practices recommendations to handle this? Should the value be increased and if so by how much, because of the fact that the wells in Novaseq are even more closely packed?

Thanks so a lot for any thoughts on this!


FilterByOrientationBias FFPE artifacts

$
0
0
Hi GATK team,

We have analyzed multiple tumor samples with Mutect2 (v4.0.6.0), especially FFPE samples. So far, we did not filter out any FFPE artifacts with FilterByOrientationBias on these samples.
We recently analyzed a sample in which we have a lot of background noise in the mutational signature we get. We strongly suspect that this is caused by FFPE artifacts. Unfortunately, no FFPE artifacts have been filtered out in this sample. On the other hand, about 1000 OxoG artifacts have been filtered out.

The 2 artifact-modes arguments we give in the command for FilterByOrientationBias are:
--artifact-modes G/T --artifact-modes C/T

We have tried to explain why no FFPE artifacts are filtered out. This is where we suspect that things are going wrong:
FilterByOrientationBias uses the number of reads that support the reference and variant in F1R2 and F2R1. This goes well if you isolate DNA and place it on the sequencing machine.
What we do is isolate DNA and first amplify it, in order to have sufficient DNA. The DNA is then placed on the sequencing machine (Illumina). Because we have an amplification step before it enters the sequencer, we no longer have an orientation bias, but the variants (FFPE artifacts) are on both strands.
Does this explain why we do not filter out FFPE artifacts?

Another question:
We checked what happens when we do not filter on C/T artifacts (FFPE), but on G/A artifacts (the opposite). Now, 4000 variants are filtered out. The mutational signature we now get highly correlates with the signature we expect to get.
How is it possible that FilterByOrientationBias filters out G/A artifacts, which we suspect as FFPE artifacts?

Erik

MuTect2 beta --germline_resource for build h19,af-only-gnomad.vcf

$
0
0

Hi - I'm looking to run MuTect2 beta using the --germline_resource option. However,I cannot find af-only-gnomad.vcf for build h19.
How can I find this vcf for hg19?

GATK GermlineCNVcaller Procedure

$
0
0
Hello

I am trying to implement GATK 4 gCNVcaller on Exome sequencing data to call CNVs from read depth. I seem to be getting confused on how to actually use the tool. From my understanding I will need to run the worflow using training samples, in COHORT MODE. Then run the workflow again with test samples. I have completed running my training samples through DetermineGermlineContigPloidy and germlineCNVcaller, but do not know if I should run the test samples using RUN mode, or should i use PostProcessIntervals?

One other question, is where the call and model "shards" would be?

(How to) Consolidate GVCFs for joint calling with GenotypeGVCFs

$
0
0

In GATK4, the GenotypeGVCFs tool can only take a single input i.e., 1) a single single-sample GVCF 2) a single multi-sample GVCF created by CombineGVCFs or 3) a GenomicsDB workspace created by GenomicsDBImport. If you have GVCFs from multiple samples (which is usually the case) you will need to combine them before feeding them to GenotypeGVCFs. The input samples must possess genotype likelihoods containing the allele produced by HaplotypeCaller with -ERC GVCF or -ERC BP_RESOLUTION.

Although there are several tools in the GATK and Picard toolkits that provide some type of VCF merging functionality, for this use case ONLY two of them can do the GVCF consolidation step correctly: GenomicsDBImport and CombineGVCFs.

GenomicsDBImport is the preferred tool (see detailed instructions below); CombineGVCFs is provided only as a backup solution for people who cannot use GenomicsDBImport. We know CombineGVCFs is quite inefficient and typically requires a lot of memory, so we encourage you to try GenomicsDBImport first and only fall back on CombineGVCFs if you experience issues that we are unable to help you solve (ask us for help in the forum!).


UsingGenomicsDBImport in practice

The GenomicsDBImport tool takes in one or more single-sample GVCFs and imports data over at least one genomics interval (this feature is available in v4.0.6.0 and later and stable in v4.0.8.0 and later), and outputs a directory containing a GenomicsDB datastore with combined multi-sample data. GenotypeGVCFs can then read from the created GenomicsDB directly and output the final multi-sample VCF.

So if you have a trio of GVCFs your GenomicsDBImport command would look like this, assuming you're running per chromosome (here we're showing the tool running on chromosome 20 and chromosome 21):

gatk GenomicsDBImport \
    -V data/gvcfs/mother.g.vcf \
    -V data/gvcfs/father.g.vcf \
    -V data/gvcfs/son.g.vcf \
    --genomicsdb-workspace-path my_database \
    --intervals chr20,chr21

That generates a directory called my_database containing the combined GVCF data for chromosome 20 and 21. (The contents of the directory are not really human-readable; see “extracting GVCF data from a GenomicsDB” to evaluate the combined, pre-genotyped data. Also note that the log will contain a series of messages like Buffer resized from 178298bytes to 262033 -- this is expected.) For larger cohort sizes, we recommend specifying a batch size of 50 for improved memory usage. A sample map file can also be specified when enumerating the GVCFs individually as above becomes arduous.

Then you run joint genotyping; note the gendb:// prefix to the database input directory path. Note that this step requires a reference, even though the import can be run without one.

gatk GenotypeGVCFs \
    -R data/ref/ref.fasta \
    -V gendb://my_database \
    -newQual \
    -O test_output.vcf 

And that's all there is to it.


Important limitations and Common “Gotchas”:

  1. You can't add data to an existing database; you have to keep the original GVCFs around and reimport them all together when you get new samples. For very large numbers of samples, there are some batching options.

  2. At least one interval must be provided when using GenomicsDBImport.

  3. Input GVCFs cannot contain multiple entries for a single genomic position

  4. GenomicsDBImport cannot accept multiple GVCFs for the same sample, so if for example you generated separate GVCFs per chromosome for each sample, you'll need to either concatenate the chromosome GVCFs to produce a single GVCF per sample (using GatherVcfs) or scatter the following steps by chromosome as well.

  5. The annotation counts specified in the header MUST BE VALID! If not, you may see an error like A fatal error has been detected by the Java Runtime Environment [...] SIGSEGV with mention of a core dump (which may or may not be output depending on your system configuration.) You can check you annotation headers with vcf-validator from VCFtools [https://github.com/vcftools/vcftools]

  6. GenomicsDB will not overwrite an existing workspace. To rerun an import, you will have to manually delete the workspace before running the command again.

  7. If you’re working on a POSIX filesystem (e.g. Lustre, NFS, xfs, ext4 etc), you must set the environment variable TILEDB_DISABLE_FILE_LOCKING=1 before running any GenomicsDB tool. If you don’t, you will likely see an error like Could not open array genomicsdb_array at workspace:[...]

  8. HaplotypeCaller output containing MNPs cannot be merged with CombineGVCFs or GenotypeGVCFs. For phasing nearby variants in multi-sample callsets, MNPs can be inferred from the phase set (PS) tag in the FORMAT field.

  9. There are a few other, rare bugs we’re in the process of working out. If you run into problems, you can check the open github issues [https://github.com/broadinstitute/gatk/issues?utf8=✓&q=is:issue+is:open+genomicsdb] to see if a fix is in in progress.

If you can't use GenomicsDBImport for whatever reason, fall back to CombineGVCFs instead. It is slower but will allow you to combine GVCFs the old-fashioned way.


Addendum: extracting GVCF data from the GenomicsDB

If you want to generate a flat multisample GVCF file from the GenomicsDB you created, you can do so with SelectVariants as follows:

gatk SelectVariants \
    -R data/ref/ref.fasta \
    -V gendb://my_database \
    -O combined.g.vcf

Bells and Whistles

GenomicsDB now supports allele-specific annotations [ https://software.broadinstitute.org/gatk/documentation/article?id=9622 ], which have become standard in our Broad exome production pipeline.

GenomicsDB can now import directly from a Google cloud path (i.e. gs://) using NIO.

Question about best haplotype finder in haplotypecaller

$
0
0
Hello everyone,

I was looking at the recent commits and I realized that the functionality of the finding method for best haplotypecaller is changed. I was wondering if it is intentional?

if you look at
hellbender/tools/walkers/haplotypecaller/graphs/KBestHaplotypeFinder.java:89

there is a line that limits the number of haplotypes that ends in a specific vertex

if (vertexCounts.get(targetVertex).getAndIncrement() < maxNumberOfHaplotypes)

I believe this means we just check the first maxNumberOfHaplotypes haplotypes for this vertex, not necessarily maxNumberOfHaplotypes highest score haplotypes. If you remove this if statement, this algorithm can find better score haplotypes. It is certainly faster but my question is if this is the functionality that you are expecting. I have sample BAMs that I can share.

Thank you for reading this.
Mehrzad

Start to use GISTIC2

$
0
0
Hi,

I have called somatic copy number in tumor-matched normal by scatngs

I have 3 output files, this is one output of scatngs

Chromosome Position Log R segmented LogR BAF segmented BAF Copy number Minor allele Raw copy number
rs62635286 1 13116 -1.23040761342626 0.00160351549950095 0 0.317683333333333 2 1 0.41361756890851
rs75454623 1 14930 -0.390888460751959 0.00160351549950095 0.3548 0.317683333333333 2 1 1.19149081633643

Could you please tell me which columns corresponds to

Num markers (number of markers in segment)

Seg.CN (log2() -1 of copy number)

Required as GISTIC2 input?


Also could you please give me an idea how I could install GISTIC2?

Thanks a lot in advance

Why does Mutect2 return SNPs with non-zero alt-allele freq's in the normal sample, but 0 alt reads?

$
0
0

I'm finding that most of the somatic SNPs that Mutect2 is inferring, have non-zero alt-alleles in the normal sample, but when I look at the AD field, they almost always have 0 reads with the alt-allele (see example below).

Is this because of the PON that I used (the one provided in the Mutect2 resource kit (my data set is too small (n=10) to generate my own))?

example below (copied from R, which was used to parse the vcf):

observedInType2AF$vcf[1:3, c( 1:7, 9:10, 42:44 ) ]

chrom pos id ref alt qual filter
794 1 9795131 . T C . PASS
12960 1 148004625 . C A . PASS
format
794 GT:AD:AF:DP:F1R2:F2R1:OBAM:OBAMRC:OBQ:OBQRC:SAAF:SAPP
12960 GT:AD:AF:DP:F1R2:F2R1:OBAM:OBAMRC:OBF:OBP:OBQ:OBQRC:SAAF:SAPP
ac1
794 0/0:38,0:0.098:38:19,0:19,0:false:false
12960 0/0:40,0:0.053:40:24,0:16,0:false:false
cuac1856
794 0/1:58,52:0.473:110:20,25:38,27:false:false:52.65:100.00:0.444,0.455,0.473:0.024,0.014,0.962
12960 ./.
cuac1857
794 0/1:15,7:0.327:22:6,2:9,5:false:false:53.96:100.00:0.273,0.303,0.318:0.021,0.023,0.956
12960 ./.


Choice of known_indels.vcf on google cloud bucket

$
0
0
Dear all:

After remapping whole genome sequencing data to GRCh38 reference assembly, I would like to do the local realignment around indels. I am wondering which known indels file to use. I saw this one on the google cloud bucket: Homo_sapiens_assembly38.known_indels.vcf.gz. Is this the 1000 genome phase 3 indels? Can I use it for local realignment?

I noticed that the best practice pipeline doesn't have a realignment step. I am using gatk version 3.3, so I would like to do local realignment.

many thanks
Yidong

which sort mode to apply before MarkDuplicates?

$
0
0

What is the difference between quername and coordination?
If the SortSamSpark step is removed before MarkDuplicatesSpark is executed, MarkDuplicatesSpark will sort by queryname by default. Does this affect the accuracy?

Picard RevertSam java.nio.file.NoSuchFileException

$
0
0

Hi,

I'm starting to process a set of bams following the best practices and beginning from bams that were processed by someone else. Thus, I'm attempting to generate unmapped BAMs following this post, and using the latest version of Picard (2.15.0). Unfortunately, Picard gives an exception that shows it is unable to find temporary files it is writing. I know there's space for these files and in fact, I now have version 1.141 of Picard running without issue. The output from version 2.15.0 is below.

15:34:14.012 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/REDACTED/bin/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Sat Nov 18 15:34:14 CST 2017] RevertSam INPUT=/REDACTED.bam OUTPUT=/REDACTED/808302_LP6008048-DNA_B02.bam SORT_ORDER=queryname RESTORE_ORIGINAL_QUALITIES=true REMOVE_DUPLICATE_INFORMATION=true REMOVE_ALIGNMENT_INFORMATION=true ATTRIBUTE_TO_CLEAR=[NM, UQ, PG, MD, MQ, SA, MC, AS, XT, XN, AS, OC, OP] SANITIZE=true MAX_DISCARD_FRACTION=0.005 TMP_DIR=[/REDACTED/tmp] VALIDATION_STRINGENCY=LENIENT OUTPUT_BY_READGROUP=false OUTPUT_BY_READGROUP_FILE_FORMAT=dynamic VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Sat Nov 18 15:34:14 CST 2017] Executing as awilliams@REDACTED on Linux 3.10.0-229.7.2.el7.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12; Deflater: Intel; Inflater: Intel; Picard version: 2.15.0-SNAPSHOT
[Sat Nov 18 15:34:30 CST 2017] picard.sam.RevertSam done. Elapsed time: 0.27 minutes.
Runtime.totalMemory()=1272971264
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.samtools.util.RuntimeIOException: java.nio.file.NoSuchFileException: /REDACTED/tmp/awilliams/sortingcollection.728972638772980431.tmp
at htsjdk.samtools.util.SortingCollection.spillToDisk(SortingCollection.java:246)
at htsjdk.samtools.util.SortingCollection.add(SortingCollection.java:166)
at picard.sam.RevertSam$RevertSamSorter.add(RevertSam.java:637)
at picard.sam.RevertSam.doWork(RevertSam.java:260)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:268)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:98)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:108)
Caused by: java.nio.file.NoSuchFileException: /REDACTED/tmp/awilliams/sortingcollection.728972638772980431.tmp
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
at java.nio.file.Files.newByteChannel(Files.java:361)
at java.nio.file.Files.createFile(Files.java:632)
at java.nio.file.TempFileHelper.create(TempFileHelper.java:138)
at java.nio.file.TempFileHelper.createTempFile(TempFileHelper.java:161)
at java.nio.file.Files.createTempFile(Files.java:852)
at htsjdk.samtools.util.IOUtil.newTempPath(IOUtil.java:316)
at htsjdk.samtools.util.SortingCollection.newTempFile(SortingCollection.java:255)
at htsjdk.samtools.util.SortingCollection.spillToDisk(SortingCollection.java:220)
... 6 more

GATK HaplotypeCaller MNP output problem.

$
0
0

Hi
I have a situation here where a clear MNP is not literally emitted by the HaplotypeCaller engine.

The top VCF is where MNP distance is 0 and the bottom is 5. Looks like GATK HC engine is too INDEL centric for this kind of situations where it prefers emitting a GC deletion and an AA insertion rather than TGC - AAT substitution. I am wondering if this could be related to the pairHMM and its limitations in certain transitions (Not allowing 1 transversion and 2 transitions at the same time at the expense of introducing gaps in both sequences) types. And if that's the case I am also wondering if an alternative engine may also be implemented in GATK that does not rely on pairHMM only to detect sequence differences.

I can submit snippets of this if you would like to check.

Thanks.

Known Issues with VariantRecalibrator

$
0
0

The syntax for specifying argument tags has changed (and the documentation was out of sync for a while, though it is now fixed). The tags must now be specified with the argument name, not with the argument value, like this:

--resource:hapmap,known=false,training=true,truth=true,prior=15.0 /trainee/ref/hapmap_3.3.hg38.vcf

Note that the ":" and tags are listed with the argument name ("-resource"), not with the file name.

Viewing all 12345 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>