Errors about input files having missing or incompatible contigs

December 23, 2017, 9:15 pm

≫ Next: Reference and Known input files in GATK hg38

≪ Previous: Why is there difference of variants between after-BQSR bam and after-HaplotypeCaller bam?

These errors occur when the names or sizes of contigs don't match between input files. This is a classic problem that typically happens when you get some files from collaborators, you try to use them with your own data, and GATK fails with a big fat error saying that the contigs don't match.

The first thing you need to do is find out which files are mismatched, because that will affect how you can fix the problem. This information is included in the error message, as shown in the examples below. You'll notice that GATK always evaluates everything relative to the reference. For more information about that see theDictionary entry on reference genomes.

BAM file contigs not matching the reference
VCF file contigs not matching the reference

BAM file contigs not matching the reference

A very common case we see looks like this:

##### ERROR MESSAGE: Input files reads and reference have incompatible contigs: Found contigs with the same name but different lengths:
##### ERROR   contig reads = chrM / 16569
##### ERROR   contig reference = chrM / 16571.
##### ERROR   reads contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM]
##### ERROR   reference contigs = [chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chr1_gl000191_random, chr1_gl000192_random, chr4_ctg9_hap1, chr4_gl000193_random, chr4_gl000194_random, chr6_apd_hap1, chr6_cox_hap2, chr6_dbb_hap3, chr6_mann_hap4, chr6_mcf_hap5, chr6_qbl_hap6, chr6_ssto_hap7, chr7_gl000195_random, chr8_gl000196_random, chr8_gl000197_random, chr9_gl000198_random, chr9_gl000199_random, chr9_gl000200_random, chr9_gl000201_random, chr11_gl000202_random, chr17_ctg5_hap1, chr17_gl000203_random, chr17_gl000204_random, chr17_gl000205_random, chr17_gl000206_random, chr18_gl000207_random, chr19_gl000208_random, chr19_gl000209_random, chr21_gl000210_random, chrUn_gl000211, chrUn_gl000212, chrUn_gl000213, chrUn_gl000214, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chrUn_gl000218, chrUn_gl000219, chrUn_gl000220, chrUn_gl000221, chrUn_gl000222, chrUn_gl000223, chrUn_gl000224, chrUn_gl000225, chrUn_gl000226, chrUn_gl000227, chrUn_gl000228, chrUn_gl000229, chrUn_gl000230, chrUn_gl000231, chrUn_gl000232, chrUn_gl000233, chrUn_gl000234, chrUn_gl000235, chrUn_gl000236, chrUn_gl000237, chrUn_gl000238, chrUn_gl000239, chrUn_gl000240, chrUn_gl000241, chrUn_gl000242, chrUn_gl000243, chrUn_gl000244, chrUn_gl000245, chrUn_gl000246, chrUn_gl000247, chrUn_gl000248, chrUn_gl000249]

First, the error tells us that the mismatch is between the file containing reads, i.e. our BAM file, and the reference:

Input files reads and reference have incompatible contigs

It further tells us that the contig length doesn't match for the chrM contig:

Found contigs with the same name but different lengths:
##### ERROR   contig reads = chrM / 16569
##### ERROR   contig reference = chrM / 16571.

This can be caused either by using the wrong genome build version entirely, or using a reference that was hacked from a build that's very close but not identical, like b37 vs hg19, as detailed a bit more below.

We sometimes also see cases where people are using a very different reference; this is especially the case for non-model organisms where there is not yet a widely-accepted standard genome reference build.

Note that the error message also lists the content of the sequence dictionaries that it found for each file, and we see that some contigs in our reference dictionary are not listed in the BAM dictionary, but that's not a problem. If it was the opposite, with extra contigs in the BAM (or VCF), then GATK wouldn't know what to do with the reads from these extra contigs and would error out (even if we try restricting analysis using -L) with something like this:

#### ERROR MESSAGE: BAM file(s) do not have the contig: chrM. You are probably using a different reference than the one this file was aligned with.

Solution

If you can, simply switch to the correct reference. Note that file names may be misleading, as people will sometimes rename files willy-nilly. Sometimes you'll need to do some detective work to identify the correct reference if you inherited someone else's sequence data.

If that's not an option because you either can't find the correct reference or you absolutely MUST use a particular reference build, then you will need to redo the alignment altogether. Sadly there is no liftover procedure for reads. If you don't have access to the original unaligned sequence files, you can use Picard tools to revert your BAM file back to an unaligned state (either unaligned BAM or FASTQ depending on the workflow you wish to follow).

Special case of b37 vs. hg19

The b37 and hg19 human genome builds are very similar, and the canonical chromosomes (1 through 22, X and Y) only differ by their names (no prefix vs. chr prefix, respectively). If you only care about those, and don't give a flying fig about the decoys or the mitochondrial genome, you could just rename the contigs throughout your mismatching file and call it done, right?

Well... This can work if you do it carefully and cleanly -- but many things can go wrong during the editing process that can screw up your files even more, and it only applies to the canonical chromosomes. The mitochondrial contig is a slightly different length (see error above) in addition to having a different naming convention, and all the other contigs (decoys, herpes virus etc) don't have direct equivalents.

So only try that if you know what you're doing. YMMV.

VCF file contigs not matching the reference

ERROR MESSAGE: Input files known and reference have incompatible contigs: Found contigs with the same name but different lengths:
ERROR contig known = chrM / 16569
ERROR contig reference = chrM / 16571.

Yep, it's just like the error we had with the BAM file above. Looks like we're using the wrong genome build again and a contig length doesn't match. But this time the error tells us that the mismatch is between the file identified as known and the reference:

Input files known and reference have incompatible contigs

In this case the error was output by a tool that takes a VCF file of known variants provided through the known argument, so this makes sense and tells us which file is at fault. Depending on the tool, the way the file is identified may vary, but the logic should be fairly obvious.

Solution

If you can, you find a version of the VCF file that is derived from the right reference. If you're working with human data and the VCF in question is just a common resource like dbsnp, you're in luck -- we make sets of suitable resources available for the supported reference builds. If you're working on your own installation of GATK, you can get these from the Resource Bundle. If you're using GATK on FireCloud, our cloud-based analysis platform, the featured GATK workspaces are preloaded with the appropriate resources.

If that's not an option, then you'll have to "liftover" -- specifically, liftover the mismatching VCF to the reference you need to work with. The best tool for liftover is Picard's LiftoverVCF. We provide several chain files to liftover between the major human reference builds, also in our resource bundle in the Liftover_Chain_Files directory. If you are working with non-human organisms, we can't help you -- but others may have chain files, so ask around in your field.

GATK used to include some liftover utilities but we no longer support them.

↧

Reference and Known input files in GATK hg38

October 2, 2018, 4:39 am

≫ Next: Variant calling: Bam file is indexed but GATK thinks it is not

≪ Previous: Errors about input files having missing or incompatible contigs

Hi,

1) dbSNP151 vcf file states that it uses as reference the GRCh38.p7. When I use dbSNP151 in GATK4 should I use this specific reference build or I can use whatever build I want, etc GRCh38.p12 (latest)?

2) Can I use whatever build of GRCh38.p* I want in VariantRecalibrator and use the same files used in this step from the bundle (1000G_phase1.snps.high_confidence.hg38.vcf.gz, 1000G_omni2.5.hg38.vcf.gz, hapmap_3.3.hg38.vcf.gz, etc). Or should I only use them with the specific Reference hg38 file from the bundle ?

3) Can I use 1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38.vcf in VariantRecalibrator instead of 1000G_phase1.snps.high_confidence.hg38.vcf.gz? What is exactly the first one? It is in the the cloud bundle but not in the ftp bundle(?!)

4) If I want to use the latest and best release from all of the files, which files should I use in every step?

↧

Variant calling: Bam file is indexed but GATK thinks it is not

October 2, 2018, 7:03 am

≫ Next: Question about Base calling for pacbio data

≪ Previous: Reference and Known input files in GATK hg38

Trying to call variants from a bam file with the command below:

java -jar /usr/local/softw/GenomeAnalysisTK.jar -T UnifiedGenotyper -I LBK_Lipatov2014_DS2X.bam -R /mnt/NAS/share/Ref/human/hs37d5.fa -L 1000G.bed --output_mode EMIT_ALL_SITES --genotyping_mode GENOTYPE_GIVEN_ALLELES -o LBK_Lipatov2014_DS2X_GATK.raw.vcf

I get the error below that says the bam is not indexed. But I did index the file and it is present in the same folder as the bam. (I did the sorting and indexing with samtools if that makes a difference.)

ERROR ------------------------------------------------------------------------------------------

ERROR A USER ERROR has occurred (version 3.7-0-gcfedb67):

ERROR

ERROR This means that one or more arguments or inputs in your command are incorrect.

ERROR The error message below tells you what is the problem.

ERROR

ERROR If the problem is an invalid argument, please check the online documentation guide

ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.

ERROR

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions https://software.broadinstitute.org/gatk

ERROR

ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.

ERROR

ERROR MESSAGE: Invalid command line: Cannot process the provided BAM/CRAM file(s) because they were not indexed. The GATK does offer limited processing of unindexed BAM/CRAMs in --unsafe mode, but this feature is unsupported -- use it at your own risk!

What can I do?

↧

Question about Base calling for pacbio data

October 2, 2018, 8:02 am

≫ Next: Inference of genotype likelihoods for lower ploidy based on genotyping at higher ploidy using GATK

≪ Previous: Variant calling: Bam file is indexed but GATK thinks it is not

I looked at some links of GATK on pacbio data but I can not understand that they are the parameters that must be used on
-variant_index_parameter , -stand_emit_conf , -stand_call_conf during the base calling by haplotypecaller.

Also for VariantFiltration of SNP and Indel
-filterExpression 'QD || FS || MQ || HaplotypeScore || MappingQualityRankSum || ReadPosRankSum

Could you give me an example of the basic call on pacbio data with the values of these parameters

↧

Inference of genotype likelihoods for lower ploidy based on genotyping at higher ploidy using GATK

October 2, 2018, 9:15 am

≫ Next: Goodbye GATK Forum

≪ Previous: Question about Base calling for pacbio data

Let's say I have a bunch of mixed ploidy individuals (with biallelic markers) in my data. Some are tetraploid and some are diploid. But I choose to run GATK HaplotypeCaller (to get genotype likelihoods) with -ploidy set to 4 for all organisms since I know the highest ploidy level in the data to be 4.

My idea is to run the data and obtain genotype likelihoods with the highest resolution and then downscale those values obtained to a lower ploidy level post-hoc.

For instance, given that there are 5 genotype classes/dosage levels for tetraploid organisms (0 of the reference allele, 1 of the reference, 2 of the reference, 3 of the reference and 4 of the reference), I will get 5 phred-scaled scores for each locus in each individual. Each score represents the probability of having a certain count for the reference allele (0 through 4).
Now if I deduce that one of these individuals is a diploid but I've already run the analyses:

Can I just combine the genotype likelihoods of the 3 heterozygote classes in the tetraploid call (1/3, 2/2, 3/1) to get the genotype likelihood of the one heterozygote class (1/1) in a diploid individual?
If so, how do I do this quantitatively?

For example, at a locus in an individual that I assumed to be tetraploid during the GATK run, I get these phred-scaled genotype likelihoods:
0/4 1/3 2/2 3/1 4/0
6 67 0 4 60

But I now know that this individual is diploid, so I am now looking for just 3 phred-scaled genotype likelihoods instead of 5:
0/2 1/1 2/0
? ? ?

Would I keep the homozygote classes the same i.e. 6 and 60 and then just average the 3 dosage classes for the heterozygote of the diploid? Or would I perform another similar mathematical operation?

Thanks,
Vivaswat

↧

Goodbye GATK Forum

September 11, 2018, 12:22 pm

≫ Next: CombineGVCFs java.lang.IllegalArgumentException: Unexpected base in allele bases '*AACC'

≪ Previous: Inference of genotype likelihoods for lower ploidy based on genotyping at higher ploidy using GATK

Dear GATK users,

I am writing this blog post to let you know it has been a wonderful 4+ years working on the forum and answering your questions. Thank you all for giving me a job for all this time and keeping me entertained I enjoyed learning all about the GATK while answering your questions. But the time has come for me to move on and learn some new skills.

I am not leaving the forum unattended, however. We have a new team member joining, along with a few other people who will be able to help you all out. There will be another blog post soon to introduce the new team members.

I trust these new team members will do a great job, and I wish you all the best in your journeys.

Sincerely,
Sheila

↧

CombineGVCFs java.lang.IllegalArgumentException: Unexpected base in allele bases '*AACC'

February 23, 2018, 4:22 am

≫ Next: Coverage after Base Quality Recalibration: enough for variant calling?

≪ Previous: Goodbye GATK Forum

Hi,

Hoping to work around the limitations of GenomicsDBImport I've used CombineGVCFs to combine my data into batches of 200 and then combined them again into a master GVCFs for genotyping. Unfortunately I seem to have run into an exception when attempting to combine my 200 sample batch GVCFs prior to genotyping.

Using GATK jar /lustre/scratch115/realdata/mdt2/projects/gdap-wgs/gvcf-4.0/scripts/gatk-4.0.1.2/gatk-package-4.0.1.2-local.jar
Running:
    /software/jre1.8.0_74/bin/java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Djava.io.tmpdir=/lustre/scratch115/projects/gdap-wgs/gvcf-4.0/tmp -XX:-UsePerfData -Xrs -Xmx3200m -jar /lustre/scratch115/realdata/mdt2/projects/gdap-wgs/gvcf-4.0/scripts/gatk-4.0.1.2/gatk-package-4.0.1.2-local.jar CombineGVCFs -R /lustre/scratch115/resources/ref/Homo_sapiens/HS38DH/hs38DH.fa -V /tmp/tmp.GXoF3ghQt3.list -O output/1.g.vcf.gz -L /lustre/scratch115/resources/ref/Homo_sapiens/HS38DH/intervals/arvados/wgs_calling_regions.hg38.interval_list.1_of_200.interval_list
12:08:21.902 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/lustre/scratch115/realdata/mdt2/projects/gdap-wgs/gvcf-4.0/scripts/gatk-4.0.1.2/gatk-package-4.0.1.2-local.jar!/com/intel/gkl/native/libgkl_compression.so
12:08:22.587 INFO  CombineGVCFs - ------------------------------------------------------------
12:08:22.587 INFO  CombineGVCFs - The Genome Analysis Toolkit (GATK) v4.0.1.2
12:08:22.587 INFO  CombineGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
12:08:22.588 INFO  CombineGVCFs - Executing as mp15@bc-31-1-09 on Linux v3.2.0-105-generic amd64
12:08:22.588 INFO  CombineGVCFs - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_74-b02
12:08:22.588 INFO  CombineGVCFs - Start Date/Time: 23 February 2018 12:08:21 GMT
12:08:22.589 INFO  CombineGVCFs - ------------------------------------------------------------
12:08:22.589 INFO  CombineGVCFs - ------------------------------------------------------------
12:08:22.589 INFO  CombineGVCFs - HTSJDK Version: 2.14.1
12:08:22.590 INFO  CombineGVCFs - Picard Version: 2.17.2
12:08:22.590 INFO  CombineGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 1
12:08:22.590 INFO  CombineGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
12:08:22.590 INFO  CombineGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
12:08:22.590 INFO  CombineGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
12:08:22.590 INFO  CombineGVCFs - Deflater: IntelDeflater
12:08:22.591 INFO  CombineGVCFs - Inflater: IntelInflater
12:08:22.597 INFO  CombineGVCFs - GCS max retries/reopens: 20
12:08:22.597 INFO  CombineGVCFs - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
12:08:22.597 INFO  CombineGVCFs - Initializing engine
12:08:25.101 INFO  FeatureManager - Using codec VCFCodec to read file file:///lustre/scratch115/projects/gdap-wgs/gvcf-4.0/gvcf-pcr_combined/1_1.g.vcf.gz
12:08:25.520 INFO  FeatureManager - Using codec VCFCodec to read file file:///lustre/scratch115/projects/gdap-wgs/gvcf-4.0/gvcf-pcrfree_combined/1_1.g.vcf.gz
12:08:26.136 INFO  FeatureManager - Using codec VCFCodec to read file file:///lustre/scratch115/projects/gdap-wgs/gvcf-4.0/gvcf-pcrfree_combined/1_2.g.vcf.gz
12:08:26.463 INFO  FeatureManager - Using codec VCFCodec to read file file:///lustre/scratch115/projects/gdap-wgs/gvcf-4.0/gvcf-pcrfree_combined/1_3.g.vcf.gz
12:09:15.365 INFO  IntervalArgumentCollection - Processing 14112327 bp from intervals
12:09:15.534 INFO  CombineGVCFs - Done initializing engine
12:09:17.050 INFO  ProgressMeter - Starting traversal
12:09:17.051 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute
12:09:22.280 INFO  CombineGVCFs - Shutting down engine
[23 February 2018 12:09:22 GMT] org.broadinstitute.hellbender.tools.walkers.CombineGVCFs done. Elapsed time: 1.01 minutes.
Runtime.totalMemory()=2233991168
java.lang.IllegalArgumentException: Unexpected base in allele bases '*AACC'
    at htsjdk.variant.variantcontext.Allele.<init>(Allele.java:165)
    at htsjdk.variant.variantcontext.Allele.create(Allele.java:239)
    at org.broadinstitute.hellbender.tools.walkers.ReferenceConfidenceVariantContextMerger.extendAllele(ReferenceConfidenceVariantContextMerger.java:406)
    at org.broadinstitute.hellbender.tools.walkers.ReferenceConfidenceVariantContextMerger.remapAlleles(ReferenceConfidenceVariantContextMerger.java:178)
    at org.broadinstitute.hellbender.tools.walkers.ReferenceConfidenceVariantContextMerger.merge(ReferenceConfidenceVariantContextMerger.java:70)
    at org.broadinstitute.hellbender.tools.walkers.CombineGVCFs.endPreviousStates(CombineGVCFs.java:340)
    at org.broadinstitute.hellbender.tools.walkers.CombineGVCFs.createIntermediateVariants(CombineGVCFs.java:189)
    at org.broadinstitute.hellbender.tools.walkers.CombineGVCFs.apply(CombineGVCFs.java:134)
    at org.broadinstitute.hellbender.engine.MultiVariantWalkerGroupedOnStart.apply(MultiVariantWalkerGroupedOnStart.java:73)
    at org.broadinstitute.hellbender.engine.VariantWalkerBase.lambda$traverse$0(VariantWalkerBase.java:110)
    at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
    at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
    at java.util.Iterator.forEachRemaining(Iterator.java:116)
    at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
    at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
    at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
    at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
    at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
    at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
    at org.broadinstitute.hellbender.engine.VariantWalkerBase.traverse(VariantWalkerBase.java:108)
    at org.broadinstitute.hellbender.engine.MultiVariantWalkerGroupedOnStart.traverse(MultiVariantWalkerGroupedOnStart.java:118)
    at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:893)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:136)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:179)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:198)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:153)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:195)
    at org.broadinstitute.hellbender.Main.main(Main.java:277)

↧

Coverage after Base Quality Recalibration: enough for variant calling?

September 3, 2018, 3:12 am

≫ Next: Could I use variantEval to evaluate somatic mutation calling by mutect2?

≪ Previous: CombineGVCFs java.lang.IllegalArgumentException: Unexpected base in allele bases '*AACC'

Dear GATK community
I used Picardtools' CollectWgsMetrics to check the mean, median and sd of coverage for each of my samples (see attached figure).

For that I used base-quality-recalibrated BAMs and let them run through CollectWgsMetrics with default settings (incl. min mapping quality 20 and min base quality 20).

I was expecting to get coverage close to 30, as this is what we requested, however for many samples the coverage is much lower than that.

My general question is the same as in the title, but my specific questions are:

1/ am I using the right tool to check the coverage?
2/ did I run CollectWgsMetrics too stringently?
3/ should I evaluate coverage before base quality recalibration, and consider that as confirmation that the requested coverage was met?
4/ shoud we increase coverage for some of the samples by sequencing them again?

Any help and suggestions on how to go about this are greatly welcome.

Thanks in advace!

↧

Could I use variantEval to evaluate somatic mutation calling by mutect2?

September 5, 2018, 8:45 pm

≪ Previous: Coverage after Base Quality Recalibration: enough for variant calling?

Hello,
I have a basic question about VariantEval? could I use this methods to evaluate somatic mutation from tumor? I tried with default parameter: even there are more than 40K lines mutations passed filter, VariantEvalonly report about 1K variants in the callset. I am trying to understand why VariantEval get so few, what is the filter VariantEval use? How do I tell how these mutations are filtered out by VariantEval?
Thanks a lot for help,
WIlley

↧

Recommended protocol for bootstrapping HaplotypeCaller and BaseRecalibrator outputs?

December 4, 2014, 11:17 am

≫ Next: How to get best performance in Merging 6K samples using CombineGVCFs and GenotypeGVCFs (GATK 4.0.9)

≪ Previous: Could I use variantEval to evaluate somatic mutation calling by mutect2?

I am identifying new sequence variants/genotypes from RNA-Seq data. The species I am working with is not well studied, and there are no available datasets of reliable SNP and INDEL variants.

For BaseRecallibrator, it is recommended that when lacking a reliable set of sequence variants:
"You can bootstrap a database of known SNPs. Here's how it works: First do an initial round of SNP calling on your original, unrecalibrated data. Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator. Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence."

Setting up a script to run HaplotypeCaller and BaseRecallibrator in a loop should be fairly strait forward. What is a good strategy for comparing VCF files and assessing convergence?

↧

How to get best performance in Merging 6K samples using CombineGVCFs and GenotypeGVCFs (GATK 4.0.9)

October 2, 2018, 1:44 pm

≫ Next: Difference between Genotype GVCFs and CombineGVCFs

≪ Previous: Recommended protocol for bootstrapping HaplotypeCaller and BaseRecalibrator outputs?

I am looking for the right practices in merging 6K samples using CombineGVCFs and GenotypeGVCFs to get the best performance. I don't see Spark appended Suffix for these, does that mean these methods can't make use of spark?

I also came across some old threads on plans in bringing tileDB in 4.0 version, which could improve the performance. How can I make use of tileDB?

PS: I generated the gvcfs using HaplotypeCallerSpark in gatk4

↧

Difference between Genotype GVCFs and CombineGVCFs

June 3, 2015, 3:22 am

≫ Next: HaplotypeCaller --dbsnp

≪ Previous: How to get best performance in Merging 6K samples using CombineGVCFs and GenotypeGVCFs (GATK 4.0.9)

Hi,

I would like to be sure of the difference between those 2 tools.
From what I understand, GenotypeGVCFs somehow re-calculate likelihood and parameters (QUAL, DP, MQ...) for each variant positions present in at least 1 input sample. Right ? And is it not the case for CombineGVCFs?

Thank you for your help,
Fabrice

↧

HaplotypeCaller --dbsnp

June 14, 2013, 11:27 am

≫ Next: Are there any plans to add multi-interval support to GenomicsDBImport?

≪ Previous: Difference between Genotype GVCFs and CombineGVCFs

The doc says "dbSNP is not used in any way for the calculations themselves. --dbsnp binds reference ordered data". Does it mean that the determination of whether a locus is a variant is not influenced by whether that variant is present at dbSNP? what does "--dbsnp binds reference ordered data" mean?

Also why isn't there a --indel option?

↧

Are there any plans to add multi-interval support to GenomicsDBImport?

January 17, 2018, 9:18 am

≫ Next: Germline short variant discovery (SNPs + Indels)

≪ Previous: HaplotypeCaller --dbsnp

The reason I ask is that it's rather annoying when you've chunking your input data and one of your chunks crosses a chromosome boundary. it seems like according to the Github docs thqt GenomicsDB supports this with vcf2tiledb, but I'm not sure whether it will then work with GenotypeGVCFs?

↧

Germline short variant discovery (SNPs + Indels)

January 7, 2018, 1:03 am

≫ Next: How to get exact allele frequency using HaplotypeCaller-GATK4 and speed up the running?

≪ Previous: Are there any plans to add multi-interval support to GenomicsDBImport?

Purpose

Identify germline short variants (SNPs and Indels) in one or more individuals to produce a joint callset in VCF format.

Reference Implementations

Pipeline	Summary	Notes	Github	FireCloud
Prod* germline short variant per-sample calling	uBAM to GVCF	optimized for GCP	yes	pending
Prod* germline short variant joint genotyping	GVCFs to cohort VCF	optimized for GCP	yes	pending
$5 Genome Analysis Pipeline	uBAM to GVCF or cohort VCF	optimized for GCP (see blog)	yes	hg38
Generic germline short variant per-sample calling	analysis-ready BAM to GVCF	universal	yes	hg38
Generic germline short variant joint genotyping	GVCFs to cohort VCF	universal	yes	hg38 & b37
Intel germline short variant per-sample calling	uBAM to GVCF	Intel optimized for local architectures	yes	NA

* Prod refers to the Broad Institute's Data Sciences Platform production pipelines, which are used to process sequence data produced by the Broad's Genomic Sequencing Platform facility.

Expected input

This workflow is designed to operate on a set of samples constituting a study cohort. Specifically, a set of per-sample BAM files that have been pre-processed as described in the GATK Best Practices for data pre-processing.

Main steps

We begin by calling variants per sample in order to produce a file in GVCF format. Next, we consolidate GVCFs from multiple samples into a GenomicsDB datastore. We then perform joint genotyping, and finally, apply VQSR filtering to produce the final multisample callset with the desired balance of precision and sensitivity.

Additional steps such as Genotype Refinement and Variant Annotation may be included depending on experimental design; those are not documented here.

Call variants per-sample

Tools involved: HaplotypeCaller (in GVCF mode)

In the past, variant callers specialized in either SNPs or Indels, or (like the GATK's own UnifiedGenotyper) could call both but had to do so them using separate models of variation. The HaplotypeCaller is capable of calling SNPs and indels simultaneously via local de-novo assembly of haplotypes in an active region. In other words, whenever the program encounters a region showing signs of variation, it discards the existing mapping information and completely reassembles the reads in that region. This allows the HaplotypeCaller to be more accurate when calling regions that are traditionally difficult to call, for example when they contain different types of variants close to each other. It also makes the HaplotypeCaller much better at calling indels than position-based callers like UnifiedGenotyper.

In the GVCF mode used for scalable variant calling in DNA sequence data, HaplotypeCaller runs per-sample to generate an intermediate file called a GVCF , which can then be used for joint genotyping of multiple samples in a very efficient way. This enables rapid incremental processing of samples as they roll off the sequencer, as well as scaling to very large cohort sizes

In practice, this step can be appended to the pre-processing section to form a single pipeline applied per-sample, going from the original unmapped BAM containing raw sequence all the way to the GVCF for each sample. This is the implementation used in production at the Broad Institute.

Consolidate GVCFs

Tools involved: GenomicsDBImport

This step consists of consolidating the contents of GVCF files across multiple samples in order to improve scalability and speed the next step, joint genotyping. Note that this is NOT equivalent to the joint genotyping step; variants in the resulting merged GVCF cannot be considered to have been called jointly.

Prior to GATK4 this was done through hierarchical merges with a tool called CombineGVCFs. This tool is included in GATK4 for legacy purposes, but performance is far superior when using GenomicsDBImport, which produces a datastore instead of a GVCF file.

Joint-Call Cohort

Tools involved: GenotypeGVCFs

At this step, we gather all the per-sample GVCFs (or combined GVCFs if we are working with large numbers of samples) and pass them all together to the joint genotyping tool, GenotypeGVCFs. This produces a set of joint-called SNP and indel calls ready for filtering. This cohort-wide analysis empowers sensitive detection of variants even at difficult sites, and produces a squared-off matrix of genotypes that provides information about all sites of interest in all samples considered, which is important for many downstream analyses.

This step runs quite quickly and can be rerun at any point when samples are added to the cohort, thereby solving the so-called N+1 problem.

Filter Variants by Variant (Quality Score) Recalibration

Tools involved: VariantRecalibrator, ApplyRecalibration

The GATK's variant calling tools are designed to be very lenient in order to achieve a high degree of sensitivity. This is good because it minimizes the chance of missing real variants, but it does mean that we need to filter the raw callset they produce in order to reduce the amount of false positives, which can be quite large.

The established way to filter the raw variant callset is to use variant quality score recalibration (VQSR), which uses machine learning to identify annotation profiles of variants that are likely to be real, and assigns a VQSLOD score to each variant that is much more reliable than the QUAL score calculated by the caller. In the first step of this two-step process, the program builds a model based on training variants, then applies that model to the data to assign a well-calibrated probability to each variant call. We can then use this variant quality score in the second step to filter the raw call set, thus producing a subset of calls with our desired level of quality, fine-tuned to balance specificity and sensitivity.

The downside of how variant recalibration works is that the algorithm requires high-quality sets of known variants to use as training and truth resources, which for many organisms are not yet available. It also requires quite a lot of data in order to learn the profiles of good vs. bad variants, so it can be difficult or even impossible to use on small datasets that involve only one or a few samples, on targeted sequencing data, on RNAseq, and on non-model organisms. If for any of these reasons you find that you cannot perform variant recalibration on your data (after having tried the workarounds that we recommend, where applicable), you will need to use hard-filtering instead. This consists of setting flat thresholds for specific annotations and applying them to all variants equally. See the methods articles and FAQs for more details on how to do this.

We are currently experimenting with neural network-based approaches with the goal of eventually replacing VQSR with a more powerful and flexible filtering process.

Notes on methodology

The central tenet that governs the variant discovery part of the workflow is that the accuracy and sensitivity of the germline variant discovery algorithm are significantly increased when it is provided with data from many samples at the same time. Specifically, the variant calling program needs to be able to construct a squared-off matrix of genotypes representing all potentially variant genomic positions, across all samples in the cohort. Note that this is distinct from the primitive approach of combining variant calls generated separately per-sample, which lack information about the confidence of homozygous-reference or other uncalled genotypes.

In earlier versions of the variant discovery phase, multiple per-sample BAM files were presented directly to the variant calling program for joint analysis. However, that scaled very poorly with the number of samples, posing unacceptable limits on the size of the study cohorts that could be analyzed in that way. In addition, it was not possible to add samples incrementally to a study; all variant calling work had to be redone when new samples were introduced.

Starting with GATK version 3.x, a new approach was introduced, which decoupled the two internal processes that previously composed variant calling: (1) the initial per-sample collection of variant context statistics and calculation of all possible genotype likelihoods given each sample by itself, which require access to the original BAM file reads and is computationally expensive, and (2) the calculation of genotype posterior probabilities per-sample given the genotype likelihoods across all samples in the cohort, which is computationally cheap. These were made into the separate steps described below, enabling incremental growth of cohorts as well as scaling to large cohort sizes.

↧

How to get exact allele frequency using HaplotypeCaller-GATK4 and speed up the running?

October 2, 2018, 9:21 pm

≫ Next: http://www.supplementexamine.com/rapid-tone-diet/

≪ Previous: Germline short variant discovery (SNPs + Indels)

Hi there,

I have performed HaplotypeCaller in GATK4 (version:4.0.9.0) for variant calling of germline DNA. Here are the results in the vcf file.

chr1 17365 rs369606208 C G 146.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-1.215;DB;DP=53;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=40.27;MQRankSum=-0.719;QD=2.77;ReadPosRankSum=-0.581;SOR=0.664 GT:AD:DP:GQ:PL 0/1:43,10:53:99:175,0,1354
chr1 17407 rs372841554 G A 249.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-0.924;DB;DP=106;ExcessHet=3.0103;FS=2.034;MLEAC=1;MLEAF=0.500;MQ=46.70;MQRankSum=-0.797;QD=2.36;ReadPosRankSum=0.729;SOR=0.433 GT:AD:DP:GQ:PL 0/1:84,22:106:99:278,0,2088
chr1 981931 rs2465128 A G 2859.77 . AC=2;AF=1.00;AN=2;BaseQRankSum=0.989;DB;DP=98;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQRankSum=0.000;QD=29.18;ReadPosRankSum=-0.013;SOR=1.313 GT:AD:DP:GQ:PL 1/1:2,96:98:99:2888,239,0

As you can see, all the AF are the estimated 0.5 or 1.0, but not exact number. At the meantime, I ran the Mutect2-GATK4 for another sample, and I could get the exact AF from Mutect2. Would you please help me to figure this out?

Here is the script for HaplotypeCaller:
java-options "-Xmx40g" HaplotypeCaller -R /data3/IonProton/bwa-index_hg19/hg19_forGATK_sinica.fa -I 13029-10ng_F0x100_intersect.bam -O 13029-10ng_F0x100_intersect_HC.vcf.gz --dbsnp /data3/IonProton/bwa-index_hg19/dbsnp_138.hg19.vcf

In addition, is there any way to speed up the running of HaplotypeCaller and Mutect2? On single job(sample) of HaplotypeCaller and Mutect2 took me about 14hr and 19hr, respectively.

Thank you!

↧

http://www.supplementexamine.com/rapid-tone-diet/

October 2, 2018, 9:49 pm

≫ Next: https://mumybuzz.com/alpha-male-x/

≪ Previous: How to get exact allele frequency using HaplotypeCaller-GATK4 and speed up the running?

Rapid Tone After your physicians visit your next stage is to establish weight loss goals. Create down your goals. I can’t emphasize this procedure strongly enough. Do not skip this procedure. Use the SMART objective design. SMART stands for particular, measurable, activity, genuine and time-bound. Be particular, don’t say “I’m going to lessen weight” instead say “I’m going to lessen 30lbs by July 4th.” Board your variety once a 7 times to examine your enhancement. It’s best to think about yourself at the same interval to get the best basis for comparison.
http://www.supplementexamine.com/rapid-tone-diet/

↧

https://mumybuzz.com/alpha-male-x/

October 2, 2018, 11:33 pm

≫ Next: http://bestenhancement.com/vigrx-plus-reviews/

≪ Previous: http://www.supplementexamine.com/rapid-tone-diet/

Alpha Male X they don't really make bigger long-term well the in a case where you have let's say you have a gentleman who may have had long-term issues with the erectile functioning let's say you know they've had problems getting erect the penis is shrunken due to disuse I think in cases like that when men start taking these supplements whether it's the actual effect of the of the the herbs working on on the physiology or even just a placebo effect I mean if the man starts getting more regular erections again

https://mumybuzz.com/alpha-male-x/

↧

http://bestenhancement.com/vigrx-plus-reviews/

October 3, 2018, 12:44 am

≫ Next: Interpreting of CNV pipeline results

≪ Previous: https://mumybuzz.com/alpha-male-x/

VigRx Plus history in Chinese Medicine and is presently an extremely famous enhancement in both the east and west. Ginkgo Biloba expands blood stream to the mind and private parts. The outcome is more keen reasoning, better memory, and better erections. Korean Red Ginseng Red Ginseng is an old home grown cure that returns a great many years The concentrate originates from the foundation of the Red Ginseng plant which is local to China and Korea. Korean Red Ginseng builds vitality and blood stream to your

http://bestenhancement.com/vigrx-plus-reviews/

↧

Interpreting of CNV pipeline results

September 21, 2018, 7:59 am

≫ Next: GATK4 joint Genotyping for an exome pipeline: CombineGVCFs or GenomicsDBImport ?

≪ Previous: http://bestenhancement.com/vigrx-plus-reviews/

Dear GATK team, first af all thanks a lot for the very detailed manual for CNV calling it was super helpfull!

Second - I'd like to ask an advise on results interpretation.

I'm trying to perform CNV with matched tumor-normal pair, but without a PON (I do not have enough of my own samples to construct one, and not sure that samples that I could download will performe better then matched normal). To complicate things it seems that my tumor sample is strongly contaminated with normal tissue (that I've already learned from Mutect2 results)
After running the pipeline (https://gatkforums.broadinstitute.org/dsde/discussion/11682,
https://gatkforums.broadinstitute.org/dsde/discussion/11683) with default paramiters (for starters) I'm getting the following output (please see the attached image).

I'm not quite sure how to interpret situation when according to the AF plot, it seems that my sample has a copy loss for the entire chrom. 7 and 14, and partiall loss of chrom. 4 but on the other hand there are no evidence for that on CR plot. Similary it seems that there is a copy number changes for chr. 20 in CR plot but there are no evidence for that on AF plot.
Does it mean that there are copy nutral LOH in my samples?

Best, Eugene

↧