Troubleshoot non-called variants

January 3, 2019, 1:23 pm

≫ Next: the reference allele mismatch and the annotation program give me Invalid alternative alleles

≪ Previous: NaN error when running FilterMutectCalls with gatk 4.0.12.0 (phredScaleLog10ErrorRate)

I am calling variants on a polyploid plant sample using GATK 4 HaplotypeCaller.
SNP calling works reasonably well, but I see regions where I can clearly see based on the read alignment that there should be SNPs, but these are not called by GATK4. I have been tweaking many parameters, but I don't seem to be possible to get these SNPs in my output VCF file.
Does anyone have any recommendations how I can troubleshoot these false negative SNPs?
Or maybe which parameters I can tweak to get these SNPs to be called?

Thanks a lot!

↧

the reference allele mismatch and the annotation program give me Invalid alternative alleles

January 3, 2019, 4:05 pm

≫ Next: SplitNCigarReads reduces the number of exons per transcript to one

≪ Previous: Troubleshoot non-called variants

Lifting variants over and sorting (not yet writing the output file.)
INFO 2019-01-02 21:12:31 LiftoverVcf Processed 116415 variants.
INFO 2019-01-02 21:12:31 LiftoverVcf 0 variants failed to liftover.
INFO 2019-01-02 21:12:31 LiftoverVcf 31704 variants lifted over but had mismatching reference alleles after lift over.
INFO 2019-01-02 21:12:31 LiftoverVcf 27.2336% of variants were not successfully lifted over and written to the output.
INFO 2019-01-02 21:12:31 LiftoverVcf liftover success by source contig:
INFO 2019-01-02 21:12:31 LiftoverVcf 1: 84711 / 116415 (72.7664%)
INFO 2019-01-02 21:12:31 LiftoverVcf lifted variants by target contig:
INFO 2019-01-02 21:12:31 LiftoverVcf 1: 84711
WARNING 2019-01-02 21:12:31 LiftoverVcf 29374 variants with a swapped REF/ALT were identified, but were not recovered. See RECOVER_SWAPPED_REF_ALT and associated caveats.
INFO 2019-01-02 21:12:31 LiftoverVcf Writing out sorted records to final VCF.
[Wed Jan 02 21:12:50 CST 2019] picard.vcf.LiftoverVcf done. Elapsed time: 1.09 minutes.

↧

SplitNCigarReads reduces the number of exons per transcript to one

January 4, 2019, 10:27 am

≫ Next: joint-discovery-gatk4-local.wdl uses deprecated output syntax

≪ Previous: the reference allele mismatch and the annotation program give me Invalid alternative alleles

I am working with RNA-seq data from Drosophila, mapping with Hisat2 and using stringtie to reconstruct transcripts. It is non-model Drosophila.

Code:

./hisat2 --phred33 --rna-strandness RF -x reference.fa -1 reads1.fq -2 reads2.fq -S out.sam

samtools view -Su out.sam | samtools sort - out.sorted

samtools addreplacerg -r 'ID:' -r 'LB:' -r 'SM:' -o out.sorted.RG.bam out.sorted.bam

samtools index out.sorted.RG.bam

java -jar picard.jar MarkDuplicates I=out.sorted.RG.bam O=out.sorted.RG.duplicates.bam CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT M=out.metrics

SplitNCigarReads was used in one pipeline and not the other:

java -jar GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SplitNCigarReads -R ref.fasta -I out.sorted.RG.duplicates.bam -o out.sorted.RG.duplicates.split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS

Initially I did the assembly without the split 'n' cigar step with GATK. The output of stringtie then had multiple exons per transcript in the .gtf file, as follows:

transcript 18495 19529 1000 - . gene_id "STRG.7"; transcript_id "STRG.7.1"; cov "3.623000"; FPKM "4.150527"; TPM "4.448586";
exon 18495 18666 1000 - . gene_id "STRG.7"; transcript_id "STRG.7.1"; exon_number "1"; cov "1.837597";
exon 18740 18908 1000 - . gene_id "STRG.7"; transcript_id "STRG.7.1"; exon_number "2"; cov "5.701381";
exon 18996 19229 1000 - . gene_id "STRG.7"; transcript_id "STRG.7.1"; exon_number "3"; cov "3.265812";
exon 19278 19529 1000 - . gene_id "STRG.7"; transcript_id "STRG.7.1"; exon_number "4"; cov "3.779449";

Wanting to make sure my analysis was robust, I ran split 'n' cigar on the alignment files and redid the estimation of transcript abundance with stringtie.

Following this, only one exon was estimated per transcript, like so:

transcript 6866 7438 1000 + . gene_id "STRG.1"; transcript_id "STRG.1.1"; cov "12.038646"; FPKM "2.925254"; TPM "2.443733";
exon 6866 7438 1000 + . gene_id "STRG.1"; transcript_id "STRG.1.1"; exon_number "1"; cov "12.038646";
transcript 12315 12592 1000 + . gene_id "STRG.2"; transcript_id "STRG.2.1"; cov "2.542066"; FPKM "0.617693"; TPM "0.516016";
exon 12315 12592 1000 + . gene_id "STRG.2"; transcript_id "STRG.2.1"; exon_number "1"; cov "2.542066";

I think including split n cigar is an appropriate part of the workflow, but I must admit that intuitively having only one exon per transcript in the output file seems wrong. In addition, I do not understand how split 'n' cigar would cause this difference. Am I using splitncigar incorrectly? Is there something I don't understand about the workflow?

I'm pretty confused, if anybody has any insight into this issue, or more experience with what stringtie output should optimally be like, that would be most appreciated.

Thanks!

↧

joint-discovery-gatk4-local.wdl uses deprecated output syntax

January 4, 2019, 5:02 pm

≫ Next: ERROR MESSAGE: java.lang.Integer cannot be cast to java.lang.Double

≪ Previous: SplitNCigarReads reduces the number of exons per transcript to one

In the GitHub-provided, official WDL files for running GATK, the "local" version of joint discovery uses a deprecated form of declaring outputs, which prevents it from being run as a subworkflow.

https://github.com/gatk-workflows/gatk4-germline-snps-indels/blob/master/joint-discovery-gatk4-local.wdl

Note that another user already has a pull request with a fix for this issue. However, that has been pending for nearly a month now, so might be worth accepting:

https://github.com/gatk-workflows/gatk4-germline-snps-indels/pull/29

↧

ERROR MESSAGE: java.lang.Integer cannot be cast to java.lang.Double

December 16, 2014, 6:44 am

≫ Next: why new version mutect2 output vcf add DP?

≪ Previous: joint-discovery-gatk4-local.wdl uses deprecated output syntax

Hi,

after running 5 exomes with GATK-v3.3 and HaplotypeCaller, I encountered a very low titv ration in my samples (~2.1) as VaraintEval report indicated. I tried running varaint filtration in these samples but I didn't see any imporvement in titv ratio nor any filtering done. therefore I filtered these with bcftools, after which the titv ratio improved to 2.5. Then when I tried running GenotypeGVCFs on these samples filtered with bcftools, I encountered the following error:

ERROR ------------------------------------------------------------------------------------------

ERROR stack trace

java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Double
at java.lang.Double.compareTo(Double.java:49)
at java.util.ComparableTimSort.countRunAndMakeAscending(ComparableTimSort.java:290)
at java.util.ComparableTimSort.sort(ComparableTimSort.java:157)
at java.util.ComparableTimSort.sort(ComparableTimSort.java:146)
at java.util.Arrays.sort(Arrays.java:472)
at java.util.Collections.sort(Collections.java:155)
at org.broadinstitute.gatk.utils.MathUtils.median(MathUtils.java:999)
at org.broadinstitute.gatk.tools.walkers.variantutils.ReferenceConfidenceVariantContextMerger.combineAnnotationValues(ReferenceConfidenceVariantContextMerger.java:73)
at org.broadinstitute.gatk.tools.walkers.variantutils.ReferenceConfidenceVariantContextMerger.merge(ReferenceConfidenceVariantContextMerger.java:158)
at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.map(GenotypeGVCFs.java:202)
at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.map(GenotypeGVCFs.java:121)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:310)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:106)

ERROR ------------------------------------------------------------------------------------------

ERROR A GATK RUNTIME ERROR has occurred (version nightly-2014-11-17-g58cfab1):

ERROR

ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.

ERROR If not, please post the error message, with stack trace, to the GATK forum.

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions http://www.broadinstitute.org/gatk

ERROR

ERROR MESSAGE: java.lang.Integer cannot be cast to java.lang.Double

ERROR ------------------------------------------------------------------------------------------

any advice on solving this incident will be much appreciated

Victoria

↧

why new version mutect2 output vcf add DP?

January 5, 2019, 6:27 am

≫ Next: Where can i find the the detailed documentation on how Mutect2 works?

≪ Previous: ERROR MESSAGE: java.lang.Integer cannot be cast to java.lang.Double

I find the result of mutect2 in gatk4.0.11.0 differs from 4.0.0.0 in vcf structure.
GT:AD:AF:DP:F1R2:F2R1(4.0.11.0)
GT:AD:AF:F1R2:F2R1(4.0.0.0)
why this need a change? and for example in new version
GT:AD:AF:DP: the value is 0/1:31,34:0.522:65:
but I once read something in somatic vcf, the first value of AD does not stands for ref reads, is it true, thanks a lot.

and can you tell me AF is usually smaller or bigger then the value in AD(I mean first /(first + second), here 34/(31+34))

FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">

FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions of alternate alleles in the tumor">

FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">

↧

Where can i find the the detailed documentation on how Mutect2 works?

January 5, 2019, 3:19 am

≫ Next: Evaluating the quality of a variant callset? is there something new to add?

≪ Previous: why new version mutect2 output vcf add DP?

Note*: I have already read the following post.
https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_cancer_m2_MuTect2.php
please recommend me something detailed
Thanks

↧

Evaluating the quality of a variant callset? is there something new to add?

January 5, 2019, 6:41 am

≫ Next: ClippingRankSum tag ｍissing in vcf files

≪ Previous: Where can i find the the detailed documentation on how Mutect2 works?

https://software.broadinstitute.org/gatk/documentation/article.php?id=6308,
is there anything new to add with the updated version og gatk. thanks a lot

↧

ClippingRankSum tag ｍissing in vcf files

January 6, 2019, 7:40 pm

≫ Next: Spark

≪ Previous: Evaluating the quality of a variant callset? is there something new to add?

Hi,

I am using gatk 4.0.10.0 version and find the clippingranksum tag is missing after best practice manipulation, is this becacuse this tag is deprecated or do I miss something in running the best practice. Thanks!

↧

Spark

January 18, 2018, 7:02 pm

≫ Next: guidelines for BAM realignment in GATK 4

≪ Previous: ClippingRankSum tag ｍissing in vcf files

In a nutshell, Spark is a piece of software that GATK4 uses to do multithreading, which is a form of parallelization that allows a computer (or cluster of computers) to finish executing a task sooner. You can read more about multithreading and parallelism in GATK here. The Spark software library is open-source and maintained by the Apache Software Foundation. It is very widely used in the computing industry and is one of the most promising technologies for accelerating execution of analysis pipelines.

Not all GATK tools use Spark

Tools that can use Spark generally have a note to that effect in their respective Tool Doc.

- Some GATK tools exist in distinct Spark-capable and non-Spark-capable versions

The "sparkified" versions have the suffix "Spark" at the end of their names. Many of these are still experimental; down the road we plan to consolidate them so that there will be only one version per tool.

- Some GATK tools only exist in a Spark-capable version

Those tools don't have the "Spark" suffix.

You don't need a Spark cluster to run Spark-enabled GATK tools!

If you're working on a "normal" machine (even just a laptop) with multiple CPU cores, the GATK engine can still use Spark to create a virtual standalone cluster in place, and set it to take advantage of however many cores are available on the machine -- or however many you choose to allocate. See the example parameters below and the local-Spark tutorial for more information on how to control this. And if your machine only has a single core, these tools can always be run in single-core mode -- it'll just take longer for them to finish.

To be clear, even the Spark-only tools can be run on regular machines, though in practice a few of them may be prohibitively slow (SV tools and PathSeq). See the Tool Docs for tool-specific recommendations.

If you do have access to a Spark cluster, the Spark-enabled tools are going to be extra happy but you may need to provide some additional parameters to use them effectively. See the cluster-Spark tutorial for more information.

Example command-line parameters

Here are some example arguments you would give to a Spark-enabled GATK tool:

--spark-master local[*] -> "Run on the local machine using all cores"
--spark-master local[2] -> "Run on the local machine using two cores"
--spark-master spark://23.195.26.187:7077 -> "Run on the cluster at 23.195.26.187, port 7077"
--spark-runner GCS --cluster my_cluster -> "Run on my_cluster in Google Dataproc"

You don't need to install any additional software to use Spark in GATK

All the necessary software for using Spark, whether it's on a local machine or a Spark cluster, is bundled within the GATK itself. Just make sure to invoke GATK using the gatk wrapper script rather than calling the jar directly, because the wrapper will select the appropriate jar file (there are two!) and will set some parameters for you.

↧

guidelines for BAM realignment in GATK 4

January 7, 2019, 4:02 am

≫ Next: (How to) Map reads to a reference with alternate contigs like GRCh38

≪ Previous: Spark

Dear all,
I have aligned my sequences with BWA and HISAT2, I would like to perform the realignment step. I find confusing reports over the internet on whether performing it or not. Can I ask if it is still a good practice to perform re-alignment and how can I run it?
I created a dictionary with
java -jar picard.jar CreateSequenceDictionary R=<ref.fa> o=<ref.dict>
But I do not have a vcf file. The syntax I have is:
java -jar GenomeAnalysisTK.jar \
-T RealignerTargetCreator \
-R <ref.fa> \ # in the same folder there is the dict file
-I <file_dedup.bam>
-o <file.intervals>

(the -known option is skipped) followed by
java -Xmx8G -Djava,io,tmpdir=/tmp -jar /
GenomeAnalysisTK.jar \
-T IndelRealigner \
-R <ref.fa> \
-targetIntervals <file.intervals>
-I <file_dedup.bam> \
-o <file_indelrealigned.bam>

What would be the new syntax for GATK4?
Thank you,
Luigi

↧

(How to) Map reads to a reference with alternate contigs like GRCh38

July 19, 2016, 10:15 am

≫ Next: second pass base recalibration gatk4

≪ Previous: guidelines for BAM realignment in GATK 4

Document is in `BETA`. It may be incomplete and/or inaccurate. Post suggestions to the Comments section and be sure to read about updates also within the Comments section.

This exploratory tutorial provides instructions and example data to map short reads to a reference genome with alternate haplotypes. Instructions are suitable for indexing and mapping reads to GRCh38.

► If you are unfamiliar with terms that describe reference genome components, or GRCh38 alternate haplotypes, take a few minutes to study the Dictionary entry Reference Genome Components.

► For an introduction to GRCh38, see Blog#8180.

Specifically, the tutorial uses BWA-MEM to index and map simulated reads for three samples to a mini-reference composed of a GRCh38 chromosome and alternate contig (sections 1–3). We align in an alternate contig aware (alt-aware) manner, which we also call alt-handling. This is the main focus of the tutorial.

The decision to align to a genome with alternate haplotypes has implications for variant calling. We discuss these in section 5 using the callset generated with the optional tutorial steps outlined in section 4. Because we strategically placed a number of SNPs on the sequence used to simulate the reads, in both homologous and divergent regions, we can use the variant calls and their annotations to examine the implications of analysis approaches. To this end, the tutorial fast-forwards through pre-processing and calls variants for a trio of samples that represents the combinations of the two reference haplotypes (the PA and the ALT). This first workflow (tutorial_8017) is suitable for calling variants on the primary assembly but is insufficient for capturing variants on the alternate contigs.

For those who are interested in calling variants on the alternate contigs, we also present a second and a third workflow in section 6. The second workflow (tutorial_8017_toSE) takes the processed BAM from the first workflow, makes some adjustments to the reads to maximize their information, and calls variants on the alternate contig. This approach is suitable for calling on ~75% of the non-HLA alternate contigs or ~92% of loci with non-HLA alternate contigs (see table in section 6). The third workflow (tutorial_8017_postalt) takes the alt-aware alignments from the first workflow and performs a postalt-processing step as well as the same adjustment from the second workflow. Postalt-processing uses the bwa-postalt.js javascript program that Heng Li provides as a companion to BWA. This allows for variant calling on all alternate contigs including HLA alternate contigs.

The tutorial ends by comparing the difference in call qualities from the multiple workflows for the given example data and discusses a few caveats of each approach.

► The three workflows shown in the diagram above are available as WDL scripts in our GATK Tutorials WDL scripts repository.

Jump to a section

Tools involved

BWA v0.7.13 or later releases. The tutorial uses v0.7.15.
Download from here and see Tutorial#2899 for installation instructions.
The bwa-postalt.js script is within the bwakit folder.
Picard tools v2.5.0 or later releases. The tutorial uses v2.5.0.
Optional GATK tools. The tutorial uses v3.6.
Optional Samtools. The tutorial uses v1.3.1.
Optional Gawk, an AWK-like tool that can interpret bitwise SAM flags. The tutorial uses v4.1.3.
Optional k8 Javascript shell. The tutorial uses v0.2.3 downloaded from here.

Download example data

Download tutorial_8017.tar.gz, either from the GoogleDrive or from the ftp site. To access the ftp site, leave the password field blank. The data tarball contains the paired FASTQ reads files for three samples. It also contains a mini-reference chr19_chr19_KI270866v1_alt.fasta and corresponding .dict dictionary, .fai index and six BWA indices including the .alt index. The data tarball includes the output files from the workflow that we care most about. These are the aligned SAMs, processed and indexed BAMs and the final multisample VCF callsets from the three presented workflows.

The mini-reference contains two contigs subset from human GRCh38: chr19 and chr19_KI270866v1_alt. The ALT contig corresponds to a diverged haplotype of chromosome 19. Specifically, it corresponds to chr19:34350807-34392977, which contains the glucose-6-phosphate isomerase or GPI gene. Part of the ALT contig introduces novel sequence that lacks a corresponding region in the primary assembly.

Using instructions in Tutorial#7859, we simulated paired 2x151 reads to derive three different sample reads that when aligned give roughly 35x coverage for the target primary locus. We derived the sequences from either the 43 kbp ALT contig (sample ALTALT), the corresponding 42 kbp region of the primary assembly (sample PAPA) or both (sample PAALT). Before simulating the reads, we introduced four SNPs to each contig sequence in a deliberate manner so that we can call variants.

► Alternatively, you may instead use the example input files and commands with the full GRCh38 reference. Results will be similar with a handful of reads mapping outside of the mini-reference regions.

1. Index the reference FASTA for use with BWA-MEM

Our example chr19_chr19_KI270866v1_alt.fasta reference already has chr19_chr19_KI270866v1_alt.dict dictionary and chr19_chr19_KI270866v1_alt.fasta.fai index files for use with Picard and GATK tools. BWA requires a different set of index files for alignment. The command below creates five of the six index files we need for alignment. The command calls the index function of BWA on the reference FASTA.

bwa index chr19_chr19_KI270866v1_alt.fasta

This gives .pac, .bwt, .ann, .amb and .sa index files that all have the same chr19_chr19_KI270866v1_alt.fasta basename. Tools recognize index files within the same directory by their identical basename. In the case of BWA, it uses the basename preceding the .fasta suffix and searches for the index file, e.g. with .bwt suffix or .64.bwt suffix. Depending on which of the two choices it finds, it looks for the same suffix for the other index files, e.g. .alt or .64.alt. Lack of a matching .alt index file will cause BWA to map reads without alt-handling. More on this next.

Note that the .64. part is an explicit indication that index files were generated with version 0.6 or later of BWA and are the 64-bit indices (as opposed to files generated by earlier versions, which were 32-bit). This .64. signifier can be added automatically by adding -6 to the bwa index command.

2. Include the reference ALT index file

Be sure to place the tutorial's mini-ALT index file chr19_chr19_KI270866v1_alt.fasta.alt with the other index files. Also, if it does not already match, change the file basename to match. This is the sixth index file we need for alignment. BWA-MEM uses this file to prioritize primary assembly alignments for reads that can map to both the primary assembly and an alternate contig. See BWA documentation for details.

As of this writing (August 8, 2016), the SAM format ALT index file for GRCh38 is available only in the x86_64-linux bwakit download as stated in this bwakit README. The hs38DH.fa.alt file is in the resource-GRCh38 folder.
In addition to mapped alternate contig records, the ALT index also contains decoy contig records as unmapped SAM records. This is relevant to the postalt-processing we discuss in section 6.2. As such, the postalt-processing in section 6 also requires the ALT index.

For the tutorial, we subset from hs38DH.fa.alt to create a mini-ALT index, chr19_chr19_KI270866v1_alt.fasta.alt. Its contents are shown below.

The record aligns the chr19_KI270866v1_alt contig to the chr19 locus starting at position 34,350,807 and uses CIGAR string nomenclature to indicate the pairwise structure. To interpret the CIGAR string, think of the primary assembly as the reference and the ALT contig sequence as the read. For example, the 11307M at the start indicates 11,307 corresponding sequence bases, either matches or mismatches. The 935S at the end indicates a 935 base softclip for the ALT contig sequence that lacks corresponding sequence in the primary assembly. This is a region that we consider highly divergent or novel. Finally, notice the NM tag that notes the edit distance to the reference.

☞ What happens if I forget the ALT index file?

If you omit the ALT index file from the reference, or if its naming structure mismatches the other indexes, then your alignments will be equivalent to the results you would obtain if you run BWA-MEM with the -j option. The next section gives an example of what this looks like.

3. Align reads with BWA-MEM

The command below uses an alt-aware version of BWA and maps reads using BWA's maximal exact match (MEM) option. Because the ALT index file is present, the tool prioritizes mapping to the primary assembly over ALT contigs. In the command, the tutorial's chr19_chr19_KI270866v1_alt.fasta serves as reference; one FASTQ holds the forward reads and the other holds the reverse reads.

bwa mem chr19_chr19_KI270866v1_alt.fasta 8017_read1.fq 8017_read2.fq > 8017_bwamem.sam

The resulting file 8017_bwamem.sam contains aligned read records.

BWA preferentially maps to the primary assembly any reads that can align equally well to the primary assembly or the ALT contigs as well as any reads that it can reasonably align to the primary assembly even if it aligns better to an ALT contig. Preference is given by the primary alignment record status, i.e. not secondary and not supplementary. BWA takes the reads that it cannot map to the primary assembly and attempts to map them to the alternate contigs. If a read can map to an alternate contig, then it is mapped to the alternate contig as a primary alignment. For those reads that can map to both and align better to the ALT contig, the tool flags the ALT contig alignment record as supplementary (0x800). This is what we call alt-aware mapping or alt-handling.
Adding the -j option to the command disables the alt-handling. Reads that can map multiply are given low or zero MAPQ scores.

☞ How can I tell if a BAM was aligned with alt-handling?

There are two approaches to this question.

First, you can view the alignments on IGV and compare primary assembly loci with their alternate contigs. The IGV screenshots to the right show how BWA maps reads with (top) or without (bottom) alt-handling.

Second, you can check the alignment SAM. Of two tags that indicate alt-aware alignment, one will persist after preprocessing only if the sample has reads that can map to alternate contigs. The first tag, the AH tag, is in the BAM header section of the alignment file, and is absent after any merging step, e.g. merging with MergeBamAlignment. The second tag, the pa tag, is present for reads that the aligner alt-handles. If a sample does not contain any reads that map equally or preferentially to alternate contigs, then this tag may be absent in a BAM even if the alignments were mapped in an alt-aware manner.

Here are three headers for comparison where only one indicates alt-aware alignment.

File header for alt-aware alignment. We use this type of alignment in the tutorial.
Each alternate contig's @SQ line in the header will have an AH:* tag to indicate alternate contig handling for that contig. This marking is based on the alternate contig being listed in the .alt index file and alt-aware alignment.

File header for -j alignment (alt-handling disabled) for example purposes. We do not perform this type of alignment in the tutorial.
Notice the absence of any special tags in the header.

File header for alt-aware alignment after merging with MergeBamAlignment. We use this step in the next section.
Again, notice the absence of any special tags in the header.

☞ What is the `pa` tag?

For BWA v0.7.15, but not v0.7.13, ALT loci alignment records that can align to both the primary assembly and alternate contig(s) will have a pa tag on the primary assembly alignment. For example, read chr19_KI270866v1_alt_4hetvars_26518_27047_0:0:0_0:0:0_931 of the ALTALT sample has five alignment records only three of which have the pa tag as shown below.

A brief description of each of the five alignments, in order:

First in pair, primary alignment on the primary assembly; AS=146, pa=0.967
First in pair, supplementary alignment on the alternate contig; AS=151
Second in pair, primary alignment on the primary assembly; AS=120; pa=0.795
Second in pair, supplementary alignment on the primary assembly; AS=54; pa=0.358
Second in pair, supplementary alignment on the alternate contig; AS=151

The pa tag measures how much better a read aligns to its best alternate contig alignment versus its primary assembly (pa) alignment. Specifically, it is the ratio of the primary assembly alignment score over the highest alternate contig alignment score. In our example we have primary assembly alignment scores of 146, 120 and 54 and alternate contig alignment scores of 151 and again 151. This gives us three different pa scores that tag the primary assembly alignments: 146/151=0.967, 120/151=0.795 and 54/151=0.358.

In our tutorial's workflow, MergeBamAlignment may either change an alignment's pa score or add a previously unassigned pa score to an alignment. The result of this is summarized as follows for the same alignments.

pa=0.967 --MergeBamAlignment--> same
none --MergeBamAlignment--> assigns pa=0.967
pa=0.795 --MergeBamAlignment--> same
pa=0.358 --MergeBamAlignment--> replaces with pa=0.795
none --MergeBamAlignment--> assigns pa=0.795

If you want to retain the BWA-assigned pa scores, then add the following options to the workflow commands in section 4.

For RevertSam, add ATTRIBUTE_TO_CLEAR=pa.
For MergeBamAlignment, add ATTRIBUTES_TO_RETAIN=pa.

In our sample set, after BWA-MEM alignment ALTALT has 1412 pa-tagged alignment records, PAALT has 805 pa-tagged alignment records and PAPA has zero pa-tagged records.

4. Add read group information, preprocess to make a clean BAM and call variants

The initial alignment file is missing read group information. One way to add that information, which we use in production, is to use MergeBamAlignment. MergeBamAlignment adds back read group information contained in an unaligned BAM and adjusts meta information to produce a clean BAM ready for pre-processing (see Tutorial#6483 for details on our use of MergeBamAlignment). Given the focus here is to showcase BWA-MEM's alt-handling, we refrain from going into the details of all this additional processing. They follow, with some variation, the PairedEndSingleSampleWf pipeline detailed here.

Remember these are simulated reads with simulated base qualities. We simulated the reads in a manner that only introduces the planned mismatches, without any errors. Coverage is good at roughly 35x. All of the base qualities for all of the reads are at I, which is, according to this page and this site, an excellent base quality score equivalent to a Sanger Phred+33 score of 40. We can therefore skip base quality score recalibration (BQSR) since the reads are simulated and the dataset is not large enough for recalibration anyway.

Here are the commands to obtain a final multisample variant callset. The commands are given for one of the samples. Process each of the three samples independently in the same manner [4.1–4.6] until the last GenotypeGVCFs command [4.7].

[4.1] Create unmapped uBAM

java -jar picard.jar RevertSam \
    I=altalt_bwamem.sam O=altalt_u.bam \
    ATTRIBUTE_TO_CLEAR=XS ATTRIBUTE_TO_CLEAR=XA

[4.2] Add read group information to uBAM

java -jar picard.jar AddOrReplaceReadGroups \
    I=altalt_u.bam O=altalt_rg.bam \
    RGID=altalt RGSM=altalt RGLB=wgsim RGPU=shlee RGPL=illumina

[4.3] Merge uBAM with aligned BAM

java -jar picard.jar MergeBamAlignment \
    ALIGNED=altalt_bwamem.sam UNMAPPED=altalt_rg.bam O=altalt_m.bam \
    R=chr19_chr19_KI270866v1_alt.fasta \
    SORT_ORDER=unsorted CLIP_ADAPTERS=false \
    ADD_MATE_CIGAR=true MAX_INSERTIONS_OR_DELETIONS=-1 \
    PRIMARY_ALIGNMENT_STRATEGY=MostDistant \
    UNMAP_CONTAMINANT_READS=false \
    ATTRIBUTES_TO_RETAIN=XS ATTRIBUTES_TO_RETAIN=XA

[4.4] Flag duplicate reads

java -jar picard.jar MarkDuplicates \
    INPUT=altalt_m.bam OUTPUT=altalt_md.bam METRICS_FILE=altalt_md.bam.txt \
    OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 ASSUME_SORT_ORDER=queryname

[4.5] Coordinate sort, fix NM and UQ tags and index for clean BAM
As of Picard v2.7.0, released October 17, 2016, SetNmAndUqTags is no longer available. Use SetNmMdAndUqTags instead.

set -o pipefail
java -jar picard.jar SortSam \
    INPUT=altalt_md.bam OUTPUT=/dev/stdout SORT_ORDER=coordinate | \
    java -jar $PICARD SetNmAndUqTags \
    INPUT=/dev/stdin OUTPUT=altalt_snaut.bam \
    CREATE_INDEX=true R=chr19_chr19_KI270866v1_alt.fasta

[4.6] Call SNP and indel variants in emit reference confidence (ERC) mode per sample using HaplotypeCaller

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller \
    -R chr19_chr19_KI270866v1_alt.fasta \
    -o altalt.g.vcf -I altalt_snaut.bam \
    -ERC GVCF --max_alternate_alleles 3 --read_filter OverclippedRead \
    --emitDroppedReads -bamout altalt_hc.bam

[4.7] Call genotypes on three samples

java -jar GenomeAnalysisTK.jar -T GenotypeGVCFs \
    -R chr19_chr19_KI270866v1_alt.fasta -o multisample.vcf \
    --variant altalt.g.vcf --variant altpa.g.vcf --variant papa.g.vcf

The altalt_snaut.bam, HaplotypeCaller's altalt_hc.bam and the multisample multisample.vcf are ready for viewing on IGV.

Before getting into the results in the next section, we have minor comments on two filtering options.

In our tutorial workflows, we turn off MergeBamAlignment's UNMAP_CONTAMINANT_READS option. If set to true, 68 reads become unmapped for PAPA and 40 reads become unmapped for PAALT. These unmapped reads are those reads caught by the UNMAP_CONTAMINANT_READS filter and their mates. MergeBamAlignment defines contaminant reads as those alignments that are overclipped, i.e. that are softclipped on both ends, and that align with less than 32 bases. Changing the MIN_UNCLIPPED_BASES option from the default of 32 to 22 and 23 restores all of these reads for PAPA and PAALT, respectively. Contaminants are obviously absent for these simulated reads. And so we set UNMAP_CONTAMINANT_READS to false to disable this filtering.

HaplotypeCaller's --read_filter OverclippedRead option similarly looks for both-end-softclipped alignments, then filters reads aligning with less than 30 bases. The difference is that HaplotypeCaller only excludes the overclipped alignments from its calling and does not remove mapping information nor does it act on the mate of the filtered alignment. Thus, we keep this read filter for the first workflow. However, for the second and third workflows in section 6, tutorial_8017_toSE and tutorial_8017_postalt, we omit the --read_filter Overclipped option from the HaplotypeCaller command. We also omit the --max_alternate_alleles 3 option for simplicity.

5. How can I tell whether I should consider an alternate haplotype?

We consider this question only for our GPI locus, a locus we know has an alternate contig in the reference. Here we use the term locus in its biological sense to refer to a contiguous genomic region of interest. The three samples give the alignment and coverage profiles shown on the right.

What is immediately apparent from the IGV screenshot is that the scenarios that include the alternate haplotype give a distinct pattern of variant sites to the primary assembly much like a fingerprint. These variants are predominantly heterozygous or homozygous. Looking closely at the 3' region of the locus, we see some alignment coverage anomalies that also show a distinct pattern. The coverage in some of the highly diverged region in the primary assembly drops while in others it increases. If we look at the origin of simulated reads in one of the excess coverage regions, we see that they are from two different regions of the alternate contig that suggests duplicated sequence segments within the alternate locus.

The variation pattern and coverage anomalies on the primary locus suggest an alternate haplotype may be present for the locus. We can then confirm the presence of aligned reads, both supplementary and primary, on the alternate locus. Furthermore, if we count the alignment records for each region, e.g. using samtools idxstats, we see the following metrics.

                        ALT/ALT     PA/ALT     PA/PA   
chr19                     10005      10006     10000     
chr19_KI270866v1_alt       1407        799         0

The number of alignments on the alternate locus increases proportionately with alternate contig dosage. All of these factors together suggest that the sample presents an alternate haplotype.

5.1 Discussion of variant calls for tutorial_8017

The three-sample variant callset gives 54 sites on the primary locus and two additional on the alternate locus for 56 variant sites. All of the eight SNP alleles we introduced are called, with six called on the primary assembly and two called on the alternate contig. Of the 15 expected genotype calls, four are incorrect. Namely, four PAALT calls that ought to be heterozygous are called homozygous variant. These are two each on the primary assembly and on the alternate contig in the region that is highly divergent.

► Our production pipelines use genomic intervals lists that exclude GRCh38 alternate contigs from variant calling. That is, variant calling is performed only for contigs of the primary assembly. This calling on even just the primary assembly of GRCh38 brings improvements to analysis results over previous assemblies. For example, if we align and call variants for our simulated reads on GRCh37, we call 50 variant sites with identical QUAL scores to the equivalent calls in our GRCh38 callset. However, this GRCh37 callset is missing six variant calls compared to the GRCh38 callset for the 42 kb locus: the two variant sites on the alternate contig and four variant sites on the primary assembly.

Consider the example variants on the primary locus. The variant calls from the primary assembly include 32 variant sites that are strictly homozygous variant in ALTALT and heterozygous variant in PAALT. The callset represents only those reads from the ALT that can be mapped to the primary assembly.

In contrast, the two variants in regions whose reads can only map to the alternate contig are absent from the primary assembly callset. For this simulated dataset, the primary alignments present on the alternate contig provide enough supporting reads that allow HaplotypeCaller to call the two variants. However, these variant calls have lower-quality annotation metrics than for those simulated in an equal manner on the primary assembly. We will get into why this is in section 6.

Additionally, for our PAALT sample that is heterozygous for an alternate haplotype, the genotype calls in the highly divergent regions are inaccurate. These are called homozygous variant on the primary assembly and on the alternate contig when in fact they are heterozygous variant. These calls have lower genotype scores GQ as well as lower allele depth AD and coverage DP. The table below shows the variant calls for the introduced SNP sites. In blue are the genotype calls that should be heterozygous variant but are instead called homozygous variant.

Here is a command to select out the intentional variant sites that uses SelectVariants:

java -jar GenomeAnalysisTK.jar -T SelectVariants \
    -R chr19_chr19_KI270866v1_alt.fasta \
    -V multisample.vcf -o multisample_selectvariants.vcf \
    -L chr19:34,383,500 -L chr19:34,389,485 -L chr19:34,391,800 -L chr19:34,392,600 \
    -L chr19_KI270866v1_alt:32,700 -L chr19_KI270866v1_alt:38,700 \
    -L chr19_KI270866v1_alt:41,700 -L chr19_KI270866v1_alt:42,700 \
    -L chr19:34,383,486 -L chr19_KI270866v1_alt:32,714

6. My locus includes an alternate haplotype. How can I call variants on alt contigs?

If you want to call variants on alternate contigs, consider additional data processing that overcome the following problems.

Loss of alignments from filtering of overclipped reads.
HaplotypeCaller's filtering of alignments whose mates map to another contig. Alt-handling produces many of these types of reads on the alternate contigs.
Zero MAPQ scores for alignments that map to two or more alternate contigs. HaplotypeCaller excludes these types of reads from contributing to evidence for variation.

Let us talk about these in more detail.

Ideally, if we are interested in alternate haplotypes, then we would have ensured we were using the most up-to-date analysis reference genome sequence with the latest patch fixes. Also, whatever approach we take to align and preprocess alignments, if we filter any reads as putative contaminants, e.g. with MergeBamAlignment's option to unmap cross-species contamination, then at this point we would want to fish back into the unmapped reads pool and pull out those reads. Specifically, these would have an SA tag indicating mapping to the alternate contig of interest and an FT tag indicating the reason for unmapping was because MergeBamAlignment's UNMAP_CONTAMINANT_READS option identified them as cross-species contamination. Similarly, we want to make sure not to include HaplotypeCaller's --read_filter OverclippedRead option that we use in the first workflow.

As section 5.1 shows, variant calls on the alternate contig are of low quality--they have roughly an order of magnitude lower QUAL scores than what should be equivalent variant calls on the primary assembly.

For this exploratory tutorial, we are interested in calling the introduced SNPs with equivalent annotation metrics. Whether they are called on the primary assembly or the alternate contig and whether they are called homozygous variant or heterozygous--let's say these are less important, especially given pinning certain variants from highly homologous regions to one of the loci is nigh impossible with our short reads. To this end, we will use the second workflow shown in the workflows diagram. However, because this solution is limited, we present a third workflow as well.

► We present these workflows solely for exploratory purposes. They do not represent any production workflows.

Tutorial_8017_toSE uses the processed BAM from our first workflow and allows for calling on singular alternate contigs. That is, the workflow is suitable for calling on alternate contigs of loci with only a single alternate contig like our GPI locus. Tutorial_8017_postalt uses the aligned SAM from the first workflow before processing, and requires separate processing before calling. This third workflow allows for calling on all alternate contigs, even on HLA loci that have numerous contigs per primary locus. However, the callset will not be parsimonious. That is, each alternate contig will greedily represent alignments and it is possible the same variant is called for all the alternate loci for a given primary locus as well as on the primary locus. It is up to the analyst to figure out what to do with the resulting calls.

The reason for the divide in these two workflows is in the way BWA assigns mapping quality scores (MAPQ) to multimapping reads. Postalt-processing becomes necessary for loci with two or more alternate contigs because the shared alignments between the primary locus and alternate loci will have zero MAPQ scores. Postalt-processing gives non-zero MAPQ scores to the alignment records. The table presents the frequencies of GRCh38 non-HLA alternate contigs per primary locus. It appears that ~75% of non-HLA alternate contigs are singular to ~92% of primary loci with non-HLA alternate contigs. In terms of bases on the primary assembly, of the ~75 megabases that have alternate contigs, ~64 megabases (85%) have singular non-HLA alternate contigs and ~11 megabases (15%) have multiple non-HLA alternate contigs per locus. Our tutorial's example locus falls under this majority.

In both alt-aware mapping and postalt-processing, alternate contig alignments have a predominance of mates that map back to the primary assembly. HaplotypeCaller, for good reason, filters reads whose mates map to a different contig. However, we know that GRCh38 artificially represents alternate haplotypes as separate contigs and BWA-MEM intentionally maps these mates back to the primary locus. For comparable calls on alternate contigs, we need to include these alignments in calling. To this end, we have devised a temporary workaround.

6.1 Variant calls for tutorial_8017_toSE

Here we are only aiming for equivalent calls with similar annotation values for the two variants that are called on the alternate contig. For the solution that we will outline, here are the results.

Including the mate-mapped-to-other-contig alignments bolsters the variant call qualities for the two SNPs HaplotypeCaller calls on the alternate locus. We see the AD allele depths much improved for ALTALT and PAALT. Corresponding to the increase in reads, the GQ genotype quality and the QUAL score (highlighted in red) indicate higher qualities. For example, the QUAL scores increase from 332 and 289 to 2166 and 1764, respectively. We also see that one of the genotype calls changes. For sample ALTALT, we see a previous no call is now a homozygous reference call (highlighted in blue). This hom-ref call is further from the truth than not having a call as the ALTALT sample should not have coverage for this region in the primary assembly.

For our example data, tutorial_8017's callset subset for the primary assembly and tutorial_8017_toSE's callset subset for the alternate contigs together appear to make for a better callset.

What solution did we apply? As the workflow's name toSE implies, this approach converts paired reads to single end reads. Specifically, this approach takes the processed and coordinate-sorted BAM from the first workflow and removes the 0x1 paired flag from the alignments. Removing the 0x1 flag from the reads allows HaplotypeCaller to consider alignments whose mates map to a different contig. We accomplish this using a modified script of that presented in Biostars post https://www.biostars.org/p/106668/, indexing with Samtools and then calling with HaplotypeCaller as follows. Note this workaround creates an invalid BAM according to ValidateSamFile. Also, another caveat is that because HaplotypeCaller uses softclipped sequences, any overlapping regions of read pairs will count twice towards variation instead of once. Thus, this step may lead to overconfident calls in such regions.

Remove the 0x1 bitwise flag from alignments

samtools view -h altalt_snaut.bam | gawk '{printf "%s\t", $1; if(and($2,0x1))
{t=$2-0x1}else{t=$2}; printf "%s\t" , t; for (i=3; i<NF; i++){printf "%s\t", $i} ; 
printf "%s\n",$NF}'| samtools view -Sb - > altalt_se.bam

Index the resulting BAM

samtools index altalt_se.bam

Call variants in -ERC GVCF mode with HaplotypeCaller for each sample

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller \
    -R chr19_chr19_KI270866v1_alt.fasta \
    -I altalt_se.bam -o altalt_hc.g.vcf \
    -ERC GVCF --emitDroppedReads -bamout altalt_hc.bam

Finally, use GenotypeGVCFs as shown in section 4's command [4.7] for a multisample variant callset. Tutorial_8017_toSE calls 68 variant sites--66 on the primary assembly and two on the alternate contig.

6.2 Variant calls for tutorial_8017_postalt

BWA's postalt-processing requires the query-grouped output of BWA-MEM. Piping an alignment step with postalt-processing is possible. However, to be able to compare variant calls from an identical alignment, we present the postalt-processing as an add-on workflow that takes the alignment from the first workflow.

The command uses the bwa-postalt.js script, which we run through k8, a Javascript execution shell. It then lists the ALT index, the aligned SAM altalt.sam and names the resulting file > altalt_postalt.sam.

k8 bwa-postalt.js \
    chr19_chr19_KI270866v1_alt.fasta.alt \
    altalt.sam > altalt_postalt.sam

The resulting postalt-processed SAM, altalt_postalt.sam, undergoes the same processing as the first workflow (commands 4.1 through 4.7) except that (i) we omit --max_alternate_alleles 3 and --read_filter OverclippedRead options for the HaplotypeCaller command like we did in section 6.1 and (ii) we perform the 0x1 flag removal step from section 6.1.

The effect of this postalt-processing is immediately apparent in the IGV screenshots. Previously empty regions are now filled with alignments. Look closely in the highly divergent region of the primary locus. Do you notice a change, albeit subtle, before and after postalt-processing for samples ALTALT and PAALT?

These alignments give the calls below for our SNP sites of interest. Here, notice calls are made for more sites--on the equivalent site if present in addition to the design site (highlighted in the first two columns). For the three pairs of sites that can be called on either the primary locus or alternate contig, the variant site QUALs, the INFO field annotation metrics and the sample level annotation values are identical for each pair.

Postalt-processing lowers the MAPQ of primary locus alignments in the highly divergent region that map better to the alt locus. You can see this as a subtle change in the IGV screenshot. After postalt-processing we see an increase in white zero MAPQ reads in the highly divergent region of the primary locus for ALTALT and PAALT. For ALTALT, this effectively cleans up the variant calls in this region at chr19:34,391,800 and chr19:34,392,600. Previously for ALTALT, these calls contained some reads: 4 and 25 for the first workflow and 0 and 28 for the second workflow. After postalt-processing, no reads are considered in this region giving us ./.:0,0:0:.:0,0,0 calls for both sites.

What we omit from examination are the effects of postalt-processing on decoy contig alignments. Namely, if an alignment on the primary assembly aligns better on a decoy contig, then postalt-processing discounts the alignment on the primary assembly by assigning it a zero MAPQ score.

To wrap up, here are the number of variant sites called for the three workflows. As you can see, this last workflow calls the most variants at 95 variant sites, with 62 on the primary assembly and 33 on the alternate contig.

Workflow                total    on primary assembly    on alternate contig
tutorial_8017           56       54                      2
tutorial_8017_toSE      68       66                      2
tutorial_8017_postalt   95       62                     33

7. Related resources

For WDL scripts of the workflows represented in this tutorial, see the GATK WDL scripts repository.
To revert an aligned BAM to unaligned BAM, see Section B of Tutorial#6484.
To simulate reads from a reference contig, see Tutorial#7859.
Dictionary entry Reference Genome Components reviews terminology that describe reference genome components.
The GATK resource bundle provides an analysis set GRCh38 reference FASTA as well as several other related resource files.
As of this writing (August 8, 2016), the SAM format ALT index file for GRCh38 is available only in the x86_64-linux bwakit download as stated in this bwakit README. The hs38DH.fa.alt file is in the resource-GRCh38 folder. Rename this file's basename to match that of the corresponding reference FASTA.
For more details on MergeBamAlignment features, see Section 3C of Tutorial#6483.
For details on the PairedEndSingleSampleWorkflow that uses GRCh38, see here.
See here for VCF specifications.

↧

second pass base recalibration gatk4

January 7, 2019, 7:32 am

≫ Next: Dataset to Genotype refinement

≪ Previous: (How to) Map reads to a reference with alternate contigs like GRCh38

hi there

I'm trying to run the second pass for the BaseRecalibrator with GATK4.
as seen in this post (https://gatkforums.broadinstitute.org/gatk/discussion/12985/how-to-do-a-second-pass-to-analyze-after-recalibration-in-gatk4) the previous -BQSR was substituted with -bqsr. also, it seems that there is no second pass available to do that?!

when I use it in:
GATK/gatk-4.0.12.0/gatk BaseRecalibrator \
-R CanFam3.1.dna.toplevel.fa \
-I sorted_WES_.bam \
--known-sites GATK/final.Axelsson.SNPs_nochr.vcf \
-bqsr recal.table \
-O post_recal.table

I get indeed the error "b is not a recognized option", suggesting that the option is not there..

however, the documentation here (https://software.broadinstitute.org/gatk/documentation/tooldocs/4.beta.4/org_broadinstitute_hellbender_tools_walkers_bqsr_AnalyzeCovariates.php) still says that you can run it.

in the --help there is no mention of the second pass and I'm confused about what I can/should do.

also, in previous versions I used PrintReads with the recalibration table, while the first link above mentions ApplyBQSR...so, which one is the one to use?

↧

Dataset to Genotype refinement

December 13, 2018, 12:21 pm

≫ Next: SplitNCigarReads exception

≪ Previous: second pass base recalibration gatk4

Hi,

I'm trying to run the genotype refinement steps using gatk 3.8 version. At first step (calculateGenotypePosteriors) we need to provide to "-supporting" flag a vcf file from a gold standard database.
I found 2 pages from gatk with 2 different files.

1) software DOT broadinstitute.org / gatk / documentation / tooldocs / 3.8-0 / org_broadinstitute_gatk_tools_walkers_variantutils_CalculateGenotypePosteriors -- using 1000G_EUR.genotypes.combined.vcf

and

2) software DOT broadinstitute.org / gatk / documentation / article . php ? id=4727 -- using 1000G_phase3_v4_20130502.sites.vcf

I believe the right here for me is to use 1000G_phase3_v4_20130502.sites.vcf, since my 91 samples are from african americans and not from europeans. Am I right?

Another question: I couldn't find anywhere this vcf file 1000G_phase3_v4_20130502.sites.vcf from hg19 version. Neither in bundle, or in gatk bundle ftp. I only found this page ftp DOT 1000genomes DOT ebi DOT ac DOT uk / vol1 / ftp / release / 20130502 with a lot of vcfs (separated for chromosomes) and there is no any unique vcf available.

So, do you have any advice for me where I can get this vcf file to run calculateGenotypePosteriors?

I don't know how I can proceed from here.

Thank you so much for help.

PS: Really sorry! I had to remove "https : / /" , "ftp : / /" and DOTS from links because I'm newbie in this forum, so I can't post links. :-/

Best,

Monete

↧

SplitNCigarReads exception

September 15, 2014, 7:16 pm

≫ Next: HOW to do a second pass to analyze after recalibration in GATK4

≪ Previous: Dataset to Genotype refinement

my script :

java -jar ~/bin/gatk-3.2-2/GenomeAnalysisTK.jar -T SplitNCigarReads -R Gmax.fa -I NPB18L_mark.bam -o NPB18L_snc.bam -U ALLOW_N_CIGAR_READS -fixNDN

when i use -fixNDN, it will be:

java.lang.UnsupportedOperationException
at java.util.AbstractList.add(AbstractList.java:148)
at java.util.AbstractList.add(AbstractList.java:108)
at org.broadinstitute.gatk.tools.walkers.rnaseq.SplitNCigarReads.initialize(SplitNCigarReads.java:150)
at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:83)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:314)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107)

ERROR ------------------------------------------------------------------------------------------

ERROR A GATK RUNTIME ERROR has occurred (version 3.2-2-gec30cee):

ERROR

ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.

ERROR If not, please post the error message, with stack trace, to the GATK forum.

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions http://www.broadinstitute.org/gatk

ERROR

ERROR MESSAGE: Code exception (see stack trace for error itself)

but i don't use -fixNDN, it will be:

ERROR ------------------------------------------------------------------------------------------

ERROR A USER ERROR has occurred (version 3.2-2-gec30cee):

ERROR

ERROR This means that one or more arguments or inputs in your command are incorrect.

ERROR The error message below tells you what is the problem.

ERROR

ERROR If the problem is an invalid argument, please check the online documentation guide

ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.

ERROR

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions http://www.broadinstitute.org/gatk

ERROR

ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.

ERROR

ERROR MESSAGE: Bad input: Cannot split this read (might be an empty section between Ns, for example 1N1D1N): 94M407N1D1033N6M

how can i fix it???

↧

HOW to do a second pass to analyze after recalibration in GATK4

September 20, 2018, 1:42 am

≫ Next: Per-allele filters?

≪ Previous: SplitNCigarReads exception

why the BaseRecalibrator function don't have -BQSR argument? I don't know how to do a second pass to analyze after recalibration in GATK4

↧

Per-allele filters?

January 7, 2019, 3:56 pm

≫ Next: PhaseByTransmission reassigns genotype code that conflicts with the allelic depths in the VCF

≪ Previous: HOW to do a second pass to analyze after recalibration in GATK4

I'm doing analysis on a large multi-ancestry project (XX,XXX exomes) in which approximately 20% of variant sites are multi-allelic. Our sequencing/calling is done at the Broad. It seems that at these multi-allelic sites, only one filter is applied. For example, where there is a SNP and an indel, the SNP may get an indel filter.

Is it possible to disclose how the filter at a given site is chosen? (E.g., by severity, etc.) Is there a plan to allow each variant to have its own filter in the future? (Not sure if there is an internal Broad place to discuss this, since it may have to do more with the pipeline and how SNPs and indel calls are combined than the GATK per se.)

↧

PhaseByTransmission reassigns genotype code that conflicts with the allelic depths in the VCF

October 25, 2016, 1:56 am

≫ Next: GenotypeGVCFs empty vcf (only headers)

≪ Previous: Per-allele filters?

I encountered a problem running PhaseByTransmission (PBT) module on female proband with both healthy parents. Prior to running the module, the correct genotype was assigned to the Father but after that, the genotype was re-coded wrongly, supported by the allelic depths for ref and alt in the VCF. So would like to know why is this happening and how to prevent this error.

The PBT command used:
java -Xmx2g -jar GenomeAnalysisTK.jar -R human_g1k_v37.fa -T PhaseByTransmission -V trio_combined_haplotypeCaller.vcf -ped trio.ped -o pbt_trio_combined_haplotypeCaller.vcf

The following are the lines from the VCFs showing the variant record that has problem with the genotype coding. The sample columns are Proband, Mother and Father from left to right in the VCFs:

Before PBT (trio_combined_haplotypeCaller.vcf):
X 153296071 . GGGGGGCTGGTGGGGTCCTCGGAGCTCTCGGGCTCAGGTGGAGGT G 257.13 PASS AC=1;AF=0.167;AN=6;BaseQRankSum=-1.311e+00;ClippingRankSum=0.495;DP=73;FS=2.782;GQ_MEAN=138.00;GQ_STDDEV=132.68;MLEAC=1;MLEAF=0.167;MQ=60.00;MQ0=0;MQRankSum=1.31;NCC=0;QD=0.70;ReadPosRankSum=0.990;SOR=1.911;VQSLOD=1.51;culprit=ReadPosRankSum;set=variant2 GT:AD:DP:GQ:PL 0/1:17,8:25:99:288,0,4362 0/0:32,0:32:90:0,90,1350 0/0:13,0:13:36:0,36,540

After PBT (pbt_trio_combined_haplotypeCaller.vcf):
X 153296071 . GGGGGGCTGGTGGGGTCCTCGGAGCTCTCGGGCTCAGGTGGAGGT G 257.13 PASS AC=1;AF=0.167;AN=6;BaseQRankSum=-1.311e+00;ClippingRankSum=0.495;DP=73;FS=2.782;GQ_MEAN=138.00;GQ_STDDEV=132.68;MLEAC=1;MLEAF=0.167;MQ=60.00;MQ0=0;MQRankSum=1.31;NCC=0;QD=0.70;ReadPosRankSum=0.990;SOR=1.911;VQSLOD=1.51;culprit=ReadPosRankSum;set=variant2 GT:AD:DP:GQ:PL:TP 0|1:17,8:25:99:288,0,4362:43 0|0:32,0:32:90:0,90,1350:43 1|0:13,0:13:0:0,36,540:43

PED file content:
fam father 0 0 1 1
fam mother 0 0 2 1
fam proband father mother 2 2

Please advise.

↧

GenotypeGVCFs empty vcf (only headers)

January 8, 2019, 2:14 am

≫ Next: default read filter in haplotypecaller

≪ Previous: PhaseByTransmission reassigns genotype code that conflicts with the allelic depths in the VCF

I want to use the joint genotyper (“GenotypeGVCFs” command) to call variants and genotypes for five different samples for which I previously obtained the corresponding g.vcf files. The command that I am using is the following:

java -Xmx64g -jar GenomeAnalysisTK.jar \
-T GenotypeGVCFs \
-R myReferenceSequence.fasta\
-o jointGenotyperSNPs.vcf \
-nt 23 \
-V sample1.final.variants.g.vcf \
-V sample2.final.variants.g.vcf \
-V sample3.final.variants.g.vcf \
-V sample4.final.variants.g.vcf \
-V sample5.final.variants.g.vcf

Unfortunately, the command produces an empty vcf file (only the headers are present).

Any idea why?

Here, the first five rows for each of the samples:

sample1.final.variants.g.vcf
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT RS16
cruralis_HybSeqProbes_test_with400Ns_beginend 1 . N . . END=400 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0
cruralis_HybSeqProbes_test_with400Ns_beginend 401 . A . . END=410 GT:DP:GQ:MIN_DP:PL 0/0:2:6:2:0,6,39
cruralis_HybSeqProbes_test_with400Ns_beginend 411 . A . . END=411 GT:DP:GQ:MIN_DP:PL 0/0:2:0:2:0,0,0
cruralis_HybSeqProbes_test_with400Ns_beginend 412 . A . . END=412 GT:DP:GQ:MIN_DP:PL 0/0:2:6:2:0,6,38
cruralis_HybSeqProbes_test_with400Ns_beginend 413 . A . . END=413 GT:DP:GQ:MIN_DP:PL 0/0:2:0:2:0,0,0

sample2.final.variants.g.vcf
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT RS17
cruralis_HybSeqProbes_test_with400Ns_beginend 1 . N . . END=977 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0
cruralis_HybSeqProbes_test_with400Ns_beginend 978 . T . . END=1046 GT:DP:GQ:MIN_DP:PL 0/0:1:3:1:0,3,16
cruralis_HybSeqProbes_test_with400Ns_beginend 1047 . A . . END=1080 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0
cruralis_HybSeqProbes_test_with400Ns_beginend 1081 . T . . END=1116 GT:DP:GQ:MIN_DP:PL 0/0:1:3:1:0,3,18
cruralis_HybSeqProbes_test_with400Ns_beginend 1117 . T . . END=1125 GT:DP:GQ:MIN_DP:PL 0/0:2:6:2:0,6,32

sample3.final.variants.g.vcf
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT RS21
cruralis_HybSeqProbes_test_with400Ns_beginend 1 . N . . END=400 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0
cruralis_HybSeqProbes_test_with400Ns_beginend 401 . A . . END=410 GT:DP:GQ:MIN_DP:PL 0/0:3:9:3:0,9,59
cruralis_HybSeqProbes_test_with400Ns_beginend 411 . A . . END=413 GT:DP:GQ:MIN_DP:PL 0/0:1:3:1:0,3,11
cruralis_HybSeqProbes_test_with400Ns_beginend 414 . A . . END=420 GT:DP:GQ:MIN_DP:PL 0/0:4:6:4:0,6,90
cruralis_HybSeqProbes_test_with400Ns_beginend 421 . C . . END=421 GT:DP:GQ:MIN_DP:PL 0/0:2:0:2:0,0,0

sample4.final.variants.g.vcf
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT RS22
cruralis_HybSeqProbes_test_with400Ns_beginend 1 . N . . END=400 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0
cruralis_HybSeqProbes_test_with400Ns_beginend 401 . A . . END=406 GT:DP:GQ:MIN_DP:PL 0/0:3:9:3:0,9,57
cruralis_HybSeqProbes_test_with400Ns_beginend 407 . A . . END=410 GT:DP:GQ:MIN_DP:PL 0/0:3:3:3:0,3,45
cruralis_HybSeqProbes_test_with400Ns_beginend 411 . A . . END=411 GT:DP:GQ:MIN_DP:PL 0/0:3:0:3:0,0,21
cruralis_HybSeqProbes_test_with400Ns_beginend 412 . A . . END=412 GT:DP:GQ:MIN_DP:PL 0/0:3:3:3:0,3,45

sample5.final.variants.g.vcf
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT RS60
cruralis_HybSeqProbes_test_with400Ns_beginend 1 . N . . END=484 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0
cruralis_HybSeqProbes_test_with400Ns_beginend 485 . T . . END=493 GT:DP:GQ:MIN_DP:PL 0/0:3:9:3:0,9,49
cruralis_HybSeqProbes_test_with400Ns_beginend 494 . G . . END=494 GT:DP:GQ:MIN_DP:PL 0/0:3:0:3:0,0,0
cruralis_HybSeqProbes_test_with400Ns_beginend 495 . A . . END=497 GT:DP:GQ:MIN_DP:PL 0/0:3:9:3:0,9,53
cruralis_HybSeqProbes_test_with400Ns_beginend 498 . T . . END=499 GT:DP:GQ:MIN_DP:PL 0/0:3:0:3:0,0,0

↧

default read filter in haplotypecaller

August 23, 2018, 7:55 am

≫ Next: Picard Sort Vcf Error

≪ Previous: GenotypeGVCFs empty vcf (only headers)

Hi,

Where can I find which read filters that are applied by default in Haplotypecaller in GATK4? (GATK web page or github page)

↧

ERROR ------------------------------------------------------------------------------------------

ERROR stack trace

ERROR ------------------------------------------------------------------------------------------

ERROR A GATK RUNTIME ERROR has occurred (version nightly-2014-11-17-g58cfab1):

ERROR

ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.

ERROR If not, please post the error message, with stack trace, to the GATK forum.

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions http://www.broadinstitute.org/gatk

ERROR

ERROR MESSAGE: java.lang.Integer cannot be cast to java.lang.Double

ERROR ------------------------------------------------------------------------------------------

FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">

FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions of alternate alleles in the tumor">

FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">

Not all GATK tools use Spark

- Some GATK tools exist in distinct Spark-capable and non-Spark-capable versions

- Some GATK tools only exist in a Spark-capable version

You don't need a Spark cluster to run Spark-enabled GATK tools!

Example command-line parameters

You don't need to install any additional software to use Spark in GATK

Document is in BETA. It may be incomplete and/or inaccurate. Post suggestions to the Comments section and be sure to read about updates also within the Comments section.

Jump to a section

Tools involved

Download example data

1. Index the reference FASTA for use with BWA-MEM

2. Include the reference ALT index file

☞ What happens if I forget the ALT index file?

3. Align reads with BWA-MEM

☞ How can I tell if a BAM was aligned with alt-handling?

☞ What is the pa tag?

4. Add read group information, preprocess to make a clean BAM and call variants

5. How can I tell whether I should consider an alternate haplotype?

5.1 Discussion of variant calls for tutorial_8017

6. My locus includes an alternate haplotype. How can I call variants on alt contigs?

6.1 Variant calls for tutorial_8017_toSE

6.2 Variant calls for tutorial_8017_postalt

7. Related resources

ERROR ------------------------------------------------------------------------------------------

ERROR A GATK RUNTIME ERROR has occurred (version 3.2-2-gec30cee):

ERROR

ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.

ERROR If not, please post the error message, with stack trace, to the GATK forum.

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions http://www.broadinstitute.org/gatk

ERROR

ERROR MESSAGE: Code exception (see stack trace for error itself)

ERROR ------------------------------------------------------------------------------------------

ERROR A USER ERROR has occurred (version 3.2-2-gec30cee):

ERROR

ERROR This means that one or more arguments or inputs in your command are incorrect.

ERROR The error message below tells you what is the problem.

ERROR

ERROR If the problem is an invalid argument, please check the online documentation guide

ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.

ERROR

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions http://www.broadinstitute.org/gatk

ERROR

ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.

ERROR

ERROR MESSAGE: Bad input: Cannot split this read (might be an empty section between Ns, for example 1N1D1N): 94M407N1D1033N6M

Document is in `BETA`. It may be incomplete and/or inaccurate. Post suggestions to the Comments section and be sure to read about updates also within the Comments section.

☞ What is the `pa` tag?