(How to) Generate an unmapped BAM from FASTQ or aligned BAM

November 23, 2015, 1:11 pm

≫ Next: Different HaplotypeCaller variant calls based on Java version?

≪ Previous: What should I use as known variants/sites for running tool X?

Here we outline how to generate an unmapped BAM (uBAM) from either a FASTQ or aligned BAM file. We use Picard's FastqToSam to convert a FASTQ (Option A) or Picard's RevertSam to convert an aligned BAM (Option B).

Jump to a section on this page

(A) Convert FASTQ to uBAM and add read group information using FastqToSam
(B) Convert aligned BAM to uBAM and discard problematic records using RevertSam

Tools involved

Prerequisites

Installed Picard tools

Download example data

Tutorial data reads were originally aligned to the advanced tutorial bundle's human_g1k_v37_decoy.fasta reference and to 10:91,000,000-92,000,000.

Related resources

Our dictionary entry on read groups discusses the importance of assigning appropriate read group fields that differentiate samples and factors that contribute to batch effects, e.g. flow cell lane. Be sure your read group fields conform to the recommendations.
This post provides an example command for AddOrReplaceReadGroups.
This How to is part of a larger workflow and tutorial on (How to) Efficiently map and clean up short read sequence data.
To extract reads in a genomic interval from the aligned BAM, use GATK's PrintReads.

(A) Convert FASTQ to uBAM and add read group information using FastqToSam

Picard's FastqToSam transforms a FASTQ file to an unmapped BAM, requires two read group fields and makes optional specification of other read group fields. In the command below we note which fields are required for GATK Best Practices Workflows. All other read group fields are optional.

java -Xmx8G -jar picard.jar FastqToSam \
    FASTQ=6484_snippet_1.fastq \ #first read file of pair
    FASTQ2=6484_snippet_2.fastq \ #second read file of pair
    OUTPUT=6484_snippet_fastqtosam.bam \
    READ_GROUP_NAME=H0164.2 \ #required; changed from default of A
    SAMPLE_NAME=NA12878 \ #required
    LIBRARY_NAME=Solexa-272222 \ #required 
    PLATFORM_UNIT=H0164ALXX140820.2 \ 
    PLATFORM=illumina \ #recommended
    SEQUENCING_CENTER=BI \ 
    RUN_DATE=2014-08-20T00:00:00-0400

Some details on select parameters:

For paired reads, specify each FASTQ file with FASTQ and FASTQ2 for the first read file and the second read file, respectively. Records in each file must be queryname sorted as the tool assumes identical ordering for pairs. The tool automatically strips the /1 and /2 read name suffixes and adds SAM flag values to indicate reads are paired. Do not provide a single interleaved fastq file, as the tool will assume reads are unpaired and the SAM flag values will reflect single ended reads.
For single ended reads, specify the input file with FASTQ.
QUALITY_FORMAT is detected automatically if unspecified.
SORT_ORDER by default is queryname.
PLATFORM_UNIT is often in run_barcode.lane format. Include if sample is multiplexed.
RUN_DATE is in Iso8601 date format.

Paired reads will have SAM flag values that reflect pairing and the fact that the reads are unmapped as shown in the example read pair below.

Original first read

@H0164ALXX140820:2:1101:10003:49022/1
ACTTTAGAAATTTACTTTTAAGGACTTTTGGTTATGCTGCAGATAAGAAATATTCTTTTTTTCTCCTATGTCAGTATCCCCCATTGAAATGACAATAACCTAATTATAAATAAGAATTAGGCTTTTTTTTGAACAGTTACTAGCCTATAGA
+
-FFFFFJJJJFFAFFJFJJFJJJFJFJFJJJ<<FJJJJFJFJFJJJJ<JAJFJJFJJJJJFJJJAJJJJJJFFJFJFJJFJJFFJJJFJJJFJJFJJFJAJJJJAJFJJJJJFFJJ<<<JFJJAFJAAJJJFFFFFJJJAJJJF<AJFFFJ

Original second read

@H0164ALXX140820:2:1101:10003:49022/2
TGAGGATCACTAGATGGGGGAGGGAGAGAAGAGATGTGGGCTGAAGAACCATCTGTTGGGTAATATGTTTACTGTCAGTGTGATGGAATAGCTGGGACCCCAAGCGTCAGTGTTACACAACTTACATCTGTTGATCGACTGTCTATGACAG
+
AA<FFJJJAJFJFAFJJJJFAJJJJJ7FFJJ<F-FJFJJJFJJFJJFJJF<FJJA<JF-AFJFAJFJJJJJAAAFJJJJJFJJF-FF<7FJJJJJJ-JA<<J<F7-<FJFJJ7AJAF-AFFFJA--J-F######################

After FastqToSam

H0164ALXX140820:2:1101:10003:49022      77      *       0       0       *       *       0       0       ACTTTAGAAATTTACTTTTAAGGACTTTTGGTTATGCTGCAGATAAGAAATATTCTTTTTTTCTCCTATGTCAGTATCCCCCATTGAAATGACAATAACCTAATTATAAATAAGAATTAGGCTTTTTTTTGAACAGTTACTAGCCTATAGA -FFFFFJJJJFFAFFJFJJFJJJFJFJFJJJ<<FJJJJFJFJFJJJJ<JAJFJJFJJJJJFJJJAJJJJJJFFJFJFJJFJJFFJJJFJJJFJJFJJFJAJJJJAJFJJJJJFFJJ<<<JFJJAFJAAJJJFFFFFJJJAJJJF<AJFFFJ RG:Z:H0164.2
H0164ALXX140820:2:1101:10003:49022      141     *       0       0       *       *       0       0       TGAGGATCACTAGATGGGGGAGGGAGAGAAGAGATGTGGGCTGAAGAACCATCTGTTGGGTAATATGTTTACTGTCAGTGTGATGGAATAGCTGGGACCCCAAGCGTCAGTGTTACACAACTTACATCTGTTGATCGACTGTCTATGACAG AA<FFJJJAJFJFAFJJJJFAJJJJJ7FFJJ<F-FJFJJJFJJFJJFJJF<FJJA<JF-AFJFAJFJJJJJAAAFJJJJJFJJF-FF<7FJJJJJJ-JA<<J<F7-<FJFJJ7AJAF-AFFFJA--J-F###################### RG:Z:H0164.2

(B) Convert aligned BAM to uBAM and discard problematic records using RevertSam

We use Picard's RevertSam to remove alignment information and generate an unmapped BAM (uBAM). For our tutorial file we have to call on some additional parameters that we explain below. This illustrates the need to cater the tool's parameters to each dataset. As such, it is a good idea to test the reversion process on a subset of reads before committing to reverting the entirety of a large BAM. Follow the directions in this How to to create a snippet of aligned reads corresponding to a genomic interval.

We use the following parameters.

java -Xmx8G -jar /path/picard.jar RevertSam \
    I=6484_snippet.bam \
    O=6484_snippet_revertsam.bam \
    SANITIZE=true \ 
    MAX_DISCARD_FRACTION=0.005 \ #informational; does not affect processing
    ATTRIBUTE_TO_CLEAR=XT \
    ATTRIBUTE_TO_CLEAR=XN \
    ATTRIBUTE_TO_CLEAR=AS \ #Picard release of 9/2015 clears AS by default
    ATTRIBUTE_TO_CLEAR=OC \
    ATTRIBUTE_TO_CLEAR=OP \
    SORT_ORDER=queryname \ #default
    RESTORE_ORIGINAL_QUALITIES=true \ #default
    REMOVE_DUPLICATE_INFORMATION=true \ #default
    REMOVE_ALIGNMENT_INFORMATION=true #default

To process large files, also designate a temporary directory.

    TMP_DIR=/path/shlee #sets environmental variable for temporary directory

We invoke or change multiple RevertSam parameters to generate an unmapped BAM

We remove nonstandard alignment tags with the ATTRIBUTE_TO_CLEAR option. Standard tags cleared by default are NM, UQ, PG, MD, MQ, SA, MC, and AS tags (AS for Picard releases starting 9/2015). Additionally, the OQ tag is removed by the default RESTORE_ORIGINAL_QUALITIES parameter. Remove all other nonstandard tags by specifying each with the ATTRIBUTE_TO_CLEAR option. For example, we clear the XT tag using this option for our tutorial file so that it is free for use by other tools, e.g. MarkIlluminaAdapters. To list all tags within a BAM, use the command below.
```
samtools view input.bam | cut -f 12- | tr '\t' '\n' | cut -d ':' -f 1 | awk '{ if(!x[$1]++) { print }}' 
```
For the tutorial file, this gives RG, OC, XN, OP and XT tags as well as those removed by default. We remove all of these except the RG tag. See your aligner's documentation and the Sequence Alignment/Map Format Specification for descriptions of tags.
Additionally, we invoke the SANITIZE option to remove reads that cause problems for certain tools, e.g. MarkIlluminaAdapters. Downstream tools will have problems with paired reads with missing mates, duplicated records, and records with mismatches in length of bases and qualities. Any paired reads file subset for a genomic interval requires sanitizing to remove reads with lost mates that align outside of the interval.
In this command, we've set MAX_DISCARD_FRACTION to a more strict threshold of 0.005 instead of the default 0.01. Whether or not this fraction is reached, the tool informs you of the number and fraction of reads it discards. This parameter asks the tool to additionally inform you of the discarded fraction via an exception as it finishes processing.
```
Exception in thread "main" picard.PicardException: Discarded 0.787% which is above MAX_DISCARD_FRACTION of 0.500%  
```

Some comments on options kept at default:

SORT_ORDER=queryname
For paired read files, because each read in a pair has the same query name, sorting results in interleaved reads. This means that reads in a pair are listed consecutively within the same file. We make sure to alter the previous sort order. Coordinate sorted reads result in the aligner incorrectly estimating insert size from blocks of paired reads as they are not randomly distributed.
RESTORE_ORIGINAL_QUALITIES=true
Restoring original base qualities to the QUAL field requires OQ tags listing original qualities. The OQ tag uses the same encoding as the QUAL field, e.g. ASCII Phred-scaled base quality+33 for tutorial data. After restoring the QUAL field, RevertSam removes the tag.
REMOVE_ALIGNMENT_INFORMATION=true will remove program group records and alignment flag and tag information. For example, flags reset to unmapped values, e.g. 77 and 141 for paired reads. The parameter also invokes the default ATTRIBUTE_TO_CLEAR parameter which removes standard alignment tags. RevertSam ignores ATTRIBUTE_TO_CLEAR when REMOVE_ALIGNMENT_INFORMATION=false.

Below we show below a read pair before and after RevertSam from the tutorial data. Notice the first listed read in the pair becomes reverse-complemented after RevertSam. This restores how reads are represented when they come off the sequencer--5' to 3' of the read being sequenced.

For 6484_snippet.bam, SANITIZE removes 2,202 out of 279,796 (0.787%) reads, leaving us with 277,594 reads.

Original BAM

H0164ALXX140820:2:1101:10003:23460  83  10  91515318    60  151M    =   91515130    -339    CCCATCCCCTTCCCCTTCCCTTTCCCTTTCCCTTTTCTTTCCTCTTTTAAAGAGACAAGGTCTTGTTCTGTCACCCAGGCTGGAATGCAGTGGTGCAGTCATGGCTCACTGCCGCCTCAGACTTCAGGGCAAAAGCAATCTTTCCAGCTCA :<<=>@AAB@AA@AA>6@@A:>,*@A@<@??@8?9>@==8?:?@?;?:><??@>==9?>8>@:?>>=>;<==>>;>?=?>>=<==>>=>9<=>??>?>;8>?><?<=:>>>;4>=>7=6>=>>=><;=;>===?=>=>>?9>>>>??==== MC:Z:60M91S MD:Z:151    PG:Z:MarkDuplicates RG:Z:H0164.2    NM:i:0  MQ:i:0  OQ:Z:<FJFFJJJJFJJJJJF7JJJ<F--JJJFJJJJ<J<FJFF<JAJJJAJAJFFJJJFJAFJAJJAJJJJJFJJJJJFJJFJJJJFJFJJJJFFJJJJJJJFAJJJFJFJFJJJFFJJJ<J7JJJJFJ<AFAJJJJJFJJJJJAJFJJAFFFFA    UQ:i:0  AS:i:151

H0164ALXX140820:2:1101:10003:23460  163 10  91515130    0   60M91S  =   91515318    339 TCTTTCCTTCCTTCCTTCCTTGCTCCCTCCCTCCCTCCTTTCCTTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCTTCCCCTCTCCCACCCCTCTCTCCCCCCCTCCCACCC :0;.=;8?7==?794<<;:>769=,<;0:=<0=:9===/,:-==29>;,5,98=599;<=########################################################################################### SA:Z:2,33141573,-,37S69M45S,0,1;    MC:Z:151M   MD:Z:48T4T6 PG:Z:MarkDuplicates RG:Z:H0164.2    NM:i:2  MQ:i:60 OQ:Z:<-<-FA<F<FJF<A7AFAAJ<<AA-FF-AJF-FA<AFF--A-FA7AJA-7-A<F7<<AFF###########################################################################################    UQ:i:49 AS:i:50

After RevertSam

H0164ALXX140820:2:1101:10003:23460  77  *   0   0   *   *   0   0   TGAGCTGGAAAGATTGCTTTTGCCCTGAAGTCTGAGGCGGCAGTGAGCCATGACTGCACCACTGCATTCCAGCCTGGGTGACAGAACAAGACCTTGTCTCTTTAAAAGAGGAAAGAAAAGGGAAAGGGAAAGGGAAGGGGAAGGGGATGGG AFFFFAJJFJAJJJJJFJJJJJAFA<JFJJJJ7J<JJJFFJJJFJFJFJJJAFJJJJJJJFFJJJJFJFJJJJFJJFJJJJJFJJJJJAJJAJFAJFJJJFFJAJAJJJAJ<FFJF<J<JJJJFJJJ--F<JJJ7FJJJJJFJJJJFFJF< RG:Z:H0164.2

H0164ALXX140820:2:1101:10003:23460  141 *   0   0   *   *   0   0   TCTTTCCTTCCTTCCTTCCTTGCTCCCTCCCTCCCTCCTTTCCTTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCTTCCCCTCTCCCACCCCTCTCTCCCCCCCTCCCACCC <-<-FA<F<FJF<A7AFAAJ<<AA-FF-AJF-FA<AFF--A-FA7AJA-7-A<F7<<AFF########################################################################################### RG:Z:H0164.2

↧

Different HaplotypeCaller variant calls based on Java version?

October 23, 2019, 7:41 am

≫ Next: GenomicsDBImport not completing for mixed ploidy samples

≪ Previous: (How to) Generate an unmapped BAM from FASTQ or aligned BAM

Hi, we are running HaplotypeCaller on two different Linux servers and getting slightly different VCF calls (only with SNVs, not Indels) with identical BAM files despite using the same script and having the same version of GATK, 4.0.3.0. The only difference between the servers is the Java versions: openjdk 1.8.0_161_b14 vs. 1.8.0_222_b10. Could this be causing differences in the variant calls? It seems like previous versions of GATK did have non-deterministic components, but that's not the case in v 4.0.3.0. Thank you.

↧

GenomicsDBImport not completing for mixed ploidy samples

September 17, 2019, 1:47 pm

≫ Next: Failures running VariantRecalibrator

≪ Previous: Different HaplotypeCaller variant calls based on Java version?

I'm attempting to call variants on whole genomes for about 500 illumina paired-end samples with varying ploidy (haploid to tetraploid). I'm running a fairly standard uBam to GVCF pipeline with HaplotypeCaller passed the ploidy information (1,2,3, or 4) in -ERC GVCF mode. I then try to collect the GVCFs using GenomicsDBImport in a batch size of 50 and use GenotypeGVCFs on the combined database. My interval list that is passed to GenomicsDBImport is just each chromosome on a separate line. I'm using GATK v4.1.1.0

Command:
```
${GATK_DIR}/gatk GenomicsDBImport \
--java-options "-Xmx110g -Xms110g" \
-R ${REF} \
--variant ${FILE_LIST} \
-L ${SCRIPT_DIR}/GATK_Style_Interval.list \
--genomicsdb-workspace-path ${WORK_DIR}/GenomicsDB_20190912 \
--batch-size 50 \
--tmp-dir=${WORK_DIR}/
```

GenomicsDBImport appears to run without error, but only shows progress for the first 6000 bp before moving onto the next batch. When I run select variants on the created database, I only get variants up to position 6716 in the first interval. When I try to run GenotypeGVCF on it, I get a strange error:
htsjdk.tribble.TribbleException: Invalid block size -1570639203

My first assumption is that one of the gvcf's is malformed from HaplotypeCaller failing after the first 6000 bp, but I've verified that the gvcfs have all completed and have 'validated' them with ValidateVariants using GATK v4.1.3.0. When I grep for the particular position in the sample's gvcfs I don't find anything out of the ordinary. I would use CombineGVCFs, but it fails due to trying to combine mixed ploidies.

Any ideas on troubleshooting or experience with problems like this?

↧

Failures running VariantRecalibrator

October 7, 2019, 2:02 pm

≫ Next: How to diagnose missing MQRankSum annotations (when BaseQRankSum is available)

≪ Previous: GenomicsDBImport not completing for mixed ploidy samples

We want to run joint germline calling on a set of 122 WES BRCA normal hg19 BAMs from the CPTAC 3 project. We are using the GATK4 workflows showcased in the Terra workspace https://app.terra.bio/#workspaces/help-gatk/Germline-SNPs-Indels-GATK4-b37. We are starting with data that has already been aligned to hg19, so of the three workflows in the showcase workspace, we are running two: haplotypecaller-gvcf-gatk4 and joint-discovery-gatk4. We are encountering problems with the joint-discovery-gatk4 workflow, in particular, in the running of the VariantRecalibrator task. Initially, we are just running on 3 sample gvcfs, recognizing that you need at a minumum 30 exome samples, just to ensure we can run the pipeline. We are using gatk4 v4.1.2.0.

We are getting more or less the same error for both instances of the VariantRecalibrator task...

task instance: JointGenotyping.SNPsVariantRecalibratorClassic:

A USER ERROR has occurred: Couldn't read file file:///cromwell_root/hapmap,known=false,training=true,truth=true,prior=15:/cromwell_root/broad-references/hg19/v0/hapmap_3.3.b37.vcf.gz. Error was: It doesn't exist.

task instance: JointGenotyping.IndelsVariantRecalibrator:

A USER ERROR has occurred: Couldn't read file file:///cromwell_root/mills,known=false,training=true,truth=true,prior=12:/cromwell_root/broad-references/hg19/v0/Mills_and_1000G_gold_standard.indels.b37.sites.vcf. Error was: It doesn't exist.

Here is the java command line (from the task log file in Terra):

Using GATK jar /gatk/gatk-package-4.1.2.0-local.jar

Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx24g -Xms24g -jar /gatk/gatk-package-4.1.2.0-local.jar VariantRecalibrator -V /cromwell_root/fc-secure-823808d0-5404-49c9-990f-b3d9e353e468/02fdb905-0a50-47d5-9a0b-8abb8d0a9636/JointGenotyping/71c87f4b-5e0e-40bc-9b61-71a5e52ac82a/call-SitesOnlyGatherVcf/CBB_Test.sites_only.vcf.gz -O CBB_Test.indels.recal --tranches-file CBB_Test.indels.tranches --trust-all-polymorphic -tranche 100.0 -tranche 99.95 -tranche 99.9 -tranche 99.5 -tranche 99.0 -tranche 97.0 -tranche 96.0 -tranche 95.0 -tranche 94.0 -tranche 93.5 -tranche 93.0 -tranche 92.0 -tranche 91.0 -tranche 90.0 -an FS -an ReadPosRankSum -an MQRankSum -an QD -an SOR -an DP -mode INDEL --max-gaussians 4 -resource mills,known=false,training=true,truth=true,prior=12:/cromwell_root/broad-references/hg19/v0/Mills_and_1000G_gold_standard.indels.b37.sites.vcf -resource axiomPoly,known=false,training=true,truth=false,prior=10:/cromwell_root/broad-references/hg19/v0/Axiom_Exome_Plus.genotypes.all_populations.poly.vcf.gz -resource dbsnp,known=true,training=false,truth=false,prior=2:/cromwell_root/broad-references/hg19/v0/dbsnp_138.b37.vcf.gz

The problem is clearly with the attributes that prepend the -resource input parameter... they are being interpreted as part of the filename by gatk4.

↧

How to diagnose missing MQRankSum annotations (when BaseQRankSum is available)

October 23, 2019, 11:53 pm

≫ Next: Using HaplotypeCaller for human samples, should I set the ploidy for non autosomal chromosomes

≪ Previous: Failures running VariantRecalibrator

We wish to discover short variants in a cohort of 60 plant whole-genome-samples. We're blocked on VariantRecalibrator.

We have a VCF truth set (aka resource) of SNPs which has been computed beforehand and hard-filtered. And we have a raw VCF for the 60 samples under study. This input VCF has been joint-called with HaplotypeCaller (GVCF) + GenomicsDBImport + GenotypeGVCFs over the whole genome. We computed a sites-only version of that input VCF and fed it to VariantRecalibrator. We configured HaplotypeCaller to produce allele-specific annotations (-G StandardAnnotation -G AS_StandardAnnotation -G StandardHCAnnotation) and GenotypeGVCFs as well (-G StandardAnnotation -G AS_StandardAnnotation).

We've configured VariantRecalibrator to build its SNP model based on the set of annotations: -an AS_QD -an MQRankSum -an AS_ReadPosRankSum -an AS_FS -an AS_MQ -an AS_SOR -an DP. This was based on this allele-specific annotation and filtering article.

Unfortunately, both AS_MQRankSum, and MQRankSum annotations have variance 0 over our data, and prevent the model from being produced. Dropping the annotation is one option, but it's ill-advised afaik (see reference #1).

How do we diagnose this?

/gatk/gatk VariantRecalibrator --java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true -Xmx183296m' --tmp-dir <redacted> <redacted_list_of_input_vcf_files> --resource:GOLD,known=false,training=true,truth=true,prior=10.0 <redacted>/gold.snps.vcf.gz --mode SNP -an AS_QD -an MQRankSum -an AS_ReadPosRankSum -an AS_FS -an AS_MQ -an AS_SOR -an DP --trust-all-polymorphic --truth-sensitivity-tranche 100.0 --truth-sensitivity-tranche 99.0 --truth-sensitivity-tranche 90.0 --truth-sensitivity-tranche 70.0 --truth-sensitivity-tranche 50.0 --max-gaussians 6 --rscript-file <redacted>  --tranches-file <redacted> -AS --output <redacted>/snp.recal.vcf.gz
22:16:29.941 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.1.2.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
...
A USER ERROR has occurred: Bad input: Found annotations with zero variance. They must be excluded before proceeding.
...
19:52:31.846 INFO  ProgressMeter - Traversal complete. Processed 2786933 total variants in 0.8 minutes.
19:52:31.959 INFO  VariantDataManager - AS_QD:   mean = 17.44    standard deviation = 8.75
19:52:32.173 INFO  VariantDataManager - MQRankSum:       mean = 0.00     standard deviation = 0.00
19:52:32.487 INFO  VariantDataManager - AS_ReadPosRankSum:       mean = 0.01     standard deviation = 0.84
19:52:32.797 INFO  VariantDataManager - AS_FS:   mean = 1.79     standard deviation = 3.03
19:52:32.962 INFO  VariantDataManager - AS_MQ:   mean = 60.00    standard deviation = 0.12
19:52:33.139 INFO  VariantDataManager - AS_SOR:          mean = 0.68     standard deviation = 0.25
19:52:33.335 INFO  VariantDataManager - DP:      mean = 1323.99  standard deviation = 945.30

(We initially tried with AS_MQRankSum, instead of MQRankSum, but it also had 0.00 variance)

Question 1: As we understand it, MQRankSum can only be computed on sites which are heterozygous reference (see reference #2). Generally speaking, and not for our particular dataset, do we need a good representation of such sites both in the truth "resource" sets and the input raw variant vcfs, or just in the input vcfs ?

Question 2: We've confirmed that our data (both the truth set and the input set) does have many het-ref sites with good read support for all alleles. One evidence of this (we think), is the fact that AS_ReadPosRankSum was calculated to have non-zero variance, as shown in the VariantRecalibrator output above. MQRankSum's variance couldn't be calculated, but both it and AS_MQRankSum have the same caveats in the documentation. What are cases where one annotation can be calculated, but not the other? e.g. Does this indicate that my mapping qualities are too uniform (in which case, the variance would be exactly 0.000)?

Question 3: If we dig in the dataset (i.e. in the sites-only VCF inputs), we see a lot of sites whose relevant annotations are a mix of "nul", and "0.000". At certain sites, AS_MQRankSum is there but not MQRankSum. Sometimes both of them are there. How should we interpret the different values? Is there anything "wrong" with that?

Ex: biallelic heterozygous-ref site. AS_MQRankSum and AS_ReadPosRankSum are "nul". MQRankSum and ReadPosRankSum are omitted altogether.

HanXRQChr01     16169   .       G       T       114.60  PASS    AC=2;AF=0.250;AN=8;AS_BaseQRankSum=nul;AS_FS=0.000;AS_MQ=
60.00;AS_MQRankSum=nul;AS_QD=30.82;AS_ReadPosRankSum=nul;AS_SOR=0.693;DP=6;ExcessHet=0.3218;FS=0.000;MLEAC=10;MLEAF=1.00;MQ=60.00;QD=29.27;SOR=0.693

Ex: biallelic het-ref site. ReadPosRankSum is non-zero. AS_MQRankSum is zero.

HanXRQChr01     17137   .       G       A       53.23   PASS AC=1;AF=0.010;AN=96;AS_BaseQRankSum=0.600;AS_FS=0.000;AS_InbreedingCoeff=-0.0476;AS_MQ=60.00;AS_MQRankSum=0.000;AS_QD=8.87;AS_ReadPosRankSum=0.800;AS_SOR=1.179;BaseQRankSum=0.623;DP=243;ExcessHet=3.0103;FS=0.000;InbreedingCoeff=-0.0476;MLEAC=1;MLEAF=0.010;MQ=60.00;MQRankSum=0.00;QD=8.87;ReadPosRankSum=0.842;SOR=1.179

Ex: site where one allele has nuls, and the other one has floats

HanXRQChr01     18154   .       G       C,A     8466.08 PASS    AC=25,1;AF=0.240,9.615e-03;AN=104;AS_BaseQRankSum=-0.550,nul;AS_FS=1.536,2.158;AS_InbreedingCoeff=0.8253,-0.0126;AS_MQ=60.00,60.00;AS_MQRankSum=0.000,nul;AS_QD=29.59,8.09;AS_ReadPosRankSum=0.900,nul;AS_SOR=0.400,0.223;BaseQRankSum=0.494;DP=813;ExcessHet=0.0000;FS=1.538;InbreedingCoeff=0.8833;MLEAC=26,1;MLEAF=0.250,9.615e-03;MQ=60.00;MQRankSum=0.00;QD=32.25;ReadPosRankSum=1.52;SOR=0.391

Relevant pages and comments I've found on the subject:
1. "MQRankSum is one of the core annotations that we recommend using, so I would recommend going to the trouble of finding out why it's not working." (https://gatkforums.broadinstitute.org/gatk/discussion/comment/9737/#Comment_9737 )
2. "The Rank Sum Tests require at least one individual to be heterozygous and have a mix of ref and alt reads" (https://gatkforums.broadinstitute.org/gatk/discussion/comment/33174/#Comment_33174 )
3. AS_ReadPosRankSum annotation documentation (https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_annotator_AS_ReadPosRankSumTest.php)
4. AS_MQRankSum annotation documentation (https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_annotator_AS_MappingQualityRankSumTest.php )
5. Hard-filtering recommendations (which talks about how the tests work, in particular MQRankSum and ReadPosRankSum): (https://software.broadinstitute.org/gatk/documentation/article.php?id=6925 )
6. Allele-specific annotation and filtering article. (https://gatkforums.broadinstitute.org/gatk/discussion/9622/allele-specific-annotation-and-filtering/)

↧

Using HaplotypeCaller for human samples, should I set the ploidy for non autosomal chromosomes

October 24, 2019, 1:36 am

≫ Next: AnalyzeCovariates error

≪ Previous: How to diagnose missing MQRankSum annotations (when BaseQRankSum is available)

I've been working on the basis that this complexity is automatically handled by the algorithm, but i'm not sure.
My question specifically relates to HaplotypeCaller in GATK4.

↧

AnalyzeCovariates error

May 26, 2014, 6:06 am

≫ Next: Using -L to filter calls from SomvarIUS and other tools

≪ Previous: Using HaplotypeCaller for human samples, should I set the ploidy for non autosomal chromosomes

Hi there, I'm trying to get AnalyzeCovariates to work, and I get the following error message...any thoughts as to what is going wrong? Not sure whether this is a program-related issue or not?

Thanks!!

docsmb17:No_Backup sgfriede$ java -jar /Users/sgfriede/GeneApps/GenomeAnalysisTK-3.1.1/GenomeAnalysisTK.jar -T AnalyzeCovariates \ -R canFam3.fa \ -before recal.table \ -after after_recal.table \ -plots Mariah_recal_plots.pdf
INFO 09:00:35,945 HelpFormatter - --------------------------------------------------------------------------------
INFO 09:00:35,947 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.1-1-g07a4bf8, Compiled 2014/03/18 06:09:21
INFO 09:00:35,947 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 09:00:35,948 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 09:00:35,951 HelpFormatter - Program Args: -T AnalyzeCovariates -R canFam3.fa -before recal.table -after after_recal.table -plots Mariah_recal_plots.pdf
INFO 09:00:36,278 HelpFormatter - Executing as sgfriede@docsmb17.local on Mac OS X 10.9.3 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_55-b13.
INFO 09:00:36,279 HelpFormatter - Date/Time: 2014/05/26 09:00:35
INFO 09:00:36,279 HelpFormatter - --------------------------------------------------------------------------------
INFO 09:00:36,279 HelpFormatter - --------------------------------------------------------------------------------
INFO 09:00:36,706 GenomeAnalysisEngine - Strictness is SILENT
INFO 09:00:37,005 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO 09:00:37,163 GenomeAnalysisEngine - Preparing for traversal
INFO 09:00:37,179 GenomeAnalysisEngine - Done preparing for traversal
INFO 09:00:37,179 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 09:00:37,179 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining
INFO 09:00:37,514 ContextCovariate - Context sizes: base substitution model 2, indel substitution model 3
INFO 09:00:37,802 ContextCovariate - Context sizes: base substitution model 2, indel substitution model 3
INFO 09:00:37,852 AnalyzeCovariates - Generating csv file '/var/folders/jx/3t4ylccn0g32xwl0qs6y4tz80000gp/T/AnalyzeCovariates3417596129729096429.csv'
INFO 09:00:38,256 AnalyzeCovariates - Generating plots file 'Mariah_recal_plots.pdf'
INFO 09:00:39,706 GATKRunReport - Uploaded run statistics report to AWS S3

ERROR ------------------------------------------------------------------------------------------

ERROR stack trace

org.broadinstitute.sting.utils.R.RScriptExecutorException: RScript exited with 1. Run with -l DEBUG for more info.
at org.broadinstitute.sting.utils.R.RScriptExecutor.exec(RScriptExecutor.java:174)
at org.broadinstitute.sting.utils.recalibration.RecalUtils.generatePlots(RecalUtils.java:548)
at org.broadinstitute.sting.gatk.walkers.bqsr.AnalyzeCovariates.generatePlots(AnalyzeCovariates.java:380)
at org.broadinstitute.sting.gatk.walkers.bqsr.AnalyzeCovariates.initialize(AnalyzeCovariates.java:394)
at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:83)
at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313)
at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:121)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:107)

ERROR ------------------------------------------------------------------------------------------

ERROR A GATK RUNTIME ERROR has occurred (version 3.1-1-g07a4bf8):

ERROR

ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.

ERROR If not, please post the error message, with stack trace, to the GATK forum.

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions http://www.broadinstitute.org/gatk

ERROR

ERROR MESSAGE: RScript exited with 1. Run with -l DEBUG for more info.

ERROR ------------------------------------------------------------------------------------------

↧

Using -L to filter calls from SomvarIUS and other tools

October 24, 2019, 2:12 pm

≫ Next: (How to) Filter on genotype using VariantFiltration

≪ Previous: AnalyzeCovariates error

Hi,

I am trying to using Mutect (v4.1.3.0) to filter variants called from other Somatic variant callers such as SomVarIUS, LoFreq and Platypus. This works for LoFreq and Platypus but for SomVarIUS, I am unable to use FilterMutectCalls.

```
The error generated is:
14:12:53.697 INFO ProgressMeter - chr19:43946067 3.8 1478000 389678.9
14:13:03.016 INFO FilterMutectCalls - Finished pass 1 through the variants
14:13:56.805 INFO FilterMutectCalls - Starting pass 2 through the variants
14:13:56.833 INFO FilterMutectCalls - Shutting down engine
[October 16, 2019 2:13:56 PM PDT] org.broadinstitute.hellbender.tools.walkers.mutect.filtering.FilterMutectCalls done. Elapsed time: 4.90 minutes.
Runtime.totalMemory()=42649780224
java.lang.IllegalArgumentException: errorRate must be good probability but got NaN
at org.broadinstitute.hellbender.utils.Utils.validateArg(Utils.java:730)
at org.broadinstitute.hellbender.utils.QualityUtils.errorProbToQual(QualityUtils.java:225)
at org.broadinstitute.hellbender.utils.QualityUtils.errorProbToQual(QualityUtils.java:209)
at org.broadinstitute.hellbender.tools.walkers.mutect.filtering.Mutect2FilteringEngine.lambda$applyFiltersAndAccumulateOutputStats$13(Mutect2FilteringEngine.java:176)
at java.util.Optional.ifPresent(Optional.java:159)
at org.broadinstitute.hellbender.tools.walkers.mutect.filtering.Mutect2FilteringEngine.applyFiltersAndAccumulateOutputStats(Mutect2FilteringEngine.java:174)
at org.broadinstitute.hellbender.tools.walkers.mutect.filtering.FilterMutectCalls.nthPassApply(FilterMutectCalls.java:148)
at org.broadinstitute.hellbender.engine.MultiplePassVariantWalker.lambda$traverse$0(MultiplePassVariantWalker.java:40)
at org.broadinstitute.hellbender.engine.MultiplePassVariantWalker.lambda$traverseVariants$1(MultiplePassVariantWalker.java:77)
at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
at java.util.Iterator.forEachRemaining(Iterator.java:116)
at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
at org.broadinstitute.hellbender.engine.MultiplePassVariantWalker.traverseVariants(MultiplePassVariantWalker.java:75)
at org.broadinstitute.hellbender.engine.MultiplePassVariantWalker.traverse(MultiplePassVariantWalker.java:40)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1048)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
at org.broadinstitute.hellbender.Main.main(Main.java:291)
```

My workflow is:
1) Call variants using SomVarIUS/LoFreq/Platypus

2) Pass the .vcf file generated ( using the interval (-L) argument) by the above tools into Mutect2 to generate an unfiltered .vcf file
```
java -jar gatk-package-4.1.3.0-local.jar Mutect2
-R hg38.fa
-L path_to_vcf_generated_by_SomVarIUS
-I tumor bam
-germline-resource path_to_gnomad_germline_resource
--f1r2-tar-gz f1r2.tar.gz
-O unfiltered.vcf
```

3) Learn Orientation model
```
java -jar gatk-package-4.1.3.0-local.jar LearnReadOrientationModel
-I f1r2.tar.gz
-O read-orientation-model.tar.gz
```
4) Get Pileup Summaries
```
java -jar gatk-package-4.1.3.0-local.jar GetPileupSummaries
-I tumor_bam
-V small_exac_common_3.hg38.vcf.gz
-L small_exac_common_3.hg38.vcf.gz
-O getpileupsummaries.table
```
5) Calculate contamination
```
java -jar gatk-package-4.1.3.0-local.jar CalculateContamination
-I getpileupsummaries.table
-tumor-segmentation segments.table
-O calculatecontamination.table
```

6) Filtreing Mutect calls
```
java -jar gatk-package-4.1.3.0-local.jar FilterMutectCalls
-V unfiltered.vcf
--tumor-segmentation segments.table
--contamination-table calculatecontamination.table
--ob-priors read-orientation-model.tar.gz
-R hg38.fa
-O Filtered.vcf
```

The error is not generated when I use other tools like LoFreq or Platypus.

The contamination error generated by CalculateContamination is always less than 1 and more or less consistent across the three tools.

Any help is appreciated!

Thanks,
Dollina

↧

(How to) Filter on genotype using VariantFiltration

July 3, 2018, 8:21 am

≫ Next: Can't download data from ftp address

≪ Previous: Using -L to filter calls from SomvarIUS and other tools

Before using VariantFiltration, please read the entirety of the discussion in https://github.com/broadinstitute/gatk/issues/5362 that describes VariantFiltration's unintuitive behavior when processing compound expressions.

This tutorial illustrates how to filter on genotype, e.g. heterozygous genotype call. The steps apply to either single-sample or multi-sample callsets.

First, the genotype is annotated with a filter expression using VariantFiltration. Then, the filtered genotypes are made into no-call (./.) genotypes with SelectVariants so that downstream tools may discount them.

We use example variant record FORMAT fields from trio.vcf to illustrate.

GT:AD:DP:GQ:PL  
0/1:17,15:32:99:399,0,439       0/1:11,12:23:99:291,0,292       1/1:0,30:30:90:948,90,0

1. Annotate genotypes using VariantFiltration

If we want to filter heterozygous genotypes, we use VariantFiltration's --genotype-filter-expression "isHet == 1" option. We can specify the annotation value for the tool to label the heterozygous genotypes with with the --genotype-filter-name option. Here, this parameter's value is set to "isHetFilter".

gatk VariantFiltration \
-V trio.vcf \
-O trio_VF.vcf \
--genotype-filter-expression "isHet == 1" \
--genotype-filter-name "isHetFilter"

After filtering, in the resulting trio_VF.vcf, our example record adds an FT field and becomes:

GT:AD:DP:FT:GQ:PL
0/1:17,15:32:isHetFilter:99:399,0,439   0/1:11,12:23:isHetFilter:99:291,0,292   1/1:0,30:30:PASS:90:948,90,0

We see that HET (0/1) genotype calls get a isHetFilter in the FT field and other genotype calls get a PASS in the genotype field.

The VariantFiltration tool document lists the various options to filter on the FORMAT (aka genotype call) field:

We have put in convenience methods so that one can now filter out hets ("isHet == 1"), refs ("isHomRef == 1"), or homs ("isHomVar == 1"). Also available are expressions isCalled, isNoCall, isMixed, and isAvailable, in accordance with the methods of the Genotype object.

2. Transform filtered genotypes to no call

Running SelectVariants with --set-filtered-gt-to-nocall will further transform the flagged genotypes with a null genotype call. This conversion is necessary because downstream tools do not parse the FORMAT-level filter field.

gatk SelectVariants \
-V trio_VF.vcf \
--set-filtered-gt-to-nocall \
-O trioGGVCF_VF_SV.vcf

The result is that the GT genotypes of the isHetFiltered genotype records become null or no call (./.) as follows.

GT:AD:DP:FT:GQ:PL
./.:17,15:32:isHetFilter:99:399,0,439   ./.:11,12:23:isHetFilter:99:291,0,292   1/1:0,30:30:PASS:90:948,90,0

↧

Can't download data from ftp address

October 25, 2019, 3:10 am

≫ Next: Questions about JEXL expressions for selecting variants according to specific reuqirements

≪ Previous: (How to) Filter on genotype using VariantFiltration

I just can't get the data by address of ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/beta/Mutect2/af-only-gnomad.raw.sites.b37.vcf.gz, the website is not available,so I want to know wether the server of ftp is available Now 20191025 ? Thanks!

↧

Questions about JEXL expressions for selecting variants according to specific reuqirements

September 1, 2014, 3:23 am

≫ Next: Error in ReadsPipelineSpark 4.1.4 when using -L interval list option

≪ Previous: Can't download data from ftp address

This discussion was created from comments split from: What are JEXL expressions and how can I use them with the GATK?.

↧

Error in ReadsPipelineSpark 4.1.4 when using -L interval list option

October 25, 2019, 8:21 am

≫ Next: MarkDuplicates - 0 pairs never matched

≪ Previous: Questions about JEXL expressions for selecting variants according to specific reuqirements

Hi all.
Always trying to tune the pipeline in our environment...

When I add the -L option to the ReadsPipelineSpark I obtain the following error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 0.0 failed 4 times, most recent failure: Lost task 4.3 in stage 0.0 (TID 25, cloudera02.opbg.dom, executor 2): java.lang.IllegalArgumentException: Contig chr1 not present in reads sequence dictionary
at org.disq_bio.disq.impl.formats.BoundedTraversalUtil.convertSimpleIntervalToQueryInterval(BoundedTraversalUtil.java:73)
at org.disq_bio.disq.impl.formats.BoundedTraversalUtil.lambda$prepareQueryIntervals$0(BoundedTraversalUtil.java:46)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:545)
at java.util.stream.AbstractPipeline.evaluateToArrayNode(AbstractPipeline.java:260)
at java.util.stream.ReferencePipeline.toArray(ReferencePipeline.java:438)
at org.disq_bio.disq.impl.formats.BoundedTraversalUtil.prepareQueryIntervals(BoundedTraversalUtil.java:47)
at org.disq_bio.disq.impl.formats.sam.AbstractBinarySamSource.lambda$getReads$c0b65654$1(AbstractBinarySamSource.java:128)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$4$1.apply(JavaRDDLike.scala:153)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$4$1.apply(JavaRDDLike.scala:153)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun
...
..

The tool works well when I call it WITHOUT the -L interval list option.

This is the command I run:

nohup /opt/gatk/gatk-4.1.4.0/gatk ReadsPipelineSpark --spark-runner SPARK --spark-master yarn --spark-submit-command spark2-submit -I hdfs://cloudera08/gatk-test2/WES2019-023_S6_reheader.bam -O hdfs://cloudera08/gatk-test2/WES2019-023_S6_out.g.vcf -R hdfs://cloudera08/gatk-test1/ucsc.hg19.fasta -L hdfs://cloudera08/gatk-test2/RefGene_exons.bed --dbsnp hdfs://cloudera08/gatk-test1/dbsnp_150_hg19.vcf.gz --known-sites hdfs://cloudera08/gatk-test1/Mills_and_1000G_gold_standard.indels.hg19.vcf.gz --align true --emit-ref-confidence GVCF --standard-min-confidence-threshold-for-calling 100.0 --conf deploy-mode=cluster --conf "spark.driver.memory=2g" --conf "spark.executor.memory=18g" --conf "spark.storage.memoryFraction=1" --conf "spark.akka.frameSize=200" --conf "spark.default.parallelism=100" --conf "spark.core.connection.ack.wait.timeout=600" --conf "spark.yarn.executor.memoryOverhead=4096" --conf "spark.yarn.driver.memoryOverhead=400" > WES2019-023_S6.out &

My input BAM is coming from the FastqToSam conversion tool, and the header is:

[root@cloudera02 WES2019-023-40045]# ../samtools-1.7/samtools view -H WES2019-023_S6.bam
@HD VN:1.6 SO:queryname
@RG ID:WES2019-023 SM:WES2019-023_S6 PL:illumina PU:L1

I tried to reheader the BAM adding the SQ tag adding the references to the fasta file but I had the same error.

Since I am struggling from some days on this problem, could you please help me in identifying in which way I can go forward?

Thanks a lot in advance.
Alessandro

↧

MarkDuplicates - 0 pairs never matched

July 29, 2019, 2:49 pm

≫ Next: GATK-Mutect2 Running and Output Error in Terra

≪ Previous: Error in ReadsPipelineSpark 4.1.4 when using -L interval list option

Hello, I am trying to use MarkDuplicates in order to combine uBAMs generated from paired fastq files across two lanes (WGS on Illumina NovaSeq) using the GATK paired-fastq-to-unmapped-bam.wdl. I believe I have parsed the correct RG information in generating the uBAMs (ValidateSamFile passed) but I am a little confused as to the MarkDuplicates output. I have three, likely related questions.

1) What is the meaning of "Tracking 0 as yet unmatched pairs" and "0 pairs never matched"?
2) There was also a WARNING - what is meant by "Reading will be unbuffered"?

```
INFO 2019-07-28 10:15:16 MarkDuplicates Read 506,000,000 records. Elapsed time: 07:30:29s. Time for last 1,000,000: 43s. Last read position: chr1:102,335,300
INFO 2019-07-28 10:15:16 MarkDuplicates Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO 2019-07-28 10:15:47 MarkDuplicates Read 506691598 records. 0 pairs never matched.
INFO 2019-07-28 10:16:24 MarkDuplicates After buildSortedReadEndLists freeMemory: 16451310808; totalMemory: 16540762112; maxMemory: 16540762112
INFO 2019-07-28 10:16:24 MarkDuplicates Will retain up to 516898816 duplicate indices before spilling to disk.
INFO 2019-07-28 10:16:25 MarkDuplicates Traversing read pair information and detecting duplicates.
INFO 2019-07-28 10:16:25 SortingCollection Creating merging iterator from 5 files
WARNING 2019-07-28 10:16:27 SortingCollection There is not enough memory per file for buffering. Reading will be unbuffered.
```

3) Finally, the workflow ends by transitioning to a terminal state (return code not 0), yet it appears MarkDuplicates is "done".

```
INFO 2019-07-28 12:27:32 MarkDuplicates Before output close freeMemory: 8443873968; totalMemory: 13359382528; maxMemory: 14944829440
INFO 2019-07-28 12:27:33 MarkDuplicates After output close freeMemory: 13288318136; totalMemory: 13332643840; maxMemory: 14944829440
[Sun Jul 28 12:27:33 CDT 2019] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 582.83 minutes.
Runtime.totalMemory()=13332643840

[2019-07-28 12:32:26,21] [info] WorkflowManagerActor WorkflowActor-69a61389-a401-4e87-af57-66a767c7b994 is in a terminal state: WorkflowFailedState
[2019-07-28 12:32:37,63] [info] SingleWorkflowRunnerActor workflow finished with status 'Failed'.
[2019-07-28 12:32:40,80] [info] Workflow polling stopped
[2019-07-28 12:32:40,84] [info] Shutting down WorkflowStoreActor - Timeout = 5 seconds
[2019-07-28 12:32:40,84] [info] 0 workflows released by cromid-95ee9e9
[2019-07-28 12:32:40,85] [info] Shutting down WorkflowLogCopyRouter - Timeout = 5 seconds
[2019-07-28 12:32:40,88] [info] Shutting down JobExecutionTokenDispenser - Timeout = 5 seconds
[2019-07-28 12:32:40,90] [info] JobExecutionTokenDispenser stopped
[2019-07-28 12:32:40,90] [info] Aborting all running workflows.
[2019-07-28 12:32:40,99] [info] WorkflowStoreActor stopped
[2019-07-28 12:32:41,00] [info] WorkflowLogCopyRouter stopped
[2019-07-28 12:32:41,00] [info] Shutting down WorkflowManagerActor - Timeout = 3600 seconds
[2019-07-28 12:32:41,00] [info] WorkflowManagerActor All workflows finished
[2019-07-28 12:32:41,00] [info] WorkflowManagerActor stopped
[2019-07-28 12:32:42,32] [info] Connection pools shut down
[2019-07-28 12:32:42,32] [info] Shutting down SubWorkflowStoreActor - Timeout = 1800 seconds
[2019-07-28 12:32:42,32] [info] Shutting down JobStoreActor - Timeout = 1800 seconds
[2019-07-28 12:32:42,32] [info] Shutting down CallCacheWriteActor - Timeout = 1800 seconds
[2019-07-28 12:32:42,32] [info] Shutting down ServiceRegistryActor - Timeout = 1800 seconds
[2019-07-28 12:32:42,32] [info] SubWorkflowStoreActor stopped
[2019-07-28 12:32:42,32] [info] Shutting down DockerHashActor - Timeout = 1800 seconds
[2019-07-28 12:32:42,32] [info] Shutting down IoProxy - Timeout = 1800 seconds
[2019-07-28 12:32:42,32] [info] KvWriteActor Shutting down: 0 queued messages to process
[2019-07-28 12:32:42,32] [info] WriteMetadataActor Shutting down: 0 queued messages to process
[2019-07-28 12:32:42,32] [info] CallCacheWriteActor Shutting down: 0 queued messages to process
[2019-07-28 12:32:42,32] [info] CallCacheWriteActor stopped
[2019-07-28 12:32:42,32] [info] IoProxy stopped
[2019-07-28 12:32:42,33] [info] JobStoreActor stopped
[2019-07-28 12:32:42,33] [info] ServiceRegistryActor stopped
[2019-07-28 12:32:42,33] [info] DockerHashActor stopped
[2019-07-28 12:32:42,35] [info] Database closed
[2019-07-28 12:32:42,35] [info] Stream materializer shut down
[2019-07-28 12:32:42,35] [info] WDL HTTP import resolver closed
```
------------------------------------------------
I am using the gatk4-data-processing workflow with cromwell 40. I was unable to post the entirety of the workflow output and the code blocks do not seem to be working for me - I am not sure why as I have never posted before!

↧

GATK-Mutect2 Running and Output Error in Terra

October 25, 2019, 11:03 am

≫ Next: GATK 4.1.4.0 Mutect2 error: Contig chr1 does not have a length field.

≪ Previous: MarkDuplicates - 0 pairs never matched

The problem I face is output data after Mutect2 running.

I am running the GATK-Mutect2 by using the practice pipeline for somatic mutation screening between the tumor and normal samples in Terra. I have cloned the demo workspace without any changes to the attribute except for the tumor and normal bam/bai index files. When I tried running the two sample files I got the error without any comments. The system automatically setted up to run by Cromwell 47 once I click the button "Run Analysis" on Mutect2 in Terra. The job status is finished but it displays the red triangle as well. Are there any critical steps or options that I need to use to have output results from Mutect2? I'm confused as to why there are no output files.

I'm using the workflow: fccredits-barium-rust-2976/Mutect2-GATK4

↧

GATK 4.1.4.0 Mutect2 error: Contig chr1 does not have a length field.

October 25, 2019, 2:34 pm

≫ Next: ASEReadCounter java error

≪ Previous: GATK-Mutect2 Running and Output Error in Terra

I'm running a somatic workflow that worked under 4.1.3.0, but when I shifted to 4.1.4.0, I started getting this error.

htsjdk.tribble.TribbleException: Contig chr1 does not have a length field.
at htsjdk.variant.vcf.VCFContigHeaderLine.getSAMSequenceRecord(VCFContigHeaderLine.java:81)
at htsjdk.variant.vcf.VCFHeader.getSequenceDictionary(VCFHeader.java:273)

I looked in the fasta, fasta.dict, bams, and VCFs input into the tool, and they all have a length field for chr1, I'm using the current GDC GRCh38 reference, and a germline resource and PON downloaded from the Broad.

e.g.

$ head -1 GRCh38.d1.vd1.fa
>chr1  AC:CM000663.2  gi:568336023  LN:248956422  rl:Chromosome  M5:6aef897c3d6ff0c78aff06ac189178dd  AS:GRCh38

$ samtools view -H tumor.bam |head |grep chr1  
@SQ SN:chr1 LN:248956422

$ samtools view -H normal.abra.md.bam |head |grep chr1 
@SQ SN:chr1 LN:248956422

$ zgrep chr1 mutect/af-only-gnomad.hg38.vcf.gz   |head -1
##contig=<ID=chr1,length=248956422>

zgrep chr1 mutect/MuTect2.PON.5210.vcf.gz |head -1
##contig=<ID=chr1,length=248956422>

There seems to be a way around it by adding --disable-sequence-dictionary-validation true to my mutect2 command line, but that wasn't necessary on earlier versions.

↧

ASEReadCounter java error

March 8, 2019, 2:27 pm

≫ Next: ASEReadcounter error

≪ Previous: GATK 4.1.4.0 Mutect2 error: Contig chr1 does not have a length field.

I'm just running ASEReadCounter on an RNA-seq BAM that has undergone mark duplicates, add read groups, and splitNcigar reads. These java errors don't provide any help for the user

14:24:03.421 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/groups/Spellmandata/heskett/packages/share/gatk4-4.0.11.0-0/gatk-package-4.0.11.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
14:24:05.114 INFO  ASEReadCounter - ------------------------------------------------------------
14:24:05.114 INFO  ASEReadCounter - The Genome Analysis Toolkit (GATK) v4.0.11.0
14:24:05.114 INFO  ASEReadCounter - For support and documentation go to https://software.broadinstitute.org/gatk/
14:24:05.115 INFO  ASEReadCounter - Executing as heskett@exanode-3-1 on Linux v3.10.0-862.14.4.el7.x86_64 amd64
14:24:05.115 INFO  ASEReadCounter - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_192-b01
14:24:05.116 INFO  ASEReadCounter - Start Date/Time: March 8, 2019 2:24:03 PM PST
14:24:05.116 INFO  ASEReadCounter - ------------------------------------------------------------
14:24:05.116 INFO  ASEReadCounter - ------------------------------------------------------------
14:24:05.117 INFO  ASEReadCounter - HTSJDK Version: 2.16.1
14:24:05.117 INFO  ASEReadCounter - Picard Version: 2.18.13
14:24:05.117 INFO  ASEReadCounter - HTSJDK Defaults.COMPRESSION_LEVEL : 2
14:24:05.118 INFO  ASEReadCounter - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
14:24:05.118 INFO  ASEReadCounter - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
14:24:05.118 INFO  ASEReadCounter - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
14:24:05.118 INFO  ASEReadCounter - Deflater: IntelDeflater
14:24:05.118 INFO  ASEReadCounter - Inflater: IntelInflater
14:24:05.119 INFO  ASEReadCounter - GCS max retries/reopens: 20
14:24:05.119 INFO  ASEReadCounter - Requester pays: disabled
14:24:05.119 INFO  ASEReadCounter - Initializing engine
14:24:05.581 INFO  FeatureManager - Using codec VCFCodec to read file file:///home/groups/Spellmandata/heskett/replication.rnaseq/scripts/../platinum.genome/NA12878.nochr.vcf
14:24:05.604 INFO  ASEReadCounter - Done initializing engine
contig  position    variantID   refAllele   altAllele   refCount    altCount    totalCount  lowMAPQDepth    lowBaseQDepth   rawDepth    otherBases  improperPairs
14:24:05.604 INFO  ProgressMeter - Starting traversal
14:24:05.604 INFO  ProgressMeter -        Current Locus  Elapsed Minutes        Loci Processed      Loci/Minute
14:24:05.638 INFO  ASEReadCounter - Shutting down engine
[March 8, 2019 2:24:05 PM PST] org.broadinstitute.hellbender.tools.walkers.rnaseq.ASEReadCounter done. Elapsed time: 0.04 minutes.
Runtime.totalMemory()=1859649536
java.lang.ArrayIndexOutOfBoundsException: 0
    at org.broadinstitute.hellbender.engine.ReferenceContext.getBase(ReferenceContext.java:396)
    at org.broadinstitute.hellbender.tools.walkers.rnaseq.ASEReadCounter.apply(ASEReadCounter.java:183)
    at org.broadinstitute.hellbender.engine.LocusWalker.lambda$traverse$0(LocusWalker.java:176)
    at java.util.Iterator.forEachRemaining(Iterator.java:116)
    at org.broadinstitute.hellbender.engine.LocusWalker.traverse(LocusWalker.java:174)
    at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:966)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
    at org.broadinstitute.hellbender.Main.main(Main.java:289)
Using GATK jar /home/groups/Spellmandata/heskett/packages/share/gatk4-4.0.11.0-0/gatk-package-4.0.11.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx20G -jar /home/groups/Spellmandata/heskett/packages/share/gatk4-4.0.11.0-0/gatk-package-4.0.11.0-local.jar ASEReadCounter -I ../alignments/gm12878.rep2Aligned.out.rg.sorted.markdup.bam --variant ../platinum.genome/NA12878.nochr.vcf
srun: error: exanode-3-1: task 0: Exited with exit code 3

↧

ASEReadcounter error

October 25, 2019, 3:57 pm

≫ Next: HaplotypeCaller generate incorrect heterozygous Calls

≪ Previous: ASEReadCounter java error

Hi, I am having a similar problem as in this thread
I am runnung ASEReadcounter on RNA-seq data and I get this error

gatk ASEReadCounter -I /mnt/beegfs/Steph_WKDIR/1XXXXXXX_Single.bam -V filtered_Phased_1831.vcf.gz -R /mnt/XXXXXXs/Genomes/genome_hg19/hg19.fa -O ASE_1831.csv
14:56:58.684 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/mnt/XXXXXXX/tools/gatk-4.1.3.0/gatk-package-4.1.3.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
Oct 25, 2019 2:57:00 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
14:57:00.552 INFO ASEReadCounter - ------------------------------------------------------------
14:57:00.553 INFO ASEReadCounter - The Genome Analysis Toolkit (GATK) v4.1.3.0
14:57:00.554 INFO ASEReadCounter - For support and documentation go to https://software.broadinstitute.org/gatk/
14:57:00.555 INFO ASEReadCounter - Executing as sfotsing@compute-0-1.hpc.lji.org on Linux v3.10.0-514.el7.x86_64 amd64
14:57:00.556 INFO ASEReadCounter - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_102-b14
14:57:00.557 INFO ASEReadCounter - Start Date/Time: October 25, 2019 2:56:58 PM PDT
14:57:00.558 INFO ASEReadCounter - ------------------------------------------------------------
14:57:00.559 INFO ASEReadCounter - ------------------------------------------------------------
14:57:00.560 INFO ASEReadCounter - HTSJDK Version: 2.20.1
14:57:00.561 INFO ASEReadCounter - Picard Version: 2.20.5
14:57:00.562 INFO ASEReadCounter - HTSJDK Defaults.COMPRESSION_LEVEL : 2
14:57:00.563 INFO ASEReadCounter - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
14:57:00.563 INFO ASEReadCounter - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
14:57:00.564 INFO ASEReadCounter - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
14:57:00.565 INFO ASEReadCounter - Deflater: IntelDeflater
14:57:00.566 INFO ASEReadCounter - Inflater: IntelInflater
14:57:00.570 INFO ASEReadCounter - GCS max retries/reopens: 20
14:57:00.571 INFO ASEReadCounter - Requester pays: disabled
14:57:00.571 INFO ASEReadCounter - Initializing engine
WARNING: BAM index file /mnt/XXXXE_Single.bai is older than BAM /mnt/XXXXX_WKDIR/1831_CD4_NAIVE_Single.bam
14:57:01.356 INFO FeatureManager - Using codec VCFCodec to read file file:///mnt/XXXXX/filtered_Phased_1831.vcf.gz
14:57:01.521 INFO ASEReadCounter - Done initializing engine
14:57:01.523 INFO ProgressMeter - Starting traversal
14:57:01.524 INFO ProgressMeter - Current Locus Elapsed Minutes Loci Processed Loci/Minute
14:57:06.958 INFO ASEReadCounter - Shutting down engine
[October 25, 2019 2:57:06 PM PDT] org.broadinstitute.hellbender.tools.walkers.rnaseq.ASEReadCounter done. Elapsed time: 0.14 minutes.
Runtime.totalMemory()=3250061312

A USER ERROR has occurred: More then one variant context at position: chr1:11125729

Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.
Using GATK jar /......and so on. Please do not pay attention to the paths

1- I am sure that my vcf does not have duplicate variants. I used SelectVariants to remove multiallelic variants and awk to remove duplicated variants by location. For example, this is my output for this location
$ zcat myvcf.vcf.gz |grep 11125729
chr1 11125729 rs2039841:11125729:T:C T C . PASS . GT 0|1

2- I get ASE output up to this position, so that might rule out formatting?
In fact, this error came up in a different position
(chr1 9355278 rs4080311:9355278:T:A T A . PASS . GT 0|1);
which I removed and reran on the new vcf. This lead to more output and stopped on this locus. I have whole human genomes, so it is not practical to manually remove loci.
Please help. Any input is welcome
Thanks

↧

HaplotypeCaller generate incorrect heterozygous Calls

August 29, 2019, 6:37 am

≫ Next: I have tried creating a GATK project in the IntelliJ IDE as documented in https://github.com/broadin

≪ Previous: ASEReadcounter error

Hi,
We have used GATK 3.6 and GATK 4.1 haplotype caller, to generate the variants, unfortunately, we are getting lots of incorrect heterozygous calls. Even only one read supports the variant it is considered as a heterozygous call. It would be great if someone can suggest what can be done in this scenario.

Chr Start End Ref Alt Zygosity Read frequency Read No Genotype Quality
1 27101249 27101249 - GGCGGGGCCCTGGGG Het 164,1 165 99
6 33141690 33141690 - GTAGGGTCCACGGGGTCAGCGGG Het 158,1 163 99
10 71906033 71906033 - G Het 150,9 160 91
11 66411384 66411384 C T Het 144,5 155 64
11 66411384 66411384 C T Het 144,5 155 64
17 79880575 79880575 G T Het 147,8 155 83
18 23807079 23807079 - ACTTTGGTCA Het 158,4 162 99
19 36276206 36276206 - GT Het 144,9 153 99
19 50340140 50340140 - GGGGGACCGGGCCCCGGGGA Het 154,3 157 99

↧

I have tried creating a GATK project in the IntelliJ IDE as documented in https://github.com/broadin

October 28, 2019, 4:09 am

≫ Next: File size is largely reduced in MarkIlluminaAdapters step

≪ Previous: HaplotypeCaller generate incorrect heterozygous Calls

It is giving error "error: package com.sun.javadoc does not exist final com.sun.javadoc.ClassDoc classDoc" everytime I run debug 'GATK debug'. Please help me

↧

File size is largely reduced in MarkIlluminaAdapters step

October 9, 2019, 6:01 am

≫ Next: jexl expressions for selecting variants

≪ Previous: I have tried creating a GATK project in the IntelliJ IDE as documented in https://github.com/broadin

Hi,

I was doing data processing steps for raw reads (Fastq) in two way approaches

1. merging all the forward reads and reverse reads and used as input for further steps  
2. Without merging, each read (single raw fastq files) were used as input for each step

While I am doing MarkIlluminaAdapter step I observed the data file size is reduced for 2nd ways, the Size details as follows

1. Raw fastq files size (80Gb)  
2. MarkIlluminaAdapter output size: **1st way (merged) 215Gb; 2nd way 179Gb**     
---  
---

But I observed that in BWA mem-Alignment(1st way(merged) 258Gb; 2nd way 263Gb), Bam conversion (1stway 60Gb; 2ndway 80Gb) and Markduplicator (1stway59Gb and 2nd way60Gb) the data size is approximately retained BWA & MarkDupicates and size increased for 2nd way.

And another thing is, when I did alignment quality check for the both of the BAM files (in a 2nd way, the each reads output were merged to single file for quality check) using samtools flagstat in both types also showed 99.65% mapped but duplication was observed less in the 1st way (merged reads)

1st way duplication: 7197218 + 0 duplicates and 2ndway duplication: 208749 + 0 duplicates

could you please explain why this large size of data reduction had seen in MarkIlluminaAdapter and about this alignment quality check duplication difference in merged files?

↧