When I call Indels from my vcf file using GATK analysis tools I ger an Error!

September 19, 2017, 8:50 am

≫ Next: (How to) Map and clean up short read sequence data efficiently

≪ Previous: Not enough memory for depth of Coverage as single thread

Hi I used the GATK pipeline until I got a vcf that had SNPs and Indels, so I used GATK Analysis tools to remove SNPs and keep Indels. But after adding the reference genome, dictionary and index I get this error:

The provided VCF file is malformed at approximately line number 455: Unparsable vcf record with allele *, for input source: /home/helenadarmancier/Documents/Estagio/Original/vcf_NoAngH201_NoMono.vcf

How can I fix this?

↧

(How to) Map and clean up short read sequence data efficiently

November 23, 2015, 12:55 pm

≫ Next: GATK 3.7 to 3.8 migration problems creating/accessing temp files

≪ Previous: When I call Indels from my vcf file using GATK analysis tools I ger an Error!

If you are interested in emulating the methods used by the Broad Genomics Platform to pre-process your short read sequencing data, you have landed on the right page. The parsimonious operating procedures outlined in this three-step workflow both maximize data quality, storage and processing efficiency to produce a mapped and clean BAM. This clean BAM is ready for analysis workflows that start with MarkDuplicates.

Since your sequencing data could be in a number of formats, the first step of this workflow refers you to specific methods to generate a compatible unmapped BAM (uBAM, Tutorial#6484) or (uBAM^XT, Tutorial#6570 coming soon). Not all unmapped BAMs are equal and these methods emphasize cleaning up prior meta information while giving you the opportunity to assign proper read group fields. The second step of the workflow has you marking adapter sequences, e.g. arising from read-through of short inserts, using MarkIlluminaAdapters such that they contribute minimally to alignments and allow the aligner to map otherwise unmappable reads. The third step pipes three processes to produce the final BAM. Piping SamToFastq, BWA-MEM and MergeBamAlignment saves time and allows you to bypass storage of larger intermediate FASTQ and SAM files. In particular, MergeBamAlignment merges defined information from the aligned SAM with that of the uBAM to conserve read data, and importantly, it generates additional meta information and unifies meta data. The resulting clean BAM is coordinate sorted, indexed.

The workflow reflects a lossless operating procedure that retains original sequencing read information within the final BAM file such that data is amenable to reversion and analysis by different means. These practices make scaling up and long-term storage efficient, as one needs only keep the final BAM file.

Geraldine_VdAuwera points out that there are many different ways of correctly preprocessing HTS data for variant discovery and ours is only one approach. So keep this in mind.

We present this workflow using real data from a public sample. The original data file, called Solexa-272222, is large at 150 GB. The file contains 151 bp paired PCR-free reads giving 30x coverage of a human whole genome sample referred to as NA12878. The entire sample library was sequenced in a single flow cell lane and thereby assigns all the reads the same read group ID. The example commands work both on this large file and on smaller files containing a subset of the reads, collectively referred to as snippet. NA12878 has a variant in exon 5 of the CYP2C19 gene, on the portion of chromosome 10 covered by the snippet, resulting in a nonfunctional protein. Consistent with GATK's recommendation of using the most up-to-date tools, for the given example results, with the exception of BWA, we used the most current versions of tools as of their testing (September to December 2015). We provide illustrative example results, some of which were derived from processing the original large file and some of which show intermediate stages skipped by this workflow.

Download example snippet data to follow along the tutorial.

We welcome feedback. Share your suggestions in the Comments section at the bottom of this page.

Jump to a section

Tools involved

MarkIlluminaAdapters
Unix pipelines
SamToFastq
BWA-MEM (Li 2013 reference; Li 2014 benchmarks; homepage; manual)
MergeBamAlignment

Prerequisites

Installed Picard tools
Installed GATK tools
Installed BWA
Reference genome
Illumina or similar tech DNA sequence reads file containing data corresponding to one read group ID. That is, the file contains data from one sample and from one flow cell lane.

Download example data

To download the reference, open ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.8/b37/ in your browser. Leave the password field blank. Download the following three files (~860 MB) to the same folder: human_g1k_v37_decoy.fasta.gz, .fasta.fai.gz, and .dict.gz. This same reference is available to load in IGV.
I divided the example data into two tarballs: tutorial_6483_piped.tar.gz contains the files for the piped process and tutorial_6483_intermediate_files.tar.gz contains the intermediate files produced by running each process independently. The data contain reads originally aligning to a one Mbp genomic interval (10:96,000,000-97,000,000) of GRCh37. The table shows the steps of the workflow, corresponding input and output example data files and approximate minutes and disk space needed to process each step. Additionally, we tabulate the time and minimum storage needed to complete the workflow as presented (piped) or without piping.

Related resources

See this tutorial to add or replace read groups or coordinate-sort and index a BAM.
See this tutorial for basic instructions on using the Integrative Genomics Viewer (IGV).
For collecting alignment summary metrics, see CollectAlignmentSummaryMetrics, CollectWgsMetrics and CollectInsertSizeMetrics. See Picard for metrics definitions.
See SAM flags to interpret SAM flag values.
Tutorial#2799 gives an example command to mark duplicates.

Other notes

When transforming data files, we stick to using Picard tools over other tools to avoid subtle incompatibilities.
For large files, (1) use the Java -Xmx setting and (2) set the environmental variable TMP_DIR for a temporary directory.
```
java -Xmx8G -jar /path/picard.jar MarkIlluminaAdapters \
    TMP_DIR=/path/shlee
```
In the command, the -Xmx8G Java option caps the maximum heap size, or memory usage, to eight gigabytes. The path given by TMP_DIR points the tool to scratch space that it can use. These options allow the tool to run without slowing down as well as run without causing an out of memory error. The -Xmx settings we provide here are more than sufficient for most cases. For GATK, 4G is standard, while for Picard less is needed. Some tools, e.g. MarkDuplicates, may require more. These options can be omitted for small files such as the example data and the equivalent command is as follows.
```
java -jar /path/picard.jar MarkIlluminaAdapters
```
To find a system's default maximum heap size, type java -XX:+PrintFlagsFinal -version, and look for MaxHeapSize. Note that any setting beyond available memory spills to storage and slows a system down. If multithreading, increase memory proportionately to the number of threads. e.g. if 1G is the minimum required for one thread, then use 2G for two threads.
When I call default options within a command, follow suit to ensure the same results.

1. Generate an unmapped BAM from FASTQ, aligned BAM or BCL

If you have raw reads data in BAM format with appropriately assigned read group fields, then you can start with step 2. Namely, besides differentiating samples, the read group ID should differentiate factors contributing to technical batch effects, i.e. flow cell lane. If not, you need to reassign read group fields. This dictionary post describes factors to consider and this post and this post provide some strategic advice on handling multiplexed data.

See this tutorial to add or replace read groups.

If your reads are mapped, or in BCL or FASTQ format, then generate an unmapped BAM according to the following instructions.

To convert FASTQ or revert aligned BAM files, follow directions in Tutorial#6484. The resulting uBAM needs to have its adapter sequences marked as outlined in the next step (step 2).
To convert an Illumina Base Call files (BCL) use IlluminaBasecallsToSam. The tool marks adapter sequences at the same time. The resulting uBAM^XT has adapter sequences marked with the XT tag so you can skip step 2 of this workflow and go directly to step 3. The corresponding Tutorial#6570 is coming soon.

See if you can revert 6483_snippet.bam, containing 279,534 aligned reads, to the unmapped 6383_snippet_revertsam.bam, containing 275,546 reads.

2. Mark adapter sequences using MarkIlluminaAdapters

MarkIlluminaAdapters adds the XT tag to a read record to mark the 5' start position of the specified adapter sequence and produces a metrics file. Some of the marked adapters come from concatenated adapters that randomly arise from the primordial soup that is a PCR reaction. Others represent read-through to 3' adapter ends of reads and arise from insert sizes that are shorter than the read length. In some instances read-though can affect the majority of reads in a sample, e.g. in Nextera library samples over-titrated with transposomes, and render these reads unmappable by certain aligners. Tools such as SamToFastq use the XT tag in various ways to effectively remove adapter sequence contribution to read alignment and alignment scoring metrics. Depending on your library preparation, insert size distribution and read length, expect varying amounts of such marked reads.

java -Xmx8G -jar /path/picard.jar MarkIlluminaAdapters \
I=6483_snippet_revertsam.bam \
O=6483_snippet_markilluminaadapters.bam \
M=6483_snippet_markilluminaadapters_metrics.txt \ #naming required
TMP_DIR=/path/shlee #optional to process large files

This produces two files. (1) The metrics file, 6483_snippet_markilluminaadapters_metrics.txt bins the number of tagged adapter bases versus the number of reads. (2) The 6483_snippet_markilluminaadapters.bam file is identical to the input BAM, 6483_snippet_revertsam.bam, except reads with adapter sequences will be marked with a tag in XT:i:# format, where # denotes the 5' starting position of the adapter sequence. At least six bases are required to mark a sequence. Reads without adapter sequence remain untagged.

By default, the tool uses Illumina adapter sequences. This is sufficient for our example data.
Adjust the default standard Illumina adapter sequences to any adapter sequence using the FIVE_PRIME_ADAPTER and THREE_PRIME_ADAPTER parameters. To clear and add new adapter sequences first set ADAPTERS to 'null' then specify each sequence with the parameter.

We plot the metrics data that is in GATKReport file format using RStudio, and as you can see, marked bases vary in size up to the full length of reads.

Do you get the same number of marked reads? 6483_snippet marks 448 reads (0.16%) with XT, while the original Solexa-272222 marks 3,236,552 reads (0.39%).

Below, we show a read pair marked with the XT tag by MarkIlluminaAdapters. The insert region sequences for the reads overlap by a length corresponding approximately to the XT tag value. For XT:i:20, the majority of the read is adapter sequence. The same read pair is shown after SamToFastq transformation, where adapter sequence base quality scores have been set to 2 (# symbol), and after MergeBamAlignment, which restores original base quality scores.

Unmapped uBAM (step 1)

After MarkIlluminaAdapters (step 2)

After SamToFastq (step 3)

After MergeBamAlignment (step 3)

3. Align reads with BWA-MEM and merge with uBAM using MergeBamAlignment

This step actually pipes three processes, performed by three different tools. Our tutorial example files are small enough to easily view, manipulate and store, so any difference in piped or independent processing will be negligible. For larger data, however, using Unix pipelines can add up to significant savings in processing time and storage.

Not all tools are amenable to piping and piping the wrong tools or wrong format can result in anomalous data.

The three tools we pipe are SamToFastq, BWA-MEM and MergeBamAlignment. By piping these we bypass storage of larger intermediate FASTQ and SAM files. We additionally save time by eliminating the need for the processor to read in and write out data for two of the processes, as piping retains data in the processor's input-output (I/O) device for the next process.

To make the information more digestible, we will first talk about each tool separately. At the end of the section, we provide the piped command.

3A. Convert BAM to FASTQ and discount adapter sequences using SamToFastq

Picard's SamToFastq takes read identifiers, read sequences, and base quality scores to write a Sanger FASTQ format file. We use additional options to effectively remove previously marked adapter sequences, in this example marked with an XT tag. By specifying CLIPPING_ATTRIBUTE=XT and CLIPPING_ACTION=2, SamToFastq changes the quality scores of bases marked by XT to two--a rather low score in the Phred scale. This effectively removes the adapter portion of sequences from contributing to downstream read alignment and alignment scoring metrics.

Illustration of an intermediate step unused in workflow. See piped command.

java -Xmx8G -jar /path/picard.jar SamToFastq \
I=6483_snippet_markilluminaadapters.bam \
FASTQ=6483_snippet_samtofastq_interleaved.fq \
CLIPPING_ATTRIBUTE=XT \
CLIPPING_ACTION=2 \
INTERLEAVE=true \
NON_PF=true \
TMP_DIR=/path/shlee #optional to process large files

This produces a FASTQ file in which all extant meta data, i.e. read group information, alignment information, flags and tags are purged. What remains are the read query names prefaced with the @ symbol, read sequences and read base quality scores.

For our paired reads example file we set SamToFastq's INTERLEAVE to true. During the conversion to FASTQ format, the query name of the reads in a pair are marked with /1 or /2 and paired reads are retained in the same FASTQ file. BWA aligner accepts interleaved FASTQ files given the -p option.
We change the NON_PF, aka INCLUDE_NON_PF_READS, option from default to true. SamToFastq will then retain reads marked by what some consider an archaic 0x200 flag bit that denotes reads that do not pass quality controls, aka reads failing platform or vendor quality checks. Our tutorial data do not contain such reads and we call out this option for illustration only.
Other CLIPPING_ACTION options include (1) X to hard-clip, (2) N to change bases to Ns or (3) another number to change the base qualities of those positions to the given value.

3B. Align reads and flag secondary hits using BWA-MEM

In this workflow, alignment is the most compute intensive and will take the longest time. GATK's variant discovery workflow recommends Burrows-Wheeler Aligner's maximal exact matches (BWA-MEM) algorithm (Li 2013 reference; Li 2014 benchmarks; homepage; manual). BWA-MEM is suitable for aligning high-quality long reads ranging from 70 bp to 1 Mbp against a large reference genome such as the human genome.

Aligning our snippet reads against either a portion or the whole genome is not equivalent to aligning our original Solexa-272222 file, merging and taking a new slice from the same genomic interval.
For the tutorial, we use BWA v 0.7.7.r441, the same aligner used by the Broad Genomics Platform as of this writing (9/2015).
As mentioned, alignment is a compute intensive process. For faster processing, use a reference genome with decoy sequences, also called a decoy genome. For example, the Broad's Genomics Platform uses an Hg19/GRCh37 reference sequence that includes Ebstein-Barr virus (EBV) sequence to soak up reads that fail to align to the human reference that the aligner would otherwise spend an inordinate amount of time trying to align as split reads. GATK's resource bundle provides a standard decoy genome from the 1000 Genomes Project.
BWA alignment requires an indexed reference genome file. Indexing is specific to algorithms. To index the human genome for BWA, we apply BWA's index function on the reference genome file, e.g. human_g1k_v37_decoy.fasta. This produces five index files with the extensions amb, ann, bwt, pac and sa.
```
bwa index -a bwtsw human_g1k_v37_decoy.fasta
```

The example command below aligns our example data against the GRCh37 genome. The tool automatically locates the index files within the same folder as the reference FASTA file.

Illustration of an intermediate step unused in workflow. See piped command.

/path/bwa mem -M -t 7 -p /path/human_g1k_v37_decoy.fasta \
6483_snippet_samtofastq_interleaved.fq > 6483_snippet_bwa_mem.sam

This command takes the FASTQ file, 6483_snippet_samtofastq_interleaved.fq, and produces an aligned SAM format file, 6483_snippet_unthreaded_bwa_mem.sam, containing read alignment information, an automatically generated program group record and reads sorted in the same order as the input FASTQ file. Aligner-assigned alignment information, flag and tag values reflect each read's or split read segment's best sequence match and does not take into consideration whether pairs are mapped optimally or if a mate is unmapped. Added tags include the aligner-specific XS tag that marks secondary alignment scores in XS:i:# format. This tag is given for each read even when the score is zero and even for unmapped reads. The program group record (@PG) in the header gives the program group ID, group name, group version and recapitulates the given command. Reads are sorted by query name. For the given version of BWA, the aligned file is in SAM format even if given a BAM extension.

Does the aligned file contain read group information?

We invoke three options in the command.

-M to flag shorter split hits as secondary.
This is optional for Picard compatibility as MarkDuplicates can directly process BWA's alignment, whether or not the alignment marks secondary hits. However, if we want MergeBamAlignment to reassign proper pair alignments, to generate data comparable to that produced by the Broad Genomics Platform, then we must mark secondary alignments.
-p to indicate the given file contains interleaved paired reads.
-t followed by a number for the number of processor threads to use concurrently. Here we use seven threads which is one less than the total threads available on my Mac laptap. Check your server or system's total number of threads with the following command provided by KateN.
```
getconf _NPROCESSORS_ONLN
```

In the example data, all of the 1211 unmapped reads each have an asterisk (*) in column 6 of the SAM record, where a read typically records its CIGAR string. The asterisk represents that the CIGAR string is unavailable. The several asterisked reads I examined are recorded as mapping exactly to the same location as their _mapped_ mates but with MAPQ of zero. Additionally, the asterisked reads had varying noticeable amounts of low base qualities, e.g. strings of #s, that corresponded to original base quality calls and not those changed by SamToFastq. This accounting by BWA allows these pairs to always list together, even when the reads are coordinate-sorted, and leaves a pointer to the genomic mapping of the mate of the unmapped read. For the example read pair shown below, comparing sequences shows no apparent overlap, with the highest identity at 72% over 25 nts.

After MarkIlluminaAdapters (step 2)

After BWA-MEM (step 3)

After MergeBamAlignment (step 3)

3C. Restore altered data and apply & adjust meta information using MergeBamAlignment

MergeBamAlignment is a beast of a tool, so its introduction is longer. It does more than is implied by its name. Explaining these features requires I fill you in on some background.

Broadly, the tool merges defined information from the unmapped BAM (uBAM, step 1) with that of the aligned BAM (step 3) to conserve read data, e.g. original read information and base quality scores. The tool also generates additional meta information based on the information generated by the aligner, which may alter aligner-generated designations, e.g. mate information and secondary alignment flags. The tool then makes adjustments so that all meta information is congruent, e.g. read and mate strand information based on proper mate designations. We ascribe the resulting BAM as clean.

Specifically, the aligned BAM generated in step 3 lacks read group information and certain tags--the UQ (Phred likelihood of the segment), MC (CIGAR string for mate) and MQ (mapping quality of mate) tags. It has hard-clipped sequences from split reads and altered base qualities. The reads also have what some call mapping artifacts but what are really just features we should not expect from our aligner. For example, the meta information so far does not consider whether pairs are optimally mapped and whether a mate is unmapped (in reality or for accounting purposes). Depending on these assignments, MergeBamAlignment adjusts the read and read mate strand orientations for reads in a proper pair. Finally, the alignment records are sorted by query name. We would like to fix all of these issues before taking our data to a variant discovery workflow.

Enter MergeBamAlignment. As the tool name implies, MergeBamAlignment applies read group information from the uBAM and retains the program group information from the aligned BAM. In restoring original sequences, the tool adjusts CIGAR strings from hard-clipped to soft-clipped. If the alignment file is missing reads present in the unaligned file, then these are retained as unmapped records. Additionally, MergeBamAlignment evaluates primary alignment designations according to a user-specified strategy, e.g. for optimal mate pair mapping, and changes secondary alignment and mate unmapped flags based on its calculations. Additional for desired congruency. I will soon explain these and additional changes in more detail and show a read record to illustrate.

Consider what PRIMARY_ALIGNMENT_STRATEGY option best suits your samples. MergeBamAlignment applies this strategy to a read for which the aligner has provided more than one primary alignment, and for which one is designated primary by virtue of another record being marked secondary. MergeBamAlignment considers and switches only existing primary and secondary designations. Therefore, it is critical that these were previously flagged.

A read with multiple alignment records may map to multiple loci or may be chimeric--that is, splits the alignment. It is possible for an aligner to produce multiple alignments as well as multiple primary alignments, e.g. in the case of a linear alignment set of split reads. When one alignment, or alignment set in the case of chimeric read records, is designated primary, others are designated either secondary or supplementary. Invoking the -M option, we had BWA mark the record with the longest aligning section of split reads as primary and all other records as secondary. MergeBamAlignment further adjusts this secondary designation and adds the read mapped in proper pair (0x2) and mate unmapped (0x8) flags. The tool then adjusts the strand orientation flag for a read (0x10) and it proper mate (0x20).

In the command, we change CLIP_ADAPTERS, MAX_INSERTIONS_OR_DELETIONS and PRIMARY_ALIGNMENT_STRATEGY values from default, and invoke other optional parameters. The path to the reference FASTA given by R should also contain the corresponding .dict sequence dictionary with the same prefix as the reference FASTA. It is imperative that both the uBAM and aligned BAM are both sorted by queryname.

Illustration of an intermediate step unused in workflow. See piped command.

java -Xmx16G -jar /path/picard.jar MergeBamAlignment \
R=/path/Homo_sapiens_assembly19.fasta \
UNMAPPED_BAM=6383_snippet_revertsam.bam \
ALIGNED_BAM=6483_snippet_bwa_mem.sam \ #accepts either SAM or BAM
O=6483_snippet_mergebamalignment.bam \
CREATE_INDEX=true \ #standard Picard option for coordinate-sorted outputs
ADD_MATE_CIGAR=true \ #default; adds MC tag
CLIP_ADAPTERS=false \ #changed from default
CLIP_OVERLAPPING_READS=true \ #default; soft-clips ends so mates do not extend past each other
INCLUDE_SECONDARY_ALIGNMENTS=true \ #default
MAX_INSERTIONS_OR_DELETIONS=-1 \ #changed to allow any number of insertions or deletions
PRIMARY_ALIGNMENT_STRATEGY=MostDistant \ #changed from default BestMapq
ATTRIBUTES_TO_RETAIN=XS \ #specify multiple times to retain tags starting with X, Y, or Z
TMP_DIR=/path/shlee #optional to process large files

This generates a coordinate-sorted and clean BAM, 6483_snippet_mergebamalignment.bam, and corresponding .bai index. These are ready for analyses starting with MarkDuplicates. The two bullet-point lists below describe changes to the resulting file. The first list gives general comments on select parameters and the second describes some of the notable changes to our example data.

Comments on select parameters

Setting PRIMARY_ALIGNMENT_STRATEGYto MostDistant marks primary alignments based on the alignment pair with the largest insert size. This strategy is based on the premise that of chimeric sections of a read aligning to consecutive regions, the alignment giving the largest insert size with the mate gives the most information.
It may well be that alignments marked as secondary represent interesting biology, so we retain them with the INCLUDE_SECONDARY_ALIGNMENTS parameter.
Setting MAX_INSERTIONS_OR_DELETIONS to -1 retains reads irregardless of the number of insertions and deletions. The default is 1.
Because we leave the ALIGNER_PROPER_PAIR_FLAGS parameter at the default false value, MergeBamAlignment will reassess and reassign proper pair designations made by the aligner. These are explained below using the example data.
ATTRIBUTES_TO_RETAIN is specified to carryover the XS tag from the alignment, which reports BWA-MEM's suboptimal alignment scores. My impression is that this is the next highest score for any alternative or additional alignments BWA considered, whether or not these additional alignments made it into the final aligned records. (IGV's BLAT feature allows you to search for additional sequence matches). For our tutorial data, this is the only additional unaccounted tag from the alignment. The XS tag in unnecessary for the Best Practices Workflow and is not retained by the Broad Genomics Platform's pipeline. We retain it here not only to illustrate that the tool carries over select alignment information only if asked, but also because I think it prudent. Given how compute intensive the alignment process is, the additional ~1% gain in the snippet file size seems a small price against having to rerun the alignment because we realize later that we want the tag.
Setting CLIP_ADAPTERS to false leaves reads unclipped.
By default the merged file is coordinate sorted. We set CREATE_INDEX to true to additionally create the bai index.
We need not invoke PROGRAM options as BWA's program group information is sufficient and is retained in the merging.
As a standalone tool, we would normally feed in a BAM file for ALIGNED_BAM instead of the much larger SAM. We will be piping this step however and so need not add an extra conversion to BAM.

Description of changes to our example data

MergeBamAlignment merges header information from the two sources that define read groups (@RG) and program groups (@PG) as well as reference contigs.
Tags are updated for our example data as shown in the table. The tool retains SA, MD, NM and AS tags from the alignment, given these are not present in the uBAM. The tool additionally adds UQ (the Phred likelihood of the segment), MC (mate CIGAR string) and MQ (mapping quality of the mate/next segment) tags if applicable. For unmapped reads (marked with an * asterisk in column 6 of the SAM record), the tool removes AS and XS tags and assigns MC (if applicable), PG and RG tags. This is illustrated for example read H0164ALXX140820:2:1101:29704:6495 in the BWA-MEM section of this document.
Original base quality score restoration is illustrated in step 2.

The example below shows a read pair for which MergeBamAlignment adjusts multiple information fields, and these changes are described in the remaining bullet points.

MergeBamAlignment changes hard-clipping to soft-clipping, e.g. 96H55M to 96S55M, and restores corresponding truncated sequences with the original full-length read sequence.
The tool reorders the read records to reflect the chromosome and contig ordering in the header and the genomic coordinates for each.
MergeBamAlignment's MostDistant PRIMARY_ALIGNMENT_STRATEGY asks the tool to consider the best pair to mark as primary from the primary and secondary records. In this pair, one of the reads has two alignment loci, on contig hs37d5 and on chromosome 10. The two loci align 115 and 55 nucleotides, respectively, and the aligned sequences are identical by 55 bases. Flag values set by BWA-MEM indicate the contig hs37d5 record is primary and the shorter chromosome 10 record is secondary. For this chimeric read, MergeBamAlignment reassigns the chromosome 10 mapping as the primary alignment and the contig hs37d5 mapping as secondary (0x100 flag bit).
In addition, MergeBamAlignment designates each record on chromosome 10 as read mapped in proper pair (0x2 flag bit) and the contig hs37d5 mapping as mate unmapped (0x8 flag bit). IGV's paired reads mode displays the two chromosome 10 mappings as a pair after these MergeBamAlignment adjustments.
MergeBamAlignment adjusts read reverse strand (0x10 flag bit) and mate reverse strand (0x20 flag bit) flags consistent with changes to the proper pair designation. For our non-stranded DNA-Seq library alignments displayed in IGV, a read pointing rightward is in the forward direction (absence of 0x10 flag) and a read pointing leftward is in the reverse direction (flagged with 0x10). In a typical pair, where the rightward pointing read is to the left of the leftward pointing read, the left read will also have the mate reverse strand (0x20) flag.

Two distinct classes of mate unmapped read records are now present in our example file: (1) reads whose mates truly failed to map and are marked by an asterisk * in column 6 of the SAM record and (2) multimapping reads whose mates are in fact mapped but in a proper pair that excludes the particular read record. Each of these two classes of mate unmapped reads can contain multimapping reads that map to two or more locations.

Comparing 6483_snippet_bwa_mem.sam and 6483_snippet_mergebamalignment.bam, we see the number_unmapped reads_ remains the same at 1211, while the number of records with the mate unmapped flag increases by 1359, from 1276 to 2635. These now account for 0.951% of the 276,970 read records.

For 6483_snippet_mergebamalignment.bam, how many additional unique reads become mate unmapped?

After BWA-MEM alignment

After MergeBamAlignment

3D. Pipe SamToFastq, BWA-MEM and MergeBamAlignment to generate a clean BAM

We pipe the three tools described above to generate an aligned BAM file sorted by query name. In the piped command, the commands for the three processes are given together, separated by a vertical bar |. We also replace each intermediate output and input file name with a symbolic path to the system's output and input devices, here /dev/stdout and /dev/stdin, respectively. We need only provide the first input file and name the last output file.

Before using a piped command, we should ask UNIX to stop the piped command if any step of the pipe should error and also return to us the error messages. Type the following into your shell to set these UNIX options.

set -o pipefail

Overview of command structure

[SamToFastq] | [BWA-MEM] | [MergeBamAlignment]

Piped command

java -Xmx8G -jar /path/picard.jar SamToFastq \
I=6483_snippet_markilluminaadapters.bam \
FASTQ=/dev/stdout \
CLIPPING_ATTRIBUTE=XT CLIPPING_ACTION=2 INTERLEAVE=true NON_PF=true \
TMP_DIR=/path/shlee | \
/path/bwa mem -M -t 7 -p /path/Homo_sapiens_assembly19.fasta /dev/stdin | \
java -Xmx16G -jar /path/picard.jar MergeBamAlignment \
ALIGNED_BAM=/dev/stdin \
UNMAPPED_BAM=6383_snippet_revertsam.bam \
OUTPUT=6483_snippet_piped.bam \
R=/path/Homo_sapiens_assembly19.fasta CREATE_INDEX=true ADD_MATE_CIGAR=true \
CLIP_ADAPTERS=false CLIP_OVERLAPPING_READS=true \
INCLUDE_SECONDARY_ALIGNMENTS=true MAX_INSERTIONS_OR_DELETIONS=-1 \
PRIMARY_ALIGNMENT_STRATEGY=MostDistant ATTRIBUTES_TO_RETAIN=XS \
TMP_DIR=/path/shlee

The piped output file, 6483_snippet_piped.bam, is for all intensive purposes the same as 6483_snippet_mergebamalignment.bam, produced by running MergeBamAlignment separately without piping. However, the resulting files, as well as new runs of the workflow on the same data, have the potential to differ in small ways because each uses a different alignment instance.

How do these small differences arise?

Counting the number of mate unmapped reads shows that this number remains unchanged for the two described workflows. Two counts emitted at the end of the process updates, that also remain constant for these instances, are the number of alignment records and the number of unmapped reads.

INFO    2015-12-08 17:25:59 AbstractAlignmentMerger Wrote 275759 alignment records and 1211 unmapped reads.

Some final remarks

We have produced a clean BAM that is coordinate-sorted and indexed, in an efficient manner that minimizes processing time and storage needs. The file is ready for marking duplicates as outlined in Tutorial#2799. Additionally, we can now free up storage on our file system by deleting the original file we started with, the uBAM and the uBAM^XT. We sleep well at night knowing that the clean BAM retains all original information.

We have two final comments (1) on multiplexed samples and (2) on fitting this workflow into a larger workflow.

For multiplexed samples, first perform the workflow steps on a file representing one sample and one lane. Then mark duplicates. Later, after some steps in the GATK's variant discovery workflow, and after aggregating files from the same sample from across lanes into a single file, mark duplicates again. These two marking steps ensure you find both optical and PCR duplicates.

For workflows that nestle this pipeline, consider additionally optimizing java jar's parameters for SamToFastq and MergeBamAlignment. For example, the following are the additional settings used by the Broad Genomics Platform in the piped command for very large data sets.

    java -Dsamjdk.buffer_size=131072 -Dsamjdk.compression_level=1 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx128m -jar /path/picard.jar SamToFastq ...

    java -Dsamjdk.buffer_size=131072 -Dsamjdk.use_async_io=true -Dsamjdk.compression_level=1 -XX:+UseStringCache -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx5000m -jar /path/picard.jar MergeBamAlignment ...

I give my sincere thanks to Julian Hess, the GATK team and the Data Sciences and Data Engineering (DSDE) team members for all their help in writing this and related documents.

↧

GATK 3.7 to 3.8 migration problems creating/accessing temp files

September 19, 2017, 9:18 am

≫ Next: Problem using GATK4 GenomicsDBImport for joint calling.

≪ Previous: (How to) Map and clean up short read sequence data efficiently

We observed a strange behavior in the UnifiedGenotyper with GATK 3.8 that we didn't see using the previous version 3.7. All our scripts remain unchanged from comparing versions. Here is the command line:

java -Xmx23g -Djava.io.tmpdir=temp -jar /project/gatk/dist/GenomeAnalysisTK.jar -T UnifiedGenotyper -nt 12 -R /reference/human_g1k_v37.fasta --input_file Exome.S24416.clean.dedup.recal.bam --out Exome.S24416.GATK_UG.vcf.gz --metrics_file Exome.S24416.GATK_UG.vcf.gz.metrics --intervals /targets/all.500_flanking.bed --dbsnp /reference/dbsnp_132.b37.vcf --genotype_likelihoods_model BOTH --output_mode EMIT_ALL_SITES --annotation HomopolymerRun --downsample_to_coverage 1000

And here is a snippet of the error log:

(...)
INFO 07:38:07,724 ProgressMeter - 1:109544813 2.9475091E7 3.0 m 6.0 s 4.7% 63.9 m 60.9 m
INFO 07:38:37,727 ProgressMeter - 1:147440544 3.4888328E7 3.5 m 6.0 s 5.6% 62.7 m 59.2 m

ERROR ------------------------------------------------------------------------------------------

ERROR A USER ERROR has occurred (version 3.8-0-g31cc35e):

ERROR

ERROR This means that one or more arguments or inputs in your command are incorrect.

ERROR The error message below tells you what is the problem.

ERROR

ERROR If the problem is an invalid argument, please check the online documentation guide

ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.

ERROR

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions https://software.broadinstitute.org/gatk

ERROR

ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.

ERROR

ERROR MESSAGE: An error occurred while working with the tmp directory /exec5/GROUP/rouleau/COMMUN/pipeline_analysis/exome.new_pipeline/S24416/Illumina_HiSeq_paired-IC-Exome_NG_V3-merged/GATK_BWA.v37.new/temp. You can specify -Djava.io.tmpdir=X on the command line (before the -jar argument) where X is a directory path, to use a more appropriate temporary directory. Note that this is a JVM argument, not a GATK argument. The exact error was Unable to create temporary file for stub: org.broadinstitute.gatk.engine.io.stubs.OutputStreamStub

ERROR ------------------------------------------------------------------------------------------

Couple of observations/comments:
1. We have full read-write access to the "temp" directory to , and our quota is fine
2. ulimit -n (open files) is set to 1024. We cannot change the hard limits, setting soft limits to a higher value doesnt change the behavior
3. Reverting back the new intel deflater to the old jdk one didn't change nothing
4. The exact same command line with GATK 3.7 didn't produce this error
5. Using exactly one thread (-nt 1) will NOT generate the error

And now for the interesting stuff:
Monitoring the system for file handles with lsof, we can see that GATK 3.8 uses the same number of active handles as 3.7 (between 50-90 for 12 file threads). However the number of "deleted" handles keeps on growing with GATK 3.8. Some of them gets removed, but it cannot get to the same pace as the new ones that gets created.

Here is an example of a deleted handle from lsof:
java 49147 dionnela 3030r REG 0,21 397059 131250621 /exec5/GROUP/rouleau/COMMUN/pipeline_analysis/exome.new_pipeline/S24416/Illumina_HiSeq_paired-IC-Exome_NG_V3-merged/GATK_BWA.v37.new/temp/org.broadinstitute.gatk.engine.io.stubs.VariantContextWriterStub4651242732645532534.tmp (deleted)

This behavior was reproduced on two different cluster systems. At around 3900 deleted handles, the tools stops and generates the error log.

Using the previous GATK 3.7, the number deleted handles grows up to around 1000, and they get removed by the system with the same pace as new ones gets created.

I understand that this is not a real GATK bug and it seems to be a systems architecture or java problem. However, we are posting this because the behavior of deleted file handles changed from one GATK version to another

↧

Problem using GATK4 GenomicsDBImport for joint calling.

September 19, 2017, 10:50 am

≫ Next: GenotypeGVCFs tool gives different output depending on the order of input GVCFs?

≪ Previous: GATK 3.7 to 3.8 migration problems creating/accessing temp files

Hi,
I am trying to use GATK4 for joint-calling. I am calling variants by HaplotypeCaller in GVCF mode for three samples (HG002-3-4). I am using HaplotypeCaller parallelized using scatter-gather method using the recommended genomic intervals. For each sample, GVCFs outputted for each interval are concatenated using GATK CatVariants and then sorted and indexed using Picard SortVCF. Note that I also tried VCFtools for sorting and indexing but there appears to be a long-standing problem with GATK's compatibility with it (see here).

Then, I input these three sorted and indexed GVCFs to GenomicsDBImport. I also give a BED file with only a single line, since GenomicsDBImport can only work with a single interval. Here is the error I'm getting.

2017-09-19T17:15:49.006057064Z terminate called after throwing an instance of 'VCF2TileDBException'
2017-09-19T17:15:49.006097259Z what(): VCF2TileDBException : Incorrect cell order found - cells must be in column major order. Previous cell: [ 0, 14415 ] current cell: [ 0, 14415 ]

I stumbled upon this post about this error (see item 8 in the link). As recommended, I used BCFTools Norm to remove variants that are supposedly at the same position (BCFTools output showed that there are not any. These GVCFs are outputs of HaplotypeCaller anyway).

There are two options here: First is I first use Picard SortVCF on the output of GATK CatVariants then use BCFTools Norm, second is the other way around (BCFTools Norm, then Picard SortVCF). Both cases fail when their outputs are inputted to GenomicsDBImport. The second case gives the same error as above. The first gives the following error.

htsjd> k.tribble.TribbleException: Could not decode field MLEAF with value nan of declared type Float
at htsjdk.variant.variantcontext.VariantContext.decodeOne(VariantContext.java:1630)
at htsjdk.variant.variantcontext.VariantContext.decodeValue(VariantContext.java:1601)
at htsjdk.variant.variantcontext.VariantContext.fullyDecodeAttributes(VariantContext.java:1561)
at htsjdk.variant.variantcontext.VariantContext.fullyDecodeInfo(VariantContext.java:1546)
at htsjdk.variant.variantcontext.VariantContext.fullyDecode(VariantContext.java:1530)
at htsjdk.variant.variantcontext.writer.BCF2Writer.add(BCF2Writer.java:197)
at com.intel.genomicsdb.GenomicsDBImporter.add(GenomicsDBImporter.java:1232)
at com.intel.genomicsdb.GenomicsDBImporter.importBatch(GenomicsDBImporter.java:1282)
at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.traverse(GenomicsDBImport.java:381)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:838)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:115)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:170)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:189)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:131)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:152)
at org.broadinstitute.hellbender.Main.main(Main.java:230)

which brings me to this post and consequently to the first link I posted above, both of which say that I should not use BCFtools with GenomicsDBImport.

Finally, there is GatherVCFs tool to replace CatVariants. However, GatherVCFs require the input VCFs (or GVCFs) in the correct order (sorted). Due to the implementation of my workflow, I have no control over the order of input files so I cannot use GatherVCFs. (I mean come on! Who needs VCF files in a specific order when each of them are individually sorted anyway. This should be trivial to fix).

Therefore, I am trapped in a unescapable cycle of incompatible tools :dizzy: It appears to me that if I use scatter-gather method with HaplotypeCaller, there is no way that I can make a joint-calling pipeline with GATK4. If I don't use scatter-gather, things just become so slow that they are not useful for my application anymore. Does anyone have a solution that I'm missing here?

I will post results when I test without scatter-gather, just to see if it works.

Thanks a lot!

Serhat

↧

GenotypeGVCFs tool gives different output depending on the order of input GVCFs?

August 24, 2017, 1:01 am

≫ Next: "recapseg_bed" should be "recapseg_tsv" in "gatk_cov_pull" ?

≪ Previous: Problem using GATK4 GenomicsDBImport for joint calling.

Hi,
I have been using GATK GenotypeGVCFs tool (versions 3.5, 3.7 and 4.0). It has come to my attention that depending on the order of input GVCFs, the output slightly changes, i.e. the total number of variants in the output VCF changes. For example, everything else kept constant, the following two command line arguments output slightly different VCFs.

1)
java -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R reference.fasta --variant sample1.g.vcf --variant sample2.g.vcf -o output.vcf
2)
java -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R reference.fasta --variant sample2.g.vcf --variant sample1.g.vcf -o output.vcf

I have observed this in GATK 3.5 and 3.7 versions. GATK 4 for some reason does not work with multiple GVCFs, which I talk about in a different question. There is no parallelization applied whatsoever. Does anyone have any idea what's going on?

Thanks a lot.

↧

"recapseg_bed" should be "recapseg_tsv" in "gatk_cov_pull" ?

September 19, 2017, 12:05 pm

≫ Next: IndelRealigner Crashing, Java Fatal Error

≪ Previous: GenotypeGVCFs tool gives different output depending on the order of input GVCFs?

Hi,
For "gatk_cov_pull" in Firecloud, the input file "recapseg_bed" should be a GATK-style tsv file based on manual of "CalculateTargetCoverage". So, "recapseg_bed" is a little bit misleading. Is that correct? I have modified it and published a new one.

Best,
Chunyang

↧

IndelRealigner Crashing, Java Fatal Error

September 19, 2017, 12:12 pm

≫ Next: (howto) Use Oncotator on the Broad servers

≪ Previous: "recapseg_bed" should be "recapseg_tsv" in "gatk_cov_pull" ?

Hello GATK,

I am receiving the following error when running this script:

java -Xmx32G -Djava.io.tmpdir=/data1/home/nipm/arvin/tmp \
-jar /data1/APPS/gatk_nightly/GenomeAnalysisTK.jar \
-T IndelRealigner \
-R /data1/DATA/REFERENCE/human_g1k_v37_decoy.fasta \
-targetIntervals S14_03B_CHG000251_realignment_targets.list \
-known /data1/DATA/REFERENCE/1000G_phase1.indels.b37.vcf \
-I S14_03B_CHG000251_combined_dedup.bam \
-o S14_03B_CHG000251_indelrealigner.bam

INFO 18:21:42,635 ProgressMeter - 1:184280371 4.010121E7 33.9 m 50.0 s 5.9% 9.6 h 9.0 h
INFO 18:22:12,636 ProgressMeter - 1:187049819 4.0801219E7 34.4 m 50.0 s 6.0% 9.6 h 9.0 h
INFO 18:22:43,558 ProgressMeter - 1:190054021 4.1601238E7 34.9 m 50.0 s 6.1% 9.6 h 9.0 h
INFO 18:23:13,561 ProgressMeter - 1:192972143 4.2301251E7 35.4 m 50.0 s 6.2% 9.6 h 9.0 h
INFO 18:23:20,807 IndelRealigner - Not attempting realignment in interval 1:193831502-193831691 because there are too many reads.
INFO 18:23:43,562 ProgressMeter - 1:195808771 4.3001259E7 35.9 m 50.0 s 6.2% 9.6 h 9.0 h
#

A fatal error has been detected by the Java Runtime Environment:

SIGSEGV (0xb) at pc=0x00007fec98d8689b, pid=11969, tid=0x00007fec94265700

JRE version: OpenJDK Runtime Environment (8.0_144-b01) (build 1.8.0_144-b01)

Java VM: OpenJDK 64-Bit Server VM (25.144-b01 mixed mode linux-amd64 )

Problematic frame:

V [libjvm.so+0x62e89b]

Core dump written. Default location: /data2/travis/WGS-86/CHG000251/core or core.11969

An error report file with more information is saved as:

/data2/travis/WGS-86/CHG000251/hs_err_pid11969.log

If you would like to submit a bug report, please visit:

http://bugreport.java.com/bugreport/crash.jsp

I am running GATK 3.8, nightly build from 09/18/2017

As for java:
openjdk version "1.8.0_144"
OpenJDK Runtime Environment (build 1.8.0_144-b01)
OpenJDK 64-Bit Server VM (build 25.144-b01, mixed mode)

I am trying to use 3.8 because 3.7 keeps crashing at HaplotypeCaller. I am now trying 3.7 using the flag --pair_hmm_implementation LOGLESS_CACHING, as this was a suggestion from my server administrator. She was also running into problems with HaplotypeCaller and this fixed her issue.

↧

(howto) Use Oncotator on the Broad servers

May 16, 2014, 3:51 pm

≫ Next: About GenotypeGVCFs?

≪ Previous: IndelRealigner Crashing, Java Fatal Error

This information is specific for using Oncotator at the Broad. This includes running Oncotator on the cluster and configurations used specifically at the Broad.

Please note: if you are working on the Broad servers, you do not need to install or upgrade Oncotator. It will be kept up to date on the Broad cluster. But you may need to tweak your environment settings as described below.

1. Deactivate any Python virtual environment

If you are in an active Python virtual environment, you may need to deactivate it. Just run this simple command:

$ deactivate

If you are not sure whether this applies to you, you probably are not using a virtual environment.

2. Activate appropriate packages

You may need to use the python package and import zlib:

reuse .python-2.7.1-sqlite3-rtrees
reuse .zlib-1.2.6

Hopefully, this step can be eliminated in the future.

3. Activate the preset virtual environment

Add the following to your ~/.my.bashrc:

PATH=/xchip/tcga/Tools/oncotator/onco_env/ubin:$PATH

then run the following from the command line:

source ~/.my.bashrc

This will activate a virtual environment that has Oncotator all the dependencies installed. See the public documentation about running Oncotator in a virtual environment for more information on what this means.

4. Test that you can invoke Oncotator

Try calling Oncotator's help page by running:

oncotator --help

If that works (and there's no reason why it shouldn't) you're all set to use Oncotator!

NOTE:

You may run into trouble if you try to run Oncotator on the cluster with the Broad's pyvcf dotkit. If that happens, use the following command

unuse .pyvcf-0.6.3-python-2.7.1-sqlite3-rtrees

↧

About GenotypeGVCFs?

September 19, 2017, 6:37 pm

≫ Next: Should I provide the exome target list (-L argu) even while calling gVCF file using Haplotypecaller?

≪ Previous: (howto) Use Oncotator on the Broad servers

I run into a problem when running GenotypeGVCFs for dealig with population gvcfs. my command lines as follows:
/scratch/gkm/software/jdk1.8.0_131/bin/java -Xmx4g -jar /scratch/gkm/software/GATK/GenomeAnalysisTK.jar \
-T GenotypeGVCFs \
-R /newVol/gkm/Project/malus_domestica/reference/GCA_002114115.1_ASM211411v1_genomic.fna \
-nt 12 \
--variant /newVol/gkm/Project/malus_domestica/gatk/finished/qinhe/variants/1/1_raw.g.vcf \
--variant /newVol/gkm/Project/malus_domestica/gatk/finished/qinhe/variants/10/10_raw.g.vcf \
--variant /newVol/gkm/Project/malus_domestica/gatk/finished/qinhe/variants/11/11_raw.g.vcf \
--variant /newVol/gkm/Project/malus_domestica/gatk/finished/qinhe/variants/11-133/11-133_raw.g.vcf \
--variant /newVol/gkm/Project/malus_domestica/gatk/finished/qinhe/variants/11-39/11-39_raw.g.vcf \
--variant /newVol/gkm/Project/malus_domestica/gatk/finished/qinhe/variants/12/12_raw.g.vcf \
--variant /newVol/gkm/Project/malus_domestica/gatk/finished/qinhe/variants/13/13_raw.g.vcf \
--variant /newVol/gkm/Project/malus_domestica/gatk/finished/qinhe/variants/14/14_raw.g.vcf \
--variant /newVol/gkm/Project/malus_domestica/gatk/finished/qinhe/variants/15_merge/15_raw.g.vcf \
--variant /newVol/gkm/Project/malus_domestica/gatk/finished/qinhe/variants/16/16_raw.g.vcf \
--variant /newVol/gkm/Project/malus_domestica/gatk/finished/qinhe/variants/17/17_raw.g.vcf \
--variant /newVol/gkm/Project/malus_domestica/gatk/finished/qinhe/variants/18/18_raw.g.vcf \
--variant /newVol/gkm/Project/malus_domestica/gatk/finished/qinhe/variants/19/19_raw.g.vcf \
--variant /newVol/gkm/Project/malus_domestica/gatk/finished/qinhe/variants/2/2_raw.g.vcf \
--variant /newVol/gkm/Project/malus_domestica/gatk/finished/qinhe/variants/20/20_raw.g.vcf \
--variant /newVol/gkm/Project/malus_domestica/gatk/finished/qinhe/variants/21/21_raw.g.vcf \
--variant /newVol/gkm/Project/malus_domestica/gatk/finished/qinhe/variants/22/22_raw.g.vcf \
--variant /newVol/gkm/Project/malus_domestica/gatk/finished/qinhe/variants/23/23_raw.g.vcf \
--variant /newVol/gkm/Project/malus_domestica/gatk/finished/qinhe/variants/24/24_raw.g.vcf \
--variant /newVol/gkm/Project/malus_domestica/gatk/finished/qinhe/variants/25/25_raw.g.vcf \
--variant /newVol/gkm/Project/malus_domestica/gatk/finished/qinhe/variants/26/26_raw.g.vcf \
--variant /newVol/gkm/Project/malus_domestica/gatk/finished/qinhe/variants/27/27_raw.g.vcf \
--variant /newVol/gkm/Project/malus_domestica/gatk/finished/qinhe/variants/28/28_raw.g.vcf \
--variant /newVol/gkm/Project/malus_domestica/gatk/finished/qinhe/variants/29/29_raw.g.vcf \
--variant /newVol/gkm/Project/malus_domestica/gatk/finished/qinhe/variants/3/3_raw.g.vcf \
--variant /newVol/gkm/Project/malus_domestica/gatk/finished/qinhe/variants/30/30_raw.g.vcf \
--variant /newVol/gkm/Project/malus_domestica/gatk/finished/qinhe/variants/31/31_raw.g.vcf \
--variant /newVol/gkm/Project/malus_domestica/gatk/finished/qinhe/variants/32/32_raw.g.vcf \
--variant /newVol/gkm/Project/malus_domestica/gatk/finished/qinhe/variants/33/33_raw.g.vcf \
--variant /newVol/gkm/Project/malus_domestica/gatk/finished/qinhe/variants/34/34_raw.g.vcf \
--variant /newVol/gkm/Project/malus_domestica/gatk/finished/qinhe/variants/35/35_raw.g.vcf \
-o out.vcf
Get a error:

ERROR --

ERROR stack trace

java.lang.NullPointerException
at java.util.LinkedList$ListItr.next(LinkedList.java:893)
at org.broadinstitute.gatk.tools.walkers.genotyper.GenotypingEngine.coveredByDeletion(GenotypingEngine.java:426)
at org.broadinstitute.gatk.tools.walkers.genotyper.GenotypingEngine.calculateOutputAlleleSubset(GenotypingEngine.java:387)
at org.broadinstitute.gatk.tools.walkers.genotyper.GenotypingEngine.calculateGenotypes(GenotypingEngine.java:251)
at org.broadinstitute.gatk.tools.walkers.genotyper.UnifiedGenotypingEngine.calculateGenotypes(UnifiedGenotypingEngine.java:392)
at org.broadinstitute.gatk.tools.walkers.genotyper.UnifiedGenotypingEngine.calculateGenotypes(UnifiedGenotypingEngine.java:375)
at org.broadinstitute.gatk.tools.walkers.genotyper.UnifiedGenotypingEngine.calculateGenotypes(UnifiedGenotypingEngine.java:330)
at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.regenotypeVC(GenotypeGVCFs.java:326)
at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.map(GenotypeGVCFs.java:304)
at org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs.map(GenotypeGVCFs.java:135)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
at org.broadinstitute.gatk.engine.executive.ShardTraverser.call(ShardTraverser.java:98)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)

ERROR ------------------------------------------------------------------------------------------

ERROR A GATK RUNTIME ERROR has occurred (version 3.7-0-gcfedb67):

ERROR

ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.

ERROR If not, please post the error message, with stack trace, to the GATK forum.

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions https://software.broadinstitute.org/gatk

ERROR

ERROR MESSAGE: Code exception (see stack trace for error itself)

ERROR ------------------------------------------------------------------------------------------

I look forwards to get a solution, thanks!

↧

Should I provide the exome target list (-L argu) even while calling gVCF file using Haplotypecaller?

April 12, 2017, 9:41 am

≫ Next: Segfault when running GATK 3.6 in a container

≪ Previous: About GenotypeGVCFs?

Hi,

Recently we performed exome sequencing using Nextera Illumina platform for three samples (Father, Mother and Son). I downloaded the exome interval list from Illumina's website.

1) Trimmed the raw reads
2) Aligned the trimmed reads against the human reference hg19 as recommended for exome-sequencing
3) Then sorted, deduped, recalibrated the bam file.
4) Then performed variant calling in two steps process for all three samples individually
4.1) Used the GATK Haplotype Caller tool in GVCF mode
Command: java -Xmx16g -jar GenomeAnalysisTK.jar - T Haplotypecaller -R /GATK_bundle/hg19.fa -I sample1.sorted.dedup.recal.bam --emitRefConfidence GVCF --dbsnp /GATK_bundle/dbsnp.138.hg19.vcf -o sample1.raw.g.vcf
4.2) Used GenotypeGVCFs (Joint SNP calling) for all three samples together
Command: java -Xmx16g -jar GenomeAnalysisTK.jar - T GenotypeGVCFs -R /GATK_bundle/hg19.fa --variant sample1.raw.g.vcf --variant sample2.raw.g.vcf --variant sample3.raw.g.vcf --dbsnp /GATK_bundle/dbsnp.138.hg19.vcf -o sample1.2.3.trio.raw.vcf

In the above command, I didn't use the Illumina's exome interval list used for targeting the exomes in sequencing process.

As per this link "https://software.broadinstitute.org/gatk/documentation/article.php?id=4669", under the example section of GATK command lines, for exome sequencing the article suggests us to provide the exome targets using -L argument.

I have following queries,as per the aforementioned article
1) Should I provide the exome target list (-L argument) only while calling regular VCF file using Haplotype caller?
or
2) Should I provide the exome target list (-L argument) even while calling gVCF file using Haplotype caller?

↧

Segfault when running GATK 3.6 in a container

September 20, 2017, 1:48 am

≫ Next: CRAM support in GATK 3.7 is broken

≪ Previous: Should I provide the exome target list (-L argu) even while calling gVCF file using Haplotypecaller?

I'm using GATK on the DNAnexus platform, which can convert Docker images to the ACI format in order to run them. I have a Docker image that uses GATK 3.6 to call variants, which runs fine under ordinary Docker, but which segfaults when run on DNAnexus using this converted container format.

The log for this error is attached. The key information is this:

#  SIGSEGV (0xb) at pc=0x00007f3980e9dce9, pid=17744, tid=0x00007f39a146a700
#
# JRE version: OpenJDK Runtime Environment (8.0_141-b15) (build 1.8.0_141-8u141-b15-1~deb9u1-b15)
# Java VM: OpenJDK 64-Bit Server VM (25.141-b15 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libVectorLoglessPairHMM6177525404161670245.so+0x1bce9]  LoadTimeInitializer::LoadTimeInitializer()+0x1669

So, it seems that somehow the compiled C code in GATK is having troubles under this specific environment. How should I attempt to resolve this? Should I upgrade or downgrade Java? GATK? Ubuntu? System libraries? Unfortunately upgrading GATK would be a bit difficult, because our workflow is accredited using GATK 3.6, but this might be possible if this is the only solution.

↧

CRAM support in GATK 3.7 is broken

January 4, 2017, 7:22 pm

≫ Next: VariantRecalibrator Error

≪ Previous: Segfault when running GATK 3.6 in a container

I have not been able to get GATK 3.7 HaplotypeCaller to work with CRAM files at all (it has a 100% failure rate so far with our whole genome CRAMs). Based on my analysis of the problem, I don't think GATK 3.7 will work with any CRAM files containing IUPAC ambiguity codes other than 'N' (including GRCh37/hs37d5 and GRCh38/HS38DH).

The error I get is:

ERROR   2017-01-05 02:18:59     Slice   Reference MD5 mismatch for slice 2:60825966-60861215, ATCTTTCATG...CTCTCCCATT
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 3.7-0-gcfedb67):
##### ERROR
##### ERROR This means that one or more arguments or inputs in your command are incorrect.
##### ERROR The error message below tells you what is the problem.
##### ERROR
##### ERROR If the problem is an invalid argument, please check the online documentation guide
##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
##### ERROR
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions https://software.broadinstitute.org/gatk
##### ERROR
##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
##### ERROR
##### ERROR MESSAGE: SAM/BAM/CRAM file /keep/46909b690725869e1d9bfbc1da4a1398+19932/20657_7.cram is malformed. Please see https://software.broadinstitute.org/gatk/documentation/article?id=1317for more
##### ERROR ------------------------------------------------------------------------------------------

This error occurs for 100% of my CRAM files, which can be read by samtools, scramble, or previous versions of GATK (including 3.6) without any issues, so the error message is incorrect and the CRAM files are not malformed.

The CRAM slice in question is on chromosome 3 of hs37d5 (3:60825966-60861215). We can verify externally that the FASTA reference we are passing into GATK with -R does have the md5 that GATK reports it is expecting:

$ samtools faidx /keep/d527a0b11143ebf18be6c52ff6c09552+2163/hs37d5.fa 3:60825966-60861215 | grep -v '^>' | tr -d '\012' | md5sum
0e0ff678755616cba9ac362f15b851cc  -

And the sequence starts and ends with the bases that htsjdk reports:

$ samtools faidx /keep/d527a0b11143ebf18be6c52ff6c09552+2163/hs37d5.fa 3:60825966-60861215 | grep -v '^>' | tr -d '\012' | cut -c1-10
ATCTTTCATG
$ samtools faidx /keep/d527a0b11143ebf18be6c52ff6c09552+2163/hs37d5.fa 3:60825966-60861215 | grep -v '^>' | tr -d '\012' | cut -c35241-
CTCTCCCATT

I ended up having to recompile GATK and htsjdk from source and added some print debugging to htsjdk to dump the whole sequence from which the md5 was being calculated. It seems the sequence that cause problems are regions of the reference with IUPAC ambiguity codes other than 'N' (in this case a slice of chromosome 3 that contains an 'M' and two 'R's). In GATK 3.7 (built with htsjdk 2.8.1), the reference which is used to calculate the md5 for the slice has had all ambiguity codes converted to 'N'. The md5 it calculates for this slice (according to my print debugging) is: 5d820b3624e78202f503796f7330d8d9

I have verified that this is the md5 we would get from converting the IUPAC codes in this slice to N's:

$ samtools faidx /keep/d527a0b11143ebf18be6c52ff6c09552+2163/hs37d5.fa 3:60825966-60861215 | grep -v '^>' | tr -d '\012' | tr RYMKWSBDHV NNNNNNNNNN | md5sum
5d820b3624e78202f503796f7330d8d9  -

I have tried in vain to figure out where in GATK and/or htsjdk the ambiguous reference bases are being converted to 'N's. I initially thought that it was in the CachingIndexedFastaSequenceFile call to BaseUtils.convertIUPACtoN (when preserveIUPAC is false, although I didn't find any code path that could set it to true). However, after recompiling with preserveIUPAC manually set to true, the problem persisted. I guess there must be some other place where the bases are remapped. I'll leave it to you guys to figure out how to get an unmodified view on the reference for htsjdk to use for CRAM decoding.

There is, however, no mystery as to why this problem has suddenly appeared in GATK 3.7. The slice md5 validation code in htsjdk was only added in July 2016 (https://github.com/samtools/htsjdk/commit/a781afa9597dcdbcde0020bfe464abee269b3b2e). The first release version it appears in is version 2.7.0. Prior to that, it seems CRAM slice md5's were not validated in htsjdk, so this error would not have occurred.

↧

VariantRecalibrator Error

September 20, 2017, 9:16 am

≫ Next: GATK4 StrandArtifact

≪ Previous: CRAM support in GATK 3.7 is broken

My task:

task VariantRecalibratorSNPs {

  File Raw_VCF
  String CohortName
  String Chromosome
  File? Parallelization
  String? InbreedingCoeff

  Map[String, String] Paths
  Array[String] RuntimeParams

  command {
    ${Paths["java"]} -Xmx4G -jar ${Paths["gatk"]} \
      -T VariantRecalibrator \
      -R ${Paths["refFasta"]} \
      -input ${Raw_VCF} \
      -recalFile ${CohortName}_${Chromosome}_SNPs.recal \
      -tranchesFile ${CohortName}_${Chromosome}_SNPs.tranches \
      -nt 4 \
      -L ${default=Chromosome Parallelization} \
      -resource:hapmap,known=false,training=true,truth=true,prior=15.0 ${Paths["hapmap"]} \
      -resource:omni,known=false,training=true,truth=true,prior=12.0 ${Paths["omni"]} \
      -resource:1000G,known=false,training=true,truth=false,prior=10.0 ${Paths["1000G"]} \
      -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 ${Paths["dbsnp"]} \
      -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP ${default="" InbreedingCoeff}\
      -mode SNP \
  }

  runtime {
    runtime_minutes: RuntimeParams[3]
    cpus: RuntimeParams[12]
    requested_memory_mb_per_core: RuntimeParams[21]
    queue: RuntimeParams[30]
  }

  output {
    File recal_SNPs_VCF = "${CohortName}_${Chromosome}_SNPs.recal"
    File tranches_SNPs_VCF = "${CohortName}_${Chromosome}_SNPs.tranches"
  }
}

GATK Error (full stderr attached):

##### ERROR --
##### ERROR stack trace
org.broadinstitute.gatk.utils.exceptions.ReviewedGATKException: Unable to retrieve result
        at org.broadinstitute.gatk.engine.executive.HierarchicalMicroScheduler.execute(HierarchicalMicroScheduler.java:190)
        at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:323)
        at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:123)
        at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
        at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
        at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:108)
Caused by: java.lang.IllegalArgumentException: No data found.
        at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:88)
        at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:536)
        at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:191)
        at org.broadinstitute.gatk.engine.executive.HierarchicalMicroScheduler.notifyTraversalDone(HierarchicalMicroScheduler.java:226)
        at org.broadinstitute.gatk.engine.executive.HierarchicalMicroScheduler.execute(HierarchicalMicroScheduler.java:183)
        ... 5 more
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 3.8-0-ge9d806836):
##### ERROR
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions https://software.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: Unable to retrieve result
##### ERROR ------------------------------------------------------------------------------------------

Interesting to note, my VariantRecalibrator for Indels works just fine:

task VariantRecalibratorIndels {

  File Raw_VCF
  String CohortName
  String Chromosome
  File? Parallelization
  String? InbreedingCoeff

  Map[String, String] Paths
  Array[String] RuntimeParams

  command {
    ${Paths["java"]} -Xmx4G -jar ${Paths["gatk"]} \
      -T VariantRecalibrator \
      -R ${Paths["refFasta"]} \
      -input ${Raw_VCF} \
      -recalFile ${CohortName}_${Chromosome}_Indels.recal \
      -tranchesFile ${CohortName}_${Chromosome}_Indels.tranches \
      -nt 4 \
      -L ${default=Chromosome Parallelization} \
      --maxGaussians 4 \
      -resource:mills,known=false,training=true,truth=true,prior=12.0 ${Paths["mills"]} \
      -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 ${Paths["dbsnp"]} \
      -an QD -an DP -an FS -an SOR -an ReadPosRankSum -an MQRankSum ${default="" InbreedingCoeff} \
      -mode INDEL \
  }

  runtime {
    runtime_minutes: RuntimeParams[4]
    cpus: RuntimeParams[13]
    requested_memory_mb_per_core: RuntimeParams[22]
    queue: RuntimeParams[31]
  }

  output {
    File recal_Indels_VCF = "${CohortName}_${Chromosome}_Indels.recal"
    File tranches_Indels_VCF = "${CohortName}_${Chromosome}_Indels.tranches"
  }
}

I am running all the most recent versions. Attaching my entire script for reference.

Thanks a lot,

Alon

↧

GATK4 StrandArtifact

September 20, 2017, 11:29 am

≫ Next: How should I pre-process data from multiplexed sequencing and multi-library designs?

≪ Previous: VariantRecalibrator Error

Hi,

Can you offer any guidance on how the Strand Artifact values are calculated in GAT4? Is there any recommendation on how they should be used?

I am specifically looking at the SA_MAP_AF and SA_POST_PROB values for Mutect2.

Thanks.

↧

How should I pre-process data from multiplexed sequencing and multi-library designs?

August 2, 2013, 1:23 pm

≫ Next: The error of the genomestrip's Svpreprocess

≪ Previous: GATK4 StrandArtifact

Our Best Practices pre-processing documentation assumes a simple experimental design in which you have one set of input sequence files (forward/reverse or interleaved FASTQ, or unmapped uBAM) per sample, and you run each step of the pre-processing workflow separately for each sample, resulting in one BAM file per sample at the end of this phase.

However, if you are generating multiple libraries for each sample, and/or multiplexing samples within and/or across sequencing lanes, the data must be de-multiplexed before pre-processing, typically resulting in multiple sets of FASTQ files per sample all of which should have distinct read group IDs (RGID).

At that point there are several different valid strategies for implementing the pre-processing workflow. Here at the Broad Institute, we run the initial steps of the pre-processing workflow (mapping, sorting and marking duplicates) separately on each individual read group. Then we merge the data to produce a single BAM file for each sample (aggregation); this is done by re-running Mark Duplicates, this time on all read group BAM files for a sample at the same time. Then we run Indel Realignment and Base Recalibration on the aggregated per-sample BAM files. See the worked-out example below and this presentation for more details.

Note that there are many possible ways to achieve a similar result; here we present the way we think gives the best combination of efficiency and quality. This assumes that you are dealing with one or more samples, and each of them was sequenced on one or more lanes.

Example

Let's say we have this example data (assuming interleaved FASTQs containing both forward and reverse reads) for two sample libraries, sampleA and sampleB, which were each sequenced on two lanes, lane1 and lane2:

sampleA_lane1.fq
sampleA_lane2.fq
sampleB_lane1.fq
sampleB_lane2.fq

These will each be identified as separate read groups A1, A2, B1 and B2. If we had multiple libraries per sample, we would further distinguish them (eg sampleA_lib1_lane1.fq leading to read group A11, sampleA_lib2_lane1.fq leading to read group A21 and so on).

1. Run initial steps per-readgroup once

Assuming that you received one FASTQ file per sample library, per lane of sequence data (which amounts to a read group), run each file through mapping and sorting. During the mapping step you assign read group information, which will be very important in the next steps so be sure to do it correctly. See the read groups dictionary entry for guidance.

The example data becomes:

sampleA_rgA1.bam
sampleA_rgA2.bam
sampleB_rgB1.bam
sampleB_rgB2.bam

At this point we mark duplicates in each read group BAM file (dedup), which allows us to estimate the complexity of the corresponding library of origin as a quality control step. This step is optional.

The example data becomes:

sampleA_rgA1.dedup.bam
sampleA_rgA2.dedup.bam
sampleB_rgB1.dedup.bam
sampleB_rgB2.dedup.bam

Technically this first run of marking duplicates is not necessary because we will run it again per-sample, and that per-sample marking would be enough to achieve the desired result. To reiterate, we only do this round of marking duplicates for QC purposes.

2. Merge read groups and mark duplicates per sample (aggregation + dedup)

Once you have pre-processed each read group individually, you merge read groups belonging to the same sample into a single BAM file. You can do this as a standalone step, bur for the sake of efficiency we combine this with the per-readgroup duplicate marking step (it's simply a matter of passing the multiple inputs to MarkDuplicates in a single command).

The example data becomes:

sampleA.merged.dedup.bam
sampleB.merged.dedup.bam

To be clear, this is the round of marking duplicates that matters. It eliminates PCR duplicates (arising from library preparation) across all lanes in addition to optical duplicates (which are by definition only per-lane).

3. Remaining per-sample pre-processing

Then you run indel realignment (optional) and base recalibration (BQSR).

The example data becomes:

sample1.merged.dedup.(realn).recal.bam
sample2.merged.dedup.(realn).recal.bam

Realigning around indels per-sample leads to consistent alignments across all lanes within a sample. This step is only necessary if you will be using a locus-based variant caller like MuTect 1 or UnifiedGenotyper (for legacy reasons). If you will be using HaplotypeCaller or MuTect2, you do not need to perform indel realignment.

Base recalibration will be applied per-read group if you assigned appropriate read group information in your data. BaseRecalibrator distinguishes read groups by RGID, or RGPU if it is available (PU takes precedence over ID). This will identify separate read groups (distinguishing both lanes and libraries) as such even if they are in the same BAM file, and it will always process them separately -- as long as the read groups are identified correctly of course. There would be no sense in trying to recalibrate across lanes, since the purpose of this processing step is to compensate for the errors made by the machine during sequencing, and the lane is the base unit of the sequencing machine (assuming the equipment is Illumina HiSeq or similar technology).

People often ask also if it's worth the trouble to try realigning across all samples in a cohort. The answer is almost always no, unless you have very shallow coverage. The problem is that while it would be lovely to ensure consistent alignments around indels across all samples, the computational cost gets too ridiculous too fast. That being said, for contrastive calling projects -- such as cancer tumor/normals -- we do recommend realigning both the tumor and the normal together in general to avoid slight alignment differences between the two tissue types.

↧

The error of the genomestrip's Svpreprocess

August 25, 2017, 1:10 am

≫ Next: VQSR: low TiTv

≪ Previous: How should I pre-process data from multiplexed sequencing and multi-library designs?

Hi,I'm just use the genomestrip to call the sv from pigs. I have 30 bams and then follow the introduction to prepare the reference file ,genomeMaskFile ,copyNumberMaskFile and ploidyMapFile files.But when I run the SVpreprocess I got the following three errors

INFO 01:50:01,554 QGraph - Writing incremental jobs reports...
INFO 01:50:01,565 QGraph - 578 Pend, 0 Run, 3 Fail, 152 Done
INFO 01:50:01,572 QCommandLine - Writing final jobs report...
INFO 01:50:01,572 QCommandLine - Done with errors
INFO 01:50:01,587 QGraph - -------
INFO 01:50:01,588 QGraph - Failed: 'java' '-Xmx2048m' '-XX:+UseParallelOldGC' '-XX:ParallelGCThreads=4' '-XX:GCTimeLimit=50' '-XX:GCHeapFreeLimit=10' '-Djava.io.tmpdir=/WORK/cau_jfliu_1/SVmap/genomestrip/test/tmpdir' '-cp' '/WORK/cau_jfliu_1/SVmap/software/svtoolkit/lib/SVToolkit.jar:/WORK/cau_jfliu_1/SVmap/software/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/WORK/cau_jfliu_1/SVmap/software/svtoolkit/lib/gatk/Queue.jar' '-cp' '/WORK/cau_jfliu_1/SVmap/software/svtoolkit/lib/SVToolkit.jar:/WORK/cau_jfliu_1/SVmap/software/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/WORK/cau_jfliu_1/SVmap/software/svtoolkit/lib/gatk/Queue.jar' 'org.broadinstitute.sv.apps.IndexReadCountFile' '-I' '/WORK/cau_jfliu_1/SVmap/genomestrip/test/test1/metadata/rccache/20161017.recal.rc.bin' '-O' '/WORK/cau_jfliu_1/SVmap/genomestrip/test/test1/metadata/rccache/20161017.recal.rc.bin.idx'
INFO 01:50:01,588 QGraph - Log: /WORK/cau_jfliu_1/SVmap/genomestrip/test/test1/logs/SVPreprocess-166.out
INFO 01:50:01,588 QGraph - -------
INFO 01:50:01,589 QGraph - Failed: 'java' '-Xmx2048m' '-XX:+UseParallelOldGC' '-XX:ParallelGCThreads=4' '-XX:GCTimeLimit=50' '-XX:GCHeapFreeLimit=10' '-Djava.io.tmpdir=/WORK/cau_jfliu_1/SVmap/genomestrip/test/tmpdir' '-cp' '/WORK/cau_jfliu_1/SVmap/software/svtoolkit/lib/SVToolkit.jar:/WORK/cau_jfliu_1/SVmap/software/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/WORK/cau_jfliu_1/SVmap/software/svtoolkit/lib/gatk/Queue.jar' org.broadinstitute.sv.main.SVCommandLine '-T' 'ComputeReadCountsWalker' '-R' '/WORK/cau_jfliu_1/SVmap/reference_metadata_bundle/test/pig.fa' '-I' '/WORK/cau_jfliu_1/SVmap/RC/20160995/20160995.recal.bam' '-O' '/WORK/cau_jfliu_1/SVmap/genomestrip/test/test1/metadata/rccache/20160995.recal.rc.bin' '-disableGATKTraversal' 'true' '-md' 'test1/metadata' '-configFile' '/WORK/cau_jfliu_1/SVmap/software/svtoolkit/conf/genstrip_parameters.txt' '-P' 'chimerism.use.correction:false' '-insertSizeRadius' '10.0'
INFO 01:50:01,589 QGraph - Log: /WORK/cau_jfliu_1/SVmap/genomestrip/test/test1/logs/SVPreprocess-171.out
INFO 01:50:01,589 QGraph - -------
INFO 01:50:01,589 QGraph - Failed: 'java' '-Xmx2048m' '-XX:+UseParallelOldGC' '-XX:ParallelGCThreads=4' '-XX:GCTimeLimit=50' '-XX:GCHeapFreeLimit=10' '-Djava.io.tmpdir=/WORK/cau_jfliu_1/SVmap/genomestrip/test/tmpdir' '-cp' '/WORK/cau_jfliu_1/SVmap/software/svtoolkit/lib/SVToolkit.jar:/WORK/cau_jfliu_1/SVmap/software/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/WORK/cau_jfliu_1/SVmap/software/svtoolkit/lib/gatk/Queue.jar' '-cp' '/WORK/cau_jfliu_1/SVmap/software/svtoolkit/lib/SVToolkit.jar:/WORK/cau_jfliu_1/SVmap/software/svtoolkit/lib/gatk/GenomeAnalysisTK.jar:/WORK/cau_jfliu_1/SVmap/software/svtoolkit/lib/gatk/Queue.jar' 'org.broadinstitute.sv.apps.ComputeGCProfiles' '-O' '/WORK/cau_jfliu_1/SVmap/genomestrip/test/test1/metadata/gcprofile/reference.gcprof.zip' '-R' '/WORK/cau_jfliu_1/SVmap/reference_metadata_bundle/test/pig.fa' '-md' 'test1/metadata' '-writeReferenceProfile' 'true' '-genomeMaskFile' '/WORK/cau_jfliu_1/SVmap/reference_metadata_bundle/test/pig.svmask.75.fa' '-copyNumberMaskFile' '/WORK/cau_jfliu_1/SVmap/reference_metadata_bundle/test/pig.gcmask.fasta' '-configFile' '/WORK/cau_jfliu_1/SVmap/software/svtoolkit/conf/genstrip_parameters.txt' '-P' 'chimerism.use.correction:false'
INFO 01:50:01,589 QGraph - Log: /WORK/cau_jfliu_1/SVmap/genomestrip/test/test1/logs/SVPreprocess-6.out
INFO 01:50:01,590 QCommandLine - Script failed: 578 Pend, 0 Run, 3 Fail, 152 Done

But When I check the log files SVPreprocess-166.out ,SVPreprocess-171.out and SVPreprocess-6.out ,there are no errors reported from these log files.

The scripts list as follows:

    #!/bin/bash
    export SV_DIR=/WORK/cau_jfliu_1/SVmap/software/svtoolkit
    SV_TMPDIR=./tmpdir
    runDir=test1
    export PATH=${SV_DIR}/bwa:${PATH}
    export LD_LIBRARY_PATH=${SV_DIR}/bwa:${LD_LIBRARY_PATH}
    mx="-Xmx4g"
    classpath="${SV_DIR}/lib/SVToolkit.jar:${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar:${SV_DIR}/lib/gatk/Queue.jar"
    mkdir -p ${runDir}/logs || exit 1
    mkdir -p ${runDir}/metadata || exit 1
    # Display version information.
    java -cp ${classpath} ${mx} -jar ${SV_DIR}/lib/SVToolkit.jar

    # Run preprocessing.
    # For large scale use, you should use -reduceInsertSizeDistributions, but this is too slow for the installation test.
    # The method employed by -computeGCProfiles requires a GC mask and is currently only supported for human genomes.
    java -cp ${classpath} ${mx} \
        org.broadinstitute.gatk.queue.QCommandLine \
        -S ${SV_DIR}/qscript/SVPreprocess.q \
        -S ${SV_DIR}/qscript/SVQScript.q \
        -gatk ${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar \
        --disableJobReport \
        -cp ${classpath} \
        -configFile ${SV_DIR}/conf/genstrip_parameters.txt \
        -tempDir ${SV_TMPDIR} \
        -R /WORK/cau_jfliu_1/SVmap/reference_metadata_bundle/test/pig.fa \
        -genomeMaskFile /WORK/cau_jfliu_1/SVmap/reference_metadata_bundle/test/pig.svmask.75.fa \
        -copyNumberMaskFile /WORK/cau_jfliu_1/SVmap/reference_metadata_bundle/test/pig.gcmask.fasta \
        -ploidyMapFile /WORK/cau_jfliu_1/SVmap/reference_metadata_bundle/test/pig.ploidymap.txt \
        -runDirectory ${runDir} \
        -md ${runDir}/metadata \
        -disableGATKTraversal \
        -useMultiStep \
        -reduceInsertSizeDistributions false \
        -computeReadCounts true \
        -jobLogDir ${runDir}/logs \
        -I /WORK/cau_jfliu_1/SVmap/genomestrip/test/file.list \
        -P chimerism.use.correction:false \
        -run \
        || exit 1

And it seems I can't post the detail log files , is there any ohter methods to post the log informations?
Thank you!

↧

VQSR: low TiTv

September 21, 2017, 2:59 am

≫ Next: MarkDuplicates out of Memory

≪ Previous: The error of the genomestrip's Svpreprocess

I'm trying out VQSR on a batch of 16 human whole genomes (~25-30x). I was wondering if someone could review the below profiles. It seems the false-positive rate is much higher than the GATK examples.

Has anyone else experienced similar results? Any possible solutions?

Here are the commands used with GATK-3.7.0:

#Build the SNP recalibration model /share/apps/jre-distros/jre1.8.0_101/bin/java -Djava.io.tmpdir=/state/partition1/tmpdir -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx8g -jar /share/apps/GATK-distros/GATK_3.7.0/GenomeAnalysisTK.jar \ -T VariantRecalibrator \ -R /state/partition1/db/human/gatk/2.8/b37/human_g1k_v37.fasta \ -input "$seqId"_variants.vcf \ -resource:hapmap,known=false,training=true,truth=true,prior=15.0 /state/partition1/db/human/gatk/2.8/b37/hapmap_3.3.b37.vcf \ -resource:omni,known=false,training=true,truth=true,prior=12.0 /state/partition1/db/human/gatk/2.8/b37/1000G_omni2.5.b37.vcf \ -resource:1000G,known=false,training=true,truth=false,prior=10.0 /state/partition1/db/human/gatk/2.8/b37/1000G_phase1.snps.high_confidence.b37.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /state/partition1/db/human/gatk/2.8/b37/dbsnp_138.b37.vcf \ -an DP \ -an QD \ -an FS \ -an SOR \ -an MQ \ -an MQRankSum \ -an ReadPosRankSum \ -an InbreedingCoeff \ -mode SNP \ -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \ -recalFile "$seqId"_SNP.recal \ -tranchesFile "$seqId"_SNP.tranches \ -rscriptFile "$seqId"_SNP_plots.R \ -L 1 -L 2 -L 3 -L 4 -L 5 -L 6 -L 7 -L 8 -L 9 -L 10 -L 11 -L 12 -L 13 -L 14 -L 15 -L 16 -L 17 -L 18 -L 19 -L 20 -L 21 -L 22 -L X -L Y -L MT \ -nt 12 \ -ped "$seqId"_pedigree.ped \ -dt NONE

#Apply the desired level of recalibration to the SNPs in the call set /share/apps/jre-distros/jre1.8.0_101/bin/java -Djava.io.tmpdir=/state/partition1/tmpdir -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx4g -jar /share/apps/GATK-distros/GATK_3.7.0/GenomeAnalysisTK.jar \ -T ApplyRecalibration \ -R /state/partition1/db/human/gatk/2.8/b37/human_g1k_v37.fasta \ -input "$seqId"_variants.vcf \ -mode SNP \ --ts_filter_level 99.0 \ -recalFile "$seqId"_SNP.recal \ -tranchesFile "$seqId"_SNP.tranches \ -o "$seqId"_recalibrated_snps_raw_indels.vcf \ -L 1 -L 2 -L 3 -L 4 -L 5 -L 6 -L 7 -L 8 -L 9 -L 10 -L 11 -L 12 -L 13 -L 14 -L 15 -L 16 -L 17 -L 18 -L 19 -L 20 -L 21 -L 22 -L X -L Y -L MT \ -nt 12 \ -ped "$seqId"_pedigree.ped \ -dt NONE

#Build the Indel recalibration model /share/apps/jre-distros/jre1.8.0_101/bin/java -Djava.io.tmpdir=/state/partition1/tmpdir -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx4g -jar /share/apps/GATK-distros/GATK_3.7.0/GenomeAnalysisTK.jar \ -T VariantRecalibrator \ -R /state/partition1/db/human/gatk/2.8/b37/human_g1k_v37.fasta \ -input "$seqId"_recalibrated_snps_raw_indels.vcf \ -resource:mills,known=false,training=true,truth=true,prior=12.0 /state/partition1/db/human/gatk/2.8/b37/Mills_and_1000G_gold_standard.indels.b37.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /state/partition1/db/human/gatk/2.8/b37/dbsnp_138.b37.vcf \ -an DP \ -an QD \ -an FS \ -an SOR \ -an MQRankSum \ -an ReadPosRankSum \ -an InbreedingCoeff \ -mode INDEL \ -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 \ --maxGaussians 4 \ -recalFile "$seqId"_INDEL.recal \ -tranchesFile "$seqId"_INDEL.tranches \ -rscriptFile "$seqId"_INDEL_plots.R \ -L 1 -L 2 -L 3 -L 4 -L 5 -L 6 -L 7 -L 8 -L 9 -L 10 -L 11 -L 12 -L 13 -L 14 -L 15 -L 16 -L 17 -L 18 -L 19 -L 20 -L 21 -L 22 -L X -L Y -L MT \ -nt 12 \ -ped "$seqId"_pedigree.ped \ -dt NONE

#Apply the desired level of recalibration to the Indels in the call set /share/apps/jre-distros/jre1.8.0_101/bin/java -Djava.io.tmpdir=/state/partition1/tmpdir -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx4g -jar /share/apps/GATK-distros/GATK_3.7.0/GenomeAnalysisTK.jar \ -T ApplyRecalibration \ -R /state/partition1/db/human/gatk/2.8/b37/human_g1k_v37.fasta \ -input "$seqId"_recalibrated_snps_raw_indels.vcf \ -mode INDEL \ --ts_filter_level 99.0 \ -recalFile "$seqId"_INDEL.recal \ -tranchesFile "$seqId"_INDEL.tranches \ -o "$seqId"_recalibrated_variants.vcf \ -L 1 -L 2 -L 3 -L 4 -L 5 -L 6 -L 7 -L 8 -L 9 -L 10 -L 11 -L 12 -L 13 -L 14 -L 15 -L 16 -L 17 -L 18 -L 19 -L 20 -L 21 -L 22 -L X -L Y -L MT \ -nt 12 \ -ped "$seqId"_pedigree.ped \ -dt NONE

↧

MarkDuplicates out of Memory

September 21, 2017, 3:27 am

≫ Next: A dog, a ship and an algorithm that measures relatedness

≪ Previous: VQSR: low TiTv

Dear all,

I have been using Picard MarkDuplicates function to mark the duplicate reads which is found to be utlizing huge amount of memory compared to the memory used by GATK in subsequent steps.

I have used the below command:

java -Xmx128G -jar /appl/bio/picard/picard-2.6.0/picard/build/libs/picard.jar MarkDuplicates INPUT=alignment_sorted.bam OUTPUT=alignment_duprem.bam CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT QUIET=true AS=true METRICS_FILE=metrics VERBOSITY=WARNING

which utilized ~50GB to complete the run. I tried to minimize the memory consumption by adding MAX_RECORDS_IN_RAM=50000 to the above command but it utilized the same amount of memory as before.

I tried searching different forums but did not find a solution for this or probably i am missing some discussions. Could someone throw some light into this issue to decrease the memory consumption.

↧

A dog, a ship and an algorithm that measures relatedness

April 7, 2017, 2:30 pm

≫ Next: Picard Markduplicates Error: Value was put into pairInfoMap more than once

≪ Previous: MarkDuplicates out of Memory

What is beagle.

Beagle is a type of dog known for its even temper and intelligence. It is also the name given to the ship Darwin sailed to the Galapagos (the H.M.S. Beagle), where he developed his theory of natural selection from observing finches. It is also the name of a genomics software package known for phasing and imputing genotypes. Beagle also calls genotypes and detects identity-by-descent (IBD), i.e. it can find segments of identical DNA that indicate two individuals are related.

I will be writing a series of posts where I share with you how I take 23andMe raw data to locate IBD segments using Beagle v4.1 (website; doi:10.1534/genetics.113.150029). For a review of the statistical methods and other theory underlying IBD, see doi:10.1534/genetics.112.148825. To see a skipper’s dog on a ship at sea, watch Irving Johnson’s footage of the Peking barque.

Why a GATK blog series that is mostly not about GATK?

I became interested in genomics after doing my 23andMe in 2013. I am in this field as a newb--some might say upstart. My team is giving me the freedom to write this series as a twenty percent project, as a reward for my hard work.

It appears to me that we are at the start of an era of genomics, as evidenced by the Precision Medicine Initiative. The initiative’s All of Us program will begin to recruit study participants this year (anyone in the U.S. can enroll) and aims to build a genomics resource that reflects the diversity of the U.S. population (doi:10.1038/ncomms7596). I became aware of this project when I attended a Town Hall meeting a week ago, because other teams under the same umbrella as the GATK team work on it.

I think this series of posts is useful for two reasons.

First, GATK and Picard are not the only programs that offer useful tools in genomic analyses. There are a multitude of bioinformatics software tools. Each can be said to occupy some niche in the ecosystem of tools that help us understand genomics. These versioned software programs enable reproducible data transformations. By experimenting on projects like this one, I hope to become more aware of the available tools. By focusing on one specific question, I can see how the tools fit together for a particular use case. Through this exercise, my perspective of the field will widen.

Second, the steps that I will go through to preprocess data will illustrate some considerations when preparing genotype data for analyses. If you haven’t noticed from our forum or experienced first-hand from error messages, the Picard and GATK are sticklers for adherence to SAM and VCF specifications. This is to ensure compatibility of data across the many genomics applications. Our reasons for this upfront cost are justified. We prevent the pain that unconventional formats, even in experienced hands, have a tendency to spawn.

Let’s go over some background.

Genotyping data is not just about calling two alleles for a diploid genomic site, aka locus. We can also capture phasing information, i.e. which set of alleles are on one chromosome molecule, or haplotype, and which are on the other. For example, you might have an AT genotype for site 1 and a GC for site 2. If we know that A and C are from mom and T and G are from dad, then we know the phasing of these two genotypes in relation to each other. I illustrate this concept in a diagram later.

Extend phasing to more loci, and what we get are blocks of sequence with a pattern of variant alleles that travel together and that we refer to as a linkage disequilibrium (LD) block. These blocks can be of variable size because they can be broken up in meiotic recombination ¹, which happens on average roughly twice per chromosome. To phase short-read sequence and SNP array data, we impute an individual’s LD blocks using known population LD blocks. The alternative approach is to physically phase variants using expensive technology that sequences long molecules, e.g. from PacBio, or to incorporate long range phasing information, e.g. using 10X Genomics technology or algorithms such as phASER with RNA splice isoforms (doi:10.1101/039529).

Similarity and relatedness are two different concepts. You may have heard each of the following.

Humans are 98.8% similar to chimpanzees and 84% similar to dogs.
Humans are 99.9% genetically identical.
Humans share 99.5% DNA with any other human.
Siblings share between 83.81% and 87.47% of SNPs.
Siblings are on average 50% identical in DNA to each other.
The first two instances compare the entirety of syntenic genomic sequence. The third incorporates copy number variants (CNVs). The fourth compares only genomic positions that are commonly variant in humans and ignores nonvariant DNA. The last refers to shared LD blocks.

As we decrease the degrees of relatedness between family members, the expected average percent of identical DNA segments drops precipitously, by halves. My sister Sue and I expect ~50% identical DNA (half-identical on 50% and fully identical on 25%). However, my 23andMe cousin Susan, who lives in Arkansas and who connected with me via 23andMe in 2013, only shares 0.11% of her DNA with me, stemming from a single half identical segment on chromosome 5 ². Beagle and algorithms like it enable us to identify and score such segments of potential shared haplotypes given unphased genotype data and a phased reference population resource. The algorithms impute phases using this resource.

What these algorithms do not capture is how Susan, Sue and I will share many alleles at sites of common variation across our genomes due to shared ethnicity. I want to clarify that despite the similarity of our names (my name is Soo Hee, pronounced So-hee), we are each real and distinct individuals.

The 23andMe data that I will use in this series of posts is from my sister Sue and myself. We are clearly related not just by the happenstance of our births but by 54.7% and 44 segments according to 23andMe. Given this, the results of IBD analysis will be unsurprising and serves as a control in how well we can impute phasing information using a public population resource file that, to be clear, does not represent our Korean ethnic background. This resource is the 1000 Genomes Project phase 3 genotypes (doi:10.1038/nature15393) that I will refer to as 1KGP. Of the resource’s 2504 individuals, who are all forward-thinking and generous volunteers, approximately 500 are East Asian. This supergroup includes ~300 Chinese, ~100 Vietnamese and ~100 Japanese, but no Koreans. Phasing confidence improves when the reference panel is ethnically matched to the individual. So ideally, we would want to match Sue and my data against a panel that includes Koreans. I should mention that the 1000 Genomes Project is continuing to fill gaps in population representation via the International Genome Sample Resource (IGSR).

Our expectation is that Beagle’s Refined IBD approach (doi:10.1534/genetics.113.150029) using the 1KGP resource will be robust to our genetic heterogeneity. I’m told there are better tools out there, e.g. IBDSeq, but that these require a reference panel from the same population as the analyzed samples. I did ask around if such phased data representing my peoples should exist but word is that none will be as carefully phased and reliable as the 1KGP resource.

To illustrate my last statement, let me digress a bit. Sue is getting married tomorrow, on April 8th, to Dave (pictured). So first I’d like to congratulate them. Sue and Dave’s ethnic origins are far apart--9,241 kilometers to be exact or a ~10 hour flight. Besides lending support to HLA-mediated attraction, their marriage has me asking this: what reference resource should we use to impute phasing for samples from Sue and Dave’s future children who would be admixed hapas?

Another reason to take this journey is to flex our file processing muscles. To start, posts will focus on specific pre-processing aims, and I use a variety of tools, both GATK and non-GATK, to achieve these. My selection criteria is that of a satisficer, i.e. I settle on the first tool or approach that achieves what I need. Given this, there may be more efficient ways to process the data, and hopefully those in the know who are reading will share these ways.

I’m curious.

I'm curious to find out how our field’s better IBD algorithm performs using a reference resource that underrepresents the ethnic background of the samples. I am a minority in this country whose parents immigrated when I was a baby. Sue was born while we lived in Fargo, North Dakota. Growing up in the Midwest and in Washington state, I have gotten used to the majority of faces not looking like me. What I have learned is that unlike the proportion my ethnicity represents in the population, to enable well-powered genomics studies whose results will be applicable to me, my ethnicity needs to be disproportionately overrepresented in programs like All of Us. I want my genetics represented because I want to benefit from the findings from genomics studies towards a longer and healthier life.

Gleaning equally useful information from studies as those in the ethnic majority means that those of us in the minority, including those of us who are African, Hispanic, mixed and Native Americans, need to over represent. I know there are other efforts, e.g. Genome Asia 100K that plans for 100,000 Asian genomes. However, consider context. The Genome Asia 100K represents diversity within Asia, where populations are relatively homogenous. The resource, and findings from the resource, because of their population context, cannot apply widely to admixed populations such as that of America. As I explained earlier, context matters for IBD imputation in that, for Sue and my ethnically homogenous samples, the IBDSeq algorithm that uses a homogenous reference panel will out perform Beagle with the diverse reference panel.

Why is IBD analysis important to research? For one, validating pedigrees by measuring relatedness is an integral quality control step in population genomics studies involving extended families. Second, it can help genetic linkage mapping. For example, haplotype phase inference can help fill in missing genotypes. Third, it can help decrease bias. We see this application in the routine removal of closely related samples in population reference resources.

We can learn the importance of representing diversity from Darwin’s Dogs. Last I heard, in December 2016, at the Broad company retreat, the project is only interested in sequencing mutts. You have to fill out a 100-question survey and they will let you know if they are interested in sequencing your dog’s genome. One thing they are interested in is associating behavior, e.g. OCD in dogs, to complex variation, e.g. multiple common variants. Sue and Dave have signed up Pepper (pictured above), who is of indeterminate mixed breed.

From where I’m standing, the contribution of genomics towards understanding disease appears so far to have come from (i) rare de novo mutations and (ii) isolated founder effects in extended but small families that control for genetic backdrop. Darwin’s Dogs explains a third approach in their blog. They are enabled to take this third approach in their studies because of findings in a 2016 article titled Complex disease and phenotype mapping in the domestic dog (doi:10.1038/ncomms10460). The key enabling factor is knowing the number of dogs to power an analysis, e.g. either 400 dogs of a single breed (200 with and 200 without the trait) or 1000 dogs of many breeds (500 with and 500 without). I would surmise that the latter scenario would yield qualitatively different results that would be more applicable to dogs in general.

[1] The average frequency of recombination events across a chromosome is captured in a metric geneticists call the centiMorgan.
[2] The algorithm 23andMe uses has a 7 cM (centiMorgan) and 700 SNPs threshold for relatedness to ensure high quality matches, so it is highly likely we share common recent ancestors.

↧

Picard Markduplicates Error: Value was put into pairInfoMap more than once

September 21, 2017, 6:07 am

≫ Next: ERROR: GATK VQSR fails to identify top worst variants and terminates

≪ Previous: A dog, a ship and an algorithm that measures relatedness

Dear all,
I got an error when i use picard tools to mark duplicates
my picardtools version 2.8.3
my sample: PE, one lane
BWA + GATK

get index of Ref

bwa index /disk/BGI_jiangzy_humilis/BGI_j_humilis.fa
java -jar /disk/share/picard-tools-2.8.3/picard.jar CreateSequenceDictionary REFERENCE=/disk/BGI_jiangzy_humilis/BGI_j_humilis.fa OUTPUT=/disk/BGI_jiangzy_humilis/BGI_j_humilis.dict
samtools faidx /disk/BGI_jiangzy_humilis/BGI_j_humilis.fa

mapping

bwa mem -t 20 -M -R "@RG\tID:Ph01111\tLB:Ph01111\tPL:Illumina\tPU:Ph01111\tSM:Ph01111" /disk/BGI_jiangzy_humilis/BGI_j_humilis.fa /disk/gtdata/jiangzy_mydata/test/11_1_clean.fq.gz /disk/gtdata/jiangzy_mydata/test/11_2_clean.fq.gz | gzip > /disk/gtdata/jiangzy_mydata/test/mapping/Ph01111_mapping.sam.gz

reorder

java -jar /disk/share/picard-tools-2.8.3/picard.jar ReorderSam REFERENCE=/disk/BGI_jiangzy_humilis/BGI_j_humilis.fa INPUT= /disk/gtdata/jiangzy_mydata/test/mapping/Ph01111_mapping.sam.gz OUTPUT= /disk/gtdata/jiangzy_mydata/test/mapping/Ph01111_mapping_Reorder.sam.gz VALIDATION_STRINGENCY=LENIENT

Sam to bam

samtools view -bS /disk/gtdata/jiangzy_mydata/test/mapping/Ph01111_mapping_Reorder.sam.gz -o /disk/gtdata/jiangzy_mydata/test/mapping/Ph01111_mapping_Reorder.bam

sort

java -jar /disk/share/picard-tools-2.8.3/picard.jar SortSam I=/disk/gtdata/jiangzy_mydata/test/mapping/Ph01111_mapping_Reorder.bam O=/disk/gtdata/jiangzy_mydata/test/mapping/Ph01111_mapping_Reorder_sort.bam SORT_ORDER=coordinate

MarkDuplicates

java -jar /disk/share/picard-tools-2.8.3/picard.jar MarkDuplicates REMOVE_DUPLICATES= false INPUT=/disk/gtdata/jiangzy_mydata/test/mapping/Ph01111_mapping_Reorder_sort.bam OUTPUT=/disk/gtdata/jiangzy_mydata/test/mapping/Ph01111_mapping_Reorder_sort_MD.bam METRICS_FILE=marked_dup_metrics.txt CREATE_INDEX=true
Error：
Exception in thread "main" htsjdk.samtools.SAMException: Value was put into PairInfoMap more than once.
1: Ph01111:HWI-ST1307:159:C48TVACXX:7:1109:1787:63474

test

ValidateSamFile

java -jar /disk/share/picard-tools-2.8.3/picard.jar ValidateSamFile I=/disk/gtdata/jiangzy_mydata/test/mapping/新建文件夹/Ph01111_mapping_Reorder_sort.bam MODE=SUMMARY
Exception in thread "main" htsjdk.samtools.SAMException: Value was put into PairInfoMap more than once. 1: HWI-ST1307:159:C48TVACXX:7:1109:1787:63474
at htsjdk.samtools.CoordinateSortedPairInfoMap.ensureSequenceLoaded(CoordinateSortedPairInfoMap.java:133)
at htsjdk.samtools.CoordinateSortedPairInfoMap.remove(CoordinateSortedPairInfoMap.java:86)
at htsjdk.samtools.SamFileValidator$CoordinateSortedPairEndInfoMap.remove(SamFileValidator.java:765)
at htsjdk.samtools.SamFileValidator.validateMateFields(SamFileValidator.java:499)
at htsjdk.samtools.SamFileValidator.validateSamRecordsAndQualityFormat(SamFileValidator.java:297)
at htsjdk.samtools.SamFileValidator.validateSamFile(SamFileValidator.java:215)
at htsjdk.samtools.SamFileValidator.validateSamFileSummary(SamFileValidator.java:143)
at picard.sam.ValidateSamFile.doWork(ValidateSamFile.java:196)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:205)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:94)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:104)

FixMateInformation

java -jar /disk/share/picard-tools-2.8.3/picard.jar FixMateInformation I=/disk/gtdata/jiangzy_mydata/test/mapping/Ph01111_mapping_Reorder.bam O=/disk/gtdata/jiangzy_mydata/test/mapping/Ph01111_mapping_Reorder_Fixed.bam

Error
Exception in thread "main" htsjdk.samtools.SAMException: Found two records that are paired, not supplementary, and first of the pair
at htsjdk.samtools.SamPairUtil$SetMateInfoIterator.advance(SamPairUtil.java:453)
at htsjdk.samtools.SamPairUtil$SetMateInfoIterator.next(SamPairUtil.java:499)
at htsjdk.samtools.SamPairUtil$SetMateInfoIterator.next(SamPairUtil.java:388)
at picard.sam.FixMateInformation.doWork(FixMateInformation.java:206)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:205)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:94)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:104)

Thank you very much for the help from all of you

↧