Does FilterByOrientationBias consider normal samples?

June 5, 2018, 1:10 pm

≫ Next: Simple explanation of MarkDuplicate

≪ Previous: Germline short variant discovery with GATK4, exome sequencing, single sample

FilterByOrientationBias takes the output of CollectSequencingArtifactMetrics to do somatic variant filtering. Its manual says:

CollectSequencingArtifactMetrics should be run for both the normal sample and the tumor sample, if the matched normal is available.

But the example command-line shown in the manual is:

 gatk-launch --javaOptions "-Xmx4g" FilterByOrientationBias \
   --artifactModes 'G/T' \
   -V tumor_unfiltered.vcf.gz \
   -P tumor.pre_adapter_detail_metrics \
   --output oxog_filtered.vcf.gz

The input only involves tumor sample. Do I really need to run CollectSequencingArtifactMetrics on the matched normal sample? If yes, how should I use it in FilterByOrientationBias?

Thanks.

↧

Simple explanation of MarkDuplicate

May 31, 2018, 8:51 am

≫ Next: StrandArtifect fields missed in normal sample in Mutect2 output

≪ Previous: Does FilterByOrientationBias consider normal samples?

I am having a hard time understanding how MarkDuplicate works. Based on MarkDuplicate documentation, this is how it has been described: “The MarkDuplicates tool works by comparing sequences in the 5 prime positions of both reads and read-pairs in a SAM/BAM file.” I don’t understand what “5 prime positions” means in the above statement. Also, what does it mean in the context of “of both reads and read-pairs” ? If you could please explain that to me using an example I would really appreciate that.

↧

StrandArtifect fields missed in normal sample in Mutect2 output

June 22, 2018, 8:42 am

≫ Next: BQSR in GATK 4.0

≪ Previous: Simple explanation of MarkDuplicate

I used the latest version of Mutect2 to call somatic variants by comparing a tumor sample versus a normal sample. There is inconsistency of the fields between columns "FORMAT", "NORMAL" and "TUMOR". Taking one variant as an example:

1 3009350 . T TTG . . DP=59;ECNT=1;NLOD=2.46;N_ART_LOD=-9.935e-01;POP_AF=1.000e-05;RPA=26,27;RU=TG;STR;TLOD=10.24 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB 0/0:8,0:0.409:5,0:3,0:0:353,0:0:0 0/1:3,5:0.565:2,3:1,2:28:221,218:60:5:0.616,0.566,0.625:0.018,0.044,0.938

The last two fields "SA_MAP_AF" and "SA_POST_PROB" were calculated for only tumor sample (last column), but missed in normal sample (10th column), which resulted in discrepancy between FORMAT column and the normal sample column. Is it true that Mutect2 only reports them for tumor sample? If so, it'd be best to let Mutect2 print missing values (eg. a period) to fill the last two fields for normal sample.

↧

BQSR in GATK 4.0

June 22, 2018, 9:15 am

≫ Next: FastqToSam "No value found for tagged argument" for ISO 8601 --RUN_DATE parameter

≪ Previous: StrandArtifect fields missed in normal sample in Mutect2 output

Hi,

Thanks first for such a great tool! I have a question about BQSR in GATK 4.0 Best Practices.

In 3.8, PrintReads supports application of a covariates table file (with --BQSR) outputted from BaseRecalibrator tool along additional read-filtering options for preparing analysis ready BAM file. As stated in https://gatkforums.broadinstitute.org/gatk/discussion/2801/howto-recalibrate-base-quality-scores-run-bqsr#latest

In 4.0, is ApplyBQSR replacing PrintReads as the recalibration application steps? I see that PrintReads v4.0 is no longer support the --BQSR flag and ApplyBQSR supports --read-filter flag yet it is not very clear in the v4.0 best practices document that ApplyBQSR is the replacement. Please advise.

Thanks for your time
Eugene

↧

FastqToSam "No value found for tagged argument" for ISO 8601 --RUN_DATE parameter

May 29, 2018, 12:55 pm

≫ Next: (How to) Consolidate GVCFs for joint calling with GenotypeGVCFs

≪ Previous: BQSR in GATK 4.0

I am using FastqToSam to convert paired FQ files to uBAM as follows:

gatk --java-options {java_opt} FastqToSam \
            --FASTQ={R1} \
            --FASTQ2={R2} \
            --OUTPUT={output} \
            --READ_GROUP_NAME={RGID} \
            --PLATFORM_UNIT={RGPU} \
            --SAMPLE_NAME={RGSM} \
            --PLATFORM={RGPL} \
            --LIBRARY_NAME={RGLB} \
            --SEQUENCING_CENTER={RGCN} \
            --RUN_DATE={RGDT} \
            --SORT_ORDER=queryname

When I include the --RUN_DATE parameter (in my test I tried with 2011-04-30T01:00:00+0100) and I get the following error message: No value found for tagged argument: RUN_DATE=2011-04-30T01:00:00+0100. I have confirmed the string is a valid ISO 8601 date using the following tool. Running the exact same FastqToSam command without the --RUN_DATE parameter works without issues. Any reason for this error that I am missing?

↧

(How to) Consolidate GVCFs for joint calling with GenotypeGVCFs

April 16, 2018, 10:14 am

≫ Next: Collected FAQs about interval lists

≪ Previous: FastqToSam "No value found for tagged argument" for ISO 8601 --RUN_DATE parameter

In GATK4, the GenotypeGVCFs tool can only take a single input, so if you have GVCFs from multiple samples (which is usually the case) you will need to combine them before feeding them to GenotypeGVCFs. Although there are several tools in the GATK and Picard toolkits that provide some type of VCF or GVCF merging functionality, for this use case only two of them can do the GVCF consolidation step correctly: GenomicsDBImport and CombineGVCFs.

GenomicsDBImport is the preferred tool (see detailed instructions below); CombineGVCFs is provided only as a backup solution for people who cannot use GenomicsDBImport, which has a few limitations (for example it can only run on diploid data at the moment). We know CombineGVCFs is quite inefficient and typically requires a lot of memory, so we encourage you to try GenomicsDBImport first and only fall back on CombineGVCFs if you experience issues that we are unable to help you solve (ask us for help in the forum!).

Using`GenomicsDBImport` in practice

The GenomicsDBImport tool takes in one or more single-sample GVCFs and imports data over a single interval, and outputs a directory containing a GenomicsDB datastore with combined multi-sample data. GenotypeGVCFs can then read from the created GenomicsDB directly and output the final multi-sample VCF.

So if you have a trio of GVCFs your GenomicsDBImportcommand would look like this, assuming you're running per chromosome (here we're showing the tool running on chromosome 20):

gatk GenomicsDBImport \
    -V data/gvcfs/mother.g.vcf \
    -V data/gvcfs/father.g.vcf \
    -V data/gvcfs/son.g.vcf \
    --genomicsdb-workspace-path my_database \
    --intervals chr20

That generates a directory called my_database containing the combined GVCF data for chromosome 20. The contents of the directory are not really human-readable; see further down for tips to deal with that.

Then you run joint genotyping; note the gendb:// prefix to the database input directory path.

gatk GenotypeGVCFs \
    -R data/ref/ref.fasta \
    -V gendb://my_database \
    -G StandardAnnotation -newQual \
    -O test_output.vcf

And that's all there is to it.

Important limitations:

You can't add data to an existing database; you have to keep the original GVCFs around and reimport them all together when you get new samples. For very large numbers of samples, there are some batching options.
At the moment you can only run GenomicsDBImport on a single genomic interval (ie max one contig) at a time. Down the road this will change (the work is tentatively scheduled for the second quarter of 2018), because we want to make it possible to run on one multiple intervals in one go. But for now you need to run on each interval separately. We recommend scripting this of course.
GenomicsDBImport cannot accept multiple GVCFs for the same sample, so if for example you generated separate GVCFs per chromosome for each sample, you'll need to either concatenate the chromosome GVCFs to produce a single GVCF per sample (using CatVariants) or scatter the following steps by chromosome as well.

**If you can't use GenomicsDBImport for whatever reason, fall back to CombineGVCFs instead. It is slower but will allow you to combine GVCFs the old-fashioned way. **

Addendum: extracting GVCF data from the GenomicsDB

If you want to generate a flat multisample GVCF file from the GenomicsDB you created, you can do so with SelectVariants as follows:

gatk SelectVariants \
    -R data/ref/ref.fasta \
    -V gendb://my_database \
    -O combined.g.vcf

↧

Collected FAQs about interval lists

August 10, 2012, 9:16 pm

≫ Next: Merge VCF Files

≪ Previous: (How to) Consolidate GVCFs for joint calling with GenotypeGVCFs

1. Can GATK tools be restricted to specific intervals instead of processing the entire reference?

Absolutely. Just use the -L argument to provide the list of intervals you wish to run on. Or you can use -XL to exclude intervals, e.g. to blacklist genome regions that are problematic.

2. What file formats does GATK support for interval lists?

GATK supports several types of interval list formats: Picard-style .interval_list, GATK-style .list, BED files with extension .bed, and VCF files.

A. Picard-style `.interval_list`

Picard-style interval files have a SAM-like header that includes a sequence dictionary. The intervals are given in the form <chr> <start> <stop> + <target_name>, with fields separated by tabs, and the coordinates are 1-based (first position in the genome is position 1, not position 0).

@HD     VN:1.0  SO:coordinate
@SQ     SN:1    LN:249250621    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:1b22b98cdeb4a9304cb5d48026a85128     SP:Homo Sapiens
@SQ     SN:2    LN:243199373    AS:GRCh37       UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta   M5:a0d9851da00400dec1098a9255ac712e     SP:Homo Sapiens
1       30366   30503   +       target_1
1       69089   70010   +       target_2
1       367657  368599  +       target_3
1       621094  622036  +       target_4
1       861320  861395  +       target_5
1       865533  865718  +       target_6

This is the preferred format because the explicit sequence dictionary safeguards against accidental misuse (e.g. apply hg18 intervals to an hg19 BAM file). Note that this file is 1-based, not 0-based (the first position in the genome is position 1).

B. GATK-style `.list` or `.intervals`

This is a simpler format, where intervals are in the form <chr>:<start>-<stop>, and no sequence dictionary is necessary. This file format also uses 1-based coordinates. Note that only the <chr> part is strictly required; if you just want to specify chromosomes/ contigs as opposed to specific coordinate ranges, you don't need to specify the rest. Both <chr>:<start>-<stop> and <chr> can be present in the same file. You can also specify intervals in this format directly at the command line instead of writing them in a file.

C. BED files with extension `.bed`

We also accept the widely-used BED format, where intervals are in the form <chr> <start> <stop>, with fields separated by tabs. However, you should be aware that this file format is 0-based for the start coordinates, so coordinates taken from 1-based formats (e.g. if you're cooking up a custom interval list derived from a file in a 1-based format) should be offset by 1. The GATK engine recognizes the .bed extension and interprets the coordinate system accordingly.

D. VCF files

Yeah, I bet you didn't expect that was a thing! It's very convenient. Say you want to redo a variant calling run on a set of variant calls that you were given by a colleague, but with the latest version of HaplotypeCaller. You just provide the VCF, slap on some padding on the fly using e.g. -ip 100 in the HC command, and boom, done. Each record in the VCF will be interpreted as a single-base interval, and by adding padding you ensure that the caller sees enough context to reevaluate the call appropriately.

3. Is there a required order of intervals?

Yes, thanks for asking. The intervals MUST be sorted by coordinate (in increasing order) within contigs; and the contigs must be sorted in the same order as in the sequence dictionary. This is for efficiency reasons.

4. Can I provide multiple sets of intervals?

Sure, no problem -- just pass them in using separate -L arguments. You can use all the different formats within the same command line. By default, the GATK engine will take the UNION of all the intervals in all the sets. This behavior can be modified by setting an interval_set rule.

5. How will GATK handle intervals that abut or overlap?

Very gracefully. By default the GATK engine will merge any intervals that abut (i.e. they are contiguous, they touch without overlapping) or overlap into a single interval. This behavior can be modified by setting an interval_merging rule.

6. What's the best way to pad intervals?

You can use the -ip engine argument to add padding on the fly. No need to produce separate padded targets files. Sweet, right?

Note that if intervals that previously didn't abut or overlap before you added padding now do so, by default the GATK engine will merge them as described above. This behavior can be modified by setting an interval_merging rule.

↧

Merge VCF Files

June 23, 2018, 1:31 pm

≫ Next: error_java.lang.IllegalArgumentException: No data found--when using VQSR

≪ Previous: Collected FAQs about interval lists

Hello,

BACKGROUND: I am working with a public data set that consists of VCF files. ( I cannot go back upstream in the process). VCF files are broken out by patient sample. And broken out further by chromosome for 0/0 calls with NON_REF listed as the ALT. The variant calls 0/1 and 1/1 and so forth are in a separate VCF file for each patient for variant calls listed across the entire genome. I concatenated all the files for each patient. So for each patient, the ALT for a 0/0 call is NON_REF and the ALT for a variant call is always listed as a value, such as "G" or "TT." Now, I wish to merge my 5000 patient samples into a single VCF file.

STEPS TRIED:
1. I went back to an older version of GATK 3.5 and used CombineVariants and got flagged with this message:

ERROR MESSAGE: CombineVariants should not be used to merge gVCFs produced by the HaplotypeCaller; use CombineGVCFs instead

I also tried GATK4 and used CombineGVCFs and got flagged with this message:

ERROR MESSAGE: The list of input alleles must contain as an allele but that is not the case at position 15274; please use the Haplotype Caller with gVCF output to generate appropriate records

QUESTION: How do I solve this and merge my files? Is there a VCF merge function that can handle a mix of calls that sometimes list NON_REF as the ALT and sometimes list an actual value for ALT?

P.S. Bcftools will not let me do this, but vcf-tools merge will handle this, but it is very slow. I am hoping to use GATK.

Jim Kozubek

↧

error_java.lang.IllegalArgumentException: No data found--when using VQSR

October 13, 2016, 11:44 pm

≫ Next: (howto) Visualize an alignment with IGV

≪ Previous: Merge VCF Files

ERROR ------------------------------------------------------------------------------------------

ERROR stack trace

java.lang.IllegalArgumentException: No data found.
at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:88)
at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:409)
at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:157)
at org.broadinstitute.gatk.engine.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129)
at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:116)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:315)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:106)

ERROR ------------------------------------------------------------------------------------------

ERROR A GATK RUNTIME ERROR has occurred (version 3.5-0-g36282e4):

ERROR

ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.

ERROR If not, please post the error message, with stack trace, to the GATK forum.

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions http://www.broadinstitute.org/gatk

ERROR

ERROR MESSAGE: No data found.

ERROR ------------------------------------------------------------------------------------------

↧

(howto) Visualize an alignment with IGV

November 24, 2015, 2:15 pm

≫ Next: Discrepancies of DepthOfCoverage report

≪ Previous: error_java.lang.IllegalArgumentException: No data found--when using VQSR

Visualize sequence read alignment data (BAM or SAM) on IGV using this quick-start tutorial. The Integrative Genomics Viewer is a non-GATK tool developed at the Broad Institute that allows for interactive exploration of large genomic datasets.

Tools involved

IGV downloaded to your desktop

Prerequisites

Coordinate-sorted and aligned BAM or SAM file
Corresponding BAI index
Matching reference genome to which the reads align. See IGV hosted genomes to check if IGV hosts a reference genome or this page for instructions on loading a .genome or FASTA file genome.

Download example data

tutorial_6491.tar.gz contains a coordinated-sorted BAM and corresponding BAI. Most reads align to a 1 Mbp genomic interval on chromosome 10 (10:96,000,000–97,000,000) of the human GRCh37 reference assembly. Specifically, reads align to GATK bundle's human_g1k_v37_decoy.fasta that corresponds to the Human (1kg, b37+decoy) reference hosted by IGV.

Related resources

See Tutorial#2909 for instructions on coordinate-sorting and indexing alignment data.
See the IGV website for downloads and the extensive user guide. For GATK users, we recommend the sections on Viewing Alignments and Viewing VCF files.
This How to and its example data are referenced in a larger workflow on (How to) Efficiently map and clean up short read sequence data.

View aligned reads using IGV

To view aligned reads using the Integrative Genomics Viewer (IGV), the SAM or BAM file must be coordinate-sorted and indexed.

Always load the reference genome first. Go to Genomes>Load Genome From Server or load from the drop-down menu in the upper left corner. Select Human (1kg, b37+decoy).
Load the data file. Go to File>Load from File and select 6491_snippet.bam. IGV automatically uses the corresponding 6491_snippet.bai index in the same folder.
Zoom in to see alignments. For our tutorial data, copy and paste 10:96,867,400-96,869,400 into the textbox at the top and press Go. A 2 kbp region of chromosome 10 comes into view as shown in the screenshot above.

Alongside read data, IGV automatically generates a coverage track that sums the depth of reads for each genomic position.

Find a specific read and view as pairs

Right-click on the alignment track and Select by name. Copy and paste H0164ALXX140820:2:2107:7323:30703 into the read name textbox and press OK. IGV will highlight two reads corresponding to this query name in bold red.
Right-click on the alignment track and select View as pairs. The two highlighted reads will display in the same row connected by a line as shown in the screenshot.

Because IGV holds in memory a limited set of data overlapping with the genomic interval in view (this is what makes IGV fast), the select by name feature also applies only to the data that you call into view. For example, we know this read has a secondary alignment on contig hs37d5 (hs37d5:10,198,000-10,200,000).

If you jump to this new region, is the read also highlighted in red?

Some tips

If you find IGV sluggish, download a Java Web Start jnlp version of IGV that allows more memory. The highest memory setting as of this writing is 10 GB (RAM) for machines with 64-bit Java. For the tutorial example data, the typical 2 GB allocation is sufficient.

To run the jnlp version of IGV, you may need to adjust your system's Java Control Panel settings, e.g. enable Java content in the browser. Also, when first opening the jnlp, overcome Mac OS X's gatekeeper function by right-clicking the saved jnlp and selecting Open with Java Web Start.

To change display settings, check out either the Alignment Preferences panel or the Alignment track Pop-up menu. For persistent changes to your IGV display settings, use the Preferences panel. For track-by-track changes, use the Pop-up menus.

Default Alignment Preferences settings are tuned to genomic sequence libraries. Go to View>Preferences and make sure the settings under the Alignments tab allows you to view reads of interest, e.g. duplicate reads.

IGV saves any changes you make to these settings and applies them to future sessions.
Some changes apply only to new sessions started after the change.
To restore default preferences, delete or rename the prefs.properties file within your system's igv folder. IGV automatically generates a new prefs.properties file with default settings. See IGV's user guide for details.

After loading data, adjust viewing modes specific to track type by right-clicking on a track to pop up a menu of options. For alignment tracks, these options are described here.

↧

Discrepancies of DepthOfCoverage report

June 9, 2018, 10:04 pm

≫ Next: RealignerTargetCreater: A USER ERROR has occurred: '-T' is not a valid command.

≪ Previous: (howto) Visualize an alignment with IGV

The DepthOfCoverage is a modulo of GATK 3.8.0
the command is:
java -Xmx32g -jar /share/data1/local/bin/GenomeAnalysisTK.jar -T DepthOfCoverage -R /share/data1/genome/hs38DH.fa -o coverage -I combined_realign.bam -L /share/data1/PublicProject/GATK_bundle/wgs_calling_regions.hg38.bed -ct 1 -ct 3 -ct 10 -ct 20 -omitBaseOutput

wgs_calling_regions.hg38.bed is a simple version from wgs_calling_regions.hg38.inveral_list, which is taken from GATK bundle

I also paste the sample_summay as below, the discrepancy is bwteen "granular_third_quartile granular_median granular_first_quartile" and "%_bases_above_1 %_bases_above_3 %_bases_above_10 %_bases_above_20"

Take the first sample for example, it has 61% of the region has a coverage above 3, however the granular_median is reported as 1, can the team kindly explain the discrepancies to me and which one I should believe?

sample_id total mean granular_third_quartile granular_median granular_first_quartile %_bases_above_1 %_bases_above_3 %_bases_above_10 %_bases_above_20
Y87645825 10726716375 3.67 1 1 1 90.8 61.0 2.8 0.1
Y87646062 10236869174 3.50 1 1 1 90.1 59.7 1.9 0.1
Y87645848 11931990342 4.08 1 1 1 91.0 64.7 4.8 0.2
E229909 8072356330 2.76 1 1 1 87.8 47.3 0.6 0.1
E230084 8801018351 3.01 1 1 1 88.4 51.5 1.0 0.1
Y87645831 10232223584 3.50 1 1 1 90.9 60.0 1.8 0.1
E229884 30810889945 10.54 1 1 1 99.4 96.8 52.5 3.8
Y87645851 12119947525 4.15 1 1 1 90.5 64.7 5.5 0.2
Y87646049 10461827187 3.58 1 1 1 89.9 60.2 2.3 0.1
Total 113393838813 38.78 N/A N/A N/A

↧

RealignerTargetCreater: A USER ERROR has occurred: '-T' is not a valid command.

February 6, 2018, 7:01 pm

≫ Next: GATK4's CalculateContamination reports no hom alt sites found

≪ Previous: Discrepancies of DepthOfCoverage report

Hi,
I was running GATK RealignerTargetCreater . I ran the same command before with another bam file and it ran fine. Both of the bam files were produced using BWA MEM, sorted, indexed, fixed mate information (samtools) and then duplicate removed by PICARD.

My command is as follows:
java -jar GenomeAnalysisTK.jar \
-T RealignerTargetCreator \
-R reference.fasta \
-I input.bam \
--known indels.vcf \
-o forIndelRealigner.intervals

In the end, the error were occured:

A USER ERROR has occurred: '-T' is not a valid command.

Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.

It would be most helpful if someone can tell me how to solve this problem. Many thanks in advance.
Regards

↧

GATK4's CalculateContamination reports no hom alt sites found

May 13, 2018, 7:39 pm

≫ Next: Genotype all sites in MuTect2

≪ Previous: RealignerTargetCreater: A USER ERROR has occurred: '-T' is not a valid command.

I have been trying to use GATK4's CalculateContamination but the output is not as expected:

level   contamination   error
whole_bam   0.0 1.0

The GATK log contained warnings that there was not enough data points to segment and that no hom alt sites were found.

Using GATK jar /mnt/projects/dlho/tancrc/bcbio_pipeline/anaconda/share/gatk4-4.0.4.0-0/gatk-package-4.0.4.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx16g -jar /mnt/projects/dlho/tancrc/bcbio_pipeline/anaconda/share/gatk4-4.0.4.0-0/gatk-package-4.0.4.0-local.jar CalculateContamination -I out/BC002-03042014_A_getpileupsummaries.table -O out/BC002-03042014_A_calculatecontamination.table
Picked up _JAVA_OPTIONS: -XX:+UseSerialGC
09:46:05.758 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/mnt/projects/dlho/tancrc/bcbio_pipeline/anaconda/share/gatk4-4.0.4.0-0/gatk-package-4.0.4.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
09:46:05.872 INFO  CalculateContamination - ------------------------------------------------------------
09:46:05.872 INFO  CalculateContamination - The Genome Analysis Toolkit (GATK) v4.0.4.0
09:46:05.872 INFO  CalculateContamination - For support and documentation go to https://software.broadinstitute.org/gatk/
09:46:05.872 INFO  CalculateContamination - Executing as dlho@n086.default.domain on Linux v2.6.32-431.el6.x86_64 amd64
09:46:05.872 INFO  CalculateContamination - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_102-b14
09:46:05.873 INFO  CalculateContamination - Start Date/Time: May 14, 2018 9:46:05 AM SGT
09:46:05.873 INFO  CalculateContamination - ------------------------------------------------------------
09:46:05.873 INFO  CalculateContamination - ------------------------------------------------------------
09:46:05.873 INFO  CalculateContamination - HTSJDK Version: 2.14.3
09:46:05.873 INFO  CalculateContamination - Picard Version: 2.18.2
09:46:05.873 INFO  CalculateContamination - HTSJDK Defaults.COMPRESSION_LEVEL : 2
09:46:05.873 INFO  CalculateContamination - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
09:46:05.873 INFO  CalculateContamination - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
09:46:05.873 INFO  CalculateContamination - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
09:46:05.873 INFO  CalculateContamination - Deflater: IntelDeflater
09:46:05.874 INFO  CalculateContamination - Inflater: IntelInflater
09:46:05.874 INFO  CalculateContamination - GCS max retries/reopens: 20
09:46:05.874 INFO  CalculateContamination - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
09:46:05.874 INFO  CalculateContamination - Initializing engine
09:46:05.874 INFO  CalculateContamination - Done initializing engine
09:46:05.935 WARN  KernelSegmenter - Specified dimension of the kernel approximation (100) exceeds the number of data points (2) to segment; using all data points to calculate kernel matrix.
09:46:05.961 WARN  KernelSegmenter - Number of points needed to calculate local changepoint costs (2 * window size = 100) exceeds number of data points (2).  Local changepoint costs will not be calculated for this window size.
09:46:05.961 WARN  KernelSegmenter - No changepoint candidates were found.  The specified window sizes may be inappropriate, or there may be insufficient data points
09:46:06.083 INFO  KernelSegmenter - Found 0 changepoints after applying the changepoint penalty.
09:46:06.090 WARN  KernelSegmenter - Specified dimension of the kernel approximation (100) exceeds the number of data points (3) to segment; using all data points to calculate kernel matrix.
09:46:06.090 WARN  KernelSegmenter - Number of points needed to calculate local changepoint costs (2 * window size = 100) exceeds number of data points (3).  Local changepoint costs will not be calculated for this window size.
09:46:06.090 WARN  KernelSegmenter - No changepoint candidates were found.  The specified window sizes may be inappropriate, or there may be insufficient data points
09:46:06.091 INFO  KernelSegmenter - Found 0 changepoints after applying the changepoint penalty.
09:46:06.091 WARN  KernelSegmenter - Specified dimension of the kernel approximation (100) exceeds the number of data points (2) to segment; using all data points to calculate kernel matrix.
09:46:06.092 WARN  KernelSegmenter - Number of points needed to calculate local changepoint costs (2 * window size = 100) exceeds number of data points (2).  Local changepoint costs will not be calculated for this window size.
09:46:06.092 WARN  KernelSegmenter - No changepoint candidates were found.  The specified window sizes may be inappropriate, or there may be insufficient data points
09:46:06.092 INFO  KernelSegmenter - Found 0 changepoints after applying the changepoint penalty.
09:46:06.093 WARN  KernelSegmenter - Specified dimension of the kernel approximation (100) exceeds the number of data points (1) to segment; using all data points to calculate kernel matrix.
09:46:06.093 WARN  KernelSegmenter - Number of points needed to calculate local changepoint costs (2 * window size = 100) exceeds number of data points (1).  Local changepoint costs will not be calculated for this window size.
09:46:06.093 WARN  KernelSegmenter - No changepoint candidates were found.  The specified window sizes may be inappropriate, or there may be insufficient data points
09:46:06.093 INFO  KernelSegmenter - Found 0 changepoints after applying the changepoint penalty.
09:46:06.113 WARN  CalculateContamination - No hom alt sites found!  Perhaps GetPileupSummaries was run on too small of an interval, or perhaps the sample was extremely inbred or haploid.
09:46:06.116 WARN  CalculateContamination - No hom alt sites found!  Perhaps GetPileupSummaries was run on too small of an interval, or perhaps the sample was extremely inbred or haploid.
09:46:06.117 WARN  CalculateContamination - No hom alt sites found!  Perhaps GetPileupSummaries was run on too small of an interval, or perhaps the sample was extremely inbred or haploid.

To get the pileup file required for CalculateContamination I used GetPileupSummaries and restricted the region with -L to a bedfile containing 77 genes which are of interest. The pileup file looks normal and I have 311 variants in the file though, is this not enough to CalculateContamination? Can CalculateContamination not be performed on small targeted sequencing panels? Would appreciate if someone could assist pls!

↧

Genotype all sites in MuTect2

June 1, 2018, 8:09 am

≫ Next: gatk 4.4 docker image missing dependancies

≪ Previous: GATK4's CalculateContamination reports no hom alt sites found

Dear colleagues,
I've been trying different options to obtain genotypes for all sites, either variant or non-variant. No success, I only get the variant sites. Went through the options several times and do not see what am I doing wrong. I am running gatk-4.0.4.0. This is my latest attempt, putting together --output-mode EMIT_ALL_SITES and --all-site-pls true

`
gatk Mutect2 \
-R hs37d5.fa \
-I tumour.bam \
-tumor tumsample \
-I bulk.bam \
-normal bulksample \
--germline-resource gnomad.exomes.r2.0.2.sites.vcf.bgz \
-O calls.vcf.gz \
-L 1:1-100000 \
--all-site-pls true \
--output-mode EMIT_ALL_SITES \
--af-of-alleles-not-in-resource 0.00003125

Thanks for any hint

↧

gatk 4.4 docker image missing dependancies

May 28, 2018, 12:56 pm

≫ Next: GenotypeGVCFs on whole genomes taking too long

≪ Previous: Genotype all sites in MuTect2

Hi there,

I am trying to perform Base recalibration using the docker image of gatk 4.4. (I used 3.6 before but a dependency problem with R pointed me to the latest version, in which the problem should be fixed according to the GATK forum) - but here I am with v4.4 and a similar error message "Error in library("reshape") : there is no package called 'reshape'" (see full message at the bottom of the message).

The library is indeed not installed in R.
Is there a repo having a container will all depencies installed? or is it just like v3.6 and running script "manually" is required? What am I missing to perform this step correctly?

thanks in advance for your answer.

Best,

`
b35@toto$ docker run --mount type=bind,source="$ld",target=/data/ --mount type=bind,source=/media/b35/DATA/genomic/reference_genomes/,target=/ref/ broadinstitute/gatk:latest sh -c "gatk AnalyzeCovariates -bqsr /data/data/recal.table${i} -plots /data/data/AnalyzeCovariates${i}.pdf"
19:36:17.277 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/build/libs/gatk-package-4.0.4.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
19:36:17.414 INFO AnalyzeCovariates - ------------------------------------------------------------
19:36:17.415 INFO AnalyzeCovariates - The Genome Analysis Toolkit (GATK) v4.0.4.0
19:36:17.415 INFO AnalyzeCovariates - For support and documentation go to https://software.broadinstitute.org/gatk/
19:36:17.415 INFO AnalyzeCovariates - Executing as root@f43fb0936ac9 on Linux v4.13.0-43-generic amd64
19:36:17.415 INFO AnalyzeCovariates - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_131-8u131-b11-2ubuntu1.16.04.3-b11
19:36:17.415 INFO AnalyzeCovariates - Start Date/Time: May 28, 2018 7:36:17 PM UTC
19:36:17.415 INFO AnalyzeCovariates - ------------------------------------------------------------
19:36:17.415 INFO AnalyzeCovariates - ------------------------------------------------------------
19:36:17.416 INFO AnalyzeCovariates - HTSJDK Version: 2.14.3
19:36:17.416 INFO AnalyzeCovariates - Picard Version: 2.18.2
19:36:17.416 INFO AnalyzeCovariates - HTSJDK Defaults.COMPRESSION_LEVEL : 2
19:36:17.416 INFO AnalyzeCovariates - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
19:36:17.416 INFO AnalyzeCovariates - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
19:36:17.416 INFO AnalyzeCovariates - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
19:36:17.416 INFO AnalyzeCovariates - Deflater: IntelDeflater
19:36:17.416 INFO AnalyzeCovariates - Inflater: IntelInflater
19:36:17.416 INFO AnalyzeCovariates - GCS max retries/reopens: 20
19:36:17.417 INFO AnalyzeCovariates - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
19:36:17.417 INFO AnalyzeCovariates - Initializing engine
19:36:17.417 INFO AnalyzeCovariates - Done initializing engine
19:36:17.731 INFO AnalyzeCovariates - Generating csv file '/tmp/root/AnalyzeCovariates2520511082455841657.csv'
19:36:17.789 INFO AnalyzeCovariates - Generating plots file '/data/data/AnalyzeCovariates1.pdf'
19:36:18.255 INFO AnalyzeCovariates - Shutting down engine
[May 28, 2018 7:36:18 PM UTC] org.broadinstitute.hellbender.tools.walkers.bqsr.AnalyzeCovariates done. Elapsed time: 0.02 minutes.
Runtime.totalMemory()=320864256
org.broadinstitute.hellbender.utils.R.RScriptExecutorException:
Rscript exited with 1
Command Line: Rscript -e tempLibDir = '/tmp/root/Rlib.7509374791326779134';source('/tmp/root/BQSR.8897197604108889283.R'); /tmp/root/AnalyzeCovariates2520511082455841657.csv /data/data/recal.table1 /data/data/AnalyzeCovariates1.pdf
Stdout:
Stderr:
Attaching package: 'gplots'

The following object is masked from 'package:stats':

lowess

Error in library("reshape") : there is no package called 'reshape'
Calls: source -> withVisible -> eval -> eval -> library
Execution halted

at org.broadinstitute.hellbender.utils.R.RScriptExecutor.getScriptException(RScriptExecutor.java:80)
at org.broadinstitute.hellbender.utils.R.RScriptExecutor.getScriptException(RScriptExecutor.java:19)
at org.broadinstitute.hellbender.utils.runtime.ScriptExecutor.executeCuratedArgs(ScriptExecutor.java:126)
at org.broadinstitute.hellbender.utils.R.RScriptExecutor.exec(RScriptExecutor.java:131)
at org.broadinstitute.hellbender.utils.recalibration.RecalUtils.generatePlots(RecalUtils.java:360)
at org.broadinstitute.hellbender.tools.walkers.bqsr.AnalyzeCovariates.generatePlots(AnalyzeCovariates.java:329)
at org.broadinstitute.hellbender.tools.walkers.bqsr.AnalyzeCovariates.doWork(AnalyzeCovariates.java:341)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:134)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:179)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:198)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)

Using GATK jar /gatk/build/libs/gatk-package-4.0.4.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /gatk/build/libs/gatk-package-4.0.4.0-local.jar AnalyzeCovariates -bqsr /data/data/recal.table1 -plots /data/data/AnalyzeCovariates1.pdf
`

↧

GenotypeGVCFs on whole genomes taking too long

March 8, 2017, 8:28 am

≫ Next: What is a VCF and how should I interpret it?

≪ Previous: gatk 4.4 docker image missing dependancies

Dear GATK team members and forum users,

I am analysing 200 germline whole genomes following the GATK best practises. I am experiencing issues with GenotypeGVCFs, whose runtime is increasing exponentially as the number of samples (gVCFs) increases.

To set you in context, I have 200 germline whole genomes in BAM format. These are high coverage, so their size ranges between 40-130GB. After recalibration, the size of these BAM files increases around 2-fold. The recalibrated BAMs are the input of HaplotypeCaller. I have run ~100 of these BAMs and got the gVCFs.

Now I want to perform joint-genotyping with GenotypeGVCFs. I remember having only 22 samples and running GenotypeGVCFs with these 22 gVCFs did not take long (around 4.5h), but now that I want to re-run with 100 samples this single command takes too long (around 1 week). Actually I am running the pipeline on an HPC, which has a maximum walltime of 1 week, hence GenotypeGVCFs is killed before finishing.

The gVCFs are compressed using bgzip + tabix. The .g.vcf.gz weight between 1.9-7GB. These are used to feed GenotypeGVCFs. I am using 230Gb memory. The exact command I am running is the following:

java -Xmx230g -Djava.io.tmpdir=/tmp \
-jar GenomeAnalysisTK.jar \
-T GenotypeGVCFs \
-R reference.fa \
--dbsnp bundle2_8/b37/dbsnp_138.b37.vcf \
--variant sample1.g.vcf.gz --variant sample2.g.vcf.gz ... --variant sample100.g.vcf.gz \
-o joint.vcf

The reason why I am not using -nt option is that it gives an "error MESSAGE: Code exception".
The GATK version I am using is 3.7

I also tried combining the 100 gVCFs into 2 batches of 50 each, but this also takes too long, around 3 days for each batch (6 days in total).

I wonder what approach would be suitable to handle this amount of data and whether this is normal. I am really concerned because I don't know how I am going to manage this once I have the 200 gVCFs.

All answers will be appreciated.

Thanks,
Ezequiel

↧

What is a VCF and how should I interpret it?

August 6, 2012, 10:28 am

≫ Next: Detecting Deletions

≪ Previous: GenotypeGVCFs on whole genomes taking too long

This document describes "regular" VCF files produced for GERMLINE calls. For information on the special kind of VCF called gVCF, produced by HaplotypeCaller in `-ERC GVCF` mode, please see this companion document. For information specific to SOMATIC calls, see the MuTect documentation.

What is VCF?
Basic structure of a VCF file
Interpreting the VCF file header information
Structure of variant call records
How the genotype and other sample-level information is represented
How to extract information from a VCF in a sane, straightforward way

1. What is VCF?

VCF stands for Variant Call Format. It is a standardized text file format for representing SNP, indel, and structural variation calls. The VCF specification used to be maintained by the 1000 Genomes Project, but its management and expansion has been taken over by the Global Alliance for Genomics and Health Data Working group file format team. The full format spec can be found in the Samtools/Hts-specs repository along with other useful specs like SAM/BAM. We highly encourage you to take a look at those documents, as they contain a lot of useful information that we don't go over in this document.

VCF is the primary (and only well-supported) format used by the GATK for variant calls. We prefer it above all others because while it can be a bit verbose, the VCF format is very explicit about the exact type and sequence of variation as well as the genotypes of multiple samples for this variation.

That being said, this highly detailed information can be challenging to understand. The information provided by the GATK tools that infer variation from high-throughput sequencing data, such as the HaplotypeCaller, is especially complex. This document describes the key features and annotations that you need to know about in order to understand VCF files output by the GATK tools.

Note that VCF files are plain text files, so you can open them for viewing or editing in any text editor, with the following caveats:

Some VCF files are very large, so your personal computer may struggle to load the whole file into memory. In such cases, you may need to use a different approach, such as using UNIX tools to access the part of the dataset that is relevant to you, or subsetting the data using tools like GATK's SelectVariants.
NEVER EDIT A VCF IN A WORD PROCESSOR SUCH AS MICROSOFT WORD BECAUSE IT WILL SCREW UP THE FORMAT! You have been warned
Don't write home-brewed VCF parsing scripts. It never ends well.

2. Basic structure of a VCF file

A valid VCF file is composed of two main parts: the header, and the variant call records.

The header contains information about the dataset and relevant reference sources (e.g. the organism, genome build version etc.), as well as definitions of all the annotations used to qualify and quantify the properties of the variant calls contained in the VCF file. The header of VCFs generated by GATK tools also include the command line that was used to generate them. Some other programs also record the command line in the VCF header, but not all do so as it is not required by the VCF specification. For more information about the header, see the next section.

The actual data lines will look something like this:

[HEADER LINES]
#CHROM  POS ID      REF ALT QUAL    FILTER  INFO          FORMAT          NA12878
1   873762  .       T   G   5231.78 PASS    [ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:173,141:282:99:255,0,255
1   877664  rs3828047   A   G   3931.66 PASS    [ANNOTATIONS] GT:AD:DP:GQ:PL  1/1:0,105:94:99:255,255,0
1   899282  rs28548431  C   T   71.77   PASS    [ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:1,3:4:26:103,0,26
1   974165  rs9442391   T   C   29.84   LowQual [ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:14,4:14:61:61,0,255

After the header lines and the field names, each line represents a single variant, with various properties of that variant represented in the columns. Note that all the lines shown in the example above describe SNPs (also called SNVs), but other variation could be described, such as indels or CNVs. See the VCF specification for details on how the various types of variations are represented. Depending on how the callset was generated, there may only be records for sites where a variant was identified, or there may also be "invariant" records, ie records for sites where no variation was identified.

You will sometimes come across VCFs that have only 8 columns, and contain no FORMAT or sample-specific information. These are called "sites-only" VCFs, and represent variation that has been observed in a population. Generally, information about the population of origin should be included in the header.

3. Interpreting the VCF file header information

The following is a valid VCF header produced by HaplotypeCaller on an example data set (derived from our favorite test sample, NA12878). You can download similar test data from our resource bundle and try looking at it yourself!

##fileformat=VCFv4.1
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##GATKCommandLine.HaplotypeCaller=<ID=HaplotypeCaller,Version=3.4-3-gd1ac142,Date="Mon May 18 17:36:4
.
.
.
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##contig=<ID=chr1,length=249250621,assembly=b37>
##reference=file:human_genome_b37.fasta

We're not showing all the lines here, but that's still a lot... so let's break it down into digestible bits. Note that the header lines are always listed in alphabetical order.

VCF spec version

The first line:

##fileformat=VCFv4.1

tells you the version of the VCF specification to which the file conforms. This may seem uninteresting but it can have some important consequences for how to handle and interpret the file contents. As genomics is a fast moving field, the file formats are evolving fairly rapidly, so some of the encoding conventions change. If you run into unexpected issues while trying to parse a VCF file, be sure to check the version and the spec for any relevant format changes.

FILTER lines

The FILTER lines tell you what filters have been applied to the data. In our test file, one filter has been applied:

##FILTER=<ID=LowQual,Description="Low quality">

Records that fail any of the filters listed here will contain the ID of the filter (here, LowQual) in its FILTER field (see how records are structured further below).

FORMAT and INFO lines

These lines define the annotations contained in the FORMAT and INFO columns of the VCF file, which we explain further below. If you ever need to know what an annotation stands for, you can always check the VCF header for a brief explanation.

GATKCommandLine

The GATKCommandLine lines contain all the parameters that went used by the tool that generated the file. Here, GATKCommandLine.HaplotypeCaller refers to a command line invoking HaplotypeCaller. These parameters include all the arguments that the tool accepts, not just the ones specified explicitly by the user in the command line.

Contig lines and Reference

These contain the contig names, lengths, and which reference assembly was used with the input bam file. This can come in handy when someone gives you a callset but doesn't tell you which reference it was derived from -- remember that for most organisms, there are multiple reference assemblies, and you should always make sure to use the appropriate one!

[todo: FAQ on genome builds]

4. Structure of variant call records

For each site record, the information is structured into columns (also called fields) as follows:

#CHROM  POS ID  REF ALT     QUAL    FILTER  INFO    FORMAT  NA12878 [other samples...]

The first 8 columns of the VCF records (up to and including INFO) represent the properties observed at the level of the variant (or invariant) site. Keep in mind that when multiple samples are represented in a VCF file, some of the site-level annotations represent a summary or average of the values obtained for that site from the different samples.

Sample-specific information such as genotype and individual sample-level annotation values are contained in the FORMAT column (9th column) and in the sample-name columns (10th and beyond). In the example above, there is one sample called NA12878; if there were additional samples there would be additional columns to the right. Most programs order the sample columns alphabetically by sample name, but this is not always the case, so be aware that you can't depend on ordering rules for parsing VCF output!

Site-level properties and annotations

These first 7 fields are required by the VCF format and must be present, although they can be empty (in practice, there has to be a dot, ie . to serve as a placeholder).

CHROM and POS : The contig and genomic coordinates on which the variant occurs.
Note that for deletions the position given is actually the base preceding the event.
ID: An optional identifier for the variant.
Based on the contig and position of the call and whether a record exists at this site in a reference database such as dbSNP.
REF and ALT: The reference allele and alternative allele(s) observed in a sample, set of samples, or a population in general (depending how the VCF was generated).
Note that REF and ALT are always given on the forward strand. For insertions, the ALT allele includes the inserted sequence as well as the base preceding the insertion so you know where the insertion is compared to the reference sequence. For deletions, the ALT allele is the base before the deletion.
QUAL: The Phred-scaled probability that a REF/ALT polymorphism exists at this site given sequencing data.
Because the Phred scale is -10 * log(1-p), a value of 10 indicates a 1 in 10 chance of error, while a 100 indicates a 1 in 10^10 chance (see the FAQ article for a detailed explanation). These values can grow very large when a large amount of data is used for variant calling, so QUAL is not often a very useful property for evaluating the quality of a variant call. See our documentation on filtering variants for more information on this topic.
Not to be confused with the sample-level annotation GQ; see this FAQ article for an explanation of the differences in what they mean and how they should be used.
FILTER: This field contains the name(s) of any filter(s) that the variant fails to pass, or the value PASS if the variant passed all filters.
If the FILTER value is ., then no filtering has been applied to the records. It is extremely important to apply appropriate filters before using a variant callset in downstream analysis. See our documentation on filtering variants for more information on this topic.

This next field does not have to be present in the VCF.

INFO: Various site-level annotations.
The annotations contained in the INFO field are represented as tag-value pairs, where the tag and value are separated by an equal sign, ie =, and pairs are separated by colons, ie ; as in this example: MQ=99.00;MQ0=0;QD=17.94.
They typically summarize context information from the samples, but can also include information from other sources (e.g. population frequencies from a database resource). Some are annotated by default by the GATK tools that produce the callset, and some can be added on request. They are always defined in the VCF header, so that's an easy way to check what an annotation means if you don't recognize it. You can also find additional information on how they are calculated and how they should be interpreted in the "Annotations" section of the Tool Documentation.

Sample-level annotations

At this point you've met all the fields up to INFO in this lineup:

#CHROM  POS ID  REF ALT     QUAL    FILTER  INFO    FORMAT  NA12878 [other samples...]

All the rest is going to be sample-level information. Sample-level annotations are tag-value pairs, like the INFO annotations, but the formatting is a bit different. The short names of the sample-level annotations are recorded in the FORMAT field. The annotation values are then recorded in corresponding order in each sample column (where the sample names are the SM tags identified in the read group data). Typically, you will at minimum have information about the genotype and confidence in the genotype for the sample at each site. See the next section on genotypes for more details.

5. How the genotype and other sample-level information is represented

The sample-level information contained in the VCF (also called "genotype fields") may look a bit complicated at first glance, but they're actually not that hard to interpret once you understand that they're just sets of tags and values.

Let's take a look at three of the records shown earlier, simplified to just show the key genotype annotations:

1   873762  .       T   G   [CLIPPED] GT:AD:DP:GQ:PL    0/1:173,141:282:99:255,0,255
1   877664  rs3828047   A   G   [CLIPPED] GT:AD:DP:GQ:PL    1/1:0,105:94:99:255,255,0
1   899282  rs28548431  C   T   [CLIPPED] GT:AD:DP:GQ:PL    0/1:1,3:4:26:103,0,26

Looking at that last column, here is what the tags mean:

GT : The genotype of this sample at this site.
For a diploid organism, the GT field indicates the two alleles carried by the sample, encoded by a 0 for the REF allele, 1 for the first ALT allele, 2 for the second ALT allele, etc. When there's a single ALT allele (by far the more common case), GT will be either:
- 0/0 - the sample is homozygous reference
- 0/1 - the sample is heterozygous, carrying 1 copy of each of the REF and ALT alleles
- 1/1 - the sample is homozygous alternate
  In the three sites shown in the example above, NA12878 is observed with the allele combinations T/G, G/G, and C/T respectively.
  For non-diploids, the same pattern applies; in the haploid case there will be just a single value in GT; for polyploids there will be more, e.g. 4 values for a tetraploid organism.
AD and DP : Allele depth and depth of coverage.
These are complementary fields that represent two important ways of thinking about the depth of the data for this sample at this site.
AD is the unfiltered allele depth, i.e. the number of reads that support each of the reported alleles. All reads at the position (including reads that did not pass the variant caller’s filters) are included in this number, except reads that were considered uninformative. Reads are considered uninformative when they do not provide enough statistical evidence to support one allele over another.
DP is the filtered depth, at the sample level. This gives you the number of filtered reads that support each of the reported alleles. You can check the variant caller’s documentation to see which filters are applied by default. Only reads that passed the variant caller’s filters are included in this number. However, unlike the AD calculation, uninformative reads are included in DP.
See the Tool Documentation for more details on AD (DepthPerAlleleBySample) and DP (Coverage) for more details.
PL : "Normalized" Phred-scaled likelihoods of the possible genotypes.
For the typical case of a monomorphic site (where there is only one ALT allele) in a diploid organism, the PL field will contain three numbers, corresponding to the three possible genotypes (0/0, 0/1, and 1/1). The PL values are "normalized" so that the PL of the most likely genotype (assigned in the GT field) is 0 in the Phred scale. We use "normalized" in quotes because these are not probabilities. We set the most likely genotype PL to 0 for easy reading purpose.The other values are scaled relative to this most likely genotype.
Keep in mind, if you're not familiar with the statistical lingo, that when we say PL is the "Phred-scaled likelihood of the genotype", we mean it is "How much less likely that genotype is compared to the best one". Have a look at this article for an example of how PL is calculated.
GQ : Quality of the assigned genotype.
The Genotype Quality represents the Phred-scaled confidence that the genotype assignment (GT) is correct, derived from the genotype PLs. Specifically, the GQ is the difference between the PL of the second most likely genotype, and the PL of the most likely genotype. As noted above, the values of the PLs are normalized so that the most likely PL is always 0, so the GQ ends up being equal to the second smallest PL, unless that PL is greater than 99. In GATK, the value of GQ is capped at 99 because larger values are not more informative, but they take more space in the file. So if the second most likely PL is greater than 99, we still assign a GQ of 99.
Basically the GQ gives you the difference between the likelihoods of the two most likely genotypes. If it is low, you can tell there is not much confidence in the genotype, i.e. there was not enough evidence to confidently choose one genotype over another. See the FAQ article on the Phred scale to get a sense of what would be considered low.
Not to be confused with the site-level annotation QUAL; see this FAQ article for an explanation of the differences in what they mean and how they should be used.

With that out of the way, let's interpret the genotype information for NA12878 at 1:899282.

1   899282  rs28548431  C   T   [CLIPPED] GT:AD:DP:GQ:PL    0/1:1,3:4:26:103,0,26

At this site, the called genotype is GT = 0/1, which corresponds to the alleles C/T. The confidence indicated by GQ = 26 isn't very good, largely because there were only a total of 4 reads at this site (DP =4), 1 of which was REF (=had the reference base) and 3 of which were ALT (=had the alternate base) (indicated by AD=1,3). The lack of certainty is evident in the PL field, where PL(0/1) = 0 (the normalized value that corresponds to a likelihood of 1.0) as is always the case for the assigned allele, but the next PL is PL(1/1) = 26 (which corresponds to 10^(-2.6), or 0.0025). So although we're pretty sure there's a variant at this site, there's a chance that the genotype assignment is incorrect, and that the subject may in fact not be het (heterozygous) but be may instead be hom-var (homozygous with the variant allele). But either way, it's clear that the subject is definitely not hom-ref (homozygous with the reference allele) since PL(0/0) = 103, which corresponds to 10^(-10.3), a very small number.

6. How to extract information from a VCF in a sane, (mostly) straightforward way

Use VariantsToTable.

No, really, don't write your own parser if you can avoid it. This is not a comment on how smart or how competent we think you are -- it's a comment on how annoyingly obtuse and convoluted the VCF format is.

Seriously. The VCF format lends itself really poorly to parsing methods like regular expressions, and we hear sob stories all the time from perfectly competent people whose home-brewed parser broke because it couldn't handle a more esoteric feature of the format. We know we broke a bunch of people's scripts when we introduced a new representation for spanning deletions in multisample callsets. OK, we ended up replacing it with a better representation a month later that was a lot less disruptive and more in line with the spirit of the specification -- but the point is, that first version was technically legal by the 4.2 spec, and that sort of thing can happen at any time. So yes, the VCF is a difficult format to work with, and one way to deal with that safely is to not home-brew parsers.

(Why are we sticking with it anyway? Because, as Winston Churchill famously put it, VCF is the worst variant call representation, except for all the others.)

↧

Detecting Deletions

June 25, 2018, 2:01 pm

≫ Next: (How to part I) Sensitively detect copy ratio alterations and allelic segments

≪ Previous: What is a VCF and how should I interpret it?

Hi,
How can I use GATK to detect deletions? I used HaplotypeCaller, and the output format was very confusing.
I ran the command
java -jar -Xmx420M /u/local/apps/gatk/3.8.0/GenomeAnalysisTK.jar -T HaplotypeCaller -R ~/reference.genome/chr19_new.fa -I ~/reference.genome/LP_J.chr19.1.25p.5_sorted.header.bam -L 19

↧

(How to part I) Sensitively detect copy ratio alterations and allelic segments

March 26, 2018, 9:52 am

≫ Next: Picard IlluminaBasecallsToSam error

≪ Previous: Detecting Deletions

Document is currently under review and in BETA. It is incomplete and may contain inaccuracies. Expect changes to the content.

This workflow is broken into two tutorials. You are currently on the first part.

The tutorial outlines steps in detecting copy ratio alterations, more familiarly copy number variants (CNVs), as well as allelic segments in a single sample using GATK4. The tutorial (i) denoises case sample alignment data against a panel of normals (PoN) to obtain copy ratios (Tutorial#11682) and (ii) models segments from the copy ratios and allelic counts (Tutorial#11683). The latter modeling incorporates data from a matched control. The same workflow steps apply to targeted exome and whole genome sequencing data.

Tutorial#11682 covers sections 1–4. Section 1 prepares a genomic intervals list with PreprocessIntervals and collects read coverage counts across the intervals. Section 2 creates a CNV PoN with CreateReadCountPanelOfNormals using read coverage counts. Section 3 denoises read coverage data against the PoN with DenoiseReadCounts using principal component analysis. Section 4 plots the results of standardizing and denoising copy ratios against the PoN.

Tutorial#11683 covers sections 5–8. Section 5 collects counts of reference versus alternate alleles with CollectAllelicCounts. Section 6 incorporates copy ratio and allelic counts data to group contiguous copy ratio and allelic counts segments with ModelSegments using kernel segmentation and Markov-chain Monte Carlo. The tool can also segment either copy ratio data or allelic counts data alone. Both types of data together refine segmentation results in that segments are based on the same copy ratio and the same minor allele fraction. Section 7 calls amplification, deletion and neutral events for the segmented copy ratios. Finally, Section 8 plots the results of segmentation and estimated allele-specific copy ratios.

Plotting is across genomic loci on the x-axis and copy or allelic ratios on the y-axis. The first part of the workflow focuses on removing systematic noise from coverage counts and adjusts the data points vertically. The second part focuses on segmentation and groups the data points horizontally. The extent of grouping, or smoothing, is adjustable with ModelSegments parameters. These adjustments do not change the copy ratios; the denoising in the first part of the workflow remains invariant in the second part of the workflow. See Figure 3 of this poster for a summary of tutorial results.

► The official GATK4 workflow is capable of running efficiently on WGS data and provides much greater resolution, up to ~50-fold more resolution for tested data. In these ways, GATK4 CNV improves upon its predecessor workflows in GATK4.alpha and GATK4.beta. Validations are still in progress and therefore the workflow itself is in BETA status, even if most tools, with the exception of ModelSegments, are production ready. The ModelSegments tool is still in BETA status and may change in small but significant ways going forward. Use at your own risk.

► The tutorial skips explicit GC-correction, an option in CNV analysis. For instructions on how to correct for GC bias, see AnnotateIntervals and DenoiseReadCounts tool documentation.

The GATK4 CNV workflow offers a multitude of levers, e.g. towards fine-tuning analyses and towards controls. Researchers are expected to tune workflow parameters on samples with similar copy number profiles as their case sample under scrutiny. Refer to each tool's documentation for descriptions of parameters.

Jump to a section

Tools involved

GATK 4.0.1.1 or later releases.
The plotting tools require particular R components. Options are to install these or to use the broadinstitute/gatk Docker. In particular, to match versions, use the broadinstitute/gatk:4.0.1.1 version.
- Install R v3.2.5 or above from https://www.r-project.org/, then install the components using the install_R_packages.R script with Rscript install_R_packages.R.
- Alternatively, run the plotting tools from a GATK4 Docker container following instructions in Article#11090.

Download example data

Download tutorial_11682.tar.gz and tutorial_11683.tar.gz, either from the GoogleDrive or from the FTP site. To access the ftp site, leave the password field blank. If the GoogleDrive link is broken, please let us know. The tutorial also requires the GRCh38 reference FASTA, dictionary and index. These are available from the GATK Resource Bundle. For details on the example data, see Tutorial#11136's third footnote and [1].

Alternatively, download the spacecade7/tutorial_11682_11683 docker image from DockerHub. The image contains GATK4.0.1.1 and the data necessary to run the tutorial commands, including the GRCh38 reference. Allocation of at least 4GB memory to Docker is recommended before launching the container.

1. Collect raw counts data with PreprocessIntervals and CollectFragmentCounts

Before collecting coverage counts that forms the basis of copy number variant detection, we define the resolution of the analysis with a genomic intervals list. The extent of genomic coverage and the size of genomic intervals in the intervals list factor towards resolution.

Preparing a genomic intervals list is necessary whether an analysis is on targeted exome data or whole genome data. In the case of exome data, we pad the target regions of the capture kit. In the case of whole genome data, we divide the reference genome into equally sized intervals or bins. In either case, we use PreprocessIntervals to prepare the intervals list.

For the tutorial exome data, we provide the capture kit target regions in 1-based intervals and set --bin-length to zero.

gatk PreprocessIntervals \
    -L targets_C.interval_list \
    -R /gatk/ref/Homo_sapiens_assembly38.fasta \
    --bin-length 0 \
    --interval-merging-rule OVERLAPPING_ONLY \
    -O sandbox/targets_C.preprocessed.interval_list

This produces a Picard-style intervals list targets_C.preprocessed.interval_list for use in the coverage collection step. Each interval is expanded 250 bases each on either side.

Comments on select parameters

The -L argument is optional. If provided, the tool expects the intervals list to be in Picard-style as described in Article#1319. The tool errs for other formats. If this argument is omitted, then the tool assumes each contig is a single interval. See [2] for additional discussion.
Set the --bin-length argument to be appropriate for the type of data, e.g. default 1000 for whole genome or 0 for exomes. In binning, an interval is divided into equal-sized regions of the specified length. The tool does not bin regions that contain Ns. [3]
Set --interval-merging-rule to OVERLAPPING_ONLY, to prevent the tool from merging abutting intervals. [4]
The --reference or -R is required and implies the presence of a corresponding reference index and a reference dictionary in the same directory.
To change the padding interval, specify the new value with --padding. The default value of 250 bases was determined to work well empirically for TCGA targeted exome data. This argument is relevant for exome data, as binning without an intervals list does not allow for intervals expansion. [5]

Take a look at the intervals before and after padding.

cnv_intervals

For consecutive intervals less than 250 bases apart, how does the tool pad the intervals?

Now collect raw integer counts data. The tutorial uses GATK4.0.1.1's CollectFragmentCounts, which counts coverage of paired end fragments. The tool counts once per fragment overlapping at its center with the interval. In GATK4.0.3.0, CollectReadCounts replaces CollectFragmentCounts. CollectReadCounts counts reads that overlap the interval.

The tutorial has already collected coverage on the tumor case sample, on the normal matched-control and on each of the normal samples that constitute the PoN. To demonstrate coverage collection, the following command uses the small BAM from Tutorial#11136’s data bundle [6]. The tutorial does not use the resulting file in subsequent steps. The CollectReadCounts command swaps out the tool name but otherwise uses identical parameters.

gatk CollectFragmentCounts \
    -I tumor.bam \
    -L targets_C.preprocessed.interval_list \
    --interval-merging-rule OVERLAPPING_ONLY \
    -O sandbox/tumor.counts.hdf5

In the tutorial data bundle, the equivalent full-length result is hcc1143_T_clean.counts.hdf5. The data tabulates CONTIG, START, END and raw COUNT values for each genomic interval.

Comments on select parameters

The -L argument interval list is a Picard-style interval list prepared with PreprocessIntervals.
The -I input is alignment data.
By default, data is in HDF5 format. To generate text-based TSV (tab-separated values) format data, specify --format TSV. The HDF5 format allows for quicker panel of normals creation.
Set --interval-merging-rule to OVERLAPPING_ONLY, to prevent the tool from merging abutting intervals. [4]
The tool employs a number of engine-level read filters. Of note are NotDuplicateReadFilter, FirstOfPairReadFilter, ProperlyPairedReadFilter and MappingQualityReadFilter. [7]

☞ 1.1 How do I view HDF5 format data?

See Article#11508 for an overview of the format and instructions on how to navigate the data with external application HDFView. The article illustrates features of the format using data generated in this tutorial.

2. Generate a CNV panel of normals with CreateReadCountPanelOfNormals

In creating a PoN, CreateReadCountPanelOfNormals abstracts the counts data for the samples and the intervals using Singular Value Decomposition (SVD, 1), a type of Principal Component Analysis (PCA, 1, 2, 3). The normal samples in the PoN should match the sequencing approach of the case sample under scrutiny. This applies especially to targeted exome data because the capture step introduces target-specific noise.

The tutorial has already created a CNV panel of normals using forty 1000 Genomes Project samples. The command below illustrates PoN creation using just three samples.

gatk --java-options "-Xmx6500m" CreateReadCountPanelOfNormals \
    -I HG00133.alt_bwamem_GRCh38DH.20150826.GBR.exome.counts.hdf5 \
    -I HG00733.alt_bwamem_GRCh38DH.20150826.PUR.exome.counts.hdf5 \
    -I NA19654.alt_bwamem_GRCh38DH.20150826.MXL.exome.counts.hdf5 \
    --minimum-interval-median-percentile 5.0 \
    -O sandbox/cnvponC.pon.hdf5

This generates a PoN in HDF5 format. The PoN stores information that, when applied, will (i) standardize case sample counts to PoN median counts and (ii) remove systematic noise in the case sample.

Comments on select parameters

Provide integer read coverage counts for each sample using -I. Coverage data may be in either TSV or HDF5 format. The tool will accept a single sample, e.g. the matched-normal.
The default --number-of-eigensamples or principal components is twenty. The tool will adjust this number to the smaller of twenty or the number of samples the tool retains after filtering. In general, denoising against a PoN with more components improves segmentation, but at the expense of sensitivity. Ideally, researchers should perform a sensitivity analysis to choose an appropriate value for this parameter. See this related discussion.
To run the tool using Spark, specify the Spark Master with --spark-master. See Article#11245 for details.

Comments on filtering and imputation parameters, in the order of application

The tutorial changes the --minimum-interval-median-percentile argument from the default of 10.0 to a smaller value of 5.0. The tool filters out targets or bins with a median proportional coverage below this percentile. The median is across the samples. The proportional coverage is the target coverage divided by the sum of the coverage of all targets for a sample. The effect of setting this parameter to a smaller value is that we retain more information.
The --maximum-zeros-in-sample-percentage default is 5.0. Any sample with more than 5% zero coverage targets is filtered.
The --maximum-zeros-in-interval-percentage default is 5.0. Any target interval with more than 5% zero coverage across samples is filtered.
The --extreme-sample-median-percentile default is 2.5. Any sample with less than 2.5 percentile or more than 97.5 percentile normalized median proportional coverage is filtered.
The --do-impute-zeros default is set to true. The tool takes zero coverage regions and changes these values to the median of the non-zero values. The tool additionally normalizes zero values below the 0.10 percentile or above the 99.90 percentile to.
The --extreme-outlier-truncation-percentile default is 0.1. The tool takes any proportional coverage below the 0.1 percentile or above the 99.9 percentile and sets it to the corresponding percentile value.

The current filtering and imputation parameters are identical to that in the BETA release of the CNV workflow and may change for later versions based on evaluations. The implementation has been made to be more memory efficient so that the tool runs faster than the BETA release.

If the data are not uniform, e.g. has many intervals with zero or low counts, the tool gives the warning to adjust filtering parameters and stops the run. This may happen, for example, if one attempts to construct a panel of mixed-sex samples and includes the allosomal contigs [8]. In this case, first be sure to either exclude allosomal contigs via a subset intervals list or subset the panel samples to those expected to have similar coverage across the given contigs, e.g. panels of the same sex. If the warning still occurs, then adjust --minimum-interval-median-percentile to a larger value. See this thread for the original discussion.

Based on what you know about PCA, what do you think are the effects of using more normal samples? A panel with some profiles that are outliers? Could PCA account for GC-bias?
What do you know about the 1000 Genome Project? Specifically, the exome data?
How could we tell a good PoN from a bad PoN? What control could we use?

In a somatic analysis, what is better for a PoN: tissue-matched normals or blood normals?
Should we include our particular tumor’s matched normal in the PoN?

3. Standardize and denoise case read counts against the PoN with DenoiseReadCounts

Provide DenoiseReadCounts with counts collected by CollectFragmentCounts and the CNV PoN generated with CreateReadCountPanelOfNormals.

gatk --java-options "-Xmx12g" DenoiseReadCounts \
    -I hcc1143_T_clean.counts.hdf5 \
    --count-panel-of-normals cnvponC.pon.hdf5 \
    --standardized-copy-ratios sandbox/hcc1143_T_clean.standardizedCR.tsv \
    --denoised-copy-ratios sandbox/hcc1143_T_clean.denoisedCR.tsv

This produces two files, the standardized copy ratios hcc1143_T_clean.standardizedCR.tsv and the denoised copy ratios hcc1143_T_clean.denoisedCR.tsv that each represents a data transformation. In the first transformation, the tool standardizes counts by the PoN median counts. The standarization includes log2 transformation and normalizing the counts data to center around one. In the second transformation, the tool denoises the standardized copy ratios using the principal components of the PoN.

Comments on select parameters

Because the default --number-of-eigensamples is null, the tool uses the maximum number of eigensamples available in the PoN. In section 2, by using default CreateReadCoundPanelOfNormals parameters, we capped the number of eigensamples in the PoN to twenty. Changing the --number-of-eigensamples in DenoiseReadCounts to lower values can change the resolution of results, i.e. how smooth segments are. See this thread for detailed discussion.
Additionally provide the optional --annotated-intervals generated by AnnotateIntervals to concurrently perform GC-bias correction.

4. Plot standardized and denoised copy ratios with PlotDenoisedCopyRatios

We plot the standardized and denoised read counts with PlotDenoisedCopyRatios. The plots allow visually assessing the efficacy of denoising. Provide the tool with both the standardized and denoised copy ratios from the previous step as well as a reference sequence dictionary.

gatk PlotDenoisedCopyRatios \
    --standardized-copy-ratios hcc1143_T_clean.standardizedCR.tsv \
    --denoised-copy-ratios hcc1143_T_clean.denoisedCR.tsv \
    --sequence-dictionary Homo_sapiens_assembly38.dict \
    --minimum-contig-length 46709983 \
    --output sandbox/plots \
    --output-prefix hcc1143_T_clean

This produces six files in the plots directory--two PNG images and four text files as follows.

hcc1143_T_clean.denoised.png plots the standardized and denoised read counts across the contigs and scales the y-axis to accommodate all copy ratio data.
hcc1143_T_clean.denoisedLimit4.png plots the same but limits the y-axis range from 0 to 4 for comparability across samples.

Each of the text files contains a single quality control value. The value is the median of absolute differences (MAD) in copy-ratios of adjacent targets. Its calculation is robust to actual copy-number events and should decrease after denoising.

hcc1143_T_clean.standardizedMAD.txt gives the MAD for standardized copy ratios.
hcc1143_T_clean.denoisedMAD.txt gives the MAD for denoised copy ratios.
hcc1143_T_clean.deltaMAD.txt gives the difference between standardized MAD and denoised MAD.
hcc1143_T_clean.scaledDeltaMAD.txt gives the fractional difference (standardized MAD - denoised MAD)/(standardized MAD).

Comments on select parameters

The tutorial provides the --sequence-dictionary that matches the GRCh38 reference used in mapping.
To omit alternate and decoy contigs from the plots, the tutorial adjusts the --minimum-contig-length from the default value of 1,000,000 to 46,709,983, the length of the smallest of GRCh38's primary assembly contigs.

Here are the results for the HCC1143 tumor cell line and its matched normal cell line. The normal cell line serves as a control. For each sample are two plots that show the effects of PCA denoising. The upper plot shows standardized copy ratios in blue and the lower plot shows denoised copy ratios in green.

4A. Tumor standarized and denoised copy ratio plots
hcc1143_T_clean.denoisedLimit4.png

4B. Normal standarized and denoised copy ratio plots
hcc1143_N_clean.denoisedLimit4.png

Would you guess there are CNV events in the normal? Should we be surprised?

The next step is to perform segmentation. This can be done either using copy ratios alone or in combination with allelic copy ratios. In part II, Section 6 outlines considerations in modeling segments with allelic copy ratios, section 7 generates a callset and section 8 shows how to plot segmented copy and allelic ratios. Again, the tutorial presents these steps using the full features of the workflow. However, researchers may desire to perform copy ratio segmentation independently of allelic counts data, e.g. for a case without a matched-control. For the case-only, segmentation gives the following plots. To recapitulate this approach, omit allelic-counts parameters from the example commands in sections 6 and 8.

4C. Tumor case-only copy ratios segmentation gives 235 segments.
T_caseonly.modeled.png

4D. Normal case-only copy ratios segmentation gives 41 segments.

While the normal sample shows trisomy of chr2 and a subpopulation with deletion of chr6, the tumor sample is highly aberrant. The extent of aneuploidy is unsurprising and consistent with these HCC1143 tumor dSKY results by Wenhan Chen. Remember that cell lines, with increasing culture time and selective bottlenecks, can give rise to new somatic events, undergo clonal selection and develop population heterogeneity much like in cancer.

☞ 4.1 Compare two PoNs: considerations in the panel of normals creation

Denoising with a PoN is critical for calling copy-number variants from targeted exome coverage profiles. It can also improve calls from WGS profiles that are typically more evenly distributed and subject to less noise. Furthermore, denoising with a PoN can greatly impact results for (i) samples that have more noise, e.g. those with lower coverage, lower purity or higher activity, (ii) samples lacking a matched normal and (iii) detection of smaller events that span only a few targets.

To understand the impact a PoN's constituents can have on an analysis, compare the results of denoising the normal sample against two different PoNs. Each PoN consists of forty 1000 Genomes Project exome samples. PoN-M consists of the same cohort used in the Mutect2 tutorial's PoN. We selected PoN-C's constituents with more care and this is the PoN the CNV tutorial uses.

4E. Compare standardization and denoising with PoN-C versus PoN-M.

What is the difference in the targets for the two cohorts--cohort-M and cohort-C? Is this a sufficient reason for the difference in noise profiles we observe above?

GATK4 denoises exome coverage profiles robustly with either panel of normals. However, a good panel allows maximal denoising, as is the case for PoN-C over PoN-M.

We use publically available 1000 Genomes Project data so as to be able to share the data and to illustrate considerations in CNV analyses. In an actual somatic analysis, we would construct the PoNs using the blood normals of the tumor cohort(s). We would construct a PoN for each sex, so as to be able to call events on allosomal chromosomes. Such a PoN should give better results than that from either of the tutorial PoNs.

Somatic analyses, due to the confounding factors of tumor purity and heterogeneity, require high sensitivity in calling. However, a sensitive caller can only do so much. Use of a carefully constructed PoN augments the sensitivity and helps illuminate copy number events.

This section is adapted from a hands-on tutorial developed and written by Soo Hee Lee (@shlee) in July of 2017 for the GATK workshops in Cambridge and Edinburgh, UK. The original tutorial uses the GATK4.beta workflow and can be found in the 1707 through 1711 GATK workshops folders. Although the Somatic CNV workflow has changed from GATK4.beta and the official GATK4 release, the PCA denoising remains the same. The hands-on tutorial focuses on differences in PCA denoising based on two different panels of normals (PoNs). Researchers may find working through the worksheet to the very end with either release version beneficial, as considerations in selecting PoN constituents remain identical.

Examining the read group information for the samples in the two PoNs shows a difference in mixtures of sequencing centers--four different sequencing centers for PoN-M versus a single sequencing center for PoN-C. The single sequencing center corresponds to that of the HCC1143 samples. Furthermore, tracing sample information will show different targeted exome capture kits for the sequencing centers. Comparing the denoising results of the two PoNs stresses the importance of selective PoN creation.

☞ 4.2 Compare PoN denoising versus matched-normal denoising

A feature of the GATK4 CNV workflow is the ability to normalize a case against a single control sample, e.g. a tumor case against its matched normal. This involves running the control sample through CreateReadCountPanelOfNormals, then denoising the case against this single-sample projection with DenoiseReadCounts. To illustrate this approach, here is the result of denoising the HCC1143 tumor sample against its matched normal. For single-sample matched-control denoising, DenoiseReadCounts produces identical data for standardizedCR.tsv and denoisedCR.tsv.

4F. Tumor case standardized against the normal matched-control

Compare these results to that of section 4.1. Notice the depression in chr2 copy ratios that occurs due to the PoN normal sample's chr2 trisomy. Here, the median absolute deviation (MAD) of 0.149 is an incremental improvement to section 4.1's PoN-M denoising (MAD=0.15). In contrast, PoN-C denoising (MAD=0.125) and even PoN-C standardization alone (MAD=0.134) are seemingly better normalization approaches than the matched-normal standardization. Again, results stress the importance of selective PoN creation.

The PoN accounts for germline CNVs common to its constituents such that the workflow discounts the same variation in the case. It is possible for the workflow to detect germline CNVs not represented in the PoN, in particular, rare germline CNVs. In the case of matched-normal standardization, the workflow should discount germline CNVs and reveal only somatic events.

The workflow does not support iteratively denoising two samples each against a PoN and then against each other.

The tutorial continues in a second document at #11683.

Footnotes

[1] The constituents of the forty sample CNV panel of normals differs from that of the Mutect2 panel of normals. Preliminarly CNV data was generated with v4.0.1.1 somatic CNV WDL scripts run locally on a Gcloud Compute Engine VM with Cromwell v30.2. Additional refinements were performed on a 16GB MacBook Pro laptop. Additional plots were generated using a broadinstitute/gatk:4.0.1.1 Docker container. Note the v4.0.1.1 WDL script does not allow custom sequence dictionaries for the plotting steps.

Case (HCC1143) and matched control (HCC1143_BL) sample data are based on a breast cancer cell line and its matched normal cell line derived from blood, respectively. The Broad Cancer Genome Analysis (CGA) group has graciously provided 2x76 paired-end whole exome sequence data from the two cell lines (C835.HCC1143_2 and C835.HCC1143_BL.4), and @shlee reverted then realigned these to GRCh38 and preprocessed according to GATK guidelines.
We express our gratitude to the 1000 Genomes Project for their publically available project data, from which @shlee constructed the tutorial panel of normals. Read about the project at:
A global reference for human genetic variation, The 1000 Genomes Project Consortium, Nature 526, 68-74 (01 October 2015) doi:10.1038/nature15393.

[2] Considerations in genomic intervals are as follows.

For targeted exomes, the intervals should represent the bait capture or target capture regions.
For whole genomes, either supply regions where coverage is expected across samples, e.g. that exclude alternate haplotypes and decoy regions in GRCh38 or omit the option for references where coverage is expected for the entirety of the reference.
For either type of data, expect to modify the intervals depending on (i) extent of masking in the reference used in read mapping and (ii) expectations in coverage on allosomal contigs. For example, for mammalian data, expect to remove Y chromosome intervals for female samples.

[3] See original discussion on bin size here. The bin size determines the resolution of CNV breakpoints. The theoretical limit depends on coverage depth and the insert-size distribution. Typically bin sizes on the order of the read length will give reasonable results. The GATK developers have tested WGS runs where the bin size is as small as 250 bases.

[4] Set --interval-merging-rule to OVERLAPPING_ONLY, to prevent the tool from merging abutting intervals. The default is set to ALL for GATK4.0.1.1. For future versions, the default will be set to OVERLAPPING_ONLY.

[5] The tool allows specifying both the padding and the binning arguments simultaneously. If exome targets are very long, it may be preferable to both pad and break up the intervals with binning. This may provide some additional resolution.

[6] The data bundle from Tutorial#11136 contains tumor.bam and normal.bam. These tumor and normal samples are identical to that in the current tutorial and represent a subset of the full data for the following regions:

chr6    29941013    29946495    +    
chr11   915890  1133890 +    
chr17   1   83257441    +    
chr11_KI270927v1_alt    1   218612  +    
HLA-A*24:03:01  1   3502    +

[7] The following regarding read filters may be of interest and apply to the workflow illustated in the tutorial that uses CollectFragmentCounts.

In contrast to prior versions of the workflow, the GATK4 CNV workflow excludes duplicate fragments from consideration with the NotDuplicateReadFilter. To instead include duplicate fragments, specify -DF NotDuplicateReadFilter.
The tool only considers paired-end reads (0x1 SAM flag) and the first of pair (0x40 flag) with the FirstOfPairReadFilter. The tool uses the first-of-pair read’s mapping information for the fragment center.
The tool only considers properly paired reads (0x2 SAM flag) using the ProperlyPairedReadFilter. Depending on whether and how data was preprocessed with MergeBamAlignment, proper pair assignments can differ from that given by the aligner. This filter also removes single ended reads.
The MappingQualityReadFilter sets a threshold for alignment MAPQ. The tool sets --minimum-mapping-quality to 30. Thus, the tool uses reads with MAPQ30 or higher.

[8] The current tool version requires strategizing denoising of allosomal chromosomes, e.g. X and Y in humans, against the panel of normals. This is because coverage will vary for these regions depending on the sex of the sample. To determine the sex of samples, analyze them with DetermineGermlineContigPloidy. Aneuploidy in allosomal chromosomes, much like trisomy, can still make for viable organisms and so phenotypic sex designations are insufficient. GermlineCNVCaller can account for differential sex in data.

Many thanks to Samuel Lee (@slee), no relation, for his patient explanations of the workflow parameters and mathematics behind the workflow. Researchers may find reading through @slee's comments on the forum, a few of which the tutorials link to, insightful.

↧

Picard IlluminaBasecallsToSam error

June 25, 2018, 3:33 pm

≫ Next: Running 260 WES samples through joint discovery and VQSR

≪ Previous: (How to part I) Sensitively detect copy ratio alterations and allelic segments

I'm getting an error when trying to run IlluminaBasecallsToSam. Please help.

java -Xmx8g -jar picard.jar IlluminaBasecallsToSam BASECALLS_DIR=/scratch-large/4-quarterly/JIY3012/180619_M00831_0315_000000000-BTMY3/Data/Intensities/BaseCalls BARCODES_DIR=/project/JIY3012/work/dual_index_umi/250T8B9M8B250T LANE=1 READ_STRUCTURE=250T8B9M8B250T RUN_BARCODE=180619_M00831_0315_000000000-BTMY3 LIBRARY_PARAMS=JIY3012_library_params.xls TMP_DIR=java_io_tmpdir MOLECULAR_INDEX_TAG=RX NUM_PROCESSORS=10 IGNORE_UNEXPECTED_BARCODES=true

17:58:32.040 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/rosema1/BioInfo/bin/picard.jar!/com/intel/gkl/native/libgkl_compression.so [Mon Jun 25 17:58:32 EDT 2018] IlluminaBasecallsToSam BASECALLS_DIR=/scratch-large/4-quarterly/JIY3012/180619_M00831_0315_000000000-BTMY3/Data/Intensities/BaseCalls BARCODES_DIR=/project/JIY3012/work/dual_index_umi/250T8B9M8B250T LANE=1 RUN_BARCODE=180619_M00831_0315_000000000-BTMY3 READ_STRUCTURE=250T8B9M8B250T LIBRARY_PARAMS=JIY3012_library_params.xls NUM_PROCESSORS=10 IGNORE_UNEXPECTED_BARCODES=true MOLECULAR_INDEX_TAG=RX TMP_DIR=[java_io_tmpdir] SEQUENCING_CENTER=BI PLATFORM=illumina ADAPTERS_TO_CHECK=[INDEXED, DUAL_INDEXED, NEXTERA_V2, FLUIDIGM] FORCE_GC=true APPLY_EAMSS_FILTER=true MAX_READS_IN_RAM_PER_TILE=1200000 MINIMUM_QUALITY=2 INCLUDE_NON_PF_READS=true MOLECULAR_INDEX_BASE_QUALITY_TAG=QX VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false [Mon Jun 25 17:58:32 EDT 2018] Executing as rosema1@my_server on Linux 2.6.32-279.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.18.7-SNAPSHOT INFO 2018-06-25 17:58:38 IlluminaBasecallsToSam DONE_READING STRUCTURE IS 250T8B9M8B250T Exception in thread "pool-2-thread-7" ERROR 2018-06-25 17:58:39 IlluminaBasecallsConverter Failure encountered in worker thread; attempting to shut down remaining worker threads and terminate ... java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at picard.illumina.IlluminaBasecallsConverter$TileReadAggregator.awaitWorkComplete(IlluminaBasecallsConverter.java:609) at picard.illumina.IlluminaBasecallsConverter.doTileProcessing(IlluminaBasecallsConverter.java:234) at picard.illumina.IlluminaBasecallsToSam.doWork(IlluminaBasecallsToSam.java:273) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:282) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103) at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113) picard.PicardException: File not found: (/scratch-large/4-quarterly/JIY3012/180619_M00831_0315_000000000-BTMY3/Data/Intensities/BaseCalls/L001/C200.1/s_1_1107.bcl) at picard.illumina.parser.readers.BaseBclReader.open(BaseBclReader.java:84) at picard.illumina.parser.readers.BclReader.<init>(BclReader.java:94) at picard.illumina.parser.BclParser$BclDataCycleFileParser.<init>(BclParser.java:269) at picard.illumina.parser.BclParser.makeCycleFileParser(BclParser.java:75) at picard.illumina.parser.PerTileCycleParser.makeCycleFileParser(PerTileCycleParser.java:97) at picard.illumina.parser.PerTileCycleParser.seekToTile(PerTileCycleParser.java:133) at picard.illumina.parser.BclParser.initialize(BclParser.java:80) at picard.illumina.parser.BclParser.<init>(BclParser.java:63) at picard.illumina.parser.IlluminaDataProviderFactory.makeParser(IlluminaDataProviderFactory.java:408) at picard.illumina.parser.IlluminaDataProviderFactory.makeDataProvider(IlluminaDataProviderFactory.java:292) at picard.illumina.IlluminaBasecallsConverter$TileReader.process(IlluminaBasecallsConverter.java:463) at picard.illumina.IlluminaBasecallsConverter$TileReadAggregator$1.run(IlluminaBasecallsConverter.java:560) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.FileNotFoundException: /scratch-large/4-quarterly/JIY3012/180619_M00831_0315_000000000-BTMY3/Data/Intensities/BaseCalls/L001/C200.1/s_1_1107.bcl (Too many open files) at java.io.FileInputStream.open0(Native Method) at java.io.FileInputStream.open(FileInputStream.java:195) at java.io.FileInputStream.<init>(FileInputStream.java:138) at picard.illumina.parser.readers.BaseBclReader.open(BaseBclReader.java:81) ... 14 more Exception in thread "pool-2-thread-10" picard.PicardException: File not found: (/scratch-large/4-quarterly/JIY3012/180619_M00831_0315_000000000-BTMY3/Data/Intensities/BaseCalls/L001/C200.1/s_1_1110.bcl) at picard.illumina.parser.readers.BaseBclReader.open(BaseBclReader.java:84) at picard.illumina.parser.readers.BclReader.<init>(BclReader.java:94) at picard.illumina.parser.BclParser$BclDataCycleFileParser.<init>(BclParser.java:269) at picard.illumina.parser.BclParser.makeCycleFileParser(BclParser.java:75) at picard.illumina.parser.PerTileCycleParser.makeCycleFileParser(PerTileCycleParser.java:97) at picard.illumina.parser.PerTileCycleParser.seekToTile(PerTileCycleParser.java:133) at picard.illumina.parser.BclParser.initialize(BclParser.java:80) at picard.illumina.parser.BclParser.<init>(BclParser.java:63) at picard.illumina.parser.IlluminaDataProviderFactory.makeParser(IlluminaDataProviderFactory.java:408) at picard.illumina.parser.IlluminaDataProviderFactory.makeDataProvider(IlluminaDataProviderFactory.java:292) at picard.illumina.IlluminaBasecallsConverter$TileReader.process(IlluminaBasecallsConverter.java:463) at picard.illumina.IlluminaBasecallsConverter$TileReadAggregator$1.run(IlluminaBasecallsConverter.java:560) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.FileNotFoundException: /scratch-large/4-quarterly/JIY3012/180619_M00831_0315_000000000-BTMY3/Data/Intensities/BaseCalls/L001/C200.1/s_1_1110.bcl (Too many open files) at java.io.FileInputStream.open0(Native Method) at java.io.FileInputStream.open(FileInputStream.java:195) at java.io.FileInputStream.<init>(FileInputStream.java:138) at picard.illumina.parser.readers.BaseBclReader.open(BaseBclReader.java:81) ... 14 more [Mon Jun 25 17:58:39 EDT 2018] picard.illumina.IlluminaBasecallsToSam done. Elapsed time: 0.12 minutes. Runtime.totalMemory()=1519386624 To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp Exception in thread "main" htsjdk.samtools.util.RuntimeIOException at htsjdk.samtools.util.BinaryCodec.close(BinaryCodec.java:625) at htsjdk.samtools.BAMFileWriter.finish(BAMFileWriter.java:154) at htsjdk.samtools.SAMFileWriterImpl.close(SAMFileWriterImpl.java:231) at picard.illumina.IlluminaBasecallsToSam$SAMFileWriterWrapper.close(IlluminaBasecallsToSam.java:566) at picard.illumina.IlluminaBasecallsConverter.doTileProcessing(IlluminaBasecallsConverter.java:257) at picard.illumina.IlluminaBasecallsToSam.doWork(IlluminaBasecallsToSam.java:273) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:282) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103) at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113) Caused by: java.nio.channels.ClosedByInterruptException at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202) at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:216) at java.nio.channels.Channels.writeFullyImpl(Channels.java:78) at java.nio.channels.Channels.writeFully(Channels.java:101) at java.nio.channels.Channels.access$000(Channels.java:61) at java.nio.channels.Channels$1.write(Channels.java:174) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) at htsjdk.samtools.util.BlockCompressedOutputStream.flush(BlockCompressedOutputStream.java:267) at htsjdk.samtools.util.BinaryCodec.close(BinaryCodec.java:610) ... 8 more Exception in thread "pool-2-thread-6" Exception in thread "pool-2-thread-11" picard.PicardException: File not found: (/scratch-large/4-quarterly/JIY3012/180619_M00831_0315_000000000-BTMY3/Data/Intensities/BaseCalls/L001/C200.1/s_1_1106.bcl) at picard.illumina.parser.readers.BaseBclReader.open(BaseBclReader.java:84) at picard.illumina.parser.readers.BclReader.<init>(BclReader.java:94) at picard.illumina.parser.BclParser$BclDataCycleFileParser.<init>(BclParser.java:269) at picard.illumina.parser.BclParser.makeCycleFileParser(BclParser.java:75) at picard.illumina.parser.PerTileCycleParser.makeCycleFileParser(PerTileCycleParser.java:97) at picard.illumina.parser.PerTileCycleParser.seekToTile(PerTileCycleParser.java:133) at picard.illumina.parser.BclParser.initialize(BclParser.java:80) at picard.illumina.parser.BclParser.<init>(BclParser.java:63) at picard.illumina.parser.IlluminaDataProviderFactory.makeParser(IlluminaDataProviderFactory.java:408) at picard.illumina.parser.IlluminaDataProviderFactory.makeDataProvider(IlluminaDataProviderFactory.java:292) at picard.illumina.IlluminaBasecallsConverter$TileReader.process(IlluminaBasecallsConverter.java:463) at picard.illumina.IlluminaBasecallsConverter$TileReadAggregator$1.run(IlluminaBasecallsConverter.java:560) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.FileNotFoundException: /scratch-large/4-quarterly/JIY3012/180619_M00831_0315_000000000-BTMY3/Data/Intensities/BaseCalls/L001/C200.1/s_1_1106.bcl (Too many open files) at java.io.FileInputStream.open0(Native Method) at java.io.FileInputStream.open(FileInputStream.java:195) at java.io.FileInputStream.<init>(FileInputStream.java:138) at picard.illumina.parser.readers.BaseBclReader.open(BaseBclReader.java:81) ... 14 more picard.PicardException: File not found: (/scratch-large/4-quarterly/JIY3012/180619_M00831_0315_000000000-BTMY3/Data/Intensities/BaseCalls/L001/C1.1/s_1_1111.bcl) at picard.illumina.parser.readers.BaseBclReader.open(BaseBclReader.java:84) at picard.illumina.parser.readers.BclReader.<init>(BclReader.java:94) at picard.illumina.parser.BclParser$BclDataCycleFileParser.<init>(BclParser.java:269) at picard.illumina.parser.BclParser.makeCycleFileParser(BclParser.java:75) at picard.illumina.parser.PerTileCycleParser.makeCycleFileParser(PerTileCycleParser.java:97) at picard.illumina.parser.PerTileCycleParser.seekToTile(PerTileCycleParser.java:133) at picard.illumina.parser.BclParser.initialize(BclParser.java:80) at picard.illumina.parser.BclParser.<init>(BclParser.java:63) at picard.illumina.parser.IlluminaDataProviderFactory.makeParser(IlluminaDataProviderFactory.java:408) at picard.illumina.parser.IlluminaDataProviderFactory.makeDataProvider(IlluminaDataProviderFactory.java:292) at picard.illumina.IlluminaBasecallsConverter$TileReader.process(IlluminaBasecallsConverter.java:463) at picard.illumina.IlluminaBasecallsConverter$TileReadAggregator$1.run(IlluminaBasecallsConverter.java:560) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.FileNotFoundException: /scratch-large/4-quarterly/JIY3012/180619_M00831_0315_000000000-BTMY3/Data/Intensities/BaseCalls/L001/C1.1/s_1_1111.bcl (Too many open files) at java.io.FileInputStream.open0(Native Method) at java.io.FileInputStream.open(FileInputStream.java:195) at java.io.FileInputStream.<init>(FileInputStream.java:138) at picard.illumina.parser.readers.BaseBclReader.open(BaseBclReader.java:81) ... 14 more Exception in thread "pool-2-thread-5" Exception in thread "pool-2-thread-12" picard.PicardException: File not found: (/scratch-large/4-quarterly/JIY3012/180619_M00831_0315_000000000-BTMY3/Data/Intensities/BaseCalls/L001/C199.1/s_1_1105.bcl) at picard.illumina.parser.readers.BaseBclReader.open(BaseBclReader.java:84) at picard.illumina.parser.readers.BclReader.<init>(BclReader.java:94) at picard.illumina.parser.BclParser$BclDataCycleFileParser.<init>(BclParser.java:269) at picard.illumina.parser.BclParser.makeCycleFileParser(BclParser.java:75) at picard.illumina.parser.PerTileCycleParser.makeCycleFileParser(PerTileCycleParser.java:97) at picard.illumina.parser.PerTileCycleParser.seekToTile(PerTileCycleParser.java:133) at picard.illumina.parser.BclParser.initialize(BclParser.java:80) at picard.illumina.parser.BclParser.<init>(BclParser.java:63) at picard.illumina.parser.IlluminaDataProviderFactory.makeParser(IlluminaDataProviderFactory.java:408) at picard.illumina.parser.IlluminaDataProviderFactory.makeDataProvider(IlluminaDataProviderFactory.java:292) at picard.illumina.IlluminaBasecallsConverter$TileReader.process(IlluminaBasecallsConverter.java:463) at picard.illumina.IlluminaBasecallsConverter$TileReadAggregator$1.run(IlluminaBasecallsConverter.java:560) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.FileNotFoundException: /scratch-large/4-quarterly/JIY3012/180619_M00831_0315_000000000-BTMY3/Data/Intensities/BaseCalls/L001/C199.1/s_1_1105.bcl (Too many open files) at java.io.FileInputStream.open0(Native Method) at java.io.FileInputStream.open(FileInputStream.java:195) at java.io.FileInputStream.<init>(FileInputStream.java:138) at picard.illumina.parser.readers.BaseBclReader.open(BaseBclReader.java:81) ... 14 more picard.PicardException: File not found: (/scratch-large/4-quarterly/JIY3012/180619_M00831_0315_000000000-BTMY3/Data/Intensities/BaseCalls/L001/C2.1/s_1_1112.bcl) at picard.illumina.parser.readers.BaseBclReader.open(BaseBclReader.java:84) at picard.illumina.parser.readers.BclReader.<init>(BclReader.java:94) at picard.illumina.parser.BclParser$BclDataCycleFileParser.<init>(BclParser.java:269) at picard.illumina.parser.BclParser.makeCycleFileParser(BclParser.java:75) at picard.illumina.parser.PerTileCycleParser.makeCycleFileParser(PerTileCycleParser.java:97) at picard.illumina.parser.PerTileCycleParser.seekToTile(PerTileCycleParser.java:133) at picard.illumina.parser.BclParser.initialize(BclParser.java:80) at picard.illumina.parser.BclParser.<init>(BclParser.java:63) at picard.illumina.parser.IlluminaDataProviderFactory.makeParser(IlluminaDataProviderFactory.java:408) at picard.illumina.parser.IlluminaDataProviderFactory.makeDataProvider(IlluminaDataProviderFactory.java:292) at picard.illumina.IlluminaBasecallsConverter$TileReader.process(IlluminaBasecallsConverter.java:463) at picard.illumina.IlluminaBasecallsConverter$TileReadAggregator$1.run(IlluminaBasecallsConverter.java:560) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.FileNotFoundException: /scratch-large/4-quarterly/JIY3012/180619_M00831_0315_000000000-BTMY3/Data/Intensities/BaseCalls/L001/C2.1/s_1_1112.bcl (Too many open files) at java.io.FileInputStream.open0(Native Method) at java.io.FileInputStream.open(FileInputStream.java:195) at java.io.FileInputStream.<init>(FileInputStream.java:138) at picard.illumina.parser.readers.BaseBclReader.open(BaseBclReader.java:81) ... 14 more Exception in thread "pool-2-thread-9" picard.PicardException: File not found: (/scratch-large/4-quarterly/JIY3012/180619_M00831_0315_000000000-BTMY3/Data/Intensities/BaseCalls/L001/C202.1/s_1_1109.bcl) at picard.illumina.parser.readers.BaseBclReader.open(BaseBclReader.java:84) at picard.illumina.parser.readers.BclReader.<init>(BclReader.java:94) at picard.illumina.parser.BclParser$BclDataCycleFileParser.<init>(BclParser.java:269) at picard.illumina.parser.BclParser.makeCycleFileParser(BclParser.java:75) at picard.illumina.parser.PerTileCycleParser.makeCycleFileParser(PerTileCycleParser.java:97) at picard.illumina.parser.PerTileCycleParser.seekToTile(PerTileCycleParser.java:133) at picard.illumina.parser.BclParser.initialize(BclParser.java:80) at picard.illumina.parser.BclParser.<init>(BclParser.java:63) at picard.illumina.parser.IlluminaDataProviderFactory.makeParser(IlluminaDataProviderFactory.java:408) at picard.illumina.parser.IlluminaDataProviderFactory.makeDataProvider(IlluminaDataProviderFactory.java:292) at picard.illumina.IlluminaBasecallsConverter$TileReader.process(IlluminaBasecallsConverter.java:463) at picard.illumina.IlluminaBasecallsConverter$TileReadAggregator$1.run(IlluminaBasecallsConverter.java:560) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.FileNotFoundException: /scratch-large/4-quarterly/JIY3012/180619_M00831_0315_000000000-BTMY3/Data/Intensities/BaseCalls/L001/C202.1/s_1_1109.bcl (Too many open files) at java.io.FileInputStream.open0(Native Method) at java.io.FileInputStream.open(FileInputStream.java:195) at java.io.FileInputStream.<init>(FileInputStream.java:138) at picard.illumina.parser.readers.BaseBclReader.open(BaseBclReader.java:81) ... 14 more Exception in thread "pool-2-thread-2" picard.PicardException: File not found: (/scratch-large/4-quarterly/JIY3012/180619_M00831_0315_000000000-BTMY3/Data/Intensities/BaseCalls/L001/C204.1/s_1_1102.bcl) at picard.illumina.parser.readers.BaseBclReader.open(BaseBclReader.java:84) at picard.illumina.parser.readers.BclReader.<init>(BclReader.java:94) at picard.illumina.parser.BclParser$BclDataCycleFileParser.<init>(BclParser.java:269) at picard.illumina.parser.BclParser.makeCycleFileParser(BclParser.java:75) at picard.illumina.parser.PerTileCycleParser.makeCycleFileParser(PerTileCycleParser.java:97) at picard.illumina.parser.PerTileCycleParser.seekToTile(PerTileCycleParser.java:133) at picard.illumina.parser.BclParser.initialize(BclParser.java:80) at picard.illumina.parser.BclParser.<init>(BclParser.java:63) at picard.illumina.parser.IlluminaDataProviderFactory.makeParser(IlluminaDataProviderFactory.java:408) at picard.illumina.parser.IlluminaDataProviderFactory.makeDataProvider(IlluminaDataProviderFactory.java:292) at picard.illumina.IlluminaBasecallsConverter$TileReader.process(IlluminaBasecallsConverter.java:463) at picard.illumina.IlluminaBasecallsConverter$TileReadAggregator$1.run(IlluminaBasecallsConverter.java:560) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.FileNotFoundException: /scratch-large/4-quarterly/JIY3012/180619_M00831_0315_000000000-BTMY3/Data/Intensities/BaseCalls/L001/C204.1/s_1_1102.bcl (Too many open files) at java.io.FileInputStream.open0(Native Method) at java.io.FileInputStream.open(FileInputStream.java:195) at java.io.FileInputStream.<init>(FileInputStream.java:138) at picard.illumina.parser.readers.BaseBclReader.open(BaseBclReader.java:81) ... 14 more Exception in thread "pool-2-thread-1" picard.PicardException: File not found: (/scratch-large/4-quarterly/JIY3012/180619_M00831_0315_000000000-BTMY3/Data/Intensities/BaseCalls/L001/C198.1/s_1_1101.bcl) at picard.illumina.parser.readers.BaseBclReader.open(BaseBclReader.java:84) at picard.illumina.parser.readers.BclReader.<init>(BclReader.java:94) at picard.illumina.parser.BclParser$BclDataCycleFileParser.<init>(BclParser.java:269) at picard.illumina.parser.BclParser.makeCycleFileParser(BclParser.java:75) at picard.illumina.parser.PerTileCycleParser.makeCycleFileParser(PerTileCycleParser.java:97) at picard.illumina.parser.PerTileCycleParser.seekToTile(PerTileCycleParser.java:133) at picard.illumina.parser.BclParser.initialize(BclParser.java:80) at picard.illumina.parser.BclParser.<init>(BclParser.java:63) at picard.illumina.parser.IlluminaDataProviderFactory.makeParser(IlluminaDataProviderFactory.java:408) at picard.illumina.parser.IlluminaDataProviderFactory.makeDataProvider(IlluminaDataProviderFactory.java:292) at picard.illumina.IlluminaBasecallsConverter$TileReader.process(IlluminaBasecallsConverter.java:463) at picard.illumina.IlluminaBasecallsConverter$TileReadAggregator$1.run(IlluminaBasecallsConverter.java:560) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.FileNotFoundException: /scratch-large/4-quarterly/JIY3012/180619_M00831_0315_000000000-BTMY3/Data/Intensities/BaseCalls/L001/C198.1/s_1_1101.bcl (Too many open files) at java.io.FileInputStream.open0(Native Method) at java.io.FileInputStream.open(FileInputStream.java:195) at java.io.FileInputStream.<init>(FileInputStream.java:138) at picard.illumina.parser.readers.BaseBclReader.open(BaseBclReader.java:81) ... 14 more Exception in thread "pool-2-thread-4" picard.PicardException: File not found: (/scratch-large/4-quarterly/JIY3012/180619_M00831_0315_000000000-BTMY3/Data/Intensities/BaseCalls/L001/C201.1/s_1_1104.bcl) at picard.illumina.parser.readers.BaseBclReader.open(BaseBclReader.java:84) at picard.illumina.parser.readers.BclReader.<init>(BclReader.java:94) at picard.illumina.parser.BclParser$BclDataCycleFileParser.<init>(BclParser.java:269) at picard.illumina.parser.BclParser.makeCycleFileParser(BclParser.java:75) at picard.illumina.parser.PerTileCycleParser.makeCycleFileParser(PerTileCycleParser.java:97) at picard.illumina.parser.PerTileCycleParser.seekToTile(PerTileCycleParser.java:133) at picard.illumina.parser.BclParser.initialize(BclParser.java:80) at picard.illumina.parser.BclParser.<init>(BclParser.java:63) at picard.illumina.parser.IlluminaDataProviderFactory.makeParser(IlluminaDataProviderFactory.java:408) at picard.illumina.parser.IlluminaDataProviderFactory.makeDataProvider(IlluminaDataProviderFactory.java:292) at picard.illumina.IlluminaBasecallsConverter$TileReader.process(IlluminaBasecallsConverter.java:463) at picard.illumina.IlluminaBasecallsConverter$TileReadAggregator$1.run(IlluminaBasecallsConverter.java:560) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.FileNotFoundException: /scratch-large/4-quarterly/JIY3012/180619_M00831_0315_000000000-BTMY3/Data/Intensities/BaseCalls/L001/C201.1/s_1_1104.bcl (Too many open files) at java.io.FileInputStream.open0(Native Method) at java.io.FileInputStream.open(FileInputStream.java:195) at java.io.FileInputStream.<init>(FileInputStream.java:138) at picard.illumina.parser.readers.BaseBclReader.open(BaseBclReader.java:81) ... 14 more Exception in thread "pool-2-thread-3" picard.PicardException: File not found: (/scratch-large/4-quarterly/JIY3012/180619_M00831_0315_000000000-BTMY3/Data/Intensities/BaseCalls/L001/C197.1/s_1_1103.bcl) at picard.illumina.parser.readers.BaseBclReader.open(BaseBclReader.java:84) at picard.illumina.parser.readers.BclReader.<init>(BclReader.java:94) at picard.illumina.parser.BclParser$BclDataCycleFileParser.<init>(BclParser.java:269) at picard.illumina.parser.BclParser.makeCycleFileParser(BclParser.java:75) at picard.illumina.parser.PerTileCycleParser.makeCycleFileParser(PerTileCycleParser.java:97) at picard.illumina.parser.PerTileCycleParser.seekToTile(PerTileCycleParser.java:133) at picard.illumina.parser.BclParser.initialize(BclParser.java:80) at picard.illumina.parser.BclParser.<init>(BclParser.java:63) at picard.illumina.parser.IlluminaDataProviderFactory.makeParser(IlluminaDataProviderFactory.java:408) at picard.illumina.parser.IlluminaDataProviderFactory.makeDataProvider(IlluminaDataProviderFactory.java:292) at picard.illumina.IlluminaBasecallsConverter$TileReader.process(IlluminaBasecallsConverter.java:463) at picard.illumina.IlluminaBasecallsConverter$TileReadAggregator$1.run(IlluminaBasecallsConverter.java:560) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.FileNotFoundException: /scratch-large/4-quarterly/JIY3012/180619_M00831_0315_000000000-BTMY3/Data/Intensities/BaseCalls/L001/C197.1/s_1_1103.bcl (Too many open files) at java.io.FileInputStream.open0(Native Method) at java.io.FileInputStream.open(FileInputStream.java:195) at java.io.FileInputStream.<init>(FileInputStream.java:138) at picard.illumina.parser.readers.BaseBclReader.open(BaseBclReader.java:81) ... 14 more Exception in thread "pool-2-thread-8" picard.PicardException: File not found: (/scratch-large/4-quarterly/JIY3012/180619_M00831_0315_000000000-BTMY3/Data/Intensities/BaseCalls/L001/C195.1/s_1_1108.bcl) at picard.illumina.parser.readers.BaseBclReader.open(BaseBclReader.java:84) at picard.illumina.parser.readers.BclReader.<init>(BclReader.java:94) at picard.illumina.parser.BclParser$BclDataCycleFileParser.<init>(BclParser.java:269) at picard.illumina.parser.BclParser.makeCycleFileParser(BclParser.java:75) at picard.illumina.parser.PerTileCycleParser.makeCycleFileParser(PerTileCycleParser.java:97) at picard.illumina.parser.PerTileCycleParser.seekToTile(PerTileCycleParser.java:133) at picard.illumina.parser.BclParser.initialize(BclParser.java:80) at picard.illumina.parser.BclParser.<init>(BclParser.java:63) at picard.illumina.parser.IlluminaDataProviderFactory.makeParser(IlluminaDataProviderFactory.java:408) at picard.illumina.parser.IlluminaDataProviderFactory.makeDataProvider(IlluminaDataProviderFactory.java:292) at picard.illumina.IlluminaBasecallsConverter$TileReader.process(IlluminaBasecallsConverter.java:463) at picard.illumina.IlluminaBasecallsConverter$TileReadAggregator$1.run(IlluminaBasecallsConverter.java:560) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.FileNotFoundException: /scratch-large/4-quarterly/JIY3012/180619_M00831_0315_000000000-BTMY3/Data/Intensities/BaseCalls/L001/C195.1/s_1_1108.bcl (Too many open files) at java.io.FileInputStream.open0(Native Method) at java.io.FileInputStream.open(FileInputStream.java:195) at java.io.FileInputStream.<init>(FileInputStream.java:138) at picard.illumina.parser.readers.BaseBclReader.open(BaseBclReader.java:81) ... 14 more 4.83user 3.15system 0:08.53elapsed 93%CPU (0avgtext+0avgdata 2424688maxresident)k 9888inputs+1368outputs (1major+27369minor)pagefaults 0swaps

↧

UsingGenomicsDBImport in practice

Important limitations:

Addendum: extracting GVCF data from the GenomicsDB

1. Can GATK tools be restricted to specific intervals instead of processing the entire reference?

2. What file formats does GATK support for interval lists?

A. Picard-style .interval_list

B. GATK-style .list or .intervals

C. BED files with extension .bed

D. VCF files

3. Is there a required order of intervals?

4. Can I provide multiple sets of intervals?

5. How will GATK handle intervals that abut or overlap?

6. What's the best way to pad intervals?

ERROR MESSAGE: CombineVariants should not be used to merge gVCFs produced by the HaplotypeCaller; use CombineGVCFs instead

ERROR MESSAGE: The list of input alleles must contain as an allele but that is not the case at position 15274; please use the Haplotype Caller with gVCF output to generate appropriate records

ERROR ------------------------------------------------------------------------------------------

ERROR stack trace

ERROR ------------------------------------------------------------------------------------------

ERROR A GATK RUNTIME ERROR has occurred (version 3.5-0-g36282e4):

ERROR

ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.

ERROR If not, please post the error message, with stack trace, to the GATK forum.

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions http://www.broadinstitute.org/gatk

ERROR

ERROR MESSAGE: No data found.

ERROR ------------------------------------------------------------------------------------------

Tools involved

Prerequisites

Download example data

Related resources

View aligned reads using IGV

Find a specific read and view as pairs

Some tips

This document describes "regular" VCF files produced for GERMLINE calls. For information on the special kind of VCF called gVCF, produced by HaplotypeCaller in -ERC GVCF mode, please see this companion document. For information specific to SOMATIC calls, see the MuTect documentation.

Contents

1. What is VCF?

2. Basic structure of a VCF file

3. Interpreting the VCF file header information

4. Structure of variant call records

Site-level properties and annotations

Sample-level annotations

5. How the genotype and other sample-level information is represented

6. How to extract information from a VCF in a sane, (mostly) straightforward way

Jump to a section

Tools involved

Download example data

1. Collect raw counts data with PreprocessIntervals and CollectFragmentCounts

☞ 1.1 How do I view HDF5 format data?

2. Generate a CNV panel of normals with CreateReadCountPanelOfNormals

3. Standardize and denoise case read counts against the PoN with DenoiseReadCounts

4. Plot standardized and denoised copy ratios with PlotDenoisedCopyRatios

☞ 4.1 Compare two PoNs: considerations in the panel of normals creation

☞ 4.2 Compare PoN denoising versus matched-normal denoising

Footnotes

Using`GenomicsDBImport` in practice

A. Picard-style `.interval_list`

B. GATK-style `.list` or `.intervals`

C. BED files with extension `.bed`

This document describes "regular" VCF files produced for GERMLINE calls. For information on the special kind of VCF called gVCF, produced by HaplotypeCaller in `-ERC GVCF` mode, please see this companion document. For information specific to SOMATIC calls, see the MuTect documentation.