Picard MarkDuplicates Barcode Tag

December 6, 2019, 11:58 am

≫ Next: Does trimming affect the variant calling?

≪ Previous: GenotypeGVCF stuck(?) after ProgressMeter - Starting traversal

Hello!

I was wondering how the implementation of MarkDuplicates differs when the --barcode_tag option is specified and I haven't been able to find anything in my searches. I'm terribly sorry if this has been explained and I just haven't found it.

As I understand it MarkDuplicates tries to mark duplicates that stem from the same original DNA fragment (thus "Both tools identify duplicates as sets of read pairs that have the same unclipped alignment start and unclipped alignment end." [HowTo](https://software.broadinstitute.org/gatk/documentation/article?id=6747) So if barcodes such as 10X chromium are representative of an original fragment, will MarkDuplicates only mark a read pair as a duplicate if has the same start, end and barcode tag?

Thanks!

↧

Does trimming affect the variant calling?

December 6, 2019, 11:39 pm

≫ Next: Why GenomicsDBImport is not taking the argument --genomicsdb-update-workspace-path ?

≪ Previous: Picard MarkDuplicates Barcode Tag

Dear GATK team,

I am aware that you do not particularly recommend adapter or quality trimming, as discussed here:
https://gatkforums.broadinstitute.org/gatk/discussion/2957/read-trimming

However, does it negatively affect it? I am using a bunch of tools on the same bam files - some benefit from adapter trimming. Is it ok to feed the same trimmed bam files (trimmed with trimmomatic as recommended in the manual for DNAseq) into GATK variant calling pipeline (HC or Mutect2) or will there be any negative consequences?

Thanks for you help!

↧

Why GenomicsDBImport is not taking the argument --genomicsdb-update-workspace-path ?

December 7, 2019, 2:54 am

≫ Next: Badly formed genome unclippedLoc: Query interval "20" is not valid for this input.

≪ Previous: Does trimming affect the variant calling?

Dear GATK team,

The following error message I am getting, when I am trying to update my existing datastores with some new g.vcf files:-

***********************************************************************

A USER ERROR has occurred: genomicsdb-update-workspace-path is not a recognized option

***********************************************************************

Command used is as follows:-

"gatk GenomicsDBImport -V ../../path/to/file1.g.vcf -V ../../path/to/file2.g.vcf --genomicsdb-update-workspace-path /path/to/existing/datastore"

Kindly help in resolving this issue.

Thank you ,

Regards

Abhishek Panda

↧

Badly formed genome unclippedLoc: Query interval "20" is not valid for this input.

December 7, 2019, 3:46 am

≫ Next: SelectVariants halted without an error

≪ Previous: Why GenomicsDBImport is not taking the argument --genomicsdb-update-workspace-path ?

Hay, I tried using GenomicsDBImport with 28 samples. I tried to load my data using a sample map file (tab separated sample header and their absolute paths) but on using the command -

gatk --java-options "-Xmx40g -Xms2g" GenomicsDBImport --genomicsdb-workspace-path /mnt/drive/Exome_data/PROCESSING/gatk4_processing/genome_db/demo_DB  --batch-size 25 -L 20  --sample-name-map sample.map --tmp-dir=/mnt/exome/tmp --reader-threads 40

I actually want to load all the variants, not just small intervals of them. Using -L 20 gives me the error in the header.

My questions are two-fold -
1) I take it that -L 20 effectively means chr20, then why am I getting the following error?
2) If I create a DB importing only from chr20, can I add the same samples to the same DB, but with different intervals, so that I can import all the chromosomes? Or would I need to keep separate DB for each chromosome?

I've tried my best to understand, but I think I'm surely missing something!

Any help appreciated.

↧

SelectVariants halted without an error

December 8, 2019, 7:41 pm

≫ Next: Germline copy number variant discovery (CNVs)

≪ Previous: Badly formed genome unclippedLoc: Query interval "20" is not valid for this input.

This discussion was created from comments split from: GATK v4.1.3.0: Error with SelectVariants.

↧

Germline copy number variant discovery (CNVs)

January 7, 2018, 1:08 am

≫ Next: If the HaplotypeCaller support FPGA Acceleration Cards of Xilinx?

≪ Previous: SelectVariants halted without an error

Purpose

Identify germline copy number variants.

Diagram is not available

Reference implementation is not available

This workflow is in development; detailed documentation will be made available when the workflow is considered fully released.

↧

If the HaplotypeCaller support FPGA Acceleration Cards of Xilinx?

December 8, 2019, 11:45 pm

≫ Next: gatk FilterVariantTranches

≪ Previous: Germline copy number variant discovery (CNVs)

we want to implement HaplotypeCaller Acceleration by Xilinx FPGA Acceleration Card, I know that support by Intel FPGA A10 Cards, if support by Xilinx FPGA card also? THX :)

↧

gatk FilterVariantTranches

October 11, 2019, 1:20 am

≫ Next: 0 variants lifted over but had mismatching reference alleles after lift over

≪ Previous: If the HaplotypeCaller support FPGA Acceleration Cards of Xilinx?

I am a bit unsure on the usage of FilterVariantTranches. I have have applied CNNScoreVariants to a VCF and am now trying to filter. I am using another VCF as the resource (also after applying CNNScoreVariants) like this:

gatk FilterVariantTranches -V ERR1213935.CNN.vcf.gz --resource por7A1.CNN.vcf.gz --info-key CNN_1D --snp-tranche 99.95 --indel-tranche 99.4 -O out.vcf

However it is throwing the following error: A USER ERROR has occurred: VCF must contain SNPs and indels with scores and resources must contain matching SNPs and indels.

So my question does the overlap between the sample VCF and the resource have to be? At the moment they contain some of the same variants but also different ones. From the error message it seems like they should contain exactly the same variants?

Thanks in advance,
Jody

↧

0 variants lifted over but had mismatching reference alleles after lift over

December 9, 2019, 12:24 pm

≫ Next: RNAseq short variant discovery (SNPs + Indels)

≪ Previous: gatk FilterVariantTranches

Hi, guys:

I am trying to liftOver a VCF file. Please see my log file below. There does not seem to have an error, but somehow none of my SNPs are lifted over. Can someone please let me know what is wrong here?

Thanks!

Jie

===============================================================================

gatk --java-options "-Xmx6g" LiftoverVcf -R $ref -I A01.b37.vcf.gz -O A01.vcf -C /mnt/d/files/hg19ToHg38.chain --MAX_RECORDS_IN_RAM 50000 --REJECT rejected.vcf --DISABLE_SORT true
Using GATK jar /mnt/d/software_lin/gatk/gatk-package-4.1.4.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx6g -jar /mnt/d/software_lin/gatk/gatk-package-4.1.4.0-local.jar LiftoverVcf -R /mnt/d/data/gatk_bundle/hg38/Homo_sapiens_assembly38.fasta.gz -I A01.b37.vcf.gz -O A01.vcf -C /mnt/d/files/hg19ToHg38.chain --MAX_RECORDS_IN_RAM 50000 --REJECT rejected.vcf --DISABLE_SORT true

20:17:23.553 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/mnt/d/software_lin/gatk/gatk-package-4.1.4.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
[Mon Dec 09 20:17:23 GMT 2019] LiftoverVcf --INPUT A01.b37.vcf.gz --OUTPUT A01.vcf --CHAIN /mnt/d/files/hg19ToHg38.chain --REJECT rejected.vcf --DISABLE_SORT true --MAX_RECORDS_IN_RAM 50000 --REFERENCE_SEQUENCE /mnt/d/data/gatk_bundle/hg38/Homo_sapiens_assembly38.fasta.gz --WARN_ON_MISSING_CONTIG false --LOG_FAILED_INTERVALS true --WRITE_ORIGINAL_POSITION false --WRITE_ORIGINAL_ALLELES false --LIFTOVER_MIN_MATCH 1.0 --ALLOW_MISSING_FIELDS_IN_HEADER false --RECOVER_SWAPPED_REF_ALT false --TAGS_TO_REVERSE AF --TAGS_TO_DROP MAX_AF --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
Dec 09, 2019 8:17:24 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
[Mon Dec 09 20:17:24 GMT 2019] Executing as jiehuang@DESKTOP-POF1PJ4 on Linux 4.4.0-17763-Microsoft amd64; OpenJDK 64-Bit Server VM 11.0.4+11-post-Ubuntu-1ubuntu218.04.3; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.1.4.0
INFO 2019-12-09 20:17:24 LiftoverVcf Loading up the target reference genome.
INFO 2019-12-09 20:17:54 LiftoverVcf Lifting variants over and writing the output file. Variants will not be sorted.
INFO 2019-12-09 20:17:57 LiftoverVcf Processed 100731 variants.
INFO 2019-12-09 20:17:57 LiftoverVcf 100731 variants failed to liftover.
INFO 2019-12-09 20:17:57 LiftoverVcf 0 variants lifted over but had mismatching reference alleles after lift over.
INFO 2019-12-09 20:17:57 LiftoverVcf 100.0000% of variants were not successfully lifted over and written to the output.
INFO 2019-12-09 20:17:57 LiftoverVcf liftover success by source contig:
INFO 2019-12-09 20:17:57 LiftoverVcf 1: 0 / 10327 (0.0000%)
INFO 2019-12-09 20:17:57 LiftoverVcf 10: 0 / 3963 (0.0000%)
INFO 2019-12-09 20:17:57 LiftoverVcf 11: 0 / 5720 (0.0000%)
INFO 2019-12-09 20:17:57 LiftoverVcf 12: 0 / 4725 (0.0000%)
INFO 2019-12-09 20:17:57 LiftoverVcf 13: 0 / 2739 (0.0000%)
INFO 2019-12-09 20:17:57 LiftoverVcf 14: 0 / 2827 (0.0000%)
INFO 2019-12-09 20:17:57 LiftoverVcf 15: 0 / 3498 (0.0000%)
INFO 2019-12-09 20:17:57 LiftoverVcf 16: 0 / 3744 (0.0000%)
INFO 2019-12-09 20:17:57 LiftoverVcf 17: 0 / 5274 (0.0000%)
INFO 2019-12-09 20:17:57 LiftoverVcf 18: 0 / 2393 (0.0000%)
INFO 2019-12-09 20:17:57 LiftoverVcf 19: 0 / 4765 (0.0000%)
INFO 2019-12-09 20:17:57 LiftoverVcf 2: 0 / 7043 (0.0000%)
INFO 2019-12-09 20:17:57 LiftoverVcf 20: 0 / 1919 (0.0000%)
INFO 2019-12-09 20:17:57 LiftoverVcf 21: 0 / 1880 (0.0000%)
INFO 2019-12-09 20:17:57 LiftoverVcf 22: 0 / 2314 (0.0000%)
INFO 2019-12-09 20:17:57 LiftoverVcf 3: 0 / 5267 (0.0000%)
INFO 2019-12-09 20:17:57 LiftoverVcf 4: 0 / 4855 (0.0000%)
INFO 2019-12-09 20:17:57 LiftoverVcf 5: 0 / 5261 (0.0000%)
INFO 2019-12-09 20:17:57 LiftoverVcf 6: 0 / 5742 (0.0000%)
INFO 2019-12-09 20:17:57 LiftoverVcf 7: 0 / 4964 (0.0000%)
INFO 2019-12-09 20:17:57 LiftoverVcf 8: 0 / 4264 (0.0000%)
INFO 2019-12-09 20:17:57 LiftoverVcf 9: 0 / 4148 (0.0000%)
INFO 2019-12-09 20:17:57 LiftoverVcf X: 0 / 3093 (0.0000%)
INFO 2019-12-09 20:17:57 LiftoverVcf Y: 0 / 6 (0.0000%)
INFO 2019-12-09 20:17:57 LiftoverVcf lifted variants by target contig:
INFO 2019-12-09 20:17:57 LiftoverVcf no successfully lifted variants
WARNING 2019-12-09 20:17:57 LiftoverVcf 0 variants with a swapped REF/ALT were identified, but were not recovered. See RECOVER_SWAPPED_REF_ALT and associated caveats.
[Mon Dec 09 20:17:57 GMT 2019] picard.vcf.LiftoverVcf done. Elapsed time: 0.56 minutes.
Runtime.totalMemory()=4588568576

↧

RNAseq short variant discovery (SNPs + Indels)

January 8, 2018, 8:39 pm

≫ Next: GATK4 CNV, output 3 million CNVs !!!

≪ Previous: 0 variants lifted over but had mismatching reference alleles after lift over

Purpose

Identify short variants (SNPs and Indels) in RNAseq data.

Reference Implementations

Pipeline	Summary	Notes	Github	Terra
RNAseq short variant per-sample calling	BAM to VCF	universal (expected)	yes	TBD

Expected input

This workflow is designed to operate on a set of samples (uBAM files) one-at-a-time; joint calling RNAseq is not supported.

Main Steps

Mapping to the Reference

Tools involved: STAR

We begin with mapping RNA reads to a reference, we recommend using STAR aligner because it increased sensitivity compared to TopHat (especially for INDELS). We use STAR’s two-pass mode to get better alignments around novel splice junctions.

Data Cleanup

Tools involved: MergeBamAlignment, MarkDuplicates

We use MergeBamAlignment and MarkDuplicates (similarly to our DNA pre-processing best practices pipeline)

SplitNCigarReads

Tools involved: SplitNCigarReads

Because RNA aligners have different conventions than DNA aligners, we need to reformat some of the alignments that span introns for HaplotypeCaller. This step splits reads with N in the cigar into multiple supplementary alignments and hard clips mismatching overhangs. By default this step also reassigns mapping qualities for good alignments to match DNA conventions.

Base Quality Recalibration

Tools involved: BaseRecalibrator, Apply Recalibration, AnalyzeCovariates (optional)

This step is performed per-sample and consists of applying machine learning to detect and correct for patterns of systematic errors in the base quality scores, which are confidence scores emitted by the sequencer for each base. Base quality scores play an important role in weighing the evidence for or against possible variant alleles during the variant discovery process, so it's important to correct any systematic bias observed in the data. Biases can originate from biochemical processes during library preparation and sequencing, from manufacturing defects in the chips, or instrumentation defects in the sequencer. The recalibration procedure involves collecting covariate statistics from all base calls in the dataset, building a model from those statistics, and applying base quality adjustments to the dataset based on the resulting model. The initial statistics collection can be parallelized by scattering across genomic coordinates, typically by chromosome or batches of chromosomes but this can be broken down further to boost throughput if needed. Then the per-region statistics must be gathered into a single genome-wide model of covariation; this cannot be parallelized but it is computationally trivial, and therefore not a bottleneck. Finally, the recalibration rules derived from the model are applied to the original dataset to produce a recalibrated dataset. This is parallelized in the same way as the initial statistics collection, over genomic regions, then followed by a final file merge operation to produce a single analysis-ready file per sample.

Variant Calling

Tools involved: HaplotypeCaller

HaplotypeCaller doesn’t need any specific changes to run with RNA once the bam has been run through SplitNCigarReads. We do adjust the minimum phred-scaled confidence threshold for calling variants to 20, but this value will depend on your specific use case.

Variant Filtering

Tools involved: VariantFiltration

We recommend specific hard filters, since VQSR and CNNScoreVariants require truth data for training that we don’t yet have for RNA.

↧

GATK4 CNV, output 3 million CNVs !!!

December 9, 2019, 3:22 pm

≫ Next: MarkDuplicatesSpark not respecting --conf 'spark.executor.cores=4' option

≪ Previous: RNAseq short variant discovery (SNPs + Indels)

Hi, there:

I was very excited to know that GATK4 could now generate CNV data from WGS.

However, I spent long time to make it work, but I think it still does not.

I recently followed all the instructions and managed to generate an output file. But this file has almost 3 million rows. The first 15 rows of the output file is shown below.

I think each person is expected to have ~1,000 CNVs, not ~3 million!!!
I understand that GATK is using a 1KB sliding window to detect CNV. But then how could I get the ~1,000 CNVs that I could use to run downstream analyisis?

Your help would be greatly appreciated!

Thank you & best regards,
Jie

↧

MarkDuplicatesSpark not respecting --conf 'spark.executor.cores=4' option

December 10, 2019, 5:45 am

≫ Next: germlineCNVCaller Procedure

≪ Previous: GATK4 CNV, output 3 million CNVs !!!

Hi,

I'm trying to run gatk MarkDuplicatesSpark (v 4.1.4.1) locally, so not on a spark cluster, and provided the option --conf 'spark.executor.cores=4' to tell MarkDuplicatesSpark to use only 4 cores on the machine. However when I check the system load with e.g. top I see that all 44 cores of the system are used by MarkDuplicatesSpark. What am I doing wrong?

command:
gatk MarkDuplicatesSpark \
--tmp-dir /local/scratch/tmp \
-I Control_aligned.bam \
-O Control_aligned_sort_mkdp.bam \
-M Control_aligned_sort_mkdp.txt \
--create-output-bam-index true \
--read-validation-stringency LENIENT \
--conf 'spark.executor.cores=4'

Best
Dietmar

↧

germlineCNVCaller Procedure

December 10, 2019, 6:44 am

≫ Next: GATK 4.1.4 DenoiseReadCounts: Sample intervals must be identical to the original intervals ...

≪ Previous: MarkDuplicatesSpark not respecting --conf 'spark.executor.cores=4' option

Hello,

I am trying to use the gCNV caller to call CNVs in a large set of samples. I am trying to figure out how to do so, but am quite confused as to the proper steps to take, and inputs and outputs for each step.

Are you to run DetermineGermlineContigPloidy-cohort and than DetermineGermlineContigPloidy-case mode? then use that model for germlineCNVcaller-cohort and then proceed to germlineCNVcaller-case?

or are you to run DetermineGermlineContigPloidy-cohort, then germlineCNVcaller-cohort, DetermineGermlineContigPloidy-case , germlineCNVcaller-case?

Can someone clarify the order of the steps one is to take to follow the proper best practices?

↧

GATK 4.1.4 DenoiseReadCounts: Sample intervals must be identical to the original intervals ...

December 10, 2019, 10:27 am

≫ Next: appropriate members for generating "known-sites" list

≪ Previous: germlineCNVCaller Procedure

I've been getting failures in Terra from the most recent 2-CNV_Somatic_Pair workflow copied from help-gatk/Somatic-CNVs-GATK4 that has an error message:

18:58:30.763 INFO SVDDenoisingUtils - Validating sample intervals against original intervals used to build panel of normals...
18:58:31.210 INFO DenoiseReadCounts - Shutting down engine [December 9, 2019 6:58:31 PM UTC]
org.broadinstitute.hellbender.tools.copynumber.DenoiseReadCounts done. Elapsed time: 0.03 minutes.
Runtime.totalMemory()=1198522368
java.lang.IllegalArgumentException: Sample intervals must be identical to the original intervals used to build the panel of normals.
at org.broadinstitute.hellbender.utils.Utils.validateArg(Utils.java:725) at org.broadinstitute.hellbender.tools.copynumber.denoising.SVDDenoisingUtils.denoise(SVDDenoisingUtils.java:119)

Historical forum posts regarding this error message were tracked back to differences in interval lists, but my hunt for discrepant interval lists has not yet revealed a clue. I must be missing something.

This task runs on WGS data so I modified the interval list to correspond to 1k bins across the genome with blacklist intervals gs://gatk-best-practices/somatic-b37/CNV_and_centromere_blacklist.hg19.list .

Relevant inputs to the PoN building task 1-CNV_Somatic_Panel were:

blacklist_intervals
gs://gatk-best-practices/somatic-b37/CNV_and_centromere_blacklist.hg19.list
intervals
gs://fc-9c84e685-79f8-4d84-9e52-640943257a9b/reference/Homo_sapiens_assembly19.fasta.wgs_intervals.1_22.interval_list

which produced an output interval list file and a pon:

preprocessed_intervals
gs://fc-035f5652-acf7-4642-abb7-e8c10848c8ed/7615f132-7160-4ff8-a335-4c529790607b/CNVSomaticPanelWorkflow/5da4afe1-7342-4b3f-85cf-4343c6edd8fe/call-PreprocessIntervals/Homo_sapiens_assembly19.fasta.wgs_intervals.1_22.preprocessed.interval_list
read_count_pon
gs://fc-035f5652-acf7-4642-abb7-e8c10848c8ed/7615f132-7160-4ff8-a335-4c529790607b/CNVSomaticPanelWorkflow/5da4afe1-7342-4b3f-85cf-4343c6edd8fe/call-CreateReadCountPanelOfNormals/attempt-3/REBC-WGS-do-gc.pon.hdf5

The relevant inputs to 2-CNV_Somatic_Pair were:

blacklist_intervals
gs://gatk-best-practices/somatic-b37/CNV_and_centromere_blacklist.hg19.list
intervals
gs://fc-035f5652-acf7-4642-abb7-e8c10848c8ed/7615f132-7160-4ff8-a335-4c529790607b/CNVSomaticPanelWorkflow/5da4afe1-7342-4b3f-85cf-4343c6edd8fe/call-PreprocessIntervals/Homo_sapiens_assembly19.fasta.wgs_intervals.1_22.preprocessed.interval_list
read_count_pon
gs://fc-035f5652-acf7-4642-abb7-e8c10848c8ed/7615f132-7160-4ff8-a335-4c529790607b/CNVSomaticPanelWorkflow/5da4afe1-7342-4b3f-85cf-4343c6edd8fe/call-CreateReadCountPanelOfNormals/attempt-3/REBC-WGS-do-gc.pon.hdf5

which match the pon-making intervals as far as I can tell.

I've been running this as task 2-CNV_Somatic_Pair_gatk414 in workspace rebc-oct16/rebc_analysis, which is a very old workspace that Terra and GATK teams should already have access. If not, let me know. An example failed job is e6f01225-5db5-4f61-99f8-23689a32d42f.

Thanks,

Chip

P.S.
The complete error message is:
Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/cromwell_root/tmp.95676adf
18:58:29.302 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.1.4.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
18:58:29.522 INFO DenoiseReadCounts - ------------------------------------------------------------
18:58:29.523 INFO DenoiseReadCounts - The Genome Analysis Toolkit (GATK) v4.1.4.0
18:58:29.523 INFO DenoiseReadCounts - For support and documentation go to https://software.broadinstitute.org/gatk/
18:58:29.523 INFO DenoiseReadCounts - Executing as root@98f1c7f0eb6f on Linux v4.19.72+ amd64
18:58:29.524 INFO DenoiseReadCounts - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_212-8u212-b03-0ubuntu1.16.04.1-b03
18:58:29.524 INFO DenoiseReadCounts - Start Date/Time: December 9, 2019 6:58:29 PM UTC
18:58:29.524 INFO DenoiseReadCounts - ------------------------------------------------------------
18:58:29.524 INFO DenoiseReadCounts - ------------------------------------------------------------
18:58:29.525 INFO DenoiseReadCounts - HTSJDK Version: 2.20.3
18:58:29.525 INFO DenoiseReadCounts - Picard Version: 2.21.1
18:58:29.525 INFO DenoiseReadCounts - HTSJDK Defaults.COMPRESSION_LEVEL : 2
18:58:29.525 INFO DenoiseReadCounts - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
18:58:29.525 INFO DenoiseReadCounts - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
18:58:29.525 INFO DenoiseReadCounts - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
18:58:29.525 INFO DenoiseReadCounts - Deflater: IntelDeflater
18:58:29.525 INFO DenoiseReadCounts - Inflater: IntelInflater
18:58:29.525 INFO DenoiseReadCounts - GCS max retries/reopens: 20
18:58:29.525 INFO DenoiseReadCounts - Requester pays: disabled
18:58:29.525 INFO DenoiseReadCounts - Initializing engine
18:58:29.525 INFO DenoiseReadCounts - Done initializing engine
log4j:WARN No appenders could be found for logger (org.broadinstitute.hdf5.HDF5Library).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
18:58:29.603 INFO DenoiseReadCounts - Reading read-counts file (/cromwell_root/fc-035f5652-acf7-4642-abb7-e8c10848c8ed/91a0cdd9-276f-40de-817f-ee4055054c5f/CNVSomaticPairWorkflow/e6f01225-5db5-4f61-99f8-23689a32d42f/call-CollectCountsNormal/SC217007.counts.hdf5)...
18:58:30.763 INFO SVDDenoisingUtils - Validating sample intervals against original intervals used to build panel of normals...
18:58:31.210 INFO DenoiseReadCounts - Shutting down engine
[December 9, 2019 6:58:31 PM UTC] org.broadinstitute.hellbender.tools.copynumber.DenoiseReadCounts done. Elapsed time: 0.03 minutes.
Runtime.totalMemory()=1198522368
java.lang.IllegalArgumentException: Sample intervals must be identical to the original intervals used to build the panel of normals.
at org.broadinstitute.hellbender.utils.Utils.validateArg(Utils.java:725)
at org.broadinstitute.hellbender.tools.copynumber.denoising.SVDDenoisingUtils.denoise(SVDDenoisingUtils.java:119)
at org.broadinstitute.hellbender.tools.copynumber.denoising.SVDReadCountPanelOfNormals.denoise(SVDReadCountPanelOfNormals.java:88)
at org.broadinstitute.hellbender.tools.copynumber.DenoiseReadCounts.doWork(DenoiseReadCounts.java:200)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:163)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:206)
at org.broadinstitute.hellbender.Main.main(Main.java:292)
Using GATK jar /root/gatk.jar defined in environment variable GATK_LOCAL_JAR
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx62000m -jar /root/gatk.jar DenoiseReadCounts --input /cromwell_root/fc-035f5652-acf7-4642-abb7-e8c10848c8ed/91a0cdd9-276f-40de-817f-ee4055054c5f/CNVSomaticPairWorkflow/e6f01225-5db5-4f61-99f8-23689a32d42f/call-CollectCountsNormal/SC217007.counts.hdf5 --count-panel-of-normals /cromwell_root/fc-035f5652-acf7-4642-abb7-e8c10848c8ed/7615f132-7160-4ff8-a335-4c529790607b/CNVSomaticPanelWorkflow/5da4afe1-7342-4b3f-85cf-4343c6edd8fe/call-CreateReadCountPanelOfNormals/attempt-3/REBC-WGS-do-gc.pon.hdf5 --standardized-copy-ratios SC217007.standardizedCR.tsv --denoised-copy-ratios SC217007.denoisedCR.tsv

↧

appropriate members for generating "known-sites" list

December 10, 2019, 2:35 pm

≫ Next: Safe to use HaplotypeCallerSpark?

≪ Previous: GATK 4.1.4 DenoiseReadCounts: Sample intervals must be identical to the original intervals ...

I have 46 complete genomes and a good reference genome. Two of the individuals are "outgroups" (two different species). The rest are the same species as the reference genome. One of the outgroups hybridizes with the ingroup (we are studying this admixture). I have gVCF files for all individuals generated by HaplotypeCaller. When selecting and filtering variants to generate a "known-sites" list, should I exclude the outgroups? That seems like the right thing to do, but I could not think of a reason why adding the two outgroups would be a problem. Perhaps they will have unique SNPs and compromise the "known-sites" list?
Also, when creating a database of gVCFs (GenomicDBImport), should I include all individuals and then exclude individuals in the GenotypeGVCFs tool? I could not find an obvious option to exclude individuals, except perhaps --annotations-to-exclude.
Thanks,

↧

Safe to use HaplotypeCallerSpark?

October 21, 2019, 3:36 pm

≫ Next: what is the current page for VCF hard-filtering?

≪ Previous: appropriate members for generating "known-sites" list

Hi all,

I'm wondering how bad it is to use use HaplotypeCallerSpark in GATK 4.0.2.1/JDK 1.8? I realize it's in beta, but I'm wondering if that means "your results will be useless" or just "use with caution".

The reason I'm asking that is because it seems like Spark is the only way to multi-thread in GATK 4, and just 1 of my bams took 64 hrs to run on single node, and I have 110 bams, so parallelizing is a must.

Btw, when I tried to run HaplotypeCallerSpark in parallel with 48 nodes, my job crashed after running for two days. I thought since with 1 node it took 64 hrs, using 48 nodes would mean it would finish in wayyy less than 2 days.

Here's what I have:

gatk --java-options "-Xmx32g -XX:ParallelGCThreads=1" HaplotypeCallerSpark --spark-master local[48] -R myref.2bit -I mybam.bam -O mygvcf.g.vcf --emit-ref-confidence GVCF --min-dangling-branch-length 1 --min-pruning 1

↧

what is the current page for VCF hard-filtering?

December 11, 2019, 1:31 am

≫ Next: (How to) Filter variants either with VQSR or by hard-filtering

≪ Previous: Safe to use HaplotypeCallerSpark?

Hi GATK people,
I am preparing a training session using 4.1.4.1 and would need the current official list of parameters for hard filtering of hg38 VCF data. I also demonstrate the VQSR but for completeness want to include hard-filtering for those who will do this with alien species.
My current source of info is:
https://gatkforums.broadinstitute.org/gatk/discussion/23216/how-to-filter-variants-either-with-vqsr-or-by-hard-filtering
Can you please confirm that this is the correct set of parameters or direct me to a better alternative.
thanks in advance
Stephane

↧

(How to) Filter variants either with VQSR or by hard-filtering

December 14, 2018, 3:37 am

≫ Next: HaplotypeCaller successfully generated my g.vcf.gz, no error message but exit status!=0

≪ Previous: what is the current page for VCF hard-filtering?

Document is in BETA. It may be incomplete and/or inaccurate. Post suggestions to the Comments section and be sure to read about updates also within the Comments section.

This article outlines two different approaches to site-level variant filtration. Site-level filtering involves using INFO field annotations in filtering. Section 1 outlines steps in Variant Quality Score Recalibration (VQSR) and section 2 outlines steps in hard-filtering. See Article#6925 for in-depth descriptions of the different variant annotations.

The GATK Best Practices recommends filtering germline variant callsets with VQSR. For WDL script implementations of pipelines the Broad Genomics Platform uses in production, i.e. reference implementations, see links, e.g. provided on https://software.broadinstitute.org/gatk/best-practices/workflow?id=11145 to both the gatk-workflows WDL script repository and to FireCloud workspaces. Each includes example data as well as publically available recommended human population resources.

Hard-filtering is useful when the data cannot support VQSR or when an analysis requires manual filtering. Additionally, hard-filtering allows for filtering on sample-level annotations, i.e. FORMAT field annotations, which this article does not cover. See Tutorial#12350 to filter on FORMAT field attributes and to change the genotypes of such filtered sample sites to NULL ./..

► GATK4 offers a deep learning method to filter germline variants that is applicable to single sample callsets. As of this writing, the CNN workflow is in experimental status (check here for an update). See Blog#10996 for an overview and initial benchmarking results, and see the gatk4-cnn-variant-filter repository for the WDL pipeline.
► For more complex variant filtering and annotation, see the Broad Hail.is framework at https://hail.is/index.html.
► After variant filtration, if downstream analyses require high-quality genotype calls, consider genotype refinement, e.g. filtering posterior-corrected GQ<20 genotypes. See Article#11074 for an overview.

Jump to a section

1. VQSR: filter a cohort callset with VariantRecalibrator & ApplyVQSR

This section outlines the VQSR filtering steps performed in the 1.1.1 version of the broad-prod-wgs-germline-snps-indels pipeline. Note the workflow hard-filters on the ExcessHet annotation before filtering with VQSR with the expectation that the callset represents many samples.

[A] Hard-filter a large cohort callset on ExcessHet using VariantFiltration
ExcessHet filtering applies only to callsets with a large number of samples, e.g. hundreds of unrelated samples. Small cohorts should not trigger ExcessHet filtering as values should remain small. Note cohorts of consanguinous samples will inflate ExcessHet, and it is possible to limit the annotation to founders for such cohorts by providing a pedigree file during variant calling.

gatk --java-options "-Xmx3g -Xms3g" VariantFiltration \
-V cohort.vcf.gz \
--filter-expression "ExcessHet > 54.69" \
--filter-name ExcessHet \
-O cohort_excesshet.vcf.gz

This produces a VCF callset where any record with ExcessHet greater than 54.69 is filtered with the ExcessHet label in the FILTER column. The phred-scaled 54.69 corresponds to a z-score of -4.5. If a record lacks the ExcessHet annotation, it will pass filters.

[B] Create sites-only VCF with MakeSitesOnlyVcf
Site-level filtering requires only site-level annotations. We can speed up the analysis in the modeling step by using a VCF that drops sample-level columns.

gatk MakeSitesOnlyVcf \
-I cohort_excesshet.vcf.gz \
-O cohort_sitesonly.vcf.gz

This produces a VCF that retains the first eight-columns.

[C] Calculate VQSLOD tranches for indels using VariantRecalibrator
All of the population resource files are publically available at gs://broad-references/hg38/v0. The parameters in this article reflect those in the 1.1.1 version of the broad-prod-wgs-germline-snps-indels pipeline and are thus tuned for WGS samples. For recommendations specific to exome samples, reasons why SNPs versus indels require different filtering and additional discussion of training sets and arguments, see Article#1259. For example, the article states:

[For filtering indels, m]ost annotations related to mapping quality have been removed since there is a conflation with the length of an indel in a read and the degradation in mapping quality that is assigned to the read by the aligner. This covariation is not necessarily indicative of being an error in the same way that it is for SNPs.

gatk --java-options "-Xmx24g -Xms24g" VariantRecalibrator \
-V cohort_sitesonly.vcf.gz \
--trust-all-polymorphic \
-tranche 100.0 -tranche 99.95 -tranche 99.9 -tranche 99.5 -tranche 99.0 -tranche 97.0 -tranche 96.0 -tranche 95.0 -tranche 94.0 -tranche 93.5 -tranche 93.0 -tranche 92.0 -tranche 91.0 -tranche 90.0 \
-an FS -an ReadPosRankSum -an MQRankSum -an QD -an SOR -an DP \      
-mode INDEL \
--max-gaussians 4 \
-resource mills,known=false,training=true,truth=true,prior=12:Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
-resource axiomPoly,known=false,training=true,truth=false,prior=10:Axiom_Exome_Plus.genotypes.all_populations.poly.hg38.vcf.gz \
-resource dbsnp,known=true,training=false,truth=false,prior=2:Homo_sapiens_assembly38.dbsnp138.vcf \
-O cohort_indels.recal \
--tranches-file cohort_indels.tranches

The --max-gaussians parameter sets the expected number of clusters in modeling. If a dataset gives fewer distinct clusters, e.g. as can happen for smaller data, then the tool will tell you there is insufficient data with a No data found error message. In this case, try decrementing the --max-gaussians value.

[D] Calculate VQSLOD tranches for SNPs using VariantRecalibrator

gatk --java-options "-Xmx3g -Xms3g" VariantRecalibrator \
-V cohort_sitesonly.vcf.gz \
--trust-all-polymorphic \
-tranche 100.0 -tranche 99.95 -tranche 99.9 -tranche 99.8 -tranche 99.6 -tranche 99.5 -tranche 99.4 -tranche 99.3 -tranche 99.0 -tranche 98.0 -tranche 97.0 -tranche 90.0 \
-an QD -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an SOR -an DP \
-mode SNP \
--max-gaussians 6 \
-resource hapmap,known=false,training=true,truth=true,prior=15:hapmap_3.3.hg38.vcf.gz \
-resource omni,known=false,training=true,truth=true,prior=12:1000G_omni2.5.hg38.vcf.gz \
-resource 1000G,known=false,training=true,truth=false,prior=10:1000G_phase1.snps.high_confidence.hg38.vcf.gz \
-resource dbsnp,known=true,training=false,truth=false,prior=7:Homo_sapiens_assembly38.dbsnp138.vcf \
-O cohort_snps.recal \
--tranches-file cohort_snps.tranches

Each step, C and D, produces a .recal recalibration table and a .tranches tranches table. In the filtering step, ApplyVQSR will use both types of data.

To additionally produce the optional tranche plot, specify the --rscript-file parameter. See the VariantRecalibrator tool documentation for details and this discussion thread for an example plot.
For allele-specific recalibration of an allele-specific callset, a beta feature as of this writing, add the -AS parameter.

☞ 1.1 How can I parallelize VQSR?

For cohorts with more than 10,000 WGS samples, it is possible to break down the analysis across genomic regions for parallel processing. The 1.1.1 version of the broad-prod-wgs-germline-snps-indels pipeline does so first by increasing --java-options to "-Xmx100g -Xms100g" and second by add the following parameters to the command to subsample variants and to produce a file of the VQSR model.

--sample-every-Nth-variant 10 \
--output-model ${model_report_filename} \

The pipeline then applies the resulting model to each genomic interval with the same parameters as above with two additions. It provides the resulting model report to VariantRecalibrator with --input-model and specifies the flag --output-tranches-for-scatter. The pipeline then collates the resulting per-interval tranches with GatherTranches. Refer to the pipeline script for implementation details.

Successively apply the indel and SNP recalibrations to the full callset that has already undergone ExcessHet filtering.

[E] Filter indels on VQSLOD using ApplyVQSR

gatk --java-options "-Xmx5g -Xms5g" \
ApplyVQSR \
-V cohort_excesshet.vcf.gz \
--recal-file cohort_indels.recal \
--tranches-file cohort_indels.tranches \
--truth-sensitivity-filter-level 99.7 \
--create-output-variant-index true \
-mode INDEL \
-O indel.recalibrated.vcf.gz

This produces an indel-filtered callset. At this point, SNP-type variants remain unfiltered.

[F] Filter SNPs on VQSLOD using ApplyVQSR

gatk --java-options "-Xmx5g -Xms5g" \
ApplyVQSR \
-V indel.recalibrated.vcf.gz \
--recal-file ${snps_recalibration} \
--tranches-file ${snps_tranches} \
--truth-sensitivity-filter-level 99.7 \
--create-output-variant-index true \
-mode SNP \
-O snp.recalibrated.vcf.gz \

This produces a SNP-filtered callset. Given the indel-filtered callset, this results in the final filtered callset.

2. Hard filter a cohort callset with VariantFiltration

This section of the tutorial provides generic hard-filtering thresholds and example commands for site-level manual filtering. A typical scenario requiring manual filtration is small cohort callsets, e.g. less than thirty exomes. See the GATK3 hard filtering Tutorial#2806 for additional discussion.

Researchers are expected to fine-tune hard-filtering thresholds for their data. Towards gauging the relative informativeness of specific variant annotations, the GATK hands-on hard-filtering workshop tutorial demonstrates how to plot distributions of annotation values for variant calls stratified against a truthset.

As with VQSR, hard-filter SNPs and indels separately. As of this writing, SelectVariants subsets SNP-only records, indel-only records or mixed-type, i.e. SNP and indel alternate alleles in the same record, separately. Therefore, when subsetting to SNP-only or indel-only records, mixed-type records are excluded. See this GitHub ticket for the status of a feature request to apply VariantFiltration directly on types of variants.

To avoid the loss of mixed-type variants, break up the multiallelic records into biallelic records before proceeding with the following subsetting. Alternatively, to process mixed-type variants with indel filtering thresholds similar to VQSR, add -select-type MIXED to the second command [B].

[A] Subset to SNPs-only callset with SelectVariants

gatk SelectVariants \
-V cohort.vcf.gz \
-select-type SNP \
-O snps.vcf.gz

This produces a VCF with records with SNP-type variants only.

[B] Subset to indels-only callset with SelectVariants

gatk SelectVariants \
-V cohort.vcf.gz \
-select-type INDEL \
-O indels.vcf.gz

This produces a VCF with records with indel-type variants only.

[C] Hard-filter SNPs on multiple expressions using VariantFiltration
The GATK does not recommend use of compound filtering expressions, e.g. the logical || "OR". For such expressions, if a record is null for or missing a particular annotation in the expression, the tool negates the entire compound expression and so automatically passes the variant record even if it fails on one of the expressions. See this issue ticket for details.

Provide each expression separately with the -filter parameter followed by the –-filter-name. The tool evaluates each expression independently. Here we show basic filtering thresholds researchers may find useful to start.

gatk VariantFiltration \
-V snps.vcf.gz \
-filter "QD < 2.0" --filter-name "QD2" \
-filter "QUAL < 30.0" --filter-name "QUAL30" \
-filter "SOR > 3.0" --filter-name "SOR3" \
-filter "FS > 60.0" --filter-name "FS60" \
-filter "MQ < 40.0" --filter-name "MQ40" \
-filter "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" \
-filter "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" \
-O snps_filtered.vcf.gz

This produces a VCF with the same variant records now annotated with filter status. Specifically, if a record passes all the filters, it receives a PASS label in the FILTER column. A record that fails a filter receives the filter name in the FILTER column, e.g. SOR3. If a record fails multiple filters, then each failing filter name appears in the FILTER column separated by semi-colons ;, e.g. MQRankSum-12.5;ReadPosRankSum-8.

[D] Similarly, hard-filter indels on multiple expressions using VariantFiltration

gatk VariantFiltration \ 
-V indels.vcf.gz \ 
-filter "QD < 2.0" --filter-name "QD2" \
-filter "QUAL < 30.0" --filter-name "QUAL30" \
-filter "FS > 200.0" --filter-name "FS200" \
-filter "ReadPosRankSum < -20.0" --filter-name "ReadPosRankSum-20" \ 
-O indels_filtered.vcf.gz

This produces a VCF with the same variant records annotated with filter names for failing records. At this point, consider merging the separate callsets together. Select comments follow.

RankSum annotations can only be calculated for REF/ALT heterozygous sites and therefore will be absent from records that do not present read counts towards heterozygous genotypes.
By default, GATK HaplotypeCaller and GenotypeGVCFs do not emit variants with QUAL < 10. The --standard-min-confidence-threshold-for-calling (-stand-call-conf) parameter adjusts this threshold. GATK recommends filtering variants with QUAL less than 30. The lower default QUAL threshold of the callers allows for more negative training data in VQSR filtering.
When providing filtering thresholds, the tool expects the value to match the type specified in the ##INFO lines of the callset. For example, an Integer type is a whole number without decimals, e.g. 0, and a Float type is a number with decimals, e.g. 0.0. If the expected type mismatches, the tool will give a java.lang.NumberFormatException error.
If a filter expression is misspelled, the tool does not give a warning, so be sure to carefully review filter expressions for correctness.

3. Evaluate the filtered callset

Filtering is about balancing sensitivity and precision for research aims. For example, genome-wide association studies can afford to maximize sensitivity over precision such that there are more false positives in the callset. Conversely, downstream analyses that require high precision, e.g. those that cannot tolerate false positive calls because validating variants is expensive, maximize precision over sensitivity such that the callset loses true positives.

Two tools enable site-level evaluation--CollectVariantCallingMetrics and VariantEval. Another tool, GenotypeConcordance, measures sample-level genotype concordance and is not covered here. For an overview of all three tools, see Article#6308.

Compare callset against a known population callset using CollectVariantCallingMetrics

gatk CollectVariantCallingMetrics \
-I filtered.vcf.gz \
--DBSNP Homo_sapiens_assembly38.dbsnp138.vcf \
-SD Homo_sapiens_assembly38.dict \
-O metrics

This produces detailed and summary metrics report files. The summary metrics provide cohort-level variant metrics and the detailed metrics segment variant metrics for each sample in the callset. The detail metrics give the same metrics as the summary metrics for the samples plus several additional metrics. These are explained in detail at https://broadinstitute.github.io/picard/picard-metric-definitions.html.

Compare callset against a known population callset using VariantEval
As of this writing, VariantEval is in beta status in GATK v4.1. And so we provide an example GATK3 command, where the tool is in production status. GATK3 Dockers are available at https://hub.docker.com/r/broadinstitute/gatk3.

java -jar gatk3.jar \
-T VariantEval \
-R Homo_sapiens_assembly38.fasta \
-eval cohort.vcf.gz \
-D Homo_sapiens_assembly38.dbsnp138.vcf \
-noEV \
-EV CompOverlap -EV IndelSummary -EV TiTvVariantEvaluator \
-EV CountVariants -EV MultiallelicSummary \
-o cohortEval.txt

This produces a file containing a table for each of the evaluation modules, e.g. CompOverlap.

Please note the GA4GH (Global Alliance for Genomics and Health) recommends using hap.py for stratified variant evaluations (1, 2). One approach using hap.py wraps the vcfeval module of RTG-Tools. The module accounts for differences in variant representation by matching variants mapped back to the reference.

↧

HaplotypeCaller successfully generated my g.vcf.gz, no error message but exit status!=0

December 11, 2019, 5:04 am

≫ Next: Error of INDEL mode during VQSR process

≪ Previous: (How to) Filter variants either with VQSR or by hard-filtering

Hi the gatk team,
I used HaplotypeCaller to generate a vcf file. Everything went fine : I got no error message, the vcf.gz and the vcf.gz.tbi were generated , however, the exit status was not '0' but 141

Dec 11, 2019 1:03:37 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
13:03:37.743 INFO  HaplotypeCaller - ------------------------------------------------------------
13:03:37.744 INFO  HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.1.4.1
13:03:37.744 INFO  HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
13:03:37.745 INFO  HaplotypeCaller - Executing as lindenbaum-p@gkq0xd2.compute.bird2.prive on Linux v3.10.0-957.21.3.el7.x86_64 amd64
13:03:37.745 INFO  HaplotypeCaller - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_131-b11
13:03:37.746 INFO  HaplotypeCaller - Start Date/Time: December 11, 2019 1:03:36 PM CET
13:03:37.746 INFO  HaplotypeCaller - ------------------------------------------------------------
13:03:37.746 INFO  HaplotypeCaller - ------------------------------------------------------------
13:03:37.749 INFO  HaplotypeCaller - HTSJDK Version: 2.21.0
13:03:37.750 INFO  HaplotypeCaller - Picard Version: 2.21.2
13:03:37.750 INFO  HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
13:03:37.750 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
13:03:37.750 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
13:03:37.750 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
13:03:37.751 INFO  HaplotypeCaller - Deflater: IntelDeflater
13:03:37.751 INFO  HaplotypeCaller - Inflater: IntelInflater
13:03:37.751 INFO  HaplotypeCaller - GCS max retries/reopens: 20
13:03:37.751 INFO  HaplotypeCaller - Requester pays: disabled
13:03:37.751 INFO  HaplotypeCaller - Initializing engine
(...)
13:28:24.643 INFO  HaplotypeCaller - Shutting down engine
[December 11, 2019 1:28:24 PM CET] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 24.81 minutes.
Runtime.totalMemory()=1274019840
Using GATK jar /sandbox/apps/bioinfo/binaries/gatk/0.0.0/gatk-4.1.4.1/gatk-package-4.1.4.1-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Djava.
io.tmpdir=. -Xmx5g -jar /path/to/gatk-package-4.1.4.1-local.jar HaplotypeCaller --tmp-dir . --reference /path/to/ref.fasta -I /path/to/input.bam -L <my-interval>  -O out.g.vcf.gz --dbsnp path/to/dbsnp_135.b3
7.vcf.gz --do-not-run-physical-phasing --emit-ref-confidence GVCF

the vcf is ok

$ bcftools view out.g.vcf.gz > /dev/null && echo ok
ok

$ bcftools view out.g.vcf.gz 22 > /dev/null && echo ok
ok

is there anything I can do to know about this error ? My FS is quite slow those days, is there any timeout to unlock some files ?

↧

Error of INDEL mode during VQSR process

December 11, 2019, 7:13 am

≫ Next: low GQ around Indels

≪ Previous: HaplotypeCaller successfully generated my g.vcf.gz, no error message but exit status!=0

Hello,
I'm trying to do VQSR on exome data of 50,000 samples. Since this dataset is too big, I used GenomicsDBImport to merge. Whole exome merging is also slower than expected, so I did this per chromosome.
SNP mode for VQSR went on smoothly while the INDEL mode has several problems.
I used the command below at first:

```
time $gatk VariantRecalibrator \
-R $reference \
-V $outdir/population/${outname}.HC.snps.VQSR.vcf.gz \
-resource:mills,known=true,training=true,truth=true,prior=12.0 $GATK_bundle/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
-an DP -an QD -an FS -an SOR -an ReadPosRankSum -an MQRankSum\
-mode INDEL \
--max-gaussians 6 \
--rscript-file $outdir/population/${outname}.HC.indels.plots.R \
--tranches-file $outdir/population/${outname}.HC.indels.tranches \
-O $outdir/population/${outname}.HC.snps.indels.recal && \
time $gatk ApplyVQSR \
-R $reference \
-V $outdir/population/${outname}.HC.snps.VQSR.vcf.gz \
--truth-sensitivity-filter-level 99.0 \
--tranches-file $outdir/population/${outname}.HC.snps.indels.tranches \
--recal-file $outdir/population/${outname}.HC.snps.indels.recal \
-mode INDEL \
-O $outdir/population/${outname}.HC.VQSR.vcf.gz && echo "** SNPs and Indels VQSR (${sample}.HC.VQSR.vcf.gz finish) done **"
```
There are two warnings:
WARN VariantDataManager - WARNING: Training with very few variant sites! Please check the model reporting PDF to ensure the quality of the model is reliable.
WARN VariantRecalibratorEngine - Evaluate datum returned a NaN
And the program stopped due to "No data found"

I searched on this forum for solution and removed "-MQRankSum". The recalibration process passed this time although the first warning is still there:
WARN VariantDataManager - WARNING: Training with very few variant sites! Please check the model reporting PDF to ensure the quality of the model is reliable.

But the ApplyVQSR process stopped, due to:

```
13:55:06.069 INFO ApplyVQSR - Deflater: IntelDeflater
13:55:06.069 INFO ApplyVQSR - Inflater: IntelInflater
13:55:06.070 INFO ApplyVQSR - GCS max retries/reopens: 20
13:55:06.070 INFO ApplyVQSR - Requester pays: disabled
13:55:06.070 INFO ApplyVQSR - Initializing engine
13:55:06.517 INFO FeatureManager - Using codec VCFCodec to read file file:///home/pang/data/public_data/UKBB/exome_population/population/ukb_efe_chr4.HC.snps.indels.recal
13:55:06.593 INFO FeatureManager - Using codec VCFCodec to read file file:///home/pang/data/public_data/UKBB/exome_population/population/ukb_efe_chr4.HC.snps.VQSR.vcf.gz
13:55:06.776 INFO ApplyVQSR - Done initializing engine
13:55:06.778 INFO ApplyVQSR - Shutting down engine
[December 11, 2019 1:55:06 PM CET] org.broadinstitute.hellbender.tools.walkers.vqsr.ApplyVQSR done. Elapsed time: 0.02 minutes.
Runtime.totalMemory()=1933574144
***********************************************************************

A USER ERROR has occurred: Couldn't read file /home/pang/data/public_data/UKBB/exome_population/population/ukb_efe_chr4.HC.snps.indels.tranches. Error was: /home/pang/data/public_data/UKBB/exome_population/population/ukb_efe_chr4.HC.snps.indels.tranches with exception: /home/pang/data/public_data/UKBB/exome_population/population/ukb_efe_chr4.HC.snps.indels.tranches

***********************************************************************
Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.
```

I do not understand what this "exception" is. Could you give me some suggestion on how to solve this?
On the other hand, I know the limitation for VQSR is above 30 exomes, in some cases it is the reason for "No data found" error. Although I did VQSR for each chromosome separately, I think 50,000 samples is enough.
Is there some method to merge these chromosomes together and do the VQSR afterwards?

Thanks!
Shichao

↧