Quantcast
Channel: Recent Discussions — GATK-Forum
Viewing all 12345 articles
Browse latest View live

What is the current recommended practice for interval padding?

$
0
0

In the GATK Best Practices WDL in github, the GenotypeGVCFs commandline does not specify any interval padding via -ip or --interval-padding parameter.

https://github.com/gatk-workflows/gatk4-germline-snps-indels/blob/master/joint-discovery-gatk4.wdl
(as of commit d9fea5462f148ab3a49a177ddd328ede304f4f62).

However, this old guide from 2017 recommends interval padding of 100bp:

https://software.broadinstitute.org/gatk/documentation/article?id=11062

What is the recommended practice for interval padding for HaplotypeCaller and GenotypeGVCFs v4.x?


GATK Funcotator and GATK VariantFiltration

$
0
0

Hey there!

I have used GATK Funcotator tool that generated values for plenty of fields as indicated below.

INFO=<ID=FUNCOTATION,Number=A,Type=String,Description="Functional annotation from the Funcotator tool. Funcotation fields are: Gencode_19_hugoSymbol|Gencode_19_ncbiBuild|Gencode_19_chromosome|Gencode_19_start|Gencode_19_end|Gencode_19_variantClassification|Gencode_19_secondaryVariantClassification|Gencode_19_variantType|Gencode_19_refAllele|Gencode_19_tumorSeqAllele1|Gencode_19_tumorSeqAllele2|Gencode_19_genomeChange|Gencode_19_annotationTranscript|Gencode_19_transcriptStrand|Gencode_19_transcriptExon|Gencode_19_transcriptPos|Gencode_19_cDnaChange|Gencode_19_codonChange|Gencode_19_proteinChange|Gencode_19_gcContent|Gencode_19_referenceContext|Gencode_19_otherTranscripts|ClinVar_HGMD_ID|ClinVar_SYM|ClinVar_TYPE|ClinVar_ASSEMBLY|ClinVar_rs|gnomAD_exome_AF|gnomAD_exome_AF_afr|gnomAD_exome_AF_afr_female|gnomAD_exome_AF_afr_male|gnomAD_exome_AF_amr|gnomAD_exome_AF_amr_female|gnomAD_exome_AF_amr_male|gnomAD_exome_AF_asj|gnomAD_exome_AF_asj_female|gnomAD_exome_AF_asj_male|gnomAD_exome_AF_eas|gnomAD_exome_AF_eas_female|gnomAD_exome_AF_eas_jpn|gnomAD_exome_AF_eas_kor|gnomAD_exome_AF_eas_male|gnomAD_exome_AF_eas_oea|gnomAD_exome_AF_female|gnomAD_exome_AF_fin|gnomAD_exome_AF_fin_female|gnomAD_exome_AF_fin_male|gnomAD_exome_AF_male|gnomAD_exome_AF_nfe|gnomAD_exome_AF_nfe_bgr|gnomAD_exome_AF_nfe_est|gnomAD_exome_AF_nfe_female|gnomAD_exome_AF_nfe_male|gnomAD_exome_AF_nfe_nwe|gnomAD_exome_AF_nfe_onf|gnomAD_exome_AF_nfe_seu|gnomAD_exome_AF_nfe_swe|gnomAD_exome_AF_oth|gnomAD_exome_AF_oth_female|gnomAD_exome_AF_oth_male|gnomAD_exome_AF_popmax|gnomAD_exome_AF_raw|gnomAD_exome_AF_sas|gnomAD_exome_AF_sas_female|gnomAD_exome_AF_sas_male|gnomAD_exome_ID|gnomAD_exome_FILTER|gnomAD_genome_AF|gnomAD_genome_AF_afr|gnomAD_genome_AF_afr_female|gnomAD_genome_AF_afr_male|gnomAD_genome_AF_amr|gnomAD_genome_AF_amr_female|gnomAD_genome_AF_amr_male|gnomAD_genome_AF_asj|gnomAD_genome_AF_asj_female|gnomAD_genome_AF_asj_male|gnomAD_genome_AF_eas|gnomAD_genome_AF_eas_female|gnomAD_genome_AF_eas_male|gnomAD_genome_AF_female|gnomAD_genome_AF_fin|gnomAD_genome_AF_fin_female|gnomAD_genome_AF_fin_male|gnomAD_genome_AF_male|gnomAD_genome_AF_nfe|gnomAD_genome_AF_nfe_est|gnomAD_genome_AF_nfe_female|gnomAD_genome_AF_nfe_male|gnomAD_genome_AF_nfe_nwe|gnomAD_genome_AF_nfe_onf|gnomAD_genome_AF_nfe_seu|gnomAD_genome_AF_oth|gnomAD_genome_AF_oth_female|gnomAD_genome_AF_oth_male|gnomAD_genome_AF_popmax|gnomAD_genome_AF_raw|gnomAD_genome_ID|gnomAD_genome_FILTER">

I have used GATK VariantFiltration to apply hard filters with the help of below links.
https://software.broadinstitute.org/gatk/documentation/article.php?id=6925
https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_filters_VariantFiltration.php

However, I would like to know if I would be able to use GATK VariantFiltration tool for applying filters to specific columns of interest obtained from GATK Funcotator say, gnomAD_exome_AF and gnomAD_genome_AF? If yes, could you please let me know how?

Also, is there a way to use GATK Funcotator to report only columns of interest as opposed to all the columns as listed above?

Would appreciate your help.

Amit

A point mutation was not called by Mutect2

$
0
0

I used Mutect2 for calling somatic cancer WES samples, but for a certain known mutation when the number of reads containing that point mutation is low (for instance, 2, or even 4), it is not called. Is there any way to increase the sensitivity of variant calling?

Help understanding genotype in spanning deletion notation

$
0
0

Hello,

I am working with a collaborator who had sequencing and variant calling done at the Broad.
The resulting multi-sample VCF has spanning deletion notation.
I am trying to understand what the genotype call is for the patient at the following site, where I have suppressed columns:

chr2 178535858 . GA G,GAA ... GT:AD:DP:GQ:PL 0/1:38,8,5:51:30:30,0,822,74,770,1497
chr2 178535859 rs202214630 A *,G ... GT:AD:DP:GQ:PL 0/2:19,8,24:51:99:752,342,731,0,301,685

It seems to me a case that isn't handled in the spanning deletions tutorial, where more than two alleles appear in a patient with a spanning deletion: https://software.broadinstitute.org/gatk/documentation/article?id=6926

The sample has reads with reference (38), 1 bp deletion (8) and 1 bp insertion (5) at 178535858 and reference (19), spanning deletion (8) and variant (24) at 178535859. How do I tell which is the GATK determined genotype at 178535859? I don't understand how the 0/1 at 178535858 and 0/2 at 178535859 give a diploid genotype.

From the VCF header, I see that the HaplotypeCaller Version used is Version=4.0.10.1.

Thanks for your help,
Rebecca

4.1.4.0 germline copy number - missing denoising_config.json file?

$
0
0

Hi all,

I have a cohort of germline genome samples that I cam analyzing by following the steps in this tutorial:
https://gatkforums.broadinstitute.org/gatk/discussion/11684/how-to-call-common-and-rare-germline-copy-number-variants

I am running the following commands:

 /gatk-4.1.4.0/gatk PreprocessIntervals -R hg19a.fa --bin-length 500 -interval-merging-rule OVERLAPPING_ONLY -O hg19a.interval_list

#This is run on ~200 germline genome BAM files
/gatk-4.1.4.0/gatk CollectReadCounts  -I myBam -L hg19a.interval_list --interval-merging-rule OVERLAPPING_ONLY -O myBam.hdf5

#Trim the interval list
/home/rcorbett/bin/gatk-4.1.4.0/gatk AnnotateIntervals -L hg19a.interval_list -R hg19a.fa -imr OVERLAPPING_ONLY -O hg19a.interval_list.annotated
/home/rcorbett/bin/gatk-4.1.4.0/gatk FilterIntervals -L hg19a.interval_list -imr OVERLAPPING_ONLY --annotated-intervals hg19a.interval_list.annotated -I myBam.hdf5 -I P01447_1_lane_dupsFlagged.bam.hdf5 -I myBam2.hdf5 -I myBam3.hdf5 -I myBam4.hdf5 ... -O hg19a.interval_list.annotated.filtered

#Get the ploidy estimates
gatk-4.1.4.0/gatk DetermineGermlineContigPloidy -L hg19a.interval_list.annotated.filtered.list -imr OVERLAPPING_ONLY --contig-ploidy-priors ploidy_priors.tsv --output . --output-prefix ploidy --verbosity DEBUG -I myBam.hdf5 -I myBam2.hdf5 -I myBam2.hdf5 ...

#Get the ploidy estimates for a single sample
gatk-4.1.4.0/gatk DetermineGermlineContigPloidy --model ploidy-model -I myBam2.hdf5 -O  . --output-prefix ploidy-case --verbosity DEBUG

The last command listed above gives me an error before failing. Here's the full trace (with some IDs and Paths changed)

Using GATK jar /gatk-4.1.4.0/gatk-package-4.1.4.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/rcorbett/bin/gatk-4.1.4.0/gatk-package-4.1.4.0-local.jar GermlineCNVCaller --run-mode CASE --model ploidy-model -I P01447_1_lane_dupsFlagged.bam.hdf5 --contig-ploidy-calls polidy-case-calls --output case_output --output-prefix case_calls --verbosity DEBUG
13:17:59.610 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/rcorbett/bin/gatk-4.1.4.0/gatk-package-4.1.4.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
13:17:59.624 DEBUG NativeLibraryLoader - Extracting libgkl_compression.so to /tmp/libgkl_compression2553444245973287106.so
13:17:59.885 INFO GermlineCNVCaller - ------------------------------------------------------------
13:17:59.885 INFO GermlineCNVCaller - The Genome Analysis Toolkit (GATK) v4.1.4.0
13:17:59.885 INFO GermlineCNVCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
13:17:59.886 INFO GermlineCNVCaller - Executing as rcorbett@xyz on Linux v2.6.32-573.8.1.el6.x86_64 amd64
13:17:59.886 INFO GermlineCNVCaller - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_66-b17
13:17:59.886 INFO GermlineCNVCaller - Start Date/Time: November 26, 2019 1:17:59 PM PST
13:17:59.886 INFO GermlineCNVCaller - ------------------------------------------------------------
13:17:59.886 INFO GermlineCNVCaller - ------------------------------------------------------------
13:17:59.887 INFO GermlineCNVCaller - HTSJDK Version: 2.20.3
13:17:59.887 INFO GermlineCNVCaller - Picard Version: 2.21.1
13:17:59.889 INFO GermlineCNVCaller - HTSJDK Defaults.BUFFER_SIZE : 131072
13:17:59.889 INFO GermlineCNVCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
13:17:59.889 INFO GermlineCNVCaller - HTSJDK Defaults.CREATE_INDEX : false
13:17:59.889 INFO GermlineCNVCaller - HTSJDK Defaults.CREATE_MD5 : false
13:17:59.889 INFO GermlineCNVCaller - HTSJDK Defaults.CUSTOM_READER_FACTORY :
13:17:59.889 INFO GermlineCNVCaller - HTSJDK Defaults.DISABLE_SNAPPY_COMPRESSOR : false
13:17:59.889 INFO GermlineCNVCaller - HTSJDK Defaults.EBI_REFERENCE_SERVICE_URL_MASK : https://www.ebi.ac.uk/ena/cram/md5/%s
13:17:59.890 INFO GermlineCNVCaller - HTSJDK Defaults.NON_ZERO_BUFFER_SIZE : 131072
13:17:59.890 INFO GermlineCNVCaller - HTSJDK Defaults.REFERENCE_FASTA : null
13:17:59.890 INFO GermlineCNVCaller - HTSJDK Defaults.SAM_FLAG_FIELD_FORMAT : DECIMAL
13:17:59.890 INFO GermlineCNVCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
13:17:59.890 INFO GermlineCNVCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
13:17:59.890 INFO GermlineCNVCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
13:17:59.890 INFO GermlineCNVCaller - HTSJDK Defaults.USE_CRAM_REF_DOWNLOAD : false
13:17:59.890 DEBUG ConfigFactory - Configuration file values:
13:17:59.896 DEBUG ConfigFactory - gcsMaxRetries = 20
13:17:59.897 DEBUG ConfigFactory - gcsProjectForRequesterPays =
13:17:59.897 DEBUG ConfigFactory - gatk_stacktrace_on_user_exception = false
13:17:59.897 DEBUG ConfigFactory - samjdk.use_async_io_read_samtools = false
13:17:59.897 DEBUG ConfigFactory - samjdk.use_async_io_write_samtools = true
13:17:59.897 DEBUG ConfigFactory - samjdk.use_async_io_write_tribble = false
13:17:59.897 DEBUG ConfigFactory - samjdk.compression_level = 2
13:17:59.897 DEBUG ConfigFactory - spark.kryoserializer.buffer.max = 512m
13:17:59.897 DEBUG ConfigFactory - spark.driver.maxResultSize = 0
13:17:59.897 DEBUG ConfigFactory - spark.driver.userClassPathFirst = true
13:17:59.897 DEBUG ConfigFactory - spark.io.compression.codec = lzf
13:17:59.897 DEBUG ConfigFactory - spark.executor.memoryOverhead = 600
13:17:59.897 DEBUG ConfigFactory - spark.driver.extraJavaOptions =
13:17:59.897 DEBUG ConfigFactory - spark.executor.extraJavaOptions =
13:17:59.897 DEBUG ConfigFactory - codec_packages = [htsjdk.variant, htsjdk.tribble, org.broadinstitute.hellbender.utils.codecs]
13:17:59.898 DEBUG ConfigFactory - read_filter_packages = [org.broadinstitute.hellbender.engine.filters]
13:17:59.898 DEBUG ConfigFactory - annotation_packages = [org.broadinstitute.hellbender.tools.walkers.annotator]
13:17:59.898 DEBUG ConfigFactory - cloudPrefetchBuffer = 40
13:17:59.898 DEBUG ConfigFactory - cloudIndexPrefetchBuffer = -1
13:17:59.898 DEBUG ConfigFactory - createOutputBamIndex = true
13:17:59.898 INFO GermlineCNVCaller - Deflater: IntelDeflater
13:17:59.898 INFO GermlineCNVCaller - Inflater: IntelInflater
13:17:59.898 INFO GermlineCNVCaller - GCS max retries/reopens: 20
13:17:59.898 INFO GermlineCNVCaller - Requester pays: disabled
13:17:59.898 INFO GermlineCNVCaller - Initializing engine
13:17:59.903 DEBUG ScriptExecutor - Executing:
13:17:59.903 DEBUG ScriptExecutor - python
13:17:59.903 DEBUG ScriptExecutor - -c
13:17:59.903 DEBUG ScriptExecutor - import gcnvkernel

/miniconda3/envs/gatk/lib/python3.6/site-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
from ._conv import register_converters as _register_converters
13:18:11.641 DEBUG ScriptExecutor - Result: 0
13:18:11.641 INFO GermlineCNVCaller - Done initializing engine
13:18:19.503 INFO GermlineCNVCaller - Running the tool in CASE mode...
13:18:19.504 INFO GermlineCNVCaller - Validating and aggregating data from input read-count files...
13:18:20.588 INFO GermlineCNVCaller - Aggregating read-count file myBam2.hdf5 (1 / 1)
log4j:WARN No appenders could be found for logger (org.broadinstitute.hdf5.HDF5Library).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
13:18:30.759 DEBUG ScriptExecutor - Executing:
13:18:30.759 DEBUG ScriptExecutor - python
13:18:30.759 DEBUG ScriptExecutor - /tmp/case_denoising_calling.6711074121962911645.py
13:18:30.759 DEBUG ScriptExecutor - --ploidy_calls_path=ploidy-case-calls
13:18:30.759 DEBUG ScriptExecutor - --output_calls_path=/case_output/case_calls-calls
13:18:30.759 DEBUG ScriptExecutor - --output_tracking_path=/case_output/case_calls-tracking
13:18:30.759 DEBUG ScriptExecutor - --input_model_path=/ploidy-model
13:18:30.759 DEBUG ScriptExecutor - --read_count_tsv_files
13:18:30.759 DEBUG ScriptExecutor - /tmp/sample-01045833658862224981.tsv
13:18:30.759 DEBUG ScriptExecutor - --psi_s_scale=1.000000e-04
13:18:30.759 DEBUG ScriptExecutor - --mapping_error_rate=1.000000e-02
13:18:30.765 DEBUG ScriptExecutor - --depth_correction_tau=1.000000e+04
13:18:30.765 DEBUG ScriptExecutor - --q_c_expectation_mode=hybrid
13:18:30.765 DEBUG ScriptExecutor - --p_alt=1.000000e-06
13:18:30.765 DEBUG ScriptExecutor - --cnv_coherence_length=1.000000e+04
13:18:30.765 DEBUG ScriptExecutor - --max_copy_number=5
13:18:30.765 DEBUG ScriptExecutor - --learning_rate=1.000000e-02
13:18:30.765 DEBUG ScriptExecutor - --adamax_beta1=9.000000e-01
13:18:30.765 DEBUG ScriptExecutor - --adamax_beta2=9.900000e-01
13:18:30.765 DEBUG ScriptExecutor - --log_emission_samples_per_round=50
13:18:30.765 DEBUG ScriptExecutor - --log_emission_sampling_rounds=10
13:18:30.765 DEBUG ScriptExecutor - --log_emission_sampling_median_rel_error=5.000000e-03
13:18:30.765 DEBUG ScriptExecutor - --max_advi_iter_first_epoch=5000
13:18:30.765 DEBUG ScriptExecutor - --max_advi_iter_subsequent_epochs=200
13:18:30.765 DEBUG ScriptExecutor - --min_training_epochs=10
13:18:30.765 DEBUG ScriptExecutor - --max_training_epochs=50
13:18:30.765 DEBUG ScriptExecutor - --initial_temperature=1.500000e+00
13:18:30.765 DEBUG ScriptExecutor - --num_thermal_advi_iters=2500
13:18:30.765 DEBUG ScriptExecutor - --convergence_snr_averaging_window=500
13:18:30.765 DEBUG ScriptExecutor - --convergence_snr_trigger_threshold=1.000000e-01
13:18:30.765 DEBUG ScriptExecutor - --convergence_snr_countdown_window=10
13:18:30.765 DEBUG ScriptExecutor - --max_calling_iters=10
13:18:30.765 DEBUG ScriptExecutor - --caller_update_convergence_threshold=1.000000e-03
13:18:30.765 DEBUG ScriptExecutor - --caller_internal_admixing_rate=7.500000e-01
13:18:30.765 DEBUG ScriptExecutor - --caller_external_admixing_rate=1.000000e+00
13:18:30.765 DEBUG ScriptExecutor - --disable_caller=false
13:18:30.765 DEBUG ScriptExecutor - --disable_sampler=false
13:18:30.765 DEBUG ScriptExecutor - --disable_annealing=false
13:18:40.525 INFO root - Loading modeling interval list from the provided model...
13:19:13.378 INFO root - The model contains 5631393 intervals and 24 contig(s)
13:19:13.379 INFO root - Loading 1 read counts file(s)...
13:20:06.039 INFO gcnvkernel.io.io_metadata - Loading germline contig ploidy and global read depth metadata...
13:20:06.080 INFO root - Loading denoising model configuration from the provided model...
miniconda3/envs/gatk/lib/python3.6/site-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
from ._conv import register_converters as _register_converters
Traceback (most recent call last):
File "/tmp/case_denoising_calling.6711074121962911645.py", line 176, in
update_args_dict_from_saved_model(args.input_model_path, args_dict)
File "/tmp/case_denoising_calling.6711074121962911645.py", line 106, in update_args_dict_from_saved_model
with open(os.path.join(input_model_path, "denoising_config.json"), 'r') as fp:
FileNotFoundError: [Errno 2] No such file or directory: '/ploidy-model/denoising_config.json'
13:20:13.878 DEBUG ScriptExecutor - Result: 1
13:20:13.880 INFO GermlineCNVCaller - Shutting down engine
[November 26, 2019 1:20:13 PM PST] org.broadinstitute.hellbender.tools.copynumber.GermlineCNVCaller done. Elapsed time: 2.24 minutes.
Runtime.totalMemory()=5160566784
org.broadinstitute.hellbender.utils.python.PythonScriptExecutorException:
python exited with 1
......

In the example data that are supplied there is a denoising_config.json file in the folders of each of the samples, but I don't get one in my folders created with the above commands. Is that the source of my error?

GenomicsDBImport not completing for mixed ploidy samples

$
0
0
I'm attempting to call variants on whole genomes for about 500 illumina paired-end samples with varying ploidy (haploid to tetraploid). I'm running a fairly standard uBam to GVCF pipeline with HaplotypeCaller passed the ploidy information (1,2,3, or 4) in -ERC GVCF mode. I then try to collect the GVCFs using GenomicsDBImport in a batch size of 50 and use GenotypeGVCFs on the combined database. My interval list that is passed to GenomicsDBImport is just each chromosome on a separate line. I'm using GATK v4.1.1.0

Command:
```
${GATK_DIR}/gatk GenomicsDBImport \
--java-options "-Xmx110g -Xms110g" \
-R ${REF} \
--variant ${FILE_LIST} \
-L ${SCRIPT_DIR}/GATK_Style_Interval.list \
--genomicsdb-workspace-path ${WORK_DIR}/GenomicsDB_20190912 \
--batch-size 50 \
--tmp-dir=${WORK_DIR}/
```

GenomicsDBImport appears to run without error, but only shows progress for the first 6000 bp before moving onto the next batch. When I run select variants on the created database, I only get variants up to position 6716 in the first interval. When I try to run GenotypeGVCF on it, I get a strange error:
htsjdk.tribble.TribbleException: Invalid block size -1570639203

My first assumption is that one of the gvcf's is malformed from HaplotypeCaller failing after the first 6000 bp, but I've verified that the gvcfs have all completed and have 'validated' them with ValidateVariants using GATK v4.1.3.0. When I grep for the particular position in the sample's gvcfs I don't find anything out of the ordinary. I would use CombineGVCFs, but it fails due to trying to combine mixed ploidies.

Any ideas on troubleshooting or experience with problems like this?

Facing an issue in IlluminaBasecallsToSam module; working in Ubuntu18 but not working in Centos7

$
0
0

Dear GATK Support Team,

We are facing issues while running IlluminaBasecallsToSam module in Centos7. We tried running each Lane separately to generate unaligned BAM. We got "OutOfMemory: Java heap space" (Attached error image).

The below is the Lane1 command line arguments.

Command:
gatk IlluminaBasecallsToSam --BASECALLS_DIR Data/Intensities/BaseCalls/ --BARCODES_DIR Barcode1_dir/ --RUN_BARCODE Lane1 --LIBRARY_PARAMS library_param.txt --LANE 1 --NUM_PROCESSORS "num" --READ_STRUCTURE "pattern"

The above command working in Ubuntu-18. Our production servers have Centos7, so kindly help us to fix this issue on Centos7.

Thank you
Krithika S

M2 and GDBI for PON: [E::vcf_parse_format] Invalid character '.' in 'AF' FORMAT field at chr1:16949

$
0
0

GATK 4.1.1.0, local linux server

Hi,

I ran some WES normal samples:

${gatk} Mutect2 \
-R ${hg38} \
-I "${sample}.bam" \ 
-O "${sample}.vcf.gz" \
-L ${interval} \
-ip 5 \
--max-mnp-distance 0

and then GenomicsDBImport:

${gatk} GenomicsDBImport \
-R ${hg38} \
-V "${sample1}.vcf.gz" \
-V "${sample2}.vcf.gz" \
--batch-size 1 --reader-threads 1 \
--genomicsdb-workspace-path "GDBI_pon" \
-L chr1

Here the error:

13:18:45.329 INFO  GenomicsDBImport - Done initializing engine
13:18:45.517 INFO  GenomicsDBImport - Vid Map JSON file will be written to /home/manolis/prove/GDBI_pon/GDBI_pon/vidmap.json
13:18:45.517 INFO  GenomicsDBImport - Callset Map JSON file will be written to /home/manolis/prove/GDBI_pon/GDBI_pon/callset.json
13:18:45.517 INFO  GenomicsDBImport - Complete VCF Header will be written to /home/manolis/prove/GDBI_pon/GDBI_pon/vcfheader.vcf
13:18:45.517 INFO  GenomicsDBImport - Importing to array - /home/manolis/prove/GDBI_pon/GDBI_pon/genomicsdb_array
13:18:45.517 INFO  ProgressMeter - Starting traversal
13:18:45.517 INFO  ProgressMeter -        Current Locus  Elapsed Minutes     Batches Processed   Batches/Minute
13:18:45.820 INFO  GenomicsDBImport - Importing batch 1 with 1 samples
[E::vcf_parse_format] Invalid character '.' in 'AF' FORMAT field at chr1:14653
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fe721be816b, pid=12942, tid=0x00007fe7801f7700
#
# JRE version: OpenJDK Runtime Environment (8.0_152-b12) (build 1.8.0_152-release-1056-b12)
# Java VM: OpenJDK 64-Bit Server VM (25.152-b12 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libtiledbgenomicsdb8166440819035845683.so+0x35416b]  bcf_unpack+0x36b
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/manolis/prove/GDBI_pon/hs_err_pid12942.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.

Here the header of the vcf.gz and the variant:

##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions of alternate alleles in the tumor">

chr1    14653   .   C   T   .   .   DP=13;ECNT=2;MBQ=20,30;MFRL=212,211;MMQ=43,33;MPOS=40;POPAF=7.30;TLOD=10.18 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB 0|1:9,4:0.333:13:6,2:3,1:0|1:14653_C_T:14653:5,4,3,1

Here the vcf validation:

${gatk} ValidateVariants \
-R ${hg38} \
-V "${sample1}.vcf.gz" \
-L ${interval} \
-ip 5

No any warning ...

When I process the "${sample1}.vcf.gz" with:

bcftools annotate -x FORMAT/AF "${sample1}.vcf.gz" -O z -o "${sample1}_noAF.vcf.gz"

and then running GenomicsDBImport I do not have any error ...

Any suggestion please?
Many thanks


Best strategy to "fix" the Haplotype Caller - GenotypeGVCF "missing DP field" bug??

$
0
0

Hi,

I've run into the (already reported http://gatkforums.broadinstitute.org/dsde/discussion/5598/missing-depth-dp-after-haplotypecaller ) bug of the missing DP format field in my callings.

I've run the following (relevant) commands:

Haplotype Caller -> Generate GVCF:

    java -Xmx${xmx} ${gct} -Djava.io.tmpdir=${NEWTMPDIR} -jar ${gatkpath}/GenomeAnalysisTK.jar \
       -T HaplotypeCaller \
       -R ${ref} \
       -I ${NEWTMPDIR}/${prefix}.realigned.fixed.recal.bam \
       -L ${reg} \
       -ERC GVCF \
       -nct ${nct} \
       --genotyping_mode DISCOVERY \
       -stand_emit_conf 10 \
       -stand_call_conf 30  \
       -o ${prefix}.raw_variants.annotated.g.vcf \
       -A QualByDepth -A RMSMappingQuality -A MappingQualityRankSumTest -A ReadPosRankSumTest -A FisherStrand -A StrandOddsRatio -A Coverage

That generates GVCF files that DO HAVE the DP field for all reference positions, but DO NOT HAVE the DP format field for any called variant (but still keep the DP in the INFO field):

18      11255   .       T       <NON_REF>       .       .       END=11256       GT:DP:GQ:MIN_DP:PL      0/0:18:48:18:0,48,720
18      11257   .       C       G,<NON_REF>     229.77  .       BaseQRankSum=1.999;DP=20;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQRankSum=-1.377;ReadPosRankSum=0.489      GT:AD:GQ:PL:SB  0/1:10,8,0:99:258,0,308,288
18      11258   .       G       <NON_REF>       .       .       END=11260       GT:DP:GQ:MIN_DP:PL      0/0:17:48:16:0,48,530

Later, I ran Genotype GVCF joining all the samples with the following command:

java -Xmx${xmx} ${gct} -Djava.io.tmpdir=${NEWTMPDIR} -jar ${gatkpath}/GenomeAnalysisTK.jar \
   -T GenotypeGVCFs \
   -R ${ref} \
   -L ${pos} \
   -o ${prefix}.raw_variants.annotated.vcf \
   --variant ${variant} [...]

This generated vcf files where the DP field is present in the format description, it IS present in the Homozygous REF samples, but IS MISSING in any Heterozygous or HomoALT samples.

22  17280388    .   T   C   18459.8 PASS    AC=34;AF=0.340;AN=100;BaseQRankSum=-2.179e+00;DP=1593;FS=2.526;InbreedingCoeff=0.0196;MLEAC=34;MLEAF=0.340;MQ=60.00;MQRankSum=0.196;QD=19.76;ReadPosRankSum=-9.400e-02;SOR=0.523    GT:AD:DP:GQ:PL  0/0:29,0:29:81:0,81,1118    0/1:20,22:.:99:688,0,682    1/1:0,27:.:81:1018,81,0 0/0:22,0:22:60:0,60,869 0/1:20,10:.:99:286,0,664    0/1:11,17:.:99:532,0,330    0/1:14,14:.:99:431,0,458    0/0:28,0:28:81:0,81,1092    0/0:35,0:35:99:0,99,1326    0/1:14,20:.:99:631,0,453    0/1:13,16:.:99:511,0,423    0/1:38,29:.:99:845,0,1231   0/1:20,10:.:99:282,0,671    0/0:22,0:22:63:0,63,837 0/1:8,15:.:99:497,0,248 0/0:32,0:32:90:0,90,1350    0/1:12,12:.:99:378,0,391    0/1:14,26:.:99:865,0,433    0/0:37,0:37:99:0,105,1406   0/0:44,0:44:99:0,120,1800   0/0:24,0:24:72:0,72,877 0/0:30,0:30:84:0,84,1250    0/0:31,0:31:90:0,90,1350    0/1:15,25:.:99:827,0,462    0/0:35,0:35:99:0,99,1445    0/0:29,0:29:72:0,72,1089    1/1:0,32:.:96:1164,96,0 0/0:21,0:21:63:0,63,809 0/1:21,15:.:99:450,0,718    1/1:0,40:.:99:1539,120,0    0/0:20,0:20:60:0,60,765 0/1:11,9:.:99:293,0,381 1/1:0,35:.:99:1306,105,0    0/1:18,14:.:99:428,0,606    0/0:32,0:32:90:0,90,1158    0/1:24,22:.:99:652,0,816    0/0:20,0:20:60:0,60,740 1/1:0,30:.:90:1120,90,0 0/1:15,13:.:99:415,0,501    0/0:31,0:31:90:0,90,1350    0/1:15,18:.:99:570,0,480    0/1:22,13:.:99:384,0,742    0/1:19,11:.:99:318,0,632    0/0:28,0:28:75:0,75,1125    0/0:20,0:20:60:0,60,785 1/1:0,27:.:81:1030,81,0 0/0:30,0:30:90:0,90,1108    0/1:16,16:.:99:479,0,493    0/1:14,22:.:99:745,0,439    0/0:31,0:31:90:0,90,1252
22  17280822    .   G   A   5491.56 PASS    AC=8;AF=0.080;AN=100;BaseQRankSum=1.21;DP=1651;FS=0.000;InbreedingCoeff=-0.0870;MLEAC=8;MLEAF=0.080;MQ=60.00;MQRankSum=0.453;QD=17.89;ReadPosRankSum=-1.380e-01;SOR=0.695   GT:AD:DP:GQ:PL  0/0:27,0:27:72:0,72,1080    0/0:34,0:34:90:0,90,1350    0/1:15,16:.:99:528,0,491    0/0:27,0:27:60:0,60,900 0/1:15,22:.:99:699,0,453    0/0:32,0:32:90:0,90,1350    0/0:37,0:37:99:0,99,1485    0/0:31,0:31:87:0,87,1305    0/0:40,0:40:99:0,108,1620   0/1:20,9:.:99:258,0,652 0/0:26,0:26:72:0,72,954 0/1:16,29:.:99:943,0,476    0/0:27,0:27:69:0,69,1035    0/0:19,0:19:48:0,48,720 0/0:32,0:32:81:0,81,1215    0/0:36,0:36:99:0,99,1435    0/0:34,0:34:99:0,99,1299    0/0:35,0:35:99:0,102,1339   0/0:38,0:38:99:0,102,1520   0/0:36,0:36:99:0,99,1476    0/0:31,0:31:81:0,81,1215    0/0:31,0:31:75:0,75,1125    0/0:35,0:35:99:0,99,1485    0/0:37,0:37:99:0,99,1485    0/0:35,0:35:90:0,90,1350    0/0:20,0:20:28:0,28,708 0/1:16,22:.:99:733,0,474    0/0:32,0:32:90:0,90,1350    0/0:35,0:35:99:0,99,1467    0/1:27,36:.:99:1169,0,831   0/0:28,0:28:75:0,75,1125    0/0:36,0:36:81:0,81,1215    0/0:35,0:35:90:0,90,1350    0/0:28,0:28:72:0,72,1080    0/0:31,0:31:81:0,81,1215    0/0:37,0:37:99:0,99,1485    0/0:31,0:31:84:0,84,1260    0/0:39,0:39:99:0,101,1575   0/0:37,0:37:96:0,96,1440    0/0:34,0:34:99:0,99,1269    0/0:30,0:30:81:0,81,1215    0/0:36,0:36:99:0,99,1485    0/1:17,17:.:99:567,0,530    0/0:26,0:26:72:0,72,1008    0/0:18,0:18:45:0,45,675 0/0:33,0:33:84:0,84,1260    0/0:25,0:25:61:0,61,877 0/1:9,21:.:99:706,0,243 0/0:35,0:35:81:0,81,1215    0/0:35,0:35:99:0,99,1485

I've just discovered this issue, and I need to run an analysis trying on the differential depth of coverage in different regions, and if there is a DP bias between called/not-called samples.

I have thousands of files and I've spent almost 1 year generating all these callings, so redoing the callings is not an option.

What would be the best/fastest strategy to either fix my final vcfs with the DP data present in all intermediate gvcf files (preferably) or, at least, extracting this data for all snps and samples?

Thanks in advance,

Txema

PS: Recalling the individual samples from bamfiles is not an option. Fixing the individual gvcfs and redoing the joint GenotypeGVCFs could be.

GATK v4.1.0.0 ValidateVariants, gVCF mode, error; non in v4.0.11.0

$
0
0

GATK v4.0.11.0 & v4.1.0.0, linux server, bash

Hi,

I was running the following codes

${GATK4} --java-options '-Xmx10g -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -XX:ConcGCThreads=1 -XX:ParallelGCThreads=2' HaplotypeCaller -R /shared/resources/hgRef/hg38/Homo_sapiens_assembly38.fasta -I /home/manolis/GATK4/2.BQSR/bqsr_PROVA/WES_16-1239_bqsr.bam -O "PROVA_${version}.g.vcf.gz" -L /home/manolis/GATK4/DB/hg38_SureSelectV6noUTR_S07604514_HC_1-22_XY.intervals -ip 100 -ERC GVCF --max-alternate-alleles 3 -ploidy 2 -A StrandBiasBySample --tmp-dir /home/manolis/GATK4/tmp/

${GATK4} --java-options '-Xmx10g -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -XX:ConcGCThreads=1 -XX:ParallelGCThreads=2' ValidateVariants -R /shared/resources/hgRef/hg38/Homo_sapiens_assembly38.fasta -V "PROVA_${version}.g.vcf.gz" -L/home/manolis/GATK4/DB/hg38_SureSelectV6noUTR_S07604514_HC_1-22_XY.intervals -ip 100 -gvcf -Xtype ALLELES --tmp-dir /home/manolis/GATK4/tmp/

and I created the following files:

HaplotypeCaller v4.0.11.0 -> output "PROVA_v40110.g.vcf.gz"
HaplotypeCaller v4.1.0.0 -> output "PROVA_v4100.g.vcf.gz"

When I'm going to validate them I have the following results:

1) ValidateVariants v4.0.11.0 -> input "PROVA_v40110.g.vcf.gz" ........ Everything OK !!!

2) ValidateVariants v4.0.11.0 -> input "PROVA_v4100.g.vcf.gz" ........ Everything OK !!!

3) ValidateVariants v4.1.0.0 -> input "PROVA_v4100.g.vcf.gz" ........ ERROR !!!

***********************************************************************
A USER ERROR has occurred: In a GVCF all records must ordered. Record: [VC Unknown @ chr2:41350-41765 Q. of type=SYMBOLIC alleles=[A*, <NON_REF>] attr={END=41765} filters= covers a position previously traversed.
***********************************************************************

4) ValidateVariants v4.1.0.0 -> input "PROVA_v40110.g.vcf.gz" ........ ERROR !!!

***********************************************************************
A USER ERROR has occurred: In a GVCF all records must ordered. Record: [VC Unknown @ chr2:41350-41765 Q. of type=SYMBOLIC alleles=[A*, <NON_REF>] attr={END=41765} filters= covers a position previously traversed.
***********************************************************************

If I create a vcf.gz file with HaplotypeCaller v4.1.0.0 (standard mode, NO gVCF ) and I'm going to validate it with ValidateVariants v4.1.0.0 I do not have any error!

For now... can I Validate my g.vcf.gz files generated from HC v4.1.0.0 with ValidateVariants of the v4.0.11.0?

Thanks

Testing FPGA implementation of HaplotypeCaller (PairHMM)

$
0
0

Hi,
We are two researchers from the Politecnico di Milano.
We are trying to test the FPGA implementation of the HaplotypeCaller (PairHMM) on GATK 3.8-0-ge9d806836, using a Terasic DE5a-Net (Arria 10, 10AX115N3F45I2SG).

According to the version highlights for GATK 3.8 (https://gatkforums.broadinstitute.org/gatk/discussion/10063/version-highlights-for-gatk-version-3-8) FPGA support was added to pairHMM, and it should be used if the appropriate hardware is detected.
However, from our tests it seems that the CPU implementation is being used instead. Is there a way to enforce the usage of the FPGA implementation?

Kind regards,
Chiara & Alberto

Why are there s in the VCF output of genotypeGVCFs?

$
0
0
I have 13 samples of re-sequenced whole genomes of several insect species, they were aligned to a reference genome of a closely related species using BWA -mem. When using the pipeline of "HaplotypeCaller => genomicsDBImport => genotypeGVCFs" in GATK (Invariant site included), I ended up with VCFs that didn't pass ValidateVariants (with error messages shown in WARN), however, when invariant sites were not chosen when running genotypeGVCFs, ValidateVariants gave no error message.

The error message part reads like:

21:42:18.766 WARN ValidateVariants - ***** Input 03_vcfs_sg/Joint.final.InvariantIncluded.HiC_scaffold_9.sorted.vcf fails strict validation: one or more of the ALT allele(s) for the record at position HiC_scaffold_9:9755 are not observed at all in the sample genotypes of type: *****
21:42:18.767 WARN ValidateVariants - ***** Input 03_vcfs_sg/Joint.final.InvariantIncluded.HiC_scaffold_9.sorted.vcf fails strict validation: one or more of the ALT allele(s) for the record at position HiC_scaffold_9:9775 are not observed at all in the sample genotypes of type: *****
21:42:21.138 WARN ValidateVariants - ***** Input 03_vcfs_sg/Joint.final.InvariantIncluded.HiC_scaffold_9.sorted.vcf fails strict validation: one or more of the ALT allele(s) for the record at position HiC_scaffold_9:67171 are not observed at all in the sample genotypes of type: *****

When I went back to check the VCF file, these lines looked like:

HiC_scaffold_9 9755 . T A,<NON_REF> . . . GT:AD:DP:RGQ ./.:0,0,0:
0:0 ./.:0,0,0:0:0 ./.:0,0,0:0:0 ./.:0,0,0:0:0 ./.:0,0,0:0:0 ./.:0,0,0:0:0 ./.:0,0,0:0:0 ./.:
0,0,0:0:0 ./.:0,0,0:0:0 ./.:0,0,0:0:0 ./.:0,0,0:0:0 ./.:0,0,0:0:0 ./.:0,0,0

HiC_scaffold_9 9775 . A G,<NON_REF> . . . GT:AD:DP:RGQ ./.:0,0,0:0:0 ./.:0,0,0:0:0 ./.:0,0,0:0:0 ./.:0,0,0:0:0 ./.:0,0,0:0:0 ./.:0,0,0:0:0 ./.:0,0,0:0:0 ./.:0,0,0:0:0 ./.:0,0,0:0:0 ./.:0,0,0:0:0 ./.:0,0,0:0:0 ./.:0,0,0:0:0 ./.:0,0,0

HiC_scaffold_9 67171 . A T,<NON_REF> . . ExcessHet=3.01 GT:AD:DP:PGT:PID:RGQ ./.:0,0,0:0:.:.:0 ./.:0,0,0:0:.:.:0 ./.:0,0,0:0:.:.:0 ./.:0,0,0:0:.:.:0 ./.:0,0,0:0:.:.:0 ./.:0,0,0:0:.:.:0 ./.:0,0,0:0:.:.:0 ./.:0,0,0:0:.:.:0 ./.:0,0,0:0:.:.:0 ./.:0,0,0:0:.:.:0 ./.:0,0,0:0:.:.:0 ./.:0,0,0:0:.:.:0 ./.:0,0,0:0:0|1:67160_C_G:3

And a lot more similar lines gave out tons of similar error messages. These positions are not found in the VCF files when invariant sites are not included during joint-genotyping.

To trace the origin of these variant calls, I went back to the original GVCF files. For examples, for the variant "HiC_scaffold_9 9755", all the first 12 samples have no explicit information on this site (which means it is considered the same as the reference?) The only difference is from the last sample, which has:

HiC_scaffold_9 9755 . T A,<NON_REF> 0.01 . MLEAC=0,0;MLEAF=NaN,NaN GT ./.

and

HiC_scaffold_9 9775 . A G,<NON_REF> 0.01 . MLEAC=0,0;MLEAF=NaN,NaN GT ./.

etc.

It seems these individual genotypes determined by haplotypecaller are badly supported. Why does this information still got passed to the final VCF output after joint-genotyping?

I am using GATK 4.1.2.0, and the commands are as follows:

gatk HaplotypeCaller \
--java-options "-Xmx4g -XX:ParallelGCThreads=1" \
-R $RefGenome \
-I ${sp}.dedup.sorted.bam \
-O ${sp}.raw.${Interval}.g.vcf \
--emit-ref-confidence GVCF \
-L ${Interval} \

gatk GenomicsDBImport \
--java-options "-Xmx4g -XX:ParallelGCThreads=1" \
--genomicsdb-workspace-path ${PATH_TO_GENOMICDB} \
-L ${Intervals} \
--sample-name-map CohortSample_map/${IntervalName}.list \
--reader-threads 1 \

gatk GenotypeGVCFs \
--java-options "-Xmx4g -XX:ParallelGCThreads=1" \
-R $RefGenome \
-V gendb://${PATH_TO_GENOMICDB}
-O Joint.final.${Interval}.vcf \
--tmp-dir=${DIR}/tmp \
--heterozygosity=0.022 \
--include-non-variant-sites=true

How can I resolve this error message? Or is it something that should be filtered out in the first place? Thank you

Excessive memory usage with MuTect2

$
0
0

I am trying to use MuTect2 for somatic variant discovery. I am running GATK v4.1.4.0 with Java HotSpot(TM) 64-Bit Server VM v1.8.0_112-b15.

When running either a single sample (tumor only mode to generate a panel of normals) or with a tumor and a normal sample, the memory usage is very high > 400 GB RAM. This is not the case initially, but memory usage gradually climbs during the run. The data are not whole genome sequence data, but rather are RADseq/GBS data. This means much of the genome is not covered by reads, but where there are reads, they start and stop in similar places and cover a ~85 bp region with moderate coverage (around 10X on average). Here is an example of the command I am running (note that I have made some modifications to the standard command to add more memory and obtain additional information for debugging):

java -Xmx384g -XX:-UseGCOverheadLimit -jar ~/bin/gatk-package-4.1.4.0-local.jar Mutect2 -R /uufs/chpc.utah.edu/common/home/u6000989/data/aspen/genome/Potrs01-genome.fa -I aln_mem_mod_003-S.uniqe.bam -I aln_mem_mod_013-S.uniqe.bam -normal potr-mod_013-S --independent-mates --max-mnp-distance 0 -debug --dont-increase-kmer-sizes-for-cycles -O somatic.vcf.gz

The run generates a vcf file that doesn't have any obvious errors for the regions of the genome it gets to, but fails to finish before running out of memory.

I have tried the identical command on a different data set with whole genome sequences and do not see the same memory issue. Thus, I think the problem with memory usage stems from the RADseq/GBS data. With that said, I don't know what about RADseq/GBS data would cause such a problem. Additionally, the reference genome I am using for aligning the RADseq/GBS data is highly fragmented (most contigs ~10 kb). Are there any modifications I might be able to make to the command I am running that could solve this problem?

CountBasesSpark doesn't work with -L opt

$
0
0

I test this in 4.1.4.0 and 4.1.4.1

gatk CountBasesSpark \
     -I input_reads.bam \
     -O base_count.txt

When run this cmd, it is OK, and get a right output base_count.txt.
But I want compute bases located in a interval file, so:

gatk CountBasesSpark \
     -I input_reads.bam \
     -O base_count.txt\
     -L interval.file

This cmd cannot run successfully, with some errors I find like this:

......
9/11/28 17:44:01 INFO NewHadoopRDD: Input split: file:/disks/disk1/data_sample/19NGS14
2/19NGS142.bam:1476395008+33554432
19/11/28 17:44:01 INFO NewHadoopRDD: Input split: file:/disks/disk1/data_sample/19NGS14
2/19NGS142.bam:1509949440+33554432
19/11/28 17:44:01 INFO NewHadoopRDD: Input split: file:/disks/disk1/data_sample/19NGS14
2/19NGS142.bam:704643072+33554432
19/11/28 17:44:02 ERROR Executor: Exception in task 6.0 in stage 1.0 (TID 7)
java.util.NoSuchElementException: next on empty iterator
        at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
        at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
        at scala.collection.Iterator$$anon$13.next(Iterator.scala:469)
......

The interval.file is fine because I use it for the whole GATK pipeline.
The CountReadsSpark has the same error.

Please check this

Thanks.
Chris

Large vcf files after running the GATK SNV + indel pipeline

$
0
0

Hi
Simple question: Why do I get large vcf files after filtering variant calls?
I am following your best practice pipeline (SNV + indel), with some minor modifications suggested in another thread (with bug fixes for Mutect2).

In brief (not adding the LearnReadOrientationModel nor Funcotator), here is the pipeline for one whole exome sequencing sample (run on a docker container):

gatk Mutect2 -R my_data/reference/hg19/hg19.fa -I my_data/input/CRF.sorted.bam -O my_data/output/CRF_unfiltered.vcf  --independent-mates
gatk GetPileupSummaries -I my_data/input/CRF.sorted.bam -V my_data/reference/ExAc_r1/ExAC_hg19_BiallelicOnly.r1.sites.vep.vcf.gz -L my_data/reference/ExAc_r1/ExAC_hg19_BiallelicOnly.r1.sites.vep.vcf.gz -O my_data/output/CRFpileups.table
gatk CalculateContamination -I my_data/output/CRFpileups.table -O my_data/output/CRFcontamination.table
gatk FilterMutectCalls -R my_data/reference/hg19/hg19.fa -V my_data/output/CRF_unfiltered.vcf  --contamination-table CRFcontamination.table --tumor-segmentation CRFsegments.tsv -O my_data/output/CRF_filtered.vcf

Here are the sizes of each output generated (only those specified on the command lines):
CRF.sorted.bam (12.9GB)
CRF_unfiltered.vcf (432.6MB)
CRFpileups.table (1.1MB)
CRFcontamination.table (80B)
CRFsegments.tsv (989B)
CRF_filtered.vcf (558.7MB)

The CRF_filtered.vcf file won't even open on a text editor (e.g. atom) for visualization. Also, although not included here, the funcontated output file was very large (4.6GB) as well.
Sorry for the lay question, is there anything missing here?

Thanks a lot in advance.

Edit I notice that in the tutorial posted here, the output is not gz-compressed. Can one still designate an vcf.gz output file?


How can I prevent the file header from showing up in gigantic font?

$
0
0

Hi. My question is, when I post to the forum, some parts of my post become huge, e.g. file headers or error messages. I'm showing a truncated example below of a VCF header. How can I prevent this from happening and show the copy-pasted blocks in normal font?

fileformat=VCFv4.2

...

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878

Are there issues with using reads coming from different technologies and having different depths?

$
0
0

Hello!

We are analyzing a WGS data of 60 samples (6 groups, 10 samples/group) produced by HiSeq4000. The mean coverage per sample is 25x (lowest sample is 15x).

Now we realized we need to sequence more samples in order to better estimate the allele frequencies. Due to budget and technical constrains we came down to sequence 90 samples (6 groups, 15 samples/group) at a target coverage of 5x. This time on a NovaSeq platform.

Our aim is to do population analysis using SNP allele frequencies after combining the Hiseq4000 (25x coverage) data and the NovaSeq (5x coverage) data.

My plan for the new batch (NovaSeq - 5x) is to run it through the steps of GATK's best practices until HaplotypeCaller and then combine it with the original batch (Hiseq4000 - 25x) using CombineGVCFs and do joint calling with GenotypeGVCF.

I am working with mice samples, so I will do VQSR.

I have the following questions

  • Is there an issue with joint calling variants and genotypes using information from different thechnologies?

  • Is 5x too low to confidently determine genotypes? in other words, would such results be publishable?

A similar thread here but data was produced with the same thechnology.

Thanks!

GATK3 MuTect2 AF calculation

$
0
0

Dear GATK developers.

I've run GATK3 MuTect2 with Tumor only mode and encountered following INDEL:
chr12 133218878 . C CAGCCAGAGCAGGTGGGGCCTCCTGTGCCCTCGGGAATCTGAAT . clustered_events;homologous_mapping_event ECNT=6;HCNT=7;MAX_ED=23;MIN_ED=7;TLOD=45.92 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1 0/1:653,0:0.043:0:0:.:19476,0:308:345

As you can see in the AD field, alt allele depth is zero. However, AF:0.043.
I'm aware that AD field only represent informative reads so AD can be zero.
I'm curious about two things:
1. How does MuTect2 call variants if there is no informative reads? Does it use all the reads(informative + uninformative) to call variants and just indicate only informative reads in AD field?
2. Is there a flag or metric in vcf which indicate the quality of variant due to uninformative reads?

Thank you.

VariantFiltration Invalid argument ' ' error

$
0
0
Hello, I read a lot of posts trying to solve my problem but I didn't find anything that could help me ...
I try to hard filter my .vcf data but I have the same error while running this command:

srun gatk VariantFiltration \
-R ${ref} \
-V /${sample}_snp_recal0.vcf \
-filter "QD < 2.0" --filter-name "QD2" \
-filter "QUAL < 30.0" --filter-name "QUAL30" \
-filter "SOR > 3.0" --filter-name "SOR3" \
-filter "FS > 60.0" --filter-name "FS60" \
-filter "MQ < 40.0" --filter-name "MQ40" \
-filter "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" \
-filter "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" \
--excludeFiltered
-O /${sample}_snp_filtered_recal0.vcf

I tried to run the first line of the filter changing -filter by --filter-expression : same error
I tried to replace "QD < 2" by 'QD < 2' : same error
I tried to replace "QD < 2" by "QD lt 2" : same error

A USER ERROR has occurred: Invalid argument ' '.

***********************************************************************
Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.
Using GATK jar /shared/mfs/data/software/miniconda/envs/gatk4-4.0.10.0/share/gatk4-4.0.10.0-0/gatk-package-4.0.10.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /shared/mfs/data/software/miniconda/envs/gatk4-4.0.10.0/share/gatk4-4.0.10.0-0/gatk-package-4.0.10.0-local.jar VariantFiltration
srun: error: cpu-node-14: task 0: Exited with exit code 1

Does anyone could help me? I'm sure it's a synthax error but I really don't know how to correct it !
thanks in advance

Chloé

Cohort mode of GermlineCNVCaller: FileNotFoundError

$
0
0
Running GermlineCNVCaller in the cohort mode---FileNotFoundError: [Errno 2] No such file or directory: '/tmp/intervals2009046725159223001.tsv'

My code for calling CNV:

cgsdir=/paedwy/disk1/wma/old_samples/raw_data/read_cgs
gatk=/home/yangyxt/software/gatk-4.1.3.0/gatk
ref_gen=/paedwy/disk1/yangyxt/indexed_genome
hdfsamples=/paedwy/disk1/wma/old_samples/raw_data/read_cgs

gCNV_model=${hdfsamples}/cgsnoB9_gCNV_model
valid_ploidy_call=${hdfsamples}/cgsNOB9_ploidy_model/cgs_normal_cohort-calls


module load miniconda3
module load gcc/4.9.1
module load GenomeAnalysisTK/4.1.3.0

source activate /software/GenomeAnalysisTK/4.1.3.0
$gatk GermlineCNVCaller \
--run-mode COHORT \
-L ${hdfsamples}/cgs.cohort.gc.filtered.interval_list \
--class-coherence-length 1000 \
--cnv-coherence-length 1000 \
--interval-merging-rule OVERLAPPING_ONLY \
--contig-ploidy-calls ${valid_ploidy_call} \
--verbosity DEBUG \
--annotated-intervals ${cgsdir}/cgs.annotated.tsv \
-I ${hdfsamples}/A100175.counts.hdf5 \
-I ${hdfsamples}/A100288B.counts.hdf5 \
-I ${hdfsamples}/A100308.counts.hdf5 \
-I ${hdfsamples}/A130042.counts.hdf5 \
-I ${hdfsamples}/A140135A.counts.hdf5 \
-I ${hdfsamples}/A140136A.counts.hdf5 \
-I ${hdfsamples}/A140347.counts.hdf5 \
-I ${hdfsamples}/A140348.counts.hdf5 \
-I ${hdfsamples}/A140417A.counts.hdf5 \
-I ${hdfsamples}/A140418A.counts.hdf5 \
-I ${hdfsamples}/A150123.counts.hdf5 \
-I ${hdfsamples}/A160134A.counts.hdf5 \
-I ${hdfsamples}/A160135A.counts.hdf5 \
-I ${hdfsamples}/A160149A.counts.hdf5 \
-I ${hdfsamples}/A160352A.counts.hdf5 \
-I ${hdfsamples}/A160353.counts.hdf5 \
-I ${hdfsamples}/A160354.counts.hdf5 \
-I ${hdfsamples}/A160355.counts.hdf5 \
-I ${hdfsamples}/A160627.counts.hdf5 \
-I ${hdfsamples}/A160788A.counts.hdf5 \
-I ${hdfsamples}/A160790B.counts.hdf5 \
-I ${hdfsamples}/A160792B.counts.hdf5 \
-I ${hdfsamples}/A170001.counts.hdf5 \
-I ${hdfsamples}/A170007.counts.hdf5 \
-I ${hdfsamples}/PID11-210.counts.hdf5 \
-I ${hdfsamples}/PID12-027.counts.hdf5 \
-I ${hdfsamples}/PID12-028.counts.hdf5 \
-I ${hdfsamples}/PID12-029.counts.hdf5 \
-I ${hdfsamples}/PID12-102.counts.hdf5 \
-I ${hdfsamples}/PID12-103.counts.hdf5 \
-I ${hdfsamples}/PID13-020.counts.hdf5 \
-I ${hdfsamples}/PID13-100.counts.hdf5 \
-I ${hdfsamples}/PID13-101.counts.hdf5 \
-I ${hdfsamples}/PID13-119.counts.hdf5 \
-I ${hdfsamples}/PID13-120.counts.hdf5 \
-I ${hdfsamples}/PID13-129.counts.hdf5 \
-I ${hdfsamples}/PID13-130.counts.hdf5 \
-I ${hdfsamples}/PID13-215.counts.hdf5 \
-I ${hdfsamples}/PID13-216.counts.hdf5 \
-I ${hdfsamples}/PID13-217.counts.hdf5 \
-I ${hdfsamples}/PID13-218.counts.hdf5 \
-I ${hdfsamples}/PID13-223.counts.hdf5 \
-I ${hdfsamples}/PID13-224.counts.hdf5 \
-I ${hdfsamples}/PID13-264.counts.hdf5 \
-I ${hdfsamples}/PID13-272.counts.hdf5 \
-I ${hdfsamples}/PID13-273.counts.hdf5 \
-I ${hdfsamples}/PID13-286.counts.hdf5 \
-I ${hdfsamples}/PID13-313.counts.hdf5 \
-I ${hdfsamples}/PID13-376.counts.hdf5 \
-I ${hdfsamples}/PID14-097.counts.hdf5 \
-I ${hdfsamples}/PID14-101.counts.hdf5 \
-I ${hdfsamples}/PID14-152A.counts.hdf5 \
-I ${hdfsamples}/PID14-191.counts.hdf5 \
-I ${hdfsamples}/PID14-192.counts.hdf5 \
-I ${hdfsamples}/PID14-200.counts.hdf5 \
-I ${hdfsamples}/PID14-203.counts.hdf5 \
-I ${hdfsamples}/PID14-205.counts.hdf5 \
-I ${hdfsamples}/PID14-208.counts.hdf5 \
-I ${hdfsamples}/PID14-209.counts.hdf5 \
-I ${hdfsamples}/PID14-210.counts.hdf5 \
-I ${hdfsamples}/PID14-211.counts.hdf5 \
-I ${hdfsamples}/PID14-229.counts.hdf5 \
-I ${hdfsamples}/PID14-230.counts.hdf5 \
-I ${hdfsamples}/PID14-325.counts.hdf5 \
-I ${hdfsamples}/PID14-360.counts.hdf5 \
-I ${hdfsamples}/PID14-361.counts.hdf5 \
-I ${hdfsamples}/PID14-362.counts.hdf5 \
-I ${hdfsamples}/PID15-017.counts.hdf5 \
-I ${hdfsamples}/PID15-018.counts.hdf5 \
-I ${hdfsamples}/PID15-105.counts.hdf5 \
-I ${hdfsamples}/PID15-106.counts.hdf5 \
-I ${hdfsamples}/PID15-116A.counts.hdf5 \
-I ${hdfsamples}/PID15-119A.counts.hdf5 \
-I ${hdfsamples}/PID16-101.counts.hdf5 \
-I ${hdfsamples}/PID16-102.counts.hdf5 \
-I ${hdfsamples}/PID16-124B.counts.hdf5 \
-I ${hdfsamples}/PID16-125A.counts.hdf5 \
-I ${hdfsamples}/PID16-177.counts.hdf5 \
-I ${hdfsamples}/PID16-178.counts.hdf5 \
-I ${hdfsamples}/PID16-179.counts.hdf5 \
-I ${hdfsamples}/PID16-202A.counts.hdf5 \
-I ${hdfsamples}/PID16-203A.counts.hdf5 \
-I ${hdfsamples}/PID16-204A.counts.hdf5 \
-I ${hdfsamples}/PID16-205A.counts.hdf5 \
-I ${hdfsamples}/PID16-211.counts.hdf5 \
-I ${hdfsamples}/PID16-212.counts.hdf5 \
-I ${hdfsamples}/PID16-213.counts.hdf5 \
-I ${hdfsamples}/XLA-016A.counts.hdf5 \
--output ${gCNV_model} \
--output-prefix cgsnoB9_gCNV_normal_cohort

source deactivate
Viewing all 12345 articles
Browse latest View live