GATK3 VariantFiltration is not working

December 19, 2018, 2:42 pm

≫ Next: how can I get germline-resource file for Mutect 2

≪ Previous: Does MergeVCFs have a limit to how many gvcf files it can merge?

EDIT:

I think I've understood what is happening. "FS > 30" doesn't mean "keep FS values above 30", it means to "filter them out", which after doing some reading is what we want. Apologies for my misunderstanding! Not sure how to delete this post.

↧

how can I get germline-resource file for Mutect 2

December 19, 2018, 4:15 pm

≫ Next: Haplotypecaller: SNPs with three genotypes have higher missing rates.

≪ Previous: GATK3 VariantFiltration is not working

Hi team,
I am running Mutect 2 on mouse data, may I ask how can I get these two files for my data?
1. the input .vcf file for --germline-resource (eg. resources/chr17_af-only-gnomad_grch38.vcf.gz)
2. the input .vcf file for GetPileupSummaries -V (eg. resources/chr17_small_exac_common_3_grch38.vcf.gz)

Thanks for any of your kind help!

↧

Haplotypecaller: SNPs with three genotypes have higher missing rates.

December 20, 2018, 2:22 am

≫ Next: (How to) Consolidate GVCFs for joint calling with GenotypeGVCFs

≪ Previous: how can I get germline-resource file for Mutect 2

Dear GATK team,

I called SNPs out of 150 samples of WGS data on a non-model species (coral). The reference is a draft genome of 500 MB, each sample has roughly 15 M paired end reads. Species is diploid.

I have ran the GATK pipeline following the best practices and called genotypes using the haplotypecaller. Since I needed to speed up calculations and be as conservative as possible, I set the min-pruning flag =10. Next, I only kept bi-allelic SNPs.

Now, I noticed that the missing rates per SNP are generally higher for SNPs with 3 genotypes (the most frequent case), compared to those with 2 genotypes.

In other words, if I filter SNPs for missing-rate I end up with a genotype matrix in which most of the SNPS have only two genotypes (1 homozygote + the heterozygote).

Any idea on what could be the case? Is it possible that a min-pruning flag of 10 can systematically create more missing calls on SNPs with three genotypes (particularly on homozygotes) ?

thank you in advance

best

OS

↧

(How to) Consolidate GVCFs for joint calling with GenotypeGVCFs

April 16, 2018, 10:14 am

≫ Next: Parameters for running GenomicsDB import

≪ Previous: Haplotypecaller: SNPs with three genotypes have higher missing rates.

In GATK4, the GenotypeGVCFs tool can only take a single input, so if you have GVCFs from multiple samples (which is usually the case) you will need to combine them before feeding them to GenotypeGVCFs. Although there are several tools in the GATK and Picard toolkits that provide some type of VCF or GVCF merging functionality, for this use case only two of them can do the GVCF consolidation step correctly: GenomicsDBImport and CombineGVCFs.

GenomicsDBImport is the preferred tool (see detailed instructions below); CombineGVCFs is provided only as a backup solution for people who cannot use GenomicsDBImport, which has a few limitations (for example it can only run on diploid data at the moment). We know CombineGVCFs is quite inefficient and typically requires a lot of memory, so we encourage you to try GenomicsDBImport first and only fall back on CombineGVCFs if you experience issues that we are unable to help you solve (ask us for help in the forum!).

Using`GenomicsDBImport` in practice

The GenomicsDBImport tool takes in one or more single-sample GVCFs and imports data over a single interval, and outputs a directory containing a GenomicsDB datastore with combined multi-sample data. GenotypeGVCFs can then read from the created GenomicsDB directly and output the final multi-sample VCF.

So if you have a trio of GVCFs your GenomicsDBImportcommand would look like this, assuming you're running per chromosome (here we're showing the tool running on chromosome 20):

gatk GenomicsDBImport \
    -V data/gvcfs/mother.g.vcf \
    -V data/gvcfs/father.g.vcf \
    -V data/gvcfs/son.g.vcf \
    --genomicsdb-workspace-path my_database \
    --intervals chr20

That generates a directory called my_database containing the combined GVCF data for chromosome 20. The contents of the directory are not really human-readable; see further down for tips to deal with that.

Then you run joint genotyping; note the gendb:// prefix to the database input directory path.

gatk GenotypeGVCFs \
    -R data/ref/ref.fasta \
    -V gendb://my_database \
    -G StandardAnnotation -newQual \
    -O test_output.vcf

And that's all there is to it.

Important limitations:

You can't add data to an existing database; you have to keep the original GVCFs around and reimport them all together when you get new samples. For very large numbers of samples, there are some batching options.
At the moment you can only run GenomicsDBImport on a single genomic interval (ie max one contig) at a time. Down the road this will change (the work is tentatively scheduled for the second quarter of 2018), because we want to make it possible to run on one multiple intervals in one go. But for now you need to run on each interval separately. We recommend scripting this of course.
GenomicsDBImport cannot accept multiple GVCFs for the same sample, so if for example you generated separate GVCFs per chromosome for each sample, you'll need to either concatenate the chromosome GVCFs to produce a single GVCF per sample (using CatVariants) or scatter the following steps by chromosome as well.

**If you can't use GenomicsDBImport for whatever reason, fall back to CombineGVCFs instead. It is slower but will allow you to combine GVCFs the old-fashioned way. **

Addendum: extracting GVCF data from the GenomicsDB

If you want to generate a flat multisample GVCF file from the GenomicsDB you created, you can do so with SelectVariants as follows:

gatk SelectVariants \
    -R data/ref/ref.fasta \
    -V gendb://my_database \
    -O combined.g.vcf

↧

Parameters for running GenomicsDB import

December 20, 2018, 3:59 am

≫ Next: About read filters

≪ Previous: (How to) Consolidate GVCFs for joint calling with GenotypeGVCFs

I have a system with about 8GB RAM. I've run HaplotypeCaller (-ERC GVCF) on specific genes of my interest using a .list file and have 109 **.g.vcf.gzs **of about 5-10 GB each. What would be the most optimal way to run GenomicsDBImport on these samples for Joint Calling ? Will I need to further subset these files into specific intervals or set a batch size ?

GATK version - 4.0.11, Java version-1.8

Optimal = Avoid errors, Maximise input samples, minimise computational load and minimise time in that order.

↧

About read filters

June 26, 2018, 5:12 am

≫ Next: SAMException of Query asks for data past end of contig occured in mutect2

≪ Previous: Parameters for running GenomicsDB import

I did read the latest worksheet/tutorial for somatic variant calling with Mutect2.

In this tutorial, Mutect2 version is "GATK 4.0.2.0" and the following option is used :
--disable-read-filter MateOnSameContigOrNoMappedMateReadFilter
The explanation is : "we disabled the read filter called
MateOnSameContigOrNoMappedMateReadFilter, which keeps only reads whose mate maps to
the same contig or is unmapped"
My understanding is that the "MateOnSameContigOrNoMappedMateReadFilter" is by default turned on for Mutect2 (i.e., --read-filter MateOnSameContigOrNoMappedMateReadFilter).

For our analysis (using amplicon data), we want to make sure that duplicate filter is disabled within Mutect2 (otherwise we lose 95% of our data).
However, with my local installation :
java -jar gatk-package-4.0.5.1-local.jar Mutect2 -h
--read-filter,-RF:String Read filters to be applied before analysis This argument may be specified 0 or more
times. Default value: null.
--disable-read-filter,-DF:String
Read filters to be disabled before analysis This argument may be specified 0 or more
times. Default value: null.
It seems that no read-filter is turned on by default with version 4.0.5.1, is that correct ? Should I still use --disable-read-filter within Mutect2 to make sure we keep duplicates for our analysis ?

Best regards

↧

SAMException of Query asks for data past end of contig occured in mutect2

October 11, 2018, 9:52 pm

≫ Next: How to run PathSeqPipelineSpark on the local machine？

≪ Previous: About read filters

Dear team

Currently, I have downloaed the gatk4 (version 4.0.7.0) dockers and the "gatk4-data-processing-master" wdl coupled with the "gatk4-somatic-snvs-indels-master" wdl for somatic mutation detection. The followed command is utilized for the whole workflow with human genome reference data (version hg38);

(1) the command of "java -jar cromwell-34.jar run processing-for-variant-discovery-gatk4.wdl --inputs normal.json"
and "java -jar cromwell-34.jar run processing-for-variant-discovery-gatk4.wdl --inputs tissue.json" is called to generate bam and bai file for both cancer normal and tissue samples.
this step is successfully finished and bam and bai files are provided.

(2) the command "java -jar cromwell-34.jar run mutect2.wdl --inputs mutect2.json" is called for somatica mutation detection with normal and tissue bam and bai files as input.
unfortunately, the error of "htsjdk.samtools.SAMException: Query asks for data past end of contig" are occured in many contig of chromosome (for example, Query contig chrX start:224664443 stop:224664487 contigLength:156040895)

Someone can help me to fixed these errors, thanks a lot.

Here are those json files used in the issue

(a) the normal json file
{
"##_COMMENT1": "SAMPLE NAME AND UNMAPPED BAMS",
"PreProcessingForVariantDiscovery_GATK4.sample_name": "mytestN",
"PreProcessingForVariantDiscovery_GATK4.ref_name": "hg38",
"PreProcessingForVariantDiscovery_GATK4.flowcell_unmapped_bams_list": "/bdp-picb/bioinfo/gyzheng/GATK_pipeline/GATK_resource/test7/normal_u
bam_list.txt",
"PreProcessingForVariantDiscovery_GATK4.unmapped_bam_suffix": ".bam",

"##COMMENT2": "REFERENCE FILES",
"PreProcessingForVariantDiscovery_GATK4.ref_dict": "/bdp-picb/bioinfo/gyzheng/GATK_pipeline/GATK_resource/reference_data/hg38/Homo_sapiens
assembly38.dict",
"PreProcessingForVariantDiscovery_GATK4.ref_fasta": "/bdp-picb/bioinfo/gyzheng/GATK_pipeline/GATK_resource/reference_data/hg38/Homo_sapiens
_assembly38.fasta",
"PreProcessingForVariantDiscovery_GATK4.ref_fasta_index": "/bdp-picb/bioinfo/gyzheng/GATK_pipeline/GATK_resource/reference_data/hg38/Homo_s
apiens_assembly38.fasta.fai",
"PreProcessingForVariantDiscovery_GATK4.SamToFastqAndBwaMem.ref_alt": "/bdp-picb/bioinfo/gyzheng/GATK_pipeline/GATK_resource/reference_data
/hg38/Homo_sapiens_assembly38.fasta.64.alt",
"PreProcessingForVariantDiscovery_GATK4.SamToFastqAndBwaMem.ref_sa": "/bdp-picb/bioinfo/gyzheng/GATK_pipeline/GATK_resource/reference_data/
hg38/Homo_sapiens_assembly38.fasta.64.sa",
"PreProcessingForVariantDiscovery_GATK4.SamToFastqAndBwaMem.ref_amb": "/bdp-picb/bioinfo/gyzheng/GATK_pipeline/GATK_resource/reference_data
/hg38/Homo_sapiens_assembly38.fasta.64.amb",
"PreProcessingForVariantDiscovery_GATK4.SamToFastqAndBwaMem.ref_bwt": "/bdp-picb/bioinfo/gyzheng/GATK_pipeline/GATK_resource/reference_data
/hg38/Homo_sapiens_assembly38.fasta.64.bwt",
"PreProcessingForVariantDiscovery_GATK4.SamToFastqAndBwaMem.ref_ann": "/bdp-picb/bioinfo/gyzheng/GATK_pipeline/GATK_resource/reference_data
/hg38/Homo_sapiens_assembly38.fasta.64.ann",
"PreProcessingForVariantDiscovery_GATK4.SamToFastqAndBwaMem.ref_pac": "/bdp-picb/bioinfo/gyzheng/GATK_pipeline/GATK_resource/reference_data
/hg38/Homo_sapiens_assembly38.fasta.64.pac",

"##_COMMENT3": "KNOWN SITES RESOURCES",
"PreProcessingForVariantDiscovery_GATK4.dbSNP_vcf": "/bdp-picb/bioinfo/gyzheng/GATK_pipeline/GATK_resource/reference_data/hg38/Homo_sapiens
_assembly38.dbsnp138.sort.vcf",
"PreProcessingForVariantDiscovery_GATK4.dbSNP_vcf_index": "/bdp-picb/bioinfo/gyzheng/GATK_pipeline/GATK_resource/reference_data/hg38/Homo_s
apiens_assembly38.dbsnp138.sort.vcf.idx",
"PreProcessingForVariantDiscovery_GATK4.known_indels_sites_VCFs": [
"/bdp-picb/bioinfo/gyzheng/GATK_pipeline/GATK_resource/reference_data/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz",
"/bdp-picb/bioinfo/gyzheng/GATK_pipeline/GATK_resource/reference_data/hg38/Homo_sapiens_assembly38.known_indels.vcf.gz"
],
"PreProcessingForVariantDiscovery_GATK4.known_indels_sites_indices": [
"/bdp-picb/bioinfo/gyzheng/GATK_pipeline/GATK_resource/reference_data/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi",
"/bdp-picb/bioinfo/gyzheng/GATK_pipeline/GATK_resource/reference_data/hg38/Homo_sapiens_assembly38.known_indels.vcf.gz.tbi"
],

"##_COMMENT4": "MISC PARAMETERS",
"PreProcessingForVariantDiscovery_GATK4.bwa_commandline": "bwa mem -K 100000000 -p -v 3 -t 16 -Y $bash_ref_fasta",
"PreProcessingForVariantDiscovery_GATK4.compression_level": 5,
"PreProcessingForVariantDiscovery_GATK4.SamToFastqAndBwaMem.num_cpu": "16",

"##_COMMENT5": "DOCKERS",
"PreProcessingForVariantDiscovery_GATK4.gotc_docker": "broadinstitute/genomes-in-the-cloud:2.3.1-1512499786",
"PreProcessingForVariantDiscovery_GATK4.gatk_docker": "broadinstitute/gatk:4.0.7.0",
"PreProcessingForVariantDiscovery_GATK4.python_docker": "python:2.7",

"##_COMMENT6": "PATHS",
"PreProcessingForVariantDiscovery_GATK4.gotc_path": "/usr/gitc/",
"PreProcessingForVariantDiscovery_GATK4.gatk_path": "/gatk/gatk",

"##_COMMENT7": "JAVA OPTIONS",
"PreProcessingForVariantDiscovery_GATK4.SamToFastqAndBwaMem.java_opt": "-Xms3000m",
"PreProcessingForVariantDiscovery_GATK4.MergeBamAlignment.java_opt": "-Xms3000m",
"PreProcessingForVariantDiscovery_GATK4.MarkDuplicates.java_opt": "-Xms4000m",
"PreProcessingForVariantDiscovery_GATK4.SortAndFixTags.java_opt_sort": "-Xms4000m",
"PreProcessingForVariantDiscovery_GATK4.SortAndFixTags.java_opt_fix": "-Xms500m",
"PreProcessingForVariantDiscovery_GATK4.BaseRecalibrator.java_opt": "-Xms4000m",
"PreProcessingForVariantDiscovery_GATK4.GatherBqsrReports.java_opt": "-Xms3000m",
"PreProcessingForVariantDiscovery_GATK4.ApplyBQSR.java_opt": "-Xms3000m",
"PreProcessingForVariantDiscovery_GATK4.GatherBamFiles.java_opt": "-Xms2000m",

"##_COMMENT8": "MEMORY ALLOCATION",
"PreProcessingForVariantDiscovery_GATK4.GetBwaVersion.mem_size": "1 GB",
"PreProcessingForVariantDiscovery_GATK4.SamToFastqAndBwaMem.mem_size": "14 GB",
"PreProcessingForVariantDiscovery_GATK4.MergeBamAlignment.mem_size": "3500 MB",
"PreProcessingForVariantDiscovery_GATK4.MarkDuplicates.mem_size": "7 GB",
"PreProcessingForVariantDiscovery_GATK4.SortAndFixTags.mem_size": "5000 MB",
"PreProcessingForVariantDiscovery_GATK4.CreateSequenceGroupingTSV.mem_size": "2 GB",
"PreProcessingForVariantDiscovery_GATK4.BaseRecalibrator.mem_size": "6 GB",
"PreProcessingForVariantDiscovery_GATK4.GatherBqsrReports.mem_size": "3500 MB",
"PreProcessingForVariantDiscovery_GATK4.ApplyBQSR.mem_size": "3500 MB",
"PreProcessingForVariantDiscovery_GATK4.GatherBamFiles.mem_size": "3 GB",

"##_COMMENT9": "DISK SIZE ALLOCATION",
"PreProcessingForVariantDiscovery_GATK4.agg_small_disk": 200,
"PreProcessingForVariantDiscovery_GATK4.agg_medium_disk": 300,
"PreProcessingForVariantDiscovery_GATK4.agg_large_disk": 400,
"PreProcessingForVariantDiscovery_GATK4.flowcell_small_disk": 100,
"PreProcessingForVariantDiscovery_GATK4.flowcell_medium_disk": 200,

"##_COMMENT10": "PREEMPTIBLES",
"PreProcessingForVariantDiscovery_GATK4.preemptible_tries": 3,
"PreProcessingForVariantDiscovery_GATK4.agg_preemptible_tries": 3
}

(b) the tissue json file
{
"##_COMMENT1": "SAMPLE NAME AND UNMAPPED BAMS",
"PreProcessingForVariantDiscovery_GATK4.sample_name": "mytestT",
"PreProcessingForVariantDiscovery_GATK4.ref_name": "hg38",
"PreProcessingForVariantDiscovery_GATK4.flowcell_unmapped_bams_list": "/bdp-picb/bioinfo/gyzheng/GATK_pipeline/GATK_resource/test7/tissue_u
bam_list.txt",
"PreProcessingForVariantDiscovery_GATK4.unmapped_bam_suffix": ".bam",

"##_COMMENT6": "PATHS",
"PreProcessingForVariantDiscovery_GATK4.gotc_path": "/usr/gitc/",
"PreProcessingForVariantDiscovery_GATK4.gatk_path": "/gatk/gatk",

"##_COMMENT10": "PREEMPTIBLES",
"PreProcessingForVariantDiscovery_GATK4.preemptible_tries": 3,
"PreProcessingForVariantDiscovery_GATK4.agg_preemptible_tries": 3
}

(c) the mutect2 json file
{
"##_COMMENT1": "Runtime",
"##Mutect2.oncotator_docker": "(optional) String?",
"Mutect2.gatk_docker": "broadinstitute/gatk:4.0.7.0",

"##_COMMENT2": "Workflow options",
"##_Mutect2.intervals": "gs://gatk-best-practices/somatic-b37/whole_exome_agilent_1.1_refseq_plus_3_boosters.Homo_sapiens_assembly19.baits.
interval_list",
"Mutect2.scatter_count": 50,
"Mutect2.artifact_modes": ["G/T", "C/T"],
"##_Mutect2.m2_extra_args": "(optional) String?",
"##_Mutect2.m2_extra_filtering_args": "(optional) String?",
"Mutect2.run_orientation_bias_filter": "False",
"Mutect2.run_oncotator": "False",

"##_COMMENT3": "Primary inputs",
"Mutect2.ref_fasta": "/bdp-picb/bioinfo/gyzheng/GATK_pipeline/GATK_resource/reference_data/hg38/Homo_sapiens_assembly38.fasta",
"Mutect2.ref_dict": "/bdp-picb/bioinfo/gyzheng/GATK_pipeline/GATK_resource/reference_data/hg38/Homo_sapiens_assembly38.dict",
"Mutect2.ref_fai": "/bdp-picb/bioinfo/gyzheng/GATK_pipeline/GATK_resource/reference_data/hg38/Homo_sapiens_assembly38.fasta.fai",
"Mutect2.normal_bam": "/bdp-picb/bioinfo/gyzheng/GATK_pipeline/GATK_resource/test7/mytestN.hg38.bam",
"Mutect2.normal_bai": "/bdp-picb/bioinfo/gyzheng/GATK_pipeline/GATK_resource/test7/mytestN.hg38.bai",
"Mutect2.tumor_bam": "/bdp-picb/bioinfo/gyzheng/GATK_pipeline/GATK_resource/test7/mytestT.hg38.bam",
"Mutect2.tumor_bai": "/bdp-picb/bioinfo/gyzheng/GATK_pipeline/GATK_resource/test7/mytestT.hg38.bai",

"##COMMENT4": "Primary resources",
"##_Mutect2.pon": "(optional) File?",
"##_Mutect2.pon_index": "(optional) File?",
"Mutect2.gnomad": "/bdp-picb/bioinfo/gyzheng/GATK_pipeline/GATK_resource/reference_data/hg38/somatic/af-only-gnomad.hg38.vcf.gz",
"Mutect2.gnomad_index": "/bdp-picb/bioinfo/gyzheng/GATK_pipeline/GATK_resource/reference_data/hg38/somatic/af-only-gnomad.hg38.vcf.gz.tbi",
"Mutect2.variants_for_contamination": "/bdp-picb/bioinfo/gyzheng/GATK_pipeline/GATK_resource/reference_data/hg38/somatic/small_exac_common
3.hg38.vcf.gz",
"Mutect2.variants_for_contamination_index": "/bdp-picb/bioinfo/gyzheng/GATK_pipeline/GATK_resource/reference_data/hg38/somatic/small_exac_c
ommon_3.hg38.vcf.gz.tbi",
"##Mutect2.realignment_index_bundle": "File? (optional)",

"##_COMMENT5": "Secondary resources",
"Mutect2.onco_ds_tar_gz": "/bdp-picb/bioinfo/gyzheng/GATK_pipeline/GATK_resource/reference_data/hg38/somatic/oncotator_v1_ds_April052016.ta
r.gz",
"Mutect2.default_config_file": "/bdp-picb/bioinfo/gyzheng/GATK_pipeline/GATK_resource/reference_data/hg38/somatic/onco_config.txt",
"##_Mutect2.sequencing_center": "(optional) String?",
"##_Mutect2.sequence_source": "(optional) String?",

"##_COMMENT6": "Secondary resources",
"##_Mutect2.MergeBamOuts.mem": "(optional) Int?",
"##_Mutect2.SplitIntervals.mem": "(optional) Int?",
"##_Mutect2.M2.mem": "(optional) Int?",
"##_Mutect2.MergeVCFs.mem": "(optional) Int?",
"##_Mutect2.oncotate_m2.mem": "(optional) Int?",

"##_COMMENT7": "Secondary resources",
"##_Mutect2.onco_ds_local_db_dir": "(optional) String?",
"##_Mutect2.sequencing_center": "(optional) String?",
"##_Mutect2.oncotate_m2.oncotator_exe": "(optional) String?",
"##_Mutect2.gatk4_override": "(optional) File?",
"##_Mutect2.CollectSequencingArtifactMetrics.mem": "(optional) Int?",

"##_COMMENT8": "Disk space",
"##_Mutect2.MergeVCFs.disk_space_gb": "(optional) Int?",
"##_Mutect2.Filter.disk_space_gb": "(optional) Int?",
"##_Mutect2.M2.disk_space_gb": "(optional) Int?",
"##_Mutect2.M2.disk_space_gb": 100,
"##_Mutect2.oncotate_m2.disk_space_gb": "(optional) Int?",
"##_Mutect2.SplitIntervals.disk_space_gb": "(optional) Int?",
"##_Mutect2.MergeBamOuts.disk_space_gb": "(optional) Int?",
"##_Mutect2.CollectSequencingArtifactMetrics.disk_space_gb": "(optional) Int?",
"##_Mutect2.emergency_extra_disk": "(optional) Int?",

"##_COMMENT9": "Preemptibles",
"##_Mutect2.MergeBamOuts.preemptible_attempts": "(optional) Int?",
"Mutect2.preemptible_attempts": 3
}

↧

How to run PathSeqPipelineSpark on the local machine？

January 26, 2018, 9:20 am

≫ Next: spark-master is not a recognized option

≪ Previous: SAMException of Query asks for data past end of contig occured in mutect2

How to run PathSeqPipelineSpark on a "normal" machine (even just a laptop) with multiple CPU cores？

↧

spark-master is not a recognized option

December 20, 2018, 6:25 am

≫ Next: The AnnotatePairOrientation tool and non-Mutect2 VCF input - only "0,0:0,0" values in the output

≪ Previous: How to run PathSeqPipelineSpark on the local machine？

Hello,

I am trying to run Mutect2 (GATK v4.0.10.0) the spark-runner LOCAL mode. As I explain below I am failing to correctly request the number of local cores.

This is how I run Mutect2 with the default --spark-runner LOCAL mode

```
gatk Mutect2 \
-R GRCh38_full_analysis_set_plus_decoy_hla.fa \
--tumor-sample HCC1143_tumor \
--input hcc1143_N_subset50K.bam \
--input hcc1143_T_subset50K.bam \
--output mutect2.vcf \
-- --spark-runner LOCAL
```

Then for testing I book a one CPU machine in our cluster and I see Spark Runner is trying to use its default sparkMaster value of local[4].

The following two lines from Mutect2 logs confirm me the above
14:34:43.360 INFO IntelPairHmm - Available threads: 1
14:34:43.360 INFO IntelPairHmm - Requested threads: 4

If I try to request one CPU in gatk like this:

```
gatk Mutect2 \
-R GRCh38_full_analysis_set_plus_decoy_hla.fa \
--tumor-sample HCC1143_tumor \
--input hcc1143_N_subset50K.bam \
--input hcc1143_T_subset50K.bam \
--output mutect2.vcf \
-- --spark-runner LOCAL --spark-master local[1]
```

I get the error:
A USER ERROR has occurred: spark-master is not a recognized option

Any help to correctly use --spark-master to effectively select the number of cores will be very appreciated

Thanks a lot,
Jorge

↧

The AnnotatePairOrientation tool and non-Mutect2 VCF input - only "0,0:0,0" values in the output

December 20, 2018, 6:45 am

≫ Next: Cross-Species Contamination Identification with PathSeq

≪ Previous: spark-master is not a recognized option

Hello!

I would like to annotate VCF files (not generated by Mutect2) with F1R2/F2R1 read support counts. I was happy to find out that a dedicated tool is now provided for this very purpose (AnnotatePairOrientation), but even though I have tried the tool in multiple versions of GATK 4 (ranging from 4.0.0.0 to 4.0.11.0) and never received an error, I am still left out without useful output: for all the annotated variants, the reported F1R2:F2R1values are "0,0:0,0". I do not expect the input BAM and VCF files to be at fault, because Mutect2 running on the same inputs (with the option "--genotyping-mode" set to "GENOTYPE_GIVEN_ALLELES") reports non-zero F1R2/F2R1 counts. The tool's documentation page doesn't seem to list any specific requirements regarding the input VCF files (anything that could, let's say, cause the tool to output default zero values due to some information missing from the INFO or FORMAT fields).
Is any good soul out there able to confirm that the tool either a) works as intended on their data, or b) only outputs zeroes for them as well?

With many thanks and best regards,
Daniel

↧

Cross-Species Contamination Identification with PathSeq

December 13, 2018, 7:32 am

≫ Next: Variant calling: high (and strange) number of alternative allele

≪ Previous: The AnnotatePairOrientation tool and non-Mutect2 VCF input - only "0,0:0,0" values in the output

Overview

PathSeq is a computational pathogen discovery pipeline in the Genome Analysis Toolkit (GATK) for detecting microbial organisms from short-read deep sequencing of a host organism, such as human. The pipeline detects microbial reads in the host organism by performing read quality filtering, subtracting reads derived from the host, aligning the remaining (non-host) reads to a reference microbe genome, and finally generating a table of the detected microbial organisms. The GATK version improves on the previous version of the pipeline by incorporating faster computational approaches, broadening the use cases of the pipeline, and integrates the pipeline in GATKs Apache Spark framework enabling parallelized data processing (Mark et al., 2018). We've written in detail in our documentation on how to use PathSeq, but I have a particularly intriguing story to share about how I used the PathSeq workflow in FireCloud to quickly identify the cause of mysteriously low sequencing alignment rates.

I first heard about this specific problem when a project manager in the sequencing lab told me that they were seeing low alignment rates on multiple samples from the same project, and asked if I could help. We normally see alignment rates (as reported from Picard’s CollectAlignmentSummaryMetrics) above 99%, but this cohort of samples was producing rates between 60% and 95%, requiring the lab to sequence more in order to reach the agreed-upon coverage for the project (which doesn’t include unaligned reads, of course).

I suspected bacterial contamination since (by manual inspection) the unaligned reads did not seem to be artifactual (for example they all had pretty random-seeming sequence, not all the same). To approach this problem, I used the new GATK4 PathSeq Workflow (publication, how-to tutorial) and a small Python script. In this document I’ll walk you through how I used PathSeq on FireCloud using workflows and the beta “Notebooks” feature to quickly identify that the unaligned reads all belong to a single bacterial genus, Burkholderia.

PathSeq Data Bundle and Documentation :

Access Resource Bundle from a FTP Server - ftp://ftp.broadinstitute.org/bundle/pathseq/
Google Cloud Bucket: gs://gatk-best-practices/pathseq/
Pathseq Tool documentation: GATK Tool Documentation Index (all related tools documented under the Metagenomics category)
Featured PathSeq Workspace - contamination-identification-with-pathseq

Setup

I prepared to import the data into my cloned PathSeq workspace. First a set of “participant”s and a set of “sample”s were added to the data model. I wanted to perform batch processing in this experiment thus I also imported a “sample_set.” Below are examples of simple TSV files that can be used to quickly setup the data model.

Example Participant TSV:

    entity:participant_id   
    dummy_participant

After importing the participant, I imported the samples with three vital columns: entity:sample_id, participant_id, and WGS_bam_path (the path to the bam file). I have simplified the text in the path column for the purposes of this blog post. The additional column labeled Aligned contain the percent of aligned reads as reported by the preceding data processing pipeline. This column isn’t needed for running the PathSeq pipeline but will be used in the Python script written in the Notebook.

Example Sample TSV:

entity:sample_id    Aligned WGS_bam_path    participant_id
sample1 68  gs://bucket/directory/file1.unmapped.bam    dummy_participant
sample2 80  gs://bucket/directory/file2.unmapped.bam    dummy_participant
sample3 76  gs://bucket/directory/file3.unmapped.bam    dummy_participant
sample4 83  gs://bucket/directory/file4.unmapped.bam    dummy_participant
sample5 86  gs://bucket/directory/file5.unmapped.bam    dummy_participant
sample6 89  gs://bucket/directory/file6.unmapped.bam    dummy_participant
sample7 92  gs://bucket/directory/file7.unmapped.bam    dummy_participant
sample8 99  gs://bucket/directory/file8.unmapped.bam    dummy_participant
sample9 99  gs://bucket/directory/file9.unmapped.bam    dummy_participant

Several comments need to be said about the file above:
1. The column separator is <TAB>.
2. The entity:sample_id column must be first.
3. The value of the participant_id column must exist in the entity:participant_id of the Participant TSV.
4. The column header WGS_bam_path is referenced in the method configuration that will be used to run the PathSeq pipeline. It is important to be certain that the correct column is being referenced in the method configuration. Any discrepancy between the column names in the Data tab and method configuration will cause the method to fail.

After importing this data, I turn to the last TSV file: sample_set. This will enable me to analyze all of the samples with a single action.
Example Sample_Set TSV:

membership:sample_set_id    sample
dummy_sample_set    sample1
dummy_sample_set    sample2
dummy_sample_set    sample3
dummy_sample_set    sample4
dummy_sample_set    sample5
dummy_sample_set    sample6
dummy_sample_set    sample7
dummy_sample_set    sample8
dummy_sample_set    sample9

Here, dummy_sample_set is an arbitrary name for the sample set, and this file specifies the “membership” of sample in sample_sets. A given sample can be in more than one sample_set.

After importing this third file I can see three sub-tabs within the “Data” tab (In the case I’m showing here, my data model has one participant per sample and two sample_sets):

Launch Analysis

To run the analysis on the desired sample_set, I go into the “Method Configuration” tab, select the “pathseq-pipeline-WGS” configuration and click “Launch Analysis” after setting the method variables. Now the data model is presented to me again in a window and I select the “entity” on which to run the configuration. Before I run a configuration on an entire cohort, I like to test it on a single sample to whether check my workspace is correctly setup. To perform a single run, I choose one of the samples in the sample sub-tab and click “Launch”. The view switches to the “Monitor” tab, and the submission I just launched is shown to be “Queued”:

During the first attempt, my submission reported an error because I had not given read access to the pet service account so that it can read my files. If you upload your data into the native Google Cloud bucket that is created along with a new FireCloud workspace, this won't be an issue. But an error came up in my case because I actually had set up my workspace such that I was referencing Google Cloud buckets that I had previously created. So if this happens to you, head over to the Google Cloud Platform Storage Console, select the bucket that holds your files, and give the "Storage-Object-Viewer" role to that service account.

After I added the role, I relaunched the method and since it succeeded, I proceeded to launch it over the whole sample_set.

To launch over an entire set, I go (as before) to the “Method Configuration” tab, select the “pathseq-pipeline-WGS” configuration and click “Launch Analysis”. But now, I go to the sample_sets tab and choose the sample_set I want to use. Before clicking the “Launch” button, I type “this.samples” into the “Define Expression” textbox:

And only then I click the “Launch” button.

At this point, I see my request has been “submitted” and eventually it transitions to “running” (refreshing the page updates the status). Clicking on “view” should show you the progression of all the samples. Because I already ran one sample to completion, one of the workflows completes quickly, as it uses call caching to retrieve output results from my previous run and will not actually re-compute anything. The remaining samples will run in parallel and when complete the entire submission will be marked as “done”:

So, where are the results? If the output variables are set in the method configuration the results will be listed under new columns under the “sample” entity in the Data tab.

Additional Analysis using Jupyter

To complete the analysis, we want to see what is the top contaminating genus and how it is related to the “Aligned” value that we have for each sample in a scatter plot. To do this I switched my analysis to “Jupyter Notebooks”. In the Notebook, python functions were used to retrieve the metrics produced by the PathSeq workflow, extracting the vital information from metrics, then creating a sorted table with the extracted data to show a clear picture of what PathSeq found. The data was also used to generate a scatter plot using the percent of alignment from the original reads on the X axis and percent of these samples having reads aligned to top contaminating genus on the Y axis.
Here is a html of the completed Jupyter Notebook used to find the top contaminate from the PathSeq output metrics. pathseq-topmed-blog .

In my case, the resulting scatter plot produced a downward slope that shows that as the samples had less reads aligned (to hg38) more reads were were aligning to PathSeqs top genus.

You’ll also find a count of the top genus of the samples:

Indicating that, in my case, all 78 samples are from “Burkholderia.”

Closure

With this data in hand, I went back to the collaborator to discuss how PathSeq detected Burkholderia to be the likely contaminant. Given other key details such as the location the samples were collected (not conveyed to me in advanced), it made sense to the project manager that this proteobacteria would be detected as the contaminant. As a result, the collaborator has since identified some problems in their sample-collection protocols and have taken action to improve them.

Nota Bene

Like many others, the term “contamination” is overloaded in sequencing parlance, and while contamination of a human sample with that of another human (at any stage of the pipeline) is certainly of interest and importance, this is not the type of contamination that can be detected by PathSeq. For that contamination I can refer you to VerifyBamID, ContEst (GATK3) and EstimateContamination (GATK4). Pathseq can only detect contamination by a different species.

Featured PathSeq workspace

For those interested in testing the PathSeq pipeline, a workspace has been provided in FireCloud. Ideally we’d like to have a workspace that replicates this blog but due to the data privacy regulations of TopMed we’ve instead created similar scenario regarding meat contamination. The workspace uses a few useful workflows to simulate contaminated sequence data and detect the contamination using PathSeq. Similar to this blog a Jupyter Notebook is used to sort the metrics outputs from PathSeq to identify the top contaminate. Workspace: contamination-identification-with-pathseq

Citation

Mark A Walker, Chandra Sekhar Pedamallu, Akinyemi I Ojesina, Susan Bullman, Ted Sharpe, Christopher W Whelan, Matthew Meyerson; (2018) GATK PathSeq: a customizable computational tool for the discovery and identification of microbial sequences in libraries from eukaryotic hosts, Bioinformatics, , bty501, https://doi.org/10.1093/bioinformatics/bty501

↧

Variant calling: high (and strange) number of alternative allele

December 20, 2018, 8:23 am

≫ Next: Germline copy number variant discovery (CNVs)

≪ Previous: Cross-Species Contamination Identification with PathSeq

Deat GATK team,

I am calling variant on a trio (mother, father and offspring) of Macaca mulatta. I have whole genome sequencing 60X for each individual. I use GATK 4.0.7.0, I call variant with HaplotypeCaller BP-RESOLUTION, combine with GenomicDBimport per chromosomes and genotype with GenotypeGVCF.

I am interested in the number of sites where I have only reference allele (AD=0 for the alternative) and the number of sites where I have some reads supporting ALT allele (AD > 0) in the parents.

I found a lot of sites (for each individuals) where I have AD>0 in the gvcf file (per indiviuals, the combined one and after genotyping). I looked at each site that are HomRef and for each individuals less than 30% of the HomRef sites have AD=0 for the alternative allele. I know that HaplotypeCaller does a realignement step that may change the positions of the reads, but 70% of the sites that have AD>0 seems a lot. I looked back at the BAM file and those alternative alleles don’t seem to be there. I try to call again using the bam.out option, and here again I don’t see so many alternative alleles. However, I see that sometimes on a read where there were no alternative allele on the bam input there is an alternative allele on the output.
Also I have tried samtools mpileup and in this case almost 90% of the HomRef sites are AD=0 for alternative allele.

Just as an example bellow is the VCF output from HaplotypeCaller for one individual and then there is a picture of both the input bam file and the output bam file.
For chr1 pos 24203380 the ref is A and I have:
Vcf --> DP=96 AD=92,4
Bam input --> DP 93, 92,1 (N)
Bam out --> DP=80, 79,1 (N)

chr1 24203380 . A <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:92,4:96:57:0,57,5771 chr1 24203381 . G <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:90,5:95:0:0,0,5897 chr1 24203382 . C <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:92,3:95:78:0,78,6075 chr1 24203383 . A <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:92,3:95:68:0,68,6127

Just in case here is my code:
gatk --java-options "-XX:ParallelGCThreads=16 -Xmx64g" HaplotypeCaller -R /PATH/rheMac8.fa -I /PATH/R01068_sorted.merged.addg.uniq.rmdup.bam -O /PATH/R01068_res.g.vcf -ERC BP_RESOLUTION \

I don’t know why I have this high number of alternative alleles and how to get read of them to have the 'real' number of alternative allele per position. The problem persists on the genotyping vcf files with some alternative alleles that are not present on any bam (input or HaplotypeCaller output).

I hope I gave you enough details so you have a clear idea of my problem and will be able to help me.
Best,

↧

Germline copy number variant discovery (CNVs)

January 7, 2018, 1:08 am

≫ Next: ECNT Value in Mutect2

≪ Previous: Variant calling: high (and strange) number of alternative allele

Purpose

Identify germline copy number variants.

Diagram is not available

Reference implementation is not available

This workflow is in development; detailed documentation will be made available when the workflow is considered fully released.

↧

ECNT Value in Mutect2

December 20, 2018, 10:57 am

≫ Next: Mathematical Notes on Mutect.pdf For germline hom alts, both the tumor and normal allele fractions

≪ Previous: Germline copy number variant discovery (CNVs)

Hello,

I am using Mutect2 and FilterMutectCalls to call variants in mtDNA. According to the vcf file, the value recorded for ECNT is "Number of events in this haplotype". I am assuming that this is the number of times that a particular mutation was found in all the reads under that base pair. I am concerned because in my data, I am seeing clusters of mutations with the same ECNT value. For example:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
gi|9626243|ref|NC_001416.1| 115 . C T VL PASS DP=445;ECNT=6;POP_AF=5.000e-08;P_CONTAM=0.00;P_GERMLINE=-7.025e+01;TLOD=6.86 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB
gi|9626243|ref|NC_001416.1| 157 . T A VL PASS DP=427;ECNT=6;POP_AF=5.000e-08;P_CONTAM=0.00;P_GERMLINE=-1.113e+02;TLOD=8.55 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB
gi|9626243|ref|NC_001416.1| 470 . C T VL PASS DP=703;ECNT=9;POP_AF=5.000e-08;P_CONTAM=0.00;P_GERMLINE=-1.859e+02;TLOD=15.58 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB
gi|9626243|ref|NC_001416.1| 500 . A T VL PASS DP=691;ECNT=9;POP_AF=5.000e-08;P_CONTAM=0.00;P_GERMLINE=-1.927e+02;TLOD=7.59 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB
gi|9626243|ref|NC_001416.1| 601 . CT C VL PASS DP=671;ECNT=9;POP_AF=5.000e-08;P_CONTAM=0.00;P_GERMLINE=-1.766e+02;RPA=3,2;RU=T;STR;TLOD=7.68 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB
gi|9626243|ref|NC_001416.1| 635 . C T VL PASS DP=665;ECNT=9;POP_AF=5.000e-08;P_CONTAM=0.00;P_GERMLINE=-1.602e+02;TLOD=34.37 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB
gi|9626243|ref|NC_001416.1| 645 . C T VL PASS DP=668;ECNT=9;POP_AF=5.000e-08;P_CONTAM=0.00;P_GERMLINE=-1.818e+02;TLOD=7.87 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB
gi|9626243|ref|NC_001416.1| 704 . T TAAAAAA VL PASS DP=660;ECNT=9;POP_AF=5.000e-08;P_CONTAM=0.00;P_GERMLINE=-1.845e+02;RPA=5,11;RU=A;STR;TLOD=5.89 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:PGT:PID:SA_MAP_AF:SA_POST_PROB
gi|9626243|ref|NC_001416.1| 736 . A G VL PASS DP=666;ECNT=6;POP_AF=5.000e-08;P_CONTAM=0.00;P_GERMLINE=-1.723e+02;TLOD=20.03 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB
gi|9626243|ref|NC_001416.1| 754 . A G VL PASS DP=654;ECNT=6;POP_AF=5.000e-08;P_CONTAM=0.00;P_GERMLINE=-1.606e+02;TLOD=29.79 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB
gi|9626243|ref|NC_001416.1| 788 . A T VL PASS DP=639;ECNT=6;POP_AF=5.000e-08;P_CONTAM=0.00;P_GERMLINE=-1.738e+02;TLOD=6.65 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB
gi|9626243|ref|NC_001416.1| 898 . C T VL PASS DP=671;ECNT=6;POP_AF=5.000e-08;P_CONTAM=0.00;P_GERMLINE=-1.821e+02;TLOD=5.38 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB
gi|9626243|ref|NC_001416.1| 958 . A G VL PASS DP=671;ECNT=6;POP_AF=5.000e-08;P_CONTAM=0.00;P_GERMLINE=-1.847e+02;TLOD=6.69

Is there an explanation for why the data would look like this? It seems odd that the mutations would be occurring exactly the same number of times as the variants surrounding it.

Thank you,

kzwon

↧

Mathematical Notes on Mutect.pdf For germline hom alts, both the tumor and normal allele fractions

December 20, 2018, 7:02 pm

≫ Next: How Can I Get strand Bias Imformation From Mutect?

≪ Previous: ECNT Value in Mutect2

why they are similar large at that assuption.

↧

How Can I Get strand Bias Imformation From Mutect?

July 23, 2014, 5:33 am

≫ Next: confident somatic variants through such steps?

≪ Previous: Mathematical Notes on Mutect.pdf For germline hom alts, both the tumor and normal allele fractions

Hi,
Mutect does have filter for strand bias, but does not give strand imformation(like DP4 or other ) in its output .call file or .vcf file.
But sometimes I wanna check the strand distribution of the SNV called and have further filtering, I wonder how could I get such imformation?

Thanks! Hartblue

↧

confident somatic variants through such steps?

December 20, 2018, 10:49 pm

≫ Next: FINDING ACTIVE REGIONS of gatk4 mutect2

≪ Previous: How Can I Get strand Bias Imformation From Mutect?

is the red line sentence right? this confident somatic variants through such steps

↧

FINDING ACTIVE REGIONS of gatk4 mutect2

December 20, 2018, 10:02 pm

≫ Next: Broad website contact form: Two things want to confirm with you before running Genome STRiP

≪ Previous: confident somatic variants through such steps?

I want to ask several question about this tool.

q1: Mutect triages sites based on their pileup at a single base locus; here you mean look each base step by step. but what is the standard of triage, and the pileup refer to what accumulates.

q2: pairHMM used in germline haplottypecaller, but in somatic just simply focus on base quality, is it good enough?

q3: we can now estimate the likelihoods of no variation and of a true alt allele with allele fraction f.

is allele fraction f refer to AF or POP_AF in vcf file ? or neither?

q4: The likelihood of no variation is the likelihood that every alt read was in error. in error means the likehood value is not big enough?

q5: where we again assign infinite base quality to ref reads, but you have said that "you use base quality instead of pairHMM", if is infinite, is it fair to ref base?

↧

Broad website contact form: Two things want to confirm with you before running Genome STRiP

December 21, 2018, 12:56 am

≫ Next: Normal-lod and tumor-lod in Mutect2

≪ Previous: FINDING ACTIVE REGIONS of gatk4 mutect2

Hi Bob,

I have pasted my questions here. It will be helpful for users who have the same queries.

Thanks for your information. From this site https://gatkforums.broadinstitute.org/gatk/discussion/1492/genome-mask-files, @Geraldine_VdAuwera introduced that a base is assigned a 0 if an N base sequence centered on this read is unique within the reference genome after running ComputeGenomeMask. Hope you can help us to have a final check. Thank you very much.

Best wishes,
Zhuqing

Hi,

It is probably better to submit questions on the GATK forum.

The masks all use 1 for a position to keep, 0 for a position to drop (like bitwise AND).

For the CN2 mask, you want to keep positions that are more likely to be non-variable in most individuals (so you set the sex chromosomes to zero, along with known repeats, CNVs, etc.).

For the alignability mask, reliably alignable positions should be marked as 1 after running ComputeGenomeMask.
If you look at the human masks, I believe they should follow this same pattern.

-Bob

Question/Comment: For non-human genomes, we should prepare the alignability mask and CN2 mask files before running Genome STRiP. For alignability mask I will use ComputeGenomeMask, for CN2 mask I will exclude the sex chromosomes, unplaced contigs, and repeat annotations from RepeatMask (all these regions should be masked with a 0). Am I right?

Another thing I want to confirm with you is that for alignability mask fasta file, the positions are masked with a 0 if they are reliably alignable and 1 if they are not. However, for CN2 mask fasta file, the positions are masked with a 0 if they are likely to be copy number polumorphic and 1 if they are unlikely. Am I right?

Thank you.

↧

Normal-lod and tumor-lod in Mutect2

May 21, 2018, 11:18 pm

≫ Next: manba and picard_gatk_mj questions

≪ Previous: Broad website contact form: Two things want to confirm with you before running Genome STRiP

Hello,

What are the normal-lod and tumor-lod in Mutect2?
What are the bases of the default values 2.2 and 3.0 respectively?
And, if the threshold are lowered or raised, what would be happened in terms of mutation calling?
Please, explain them by using very basic terms ( AD, AF, ...)

Many thanks,
Luke

↧

UsingGenomicsDBImport in practice