Error with Variant Filteration

April 17, 2019, 2:50 am

≫ Next: Several Annotations not working in GATK Haplotype Caller

≪ Previous: Known sites for indel realignment and BQSR in hg38 bundle

My variant filtration code was working with lenient filters but it threw an exception with stricter filters. I need the more stringent variants for haplotype phasing, how do I fix this? This step is part of the bootstrapping BQSR implementation.

Command:
```
gatk VariantFiltration -R ${refdir}/ref.fa -V ${outdir}/recal0_raw_variants.vcf --filter-expression "QD < 10.0 || FS > 10.0 || SOR > 1.0 || MQ < 59.0 || MQRankSum < -0.5 || ReadPosRankSum < -1" --filter-name "my_filter" -O ${outdir}/recal0_filtered_variants.vcf
```

Partial Error:
```
NumberFormatException: For input string: "-0.621"
```

I tried entering the entire script and error but it keep giving me a links error.

↧

Several Annotations not working in GATK Haplotype Caller

January 19, 2016, 8:27 am

≫ Next: using SNP database of known variants from RAD for BQSR on whole genome

≪ Previous: Error with Variant Filteration

I am using Genotype Given Allele with Haplotype Caller
I am trying to explicitely request all annotations that the documentation says are compatible with the Haplotype caller (and that make sense for a single sample .. e.g. no hardy weinberg ..)

the following annotations all have "NA"
GCContent(GC) HomopolymerRun(Hrun) TandemRepeatAnnotator (STR RU RPA)
.. but are valid requests because I get no errors from GATK.

This is the command I ran (all on one line)

java -Xmx40g -jar /data5/bsi/bictools/alignment/gatk/3.4-46/GenomeAnalysisTK.jar -T HaplotypeCaller --input_file /data2/external_data/[...]/s115343.beauty/Paired_analysis/secondary/Paired_10192014/IGV_BAM/pair_EX167687/s_EX167687_DNA_Blood.igv-sorted.bam --alleles:vcf /data2/external_data/[...]m026645/s109575.ez/Sequencing_2016/OMNI.vcf --phone_home NO_ET --gatk_key /projects/bsi/bictools/apps/alignment/GenomeAnalysisTK/3.1-1/Hossain.Asif_mayo.edu.key --reference_sequence /data2/bsi/reference/sequence/human/ncbi/hg19/allchr.fa --minReadsPerAlignmentStart 1 --disableOptimizations --dontTrimActiveRegions --forceActive --out /data2/external_data/[...]m026645/s109575.ez/Sequencing_2016/EX167687.0.0375.chr22.vcf --logging_level INFO -L chr22 --downsample_to_fraction 0.0375 --downsampling_type BY_SAMPLE --genotyping_mode GENOTYPE_GIVEN_ALLELES --standard_min_confidence_threshold_for_calling 20.0 --standard_min_confidence_threshold_for_emitting 0.0 --annotateNDA --annotation BaseQualityRankSumTest --annotation ClippingRankSumTest --annotation Coverage --annotation FisherStrand --annotation GCContent --annotation HomopolymerRun --annotation LikelihoodRankSumTest --annotation MappingQualityRankSumTest --annotation NBaseCount --annotation QualByDepth --annotation RMSMappingQuality --annotation ReadPosRankSumTest --annotation StrandOddsRatio --annotation TandemRepeatAnnotator --annotation DepthPerAlleleBySample --annotation DepthPerSampleHC --annotation StrandAlleleCountsBySample --annotation StrandBiasBySample --excludeAnnotation HaplotypeScore --excludeAnnotation InbreedingCoeff

Log file is below( Notice "weird" WARNings about) "StrandBiasBySample annotation exists in input VCF header"..
which make no sense because the header is empty other than the barebone fields.

This is the barebone VCF
head /data2/external_data/[...]_m026645/s109575.ez/Sequencing_2016/OMNI.vcf

fileformat=VCFv4.2

CHROM POS ID REF ALT QUAL FILTER INFO

chr1 723918 rs144434834 G A 30 PASS .
chr1 729632 rs116720794 C T 30 PASS .
chr1 752566 rs3094315 G A 30 PASS .
chr1 752721 rs3131972 A G 30 PASS .
chr1 754063 rs12184312 G T 30 PASS .
chr1 757691 rs74045212 T C 30 PASS .
chr1 759036 rs114525117 G A 30 PASS .
chr1 761764 rs144708130 G A 30 PASS .

This is the output

INFO 10:03:06,080 HelpFormatter - ---------------------------------------------------------------------------------
INFO 10:03:06,082 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.4-46-gbc02625, Compiled 2015/07/09 17:38:12
INFO 10:03:06,083 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 10:03:06,083 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 10:03:06,086 HelpFormatter - Program Args: -T HaplotypeCaller --input_file /data2/external_data/[...]/s115343.beauty/Paired_analysis/secondary/Paired_10192014/IGV_BAM/pair_EX167687/s_EX167687_DNA_Blood.igv-sorted.bam --alleles:vcf /data2/external_data/[...]m026645/s109575.ez/Sequencing_2016/OMNI.vcf --phone_home NO_ET --gatk_key /projects/bsi/bictools/apps/alignment/GenomeAnalysisTK/3.1-1/Hossain.Asif_mayo.edu.key --reference_sequence /data2/bsi/reference/sequence/human/ncbi/hg19/allchr.fa --minReadsPerAlignmentStart 1 --disableOptimizations --dontTrimActiveRegions --forceActive --out /data2/external_data/[...]m026645/s109575.ez/Sequencing_2016/EX167687.0.0375.chr22.vcf --logging_level INFO -L chr22 --downsample_to_fraction 0.0375 --downsampling_type BY_SAMPLE --genotyping_mode GENOTYPE_GIVEN_ALLELES --standard_min_confidence_threshold_for_calling 20.0 --standard_min_confidence_threshold_for_emitting 0.0 --annotateNDA --annotation BaseQualityRankSumTest --annotation ClippingRankSumTest --annotation Coverage --annotation FisherStrand --annotation GCContent --annotation HomopolymerRun --annotation LikelihoodRankSumTest --annotation MappingQualityRankSumTest --annotation NBaseCount --annotation QualByDepth --annotation RMSMappingQuality --annotation ReadPosRankSumTest --annotation StrandOddsRatio --annotation TandemRepeatAnnotator --annotation DepthPerAlleleBySample --annotation DepthPerSampleHC --annotation StrandAlleleCountsBySample --annotation StrandBiasBySample --excludeAnnotation HaplotypeScore --excludeAnnotation InbreedingCoeff
INFO 10:03:06,093 HelpFormatter - Executing as m037385@franklin04-213 on Linux 2.6.32-573.8.1.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_20-b26.
INFO 10:03:06,094 HelpFormatter - Date/Time: 2016/01/19 10:03:06
INFO 10:03:06,094 HelpFormatter - ---------------------------------------------------------------------------------
INFO 10:03:06,094 HelpFormatter - ---------------------------------------------------------------------------------
INFO 10:03:06,545 GenomeAnalysisEngine - Strictness is SILENT
INFO 10:03:06,657 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Fraction: 0.04
INFO 10:03:06,666 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 10:03:07,012 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.35
INFO 10:03:07,031 HCMappingQualityFilter - Filtering out reads with MAPQ < 20
INFO 10:03:07,170 IntervalUtils - Processing 51304566 bp from intervals
INFO 10:03:07,256 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files
INFO 10:03:07,595 GenomeAnalysisEngine - Done preparing for traversal
INFO 10:03:07,595 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 10:03:07,595 ProgressMeter - | processed | time | per 1M | | total | remaining
INFO 10:03:07,596 ProgressMeter - Location | active regions | elapsed | active regions | completed | runtime | runtime
INFO 10:03:07,596 HaplotypeCaller - Disabling physical phasing, which is supported only for reference-model confidence output
WARN 10:03:07,709 StrandBiasTest - StrandBiasBySample annotation exists in input VCF header. Attempting to use StrandBiasBySample values to calculate strand bias annotation values. If no sample has the SB genotype annotation, annotation may still fail.
WARN 10:03:07,709 StrandBiasTest - StrandBiasBySample annotation exists in input VCF header. Attempting to use StrandBiasBySample values to calculate strand bias annotation values. If no sample has the SB genotype annotation, annotation may still fail.
INFO 10:03:07,719 HaplotypeCaller - Using global mismapping rate of 45 => -4.5 in log10 likelihood units
INFO 10:03:37,599 ProgressMeter - chr22:5344011 0.0 30.0 s 49.6 w 10.4% 4.8 m 4.3 m
INFO 10:04:07,600 ProgressMeter - chr22:11875880 0.0 60.0 s 99.2 w 23.1% 4.3 m 3.3 m
Using AVX accelerated implementation of PairHMM
INFO 10:04:29,924 VectorLoglessPairHMM - libVectorLoglessPairHMM unpacked successfully from GATK jar file
INFO 10:04:29,925 VectorLoglessPairHMM - Using vectorized implementation of PairHMM
WARN 10:04:29,938 AnnotationUtils - Annotation will not be calculated, genotype is not called
WARN 10:04:29,938 AnnotationUtils - Annotation will not be calculated, genotype is not called
WARN 10:04:29,939 AnnotationUtils - Annotation will not be calculated, genotype is not called
INFO 10:04:37,601 ProgressMeter - chr22:17412465 0.0 90.0 s 148.8 w 33.9% 4.4 m 2.9 m
INFO 10:05:07,602 ProgressMeter - chr22:18643131 0.0 120.0 s 198.4 w 36.3% 5.5 m 3.5 m
INFO 10:05:37,603 ProgressMeter - chr22:20133744 0.0 2.5 m 248.0 w 39.2% 6.4 m 3.9 m
INFO 10:06:07,604 ProgressMeter - chr22:22062452 0.0 3.0 m 297.6 w 43.0% 7.0 m 4.0 m
INFO 10:06:37,605 ProgressMeter - chr22:23818297 0.0 3.5 m 347.2 w 46.4% 7.5 m 4.0 m
INFO 10:07:07,606 ProgressMeter - chr22:25491290 0.0 4.0 m 396.8 w 49.7% 8.1 m 4.1 m
INFO 10:07:37,607 ProgressMeter - chr22:27044271 0.0 4.5 m 446.4 w 52.7% 8.5 m 4.0 m
INFO 10:08:07,608 ProgressMeter - chr22:28494980 0.0 5.0 m 496.1 w 55.5% 9.0 m 4.0 m
INFO 10:08:47,609 ProgressMeter - chr22:30866786 0.0 5.7 m 562.2 w 60.2% 9.4 m 3.8 m
INFO 10:09:27,610 ProgressMeter - chr22:32908950 0.0 6.3 m 628.3 w 64.1% 9.9 m 3.5 m
INFO 10:09:57,610 ProgressMeter - chr22:34451306 0.0 6.8 m 677.9 w 67.2% 10.2 m 3.3 m
INFO 10:10:27,611 ProgressMeter - chr22:36013343 0.0 7.3 m 727.5 w 70.2% 10.4 m 3.1 m
INFO 10:10:57,613 ProgressMeter - chr22:37387478 0.0 7.8 m 777.1 w 72.9% 10.7 m 2.9 m
INFO 10:11:27,614 ProgressMeter - chr22:38534891 0.0 8.3 m 826.8 w 75.1% 11.1 m 2.8 m
INFO 10:11:57,615 ProgressMeter - chr22:39910054 0.0 8.8 m 876.4 w 77.8% 11.4 m 2.5 m
INFO 10:12:27,616 ProgressMeter - chr22:41738463 0.0 9.3 m 926.0 w 81.4% 11.5 m 2.1 m
INFO 10:12:57,617 ProgressMeter - chr22:43113306 0.0 9.8 m 975.6 w 84.0% 11.7 m 112.0 s
INFO 10:13:27,618 ProgressMeter - chr22:44456937 0.0 10.3 m 1025.2 w 86.7% 11.9 m 95.0 s
INFO 10:13:57,619 ProgressMeter - chr22:45448656 0.0 10.8 m 1074.8 w 88.6% 12.2 m 83.0 s
INFO 10:14:27,620 ProgressMeter - chr22:46689073 0.0 11.3 m 1124.4 w 91.0% 12.5 m 67.0 s
INFO 10:14:57,621 ProgressMeter - chr22:48062438 0.0 11.8 m 1174.0 w 93.7% 12.6 m 47.0 s
INFO 10:15:27,622 ProgressMeter - chr22:49363910 0.0 12.3 m 1223.6 w 96.2% 12.8 m 29.0 s
INFO 10:15:57,623 ProgressMeter - chr22:50688233 0.0 12.8 m 1273.2 w 98.8% 13.0 m 9.0 s
INFO 10:16:12,379 VectorLoglessPairHMM - Time spent in setup for JNI call : 0.061128124000000006
INFO 10:16:12,379 PairHMM - Total compute time in PairHMM computeLikelihoods() : 22.846350295
INFO 10:16:12,380 HaplotypeCaller - Ran local assembly on 25679 active regions
INFO 10:16:12,434 ProgressMeter - done 5.1304566E7 13.1 m 15.0 s 100.0% 13.1 m 0.0 s
INFO 10:16:12,435 ProgressMeter - Total runtime 784.84 secs, 13.08 min, 0.22 hours
INFO 10:16:12,435 MicroScheduler - 727347 reads were filtered out during the traversal out of approximately 4410423 total reads (16.49%)
INFO 10:16:12,435 MicroScheduler - -> 2 reads (0.00% of total) failing BadCigarFilter
INFO 10:16:12,436 MicroScheduler - -> 669763 reads (15.19% of total) failing DuplicateReadFilter
INFO 10:16:12,436 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
INFO 10:16:12,436 MicroScheduler - -> 57582 reads (1.31% of total) failing HCMappingQualityFilter
INFO 10:16:12,437 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter
INFO 10:16:12,437 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter
INFO 10:16:12,437 MicroScheduler - -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter
INFO 10:16:12,438 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter

↧

using SNP database of known variants from RAD for BQSR on whole genome

April 17, 2019, 7:24 pm

≫ Next: GATK4 VariantFiltration is unable to tag the variants properly with --genotype-filter-expression

≪ Previous: Several Annotations not working in GATK Haplotype Caller

Is it a bad idea to use a dataset of known variants determined from RAD seq data to input for BQSR for whole-genome resequencing data? After reading the description of the tool, my understanding is that novel variation present in the whole genome sequence, which is not a previously known variant would be treated as a sequencing error, for the purposes of finding associations between sequencing errors and genomic context, machine cycle, etc. and then the BQSR will adjust quality scores based on these models. Thus, if these variants are not in fact sequencing errors, but are also not associated with any of the putative error covariates, will their quality scores remain fine? Am I correct in this, or is it the case that these novel variants will have their quality scores downgraded simply by virtue of being assumed to be an error, even if they are not found to be associated with putative error covariates?

Thanks in advance for your advice.

↧

GATK4 VariantFiltration is unable to tag the variants properly with --genotype-filter-expression

April 17, 2019, 10:46 pm

≫ Next: (How to) Run GATK in a Docker container

≪ Previous: using SNP database of known variants from RAD for BQSR on whole genome

I am trying to filter variants based on FORMAT annotation: GQ < 20.

A couple of variants for only 5/95 samples from the input vcf (APOE.recode.genotypeRefined.vcf) are shown below:

chr19 44907654 rs769451 T G 1131.52 PASS AC=3;AF=0.016;AN=190;BaseQRankSum=-4.920e-01;DB;DP=2959;ExcessHet=3.0798;FS=0.788;InbreedingCoeff=-0.0160;MLEAC=3;MLEAF=0.016;MQ=59.93;MQRankSum=0.00;PG=0,16,39;POSITIVE_TRAIN_SITE;QD=12.04;ReadPosRankSum=0.00;SOR=0.582;VQSLOD=9.43;culprit=MQRankSum GT:AD:DP:GQ:PL:PP 0/0:37,0:37:99:0,99,1485:0,115,1524 0/0:35,0:35:99:0,99,1485:0,115,1524 0/0:27,0:27:85:0,69,1035:0,85,1074 0/0:37,0:37:99:0,99,1485:0,115,1524 0/0:25,0:25:82:0,66,990:0,82,1029

chr19 44908684 rs429358 T C 4672.69 PASS AC=24;AF=0.126;AN=190;BaseQRankSum=-2.469e+00;DB;DP=1958;ExcessHet=4.7065;FS=1.755;InbreedingCoeff=-0.0544;MLEAC=24;MLEAF=0.126;MQ=59.98;MQRankSum=0.00;PG=0,10,26;POSITIVE_TRAIN_SITE;QD=9.40;ReadPosRankSum=0.283;SOR=0.859;VQSLOD=8.99;culprit=MQRankSum GT:AD:DP:GQ:PL:PP 0/1:27,4:31:28:38,0,909:28,0,925 0/0:29,0:29:94:0,84,1260:0,94,1286 0/0:15,0:15:55:0,45,563:0,55,589 0/0:19,0:19:67:0,57,694:0,67,7200/0:10,0:10:40:0,30,302:0,40,328

I ran the following command:
/usr/bin/java -jar gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar VariantFiltration --variant APOE.recode.genotypeRefined.vcf --genotype-filter-expression "GQ < 20" --genotype-filter-name "lowGQ" --output APOE.recode.genotypeRefined.filteredoutLowGQ.vcf

The output for the above two variants is shown below:

chr19 44908684 rs429358 T C 4672.69 PASS AC=24;AF=0.126;AN=190;BaseQRankSum=-2.469e+00;DB;DP=1958;ExcessHet=4.7065;FS=1.755;InbreedingCoeff=-0.0544;MLEAC=24;MLEAF=0.126;MQ=59.98;MQRankSum=0.00;PG=0,10,26;POSITIVE_TRAIN_SITE;QD=9.40;ReadPosRankSum=0.283;SOR=0.859;VQSLOD=8.99;culprit=MQRankSum GT:AD:DP:FT:GQ:PL:PP 0/1:27,4:31:PASS:28:38,0,909:28,0,925 0/0:29,0:29:PASS:94:0,84,1260:0,94,1286 0/0:15,0:15:PASS:55:0,45,563:0,55,589 0/0:19,0:19:PASS:67:0,57,694:0,67,720 0/0:10,0:10:PASS:40:0,30,302:0,40,328

See, the variant rs429358 has Genotype-level filter, FT tag for all the samples, while the variant rs769451 has no FT tag which means the genotype filter was not applied to this variant.

Is there something wrong/missing in the above command that could be the reason for missing FT tag? Or, this is expected behavior which I am not aware of ? I thought Genotype-level filter will be applied to all the input variants.
Please help.
Thanks
Srikant

↧

(How to) Run GATK in a Docker container

December 29, 2017, 7:24 pm

≫ Next: "Could not open array genomicsdb_array at workspace" from GenotypeGVCFs in GATK 4.0.0.0

≪ Previous: GATK4 VariantFiltration is unable to tag the variants properly with --genotype-filter-expression

This document explains how to install and use Docker to run GATK on a local machine. For a primer on what Docker containers are for and related terminology, see this Dictionary entry.

Install Docker
Test that it works
Get the GATK container image
Start up the GATK container
Run a GATK command in the container
Use a mounted volume to access data that lives outside the container

1. Install Docker

Follow the relevant link below depending on your computer system; on Mac and Windows, select the "Stable channel" download. Run through the installation instructions and initial setup page; they are very straightforward and should only take you a few minutes (not counting download time).
We have included instructions below for all steps after that first page, so you shouldn't need to go to any other pages in the Docker documentation. Frankly their docs are targeted at people who want to do things like run web applications on the cloud and can be quite frustrating to deal with.

MacOS systems

Click here for the MacOS install instructions

On Mac, the installation adds a menu bar item that looks like a whale/container-ship, which conveniently shows you the status of the Docker "daemon" (= program that runs in the background) and gives you GUI access to various Docker-related functionalities. But you can also just use it from the command-line, which is what we'll do in the rest of this tutorial.

Windows systems

Click here for the Windows install instructions

Note that on some Windows systems (including non-Pro versions like Windows Home, and older versions) the "normal" Docker app doesn't work, and you have to use an older app called Docker Toolbox, which you can find here.

Linux systems

Here is the full list of supported systems and their install pages.

2. Test that it works

Now, open a terminal window and invoke the docker program directly. Checking the version is always a good way to test that a program will run without investing too much effort into finding a command that will work, so let's do:

docker --version

This should return something like "Docker version 17.06.0-ce, build 02c1d87".

If you run into trouble at this step, you may need to run one or more of the following commands:

docker-machine restart default
docker-machine regenerate-certs
docker-machine env

Note that we have had reports that Docker is not compatible with some other virtual machine software; if you run into that problem you may need to uninstall other software. Or, uh, install Docker in a virtual machine? Ahhhh, too many layers! Let's just assume your Docker install worked fine. (If not, let us know in the forum and we'll try to help you)

3. Get the GATK container image

Still in your terminal (it doesn't matter where your working directory is), run the following command to retrieve the GATK image from Docker Hub:

docker pull broadinstitute/gatk:4.1.0.0

Note that the last bit after gatk: is the version tag, which you can change to get a different version than what we've specified here. At time of writing we're using the latest released version.

The GATK container image is quite large so the download may take a little while if you've never done this before. The good news is that next time you need to pull a GATK image (e.g. to get another release), Docker will only pull the components that have been updated, so it will go faster.

4. Start up the GATK container

There are several different ways to do this in Docker. Here we're going to use the simplest invocation that gets us the functionality we need, i.e. the ability to log into the container once it's running and execute commands from inside it.

docker run -it broadinstitute/gatk:4.1.0.0

If all goes well, this will start up the container in interactive mode, and you will automatically get logged into it. Your terminal prompt will change to something like this:

root@ea3a5218f494:/gatk#

At this point you can use classic shell commands to explore the container and see what's in there, if you like.

5. Run a GATK command in the container

The container has the gatk wrapper script all set up and ready to go, so you can now run any GATK or Picard command you want. Note that if you want to run a Picard command, you need to use the new syntax, which follows GATK conventions (-I instead of I= and so on). Let's use --list to list all tools available in this version.

./gatk --list

The output will start with a usage message (shown below) then a full list of tools and their summary descriptions.

Using GATK wrapper script /gatk/build/install/gatk/bin/gatk
Running:
    /gatk/build/install/gatk/bin/gatk --help
USAGE:  <program name> [-h]

Once you've verified that this works for you, you know you can run any GATK commands you want. But before you proceed, there's one more setup thing to go through, which is technically optional but will make your life much easier.

6. Use a mounted volume to access data that lives outside the container

This is the final piece of the puzzle. By default, when you're inside the container you can't access any data that lives on the filesystem outside of the container. One way to deal with that is to copy things back and forth, but that's wasteful and tedious. So we're going to follow the better path, which is to mount a volume in the container, i.e. establish a link that makes part of the filesystem visible from inside the container.

The hitch is that you can't do this after you started running the container, so you'll have to shut it down and run a new one (not just restart the first one) with an extra part to the command. In case you're wondering why we didn't do this from the get-go, it's because the first command we ran is simpler so there's less chance that something will go wrong, which is nice when you're trying something for the first time.

To shut down your container from inside it, you can just type exit while still inside the container:

exit

That should stop the container and take you back to your regular prompt. It's also possible to exit the container without stopping it (a move called detaching) but that's a matter for another time since here we do want to to stop it. You'll probably also want to learn how to clean up and delete old instances of containers that you no longer want.

For now, let's focus on starting a new instance of the GATK4 container, specifying in the following command what is your particular container ID and the filesystem location you want to mount.

docker run -v ~/my_project:/gatk/my_data -it broadinstitute/gatk:4.1.0.0

Here I set the external location to be an existing directory called my_project in my home directory (the key requirement is that it has to be an absolute path) and I'm setting the mount point inside the container's /gatk directory. The name of the mount point can be the same as the mount directory, or something completely different; the main constraint is that it should not conflict with an existing directory, otherwise that would make the existing directory unattainable.

Assuming your paths are valid, this command starts up the container and logs you into it the same way as before; but now you can see by using ls that you have access to your filesystem. So now you can run GATK commands on any data you have lying around. Have fun!

↧

"Could not open array genomicsdb_array at workspace" from GenotypeGVCFs in GATK 4.0.0.0

January 11, 2018, 1:36 am

≫ Next: how can I get germline-resource file for Mutect 2

≪ Previous: (How to) Run GATK in a Docker container

I experience Issues with GenotypeGVCFs and GenomicsDB input in the final GATK4 release using the official Docker image. This does not occur using the 4.beta.6 release. It looks like there has been a bug in an earlier beta release with the same error message which got fixed. Is my issue related to that old bug or just results in the same error message? What can I do to debug the issue?

2018-01-10T12:15:04.154516155Z terminate called after throwing an instance of 'VariantQueryProcessorException'
2018-01-10T12:15:04.154547266Z   what():  VariantQueryProcessorException : Could not open array genomicsdb_array at workspace: /keep/d22f668d4f44631d98bc650d582975ca+1399/chr22_db
2018-01-10T12:15:04.154561314Z 
2018-01-10T12:15:04.620517615Z Using GATK wrapper script /gatk/build/install/gatk/bin/gatk
2018-01-10T12:15:04.620517615Z Running:
2018-01-10T12:15:04.620517615Z     /gatk/build/install/gatk/bin/gatk GenotypeGVCFs -V gendb:///keep/d22f668d4f44631d98bc650d582975ca+1399/chr22_db --output chr22_db.vcf --reference /keep/db91e5f04cbd9018e42708316c28e82d+2160/hg19.fa

↧

how can I get germline-resource file for Mutect 2

December 19, 2018, 4:15 pm

≫ Next: (How to part I) Sensitively detect copy ratio alterations and allelic segments

≪ Previous: "Could not open array genomicsdb_array at workspace" from GenotypeGVCFs in GATK 4.0.0.0

Hi team,
I am running Mutect 2 on mouse data, may I ask how can I get these two files for my data?
1. the input .vcf file for --germline-resource (eg. resources/chr17_af-only-gnomad_grch38.vcf.gz)
2. the input .vcf file for GetPileupSummaries -V (eg. resources/chr17_small_exac_common_3_grch38.vcf.gz)

Thanks for any of your kind help!

↧

(How to part I) Sensitively detect copy ratio alterations and allelic segments

March 26, 2018, 9:52 am

≫ Next: Physical Phasing Information HaplotypeCaller 4.1.0.0

≪ Previous: how can I get germline-resource file for Mutect 2

Document is currently under review and in BETA. It is incomplete and may contain inaccuracies. Expect changes to the content.

This workflow is broken into two tutorials. You are currently on the first part.

The tutorial outlines steps in detecting copy ratio alterations, more familiarly copy number variants (CNVs), as well as allelic segments in a single sample using GATK4. The tutorial (i) denoises case sample alignment data against a panel of normals (PoN) to obtain copy ratios (Tutorial#11682) and (ii) models segments from the copy ratios and allelic counts (Tutorial#11683). The latter modeling incorporates data from a matched control. The same workflow steps apply to targeted exome and whole genome sequencing data.

Tutorial#11682 covers sections 1–4. Section 1 prepares a genomic intervals list with PreprocessIntervals and collects read coverage counts across the intervals. Section 2 creates a CNV PoN with CreateReadCountPanelOfNormals using read coverage counts. Section 3 denoises read coverage data against the PoN with DenoiseReadCounts using principal component analysis. Section 4 plots the results of standardizing and denoising copy ratios against the PoN.

Tutorial#11683 covers sections 5–8. Section 5 collects counts of reference versus alternate alleles with CollectAllelicCounts. Section 6 incorporates copy ratio and allelic counts data to group contiguous copy ratio and allelic counts segments with ModelSegments using kernel segmentation and Markov-chain Monte Carlo. The tool can also segment either copy ratio data or allelic counts data alone. Both types of data together refine segmentation results in that segments are based on the same copy ratio and the same minor allele fraction. Section 7 calls amplification, deletion and neutral events for the segmented copy ratios. Finally, Section 8 plots the results of segmentation and estimated allele-specific copy ratios.

Plotting is across genomic loci on the x-axis and copy or allelic ratios on the y-axis. The first part of the workflow focuses on removing systematic noise from coverage counts and adjusts the data points vertically. The second part focuses on segmentation and groups the data points horizontally. The extent of grouping, or smoothing, is adjustable with ModelSegments parameters. These adjustments do not change the copy ratios; the denoising in the first part of the workflow remains invariant in the second part of the workflow. See Figure 3 of this poster for a summary of tutorial results.

► The official GATK4 workflow is capable of running efficiently on WGS data and provides much greater resolution, up to ~50-fold more resolution for tested data. In these ways, GATK4 CNV improves upon its predecessor workflows in GATK4.alpha and GATK4.beta. Validations are still in progress and therefore the workflow itself is in BETA status, even if most tools, with the exception of ModelSegments, are production ready. The ModelSegments tool is still in BETA status and may change in small but significant ways going forward. Use at your own risk.

► The tutorial skips explicit GC-correction, an option in CNV analysis. For instructions on how to correct for GC bias, see AnnotateIntervals and DenoiseReadCounts tool documentation.

The GATK4 CNV workflow offers a multitude of levers, e.g. towards fine-tuning analyses and towards controls. Researchers are expected to tune workflow parameters on samples with similar copy number profiles as their case sample under scrutiny. Refer to each tool's documentation for descriptions of parameters.

Jump to a section

Tools involved

GATK 4.0.1.1 or later releases.
The plotting tools require particular R components. Options are to install these or to use the broadinstitute/gatk Docker. In particular, to match versions, use the broadinstitute/gatk:4.0.1.1 version.
- Install R v3.2.5 or above from https://www.r-project.org/, then install the components using the install_R_packages.R script with Rscript install_R_packages.R.
- Alternatively, run the plotting tools from a GATK4 Docker container following instructions in Article#11090.

Download example data

Download tutorial_11682.tar.gz and tutorial_11683.tar.gz, either from the GoogleDrive or from the FTP site. To access the ftp site, leave the password field blank. If the GoogleDrive link is broken, please let us know. The tutorial also requires the GRCh38 reference FASTA, dictionary and index. These are available from the GATK Resource Bundle. For details on the example data, see Tutorial#11136's third footnote and [1].

Alternatively, download the spacecade7/tutorial_11682_11683 docker image from DockerHub. The image contains GATK4.0.1.1 and the data necessary to run the tutorial commands, including the GRCh38 reference. Allocation of at least 4GB memory to Docker is recommended before launching the container.

1. Collect raw counts data with PreprocessIntervals and CollectFragmentCounts

Before collecting coverage counts that forms the basis of copy number variant detection, we define the resolution of the analysis with a genomic intervals list. The extent of genomic coverage and the size of genomic intervals in the intervals list factor towards resolution.

Preparing a genomic intervals list is necessary whether an analysis is on targeted exome data or whole genome data. In the case of exome data, we pad the target regions of the capture kit. In the case of whole genome data, we divide the reference genome into equally sized intervals or bins. In either case, we use PreprocessIntervals to prepare the intervals list.

For the tutorial exome data, we provide the capture kit target regions in 1-based intervals and set --bin-length to zero.

gatk PreprocessIntervals \
    -L targets_C.interval_list \
    -R /gatk/ref/Homo_sapiens_assembly38.fasta \
    --bin-length 0 \
    --interval-merging-rule OVERLAPPING_ONLY \
    -O sandbox/targets_C.preprocessed.interval_list

This produces a Picard-style intervals list targets_C.preprocessed.interval_list for use in the coverage collection step. Each interval is expanded 250 bases each on either side.

Comments on select parameters

The -L argument is optional. If provided, the tool expects the intervals list to be in Picard-style as described in Article#1319. The tool errs for other formats. If this argument is omitted, then the tool assumes each contig is a single interval. See [2] for additional discussion.
Set the --bin-length argument to be appropriate for the type of data, e.g. default 1000 for whole genome or 0 for exomes. In binning, an interval is divided into equal-sized regions of the specified length. The tool does not bin regions that contain Ns. [3]
Set --interval-merging-rule to OVERLAPPING_ONLY, to prevent the tool from merging abutting intervals. [4]
The --reference or -R is required and implies the presence of a corresponding reference index and a reference dictionary in the same directory.
To change the padding interval, specify the new value with --padding. The default value of 250 bases was determined to work well empirically for TCGA targeted exome data. This argument is relevant for exome data, as binning without an intervals list does not allow for intervals expansion. [5]

Take a look at the intervals before and after padding.

cnv_intervals

For consecutive intervals less than 250 bases apart, how does the tool pad the intervals?

Now collect raw integer counts data. The tutorial uses GATK4.0.1.1's CollectFragmentCounts, which counts coverage of paired end fragments. The tool counts once per fragment overlapping at its center with the interval. In GATK4.0.3.0, CollectReadCounts replaces CollectFragmentCounts. CollectReadCounts counts reads that overlap the interval.

The tutorial has already collected coverage on the tumor case sample, on the normal matched-control and on each of the normal samples that constitute the PoN. To demonstrate coverage collection, the following command uses the small BAM from Tutorial#11136’s data bundle [6]. The tutorial does not use the resulting file in subsequent steps. The CollectReadCounts command swaps out the tool name but otherwise uses identical parameters.

gatk CollectFragmentCounts \
    -I tumor.bam \
    -L targets_C.preprocessed.interval_list \
    --interval-merging-rule OVERLAPPING_ONLY \
    -O sandbox/tumor.counts.hdf5

In the tutorial data bundle, the equivalent full-length result is hcc1143_T_clean.counts.hdf5. The data tabulates CONTIG, START, END and raw COUNT values for each genomic interval.

Comments on select parameters

The -L argument interval list is a Picard-style interval list prepared with PreprocessIntervals.
The -I input is alignment data.
By default, data is in HDF5 format. To generate text-based TSV (tab-separated values) format data, specify --format TSV. The HDF5 format allows for quicker panel of normals creation.
Set --interval-merging-rule to OVERLAPPING_ONLY, to prevent the tool from merging abutting intervals. [4]
The tool employs a number of engine-level read filters. Of note are NotDuplicateReadFilter, FirstOfPairReadFilter, ProperlyPairedReadFilter and MappingQualityReadFilter. [7]

☞ 1.1 How do I view HDF5 format data?

See Article#11508 for an overview of the format and instructions on how to navigate the data with external application HDFView. The article illustrates features of the format using data generated in this tutorial.

2. Generate a CNV panel of normals with CreateReadCountPanelOfNormals

In creating a PoN, CreateReadCountPanelOfNormals abstracts the counts data for the samples and the intervals using Singular Value Decomposition (SVD, 1), a type of Principal Component Analysis (PCA, 1, 2, 3). The normal samples in the PoN should match the sequencing approach of the case sample under scrutiny. This applies especially to targeted exome data because the capture step introduces target-specific noise.

The tutorial has already created a CNV panel of normals using forty 1000 Genomes Project samples. The command below illustrates PoN creation using just three samples.

gatk --java-options "-Xmx6500m" CreateReadCountPanelOfNormals \
    -I HG00133.alt_bwamem_GRCh38DH.20150826.GBR.exome.counts.hdf5 \
    -I HG00733.alt_bwamem_GRCh38DH.20150826.PUR.exome.counts.hdf5 \
    -I NA19654.alt_bwamem_GRCh38DH.20150826.MXL.exome.counts.hdf5 \
    --minimum-interval-median-percentile 5.0 \
    -O sandbox/cnvponC.pon.hdf5

This generates a PoN in HDF5 format. The PoN stores information that, when applied, will (i) standardize case sample counts to PoN median counts and (ii) remove systematic noise in the case sample.

Comments on select parameters

Provide integer read coverage counts for each sample using -I. Coverage data may be in either TSV or HDF5 format. The tool will accept a single sample, e.g. the matched-normal.
The default --number-of-eigensamples or principal components is twenty. The tool will adjust this number to the smaller of twenty or the number of samples the tool retains after filtering. In general, denoising against a PoN with more components improves segmentation, but at the expense of sensitivity. Ideally, researchers should perform a sensitivity analysis to choose an appropriate value for this parameter. See this related discussion.
To run the tool using Spark, specify the Spark Master with --spark-master. See Article#11245 for details.

Comments on filtering and imputation parameters, in the order of application

The tutorial changes the --minimum-interval-median-percentile argument from the default of 10.0 to a smaller value of 5.0. The tool filters out targets or bins with a median proportional coverage below this percentile. The median is across the samples. The proportional coverage is the target coverage divided by the sum of the coverage of all targets for a sample. The effect of setting this parameter to a smaller value is that we retain more information.
The --maximum-zeros-in-sample-percentage default is 5.0. Any sample with more than 5% zero coverage targets is filtered.
The --maximum-zeros-in-interval-percentage default is 5.0. Any target interval with more than 5% zero coverage across samples is filtered.
The --extreme-sample-median-percentile default is 2.5. Any sample with less than 2.5 percentile or more than 97.5 percentile normalized median proportional coverage is filtered.
The --do-impute-zeros default is set to true. The tool takes zero coverage regions and changes these values to the median of the non-zero values. The tool additionally normalizes zero values below the 0.10 percentile or above the 99.90 percentile to.
The --extreme-outlier-truncation-percentile default is 0.1. The tool takes any proportional coverage below the 0.1 percentile or above the 99.9 percentile and sets it to the corresponding percentile value.

The current filtering and imputation parameters are identical to that in the BETA release of the CNV workflow and may change for later versions based on evaluations. The implementation has been made to be more memory efficient so that the tool runs faster than the BETA release.

If the data are not uniform, e.g. has many intervals with zero or low counts, the tool gives the warning to adjust filtering parameters and stops the run. This may happen, for example, if one attempts to construct a panel of mixed-sex samples and includes the allosomal contigs [8]. In this case, first be sure to either exclude allosomal contigs via a subset intervals list or subset the panel samples to those expected to have similar coverage across the given contigs, e.g. panels of the same sex. If the warning still occurs, then adjust --minimum-interval-median-percentile to a larger value. See this thread for the original discussion.

Based on what you know about PCA, what do you think are the effects of using more normal samples? A panel with some profiles that are outliers? Could PCA account for GC-bias?
What do you know about the 1000 Genome Project? Specifically, the exome data?
How could we tell a good PoN from a bad PoN? What control could we use?

In a somatic analysis, what is better for a PoN: tissue-matched normals or blood normals?
Should we include our particular tumor’s matched normal in the PoN?

3. Standardize and denoise case read counts against the PoN with DenoiseReadCounts

Provide DenoiseReadCounts with counts collected by CollectFragmentCounts and the CNV PoN generated with CreateReadCountPanelOfNormals.

gatk --java-options "-Xmx12g" DenoiseReadCounts \
    -I hcc1143_T_clean.counts.hdf5 \
    --count-panel-of-normals cnvponC.pon.hdf5 \
    --standardized-copy-ratios sandbox/hcc1143_T_clean.standardizedCR.tsv \
    --denoised-copy-ratios sandbox/hcc1143_T_clean.denoisedCR.tsv

This produces two files, the standardized copy ratios hcc1143_T_clean.standardizedCR.tsv and the denoised copy ratios hcc1143_T_clean.denoisedCR.tsv that each represents a data transformation. In the first transformation, the tool standardizes counts by the PoN median counts. The standarization includes log2 transformation and normalizing the counts data to center around one. In the second transformation, the tool denoises the standardized copy ratios using the principal components of the PoN.

Comments on select parameters

Because the default --number-of-eigensamples is null, the tool uses the maximum number of eigensamples available in the PoN. In section 2, by using default CreateReadCoundPanelOfNormals parameters, we capped the number of eigensamples in the PoN to twenty. Changing the --number-of-eigensamples in DenoiseReadCounts to lower values can change the resolution of results, i.e. how smooth segments are. See this thread for detailed discussion.
Additionally provide the optional --annotated-intervals generated by AnnotateIntervals to concurrently perform GC-bias correction.

4. Plot standardized and denoised copy ratios with PlotDenoisedCopyRatios

We plot the standardized and denoised read counts with PlotDenoisedCopyRatios. The plots allow visually assessing the efficacy of denoising. Provide the tool with both the standardized and denoised copy ratios from the previous step as well as a reference sequence dictionary.

gatk PlotDenoisedCopyRatios \
    --standardized-copy-ratios hcc1143_T_clean.standardizedCR.tsv \
    --denoised-copy-ratios hcc1143_T_clean.denoisedCR.tsv \
    --sequence-dictionary Homo_sapiens_assembly38.dict \
    --minimum-contig-length 46709983 \
    --output sandbox/plots \
    --output-prefix hcc1143_T_clean

This produces six files in the plots directory--two PNG images and four text files as follows.

hcc1143_T_clean.denoised.png plots the standardized and denoised read counts across the contigs and scales the y-axis to accommodate all copy ratio data.
hcc1143_T_clean.denoisedLimit4.png plots the same but limits the y-axis range from 0 to 4 for comparability across samples.

Each of the text files contains a single quality control value. The value is the median of absolute differences (MAD) in copy-ratios of adjacent targets. Its calculation is robust to actual copy-number events and should decrease after denoising.

hcc1143_T_clean.standardizedMAD.txt gives the MAD for standardized copy ratios.
hcc1143_T_clean.denoisedMAD.txt gives the MAD for denoised copy ratios.
hcc1143_T_clean.deltaMAD.txt gives the difference between standardized MAD and denoised MAD.
hcc1143_T_clean.scaledDeltaMAD.txt gives the fractional difference (standardized MAD - denoised MAD)/(standardized MAD).

Comments on select parameters

The tutorial provides the --sequence-dictionary that matches the GRCh38 reference used in mapping.
To omit alternate and decoy contigs from the plots, the tutorial adjusts the --minimum-contig-length from the default value of 1,000,000 to 46,709,983, the length of the smallest of GRCh38's primary assembly contigs.

Here are the results for the HCC1143 tumor cell line and its matched normal cell line. The normal cell line serves as a control. For each sample are two plots that show the effects of PCA denoising. The upper plot shows standardized copy ratios in blue and the lower plot shows denoised copy ratios in green.

4A. Tumor standarized and denoised copy ratio plots
hcc1143_T_clean.denoisedLimit4.png

4B. Normal standarized and denoised copy ratio plots
hcc1143_N_clean.denoisedLimit4.png

Would you guess there are CNV events in the normal? Should we be surprised?

The next step is to perform segmentation. This can be done either using copy ratios alone or in combination with allelic copy ratios. In part II, Section 6 outlines considerations in modeling segments with allelic copy ratios, section 7 generates a callset and section 8 shows how to plot segmented copy and allelic ratios. Again, the tutorial presents these steps using the full features of the workflow. However, researchers may desire to perform copy ratio segmentation independently of allelic counts data, e.g. for a case without a matched-control. For the case-only, segmentation gives the following plots. To recapitulate this approach, omit allelic-counts parameters from the example commands in sections 6 and 8.

4C. Tumor case-only copy ratios segmentation gives 235 segments.
T_caseonly.modeled.png

4D. Normal case-only copy ratios segmentation gives 41 segments.

While the normal sample shows trisomy of chr2 and a subpopulation with deletion of chr6, the tumor sample is highly aberrant. The extent of aneuploidy is unsurprising and consistent with these HCC1143 tumor dSKY results by Wenhan Chen. Remember that cell lines, with increasing culture time and selective bottlenecks, can give rise to new somatic events, undergo clonal selection and develop population heterogeneity much like in cancer.

☞ 4.1 Compare two PoNs: considerations in the panel of normals creation

Denoising with a PoN is critical for calling copy-number variants from targeted exome coverage profiles. It can also improve calls from WGS profiles that are typically more evenly distributed and subject to less noise. Furthermore, denoising with a PoN can greatly impact results for (i) samples that have more noise, e.g. those with lower coverage, lower purity or higher activity, (ii) samples lacking a matched normal and (iii) detection of smaller events that span only a few targets.

To understand the impact a PoN's constituents can have on an analysis, compare the results of denoising the normal sample against two different PoNs. Each PoN consists of forty 1000 Genomes Project exome samples. PoN-M consists of the same cohort used in the Mutect2 tutorial's PoN. We selected PoN-C's constituents with more care and this is the PoN the CNV tutorial uses.

4E. Compare standardization and denoising with PoN-C versus PoN-M.

What is the difference in the targets for the two cohorts--cohort-M and cohort-C? Is this a sufficient reason for the difference in noise profiles we observe above?

GATK4 denoises exome coverage profiles robustly with either panel of normals. However, a good panel allows maximal denoising, as is the case for PoN-C over PoN-M.

We use publically available 1000 Genomes Project data so as to be able to share the data and to illustrate considerations in CNV analyses. In an actual somatic analysis, we would construct the PoNs using the blood normals of the tumor cohort(s). We would construct a PoN for each sex, so as to be able to call events on allosomal chromosomes. Such a PoN should give better results than that from either of the tutorial PoNs.

Somatic analyses, due to the confounding factors of tumor purity and heterogeneity, require high sensitivity in calling. However, a sensitive caller can only do so much. Use of a carefully constructed PoN augments the sensitivity and helps illuminate copy number events.

This section is adapted from a hands-on tutorial developed and written by Soo Hee Lee (@shlee) in July of 2017 for the GATK workshops in Cambridge and Edinburgh, UK. The original tutorial uses the GATK4.beta workflow and can be found in the 1707 through 1711 GATK workshops folders. Although the Somatic CNV workflow has changed from GATK4.beta and the official GATK4 release, the PCA denoising remains the same. The hands-on tutorial focuses on differences in PCA denoising based on two different panels of normals (PoNs). Researchers may find working through the worksheet to the very end with either release version beneficial, as considerations in selecting PoN constituents remain identical.

Examining the read group information for the samples in the two PoNs shows a difference in mixtures of sequencing centers--four different sequencing centers for PoN-M versus a single sequencing center for PoN-C. The single sequencing center corresponds to that of the HCC1143 samples. Furthermore, tracing sample information will show different targeted exome capture kits for the sequencing centers. Comparing the denoising results of the two PoNs stresses the importance of selective PoN creation.

☞ 4.2 Compare PoN denoising versus matched-normal denoising

A feature of the GATK4 CNV workflow is the ability to normalize a case against a single control sample, e.g. a tumor case against its matched normal. This involves running the control sample through CreateReadCountPanelOfNormals, then denoising the case against this single-sample projection with DenoiseReadCounts. To illustrate this approach, here is the result of denoising the HCC1143 tumor sample against its matched normal. For single-sample matched-control denoising, DenoiseReadCounts produces identical data for standardizedCR.tsv and denoisedCR.tsv.

4F. Tumor case standardized against the normal matched-control

Compare these results to that of section 4.1. Notice the depression in chr2 copy ratios that occurs due to the PoN normal sample's chr2 trisomy. Here, the median absolute deviation (MAD) of 0.149 is an incremental improvement to section 4.1's PoN-M denoising (MAD=0.15). In contrast, PoN-C denoising (MAD=0.125) and even PoN-C standardization alone (MAD=0.134) are seemingly better normalization approaches than the matched-normal standardization. Again, results stress the importance of selective PoN creation.

The PoN accounts for germline CNVs common to its constituents such that the workflow discounts the same variation in the case. It is possible for the workflow to detect germline CNVs not represented in the PoN, in particular, rare germline CNVs. In the case of matched-normal standardization, the workflow should discount germline CNVs and reveal only somatic events.

The workflow does not support iteratively denoising two samples each against a PoN and then against each other.

The tutorial continues in a second document at #11683.

Footnotes

[1] The constituents of the forty sample CNV panel of normals differs from that of the Mutect2 panel of normals. Preliminarly CNV data was generated with v4.0.1.1 somatic CNV WDL scripts run locally on a Gcloud Compute Engine VM with Cromwell v30.2. Additional refinements were performed on a 16GB MacBook Pro laptop. Additional plots were generated using a broadinstitute/gatk:4.0.1.1 Docker container. Note the v4.0.1.1 WDL script does not allow custom sequence dictionaries for the plotting steps.

Case (HCC1143) and matched control (HCC1143_BL) sample data are based on a breast cancer cell line and its matched normal cell line derived from blood, respectively. The Broad Cancer Genome Analysis (CGA) group has graciously provided 2x76 paired-end whole exome sequence data from the two cell lines (C835.HCC1143_2 and C835.HCC1143_BL.4), and @shlee reverted then realigned these to GRCh38 and preprocessed according to GATK guidelines.
We express our gratitude to the 1000 Genomes Project for their publically available project data, from which @shlee constructed the tutorial panel of normals. Read about the project at:
A global reference for human genetic variation, The 1000 Genomes Project Consortium, Nature 526, 68-74 (01 October 2015) doi:10.1038/nature15393.

[2] Considerations in genomic intervals are as follows.

For targeted exomes, the intervals should represent the bait capture or target capture regions.
For whole genomes, either supply regions where coverage is expected across samples, e.g. that exclude alternate haplotypes and decoy regions in GRCh38 or omit the option for references where coverage is expected for the entirety of the reference.
For either type of data, expect to modify the intervals depending on (i) extent of masking in the reference used in read mapping and (ii) expectations in coverage on allosomal contigs. For example, for mammalian data, expect to remove Y chromosome intervals for female samples.

[3] See original discussion on bin size here. The bin size determines the resolution of CNV breakpoints. The theoretical limit depends on coverage depth and the insert-size distribution. Typically bin sizes on the order of the read length will give reasonable results. The GATK developers have tested WGS runs where the bin size is as small as 250 bases.

[4] Set --interval-merging-rule to OVERLAPPING_ONLY, to prevent the tool from merging abutting intervals. The default is set to ALL for GATK4.0.1.1. For future versions, the default will be set to OVERLAPPING_ONLY.

[5] The tool allows specifying both the padding and the binning arguments simultaneously. If exome targets are very long, it may be preferable to both pad and break up the intervals with binning. This may provide some additional resolution.

[6] The data bundle from Tutorial#11136 contains tumor.bam and normal.bam. These tumor and normal samples are identical to that in the current tutorial and represent a subset of the full data for the following regions:

chr6    29941013    29946495    +    
chr11   915890  1133890 +    
chr17   1   83257441    +    
chr11_KI270927v1_alt    1   218612  +    
HLA-A*24:03:01  1   3502    +

[7] The following regarding read filters may be of interest and apply to the workflow illustated in the tutorial that uses CollectFragmentCounts.

In contrast to prior versions of the workflow, the GATK4 CNV workflow excludes duplicate fragments from consideration with the NotDuplicateReadFilter. To instead include duplicate fragments, specify -DF NotDuplicateReadFilter.
The tool only considers paired-end reads (0x1 SAM flag) and the first of pair (0x40 flag) with the FirstOfPairReadFilter. The tool uses the first-of-pair read’s mapping information for the fragment center.
The tool only considers properly paired reads (0x2 SAM flag) using the ProperlyPairedReadFilter. Depending on whether and how data was preprocessed with MergeBamAlignment, proper pair assignments can differ from that given by the aligner. This filter also removes single ended reads.
The MappingQualityReadFilter sets a threshold for alignment MAPQ. The tool sets --minimum-mapping-quality to 30. Thus, the tool uses reads with MAPQ30 or higher.

[8] The current tool version requires strategizing denoising of allosomal chromosomes, e.g. X and Y in humans, against the panel of normals. This is because coverage will vary for these regions depending on the sex of the sample. To determine the sex of samples, analyze them with DetermineGermlineContigPloidy. Aneuploidy in allosomal chromosomes, much like trisomy, can still make for viable organisms and so phenotypic sex designations are insufficient. GermlineCNVCaller can account for differential sex in data.

Many thanks to Samuel Lee (@slee), no relation, for his patient explanations of the workflow parameters and mathematics behind the workflow. Researchers may find reading through @slee's comments on the forum, a few of which the tutorials link to, insightful.

↧

Physical Phasing Information HaplotypeCaller 4.1.0.0

April 6, 2019, 12:09 pm

≫ Next: missing physical phasing information in vcf?

≪ Previous: (How to part I) Sensitively detect copy ratio alterations and allelic segments

Hi,

I am looking to use HaplotypeCaller to call germline variants, and I am particularly interested in the orientation of these variants relative to one another (cis- or trans-). There seems to be reference to physical phasing in the (HaplotypeCaller documentation)[https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_haplotypecaller_HaplotypeCaller.php#--do-not-run-physical-phasing], but I cannot find any physical phasing information in my VCF file.

For instance, I would expect the two variants below:

1 1647722 . G T 307.60 . AC=1;AF=0.500;AN=2;BaseQRankSum=-2.861;DP=29;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=53.28;MQRankSum=-5.260;QD=10.61;ReadPosRankSum=-0.098;SOR=0.155 GT:AD:DP:GQ:PL 0/1:21,8:29:99:315,0,841
1 1647725 . G A 304.60 . AC=1;AF=0.500;AN=2;BaseQRankSum=-1.277;DP=29;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=52.38;MQRankSum=-5.262;QD=10.50;ReadPosRankSum=-0.448;SOR=0.204 GT:AD:DP:GQ:PL 0/1:20,9:29:99:312,0,883

to be in the cis- orientation because they share nearly identical read counts, but I cannot find a corresponding annotation in the VCF file that says as much.

My command to call HaplotypeCaller is as below:

$gatk_launcher --java-options -Xmx${mem}g HaplotypeCaller \
-R $reference \
-I $bam_file \
-O $out_file \
-L $intervals_split &>> $log_file

Thank you for the help!!

↧

missing physical phasing information in vcf?

April 18, 2019, 5:56 pm

≫ Next: Deep sequencing data is missing variants in M2 called vcf

≪ Previous: Physical Phasing Information HaplotypeCaller 4.1.0.0

The way the phasing algorithm decides to phase is by checking whether two variants always occur on the same haplotype or always occur on a different haplotypes. The excess haplotypes severely dilute the signal.

For example, let's say variants A and B both occur on real haplotype H1, but that HC also assembled a similar false haplotype H2. If any reads supporting variant A match H2 better than H1, the phasing via H1 is lost.

This raises the question of whether we could do better, and the answer is yes, easily. The current code is very naive.
However, instead of improving our phasing algorithm our current efforts are in assembling fewer and better haplotypes.

Basically, the goal is to prevent H2 from existing in the first place, in which case the current naive phasing algorithm will probably work well enough.

↧

Deep sequencing data is missing variants in M2 called vcf

April 18, 2019, 8:01 pm

≫ Next: Is there a recommended methods in GTAK downstream analysis ?

≪ Previous: missing physical phasing information in vcf?

--max-reads-per-alignment-start is helpful because the genome has a few hotspots of extremely high coverage, due to mapping error for the most part, where to avoid spending an inordinate amount of compute on these few regions we truncated the coverage. For example, a 100x exome may have a few thousand bp with 10,000x coverage.

However, this behavior should be turned off, by setting --max-reads-per-alignment-start 0 , when the coverage is uniformly high and one wants to use that depth to discover low-AF variants.

↧

Is there a recommended methods in GTAK downstream analysis ?

April 19, 2019, 6:05 am

≫ Next: How to extract germline mutations from Mutect2?

≪ Previous: Deep sequencing data is missing variants in M2 called vcf

Hi, I have finished call variants via GTAK best practices(joint-call) with 20 WES samples.
My goal is to find the genes that may causes the disease from these 20 samples, I also tried some tools and methods, but I don't whether they are rigth.
So, Is there a good guidance documentation like GTAK best practices to get my target?

↧

How to extract germline mutations from Mutect2?

April 19, 2019, 10:11 am

≫ Next: How to exit in the middle of a linear chain tasks in wdl

≪ Previous: Is there a recommended methods in GTAK downstream analysis ?

I am interested in performing a study on germline mutations in a cancer patient. VCF files in TCGA reports only somatic mutations. As per the GDC DNA-seq analysis pipeline, they use below command to get somatic mutations-

java -jar GenomeAnalysisTK.jar \
-T MuTect2 \
-R < reference > \
-L < region > \
-I:tumor < tumor.bam > \
-I:normal < normal.bam > \
--normal_panel < pon.vcf > \
--cosmic < cosmic.vcf > \
--dbsnp < dbsnp.vcf > \
--contamination_fraction_to_filter 0.02 \
-o < mutect_variants.vcf > \
--output_mode EMIT_VARIANTS_ONLY \
--disable_auto_index_creation_and_locking_when_reading_rods

After reading the MuTect2 documentation, I could not figure out how to modify the above command to retrieve germline mutation from the bam file.

↧

How to exit in the middle of a linear chain tasks in wdl

April 19, 2019, 11:54 am

≫ Next: Known Issues with VariantRecalibrator

≪ Previous: How to extract germline mutations from Mutect2?

Hi,
I have a sequential of tasks. If I found the output of task1 has an error message "No space on the disk", I would like to exit the analysis. Don't continue with the following tasks. Is that implementable and how to do that in WDL? Thank you very much.

↧

Known Issues with VariantRecalibrator

April 9, 2019, 3:56 pm

≫ Next: PON: Mutect2 include MNPs and crash GenomicsDBImport

≪ Previous: How to exit in the middle of a linear chain tasks in wdl

The syntax for specifying argument tags has changed (and the documentation was out of sync for a while, though it is now fixed). The tags must now be specified with the argument name, not with the argument value, like this:

--resource:hapmap,known=false,training=true,truth=true,prior=15.0 /trainee/ref/hapmap_3.3.hg38.vcf

Note that the ":" and tags are listed with the argument name ("-resource"), not with the file name.

↧

PON: Mutect2 include MNPs and crash GenomicsDBImport

April 20, 2019, 12:49 pm

≫ Next: Interpreting CNNScoreVariants Scores

≪ Previous: Known Issues with VariantRecalibrator

GATK v4.1.1.0, linux server, bash

Hi,

using the following command I create the input files for the pon:

${gatk}  Mutect2 \
-R ${hg38} \
-I ${bqsr_bam} \
-O ${sample_pon} \
-L ${intervals}

When I run GenomicsDBImport I have an error which reports the presence of MNPs in the ${sample_pon} files

Then I exclude the MNPs with:

${gatk} SelectVariants \
-R ${hg38} \
-V ${sample_pon} \
-O ${clean_sample_pon} \
-xl-select-type MNP

If I go to count the variants before and after SelectVariants I have the following counts

sample_pon_1
62223
clean_sample_pon_1
61395

sample_pon_2
66974
clean_sample_pon_2
66013

How can I deactivate the call of MNPs in Mutect2 in order to use GenomicsDBImport?
Is it the default behavior of Mutect2/pon-pipeline or I'm doing something wrong?

Many thanks

↧

Interpreting CNNScoreVariants Scores

April 3, 2018, 1:22 pm

≫ Next: CNNScoreVariants, too much threads

≪ Previous: PON: Mutect2 include MNPs and crash GenomicsDBImport

Hi,

I'm playing around with the new CNNScoreVariants module, and was wondering if you had any guidelines for choosing a scoring cutoff, or in general, how to interpret the scores added to the output vcf.

Thanks,
Will

↧

CNNScoreVariants, too much threads

February 24, 2019, 10:22 pm

≫ Next: (How to) Call somatic copy number variants using GATK4 CNV

≪ Previous: Interpreting CNNScoreVariants Scores

Hi,

in the BestPractice workflows you advise to use HaplotypeCaller with the "-XX:GCTimeLimit=50" and "-X:GCHeapFreeLimit=10" java options.

Is there something similar for CNNScoreVariants? I tried to use several java options with different values to limit threads but it is quite impossible. Without any option I have 116 threads, running only one command, with 5 java options I can limit them to 95 ... still too much! What should I limit here?

Many thanks

↧

(How to) Call somatic copy number variants using GATK4 CNV

March 8, 2017, 11:13 am

≫ Next: GATK4 VariantsToTable unable to properly assign ANN field to multi-allelic variants

≪ Previous: CNNScoreVariants, too much threads

A more recent CNV tutorial using v4.0.1.1 has been posted in two parts elsewhere at:

The first part mostly recapitulates the workflow on this page, while the second part adds detection of allelic ratios. Although the v4.0.1.1 tutorial is under review as of May 2, 2018, we recommend you update to the official workflow, especially if performing CNV analyses on WGS data. The official workflow has algorithmic improvements to the GATK4.beta workflow illustrated here.

This demonstrative tutorial provides instructions and example data to detect somatic copy number variation (CNV) using a panel of normals (PoN). The workflow is optimized for Illumina short-read whole exome sequencing (WES) data. It is not suitable for whole genome sequencing (WGS) data nor for germline calling.

The tutorial recapitulates the GATK demonstration given at the 2016 ASHG meeting in Vancouver, Canada, for a beta version of the CNV workflow. Because we are still actively developing the CNV tools (writing as of March 2017), the underlying algorithms and current workflow options, e.g. syntax, may change. However, the presented basic approach and general concepts will still be germaine. Please check the forum for updates.

Many thanks to Samuel Lee (@slee) for developing the example data, data figures and discussion that set the backbone of this tutorial.

► For a similar example workflow that pertains to earlier releases of GATK4, see Article#6791.
► For the mathematics behind the workflow, see this whitepaper.

Different data types come with their own caveats. WGS, while providing even coverage that enables better CNV detection, is costly. SNP arrays, while the standard for CNV detection, may not be part of an analysis protocol. Being able to resolve CNVs from WES, which additionally introduces artifacts and variance in the target capture step, requires sophisticated denoising.

Jump to a section

Tools, system requirements and example data download

This tutorial uses a beta version of the CNV workflow tools within the GATK4 gatk-protected-1.0.0.0-alpha1.2.3 pre-release (Version:0288cff-SNAPSHOT from September 2016). We previously made the program jar specially available alongside the data bundle in the workshops directory here. The original worksheets are in the 1610 folder. However, the data bundle was only available to workshop attendees. Note other tools in this program release may be unsuitable for analyses.

The example data is whole exome capture sequence data for chromosomes 1–7 of matched normal and tumor samples aligned to GRCh37. Because the data is from real cancer patients, we have anonymized them at multiple levels. The anonymization process preserves the noise inherent in real samples. The data is representative of Illumina sequencing technology from 2011.
R (install from https://www.r-project.org/) and certain R components. After installing R, install the components with the following command.
```
Rscript install_R_packages.R 
```
We include install_R_packages.R in the tutorial data bundle. Alternatively, download it from here.
XQuartz for optional section 5. Your system may already have this installed.
The tutorial does not require reference files. The optional plotting step that uses the PlotSegmentedCopyRatio tool plots against GRCh37 and should NOT be used for other reference assemblies.

1. Collect proportional coverage using target intervals and read data using CalculateTargetCoverage

In this step, we collect proportional coverage using target intervals and read data. We have actually pre-computed this for you and we provide the command here for reference.

We process each BAM, whether normal or tumor. The tool collects coverage per read group at each target and divides these counts by the total number of reads per sample.

java -jar gatk4.jar CalculateTargetCoverage \
    -I <input_bam_file> \
    -T <input_target_tsv> \
    -transform PCOV \
    -groupBy SAMPLE \
    -targetInfo FULL \
    –keepdups \
    -O <output_pcov_file>

The target file -T is a padded intervals list of the baited regions. You can add padding to a target list using the GATK4 PadTargets tool. For our example data, padding each exome target 250bp on either side increases sensitivity.
Setting the -targetInfo option to FULL keeps the original target names from the target list.
The –keepdups option asks the tool to include alignments flagged as duplicate.

The top plot shows the raw proportional coverage for our tumor sample for chromosomes 1–7. Each dot represents a target. The y-axis plots proportional coverage and the x-axis targets. The middle plot shows the data after a median-normalization and log2-transformation. The bottom plot shows the tumor data after normalization against its matched-normal.

For each of these progressions, how certain are you that there are copy-number events? How many copy-number variants are you certain of? What is contributing to your uncertainty?

2. Create the CNV PoN using CombineReadCounts and CreatePanelOfNormals

In this step, we use two commands to create the CNV panel of normals (PoN).

The normals should represent the same sequencing technology, e.g. sample preparation and capture target kit, as that of the tumor samples under scrutiny. The PoN is meant to encapsulate sequencing noise and may also capture common germline variants. Like any control, you should think carefully about what sample set would make an effective panel. At the least, the PoN should consist of ten normal samples that were ideally subject to the same batch effects as that of the tumor sample, e.g. from the same sequencing center. Our current recommendation is 40 or more normal samples. Depending on the coverage depth of samples, adjust the number.

What is better, tissue-matched normals or blood normals of tumor samples?
What makes a better background control, a matched normal sample or a panel of normals?

The first step combines the proportional read counts from the multiple normal samples into a single file. The -inputList parameter takes a file listing the relative file paths, one sample per line, of the proportional coverage data of the normals.

java -jar gatk4.jar CombineReadCounts \
    -inputList normals.txt \
    -O sandbox/combined-normals.tsv

The second step creates a single CNV PoN file. The PoN stores information such as the median proportional coverage per target across the panel and projections of systematic noise calculated with PCA (principal component analysis). Our tutorial’s PoN is built with 39 normal blood samples from cancer patients from the same cohort (not suffering from blood cancers).

java -jar gatk4.jar CreatePanelOfNormals \
    -I sandbox/combined-normals.tsv \
    -O sandbox/normals.pon \
    -noQC \
    --disableSpark \
    --minimumTargetFactorPercentileThreshold 5

This results in two files, the CNV PoN and a target_weights.txt file that typical workflows can ignore. Because we have a small number of normals, we include the -noQC option and change the --minimumTargetFactorPercentileThreshold to 5%.

Based on what you know about PCA, what do you think are the effects of using more normal samples? A panel with some profiles that are outliers?

3. Normalize a raw proportional coverage profile against the PoN using NormalizeSomaticReadCounts

In this step, we normalize the raw proportional coverage (PCOV) profile using the PoN. Specifically, we normalize the tumor coverage against the PoN’s target medians and against the principal components of the PoN.

java -jar gatk4.jar NormalizeSomaticReadCounts \
    -I cov/tumor.tsv \
    -PON sandbox/normals.pon \
    -PTN sandbox/tumor.ptn.tsv \
    -TN sandbox/tumor.tn.tsv

This produces the pre-tangent-normalized file -PTN and the tangent-normalized file -TN, respectively. Resulting data is log2-transformed.

Denoising with a PoN is critical for calling copy-number variants from WES coverage profiles. It can also improve calls from WGS profiles that are typically more evenly distributed and subject to less noise. Furthermore, denoising with a PoN can greatly impact results for (i) samples that have more noise, e.g. those with lower coverage, lower purity or higher activity, (ii) samples lacking a matched normal and (iii) detection of smaller events that span only a few targets.

4. Segment the normalized coverage profile using PerformSegmentation

Here we segment the normalized coverage profile. Segmentation groups contiguous targets with the same copy ratio.

java -jar gatk4.jar PerformSegmentation \
    -TN sandbox/tumor.tn.tsv \
    -O sandbox/tumor.seg \
    -LOG

For our tumor sample, we reduce the ~73K individual targets to 14 segments. The -LOG parameter tells the tool that the input coverages are log2-transformed.

View the resulting file with cat sandbox/tumor.seg.

Which chromosomes have events?

☞ I get an error at this step!

This command will error if you have not installed R and certain R components. Take a few minutes to install R from https://www.r-project.org/. Then install the components with the following command.

Rscript install_R_packages.R

We include install_R_packages.R in the tutorial data bundle. Alternatively, download it from here.

5. (Optional) Plot segmented coverage using PlotSegmentedCopyRatio

This is an optional step that plots segmented coverage.

This command requires XQuartz installation. If you do not have this dependency, then view the results in the precomputed_results folder instead. Currently plotting only supports human assembly b37 autosomes. Going forward, this tool will accommodate other references and the workflow will support calling on sex chromosomes.

java -jar gatk4.jar PlotSegmentedCopyRatio \
    -TN sandbox/tumor.tn.tsv \
    -PTN sandbox/tumor.ptn.tsv \
    -S sandbox/tumor.seg \
    -O sandbox \
    -pre tumor \
    -LOG

The -O defines the output directory, and the -pre defines the basename of the files. Again, the -LOG parameter tells the tool that the inputs are log2- transformed. The output folder contains seven files--three PNG images and four text files.

Before_After.png (shown above) plots copy-ratios pre (top) and post (bottom) tangent-normalization across the chromosomes. The plot automatically adjusts the y-axis to show all available data points. Dotted lines represent centromeres.
Before_After_CR_Lim_4.png shows the same but fixes the y-axis range from 0 to 4 for comparability across samples.
FullGenome.png colors differential copy-ratio segments in alternating blue and orange. The horizontal line plots the segment mean. Again the y-axis ranges from 0 to 4.

Open each of these images. How many copy-number variants do you see?

☞ What is the QC value?

Each of the four text files contain a single quality control (QC) value. This value is the median of absolute differences (MAD) in copy-ratios of adjacent targets. Its calculation is robust to actual copy-number variants and should decrease post tangent-normalization.

preQc.txt gives the QC value before tangent-normalization.
postQc.txt gives the post-tangent-normalization QC value.
dQc.txt gives the difference between pre and post QC values.
scaled_dQc.txt gives the fraction difference (preQc - postQc)/(preQc).

6. Call segmented copy number variants using CallSegments

In this final step, we call segmented copy number variants. The tool makes one of three calls for each segment--neutral (0), deletion (-) or amplification (+). These deleted or amplified segments could represent somatic events.

java -jar gatk4.jar CallSegments \
    -TN sandbox/tumor.tn.tsv \
    -S sandbox/tumor.seg \
    -O sandbox/tumor.called

View the results with cat sandbox/tumor.called.

Besides the last column, how is this result different from that of step 4?

7. Discussion of interest to some

☞ Why can't I use just a matched normal?

Let’s compare results from the raw coverage (top), from normalizing using the matched-normal only (middle) and from normalizing using the PoN (bottom).

What is different between the plots? Look closely.

The matched-normal normalization appears to perform well. However, its noisiness brings uncertainty to any call that would be made, even if visible by eye. Furthermore, its level of noise obscures detection of the 4th variant that the PoN normalization reveals.

☞How do the results compare to SNP6 analyses?

As with any algorithmic analysis, it’s good to confirm results with orthogonal methods. If we compare calls from the original unscrambled tumor data against GISTIC SNP6 array analysis of the same sample, we similarly find three deletions and a single large amplification.

↧

GATK4 VariantsToTable unable to properly assign ANN field to multi-allelic variants

April 22, 2019, 3:21 am

≫ Next: IntelDeflater problem on Kernel Update

≪ Previous: (How to) Call somatic copy number variants using GATK4 CNV

The input vcf record was:

chr19 9115341 rs2217657 C G,A 29868.6 PASS AC=70,1;AF=0.368,5.263e-03;AN=190;BaseQRankSum=-3.238e+00;DB;DP=3117;ExcessHet=9.1637;FS=1.757;InbreedingCoeff=-0.1235;MLEAC=69,1;MLEAF=0.363,5.263e-03;MQ=59.95;MQRankSum=0.00;PG=3,0,3,26,26,51;POSITIVE_TRAIN_SITE;QD=15.16;ReadPosRankSum=-7.980e-01;SOR=0.616;VQSLOD=9.05;culprit=MQRankSum;ANN=A|missense_variant|MODERATE|OR7G1|ENSG00000161807|transcript|ENST00000541538.1|protein_coding|1/1|c.423G>T|p.Trp141Cys|423/936|423/936|141/311||,G|missense_variant|MODERATE|OR7G1|ENSG00000161807|transcript|ENST00000541538.1|protein_coding|1/1|c.423G>C|p.Trp141Cys|423/936|423/936|141/311|| ...

Following command was run:
java -jar gatk-package-4.1.0.0-local.jar VariantsToTable --variant chr19.genotypeRefined.ann.recode.vcf --split-multi-allelic -F CHROM -F POS -F REF -F ALT -F ID -F TYPE -F TRANSITION -F FILTER -F HET -F HOM-REF -F HOM-VAR -F VAR -F ANN -F LOF -F NMD -GF GT -GF GQ --output ch19.genotypeRefined.ann.recode.table

The output table contains two entries associated with above variant:

chr19 9115341 C G rs2217657 SNP -1 PASS 51 34 10 61 A|missense_variant|MODERATE|OR7G1|ENSG00000161807|transcript|ENST00000541538.1|protein_coding|1/1|c.423G>T|p.Trp141Cys|423/936|423/936|141/311|| ...

chr19 9115341 C A rs2217657 SNP -1 PASS 51 34 10 61 G|missense_variant|MODERATE|OR7G1|ENSG00000161807|transcript|ENST00000541538.1|protein_coding|1/1|c.423G>C|p.Trp141Cys|423/936|423/936|141/311|| ...

Look at the ANN field. The annotation of C>G and C>A have been swapped.

Furthermore, there are entries where the ANN field has not been split, but simply copied to all alleles. For example,

chr19 3752876 A G rs8102086 SNP -1 PASS 50 14 31 81 C|missense_variant|MODERATE|APBA3|ENSG00000011132|transcript|ENST00000316757.3|protein_coding|7/11|c.1126T>G|p.Cys376Gly|1327/2075|1126/1728|376/575||,G|missense_variant|MODERATE|APBA3|ENSG00000011132|transcript|ENST00000316757.3|protein_coding|7/11|c.1126T>C|p.Cys376Arg|1327/2075|1126/1728|376/575||...,G|non_coding_transcript_exon_variant|MODIFIER|APBA3|ENSG00000011132|transcript|ENST00000592826.1|retained_intron|3/4|n.400T>C||||||...

chr19 3752876 A C rs8102086 SNP -1 PASS 50 14 31 81 C|missense_variant|MODERATE|APBA3|ENSG00000011132|transcript|ENST00000316757.3|protein_coding|7/11|c.1126T>G|p.Cys376Gly|1327/2075|1126/1728|376/575||,G|missense_variant|MODERATE|APBA3|ENSG00000011132|transcript|ENST00000316757.3|protein_coding|7/11|c.1126T>C|p.Cys376Arg|1327/2075|1126/1728|376/575||...,G|non_coding_transcript_exon_variant|MODIFIER|APBA3|ENSG00000011132|transcript|ENST00000592826.1|retained_intron|3/4|n.400T>C||||||...

Is there an issue with the command or am I misinterpreting the observation?
Thanks
Srikant

↧

fileformat=VCFv4.2

CHROM POS ID REF ALT QUAL FILTER INFO

Contents

1. Install Docker

MacOS systems

Windows systems

Linux systems

2. Test that it works

3. Get the GATK container image

4. Start up the GATK container

5. Run a GATK command in the container

6. Use a mounted volume to access data that lives outside the container

Jump to a section

Tools involved

Download example data

1. Collect raw counts data with PreprocessIntervals and CollectFragmentCounts

☞ 1.1 How do I view HDF5 format data?

2. Generate a CNV panel of normals with CreateReadCountPanelOfNormals

3. Standardize and denoise case read counts against the PoN with DenoiseReadCounts

4. Plot standardized and denoised copy ratios with PlotDenoisedCopyRatios

☞ 4.1 Compare two PoNs: considerations in the panel of normals creation

☞ 4.2 Compare PoN denoising versus matched-normal denoising

Footnotes

A more recent CNV tutorial using v4.0.1.1 has been posted in two parts elsewhere at:

Jump to a section

Tools, system requirements and example data download

1. Collect proportional coverage using target intervals and read data using CalculateTargetCoverage

2. Create the CNV PoN using CombineReadCounts and CreatePanelOfNormals

3. Normalize a raw proportional coverage profile against the PoN using NormalizeSomaticReadCounts

4. Segment the normalized coverage profile using PerformSegmentation

☞ I get an error at this step!

5. (Optional) Plot segmented coverage using PlotSegmentedCopyRatio

☞ What is the QC value?

6. Call segmented copy number variants using CallSegments

7. Discussion of interest to some

☞ Why can't I use just a matched normal?

☞How do the results compare to SNP6 analyses?