The CNV case and PoN workflows (description and examples) for earlier releases of GATK4.
For a newer tutorial using GATK4's v1.0.0.0-alpha1.2.3 release (Version:0288cff-SNAPSHOT from September 2016), see Article#9143 and this data bundle. If you have a question on the Somatic_CNV_handson tutorial, please post it as a new question using this form.
Requirements
- Java 1.8
- A functioning GATK4-protected jar (hellbender-protected.jar or gatk-protected.jar)
- HDF5 1.8.13
- The location of the HDF5-Java JNI Libraries Release 2.9 (2.11 for Macs).
Typical locations:
Ubuntu:/usr/lib/jni/
Mac:/Applications/HDFView.app/Contents/Resources/lib/
Broad internal servers:/broad/software/free/Linux/redhat_6_x86_64/pkgs/hdfview_2.9/HDFView/lib/linux/
- Reference genome (fasta files) with fai and dict files. This can be downloaded as part of the GATK resource bundle: http://www.broadinstitute.org/gatk/guide/article?id=1213
- PoN file (when running case samples only). This file should be created using the Create PoN workflow (see below).
- Target BED file that was used to create the PoN file. Format details can be found here . NOTE: For the CNV tools, you will need a fourth column for target name, which must be unique across rows.
1 12200 12275 target1
1 13505 13600 target2
1 31000 31500 target3
1 35138 35174 target4
....snip....
Before running the workflows, we recommend padding the target file by 250 bases with the PadTargets
tool. Example: java -jar gatk-protected.jar PadTargets --targets initial_target_file.bed --output initial_target_file.padded.bed --padding 250
This allows some off-target reads to be factored into the copy ratio estimates. Our internal evaluations have shown that this improves results.
If you are using the premade Queue scripts (see below), you can specify the padding there and the workflow will generate the padded targets automatically (i.e. there is no reason to run PadTargets explicitly if you are using the premade Queue scripts).
Case sample workflow
This workflow requires a PoN file generated by the Create PoN workflow.
If you do not have a PoN, please skip to the Create PoN workflow, below ....
Overview of steps
- Step 0. (recommended) Pad Targets (see example above)
- Step 1. Collect proportional coverage
- Step 2. Create coverage profile
- Step 3. Segment coverage profile
- Step 4. Plot coverage profile
- Step 5. Call segments
Step 1. Collect proportional coverage
Inputs
- bam file
- target bed file -- must be the same that was used for the PoN
- reference_sequence (required by GATK) -- fasta file with b37 reference.
Outputs
- Proportional coverage tsv file -- Mx5 matrix of proportional coverage, where M is the number of targets. The fifth column will be named for the sample in the bam file (found in the bam file
SM
tag). If the file exists, it will be overwritten.
##fileFormat = tsv
##commandLine = org.broadinstitute.hellbender.tools.exome.ExomeReadCounts ...snip...
##title = Read counts per target and sample
CONTIG START END NAME SAMPLE1
1 12200 12275 target1 1.150e-05
1 13505 13600 target2 1.500e-05
1 31000 31500 target3 7.000e-05
....snip....
Invocation
java -Xmx8g -jar <path_to_hellbender_protected_jar> CalculateTargetCoverage -I <input_bam_file> -O <pcov_output_file_path> --targets <target_BED> -R <ref_genome> \
-transform PCOV --targetInformationColumns FULL -groupBy SAMPLE -keepdups
Step 2. Create coverage profile
Inputs
- proportional coverage file from Step 1
- target BED file -- must be the same that was used for the PoN
- PoN file
- directory containing the HDF5 JNI native libraries
Outputs
- normalized coverage file (tsv) -- details each target with chromosome, start, end, and log copy ratio estimate
#fileFormat = tsv
#commandLine = ....snip....
#title = ....snip....
name contig start stop SAMPLE1
target1 1 12200 12275 -0.5958351605220968
target2 1 13505 13600 -0.2855054918109098
target3 1 31000 31500 -0.11450116047248263
....snip....
- pre-tangent-normalization coverage file (tsv) -- same as normalized coverage file (tsv) above, but copy ratio estimates are before the noise reduction step. The file format is the same as the normalized coverage file (tsv).
- fnt file (tsv) -- proportional coverage divided by the target factors contained in the PoN. The file format is the same as the proportional coverage in step 1.
- betaHats (tsv) -- used by developers and evaluators, typically, but output location must be specified. These are the
coefficients used in the projection of the case sample into the (reducued) PoN. This will be a Mx1 matrix where M is the number of targets.
Invocation
java -Djava.library.path=<hdf_jni_native_dir> -Xmx8g -jar <path_to_hellbender_protected_jar> NormalizeSomaticReadCounts -I <pcov_input_file_path> -T <target_BED> -pon <pon_file> \
-O <output_target_cr_file> -FNO <output_target_fnt_file> -BHO <output_beta_hats_file> -PTNO <output_pre_tangent_normalization_cr_file>
Step 3. Segment coverage profile
Inputs
- normalized coverage file (tsv) -- from step 2.
- sample name
Outputs
- seg file (tsv) -- segment file (tsv) detailing contig, start, end, and copy ratio (segment_mean) for each detected segment. Note that this is a different format than python recapseg, since the segment mean no longer has log2 applied.
Sample Chromosome Start End Num_Probes Segment_Mean
SAMPLE1 1 12200 70000 18 0.841235
SAMPLE1 1 300600 1630000 337 1.23232323
....snip....
Invocation
java -Xmx8g -jar <path_to_hellbender_protected_jar> PerformSegmentation -S <sample_name> -T <normalized_coverage_file> -O <output_seg_file> -log
Step 4. Plot coverage profile
Inputs
- normalized coverage file (tsv) -- from step 2.
- pre-normalized coverage file (tsv) -- from step 2.
- segmented coverage file (seg) -- from step 3.
- sample name, see above
Outputs
- beforeAfterTangentLimPlot (png) -- Output before/after tangent normalization plot up to copy-ratio 4
- beforeAfterTangentPlot (png) -- Output before/after tangent normalization plot
- fullGenomePlot (png) -- Full genome plot after tangent normalization
- preQc (txt) -- Median absolute differences of targets before normalization
- postQc (txt) -- Median absolute differences of targets after normalization
- dQc (txt) -- Difference in median absolute differences of targets before and after normalization
Invocation
java -Xmx8g -jar <path_to_hellbender_protected_jar> PlotSegmentedCopyRatio -S <sample_name> -T <normalized_coverage_file> -P <pre_normalized_coverage_file> -seg <segmented_coverage_file> -O <output_seg_file> -log
Step 5. Call segments
Inputs
- normalized coverage file (tsv) -- from step 2.
- seg file (tsv) -- from step 3.
- sample name
Outputs
- called file (tsv) -- output is exactly the same as in seg file (step 3), except Segment_Call column is added. Calls are either "+", "0", or "-" (no quotes).
Sample Chromosome Start End Num_Probes Segment_Mean Segment_Call
SAMPLE1 1 12200 70000 18 0.841235 -
SAMPLE1 1 300600 1630000 337 1.23232323 0
....snip....
Invocation
java -Xmx8g -jar <path_to_hellbender_protected_jar> CallSegments -T <normalized_coverage_file> -S <seg_file> -O <output_called_seg_file> -sample <sample_name>
Create PoN workflow
This workflow can take some time to run depending on how many samples are going into your PoN and the number of targets you are covering. Basic time estimates are found in the Overview of Steps.
Additional requirements
- Normal sample bam files to be used in the PoN. The index files (.bai) must be local to all of the associated bam files.
Overview of steps
- Step 1. Collect proportional coverage. (~20 minutes for mean 150x coverage and 150k targets, per sample)
- Step 2. Combine proportional coverage files (< 5 minutes for 150k targets and 300 samples)
- Step 3. Create the PoN file (~1.75 hours for 150k targets and 300 samples)
All time estimates are using the internal Broad infrastructure.
Step 1. Collect proportional coverage on each bam file
This is exactly the same as the case sample workflow, except that this needs to be run once for each input bam file, each with a different output file name. Otherwise, the inputs should be the same for each bam file.
Please see documentation above.
IMPORTANT NOTE: You must create a list of the proportional coverage files (i.e. output files) that you create in this step. One output file per line in a text file (see step 2)
Step 2. Merge proportional coverage files
This step merges the proportional coverage files into one large file with a separate column for each samples.
Inputs
- list of proportional coverage files generated (possibly manually) in step 1. This is a text file.
/path/to/pcov_file1.txt
/path/to/pcov_file2.txt
/path/to/pcov_file3.txt
....snip....
Outputs
- merged tsv of proportional coverage
CONTIG START END NAME SAMPLE1 SAMPLE2 SAMPLE3 ....snip....
1 12191 12227 target1 8.835E-6 1.451E-5 1.221E-5 ....snip....
1 12596 12721 target2 1.602E-5 1.534E-5 1.318E-5 ....snip....
....snip....
Invocation
java -Xmx8g -jar <path_to_hellbender_protected_jar> CombineReadCounts --inputList <text_file_list_of_proportional_coverage_files> \
-O <output_merged_file> -MOF 200
Step 3. Create the PoN file
Inputs
- merged tsv of proportional coverage -- generated in step 2.
Outputs
- PoN file -- HDF5 format. This file can be used for running case samples sequenced with the same process.
Invocation
java -Xmx16g -Djava.library.path=<hdf_jni_native_dir> -jar <path_to_hellbender_protected_jar> CreatePanelOfNormals -I <merged_pcov_file> \
-O <output_pon_file_full_path>