Description and examples of the steps in the CNV case and CNV PoN creation workflows

The CNV case and PoN workflows (description and examples) for earlier releases of GATK4.

For a newer tutorial using GATK4's v1.0.0.0-alpha1.2.3 release (Version:0288cff-SNAPSHOT from September 2016), see Article#9143 and this data bundle. If you have a question on the Somatic_CNV_handson tutorial, please post it as a new question using this form.

Requirements

Java 1.8
A functioning GATK4-protected jar (hellbender-protected.jar or gatk-protected.jar)
HDF5 1.8.13
The location of the HDF5-Java JNI Libraries Release 2.9 (2.11 for Macs).
Typical locations:
Ubuntu: /usr/lib/jni/
Mac: /Applications/HDFView.app/Contents/Resources/lib/
Broad internal servers: /broad/software/free/Linux/redhat_6_x86_64/pkgs/hdfview_2.9/HDFView/lib/linux/
Reference genome (fasta files) with fai and dict files. This can be downloaded as part of the GATK resource bundle: http://www.broadinstitute.org/gatk/guide/article?id=1213
PoN file (when running case samples only). This file should be created using the Create PoN workflow (see below).
Target BED file that was used to create the PoN file. Format details can be found here . NOTE: For the CNV tools, you will need a fourth column for target name, which must be unique across rows.

1       12200   12275   target1
1       13505   13600   target2
1       31000   31500   target3
1       35138   35174   target4
....snip....

Before running the workflows, we recommend padding the target file by 250 bases with the PadTargets tool. Example: java -jar gatk-protected.jar PadTargets --targets initial_target_file.bed --output initial_target_file.padded.bed --padding 250
This allows some off-target reads to be factored into the copy ratio estimates. Our internal evaluations have shown that this improves results.
If you are using the premade Queue scripts (see below), you can specify the padding there and the workflow will generate the padded targets automatically (i.e. there is no reason to run PadTargets explicitly if you are using the premade Queue scripts).

Case sample workflow

This workflow requires a PoN file generated by the Create PoN workflow.

If you do not have a PoN, please skip to the Create PoN workflow, below ....

Overview of steps

Step 0. (recommended) Pad Targets (see example above)
Step 1. Collect proportional coverage
Step 2. Create coverage profile
Step 3. Segment coverage profile
Step 4. Plot coverage profile
Step 5. Call segments

Step 1. Collect proportional coverage

Inputs

bam file
target bed file -- must be the same that was used for the PoN
reference_sequence (required by GATK) -- fasta file with b37 reference.

Outputs

Proportional coverage tsv file -- Mx5 matrix of proportional coverage, where M is the number of targets. The fifth column will be named for the sample in the bam file (found in the bam file SM tag). If the file exists, it will be overwritten.

##fileFormat  = tsv
##commandLine = org.broadinstitute.hellbender.tools.exome.ExomeReadCounts  ...snip...
##title       = Read counts per target and sample
CONTIG  START   END     NAME    SAMPLE1
1       12200   12275   target1    1.150e-05
1       13505   13600   target2    1.500e-05
1       31000   31500   target3    7.000e-05
....snip....

Invocation

 java -Xmx8g -jar <path_to_hellbender_protected_jar> CalculateTargetCoverage -I <input_bam_file> -O <pcov_output_file_path>  --targets <target_BED> -R <ref_genome> \ 
       -transform PCOV --targetInformationColumns FULL -groupBy SAMPLE -keepdups

Step 2. Create coverage profile

Inputs

proportional coverage file from Step 1
target BED file -- must be the same that was used for the PoN
PoN file
directory containing the HDF5 JNI native libraries

Outputs

normalized coverage file (tsv) -- details each target with chromosome, start, end, and log copy ratio estimate

#fileFormat = tsv
#commandLine = ....snip....
#title = ....snip....
name    contig  start   stop    SAMPLE1
target1    1       12200   12275   -0.5958351605220968
target2    1       13505   13600   -0.2855054918109098
target3    1       31000   31500   -0.11450116047248263
....snip....

pre-tangent-normalization coverage file (tsv) -- same as normalized coverage file (tsv) above, but copy ratio estimates are before the noise reduction step. The file format is the same as the normalized coverage file (tsv).
fnt file (tsv) -- proportional coverage divided by the target factors contained in the PoN. The file format is the same as the proportional coverage in step 1.
betaHats (tsv) -- used by developers and evaluators, typically, but output location must be specified. These are the
coefficients used in the projection of the case sample into the (reducued) PoN. This will be a Mx1 matrix where M is the number of targets.

Invocation

java -Djava.library.path=<hdf_jni_native_dir> -Xmx8g -jar <path_to_hellbender_protected_jar> NormalizeSomaticReadCounts -I <pcov_input_file_path> -T <target_BED> -pon <pon_file> \
 -O <output_target_cr_file> -FNO <output_target_fnt_file> -BHO <output_beta_hats_file> -PTNO <output_pre_tangent_normalization_cr_file>

Step 3. Segment coverage profile

Inputs

normalized coverage file (tsv) -- from step 2.
sample name

Outputs

seg file (tsv) -- segment file (tsv) detailing contig, start, end, and copy ratio (segment_mean) for each detected segment. Note that this is a different format than python recapseg, since the segment mean no longer has log2 applied.

Sample  Chromosome      Start   End     Num_Probes      Segment_Mean
SAMPLE1        1       12200   70000   18       0.841235
SAMPLE1        1       300600  1630000 337     1.23232323
....snip....

Invocation

java -Xmx8g -jar <path_to_hellbender_protected_jar>  PerformSegmentation  -S <sample_name> -T <normalized_coverage_file> -O <output_seg_file> -log

Step 4. Plot coverage profile

Inputs

normalized coverage file (tsv) -- from step 2.
pre-normalized coverage file (tsv) -- from step 2.
segmented coverage file (seg) -- from step 3.
sample name, see above

Outputs

beforeAfterTangentLimPlot (png) -- Output before/after tangent normalization plot up to copy-ratio 4
beforeAfterTangentPlot (png) -- Output before/after tangent normalization plot
fullGenomePlot (png) -- Full genome plot after tangent normalization
preQc (txt) -- Median absolute differences of targets before normalization
postQc (txt) -- Median absolute differences of targets after normalization
dQc (txt) -- Difference in median absolute differences of targets before and after normalization

Invocation

java -Xmx8g -jar <path_to_hellbender_protected_jar>  PlotSegmentedCopyRatio  -S <sample_name> -T <normalized_coverage_file> -P <pre_normalized_coverage_file> -seg <segmented_coverage_file> -O <output_seg_file> -log

Step 5. Call segments

Inputs

normalized coverage file (tsv) -- from step 2.
seg file (tsv) -- from step 3.
sample name

Outputs

called file (tsv) -- output is exactly the same as in seg file (step 3), except Segment_Call column is added. Calls are either "+", "0", or "-" (no quotes).

Sample  Chromosome      Start   End     Num_Probes      Segment_Mean      Segment_Call
SAMPLE1        1       12200   70000   18       0.841235      -
SAMPLE1        1       300600  1630000 337     1.23232323     0 
....snip....

Invocation

java -Xmx8g -jar <path_to_hellbender_protected_jar> CallSegments -T <normalized_coverage_file> -S <seg_file> -O <output_called_seg_file> -sample <sample_name>

Create PoN workflow

This workflow can take some time to run depending on how many samples are going into your PoN and the number of targets you are covering. Basic time estimates are found in the Overview of Steps.

Additional requirements

Normal sample bam files to be used in the PoN. The index files (.bai) must be local to all of the associated bam files.

Overview of steps

Step 1. Collect proportional coverage. (~20 minutes for mean 150x coverage and 150k targets, per sample)
Step 2. Combine proportional coverage files (< 5 minutes for 150k targets and 300 samples)
Step 3. Create the PoN file (~1.75 hours for 150k targets and 300 samples)

All time estimates are using the internal Broad infrastructure.

Step 1. Collect proportional coverage on each bam file

This is exactly the same as the case sample workflow, except that this needs to be run once for each input bam file, each with a different output file name. Otherwise, the inputs should be the same for each bam file.

Please see documentation above.

IMPORTANT NOTE: You must create a list of the proportional coverage files (i.e. output files) that you create in this step. One output file per line in a text file (see step 2)

Step 2. Merge proportional coverage files

This step merges the proportional coverage files into one large file with a separate column for each samples.

Inputs

list of proportional coverage files generated (possibly manually) in step 1. This is a text file.

/path/to/pcov_file1.txt
/path/to/pcov_file2.txt
/path/to/pcov_file3.txt
....snip....

Outputs

merged tsv of proportional coverage

CONTIG  START   END     NAME    SAMPLE1    SAMPLE2 SAMPLE3 ....snip....
1       12191   12227   target1    8.835E-6  1.451E-5     1.221E-5    ....snip....
1       12596   12721   target2    1.602E-5  1.534E-5     1.318E-5   ....snip....
....snip....

Invocation

java -Xmx8g -jar  <path_to_hellbender_protected_jar> CombineReadCounts --inputList <text_file_list_of_proportional_coverage_files> \
    -O <output_merged_file> -MOF 200

Step 3. Create the PoN file

Inputs

merged tsv of proportional coverage -- generated in step 2.

Outputs

PoN file -- HDF5 format. This file can be used for running case samples sequenced with the same process.

Invocation

java -Xmx16g -Djava.library.path=<hdf_jni_native_dir> -jar <path_to_hellbender_protected_jar> CreatePanelOfNormals -I <merged_pcov_file> \
       -O <output_pon_file_full_path>

For a newer tutorial using GATK4's v1.0.0.0-alpha1.2.3 release (Version:0288cff-SNAPSHOT from September 2016), see Article#9143 and this data bundle. If you have a question on the Somatic_CNV_handson tutorial, please post it as a new question using this form.

Requirements

Case sample workflow

Overview of steps

Step 1. Collect proportional coverage

Inputs

Outputs

Invocation

Step 2. Create coverage profile

Inputs

Outputs

Invocation

Step 3. Segment coverage profile

Inputs

Outputs

Invocation

Step 4. Plot coverage profile

Inputs

Outputs

Invocation

Step 5. Call segments

Inputs

Outputs

Invocation

Create PoN workflow

Additional requirements

Overview of steps

Step 1. Collect proportional coverage on each bam file

Step 2. Merge proportional coverage files

Inputs

Outputs

Invocation

Step 3. Create the PoN file

Inputs

Outputs

Invocation

Trending Articles