Quantcast
Channel: Recent Discussions — GATK-Forum
Viewing all articles
Browse latest Browse all 12345

Description and examples of the steps in the CNV case and CNV PoN creation workflows

$
0
0

The CNV case and PoN workflows (description and examples) for earlier releases of GATK4.

For a newer tutorial using GATK4's v1.0.0.0-alpha1.2.3 release (Version:0288cff-SNAPSHOT from September 2016), see Article#9143 and this data bundle. If you have a question on the Somatic_CNV_handson tutorial, please post it as a new question using this form.


Requirements

  1. Java 1.8
  2. A functioning GATK4-protected jar (hellbender-protected.jar or gatk-protected.jar)
  3. HDF5 1.8.13
  4. The location of the HDF5-Java JNI Libraries Release 2.9 (2.11 for Macs).
    Typical locations:
    Ubuntu: /usr/lib/jni/
    Mac: /Applications/HDFView.app/Contents/Resources/lib/
    Broad internal servers: /broad/software/free/Linux/redhat_6_x86_64/pkgs/hdfview_2.9/HDFView/lib/linux/
  5. Reference genome (fasta files) with fai and dict files. This can be downloaded as part of the GATK resource bundle: http://www.broadinstitute.org/gatk/guide/article?id=1213
  6. PoN file (when running case samples only). This file should be created using the Create PoN workflow (see below).
  7. Target BED file that was used to create the PoN file. Format details can be found here . NOTE: For the CNV tools, you will need a fourth column for target name, which must be unique across rows.
1       12200   12275   target1
1       13505   13600   target2
1       31000   31500   target3
1       35138   35174   target4
....snip....

Before running the workflows, we recommend padding the target file by 250 bases with the PadTargets tool. Example: java -jar gatk-protected.jar PadTargets --targets initial_target_file.bed --output initial_target_file.padded.bed --padding 250
This allows some off-target reads to be factored into the copy ratio estimates. Our internal evaluations have shown that this improves results.
If you are using the premade Queue scripts (see below), you can specify the padding there and the workflow will generate the padded targets automatically (i.e. there is no reason to run PadTargets explicitly if you are using the premade Queue scripts).

Case sample workflow

This workflow requires a PoN file generated by the Create PoN workflow.

If you do not have a PoN, please skip to the Create PoN workflow, below ....

Overview of steps
  • Step 0. (recommended) Pad Targets (see example above)
  • Step 1. Collect proportional coverage
  • Step 2. Create coverage profile
  • Step 3. Segment coverage profile
  • Step 4. Plot coverage profile
  • Step 5. Call segments
Step 1. Collect proportional coverage
Inputs
  • bam file
  • target bed file -- must be the same that was used for the PoN
  • reference_sequence (required by GATK) -- fasta file with b37 reference.
Outputs
  • Proportional coverage tsv file -- Mx5 matrix of proportional coverage, where M is the number of targets. The fifth column will be named for the sample in the bam file (found in the bam file SM tag). If the file exists, it will be overwritten.
##fileFormat  = tsv
##commandLine = org.broadinstitute.hellbender.tools.exome.ExomeReadCounts  ...snip...
##title       = Read counts per target and sample
CONTIG  START   END     NAME    SAMPLE1
1       12200   12275   target1    1.150e-05
1       13505   13600   target2    1.500e-05
1       31000   31500   target3    7.000e-05
....snip....
Invocation
 java -Xmx8g -jar <path_to_hellbender_protected_jar> CalculateTargetCoverage -I <input_bam_file> -O <pcov_output_file_path>  --targets <target_BED> -R <ref_genome> \ 
       -transform PCOV --targetInformationColumns FULL -groupBy SAMPLE -keepdups
Step 2. Create coverage profile
Inputs
  • proportional coverage file from Step 1
  • target BED file -- must be the same that was used for the PoN
  • PoN file
  • directory containing the HDF5 JNI native libraries
Outputs
  • normalized coverage file (tsv) -- details each target with chromosome, start, end, and log copy ratio estimate
#fileFormat = tsv
#commandLine = ....snip....
#title = ....snip....
name    contig  start   stop    SAMPLE1
target1    1       12200   12275   -0.5958351605220968
target2    1       13505   13600   -0.2855054918109098
target3    1       31000   31500   -0.11450116047248263
....snip....
  • pre-tangent-normalization coverage file (tsv) -- same as normalized coverage file (tsv) above, but copy ratio estimates are before the noise reduction step. The file format is the same as the normalized coverage file (tsv).
  • fnt file (tsv) -- proportional coverage divided by the target factors contained in the PoN. The file format is the same as the proportional coverage in step 1.
  • betaHats (tsv) -- used by developers and evaluators, typically, but output location must be specified. These are the
    coefficients used in the projection of the case sample into the (reducued) PoN. This will be a Mx1 matrix where M is the number of targets.
Invocation
java -Djava.library.path=<hdf_jni_native_dir> -Xmx8g -jar <path_to_hellbender_protected_jar> NormalizeSomaticReadCounts -I <pcov_input_file_path> -T <target_BED> -pon <pon_file> \
 -O <output_target_cr_file> -FNO <output_target_fnt_file> -BHO <output_beta_hats_file> -PTNO <output_pre_tangent_normalization_cr_file>
Step 3. Segment coverage profile
Inputs
  • normalized coverage file (tsv) -- from step 2.
  • sample name
Outputs
  • seg file (tsv) -- segment file (tsv) detailing contig, start, end, and copy ratio (segment_mean) for each detected segment. Note that this is a different format than python recapseg, since the segment mean no longer has log2 applied.
Sample  Chromosome      Start   End     Num_Probes      Segment_Mean
SAMPLE1        1       12200   70000   18       0.841235
SAMPLE1        1       300600  1630000 337     1.23232323
....snip....
Invocation
java -Xmx8g -jar <path_to_hellbender_protected_jar>  PerformSegmentation  -S <sample_name> -T <normalized_coverage_file> -O <output_seg_file> -log
Step 4. Plot coverage profile
Inputs
  • normalized coverage file (tsv) -- from step 2.
  • pre-normalized coverage file (tsv) -- from step 2.
  • segmented coverage file (seg) -- from step 3.
  • sample name, see above
Outputs
  • beforeAfterTangentLimPlot (png) -- Output before/after tangent normalization plot up to copy-ratio 4
  • beforeAfterTangentPlot (png) -- Output before/after tangent normalization plot
  • fullGenomePlot (png) -- Full genome plot after tangent normalization
  • preQc (txt) -- Median absolute differences of targets before normalization
  • postQc (txt) -- Median absolute differences of targets after normalization
  • dQc (txt) -- Difference in median absolute differences of targets before and after normalization
Invocation
java -Xmx8g -jar <path_to_hellbender_protected_jar>  PlotSegmentedCopyRatio  -S <sample_name> -T <normalized_coverage_file> -P <pre_normalized_coverage_file> -seg <segmented_coverage_file> -O <output_seg_file> -log
Step 5. Call segments
Inputs
  • normalized coverage file (tsv) -- from step 2.
  • seg file (tsv) -- from step 3.
  • sample name
Outputs
  • called file (tsv) -- output is exactly the same as in seg file (step 3), except Segment_Call column is added. Calls are either "+", "0", or "-" (no quotes).
Sample  Chromosome      Start   End     Num_Probes      Segment_Mean      Segment_Call
SAMPLE1        1       12200   70000   18       0.841235      -
SAMPLE1        1       300600  1630000 337     1.23232323     0 
....snip....
Invocation
java -Xmx8g -jar <path_to_hellbender_protected_jar> CallSegments -T <normalized_coverage_file> -S <seg_file> -O <output_called_seg_file> -sample <sample_name> 

Create PoN workflow

This workflow can take some time to run depending on how many samples are going into your PoN and the number of targets you are covering. Basic time estimates are found in the Overview of Steps.

Additional requirements
  • Normal sample bam files to be used in the PoN. The index files (.bai) must be local to all of the associated bam files.

Overview of steps
  • Step 1. Collect proportional coverage. (~20 minutes for mean 150x coverage and 150k targets, per sample)
  • Step 2. Combine proportional coverage files (< 5 minutes for 150k targets and 300 samples)
  • Step 3. Create the PoN file (~1.75 hours for 150k targets and 300 samples)

All time estimates are using the internal Broad infrastructure.

Step 1. Collect proportional coverage on each bam file

This is exactly the same as the case sample workflow, except that this needs to be run once for each input bam file, each with a different output file name. Otherwise, the inputs should be the same for each bam file.

Please see documentation above.

IMPORTANT NOTE: You must create a list of the proportional coverage files (i.e. output files) that you create in this step. One output file per line in a text file (see step 2)

Step 2. Merge proportional coverage files

This step merges the proportional coverage files into one large file with a separate column for each samples.

Inputs
  • list of proportional coverage files generated (possibly manually) in step 1. This is a text file.
/path/to/pcov_file1.txt
/path/to/pcov_file2.txt
/path/to/pcov_file3.txt
....snip....
Outputs
  • merged tsv of proportional coverage
CONTIG  START   END     NAME    SAMPLE1    SAMPLE2 SAMPLE3 ....snip....
1       12191   12227   target1    8.835E-6  1.451E-5     1.221E-5    ....snip....
1       12596   12721   target2    1.602E-5  1.534E-5     1.318E-5   ....snip....
....snip....
Invocation
java -Xmx8g -jar  <path_to_hellbender_protected_jar> CombineReadCounts --inputList <text_file_list_of_proportional_coverage_files> \
    -O <output_merged_file> -MOF 200 
Step 3. Create the PoN file
Inputs
  • merged tsv of proportional coverage -- generated in step 2.
Outputs
  • PoN file -- HDF5 format. This file can be used for running case samples sequenced with the same process.
Invocation
java -Xmx16g -Djava.library.path=<hdf_jni_native_dir> -jar <path_to_hellbender_protected_jar> CreatePanelOfNormals -I <merged_pcov_file> \
       -O <output_pon_file_full_path>

Viewing all articles
Browse latest Browse all 12345

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>